If I have downloaded Wikipedia XML dumps, is there any way of removing all of the internal links from within an XML file?
如果我已下载Wikipedia XML转储,有没有办法从XML文件中删除所有内部链接?
Thanks
4 个解决方案
#1
One thing you could do, if you are importing them into a local wiki, is to import all the files you want, then use a robot (eg. pywikipediabot is easy to use) to get rid of all the internal links.
如果要将它们导入到本地wiki中,你可以做的一件事就是导入你想要的所有文件,然后使用机器人(例如pywikipediabot很容易使用)来摆脱所有内部链接。
#2
Wikipedia database dumps and information about using them are located here: Wikipedia:Database download. You should do this instead of writing a script to scrape Wikipedia.
*数据库转储和使用它们的信息位于:*:数据库下载。你应该这样做,而不是写一个脚本来刮取*。
#3
I would try to use XSLT to transform the XML file into another XML file.
我会尝试使用XSLT将XML文件转换为另一个XML文件。
#4
You could do a search and replace in your favorite text editor, replacing [[ and ]] with nothing.
您可以在自己喜欢的文本编辑器中进行搜索和替换,无需替换[[和]]。
#1
One thing you could do, if you are importing them into a local wiki, is to import all the files you want, then use a robot (eg. pywikipediabot is easy to use) to get rid of all the internal links.
如果要将它们导入到本地wiki中,你可以做的一件事就是导入你想要的所有文件,然后使用机器人(例如pywikipediabot很容易使用)来摆脱所有内部链接。
#2
Wikipedia database dumps and information about using them are located here: Wikipedia:Database download. You should do this instead of writing a script to scrape Wikipedia.
*数据库转储和使用它们的信息位于:*:数据库下载。你应该这样做,而不是写一个脚本来刮取*。
#3
I would try to use XSLT to transform the XML file into another XML file.
我会尝试使用XSLT将XML文件转换为另一个XML文件。
#4
You could do a search and replace in your favorite text editor, replacing [[ and ]] with nothing.
您可以在自己喜欢的文本编辑器中进行搜索和替换,无需替换[[和]]。