从Wikipedia XML文件中删除内部链接的过程?

时间:2023-01-15 00:03:02

If I have downloaded Wikipedia XML dumps, is there any way of removing all of the internal links from within an XML file?

如果我已下载Wikipedia XML转储,有没有办法从XML文件中删除所有内部链接?

Thanks

4 个解决方案

#1


One thing you could do, if you are importing them into a local wiki, is to import all the files you want, then use a robot (eg. pywikipediabot is easy to use) to get rid of all the internal links.

如果要将它们导入到本地wiki中,你可以做的一件事就是导入你想要的所有文件,然后使用机器人(例如pywikipediabot很容易使用)来摆脱所有内部链接。

#2


Wikipedia database dumps and information about using them are located here: Wikipedia:Database download. You should do this instead of writing a script to scrape Wikipedia.

*数据库转储和使用它们的信息位于:*:数据库下载。你应该这样做,而不是写一个脚本来刮取*。

#3


I would try to use XSLT to transform the XML file into another XML file.

我会尝试使用XSLT将XML文件转换为另一个XML文件。

#4


You could do a search and replace in your favorite text editor, replacing [[ and ]] with nothing.

您可以在自己喜欢的文本编辑器中进行搜索和替换,无需替换[[和]]。

#1


One thing you could do, if you are importing them into a local wiki, is to import all the files you want, then use a robot (eg. pywikipediabot is easy to use) to get rid of all the internal links.

如果要将它们导入到本地wiki中,你可以做的一件事就是导入你想要的所有文件,然后使用机器人(例如pywikipediabot很容易使用)来摆脱所有内部链接。

#2


Wikipedia database dumps and information about using them are located here: Wikipedia:Database download. You should do this instead of writing a script to scrape Wikipedia.

*数据库转储和使用它们的信息位于:*:数据库下载。你应该这样做,而不是写一个脚本来刮取*。

#3


I would try to use XSLT to transform the XML file into another XML file.

我会尝试使用XSLT将XML文件转换为另一个XML文件。

#4


You could do a search and replace in your favorite text editor, replacing [[ and ]] with nothing.

您可以在自己喜欢的文本编辑器中进行搜索和替换,无需替换[[和]]。