I'm trying to extract some data from various HTML pages using a python program. Unfortunately, some of these pages contain user-entered data which occasionally has "slight" errors - namely tag mismatching.
我正在尝试使用python程序从各种HTML页面中提取一些数据。不幸的是,其中一些页面包含用户输入的数据,偶尔会出现“轻微”错误 - 即标签不匹配。
Is there a good way to have python's xml.dom try to correct errors or something of the sort? Alternatively, is there a better way to extract data from HTML pages which may contain errors?
是否有一个很好的方法让python的xml.dom尝试纠正错误或类似的东西?或者,是否有更好的方法从HTML页面中提取可能包含错误的数据?
4 个解决方案
#1
You could use HTML Tidy to clean up, or Beautiful Soup to parse. Could be that you have to save the result to a temp file, but it should work.
您可以使用HTML Tidy来清理,或使用Beautiful Soup来解析。可能是你必须将结果保存到临时文件,但它应该工作。
Cheers,
#2
I used to use BeautifulSoup for such tasks but now I have shifted to HTML5lib (http://code.google.com/p/html5lib/) which works well in many cases where BeautifulSoup fails
我曾经使用BeautifulSoup执行此类任务,但现在我已转移到HTML5lib(http://code.google.com/p/html5lib/),它在很多情况下运行良好,其中BeautifulSoup失败
other alternative is to use "Element Soup" (http://effbot.org/zone/element-soup.htm) which is a wrapper for Beautiful Soup using ElementTree
另一种方法是使用“Element Soup”(http://effbot.org/zone/element-soup.htm),这是使用ElementTree的美丽汤的包装
#3
lxml does a decent job at parsing invalid HTML.
lxml在解析无效的HTML方面做得不错。
According to their documentation Beautiful Soup and html5lib sometimes perform better depending on the input. With lxml you can choose which parser to use, and access them via an unified API.
根据他们的文档,Beautiful Soup和html5lib有时会根据输入执行得更好。使用lxml,您可以选择使用哪个解析器,并通过统一API访问它们。
#4
If jython is acceptable to you, tagsoup is very good at parsing junk - if it is, I found the jdom libraries far easier to use than other xml alternatives.
如果你接受jython,那么tagsoup非常擅长解析垃圾 - 如果是的话,我发现jdom库比其他xml替代品更容易使用。
This is a snippet from a demo mockup to do with screen scraping from tfl's journey planner:
这是来自演示模型的片段,用于处理来自tfl的旅程规划器的屏幕抓取:
private Document getRoutePage(HashMap params) throws Exception { String uri = "http://journeyplanner.tfl.gov.uk/bcl/XSLT_TRIP_REQUEST2"; HttpWrapper hw = new HttpWrapper(); String page = hw.urlEncPost(uri, params); SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser"); Reader pageReader = new StringReader(page); return builder.build(pageReader); }
#1
You could use HTML Tidy to clean up, or Beautiful Soup to parse. Could be that you have to save the result to a temp file, but it should work.
您可以使用HTML Tidy来清理,或使用Beautiful Soup来解析。可能是你必须将结果保存到临时文件,但它应该工作。
Cheers,
#2
I used to use BeautifulSoup for such tasks but now I have shifted to HTML5lib (http://code.google.com/p/html5lib/) which works well in many cases where BeautifulSoup fails
我曾经使用BeautifulSoup执行此类任务,但现在我已转移到HTML5lib(http://code.google.com/p/html5lib/),它在很多情况下运行良好,其中BeautifulSoup失败
other alternative is to use "Element Soup" (http://effbot.org/zone/element-soup.htm) which is a wrapper for Beautiful Soup using ElementTree
另一种方法是使用“Element Soup”(http://effbot.org/zone/element-soup.htm),这是使用ElementTree的美丽汤的包装
#3
lxml does a decent job at parsing invalid HTML.
lxml在解析无效的HTML方面做得不错。
According to their documentation Beautiful Soup and html5lib sometimes perform better depending on the input. With lxml you can choose which parser to use, and access them via an unified API.
根据他们的文档,Beautiful Soup和html5lib有时会根据输入执行得更好。使用lxml,您可以选择使用哪个解析器,并通过统一API访问它们。
#4
If jython is acceptable to you, tagsoup is very good at parsing junk - if it is, I found the jdom libraries far easier to use than other xml alternatives.
如果你接受jython,那么tagsoup非常擅长解析垃圾 - 如果是的话,我发现jdom库比其他xml替代品更容易使用。
This is a snippet from a demo mockup to do with screen scraping from tfl's journey planner:
这是来自演示模型的片段,用于处理来自tfl的旅程规划器的屏幕抓取:
private Document getRoutePage(HashMap params) throws Exception { String uri = "http://journeyplanner.tfl.gov.uk/bcl/XSLT_TRIP_REQUEST2"; HttpWrapper hw = new HttpWrapper(); String page = hw.urlEncPost(uri, params); SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser"); Reader pageReader = new StringReader(page); return builder.build(pageReader); }