I'm creating a code that gets image's urls from any web pages, the code are in python and use BeutifulSoup and httplib2. When I run the code, I get the next error:
我正在创建一个从任何网页获取图像网址的代码,代码在python中并使用BeutifulSoup和httplib2。当我运行代码时,我得到下一个错误:
Look me http://movies.nytimes.com (this line is printed by the code)
Traceback (most recent call last):
File "main.py", line 103, in <module>
visit(initialList,profundidad)
File "main.py", line 98, in visit
visit(dodo[indice], bottom -1)
File "main.py", line 94, in visit
getImages(w)
File "main.py", line 34, in getImages
iSoupList = BeautifulSoup(response, parseOnlyThese=SoupStrainer('img'))
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
self.error("malformed start tag")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 942, column 118
Someone can explain me how to fix or make an exeption for the error
有人可以解释我如何修复或制作错误的例外
3 个解决方案
#1
To catch that error specifically, change your code to look like this:
要专门捕获该错误,请将代码更改为如下所示:
try:
iSoupList = BeautifulSoup(response, parseOnlyThese=SoupStrainer('img'))
except HTMLParseError:
#Do something intelligent here
Here's some more reading on Python's try except blocks: http://docs.python.org/tutorial/errors.html
除了块之外,还有一些关于Python的尝试的更多内容:http://docs.python.org/tutorial/errors.html
#2
Are you using latest version of BeautifulSoup?
This seems a known issue of version 3.1.x, because it started using a new parser (HTMLParser, instead of SGMLParser) that is much worse at processing malformed HTML. You can find more information about this on BeautifulSoup website.
As a quick solution, you can simply use an older version (3.0.7a).
您使用的是BeautifulSoup的最新版本吗?这似乎是版本3.1.x的一个已知问题,因为它开始使用一个新的解析器(HTMLParser,而不是SGMLParser),它在处理格式错误的HTML时更糟糕。您可以在BeautifulSoup网站上找到更多相关信息。作为一种快速解决方案,您只需使用旧版本(3.0.7a)即可。
#3
I got that error when I had the string =& in my HTML document. When I replaced that string (in my case with =and) then I no longer received that parsing error.
当我在HTML文档中使用string =&时出现错误。当我替换该字符串(在我的情况下使用=和)时,我不再收到该解析错误。
#1
To catch that error specifically, change your code to look like this:
要专门捕获该错误,请将代码更改为如下所示:
try:
iSoupList = BeautifulSoup(response, parseOnlyThese=SoupStrainer('img'))
except HTMLParseError:
#Do something intelligent here
Here's some more reading on Python's try except blocks: http://docs.python.org/tutorial/errors.html
除了块之外,还有一些关于Python的尝试的更多内容:http://docs.python.org/tutorial/errors.html
#2
Are you using latest version of BeautifulSoup?
This seems a known issue of version 3.1.x, because it started using a new parser (HTMLParser, instead of SGMLParser) that is much worse at processing malformed HTML. You can find more information about this on BeautifulSoup website.
As a quick solution, you can simply use an older version (3.0.7a).
您使用的是BeautifulSoup的最新版本吗?这似乎是版本3.1.x的一个已知问题,因为它开始使用一个新的解析器(HTMLParser,而不是SGMLParser),它在处理格式错误的HTML时更糟糕。您可以在BeautifulSoup网站上找到更多相关信息。作为一种快速解决方案,您只需使用旧版本(3.0.7a)即可。
#3
I got that error when I had the string =& in my HTML document. When I replaced that string (in my case with =and) then I no longer received that parsing error.
当我在HTML文档中使用string =&时出现错误。当我替换该字符串(在我的情况下使用=和)时,我不再收到该解析错误。