I'm trying to read html file but when sourcing out for the titles and urls to compare with my keyword 'alist'
I get this error Unicode Encode Error: 'ascii' codec can't encode character u'\u2019'.
Error in link(http://tinypic.com/r/307w8bl/8)
我正在尝试读取html文件,但是当我在寻找标题和url时,要与我的关键词“alist”进行比较,我得到了这个错误Unicode编码错误:“ascii”编解码器不能对字符u'\u2019进行编码。错误链接(http://tinypic.com/r/307w8bl/8)
Code
代码
for q in soup.find_all('a'):
title = (q.get('title'))
url = ((q.get('href')))
length = len(alist)
i = 0
while length > 0:
if alist[i] in str(title): #checks for keywords from html form from the titles and urls
r.write(title)
r.write("\n")
r.write(url)
r.write("\n")
i = i + 1
length = length -1
doc.close()
r.close()
A little background. alist contains a list of keywords which I would use to compare it with title so as to get what I want. The strange thing is if alist contains 2 or more words, it would run perfectly but if there was only one word, the error as seen above would appear. Thanks in advance.
一个小的背景。alist包含了一个关键字列表,我将用它与标题进行比较,以得到我想要的。奇怪的是,如果一个人包含2个或更多的单词,它会运行得很完美,但是如果只有一个单词,那么上面所看到的错误就会出现。提前谢谢。
3 个解决方案
#1
3
If your list MUST BE a string list, try to encode title var
如果您的列表必须是一个字符串列表,请尝试对标题var进行编码。
>>> alist=['á'] #asci string
>>> title = u'á' #unicode string
>>> alist[0] in title
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> title and alist[0] in title.encode('utf-8')
True
>>>
#2
0
Presumably, title
is a Unicode string that can contain any kind of character; str(title)
tries to turn it into a bytestring using the ASCII codec, but that fails because your title contains a non-ASCII character.
可能,标题是一个Unicode字符串,可以包含任何类型的字符;str(title)尝试使用ASCII codec将其转换成一个bytestring,但这失败了,因为您的标题包含一个非ASCII字符。
What are you trying to do? Why do you need to turn the title into a bytestring?
你想做什么?为什么要把标题变成bytestring?
#3
0
The problem is in str(title)
. U are trying to convert unicode
data to string.
问题在str(标题)中。尝试将unicode数据转换为字符串。
Why u are converting title
to string? You can direct access it.
为什么要将标题转换为字符串?你可以直接访问它。
soup.find_all
will return you list of strings.
汤。find_all将返回字符串列表。
#1
3
If your list MUST BE a string list, try to encode title var
如果您的列表必须是一个字符串列表,请尝试对标题var进行编码。
>>> alist=['á'] #asci string
>>> title = u'á' #unicode string
>>> alist[0] in title
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> title and alist[0] in title.encode('utf-8')
True
>>>
#2
0
Presumably, title
is a Unicode string that can contain any kind of character; str(title)
tries to turn it into a bytestring using the ASCII codec, but that fails because your title contains a non-ASCII character.
可能,标题是一个Unicode字符串,可以包含任何类型的字符;str(title)尝试使用ASCII codec将其转换成一个bytestring,但这失败了,因为您的标题包含一个非ASCII字符。
What are you trying to do? Why do you need to turn the title into a bytestring?
你想做什么?为什么要把标题变成bytestring?
#3
0
The problem is in str(title)
. U are trying to convert unicode
data to string.
问题在str(标题)中。尝试将unicode数据转换为字符串。
Why u are converting title
to string? You can direct access it.
为什么要将标题转换为字符串?你可以直接访问它。
soup.find_all
will return you list of strings.
汤。find_all将返回字符串列表。