I want to build a web scraper. Currently, I'm learning Python. This is the very basics!
我想建一个刮网器。目前,我正在学习Python。这是最基本的!
Python Code
Python代码
import urllib.request
import re
htmlfile = urllib.request.urlopen("http://basketball.realgm.com/")
htmltext = htmlfile.read()
title = re.findall('<title>(.*)</title>', htmltext)
print (htmltext)
Error:
错误:
File "C:\Python33\lib\re.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
2 个解决方案
#1
5
You have to decode your data. Since the website in question says
你必须解码你的数据。因为网站上说
charset=iso-8859-1
use that. utf-8 won't work in this case.
使用它。utf-8在这种情况下不起作用。
htmltext = htmlfile.read().decode('iso-8859-1')
#2
3
Use bytes literal as pattern:
使用字节文字作为模式:
title = re.findall(b'<title>(.*)</title>', htmltext)
or decode the retrieved data to string:
或将检索到的数据解码为字符串:
title = re.findall('<title>(.*)</title>', htmltext.decode('utf-8'))
(change utf-8
with appropriate encoding of the document)
(更改utf-8并对文档进行适当编码)
#1
5
You have to decode your data. Since the website in question says
你必须解码你的数据。因为网站上说
charset=iso-8859-1
use that. utf-8 won't work in this case.
使用它。utf-8在这种情况下不起作用。
htmltext = htmlfile.read().decode('iso-8859-1')
#2
3
Use bytes literal as pattern:
使用字节文字作为模式:
title = re.findall(b'<title>(.*)</title>', htmltext)
or decode the retrieved data to string:
或将检索到的数据解码为字符串:
title = re.findall('<title>(.*)</title>', htmltext.decode('utf-8'))
(change utf-8
with appropriate encoding of the document)
(更改utf-8并对文档进行适当编码)