Python web抓取错误-类型错误:不能在类似字节的对象上使用字符串模式

I want to build a web scraper. Currently, I'm learning Python. This is the very basics!

我想建一个刮网器。目前,我正在学习Python。这是最基本的!

Python Code

Python代码

import urllib.request
import re

htmlfile = urllib.request.urlopen("http://basketball.realgm.com/")

htmltext = htmlfile.read()
title = re.findall('<title>(.*)</title>', htmltext)

print (htmltext)

Error:

错误:

  File "C:\Python33\lib\re.py", line 201, in findall
    return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object

2 个解决方案

#1

You have to decode your data. Since the website in question says

你必须解码你的数据。因为网站上说

charset=iso-8859-1

use that. utf-8 won't work in this case.

使用它。utf-8在这种情况下不起作用。

htmltext = htmlfile.read().decode('iso-8859-1')

#2

Use bytes literal as pattern:

使用字节文字作为模式:

title = re.findall(b'<title>(.*)</title>', htmltext)

or decode the retrieved data to string:

或将检索到的数据解码为字符串:

title = re.findall('<title>(.*)</title>', htmltext.decode('utf-8'))

(change utf-8 with appropriate encoding of the document)

(更改utf-8并对文档进行适当编码)

#1