I crawled some webpages using python's urllib.request API and saved the read lines into a new file.
我使用python的urllib爬行了一些网页。请求API并将读取的行保存到一个新文件中。
f = open(docId + ".html", "w+")
with urllib.request.urlopen('http://*.com') as u:
s = u.read()
f.write(str(s))
But when I open the saved files, I see many strings such as \xe2\x86\x90, which was originally an arrow symbol in the original page. It seems to be a UTF-8 code of the symbol, but how do I convert the code to the symbol back?
但是当我打开保存的文件时,我看到了许多字符串,比如\xe2\x86\x90,它最初是原始页面中的一个箭头符号。它似乎是一个UTF-8编码的符号,但是我如何将代码转换回符号呢?
2 个解决方案
#1
2
Your code is broken: u.read()
returns bytes
object. str(bytes_object)
returns a string representation of the object (how the bytes literal would look like) -- you don't want it here:
您的代码被破坏:uread()返回字节对象。str(bytes_object)返回对象的字符串表示形式(字节字面量是什么样子)——您不希望它出现在这里:
>>> str(b'\xe2\x86\x90')
"b'\\xe2\\x86\\x90'"
Either save the bytes on disk as is:
或者将磁盘上的字节保存为:
import urllib.request
urllib.request.urlretrieve('http://*.com', 'so.html')
or open the file in binary mode: 'wb'
and save it manually:
或以“wb”模式打开文件,手动保存:
import shutil
from urllib.request import urlopen
with urlopen('http://*.com') as u, open('so.html', 'wb') as file:
shutil.copyfileobj(u, file)
or convert bytes to Unicode and save them to disk using any encoding you like.
或者将字节转换为Unicode并使用您喜欢的任何编码将它们保存到磁盘。
import io
import shutil
from urllib.request import urlopen
with urlopen('http://*.com') as u, \
open('so.html', 'w', encoding='utf-8', newline='') as file, \
io.TextIOWrapper(u, encoding=u.headers.get_content_charset('utf-8'), newline='') as t:
shutil.copyfileobj(t, file)
#2
1
Try:
试一试:
import urllib2, io
with io.open("test.html", "w", encoding='utf8') as fout:
s = urllib2.urlopen('http://*.com').read()
s = s.decode('utf8', 'ignore') # or s.decode('utf8', 'replace')
fout.write(s)
See https://docs.python.org/2/howto/unicode.html
参见https://docs.python.org/2/howto/unicode.html
#1
2
Your code is broken: u.read()
returns bytes
object. str(bytes_object)
returns a string representation of the object (how the bytes literal would look like) -- you don't want it here:
您的代码被破坏:uread()返回字节对象。str(bytes_object)返回对象的字符串表示形式(字节字面量是什么样子)——您不希望它出现在这里:
>>> str(b'\xe2\x86\x90')
"b'\\xe2\\x86\\x90'"
Either save the bytes on disk as is:
或者将磁盘上的字节保存为:
import urllib.request
urllib.request.urlretrieve('http://*.com', 'so.html')
or open the file in binary mode: 'wb'
and save it manually:
或以“wb”模式打开文件,手动保存:
import shutil
from urllib.request import urlopen
with urlopen('http://*.com') as u, open('so.html', 'wb') as file:
shutil.copyfileobj(u, file)
or convert bytes to Unicode and save them to disk using any encoding you like.
或者将字节转换为Unicode并使用您喜欢的任何编码将它们保存到磁盘。
import io
import shutil
from urllib.request import urlopen
with urlopen('http://*.com') as u, \
open('so.html', 'w', encoding='utf-8', newline='') as file, \
io.TextIOWrapper(u, encoding=u.headers.get_content_charset('utf-8'), newline='') as t:
shutil.copyfileobj(t, file)
#2
1
Try:
试一试:
import urllib2, io
with io.open("test.html", "w", encoding='utf8') as fout:
s = urllib2.urlopen('http://*.com').read()
s = s.decode('utf8', 'ignore') # or s.decode('utf8', 'replace')
fout.write(s)
See https://docs.python.org/2/howto/unicode.html
参见https://docs.python.org/2/howto/unicode.html