I'm trying to download some content from a dictionary site like http://dictionary.reference.com/browse/apple?s=t
我想从http://dictionary.reference.com/browse/apple?
The problem I'm having is that the original paragraph has all those squiggly lines, and reverse letters, and such, so when I read the local files I end up with those funny escape characters like \x85, \xa7, \x8d, etc.
我遇到的问题是,原来的段落有那么多弯弯曲曲的线条和反向字母,等等,所以当我读本地文件时,我得到了那些有趣的转义字符,比如\x85、\xa7、\x8d等等。
My question is, is there any way i can convert all those escape characters into their respective UTF-8 characters, eg if there is an 'à' how do i convert that into a standard 'a' ?
我的问题是,是否有办法将所有转义字符转换成各自的UTF-8字符,例如,如果有一个“a”,如何将它转换成一个“a”?
Python calling code:
Python调用代码:
import os
word = 'apple'
os.system(r'wget.lnk --directory-prefix=G:/projects/words/dictionary/urls/ --output-document=G:\projects\words\dictionary\urls/' + word + '-dict.html http://dictionary.reference.com/browse/' + word)
I'm using wget-1.11.4-1 on a Windows 7 system (don't kill me Linux people, it was a client requirement), and the wget exe is being fired off with a Python 2.6 script file.
我正在Windows 7系统上使用wget-1.11.4-1(不要杀了我Linux用户,这是客户端的要求),wget exe正在用Python 2.6脚本文件启动。
3 个解决方案
#1
37
how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?
如何将所有转义字符转换为各自的字符,比如有unicode a,如何将其转换为标准a?
Assume you have loaded your unicode into a variable called my_unicode
... normalizing à into a is this simple...
假设您已经将unicode加载到一个名为my_unicode的变量中……把a归为a是这么简单……
import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
Explicit example...
明确的例子…
>>> myfoo = u'àà'
>>> myfoo
u'\xe0\xe0'
>>> unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'
>>>
How it worksunicodedata.normalize('NFD', "insert-unicode-text-here")
performs a Canonical Decomposition (NFD) of the unicode text; then we use str.encode('ascii', 'ignore')
to transform the NFD mapped characters into ascii (ignoring errors).
它是如何工作的unicodedata。normalize('NFD', "insert-unicode-text-here")执行unicode文本的规范分解(NFD);然后我们使用string .encode(“ascii”、“忽略”)将NFD映射字符转换为ascii(忽略错误)。
#2
1
I needed something like this but to remove only accented characters, ignoring special ones and I did this small function:
我需要这样的东西,但为了只删除重音字符,忽略特殊字符,我做了这个小函数:
# ~*~ coding: utf-8 ~*~
import re
def remove_accents(string):
if type(string) is not unicode:
string = unicode(string, encoding='utf-8')
string = re.sub(u"[àáâãäå]", 'a', string)
string = re.sub(u"[èéêë]", 'e', string)
string = re.sub(u"[ìíîï]", 'i', string)
string = re.sub(u"[òóôõö]", 'o', string)
string = re.sub(u"[ùúûü]", 'u', string)
string = re.sub(u"[ýÿ]", 'y', string)
return string
I like that function because you can customize it in case you need to ignore other characters
我喜欢这个函数,因为如果需要忽略其他字符,可以定制它
#3
0
The given URL returns UTF-8 as the HTTP response clearly indicates:
给定的URL返回UTF-8,因为HTTP响应明确表示:
wget -S http://dictionary.reference.com/browse/apple?s=t
--2013-01-02 08:43:40-- http://dictionary.reference.com/browse/apple?s=t
Resolving dictionary.reference.com (dictionary.reference.com)... 23.14.94.26, 23.14.94.11
Connecting to dictionary.reference.com (dictionary.reference.com)|23.14.94.26|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Server: Apache
Cache-Control: private
Content-Type: text/html;charset=UTF-8
Date: Wed, 02 Jan 2013 07:43:40 GMT
Transfer-Encoding: chunked
Connection: keep-alive
Connection: Transfer-Encoding
Set-Cookie: sid=UOPlLC7t-zl20-k7; Domain=reference.com; Expires=Wed, 02-Jan-2013 08:13:40 GMT; Path=/
Set-Cookie: cu.wz=0; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
Set-Cookie: recsrch=apple; Domain=reference.com; Expires=Tue, 02-Apr-2013 07:43:40 GMT; Path=/
Set-Cookie: dcc=*~*~*~*~*~*~*~*~; Domain=reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
Set-Cookie: iv_dic=1-0; Domain=reference.com; Expires=Thu, 03-Jan-2013 07:43:40 GMT; Path=/
Set-Cookie: accepting=1; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
Set-Cookie: bid=UOPlLC7t-zlrHXne; Domain=reference.com; Expires=Fri, 02-Jan-2015 07:43:40 GMT; Path=/
Length: unspecified [text/html]
Investigating the saved file using vim also reveals that the data is correctly utf-8 encoded...the same is true fetching the URL using Python.
使用vim研究保存的文件还显示数据是正确的utf-8编码……使用Python获取URL也是如此。
#1
37
how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?
如何将所有转义字符转换为各自的字符,比如有unicode a,如何将其转换为标准a?
Assume you have loaded your unicode into a variable called my_unicode
... normalizing à into a is this simple...
假设您已经将unicode加载到一个名为my_unicode的变量中……把a归为a是这么简单……
import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
Explicit example...
明确的例子…
>>> myfoo = u'àà'
>>> myfoo
u'\xe0\xe0'
>>> unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'
>>>
How it worksunicodedata.normalize('NFD', "insert-unicode-text-here")
performs a Canonical Decomposition (NFD) of the unicode text; then we use str.encode('ascii', 'ignore')
to transform the NFD mapped characters into ascii (ignoring errors).
它是如何工作的unicodedata。normalize('NFD', "insert-unicode-text-here")执行unicode文本的规范分解(NFD);然后我们使用string .encode(“ascii”、“忽略”)将NFD映射字符转换为ascii(忽略错误)。
#2
1
I needed something like this but to remove only accented characters, ignoring special ones and I did this small function:
我需要这样的东西,但为了只删除重音字符,忽略特殊字符,我做了这个小函数:
# ~*~ coding: utf-8 ~*~
import re
def remove_accents(string):
if type(string) is not unicode:
string = unicode(string, encoding='utf-8')
string = re.sub(u"[àáâãäå]", 'a', string)
string = re.sub(u"[èéêë]", 'e', string)
string = re.sub(u"[ìíîï]", 'i', string)
string = re.sub(u"[òóôõö]", 'o', string)
string = re.sub(u"[ùúûü]", 'u', string)
string = re.sub(u"[ýÿ]", 'y', string)
return string
I like that function because you can customize it in case you need to ignore other characters
我喜欢这个函数,因为如果需要忽略其他字符,可以定制它
#3
0
The given URL returns UTF-8 as the HTTP response clearly indicates:
给定的URL返回UTF-8,因为HTTP响应明确表示:
wget -S http://dictionary.reference.com/browse/apple?s=t
--2013-01-02 08:43:40-- http://dictionary.reference.com/browse/apple?s=t
Resolving dictionary.reference.com (dictionary.reference.com)... 23.14.94.26, 23.14.94.11
Connecting to dictionary.reference.com (dictionary.reference.com)|23.14.94.26|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Server: Apache
Cache-Control: private
Content-Type: text/html;charset=UTF-8
Date: Wed, 02 Jan 2013 07:43:40 GMT
Transfer-Encoding: chunked
Connection: keep-alive
Connection: Transfer-Encoding
Set-Cookie: sid=UOPlLC7t-zl20-k7; Domain=reference.com; Expires=Wed, 02-Jan-2013 08:13:40 GMT; Path=/
Set-Cookie: cu.wz=0; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
Set-Cookie: recsrch=apple; Domain=reference.com; Expires=Tue, 02-Apr-2013 07:43:40 GMT; Path=/
Set-Cookie: dcc=*~*~*~*~*~*~*~*~; Domain=reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
Set-Cookie: iv_dic=1-0; Domain=reference.com; Expires=Thu, 03-Jan-2013 07:43:40 GMT; Path=/
Set-Cookie: accepting=1; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
Set-Cookie: bid=UOPlLC7t-zlrHXne; Domain=reference.com; Expires=Fri, 02-Jan-2015 07:43:40 GMT; Path=/
Length: unspecified [text/html]
Investigating the saved file using vim also reveals that the data is correctly utf-8 encoded...the same is true fetching the URL using Python.
使用vim研究保存的文件还显示数据是正确的utf-8编码……使用Python获取URL也是如此。