Python's urllib.quote
and urllib.unquote
do not handle Unicode correctly in Python 2.6.5. This is what happens:
Python的urllib。报价和urllib。在Python 2.6.5中,unquote不正确处理Unicode。这是发生了什么:
In [5]: print urllib.unquote(urllib.quote(u'Cataño'))
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/home/kkinder/<ipython console> in <module>()
/usr/lib/python2.6/urllib.pyc in quote(s, safe)
1222 safe_map[c] = (c in safe) and c or ('%%%02X' % i)
1223 _safemaps[cachekey] = safe_map
-> 1224 res = map(safe_map.__getitem__, s)
1225 return ''.join(res)
1226
KeyError: u'\xc3'
Encoding the value to UTF8 also does not work:
将值编码到UTF8也不起作用:
In [6]: print urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))
Cataño
It's recognized as a bug and there is a fix, but not for my version of Python.
它被认为是一个bug,并且有一个补丁,但不是针对我的Python版本。
What I'd like is something similar to urllib.quote/urllib.unquote, but handles unicode variables correctly, such that this code would work:
我想要的是类似于urllib.quote/urllib的东西。取消引用,但是正确处理unicode变量,这样代码就能工作:
decode_url(encode_url(u'Cataño')) == u'Cataño'
Any recommendations?
你有什么推荐吗?
4 个解决方案
#1
39
Python's urllib.quote and urllib.unquote do not handle Unicode correctly
Python的urllib。报价和urllib。unquote不正确处理Unicode。
urllib
does not handle Unicode at all. URLs don't contain non-ASCII characters, by definition. When you're dealing with urllib
you should use only byte strings. If you want those to represent Unicode characters you will have to encode and decode them manually.
urllib根本不处理Unicode。根据定义,url不包含非ascii字符。在处理urllib时,应该只使用字节字符串。如果您希望这些字符代表Unicode字符,那么您必须手工编码和解码它们。
IRIs can contain non-ASCII characters, encoding them as UTF-8 sequences, but Python doesn't, at this point, have an irilib
.
IRIs可以包含非ascii字符,将它们编码为UTF-8序列,但是在这一点上,Python并没有一个irilib。
Encoding the value to UTF8 also does not work:
将值编码到UTF8也不起作用:
In [6]: print urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))
Cataño
Ah, well now you're typing Unicode into a console, and doing print
-Unicode to the console. This is generally unreliable, especially in Windows and in your case with the IPython console.
现在,您正在将Unicode输入到一个控制台,并在控制台中执行print-Unicode。这通常是不可靠的,尤其是在Windows和IPython控制台的情况下。
Type it out the long way with backslash sequences and you can more easily see that the urllib
bit does actually work:
用反斜杠的顺序把它打印出来,你可以更容易地看到,urllib bit确实起作用了:
>>> u'Cata\u00F1o'.encode('utf-8')
'Cata\xC3\xB1o'
>>> urllib.quote(_)
'Cata%C3%B1o'
>>> urllib.unquote(_)
'Cata\xC3\xB1o'
>>> _.decode('utf-8')
u'Cata\xF1o'
#2
4
"""Encoding the value to UTF8 also does not work""" ... the result of your code is a str
object which at a guess appears to be the input encoded in UTF-8. You need to decode it or define "does not work" -- what do you expect?
“将值编码到UTF8也不起作用”。代码的结果是一个str对象,它在一个猜测中似乎是UTF-8编码的输入。你需要解码或定义“不工作”——你想要什么?
Note: So that we don't need to guess the encoding of your terminal and the type of your data, use print repr(whatever)
instead of print whatever
.
注意:所以我们不需要猜测终端的编码和数据的类型,使用print repr(随便什么),而不是打印任何东西。
>>> # Python 2.6.6
... from urllib import quote, unquote
>>> s = u"Cata\xf1o"
>>> q = quote(s.encode('utf8'))
>>> u = unquote(q).decode('utf8')
>>> for x in (s, q, u):
... print repr(x)
...
u'Cata\xf1o'
'Cata%C3%B1o'
u'Cata\xf1o'
>>>
For comparison:
比较:
>>> # Python 3.2
... from urllib.parse import quote, unquote
>>> s = "Cata\xf1o"
>>> q = quote(s)
>>> u = unquote(q)
>>> for x in (s, q, u):
... print(ascii(x))
...
'Cata\xf1o'
'Cata%C3%B1o'
'Cata\xf1o'
>>>
#3
1
I encountered the same problem and used a helper function to deal with non-ascii and urllib.urlencode function (which includes quote and unquote):
我遇到了同样的问题,并使用了一个helper函数来处理非ascii和urllib。urlencode函数(包括引用和引用):
def utf8_urlencode(params):
import urllib as u
# problem: u.urlencode(params.items()) is not unicode-safe. Must encode all params strings as utf8 first.
# UTF-8 encodes all the keys and values in params dictionary
for k,v in params.items():
# TRY urllib.unquote_plus(artist.encode('utf-8')).decode('utf-8')
if type(v) in (int, long, float):
params[k] = v
else:
try:
params[k.encode('utf-8')] = v.encode('utf-8')
except Exception as e:
logging.warning( '**ERROR utf8_urlencode ERROR** %s' % e )
return u.urlencode(params.items()).decode('utf-8')
adopted from Unicode URL encode / decode with Python
采用Unicode URL编码/解码Python。
#4
1
So I had the same problem: I wanted to put query parameters in an url, but some of them contained weird characters (diacritics).
所以我遇到了同样的问题:我想把查询参数放到一个url中,但是其中一些包含了奇怪的字符(diacritics)。
Dealing with encoding gave a messy url and was fragile.
处理编码有一个混乱的url,而且很脆弱。
My solution was to replace every accent/weird unicode character to its ascii equivalent. It's straightforward thanks to unidecode
: What is the best way to remove accents in a Python unicode string?
我的解决方案是将每个重音/怪异的unicode字符替换成它的ascii码。这很简单,这要归功于unidecode:在Python unicode字符串中删除重音的最好方法是什么?
pip install unidecode
then
然后
from unidecode import unidecode
print unidecode(u"éèê")
# prints eee
so I have a clean url. Also works for chinese etc.
我有一个干净的url。也适用于汉语等。
#1
39
Python's urllib.quote and urllib.unquote do not handle Unicode correctly
Python的urllib。报价和urllib。unquote不正确处理Unicode。
urllib
does not handle Unicode at all. URLs don't contain non-ASCII characters, by definition. When you're dealing with urllib
you should use only byte strings. If you want those to represent Unicode characters you will have to encode and decode them manually.
urllib根本不处理Unicode。根据定义,url不包含非ascii字符。在处理urllib时,应该只使用字节字符串。如果您希望这些字符代表Unicode字符,那么您必须手工编码和解码它们。
IRIs can contain non-ASCII characters, encoding them as UTF-8 sequences, but Python doesn't, at this point, have an irilib
.
IRIs可以包含非ascii字符,将它们编码为UTF-8序列,但是在这一点上,Python并没有一个irilib。
Encoding the value to UTF8 also does not work:
将值编码到UTF8也不起作用:
In [6]: print urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))
Cataño
Ah, well now you're typing Unicode into a console, and doing print
-Unicode to the console. This is generally unreliable, especially in Windows and in your case with the IPython console.
现在,您正在将Unicode输入到一个控制台,并在控制台中执行print-Unicode。这通常是不可靠的,尤其是在Windows和IPython控制台的情况下。
Type it out the long way with backslash sequences and you can more easily see that the urllib
bit does actually work:
用反斜杠的顺序把它打印出来,你可以更容易地看到,urllib bit确实起作用了:
>>> u'Cata\u00F1o'.encode('utf-8')
'Cata\xC3\xB1o'
>>> urllib.quote(_)
'Cata%C3%B1o'
>>> urllib.unquote(_)
'Cata\xC3\xB1o'
>>> _.decode('utf-8')
u'Cata\xF1o'
#2
4
"""Encoding the value to UTF8 also does not work""" ... the result of your code is a str
object which at a guess appears to be the input encoded in UTF-8. You need to decode it or define "does not work" -- what do you expect?
“将值编码到UTF8也不起作用”。代码的结果是一个str对象,它在一个猜测中似乎是UTF-8编码的输入。你需要解码或定义“不工作”——你想要什么?
Note: So that we don't need to guess the encoding of your terminal and the type of your data, use print repr(whatever)
instead of print whatever
.
注意:所以我们不需要猜测终端的编码和数据的类型,使用print repr(随便什么),而不是打印任何东西。
>>> # Python 2.6.6
... from urllib import quote, unquote
>>> s = u"Cata\xf1o"
>>> q = quote(s.encode('utf8'))
>>> u = unquote(q).decode('utf8')
>>> for x in (s, q, u):
... print repr(x)
...
u'Cata\xf1o'
'Cata%C3%B1o'
u'Cata\xf1o'
>>>
For comparison:
比较:
>>> # Python 3.2
... from urllib.parse import quote, unquote
>>> s = "Cata\xf1o"
>>> q = quote(s)
>>> u = unquote(q)
>>> for x in (s, q, u):
... print(ascii(x))
...
'Cata\xf1o'
'Cata%C3%B1o'
'Cata\xf1o'
>>>
#3
1
I encountered the same problem and used a helper function to deal with non-ascii and urllib.urlencode function (which includes quote and unquote):
我遇到了同样的问题,并使用了一个helper函数来处理非ascii和urllib。urlencode函数(包括引用和引用):
def utf8_urlencode(params):
import urllib as u
# problem: u.urlencode(params.items()) is not unicode-safe. Must encode all params strings as utf8 first.
# UTF-8 encodes all the keys and values in params dictionary
for k,v in params.items():
# TRY urllib.unquote_plus(artist.encode('utf-8')).decode('utf-8')
if type(v) in (int, long, float):
params[k] = v
else:
try:
params[k.encode('utf-8')] = v.encode('utf-8')
except Exception as e:
logging.warning( '**ERROR utf8_urlencode ERROR** %s' % e )
return u.urlencode(params.items()).decode('utf-8')
adopted from Unicode URL encode / decode with Python
采用Unicode URL编码/解码Python。
#4
1
So I had the same problem: I wanted to put query parameters in an url, but some of them contained weird characters (diacritics).
所以我遇到了同样的问题:我想把查询参数放到一个url中,但是其中一些包含了奇怪的字符(diacritics)。
Dealing with encoding gave a messy url and was fragile.
处理编码有一个混乱的url,而且很脆弱。
My solution was to replace every accent/weird unicode character to its ascii equivalent. It's straightforward thanks to unidecode
: What is the best way to remove accents in a Python unicode string?
我的解决方案是将每个重音/怪异的unicode字符替换成它的ascii码。这很简单,这要归功于unidecode:在Python unicode字符串中删除重音的最好方法是什么?
pip install unidecode
then
然后
from unidecode import unidecode
print unidecode(u"éèê")
# prints eee
so I have a clean url. Also works for chinese etc.
我有一个干净的url。也适用于汉语等。