I am using the Python interpreter in Windows 7 terminal.
I am trying to wrap my head around unicode and encodings.
我在Windows 7终端使用Python解释器。我正试图绕过unicode和编码。
I type:
我类型:
>>> s='ë'
>>> s
'\x89'
>>> u=u'ë'
>>> u
u'\xeb'
Question 1: Why is the encoding used in the string s
different from the one used in the unicode string u
?
问题1:为什么在字符串中使用的编码与在unicode字符串u中使用的编码不同?
I continue, and type:
我继续和类型:
>>> us=unicode(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 0: ordinal
not in range(128)
>>> us=unicode(s, 'latin-1')
>>> us
u'\x89'
Question2: I tried using the latin-1
encoding on good luck to turn the string into an unicode string (actually, I tried a bunch of other ones first, including utf-8
). How can I find out which encoding the terminal has used to encode my string?
问题2:我试着用latin-1编码来把字符串转换成unicode字符串(实际上,我尝试了很多其他的字符串,包括utf-8)。我怎样才能知道终端编码我的字符串的编码是什么?
Question 3: how can I make the terminal print
Hmm, stupid me. ë
as
ë
instead of
'\x89'
or
u'xeb'
?
print(s)
does the job.
问题3:我怎样才能把终端打印成e而不是'\x89'或u'xeb'?嗯,愚蠢的我。打印(s)工作。
I already looked at this related SO question, but no clues from there: Set Python terminal encoding on Windows
我已经看了相关的问题,但没有任何线索:在Windows上设置Python终端编码。
8 个解决方案
#1
11
Unicode is not an encoding. You encode into byte strings and decode into Unicode:
Unicode不是一种编码。你编码成字节串并解码成Unicode:
>>> '\x89'.decode('cp437')
u'\xeb'
>>> u'\xeb'.encode('cp437')
'\x89'
>>> u'\xeb'.encode('utf8')
'\xc3\xab'
The windows terminal uses legacy code pages for DOS. For US Windows it is:
windows终端使用遗留代码页进行DOS操作。对我们来说,Windows就是:
>>> import sys
>>> sys.stdout.encoding
'cp437'
Windows applications use windows code pages. Python's IDLE will show the windows encoding:
Windows应用程序使用Windows代码页。Python的IDLE将显示windows编码:
>>> import sys
>>> sys.stdout.encoding
'cp1252'
Your results may vary.
您的结果可能不同。
#2
3
Avoid Windows Terminal
I'm not going out on a limb by saying the 'terminal' more appropriately the 'DOS prompt' that ships with Windows 7 is absolute junk. It was bad in Windows 95, NT, XP, Vista, and 7. Maybe they fixed it with Powershell, I don't know. However, it is indicative of the kind of problems that were plaguing OS development at Microsoft at the time.
我并不是说“终端”更合适地说“DOS提示”,Windows 7的船只绝对是垃圾。它在Windows 95、NT、XP、Vista和7中都很糟糕。也许他们用的是Powershell,我不知道。然而,这也表明了当时微软在操作系统开发方面所遇到的问题。
Output to a file instead
Set the PYTHONIOENCODING
environment variable and then redirect the output to a file.
设置python编码环境变量,然后将输出重定向到文件。
set PYTHONIOENCODING=utf-8
./myscript.py > output.txt
Then using Notepad++ you can then see the UTF-8 version of your output.
然后使用Notepad++,您可以看到您的输出的UTF-8版本。
Install win-unicode-console
win-unicode-console can fix your problems. You should try it out
win- unicodeconsole可以解决您的问题。你应该试一试。
pip install win-unicode-console
If you are interested in a through discussion on the issue of python and command-line output check out Python issue 1602. Otherwise, just use the win-unicode-console package.
如果您感兴趣的是关于python和命令行输出的讨论,请参阅python问题1602。否则,只需使用win- unicodeconsole包。
py -m run script.py
Runs it per script or you can follow their directions to add win_unicode_console.enable()
to every invocation by adding it to usercustomize
or sitecustomize
.
根据每个脚本运行它,或者您可以按照它们的方向添加win_unicode_console.enable(),将它添加到usercustomize或sitecustomize中。
#3
1
Read through this python HOWTO about unicode after you read this section from the tutorial
在阅读本教程的这一节之后,您可以阅读python的关于unicode的内容。
Creating Unicode strings in Python is just as simple as creating normal strings:
在Python中创建Unicode字符串就像创建普通字符串一样简单:
>>> u'Hello World !'
u'Hello World !'
To answer your first question, they are different because only when using u''
are you creating a unicode string.
要回答第一个问题,它们是不同的,因为只有在使用u时,才会创建一个unicode字符串。
2nd question:
问题2:
sys.getdefaultencoding()
returns the default encoding
返回默认编码
But to quote from link:
但是引用链接的话:
Python users who are new to Unicode sometimes are attracted by default encoding returned by sys.getdefaultencoding(). The first thing you should know about default encoding is that you don't need to care about it. Its value should be 'ascii' and it is used when converting byte strings StrIsNotAString to unicode strings.
新到Unicode的Python用户有时会被sys.getdefaultencoding()返回的默认编码所吸引。关于默认编码,您首先应该知道的是您不需要关心它。它的值应该是“ascii”,在将字节字符串StrIsNotAString转换为unicode字符串时使用它。
#4
1
You've answered question 1 as you ask it: the first string is an encoded byte-string, but the second is not an encoding at all, it refers to a unicode code-point, which for "LATIN SMALL LETTER E WITH DIAERESIS" is hex eb
.
您已经回答了问题1:第一个字符串是一个编码的字节字符串,但是第二个字符串不是一个编码,它是指一个unicode代码点,对于“拉丁字母E与DIAERESIS”是hex eb。
Now, the question of what the first encoding is is an interesting one. I would normally expect it to be either utf-8, or, since you're on Windows, ISO-8859-1 or Win-1252 (which aren't exactly the same thing, but close enough). However, the normal representation of that letter in utf-8 is c3 ab
and in Win-1252 it's actually the same as the unicode code-point - ie hex eb
. So, it's a bit of a mystery.
现在,第一个编码的问题是一个有趣的问题。我通常希望它是utf-8,或者,因为你在Windows上,ISO-8859-1或Win-1252(这不是完全一样的东西,但足够接近)。然而,在utf-8中,这封信的正常表示是c3 ab,在Win-1252中,它实际上与unicode代码点(即hex eb)相同。所以,这有点神秘。
#5
1
It appears you are using code page CP850, which makes sense as this is the historical code page for DOS which has been carried forward to the terminal window.
看起来您使用的是代码页CP850,这是有意义的,因为这是已被传送到终端窗口的DOS的历史代码页。
>>> s
'\x89'
>>> us=unicode(s,'CP850')
>>> us
u'\xeb'
#6
1
-
Actually, unicode object has no 'encoding'. You should read up on Unicode in python to avoid constant confusion. This presentation looks adequate - http://farmdev.com/talks/unicode/ .
实际上,unicode对象没有“编码”。您应该阅读python中的Unicode,以避免持续的混乱。这个演示看起来足够了——http://farmdev.com/talks/unicode/。
-
You are on russian version of windows, right? You terminal uses cp1251.
你在俄国版的windows上,对吗?你使用cp1251终端。
#7
1
As you've figured out:
你算出:
>>> a = "ё"
>>> a
'\xf1'
>>> print a
ё
Do you open any file when get such errors? If so, try to open it with
当你犯这样的错误时,你会打开文件吗?如果是的话,试着打开它。
import codecs
f = codecs.open('filename.txt','r','utf-8')
#8
0
In case others get this page when searching Easiest way is to set the codepage in the terminal first
如果其他人在搜索最简单的方法时得到这个页面,那么首先在终端设置代码页。
CHCP 65001
then run your program.
然后运行您的程序。
working well for me. For power shell start it with
工作对我来说。对于电力外壳,启动它。
powershell.exe -NoExit /c "chcp.com 65001"
Its from python: unicode in Windows terminal, encoding used?
它来自python: Windows终端的unicode,编码使用?
#1
11
Unicode is not an encoding. You encode into byte strings and decode into Unicode:
Unicode不是一种编码。你编码成字节串并解码成Unicode:
>>> '\x89'.decode('cp437')
u'\xeb'
>>> u'\xeb'.encode('cp437')
'\x89'
>>> u'\xeb'.encode('utf8')
'\xc3\xab'
The windows terminal uses legacy code pages for DOS. For US Windows it is:
windows终端使用遗留代码页进行DOS操作。对我们来说,Windows就是:
>>> import sys
>>> sys.stdout.encoding
'cp437'
Windows applications use windows code pages. Python's IDLE will show the windows encoding:
Windows应用程序使用Windows代码页。Python的IDLE将显示windows编码:
>>> import sys
>>> sys.stdout.encoding
'cp1252'
Your results may vary.
您的结果可能不同。
#2
3
Avoid Windows Terminal
I'm not going out on a limb by saying the 'terminal' more appropriately the 'DOS prompt' that ships with Windows 7 is absolute junk. It was bad in Windows 95, NT, XP, Vista, and 7. Maybe they fixed it with Powershell, I don't know. However, it is indicative of the kind of problems that were plaguing OS development at Microsoft at the time.
我并不是说“终端”更合适地说“DOS提示”,Windows 7的船只绝对是垃圾。它在Windows 95、NT、XP、Vista和7中都很糟糕。也许他们用的是Powershell,我不知道。然而,这也表明了当时微软在操作系统开发方面所遇到的问题。
Output to a file instead
Set the PYTHONIOENCODING
environment variable and then redirect the output to a file.
设置python编码环境变量,然后将输出重定向到文件。
set PYTHONIOENCODING=utf-8
./myscript.py > output.txt
Then using Notepad++ you can then see the UTF-8 version of your output.
然后使用Notepad++,您可以看到您的输出的UTF-8版本。
Install win-unicode-console
win-unicode-console can fix your problems. You should try it out
win- unicodeconsole可以解决您的问题。你应该试一试。
pip install win-unicode-console
If you are interested in a through discussion on the issue of python and command-line output check out Python issue 1602. Otherwise, just use the win-unicode-console package.
如果您感兴趣的是关于python和命令行输出的讨论,请参阅python问题1602。否则,只需使用win- unicodeconsole包。
py -m run script.py
Runs it per script or you can follow their directions to add win_unicode_console.enable()
to every invocation by adding it to usercustomize
or sitecustomize
.
根据每个脚本运行它,或者您可以按照它们的方向添加win_unicode_console.enable(),将它添加到usercustomize或sitecustomize中。
#3
1
Read through this python HOWTO about unicode after you read this section from the tutorial
在阅读本教程的这一节之后,您可以阅读python的关于unicode的内容。
Creating Unicode strings in Python is just as simple as creating normal strings:
在Python中创建Unicode字符串就像创建普通字符串一样简单:
>>> u'Hello World !'
u'Hello World !'
To answer your first question, they are different because only when using u''
are you creating a unicode string.
要回答第一个问题,它们是不同的,因为只有在使用u时,才会创建一个unicode字符串。
2nd question:
问题2:
sys.getdefaultencoding()
returns the default encoding
返回默认编码
But to quote from link:
但是引用链接的话:
Python users who are new to Unicode sometimes are attracted by default encoding returned by sys.getdefaultencoding(). The first thing you should know about default encoding is that you don't need to care about it. Its value should be 'ascii' and it is used when converting byte strings StrIsNotAString to unicode strings.
新到Unicode的Python用户有时会被sys.getdefaultencoding()返回的默认编码所吸引。关于默认编码,您首先应该知道的是您不需要关心它。它的值应该是“ascii”,在将字节字符串StrIsNotAString转换为unicode字符串时使用它。
#4
1
You've answered question 1 as you ask it: the first string is an encoded byte-string, but the second is not an encoding at all, it refers to a unicode code-point, which for "LATIN SMALL LETTER E WITH DIAERESIS" is hex eb
.
您已经回答了问题1:第一个字符串是一个编码的字节字符串,但是第二个字符串不是一个编码,它是指一个unicode代码点,对于“拉丁字母E与DIAERESIS”是hex eb。
Now, the question of what the first encoding is is an interesting one. I would normally expect it to be either utf-8, or, since you're on Windows, ISO-8859-1 or Win-1252 (which aren't exactly the same thing, but close enough). However, the normal representation of that letter in utf-8 is c3 ab
and in Win-1252 it's actually the same as the unicode code-point - ie hex eb
. So, it's a bit of a mystery.
现在,第一个编码的问题是一个有趣的问题。我通常希望它是utf-8,或者,因为你在Windows上,ISO-8859-1或Win-1252(这不是完全一样的东西,但足够接近)。然而,在utf-8中,这封信的正常表示是c3 ab,在Win-1252中,它实际上与unicode代码点(即hex eb)相同。所以,这有点神秘。
#5
1
It appears you are using code page CP850, which makes sense as this is the historical code page for DOS which has been carried forward to the terminal window.
看起来您使用的是代码页CP850,这是有意义的,因为这是已被传送到终端窗口的DOS的历史代码页。
>>> s
'\x89'
>>> us=unicode(s,'CP850')
>>> us
u'\xeb'
#6
1
-
Actually, unicode object has no 'encoding'. You should read up on Unicode in python to avoid constant confusion. This presentation looks adequate - http://farmdev.com/talks/unicode/ .
实际上,unicode对象没有“编码”。您应该阅读python中的Unicode,以避免持续的混乱。这个演示看起来足够了——http://farmdev.com/talks/unicode/。
-
You are on russian version of windows, right? You terminal uses cp1251.
你在俄国版的windows上,对吗?你使用cp1251终端。
#7
1
As you've figured out:
你算出:
>>> a = "ё"
>>> a
'\xf1'
>>> print a
ё
Do you open any file when get such errors? If so, try to open it with
当你犯这样的错误时,你会打开文件吗?如果是的话,试着打开它。
import codecs
f = codecs.open('filename.txt','r','utf-8')
#8
0
In case others get this page when searching Easiest way is to set the codepage in the terminal first
如果其他人在搜索最简单的方法时得到这个页面,那么首先在终端设置代码页。
CHCP 65001
then run your program.
然后运行您的程序。
working well for me. For power shell start it with
工作对我来说。对于电力外壳,启动它。
powershell.exe -NoExit /c "chcp.com 65001"
Its from python: unicode in Windows terminal, encoding used?
它来自python: Windows终端的unicode,编码使用?