了解Python Unicode和Linux终端

时间:2022-08-09 00:10:06

I have a Python script that writes some strings with UTF-8 encoding. In my script I am using mainly the str() function to cast to string. It looks like that:

我有一个Python脚本,用UTF-8编码写一些字符串。在我的脚本中,我主要使用str()函数来转换为字符串。它看起来像这样:

mystring="this is unicode string:"+japanesevalues[1] 
#japanesevalues is a list of unicode values, I am sure it is unicode
print mystring

I don't use the Python terminal, just the standard Linux Red Hat x86_64 terminal. I set the terminal to output utf8 chars.

我不使用Python终端,只使用标准的Linux Red Hat x86_64终端。我将终端设置为输出utf8字符。

If I execute this:

如果我执行这个:

#python myscript.py
this is unicode string: カラダーズ ソフィー

But if I do that:

但如果我这样做:

#python myscript.py > output

I got the typical error:

我得到了典型的错误:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 253-254: ordinal not in range(128)

Why is that?

这是为什么?

2 个解决方案

#1


15  

The terminal has a character set, and Python knows what that character set is, so it will automatically decode your Unicode strings to the byte-encoding that the terminal uses, in your case UTF-8.

终端有一个字符集,Python知道该字符集是什么,因此它会自动将您的Unicode字符串解码为终端使用的字节编码,在您的情况下为UTF-8。

But when you redirect, you are no longer using the terminal. You are now just using a Unix pipe. That Unix pipe doesn't have a charset, and Python has no way of knowing which encoding you now want, so it will fall back to a default character set. You have marked your question with "Python-3.x" but your print syntax is Python 2, so I suspect you are actually using Python 2. And then your sys.getdefaultencoding() is generally 'ascii', and in your case it's definitely so. And of course, you can not encode Japanese characters as ASCII, so you get an error.

但是当您重定向时,您不再使用终端。您现在只是使用Unix管道。 Unix管道没有字符集,Python无法知道您现在想要的编码,因此它将回退到默认字符集。你用“Python-3.x”标记了你的问题,但你的打印语法是Python 2,所以我怀疑你实际上是在使用Python 2.然后你的sys.getdefaultencoding()通常是'ascii',在你的情况下它是绝对如此。当然,您不能将日语字符编码为ASCII,因此会出错。

Your best bet when using Python 2 is to encode the string with UTF-8 before printing it. Then redirection will work, and the resulting file with be UTF-8. That means it will not work if your terminal is something else, though, but you can get the terminal encoding from sys.stdout.encoding and use that (it will be None when redirecting under Python 2).

使用Python 2时最好的选择是在打印之前用UTF-8对字符串进行编码。然后重定向将起作用,并且生成的文件为UTF-8。这意味着如果您的终端是其他东西,它将无法工作,但您可以从sys.stdout.encoding获取终端编码并使用它(在Python 2下重定向时它将是None)。

In Python 3, your code should work as is, except that you need to change print mystring to print(mystring).

在Python 3中,您的代码应该按原样工作,除了您需要将print mystring更改为print(mystring)。

#2


2  

If it outputs to the terminal then Python can examine the value of $LANG to pick a charset. All bets are off if you redirect.

如果它输出到终端,那么Python可以检查$ LANG的值来选择一个字符集。如果您重定向,所有投注均已关闭。

#1


15  

The terminal has a character set, and Python knows what that character set is, so it will automatically decode your Unicode strings to the byte-encoding that the terminal uses, in your case UTF-8.

终端有一个字符集,Python知道该字符集是什么,因此它会自动将您的Unicode字符串解码为终端使用的字节编码,在您的情况下为UTF-8。

But when you redirect, you are no longer using the terminal. You are now just using a Unix pipe. That Unix pipe doesn't have a charset, and Python has no way of knowing which encoding you now want, so it will fall back to a default character set. You have marked your question with "Python-3.x" but your print syntax is Python 2, so I suspect you are actually using Python 2. And then your sys.getdefaultencoding() is generally 'ascii', and in your case it's definitely so. And of course, you can not encode Japanese characters as ASCII, so you get an error.

但是当您重定向时,您不再使用终端。您现在只是使用Unix管道。 Unix管道没有字符集,Python无法知道您现在想要的编码,因此它将回退到默认字符集。你用“Python-3.x”标记了你的问题,但你的打印语法是Python 2,所以我怀疑你实际上是在使用Python 2.然后你的sys.getdefaultencoding()通常是'ascii',在你的情况下它是绝对如此。当然,您不能将日语字符编码为ASCII,因此会出错。

Your best bet when using Python 2 is to encode the string with UTF-8 before printing it. Then redirection will work, and the resulting file with be UTF-8. That means it will not work if your terminal is something else, though, but you can get the terminal encoding from sys.stdout.encoding and use that (it will be None when redirecting under Python 2).

使用Python 2时最好的选择是在打印之前用UTF-8对字符串进行编码。然后重定向将起作用,并且生成的文件为UTF-8。这意味着如果您的终端是其他东西,它将无法工作,但您可以从sys.stdout.encoding获取终端编码并使用它(在Python 2下重定向时它将是None)。

In Python 3, your code should work as is, except that you need to change print mystring to print(mystring).

在Python 3中,您的代码应该按原样工作,除了您需要将print mystring更改为print(mystring)。

#2


2  

If it outputs to the terminal then Python can examine the value of $LANG to pick a charset. All bets are off if you redirect.

如果它输出到终端,那么Python可以检查$ LANG的值来选择一个字符集。如果您重定向,所有投注均已关闭。