Python,编码输出到UTF-8。

时间:2022-04-19 20:20:57

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.

我有一个定义,它构建一个由UTF-8编码的字符组成的字符串。输出文件使用“w+”、“utf-8”参数打开。

However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)

但是,当我尝试x.write(string)时,我得到了UnicodeEncodeError:“ascii”编解码器无法在位置1中编码字符u'\ufeff:序数不在范围(128)

I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...

我假设这是因为通常情况下你会做“打印(u'something)”。但是我需要使用一个变量,而u'_'中的引语否定了…

Any suggestions?

有什么建议吗?

EDIT: Actual code here:

编辑:实际的代码:

source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)

Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:

基本上所有这些都是为了给我构建大量的字符串看起来类似于这个:

[日木曜 Deliverables]= CASE WHEN things = 11 THEN C ELSE 0 END

(日木曜交付]=情况下当事情= 11 C其他0结束

3 个解决方案

#1


5  

Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.

你用codecs.open()?Python 2.7的内置open()不支持特定的编码,这意味着您必须手工编码非ascii字符串(正如其他人所指出的那样),但codecs.open()确实支持这一点,并且可能比手动编码所有字符串更容易实现。


As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.

当你实际使用codecs.open(),通过添加代码,有点自己的东西之后,我建议试图打开输入和/或输出文件与编码“utf-8-sig”,将自动处理utf - 8的BOM(见http://docs.python.org/2/library/codecs.html # encodings-and-unicode,底部附近的部分)我认为只会输入文件,但是如果这些组合(utf-8-sig / utf - 8,utf - 8 / utf-8-sig,我认为最有可能的情况是,您的输入文件是以不同的Unicode格式编码的BOM,因为Python的默认UTF-8编解码器将BOM解释为常规字符,因此输入不会有问题,但是输出可以。


Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).

刚注意到这一点,但是……当使用codecs.open()时,它希望使用Unicode字符串,而不是编码的字符串;尝试x = unicode(actionT(splitList[0], splitList[1]))。

Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.

当试图解码unicode字符串时,也会出现错误(请参见http://wiki.python.org/moin/UnicodeEncodeError),但我认为,除非actionT()或您的列表分割对unicode字符串有影响,否则它们将被视为非unicode字符串。

#2


5  

In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:

在python中2。有两种类型的字符串:字节字符串和unicode字符串。第一个包含字节和最后一个- unicode代码点。很容易确定,它是什么类型的字符串- unicode字符串从u开始:

# byte string
>>> 'abc'
'abc'

# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'

'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):

abc' chars是一样的,因为是ASCII码。\u0430是一个unicode编码点,它不在ASCII范围内。“代码点”是python中unicode点的内部表示,它们不能保存到文件中。首先需要将它们编码为字节。这里编码的unicode字符串看起来是怎样的(当它被编码时,它变成了一个字节字符串):

>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'

This encoded string now can be written to file:

这个编码的字符串现在可以写入文件:

>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
...     f.write(s.encode('utf8'))

Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:

现在,重要的是要记住,我们在写入文件时使用了什么编码。因为要能够读取数据,我们需要解码内容。以下是没有解码的数据:

>>> with open('text.txt', 'r') as f:
...     content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'

You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:

你看,我们已经得到了编码的字节,和在。encode('utf8')中完全一样。要解码,需要提供编码名称:

>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'

After decode, we've got back our unicode string with unicode code points.

解码后,我们得到了unicode编码点的unicode字符串。

>>> print content.decode('utf8')
abc абв

#3


1  

xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.

xgord是正确的,但是对于进一步的编辑,它的价值是准确的。它被称为BOM或字节顺序标记基本上它是早期unicode的一个回调当人们无法同意他们想要的unicode的方式时。现在所有的unicode文档都是预先面对一个\ufeff或\uffef,这取决于它们决定将字节排列在哪个顺序上。

If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

如果您在第一个位置的那些字符上遇到错误,您可以确定问题是您没有试图将其解码为utf-8,而且这个文件可能仍然很好。

#1


5  

Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.

你用codecs.open()?Python 2.7的内置open()不支持特定的编码,这意味着您必须手工编码非ascii字符串(正如其他人所指出的那样),但codecs.open()确实支持这一点,并且可能比手动编码所有字符串更容易实现。


As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.

当你实际使用codecs.open(),通过添加代码,有点自己的东西之后,我建议试图打开输入和/或输出文件与编码“utf-8-sig”,将自动处理utf - 8的BOM(见http://docs.python.org/2/library/codecs.html # encodings-and-unicode,底部附近的部分)我认为只会输入文件,但是如果这些组合(utf-8-sig / utf - 8,utf - 8 / utf-8-sig,我认为最有可能的情况是,您的输入文件是以不同的Unicode格式编码的BOM,因为Python的默认UTF-8编解码器将BOM解释为常规字符,因此输入不会有问题,但是输出可以。


Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).

刚注意到这一点,但是……当使用codecs.open()时,它希望使用Unicode字符串,而不是编码的字符串;尝试x = unicode(actionT(splitList[0], splitList[1]))。

Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.

当试图解码unicode字符串时,也会出现错误(请参见http://wiki.python.org/moin/UnicodeEncodeError),但我认为,除非actionT()或您的列表分割对unicode字符串有影响,否则它们将被视为非unicode字符串。

#2


5  

In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:

在python中2。有两种类型的字符串:字节字符串和unicode字符串。第一个包含字节和最后一个- unicode代码点。很容易确定,它是什么类型的字符串- unicode字符串从u开始:

# byte string
>>> 'abc'
'abc'

# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'

'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):

abc' chars是一样的,因为是ASCII码。\u0430是一个unicode编码点,它不在ASCII范围内。“代码点”是python中unicode点的内部表示,它们不能保存到文件中。首先需要将它们编码为字节。这里编码的unicode字符串看起来是怎样的(当它被编码时,它变成了一个字节字符串):

>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'

This encoded string now can be written to file:

这个编码的字符串现在可以写入文件:

>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
...     f.write(s.encode('utf8'))

Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:

现在,重要的是要记住,我们在写入文件时使用了什么编码。因为要能够读取数据,我们需要解码内容。以下是没有解码的数据:

>>> with open('text.txt', 'r') as f:
...     content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'

You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:

你看,我们已经得到了编码的字节,和在。encode('utf8')中完全一样。要解码,需要提供编码名称:

>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'

After decode, we've got back our unicode string with unicode code points.

解码后,我们得到了unicode编码点的unicode字符串。

>>> print content.decode('utf8')
abc абв

#3


1  

xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.

xgord是正确的,但是对于进一步的编辑,它的价值是准确的。它被称为BOM或字节顺序标记基本上它是早期unicode的一个回调当人们无法同意他们想要的unicode的方式时。现在所有的unicode文档都是预先面对一个\ufeff或\uffef,这取决于它们决定将字节排列在哪个顺序上。

If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

如果您在第一个位置的那些字符上遇到错误,您可以确定问题是您没有试图将其解码为utf-8,而且这个文件可能仍然很好。