The following code retrieves an iterable object of strings in rows
which contains a PDF byte stream. The string row
was type of str
. The resulting file was a PDF format and could be opened.
以下代码检索包含PDF字节流的行中的字符串的可迭代对象。字符串行是str的类型。生成的文件是PDF格式,可以打开。
with open(fname, "wb") as fd:
for row in rows:
fd.write(row)
Due to a new C-Library and changes in the Python implementation the str
changes to unicode
. And the corresponding content changed as well so my PDF file is broken.
由于新的C-Library和Python实现的变化,str变为unicode。并且相应的内容也发生了变化,因此我的PDF文件被破坏了。
Starting bytes of first row
object:
第一行对象的起始字节:
old row[0]: 25 50 44 46 2D 31 2E 33 0D 0A 25 E2 E3 CF D3 0D 0A ...
new row[0]: 25 50 44 46 2D 31 2E 33 0D 0A 25 C3 A2 C3 A3 C3 8F C3 93 0D 0A ...
I adjust the corresponding byte positions here so it looks like a unicode problem.
我在这里调整相应的字节位置,所以它看起来像一个unicode问题。
I think this is a good start but I still have a unicode
string as input...
我认为这是一个好的开始,但我仍然有一个unicode字符串作为输入...
>>> "\xc3\xa2".decode('utf8') # but as input I have u"\xc3\xa2"
u'\xe2'
I already tried several calls of encode
and decode
so I need a more analytical way to fix this. I can't see the wood for the trees. Thank you.
我已经尝试了几次编码和解码调用,所以我需要一种更加分析的方法来解决这个问题。我无法看到树木。谢谢。
2 个解决方案
#1
0
When you find u"\xc3\xa2"
in a Python unicode string, it often means that you have read an UTF-8 encoded file as is it was Latin1 encoded. So the best thing to do is certainly to fix the initial read.
当您在Python unicode字符串中找到u“\ xc3 \ xa2”时,通常意味着您已读取UTF-8编码文件,因为它是Latin1编码的。所以最好的办法当然是修复初始读数。
That being said if you have to depend on broken code, the fix is still easy: you just encode the string as Latin1 and then decode it as UTF-8:
话虽如此,如果你必须依赖破碎的代码,修复仍然很容易:你只需将字符串编码为Latin1,然后将其解码为UTF-8:
fixed_u_str = broken_u_str.encode('Latin1').decode('UTF-8')
For example:
u"\xc3\xa2\xc3\xa3".encode('Latin1').decode('utf8')
correctly gives u"\xe2\xe3"
which displays as âã
正确地给你“\ xe2 \ xe3”显示为â
#2
0
This looks like you should be doing
这看起来像你应该做的
fd.write(row.encode('utf-8'))
assuming the type of row
is now unicode
(this is my understanding of how you presented things).
假设行的类型现在是unicode(这是我对你如何呈现东西的理解)。
#1
0
When you find u"\xc3\xa2"
in a Python unicode string, it often means that you have read an UTF-8 encoded file as is it was Latin1 encoded. So the best thing to do is certainly to fix the initial read.
当您在Python unicode字符串中找到u“\ xc3 \ xa2”时,通常意味着您已读取UTF-8编码文件,因为它是Latin1编码的。所以最好的办法当然是修复初始读数。
That being said if you have to depend on broken code, the fix is still easy: you just encode the string as Latin1 and then decode it as UTF-8:
话虽如此,如果你必须依赖破碎的代码,修复仍然很容易:你只需将字符串编码为Latin1,然后将其解码为UTF-8:
fixed_u_str = broken_u_str.encode('Latin1').decode('UTF-8')
For example:
u"\xc3\xa2\xc3\xa3".encode('Latin1').decode('utf8')
correctly gives u"\xe2\xe3"
which displays as âã
正确地给你“\ xe2 \ xe3”显示为â
#2
0
This looks like you should be doing
这看起来像你应该做的
fd.write(row.encode('utf-8'))
assuming the type of row
is now unicode
(this is my understanding of how you presented things).
假设行的类型现在是unicode(这是我对你如何呈现东西的理解)。