在python中使用encode(“utf-8”)从Excel中读取字符串的缺点

I am reading a large amount of data from an excel spreadsheet in which I read (and reformat and rewrite) from the spreadsheet using the following general structure:

我从excel电子表格中读取了大量的数据，我使用以下的一般结构从电子表格中读取(并重新格式化和重写):

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
     z = i + 1
     toprint = """formatting of the data im writing. important stuff is to the right -> """ + str(sheettwo.cell(z,y).value) + """ more formatting! """ + str(sheettwo.cell(z,x).value.encode('utf-8')) + """ and done"""
     out.write(toprint)
     out.write("\n")

where x and y are arbitrary cells in this case, with x being less arbitrary and containing utf-8 characters

在这种情况下，x和y是任意的单元格，而x是不那么任意的，并且包含utf-8字符

So far I have only been using the .encode('utf-8') in cells where I know there will be errors otherwise or foresee an error without using utf-8.

到目前为止，我只在单元格中使用.encode(“utf-8”)，我知道如果不使用utf-8就会出现错误，或者预见错误。

My question is basically this: is there a disadvantage to using .encode('utf-8') on all of the cells even if it is unnecessary? Efficiency is not an issue. the main issue is that it works even if there is a utf-8 character in a place there shouldn't be. If no errors would occur if I just lump the ".encode('utf-8')" onto every cell read, I will probably end up doing that.

我的问题基本上是这样的:在所有的单元格上使用.encode(“utf-8”)有什么缺点吗?效率不是问题。主要的问题是，即使在不应该有utf-8字符的地方，它也能工作。如果我将“.encode”(“utf-8”)压缩到每个读取的单元格上，就不会出现错误，那么我可能最终会这么做。

2 个解决方案

#1

The XLRD Documentation states it clearly: "From Excel 97 onwards, text in Excel spreadsheets has been stored as Unicode.". Since you are likely reading in files newer than 97, they are containing Unicode codepoints anyway. It is therefore necessary that keep the content of these cells as Unicode within Python and do not convert them to ASCII (which you do in with the str() function). Use this code below:

XLRD文档清楚地说明:“从Excel 97开始，Excel电子表格中的文本被存储为Unicode。”由于您可能正在读取比97更新的文件，因此无论如何它们都包含Unicode码点。因此，有必要将这些单元格的内容保持为Python内的Unicode，不要将它们转换为ASCII(您使用str()函数来处理)。用下面这段代码:

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
    z = i + 1
    toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
    out.write(toprint.encode('UTF-8'))

#2

This answer is really a few mild comments on the accepted answer, but they need better formatting than the SO comment facility provides.

这个答案实际上是对已接受的答案的一些温和的注释，但是它们需要比SO comment功能提供的更好的格式。

(1) Avoiding the SO horizontal scrollbar enhances the chance that people will read your code. Try wrapping your lines, for example:

(1)避免使用SO水平滚动条增加了人们阅读您的代码的机会。试着把你的台词包装起来，例如:

toprint = u"".join([
    u"formatting of the data im writing. "
    u"important stuff is to the right -> ",
    unicode(sheettwo.cell(z,y).value),
    u" more formatting! ",
    unicode(sheettwo.cell(z,x).value),
    u" and done\n"
    ])
out.write(toprint.encode('UTF-8'))

(2) Presumably you are using unicode() to convert floats and ints to unicode; it does nothing for values that are already unicode. Be aware that unicode(), like str(), gives you only 12 digits of precision for floats:

(2)假定您正在使用unicode()将浮点数和整数转换为unicode;它对已经是unicode的值不起作用。请注意，unicode()，如str()，仅为浮点数提供12位精度:

>>> unicode(123456.78901234567)
u'123456.789012'

If that is a bother, you might like to try something like this:

如果这有点麻烦的话，你可以试试这样的方法:

>>> def full_precision(x):
>>> ... return unicode(repr(x) if isinstance(x, float) else x)
>>> ...
>>> full_precision(u'\u0400')
u'\u0400'
>>> full_precision(1234)
u'1234'
>>> full_precision(123456.78901234567)
u'123456.78901234567'

(3) xlrd builds Cell objects on the fly when demanded.

(3) xlrd在需要时动态构建单元对象。

sheettwo.cell(z,y).value # slower
sheettwo.cell_value(z,y) # faster

#1

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
    z = i + 1
    toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
    out.write(toprint.encode('UTF-8'))

#2