如何使用xlrd将Excel文件读入Python？它能读取更新的Office格式吗？

My issue is below but would be interested comments from anyone with experience with xlrd.

我的问题如下,但对任何有xlrd经验的人都会感兴趣。

I just found xlrd and it looks like the perfect solution but I'm having a little problem getting started. I am attempting to extract data programatically from an Excel file I pulled from Dow Jones with current components of the Dow Jones Industrial Average (link: http://www.djindexes.com/mdsidx/?event=showAverages)

我刚刚发现了xlrd,它看起来像是完美的解决方案,但我开始时遇到了一些问题。我试图从道琼斯手中提取的Excel文件以编程方式提取数据,其中包含道琼斯工业平均指数的当前组成部分(链接:http://www.djindexes.com/mdsidx/?event = showAverages)

When I open the file unmodified I get a nasty BIFF error (binary format not recognized)

当我打开未修改的文件时,我得到一个令人讨厌的BIFF错误(无法识别二进制格式)

However you can see in this screenshot that Excel 2008 for Mac thinks it is in 'Excel 1997-2004' format (screenshot: http://skitch.com/alok/ssa3/componentreport-dji.xls-properties)

但是,您可以在此屏幕截图中看到Excel 2008 for Mac认为它采用'Excel 1997-2004'格式(屏幕截图:http://skitch.com/alok/ssa3/componentreport-dji.xls-properties)

If I instead open it in Excel manually and save as 'Excel 1997-2004' format explicitly, then open in python usig xlrd, everything is wonderful. Remember, Office thinks the file is already in 'Excel 1997-2004' format. All files are .xls

如果我在Excel中手动打开并显式保存为“Excel 1997-2004”格式,则在python usig xlrd中打开,一切都很棒。请记住,Office认为该文件已经采用“Excel 1997-2004”格式。所有文件都是.xls

Here is a pastebin of an ipython session replicating the issue: http://pastie.textmate.org/private/jbawdtrvlrruh88mzueqdq

这是一个复制问题的ipython会话的pastebin:http://pastie.textmate.org/private/jbawdtrvlrruh88mzueqdq

Any thoughts on: How to trick xlrd into recognizing the file so I can extract data? How to use python to automate the explicit 'save as' format to one that xlrd will accept? Plan B?

有任何想法:如何欺骗xlrd识别文件,以便我可以提取数据?如何使用python将显式的“另存为”格式自动化为xlrd将接受的格式? B计划?

5 个解决方案

#1

FWIW, I'm the author of xlrd, and the maintainer of xlwt (a fork of pyExcelerator). A few points:

FWIW,我是xlrd的作者,也是xlwt(pyExcelerator的一个分支)的维护者。几点:

The file ComponentReport-DJI.xls is misnamed; it is not an XLS file, it is a tab-separated-values file. Open it with a text editor (e.g. Notepad) and you'll see what I mean. You can also look at the not-very-raw raw bytes with Python:

ComponentReport-DJI.xls文件名称错误;它不是XLS文件,它是一个制表符分隔值文件。用文本编辑器(例如记事本)打开它,你就会明白我的意思。您还可以使用Python查看非常原始的原始字节:
```
>>> open('ComponentReport-DJI.xls', 'rb').read(200)
'COMPANY NAME\tPRIMARY EXCHANGE\tTICKER\tSTYLE\tICB SUBSECTOR\tMARKET CAP RANGE\
tWEIGHT PCT\tUSD CLOSE\t\r\n3M Co.\tNew York SE\tMMM\tN/A\tDiversified Industria
ls\tBroad\t5.15676229508\t50.33\t\r\nAlcoa Inc.\tNew York SE\tA'
```
You can read this file using Python's csv module ... just use delimiter="\t" in your call to csv.reader().

您可以使用Python的csv模块读取此文件...只需在调用csv.reader()时使用delimiter =“\ t”。
xlrd can read any file that pyExcelerator can, and read them better—dates don't come out as floats, and the full story on Excel dates is in the xlrd documentation.

xlrd可以读取pyExcelerator可以读取的任何文件,并且可以更好地读取它们 - 日期不会以浮点数形式出现,Excel日期的完整故事在xlrd文档中。
pyExcelerator is abandonware—xlrd and xlwt are alive and well. Check out http://groups.google.com/group/python-excel

pyExcelerator是abandonware-xlrd,xlwt还活着。查看http://groups.google.com/group/python-excel

HTH John

#2

xlrd support for Office 2007/2008 (OpenXML) format is in alpha test - see the following post in the python-excel newsgroup: http://groups.google.com/group/python-excel/msg/0c5f15ad122bf24b?hl=en

xlrd对Office 2007/2008(OpenXML)格式的支持是alpha测试 - 请参阅python-excel新闻组中的以下帖子:http://groups.google.com/group/python-excel/msg/0c5f15ad122bf24b?hl = en

#3

More info on pyExcelerator: To read a file, do this:

有关pyExcelerator的更多信息:要读取文件,请执行以下操作:

import pyExcelerator
book = pyExcelerator.parse_xls(filename)

where filename is a string that is the filename to read (not a file-like object). This will give you a data structure representing the workbook: a list of pairs, where the first element of the pair is the worksheet name and the second element is the worksheet data.

其中filename是一个字符串,它是要读取的文件名(不是类似文件的对象)。这将为您提供表示工作簿的数据结构:对的列表,其中对的第一个元素是工作表名称,第二个元素是工作表数据。

The worksheet data is a dictionary, where the keys are (row, col) pairs (starting with 0) and the values are the cell contents -- generally int, float, or string. So, for instance, in the simple case of all the data being on the first worksheet:

工作表数据是一个字典,其中键是(row,col)对(从0开始),值是单元格内容 - 通常是int,float或string。因此,例如,在所有数据都在第一个工作表上的简单情况下:

data = book[0][1]
print 'Cell A1 of worksheet %s is: %s' % (book[0][0], repr(data[(0, 0)]))

If the cell is empty, you'll get a KeyError. If you're dealing with dates, they may (I forget) come through as integers or floats; if this is the case, you'll need to convert. Basically the rule is: datetime.datetime(1899, 12, 31) + datetime.timedelta(days=n) but that might be off by 1 or 2 (because Excel treats 1900 as a leap-year for compatibility with Lotus, and because I can't remember if 1900-1-1 is 0 or 1), so do some trial-and-error to check. Datetimes are stored as floats, I think (days and fractions of a day).

如果单元格为空,则会出现KeyError。如果你正在处理日期,他们可能(我忘记)以整数或浮点数来表达;如果是这种情况,您需要转换。基本上规则是:datetime.datetime(1899,12,31)+ datetime.timedelta(days = n),但可能会偏离1或2(因为Excel将1900视为与Lotus兼容的闰年,因为我不记得1900-1-1是0还是1),所以要检查一些反复试验。我认为日期时间存储为浮点数(天数和一天的分数)。

I think there is partial support for forumulas, but I wouldn't guarantee anything.

我认为对论坛有部分支持,但我不保证任何事情。

#4

Well here is some code that I did: (look down the bottom): here

这里有一些我做过的代码:(向下看):这里

Not sure about the newer formats - if xlrd can't read it, xlrd needs to have a new version released !

不确定更新的格式 - 如果xlrd无法读取它,xlrd需要发布新版本!

#5

-1

Do you have to use xlrd? I just downloaded 'UPDATED - Dow Jones Industrial Average Movers - 2008' from that website and had no trouble reading it with pyExcelerator.

你必须使用xlrd吗?我刚刚从该网站下载了“更新 - 道琼斯工业平均推动者 - 2008”,并且使用pyExcelerator阅读它并没有任何问题。

import pyExcelerator
book = pyExcelerator.parse_xls('DJIAMovers.xls')

#1