Trying to parse and XLSX file using roo gem in a ruby script.
尝试在ruby脚本中使用roo gem解析和XLSX文件。
In excel dates are stored as floats or integers in the format DDDDD.ttttt, counting from 1900-01-00 (00 no 01)
. So in order to convert a date such as 40396 - you would take 1900-01-00 + 40396
and you should get 2010-10-15, but I'm getting 2010-08-08.
在excel中,日期以浮点数或整数形式存储在DDDDD格式中。ttttt,从1900-01-00 (00 no 01)开始计算。为了转换一个日期,比如40396,你需要1900-01-00 + 40396你应该得到2010-10-15,但是我得到2010-08-08。
I'm using active_support/time to do calculation like so:
我使用active_support/time进行如下计算:
Time.new("1900-01-01") + 40396.days
Am I doing my calculation wrong or is there a bug in active support?
我计算错了吗?还是在主动支持中有错误?
I'm running ruby 1.9.3-mri on Windows 7 + latest active_support gem (3.2.1)
我在Windows 7 +最新的active_support gem上运行ruby 1.9.3-mri (3.2.1)
EDIT
编辑
I was looking at the older file in Excel with the wrong data - my script / console were pulling the right data - hence my confusion - I was doing everything right, except for using the right file!!!! Damn the all-nighters!
我用错误的数据查看Excel中的旧文件——我的脚本/控制台提取了正确的数据——因此我感到困惑——我做的每件事都是正确的,除了使用正确的文件!该死的夜猫子!
Thanks to everyone replying, I will keep the question here in case somebody needs info on how to convert dates from excel using ruby.
感谢大家的回复,如果有人需要关于如何使用ruby从excel转换日期的信息,我将把这个问题放在这里。
Also for anyone else running into this - spreadsheet gem DOES NOT support reading XLSX files at this point (v 0.7.1) properly - so I'm using roo for reading, and axlsx for writing.
另外,对于任何遇到这种情况的人来说——此时电子表格gem不支持正确地读取XLSX文件(v0.7.1)——所以我使用roo进行读取,axlsx进行写入。
3 个解决方案
#1
25
You have an off-by-one error in your day numbering - due to a bug in Lotus 1-2-3 that Excel and other spreadsheet programs have carefully maintained compatibility with for 30+ years.
由于Lotus 1-2-3程序中有一个错误,Excel和其他电子表格程序在30年多的时间里一直小心地保持兼容性。
Originally, day 1 was intended to be January 1, 1900 (which would, as you stated, make day 0 equal to December 31, 1899). But Lotus incorrectly considered 1900 to be a leap year, so the day numbers for everything before March 1st of that year are one too high. Using those numbers with a calendar that correctly counts 1900 as a common year, day 1 becomes December 31st and day 0 shifts back to the 30th. So the epoch for date arithmetic in Lotus-based spreadsheets is Saturday, December 30th, 1899. (Modern Excel and some other spreadsheets extend the Lotus bug-compatibility far enough to continue to label that date "December 31st" while agreeing that it was a Saturday, but other Lotus-based spreadsheets don't, and Ruby certainly doesn't either.)
最初,第一天是1900年1月1日(如你所说,这将使第0天等于1899年12月31日)。但是莲花不正确地认为1900年是闰年,所以3月1日之前的所有日子的数字都太高了。使用这些数字加上正确计算1900年为公历年的日历,第一天变成了12月31日,第0天变成了第30天。因此,基于lotusbase的电子表格中的日期算法的时代是在1899年12月30日,星期六。(现代Excel和其他一些电子表格将Lotus bug兼容性扩展得足够远,可以继续将日期标注为“12月31日”,同时也同意那天是星期六,但其他基于lotuscan的电子表格则没有,Ruby当然也没有。)
Even allowing for this error, however, your stated example is incorrect: Lotus day number 40,396 is August 6th, 2010, not October 15th. I have confirmed this correspondence in Excel, LibreOffice, and Google sheets, all of which agree. You must have crossed examples somewhere.
然而,即使考虑到这个错误,您所举的例子是不正确的:Lotus day number 40,396是2010年8月6日,而不是10月15日。我已经在Excel, LibreOffice和谷歌表格中确认了这封信,这些都是一致的。你一定在什么地方遇到过这样的例子。
Here's one way to do the conversion:
这里有一种转换的方法:
Time.utc(1899,12,30) + 40396.days #=> 2010-08-06 00:00:00 UTC
Alternatively, you could take advantage of another known correspondence. Time zero for Ruby (and POSIX systems in general) is the moment January 1, 1970, at midnight GMT. January 1, 1970 is Lotus day 25,569. As long as you remember to do your calculations in UTC, you can also do this:
或者,你可以利用另一个已知的通信。Ruby(以及POSIX系统)的时间是1970年1月1日,格林尼治时间午夜。1970年1月1日是莲花日25569。只要你记得用UTC进行计算,你也可以这样做:
Time.at( (40396 - 25569).days ).utc # => 2010-08-06 00:00:00 UTC
In either case, you probably want to declare a symbolic constant for the epoch date (either the Time
object representing 1899-12-30 or the POSIX "day 0" value 25,569).
在这两种情况下,您可能都希望为历日期声明一个符号常量(要么是表示1899-12-30的时间对象,要么是表示POSIX“day 0”值25,569的时间对象)。
You can replace those calls to .days
with multiplication by 86,400 if you don't need active_support/core_ext/integer/time
for anything else, and don't want to load it just for this.
如果您不需要active_support/core_ext/integer/time来执行其他任何操作,那么可以将这些对.days的调用替换为86,400,而且也不希望为此加载它。
#2
4
"Excel stores dates and times as a number representing the number of days since 1900-Jan-0, plus a fractional portion of a 24 hour day: ddddd.tttttt . This is called a serial date, or serial date-time." (http://www.cpearson.com/excel/datetime.htm)
Excel将日期和时间存储为一个数字,该数字表示从1900年到1月0日的天数,加上24小时内的部分时间:ddd。tttttt。这被称为串行日期或串行日期时间。(http://www.cpearson.com/excel/datetime.htm)
If your column contains a date time, rather then just a date, the following code is useful:
如果您的列包含一个日期时间,而不是一个日期,下面的代码是有用的:
dt = DateTime.new(1899, 12, 30) + excel_value.to_f
Also keep in mind that there are 2 modes of dates in an excel worksheet, 1900 based and 1904 based, which typically is enabled by default for spreadsheets created on the mac. If you consistently find your dates off by 4 years, you should use a different base date:
还要记住,excel工作表中有两种日期模式,分别是1900年和1904年,默认情况下,在mac上创建的电子表格都可以使用这两种模式。
dt = DateTime.new(1904, 1, 1) + excel_value.to_f
You can enable/disable 1904 date mode for any spreadsheet, but the dates will then appear off by 4 years in the spreadsheet if you change the setting after adding data. In general you should always use 1900 date mode since most excel users in the wild are windows based.
您可以为任何电子表格启用/禁用1904年日期模式,但是如果您在添加数据后更改设置,那么在电子表格中日期将显示4年。一般来说,您应该始终使用1900 date模式,因为大多数excel用户都是基于windows的。
Note: A gotcha with this method is that rounding might occur +/- 1 second. For me the dates I import are "close enough" but just something to keep in mind. A better solution might use rounding on fractional seconds to solve this issue.
注意:这个方法的一个问题是舍入可能发生在+/- 1秒。对我来说,我输入的日期“足够接近”,但只是一些需要记住的东西。更好的解决方案可以使用舍入数秒来解决这个问题。
#3
3
You're doing your calculation wrong. How do you arrive at the expected result of 2010-10-15?
你计算错了。你如何达到2010-10-15年度的预期结果?
In Excel, 40396
is 2010-08-06
(not using the 1904 calendar, of course). To demonstrate that, type 40396 into an Excel cell and set the format to yyyy-mm-dd
.
在Excel中,40396是2010-08-06(当然不是使用1904年的日历)。要演示这一点,请将40396输入一个Excel单元,并将格式设置为yyyyyy -mm-dd。
Alternatively:
另外:
40396 / 365.2422 = 110.6 (years -- 1900 + 110 = 2010)
0.6 * 12 = 7.2 (months -- January = 1; 1 + 7 = 8; 8 = August)
0.2 * 30 = 6 (days)
Excel's calendar incorrectly includes 1900-02-29; that accounts for one day's difference between your 2010-08-08 result; I'm not sure about the reason for the second day of difference.
Excel的日历错误地包含1900-02-29;这就说明了你的2010-08-08结果有一天的不同。我不确定第二天出现差异的原因。
#1
25
You have an off-by-one error in your day numbering - due to a bug in Lotus 1-2-3 that Excel and other spreadsheet programs have carefully maintained compatibility with for 30+ years.
由于Lotus 1-2-3程序中有一个错误,Excel和其他电子表格程序在30年多的时间里一直小心地保持兼容性。
Originally, day 1 was intended to be January 1, 1900 (which would, as you stated, make day 0 equal to December 31, 1899). But Lotus incorrectly considered 1900 to be a leap year, so the day numbers for everything before March 1st of that year are one too high. Using those numbers with a calendar that correctly counts 1900 as a common year, day 1 becomes December 31st and day 0 shifts back to the 30th. So the epoch for date arithmetic in Lotus-based spreadsheets is Saturday, December 30th, 1899. (Modern Excel and some other spreadsheets extend the Lotus bug-compatibility far enough to continue to label that date "December 31st" while agreeing that it was a Saturday, but other Lotus-based spreadsheets don't, and Ruby certainly doesn't either.)
最初,第一天是1900年1月1日(如你所说,这将使第0天等于1899年12月31日)。但是莲花不正确地认为1900年是闰年,所以3月1日之前的所有日子的数字都太高了。使用这些数字加上正确计算1900年为公历年的日历,第一天变成了12月31日,第0天变成了第30天。因此,基于lotusbase的电子表格中的日期算法的时代是在1899年12月30日,星期六。(现代Excel和其他一些电子表格将Lotus bug兼容性扩展得足够远,可以继续将日期标注为“12月31日”,同时也同意那天是星期六,但其他基于lotuscan的电子表格则没有,Ruby当然也没有。)
Even allowing for this error, however, your stated example is incorrect: Lotus day number 40,396 is August 6th, 2010, not October 15th. I have confirmed this correspondence in Excel, LibreOffice, and Google sheets, all of which agree. You must have crossed examples somewhere.
然而,即使考虑到这个错误,您所举的例子是不正确的:Lotus day number 40,396是2010年8月6日,而不是10月15日。我已经在Excel, LibreOffice和谷歌表格中确认了这封信,这些都是一致的。你一定在什么地方遇到过这样的例子。
Here's one way to do the conversion:
这里有一种转换的方法:
Time.utc(1899,12,30) + 40396.days #=> 2010-08-06 00:00:00 UTC
Alternatively, you could take advantage of another known correspondence. Time zero for Ruby (and POSIX systems in general) is the moment January 1, 1970, at midnight GMT. January 1, 1970 is Lotus day 25,569. As long as you remember to do your calculations in UTC, you can also do this:
或者,你可以利用另一个已知的通信。Ruby(以及POSIX系统)的时间是1970年1月1日,格林尼治时间午夜。1970年1月1日是莲花日25569。只要你记得用UTC进行计算,你也可以这样做:
Time.at( (40396 - 25569).days ).utc # => 2010-08-06 00:00:00 UTC
In either case, you probably want to declare a symbolic constant for the epoch date (either the Time
object representing 1899-12-30 or the POSIX "day 0" value 25,569).
在这两种情况下,您可能都希望为历日期声明一个符号常量(要么是表示1899-12-30的时间对象,要么是表示POSIX“day 0”值25,569的时间对象)。
You can replace those calls to .days
with multiplication by 86,400 if you don't need active_support/core_ext/integer/time
for anything else, and don't want to load it just for this.
如果您不需要active_support/core_ext/integer/time来执行其他任何操作,那么可以将这些对.days的调用替换为86,400,而且也不希望为此加载它。
#2
4
"Excel stores dates and times as a number representing the number of days since 1900-Jan-0, plus a fractional portion of a 24 hour day: ddddd.tttttt . This is called a serial date, or serial date-time." (http://www.cpearson.com/excel/datetime.htm)
Excel将日期和时间存储为一个数字,该数字表示从1900年到1月0日的天数,加上24小时内的部分时间:ddd。tttttt。这被称为串行日期或串行日期时间。(http://www.cpearson.com/excel/datetime.htm)
If your column contains a date time, rather then just a date, the following code is useful:
如果您的列包含一个日期时间,而不是一个日期,下面的代码是有用的:
dt = DateTime.new(1899, 12, 30) + excel_value.to_f
Also keep in mind that there are 2 modes of dates in an excel worksheet, 1900 based and 1904 based, which typically is enabled by default for spreadsheets created on the mac. If you consistently find your dates off by 4 years, you should use a different base date:
还要记住,excel工作表中有两种日期模式,分别是1900年和1904年,默认情况下,在mac上创建的电子表格都可以使用这两种模式。
dt = DateTime.new(1904, 1, 1) + excel_value.to_f
You can enable/disable 1904 date mode for any spreadsheet, but the dates will then appear off by 4 years in the spreadsheet if you change the setting after adding data. In general you should always use 1900 date mode since most excel users in the wild are windows based.
您可以为任何电子表格启用/禁用1904年日期模式,但是如果您在添加数据后更改设置,那么在电子表格中日期将显示4年。一般来说,您应该始终使用1900 date模式,因为大多数excel用户都是基于windows的。
Note: A gotcha with this method is that rounding might occur +/- 1 second. For me the dates I import are "close enough" but just something to keep in mind. A better solution might use rounding on fractional seconds to solve this issue.
注意:这个方法的一个问题是舍入可能发生在+/- 1秒。对我来说,我输入的日期“足够接近”,但只是一些需要记住的东西。更好的解决方案可以使用舍入数秒来解决这个问题。
#3
3
You're doing your calculation wrong. How do you arrive at the expected result of 2010-10-15?
你计算错了。你如何达到2010-10-15年度的预期结果?
In Excel, 40396
is 2010-08-06
(not using the 1904 calendar, of course). To demonstrate that, type 40396 into an Excel cell and set the format to yyyy-mm-dd
.
在Excel中,40396是2010-08-06(当然不是使用1904年的日历)。要演示这一点,请将40396输入一个Excel单元,并将格式设置为yyyyyy -mm-dd。
Alternatively:
另外:
40396 / 365.2422 = 110.6 (years -- 1900 + 110 = 2010)
0.6 * 12 = 7.2 (months -- January = 1; 1 + 7 = 8; 8 = August)
0.2 * 30 = 6 (days)
Excel's calendar incorrectly includes 1900-02-29; that accounts for one day's difference between your 2010-08-08 result; I'm not sure about the reason for the second day of difference.
Excel的日历错误地包含1900-02-29;这就说明了你的2010-08-08结果有一天的不同。我不确定第二天出现差异的原因。