将字符串日期转换为R Date FAST表示所有日期

时间:2022-03-04 22:52:21

This has been asked several times with no clear answer: I would like to convert an R character string of the form "YYYY-mm-dd" into a Date. The as.Date function is exceedingly slow. convert character to date *quickly* in R provides a solution using fasttime that works for dates from 1970 onward. My issue is I have dates starting from 1900 that I need to convert and there are about 100 million of them. I have to do this frequently so the speed is important. Are there any other solutions?

这已经多次被问过,没有明确的答案:我想将“YYYY-mm-dd”形式的R字符串转换为日期。 as.Date函数非常慢。将字符转换为日期*快速*在R中提供使用fasttime的解决方案,该解决方案适用于1970年以后的日期。我的问题是我有从1900年开始的日期,我需要转换,其中大约有1亿。我必须经常这样做,所以速度很重要。还有其他解决方案吗?

5 个解决方案

#1


7  

I can get a little speedup by using the date package:

我可以通过使用日期包获得一点加速:

library(date)
set.seed(21)
x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
system.time(dDate <- as.Date(x))
#    user  system elapsed 
#    6.54    0.01    6.56 
system.time(ddate <- as.Date(as.date(x,"ymd")))
#    user  system elapsed 
#    3.42    0.22    3.64 

You might want to look at the C code it uses and see if you can modify it to be faster for your specific situation.

您可能希望查看它使用的C代码,看看是否可以根据具体情况对其进行修改。

#2


6  

I had a similar problem a while ago and came up with the following solution:

我不久前遇到过类似的问题,并提出了以下解决方案:

  1. convert the string to a factor (if not already a factor)
  2. 将字符串转换为因子(如果还不是因素)

  3. convert the levels of the factor to a Date
  4. 将因子的级别转换为日期

  5. Expand the converted levels to the solution using the index vector of the factor
  6. 使用因子的索引向量将转换后的级别扩展到解决方案

Extending Joshua Ulrich's example, I get (with slower timings on my laptop)

扩展约书亚乌尔里希的例子,我得到了(我的笔记本电脑上的时间较慢)

library(date)
set.seed(21)
x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
system.time(dDate <- as.Date(x))
#    user  system elapsed 
#    12.09   0.00   12.12 
system.time(ddate <- as.Date(as.date(x,"ymd")))
#    user  system elapsed 
#    6.97    0.04    7.05 
system.time({
    xf <- as.factor(x)
    dDate <- as.Date(levels(xf))[as.integer(xf)]
})
#    user  system elapsed 
#    1.16    0.00    1.15

Here, step 2 does not depend on the length of x once x is large enough and step 3 scales extremely well (simple vector indexing). The bottleneck should be step 1, which can be avoided if the data is already stored as a factor.

这里,一旦x足够大并且步骤3非常好地缩放(简单向量索引),步骤2不依赖于x的长度。瓶颈应该是步骤1,如果数据已经存储为因素,则可以避免这种情况。

#3


3  

The function parse_date_time from the 'lubridate' package is quite fast too:

'lubridate'包中的函数parse_date_time也非常快:

library(date)
library(lubridate)
set.seed(21)
x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
system.time(date1 <- as.Date(x))
#  user  system elapsed 
# 12.86    0.00   12.94 
system.time(date2 <- as.Date(as.date(x,"ymd"))) # from package 'date'
#  user  system elapsed 
#  4.82    0.00    4.85 
system.time(date3 <- as.Date(parse_date_time(x,'%y-%m-%d'))) # from package 'lubridate'
#  user  system elapsed 
#  0.27    0.00    0.26 
all(date1 == date2)
#  TRUE
all(date1 == date3)
#  TRUE

#4


1  

A further speedup: You already work with data.table. So, create a lookup table with your dates and merge them with your data.

进一步加速:您已经使用data.table。因此,创建一个包含日期的查找表,并将它们与您的数据合并。

library(lubridate)
library(data.table)

y <- seq(as.Date('1900-01-01'), Sys.Date(), by = 'day')
id.date <- data.table(id = as.character(y), date = as.Date(y), key = 'id')

set.seed(21)
x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))

system.time(date3 <- as.Date(parse_date_time(x,'%y-%m-%d'))) # from package 'lubridate'
#  user  system elapsed 
#  0.15  0.00   0.15  

system.time(date4 <- id.date[setDT(list(id = x)), on='id', date])
#  user  system elapsed 
#  0.08  0.00   0.08

all(date3 == date4)
# TRUE

It's kind of a workaround, but I believe thats how data.table is intended to use. I don't know if the above mentioned time/date packages internally are based on algorithms or as well on lookup tables (hash tables).

这是一种解决方法,但我相信data.table是如何使用的。我不知道上面提到的时间/日期包是基于算法还是基于查找表(哈希表)。

For larger datasets, whenever there is character manipulation involved, which tend to be slow, I consider switching to lookup a reference table.

对于较大的数据集,只要涉及到字符操作,这往往很慢,我会考虑切换到查找引用表。

#5


1  

Consider incredibly fast anytime library which is fine with 1970< issue. It uses the Boost date_time C++ library and provides functions anytime() and anydate() for conversions. Comparison:

考虑到令人难以置信的快速任何时间库,这是1970年 <问题。它使用boost date_time c ++库,并为转换提供anytime()和anydate()函数。比较:< p>

require(anytime)        #anydate()
require(lubridate)      #parse_date_time()
require(microbenchmark) #microbenchmark()

set.seed(21)
test.dd <- as.Date("2018-05-16") - sample(40000, 1e6, TRUE) #1 mln. random dates

microbenchmark(
    strptime(test.dd, "%Y-%m-%d"),                     #basic strptime
    parse_date_time(test.dd, orders = "ymd"),          #lubridate (POSIXct class)
    as.Date(parse_date_time(test.dd, orders = "ymd")), #lubridate + date class conversion
    anydate(test.dd),                                  #anytime library
    times = 10L, unit = "s"
)

Result/Output:

Unit: seconds
                                             expr          min           lq         mean       median           uq          max neval cld
                    strptime(test.dd, "%Y-%m-%d") 10.177406012 10.472527403 1.064532e+01 10.621221596 10.819156870 11.288330598    10   c
         parse_date_time(test.dd, orders = "ymd")  4.541542019  4.603663894 4.844961e+00  4.869800287  5.055844972  5.128409226    10  b 
as.Date(parse_date_time(test.dd, orders = "ymd"))  4.461140695  4.568415584 4.867837e+00  4.739026273  5.080610126  5.532028490    10  b 
                                 anydate(test.dd)  0.000000755  0.000004909 5.777500e-06  0.000005664  0.000006042  0.000012839    10 a 

p.s. For working with time series consider flipTime library. It has all required tools and almost as fast as anytime for conversion purposes:

附:对于处理时间序列,请考虑flipTime库。它具有所有必需的工具,几乎与转换目的一样快:

require(devtools)
install_github("Displayr/flipTime")

#1


7  

I can get a little speedup by using the date package:

我可以通过使用日期包获得一点加速:

library(date)
set.seed(21)
x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
system.time(dDate <- as.Date(x))
#    user  system elapsed 
#    6.54    0.01    6.56 
system.time(ddate <- as.Date(as.date(x,"ymd")))
#    user  system elapsed 
#    3.42    0.22    3.64 

You might want to look at the C code it uses and see if you can modify it to be faster for your specific situation.

您可能希望查看它使用的C代码,看看是否可以根据具体情况对其进行修改。

#2


6  

I had a similar problem a while ago and came up with the following solution:

我不久前遇到过类似的问题,并提出了以下解决方案:

  1. convert the string to a factor (if not already a factor)
  2. 将字符串转换为因子(如果还不是因素)

  3. convert the levels of the factor to a Date
  4. 将因子的级别转换为日期

  5. Expand the converted levels to the solution using the index vector of the factor
  6. 使用因子的索引向量将转换后的级别扩展到解决方案

Extending Joshua Ulrich's example, I get (with slower timings on my laptop)

扩展约书亚乌尔里希的例子,我得到了(我的笔记本电脑上的时间较慢)

library(date)
set.seed(21)
x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
system.time(dDate <- as.Date(x))
#    user  system elapsed 
#    12.09   0.00   12.12 
system.time(ddate <- as.Date(as.date(x,"ymd")))
#    user  system elapsed 
#    6.97    0.04    7.05 
system.time({
    xf <- as.factor(x)
    dDate <- as.Date(levels(xf))[as.integer(xf)]
})
#    user  system elapsed 
#    1.16    0.00    1.15

Here, step 2 does not depend on the length of x once x is large enough and step 3 scales extremely well (simple vector indexing). The bottleneck should be step 1, which can be avoided if the data is already stored as a factor.

这里,一旦x足够大并且步骤3非常好地缩放(简单向量索引),步骤2不依赖于x的长度。瓶颈应该是步骤1,如果数据已经存储为因素,则可以避免这种情况。

#3


3  

The function parse_date_time from the 'lubridate' package is quite fast too:

'lubridate'包中的函数parse_date_time也非常快:

library(date)
library(lubridate)
set.seed(21)
x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
system.time(date1 <- as.Date(x))
#  user  system elapsed 
# 12.86    0.00   12.94 
system.time(date2 <- as.Date(as.date(x,"ymd"))) # from package 'date'
#  user  system elapsed 
#  4.82    0.00    4.85 
system.time(date3 <- as.Date(parse_date_time(x,'%y-%m-%d'))) # from package 'lubridate'
#  user  system elapsed 
#  0.27    0.00    0.26 
all(date1 == date2)
#  TRUE
all(date1 == date3)
#  TRUE

#4


1  

A further speedup: You already work with data.table. So, create a lookup table with your dates and merge them with your data.

进一步加速:您已经使用data.table。因此,创建一个包含日期的查找表,并将它们与您的数据合并。

library(lubridate)
library(data.table)

y <- seq(as.Date('1900-01-01'), Sys.Date(), by = 'day')
id.date <- data.table(id = as.character(y), date = as.Date(y), key = 'id')

set.seed(21)
x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))

system.time(date3 <- as.Date(parse_date_time(x,'%y-%m-%d'))) # from package 'lubridate'
#  user  system elapsed 
#  0.15  0.00   0.15  

system.time(date4 <- id.date[setDT(list(id = x)), on='id', date])
#  user  system elapsed 
#  0.08  0.00   0.08

all(date3 == date4)
# TRUE

It's kind of a workaround, but I believe thats how data.table is intended to use. I don't know if the above mentioned time/date packages internally are based on algorithms or as well on lookup tables (hash tables).

这是一种解决方法,但我相信data.table是如何使用的。我不知道上面提到的时间/日期包是基于算法还是基于查找表(哈希表)。

For larger datasets, whenever there is character manipulation involved, which tend to be slow, I consider switching to lookup a reference table.

对于较大的数据集,只要涉及到字符操作,这往往很慢,我会考虑切换到查找引用表。

#5


1  

Consider incredibly fast anytime library which is fine with 1970< issue. It uses the Boost date_time C++ library and provides functions anytime() and anydate() for conversions. Comparison:

考虑到令人难以置信的快速任何时间库,这是1970年 <问题。它使用boost date_time c ++库,并为转换提供anytime()和anydate()函数。比较:< p>

require(anytime)        #anydate()
require(lubridate)      #parse_date_time()
require(microbenchmark) #microbenchmark()

set.seed(21)
test.dd <- as.Date("2018-05-16") - sample(40000, 1e6, TRUE) #1 mln. random dates

microbenchmark(
    strptime(test.dd, "%Y-%m-%d"),                     #basic strptime
    parse_date_time(test.dd, orders = "ymd"),          #lubridate (POSIXct class)
    as.Date(parse_date_time(test.dd, orders = "ymd")), #lubridate + date class conversion
    anydate(test.dd),                                  #anytime library
    times = 10L, unit = "s"
)

Result/Output:

Unit: seconds
                                             expr          min           lq         mean       median           uq          max neval cld
                    strptime(test.dd, "%Y-%m-%d") 10.177406012 10.472527403 1.064532e+01 10.621221596 10.819156870 11.288330598    10   c
         parse_date_time(test.dd, orders = "ymd")  4.541542019  4.603663894 4.844961e+00  4.869800287  5.055844972  5.128409226    10  b 
as.Date(parse_date_time(test.dd, orders = "ymd"))  4.461140695  4.568415584 4.867837e+00  4.739026273  5.080610126  5.532028490    10  b 
                                 anydate(test.dd)  0.000000755  0.000004909 5.777500e-06  0.000005664  0.000006042  0.000012839    10 a 

p.s. For working with time series consider flipTime library. It has all required tools and almost as fast as anytime for conversion purposes:

附:对于处理时间序列,请考虑flipTime库。它具有所有必需的工具,几乎与转换目的一样快:

require(devtools)
install_github("Displayr/flipTime")