根据查询速度从R中存储矩阵的最佳数据库设计

I have hundreds of matrices that need to be used in R and most of them are around 45000x350 each. What I'd like to do is find an optimal database software choice and schema to store the data in and be able to call subsets of the matrices from the database. This needs to be done with the object of extracting the data be as fast as possible.

我有数百个矩阵需要在R中使用它们大多数都在45000x350左右。我想做的是找到一个最优的数据库软件选择和模式来存储数据并能够从数据库中调用矩阵的子集。这需要用提取数据的对象尽快完成。

As a base here is code that creates 5 matrices similar to what I'm dealing with:

作为基础，这里的代码创建了5个矩阵类似于我正在处理的:

if(!"zoo" %in% installed.packages()[,1]) { install.packages("zoo") }
require("zoo", quietly=TRUE)

numSymbols <- 45000
numVariables <- 5
rDatePattern <- "%d/%m/%Y"
startDate <- "31/12/1982"
endDate <- "30/09/2011"
startYearMonth <- as.yearmon(startDate,format=rDatePattern)
alphaNumeric <- c(1:9,toupper(letters))             
numMonths <- (as.yearmon(endDate,format=rDatePattern)-startYearMonth)*12
numValues <- numSymbols*(numMonths+1)

dateVector <- sapply(1:(numMonths+1), function(x) {as.character(format(as.Date(startYearMonth+x*1/12,fraq=0)-1,rDatePattern))})

symbolNames <- sapply(1:numSymbols,function(x) {as.character(paste((sample(alphaNumeric,7)),collapse=""))})

for(i in 1:numVariables) {
    assign(paste("Variable",i,sep="_"),matrix(sample(c(rnorm(numValues/2),rep(NA,numValues/2))),
                                              nrow=numSymbols,
                                              ncol=(numMonths+1),
                                              dimnames=list(symbolNames,dateVector)))
}

Basically all the matrices will have about half the values filled with doubles and the rest NAs.

基本上所有的矩阵都有一半的值都是双打和其他的NAs。

# > ls()[grepl("Variable_",ls())]
# [1] "Variable_1" "Variable_2" "Variable_3" "Variable_4" "Variable_5"

# > dim(Variable_1)
# [1] 45000   346

# > Variable_1[1:10,1:3]
#                  31/12/1982          31/01/1983           28/02/1983
# AF3HM5V                  NA                  NA -1.15076100366945755
# DL8TVIY                  NA                  NA -1.59412257037490046
# JEFDYPO                  NA                  NA                   NA
# 21ZV689                  NA                  NA -0.31095014405320764
# RP1DZHB -1.0571670785223215 -0.7206356272944392 -0.84028668343265112
# N6DUSZC                  NA                  NA -1.31113363079930023
# XG3ZA1W                  NA  0.8531074740045220  0.06797987526470438
# W1JCXIE  0.2782029710832690 -1.2668560986048898                   NA
# X3RKT2B  1.5220172324681460 -1.0460218516729356                   NA
# 3EUB8VN -0.9405417187846803  1.1151437940206490  1.60520458945005262

I want to be able to store these in a database. RDBMS would be the default option but I'm willing to look at other options. The biggest part is the optimal solution for quick querying, be it for the whole matrix or a subset of the matrix, e.g. 2000 symbols, 100 dates etc.

我希望能够将它们存储在数据库中。RDBMS是默认选项，但我愿意考虑其他选项。最大的部分是快速查询的最优解，可以是整个矩阵，也可以是矩阵的子集，比如2000个符号，100个日期等等。

The current solution I've been using is saving each matrix as a RData file and then loading in the whole matrix and truncating in size to use. This is really quick but I feel a database design would be more benefitial in terms of scaling the matrices in terms of symbols + dates and backups for the data.

我使用的当前解决方案是将每个矩阵保存为一个RData文件，然后在整个矩阵中加载，并在大小上截断以使用。这确实很快，但我觉得数据库设计在扩展矩阵的符号+日期和数据备份方面更有好处。

What I've tried so far in terms of RDBMS options is:

到目前为止，我在RDBMS选项方面所做的尝试是:

一)

- Fields: Symbol, Variable, Date, Value
- Seperate and clustered indices for all but value.
- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)
- Average query for a normal sample into R: 4-8 minutes

- Each variable in a sperate table.
- Fields: Symbol, Date, Value
- Seperate and clustered indices for all but value.
- Views added to cache common subsets (dunno if it helped at all...)
- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)
- Average query for a normal sample into R: 3-5 minutes

C) [Should maybe have tried a column based database here]

C)[本应尝试使用基于列的数据库]

- Symbols and dates stored seperately and map to row and col numbers only
- Each variable in a seperate table with symbols for rows and dates for columns
- Really bad for where data maps to disk when scaling rows and cols.
- Data already in correct format for R
- Average query for a normal sample into R: 1-3 minutes

In comparison with the above database set ups loading in the whole variable from the RData files takes 5 sec locally and 20 sec over the network. All the database times are over the network.

与上面的数据库设置相比，从RData文件加载整个变量在本地需要5秒，在网络上需要20秒。所有的数据库时间都在网络上。

Is there anything I can do to make the database route come anywhere close to the binary file speeds?

我能做些什么来使数据库路径接近二进制文件的速度?

Maybe one of the tabular nosql databases is what I need?

也许我需要一个表格式的nosql数据库?

How does that scale in terms of additional symbols + dates?

如何用其他符号+日期来表示?

Any help from someone who's dealt with something similar would be appreciated.

任何来自处理过类似事情的人的帮助都会得到感激。

Update: Thought I'd post an update to this. In the end I went with Iterator's suggestion and the data is now hosted in bigmemory memory mapped files and then RData for quick use drag and drop checking as well as outputted to csv and pulled by SQL Server for backup purposes. Any database solution is too slow to be used by multiple users. Also using RODBC against SQL server is crazy slow, but tried input and output to R via CSV to and from SQL and that was okay but pointless.

更新:我想要更新一下。最后，我采用了Iterator的建议，数据现在驻留在bigmemory内存映射文件中，然后使用RData进行快速的拖放检查，并输出到csv，并由SQL Server拉出用于备份目的。任何数据库解决方案都太慢，不能被多个用户使用。对SQL server使用RODBC是非常缓慢的，但是通过SQL和SQL之间的CSV输入和输出尝试了输入和输出。

Also for references, byte compiling the the load method for bigmemory does have an impact. Here are the results of my load test for RData vs bigmemory.

对于引用，字节编译bigmemory的load方法也会产生影响。这是我对RData和bigmemory的负载测试结果。

workingDirectory <- "/Users/Hans/92 Speed test/"
require("bigmemory")
require("compiler")
require("rbenchmark")

LoadVariablesInFolder <- function(folder, sedols, dates) {
    filesInFolder <- dir(folder)
    filesToLoad <- filesInFolder[grepl(".*NVAR_.*\\.RData",filesInFolder)]
    filesToLoad <- paste(folder,filesToLoad,sep="/")
    variablesThatWereLoaded <- c()
    for(fToLoad in filesToLoad) {
        loadedVar <- load(fToLoad)
        assign(loadedVar,get(loadedVar)[sedols,dates])
        gc() -> ans
        variablesThatWereLoaded <- c(variablesThatWereLoaded,loadedVar)
        rm(list=c(loadedVar))
    }
    return(variablesThatWereLoaded)
}

cLoadVariablesInFolder <- cmpfun(LoadVariablesInFolder)

BigMLoadVariablesInFolder <- function(folder, sedols, dates) {
    workD <- getwd()
    setwd(folder)
    filesInFolder <- dir(folder)
    filesToLoad <- filesInFolder[grepl(".*NVAR_.*\\.desc",filesInFolder)]
    variablesThatWereLoaded <- c()
    for(fToLoad in filesToLoad) {
        tempVar <- attach.big.matrix(dget(fToLoad))
        loadedVar <- gsub(".*(NVAR_\\d+).*","\\1",fToLoad,perl=TRUE)
        assign(loadedVar,tempVar[sedols,dates])
        variablesThatWereLoaded <- c(variablesThatWereLoaded,loadedVar)
        rm(list=c(loadedVar,"tempVar"))
        gc() -> ans
    }
    setwd(workD)
    return(variablesThatWereLoaded)
}

cBigMLoadVariablesInFolder <- cmpfun(BigMLoadVariablesInFolder)

testCases <- list(
                list(numSedols=1000,numDates=120),
                list(numSedols=5000,numDates=120),
                list(numSedols=50000,numDates=120),
                list(numSedols=1000,numDates=350),
                list(numSedols=5000,numDates=350),
                list(numSedols=50000,numDates=350))

load(paste(workingDirectory,"dates.cache",sep="/"))
load(paste(workingDirectory,"sedols.cache",sep="/"))

for (testCase in testCases) {
    results <- benchmark(LoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              cLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              BigMLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              cBigMLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              columns=c("test", "replications","elapsed", "relative"),
              order="relative", replications=3)
    cat("Results for testcase:\n")
    print(testCase)
    print(results)
}

Basically the smaller the subset the more is gained because you don't spend time loading in the whole matrix. But loading the whole matrix is slower from bigmemory than RData, I guess it's the conversion overhead:

基本上子集越小得到的就越多因为你不需要花时间在整个矩阵中加载。但是从bigmemory加载整个矩阵要比RData慢，我猜是转换开销:

# Results for testcase:
# $numSedols
# [1] 1000

# $numDates
# [1] 120

                                                                                                                              # test
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications             elapsed           relative
# 4            3   6.799999999999955  1.000000000000000
# 3            3  14.389999999999986  2.116176470588247
# 1            3 235.639999999999986 34.652941176470819
# 2            3 250.590000000000032 36.851470588235543
# Results for testcase:
# $numSedols
# [1] 5000

# $numDates
# [1] 120

                                                                                                                              # test
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications             elapsed           relative
# 4            3   7.080000000000155  1.000000000000000
# 3            3  32.730000000000018  4.622881355932105
# 1            3 249.389999999999873 35.224576271185654
# 2            3 254.909999999999854 36.004237288134789
# Results for testcase:
# $numSedols
# [1] 50000

# $numDates
# [1] 120

                                                                                                                              # test
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications           elapsed          relative
# 3            3 146.3499999999999 1.000000000000000
# 4            3 148.1799999999998 1.012504270584215
# 2            3 238.3200000000002 1.628425008541171
# 1            3 240.4600000000000 1.643047488896482
# Results for testcase:
# $numSedols
# [1] 1000

# $numDates
# [1] 350

                                                                                                                              # test
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications            elapsed          relative
# 3            3  83.88000000000011 1.000000000000000
# 4            3  91.71000000000004 1.093347639484977
# 1            3 235.69000000000005 2.809847401049115
# 2            3 240.79999999999973 2.870767763471619
# Results for testcase:
# $numSedols
# [1] 5000

# $numDates
# [1] 350

                                                                                                                              # test
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications           elapsed          relative
# 3            3 135.6999999999998 1.000000000000000
# 4            3 155.8900000000003 1.148784082535008
# 2            3 233.3699999999999 1.719749447310245
# 1            3 240.5599999999995 1.772733971997051
# Results for testcase:
# $numSedols
# [1] 50000

# $numDates
# [1] 350

                                                                                                                              # test
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications           elapsed          relative
# 2            3 236.5000000000000 1.000000000000000
# 1            3 237.2100000000000 1.003002114164905
# 3            3 388.2900000000000 1.641818181818182
# 4            3 393.6300000000001 1.664397463002115

3 个解决方案

#1

I would strongly recommend using HDF5. I assume that your data is complex enough that a variety of bigmemory files (i.e. memory mapped matrices) would not easily satisfy your needs (see note 1), but HDF5 is just short of the speed of memory mapped files. See this longer answer to another question to understand how I compare HDF5 and .RDat files.

我强烈建议使用HDF5。我假设您的数据足够复杂，以致于各种bigmemory文件(即内存映射矩阵)不容易满足您的需求(请参见注释1)，但是HDF5只是缺少内存映射文件的速度。请参阅对另一个问题的更长的回答，以了解我如何比较HDF5和. rdat文件。

Most notably, the fact that HDF5 supports random access means that you should be able to get substantial speed improvements.

最值得注意的是，HDF5支持随机访问，这意味着您应该能够获得显著的速度改进。

Another option, depending on your willingness to design your own binary format, is to use readBin and writeBin, though this doesn't have all of the nice features that HDF5 has, including parallel I/O, version information, portability, etc.

另一种选择是使用readBin和writeBin，这取决于您设计自己的二进制格式的意愿，尽管这并没有HDF5拥有的所有优点，包括并行I/O、版本信息、可移植性等。

Note 1: If you have just a few types per row, i.e. 1 character and the rest are numeric, you can simply create 2 memory mapped matrices, one of which is for characters, the other for numeric values. This will allow you to use bigmemory, mwhich, bigtabulate and lots of other nice functions in the bigmemory suite. I'd give that a reasonable effort, as it's a very easy system to smoothly integrate with lots of R code: the matrix need never enter memory, just whatever subsets you happen to need, and many instances can access the same files simultaneously. What's more, it is easy to parallelize access using multicore backends for foreach(). I used to have an operation that would take about 3 minutes per .Rdat file: about 2 minutes to load, about 20 seconds to subselect what I needed, about 10 seconds to analyze, and about 30 seconds to save the results. After switching to bigmemory, I got down to about 10 seconds to analyze and about 5-15 seconds on the I/O.

注1:如果每行只有几个类型，即1个字符，其余都是数字，那么只需创建两个内存映射矩阵，其中一个用于字符，另一个用于数字值。这将允许您在bigmemory套件中使用bigmemory、mwhich、bigtabulate和许多其他优秀的函数。我认为这是一个合理的努力，因为它是一个很容易与许多R代码集成的系统:矩阵不需要输入内存，只要您碰巧需要的任何子集，并且许多实例可以同时访问相同的文件。此外，使用foreach()的多核后端进行并行访问也很容易。我曾经做过一个操作，每个. rdat文件大约需要3分钟:大约2分钟加载，大约20秒重新选择需要的内容，大约10秒分析，大约30秒保存结果。在切换到bigmemory之后，我开始分析大约10秒，在I/O上大约5-15秒。

Update 1: I overlooked the ff package - this is another good option, though it is a lot more complex than bigmemory.

更新1:我忽略了ff包——这是另一个不错的选择，尽管它比bigmemory复杂得多。

#2

Maybe the database design of the TSdbi package is inspiring...

也许TSdbi包的数据库设计是鼓舞人心的……

For a nosql solution, hdf5 might be an option. I do not know much about it, though.

对于nosql解决方案，hdf5可能是一个选项。不过，我对此不太了解。

#3

You could also delve into the uses/internals of sqldf. It seems they got their database work pretty well sorted out. There are also a lot of interesting examples given on their page.

您还可以深入研究sqldf的使用/内部特性。看起来他们的数据库工作做得很好。在他们的页面上也有很多有趣的例子。

#1