I have recently been using R to run a Generalised Linear Model(GLM) on a 100 mb csv file ( 9 million rows by 5 columns). The contents of this file includes 5 columns called depvar, var1,var2,var3,var4 and are all randomly distributed such that the columns contain numbers that are either 0,1 or 2. Basically I have used the biglm package to run the GLM on this data file and R processed this in approximately 2 minutes. This was on a linux machine using R version 2.10 (I am currently updating to 2.14), 4 cores and 8 GB of RAM. Basically I want to run the code faster at around 30 to 60 seconds. One solution is adding more cores and RAM but this would only be a temporary solution as I realise that datasets will only get bigger. Ideally I want to find a way to make the code faster for bigglm. I have run some R profiling code on the dataset. Adding the following code (before the code I want to run to check it's speed):
我最近使用R在100 mb的csv文件(9百万行5列)上运行一个通用线性模型(GLM)。该文件的内容包括5个列,称为depvar、var1、var2、var3、var4,它们都是随机分布的,这些列包含的数字是0、1或2。基本上,我已经使用biglm包在这个数据文件上运行GLM, R在大约2分钟内处理了这个。这是在一台使用R版本2.10(我目前更新到2.14)、4个内核和8 GB RAM的linux机器上完成的。基本上,我想要在30到60秒内更快地运行代码。一个解决方案是添加更多的内核和RAM,但这只是一个临时解决方案,因为我意识到数据集只会变得更大。理想情况下,我希望找到一种方法使bigglm的代码运行得更快。我在数据集上运行了一些R分析代码。添加以下代码(在我想要运行的代码之前检查它的速度):
Rprof('method1.out')
Then after typing this command I write my bigglm code which looks something like this:
然后在输入这个命令后,我编写了如下代码:
x<-read.csv('file location of 100mb file')
form<-depvar~var1+var2+var3+var4
a<-bigglm(form, data=x, chunksize=800000, sandwich=FALSE, family=binomial())
summary(a)
AIC(a)
deviance(a)
After running these codes which take around 2 to 3 minutes, I type the following to see my profiling code:
运行这些代码大约需要2到3分钟后,我输入以下代码以查看我的分析代码:
Rprofsummary('method1.out')
What I then get is a breakdown of the bigglm process and which individual lines are taking very long. After viewing this I was surprised to see that there was a call to fortran code that was taking a very long time (Around 20 seconds). This code can be found in the base file of Bigglm at:
然后,我得到的是更大的过程的分解,而每一行都花费了很长时间。看完这个之后,我很惊讶地发现对fortran代码的调用花了很长时间(大约20秒)。这段代码可以在Bigglm的基础文件中找到:
http://cran.r-project.org/web/packages/biglm/index.html
In the bigglm 0.8.tar.gz file
在bigglm 0.8.tar。gz文件
Basically what I am asking the community is can this code be made more faster? For example by changing the code to recall the Fortran code to do the QR decomposition. Furthermore there were other functions like as.character and model.matrix which also took a long time. I have not attached the profiling file here as I believe it can easily be reproduced given the information I have supplied but basically I am hinting at the big problem of big data and processing GLM on this big data. This is a problem that is shared amongst the R community and I think any feedback or help would be grateful on this issue. You can probably easily replicate this example using a different dataset and look at what is taking so long in the bigglm code and see if they are the same things that I found. If so can someone please help me figure out how to make bigglm run faster. After Ben requested it I have uploaded the snippet of the profiling code I had as well as the first 10 lines of my csv file:
基本上,我想问的是,这个代码能做得更快吗?例如,通过更改代码以回忆Fortran代码来执行QR分解。此外,还有其他功能,如as。性格和模型。这个矩阵也花了很长时间。我没有在这里附加分析文件,因为我相信它可以很容易地复制给我提供的信息,但基本上我是在暗示大数据和处理大数据上的GLM的大问题。这是R社区的共同问题,我认为任何反馈或帮助都将感谢这个问题。您可能可以使用不同的数据集轻松复制此示例,并查看在bigglm代码中花了这么长时间的内容,看看它们是否与我发现的内容相同。如果这样的话,有人能帮我弄清楚如何让bigglm跑得更快。在Ben请求之后,我上传了我的分析代码片段,以及我的csv文件的前10行:
CSV File:
var1,var2,var3,var4,depvar
1,0,2,2,1
0,0,1,2,0
1,1,0,0,1
0,1,1,2,0
1,0,0,3,0
0,0,2,2,0
1,1,0,0,1
0,1,2,2,0
0,0,2,2,1
This CSV output was copied from my text editor UltraEdit and it can be seen that var1 takes on values 0 or 1, var2 takes on values 0 and 1, var3 takes on values 0,1,2, var4 takes on values 0,1,2,3 and depvar takes on values 1 or 0. This csv can be replicated in excel using the RAND function up to around 1 million rows then it can be copied and pasted several times over to get a large number of rows in a text editor like ultraedit. Basically type RAND() into one column for 1 million columns then do round(column) in the column beside the RAND() column to get 1s and zeros. Same sort of thinking applies to 0,1,2,3.
这个CSV输出是从我的文本编辑器super edit中复制的,可以看到var1取值0或1,var2取值0和1,var3取值0,1,2,var4取值0,1,2,3,var取值1或0。这个csv可以使用RAND函数在excel中复制,最多可复制100万行,然后可以多次复制和粘贴,以便在文本编辑器中获得大量的行,比如ultraedit。基本上,将RAND()输入到一个列中,包含100万列,然后在RAND()列旁边的列中执行round(列),得到1和0。同样的思路也适用于0,1,2,3。
The profiling file is long so I have attached lines that took most time:
分析文件很长,所以我附加了一些需要花费很多时间的行:
summaryRprof('method1.out')
$by.self
self.time self.pct total.time total.pct
"model.matrix.default" 25.40 20.5 26.76 21.6
".Call" 20.24 16.3 20.24 16.3
"as.character" 17.22 13.9 17.22 13.9
"[.data.frame" 14.80 11.9 22.12 17.8
"update.bigqr" 5.72 4.6 14.38 11.6
"-" 4.36 3.5 4.36 3.5
"anyDuplicated.default" 4.18 3.4 4.18 3.4
"|" 3.98 3.2 3.98 3.2
"*" 3.44 2.8 3.44 2.8
"/" 3.18 2.6 3.18 2.6
"unclass" 2.28 1.8 2.28 1.8
"sum" 2.26 1.8 2.26 1.8
"attr" 2.12 1.7 2.12 1.7
"na.omit" 2.02 1.6 20.00 16.1
"%*%" 1.74 1.4 1.74 1.4
"^" 1.56 1.3 1.56 1.3
"bigglm.function" 1.42 1.1 122.52 98.8
"+" 1.30 1.0 1.30 1.0
"is.na" 1.28 1.0 1.28 1.0
"model.frame.default" 1.20 1.0 22.68 18.3
">" 0.84 0.7 0.84 0.7
"strsplit" 0.62 0.5 0.62 0.5
$by.total
total.time total.pct self.time self.pct
"standardGeneric" 122.54 98.8 0.00 0.0
"bigglm.function" 122.52 98.8 1.42 1.1
"bigglm" 122.52 98.8 0.00 0.0
"bigglm.data.frame" 122.52 98.8 0.00 0.0
"model.matrix.default" 26.76 21.6 25.40 20.5
"model.matrix" 26.76 21.6 0.00 0.0
"model.frame.default" 22.68 18.3 1.20 1.0
"model.frame" 22.68 18.3 0.00 0.0
"[" 22.14 17.9 0.02 0.0
"[.data.frame" 22.12 17.8 14.80 11.9
".Call" 20.24 16.3 20.24 16.3
"na.omit" 20.00 16.1 2.02 1.6
"na.omit.data.frame" 17.98 14.5 0.02 0.0
"model.response" 17.44 14.1 0.10 0.1
"as.character" 17.22 13.9 17.22 13.9
"names<-" 17.22 13.9 0.00 0.0
"<Anonymous>" 15.10 12.2 0.00 0.0
"update.bigqr" 14.38 11.6 5.72 4.6
"update" 14.38 11.6 0.00 0.0
"data" 10.26 8.3 0.00 0.0
"-" 4.36 3.5 4.36 3.5
"anyDuplicated.default" 4.18 3.4 4.18 3.4
"anyDuplicated" 4.18 3.4 0.00 0.0
"|" 3.98 3.2 3.98 3.2
"*" 3.44 2.8 3.44 2.8
"/" 3.18 2.6 3.18 2.6
"lapply" 3.04 2.5 0.04 0.0
"sapply" 2.44 2.0 0.00 0.0
"as.list.data.frame" 2.30 1.9 0.02 0.0
"as.list" 2.30 1.9 0.00 0.0
"unclass" 2.28 1.8 2.28 1.8
"sum" 2.26 1.8 2.26 1.8
"attr" 2.12 1.7 2.12 1.7
"etafun" 1.88 1.5 0.14 0.1
"%*%" 1.74 1.4 1.74 1.4
"^" 1.56 1.3 1.56 1.3
"summaryRprof" 1.48 1.2 0.02 0.0
"+" 1.30 1.0 1.30 1.0
"is.na" 1.28 1.0 1.28 1.0
">" 0.84 0.7 0.84 0.7
"FUN" 0.70 0.6 0.36 0.3
"strsplit" 0.62 0.5 0.62 0.5
I was mainly surprised by the .Call function that is calling to Fortran. Maybe I didn't understand it. It seems all calculations are done once this function is used. I thought this was like a linking function that went to extract the Fortran code. Furthermore if Fortran is doing all the work and all the iteratively weighted least squares/QR why is the rest of the code taking so long then.
我主要对调用Fortran的. call函数感到惊讶。也许我不明白。似乎所有的计算都是在使用这个函数之后完成的。我以为这是一个链接函数,用来提取Fortran代码。此外,如果Fortran做了所有的工作和所有迭代加权最小二乘/QR,那么为什么剩下的代码要花这么长时间呢?
1 个解决方案
#1
4
My lay understanding is that biglm is breaking data into chunks and running them sequentially.
我的理解是biglm将数据分解成块并按顺序运行。
- So you could speed things up by optimising the size of the chunks to be just as big as your memory allows.
- 所以你可以通过优化块的大小使其尽可能大的内存来加快速度。
- This is also just using one of your cores. This is not a multi-threaded code. You'd need to do some magic to get that working.
- 这也只是使用你的一个核心。这不是多线程代码。你需要做一些魔术才能让它奏效。
#1
4
My lay understanding is that biglm is breaking data into chunks and running them sequentially.
我的理解是biglm将数据分解成块并按顺序运行。
- So you could speed things up by optimising the size of the chunks to be just as big as your memory allows.
- 所以你可以通过优化块的大小使其尽可能大的内存来加快速度。
- This is also just using one of your cores. This is not a multi-threaded code. You'd need to do some magic to get that working.
- 这也只是使用你的一个核心。这不是多线程代码。你需要做一些魔术才能让它奏效。