在并行计算环境中使用RJDBC时出现服务器内存不足的问题

时间:2022-02-22 20:11:23

I have an R server with 16 cores and 8Gb ram that initializes a local SNOW cluster of, say, 10 workers. Each worker downloads a series of datasets from a Microsoft SQL server, merges them on some key, then runs analyses on the dataset before writing the results to the SQL server. The connection between the workers and the SQL server runs through a RJDBC connection. When multiple workers are getting data from the SQL server, ram usage explodes and the R server crashes.

我有一个带有16个内核和8Gb ram的R服务器,它初始化了一个本地SNOW集群,例如10个工作者。每个工作人员从Microsoft SQL服务器下载一系列数据集,将它们合并到某个键上,然后在将结果写入SQL服务器之前对数据集运行分析。 worker和SQL Server之间的连接通过RJDBC连接运行。当多个工作程序从SQL服务器获取数据时,ram使用情况会爆炸,R服务器崩溃。

The strange thing is that the ram usage by a worker loading in data seems disproportionally large compared to the size of the loaded dataset. Each dataset has about 8000 rows and 6500 columns. This translates to about 20MB when saved as an R object on disk and about 160MB when saved as a comma-delimited file. Yet, the ram usage of the R session is about 2,3 GB.

奇怪的是,与加载数据集的大小相比,加载数据的工作者使用ram的用法似乎不成比例地大。每个数据集包含大约8000行和6500列。当保存为磁盘上的R对象时,这将转换为大约20MB,当保存为逗号分隔文件时,大约为160MB。然而,R会话的RAM使用大约是2,3 GB。

Here is an overview of the code (some typographical changes to improve readability):

以下是代码的概述(一些提高可读性的印刷更改):

Establish connection using RJDBC:

使用RJDBC建立连接:

require("RJDBC")
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<username>","<pass>")

After this there is some code that sorts the function input vector requestedDataSets with names of all tables to query by number of records, such that we load the datasets from largest to smallest:

在此之后,有一些代码将函数输入向量requestedDataSets与所有表的名称进行排序,以按记录数查询,以便我们从最大到最小加载数据集:

nrow.to.merge <- rep(0, length(requestedDataSets))
for(d in 1:length(requestedDataSets)){
nrow.to.merge[d] <- dbGetQuery(con, paste0("select count(*) from",requestedDataSets[d]))[1,1]
}
merge.order <- order(nrow.to.merge,decreasing = T)

We then go through the requestedDatasets vector and load and/or merge the data:

然后我们通过requestedDatasets向量并加载和/或合并数据:

for(d in merge.order){
    # force reconnect to SQL server
    drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
    try(dbDisconnect(con), silent = T)
    con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<user>","<pass>")
    # remove the to.merge object
    rm(complete.data.to.merge)
    # force garbage collection
    gc()
    jgc()
    # ask database for dataset d
    complete.data.to.merge <- dbGetQuery(con, paste0("select * from",requestedDataSets[d]))
    # first dataset
    if (d == merge.order[1]){
        complete.data <- complete.data.to.merge
        colnames(complete.data)[colnames(complete.data) == "key"] <- "key_1"
    } 
    # later dataset
    else {
        complete.data <- merge(
                         x = complete.data, 
                         y = complete.data.to.merge,
                         by.x = "key_1", by.y = "key", all.x=T)
    }
}
return(complete.data)

When I run this code on a serie of twelve datasets, the number of rows/columns of the complete.data object is as expected, so it is unlikely the merge call somehow blows up the usage. For the twelve iterations memory.size() returns 1178, 1364, 1500, 1662, 1656, 1925, 1835, 1987, 2106, 2130, 2217, and 2361. Which, again, is strange as the dataset at the end is at most 162 MB...

当我在12个数据集的系列上运行此代码时,complete.data对象的行数/列数与预期的一样,因此合并调用不太可能以某种方式炸毁使用情况。对于十二次迭代,memory.size()返回1178,1364,1500,1662,1656,1925,1835,1987,2106,2130,2217和2361.这也是奇怪的,因为最后的数据集最多162 MB ......

As you can see in the code above I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection jgc <- function(){.jcall("java/lang/System", method = "gc")}). I've also tried merging the data SQL-server-side, but then I run into number of columns constraints.

正如您在上面的代码中看到的,我已经尝试过几个修复,比如调用GC(),JGC()(这是一个强制Java垃圾收集的函数jgc < - function(){。jcall(“java /” lang / System“,method =”gc“)})。我也尝试合并SQL-server端的数据,但后来遇到了列数限制。

It vexes me that the RAM usage is so much bigger than the dataset that is eventually created, leading me to believe there is some sort of buffer/heap that is overflowing... but I seem unable to find it.

令我烦恼的是,RAM的使用量远远大于最终创建的数据集,这使我相信存在某种溢出的缓冲区/堆......但我似乎无法找到它。

Any advice on how to resolve this issue would be greatly appreciated. Let me know if (parts of) my problem description are vague or if you require more information.

如何解决这个问题的任何建议将不胜感激。如果(部分)我的问题描述含糊不清或者您需要更多信息,请告诉我。

Thanks.

谢谢。

1 个解决方案

#1


3  

This answer is more of a glorified comment. Simply because the data being processed on one node only requires 160MB does not mean that the amount of memory needed to process it is 160MB. Many algorithms require O(n^2) storage space, which would be be in the GB for your chunk of data. So I actually don't see anything here which is unsurprising.

这个答案更像是一个美化的评论。仅仅因为在一个节点上处理的数据仅需要160MB并不意味着处理它所需的内存量是160MB。许多算法需要O(n ^ 2)存储空间,对于您的数据块,这些存储空间将在GB中。所以我实际上在这里看不到任何不足为奇的东西。

I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection...

我已经尝试了几个修复,如调用GC(),JGC()(这是一个强制Java垃圾收集的函数...

You can't force a garbage collection in Java, calling System.gc() only politely asks the JVM to do a garbage collection, but it is free to ignore the request if it wants. In any case, the JVM usually optimizes garbage collection well on its own, and I doubt this is your bottleneck. More likely, you are simply hitting on the overhead which R needs to crunch your data.

你不能在Java中强制进行垃圾收集,只是礼貌地要求JVM进行垃圾收集,而是调用System.gc(),但如果需要,可以*地忽略该请求。无论如何,JVM通常可以很好地优化垃圾收集,我怀疑这是你的瓶颈。更有可能的是,您只需要处理R需要处理数据的开销。

#1


3  

This answer is more of a glorified comment. Simply because the data being processed on one node only requires 160MB does not mean that the amount of memory needed to process it is 160MB. Many algorithms require O(n^2) storage space, which would be be in the GB for your chunk of data. So I actually don't see anything here which is unsurprising.

这个答案更像是一个美化的评论。仅仅因为在一个节点上处理的数据仅需要160MB并不意味着处理它所需的内存量是160MB。许多算法需要O(n ^ 2)存储空间,对于您的数据块,这些存储空间将在GB中。所以我实际上在这里看不到任何不足为奇的东西。

I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection...

我已经尝试了几个修复,如调用GC(),JGC()(这是一个强制Java垃圾收集的函数...

You can't force a garbage collection in Java, calling System.gc() only politely asks the JVM to do a garbage collection, but it is free to ignore the request if it wants. In any case, the JVM usually optimizes garbage collection well on its own, and I doubt this is your bottleneck. More likely, you are simply hitting on the overhead which R needs to crunch your data.

你不能在Java中强制进行垃圾收集,只是礼貌地要求JVM进行垃圾收集,而是调用System.gc(),但如果需要,可以*地忽略该请求。无论如何,JVM通常可以很好地优化垃圾收集,我怀疑这是你的瓶颈。更有可能的是,您只需要处理R需要处理数据的开销。