在关系数据库中存储R对象

时间:2023-01-03 16:55:22

I frequently create nonparametric statistics (loess, kernel densities, etc) on data I pull out of a relational database. To make data management easier I would like to store R output back inside my DB. This is easy with simple data frames of numbers or text, but I have not figured out how to store R objects back in my relational database. So is there a way to store a vector of kernel densities, for example, back into a relational database?

我经常在从关系数据库中提取的数据上创建非参数统计(黄土,内核密度等)。为了使数据管理更容易,我想将R输出存储在我的数据库中。使用简单的数字或文本数据框很容易,但我还没有弄清楚如何将R对象存储回我的关系数据库中。那么有没有办法将内核密度的向量存储回关系数据库?

Right now I work around this by saving the R objects to a network drive space so others can load the objects as needed.

现在我通过将R对象保存到网络驱动器空间来解决这个问题,以便其他人可以根据需要加载对象。

4 个解决方案

#1


10  

Use the serialization feature to turn any R object into a (raw or character) string, then store that string. See help(serialize).

使用序列化功能将任何R对象转换为(原始或字符)字符串,然后存储该字符串。请参阅帮助(序列化)。

Reverse this for retrieval: get the string, then unserialize() into a R object.

将其反转以进行检索:获取字符串,然后将unserialize()反转为R对象。

#2


10  

An example R variable, that's fairly complex:

一个示例R变量,它相当复杂:

library(nlme)
model <- lme(uptake ~ conc + Treatment, CO2, random = ~ 1 | Plant / Type)

The best storage database method for R variables depends upon how you want to use it.

R变量的最佳存储数据库方法取决于您希望如何使用它。

I need to do in-database analytics on the values

我需要对值进行数据库内分析

In this case, you need to break the object down into values that the database can handle natively. This usually means converting it into one or more data frames. The easiest way to do this is to use the broom package.

在这种情况下,您需要将对象分解为数据库可以本机处理的值。这通常意味着将其转换为一个或多个数据帧。最简单的方法是使用扫帚包。

library(broom)
coefficients_etc <- tidy(model)
model_level_stats <- glance(model)
row_level_stats <- augment(model)

I just want storage

我只是想要存储

In this case you want to serialize your R variables. That is, converting them to be a string or a binary blob. There are several methods for this.

在这种情况下,您希望序列化您的R变量。也就是说,将它们转换为字符串或二进制blob。有几种方法可以做到这一点。


My data has to be accessible by programs other than R, and needs to be human-readable

我的数据必须可由R以外的程序访问,并且必须是人类可读的

You should store your data in a cross-platform text format; probably JSON or YAML. JSON doesn't support some important concepts like Inf; YAML is more general but the support in R isn't as mature. XML is also possible, but is too verbose to be useful for storing large arrays.

您应该以跨平台文本格式存储数据;可能是JSON或YAML。 JSON不支持Inf等一些重要概念; YAML更为普遍,但R中的支持并不成熟。 XML也是可能的,但是对于存储大型数组来说太冗长了。

library(RJSONIO)
model_as_json <- toJSON(model)
nchar(model_as_json) # 17916

library(yaml)
# yaml package doesn't yet support conversion of language objects,
# so preprocessing is needed
model2 <- within(
  model,
  {
     call <- as.character(call)
     terms <- as.character(terms)
  }
)
model_as_yaml <- as.yaml(model2) 
nchar(model_as_yaml) # 14493

My data has to be accessible by programs other than R, and doesn't need to be human-readable

我的数据必须可由R以外的程序访问,并且不需要是人类可读的

You could write your data to an open, cross-platform binary format like HFD5. Currently support for HFD5 files (via rhdf5) is limited, so complex objects are not supported. (You'll probably need to unclass everything.)

您可以将数据写入开放的跨平台二进制格式,如HFD5。目前对HFD5文件(通过rhdf5)的支持是有限的,因此不支持复杂对象。 (你可能需要取消一切。)

library(rhdf5)
h5save(rapply(model2, unclass, how = "replace"), file = "model.h5")
bin_h5 <- readBin("model.h5", "raw", 1e6)
length(bin_h5) # 88291 not very efficient in this case

The feather package let's you save data frames in a format readable by both R and Python. To use this, you would first have to convert the model object into data frames, as described in the broom section earlier in the answer.

使用羽毛包,您可以以R和Python可读的格式保存数据帧。要使用它,首先必须将模型对象转换为数据框,如答案前面的扫帚部分所述。

library(feather)
library(broom)
write_feather(augment(model), "co2_row.feather")  # 5474 bytes
write_feather(tidy(model), "co2_coeff.feather")   # 2093 bytes
write_feather(glance(model), "co2_model.feather") #  562 bytes

Another alternative is to save a text version of the variable (see previous section) to a zipped file and store its bytes in the database.

另一种方法是将变量的文本版本(参见上一节)保存到压缩文件中,并将其字节存储在数据库中。

writeLines(model_as_json)
tar("model.tar.bz", "model.txt", compression = "bzip2")
bin_bzip <- readBin("model.tar.bz", "raw", 1e6)
length(bin_bzip) # only 42 bytes!

My data only needs to be accessible by R, and needs to be human-readable

我的数据只需要R可以访问,并且需要人类可读

There are two options for turning a variable into a string: serialize and deparse.

将变量转换为字符串有两种选择:serialize和deparse。

p <- function(x)
{
  paste0(x, collapse = "\n")
}

serialize needs to be sent to a text connection, and rather than writing to file, you can write to the console and capture it.

serialize需要发送到文本连接,而不是写入文件,您可以写入控制台并捕获它。

 model_serialized <- p(capture.output(serialize(model, stdout())))
 nchar(model_serialized) # 23830

Use deparse with control = "all" to maximise the reversibility when re-parsing later.

使用deparse with control =“all”可以在以后重新解析时最大化可逆性。

model_deparsed <- p(deparse(model, control = "all"))
nchar(model_deparsed) # 22036

My data only needs to be accessible by R, and doesn't need to be human-readable

我的数据只需要R可以访问,并且不需要是人类可读的

The same sorts of techniques shown in the previous sections can be applied here. You can zip a serialized or deparsed variable and re-read it as a raw vector.

可以在此处应用前面部分中显示的相同类型的技术。您可以压缩序列化或解压缩的变量,并将其重新读取为原始矢量。

serialize can also write variables in a binary format. In this case, it is most easily used with its wrapper saveRDS.

serialize还可以以二进制格式编写变量。在这种情况下,它最容易与其包装saveRDS一起使用。

saveRDS(model, "model.rds")
bin_rds <- readBin("model.rds", "raw", 1e6)
length(bin_rds) # 6350

#3


2  

Using textConnection / saveRDS / loadRDS is perhaps the most versatile and high level:

使用textConnection / saveRDS / loadRDS可能是最通用和*别的:

zz<-textConnection('tempConnection', 'wb')
saveRDS(myData, zz, ascii = T)
TEXT<-paste(textConnectionValue(zz), collapse='\n')

#write TEXT into SQL
...
closeAllConnections()  #if the connection persists, new data will be appended

#reading back:
#1. pull from SQL into queryResult
...
#2. recover the object
recoveredData <- readRDS(textConnection(queryResult$TEXT))

#4


2  

For sqlite (and possibly others):

对于sqlite(可能还有其他人):

CREATE TABLE data (blob BLOB);

Now in R:

现在在R:

RSQLite::dbGetQuery(db.conn, 'INSERT INTO data VALUES (:blob)', params = list(blob = list(serialize(some_object)))

Note the list wrapper around some_object. The output of serialize is a raw vector. Without list, the INSERT statement would be executed for each vector element. Wrapping it in a list allows RSQLite::dbGetQuery to see it as one element.

请注意some_object周围的列表包装器。 serialize的输出是原始向量。如果没有list,则会对每个vector元素执行INSERT语句。将其包装在列表中允许RSQLite :: dbGetQuery将其视为一个元素。

To get the object back from the database:

要从数据库中获取对象:

some_object <- unserialize(RSQLite::dbGetQuery(db.conn, 'SELECT blob FROM data LIMIT 1')$blob[[1]])

What happens here is you take the field blob (which is a list since RSQLite doesn't know how many rows will be returned by the query). Since LIMIT 1 assures only 1 row is returned, we take it with [[1]], which is the original raw vector. Then you need to unserialize the raw vector to get your object.

这里发生的是你采用字段blob(这是一个列表,因为RSQLite不知道查询将返回多少行)。由于LIMIT 1仅确保返回1行,因此我们将[[1]]与原始矢量一起使用。然后,您需要反序列化原始矢量以获取您的对象。

#1


10  

Use the serialization feature to turn any R object into a (raw or character) string, then store that string. See help(serialize).

使用序列化功能将任何R对象转换为(原始或字符)字符串,然后存储该字符串。请参阅帮助(序列化)。

Reverse this for retrieval: get the string, then unserialize() into a R object.

将其反转以进行检索:获取字符串,然后将unserialize()反转为R对象。

#2


10  

An example R variable, that's fairly complex:

一个示例R变量,它相当复杂:

library(nlme)
model <- lme(uptake ~ conc + Treatment, CO2, random = ~ 1 | Plant / Type)

The best storage database method for R variables depends upon how you want to use it.

R变量的最佳存储数据库方法取决于您希望如何使用它。

I need to do in-database analytics on the values

我需要对值进行数据库内分析

In this case, you need to break the object down into values that the database can handle natively. This usually means converting it into one or more data frames. The easiest way to do this is to use the broom package.

在这种情况下,您需要将对象分解为数据库可以本机处理的值。这通常意味着将其转换为一个或多个数据帧。最简单的方法是使用扫帚包。

library(broom)
coefficients_etc <- tidy(model)
model_level_stats <- glance(model)
row_level_stats <- augment(model)

I just want storage

我只是想要存储

In this case you want to serialize your R variables. That is, converting them to be a string or a binary blob. There are several methods for this.

在这种情况下,您希望序列化您的R变量。也就是说,将它们转换为字符串或二进制blob。有几种方法可以做到这一点。


My data has to be accessible by programs other than R, and needs to be human-readable

我的数据必须可由R以外的程序访问,并且必须是人类可读的

You should store your data in a cross-platform text format; probably JSON or YAML. JSON doesn't support some important concepts like Inf; YAML is more general but the support in R isn't as mature. XML is also possible, but is too verbose to be useful for storing large arrays.

您应该以跨平台文本格式存储数据;可能是JSON或YAML。 JSON不支持Inf等一些重要概念; YAML更为普遍,但R中的支持并不成熟。 XML也是可能的,但是对于存储大型数组来说太冗长了。

library(RJSONIO)
model_as_json <- toJSON(model)
nchar(model_as_json) # 17916

library(yaml)
# yaml package doesn't yet support conversion of language objects,
# so preprocessing is needed
model2 <- within(
  model,
  {
     call <- as.character(call)
     terms <- as.character(terms)
  }
)
model_as_yaml <- as.yaml(model2) 
nchar(model_as_yaml) # 14493

My data has to be accessible by programs other than R, and doesn't need to be human-readable

我的数据必须可由R以外的程序访问,并且不需要是人类可读的

You could write your data to an open, cross-platform binary format like HFD5. Currently support for HFD5 files (via rhdf5) is limited, so complex objects are not supported. (You'll probably need to unclass everything.)

您可以将数据写入开放的跨平台二进制格式,如HFD5。目前对HFD5文件(通过rhdf5)的支持是有限的,因此不支持复杂对象。 (你可能需要取消一切。)

library(rhdf5)
h5save(rapply(model2, unclass, how = "replace"), file = "model.h5")
bin_h5 <- readBin("model.h5", "raw", 1e6)
length(bin_h5) # 88291 not very efficient in this case

The feather package let's you save data frames in a format readable by both R and Python. To use this, you would first have to convert the model object into data frames, as described in the broom section earlier in the answer.

使用羽毛包,您可以以R和Python可读的格式保存数据帧。要使用它,首先必须将模型对象转换为数据框,如答案前面的扫帚部分所述。

library(feather)
library(broom)
write_feather(augment(model), "co2_row.feather")  # 5474 bytes
write_feather(tidy(model), "co2_coeff.feather")   # 2093 bytes
write_feather(glance(model), "co2_model.feather") #  562 bytes

Another alternative is to save a text version of the variable (see previous section) to a zipped file and store its bytes in the database.

另一种方法是将变量的文本版本(参见上一节)保存到压缩文件中,并将其字节存储在数据库中。

writeLines(model_as_json)
tar("model.tar.bz", "model.txt", compression = "bzip2")
bin_bzip <- readBin("model.tar.bz", "raw", 1e6)
length(bin_bzip) # only 42 bytes!

My data only needs to be accessible by R, and needs to be human-readable

我的数据只需要R可以访问,并且需要人类可读

There are two options for turning a variable into a string: serialize and deparse.

将变量转换为字符串有两种选择:serialize和deparse。

p <- function(x)
{
  paste0(x, collapse = "\n")
}

serialize needs to be sent to a text connection, and rather than writing to file, you can write to the console and capture it.

serialize需要发送到文本连接,而不是写入文件,您可以写入控制台并捕获它。

 model_serialized <- p(capture.output(serialize(model, stdout())))
 nchar(model_serialized) # 23830

Use deparse with control = "all" to maximise the reversibility when re-parsing later.

使用deparse with control =“all”可以在以后重新解析时最大化可逆性。

model_deparsed <- p(deparse(model, control = "all"))
nchar(model_deparsed) # 22036

My data only needs to be accessible by R, and doesn't need to be human-readable

我的数据只需要R可以访问,并且不需要是人类可读的

The same sorts of techniques shown in the previous sections can be applied here. You can zip a serialized or deparsed variable and re-read it as a raw vector.

可以在此处应用前面部分中显示的相同类型的技术。您可以压缩序列化或解压缩的变量,并将其重新读取为原始矢量。

serialize can also write variables in a binary format. In this case, it is most easily used with its wrapper saveRDS.

serialize还可以以二进制格式编写变量。在这种情况下,它最容易与其包装saveRDS一起使用。

saveRDS(model, "model.rds")
bin_rds <- readBin("model.rds", "raw", 1e6)
length(bin_rds) # 6350

#3


2  

Using textConnection / saveRDS / loadRDS is perhaps the most versatile and high level:

使用textConnection / saveRDS / loadRDS可能是最通用和*别的:

zz<-textConnection('tempConnection', 'wb')
saveRDS(myData, zz, ascii = T)
TEXT<-paste(textConnectionValue(zz), collapse='\n')

#write TEXT into SQL
...
closeAllConnections()  #if the connection persists, new data will be appended

#reading back:
#1. pull from SQL into queryResult
...
#2. recover the object
recoveredData <- readRDS(textConnection(queryResult$TEXT))

#4


2  

For sqlite (and possibly others):

对于sqlite(可能还有其他人):

CREATE TABLE data (blob BLOB);

Now in R:

现在在R:

RSQLite::dbGetQuery(db.conn, 'INSERT INTO data VALUES (:blob)', params = list(blob = list(serialize(some_object)))

Note the list wrapper around some_object. The output of serialize is a raw vector. Without list, the INSERT statement would be executed for each vector element. Wrapping it in a list allows RSQLite::dbGetQuery to see it as one element.

请注意some_object周围的列表包装器。 serialize的输出是原始向量。如果没有list,则会对每个vector元素执行INSERT语句。将其包装在列表中允许RSQLite :: dbGetQuery将其视为一个元素。

To get the object back from the database:

要从数据库中获取对象:

some_object <- unserialize(RSQLite::dbGetQuery(db.conn, 'SELECT blob FROM data LIMIT 1')$blob[[1]])

What happens here is you take the field blob (which is a list since RSQLite doesn't know how many rows will be returned by the query). Since LIMIT 1 assures only 1 row is returned, we take it with [[1]], which is the original raw vector. Then you need to unserialize the raw vector to get your object.

这里发生的是你采用字段blob(这是一个列表,因为RSQLite不知道查询将返回多少行)。由于LIMIT 1仅确保返回1行,因此我们将[[1]]与原始矢量一起使用。然后,您需要反序列化原始矢量以获取您的对象。