在R data.frame中存储可变长度数据的最佳方法?

时间:2022-10-16 20:23:21

I have some mixed-type data that I would like to store in an R data structure of some sort. Each data point has a set of fixed attributes which may be 1-d numeric, factors, or characters, and also a set of variable length data. For example:

我有一些混合类型的数据,我想存储在某种R数据结构中。每个数据点都有一组固定属性,可以是一维数字,因子或字符,也可以是一组可变长度数据。例如:

id  phrase                    num_tokens  token_lengths
1   "hello world"             2           5 5
2   "greetings"               1           9
3   "take me to your leader"  4           4 2 2 4 6

The actual values are not all computable from one another, but that's the flavor of the data. The operations I'm going to want to do include subsetting the data based on boolean functions (e.g. something like nchar(data$phrase) > 10 or lapply(data$token_lengths, length) > 2). I'd also like to index and average values in the variable length portion by index. This doesn't work, but something like: mean(data$token_lengths[1], na.rm=TRUE))

实际值并非都可以相互计算,但这就是数据的味道。我想要做的操作包括基于布尔函数对数据进行子集化(例如nchar(data $ phrase)> 10或lapply(data $ token_lengths,length)> 2)。我还希望通过索引对可变长度部分中的值进行索引和平均。这不起作用,但是类似于:mean(data $ token_lengths [1],na.rm = TRUE))

I've found I can shoehorn "token_lengths" into a data.frame by making it an array:

我发现通过使它成为一个数组,我可以将“token_lengths”塞进一个data.frame:

d <- data.frame(id=c(1,2,3), ..., token_lengths=as.array(list(c(5,5), 9, c(4,2,2,4,6)))

But is this the best way?

但这是最好的方法吗?

5 个解决方案

#1


4  

Trying to shoehorn the data into a data frame seems hackish to me. Far better to consider each row as an individual object, then think of the dataset as an array of these objects.

试图将数据塞进数据框架似乎对我来说很骇人听闻。将每一行视为单个对象要好得多,然后将数据集视为这些对象的数组。

This function converts your data strings to an appropriate format. (This is S3 style code; you may prefer to use one of the 'proper' object oriented systems.)

此函数将数据字符串转换为适当的格式。 (这是S3样式代码;您可能更喜欢使用“适当的”面向对象系统之一。)

as.mydata <- function(x)
{
   UseMethod("as.mydata")
}

as.mydata.character <- function(x)
{
   convert <- function(x)
   {
      md <- list()
      md$phrase = x
      spl <- strsplit(x, " ")[[1]]
      md$num_words <- length(spl)
      md$token_lengths <- nchar(spl)
      class(md) <- "mydata"
      md
   }
   lapply(x, convert)
}

Now your whole dataset looks like

现在你的整个数据集看起来像

mydataset <- as.mydata(c("hello world", "greetings", "take me to your leader"))

mydataset
[[1]]
$phrase
[1] "hello world"

$num_words
[1] 2

$token_lengths
[1] 5 5

attr(,"class")
[1] "mydata"

[[2]]
$phrase
[1] "greetings"

$num_words
[1] 1

$token_lengths
[1] 9

attr(,"class")
[1] "mydata"

[[3]]
$phrase
[1] "take me to your leader"

$num_words
[1] 5

$token_lengths
[1] 4 2 2 4 6

attr(,"class")
[1] "mydata"

You can define a print method to make this look prettier.

您可以定义打印方法,使其看起来更漂亮。

print.mydata <- function(x)
{
   cat(x$phrase, "consists of", x$num_words, "words, with", paste(x$token_lengths, collapse=", "), "letters.")
}
mydataset
[[1]]
hello world consists of 2 words, with 5, 5 letters.
[[2]]
greetings consists of 1 words, with 9 letters.
[[3]]
take me to your leader consists of 5 words, with 4, 2, 2, 4, 6 letters.

The sample operations you wanted to do are fairly straightforward with data in this format.

使用此格式的数据,您希望执行的示例操作非常简单。

sapply(mydataset, function(x) nchar(x$phrase) > 10)
[1]  TRUE FALSE  TRUE

#2


4  

I would just use the data in the "long" format.

我只想使用“长”格式的数据。

E.g.

例如。

> d1 <- data.frame(id=1:3, num_words=c(2,1,4), phrase=c("hello world", "greetings", "take me to your leader"))
> d2 <- data.frame(id=c(rep(1,2), rep(2,1), rep(3,5)), token_length=c(5,5,9,4,2,2,4,6))
> d2$tokenid <- with(d2, ave(token_length, id, FUN=seq_along))
> d <- merge(d1,d2)
> subset(d, nchar(phrase) > 10)
  id num_words                 phrase token_length tokenid
1  1         2            hello world            5       1
2  1         2            hello world            5       2
4  3         4 take me to your leader            4       1
5  3         4 take me to your leader            2       2
6  3         4 take me to your leader            2       3
7  3         4 take me to your leader            4       4
8  3         4 take me to your leader            6       5
> with(d, tapply(token_length, id, mean))
  1   2   3 
5.0 9.0 3.6 

Once the data is in the long format, you can use sqldf or plyr to extract what you want from it.

一旦数据采用长格式,您可以使用sqldf或plyr从中提取所需内容。

#3


4  

Another option would be to convert your data frame into a matrix of mode list - each element of the matrix would be a list. standard array operations (slicing with [, apply(), etc. would be applicable).

另一种选择是将数据帧转换为模式列表矩阵 - 矩阵的每个元素都是一个列表。标准数组操作(使用[,apply()等进行切片将适用)。

> d <- data.frame(id=c(1,2,3), num_tokens=c(2,1,4), token_lengths=as.array(list(c(5,5), 9, c(4,2,2,4,6))))
> m <- as.matrix(d)
> mode(m)
[1] "list"
> m[,"token_lengths"]
[[1]]
[1] 5 5

[[2]]
[1] 9

[[3]]
[1] 4 2 2 4 6

> m[3,]
$id
[1] 3

$num_tokens
[1] 4

$token_lengths
[1] 4 2 2 4 6

#4


1  

Since the R data frame structure is based loosely on the SQL table, having each element of the data frame be anything other than an atomic data type is uncommon. However, it can be done, as you've shown, and this linked post describes such an application implemented on a larger scale.

由于R数据帧结构松散地基于SQL表,因此使数据帧的每个元素都不是原子数据类型是不常见的。但是,正如您所示,它可以完成,并且此链接的帖子描述了更大规模实施的此类应用程序。

An alternative is to store your data as a string and have a function to retrieve it, or create a separate function to which the data is attached and extract it using indices stored in your data frame.

另一种方法是将数据存储为字符串并具有检索数据的功能,或创建附加数据的单独函数,并使用存储在数据框中的索引将其提取。

> ## alternative 1
> tokens <- function(x,i=TRUE) Map(as.numeric,strsplit(x[i],","))
> d <- data.frame(id=c(1,2,3), token_lengths=c("5,5", "9", "4,2,2,4,6"))
> 
> tokens(d$token_lengths)
[[1]]
[1] 5 5

[[2]]
[1] 9

[[3]]
[1] 4 2 2 4 6

> tokens(d$token_lengths,2:3)
[[1]]
[1] 9

[[2]]
[1] 4 2 2 4 6

> 
> ## alternative 2
> retrieve <- local({
+   token_lengths <- list(c(5,5), 9, c(4,2,2,4,6))
+   function(i) token_lengths[i]
+ })
> 
> d <- data.frame(id=c(1,2,3), token_lengths=1:3)
> retrieve(d$token_lengths[2:3])
[[1]]
[1] 9

[[2]]
[1] 4 2 2 4 6

#5


0  

I would also use strings for the variable length data, but as in the following example: "c(5,5)" for the first phrase. One needs to use eval(parse(text=...)) to carry out computations.

我也会使用字符串作为可变长度数据,但如下例所示:第一个短语的“c(5,5)”。需要使用eval(parse(text = ...))来执行计算。

For example, the mean can be computed as follows:

例如,均值可以如下计算:

sapply(data$token_lengths,function(str) mean(eval(parse(text=str))))

sapply(data $ token_lengths,function(str)mean(eval(parse(text = str))))

#1


4  

Trying to shoehorn the data into a data frame seems hackish to me. Far better to consider each row as an individual object, then think of the dataset as an array of these objects.

试图将数据塞进数据框架似乎对我来说很骇人听闻。将每一行视为单个对象要好得多,然后将数据集视为这些对象的数组。

This function converts your data strings to an appropriate format. (This is S3 style code; you may prefer to use one of the 'proper' object oriented systems.)

此函数将数据字符串转换为适当的格式。 (这是S3样式代码;您可能更喜欢使用“适当的”面向对象系统之一。)

as.mydata <- function(x)
{
   UseMethod("as.mydata")
}

as.mydata.character <- function(x)
{
   convert <- function(x)
   {
      md <- list()
      md$phrase = x
      spl <- strsplit(x, " ")[[1]]
      md$num_words <- length(spl)
      md$token_lengths <- nchar(spl)
      class(md) <- "mydata"
      md
   }
   lapply(x, convert)
}

Now your whole dataset looks like

现在你的整个数据集看起来像

mydataset <- as.mydata(c("hello world", "greetings", "take me to your leader"))

mydataset
[[1]]
$phrase
[1] "hello world"

$num_words
[1] 2

$token_lengths
[1] 5 5

attr(,"class")
[1] "mydata"

[[2]]
$phrase
[1] "greetings"

$num_words
[1] 1

$token_lengths
[1] 9

attr(,"class")
[1] "mydata"

[[3]]
$phrase
[1] "take me to your leader"

$num_words
[1] 5

$token_lengths
[1] 4 2 2 4 6

attr(,"class")
[1] "mydata"

You can define a print method to make this look prettier.

您可以定义打印方法,使其看起来更漂亮。

print.mydata <- function(x)
{
   cat(x$phrase, "consists of", x$num_words, "words, with", paste(x$token_lengths, collapse=", "), "letters.")
}
mydataset
[[1]]
hello world consists of 2 words, with 5, 5 letters.
[[2]]
greetings consists of 1 words, with 9 letters.
[[3]]
take me to your leader consists of 5 words, with 4, 2, 2, 4, 6 letters.

The sample operations you wanted to do are fairly straightforward with data in this format.

使用此格式的数据,您希望执行的示例操作非常简单。

sapply(mydataset, function(x) nchar(x$phrase) > 10)
[1]  TRUE FALSE  TRUE

#2


4  

I would just use the data in the "long" format.

我只想使用“长”格式的数据。

E.g.

例如。

> d1 <- data.frame(id=1:3, num_words=c(2,1,4), phrase=c("hello world", "greetings", "take me to your leader"))
> d2 <- data.frame(id=c(rep(1,2), rep(2,1), rep(3,5)), token_length=c(5,5,9,4,2,2,4,6))
> d2$tokenid <- with(d2, ave(token_length, id, FUN=seq_along))
> d <- merge(d1,d2)
> subset(d, nchar(phrase) > 10)
  id num_words                 phrase token_length tokenid
1  1         2            hello world            5       1
2  1         2            hello world            5       2
4  3         4 take me to your leader            4       1
5  3         4 take me to your leader            2       2
6  3         4 take me to your leader            2       3
7  3         4 take me to your leader            4       4
8  3         4 take me to your leader            6       5
> with(d, tapply(token_length, id, mean))
  1   2   3 
5.0 9.0 3.6 

Once the data is in the long format, you can use sqldf or plyr to extract what you want from it.

一旦数据采用长格式,您可以使用sqldf或plyr从中提取所需内容。

#3


4  

Another option would be to convert your data frame into a matrix of mode list - each element of the matrix would be a list. standard array operations (slicing with [, apply(), etc. would be applicable).

另一种选择是将数据帧转换为模式列表矩阵 - 矩阵的每个元素都是一个列表。标准数组操作(使用[,apply()等进行切片将适用)。

> d <- data.frame(id=c(1,2,3), num_tokens=c(2,1,4), token_lengths=as.array(list(c(5,5), 9, c(4,2,2,4,6))))
> m <- as.matrix(d)
> mode(m)
[1] "list"
> m[,"token_lengths"]
[[1]]
[1] 5 5

[[2]]
[1] 9

[[3]]
[1] 4 2 2 4 6

> m[3,]
$id
[1] 3

$num_tokens
[1] 4

$token_lengths
[1] 4 2 2 4 6

#4


1  

Since the R data frame structure is based loosely on the SQL table, having each element of the data frame be anything other than an atomic data type is uncommon. However, it can be done, as you've shown, and this linked post describes such an application implemented on a larger scale.

由于R数据帧结构松散地基于SQL表,因此使数据帧的每个元素都不是原子数据类型是不常见的。但是,正如您所示,它可以完成,并且此链接的帖子描述了更大规模实施的此类应用程序。

An alternative is to store your data as a string and have a function to retrieve it, or create a separate function to which the data is attached and extract it using indices stored in your data frame.

另一种方法是将数据存储为字符串并具有检索数据的功能,或创建附加数据的单独函数,并使用存储在数据框中的索引将其提取。

> ## alternative 1
> tokens <- function(x,i=TRUE) Map(as.numeric,strsplit(x[i],","))
> d <- data.frame(id=c(1,2,3), token_lengths=c("5,5", "9", "4,2,2,4,6"))
> 
> tokens(d$token_lengths)
[[1]]
[1] 5 5

[[2]]
[1] 9

[[3]]
[1] 4 2 2 4 6

> tokens(d$token_lengths,2:3)
[[1]]
[1] 9

[[2]]
[1] 4 2 2 4 6

> 
> ## alternative 2
> retrieve <- local({
+   token_lengths <- list(c(5,5), 9, c(4,2,2,4,6))
+   function(i) token_lengths[i]
+ })
> 
> d <- data.frame(id=c(1,2,3), token_lengths=1:3)
> retrieve(d$token_lengths[2:3])
[[1]]
[1] 9

[[2]]
[1] 4 2 2 4 6

#5


0  

I would also use strings for the variable length data, but as in the following example: "c(5,5)" for the first phrase. One needs to use eval(parse(text=...)) to carry out computations.

我也会使用字符串作为可变长度数据,但如下例所示:第一个短语的“c(5,5)”。需要使用eval(parse(text = ...))来执行计算。

For example, the mean can be computed as follows:

例如,均值可以如下计算:

sapply(data$token_lengths,function(str) mean(eval(parse(text=str))))

sapply(data $ token_lengths,function(str)mean(eval(parse(text = str))))