根据列中的公共值将大型数据aframe分割为一组数据帧

时间:2022-05-09 04:06:18

I have a data frame with 10 columns, collecting actions of "users", where one of the columns contains an ID (not unique, identifying user)(column 10). the length of the data frame is about 750000 rows. I am trying to extract individual data frames (so getting a list or vector of data frames) split by the column containing the "user" identifier, to isolate the actions of a single actor.

我有一个包含10列的数据框架,收集“用户”的操作,其中一个列包含ID(不是唯一的,标识用户)(第10列)。数据帧的长度约为75万行。我正在尝试从包含“user”标识符的列中提取单个数据帧(即获取数据帧的列表或向量),以隔离单个参与者的操作。

ID | Data1 | Data2 | ... | UserID
1  | aaa   | bbb   | ... | u_001
2  | aab   | bb2   | ... | u_001
3  | aac   | bb3   | ... | u_001
4  | aad   | bb4   | ... | u_002

resulting into

结果进

list(
ID | Data1 | Data2 | ... | UserID
1  | aaa   | bbb   | ... | u_001
2  | aab   | bb2   | ... | u_001
3  | aac   | bb3   | ... | u_001
,
4  | aad   | bb4   | ... | u_002
...)

The following works very well for me on a small sample (1000 rows):

对于一个小样本(1000行),下面的方法非常适合我:

paths = by(smallsampleMat, smallsampleMat[,"userID"], function(x) x)

and then accessing the element I want by paths[1] for instance.

然后通过路径[1]访问我想要的元素。

When applying on the original large data frame or even a matrix representation, this chokes my machine ( 4GB RAM, MacOSX 10.6, R 2.15) and never completes (I know that a newer R version exists, but I believe this is not the main problem).

当应用于原始的大数据帧或甚至是一个矩阵表示时,这将会阻塞我的机器(4GB RAM, macosx10.6, r2.15),并且永远不会完成(我知道有一个新的R版本存在,但我相信这不是主要问题)。

It seems that split is more performant and after a long time completes, but I do not know ( inferior R knowledge) how to piece the resulting list of vectors into a vector of matrices.

似乎分割更有效果,而且经过很长一段时间后才完成,但是我不知道(劣质的R知识)如何将得到的向量列表分割成一个矩阵的向量。

path = split(smallsampleMat, smallsampleMat[,10]) 

I have considered also using big.matrix etc, but without much success that would speed up the process.

我也考虑过使用big。矩阵等,但没有很多成功,会加快进程。

2 个解决方案

#1


61  

You can just as easily access each element in the list using e.g. path[[1]]. You can't put a set of matrices into an atomic vector and access each element. A matrix is an atomic vector with dimension attributes. I would use the list structure returned by split, it's what it was designed for. Each list element can hold data of different types and sizes so it's very versatile and you can use *apply functions to further operate on each element in the list. Example below.

您可以使用示例路径[[1]]轻松访问列表中的每个元素。你不能把一组矩阵放到一个原子向量中然后访问每个元素。矩阵是具有维数属性的原子向量。我会使用split返回的列表结构,这就是它的设计目的。每个列表元素都可以保存不同类型和大小的数据,因此非常通用,您可以使用*apply函数进一步操作列表中的每个元素。下面的例子。

#  For reproducibile data
set.seed(1)

#  Make some data
userid <- rep(1:2,times=4)
data1 <- replicate(8 , paste( sample(letters , 3 ) , collapse = "" ) )
data2 <- sample(10,8)
df <- data.frame( userid , data1 , data2 )

#  Split on userid
out <- split( df , f = df$userid )
#$`1`
#  userid data1 data2
#1      1   gjn     3
#3      1   yqp     1
#5      1   rjs     6
#7      1   jtw     5

#$`2`
#  userid data1 data2
#2      2   xfv     4
#4      2   bfe    10
#6      2   mrx     2
#8      2   fqd     9

Access each element using the [[ operator like this:

使用[[这样的操作符:

out[[1]]
#  userid data1 data2
#1      1   gjn     3
#3      1   yqp     1
#5      1   rjs     6
#7      1   jtw     5

Or use an *apply function to do further operations on each list element. For instance, to take the mean of the data2 column you could use sapply like this:

或者使用*apply函数对每个列表元素进行进一步操作。例如,要求data2列的平均值,可以使用如下方法:

sapply( out , function(x) mean( x$data2 ) )
#   1    2 
#3.75 6.25 

#2


5  

Stumbled across this answer and I actually wanted BOTH groups (data containing that one user and data containing everything but that one user). Not necessary for the specifics of this post, but I thought I would add in case someone was googling the same issue as me.

偶然发现这个答案,我实际上想要两个组(包含一个用户的数据和包含除那个用户以外的所有数据)。这篇文章的细节没有必要,但我想如果有人在谷歌上搜索和我一样的问题,我会补充一点。

df <- data.frame(
     ran_data1=rnorm(125),
     ran_data2=rnorm(125),
     g=rep(factor(LETTERS[1:5]), 25)
 )
        
test_x = split(df,df$g)[['A']]
test_y = split(df,df$g!='A')[['TRUE']]
Here's what it looks like:

head(test_x)
            x          y g
1   1.1362198  1.2969541 A
6   0.5510307 -0.2512449 A
11  0.0321679  0.2358821 A
16  0.4734277 -1.2889081 A
21 -1.2686151  0.2524744 A

> head(test_y)
            x          y g
2 -2.23477293  1.1514810 B
3 -0.46958938 -1.7434205 C
4  0.07365603  0.1111419 D
5 -1.08758355  0.4727281 E
7  0.28448637 -1.5124336 B
8  1.24117504  0.4928257 C

#1


61  

You can just as easily access each element in the list using e.g. path[[1]]. You can't put a set of matrices into an atomic vector and access each element. A matrix is an atomic vector with dimension attributes. I would use the list structure returned by split, it's what it was designed for. Each list element can hold data of different types and sizes so it's very versatile and you can use *apply functions to further operate on each element in the list. Example below.

您可以使用示例路径[[1]]轻松访问列表中的每个元素。你不能把一组矩阵放到一个原子向量中然后访问每个元素。矩阵是具有维数属性的原子向量。我会使用split返回的列表结构,这就是它的设计目的。每个列表元素都可以保存不同类型和大小的数据,因此非常通用,您可以使用*apply函数进一步操作列表中的每个元素。下面的例子。

#  For reproducibile data
set.seed(1)

#  Make some data
userid <- rep(1:2,times=4)
data1 <- replicate(8 , paste( sample(letters , 3 ) , collapse = "" ) )
data2 <- sample(10,8)
df <- data.frame( userid , data1 , data2 )

#  Split on userid
out <- split( df , f = df$userid )
#$`1`
#  userid data1 data2
#1      1   gjn     3
#3      1   yqp     1
#5      1   rjs     6
#7      1   jtw     5

#$`2`
#  userid data1 data2
#2      2   xfv     4
#4      2   bfe    10
#6      2   mrx     2
#8      2   fqd     9

Access each element using the [[ operator like this:

使用[[这样的操作符:

out[[1]]
#  userid data1 data2
#1      1   gjn     3
#3      1   yqp     1
#5      1   rjs     6
#7      1   jtw     5

Or use an *apply function to do further operations on each list element. For instance, to take the mean of the data2 column you could use sapply like this:

或者使用*apply函数对每个列表元素进行进一步操作。例如,要求data2列的平均值,可以使用如下方法:

sapply( out , function(x) mean( x$data2 ) )
#   1    2 
#3.75 6.25 

#2


5  

Stumbled across this answer and I actually wanted BOTH groups (data containing that one user and data containing everything but that one user). Not necessary for the specifics of this post, but I thought I would add in case someone was googling the same issue as me.

偶然发现这个答案,我实际上想要两个组(包含一个用户的数据和包含除那个用户以外的所有数据)。这篇文章的细节没有必要,但我想如果有人在谷歌上搜索和我一样的问题,我会补充一点。

df <- data.frame(
     ran_data1=rnorm(125),
     ran_data2=rnorm(125),
     g=rep(factor(LETTERS[1:5]), 25)
 )
        
test_x = split(df,df$g)[['A']]
test_y = split(df,df$g!='A')[['TRUE']]
Here's what it looks like:

head(test_x)
            x          y g
1   1.1362198  1.2969541 A
6   0.5510307 -0.2512449 A
11  0.0321679  0.2358821 A
16  0.4734277 -1.2889081 A
21 -1.2686151  0.2524744 A

> head(test_y)
            x          y g
2 -2.23477293  1.1514810 B
3 -0.46958938 -1.7434205 C
4  0.07365603  0.1111419 D
5 -1.08758355  0.4727281 E
7  0.28448637 -1.5124336 B
8  1.24117504  0.4928257 C