I am trying to subset a data frame, where I get multiple data frames based on multiple column values. Here is my example
我正在尝试对数据帧进行子集化,其中我基于多个列值获得多个数据帧。这是我的例子
>df
v1 v2 v3 v4 v5
A Z 1 10 12
D Y 10 12 8
E X 2 12 15
A Z 1 10 12
E X 2 14 16
The expected output is something like this where I am splitting this data frame into multiple data frames based on column v1
and v2
预期的输出是这样的,我将基于列v1和v2将此数据帧拆分为多个数据帧
>df1
v3 v4 v5
1 10 12
1 10 12
>df2
v3 v4 v5
10 12 8
>df3
v3 v4 v5
2 12 15
2 14 16
I have written a code which is working right now but don't think that's the best way to do it. There must be a better way to do it. Assuming tab
is the data.frame having the initial data. Here is my code:
我编写了一个现在正在运行的代码,但不认为这是最好的方法。必须有更好的方法来做到这一点。假设tab是具有初始数据的data.frame。这是我的代码:
v1Factors<-levels(factor(tab$v1))
v2Factors<-levels(factor(tab$v2))
for(i in 1:length(v1Factors)){
for(j in 1:length(v2Factors)){
subsetTab<-subset(tab, v1==v1Factors[i] & v2==v2Factors[j], select=c("v3", "v4", "v5"))
print(subsetTab)
}
}
Can someone suggest a better method to do the above?
有人可以建议一个更好的方法来做上述事情吗?
2 个解决方案
#1
24
You are looking for split
你正在寻找分裂
split(df, with(df, interaction(v1,v2)), drop = TRUE)
$E.X
v1 v2 v3 v4 v5
3 E X 2 12 15
5 E X 2 14 16
$D.Y
v1 v2 v3 v4 v5
2 D Y 10 12 8
$A.Z
v1 v2 v3 v4 v5
1 A Z 1 10 12
As noted in the comments
正如评论中所述
any of the following would work
以下任何一种都可行
library(microbenchmark)
microbenchmark(
split(df, list(df$v1,df$v2), drop = TRUE),
split(df, interaction(df$v1,df$v2), drop = TRUE),
split(df, with(df, interaction(v1,v2)), drop = TRUE))
Unit: microseconds
expr min lq median uq max neval
split(df, list(df$v1, df$v2), drop = TRUE) 1119.845 1129.3750 1145.8815 1182.119 3910.249 100
split(df, interaction(df$v1, df$v2), drop = TRUE) 893.749 900.5720 909.8035 936.414 3617.038 100
split(df, with(df, interaction(v1, v2)), drop = TRUE) 895.150 902.5705 909.8505 927.128 1399.284 100
It appears interaction
is slightly faster (probably due the fact that the f = list(...)
are just converted to an interaction within the function)
看起来交互稍微快一些(可能是由于f = list(...)刚刚转换为函数内的交互)
Edit
编辑
If you just want use the subset data.frames then I would suggest using data.table for ease of coding
如果您只是想使用子集data.frames,那么我建议使用data.table以便于编码
library(data.table)
dt <- data.table(df)
dt[, plot(v4, v5), by = list(v1, v2)]
#2
3
There's now also nest()
from tidyr
which is rather nice.
现在还有来自tidyr的nest(),相当不错。
library(tidyr)
nestdf <- df %>% nest(v3:v5)
nestdf$data
> nestdf$data
[[1]]
# A tibble: 2 × 3
v3 v4 v5
<int> <int> <int>
1 1 10 12
2 1 10 12
[[2]]
# A tibble: 1 × 3
v3 v4 v5
<int> <int> <int>
1 10 12 8
[[3]]
# A tibble: 2 × 3
v3 v4 v5
<int> <int> <int>
1 2 12 15
2 2 14 16
Access individual tibbles with nestdf$data[1]
and so on.
使用nestdf $ data [1]访问单个元素,依此类推。
#1
24
You are looking for split
你正在寻找分裂
split(df, with(df, interaction(v1,v2)), drop = TRUE)
$E.X
v1 v2 v3 v4 v5
3 E X 2 12 15
5 E X 2 14 16
$D.Y
v1 v2 v3 v4 v5
2 D Y 10 12 8
$A.Z
v1 v2 v3 v4 v5
1 A Z 1 10 12
As noted in the comments
正如评论中所述
any of the following would work
以下任何一种都可行
library(microbenchmark)
microbenchmark(
split(df, list(df$v1,df$v2), drop = TRUE),
split(df, interaction(df$v1,df$v2), drop = TRUE),
split(df, with(df, interaction(v1,v2)), drop = TRUE))
Unit: microseconds
expr min lq median uq max neval
split(df, list(df$v1, df$v2), drop = TRUE) 1119.845 1129.3750 1145.8815 1182.119 3910.249 100
split(df, interaction(df$v1, df$v2), drop = TRUE) 893.749 900.5720 909.8035 936.414 3617.038 100
split(df, with(df, interaction(v1, v2)), drop = TRUE) 895.150 902.5705 909.8505 927.128 1399.284 100
It appears interaction
is slightly faster (probably due the fact that the f = list(...)
are just converted to an interaction within the function)
看起来交互稍微快一些(可能是由于f = list(...)刚刚转换为函数内的交互)
Edit
编辑
If you just want use the subset data.frames then I would suggest using data.table for ease of coding
如果您只是想使用子集data.frames,那么我建议使用data.table以便于编码
library(data.table)
dt <- data.table(df)
dt[, plot(v4, v5), by = list(v1, v2)]
#2
3
There's now also nest()
from tidyr
which is rather nice.
现在还有来自tidyr的nest(),相当不错。
library(tidyr)
nestdf <- df %>% nest(v3:v5)
nestdf$data
> nestdf$data
[[1]]
# A tibble: 2 × 3
v3 v4 v5
<int> <int> <int>
1 1 10 12
2 1 10 12
[[2]]
# A tibble: 1 × 3
v3 v4 v5
<int> <int> <int>
1 10 12 8
[[3]]
# A tibble: 2 × 3
v3 v4 v5
<int> <int> <int>
1 2 12 15
2 2 14 16
Access individual tibbles with nestdf$data[1]
and so on.
使用nestdf $ data [1]访问单个元素,依此类推。