I want to split a data frame into several smaller ones. This looks like a very trivial question, however I cannot find a solution from web search.
我想把一个数据帧分割成几个小的。这看起来是一个非常琐碎的问题,但是我无法从web搜索找到解决方案。
8 个解决方案
#1
51
You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes.
您还可以将数据帧分割成任意数量的较小的dataframes。在这里,我们切成两个数据。
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
gives
给了
$`1`
num let LET
3 3 c C
6 6 f F
10 10 j J
12 12 l L
14 14 n N
15 15 o O
17 17 q Q
18 18 r R
20 20 t T
21 21 u U
22 22 v V
23 23 w W
26 26 z Z
$`2`
num let LET
1 1 a A
2 2 b B
4 4 d D
5 5 e E
7 7 g G
8 8 h H
9 9 i I
11 11 k K
13 13 m M
16 16 p P
19 19 s S
24 24 x X
25 25 y Y
#2
18
If you want to split a dataframe according to values of some variable, I'd suggest using daply()
from the plyr
package.
如果您想根据某个变量的值分割一个dataframe,我建议使用来自plyr包的daply()。
library(plyr)
x <- daply(df, .(splitting_variable), function(x)return(x))
Now, x
is an array of dataframes. To access one of the dataframes, you can index it with the name of the level of the splitting variable.
现在,x是一个dataframes数组。要访问其中一个dataframes,您可以使用分裂变量的级别来对其进行索引。
x$Level1
#or
x[["Level1"]]
I'd be sure that there aren't other more clever ways to deal with your data before splitting it up into many dataframes though.
我确信在将数据分割成许多dataframes之前,没有其他更聪明的方法来处理数据了。
#3
12
I just posted a kind of a RFC that might help you: Split a vector into chunks in R
我刚刚发布了一种RFC,它可以帮助您:将一个矢量分解成R中的块。
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
## number of chunks
n <- 2
dfchunk <- split(x, factor(sort(rank(row.names(x))%%n)))
dfchunk
$`0`
num let LET
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J
11 11 k K
12 12 l L
13 13 m M
$`1`
num let LET
14 14 n N
15 15 o O
16 16 p P
17 17 q Q
18 18 r R
19 19 s S
20 20 t T
21 21 u U
22 22 v V
23 23 w W
24 24 x X
25 25 y Y
26 26 z Z
Cheers, Sebastian
欢呼,塞巴斯蒂安
#4
10
You could also use
您还可以使用
data2 <- data[data$sum_points == 2500, ]
This will make a dataframe with the values where sum_points = 2500
这将使dataframe具有sum_points = 2500的值。
It gives :
它给:
airfoils sum_points field_points init_t contour_t field_t
...
491 5 2500 5625 0.000086 0.004272 6.321774
498 5 2500 5625 0.000087 0.004507 6.325083
504 5 2500 5625 0.000088 0.004370 6.336034
603 5 250 10000 0.000072 0.000525 1.111278
577 5 250 10000 0.000104 0.000559 1.111431
587 5 250 10000 0.000072 0.000528 1.111524
606 5 250 10000 0.000079 0.000538 1.111685
....
> data2 <- data[data$sum_points == 2500, ]
> data2
airfoils sum_points field_points init_t contour_t field_t
108 5 2500 625 0.000082 0.004329 0.733109
106 5 2500 625 0.000102 0.004564 0.733243
117 5 2500 625 0.000087 0.004321 0.733274
112 5 2500 625 0.000081 0.004428 0.733587
#5
7
subset() is also useful
子集()也很有用
subset(DATAFRAME, COLUMNNAME == "")
For a survey package, maybe the "survey" package is pertinent?
对于一个调查包,也许“调查”包是恰当的?
http://faculty.washington.edu/tlumley/survey/
http://faculty.washington.edu/tlumley/survey/
#6
3
The answer you want depends very much on how and why you want to break up the data frame.
您想要的答案很大程度上取决于您如何以及为什么要拆分数据框架。
For example, if you want to leave out some variables, you can create new data frames from specific columns of the database. The subscripts in brackets after the data frame refer to row and column numbers. Check out Spoetry for a complete description.
例如,如果您想省略一些变量,您可以从数据库的特定列创建新的数据帧。数据框后面的括号中的下标表示行号和列号。请查看Spoetry以获得完整的描述。
newdf <- mydf[,1:3]
Or, you can choose specific rows.
或者,您可以选择特定的行。
newdf <- mydf[1:3,]
And these subscripts can also be logical tests, such as choosing rows that contain a particular value, or factors with a desired value.
这些子脚本也可以是逻辑测试,例如选择包含特定值的行或具有所需值的因素。
What do you want to do with the chunks left over? Do you need to perform the same operation on each chunk of the database? Then you'll want to ensure that the subsets of the data frame end up in a convenient object, such as a list, that will help you perform the same command on each chunk of the data frame.
剩下的块你想怎么处理?您是否需要在数据库的每个块上执行相同的操作?然后,您将希望确保数据帧的子集以方便的对象(如列表)结束,这将帮助您在数据框架的每个块上执行相同的命令。
#7
3
If you want to split by values in one of the columns, you can use lapply
. For instance, to split ChickWeight
into a separate dataset for each chick:
如果希望在其中一个列中按值进行拆分,可以使用lapply。例如,为每个小鸡分出一个单独的数据集:
data(ChickWeight)
lapply(unique(ChickWeight$Chick), function(x) ChickWeight[ChickWeight$Chick == x,])
#8
1
Splitting the data frame seems counter-productive. Instead, use the split-apply-combine paradigm, e.g., generate some data
分割数据框架似乎适得其反。相反,使用分割-应用-组合范例,例如,生成一些数据。
df = data.frame(grp=sample(letters, 100, TRUE), x=rnorm(100))
then split only the relevant columns and apply the scale()
function to x in each group, and combine the results (using split<-
or ave
)
然后只分离相关的列并将scale()函数应用到每个组中的x,并将结果(使用split<-或ave)组合起来
df$z = 0
split(df$z, df$grp) = lapply(split(df$x, df$grp), scale)
## alternative: df$z = ave(df$x, df$grp, FUN=scale)
This will be very fast compared to splitting data.frames, and the result remains usable in downstream analysis without iteration. I think the dplyr syntax is
与分割数据相比,这将会非常快,而且在没有迭代的情况下,结果仍然可以在下游分析中使用。我认为dplyr语法是。
library(dplyr)
df %>% group_by(grp) %>% mutate(z=scale(x))
In general this dplyr solution is faster than splitting data frames but not as fast as split-apply-combine.
一般来说,这个dplyr解决方案比拆分数据帧快,但不像分割-应用-合并那样快。
#1
51
You may also want to cut the data frame into an arbitrary number of smaller dataframes. Here, we cut into two dataframes.
您还可以将数据帧分割成任意数量的较小的dataframes。在这里,我们切成两个数据。
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
gives
给了
$`1`
num let LET
3 3 c C
6 6 f F
10 10 j J
12 12 l L
14 14 n N
15 15 o O
17 17 q Q
18 18 r R
20 20 t T
21 21 u U
22 22 v V
23 23 w W
26 26 z Z
$`2`
num let LET
1 1 a A
2 2 b B
4 4 d D
5 5 e E
7 7 g G
8 8 h H
9 9 i I
11 11 k K
13 13 m M
16 16 p P
19 19 s S
24 24 x X
25 25 y Y
#2
18
If you want to split a dataframe according to values of some variable, I'd suggest using daply()
from the plyr
package.
如果您想根据某个变量的值分割一个dataframe,我建议使用来自plyr包的daply()。
library(plyr)
x <- daply(df, .(splitting_variable), function(x)return(x))
Now, x
is an array of dataframes. To access one of the dataframes, you can index it with the name of the level of the splitting variable.
现在,x是一个dataframes数组。要访问其中一个dataframes,您可以使用分裂变量的级别来对其进行索引。
x$Level1
#or
x[["Level1"]]
I'd be sure that there aren't other more clever ways to deal with your data before splitting it up into many dataframes though.
我确信在将数据分割成许多dataframes之前,没有其他更聪明的方法来处理数据了。
#3
12
I just posted a kind of a RFC that might help you: Split a vector into chunks in R
我刚刚发布了一种RFC,它可以帮助您:将一个矢量分解成R中的块。
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
## number of chunks
n <- 2
dfchunk <- split(x, factor(sort(rank(row.names(x))%%n)))
dfchunk
$`0`
num let LET
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J
11 11 k K
12 12 l L
13 13 m M
$`1`
num let LET
14 14 n N
15 15 o O
16 16 p P
17 17 q Q
18 18 r R
19 19 s S
20 20 t T
21 21 u U
22 22 v V
23 23 w W
24 24 x X
25 25 y Y
26 26 z Z
Cheers, Sebastian
欢呼,塞巴斯蒂安
#4
10
You could also use
您还可以使用
data2 <- data[data$sum_points == 2500, ]
This will make a dataframe with the values where sum_points = 2500
这将使dataframe具有sum_points = 2500的值。
It gives :
它给:
airfoils sum_points field_points init_t contour_t field_t
...
491 5 2500 5625 0.000086 0.004272 6.321774
498 5 2500 5625 0.000087 0.004507 6.325083
504 5 2500 5625 0.000088 0.004370 6.336034
603 5 250 10000 0.000072 0.000525 1.111278
577 5 250 10000 0.000104 0.000559 1.111431
587 5 250 10000 0.000072 0.000528 1.111524
606 5 250 10000 0.000079 0.000538 1.111685
....
> data2 <- data[data$sum_points == 2500, ]
> data2
airfoils sum_points field_points init_t contour_t field_t
108 5 2500 625 0.000082 0.004329 0.733109
106 5 2500 625 0.000102 0.004564 0.733243
117 5 2500 625 0.000087 0.004321 0.733274
112 5 2500 625 0.000081 0.004428 0.733587
#5
7
subset() is also useful
子集()也很有用
subset(DATAFRAME, COLUMNNAME == "")
For a survey package, maybe the "survey" package is pertinent?
对于一个调查包,也许“调查”包是恰当的?
http://faculty.washington.edu/tlumley/survey/
http://faculty.washington.edu/tlumley/survey/
#6
3
The answer you want depends very much on how and why you want to break up the data frame.
您想要的答案很大程度上取决于您如何以及为什么要拆分数据框架。
For example, if you want to leave out some variables, you can create new data frames from specific columns of the database. The subscripts in brackets after the data frame refer to row and column numbers. Check out Spoetry for a complete description.
例如,如果您想省略一些变量,您可以从数据库的特定列创建新的数据帧。数据框后面的括号中的下标表示行号和列号。请查看Spoetry以获得完整的描述。
newdf <- mydf[,1:3]
Or, you can choose specific rows.
或者,您可以选择特定的行。
newdf <- mydf[1:3,]
And these subscripts can also be logical tests, such as choosing rows that contain a particular value, or factors with a desired value.
这些子脚本也可以是逻辑测试,例如选择包含特定值的行或具有所需值的因素。
What do you want to do with the chunks left over? Do you need to perform the same operation on each chunk of the database? Then you'll want to ensure that the subsets of the data frame end up in a convenient object, such as a list, that will help you perform the same command on each chunk of the data frame.
剩下的块你想怎么处理?您是否需要在数据库的每个块上执行相同的操作?然后,您将希望确保数据帧的子集以方便的对象(如列表)结束,这将帮助您在数据框架的每个块上执行相同的命令。
#7
3
If you want to split by values in one of the columns, you can use lapply
. For instance, to split ChickWeight
into a separate dataset for each chick:
如果希望在其中一个列中按值进行拆分,可以使用lapply。例如,为每个小鸡分出一个单独的数据集:
data(ChickWeight)
lapply(unique(ChickWeight$Chick), function(x) ChickWeight[ChickWeight$Chick == x,])
#8
1
Splitting the data frame seems counter-productive. Instead, use the split-apply-combine paradigm, e.g., generate some data
分割数据框架似乎适得其反。相反,使用分割-应用-组合范例,例如,生成一些数据。
df = data.frame(grp=sample(letters, 100, TRUE), x=rnorm(100))
then split only the relevant columns and apply the scale()
function to x in each group, and combine the results (using split<-
or ave
)
然后只分离相关的列并将scale()函数应用到每个组中的x,并将结果(使用split<-或ave)组合起来
df$z = 0
split(df$z, df$grp) = lapply(split(df$x, df$grp), scale)
## alternative: df$z = ave(df$x, df$grp, FUN=scale)
This will be very fast compared to splitting data.frames, and the result remains usable in downstream analysis without iteration. I think the dplyr syntax is
与分割数据相比,这将会非常快,而且在没有迭代的情况下,结果仍然可以在下游分析中使用。我认为dplyr语法是。
library(dplyr)
df %>% group_by(grp) %>% mutate(z=scale(x))
In general this dplyr solution is faster than splitting data frames but not as fast as split-apply-combine.
一般来说,这个dplyr解决方案比拆分数据帧快,但不像分割-应用-合并那样快。