将一个向量分割成R中的块。

时间:2021-04-25 21:41:22

I have to split a vector into n chunks of equal size in R. I couldn't find any base function to do that. Also Google didn't get me anywhere. So here is what I came up with, hopefully it helps someone some where.

我必须把一个向量分割成n个大小相等的n块,我找不到任何基本的函数。谷歌也没有带我去任何地方。这就是我想到的,希望它能帮助到一些人。

x <- 1:10
n <- 3
chunk <- function(x,n) split(x, factor(sort(rank(x)%%n)))
chunk(x,n)
$`0`
[1] 1 2 3

$`1`
[1] 4 5 6 7

$`2`
[1]  8  9 10

Any comments, suggestions or improvements are really welcome and appreciated.

任何意见、建议或改进都是非常受欢迎和赞赏的。

Cheers, Sebastian

欢呼,塞巴斯蒂安

15 个解决方案

#1


235  

A one-liner splitting d into chunks of size 20:

一艘单班船把d分成20块:

split(d, ceiling(seq_along(d)/20))

More details: I think all you need is seq_along(), split() and ceiling():

更多细节:我认为您所需要的是seq_along()、split()和ceiling():

> d <- rpois(73,5)
> d
 [1]  3  1 11  4  1  2  3  2  4 10 10  2  7  4  6  6  2  1  1  2  3  8  3 10  7  4
[27]  3  4  4  1  1  7  2  4  6  0  5  7  4  6  8  4  7 12  4  6  8  4  2  7  6  5
[53]  4  5  4  5  5  8  7  7  7  6  2  4  3  3  8 11  6  6  1  8  4
> max <- 20
> x <- seq_along(d)
> d1 <- split(d, ceiling(x/max))
> d1
$`1`
 [1]  3  1 11  4  1  2  3  2  4 10 10  2  7  4  6  6  2  1  1  2

$`2`
 [1]  3  8  3 10  7  4  3  4  4  1  1  7  2  4  6  0  5  7  4  6

$`3`
 [1]  8  4  7 12  4  6  8  4  2  7  6  5  4  5  4  5  5  8  7  7

$`4`
 [1]  7  6  2  4  3  3  8 11  6  6  1  8  4

#2


43  

chunk2 <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE)) 

#3


18  

This will split it differently to what you have, but is still quite a nice list structure I think:

这将与你所拥有的不同,但仍然是一个很好的列表结构,我认为:

chunk.2 <- function(x, n, force.number.of.groups = TRUE, len = length(x), groups = trunc(len/n), overflow = len%%n) { 
  if(force.number.of.groups) {
    f1 <- as.character(sort(rep(1:n, groups)))
    f <- as.character(c(f1, rep(n, overflow)))
  } else {
    f1 <- as.character(sort(rep(1:groups, n)))
    f <- as.character(c(f1, rep("overflow", overflow)))
  }

  g <- split(x, f)

  if(force.number.of.groups) {
    g.names <- names(g)
    g.names.ordered <- as.character(sort(as.numeric(g.names)))
  } else {
    g.names <- names(g[-length(g)])
    g.names.ordered <- as.character(sort(as.numeric(g.names)))
    g.names.ordered <- c(g.names.ordered, "overflow")
  }

  return(g[g.names.ordered])
}

Which will give you the following, depending on how you want it formatted:

这将给你以下的内容,取决于你想要的格式:

> x <- 1:10; n <- 3
> chunk.2(x, n, force.number.of.groups = FALSE)
$`1`
[1] 1 2 3

$`2`
[1] 4 5 6

$`3`
[1] 7 8 9

$overflow
[1] 10

> chunk.2(x, n, force.number.of.groups = TRUE)
$`1`
[1] 1 2 3

$`2`
[1] 4 5 6

$`3`
[1]  7  8  9 10

Running a couple of timings using these settings:

使用这些设置运行几次计时:

set.seed(42)
x <- rnorm(1:1e7)
n <- 3

Then we have the following results:

然后我们得到如下结果:

> system.time(chunk(x, n)) # your function 
   user  system elapsed 
 29.500   0.620  30.125 

> system.time(chunk.2(x, n, force.number.of.groups = TRUE))
   user  system elapsed 
  5.360   0.300   5.663 

EDIT: Changing from as.factor() to as.character() in my function made it twice as fast.

编辑:从as.factor()改为as.character(),在我的函数中,它的速度是原来的两倍。

#4


16  

simplified version...
n = 3
split(x, sort(x%%n))

#5


13  

Try the ggplot2 function, cut_number:

试试ggplot2函数,cut_number:

library(ggplot2)
x <- 1:10
n <- 3
cut_number(x, n) # labels = FALSE if you just want an integer result
#>  [1] [1,4]  [1,4]  [1,4]  [1,4]  (4,7]  (4,7]  (4,7]  (7,10] (7,10] (7,10]
#> Levels: [1,4] (4,7] (7,10]

# if you want it split into a list:
split(x, cut_number(x, n))
#> $`[1,4]`
#> [1] 1 2 3 4
#> 
#> $`(4,7]`
#> [1] 5 6 7
#> 
#> $`(7,10]`
#> [1]  8  9 10

#6


12  

A few more variants to the pile...

还有一些变体……

> x <- 1:10
> n <- 3

Note, that you don't need to use the factor function here, but you still want to sort o/w your first vector would be 1 2 3 10:

注意,这里你不需要用到因子函数,但你还是要把第一个向量排序为1 2 3 10:

> chunk <- function(x, n) split(x, sort(rank(x) %% n))
> chunk(x,n)
$`0`
[1] 1 2 3
$`1`
[1] 4 5 6 7
$`2`
[1]  8  9 10

Or you can assign character indices, vice the numbers in left ticks above:

或者你也可以分配字符索引,在上面的左边刻度上的数字:

> my.chunk <- function(x, n) split(x, sort(rep(letters[1:n], each=n, len=length(x))))
> my.chunk(x, n)
$a
[1] 1 2 3 4
$b
[1] 5 6 7
$c
[1]  8  9 10

Or you can use plainword names stored in a vector. Note that using sort to get consecutive values in x alphabetizes the labels:

或者您可以使用存储在vector中的简单名称。请注意,使用sort以获得x的连续值按字母顺序排列:

> my.other.chunk <- function(x, n) split(x, sort(rep(c("tom", "dick", "harry"), each=n, len=length(x))))
> my.other.chunk(x, n)
$dick
[1] 1 2 3
$harry
[1] 4 5 6
$tom
[1]  7  8  9 10

#7


7  

You could combine the split/cut, as suggested by mdsummer, with quantile to create even groups:

正如mdsummer所建议的那样,你可以将分切/切割组合起来,用分位数来创建偶数组:

split(x,cut(x,quantile(x,(0:n)/n), include.lowest=TRUE, labels=FALSE))

This gives the same result for your example, but not for skewed variables.

这为您的示例提供了相同的结果,但不用于歪曲的变量。

#8


6  

Here's another variant.

这是另一个变体。

NOTE: with this sample you're specifying the CHUNK SIZE in the second parameter

注意:在这个示例中,您将在第二个参数中指定块大小。

  1. all chunks are uniform, except for the last;
  2. 所有的块都是统一的,除了最后一个;
  3. the last will at worst be smaller, never bigger than the chunk size.
  4. 最后一个将会更小,永远不会大于块大小。

chunk <- function(x,n)
{
    f <- sort(rep(1:(trunc(length(x)/n)+1),n))[1:length(x)]
    return(split(x,f))
}

#Test
n<-c(1,2,3,4,5,6,7,8,9,10,11)

c<-chunk(n,5)

q<-lapply(c, function(r) cat(r,sep=",",collapse="|") )
#output
1,2,3,4,5,|6,7,8,9,10,|11,|

#9


5  

split(x,matrix(1:n,n,length(x))[1:length(x)])

分割(x,矩阵(1:n,n,长度(x))(1:长度(x)))

perhaps this is more clear, but the same idea:
split(x,rep(1:n, ceiling(length(x)/n),length.out = length(x)))

也许这更清楚,但同样的想法:split(x,rep(1:n, ceiling)(长度(x)/n),长度。=长度(x)))

if you want it ordered,throw a sort around it

如果你想要它,就在它周围乱扔。

#10


5  

I needed the same function and have read the previous solutions, however i also needed to have the unbalanced chunk to be at the end i.e if i have 10 elements to split them into vectors of 3 each, then my result should have vectors with 3,3,4 elements respectively. So i used the following (i left the code unoptimised for readability, otherwise no need to have many variables):

我需要相同的函数,并且已经读过前面的解,但是我也需要有不平衡的块在最后I。如果我有10个元素把它们分成3个向量,那么我的结果就应该分别有3个,3个,4个元素。因此,我使用了下面的代码(我将代码没有优化为可读性,否则不需要有很多变量):

chunk <- function(x,n){
  numOfVectors <- floor(length(x)/n)
  elementsPerVector <- c(rep(n,numOfVectors-1),n+length(x) %% n)
  elemDistPerVector <- rep(1:numOfVectors,elementsPerVector)
  split(x,factor(elemDistPerVector))
}
set.seed(1)
x <- rnorm(10)
n <- 3
chunk(x,n)
$`1`
[1] -0.6264538  0.1836433 -0.8356286

$`2`
[1]  1.5952808  0.3295078 -0.8204684

$`3`
[1]  0.4874291  0.7383247  0.5757814 -0.3053884

#11


2  

Credit to @Sebastian for this function

感谢@Sebastian的这个功能。

chunk <- function(x,y){
         split(x, factor(sort(rank(row.names(x))%%y)))
         }

#12


2  

If you don't like split() and you don't mind NAs padding out your short tail:

如果你不喜欢split(),而且你不介意NAs填充你的短尾巴:

chunk <- function(x, n) { if((length(x)%%n)==0) {return(matrix(x, nrow=n))} else {return(matrix(append(x, rep(NA, n-(length(x)%%n))), nrow=n))} }

The columns of the returned matrix ([,1:ncol]) are the droids you are looking for.

返回矩阵的列([,1:ncol])是您正在寻找的机器人。

#13


2  

If you don't like split() and you don't like matrix() (with its dangling NAs), there's this:

如果你不喜欢split(),而不喜欢matrix()(使用它的悬垂的NAs),就会有这样的情况:

chunk <- function(x, n) (mapply(function(a, b) (x[a:b]), seq.int(from=1, to=length(x), by=n), pmin(seq.int(from=1, to=length(x), by=n)+(n-1), length(x)), SIMPLIFY=FALSE))

Like split(), it returns a list, but it doesn't waste time or space with labels, so it may be more performant.

与split()一样,它返回一个列表,但它不会浪费时间或空间与标签,所以它可能更有性能。

#14


1  

I need a function that takes the argument of a data.table (in quotes) and another argument that is the upper limit on the number of rows in the subsets of that original data.table. This function produces whatever number of data.tables that upper limit allows for:

我需要一个函数来接受数据的参数。表(在引号中)和另一个参数,它是原始数据的子集中的行数的上限。这个函数产生任意数量的数据。上限允许的表:

library(data.table)    
split_dt <- function(x,y) 
    {
    for(i in seq(from=1,to=nrow(get(x)),by=y)) 
        {df_ <<- get(x)[i:(i + y)];
            assign(paste0("df_",i),df_,inherits=TRUE)}
    rm(df_,inherits=TRUE)
    }

This function gives me a series of data.tables named df_[number] with the starting row from the original data.table in the name. The last data.table can be short and filled with NAs so you have to subset that back to whatever data is left. This type of function is useful because certain GIS software have limits on how many address pins you can import, for example. So slicing up data.tables into smaller chunks may not be recommended, but it may not be avoidable.

这个函数给了我一系列的数据。表名为df_[number],与原始数据中的起始行相匹配。表的名称。最后一个数据。表可以是短的,并且填充了NAs,所以您必须将其子集返回到剩余的数据。这种类型的函数是有用的,因为某些GIS软件对您可以导入的地址插脚有限制。所以切片数据。可能不建议把表分成小块,但这可能是不可避免的。

#15


0  

Simple function for splitting a vector by simply using indexes - no need to over complicate this

简单的函数,通过简单地使用索引来分裂一个向量——不需要把它复杂化。

vsplit <- function(v, n) {
    l = length(v)
    r = l/n
    return(lapply(1:n, function(i) {
        s = max(1, round(r*(i-1))+1)
        e = min(l, round(r*i))
        return(v[s:e])
    }))
}

#1


235  

A one-liner splitting d into chunks of size 20:

一艘单班船把d分成20块:

split(d, ceiling(seq_along(d)/20))

More details: I think all you need is seq_along(), split() and ceiling():

更多细节:我认为您所需要的是seq_along()、split()和ceiling():

> d <- rpois(73,5)
> d
 [1]  3  1 11  4  1  2  3  2  4 10 10  2  7  4  6  6  2  1  1  2  3  8  3 10  7  4
[27]  3  4  4  1  1  7  2  4  6  0  5  7  4  6  8  4  7 12  4  6  8  4  2  7  6  5
[53]  4  5  4  5  5  8  7  7  7  6  2  4  3  3  8 11  6  6  1  8  4
> max <- 20
> x <- seq_along(d)
> d1 <- split(d, ceiling(x/max))
> d1
$`1`
 [1]  3  1 11  4  1  2  3  2  4 10 10  2  7  4  6  6  2  1  1  2

$`2`
 [1]  3  8  3 10  7  4  3  4  4  1  1  7  2  4  6  0  5  7  4  6

$`3`
 [1]  8  4  7 12  4  6  8  4  2  7  6  5  4  5  4  5  5  8  7  7

$`4`
 [1]  7  6  2  4  3  3  8 11  6  6  1  8  4

#2


43  

chunk2 <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE)) 

#3


18  

This will split it differently to what you have, but is still quite a nice list structure I think:

这将与你所拥有的不同,但仍然是一个很好的列表结构,我认为:

chunk.2 <- function(x, n, force.number.of.groups = TRUE, len = length(x), groups = trunc(len/n), overflow = len%%n) { 
  if(force.number.of.groups) {
    f1 <- as.character(sort(rep(1:n, groups)))
    f <- as.character(c(f1, rep(n, overflow)))
  } else {
    f1 <- as.character(sort(rep(1:groups, n)))
    f <- as.character(c(f1, rep("overflow", overflow)))
  }

  g <- split(x, f)

  if(force.number.of.groups) {
    g.names <- names(g)
    g.names.ordered <- as.character(sort(as.numeric(g.names)))
  } else {
    g.names <- names(g[-length(g)])
    g.names.ordered <- as.character(sort(as.numeric(g.names)))
    g.names.ordered <- c(g.names.ordered, "overflow")
  }

  return(g[g.names.ordered])
}

Which will give you the following, depending on how you want it formatted:

这将给你以下的内容,取决于你想要的格式:

> x <- 1:10; n <- 3
> chunk.2(x, n, force.number.of.groups = FALSE)
$`1`
[1] 1 2 3

$`2`
[1] 4 5 6

$`3`
[1] 7 8 9

$overflow
[1] 10

> chunk.2(x, n, force.number.of.groups = TRUE)
$`1`
[1] 1 2 3

$`2`
[1] 4 5 6

$`3`
[1]  7  8  9 10

Running a couple of timings using these settings:

使用这些设置运行几次计时:

set.seed(42)
x <- rnorm(1:1e7)
n <- 3

Then we have the following results:

然后我们得到如下结果:

> system.time(chunk(x, n)) # your function 
   user  system elapsed 
 29.500   0.620  30.125 

> system.time(chunk.2(x, n, force.number.of.groups = TRUE))
   user  system elapsed 
  5.360   0.300   5.663 

EDIT: Changing from as.factor() to as.character() in my function made it twice as fast.

编辑:从as.factor()改为as.character(),在我的函数中,它的速度是原来的两倍。

#4


16  

simplified version...
n = 3
split(x, sort(x%%n))

#5


13  

Try the ggplot2 function, cut_number:

试试ggplot2函数,cut_number:

library(ggplot2)
x <- 1:10
n <- 3
cut_number(x, n) # labels = FALSE if you just want an integer result
#>  [1] [1,4]  [1,4]  [1,4]  [1,4]  (4,7]  (4,7]  (4,7]  (7,10] (7,10] (7,10]
#> Levels: [1,4] (4,7] (7,10]

# if you want it split into a list:
split(x, cut_number(x, n))
#> $`[1,4]`
#> [1] 1 2 3 4
#> 
#> $`(4,7]`
#> [1] 5 6 7
#> 
#> $`(7,10]`
#> [1]  8  9 10

#6


12  

A few more variants to the pile...

还有一些变体……

> x <- 1:10
> n <- 3

Note, that you don't need to use the factor function here, but you still want to sort o/w your first vector would be 1 2 3 10:

注意,这里你不需要用到因子函数,但你还是要把第一个向量排序为1 2 3 10:

> chunk <- function(x, n) split(x, sort(rank(x) %% n))
> chunk(x,n)
$`0`
[1] 1 2 3
$`1`
[1] 4 5 6 7
$`2`
[1]  8  9 10

Or you can assign character indices, vice the numbers in left ticks above:

或者你也可以分配字符索引,在上面的左边刻度上的数字:

> my.chunk <- function(x, n) split(x, sort(rep(letters[1:n], each=n, len=length(x))))
> my.chunk(x, n)
$a
[1] 1 2 3 4
$b
[1] 5 6 7
$c
[1]  8  9 10

Or you can use plainword names stored in a vector. Note that using sort to get consecutive values in x alphabetizes the labels:

或者您可以使用存储在vector中的简单名称。请注意,使用sort以获得x的连续值按字母顺序排列:

> my.other.chunk <- function(x, n) split(x, sort(rep(c("tom", "dick", "harry"), each=n, len=length(x))))
> my.other.chunk(x, n)
$dick
[1] 1 2 3
$harry
[1] 4 5 6
$tom
[1]  7  8  9 10

#7


7  

You could combine the split/cut, as suggested by mdsummer, with quantile to create even groups:

正如mdsummer所建议的那样,你可以将分切/切割组合起来,用分位数来创建偶数组:

split(x,cut(x,quantile(x,(0:n)/n), include.lowest=TRUE, labels=FALSE))

This gives the same result for your example, but not for skewed variables.

这为您的示例提供了相同的结果,但不用于歪曲的变量。

#8


6  

Here's another variant.

这是另一个变体。

NOTE: with this sample you're specifying the CHUNK SIZE in the second parameter

注意:在这个示例中,您将在第二个参数中指定块大小。

  1. all chunks are uniform, except for the last;
  2. 所有的块都是统一的,除了最后一个;
  3. the last will at worst be smaller, never bigger than the chunk size.
  4. 最后一个将会更小,永远不会大于块大小。

chunk <- function(x,n)
{
    f <- sort(rep(1:(trunc(length(x)/n)+1),n))[1:length(x)]
    return(split(x,f))
}

#Test
n<-c(1,2,3,4,5,6,7,8,9,10,11)

c<-chunk(n,5)

q<-lapply(c, function(r) cat(r,sep=",",collapse="|") )
#output
1,2,3,4,5,|6,7,8,9,10,|11,|

#9


5  

split(x,matrix(1:n,n,length(x))[1:length(x)])

分割(x,矩阵(1:n,n,长度(x))(1:长度(x)))

perhaps this is more clear, but the same idea:
split(x,rep(1:n, ceiling(length(x)/n),length.out = length(x)))

也许这更清楚,但同样的想法:split(x,rep(1:n, ceiling)(长度(x)/n),长度。=长度(x)))

if you want it ordered,throw a sort around it

如果你想要它,就在它周围乱扔。

#10


5  

I needed the same function and have read the previous solutions, however i also needed to have the unbalanced chunk to be at the end i.e if i have 10 elements to split them into vectors of 3 each, then my result should have vectors with 3,3,4 elements respectively. So i used the following (i left the code unoptimised for readability, otherwise no need to have many variables):

我需要相同的函数,并且已经读过前面的解,但是我也需要有不平衡的块在最后I。如果我有10个元素把它们分成3个向量,那么我的结果就应该分别有3个,3个,4个元素。因此,我使用了下面的代码(我将代码没有优化为可读性,否则不需要有很多变量):

chunk <- function(x,n){
  numOfVectors <- floor(length(x)/n)
  elementsPerVector <- c(rep(n,numOfVectors-1),n+length(x) %% n)
  elemDistPerVector <- rep(1:numOfVectors,elementsPerVector)
  split(x,factor(elemDistPerVector))
}
set.seed(1)
x <- rnorm(10)
n <- 3
chunk(x,n)
$`1`
[1] -0.6264538  0.1836433 -0.8356286

$`2`
[1]  1.5952808  0.3295078 -0.8204684

$`3`
[1]  0.4874291  0.7383247  0.5757814 -0.3053884

#11


2  

Credit to @Sebastian for this function

感谢@Sebastian的这个功能。

chunk <- function(x,y){
         split(x, factor(sort(rank(row.names(x))%%y)))
         }

#12


2  

If you don't like split() and you don't mind NAs padding out your short tail:

如果你不喜欢split(),而且你不介意NAs填充你的短尾巴:

chunk <- function(x, n) { if((length(x)%%n)==0) {return(matrix(x, nrow=n))} else {return(matrix(append(x, rep(NA, n-(length(x)%%n))), nrow=n))} }

The columns of the returned matrix ([,1:ncol]) are the droids you are looking for.

返回矩阵的列([,1:ncol])是您正在寻找的机器人。

#13


2  

If you don't like split() and you don't like matrix() (with its dangling NAs), there's this:

如果你不喜欢split(),而不喜欢matrix()(使用它的悬垂的NAs),就会有这样的情况:

chunk <- function(x, n) (mapply(function(a, b) (x[a:b]), seq.int(from=1, to=length(x), by=n), pmin(seq.int(from=1, to=length(x), by=n)+(n-1), length(x)), SIMPLIFY=FALSE))

Like split(), it returns a list, but it doesn't waste time or space with labels, so it may be more performant.

与split()一样,它返回一个列表,但它不会浪费时间或空间与标签,所以它可能更有性能。

#14


1  

I need a function that takes the argument of a data.table (in quotes) and another argument that is the upper limit on the number of rows in the subsets of that original data.table. This function produces whatever number of data.tables that upper limit allows for:

我需要一个函数来接受数据的参数。表(在引号中)和另一个参数,它是原始数据的子集中的行数的上限。这个函数产生任意数量的数据。上限允许的表:

library(data.table)    
split_dt <- function(x,y) 
    {
    for(i in seq(from=1,to=nrow(get(x)),by=y)) 
        {df_ <<- get(x)[i:(i + y)];
            assign(paste0("df_",i),df_,inherits=TRUE)}
    rm(df_,inherits=TRUE)
    }

This function gives me a series of data.tables named df_[number] with the starting row from the original data.table in the name. The last data.table can be short and filled with NAs so you have to subset that back to whatever data is left. This type of function is useful because certain GIS software have limits on how many address pins you can import, for example. So slicing up data.tables into smaller chunks may not be recommended, but it may not be avoidable.

这个函数给了我一系列的数据。表名为df_[number],与原始数据中的起始行相匹配。表的名称。最后一个数据。表可以是短的,并且填充了NAs,所以您必须将其子集返回到剩余的数据。这种类型的函数是有用的,因为某些GIS软件对您可以导入的地址插脚有限制。所以切片数据。可能不建议把表分成小块,但这可能是不可避免的。

#15


0  

Simple function for splitting a vector by simply using indexes - no need to over complicate this

简单的函数,通过简单地使用索引来分裂一个向量——不需要把它复杂化。

vsplit <- function(v, n) {
    l = length(v)
    r = l/n
    return(lapply(1:n, function(i) {
        s = max(1, round(r*(i-1))+1)
        e = min(l, round(r*i))
        return(v[s:e])
    }))
}