为在data.frame中丢失的值添加行的最快方法?

时间:2021-11-26 09:18:11

I have a column in my datasets where time periods (Time) are integers ranging from a-b. Sometimes there might be missing time periods for any given group. I'd like to fill in those rows with NA. Below is example data for 1 (of several 1000) group(s).

我的数据集中有一列,其中的时间段(时间)是a-b的整数。有时,任何给定的组都可能缺少时间段。我想用NA填充这些行。下面是1000个组中的1个的示例数据。

structure(list(Id = c(1, 1, 1, 1), Time = c(1, 2, 4, 5), Value = c(0.568780482159894, 
-0.7207749516298, 1.24258192959273, 0.682123081696789)), .Names = c("Id", 
"Time", "Value"), row.names = c(NA, 4L), class = "data.frame")


  Id Time      Value
1  1    1  0.5687805
2  1    2 -0.7207750
3  1    4  1.2425819
4  1    5  0.6821231

As you can see, Time 3 is missing. Often one or more could be missing. I can solve this on my own but am afraid I wouldn't be doing this the most efficient way. My approach would be to create a function that:

正如你所看到的,时间3没有了。通常一个或多个可能会丢失。我可以自己解决这个问题,但是我担心我不会用最有效的方法。我的方法是创建一个函数:

Generate a sequence of time periods from min(Time) to max(Time)

生成从最小(时间)到最大(时间)的时间周期序列

Then do a setdiff to grab missing Time values.

然后做一个setdiff来抓取缺失的时间值。

Convert that vector to a data.frame

将这个向量转换为data.frame

Pull unique identifier variables (Id and others not listed above), and add that to this data.frame.

提取唯一标识符变量(Id和上面没有列出的其他变量),并将其添加到这个data.frame中。

Merge the two.

合并这两个。

Return from function.

从函数返回。

So the entire process would then get executed as below:

因此整个过程将被执行如下:

   # Split the data into individual data.frames by Id.
    temp_list <- dlply(original_data, .(Id)) 
    # pad each data.frame
    tlist2 <- llply(temp_list, my_pad_function)
    # collapse the list back to a data.frame
    filled_in_data <- ldply(tlist2)

Better way to achieve this?

更好的实现方法是什么?

4 个解决方案

#1


28  

Following up on comments with Ben Barnes and starting with his mydf3 :

跟进本·巴恩斯的评论,从他的mydf3开始:

DT = as.data.table(mydf3)
setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time)))]
      Id Time        Value Id2
 [1,]  1    1 -0.262482283   2
 [2,]  1    2 -1.423935165   2
 [3,]  1    3  0.500523295   1
 [4,]  1    4 -1.912687398   1
 [5,]  1    5 -1.459766444   2
 [6,]  1    6 -0.691736451   1
 [7,]  1    7           NA  NA
 [8,]  1    8  0.001041489   2
 [9,]  1    9  0.495820559   2
[10,]  1   10 -0.673167744   1
First 10 rows of 12800 printed. 

setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time)))]
      Id Id2 Time      Value
 [1,]  1   1    1         NA
 [2,]  1   1    2         NA
 [3,]  1   1    3  0.5005233
 [4,]  1   1    4 -1.9126874
 [5,]  1   1    5         NA
 [6,]  1   1    6 -0.6917365
 [7,]  1   1    7         NA
 [8,]  1   1    8         NA
 [9,]  1   1    9         NA
[10,]  1   1   10 -0.6731677
First 10 rows of 25600 printed. 

CJ stands for Cross Join, see ?CJ. The padding with NAs happens because nomatch by default is NA. Set nomatch to 0 instead to remove the no matches. If instead of padding with NAs the prevailing row is required, just add roll=TRUE. This can be more efficient than padding with NAs and then filling NAs afterwards. See the description of roll in ?data.table.

CJ代表交叉连接,见?CJ。NAs的填充发生是因为默认值是NA。将nomatch设置为0,以删除无匹配。如果不需要用NAs填充当前行,只需添加roll=TRUE。这比用NAs填充,然后填充NAs更有效。查看roll in ?data.table的描述。

setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time))),roll=TRUE]
      Id Time        Value Id2
 [1,]  1    1 -0.262482283   2
 [2,]  1    2 -1.423935165   2
 [3,]  1    3  0.500523295   1
 [4,]  1    4 -1.912687398   1
 [5,]  1    5 -1.459766444   2
 [6,]  1    6 -0.691736451   1
 [7,]  1    7 -0.691736451   1
 [8,]  1    8  0.001041489   2
 [9,]  1    9  0.495820559   2
[10,]  1   10 -0.673167744   1
First 10 rows of 12800 printed. 

setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time))),roll=TRUE]
      Id Id2 Time      Value
 [1,]  1   1    1         NA
 [2,]  1   1    2         NA
 [3,]  1   1    3  0.5005233
 [4,]  1   1    4 -1.9126874
 [5,]  1   1    5 -1.9126874
 [6,]  1   1    6 -0.6917365
 [7,]  1   1    7 -0.6917365
 [8,]  1   1    8 -0.6917365
 [9,]  1   1    9 -0.6917365
[10,]  1   1   10 -0.6731677
First 10 rows of 25600 printed. 

#2


4  

Please see Matthew Dowle's answer (by now, hopefully above).

请参阅马修·道尔的答案(希望现在是这样)。

Here's something that uses the data.table package, and it may help when there is more than one ID variable. It may also be faster than merge, depending on how you want your results. I'd be interested in benchmarking and/or suggested improvements.

这里有一些数据。表包,当有多个ID变量时,它可能会有所帮助。它也可能比合并快,这取决于您希望得到的结果。我对基准测试和/或建议的改进感兴趣。

First, create some more demanding data with two ID variables

首先,使用两个ID变量创建一些要求更高的数据

library(data.table)

set.seed(1)

mydf3<-data.frame(Id=sample(1:100,10000,replace=TRUE),
  Value=rnorm(10000))
mydf3<-mydf3[order(mydf3$Id),]

mydf3$Time<-unlist(by(mydf3,mydf3$Id,
  function(x)sample(1:(nrow(x)+3),nrow(x)),simplify=TRUE))

mydf3$Id2<-sample(1:2,nrow(mydf3),replace=TRUE)

Create a function (This has been EDITED - see history)

创建一个函数(已编辑——请参阅历史)

padFun<-function(data,idvars,timevar){
# Coerce ID variables to character
  data[,idvars]<-lapply(data[,idvars,drop=FALSE],as.character)
# Create global ID variable of all individual ID vars pasted together
  globalID<-Reduce(function(...)paste(...,sep="SOMETHINGWACKY"),
    data[,idvars,drop=FALSE])
# Create data.frame of all possible combinations of globalIDs and times
  allTimes<-expand.grid(globalID=unique(globalID),
    allTime=min(data[,timevar]):max(data[,timevar]),
    stringsAsFactors=FALSE)
# Get the original ID variables back
  allTimes2<-data.frame(allTimes$allTime,do.call(rbind,
    strsplit(allTimes$globalID,"SOMETHINGWACKY")),stringsAsFactors=FALSE)
# Convert combinations data.frame to data.table with idvars and timevar as key
  allTimesDT<-data.table(allTimes2)
  setnames(allTimesDT,1:ncol(allTimesDT),c(timevar,idvars))
  setkeyv(allTimesDT,c(idvars,timevar))
# Convert data to data.table with same variables as key
  dataDT<-data.table(data,key=c(idvars,timevar))
# Join the two data.tables to create padding
  res<-dataDT[allTimesDT]
  return(res)
}

Use the function

使用的函数

(padded2<-padFun(data=mydf3,idvars=c("Id"),timevar="Time"))

#       Id Time        Value Id2
#  [1,]  1    1 -0.262482283   2
#  [2,]  1    2 -1.423935165   2
#  [3,]  1    3  0.500523295   1
#  [4,]  1    4 -1.912687398   1
#  [5,]  1    5 -1.459766444   2
#  [6,]  1    6 -0.691736451   1
#  [7,]  1    7           NA  NA
#  [8,]  1    8  0.001041489   2
#  [9,]  1    9  0.495820559   2
# [10,]  1   10 -0.673167744   1
# First 10 rows of 12800 printed.

(padded<-padFun(data=mydf3,idvars=c("Id","Id2"),timevar="Time"))

#      Id Id2 Time      Value
#  [1,]  1   1    1         NA
#  [2,]  1   1    2         NA
#  [3,]  1   1    3  0.5005233
#  [4,]  1   1    4 -1.9126874
#  [5,]  1   1    5         NA
#  [6,]  1   1    6 -0.6917365
#  [7,]  1   1    7         NA
#  [8,]  1   1    8         NA
#  [9,]  1   1    9         NA
# [10,]  1   1   10 -0.6731677
# First 10 rows of 25600 printed.

The edited function splits the globalID into its component parts in the combination data.frame, before merging with the original data. This should (I think) be better.

在与原始数据合并之前,编辑函数将globalID分割为组合数据.frame中的组件部分。我认为这应该更好。

#3


2  

You can use tidyr for this.

你可以用tidyr。

Use tidyr::complete to fill in rows for Time, and by default the values are filled in with NA.

使用tidyr::complete来填充时间行,默认情况下,值用NA填充。

Create Data

I extended the sample data to show that it works for multiple Ids and even when within an Id the full range of Time is not present.

我扩展了示例数据,以表明它适用于多个Id,即使在一个Id内,也不存在完整的时间范围。

library(dplyr)
library(tidyr)


df <- tibble(
  Id = c(1, 1, 1, 1, 2, 2, 2),
  Time = c(1, 2, 4, 5, 2, 3, 5),
  Value = c(0.56, -0.72, 1.24, 0.68, 1.46, 0.74, 0.99)
)

df
#> # A tibble: 7 x 3
#>      Id  Time Value
#>   <dbl> <dbl> <dbl>
#> 1     1     1  0.56
#> 2     1     2 -0.72
#> 3     1     4  1.24
#> 4     1     5  0.68
#> 5     2     2  1.46
#> 6     2     3  0.74
#> 7     2     5  0.99

Fill in the missing rows

df %>% complete(nesting(Id), Time = seq(min(Time), max(Time), 1L))

#> # A tibble: 10 x 3
#>       Id  Time Value
#>    <dbl> <dbl> <dbl>
#> 1      1     1  0.56
#> 2      1     2 -0.72
#> 3      1     3    NA
#> 4      1     4  1.24
#> 5      1     5  0.68
#> 6      2     1    NA
#> 7      2     2  1.46
#> 8      2     3  0.74
#> 9      2     4    NA
#> 10     2     5  0.99

#4


1  

My general approach is to use freqTable <- as.data.frame(table(idvar1, idvar2, idvarN)) then pull out the rows where Freq==0, pad as necessary and then stack back onto the original data.

我的一般方法是使用freqTable <- as.data.frame(表(idvar1、idvar2、idvarN)),然后取出Freq== =0的行,必要时使用pad,然后将其堆叠到原始数据上。

#1


28  

Following up on comments with Ben Barnes and starting with his mydf3 :

跟进本·巴恩斯的评论,从他的mydf3开始:

DT = as.data.table(mydf3)
setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time)))]
      Id Time        Value Id2
 [1,]  1    1 -0.262482283   2
 [2,]  1    2 -1.423935165   2
 [3,]  1    3  0.500523295   1
 [4,]  1    4 -1.912687398   1
 [5,]  1    5 -1.459766444   2
 [6,]  1    6 -0.691736451   1
 [7,]  1    7           NA  NA
 [8,]  1    8  0.001041489   2
 [9,]  1    9  0.495820559   2
[10,]  1   10 -0.673167744   1
First 10 rows of 12800 printed. 

setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time)))]
      Id Id2 Time      Value
 [1,]  1   1    1         NA
 [2,]  1   1    2         NA
 [3,]  1   1    3  0.5005233
 [4,]  1   1    4 -1.9126874
 [5,]  1   1    5         NA
 [6,]  1   1    6 -0.6917365
 [7,]  1   1    7         NA
 [8,]  1   1    8         NA
 [9,]  1   1    9         NA
[10,]  1   1   10 -0.6731677
First 10 rows of 25600 printed. 

CJ stands for Cross Join, see ?CJ. The padding with NAs happens because nomatch by default is NA. Set nomatch to 0 instead to remove the no matches. If instead of padding with NAs the prevailing row is required, just add roll=TRUE. This can be more efficient than padding with NAs and then filling NAs afterwards. See the description of roll in ?data.table.

CJ代表交叉连接,见?CJ。NAs的填充发生是因为默认值是NA。将nomatch设置为0,以删除无匹配。如果不需要用NAs填充当前行,只需添加roll=TRUE。这比用NAs填充,然后填充NAs更有效。查看roll in ?data.table的描述。

setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time))),roll=TRUE]
      Id Time        Value Id2
 [1,]  1    1 -0.262482283   2
 [2,]  1    2 -1.423935165   2
 [3,]  1    3  0.500523295   1
 [4,]  1    4 -1.912687398   1
 [5,]  1    5 -1.459766444   2
 [6,]  1    6 -0.691736451   1
 [7,]  1    7 -0.691736451   1
 [8,]  1    8  0.001041489   2
 [9,]  1    9  0.495820559   2
[10,]  1   10 -0.673167744   1
First 10 rows of 12800 printed. 

setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time))),roll=TRUE]
      Id Id2 Time      Value
 [1,]  1   1    1         NA
 [2,]  1   1    2         NA
 [3,]  1   1    3  0.5005233
 [4,]  1   1    4 -1.9126874
 [5,]  1   1    5 -1.9126874
 [6,]  1   1    6 -0.6917365
 [7,]  1   1    7 -0.6917365
 [8,]  1   1    8 -0.6917365
 [9,]  1   1    9 -0.6917365
[10,]  1   1   10 -0.6731677
First 10 rows of 25600 printed. 

#2


4  

Please see Matthew Dowle's answer (by now, hopefully above).

请参阅马修·道尔的答案(希望现在是这样)。

Here's something that uses the data.table package, and it may help when there is more than one ID variable. It may also be faster than merge, depending on how you want your results. I'd be interested in benchmarking and/or suggested improvements.

这里有一些数据。表包,当有多个ID变量时,它可能会有所帮助。它也可能比合并快,这取决于您希望得到的结果。我对基准测试和/或建议的改进感兴趣。

First, create some more demanding data with two ID variables

首先,使用两个ID变量创建一些要求更高的数据

library(data.table)

set.seed(1)

mydf3<-data.frame(Id=sample(1:100,10000,replace=TRUE),
  Value=rnorm(10000))
mydf3<-mydf3[order(mydf3$Id),]

mydf3$Time<-unlist(by(mydf3,mydf3$Id,
  function(x)sample(1:(nrow(x)+3),nrow(x)),simplify=TRUE))

mydf3$Id2<-sample(1:2,nrow(mydf3),replace=TRUE)

Create a function (This has been EDITED - see history)

创建一个函数(已编辑——请参阅历史)

padFun<-function(data,idvars,timevar){
# Coerce ID variables to character
  data[,idvars]<-lapply(data[,idvars,drop=FALSE],as.character)
# Create global ID variable of all individual ID vars pasted together
  globalID<-Reduce(function(...)paste(...,sep="SOMETHINGWACKY"),
    data[,idvars,drop=FALSE])
# Create data.frame of all possible combinations of globalIDs and times
  allTimes<-expand.grid(globalID=unique(globalID),
    allTime=min(data[,timevar]):max(data[,timevar]),
    stringsAsFactors=FALSE)
# Get the original ID variables back
  allTimes2<-data.frame(allTimes$allTime,do.call(rbind,
    strsplit(allTimes$globalID,"SOMETHINGWACKY")),stringsAsFactors=FALSE)
# Convert combinations data.frame to data.table with idvars and timevar as key
  allTimesDT<-data.table(allTimes2)
  setnames(allTimesDT,1:ncol(allTimesDT),c(timevar,idvars))
  setkeyv(allTimesDT,c(idvars,timevar))
# Convert data to data.table with same variables as key
  dataDT<-data.table(data,key=c(idvars,timevar))
# Join the two data.tables to create padding
  res<-dataDT[allTimesDT]
  return(res)
}

Use the function

使用的函数

(padded2<-padFun(data=mydf3,idvars=c("Id"),timevar="Time"))

#       Id Time        Value Id2
#  [1,]  1    1 -0.262482283   2
#  [2,]  1    2 -1.423935165   2
#  [3,]  1    3  0.500523295   1
#  [4,]  1    4 -1.912687398   1
#  [5,]  1    5 -1.459766444   2
#  [6,]  1    6 -0.691736451   1
#  [7,]  1    7           NA  NA
#  [8,]  1    8  0.001041489   2
#  [9,]  1    9  0.495820559   2
# [10,]  1   10 -0.673167744   1
# First 10 rows of 12800 printed.

(padded<-padFun(data=mydf3,idvars=c("Id","Id2"),timevar="Time"))

#      Id Id2 Time      Value
#  [1,]  1   1    1         NA
#  [2,]  1   1    2         NA
#  [3,]  1   1    3  0.5005233
#  [4,]  1   1    4 -1.9126874
#  [5,]  1   1    5         NA
#  [6,]  1   1    6 -0.6917365
#  [7,]  1   1    7         NA
#  [8,]  1   1    8         NA
#  [9,]  1   1    9         NA
# [10,]  1   1   10 -0.6731677
# First 10 rows of 25600 printed.

The edited function splits the globalID into its component parts in the combination data.frame, before merging with the original data. This should (I think) be better.

在与原始数据合并之前,编辑函数将globalID分割为组合数据.frame中的组件部分。我认为这应该更好。

#3


2  

You can use tidyr for this.

你可以用tidyr。

Use tidyr::complete to fill in rows for Time, and by default the values are filled in with NA.

使用tidyr::complete来填充时间行,默认情况下,值用NA填充。

Create Data

I extended the sample data to show that it works for multiple Ids and even when within an Id the full range of Time is not present.

我扩展了示例数据,以表明它适用于多个Id,即使在一个Id内,也不存在完整的时间范围。

library(dplyr)
library(tidyr)


df <- tibble(
  Id = c(1, 1, 1, 1, 2, 2, 2),
  Time = c(1, 2, 4, 5, 2, 3, 5),
  Value = c(0.56, -0.72, 1.24, 0.68, 1.46, 0.74, 0.99)
)

df
#> # A tibble: 7 x 3
#>      Id  Time Value
#>   <dbl> <dbl> <dbl>
#> 1     1     1  0.56
#> 2     1     2 -0.72
#> 3     1     4  1.24
#> 4     1     5  0.68
#> 5     2     2  1.46
#> 6     2     3  0.74
#> 7     2     5  0.99

Fill in the missing rows

df %>% complete(nesting(Id), Time = seq(min(Time), max(Time), 1L))

#> # A tibble: 10 x 3
#>       Id  Time Value
#>    <dbl> <dbl> <dbl>
#> 1      1     1  0.56
#> 2      1     2 -0.72
#> 3      1     3    NA
#> 4      1     4  1.24
#> 5      1     5  0.68
#> 6      2     1    NA
#> 7      2     2  1.46
#> 8      2     3  0.74
#> 9      2     4    NA
#> 10     2     5  0.99

#4


1  

My general approach is to use freqTable <- as.data.frame(table(idvar1, idvar2, idvarN)) then pull out the rows where Freq==0, pad as necessary and then stack back onto the original data.

我的一般方法是使用freqTable <- as.data.frame(表(idvar1、idvar2、idvarN)),然后取出Freq== =0的行,必要时使用pad,然后将其堆叠到原始数据上。