I have the following table:
我有下表:
Name Date Quiz Homework
John 11-01-02 40 10
John 11-01-03 47 20
John 11-01-04 41 10
John 11-01-08 35 10
John 11-01-10 43 15
John 11-01-13 40 10
Adam 11-01-05 41 10
Adam 11-01-08 41 15
Adam 11-01-14 49 10
Adam 11-01-19 40 20
Adam 11-01-21 40 10
You can see that there are some time gaps. I would like to fill in those time gaps by name and replace the quiz, homework scores for those missing dates with zero. Thus, the final outcome I want would be the following
你可以看到有一些时间差距。我想按名称填写那些时间差距,并将那些缺失日期的测验和作业分数替换为零。因此,我想要的最终结果如下
Name Date Quiz Homework
John 11-01-02 40 10
John 11-01-03 47 20
John 11-01-04 41 10
John 11-01-05 0 0
John 11-01-06 0 0
John 11-01-07 0 0
John 11-01-08 35 10
John 11-01-09 0 0
John 11-01-10 43 15
John 11-01-11 0 0
John 11-01-12 0 0
John 11-01-13 40 10
Adam 11-01-05 41 10
Adam 11-01-06 0 0
Adam 11-01-07 0 0
Adam 11-01-08 41 15
Adam 11-01-09 0 0
Adam 11-01-10 0 0
Adam 11-01-11 0 0
Adam 11-01-12 0 0
Adam 11-01-13 0 0
Adam 11-01-14 49 10
Adam 11-01-15 0 0
Adam 11-01-16 0 0
Adam 11-01-17 0 0
Adam 11-01-18 0 0
Adam 11-01-19 40 20
Adam 11-01-20 0 0
Adam 11-01-21 40 10
Is there a fast way of doing it? What I did was the following:
有这么快的方法吗?我做的是以下内容:
1) Find a minimum, maximum dates by name
2) For each name, create a sequence of dates from minimum, maximum dates found in step 1)
3) Join the table created in step 2) with the original table.
4) replace NA values in Quiz, Homework by zero
but that was rather slow. I was wondering if there's a fast way of doing it.
但那很慢。我想知道是否有一种快速的方法。
2 个解决方案
#1
1
A tidyverse
solution:
一个整合的解决方案:
library(dplyr)
library(tidyr)
library(lubridate) # for easier year conversion
df1 <- structure(list(Name = c("John", "John", "John", "John", "John",
"John", "Adam", "Adam", "Adam", "Adam", "Adam"),
Date = c("11-01-02", "11-01-03", "11-01-04",
"11-01-08", "11-01-10", "11-01-13",
"11-01-05", "11-01-08", "11-01-14",
"11-01-19", "11-01-21"),
Quiz = c(40L, 47L, 41L, 35L, 43L, 40L, 41L, 41L, 49L, 40L, 40L),
Homework = c(10L, 20L, 10L, 10L, 15L, 10L,
10L, 15L, 10L, 20L, 10L)),
.Names = c("Name", "Date", "Quiz", "Homework"),
class = "data.frame",
row.names = c(NA, -11L))
df1 %>%
mutate(Date = as_date(Date, "%C-%m-%d")) %>%
group_by(Name) %>%
complete(Date = seq(min(Date), max(Date), by = "1 day"),
fill = list(Quiz = 0, Homework = 0))
Name Date Quiz Homework
1 Adam 2011-01-05 41 10
2 Adam 2011-01-06 0 0
3 Adam 2011-01-07 0 0
4 Adam 2011-01-08 41 15
5 Adam 2011-01-09 0 0
6 Adam 2011-01-10 0 0
7 Adam 2011-01-11 0 0
8 Adam 2011-01-12 0 0
9 Adam 2011-01-13 0 0
10 Adam 2011-01-14 49 10
11 Adam 2011-01-15 0 0
12 Adam 2011-01-16 0 0
13 Adam 2011-01-17 0 0
14 Adam 2011-01-18 0 0
15 Adam 2011-01-19 40 20
16 Adam 2011-01-20 0 0
17 Adam 2011-01-21 40 10
18 John 2011-01-02 40 10
19 John 2011-01-03 47 20
20 John 2011-01-04 41 10
21 John 2011-01-05 0 0
22 John 2011-01-06 0 0
23 John 2011-01-07 0 0
24 John 2011-01-08 35 10
25 John 2011-01-09 0 0
26 John 2011-01-10 43 15
27 John 2011-01-11 0 0
28 John 2011-01-12 0 0
29 John 2011-01-13 40 10
#2
1
A solution using data.table
package which should be fast:
使用data.table包的解决方案应该很快:
library(data.table)
DT <- fread("Name Date Quiz Homework
John 11-01-02 40 10
John 11-01-03 47 20
John 11-01-04 41 10
John 11-01-08 35 10
John 11-01-10 43 15
John 11-01-13 40 10
Adam 11-01-05 41 10
Adam 11-01-08 41 15
Adam 11-01-14 49 10
Adam 11-01-19 40 20
Adam 11-01-21 40 10")
DT[, Date := as.Date(Date, "%y-%m-%d")]
DT[DT[, .(Date=seq(min(Date), max(Date), by="1 day")), by=.(Name)],
on=.(Name, Date)][,
':=' (
Quiz = ifelse(is.na(Quiz), 0, Quiz),
Homework = ifelse(is.na(Homework), 0, Homework)
)]
Explanation:
- Create the sequence of dates using
allDates <- DT[, .(Date=seq(min(Date), max(Date), by="1 day")), by=.(Name)]
- Join with original dataset using
DT[allDates, on=.(Name, Date)]
- Finally, replace NAs with 0
使用allDates < - DT [,。(Date = seq(min(Date),max(Date),by =“1 day”)),by =。(Name)]创建日期序列
使用DT加入原始数据集[allDates,on =。(Name,Date)]
最后,用0替换NA
#1
1
A tidyverse
solution:
一个整合的解决方案:
library(dplyr)
library(tidyr)
library(lubridate) # for easier year conversion
df1 <- structure(list(Name = c("John", "John", "John", "John", "John",
"John", "Adam", "Adam", "Adam", "Adam", "Adam"),
Date = c("11-01-02", "11-01-03", "11-01-04",
"11-01-08", "11-01-10", "11-01-13",
"11-01-05", "11-01-08", "11-01-14",
"11-01-19", "11-01-21"),
Quiz = c(40L, 47L, 41L, 35L, 43L, 40L, 41L, 41L, 49L, 40L, 40L),
Homework = c(10L, 20L, 10L, 10L, 15L, 10L,
10L, 15L, 10L, 20L, 10L)),
.Names = c("Name", "Date", "Quiz", "Homework"),
class = "data.frame",
row.names = c(NA, -11L))
df1 %>%
mutate(Date = as_date(Date, "%C-%m-%d")) %>%
group_by(Name) %>%
complete(Date = seq(min(Date), max(Date), by = "1 day"),
fill = list(Quiz = 0, Homework = 0))
Name Date Quiz Homework
1 Adam 2011-01-05 41 10
2 Adam 2011-01-06 0 0
3 Adam 2011-01-07 0 0
4 Adam 2011-01-08 41 15
5 Adam 2011-01-09 0 0
6 Adam 2011-01-10 0 0
7 Adam 2011-01-11 0 0
8 Adam 2011-01-12 0 0
9 Adam 2011-01-13 0 0
10 Adam 2011-01-14 49 10
11 Adam 2011-01-15 0 0
12 Adam 2011-01-16 0 0
13 Adam 2011-01-17 0 0
14 Adam 2011-01-18 0 0
15 Adam 2011-01-19 40 20
16 Adam 2011-01-20 0 0
17 Adam 2011-01-21 40 10
18 John 2011-01-02 40 10
19 John 2011-01-03 47 20
20 John 2011-01-04 41 10
21 John 2011-01-05 0 0
22 John 2011-01-06 0 0
23 John 2011-01-07 0 0
24 John 2011-01-08 35 10
25 John 2011-01-09 0 0
26 John 2011-01-10 43 15
27 John 2011-01-11 0 0
28 John 2011-01-12 0 0
29 John 2011-01-13 40 10
#2
1
A solution using data.table
package which should be fast:
使用data.table包的解决方案应该很快:
library(data.table)
DT <- fread("Name Date Quiz Homework
John 11-01-02 40 10
John 11-01-03 47 20
John 11-01-04 41 10
John 11-01-08 35 10
John 11-01-10 43 15
John 11-01-13 40 10
Adam 11-01-05 41 10
Adam 11-01-08 41 15
Adam 11-01-14 49 10
Adam 11-01-19 40 20
Adam 11-01-21 40 10")
DT[, Date := as.Date(Date, "%y-%m-%d")]
DT[DT[, .(Date=seq(min(Date), max(Date), by="1 day")), by=.(Name)],
on=.(Name, Date)][,
':=' (
Quiz = ifelse(is.na(Quiz), 0, Quiz),
Homework = ifelse(is.na(Homework), 0, Homework)
)]
Explanation:
- Create the sequence of dates using
allDates <- DT[, .(Date=seq(min(Date), max(Date), by="1 day")), by=.(Name)]
- Join with original dataset using
DT[allDates, on=.(Name, Date)]
- Finally, replace NAs with 0
使用allDates < - DT [,。(Date = seq(min(Date),max(Date),by =“1 day”)),by =。(Name)]创建日期序列
使用DT加入原始数据集[allDates,on =。(Name,Date)]
最后,用0替换NA