I'll start by saying that filling in missing data in one data frame with info from another has one solution that may work for my problem. However, it solves it with a FOR loop, and I would prefer a vectorized solution.
我首先要说的是,在一个数据框中用另一个数据填充缺少的数据有一个可能适用于我的问题的解决方案。但是,它用FOR循环解决了它,我更喜欢矢量化解决方案。
I have 125 years of climate data with year, month, temperature, precipitation, and open pan evaporation. It is daily data summarized by month. Some years in the late 1800's have entire months missing, and I would like to substitute those missing months with its equivalent month from a 30-year average around that time.
我有125年的气候数据,包括年,月,温度,降水和露天蒸发。它是按月汇总的每日数据。在1800年代后期的某些年里,整整几个月都没有了,而且我希望用那个相当于月份的月份代替那些失踪的月份。
I have pasted some of the code I've been playing with, below:
我已经粘贴了一些我一直在玩的代码,如下:
# For simplicity, let's pretend there are 5 months in the year, so year 3
# is the only year with a complete set of data, years 1 and 2 are missing some.
df1<-structure(
list(
Year=c(1,1,1,2,2,3,3,3,3,3),
Month=c(1,2,4,2,5,1,2,3,4,5),
Temp=c(-2,2,10,-4,12,2,4,8,14,16),
Precip=c(20,10,50,10,60,26,18,40,60,46),
Evap=c(2,6,30,4,48,4,10,32,70,40)
)
)
# This represents the 30-year average data:
df2<-structure(
list(
Month=c(1,2,3,4,5),
Temp=c(1,3,9,13,15),
Precip=c(11,13,21,43,35),
Evap=c(1,5,13,35,45)
)
)
# to match my actual setup
df1<-as_tibble(df1)
df2<-as_tibble(df2)
# I can get to the list of months missing from a given year
full_year <- df2[,1]
compare_year1 <- df1[df1$Year==1,2]
missing_months <- setdiff(full_year,compare_year1)
# Or I can get the full data from each year missing one or more months
year_full <- df2[,1]
years_compare <- split(df1[,c(2)], df1$Year)
years_missing_months <- names(years_compare[sapply(years_compare,nrow)<5])
complete_years_missing_months <- df1[df1$Year %in% years_missing_months,]
This is where I've gotten stumped.
这是我难倒的地方。
I've looked at anti_join and merge, but it looks like they need data of the same length in each frame. I can get from lists grouped by year to identify the years that are missing months, but I'm not sure how to actually get the rows inserted from there. It seems like lapply could be useful, but the answer ain't comin'.
我看过anti_join和merge,但看起来他们需要每帧中相同长度的数据。我可以从按年份分组的列表中找到缺少月份的年份,但我不确定如何实际从那里插入行。似乎lapply可能有用,但答案并非如此。
Thanks in advance.
提前致谢。
Edit 7/19: As an illustration of what I need, just looking at year "1", the current data (df1) has the following:
Year | Mon | Temp | Precip | Evap
1 | 1 | -2 | 20 | 2
1 | 2 | 2 | 10 | 6
1 | 4 | 10 | 50 | 30
编辑7/19:为了说明我需要的东西,只需查看年份“1”,当前数据(df1)具有以下内容:周一|温度| Precip | Evap 1 | 1 | -2 | 20 | 2 1 | 2 | 2 | 10 | 6 1 | 4 | 10 | 50 |三十
Months 3 and 5 are missing data, so I would like to insert the equivalent-month data from the 30-year average table (df2), so the final result for year "1" would look like:
Year | Mon | Temp | Precip | Evap
1 | 1 | -2 | 20 | 2
1 | 2 | 2 | 10 | 6
1 | 3 | 9 | 21 | 13
1 | 4 | 10 | 50 | 30
1 | 5 | 15 | 35 | 45
第3个月和第5个月缺少数据,因此我想插入30年平均表(df2)中的等效月数据,因此年份“1”的最终结果如下所示:周一|温度| Precip | Evap 1 | 1 | -2 | 20 | 2 1 | 2 | 2 | 10 | 6 1 | 3 | 9 | 21 | 13 1 | 4 | 10 | 50 | 30 1 | 5 | 15 | 35 | 45
Then fill in every year missing months in like manner. Year "3" would have no change, because (in this 5-month example) there are no months missing data.
然后每年以类似的方式填写缺少的月份。年“3”没有变化,因为(在这5个月的例子中)没有数月缺失数据。
1 个解决方案
#1
0
First just add rows to hold the imputed values, since you know that there are missing rows with known dates:
首先只需添加行来保存插补值,因为您知道缺少已知日期的行:
df1$date <- as.Date(paste0("200",df1$Year,"/",df1$Month,"/01"))
pretend_12months <- seq(min(df1$date),max(df1$date),by = "1 month")
pretend_5months <- pretend_12months[lubridate::month(pretend_12months) < 6]
pretend_5months <- data.frame(date=pretend_5months)
new <- merge(df1,
pretend_5months,
by="date",
all=TRUE)
new$Year <- ifelse(is.na(new$Year),
substr(lubridate::year(new$date),4,4),
new$Year)
new$Month <- ifelse(is.na(new$Month),
lubridate::month(new$date),
new$Month)
Impute the NA
values using a left join:
使用左连接来估算NA值:
# key part: left join using any library or builtin method (left_join,merge, etc)
fillin <- sqldf::sqldf("select a.date,a.Year,a.Month, b.Temp, b.Precip, b.Evap from new a left join df2 b on a.Month = b.Month")
# apply data set from join to the NA data
new$Temp[is.na(new$Temp)] <- fillin$Temp[is.na(new$Temp)]
new$Precip[is.na(new$Precip)] <- fillin$Precip[is.na(new$Precip)]
new$Evap[is.na(new$Evap)] <- fillin$Evap[is.na(new$Evap)]
date Year Month Temp Precip Evap 1 2001-01-01 1 1 -2 20 2 2 2001-02-01 1 2 2 10 6 3 2001-03-01 1 3 9 21 9 4 2001-04-01 1 4 10 50 30 5 2001-05-01 1 5 15 35 15 6 2002-01-01 2 1 1 11 1 7 2002-02-01 2 2 -4 10 4 8 2002-03-01 2 3 9 21 9 9 2002-04-01 2 4 13 43 13 10 2002-05-01 2 5 12 60 48 11 2003-01-01 3 1 2 26 4 12 2003-02-01 3 2 4 18 10 13 2003-03-01 3 3 8 40 32 14 2003-04-01 3 4 14 60 70 15 2003-05-01 3 5 16 46 40
#1
0
First just add rows to hold the imputed values, since you know that there are missing rows with known dates:
首先只需添加行来保存插补值,因为您知道缺少已知日期的行:
df1$date <- as.Date(paste0("200",df1$Year,"/",df1$Month,"/01"))
pretend_12months <- seq(min(df1$date),max(df1$date),by = "1 month")
pretend_5months <- pretend_12months[lubridate::month(pretend_12months) < 6]
pretend_5months <- data.frame(date=pretend_5months)
new <- merge(df1,
pretend_5months,
by="date",
all=TRUE)
new$Year <- ifelse(is.na(new$Year),
substr(lubridate::year(new$date),4,4),
new$Year)
new$Month <- ifelse(is.na(new$Month),
lubridate::month(new$date),
new$Month)
Impute the NA
values using a left join:
使用左连接来估算NA值:
# key part: left join using any library or builtin method (left_join,merge, etc)
fillin <- sqldf::sqldf("select a.date,a.Year,a.Month, b.Temp, b.Precip, b.Evap from new a left join df2 b on a.Month = b.Month")
# apply data set from join to the NA data
new$Temp[is.na(new$Temp)] <- fillin$Temp[is.na(new$Temp)]
new$Precip[is.na(new$Precip)] <- fillin$Precip[is.na(new$Precip)]
new$Evap[is.na(new$Evap)] <- fillin$Evap[is.na(new$Evap)]
date Year Month Temp Precip Evap 1 2001-01-01 1 1 -2 20 2 2 2001-02-01 1 2 2 10 6 3 2001-03-01 1 3 9 21 9 4 2001-04-01 1 4 10 50 30 5 2001-05-01 1 5 15 35 15 6 2002-01-01 2 1 1 11 1 7 2002-02-01 2 2 -4 10 4 8 2002-03-01 2 3 9 21 9 9 2002-04-01 2 4 13 43 13 10 2002-05-01 2 5 12 60 48 11 2003-01-01 3 1 2 26 4 12 2003-02-01 3 2 4 18 10 13 2003-03-01 3 3 8 40 32 14 2003-04-01 3 4 14 60 70 15 2003-05-01 3 5 16 46 40