在R中有条件地填充缺失的值，同时重构长到宽的数据集

I am constructing complete timelines of indicators for a set of years and countries on the basis of multiple datasets with varying quality.

我正在基于不同质量的多个数据集构建一套年份和国家的完整指标时间线。

Using reshape2 I have "melted" those datasets into a single dataframe.

使用reshape2，我将这些数据集“融化”为一个数据aframe。

Example dataset:

示例数据集:

d <- structure(list(cntry = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("BE", 
"DE", "GE"), class = "factor"), year = c(1960L, 1970L, 1980L, 
1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 
1970L, 1960L, 1970L, 1960L, 1970L, 1970L, 1980L), indicator = c(5.5, 
1.2, 1.5, NA, 1.4, NA, NA, 5.5, 1.2, 2.3, 1.4, NA, 1.4, NA, NA, 
2.3, 1.4, 1.4, NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "male", class = "factor"), 
    source = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Council", 
    "Eurostat", "OECD"), class = "factor")), .Names = c("cntry", 
"year", "indicator", "sex", "source"), class = "data.frame", row.names = c(NA, 
-19L))


d
#    cntry year indicator  sex   source
# 1     BE 1960       5.5 male Eurostat
# 2     BE 1970       1.2 male Eurostat
# 3     BE 1980       1.5 male Eurostat
# 4     DE 1960        NA male Eurostat
# 5     DE 1970       1.4 male Eurostat
# 6     GE 1960        NA male Eurostat
# 7     GE 1970        NA male Eurostat
# 8     BE 1960       5.5 male     OECD
# 9     BE 1970       1.2 male     OECD
# 10    DE 1960       2.3 male     OECD
# 11    DE 1970       1.4 male     OECD
# 12    GE 1960        NA male     OECD
# 13    GE 1970       1.4 male     OECD
# 14    BE 1960        NA male  Council
# 15    BE 1970        NA male  Council
# 16    DE 1960       2.3 male  Council
# 17    DE 1970       1.4 male  Council
# 18    GE 1970       1.4 male  Council
# 19    GE 1980        NA male  Council

I was hoping I could uses cast() with fun.aggregate to convert this long dataset into the wide format, while selecting the most high quality dataset (Eurostat > OECD > Council) for a given country-year combination to fill in the missings. Unfortunately I do not really understand how to work with such a custom aggregate function.

我希望我可以很有趣地使用cast()。在选择最优质的数据集(Eurostat > OECD > Council)，为给定的国家年组合选择最优质的数据集，并将其转换为宽格式。不幸的是，我并不真正理解如何使用这种自定义聚合函数。

In other words, I want to reshape the dataset from a long to a wide format while merging multiple values depending on the value of a factor ("source"). Ideally it would work something as:

换句话说，我希望将数据集从长到宽的格式进行重构，同时合并多个值，这取决于一个因素(“源”)的值。理想情况下，它可以工作如下:

full_data <- expand.grid(c('BE', 'GE', 'DE'), c('1960', '1970', '1980'))
full_data <- fill_missings(full_data, d, pref_order=c('Eurostat', 'OECD', 'Council'))
full_data
# BE 1960 5.5 male Eurostat
# BE 1970 1.2 male Eurostat
# BE 1980 1.5 male Eurostat
# DE 1960 2.3 male OECD
# DE 1970 1.4 male Eurostat
# DE 1980 NA  NA   NA
# GE 1960 NA  male Council 
# GE 1970 1.4 male OECD
# GE 1980 NA  male Council

and optionally (or directly) into the wide format:

并可选择(或直接)以广泛的格式:

# cntry  sex 1960 1970 1980
#    BE male  5.5  1.2  1.5
#    DE male  2.3  1.4  NA
#    GE male   NA  1.4  NA

4 个解决方案

#1

Assuming that the data is in the order you require, that is, column source is ordered first by Eurostat, then by OECD and then by council, I'd go about using data.table in this manner:

假设数据按照你需要的顺序，也就是说，列源首先由欧盟统计局，然后由经合组织，然后由理事会，我将着手使用数据。表以这种方式:

require(data.table) # >= v1.9.0
setDT(d) # converts data.frame to data.table by reference
dcast.data.table(d, cntry + sex ~ year, value.var="indicator", 
 subset=.(!duplicated(d, by=c("cntry", "year", "indicator")) & !is.na(indicator)))

#    cntry  sex 1960 1970 1980
# 1:    BE male  5.5  1.2  1.5
# 2:    DE male  2.3  1.4   NA
# 3:    GE male   NA  1.4   NA

#2

I am not sure if this meets all of your expectations, but it sounds like you're looking for something like the following:

我不确定这是否符合你的所有期望，但听起来你在寻找如下的东西:

toMerge <- expand.grid(cntry = c("BE", "DE", "GE"), 
                       year = c(1960, 1970, 1980), 
                       source = c("Eurostat", "OECD", "Council"), 
                       sex = "male")
d2 <- merge(d, toMerge, all = TRUE)

d2$source <- factor(d2$source, c("Council", "OECD", "Eurostat"), ordered=TRUE)
d2 <- d2[order(d2$source, decreasing=TRUE), ]
Rank <- with(d2, ave(indicator, d2[c("cntry", "year", "sex")], 
                 FUN = function(x) rank(x, ties.method="first", na.last=TRUE)))
D <- d2[Rank == 1, ]
D
#    cntry year  sex   source indicator
# 2     BE 1960 male Eurostat       5.5
# 5     BE 1970 male Eurostat       1.2
# 8     BE 1980 male Eurostat       1.5
# 14    DE 1970 male Eurostat       1.4
# 17    DE 1980 male Eurostat        NA
# 20    GE 1960 male Eurostat        NA
# 26    GE 1980 male Eurostat        NA
# 12    DE 1960 male     OECD       2.3
# 24    GE 1970 male     OECD       1.4

library(reshape2)
dcast(D, cntry ~ year, value.var="indicator")
#   cntry 1960 1970 1980
# 1    BE  5.5  1.2  1.5
# 2    DE  2.3  1.4   NA
# 3    GE   NA  1.4   NA

#3

Perhaps the following could work as well:

也许下面的方法也能奏效:

library(reshape2)
x <- melt(d,id.vars=c("cntry","year","source","sex"))
y <- dcast(x,cntry+year+sex ~ source)
y$selected.value <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes=y$Council,no=y$OECD),no=y$Eurostat)
dcast(y,cntry + sex ~ year)

The source selection is made using a layered ifelse statement. The indication of the source selected is lost with this approach, if that is an issue, a similar ifelse statement can be added, creating the source origin variable:

源选择使用分层的ifelse语句。使用这种方法会丢失所选源的指示，如果有问题，可以添加类似的ifelse语句，创建源源源源变量:

y$selected.source <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes="Council",no="OECD"),no="Eurostat")

#4

Here is another option:

这是另一个选择:

library(reshape2)
d$source <- factor(d$source, levels=c('Eurostat', 'OECD', 'Council'))
d2 <- d[1:4]
d2[[3]] <- lapply(split(d, 1:nrow(d)), `[`, c(3, 5))
dcast(
  d2, cntry + sex ~ year, value.var="indicator", 
  fun.aggregate=function(x) {
    if(!length(x)) return(NA_real_)
    xs <- do.call(rbind, x)
    xs <- xs[complete.cases(xs), ]
    if(nrow(xs)) xs[order(as.numeric(xs$source)), "indicator"][[1L]] else NA_real_
} )

Produces:

生产:

  cntry  sex  1960  1970  1980
1    BE male 105.5 101.2 101.5
2    DE male   2.3 101.4    NA
3    GE male    NA   1.4    NA

Note I added 100 to "Eurostat" value to make them distinguishable from the others since in this sample set they seemed to be equal.

注意，我在“Eurostat”值中添加了100，以使它们与其他元素区别开来，因为在这个示例集中，它们看起来是相等的。

Basically, we cheat by turning the indicator column into a column of list items containing both the indicator and the source, and then we use fun.aggregate to pick the item from each group with the lowest source value (note we reset the factors so the most desirable source has the lowest level).

基本上，我们通过将指示符列转换为包含指示符和源的列表项列来作弊，然后我们使用fun。聚合以从每个具有最低源值的组中选择项(注意，我们重置了因子，以便最理想的源具有最低级别)。

#1