I am constructing complete timelines of indicators for a set of years and countries on the basis of multiple datasets with varying quality.
我正在基于不同质量的多个数据集构建一套年份和国家的完整指标时间线。
Using reshape2
I have "melted" those datasets into a single dataframe.
使用reshape2,我将这些数据集“融化”为一个数据aframe。
Example dataset:
示例数据集:
d <- structure(list(cntry = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L,
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("BE",
"DE", "GE"), class = "factor"), year = c(1960L, 1970L, 1980L,
1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L,
1970L, 1960L, 1970L, 1960L, 1970L, 1970L, 1980L), indicator = c(5.5,
1.2, 1.5, NA, 1.4, NA, NA, 5.5, 1.2, 2.3, 1.4, NA, 1.4, NA, NA,
2.3, 1.4, 1.4, NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "male", class = "factor"),
source = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Council",
"Eurostat", "OECD"), class = "factor")), .Names = c("cntry",
"year", "indicator", "sex", "source"), class = "data.frame", row.names = c(NA,
-19L))
d
# cntry year indicator sex source
# 1 BE 1960 5.5 male Eurostat
# 2 BE 1970 1.2 male Eurostat
# 3 BE 1980 1.5 male Eurostat
# 4 DE 1960 NA male Eurostat
# 5 DE 1970 1.4 male Eurostat
# 6 GE 1960 NA male Eurostat
# 7 GE 1970 NA male Eurostat
# 8 BE 1960 5.5 male OECD
# 9 BE 1970 1.2 male OECD
# 10 DE 1960 2.3 male OECD
# 11 DE 1970 1.4 male OECD
# 12 GE 1960 NA male OECD
# 13 GE 1970 1.4 male OECD
# 14 BE 1960 NA male Council
# 15 BE 1970 NA male Council
# 16 DE 1960 2.3 male Council
# 17 DE 1970 1.4 male Council
# 18 GE 1970 1.4 male Council
# 19 GE 1980 NA male Council
I was hoping I could uses cast()
with fun.aggregate
to convert this long dataset into the wide format, while selecting the most high quality dataset (Eurostat > OECD > Council) for a given country-year combination to fill in the missings. Unfortunately I do not really understand how to work with such a custom aggregate function.
我希望我可以很有趣地使用cast()。在选择最优质的数据集(Eurostat > OECD > Council),为给定的国家年组合选择最优质的数据集,并将其转换为宽格式。不幸的是,我并不真正理解如何使用这种自定义聚合函数。
In other words, I want to reshape the dataset from a long to a wide format while merging multiple values depending on the value of a factor ("source"). Ideally it would work something as:
换句话说,我希望将数据集从长到宽的格式进行重构,同时合并多个值,这取决于一个因素(“源”)的值。理想情况下,它可以工作如下:
full_data <- expand.grid(c('BE', 'GE', 'DE'), c('1960', '1970', '1980'))
full_data <- fill_missings(full_data, d, pref_order=c('Eurostat', 'OECD', 'Council'))
full_data
# BE 1960 5.5 male Eurostat
# BE 1970 1.2 male Eurostat
# BE 1980 1.5 male Eurostat
# DE 1960 2.3 male OECD
# DE 1970 1.4 male Eurostat
# DE 1980 NA NA NA
# GE 1960 NA male Council
# GE 1970 1.4 male OECD
# GE 1980 NA male Council
and optionally (or directly) into the wide format:
并可选择(或直接)以广泛的格式:
# cntry sex 1960 1970 1980
# BE male 5.5 1.2 1.5
# DE male 2.3 1.4 NA
# GE male NA 1.4 NA
4 个解决方案
#1
2
Assuming that the data is in the order you require, that is, column source
is ordered first by Eurostat
, then by OECD
and then by council
, I'd go about using data.table
in this manner:
假设数据按照你需要的顺序,也就是说,列源首先由欧盟统计局,然后由经合组织,然后由理事会,我将着手使用数据。表以这种方式:
require(data.table) # >= v1.9.0
setDT(d) # converts data.frame to data.table by reference
dcast.data.table(d, cntry + sex ~ year, value.var="indicator",
subset=.(!duplicated(d, by=c("cntry", "year", "indicator")) & !is.na(indicator)))
# cntry sex 1960 1970 1980
# 1: BE male 5.5 1.2 1.5
# 2: DE male 2.3 1.4 NA
# 3: GE male NA 1.4 NA
#2
1
I am not sure if this meets all of your expectations, but it sounds like you're looking for something like the following:
我不确定这是否符合你的所有期望,但听起来你在寻找如下的东西:
toMerge <- expand.grid(cntry = c("BE", "DE", "GE"),
year = c(1960, 1970, 1980),
source = c("Eurostat", "OECD", "Council"),
sex = "male")
d2 <- merge(d, toMerge, all = TRUE)
d2$source <- factor(d2$source, c("Council", "OECD", "Eurostat"), ordered=TRUE)
d2 <- d2[order(d2$source, decreasing=TRUE), ]
Rank <- with(d2, ave(indicator, d2[c("cntry", "year", "sex")],
FUN = function(x) rank(x, ties.method="first", na.last=TRUE)))
D <- d2[Rank == 1, ]
D
# cntry year sex source indicator
# 2 BE 1960 male Eurostat 5.5
# 5 BE 1970 male Eurostat 1.2
# 8 BE 1980 male Eurostat 1.5
# 14 DE 1970 male Eurostat 1.4
# 17 DE 1980 male Eurostat NA
# 20 GE 1960 male Eurostat NA
# 26 GE 1980 male Eurostat NA
# 12 DE 1960 male OECD 2.3
# 24 GE 1970 male OECD 1.4
library(reshape2)
dcast(D, cntry ~ year, value.var="indicator")
# cntry 1960 1970 1980
# 1 BE 5.5 1.2 1.5
# 2 DE 2.3 1.4 NA
# 3 GE NA 1.4 NA
#3
1
Perhaps the following could work as well:
也许下面的方法也能奏效:
library(reshape2)
x <- melt(d,id.vars=c("cntry","year","source","sex"))
y <- dcast(x,cntry+year+sex ~ source)
y$selected.value <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes=y$Council,no=y$OECD),no=y$Eurostat)
dcast(y,cntry + sex ~ year)
The source selection is made using a layered ifelse
statement. The indication of the source selected is lost with this approach, if that is an issue, a similar ifelse
statement can be added, creating the source origin variable:
源选择使用分层的ifelse语句。使用这种方法会丢失所选源的指示,如果有问题,可以添加类似的ifelse语句,创建源源源源变量:
y$selected.source <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes="Council",no="OECD"),no="Eurostat")
#4
0
Here is another option:
这是另一个选择:
library(reshape2)
d$source <- factor(d$source, levels=c('Eurostat', 'OECD', 'Council'))
d2 <- d[1:4]
d2[[3]] <- lapply(split(d, 1:nrow(d)), `[`, c(3, 5))
dcast(
d2, cntry + sex ~ year, value.var="indicator",
fun.aggregate=function(x) {
if(!length(x)) return(NA_real_)
xs <- do.call(rbind, x)
xs <- xs[complete.cases(xs), ]
if(nrow(xs)) xs[order(as.numeric(xs$source)), "indicator"][[1L]] else NA_real_
} )
Produces:
生产:
cntry sex 1960 1970 1980
1 BE male 105.5 101.2 101.5
2 DE male 2.3 101.4 NA
3 GE male NA 1.4 NA
Note I added 100 to "Eurostat" value to make them distinguishable from the others since in this sample set they seemed to be equal.
注意,我在“Eurostat”值中添加了100,以使它们与其他元素区别开来,因为在这个示例集中,它们看起来是相等的。
Basically, we cheat by turning the indicator
column into a column of list items containing both the indicator and the source, and then we use fun.aggregate
to pick the item from each group with the lowest source value (note we reset the factors so the most desirable source has the lowest level).
基本上,我们通过将指示符列转换为包含指示符和源的列表项列来作弊,然后我们使用fun。聚合以从每个具有最低源值的组中选择项(注意,我们重置了因子,以便最理想的源具有最低级别)。
#1
2
Assuming that the data is in the order you require, that is, column source
is ordered first by Eurostat
, then by OECD
and then by council
, I'd go about using data.table
in this manner:
假设数据按照你需要的顺序,也就是说,列源首先由欧盟统计局,然后由经合组织,然后由理事会,我将着手使用数据。表以这种方式:
require(data.table) # >= v1.9.0
setDT(d) # converts data.frame to data.table by reference
dcast.data.table(d, cntry + sex ~ year, value.var="indicator",
subset=.(!duplicated(d, by=c("cntry", "year", "indicator")) & !is.na(indicator)))
# cntry sex 1960 1970 1980
# 1: BE male 5.5 1.2 1.5
# 2: DE male 2.3 1.4 NA
# 3: GE male NA 1.4 NA
#2
1
I am not sure if this meets all of your expectations, but it sounds like you're looking for something like the following:
我不确定这是否符合你的所有期望,但听起来你在寻找如下的东西:
toMerge <- expand.grid(cntry = c("BE", "DE", "GE"),
year = c(1960, 1970, 1980),
source = c("Eurostat", "OECD", "Council"),
sex = "male")
d2 <- merge(d, toMerge, all = TRUE)
d2$source <- factor(d2$source, c("Council", "OECD", "Eurostat"), ordered=TRUE)
d2 <- d2[order(d2$source, decreasing=TRUE), ]
Rank <- with(d2, ave(indicator, d2[c("cntry", "year", "sex")],
FUN = function(x) rank(x, ties.method="first", na.last=TRUE)))
D <- d2[Rank == 1, ]
D
# cntry year sex source indicator
# 2 BE 1960 male Eurostat 5.5
# 5 BE 1970 male Eurostat 1.2
# 8 BE 1980 male Eurostat 1.5
# 14 DE 1970 male Eurostat 1.4
# 17 DE 1980 male Eurostat NA
# 20 GE 1960 male Eurostat NA
# 26 GE 1980 male Eurostat NA
# 12 DE 1960 male OECD 2.3
# 24 GE 1970 male OECD 1.4
library(reshape2)
dcast(D, cntry ~ year, value.var="indicator")
# cntry 1960 1970 1980
# 1 BE 5.5 1.2 1.5
# 2 DE 2.3 1.4 NA
# 3 GE NA 1.4 NA
#3
1
Perhaps the following could work as well:
也许下面的方法也能奏效:
library(reshape2)
x <- melt(d,id.vars=c("cntry","year","source","sex"))
y <- dcast(x,cntry+year+sex ~ source)
y$selected.value <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes=y$Council,no=y$OECD),no=y$Eurostat)
dcast(y,cntry + sex ~ year)
The source selection is made using a layered ifelse
statement. The indication of the source selected is lost with this approach, if that is an issue, a similar ifelse
statement can be added, creating the source origin variable:
源选择使用分层的ifelse语句。使用这种方法会丢失所选源的指示,如果有问题,可以添加类似的ifelse语句,创建源源源源变量:
y$selected.source <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes="Council",no="OECD"),no="Eurostat")
#4
0
Here is another option:
这是另一个选择:
library(reshape2)
d$source <- factor(d$source, levels=c('Eurostat', 'OECD', 'Council'))
d2 <- d[1:4]
d2[[3]] <- lapply(split(d, 1:nrow(d)), `[`, c(3, 5))
dcast(
d2, cntry + sex ~ year, value.var="indicator",
fun.aggregate=function(x) {
if(!length(x)) return(NA_real_)
xs <- do.call(rbind, x)
xs <- xs[complete.cases(xs), ]
if(nrow(xs)) xs[order(as.numeric(xs$source)), "indicator"][[1L]] else NA_real_
} )
Produces:
生产:
cntry sex 1960 1970 1980
1 BE male 105.5 101.2 101.5
2 DE male 2.3 101.4 NA
3 GE male NA 1.4 NA
Note I added 100 to "Eurostat" value to make them distinguishable from the others since in this sample set they seemed to be equal.
注意,我在“Eurostat”值中添加了100,以使它们与其他元素区别开来,因为在这个示例集中,它们看起来是相等的。
Basically, we cheat by turning the indicator
column into a column of list items containing both the indicator and the source, and then we use fun.aggregate
to pick the item from each group with the lowest source value (note we reset the factors so the most desirable source has the lowest level).
基本上,我们通过将指示符列转换为包含指示符和源的列表项列来作弊,然后我们使用fun。聚合以从每个具有最低源值的组中选择项(注意,我们重置了因子,以便最理想的源具有最低级别)。