模拟条件自连接以在数据帧中输入NA

I have a data frame that looks like this:

我有一个如下所示的数据框：

d <- data.frame(Vessel = c("Hondo", "Whamo", "Hondo", "Delta", "Whamo", "Hondo"),
            PAX = c(250, 252, 249, 353, 252, 250),
            crew = c(35, 63, 36, NA, NA, NA))

I would like to impute the NAs using something like a conditional self join where if there is another row in the frame with the same Vessel, it updates the crew value based on the corresponding row (if there are multiple corresponding rows, it can sample the crew value, pick max/min...it won't matter as crew values don't change dramatically...and if there is no corresponding record, it updates crew by round(0.25 * PAX). I have a feeling ddply would be the way to go here and I apologize for not being able to figure this out on my own...I'm having trouble getting anywhere with this. I would like the final data.frame to look like this:

我想使用像条件自连接之类的东西来判断NAs，如果框架中有另一行具有相同的Vessel，它会根据相应的行更新工作人员值（如果有多个相应的行，它可以采样船员价值，选择最大/最小......当船员价值没有发生显着变化时无关紧要......如果没有相应的记录，它会轮流更新船员（0.25 * PAX）。我有一种感觉ddply会是去这里的方式，我为自己无法解决这个问题而道歉......我无法在任何地方找到这个。我希望最终的data.frame看起来像这样：

VESSEL     PAX     crew
Hondo      250       35
Whamo      252       63
Hondo      249       36
Delta      353       88
Whamo      254       63
Hondo      250       35

Note: PAX and CREW values may vary (CREW varies very little) so the last "Hondo" CREW value could be 35, 36, or something close (but it should be based on the lookup and not the calculation).

注意：PAX和CREW值可能会有所不同（CREW变化非常小），因此最后一个“Hondo”CREW值可能是35,36或接近（但它应该基于查找而不是计算）。

Thanks in advance, --JT

在此先感谢， - .JT

2 个解决方案

#1

Here's a solution using base R:

这是使用基数R的解决方案：

transform(merge(d, aggregate(crew ~ ., d, mean), by=1:2, all.x=T, sort=F), 
          crew=ifelse(!is.na(crew.x), crew.x,
                      ifelse(!is.na(crew.y), crew.y, round(0.25 * PAX))))

Note that mean is used to get a unique value for each Vessell/PAX pair. This could just as easily be head(x, 1) or whatever.

请注意，mean用于获取每个Vessell / PAX对的唯一值。这可以很容易地是头（x，1）或其他什么。

#2

Thanks to Joran's answer to my poorly worded question, I have a solution, albeit an ugly one...

感谢Joran回答我措辞不好的问题，我有一个解决方案，虽然是一个丑陋的...

library(plyr)
d <- data.frame(Vessel = c("Hondo", "Whamo", "Hondo", "Delta", "Whamo", "Hondo"),
            PAX = c(250, 252, 249, 353, 252, 250),
            crew = c(35, 63, 36, NA, NA, NA))
crewlookup <- ddply(subset(d, !is.na(d$crew)), .(Vessel),
                function(x) {
                  x[sample(nrow(x),size=1),]
                })
d2 <- join(d, crewlookup, by="Vessel")
colnames(d2)<-c("Vessel","PAX","crew","PAXl","crewl")
d2$crew <- ifelse(is.na(d2$crew),d2$crewl,d2$crew)
d2 <- within(d2, crew[is.na(crew)] <- round(.25 * PAX[is.na(crew)]) )
d <- subset(d2, select = c("Vessel", "PAX", "crew"))

Anything more elegant would be appreciated.

任何更优雅的东西将不胜感激。

#1