根据另一列中的最高值选择一个值

I don't understand why I can't find a solution for this, since I feel that this is a pretty basic question. Need to ask for help, then. I want to rearrange airquality dataset by month with maximum temp value for each month. In addition I want to find the corresponding day for each monthly maximum temperature. What is the laziest (code-wise) way to do this?

我不明白为什么我找不到解决方案,因为我觉得这是一个非常基本的问题。那么需要寻求帮助。我想按月重新安排空气质量数据集,每个月的最大温度值。另外,我想找到每月最高温度的相应日期。什么是最懒的(代码方式)方法呢?

I have tried following without a success:

我试过以下没有成功:

require(reshape2)
names(airquality) <- tolower(names(airquality))
mm <- melt(airquality, id.vars = c("month", "day"), meas = c("temp"))

dcast(mm, month + day ~ variable, max)
aggregate(formula = temp ~ month + day, data = airquality, FUN = max)

I am after something like this:

我喜欢这样的事情:

month day temp
5     7    89
...

4 个解决方案

#1

There was quite a discussion a while back about whether being lazy is good or not. Anwyay, this is short and natural to write and read (and is fast for large data so you don't need to change or optimize it later) :

有一段时间以来,人们一直在讨论懒惰是好还是坏。 Anwyay,这是一个简短而自然的写入和读取(对于大数据来说速度很快,因此您不需要在以后更改或优化它):

require(data.table)
DT=as.data.table(airquality)

DT[,.SD[which.max(Temp)],by=Month]

     Month Ozone Solar.R Wind Temp Day
[1,]     5    45     252 14.9   81  29
[2,]     6    NA     259 10.9   93  11
[3,]     7    97     267  6.3   92   8
[4,]     8    76     203  9.7   97  28
[5,]     9    73     183  2.8   93   3

.SD is the subset of the data for each group, and you just want the row from it with the largest Temp, iiuc. If you need the row number then that can be added.

.SD是每个组的数据子集,您只需要具有最大Temp,iiuc的行。如果您需要行号,则可以添加。

Or to get all the rows where the max is tied :

或者获取最大值绑定的所有行:

DT[,.SD[Temp==max(Temp)],by=Month]

     Month Ozone Solar.R Wind Temp Day
[1,]     5    45     252 14.9   81  29
[2,]     6    NA     259 10.9   93  11
[3,]     7    97     267  6.3   92   8
[4,]     7    97     272  5.7   92   9
[5,]     8    76     203  9.7   97  28
[6,]     9    73     183  2.8   93   3
[7,]     9    91     189  4.6   93   4

#2

Another approach with plyr

与plyr的另一种方法

require(reshape2)
names(airquality) <- tolower(names(airquality))
mm <- melt(airquality, id.vars = c("month", "day"), meas = c("temp"), value.name = 'temp')

library(plyr)

ddply(mm, .(month), subset, subset = temp == max(temp), select = -variable)

Gives

  month day temp
1     5  29   81
2     6  11   93
3     7   8   92
4     7   9   92
5     8  28   97
6     9   3   93
7     9   4   93

Or, even simpler

或者,甚至更简单

require(reshape2)
require(plyr)
names(airquality) <- tolower(names(airquality))
ddply(airquality, .(month), subset, 
  subset = temp == max(temp), select = c(month, day, temp) )

#3

how about with plyr?

与普利尔怎么样?

max.func <- function(df) {
   max.temp <- max(df$temp)

   return(data.frame(day = df$Day[df$Temp==max.temp],
                     temp = max.temp))
}

ddply(airquality, .(Month), max.func)

As you can see, the max temperature for the month happens on more than one day. If you want different behavior, the function is easy enough to adjust.

如您所见,本月的最高温度发生在一天以上。如果您想要不同的行为,该功能很容易调整。

#4

Or if you want to use the data.table package (for instance, if speed is an issue and the data set is large or if you prefer the syntax):

或者,如果您想使用data.table包(例如,如果速度有问题且数据集很大或者您更喜欢语法):

library(data.table)
DT <- data.table(airquality)
DT[, list(maxTemp=max(Temp), dayMaxTemp=.SD[max(Temp)==Temp, Day]), by="Month"]

If you want to know what the .SD stands for, have a look here: SO

如果你想知道.SD代表什么,请看看:SO

#1

require(data.table)
DT=as.data.table(airquality)

DT[,.SD[which.max(Temp)],by=Month]

     Month Ozone Solar.R Wind Temp Day
[1,]     5    45     252 14.9   81  29
[2,]     6    NA     259 10.9   93  11
[3,]     7    97     267  6.3   92   8
[4,]     8    76     203  9.7   97  28
[5,]     9    73     183  2.8   93   3