在rdataframe中找到最小距离

时间:2022-06-12 16:19:08

I'm trying to calculate the range minimum of a dataframe in R. The dataframe looks like this:

我试着计算r中一个dataframe的最小范围,这个dataframe是这样的:

+-----+--------------+-----------+------+------+
| Key | DaysToEvent  | PriceEUR  | Pmin | Pmax |
+-----+--------------+-----------+------+------+
| AAA | 120          |        50 |   50 |   50 |
| AAA | 110          |        40 |   40 |   50 |
| AAA | 100          |        60 |   40 |   60 |
| BBB | ...          |           |      |      |
+-----+--------------+-----------+------+------+

So the range minimum price (Pmin) holds the minimum price of that key, up to that point in time ( DaysToEvent ).

所以极限值最小价格(Pmin)保持了那个键的最小价格,直到那个时间点(DaysToEvent)。

Here's my implementation:

这是我的实现:

for (i in 1:nrow(data)){
  currentRecord <- data[i,]

  if(currentRecord$Key != currentKey) {
    # New key detected - reset pmin and pmax
    pmin <- 100000
    pmax <- 0
    currentKey <- currentRecord$Key
  }

  if(currentRecord$PriceEUR < pmin) {
    pmin <- currentRecord$PriceEUR
  }
  if(currentRecord$PriceEUR > pmax) {
    pmax <- currentRecord$PriceEUR
  }

  currentRecord$Pmin <- pmin
  currentRecord$Pmax <- pmax

  # This line seems to be killing my performance
  # but otherwise the data variable is not updated in
  # global space
  data[i,] <- currentRecord
}

This works - but is REALLY slow, only a couple per second. It works because I've sorted the data frame like so data = data[order(data$Key, -data$DaysToEvent), ]. Reason for doing this, is because I was hoping to get a Big-O of nlog(n) for the sorting and n on the for loop. So I thought I'd be flying through this data, but I'm not AT ALL - takes hours.

这是可行的——但是非常慢,每秒只有几次。它之所以有效,是因为我将数据帧排序为so data = data[order(data$Key, -data$DaysToEvent)]。这样做的原因是,我希望得到一个大o的nlog(n)用于排序,n在for循环上。所以我想我要飞遍这些数据,但我一点也不需要花几个小时。

How can I make this faster?

我怎样才能使这个更快呢?

The previous approach is from my colleague - here in pseudo:

之前的方法来自我的同事——这里是pseudo:

for (i in 1:nrow(data)) {
    ...
    currentRecord$Pmin <- data[subset on the key[find the min value of the price 
                      where DaysToEvent > currentRecord$DaysToEvent]]
    ...
}

Also works - but I think the order of this functions is way higher. n^2log(n) if I'm correct and takes days. So I thought I was going to improve on that big time.

同样有效,但我认为这个函数的顺序要高一些。n ^ 2 log(n)如果我是正确的,需要几天。所以我认为我将在那个重要的时刻进步。

So I've tried to get my head around on all kinds of *apply, by functions and of course that's what you really want to use.

所以我试着用各种各样的方法来解决问题,当然,这就是你真正想要使用的。

However - if I use by() and then split on the key. Gets me pretty close. However, I cannot get around how I would get the range minimum / max. I'm trying to think in functional paradigm but I'm stuck. Any help is appreciated.

但是,如果我用by()然后在键上分开。让我相当接近。然而,我无法回避如何得到最小/最大值的范围。我试着去思考功能范式,但我被卡住了。任何帮助都是感激。

1 个解决方案

#1


4  

[Original answer: dplyr]

[原来回答:dplyr]

You can solve this problem by using the dplyr package:

您可以使用dplyr包来解决这个问题:

library(dplyr)
d %>% 
  group_by(Key) %>% 
  mutate(Pmin=cummin(PriceEUR),Pmax=cummax(PriceEUR))

#   Key DaysToEvent PriceEUR Pmin Pmax
# 1 AAA         120       50   50   50
# 2 AAA         110       40   40   50
# 3 AAA         100       60   40   60
# 4 BBB         100       50   50   50

where d is supposed to be your data set:

d应该是你的数据集:

d <- data.frame(Key=c('AAA','AAA','AAA','BBB'),DaysToEvent = c(120,110,100,100),PriceEUR = c(50,40,60,50), Pmin = c(50,40,40,30), Pmax = c(50,50,60,70))

[Update: data.table]

(更新:data.table)

Another approach is to use data.table, which has a quite spectacular performance:

另一种方法是使用数据。桌子,有一个相当壮观的表现:

library(data.table)
DT <- setDT(d)
DT[,c("Pmin","Pmax") := list(cummin(PriceEUR),cummax(PriceEUR)),by=Key]

DT
#    Key DaysToEvent PriceEUR Pmin Pmax
# 1: AAA         120       50   50   50
# 2: AAA         110       40   40   50
# 3: AAA         100       60   40   60
# 4: BBB         100       50   50   50

[Update 2: base R]

(更新2:基地R)

Here is another approach in the case that you'd like to use only base R for some reason:

这里有另一种方法,在这种情况下,由于某些原因,您希望只使用基数R:

d$Pmin <- unlist(lapply(split(d$PriceEUR,d$Key),cummin))
d$Pmax <- unlist(lapply(split(d$PriceEUR,d$Key),cummax))

#1


4  

[Original answer: dplyr]

[原来回答:dplyr]

You can solve this problem by using the dplyr package:

您可以使用dplyr包来解决这个问题:

library(dplyr)
d %>% 
  group_by(Key) %>% 
  mutate(Pmin=cummin(PriceEUR),Pmax=cummax(PriceEUR))

#   Key DaysToEvent PriceEUR Pmin Pmax
# 1 AAA         120       50   50   50
# 2 AAA         110       40   40   50
# 3 AAA         100       60   40   60
# 4 BBB         100       50   50   50

where d is supposed to be your data set:

d应该是你的数据集:

d <- data.frame(Key=c('AAA','AAA','AAA','BBB'),DaysToEvent = c(120,110,100,100),PriceEUR = c(50,40,60,50), Pmin = c(50,40,40,30), Pmax = c(50,50,60,70))

[Update: data.table]

(更新:data.table)

Another approach is to use data.table, which has a quite spectacular performance:

另一种方法是使用数据。桌子,有一个相当壮观的表现:

library(data.table)
DT <- setDT(d)
DT[,c("Pmin","Pmax") := list(cummin(PriceEUR),cummax(PriceEUR)),by=Key]

DT
#    Key DaysToEvent PriceEUR Pmin Pmax
# 1: AAA         120       50   50   50
# 2: AAA         110       40   40   50
# 3: AAA         100       60   40   60
# 4: BBB         100       50   50   50

[Update 2: base R]

(更新2:基地R)

Here is another approach in the case that you'd like to use only base R for some reason:

这里有另一种方法,在这种情况下,由于某些原因,您希望只使用基数R:

d$Pmin <- unlist(lapply(split(d$PriceEUR,d$Key),cummin))
d$Pmax <- unlist(lapply(split(d$PriceEUR,d$Key),cummax))