按组查找特定行之间的差异

时间:2021-04-01 15:23:07

Within a group, I want to find the difference between that row and the first time that user appeared in the data. For example, I need to create the diff variable below. Users have different number of rows each as in the following data:

在一个组中,我想找到该行与用户第一次出现在数据中的区别。例如,我需要在下面创建diff变量。用户具有不同的行数,如以下数据中所示:

df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L), 
    money = c(9L, 12L, 13L, 15L, 5L, 7L, 8L, 5L, 2L, 10L), occurence = c(1L, 
    2L, 3L, 4L, 1L, 2L, 3L, 1L, 1L, 2L), diff = c(NA, 3L, 4L, 
    6L, NA, 2L, 3L, NA, NA, 8L)), .Names = c("ID", "money", "occurence", 
"diff"), class = "data.frame", row.names = c(NA, -10L))

   ID money occurence diff
1   1     9         1   NA
2   1    12         2    3
3   1    13         3    4
4   1    15         4    6
5   2     5         1   NA
6   2     7         2    2
7   2     8         3    3
8   3     5         1   NA
9   4     2         1   NA
10  4    10         2    8

2 个解决方案

#1


3  

You can use ave(). We just remove the first value per group and replace it with NA, and subtract the first value from the rest of the values.

你可以使用ave()。我们只删除每组的第一个值并将其替换为NA,并从其余值中减去第一个值。

with(df, ave(money, ID, FUN = function(x) c(NA, x[-1] - x[1])))
# [1] NA  3  4  6 NA  2  3 NA NA  8

#2


1  

A solution, which uses the first function to get the first value and calculate the difference.

一个dplyr解决方案,它使用第一个函数来获取第一个值并计算差值。

library(dplyr)

df2 <- df %>%
  group_by(ID) %>%
  mutate(diff = money - first(money)) %>%
  mutate(diff = replace(diff, diff == 0, NA)) %>%
  ungroup()
df2
# # A tibble: 10 x 4
#       ID money occurence  diff
#    <int> <int>     <int> <int>
#  1     1     9         1    NA
#  2     1    12         2     3
#  3     1    13         3     4
#  4     1    15         4     6
#  5     2     5         1    NA
#  6     2     7         2     2
#  7     2     8         3     3
#  8     3     5         1    NA
#  9     4     2         1    NA
# 10     4    10         2     8

Update

Here is a solution provided by Sotos. Notice that no need to replace 0 with NA.

这是Sotos提供的data.table解决方案。请注意,无需用NA替换0。

library(data.table)

setDT(df)[, money := money - first(money), by = ID][]
#     ID money occurence diff
#  1:  1     0         1   NA
#  2:  1     3         2    3
#  3:  1     4         3    4
#  4:  1     6         4    6
#  5:  2     0         1   NA
#  6:  2     2         2    2
#  7:  2     3         3    3
#  8:  3     0         1   NA
#  9:  4     0         1   NA
# 10:  4     8         2    8

DATA

dput(df)
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L), 
    money = c(9L, 12L, 13L, 15L, 5L, 7L, 8L, 5L, 2L, 10L), occurence = c(1L, 
    2L, 3L, 4L, 1L, 2L, 3L, 1L, 1L, 2L)), .Names = c("ID", "money", 
"occurence"), row.names = c(NA, -10L), class = "data.frame")

#1


3  

You can use ave(). We just remove the first value per group and replace it with NA, and subtract the first value from the rest of the values.

你可以使用ave()。我们只删除每组的第一个值并将其替换为NA,并从其余值中减去第一个值。

with(df, ave(money, ID, FUN = function(x) c(NA, x[-1] - x[1])))
# [1] NA  3  4  6 NA  2  3 NA NA  8

#2


1  

A solution, which uses the first function to get the first value and calculate the difference.

一个dplyr解决方案,它使用第一个函数来获取第一个值并计算差值。

library(dplyr)

df2 <- df %>%
  group_by(ID) %>%
  mutate(diff = money - first(money)) %>%
  mutate(diff = replace(diff, diff == 0, NA)) %>%
  ungroup()
df2
# # A tibble: 10 x 4
#       ID money occurence  diff
#    <int> <int>     <int> <int>
#  1     1     9         1    NA
#  2     1    12         2     3
#  3     1    13         3     4
#  4     1    15         4     6
#  5     2     5         1    NA
#  6     2     7         2     2
#  7     2     8         3     3
#  8     3     5         1    NA
#  9     4     2         1    NA
# 10     4    10         2     8

Update

Here is a solution provided by Sotos. Notice that no need to replace 0 with NA.

这是Sotos提供的data.table解决方案。请注意,无需用NA替换0。

library(data.table)

setDT(df)[, money := money - first(money), by = ID][]
#     ID money occurence diff
#  1:  1     0         1   NA
#  2:  1     3         2    3
#  3:  1     4         3    4
#  4:  1     6         4    6
#  5:  2     0         1   NA
#  6:  2     2         2    2
#  7:  2     3         3    3
#  8:  3     0         1   NA
#  9:  4     0         1   NA
# 10:  4     8         2    8

DATA

dput(df)
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L), 
    money = c(9L, 12L, 13L, 15L, 5L, 7L, 8L, 5L, 2L, 10L), occurence = c(1L, 
    2L, 3L, 4L, 1L, 2L, 3L, 1L, 1L, 2L)), .Names = c("ID", "money", 
"occurence"), row.names = c(NA, -10L), class = "data.frame")