如何基于相同列上但不同行的值更新列中的值?

时间:2021-03-10 09:16:07

Take the example:

比如说:

> set.seed(42)
> ids <- c("u1", "u2", "u3")
> groups <- c(rep("A",3), rep("B",3), rep("C",3))
> reps <- c(rep("r1",9), rep("r2",9), rep("r3",9))
> vals <- rnorm(27, 0, 2)
> 
> df = data.frame(ids = rep(ids, 9), groups = rep(groups,3), reps = reps, vals = vals)
> df
   ids groups reps       vals
1   u1      A   r1  2.7419169
2   u2      A   r1 -1.1293963
3   u3      A   r1  0.7262568
4   u1      B   r1  1.2657252
5   u2      B   r1  0.8085366
6   u3      B   r1 -0.2122490
7   u1      C   r1  3.0230440
8   u2      C   r1 -0.1893181
9   u3      C   r1  4.0368474
10  u1      A   r2 -0.1254282
11  u2      A   r2  2.6097393
12  u3      A   r2  4.5732908
13  u1      B   r2 -2.7777214
14  u2      B   r2 -0.5575775
15  u3      B   r2 -0.2666427
16  u1      C   r2  1.2719008
17  u2      C   r2 -0.5685058
18  u3      C   r2 -5.3129108
19  u1      A   r3 -4.8809339
20  u2      A   r3  2.6402267
21  u3      A   r3 -0.6132772
22  u1      B   r3 -3.5626169
23  u2      B   r3 -0.3438347
24  u3      B   r3  2.4293494
25  u1      C   r3  3.7903869
26  u2      C   r3 -0.8609383
27  u3      C   r3 -0.5145388

What I want to do is to subtract the mean of values in C.r1, C.r2 and C.r3 for each id. The idea is to use group C as a baseline for the other groups.

我要做的是减去C值的均值。r1,C。r2和C。r3表示每个id,其思想是使用组C作为其他组的基线。

So in terms of expected outcome, for the first two rows:

就预期结果而言,前两行

  • (u1, A, r1) should be modified as 2.74 - mean(3.02, 1.27, 3.79) = 0.046

    (u1, A, r1)修改为2.74 - mean(3.02, 1.27, 3.79) = 0.046

  • (u2, A, r1) should be modified as -1.23 - mean(-0.18, -0.56, -0.86) = -0.69

    (u2, A, r1)修改为-1.23 -均值(-0.18,-0.56,-0.86)= -0.69

How can I get this to work on all rows in a large (about 1M rows) table that contains a number of other columns besides the relevant ones here? I obviously need to group by ids but the look up of values that match specifically group == C together with mean of of vals is a bit tricky.

我怎样才能让它在一个大的(大约1M行)的表中工作,这个表除了相关的列之外还包含许多其他列?显然我需要用id进行分组,但是查找与group == C匹配的值以及值的平均值是有点棘手的。

> dt <- setDT(df)
> dt[groups == "C", cmean := mean(vals), ids]

gives me the means of group C measurements for each id (in multiple copies), but I cannot really use those values immediately, since all the other rows are already filtered out. I think I might need to chain somehow but I am not sure how exactly.

为每个id(以多个副本)提供组C度量的方法,但是我不能立即使用这些值,因为所有其他行都已经被过滤掉了。我想我可能需要以某种方式来约束自己,但我不确定具体是怎样的。

I would be equally interested in solutions with data.table and dplyr

我对有数据的解决方案同样感兴趣。表和dplyr

2 个解决方案

#1


1  

We can do a join after subsetting for 'groups' that are 'C', grouped by 'ids', get the mean of 'vals', then we join the original dataset on the 'ids', subtract the 'vals' from the first dataset with the 'Meanvals' from the second and assign (:=) it to 'newvals'

我们可以对'groups'设为'C',按'id '分组,然后进行连接,得到'vals'的均值,然后在'id '上加入原始数据集,从第一个数据集中减去'vals',将第二个数据集中的'Meanvals'赋给(:=)newvals'

setDT(df)[df[groups=="C", .(Meanvals = mean(vals)), ids], 
                         newvals := vals - Meanvals, on = .(ids)]
head(df)

#2


1  

One possible dplyr-solution:

一个可能的dplyr-solution:

library(dplyr)
df %>% group_by(ids) %>%
  mutate(mean = mean(vals[groups=="C"]), 
         vals = vals - mean) %>% select(-mean)

# A tibble: 27 × 4
      ids groups   reps        vals
   <fctr> <fctr> <fctr>       <dbl>
1      u1      A     r1  0.04680632
2      u2      A     r1 -0.58980895
3      u3      A     r1  1.32312422
4      u1      B     r1 -1.42938536
5      u2      B     r1  1.34812404

#1


1  

We can do a join after subsetting for 'groups' that are 'C', grouped by 'ids', get the mean of 'vals', then we join the original dataset on the 'ids', subtract the 'vals' from the first dataset with the 'Meanvals' from the second and assign (:=) it to 'newvals'

我们可以对'groups'设为'C',按'id '分组,然后进行连接,得到'vals'的均值,然后在'id '上加入原始数据集,从第一个数据集中减去'vals',将第二个数据集中的'Meanvals'赋给(:=)newvals'

setDT(df)[df[groups=="C", .(Meanvals = mean(vals)), ids], 
                         newvals := vals - Meanvals, on = .(ids)]
head(df)

#2


1  

One possible dplyr-solution:

一个可能的dplyr-solution:

library(dplyr)
df %>% group_by(ids) %>%
  mutate(mean = mean(vals[groups=="C"]), 
         vals = vals - mean) %>% select(-mean)

# A tibble: 27 × 4
      ids groups   reps        vals
   <fctr> <fctr> <fctr>       <dbl>
1      u1      A     r1  0.04680632
2      u2      A     r1 -0.58980895
3      u3      A     r1  1.32312422
4      u1      B     r1 -1.42938536
5      u2      B     r1  1.34812404