Take the example:
比如说:
> set.seed(42)
> ids <- c("u1", "u2", "u3")
> groups <- c(rep("A",3), rep("B",3), rep("C",3))
> reps <- c(rep("r1",9), rep("r2",9), rep("r3",9))
> vals <- rnorm(27, 0, 2)
>
> df = data.frame(ids = rep(ids, 9), groups = rep(groups,3), reps = reps, vals = vals)
> df
ids groups reps vals
1 u1 A r1 2.7419169
2 u2 A r1 -1.1293963
3 u3 A r1 0.7262568
4 u1 B r1 1.2657252
5 u2 B r1 0.8085366
6 u3 B r1 -0.2122490
7 u1 C r1 3.0230440
8 u2 C r1 -0.1893181
9 u3 C r1 4.0368474
10 u1 A r2 -0.1254282
11 u2 A r2 2.6097393
12 u3 A r2 4.5732908
13 u1 B r2 -2.7777214
14 u2 B r2 -0.5575775
15 u3 B r2 -0.2666427
16 u1 C r2 1.2719008
17 u2 C r2 -0.5685058
18 u3 C r2 -5.3129108
19 u1 A r3 -4.8809339
20 u2 A r3 2.6402267
21 u3 A r3 -0.6132772
22 u1 B r3 -3.5626169
23 u2 B r3 -0.3438347
24 u3 B r3 2.4293494
25 u1 C r3 3.7903869
26 u2 C r3 -0.8609383
27 u3 C r3 -0.5145388
What I want to do is to subtract the mean of values in C.r1, C.r2 and C.r3 for each id. The idea is to use group C as a baseline for the other groups.
我要做的是减去C值的均值。r1,C。r2和C。r3表示每个id,其思想是使用组C作为其他组的基线。
So in terms of expected outcome, for the first two rows:
就预期结果而言,前两行
-
(u1, A, r1) should be modified as 2.74 - mean(3.02, 1.27, 3.79) = 0.046
(u1, A, r1)修改为2.74 - mean(3.02, 1.27, 3.79) = 0.046
-
(u2, A, r1) should be modified as -1.23 - mean(-0.18, -0.56, -0.86) = -0.69
(u2, A, r1)修改为-1.23 -均值(-0.18,-0.56,-0.86)= -0.69
How can I get this to work on all rows in a large (about 1M rows) table that contains a number of other columns besides the relevant ones here? I obviously need to group by ids
but the look up of values that match specifically group == C
together with mean of of vals is a bit tricky.
我怎样才能让它在一个大的(大约1M行)的表中工作,这个表除了相关的列之外还包含许多其他列?显然我需要用id进行分组,但是查找与group == C匹配的值以及值的平均值是有点棘手的。
> dt <- setDT(df)
> dt[groups == "C", cmean := mean(vals), ids]
gives me the means of group C measurements for each id (in multiple copies), but I cannot really use those values immediately, since all the other rows are already filtered out. I think I might need to chain somehow but I am not sure how exactly.
为每个id(以多个副本)提供组C度量的方法,但是我不能立即使用这些值,因为所有其他行都已经被过滤掉了。我想我可能需要以某种方式来约束自己,但我不确定具体是怎样的。
I would be equally interested in solutions with data.table
and dplyr
我对有数据的解决方案同样感兴趣。表和dplyr
2 个解决方案
#1
1
We can do a join after subsetting for 'groups' that are 'C', grouped by 'ids', get the mean
of 'vals', then we join the original dataset on
the 'ids', subtract the 'vals' from the first dataset with the 'Meanvals' from the second and assign (:=
) it to 'newvals'
我们可以对'groups'设为'C',按'id '分组,然后进行连接,得到'vals'的均值,然后在'id '上加入原始数据集,从第一个数据集中减去'vals',将第二个数据集中的'Meanvals'赋给(:=)newvals'
setDT(df)[df[groups=="C", .(Meanvals = mean(vals)), ids],
newvals := vals - Meanvals, on = .(ids)]
head(df)
#2
1
One possible dplyr
-solution:
一个可能的dplyr-solution:
library(dplyr)
df %>% group_by(ids) %>%
mutate(mean = mean(vals[groups=="C"]),
vals = vals - mean) %>% select(-mean)
# A tibble: 27 × 4
ids groups reps vals
<fctr> <fctr> <fctr> <dbl>
1 u1 A r1 0.04680632
2 u2 A r1 -0.58980895
3 u3 A r1 1.32312422
4 u1 B r1 -1.42938536
5 u2 B r1 1.34812404
#1
1
We can do a join after subsetting for 'groups' that are 'C', grouped by 'ids', get the mean
of 'vals', then we join the original dataset on
the 'ids', subtract the 'vals' from the first dataset with the 'Meanvals' from the second and assign (:=
) it to 'newvals'
我们可以对'groups'设为'C',按'id '分组,然后进行连接,得到'vals'的均值,然后在'id '上加入原始数据集,从第一个数据集中减去'vals',将第二个数据集中的'Meanvals'赋给(:=)newvals'
setDT(df)[df[groups=="C", .(Meanvals = mean(vals)), ids],
newvals := vals - Meanvals, on = .(ids)]
head(df)
#2
1
One possible dplyr
-solution:
一个可能的dplyr-solution:
library(dplyr)
df %>% group_by(ids) %>%
mutate(mean = mean(vals[groups=="C"]),
vals = vals - mean) %>% select(-mean)
# A tibble: 27 × 4
ids groups reps vals
<fctr> <fctr> <fctr> <dbl>
1 u1 A r1 0.04680632
2 u2 A r1 -0.58980895
3 u3 A r1 1.32312422
4 u1 B r1 -1.42938536
5 u2 B r1 1.34812404