I'm doing element-wise calculations on two data frames, but only where the same id exists in both sets of data.
我在两个数据帧上进行逐元素计算,但只在两组数据中存在相同的id。
The current method I'm using is to subset both data frames where the same ids exist, then sort the data by id, then do the calculation:
我正在使用的当前方法是对存在相同ID的数据帧进行子集化,然后按id对数据进行排序,然后进行计算:
## Example data
id <- c('a','b','c','d','e')
v1 <- c(10, 20, 30,20,40)
v2 <- c(20,30,20,20,40)
df1 <- data.frame(id, v1, v2, stringsAsFactors=FALSE)
id <- c('a','c','d','b','f')
v1 <- c(20,60,30,10,20)
v2 <- c(60,20,50,10,20)
df2 <- data.frame(id, v1, v2, stringsAsFactors=FALSE)
## subset both data frames by ids that exist in both
df1_subset <- df[df1$id %in% df2$id,]
df2_subset <- df2[df2$id %in% df1$id,]
id <- df1_subset$id
## arrange by id value
library(dplyr)
df1_sorted <- df1_subset %>% arrange(id)
df2_sorted <- df2_subset %>% arrange(id)
## find the difference between each value
df_result <- cbind(id, df2_sorted[,2:3] - df1_sorted[,2:3])
Is there a 'better' way of doing this calculation where the data doesn't need to be subset and sorted, and uses the id value directly to validate/ensure the calculation is being performed on the correct row & column of data?
有没有“更好”的方法进行此计算,其中数据不需要是子集和排序,并直接使用id值来验证/确保正在对正确的数据行和列执行计算?
3 个解决方案
#1
library(dplyr)
inner_join(df1, df2, by="id") %>%
mutate(v1=v1.y-v1.x, v2=v2.y-v2.x) %>%
select(id, v1, v2)
# id v1 v2
#1 a 10 40
#2 b -10 -20
#3 c 30 0
#4 d 10 30
#2
You can use merge
and then a single transform
to do what you need:
您可以使用合并然后使用单个转换来执行您需要的操作:
#merge will find the common ids between the dataframes
a <- merge(df1,df2, by='id')
#transform will add the two columns you need (subtracting one from the other)
a <- transform(a, v1 = v1.y - v1.x, v2 = v2.y - v2.x)
Output:
> a
id v1.x v2.x v1.y v2.y v1 v2
1 a 10 20 20 60 10 40
2 b 20 30 10 10 -10 -20
3 c 30 20 60 20 30 0
4 d 20 20 30 50 10 30
Which is the same as your df_result
这与你的df_result相同
> df_result
id v1 v2
1 a 10 40
2 b -10 -20
3 c 30 0
4 d 10 30
#3
First, you can easily join these DFs on id with merge()
(In R base) :
首先,您可以使用merge()(在R base中)轻松地将这些DF加入到id中:
df_merged = merge(df1,df2, by='id')
which gives you the following new column names:
它为您提供以下新列名称:
names(df_merged)
# [1] "id" "v1.x" "v2.x" "v1.y" "v2.y"
because merge()
by default adds suffixes to colliding column names.
因为默认情况下merge()会将后缀添加到碰撞列名中。
Then consider this combination to get your result ...
然后考虑这个组合来得到你的结果......
df_result = with(df_merged, data.frame(id, result1 = v1.x - v1.y, result2 = v2.x-v2.y)))
with()
adds readability. There are many many ways to do this. Lots of nice libraries like plyr
and sqldf
to make it easy. I look forward to seeing a more R-er way in the answers.
with()增加了可读性。有很多方法可以做到这一点。像plyrand sqldf这样的很多很好的库让它变得简单。我期待在答案中看到更多的R-er方式。
#1
library(dplyr)
inner_join(df1, df2, by="id") %>%
mutate(v1=v1.y-v1.x, v2=v2.y-v2.x) %>%
select(id, v1, v2)
# id v1 v2
#1 a 10 40
#2 b -10 -20
#3 c 30 0
#4 d 10 30
#2
You can use merge
and then a single transform
to do what you need:
您可以使用合并然后使用单个转换来执行您需要的操作:
#merge will find the common ids between the dataframes
a <- merge(df1,df2, by='id')
#transform will add the two columns you need (subtracting one from the other)
a <- transform(a, v1 = v1.y - v1.x, v2 = v2.y - v2.x)
Output:
> a
id v1.x v2.x v1.y v2.y v1 v2
1 a 10 20 20 60 10 40
2 b 20 30 10 10 -10 -20
3 c 30 20 60 20 30 0
4 d 20 20 30 50 10 30
Which is the same as your df_result
这与你的df_result相同
> df_result
id v1 v2
1 a 10 40
2 b -10 -20
3 c 30 0
4 d 10 30
#3
First, you can easily join these DFs on id with merge()
(In R base) :
首先,您可以使用merge()(在R base中)轻松地将这些DF加入到id中:
df_merged = merge(df1,df2, by='id')
which gives you the following new column names:
它为您提供以下新列名称:
names(df_merged)
# [1] "id" "v1.x" "v2.x" "v1.y" "v2.y"
because merge()
by default adds suffixes to colliding column names.
因为默认情况下merge()会将后缀添加到碰撞列名中。
Then consider this combination to get your result ...
然后考虑这个组合来得到你的结果......
df_result = with(df_merged, data.frame(id, result1 = v1.x - v1.y, result2 = v2.x-v2.y)))
with()
adds readability. There are many many ways to do this. Lots of nice libraries like plyr
and sqldf
to make it easy. I look forward to seeing a more R-er way in the answers.
with()增加了可读性。有很多方法可以做到这一点。像plyrand sqldf这样的很多很好的库让它变得简单。我期待在答案中看到更多的R-er方式。