左加入R(dplyr) - 观察太多了?

时间:2022-11-10 22:48:29

I'm using dplyrs left join function in order to match two dataframes.

我正在使用dplyrs左连接函数来匹配两个数据帧。

I have a panel data set A which consists of 4708 rows and 2 columns ID and Name:

我有一个面板数据集A,它包含4708行和2列ID和名称:

ID Name
1  Option1
1  Option2
1  Option3
2  Option2
2  Option3
3  Option1
3  Option4

My dataset B consists of single definitions and categories for each name column (86 rows):

我的数据集B由每个名称列的单个定义和类别组成(86行):

Name        Definition  Category
Option1     Def1         1
Option2     Def2         1
Option3     Def2         2
Option4     Def3         2

So in the end I need following data set C which links the columns of B to A:

所以最后我需要跟随数据集C,它将B列链接到A:

ID Name      Definition   Category
1  Option1   Def1         1
1  Option2   Def2         1
1  Option3   Def2         2
2  Option2   Def2         1
2  Option3   Def2         2
3  Option1   Def1         1
3  Option4   Def3         2

I used a left_join command in dplyr to do this:

我在dplyr中使用了left_join命令来执行此操作:

Data C <- left_join(A,B, by="name")

However, for some reason I got 5355 rows instead of the original 4708, so rows were some added. My understanding was that left_join simply assigns the definitions & categories of B to data set A.

但是,出于某种原因,我获得了5355行而不是原来的4708行,因此添加了一些行。我的理解是left_join只是将B的定义和类别分配给数据集A.

Why do I get more rows ? Or are there any other ways to get the desired data frame C?

为什么我会获得更多行?或者有没有其他方法来获得所需的数据框C?

2 个解决方案

#1


1  

With left_join(A, B) new rows will be added wherever there are multiple rows in B for which the key columns (same-name columns by default) match the same, single row in A. For example:

使用left_join(A,B),将在B中有多行的任何地方添加新行,其中键列(默认情况下为相同名称列)与A中的相同单行匹配。例如:

library(dplyr)
df1 <- data.frame(col1 = LETTERS[1:4],
                  col2 = 1:4)
df2 <- data.frame(col1 = rep(LETTERS[1:2], 2),
                  col3 = 4:1)

left_join(df1, df2)  # has 6 rows rather than 4

#2


1  

It's hard to know without seeing your original data, but if data frame B does not contain unique values on the join columns, you will get repeated rows from data frame A whenever this happens. You could try:

在没有看到原始数据的情况下很难知道,但如果数据框B在连接列上不包含唯一值,那么每当发生这种情况时,您将从数据框A获得重复的行。你可以尝试:

data_frame_b %>% count(join_col_1, join_col_2)

Which will let you know if there are non-unique combinations of the two variables.

如果两个变量存在非唯一组合,将通知您。

#1


1  

With left_join(A, B) new rows will be added wherever there are multiple rows in B for which the key columns (same-name columns by default) match the same, single row in A. For example:

使用left_join(A,B),将在B中有多行的任何地方添加新行,其中键列(默认情况下为相同名称列)与A中的相同单行匹配。例如:

library(dplyr)
df1 <- data.frame(col1 = LETTERS[1:4],
                  col2 = 1:4)
df2 <- data.frame(col1 = rep(LETTERS[1:2], 2),
                  col3 = 4:1)

left_join(df1, df2)  # has 6 rows rather than 4

#2


1  

It's hard to know without seeing your original data, but if data frame B does not contain unique values on the join columns, you will get repeated rows from data frame A whenever this happens. You could try:

在没有看到原始数据的情况下很难知道,但如果数据框B在连接列上不包含唯一值,那么每当发生这种情况时,您将从数据框A获得重复的行。你可以尝试:

data_frame_b %>% count(join_col_1, join_col_2)

Which will let you know if there are non-unique combinations of the two variables.

如果两个变量存在非唯一组合,将通知您。