在两个Pandas数据帧中查找公共行(交集)

时间:2021-12-03 22:58:03

Assume I have two dataframes of this format (call them df1 and df2):

假设我有两个这种格式的数据帧(称为df1和df2):

+------------------------+------------------------+--------+
|        user_id         |      business_id       | rating |
+------------------------+------------------------+--------+
| rLtl8ZkDX5vH5nAx9C3q5Q | eIxSLxzIlfExI6vgAbn2JA |      4 |
| C6IOtaaYdLIT5fWd7ZYIuA | eIxSLxzIlfExI6vgAbn2JA |      5 |
| mlBC3pN9GXlUUfQi1qBBZA | KoIRdcIfh3XWxiCeV1BDmA |      3 |
+------------------------+------------------------+--------+

I'm looking to get a dataframe of all the rows that have a common user_id in df1 and df2. (ie. if a user_id is in both df1 and df2, include the two rows in the output dataframe)

我希望获得df1和df2中具有公共user_id的所有行的数据帧。 (即,如果user_id同时在df1和df2中,则在输出数据帧中包含两行)

I can think of many ways to approach this, but they all strike me as clunky. For example, we could find all the unique user_ids in each dataframe, create a set of each, find their intersection, filter the two dataframes with the resulting set and concatenate the two filtered dataframes.

我可以想出很多方法来解决这个问题,但它们都让我感到笨拙。例如,我们可以在每个数据帧中找到所有唯一的user_id,创建一组,找到它们的交集,用结果集过滤两个数据帧并连接两个过滤的数据帧。

Maybe that's the best approach, but I know Pandas is clever. Is there a simpler way to do this? I've looked at merge but I don't think that's what I need.

也许这是最好的方法,但我知道熊猫很聪明。有更简单的方法吗?我看过合并,但我认为这不是我需要的。

3 个解决方案

#1


49  

My understanding is that this question is better answered over in this post.

我的理解是这个问题在这篇文章中得到了更好的回答。

But briefly, the answer to the OP with this method is simply:

但简而言之,使用此方法对OP的答案很简单:

s1 = pd.merge(df1, df2, how='inner', on=['user_id'])

Which gives s1 with 5 columns: user_id and the other two columns from each of df1 and df2.

这给了s1 5列:user_id和df1和df2中的每两列。

#2


9  

If I understand you correctly, you can use a combination of Series.isin() and DataFrame.append():

如果我理解正确,您可以使用Series.isin()和DataFrame.append()的组合:

In [80]: df1
Out[80]:
   rating  user_id
0       2  0x21abL
1       1  0x21abL
2       1   0xdafL
3       0  0x21abL
4       4  0x1d14L
5       2  0x21abL
6       1  0x21abL
7       0   0xdafL
8       4  0x1d14L
9       1  0x21abL

In [81]: df2
Out[81]:
   rating      user_id
0       2      0x1d14L
1       1    0xdbdcad7
2       1      0x21abL
3       3      0x21abL
4       3      0x21abL
5       1  0x5734a81e2
6       2      0x1d14L
7       0       0xdafL
8       0      0x1d14L
9       4  0x5734a81e2

In [82]: ind = df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id)

In [83]: ind
Out[83]:
0     True
1    False
2     True
3     True
4     True
5    False
6     True
7     True
8     True
9    False
Name: user_id, dtype: bool

In [84]: df1[ind].append(df2[ind])
Out[84]:
   rating  user_id
0       2  0x21abL
2       1   0xdafL
3       0  0x21abL
4       4  0x1d14L
6       1  0x21abL
7       0   0xdafL
8       4  0x1d14L
0       2  0x1d14L
2       1  0x21abL
3       3  0x21abL
4       3  0x21abL
6       2  0x1d14L
7       0   0xdafL
8       0  0x1d14L

This is essentially the algorithm you described as "clunky", using idiomatic pandas methods. Note the duplicate row indices. Also, note that this won't give you the expected output if df1 and df2 have no overlapping row indices, i.e., if

这基本上是您使用惯用的pandas方法描述为“笨重”的算法。请注意重复的行索引。另请注意,如果df1和df2没有重叠的行索引,即if,则不会给出预期的输出

In [93]: df1.index & df2.index
Out[93]: Int64Index([], dtype='int64')

In fact, it won't give the expected output if their row indices are not equal.

实际上,如果它们的行索引不相等,它将不会给出预期的输出。

#3


3  

In SQL, this problem could be solved by several methods:

在SQL中,这个问题可以通过几种方法解决:

select * from df1 where exists (select * from df2 where df2.user_id = df1.user_id)
union all
select * from df2 where exists (select * from df1 where df1.user_id = df2.user_id)

or join and then unpivot (possible in SQL server)

或者加入然后忽略(可能在SQL服务器中)

select
    df1.user_id,
    c.rating
from df1
    inner join df2 on df2.user_i = df1.user_id
    outer apply (
        select df1.rating union all
        select df2.rating
    ) as c

Second one could be written in pandas with something like:

第二个可以用熊猫写成:

>>> df1 = pd.DataFrame({"user_id":[1,2,3], "rating":[10, 15, 20]})
>>> df2 = pd.DataFrame({"user_id":[3,4,5], "rating":[30, 35, 40]})
>>>
>>> df4 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df = pd.merge(df1, df2, on='user_id', suffixes=['_1', '_2'])
>>> df3 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df4 = df[['user_id', 'rating_2']].rename(columns={'rating_2':'rating'})
>>> pd.concat([df3, df4], axis=0)
   user_id  rating
0        3      20
0        3      30

#1


49  

My understanding is that this question is better answered over in this post.

我的理解是这个问题在这篇文章中得到了更好的回答。

But briefly, the answer to the OP with this method is simply:

但简而言之,使用此方法对OP的答案很简单:

s1 = pd.merge(df1, df2, how='inner', on=['user_id'])

Which gives s1 with 5 columns: user_id and the other two columns from each of df1 and df2.

这给了s1 5列:user_id和df1和df2中的每两列。

#2


9  

If I understand you correctly, you can use a combination of Series.isin() and DataFrame.append():

如果我理解正确,您可以使用Series.isin()和DataFrame.append()的组合:

In [80]: df1
Out[80]:
   rating  user_id
0       2  0x21abL
1       1  0x21abL
2       1   0xdafL
3       0  0x21abL
4       4  0x1d14L
5       2  0x21abL
6       1  0x21abL
7       0   0xdafL
8       4  0x1d14L
9       1  0x21abL

In [81]: df2
Out[81]:
   rating      user_id
0       2      0x1d14L
1       1    0xdbdcad7
2       1      0x21abL
3       3      0x21abL
4       3      0x21abL
5       1  0x5734a81e2
6       2      0x1d14L
7       0       0xdafL
8       0      0x1d14L
9       4  0x5734a81e2

In [82]: ind = df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id)

In [83]: ind
Out[83]:
0     True
1    False
2     True
3     True
4     True
5    False
6     True
7     True
8     True
9    False
Name: user_id, dtype: bool

In [84]: df1[ind].append(df2[ind])
Out[84]:
   rating  user_id
0       2  0x21abL
2       1   0xdafL
3       0  0x21abL
4       4  0x1d14L
6       1  0x21abL
7       0   0xdafL
8       4  0x1d14L
0       2  0x1d14L
2       1  0x21abL
3       3  0x21abL
4       3  0x21abL
6       2  0x1d14L
7       0   0xdafL
8       0  0x1d14L

This is essentially the algorithm you described as "clunky", using idiomatic pandas methods. Note the duplicate row indices. Also, note that this won't give you the expected output if df1 and df2 have no overlapping row indices, i.e., if

这基本上是您使用惯用的pandas方法描述为“笨重”的算法。请注意重复的行索引。另请注意,如果df1和df2没有重叠的行索引,即if,则不会给出预期的输出

In [93]: df1.index & df2.index
Out[93]: Int64Index([], dtype='int64')

In fact, it won't give the expected output if their row indices are not equal.

实际上,如果它们的行索引不相等,它将不会给出预期的输出。

#3


3  

In SQL, this problem could be solved by several methods:

在SQL中,这个问题可以通过几种方法解决:

select * from df1 where exists (select * from df2 where df2.user_id = df1.user_id)
union all
select * from df2 where exists (select * from df1 where df1.user_id = df2.user_id)

or join and then unpivot (possible in SQL server)

或者加入然后忽略(可能在SQL服务器中)

select
    df1.user_id,
    c.rating
from df1
    inner join df2 on df2.user_i = df1.user_id
    outer apply (
        select df1.rating union all
        select df2.rating
    ) as c

Second one could be written in pandas with something like:

第二个可以用熊猫写成:

>>> df1 = pd.DataFrame({"user_id":[1,2,3], "rating":[10, 15, 20]})
>>> df2 = pd.DataFrame({"user_id":[3,4,5], "rating":[30, 35, 40]})
>>>
>>> df4 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df = pd.merge(df1, df2, on='user_id', suffixes=['_1', '_2'])
>>> df3 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df4 = df[['user_id', 'rating_2']].rename(columns={'rating_2':'rating'})
>>> pd.concat([df3, df4], axis=0)
   user_id  rating
0        3      20
0        3      30