I have two series s1 and s2 in pandas/python and want to compute the intersection i.e. where all of the values of the series are common.
我在pandas / python中有两个系列s1和s2,并且想要计算交集,即系列的所有值都是常见的。
How would I use the concat function to do this? I have been trying to work it out but have been unable to (I don't want to compute the intersection on the indices of s1 and S2, but on the values).
我如何使用concat函数执行此操作?我一直试图解决它,但一直无法(我不想计算s1和S2的索引上的交集,但是在值上)。
Thanks in advance.
提前致谢。
5 个解决方案
#1
32
Place both series in Python's set container then use the set intersection method:
将两个系列放在Python的set容器中,然后使用set intersection方法:
s1.intersection(s2)
and then transform back to list if needed.
然后根据需要转换回列表。
Just noticed pandas in the tag. Can translate back to that:
刚注意到标签中的熊猫。可以翻译回来:
pd.Series(list(set(s1).intersection(set(s2))))
From comments I have changed this to a more Pythonic expression, which is shorter and easier to read:
从评论中我已将其更改为更Pythonic的表达式,该表达式更短且更易于阅读:
Series(list(set(s1) & set(s2)))
should do the trick, except if the index data is also important to you.
应该做的伎俩,除非索引数据对你也很重要。
Have added the list(...)
to translate the set before going to pd.Series as pandas does not accept a set as direct input for a Series.
在转到pd之前添加了列表(...)来翻译集合。系列作为pandas不接受set作为Series的直接输入。
#2
18
Setup:
s1 = pd.Series([4,5,6,20,42])
s2 = pd.Series([1,2,3,5,42])
Timings:
%%timeit
pd.Series(list(set(s1).intersection(set(s2))))
10000 loops, best of 3: 57.7 µs per loop
%%timeit
pd.Series(np.intersect1d(s1,s2))
1000 loops, best of 3: 659 µs per loop
%%timeit
pd.Series(np.intersect1d(s1.values,s2.values))
10000 loops, best of 3: 64.7 µs per loop
So the numpy solution can be comparable to the set solution even for small series, if one uses the values
explicitely.
因此,即使对于小型系列,如果使用明确的值,numpy解决方案也可以与设定的解决方案相媲美。
#3
11
If you are using Panda's, I assume you are also using NumPy. Numpy has a function intersect1d
that will work with a Pandas' series.
如果您使用Panda,我认为您也在使用NumPy。 Numpy有一个函数intersect1d,它将与Pandas的系列一起使用。
Example:
pd.Series(np.intersect1d(pd.Series([1,2,3,5,42]), pd.Series([4,5,6,20,42])))
will return a Series with the values 5 and 42.
将返回值为5和42的Series。
#4
5
Python
s1 = pd.Series([4,5,6,20,42])
s2 = pd.Series([1,2,3,5,42])
s1[s1.isin(s2)]
R
s1 <- c(4,5,6,20,42)
s2 <- c(1,2,3,5,42)
s1[s1 %in% s2]
Edit: Doesn't handle dupes.
编辑:不处理欺骗。
#5
3
Could use merge operator like follows
可以使用如下的合并运算符
pd.merge(df1, df2, how='inner')
#1
32
Place both series in Python's set container then use the set intersection method:
将两个系列放在Python的set容器中,然后使用set intersection方法:
s1.intersection(s2)
and then transform back to list if needed.
然后根据需要转换回列表。
Just noticed pandas in the tag. Can translate back to that:
刚注意到标签中的熊猫。可以翻译回来:
pd.Series(list(set(s1).intersection(set(s2))))
From comments I have changed this to a more Pythonic expression, which is shorter and easier to read:
从评论中我已将其更改为更Pythonic的表达式,该表达式更短且更易于阅读:
Series(list(set(s1) & set(s2)))
should do the trick, except if the index data is also important to you.
应该做的伎俩,除非索引数据对你也很重要。
Have added the list(...)
to translate the set before going to pd.Series as pandas does not accept a set as direct input for a Series.
在转到pd之前添加了列表(...)来翻译集合。系列作为pandas不接受set作为Series的直接输入。
#2
18
Setup:
s1 = pd.Series([4,5,6,20,42])
s2 = pd.Series([1,2,3,5,42])
Timings:
%%timeit
pd.Series(list(set(s1).intersection(set(s2))))
10000 loops, best of 3: 57.7 µs per loop
%%timeit
pd.Series(np.intersect1d(s1,s2))
1000 loops, best of 3: 659 µs per loop
%%timeit
pd.Series(np.intersect1d(s1.values,s2.values))
10000 loops, best of 3: 64.7 µs per loop
So the numpy solution can be comparable to the set solution even for small series, if one uses the values
explicitely.
因此,即使对于小型系列,如果使用明确的值,numpy解决方案也可以与设定的解决方案相媲美。
#3
11
If you are using Panda's, I assume you are also using NumPy. Numpy has a function intersect1d
that will work with a Pandas' series.
如果您使用Panda,我认为您也在使用NumPy。 Numpy有一个函数intersect1d,它将与Pandas的系列一起使用。
Example:
pd.Series(np.intersect1d(pd.Series([1,2,3,5,42]), pd.Series([4,5,6,20,42])))
will return a Series with the values 5 and 42.
将返回值为5和42的Series。
#4
5
Python
s1 = pd.Series([4,5,6,20,42])
s2 = pd.Series([1,2,3,5,42])
s1[s1.isin(s2)]
R
s1 <- c(4,5,6,20,42)
s2 <- c(1,2,3,5,42)
s1[s1 %in% s2]
Edit: Doesn't handle dupes.
编辑:不处理欺骗。
#5
3
Could use merge operator like follows
可以使用如下的合并运算符
pd.merge(df1, df2, how='inner')