Having issue filtering my result dataframe with an or
condition. I want my result df
to extract all column _var_
values that are above 0.25 and below -0.25. This logic below gives me an ambiguous truth value however it work when I split this filtering in two separate operations. What is happening here? not sure where to use the suggested a.empty(), a.bool(), a.item(),a.any() or a.all()
.
用一个或条件过滤我的结果dataframe。我想要我的result df来提取所有列_var_值,这些值都在0.25以上,小于-0.25。下面的逻辑给了我一个不明确的真实值,但是当我将这个过滤拆分为两个独立的操作时,它是有效的。这里正在发生什么?不确定在哪里使用建议的a.empty(), a.bool(), a.item(),a.any()或a.all()。
result = result[(result['var']>0.25) or (result['var']<-0.25)]
4 个解决方案
#1
136
The or
and and
python statements require truth
-values. For pandas
these are considered ambiguous so you should use "bitwise" |
(or) or &
(and) operations:
or和python语句需要真实值。对于熊猫来说,它们被认为是不明确的,所以你应该使用“bitwise”|(或)或&(和)操作:
result = result[(result['var']>0.25) | (result['var']<-0.25)]
These are overloaded for these kind of datastructures to yield the element-wise or
(or and
).
对于这些数据结构来说,它们是重载的,以产生元素的或(或)。
Just to add some more explanation to this statement:
为了给这句话增加更多的解释:
The exception is thrown when you want to get the bool
of a pandas.Series
:
当您想获得pandas.Series的bool时,会抛出异常。
>>> import pandas as pd
>>> x = pd.Series([1])
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What you hit was a place where the operator implicitly converted the operands to bool
(you used or
but it also happens for and
, if
and while
):
你点击的是一个操作符隐式地将操作数转换为bool的地方(你使用了或但它也会发生,如果和while):
>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
... print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
... print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Besides these 4 statements there are several python functions that hide some bool
calls (like any
, all
, filter
, ...) these are normally not problematic with pandas.Series
but for completeness I wanted to mention these.
除了这4个语句之外,还有一些python函数隐藏了一些bool调用(如任何,所有,过滤器,…)这些通常不会对熊猫造成问题。系列,但为了完整性,我想提一下这些。
In your case the exception isn't really helpful, because it doesn't mention the right alternatives. For and
and or
you can use (if you want element-wise comparisons):
在你的情况下,例外并不是很有用,因为它没有提到正确的选择。对于和,或者您可以使用(如果您需要元素的比较):
-
numpy.logical_or:
>>> import numpy as np >>> np.logical_or(x, y)
or simply the
|
operator:或者简单的|操作符:
>>> x | y
-
numpy.logical_and:
>>> np.logical_and(x, y)
or simply the
&
operator:或者简单的操作人员:
>>> x & y
If you're using the operators then make sure you set your parenthesis correctly because of the operator precedence.
如果您正在使用操作符,那么请确保您正确地设置了括号,因为操作符的优先级。
There are several logical numpy functions which should work on pandas.Series
.
有几个逻辑的numpy函数应该在pandas.Series上工作。
The alternatives mentioned in the Exception are more suited if you encountered it when doing if
or while
. I'll shortly explain each of these:
如果您在执行if或while时遇到它,则异常中提到的其他选项更适合。我将简短地解释其中的每一个:
-
If you want to check if your Series is empty:
如果您想检查您的系列是否为空:
>>> x = pd.Series([]) >>> x.empty True >>> x = pd.Series([1]) >>> x.empty False
Python normally interprets the
len
gth of containers (likelist
,tuple
, ...) as truth-value if it has no explicit boolean interpretation. So if you want the python-like check, you could do:if x.size
orif not x.empty
instead ofif x
.如果没有显式的布尔解释,Python通常会将容器的长度(如列表、tuple、…)解释为真值。如果你想要勾股定理,你可以这样做:如果x。大小,如果不是x。空的而不是x。
-
If your
Series
contains one and only one boolean value:如果您的序列包含一个且只有一个布尔值:
>>> x = pd.Series([100]) >>> (x > 50).bool() True >>> (x < 50).bool() False
-
If you want to check the first and only item of your Series (like
.bool()
but works even for not boolean contents):如果您想要检查您的系列的第一个和唯一一个项目(比如.bool(),但即使不是布尔内容也可以):
>>> x = pd.Series([100]) >>> x.item() 100
-
If you want to check if all or any item is not-zero, not-empty or not-False:
如果您想检查所有或任何项是否不为零,不为空或不为假:
>>> x = pd.Series([0, 1, 2]) >>> x.all() # because one element is zero False >>> x.any() # because one (or more) elements are non-zero True
#2
17
For boolean logic, use &
and |
.
对于布尔逻辑,使用&和|。
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
>>> df
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
2 0.950088 -0.151357 -0.103219
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
>>> df.loc[(df.C > 0.25) | (df.C < -0.25)]
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
To see what is happening, you get a column of booleans for each comparison, e.g.
为了查看正在发生的情况,您可以为每个比较获取一个布尔值列。
df.C > 0.25
0 True
1 False
2 False
3 True
4 True
Name: C, dtype: bool
When you have multiple criteria, you will get multiple columns returned. This is why the the join logic is ambiguous. Using and
or or
treats each column separately, so you first need to reduce that column to a single boolean value. For example, to see if any value or all values in each of the columns is True.
当您有多个标准时,将返回多个列。这就是为什么连接逻辑是模糊的。使用和或者分别对待每一列,因此您首先需要将该列减少到一个布尔值。例如,查看每个列中的任何值或所有值是否为真。
# Any value in either column is True?
(df.C > 0.25).any() or (df.C < -0.25).any()
True
# All values in either column is True?
(df.C > 0.25).all() or (df.C < -0.25).all()
False
One convoluted way to achieve the same thing is to zip all of these columns together, and perform the appropriate logic.
实现同样目的的一个复杂的方法是将所有这些列压缩在一起,并执行适当的逻辑。
>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C < -0.25)]]
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
For more details, refer to Boolean Indexing in the docs.
有关更多细节,请参阅文档中的布尔索引。
#3
2
Or, alternatively, you could use Operator module. More detailed information is here Python docs
或者,也可以使用操作员模块。更详细的信息在这里,Python文档。
import operator
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.4438
#4
0
This excellent answer explains very well what is happening and provides a solution. I would like to add another solution that might be suitable in similar cases: using the query
method:
这个优秀的答案很好地解释了正在发生的事情并提供了解决方案。我想添加另一个可能适用于类似情况的解决方案:使用查询方法:
result = result.query("(var > 0.25) or (var < -0.25)")
See also http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-query.
参见http://pandas.pydata.org/pandas-docs/stable/indexing.html indexing-query。
(Some tests with a dataframe I'm currently working with suggest that this method is a bit slower than using the bitwise operators on series of booleans: 2 ms vs. 870 µs)
(有些测试dataframe我当前正在处理表明,该方法比使用按位运算符有点慢在一系列的布尔值:870µs)女士和
A piece of warning: At least one situation where this is not straightforward is when column names happen to be python expressions. I had columns named WT_38hph_IP_2
, WT_38hph_input_2
and log2(WT_38hph_IP_2/WT_38hph_input_2)
and wanted to perform the following query: "(log2(WT_38hph_IP_2/WT_38hph_input_2) > 1) and (WT_38hph_IP_2 > 20)"
警告:至少有一种情况下,当列名恰好是python表达式时,这种情况并不简单。我有一个名为WT_38hph_IP_2、WT_38hph_input_2和log2的列(WT_38hph_IP_2/WT_38hph_input_2),并希望执行以下查询:“(log2(WT_38hph_IP_2/WT_38hph_input_2)和(WT_38hph_IP_2 > 20)”
I obtained the following exception cascade:
我获得了以下的异常级联:
KeyError: 'log2'
- KeyError:“log2”
UndefinedVariableError: name 'log2' is not defined
- 不定义变量错误:名称'log2'没有定义。
ValueError: "log2" is not a supported function
- ValueError:“log2”不是一个受支持的函数。
I guess this happened because the query parser was trying to make something from the first two columns instead of identifying the expression with the name of the third column.
我想这是因为查询解析器试图从前两列中生成一些东西,而不是用第三列的名称来标识表达式。
A possible workaround is proposed here.
这里提出了一个可行的解决方案。
#1
136
The or
and and
python statements require truth
-values. For pandas
these are considered ambiguous so you should use "bitwise" |
(or) or &
(and) operations:
or和python语句需要真实值。对于熊猫来说,它们被认为是不明确的,所以你应该使用“bitwise”|(或)或&(和)操作:
result = result[(result['var']>0.25) | (result['var']<-0.25)]
These are overloaded for these kind of datastructures to yield the element-wise or
(or and
).
对于这些数据结构来说,它们是重载的,以产生元素的或(或)。
Just to add some more explanation to this statement:
为了给这句话增加更多的解释:
The exception is thrown when you want to get the bool
of a pandas.Series
:
当您想获得pandas.Series的bool时,会抛出异常。
>>> import pandas as pd
>>> x = pd.Series([1])
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What you hit was a place where the operator implicitly converted the operands to bool
(you used or
but it also happens for and
, if
and while
):
你点击的是一个操作符隐式地将操作数转换为bool的地方(你使用了或但它也会发生,如果和while):
>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
... print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
... print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Besides these 4 statements there are several python functions that hide some bool
calls (like any
, all
, filter
, ...) these are normally not problematic with pandas.Series
but for completeness I wanted to mention these.
除了这4个语句之外,还有一些python函数隐藏了一些bool调用(如任何,所有,过滤器,…)这些通常不会对熊猫造成问题。系列,但为了完整性,我想提一下这些。
In your case the exception isn't really helpful, because it doesn't mention the right alternatives. For and
and or
you can use (if you want element-wise comparisons):
在你的情况下,例外并不是很有用,因为它没有提到正确的选择。对于和,或者您可以使用(如果您需要元素的比较):
-
numpy.logical_or:
>>> import numpy as np >>> np.logical_or(x, y)
or simply the
|
operator:或者简单的|操作符:
>>> x | y
-
numpy.logical_and:
>>> np.logical_and(x, y)
or simply the
&
operator:或者简单的操作人员:
>>> x & y
If you're using the operators then make sure you set your parenthesis correctly because of the operator precedence.
如果您正在使用操作符,那么请确保您正确地设置了括号,因为操作符的优先级。
There are several logical numpy functions which should work on pandas.Series
.
有几个逻辑的numpy函数应该在pandas.Series上工作。
The alternatives mentioned in the Exception are more suited if you encountered it when doing if
or while
. I'll shortly explain each of these:
如果您在执行if或while时遇到它,则异常中提到的其他选项更适合。我将简短地解释其中的每一个:
-
If you want to check if your Series is empty:
如果您想检查您的系列是否为空:
>>> x = pd.Series([]) >>> x.empty True >>> x = pd.Series([1]) >>> x.empty False
Python normally interprets the
len
gth of containers (likelist
,tuple
, ...) as truth-value if it has no explicit boolean interpretation. So if you want the python-like check, you could do:if x.size
orif not x.empty
instead ofif x
.如果没有显式的布尔解释,Python通常会将容器的长度(如列表、tuple、…)解释为真值。如果你想要勾股定理,你可以这样做:如果x。大小,如果不是x。空的而不是x。
-
If your
Series
contains one and only one boolean value:如果您的序列包含一个且只有一个布尔值:
>>> x = pd.Series([100]) >>> (x > 50).bool() True >>> (x < 50).bool() False
-
If you want to check the first and only item of your Series (like
.bool()
but works even for not boolean contents):如果您想要检查您的系列的第一个和唯一一个项目(比如.bool(),但即使不是布尔内容也可以):
>>> x = pd.Series([100]) >>> x.item() 100
-
If you want to check if all or any item is not-zero, not-empty or not-False:
如果您想检查所有或任何项是否不为零,不为空或不为假:
>>> x = pd.Series([0, 1, 2]) >>> x.all() # because one element is zero False >>> x.any() # because one (or more) elements are non-zero True
#2
17
For boolean logic, use &
and |
.
对于布尔逻辑,使用&和|。
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
>>> df
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
2 0.950088 -0.151357 -0.103219
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
>>> df.loc[(df.C > 0.25) | (df.C < -0.25)]
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
To see what is happening, you get a column of booleans for each comparison, e.g.
为了查看正在发生的情况,您可以为每个比较获取一个布尔值列。
df.C > 0.25
0 True
1 False
2 False
3 True
4 True
Name: C, dtype: bool
When you have multiple criteria, you will get multiple columns returned. This is why the the join logic is ambiguous. Using and
or or
treats each column separately, so you first need to reduce that column to a single boolean value. For example, to see if any value or all values in each of the columns is True.
当您有多个标准时,将返回多个列。这就是为什么连接逻辑是模糊的。使用和或者分别对待每一列,因此您首先需要将该列减少到一个布尔值。例如,查看每个列中的任何值或所有值是否为真。
# Any value in either column is True?
(df.C > 0.25).any() or (df.C < -0.25).any()
True
# All values in either column is True?
(df.C > 0.25).all() or (df.C < -0.25).all()
False
One convoluted way to achieve the same thing is to zip all of these columns together, and perform the appropriate logic.
实现同样目的的一个复杂的方法是将所有这些列压缩在一起,并执行适当的逻辑。
>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C < -0.25)]]
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
For more details, refer to Boolean Indexing in the docs.
有关更多细节,请参阅文档中的布尔索引。
#3
2
Or, alternatively, you could use Operator module. More detailed information is here Python docs
或者,也可以使用操作员模块。更详细的信息在这里,Python文档。
import operator
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.4438
#4
0
This excellent answer explains very well what is happening and provides a solution. I would like to add another solution that might be suitable in similar cases: using the query
method:
这个优秀的答案很好地解释了正在发生的事情并提供了解决方案。我想添加另一个可能适用于类似情况的解决方案:使用查询方法:
result = result.query("(var > 0.25) or (var < -0.25)")
See also http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-query.
参见http://pandas.pydata.org/pandas-docs/stable/indexing.html indexing-query。
(Some tests with a dataframe I'm currently working with suggest that this method is a bit slower than using the bitwise operators on series of booleans: 2 ms vs. 870 µs)
(有些测试dataframe我当前正在处理表明,该方法比使用按位运算符有点慢在一系列的布尔值:870µs)女士和
A piece of warning: At least one situation where this is not straightforward is when column names happen to be python expressions. I had columns named WT_38hph_IP_2
, WT_38hph_input_2
and log2(WT_38hph_IP_2/WT_38hph_input_2)
and wanted to perform the following query: "(log2(WT_38hph_IP_2/WT_38hph_input_2) > 1) and (WT_38hph_IP_2 > 20)"
警告:至少有一种情况下,当列名恰好是python表达式时,这种情况并不简单。我有一个名为WT_38hph_IP_2、WT_38hph_input_2和log2的列(WT_38hph_IP_2/WT_38hph_input_2),并希望执行以下查询:“(log2(WT_38hph_IP_2/WT_38hph_input_2)和(WT_38hph_IP_2 > 20)”
I obtained the following exception cascade:
我获得了以下的异常级联:
KeyError: 'log2'
- KeyError:“log2”
UndefinedVariableError: name 'log2' is not defined
- 不定义变量错误:名称'log2'没有定义。
ValueError: "log2" is not a supported function
- ValueError:“log2”不是一个受支持的函数。
I guess this happened because the query parser was trying to make something from the first two columns instead of identifying the expression with the name of the third column.
我想这是因为查询解析器试图从前两列中生成一些东西,而不是用第三列的名称来标识表达式。
A possible workaround is proposed here.
这里提出了一个可行的解决方案。