带有一个falsey元素的numpy数组的真值似乎取决于dtype

时间:2021-11-05 15:50:26
import numpy as np
a = np.array([0])
b = np.array([None])
c = np.array([''])
d = np.array([' '])

Why should we have this inconsistency:

为什么我们要有这种不一致

>>> bool(a)
False
>>> bool(b)
False
>>> bool(c)
True
>>> bool(d)
False

4 个解决方案

#1


7  

For arrays with one element, the array's truth value is determined by the truth value of that element.

对于具有一个元素的数组,数组的真值由该元素的真值决定。

The main point to make is that np.array(['']) is not an array containing one empty Python string. This array is created to hold strings of exactly one byte each and NumPy pads strings that are too short with the null character. This means that the array is equal to np.array(['\0']).

要点是,np.array(["])不是一个包含一个空Python字符串的数组。创建这个数组是为了保存每个字节的字符串,并且NumPy填充使用null字符太短的字符串。这意味着数组等于np.array(['\0'])。

In this regard, NumPy is being consistent with Python which evaluates bool('\0') as True.

在这方面,NumPy与Python一致,Python将bool('\0')评估为True。

In fact, the only strings which are False in NumPy arrays are strings which do not contain any non-whitespace characters ('\0' is not a whitespace character).

事实上,NumPy数组中唯一的字符串是字符串,它不包含任何非空格字符('\0'不是空格字符)。

Details of this Boolean evaluation are presented below.

这个布尔值的详细信息如下所示。


Navigating NumPy's labyrinthine source code is not always easy, but we can find the code governing how values in different datatypes are mapped to Boolean values in the arraytypes.c.src file. This will explain how bool(a), bool(b), bool(c) and bool(d) are determined.

浏览NumPy复杂的源代码并不总是容易的,但是我们可以找到代码来控制不同数据类型中的值如何映射到arraytypes.c中的布尔值。src文件。这将解释如何确定bool(a)、bool(b)、bool(c)和bool(d)。

Before we get to the code in that file, we can see that calling bool() on a NumPy array invokes the internal _array_nonzero() function. If the array is empty, we get False. If there are two or more elements we get an error. But if the array has exactly one element, we hit the line:

在进入该文件中的代码之前,我们可以看到在NumPy数组上调用bool()调用内部的_array_nonzero()函数。如果数组为空,则得到False。如果有两个或多个元素,我们就会得到一个错误。但是如果数组中只有一个元素,我们就会这样做:

return PyArray_DESCR(mp)->f->nonzero(PyArray_DATA(mp), mp);

Now, PyArray_DESCR is a struct holding various properties for the array. f is a pointer to another struct PyArray_ArrFuncs that holds the array's nonzero function. In other words, NumPy is going to call upon the array's own special nonzero function to check the Boolean value of that one element.

现在,PyArray_DESCR是一个为数组保存各种属性的结构。f是指向另一个结构PyArray_ArrFuncs的指针,该结构包含数组的非零函数。换句话说,NumPy将调用数组本身的特殊非零函数来检查一个元素的布尔值。

Determining whether an element is nonzero or not is obviously going to depend on the datatype of the element. The code implementing the type-specific nonzero functions can be found in the "nonzero" section of the arraytypes.c.src file.

确定一个元素是否为非零显然取决于元素的数据类型。实现特定类型的非零函数的代码可以在arraytypes.c的“非零”部分中找到。src文件。

As we'd expect, floats, integers and complex numbers are False if they're equal with zero. This explains bool(a). In the case of object arrays, None is similarly going to be evaluated as False because NumPy just calls the PyObject_IsTrue function. This explains bool(b).

如我们所料,如果浮点数、整数和复数与零相等,它们就是假的。这就解释了bool(a)。在对象数组的情况下,它们同样不会被赋值为False,因为NumPy只调用PyObject_IsTrue函数。这就解释了布尔值(b)。

To understand the results of bool(c) and bool(d), we see that the nonzero function for string type arrays is mapped to the STRING_nonzero function:

为了理解bool(c)和bool(d)的结果,我们看到字符串类型数组的非零函数被映射到STRING_nonzero函数:

static npy_bool
STRING_nonzero (char *ip, PyArrayObject *ap)
{
    int len = PyArray_DESCR(ap)->elsize; // size of dtype (not string length)
    int i;
    npy_bool nonz = NPY_FALSE;

    for (i = 0; i < len; i++) {
        if (!Py_STRING_ISSPACE(*ip)) {   // if it isn't whitespace, it's True
            nonz = NPY_TRUE;
            break;
        }
        ip++;
    }
    return nonz;
}

(The unicode case is more or less the same idea.)

(unicode大小写或多或少是相同的概念。)

So in arrays with a string or unicode datatype, a string is only False if it contains only whitespace characters:

因此,在带有字符串或unicode数据类型的数组中,如果字符串仅包含空白字符,则为False:

>>> bool(np.array([' ']))
False

In the case of array c in the question, there is a really a null character \0 padding the seemingly-empty string:

对于问题中的数组c,有一个空字符\0填充看似空的字符串:

>>> np.array(['']) == np.array(['\0'])
array([ True], dtype=bool)

The STRING_nonzero function sees this non-whitespace character and so bool(c) is True.

string_non0函数看到这个非空格字符,因此bool(c)是正确的。

As noted at the start of this answer, this is consistent with Python's evaluation of strings containing a single null character: bool('\0') is also True.

正如在这个答案开始时所指出的,这与Python对包含一个null字符的字符串的评估一致:bool('\0')也是正确的。


Update: Wim has fixed the behaviour detailed above in NumPy's master branch by making strings which contain only null characters, or a mix of only whitespace and null characters, evaluate to False. This means that NumPy 1.10+ will see that bool(np.array([''])) is False, which is much more in line with Python's treatment of "empty" strings.

更新:Wim通过使只包含空字符的字符串或仅包含空格和空字符的混合字符串计算为False来修复上面NumPy主分支中详细描述的行为。这意味着NumPy 1.10+将看到bool(np.array(["]))是假的,这与Python对“空”字符串的处理更加一致。

#2


8  

I'm pretty sure the answer is, as explained in Scalars, that:

我很确定答案是,正如用标量解释的那样:

Array scalars have the same attributes and methods as ndarrays. [1] This allows one to treat items of an array partly on the same footing as arrays, smoothing out rough edges that result when mixing scalar and array operations.

数组标量具有与ndarrays相同的属性和方法。[1]允许将数组的项部分地放在与数组相同的基础上处理,消除混合标量和数组操作时产生的粗糙边缘。

So, if it's acceptable to call bool on a scalar, it must be acceptable to call bool on an array of shape (1,), because they are, as far as possible, the same thing.

因此,如果在标量上调用bool是可以接受的,那么将bool称为一个形状数组(1,)是可以接受的,因为它们尽可能地是相同的东西。

And, while it isn't directly said anywhere in the docs that I know of, it's pretty obvious from the design that NumPy's scalars are supposed to act like native Python objects.

而且,虽然我所知道的文档中没有直接提到它,但从设计上看,NumPy的标量应该像本地Python对象一样。

So, that explains why np.array([0]) is falsey rather than truthy, which is what you were initially surprised about.

这就解释了为什么np.array([0])是falsey而不是truthy,这正是您最初感到惊讶的地方。


So, that explains the basics. But what about the specifics of case c?

这就解释了基本原理。那么c的具体情况呢?

First, note that your array np.array(['']) is not an array of one Python object, but an array of one NumPy <U1 null-terminated character string of length 1. Fixed-length-string values don't have the same truthiness rule as Python strings—and they really couldn't; for a fixed-length-string type, "false if empty" doesn't make any sense, because they're never empty. You could argument about whether NumPy should have been designed that way or not, but it clearly does follow that rule consistently, and I don't think the opposite rule would be any less confusing here, just different.

首先,注意数组np.array(["])不是一个Python对象的数组,而是长度为1的NumPy 空终止字符字符串的数组。固定长度字符串值与python字符串不具有相同的truthiness规则,它们确实不能;对于固定长度的字符串类型,“false>

But there seems to be something else weird going on with strings. Consider this:

但是似乎还有其他奇怪的事情在发生。考虑一下:

>>> np.array(['a', 'b']) != 0
True

That's not doing an elementwise comparison of the <U2 strings to 0 and returning array([True, True]) (as you'd get from np.array(['a', 'b'], dtype=object)), it's doing an array-wide comparison and deciding that no array of strings is equal to 0, which seems odd… I'm not sure whether this deserves a separate answer here or even a whole separate question, but I am pretty sure I'm not going to be the one who writes that answer, because I have no clue what's going on here. :)

这并不是对 字符串与返回数组([true,>


Beyond arrays of shape (1,), arrays of shape () are treated the same way, but anything else is a ValueError, because otherwise it would be very easily to misuse arrays with and and other Python operators that NumPy can't automagically convert into elementwise operations.

除了形状数组(1,)之外,形状数组()也以相同的方式处理,但是其他的都是ValueError,因为否则很容易误用NumPy不能自动转换为elementwise操作的数组和其他Python操作符。

I personally think being consistent with other arrays would be more useful than being consistent with scalars here—in other words, just raise a ValueError. I also think that, if being consistent with scalars were important here, it would be better to be consistent with the unboxed Python values. In other words, if bool(array([v])) and bool(array(v)) are going to be allowed at all, they should always return exactly the same thing as bool(v), even if that's not consistent with np.nonzero. But I can see the argument the other way.

我个人认为与其他数组保持一致要比与标量保持一致更有用——换句话说,只需提出一个ValueError。我还认为,如果在这里与标量保持一致,那么最好与未装箱的Python值保持一致。换句话说,如果允许bool(array([v]))和bool(array(v)),它们应该总是返回与bool(v)完全相同的东西,即使这与np.nonzero不一致。但我可以从另一个角度来看待这个论点。

#3


3  

It's fixed in master now.

它现在已经固定在master里面了。

I thought this was a bug, and the numpy devs agreed, so this patch was merged earlier today. We should see new behaviour in the upcoming 1.10 release.

我认为这是一个错误,numpy开发人员同意,所以这个补丁今天早些时候被合并了。我们应该在即将发布的1.10版本中看到新的行为。

#4


2  

Numpy seems to be following the same castings as builtin python**, in this context it seems to be because of which return true for calls to nonzero. Apparently len can also be used, but here, none of these arrays are empty (length 0) - so that's not directly relevant. Note that calling bool([False]) also returns True according to these rules.

Numpy似乎遵循与内置python** *相同的浇铸,在这种情况下,似乎是因为对非零的调用返回true。显然len也可以使用,但是在这里,这些数组都不是空的(长度为0)——所以这不是直接相关的。注意,根据这些规则,调用bool([False])也返回True。

a = np.array([0])
b = np.array([None])
c = np.array([''])

>>> nonzero(a)
(array([], dtype=int64),)
>>> nonzero(b)
(array([], dtype=int64),)
>>> nonzero(c)
(array([0]),)

This also seems consistent with the more enumerative description of bool casting --- where your examples are all explicitly discussed.

这似乎也与bool casting的更枚举性描述相一致——其中您的示例都被显式地讨论过。

Interestingly, there does seem to be systematically different behavior with string arrays, e.g.

有趣的是,字符串数组似乎有系统的不同行为。

>>> a.astype(bool)
array([False], dtype=bool)
>>> b.astype(bool)
array([False], dtype=bool)
>>> c.astype(bool)
ERROR: ValueError: invalid literal for int() with base 10: ''

I think, when numpy converts something into a bool it uses the PyArray_BoolConverter function which, in turn, just calls the PyObject_IsTrue function --- i.e. the exact same function that builtin python uses, which is why numpy's results are so consistent.

我认为,当numpy将一些东西转换成bool时,它使用PyArray_BoolConverter函数,反过来,它只调用PyObject_IsTrue函数——也就是构造python所用的函数,这就是为什么numpy的结果是如此一致的。

#1


7  

For arrays with one element, the array's truth value is determined by the truth value of that element.

对于具有一个元素的数组,数组的真值由该元素的真值决定。

The main point to make is that np.array(['']) is not an array containing one empty Python string. This array is created to hold strings of exactly one byte each and NumPy pads strings that are too short with the null character. This means that the array is equal to np.array(['\0']).

要点是,np.array(["])不是一个包含一个空Python字符串的数组。创建这个数组是为了保存每个字节的字符串,并且NumPy填充使用null字符太短的字符串。这意味着数组等于np.array(['\0'])。

In this regard, NumPy is being consistent with Python which evaluates bool('\0') as True.

在这方面,NumPy与Python一致,Python将bool('\0')评估为True。

In fact, the only strings which are False in NumPy arrays are strings which do not contain any non-whitespace characters ('\0' is not a whitespace character).

事实上,NumPy数组中唯一的字符串是字符串,它不包含任何非空格字符('\0'不是空格字符)。

Details of this Boolean evaluation are presented below.

这个布尔值的详细信息如下所示。


Navigating NumPy's labyrinthine source code is not always easy, but we can find the code governing how values in different datatypes are mapped to Boolean values in the arraytypes.c.src file. This will explain how bool(a), bool(b), bool(c) and bool(d) are determined.

浏览NumPy复杂的源代码并不总是容易的,但是我们可以找到代码来控制不同数据类型中的值如何映射到arraytypes.c中的布尔值。src文件。这将解释如何确定bool(a)、bool(b)、bool(c)和bool(d)。

Before we get to the code in that file, we can see that calling bool() on a NumPy array invokes the internal _array_nonzero() function. If the array is empty, we get False. If there are two or more elements we get an error. But if the array has exactly one element, we hit the line:

在进入该文件中的代码之前,我们可以看到在NumPy数组上调用bool()调用内部的_array_nonzero()函数。如果数组为空,则得到False。如果有两个或多个元素,我们就会得到一个错误。但是如果数组中只有一个元素,我们就会这样做:

return PyArray_DESCR(mp)->f->nonzero(PyArray_DATA(mp), mp);

Now, PyArray_DESCR is a struct holding various properties for the array. f is a pointer to another struct PyArray_ArrFuncs that holds the array's nonzero function. In other words, NumPy is going to call upon the array's own special nonzero function to check the Boolean value of that one element.

现在,PyArray_DESCR是一个为数组保存各种属性的结构。f是指向另一个结构PyArray_ArrFuncs的指针,该结构包含数组的非零函数。换句话说,NumPy将调用数组本身的特殊非零函数来检查一个元素的布尔值。

Determining whether an element is nonzero or not is obviously going to depend on the datatype of the element. The code implementing the type-specific nonzero functions can be found in the "nonzero" section of the arraytypes.c.src file.

确定一个元素是否为非零显然取决于元素的数据类型。实现特定类型的非零函数的代码可以在arraytypes.c的“非零”部分中找到。src文件。

As we'd expect, floats, integers and complex numbers are False if they're equal with zero. This explains bool(a). In the case of object arrays, None is similarly going to be evaluated as False because NumPy just calls the PyObject_IsTrue function. This explains bool(b).

如我们所料,如果浮点数、整数和复数与零相等,它们就是假的。这就解释了bool(a)。在对象数组的情况下,它们同样不会被赋值为False,因为NumPy只调用PyObject_IsTrue函数。这就解释了布尔值(b)。

To understand the results of bool(c) and bool(d), we see that the nonzero function for string type arrays is mapped to the STRING_nonzero function:

为了理解bool(c)和bool(d)的结果,我们看到字符串类型数组的非零函数被映射到STRING_nonzero函数:

static npy_bool
STRING_nonzero (char *ip, PyArrayObject *ap)
{
    int len = PyArray_DESCR(ap)->elsize; // size of dtype (not string length)
    int i;
    npy_bool nonz = NPY_FALSE;

    for (i = 0; i < len; i++) {
        if (!Py_STRING_ISSPACE(*ip)) {   // if it isn't whitespace, it's True
            nonz = NPY_TRUE;
            break;
        }
        ip++;
    }
    return nonz;
}

(The unicode case is more or less the same idea.)

(unicode大小写或多或少是相同的概念。)

So in arrays with a string or unicode datatype, a string is only False if it contains only whitespace characters:

因此,在带有字符串或unicode数据类型的数组中,如果字符串仅包含空白字符,则为False:

>>> bool(np.array([' ']))
False

In the case of array c in the question, there is a really a null character \0 padding the seemingly-empty string:

对于问题中的数组c,有一个空字符\0填充看似空的字符串:

>>> np.array(['']) == np.array(['\0'])
array([ True], dtype=bool)

The STRING_nonzero function sees this non-whitespace character and so bool(c) is True.

string_non0函数看到这个非空格字符,因此bool(c)是正确的。

As noted at the start of this answer, this is consistent with Python's evaluation of strings containing a single null character: bool('\0') is also True.

正如在这个答案开始时所指出的,这与Python对包含一个null字符的字符串的评估一致:bool('\0')也是正确的。


Update: Wim has fixed the behaviour detailed above in NumPy's master branch by making strings which contain only null characters, or a mix of only whitespace and null characters, evaluate to False. This means that NumPy 1.10+ will see that bool(np.array([''])) is False, which is much more in line with Python's treatment of "empty" strings.

更新:Wim通过使只包含空字符的字符串或仅包含空格和空字符的混合字符串计算为False来修复上面NumPy主分支中详细描述的行为。这意味着NumPy 1.10+将看到bool(np.array(["]))是假的,这与Python对“空”字符串的处理更加一致。

#2


8  

I'm pretty sure the answer is, as explained in Scalars, that:

我很确定答案是,正如用标量解释的那样:

Array scalars have the same attributes and methods as ndarrays. [1] This allows one to treat items of an array partly on the same footing as arrays, smoothing out rough edges that result when mixing scalar and array operations.

数组标量具有与ndarrays相同的属性和方法。[1]允许将数组的项部分地放在与数组相同的基础上处理,消除混合标量和数组操作时产生的粗糙边缘。

So, if it's acceptable to call bool on a scalar, it must be acceptable to call bool on an array of shape (1,), because they are, as far as possible, the same thing.

因此,如果在标量上调用bool是可以接受的,那么将bool称为一个形状数组(1,)是可以接受的,因为它们尽可能地是相同的东西。

And, while it isn't directly said anywhere in the docs that I know of, it's pretty obvious from the design that NumPy's scalars are supposed to act like native Python objects.

而且,虽然我所知道的文档中没有直接提到它,但从设计上看,NumPy的标量应该像本地Python对象一样。

So, that explains why np.array([0]) is falsey rather than truthy, which is what you were initially surprised about.

这就解释了为什么np.array([0])是falsey而不是truthy,这正是您最初感到惊讶的地方。


So, that explains the basics. But what about the specifics of case c?

这就解释了基本原理。那么c的具体情况呢?

First, note that your array np.array(['']) is not an array of one Python object, but an array of one NumPy <U1 null-terminated character string of length 1. Fixed-length-string values don't have the same truthiness rule as Python strings—and they really couldn't; for a fixed-length-string type, "false if empty" doesn't make any sense, because they're never empty. You could argument about whether NumPy should have been designed that way or not, but it clearly does follow that rule consistently, and I don't think the opposite rule would be any less confusing here, just different.

首先,注意数组np.array(["])不是一个Python对象的数组,而是长度为1的NumPy 空终止字符字符串的数组。固定长度字符串值与python字符串不具有相同的truthiness规则,它们确实不能;对于固定长度的字符串类型,“false>

But there seems to be something else weird going on with strings. Consider this:

但是似乎还有其他奇怪的事情在发生。考虑一下:

>>> np.array(['a', 'b']) != 0
True

That's not doing an elementwise comparison of the <U2 strings to 0 and returning array([True, True]) (as you'd get from np.array(['a', 'b'], dtype=object)), it's doing an array-wide comparison and deciding that no array of strings is equal to 0, which seems odd… I'm not sure whether this deserves a separate answer here or even a whole separate question, but I am pretty sure I'm not going to be the one who writes that answer, because I have no clue what's going on here. :)

这并不是对 字符串与返回数组([true,>


Beyond arrays of shape (1,), arrays of shape () are treated the same way, but anything else is a ValueError, because otherwise it would be very easily to misuse arrays with and and other Python operators that NumPy can't automagically convert into elementwise operations.

除了形状数组(1,)之外,形状数组()也以相同的方式处理,但是其他的都是ValueError,因为否则很容易误用NumPy不能自动转换为elementwise操作的数组和其他Python操作符。

I personally think being consistent with other arrays would be more useful than being consistent with scalars here—in other words, just raise a ValueError. I also think that, if being consistent with scalars were important here, it would be better to be consistent with the unboxed Python values. In other words, if bool(array([v])) and bool(array(v)) are going to be allowed at all, they should always return exactly the same thing as bool(v), even if that's not consistent with np.nonzero. But I can see the argument the other way.

我个人认为与其他数组保持一致要比与标量保持一致更有用——换句话说,只需提出一个ValueError。我还认为,如果在这里与标量保持一致,那么最好与未装箱的Python值保持一致。换句话说,如果允许bool(array([v]))和bool(array(v)),它们应该总是返回与bool(v)完全相同的东西,即使这与np.nonzero不一致。但我可以从另一个角度来看待这个论点。

#3


3  

It's fixed in master now.

它现在已经固定在master里面了。

I thought this was a bug, and the numpy devs agreed, so this patch was merged earlier today. We should see new behaviour in the upcoming 1.10 release.

我认为这是一个错误,numpy开发人员同意,所以这个补丁今天早些时候被合并了。我们应该在即将发布的1.10版本中看到新的行为。

#4


2  

Numpy seems to be following the same castings as builtin python**, in this context it seems to be because of which return true for calls to nonzero. Apparently len can also be used, but here, none of these arrays are empty (length 0) - so that's not directly relevant. Note that calling bool([False]) also returns True according to these rules.

Numpy似乎遵循与内置python** *相同的浇铸,在这种情况下,似乎是因为对非零的调用返回true。显然len也可以使用,但是在这里,这些数组都不是空的(长度为0)——所以这不是直接相关的。注意,根据这些规则,调用bool([False])也返回True。

a = np.array([0])
b = np.array([None])
c = np.array([''])

>>> nonzero(a)
(array([], dtype=int64),)
>>> nonzero(b)
(array([], dtype=int64),)
>>> nonzero(c)
(array([0]),)

This also seems consistent with the more enumerative description of bool casting --- where your examples are all explicitly discussed.

这似乎也与bool casting的更枚举性描述相一致——其中您的示例都被显式地讨论过。

Interestingly, there does seem to be systematically different behavior with string arrays, e.g.

有趣的是,字符串数组似乎有系统的不同行为。

>>> a.astype(bool)
array([False], dtype=bool)
>>> b.astype(bool)
array([False], dtype=bool)
>>> c.astype(bool)
ERROR: ValueError: invalid literal for int() with base 10: ''

I think, when numpy converts something into a bool it uses the PyArray_BoolConverter function which, in turn, just calls the PyObject_IsTrue function --- i.e. the exact same function that builtin python uses, which is why numpy's results are so consistent.

我认为,当numpy将一些东西转换成bool时,它使用PyArray_BoolConverter函数,反过来,它只调用PyObject_IsTrue函数——也就是构造python所用的函数,这就是为什么numpy的结果是如此一致的。