Numpy astype“向上投射”数组,而不是跨列应用dtype

时间:2021-02-09 16:10:14

I have a 2D numpy array, and I'd like to apply a specific dtype to each column.

我有一个2D numpy数组,我想对每一列应用一个特定的dtype。

a = np.arange(25).reshape((5,5))

In [40]: a
Out[40]: 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [41]: a.astype(dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])

I was expecting line 41 to apply the dtypethat I desired, but instead it "upcast" by creating a new axis, replicating the whole array once for each of the dtypes:

我原以为第41行会应用我想要的dtypeit,但它会通过创建一个新的轴来“向上投射”,对每个dtype复制整个数组一次:

Out[41]: 
array([[(0, 0, 0, 0.0, 0.0), (1, 1, 1, 1.0, 1.0), (2, 2, 2, 2.0, 2.0),
        (3, 3, 3, 3.0, 3.0), (4, 4, 4, 4.0, 4.0)],
       [(5, 5, 5, 5.0, 5.0), (6, 6, 6, 6.0, 6.0), (7, 7, 7, 7.0, 7.0),
        (8, 8, 8, 8.0, 8.0), (9, 9, 9, 9.0, 9.0)],
       [(10, 10, 10, 10.0, 10.0), (11, 11, 11, 11.0, 11.0),
        (12, 12, 12, 12.0, 12.0), (13, 13, 13, 13.0, 13.0),
        (14, 14, 14, 14.0, 14.0)],
       [(15, 15, 15, 15.0, 15.0), (16, 16, 16, 16.0, 16.0),
        (17, 17, 17, 17.0, 17.0), (18, 18, 18, 18.0, 18.0),
        (19, 19, 19, 19.0, 19.0)],
       [(20, 20, 20, 20.0, 20.0), (21, 21, 21, 21.0, 21.0),
        (22, 22, 22, 22.0, 22.0), (23, 23, 23, 23.0, 23.0),
        (24, 24, 24, 24.0, 24.0)]], 
      dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])

Why did this happen, given that the number of dtypes matches the number of columns (and so I didn't expect upcasting)?

为什么会发生这种情况,因为dtypes的数量与列数相匹配(所以我没有期望向上转换)?

How can I take an existing array in memory and apply per-column dtypes, as I had intended on line 41? Thanks.

我如何在内存中使用一个现有的数组并应用每个列的dtype,就像我在第41行中所期望的那样?谢谢。

3 个解决方案

#1


1  

This is an odd corner case that I've never encountered, but I believe the answer is related to the fact that in general, numpy only supports a few forms of assignment to structured arrays.

这是我从未遇到过的一种奇怪的情况,但我相信答案与这样一个事实有关:一般来说,numpy只支持结构数组的一些赋值形式。

In this particular case, I think numpy is following the convention used for scalar assignment to structured arrays, and is then broadcasting the assignment over the whole input array to generate a result of the same shape as the original array.

在这个特殊的例子中,我认为numpy遵循了对结构化数组进行标量赋值的约定,然后在整个输入数组上广播赋值,以生成与原始数组相同的结果。

Why the limit?

I believe the forms of assignment for structured arrays are limited because the "columns" of structured arrays are not much like the columns of ordinary 2-d arrays. In fact, it makes more sense to think of a ten-row, three-column structured array as a 1-d array of ten instances of an atomic row type.

我认为结构化数组的赋值形式是有限的,因为结构化数组的“列”不太像普通的二维数组的列。实际上,将一个10行、3列的结构化数组看作一个1-d数组,其中包含10个原子行类型的实例,更有意义。

These atomic row types are called "structured scalars". They have a fixed internal memory layout that cannot be dynamically reshaped, and so it doesn't really make sense to treat them the same way as the row of a 2-d array.

这些原子行类型称为“结构化标量”。它们有一个固定的内部内存布局,不能被动态重构,所以用与二维数组相同的方式来处理它们是没有意义的。

How to create a structured view of an existing array

Honestly, I don't know! I will update this answer if I find a good way. But I don't think I will find a good way, because as discussed above, structured scalars have their own distinctive memory layout. It's possible to hack up something with a buffer that has the right layout, but you'd be digging into numpy internals to do that, which isn't ideal. That being said, see this answer from Mad Physicist, who has done this somewhat more elegantly than I thought was possible.

老实说,我不知道!如果我找到一个好的方法,我会更新这个答案。但是我认为我不会找到一个好的方法,因为如上所述,结构化标量有它们自己独特的内存布局。使用具有正确布局的缓冲区来破解某些东西是有可能的,但是您可能需要深入到numpy内部来实现这一点,这并不理想。话虽如此,看看这个疯狂的物理学家的答案吧,他做的比我想象的要优雅得多。

It's also worth mentioning that astype creates a copy by default. You can pass copy=False, but numpy might still make a copy if certain requirements aren't satisfied.

值得一提的是,astype默认创建了一个副本。您可以传递copy=False,但是如果某些需求不满足的话,numpy仍然可以进行复制。

Alternatives...

I rarely find that I actually need a view; often creating a copy causes no perceptible change in performance. My first approach to this problem would simply be to use one of the standard assignment strategies for record arrays. In this case, that would probably mean using subarray assignment. First we create the array. Note the tuples. They are required for expected behavior.

我很少发现我真的需要一个视图;通常,创建副本不会导致性能的明显变化。我解决这个问题的第一种方法就是使用记录数组的标准分配策略之一。在这种情况下,这可能意味着使用子数组赋值。首先创建数组。注意元组。它们是预期行为所必需的。

>>> a = np.array([(1, 2), (3, 4)], dtype=[('x', 'f8'), ('y', 'i8')])
>>> a
array([(1., 2), (3., 4)], dtype=[('x', '<f8'), ('y', '<i8')])

Now if we try to assign an ordinary 2-d array to a, we get an error:

现在如果我们尝试给a分配一个普通的二维数组,我们会得到一个错误:

>>> a[:] = np.array([[11, 22], [33, 44]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: could not broadcast input array from shape (2,2) into shape (2)

But we can easily assign in a column-wise way:

但我们可以很容易地按列进行分配:

>>> a['x'] = [11, 22]
>>> a['y'] = [33, 44]
>>> a
array([(11., 33), (22., 44)], dtype=[('x', '<f8'), ('y', '<i8')])

We can also use Python tuples. This overwrites the whole array:

我们还可以使用Python元组。这覆盖了整个数组:

>>> a[:] = [(111, 222), (333, 444)]
>>> a
array([(111., 222), (333., 444)], dtype=[('x', '<f8'), ('y', '<i8')])

We can also assign data row-wise using tuples:

我们也可以使用元组来分配数据行:

>>> a[1] = (3333, 4444)
>>> a
array([( 111.,  222), (3333., 4444)], dtype=[('x', '<f8'), ('y', '<i8')])

Again, this fails if we try to pass a list or array:

同样,如果我们试图传递一个列表或数组,这将失败:

>>> a[1] = [3333, 4444]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.
>>> a[1] = np.array([3333, 4444])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.

Finally, note that we see the same behavior you saw with astype when we try to create a structured array from nested lists or numpy arrays. numpy just broadcasts the input array against the datatype, producing a 2-d array of structured scalars:

最后,请注意,当我们尝试从嵌套列表或numpy数组创建结构化数组时,我们看到了与astype相同的行为。numpy只是针对数据类型广播输入数组,生成一个二维结构化标量数组:

>>> a
array([[(1., 1), (2., 2)],
       [(3., 3), (4., 4)]], dtype=[('x', '<f8'), ('y', '<i8')])
>>> a = np.array(np.array([[1, 2], [3, 4]]), dtype=[('x', 'f8'), ('y', 'i8')])
>>> a
array([[(1., 1), (2., 2)],
       [(3., 3), (4., 4)]], dtype=[('x', '<f8'), ('y', '<i8')])

If your goal is simply to create a new array, then see the answers to this question. They cover a couple of useful approaches, including numpy.core.records.fromarrays and numpy.core.records.fromrecords. See also Paul Panzer's answer, which discusses how to create a new record array (a structured array that allows attribute access to columns).

如果您的目标只是创建一个新数组,那么请查看这个问题的答案。它们涵盖了一些有用的方法,包括numpi .core.records.fromarray和numpi .core. recorders .fromrecords。请参见Paul Panzer的答案,该答案讨论了如何创建一个新的记录数组(允许对列进行属性访问的结构化数组)。

#2


2  

As @senderle correctly points out, you rarely need a view, but here is a possible solution to do this almost in-place just for fun. The only modification you will need to do is to make sure your types are all of the same size.

正如@senderle所指出的那样,您很少需要视图,但是这里有一种可能的解决方案,它几乎是为了好玩而就地完成的。您需要做的唯一修改是确保您的类型都是相同的大小。

a = np.arange(25, dtype='<i4').reshape((5,5))
b = a.view(dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])
b['score'] = a[:, -2, np.newaxis].astype('<f4')
b['auc'] = a[:, -1, np.newaxis].astype('<f4')

If we going to do non-recommended things, you can also insert a line b.shape = (5,) after getting the view to eliminate the extra dimension preserved from a, and make the assignments below that simpler.

如果我们要做非推荐的事情,你也可以插入一行b。shape =(5,)在获取视图以消除从a中保留的额外维度后,使下面的赋值更简单。

This will give you a view b, which has all the desired properties, but of course will mess up the contents of a:

这会给你一个视图b,它有所有想要的属性,但是当然会把a的内容弄乱:

>>> a
array([[         0,          1,          2, 1077936128, 1082130432],
       [         5,          6,          7, 1090519040, 1091567616],
       [        10,         11,         12, 1095761920, 1096810496],
       [        15,         16,         17, 1099956224, 1100480512],
       [        20,         21,         22, 1102577664, 1103101952]])
>>> b
array([[( 0,  1,  2,  3.,  4.)],
       [( 5,  6,  7,  8.,  9.)],
       [(10, 11, 12, 13., 14.)],
       [(15, 16, 17, 18., 19.)],
       [(20, 21, 22, 23., 24.)]],
      dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])

#3


1  

Here is a workaround using np.rec.fromarrays:

下面是一个使用np.rec.fromarray的方法。

>>> dtype = [('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')]
>>> np.rec.fromarrays(a.T, dtype=dtype)
rec.array([( 0,  1,  2,  3.,  4.), ( 5,  6,  7,  8.,  9.),
           (10, 11, 12, 13., 14.), (15, 16, 17, 18., 19.),
           (20, 21, 22, 23., 24.)],
          dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])

This is a recarray, but we can cast to ndarray if need be. In addition, the dtype is np.record we need to (view-) cast that to void to get a "clean" numpy result.

这是一个recarray,但是如果需要的话我们可以向ndarray转换。此外,dtype是np。我们需要(视图-)将其转换为void,以获得一个“干净”的numpy结果。

>>> np.asarray(np.rec.fromarrays(a.T, dtype=dtype)).view(dtype)
array([( 0,  1,  2,  3.,  4.), ( 5,  6,  7,  8.,  9.),
       (10, 11, 12, 13., 14.), (15, 16, 17, 18., 19.),
       (20, 21, 22, 23., 24.)],
      dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])

#1


1  

This is an odd corner case that I've never encountered, but I believe the answer is related to the fact that in general, numpy only supports a few forms of assignment to structured arrays.

这是我从未遇到过的一种奇怪的情况,但我相信答案与这样一个事实有关:一般来说,numpy只支持结构数组的一些赋值形式。

In this particular case, I think numpy is following the convention used for scalar assignment to structured arrays, and is then broadcasting the assignment over the whole input array to generate a result of the same shape as the original array.

在这个特殊的例子中,我认为numpy遵循了对结构化数组进行标量赋值的约定,然后在整个输入数组上广播赋值,以生成与原始数组相同的结果。

Why the limit?

I believe the forms of assignment for structured arrays are limited because the "columns" of structured arrays are not much like the columns of ordinary 2-d arrays. In fact, it makes more sense to think of a ten-row, three-column structured array as a 1-d array of ten instances of an atomic row type.

我认为结构化数组的赋值形式是有限的,因为结构化数组的“列”不太像普通的二维数组的列。实际上,将一个10行、3列的结构化数组看作一个1-d数组,其中包含10个原子行类型的实例,更有意义。

These atomic row types are called "structured scalars". They have a fixed internal memory layout that cannot be dynamically reshaped, and so it doesn't really make sense to treat them the same way as the row of a 2-d array.

这些原子行类型称为“结构化标量”。它们有一个固定的内部内存布局,不能被动态重构,所以用与二维数组相同的方式来处理它们是没有意义的。

How to create a structured view of an existing array

Honestly, I don't know! I will update this answer if I find a good way. But I don't think I will find a good way, because as discussed above, structured scalars have their own distinctive memory layout. It's possible to hack up something with a buffer that has the right layout, but you'd be digging into numpy internals to do that, which isn't ideal. That being said, see this answer from Mad Physicist, who has done this somewhat more elegantly than I thought was possible.

老实说,我不知道!如果我找到一个好的方法,我会更新这个答案。但是我认为我不会找到一个好的方法,因为如上所述,结构化标量有它们自己独特的内存布局。使用具有正确布局的缓冲区来破解某些东西是有可能的,但是您可能需要深入到numpy内部来实现这一点,这并不理想。话虽如此,看看这个疯狂的物理学家的答案吧,他做的比我想象的要优雅得多。

It's also worth mentioning that astype creates a copy by default. You can pass copy=False, but numpy might still make a copy if certain requirements aren't satisfied.

值得一提的是,astype默认创建了一个副本。您可以传递copy=False,但是如果某些需求不满足的话,numpy仍然可以进行复制。

Alternatives...

I rarely find that I actually need a view; often creating a copy causes no perceptible change in performance. My first approach to this problem would simply be to use one of the standard assignment strategies for record arrays. In this case, that would probably mean using subarray assignment. First we create the array. Note the tuples. They are required for expected behavior.

我很少发现我真的需要一个视图;通常,创建副本不会导致性能的明显变化。我解决这个问题的第一种方法就是使用记录数组的标准分配策略之一。在这种情况下,这可能意味着使用子数组赋值。首先创建数组。注意元组。它们是预期行为所必需的。

>>> a = np.array([(1, 2), (3, 4)], dtype=[('x', 'f8'), ('y', 'i8')])
>>> a
array([(1., 2), (3., 4)], dtype=[('x', '<f8'), ('y', '<i8')])

Now if we try to assign an ordinary 2-d array to a, we get an error:

现在如果我们尝试给a分配一个普通的二维数组,我们会得到一个错误:

>>> a[:] = np.array([[11, 22], [33, 44]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: could not broadcast input array from shape (2,2) into shape (2)

But we can easily assign in a column-wise way:

但我们可以很容易地按列进行分配:

>>> a['x'] = [11, 22]
>>> a['y'] = [33, 44]
>>> a
array([(11., 33), (22., 44)], dtype=[('x', '<f8'), ('y', '<i8')])

We can also use Python tuples. This overwrites the whole array:

我们还可以使用Python元组。这覆盖了整个数组:

>>> a[:] = [(111, 222), (333, 444)]
>>> a
array([(111., 222), (333., 444)], dtype=[('x', '<f8'), ('y', '<i8')])

We can also assign data row-wise using tuples:

我们也可以使用元组来分配数据行:

>>> a[1] = (3333, 4444)
>>> a
array([( 111.,  222), (3333., 4444)], dtype=[('x', '<f8'), ('y', '<i8')])

Again, this fails if we try to pass a list or array:

同样,如果我们试图传递一个列表或数组,这将失败:

>>> a[1] = [3333, 4444]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.
>>> a[1] = np.array([3333, 4444])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.

Finally, note that we see the same behavior you saw with astype when we try to create a structured array from nested lists or numpy arrays. numpy just broadcasts the input array against the datatype, producing a 2-d array of structured scalars:

最后,请注意,当我们尝试从嵌套列表或numpy数组创建结构化数组时,我们看到了与astype相同的行为。numpy只是针对数据类型广播输入数组,生成一个二维结构化标量数组:

>>> a
array([[(1., 1), (2., 2)],
       [(3., 3), (4., 4)]], dtype=[('x', '<f8'), ('y', '<i8')])
>>> a = np.array(np.array([[1, 2], [3, 4]]), dtype=[('x', 'f8'), ('y', 'i8')])
>>> a
array([[(1., 1), (2., 2)],
       [(3., 3), (4., 4)]], dtype=[('x', '<f8'), ('y', '<i8')])

If your goal is simply to create a new array, then see the answers to this question. They cover a couple of useful approaches, including numpy.core.records.fromarrays and numpy.core.records.fromrecords. See also Paul Panzer's answer, which discusses how to create a new record array (a structured array that allows attribute access to columns).

如果您的目标只是创建一个新数组,那么请查看这个问题的答案。它们涵盖了一些有用的方法,包括numpi .core.records.fromarray和numpi .core. recorders .fromrecords。请参见Paul Panzer的答案,该答案讨论了如何创建一个新的记录数组(允许对列进行属性访问的结构化数组)。

#2


2  

As @senderle correctly points out, you rarely need a view, but here is a possible solution to do this almost in-place just for fun. The only modification you will need to do is to make sure your types are all of the same size.

正如@senderle所指出的那样,您很少需要视图,但是这里有一种可能的解决方案,它几乎是为了好玩而就地完成的。您需要做的唯一修改是确保您的类型都是相同的大小。

a = np.arange(25, dtype='<i4').reshape((5,5))
b = a.view(dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])
b['score'] = a[:, -2, np.newaxis].astype('<f4')
b['auc'] = a[:, -1, np.newaxis].astype('<f4')

If we going to do non-recommended things, you can also insert a line b.shape = (5,) after getting the view to eliminate the extra dimension preserved from a, and make the assignments below that simpler.

如果我们要做非推荐的事情,你也可以插入一行b。shape =(5,)在获取视图以消除从a中保留的额外维度后,使下面的赋值更简单。

This will give you a view b, which has all the desired properties, but of course will mess up the contents of a:

这会给你一个视图b,它有所有想要的属性,但是当然会把a的内容弄乱:

>>> a
array([[         0,          1,          2, 1077936128, 1082130432],
       [         5,          6,          7, 1090519040, 1091567616],
       [        10,         11,         12, 1095761920, 1096810496],
       [        15,         16,         17, 1099956224, 1100480512],
       [        20,         21,         22, 1102577664, 1103101952]])
>>> b
array([[( 0,  1,  2,  3.,  4.)],
       [( 5,  6,  7,  8.,  9.)],
       [(10, 11, 12, 13., 14.)],
       [(15, 16, 17, 18., 19.)],
       [(20, 21, 22, 23., 24.)]],
      dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])

#3


1  

Here is a workaround using np.rec.fromarrays:

下面是一个使用np.rec.fromarray的方法。

>>> dtype = [('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')]
>>> np.rec.fromarrays(a.T, dtype=dtype)
rec.array([( 0,  1,  2,  3.,  4.), ( 5,  6,  7,  8.,  9.),
           (10, 11, 12, 13., 14.), (15, 16, 17, 18., 19.),
           (20, 21, 22, 23., 24.)],
          dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])

This is a recarray, but we can cast to ndarray if need be. In addition, the dtype is np.record we need to (view-) cast that to void to get a "clean" numpy result.

这是一个recarray,但是如果需要的话我们可以向ndarray转换。此外,dtype是np。我们需要(视图-)将其转换为void,以获得一个“干净”的numpy结果。

>>> np.asarray(np.rec.fromarrays(a.T, dtype=dtype)).view(dtype)
array([( 0,  1,  2,  3.,  4.), ( 5,  6,  7,  8.,  9.),
       (10, 11, 12, 13., 14.), (15, 16, 17, 18., 19.),
       (20, 21, 22, 23., 24.)],
      dtype=[('width', '<i4'), ('height', '<i4'), ('depth', '<i4'), ('score', '<f4'), ('auc', '<f4')])