Numpy中的矢量化字符串操作:为什么它们相当慢?

时间:2021-05-21 21:19:45

This is of those "mostly asked out of pure curiosity (in possibly futile hope I will learn something)" questions.

这些“主要是出于纯粹的好奇心(可能是徒劳的希望,我会学到一些东西)”的问题。

I was investigating ways of saving memory on operations on massive numbers of strings, and for some scenarios it seems like string operations in numpy could be useful. However, I got somewhat surprising results:

我正在研究如何在大量字符串的操作上节省内存,在某些情况下,numpy中的字符串操作似乎很有用。但是,我得到了一些令人惊讶的结果:

import random
import string

milstr = [''.join(random.choices(string.ascii_letters, k=10)) for _ in range(1000000)]

npmstr = np.array(milstr, dtype=np.dtype(np.unicode_, 1000000))

Memory consumption using memory_profiler:

使用memory_profiler的内存消耗:

%memit [x.upper() for x in milstr]
peak memory: 420.96 MiB, increment: 61.02 MiB

%memit np.core.defchararray.upper(npmstr)
peak memory: 391.48 MiB, increment: 31.52 MiB

So far, so good; however, timing results are surprising for me:

到现在为止还挺好;然而,时间结果让我感到惊讶:

%timeit [x.upper() for x in milstr]
129 ms ± 926 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.core.defchararray.upper(npmstr)
373 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Why is that? I expected that since Numpy uses contiguous chunks of memory for its arrays AND its operations are vectorized (as the above numpy doc page says) AND numpy string arrays apparently use less memory so operating on them should at least potentially be more on-CPU cache-friendly, performance on arrays of strings would be at least similar to those in pure Python?

这是为什么?我预计,因为Numpy为其数组使用了连续的内存块,并且它的操作是矢量化的(正如上面的numpy doc页面所说的那样)和numpy字符串数组显然使用更少的内存,因此对它们进行操作至少可能是更多的CPU缓存 - 友好,字符串数组的性能至少与纯Python相似?

Environment:

Python 3.6.3 x64, Linux

Python 3.6.3 x64,Linux

numpy==1.14.1

1 个解决方案

#1


3  

Vectorized is used in two ways when talking about numpy, and it`s not always clear which is meant.

在谈论numpy时,Vectorized以两种方式使用,并且它并不总是清楚意味着什么。

  1. Operations that operate on all elements of an array
  2. 对阵列的所有元素进行操作的操作

  3. Operations that call optimized (and in many cases multi-threaded) numerical code internally
  4. 在内部调用优化(在许多情况下是多线程)数字代码的操作

The second point is what makes vectorized operations much faster than a for loop in python, and the multithreaded part is what makes them faster than a list comprehension. When commenters here state that vectorized code is faster, they're referring to the second case as well. However, in the numpy documentation, vectorized only refers to the first case. It means you can use a function directly on an array, without having to loop through all the elements and call it on each elements. In this sense it makes code more concise, but isn't necessarily any faster. Some vectorized operations do call multithreaded code, but as far as I know this is limited to linear algebra routines. Personally, I prefer using vectorized operatios since I think it is more readable than list comprehensions, even if performance is identical.

第二点是使矢量化操作比python中的for循环快得多,而多线程部分使它们比列表理解更快。当这里的评论者说明矢量化代码更快时,他们也指的是第二种情况。但是,在numpy文档中,vectorized仅指第一种情况。这意味着您可以直接在数组上使用函数,而无需遍历所有元素并在每个元素上调用它。从这个意义上说,它使代码更简洁,但不一定更快。一些矢量化操作会调用多线程代码,但据我所知,这仅限于线性代数例程。就个人而言,我更喜欢使用矢量化操作系统,因为我认为它比列表推导更具可读性,即使性能相同。

Now, for the code in question the documentation for np.char (which is an alias for np.core.defchararray), states

现在,对于有问题的代码,np.char的文档(这是np.core.defchararray的别名),状态

The chararray class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays of dtype object_, string_ or unicode_, and use the free functions in the numpy.char module for fast vectorized string operations.

chararray类的存在是为了向后兼容Numarray,不建议用于新开发。从numpy 1.4开始,如果需要字符串数组,建议使用dtype object_,string_或unicode_的数组,并使用numpy.char模块中的free函数进行快速矢量化字符串操作。

So there are four ways (one not recommended) to handle strings in numpy. Some testing is in order, since certainly each way will have different advantages and disadvantages. Using arrays defined as follows:

所以有四种方法(一种不推荐)来处理numpy中的字符串。有些测试是有序的,因为每种方式肯定会有不同的优点和缺点。使用如下定义的数组:

npob = np.array(milstr, dtype=np.object_)
npuni = np.array(milstr, dtype=np.unicode_)
npstr = np.array(milstr, dtype=np.string_)
npchar = npstr.view(np.chararray)
npcharU = npuni.view(np.chararray)

This creates arrays (or chararrays for the last two) with the following datatypes:

这将使用以下数据类型创建数组(或最后两个的chararrays):

In [68]: npob.dtype
Out[68]: dtype('O')

In [69]: npuni.dtype
Out[69]: dtype('<U10')

In [70]: npstr.dtype
Out[70]: dtype('S10')

In [71]: npchar.dtype
Out[71]: dtype('S10')

In [72]: npcharU.dtype
Out[72]: dtype('<U10')

The benchmarks give quite a range of performance across these datatypes:

基准测试在这些数据类型中提供了相当大的性能范围:

%timeit [x.upper() for x in test]
%timeit np.char.upper(test)

# test = milstr
103 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
377 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# test = npob
110 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
<error on second test, vectorized operations don't work with object arrays>

# test = npuni
295 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
323 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# test = npstr
125 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
125 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# test = npchar
663 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
127 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# test = npcharU
887 ms ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
325 ms ± 3.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Surprisingly, using a plain old list of strings is still the fastest. Numpy is competitive when the datatype is string_ or object_, but once unicode is included performance becomes much worse. The chararray is by far the slowest, wether handling unicode or not. It should be clear why it's not recommended for use.

令人惊讶的是,使用普通的旧字符串列表仍然是最快的。当数据类型为string_或object_时,Numpy很有竞争力,但是一旦包含unicode,性能就会变差。 chararray是迄今为止最慢,更低的处理unicode。应该清楚为什么不推荐使用它。

Using unicode strings has a significant performance penalty. The docs state the following for differences between these types

使用unicode字符串会显着降低性能。文档声明了以下这些类型之间的差异

For backward compatibility with Python 2 the S and a typestrings remain zero-terminated bytes and np.string_ continues to map to np.bytes_. To use actual strings in Python 3 use U or np.unicode_. For signed bytes that do not need zero-termination b or i1 can be used.

为了与Python 2向后兼容,S和类型字符串保持零终止字节,np.string_继续映射到np.bytes_。要在Python 3中使用实际字符串,请使用U或np.unicode_。对于不需要零终止的带符号字节,可以使用b或i1。

In this case, where the character set does not require unicode it would make sense to use the faster string_ type. If unicode was needed, you may get better performance by using a list, or a numpy array of type object_ if other numpy functionality is needed. Another good example of when a list may be better is appending lots of data

在这种情况下,在字符集不需要unicode的情况下,使用更快的string_类型是有意义的。如果需要unicode,如果需要其他numpy功能,则可以通过使用列表或类型为object_的numpy数组来获得更好的性能。列表可能更好的另一个好例子是附加大量数据

So, takeaways from this:

所以,请注意这个:

  1. Python, while generally accepted as slow, is very performant for some common things. Numpy is generally quite fast, but is not optimized for everything.
  2. Python虽然通常被认为很慢,但对于一些常见的东西来说非常有效。 Numpy通常很快,但并未针对一切进行优化。

  3. Read the docs. If there's more than one way of doing things (and there usually is), odds are one way is better for what you're trying to do.
  4. 阅读文档。如果有不止一种做事方式(并且通常存在),那么对于你正在尝试做的事情,赔率是一种更好的方式。

  5. Don't blindly assume that vectorized code will be faster - Always profile when you care about performance (this goes for any "optimization" tips).
  6. 不要盲目地假设矢量化代码会更快 - 在关注性能时总是进行分析(这适用于任何“优化”提示)。

#1


3  

Vectorized is used in two ways when talking about numpy, and it`s not always clear which is meant.

在谈论numpy时,Vectorized以两种方式使用,并且它并不总是清楚意味着什么。

  1. Operations that operate on all elements of an array
  2. 对阵列的所有元素进行操作的操作

  3. Operations that call optimized (and in many cases multi-threaded) numerical code internally
  4. 在内部调用优化(在许多情况下是多线程)数字代码的操作

The second point is what makes vectorized operations much faster than a for loop in python, and the multithreaded part is what makes them faster than a list comprehension. When commenters here state that vectorized code is faster, they're referring to the second case as well. However, in the numpy documentation, vectorized only refers to the first case. It means you can use a function directly on an array, without having to loop through all the elements and call it on each elements. In this sense it makes code more concise, but isn't necessarily any faster. Some vectorized operations do call multithreaded code, but as far as I know this is limited to linear algebra routines. Personally, I prefer using vectorized operatios since I think it is more readable than list comprehensions, even if performance is identical.

第二点是使矢量化操作比python中的for循环快得多,而多线程部分使它们比列表理解更快。当这里的评论者说明矢量化代码更快时,他们也指的是第二种情况。但是,在numpy文档中,vectorized仅指第一种情况。这意味着您可以直接在数组上使用函数,而无需遍历所有元素并在每个元素上调用它。从这个意义上说,它使代码更简洁,但不一定更快。一些矢量化操作会调用多线程代码,但据我所知,这仅限于线性代数例程。就个人而言,我更喜欢使用矢量化操作系统,因为我认为它比列表推导更具可读性,即使性能相同。

Now, for the code in question the documentation for np.char (which is an alias for np.core.defchararray), states

现在,对于有问题的代码,np.char的文档(这是np.core.defchararray的别名),状态

The chararray class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays of dtype object_, string_ or unicode_, and use the free functions in the numpy.char module for fast vectorized string operations.

chararray类的存在是为了向后兼容Numarray,不建议用于新开发。从numpy 1.4开始,如果需要字符串数组,建议使用dtype object_,string_或unicode_的数组,并使用numpy.char模块中的free函数进行快速矢量化字符串操作。

So there are four ways (one not recommended) to handle strings in numpy. Some testing is in order, since certainly each way will have different advantages and disadvantages. Using arrays defined as follows:

所以有四种方法(一种不推荐)来处理numpy中的字符串。有些测试是有序的,因为每种方式肯定会有不同的优点和缺点。使用如下定义的数组:

npob = np.array(milstr, dtype=np.object_)
npuni = np.array(milstr, dtype=np.unicode_)
npstr = np.array(milstr, dtype=np.string_)
npchar = npstr.view(np.chararray)
npcharU = npuni.view(np.chararray)

This creates arrays (or chararrays for the last two) with the following datatypes:

这将使用以下数据类型创建数组(或最后两个的chararrays):

In [68]: npob.dtype
Out[68]: dtype('O')

In [69]: npuni.dtype
Out[69]: dtype('<U10')

In [70]: npstr.dtype
Out[70]: dtype('S10')

In [71]: npchar.dtype
Out[71]: dtype('S10')

In [72]: npcharU.dtype
Out[72]: dtype('<U10')

The benchmarks give quite a range of performance across these datatypes:

基准测试在这些数据类型中提供了相当大的性能范围:

%timeit [x.upper() for x in test]
%timeit np.char.upper(test)

# test = milstr
103 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
377 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# test = npob
110 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
<error on second test, vectorized operations don't work with object arrays>

# test = npuni
295 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
323 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# test = npstr
125 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
125 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# test = npchar
663 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
127 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# test = npcharU
887 ms ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
325 ms ± 3.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Surprisingly, using a plain old list of strings is still the fastest. Numpy is competitive when the datatype is string_ or object_, but once unicode is included performance becomes much worse. The chararray is by far the slowest, wether handling unicode or not. It should be clear why it's not recommended for use.

令人惊讶的是,使用普通的旧字符串列表仍然是最快的。当数据类型为string_或object_时,Numpy很有竞争力,但是一旦包含unicode,性能就会变差。 chararray是迄今为止最慢,更低的处理unicode。应该清楚为什么不推荐使用它。

Using unicode strings has a significant performance penalty. The docs state the following for differences between these types

使用unicode字符串会显着降低性能。文档声明了以下这些类型之间的差异

For backward compatibility with Python 2 the S and a typestrings remain zero-terminated bytes and np.string_ continues to map to np.bytes_. To use actual strings in Python 3 use U or np.unicode_. For signed bytes that do not need zero-termination b or i1 can be used.

为了与Python 2向后兼容,S和类型字符串保持零终止字节,np.string_继续映射到np.bytes_。要在Python 3中使用实际字符串,请使用U或np.unicode_。对于不需要零终止的带符号字节,可以使用b或i1。

In this case, where the character set does not require unicode it would make sense to use the faster string_ type. If unicode was needed, you may get better performance by using a list, or a numpy array of type object_ if other numpy functionality is needed. Another good example of when a list may be better is appending lots of data

在这种情况下,在字符集不需要unicode的情况下,使用更快的string_类型是有意义的。如果需要unicode,如果需要其他numpy功能,则可以通过使用列表或类型为object_的numpy数组来获得更好的性能。列表可能更好的另一个好例子是附加大量数据

So, takeaways from this:

所以,请注意这个:

  1. Python, while generally accepted as slow, is very performant for some common things. Numpy is generally quite fast, but is not optimized for everything.
  2. Python虽然通常被认为很慢,但对于一些常见的东西来说非常有效。 Numpy通常很快,但并未针对一切进行优化。

  3. Read the docs. If there's more than one way of doing things (and there usually is), odds are one way is better for what you're trying to do.
  4. 阅读文档。如果有不止一种做事方式(并且通常存在),那么对于你正在尝试做的事情,赔率是一种更好的方式。

  5. Don't blindly assume that vectorized code will be faster - Always profile when you care about performance (this goes for any "optimization" tips).
  6. 不要盲目地假设矢量化代码会更快 - 在关注性能时总是进行分析(这适用于任何“优化”提示)。