将NumPy数组转换为集合需要太长时间

时间:2022-02-13 01:49:20

I'm trying to execute the following

我正在尝试执行以下操作

from numpy import *
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]])  # large nparray
for item in x:
    set(item)

and it takes very long compared to:

与以下相比需要很长时间:

x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]])  # large nparray
for item in x:
    item.tolist()

Why does it take much longer to convert a NumPy array to a set than to a list? I mean basically both have complexity O(n)?

为什么将NumPy数组转换为集合而不是列表需要更长的时间?我的意思是基本上两者都有复杂度O(n)?

2 个解决方案

#1


30  

TL;DR: The set() function creates a set using Pythons iteration protocol. But iterating (on the Python level) over NumPy arrays is so slow that using tolist() to convert the array to a Python list before doing the iteration is (much) faster.

TL; DR:set()函数使用Pythons迭代协议创建一个集合。但是在NumPy数组上迭代(在Python级别上)是如此之慢,以至于在执行迭代之前使用tolist()将数组转换为Python列表(更快)。

To understand why iterating over NumPy arrays is so slow it's important to know how Python objects, Python lists, and NumPy arrays are stored in memory.

要理解为什么迭代NumPy数组的速度太慢,了解Python对象,Python列表和NumPy数组如何存储在内存中非常重要。

A Python object needs some bookkeeping properties (like the reference count, a link to its class, ...) and the value it represents. For example the integer ten = 10 could look like this:

Python对象需要一些簿记属性(如引用计数,其类的链接,......)及其表示的值。例如,整数十= 10可能如下所示:

将NumPy数组转换为集合需要太长时间

The blue circle is the "name" you use in the Python interpreter for the variable ten and the lower object (instance) is what actually represents the integer (since the bookkeeping properties aren't imporant here I ignored them in the images).

蓝色圆圈是您在变换器10的Python解释器中使用的“名称”,而较低的对象(实例)实际上代表整数(因为簿记属性在这里并不重要,我在图像中忽略了它们)。

A Python list is just a collection of Python objects, for example mylist = [1, 2, 3] would be saved like this:

Python列表只是Python对象的集合,例如mylist = [1,2,3]将像这样保存:

将NumPy数组转换为集合需要太长时间

This time the list references the Python integers 1, 2 and 3 and the name mylist just references the list instance.

这次列表引用Python整数1,2和3,名称mylist只引用列表实例。

But an array myarray = np.array([1, 2, 3]) doesn't store Python objects as elements:

但是数组myarray = np.array([1,2,3])不会将Python对象存储为元素:

将NumPy数组转换为集合需要太长时间

The values 1, 2 and 3 are stored directly in the NumPy array instance.

值1,2和3直接存储在NumPy数组实例中。


With this information I can explain why iterating over an array is so much slower compared to an iteration over a list:

有了这些信息,我可以解释为什么与列表上的迭代相比,迭代数组的速度要慢得多:

Each time you access the next element in a list the list just returns a stored object. That's very fast because the element already exists as Python object (it just needs to increment the reference count by one).

每次访问列表中的下一个元素时,列表只返回一个存储的对象。这非常快,因为该元素已经作为Python对象存在(它只需要将引用计数增加一个)。

On the other hand when you want an element of an array it needs to create a new Python "box" for the value with all the bookkeeping stuff before it is returned. When you iterate over the array it needs to create one Python box for each element in your array:

另一方面,当你想要一个数组的元素时,它需要在返回之前为所有的簿记内容创建一个新的Python“框”。当您遍历数组时,它需要为数组中的每个元素创建一个Python框:

将NumPy数组转换为集合需要太长时间

Creating these boxes is slow and the main reason why iterating over NumPy arrays is much slower than iterating over Python collections (lists/tuples/sets/dictionaries) which store the values and their box:

创建这些框很慢,迭代NumPy数组的主要原因要比迭代存储值及其框的Python集合(lists / tuples / sets / dictionaries)慢得多:

import numpy as np
arr = np.arange(100000)
lst = list(range(100000))

def iterateover(obj):
    for item in obj:
        pass

%timeit iterateover(arr)
# 20.2 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterateover(lst)
# 3.96 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The set "constructor" just does an iteration over the object.

集合“构造函数”只是对对象进行迭代。

One thing I can't answer definitely is why the tolist method is so much faster. In the end each value in the resulting Python list needs to be in a "Python box" so there's not much work that tolist could avoid. But one thing I know for sure is that list(array) is slower than array.tolist():

我无法回答的一件事是为什么tolist方法要快得多。最后,生成的Python列表中的每个值都需要位于“Python框”中,因此tolist可以避免的工作量不大。但我确定知道的一件事是list(数组)比array.tolist()慢:

arr = np.arange(100000)

%timeit list(arr)
# 20 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.tolist()
# 10.3 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Each of these has O(n) runtime complexity but the constant factors are very different.

这些中的每一个都具有O(n)运行时复杂性,但是常数因子是非常不同的。

In your case you did compare set() to tolist() - which isn't a particular good comparison. It would make more sense to compare set(arr) to list(arr) or set(arr.tolist()) to arr.tolist():

在你的情况下,你确实将set()与tolist()进行了比较 - 这不是一个特别好的比较。将set(arr)与list(arr)或set(arr.tolist())与arr.tolist()进行比较会更有意义:

arr = np.random.randint(0, 1000, (10000, 3))

def tosets(arr):
    for line in arr:
        set(line)

def tolists(arr):
    for line in arr:
        list(line)

def tolists_method(arr):
    for line in arr:
        line.tolist()

def tosets_intermediatelist(arr):
    for line in arr:
        set(line.tolist())

%timeit tosets(arr)
# 72.2 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists(arr)
# 80.5 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists_method(arr)
# 16.3 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit tosets_intermediatelist(arr)
# 38.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So if you want sets you are better off using set(arr.tolist()). For bigger arrays it could make sense to use np.unique but because your rows only contain 3 items that will likely be slower (for thousands of elements it could be much faster!).

因此,如果你想要套装,最好使用set(arr.tolist())。对于更大的数组,使用np.unique是有意义的,但因为你的行只包含3个可能更慢的项目(对于数千个元素,它可能会更快!)。


In the comments you asked about numba and yes, it's true that numba could speed this up. Numba supports typed sets (only numeric types), but that doesn't mean it will be always faster.

你在评论中提到了关于numba的评论,是的,numba确实可以加快这个速度。 Numba支持类型集(仅限数字类型),但这并不意味着它总是更快。

I'm not sure how numba (re-)implements sets but because they are typed it's likely they also avoid the "Python boxes" and store the values directly inside the set:

我不确定numba(重新)如何实现集合,但因为它们是键入的,所以它们也可能避免使用“Python框”并将值直接存储在集合中:

将NumPy数组转换为集合需要太长时间

Sets are more complicated than lists because it they involve hashes and empty slots (Python uses open-addressing for sets, so I assume numba will too).

集合比列表更复杂,因为它涉及哈希和空槽(Python使用集合的开放寻址,所以我也假设numba也是如此)。

Like the NumPy array the numba set saves the values directly. So when you convert a NumPy array to a numba set (or vise-versa) it won't need to use "Python boxes" at all, so when you create the sets in a numba nopython function it will be much faster even than the set(arr.tolist()) operation:

与NumPy数组一样,numba集直接保存值。因此,当您将NumPy数组转换为numba集(或反之亦然)时,根本不需要使用“Python框”,因此当您在numba nopython函数中创建集时,它甚至会比set(arr.tolist())操作:

import numba as nb
@nb.njit
def tosets_numba(arr):
    for lineno in range(arr.shape[0]):
        set(arr[lineno])

tosets_numba(arr)  # warmup
%timeit tosets_numba(arr)
# 6.55 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's roughly five times faster than the set(arr.tolist()) approach. But it's important to highlight that I did not return the sets from the function. When you return a set from a nopython numba function to Python Numba creates a python set - including "creating the boxes" for all values in the set (that's something numba is hiding).

这大约比set(arr.tolist())方法快五倍。但重要的是要强调我没有从函数返回集合。当你从nopython numba函数返回一个集合到Python时,Numba会创建一个python集 - 包括集合中所有值的“创建框”(这是numba隐藏的东西)。

Just FYI: The same boxing/unboxing happens if you pass lists to Numba nopython functions or return lists from these functions. So what's a O(1) operation in Python is an O(n) operation with Numba! That's why it's generally better to pass NumPy arrays to numba nopython function (which is O(1)).

仅供参考:如果您将列表传递给Numba nopython函数或从这些函数返回列表,则会发生相同的装箱/拆箱。那么Python中的O(1)操作是Numba的O(n)操作!这就是为什么将NumPy数组传递给numba nopython函数(即O(1))通常会更好。

将NumPy数组转换为集合需要太长时间

I assume that if you return these sets from the function (not really possible right now because numba doesn't support lists of sets currently) it would be slower (because it creates a numba set and later converts it to a python set) or only marginally faster (if the conversion numbaset -> pythonset is really, really fast).

我假设如果你从函数中返回这些集合(现在不是真的可能,因为numba当前不支持集合列表)它会更慢(因为它创建一个numba集合,然后将其转换为python集合)或仅稍快一点(如果转换麻木 - > pythonset真的非常快)。

Personally I would use numba for sets only if I don't need to return them from the function and do all operations on the set inside the function and only if all the operations on the set are supported in nopython mode. In any other case I wouldn't use numba here.

就个人而言,只有当我不需要从函数中返回它们并在函数内部对集合执行所有操作时,我才会使用numba for sets,并且只有在nopython模式下支持集合上的所有操作时才会使用numba。在任何其他情况下,我不会在这里使用numba。


Just a note: from numpy import * should be avoided, you hide several python built-in functions when you do that (sum, min, max, ...) and it puts a lot of stuff into your globals. Better to use import numpy as np. The np. in front of function calls makes the code clearer and isn't much to type.

请注意:应该避免使用numpy import *,当你这样做时,你会隐藏几个python内置函数(sum,min,max,...),它会把很多东西放到你的全局变量中。最好使用import numpy作为np。 np。在函数调用之前使代码更清晰,并且输入的内容不多。

#2


1  

Here is a way to speed things up: avoid the loop and use a multiprocessing pool.map trick

这是一种加快速度的方法:避免循环并使用多处理pool.map技巧

from multiprocessing.dummy import Pool as ThreadPool
import multiprocessing

pool = ThreadPool(multiprocessing.cpu_count()) # get the number of CPU
y = pool.map(set,x) # apply the function to your iterable
pool.close()
pool.join()

#1


30  

TL;DR: The set() function creates a set using Pythons iteration protocol. But iterating (on the Python level) over NumPy arrays is so slow that using tolist() to convert the array to a Python list before doing the iteration is (much) faster.

TL; DR:set()函数使用Pythons迭代协议创建一个集合。但是在NumPy数组上迭代(在Python级别上)是如此之慢,以至于在执行迭代之前使用tolist()将数组转换为Python列表(更快)。

To understand why iterating over NumPy arrays is so slow it's important to know how Python objects, Python lists, and NumPy arrays are stored in memory.

要理解为什么迭代NumPy数组的速度太慢,了解Python对象,Python列表和NumPy数组如何存储在内存中非常重要。

A Python object needs some bookkeeping properties (like the reference count, a link to its class, ...) and the value it represents. For example the integer ten = 10 could look like this:

Python对象需要一些簿记属性(如引用计数,其类的链接,......)及其表示的值。例如,整数十= 10可能如下所示:

将NumPy数组转换为集合需要太长时间

The blue circle is the "name" you use in the Python interpreter for the variable ten and the lower object (instance) is what actually represents the integer (since the bookkeeping properties aren't imporant here I ignored them in the images).

蓝色圆圈是您在变换器10的Python解释器中使用的“名称”,而较低的对象(实例)实际上代表整数(因为簿记属性在这里并不重要,我在图像中忽略了它们)。

A Python list is just a collection of Python objects, for example mylist = [1, 2, 3] would be saved like this:

Python列表只是Python对象的集合,例如mylist = [1,2,3]将像这样保存:

将NumPy数组转换为集合需要太长时间

This time the list references the Python integers 1, 2 and 3 and the name mylist just references the list instance.

这次列表引用Python整数1,2和3,名称mylist只引用列表实例。

But an array myarray = np.array([1, 2, 3]) doesn't store Python objects as elements:

但是数组myarray = np.array([1,2,3])不会将Python对象存储为元素:

将NumPy数组转换为集合需要太长时间

The values 1, 2 and 3 are stored directly in the NumPy array instance.

值1,2和3直接存储在NumPy数组实例中。


With this information I can explain why iterating over an array is so much slower compared to an iteration over a list:

有了这些信息,我可以解释为什么与列表上的迭代相比,迭代数组的速度要慢得多:

Each time you access the next element in a list the list just returns a stored object. That's very fast because the element already exists as Python object (it just needs to increment the reference count by one).

每次访问列表中的下一个元素时,列表只返回一个存储的对象。这非常快,因为该元素已经作为Python对象存在(它只需要将引用计数增加一个)。

On the other hand when you want an element of an array it needs to create a new Python "box" for the value with all the bookkeeping stuff before it is returned. When you iterate over the array it needs to create one Python box for each element in your array:

另一方面,当你想要一个数组的元素时,它需要在返回之前为所有的簿记内容创建一个新的Python“框”。当您遍历数组时,它需要为数组中的每个元素创建一个Python框:

将NumPy数组转换为集合需要太长时间

Creating these boxes is slow and the main reason why iterating over NumPy arrays is much slower than iterating over Python collections (lists/tuples/sets/dictionaries) which store the values and their box:

创建这些框很慢,迭代NumPy数组的主要原因要比迭代存储值及其框的Python集合(lists / tuples / sets / dictionaries)慢得多:

import numpy as np
arr = np.arange(100000)
lst = list(range(100000))

def iterateover(obj):
    for item in obj:
        pass

%timeit iterateover(arr)
# 20.2 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterateover(lst)
# 3.96 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The set "constructor" just does an iteration over the object.

集合“构造函数”只是对对象进行迭代。

One thing I can't answer definitely is why the tolist method is so much faster. In the end each value in the resulting Python list needs to be in a "Python box" so there's not much work that tolist could avoid. But one thing I know for sure is that list(array) is slower than array.tolist():

我无法回答的一件事是为什么tolist方法要快得多。最后,生成的Python列表中的每个值都需要位于“Python框”中,因此tolist可以避免的工作量不大。但我确定知道的一件事是list(数组)比array.tolist()慢:

arr = np.arange(100000)

%timeit list(arr)
# 20 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.tolist()
# 10.3 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Each of these has O(n) runtime complexity but the constant factors are very different.

这些中的每一个都具有O(n)运行时复杂性,但是常数因子是非常不同的。

In your case you did compare set() to tolist() - which isn't a particular good comparison. It would make more sense to compare set(arr) to list(arr) or set(arr.tolist()) to arr.tolist():

在你的情况下,你确实将set()与tolist()进行了比较 - 这不是一个特别好的比较。将set(arr)与list(arr)或set(arr.tolist())与arr.tolist()进行比较会更有意义:

arr = np.random.randint(0, 1000, (10000, 3))

def tosets(arr):
    for line in arr:
        set(line)

def tolists(arr):
    for line in arr:
        list(line)

def tolists_method(arr):
    for line in arr:
        line.tolist()

def tosets_intermediatelist(arr):
    for line in arr:
        set(line.tolist())

%timeit tosets(arr)
# 72.2 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists(arr)
# 80.5 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists_method(arr)
# 16.3 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit tosets_intermediatelist(arr)
# 38.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So if you want sets you are better off using set(arr.tolist()). For bigger arrays it could make sense to use np.unique but because your rows only contain 3 items that will likely be slower (for thousands of elements it could be much faster!).

因此,如果你想要套装,最好使用set(arr.tolist())。对于更大的数组,使用np.unique是有意义的,但因为你的行只包含3个可能更慢的项目(对于数千个元素,它可能会更快!)。


In the comments you asked about numba and yes, it's true that numba could speed this up. Numba supports typed sets (only numeric types), but that doesn't mean it will be always faster.

你在评论中提到了关于numba的评论,是的,numba确实可以加快这个速度。 Numba支持类型集(仅限数字类型),但这并不意味着它总是更快。

I'm not sure how numba (re-)implements sets but because they are typed it's likely they also avoid the "Python boxes" and store the values directly inside the set:

我不确定numba(重新)如何实现集合,但因为它们是键入的,所以它们也可能避免使用“Python框”并将值直接存储在集合中:

将NumPy数组转换为集合需要太长时间

Sets are more complicated than lists because it they involve hashes and empty slots (Python uses open-addressing for sets, so I assume numba will too).

集合比列表更复杂,因为它涉及哈希和空槽(Python使用集合的开放寻址,所以我也假设numba也是如此)。

Like the NumPy array the numba set saves the values directly. So when you convert a NumPy array to a numba set (or vise-versa) it won't need to use "Python boxes" at all, so when you create the sets in a numba nopython function it will be much faster even than the set(arr.tolist()) operation:

与NumPy数组一样,numba集直接保存值。因此,当您将NumPy数组转换为numba集(或反之亦然)时,根本不需要使用“Python框”,因此当您在numba nopython函数中创建集时,它甚至会比set(arr.tolist())操作:

import numba as nb
@nb.njit
def tosets_numba(arr):
    for lineno in range(arr.shape[0]):
        set(arr[lineno])

tosets_numba(arr)  # warmup
%timeit tosets_numba(arr)
# 6.55 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's roughly five times faster than the set(arr.tolist()) approach. But it's important to highlight that I did not return the sets from the function. When you return a set from a nopython numba function to Python Numba creates a python set - including "creating the boxes" for all values in the set (that's something numba is hiding).

这大约比set(arr.tolist())方法快五倍。但重要的是要强调我没有从函数返回集合。当你从nopython numba函数返回一个集合到Python时,Numba会创建一个python集 - 包括集合中所有值的“创建框”(这是numba隐藏的东西)。

Just FYI: The same boxing/unboxing happens if you pass lists to Numba nopython functions or return lists from these functions. So what's a O(1) operation in Python is an O(n) operation with Numba! That's why it's generally better to pass NumPy arrays to numba nopython function (which is O(1)).

仅供参考:如果您将列表传递给Numba nopython函数或从这些函数返回列表,则会发生相同的装箱/拆箱。那么Python中的O(1)操作是Numba的O(n)操作!这就是为什么将NumPy数组传递给numba nopython函数(即O(1))通常会更好。

将NumPy数组转换为集合需要太长时间

I assume that if you return these sets from the function (not really possible right now because numba doesn't support lists of sets currently) it would be slower (because it creates a numba set and later converts it to a python set) or only marginally faster (if the conversion numbaset -> pythonset is really, really fast).

我假设如果你从函数中返回这些集合(现在不是真的可能,因为numba当前不支持集合列表)它会更慢(因为它创建一个numba集合,然后将其转换为python集合)或仅稍快一点(如果转换麻木 - > pythonset真的非常快)。

Personally I would use numba for sets only if I don't need to return them from the function and do all operations on the set inside the function and only if all the operations on the set are supported in nopython mode. In any other case I wouldn't use numba here.

就个人而言,只有当我不需要从函数中返回它们并在函数内部对集合执行所有操作时,我才会使用numba for sets,并且只有在nopython模式下支持集合上的所有操作时才会使用numba。在任何其他情况下,我不会在这里使用numba。


Just a note: from numpy import * should be avoided, you hide several python built-in functions when you do that (sum, min, max, ...) and it puts a lot of stuff into your globals. Better to use import numpy as np. The np. in front of function calls makes the code clearer and isn't much to type.

请注意:应该避免使用numpy import *,当你这样做时,你会隐藏几个python内置函数(sum,min,max,...),它会把很多东西放到你的全局变量中。最好使用import numpy作为np。 np。在函数调用之前使代码更清晰,并且输入的内容不多。

#2


1  

Here is a way to speed things up: avoid the loop and use a multiprocessing pool.map trick

这是一种加快速度的方法:避免循环并使用多处理pool.map技巧

from multiprocessing.dummy import Pool as ThreadPool
import multiprocessing

pool = ThreadPool(multiprocessing.cpu_count()) # get the number of CPU
y = pool.map(set,x) # apply the function to your iterable
pool.close()
pool.join()