Is there an efficient way to create an arbitrary long numpy array where each dimension consists of n elements drawn from a list of length >= n? Each element in the list can be drawn only once for each dimension.
是否有一种有效的方法来创建一个任意长的numpy数组,其中每个维度由从长度> = n的列表中提取的n个元素组成?列表中的每个元素只能为每个维度绘制一次。
For instance, if I have the list l = ['cat', 'mescaline', 'popcorn']
, I want to be able to, for instance by typing something like np.random.pick_random(l, (3, 2), replace=false)
, create an array array([['cat', 'popcorn'], ['cat', 'popcorn'], ['mescaline', 'cat']])
.
例如,如果我有列表l = ['cat','mescaline','popcorn'],我希望能够,例如通过键入类似np.random.pick_random(l,(3,2))的内容,replace = false),创建一个数组数组([['cat','popcorn'],['cat','popcorn'],['mescaline','cat']])。
Thank you.
谢谢。
3 个解决方案
#1
7
Theres a couple of ways of doing this, each has their pros/cons, the following four where just from the top of my head ...
有几种方法可以做到这一点,每种方式都有其优点/缺点,以下四种方式只是从我的头脑中...
- pythons own
random.sample
, is simple and built in, though it may not be the fastest... - pythons拥有random.sample,很简单并内置,虽然它可能不是最快的......
-
numpy.random.permutation
again simple but it creates a copy of which we have to slice, ouch! - numpy.random.permutation再次简单,但它创建了一个我们必须切片的副本,哎哟!
-
numpy.random.shuffle
is faster since it shuffles in place, but we still have to slice. - numpy.random.shuffle更快,因为它在适当的地方洗牌,但我们仍然需要切片。
-
numpy.random.sample
is the fastest but it only works on the interval 0 to 1 so we have to normalize it, and convert it to ints to get the random indices, at the end we still have to slice, note normalizing to the size we want does not generate a uniform random distribution. - numpy.random.sample是最快但它只适用于0到1的间隔,所以我们必须对其进行标准化,并将其转换为int以获取随机索引,最后我们仍然需要切片,注意标准化为大小我们想要的不会产生统一的随机分布。
Here are some benchmarks.
这是一些基准测试。
import timeit
from matplotlib import pyplot as plt
setup = \
"""
import numpy
import random
number_of_members = 20
values = range(50)
"""
number_of_repetitions = 20
array_sizes = (10, 200)
python_random_times = [timeit.timeit(stmt = "[random.sample(values, number_of_members) for index in xrange({0})]".format(array_size),
setup = setup,
number = number_of_repetitions)
for array_size in xrange(*array_sizes)]
numpy_permutation_times = [timeit.timeit(stmt = "[numpy.random.permutation(values)[:number_of_members] for index in xrange({0})]".format(array_size),
setup = setup,
number = number_of_repetitions)
for array_size in xrange(*array_sizes)]
numpy_shuffle_times = [timeit.timeit(stmt = \
"""
random_arrays = []
for index in xrange({0}):
numpy.random.shuffle(values)
random_arrays.append(values[:number_of_members])
""".format(array_size),
setup = setup,
number = number_of_repetitions)
for array_size in xrange(*array_sizes)]
numpy_sample_times = [timeit.timeit(stmt = \
"""
values = numpy.asarray(values)
random_arrays = [values[indices][:number_of_members]
for indices in (numpy.random.sample(({0}, len(values))) * len(values)).astype(int)]
""".format(array_size),
setup = setup,
number = number_of_repetitions)
for array_size in xrange(*array_sizes)]
line_0 = plt.plot(xrange(*array_sizes),
python_random_times,
color = 'black',
label = 'random.sample')
line_1 = plt.plot(xrange(*array_sizes),
numpy_permutation_times,
color = 'red',
label = 'numpy.random.permutations'
)
line_2 = plt.plot(xrange(*array_sizes),
numpy_shuffle_times,
color = 'yellow',
label = 'numpy.shuffle')
line_3 = plt.plot(xrange(*array_sizes),
numpy_sample_times,
color = 'green',
label = 'numpy.random.sample')
plt.xlabel('Number of Arrays')
plt.ylabel('Time in (s) for %i rep' % number_of_repetitions)
plt.title('Different ways to sample.')
plt.legend()
plt.show()
and the result:
结果:
So it looks like numpy.random.permutation
is the worst, not surprising, pythons own random.sample
is holding it own, so it looks like its a close race between numpy.random.shuffle
and numpy.random.sample
with numpy.random.sample
edging out, so either should suffice, even though numpy.random.sample
has a higher memory footprint I still prefer it since I really don't need to build the arrays I just need the random indices ...
所以它看起来像numpy.random.permutation是最糟糕的,并不奇怪,pythons拥有random.sample持有它自己,所以它看起来像numpy.random.shuffle和numpy.random.sample与numpy.random之间的紧密竞争虽然numpy.random.sample有更高的内存占用,但我还是更喜欢它,因为我真的不需要构建数组我只需要随机索引......
$ uname -a
Darwin Kernel Version 10.8.0: Tue Jun 7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386
$ python --version
Python 2.6.1
$ python -c "import numpy; print numpy.__version__"
1.6.1
UPDATE
UPDATE
Unfortunately numpy.random.sample
doesn't draw unique elements from a population so you'll get repitation, so just stick with shuffle is just as fast.
不幸的是,numpy.random.sample不会从一个人群中绘制出独特的元素,所以你会得到复制,所以坚持使用shuffle同样快。
UPDATE 2
更新2
If you want to remain within numpy to leverage some of its built in functionality just convert the values into numpy arrays.
如果你想保持在numpy中以利用它的一些内置功能,只需将值转换为numpy数组。
import numpy as np
values = ['cat', 'popcorn', 'mescaline']
number_of_members = 2
N = 1000000
random_arrays = np.asarray([values] * N)
_ = [np.random.shuffle(array) for array in random_arrays]
subset = random_arrays[:, :number_of_members]
Note that N here is quite large as such you are going to get repeated number of permutations, by permutations I mean order of values not repeated values within a permutation, since fundamentally theres a finite number of permutations on any giving finite set, if just calculating the whole set then its n!, if only selecting k elements its n!/(n - k)! and even if this wasn't the case, meaning our set was much larger, we might still get repetitions depending on the random functions implementation, since shuffle/permutation/... and so on only work with the current set and have no idea of the population, this may or may not be acceptable, depends on what you are trying to achieve, if you want a set of unique permutations, then you are going to generate that set and subsample it.
注意,N在这里是相当大的,因此你将得到重复的排列数,通过排列我的意思是在排列中没有重复值的值的顺序,因为从根本上说,任何给定有限集上的有限数量的排列,如果只是计算如果只选择k个元素n!/(n - k),则整个集合然后是n!即使不是这种情况,意味着我们的集合要大得多,我们仍然可能依赖于随机函数实现重复,因为shuffle / permutation / ...等等只适用于当前集合并且不知道对于人口,这可能是可接受的,也可能是不可接受的,取决于你想要实现的目标,如果你想要一组独特的排列,那么你将生成该集合并对其进行二次抽样。
#2
6
Here's a way to do it using numpy's np.random.randint
:
这是使用numpy的np.random.randint执行此操作的方法:
In [68]: l = np.array(['cat', 'mescaline', 'popcorn'])
In [69]: l[np.random.randint(len(l), size=(3,2))]
Out[69]:
array([['cat', 'popcorn'],
['popcorn', 'popcorn'],
['mescaline', 'cat']],
dtype='|S9')
EDIT: after the additional details that each element should appear at most once in each row
编辑:在每个元素每行最多出现一次的附加细节之后
this is not very space efficient, do you need something better?
这不是很节省空间,你需要更好的东西吗?
In [29]: l = np.array(['cat', 'mescaline', 'popcorn'])
In [30]: array([np.random.choice(l, 3, replace=False) for i in xrange(5)])
Out[30]:
array([['mescaline', 'popcorn', 'cat'],
['mescaline', 'popcorn', 'cat'],
['popcorn', 'mescaline', 'cat'],
['mescaline', 'cat', 'popcorn'],
['mescaline', 'cat', 'popcorn']],
dtype='|S9')
#3
2
>>> import numpy
>>> l = numpy.array(['cat', 'mescaline', 'popcorn'])
>>> l[numpy.random.randint(0, len(l), (3, 2))]
array([['popcorn', 'mescaline'],
['mescaline', 'popcorn'],
['cat', 'cat']],
dtype='|S9')
#1
7
Theres a couple of ways of doing this, each has their pros/cons, the following four where just from the top of my head ...
有几种方法可以做到这一点,每种方式都有其优点/缺点,以下四种方式只是从我的头脑中...
- pythons own
random.sample
, is simple and built in, though it may not be the fastest... - pythons拥有random.sample,很简单并内置,虽然它可能不是最快的......
-
numpy.random.permutation
again simple but it creates a copy of which we have to slice, ouch! - numpy.random.permutation再次简单,但它创建了一个我们必须切片的副本,哎哟!
-
numpy.random.shuffle
is faster since it shuffles in place, but we still have to slice. - numpy.random.shuffle更快,因为它在适当的地方洗牌,但我们仍然需要切片。
-
numpy.random.sample
is the fastest but it only works on the interval 0 to 1 so we have to normalize it, and convert it to ints to get the random indices, at the end we still have to slice, note normalizing to the size we want does not generate a uniform random distribution. - numpy.random.sample是最快但它只适用于0到1的间隔,所以我们必须对其进行标准化,并将其转换为int以获取随机索引,最后我们仍然需要切片,注意标准化为大小我们想要的不会产生统一的随机分布。
Here are some benchmarks.
这是一些基准测试。
import timeit
from matplotlib import pyplot as plt
setup = \
"""
import numpy
import random
number_of_members = 20
values = range(50)
"""
number_of_repetitions = 20
array_sizes = (10, 200)
python_random_times = [timeit.timeit(stmt = "[random.sample(values, number_of_members) for index in xrange({0})]".format(array_size),
setup = setup,
number = number_of_repetitions)
for array_size in xrange(*array_sizes)]
numpy_permutation_times = [timeit.timeit(stmt = "[numpy.random.permutation(values)[:number_of_members] for index in xrange({0})]".format(array_size),
setup = setup,
number = number_of_repetitions)
for array_size in xrange(*array_sizes)]
numpy_shuffle_times = [timeit.timeit(stmt = \
"""
random_arrays = []
for index in xrange({0}):
numpy.random.shuffle(values)
random_arrays.append(values[:number_of_members])
""".format(array_size),
setup = setup,
number = number_of_repetitions)
for array_size in xrange(*array_sizes)]
numpy_sample_times = [timeit.timeit(stmt = \
"""
values = numpy.asarray(values)
random_arrays = [values[indices][:number_of_members]
for indices in (numpy.random.sample(({0}, len(values))) * len(values)).astype(int)]
""".format(array_size),
setup = setup,
number = number_of_repetitions)
for array_size in xrange(*array_sizes)]
line_0 = plt.plot(xrange(*array_sizes),
python_random_times,
color = 'black',
label = 'random.sample')
line_1 = plt.plot(xrange(*array_sizes),
numpy_permutation_times,
color = 'red',
label = 'numpy.random.permutations'
)
line_2 = plt.plot(xrange(*array_sizes),
numpy_shuffle_times,
color = 'yellow',
label = 'numpy.shuffle')
line_3 = plt.plot(xrange(*array_sizes),
numpy_sample_times,
color = 'green',
label = 'numpy.random.sample')
plt.xlabel('Number of Arrays')
plt.ylabel('Time in (s) for %i rep' % number_of_repetitions)
plt.title('Different ways to sample.')
plt.legend()
plt.show()
and the result:
结果:
So it looks like numpy.random.permutation
is the worst, not surprising, pythons own random.sample
is holding it own, so it looks like its a close race between numpy.random.shuffle
and numpy.random.sample
with numpy.random.sample
edging out, so either should suffice, even though numpy.random.sample
has a higher memory footprint I still prefer it since I really don't need to build the arrays I just need the random indices ...
所以它看起来像numpy.random.permutation是最糟糕的,并不奇怪,pythons拥有random.sample持有它自己,所以它看起来像numpy.random.shuffle和numpy.random.sample与numpy.random之间的紧密竞争虽然numpy.random.sample有更高的内存占用,但我还是更喜欢它,因为我真的不需要构建数组我只需要随机索引......
$ uname -a
Darwin Kernel Version 10.8.0: Tue Jun 7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386
$ python --version
Python 2.6.1
$ python -c "import numpy; print numpy.__version__"
1.6.1
UPDATE
UPDATE
Unfortunately numpy.random.sample
doesn't draw unique elements from a population so you'll get repitation, so just stick with shuffle is just as fast.
不幸的是,numpy.random.sample不会从一个人群中绘制出独特的元素,所以你会得到复制,所以坚持使用shuffle同样快。
UPDATE 2
更新2
If you want to remain within numpy to leverage some of its built in functionality just convert the values into numpy arrays.
如果你想保持在numpy中以利用它的一些内置功能,只需将值转换为numpy数组。
import numpy as np
values = ['cat', 'popcorn', 'mescaline']
number_of_members = 2
N = 1000000
random_arrays = np.asarray([values] * N)
_ = [np.random.shuffle(array) for array in random_arrays]
subset = random_arrays[:, :number_of_members]
Note that N here is quite large as such you are going to get repeated number of permutations, by permutations I mean order of values not repeated values within a permutation, since fundamentally theres a finite number of permutations on any giving finite set, if just calculating the whole set then its n!, if only selecting k elements its n!/(n - k)! and even if this wasn't the case, meaning our set was much larger, we might still get repetitions depending on the random functions implementation, since shuffle/permutation/... and so on only work with the current set and have no idea of the population, this may or may not be acceptable, depends on what you are trying to achieve, if you want a set of unique permutations, then you are going to generate that set and subsample it.
注意,N在这里是相当大的,因此你将得到重复的排列数,通过排列我的意思是在排列中没有重复值的值的顺序,因为从根本上说,任何给定有限集上的有限数量的排列,如果只是计算如果只选择k个元素n!/(n - k),则整个集合然后是n!即使不是这种情况,意味着我们的集合要大得多,我们仍然可能依赖于随机函数实现重复,因为shuffle / permutation / ...等等只适用于当前集合并且不知道对于人口,这可能是可接受的,也可能是不可接受的,取决于你想要实现的目标,如果你想要一组独特的排列,那么你将生成该集合并对其进行二次抽样。
#2
6
Here's a way to do it using numpy's np.random.randint
:
这是使用numpy的np.random.randint执行此操作的方法:
In [68]: l = np.array(['cat', 'mescaline', 'popcorn'])
In [69]: l[np.random.randint(len(l), size=(3,2))]
Out[69]:
array([['cat', 'popcorn'],
['popcorn', 'popcorn'],
['mescaline', 'cat']],
dtype='|S9')
EDIT: after the additional details that each element should appear at most once in each row
编辑:在每个元素每行最多出现一次的附加细节之后
this is not very space efficient, do you need something better?
这不是很节省空间,你需要更好的东西吗?
In [29]: l = np.array(['cat', 'mescaline', 'popcorn'])
In [30]: array([np.random.choice(l, 3, replace=False) for i in xrange(5)])
Out[30]:
array([['mescaline', 'popcorn', 'cat'],
['mescaline', 'popcorn', 'cat'],
['popcorn', 'mescaline', 'cat'],
['mescaline', 'cat', 'popcorn'],
['mescaline', 'cat', 'popcorn']],
dtype='|S9')
#3
2
>>> import numpy
>>> l = numpy.array(['cat', 'mescaline', 'popcorn'])
>>> l[numpy.random.randint(0, len(l), (3, 2))]
array([['popcorn', 'mescaline'],
['mescaline', 'popcorn'],
['cat', 'cat']],
dtype='|S9')