Hi I'm trying to map an array of numbers to their ranks. So for example [2,5,3] would become [0,2,1].
嗨,我正在尝试将一系列数字映射到他们的行列。因此,例如[2,5,3]将变为[0,2,1]。
I'm currently using np.where to lookup the rank in an array, but this is proving to take a very long time as I have to do this for a very large array (over 2 million datapoints).
我目前正在使用np.where来查找数组中的排名,但事实证明这需要很长时间,因为我必须为非常大的数组(超过200万个数据点)执行此操作。
If anyone has any suggestions on how I could achieve this, I'd greatly appreciate it!
如果有人对如何实现这一点有任何建议,我将不胜感激!
[EDIT] This is what the code to change a specific row currently looks like:
[编辑]这是改变特定行的代码当前的样子:
def change_nodes(row):
a = row
new_a = node_map[node_map[:,1] == a][0][0]
return new_a
[EDIT 2] Duplicated numbers should additionally have the same rank
[编辑2]重复的数字应该另外具有相同的等级
[EDIT 3] Additionally, unique numbers should only count once towards the ranking. So for example, the rankings for this list [2,3,3,4,5,7,7,7,7,8,1], would be:
[编辑3]此外,唯一数字应该只计入一次排名。例如,该列表[2,3,3,4,5,7,7,7,7,8,1]的排名将是:
{1:0, 2:1, 3:2, 4:3, 5:4, 7:5, 8:6 }
{1:0,2:1,3:2,4:3,5:4,7:5,8:6}
3 个解决方案
#1
2
Here is an efficient solution and a comparison with the solution using index
(the index
solution is also not correct with the added (edit 3) restriction to the question)
这是一个有效的解决方案,并与使用索引的解决方案进行比较(索引解决方案对于问题的添加(编辑3)限制也不正确)
import numpy as np
def rank1(x):
# Sort values i = 0, 1, 2, .. using x[i] as key
y = sorted(range(len(x)), key = lambda i: x[i])
# Map each value of x to a rank. If a value is already associated with a
# rank, the rank is updated. Iterate in reversed order so we get the
# smallest rank for each value.
rank = { x[y[i]]: i for i in xrange(len(y) -1, -1 , -1) }
# Remove gaps in the ranks
kv = sorted(rank.iteritems(), key = lambda p: p[1])
for i in range(len(kv)):
kv[i] = (kv[i][0], i)
rank = { p[0]: p[1] for p in kv }
# Pre allocate a array to fill with ranks
r = np.zeros((len(x),), dtype=np.int)
for i, v in enumerate(x):
r[i] = rank[v]
return r
def rank2(x):
x_sorted = sorted(x)
# creates a new list to preserve x
rank = list(x)
for v in x_sorted:
rank[rank.index(v)] = x_sorted.index(v)
return rank
Comparison results
比较结果
>>> d = np.arange(1000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
100 loops, best of 3: 1.97 ms per loop
>>> %timeit rank2(d)
1 loops, best of 3: 226 ms per loop
>>> d = np.arange(10000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
10 loops, best of 3: 32 ms per loop
>>> %timeit rank2(d)
1 loops, best of 3: 24.4 s per loop
>>> d = np.arange(100000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
1 loops, best of 3: 433 ms per loop
>>> d = np.arange(2000000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
1 loops, best of 3: 11.2 s per loop
The problem with the index
solution is that the time complexity is O(n^2). The time complexity of my solution is O(n lg n), that is, the sort time.
索引解决方案的问题是时间复杂度为O(n ^ 2)。我的解决方案的时间复杂度是O(n lg n),即排序时间。
#2
3
What you want to use is numpy.argsort
:
你想要使用的是numpy.argsort:
>>> import numpy as np
>>> x = np.array([2, 5, 3])
>>> x.argsort()
array([0, 2, 1])
See this question and its answers for thoughts on adjusting how ties are handled.
有关调整关系处理方式的想法,请参阅此问题及其答案。
#3
2
I have a variant with only vanilla Python:
我有一个只有vanilla Python的变种:
a = [2,5,3]
aSORT = list(a)
aSORT.sort()
for x in aSORT:
a[a.index(x)] = aSORT.index(x)
print(a)
In my testing, the numpy
version posted here took 0.1406 seconds to sort the list [2,5,3,62,5,2,5,1000,100,-1,-9]
compared to only 0.0154 seconds with my method.
在我的测试中,这里发布的numpy版本花了0.1406秒对列表进行排序[2,5,3,62,5,2,5,1000,100,-1,-9],而我的方法只有0.0154秒。
#1
2
Here is an efficient solution and a comparison with the solution using index
(the index
solution is also not correct with the added (edit 3) restriction to the question)
这是一个有效的解决方案,并与使用索引的解决方案进行比较(索引解决方案对于问题的添加(编辑3)限制也不正确)
import numpy as np
def rank1(x):
# Sort values i = 0, 1, 2, .. using x[i] as key
y = sorted(range(len(x)), key = lambda i: x[i])
# Map each value of x to a rank. If a value is already associated with a
# rank, the rank is updated. Iterate in reversed order so we get the
# smallest rank for each value.
rank = { x[y[i]]: i for i in xrange(len(y) -1, -1 , -1) }
# Remove gaps in the ranks
kv = sorted(rank.iteritems(), key = lambda p: p[1])
for i in range(len(kv)):
kv[i] = (kv[i][0], i)
rank = { p[0]: p[1] for p in kv }
# Pre allocate a array to fill with ranks
r = np.zeros((len(x),), dtype=np.int)
for i, v in enumerate(x):
r[i] = rank[v]
return r
def rank2(x):
x_sorted = sorted(x)
# creates a new list to preserve x
rank = list(x)
for v in x_sorted:
rank[rank.index(v)] = x_sorted.index(v)
return rank
Comparison results
比较结果
>>> d = np.arange(1000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
100 loops, best of 3: 1.97 ms per loop
>>> %timeit rank2(d)
1 loops, best of 3: 226 ms per loop
>>> d = np.arange(10000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
10 loops, best of 3: 32 ms per loop
>>> %timeit rank2(d)
1 loops, best of 3: 24.4 s per loop
>>> d = np.arange(100000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
1 loops, best of 3: 433 ms per loop
>>> d = np.arange(2000000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
1 loops, best of 3: 11.2 s per loop
The problem with the index
solution is that the time complexity is O(n^2). The time complexity of my solution is O(n lg n), that is, the sort time.
索引解决方案的问题是时间复杂度为O(n ^ 2)。我的解决方案的时间复杂度是O(n lg n),即排序时间。
#2
3
What you want to use is numpy.argsort
:
你想要使用的是numpy.argsort:
>>> import numpy as np
>>> x = np.array([2, 5, 3])
>>> x.argsort()
array([0, 2, 1])
See this question and its answers for thoughts on adjusting how ties are handled.
有关调整关系处理方式的想法,请参阅此问题及其答案。
#3
2
I have a variant with only vanilla Python:
我有一个只有vanilla Python的变种:
a = [2,5,3]
aSORT = list(a)
aSORT.sort()
for x in aSORT:
a[a.index(x)] = aSORT.index(x)
print(a)
In my testing, the numpy
version posted here took 0.1406 seconds to sort the list [2,5,3,62,5,2,5,1000,100,-1,-9]
compared to only 0.0154 seconds with my method.
在我的测试中,这里发布的numpy版本花了0.1406秒对列表进行排序[2,5,3,62,5,2,5,1000,100,-1,-9],而我的方法只有0.0154秒。