I was graphing out letter frequency in some large academic documents. As part of this process, is was sorting letters from large clippings of those documents into alphabetical order. I was using Python's
built in sorted function, and I started to wonder if I could make it faster. I then wrote the following function:
我正在绘制一些大型学术文件中的字母频率。作为此过程的一部分,是将这些文档的大量剪辑中的字母排序为字母顺序。我使用的是Python内置的排序函数,我开始怀疑是否可以让它更快。然后我写了以下函数:
def count_sort(l):
items = {'a':0,'b':0,'c':0,'d':0,'e':0,'f':0,'g':0,'h':0,'i':0,'j':0,'k':0,'l':0,'m':
0,'n':0,'o':0,'p':0,'q':0,'r':0,'s':0,'t':0,'u':0,'v':0,'w':0,'x':0,'y':0,'z'
:0}
for item in l:
items[item] += 1
sort_l = []
for key in items:
sort_l += key*items[key]
return sort_l
When testing this code versus sorted
on a 10000
letter long string of text, it was almost 20X
faster.
测试此代码时,对10000字母长的文本字符串进行排序,速度快了近20倍。
With such a performance boost, why isn't this sorting algorithm in the standard libs
?
有了这样的性能提升,为什么这个排序算法不在标准库中呢?
5 个解决方案
#1
You have rediscovered the counting sort algorithm.
您重新发现了计数排序算法。
To quote Wikipedia:
引用*:
For problem instances in which the maximum key value is significantly smaller than the number of items, counting sort can be highly space-efficient, as the only storage it uses other than its input and output arrays is the Count array which uses space O(k).
对于最大键值明显小于项目数的问题实例,计数排序可以高度节省空间,因为它使用除输入和输出数组之外的唯一存储是使用空间O(k)的Count数组)。
The counting sort algorithm becomes more and more efficient(relatively) the greater the difference is between the total number of items that being sorted, and the number of unique items being sorted.
计数排序算法变得越来越有效(相对),正在排序的项目总数与被排序的唯一项目的数量之间的差异越大。
You can see why this must be looking at your own code, or the Wikipedia example code:
您可以看到为什么必须查看您自己的代码或Wikipedia示例代码:
# calculate the histogram of key frequencies:
for x in input:
count[key(x)] += 1
# calculate the starting index for each key:
total = 0
for i in range(k): # i = 0, 1, ... k-1
oldCount = count[i]
count[i] = total
total += oldCount
# copy to output array, preserving order of inputs with equal keys:
for x in input:
output[count[key(x)]] = x
count[key(x)] += 1
return output
You have 2 for loops in your function: the first to iterate over the letters you are sorting, and the second to iterate over the items dictionary. As I posited previously, this makes sense of the items dictionary is considerably smaller than the list you are sorting, but it quickly becomes very inefficient if the as the number number of unique elements increases relative to the number of items being sorted.
你的函数中有2个for循环:第一个迭代你正在排序的字母,第二个循环遍历items字典。正如我之前提到的那样,这意味着项目字典比您正在排序的列表要小得多,但如果相对于被排序的项目数量的唯一元素的数量增加,它很快变得非常低效。
Like @BrenBarn answered, this only when you know exactly what characters to expect and you're willing to ignore any other characters. While it seems like counting sort is highly efficient in the example you gave, the problem of sorting letters is hardly the most common sorting problem.
就像@BrenBarn回答的那样,只有当你确切地知道了什么字符时,你才愿意忽略任何其他字符。虽然在你给出的例子中计算排序是高效的,但排序字母的问题并不是最常见的排序问题。
Below I have fixed your function to print the letters by iterating through a list rather than iterating through the keys in a dictionary(as Python's dictionaries are not ordered)
下面我修复了你的函数,通过遍历列表而不是遍历字典中的键来打印字母(因为Python的字典没有被排序)
def count_sort(l):
letters = [chr(i) for i in range(97, 122)]
items = dict()
for letter in letters:
items[letter] = 0
for item in l:
items[item] += 1
sort_l = list()
for letter in letters:
sort_l.extend(letter*items[letter])
return sort_l
#2
As mentioned in the comments and answers above you can have rediscovered the counting sort but you haven't discovered the python collections library:
正如上面的评论和答案中所提到的,您可以重新发现计数排序,但您还没有发现python集合库:
from collections import Counter
def count_sort(l):
items = Counter()
for item in l:
items[item] += 1
sort_l = []
for key in items.keys().sorted():
sort_l += key*items[key]
return sort_l
The major difference is that you will not get any entries for missing entries, you may also wish to change:
主要区别在于您不会获得任何缺失条目的条目,您可能还希望更改:
sort_l += key*items[key]
to:
sort_l.append((key, items[key]))
to return a sorted list of keys and counts. Another good trick would be to return a collections.OrderedDict
object.
返回键和计数的排序列表。另一个好方法是返回collections.OrderedDict对象。
#3
Two reasons.
One, your algorithm relies on knowing ahead of time what elements will exist in the list to be sorted. If the list contained an uppercase character or a digit or anything else, your code would fail with an error.
其一,您的算法依赖于提前知道要排序的列表中将存在哪些元素。如果列表包含大写字符或数字或其他任何内容,则代码将失败并显示错误。
The other reason is that your code relies on the order of the dictionary remaining the same. Dictionaries are not ordered, so there is no guarantee that for key in items
will iterate over the keys in alphabetical order. Your "sorted" list might not wind up sorted at all.
另一个原因是你的代码依赖于字典的顺序保持不变。字典不是有序的,因此无法保证键入项目将按字母顺序迭代键。你的“排序”列表可能根本没有排序。
#4
As stated in other answer, it is known as Counting Sort...
如其他答案所述,它被称为Counting Sort ...
Other than the reasons / limitations they have mentioned,
除了他们提到的原因/限制,
Counting sort somehow use the discrete items as array index and count them
计数排序以某种方式使用离散项作为数组索引并计算它们
So if the item's domain is something like double / floating point, then the implementation may be hard, at least maybe you have to use Map() to map the number into valid array index, and Map() itself is O(lg N), turns out the complexity is still O(N lg N)...
所以如果项目的域类似于双/浮点,那么实现可能很难,至少可能你必须使用Map()将数字映射到有效的数组索引,而Map()本身就是O(lg N) ,事实证明复杂性仍为O(N lg N)......
#5
If you're interested in the counting sort and how it works with other sorting algorithms, you should check this analysis of the Counting sort and Radix sort and the instances in which each of them are useful:
如果您对计数排序以及它如何与其他排序算法一起工作感兴趣,您应该检查Counting排序和Radix排序的分析以及每个排序算法对它们有用的实例:
Many times counting sort can be useful as a sub-routine of a larger sorting (most notably Radix sort)algorithm in instances where it would be impractical by itself.
计数排序很多时候可以作为更大排序(最明显的是Radix排序)算法的子例程,在这种情况下它本身是不切实际的。
#1
You have rediscovered the counting sort algorithm.
您重新发现了计数排序算法。
To quote Wikipedia:
引用*:
For problem instances in which the maximum key value is significantly smaller than the number of items, counting sort can be highly space-efficient, as the only storage it uses other than its input and output arrays is the Count array which uses space O(k).
对于最大键值明显小于项目数的问题实例,计数排序可以高度节省空间,因为它使用除输入和输出数组之外的唯一存储是使用空间O(k)的Count数组)。
The counting sort algorithm becomes more and more efficient(relatively) the greater the difference is between the total number of items that being sorted, and the number of unique items being sorted.
计数排序算法变得越来越有效(相对),正在排序的项目总数与被排序的唯一项目的数量之间的差异越大。
You can see why this must be looking at your own code, or the Wikipedia example code:
您可以看到为什么必须查看您自己的代码或Wikipedia示例代码:
# calculate the histogram of key frequencies:
for x in input:
count[key(x)] += 1
# calculate the starting index for each key:
total = 0
for i in range(k): # i = 0, 1, ... k-1
oldCount = count[i]
count[i] = total
total += oldCount
# copy to output array, preserving order of inputs with equal keys:
for x in input:
output[count[key(x)]] = x
count[key(x)] += 1
return output
You have 2 for loops in your function: the first to iterate over the letters you are sorting, and the second to iterate over the items dictionary. As I posited previously, this makes sense of the items dictionary is considerably smaller than the list you are sorting, but it quickly becomes very inefficient if the as the number number of unique elements increases relative to the number of items being sorted.
你的函数中有2个for循环:第一个迭代你正在排序的字母,第二个循环遍历items字典。正如我之前提到的那样,这意味着项目字典比您正在排序的列表要小得多,但如果相对于被排序的项目数量的唯一元素的数量增加,它很快变得非常低效。
Like @BrenBarn answered, this only when you know exactly what characters to expect and you're willing to ignore any other characters. While it seems like counting sort is highly efficient in the example you gave, the problem of sorting letters is hardly the most common sorting problem.
就像@BrenBarn回答的那样,只有当你确切地知道了什么字符时,你才愿意忽略任何其他字符。虽然在你给出的例子中计算排序是高效的,但排序字母的问题并不是最常见的排序问题。
Below I have fixed your function to print the letters by iterating through a list rather than iterating through the keys in a dictionary(as Python's dictionaries are not ordered)
下面我修复了你的函数,通过遍历列表而不是遍历字典中的键来打印字母(因为Python的字典没有被排序)
def count_sort(l):
letters = [chr(i) for i in range(97, 122)]
items = dict()
for letter in letters:
items[letter] = 0
for item in l:
items[item] += 1
sort_l = list()
for letter in letters:
sort_l.extend(letter*items[letter])
return sort_l
#2
As mentioned in the comments and answers above you can have rediscovered the counting sort but you haven't discovered the python collections library:
正如上面的评论和答案中所提到的,您可以重新发现计数排序,但您还没有发现python集合库:
from collections import Counter
def count_sort(l):
items = Counter()
for item in l:
items[item] += 1
sort_l = []
for key in items.keys().sorted():
sort_l += key*items[key]
return sort_l
The major difference is that you will not get any entries for missing entries, you may also wish to change:
主要区别在于您不会获得任何缺失条目的条目,您可能还希望更改:
sort_l += key*items[key]
to:
sort_l.append((key, items[key]))
to return a sorted list of keys and counts. Another good trick would be to return a collections.OrderedDict
object.
返回键和计数的排序列表。另一个好方法是返回collections.OrderedDict对象。
#3
Two reasons.
One, your algorithm relies on knowing ahead of time what elements will exist in the list to be sorted. If the list contained an uppercase character or a digit or anything else, your code would fail with an error.
其一,您的算法依赖于提前知道要排序的列表中将存在哪些元素。如果列表包含大写字符或数字或其他任何内容,则代码将失败并显示错误。
The other reason is that your code relies on the order of the dictionary remaining the same. Dictionaries are not ordered, so there is no guarantee that for key in items
will iterate over the keys in alphabetical order. Your "sorted" list might not wind up sorted at all.
另一个原因是你的代码依赖于字典的顺序保持不变。字典不是有序的,因此无法保证键入项目将按字母顺序迭代键。你的“排序”列表可能根本没有排序。
#4
As stated in other answer, it is known as Counting Sort...
如其他答案所述,它被称为Counting Sort ...
Other than the reasons / limitations they have mentioned,
除了他们提到的原因/限制,
Counting sort somehow use the discrete items as array index and count them
计数排序以某种方式使用离散项作为数组索引并计算它们
So if the item's domain is something like double / floating point, then the implementation may be hard, at least maybe you have to use Map() to map the number into valid array index, and Map() itself is O(lg N), turns out the complexity is still O(N lg N)...
所以如果项目的域类似于双/浮点,那么实现可能很难,至少可能你必须使用Map()将数字映射到有效的数组索引,而Map()本身就是O(lg N) ,事实证明复杂性仍为O(N lg N)......
#5
If you're interested in the counting sort and how it works with other sorting algorithms, you should check this analysis of the Counting sort and Radix sort and the instances in which each of them are useful:
如果您对计数排序以及它如何与其他排序算法一起工作感兴趣,您应该检查Counting排序和Radix排序的分析以及每个排序算法对它们有用的实例:
Many times counting sort can be useful as a sub-routine of a larger sorting (most notably Radix sort)algorithm in instances where it would be impractical by itself.
计数排序很多时候可以作为更大排序(最明显的是Radix排序)算法的子例程,在这种情况下它本身是不切实际的。