I am trying to find a frequency of each symbol in any given text using an algorithm of O(n) complexity. My algorithm looks like:
我试图在任何给定的文本中使用O(n)复杂度的算法来查找每个符号的频率。我的算法的样子:
s = len(text)
P = 1.0/s
freqs = {}
for char in text:
try:
freqs[char]+=P
except:
freqs[char]=P
but I doubt that this dictionary-method is fast enough, because it depends on the underlying implementation of the dictionary methods. Is this the fastest method?
但是我怀疑这种字典方法是否足够快,因为它依赖于字典方法的底层实现。这是最快的方法吗?
UPDATE: there is no increase in speed if collections and integers are used. It is because the algorithm is already of O(n) complexity, so no essential speedup is possible.
更新:如果使用集合和整数,速度不会增加。这是因为该算法已经具有复杂度,所以没有必要的加速。
For example, results for 1MB text:
例如,1MB文本的结果:
without collections:
real 0m0.695s
with collections:
real 0m0.625s
12 个解决方案
#1
43
Performance comparison
Note: time in the table doesn't include the time it takes to load files.
注意:表中的时间不包括加载文件所需的时间。
| approach | american-english, | big.txt, | time w.r.t. defaultdict |
| | time, seconds | time, seconds | |
|----------------+-------------------+---------------+-------------------------|
| Counter | 0.451 | 3.367 | 3.6 |
| setdefault | 0.348 | 2.320 | 2.5 |
| list | 0.277 | 1.822 | 2 |
| try/except | 0.158 | 1.068 | 1.2 |
| defaultdict | 0.141 | 0.925 | 1 |
| numpy | 0.012 | 0.076 | 0.082 |
| S.Mark's ext. | 0.003 | 0.019 | 0.021 |
| ext. in Cython | 0.001 | 0.008 | 0.0086 |
#+TBLFM: $4=$3/@7$3;%.2g
The files used: '/usr/share/dict/american-english'
and 'big.txt'
.
使用的文件:“/usr/share/dict/美式英语”和“big.txt”。
The script that compares 'Counter', 'setdefault', 'list', 'try/except', 'defaultdict', 'numpy', 'cython' -based, and @S.Mark's solutions is at http://gist.github.com/347000
比较“Counter”、“setdefault”、“list”、“try/except”、“defaultdict”、“numpy”、“cython”和“@S”的脚本。Mark的解决方案是http://gist.github.com/347000。
The fastest solution is Python extension written in Cython:
最快的解决方案是在Cython中编写的Python扩展:
import cython
@cython.locals(
chars=unicode,
i=cython.Py_ssize_t,
L=cython.Py_ssize_t[0x10000])
def countchars_cython(chars):
for i in range(0x10000): # unicode code points > 0xffff are not supported
L[i] = 0
for c in chars:
L[c] += 1
return {unichr(i): L[i] for i in range(0x10000) if L[i]}
Previous comparison:
* python (dict) : 0.5 seconds
* python (list) : 0.5 (ascii) (0.2 if read whole file in memory)
* perl : 0.5
* python (numpy): 0.07
* c++ : 0.05
* c : 0.008 (ascii)
Input data:
输入数据:
$ tail /usr/share/dict/american-english
éclat's
élan
élan's
émigré
émigrés
épée
épées
étude
étude's
études
$ du -h /usr/share/dict/american-english
912K /usr/share/dict/american-english
python (Counter): 0.5 seconds
#!/usr/bin/env python3.1
import collections, fileinput, textwrap
chars = (ch for word in fileinput.input() for ch in word.rstrip())
# faster (0.4s) but less flexible: chars = open(filename).read()
print(textwrap.fill(str(collections.Counter(chars)), width=79))
Run it:
运行该程序:
$ time -p python3.1 count_char.py /usr/share/dict/american-english
Counter({'e': 87823, 's': 86620, 'i': 66548, 'a': 62778, 'n': 56696, 'r': 56286, 't': 51588, 'o': 48425, 'l': 39914, 'c': 30020, 'd': 28068, 'u': 25810, "'": 24511, 'g': 22262, 'p': 20917, 'm': 20747, 'h': 18453, 'b': 14137, 'y': 12367, 'f': 10049, 'k': 7800, 'v': 7573, 'w': 6924, 'z': 3088, 'x': 2082, 'M': 1686, 'C': 1549, 'S': 1515, 'q': 1447, 'B': 1387, 'j': 1376, 'A': 1345, 'P': 974, 'L': 912, 'H': 860, 'T': 858, 'G': 811, 'D': 809, 'R': 749, 'K': 656, 'E': 618, 'J': 539, 'N': 531, 'W': 507, 'F': 502, 'O': 354, 'I': 344, 'V': 330, 'Z': 150, 'Y': 140, 'é': 128, 'U': 117, 'Q': 63, 'X': 42, 'è': 29, 'ö': 12, 'ü': 12, 'ó': 10, 'á': 10, 'ä': 7, 'ê': 6, 'â': 6, 'ñ': 6, 'ç': 4, 'å': 3, 'û': 3, 'í': 2, 'ô': 2, 'Å': 1}) real 0.44 user 0.43 sys 0.01
perl: 0.5 seconds
time -p perl -MData::Dumper -F'' -lanwe'$c{$_}++ for (@F);
END{ $Data::Dumper::Terse = 1; $Data::Dumper::Indent = 0; print Dumper(\%c) }
' /usr/share/dict/american-english
Output:
输出:
{'S' => 1515,'K' => 656,'' => 29,'d' => 28068,'Y' => 140,'E' => 618,'y' => 12367,'g' => 22262,'e' => 87823,'' => 2,'J' => 539,'' => 241,'' => 3,'' => 6,'' => 4,'' => 128,'D' => 809,'q' => 1447,'b' => 14137,'z' => 3088,'w' => 6924,'Q' => 63,'' => 10,'M' => 1686,'C' => 1549,'' => 10,'L' => 912,'X' => 42,'P' => 974,'' => 12,'\'' => 24511,'' => 6,'a' => 62778,'T' => 858,'N' => 531,'j' => 1376,'Z' => 150,'u' => 25810,'k' => 7800,'t' => 51588,'' => 6,'W' => 507,'v' => 7573,'s' => 86620,'B' => 1387,'H' => 860,'c' => 30020,'' => 12,'I' => 344,'' => 3,'G' => 811,'U' => 117,'F' => 502,'' => 2,'r' => 56286,'x' => 2082,'V' => 330,'h' => 18453,'f' => 10049,'' => 1,'i' => 66548,'A' => 1345,'O' => 354,'n' => 56696,'m' => 20747,'l' => 39914,'' => 7,'p' => 20917,'R' => 749,'o' => 48425} real 0.51 user 0.49 sys 0.02
python (numpy): 0.07 seconds
Based on Ants Aasma's answer (modified to support unicode):
基于蚂蚁Aasma的回答(修改为支持unicode):
#!/usr/bin/env python
import codecs, itertools, operator, sys
import numpy
filename = sys.argv[1] if len(sys.argv)>1 else '/usr/share/dict/american-english'
# ucs2 or ucs4 python?
dtype = {2: numpy.uint16, 4: numpy.uint32}[len(buffer(u"u"))]
# count ordinals
text = codecs.open(filename, encoding='utf-8').read()
a = numpy.frombuffer(text, dtype=dtype)
counts = numpy.bincount(a)
# pretty print
counts = [(unichr(i), v) for i, v in enumerate(counts) if v]
counts.sort(key=operator.itemgetter(1))
print ' '.join('("%s" %d)' % c for c in counts if c[0] not in ' \t\n')
Output:
输出:
("Å" 1) ("í" 2) ("ô" 2) ("å" 3) ("û" 3) ("ç" 4) ("â" 6) ("ê" 6) ("ñ" 6) ("ä" 7) ("á" 10) ("ó" 10) ("ö" 12) ("ü" 12) ("è" 29) ("X" 42) ("Q" 63) ("U" 117) ("é" 128) ("Y" 140) ("Z" 150) ("V" 330) ("I" 344) ("O" 354) ("F" 502) ("W" 507) ("N" 531) ("J" 539) ("E" 618) ("K" 656) ("R" 749) ("D" 809) ("G" 811) ("T" 858) ("H" 860) ("L" 912) ("P" 974) ("A" 1345) ("j" 1376) ("B" 1387) ("q" 1447) ("S" 1515) ("C" 1549) ("M" 1686) ("x" 2082) ("z" 3088) ("w" 6924) ("v" 7573) ("k" 7800) ("f" 10049) ("y" 12367) ("b" 14137) ("h" 18453) ("m" 20747) ("p" 20917) ("g" 22262) ("'" 24511) ("u" 25810) ("d" 28068) ("c" 30020) ("l" 39914) ("o" 48425) ("t" 51588) ("r" 56286) ("n" 56696) ("a" 62778) ("i" 66548) ("s" 86620) ("e" 87823)
real 0.07
user 0.06
sys 0.01
c++: 0.05 seconds
// $ g++ *.cc -lboost_program_options
// $ ./a.out /usr/share/dict/american-english
#include <iostream>
#include <fstream>
#include <cstdlib> // exit
#include <boost/program_options/detail/utf8_codecvt_facet.hpp>
#include <boost/tr1/unordered_map.hpp>
#include <boost/foreach.hpp>
int main(int argc, char* argv[]) {
using namespace std;
// open input file
if (argc != 2) {
cerr << "Usage: " << argv[0] << " <filename>\n";
exit(2);
}
wifstream f(argv[argc-1]);
// assume the file has utf-8 encoding
locale utf8_locale(locale(""),
new boost::program_options::detail::utf8_codecvt_facet);
f.imbue(utf8_locale);
// count characters frequencies
typedef std::tr1::unordered_map<wchar_t, size_t> hashtable_t;
hashtable_t counts;
for (wchar_t ch; f >> ch; )
counts[ch]++;
// print result
wofstream of("output.utf8");
of.imbue(utf8_locale);
BOOST_FOREACH(hashtable_t::value_type i, counts)
of << "(" << i.first << " " << i.second << ") ";
of << endl;
}
Result:
结果:
$ cat output.utf8
(í 2) (O 354) (P 974) (Q 63) (R 749) (S 1,515) (ñ 6) (T 858) (U 117) (ó 10) (ô 2) (V 330) (W 507) (X 42) (ö 12) (Y 140) (Z 150) (û 3) (ü 12) (a 62,778) (b 14,137) (c 30,020) (d 28,068) (e 87,823) (f 10,049) (g 22,262) (h 18,453) (i 66,548) (j 1,376) (k 7,800) (l 39,914) (m 20,747) (n 56,696) (o 48,425) (p 20,917) (q 1,447) (r 56,286) (s 86,620) (t 51,588) (u 25,810) (Å 1) (' 24,511) (v 7,573) (w 6,924) (x 2,082) (y 12,367) (z 3,088) (A 1,345) (B 1,387) (C 1,549) (á 10) (â 6) (D 809) (E 618) (F 502) (ä 7) (å 3) (G 811) (H 860) (ç 4) (I 344) (J 539) (è 29) (K 656) (é 128) (ê 6) (L 912) (M 1,686) (N 531)
c (ascii): 0.0079 seconds
// $ gcc -O3 cc_ascii.c -o cc_ascii && time -p ./cc_ascii < input.txt
#include <stdio.h>
enum { N = 256 };
size_t counts[N];
int main(void) {
// count characters
int ch = -1;
while((ch = getchar()) != EOF)
++counts[ch];
// print result
size_t i = 0;
for (; i < N; ++i)
if (counts[i])
printf("('%c' %zu) ", (int)i, counts[i]);
return 0;
}
#2
16
How about avoiding float operations inside the loop and do it after everything is done?
如何避免在循环中执行浮动操作,并在完成所有操作之后执行?
By that way, you could just do +1 everytime, and its should be faster.
通过这种方式,你可以每次都做+1,它应该更快。
And better use collections.defaultdict as S.Lott advised.
最好使用collections.defaultdict为S。洛特建议。
freqs=collections.defaultdict(int)
for char in text:
freqs[char]+=1
Or You may want to try, collections.Counter in python 2.7+
或者你可能想尝试,收集。计数器在python 2.7 +
>>> collections.Counter("xyzabcxyz")
Counter({'y': 2, 'x': 2, 'z': 2, 'a': 1, 'c': 1, 'b': 1})
Or
或
You may try psyco, which do just-in-time compiling for python. You have loops, so I think you would get some performance gain with psyco
您可以尝试psyco,它可以实时地为python编译。你有循环,我想你会得到一些psyco的性能增益。
Edit 1:
编辑1:
I did some benchmarks base on big.txt (~6.5 MB) which is used in spelling corrector by peter norvig
我做了一些基准测试。txt (~6.5 MB),由peter norvig在拼写校正器中使用。
Text Length: 6488666
dict.get : 11.9060001373 s
93 chars {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8, ....
if char in dict : 9.71799993515 s
93 chars {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8, ....
dict try/catch : 7.35899996758 s
93 chars {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8, ....
collections.default : 7.29699993134 s
93 chars defaultdict(<type 'int'>, {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8, ....
CPU Specs: 1.6GHz Intel Mobile Atom CPU
CPU规格:1.6GHz英特尔移动Atom处理器。
According to that, dict.get
is slowest and collections.defaultdict
is fastest, try/except
is also the fast one.
根据这个,dict.get是最慢的,而collections.defaultdict是最快的,try/除了也是fast。
Edit 2:
编辑2:
Added collections.Counter
benchmarks, Its slower than dict.get
and took 15s in my laptop
添加收藏。计数器基准测试,它比dict慢,在我的笔记本电脑上取15秒。
collections.Counter : 15.3439998627 s
93 chars Counter({u' ': 1036511, u'e': 628234, u't': 444459, u'a': 395872, u'o': 382683, u'n': 362397, u'i': 348464,
#3
10
I've written Char Counter C Extension to Python, looks like 300x faster than collections.Counter
and 150x faster than collections.default(int)
我已经为Python编写了Char Counter C扩展,看起来比集合快300倍。计数器和150x比集合。默认值(int)
C Char Counter : 0.0469999313354 s
93 chars {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8,
Here is Char Counter C Extension Codes
这是字符计数器C扩展码。
static PyObject *
CharCounter(PyObject *self, PyObject *args, PyObject *keywds)
{
wchar_t *t1;unsigned l1=0;
if (!PyArg_ParseTuple(args,"u#",&t1,&l1)) return NULL;
PyObject *resultList,*itemTuple;
for(unsigned i=0;i<=0xffff;i++)char_counter[i]=0;
unsigned chlen=0;
for(unsigned i=0;i<l1;i++){
if(char_counter[t1[i]]==0)char_list[chlen++]=t1[i];
char_counter[t1[i]]++;
}
resultList = PyList_New(0);
for(unsigned i=0;i<chlen;i++){
itemTuple = PyTuple_New(2);
PyTuple_SetItem(itemTuple, 0,PyUnicode_FromWideChar(&char_list[i],1));
PyTuple_SetItem(itemTuple, 1,PyInt_FromLong(char_counter[char_list[i]]));
PyList_Append(resultList, itemTuple);
Py_DECREF(itemTuple);
};
return resultList;
}
Where char_counter, and char_list are malloc-ated at module level, so no need to malloc every time when function calls.
其中char_counter和char_list在模块级别上是错误的,所以每次函数调用时都不需要malloc。
char_counter=(unsigned*)malloc(sizeof(unsigned)*0x10000);
char_list=(wchar_t*)malloc(sizeof(wchar_t)*0x10000);
It returns a List with Tuples
它返回一个带有元组的列表。
[(u'T', 16282), (u'h', 287323), (u'e', 628234), (u' ', 1036511), (u'P', 8946), (u'r', 303977), (u'o', 382683), ...
To convert to dict format, just dict()
will do.
要转换为dict格式,只需命令()就可以了。
dict(CharCounter(text))
PS: Benchmark included the time converting to dict
基准测试包括时间转换到指令。
CharCounter
accept only Python Unicode String u""
, if the text is utf8, need to do .decode("utf8")
in advance.
CharCounter只接受Python Unicode字符串u“”,如果文本是utf8,则需要提前进行。decode(“utf8”)。
Input Supports Unicode until Basic Multilingual Plane (BMP) - 0x0000 to 0xFFFF
输入支持Unicode,直到基本的多语言平面(BMP) - 0x0000到0xFFFF。
#4
6
No it's not the fastest, because you know that the characters have a limited range and you could use a list and direct indexing, using the numeric representation of the character, to store the frequencies.
不,它不是最快的,因为你知道字符的范围很有限,你可以使用一个列表和直接索引,使用字符的数字表示来存储频率。
#5
5
It is very, very hard to beat dict
. It is very highly tuned since almost everything in Python is dict-based.
这是非常非常难以击败的命令。它是非常高度调谐的,因为几乎所有的Python都是基于文本的。
#6
4
I'm not familiar with python, but for finding frequencies, unless you know the range of frequencies (in which case you can use an array), dictionary is the way to go.
If you know your characters in a unicode, ASCII, etc. range, you can define an array with the correct number of values.
However, this will change the space complexity of this from O(n) to O(possible n), but you will earn a time complexity improvement from O(n*(dictionary extraction/insertion time)) to O(n).
我不熟悉python,但对于查找频率,除非您知道频率的范围(在这种情况下您可以使用数组),字典是一种方法。如果您知道在unicode、ASCII等范围内的字符,您可以用正确的值定义一个数组。然而,这将改变从O(n)到O(可能的n)的空间复杂度,但是您将从O(n*(字典提取/插入时间))到O(n)获得一个时间复杂度的改进。
#7
2
If you are really concerned about speed, you might consider first counting characters with integers and then obtaining frequencies through (float) division.
如果你真的关心速度,你可以考虑先用整数计算字符,然后通过(浮点)除法获得频率。
Here are the numbers:
以下数字:
python -mtimeit -s'x=0' 'x+=1'
10000000 loops, best of 3: 0.0661 usec per loop
python -mtimeit -s'x=0.' 'x+=1.'
10000000 loops, best of 3: 0.0965 usec per loop
#8
2
well, you can do it in the old fashioned style... as we know that there are not too many different characters and they are contiguous, we can use a plain array (or list here) and use the characters ordinal numbering for indexing:
嗯,你可以用老式的风格……我们知道,没有太多不同的字符,它们是连续的,我们可以使用一个简单的数组(或列表),并使用字符序号进行索引:
s = 1.0*len(text)
counts = [0]*256 # change this if working with unicode
for char in text:
freqs[ord(char)]+=1
freqs = dict((chr(i), v/s) for i,v in enumerate(counts) if v)
This will be probably faster, but just by a constant factor, both methods should have the same complexity.
这可能会更快,但是仅仅通过一个常数因子,这两种方法都应该具有相同的复杂性。
#9
2
Using this code on Alice in Wonderland (163793 chars) and "The Bible, Douay-Rheims Version" (5649295 chars) from Project Gutenberg:
在《爱丽丝漫游奇境记》(163793 chars)和《圣经》中,古登堡计划(5649295)中使用这段代码:
from collections import defaultdict
import timeit
def countchars():
f = open('8300-8.txt', 'rb')
#f = open('11.txt')
s = f.read()
f.close()
charDict = defaultdict(int)
for aChar in s:
charDict[aChar] += 1
if __name__ == '__main__':
tm = timeit.Timer('countchars()', 'from countchars import countchars')
print tm.timeit(10)
I get:
我得到:
2.27324003315 #Alice in Wonderland
74.8686217403 #Bible
The ratio between the number of chars for both books is 0.029
and the ratio between the times is 0.030
, so, the algorithm is O(n) with a very small constant factor. Fast enough for most (all?) purposes, I should think.
两本书的字符数的比值是0.029,而时间是0.030,所以算法是O(n),一个很小的常数因子。我认为,对于大多数(所有的)目的来说,速度够快了。
#10
2
If the data is in a single byte encoding you can use numpy to accelerate the count process:
如果数据是在单个字节编码中,则可以使用numpy加速计数过程:
import numpy as np
def char_freq(data):
counts = np.bincount(np.frombuffer(data, dtype=np.byte))
freqs = counts.astype(np.double) / len(data)
return dict((chr(idx), freq) for idx, freq in enumerate(freqs) if freq > 0)
Some quick benchmarking shows that this is about 10x faster than aggregating to a defaultdict(int).
一些快速的基准测试表明,这比聚合到defaultdict(int)要快10倍。
#11
1
To avoid the try except overhead you can use a defaultdict.
为了避免这种尝试,除了开销,您可以使用defaultdict。
#12
1
Small speed up will be usage of dict.setdefault
method, that way you will not pay rather big price for every new encountered character:
小的加速将是使用dict.setdefault方法,这样你将不会为每一个新遇到的字符付出相当大的代价:
for char in text:
freq[char] = freq.setdefault(char, 0.0) + P
As a sidenote: having bare except:
is considered very bad practice.
作为一个旁注:除了:被认为是非常糟糕的实践。
#1
43
Performance comparison
Note: time in the table doesn't include the time it takes to load files.
注意:表中的时间不包括加载文件所需的时间。
| approach | american-english, | big.txt, | time w.r.t. defaultdict |
| | time, seconds | time, seconds | |
|----------------+-------------------+---------------+-------------------------|
| Counter | 0.451 | 3.367 | 3.6 |
| setdefault | 0.348 | 2.320 | 2.5 |
| list | 0.277 | 1.822 | 2 |
| try/except | 0.158 | 1.068 | 1.2 |
| defaultdict | 0.141 | 0.925 | 1 |
| numpy | 0.012 | 0.076 | 0.082 |
| S.Mark's ext. | 0.003 | 0.019 | 0.021 |
| ext. in Cython | 0.001 | 0.008 | 0.0086 |
#+TBLFM: $4=$3/@7$3;%.2g
The files used: '/usr/share/dict/american-english'
and 'big.txt'
.
使用的文件:“/usr/share/dict/美式英语”和“big.txt”。
The script that compares 'Counter', 'setdefault', 'list', 'try/except', 'defaultdict', 'numpy', 'cython' -based, and @S.Mark's solutions is at http://gist.github.com/347000
比较“Counter”、“setdefault”、“list”、“try/except”、“defaultdict”、“numpy”、“cython”和“@S”的脚本。Mark的解决方案是http://gist.github.com/347000。
The fastest solution is Python extension written in Cython:
最快的解决方案是在Cython中编写的Python扩展:
import cython
@cython.locals(
chars=unicode,
i=cython.Py_ssize_t,
L=cython.Py_ssize_t[0x10000])
def countchars_cython(chars):
for i in range(0x10000): # unicode code points > 0xffff are not supported
L[i] = 0
for c in chars:
L[c] += 1
return {unichr(i): L[i] for i in range(0x10000) if L[i]}
Previous comparison:
* python (dict) : 0.5 seconds
* python (list) : 0.5 (ascii) (0.2 if read whole file in memory)
* perl : 0.5
* python (numpy): 0.07
* c++ : 0.05
* c : 0.008 (ascii)
Input data:
输入数据:
$ tail /usr/share/dict/american-english
éclat's
élan
élan's
émigré
émigrés
épée
épées
étude
étude's
études
$ du -h /usr/share/dict/american-english
912K /usr/share/dict/american-english
python (Counter): 0.5 seconds
#!/usr/bin/env python3.1
import collections, fileinput, textwrap
chars = (ch for word in fileinput.input() for ch in word.rstrip())
# faster (0.4s) but less flexible: chars = open(filename).read()
print(textwrap.fill(str(collections.Counter(chars)), width=79))
Run it:
运行该程序:
$ time -p python3.1 count_char.py /usr/share/dict/american-english
Counter({'e': 87823, 's': 86620, 'i': 66548, 'a': 62778, 'n': 56696, 'r': 56286, 't': 51588, 'o': 48425, 'l': 39914, 'c': 30020, 'd': 28068, 'u': 25810, "'": 24511, 'g': 22262, 'p': 20917, 'm': 20747, 'h': 18453, 'b': 14137, 'y': 12367, 'f': 10049, 'k': 7800, 'v': 7573, 'w': 6924, 'z': 3088, 'x': 2082, 'M': 1686, 'C': 1549, 'S': 1515, 'q': 1447, 'B': 1387, 'j': 1376, 'A': 1345, 'P': 974, 'L': 912, 'H': 860, 'T': 858, 'G': 811, 'D': 809, 'R': 749, 'K': 656, 'E': 618, 'J': 539, 'N': 531, 'W': 507, 'F': 502, 'O': 354, 'I': 344, 'V': 330, 'Z': 150, 'Y': 140, 'é': 128, 'U': 117, 'Q': 63, 'X': 42, 'è': 29, 'ö': 12, 'ü': 12, 'ó': 10, 'á': 10, 'ä': 7, 'ê': 6, 'â': 6, 'ñ': 6, 'ç': 4, 'å': 3, 'û': 3, 'í': 2, 'ô': 2, 'Å': 1}) real 0.44 user 0.43 sys 0.01
perl: 0.5 seconds
time -p perl -MData::Dumper -F'' -lanwe'$c{$_}++ for (@F);
END{ $Data::Dumper::Terse = 1; $Data::Dumper::Indent = 0; print Dumper(\%c) }
' /usr/share/dict/american-english
Output:
输出:
{'S' => 1515,'K' => 656,'' => 29,'d' => 28068,'Y' => 140,'E' => 618,'y' => 12367,'g' => 22262,'e' => 87823,'' => 2,'J' => 539,'' => 241,'' => 3,'' => 6,'' => 4,'' => 128,'D' => 809,'q' => 1447,'b' => 14137,'z' => 3088,'w' => 6924,'Q' => 63,'' => 10,'M' => 1686,'C' => 1549,'' => 10,'L' => 912,'X' => 42,'P' => 974,'' => 12,'\'' => 24511,'' => 6,'a' => 62778,'T' => 858,'N' => 531,'j' => 1376,'Z' => 150,'u' => 25810,'k' => 7800,'t' => 51588,'' => 6,'W' => 507,'v' => 7573,'s' => 86620,'B' => 1387,'H' => 860,'c' => 30020,'' => 12,'I' => 344,'' => 3,'G' => 811,'U' => 117,'F' => 502,'' => 2,'r' => 56286,'x' => 2082,'V' => 330,'h' => 18453,'f' => 10049,'' => 1,'i' => 66548,'A' => 1345,'O' => 354,'n' => 56696,'m' => 20747,'l' => 39914,'' => 7,'p' => 20917,'R' => 749,'o' => 48425} real 0.51 user 0.49 sys 0.02
python (numpy): 0.07 seconds
Based on Ants Aasma's answer (modified to support unicode):
基于蚂蚁Aasma的回答(修改为支持unicode):
#!/usr/bin/env python
import codecs, itertools, operator, sys
import numpy
filename = sys.argv[1] if len(sys.argv)>1 else '/usr/share/dict/american-english'
# ucs2 or ucs4 python?
dtype = {2: numpy.uint16, 4: numpy.uint32}[len(buffer(u"u"))]
# count ordinals
text = codecs.open(filename, encoding='utf-8').read()
a = numpy.frombuffer(text, dtype=dtype)
counts = numpy.bincount(a)
# pretty print
counts = [(unichr(i), v) for i, v in enumerate(counts) if v]
counts.sort(key=operator.itemgetter(1))
print ' '.join('("%s" %d)' % c for c in counts if c[0] not in ' \t\n')
Output:
输出:
("Å" 1) ("í" 2) ("ô" 2) ("å" 3) ("û" 3) ("ç" 4) ("â" 6) ("ê" 6) ("ñ" 6) ("ä" 7) ("á" 10) ("ó" 10) ("ö" 12) ("ü" 12) ("è" 29) ("X" 42) ("Q" 63) ("U" 117) ("é" 128) ("Y" 140) ("Z" 150) ("V" 330) ("I" 344) ("O" 354) ("F" 502) ("W" 507) ("N" 531) ("J" 539) ("E" 618) ("K" 656) ("R" 749) ("D" 809) ("G" 811) ("T" 858) ("H" 860) ("L" 912) ("P" 974) ("A" 1345) ("j" 1376) ("B" 1387) ("q" 1447) ("S" 1515) ("C" 1549) ("M" 1686) ("x" 2082) ("z" 3088) ("w" 6924) ("v" 7573) ("k" 7800) ("f" 10049) ("y" 12367) ("b" 14137) ("h" 18453) ("m" 20747) ("p" 20917) ("g" 22262) ("'" 24511) ("u" 25810) ("d" 28068) ("c" 30020) ("l" 39914) ("o" 48425) ("t" 51588) ("r" 56286) ("n" 56696) ("a" 62778) ("i" 66548) ("s" 86620) ("e" 87823)
real 0.07
user 0.06
sys 0.01
c++: 0.05 seconds
// $ g++ *.cc -lboost_program_options
// $ ./a.out /usr/share/dict/american-english
#include <iostream>
#include <fstream>
#include <cstdlib> // exit
#include <boost/program_options/detail/utf8_codecvt_facet.hpp>
#include <boost/tr1/unordered_map.hpp>
#include <boost/foreach.hpp>
int main(int argc, char* argv[]) {
using namespace std;
// open input file
if (argc != 2) {
cerr << "Usage: " << argv[0] << " <filename>\n";
exit(2);
}
wifstream f(argv[argc-1]);
// assume the file has utf-8 encoding
locale utf8_locale(locale(""),
new boost::program_options::detail::utf8_codecvt_facet);
f.imbue(utf8_locale);
// count characters frequencies
typedef std::tr1::unordered_map<wchar_t, size_t> hashtable_t;
hashtable_t counts;
for (wchar_t ch; f >> ch; )
counts[ch]++;
// print result
wofstream of("output.utf8");
of.imbue(utf8_locale);
BOOST_FOREACH(hashtable_t::value_type i, counts)
of << "(" << i.first << " " << i.second << ") ";
of << endl;
}
Result:
结果:
$ cat output.utf8
(í 2) (O 354) (P 974) (Q 63) (R 749) (S 1,515) (ñ 6) (T 858) (U 117) (ó 10) (ô 2) (V 330) (W 507) (X 42) (ö 12) (Y 140) (Z 150) (û 3) (ü 12) (a 62,778) (b 14,137) (c 30,020) (d 28,068) (e 87,823) (f 10,049) (g 22,262) (h 18,453) (i 66,548) (j 1,376) (k 7,800) (l 39,914) (m 20,747) (n 56,696) (o 48,425) (p 20,917) (q 1,447) (r 56,286) (s 86,620) (t 51,588) (u 25,810) (Å 1) (' 24,511) (v 7,573) (w 6,924) (x 2,082) (y 12,367) (z 3,088) (A 1,345) (B 1,387) (C 1,549) (á 10) (â 6) (D 809) (E 618) (F 502) (ä 7) (å 3) (G 811) (H 860) (ç 4) (I 344) (J 539) (è 29) (K 656) (é 128) (ê 6) (L 912) (M 1,686) (N 531)
c (ascii): 0.0079 seconds
// $ gcc -O3 cc_ascii.c -o cc_ascii && time -p ./cc_ascii < input.txt
#include <stdio.h>
enum { N = 256 };
size_t counts[N];
int main(void) {
// count characters
int ch = -1;
while((ch = getchar()) != EOF)
++counts[ch];
// print result
size_t i = 0;
for (; i < N; ++i)
if (counts[i])
printf("('%c' %zu) ", (int)i, counts[i]);
return 0;
}
#2
16
How about avoiding float operations inside the loop and do it after everything is done?
如何避免在循环中执行浮动操作,并在完成所有操作之后执行?
By that way, you could just do +1 everytime, and its should be faster.
通过这种方式,你可以每次都做+1,它应该更快。
And better use collections.defaultdict as S.Lott advised.
最好使用collections.defaultdict为S。洛特建议。
freqs=collections.defaultdict(int)
for char in text:
freqs[char]+=1
Or You may want to try, collections.Counter in python 2.7+
或者你可能想尝试,收集。计数器在python 2.7 +
>>> collections.Counter("xyzabcxyz")
Counter({'y': 2, 'x': 2, 'z': 2, 'a': 1, 'c': 1, 'b': 1})
Or
或
You may try psyco, which do just-in-time compiling for python. You have loops, so I think you would get some performance gain with psyco
您可以尝试psyco,它可以实时地为python编译。你有循环,我想你会得到一些psyco的性能增益。
Edit 1:
编辑1:
I did some benchmarks base on big.txt (~6.5 MB) which is used in spelling corrector by peter norvig
我做了一些基准测试。txt (~6.5 MB),由peter norvig在拼写校正器中使用。
Text Length: 6488666
dict.get : 11.9060001373 s
93 chars {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8, ....
if char in dict : 9.71799993515 s
93 chars {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8, ....
dict try/catch : 7.35899996758 s
93 chars {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8, ....
collections.default : 7.29699993134 s
93 chars defaultdict(<type 'int'>, {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8, ....
CPU Specs: 1.6GHz Intel Mobile Atom CPU
CPU规格:1.6GHz英特尔移动Atom处理器。
According to that, dict.get
is slowest and collections.defaultdict
is fastest, try/except
is also the fast one.
根据这个,dict.get是最慢的,而collections.defaultdict是最快的,try/除了也是fast。
Edit 2:
编辑2:
Added collections.Counter
benchmarks, Its slower than dict.get
and took 15s in my laptop
添加收藏。计数器基准测试,它比dict慢,在我的笔记本电脑上取15秒。
collections.Counter : 15.3439998627 s
93 chars Counter({u' ': 1036511, u'e': 628234, u't': 444459, u'a': 395872, u'o': 382683, u'n': 362397, u'i': 348464,
#3
10
I've written Char Counter C Extension to Python, looks like 300x faster than collections.Counter
and 150x faster than collections.default(int)
我已经为Python编写了Char Counter C扩展,看起来比集合快300倍。计数器和150x比集合。默认值(int)
C Char Counter : 0.0469999313354 s
93 chars {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8,
Here is Char Counter C Extension Codes
这是字符计数器C扩展码。
static PyObject *
CharCounter(PyObject *self, PyObject *args, PyObject *keywds)
{
wchar_t *t1;unsigned l1=0;
if (!PyArg_ParseTuple(args,"u#",&t1,&l1)) return NULL;
PyObject *resultList,*itemTuple;
for(unsigned i=0;i<=0xffff;i++)char_counter[i]=0;
unsigned chlen=0;
for(unsigned i=0;i<l1;i++){
if(char_counter[t1[i]]==0)char_list[chlen++]=t1[i];
char_counter[t1[i]]++;
}
resultList = PyList_New(0);
for(unsigned i=0;i<chlen;i++){
itemTuple = PyTuple_New(2);
PyTuple_SetItem(itemTuple, 0,PyUnicode_FromWideChar(&char_list[i],1));
PyTuple_SetItem(itemTuple, 1,PyInt_FromLong(char_counter[char_list[i]]));
PyList_Append(resultList, itemTuple);
Py_DECREF(itemTuple);
};
return resultList;
}
Where char_counter, and char_list are malloc-ated at module level, so no need to malloc every time when function calls.
其中char_counter和char_list在模块级别上是错误的,所以每次函数调用时都不需要malloc。
char_counter=(unsigned*)malloc(sizeof(unsigned)*0x10000);
char_list=(wchar_t*)malloc(sizeof(wchar_t)*0x10000);
It returns a List with Tuples
它返回一个带有元组的列表。
[(u'T', 16282), (u'h', 287323), (u'e', 628234), (u' ', 1036511), (u'P', 8946), (u'r', 303977), (u'o', 382683), ...
To convert to dict format, just dict()
will do.
要转换为dict格式,只需命令()就可以了。
dict(CharCounter(text))
PS: Benchmark included the time converting to dict
基准测试包括时间转换到指令。
CharCounter
accept only Python Unicode String u""
, if the text is utf8, need to do .decode("utf8")
in advance.
CharCounter只接受Python Unicode字符串u“”,如果文本是utf8,则需要提前进行。decode(“utf8”)。
Input Supports Unicode until Basic Multilingual Plane (BMP) - 0x0000 to 0xFFFF
输入支持Unicode,直到基本的多语言平面(BMP) - 0x0000到0xFFFF。
#4
6
No it's not the fastest, because you know that the characters have a limited range and you could use a list and direct indexing, using the numeric representation of the character, to store the frequencies.
不,它不是最快的,因为你知道字符的范围很有限,你可以使用一个列表和直接索引,使用字符的数字表示来存储频率。
#5
5
It is very, very hard to beat dict
. It is very highly tuned since almost everything in Python is dict-based.
这是非常非常难以击败的命令。它是非常高度调谐的,因为几乎所有的Python都是基于文本的。
#6
4
I'm not familiar with python, but for finding frequencies, unless you know the range of frequencies (in which case you can use an array), dictionary is the way to go.
If you know your characters in a unicode, ASCII, etc. range, you can define an array with the correct number of values.
However, this will change the space complexity of this from O(n) to O(possible n), but you will earn a time complexity improvement from O(n*(dictionary extraction/insertion time)) to O(n).
我不熟悉python,但对于查找频率,除非您知道频率的范围(在这种情况下您可以使用数组),字典是一种方法。如果您知道在unicode、ASCII等范围内的字符,您可以用正确的值定义一个数组。然而,这将改变从O(n)到O(可能的n)的空间复杂度,但是您将从O(n*(字典提取/插入时间))到O(n)获得一个时间复杂度的改进。
#7
2
If you are really concerned about speed, you might consider first counting characters with integers and then obtaining frequencies through (float) division.
如果你真的关心速度,你可以考虑先用整数计算字符,然后通过(浮点)除法获得频率。
Here are the numbers:
以下数字:
python -mtimeit -s'x=0' 'x+=1'
10000000 loops, best of 3: 0.0661 usec per loop
python -mtimeit -s'x=0.' 'x+=1.'
10000000 loops, best of 3: 0.0965 usec per loop
#8
2
well, you can do it in the old fashioned style... as we know that there are not too many different characters and they are contiguous, we can use a plain array (or list here) and use the characters ordinal numbering for indexing:
嗯,你可以用老式的风格……我们知道,没有太多不同的字符,它们是连续的,我们可以使用一个简单的数组(或列表),并使用字符序号进行索引:
s = 1.0*len(text)
counts = [0]*256 # change this if working with unicode
for char in text:
freqs[ord(char)]+=1
freqs = dict((chr(i), v/s) for i,v in enumerate(counts) if v)
This will be probably faster, but just by a constant factor, both methods should have the same complexity.
这可能会更快,但是仅仅通过一个常数因子,这两种方法都应该具有相同的复杂性。
#9
2
Using this code on Alice in Wonderland (163793 chars) and "The Bible, Douay-Rheims Version" (5649295 chars) from Project Gutenberg:
在《爱丽丝漫游奇境记》(163793 chars)和《圣经》中,古登堡计划(5649295)中使用这段代码:
from collections import defaultdict
import timeit
def countchars():
f = open('8300-8.txt', 'rb')
#f = open('11.txt')
s = f.read()
f.close()
charDict = defaultdict(int)
for aChar in s:
charDict[aChar] += 1
if __name__ == '__main__':
tm = timeit.Timer('countchars()', 'from countchars import countchars')
print tm.timeit(10)
I get:
我得到:
2.27324003315 #Alice in Wonderland
74.8686217403 #Bible
The ratio between the number of chars for both books is 0.029
and the ratio between the times is 0.030
, so, the algorithm is O(n) with a very small constant factor. Fast enough for most (all?) purposes, I should think.
两本书的字符数的比值是0.029,而时间是0.030,所以算法是O(n),一个很小的常数因子。我认为,对于大多数(所有的)目的来说,速度够快了。
#10
2
If the data is in a single byte encoding you can use numpy to accelerate the count process:
如果数据是在单个字节编码中,则可以使用numpy加速计数过程:
import numpy as np
def char_freq(data):
counts = np.bincount(np.frombuffer(data, dtype=np.byte))
freqs = counts.astype(np.double) / len(data)
return dict((chr(idx), freq) for idx, freq in enumerate(freqs) if freq > 0)
Some quick benchmarking shows that this is about 10x faster than aggregating to a defaultdict(int).
一些快速的基准测试表明,这比聚合到defaultdict(int)要快10倍。
#11
1
To avoid the try except overhead you can use a defaultdict.
为了避免这种尝试,除了开销,您可以使用defaultdict。
#12
1
Small speed up will be usage of dict.setdefault
method, that way you will not pay rather big price for every new encountered character:
小的加速将是使用dict.setdefault方法,这样你将不会为每一个新遇到的字符付出相当大的代价:
for char in text:
freq[char] = freq.setdefault(char, 0.0) + P
As a sidenote: having bare except:
is considered very bad practice.
作为一个旁注:除了:被认为是非常糟糕的实践。