统计文件中单词的个数---Shell及python版

时间:2024-07-23 10:35:08

最近在看shell中有个题目为统计单词的个数,使用了awk功能,代码如下

#!/bin/bash
if [ $# -ne ];then
echo "Usage:basename $0 filename"
exit
fi filename=$
egrep -o "[a-zA-Z]+" $filename |
awk '{count[$0]++}
END{printf "%-14s %s\n","Word","Count"
for(i in count)printf "%-14s %s\n",i,count[i]|"sort -nrk 2"}'

使用正则来匹配,+表示1个多个

结果如下:

[root@localhost shellcookbook]# sh word_freq.sh item.txt
Word Count
Tennis
Sports
Racket
Printer
Office
Laser
Video
Refrigerator
Player
MP
HD
Camcorder
Audio
Appliance

正好在学习python,顺便拿python实现一下吧,代码如下:

#!/usr/bin/env python
import sys,re if len(sys.argv[0:]) != 2:
print "Usage:%s file" % sys.argv[0]
sys.exit(0) try:
filename=sys.argv[1]
with open(filename) as f:
data=f.read()
except IOError:
print "Please check %s is Exised!" % filename
exit(0)
except Exception,e:
print e
sys.exit() patten=r'[a-zA-Z]+'
words=re.findall(patten,data)
#print sorted([(i,words.count(i)) for i in set(words)],cmp=lambda x,y:cmp(x[1],y[1]),reverse=True)
wordcounts=sorted([(i,words.count(i)) for i in set(words)],key=lambda x:x[1],reverse=True)
print "%-14s %s" % ("Word","Counts")
for word,counts in wordcounts:
print "%-14s %s" % (word,counts)

使用的也是正则先匹配出来后,再用sorted进行排序并计算出来个数,结果如下:

[root@localhost shellcookbook]# python word_freq_py.py item.txt
Word Counts
Printer 2
Laser 2
Office 2
Tennis 2
Sports 2
Racket 2
Appliance 1
Player 1
Video 1
HD 1
Audio 1
Camcorder 1
Refrigerator 1
MP 1

我们来看看这二个对比,程序效率如何:

# time sh word_freq.sh item.txt 

real    0m0.007s
user 0m0.003s
sys 0m0.005s
time python word_freq_py.py item.txt 

real    0m0.035s
user 0m0.031s
sys 0m0.004s

对比来看,shell程序更快,主要是使用了awk提高了效率。所以在linux下写的小程序时,shell能实现,还是使用shell实现,python辅助。