综合练习:词频统计

时间:2023-02-13 12:58:56

1.英文词频统

下载一首英文的歌词或文章

将所有,.?!’:等分隔符全部替换为空格

news = '''
A man may usually be known by the books he reads as well as by the company he keeps; for there is a companionship of books as well as of men; and one should always live in the best company, whether it be of books or of men.
A good book may be among the best of friends. It is the same today that it always was, and it will never change. It is the most patient and cheerful of companions. It does not turn its back upon us in times of adversity or distress. It always receives us with the same kindness; amusing and instructing us in youth, and comforting and consoling us in age.
Men often discover their affinity to each other by the mutual love they have for a book just as two persons sometimes discover a friend by the admiration which both entertain for a third. There is an old proverb, ‘Love me, love my dog.” But there is more wisdom in this:” Love me, love my book.” The book is a truer and higher bond of union. Men can think, feel, and sympathize with each other through their favorite author. They live in him together, and he in them.
'''
sep = ''',.?":;()'''
for c in sep:
    news = news.replace(c,' ')

  

将所有大写转换为小写

wordList = news.lower().split()
for w in wordList:
    print(w)

  

生成单词列表

wordDist = {}
wordSet = set(wordList)
for w in wordSet:
    wordDist[w] = wordList.count(w)
 
for w in wordDist:
    print(w, wordDist[w])

  

生成词频统计

wordDist = {}
wordSet = set(wordList)
for w in wordSet:
    wordDist[w] = wordList.count(w)

for w in wordDist:
    print(w, wordDist[w])

  

排序

dictList = list(wordDist.items())
dictList.sort(key = lambda x: x[1], reverse=True)

  

排除语法型词汇,代词、冠词、连词

exclude = {'the','of','and','s','to','which','will','as','on','is','by',}
wordSet=set(wordList)-exclude
for w in wordSet:
    wordDist[w]=wordList.count(w)

  

输出词频最大TOP20

for i in range(20):
    print(dictList[i])

  

将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

 读取news.txt文件:

f=open('news.txt','r',encoding='utf-8')
news=f.read()
f.close()
print(news)

将排序结果放在newscount.txt文件中:

f=open('newscount.txt','a')
for i in range(25):
    f.write(dictList[i][0]+' '+str(dictList[i][1])+'\n')
f.close()

  

2.中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

news = open('gzccnews.txt','r',encoding = 'utf-8')

 

安装与使用jieba进行中文分词。

pip install jieba

import jieba

list(jieba.lcut(news))

import jieba
 
file=open('hong.txt','r',encoding='utf-8')
word=file.read()
file.close()

  

生成词频统计

wordList=list(jieba.cut_for_search(word))
 
wordDist={}
for w in wordList:
    wordDist[w] = wordList.count(w)
 
for w in wordDist:
    print(w, wordDist[w])

 

排序

dictList = list(wordDist.items())
dictList.sort(key = lambda x: x[1], reverse=True)

   

排除语法型词汇,代词、冠词、连词

sep=''',。?“”:、?;!!'''
 
exclude ={' ','\n','了','的','\u3000','他','我','也','又','是','你','着','这','就','都','呢','只'}
 
for c in sep:
    word = word.replace(c,' ')
 
wordSet=set(wordList)-exclude

  

输出词频最大TOP20(或把结果存放到文件里)

f=open('hongcount.txt','a')
for i in range(20):
    f.write(dictList[i][0]+' '+str(dictList[i][1])+'\n')
f.close()