1.使用FreqDist需找文本中最常见的50个词
>>>fdist1=FreqDist(text1)
>>>fdist1
FreqDist({',':18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024, 'a': 4569, 'to': 4542,';': 4072, 'in': 3916, 'that': 2982, ...})
>>>vocab=fdist1.most_common(50)
>>>vocab
[(',',18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569),('to', 4542), (';', 4072), ('in', 3916), ('that', 2982), ("'", 2684),('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124), ('s', 1739), ('is',1695), ('he', 1661), ('with', 1659), ('was', 1632), ('as', 1620), ('"',1478), ('all', 1462), ('for', 1414), ('this', 1280), ('!', 1269), ('at', 1231),('by', 1137), ('but', 1113), ('not', 1103), ('--', 1070), ('him', 1058),('from', 1052), ('be', 1030), ('on', 1005), ('so', 918), ('whale', 906),('one', 889), ('you', 841), ('had', 767), ('have', 760), ('there', 715),('But', 705), ('or', 697), ('were', 680), ('now', 646), ('which', 640), ('?',637), ('me', 627), ('like', 624)]
2.得到出现次数前50的频率图
>>>fdist1.plot(50,cumulative=True)
3.查找只出现一次的词
>>>fdist1.hapaxes()
4.查找长度超过15个字符的词
>>>V=set(text1)
>>>long_words=[w for w in V if len(w)>15]
>>>sorted(long_words)
['CIRCUMNAVIGATION','Physiognomically','apprehensiveness','cannibalistically','characteristically','circumnavigating','circumnavigation','circumnavigations','comprehensiveness','hermaphroditical','indiscriminately','indispensableness','irresistibleness','physiognomically','preternaturalness','responsibilities','simultaneousness','subterraneousness','supernaturalness','superstitiousness','uncomfortableness','uncompromisedness','undiscriminating','uninterpenetratingly']
5.找出文本中所有长度超过7个字符且出现次数超过7次的词
>>>fdist5=FreqDist(text5)
>>>sorted(w for w in set(text5) if len(w)>7 and fdist5[w]>7)
['#14-19teens','#talkcity_adults', '((((((((((', '........', 'Question', 'actually','anything', 'computer', 'cute.-ass', 'everyone', 'football', 'innocent','listening', 'remember', 'seriously', 'something', 'together', 'tomorrow','watching']
6.提取文本的双连词
>>>list(bigrams(['more','is','said','done','than']))
[('more','is'), ('is', 'said'), ('said', 'done'), ('done', 'than')]
注意要用from nltk import *才可以使用函数bigrams,用list写作列表
7.找到频繁出现的双连词
>>>text4.collocations()
UnitedStates; fellow citizens; four years; years ago; Federal
Government;General Government; American people; Vice President; Old
World;Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
Godbless; every citizen; Indian tribes; public debt; one another;
foreignnations; political parties
8.创造词长的字典FreqDist
>>>fdist=FreqDist([len(w) for w in text1])
>>>fdist
FreqDist({3:50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399, 8: 9966, 9:6428, 10: 3528, ...})
>>>fdist.keys()
dict_keys([1,4, 2, 6, 8, 9, 11, 5, 7, 3, 10, 12, 13, 14, 16, 15, 17, 18, 20])
其中3:50223表示词长为3的出现50223次
fdist.keys()得到了所有的词长集合
8.对词长的一些操作
列出词长与其出现次数
>>>fdist.items()
dict_items([(1,47933), (4, 42345), (2, 38513), (6, 17111), (8, 9966), (9, 6428), (11, 1873),(5, 26597), (7, 14399), (3, 50223), (10, 3528), (12, 1053), (13, 567), (14,177), (16, 22), (15, 70), (17, 12), (18, 1), (20, 1)])
找到出现次数最多的词长
>>>fdist.max()
3
查找词长为3出现的次数
>>>fdist[3]
50223
词长为3出现的频率
>>>fdist.freq(3)
0.19255882431878046
9.NLTK频率分布类中定义的函数
fdist=FreqDist(samples) 创建包含给定样本的频率分布
fdist.inc(sample) 增加样本
fdist[‘monstrous’] 计数给定样本出现的次数
fdist.freq(‘monstrous’) 给定样本的频率
fdist.N() 样本总数
fdist.keys() 以频率递减书序排序的样本链表
for samplein fdist: 以频率递减的顺序遍历样本
fidst.max() 数值最大的样本
fdist.tabulate() 绘制频率分布表
fdist.plot() 绘制频率分布图
fdist.plot(cumulative=True) 绘制累积频率分布图
fdist1<fdist2 测试样本在fdist1中出现的频率是否小于fdist2