分类和标注词汇
一:词性标记
text = nltk.word_tokenize("And now for something completely different") print(nltk.pos_tag(text))
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
RB副词,CC连词,JJ形容词,NN名词
具体标签可以用来查询
nltk.help.upenn_tagset()
"They refuse to permit us to obtain the refuse permit"
结果:
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
to后面的permit是动词,refuse 后面的permit是动词
上述词性标记信息很多来自于对文本中词语分布的浅层分析。
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words()) print(text.similar('car'))
该方法会为词找出所有上下文w1ww2,然后找出所有出现在相同上下文中的词w' 即 w1w'w2
创建一个由标识符和标记组成的元组
tagged_token=nltk.tag.str2tuple('fly/NN') print(tagged_token[0]) print(tagged_token[1])
二:标注语料库
读取已经标注的语料库
print(nltk.corpus.brown.tagged_words(tagset='universal')[:10])tagset='universal' 可以显示简单的词性
brown_news_tagged = brown.tagged_words(categories='news',tagset='universal') tag_fd = nltk.FreqDist(tag for (word,tag) in brown_news_tagged) tag_fd.plot(cumulative=True)
画图显示新闻类哪种词性用的多
按频率排序所有动词:
wsj=nltk.corpus.treebank.tagged_words(tagset='universal') word_tag_fd = nltk.FreqDist(wsj) print([word+'/'+tag for (word,tag) in word_tag_fd if tag.startswith('V')])