自然语言处理(一)

时间:2021-08-29 12:11:48

分类和标注词汇

一:词性标记

text = nltk.word_tokenize("And now for something completely different")
print(nltk.pos_tag(text))


[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

RB副词,CC连词,JJ形容词,NN名词


具体标签可以用来查询

nltk.help.upenn_tagset()



"They refuse to permit us to obtain the refuse permit"
结果:

[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]


to后面的permit是动词,refuse 后面的permit是动词



上述词性标记信息很多来自于对文本中词语分布的浅层分析。

text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
print(text.similar('car'))

该方法会为词找出所有上下文w1ww2,然后找出所有出现在相同上下文中的词w'  即 w1w'w2


创建一个由标识符和标记组成的元组

tagged_token=nltk.tag.str2tuple('fly/NN')
print(tagged_token[0])
print(tagged_token[1])

二:标注语料库

读取已经标注的语料库

print(nltk.corpus.brown.tagged_words(tagset='universal')[:10])
tagset='universal'  可以显示简单的词性

brown_news_tagged = brown.tagged_words(categories='news',tagset='universal')

tag_fd = nltk.FreqDist(tag for (word,tag) in brown_news_tagged)
tag_fd.plot(cumulative=True) 

画图显示新闻类哪种词性用的多


按频率排序所有动词:

wsj=nltk.corpus.treebank.tagged_words(tagset='universal')

word_tag_fd = nltk.FreqDist(wsj)

print([word+'/'+tag for (word,tag) in word_tag_fd if tag.startswith('V')])