回答下列问题:
(1)如何能构建一个系统,以至从非结构化文本中提取结构化数据?
(2)有哪些稳健的方法识别一个文本描述的实体和关系?
(3)哪些语料库适合这项工作,如何使用它们来训练和评估模型?
一 信息提取
信息有很多种”形状“和”大小“,一个重要的形式是结构化数据:实体和关系的规范和可预测的组织。例如:我们可能对公司和地点之间的关系,可用关系数据库存储。
但如果我们尝试从文本中获得相似的信息,事情就比较麻烦了。如何从一段文字中发现一个实体和关系的表呢?
然后,利用强大的查询工具,如SQL,这种从文本获取意义的方法被称为“信息提取”
信息提取有许多应用,包括商业智能、简历收获、媒体分析、情感检测、专利检索及电子邮件扫描。当前研究的一个特别重要的领域是提取出电子科学文献的结构化数据,特别是在生物学和医学领域。
#信息提取结构
要执行前面3个任务,句子分割器、分词器和词性标注器
import nltk, re, pprint def ie_preprocess(document): sentences = nltk.sent_tokenize(document) #句子分割 sentences = [nltk.word_tokenize(sent) for sent in sentences] #分词 sentences = [nltk.pos_tag(sent) for sent in sentence] #词性标注器
二 分块
用于实体识别的基本技术是分块(chunking)
小框显示词级标识符和词性标注,同时,大框显示较高级的程序分块
在本节上,我们将在较深的层面上探讨程序分块,以组块的定义和表示开始,我们将看到正则表达式和n-gram方法分块,使用CoNLL-2000分块语料库开发和评估分块器。
#名词短语分块
NP-chunking(名词短语分块),寻找单独名词短语对应的块
NP-分块信息最有用的来源之一是词性标记。这是在信息提取系统中进行词性标注的动机之一。
为了创建NP-分块,首先定义分块语法,规定句子应如何分块。在本例中,使用一个正则表达式规则定义一个简单的语法。
这条规则是NP-分块有可选的且后面跟着任意数目形容词的限定词和名词组成。使用此语法,我们创建了组块分析器,测试我门的例句。结果得到树状图,可以输出或显示图形。
sentence = [("the","DT"),("little","JJ"),("yellow","JJ"),("dog","NN"),("barked","VBD"),("at","IN"),("the","DT"),("cat","NN")] grammar = "NP: {<DT>?<JJ>*<NN>}" cp = nltk.RegexpParser(grammar) result = cp.parse(sentence) print result (S (NP the/DT little/JJ yellow/JJ dog/NN) barked/VBD at/IN (NP the/DT cat/NN))
result.draw()
#标记模式
使用图形界面nltk.app.chunkparser()
#用正则表达式分块
grammer = r""" NP: {<DT|PP\$>?<JJ>*<NN>} #匹配一个可选的限定词或所有格代名词 {<NNP>+} #匹配一个或多个专有名词 """ cp = nltk.RegexpParser(grammer) sentence = [("Rapunzel","NNP"),("let","VBD"),("down", "RP"),("her","PP$"),("long","JJ"),("golden","JJ"),("hair","NN")] print cp.parse(sentence) (S (NP Rapunzel/NNP) let/VBD down/RP (NP her/PP$ long/JJ golden/JJ hair/NN))
nouns = [("money","NN"),("market","NN"),("fund","NN")] grammar = "NP: {<NN><NN>}" #如果将匹配两个连续名词的文本的规则应用到包含3个连续名词的文本中,则只有前两个名词被分块 cp = nltk.RegexpParser(grammar) print cp.parse(nouns) (S (NP money/NN market/NN) fund/NN)
#探索文本语料库
使用分块器可以在已标注的语料库中提取匹配特定词性标记序列的短语
cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}') brown = nltk.corpus.brown for sent in brown.tagged_sents(): tree = cp.parse(sent) for subtree in tree.subtrees(): if subtree.label() == 'CHUNK': print subtree (CHUNK combined/VBN to/TO achieve/VB) (CHUNK continue/VB to/TO place/VB) (CHUNK serve/VB to/TO protect/VB) (CHUNK wanted/VBD to/TO wait/VB) (CHUNK allowed/VBN to/TO place/VB) ......
#缝隙
为不包括在大块中的标识符序列定义一个缝隙
加缝隙是从大块中去除标识符序列的过程
grammar = r""" NP: {<.*>+} }<VBD|IN>+{""" sentence = [("the","DT"),("little","JJ"),("yellow","JJ"),("dog","NN"),("barked","VBD"),("at","IN"),("the","DT"),("cat","NN")] cp = nltk.RegexpParser(grammar) print cp.parse(sentence) (S (NP the/DT little/JJ yellow/JJ dog/NN) barked/VBD at/IN (NP the/DT cat/NN))
#分块的表示:标记与树状图
作为标注和分析之间的中间状态,块结构可以使用标记或树状图来表示。使用最广泛的表示是IOB标记
在这个方案中,每个标识符被用3个特殊的块标签之一标注,I(inside,内部),O(outside,外部)或B(begin,开始)。
B标志着它是分块的开始。块内的标识符子序列被标志为I,其他为O
B和I标记是块类型的后缀,如B-NP, I-NP。
NLTK用树状图作为分块的内部表示,却提供这些树状图与IOB之间格式转换的方法
3 开发和评估分块器
如何评估分块器
#读取IOB格式与CoNLL2000分块语料库
CoNLL2000分块语料库包含27万词的《华尔街日报文本》,分为“训练”和“测试”两部分,标注有词性标记和IOB格式分块标记。
from nltk.corpus import conll2000 print conll2000.chunked_sents('train.txt')[99] (S (PP Over/IN) (NP a/DT cup/NN) (PP of/IN) (NP coffee/NN) ,/, (NP Mr./NNP Stone/NNP) (VP told/VBD) (NP his/PRP$ story/NN) ./.)包含 3中分块类型:NP分块,VP分块,PP分块
print conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99] #只选择NP分块
#简单评估和基准
cp = nltk.RegexpParser("") #不分块 test_sents = conll2000.chunked_sents('test.txt',chunk_types=['NP']) print cp.evaluate(test_sents) #评估结果 ChunkParse score: IOB Accuracy: 43.4%% Precision: 0.0%% Recall: 0.0%% F-Measure: 0.0%%
grammar = r"NP: {<[CDJNP].*>+}" cp = nltk.RegexpParser(grammar) #初级的正则表达式分块器 test_sents = conll2000.chunked_sents('test.txt') print cp.evaluate(test_sents) #评估结果 ChunkParse score: IOB Accuracy: 62.5%% Precision: 70.6%% Recall: 38.5%% F-Measure: 49.8%%使用unigram标注器对名词短语分块
#使用训练语料找到对每个词性标记最有可能的块标记(I、O或B) #可以用unigram标注器建立一个分块器,但不是要确定每个词的正确词性标记,而是给定每个词的词性标记,尝试确定正确的块标记 class UnigramChunker(nltk.ChunkParserI): def __init__(self, train_sents): train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents] self.tagger = nltk.UnigramTagger(train_data) def parse(self, sentence): pos_tags = [pos for (word,pos) in sentence] tagged_pos_tags = self.tagger.tag(pos_tags) chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags] #为词性标注IOB块标记 conlltags = [(word, pos, chunktag) for ((word,pos),chunktag) in zip(sentence, chunktags)] return nltk.chunk.conlltags2tree(conlltags) #转换成分块树状图
#使用CoNLL2000分块语料库训练 test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP']) unigram_chunker = UnigramChunker(train_sents) print unigram_chunker.evaluate(test_sents) ChunkParse score: IOB Accuracy: 92.9%% Precision: 79.9%% Recall: 86.8%% F-Measure: 83.2%%
postags = sorted(set(pos for sent in train_sents for (word,pos) in sent.leaves())) print unigram_chunker.tagger.tag(postags) [(u'#', u'B-NP'), (u'$', u'B-NP'), (u"''", u'O'), (u'(', u'O'), (u')', u'O'), (u',', u'O'), (u'.', u'O'), (u':', u'O'), (u'CC', u'O'), (u'CD', u'I-NP'), (u'DT', u'B-NP'), (u'EX', u'B-NP'), (u'FW', u'I-NP'), (u'IN', u'O'), (u'JJ', u'I-NP'), (u'JJR', u'B-NP'), (u'JJS', u'I-NP'), (u'MD', u'O'), (u'NN', u'I-NP'), (u'NNP', u'I-NP'), (u'NNPS', u'I-NP'), (u'NNS', u'I-NP'), (u'PDT', u'B-NP'), (u'POS', u'B-NP'), (u'PRP', u'B-NP'), (u'PRP$', u'B-NP'), (u'RB', u'O'), (u'RBR', u'O'), (u'RBS', u'B-NP'), (u'RP', u'O'), (u'SYM', u'O'), (u'TO', u'O'), (u'UH', u'O'), (u'VB', u'O'), (u'VBD', u'O'), (u'VBG', u'O'), (u'VBN', u'O'), (u'VBP', u'O'), (u'VBZ', u'O'), (u'WDT', u'B-NP'), (u'WP', u'B-NP'), (u'WP$', u'B-NP'), (u'WRB', u'O'), (u'``', u'O')]
#使用训练语料找到对每个词性标记最有可能的块标记(I、O或B) #可以用bigram标注器建立一个分块器,但不是要确定每个词的正确词性标记,而是给定每个词的词性标记,尝试确定正确的块标记 class BigramChunker(nltk.ChunkParserI): def __init__(self, train_sents): train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents] self.tagger = nltk.BigramTagger(train_data) def parse(self, sentence): pos_tags = [pos for (word,pos) in sentence] tagged_pos_tags = self.tagger.tag(pos_tags) chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags] #为词性标注IOB块标记 conlltags = [(word, pos, chunktag) for ((word,pos),chunktag) in zip(sentence, chunktags)] return nltk.chunk.conlltags2tree(conlltags) #转换成分块树状图 bigram_chunker = BigramChunker(train_sents) print bigram_chunker.evaluate(test_sents) ChunkParse score: IOB Accuracy: 93.3%% Precision: 82.3%% Recall: 86.8%% F-Measure: 84.5%%
#训练基于分类器的分块器
有时词性标记不足以确定一个句子应如何分块
安装ocaml
安装maxnet 最大熵
class ConsecutiveNPChunkTagger(nltk.TaggerI): def __init__(self, train_sents): train_set = [] for tagged_sent in train_sents: untagged_sent = nltk.tag.untag(tagged_sent) history = [] for i, (word,tag) in enumerate(tagged_sent): featureset = npchunk_features(untagged_sent, i, history) train_set.append( (featureset, tag) ) history.append(tag) self.classifier = nltk.MaxentClassifier.train(train_set, algorithm='megam', trace=0) #最大熵 def tag(self, sentence): history = [] for i, word in enumerate(sentence): featureset = npchunk_features(sentence, i, history) tag = self.classifier.classify(featureset) history.append(tag) return zip(sentence, history) class ConsecutiveNPChunker(nltk.ChunkParserI): def __init__(self, train_sents): tagged_sents = [[((w,t),c) for (w,t,c) in nltk.chunk.tree2conlltags(sent)] for sent in train_sents] self.tagger = ConsecutiveNPChunkTagger(tagged_sents) def parse(self, sentence): tagged_sents = self.tagger.tag(sentence) conlltags = [(w,t,c) for ((w,t),c) in tagged_sents] return nltk.chunk.conlltags2tree(conlltags)
def npchunk_features(sentence, i, history): word, pos = sentence[i] return {"pos": pos} #只提供当前标识符的词性标记
chunker = ConsecutiveNPChunker(train_sents) print chunker.evaluate(test_sents) ChunkParse score: IOB Accuracy: 92.9%% Precision: 79.9%% Recall: 86.7%% F-Measure: 83.2%%
def npchunk_features(sentence, i, history): word, pos = sentence[i] if i == 0: prevword, prevpos = "<START>", "<START>" else: prevword, prevpos = sentence[i-1]
return {"pos": pos, "prevpos": prevpos} #模拟相邻标记之间的相互作用
chunker = ConsecutiveNPChunker(train_sents) print chunker.evaluate(test_sents) ChunkParse score: IOB Accuracy: 93.7%% Precision: 82.1%% Recall: 87.2%% F-Measure: 84.5%%
def npchunk_features(sentence, i, history): word, pos = sentence[i] if i == 0: prevword, prevpos = "<START>", "<START>" else: prevword, prevpos = sentence[i-1] return {"pos": pos, "word": word, "prevpos": prevpos} #增加词的内容 chunker = ConsecutiveNPChunker(train_sents) print chunker.evaluate(test_sents) ChunkParse score: IOB Accuracy: 94.2%% Precision: 83.2%% Recall: 88.3%% F-Measure: 85.7%%
def npchunk_features(sentence, i, history): word, pos = sentence[i] if i == 0: prevword, prevpos = "<START>", "<START>" else: prevword, prevpos = sentence[i-1] if i == len(sentence)-1: nextword, nextpos = "<END>", "<END>" else: nextword, nextpos = sentence[i+1] return {"pos": pos, "word": word, "prevpos": prevpos, "nextpos": nextpos, "prevpos+pos": "%s+%s" % (prevpos, pos), "pos+nextpos": "%s+%s" % (pos, nextpos), "tags-since-dt": tags_since_dt(sentence, i)} #预取特征、配对功能和复杂的语境特征 def tags_since_dt(sentence, i): tags = set() for word, pos in sentence[:i]: if pos == "DT": tags = set() else: tags.add(pos) return '+'.join(sorted(tags)) chunker = ConsecutiveNPChunker(train_sents) print chunker.evaluate(test_sents) ChunkParse score: IOB Accuracy: 96.0%% Precision: 88.8%% Recall: 91.1%% F-Measure: 89.9%%
四 语言结构中的递归
#用级联分块器构建嵌套结构
只需创建一个包含递归规则的多级的分块语法,就可以建立任意深度的分块结构
例子展示名词短语、介词短语、动词短语和句子的模式
grammar = r""" NP: {<DT|JJ|NN.*>+} PP: {<IN><NP>} VP: {<VB.*><NP|PP|CLAUSE>+$} CLAUSE: {<NP><VP>} """ cp = nltk.RegexpParser(grammar) sentence = [("Mary","NN"), ("saw","VBD"),("the","DT"),("cat","NN"),("sit","VB"),("on","IN"),("the","DT"),("mat","NN")] print cp.parse(sentence) (S (NP Mary/NN) saw/VBD #无法识别VP (CLAUSE (NP the/DT cat/NN) (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))
cp = nltk.RegexpParser(grammar, loop=2) #添加循环 print cp.parse(sentence) (S (CLAUSE (NP Mary/NN) (VP saw/VBD (CLAUSE (NP the/DT cat/NN) (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))
#树状图
在NLTK中,创建树状图,方法是给节点添加标签和一个子链表
#树遍历
使用递归函数来遍历树是标准的做法
五 命名实体识别
命名实体识别(NER)系统的目标是识别所有文字提及的命名实体。这可以分解成两个子任务:确定NE的边界和确定其类型。命名实体识别经常是信息提取中关系识别的前奏,也有助于其他任务。例如:在问答系统(QA)中,我们试图提高信息检索的精确度,不用返回整个页面而只是包含用户问题的答案的那部分。大多数QA系统利用标准信息检索返回的文件,然后尝试分离文档中包含答案的最小的文本分段。P303,例如问题:Who was the first President of the US?被检索的文档中包含答案,但我们想得到的答案应该是X was the first President of the US的形式,其中X不仅是一个名词短语也是一个PER类型的命名实体。
如何识别命名实体呢?一种方法是查找适当的名称列表,但问题是许多实体措辞有歧义,如May和North可能是日期和地点
因此我们需要能够识别多标识符序列的开头和结尾
NER是一个非常适合用于分类器类型的方法。
NLTK提供了一个已经训练好的可以识别命名实体的分类器,使用函数nltk.ne_chunk()访问。
sent = nltk.corpus.treebank.tagged_sents()[22] print nltk.ne_chunk(sent, binary=True) #如果设置参数binary=True,那么命名实体只被标注为NE (S The/DT (NE U.S./NNP) is/VBZ one/CD of/IN
print nltk.ne_chunk(sent) #PERSON, ORGANIZATION and GPE (S The/DT (GPE U.S./NNP) is/VBZ ...... (PERSON *e/NNP T./NNP Mossman/NNP) ,/, a/DT professor/NN of/IN pathlogy/NN at/IN the/DT (ORGANIZATION University/NNP) of/IN (PERSON Vermont/NNP College/NNP) of/IN (GPE Medicine/NNP)
六 关系抽取
只要文本中的命名实体被识别,我们就可以提取它们之间存在的关系。
方法之一是首先寻找所有(X, a, Y)形式的三元组,其中X和Y是指定类型的命名实体,a表示X和Y之间关系的字符串
IN = re.compile(r'.*\bin\b(?!\b.+ing)') for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'): for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern=IN): print nltk.sem.relextract.rtuple(rel) [ORG: u'WHYY'] u'in' [LOC: u'Philadelphia'] [ORG: u'McGlashan & Sarrail'] u'firm in' [LOC: u'San Mateo'] [ORG: u'Freedom Forum'] u'in' [LOC: u'Arlington'] [ORG: u'*ings Institution'] u', the research group in' [LOC: u'Washington'] [ORG: u'Idealab'] u', a self-described business incubator based in' [LOC: u'Los Angeles'] [ORG: u'Open Text'] u', based in' [LOC: u'Waterloo']