NLP入门（四）命名实体识别（NER）

本文将会简单介绍自然语言处理（NLP）中的命名实体识别（NER）。

命名实体识别（Named Entity Recognition，简称NER）是信息提取、问答系统、句法分析、机器翻译等应用领域的重要基础工具，在自然语言处理技术走向实用化的过程中占有重要地位。一般来说，命名实体识别的任务就是识别出待处理文本中三大类（实体类、时间类和数字类）、七小类（人名、机构名、地名、时间、日期、货币和百分比）命名实体。

举个简单的例子，在句子“小明早上8点去学校上课。”中，对其进行命名实体识别，应该能提取信息

人名：小明，时间：早上8点，地点：学校。

本文将会介绍几个工具用来进行命名实体识别，后续有机会的话，我们将会尝试着用HMM、CRF或深度学习来实现命名实体识别。

首先我们来看一下NLTK和Stanford NLP中对命名实体识别的分类，如下图：

NLP入门（四）命名实体识别（NER）

在上图中，LOCATION和GPE有重合。GPE通常表示地理—政治条目，比如城市，州，国家，洲等。LOCATION除了上述内容外，还能表示名山大川等。FACILITY通常表示知名的纪念碑或人工制品等。

下面介绍两个工具来进行NER的任务：NLTK和Stanford NLP。

首先是NLTK，我们的示例文档（介绍FIFA，来源于*）如下：

FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,

Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its

membership now comprises 211 national associations. Member countries must each also be members of one of

the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America

and the Caribbean, Oceania, and South America.

实现NER的Python代码如下：

import re

import pandas as pd

import nltk

def parse_document(document):

   document = re.sub('\n', ' ', document)

   if isinstance(document, str):

       document = document

   else:

       raise ValueError('Document is not string!')

   document = document.strip()

   sentences = nltk.sent_tokenize(document)

   sentences = [sentence.strip() for sentence in sentences]

   return sentences

# sample document

text = """

FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,

Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its

membership now comprises 211 national associations. Member countries must each also be members of one of

the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America

and the Caribbean, Oceania, and South America.

"""

# tokenize sentences

sentences = parse_document(text)

tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

# tag sentences and use nltk's Named Entity Chunker

tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]

ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]

# extract all named entities

named_entities = []

for ne_tagged_sentence in ne_chunked_sents:

   for tagged_tree in ne_tagged_sentence:

       # extract only chunks having NE labels

       if hasattr(tagged_tree, 'label'):

           entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) #get NE name

           entity_type = tagged_tree.label() # get NE category

           named_entities.append((entity_name, entity_type))

           # get unique named entities

           named_entities = list(set(named_entities))

# store named entities in a data frame

entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])

# display results

print(entity_frame)

输出结果如下：

        Entity Name   Entity Type

0              FIFA  ORGANIZATION

1   Central America  ORGANIZATION

2           Belgium           GPE

3         Caribbean      LOCATION

4              Asia           GPE

5            France           GPE

6           Oceania           GPE

7           Germany           GPE

8     South America           GPE

9           Denmark           GPE

10           Zürich           GPE

11           Africa        PERSON

12           Sweden           GPE

13      Netherlands           GPE

14            Spain           GPE

15      Switzerland           GPE

16            North           GPE

17           Europe           GPE

可以看到，NLTK中的NER任务大体上完成得还是不错的，能够识别FIFA为组织（ORGANIZATION），Belgium,Asia为GPE, 但是也有一些不太如人意的地方，比如，它将Central America识别为ORGANIZATION，而实际上它应该为GPE；将Africa识别为PERSON，实际上应该为GPE。

接下来，我们尝试着用Stanford NLP工具。关于该工具，我们主要使用Stanford NER 标注工具。在使用这个工具之前，你需要在自己的电脑上安装Java（一般是JDK），并将Java添加到系统路径中，同时下载英语NER的文件包：stanford-ner-2018-10-16.zip（大小为172MB），下载地址为：https://nlp.stanford.edu/software/CRF-NER.shtml。以笔者的电脑为例，Java所在的路径为：C:\Program Files\Java\jdk1.8.0_161\bin\java.exe，下载Stanford NER的zip文件解压后的文件夹的路径为：E://stanford-ner-2018-10-16，如下图所示：

NLP入门（四）命名实体识别（NER）

在classifer文件夹中有如下文件：

NLP入门（四）命名实体识别（NER）

它们代表的含义如下：

3 class: Location, Person, Organization

4 class: Location, Person, Organization, Misc

7 class: Location, Person, Organization, Money, Percent, Date, Time

可以使用Python实现Stanford NER，完整的代码如下：

import re

from nltk.tag import StanfordNERTagger

import os

import pandas as pd

import nltk

def parse_document(document):

   document = re.sub('\n', ' ', document)

   if isinstance(document, str):

       document = document

   else:

       raise ValueError('Document is not string!')

   document = document.strip()

   sentences = nltk.sent_tokenize(document)

   sentences = [sentence.strip() for sentence in sentences]

   return sentences

# sample document

text = """

FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,

Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its

membership now comprises 211 national associations. Member countries must each also be members of one of

the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America

and the Caribbean, Oceania, and South America.

"""

sentences = parse_document(text)

tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

# set java path in environment variables

java_path = r'C:\Program Files\Java\jdk1.8.0_161\bin\java.exe'

os.environ['JAVAHOME'] = java_path

# load stanford NER

sn = StanfordNERTagger('E://stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz',

                       path_to_jar='E://stanford-ner-2018-10-16/stanford-ner.jar')

# tag sentences

ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences]

# extract named entities

named_entities = []

for sentence in ne_annotated_sentences:

   temp_entity_name = ''

   temp_named_entity = None

   for term, tag in sentence:

       # get terms with NE tags

       if tag != 'O':

           temp_entity_name = ' '.join([temp_entity_name, term]).strip() #get NE name

           temp_named_entity = (temp_entity_name, tag) # get NE and its category

       else:

           if temp_named_entity:

               named_entities.append(temp_named_entity)

               temp_entity_name = ''

               temp_named_entity = None

# get unique named entities

named_entities = list(set(named_entities))

# store named entities in a data frame

entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])

# display results

print(entity_frame)

输出结果如下：

                Entity Name   Entity Type

0                      1904          DATE

1                   Denmark      LOCATION

2                     Spain      LOCATION

3   North & Central America  ORGANIZATION

4             South America      LOCATION

5                   Belgium      LOCATION

6                    Zürich      LOCATION

7           the Netherlands      LOCATION

8                    France      LOCATION

9                 Caribbean      LOCATION

10                   Sweden      LOCATION

11                  Oceania      LOCATION

12                     Asia      LOCATION

13                     FIFA  ORGANIZATION

14                   Europe      LOCATION

15                   Africa      LOCATION

16              Switzerland      LOCATION

17                  Germany      LOCATION

可以看到，在Stanford NER的帮助下，NER的实现效果较好，将Africa识别为LOCATION，将1904识别为时间（这在NLTK中没有识别出来），但还是对North & Central America识别有误，将其识别为ORGANIZATION。

值得注意的是，并不是说Stanford NER一定会比NLTK NER的效果好，两者针对的对象，预料，算法可能有差异，因此，需要根据自己的需求决定使用什么工具。

本次分享到此结束，以后有机会的话，将会尝试着用HMM、CRF或深度学习来实现命名实体识别。

注意：本人现已开通微信公众号： Python爬虫与算法（微信号为：easy_web_scrape），欢迎大家关注哦~~

NLP入门（四）命名实体识别（NER）的更多相关文章

NLP入门（八）使用CRF++实现命名实体识别(NER)
CRF与NER简介 CRF,英文全称为conditional random field, 中文名为条件随机场,是给定一组输入随机变量条件下另一组输出随机变量的条件概率分布模型,其特点是假设输出随机 ...
pytorch 文本情感分类和命名实体识别NER中LSTM输出的区别
文本情感分类: 文本情感分类采用LSTM的最后一层输出比如双层的LSTM,使用正向的最后一层和反向的最后一层进行拼接 def forward(self,input): ''' :param inpu ...
『深度应用』NLP命名实体识别(NER)开源实战教程
近几年来,基于神经网络的深度学习方法在计算机视觉.语音识别等领域取得了巨大成功,另外在自然语言处理领域也取得了不少进展.在NLP的关键性基础任务—命名实体识别(Named Entity Recogni ...
【NLP学习其一】什么是命名实体识别NER&quest;
命名实体识别概念命名实体识别(Named Entity Recognition,简称NER) , 是指识别文本中具有特定意义的词(实体),主要包括人名.地名.机构名.专有名词等等,并把我们需要识别 ...
命名实体识别(NER)
一.任务 Named Entity Recognition,简称NER.主要用于提取时间.地点.人物.组织机构名. 二.应用知识图谱.情感分析.机器翻译.对话问答系统都有应用.比如,需要利用命名实体 ...
零基础入门--中文命名实体识别（BiLSTM+CRF模型，含代码）
自己也是一个初学者,主要是总结一下最近的学习,大佬见笑. 中文分词说到命名实体抽取,先要了解一下基于字标注的中文分词.比如一句话 "我爱北京*”. 分词的结果可以是 “我/爱/北京/天安 ...
NLP入门（五）用深度学习实现命名实体识别（NER）
前言在文章:NLP入门(四)命名实体识别(NER)中,笔者介绍了两个实现命名实体识别的工具--NLTK和Stanford NLP.在本文中,我们将会学习到如何使用深度学习工具来自己一步步地实现N ...
NLP（二十四）利用ALBERT实现命名实体识别
本文将会介绍如何利用ALBERT来实现命名实体识别.如果有对命名实体识别不清楚的读者,请参考笔者的文章NLP入门(四)命名实体识别(NER) . 本文的项目结构如下: 其中,albert_ ...
神经网络结构在命名实体识别（NER）中的应用
神经网络结构在命名实体识别(NER)中的应用近年来,基于神经网络的深度学习方法在自然语言处理领域已经取得了不少进展.作为NLP领域的基础任务-命名实体识别(Named Entity Recognit ...

随机推荐

【HOW】在InfoPath中如何为浏览和编辑模式设置不同的视图
1. 在SharePoint Designer中打开要自定义视图的列表.并点击菜单:列表设置 > 在 InfoPath 中设计表单 > {要自定义表单的内容类型},则会自动打开InfoPa ...
解决Putty连接不上服务器的方法
1.vi /etc/ssh/sshd_config 将PermitRootLogin的注释取消,或者将no改为yes. 2.service sshd restart 3.setup命令进入将防火墙关闭 ...
[Effective Java读书笔记] 第二章创建和销毁对象(1~7)
我的技术博客经常被流氓网站恶意爬取转载.请移步原文:http://www.cnblogs.com/hamhog/p/3537576.html,享受整齐的排版.有效的链接.正确的代码缩进.更好的阅读体验 ...
Android 仿窗帘效果和登录界面拖动效果（Scroller类的应用）附 2个DEMO及源码
在android学习中,动作交互是软件中重要的一部分,其中的Scroller就是提供了拖动效果的类,在网上,比如说一些Launcher实现滑屏都可以通过这个类去实现.下面要说的就是上次Scroller ...
c语言&lowbar;头文件
传统 C++ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 #include <assert.h> //设定插入点 #include <ctyp ...
jeecg 弹出框点击按钮回调父页面返回值
jeecg 弹出框点击按钮回调父页面返回值 <t:base type="jquery"></t:base> <t:base type=" ...
call()，apply()，bind()区别？
每个函数都包含两个非继承而来的方法,apply()和call(),这两方法的用途都是在特定的作用域中调用函数,实际上等于设置函数数体内的this对象的值. apply()和call()第一个参数都一样 ...
精练代码：一次Java函数式编程的重构之旅
摘要:通过一次并发处理数据集的Java代码重构之旅,展示函数式编程如何使得代码更加精练. 难度:中级基础知识在开始之前,了解"高阶函数"和"泛型"这两个概念 ...
Jquery 事件冒泡、元素的默认行为的阻止、获取事件类型、触发事件
$(function(){// 事件冒泡 $('').bind("click",function(event){ //事件内容 //停止事件冒泡 event.stopPropaga ...
[性能优化] perf 高级用法：完整记录程序性能指标，并按照时间段对程序进行有针对性的性能分析
如题: 假设你已经熟悉了基本用法,知道perf是干嘛的,以及会用 perf top [性能优化] perf 背景:目标程序在运行的某时间段内会出现性能下降,需要了解这个时间内,程序发生了什么. 方法: ...