使用Python在Stanford NLP中进行实体识别

I am using Stanford Core NLP using Python.I have taken the code from here.
Following is the code :

我正在使用Python使用Stanford Core NLP。我从这里获取了代码。以下是代码:

from stanfordcorenlp import StanfordCoreNLP
import logging
import json


class StanfordNLP:
def __init__(self, host='http://localhost', port=9000):
    self.nlp = StanfordCoreNLP(host, port=port,
                               timeout=30000 , quiet=True, logging_level=logging.DEBUG)
    self.props = {
        'annotators': 'tokenize,ssplit,pos,lemma,ner,parse,depparse,dcoref,relation,sentiment',
        'pipelineLanguage': 'en',
        'outputFormat': 'json'
    }

def word_tokenize(self, sentence):
    return self.nlp.word_tokenize(sentence)

def pos(self, sentence):
    return self.nlp.pos_tag(sentence)

def ner(self, sentence):
    return self.nlp.ner(sentence)

def parse(self, sentence):
    return self.nlp.parse(sentence)

def dependency_parse(self, sentence):
    return self.nlp.dependency_parse(sentence)

def annotate(self, sentence):
    return json.loads(self.nlp.annotate(sentence, properties=self.props))

@staticmethod
def tokens_to_dict(_tokens):
    tokens = defaultdict(dict)
    for token in _tokens:
        tokens[int(token['index'])] = {
            'word': token['word'],
            'lemma': token['lemma'],
            'pos': token['pos'],
            'ner': token['ner']
        }
    return tokens

if __name__ == '__main__':
sNLP = StanfordNLP()
text = r'China on Wednesday issued a $50-billion list of U.S. goods  including soybeans and small aircraft for possible tariff hikes in an escalating technology dispute with Washington that companies worry could set back the global economic recovery.The country\'s tax agency gave no date for the 25 percent increase...'
ANNOTATE =  sNLP.annotate(text)
POS = sNLP.pos(text)
TOKENS = sNLP.word_tokenize(text)
NER = sNLP.ner(text)
PARSE = sNLP.parse(text)
DEP_PARSE = sNLP.dependency_parse(text)

I am only interested in Entity Recognition which is being saved in the variable NER. The command NER is giving the following result

我只对保存在变量NER中的实体识别感兴趣。命令NER给出以下结果

The same thing if I run on Stanford Website, the output for NER is

如果我在斯坦福网站上运行,同样的事情,NER的输出是

There are 2 problems with my Python Code:

我的Python代码有两个问题:

1. '$' and '50-billion' should be combined and named a single entity. Similarly, I want '25' and 'percent' as a single entity as it is showing in the online stanford output.
2. In my output, 'Washington' is shown as State and 'China' is shown as Country. I want both of them to be shown as 'Loc' as in the stanford website output. The possible solution to this problem lies in the documentation .

1.'$'和'50 -billion'应合并并命名为单一实体。同样,我希望'25'和'percent'作为单个实体,因为它在在线stanford输出中显示。在我的输出中,“华盛顿”显示为国家,“中国”显示为国家。我希望它们都像stanford网站输出中的'Loc'一样显示。该问题的可能解决方案在于文档。

But I don't know which model am I using and how to change the model.

但我不知道我使用的是哪种型号以及如何更改型号。

1 个解决方案

#1

Here is a way you can solve this

这是一种可以解决这个问题的方法

Make sure to download Stanford CoreNLP 3.9.1 and the necessary models jars

请务必下载Stanford CoreNLP 3.9.1和必要的型号罐

Set up the server properties in this file "ner-server.properties"

在此文件“ner-server.properties”中设置服务器属性

annotators = tokenize,ssplit,pos,lemma,ner
ner.applyFineGrained = false

Start the server with this command:

使用以下命令启动服务器:

java -Xmx12g edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 -serverProperties ner-server.properties

Make sure you've installed this Python package:

确保您已安装此Python包:

https://github.com/stanfordnlp/python-stanford-corenlp

Run this Python code:

运行此Python代码:

import corenlp
client = corenlp.CoreNLPClient(start_server=False, annotators=["tokenize", "ssplit", "pos", "lemma", "ner"])
sample_text = "Joe Smith was born in Hawaii."
ann = client.annotate(sample_text)
for mention in ann.sentence[0].mentions:
    print([x.word for x in ann.sentence[0].token[mention.tokenStartInSentenceInclusive:mention.tokenEndInSentenceExclusive]])

Here are all the fields available in the EntityMention for each entity:

以下是每个实体的EntityMention中可用的所有字段:

sentenceIndex: 0
tokenStartInSentenceInclusive: 5
tokenEndInSentenceExclusive: 7
ner: "MONEY"
normalizedNER: "$5.0E10"
entityType: "MONEY"

#1