斯坦福NLP - 使用解析或标记文本生成完整XML

I'm trying to extract data from the PennTreeBank, Wall Street Journal corpus. Most of it already has the parse trees, but some of the data is only tagged. i.e. wsj_DDXX.mrg and wsj_DDXX.pos files.

我正试图从PennTreeBank,华尔街日报语料库中提取数据。其中大部分已经有解析树,但有些数据只是标记的。即wsj_DDXX.mrg和wsj_DDXX.pos文件。

I would like to use the already parsed trees and tagged data in these files so as not to use the parser and taggers within CoreNLP, but I still want the output file format that CoreNLP gives; namely, the XML file that contains the dependencies, entity coreference, and the parse tree and tagged data.

我想在这些文件中使用已解析的树和标记数据,以免在CoreNLP中使用解析器和标记器,但我仍然需要CoreNLP提供的输出文件格式;即,包含依赖项,实体共参考以及解析树和标记数据的XML文件。

I've read many of the java docs but I cannot figure out how to get it the way I described.

我已经阅读了很多java文档,但我无法弄清楚如何按照我描述的方式获取它。

For POS, I tried using the LexicalizedParser and it allows me to use the tags, but I can only generate an XML file with the some of the information I want; there is no option for coreference or generating the parse trees. To get it to correctly generate the sub-optimal XML files here, I had to write a script to get rid of all of the brackets within the files. This is the command I use:

对于POS,我尝试使用LexicalizedParser,它允许我使用标签,但我只能用我想要的一些信息生成一个XML文件;没有选择共参照或生成解析树。为了让它在这里正确生成次优XML文件,我必须编写一个脚本来摆脱文件中的所有括号。这是我使用的命令:

java -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependenciesCollapsed,wordsAndTags -outputFilesExtension xml -outputFormatOptions xml -writeOutputFiles -outputFilesDirectory my\dir -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz my\wsj\files\dir

java -cp“*”edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependenciesCollapsed,wordsAndTags -outputFilesExtension xml -outputFormatOptions xml -writeOutputFiles -outputFilesDirectory my \ dir -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process。 WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory edu / stanford / nlp / models / lexparser / englishPCFG.ser.gz my \ wsj \ files \ dir

I also can't generate the data I would like to have for the WSJ data that already has the trees. I tried using what is said here and I looked at the corresponding Javadocs. I used the command similar to what is described. But I had to write a python program to retrieve the stdout data resulting from analyzing each file and wrote it into a new file. This resulting data is only a text file with the dependencies and is not in the desired XML notation.

我也无法为已经拥有树的WSJ数据生成我想要的数据。我尝试使用这里所说的内容,然后查看了相应的Javadoc。我使用了类似于描述的命令。但我必须编写一个python程序来检索分析每个文件所产生的stdout数据并将其写入新文件。此结果数据仅是具有依赖项的文本文件,并且不是所需的XML表示法。

To summarize, I would like to use the POS and tree data from these PTB files in order to generate a CoreNLP parse corresponding to what would occur if I used CoreNLP on a regular text file. The pseudo command would be like this:

总而言之,我想使用这些PTB文件中的POS和树数据来生成CoreNLP解析,该解析对应于在常规文本文件上使用CoreNLP时会发生的情况。伪命令将是这样的:

java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -useTreeFile wsj_DDXX.mrg

java -cp“*”edu.stanford.nlp.pipeline.CoreNLP -useTreeFile wsj_DDXX.mrg

and

java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -usePOSFile wsj_DDXX.pos

java -cp“*”edu.stanford.nlp.pipeline.CoreNLP -usePOSFile wsj_DDXX.pos

Edit: fixed a link.

编辑:修复了一个链接。

1 个解决方案

#1

Yes, this is possible, but a bit tricky and there is no out of the box feature that can do this, so you will have to write some code. The basic idea is to replace the tokenize, ssplit and pos annotators (and in case you also have trees the parse annotator) with your code that loads these annotations from your annotated files.

是的,这是可能的,但有点棘手,没有开箱即用的功能可以做到这一点,所以你必须编写一些代码。基本思想是用您的代码替换tokenize,ssplit和pos注释器(以及你也有树的解析注释器),这些代码从带注释的文件中加载这些注释。

On a very high level you have to do the following:

在很高的层次上,您必须执行以下操作:

Load your trees with MemoryTreebank

使用MemoryTreebank加载树

Loop through all the trees and for each tree create a sentence CoreMap to which you add
- a TokensAnnotation
- a TreeAnnotation and the SemanticGraphCoreAnnotations

循环遍历所有树,并为每个树创建一个句子CoreMap,您可以向其添加TokensAnnotation TreeAnnotation和SemanticGraphCoreAnnotations

Create an Annotation object with a list containing the CoreMap objects for all sentences

使用包含所有句子的CoreMap对象的列表创建Annotation对象

Run the StanfordCoreNLP pipeline with the annotators option set to lemma,ner,dcoref and the option enforceRequirements set to false.

运行StanfordCoreNLP管道,将annotators选项设置为lemma,ner,dcoref,并将选项enforceRequirements设置为false。

Take a look at the individual annotators to see how to add the required annotations. E.g. there is a method in ParserAnnotatorUtils that adds the SemanticGraphCoreAnnotations.

查看各个注释器,了解如何添加所需的注释。例如。 ParserAnnotatorUtils中有一个方法可以添加SemanticGraphCoreAnnotations。

#1