I have a XML file which contains 100s of documents inside . Each block looks like this:
我有一个包含100多个文档的XML文件。每个块是这样的:
<DOC>
<DOCNO> FR940104-2-00001 </DOCNO>
<PARENT> FR940104-2-00001 </PARENT>
<TEXT>
<!-- PJG FTAG 4703 -->
<!-- PJG STAG 4703 -->
<!-- PJG ITAG l=90 g=1 f=1 -->
<!-- PJG /ITAG -->
<!-- PJG ITAG l=90 g=1 f=4 -->
Federal Register
<!-- PJG /ITAG -->
<!-- PJG ITAG l=90 g=1 f=1 -->
/ Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices
<!-- PJG 0012 frnewline -->
<!-- PJG /ITAG -->
<!-- PJG ITAG l=01 g=1 f=1 -->
Vol. 59, No. 2
<!-- PJG 0012 frnewline -->
<!-- PJG /ITAG -->
<!-- PJG ITAG l=02 g=1 f=1 -->
Tuesday, January 4, 1994
<!-- PJG 0012 frnewline -->
<!-- PJG 0012 frnewline -->
<!-- PJG /ITAG -->
<!-- PJG /STAG -->
<!-- PJG /FTAG -->
</TEXT>
</DOC>
I want load this XML doc into a dictionary Text
. Key as DOCNO & Value as text inside tags. Also this text should not contain all the comments. Example Text['FR940104-2-00001']
must contain Federal Register / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices Vol. 59, No. 2 Tuesday, January 4, 1994
. This is the code I wrote.
我希望将这个XML文档加载到字典文本中。键为DOCNO,值为标签内的文本。此外,本文不应包含所有评论。例如,1994年1月4日星期二,第59号,1994年1月4日,第59号,第59号。这是我写的代码。
L = doc.getElementsByTagName("DOCNO")
for node2 in L:
for node3 in node2.childNodes:
if node3.nodeType == Node.TEXT_NODE:
docno.append(node3.data);
#print node2.data
L = doc.getElementsByTagName("TEXT")
i = 0
for node2 in L:
for node3 in node2.childNodes:
if node3.nodeType == Node.TEXT_NODE:
Text[docno[i]] = node3.data
i = i+1
Surprisingly, with my code I'm getting Text['FR940104-2-00001'] as u'\n'
How come?? How to get what I want
奇怪的是,用我的代码,我得到了文本['FR940104-2-00001']作为u'\n'怎么会这样?如何得到我想要的
5 个解决方案
#1
4
You could avoid looping through the doc twice by using xml.sax.handler:
通过使用xml. saxon .handler:
import xml.sax.handler
import collections
class DocBuilder(xml.sax.handler.ContentHandler):
def __init__(self):
self.state=''
self.docno=''
self.text=collections.defaultdict(list)
def startElement(self, name, attrs):
self.state=name
def endElement(self, name):
if name==u'TEXT':
self.docno=''
def characters(self,content):
content=content.strip()
if content:
if self.state==u'DOCNO':
self.docno+=content
elif self.state==u'TEXT':
if content:
self.text[self.docno].append(content)
with open('test.xml') as f:
data=f.read()
builder = DocBuilder()
xml.sax.parseString(data, builder)
for key,value in builder.text.iteritems():
print('{k}: {v}'.format(k=key,v=' '.join(value)))
# FR940104-2-00001: Federal Register / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices Vol. 59, No. 2 Tuesday, January 4, 1994
#2
2
Similar to unutbu's answer, though I think simpler:
类似于unutbu的答案,尽管我认为更简单:
from lxml import etree
with open('test.xml') as f:
doc=etree.parse(f)
result={}
for elm in doc.xpath("/DOC[DOCNO]"):
key = elm.xpath("DOCNO")[0].text.strip()
value = "".join(t.strip() for t in elm.xpath("TEXT/text()") if t.strip())
result[key] = value
The XPath that finds the DOC
element in this example needs to be changed to be appropriate for your real document - e.g. if there's a single top-level element that all the DOC
elements are children of, you'd change it to /*/DOC
. The predicate on that XPath skips any DOC
element that doesn't have a DOCNO
child, which would otherwise cause an exception when setting the key.
在本例中找到DOC元素的XPath需要更改为适合您的真实文档——例如,如果有一个*元素,所有DOC元素都是子元素,那么您可以将其更改为/*/DOC。XPath上的谓词跳过没有DOCNO子元素的任何DOC元素,否则在设置键时将导致异常。
#3
1
Using lxml:
使用lxml:
import lxml.etree as le
with open('test.xml') as f:
doc=le.parse(f)
texts={}
for docno in doc.xpath('DOCNO'):
docno_text=docno.text.strip()
text=' '.join([t.strip()
for t in docno.xpath('following-sibling::TEXT[1]/text()')
if t.strip()])
texts[docno.text]=text
print(texts)
# {'FR940104-2-00001': 'Federal Register / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices Vol. 59, No. 2 Tuesday, January 4, 1994'}
This version is a tad simpler than my first lxml solution. It handles multiple instances of DOCNO, TEXT nodes. The DOCNO/TEXT nodes should alternate, but in any case, the DOCNO is associated with the closest TEXT node that follows it.
这个版本比我的第一个lxml解决方案简单一点。它处理DOCNO、文本节点的多个实例。DOCNO/TEXT节点应该是交替的,但是无论如何,DOCNO与紧随其后的最近的文本节点相关联。
#4
0
Your line
你的线
Text[docno[i]] = node3.data
replaces the value of the mapping instead of appending the new one. Your <TEXT>
node has both text and comment children, interleaved with each other.
替换映射的值,而不是添加新的映射。您的
#5
0
DOM parser strips out the comments automatically for you. Each line is a Node.
DOM解析器自动为您删除注释。每一行都是一个节点。
So, You need to use:
所以,你需要使用:
Text[docno[i]]+= node3.data
but before that you need to have an empty dictionary with all the keys. So, you can add Text[node3.data] = '';
in your first block of code.
文本(docno[我]]+ = node3。但在此之前,您需要一个包含所有键的空字典。你可以添加文本[node3]。数据";在第一个代码块中。
So, your code becomes:
所以,您的代码就变成:
L = doc.getElementsByTagName("DOCNO")
for node2 in L:
for node3 in node2.childNodes:
if node3.nodeType == Node.TEXT_NODE:
docno.append(node3.data);
Text[node3.data] = '';
#print node2.data
L = doc.getElementsByTagName("TEXT")
i = 0
for node2 in L:
for node3 in node2.childNodes:
if node3.nodeType == Node.TEXT_NODE:
Text[docno[i]]+= node3.data
i = i+1
#1
4
You could avoid looping through the doc twice by using xml.sax.handler:
通过使用xml. saxon .handler:
import xml.sax.handler
import collections
class DocBuilder(xml.sax.handler.ContentHandler):
def __init__(self):
self.state=''
self.docno=''
self.text=collections.defaultdict(list)
def startElement(self, name, attrs):
self.state=name
def endElement(self, name):
if name==u'TEXT':
self.docno=''
def characters(self,content):
content=content.strip()
if content:
if self.state==u'DOCNO':
self.docno+=content
elif self.state==u'TEXT':
if content:
self.text[self.docno].append(content)
with open('test.xml') as f:
data=f.read()
builder = DocBuilder()
xml.sax.parseString(data, builder)
for key,value in builder.text.iteritems():
print('{k}: {v}'.format(k=key,v=' '.join(value)))
# FR940104-2-00001: Federal Register / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices Vol. 59, No. 2 Tuesday, January 4, 1994
#2
2
Similar to unutbu's answer, though I think simpler:
类似于unutbu的答案,尽管我认为更简单:
from lxml import etree
with open('test.xml') as f:
doc=etree.parse(f)
result={}
for elm in doc.xpath("/DOC[DOCNO]"):
key = elm.xpath("DOCNO")[0].text.strip()
value = "".join(t.strip() for t in elm.xpath("TEXT/text()") if t.strip())
result[key] = value
The XPath that finds the DOC
element in this example needs to be changed to be appropriate for your real document - e.g. if there's a single top-level element that all the DOC
elements are children of, you'd change it to /*/DOC
. The predicate on that XPath skips any DOC
element that doesn't have a DOCNO
child, which would otherwise cause an exception when setting the key.
在本例中找到DOC元素的XPath需要更改为适合您的真实文档——例如,如果有一个*元素,所有DOC元素都是子元素,那么您可以将其更改为/*/DOC。XPath上的谓词跳过没有DOCNO子元素的任何DOC元素,否则在设置键时将导致异常。
#3
1
Using lxml:
使用lxml:
import lxml.etree as le
with open('test.xml') as f:
doc=le.parse(f)
texts={}
for docno in doc.xpath('DOCNO'):
docno_text=docno.text.strip()
text=' '.join([t.strip()
for t in docno.xpath('following-sibling::TEXT[1]/text()')
if t.strip()])
texts[docno.text]=text
print(texts)
# {'FR940104-2-00001': 'Federal Register / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices Vol. 59, No. 2 Tuesday, January 4, 1994'}
This version is a tad simpler than my first lxml solution. It handles multiple instances of DOCNO, TEXT nodes. The DOCNO/TEXT nodes should alternate, but in any case, the DOCNO is associated with the closest TEXT node that follows it.
这个版本比我的第一个lxml解决方案简单一点。它处理DOCNO、文本节点的多个实例。DOCNO/TEXT节点应该是交替的,但是无论如何,DOCNO与紧随其后的最近的文本节点相关联。
#4
0
Your line
你的线
Text[docno[i]] = node3.data
replaces the value of the mapping instead of appending the new one. Your <TEXT>
node has both text and comment children, interleaved with each other.
替换映射的值,而不是添加新的映射。您的
#5
0
DOM parser strips out the comments automatically for you. Each line is a Node.
DOM解析器自动为您删除注释。每一行都是一个节点。
So, You need to use:
所以,你需要使用:
Text[docno[i]]+= node3.data
but before that you need to have an empty dictionary with all the keys. So, you can add Text[node3.data] = '';
in your first block of code.
文本(docno[我]]+ = node3。但在此之前,您需要一个包含所有键的空字典。你可以添加文本[node3]。数据";在第一个代码块中。
So, your code becomes:
所以,您的代码就变成:
L = doc.getElementsByTagName("DOCNO")
for node2 in L:
for node3 in node2.childNodes:
if node3.nodeType == Node.TEXT_NODE:
docno.append(node3.data);
Text[node3.data] = '';
#print node2.data
L = doc.getElementsByTagName("TEXT")
i = 0
for node2 in L:
for node3 in node2.childNodes:
if node3.nodeType == Node.TEXT_NODE:
Text[docno[i]]+= node3.data
i = i+1