I am trying to parse XML with tags embedded in tags, like this one using Nokigiri and Ruby:
我试图用嵌入在标签中的标签来解析XML,比如使用Nokigiri和Ruby的标签:
<seg>Trennmesser <ph><I.FIGREF ITEM="3" FORMAT="PARENTHESIS"></ph><bpt i="1"><I.FIGTARGET TARGET="CIADDAJA"></bpt><ept i="1"></I.FIGREF></ept></seg>
In this case I would only need the word "Trennmesser" not within the embedded tags.
在这种情况下,我只需要“嵌入式标签”中的“Trennmesser”一词。
In this second example:
在第二个例子中:
<seg>Hilfsmittel <ph><F34@Z7@Lge></ph>X <ph><F0></ph>= 0,5mm zwischen Beschleunigerwalze <ph><F34@Z7@Lge></ph>D<ph><F0></ph> und Trennmesser schieben.</seg>
The words within the closed /ph
and open ph
tags are also interesting, so the regex would need to extract the string "Hilfsmittel 0,5mm zwischen Beschleunigerwalze und Trennmesser schieben.
" and discard everything else.
封闭/ ph和开放ph标签内的单词也很有趣,因此正则表达式需要提取字符串“Hilfsmittel 0,5mm zwischen Beschleunigerwalze und Trennmesser schieben”。并丢弃其他一切。
I have also uploaded a part of the document here:
http://pastebin.com/Q8CdnASz
我还在这里上传了部分文档:http://pastebin.com/Q8CdnASz
2 个解决方案
#1
1
Try this in irb
在irb中尝试这个
require 'nokogiri'
x = Nokogiri::XML.parse('<seg>Hilfsmittel <ph><F34@Z7@Lge></ph>X <ph><F0></ph>= 0,5mm zwischen Beschleunigerwalze <ph><F34@Z7@Lge></ph>D<ph><F0></ph> und Trennmesser schieben.</seg>')
x.xpath('//seg').children.reject {|x| x.element?}.join {|x| x.content}
for me this outputs
对我来说这是输出
=> "Hilfsmittel X = 0,5mm zwischen Beschleunigerwalze D und Trennmesser schieben."
The idea here is that we iterate over the children of the <seg>
tag, rejecting the ones that are elements themselves (<ph>
), which should leave only the content elements. Take the resultant array, and join the content elements together as one string.
这里的想法是我们迭代
Note that the output is slightly different than you described, because there's an additional D
and X
in between two of the tags.
请注意,输出与您描述的略有不同,因为两个标签之间还有一个额外的D和X.
#2
1
The content inside the <ph>
tags has been encoded to preserve the reserved characters <
and >
.
A clean way to deal with this is to let Nokogiri reparse those chunks back into XML:
解决这个问题的一个简单方法是让Nokogiri将这些块重新解析为XML:
require 'nokogiri'
doc = Nokogiri::XML('<seg>Trennmesser <ph><I.FIGREF ITEM="3" FORMAT="PARENTHESIS"></ph><bpt i="1"><I.FIGTARGET TARGET="CIADDAJA"></bpt><ept i="1"></I.FIGREF></ept></seg>')
ph = Nokogiri::XML::DocumentFragment.parse(doc.at('seg ph').content)
puts ph.to_xml
Which outputs the following node, showing Nokogiri recreated that fragment correctly:
其中输出以下节点,显示Nokogiri正确地重新创建了该片段:
<I.FIGREF ITEM="3" FORMAT="PARENTHESIS"/>
For extracting the text inside the <seg>
tag:
用于提取
doc.at('//seg/text()').text
=> "Trennmesser "
When dealing with HTML or XML, it's never good to presuppose that regex will be the best path to extracting something. Both HTML and XML are too irregular and "flexible" (where flexible means it's often irritatingly malformed or defined in totally unique and unexpected ways).
在处理HTML或XML时,预先假定正则表达式是提取某些东西的最佳途径,这绝不是好事。 HTML和XML都太不规则和“灵活”(灵活意味着它经常令人烦恼地变形或以完全独特和意想不到的方式定义)。
To get the full content inside the <seg>
tag in the second question:
要在第二个问题中获取
require 'nokogiri'
doc = Nokogiri::XML('<seg>Hilfsmittel <ph><F34@Z7@Lge></ph>X <ph><F0></ph>= 0,5mm zwischen Beschleunigerwalze <ph><F34@Z7@Lge></ph>D<ph><F0></ph> und Trennmesser schieben.</seg>')
seg = Nokogiri::XML::DocumentFragment.parse(doc.at('seg').content)
puts seg.content
Which outputs:
Hilfsmittel @Z7@Lge>X = 0,5mm zwischen Beschleunigerwalze @Z7@Lge>D und Trennmesser schieben.
#1
1
Try this in irb
在irb中尝试这个
require 'nokogiri'
x = Nokogiri::XML.parse('<seg>Hilfsmittel <ph><F34@Z7@Lge></ph>X <ph><F0></ph>= 0,5mm zwischen Beschleunigerwalze <ph><F34@Z7@Lge></ph>D<ph><F0></ph> und Trennmesser schieben.</seg>')
x.xpath('//seg').children.reject {|x| x.element?}.join {|x| x.content}
for me this outputs
对我来说这是输出
=> "Hilfsmittel X = 0,5mm zwischen Beschleunigerwalze D und Trennmesser schieben."
The idea here is that we iterate over the children of the <seg>
tag, rejecting the ones that are elements themselves (<ph>
), which should leave only the content elements. Take the resultant array, and join the content elements together as one string.
这里的想法是我们迭代
Note that the output is slightly different than you described, because there's an additional D
and X
in between two of the tags.
请注意,输出与您描述的略有不同,因为两个标签之间还有一个额外的D和X.
#2
1
The content inside the <ph>
tags has been encoded to preserve the reserved characters <
and >
.
A clean way to deal with this is to let Nokogiri reparse those chunks back into XML:
解决这个问题的一个简单方法是让Nokogiri将这些块重新解析为XML:
require 'nokogiri'
doc = Nokogiri::XML('<seg>Trennmesser <ph><I.FIGREF ITEM="3" FORMAT="PARENTHESIS"></ph><bpt i="1"><I.FIGTARGET TARGET="CIADDAJA"></bpt><ept i="1"></I.FIGREF></ept></seg>')
ph = Nokogiri::XML::DocumentFragment.parse(doc.at('seg ph').content)
puts ph.to_xml
Which outputs the following node, showing Nokogiri recreated that fragment correctly:
其中输出以下节点,显示Nokogiri正确地重新创建了该片段:
<I.FIGREF ITEM="3" FORMAT="PARENTHESIS"/>
For extracting the text inside the <seg>
tag:
用于提取
doc.at('//seg/text()').text
=> "Trennmesser "
When dealing with HTML or XML, it's never good to presuppose that regex will be the best path to extracting something. Both HTML and XML are too irregular and "flexible" (where flexible means it's often irritatingly malformed or defined in totally unique and unexpected ways).
在处理HTML或XML时,预先假定正则表达式是提取某些东西的最佳途径,这绝不是好事。 HTML和XML都太不规则和“灵活”(灵活意味着它经常令人烦恼地变形或以完全独特和意想不到的方式定义)。
To get the full content inside the <seg>
tag in the second question:
要在第二个问题中获取
require 'nokogiri'
doc = Nokogiri::XML('<seg>Hilfsmittel <ph><F34@Z7@Lge></ph>X <ph><F0></ph>= 0,5mm zwischen Beschleunigerwalze <ph><F34@Z7@Lge></ph>D<ph><F0></ph> und Trennmesser schieben.</seg>')
seg = Nokogiri::XML::DocumentFragment.parse(doc.at('seg').content)
puts seg.content
Which outputs:
Hilfsmittel @Z7@Lge>X = 0,5mm zwischen Beschleunigerwalze @Z7@Lge>D und Trennmesser schieben.