I would need to remove anything between XML tags, especially whitespace and newlines.
我需要删除XML标记之间的任何东西,特别是空格和换行符。
For example removing whitespace and newslines from:
</node> \n<node id="whatever">
例如,从 \n
to get:
</node><node id="whatever">
得到:< /节点> <节点id = "不管">
This is not meant for parsing XML by hand, but rather to prepare XML data before it's getting parsed by a tool. To be more specific, I'm using Hpricot (Ruby) to parse XML and unfortunately we're currently stuck on version 0.6.164, so ... I don't know about more recent versions, but this one often returns weird nodes (Objects) that only contain whitespace and line breaks. So the idea is to clean up the XML before converting it into an Hpricot document. Alternative solutions appreciated.
这不是手工解析XML,而是在XML数据被工具解析之前准备好。更具体地说,我正在使用Hpricot (Ruby)解析XML,不幸的是,我们目前还停留在0.6.164版本上,所以……我不知道最近的版本是什么,但是这个版本经常返回一些奇怪的节点(对象),它们只包含空格和换行符。因此,我们的想法是在将XML转换为Hpricot文档之前对其进行清理。替代方案表示赞赏。
An example from a test: NoMethodError: undefined method `children' for "\n ":Hpricot::Text
The interesting part here is not the NoMethodError, because that's just fine, but that the Hpricot::Text element only contains a newline and nothing more.
一个来自测试的例子:NoMethodError: undefined method ' children' for "\n ":Hpricot:文本最有趣的部分不是NoMethodError,因为这很好,但是Hpricot::Text元素只包含一个换行符,没有别的。
5 个解决方案
#1
6
Please don't use regular expressions to parse XML. It's horribly error prone.
请不要使用正则表达式来解析XML。它是非常容易出错。
Use a proper XML library, which will make this trivial. There are XML libraries available for just about every programming platform you could ask for - there's really no excuse to use a regular expression for XML.
使用一个合适的XML库,这将使这变得微不足道。几乎每个编程平台都可以使用XML库——实际上没有理由为XML使用正则表达式。
#2
6
A solution is to select all "blank" text nodes and remove them.
解决方案是选择所有“空白”文本节点并删除它们。
doc = Nokogiri(xml_source)
doc.xpath('//text()[not(normalize-space())]').remove
#3
2
It is generally not a good idea to parse XML using regular expressions. One of the major benefits of XML is that there are dozens of well-tested parsers out there for any language/framework that you might ever want. There are some tricky rules within XML that prevent any regular expression from being able to properly parse XML.
使用正则表达式解析XML通常不是一个好主意。XML的主要好处之一是,对于任何您可能想要的语言/框架,都有几十个经过良好测试的解析器。XML中有一些棘手的规则,它们阻止任何正则表达式能够正确地解析XML。
That said, something like:
也就是说,类似:
s/>.*?</></gs
(that is perl syntax) might do what you want. That says take anything from a greater than up to a less than, and strip it away. The "g" at the end says to perform the substitution as many times as needed, and the "s" makes the "." match all characters INCLUDING newlines (otherwise newlines would not be included, so the pattern would need to be run once for each line, and it would not cover tags that span multiple lines).
(这是perl语法)可以做您想做的事情。也就是说,取大于小于等于的任何数,并将其除去。“g”最后说根据需要多次执行替换,和“s”使“。”匹配所有字符包括换行(否则不会包括换行,所以模式为每一行都需要运行一次,和它不会覆盖标记跨多行)。
#4
1
You shouldn't use regex to parse XML or HTML, it's just not reliable and there are way too many edge cases. You should use a XML/HTML parser for this kind of stuff instead.
您不应该使用regex来解析XML或HTML,这是不可靠的,而且有太多的边界情况。您应该使用XML/HTML解析器来解析这类内容。
#5
1
Don't use regex. Try parsing the XML into a DOM, and manipulating from there (what language/framework are you using?);
不要使用正则表达式。尝试将XML解析到DOM中,并从那里操作(您使用的是什么语言/框架?)
#1
6
Please don't use regular expressions to parse XML. It's horribly error prone.
请不要使用正则表达式来解析XML。它是非常容易出错。
Use a proper XML library, which will make this trivial. There are XML libraries available for just about every programming platform you could ask for - there's really no excuse to use a regular expression for XML.
使用一个合适的XML库,这将使这变得微不足道。几乎每个编程平台都可以使用XML库——实际上没有理由为XML使用正则表达式。
#2
6
A solution is to select all "blank" text nodes and remove them.
解决方案是选择所有“空白”文本节点并删除它们。
doc = Nokogiri(xml_source)
doc.xpath('//text()[not(normalize-space())]').remove
#3
2
It is generally not a good idea to parse XML using regular expressions. One of the major benefits of XML is that there are dozens of well-tested parsers out there for any language/framework that you might ever want. There are some tricky rules within XML that prevent any regular expression from being able to properly parse XML.
使用正则表达式解析XML通常不是一个好主意。XML的主要好处之一是,对于任何您可能想要的语言/框架,都有几十个经过良好测试的解析器。XML中有一些棘手的规则,它们阻止任何正则表达式能够正确地解析XML。
That said, something like:
也就是说,类似:
s/>.*?</></gs
(that is perl syntax) might do what you want. That says take anything from a greater than up to a less than, and strip it away. The "g" at the end says to perform the substitution as many times as needed, and the "s" makes the "." match all characters INCLUDING newlines (otherwise newlines would not be included, so the pattern would need to be run once for each line, and it would not cover tags that span multiple lines).
(这是perl语法)可以做您想做的事情。也就是说,取大于小于等于的任何数,并将其除去。“g”最后说根据需要多次执行替换,和“s”使“。”匹配所有字符包括换行(否则不会包括换行,所以模式为每一行都需要运行一次,和它不会覆盖标记跨多行)。
#4
1
You shouldn't use regex to parse XML or HTML, it's just not reliable and there are way too many edge cases. You should use a XML/HTML parser for this kind of stuff instead.
您不应该使用regex来解析XML或HTML,这是不可靠的,而且有太多的边界情况。您应该使用XML/HTML解析器来解析这类内容。
#5
1
Don't use regex. Try parsing the XML into a DOM, and manipulating from there (what language/framework are you using?);
不要使用正则表达式。尝试将XML解析到DOM中,并从那里操作(您使用的是什么语言/框架?)