I have a large number of XML files that contain URLs. I'm writing a groovy utility to find each URL and replace it with an updated version.
我有大量包含URL的XML文件。我正在编写一个groovy实用程序来查找每个URL并将其替换为更新版本。
Given example.xml:
给定example.xml:
<?xml version="1.0" encoding="UTF-8"?>
<page>
<content>
<section>
<link>
<url>/some/old/url</url>
</link>
<link>
<url>/some/old/url</url>
</link>
</section>
<section>
<link>
<url>
/a/different/old/url?with=specialChars&escaped=true
</url>
</link>
</section>
</content>
</page>
Once the script has run, example.xml should contain:
脚本运行后,example.xml应包含:
<?xml version="1.0" encoding="UTF-8"?>
<page>
<content>
<section>
<link>
<url>/a/new/and/improved/url</url>
</link>
<link>
<url>/a/new/and/improved/url</url>
</link>
</section>
<section>
<link>
<url>
/a/different/new/and/improved/url?with=specialChars&stillEscaped=true
</url>
</link>
</section>
</content>
</page>
This is easy to do using groovy's excellent xml support, except that I want to change the URLs and nothing else about the file.
使用groovy优秀的xml支持很容易做到这一点,除了我想要更改URL以及文件的其他内容。
By that I mean:
我的意思是:
- whitespace must not change (files might contain spaces, tabs, or both)
- 空格不得更改(文件可能包含空格,制表符或两者)
- comments must be preserved
- 必须保留评论
- windows vs. unix-style line separators must be preserved
- 必须保留windows与unix样式的行分隔符
- the xml declaration at the top must not be added or removed
- 不得添加或删除顶部的xml声明
- attributes in tags must retain their order
- 标签中的属性必须保留其顺序
So far, after trying many combinations of XmlParser, DOMBuilder, XmlNodePrinter, XmlUtil.serialize(), and so on, I've landed on reading each file line-by-line and applying an ugly hybrid of the xml utilities and regular expressions.
到目前为止,在尝试了XmlParser,DOMBuilder,XmlNodePrinter,XmlUtil.serialize()等的许多组合之后,我已经逐行阅读每个文件,并应用了xml实用程序和正则表达式的丑陋混合。
Reading and writing each file:
读写每个文件:
files.each { File file ->
def lineEnding = file.text.contains('\r\n') ? '\r\n' : '\n'
def newLineAtEof = file.text.endsWith(lineEnding)
def lines = file.readLines()
file.withWriter { w ->
lines.eachWithIndex { line, index ->
line = update(line)
w.write(line)
if (index < lines.size-1) w.write(lineEnding)
else if (newLineAtEof) w.write(lineEnding)
}
}
}
Searching for and updating URLs within a line:
搜索和更新一行中的URL:
def matcher = (line =~ urlTagRegexp) //matches a <url> element and its contents
matcher.each { groups ->
def urlNode = new XmlParser().parseText(line)
def url = urlNode.text()
def newUrl = translate(url)
if (newUrl) {
urlNode.value = newUrl
def replacement = nodeToString(urlNode)
line = matcher.replaceAll(replacement)
}
}
def nodeToString(node) {
def writer = new StringWriter()
writer.withPrintWriter { printWriter ->
def printer = new XmlNodePrinter(printWriter)
printer.preserveWhitespace = true
printer.print(node)
}
writer.toString().replaceAll(/[\r\n]/, '')
}
This mostly works, except it can't handle a tag split over multiple lines, and messing with newlines when writing the files back out is cumbersome.
这主要是有效的,除了它不能处理分割成多行的标签,并且在将文件写回时弄乱换行是很麻烦的。
I'm new to groovy, but I feel like there must be a groovier way of doing this.
我是groovy的新手,但我觉得必须有一种更加时髦的方式来做这件事。
2 个解决方案
#1
9
I just created gist at: https://gist.github.com/akhikhl/8070808 to demonstrate how such transformation is done with Groovy and JDOM2.
我刚刚在https://gist.github.com/akhikhl/8070808上创建了gist,以演示如何使用Groovy和JDOM2完成此类转换。
Important notes:
重要笔记:
- Groovy technically allows using any java libraries. If something cannot be done with Groovy JDK, it can be done with other library.
- Groovy在技术上允许使用任何java库。如果使用Groovy JDK无法完成某些操作,可以使用其他库完成。
- jaxen library (implementing XPath) should be included explicitly (via @Grab or via maven/gradle), since it's an optional dependency of JDOM2.
- 应该明确地包含jaxen库(实现XPath)(通过@Grab或通过maven / gradle),因为它是JDOM2的可选依赖项。
- The sequence of @Grab/@GrabExclude instructions fixes the quirky dependence of jaxen on JDOM-1.0.
- @ Grab / @ GrabExclude指令的序列修复了jaxen对JDOM-1.0的古怪依赖性。
- XPathFactory.compile also supports variable binding and filters (see online javadoc).
- XPathFactory.compile还支持变量绑定和过滤器(参见在线javadoc)。
- XPathExpression (which is returned by compile) supports two major functions - evaluate and evaluateFirst. evaluate always returns a list of all XML-nodes, satisfying the specified predicate, while evaluateFirst returns just the first matching XML-node.
- XPathExpression(由compile返回)支持两个主要功能 - evaluate和evaluateFirst。 evaluate总是返回所有XML节点的列表,满足指定的谓词,而evaluateFirst只返回第一个匹配的XML节点。
Update
更新
The following code:
以下代码:
new XMLOutputter().with {
format = Format.getRawFormat()
format.setLineSeparator(LineSeparator.NONE)
output(doc, System.out)
}
solves a problem with preserving whitespaces and line separators. getRawFormat constructs a format object that preserves whitespaces. LineSeparator.NONE instructs format object, that it should not convert line separators.
解决了保留空格和行分隔符的问题。 getRawFormat构造一个保留空格的格式对象。 LineSeparator.NONE指示格式对象,它不应转换行分隔符。
The gist mentioned above contains this new code as well.
上面提到的要点也包含这个新代码。
#2
6
There is a solution without any 3rd party library.
有一个没有任何第三方库的解决方案。
def xml = file.text
def document = groovy.xml.DOMBuilder.parse(new StringReader(xml))
def root = document.documentElement
use(groovy.xml.dom.DOMCategory) {
// manipulate the XML here, i.e. root.someElement?.each { it.value = 'new value'}
}
def result = groovy.xml.dom.DOMUtil.serialize(root)
file.withWriter { w ->
w.write(result)
}
摘自http://jonathan-whywecanthavenicethings.blogspot.de/2011/07/keep-your-hands-off-of-my-whitespace.html
#1
9
I just created gist at: https://gist.github.com/akhikhl/8070808 to demonstrate how such transformation is done with Groovy and JDOM2.
我刚刚在https://gist.github.com/akhikhl/8070808上创建了gist,以演示如何使用Groovy和JDOM2完成此类转换。
Important notes:
重要笔记:
- Groovy technically allows using any java libraries. If something cannot be done with Groovy JDK, it can be done with other library.
- Groovy在技术上允许使用任何java库。如果使用Groovy JDK无法完成某些操作,可以使用其他库完成。
- jaxen library (implementing XPath) should be included explicitly (via @Grab or via maven/gradle), since it's an optional dependency of JDOM2.
- 应该明确地包含jaxen库(实现XPath)(通过@Grab或通过maven / gradle),因为它是JDOM2的可选依赖项。
- The sequence of @Grab/@GrabExclude instructions fixes the quirky dependence of jaxen on JDOM-1.0.
- @ Grab / @ GrabExclude指令的序列修复了jaxen对JDOM-1.0的古怪依赖性。
- XPathFactory.compile also supports variable binding and filters (see online javadoc).
- XPathFactory.compile还支持变量绑定和过滤器(参见在线javadoc)。
- XPathExpression (which is returned by compile) supports two major functions - evaluate and evaluateFirst. evaluate always returns a list of all XML-nodes, satisfying the specified predicate, while evaluateFirst returns just the first matching XML-node.
- XPathExpression(由compile返回)支持两个主要功能 - evaluate和evaluateFirst。 evaluate总是返回所有XML节点的列表,满足指定的谓词,而evaluateFirst只返回第一个匹配的XML节点。
Update
更新
The following code:
以下代码:
new XMLOutputter().with {
format = Format.getRawFormat()
format.setLineSeparator(LineSeparator.NONE)
output(doc, System.out)
}
solves a problem with preserving whitespaces and line separators. getRawFormat constructs a format object that preserves whitespaces. LineSeparator.NONE instructs format object, that it should not convert line separators.
解决了保留空格和行分隔符的问题。 getRawFormat构造一个保留空格的格式对象。 LineSeparator.NONE指示格式对象,它不应转换行分隔符。
The gist mentioned above contains this new code as well.
上面提到的要点也包含这个新代码。
#2
6
There is a solution without any 3rd party library.
有一个没有任何第三方库的解决方案。
def xml = file.text
def document = groovy.xml.DOMBuilder.parse(new StringReader(xml))
def root = document.documentElement
use(groovy.xml.dom.DOMCategory) {
// manipulate the XML here, i.e. root.someElement?.each { it.value = 'new value'}
}
def result = groovy.xml.dom.DOMUtil.serialize(root)
file.withWriter { w ->
w.write(result)
}
摘自http://jonathan-whywecanthavenicethings.blogspot.de/2011/07/keep-your-hands-off-of-my-whitespace.html