用于大型搜索和替换的Regex工具

I often find myself needing a tool that would allow me to:

我经常发现自己需要一种工具，使我能够:

search for multiple multi-line regex patterns in a large file and replace them using back-referencing.

在一个大文件中搜索多个多行regex模式，并使用反向引用替换它们。

Should I:

我应该:

take the 2 hours it'll require to build myself such a tool
花2个小时为自己打造这样一个工具
use something someone has already built (please suggest)
使用某人已经建立的东西(请建议)
learn to use a language that's particularly good at this type of thing(Perl?)
学习使用一种特别擅长这种类型的语言(Perl?)

Example
I have an xml document containing thousands of entries. There are about 100 entries with a known value field which need to be removed. I can build a regular expression for each entry. The expression is the same for the 100 entries except for the value string part. Either this tool would need to be able to loop through once for every value or just once with 100 OR terms (|) in the expression (it would be huge). In this case I'm replacing the matches with a blank but in other cases, I'd reformat the text and re-insert the value field.

例如，我有一个包含数千个条目的xml文档。有大约100个具有已知值字段的条目需要删除。我可以为每个条目构建一个正则表达式。除了值字符串部分之外，这100个条目的表达式是相同的。这个工具需要能够对每个值进行一次循环，或者仅对表达式中的100或terms(|)进行一次循环(这会非常巨大)。在本例中，我将用空格替换匹配项，但在其他情况下，我将重新格式化文本并重新插入值字段。

4 个解决方案

#1

I reckon you should write the thing in Python. The python re library is great:

我想你应该用Python语言写东西。python re库很棒:

# get the re library
import re

# this is the line to process
xml_line = "<stuff><bad i_am_naughty=\"True\"></bad></stuff>"
# compile a regex 
exp = re.compile ("(.*)(<bad.*bad>)(.*)")
# run the regex on the line
match = exp.search (xml_line)
# print out the groups the regex found
print match.groups ()

N.B. You could also use python XML parsing libraries to strip out the elements you don't want. Using the python XMl parsing simplifies some of the complexity that I have ignored in my example (multiple lines etc). In lieu of a Python XML parsing example this question has some good answers re parsing XML in Python.

N.B.您还可以使用python XML解析库删除您不想要的元素。使用python XMl解析简化了我在示例中忽略的一些复杂性(多行等等)。与Python XML解析示例不同，这个问题有一些很好的答案来解析Python中的XML。

#2

I am not quite sure what your data looks like, but I would consider writing the tool in python in three passes:

我不太确定您的数据是什么样子的，但是我将考虑用python在三遍中编写这个工具:

convert the file of XML path plus variable = value to lines of XML.path.variable=value
将XML路径加上variable=value的文件转换为XML.path.variable=value的行
apply massive regex to each line, possibly deleting line from output
对每一行应用大量的regex，可能从输出中删除行
convert shortened list of XML.path.variable=value lines back to XML
转换XML.path的缩短列表。变量=返回XML的值。

#3

I would suggest not to use regular expression. XML should usually be handled with XML tools. Why not just use XSLT?

我建议不要使用正则表达式。XML通常应该使用XML工具来处理。为什么不直接使用XSLT呢?

#4

There're large number of module that could be used to handle xml.

有大量的模块可以用来处理xml。

http://www.crummy.com/software/BeautifulSoup/ http://codespeak.net/lxml/

#1