使用Python regex查找/替换文档中的URL

Experts of Python regular expressions! I'm trying to change a line in a xml document. The original line is:

Python正则表达式专家!我正在修改xml文档中的一行。最初的线路是:

<Tag name="low"     Value="%hello%\dir"/>

The result I want to see is:

我想看到的结果是:

<Tag name="low"     Value="C:\art"/>

My failed straight-forward attempt is:

我失败的直接尝试是:

lines = re.sub("%hello%\dir"", "C:\art"/>

This doesn't work. Doesn't change anything in the doc. Something with %?

这并不工作。没有改变文档中的任何东西。与% ?

For testing purposes I tried:

为了测试目的，我尝试:

lines = re.sub("dir", "C:\art", a)

And I get:

我得到:

<Tag name="low"     Value="%hello%\C:BELrt"/>

The problem is that \a = BEL.

问题是这样的。

I've tried a bunch of other things, but to no avail. How do I go about this problem?

我试过很多其他的东西，但都没用。我该如何解决这个问题呢?

3 个解决方案

#1

You're issue is that you've got some characters which have specific meaning in regex's.

您的问题是您有一些在regex中具有特定意义的字符。

\d means any digit. %hello%\dir is then %hello%[0-9]ir

\ d意味着任何数字。%你好% \ dir然后% %[0 - 9]ir

You need to escape those slashes/use a raw string to get around this:

你需要避开这些斜线/使用一根粗线来绕过这个:

a = '''<Tag name="low" Value="%hello%\dir"/>'''
lines = re.sub(r"%hello%\\dir", r"C:\\art", a)
print(lines) #<Tag name="low"     Value="C:\\art"/>

#2

In Python, use the r prefix to a literal string to keep from having to escape your slashes. Then escape your slash to avoid \d matching a digit.

在Python中，将r前缀用于字符串，以避免必须摆脱斜杠。然后转义斜杠以避免与数字匹配。

lines = re.sub(r"%hello%\\dir", r"C:\\art")

#3

It is a good question. It shows three issues with a text representation at once:

这是个好问题。它同时显示了文本表示的三个问题:

'\a' Python string literal is a single BELL character.

'\a' Python字符串文字是一个钟形字符。

To input backslash followed by letter 'a' in Python source code you need either use raw-literals: r'\a' or escape the slash '\\a'.

要在Python源代码中输入反斜杠后跟字母“a”，您需要使用原始文字:r'\a”或转义“\a”。
r'\d' (two characters) has special meaning when interpreted as a regular expression (r'\d' means match a digit in a regex).

r'\d'(两个字符)在被解释为正则表达式时具有特殊的意义(r'\d'意味着匹配正则表达式中的一个数字)。

In addition to rules for Python string literals you also need to escape possible regex metacharacters. You could use re.escape(your_string) in general case or just r'\\d' or '\\\\d'. '\a' in the repl part should also be escaped (twice in your case: r'\\a' or '\\\\a'):

除了Python字符串字面量的规则之外，您还需要转义可能的regex元字符。您可以使用re.escape(your_string)，或者只使用r'\\ \\\\ \\\\ \\\\ \\\\ \\\\ \\\\ \\\。在repl部分也应该转义(在你的例子中有两种情况:r'\\ \\\\ \ '):
```
>>> old, new = r'%hello%\dir', r'C:\art'
>>> print re.sub(re.escape(old), new.encode('string-escape'), xml)
<Tag name="low"     Value="C:\art"/>
```
btw, you don't need regular expressions at all in this case:

顺便说一句，在这种情况下根本不需要正则表达式:
```
>>> print xml.replace(old, new)
<Tag name="low"     Value="C:\art"/>
```
at last XML attribute value can't contain certain characters that are also should be escaped e.g., '&', '"', "<", etc.

最后，XML属性值不能包含某些也应该转义的字符，如'&'、' ' '、"<"等。

In general you should not use regex to manipulate XML. Python's stdlib has XML parsers.

一般来说，不应该使用regex来操作XML。Python的stdlib有XML解析器。

>>> import xml.etree.cElementTree as etree
>>> xml = r'<Tag name="low"     Value="%hello%\dir"/>'
>>> tag = etree.fromstring(xml)
>>> tag.set('Value', r"C:\art & design")
>>> etree.dump(tag)
<Tag Value="C:\art &amp; design" name="low" />

#1