python中的正则表达式,从python中的字符串中删除模式'[... / ...]'

时间:2021-10-17 01:38:54

I have an input string for e.g:

我有一个输入字符串,例如:

input_str = 'this is a test for [blah] and [blah/blahhhh]'

input_str ='这是对[blah]和[blah / blahhhh]的测试

and I want to retain [blah] but want to remove [blah/blahhhh] from the above string. I tried the following codes:

我想保留[blah],但想从上面的字符串中删除[blah / blahhhh]。我尝试了以下代码:

>>>re.sub(r'\[.*?\]', '', input_str)
'this is a test for  and '

and

>>>re.sub(r'\[.*?\/.*?\]', '', input_str)
'this is a test for '

what should be the right regex pattern to get the output as "this is a test for [blah] and"?

什么应该是正确的正则表达式模式,以获得输出“这是对[blah]和”的测试?

3 个解决方案

#1


1  

I don't understand why your 2nd regex doesn't work, I tested it yes, you are correct, it doesn't work. So you can use the same idea but with different approaches.

我不明白为什么你的第二个正则表达式不起作用,我测试它是的,你是对的,它不起作用。所以你可以使用相同的想法,但采用不同的方法。

Instead of using the wildcards you can use the \w like this:

而不是使用通配符,您可以使用\ W像这样:

\[\w+\/\w+\]

Working demo

By the way, if you can have non characters separated by /, then you can use this regex:

顺便说一句,如果你可以用/分隔非字符,那么你可以使用这个正则表达式:

\[[^\]]*\/[^\]]*]

Working demo

#2


0  

The reason the second regex in the original post matches more than the OP wants is that . matches any character including ]. So \[.*?\/' (or just \[.*?/ since the \ before the / is superfluous) will match more than it seems the OP wanted: [blah] and [blah/ in input_str.

原始帖子中的第二个正则表达式比OP想要的更多的原因是。匹配任何字符,包括]。所以\ [。*?\ /'(或者只是\ [。*?/因为\之前的\是多余的)将比OP想要的更多匹配:[blah]和[blah / in input_str。

The ? adds confusion. It will limit repetition of the .* part of .*\] sub-expression, but you have to understand what repetition you're limiting [1]. It's better to explicitly match any non-closing bracket instead of the . wildcard to begin with. So-called "greedy" matching of .* is often a stumbling block since it will match zero or more occurrences of any character until that wildcard match fails (usually much longer than people expect). In your case it greedily matches as much of the input as possible until the last occurrence of the next explicitly specified part of the regex (] or / in your regexes). Instead of using ? to try to counteract or limit greedy matching with lazy matching, it is often better to be explicit about what to not match in the greedy part.

的?增加了混乱。它将限制。* \]子表达式的。*部分的重复,但你必须理解你正在限制的重复[1]。最好明确匹配任何非结束括号而不是。开头的通配符。所谓的“贪婪”匹配。*通常是一个绊脚石,因为它会匹配任何字符的零次或多次出现,直到该通配符匹配失败(通常比人们预期的要长得多)。在你的情况下,它会贪婪地匹配尽可能多的输入,直到最后一次出现regex(]或/在你的正则表达式中的正则指定部分)。而不是使用?为了试图抵消或限制与懒惰匹配的贪婪匹配,通常最好明确在贪婪部分中不匹配的内容。

As an illustration, see the following example of .* grabbing everything until the last occurrence of the character after .*:

作为示例,请参阅以下示例。*抓取所有内容,直到最后一次出现的字符。*:

echo '////k////,/k' | sed -r 's|/.*/|XXX|'
XXXk

echo '////k////,/k' | sed -r 's|/(.*)?/|XXX|'
XXXk

And subtleties of greedy / lazy matching behavior can vary from one regex implementation to the next (pcre, python, grep/egrep). For portability and simplicity / clarity, be explicit when you can.

贪婪/懒惰匹配行为的微妙之处可能因一个正则表达式实现而不同(pcre,python,grep / egrep)。为了便于携带和简单/清晰,请尽可能明确。

If you only want to look for strings with brackets that don't include a closing bracket character before the slash character, you could more explicitly look for "not-a-closing-bracket" instead of the wildcard match:

如果您只想在斜杠字符之前查找括号不包含右括号字符的字符串,则可以更明确地查找“not-a-closing-bracket”而不是通配符匹配:

re.sub(r'\[[^]]*/[^]]*\]', '', input_str)
'this is a test for [blah] and '

This uses a character class expression - [^]] - instead of the wildcard . to match any character that is explicitly not a closing bracket.

这使用了一个字符类表达式 - [^]] - 而不是通配符。匹配任何明确不是结束括号的字符。

If it's "legal" in your input stream to have one or more closing brackets within enclosing brackets (before the slash), then things get more complicated since you have to determine if it's just a stray bracket character or the start of a nested sub-expression. That's starting to sound more like the job of a token parser.

如果输入流中的“合法”在封闭括号内(斜杠之前)有一个或多个右括号,那么事情变得更复杂,因为你必须确定它是否只是一个迷路括号字符或嵌套子字符的开头表达。这开始听起来更像是令牌解析器的工作。

Depending on what you are trying to really achieve (I assume this is just a dummy example of something that is probably more complex) and what is allowed in the input, you may need something more than my simple modification above. But it works for your example anyway.

根据你想要真正实现的目标(我假设这只是一个可能更复杂的虚拟例子)以及输入中允许的内容,你可能需要的不仅仅是我上面的简单修改。但它无论如何都适用于你的例子。

[1] http://www.regular-expressions.info/repeat.html

#3


-1  

You can write a function that takes that input_str as an argument and loop trough the string and if it sees '/' between '[' and ']' jumps back to the position where '[' is and removes all elements including ']'

您可以编写一个函数,将input_str作为参数并循环通过字符串,如果它在'['和']之间看到'/',则跳回到'['的位置并删除所有元素,包括']'

#1


1  

I don't understand why your 2nd regex doesn't work, I tested it yes, you are correct, it doesn't work. So you can use the same idea but with different approaches.

我不明白为什么你的第二个正则表达式不起作用,我测试它是的,你是对的,它不起作用。所以你可以使用相同的想法,但采用不同的方法。

Instead of using the wildcards you can use the \w like this:

而不是使用通配符,您可以使用\ W像这样:

\[\w+\/\w+\]

Working demo

By the way, if you can have non characters separated by /, then you can use this regex:

顺便说一句,如果你可以用/分隔非字符,那么你可以使用这个正则表达式:

\[[^\]]*\/[^\]]*]

Working demo

#2


0  

The reason the second regex in the original post matches more than the OP wants is that . matches any character including ]. So \[.*?\/' (or just \[.*?/ since the \ before the / is superfluous) will match more than it seems the OP wanted: [blah] and [blah/ in input_str.

原始帖子中的第二个正则表达式比OP想要的更多的原因是。匹配任何字符,包括]。所以\ [。*?\ /'(或者只是\ [。*?/因为\之前的\是多余的)将比OP想要的更多匹配:[blah]和[blah / in input_str。

The ? adds confusion. It will limit repetition of the .* part of .*\] sub-expression, but you have to understand what repetition you're limiting [1]. It's better to explicitly match any non-closing bracket instead of the . wildcard to begin with. So-called "greedy" matching of .* is often a stumbling block since it will match zero or more occurrences of any character until that wildcard match fails (usually much longer than people expect). In your case it greedily matches as much of the input as possible until the last occurrence of the next explicitly specified part of the regex (] or / in your regexes). Instead of using ? to try to counteract or limit greedy matching with lazy matching, it is often better to be explicit about what to not match in the greedy part.

的?增加了混乱。它将限制。* \]子表达式的。*部分的重复,但你必须理解你正在限制的重复[1]。最好明确匹配任何非结束括号而不是。开头的通配符。所谓的“贪婪”匹配。*通常是一个绊脚石,因为它会匹配任何字符的零次或多次出现,直到该通配符匹配失败(通常比人们预期的要长得多)。在你的情况下,它会贪婪地匹配尽可能多的输入,直到最后一次出现regex(]或/在你的正则表达式中的正则指定部分)。而不是使用?为了试图抵消或限制与懒惰匹配的贪婪匹配,通常最好明确在贪婪部分中不匹配的内容。

As an illustration, see the following example of .* grabbing everything until the last occurrence of the character after .*:

作为示例,请参阅以下示例。*抓取所有内容,直到最后一次出现的字符。*:

echo '////k////,/k' | sed -r 's|/.*/|XXX|'
XXXk

echo '////k////,/k' | sed -r 's|/(.*)?/|XXX|'
XXXk

And subtleties of greedy / lazy matching behavior can vary from one regex implementation to the next (pcre, python, grep/egrep). For portability and simplicity / clarity, be explicit when you can.

贪婪/懒惰匹配行为的微妙之处可能因一个正则表达式实现而不同(pcre,python,grep / egrep)。为了便于携带和简单/清晰,请尽可能明确。

If you only want to look for strings with brackets that don't include a closing bracket character before the slash character, you could more explicitly look for "not-a-closing-bracket" instead of the wildcard match:

如果您只想在斜杠字符之前查找括号不包含右括号字符的字符串,则可以更明确地查找“not-a-closing-bracket”而不是通配符匹配:

re.sub(r'\[[^]]*/[^]]*\]', '', input_str)
'this is a test for [blah] and '

This uses a character class expression - [^]] - instead of the wildcard . to match any character that is explicitly not a closing bracket.

这使用了一个字符类表达式 - [^]] - 而不是通配符。匹配任何明确不是结束括号的字符。

If it's "legal" in your input stream to have one or more closing brackets within enclosing brackets (before the slash), then things get more complicated since you have to determine if it's just a stray bracket character or the start of a nested sub-expression. That's starting to sound more like the job of a token parser.

如果输入流中的“合法”在封闭括号内(斜杠之前)有一个或多个右括号,那么事情变得更复杂,因为你必须确定它是否只是一个迷路括号字符或嵌套子字符的开头表达。这开始听起来更像是令牌解析器的工作。

Depending on what you are trying to really achieve (I assume this is just a dummy example of something that is probably more complex) and what is allowed in the input, you may need something more than my simple modification above. But it works for your example anyway.

根据你想要真正实现的目标(我假设这只是一个可能更复杂的虚拟例子)以及输入中允许的内容,你可能需要的不仅仅是我上面的简单修改。但它无论如何都适用于你的例子。

[1] http://www.regular-expressions.info/repeat.html

#3


-1  

You can write a function that takes that input_str as an argument and loop trough the string and if it sees '/' between '[' and ']' jumps back to the position where '[' is and removes all elements including ']'

您可以编写一个函数,将input_str作为参数并循环通过字符串,如果它在'['和']之间看到'/',则跳回到'['的位置并删除所有元素,包括']'