I have an input string for e.g:
input_str = 'this is a test for [blah] and [blah/blahhhh]'
input_str ='这是对[blah]和[blah / blahhhh]的测试
and I want to retain [blah] but want to remove [blah/blahhhh] from the above string. I tried the following codes:
我想保留[blah],但想从上面的字符串中删除[blah / blahhhh]。我尝试了以下代码:
>>>re.sub(r'\[.*?\]', '', input_str)
'this is a test for and '
>>>re.sub(r'\[.*?\/.*?\]', '', input_str)
'this is a test for '
what should be the right regex pattern to get the output as "this is a test for [blah] and"?
3 个解决方案
I don't understand why your 2nd regex doesn't work, I tested it yes, you are correct, it doesn't work. So you can use the same idea but with different approaches.
Instead of using the wildcards you can use the \w
like this:
而不是使用通配符,您可以使用\ W像这样:
By the way, if you can have non characters separated by /
, then you can use this regex:
The reason the second regex in the original post matches more than the OP wants is that .
matches any character including ]
. So \[.*?\/'
(or just \[.*?/
since the \
before the /
is superfluous) will match more than it seems the OP wanted: [blah] and [blah/
in input_str
原始帖子中的第二个正则表达式比OP想要的更多的原因是。匹配任何字符,包括]。所以\ [。*?\ /'(或者只是\ [。*?/因为\之前的\是多余的)将比OP想要的更多匹配:[blah]和[blah / in input_str。
The ?
adds confusion. It will limit repetition of the .*
part of .*\]
sub-expression, but you have to understand what repetition you're limiting [1]. It's better to explicitly match any non-closing bracket instead of the .
wildcard to begin with. So-called "greedy" matching of .*
is often a stumbling block since it will match zero or more occurrences of any character until that wildcard match fails (usually much longer than people expect). In your case it greedily matches as much of the input as possible until the last occurrence of the next explicitly specified part of the regex (]
or /
in your regexes). Instead of using ?
to try to counteract or limit greedy matching with lazy matching, it is often better to be explicit about what to not match in the greedy part.
的?增加了混乱。它将限制。* \]子表达式的。*部分的重复,但你必须理解你正在限制的重复[1]。最好明确匹配任何非结束括号而不是。开头的通配符。所谓的“贪婪”匹配。*通常是一个绊脚石,因为它会匹配任何字符的零次或多次出现,直到该通配符匹配失败(通常比人们预期的要长得多)。在你的情况下,它会贪婪地匹配尽可能多的输入,直到最后一次出现regex(]或/在你的正则表达式中的正则指定部分)。而不是使用?为了试图抵消或限制与懒惰匹配的贪婪匹配,通常最好明确在贪婪部分中不匹配的内容。
As an illustration, see the following example of .*
grabbing everything until the last occurrence of the character after .*
echo '////k////,/k' | sed -r 's|/.*/|XXX|'
echo '////k////,/k' | sed -r 's|/(.*)?/|XXX|'
And subtleties of greedy / lazy matching behavior can vary from one regex implementation to the next (pcre, python, grep/egrep). For portability and simplicity / clarity, be explicit when you can.
贪婪/懒惰匹配行为的微妙之处可能因一个正则表达式实现而不同(pcre,python,grep / egrep)。为了便于携带和简单/清晰,请尽可能明确。
If you only want to look for strings with brackets that don't include a closing bracket character before the slash character, you could more explicitly look for "not-a-closing-bracket" instead of the wildcard match:
re.sub(r'\[[^]]*/[^]]*\]', '', input_str)
'this is a test for [blah] and '
This uses a character class expression - [^]]
- instead of the wildcard .
to match any character that is explicitly not a closing bracket.
这使用了一个字符类表达式 - [^]] - 而不是通配符。匹配任何明确不是结束括号的字符。
If it's "legal" in your input stream to have one or more closing brackets within enclosing brackets (before the slash), then things get more complicated since you have to determine if it's just a stray bracket character or the start of a nested sub-expression. That's starting to sound more like the job of a token parser.
Depending on what you are trying to really achieve (I assume this is just a dummy example of something that is probably more complex) and what is allowed in the input, you may need something more than my simple modification above. But it works for your example anyway.
You can write a function that takes that input_str as an argument and loop trough the string and if it sees '/' between '[' and ']' jumps back to the position where '[' is and removes all elements including ']'
I don't understand why your 2nd regex doesn't work, I tested it yes, you are correct, it doesn't work. So you can use the same idea but with different approaches.
Instead of using the wildcards you can use the \w
like this:
而不是使用通配符,您可以使用\ W像这样:
By the way, if you can have non characters separated by /
, then you can use this regex:
The reason the second regex in the original post matches more than the OP wants is that .
matches any character including ]
. So \[.*?\/'
(or just \[.*?/
since the \
before the /
is superfluous) will match more than it seems the OP wanted: [blah] and [blah/
in input_str
原始帖子中的第二个正则表达式比OP想要的更多的原因是。匹配任何字符,包括]。所以\ [。*?\ /'(或者只是\ [。*?/因为\之前的\是多余的)将比OP想要的更多匹配:[blah]和[blah / in input_str。
The ?
adds confusion. It will limit repetition of the .*
part of .*\]
sub-expression, but you have to understand what repetition you're limiting [1]. It's better to explicitly match any non-closing bracket instead of the .
wildcard to begin with. So-called "greedy" matching of .*
is often a stumbling block since it will match zero or more occurrences of any character until that wildcard match fails (usually much longer than people expect). In your case it greedily matches as much of the input as possible until the last occurrence of the next explicitly specified part of the regex (]
or /
in your regexes). Instead of using ?
to try to counteract or limit greedy matching with lazy matching, it is often better to be explicit about what to not match in the greedy part.
的?增加了混乱。它将限制。* \]子表达式的。*部分的重复,但你必须理解你正在限制的重复[1]。最好明确匹配任何非结束括号而不是。开头的通配符。所谓的“贪婪”匹配。*通常是一个绊脚石,因为它会匹配任何字符的零次或多次出现,直到该通配符匹配失败(通常比人们预期的要长得多)。在你的情况下,它会贪婪地匹配尽可能多的输入,直到最后一次出现regex(]或/在你的正则表达式中的正则指定部分)。而不是使用?为了试图抵消或限制与懒惰匹配的贪婪匹配,通常最好明确在贪婪部分中不匹配的内容。
As an illustration, see the following example of .*
grabbing everything until the last occurrence of the character after .*
echo '////k////,/k' | sed -r 's|/.*/|XXX|'
echo '////k////,/k' | sed -r 's|/(.*)?/|XXX|'
And subtleties of greedy / lazy matching behavior can vary from one regex implementation to the next (pcre, python, grep/egrep). For portability and simplicity / clarity, be explicit when you can.
贪婪/懒惰匹配行为的微妙之处可能因一个正则表达式实现而不同(pcre,python,grep / egrep)。为了便于携带和简单/清晰,请尽可能明确。
If you only want to look for strings with brackets that don't include a closing bracket character before the slash character, you could more explicitly look for "not-a-closing-bracket" instead of the wildcard match:
re.sub(r'\[[^]]*/[^]]*\]', '', input_str)
'this is a test for [blah] and '
This uses a character class expression - [^]]
- instead of the wildcard .
to match any character that is explicitly not a closing bracket.
这使用了一个字符类表达式 - [^]] - 而不是通配符。匹配任何明确不是结束括号的字符。
If it's "legal" in your input stream to have one or more closing brackets within enclosing brackets (before the slash), then things get more complicated since you have to determine if it's just a stray bracket character or the start of a nested sub-expression. That's starting to sound more like the job of a token parser.
Depending on what you are trying to really achieve (I assume this is just a dummy example of something that is probably more complex) and what is allowed in the input, you may need something more than my simple modification above. But it works for your example anyway.
You can write a function that takes that input_str as an argument and loop trough the string and if it sees '/' between '[' and ']' jumps back to the position where '[' is and removes all elements including ']'