I have some text where each line of text has some good words and some bad(unwanted) words. So the pattern might look like this
我有一些文本,每行文本都有一些好的词和一些不好的(不需要的)词。这个模式是这样的
good1-good2 good3 bad1-good4-bad2 some more good words
good1-good2 good3 bad1 bad2
good1-good2 good3 bad1 bad2 bad3
Now i need to reject everything in a line following and including the first bad word So
现在我需要在一行中拒绝所有东西,包括第一个坏词So
good1-good2 good3 bad1-good4-bad2 some more good words
should become good1-good2 good3
good1- good3 bad1-good4-bad2更多的好词应该变成good1-good2 good3。
good1-good2 good3 bad1 bad2
should become good1-good2 good3
好人应该成为好人
good1-good2 good3 bad1 bad2 bad3
should become good1-good2 good3
good1- good3 bad1 bad2 bad3应该变成good1-good2 good3。
I am using python so this was what i did
我用的是python,这就是我做的。
p=re.compile('([\w \d-]+) (bad1|bad2|bad3).+',re.I)
m=p.search('good1-good2 good3 bad1-good4-bad2 ')
m.group(1)
and this gives good1-good2 good3
which is what i want but
这给了good1-good2,这是我想要的。
m=p.search('good1-good2 good3 bad1 bad2 ')
m.group(1)
returns good1-good2 good3 bad1
I thought that because the +
is greedy so the +
in ([\w \d-]+)
goes on matching characters till the end of the line and then it backtracks to find the last bad word which in this case is bad2
but when i do this
返回good1-good2 good3 bad1我认为因为+是贪婪的,所以in ([\w \d-]+)会继续匹配字符直到行尾,然后它会反向查找最后一个坏词bad2,但当我这么做的时候
p=re.compile('([\w \d-]+) (bad1|bad2|bad3).+',re.I)
m=p.search('good1-good2 good3 bad1 bad2 bad3')
m.group(1)
it again returns good1-good2 good3 bad1
. Can you please explain that? Because there might be a problem with my understanding of greediness
in regex? Although i have figured out to solve this problem by using a regex like this ([\w \d-]+?) (bad1|bad2|bad3).+
but still i do not understand why using ([\w \d-]+) (bad1|bad2|bad3).+
always returns the first bad word(bad1 in this case)?
它又返回good -good2 good3 bad1。你能解释一下吗?因为我对regex中的贪心的理解可能有问题?虽然我已经通过使用这样的regex解决了这个问题([\w \d-]+?) (bad1|bad2|bad3)。但是我还是不明白为什么要使用([\w \d-]+) (bad1|bad2|bad3)。+总是返回第一个坏单词(在本例中是bad1)?
Thanks for the time.
谢谢你的时间。
Edit: But suppose i have a pattern with only good words and no bad words like good1-good2 good3--only good words
then what should be the regex? i tried this regex ([\w \d-]+?) ?(bad1|bad2|bad3)?.*
but this returns the first letter of the pattern.
编辑:但是假设我有一个只有好词而没有坏词的模式,比如good -good2 good3——只有好词,那么regex应该是什么呢?我试着这个正则表达式((\ w \ d -)+ ?)?(bad1 | bad2 | bad3)?。但这将返回模式的第一个字母。
1 个解决方案
#1
3
Regarding this case:
关于这种情况下:
m=p.search('good1-good2 good3 bad1 bad2 ')
You are correct. ([\w \d-]+)
is greedy so it "eats" as much as possible and backtracks.
你是正确的。([\w \d-]+)是贪婪的,所以它“吃”得越多越好,而且越反其道而行之。
Regarding this case however:
然而关于这种情况下:
m=p.search('good1-good2 good3 bad1 bad2 bad3')
What you're probably not seeing is that your .+
has to match at least one character after the bad word. That's why the regex can't match bad3
as the bad word: if it did, it'd run out of characters for the .+
to match anything. Thus, it backtracks to bad2
once again. Change your .+
to .*
to see the difference. It's only because you happened to have an extra space in the first case, i.e. bad2
, that things "worked out as expected" there.
你可能看不到的是你的。+必须在坏词之后匹配至少一个字符。这就是为什么regex不能将bad3作为坏词匹配:如果匹配,那么.+的字符将会耗尽,无法匹配任何内容。因此,它又回到了bad2。把你的。+换成。*看看有什么不同。只是因为你碰巧在第一种情况下有一个额外的空间,也就是bad2,事情在那里“按预期进行”。
In other words, some unfortunate coincidences left you confused; but your understanding of greediness is sound.
换句话说,一些不幸的巧合让你感到困惑;但是你对贪婪的理解是正确的。
EDIT
编辑
For the edited part of the question, as written by @lovesh from the comments below:
@lovesh在以下评论中写道:
([\w \d-]+?) ?(bad1|bad2|bad3|$)
#1
3
Regarding this case:
关于这种情况下:
m=p.search('good1-good2 good3 bad1 bad2 ')
You are correct. ([\w \d-]+)
is greedy so it "eats" as much as possible and backtracks.
你是正确的。([\w \d-]+)是贪婪的,所以它“吃”得越多越好,而且越反其道而行之。
Regarding this case however:
然而关于这种情况下:
m=p.search('good1-good2 good3 bad1 bad2 bad3')
What you're probably not seeing is that your .+
has to match at least one character after the bad word. That's why the regex can't match bad3
as the bad word: if it did, it'd run out of characters for the .+
to match anything. Thus, it backtracks to bad2
once again. Change your .+
to .*
to see the difference. It's only because you happened to have an extra space in the first case, i.e. bad2
, that things "worked out as expected" there.
你可能看不到的是你的。+必须在坏词之后匹配至少一个字符。这就是为什么regex不能将bad3作为坏词匹配:如果匹配,那么.+的字符将会耗尽,无法匹配任何内容。因此,它又回到了bad2。把你的。+换成。*看看有什么不同。只是因为你碰巧在第一种情况下有一个额外的空间,也就是bad2,事情在那里“按预期进行”。
In other words, some unfortunate coincidences left you confused; but your understanding of greediness is sound.
换句话说,一些不幸的巧合让你感到困惑;但是你对贪婪的理解是正确的。
EDIT
编辑
For the edited part of the question, as written by @lovesh from the comments below:
@lovesh在以下评论中写道:
([\w \d-]+?) ?(bad1|bad2|bad3|$)