如何只在字符串的开头替换重复字符/单词的模式?

时间:2022-02-27 22:18:58

Note that this question is in the context of Julia, and therefore (to my knowledge) PCRE.

请注意,这个问题是在Julia的背景下,因此(据我所知)PCRE。

Suppose that you had a string like this:

假设你有一个这样的弦:

"sssppaaasspaapppssss"

and you wanted to match, individually, the repeating characters at the end of the string (in the case of our string, the four "s" characters - that is, so that matchall gives ["s","s","s","s"], not ["ssss"]). This is easy:

你想要单独匹配字符串末尾的重复字符(在我们的字符串中,是4个“s”字符——也就是说,matchall给出[s"、"s"、"s"、"s"),而不是["ssss"]。这很简单:

r"(.)(?=\1*$)"

It's practically trivial (and easily used - replace(r"(.)(?=\1*$)","hell","k") will give "hekk" while replace(r"(.)(?=\1*$)","hello","k") will give "hellk"). And it can be generalised for repeating patterns by switching out the dot for something more complex:

几乎是微不足道的(和容易使用——取代(r”(.)(? = \ 1 * $)”,“地狱”,“k”)将“hekk”而取代(r”(.)(? = \ 1 * $)”、“你好”、“k”)将“hellk”)。它可以推广为重复的模式,通过把点换成更复杂的东西:

r"(\S+)(?=( \1)*$)"

which will, for instance, independently match the last three instances of "abc" in "abc abc defg abc h abc abc abc".

例如,它将独立地匹配“abc abc defg abc h abc abc abc abc abc abc abc abc”中的最后三个实例。

Which then leads to the question... how would you match the repeating character or pattern at the start of the string, instead? Specifically, using regex in the way it's used above.

这就引出了一个问题……您将如何匹配字符串开始时的重复字符或模式?具体来说,在上面使用regex的方式中使用regex。

The obvious approach would be to reverse the direction of the above regex as r"(?<=^\1*)(.)" - but PCRE/Julia doesn't allow lookbehinds to have variable length (except where it's fixed-variable, like (?<=ab|cde)), and thus throws an error. The next thought is to use "\K" as something along the lines of r"^\1*\K(.)", but this only manages to match the first character (presumably because it "advances" after matching it, and no longer matches the caret).

最明显的方法是反向的方向上面的正则表达式为r”(? < = ^ \ 1 *)(。)”-但是PCRE/Julia不允许lookbehind有可变长度(除非它是固定变量,比如(?<=ab|cde)),从而抛出一个错误。下一个想法是用“\ K”的r ^ * \ \ 1 K(。),但这只能匹配第一个字符(大概是因为它在匹配后“前进”,而不再匹配插入符号)。

For clarity: I'm seeking a regex that will, for instance, result in

为了清楚起见:我正在寻找一个regex,例如,它将导致。

replace("abc abc defg abc h abc abc abc",<regex here>,"hello")

producing

生产

"hello hello defg abc h abc abc abc"

As you can see, it's replacing each "abc" from the start with "hello", but only until the first non-match. The reverse one I provide above does this at the other end of the string:

正如您所看到的,它从一开始就用“hello”替换了每个“abc”,但直到第一次不匹配。我上面提供的相反的一个在字符串的另一端做这个:

replace("abc abc defg abc h abc abc abc",r"(\S+)(?=( \1)*$)","hello")

produces

生产

"abc abc defg abc h hello hello hello"

2 个解决方案

#1


8  

You can use the \G anchor that matches the position after the previous match or at the start of the string. In this way you ensure the contiguity of results from the start of the string to the last occurrence:

您可以使用与先前匹配后或字符串开始位置匹配的\G锚。这样可以确保从字符串的开始到最后一次出现的结果的连续性:

\G(\S+)( (?=\1 ))?

demo

演示

or to be able to match until the end of the string:

或者能够匹配到字符串的末尾:

\G(\S+)( (?=\1(?: |\z)))?

#2


4  

For PCRE style engines, unfortunately there is no way to do this without
variable length lookbehind.

对于PCRE风格的引擎,不幸的是,如果没有可变长度的查找,就没有办法做到这一点。

A pure solution is not possible.
There is no \G anchor trickery that can accomplish this.

纯解是不可能的。没有什么锚爪能做到这一点。

Here is why the \G anchor won't work.

这就是为什么\G锚不会起作用的原因。

With the anchor, the only guarantee you have is that the last match
resulted in a match where the forward overlap was checked to be equal
to the current match.

对于锚点,惟一的保证是,最后的匹配结果是前向重叠被检查为等于当前匹配。

As a result, you can only globally match up to N-1 of the duplicate's from the beginning.

因此,从一开始,您只能全局匹配副本的N-1。

Here is a proof:

这是一个证明:

Regex:

正则表达式:

 # (?:\G([a-c]+)(?=\1))

 (?:
      \G 
      ( [a-c]+ )                    # (1)
      (?=
           \1 
      )
 )

Input:

输入:

abcabcabcbca

Output:

输出:

 **  Grp 0 -  ( pos 0 , len 3 ) 
abc  
 **  Grp 1 -  ( pos 0 , len 3 ) 
abc  
------------
 **  Grp 0 -  ( pos 3 , len 3 ) 
abc  
 **  Grp 1 -  ( pos 3 , len 3 ) 
abc  

Conclusion:

结论:

Even though you know the Nth one is there from the previous lookahead,
the Nth one can't be matched without the condition of the current lookahead.

即使你知道第n个在前面的展望中,第n个没有当前展望的条件是不能匹配的。

Sorry, and good luck!
Let me know if you find a pure regex solution.

对不起,祝你好运!如果您找到一个纯regex解决方案,请告诉我。

#1


8  

You can use the \G anchor that matches the position after the previous match or at the start of the string. In this way you ensure the contiguity of results from the start of the string to the last occurrence:

您可以使用与先前匹配后或字符串开始位置匹配的\G锚。这样可以确保从字符串的开始到最后一次出现的结果的连续性:

\G(\S+)( (?=\1 ))?

demo

演示

or to be able to match until the end of the string:

或者能够匹配到字符串的末尾:

\G(\S+)( (?=\1(?: |\z)))?

#2


4  

For PCRE style engines, unfortunately there is no way to do this without
variable length lookbehind.

对于PCRE风格的引擎,不幸的是,如果没有可变长度的查找,就没有办法做到这一点。

A pure solution is not possible.
There is no \G anchor trickery that can accomplish this.

纯解是不可能的。没有什么锚爪能做到这一点。

Here is why the \G anchor won't work.

这就是为什么\G锚不会起作用的原因。

With the anchor, the only guarantee you have is that the last match
resulted in a match where the forward overlap was checked to be equal
to the current match.

对于锚点,惟一的保证是,最后的匹配结果是前向重叠被检查为等于当前匹配。

As a result, you can only globally match up to N-1 of the duplicate's from the beginning.

因此,从一开始,您只能全局匹配副本的N-1。

Here is a proof:

这是一个证明:

Regex:

正则表达式:

 # (?:\G([a-c]+)(?=\1))

 (?:
      \G 
      ( [a-c]+ )                    # (1)
      (?=
           \1 
      )
 )

Input:

输入:

abcabcabcbca

Output:

输出:

 **  Grp 0 -  ( pos 0 , len 3 ) 
abc  
 **  Grp 1 -  ( pos 0 , len 3 ) 
abc  
------------
 **  Grp 0 -  ( pos 3 , len 3 ) 
abc  
 **  Grp 1 -  ( pos 3 , len 3 ) 
abc  

Conclusion:

结论:

Even though you know the Nth one is there from the previous lookahead,
the Nth one can't be matched without the condition of the current lookahead.

即使你知道第n个在前面的展望中,第n个没有当前展望的条件是不能匹配的。

Sorry, and good luck!
Let me know if you find a pure regex solution.

对不起,祝你好运!如果您找到一个纯regex解决方案,请告诉我。