在Ruby的Regex中,“向前看”和“向后看”概念如何支持这样的零宽度断言概念?

时间:2021-02-02 15:58:23

I just gone through the concept Zero-Width Assertions from the documentation. And some quick questions comes into my mind-

我刚刚从文档中介绍了“零宽度断言”的概念。我想到了一些简单的问题

  • why such name Zero-Width Assertions?
  • 为什么要这样命名零宽度断言?
  • How the Look-ahead and look-behind concept supports such Zero-Width Assertions concept?
  • 向前和向后看的概念如何支持这种零宽度断言概念?
  • What such ?<=s,<!s,=s,<=s - 4 symbols are instructing inside the pattern? can you help me here to focus to understand what is actually going on
  • 等什么? < =,< !s,=s,<=s - 4符号在模式中指示?你能帮我集中注意力了解到底发生了什么吗?

I also tried some tiny codes to understand the logic, but not that much confident with the output of those:

我还尝试了一些细小的代码来理解其中的逻辑,但对这些代码的输出不是很有信心:

irb(main):001:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"
irb(main):002:0> "foresight".sub(/(?=s)ight/, 'ee')
=> "foresight"
irb(main):003:0> "foresight".sub(/(?<=s)ight/, 'ee')
=> "foresee"
irb(main):004:0> "foresight".sub(/(?<!s)ight/, 'ee')
=> "foresight"

Can anyone help me here to understand?

有人能帮我理解一下吗?

EDIT

编辑

Here i have tried two snippets one with "Zero-Width Assertions" concepts as below:

在这里,我尝试了两个带有“零宽度断言”概念的片段,如下所示:

irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"

and the other is without "Zero-Width Assertions" concepts as below:

而另一个则没有“零宽度断言”概念:

irb(main):003:0> "foresight".sub(/ight/, 'ee')
=> "foresee"

Both the above produces same output,now internally how the both regexp move by their own to produce output- could you help me to visualize?

上面两个都产生相同的输出,现在内部两个regexp如何自己移动以产生输出——您能帮助我可视化吗?

Thanks

谢谢

3 个解决方案

#1


16  

Regular expressions match from left to right, and move a sort of "cursor" along the string as they go. If your regex contains a regular character like a, this means: "if there's a letter a in front of the cursor, move the cursor ahead one character, and keep going. Otherwise, something's wrong; back up and try something else." So you might say that a has a "width" of one character.

正则表达式从左到右匹配,并沿着字符串移动某种“游标”。如果您的regex包含一个普通字符,比如a,这意味着:“如果光标前面有一个字母a,那么将光标移动到一个字符前面,然后继续。”否则,什么是错的;退后,试试别的。所以你可以说a有一个字符的宽度。

A "zero-width assertion" is just that: it asserts something about the string (i.e., doesn't match if some condition doesn't hold), but it doesn't move the cursor forwards, because its "width" is zero.

“零宽度断言”就是:它断言关于字符串的某些东西(例如。如果某个条件不成立,则不匹配),但它不会将光标向前移动,因为它的“宽度”为零。

You're probably already familiar with some simpler zero-width assertions, like ^ and $. These match the start and end of a string. If the cursor isn't at the start or end when it sees those symbols, the regex engine will fail, back up, and try something else. But they don't actually move the cursor forwards, because they don't match characters; they only check where the cursor is.

您可能已经熟悉一些简单的零宽度断言,像^和$。这些匹配字符串的开始和结束。如果游标在看到这些符号时没有开始或结束,那么regex引擎将会失败、备份和尝试其他的东西。但它们并不会将光标向前移动,因为它们不匹配字符;它们只检查光标的位置。

Lookahead and lookbehind work the same way. When the regex engine tries to match them, it checks around the cursor to see if the right pattern is ahead of or behind it, but in case of a match, it doesn't move the cursor.

展望未来,以同样的方式看待工作。当regex引擎试图匹配它们时,它会检查游标周围,看看正确的模式是在它前面还是在它后面,但是如果匹配,它不会移动游标。

Consider:

考虑:

/(?=foo)foo/.match 'foo'

This will match! The regex engine goes like this:

这将匹配!regex引擎是这样的:

  1. Start at the beginning of the string: |foo.
  2. 从字符串的开头开始:|foo。
  3. The first part of the regex is (?=foo). This means: only match if foo appears after the cursor. Does it? Well, yes, so we can proceed. But the cursor doesn't move, because this is zero-width. We still have |foo.
  4. regex的第一部分是(?=foo)。这意味着:只有当foo出现在游标的后面时才匹配。不是吗?是的,我们可以继续。但是光标不会移动,因为这是零宽度。我们还有| foo。
  5. Next is f. Is there an f in front of the cursor? Yes, so proceed, and move the cursor past the f: f|oo.
  6. 下一个是f,光标前面有一个f吗?是的,继续,将光标移过f: f|oo。
  7. Next is o. Is there an o in front of the cursor? Yes, so proceed, and move the cursor past the o: fo|o.
  8. 下一个是o,光标前面有一个o吗?是的,继续,然后移动光标经过o: fo|o。
  9. Same thing again, bringing us to foo|.
  10. 同样的,把我们带到foo|。
  11. We reached the end of the regex, and nothing failed, so the pattern matches.
  12. 我们到达了正则表达式的末尾,没有失败,所以模式匹配。

On your four assertions in particular:

特别就你的四项主张:

  • (?=...) is "lookahead"; it asserts that ... does appear after the cursor.

    (? =…)是“超前”;它声称…确实出现在光标之后。

    1.9.3p125 :002 > 'jump june'.gsub(/ju(?=m)/, 'slu')
     => "slump june" 
    

    The "ju" in "jump" matches because an "m" comes next. But the "ju" in "june" doesn't have an "m" next, so it's left alone.

    “jump”中的“ju”之所以叫“ju”,是因为后面跟着一个“m”。但是《六月》中的“居”下一个字母没有“m”,所以它被放在一边。

    Since it doesn't move the cursor, you have to be careful when putting anything after it. (?=a)b will never match anything, because it checks that the next character is a, then also checks that the same character is b, which is impossible.

    因为它不会移动光标,所以在它后面放任何东西时你必须小心。(?=a)b永远不会匹配任何东西,因为它检查下一个字符是a,然后检查相同的字符是b,这是不可能的。

  • (?<=...) is "lookbehind"; it asserts that ... does appear before the cursor.

    (? < =…)是“向后插入”;它声称…确实出现在光标之前。

    1.9.3p125 :002 > 'four flour'.gsub(/(?<=f)our/, 'ive')
     => "five flour" 
    

    The "our" in "four" matches because there's an "f" immediately before it, but the "our" in "flour" has an "l" immediately before it so it doesn't match.

    “我们的”在“四”中匹配,因为它前面有一个“f”,但是“面粉”中的“我们的”在它前面有一个“l”,所以它不匹配。

    Like above, you have to be careful with what you put before it. a(?<=b) will never match, because it checks that the next character is a, moves the cursor, then checks that the previous character was b.

    就像上面说的,你必须小心你放在它前面的东西。a(?<=b)永远不会匹配,因为它检查下一个字符是a,移动光标,然后检查前一个字符是b。

  • (?!...) is "negative lookahead"; it asserts that ... does not appear after the cursor.

    (? !…)是“负超前”;它声称…不显示在光标之后。

    1.9.3p125 :003 > 'child children'.gsub(/child(?!ren)/, 'kid')
     => "kid children"
    

    "child" matches, because what comes next is a space, not "ren". "children" doesn't.

    “child”匹配,因为后面是空格,而不是“ren”。“孩子”没有。

    This is probably the one I get the most use out of; finely controlling what can't come next comes in handy.

    这可能是我用得最多的;精细地控制接下来不可能发生的事情将派上用场。

  • (?<!...) is "negative lookbehind"; it asserts that ... does not appear before the cursor.

    (? < !……)是“负向后插入”;它声称…不会出现在光标之前。

    1.9.3p125 :004 > 'foot root'.gsub(/(?<!r)oot/, 'eet')
     => "feet root" 
    

    The "oot" in "foot" is fine, since there's no "r" before it. The "oot" in "root" clearly has an "r".

    "foot"中的"oot"可以,因为在它之前没有"r"。“根”中的“oot”显然有一个“r”。

    As an additional restriction, most regex engines require that ... has a fixed length in this case. So you can't use ?, +, *, or {n,m}.

    作为额外的限制,大多数regex引擎要求…在这种情况下有固定长度。所以不能使用? + *或{n,m}。

You can also nest these and otherwise do all kinds of crazy things. I use them mainly for one-offs I know I'll never have to maintain, so I don't have any great examples of real-world applications handy; honestly, they're weird enough that you should try to do what you want some other way first. :)

你也可以将它们嵌套起来,或者做各种疯狂的事情。我主要用它们来做一次性的事情,我知道我永远都不需要维护它们,所以我手边没有任何真实世界应用的好例子;老实说,他们很奇怪,你应该先试着用别的方法做你想做的事。:)


Afterthought: The syntax comes from Perl regular expressions, which used (? followed by various symbols for a lot of extended syntax because ? on its own is invalid. So <= doesn't mean anything by itself; (?<= is one entire token, meaning "this is the start of a lookbehind". It's like how += and ++ are separate operators, even though they both start with +.

回味:语法来自Perl正则表达式,它使用(?后面跟着各种符号进行大量的扩展语法,因为?它本身是无效的。所以<=本身没有任何意义;(?<=是一个完整的标记,意思是“这是一个lookbehind的开始”。就像+=和++ +是独立的运算符,尽管它们都以+开头。

They're easy to remember, though: = indicates looking forwards (or, really, "here"), < indicates looking backwards, and ! has its traditional meaning of "not".

它们很容易记住,但是:=表示向前看(或者,实际上是“在这里”), <表示向后看,而且!有“不是”的传统含义。< p>


Regarding your later examples:

关于你后来的例子:

irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"

irb(main):003:0> "foresight".sub(/ight/, 'ee')
=> "foresee"

Yes, these produce the same output. This is that tricky bit with using lookahead:

是的,它们产生相同的输出。这就是使用前视的技巧:

  1. The regex engine has tried some things, but they haven't worked, and now it's at fores|ight.
  2. regex引擎已经尝试了一些东西,但是它们没有工作,现在它在fores|ight。
  3. It checks (?!s). Is the character after the cursor s? No, it's i! So that part matches and the matching continues, but the cursor doesn't move, and we still have fores|ight.
  4. 它检查(? !)。光标后面的字符是s吗?不,是我!所以那个部分匹配,匹配继续,但是光标没有移动,我们仍然有fores|ight。
  5. It checks ight. Does ight come after the cursor? Well, yes, it does, so move the cursor: foresight|.
  6. 它检查的洞察力。你的鼠标在光标后面吗?嗯,是的,它确实如此,所以移动光标:foresight|。
  7. We're done!
  8. 我们完成了!

The cursor moved over the substring ight, so that's the full match, and that's what gets replaced.

光标移动到子字符串ight上,这就是完整的匹配,这就是替换的。

Doing (?!a)b is useless, since you're saying: the next character must not be a, and it must be b. But that's the same as just matching b!

做(?!a)b是没有用的,因为你在说:下一个字符不能是a,它必须是b,但是这和b是一样的!

This can be useful sometimes, but you need a more complex pattern: for example, (?!3)\d will match any digit that isn't a 3.

这有时是有用的,但是您需要一个更复杂的模式:例如,(?!3)\d将匹配任何不是3的数字。

This is what you want:

这就是你想要的:

1.9.3p125 :001 > "foresight".sub(/(?<!s)ight/, 'ee')
 => "foresight" 

This asserts that s doesn't come before ight.

它断言s在8点之前不会出现。

#2


5  

Zero-width assertions are difficult to understand until you realize that regex matches positions as well as characters.

零宽度的断言很难理解,除非您意识到regex匹配位置和字符。

When you see the string "foo" you naturally read three characters. But, there are also four positions, marked here by pipes: "|f|o|o|". A lookahead or lookbehind (aka lookarounds) match a position where the character before or after match the expression.

当你看到字符串“foo”时,你自然会读到三个字符。但是,还有四个位置,在这里用管道标记:“|f|o|o|”。前视或后视(也称后视)匹配字符在表达式之前或之后匹配的位置。

The difference between a zero-width expression and other expressions is that the zero-width expression only matches (or "consumes") the position. So, for example:

零宽度表达式和其他表达式之间的区别是,零宽度表达式只匹配(或“消耗”)位置。举个例子:

/(app)apple/

will fail to match "apple" because it's trying to match "app" twice. But

将无法匹配“苹果”,因为它试图匹配“应用”两次。但

/(?=app)apple/

will succeed because the lookahead is only matching the position where "app" follows. It doesn't actually match the "app" character, allowing the next expression to consume them.

将会成功,因为前视只匹配“app”的位置。它实际上并不匹配“应用”字符,允许下一个表达式使用它们。

LOOKAROUND DESCRIPTIONS

看看描述

Positive Lookahead: (?=s)

积极的超前:(? = s)

Imagine you are a drill sergeant and you are performing an inspection. You begin at the front of the line with the intention of walking past each private and ensuring they meet expectations. But, before doing so, you look ahead one by one to make sure they have lined up in the property order. The privates' names are "A", "B", "C", "D" and "E". /(?=ABCDE)...../.match('ABCDE'). Yep, they are all present and accounted for.

假设你是一名教官,你正在进行检查。你从队伍的前面开始,目的是让每个人都走过,确保他们达到预期。但是,在此之前,您要逐一查看它们,以确保它们已经按照属性顺序排列。privates的名字是A、B、C、D和E。/(? =中的)..... / .match(“中的”)。是的,它们都是存在的。

Negative Lookahead: (?!s)

消极的超前:(? !)

You perform the inspection down the line and are finally standing at private D. Now you are going to look ahead to make sure that "F" from the other company has not, yet again, accidentally slipped into the wrong formation. /.....(?!F)/.match('ABCDE'). Nope, he hasn't slipped in this time, so all is well.

你沿着这条线进行检查,最后来到了private d。现在你要向前看,确保来自另一家公司的“F”没有再一次不小心滑进错误的队列。/ .....(? !)/ .match(“中的”)。不,他这次没溜进来,所以一切都好。

Positive Lookbehind: (?<=s)

积极的向后插入:(? < =)

After completing the inspection, the sergeant is at the end of the formation. He turns and scans back to make sure no one has snuck away. /.....(?<=ABCDE)/.match('ABCDE'). Yep, everyone is present and accounted for.

完成检查后,中士在队尾。他转身扫描回去,确保没有人偷偷溜走。/ .....(? < =中的)/ .match(“中的”)。是的,每个人都在场,都有记录。

Negative Lookbehind: (?<!s)

负面向后插入:(? < !)

Finally, the drill sergeant takes one last look to make sure that privates A and B have not, once again, switched places (because they like KP). /.....(?<!BACDE)/.match('ABCDE'). Nope, they haven't, so all is well.

最后,军士长最后看了一眼,以确保A和B的位置没有发生变化(因为它们喜欢KP)。/ .....(? < ! BACDE)/ .match(“中的”)。不,他们没有,所以一切都很好。

#3


2  

The meaning of a zero-width assertion is an expression that consumes zero characters while matching. For example, in this example,

零宽度断言的含义是在匹配时使用零字符的表达式。举个例子,在这个例子中,

"foresight".sub(/sight/, 'ee')

what is matched is

匹配的是

foresight
    ^^^^^

and thus the result would be

结果是

foreee

However, in this example,

然而,在这个例子中,

"foresight".sub(/(?<=s)ight/, 'ee')

what is matched is

匹配的是

foresight
     ^^^^

and therefore the result would be

因此结果是

foresee

Another example of a zero-width assertion is the word-boundary character, \b. For example, to match a complete word, you might try surrounding the word with spaces, e.g.

另一个零宽度断言的例子是单词边界字符\b。例如,为了匹配一个完整的单词,你可以用空格来围绕这个单词。

"flight light plight".sub(/\slight\s/, 'dark')

to get

得到

flightdarkplight

But you see how matching the spaces removes it during substitution? Using a word boundary gets around this problem:

但是你看到匹配空格是如何在替换过程中移除它的吗?使用一个单词边界来解决这个问题:

"flight light plight".sub(/\blight\b/, 'dark')

The \b matches the beginning or end of a word, but does not actually match a character: it's zero-width.

\b匹配单词的开头或结尾,但实际上不匹配字符:它的宽度为零。

Maybe the most succinct answer to your question is this: Lookahead and lookbehind assertions are one kind of zero-width assertions. All lookahead and lookbehind assertions are zero-width assertions.

也许对您的问题最简洁的回答是:forward和lookbehind断言是一种零宽度断言。所有的lookforward和lookbehind断言都是零宽度断言。


Here are explanations of your examples:

以下是对你的例子的解释:

irb(main):001:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"

Above, you're saying, "Match where the next character is not an s, and then an i." This is always true for an i, since an i is never an s, so the substitution succeeds.

上面,你说,匹配下一个字符不是s,然后是i对于i,这总是成立的,因为i不是s,所以替换成功了。

irb(main):002:0> "foresight".sub(/(?=s)ight/, 'ee')
=> "foresight"

Above, you're saying, "Match where the next character is an s, and then an i." This is never true, since an i is never an s, so the substitution fails.

上面,你说,匹配下一个字符是s,然后是i这是不成立的,因为i从来不是s,所以换元失败了。

irb(main):003:0> "foresight".sub(/(?<=s)ight/, 'ee')
=> "foresee"

Above, already explained. (This is the correct one.)

上面已经解释道。(这是正确的。)

irb(main):004:0> "foresight".sub(/(?<!s)ight/, 'ee')
=> "foresight"

Above, should be clear by now. In this case, "firefight" would substitute to "firefee", but not "foresight" to "foresee".

以上,现在应该都清楚了。在这种情况下,“交火”将代替“火费”,而不是“预见”。

#1


16  

Regular expressions match from left to right, and move a sort of "cursor" along the string as they go. If your regex contains a regular character like a, this means: "if there's a letter a in front of the cursor, move the cursor ahead one character, and keep going. Otherwise, something's wrong; back up and try something else." So you might say that a has a "width" of one character.

正则表达式从左到右匹配,并沿着字符串移动某种“游标”。如果您的regex包含一个普通字符,比如a,这意味着:“如果光标前面有一个字母a,那么将光标移动到一个字符前面,然后继续。”否则,什么是错的;退后,试试别的。所以你可以说a有一个字符的宽度。

A "zero-width assertion" is just that: it asserts something about the string (i.e., doesn't match if some condition doesn't hold), but it doesn't move the cursor forwards, because its "width" is zero.

“零宽度断言”就是:它断言关于字符串的某些东西(例如。如果某个条件不成立,则不匹配),但它不会将光标向前移动,因为它的“宽度”为零。

You're probably already familiar with some simpler zero-width assertions, like ^ and $. These match the start and end of a string. If the cursor isn't at the start or end when it sees those symbols, the regex engine will fail, back up, and try something else. But they don't actually move the cursor forwards, because they don't match characters; they only check where the cursor is.

您可能已经熟悉一些简单的零宽度断言,像^和$。这些匹配字符串的开始和结束。如果游标在看到这些符号时没有开始或结束,那么regex引擎将会失败、备份和尝试其他的东西。但它们并不会将光标向前移动,因为它们不匹配字符;它们只检查光标的位置。

Lookahead and lookbehind work the same way. When the regex engine tries to match them, it checks around the cursor to see if the right pattern is ahead of or behind it, but in case of a match, it doesn't move the cursor.

展望未来,以同样的方式看待工作。当regex引擎试图匹配它们时,它会检查游标周围,看看正确的模式是在它前面还是在它后面,但是如果匹配,它不会移动游标。

Consider:

考虑:

/(?=foo)foo/.match 'foo'

This will match! The regex engine goes like this:

这将匹配!regex引擎是这样的:

  1. Start at the beginning of the string: |foo.
  2. 从字符串的开头开始:|foo。
  3. The first part of the regex is (?=foo). This means: only match if foo appears after the cursor. Does it? Well, yes, so we can proceed. But the cursor doesn't move, because this is zero-width. We still have |foo.
  4. regex的第一部分是(?=foo)。这意味着:只有当foo出现在游标的后面时才匹配。不是吗?是的,我们可以继续。但是光标不会移动,因为这是零宽度。我们还有| foo。
  5. Next is f. Is there an f in front of the cursor? Yes, so proceed, and move the cursor past the f: f|oo.
  6. 下一个是f,光标前面有一个f吗?是的,继续,将光标移过f: f|oo。
  7. Next is o. Is there an o in front of the cursor? Yes, so proceed, and move the cursor past the o: fo|o.
  8. 下一个是o,光标前面有一个o吗?是的,继续,然后移动光标经过o: fo|o。
  9. Same thing again, bringing us to foo|.
  10. 同样的,把我们带到foo|。
  11. We reached the end of the regex, and nothing failed, so the pattern matches.
  12. 我们到达了正则表达式的末尾,没有失败,所以模式匹配。

On your four assertions in particular:

特别就你的四项主张:

  • (?=...) is "lookahead"; it asserts that ... does appear after the cursor.

    (? =…)是“超前”;它声称…确实出现在光标之后。

    1.9.3p125 :002 > 'jump june'.gsub(/ju(?=m)/, 'slu')
     => "slump june" 
    

    The "ju" in "jump" matches because an "m" comes next. But the "ju" in "june" doesn't have an "m" next, so it's left alone.

    “jump”中的“ju”之所以叫“ju”,是因为后面跟着一个“m”。但是《六月》中的“居”下一个字母没有“m”,所以它被放在一边。

    Since it doesn't move the cursor, you have to be careful when putting anything after it. (?=a)b will never match anything, because it checks that the next character is a, then also checks that the same character is b, which is impossible.

    因为它不会移动光标,所以在它后面放任何东西时你必须小心。(?=a)b永远不会匹配任何东西,因为它检查下一个字符是a,然后检查相同的字符是b,这是不可能的。

  • (?<=...) is "lookbehind"; it asserts that ... does appear before the cursor.

    (? < =…)是“向后插入”;它声称…确实出现在光标之前。

    1.9.3p125 :002 > 'four flour'.gsub(/(?<=f)our/, 'ive')
     => "five flour" 
    

    The "our" in "four" matches because there's an "f" immediately before it, but the "our" in "flour" has an "l" immediately before it so it doesn't match.

    “我们的”在“四”中匹配,因为它前面有一个“f”,但是“面粉”中的“我们的”在它前面有一个“l”,所以它不匹配。

    Like above, you have to be careful with what you put before it. a(?<=b) will never match, because it checks that the next character is a, moves the cursor, then checks that the previous character was b.

    就像上面说的,你必须小心你放在它前面的东西。a(?<=b)永远不会匹配,因为它检查下一个字符是a,移动光标,然后检查前一个字符是b。

  • (?!...) is "negative lookahead"; it asserts that ... does not appear after the cursor.

    (? !…)是“负超前”;它声称…不显示在光标之后。

    1.9.3p125 :003 > 'child children'.gsub(/child(?!ren)/, 'kid')
     => "kid children"
    

    "child" matches, because what comes next is a space, not "ren". "children" doesn't.

    “child”匹配,因为后面是空格,而不是“ren”。“孩子”没有。

    This is probably the one I get the most use out of; finely controlling what can't come next comes in handy.

    这可能是我用得最多的;精细地控制接下来不可能发生的事情将派上用场。

  • (?<!...) is "negative lookbehind"; it asserts that ... does not appear before the cursor.

    (? < !……)是“负向后插入”;它声称…不会出现在光标之前。

    1.9.3p125 :004 > 'foot root'.gsub(/(?<!r)oot/, 'eet')
     => "feet root" 
    

    The "oot" in "foot" is fine, since there's no "r" before it. The "oot" in "root" clearly has an "r".

    "foot"中的"oot"可以,因为在它之前没有"r"。“根”中的“oot”显然有一个“r”。

    As an additional restriction, most regex engines require that ... has a fixed length in this case. So you can't use ?, +, *, or {n,m}.

    作为额外的限制,大多数regex引擎要求…在这种情况下有固定长度。所以不能使用? + *或{n,m}。

You can also nest these and otherwise do all kinds of crazy things. I use them mainly for one-offs I know I'll never have to maintain, so I don't have any great examples of real-world applications handy; honestly, they're weird enough that you should try to do what you want some other way first. :)

你也可以将它们嵌套起来,或者做各种疯狂的事情。我主要用它们来做一次性的事情,我知道我永远都不需要维护它们,所以我手边没有任何真实世界应用的好例子;老实说,他们很奇怪,你应该先试着用别的方法做你想做的事。:)


Afterthought: The syntax comes from Perl regular expressions, which used (? followed by various symbols for a lot of extended syntax because ? on its own is invalid. So <= doesn't mean anything by itself; (?<= is one entire token, meaning "this is the start of a lookbehind". It's like how += and ++ are separate operators, even though they both start with +.

回味:语法来自Perl正则表达式,它使用(?后面跟着各种符号进行大量的扩展语法,因为?它本身是无效的。所以<=本身没有任何意义;(?<=是一个完整的标记,意思是“这是一个lookbehind的开始”。就像+=和++ +是独立的运算符,尽管它们都以+开头。

They're easy to remember, though: = indicates looking forwards (or, really, "here"), < indicates looking backwards, and ! has its traditional meaning of "not".

它们很容易记住,但是:=表示向前看(或者,实际上是“在这里”), <表示向后看,而且!有“不是”的传统含义。< p>


Regarding your later examples:

关于你后来的例子:

irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"

irb(main):003:0> "foresight".sub(/ight/, 'ee')
=> "foresee"

Yes, these produce the same output. This is that tricky bit with using lookahead:

是的,它们产生相同的输出。这就是使用前视的技巧:

  1. The regex engine has tried some things, but they haven't worked, and now it's at fores|ight.
  2. regex引擎已经尝试了一些东西,但是它们没有工作,现在它在fores|ight。
  3. It checks (?!s). Is the character after the cursor s? No, it's i! So that part matches and the matching continues, but the cursor doesn't move, and we still have fores|ight.
  4. 它检查(? !)。光标后面的字符是s吗?不,是我!所以那个部分匹配,匹配继续,但是光标没有移动,我们仍然有fores|ight。
  5. It checks ight. Does ight come after the cursor? Well, yes, it does, so move the cursor: foresight|.
  6. 它检查的洞察力。你的鼠标在光标后面吗?嗯,是的,它确实如此,所以移动光标:foresight|。
  7. We're done!
  8. 我们完成了!

The cursor moved over the substring ight, so that's the full match, and that's what gets replaced.

光标移动到子字符串ight上,这就是完整的匹配,这就是替换的。

Doing (?!a)b is useless, since you're saying: the next character must not be a, and it must be b. But that's the same as just matching b!

做(?!a)b是没有用的,因为你在说:下一个字符不能是a,它必须是b,但是这和b是一样的!

This can be useful sometimes, but you need a more complex pattern: for example, (?!3)\d will match any digit that isn't a 3.

这有时是有用的,但是您需要一个更复杂的模式:例如,(?!3)\d将匹配任何不是3的数字。

This is what you want:

这就是你想要的:

1.9.3p125 :001 > "foresight".sub(/(?<!s)ight/, 'ee')
 => "foresight" 

This asserts that s doesn't come before ight.

它断言s在8点之前不会出现。

#2


5  

Zero-width assertions are difficult to understand until you realize that regex matches positions as well as characters.

零宽度的断言很难理解,除非您意识到regex匹配位置和字符。

When you see the string "foo" you naturally read three characters. But, there are also four positions, marked here by pipes: "|f|o|o|". A lookahead or lookbehind (aka lookarounds) match a position where the character before or after match the expression.

当你看到字符串“foo”时,你自然会读到三个字符。但是,还有四个位置,在这里用管道标记:“|f|o|o|”。前视或后视(也称后视)匹配字符在表达式之前或之后匹配的位置。

The difference between a zero-width expression and other expressions is that the zero-width expression only matches (or "consumes") the position. So, for example:

零宽度表达式和其他表达式之间的区别是,零宽度表达式只匹配(或“消耗”)位置。举个例子:

/(app)apple/

will fail to match "apple" because it's trying to match "app" twice. But

将无法匹配“苹果”,因为它试图匹配“应用”两次。但

/(?=app)apple/

will succeed because the lookahead is only matching the position where "app" follows. It doesn't actually match the "app" character, allowing the next expression to consume them.

将会成功,因为前视只匹配“app”的位置。它实际上并不匹配“应用”字符,允许下一个表达式使用它们。

LOOKAROUND DESCRIPTIONS

看看描述

Positive Lookahead: (?=s)

积极的超前:(? = s)

Imagine you are a drill sergeant and you are performing an inspection. You begin at the front of the line with the intention of walking past each private and ensuring they meet expectations. But, before doing so, you look ahead one by one to make sure they have lined up in the property order. The privates' names are "A", "B", "C", "D" and "E". /(?=ABCDE)...../.match('ABCDE'). Yep, they are all present and accounted for.

假设你是一名教官,你正在进行检查。你从队伍的前面开始,目的是让每个人都走过,确保他们达到预期。但是,在此之前,您要逐一查看它们,以确保它们已经按照属性顺序排列。privates的名字是A、B、C、D和E。/(? =中的)..... / .match(“中的”)。是的,它们都是存在的。

Negative Lookahead: (?!s)

消极的超前:(? !)

You perform the inspection down the line and are finally standing at private D. Now you are going to look ahead to make sure that "F" from the other company has not, yet again, accidentally slipped into the wrong formation. /.....(?!F)/.match('ABCDE'). Nope, he hasn't slipped in this time, so all is well.

你沿着这条线进行检查,最后来到了private d。现在你要向前看,确保来自另一家公司的“F”没有再一次不小心滑进错误的队列。/ .....(? !)/ .match(“中的”)。不,他这次没溜进来,所以一切都好。

Positive Lookbehind: (?<=s)

积极的向后插入:(? < =)

After completing the inspection, the sergeant is at the end of the formation. He turns and scans back to make sure no one has snuck away. /.....(?<=ABCDE)/.match('ABCDE'). Yep, everyone is present and accounted for.

完成检查后,中士在队尾。他转身扫描回去,确保没有人偷偷溜走。/ .....(? < =中的)/ .match(“中的”)。是的,每个人都在场,都有记录。

Negative Lookbehind: (?<!s)

负面向后插入:(? < !)

Finally, the drill sergeant takes one last look to make sure that privates A and B have not, once again, switched places (because they like KP). /.....(?<!BACDE)/.match('ABCDE'). Nope, they haven't, so all is well.

最后,军士长最后看了一眼,以确保A和B的位置没有发生变化(因为它们喜欢KP)。/ .....(? < ! BACDE)/ .match(“中的”)。不,他们没有,所以一切都很好。

#3


2  

The meaning of a zero-width assertion is an expression that consumes zero characters while matching. For example, in this example,

零宽度断言的含义是在匹配时使用零字符的表达式。举个例子,在这个例子中,

"foresight".sub(/sight/, 'ee')

what is matched is

匹配的是

foresight
    ^^^^^

and thus the result would be

结果是

foreee

However, in this example,

然而,在这个例子中,

"foresight".sub(/(?<=s)ight/, 'ee')

what is matched is

匹配的是

foresight
     ^^^^

and therefore the result would be

因此结果是

foresee

Another example of a zero-width assertion is the word-boundary character, \b. For example, to match a complete word, you might try surrounding the word with spaces, e.g.

另一个零宽度断言的例子是单词边界字符\b。例如,为了匹配一个完整的单词,你可以用空格来围绕这个单词。

"flight light plight".sub(/\slight\s/, 'dark')

to get

得到

flightdarkplight

But you see how matching the spaces removes it during substitution? Using a word boundary gets around this problem:

但是你看到匹配空格是如何在替换过程中移除它的吗?使用一个单词边界来解决这个问题:

"flight light plight".sub(/\blight\b/, 'dark')

The \b matches the beginning or end of a word, but does not actually match a character: it's zero-width.

\b匹配单词的开头或结尾,但实际上不匹配字符:它的宽度为零。

Maybe the most succinct answer to your question is this: Lookahead and lookbehind assertions are one kind of zero-width assertions. All lookahead and lookbehind assertions are zero-width assertions.

也许对您的问题最简洁的回答是:forward和lookbehind断言是一种零宽度断言。所有的lookforward和lookbehind断言都是零宽度断言。


Here are explanations of your examples:

以下是对你的例子的解释:

irb(main):001:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"

Above, you're saying, "Match where the next character is not an s, and then an i." This is always true for an i, since an i is never an s, so the substitution succeeds.

上面,你说,匹配下一个字符不是s,然后是i对于i,这总是成立的,因为i不是s,所以替换成功了。

irb(main):002:0> "foresight".sub(/(?=s)ight/, 'ee')
=> "foresight"

Above, you're saying, "Match where the next character is an s, and then an i." This is never true, since an i is never an s, so the substitution fails.

上面,你说,匹配下一个字符是s,然后是i这是不成立的,因为i从来不是s,所以换元失败了。

irb(main):003:0> "foresight".sub(/(?<=s)ight/, 'ee')
=> "foresee"

Above, already explained. (This is the correct one.)

上面已经解释道。(这是正确的。)

irb(main):004:0> "foresight".sub(/(?<!s)ight/, 'ee')
=> "foresight"

Above, should be clear by now. In this case, "firefight" would substitute to "firefee", but not "foresight" to "foresee".

以上,现在应该都清楚了。在这种情况下,“交火”将代替“火费”,而不是“预见”。