bash是否支持单词边界正则表达式?

时间:2021-09-25 15:01:23

I am trying to match on the presence of a word in a list before adding that word again (to avoid duplicates). I am using bash 4.2.24 and am trying the below:

在添加这个单词之前,我试着在列表中匹配一个单词(以避免重复)。我正在使用bash 4.2.24,并正在尝试以下内容:

[[  $foo =~ \bmyword\b ]]

also

[[  $foo =~ \<myword\> ]]

However, neither seem to work. They are mentioned in the bash docs example: http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_04_01.html.

然而,这两种方法似乎都不起作用。bash文档示例中提到了它们:http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_04_01.html。

I presume I am doing something wrong but I am not sure what.

我认为我做错了什么,但我不确定是什么。

8 个解决方案

#1


20  

Yes, all the listed regex extensions are supported but you'll have better luck putting the pattern in a variable before using it. Try this:

是的,所有列出的regex扩展都得到了支持,但是在使用它之前,您最好将模式放在一个变量中。试试这个:

re=\\bmyword\\b
[[ $foo =~ $re ]]

Digging around I found this question, whose answers seems to explain why the behaviour changes when the regex is written inline as in your example. You'll probably have to rewrite your tests so as to use a temporary variable for your regexes, or use the 3.1 compatibility mode:

我仔细研究了一下这个问题,它的答案似乎可以解释为什么当regex像您的示例那样内联地编写时,行为会发生变化。您可能需要重写测试,以便为regexes使用临时变量,或者使用3.1兼容性模式:

shopt -s compat31

#2


20  

tl;dr

博士tl;

  • To be safe, do not use a regex literal with =~.
    Instead, use:

    为了安全起见,不要使用regex文字with =~。相反,使用:

  • Whether \b and \< / \> work at all depends on the host platform, not Bash:

    \b和\< / \>是否工作完全取决于主机平台,而不是Bash:

    • they DO work on Linux,
    • 他们在Linux上工作,
    • but NOT on BSD-based platforms such as macOS.
    • 但不是基于bsd的平台,比如macOS。

If you want to know more, read on.

如果你想知道更多,请继续读下去。


On bash 3.2+ (unless the compat31 shopt option is set), the right operand of the =~ operator must be unquoted in order to be recognized as a regex (if you quote the right operand, =~ performs regular string comparison instead).

在bash 3.2+上(除非设置了compat31 shopt选项),=~操作符的右操作数必须是未引用的,以便被识别为regex(如果引用了正确的操作数,=~执行常规字符串比较)。

More accurately, at least the special regex characters and sequences must be unquoted, so it's OK and useful to quote those substrings that should be taken literally; e.g., [[ ' ab' =~ ^' ab' ]] matches, because ^ is unquoted and thus correctly recognized as the start-of-string anchor.

更准确地说,至少特定的regex字符和序列必须不被引用,所以引用那些应该按字面理解的子字符串是可以的,也是有用的;例如,[[“ab”= ~ ^“ab”]]匹配,因为^非上市,从而正确地识别为start-of-string锚。

However, there appears to be a bug in (at least) bash 4.x where certain regex literals aren't parsed correctly, namely those containing \-prefixed constructs such as \< and \s (if you think this is not a bug, do let me know); behavior as of bash 4.2.46 on Linux:

然而,(至少)bash 4中似乎存在一个错误。在某些正则表达式没有正确解析的地方,也就是那些包含\-前缀结构,如\ <和\s(如果您认为这不是bug,请告诉我);bash 4.2.46在linux上的行为:< p>

   # BUG
[[ ' word ' =~ \<word\> ]] && echo MATCHES     # !! DOES NOT MATCH
[[ ' word ' =~ \\<word\\> ]] && echo MATCHES   # !! BREAKS
[[ ' word ' =~ \\\<word\\\> ]] && echo MATCHES # !! DOES NOT MATCH

   # WORKAROUNDS
re='\<word\>'; [[ ' word ' =~ $re ]] && echo MATCHES # OK - intermediate variable
[[ ' word ' =~ $(printf %s '\<word\>') ]] && echo MATCHES # OK - command subst.

Cross-platform support:

跨平台的支持:

=~ is the rare case (the only case?) of a built-in bash feature that is platform-dependent: It uses the regex libraries of the platform it is running on, resulting in different regex flavors on different platforms.

=~是内置bash特性中很少出现的(唯一的情况?)依赖于平台的情况:它使用正在运行的平台的regex库,导致在不同的平台上出现不同的regex风格。

For instance, on FreeBSD/OSX \< / \> and \b are NOT supported, but [[:<:]] and [[:>:]] are. On Linux it is the other way around.

例如,在FreeBSD/OSX \< / \>和\b上不支持,但是[[[:<:]]和[[:>:]]支持。在Linux上,情况恰恰相反。

Thus, it is non-trivial and requires extra care to write portable code that uses the =~ operator.

因此,编写使用=~操作符的可移植代码是非常重要的。

#3


3  

The accepted answer focuses on using auxiliary variables to deal with the syntax oddities of regular expressions in Bash's [[ ... ]] expressions. Very good info.

公认的答案集中在使用辅助变量来处理Bash的[…]中正则表达式的语法异常。]]表达式。很好的信息。

However, the real answer is:

然而,真正的答案是:

\b \< and \> do not work on OS X 10.11.5 (El Capitan) with bash version 4.3.42(1)-release (x86_64-apple-darwin15.0.0).

\b \ <和\> 在osx10.11.5 (El Capitan)上不能使用bash 4.3.42(1)-发布(x86_64-苹果-达尔文15.0)。

Instead, use [[:<:]] and [[:>:]].

相反,使用[[[:<:]]]和[[:>:]]。

#4


0  

This worked for me

这为我工作

bar='\<myword\>'
[[ $foo =~ $bar ]]

#5


0  

Not exactly "\b", but for me more readable (and portable) than the other suggestions:

不完全是“\b”,但对我来说比其他建议更可读(和可移植性):

[[  $foo =~ (^| )myword($| ) ]]

#6


0  

Tangential to your question, but if you can use egrep in your script:

与你的问题相切,但是如果你可以在你的脚本中使用白鹭:

if [ `echo $foo | egrep -c "\b${myword}\b"` -gt 0 ]; then

I ended up using this after flailing with bash's =~

在与bash =~发生冲突后,我最终使用了这个方法

As mklement0's astutely points out, we could just rely on egrep's exit status and write:

正如mklement0敏锐地指出的那样,我们可以仅依赖于白鹭的退出状态,并写:

if egrep -q "\b${myword}\b" <<<$foo; then

#7


0  

You can use grep, which is more portable than bash's regexp like this:

您可以使用grep,它比bash的regexp更可移植:

if echo $foo | grep -q '\<myword\>'; then 
    echo "MATCH"; 
else 
    echo "NO MATCH"; 
fi

#8


0  

I've used the following to match word boundaries on older systems. The key is to wrap $foo with spaces since [^[:alpha:]] will not match words at the beginning or end of the list.

我使用了以下方法来匹配旧系统上的单词边界。关键是要用美元foo与空间自[^[α:]]不会匹配单词列表的开始或结束。

[[ " $foo " =~ [^[:alpha:]]myword[^[:alpha:]] ]]

Tweak the character class as needed based on the expected contents of myword, otherwise this may not be good solution.

根据myword的预期内容根据需要调整字符类,否则这可能不是一个好的解决方案。

#1


20  

Yes, all the listed regex extensions are supported but you'll have better luck putting the pattern in a variable before using it. Try this:

是的,所有列出的regex扩展都得到了支持,但是在使用它之前,您最好将模式放在一个变量中。试试这个:

re=\\bmyword\\b
[[ $foo =~ $re ]]

Digging around I found this question, whose answers seems to explain why the behaviour changes when the regex is written inline as in your example. You'll probably have to rewrite your tests so as to use a temporary variable for your regexes, or use the 3.1 compatibility mode:

我仔细研究了一下这个问题,它的答案似乎可以解释为什么当regex像您的示例那样内联地编写时,行为会发生变化。您可能需要重写测试,以便为regexes使用临时变量,或者使用3.1兼容性模式:

shopt -s compat31

#2


20  

tl;dr

博士tl;

  • To be safe, do not use a regex literal with =~.
    Instead, use:

    为了安全起见,不要使用regex文字with =~。相反,使用:

  • Whether \b and \< / \> work at all depends on the host platform, not Bash:

    \b和\< / \>是否工作完全取决于主机平台,而不是Bash:

    • they DO work on Linux,
    • 他们在Linux上工作,
    • but NOT on BSD-based platforms such as macOS.
    • 但不是基于bsd的平台,比如macOS。

If you want to know more, read on.

如果你想知道更多,请继续读下去。


On bash 3.2+ (unless the compat31 shopt option is set), the right operand of the =~ operator must be unquoted in order to be recognized as a regex (if you quote the right operand, =~ performs regular string comparison instead).

在bash 3.2+上(除非设置了compat31 shopt选项),=~操作符的右操作数必须是未引用的,以便被识别为regex(如果引用了正确的操作数,=~执行常规字符串比较)。

More accurately, at least the special regex characters and sequences must be unquoted, so it's OK and useful to quote those substrings that should be taken literally; e.g., [[ ' ab' =~ ^' ab' ]] matches, because ^ is unquoted and thus correctly recognized as the start-of-string anchor.

更准确地说,至少特定的regex字符和序列必须不被引用,所以引用那些应该按字面理解的子字符串是可以的,也是有用的;例如,[[“ab”= ~ ^“ab”]]匹配,因为^非上市,从而正确地识别为start-of-string锚。

However, there appears to be a bug in (at least) bash 4.x where certain regex literals aren't parsed correctly, namely those containing \-prefixed constructs such as \< and \s (if you think this is not a bug, do let me know); behavior as of bash 4.2.46 on Linux:

然而,(至少)bash 4中似乎存在一个错误。在某些正则表达式没有正确解析的地方,也就是那些包含\-前缀结构,如\ <和\s(如果您认为这不是bug,请告诉我);bash 4.2.46在linux上的行为:< p>

   # BUG
[[ ' word ' =~ \<word\> ]] && echo MATCHES     # !! DOES NOT MATCH
[[ ' word ' =~ \\<word\\> ]] && echo MATCHES   # !! BREAKS
[[ ' word ' =~ \\\<word\\\> ]] && echo MATCHES # !! DOES NOT MATCH

   # WORKAROUNDS
re='\<word\>'; [[ ' word ' =~ $re ]] && echo MATCHES # OK - intermediate variable
[[ ' word ' =~ $(printf %s '\<word\>') ]] && echo MATCHES # OK - command subst.

Cross-platform support:

跨平台的支持:

=~ is the rare case (the only case?) of a built-in bash feature that is platform-dependent: It uses the regex libraries of the platform it is running on, resulting in different regex flavors on different platforms.

=~是内置bash特性中很少出现的(唯一的情况?)依赖于平台的情况:它使用正在运行的平台的regex库,导致在不同的平台上出现不同的regex风格。

For instance, on FreeBSD/OSX \< / \> and \b are NOT supported, but [[:<:]] and [[:>:]] are. On Linux it is the other way around.

例如,在FreeBSD/OSX \< / \>和\b上不支持,但是[[[:<:]]和[[:>:]]支持。在Linux上,情况恰恰相反。

Thus, it is non-trivial and requires extra care to write portable code that uses the =~ operator.

因此,编写使用=~操作符的可移植代码是非常重要的。

#3


3  

The accepted answer focuses on using auxiliary variables to deal with the syntax oddities of regular expressions in Bash's [[ ... ]] expressions. Very good info.

公认的答案集中在使用辅助变量来处理Bash的[…]中正则表达式的语法异常。]]表达式。很好的信息。

However, the real answer is:

然而,真正的答案是:

\b \< and \> do not work on OS X 10.11.5 (El Capitan) with bash version 4.3.42(1)-release (x86_64-apple-darwin15.0.0).

\b \ <和\> 在osx10.11.5 (El Capitan)上不能使用bash 4.3.42(1)-发布(x86_64-苹果-达尔文15.0)。

Instead, use [[:<:]] and [[:>:]].

相反,使用[[[:<:]]]和[[:>:]]。

#4


0  

This worked for me

这为我工作

bar='\<myword\>'
[[ $foo =~ $bar ]]

#5


0  

Not exactly "\b", but for me more readable (and portable) than the other suggestions:

不完全是“\b”,但对我来说比其他建议更可读(和可移植性):

[[  $foo =~ (^| )myword($| ) ]]

#6


0  

Tangential to your question, but if you can use egrep in your script:

与你的问题相切,但是如果你可以在你的脚本中使用白鹭:

if [ `echo $foo | egrep -c "\b${myword}\b"` -gt 0 ]; then

I ended up using this after flailing with bash's =~

在与bash =~发生冲突后,我最终使用了这个方法

As mklement0's astutely points out, we could just rely on egrep's exit status and write:

正如mklement0敏锐地指出的那样,我们可以仅依赖于白鹭的退出状态,并写:

if egrep -q "\b${myword}\b" <<<$foo; then

#7


0  

You can use grep, which is more portable than bash's regexp like this:

您可以使用grep,它比bash的regexp更可移植:

if echo $foo | grep -q '\<myword\>'; then 
    echo "MATCH"; 
else 
    echo "NO MATCH"; 
fi

#8


0  

I've used the following to match word boundaries on older systems. The key is to wrap $foo with spaces since [^[:alpha:]] will not match words at the beginning or end of the list.

我使用了以下方法来匹配旧系统上的单词边界。关键是要用美元foo与空间自[^[α:]]不会匹配单词列表的开始或结束。

[[ " $foo " =~ [^[:alpha:]]myword[^[:alpha:]] ]]

Tweak the character class as needed based on the expected contents of myword, otherwise this may not be good solution.

根据myword的预期内容根据需要调整字符类,否则这可能不是一个好的解决方案。