.Net regex:什么是字符\w?

时间:2021-10-14 20:46:48

Simple question:
What is the pattern for the word character \w in c#, .net?

简单的问题:c#,。net中字符\w的模式是什么?

My first thought was that it matches [A-Za-z0-9_] and the documentation tells me:

我的第一个想法是它与[A-Za-z0-9_]匹配,文档告诉我:

Character class    Description          Pattern     Matches
\w                 Matches any          \w          "I", "D", "A", "1", "3"
                   word character.                  in "ID A1.3"

which is not very helpful.
And \w seems to match äöü, too. What else? Is there a better (exact) definition available?

这不是很有帮助。而且看起来也和aou相配。还有什么?有更好的(确切的)定义吗?

3 个解决方案

#1


86  

From the documentation:

从文档:

Word Character: \w

\w matches any word character. A word character is a member of any of the Unicode categories listed in the following table.

\w匹配任何单词字符。字字符是下表中列出的任何Unicode类别的成员。

  • Ll (Letter, Lowercase)
  • 我(字母、小写)
  • Lu (Letter, Uppercase)
  • 陆(字母,大写)
  • Lt (Letter, Titlecase)
  • Lt(信,Titlecase)
  • Lo (Letter, Other)
  • 罗(信,其他)
  • Lm (Letter, Modifier)
  • Lm(信,修饰符)
  • Nd (Number, Decimal Digit)
  • Nd(数字,小数点位数)
  • Pc (Punctuation, Connector)
    • This category includes ten characters, the most commonly used of which is the LOWLINE character (_), u+005F.
    • 这个类别包含10个字符,其中最常用的是LOWLINE字符(_),u+005F。
  • 这个类别包含十个字符,其中最常用的是LOWLINE字符(_),u+005F。

If ECMAScript-compliant behavior is specified, \w is equivalent to [a-zA-Z_0-9].

如果指定了与ecmascript兼容的行为,那么\w等同于[a-zA-Z_0-9]。

See also

#2


13  

Basically it matches everything that can be considered the intuitive definition of letter in various scripts – plus the underscore and a few other oddballs.

基本上,它匹配所有可以被认为是不同脚本中字母的直观定义——加上下划线和其他一些古怪的东西。

You can find a complete list (at least for the BMP) with the following tiny PowerShell snippet:

您可以找到一个完整的列表(至少对于BMP而言),其中包含以下小小的PowerShell代码片段:

0..65535 | ?{([char]$_) -match '\w'} | %{ "$_`: " + [char]$_ }

#3


3  

So after some research using '\w' in .NET is equivalent to:

因此,在。net中使用“\w”进行一些研究之后,就等于:

public static class Extensions { 
    /// <summary>
    /// The word categories.
    /// </summary>
    [NotNull]
    private static readonly HashSet<UnicodeCategory> _wordCategories = new HashCollection<UnicodeCategory>(
                new[]
                {
            UnicodeCategory.DecimalDigitNumber,
            UnicodeCategory.UppercaseLetter,
            UnicodeCategory.ConnectorPunctuation,
            UnicodeCategory.LowercaseLetter,
            UnicodeCategory.OtherLetter,
            UnicodeCategory.TitlecaseLetter,
            UnicodeCategory.ModifierLetter,
            UnicodeCategory.NonSpacingMark,
                });

    /// <summary>
    /// Determines whether the specified character is a word character (equivalent to '\w').
    /// </summary>
    /// <param name="c">The c.</param>
    public static bool IsWord(this char c) => _wordCategories.Contains(char.GetUnicodeCategory(c));
}

I've written this as an extension method to be easy to use on any character c just invoke c.IsWord() which will return true if the character is a word character. This should be significantly quicker than using a Regex.

我将它编写为一个扩展方法,以便在任何字符上都能方便地使用,只需调用c. isword(),如果字符是一个单词字符,那么它将返回true。这应该比使用Regex快得多。

Interestingly, this doesn't appear to match the .NET specification, in fact '\w' match 938 'NonSpacingMark' characters, which are not mentioned.

有趣的是,这似乎与。net规范不匹配,事实上'\w'匹配938 'NonSpacingMark'字符,这没有提到。

In total this matches 49,760 of the 65,535 characters, so the simple regex's often shown on the web are incomplete.

这总共匹配65,535个字符中的49760个,因此通常显示在web上的简单regex是不完整的。

#1


86  

From the documentation:

从文档:

Word Character: \w

\w matches any word character. A word character is a member of any of the Unicode categories listed in the following table.

\w匹配任何单词字符。字字符是下表中列出的任何Unicode类别的成员。

  • Ll (Letter, Lowercase)
  • 我(字母、小写)
  • Lu (Letter, Uppercase)
  • 陆(字母,大写)
  • Lt (Letter, Titlecase)
  • Lt(信,Titlecase)
  • Lo (Letter, Other)
  • 罗(信,其他)
  • Lm (Letter, Modifier)
  • Lm(信,修饰符)
  • Nd (Number, Decimal Digit)
  • Nd(数字,小数点位数)
  • Pc (Punctuation, Connector)
    • This category includes ten characters, the most commonly used of which is the LOWLINE character (_), u+005F.
    • 这个类别包含10个字符,其中最常用的是LOWLINE字符(_),u+005F。
  • 这个类别包含十个字符,其中最常用的是LOWLINE字符(_),u+005F。

If ECMAScript-compliant behavior is specified, \w is equivalent to [a-zA-Z_0-9].

如果指定了与ecmascript兼容的行为,那么\w等同于[a-zA-Z_0-9]。

See also

#2


13  

Basically it matches everything that can be considered the intuitive definition of letter in various scripts – plus the underscore and a few other oddballs.

基本上,它匹配所有可以被认为是不同脚本中字母的直观定义——加上下划线和其他一些古怪的东西。

You can find a complete list (at least for the BMP) with the following tiny PowerShell snippet:

您可以找到一个完整的列表(至少对于BMP而言),其中包含以下小小的PowerShell代码片段:

0..65535 | ?{([char]$_) -match '\w'} | %{ "$_`: " + [char]$_ }

#3


3  

So after some research using '\w' in .NET is equivalent to:

因此,在。net中使用“\w”进行一些研究之后,就等于:

public static class Extensions { 
    /// <summary>
    /// The word categories.
    /// </summary>
    [NotNull]
    private static readonly HashSet<UnicodeCategory> _wordCategories = new HashCollection<UnicodeCategory>(
                new[]
                {
            UnicodeCategory.DecimalDigitNumber,
            UnicodeCategory.UppercaseLetter,
            UnicodeCategory.ConnectorPunctuation,
            UnicodeCategory.LowercaseLetter,
            UnicodeCategory.OtherLetter,
            UnicodeCategory.TitlecaseLetter,
            UnicodeCategory.ModifierLetter,
            UnicodeCategory.NonSpacingMark,
                });

    /// <summary>
    /// Determines whether the specified character is a word character (equivalent to '\w').
    /// </summary>
    /// <param name="c">The c.</param>
    public static bool IsWord(this char c) => _wordCategories.Contains(char.GetUnicodeCategory(c));
}

I've written this as an extension method to be easy to use on any character c just invoke c.IsWord() which will return true if the character is a word character. This should be significantly quicker than using a Regex.

我将它编写为一个扩展方法,以便在任何字符上都能方便地使用,只需调用c. isword(),如果字符是一个单词字符,那么它将返回true。这应该比使用Regex快得多。

Interestingly, this doesn't appear to match the .NET specification, in fact '\w' match 938 'NonSpacingMark' characters, which are not mentioned.

有趣的是,这似乎与。net规范不匹配,事实上'\w'匹配938 'NonSpacingMark'字符,这没有提到。

In total this matches 49,760 of the 65,535 characters, so the simple regex's often shown on the web are incomplete.

这总共匹配65,535个字符中的49760个,因此通常显示在web上的简单regex是不完整的。