以特殊字符(Unicode)命名的Regex

时间:2021-01-06 20:24:18

Okay, I have read about regex all day now, and still don't understand it properly. What i'm trying to do is validate a name, but the functions i can find for this on the internet only use [a-zA-Z], leaving characters out that i need to accept to.

好吧,我已经读了一整天关于regex的文章了,还是不太理解。我要做的是验证一个名称,但是我可以在互联网上找到的函数只使用[a- za -z],只留下需要接受的字符。

I basically need a regex that checks that the name is at least two words, and that it does not contain numbers or special characters like !"#¤%&/()=..., however the words can contain characters like æ, é, Â and so on...

我基本上需要一个正则表达式,检查名字是至少两个词,而且它不包含数字或特殊字符! " #¤% & /()=…,然而像æ单词可以包含字符,e,等等……

An example of an accepted name would be: "John Elkjærd" or "André Svenson"
An non-accepted name would be: "Hans", "H4nn3 Andersen" or "Martin Henriksen!"

接受的名字的一个例子是:“约翰Elkjærd”或“安德烈Svenson”non-accepted名称是:“汉斯”、“H4nn3安徒生”或“马丁•亨利!”

If it matters i use the javascript .match() function client side and want to use php's preg_replace() only "in negative" server side. (removing non-matching characters).

如果它很重要,我就使用javascript .match()函数客户端,并希望使用php的preg_replace()只“在消极的”服务器端。(删除匹配的字符)。

Any help would be much appreciated.

非常感谢您的帮助。

Update:
Okay, thanks to Alix Axel's answer i have the important part down, the server side one.

更新:好的,感谢Alix Axel的回答,我有重要的部分,服务器端1。

But as the page from LightWing's answer suggests, i'm unable to find anything about unicode support for javascript, so i ended up with half a solution for the client side, just checking for at least two words and minimum 5 characters like this:

但从LightWing的答案中可以看出,我无法找到任何关于javascript的unicode支持,所以我最终为客户端提供了一半的解决方案,只需要检查至少两个单词和至少5个字符:

if(name.match(/\S+/g).length >= minWords && name.length >= 5) {
  //valid
}

An alternative would be to specify all the unicode characters as suggested in shifty's answer, which i might end up doing something like, along with the solution above, but it is a bit unpractical though.

另一种方法是指定shifty的答案中所建议的所有unicode字符,我最后可能会做一些类似于上面的解决方案的事情,但是这有点不实际。

7 个解决方案

#1


29  

Try the following regular expression:

试试下面的正则表达式:

^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$

In PHP this translates to:

在PHP中,这意味着:

if (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0)
{
    // valid
}

You should read it like this:

你应该这样读:

^   # start of subject
    (?:     # match this:
        [           # match a:
            \p{L}       # Unicode letter, or
            \p{Mn}      # Unicode accents, or
            \p{Pd}      # Unicode hyphens, or
            \'          # single quote, or
            \x{2019}    # single quote (alternative)
        ]+              # one or more times
        \s          # any kind of space
        [               #match a:
            \p{L}       # Unicode letter, or
            \p{Mn}      # Unicode accents, or
            \p{Pd}      # Unicode hyphens, or
            \'          # single quote, or
            \x{2019}    # single quote (alternative)
        ]+              # one or more times
        \s?         # any kind of space (0 or more times)
    )+      # one or more times
$   # end of subject

I honestly don't know how to port this to Javascript, I'm not even sure Javascript supports Unicode properties but in PHP PCRE this seems to work flawlessly @ IDEOne.com:

老实说,我不知道如何将它移植到Javascript,我甚至不确定Javascript是否支持Unicode属性,但在PHP PCRE中,这似乎可以完美地工作@ IDEOne.com:

$names = array
(
    'Alix',
    'André Svenson',
    'H4nn3 Andersen',
    'Hans',
    'John Elkjærd',
    'Kristoffer la Cour',
    'Marco d\'Almeida',
    'Martin Henriksen!',
);

foreach ($names as $name)
{
    echo sprintf('%s is %s' . "\n", $name, (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0) ? 'valid' : 'invalid');
}

I'm sorry I can't help you regarding the Javascript part but probably someone here will.

很抱歉,关于Javascript部分,我帮不了你,但这里可能有人会。


Validates:

验证:

  • John Elkjærd
  • 约翰Elkjærd
  • André Svenson
  • 安德烈Svenson
  • Marco d'Almeida
  • 马可·d 'Almeida
  • Kristoffer la Cour
  • 克里斯汀拉场地

Invalidates:

无效:

  • Hans
  • 汉斯
  • H4nn3 Andersen
  • H4nn3安徒生
  • Martin Henriksen!
  • 马丁•亨利!

To replace invalid characters, though I'm not sure why you need this, you just need to change it slightly:

为了替换无效字符,虽然我不知道您为什么需要它,您只需要稍微改变一下:

$name = preg_replace('~[^\p{L}\p{Mn}\p{Pd}\'\x{2019}\s]~u', '$1', $name);

Examples:

例子:

  • H4nn3 Andersen -> Hnn Andersen
  • H4nn3安徒生->安徒生
  • Martin Henriksen! -> Martin Henriksen
  • 马丁•亨利!- >马丁•亨利

Note that you always need to use the u modifier.

注意,您总是需要使用u修饰符。

#2


3  

Regarding JavaScript it is more tricky, since JavaScript Regex syntax doesn't support unicode character properties. A pragmatic solution would be to match letters like this:

对于JavaScript来说,这更复杂,因为JavaScript Regex语法不支持unicode字符属性。一种务实的解决办法是匹配这样的字母:

[a-zA-Z\xC0-\uFFFF]

This allows letters in all languages and excludes numbers and all the special (non-letter) characters commonly found on keyboards. It is imperfect because it also allows unicode special symbols which are not letters, e.g. emoticons, snowman and so on. However, since these symbols are typically not available on keyboards I don't think they will be entered by accident. So depending on your requirements it may be an acceptable solution.

这允许所有语言的字母,不包括数字和所有在键盘上常见的特殊(非字母)字符。它不完美,因为它还允许unicode的特殊符号,而不是字母,例如表情符号,雪人等等。然而,由于这些符号通常不会出现在键盘上,我认为它们不会是偶然输入的。因此,根据您的需求,它可能是一个可接受的解决方案。

#3


2  

visit this page Unicode Characters in Regular Expression

请访问此页面正则表达式中的Unicode字符

#4


2  

you can add the allowed special chars to the regex.

您可以将允许的特殊字符添加到regex中。

example:

例子:

[a-zA-ZßöäüÖÄÜæé]+

EDIT:

编辑:

not the best solution, but this would give a result if there are at least to words.

这不是最好的解决方案,但如果至少有文字的话,就会有结果。

[a-zA-ZßöäüÖÄÜæé]+\s[a-zA-ZßöäüÖÄÜæé]+

#5


2  

Here's an optimization over the fantastic answer by @Alix above. It removes the need to define the character class twice, and allows for easier definition of any number of required words.

下面是对上面@Alix的神奇答案的优化。它消除了两次定义字符类的需要,并允许更容易地定义任何数量的必需单词。

^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+(?:$|\s+)){2,}$

It can be broken down as follows:

它可以细分如下:

^         # start
  (?:       # non-capturing group
    [         # match a:
      \p{L}     # Unicode letter, or
      \p{Mn}    # Unicode accents, or
      \p{Pd}    # Unicode hyphens, or
      \'        # single quote, or
      \x{2019}  # single quote (alternative)
    ]+        # one or more times
    (?:       # non-capturing group
      $         # either end-of-string
    |         # or
      \s+       # one or more spaces
    )         # end of group
  ){2,}     # two or more times
$         # end-of-string

Essentially, it is saying to find a word as defined by the character class, then either find one or more spaces or an end of a line. The {2,} at the end tells it that a minimum of two words must be found for a match to succeed. This ensures the OP's "Hans" example will not match.

本质上,它是说找到一个由字符类定义的单词,然后找到一个或多个空格或一行的结尾。{2,}在最后告诉它,一个匹配成功的最少必须找到两个词。这确保OP的“Hans”示例不匹配。


Lastly, since I found this question while looking for a similar solution for , here is the regular expression as can be used in Ruby 1.9+

最后,由于我在寻找ruby的类似解决方案时发现了这个问题,下面是ruby 1.9+中可以使用的正则表达式

\A(?:[\p{L}\p{Mn}\p{Pd}\'\U+2019]+(?:\Z|\s+)){2,}\Z

The primary changes are using \A and \Z for beginning and end of string (instead of line) and Ruby's Unicode character notation.

主要的改变是使用\A和\Z来表示字符串的开始和结束(而不是行)以及Ruby的Unicode字符表示法。

#6


0  

When checking your input string you could

当检查输入字符串时,你可以

  • trim() it to remove leading/trailing whitespaces
  • 修剪()以去除前导/后导白
  • match against [^\w\s] to detect non-word\non-whitespace characters
  • 匹配[^ \ w \ s]检测单词\非空字符
  • match against \s+ to get the number of word separators which equals to number of words + 1.
  • 匹配\s+以获得单词分隔符的数目,该数目等于单词的数目+ 1。

However I'm not sure that the \w shorthand includes accented characters, but it should fall into "word characters" category.

但是我不确定\w简写是否包括重音字符,但是它应该属于“单词字符”类别。

#7


0  

This is the JS regex that I use for fancy names composed with max 3 words (1 to 60 chars), separated by space/single quote/minus sign

这是我使用的JS regex,它由max 3 words(1到60 chars)组成,由空格/单引号/减号分隔。

^([a-zA-Z\xC0-\uFFFF]{1,60}[ \-\']{0,1}){1,3}$

#1


29  

Try the following regular expression:

试试下面的正则表达式:

^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$

In PHP this translates to:

在PHP中,这意味着:

if (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0)
{
    // valid
}

You should read it like this:

你应该这样读:

^   # start of subject
    (?:     # match this:
        [           # match a:
            \p{L}       # Unicode letter, or
            \p{Mn}      # Unicode accents, or
            \p{Pd}      # Unicode hyphens, or
            \'          # single quote, or
            \x{2019}    # single quote (alternative)
        ]+              # one or more times
        \s          # any kind of space
        [               #match a:
            \p{L}       # Unicode letter, or
            \p{Mn}      # Unicode accents, or
            \p{Pd}      # Unicode hyphens, or
            \'          # single quote, or
            \x{2019}    # single quote (alternative)
        ]+              # one or more times
        \s?         # any kind of space (0 or more times)
    )+      # one or more times
$   # end of subject

I honestly don't know how to port this to Javascript, I'm not even sure Javascript supports Unicode properties but in PHP PCRE this seems to work flawlessly @ IDEOne.com:

老实说,我不知道如何将它移植到Javascript,我甚至不确定Javascript是否支持Unicode属性,但在PHP PCRE中,这似乎可以完美地工作@ IDEOne.com:

$names = array
(
    'Alix',
    'André Svenson',
    'H4nn3 Andersen',
    'Hans',
    'John Elkjærd',
    'Kristoffer la Cour',
    'Marco d\'Almeida',
    'Martin Henriksen!',
);

foreach ($names as $name)
{
    echo sprintf('%s is %s' . "\n", $name, (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0) ? 'valid' : 'invalid');
}

I'm sorry I can't help you regarding the Javascript part but probably someone here will.

很抱歉,关于Javascript部分,我帮不了你,但这里可能有人会。


Validates:

验证:

  • John Elkjærd
  • 约翰Elkjærd
  • André Svenson
  • 安德烈Svenson
  • Marco d'Almeida
  • 马可·d 'Almeida
  • Kristoffer la Cour
  • 克里斯汀拉场地

Invalidates:

无效:

  • Hans
  • 汉斯
  • H4nn3 Andersen
  • H4nn3安徒生
  • Martin Henriksen!
  • 马丁•亨利!

To replace invalid characters, though I'm not sure why you need this, you just need to change it slightly:

为了替换无效字符,虽然我不知道您为什么需要它,您只需要稍微改变一下:

$name = preg_replace('~[^\p{L}\p{Mn}\p{Pd}\'\x{2019}\s]~u', '$1', $name);

Examples:

例子:

  • H4nn3 Andersen -> Hnn Andersen
  • H4nn3安徒生->安徒生
  • Martin Henriksen! -> Martin Henriksen
  • 马丁•亨利!- >马丁•亨利

Note that you always need to use the u modifier.

注意,您总是需要使用u修饰符。

#2


3  

Regarding JavaScript it is more tricky, since JavaScript Regex syntax doesn't support unicode character properties. A pragmatic solution would be to match letters like this:

对于JavaScript来说,这更复杂,因为JavaScript Regex语法不支持unicode字符属性。一种务实的解决办法是匹配这样的字母:

[a-zA-Z\xC0-\uFFFF]

This allows letters in all languages and excludes numbers and all the special (non-letter) characters commonly found on keyboards. It is imperfect because it also allows unicode special symbols which are not letters, e.g. emoticons, snowman and so on. However, since these symbols are typically not available on keyboards I don't think they will be entered by accident. So depending on your requirements it may be an acceptable solution.

这允许所有语言的字母,不包括数字和所有在键盘上常见的特殊(非字母)字符。它不完美,因为它还允许unicode的特殊符号,而不是字母,例如表情符号,雪人等等。然而,由于这些符号通常不会出现在键盘上,我认为它们不会是偶然输入的。因此,根据您的需求,它可能是一个可接受的解决方案。

#3


2  

visit this page Unicode Characters in Regular Expression

请访问此页面正则表达式中的Unicode字符

#4


2  

you can add the allowed special chars to the regex.

您可以将允许的特殊字符添加到regex中。

example:

例子:

[a-zA-ZßöäüÖÄÜæé]+

EDIT:

编辑:

not the best solution, but this would give a result if there are at least to words.

这不是最好的解决方案,但如果至少有文字的话,就会有结果。

[a-zA-ZßöäüÖÄÜæé]+\s[a-zA-ZßöäüÖÄÜæé]+

#5


2  

Here's an optimization over the fantastic answer by @Alix above. It removes the need to define the character class twice, and allows for easier definition of any number of required words.

下面是对上面@Alix的神奇答案的优化。它消除了两次定义字符类的需要,并允许更容易地定义任何数量的必需单词。

^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+(?:$|\s+)){2,}$

It can be broken down as follows:

它可以细分如下:

^         # start
  (?:       # non-capturing group
    [         # match a:
      \p{L}     # Unicode letter, or
      \p{Mn}    # Unicode accents, or
      \p{Pd}    # Unicode hyphens, or
      \'        # single quote, or
      \x{2019}  # single quote (alternative)
    ]+        # one or more times
    (?:       # non-capturing group
      $         # either end-of-string
    |         # or
      \s+       # one or more spaces
    )         # end of group
  ){2,}     # two or more times
$         # end-of-string

Essentially, it is saying to find a word as defined by the character class, then either find one or more spaces or an end of a line. The {2,} at the end tells it that a minimum of two words must be found for a match to succeed. This ensures the OP's "Hans" example will not match.

本质上,它是说找到一个由字符类定义的单词,然后找到一个或多个空格或一行的结尾。{2,}在最后告诉它,一个匹配成功的最少必须找到两个词。这确保OP的“Hans”示例不匹配。


Lastly, since I found this question while looking for a similar solution for , here is the regular expression as can be used in Ruby 1.9+

最后,由于我在寻找ruby的类似解决方案时发现了这个问题,下面是ruby 1.9+中可以使用的正则表达式

\A(?:[\p{L}\p{Mn}\p{Pd}\'\U+2019]+(?:\Z|\s+)){2,}\Z

The primary changes are using \A and \Z for beginning and end of string (instead of line) and Ruby's Unicode character notation.

主要的改变是使用\A和\Z来表示字符串的开始和结束(而不是行)以及Ruby的Unicode字符表示法。

#6


0  

When checking your input string you could

当检查输入字符串时,你可以

  • trim() it to remove leading/trailing whitespaces
  • 修剪()以去除前导/后导白
  • match against [^\w\s] to detect non-word\non-whitespace characters
  • 匹配[^ \ w \ s]检测单词\非空字符
  • match against \s+ to get the number of word separators which equals to number of words + 1.
  • 匹配\s+以获得单词分隔符的数目,该数目等于单词的数目+ 1。

However I'm not sure that the \w shorthand includes accented characters, but it should fall into "word characters" category.

但是我不确定\w简写是否包括重音字符,但是它应该属于“单词字符”类别。

#7


0  

This is the JS regex that I use for fancy names composed with max 3 words (1 to 60 chars), separated by space/single quote/minus sign

这是我使用的JS regex,它由max 3 words(1到60 chars)组成,由空格/单引号/减号分隔。

^([a-zA-Z\xC0-\uFFFF]{1,60}[ \-\']{0,1}){1,3}$