如何将重音字符与正则表达式匹配?

I am running Ruby on Rails 3.0.10 and Ruby 1.9.2. I am using the following Regex in order to match names:

我正在运行Ruby on Rails 3.0.10和Ruby 1.9.2。我使用以下Regex来匹配名称:

NAME_REGEX = /^[\w\s'"\-_&@!?()\[\]-]*$/u

validates :name,
  :presence   => true,
  :format     => {
    :with     => NAME_REGEX,
    :message  => "format is invalid"
  }

However, if I try to save some words like the followings:

但是，如果我想保存一些单词，如:

Oilalà
Pì
Rùby
...

# In few words, those with accented characters

I have a validation error "Name format is invalid..

我有一个验证错误“名称格式无效。

How can I change the above Regex so to match also accented characters like à, è, é, ì, ò, ù, ...?

我怎样才能改变上面的正则表达式，使之匹配像a、e、e、I、o、u、…?

2 个解决方案

#1

Instead of \w, use the POSIX bracket expression [:alpha:]:

使用POSIX括号表达式[:alpha:]:

"blåbær dèjá vu".scan /[[:alpha:]]+/  # => ["blåbær", "dèjá", "vu"]

"blåbær dèjá vu".scan /\w+/  # => ["bl", "b", "r", "d", "j", "vu"]

In your particular case, change the regex to this:

在您的特定情况下，将regex更改为以下内容:

NAME_REGEX = /^[[:alpha:]\s'"\-_&@!?()\[\]-]*$/u

This does match much more than just accented characters, though. Which is a good thing. Make sure you read this blog entry about common misconceptions regarding names in software applications.

不过，这不仅仅是重音字符的匹配。这是一件好事。请确保您阅读了这篇关于软件应用程序中关于名称的常见错误概念的博客条目。

#2

One solution would of course be to simply find all of them just use them as you normally would, although I assume they can be fairly many.

一种解决方案当然是找到所有的方法，就像你通常使用的那样，尽管我假设它们可以是相当多的。

If you are using UTF8 then you will find that such characters are often split into two parts, the "base" character itself, followed by the accent (0x0300 and 0x0301 I believe) also called a combining character. However, this may not always be true since some characters can also be written using the "hardcoded" character code... so you need to normalize the UTF8 string to NFD form first.

如果您正在使用UTF8，那么您将发现这些字符通常被分为两部分:“基本”字符本身，然后是重音(我认为是0x0300和0x0301)，也称为组合字符。然而，这可能并不总是正确的，因为有些字符也可以使用“硬编码”字符代码编写……因此，您需要首先将UTF8字符串规范化为NFD格式。

Of course, you could also turn any string you have into UTF8 and then back into the original charset... but the overhead might become quite large if you are doing bulk operations.

当然，您也可以将任何字符串转换为UTF8，然后返回到原始字符集…但是如果进行批量操作，开销可能会变得非常大。

EDIT: To answer your question specifically, the best solution is likely to normalize your strings into UTF8 NPD form, and then simply add 0x0300 and 0x0301 to your list of acceptable characters, and whatever other combining characters you want to allow (such as the dots in åäö, you can find them all in "charmap" in Windows, look at 0x0300 and "up").

编辑:具体回答你的问题,最好的解决方案是容易规范化你的字符串为UTF8 NPD形式,然后简单地添加0 x0300和0 x0301可接受的字符列表,和其他组合字符您希望允许(如氧化铝的点,你可以找到他们在“charmap”窗口,看看0 x0300和《飞屋环游记》)。

#1