如何在ruby中对utf8使用regex

时间:2022-07-07 20:19:16

In RoR,how to validate a Chinese or a Japanese word for a posting form with utf8 code.

在RoR中,如何使用utf8代码验证发布表单的中文或日文单词。

In GBK code, it uses [\u4e00-\u9fa5]+ to validate Chinese words. In Php, it uses /^[\x{4e00}-\x{9fa5}]+$/u for utf-8 pages.

在GBK代码中,它使用[\u4e00-\u9fa5]+来验证中文单词。在Php中,它使用/ ^ \[x { 4 e00 } - \ { 9 fa5 }]+美元/ u utf - 8页。

4 个解决方案

#1


10  

Ruby 1.8 has poor support for UTF-8 strings. You need to write the bytes individually in the regular expression, rather then the full code:

Ruby 1.8对UTF-8字符串的支持很差。您需要在正则表达式中分别编写字节,而不是完整的代码:

>> "acentuação".scan(/\xC3\xA7/)
=> ["ç"]    

To match the range you specified the expression will become a bit complicated:

要匹配您指定的范围,表达式将变得有点复杂:

/([\x4E-\x9E][\x00-\xFF])|(\x9F[\x00-\xA5])/  # (untested)

That will be improved in Ruby 1.9, though.

不过,在Ruby 1.9中会有所改进。

Edit: As noted in the comments, the unicode characters \u4E00-\u9FA5 only map to the expression above in the UTF16-BE encoding. The UTF8 encoding is likely different. So you need to analyze the mapping carefully and see if you can come up with a byte-matching expression for Ruby 1.8.

编辑:如注释所述,unicode字符\u4E00-\u9FA5只映射到上面UTF16-BE编码中的表达式。UTF8编码可能不同。因此,您需要仔细分析映射,看看是否能找到Ruby 1.8的字节匹配表达式。

#2


3  

This is what i have done:

这就是我所做的:

%r{^[#{"\344\270\200"}-#{"\351\277\277"}]+$}

This is basically a regular expression with the octal values that represent the range between U+4E00 and U+9FFF, the most common Chinese and Japanese characters.

这基本上是一个带有八进制值的正则表达式,表示U+4E00和U+9FFF之间的范围,这是最常见的中文和日文字符。

#3


2  

The Oniguruma regexp engine has proper support for Unicode. Ruby 1.9 uses Oniguruma by default. Ruby 1.8 can be recompiled to use it.

Oniguruma regexp引擎对Unicode有适当的支持。Ruby 1.9默认使用Oniguruma。可以重新编译Ruby 1.8来使用它。

With Oniguruma you can use the exact same regex as in PHP, including the /u modifier to force Ruby to treat the string as UTF-8.

使用Oniguruma,您可以使用与PHP中完全相同的regex,包括/u修饰符,以迫使Ruby将字符串处理为UTF-8。

#4


1  

activeSupport has a UTF-8 handler

activeSupport有一个UTF-8处理器

http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Handlers/UTF8Handler.html

http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Handlers/UTF8Handler.html


otherwise, look in ruby 1.9, encoding method for Regexp objects

否则,请查看ruby 1.9, Regexp对象的编码方法

#1


10  

Ruby 1.8 has poor support for UTF-8 strings. You need to write the bytes individually in the regular expression, rather then the full code:

Ruby 1.8对UTF-8字符串的支持很差。您需要在正则表达式中分别编写字节,而不是完整的代码:

>> "acentuação".scan(/\xC3\xA7/)
=> ["ç"]    

To match the range you specified the expression will become a bit complicated:

要匹配您指定的范围,表达式将变得有点复杂:

/([\x4E-\x9E][\x00-\xFF])|(\x9F[\x00-\xA5])/  # (untested)

That will be improved in Ruby 1.9, though.

不过,在Ruby 1.9中会有所改进。

Edit: As noted in the comments, the unicode characters \u4E00-\u9FA5 only map to the expression above in the UTF16-BE encoding. The UTF8 encoding is likely different. So you need to analyze the mapping carefully and see if you can come up with a byte-matching expression for Ruby 1.8.

编辑:如注释所述,unicode字符\u4E00-\u9FA5只映射到上面UTF16-BE编码中的表达式。UTF8编码可能不同。因此,您需要仔细分析映射,看看是否能找到Ruby 1.8的字节匹配表达式。

#2


3  

This is what i have done:

这就是我所做的:

%r{^[#{"\344\270\200"}-#{"\351\277\277"}]+$}

This is basically a regular expression with the octal values that represent the range between U+4E00 and U+9FFF, the most common Chinese and Japanese characters.

这基本上是一个带有八进制值的正则表达式,表示U+4E00和U+9FFF之间的范围,这是最常见的中文和日文字符。

#3


2  

The Oniguruma regexp engine has proper support for Unicode. Ruby 1.9 uses Oniguruma by default. Ruby 1.8 can be recompiled to use it.

Oniguruma regexp引擎对Unicode有适当的支持。Ruby 1.9默认使用Oniguruma。可以重新编译Ruby 1.8来使用它。

With Oniguruma you can use the exact same regex as in PHP, including the /u modifier to force Ruby to treat the string as UTF-8.

使用Oniguruma,您可以使用与PHP中完全相同的regex,包括/u修饰符,以迫使Ruby将字符串处理为UTF-8。

#4


1  

activeSupport has a UTF-8 handler

activeSupport有一个UTF-8处理器

http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Handlers/UTF8Handler.html

http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Handlers/UTF8Handler.html


otherwise, look in ruby 1.9, encoding method for Regexp objects

否则,请查看ruby 1.9, Regexp对象的编码方法