I am new to ruby and I'm trying to work with regex.
我是ruby的新手,我正在尝试使用regex。
I have a text which looks something like:
我有这样一段文字:
HEADING
Some text which is always non capitalized. Headings are always capitalized, followed by a space or nothing more.
YOU CAN HAVE MULTIPLE WORDS IN HEADING
I'm using this regular expression to choose all headings:
我用这个正则表达式来选择所有的标题:
^[A-Z]{2,}\s?([A-Z]{2,}\s?)*$
However, it matches all headings which does not contain chars as Č, Š, Ž(slovenian characters).
然而,它匹配所有的标题不包含字符Č,Š,Ž(斯洛文尼亚字符)。
So I'm guessing [A-Z] only matches ASCII characters? How could I get utf8?
我猜[A-Z]只匹配ASCII字符?如何得到utf8?
2 个解决方案
#1
4
You are right in that when you define the ASCII range A-Z
, the match is made literally only for those characters. This is to do with the history of characters on computers, more and more characters have been added over time, and they are not always structured in an encoding in ways that are easy to use.
当您定义ASCII范围A-Z时,您是正确的,匹配只针对那些字符。这与计算机上的字符历史有关,随着时间的推移,越来越多的字符被添加进来,它们并不总是以易于使用的方式进行编码。
You could make a larger character class that matches the slovenian characters you need, by listing them.
通过列出需要的斯洛文尼亚字符,您可以创建一个更大的字符类。
But there is a shortcut. Someone else has already added necessary data to the Unicode data so that you can write shorter matches for "all uppercase characters": /[[:upper:]]/
. See http://ruby-doc.org//core-2.1.4/Regexp.html for more.
但有一条捷径。其他人已经向Unicode数据添加了必要的数据,以便您可以为“所有大写字符”编写更短的匹配:/[[:upper:]]/。见http://ruby-doc.org//core-2.1.4/Regexp.html。
Altering your regular expression with just this adjustment:
通过以下调整来改变你的正则表达式:
^[[:upper:]]{2,}\s?([[:upper:]]{2,}\s?)*$
You may need to adjust it further, for instance it would not match the heading "I AM A HEADING" due to the match insisting each word is at least two letters long.
你可能需要进一步调整它,例如它不会匹配标题“我是一个标题”,因为匹配坚持每个单词至少有两个字母长。
Without seeing all your examples, I would probably simplify the group matching and just allow spaces anywhere:
在没有看到所有示例的情况下,我可能会简化组匹配,并在任何地方允许空格:
^[[:upper:]\s]+$
#2
2
You can use unicode upper case letter:
你可以使用unicode大小写字母:
\p{Lu}
Your regex:
你的正则表达式:
\b\p{Lu}{2,}(?:\s*\p{Lu}{2,})\b
RegEx演示
#1
4
You are right in that when you define the ASCII range A-Z
, the match is made literally only for those characters. This is to do with the history of characters on computers, more and more characters have been added over time, and they are not always structured in an encoding in ways that are easy to use.
当您定义ASCII范围A-Z时,您是正确的,匹配只针对那些字符。这与计算机上的字符历史有关,随着时间的推移,越来越多的字符被添加进来,它们并不总是以易于使用的方式进行编码。
You could make a larger character class that matches the slovenian characters you need, by listing them.
通过列出需要的斯洛文尼亚字符,您可以创建一个更大的字符类。
But there is a shortcut. Someone else has already added necessary data to the Unicode data so that you can write shorter matches for "all uppercase characters": /[[:upper:]]/
. See http://ruby-doc.org//core-2.1.4/Regexp.html for more.
但有一条捷径。其他人已经向Unicode数据添加了必要的数据,以便您可以为“所有大写字符”编写更短的匹配:/[[:upper:]]/。见http://ruby-doc.org//core-2.1.4/Regexp.html。
Altering your regular expression with just this adjustment:
通过以下调整来改变你的正则表达式:
^[[:upper:]]{2,}\s?([[:upper:]]{2,}\s?)*$
You may need to adjust it further, for instance it would not match the heading "I AM A HEADING" due to the match insisting each word is at least two letters long.
你可能需要进一步调整它,例如它不会匹配标题“我是一个标题”,因为匹配坚持每个单词至少有两个字母长。
Without seeing all your examples, I would probably simplify the group matching and just allow spaces anywhere:
在没有看到所有示例的情况下,我可能会简化组匹配,并在任何地方允许空格:
^[[:upper:]\s]+$
#2
2
You can use unicode upper case letter:
你可以使用unicode大小写字母:
\p{Lu}
Your regex:
你的正则表达式:
\b\p{Lu}{2,}(?:\s*\p{Lu}{2,})\b
RegEx演示