I am trying to create a regex pattern that will split a string into an array of words based on many different patterns and conventions. The rules are as follows:
我正在尝试创建一个regex模式,它将根据许多不同的模式和约定将字符串拆分为一系列单词。规则如下:
- It must split the string on all dashes, spaces, underscores, and periods.
- 它必须在所有的破折号、空格、下划线和句点上分割字符串。
- When multiple of the aforementioned characters show up together, it must only split once (so 'the--.quick' must split to ['the', 'quick'] and not ['the', '', '', 'quick'] )
- 当前面提到的多个角色同时出现时,它只能分裂一次。快速的必须分割(“的”,“快速”),而不是(“、”、“‘快速’])
- It must split the string on new capital letters, while keeping that letter with its corresponding word ('theQuickBrown' splits to ['the', 'quick', 'brown']
- 它必须在新的大写字母上拆分字符串,同时使用相应的单词(“theQuickBrown”拆分为“the”、“quick”和“brown”)
- It must group multiple uppercase letters in a row together ('LETS_GO' must split to ['lets', 'go'], not ['l', 'e', 't', 's', 'g', 'o'])
- 它必须将多个大写字母组合在一起('LETS_GO'必须拆分为['let ', 'go'],而不是['l', 'e', 't', 's', 'g', 'o'])
- It must use only lowercase letters in the split array.
- 它必须在分割数组中使用小写字母。
If it is working properly, the following should be true
如果工作正常,下面应该是正确的。
"theQuick--brown_fox JumpsOver___the.lazy DOG".split_words ==
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
So far, I have been able to get almost there, with the only issue being that it splits on every capital, so "DOG".split_words is ["d", "o", "g"] and not ["dog"]
到目前为止,我几乎达到了这个目标,唯一的问题是它在每个首都都是分裂的,所以是“狗”。split_words是["d", "o", "g"]而不是["dog"]
I also use a combination of regex and maps/filters on the split array to get to the solution, bonus points if you can tell me how to get rid of that and use only regex.
我还使用了regex和map /filter的组合来获得解决方案,如果你能告诉我如何处理这些,并且只使用regex,就可以得到额外的积分。
Here's what I have so far:
这是我目前所拥有的:
class String
def split_words
split(/[_,\-, ,.]|(?=[A-Z]+)/).
map(&:downcase).
reject(&:empty?)
end
end
Which when called on the string from the test above returns:
当在测试中调用该字符串时,它返回:
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "d", "o", "g"]
How can I update this method to meet all of the above specs?
如何更新此方法以满足上述规格?
3 个解决方案
#1
4
You may use a matching approach to extract chunks of 2 or more uppercase letters or a letter followed only with 0+ lowercase letters:
您可以使用匹配的方法来提取两个或多个大写字母或字母的块,而只使用0+小写字母:
s.scan(/\p{Lu}{2,}|\p{L}\p{Ll}*/).map(&:downcase)
See the Ruby demo and the Rubular demo.
请参见Ruby演示和小兔演示。
The regex matches:
正则表达式匹配:
-
\p{Lu}{2,}
- 2 or more uppercase letters - \p{2,} - 2或更多大写字母
-
|
- or - |——或者
-
\p{L}
- any letter - \ p { L } -任何信件
-
\p{Ll}*
- 0 or more lowercase letters. - \p{Ll}* - 0或更多小写字母。
With map(&:downcase)
, the items you get with .scan()
are turned to lower case.
使用map(&:downcase),使用.scan()获得的项目将被转换为小写。
#2
5
You can slightly change the regex so it doesn't split on every capital, but every sequence of letters that starts with a capital. This just involves putting a [a-z]+
after the [A-Z]+
你可以稍微改变一下正则表达式,这样它就不会在每一个大写字母上出现分裂,而是每一个字母序列都以大写字母开头。这只需要在[a-z]+后面加上[a-z]+
string = "theQuick--brown_fox JumpsOver___the.lazy DOG"
regex = /[_,\-, ,.]|(?=[A-Z]+[a-z]+)/
string.split(regex).reject(&:empty?)
# => ["the", "Quick", "brown", "fox", "Jumps", "Over", "the", "lazy", "DOG"]
#3
2
r = /
[- _.]+ # match one or more combinations of dashes, spaces,
# underscores and periods
| # or
(?<=\p{Ll}) # match a lower case letter in a positive lookbehind
(?=\p{Lu}) # match an upper case letter in a positive lookahead
/x # free-spacing regex definition mode
str = "theQuick--brown_dog, JumpsOver___the.--lazy FOX for $5"
str.split(r).map(&:downcase)
#=> ["the", "quick", "brown", "dog,", "jumps", "over", "the", "lazy",
"fox", "for", "$5"]
If the string is to be broken on spaces and all punctuation characters, replace [- _.]+
with [ [:punct:]]+
. Search for "[[:punct:]]"
at Regexp for the reference.
如果字符串在空格和所有标点符号上被破坏,请替换[- _]。与[[:punct:]]]+ +。在Regexp上搜索“[[:punct:]]”以供参考。
#1
4
You may use a matching approach to extract chunks of 2 or more uppercase letters or a letter followed only with 0+ lowercase letters:
您可以使用匹配的方法来提取两个或多个大写字母或字母的块,而只使用0+小写字母:
s.scan(/\p{Lu}{2,}|\p{L}\p{Ll}*/).map(&:downcase)
See the Ruby demo and the Rubular demo.
请参见Ruby演示和小兔演示。
The regex matches:
正则表达式匹配:
-
\p{Lu}{2,}
- 2 or more uppercase letters - \p{2,} - 2或更多大写字母
-
|
- or - |——或者
-
\p{L}
- any letter - \ p { L } -任何信件
-
\p{Ll}*
- 0 or more lowercase letters. - \p{Ll}* - 0或更多小写字母。
With map(&:downcase)
, the items you get with .scan()
are turned to lower case.
使用map(&:downcase),使用.scan()获得的项目将被转换为小写。
#2
5
You can slightly change the regex so it doesn't split on every capital, but every sequence of letters that starts with a capital. This just involves putting a [a-z]+
after the [A-Z]+
你可以稍微改变一下正则表达式,这样它就不会在每一个大写字母上出现分裂,而是每一个字母序列都以大写字母开头。这只需要在[a-z]+后面加上[a-z]+
string = "theQuick--brown_fox JumpsOver___the.lazy DOG"
regex = /[_,\-, ,.]|(?=[A-Z]+[a-z]+)/
string.split(regex).reject(&:empty?)
# => ["the", "Quick", "brown", "fox", "Jumps", "Over", "the", "lazy", "DOG"]
#3
2
r = /
[- _.]+ # match one or more combinations of dashes, spaces,
# underscores and periods
| # or
(?<=\p{Ll}) # match a lower case letter in a positive lookbehind
(?=\p{Lu}) # match an upper case letter in a positive lookahead
/x # free-spacing regex definition mode
str = "theQuick--brown_dog, JumpsOver___the.--lazy FOX for $5"
str.split(r).map(&:downcase)
#=> ["the", "quick", "brown", "dog,", "jumps", "over", "the", "lazy",
"fox", "for", "$5"]
If the string is to be broken on spaces and all punctuation characters, replace [- _.]+
with [ [:punct:]]+
. Search for "[[:punct:]]"
at Regexp for the reference.
如果字符串在空格和所有标点符号上被破坏,请替换[- _]。与[[:punct:]]]+ +。在Regexp上搜索“[[:punct:]]”以供参考。