How if at all can I use regex to match a string with a variable number of matches.
如何使用regex将字符串与可变数目的匹配匹配匹配。
The strings I want to parse look like:
我要解析的字符串如下:
'Every 15th of the month'
'Every 21st and 28th of the month'
'Every 21st, 22nd and 28th of the month'
ad infinitum...
无限……
I want to be able to capture the ordinal numbers (15th, 21st etc)
我希望能够捕捉到序号(15,21等)
The language I'm using is Ruby for what it's worth.
我使用的语言是Ruby的值。
Thanks, Alex
谢谢你,亚历克斯
1 个解决方案
#1
3
You can capture them into an array with scan
, which will match all occurrences of your regex:
您可以通过扫描将它们捕获到一个数组中,该数组将匹配所有出现的regex:
irb(main):001:0> s = 'every 15th of the month'
=> "every 15th of the month"
irb(main):003:0> s2 = 'every 21st and 28th of the month'
=> "every 21st and 28th of the month"
irb(main):004:0> s3 = 'every 21st, 22nd, and 28th of the month'
=> "every 21st, 22nd, and 28th of the month"
irb(main):006:0> myarray = s3.scan(/(\d{1,2}(?:st|nd|rd|th))/)
=> [["21st"], ["22nd"], ["28th"]]
irb(main):007:0> myarray = s2.scan(/(\d{1,2}(?:st|nd|rd|th))/)
=> [["21st"], ["28th"]]
irb(main):008:0> myarray = s.scan(/(\d{1,2}(?:st|nd|rd|th))/)
=> [["15th"]]
irb(main):009:0>
Then of course you can access each match using the typical myarray[index]
notation (or loop through all of them, etc).
当然,您可以使用典型的myarray[index]表示法(或遍历所有匹配项,等等)访问每个匹配项。
Edit: Based on your comments, this is how I would do this:
编辑:根据你的评论,我是这样做的:
ORDINALS = (1..31).map { |n| ActiveSupport::Inflector::ordinalize n }
DAY_OF_MONTH_REGEX = /(#{ORDINALS.join('|')})/i
myarray = string.scan(DAY_OF_MONTH_REGEX)
This really only gets tripped up by ordinal numbers that might appear in other phrases. Trying to get more restrictive than that will probably be pretty ugly, since you have to cover a bunch of different cases. Might be able to come up with something...but it probably wouldn't be worth it. If you want to parse the string with really fine-grained control and a variable amount of text to match, then this probably just isn't a job for regex, to be honest. It's hard to be certain without knowing what format the lines are, if this is coming from a file with other similar lines, if you have any control over the format/contents of the strings, etc.
这只会被其他短语中出现的序数绊倒。想要得到更多的限制可能会很难办,因为你要涵盖很多不同的情况。可能会想出什么……但这可能不值得。如果您希望用非常细粒度的控件和数量可变的文本来解析字符串,那么老实说,这可能不是regex的工作。如果不知道行是什么格式就很难确定,如果这些行来自具有其他类似行的文件,如果对字符串的格式/内容有任何控制,等等。
#1
3
You can capture them into an array with scan
, which will match all occurrences of your regex:
您可以通过扫描将它们捕获到一个数组中,该数组将匹配所有出现的regex:
irb(main):001:0> s = 'every 15th of the month'
=> "every 15th of the month"
irb(main):003:0> s2 = 'every 21st and 28th of the month'
=> "every 21st and 28th of the month"
irb(main):004:0> s3 = 'every 21st, 22nd, and 28th of the month'
=> "every 21st, 22nd, and 28th of the month"
irb(main):006:0> myarray = s3.scan(/(\d{1,2}(?:st|nd|rd|th))/)
=> [["21st"], ["22nd"], ["28th"]]
irb(main):007:0> myarray = s2.scan(/(\d{1,2}(?:st|nd|rd|th))/)
=> [["21st"], ["28th"]]
irb(main):008:0> myarray = s.scan(/(\d{1,2}(?:st|nd|rd|th))/)
=> [["15th"]]
irb(main):009:0>
Then of course you can access each match using the typical myarray[index]
notation (or loop through all of them, etc).
当然,您可以使用典型的myarray[index]表示法(或遍历所有匹配项,等等)访问每个匹配项。
Edit: Based on your comments, this is how I would do this:
编辑:根据你的评论,我是这样做的:
ORDINALS = (1..31).map { |n| ActiveSupport::Inflector::ordinalize n }
DAY_OF_MONTH_REGEX = /(#{ORDINALS.join('|')})/i
myarray = string.scan(DAY_OF_MONTH_REGEX)
This really only gets tripped up by ordinal numbers that might appear in other phrases. Trying to get more restrictive than that will probably be pretty ugly, since you have to cover a bunch of different cases. Might be able to come up with something...but it probably wouldn't be worth it. If you want to parse the string with really fine-grained control and a variable amount of text to match, then this probably just isn't a job for regex, to be honest. It's hard to be certain without knowing what format the lines are, if this is coming from a file with other similar lines, if you have any control over the format/contents of the strings, etc.
这只会被其他短语中出现的序数绊倒。想要得到更多的限制可能会很难办,因为你要涵盖很多不同的情况。可能会想出什么……但这可能不值得。如果您希望用非常细粒度的控件和数量可变的文本来解析字符串,那么老实说,这可能不是regex的工作。如果不知道行是什么格式就很难确定,如果这些行来自具有其他类似行的文件,如果对字符串的格式/内容有任何控制,等等。