I'm developing an algorithm to parse a number out of a series of short-ish strings. These strings are somewhat regular, but there's a few different general forms and several exceptions. I'm trying to build a set of regexes that will handle the various forms and exceptions; I'll apply them one after another to see if I get a match.
我正在开发一种算法来从一系列短字符串中解析数字。这些字符串有些规则,但是有一些不同的通用形式和一些例外。我正在尝试构建一组regex,它将处理各种形式和异常;我将逐一应用它们,看看是否匹配。
One of these forms goes something like this:
其中一种形式是这样的:
X (Y) Z
Where:
地点:
- X is a number I want to capture.
- X是我想要表示的数字。
- Z is static, pre-defined text. it's basically how I determine whether this particular form is applicable or not.
- Z是静态的,预定义的文本。这基本上就是我如何确定这个特殊的形式是否适用。
- Y is a a string of unknown length and content, surrounded by parenthesis.
- Y是一个长度和内容未知的字符串,被圆括号包围。
Also: Y is optional; it doesn't always appear in a string with Z and X. So, I want to be able to extract the numbers from all of these strings:
另外:Y是可选的;它并不总是出现在带Z和x的字符串中,因此,我希望能够从所有这些字符串中提取数字:
- 10 Z
- 10 Z
- 20 (foo) Z
- 20(foo)Z
- 30 (bar) Z
- 30(bar)Z
Right now, I have a regex that will capture the first one:
现在,我有一个regex将捕获第一个:
([0-9]+) +Z
My problem is that I don't know how to construct a regex that will match a series of characters if and only if they're enclosed in parenthesis. Can this be done in a single regex?
我的问题是,我不知道如何构造一个regex,当且仅当它们包含在括号中时,它将匹配一系列字符。这能在一个regex中完成吗?
5 个解决方案
#1
48
(\d+)\s+(\(.*?\))?\s?Z
Note the escaped parentheses, and the ? (zero or once) quantifiers. Any of the groups you don't want to capture can be (?: non-capture groups).
注意转义括号,以及?(零次或一次)量词。您不想捕获的任何组都可以是(?):non-capture组)。
I agree about the spaces. \s is a better option there. I also changed the quantifier to insure there are digits at the beginning. As far as newlines, that would depend on context: if the file is parsed line by line it won't be a problem. Another option is to anchor the start and end of the line (add a ^ at the front and a $ at the end).
我同意这些空间。\s是更好的选择。我还修改了量词,以确保开头有数字。就换行而言,这取决于上下文:如果文件按一行解析,就不会有问题。另一个选择是锚线的开始和结束(在前面添加一个^和$结束时)。
#2
15
This ought to work:
这应该工作:
^\d+\s?(\([^\)]+\)\s?)?Z$
Haven't tested it though, but let me give you the breakdown, so if there are any bugs left they should be pretty straightforward to find:
虽然还没有测试过,但是让我给你一个分解,所以如果有任何bug,他们应该很容易找到:
First the beginning:
第一个开始:
^ = beginning of string
\d+ = one or more decimal characters
\s? = one optional whitespace
Then this part:
那么这部分:
(\([^\)]+\)\s?)?
Is actually:
实际上是:
(.............)?
Which makes the following contents optional, only if it exists fully
只有当以下内容完全存在时才可选
\([^\)]+\)\s?
\( = an opening bracket
[^\)]+ = a series of at least one character that is not a closing bracket
\) = followed by a closing bracket
\s? = followed by one optional whitespace
And the end is made up of
结局是由
Z$
Where
在哪里
Z = your constant string
$ = the end of the string
#3
7
You can do this:
你可以这样做:
([0-9]+) (\([^)]+\))? Z
This will not work with nested parens for Y, however. Nesting requires recursion which isn't strictly regular any more (but context-free). Modern regexp engines can still handle it, albeit with some difficulties (back-references).
但是,对于Y,这将不能使用嵌套的解析器。嵌套需要递归,这不再是严格意义上的规则(但是上下文无关)。现代regexp引擎仍然可以处理它,尽管有一些困难(反向引用)。
#4
2
Try this:
试试这个:
X (\(Y\))? Z
#5
0
Here is an example of email validation to get the most relevant part of text using ()?
这里有一个使用()获取文本最相关部分的电子邮件验证示例。
'email@com.ua'.match(/[a-z-_.]+(@([a-z]+(\.([a-z]+)?)?)?)?/g) # => ['email@com.ua']
For Example,if previous match has not been found it will match the next until empty string or incorrect symbols appear
例如,如果没有找到前一个匹配项,它将匹配下一个,直到出现空字符串或不正确的符号
// ['eamil@com.ua']
// ['email@com.']
// ['email@com']
// ['email@']
// ['email']
#1
48
(\d+)\s+(\(.*?\))?\s?Z
Note the escaped parentheses, and the ? (zero or once) quantifiers. Any of the groups you don't want to capture can be (?: non-capture groups).
注意转义括号,以及?(零次或一次)量词。您不想捕获的任何组都可以是(?):non-capture组)。
I agree about the spaces. \s is a better option there. I also changed the quantifier to insure there are digits at the beginning. As far as newlines, that would depend on context: if the file is parsed line by line it won't be a problem. Another option is to anchor the start and end of the line (add a ^ at the front and a $ at the end).
我同意这些空间。\s是更好的选择。我还修改了量词,以确保开头有数字。就换行而言,这取决于上下文:如果文件按一行解析,就不会有问题。另一个选择是锚线的开始和结束(在前面添加一个^和$结束时)。
#2
15
This ought to work:
这应该工作:
^\d+\s?(\([^\)]+\)\s?)?Z$
Haven't tested it though, but let me give you the breakdown, so if there are any bugs left they should be pretty straightforward to find:
虽然还没有测试过,但是让我给你一个分解,所以如果有任何bug,他们应该很容易找到:
First the beginning:
第一个开始:
^ = beginning of string
\d+ = one or more decimal characters
\s? = one optional whitespace
Then this part:
那么这部分:
(\([^\)]+\)\s?)?
Is actually:
实际上是:
(.............)?
Which makes the following contents optional, only if it exists fully
只有当以下内容完全存在时才可选
\([^\)]+\)\s?
\( = an opening bracket
[^\)]+ = a series of at least one character that is not a closing bracket
\) = followed by a closing bracket
\s? = followed by one optional whitespace
And the end is made up of
结局是由
Z$
Where
在哪里
Z = your constant string
$ = the end of the string
#3
7
You can do this:
你可以这样做:
([0-9]+) (\([^)]+\))? Z
This will not work with nested parens for Y, however. Nesting requires recursion which isn't strictly regular any more (but context-free). Modern regexp engines can still handle it, albeit with some difficulties (back-references).
但是,对于Y,这将不能使用嵌套的解析器。嵌套需要递归,这不再是严格意义上的规则(但是上下文无关)。现代regexp引擎仍然可以处理它,尽管有一些困难(反向引用)。
#4
2
Try this:
试试这个:
X (\(Y\))? Z
#5
0
Here is an example of email validation to get the most relevant part of text using ()?
这里有一个使用()获取文本最相关部分的电子邮件验证示例。
'email@com.ua'.match(/[a-z-_.]+(@([a-z]+(\.([a-z]+)?)?)?)?/g) # => ['email@com.ua']
For Example,if previous match has not been found it will match the next until empty string or incorrect symbols appear
例如,如果没有找到前一个匹配项,它将匹配下一个,直到出现空字符串或不正确的符号
// ['eamil@com.ua']
// ['email@com.']
// ['email@com']
// ['email@']
// ['email']