匹配regex中的可选子字符串

时间:2021-03-18 11:08:52

I'm developing an algorithm to parse a number out of a series of short-ish strings. These strings are somewhat regular, but there's a few different general forms and several exceptions. I'm trying to build a set of regexes that will handle the various forms and exceptions; I'll apply them one after another to see if I get a match.

我正在开发一种算法来从一系列短字符串中解析数字。这些字符串有些规则,但是有一些不同的通用形式和一些例外。我正在尝试构建一组regex,它将处理各种形式和异常;我将逐一应用它们,看看是否匹配。

One of these forms goes something like this:

其中一种形式是这样的:

X (Y) Z

Where:

地点:

  • X is a number I want to capture.
  • X是我想要表示的数字。
  • Z is static, pre-defined text. it's basically how I determine whether this particular form is applicable or not.
  • Z是静态的,预定义的文本。这基本上就是我如何确定这个特殊的形式是否适用。
  • Y is a a string of unknown length and content, surrounded by parenthesis.
  • Y是一个长度和内容未知的字符串,被圆括号包围。

Also: Y is optional; it doesn't always appear in a string with Z and X. So, I want to be able to extract the numbers from all of these strings:

另外:Y是可选的;它并不总是出现在带Z和x的字符串中,因此,我希望能够从所有这些字符串中提取数字:

  • 10 Z
  • 10 Z
  • 20 (foo) Z
  • 20(foo)Z
  • 30 (bar) Z
  • 30(bar)Z

Right now, I have a regex that will capture the first one:

现在,我有一个regex将捕获第一个:

([0-9]+) +Z

My problem is that I don't know how to construct a regex that will match a series of characters if and only if they're enclosed in parenthesis. Can this be done in a single regex?

我的问题是,我不知道如何构造一个regex,当且仅当它们包含在括号中时,它将匹配一系列字符。这能在一个regex中完成吗?

5 个解决方案

#1


48  

(\d+)\s+(\(.*?\))?\s?Z

Note the escaped parentheses, and the ? (zero or once) quantifiers. Any of the groups you don't want to capture can be (?: non-capture groups).

注意转义括号,以及?(零次或一次)量词。您不想捕获的任何组都可以是(?):non-capture组)。

I agree about the spaces. \s is a better option there. I also changed the quantifier to insure there are digits at the beginning. As far as newlines, that would depend on context: if the file is parsed line by line it won't be a problem. Another option is to anchor the start and end of the line (add a ^ at the front and a $ at the end).

我同意这些空间。\s是更好的选择。我还修改了量词,以确保开头有数字。就换行而言,这取决于上下文:如果文件按一行解析,就不会有问题。另一个选择是锚线的开始和结束(在前面添加一个^和$结束时)。

#2


15  

This ought to work:

这应该工作:

^\d+\s?(\([^\)]+\)\s?)?Z$

Haven't tested it though, but let me give you the breakdown, so if there are any bugs left they should be pretty straightforward to find:

虽然还没有测试过,但是让我给你一个分解,所以如果有任何bug,他们应该很容易找到:

First the beginning:

第一个开始:

^ = beginning of string
\d+ = one or more decimal characters
\s? = one optional whitespace

Then this part:

那么这部分:

(\([^\)]+\)\s?)?

Is actually:

实际上是:

(.............)?

Which makes the following contents optional, only if it exists fully

只有当以下内容完全存在时才可选

\([^\)]+\)\s?

\( = an opening bracket
[^\)]+ = a series of at least one character that is not a closing bracket
\) = followed by a closing bracket
\s? = followed by one optional whitespace

And the end is made up of

结局是由

Z$

Where

在哪里

Z = your constant string
$ = the end of the string

#3


7  

You can do this:

你可以这样做:

([0-9]+) (\([^)]+\))? Z

This will not work with nested parens for Y, however. Nesting requires recursion which isn't strictly regular any more (but context-free). Modern regexp engines can still handle it, albeit with some difficulties (back-references).

但是,对于Y,这将不能使用嵌套的解析器。嵌套需要递归,这不再是严格意义上的规则(但是上下文无关)。现代regexp引擎仍然可以处理它,尽管有一些困难(反向引用)。

#4


2  

Try this:

试试这个:

X (\(Y\))? Z

#5


0  

Here is an example of email validation to get the most relevant part of text using ()?

这里有一个使用()获取文本最相关部分的电子邮件验证示例。

'email@com.ua'.match(/[a-z-_.]+(@([a-z]+(\.([a-z]+)?)?)?)?/g) # => ['email@com.ua']

For Example,if previous match has not been found it will match the next until empty string or incorrect symbols appear

例如,如果没有找到前一个匹配项,它将匹配下一个,直到出现空字符串或不正确的符号

// ['eamil@com.ua']
// ['email@com.']
// ['email@com']
// ['email@']
// ['email']

#1


48  

(\d+)\s+(\(.*?\))?\s?Z

Note the escaped parentheses, and the ? (zero or once) quantifiers. Any of the groups you don't want to capture can be (?: non-capture groups).

注意转义括号,以及?(零次或一次)量词。您不想捕获的任何组都可以是(?):non-capture组)。

I agree about the spaces. \s is a better option there. I also changed the quantifier to insure there are digits at the beginning. As far as newlines, that would depend on context: if the file is parsed line by line it won't be a problem. Another option is to anchor the start and end of the line (add a ^ at the front and a $ at the end).

我同意这些空间。\s是更好的选择。我还修改了量词,以确保开头有数字。就换行而言,这取决于上下文:如果文件按一行解析,就不会有问题。另一个选择是锚线的开始和结束(在前面添加一个^和$结束时)。

#2


15  

This ought to work:

这应该工作:

^\d+\s?(\([^\)]+\)\s?)?Z$

Haven't tested it though, but let me give you the breakdown, so if there are any bugs left they should be pretty straightforward to find:

虽然还没有测试过,但是让我给你一个分解,所以如果有任何bug,他们应该很容易找到:

First the beginning:

第一个开始:

^ = beginning of string
\d+ = one or more decimal characters
\s? = one optional whitespace

Then this part:

那么这部分:

(\([^\)]+\)\s?)?

Is actually:

实际上是:

(.............)?

Which makes the following contents optional, only if it exists fully

只有当以下内容完全存在时才可选

\([^\)]+\)\s?

\( = an opening bracket
[^\)]+ = a series of at least one character that is not a closing bracket
\) = followed by a closing bracket
\s? = followed by one optional whitespace

And the end is made up of

结局是由

Z$

Where

在哪里

Z = your constant string
$ = the end of the string

#3


7  

You can do this:

你可以这样做:

([0-9]+) (\([^)]+\))? Z

This will not work with nested parens for Y, however. Nesting requires recursion which isn't strictly regular any more (but context-free). Modern regexp engines can still handle it, albeit with some difficulties (back-references).

但是,对于Y,这将不能使用嵌套的解析器。嵌套需要递归,这不再是严格意义上的规则(但是上下文无关)。现代regexp引擎仍然可以处理它,尽管有一些困难(反向引用)。

#4


2  

Try this:

试试这个:

X (\(Y\))? Z

#5


0  

Here is an example of email validation to get the most relevant part of text using ()?

这里有一个使用()获取文本最相关部分的电子邮件验证示例。

'email@com.ua'.match(/[a-z-_.]+(@([a-z]+(\.([a-z]+)?)?)?)?/g) # => ['email@com.ua']

For Example,if previous match has not been found it will match the next until empty string or incorrect symbols appear

例如,如果没有找到前一个匹配项,它将匹配下一个,直到出现空字符串或不正确的符号

// ['eamil@com.ua']
// ['email@com.']
// ['email@com']
// ['email@']
// ['email']