如何在正则表达式中获取(可能嵌套的)捕获组?

时间:2022-11-29 20:15:30

Given a regular expression:

给出一个正则表达式:

/say (hullo|goodbye) to my lovely (.*)/

and a string:

和一个字符串:

"my $2 is happy that you said $1"

What is the best way to obtain a regular expression from the string that contains the capture groups in the regular expression? That is:

从包含正则表达式中的捕获组的字符串中获取正则表达式的最佳方法是什么?那是:

/my (.*) is happy that you said (hullo|goodbye)/

Clearly I could use regular expressions on a string representation of the original regular expression, but this would probably present difficulties with nested capture groups.

显然,我可以在原始正则表达式的字符串表示上使用正则表达式,但这可能会给嵌套捕获组带来困难。

I'm using Ruby. My simple implementation so far goes along the lines of:

我正在使用Ruby。到目前为止,我的简单实现遵循以下方针:

class Regexp
  def capture_groups
    self.to_s[1..-2].scan(/\(.*?\)/)
  end
end

regexp.capture_groups.each_with_index do |capture, idx|
  string.gsub!("$#{idx+1}", capture)
end
/^#{string}$/

2 个解决方案

#1


2  

i guess you need to create your own function that would do this:

我想你需要创建自己的功能来做到这一点:

  • create empty dictionaries groups and active_groups and initialize counter = 1
  • 创建空词典组和active_groups并初始化counter = 1
  • iterate over the characters in the string representation:
    • if current character = '(' and previous charaster != \:
      • add counter key to active_groups and increase counter
      • 将计数器密钥添加到active_groups并增加计数器
    • if current character ='('和之前的charaster!= \:将计数器键添加到active_groups并增加计数器
    • add current character to all active_groups
    • 将当前字符添加到所有active_groups
    • if current character = ')' and previous charaster != \:
      • remove the last item (key, value) from active_groups and add it to groups
      • 从active_groups中删除最后一项(键,值)并将其添加到组
    • if current character =')'和之前的charaster!= \:从active_groups中删除最后一项(键,值)并将其添加到组
  • 迭代字符串表示中的字符:如果当前字符='('和前一个charaster!= \:将计数器键添加到active_groups并增加计数器将当前字符添加到所有active_groups,如果当前字符=')'和之前的charaster!= \ :从active_groups中删除最后一项(键,值)并将其添加到组
  • convert groups to an array if needed
  • 如果需要,将组转换为数组

You might also want to implement:

您可能还想实现:

  • ignore = True between unescaped '[' and ']'
  • 在未转义的'['和']'之间忽略=真
  • reset counter if current character = '|' and active_groups is empty (or decrease counter if active_group is not empty)

    如果当前字符='|'则重置计数器和active_groups为空(如果active_group不为空,则减少计数器)

    UPDATES from comments:

    评论更新:

  • ingore non-capturing groups starting with '(?:'
  • 从'(?:'开始的非捕获组

#2


1  

So once I realised that what I actually need is a regular expression parser, things started falling into place. I discovered this project:

所以一旦我意识到我真正需要的是一个正则表达式解析器,事情就开始发生了。我发现了这个项目:

which can generate strings that match a regular expression. It defines a regular expression grammar using http://treetop.rubyforge.org/. Unfortunately the grammar it defines is incomplete, though useful for many cases.

它可以生成与正则表达式匹配的字符串。它使用http://treetop.rubyforge.org/定义了正则表达式语法。不幸的是,它定义的语法是不完整的,虽然对许多情况很有用。

I also stumbled past https://github.com/mjijackson/citrus, which does a similar job to Treetop.

我也偶然发现了https://github.com/mjijackson/citrus,它与Treetop做了类似的工作。

I then found this mind blowing gem:

然后我发现这个令人兴奋的宝石:

which defines a full regexp grammar and parses a regular expression into a walkable tree. I was then able to walk the tree and pick out the parts of the tree I wanted (the capture groups).

它定义了一个完整的正则表达式语法,并将正则表达式解析为可步行树。然后我能够走树并挑选出我想要的树的部分(捕获组)。

Unfortunately there was a minor bug, fixed in my fork: https://github.com/LaunchThing/regexp_parser.

不幸的是,我的分支中修复了一个小错误:https://github.com/LaunchThing/regexp_parser。

Here's my patch to Regexp, that uses the fixed gem:

这是我使用固定gem的Regexp补丁:

class Regexp
  def parse
    Regexp::Parser.parse(self.to_s, 'ruby/1.9')
  end

  def walk(e = self.parse, depth = 0, &block)
    block.call(e, depth)
    unless e.expressions.empty?
      e.each do |s| 
        walk(s, depth+1, &block) 
      end
    end
  end

  def capture_groups
    capture_groups = []
    walk do |e, depth|
      capture_groups << e.to_s if Regexp::Expression::Group::Capture === e
    end
    capture_groups
  end
end

I can then use this in my application to make replacements in my string - the final goal - along these lines:

然后我可以在我的应用程序中使用它来替换我的字符串 - 最终目标 - 沿着这些方向:

from = /^\/search\/(.*)$/
to = '/buy/$1'

to_as_regexp = to.dup

# I should probably make this gsub tighter
from.capture_groups.each_with_index do |capture, idx|
  to_as_regexp.gsub!("$#{idx+1}", capture)
end
to_as_regexp = /^#{to_as_regexp}$/

# to_as_regexp = /^\/buy\/(.*)$/

I hope this helps someone else out.

我希望这可以帮助其他人。

#1


2  

i guess you need to create your own function that would do this:

我想你需要创建自己的功能来做到这一点:

  • create empty dictionaries groups and active_groups and initialize counter = 1
  • 创建空词典组和active_groups并初始化counter = 1
  • iterate over the characters in the string representation:
    • if current character = '(' and previous charaster != \:
      • add counter key to active_groups and increase counter
      • 将计数器密钥添加到active_groups并增加计数器
    • if current character ='('和之前的charaster!= \:将计数器键添加到active_groups并增加计数器
    • add current character to all active_groups
    • 将当前字符添加到所有active_groups
    • if current character = ')' and previous charaster != \:
      • remove the last item (key, value) from active_groups and add it to groups
      • 从active_groups中删除最后一项(键,值)并将其添加到组
    • if current character =')'和之前的charaster!= \:从active_groups中删除最后一项(键,值)并将其添加到组
  • 迭代字符串表示中的字符:如果当前字符='('和前一个charaster!= \:将计数器键添加到active_groups并增加计数器将当前字符添加到所有active_groups,如果当前字符=')'和之前的charaster!= \ :从active_groups中删除最后一项(键,值)并将其添加到组
  • convert groups to an array if needed
  • 如果需要,将组转换为数组

You might also want to implement:

您可能还想实现:

  • ignore = True between unescaped '[' and ']'
  • 在未转义的'['和']'之间忽略=真
  • reset counter if current character = '|' and active_groups is empty (or decrease counter if active_group is not empty)

    如果当前字符='|'则重置计数器和active_groups为空(如果active_group不为空,则减少计数器)

    UPDATES from comments:

    评论更新:

  • ingore non-capturing groups starting with '(?:'
  • 从'(?:'开始的非捕获组

#2


1  

So once I realised that what I actually need is a regular expression parser, things started falling into place. I discovered this project:

所以一旦我意识到我真正需要的是一个正则表达式解析器,事情就开始发生了。我发现了这个项目:

which can generate strings that match a regular expression. It defines a regular expression grammar using http://treetop.rubyforge.org/. Unfortunately the grammar it defines is incomplete, though useful for many cases.

它可以生成与正则表达式匹配的字符串。它使用http://treetop.rubyforge.org/定义了正则表达式语法。不幸的是,它定义的语法是不完整的,虽然对许多情况很有用。

I also stumbled past https://github.com/mjijackson/citrus, which does a similar job to Treetop.

我也偶然发现了https://github.com/mjijackson/citrus,它与Treetop做了类似的工作。

I then found this mind blowing gem:

然后我发现这个令人兴奋的宝石:

which defines a full regexp grammar and parses a regular expression into a walkable tree. I was then able to walk the tree and pick out the parts of the tree I wanted (the capture groups).

它定义了一个完整的正则表达式语法,并将正则表达式解析为可步行树。然后我能够走树并挑选出我想要的树的部分(捕获组)。

Unfortunately there was a minor bug, fixed in my fork: https://github.com/LaunchThing/regexp_parser.

不幸的是,我的分支中修复了一个小错误:https://github.com/LaunchThing/regexp_parser。

Here's my patch to Regexp, that uses the fixed gem:

这是我使用固定gem的Regexp补丁:

class Regexp
  def parse
    Regexp::Parser.parse(self.to_s, 'ruby/1.9')
  end

  def walk(e = self.parse, depth = 0, &block)
    block.call(e, depth)
    unless e.expressions.empty?
      e.each do |s| 
        walk(s, depth+1, &block) 
      end
    end
  end

  def capture_groups
    capture_groups = []
    walk do |e, depth|
      capture_groups << e.to_s if Regexp::Expression::Group::Capture === e
    end
    capture_groups
  end
end

I can then use this in my application to make replacements in my string - the final goal - along these lines:

然后我可以在我的应用程序中使用它来替换我的字符串 - 最终目标 - 沿着这些方向:

from = /^\/search\/(.*)$/
to = '/buy/$1'

to_as_regexp = to.dup

# I should probably make this gsub tighter
from.capture_groups.each_with_index do |capture, idx|
  to_as_regexp.gsub!("$#{idx+1}", capture)
end
to_as_regexp = /^#{to_as_regexp}$/

# to_as_regexp = /^\/buy\/(.*)$/

I hope this helps someone else out.

我希望这可以帮助其他人。