如何在JavaScript正则表达式中找到组的索引?

时间:2021-08-19 15:44:10

When I write a regular expression like:

当我写一个正则表达式时,比如:

var m = /(s+).*?(l)[^l]*?(o+)/.exec("this is hello to you");
console.log(m);

I get a match object containing the following:

我得到一个匹配对象,包含以下内容:

{
  0: "s is hello",
  1: "s",
  2: "l",
  3: "o",
  index: 3,
  input: "this is hello to you"
}

I know the index of the entire match from the index property, but I also need to know the start and end of the groups matched. Using a simple search won't work. In this example it will find the first 'l' instead of the one found in the group.

我通过index属性知道整个匹配的索引,但是我还需要知道组的开始和结束。使用简单的搜索是行不通的。在这个示例中,它将找到第一个“l”,而不是在组中找到的“l”。

Is there any way to get the offset of a matched group?

有办法得到一个匹配组的偏移量吗?

5 个解决方案

#1


13  

You can't directly get the index of a match group. What you have to do is first put every character in a match group, even the ones you don't care about:

您不能直接获得一个匹配组的索引。你要做的是首先把每个角色放到一个匹配组中,甚至是那些你不关心的角色:

var m= /(s+)(.*?)(l)([^l]*?)(o+)/.exec('this is hello to you');

Now you've got the whole match in parts:

现在你已经有了完整的部分:

['s is hello', 's', ' is hel', 'l', '', 'o']

So you can add up the lengths of the strings before your group to get the offset from the match index to the group index:

所以你可以在你的组之前把字符串的长度相加得到匹配索引到组索引的偏移量:

function indexOfGroup(match, n) {
    var ix= match.index;
    for (var i= 1; i<n; i++)
        ix+= match[i].length;
    return ix;
}

console.log(indexOfGroup(m, 3)); // 11

#2


8  

I wrote a simple (well the initialization got a bit bloated) javascript object to solve this problem on a project I've been working on recently. It works the same way as the accepted answer but generates the new regexp and pulls out the data you requested automatically.

我编写了一个简单的(初始化有点臃肿)javascript对象来解决我最近正在处理的一个项目上的这个问题。它的工作方式与被接受的答案相同,但生成了新的regexp,并自动提取您请求的数据。

var exp = new MultiRegExp(/(firstBit\w+)this text is ignored(optionalBit)?/i);
var value = exp.exec("firstbitWithMorethis text is ignored");

value = {0: {index: 0, text: 'firstbitWithMore'},
         1: null};

Git Repo: My MultiRegExp. Hope this helps someone out there.

Git存储库:我MultiRegExp。希望这能帮助别人。

edit Aug, 2015:

2015年8月编辑,:

Try me: MultiRegExp Live.

试着我:MultiRegExp生活。

#3


1  

Another javascript class which is also able to parse nested groups is available under: https://github.com/valorize/MultiRegExp2

另一个可以解析嵌套组的javascript类可以在下面找到:https://github.com/valorize/MultiRegExp2

Usage:

用法:

let regex = /a(?: )bc(def(ghi)xyz)/g;
let regex2 = new MultiRegExp2(regex);

let matches = regex2.execForAllGroups('ababa bcdefghixyzXXXX'));

Will output:
[ { match: 'defghixyz', start: 8, end: 17 },
  { match: 'ghi', start: 11, end: 14 } ]

#4


0  

Based on the ecma regular expression syntax I've written a parser respective an extension of the RegExp class which solves besides this problem (full indexed exec method) as well other limitations of the JavaScript RegExp implementation for example: Group based search & replace. You can test and download the implementation here (is as well available as NPM module).

基于ecma正则表达式语法,我编写了一个解析器,它分别是RegExp类的扩展,解决了这个问题(全索引的exec方法)和JavaScript RegExp实现的其他限制,例如:基于组的搜索和替换。您可以在这里测试和下载实现(以及NPM模块)。

The implementation works as follows (small example):

实施工作如下(小例子):

//Retrieve content and position of: opening-, closing tags and body content for: non-nested html-tags.
var pattern = '(<([^ >]+)[^>]*>)([^<]*)(<\\/\\2>)';
var str = '<html><code class="html plain">first</code><div class="content">second</div></html>';
var regex = new Regex(pattern, 'g');
var result = regex.exec(str);

console.log(5 === result.length);
console.log('<code class="html plain">first</code>'=== result[0]);
console.log('<code class="html plain">'=== result[1]);
console.log('first'=== result[3]);
console.log('</code>'=== result[4]);
console.log(5=== result.index.length);
console.log(6=== result.index[0]);
console.log(6=== result.index[1]);
console.log(31=== result.index[3]);
console.log(36=== result.index[4]);

I tried as well the implementation from @velop but the implementation seems buggy for example it does not handle backreferences correctly e.g. "/a(?: )bc(def(\1ghi)xyz)/g" - when adding paranthesis in front then the backreference \1 needs to be incremented accordingly (which is not the case in his implementation).

我也尝试了@的实现,但是它的实现似乎是错误的,例如它没有正确处理反向引用。“(/ ?:)bc(def(\1ghi)xyz)/g”-当在前面添加paranthesis时,需要相应地增加backreference \1(在他的实现中不是这样)。

#5


0  

For global regex you want to match only fragments and iterate so first solution won't work. This is a 30 min solution based on indexOf and sums that work for this case:

对于全局regex,您希望只匹配片段并迭代,这样第一个解决方案就不能工作。这是一个基于索引和求和的30分钟解决方案,在这种情况下有效:

https://codepen.io/cancerberoSgx/pen/qYwjjz?editors=0012#code-area

https://codepen.io/cancerberoSgx/pen/qYwjjz?editors=0012代码区域

!function () {
  const regex = /\/\*\*\*@\s*([^@]+)\s*(@\*\*\*\/)/gim
  const exampleThatMatch = `
    /***@
    debug.print('hello editor, simpleNode kind is ' +
    arg.simpleNode.getKindName())
    @***/

    const a = 1 //user

    /***@
    debug.print(arg.simpleNode.getParent().getKindName())
    @***/
    `
  const text = exampleThatMatch 
  function exec(r, s) {
    function indexOfGroup(match, n) {
      var ix = match.index;
      for (var i = 1; i < n; i++)
        ix += match[i].length;
      return ix;
    }
    let result
    let lastMatchIndex = 0
    const matches = []
    while ((result = regex.exec(text))) {
      const match = []
      lastMatchIndex = text.indexOf(result[0], lastMatchIndex)
      let relIndex = 0 
      for (let i = 1; i < result.length; i++) {
        relIndex = text.indexOf(result[i], relIndex)
        match.push({ value: result[i], start: relIndex, end: relIndex + result[i].length })
      }
      matches.push(match)
    }
    return matches
  }
  const groupsWithIndex = exec(regex, text)
  console.log({RESULT: groupsWithIndex })
  // now test - let's remove everything else but matched groups 
  let frag = '' , sep = '\n#######\n'
  groupsWithIndex.forEach(match => match.forEach(group => {
    frag += text.substring(group.start, group.end) + sep
  }))
  console.log('The following are only the matched groups usign the result and text.substring just to verify it works OK:', '\n'+sep)
  console.log(frag)
}()

And just in case here is the typescript:

这里是打字稿

https://codepen.io/cancerberoSgx/pen/yjrXxx?editors=0012

https://codepen.io/cancerberoSgx/pen/yjrXxx?editors=0012

| Enjoy

|享受

#1


13  

You can't directly get the index of a match group. What you have to do is first put every character in a match group, even the ones you don't care about:

您不能直接获得一个匹配组的索引。你要做的是首先把每个角色放到一个匹配组中,甚至是那些你不关心的角色:

var m= /(s+)(.*?)(l)([^l]*?)(o+)/.exec('this is hello to you');

Now you've got the whole match in parts:

现在你已经有了完整的部分:

['s is hello', 's', ' is hel', 'l', '', 'o']

So you can add up the lengths of the strings before your group to get the offset from the match index to the group index:

所以你可以在你的组之前把字符串的长度相加得到匹配索引到组索引的偏移量:

function indexOfGroup(match, n) {
    var ix= match.index;
    for (var i= 1; i<n; i++)
        ix+= match[i].length;
    return ix;
}

console.log(indexOfGroup(m, 3)); // 11

#2


8  

I wrote a simple (well the initialization got a bit bloated) javascript object to solve this problem on a project I've been working on recently. It works the same way as the accepted answer but generates the new regexp and pulls out the data you requested automatically.

我编写了一个简单的(初始化有点臃肿)javascript对象来解决我最近正在处理的一个项目上的这个问题。它的工作方式与被接受的答案相同,但生成了新的regexp,并自动提取您请求的数据。

var exp = new MultiRegExp(/(firstBit\w+)this text is ignored(optionalBit)?/i);
var value = exp.exec("firstbitWithMorethis text is ignored");

value = {0: {index: 0, text: 'firstbitWithMore'},
         1: null};

Git Repo: My MultiRegExp. Hope this helps someone out there.

Git存储库:我MultiRegExp。希望这能帮助别人。

edit Aug, 2015:

2015年8月编辑,:

Try me: MultiRegExp Live.

试着我:MultiRegExp生活。

#3


1  

Another javascript class which is also able to parse nested groups is available under: https://github.com/valorize/MultiRegExp2

另一个可以解析嵌套组的javascript类可以在下面找到:https://github.com/valorize/MultiRegExp2

Usage:

用法:

let regex = /a(?: )bc(def(ghi)xyz)/g;
let regex2 = new MultiRegExp2(regex);

let matches = regex2.execForAllGroups('ababa bcdefghixyzXXXX'));

Will output:
[ { match: 'defghixyz', start: 8, end: 17 },
  { match: 'ghi', start: 11, end: 14 } ]

#4


0  

Based on the ecma regular expression syntax I've written a parser respective an extension of the RegExp class which solves besides this problem (full indexed exec method) as well other limitations of the JavaScript RegExp implementation for example: Group based search & replace. You can test and download the implementation here (is as well available as NPM module).

基于ecma正则表达式语法,我编写了一个解析器,它分别是RegExp类的扩展,解决了这个问题(全索引的exec方法)和JavaScript RegExp实现的其他限制,例如:基于组的搜索和替换。您可以在这里测试和下载实现(以及NPM模块)。

The implementation works as follows (small example):

实施工作如下(小例子):

//Retrieve content and position of: opening-, closing tags and body content for: non-nested html-tags.
var pattern = '(<([^ >]+)[^>]*>)([^<]*)(<\\/\\2>)';
var str = '<html><code class="html plain">first</code><div class="content">second</div></html>';
var regex = new Regex(pattern, 'g');
var result = regex.exec(str);

console.log(5 === result.length);
console.log('<code class="html plain">first</code>'=== result[0]);
console.log('<code class="html plain">'=== result[1]);
console.log('first'=== result[3]);
console.log('</code>'=== result[4]);
console.log(5=== result.index.length);
console.log(6=== result.index[0]);
console.log(6=== result.index[1]);
console.log(31=== result.index[3]);
console.log(36=== result.index[4]);

I tried as well the implementation from @velop but the implementation seems buggy for example it does not handle backreferences correctly e.g. "/a(?: )bc(def(\1ghi)xyz)/g" - when adding paranthesis in front then the backreference \1 needs to be incremented accordingly (which is not the case in his implementation).

我也尝试了@的实现,但是它的实现似乎是错误的,例如它没有正确处理反向引用。“(/ ?:)bc(def(\1ghi)xyz)/g”-当在前面添加paranthesis时,需要相应地增加backreference \1(在他的实现中不是这样)。

#5


0  

For global regex you want to match only fragments and iterate so first solution won't work. This is a 30 min solution based on indexOf and sums that work for this case:

对于全局regex,您希望只匹配片段并迭代,这样第一个解决方案就不能工作。这是一个基于索引和求和的30分钟解决方案,在这种情况下有效:

https://codepen.io/cancerberoSgx/pen/qYwjjz?editors=0012#code-area

https://codepen.io/cancerberoSgx/pen/qYwjjz?editors=0012代码区域

!function () {
  const regex = /\/\*\*\*@\s*([^@]+)\s*(@\*\*\*\/)/gim
  const exampleThatMatch = `
    /***@
    debug.print('hello editor, simpleNode kind is ' +
    arg.simpleNode.getKindName())
    @***/

    const a = 1 //user

    /***@
    debug.print(arg.simpleNode.getParent().getKindName())
    @***/
    `
  const text = exampleThatMatch 
  function exec(r, s) {
    function indexOfGroup(match, n) {
      var ix = match.index;
      for (var i = 1; i < n; i++)
        ix += match[i].length;
      return ix;
    }
    let result
    let lastMatchIndex = 0
    const matches = []
    while ((result = regex.exec(text))) {
      const match = []
      lastMatchIndex = text.indexOf(result[0], lastMatchIndex)
      let relIndex = 0 
      for (let i = 1; i < result.length; i++) {
        relIndex = text.indexOf(result[i], relIndex)
        match.push({ value: result[i], start: relIndex, end: relIndex + result[i].length })
      }
      matches.push(match)
    }
    return matches
  }
  const groupsWithIndex = exec(regex, text)
  console.log({RESULT: groupsWithIndex })
  // now test - let's remove everything else but matched groups 
  let frag = '' , sep = '\n#######\n'
  groupsWithIndex.forEach(match => match.forEach(group => {
    frag += text.substring(group.start, group.end) + sep
  }))
  console.log('The following are only the matched groups usign the result and text.substring just to verify it works OK:', '\n'+sep)
  console.log(frag)
}()

And just in case here is the typescript:

这里是打字稿

https://codepen.io/cancerberoSgx/pen/yjrXxx?editors=0012

https://codepen.io/cancerberoSgx/pen/yjrXxx?editors=0012

| Enjoy

|享受