如何使用正则表达式创建循环?

时间:2021-11-20 13:20:43

Honestly, I think I should first ask for your help with syntax of this question first.

老实说,我想我应该首先请求你帮助解决这个问题的语法。

But please if you can understand what I mean edit the title with suitable one.

但是如果你能理解我的意思,请用适当的编辑标题。

Is there a way to make pattern that can split a text like this.

有没有办法制作可以分割这样的文字的模式。

{{START}}
    {{START}}
        {{START}}
            {{START}}
            {{END}}
        {{END}}
    {{END}}
{{END}}

So every {{START}} matches its {{END}} from inside first to outside last!

因此,{{START}}每个{{END}}都会从第一个内部到外部最后一个匹配{{END}}!

And if I cannot do that with regex only. What about doing it using PHP?

如果我不能只用正则表达式做到这一点。用PHP做这件事怎么样?

Thank you up front.

先谢谢你了。

3 个解决方案

#1


4  

This is beyond the capability of a regular expression, which can only parse regular grammars. What you're describing would require a pushdown automaton (regular languages are defined by a regular automaton).

这超出了正则表达式的能力,正则表达式只能解析常规语法。您所描述的内容需要一个下推式自动机(常规语言由常规自动机定义)。

You can use regular expression to parse the individual elements, but the "depth" part needs to be handled by a a language with a concept of memory (PHP is fine for this).

您可以使用正则表达式来解析单个元素,但“深度”部分需要由具有内存概念的语言处理(PHP可以用于此)。

So in your solution, regexes will just be used for identifying your tags, while the real logic as to tracking depth and determining which element the END tag belongs to will must be your program itself.

因此,在您的解决方案中,正则表达式将仅用于标识您的标记,而跟踪深度和确定END标记所属元素的真实逻辑必须是您的程序本身。

#2


2  

It is possible! You can have each level of content using a recursive regular expression:

有可能的!您可以使用递归正则表达式来获取每个级别的内容:

$data = <<<LOD
{{START1}}
    aaaaa
    {{START2}}
        bbbbb
        {{START3}}
            ccccc
            {{START4}}
                ddddd
            {{END4}}
        {{END3}}
    {{END2}}
{{END1}}
LOD;

$pattern = '~(?=({{START\d+}}(?>[^{]++|(?1))*{{END\d+}}))~';
preg_match_all ($pattern, $data, $matches);

print_r($matches);

explanations:

part: ({{START\d+}}(?>[^{]++|(?1))*{{END\d+}})

This part of the pattern describe a nested structure with {{START#}} and {{END#}}

模式的这一部分用{{START#}}和{{END#}}描述了一个嵌套结构

(             # open the first capturing group
{{START\d+}}  
(?>           # open an atomic group (= backtracks forbidden)
    [^{]++    # all that is not a { one or more times (possessive)
  |           # OR
    (?1)      # refer to the first capturing group itself
)             # close the atomic group
{END\d+}}     # 
)             # close the first capturing group

Now the problem is that you can't capture all the level with this part only, because all the characters of the string are consumed by the pattern. In other words you can't match overlapped parts of the string.

现在的问题是你不能只用这个部分捕获所有级别,因为字符串的所有字符都被模式使用。换句话说,您无法匹配字符串的重叠部分。

The issue is to wrap all this part inside a zero-width assertion which doesn't consume characters like a lookahead (?=...), result:

问题是将所有这部分包装在零宽度断言中,该断言不消耗像前瞻(?= ...)这样的字符,结果:

(?=({{START\d+}}(?>[^{]++|(?1))*{{END\d+}}))

This will match all the levels.

这将匹配所有级别。

#3


1  

You cannot do this with pure RegEx, however with a simple loop it can be accomplished.

你不能用纯RegEx做到这一点,但是通过简单的循环就可以实现。

JS Example:

//[.\s\S]* ensures line breaks are matched (dotall not supported in JS)
var exp = /\{\{START\}\}([.\s\S]*)\{\{END\}\}/;

var myString = "{{START}}\ntest\n{{START}}\ntest 2\n{{START}}\ntest 3\n{{START}}\ntest4\n{{END}}\n{{END}}\n{{END}}\n{{END}}";

var matches = [];
var m = exp.exec(myString);
while ( m != null ) {
    matches.push(m[0]);
    m = exp.exec(m[1]);
}

alert(matches.join("\n\n"));

PHP (I have no idea if this is correct, it's been forever since I've done PHP)

PHP(我不知道这是不是正确的,因为我已经完成了PHP,所以这是永远的)

$pattern = "/\{\{START\}\}([.\s\S]*)\{\{END\}\}/";
$myString = "{{START}}\ntest\n{{START}}\ntest 2\n{{START}}\ntest 3\n{{START}}\ntest4\n{{END}}\n{{END}}\n{{END}}\n{{END}}";

$result = preg_match($pattern, $myString, $matches, PREG_OFFSET_CAPTURE);
$outMatches = array();
while ( $result ) {
    array_push($outMatches, $matches[0]);
    $result = preg_match($pattern, $matches[1], $matches, PREG_OFFSET_CAPTURE);
}
print($outMatches);

Output:

{{START}}
test
{{START}}
test 2
{{START}}
test 3
{{START}}
test4
{{END}}
{{END}}
{{END}}
{{END}}

{{START}}
test 2
{{START}}
test 3
{{START}}
test4
{{END}}
{{END}}
{{END}}

{{START}}
test 3
{{START}}
test4
{{END}}
{{END}}

{{START}}
test4
{{END}} 

#1


4  

This is beyond the capability of a regular expression, which can only parse regular grammars. What you're describing would require a pushdown automaton (regular languages are defined by a regular automaton).

这超出了正则表达式的能力,正则表达式只能解析常规语法。您所描述的内容需要一个下推式自动机(常规语言由常规自动机定义)。

You can use regular expression to parse the individual elements, but the "depth" part needs to be handled by a a language with a concept of memory (PHP is fine for this).

您可以使用正则表达式来解析单个元素,但“深度”部分需要由具有内存概念的语言处理(PHP可以用于此)。

So in your solution, regexes will just be used for identifying your tags, while the real logic as to tracking depth and determining which element the END tag belongs to will must be your program itself.

因此,在您的解决方案中,正则表达式将仅用于标识您的标记,而跟踪深度和确定END标记所属元素的真实逻辑必须是您的程序本身。

#2


2  

It is possible! You can have each level of content using a recursive regular expression:

有可能的!您可以使用递归正则表达式来获取每个级别的内容:

$data = <<<LOD
{{START1}}
    aaaaa
    {{START2}}
        bbbbb
        {{START3}}
            ccccc
            {{START4}}
                ddddd
            {{END4}}
        {{END3}}
    {{END2}}
{{END1}}
LOD;

$pattern = '~(?=({{START\d+}}(?>[^{]++|(?1))*{{END\d+}}))~';
preg_match_all ($pattern, $data, $matches);

print_r($matches);

explanations:

part: ({{START\d+}}(?>[^{]++|(?1))*{{END\d+}})

This part of the pattern describe a nested structure with {{START#}} and {{END#}}

模式的这一部分用{{START#}}和{{END#}}描述了一个嵌套结构

(             # open the first capturing group
{{START\d+}}  
(?>           # open an atomic group (= backtracks forbidden)
    [^{]++    # all that is not a { one or more times (possessive)
  |           # OR
    (?1)      # refer to the first capturing group itself
)             # close the atomic group
{END\d+}}     # 
)             # close the first capturing group

Now the problem is that you can't capture all the level with this part only, because all the characters of the string are consumed by the pattern. In other words you can't match overlapped parts of the string.

现在的问题是你不能只用这个部分捕获所有级别,因为字符串的所有字符都被模式使用。换句话说,您无法匹配字符串的重叠部分。

The issue is to wrap all this part inside a zero-width assertion which doesn't consume characters like a lookahead (?=...), result:

问题是将所有这部分包装在零宽度断言中,该断言不消耗像前瞻(?= ...)这样的字符,结果:

(?=({{START\d+}}(?>[^{]++|(?1))*{{END\d+}}))

This will match all the levels.

这将匹配所有级别。

#3


1  

You cannot do this with pure RegEx, however with a simple loop it can be accomplished.

你不能用纯RegEx做到这一点,但是通过简单的循环就可以实现。

JS Example:

//[.\s\S]* ensures line breaks are matched (dotall not supported in JS)
var exp = /\{\{START\}\}([.\s\S]*)\{\{END\}\}/;

var myString = "{{START}}\ntest\n{{START}}\ntest 2\n{{START}}\ntest 3\n{{START}}\ntest4\n{{END}}\n{{END}}\n{{END}}\n{{END}}";

var matches = [];
var m = exp.exec(myString);
while ( m != null ) {
    matches.push(m[0]);
    m = exp.exec(m[1]);
}

alert(matches.join("\n\n"));

PHP (I have no idea if this is correct, it's been forever since I've done PHP)

PHP(我不知道这是不是正确的,因为我已经完成了PHP,所以这是永远的)

$pattern = "/\{\{START\}\}([.\s\S]*)\{\{END\}\}/";
$myString = "{{START}}\ntest\n{{START}}\ntest 2\n{{START}}\ntest 3\n{{START}}\ntest4\n{{END}}\n{{END}}\n{{END}}\n{{END}}";

$result = preg_match($pattern, $myString, $matches, PREG_OFFSET_CAPTURE);
$outMatches = array();
while ( $result ) {
    array_push($outMatches, $matches[0]);
    $result = preg_match($pattern, $matches[1], $matches, PREG_OFFSET_CAPTURE);
}
print($outMatches);

Output:

{{START}}
test
{{START}}
test 2
{{START}}
test 3
{{START}}
test4
{{END}}
{{END}}
{{END}}
{{END}}

{{START}}
test 2
{{START}}
test 3
{{START}}
test4
{{END}}
{{END}}
{{END}}

{{START}}
test 3
{{START}}
test4
{{END}}
{{END}}

{{START}}
test4
{{END}}