preg_split混合HTML和PHP标记,引号和注释除外

时间:2021-08-10 22:06:43

I have a php page mixed with HTML. Some example code:

我有一个混有HTML的php页面。一些示例代码:

<?php echo "<p>some text</p>"; ?>/* <? some php in comments ?> */
<p>some HTML text</p> <!-- <h1>some HTML in comments</h1> -->
<? $header_info = <<<END 
\$some="<?php @ob_start(); @session_set_save_handler(); ?>";
END; ?>
<h2>Some more HTML</h2>

I would like to split at each PHP and HTML tag but leave any PHP tags or HTML tags in quotes or comments untouched/ignored. This is what I have so far:

我想在每个PHP和HTML标记处拆分,但保留/忽略引号或注释中的任何PHP标记或HTML标记。这是我到目前为止:

$array = preg_split("/((^<\?php)|([^'|\"]<\?php)|([^'|\"]<\?)|([^'|\"]\?>)|(<\%)|(\%>))/i", $string, -1);

The issue I have is that some of the HTML closing brackets '>' are missing in the final $array. I would like to keep the HTML open and closing tags intact. Sometimes I end up with

我遇到的问题是最终的$数组中缺少一些HTML结束括号“>”。我想保持HTML打开和关闭标签完好无损。有时我最终会

<p></p instead of <p></p> 

It should look like this:

它应该如下所示:

[0] echo "<p>some text</p>";  
[1] <p>some HTML text</p> 
[2] $header_info = <<<END 
\$some="<?php @ob_start(); @session_set_save_handler(); ?>";
END; 
[3] <h2>Some more HTML</h2>

Any comments do not need to be part of the array as long as preg_split does not see them as any delimiters and disregards any of them.

只要preg_split没有将它们视为任何分隔符并忽略它们中的任何分隔符,任何注释都不需要成为数组的一部分。

I also just realized that some of the php tags, especially when using eval() can end up like this:

我也刚刚意识到一些php标签,特别是在使用eval()时最终会像这样结束:

"?> <p>some HTML text</p> <?";

which would mean that the quotations in my regex would not match any of those cases.

这意味着我的正则表达式中的引用与任何这些情况都不匹配。

Preg_match() might be a better option, not sure though.

Preg_match()可能是更好的选择,但不确定。

Any help would be very much appreciated as I am not very ingenious when it comes to regex and am rather stuck at this point.

任何帮助都会非常感激,因为我在正则表达方面并不是非常聪明,而是在这一点上相当困难。

Thanks a lot :)

非常感谢 :)

1 个解决方案

#1


1  

PREAMBLE
Since a regular expression solution was asked, the following solution will rely on regular expressions. However, in this particular case, a PHP parser would be more suited.

PREAMBLE由于要求正则表达式解决方案,以下解决方案将依赖于正则表达式。但是,在这种特殊情况下,PHP解析器更适合。

Regular Expression

#(?<!"|\')<\\?(?:php)?\\s+(.+?)\\?>(?!"|\')|/\*.+\*/|<!--.+-->#is

Scriptlet

$subject = '<?php echo "<p>some text</p>"; ?>/* <? some php in comments ?> */
<p>some HTML text</p> <!-- <h1>some HTML in comments</h1> -->
<? $header_info = <<<END
\\$some="<?php @ob_start(); @session_set_save_handler(); ?>";
END; ?>
<h2>Some more HTML</h2>';

$returnValue = preg_replace('#(?<!"|\')<\\?(?:php)?\\s+(.+?)\\?>(?!"|\')|/\*.+\*/|<!--.+-->#is', '$1', $subject, -1);

var_dump(preg_split('#\\r?\\n#s', $returnValue));

Result

array(6) {
  [0]=>
  string(25) "echo "<p>some text</p>"; "
  [1]=>
  string(22) "<p>some HTML text</p> "
  [2]=>
  string(21) "$header_info = <<<END"
  [3]=>
  string(60) "\$some="<?php @ob_start(); @session_set_save_handler(); ?>";"
  [4]=>
  string(5) "END; "
  [5]=>
  string(23) "<h2>Some more HTML</h2>"
}

DEMO
http://sandbox.onlinephpfunctions.com/code/017a51877b50f272f151feade7b59e142757481e

Discussion

1. # 
2. (?<!"|\')
3. <\\?(?:php)?\\s+
4. (.+?)
5. \\?>
6. (?!"|\')
7. |/\*.+\*/
8. |<!--.+-->
9. #is

line 1 I use this regex delimiter since it permits avoiding the escape of /
line 2 Here is the key of the regex. A negative lookbehind is used to ensure that the next opening php tag is NOT preceded by any single or double quote.
line 3 Here is defined what an opening php tag is. To support ASP tags too, this line can be changed like this : <\\?(?:php|%)?\\s+
line 4 Since we have detected the start of a php code sequence, we match any char appeaing in this php code sequence. Note on line 9 we use the s flag to indicate that we want new lines as well in php code sequence.
line 5 We mark the end of php code sequence.
line 6 We ensure that the preceding matched php tag is not followed by any single/double quote with the negative lookahead assertion.
line 7,8 If we find some php/HTML comment, they will be simply ignored.
line 9 End f regex.

第1行我使用这个正则表达式分隔符,因为它允许避免/ line 2的转义这是正则表达式的关键。负向lookbehind用于确保下一个打开的php标记前面没有任何单引号或双引号。第3行这里定义了一个开放的php标记。为了支持ASP标签,这行可以改变如下:<\\?(?:php |%)?\\ s +第4行因为我们检测到了php代码序列的开始,所以我们匹配任何在此处出现的char php代码序列。注意在第9行,我们使用s标志来表示我们在PHP代码序列中也需要新行。第5行我们标记php代码序列的结束。第6行我们确保前面匹配的php标记后面没有带有负前瞻断言的任何单/双引号。第7,8行如果我们找到一些php / HTML注释,它们将被忽略。第9行结束f regex。

Known issues

  • After executing the regex on $subject, the lines are simply splitted with a newline (preceded by an optional carriage return) delimiter.
  • 在$ subject上执行正则表达式之后,只需使用换行符(前面带有可选的回车符)分隔符对这些行进行拆分。

  • No effort is made to handle php heredoc or newdoc syntaxes.
  • 没有努力处理php heredoc或newdoc语法。

  • This regex should NOT be viewed as a bulletproof regex against any php code in the wild. PHP parsers are far more suited.
  • 不应将此正则表达式视为针对任何PHP代码的防弹正则表达式。 PHP解析器更适合。

#1


1  

PREAMBLE
Since a regular expression solution was asked, the following solution will rely on regular expressions. However, in this particular case, a PHP parser would be more suited.

PREAMBLE由于要求正则表达式解决方案,以下解决方案将依赖于正则表达式。但是,在这种特殊情况下,PHP解析器更适合。

Regular Expression

#(?<!"|\')<\\?(?:php)?\\s+(.+?)\\?>(?!"|\')|/\*.+\*/|<!--.+-->#is

Scriptlet

$subject = '<?php echo "<p>some text</p>"; ?>/* <? some php in comments ?> */
<p>some HTML text</p> <!-- <h1>some HTML in comments</h1> -->
<? $header_info = <<<END
\\$some="<?php @ob_start(); @session_set_save_handler(); ?>";
END; ?>
<h2>Some more HTML</h2>';

$returnValue = preg_replace('#(?<!"|\')<\\?(?:php)?\\s+(.+?)\\?>(?!"|\')|/\*.+\*/|<!--.+-->#is', '$1', $subject, -1);

var_dump(preg_split('#\\r?\\n#s', $returnValue));

Result

array(6) {
  [0]=>
  string(25) "echo "<p>some text</p>"; "
  [1]=>
  string(22) "<p>some HTML text</p> "
  [2]=>
  string(21) "$header_info = <<<END"
  [3]=>
  string(60) "\$some="<?php @ob_start(); @session_set_save_handler(); ?>";"
  [4]=>
  string(5) "END; "
  [5]=>
  string(23) "<h2>Some more HTML</h2>"
}

DEMO
http://sandbox.onlinephpfunctions.com/code/017a51877b50f272f151feade7b59e142757481e

Discussion

1. # 
2. (?<!"|\')
3. <\\?(?:php)?\\s+
4. (.+?)
5. \\?>
6. (?!"|\')
7. |/\*.+\*/
8. |<!--.+-->
9. #is

line 1 I use this regex delimiter since it permits avoiding the escape of /
line 2 Here is the key of the regex. A negative lookbehind is used to ensure that the next opening php tag is NOT preceded by any single or double quote.
line 3 Here is defined what an opening php tag is. To support ASP tags too, this line can be changed like this : <\\?(?:php|%)?\\s+
line 4 Since we have detected the start of a php code sequence, we match any char appeaing in this php code sequence. Note on line 9 we use the s flag to indicate that we want new lines as well in php code sequence.
line 5 We mark the end of php code sequence.
line 6 We ensure that the preceding matched php tag is not followed by any single/double quote with the negative lookahead assertion.
line 7,8 If we find some php/HTML comment, they will be simply ignored.
line 9 End f regex.

第1行我使用这个正则表达式分隔符,因为它允许避免/ line 2的转义这是正则表达式的关键。负向lookbehind用于确保下一个打开的php标记前面没有任何单引号或双引号。第3行这里定义了一个开放的php标记。为了支持ASP标签,这行可以改变如下:<\\?(?:php |%)?\\ s +第4行因为我们检测到了php代码序列的开始,所以我们匹配任何在此处出现的char php代码序列。注意在第9行,我们使用s标志来表示我们在PHP代码序列中也需要新行。第5行我们标记php代码序列的结束。第6行我们确保前面匹配的php标记后面没有带有负前瞻断言的任何单/双引号。第7,8行如果我们找到一些php / HTML注释,它们将被忽略。第9行结束f regex。

Known issues

  • After executing the regex on $subject, the lines are simply splitted with a newline (preceded by an optional carriage return) delimiter.
  • 在$ subject上执行正则表达式之后,只需使用换行符(前面带有可选的回车符)分隔符对这些行进行拆分。

  • No effort is made to handle php heredoc or newdoc syntaxes.
  • 没有努力处理php heredoc或newdoc语法。

  • This regex should NOT be viewed as a bulletproof regex against any php code in the wild. PHP parsers are far more suited.
  • 不应将此正则表达式视为针对任何PHP代码的防弹正则表达式。 PHP解析器更适合。