I using this php code to split a string roughly every 120 chars. It splits at the closest space. But it splits HTML and XML entities, so it sometimes outputs things like id="id">
. How can I make it so it ignores XML and HTML entities, but does not remove them.
我使用这个PHP代码大约每120个字符分割一个字符串。它在最近的空间分裂。但它分割HTML和XML实体,因此它有时会输出id =“id”>之类的东西。我怎样才能使它忽略XML和HTML实体,但不删除它们。
function splitWords($string, $max = 1)
{
$words = preg_split( '/\s/', $string );
$lines = array();
$line = '';
foreach ( $words as $k => $word ) {
$newLine = $line . ' ' . $word;
$length = strlen( $newLine );
if ( $length <= $max ) {
$line .= ' ' . $word;
} else if ( $length > $max ) {
if ( !empty( $line ) ) {
$lines[] = trim( $line );
}
$line = $word;
} else {
$lines[] = trim( $line ) . ' ' . $word;
$line = '';
}
}
$lines[] = ( $line = trim( $line ) ) ? $line : $word;
return $lines;
}
1 个解决方案
#1
1
Description
I would change your split command to use tag substrings as a delimiter or the space.
我会更改您的split命令以使用标记子字符串作为分隔符或空格。
This basic regex will:
这个基本的正则表达式将:
- match tags or will match spaces
- it will not match spaces inside tags
- will avoid many of the pitfalls with pattern matching html text
匹配标签或匹配空格
它不会匹配标签内的空格
将避免许多与模式匹配HTML文本的陷阱
<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>|\s
With this regex you can do all sorts of crazy things depending on where you place the capturing paranthesse and the options used in preg_split.
使用此正则表达式,您可以执行各种疯狂的操作,具体取决于您放置捕获paranthesse的位置以及preg_split中使用的选项。
Examples
Note that in this demo the anchor tags have some seriously difficult edge cases.
请注意,在此演示中,锚标记具有一些非常困难的边缘情况。
PHPv5.4.4 Code
<?php
$string = ' <a onmouseover=\' <a href="notreal.com">This is text inside an attribute</a> \' href=url.com>This is some inner text</a>This is outer text.
<a onmouseover=\' a=1; href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; \' href=\'http://InterestedURL.com\' id=\'revSAR\'>
I am the inner text too.
</a>
';
echo "split retains all spaces\n";
$array = preg_split ('/(<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>|\s)/', $string, 0, PREG_SPLIT_DELIM_CAPTURE);
echo implode(",",$array);
echo "\n\nsplit ignores spaces\n";
$array = preg_split ('/(<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>)|\s/', $string, 0, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
echo implode(",",$array);
echo "\n\nsplit ignores tags and spaces\n";
$array = preg_split ('/<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>|\s/', $string, 0, PREG_SPLIT_NO_EMPTY);
echo implode(",",$array);
echo "\n\nsplit ignores tags and retains spaces\n";
$array = preg_split ('/<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>|(\s)/', $string, 0, PREG_SPLIT_DELIM_CAPTURE);
echo implode(",",$array);
Output
You're probably most interested in the third option "split ignores tags and spaces"
你可能对第三个选项“split ignores tags and spaces”最感兴趣
split retains all spaces
, ,,<a onmouseover=' <a href="notreal.com">This is text inside an attribute</a> ' href=url.com>,This, ,is, ,some, ,inner, ,text,</a>,This, ,is, ,outer, ,text.,
,,
,, ,,<a onmouseover=' a=1; href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; ' href='http://InterestedURL.com' id='revSAR'>,,
,, ,, ,I, ,am, ,the, ,inner, ,text, ,too.,
,, ,, ,,</a>,,
,
split ignores spaces
<a onmouseover=' <a href="notreal.com">This is text inside an attribute</a> ' href=url.com>,This,is,some,inner,text,</a>,This,is,outer,text.,<a onmouseover=' a=1; href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; ' href='http://InterestedURL.com' id='revSAR'>,I,am,the,inner,text,too.,</a>
split ignores tags and spaces
This,is,some,inner,text,This,is,outer,text.,I,am,the,inner,text,too.
split ignores tags and retains spaces
, ,,This, ,is, ,some, ,inner, ,text,This, ,is, ,outer, ,text.,
,,
,, ,,,
,, ,, ,I, ,am, ,the, ,inner, ,text, ,too.,
,, ,, ,,,
,
#1
1
Description
I would change your split command to use tag substrings as a delimiter or the space.
我会更改您的split命令以使用标记子字符串作为分隔符或空格。
This basic regex will:
这个基本的正则表达式将:
- match tags or will match spaces
- it will not match spaces inside tags
- will avoid many of the pitfalls with pattern matching html text
匹配标签或匹配空格
它不会匹配标签内的空格
将避免许多与模式匹配HTML文本的陷阱
<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>|\s
With this regex you can do all sorts of crazy things depending on where you place the capturing paranthesse and the options used in preg_split.
使用此正则表达式,您可以执行各种疯狂的操作,具体取决于您放置捕获paranthesse的位置以及preg_split中使用的选项。
Examples
Note that in this demo the anchor tags have some seriously difficult edge cases.
请注意,在此演示中,锚标记具有一些非常困难的边缘情况。
PHPv5.4.4 Code
<?php
$string = ' <a onmouseover=\' <a href="notreal.com">This is text inside an attribute</a> \' href=url.com>This is some inner text</a>This is outer text.
<a onmouseover=\' a=1; href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; \' href=\'http://InterestedURL.com\' id=\'revSAR\'>
I am the inner text too.
</a>
';
echo "split retains all spaces\n";
$array = preg_split ('/(<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>|\s)/', $string, 0, PREG_SPLIT_DELIM_CAPTURE);
echo implode(",",$array);
echo "\n\nsplit ignores spaces\n";
$array = preg_split ('/(<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>)|\s/', $string, 0, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
echo implode(",",$array);
echo "\n\nsplit ignores tags and spaces\n";
$array = preg_split ('/<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>|\s/', $string, 0, PREG_SPLIT_NO_EMPTY);
echo implode(",",$array);
echo "\n\nsplit ignores tags and retains spaces\n";
$array = preg_split ('/<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>|(\s)/', $string, 0, PREG_SPLIT_DELIM_CAPTURE);
echo implode(",",$array);
Output
You're probably most interested in the third option "split ignores tags and spaces"
你可能对第三个选项“split ignores tags and spaces”最感兴趣
split retains all spaces
, ,,<a onmouseover=' <a href="notreal.com">This is text inside an attribute</a> ' href=url.com>,This, ,is, ,some, ,inner, ,text,</a>,This, ,is, ,outer, ,text.,
,,
,, ,,<a onmouseover=' a=1; href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; ' href='http://InterestedURL.com' id='revSAR'>,,
,, ,, ,I, ,am, ,the, ,inner, ,text, ,too.,
,, ,, ,,</a>,,
,
split ignores spaces
<a onmouseover=' <a href="notreal.com">This is text inside an attribute</a> ' href=url.com>,This,is,some,inner,text,</a>,This,is,outer,text.,<a onmouseover=' a=1; href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; ' href='http://InterestedURL.com' id='revSAR'>,I,am,the,inner,text,too.,</a>
split ignores tags and spaces
This,is,some,inner,text,This,is,outer,text.,I,am,the,inner,text,too.
split ignores tags and retains spaces
, ,,This, ,is, ,some, ,inner, ,text,This, ,is, ,outer, ,text.,
,,
,, ,,,
,, ,, ,I, ,am, ,the, ,inner, ,text, ,too.,
,, ,, ,,,
,