How can I split a string by a delimiter, but not if it is escaped? For example, I have a string:
如何用分隔符分割一个字符串,但如果它被转义了呢?例如,我有一个字符串:
1|2\|2|3\\|4\\\|4
The delimiter is |
and an escaped delimiter is \|
. Furthermore I want to ignore escaped backslashes, so in \\|
the |
would still be a delimiter.
分隔符是|,转义分隔符是\|。此外,我想忽略转义的反斜杠,所以在\|中,|仍然是一个分隔符。
So with the above string the result should be:
因此,对于上面的字符串,结果应该是:
[0] => 1
[1] => 2\|2
[2] => 3\\
[3] => 4\\\|4
5 个解决方案
#1
107
Use dark magic:
使用黑魔法:
$array = preg_split('~\\\\.(*SKIP)(*FAIL)|\|~s', $string);
\\\\.
matches a backslash followed by a character, (*SKIP)(*FAIL)
skips it and \|
matches your delimiter.
\ \ \ \。匹配一个后跟字符的反斜杠,(*SKIP)(*FAIL)跳过它,\|匹配您的分隔符。
#2
11
Instead of split(...)
, it's IMO more intuitive to use some sort of "scan" function that operates like a lexical tokenizer. In PHP that would be the preg_match_all
function. You simply say you want to match:
与split(…)不同,使用某种类似词汇标记器的“扫描”功能在我看来更直观。在PHP中,这是preg_match_all函数。你只是说你想匹配:
- something other than a
\
or|
- 不只是一个\或|
- or a
\
followed by a\
or|
- 或者跟着一个\或|
- repeat #1 or #2 at least once
- 至少重复1或2次
The following demo:
下面的演示:
$input = "1|2\\|2|3\\\\|4\\\\\\|4";
echo $input . "\n\n";
preg_match_all('/(?:\\\\.|[^\\\\|])+/', $input, $parts);
print_r($parts[0]);
will print:
将打印:
1|2\|2|3\\|4\\\|4
Array
(
[0] => 1
[1] => 2\|2
[2] => 3\\
[3] => 4\\\|4
)
#3
4
Recently I devised a solution:
最近我想出了一个解决方案:
$array = preg_split('~ ((?<!\\\\)|(?<=[^\\\\](\\\\\\\\)+)) \| ~x', $string);
But the black magic solution is still three times faster.
但是黑魔法解决方案仍然快了三倍。
#4
4
For future readers, here is a universal solution. It is based on NikiC's idea with (*SKIP)(*FAIL)
:
对于未来的读者,这里有一个通用的解决方案。基于NikiC的想法(*SKIP)(*FAIL):
function split_escaped($delimiter, $escaper, $text)
{
$d = preg_quote($delimiter, "~");
$e = preg_quote($escaper, "~");
$tokens = preg_split(
'~' . $e . '(' . $e . '|' . $d . ')(*SKIP)(*FAIL)|' . $d . '~',
$text
);
$escaperReplacement = str_replace(['\\', '$'], ['\\\\', '\\$'], $escaper);
$delimiterReplacement = str_replace(['\\', '$'], ['\\\\', '\\$'], $delimiter);
return preg_replace(
['~' . $e . $e . '~', '~' . $e . $d . '~'],
[$escaperReplacement, $delimiterReplacement],
$tokens
);
}
Make a try:
试一试:
// the base situation:
$text = "asdf\\,fds\\,ddf,\\\\,f\\,,dd";
$delimiter = ",";
$escaper = "\\";
print_r(split_escaped($delimiter, $escaper, $text));
// other signs:
$text = "dk!%fj%slak!%df!!jlskj%%dfl%isr%!%%jlf";
$delimiter = "%";
$escaper = "!";
print_r(split_escaped($delimiter, $escaper, $text));
// delimiter with multiple characters:
$text = "aksd()jflaksd())jflkas(('()j()fkl'()()as()d('')jf";
$delimiter = "()";
$escaper = "'";
print_r(split_escaped($delimiter, $escaper, $text));
// escaper is same as delimiter:
$text = "asfl''asjf'lkas'''jfkl''d'jsl";
$delimiter = "'";
$escaper = "'";
print_r(split_escaped($delimiter, $escaper, $text));
Output:
输出:
Array
(
[0] => asdf,fds,ddf
[1] => \
[2] => f,
[3] => dd
)
Array
(
[0] => dk%fj
[1] => slak%df!jlskj
[2] =>
[3] => dfl
[4] => isr
[5] => %
[6] => jlf
)
Array
(
[0] => aksd
[1] => jflaksd
[2] => )jfl'kas((()j
[3] => fkl()
[4] => as
[5] => d(')jf
)
Array
(
[0] => asfl'asjf
[1] => lkas'
[2] => jfkl'd
[3] => jsl
)
Note: There is a theoretical level problem: implode('::', ['a:', ':b'])
and implode('::', ['a', '', 'b'])
result the same string: 'a::::b'
. Imploding can be also an interesting problem.
注意:有一个理论水平问题:内爆(“::”,[':',':b '])和内爆(“::”,[‘一个’,”,' b '])结果相同的字符串:::::b。内爆也是一个有趣的问题。
#5
-1
Regex is painfully slow. A better method is removing escaped characters from the string prior to splitting then putting them back in:
正则表达式是缓慢的。更好的方法是从字符串中删除转义字符,然后再将其放回:
$foo = 'a,b|,c,d||,e';
function splitEscaped($str, $delimiter,$escapeChar = '\\') {
//Just some temporary strings to use as markers that will not appear in the original string
$double = "\0\0\0_doub";
$escaped = "\0\0\0_esc";
$str = str_replace($escapeChar . $escapeChar, $double, $str);
$str = str_replace($escapeChar . $delimiter, $escaped, $str);
$split = explode($delimiter, $str);
foreach ($split as &$val) $val = str_replace([$double, $escaped], [$escapeChar, $delimiter], $val);
return $split;
}
print_r(splitEscaped($foo, ',', '|'));
which splits on ',' but not if escaped with "|". It also supports double escaping so "||" becomes a single "|" after the split happens:
它在',但如果用"|"逃脱的话。它也支持双转义,所以“||”在分割后成为单个“|”:
Array ( [0] => a [1] => b,c [2] => d| [3] => e )
#1
107
Use dark magic:
使用黑魔法:
$array = preg_split('~\\\\.(*SKIP)(*FAIL)|\|~s', $string);
\\\\.
matches a backslash followed by a character, (*SKIP)(*FAIL)
skips it and \|
matches your delimiter.
\ \ \ \。匹配一个后跟字符的反斜杠,(*SKIP)(*FAIL)跳过它,\|匹配您的分隔符。
#2
11
Instead of split(...)
, it's IMO more intuitive to use some sort of "scan" function that operates like a lexical tokenizer. In PHP that would be the preg_match_all
function. You simply say you want to match:
与split(…)不同,使用某种类似词汇标记器的“扫描”功能在我看来更直观。在PHP中,这是preg_match_all函数。你只是说你想匹配:
- something other than a
\
or|
- 不只是一个\或|
- or a
\
followed by a\
or|
- 或者跟着一个\或|
- repeat #1 or #2 at least once
- 至少重复1或2次
The following demo:
下面的演示:
$input = "1|2\\|2|3\\\\|4\\\\\\|4";
echo $input . "\n\n";
preg_match_all('/(?:\\\\.|[^\\\\|])+/', $input, $parts);
print_r($parts[0]);
will print:
将打印:
1|2\|2|3\\|4\\\|4
Array
(
[0] => 1
[1] => 2\|2
[2] => 3\\
[3] => 4\\\|4
)
#3
4
Recently I devised a solution:
最近我想出了一个解决方案:
$array = preg_split('~ ((?<!\\\\)|(?<=[^\\\\](\\\\\\\\)+)) \| ~x', $string);
But the black magic solution is still three times faster.
但是黑魔法解决方案仍然快了三倍。
#4
4
For future readers, here is a universal solution. It is based on NikiC's idea with (*SKIP)(*FAIL)
:
对于未来的读者,这里有一个通用的解决方案。基于NikiC的想法(*SKIP)(*FAIL):
function split_escaped($delimiter, $escaper, $text)
{
$d = preg_quote($delimiter, "~");
$e = preg_quote($escaper, "~");
$tokens = preg_split(
'~' . $e . '(' . $e . '|' . $d . ')(*SKIP)(*FAIL)|' . $d . '~',
$text
);
$escaperReplacement = str_replace(['\\', '$'], ['\\\\', '\\$'], $escaper);
$delimiterReplacement = str_replace(['\\', '$'], ['\\\\', '\\$'], $delimiter);
return preg_replace(
['~' . $e . $e . '~', '~' . $e . $d . '~'],
[$escaperReplacement, $delimiterReplacement],
$tokens
);
}
Make a try:
试一试:
// the base situation:
$text = "asdf\\,fds\\,ddf,\\\\,f\\,,dd";
$delimiter = ",";
$escaper = "\\";
print_r(split_escaped($delimiter, $escaper, $text));
// other signs:
$text = "dk!%fj%slak!%df!!jlskj%%dfl%isr%!%%jlf";
$delimiter = "%";
$escaper = "!";
print_r(split_escaped($delimiter, $escaper, $text));
// delimiter with multiple characters:
$text = "aksd()jflaksd())jflkas(('()j()fkl'()()as()d('')jf";
$delimiter = "()";
$escaper = "'";
print_r(split_escaped($delimiter, $escaper, $text));
// escaper is same as delimiter:
$text = "asfl''asjf'lkas'''jfkl''d'jsl";
$delimiter = "'";
$escaper = "'";
print_r(split_escaped($delimiter, $escaper, $text));
Output:
输出:
Array
(
[0] => asdf,fds,ddf
[1] => \
[2] => f,
[3] => dd
)
Array
(
[0] => dk%fj
[1] => slak%df!jlskj
[2] =>
[3] => dfl
[4] => isr
[5] => %
[6] => jlf
)
Array
(
[0] => aksd
[1] => jflaksd
[2] => )jfl'kas((()j
[3] => fkl()
[4] => as
[5] => d(')jf
)
Array
(
[0] => asfl'asjf
[1] => lkas'
[2] => jfkl'd
[3] => jsl
)
Note: There is a theoretical level problem: implode('::', ['a:', ':b'])
and implode('::', ['a', '', 'b'])
result the same string: 'a::::b'
. Imploding can be also an interesting problem.
注意:有一个理论水平问题:内爆(“::”,[':',':b '])和内爆(“::”,[‘一个’,”,' b '])结果相同的字符串:::::b。内爆也是一个有趣的问题。
#5
-1
Regex is painfully slow. A better method is removing escaped characters from the string prior to splitting then putting them back in:
正则表达式是缓慢的。更好的方法是从字符串中删除转义字符,然后再将其放回:
$foo = 'a,b|,c,d||,e';
function splitEscaped($str, $delimiter,$escapeChar = '\\') {
//Just some temporary strings to use as markers that will not appear in the original string
$double = "\0\0\0_doub";
$escaped = "\0\0\0_esc";
$str = str_replace($escapeChar . $escapeChar, $double, $str);
$str = str_replace($escapeChar . $delimiter, $escaped, $str);
$split = explode($delimiter, $str);
foreach ($split as &$val) $val = str_replace([$double, $escaped], [$escapeChar, $delimiter], $val);
return $split;
}
print_r(splitEscaped($foo, ',', '|'));
which splits on ',' but not if escaped with "|". It also supports double escaping so "||" becomes a single "|" after the split happens:
它在',但如果用"|"逃脱的话。它也支持双转义,所以“||”在分割后成为单个“|”:
Array ( [0] => a [1] => b,c [2] => d| [3] => e )