拆分字符串(特别是在Java中使用java.util.regex或其他东西)

时间:2023-01-07 14:01:52

Does anyone know how to split a string on a character taking into account its escape sequence?

是否有人知道如何在角色上拆分字符串并考虑其转义序列?

For example, if the character is ':', "a:b" is split into two parts ("a" and "b"), whereas "a:b" is not split at all.

例如,如果字符是':',则“a:b”被分成两部分(“a”和“b”),而“a:b”根本不分开。

I think this is hard (impossible?) to do with regular expressions.

我认为这对正则表达式来说很难(不可能?)。

Thank you in advance,

先感谢您,

Kedar

2 个解决方案

#1


Since Java supports variable-length look-behinds (as long as they are finite), you could do do it like this:

由于Java支持可变长度的后视(只要它们是有限的),你可以这样做:

import java.util.regex.*;

public class RegexTest {
    public static void main(String[] argv) {

        Pattern p = Pattern.compile("(?<=(?<!\\\\)(?:\\\\\\\\){0,10}):");

        String text = "foo:bar\\:baz\\\\:qux\\\\\\:quux\\\\\\\\:corge";

        String[] parts = p.split(text);

        System.out.printf("Input string: %s\n", text);
        for (int i = 0; i < parts.length; i++) {
            System.out.printf("Part %d: %s\n", i+1, parts[i]);
        }

    }
}
  • (?<=(?<!\\)(?:\\\\){0,10}) looks behind for an even number of back-slashes (including zero, up to a maximum of 10).
  • (?<=(?

Output:

Input string: foo:bar\:baz\\:qux\\\:quux\\\\:corge
Part 1: foo
Part 2: bar\:baz\\
Part 3: qux\\\:quux\\\\
Part 4: corge

输入字符串:foo:bar \:baz \\:qux \\\:quux \\\\:corge第1部分:foo第2部分:bar \:baz \\第3部分:qux \\\:quux \\\\第4部分:corge

Another way would be to match the parts themselves, instead of split at the delimiters.

另一种方法是匹配部件本身,而不是在分隔符处分开。

Pattern p2 = Pattern.compile("(?<=\\A|\\G:)((?:\\\\.|[^:\\\\])*)");
List<String> parts2 = new LinkedList<String>();
Matcher m = p2.matcher(text);
while (m.find()) {
    parts2.add(m.group(1));
}

The strange syntax stems from that it need to handle the case of empty pieces at the start and end of the string. When a match spans exactly zero characters, the next attempt will start one character past the end of it. If it didn't, it would match another empty string, and another, ad infinitum…

奇怪的语法源于它需要在字符串的开头和结尾处理空片的情况。当一个匹配恰好为零个字符时,下一个尝试将在它结束后开始一个字符。如果没有,它将匹配另一个空字符串,另一个,无限广告......

  • (?<=\A|\G:) will look behind for either the start of the string (the first piece), or the end of the previous match, followed by the separator. If we did (?:\A|\G:), it would fail if the first piece is empty (input starts with a separator).
  • (?<= \ A | \ G :)会查看字符串的开头(第一部分)或上一个匹配的结尾,然后是分隔符。如果我们做了(?:\ A | \ G :),如果第一个部分为空(输入以分隔符开始),它将失败。

  • \\. matches any escaped character.
  • \\。匹配任何转义字符。

  • [^:\\] matches any character that is not in an escape sequence (because \\. consumed both of those).
  • [^:\\]匹配任何不在转义序列中的字符(因为\\。消耗了这两个字符)。

  • ((?:\\.|[^:\\])*) captures all characters up until the first non-escaped delimiter into capture-group 1.
  • ((?:\\。| [^:\\])*)捕获所有字符,直到第一个非转义分隔符进入捕获组1。

#2


(?<=^|[^\\]): gets you close, but doesn't address escaped slashes. (That's a literal regex, of course you have to escape the slashes in it to get it into a java string)

(?<= ^ | [^ \\]):让你关闭,但不解决转义斜杠。 (这是一个文字正则表达式,当然你必须逃避它中的斜线才能将它变成一个java字符串)

(?<=(^|[^\\])(\\\\)*): How about that? I think that should satisfy any ':' that is preceded by an even number of slashes.

(?<=(^ | [^ \\])(\\\\)*):怎么样?我认为应该满足任何':'前面有偶数个斜杠。

Edit: don't vote this up. MizardX's solution is better :)

编辑:不要投票。 MizardX的解决方案更好:)

#1


Since Java supports variable-length look-behinds (as long as they are finite), you could do do it like this:

由于Java支持可变长度的后视(只要它们是有限的),你可以这样做:

import java.util.regex.*;

public class RegexTest {
    public static void main(String[] argv) {

        Pattern p = Pattern.compile("(?<=(?<!\\\\)(?:\\\\\\\\){0,10}):");

        String text = "foo:bar\\:baz\\\\:qux\\\\\\:quux\\\\\\\\:corge";

        String[] parts = p.split(text);

        System.out.printf("Input string: %s\n", text);
        for (int i = 0; i < parts.length; i++) {
            System.out.printf("Part %d: %s\n", i+1, parts[i]);
        }

    }
}
  • (?<=(?<!\\)(?:\\\\){0,10}) looks behind for an even number of back-slashes (including zero, up to a maximum of 10).
  • (?<=(?

Output:

Input string: foo:bar\:baz\\:qux\\\:quux\\\\:corge
Part 1: foo
Part 2: bar\:baz\\
Part 3: qux\\\:quux\\\\
Part 4: corge

输入字符串:foo:bar \:baz \\:qux \\\:quux \\\\:corge第1部分:foo第2部分:bar \:baz \\第3部分:qux \\\:quux \\\\第4部分:corge

Another way would be to match the parts themselves, instead of split at the delimiters.

另一种方法是匹配部件本身,而不是在分隔符处分开。

Pattern p2 = Pattern.compile("(?<=\\A|\\G:)((?:\\\\.|[^:\\\\])*)");
List<String> parts2 = new LinkedList<String>();
Matcher m = p2.matcher(text);
while (m.find()) {
    parts2.add(m.group(1));
}

The strange syntax stems from that it need to handle the case of empty pieces at the start and end of the string. When a match spans exactly zero characters, the next attempt will start one character past the end of it. If it didn't, it would match another empty string, and another, ad infinitum…

奇怪的语法源于它需要在字符串的开头和结尾处理空片的情况。当一个匹配恰好为零个字符时,下一个尝试将在它结束后开始一个字符。如果没有,它将匹配另一个空字符串,另一个,无限广告......

  • (?<=\A|\G:) will look behind for either the start of the string (the first piece), or the end of the previous match, followed by the separator. If we did (?:\A|\G:), it would fail if the first piece is empty (input starts with a separator).
  • (?<= \ A | \ G :)会查看字符串的开头(第一部分)或上一个匹配的结尾,然后是分隔符。如果我们做了(?:\ A | \ G :),如果第一个部分为空(输入以分隔符开始),它将失败。

  • \\. matches any escaped character.
  • \\。匹配任何转义字符。

  • [^:\\] matches any character that is not in an escape sequence (because \\. consumed both of those).
  • [^:\\]匹配任何不在转义序列中的字符(因为\\。消耗了这两个字符)。

  • ((?:\\.|[^:\\])*) captures all characters up until the first non-escaped delimiter into capture-group 1.
  • ((?:\\。| [^:\\])*)捕获所有字符,直到第一个非转义分隔符进入捕获组1。

#2


(?<=^|[^\\]): gets you close, but doesn't address escaped slashes. (That's a literal regex, of course you have to escape the slashes in it to get it into a java string)

(?<= ^ | [^ \\]):让你关闭,但不解决转义斜杠。 (这是一个文字正则表达式,当然你必须逃避它中的斜线才能将它变成一个java字符串)

(?<=(^|[^\\])(\\\\)*): How about that? I think that should satisfy any ':' that is preceded by an even number of slashes.

(?<=(^ | [^ \\])(\\\\)*):怎么样?我认为应该满足任何':'前面有偶数个斜杠。

Edit: don't vote this up. MizardX's solution is better :)

编辑:不要投票。 MizardX的解决方案更好:)