我如何使用string＃split来分隔带分隔符+ - * /（）和空格的字符串，并将它们保留为额外的标记？

I need to split strings containing basic mathematical expressions, such as:
"(a+b)*c"
or
" (a - c) / d"
The delimiters are + - * / ( ) and space and i need them as an independent token. Basically the result should look like this:

我需要拆分包含基本数学表达式的字符串,例如:“(a + b)* c”或“(a - c)/ d”分隔符是+ - * /()和空格我需要它们作为一个独立的令牌。基本上结果应如下所示:

"("
"a"
"+"
"b"
")"
"*"
"c"

“(”“”“”+“”b“”)“”*“”c“

And for the second example:

而对于第二个例子:

" "
"("
"a"
" "
"-"
...

“ “ “(“ “一个” ” ” ”-” ...

I read a lot of questions about similar problems with less complex delimiters and the common answer was to use zero space positive lookahead and -behind.
Like this: (?<=X | ?=X)
And X represents the delimiters, but putting them in a class like this:
[\\Q+-*()\\E/\\s]
does not work in the desired way.
So how do i have to format the delimiters to make the split work how i need it?

我阅读了很多关于类似问题的问题,这些问题的分隔符不太复杂,常见的答案是使用零空间正向前瞻和后方。像这样:(?<= X |?= X)并且X代表分隔符,但是将它们放在这样的类中:[\\ Q + - *()\\ E / \\ s]在所需的类中不起作用办法。那么我如何格式化分隔符以使分割工作如何我需要它?

---Update---
Word class characters and longer combinations should not be splitted.
Such as "ab" "c1" or "12".
Or in short, I need the same result as the StringTokenizer would have, give the parameters "-+*/() " and true.

---更新---不应拆分Word类字符和更长的组合。例如“ab”“c1”或“12”。或者简而言之,我需要与StringTokenizer相同的结果,给出参数“ - + * /()”和true。

4 个解决方案

#1

Try splitting your data using

尝试使用分割数据

yourString.split("(?<=[\\Q+-*()\\E/\\s])|(?=[\\Q+-*()\\E/\\s])(?<!^)"));

I assume that problem you had was not in \\Q+-*()\\E part but in (?<=X | ?=X) <- it should be (?<=X)|(?=X) since it should produce look-behind and look-ahead.

我假设你遇到的问题不在\\ Q + - *()\\ E部分但在(?<= X |?= X)< - 它应该是(?<= X)|(?= X)它应该产生后视和前瞻。

demo for "_a+(ab-c1__)+12_" _{(BTW _ will be replaced with space in code. SO shows two spaces as one, so had to use __ to present them somehow)}

“_a +(ab-c1 __)+ 12_”的演示(BTW _将被替换为代码中的空格.SO显示两个空格为一,所以必须使用__以某种方式呈现它们)

String[] tokens = " a+(ab-c1  )+12 "
        .split("(?<=[\\Q+-*()\\E/\\s])|(?=[\\Q+-*()\\E/\\s])(?<!^)");
for (String token :  tokens)
    System.out.println("\"" + token + "\"");

result

" "
"a"
"+"
"("
"ab"
"-"
"c1"
" "
" "
")"
"+"
"12"
" "

#2

It is one thing if you are doing this as student work, but in practice this is more of a job for a lexical analyzer and parser. In C, you would use lex and yacc or GNU flex and bison. In Java, you'd use ANTLR or JavaCC.

如果您将此作为学生工作,这是一回事,但在实践中,这对于词法分析器和解析器来说更像是一项工作。在C中,您将使用lex和yacc或GNU flex和bison。在Java中,您使用ANTLR或JavaCC。

But start by writing a BNF grammar for your expected input (usually called the input language).

但首先要为您的预期输入(通常称为输入语言)编写BNF语法。

#3

You can use the following regex:

您可以使用以下正则表达式:

\s*(?<=[()+*/a-z-])\s*

?<= makes zero-witdh assertions, that is, they match, but won't include the matched expression in the group. The \s* will take care of the trailing spaces.

?<=进行零问题断言,即它们匹配,但不包括组中匹配的表达式。 \ s *将处理尾随空格。

Code example:

String a = " (a - c) / d *       x   ";
String regex = "\\s*(?<=[()+*/a-z-])\\s*";
String[] split = a.split(regex);
System.out.println(Arrays.toString(split));

Output:

[ (, a, -, c, ), /, d, *, x]

#4

Try this instead:

试试这个:

[-+*()\\s]

Dashes have to come first or last in a character class in order to not represent a range. The rest of the characters need no escaping (presumably what you were trying to do with \\Q and \\E) because most characters are taken literally anyway in a character class.

破折号必须在字符类中排在第一位或最后一位才能表示范围。其余的角色不需要逃避(大概是你试图用\\ Q和\\ E),因为大多数角色无论如何都要在角色类中进行。

Also, I wasn't aware of the syntax, (?<=X|?=X). If it works, then great. But if it doesn't, try this equivalent expansion, whose syntax I know does work:

另外,我不知道语法,(?<= X |?= X)。如果它有效,那么很棒。但如果没有,请尝试这种等效的扩展,我知道它的语法有效:

(?:(?<=X)|(?=X))

#1