在不包含双引号的逗号上分割字符串,并使用扭转

时间:2021-09-09 21:42:05

I asked this question earlier and it was closed because it was a duplicate, which I accept and actually found the answer in the question Java: splitting a comma-separated string but ignoring commas in quotes, so thanks to whoever posted it.

我之前问过这个问题,它被关闭了,因为它是一个副本,我接受并在Java问题中找到了答案:分割一个逗号分隔的字符串,但是忽略引号中的逗号,所以感谢发布它的人。

But I've since run into another issue. Apparently what I need to do is use "," as my delimiter when there are zero or an even number of double-quotes, but also ignore any "," contained in brackets.

但我又遇到了另一个问题。显然,我需要做的是在双引号为零或偶数时使用“,”作为分隔符,但也要忽略括号中包含的“,”。

So the following:

所以以下:

"Thanks,", "in advance,", "for("the", "help")"

Would tokenize as:

将标记:

  • Thanks,
  • 谢谢,
  • in advance,
  • 提前,
  • for("the", "help")
  • (“的”,“帮助”)

I'm not sure if there's anyway to modify the current regex I'm using to allow for this, but any guidance would be appreciated.

我不确定是否有任何方法可以修改我现在使用的regex,以允许这样做,但是任何指导都是值得赞赏的。

line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");

2 个解决方案

#1


5  

Sometimes it is easier to match what you want instead of what you don't want:

有时更容易匹配你想要的而不是你不想要的:

String s = "\"Thanks,\", \"in advance,\", \"for(\"the\", \"help\")\"";
String regex = "\"(\\([^)]*\\)|[^\"])*\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(s);
while(m.find()) {
    System.out.println(s.substring(m.start(),m.end()));
}

Output:

输出:

"Thanks,"
"in advance,"
"for("the", "help")"

If you also need it to ignore closing brackets inside the quotes sections that are inside the brackets, then you need this:

如果您还需要它忽略括号内引号部分中的结束括号,那么您需要:

 String regex = "\"(\\((\"[^\"]*\"|[^)])*\\)|[^\"])*\"";

An example of a string which needs this second, more complex version is:

一个需要第二个更复杂版本的字符串示例是:

 "foo","bar","baz(":-)",":-o")"

Output:

输出:

"foo"
"bar"
"baz(":-)",":-o")"

However, I'd advise you to change your data format if at all possible. This would be a lot easier if you used a standard format like XML to store your tokens.

但是,我建议你尽可能改变你的数据格式。如果使用XML之类的标准格式来存储令牌,这将容易得多。

#2


3  

A home-grown parser is easily written.

自定义解析器很容易编写。

For example, this ANTLR grammar takes care of your example input without much trouble:

例如,这个ANTLR语法可以轻松地处理示例输入:

parse
  :  line*
  ;

line
  :  Quoted ( ',' Quoted )* ( '\r'? '\n' | EOF )
  ;

Quoted
  :  '"' ( Atom )* '"'
  ;

fragment
Atom
  :  Parentheses
  |  ~( '"' | '\r' | '\n' | '(' | ')' )
  ;

fragment
Parentheses
  :  '(' ~( '(' | ')' | '\r' | '\n' )* ')'
  ;

Space
  :  ( ' ' | '\t' ) {skip();}
  ;

and it would be easy to extend this to take escaped quotes or parenthesis into account.

将它扩展到包含转义引号或括号的地方很容易。

When feeding the parser generated by that grammar to following two lines of input:

将该语法生成的解析器提供给以下两行输入:

"Thanks,", "in advance,", "for("the", "help")"
"and(,some,more)","data , here"

it gets parsed like this:

它的解析是这样的:

alt text http://i47.tinypic.com/258otvs.png

alt文本http://i47.tinypic.com/258otvs.png

If you consider to use ANTLR for this, I can post a little HOW-TO to get a parser from that grammar I posted, if you want.

如果您考虑为此使用ANTLR,如果您愿意,我可以发布一些HOW-TO来从我所发布的语法中获得解析器。

#1


5  

Sometimes it is easier to match what you want instead of what you don't want:

有时更容易匹配你想要的而不是你不想要的:

String s = "\"Thanks,\", \"in advance,\", \"for(\"the\", \"help\")\"";
String regex = "\"(\\([^)]*\\)|[^\"])*\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(s);
while(m.find()) {
    System.out.println(s.substring(m.start(),m.end()));
}

Output:

输出:

"Thanks,"
"in advance,"
"for("the", "help")"

If you also need it to ignore closing brackets inside the quotes sections that are inside the brackets, then you need this:

如果您还需要它忽略括号内引号部分中的结束括号,那么您需要:

 String regex = "\"(\\((\"[^\"]*\"|[^)])*\\)|[^\"])*\"";

An example of a string which needs this second, more complex version is:

一个需要第二个更复杂版本的字符串示例是:

 "foo","bar","baz(":-)",":-o")"

Output:

输出:

"foo"
"bar"
"baz(":-)",":-o")"

However, I'd advise you to change your data format if at all possible. This would be a lot easier if you used a standard format like XML to store your tokens.

但是,我建议你尽可能改变你的数据格式。如果使用XML之类的标准格式来存储令牌,这将容易得多。

#2


3  

A home-grown parser is easily written.

自定义解析器很容易编写。

For example, this ANTLR grammar takes care of your example input without much trouble:

例如,这个ANTLR语法可以轻松地处理示例输入:

parse
  :  line*
  ;

line
  :  Quoted ( ',' Quoted )* ( '\r'? '\n' | EOF )
  ;

Quoted
  :  '"' ( Atom )* '"'
  ;

fragment
Atom
  :  Parentheses
  |  ~( '"' | '\r' | '\n' | '(' | ')' )
  ;

fragment
Parentheses
  :  '(' ~( '(' | ')' | '\r' | '\n' )* ')'
  ;

Space
  :  ( ' ' | '\t' ) {skip();}
  ;

and it would be easy to extend this to take escaped quotes or parenthesis into account.

将它扩展到包含转义引号或括号的地方很容易。

When feeding the parser generated by that grammar to following two lines of input:

将该语法生成的解析器提供给以下两行输入:

"Thanks,", "in advance,", "for("the", "help")"
"and(,some,more)","data , here"

it gets parsed like this:

它的解析是这样的:

alt text http://i47.tinypic.com/258otvs.png

alt文本http://i47.tinypic.com/258otvs.png

If you consider to use ANTLR for this, I can post a little HOW-TO to get a parser from that grammar I posted, if you want.

如果您考虑为此使用ANTLR,如果您愿意,我可以发布一些HOW-TO来从我所发布的语法中获得解析器。