正则表达式,用于保留引号,单引号,连字符和空格分割

时间:2021-08-13 03:51:39

I use Java Pattern class to specify the regex as a string.

我使用Java Pattern类将正则表达式指定为字符串。

So example I love being spider-man : "Peter Parker"

所以我喜欢做蜘蛛侠的例子:“Peter Parker”

should list spider-man and "Peter Parker" as a separate token. Thanks

应列出蜘蛛侠和“彼得帕克”作为单独的标记。谢谢

try {
     BufferedReader br = new BufferedReader(new FileReader(f));
     StringBuilder sb = new StringBuilder();
     String line = br.readLine();

     while (line != null) {
        sb.append(line);
        line = br.readLine();
     }

    String everything = sb.toString();        
    List<String> result = new ArrayList<String>();
    Pattern pat = Pattern.compile("([\"'].*?[\"']|[^ ]+)");
    PatternTokenizer pt = new PatternTokenizer(new StringReader(everything),pat,0);
    while (pt.incrementToken()) {
     result.add(pt.getAttribute(CharTermAttribute.class).toString());

     }

 }
    catch (Exception e) {
    throw new RuntimeException(e);
   }

So i guess the reason why "some word" is not working is because each token is itself a string. Any cues ? Thank you

所以我想“某些词”不起作用的原因是因为每个令牌本身就是一个字符串。任何线索?谢谢

2 个解决方案

#1


1  

Check whether this regex is what you need:

检查这个正则表达式是否符合您的要求:

"([\"'].*?[\"']|(?<=[ :]|^)[a-zA-Z0-9-]+(?=[ :]|$))"

I assume that you don't have (single/double) quote inside (single/double) quote.

我假设你没有(单/双)报价(单/双)报价。

There is also assumption about the delimiter: I only allow space and : to work as delimiter. Nothing will be matched in "foo_bar". If you want to add more delimiter, such as ;, ., ,, ?, add it to the character class in both look ahead and look behind assertion, like this:

关于分隔符也有假设:我只允许空间和:作为分隔符。 “foo_bar”中没有任何内容可以匹配。如果你想添加更多分隔符,例如;,。,,,?,请将它添加到字符类中,然后向前看并查看断言,如下所示:

"([\"'].*?[\"']|(?<=[ :;.,?]|^)[a-zA-Z0-9-]+(?=[ :;.,?]|$))"

Not yet tested on every input, but I have tested on this input:

尚未对每个输入进行测试,但我已对此输入进行了测试:

"    sdfsdf \" sdfs  sdfsdfs \"   \"sdfsdf\"  sdfsdf   sdfsd  dsfshj sdfsdf-sdf  'sdfsdfsdf  sd f '  "
// I used replaceAll to check the captured group
.replaceAll("([\"'].*?[\"']|(?<=[ :]|^)[a-zA-Z0-9-]+(?=[ :]|$))", "X$1Y")

And it works fine for me.

它对我来说很好。

If you want a more liberal capturing, but still with the assumption about quoting:

如果你想要一个更*的捕获,但仍然有引用的假设:

"([\"'].*?[\"']|[^ ]+)"

To extract matches:

提取匹配:

Matcher m = Pattern.compile(regex).matcher(inputString);
List<String> tokens = new ArrayList<String>();
while (m.find()) {
    tokens.add(m.group(1));
}

#2


2  

If it doesn't have to be regex and your data in String is correct (quotes are in right order not like " ' some data " ') then you can do it in one iteration like

如果它不必是正则表达式并且你的数据在String中是正确的(引号是正确的顺序而不是像“'some data”')那么你可以在一次迭代中完成它

String data="I love being spider-man : \"Peter Parker\" or 'photo reporter'";

List<String> tokens = new ArrayList<String>();
StringBuilder sb=new StringBuilder();
boolean inSingleQuote=false;
boolean indDoubleQuote=false;

for (char c:data.toCharArray()){
    if (c=='\'') inSingleQuote=!inSingleQuote;
    if (c=='"') indDoubleQuote=!indDoubleQuote;
    if (c==' ' && !inSingleQuote && !indDoubleQuote){
        tokens.add(sb.toString());
        sb.delete(0,sb.length());
    }
    else 
        sb.append(c);
}
tokens.add(sb.toString());
System.out.println(tokens);

output

产量

[I, love, being, spider-man, :, "Peter Parker", or, 'photo reporter']

#1


1  

Check whether this regex is what you need:

检查这个正则表达式是否符合您的要求:

"([\"'].*?[\"']|(?<=[ :]|^)[a-zA-Z0-9-]+(?=[ :]|$))"

I assume that you don't have (single/double) quote inside (single/double) quote.

我假设你没有(单/双)报价(单/双)报价。

There is also assumption about the delimiter: I only allow space and : to work as delimiter. Nothing will be matched in "foo_bar". If you want to add more delimiter, such as ;, ., ,, ?, add it to the character class in both look ahead and look behind assertion, like this:

关于分隔符也有假设:我只允许空间和:作为分隔符。 “foo_bar”中没有任何内容可以匹配。如果你想添加更多分隔符,例如;,。,,,?,请将它添加到字符类中,然后向前看并查看断言,如下所示:

"([\"'].*?[\"']|(?<=[ :;.,?]|^)[a-zA-Z0-9-]+(?=[ :;.,?]|$))"

Not yet tested on every input, but I have tested on this input:

尚未对每个输入进行测试,但我已对此输入进行了测试:

"    sdfsdf \" sdfs  sdfsdfs \"   \"sdfsdf\"  sdfsdf   sdfsd  dsfshj sdfsdf-sdf  'sdfsdfsdf  sd f '  "
// I used replaceAll to check the captured group
.replaceAll("([\"'].*?[\"']|(?<=[ :]|^)[a-zA-Z0-9-]+(?=[ :]|$))", "X$1Y")

And it works fine for me.

它对我来说很好。

If you want a more liberal capturing, but still with the assumption about quoting:

如果你想要一个更*的捕获,但仍然有引用的假设:

"([\"'].*?[\"']|[^ ]+)"

To extract matches:

提取匹配:

Matcher m = Pattern.compile(regex).matcher(inputString);
List<String> tokens = new ArrayList<String>();
while (m.find()) {
    tokens.add(m.group(1));
}

#2


2  

If it doesn't have to be regex and your data in String is correct (quotes are in right order not like " ' some data " ') then you can do it in one iteration like

如果它不必是正则表达式并且你的数据在String中是正确的(引号是正确的顺序而不是像“'some data”')那么你可以在一次迭代中完成它

String data="I love being spider-man : \"Peter Parker\" or 'photo reporter'";

List<String> tokens = new ArrayList<String>();
StringBuilder sb=new StringBuilder();
boolean inSingleQuote=false;
boolean indDoubleQuote=false;

for (char c:data.toCharArray()){
    if (c=='\'') inSingleQuote=!inSingleQuote;
    if (c=='"') indDoubleQuote=!indDoubleQuote;
    if (c==' ' && !inSingleQuote && !indDoubleQuote){
        tokens.add(sb.toString());
        sb.delete(0,sb.length());
    }
    else 
        sb.append(c);
}
tokens.add(sb.toString());
System.out.println(tokens);

output

产量

[I, love, being, spider-man, :, "Peter Parker", or, 'photo reporter']