I use Java Pattern class to specify the regex as a string.
我使用Java Pattern类将正则表达式指定为字符串。
So example I love being spider-man : "Peter Parker"
所以我喜欢做蜘蛛侠的例子:“Peter Parker”
should list spider-man and "Peter Parker" as a separate token. Thanks
应列出蜘蛛侠和“彼得帕克”作为单独的标记。谢谢
try {
BufferedReader br = new BufferedReader(new FileReader(f));
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
line = br.readLine();
}
String everything = sb.toString();
List<String> result = new ArrayList<String>();
Pattern pat = Pattern.compile("([\"'].*?[\"']|[^ ]+)");
PatternTokenizer pt = new PatternTokenizer(new StringReader(everything),pat,0);
while (pt.incrementToken()) {
result.add(pt.getAttribute(CharTermAttribute.class).toString());
}
}
catch (Exception e) {
throw new RuntimeException(e);
}
So i guess the reason why "some word" is not working is because each token is itself a string. Any cues ? Thank you
所以我想“某些词”不起作用的原因是因为每个令牌本身就是一个字符串。任何线索?谢谢
2 个解决方案
#1
1
Check whether this regex is what you need:
检查这个正则表达式是否符合您的要求:
"([\"'].*?[\"']|(?<=[ :]|^)[a-zA-Z0-9-]+(?=[ :]|$))"
I assume that you don't have (single/double) quote inside (single/double) quote.
我假设你没有(单/双)报价(单/双)报价。
There is also assumption about the delimiter: I only allow space and :
to work as delimiter. Nothing will be matched in "foo_bar"
. If you want to add more delimiter, such as ;
, .
, ,
, ?
, add it to the character class in both look ahead and look behind assertion, like this:
关于分隔符也有假设:我只允许空间和:作为分隔符。 “foo_bar”中没有任何内容可以匹配。如果你想添加更多分隔符,例如;,。,,,?,请将它添加到字符类中,然后向前看并查看断言,如下所示:
"([\"'].*?[\"']|(?<=[ :;.,?]|^)[a-zA-Z0-9-]+(?=[ :;.,?]|$))"
Not yet tested on every input, but I have tested on this input:
尚未对每个输入进行测试,但我已对此输入进行了测试:
" sdfsdf \" sdfs sdfsdfs \" \"sdfsdf\" sdfsdf sdfsd dsfshj sdfsdf-sdf 'sdfsdfsdf sd f ' "
// I used replaceAll to check the captured group
.replaceAll("([\"'].*?[\"']|(?<=[ :]|^)[a-zA-Z0-9-]+(?=[ :]|$))", "X$1Y")
And it works fine for me.
它对我来说很好。
If you want a more liberal capturing, but still with the assumption about quoting:
如果你想要一个更*的捕获,但仍然有引用的假设:
"([\"'].*?[\"']|[^ ]+)"
To extract matches:
提取匹配:
Matcher m = Pattern.compile(regex).matcher(inputString);
List<String> tokens = new ArrayList<String>();
while (m.find()) {
tokens.add(m.group(1));
}
#2
2
If it doesn't have to be regex and your data in String is correct (quotes are in right order not like " ' some data " '
) then you can do it in one iteration like
如果它不必是正则表达式并且你的数据在String中是正确的(引号是正确的顺序而不是像“'some data”')那么你可以在一次迭代中完成它
String data="I love being spider-man : \"Peter Parker\" or 'photo reporter'";
List<String> tokens = new ArrayList<String>();
StringBuilder sb=new StringBuilder();
boolean inSingleQuote=false;
boolean indDoubleQuote=false;
for (char c:data.toCharArray()){
if (c=='\'') inSingleQuote=!inSingleQuote;
if (c=='"') indDoubleQuote=!indDoubleQuote;
if (c==' ' && !inSingleQuote && !indDoubleQuote){
tokens.add(sb.toString());
sb.delete(0,sb.length());
}
else
sb.append(c);
}
tokens.add(sb.toString());
System.out.println(tokens);
output
产量
[I, love, being, spider-man, :, "Peter Parker", or, 'photo reporter']
#1
1
Check whether this regex is what you need:
检查这个正则表达式是否符合您的要求:
"([\"'].*?[\"']|(?<=[ :]|^)[a-zA-Z0-9-]+(?=[ :]|$))"
I assume that you don't have (single/double) quote inside (single/double) quote.
我假设你没有(单/双)报价(单/双)报价。
There is also assumption about the delimiter: I only allow space and :
to work as delimiter. Nothing will be matched in "foo_bar"
. If you want to add more delimiter, such as ;
, .
, ,
, ?
, add it to the character class in both look ahead and look behind assertion, like this:
关于分隔符也有假设:我只允许空间和:作为分隔符。 “foo_bar”中没有任何内容可以匹配。如果你想添加更多分隔符,例如;,。,,,?,请将它添加到字符类中,然后向前看并查看断言,如下所示:
"([\"'].*?[\"']|(?<=[ :;.,?]|^)[a-zA-Z0-9-]+(?=[ :;.,?]|$))"
Not yet tested on every input, but I have tested on this input:
尚未对每个输入进行测试,但我已对此输入进行了测试:
" sdfsdf \" sdfs sdfsdfs \" \"sdfsdf\" sdfsdf sdfsd dsfshj sdfsdf-sdf 'sdfsdfsdf sd f ' "
// I used replaceAll to check the captured group
.replaceAll("([\"'].*?[\"']|(?<=[ :]|^)[a-zA-Z0-9-]+(?=[ :]|$))", "X$1Y")
And it works fine for me.
它对我来说很好。
If you want a more liberal capturing, but still with the assumption about quoting:
如果你想要一个更*的捕获,但仍然有引用的假设:
"([\"'].*?[\"']|[^ ]+)"
To extract matches:
提取匹配:
Matcher m = Pattern.compile(regex).matcher(inputString);
List<String> tokens = new ArrayList<String>();
while (m.find()) {
tokens.add(m.group(1));
}
#2
2
If it doesn't have to be regex and your data in String is correct (quotes are in right order not like " ' some data " '
) then you can do it in one iteration like
如果它不必是正则表达式并且你的数据在String中是正确的(引号是正确的顺序而不是像“'some data”')那么你可以在一次迭代中完成它
String data="I love being spider-man : \"Peter Parker\" or 'photo reporter'";
List<String> tokens = new ArrayList<String>();
StringBuilder sb=new StringBuilder();
boolean inSingleQuote=false;
boolean indDoubleQuote=false;
for (char c:data.toCharArray()){
if (c=='\'') inSingleQuote=!inSingleQuote;
if (c=='"') indDoubleQuote=!indDoubleQuote;
if (c==' ' && !inSingleQuote && !indDoubleQuote){
tokens.add(sb.toString());
sb.delete(0,sb.length());
}
else
sb.append(c);
}
tokens.add(sb.toString());
System.out.println(tokens);
output
产量
[I, love, being, spider-man, :, "Peter Parker", or, 'photo reporter']