I am trying to split a string on spaces and some specific special characters.
我试图在空格和一些特定的特殊字符上拆分字符串。
Given the string "john - & + $ ? . @ boy" I want to get the array:
鉴于字符串“john - &+ $?。@ boy”我想得到数组:
array[0]="john";
array[1]="boy";
I've tried several regular expressions and gotten no where. Here is my current stab:
我已经尝试了几个正则表达式,并没有在哪里。这是我目前的刺:
String[] terms = uglString.split("\\s+|[\\-\\+\\$\\?\\.@&].*");
Which preserves "john" but not "boy". Can anyone get me the rest of this?
这保留了“约翰”而不是“男孩”。谁能让我得到剩下的这个?
6 个解决方案
#1
6
Just use:
String[] terms = input.split("[\\s@&.?$+-]+");
You can put a short-hand character class inside a character class (note the \s
), and most meta-character loses their meaning inside a character class, except for [
, ]
, -
, &
, \
. However, &
is meaningful only when comes in pair &&
, and -
is treated as literal character if put at the beginning or the end of the character class.
你可以把一个简短的字符类放在一个字符类中(注意\ s),大多数元字符在字符类中失去意义,除了[,], - ,&,\。但是,&只有在对&&中出现时才有意义,并且 - 如果放在字符类的开头或结尾,则被视为文字字符。
Other languages may have different rules for parsing the pattern, but the rule about -
applies for most of the engines.
其他语言可能有不同的规则来解析模式,但规则 - 适用于大多数引擎。
As @Sean Patrick Floyd mentioned in his answer, the important thing boils down to defining what constitute a word. \w
in Java is equivalent to [a-zA-Z0-9_]
(English letters upper and lower case, digits and underscore), and therefore, \W
consists of all other characters. If you want to consider Unicode letters and digits, you may want to look at Unicode character classes.
正如@Sean Patrick Floyd在他的回答中所提到的,重要的是归结为定义一个单词的构成。 Java中的\ w等效于[a-zA-Z0-9_](英文字母大写和小写,数字和下划线),因此\ W由所有其他字符组成。如果要考虑Unicode字母和数字,可能需要查看Unicode字符类。
#2
4
You could make your code much easier by replacing your pattern with "\\W+"
(one or more occurrences of a non-word character. (This way you are whitelisting characters instead of blacklisting, which is usually a good idea)
您可以通过将模式替换为“\\ W +”(一个或多个非单词字符)来使代码更容易。(这样您将字符列入白名单而不是黑名单,这通常是一个好主意)
And of Course things could be made more efficient by using Guava's Splitter
class
当然,使用Guava的Splitter类可以提高效率
#3
0
to add to what have been said about Splitter
, you can do something of this sort:
要添加到关于Splitter的内容,你可以做一些这样的事情:
String str = "john - & + $ ? . @ boy";
Iterable<String> ttt = Splitter.on(Pattern.compile("\\W")).trimResults().omitEmptyStrings().split(str);
#4
0
Breaking then step by step:
然后一步一步地打破:
For your case, you replace non-word chars (as pointed out). Now you might want to preserve the spaces for an easy String split.
对于您的情况,您替换非单词字符(如指出)。现在,您可能希望保留空间以便进行简单的String拆分。
String ugly = "john - & + $ ? . @ boy";
String words = ugly.replaceAll("[^\\w\\s]", "");
There are a lot of spaces in the resulting String which you might want to generally trim to just 1 space:
结果字符串中有很多空格,您可能希望通常修剪为1个空格:
String formatted = words.trim().replaceAll(" +", " ");
Now you can easily split the String into the words to a String Array:
现在,您可以轻松地将字符串拆分为字符串数组:
String[] terms = formatted.split("\\s");
System.out.println(terms[0]);
#5
0
Use this format.
使用此格式。
String s = "john - & + $ ? . @ boy";
String reg = "[!_.',@? ]";
String[] res = s.split(reg);
here include every character that you want to split inside the [ ] brackets.
这里包括你想要在[]括号内分割的每个字符。
#6
0
You can use something like below
你可以使用下面的东西
arrayOfStringType=string.split(" |'|,|.|//+|_");
'|' will work as an or operator here.
'|'将在这里作为一个或运营商。
#1
6
Just use:
String[] terms = input.split("[\\s@&.?$+-]+");
You can put a short-hand character class inside a character class (note the \s
), and most meta-character loses their meaning inside a character class, except for [
, ]
, -
, &
, \
. However, &
is meaningful only when comes in pair &&
, and -
is treated as literal character if put at the beginning or the end of the character class.
你可以把一个简短的字符类放在一个字符类中(注意\ s),大多数元字符在字符类中失去意义,除了[,], - ,&,\。但是,&只有在对&&中出现时才有意义,并且 - 如果放在字符类的开头或结尾,则被视为文字字符。
Other languages may have different rules for parsing the pattern, but the rule about -
applies for most of the engines.
其他语言可能有不同的规则来解析模式,但规则 - 适用于大多数引擎。
As @Sean Patrick Floyd mentioned in his answer, the important thing boils down to defining what constitute a word. \w
in Java is equivalent to [a-zA-Z0-9_]
(English letters upper and lower case, digits and underscore), and therefore, \W
consists of all other characters. If you want to consider Unicode letters and digits, you may want to look at Unicode character classes.
正如@Sean Patrick Floyd在他的回答中所提到的,重要的是归结为定义一个单词的构成。 Java中的\ w等效于[a-zA-Z0-9_](英文字母大写和小写,数字和下划线),因此\ W由所有其他字符组成。如果要考虑Unicode字母和数字,可能需要查看Unicode字符类。
#2
4
You could make your code much easier by replacing your pattern with "\\W+"
(one or more occurrences of a non-word character. (This way you are whitelisting characters instead of blacklisting, which is usually a good idea)
您可以通过将模式替换为“\\ W +”(一个或多个非单词字符)来使代码更容易。(这样您将字符列入白名单而不是黑名单,这通常是一个好主意)
And of Course things could be made more efficient by using Guava's Splitter
class
当然,使用Guava的Splitter类可以提高效率
#3
0
to add to what have been said about Splitter
, you can do something of this sort:
要添加到关于Splitter的内容,你可以做一些这样的事情:
String str = "john - & + $ ? . @ boy";
Iterable<String> ttt = Splitter.on(Pattern.compile("\\W")).trimResults().omitEmptyStrings().split(str);
#4
0
Breaking then step by step:
然后一步一步地打破:
For your case, you replace non-word chars (as pointed out). Now you might want to preserve the spaces for an easy String split.
对于您的情况,您替换非单词字符(如指出)。现在,您可能希望保留空间以便进行简单的String拆分。
String ugly = "john - & + $ ? . @ boy";
String words = ugly.replaceAll("[^\\w\\s]", "");
There are a lot of spaces in the resulting String which you might want to generally trim to just 1 space:
结果字符串中有很多空格,您可能希望通常修剪为1个空格:
String formatted = words.trim().replaceAll(" +", " ");
Now you can easily split the String into the words to a String Array:
现在,您可以轻松地将字符串拆分为字符串数组:
String[] terms = formatted.split("\\s");
System.out.println(terms[0]);
#5
0
Use this format.
使用此格式。
String s = "john - & + $ ? . @ boy";
String reg = "[!_.',@? ]";
String[] res = s.split(reg);
here include every character that you want to split inside the [ ] brackets.
这里包括你想要在[]括号内分割的每个字符。
#6
0
You can use something like below
你可以使用下面的东西
arrayOfStringType=string.split(" |'|,|.|//+|_");
'|' will work as an or operator here.
'|'将在这里作为一个或运营商。