I have been trying to split a string that contains text in Vietnamese into individual words. For example:
我一直试图将包含越南语文本的字符串拆分为单个单词。例如:
s = "Chào bạn, mình tên Đạt."
s =“Chàobạn,mìnhtênĐạt。”
Will be splitted into an array:
将被拆分为一个数组:
arr = {"Chào", "bạn", "mình", "tên", "Đạt"}
arr = {“Chào”,“bạn”,“mình”,“tên”,“Đạt”}
Normally in English, this would be easily solve by 1 line only:
通常在英语中,这很容易通过1行解决:
arr = s.split("\\W+");
but since there are many non-alphabetic letters in Vietnamese, it can't be solve by just one line. So the question is: Is there any regular expressions that can replace this "\W+" (I'm not very good with regular expressions)? If not, is there any other ways around it?
但由于越南语中有许多非字母字母,因此只能用一行来解决。所以问题是:是否有任何正则表达式可以替换这个“\ W +”(我对正则表达式不是很好)?如果没有,还有其他方法吗?
1 个解决方案
#1
2
Split a String by space and punctuation. You can add your punctuation. As some of the characters in regex are reserved, I prefer to use them a in a character class []
.
按空格和标点符号拆分字符串。您可以添加标点符号。由于正则表达式中的某些字符是保留的,我更喜欢在字符类[]中使用它们。
arr = s.split("([ ]|[.]|[,]|[:]|[?])+"); //You can customize punctuation.
This is a working example.
这是一个有效的例子。
public static void main(String[] args) {
String inputStr = "Chào bạn, mình tên Đạt.";
String [] splitArray = inputStr.split("([ ]|[.]|[,]|[:]|[?])+");
for (String s : splitArray) {
System.out.println(s);
}
}
Prints:
Chào
bạn
mình
tên
Đạt
Update
In case of simple space character [ ]
, it works well. However, for this String.
在简单空格字符[]的情况下,它运作良好。但是,对于这个String。
String inputStr = "Chào bạn,\n mình tên\t Đạt.";
Result
Chào
bạn
mình
tên
Đạt
To fix it, use space character class - \s
.
要修复它,请使用空格字符类 - \ s。
String [] splitArray = inputStr.split("(\\s|[.]|[,]|[:]|[?])+");
Or loop through the array of Strings, and trim them.
或者循环遍历字符串数组,并修剪它们。
#1
2
Split a String by space and punctuation. You can add your punctuation. As some of the characters in regex are reserved, I prefer to use them a in a character class []
.
按空格和标点符号拆分字符串。您可以添加标点符号。由于正则表达式中的某些字符是保留的,我更喜欢在字符类[]中使用它们。
arr = s.split("([ ]|[.]|[,]|[:]|[?])+"); //You can customize punctuation.
This is a working example.
这是一个有效的例子。
public static void main(String[] args) {
String inputStr = "Chào bạn, mình tên Đạt.";
String [] splitArray = inputStr.split("([ ]|[.]|[,]|[:]|[?])+");
for (String s : splitArray) {
System.out.println(s);
}
}
Prints:
Chào
bạn
mình
tên
Đạt
Update
In case of simple space character [ ]
, it works well. However, for this String.
在简单空格字符[]的情况下,它运作良好。但是,对于这个String。
String inputStr = "Chào bạn,\n mình tên\t Đạt.";
Result
Chào
bạn
mình
tên
Đạt
To fix it, use space character class - \s
.
要修复它,请使用空格字符类 - \ s。
String [] splitArray = inputStr.split("(\\s|[.]|[,]|[:]|[?])+");
Or loop through the array of Strings, and trim them.
或者循环遍历字符串数组,并修剪它们。