将字符串拆分为unicode字样? (特别是越南人)

时间:2021-10-21 03:54:03

I have been trying to split a string that contains text in Vietnamese into individual words. For example:

我一直试图将包含越南语文本的字符串拆分为单个单词。例如:

s = "Chào bạn, mình tên Đạt."

s =“Chàobạn,mìnhtênĐạt。”

Will be splitted into an array:

将被拆分为一个数组:

arr = {"Chào", "bạn", "mình", "tên", "Đạt"}

arr = {“Chào”,“bạn”,“mình”,“tên”,“Đạt”}

Normally in English, this would be easily solve by 1 line only:

通常在英语中,这很容易通过1行解决:

arr = s.split("\\W+");

but since there are many non-alphabetic letters in Vietnamese, it can't be solve by just one line. So the question is: Is there any regular expressions that can replace this "\W+" (I'm not very good with regular expressions)? If not, is there any other ways around it?

但由于越南语中有许多非字母字母,因此只能用一行来解决。所以问题是:是否有任何正则表达式可以替换这个“\ W +”(我对正则表达式不是很好)?如果没有,还有其他方法吗?

1 个解决方案

#1


2  

Split a String by space and punctuation. You can add your punctuation. As some of the characters in regex are reserved, I prefer to use them a in a character class [].

按空格和标点符号拆分字符串。您可以添加标点符号。由于正则表达式中的某些字符是保留的,我更喜欢在字符类[]中使用它们。

arr = s.split("([ ]|[.]|[,]|[:]|[?])+"); //You can customize punctuation.

This is a working example.

这是一个有效的例子。

public static void main(String[] args) {
   String  inputStr = "Chào bạn, mình tên Đạt.";
   String [] splitArray = inputStr.split("([ ]|[.]|[,]|[:]|[?])+");
   for (String s : splitArray) {
       System.out.println(s);
   }
}

Prints:

Chào
bạn
mình
tên
Đạt

Update

In case of simple space character [ ], it works well. However, for this String.

在简单空格字符[]的情况下,它运作良好。但是,对于这个String。

 String  inputStr = "Chào  bạn,\n mình tên\t Đạt.";

Result

Chào
bạn


mình
tên 
Đạt

To fix it, use space character class - \s.

要修复它,请使用空格字符类 - \ s。

  String [] splitArray = inputStr.split("(\\s|[.]|[,]|[:]|[?])+");

Or loop through the array of Strings, and trim them.

或者循环遍历字符串数组,并修剪它们。

#1


2  

Split a String by space and punctuation. You can add your punctuation. As some of the characters in regex are reserved, I prefer to use them a in a character class [].

按空格和标点符号拆分字符串。您可以添加标点符号。由于正则表达式中的某些字符是保留的,我更喜欢在字符类[]中使用它们。

arr = s.split("([ ]|[.]|[,]|[:]|[?])+"); //You can customize punctuation.

This is a working example.

这是一个有效的例子。

public static void main(String[] args) {
   String  inputStr = "Chào bạn, mình tên Đạt.";
   String [] splitArray = inputStr.split("([ ]|[.]|[,]|[:]|[?])+");
   for (String s : splitArray) {
       System.out.println(s);
   }
}

Prints:

Chào
bạn
mình
tên
Đạt

Update

In case of simple space character [ ], it works well. However, for this String.

在简单空格字符[]的情况下,它运作良好。但是,对于这个String。

 String  inputStr = "Chào  bạn,\n mình tên\t Đạt.";

Result

Chào
bạn


mình
tên 
Đạt

To fix it, use space character class - \s.

要修复它,请使用空格字符类 - \ s。

  String [] splitArray = inputStr.split("(\\s|[.]|[,]|[:]|[?])+");

Or loop through the array of Strings, and trim them.

或者循环遍历字符串数组,并修剪它们。