字符串拆分,包括重音字符的单词

时间:2021-05-24 21:37:37

I'm using this regex:

我正在使用这个正则表达式:

x.split("[^a-zA-Z0-9']+");

This returns an array of strings with letters and/or numbers.

这将返回带有字母和/或数字的字符串数组。

If I use this:

如果我用这个:

String name = "CEN01_Automated_TestCase.java";
String[] names = name.Split.split("[^a-zA-Z0-9']+");

I got:

我有:

CEN01
Automated
TestCase
Java

But if I use this:

但如果我使用这个:

String name = "CEN01_Automação_Caso_Teste.java";
String[] names = name.Split.split("[^a-zA-Z0-9']+");

I got:

我有:

CEN01
Automa
o
Caso
Teste
Java

How can I modify this regex to include accented characters? (á,ã,õ, etc...)

如何修改此正则表达式以包含重音字符? (á,ã,õ等......)

5 个解决方案

#1


9  

From http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

来自http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Categories that behave like the java.lang.Character boolean ismethodname methods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname.

行为类似于java.lang.Character boolean ismethodname方法的类别(不推荐使用的方法除外)可通过相同的\ p {prop}语法获得,其中指定的属性名称为javamethodname。

Since Character class contains isAlphabetic method you can use

由于Character类包含isAlphabetic方法,您可以使用

name.split("[^\\p{IsAlphabetic}0-9']+");

You can also use

你也可以使用

name.split("(?U)[^\\p{Alpha}0-9']+");

but you will need to use UNICODE_CHARACTER_CLASS flag which can be used by adding (?U) in regex.

但是你需要使用UNICODE_CHARACTER_CLASS标志,可以通过在正则表达式中添加(?U)来使用它。

#2


2  

I would check out the Java Documentation on Regular Expressions. There is a unicode section which I believe is what you may be looking for.

我会查看正则表达式的Java文档。有一个unicode部分,我相信你可能正在寻找。

EDIT: Example

编辑:示例

Another way would be to match on the character code you are looking for. For example

另一种方法是匹配您正在寻找的字符代码。例如

\uFFFF where FFFF is the hexadecimal number of the character you are trying to match.

Example: \u00E0 matches à

示例:\ u00E0匹配à

Realize that the backslash will need to be escaped in Java if you are using it as a string literal.

如果您将其用作字符串文字,请认识到需要在Java中转义反斜杠。

Read more about it here.

在这里阅读更多相关信息。

#3


2  

You can use this:

你可以用这个:

String[] names = name.split("[^a-zA-Z0-9'\\p{L}]+");

System.out.println(Arrays.toString(names)); Will output:

的System.out.println(Arrays.toString(地名));将输出:

[CEN01, Automação, Caso, Teste, java]

[CEN01,Automação,Caso,Teste,java]

See this for more information.

有关更多信息,请参阅此

#4


1  

Why not split on the separator characters?

为什么不拆分分隔符?

String[] names = name.split("[_.]");

#5


0  

Instead of blacklisting all the characters you don't want, you could always whitlist the characters you want like :

您可以随时将所需的字符列入白名单,而不是将您不想要的所有字符列入黑名单:

^[^<>%$]*$

The expression [^(many characters here)] just matches any character that is not listed.

表达式[^(这里有很多字符)]只匹配未列出的任何字符。

But that is a personnal opinion.

但那是个人意见。

#1


9  

From http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

来自http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Categories that behave like the java.lang.Character boolean ismethodname methods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname.

行为类似于java.lang.Character boolean ismethodname方法的类别(不推荐使用的方法除外)可通过相同的\ p {prop}语法获得,其中指定的属性名称为javamethodname。

Since Character class contains isAlphabetic method you can use

由于Character类包含isAlphabetic方法,您可以使用

name.split("[^\\p{IsAlphabetic}0-9']+");

You can also use

你也可以使用

name.split("(?U)[^\\p{Alpha}0-9']+");

but you will need to use UNICODE_CHARACTER_CLASS flag which can be used by adding (?U) in regex.

但是你需要使用UNICODE_CHARACTER_CLASS标志,可以通过在正则表达式中添加(?U)来使用它。

#2


2  

I would check out the Java Documentation on Regular Expressions. There is a unicode section which I believe is what you may be looking for.

我会查看正则表达式的Java文档。有一个unicode部分,我相信你可能正在寻找。

EDIT: Example

编辑:示例

Another way would be to match on the character code you are looking for. For example

另一种方法是匹配您正在寻找的字符代码。例如

\uFFFF where FFFF is the hexadecimal number of the character you are trying to match.

Example: \u00E0 matches à

示例:\ u00E0匹配à

Realize that the backslash will need to be escaped in Java if you are using it as a string literal.

如果您将其用作字符串文字,请认识到需要在Java中转义反斜杠。

Read more about it here.

在这里阅读更多相关信息。

#3


2  

You can use this:

你可以用这个:

String[] names = name.split("[^a-zA-Z0-9'\\p{L}]+");

System.out.println(Arrays.toString(names)); Will output:

的System.out.println(Arrays.toString(地名));将输出:

[CEN01, Automação, Caso, Teste, java]

[CEN01,Automação,Caso,Teste,java]

See this for more information.

有关更多信息,请参阅此

#4


1  

Why not split on the separator characters?

为什么不拆分分隔符?

String[] names = name.split("[_.]");

#5


0  

Instead of blacklisting all the characters you don't want, you could always whitlist the characters you want like :

您可以随时将所需的字符列入白名单,而不是将您不想要的所有字符列入黑名单:

^[^<>%$]*$

The expression [^(many characters here)] just matches any character that is not listed.

表达式[^(这里有很多字符)]只匹配未列出的任何字符。

But that is a personnal opinion.

但那是个人意见。