I m trying to match unicode characters in Java.
我正在尝试在Java中匹配unicode字符。
Input String: informa
informa输入字符串:
String to match : informátion
匹配字符串:信息
So far I ve tried this:
到目前为止,我已经尝试过:
Pattern p= Pattern.compile("informa[\u0000-\uffff].*", (Pattern.UNICODE_CASE|Pattern.CANON_EQ|Pattern.CASE_INSENSITIVE));
String s = "informátion";
Matcher m = p.matcher(s);
if(m.matches()){
System.out.println("Match!");
}else{
System.out.println("No match");
}
It comes out as "No match". Any ideas?
结果是“没有对手”。什么好主意吗?
3 个解决方案
#1
12
The term "Unicode characters" is not specific enough. It would match every character which is in the Unicode range, thus also "normal" characters. This term is however very often used when one actually means "characters which are not in the printable ASCII range".
“Unicode字符”这个术语还不够具体。它将匹配Unicode范围内的每个字符,因此也将匹配“普通”字符。然而,当一个词实际上是指“不在可打印ASCII范围内的字符”时,这个词经常被使用。
In regex terms that would be [^\x20-\x7E]
.
正则表达式而言这将是[^ \ x20的——\ x7E]。
boolean containsNonPrintableASCIIChars = string.matches(".*[^\\x20-\\x7E].*");
Depending on what you'd like to do with this information, here are some useful follow-up answers:
根据你想要如何处理这些信息,以下是一些有用的后续回答:
- Get rid of special characters
- 去掉特殊的字符
- Get rid of diacritical marks
- 去掉区分字符
#2
6
Is it because informa
isn't a substring of informátion
at all?
是因为informa不是信息的子字符串吗?
How would your code work if you removed the last a
from informa
in your regex?
如果在regex中从informa中删除最后一个a,代码将如何工作?
#3
1
It sounds like you want to match letters while ignoring diacritical marks. If that's right, then normalize your strings to NFD form, strip out the diacritical marks, and then do your search.
听起来好像你想要匹配字母,而忽略不区分字符。如果是正确的,那么将字符串规范化为NFD形式,去掉关键字标记,然后进行搜索。
String normalized = java.text.Normalizer.normalize(textToSearch, java.text.Normalizer.Form.NFD);
String withoutDiacritical = normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
// Search code goes here...
To learn more about NFD:
进一步了解NFD:
- https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms
- https://en.wikipedia.org/wiki/Unicode_equivalence Normal_forms
- http://unicode.org/faq/normalization.html
- http://unicode.org/faq/normalization.html
#1
12
The term "Unicode characters" is not specific enough. It would match every character which is in the Unicode range, thus also "normal" characters. This term is however very often used when one actually means "characters which are not in the printable ASCII range".
“Unicode字符”这个术语还不够具体。它将匹配Unicode范围内的每个字符,因此也将匹配“普通”字符。然而,当一个词实际上是指“不在可打印ASCII范围内的字符”时,这个词经常被使用。
In regex terms that would be [^\x20-\x7E]
.
正则表达式而言这将是[^ \ x20的——\ x7E]。
boolean containsNonPrintableASCIIChars = string.matches(".*[^\\x20-\\x7E].*");
Depending on what you'd like to do with this information, here are some useful follow-up answers:
根据你想要如何处理这些信息,以下是一些有用的后续回答:
- Get rid of special characters
- 去掉特殊的字符
- Get rid of diacritical marks
- 去掉区分字符
#2
6
Is it because informa
isn't a substring of informátion
at all?
是因为informa不是信息的子字符串吗?
How would your code work if you removed the last a
from informa
in your regex?
如果在regex中从informa中删除最后一个a,代码将如何工作?
#3
1
It sounds like you want to match letters while ignoring diacritical marks. If that's right, then normalize your strings to NFD form, strip out the diacritical marks, and then do your search.
听起来好像你想要匹配字母,而忽略不区分字符。如果是正确的,那么将字符串规范化为NFD形式,去掉关键字标记,然后进行搜索。
String normalized = java.text.Normalizer.normalize(textToSearch, java.text.Normalizer.Form.NFD);
String withoutDiacritical = normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
// Search code goes here...
To learn more about NFD:
进一步了解NFD:
- https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms
- https://en.wikipedia.org/wiki/Unicode_equivalence Normal_forms
- http://unicode.org/faq/normalization.html
- http://unicode.org/faq/normalization.html