Is there a way to achieve transliteration of characters between charsets in java? something similar to the unix command (or similar php function):
有没有一种方法可以在java中实现字符集之间的字符音译?类似于unix命令(或类似的php函数)的内容:
iconv -f UTF-8 -t ASCII//TRANSLIT < some_doc.txt > new_doc.txt
preferably operating on strings, not having anything to do with files
最好是在字符串上操作,与文件无关
I know you can can change encodings with the String
constructor, but that doesn't handle transliteration of characters that aren't in the resulting charset.
我知道您可以使用字符串构造函数来更改编码,但这不能处理结果字符集之外的字符的音译。
3 个解决方案
#1
10
I'm not aware of any libraries that do exactly what iconv
purports to do (which doesn't seem very well defined). However, you can use "normalization" in Java to do things like remove accents from characters. This process is well defined by Unicode standards.
我不知道有哪个库实现了iconv所宣称的功能(这似乎没有很好的定义)。但是,您可以在Java中使用“规范化”来做一些事情,比如从字符中删除重音。这个过程是由Unicode标准定义的。
I think NFKD (compatibility decomposition) followed by a filtering of non-ASCII characters might get you close to what you want. Obviously, this is a lossy process; you can never recover all of the information that was in the original string, so be careful.
我认为NFKD(兼容性分解)加上非ascii字符的过滤可能会使您更接近您想要的。显然,这是一个有损过程;您永远无法恢复原来字符串中的所有信息,所以要小心。
/* Decompose original "accented" string to basic characters. */
String decomposed = Normalizer.normalize(accented, Normalizer.Form.NFKD);
/* Build a new String with only ASCII characters. */
StringBuilder buf = new StringBuilder();
for (int idx = 0; idx < decomposed.length(); ++idx) {
char ch = decomposed.charAt(idx);
if (ch < 128)
buf.append(ch);
}
String filtered = buf.toString();
With the filtering used here, you might render some strings unreadable. For example, a string of Chinese characters would be filtered away completely because none of them have an ASCII representation (this is more like iconv's //IGNORE
).
使用这里的过滤,您可能会呈现一些字符串不可读。例如,一串中文字符将被完全过滤掉,因为它们都没有ASCII表示(这更像是iconv的//IGNORE)。
Overall, it would be safer to build your own lookup table of valid character substitutions, or at least of combining characters (accents and things) that are safe to strip. The best solution depends on the range of input characters you expect to handle.
总的来说,您可以构建自己的查找表,其中包含有效的字符替换,或者至少包含可以安全剥离的字符(口音和其他东西)。最好的解决方案取决于您希望处理的输入字符的范围。
#2
4
Let's start with a slight variation of Ericson's answer and build more //TRANSLIT
features on it:
让我们先从Ericson的回答稍微改变一下开始,并在上面建立更多//TRANSLIT特性:
Decompose chars to gain ASCII-String
public class Translit {
private static final Charset US_ASCII = Charset.forName("US-ASCII");
private static String toAscii(final String input) {
final CharsetEncoder charsetEncoder = US_ASCII.newEncoder();
final char[] decomposed = Normalizer.normalize(input, Normalizer.Form.NFKD).toCharArray();
final StringBuilder sb = new StringBuilder(decomposed.length);
for (int i = 0; i < decomposed.length; ) {
final int codePoint = Character.codePointAt(decomposed, i);
final int charCount = Character.charCount(codePoint);
if(charsetEncoder.canEncode(CharBuffer.wrap(decomposed, i, charCount))) {
sb.append(decomposed, i, charCount);
}
i += charCount;
}
return sb.toString();
}
public static void main(String[] args) {
final String a = "Michèleäöüß";
System.out.println(a + " => " + toAscii(a));
System.out.println(a.toUpperCase() + " => " + toAscii(a.toUpperCase()));
}
}
While this should behave the same for US-ASCII this solution is easier to adopt for different target encodings. (As characters are decomposed first this does not necessarily yield better results for other encodings though)
虽然这对于US-ASCII的行为应该是相同的,但是对于不同的目标编码,这个解决方案更容易采用。(由于字符首先被分解,这并不一定会对其他编码产生更好的结果)
The function is safe for supplementary code points (which is a bit overkill for ASCII as target, but may reduce head-aches if another target encoding is chosen).
该函数对于补充的代码点是安全的(对于ASCII作为目标来说有点过了,但是如果选择了另一个目标编码,可能会减少头痛)。
Also note, that a regular Java-String is returned; if you need an ASCII-byte[]
you still need to convert it (but as we ensured there are no offending characters...).
还要注意,返回一个常规的Java-String;如果您需要一个ascii字节[],您仍然需要对它进行转换(但是由于我们确保没有违规字符…)。
And this is how you could extend it to more character-sets:
这就是如何将它扩展到更多的字符集:
Replace or decompose characters to gain a String
encodeable in supplied Charset
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;
import java.text.Normalizer;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
/**
* Created for http://*.com/a/22841035/1266906
*/
public class Translit {
public static final Charset US_ASCII = Charset.forName("US-ASCII");
public static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
public static final Charset UTF_8 = Charset.forName("UTF-8");
public static final HashMap<Integer, String> REPLACEMENTS = new ReplacementBuilder().put('„', '"')
.put('“', '"')
.put('”', '"')
.put('″', '"')
.put('€', "EUR")
.put('ß', "ss")
.put('•', '*')
.getMap();
private static String toCharset(final String input, Charset charset) {
return toCharset(input, charset, Collections.<Integer, String>emptyMap());
}
private static String toCharset(final String input,
Charset charset,
Map<? super Integer, ? extends String> replacements) {
final CharsetEncoder charsetEncoder = charset.newEncoder();
return toCharset(input, charsetEncoder, replacements);
}
private static String toCharset(String input,
CharsetEncoder charsetEncoder,
Map<? super Integer, ? extends String> replacements) {
char[] data = input.toCharArray();
final StringBuilder sb = new StringBuilder(data.length);
for (int i = 0; i < data.length; ) {
final int codePoint = Character.codePointAt(data, i);
final int charCount = Character.charCount(codePoint);
CharBuffer charBuffer = CharBuffer.wrap(data, i, charCount);
if (charsetEncoder.canEncode(charBuffer)) {
sb.append(data, i, charCount);
} else if (replacements.containsKey(codePoint)) {
sb.append(toCharset(replacements.get(codePoint), charsetEncoder, replacements));
} else {
// Only perform NFKD Normalization after ensuring the original character is invalid as this is a irreversible process
final char[] decomposed = Normalizer.normalize(charBuffer, Normalizer.Form.NFKD).toCharArray();
for (int j = 0; j < decomposed.length; ) {
int decomposedCodePoint = Character.codePointAt(decomposed, j);
int decomposedCharCount = Character.charCount(decomposedCodePoint);
if (charsetEncoder.canEncode(CharBuffer.wrap(decomposed, j, decomposedCharCount))) {
sb.append(decomposed, j, decomposedCharCount);
} else if (replacements.containsKey(decomposedCodePoint)) {
sb.append(toCharset(replacements.get(decomposedCodePoint), charsetEncoder, replacements));
}
j += decomposedCharCount;
}
}
i += charCount;
}
return sb.toString();
}
public static void main(String[] args) {
final String a = "Michèleäöü߀„“”″•";
System.out.println(a + " => " + toCharset(a, US_ASCII));
System.out.println(a + " => " + toCharset(a, ISO_8859_1));
System.out.println(a + " => " + toCharset(a, UTF_8));
System.out.println(a + " => " + toCharset(a, US_ASCII, REPLACEMENTS));
System.out.println(a + " => " + toCharset(a, ISO_8859_1, REPLACEMENTS));
System.out.println(a + " => " + toCharset(a, UTF_8, REPLACEMENTS));
}
public static class MapBuilder<K, V> {
private final HashMap<K, V> map;
public MapBuilder() {
map = new HashMap<K, V>();
}
public MapBuilder<K, V> put(K key, V value) {
map.put(key, value);
return this;
}
public HashMap<K, V> getMap() {
return map;
}
}
public static class ReplacementBuilder extends MapBuilder<Integer, String> {
public ReplacementBuilder() {
super();
}
@Override
public ReplacementBuilder put(Integer input, String replacement) {
super.put(input, replacement);
return this;
}
public ReplacementBuilder put(Integer input, char replacement) {
return this.put(input, String.valueOf(replacement));
}
public ReplacementBuilder put(char input, String replacement) {
return this.put((int) input, replacement);
}
public ReplacementBuilder put(char input, char replacement) {
return this.put((int) input, String.valueOf(replacement));
}
}
}
I would strongly recommend building an extensive replacement-table as the simple example already shows how you otherwise might lose desired information like €
. For ASCII this implementation is of course a bit slower as decomposition is only done on demand and the StringBuilder
now may need to grow to hold the replacements.
我会强烈建议建立一个广泛的替代的表的简单的例子已经显示出你所需信息,如€否则可能会失去。对于ASCII来说,这个实现当然要慢一点,因为分解只在需要时完成,而StringBuilder现在可能需要扩展以保存替换。
GNU's iconv uses the replacements listed in translit.def to perform a //TRANSLIT
-conversion and you can use a method like this if you want to use it as replacement-map:
GNU的iconv使用translit.def中列出的替换来执行//TRANSLIT-conversion,如果您想使用这种方法作为替换映射,可以使用如下方法:
Import original //TRANSLIT
-replacements
private static Map<Integer, String> readReplacements() {
HashMap<Integer, String> map = new HashMap<>();
InputStream stream = Translit.class.getResourceAsStream("/translit.def");
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(stream, UTF_8));
Pattern pattern = Pattern.compile("^([0-9A-Fa-f]+)\t(.?[^\t]*)\t#(.*)$");
try {
String line;
while ((line = bufferedReader.readLine()) != null) {
if (line.charAt(0) != '#') {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
map.put(Integer.valueOf(matcher.group(1), 16), matcher.group(2));
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
return map;
}
#3
3
One solution is to execute execute iconv as an external process. It will certainly offend purists. It depends on presence of iconv on the system but it works and does exactly what you want:
一种解决方案是将iconv作为外部进程执行。这肯定会冒犯纯粹主义者。它依赖于iconv在系统中的存在,但它确实起作用,并做你想做的事:
public static String utfToAscii(String input) throws IOException {
Process p = Runtime.getRuntime().exec("iconv -f UTF-8 -t ASCII//TRANSLIT");
BufferedWriter bwo = new BufferedWriter(new OutputStreamWriter(p.getOutputStream()));
BufferedReader bri = new BufferedReader(new InputStreamReader(p.getInputStream()));
bwo.write(input,0,input.length());
bwo.flush();
bwo.close();
String line = null;
StringBuilder stringBuilder = new StringBuilder();
String ls = System.getProperty("line.separator");
while( ( line = bri.readLine() ) != null ) {
stringBuilder.append( line );
stringBuilder.append( ls );
}
bri.close();
try {
p.waitFor();
} catch ( InterruptedException e ) {
}
return stringBuilder.toString();
}
#1
10
I'm not aware of any libraries that do exactly what iconv
purports to do (which doesn't seem very well defined). However, you can use "normalization" in Java to do things like remove accents from characters. This process is well defined by Unicode standards.
我不知道有哪个库实现了iconv所宣称的功能(这似乎没有很好的定义)。但是,您可以在Java中使用“规范化”来做一些事情,比如从字符中删除重音。这个过程是由Unicode标准定义的。
I think NFKD (compatibility decomposition) followed by a filtering of non-ASCII characters might get you close to what you want. Obviously, this is a lossy process; you can never recover all of the information that was in the original string, so be careful.
我认为NFKD(兼容性分解)加上非ascii字符的过滤可能会使您更接近您想要的。显然,这是一个有损过程;您永远无法恢复原来字符串中的所有信息,所以要小心。
/* Decompose original "accented" string to basic characters. */
String decomposed = Normalizer.normalize(accented, Normalizer.Form.NFKD);
/* Build a new String with only ASCII characters. */
StringBuilder buf = new StringBuilder();
for (int idx = 0; idx < decomposed.length(); ++idx) {
char ch = decomposed.charAt(idx);
if (ch < 128)
buf.append(ch);
}
String filtered = buf.toString();
With the filtering used here, you might render some strings unreadable. For example, a string of Chinese characters would be filtered away completely because none of them have an ASCII representation (this is more like iconv's //IGNORE
).
使用这里的过滤,您可能会呈现一些字符串不可读。例如,一串中文字符将被完全过滤掉,因为它们都没有ASCII表示(这更像是iconv的//IGNORE)。
Overall, it would be safer to build your own lookup table of valid character substitutions, or at least of combining characters (accents and things) that are safe to strip. The best solution depends on the range of input characters you expect to handle.
总的来说,您可以构建自己的查找表,其中包含有效的字符替换,或者至少包含可以安全剥离的字符(口音和其他东西)。最好的解决方案取决于您希望处理的输入字符的范围。
#2
4
Let's start with a slight variation of Ericson's answer and build more //TRANSLIT
features on it:
让我们先从Ericson的回答稍微改变一下开始,并在上面建立更多//TRANSLIT特性:
Decompose chars to gain ASCII-String
public class Translit {
private static final Charset US_ASCII = Charset.forName("US-ASCII");
private static String toAscii(final String input) {
final CharsetEncoder charsetEncoder = US_ASCII.newEncoder();
final char[] decomposed = Normalizer.normalize(input, Normalizer.Form.NFKD).toCharArray();
final StringBuilder sb = new StringBuilder(decomposed.length);
for (int i = 0; i < decomposed.length; ) {
final int codePoint = Character.codePointAt(decomposed, i);
final int charCount = Character.charCount(codePoint);
if(charsetEncoder.canEncode(CharBuffer.wrap(decomposed, i, charCount))) {
sb.append(decomposed, i, charCount);
}
i += charCount;
}
return sb.toString();
}
public static void main(String[] args) {
final String a = "Michèleäöüß";
System.out.println(a + " => " + toAscii(a));
System.out.println(a.toUpperCase() + " => " + toAscii(a.toUpperCase()));
}
}
While this should behave the same for US-ASCII this solution is easier to adopt for different target encodings. (As characters are decomposed first this does not necessarily yield better results for other encodings though)
虽然这对于US-ASCII的行为应该是相同的,但是对于不同的目标编码,这个解决方案更容易采用。(由于字符首先被分解,这并不一定会对其他编码产生更好的结果)
The function is safe for supplementary code points (which is a bit overkill for ASCII as target, but may reduce head-aches if another target encoding is chosen).
该函数对于补充的代码点是安全的(对于ASCII作为目标来说有点过了,但是如果选择了另一个目标编码,可能会减少头痛)。
Also note, that a regular Java-String is returned; if you need an ASCII-byte[]
you still need to convert it (but as we ensured there are no offending characters...).
还要注意,返回一个常规的Java-String;如果您需要一个ascii字节[],您仍然需要对它进行转换(但是由于我们确保没有违规字符…)。
And this is how you could extend it to more character-sets:
这就是如何将它扩展到更多的字符集:
Replace or decompose characters to gain a String
encodeable in supplied Charset
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;
import java.text.Normalizer;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
/**
* Created for http://*.com/a/22841035/1266906
*/
public class Translit {
public static final Charset US_ASCII = Charset.forName("US-ASCII");
public static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
public static final Charset UTF_8 = Charset.forName("UTF-8");
public static final HashMap<Integer, String> REPLACEMENTS = new ReplacementBuilder().put('„', '"')
.put('“', '"')
.put('”', '"')
.put('″', '"')
.put('€', "EUR")
.put('ß', "ss")
.put('•', '*')
.getMap();
private static String toCharset(final String input, Charset charset) {
return toCharset(input, charset, Collections.<Integer, String>emptyMap());
}
private static String toCharset(final String input,
Charset charset,
Map<? super Integer, ? extends String> replacements) {
final CharsetEncoder charsetEncoder = charset.newEncoder();
return toCharset(input, charsetEncoder, replacements);
}
private static String toCharset(String input,
CharsetEncoder charsetEncoder,
Map<? super Integer, ? extends String> replacements) {
char[] data = input.toCharArray();
final StringBuilder sb = new StringBuilder(data.length);
for (int i = 0; i < data.length; ) {
final int codePoint = Character.codePointAt(data, i);
final int charCount = Character.charCount(codePoint);
CharBuffer charBuffer = CharBuffer.wrap(data, i, charCount);
if (charsetEncoder.canEncode(charBuffer)) {
sb.append(data, i, charCount);
} else if (replacements.containsKey(codePoint)) {
sb.append(toCharset(replacements.get(codePoint), charsetEncoder, replacements));
} else {
// Only perform NFKD Normalization after ensuring the original character is invalid as this is a irreversible process
final char[] decomposed = Normalizer.normalize(charBuffer, Normalizer.Form.NFKD).toCharArray();
for (int j = 0; j < decomposed.length; ) {
int decomposedCodePoint = Character.codePointAt(decomposed, j);
int decomposedCharCount = Character.charCount(decomposedCodePoint);
if (charsetEncoder.canEncode(CharBuffer.wrap(decomposed, j, decomposedCharCount))) {
sb.append(decomposed, j, decomposedCharCount);
} else if (replacements.containsKey(decomposedCodePoint)) {
sb.append(toCharset(replacements.get(decomposedCodePoint), charsetEncoder, replacements));
}
j += decomposedCharCount;
}
}
i += charCount;
}
return sb.toString();
}
public static void main(String[] args) {
final String a = "Michèleäöü߀„“”″•";
System.out.println(a + " => " + toCharset(a, US_ASCII));
System.out.println(a + " => " + toCharset(a, ISO_8859_1));
System.out.println(a + " => " + toCharset(a, UTF_8));
System.out.println(a + " => " + toCharset(a, US_ASCII, REPLACEMENTS));
System.out.println(a + " => " + toCharset(a, ISO_8859_1, REPLACEMENTS));
System.out.println(a + " => " + toCharset(a, UTF_8, REPLACEMENTS));
}
public static class MapBuilder<K, V> {
private final HashMap<K, V> map;
public MapBuilder() {
map = new HashMap<K, V>();
}
public MapBuilder<K, V> put(K key, V value) {
map.put(key, value);
return this;
}
public HashMap<K, V> getMap() {
return map;
}
}
public static class ReplacementBuilder extends MapBuilder<Integer, String> {
public ReplacementBuilder() {
super();
}
@Override
public ReplacementBuilder put(Integer input, String replacement) {
super.put(input, replacement);
return this;
}
public ReplacementBuilder put(Integer input, char replacement) {
return this.put(input, String.valueOf(replacement));
}
public ReplacementBuilder put(char input, String replacement) {
return this.put((int) input, replacement);
}
public ReplacementBuilder put(char input, char replacement) {
return this.put((int) input, String.valueOf(replacement));
}
}
}
I would strongly recommend building an extensive replacement-table as the simple example already shows how you otherwise might lose desired information like €
. For ASCII this implementation is of course a bit slower as decomposition is only done on demand and the StringBuilder
now may need to grow to hold the replacements.
我会强烈建议建立一个广泛的替代的表的简单的例子已经显示出你所需信息,如€否则可能会失去。对于ASCII来说,这个实现当然要慢一点,因为分解只在需要时完成,而StringBuilder现在可能需要扩展以保存替换。
GNU's iconv uses the replacements listed in translit.def to perform a //TRANSLIT
-conversion and you can use a method like this if you want to use it as replacement-map:
GNU的iconv使用translit.def中列出的替换来执行//TRANSLIT-conversion,如果您想使用这种方法作为替换映射,可以使用如下方法:
Import original //TRANSLIT
-replacements
private static Map<Integer, String> readReplacements() {
HashMap<Integer, String> map = new HashMap<>();
InputStream stream = Translit.class.getResourceAsStream("/translit.def");
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(stream, UTF_8));
Pattern pattern = Pattern.compile("^([0-9A-Fa-f]+)\t(.?[^\t]*)\t#(.*)$");
try {
String line;
while ((line = bufferedReader.readLine()) != null) {
if (line.charAt(0) != '#') {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
map.put(Integer.valueOf(matcher.group(1), 16), matcher.group(2));
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
return map;
}
#3
3
One solution is to execute execute iconv as an external process. It will certainly offend purists. It depends on presence of iconv on the system but it works and does exactly what you want:
一种解决方案是将iconv作为外部进程执行。这肯定会冒犯纯粹主义者。它依赖于iconv在系统中的存在,但它确实起作用,并做你想做的事:
public static String utfToAscii(String input) throws IOException {
Process p = Runtime.getRuntime().exec("iconv -f UTF-8 -t ASCII//TRANSLIT");
BufferedWriter bwo = new BufferedWriter(new OutputStreamWriter(p.getOutputStream()));
BufferedReader bri = new BufferedReader(new InputStreamReader(p.getInputStream()));
bwo.write(input,0,input.length());
bwo.flush();
bwo.close();
String line = null;
StringBuilder stringBuilder = new StringBuilder();
String ls = System.getProperty("line.separator");
while( ( line = bri.readLine() ) != null ) {
stringBuilder.append( line );
stringBuilder.append( ls );
}
bri.close();
try {
p.waitFor();
} catch ( InterruptedException e ) {
}
return stringBuilder.toString();
}