如何将编码为Unicode的字符串转换为字母字符串

时间:2021-03-23 20:13:39

I have a string with Unicode encoding, \uXXXX, and I want to convert it to a regular letter (UTF-8). For example:

我有一个Unicode编码的字符串,\uXXXX,我想把它转换成一个普通的字母(UTF-8)。例如:

String myString = "\u0048\u0065\u006C\u006C\u006F World";

should become

应该成为

"Hello World"

I know that when I print the string it shows Hello world. My problem is I read file names from a file on a Unix machine, and then I search for them. The files names are with Unicode encoding, and when I search for the files, I can't find them, since it searches for a file with \uXXXX in its name.

我知道当我打印字符串时它会显示Hello world。我的问题是,我从Unix机器上的文件中读取文件名,然后搜索它们。文件名称使用Unicode编码,当我搜索文件时,我找不到它们,因为它搜索的是一个名称中带有\uXXXX的文件。

12 个解决方案

#1


27  

Technically doing:

在技术上做的事情:

String myString = "\u0048\u0065\u006C\u006C\u006F World";

automatically converts it to "Hello World", so I assume you are reading in the string from some file. In order to convert it to "Hello" you'll have to parse the text into the separate unicode digits, (take the \uXXXX and just get XXXX) then do Integer.ParseInt(XXXX, 16) to get a hex value and then case that to char to get the actual character.

自动将它转换为“Hello World”,因此我假设您正在从某个文件读取字符串。为了将其转换为“Hello”,您必须将文本解析为单独的unicode数字(取\uXXXX,只获取XXXX),然后执行Integer。ParseInt(XXXX, 16)获取一个十六进制值,然后用这个字符获取实际的字符。

Edit: Some code to accomplish this:

编辑:一些代码来完成这个:

String str = myString.split(" ")[0];
str = str.replace("\\","");
String[] arr = str.split("u");
String text = "";
for(int i = 1; i < arr.length; i++){
    int hexVal = Integer.parseInt(arr[i], 16);
    text += (char)hexVal;
}
// Text will now have Hello

#2


66  

The Apache Commons Lang StringEscapeUtils.unescapeJava() can decode it properly.

unescapejava()可以正确地解码它。

import org.apache.commons.lang.StringEscapeUtils;

@Test
public void testUnescapeJava() {
    String sJava="\\u0048\\u0065\\u006C\\u006C\\u006F";
    System.out.println("StringEscapeUtils.unescapeJava(sJava):\n" + StringEscapeUtils.unescapeJava(sJava));
}


 output:
 StringEscapeUtils.unescapeJava(sJava):
 Hello

#3


18  

You can use StringEscapeUtils from Apache Commons Lang, i.e.:

可以使用Apache Commons Lang中的StringEscapeUtils,例如:

String unicode = "\u0048\u0065\u006C\u006C\u006F";
String Title = StringEscapeUtils.unescapeJava(unicode);

#4


9  

Byte Encodings and Strings

字节编码和字符串

In java for conversion of the byte stream (byte []) in the string (String) and back to the String class has the following features:

在java中,字符串(字符串)中字节流的转换(字节[])和返回到string类有以下特征:

Constructor String (byte [] bytes, String enc) receives the input stream of bytes with their coding; if the encoding is omitted it will be accepted by default

构造函数字符串(字节[]字节,字符串enc)通过编码接收输入的字节流;如果省略了编码,它将被默认接受

getBytes Method (String enc) returns a byte stream recorded in the specified encoding; encoding can also be omitted.

getBytes方法(String enc)返回在指定编码中记录的字节流;编码也可以省略。

try {
    String myString = "\u0048\u0065\u006C\u006C\u006F World";
    byte[] utf8Bytes = myString.getBytes("UTF8");
    String text = new String(utf8Bytes,"UTF8");
}
catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}

UPDATE:

更新:

Since Java 1.7 use StandardCharsets.UTF_8:

因为Java 1.7使用了standard . utf_8:

String utf8Text = "\u0048\u0065\u006C\u006C\u006F World";
byte[] bytes = utf8Text.getBytes(StandardCharsets.UTF_8);
String text = new String(bytes, StandardCharsets.UTF_8);

#5


7  

This simple method will work for most cases, but would trip up over something like "u005Cu005C" which should decode to the string "\u0048" but would actually decode "H" as the first pass produces "\u0048" as the working string which then gets processed again by the while loop.

这个简单的方法在大多数情况下都可以使用,但是会遇到“u005Cu005C”之类的东西,它应该对字符串“\u0048”进行解码,但是当第一次传递产生“\u0048”时,它实际上会解码“H”,作为工作字符串,然后再由while循环进行处理。

static final String decode(final String in)
{
    String working = in;
    int index;
    index = working.indexOf("\\u");
    while(index > -1)
    {
        int length = working.length();
        if(index > (length-6))break;
        int numStart = index + 2;
        int numFinish = numStart + 4;
        String substring = working.substring(numStart, numFinish);
        int number = Integer.parseInt(substring,16);
        String stringStart = working.substring(0, index);
        String stringEnd   = working.substring(numFinish);
        working = stringStart + ((char)number) + stringEnd;
        index = working.indexOf("\\u");
    }
    return working;
}

#6


4  

It's not totally clear from your question, but I'm assuming you saying that you have a file where each line of that file is a filename. And each filename is something like this:

你的问题不是很清楚,但我假设你有一个文件,文件的每一行都是文件名。每个文件名是这样的:

\u0048\u0065\u006C\u006C\u006F

In other words, the characters in the file of filenames are \, u, 0, 0, 4, 8 and so on.

换句话说,文件名文件中的字符是\、u、0、0、4、8等等。

If so, what you're seeing is expected. Java only translates \uXXXX sequences in string literals in source code (and when reading in stored Properties objects). When you read the contents you file you will have a string consisting of the characters \, u, 0, 0, 4, 8 and so on and not the string Hello.

如果是这样的话,你所看到的是预期的。Java只在源代码(以及在存储的属性对象中读取)的字符串文本中翻译\uXXXX序列。当您读取文件内容时,您将得到一个由字符\ u、0、0、4、8等组成的字符串,而不是字符串Hello。

So you will need to parse that string to extract the 0048, 0065, etc. pieces and then convert them to chars and make a string from those chars and then pass that string to the routine that opens the file.

所以你需要解析那个字符串来提取0048,0065等等,然后把它们转换成chars,从这些chars中创建一个字符串,然后将这个字符串传递给打开文件的例程。

#7


3  

try

试一试

private static final Charset UTF_8 = Charset.forName("UTF-8");
private String forceUtf8Coding(String input) {return new String(input.getBytes(UTF_8), UTF_8))}

#8


2  

Shorter version:

较短的版本:

public static String unescapeJava(String escaped) {
    if(escaped.indexOf("\\u")==-1)
        return escaped;

    String processed="";

    int position=escaped.indexOf("\\u");
    while(position!=-1) {
        if(position!=0)
            processed+=escaped.substring(0,position);
        String token=escaped.substring(position+2,position+6);
        escaped=escaped.substring(position+6);
        processed+=(char)Integer.parseInt(token,16);
        position=escaped.indexOf("\\u");
    }
    processed+=escaped;

    return processed;
}

#9


1  

one easy way i know using JsonObject:

使用JsonObject的一个简单方法是:

try {
    JSONObject json = new JSONObject();
    json.put("string", myString);
    String converted = json.getString("string");

} catch (JSONException e) {
    e.printStackTrace();
}

#10


1  

Solution for Kotlin:

解决方案芬兰湾的科特林:

val result = String(someText.toByteArray())

Kotlin uses UTF-8 everywhere as default encoding

Kotlin在任何地方都使用UTF-8作为默认编码

Also you can implement it as extension for String class:

你也可以把它作为字符串类的扩展:

fun String.unescape(): String {
    return String(this.toByteArray())
}

and then use it simple:

然后简单地说:

val result = someText.unescape()

;)

,)

#11


0  

Actually, I wrote an Open Source library that contains some utilities. One of them is converting a Unicode sequence to String and vise-versa. I found it very useful. Here is the quote from the article about this library about Unicode converter:

实际上,我编写了一个包含一些实用程序的开源库。其中之一是将Unicode序列转换为字符串,反之亦然。我发现它很有用。下面是这篇关于Unicode转换器的文章:

Class StringUnicodeEncoderDecoder has methods that can convert a String (in any language) into a sequence of Unicode characters and vise-versa. For example a String "Hello World" will be converted into

类StringUnicodeEncoderDecoder的方法可以将字符串(在任何语言中)转换为Unicode字符序列,反之亦然。例如,字符串“Hello World”将被转换为。

"\u0048\u0065\u006c\u006c\u006f\u0020 \u0057\u006f\u0072\u006c\u0064"

“\ u0048 \ u0065 \ u006c \ u006c \ u006f \ u0020 \ u0057 \ u006f \ u0072 \ u006c \ u0064”

and may be restored back.

可能会恢复。

Here is the link to entire article that explains what Utilities the library has and how to get the library to use it. It is available as Maven artifact or as source from Github. It is very easy to use. Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison

这里是指向整篇文章的链接,该链接解释了库拥有什么实用程序以及如何让库使用它。它可以作为Maven工件使用,也可以作为Github的源代码使用。它很容易使用。具有堆栈跟踪过滤、静默字符串解析Unicode转换器和版本比较的开放源码Java库

#12


0  

Here is my solution...

这是我的解决方案……

                String decodedName = JwtJson.substring(startOfName, endOfName);

                StringBuilder builtName = new StringBuilder();

                int i = 0;

                while ( i < decodedName.length() )
                {
                    if ( decodedName.substring(i).startsWith("\\u"))
                    {
                        i=i+2;
                        builtName.append(Character.toChars(Integer.parseInt(decodedName.substring(i,i+4), 16)));
                        i=i+4;
                    }
                    else
                    {
                        builtName.append(decodedName.charAt(i));
                        i = i+1;
                    }
                };

#1


27  

Technically doing:

在技术上做的事情:

String myString = "\u0048\u0065\u006C\u006C\u006F World";

automatically converts it to "Hello World", so I assume you are reading in the string from some file. In order to convert it to "Hello" you'll have to parse the text into the separate unicode digits, (take the \uXXXX and just get XXXX) then do Integer.ParseInt(XXXX, 16) to get a hex value and then case that to char to get the actual character.

自动将它转换为“Hello World”,因此我假设您正在从某个文件读取字符串。为了将其转换为“Hello”,您必须将文本解析为单独的unicode数字(取\uXXXX,只获取XXXX),然后执行Integer。ParseInt(XXXX, 16)获取一个十六进制值,然后用这个字符获取实际的字符。

Edit: Some code to accomplish this:

编辑:一些代码来完成这个:

String str = myString.split(" ")[0];
str = str.replace("\\","");
String[] arr = str.split("u");
String text = "";
for(int i = 1; i < arr.length; i++){
    int hexVal = Integer.parseInt(arr[i], 16);
    text += (char)hexVal;
}
// Text will now have Hello

#2


66  

The Apache Commons Lang StringEscapeUtils.unescapeJava() can decode it properly.

unescapejava()可以正确地解码它。

import org.apache.commons.lang.StringEscapeUtils;

@Test
public void testUnescapeJava() {
    String sJava="\\u0048\\u0065\\u006C\\u006C\\u006F";
    System.out.println("StringEscapeUtils.unescapeJava(sJava):\n" + StringEscapeUtils.unescapeJava(sJava));
}


 output:
 StringEscapeUtils.unescapeJava(sJava):
 Hello

#3


18  

You can use StringEscapeUtils from Apache Commons Lang, i.e.:

可以使用Apache Commons Lang中的StringEscapeUtils,例如:

String unicode = "\u0048\u0065\u006C\u006C\u006F";
String Title = StringEscapeUtils.unescapeJava(unicode);

#4


9  

Byte Encodings and Strings

字节编码和字符串

In java for conversion of the byte stream (byte []) in the string (String) and back to the String class has the following features:

在java中,字符串(字符串)中字节流的转换(字节[])和返回到string类有以下特征:

Constructor String (byte [] bytes, String enc) receives the input stream of bytes with their coding; if the encoding is omitted it will be accepted by default

构造函数字符串(字节[]字节,字符串enc)通过编码接收输入的字节流;如果省略了编码,它将被默认接受

getBytes Method (String enc) returns a byte stream recorded in the specified encoding; encoding can also be omitted.

getBytes方法(String enc)返回在指定编码中记录的字节流;编码也可以省略。

try {
    String myString = "\u0048\u0065\u006C\u006C\u006F World";
    byte[] utf8Bytes = myString.getBytes("UTF8");
    String text = new String(utf8Bytes,"UTF8");
}
catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}

UPDATE:

更新:

Since Java 1.7 use StandardCharsets.UTF_8:

因为Java 1.7使用了standard . utf_8:

String utf8Text = "\u0048\u0065\u006C\u006C\u006F World";
byte[] bytes = utf8Text.getBytes(StandardCharsets.UTF_8);
String text = new String(bytes, StandardCharsets.UTF_8);

#5


7  

This simple method will work for most cases, but would trip up over something like "u005Cu005C" which should decode to the string "\u0048" but would actually decode "H" as the first pass produces "\u0048" as the working string which then gets processed again by the while loop.

这个简单的方法在大多数情况下都可以使用,但是会遇到“u005Cu005C”之类的东西,它应该对字符串“\u0048”进行解码,但是当第一次传递产生“\u0048”时,它实际上会解码“H”,作为工作字符串,然后再由while循环进行处理。

static final String decode(final String in)
{
    String working = in;
    int index;
    index = working.indexOf("\\u");
    while(index > -1)
    {
        int length = working.length();
        if(index > (length-6))break;
        int numStart = index + 2;
        int numFinish = numStart + 4;
        String substring = working.substring(numStart, numFinish);
        int number = Integer.parseInt(substring,16);
        String stringStart = working.substring(0, index);
        String stringEnd   = working.substring(numFinish);
        working = stringStart + ((char)number) + stringEnd;
        index = working.indexOf("\\u");
    }
    return working;
}

#6


4  

It's not totally clear from your question, but I'm assuming you saying that you have a file where each line of that file is a filename. And each filename is something like this:

你的问题不是很清楚,但我假设你有一个文件,文件的每一行都是文件名。每个文件名是这样的:

\u0048\u0065\u006C\u006C\u006F

In other words, the characters in the file of filenames are \, u, 0, 0, 4, 8 and so on.

换句话说,文件名文件中的字符是\、u、0、0、4、8等等。

If so, what you're seeing is expected. Java only translates \uXXXX sequences in string literals in source code (and when reading in stored Properties objects). When you read the contents you file you will have a string consisting of the characters \, u, 0, 0, 4, 8 and so on and not the string Hello.

如果是这样的话,你所看到的是预期的。Java只在源代码(以及在存储的属性对象中读取)的字符串文本中翻译\uXXXX序列。当您读取文件内容时,您将得到一个由字符\ u、0、0、4、8等组成的字符串,而不是字符串Hello。

So you will need to parse that string to extract the 0048, 0065, etc. pieces and then convert them to chars and make a string from those chars and then pass that string to the routine that opens the file.

所以你需要解析那个字符串来提取0048,0065等等,然后把它们转换成chars,从这些chars中创建一个字符串,然后将这个字符串传递给打开文件的例程。

#7


3  

try

试一试

private static final Charset UTF_8 = Charset.forName("UTF-8");
private String forceUtf8Coding(String input) {return new String(input.getBytes(UTF_8), UTF_8))}

#8


2  

Shorter version:

较短的版本:

public static String unescapeJava(String escaped) {
    if(escaped.indexOf("\\u")==-1)
        return escaped;

    String processed="";

    int position=escaped.indexOf("\\u");
    while(position!=-1) {
        if(position!=0)
            processed+=escaped.substring(0,position);
        String token=escaped.substring(position+2,position+6);
        escaped=escaped.substring(position+6);
        processed+=(char)Integer.parseInt(token,16);
        position=escaped.indexOf("\\u");
    }
    processed+=escaped;

    return processed;
}

#9


1  

one easy way i know using JsonObject:

使用JsonObject的一个简单方法是:

try {
    JSONObject json = new JSONObject();
    json.put("string", myString);
    String converted = json.getString("string");

} catch (JSONException e) {
    e.printStackTrace();
}

#10


1  

Solution for Kotlin:

解决方案芬兰湾的科特林:

val result = String(someText.toByteArray())

Kotlin uses UTF-8 everywhere as default encoding

Kotlin在任何地方都使用UTF-8作为默认编码

Also you can implement it as extension for String class:

你也可以把它作为字符串类的扩展:

fun String.unescape(): String {
    return String(this.toByteArray())
}

and then use it simple:

然后简单地说:

val result = someText.unescape()

;)

,)

#11


0  

Actually, I wrote an Open Source library that contains some utilities. One of them is converting a Unicode sequence to String and vise-versa. I found it very useful. Here is the quote from the article about this library about Unicode converter:

实际上,我编写了一个包含一些实用程序的开源库。其中之一是将Unicode序列转换为字符串,反之亦然。我发现它很有用。下面是这篇关于Unicode转换器的文章:

Class StringUnicodeEncoderDecoder has methods that can convert a String (in any language) into a sequence of Unicode characters and vise-versa. For example a String "Hello World" will be converted into

类StringUnicodeEncoderDecoder的方法可以将字符串(在任何语言中)转换为Unicode字符序列,反之亦然。例如,字符串“Hello World”将被转换为。

"\u0048\u0065\u006c\u006c\u006f\u0020 \u0057\u006f\u0072\u006c\u0064"

“\ u0048 \ u0065 \ u006c \ u006c \ u006f \ u0020 \ u0057 \ u006f \ u0072 \ u006c \ u0064”

and may be restored back.

可能会恢复。

Here is the link to entire article that explains what Utilities the library has and how to get the library to use it. It is available as Maven artifact or as source from Github. It is very easy to use. Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison

这里是指向整篇文章的链接,该链接解释了库拥有什么实用程序以及如何让库使用它。它可以作为Maven工件使用,也可以作为Github的源代码使用。它很容易使用。具有堆栈跟踪过滤、静默字符串解析Unicode转换器和版本比较的开放源码Java库

#12


0  

Here is my solution...

这是我的解决方案……

                String decodedName = JwtJson.substring(startOfName, endOfName);

                StringBuilder builtName = new StringBuilder();

                int i = 0;

                while ( i < decodedName.length() )
                {
                    if ( decodedName.substring(i).startsWith("\\u"))
                    {
                        i=i+2;
                        builtName.append(Character.toChars(Integer.parseInt(decodedName.substring(i,i+4), 16)));
                        i=i+4;
                    }
                    else
                    {
                        builtName.append(decodedName.charAt(i));
                        i = i+1;
                    }
                };