在C＃中将字符串（UTF-16）转换为UTF-8

I need to convert a string to UTF-8 in C#. I've already try many ways but none works as I wanted. I converted my string into a byte array and then to try to write it to an XML file (which encoding is UTF-8....) but either I got the same string (not encoded at all) either I got a list of byte which is useless.... Does someone face the same issue ?

我需要在C#中将字符串转换为UTF-8。我已经尝试了很多方法,但没有一个像我想的那样有效。我将我的字符串转换为字节数组,然后尝试将其写入XML文件(编码为UTF-8 ....)但是我得到了相同的字符串(根本没有编码)要么我得到了一个列表无用的字节....有人面临同样的问题吗?

Edit : This is some of the code I used :

编辑:这是我使用的一些代码:

str= "testé";
byte[] utf8Bytes = Encoding.UTF8.GetBytes(str);
return Encoding.UTF8.GetString(utf8Bytes);

The result is "testé" or I expected something like "testÃ©"...

结果是“testé”或者我期待像“testé”这样的东西......

6 个解决方案

#1

If you want a UTF8 string, where every byte is correct ('Ö' -> [195, 0] , [150, 0]), you can use the followed:

如果你想要一个UTF8字符串,每个字节都是正确的('Ö' - > [195,0],[150,0]),你可以使用以下内容:

public static string Utf16ToUtf8(string utf16String)
{
   /**************************************************************
    * Every .NET string will store text with the UTF16 encoding, *
    * known as Encoding.Unicode. Other encodings may exist as    *
    * Byte-Array or incorrectly stored with the UTF16 encoding.  *
    *                                                            *
    * UTF8 = 1 bytes per char                                    *
    *    ["100" for the ansi 'd']                                *
    *    ["206" and "186" for the russian 'κ']                   *
    *                                                            *
    * UTF16 = 2 bytes per char                                   *
    *    ["100, 0" for the ansi 'd']                             *
    *    ["186, 3" for the russian 'κ']                          *
    *                                                            *
    * UTF8 inside UTF16                                          *
    *    ["100, 0" for the ansi 'd']                             *
    *    ["206, 0" and "186, 0" for the russian 'κ']             *
    *                                                            *
    * We can use the convert encoding function to convert an     *
    * UTF16 Byte-Array to an UTF8 Byte-Array. When we use UTF8   *
    * encoding to string method now, we will get a UTF16 string. *
    *                                                            *
    * So we imitate UTF16 by filling the second byte of a char   *
    * with a 0 byte (binary 0) while creating the string.        *
    **************************************************************/

    // Storage for the UTF8 string
    string utf8String = String.Empty;

    // Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
    byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
    byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);

    // Fill UTF8 bytes inside UTF8 string
    for (int i = 0; i < utf8Bytes.Length; i++)
    {
        // Because char always saves 2 bytes, fill char with 0
        byte[] utf8Container = new byte[2] { utf8Bytes[i], 0 };
        utf8String += BitConverter.ToChar(utf8Container, 0);
    }

    // Return UTF8
    return utf8String;
}

In my case the DLL request is a UTF8 string too, but unfortunately the UTF8 string must be interpreted with UTF16 encoding ('Ö' -> [195, 0], [19, 32]). So the ANSI '–' which is 150 has to be converted to the UTF16 '–' which is 8211. If you have this case too, you can use the following instead:

在我的例子中,DLL请求也是UTF8字符串,但不幸的是UTF8字符串必须用UTF16编码解释('Ö' - > [195,0],[19,32])。所以ANSI' - '150必须转换为UTF16' - '即8211.如果你也有这种情况,你可以使用以下代码:

public static string Utf16ToUtf8(string utf16String)
{
    // Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
    byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
    byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);

    // Return UTF8 bytes as ANSI string
    return Encoding.Default.GetString(utf8Bytes);
}

Or the Native-Method:

或Native-Method:

[DllImport("kernel32.dll")]
private static extern Int32 WideCharToMultiByte(UInt32 CodePage, UInt32 dwFlags, [MarshalAs(UnmanagedType.LPWStr)] String lpWideCharStr, Int32 cchWideChar, [Out, MarshalAs(UnmanagedType.LPStr)] StringBuilder lpMultiByteStr, Int32 cbMultiByte, IntPtr lpDefaultChar, IntPtr lpUsedDefaultChar);

public static string Utf16ToUtf8(string utf16String)
{
    Int32 iNewDataLen = WideCharToMultiByte(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf16String, utf16String.Length, null, 0, IntPtr.Zero, IntPtr.Zero);
    if (iNewDataLen > 1)
    {
        StringBuilder utf8String = new StringBuilder(iNewDataLen);
        WideCharToMultiByte(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf16String, -1, utf8String, utf8String.Capacity, IntPtr.Zero, IntPtr.Zero);

        return utf8String.ToString();
    }
    else
    {
        return String.Empty;
    }
}

If you need it the other way around, see Utf8ToUtf16. Hope I could be of help.

如果您需要反过来,请参阅Utf8ToUtf16。希望我能提供帮助。

#2

A string in C# is always UTF-16, there is no way to "convert" it. The encoding is irrelevant as long as you manipulate the string in memory, it only matters if you write the string to a stream (file, memory stream, network stream...).

C#中的字符串总是UTF-16,没有办法“转换”它。只要您在内存中操作字符串,编码就无关紧要了,只有将字符串写入流(文件,内存流,网络流......)才有意义。

If you want to write the string to a XML file, just specify the encoding when you create the XmlWriter

如果要将字符串写入XML文件,只需在创建XmlWriter时指定编码

#3

    private static string Utf16ToUtf8(string utf16String)
    {
        /**************************************************************
         * Every .NET string will store text with the UTF16 encoding, *
         * known as Encoding.Unicode. Other encodings may exist as    *
         * Byte-Array or incorrectly stored with the UTF16 encoding.  *
         *                                                            *
         * UTF8 = 1 bytes per char                                    *
         *    ["100" for the ansi 'd']                                *
         *    ["206" and "186" for the russian '?']                   *
         *                                                            *
         * UTF16 = 2 bytes per char                                   *
         *    ["100, 0" for the ansi 'd']                             *
         *    ["186, 3" for the russian '?']                          *
         *                                                            *
         * UTF8 inside UTF16                                          *
         *    ["100, 0" for the ansi 'd']                             *
         *    ["206, 0" and "186, 0" for the russian '?']             *
         *                                                            *
         * We can use the convert encoding function to convert an     *
         * UTF16 Byte-Array to an UTF8 Byte-Array. When we use UTF8   *
         * encoding to string method now, we will get a UTF16 string. *
         *                                                            *
         * So we imitate UTF16 by filling the second byte of a char   *
         * with a 0 byte (binary 0) while creating the string.        *
         **************************************************************/

        // Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
        byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
        byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);
        char[] chars = (char[])Array.CreateInstance(typeof(char), utf8Bytes.Length);

        for (int i = 0; i < utf8Bytes.Length; i++)
        {
            chars[i] = BitConverter.ToChar(new byte[2] { utf8Bytes[i], 0 }, 0);
        }

        // Return UTF8
        return new String(chars);
    }

In the original post author concatenated strings. Every sting operation will result in string recreation in .Net. String is effectively a reference type. As a result, the function provided will be visibly slow. Don't do that. Use array of chars instead, write there directly and then convert result to string. In my case of processing 500 kb of text difference is almost 5 minutes.

在原帖子作者连接字符串。每个sting操作都将导致.Net中的字符串重新创建。 String实际上是一种引用类型。结果,提供的功能将明显变慢。不要那样做。使用字符数组,直接在那里写,然后将结果转换为字符串。在我的情况下处理500 kb的文本差异几乎是5分钟。

#4

Check the Jon Skeet answer to this other question: UTF-16 to UTF-8 conversion (for scripting in Windows)

检查Jon Skeet对这个问题的回答:UTF-16到UTF-8的转换(用于Windows中的脚本)

It contains the source code that you need.

它包含您需要的源代码。

Hope it helps.

希望能帮助到你。

#5

does this example help ?

这个例子有帮助吗?

using System;
using System.IO;
using System.Text;

class Test
{
   public static void Main() 
   {        
    using (StreamWriter output = new StreamWriter("practice.txt")) 
    {
        // Create and write a string containing the symbol for Pi.
        string srcString = "Area = \u03A0r^2";

        // Convert the UTF-16 encoded source string to UTF-8 and ASCII.
        byte[] utf8String = Encoding.UTF8.GetBytes(srcString);
        byte[] asciiString = Encoding.ASCII.GetBytes(srcString);

        // Write the UTF-8 and ASCII encoded byte arrays. 
        output.WriteLine("UTF-8  Bytes: {0}", BitConverter.ToString(utf8String));
        output.WriteLine("ASCII  Bytes: {0}", BitConverter.ToString(asciiString));


        // Convert UTF-8 and ASCII encoded bytes back to UTF-16 encoded  
        // string and write.
        output.WriteLine("UTF-8  Text : {0}", Encoding.UTF8.GetString(utf8String));
        output.WriteLine("ASCII  Text : {0}", Encoding.ASCII.GetString(asciiString));

        Console.WriteLine(Encoding.UTF8.GetString(utf8String));
        Console.WriteLine(Encoding.ASCII.GetString(asciiString));
    }
}

}

#6

class Program
{
    static void Main(string[] args)
    {
        String unicodeString =
        "This Unicode string contains two characters " +
        "with codes outside the traditional ASCII code range, " +
        "Pi (\u03a0) and Sigma (\u03a3).";

        Console.WriteLine("Original string:");
        Console.WriteLine(unicodeString);
        UnicodeEncoding unicodeEncoding = new UnicodeEncoding();
        byte[] utf16Bytes = unicodeEncoding.GetBytes(unicodeString);
        char[] chars = unicodeEncoding.GetChars(utf16Bytes, 2, utf16Bytes.Length - 2);
        string s = new string(chars);
        Console.WriteLine();
        Console.WriteLine("Char Array:");
        foreach (char c in chars) Console.Write(c);
        Console.WriteLine();
        Console.WriteLine();
        Console.WriteLine("String from Char Array:");
        Console.WriteLine(s);

        Console.ReadKey();
    }
}

#1