如何将Unicode字符转换为其ASCII等效字符

时间:2020-12-14 00:28:23

Here's the problem:

这是问题所在:

In C# I'm getting information from a legacy ACCESS database. .NET converts the content of the database (in the case of this problem a string) to Unicode before handing the content to me.

在C#中,我从传统的ACCESS数据库中获取信息。在将内容交给我之前,.NET将数据库的内容(在此问题的情况下为字符串)转换为Unicode。

How do I convert this Unicode string back to it's ASCII equivalent?

如何将此Unicode字符串转换回其ASCII等效字符串?


Edit
Unicode char 710 is indeed MODIFIER LETTER CIRCUMFLEX ACCENT. Here's the problem a bit more precise:

 -> (Extended) ASCII character ê (Extended ASCII 136) was inserted in the database.
 -> Either Access or the reading component in .NET converted this to U+02C6 U+0065
    (MODIFIER LETTER CIRCUMFLEX ACCENT + LATIN SMALL LETTER E)
 -> I need the (Extended) ASCII character 136 back.


Here's what I've tried (I see now why this did not work...):

string myInput = Convert.ToString(Convert.ToChar(710));
byte[] asBytes = Encoding.ASCII.GetBytes(myInput);

But this does not result in 94 but a byte with value 63...
Here's a new try but it still does not work:

但这不会导致94但是一个值为63的字节...这是一个新的尝试,但它仍然不起作用:

byte[] bytes = Encoding.ASCII.GetBytes("ê");


Soltution
Thanks to both csgero and bzlm for pointing in the right direction I solved the problem here.

5 个解决方案

#1


9  

Okay, let's elaborate. Both csgero and bzlm pointed in the right direction.

好的,让我们详细说明。 csgero和bzlm都指向了正确的方向。

Because of blzm's reply I looked up the Windows-1252 page on wiki and found that it's called a codepage. The wikipedia article for Code page which stated the following:

由于blzm的回复,我在wiki上查找了Windows-1252页面,发现它被称为代码页。代码页的*文章说明如下:

No formal standard existed for these ‘extended character sets’; IBM merely referred to the variants as code pages, as it had always done for variants of EBCDIC encodings.

这些“扩展字符集”没有正式的标准; IBM仅将这些变体称为代码页,因为它一直用于EBCDIC编码的变体。

This led me to codepage 437:

这导致我进入代码页437:

n ASCII-compatible code pages, the lower 128 characters maintained their standard US-ASCII values, and different pages (or sets of characters) could be made available in the upper 128 characters. DOS computers built for the North American market, for example, used code page 437, which included accented characters needed for French, German, and a few other European languages, as well as some graphical line-drawing characters.

n与ASCII兼容的代码页,低128个字符保持其标准的US-ASCII值,并且可以在高128个字符中提供不同的页面(或字符集)。例如,为北美市场构建的DOS计算机使用代码页437,其中包括法语,德语和一些其他欧洲语言所需的重音字符,以及一些图形线条绘制字符。

So, codepage 437 was the codepage I was calling 'extended ASCII', it had the ê as character 136 so I looked up some other chars as well and they seem right.

因此,代码页437是我称之为“扩展ASCII”的代码页,它具有ê作为字符136所以我也查找了其他一些字符,它们似乎正确。

csgero came with the Encoding.GetEncoding() hint, I used it to create the following statement which solves my problem:

csgero附带了Encoding.GetEncoding()提示,我用它来创建以下语句来解决我的问题:

byte[] bytes = Encoding.GetEncoding(437).GetBytes("ê");

#2


3  

You cannot use the default ASCII encoding (Encoding.ASCII) here, but must create the encoding with the appropriate code page using Encoding.GetEncoding(...). You might try to use code page 1252, which is a superset of ISO 8859-1.

您不能在此处使用默认的ASCII编码(Encoding.ASCII),但必须使用Encoding.GetEncoding(...)在适当的代码页中创建编码。您可以尝试使用代码页1252,它是ISO 8859-1的超集。

#3


2  

ASCII does not define ê; the number 136 comes from the number for the circumflex in 8-bit encodings such as Windows-1252.

ASCII没有定义ê;数字136来自8位编码(例如Windows-1252)中的抑扬数。

Can you verify that a small e with a circumflex (ê) is actually what is supposed to be stored in the Access database in this case? Perhaps U+02C6 U+0065 is the result of a conversion error, where the input is actually an e followed by a circumflex, or something else entirely. Perhaps your Access database has corrupt data in the sense that the designated encoding does not match the contents, in which case the .NET client might incorrectly parse the data (using the wrong decoder).

在这种情况下,您能否验证带有抑扬(ê)的小e实际上应该存储在Access数据库中?也许U + 02C6 U + 0065是转换错误的结果,其中输入实际上是e后跟一个抑扬符,或完全不同的东西。在指定的编码与内容不匹配的意义上,您的Access数据库可能存在损坏的数据,在这种情况下,.NET客户端可能会错误地解析数据(使用错误的解码器)。

If this error is indeed introduced during the reading from the database, perhaps pasting some code or configuration settings might help.

如果在从数据库读取期间确实引入了此错误,则可能粘贴某些代码或配置设置可能会有所帮助。

In Code page 437, character number 136 is an e with a circumflex.

在代码页437中,字符编号136是带有抑扬符的e。

#4


0  

Hmm … I'm not sure which character you mean. The caret (“^”, CIRCUMFLEX ACCENT) has the same code in ASCII and Unicode (U+005E).

嗯......我不确定你指的是哪个角色。插入符号(“^”,CIRCUMFLEX ACCENT)在ASCII和Unicode(U + 005E)中具有相同的代码。

/EDIT: Damn, my fault. 710 (U+02C6) is actually the MODIFIER LETTER CIRCUMFLEX ACCENT. Unfortunately, this character isn't part of ASCII at all. It might look like the normal caret but it's a different character. Simple conversion won't help here. I'm not sure if .NET supports mapping of similar characters when converting from Unicode. Worth investigating, though.

/编辑:该死的,我的错。 710(U + 02C6)实际上是MODIFIER LETTER CIRCUMFLEX ACCENT。不幸的是,这个字符根本不是ASCII的一部分。它可能看起来像普通的插入符号,但它是一个不同的角色。简单的转换在这里无济于事。我不确定.NET是否支持从Unicode转换时类似字符的映射。值得调查一下。

#5


0  

The value 63 is the question mark, AKA "I am not able to display this character in ASCII".

值63是问号,AKA“我无法以ASCII显示该字符”。

#1


9  

Okay, let's elaborate. Both csgero and bzlm pointed in the right direction.

好的,让我们详细说明。 csgero和bzlm都指向了正确的方向。

Because of blzm's reply I looked up the Windows-1252 page on wiki and found that it's called a codepage. The wikipedia article for Code page which stated the following:

由于blzm的回复,我在wiki上查找了Windows-1252页面,发现它被称为代码页。代码页的*文章说明如下:

No formal standard existed for these ‘extended character sets’; IBM merely referred to the variants as code pages, as it had always done for variants of EBCDIC encodings.

这些“扩展字符集”没有正式的标准; IBM仅将这些变体称为代码页,因为它一直用于EBCDIC编码的变体。

This led me to codepage 437:

这导致我进入代码页437:

n ASCII-compatible code pages, the lower 128 characters maintained their standard US-ASCII values, and different pages (or sets of characters) could be made available in the upper 128 characters. DOS computers built for the North American market, for example, used code page 437, which included accented characters needed for French, German, and a few other European languages, as well as some graphical line-drawing characters.

n与ASCII兼容的代码页,低128个字符保持其标准的US-ASCII值,并且可以在高128个字符中提供不同的页面(或字符集)。例如,为北美市场构建的DOS计算机使用代码页437,其中包括法语,德语和一些其他欧洲语言所需的重音字符,以及一些图形线条绘制字符。

So, codepage 437 was the codepage I was calling 'extended ASCII', it had the ê as character 136 so I looked up some other chars as well and they seem right.

因此,代码页437是我称之为“扩展ASCII”的代码页,它具有ê作为字符136所以我也查找了其他一些字符,它们似乎正确。

csgero came with the Encoding.GetEncoding() hint, I used it to create the following statement which solves my problem:

csgero附带了Encoding.GetEncoding()提示,我用它来创建以下语句来解决我的问题:

byte[] bytes = Encoding.GetEncoding(437).GetBytes("ê");

#2


3  

You cannot use the default ASCII encoding (Encoding.ASCII) here, but must create the encoding with the appropriate code page using Encoding.GetEncoding(...). You might try to use code page 1252, which is a superset of ISO 8859-1.

您不能在此处使用默认的ASCII编码(Encoding.ASCII),但必须使用Encoding.GetEncoding(...)在适当的代码页中创建编码。您可以尝试使用代码页1252,它是ISO 8859-1的超集。

#3


2  

ASCII does not define ê; the number 136 comes from the number for the circumflex in 8-bit encodings such as Windows-1252.

ASCII没有定义ê;数字136来自8位编码(例如Windows-1252)中的抑扬数。

Can you verify that a small e with a circumflex (ê) is actually what is supposed to be stored in the Access database in this case? Perhaps U+02C6 U+0065 is the result of a conversion error, where the input is actually an e followed by a circumflex, or something else entirely. Perhaps your Access database has corrupt data in the sense that the designated encoding does not match the contents, in which case the .NET client might incorrectly parse the data (using the wrong decoder).

在这种情况下,您能否验证带有抑扬(ê)的小e实际上应该存储在Access数据库中?也许U + 02C6 U + 0065是转换错误的结果,其中输入实际上是e后跟一个抑扬符,或完全不同的东西。在指定的编码与内容不匹配的意义上,您的Access数据库可能存在损坏的数据,在这种情况下,.NET客户端可能会错误地解析数据(使用错误的解码器)。

If this error is indeed introduced during the reading from the database, perhaps pasting some code or configuration settings might help.

如果在从数据库读取期间确实引入了此错误,则可能粘贴某些代码或配置设置可能会有所帮助。

In Code page 437, character number 136 is an e with a circumflex.

在代码页437中,字符编号136是带有抑扬符的e。

#4


0  

Hmm … I'm not sure which character you mean. The caret (“^”, CIRCUMFLEX ACCENT) has the same code in ASCII and Unicode (U+005E).

嗯......我不确定你指的是哪个角色。插入符号(“^”,CIRCUMFLEX ACCENT)在ASCII和Unicode(U + 005E)中具有相同的代码。

/EDIT: Damn, my fault. 710 (U+02C6) is actually the MODIFIER LETTER CIRCUMFLEX ACCENT. Unfortunately, this character isn't part of ASCII at all. It might look like the normal caret but it's a different character. Simple conversion won't help here. I'm not sure if .NET supports mapping of similar characters when converting from Unicode. Worth investigating, though.

/编辑:该死的,我的错。 710(U + 02C6)实际上是MODIFIER LETTER CIRCUMFLEX ACCENT。不幸的是,这个字符根本不是ASCII的一部分。它可能看起来像普通的插入符号,但它是一个不同的角色。简单的转换在这里无济于事。我不确定.NET是否支持从Unicode转换时类似字符的映射。值得调查一下。

#5


0  

The value 63 is the question mark, AKA "I am not able to display this character in ASCII".

值63是问号,AKA“我无法以ASCII显示该字符”。