将unicode字符串转换为Ruby中的字符?

时间:2021-04-18 20:17:24

I have the following string:

我有以下字符串:

l\u0092issue

My question is how to convert it to utf8 characters ?

我的问题是如何将其转换为utf8字符?

I have tried that

我试过了

1.9.3p484 :024 > "l\u0092issue".encode('utf-8')
 => "l\u0092issue" 

2 个解决方案

#1


11  

You seem to have got your encodings into a bit of a mix up. If you haven’t already, you should first read Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which provides a good introduction into this type of thing. There is a good set of articles on how Ruby handles character encodings at http://graysoftinc.com/character-encodings/understanding-m17n-multilingualization. You could also have a look at the Ruby docs for String and Encoding.

你似乎已经把你的编码混合起来了。如果你还没有,你应该首先阅读Joel Spolsky的文章绝对最低每个软件开发人员,绝对必须知道关于Unicode和字符集(没有借口!),它提供了对这类事物的良好介绍。在http://graysoftinc.com/character-encodings/understanding-m17n-multilingualization上有一套关于Ruby如何处理字符编码的文章。您还可以查看用于字符串和编码的Ruby文档。

In this specific case, the string l\u0092issue means that the second character is the character with the unicode codepoint 0x92. This codepoint is PRIVATE USE TWO (see the chart), which basically means this position isn’t used.

在这种特定情况下,字符串l \ u0092issue表示第二个字符是具有unicode代码点0x92的字符。此代码点是PRIVATE USE TWO(参见图表),这基本上意味着不使用此位置。

However, looking at the Windows CP-1252 encoding, position 0x92 is occupied by the character , so if this is the missing character the the string would be l’issue, whick looks a lot more likely even though I don’t speak French.

但是,看看Windows CP-1252编码,位置0x92被字符'占用,所以如果这是丢失的字符,那么字符串将是l'问题,即使我不会说法语,看起来更有可能。

What I suspect has happened is your program has received the string l’issue encoded in CP-1252, but has assumed it was encoded in ISO-8859-1 (ISO-8859-1 and CP-1252 are quite closely related) and re-encoded it to UTF-8 leaving you with the string you now have.

我怀疑发生的是你的程序收到了CP-1252中编码的字符串l'问题,但是假设它是用ISO-8859-1编码的(ISO-8859-1和CP-1252密切相关)并重新编写 - 将它编码为UTF-8,留下你现在拥有的字符串。

The real fix for you is to be careful about the encodings of any strings that enter (and leave) your program, and how you manage them.

对您而言,真正的解决方法是要小心进入(和离开)程序的任何字符串的编码,以及如何管理它们。

To transform your string to l’issue, you can encode it back to ISO-8859-1, then use force_encoding to tell Ruby the real encoding of CP-1252, and then you can re-encode to UTF-8:

要将您的字符串转换为l'问题,您可以将其编码回ISO-8859-1,然后使用force_encoding告诉Ruby CP-1252的实际编码,然后您可以重新编码为UTF-8:

2.1.0 :001 > s = "l\u0092issue"
 => "l\u0092issue" 
2.1.0 :002 > s = s.encode('iso-8859-1')
 => "l\x92issue" 
2.1.0 :003 > s.force_encoding('cp1252')
 => "l\x92issue" 
2.1.0 :004 > s.encode('utf-8')
 => "l’issue"

This is only really a demonstration of what is going on though. The real solution is to make sure you’re handling encodings correctly.

这只能说明发生了什么。真正的解决方案是确保您正确处理编码。

#2


4  

That is encoded as UTF-8 (unless you changed the original string encoding). Ruby is just showing you the escape sequences when you inspect the string (which is why IRB does there). \u0092 is the escape sequence for this character.

它编码为UTF-8(除非您更改了原始字符串编码)。 Ruby只是在你检查字符串时向你显示转义序列(这就是IRB在那里做的原因)。 \ u0092是此角色的转义序列。

Try puts "l\u0092issue" to see the rendered character, if your terminal font supports it.

如果您的终端字体支持,请尝试放置“l \ u0092issue”以查看渲染的字符。

#1


11  

You seem to have got your encodings into a bit of a mix up. If you haven’t already, you should first read Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which provides a good introduction into this type of thing. There is a good set of articles on how Ruby handles character encodings at http://graysoftinc.com/character-encodings/understanding-m17n-multilingualization. You could also have a look at the Ruby docs for String and Encoding.

你似乎已经把你的编码混合起来了。如果你还没有,你应该首先阅读Joel Spolsky的文章绝对最低每个软件开发人员,绝对必须知道关于Unicode和字符集(没有借口!),它提供了对这类事物的良好介绍。在http://graysoftinc.com/character-encodings/understanding-m17n-multilingualization上有一套关于Ruby如何处理字符编码的文章。您还可以查看用于字符串和编码的Ruby文档。

In this specific case, the string l\u0092issue means that the second character is the character with the unicode codepoint 0x92. This codepoint is PRIVATE USE TWO (see the chart), which basically means this position isn’t used.

在这种特定情况下,字符串l \ u0092issue表示第二个字符是具有unicode代码点0x92的字符。此代码点是PRIVATE USE TWO(参见图表),这基本上意味着不使用此位置。

However, looking at the Windows CP-1252 encoding, position 0x92 is occupied by the character , so if this is the missing character the the string would be l’issue, whick looks a lot more likely even though I don’t speak French.

但是,看看Windows CP-1252编码,位置0x92被字符'占用,所以如果这是丢失的字符,那么字符串将是l'问题,即使我不会说法语,看起来更有可能。

What I suspect has happened is your program has received the string l’issue encoded in CP-1252, but has assumed it was encoded in ISO-8859-1 (ISO-8859-1 and CP-1252 are quite closely related) and re-encoded it to UTF-8 leaving you with the string you now have.

我怀疑发生的是你的程序收到了CP-1252中编码的字符串l'问题,但是假设它是用ISO-8859-1编码的(ISO-8859-1和CP-1252密切相关)并重新编写 - 将它编码为UTF-8,留下你现在拥有的字符串。

The real fix for you is to be careful about the encodings of any strings that enter (and leave) your program, and how you manage them.

对您而言,真正的解决方法是要小心进入(和离开)程序的任何字符串的编码,以及如何管理它们。

To transform your string to l’issue, you can encode it back to ISO-8859-1, then use force_encoding to tell Ruby the real encoding of CP-1252, and then you can re-encode to UTF-8:

要将您的字符串转换为l'问题,您可以将其编码回ISO-8859-1,然后使用force_encoding告诉Ruby CP-1252的实际编码,然后您可以重新编码为UTF-8:

2.1.0 :001 > s = "l\u0092issue"
 => "l\u0092issue" 
2.1.0 :002 > s = s.encode('iso-8859-1')
 => "l\x92issue" 
2.1.0 :003 > s.force_encoding('cp1252')
 => "l\x92issue" 
2.1.0 :004 > s.encode('utf-8')
 => "l’issue"

This is only really a demonstration of what is going on though. The real solution is to make sure you’re handling encodings correctly.

这只能说明发生了什么。真正的解决方案是确保您正确处理编码。

#2


4  

That is encoded as UTF-8 (unless you changed the original string encoding). Ruby is just showing you the escape sequences when you inspect the string (which is why IRB does there). \u0092 is the escape sequence for this character.

它编码为UTF-8(除非您更改了原始字符串编码)。 Ruby只是在你检查字符串时向你显示转义序列(这就是IRB在那里做的原因)。 \ u0092是此角色的转义序列。

Try puts "l\u0092issue" to see the rendered character, if your terminal font supports it.

如果您的终端字体支持,请尝试放置“l \ u0092issue”以查看渲染的字符。