在Ruby中将unicode码点转换为字符串字符

时间:2022-11-26 20:18:37

I have these values from a unicode database but I'm not sure how to translate them into the human readable form. What are these even called?

我从unicode数据库中获得了这些值,但我不知道如何将它们转换为人类可读的形式。这些都叫什么?

Here they are:

在这里,他们是:

  • U+2B71F
  • U + 2 b71f
  • U+2A52D
  • U + 2 a52d
  • U+2A68F
  • U + 2 a68f
  • U+2A690
  • U + 2 a690
  • U+2B72F
  • U + 2 b72f
  • U+2B4F7
  • U + 2 b4f7
  • U+2B72B
  • U + 2 b72b

How can I convert these to there readable symbols?

如何将这些转换成可读的符号?

2 个解决方案

#1


34  

How about:

如何:

puts ["2B71F".hex].pack("U")

Edit

编辑

In Ruby 1.9 you can even do this:

在Ruby 1.9中,你甚至可以这样做:

puts "\u{2B71F}"

I.e. the \u{} escape sequence can be used to decode Unicode codepoints.

例如,可以使用u{}转义序列解码Unicode码点。

#2


19  

The unicode symbols like U+2B71F are referred to as a codepoint.

像U+2B71F这样的unicode符号被称为码点。

The unicode system defines a unique codepoint for each character in a multitude of world languages, scientific symbols, currencies etc. This character set is steadily growing.

unicode系统为各种世界语言、科学符号、货币等中的每个字符定义了一个独特的码点。

For example, U+221E is infinity.

例如,U+221E是无穷大。

The codepoints are hexadecimal numbers. There is always exactly one number defined per character.

码点是十六进制数字。每个字符总是定义一个数字。

There are many ways to arrange this in memory. This is known as an encoding of which the common ones are UTF-8 and UTF-16. The conversion to and fro is well defined.

有很多方法可以在内存中进行排列。这被认为是一种编码,通常的编码是UTF-8和UTF-16。来回转换是很明确的。

Here you are most probably looking for converting the unicode codepoint to UTF-8 characters.

在这里,您可能希望将unicode代码点转换为UTF-8字符。

codepoint = "U+2B71F"

You need to extract the hex part coming after U+ and get only 2B71F. This will be the first group capture. See this.

你需要提取U+后面的十六进制部分,得到2B71F。这将是第一个组捕获。看到这个。

codepoint.to_s =~ /U\+([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})$/

And you're UTF-8 character will be:

你是UTF-8字符

utf_8_character = [$1.hex].pack("U")

References:

引用:

  1. Convert Unicode codepoints to UTF-8 characters with Module#const_missing.
  2. 使用模块#const_missing将Unicode码点转换为UTF-8字符。
  3. Tim Bray on the goodness of unicode.
  4. 蒂姆·布雷(Tim Bray)关于unicode的好处。
  5. Joel Spolsky - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
  6. 乔尔·斯波尔斯基——绝对的最小值,每个软件开发人员绝对必须知道Unicode和字符集(没有借口!)
  7. Dissecting the Unicode regular expression
  8. 剖析Unicode正则表达式。

#1


34  

How about:

如何:

puts ["2B71F".hex].pack("U")

Edit

编辑

In Ruby 1.9 you can even do this:

在Ruby 1.9中,你甚至可以这样做:

puts "\u{2B71F}"

I.e. the \u{} escape sequence can be used to decode Unicode codepoints.

例如,可以使用u{}转义序列解码Unicode码点。

#2


19  

The unicode symbols like U+2B71F are referred to as a codepoint.

像U+2B71F这样的unicode符号被称为码点。

The unicode system defines a unique codepoint for each character in a multitude of world languages, scientific symbols, currencies etc. This character set is steadily growing.

unicode系统为各种世界语言、科学符号、货币等中的每个字符定义了一个独特的码点。

For example, U+221E is infinity.

例如,U+221E是无穷大。

The codepoints are hexadecimal numbers. There is always exactly one number defined per character.

码点是十六进制数字。每个字符总是定义一个数字。

There are many ways to arrange this in memory. This is known as an encoding of which the common ones are UTF-8 and UTF-16. The conversion to and fro is well defined.

有很多方法可以在内存中进行排列。这被认为是一种编码,通常的编码是UTF-8和UTF-16。来回转换是很明确的。

Here you are most probably looking for converting the unicode codepoint to UTF-8 characters.

在这里,您可能希望将unicode代码点转换为UTF-8字符。

codepoint = "U+2B71F"

You need to extract the hex part coming after U+ and get only 2B71F. This will be the first group capture. See this.

你需要提取U+后面的十六进制部分,得到2B71F。这将是第一个组捕获。看到这个。

codepoint.to_s =~ /U\+([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})$/

And you're UTF-8 character will be:

你是UTF-8字符

utf_8_character = [$1.hex].pack("U")

References:

引用:

  1. Convert Unicode codepoints to UTF-8 characters with Module#const_missing.
  2. 使用模块#const_missing将Unicode码点转换为UTF-8字符。
  3. Tim Bray on the goodness of unicode.
  4. 蒂姆·布雷(Tim Bray)关于unicode的好处。
  5. Joel Spolsky - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
  6. 乔尔·斯波尔斯基——绝对的最小值,每个软件开发人员绝对必须知道Unicode和字符集(没有借口!)
  7. Dissecting the Unicode regular expression
  8. 剖析Unicode正则表达式。