如何获得字符的Unicode编码点?

时间:2022-09-13 09:01:36

How can I extract the Unicode code point(s) of a given Character without first converting it to a String? I know that I can use the following:

如何提取给定字符的Unicode代码点而不首先将其转换为字符串?我知道我可以使用以下内容:

let ch: Character = "A"
let s = String(ch).unicodeScalars
s[s.startIndex].value // returns 65

but it seems like there should be a more direct way to accomplish this using just Swift's standard library. The Language Guide sections "Working with Characters" and "Unicode" only discuss iterating through the characters in a String, not working directly with Characters.

但似乎应该有一种更直接的方式来使用Swift的标准库来实现这一点。语言指南部分“处理字符”和“Unicode”只讨论在字符串中迭代字符,而不直接处理字符。

5 个解决方案

#1


27  

From what I can gather in the documentation, they want you to get Character values from a String because it gives context. Is this Character encoded with UTF8, UTF16, or 21-bit code points (scalars)?

根据我在文档中收集到的信息,他们希望您从字符串中获取字符值,因为它提供了上下文。这个字符是用UTF8、UTF16或21位代码点(标量)编码的吗?

If you look at how a Character is defined in the Swift framework, it is actually an enum value. This is probably done due to the various representations from String.utf8, String.utf16, and String.unicodeScalars.

如果您查看在Swift框架中如何定义字符,那么它实际上是一个enum值。这可能是由于字符串的各种表示而完成的。use utf8,字符串。utf16,String.unicodeScalars。

It seems they do not expect you to work with Character values but rather Strings and you as the programmer decide how to get these from the String itself, allowing encoding to be preserved.

看起来他们并不期望您处理字符值,而是字符串,而您作为程序员决定如何从字符串本身获取这些值,从而允许保留编码。

That said, if you need to get the code points in a concise manner, I would recommend an extension like such:

也就是说,如果您需要以简洁的方式获得代码点,我建议这样的扩展:

extension Character
{
    func unicodeScalarCodePoint() -> UInt32
    {
        let characterString = String(self)
        let scalars = characterString.unicodeScalars

        return scalars[scalars.startIndex].value
    }
}

Then you can use it like so:

然后你可以这样使用它:

let char : Character = "A"
char.unicodeScalarCodePoint()

In summary, string and character encoding is a tricky thing when you factor in all the possibilities. In order to allow each possibility to be represented, they went with this scheme.

综上所述,当您考虑所有的可能性时,字符串和字符编码是一件棘手的事情。为了使每一种可能性都有代表性,他们采用了这个方案。

Also remember this is a 1.0 release, I'm sure they will expand Swift's syntactical sugar soon.

还记得这是1.0版,我相信他们很快就会扩展斯威夫特的语法糖。

#2


15  

I think there are some misunderstandings about the Unicode. Unicode itself is NOT an encoding, it does not transform any grapheme clusters (or "Characters" from human reading respect) into any sort of binary sequence. The Unicode is just a big table which collects all the grapheme clusters used by all languages on Earth (unofficially also includes the Klingon). Those grapheme clusters are organized and indexed by the code points (a 21-bit number in swift, and looks like U+D800). You can find where the character you are looking for in the big Unicode table by using the code points

我认为关于Unicode有些误解。Unicode本身并不是一种编码,它不会将任何字符串(或“字符”)转换成任何类型的二进制序列。Unicode只是一个大表格,它收集了地球上所有语言使用的所有字母串(非正式地也包括克林贡语)。这些grapheme集群由代码点组织和索引(swift的一个21位数字,看起来像U+D800)。通过使用代码点,您可以在大Unicode表中找到要查找的字符

Meanwhile, the protocol called UTF8, UTF16, UTF32 is actually encodings. Yes, there are more than one ways to encode the Unicode characters into binary sequences. Using which protocol depends on the project you are working, but most of the web page is encoded by UTF-8 (you can actually check it now).

同时,名为UTF8、UTF16、UTF32的协议实际上是编码。是的,将Unicode字符编码成二进制序列的方法不止一种。使用哪种协议取决于您正在处理的项目,但是大多数web页面都是由UTF-8编码的(您现在可以实际检查它)。

Concept 1: The Unicode point is called the Unicode Scalar in Swift

概念1:Unicode点在Swift中称为Unicode标量

A Unicode scalar is any Unicode code point in the range U+0000 to U+D7FF inclusive or U+E000 to U+10FFFF inclusive. Unicode scalars do not include the Unicode surrogate pair code points, which are the code points in the range U+D800 to U+DFFF inclusive.

Unicode标量是指在U+0000到U+D7FF或U+E000到U+10FFFF的范围内的任何Unicode编码点。Unicode标量不包括Unicode代理对代码点,这些代码点是包括U+D800到U+DFFF的范围内的代码点。

Concept 2: The Code Unit is the abstract representation of the encoding.

概念2:代码单元是编码的抽象表示。

Consider the following code snippet

考虑下面的代码片段

let theCat = "Cat!????"

for char in theCat.utf8 {
    print("\(char) ", terminator: "") //Code Unit of each grapheme cluster for the UFT8 encoding
}
print("")
for char in theCat.utf8 {
    print("\(String(char, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF8 encoding
}
print("")


for char in theCat.utf16 {
    print("\(char) ", terminator: "") //Code Unit of each grapheme cluster for the UFT-16 encoding
}
print("")
for char in theCat.utf16 {
    print("\(String(char, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-16 encoding
}
print("")

for char in theCat.unicodeScalars {
    print("\(char.value) ", terminator: "") //Code Unit of each grapheme cluster for the UFT-32 encoding
}
print("")
for char in theCat.unicodeScalars {
    print("\(String(char.value, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-32 encoding
}

Abstract representation means: Code unit is written by the base-10 number (decimal number) it equals to the base-2 encoding (binary sequence). Encoding is made for the machines, Code Unit is more for humans, it is easy to read than binary sequences.

抽象表示方法:代码单元由十进数(十进制数)构成,等于二进编码(二进制序列)。编码是为机器做的,代码单元更适合人类,它比二进制序列更容易阅读。

Concept 3: A character may have different Unicode point(s). It depends on how the character is contracted by what grapheme clusters, (this is why I said "Characters" from human reading respect in the beginning)

概念3:字符可能具有不同的Unicode点。这取决于字符是如何通过字母串来压缩的(这就是为什么我在开始时从人类阅读的角度说“字符”)

consider the following code snippet

考虑下面的代码片段

let precomposed: String = "\u{D55C}"
let decomposed: String = "\u{1112}\u{1161}\u{11AB}" 
print(precomposed.characters.count) // print "1"
print(decomposed.characters.count) // print "1" => Character != grapheme cluster
print(precomposed) //print "한"
print(decomposed) //print "한"

The character precomposed and decomposed is visually and linguistically equal, But they have different Unicode point and different code unit if they encoded by the same encoding protocol (see the following example)

预先编写和分解的字符在视觉上和语言上都是相同的,但是如果使用相同的编码协议进行编码,那么它们有不同的Unicode点和不同的代码单元(请参见下面的示例)

for preCha in precomposed.utf16 {
    print("\(preCha) ", terminator: "") //print 55357 56374 128054 54620
}

print("")

for deCha in decomposed.utf16 {
    print("\(deCha) ", terminator: "") //print 4370 4449 4523
}

Extra example

额外的例子

var word = "cafe"
print("the number of characters in \(word) is \(word.characters.count)")

word += "\u{301}"

print("the number of characters in \(word) is \(word.characters.count)")

Summary: Code Points, A.k.a the position index of the characters in Unicode, has nothing to do with UTF-8, UTF-16 and UTF-32 encoding schemes.

摘要:a.k.代码点。Unicode中字符的位置索引与UTF-8、UTF-16和UTF-32编码方案无关。

Further Readings:

进一步阅读:

http://www.joelonsoftware.com/articles/Unicode.html

http://www.joelonsoftware.com/articles/Unicode.html

http://kunststube.net/encoding/

http://kunststube.net/encoding/

https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-swifts-string-api-so-hard.html

https://www.mikeash.com/pyblog/friday - qa - 2015 - 11 - 06 -为什么- -雨燕字符串- api -所以hard.html

#3


6  

I think the issue is that Character doesn't represent a Unicode code point. It represents a "Unicode grapheme cluster", which can consist of multiple code points.

我认为问题在于字符不代表Unicode代码点。它表示一个“Unicode grapheme集群”,可以包含多个代码点。

Instead, UnicodeScalar represents a Unicode code point.

相反,UnicodeScalar表示Unicode代码点。

#4


6  

I agree with you, there should be a way to get the code directly from character. But all I can offer is a shorthand:

我同意你的观点,应该有一种直接从字符中获取代码的方法。但我所能提供的只是一个简写:

let ch: Character = "A"
for code in String(ch).utf8 { println(code) }

#5


0  

Have you tried:

你有试过:

import Foundation

let characterString: String = "abc"
var numbers: [Int] = Array<Int>()
for character in characterString.utf8 {
    let stringSegment: String = "\(character)"
    let anInt: Int = stringSegment.toInt()!
    numbers.append(anInt)
}

numbers

Output:

[97, 98, 99]

(97、98、99)

It may also be only one Character in the String.

它也可能只是字符串中的一个字符。

#1


27  

From what I can gather in the documentation, they want you to get Character values from a String because it gives context. Is this Character encoded with UTF8, UTF16, or 21-bit code points (scalars)?

根据我在文档中收集到的信息,他们希望您从字符串中获取字符值,因为它提供了上下文。这个字符是用UTF8、UTF16或21位代码点(标量)编码的吗?

If you look at how a Character is defined in the Swift framework, it is actually an enum value. This is probably done due to the various representations from String.utf8, String.utf16, and String.unicodeScalars.

如果您查看在Swift框架中如何定义字符,那么它实际上是一个enum值。这可能是由于字符串的各种表示而完成的。use utf8,字符串。utf16,String.unicodeScalars。

It seems they do not expect you to work with Character values but rather Strings and you as the programmer decide how to get these from the String itself, allowing encoding to be preserved.

看起来他们并不期望您处理字符值,而是字符串,而您作为程序员决定如何从字符串本身获取这些值,从而允许保留编码。

That said, if you need to get the code points in a concise manner, I would recommend an extension like such:

也就是说,如果您需要以简洁的方式获得代码点,我建议这样的扩展:

extension Character
{
    func unicodeScalarCodePoint() -> UInt32
    {
        let characterString = String(self)
        let scalars = characterString.unicodeScalars

        return scalars[scalars.startIndex].value
    }
}

Then you can use it like so:

然后你可以这样使用它:

let char : Character = "A"
char.unicodeScalarCodePoint()

In summary, string and character encoding is a tricky thing when you factor in all the possibilities. In order to allow each possibility to be represented, they went with this scheme.

综上所述,当您考虑所有的可能性时,字符串和字符编码是一件棘手的事情。为了使每一种可能性都有代表性,他们采用了这个方案。

Also remember this is a 1.0 release, I'm sure they will expand Swift's syntactical sugar soon.

还记得这是1.0版,我相信他们很快就会扩展斯威夫特的语法糖。

#2


15  

I think there are some misunderstandings about the Unicode. Unicode itself is NOT an encoding, it does not transform any grapheme clusters (or "Characters" from human reading respect) into any sort of binary sequence. The Unicode is just a big table which collects all the grapheme clusters used by all languages on Earth (unofficially also includes the Klingon). Those grapheme clusters are organized and indexed by the code points (a 21-bit number in swift, and looks like U+D800). You can find where the character you are looking for in the big Unicode table by using the code points

我认为关于Unicode有些误解。Unicode本身并不是一种编码,它不会将任何字符串(或“字符”)转换成任何类型的二进制序列。Unicode只是一个大表格,它收集了地球上所有语言使用的所有字母串(非正式地也包括克林贡语)。这些grapheme集群由代码点组织和索引(swift的一个21位数字,看起来像U+D800)。通过使用代码点,您可以在大Unicode表中找到要查找的字符

Meanwhile, the protocol called UTF8, UTF16, UTF32 is actually encodings. Yes, there are more than one ways to encode the Unicode characters into binary sequences. Using which protocol depends on the project you are working, but most of the web page is encoded by UTF-8 (you can actually check it now).

同时,名为UTF8、UTF16、UTF32的协议实际上是编码。是的,将Unicode字符编码成二进制序列的方法不止一种。使用哪种协议取决于您正在处理的项目,但是大多数web页面都是由UTF-8编码的(您现在可以实际检查它)。

Concept 1: The Unicode point is called the Unicode Scalar in Swift

概念1:Unicode点在Swift中称为Unicode标量

A Unicode scalar is any Unicode code point in the range U+0000 to U+D7FF inclusive or U+E000 to U+10FFFF inclusive. Unicode scalars do not include the Unicode surrogate pair code points, which are the code points in the range U+D800 to U+DFFF inclusive.

Unicode标量是指在U+0000到U+D7FF或U+E000到U+10FFFF的范围内的任何Unicode编码点。Unicode标量不包括Unicode代理对代码点,这些代码点是包括U+D800到U+DFFF的范围内的代码点。

Concept 2: The Code Unit is the abstract representation of the encoding.

概念2:代码单元是编码的抽象表示。

Consider the following code snippet

考虑下面的代码片段

let theCat = "Cat!????"

for char in theCat.utf8 {
    print("\(char) ", terminator: "") //Code Unit of each grapheme cluster for the UFT8 encoding
}
print("")
for char in theCat.utf8 {
    print("\(String(char, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF8 encoding
}
print("")


for char in theCat.utf16 {
    print("\(char) ", terminator: "") //Code Unit of each grapheme cluster for the UFT-16 encoding
}
print("")
for char in theCat.utf16 {
    print("\(String(char, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-16 encoding
}
print("")

for char in theCat.unicodeScalars {
    print("\(char.value) ", terminator: "") //Code Unit of each grapheme cluster for the UFT-32 encoding
}
print("")
for char in theCat.unicodeScalars {
    print("\(String(char.value, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-32 encoding
}

Abstract representation means: Code unit is written by the base-10 number (decimal number) it equals to the base-2 encoding (binary sequence). Encoding is made for the machines, Code Unit is more for humans, it is easy to read than binary sequences.

抽象表示方法:代码单元由十进数(十进制数)构成,等于二进编码(二进制序列)。编码是为机器做的,代码单元更适合人类,它比二进制序列更容易阅读。

Concept 3: A character may have different Unicode point(s). It depends on how the character is contracted by what grapheme clusters, (this is why I said "Characters" from human reading respect in the beginning)

概念3:字符可能具有不同的Unicode点。这取决于字符是如何通过字母串来压缩的(这就是为什么我在开始时从人类阅读的角度说“字符”)

consider the following code snippet

考虑下面的代码片段

let precomposed: String = "\u{D55C}"
let decomposed: String = "\u{1112}\u{1161}\u{11AB}" 
print(precomposed.characters.count) // print "1"
print(decomposed.characters.count) // print "1" => Character != grapheme cluster
print(precomposed) //print "한"
print(decomposed) //print "한"

The character precomposed and decomposed is visually and linguistically equal, But they have different Unicode point and different code unit if they encoded by the same encoding protocol (see the following example)

预先编写和分解的字符在视觉上和语言上都是相同的,但是如果使用相同的编码协议进行编码,那么它们有不同的Unicode点和不同的代码单元(请参见下面的示例)

for preCha in precomposed.utf16 {
    print("\(preCha) ", terminator: "") //print 55357 56374 128054 54620
}

print("")

for deCha in decomposed.utf16 {
    print("\(deCha) ", terminator: "") //print 4370 4449 4523
}

Extra example

额外的例子

var word = "cafe"
print("the number of characters in \(word) is \(word.characters.count)")

word += "\u{301}"

print("the number of characters in \(word) is \(word.characters.count)")

Summary: Code Points, A.k.a the position index of the characters in Unicode, has nothing to do with UTF-8, UTF-16 and UTF-32 encoding schemes.

摘要:a.k.代码点。Unicode中字符的位置索引与UTF-8、UTF-16和UTF-32编码方案无关。

Further Readings:

进一步阅读:

http://www.joelonsoftware.com/articles/Unicode.html

http://www.joelonsoftware.com/articles/Unicode.html

http://kunststube.net/encoding/

http://kunststube.net/encoding/

https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-swifts-string-api-so-hard.html

https://www.mikeash.com/pyblog/friday - qa - 2015 - 11 - 06 -为什么- -雨燕字符串- api -所以hard.html

#3


6  

I think the issue is that Character doesn't represent a Unicode code point. It represents a "Unicode grapheme cluster", which can consist of multiple code points.

我认为问题在于字符不代表Unicode代码点。它表示一个“Unicode grapheme集群”,可以包含多个代码点。

Instead, UnicodeScalar represents a Unicode code point.

相反,UnicodeScalar表示Unicode代码点。

#4


6  

I agree with you, there should be a way to get the code directly from character. But all I can offer is a shorthand:

我同意你的观点,应该有一种直接从字符中获取代码的方法。但我所能提供的只是一个简写:

let ch: Character = "A"
for code in String(ch).utf8 { println(code) }

#5


0  

Have you tried:

你有试过:

import Foundation

let characterString: String = "abc"
var numbers: [Int] = Array<Int>()
for character in characterString.utf8 {
    let stringSegment: String = "\(character)"
    let anInt: Int = stringSegment.toInt()!
    numbers.append(anInt)
}

numbers

Output:

[97, 98, 99]

(97、98、99)

It may also be only one Character in the String.

它也可能只是字符串中的一个字符。