关于Unicode,我需要了解什么?

时间:2021-11-09 04:15:47

Being a application developer, do I need to know Unicode?

作为应用程序开发人员,我是否需要了解Unicode?

11 个解决方案

#1


45  

Unicode is a standard that defines numeric codes for glyphs used in written communication. Or, as they say it themselves:

Unicode是一种标准,用于定义书面通信中使用的字形的数字代码。或者,正如他们自己所说:

The standard for digital representation of the characters used in writing all of the world's languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet. Unicode is developed and maintained by the Unicode Consortium.

用于编写世界上所有语言的字符的数字表示标准。 Unicode提供了一种统一的方法,用于以任何语言存储,搜索和交换文本。它被所有现代计算机使用,是在Internet上处理文本的基础。 Unicode由Unicode Consortium开发和维护。

There are many common, yet easily avoided, programming errors committed by developers who don't bother to educate themselves about Unicode and its encodings.

开发人员提出了许多常见的,但很容易避免的编程错误,他们不打算自己研究Unicode及其编码。

  • First, go to the source for authoritative, detailed information and implementation guidelines.
  • 首先,请访问权威,详细信息和实施指南。

  • As mentioned by others, Joel Spolsky has a good list of these errors.
  • 正如其他人所提到的,Joel Spolsky有很好的错误列表。

  • I also like Elliotte Rusty Harold's Ten Commandments of Unicode.
  • 我也喜欢Elliotte Rusty Harold的十诫Unicode。

  • Developers should also watch out for canonical representation attacks.
  • 开发人员还应注意规范表示攻击。

Some of the key concepts you should be aware of are:

您应该注意的一些关键概念是:

  • Glyphs—concrete graphics used to represent written characters.
  • 字形混凝土图形用于表示书写字符。

  • Composition—combining glyphs to create another glyph.
  • 组合组合字形以创建另一个字形。

  • Encoding—converting Unicode points to a stream of bytes.
  • 编码 - 将Unicode点转换为字节流。

  • Collation—locale-sensitive comparison of Unicode strings.
  • Unicode字符串的归类 - 区域设置敏感比较。

#2


66  

Read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

阅读Joel的绝对最低每个软件开发人员绝对必须知道Unicode和字符集(没有借口!)

#3


11  

At the risk of just adding another link, unicode.org is a spectacular resource.

可能只是添加另一个链接,unicode.org是一个壮观的资源。

In short, it's a replacement for ASCII that's designed to handle, literally, every character ever used by humans. Unicode has everal encoding schemes to handle all those characters - UTF-8, which is more or less the standard these days, works really hard to stay a single byte per character, and is identical to ASCII for the first 7 bits.

简而言之,它是ASCII的替代品,旨在处理人类曾经使用的每个角色。 Unicode有几种编码方案来处理所有这些字符 - UTF-8,这或多或少是现在的标准,很难保持每个字符的单个字节,并且与前7位的ASCII相同。

(As an addendum, there's a popular misconception amongst programmers that you only need to know about Unicode if you're going to be doing internationalization. While that's certainly one use, it's not the only one. For example, I'm working on a project that will only ever use English text - but with a huge number of fancy math symbols. Moving the whole project over to be fully Unicode solved more problems than I can count.)

(作为附录,程序员之间存在一种流行的误解,如果你要进行国际化,你只需要知道Unicode。虽然这肯定是一种用途,但它不是唯一的。例如,我正在研究只会使用英文文本的项目 - 但是有大量花哨的数学符号。将整个项目移动到完全Unicode解决了比我可以计算的更多问题。)

#4


6  

This article from Joel Spolsky should help you a lot.

Joel Spolsky的这篇文章对你很有帮助。

#5


4  

Unicode is an industry agreed standard for consistently representing text that has capacity to represent the World's character systems. All developers need to know about it, as Globalization is a growing concern.

Unicode是行业认可的标准,用于始终表示能够代表世界角色系统的文本。所有开发人员都需要了解它,因为全球化越来越受到关注。

#6


3  

One (open) source of code for handling Unicode is ICU - Internationalization Components for Unicode. It includes ICU4J for Java and ICU4C for C and C++ (presents C interface; uses C++ compiler).

用于处理Unicode的一个(开放)代码源是ICU - Unicode的国际化组件。它包括用于Java的ICU4J和用于C和C ++的ICU4C(提供C接口;使用C ++编译器)。

#7


2  

Unicode is a character set, that other than ASCII (which contains only letters for English, 127 characters, one third of them actually being non-printable control characters) contains roughly 2 million characters, including characters of every language known (Chinese, Russian, Greek, Ar*, etc.) and some languages you have probably never even heard of (even lots of dead language symbols not in use anymore, but useful for archiving ancient documents).

Unicode是一个字符集,除ASCII之外(仅包含英文字母,127个字符,其中三分之一实际上是不可打印的控制字符)包含大约200万个字符,包括已知的每种语言的字符(中文,俄文,希腊语,阿拉伯语等)和你可能从未听说过的一些语言(甚至许多死语言符号不再使用,但对于存档古代文档很有用)。

So instead of dealing with dozens of different character encodings, you have one encoding for all of them (which also makes it easier to mix characters from different languages within a single text string, as you don't need to switch the encoding somewhere in the middle of a text string). Actually there is still plenty of room left, we are far from having all 2 mio characters in use; the Unicode Consortium could easily add symbols for another 100 languages without even starting to fear running out of symbol space.

因此,不是处理几十种不同的字符编码,而是对所有字符编码都有一种编码(这也使得在单个文本字符串中混合来自不同语言的字符变得更容易,因为您不需要在中间的文字字符串)。实际上仍然有足够的空间,我们远未使用所有2个mio字符; Unicode Consortium可以轻松地为另外100种语言添加符号,甚至不用担心耗尽符号空间。

Pretty much any book in any language you can find in a library today can be expressed in Unicode. Unicode is the name of the encoding itself, how it is expressed as "bytes" is a different issue. There are several ways to write Unicode characters like UTF-8 (one to six bytes represent a single character, depending on character number, English is almost always one byte, other Roman languages might be two or three, Chinese/Japanese might be more), UTF-16 (most characters are two byte, some rarely used ones are four byte) and UTF-32, every character is four byte. There are others, but these are the dominant ones.

几乎任何你今天在图书馆中找到的任何语言的书都可以用Unicode表示。 Unicode是编码本身的名称,它如何表示为“字节”是一个不同的问题。有几种方法可以编写像UTF-8这样的Unicode字符(一到六个字节代表一个字符,取决于字符编号,英文几乎总是一个字节,其他罗马语言可能是两个或三个,中文/日文可能更多) ,UTF-16(大多数字符是两个字节,一些很少使用的是四个字节)和UTF-32,每个字符是四个字节。还有其他的,但这些是主导的。

Unicode is the default encoding for many newer OSes (in Mac OS X almost anything is Unicode) and programming languages (Java uses Unicode as default encoding, usually UTF-16, I heard Python does as well and will use or already does use UTF-32). If you ever plan to write an app that should display, store, or process anything other than plain English text, you'd better get used to Unicode, the sooner the better.

Unicode是许多较新操作系统的默认编码(在Mac OS X中几乎都是Unicode)和编程语言(Java使用Unicode作为默认编码,通常是UTF-16,我听过Python也会这样做,并且会使用或已经使用UTF- 32)。如果您计划编写应该显示,存储或处理除纯英文文本之外的任何应用程序的应用程序,您最好习惯Unicode,越快越好。

#8


2  

Here you can find a great guide:

在这里你可以找到一个很棒的指南:

http://www.joelonsoftware.com/articles/Unicode.html

#9


1  

Unicode is a standard that enumerates characters, and gives them unique numeric IDs (called "code points"). It includes a very large, and growing, set of characters for most modern written languages, and also a lot of exotic things like ancient Greek musical notation.

Unicode是枚举字符的标准,并为它们提供唯一的数字ID(称为“代码点”)。它包括一个非常大的,不断增长的,用于大多数现代书面语言的字符集,以及许多异国情调的东西,如古希腊音乐符号。

Unlike other character encoding schemes (like ASCII or the ISO-8859 standards), Unicode does not say anything about representing these characters in bytes; it just gives a universal set of IDs to characters. So it is wrong to say that Unicode is "a 16-bit replacement for ASCII".

与其他字符编码方案(如ASCII或ISO-8859标准)不同,Unicode没有说明以字节为单位表示这些字符;它只是为角色提供了一组通用的ID。因此,说Unicode是“16位替代ASCII”是错误的。

There are various encoding schemes that can representing arbitrary Unicode characters in bytes, including UTF-8, UTF-16, and others.

有各种编码方案可以用字节表示任意Unicode字符,包括UTF-8,UTF-16等。

#10


1  

You don't need to learn unicode to use it, it's a hell of complex norm. You just need to know the main issues and how your programming tools deal with it. To learn that, check the Galwegian's link and your programming language and ide documentation.

你不需要学习unicode来使用它,这是一个复杂规范的地狱。您只需要了解主要问题以及编程工具如何处理它。要了解这一点,请查看Galwegian的链接以及您的编程语言和ide文档。

E.G :

You can convert any caracter from latin-1 to unicode but it doesn't work the other way for all caracters. PHP let you now that some function (like stristr) does not work with unicode. Python declare unicode string this way : u"Hello World".

您可以将任何caracter从latin-1转换为unicode,但它不适用于所有caracters的另一种方式。 PHP现在允许您使用某些函数(如stristr)与unicode不兼容。 Python以这种方式声明unicode字符串:u“Hello World”。

That's the kind of thin you must know.

那是你必须知道的那种瘦身。

Knowing that, if you do not have a GOOD reason to not use unicode, then just use it.

知道如果你没有使用unicode的好理由,那么就使用它。

#1


45  

Unicode is a standard that defines numeric codes for glyphs used in written communication. Or, as they say it themselves:

Unicode是一种标准,用于定义书面通信中使用的字形的数字代码。或者,正如他们自己所说:

The standard for digital representation of the characters used in writing all of the world's languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet. Unicode is developed and maintained by the Unicode Consortium.

用于编写世界上所有语言的字符的数字表示标准。 Unicode提供了一种统一的方法,用于以任何语言存储,搜索和交换文本。它被所有现代计算机使用,是在Internet上处理文本的基础。 Unicode由Unicode Consortium开发和维护。

There are many common, yet easily avoided, programming errors committed by developers who don't bother to educate themselves about Unicode and its encodings.

开发人员提出了许多常见的,但很容易避免的编程错误,他们不打算自己研究Unicode及其编码。

  • First, go to the source for authoritative, detailed information and implementation guidelines.
  • 首先,请访问权威,详细信息和实施指南。

  • As mentioned by others, Joel Spolsky has a good list of these errors.
  • 正如其他人所提到的,Joel Spolsky有很好的错误列表。

  • I also like Elliotte Rusty Harold's Ten Commandments of Unicode.
  • 我也喜欢Elliotte Rusty Harold的十诫Unicode。

  • Developers should also watch out for canonical representation attacks.
  • 开发人员还应注意规范表示攻击。

Some of the key concepts you should be aware of are:

您应该注意的一些关键概念是:

  • Glyphs—concrete graphics used to represent written characters.
  • 字形混凝土图形用于表示书写字符。

  • Composition—combining glyphs to create another glyph.
  • 组合组合字形以创建另一个字形。

  • Encoding—converting Unicode points to a stream of bytes.
  • 编码 - 将Unicode点转换为字节流。

  • Collation—locale-sensitive comparison of Unicode strings.
  • Unicode字符串的归类 - 区域设置敏感比较。

#2


66  

Read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

阅读Joel的绝对最低每个软件开发人员绝对必须知道Unicode和字符集(没有借口!)

#3


11  

At the risk of just adding another link, unicode.org is a spectacular resource.

可能只是添加另一个链接,unicode.org是一个壮观的资源。

In short, it's a replacement for ASCII that's designed to handle, literally, every character ever used by humans. Unicode has everal encoding schemes to handle all those characters - UTF-8, which is more or less the standard these days, works really hard to stay a single byte per character, and is identical to ASCII for the first 7 bits.

简而言之,它是ASCII的替代品,旨在处理人类曾经使用的每个角色。 Unicode有几种编码方案来处理所有这些字符 - UTF-8,这或多或少是现在的标准,很难保持每个字符的单个字节,并且与前7位的ASCII相同。

(As an addendum, there's a popular misconception amongst programmers that you only need to know about Unicode if you're going to be doing internationalization. While that's certainly one use, it's not the only one. For example, I'm working on a project that will only ever use English text - but with a huge number of fancy math symbols. Moving the whole project over to be fully Unicode solved more problems than I can count.)

(作为附录,程序员之间存在一种流行的误解,如果你要进行国际化,你只需要知道Unicode。虽然这肯定是一种用途,但它不是唯一的。例如,我正在研究只会使用英文文本的项目 - 但是有大量花哨的数学符号。将整个项目移动到完全Unicode解决了比我可以计算的更多问题。)

#4


6  

This article from Joel Spolsky should help you a lot.

Joel Spolsky的这篇文章对你很有帮助。

#5


4  

Unicode is an industry agreed standard for consistently representing text that has capacity to represent the World's character systems. All developers need to know about it, as Globalization is a growing concern.

Unicode是行业认可的标准,用于始终表示能够代表世界角色系统的文本。所有开发人员都需要了解它,因为全球化越来越受到关注。

#6


3  

One (open) source of code for handling Unicode is ICU - Internationalization Components for Unicode. It includes ICU4J for Java and ICU4C for C and C++ (presents C interface; uses C++ compiler).

用于处理Unicode的一个(开放)代码源是ICU - Unicode的国际化组件。它包括用于Java的ICU4J和用于C和C ++的ICU4C(提供C接口;使用C ++编译器)。

#7


2  

Unicode is a character set, that other than ASCII (which contains only letters for English, 127 characters, one third of them actually being non-printable control characters) contains roughly 2 million characters, including characters of every language known (Chinese, Russian, Greek, Ar*, etc.) and some languages you have probably never even heard of (even lots of dead language symbols not in use anymore, but useful for archiving ancient documents).

Unicode是一个字符集,除ASCII之外(仅包含英文字母,127个字符,其中三分之一实际上是不可打印的控制字符)包含大约200万个字符,包括已知的每种语言的字符(中文,俄文,希腊语,阿拉伯语等)和你可能从未听说过的一些语言(甚至许多死语言符号不再使用,但对于存档古代文档很有用)。

So instead of dealing with dozens of different character encodings, you have one encoding for all of them (which also makes it easier to mix characters from different languages within a single text string, as you don't need to switch the encoding somewhere in the middle of a text string). Actually there is still plenty of room left, we are far from having all 2 mio characters in use; the Unicode Consortium could easily add symbols for another 100 languages without even starting to fear running out of symbol space.

因此,不是处理几十种不同的字符编码,而是对所有字符编码都有一种编码(这也使得在单个文本字符串中混合来自不同语言的字符变得更容易,因为您不需要在中间的文字字符串)。实际上仍然有足够的空间,我们远未使用所有2个mio字符; Unicode Consortium可以轻松地为另外100种语言添加符号,甚至不用担心耗尽符号空间。

Pretty much any book in any language you can find in a library today can be expressed in Unicode. Unicode is the name of the encoding itself, how it is expressed as "bytes" is a different issue. There are several ways to write Unicode characters like UTF-8 (one to six bytes represent a single character, depending on character number, English is almost always one byte, other Roman languages might be two or three, Chinese/Japanese might be more), UTF-16 (most characters are two byte, some rarely used ones are four byte) and UTF-32, every character is four byte. There are others, but these are the dominant ones.

几乎任何你今天在图书馆中找到的任何语言的书都可以用Unicode表示。 Unicode是编码本身的名称,它如何表示为“字节”是一个不同的问题。有几种方法可以编写像UTF-8这样的Unicode字符(一到六个字节代表一个字符,取决于字符编号,英文几乎总是一个字节,其他罗马语言可能是两个或三个,中文/日文可能更多) ,UTF-16(大多数字符是两个字节,一些很少使用的是四个字节)和UTF-32,每个字符是四个字节。还有其他的,但这些是主导的。

Unicode is the default encoding for many newer OSes (in Mac OS X almost anything is Unicode) and programming languages (Java uses Unicode as default encoding, usually UTF-16, I heard Python does as well and will use or already does use UTF-32). If you ever plan to write an app that should display, store, or process anything other than plain English text, you'd better get used to Unicode, the sooner the better.

Unicode是许多较新操作系统的默认编码(在Mac OS X中几乎都是Unicode)和编程语言(Java使用Unicode作为默认编码,通常是UTF-16,我听过Python也会这样做,并且会使用或已经使用UTF- 32)。如果您计划编写应该显示,存储或处理除纯英文文本之外的任何应用程序的应用程序,您最好习惯Unicode,越快越好。

#8


2  

Here you can find a great guide:

在这里你可以找到一个很棒的指南:

http://www.joelonsoftware.com/articles/Unicode.html

#9


1  

Unicode is a standard that enumerates characters, and gives them unique numeric IDs (called "code points"). It includes a very large, and growing, set of characters for most modern written languages, and also a lot of exotic things like ancient Greek musical notation.

Unicode是枚举字符的标准,并为它们提供唯一的数字ID(称为“代码点”)。它包括一个非常大的,不断增长的,用于大多数现代书面语言的字符集,以及许多异国情调的东西,如古希腊音乐符号。

Unlike other character encoding schemes (like ASCII or the ISO-8859 standards), Unicode does not say anything about representing these characters in bytes; it just gives a universal set of IDs to characters. So it is wrong to say that Unicode is "a 16-bit replacement for ASCII".

与其他字符编码方案(如ASCII或ISO-8859标准)不同,Unicode没有说明以字节为单位表示这些字符;它只是为角色提供了一组通用的ID。因此,说Unicode是“16位替代ASCII”是错误的。

There are various encoding schemes that can representing arbitrary Unicode characters in bytes, including UTF-8, UTF-16, and others.

有各种编码方案可以用字节表示任意Unicode字符,包括UTF-8,UTF-16等。

#10


1  

You don't need to learn unicode to use it, it's a hell of complex norm. You just need to know the main issues and how your programming tools deal with it. To learn that, check the Galwegian's link and your programming language and ide documentation.

你不需要学习unicode来使用它,这是一个复杂规范的地狱。您只需要了解主要问题以及编程工具如何处理它。要了解这一点,请查看Galwegian的链接以及您的编程语言和ide文档。

E.G :

You can convert any caracter from latin-1 to unicode but it doesn't work the other way for all caracters. PHP let you now that some function (like stristr) does not work with unicode. Python declare unicode string this way : u"Hello World".

您可以将任何caracter从latin-1转换为unicode,但它不适用于所有caracters的另一种方式。 PHP现在允许您使用某些函数(如stristr)与unicode不兼容。 Python以这种方式声明unicode字符串:u“Hello World”。

That's the kind of thin you must know.

那是你必须知道的那种瘦身。

Knowing that, if you do not have a GOOD reason to not use unicode, then just use it.

知道如果你没有使用unicode的好理由,那么就使用它。

#11