什么是标准化UTF-8 ?

时间:2023-01-14 09:40:45

The ICU project (which also now has a PHP library) contains the classes needed to help normalize UTF-8 strings to make it easier to compare values when searching.

ICU项目(现在也有一个PHP库)包含了帮助规范化UTF-8字符串所需的类,以便在搜索时更容易地比较值。

However, I'm trying to figure out what this means for applications. For example, in which cases do I want "Canonical Equivalence" instead of "Compatibility equivalence", or vis-versa?

然而,我试图弄明白这对应用程序意味着什么。例如,在哪些情况下,我需要“规范等价”而不是“兼容性对等”,或者相反?

7 个解决方案

#1


165  

Everything You Never Wanted to Know about Unicode Normalization

Canonical Normalization

Unicode includes multiple ways to encode some characters, most notably accented characters. Canonical normalization changes the code points into a canonical encoding form. The resulting code points should appear identical to the original ones barring any bugs in the fonts or rendering engine.

Unicode包括多种编码字符的方式,最显著的是重音字符。规范化标准化将代码点更改为规范化编码形式。得到的代码点应该与原始代码点相同,除非字体或呈现引擎中有任何错误。

When To Use

Because the results appear identical, it is always safe to apply canonical normalization to a string before storing or displaying it, as long as you can tolerate the result not being bit for bit identical to the input.

由于结果看起来是相同的,所以在存储或显示字符串之前,将规范化规范化规范化规范化应用到字符串总是安全的,只要您能够容忍结果与输入的位不一致。

Canonical normalization comes in 2 forms: NFD and NFC. The two are equivalent in the sense that one can convert between these two forms without loss. Comparing two strings under NFC will always give the same result as comparing them under NFD.

标准规范化有两种形式:NFD和NFC。这两种形式是等价的,因为我们可以在这两种形式之间进行转换而不损失。在NFC下比较两个字符串总是会得到与在NFD下比较相同的结果。

NFD

NFD has the characters fully expanded out. This is the faster normalization form to calculate, but the results in more code points (i.e. uses more space).

NFD的字符完全展开。这是计算速度更快的规范化形式,但是结果会产生更多的代码点(即使用更多的空间)。

If you just want to compare two strings that are not already normalized, this is the preferred normalization form unless you know you need compatibility normalization.

如果你只是想比较两个没有被标准化的字符串,这是首选的标准化形式,除非你知道你需要兼容性标准化。

NFC

NFC recombines code points when possible after running the NFD algorithm. This takes a little longer, but results in shorter strings.

在运行NFD算法之后,如果可能的话,NFC将重新组合代码点。这需要更长的时间,但是会导致更短的字符串。

Compatibility Normalization

Unicode also includes many characters that really do not belong, but were used in legacy character sets. Unicode added these to allow text in those character sets to be processed as Unicode, and then be converted back without loss.

Unicode还包含许多实际上不属于的字符,但在遗留字符集中使用。Unicode添加了这些字符,以允许将这些字符集中的文本作为Unicode处理,然后在不丢失的情况下返回。

Compatibility normalization converts these to the corresponding sequence of "real" characters, and also performs canonical normalization. The results of compatibility normalization may not appear identical to the originals.

兼容性规范化将这些字符转换为相应的“真实”字符序列,并执行规范化标准化。兼容性正常化的结果可能与原件不一样。

Characters that include formatting information are replaced with ones that do not. For example the character gets converted to 9. Others don't involve formatting differences. For example the roman numeral character is converted to the regular letters IX.

包含格式化信息的字符将替换为不包含格式化信息的字符。例如字符⁹被转换为9。另一些则不涉及格式差异。例如罗马数字字符Ⅸ转化为普通信件第九。

Obviously, once this transformation has been performed, it is no longer possible to losslessly convert back to the original character set.

很明显,一旦完成了这个转换,就不可能再将其转换回原来的字符集了。

When to use

The Unicode Consortium suggests thinking of compatibility normalization like a ToUpperCase transform. It is something that may be useful in some circumstances, but you should not just apply it willy-nilly.

Unicode协会建议考虑像ToUpperCase转换那样的兼容性规范化。它在某些情况下可能是有用的,但是您不应该只是随意地应用它。

An excellent use case would be a search engine since you would probably want a search for 9 to match .

一个很好的用例就是一个搜索引擎,因为你可能会想要搜索9⁹相匹配。

One thing you should probably not do is display the result of applying compatibility normalization to the user.

您可能不应该做的一件事是向用户显示应用兼容性规范化的结果。

NFKC/NFKD

Compatibility normalization form comes in two forms NFKD and NFKC. They have the same relationship as between NFD and C.

兼容性规范化形式有NFKD和NFKC两种。它们与NFD和C之间的关系相同。

Any string in NFKC is inherently also in NFC, and the same for the NFKD and NFD. Thus NFKD(x)=NFD(NFKC(x)), and NFKC(x)=NFC(NFKD(x)), etc.

NFKC中的任何字符串在NFC中都是固有的,NFKD和NFD也是如此。因此,NFKD(x)=NFD(NFKC(x))和NFKC(x)=NFC(NFKD(x))等。

Conclusion

If in doubt, go with canonical normalization. Choose NFC or NFD based on the space/speed trade-off applicable, or based on what is required by something you are inter-operating with.

如果有疑问,请使用规范规范化。根据适用的空间/速度交换选择NFC或NFD,或者根据您正在进行的互操作所需的内容选择NFC或NFD。

#2


38  

Some characters, for example a letter with an accent (say, é) can be represented in two ways - a single code point U+00E9 or the plain letter followed by a combining accent mark U+0065 U+0301. Ordinary normalization will choose one of these to always represent it (the single code point for NFC, the combining form for NFD).

有些字符,例如带有重音的字母(比如e)可以用两种方式表示——一个单码点U+00E9,或者一个普通的字母后面跟着一个联合重音标记U+0065 U+0301。通常的规范化会选择其中一个来表示它(NFC的单个代码点,NFD的组合形式)。

For characters that could be represented by multiple sequences of base characters and combining marks (say, "s, dot below, dot above" vs putting dot above then dot below or using a base character that already has one of the dots), NFD will also pick one of these (below goes first, as it happens)

字符可以由多个基本字符序列,结合标志(说,”年代,点下面,点上面的“vs把下面点上面然后点或使用一个基本角色,已经有了一个点),NFD还将挑选其中一个(低于先,碰巧)

The compatibility decompositions include a number of characters that "shouldn't really" be characters but are because they were used in legacy encodings. Ordinary normalization won't unify these (to preserve round-trip integrity - this isn't an issue for the combining forms because no legacy encoding [except a handful of vietnamese encodings] used both), but compatibility normalization will. Think like the "kg" kilogram sign that appears in some East Asian encodings (or the halfwidth/fullwidth katakana and alphabet), or the "fi" ligature in MacRoman.

兼容性分解包括一些字符,这些字符“不应该”是字符,而是因为它们是在遗留编码中使用的。普通的规范化不会统一这些(为了保持往返完整性——这不是组合表单的问题,因为没有遗留编码(除了少数的越南编码)同时使用这两种格式,但是兼容性规范化会使用这两种格式。就像一些东亚编码中出现的“公斤”符号(或者是半宽/全宽的片假名和字母表),或者是MacRoman的“fi”结合体。

See http://unicode.org/reports/tr15/ for more details.

详情请参见http://unicode.org/reports/tr15/。

#3


13  

Normal forms (of Unicode, not databases) deal primarily (exclusively?) with characters that have diacritical marks. Unicode provides some characters with "built in" diacritical marks, such as U+00C0, "Latin Capital A with Grave". The same character can be created from a `Latin Capital A" (U+0041) with a "Combining Grave Accent" (U+0300). That means even though the two sequences produce the same resulting character, a byte-by-byte comparison will show them as being completely different.

普通的表单(Unicode,而不是数据库)主要处理(专有的?)具有区分字符标记的字符。Unicode提供了一些带有“内置”字符的字符,如U+00C0、“带坟墓的拉丁大写字母A”。同样的字符可以从“拉丁语大写字母a”(U+0041)和“庄重口音”(U+0300)中创建。这意味着即使这两个序列产生相同的结果字符,逐字节的比较将显示它们是完全不同的。

Normalization is an attempt at dealing with that. Normalizing assures (or at least tries to) that all the characters are encoded the same way -- either all using a separate combining diacritical mark where needed, or all using a single code point wherever possible. From a viewpoint of comparison, it doesn't really matter a whole lot which you choose -- pretty much any normalized string will compare properly with another normalized string.

正常化就是试图解决这个问题。规范化确保(或至少尝试)所有字符都以相同的方式进行编码——要么在需要时使用单独的组合字符标记,要么尽可能使用单个代码点。从比较的角度来看,你选择什么并不重要——几乎任何规范化的字符串都会与另一个规范化的字符串进行适当的比较。

In this case, "compatibility" means compatibility with code that assumes that one code point equals one character. If you have code like that, you probably want to use the compatibility normal form. Although I've never seen it stated directly, the names of the normal forms imply that the Unicode consortium considers it preferable to use separate combining diacritical marks. This requires more intelligence to count the actual characters in a string (as well as things like breaking a string intelligently), but is more versatile.

在这种情况下,“兼容性”意味着与代码的兼容性,该代码假定一个代码点等于一个字符。如果您有这样的代码,您可能希望使用兼容性范式。虽然我从未见过它直接声明,但是普通表单的名称表明Unicode协会认为最好使用分离的组合字符。这需要更多的智能来计算字符串中的实际字符(以及像智能地破坏字符串这样的事情),但它更通用。

If you're making full use of ICU, chances are that you want to use the canonical normal form. If you're trying to write code on your own that (for example) assumes a code point equals a character, then you probably want the compatibility normal form that makes that true as often as possible.

如果您正在充分利用ICU,那么您可能希望使用规范范式。如果您正在尝试编写假定代码点等于字符的代码(例如),那么您可能想要尽可能多地使其为真的兼容性范式。

#4


5  

If two unicode strings are canonically equivalent the strings are really the same, only using different unicode sequences. For example Ä can be represented either using the character Ä or a combination of A and ◌̈.

如果两个unicode字符串在标准上是等价的,那么这些字符串实际上是相同的,只是使用不同的unicode序列。例如一个可以使用的字符或组合和◌̈。

If the strings are only compatibility equivalent the strings aren't necessarily the same, but they may be the same in some contexts. E.g. ff could be considered same as ff.

如果字符串只有兼容性等效,那么字符串不一定是相同的,但在某些上下文中它们可能是相同的。例如,ff可以被认为是ff。

So, if you are comparing strings you should use canonical equivalence, because compatibility equivalence isn't real equivalence.

所以,如果你在比较字符串,你应该使用正则等价,因为兼容性等价不是真正的等价。

But if you want to sort a set of strings it might make sense to use compatibility equivalence as the are nearly identical.

但是如果你想要对一组字符串进行排序,那么使用兼容性等价是有意义的,因为它们几乎是相同的。

#5


4  

Whether canonical equivalence or compatibility equivalence is more relevant to you depends on your application. The ASCII way of thinking about string comparisons roughly maps to canonical equivalence, but Unicode represents a lot of languages. I don't think it is safe to assume that Unicode encodes all languages in a way that allows you to treat them just like western european ASCII.

规范等价或兼容性等价是否与您更相关,这取决于您的应用程序。ASCII考虑字符串比较的方式大致映射到规范等价,但是Unicode代表了很多语言。我不认为使用Unicode编码所有语言就像对待西欧ASCII一样是安全的。

Figures 1 and 2 provide good examples of the two types of equivalence. Under compatibility equivalence, it looks like the same number in sub- and super- script form would compare equal. But I'm not sure that solve the same problem that as the cursive arabic form or the rotated characters.

图1和图2给出了这两种等价类型的很好的例子。在相容性等价的情况下,在子和超级脚本中相同的数字会比较相等。但我不确定这是否解决了与草书阿拉伯形式或旋转字符相同的问题。

The hard truth of Unicode text processing is that you have to think deeply about your application's text processing requirements, and then address them as well as you can with the available tools. That doesn't directly address your question, but a more detailed answer would require linguistic experts for each of the languages you expect to support.

Unicode文本处理的硬道理是,您必须深入思考应用程序的文本处理需求,然后使用可用的工具尽可能地解决它们。这并不能直接回答你的问题,但是更详细的回答需要语言专家来回答你想要支持的每一种语言。

#6


4  

This is actually fairly simple. UTF-8 actually has several different representations of the same "character". (I use character in quotes since byte-wise they are different, but practically they are the same). An example is given in the linked document.

这其实很简单。实际上,UTF-8对相同的“字符”有几种不同的表示。(我在引号中使用字符,因为它们是不同的,但实际上它们是相同的)。在链接的文档中给出了一个示例。

The character "Ç" can be represented as the byte sequence 0xc387. But it can also be represented by a C (0x43) followed by the byte sequence 0x8ccca7. So you can say that 0xc387 and 0x438ccca7 are the same character. The reason that works, is that 0x8ccca7 is a combining mark; that is to say it takes the character before it (a C here), and modifies it.

字符“C”可以表示为字节序列0xc387。但是它也可以由C (0x43)和字节序列0x8ccca7表示。所以你可以说0xc387和0x438ccca7是同一个字符。这样做的原因是0x8ccca7是一个组合标记;也就是说,它先取字符(这里是C),然后修改。

Now, as far as the difference between canonical equivalence vs compatibility equivalence, we need to look at characters in general.

现在,关于正则等价与兼容性等价之间的区别,我们需要看一下一般的字符。

There are 2 types of characters, those that convey meaning through the value, and those that take another character and alter it. So 9 is a meaningful character. A super-script ⁹ takes that meaning and alters it by presentation. So canonically they have different meanings, but they still represent the base character.

有两种类型的角色,那些通过价值传达意义的角色,以及那些扮演另一个角色并改变它的角色。所以9是一个有意义的角色。一个super-script⁹意义和改变它的表现。所以它们有不同的含义,但它们仍然代表基本字符。

So canonical equivalence is where the byte sequence is rendering the same character with the same meaning. Compatibility equivalence is when the byte sequence is rendering a different character with the same base meaning (even though it may be altered). So the 9 and ⁹ are compatibility equivalent since they both mean "9", but are not canonically equivalent since they don't have the same representation...

因此,标准等价是指字节序列以相同的含义呈现相同的字符。兼容性等价是当字节序列呈现具有相同基本含义的不同字符时(即使它可能被修改)。所以9和兼容性⁹相当于因为他们都意味着“9”,但不正规的等效因为他们没有相同的表示……

Hope that helps...

希望这有助于……

#7


1  

The problem of compare strings: two strings with content that is equivalent for the purposes of most applications may contain differing character sequences.

比较字符串的问题:两个字符串的内容对于大多数应用程序来说是等价的,它们可能包含不同的字符序列。

See Unicode's canonical equivalence: if the comparison algorithm is simple (or must be fast), the Unicode equivalence is not performed. This problem occurs, for instance, in XML canonical comparison, see http://www.w3.org/TR/xml-c14n

参见Unicode的规范等价性:如果比较算法简单(或必须快速),则不执行Unicode等价。例如,在XML规范比较中出现了这个问题,请参见http://www.w3.org/TR/xml-c14n

To avoid this problem... What standard to use? "expanded UTF8" or "compact UTF8"?
Use "ç" or "c+◌̧."?

为了避免这个问题……使用什么标准?“扩展UTF8”还是“紧凑型UTF8”?使用“c”或“c +◌̧。”?

W3C and others (ex. file names) suggest to use the "composed as canonical" (take in mind C of "most compact" shorter strings)... So,

W3C和其他(例如文件名)建议使用“组合为规范”(记住C中的“最紧凑”的短字符串)……所以,

The standard is C! in doubt use NFC

For interoperability, and for "convention over configuration" choices, the recommendation is the use of NFC, to "canonize" external strings. To store canonical XML, for example, store it in the "FORM_C". The W3C's CSV on the Web Working Group also recomend NFC (section 7.2).

对于互操作性和“约定优于配置”的选择,建议使用NFC来“封圣”外部字符串。例如,要存储规范化XML,请将其存储在“FORM_C”中。在Web工作组上的W3C的CSV也对NFC进行了补充(第7.2节)。

PS: de "FORM_C" is the default form in most of libraries. Ex. in PHP's normalizer.isnormalized().

de“FORM_C”是大多数库的默认形式。例如在PHP的normalizer.isnormalized()。


Ther term "compostion form" (FORM_C) is used to both, to say that "a string is in the C-canonical form" (the result of a NFC transformation) and to say that a transforming algorithm is used... See http://www.macchiato.com/unicode/nfc-faq

“组合形式”(FORM_C)被用来表示“一个字符串在c规范形式中”(NFC转换的结果),并表示使用了一个转换算法……参见http://www.macchiato.com/unicode/nfc-faq

(...) each of the following sequences (the first two being single-character sequences) represent the same character:

(…)下列每个序列(前两个为单字符序列)代表相同的字符:

  1. U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE
  2. U+00C5 (A)拉丁大写字母A,上面有环
  3. U+212B ( Å ) ANGSTROM SIGN
  4. U + 212 b(Å)埃的迹象
  5. U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE
  6. U + 0041(A)大写拉丁字母A + U + 030(̊)结合上面的环

These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for compostion. (...) A function transforming a string S into the NFC form can be abbreviated as toNFC(S), while one that tests whether S is in NFC is abbreviated as isNFC(S).

这些序列被称为正规等价。第一种形式叫做NFC -用于标准化C,其中C是用于组合的。(…)将字符串S转换为NFC形式的函数可以缩写为toNFC(S),而测试S是否在NFC中的函数可以缩写为isNFC(S)。


Note: to test of normalization of little strings (pure UTF-8 or XML-entity references), you can use this test/normalize online converter.

注意:要测试小字符串的规范化(纯UTF-8或xml实体引用),您可以使用这个测试/规范化在线转换器。

#1


165  

Everything You Never Wanted to Know about Unicode Normalization

Canonical Normalization

Unicode includes multiple ways to encode some characters, most notably accented characters. Canonical normalization changes the code points into a canonical encoding form. The resulting code points should appear identical to the original ones barring any bugs in the fonts or rendering engine.

Unicode包括多种编码字符的方式,最显著的是重音字符。规范化标准化将代码点更改为规范化编码形式。得到的代码点应该与原始代码点相同,除非字体或呈现引擎中有任何错误。

When To Use

Because the results appear identical, it is always safe to apply canonical normalization to a string before storing or displaying it, as long as you can tolerate the result not being bit for bit identical to the input.

由于结果看起来是相同的,所以在存储或显示字符串之前,将规范化规范化规范化规范化应用到字符串总是安全的,只要您能够容忍结果与输入的位不一致。

Canonical normalization comes in 2 forms: NFD and NFC. The two are equivalent in the sense that one can convert between these two forms without loss. Comparing two strings under NFC will always give the same result as comparing them under NFD.

标准规范化有两种形式:NFD和NFC。这两种形式是等价的,因为我们可以在这两种形式之间进行转换而不损失。在NFC下比较两个字符串总是会得到与在NFD下比较相同的结果。

NFD

NFD has the characters fully expanded out. This is the faster normalization form to calculate, but the results in more code points (i.e. uses more space).

NFD的字符完全展开。这是计算速度更快的规范化形式,但是结果会产生更多的代码点(即使用更多的空间)。

If you just want to compare two strings that are not already normalized, this is the preferred normalization form unless you know you need compatibility normalization.

如果你只是想比较两个没有被标准化的字符串,这是首选的标准化形式,除非你知道你需要兼容性标准化。

NFC

NFC recombines code points when possible after running the NFD algorithm. This takes a little longer, but results in shorter strings.

在运行NFD算法之后,如果可能的话,NFC将重新组合代码点。这需要更长的时间,但是会导致更短的字符串。

Compatibility Normalization

Unicode also includes many characters that really do not belong, but were used in legacy character sets. Unicode added these to allow text in those character sets to be processed as Unicode, and then be converted back without loss.

Unicode还包含许多实际上不属于的字符,但在遗留字符集中使用。Unicode添加了这些字符,以允许将这些字符集中的文本作为Unicode处理,然后在不丢失的情况下返回。

Compatibility normalization converts these to the corresponding sequence of "real" characters, and also performs canonical normalization. The results of compatibility normalization may not appear identical to the originals.

兼容性规范化将这些字符转换为相应的“真实”字符序列,并执行规范化标准化。兼容性正常化的结果可能与原件不一样。

Characters that include formatting information are replaced with ones that do not. For example the character gets converted to 9. Others don't involve formatting differences. For example the roman numeral character is converted to the regular letters IX.

包含格式化信息的字符将替换为不包含格式化信息的字符。例如字符⁹被转换为9。另一些则不涉及格式差异。例如罗马数字字符Ⅸ转化为普通信件第九。

Obviously, once this transformation has been performed, it is no longer possible to losslessly convert back to the original character set.

很明显,一旦完成了这个转换,就不可能再将其转换回原来的字符集了。

When to use

The Unicode Consortium suggests thinking of compatibility normalization like a ToUpperCase transform. It is something that may be useful in some circumstances, but you should not just apply it willy-nilly.

Unicode协会建议考虑像ToUpperCase转换那样的兼容性规范化。它在某些情况下可能是有用的,但是您不应该只是随意地应用它。

An excellent use case would be a search engine since you would probably want a search for 9 to match .

一个很好的用例就是一个搜索引擎,因为你可能会想要搜索9⁹相匹配。

One thing you should probably not do is display the result of applying compatibility normalization to the user.

您可能不应该做的一件事是向用户显示应用兼容性规范化的结果。

NFKC/NFKD

Compatibility normalization form comes in two forms NFKD and NFKC. They have the same relationship as between NFD and C.

兼容性规范化形式有NFKD和NFKC两种。它们与NFD和C之间的关系相同。

Any string in NFKC is inherently also in NFC, and the same for the NFKD and NFD. Thus NFKD(x)=NFD(NFKC(x)), and NFKC(x)=NFC(NFKD(x)), etc.

NFKC中的任何字符串在NFC中都是固有的,NFKD和NFD也是如此。因此,NFKD(x)=NFD(NFKC(x))和NFKC(x)=NFC(NFKD(x))等。

Conclusion

If in doubt, go with canonical normalization. Choose NFC or NFD based on the space/speed trade-off applicable, or based on what is required by something you are inter-operating with.

如果有疑问,请使用规范规范化。根据适用的空间/速度交换选择NFC或NFD,或者根据您正在进行的互操作所需的内容选择NFC或NFD。

#2


38  

Some characters, for example a letter with an accent (say, é) can be represented in two ways - a single code point U+00E9 or the plain letter followed by a combining accent mark U+0065 U+0301. Ordinary normalization will choose one of these to always represent it (the single code point for NFC, the combining form for NFD).

有些字符,例如带有重音的字母(比如e)可以用两种方式表示——一个单码点U+00E9,或者一个普通的字母后面跟着一个联合重音标记U+0065 U+0301。通常的规范化会选择其中一个来表示它(NFC的单个代码点,NFD的组合形式)。

For characters that could be represented by multiple sequences of base characters and combining marks (say, "s, dot below, dot above" vs putting dot above then dot below or using a base character that already has one of the dots), NFD will also pick one of these (below goes first, as it happens)

字符可以由多个基本字符序列,结合标志(说,”年代,点下面,点上面的“vs把下面点上面然后点或使用一个基本角色,已经有了一个点),NFD还将挑选其中一个(低于先,碰巧)

The compatibility decompositions include a number of characters that "shouldn't really" be characters but are because they were used in legacy encodings. Ordinary normalization won't unify these (to preserve round-trip integrity - this isn't an issue for the combining forms because no legacy encoding [except a handful of vietnamese encodings] used both), but compatibility normalization will. Think like the "kg" kilogram sign that appears in some East Asian encodings (or the halfwidth/fullwidth katakana and alphabet), or the "fi" ligature in MacRoman.

兼容性分解包括一些字符,这些字符“不应该”是字符,而是因为它们是在遗留编码中使用的。普通的规范化不会统一这些(为了保持往返完整性——这不是组合表单的问题,因为没有遗留编码(除了少数的越南编码)同时使用这两种格式,但是兼容性规范化会使用这两种格式。就像一些东亚编码中出现的“公斤”符号(或者是半宽/全宽的片假名和字母表),或者是MacRoman的“fi”结合体。

See http://unicode.org/reports/tr15/ for more details.

详情请参见http://unicode.org/reports/tr15/。

#3


13  

Normal forms (of Unicode, not databases) deal primarily (exclusively?) with characters that have diacritical marks. Unicode provides some characters with "built in" diacritical marks, such as U+00C0, "Latin Capital A with Grave". The same character can be created from a `Latin Capital A" (U+0041) with a "Combining Grave Accent" (U+0300). That means even though the two sequences produce the same resulting character, a byte-by-byte comparison will show them as being completely different.

普通的表单(Unicode,而不是数据库)主要处理(专有的?)具有区分字符标记的字符。Unicode提供了一些带有“内置”字符的字符,如U+00C0、“带坟墓的拉丁大写字母A”。同样的字符可以从“拉丁语大写字母a”(U+0041)和“庄重口音”(U+0300)中创建。这意味着即使这两个序列产生相同的结果字符,逐字节的比较将显示它们是完全不同的。

Normalization is an attempt at dealing with that. Normalizing assures (or at least tries to) that all the characters are encoded the same way -- either all using a separate combining diacritical mark where needed, or all using a single code point wherever possible. From a viewpoint of comparison, it doesn't really matter a whole lot which you choose -- pretty much any normalized string will compare properly with another normalized string.

正常化就是试图解决这个问题。规范化确保(或至少尝试)所有字符都以相同的方式进行编码——要么在需要时使用单独的组合字符标记,要么尽可能使用单个代码点。从比较的角度来看,你选择什么并不重要——几乎任何规范化的字符串都会与另一个规范化的字符串进行适当的比较。

In this case, "compatibility" means compatibility with code that assumes that one code point equals one character. If you have code like that, you probably want to use the compatibility normal form. Although I've never seen it stated directly, the names of the normal forms imply that the Unicode consortium considers it preferable to use separate combining diacritical marks. This requires more intelligence to count the actual characters in a string (as well as things like breaking a string intelligently), but is more versatile.

在这种情况下,“兼容性”意味着与代码的兼容性,该代码假定一个代码点等于一个字符。如果您有这样的代码,您可能希望使用兼容性范式。虽然我从未见过它直接声明,但是普通表单的名称表明Unicode协会认为最好使用分离的组合字符。这需要更多的智能来计算字符串中的实际字符(以及像智能地破坏字符串这样的事情),但它更通用。

If you're making full use of ICU, chances are that you want to use the canonical normal form. If you're trying to write code on your own that (for example) assumes a code point equals a character, then you probably want the compatibility normal form that makes that true as often as possible.

如果您正在充分利用ICU,那么您可能希望使用规范范式。如果您正在尝试编写假定代码点等于字符的代码(例如),那么您可能想要尽可能多地使其为真的兼容性范式。

#4


5  

If two unicode strings are canonically equivalent the strings are really the same, only using different unicode sequences. For example Ä can be represented either using the character Ä or a combination of A and ◌̈.

如果两个unicode字符串在标准上是等价的,那么这些字符串实际上是相同的,只是使用不同的unicode序列。例如一个可以使用的字符或组合和◌̈。

If the strings are only compatibility equivalent the strings aren't necessarily the same, but they may be the same in some contexts. E.g. ff could be considered same as ff.

如果字符串只有兼容性等效,那么字符串不一定是相同的,但在某些上下文中它们可能是相同的。例如,ff可以被认为是ff。

So, if you are comparing strings you should use canonical equivalence, because compatibility equivalence isn't real equivalence.

所以,如果你在比较字符串,你应该使用正则等价,因为兼容性等价不是真正的等价。

But if you want to sort a set of strings it might make sense to use compatibility equivalence as the are nearly identical.

但是如果你想要对一组字符串进行排序,那么使用兼容性等价是有意义的,因为它们几乎是相同的。

#5


4  

Whether canonical equivalence or compatibility equivalence is more relevant to you depends on your application. The ASCII way of thinking about string comparisons roughly maps to canonical equivalence, but Unicode represents a lot of languages. I don't think it is safe to assume that Unicode encodes all languages in a way that allows you to treat them just like western european ASCII.

规范等价或兼容性等价是否与您更相关,这取决于您的应用程序。ASCII考虑字符串比较的方式大致映射到规范等价,但是Unicode代表了很多语言。我不认为使用Unicode编码所有语言就像对待西欧ASCII一样是安全的。

Figures 1 and 2 provide good examples of the two types of equivalence. Under compatibility equivalence, it looks like the same number in sub- and super- script form would compare equal. But I'm not sure that solve the same problem that as the cursive arabic form or the rotated characters.

图1和图2给出了这两种等价类型的很好的例子。在相容性等价的情况下,在子和超级脚本中相同的数字会比较相等。但我不确定这是否解决了与草书阿拉伯形式或旋转字符相同的问题。

The hard truth of Unicode text processing is that you have to think deeply about your application's text processing requirements, and then address them as well as you can with the available tools. That doesn't directly address your question, but a more detailed answer would require linguistic experts for each of the languages you expect to support.

Unicode文本处理的硬道理是,您必须深入思考应用程序的文本处理需求,然后使用可用的工具尽可能地解决它们。这并不能直接回答你的问题,但是更详细的回答需要语言专家来回答你想要支持的每一种语言。

#6


4  

This is actually fairly simple. UTF-8 actually has several different representations of the same "character". (I use character in quotes since byte-wise they are different, but practically they are the same). An example is given in the linked document.

这其实很简单。实际上,UTF-8对相同的“字符”有几种不同的表示。(我在引号中使用字符,因为它们是不同的,但实际上它们是相同的)。在链接的文档中给出了一个示例。

The character "Ç" can be represented as the byte sequence 0xc387. But it can also be represented by a C (0x43) followed by the byte sequence 0x8ccca7. So you can say that 0xc387 and 0x438ccca7 are the same character. The reason that works, is that 0x8ccca7 is a combining mark; that is to say it takes the character before it (a C here), and modifies it.

字符“C”可以表示为字节序列0xc387。但是它也可以由C (0x43)和字节序列0x8ccca7表示。所以你可以说0xc387和0x438ccca7是同一个字符。这样做的原因是0x8ccca7是一个组合标记;也就是说,它先取字符(这里是C),然后修改。

Now, as far as the difference between canonical equivalence vs compatibility equivalence, we need to look at characters in general.

现在,关于正则等价与兼容性等价之间的区别,我们需要看一下一般的字符。

There are 2 types of characters, those that convey meaning through the value, and those that take another character and alter it. So 9 is a meaningful character. A super-script ⁹ takes that meaning and alters it by presentation. So canonically they have different meanings, but they still represent the base character.

有两种类型的角色,那些通过价值传达意义的角色,以及那些扮演另一个角色并改变它的角色。所以9是一个有意义的角色。一个super-script⁹意义和改变它的表现。所以它们有不同的含义,但它们仍然代表基本字符。

So canonical equivalence is where the byte sequence is rendering the same character with the same meaning. Compatibility equivalence is when the byte sequence is rendering a different character with the same base meaning (even though it may be altered). So the 9 and ⁹ are compatibility equivalent since they both mean "9", but are not canonically equivalent since they don't have the same representation...

因此,标准等价是指字节序列以相同的含义呈现相同的字符。兼容性等价是当字节序列呈现具有相同基本含义的不同字符时(即使它可能被修改)。所以9和兼容性⁹相当于因为他们都意味着“9”,但不正规的等效因为他们没有相同的表示……

Hope that helps...

希望这有助于……

#7


1  

The problem of compare strings: two strings with content that is equivalent for the purposes of most applications may contain differing character sequences.

比较字符串的问题:两个字符串的内容对于大多数应用程序来说是等价的,它们可能包含不同的字符序列。

See Unicode's canonical equivalence: if the comparison algorithm is simple (or must be fast), the Unicode equivalence is not performed. This problem occurs, for instance, in XML canonical comparison, see http://www.w3.org/TR/xml-c14n

参见Unicode的规范等价性:如果比较算法简单(或必须快速),则不执行Unicode等价。例如,在XML规范比较中出现了这个问题,请参见http://www.w3.org/TR/xml-c14n

To avoid this problem... What standard to use? "expanded UTF8" or "compact UTF8"?
Use "ç" or "c+◌̧."?

为了避免这个问题……使用什么标准?“扩展UTF8”还是“紧凑型UTF8”?使用“c”或“c +◌̧。”?

W3C and others (ex. file names) suggest to use the "composed as canonical" (take in mind C of "most compact" shorter strings)... So,

W3C和其他(例如文件名)建议使用“组合为规范”(记住C中的“最紧凑”的短字符串)……所以,

The standard is C! in doubt use NFC

For interoperability, and for "convention over configuration" choices, the recommendation is the use of NFC, to "canonize" external strings. To store canonical XML, for example, store it in the "FORM_C". The W3C's CSV on the Web Working Group also recomend NFC (section 7.2).

对于互操作性和“约定优于配置”的选择,建议使用NFC来“封圣”外部字符串。例如,要存储规范化XML,请将其存储在“FORM_C”中。在Web工作组上的W3C的CSV也对NFC进行了补充(第7.2节)。

PS: de "FORM_C" is the default form in most of libraries. Ex. in PHP's normalizer.isnormalized().

de“FORM_C”是大多数库的默认形式。例如在PHP的normalizer.isnormalized()。


Ther term "compostion form" (FORM_C) is used to both, to say that "a string is in the C-canonical form" (the result of a NFC transformation) and to say that a transforming algorithm is used... See http://www.macchiato.com/unicode/nfc-faq

“组合形式”(FORM_C)被用来表示“一个字符串在c规范形式中”(NFC转换的结果),并表示使用了一个转换算法……参见http://www.macchiato.com/unicode/nfc-faq

(...) each of the following sequences (the first two being single-character sequences) represent the same character:

(…)下列每个序列(前两个为单字符序列)代表相同的字符:

  1. U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE
  2. U+00C5 (A)拉丁大写字母A,上面有环
  3. U+212B ( Å ) ANGSTROM SIGN
  4. U + 212 b(Å)埃的迹象
  5. U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE
  6. U + 0041(A)大写拉丁字母A + U + 030(̊)结合上面的环

These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for compostion. (...) A function transforming a string S into the NFC form can be abbreviated as toNFC(S), while one that tests whether S is in NFC is abbreviated as isNFC(S).

这些序列被称为正规等价。第一种形式叫做NFC -用于标准化C,其中C是用于组合的。(…)将字符串S转换为NFC形式的函数可以缩写为toNFC(S),而测试S是否在NFC中的函数可以缩写为isNFC(S)。


Note: to test of normalization of little strings (pure UTF-8 or XML-entity references), you can use this test/normalize online converter.

注意:要测试小字符串的规范化(纯UTF-8或xml实体引用),您可以使用这个测试/规范化在线转换器。