What character encoding should I use for a web page containing mostly Arabic text?
对于主要包含阿拉伯文本的网页,我应该使用什么字符编码?
Is utf-8 okay?
utf-8好吗?
5 个解决方案
#1
15
UTF-8 can store the full Unicode range, so it's fine to use for Arabic.
UTF-8可以存储完整的Unicode范围,因此可以用于阿拉伯语。
However, if you were wondering what encoding would be most efficient:
但是,如果您想知道哪种编码最有效:
All Arabic characters can be encoded using a single UTF-16 code unit (2 bytes), but they may take either 2 or 3 UTF-8 code units (1 byte each), so if you were just encoding Arabic, UTF-16 would be a more space efficient option.
所有阿拉伯字符都可以使用单个UTF-16代码单元(2个字节)进行编码,但它们可以采用2个或3个UTF-8代码单元(每个1字节),因此如果您只是编码阿拉伯语,UTF-16会是一个更节省空间的选择。
However, you're not just encoding Arabic - you're encoding a significant number of characters that can be stored in a single byte in UTF-8, but take two bytes in UTF-16; all the html encoding characters <
,&
,>
,=
and all the html element names.
但是,您不只是编码阿拉伯语 - 您编码的大量字符可以存储在UTF-8中的单个字节中,但在UTF-16中占用两个字节;所有html编码字符<,&,>,=和所有html元素名称。
It's a trade off and, unless you're dealing with huge documents, it doesn't matter.
这是一种权衡,除非你处理大量文件,否则无所谓。
#2
9
I develop mostly Arabic websites and these are the two encodings I use :
我主要开发阿拉伯语网站,这些是我使用的两种编码:
1. Windows-1256
This is the most common encoding Arabic websites use. It works in most cases (90%) for Arabic users.
这是阿拉伯语网站最常用的编码方式。它在大多数情况下(90%)适用于阿拉伯语用户。
Here is one of the biggest Arabic web-development forums: http://traidnt.net/vb/. You can see that they are using this encoding.
这是最大的阿拉伯语网站开发论坛之一:http://traidnt.net/vb/。您可以看到他们正在使用此编码。
The problem with this encoding is that if you are developing a website for international use, this encoding won't work with every user and they will see gibberish instead of the content.
这种编码的问题在于,如果您正在开发一个供国际使用的网站,这种编码将不适用于每个用户,他们将看到乱码而不是内容。
2. UTF-8
This encoding solves the previous problem and also works in urls. I mean if you want to have Arabic words in the your url, you need them to be in utf-8 or it won't work.
此编码解决了以前的问题,也适用于网址。我的意思是如果你想在你的网址中加入阿拉伯语单词,你需要它们在utf-8中,否则它将无效。
The downside of this encoding is that if you are going to save Arabic content to a database (e.g. MySql) using this encoding (so the database will also be encoded with utf-8) its size is going to be double what it would have been if it were encoded with windows-1256 (so the database will be encoded with latin-1).
这种编码的缺点是如果你要使用这种编码将阿拉伯语内容保存到数据库(例如MySql)(所以数据库也将用utf-8编码),它的大小将是原来的两倍。如果它是用windows-1256编码的(因此数据库将使用latin-1编码)。
I suggest going with utf-8 if you can afford the size increase.
如果你能负担得起规模的增加,我建议你选择utf-8。
#3
8
UTF-8 is fine, yes. It can encode any code point in the Unicode standard.
UTF-8很好,是的。它可以编码Unicode标准中的任何代码点。
Edited to add
编辑添加
To make the answer more complete, your realistic choices are:
为了使答案更加完整,您的现实选择是:
- UTF-8
- UTF-16
- UTF-32
Each comes with tradeoffs and advantages.
每个都有权衡和优势。
UTF-8
As Joe Gauterin points out, UTF-8 is very efficient for European texts but can get increasingly inefficient the "farther" from the Latin alphabet you get. If your text is all Arabic it will actually be larger than the equivalent text in UTF-16. This is rarely a problem, however, in practice in these days of cheap and plentiful RAM unless you have a lot of text to deal with. More of a problem is that the variable-length of the encoding makes some string operations difficult and slow. For example you can't easily get the fifth Arabic character in a string because some characters might be 1 byte long (punctuation, say), while others are two or three. This makes actual processing of strings slow and error-prone.
正如Joe Gauterin所指出的那样,UTF-8对于欧洲文本来说非常有效,但是从你得到的拉丁字母“更远”可以变得越来越低效。如果您的文本都是阿拉伯语,那么它实际上将大于UTF-16中的等效文本。然而,这很少是一个问题,但在实践中,除非你有大量的文本要处理,否则在廉价和丰富的RAM中。更多的问题是编码的可变长度使得一些字符串操作变得困难和缓慢。例如,您不能轻易地在字符串中获取第五个阿拉伯字符,因为某些字符可能是1个字节长(比如标点符号),而其他字符则是两个或三个。这使得字符串的实际处理变得缓慢且容易出错。
On the other hand, UTF-8 is likely your best choice if you're doing a lot of mixed European/Arabic text. The more European text in your documents, the better the UTF-8 choice will be.
另一方面,如果您正在进行大量混合的欧洲/阿拉伯语文本,UTF-8可能是您的最佳选择。文档中的欧洲文本越多,UTF-8的选择就越好。
UTF-16
UTF-16 will give you better space efficiency than UTF-8 if you're using predominantly Arabic text. I don't know about the Arabic code points, however, so I don't know if you risk having variable-length encodings here. (My guess is that this is not an issue, however.) If you do, in fact, have variable-length encodings, all the string processing problems of UTF-8 apply here as well. If not, no problems.
如果您主要使用阿拉伯语文本,UTF-16将为您提供比UTF-8更好的空间效率。但是,我不知道阿拉伯语代码点,所以我不知道你是否有可能在这里使用可变长度编码。 (我的猜测是,这不是问题。)如果你确实有可变长度编码,那么UTF-8的所有字符串处理问题也适用于此。如果没有,没有问题。
On the other hand, if you have mixed European and Arabic texts, UTF-16 will be less space-efficient. Also, if you find yourself expanding your text forms to other texts like, say, Chinese, you definitely go back to variable length forms and the associated problems.
另一方面,如果你有欧洲和阿拉伯语的混合文本,UTF-16的节省空间会更少。此外,如果您发现自己将文本格式扩展到其他文本,例如中文,那么您肯定会回到可变长度形式和相关问题。
UTF-32
UTF-32 will basically double your space requirements. On the other hand it's constant sized for all known (and, likely, unknown;) script forms. For raw string processing it's your fastest, best option without the problems that variable-length encoding will cause you. (This presupposes you have a string library that knows about 32-bit characters, naturally.)
UTF-32基本上会使您的空间需求翻倍。另一方面,对于所有已知(并且可能是未知的)脚本表单,它的大小不变。对于原始字符串处理,它是您最快,最好的选择,没有可变长度编码会导致您的问题。 (这预示着你有一个字符串库,它自然知道32位字符。)
Recommendation
My own recommendation is that you use UTF-8 as your external format (because everybody supports it) for storage, transmission, etc. unless you really see a benefit size-wise with UTF-16. So any time you read a string from the outside world it would be UTF-8 and any time you put one to the outside world it, too, would be UTF-8. Within your software, though, unless you're in the habit of manipulating massive strings (in which case I'd recommend different data structures anyway!) I'd recommend using UTF-16 or UTF-32 instead (depending on if there's any variable-length encoding issues in your UTF-16 data) for the speed efficiency and simplicity of code.
我自己的建议是你使用UTF-8作为外部格式(因为每个人都支持它)用于存储,传输等,除非你真的看到UTF-16的大小优势。因此,每当你从外部世界读取一个字符串时,它将是UTF-8,任何时候你把它放到外面的世界,它也将是UTF-8。但是,在你的软件中,除非你习惯于操作大量的字符串(在这种情况下我还是会推荐不同的数据结构!)我建议使用UTF-16或UTF-32(取决于是否有任何数据结构) UTF-16数据中的可变长度编码问题,以提高代码的速度和简单性。
#4
2
UTF-8 is the simplest way to go since it will work with almost everything:
UTF-8是最简单的方法,因为它几乎适用于所有方面:
UTF-8 can encode any Unicode character. Files in different languages can be displayed correctly without having to choose the correct code page or font. For instance Chinese and Arabic can be in the same text without special codes inserted to switch the encoding. (via wikipedia)
UTF-8可以编码任何Unicode字符。可以正确显示不同语言的文件,而无需选择正确的代码页或字体。例如,中文和阿拉伯语可以在同一文本中,而无需插入特殊代码来切换编码。 (通过*)
Of course keep in mind that:
当然要记住:
UTF-8 often takes more space than an encoding made for one or a few languages. Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate multi-byte encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.
UTF-8通常比一种或几种语言的编码占用更多空间。带有变音符号的拉丁字母和来自其他字母脚本的字符通常在适当的多字节编码中每个字符占用一个字节,但在UTF-8中占两个字节。东亚脚本在其多字节编码中通常每个字符有两个字节,但在UTF-8中每个字符占用三个字节。
... but in most cases it's not a big issues. It would become one if you start handling huge documents.
......但在大多数情况下,这不是一个大问题。如果您开始处理大型文档,它将成为一个。
#5
0
UTF-8 often takes more space than an encoding made for one or a few languages. Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate multi-byte encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.
UTF-8通常比一种或几种语言的编码占用更多空间。带有变音符号的拉丁字母和来自其他字母脚本的字符通常在适当的多字节编码中每个字符占用一个字节,但在UTF-8中占两个字节。东亚脚本在其多字节编码中通常每个字符有两个字节,但在UTF-8中每个字符占用三个字节。
#1
15
UTF-8 can store the full Unicode range, so it's fine to use for Arabic.
UTF-8可以存储完整的Unicode范围,因此可以用于阿拉伯语。
However, if you were wondering what encoding would be most efficient:
但是,如果您想知道哪种编码最有效:
All Arabic characters can be encoded using a single UTF-16 code unit (2 bytes), but they may take either 2 or 3 UTF-8 code units (1 byte each), so if you were just encoding Arabic, UTF-16 would be a more space efficient option.
所有阿拉伯字符都可以使用单个UTF-16代码单元(2个字节)进行编码,但它们可以采用2个或3个UTF-8代码单元(每个1字节),因此如果您只是编码阿拉伯语,UTF-16会是一个更节省空间的选择。
However, you're not just encoding Arabic - you're encoding a significant number of characters that can be stored in a single byte in UTF-8, but take two bytes in UTF-16; all the html encoding characters <
,&
,>
,=
and all the html element names.
但是,您不只是编码阿拉伯语 - 您编码的大量字符可以存储在UTF-8中的单个字节中,但在UTF-16中占用两个字节;所有html编码字符<,&,>,=和所有html元素名称。
It's a trade off and, unless you're dealing with huge documents, it doesn't matter.
这是一种权衡,除非你处理大量文件,否则无所谓。
#2
9
I develop mostly Arabic websites and these are the two encodings I use :
我主要开发阿拉伯语网站,这些是我使用的两种编码:
1. Windows-1256
This is the most common encoding Arabic websites use. It works in most cases (90%) for Arabic users.
这是阿拉伯语网站最常用的编码方式。它在大多数情况下(90%)适用于阿拉伯语用户。
Here is one of the biggest Arabic web-development forums: http://traidnt.net/vb/. You can see that they are using this encoding.
这是最大的阿拉伯语网站开发论坛之一:http://traidnt.net/vb/。您可以看到他们正在使用此编码。
The problem with this encoding is that if you are developing a website for international use, this encoding won't work with every user and they will see gibberish instead of the content.
这种编码的问题在于,如果您正在开发一个供国际使用的网站,这种编码将不适用于每个用户,他们将看到乱码而不是内容。
2. UTF-8
This encoding solves the previous problem and also works in urls. I mean if you want to have Arabic words in the your url, you need them to be in utf-8 or it won't work.
此编码解决了以前的问题,也适用于网址。我的意思是如果你想在你的网址中加入阿拉伯语单词,你需要它们在utf-8中,否则它将无效。
The downside of this encoding is that if you are going to save Arabic content to a database (e.g. MySql) using this encoding (so the database will also be encoded with utf-8) its size is going to be double what it would have been if it were encoded with windows-1256 (so the database will be encoded with latin-1).
这种编码的缺点是如果你要使用这种编码将阿拉伯语内容保存到数据库(例如MySql)(所以数据库也将用utf-8编码),它的大小将是原来的两倍。如果它是用windows-1256编码的(因此数据库将使用latin-1编码)。
I suggest going with utf-8 if you can afford the size increase.
如果你能负担得起规模的增加,我建议你选择utf-8。
#3
8
UTF-8 is fine, yes. It can encode any code point in the Unicode standard.
UTF-8很好,是的。它可以编码Unicode标准中的任何代码点。
Edited to add
编辑添加
To make the answer more complete, your realistic choices are:
为了使答案更加完整,您的现实选择是:
- UTF-8
- UTF-16
- UTF-32
Each comes with tradeoffs and advantages.
每个都有权衡和优势。
UTF-8
As Joe Gauterin points out, UTF-8 is very efficient for European texts but can get increasingly inefficient the "farther" from the Latin alphabet you get. If your text is all Arabic it will actually be larger than the equivalent text in UTF-16. This is rarely a problem, however, in practice in these days of cheap and plentiful RAM unless you have a lot of text to deal with. More of a problem is that the variable-length of the encoding makes some string operations difficult and slow. For example you can't easily get the fifth Arabic character in a string because some characters might be 1 byte long (punctuation, say), while others are two or three. This makes actual processing of strings slow and error-prone.
正如Joe Gauterin所指出的那样,UTF-8对于欧洲文本来说非常有效,但是从你得到的拉丁字母“更远”可以变得越来越低效。如果您的文本都是阿拉伯语,那么它实际上将大于UTF-16中的等效文本。然而,这很少是一个问题,但在实践中,除非你有大量的文本要处理,否则在廉价和丰富的RAM中。更多的问题是编码的可变长度使得一些字符串操作变得困难和缓慢。例如,您不能轻易地在字符串中获取第五个阿拉伯字符,因为某些字符可能是1个字节长(比如标点符号),而其他字符则是两个或三个。这使得字符串的实际处理变得缓慢且容易出错。
On the other hand, UTF-8 is likely your best choice if you're doing a lot of mixed European/Arabic text. The more European text in your documents, the better the UTF-8 choice will be.
另一方面,如果您正在进行大量混合的欧洲/阿拉伯语文本,UTF-8可能是您的最佳选择。文档中的欧洲文本越多,UTF-8的选择就越好。
UTF-16
UTF-16 will give you better space efficiency than UTF-8 if you're using predominantly Arabic text. I don't know about the Arabic code points, however, so I don't know if you risk having variable-length encodings here. (My guess is that this is not an issue, however.) If you do, in fact, have variable-length encodings, all the string processing problems of UTF-8 apply here as well. If not, no problems.
如果您主要使用阿拉伯语文本,UTF-16将为您提供比UTF-8更好的空间效率。但是,我不知道阿拉伯语代码点,所以我不知道你是否有可能在这里使用可变长度编码。 (我的猜测是,这不是问题。)如果你确实有可变长度编码,那么UTF-8的所有字符串处理问题也适用于此。如果没有,没有问题。
On the other hand, if you have mixed European and Arabic texts, UTF-16 will be less space-efficient. Also, if you find yourself expanding your text forms to other texts like, say, Chinese, you definitely go back to variable length forms and the associated problems.
另一方面,如果你有欧洲和阿拉伯语的混合文本,UTF-16的节省空间会更少。此外,如果您发现自己将文本格式扩展到其他文本,例如中文,那么您肯定会回到可变长度形式和相关问题。
UTF-32
UTF-32 will basically double your space requirements. On the other hand it's constant sized for all known (and, likely, unknown;) script forms. For raw string processing it's your fastest, best option without the problems that variable-length encoding will cause you. (This presupposes you have a string library that knows about 32-bit characters, naturally.)
UTF-32基本上会使您的空间需求翻倍。另一方面,对于所有已知(并且可能是未知的)脚本表单,它的大小不变。对于原始字符串处理,它是您最快,最好的选择,没有可变长度编码会导致您的问题。 (这预示着你有一个字符串库,它自然知道32位字符。)
Recommendation
My own recommendation is that you use UTF-8 as your external format (because everybody supports it) for storage, transmission, etc. unless you really see a benefit size-wise with UTF-16. So any time you read a string from the outside world it would be UTF-8 and any time you put one to the outside world it, too, would be UTF-8. Within your software, though, unless you're in the habit of manipulating massive strings (in which case I'd recommend different data structures anyway!) I'd recommend using UTF-16 or UTF-32 instead (depending on if there's any variable-length encoding issues in your UTF-16 data) for the speed efficiency and simplicity of code.
我自己的建议是你使用UTF-8作为外部格式(因为每个人都支持它)用于存储,传输等,除非你真的看到UTF-16的大小优势。因此,每当你从外部世界读取一个字符串时,它将是UTF-8,任何时候你把它放到外面的世界,它也将是UTF-8。但是,在你的软件中,除非你习惯于操作大量的字符串(在这种情况下我还是会推荐不同的数据结构!)我建议使用UTF-16或UTF-32(取决于是否有任何数据结构) UTF-16数据中的可变长度编码问题,以提高代码的速度和简单性。
#4
2
UTF-8 is the simplest way to go since it will work with almost everything:
UTF-8是最简单的方法,因为它几乎适用于所有方面:
UTF-8 can encode any Unicode character. Files in different languages can be displayed correctly without having to choose the correct code page or font. For instance Chinese and Arabic can be in the same text without special codes inserted to switch the encoding. (via wikipedia)
UTF-8可以编码任何Unicode字符。可以正确显示不同语言的文件,而无需选择正确的代码页或字体。例如,中文和阿拉伯语可以在同一文本中,而无需插入特殊代码来切换编码。 (通过*)
Of course keep in mind that:
当然要记住:
UTF-8 often takes more space than an encoding made for one or a few languages. Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate multi-byte encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.
UTF-8通常比一种或几种语言的编码占用更多空间。带有变音符号的拉丁字母和来自其他字母脚本的字符通常在适当的多字节编码中每个字符占用一个字节,但在UTF-8中占两个字节。东亚脚本在其多字节编码中通常每个字符有两个字节,但在UTF-8中每个字符占用三个字节。
... but in most cases it's not a big issues. It would become one if you start handling huge documents.
......但在大多数情况下,这不是一个大问题。如果您开始处理大型文档,它将成为一个。
#5
0
UTF-8 often takes more space than an encoding made for one or a few languages. Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate multi-byte encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.
UTF-8通常比一种或几种语言的编码占用更多空间。带有变音符号的拉丁字母和来自其他字母脚本的字符通常在适当的多字节编码中每个字符占用一个字节,但在UTF-8中占两个字节。东亚脚本在其多字节编码中通常每个字符有两个字节,但在UTF-8中每个字符占用三个字节。