utf-8 decode

摘自 Lua 5.3 源文件 lutf8lib.c

 /*

 ** Decode one UTF-8 sequence, returning NULL if byte sequence is invalid.

 */

 static const char *utf8_decode (const char *o, int *val) {

   static const unsigned int limits[] = {0xFF, 0x7F, 0x7FF, 0xFFFF};

   const unsigned char *s = (const unsigned char *)o;

   unsigned int c = s[];

   unsigned int res = ;  /* final result */

   if (c < 0x80)  /* ascii? */

     res = c;

   else {

     int count = ;  /* to count number of continuation bytes */

     while (c & 0x40) {  /* still have continuation bytes? */

       int cc = s[++count];  /* read next byte */

       if ((cc & 0xC0) != 0x80)  /* not a continuation byte? */

         return NULL;  /* invalid byte sequence */

       res = (res << ) | (cc & 0x3F);  /* add lower 6 bits from cont. byte */

       c <<= ;  /* to test next bit */

     }

     res |= ((c & 0x7F) << (count * ));  /* add first byte */

     if (count >  || res > MAXUNICODE || res <= limits[count])

       return NULL;  /* invalid byte sequence */

     s += count;  /* skip continuation bytes read */

   }

   if (val) *val = res;

   return (const char *)s + ;  /* +1 to include first byte */

 }

关于 utf-8 的基础知识，参考 http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

UTF-8 的编码规则很简单，只有二条：

1）对于单字节的符号，字节的第一位设为0，后面7位为这个符号的 Unicode 码。因此对于英语字母，UTF-8 编码和 ASCII 码是相同的。

2）对于n字节的符号（n > 1），第一个字节的前n位都设为1，第n + 1位设为0，后面字节的前两位一律设为10。剩下的没有提及的二进制位，全部为这个符号的 Unicode 码。

下表总结了编码规则，字母x表示可用编码的位。

Unicode符号范围     |        UTF-8编码方式

(十六进制)        |              （二进制）

----------------------+---------------------------------------------

0000 0000-0000 007F | 0xxxxxxx

0000 0080-0000 07FF | 110xxxxx 10xxxxxx

0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx

0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

跟据上表，解读 UTF-8 编码非常简单。如果一个字节的第一位是0，则这个字节单独就是一个字符；如果第一位是1，则连续有多少个1，就表示当前字符占用多少个字节。

相关文章