393. UTF-8 Validation

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

For 1-byte character, the first bit is a 0, followed by its unicode code.
For n-bytes character, the first n-bits are all one's, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

Char. number range  |        UTF-8 octet sequence

(hexadecimal)       |              (binary)

--------------------+---------------------------------------------

0000 0000-0000 007F | 0xxxxxxx

0000 0080-0000 07FF | 110xxxxx 10xxxxxx

0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx

0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Given an array of integers representing the data, return whether it is a valid utf-8 encoding.

Note:

The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Example 1:

data = [197, 130, 1], which represents the octet sequence: 11000101 10000010 00000001.

Return true.

It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Example 2:

data = [235, 140, 4], which represented the octet sequence: 11101011 10001100 00000100.

Return false.

The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.

The next byte is a continuation byte which starts with 10 and that's correct.

But the second continuation byte does not start with 10, so it is invalid.

算法分析

算法很简单，只需要依次检查每个数字是否是在合法的范围内即可：如果一个数字在0x00~0x7F之间，说明是 1-byte 字符，检查下一个字符；如果一个数字在0xC0_{0xDF之间，则应为2-byte字符，那么接下来的一个数字应该在0x80}0xBF之间；如果一个数字在0xE0_{0xEF之间，则应为3-byte字符，那么接下来的两个数字应该在0x80}0xBF之间；如果一个数字在0xF0_{0xF7之间，则应为4-byte字符，那么接下来的三个数字应该在0x80}0xBF之间。

Java算法实现：

public class Solution {

    public boolean validUtf8(int[] data) {

        int len=data.length;

        int index=0;

        int num,num1,num2,num3;

        while(index<len){

        	num=data[index];

        	num&=0xff;

        	if(num>=0&&num<=0x7f){

        		//is 1 byte character

        		index++;

        	}

        	else if(num>=0xc0&&num<=0xdf){

        		//is 2-byte character

        		if(index+1<len){

        			num1=data[index+1];

        			num1&=0xff;

        			if(!(num1<=0xbf&&num1>=0x80)){

        				return false;

        			}

        			//the second byte is right

        			index+=2;

        		}

        		else{

        			return false;

        		}

        	}

        	else if(num>=0xe0&&num<=0xef){

        		//it is a 3-byte character

        		if(index+2<len){

        			num1=data[index+1];

        			num2=data[index+2];

        			num1&=0xff;

        			num2&=0xff;

        			if(!(num1>=0x80&&num1<=0xbf&&num2>=0x80&&num2<=0xbf)){

        				return false;

        			}

        			index+=3;

        		}

        		else{

        			return false;

        		}

        	}

        	else if(num>=0xf0&&num<=0xf7){

        		//is a 4-byte character

        		if(index+3<len){

        			num1=data[index+1];

        			num2=data[index+2];

        			num3=data[index+3];

        			num1&=0xff;

        			num2&=0xff;

        			num3&=0xff;

        			if(!(num1>=0x80&&num1<=0xbf&&num2>=0x80&&num2<=0xbf&&num3>=0x80&&num3<=0xbf)){

        				return false;

        			}

        			index+=4;

        		}

        		else{

        			return false;

        		}

        	}

        	else{

        		return false;

        	}

        }

        return true;

    }

}

秒客网

LeetCode赛题393----UTF-8 Validation

393. UTF-8 Validation

算法分析

相关文章