A character in UTF-8 can be from 1 to 4 bytes long, subjected to the following rules:
1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.
2. For n-bytes characters, the first n-bits are all one’s, and the n + 1 bit is 0, followed by n - 1 bytes, with the most significant 2 bits being 10.
You are given an array 'DATA' of integers representing the data, Your task is to find out whether the given array is a valid UTF-8 encoding.
The given input contains only integers, and only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
Given data = [196, 128, 1], which represents the octet sequence: 11000101 10000010 00000001. Return true because this is a valid UTF-8 sequence for a 2-byte character followed by a 1-byte character.
Input format :
The first line of input contains an integer T denoting the number of test cases. The first line of each test case contains a single integer 'N', where ‘N’ is the number of elements of the array. The second line of each test case contains 'N' space-separated integers, denoting the array/list 'DATA'.
Output format :
For each test case, return whether the array denotes a correct UTF-8 sequence or not. The output of each test case will be printed in a separate line.
You don’t need to print anything, it has been already taken care of. You just need to implement the given function.
1 <= T <= 5 1 <= N <= 4 1 <= DATA[ i ] < 256 Where ‘DATA[ i ]’ is array element at index ‘i’. Time limit: 1 sec
The most important observation is for ‘N’ byte UTF 8 sequence looks like this for different ‘N’ according to the UTF-8 rules:
For ‘N’ = 1, it should be 0_ _ _ _ _ _ _
For ‘N’ = 2, it should be 1 1 0 _ _ _ _ _ 1 0 _ _ _ _ _ _
For ‘N’ = 3, it should be 1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _
For ‘N’ = 4, it should be 1 1 1 1 0 _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _
1 0 _ _ _ _ _ _
As long as every byte in the array is of the right type, it is a valid UTF-8 encoding.
The steps are as follows:
- Start from index 0, determine each byte’s type and check its validity.
- There are five kinds of valid byte type: 0**, 10**, 110**,1110**, and 11110**
- Give them type numbers 0, 1, 2, 3, 4, which are the index of the first 0 from left.
- So, the index of the first ‘0’ determines the byte type because that could only make a UTF-8 of 1 byte according to the rules.
- If a byte belongs to one of them: if it is type 0, continue if it is type 2 or 3 or 4, check whether the following 1, 2, and 3 byte(s) are of type 1 or not.
- If not, return false; else, if a byte is type 1 or not of valid type, return false.
- If we found a valid set of UTF-8 sequences as described above, we will return true.