UTF-8 Validation

Posted: 13 Mar, 2021
Difficulty: Easy

PROBLEM STATEMENT

Try Problem

A character in UTF-8 can be from 1 to 4 bytes long, subjected to the following rules:

1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.

2. For n-bytes characters, the first n-bits are all one’s, and the n + 1 bit is 0, followed by n - 1 bytes, with the most significant 2 bits being 10.

You are given an array 'DATA' of integers representing the data, Your task is to find out whether the given array is a valid UTF-8 encoding.

Note:

The given input contains only integers, and only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
Example:
Given data = [196, 128, 1], which represents the octet sequence: 11000101 10000010 00000001.

Return true because this is a valid UTF-8 sequence for a 2-byte character followed by a 1-byte character.

Input format :

The first line of input contains an integer T denoting the number of test cases.

The first line of each test case contains a single integer 'N', where ‘N’ is the number of elements of the array.

The second line of each test case contains 'N' space-separated integers, denoting the array/list 'DATA'.

Output format :

For each test case, return whether the array denotes a correct UTF-8 sequence or not.

The output of each test case will be printed in a separate line.

Note:

You don’t need to print anything, it has been already taken care of. You just need to implement the given function.

Constraints:

1 <= T <= 5
1 <= N <= 4
1 <= DATA[ i ] < 256

Where ‘DATA[ i ]’ is array element at index ‘i’.

Time limit: 1 sec
Approach 1

The most important observation is for ‘N’ byte UTF 8 sequence looks like this for different ‘N’ according to the UTF-8 rules: 

 

For ‘N’ = 1, it should be 0_ _ _ _ _ _ _

For ‘N’ = 2, it should be 1 1 0 _ _ _ _ _ 1 0 _ _ _ _ _ _ 

For ‘N’ = 3, it should be 1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _  1 0 _ _ _ _ _ _

For ‘N’ = 4, it should be 1 1 1 1 0 _ _ _  1 0 _ _ _ _ _ _  1 0 _ _ _ _ _ _  

1 0 _ _ _ _ _ _  

 

As long as every byte in the array is of the right type, it is a valid UTF-8 encoding.    

 

The steps are as follows:

 

  • Start from index 0, determine each byte’s type and check its validity.
  • There are five kinds of valid byte type: 0**, 10**, 110**,1110**, and 11110**
  • Give them type numbers 0, 1, 2, 3, 4, which are the index of the first 0 from left.
  • So, the index of the first ‘0’ determines the byte type because that could only make a UTF-8 of 1 byte according to the rules.
  • If a byte belongs to one of them: if it is type 0, continue if it is type 2 or 3 or 4, check whether the following 1, 2, and 3 byte(s) are of type 1 or not.
  • If not, return false; else, if a byte is type 1 or not of valid type, return false.
  • If we found a valid set of UTF-8 sequences as described above, we will return true.
Try Problem