UTF-8 Validation Problem

Problem

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

For a 1-byte character, the first bit is a 0, followed by its Unicode code.
For an n-bytes character, the first n bits are all one’s, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

Number of Bytes	UTF-8 Octet Sequence(binary)
1	0xxxxxxx
2	110xxxxx 10xxxxxx
3	1110xxxx 10xxxxxx 10xxxxxx
4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

x denotes a bit in the binary form of a byte that may be either 0 or 1.

Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Examples

Example 1:

Input:
data = [197,130,1]
Output:
 true
Explanation: data represents the octet sequence: 11000101 10000010 00000001.
It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Example 2:

Input:
data = [235,140,4]
Output:
 false
Explanation: data represented the octet sequence: 11101011 10001100 00000100.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.

Solution

Method 1 - Bit Manipulation Using `&` Operation and Masking

Consider the following numbers, which represent the first byte in the four combinations (replacing all x with 0s):

0xxxxxxx = 0
110xxxxx = 192
1110xxxx = 224
11110xxx = 240

Next, let’s create masks for the & operator. We add an extra 1, so the result matches the numbers above. The masks (replacing x with 0s) are:

0xxxxxxx → 1xxxxxxx = 128
110xxxxx → 111xxxxx = 224
1110xxxx → 1111xxxx = 240
11110xxx → 11111xxx = 248

To determine the category of the current byte, we can use these masks to check if the byte fits the pattern. For example, for a 1-byte case, if 128 & byte == 0, the byte is valid. If 1111xxxx & byte == 1110xxxx, it indicates a 3-byte case, and we must ensure the next two bytes follow the correct pattern.

Algo

Iterate through the array of bytes. For each byte, determine which case it falls into, i.e., whether it is the first byte of a 1-byte, 2-byte, 3-byte, or 4-byte sequence.
Verify if the subsequent numBytes match the pattern. The mask for all following bytes is 1100000 as the number pattern is 10000000. If any mismatch occurs, return false.
Return true if all bytes follow the pattern.

Code

Java

public boolean validUtf8(int[] data) {
	if (data == null || data.length == 0) {
		return false;
	}

	for (int i = 0; i<data.length; i++) {
		if (data[i] > 255) {
			return false; // 1 after 8th digit, 100000000
		}
		int numBytes = 0;
		if ((data[i] & 128) == 0) { // 0xxxxxxx, 1 byte, 128(10000000)
			numBytes = 1;
		} else if ((data[i] & 224) == 192) { // 110xxxxx, 2 bytes, 224(11100000), 192(11000000)
			numBytes = 2;
		} else if ((data[i] & 240) == 224) { // 1110xxxx, 3 bytes, 240(11110000), 224(11100000)
			numBytes = 3;
		} else if ((data[i] & 248) == 240) { // 11110xxx, 4 bytes, 248(11111000), 240(11110000)
			numBytes = 4;
		} else {
			return false;
		}
		// note no equality in the condition
		if (i + numBytes > data.length) {
			return false;
		}
		for (int j = 1; j<numBytes; j++) { // check that the next n bytes start with 10xxxxxx
			if ((data[i + j] & 192) != 128) { // 192(11000000), 128(10000000)
				return false;
			}
		}
		i = i + numBytes - 1;
	}
	return true;
}

Python

def validUtf8(data):
    if not data:
        return False

    i = 0
    while i < len(data):
        if data[i] > 255:
            return False  # Ensure each byte is within the 8-bit range

        numBytes = 0
        if (data[i] & 128) == 0:  # 0xxxxxxx, 1 byte
            numBytes = 1
        elif (data[i] & 224) == 192:  # 110xxxxx, 2 bytes
            numBytes = 2
        elif (data[i] & 240) == 224:  # 1110xxxx, 3 bytes
            numBytes = 3
        elif (data[i] & 248) == 240:  # 11110xxx, 4 bytes
            numBytes = 4
        else:
            return False

        if i + numBytes > len(data):
            return False

        for j in range(1, numBytes):
            if (data[i + j] & 192) != 128:  # 10xxxxxx, continue bytes
                return False

        i += numBytes

Complexity

⏰ Time complexity: O(n)
🧺 Space complexity: O(1)

Problem#

Examples#

Solution#

Method 1 - Bit Manipulation Using & Operation and Masking#

Algo#

Code#

Java#

Python#

Complexity#