UTF-8

UTF-8 is an encoding scheme for Unicode that represent every code point on 1, 2, 3, or 4 bytes. Having this 1 to 4 bytes flexibility, UTF-8 uses space in a very efficient way.

UTF-8 is the most popular encoding scheme that dominates the Internet and applications.

For instance, we know that the code point of the letter A is 65 and it can be encoded using a 7-bit binary representation. The following figure represents this letter encoded in UTF-8:

Figure 2.8 – Letter A encoded in UTF-8

This is very cool! UTF-8 has used a single byte to encode A. The first (left most) 0 signals that this is a single-byte encoding. Next, let’s see the Chinese character, 暗:

Figure 2.9 – Chinese character, 暗, encoded in UTF-8

The code point of 暗 is 26263, so UTF-8 uses 3 bytes to represent it. The first byte contains 4 bits (1110) that signal that this is a 3 bytes encoding. The next two bytes start with 2 bits of 10. All these 8 bits can be dropped and we keep only the remaining 16 bits which gives us the expected code point.Finally, let’s tackle the following figure:

Figure 2.10 – UTF-8 encoding with 4 bytes

This time, the first byte signals that this is a 4-byte encoding via 11110. The remaining 3 bytes start with 10. All these 11 bits can be dropped and we keep only the remaining 21 bits, 000011111011000001101, which gives us the expected code point, 128525.In the following figure you can see the UTF-8 template used for encoding Unicode characters:

Figure 2.11 – UTF-8 template used for encoding Unicode characters

Did you know that 8 zeros in a row (00000000 – U+0000) are interpreted as NULL? A NULL represents the end of the string, so sending it “accidentally” will be a problem because the remaining string will not be processed. Fortunately, UTF-8 prevents this issue, and sending a NULL can be done only if we effectively sent the U+0000 code point.

Java and Unicode

As long as we use characters with code points less than 65,535 (0xFFFF), we can rely on charAt() method to obtain the code point. Here are some examples:

int cp1 = “A”.charAt(0);                   // 65
String hcp1 = Integer.toHexString(cp1);    // 41
String bcp1 = Integer.toBinaryString(cp1); // 1000001     
int cp2 = “暗”.charAt(0);                  // 26263
String hcp2 = Integer.toHexString(cp2);    // 6697
String bcp2 = Integer.toBinaryString(cp2); // 1101100000111101     

Based on these examples, we may write a helper method that returns the binary representation of strings having code points less than 65,535 (0xFFFF) as follows (you already saw earlier the imperative version of the following functional code):

public static String strToBinary(String str) {
 
   String binary = str.chars()
     .mapToObj(Integer::toBinaryString)
     .map(t -> “0” + (String) t)
     .collect(Collectors.joining(” “));
   return binary;
}

If you run this code against a Unicode character having a code point greater than 65,535 (0xFFFF) then you’ll get a wrong result. You’ll not get an exception or any kind of warning.So, charAt() covers only a subset of Unicode characters. For covering all Unicode characters Java provides an API that consists of several methods. For instance, if we replace charAt() with codePointAt() then we obtain the correct code point in all cases, as you can see in the following figure:

Figure 2.12 – charAt() vs. codePointAt()

Leave a Reply

Your email address will not be published. Required fields are marked *