Explain and exemplify UTF-8, UTF-16, and UTF-32 2– Objects, Immutability, Switch Expressions, and Pattern Matching
By Adenike Adekola / May 13, 2022 / No Comments / Exams of Java, Java Certifications, Returning an identity string, Returning the flooring/ceiling modulus
In ASCII encoding, the letter A is 65, the letter B is 66, and so on. In Java, we can easily check this via the existing API as in the following simple code:
int decimalA = “A”.charAt(0); // 65
String binaryA = Integer.toBinaryString(decimalA); // 1000001
Or, let’s see the encoding of the text Hello World. This time, we added the free bit as well, so the result will be, 01001000 01100101 01101100 01101100 01101111 0100000 01010111 01101111 01110010 01101100 01100100:
char[] chars = “Hello World”.toCharArray();
for(char ch: chars) {
System.out.print(“0″ + Integer.toBinaryString(ch) + ” “);
}
If we perform a match then we see that 01001000 is H, 01100101 is e, 01101100 is l, 01101111 is o, 0100000 is space, 01010111 is W, 01110010 is r, and 01100100 is d. So, beside letters, the ASCII encoding can represent the English alphabet (upper and lower case), digits, space, punctuation marks, and some special characters.Beside the core ASCII for English, we also have ASCII extensions which are basically variations of the original ASCII to support other alphabets. Most probably, you’ve heard about the ISO-8859-1 (known as, ISO Latin 1) which is a famous ASCII extension. But, even with ASCII extensions, there are still a lot of characters in the world that cannot be encoded yet. There are countries that have a lot more characters than ASCII can encode and even countries that don’t use alphabets. So, ASCII has its limitations.I know what you think … let’s use that free bit (27+127). Yes, but even so, we can go up to 256 characters. Still not enough! It is time to encode characters using more than 1 byte.
Introducing multi-byte encoding
In different parts of the world, people started to create multi-byte encoding schemes (commonly, 2 bytes). For instance, Chinese, which has a lot of characters, have created SHIFT-JIS and BIG5 which uses 1 or 2 bytes to represent characters.But, what happened when most of the countries come up with their own multi-byte encoding schemes trying to cover their special characters, symbols, and so on? Obviously, this led to a huge incompatibility between the encoding schemes used in different countries. Even worse, some countries have multiple encoding schemes that are totally incompatible with each other. For instance, Japan has three different incompatible encoding schemes, which means that encoding a document with one of these encoding schemes and decoding with another will lead to a garbled document.But, this incompatibility was not such a big issue until the Internet occurred and documents have been massively shared all around the globe using computers. At that moment, the incompatibility between the encoding schemes conceived in isolation (for instance, countries and geographical regions) started to be painful.It was the perfect moment for the Unicode Consortium to be created.