Explain and exemplify UTF-8, UTF-16, and UTF-32 – Objects, Immutability, Switch Expressions, and Pattern Matching
By Adenike Adekola / May 10, 2022 / No Comments / Checking sub-range in the range from 0 to length, Exams of Java, Filling a long array with pseudo-random numbers, Getting integral and fractional parts from a double, Java Certifications
34. Explain and exemplify UTF-8, UTF-16, and UTF-32
Character encoding/decoding is important for browsers, databases, text editors, file systems, networking, and so on. So, it’s a major topic for any programmer. Check out the following figure:

Figure 2.1 – Representing text with different char sets
In figure 2.1, we see several Chinese characters represented in UTF-8, UTF-16, and ANSI on a computer screen. But, what are these? What is ANSI? What is UTF-8 and how did we get to it? Why these characters don’t look normal in ANSI?Well, the story may begin with computers trying to represent characters (such as letters from the alphabet or digits or punctuation marks). The computers understand/process everything from the real world as a binary representation, so as a sequence of 0 and 1. This means that every character (for instance, A, 5, +, and so on) has to be mapped to a sequence of 0 and 1.The process of mapping a character to a sequence of 0 and 1 is known as character encoding or simply encoding. The reverse process of un-mapping a sequence of 0 and 1 to a character is known as character decoding or simply decoding. Ideally, an encoding-decoding cycle should return the same character; otherwise, we obtain something that we don’t understand or we cannot use.For instance, the Chinese character, 久, should be encoded in the computer’s memory as a sequence of 0 and 1. Next, when this sequence is decoded we expect back the same Chinese letter, 久. In figure 2.1 this happens in the left side and middle screenshots, while in the right screenshot, the returned character is ä¹…. A Chinese will not understand this (actually, nobody will), so something went wrong!Of course, we don’t have only Chinese characters to represent. We have many other sets of characters grouped in alphabets, emoticons, and so on. A set of characters have well-defined content (for instance, an alphabet has a certain number of well-defined characters) and is known as a character set, or shortly, charset.Having a charset, the problem is to define a set of rules (a standard) that clearly explains how the characters of this charset should be encoded/decoded in the computer memory. Without having a clear set of rules the encoding and decoding may lead to errors or indecipherable characters. Such a standard is known as an encoding scheme.One of the first encoding schemes was ASCII.
Introducing ASCII encoding scheme (or, single-byte encoding)
ASCII stands for American Standard Code for Information Interchange. This encoding scheme relies on a 7-bit binary system. In other words, each character that is part of the ASCII charset (http://ee.hawaii.edu/~tep/EE160/Book/chap4/subsection2.1.1.1.html) should be representable (encoded) on 7 bits. A 7 bits number can be e decimal between 0 and 127 as in the next figure:

Figure 2.2 – ASCII charset encoding
So, ASCII is an encoding scheme based on a 7-bit system that supports 128 different characters. But, we know that computers operate on bytes (octets) and a byte has 8 bits. This means that ASCII is a single-byte encoding scheme that leaves a bit free for each byte. See the following figure:

Figure 2.2 – The highlighted bit is leaved free in ASCII encoding