Explain and exemplify UTF-8, UTF-16, and UTF-32 6 – Objects, Immutability, Switch Expressions, and Pattern Matching
By Adenike Adekola / May 25, 2022 / No Comments / Exams of Java, Filling a long array with pseudo-random numbers, Java Certifications, Returning an identity string
Check out the last example, c2. Since codePointAt() returns the correct code point (128525), we can obtain the binary representation as follows:
String uc = Integer.toBinaryString(c2); // 11111011000001101
So, if we need a method that returns the binary encoding of any Unicode character then we can replace the chars() call with codePoints() call. The codePoints() method returns the code points of the given sequence:
public static String codePointToBinary(String str) {
String binary = str.codePoints()
.mapToObj(Integer::toBinaryString)
.collect(Collectors.joining(” “));
return binary;
}
The codePoints() method is just one of the methods provided by Java to work around code points. The Java API also include codePointAt(), offsetByCodePoints(), codePointCount(), codePointBefore(), codePointOf(), and so on. You can find several examples of them in the bundled code next to this one for obtaining a String from a given code point:
String str1 = String.valueOf(Character.toChars(65)); // A
String str2 = String.valueOf(Character.toChars(128525));
The toChars() method gets a code point and returns the UTF-16 representation via a char[]. The string returned by the first example (str1) has a length of 1 and is the A letter. The second example returns a string of length 2 since the character having the code point 128525 needs a surrogate pair. The returned char[] contains both, the high and low surrogates.Finally, let’s have a helper method that allows us to obtain the binary representation of a string for a given encoding scheme:
public static String stringToBinaryEncoding(
String str, String encoding) {
final Charset charset = Charset.forName(encoding);
final byte[] strBytes = str.getBytes(charset);
final StringBuilder strBinary = new StringBuilder();
for (byte strByte : strBytes) {
for (int i = 0; i < 8; i++) {
strBinary.append((strByte & 128) == 0 ? 0 : 1);
strByte <<= 1;
}
strBinary.append(” “);
}
return strBinary.toString().trim();
}
Using this method is quite simple as you can see in the following examples:
// 00000000 00000000 00000000 01000001
String r = Charsets.stringToBinaryEncoding(“A”, “UTF-32”);
// 10010111 01100110
String r = Charsets.stringToBinaryEncoding(“暗”,
StandardCharsets.UTF_16LE.name());
You can practice more examples in the bundled code.
JDK 18 defaults the charset to UTF-8
Before JDK 18, the default charset was determined based on the operating system charset and locale (for instance, on a Windows machine, it could be windows-1252). Starting with JDK 18, the default charset is UTF-8 (Charset.defaultCharset() returns the string, UTF-8).But, the default charset can be explicitly set via the file.encoding and native.encoding system properties at the command line. For instance, you may need to perform such modification to compile legacy code developed before JDK 18:
// the default charset is computed from native.encoding
java -Dfile-encoding=COMPAT
// the default charset is windows-1252
java -Dfile-encoding = windows-1252
So, since JDK 18, classes that use encoding (for instance, FileReader/FileWriter, InputStreamReader/OutputStreamWriter, PrintStream, Formatter, Scanner, URLEncoder/URLDecoder) can take advantage of UTF-8 out of the box. For instance, using UTF-8 before JDK 18 for reading a file can be accomplished by explicitly specifying this charset encoding scheme as follows:
try ( BufferedReader br = new BufferedReader(new FileReader(
chineseUtf8File.toFile(), StandardCharsets.UTF_8))) {
…
}
Accomplishing the same thing in JDK18+ doesn’t require explicitly specifying the charset encoding scheme:
try ( BufferedReader br = new BufferedReader(
new FileReader(chineseUtf8File.toFile()))) {
…
}
However, for System.out and System.err, JDK 18+ still uses the default system charset. So, if you are using System.out/err and you see question marks (?) instead of the expected characters then most probably you should set UTF-8 via the new properties -Dstdout.encoding and -Dstderr.encoding:
-Dstderr.encoding=utf8 -Dstdout.encoding=utf8 Or, you can set them as environment variables to set them globally:
_JAVA_OPTIONS=”-Dstdout.encoding=utf8 -Dstderr.encoding=utf8″ In the bundled code you can see more examples.