Wolfram Language

Use the Full Range of Unicode Characters

The Wolfram Language was one of the earliest adopters of the Unicode standard (www.unicode.org). Version 12 extends range of characters that can be processed and written by the Wolfram Language beyond the Basic Multilingual Plane of the roughly 50000 most common characters to the full range of more than one million possible Unicode characters. Support includes conversion to and from UTF-8, a new special input form for entering 6-digit hexadecimal codes and transmission over WSTP.

Enter a string of four characters using the previously existing hexadecimal input forms \.xx for 2-digit codes and \:yyyy for 4-digit codes, as well the new form \|zzzzzz for 6-digit codes.

Convert the characters to code points. The last character, with a code point above 65535, is newly representable in the Wolfram Language.

In base 16, the correspondence between the input form and code points becomes clear.

Convert the string to ByteArray using UTF-8 encoding.

UTF-8 encoding is a variable-length encoding where larger code points require more bytes. Split the byte array into four arrays of increasing length.

Convert each array back to a string, showing that each array corresponded to exactly one character.

Related Examples

de es fr ja ko pt-br zh