3.4. UTF-8 utilities
The UTF8_UTILS module provides Unicode UTF-8 string utilities including character iteration, codepoint extraction, byte length calculation, and validation of UTF-8 encoded text.
All functions and symbols are in “utf8_utils” module, use require to get access to it.
require daslib/utf8_utils
3.4.1. Constants
- utf8_utils.s_utf8d = fixed_array<uint>(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0xa, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x3, 0x3, 0xb, 0x6, 0x6, 0x6, 0x5, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x0, 0xc, 0x18, 0x24, 0x3c, 0x60, 0x54, 0xc, 0xc, 0xc, 0x30, 0x48, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0x0, 0xc, 0xc, 0xc, 0xc, 0xc, 0x0, 0xc, 0x0, 0xc, 0xc, 0xc, 0x18, 0xc, 0xc, 0xc, 0xc, 0xc, 0x18, 0xc, 0x18, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0x18, 0xc, 0xc, 0xc, 0xc, 0xc, 0x18, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0x18, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0x24, 0xc, 0x24, 0xc, 0xc, 0xc, 0x24, 0xc, 0xc, 0xc, 0xc, 0xc, 0x24, 0xc, 0x24, 0xc, 0xc, 0xc, 0x24, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc)
Byte-class and state-transition table for the UTF-8 DFA decoder.
- utf8_utils.UTF8_ACCEPT = 0x0
DFA accept state indicating a valid UTF-8 sequence.
3.4.2. Encoding and decoding
utf8_decode (var dest_utf32_string: array<uint>; source_utf8_string: string)
utf8_decode (source_utf8_string: array<uint8>) : array<uint>
utf8_decode (var dest_utf32_string: array<uint>; source_utf8_string: array<uint8>)
utf8_encode (var dest_array: array<uint8>; source_utf32_string: array<uint>)
utf8_encode (source_utf32_string: array<uint>) : array<uint8>
- utf8_utils.decode_unicode_escape(str: string): string
Decodes Unicode escape sequences (backslash followed by hex digits) in a string to UTF-8.
- Arguments:
str : string
- utf8_utils.utf16_to_utf32(high: uint; low: uint): uint
Converts a UTF-16 surrogate pair to a single UTF-32 codepoint.
- Arguments:
high : uint
low : uint
3.4.2.1. utf8_decode
- utf8_utils.utf8_decode(source_utf8_string: string): array<uint>
Converts UTF-8 string to UTF-32 and returns it as an array of codepoints (UTF-32 string)
- Arguments:
source_utf8_string : string
- utf8_utils.utf8_decode(dest_utf32_string: array<uint>; source_utf8_string: string)
- utf8_utils.utf8_decode(source_utf8_string: array<uint8>): array<uint>
- utf8_utils.utf8_decode(dest_utf32_string: array<uint>; source_utf8_string: array<uint8>)
3.4.2.2. utf8_encode
- utf8_utils.utf8_encode(dest_array: array<uint8>; source_utf32_string: array<uint>)
Converts UTF-32 string to UTF-8 and appends it to the UTF-8 byte array
- Arguments:
dest_array : array<uint8>
source_utf32_string : array<uint> implicit
- utf8_utils.utf8_encode(dest_array: array<uint8>; ch: uint)
- utf8_utils.utf8_encode(ch: uint): array<uint8>
- utf8_utils.utf8_encode(source_utf32_string: array<uint>): array<uint8>
3.4.3. Length and measurement
3.4.3.1. utf8_length
- utf8_utils.utf8_length(utf8_string: string): int
Returns the number of characters in the UTF-8 string
- Arguments:
utf8_string : string
- utf8_utils.utf8_length(utf8_string: array<uint8>): int
3.4.4. Validation
3.4.4.1. contains_utf8_bom
- utf8_utils.contains_utf8_bom(utf8_string: array<uint8>): bool
Returns true if the byte array starts with a UTF-8 BOM (byte order mark).
- Arguments:
utf8_string : array<uint8> implicit
- utf8_utils.contains_utf8_bom(utf8_string: string): bool
- utf8_utils.is_first_byte_of_utf8_char(ch: uint8): bool
Returns true if the given byte is the first byte of a UTF-8 character.
- Arguments:
ch : uint8
3.4.4.2. is_utf8_string_valid
- utf8_utils.is_utf8_string_valid(utf8_string: array<uint8>): bool
Returns true if the byte array contains a valid UTF-8 encoded string.
- Arguments:
utf8_string : array<uint8> implicit
- utf8_utils.is_utf8_string_valid(utf8_string: string): bool