Character size unit and length semantics
When programming an application for a occidental language such as English, a single-byte
character set can be used, and the logical size, storage size and print width of characters is the
same. For example, in ISO-8859-1, the ê
character takes one logical position, has a
storage size of one byte and a print width of one.
When programming an international application using multiple languages and a multibyte character set encoding, you must distinguish three size units:
- The size in character unit, to count or position logical characters used in a string. For
example, the strings
abc
andåôë
have both a length of 3, in character units. - The size in byte unit, used to encode the character in a given character set. For
example, a Latin
ê
acute character will use a unique byte in the ISO-8859-1 character set, but needs two bytes in UTF-8. - The size in width unit, used in formatting and alignments. The width is the length of the glyph/font of characters, especially in a fixed font. This concept is also known as "fullwidth" versus "halfwidth" characters. For example, the width of a Chinese logogram is twice the width of Latin characters such as A,B,C.
Working with byte units in a multibyte character set can be difficult: You need to calculate sizes, lengths and substring offsets in a number of bytes, when the natural way is to count in characters.
Length semantics define the unit to be used for character data type definition, character string lengths and positions.
With Byte Length Semantics, a length is expressed in bytes, while Character Length Semantics counts in characters.