Length semantics settings

Understanding length semantics

The length semantics of character string data matters when using a multibyte character set. Length semantics involves data type length specification for database column and program variable definitions, as well as string manipulations, regarding lengths, positions, offsets and substrings.

In a single-byte characters set like ISO-8859-1, a character is encoded on one byte. The length of a string can be counted in bytes or characters, the unit does not matter. In other words, the length semantics is identical in bytes or characters, with a single byte encoding. However, with a multibyte character set like UTF-8 or BIG5, a character can be encoded on several bytes. In such case, the unit regarding length semantics matters, because the number of bytes of a character string can be different from the number of characters.

For multibyte characters sets, the language supports Byte Length Semantics (BLS) and Character Length Semantics (CLS) specification. BLS or CLS usage depends on the current character set of the application. BLS is typically used with a character set such as BIG5, because for historical reasons programmers are used to count 2 bytes for each Asian ideogram. For UTF-8, which is a variable size encoding, CLS should be used instead. CLS simplifies data type definition and string handling when using UTF-8.

Programming areas concerned by length semantics are illustrated in the following code example:

SCHEMA shop

# CREATE TABLE mytable (
#     k INT,
#     vc VARCHAR(10)
#         -- what is the unit for the column size and how many
#         -- characters can be stored in this column?
#  )

MAIN
  DEFINE buf, tmp VARCHAR(50)  -- what is the unit for the size?
  DEFINE rec RECORD LIKE mytable  -- what is the size of vc member?
  DEFINE str STRING, len INT

  DATABASE shop

  SELECT LENGTH(vc) INTO len -- What unit use string functions in SQL?
     FROM mytable WHERE k = 45

  LET buf = "abcdef..."  -- How many chars can this variable hold?

  DISPLAY LENGTH(buf)  -- In what unit is the length expressed?

  LET tmp = buf[1,5]  -- What is the unit for char positions?

  LET str = buf
  DISPLAY str.getLength() -- What is the unit for the length?
  DISPLAY str.getIndexOf("def") -- What is the unit for the offset?

END MAIN

Using Byte Length Semantics

Byte Length Semantics must be used if the current locale defines a multibyte character set different from UTF-8.

Byte Length Semantics is the default.

With BLS, the size of CHAR/VARCHAR program variables is expressed in byte units. In a single-byte character set like ISO-8859-1, every character is encoded on a unique byte, so the number of bytes equals the number of characters. When using BLS with a multibyte character set, you must be aware of the storage size in byte units: Character encoding requires more than one byte, so the number of bytes to store a multibyte string is bigger than the number of characters. For example, in a BIG5 encoding, one Chinese character needs 2 bytes, so if you want to hold a BIG5 string with a maximum of 10 Chinese characters, you must define a CHAR(20). When using UTF-8, characters can take one or several bytes which can use two or three times more storage space as character count. You need to choose the right expansion factor to define CHAR or VARCHAR variables in byte units.

-- Using Byte Length Semantics
DEFINE var VARCHAR(10)  -- Can store 10 bytes / 10 single-byte chars.

Using Char Length Semantics

Character Length Semantics should be used with multibyte character sets such as UTF-8: Migrating to UTF-8 by using CLS will allow you to leave the source code untouched, even when doing complex string/substring manipulations.

The database should typically also use UTF-8 and CLS. If the database uses UTF-8 and only supports BLS, the programs can still use CLS with UTF-8.

With CLS, the size of a CHAR/VARCHAR program variable is expressed in character units, and the number of bytes needed to store these characters is allocated automatically. A VARCHAR(10) variable will hold 10 characters, of any byte length. Further, language functions and class methods dealing with character string length and positions will use character units.

-- Using Character Length Semantics
DEFINE var VARCHAR(10)  -- Can store 10 chars in UTF-8, or any encoding.
LET var = "Forêt"  -- 5 chars, that take 6 bytes in UTF-8
DISPLAY LENGTH(var)  -- Displays a length of 5 (characters)
DISPLAY "[",var[4,5],"]"    -- Displays [êt]

To enable Char Length Semantics, define the FGL_LENGTH_SEMANTICS environment variable to "CHAR". For example, on UNIX™:

$ FGL_LENGTH_SEMANTICS="CHAR"
$ export FGL_LENGTH_SEMANTICS

However, BLS can also be used with a multibyte character set, especially when the database supports BLS only. In order to use BLS, define the FGL_LENGTH_SEMANTICS environment variable to "BYTE", or just leave it unset (BLS is the default). For example, on UNIX:

$ FGL_LENGTH_SEMANTICS="BYTE"
$ export FGL_LENGTH_SEMANTICS

Length Semantics in SQL

On the database server side, the length semantics used for character data types varies from a vendor to another. Some databases use BLS, other use CLS, and other support both semantics. For example, Informix^® uses BLS only (with a special server configuration parameter SQL_LOGICAL_CHAR to define a size conversion ratio). Oracle supports both CLS and BLS at the database, session and even column level, with the CHAR(10 BYTE|CHAR) syntax. SQL Server supports non-UCS-2 character sets (Latin1, BIG5) in CHAR/VARCHAR/TEXT columns using BLS the size, while NCHAR/NVARCHAR/NTEXT columns store double-byte UCS-2 characters and use CLS.

This table shows the character data type length semantics of support database servers:

Table 1. Character data type length semantics of supported database servers
Database Engine	Length semantics in character data types	Summary
Genero db	Uses Byte Length Semantics for the size of character columns. Character string data is stored in UTF-8 when CHARACTER_SET=UTF-8 configuration parameter is defined.	BLS
Oracle	Supports both Byte or Character Length Semantics in character type definition, can be defined globally for the database or at column level. Character string data is stored in database character set for CHAR /VARCHAR columns and in national character set for NCHAR /NVARCHAR columns.	BLS/CLS
Informix	Uses Byte Length Semantics for the size of character columns. Can apply a ratio when creating columns, according to the SQL_LOGICAL_CHARS server configuration parameter. Character string data is stored in the database character set defined by DB_LOCALE.	BLS
IBM^® DB2^®	Uses Byte Length Semantics for the size of character columns. Character data is stored in the database character set defined by the CODESET of CREATE DATABASE.	BLS
Microsoft™ SQL Server	CHAR / VARCHAR sizes are specified in bytes; Data is stored in the character set defined by the database collation. NCHAR / NVARCHAR sizes are specified in characters; Data is stored in UCS-2. See SQL adaptation guide For SQL SERVER 2005, 2008, 2012 for more details.	BLS/CLS
PostgreSQL	Uses Character Length Semantics for the size of character columns. Character string data is stored in the database character set defined by WITH ENCODING of CREATE DATABASE.	CLS
MySQL	Uses Character Length Semantics for the size of character columns. Character string data is stored in the server character set defined by a configuration parameter.	CLS
SQLite	Uses Character Length Semantics for the size of character columns. Character string data is stored in UTF-8.	CLS
Sybase Adaptive Server Enterprise (ASE)	CHAR / VARCHAR sizes are specified in bytes; Data is stored in the db character set. NCHAR / NVARCHAR sizes are specified in characters; Data is stored the db character set. UNICHAR / UNIVARCHAR sizes are specified in characters; Data is stored in UTF-16. See SQL adaptation guide For Sybase ASE 15.5+ for more details.	BLS/CLS

Other SQL elements like functions and operators are affected by the length semantic. For example, Informix LENGTH() function always returns a number of bytes, while Oracle's LENGTH() function returns a number of characters (use LENGTHB() to get the number of bytes with Oracle).

It is important to understand properly how the database servers handle multibyte character sets. Check your database server reference manual: In most documentations you will find a "Localization" chapter which describes those concepts in detail.

Extracting database schemas

Database schema files (.sch) are used to resolve column data types when compiling .4gl modules and .per form files. This file contains size information for CHAR and VARCHAR types. It is important to identify the unit used by the database columns, to properly define CHAR/VARCHAR variables in programs and fields in forms.

Most database engines (like Oracle, SQL Server, PostgreSQL, Sybase, SQLite) provide catalog tables with column size information in character units. In this case, the fgldbsch tool extracts the column sizes in character units, without further conversion. If the column sizes is provided in bytes by catalog tables, fgldbsch will try to detect character length semantics usage in the database and apply a reduction factor to convert the number of bytes to chars.

As result - independently from the length semantics used in your programs - the CHAR/VARCHAR type sizes in the schema file are always expressed in character units. When using Byte Length Semantics, this makes no difference in a single-byte locale, because one character occupies a single byte. In a multibyte encoding (UTF-8) with BLS, this method garanties that the program variable will not hold more ASCII characters than the database column can hold. When using Character Length Semantics with a multibyte character set, the size in characters will define character type variables in the same unit.

For example, with BLS, a VARCHAR(10 (bytes or chars)) column will define a VARCHAR(10 (bytes)) in programs. With CLS, a VARCHAR(10 (chars)) column will define a VARCHAR(10 (chars)) in programs.

Moving from single-byte to UTF-8

Migration to Unicode (UTF-8) is facilitated with Char Length Semantics:

Verify that your database uses Char Length Semantics.
Convert your sources and string files from your single-byte locale to UTF-8 (iconv).
Turn on Char Length Semantics with FGL_LENGTH_SEMANTICS=CHAR.
Compile and run your programs untouched.