Using UTF-8 in IBM Informix databases
UTF-8 in Genero programs and Informix databases
Genero BDL programs can run in a UTF-8 locale, and use an IBM® Informix® database created with a UTF-8 locale.
The locale of an Informix database is defined at creation by the DB_LOCALE environment variable.
At runtime, the Informix client locale is defined by the CLIENT_LOCALE environment variable and must match the Genero application locale (LANG/LC_ALL), while DB_LOCALE must match the database locale. If needed, Informix will make charset conversions between CLIENT_LOCALE and DB_LOCALE. For more details about Informix client setup, see Prepare the runtime environment - connecting to the database.
- Define the DB_LOCALE environment variable to a UTF-8 locale like
en_us.utf8
- Check the
SQL_LOGICAL_CHAR
parameter in the onconfig Informix server file.
Defining the Informix database locale
When you create a database with a CREATE DATABASE
SQL statement, the current
value of the DB_LOCALE environment variable defines the locale for the created database:
$ DB_LOCALE=en_US.utf8
$ export DB_LOCALE
$ dbaccess - -
> CREATE DATABASE mydb WITH BUFFERED LOG;
en_us.8859-1
(i.e. en_us.819
).sysmaster
database, and run the following
query:$ dbaccess sysmaster -
Database selected.
> SELECT * FROM sysdbslocale WHERE dbs_dbsname = 'mydb';
dbs_dbsname mydb
dbs_collate en_US.819
The
charset code for ISO-8859-1 is 819, for UTF8 it is 57372.The SQL_LOGICAL_CHAR
onconfig parameter
Informix (14.10) does not support CHAR length semantics, it uses only BYTE length semantics: By
default, in a CHAR(10)
SQL type, the size represents a number of bytes. Since UTF-8
locale is a multi-byte, variable-length encoding, it is not possible to store 10 of all possible
UTF-8 characters in 10 bytes.
However, the Informix onconfig SQL_LOGICAL_CHAR
parameter
can be defined as a size multiplier for CHAR/VARCHAR
SQL types, when creating
tables. For example, when using a CHAR(10)
SQL type and
SQL_LOGICAL_CHAR=4
, the real SQL type will be CHAR(40)
, where 40
is the size in bytes.
sysmaster
!),
and execute the following
query:$ dbaccess mydb -
Database selected.
> SELECT MOD(flags,4)+1 FROM systables WHERE tabname = ' VERSION';
4
Persistence of database settings
The DB_LOCALE and SQL_LOGICAL_CHAR
values used at database creation are
persistent properties for the lifetime of that database.
If you stop the Informix engine, change the SQL_LOGICIAL_CHAR
, then restart the
engine, or set a different DB_LOCALE, the locale settings of existing databases remain unchanged:
New created tables are based on the DB_LOCALE and SQL_LOGICAL_CHAR
settings when
the database was created.
Database schema and CHAR/VARCHAR sizes
Program variables can be declared with DEFINE
LIKE
, to use SQL types from .sch schema files.
Schema files are produced with the fgldbsch tool, reading Informix
(syscolumns
) system tables. The sizes of CHAR/VARCHAR
columns is
always expressed in bytes in syscolumns.collength
.
For databases created with an SQL_LOGICAL_CHAR
size multiplier, the
fgldbsch tool divides the collength
by the
SQL_LOGICAL_CHAR
value used by the database.
- At database creation, onconfig server file has
SQL_LOGICAL_CHAR=4
CREATE TABLE tab1 ( col1 CHAR(10) )
- In the Informix system table:
syscolumns.collength = 40
(40 bytes) - fgldbsch -db mydb produces the .sch file (for
tab1.col1: 40/4=10
) - In .sch schema file,
tab1.col1
is defined asCHAR(10)
- Source code declares:
DEFINE rec RECOR LIKE tab1.*
- The compiler reads schema file and defines
rec.col1
asCHAR(10)
CHAR(10)
variable is then
defined by the FGL_LENGTH_SEMANTICS environment variable:- When FGL_LENGTH_SEMANTICS=CHAR, a
CHAR(10)
variable can store 10 UTF-8 characters, no matter the number of bytes used by these characters. - When FGL_LENGTH_SEMANTICS=BYTE, a
CHAR(10)
variable can store 10 ASCII-7 characters, 5 (5x2b=10b) UTF-8 accute characters, or 3 (3x3b=9b) UTF-8 Chinese characters.
FGL_LENGTH_SEMANTICS: CHAR or BYTE ?
With UTF-8 in Genero programs, it is recommended to use char length semantics by setting the FGL_LENGTH_SEMANTICS=CHAR environment variable.
However, depending on the level of compatibility with Informix SQL length semantics, and the complexity of character string operations in the program's code, you need to choose CHAR or BYTE length semantics for Genero programs.
If Informix would support CHAR length semantics, there would be no question: We should just use FGL_LENGTH_SEMANTICS=CHAR as well.
SQL_LOGICAL_CHAR
is defined). For example, in SQL statements:LENGTH(expr)
returns a number of bytescustname[2,5]
means the substrings from byte position 2 to 5
If the application code does not have complex string operations, you may consider using FGL_LENGTH_SEMANTICS=BYTE, to get the same length semantics as in SQL statements.
But when the application code uses a lot of substring positions such as
custcode[2,5]
, it is mandatory to use FGL_LENGTH_SEMANTICS=CHAR, to keep the source
code untouched.
CHAR(10)
will become a CHAR(10)
variable from
DEFINE LIKE
(even if syscolumns.collength
is defined as 40 bytes).
Thus, the schema file is prepared for FGL_LENGTH_SEMANTICS=CHAR usage.The future of Informix UTF-8 support
Ideally, Informix should support CHAR length semantics like most other databases, to get the same behavior with SQL and FGL expressions.
This feature request is known to Informix developers and should be available in a future version (> 14.10).