Unicode Support

BTR2SQL fully supports Unicode data. If your Btrieve/Pervasive.SQL tables use WString (data type 20) or WZString (data type 21) columns, the driver preserves the Unicode content when migrating to and running against a SQL backend.

Note

Most Btrieve applications use ANSI (single-byte) column types such as String and Zstring. If your application does not use WString or WZString columns, you do not need to do anything special — follow the standard migration and deployment steps.

WString and WZString Column Types

Pervasive.SQL defines two Unicode column types:

Btrieve Type

Type ID

Description

WString

20

Fixed-length wide (Unicode) character string. Stored as UTF-16LE in the Btrieve file.

WZString

21

Null-terminated wide (Unicode) string. Stored as UTF-16LE in the Btrieve file.

For reference, the Actian PSQL documentation describes these types in the SQL Data Types section.

SQL Type Mapping by Backend

When BTR2SQL migrates WString or WZString columns, it creates the appropriate Unicode-capable column type on each backend:

Backend

SQL Type

Notes

MS SQL Server

nvarchar / nchar

SQL Server stores Unicode natively in these types using UCS-2/UTF-16. No special database configuration is required.

PostgreSQL

varchar / text

PostgreSQL does not have separate Unicode column types. Instead, the database encoding determines whether text columns can hold Unicode data. The database must be created with UTF-8 encoding. See PostgreSQL Database Setup below.

Oracle

varchar2 / char / clob

Oracle uses the database character set to handle encoding. The database must be created with the AL32UTF8 character set. Oracle does not require the national character set types (nvarchar2 / nchar) — standard types with AL32UTF8 support full Unicode.

PostgreSQL Database Setup

PostgreSQL uses a single encoding for the entire database, which means you must choose between ANSI and Unicode at database creation time:

  • ANSI database — Use a locale-specific encoding (e.g. WIN1252 or LATIN1) with a compatible collation and CTYPE of C, using template0. This is the correct choice for most customers, as the majority of Btrieve applications use ANSI data exclusively.

  • UTF-8 database — Use UTF8 encoding. This is required if your Btrieve tables contain WString or WZString columns.

Important

A single PostgreSQL database cannot mix ANSI and UTF-8 encodings. If your application has both ANSI and Unicode columns, use a UTF-8 database — the driver handles both column types correctly when the database encoding is UTF-8.

Creating a UTF-8 database (for Unicode data):

CREATE DATABASE myapp
    ENCODING 'UTF8'
    LC_COLLATE 'en_US.UTF-8'
    LC_CTYPE 'en_US.UTF-8'
    TEMPLATE template0;

Tip

If you are unsure whether your tables use Unicode columns, check the DDFs or the migration output. The migration trace will show the Btrieve data type for each column. Types 20 (WString) and 21 (WZString) are the Unicode types.

Oracle Database Setup

For Oracle, the database must use the AL32UTF8 character set to store Unicode data. This is Oracle’s recommended character set for new databases and supports the full Unicode range using standard column types (varchar2, char, clob).

If your Oracle database was created with a non-Unicode character set (e.g. WE8MSWIN1252), Unicode columns cannot be stored correctly. Consult your DBA about migrating to AL32UTF8 or creating a new database with the correct character set.

MS SQL Server Setup

MS SQL Server stores Unicode data in nvarchar and nchar columns natively. The migration utility automatically creates the correct column types when it encounters WString or WZString columns in the source data.

Collation Considerations

SQL Server has well over 1000 collations available. Collations are commonly thought of as a way of sorting data, but they actually define much more than sort order. For non-Unicode data, collations determine how binary values are rendered as characters. For Unicode data, they can define what is treated as a character and whether data should be treated as UTF-8 or ANSI. Various collations have different settings related to character weighting based on language, accent sensitivity, supplementary character support, case sensitivity, Kanatype sensitivity (for Japanese), width sensitivity, binary order, and more. Choosing which collation to use for your database is a complex topic — it was complex before migration, and it remains complex after.

In the USA, the collation you will most likely want is SQL_Latin1_General_CP1_CI_AI. There is a good chance you are already using this collation. It uses codepage 1252 for non-Unicode columns and, by being accent-insensitive, makes words like “café” findable using the search term “cafe”. You may notice that this collation does not include _SC (supplementary character support). This does not mean you cannot store supplementary characters — it means that from a sorting perspective they will be treated as two separate byte-pairs and sorted accordingly, which may not produce the order a user expects. If you need supplementary characters to sort in a user-friendly manner (think emoji on an iOS keyboard, for example), consider Latin1_General_100_CI_AI_SC instead. Your scenario could be more complicated than this; customer requirements might mean you need the accent-sensitive version of either collation (replacing _AI with _AS).

The key point is that collations apply across both Unicode-aware and non-Unicode columns. What you probably do not want is a binary sort collation as your database default. Although binary sort often makes sense to developers, it rarely produces the results users expect because it does not account for the reality of sorting in most languages.

Also remember that individual columns can have their own collation that differs from the database collation, and that the way you build your SQL statements can affect collation behavior as well.

Note

We have not advised using a _UTF8 based collation. These collations are only available in SQL Server 2019 and newer.

Mixed Environments

In some applications, only a subset of tables use Unicode columns while the rest use ANSI types. BTR2SQL handles this transparently:

  • MS SQL Server — ANSI columns use varchar / char; Unicode columns use nvarchar / nchar. Both coexist in the same database without any special configuration. Be sure to test your application thoroughly.

  • PostgreSQL — Use a UTF-8 database. Both ANSI and Unicode columns work correctly because UTF-8 is a superset of ASCII. Standard varchar / text types handle both. Be careful about the collation you choose — it should support both ANSI and Unicode characters. Sorting of ANSI characters may behave differently than it did in your Btrieve application. Be sure to test your application thoroughly.

  • Oracle — Use AL32UTF8. Both ANSI and Unicode columns work correctly because AL32UTF8 is a superset of ASCII. Standard varchar2 / char types handle both. Be sure to test your application thoroughly.