Multibyte (MB) support is intended to allow
PostgreSQL to handle
multiple-byte character sets such as EUC (Extended Unix Code), Unicode, and
Mule internal code. With MB enabled you can use multibyte
character sets in regular expressions (regexp), LIKE, and some
other functions. The default
encoding system is selected while initializing your
PostgreSQL installation using
initdb. Note that this can be
overridden when you create a database using
createdb or by using the SQL command
CREATE DATABASE. So you can have multiple databases each with
a different encoding system. Note that MB can
handle single byte characters sets such as ISO-8859-1.
Multibyte support is enabled by default since PostgreSQL version 7.3.
Following encoding can be used as database encoding.
Table 7-1. Character Set Encodings
Encoding
Description
SQL_ASCII
ASCII
EUC_JP
Japanese EUC
EUC_CN
Chinese EUC
EUC_KR
Korean EUC
JOHAB
Korean EUC (Hangle base)
EUC_TW
Taiwan EUC
UNICODE
Unicode (UTF-8)
MULE_INTERNAL
Mule internal code
LATIN1
ISO 8859-1 ECMA-94 Latin Alphabet No.1
LATIN2
ISO 8859-2 ECMA-94 Latin Alphabet No.2
LATIN3
ISO 8859-3 ECMA-94 Latin Alphabet No.3
LATIN4
ISO 8859-4 ECMA-94 Latin Alphabet No.4
LATIN5
ISO 8859-9 ECMA-128 Latin Alphabet No.5
LATIN6
ISO 8859-10 ECMA-144 Latin Alphabet No.6
LATIN7
ISO 8859-13 Latin Alphabet No.7
LATIN8
ISO 8859-14 Latin Alphabet No.8
LATIN9
ISO 8859-15 Latin Alphabet No.9
LATIN10
ISO 8859-16 ASRO SR 14111 Latin Alphabet No.10
ISO-8859-5
ECMA-113 Latin/Cyrillic
ISO-8859-6
ECMA-114 Latin/Arabic
ISO-8859-7
ECMA-118 Latin/Greek
ISO-8859-8
ECMA-121 Latin/Hebrew
KOI8
KOI8-R(U)
WIN
Windows CP1251
ALT
Windows CP866
WIN1256
Arabic Windows CP1256
TCVN
Vietnamese TCVN-5712 (Windows CP1258)
WIN874
Thai Windows CP874
Important: Before PostgreSQL7.2, LATIN5 mistakenly
meant ISO 8859-5. From 7.2 on,
LATIN5 means ISO 8859-9. If you have a LATIN5
database created on 7.1 or earlier and want to migrate to 7.2 (or
later), you should be very careful about this change.
Important: Not all APIs supports all the encodings listed above. For example, the
PostgreSQL
JDBC driver does not support MULE_INTERNAL, LATIN6,
LATIN8, and LATIN10.
initdb defines the default encoding
for a PostgreSQL installation. For example:
$ initdb -E EUC_JP
sets the default encoding to EUC_JP (Extended Unix Code for Japanese).
Note that you can use --encoding instead of -E if you prefer
to type longer option strings.
If no -E or --encoding option is
given, SQL_ASCII is used.
You can create a database with a different encoding:
$ createdb -E EUC_KR korean
will create a database named korean with EUC_KR encoding.
Another way to accomplish this is to use a SQL command:
CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
The encoding for a database is represented as an
encoding column in the
pg_database system catalog.
You can see that by using the -l option or the
\l command of psql.
PostgreSQL supports an automatic
encoding conversion between server and client for some
encodings. The conversion info is stored in pg_conversion system
catalog. You can create a new conversion by using CREATE
CONVERSION. PostgreSQL comes with some predefined
conversions. They are listed in Table 7-2.
To enable the automatic encoding translation, you have to tell
PostgreSQL the encoding you would like
to use in the client. There are
several ways to accomplish this.
Using the \encoding command in
psql.
\encoding allows you to change client
encoding on the fly. For
example, to change the encoding to SJIS, type:
\encoding SJIS
Using libpq functions.
\encoding actually calls
PQsetClientEncoding() for its purpose.
int PQsetClientEncoding(PGconn *conn, const char *encoding)
where conn is a connection to the server,
and encoding is an encoding you
want to use. If it successfully sets the encoding, it returns 0,
otherwise -1. The current encoding for this connection can be shown by
using:
int PQclientEncoding(const PGconn *conn)
Note that it returns the encoding ID, not a symbolic string
such as EUC_JP. To convert an encoding ID to an encoding name, you
can use:
char *pg_encoding_to_char(int encoding_id)
Using SET CLIENT_ENCODING TO.
Setting the client encoding can be done with this SQL command:
SET CLIENT_ENCODING TO 'encoding';
Also you can use the SQL92 syntax SET NAMES for this purpose:
SET NAMES 'encoding';
To query the current client encoding:
SHOW CLIENT_ENCODING;
To return to the default encoding:
RESET CLIENT_ENCODING;
Using PGCLIENTENCODING.
If environment variable PGCLIENTENCODING is defined
in the client's environment, that client encoding is automatically
selected when a connection to the server is made. (This can subsequently
be overridden using any of the other methods mentioned above.)
Using client_encoding variable.
If the client_encoding variable in postgresql.conf is set, that
client encoding is automatically selected when a connection to the
server is made. (This can subsequently be overridden using any of the
other methods mentioned above.)
Suppose you choose EUC_JP for the server
and LATIN1 for the client,
then some Japanese characters cannot be translated into LATIN1. In
this case, a letter that cannot be represented in the LATIN1 character set
would be transformed as:
Dec 7, 2000
* An automatic encoding translation between Unicode and other
encodings are implemented
* Changes above will appear in 7.1
May 20, 2000
* SJIS UDC (NEC selection IBM kanji) support contributed
by Eiji Tokuya
* Changes above will appear in 7.0.1
Mar 22, 2000
* Add new libpq functions PQsetClientEncoding, PQclientEncoding
* ./configure --with-mb=EUC_JP
now deprecated. use
./configure --enable-multibyte=EUC_JP
instead
* Add SQL_ASCII regression test case
* Add SJIS User Defined Character (UDC) support
* All of above will appear in 7.0
July 11, 1999
* Add support for WIN1250 (Windows Czech) as a client encoding
(contributed by Pavel Behal)
* fix some compiler warnings (contributed by Tomoaki Nishiyama)
Mar 23, 1999
* Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866)
(thanks Oleg Broytmann for testing)
* Fix problem with MB and locale
Jan 26, 1999
* Add support for Big5 for frontend encoding
(you need to create a database with EUC_TW to use Big5)
* Add regression test case for EUC_TW
(contributed by Jonah Kuo <jonahkuo@mail.ttn.com.tw>)
Dec 15, 1998
* Bugs related to SQL_ASCII support fixed
Nov 5, 1998
* 6.4 release. In this version, pg_database has "encoding"
column that represents the database encoding
Jul 22, 1998
* determine encoding at initdb/createdb rather than compile time
* support for PGCLIENTENCODING when issuing COPY command
* support for SQL92 syntax "SET NAMES"
* support for LATIN2-5
* add UNICODE regression test case
* new test suite for MB
* clean up source files
Jun 5, 1998
* add support for the encoding translation between the backend
and the frontend
* new command SET CLIENT_ENCODING etc. added
* add support for LATIN1 character set
* enhance 8-bit cleanliness
April 21, 1998 some enhancements/fixes
* character_length(), position(), substring() are now aware of
multi-byte characters
* add octet_length()
* add --with-mb option to configure
* new regression tests for EUC_KR
(contributed by Soonmyung Hong)
* add some test cases to the EUC_JP regression test
* fix problem in regress/regress.sh in case of System V
* fix toupper(), tolower() to handle 8bit chars
Mar 25, 1998 MB PL2 is incorporated into PostgreSQL 6.3.1
Mar 10, 1998 PL2 released
* add regression test for EUC_JP, EUC_CN and MULE_INTERNAL
* add an English document (this file)
* fix problems concerning 8-bit single byte characters
Mar 1, 1998 PL1 released