PostgreSQL database migration, the SQL_ASCII to UTF8 problem
Another reminder to myself, as I've faced this situation too many times (not too much lately, as I've switched definetely to UTF-8 in my database servers but I still have some old ones that fit perfectly for this post).
In order to migrate a database from one PostgreSQL server to another one (even if there is a version mismatch, for example, from the 7.x branch to the 8.x branch or between 8.1 and 8.3) all you need is pg_dump. With pg_dump you will be able to create a SQL dump of a whole database, which could be later imported into another database.
For example:
pg_dump openbsd_es > openbsd_es.sql
This way I've created a dump of a database called openbsd_es (used in the past for the defunct effort to keep the work on the .es translation of OpenBSD documentation). The dump was called openbsd_es.sql and I could import it in another PostgreSQL server, in a database with whatever name it has.
pg_dump has a number of options that will help you creating your database dump. For example, you can create a dump that will create full-insert sql statements using pg_dump this way:
pg_dump -D openbsd_es > openbsd_es.sql
(pretty useful if you need to load the sql file into MySQL, SQLite or any other database system)
Then, you only have to use psql to load the sql file back into another database:
psql openbsd_es_clon < openbsd_es.sql
At this point it is posible that you've see an error on screen, something like:
ERROR: invalid byte sequence for encoding "UTF8": 0xf36e2020 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
AHA!, the fscked UTF-8 error!!!
This is caused by a mismatch in your database encodings, surely because the database from where you got the SQL dump was encoded as SQL_ASCII while the new one is encoded as UTF8.
You can get some information using psql with the -l option. In the old server:
psql -l [ stripped ] openbsd_es | wu | SQL_ASCII [ stripped ]
In the new one:
psql -l [ stripped ] openbsd_es_clon | wu | UTF8 [ stripped ]
Of course you can have databases with different encodings in the same PostgreSQL server, but it is not recommended (if your server data directory was created with UTF8 as its default encoding, you should stick to that for all of your databases).
So, get back to our sql file (openbsd_es.sql). We've created it from a SQL_ASCII encoded database, which results in something like:
[Fenris] ~> file openbsd_es.sql openbsd_es.sql: ISO-8859 English text, with very long lines [Fenris] ~>
As our new database is UTF8 encoded, it would be nice of we could change quickly the encoding of the sql file, just to import it without all those errors.
Here is where recode fits in this post. Recode is a small tool from the GNU project that let you change on-the-fly the encoding of a given file. For example, to change our sql file into a suitable UTF-8 encoding, we could call recode just like this:
cat openbsd_es.sql | recode iso-8859-1..u8 > openbsd_es_utf8.sql
And the new openbsd_es_utf8.sql file will be UTF-8 encoded:
[Fenris] ~> file openbsd_es_utf8.sql openbsd_es_utf8.sql: UTF-8 Unicode English text, with very long lines [Fenris] ~>
Just one more thing to do before importing the sql file into the new database, open the new UTF-8 encoded sql file with your favourite editor and change the line:
SET client_encoding = 'SQL_ASCII';
to:
SET client_encoding = 'UTF8';
as this setting will set the encoding for the psql command before executing the SQL statements within the sql file.
Now, you can import the sql file as usual with (hopefully) no errors this time:
psql openbsd_es_clon < openbsd_es_utf8.sql
(interesting: you can install recode from ports in both FreeBSD and OpenBSD, /usr/ports/converters/recode in both BSD flavours. In linux it seems that it will be installed by default in some distributions, install it from packages if not.)