PostgreSQL database migration, the SQL_ASCII to UTF8 problem
July 2014
Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
About
This site is an effort to share some of the base knowledge I have gathered through all this years working with Linux, FreeBSD, OpenBSD, Python or Zope, among others. So, take a look around and I hope you will find the contents useful.
Recent Entries
Recent Comments
Recent Trackbacks
Categories
OpenBSD (8 items)
BSD (0 items)
FreeBSD (19 items)
Linux (3 items)
Security (3 items)
Python (18 items)
Zope (13 items)
Daily (139 items)
e-shell (9 items)
Hacks (13 items)
PostgreSQL (3 items)
OSX (7 items)
Nintendo DS (0 items)
enlightenment (0 items)
Apache (3 items)
Nintendo Wii (1 items)
Django (23 items)
Music (12 items)
Plone (7 items)
Varnish (0 items)
Lugo (2 items)
Sendmail (0 items)
europython (7 items)
Cherokee (1 items)
self (1 items)
Nature (1 items)
Hiking (0 items)
uwsgi (0 items)
nginx (0 items)
bikes (0 items)
Archives

Syndicate this site (XML)

RSS/RDF 0.91

09 febrero
2009

PostgreSQL database migration, the SQL_ASCII to UTF8 problem

or how to get the proper encoding to fit your new database encoding...

Another reminder to myself, as I've faced this situation too many times (not too much lately, as I've switched definetely to UTF-8 in my database servers but I still have some old ones that fit perfectly for this post).

In order to migrate a database from one PostgreSQL server to another one (even if there is a version mismatch, for example, from the 7.x branch to the 8.x branch or between 8.1 and 8.3) all you need is pg_dump. With pg_dump you will be able to create a SQL dump of a whole database, which could be later imported into another database.

For example:

pg_dump openbsd_es > openbsd_es.sql

This way I've created a dump of a database called openbsd_es (used in the past for the defunct effort to keep the work on the .es translation of OpenBSD documentation). The dump was called openbsd_es.sql and I could import it in another PostgreSQL server, in a database with whatever name it has.

pg_dump has a number of options that will help you creating your database dump. For example, you can create a dump that will create full-insert sql statements using pg_dump this way:

pg_dump -D openbsd_es > openbsd_es.sql

(pretty useful if you need to load the sql file into MySQL, SQLite or any other database system)

Then, you only have to use psql to load the sql file back into another database:

psql openbsd_es_clon < openbsd_es.sql

At this point it is posible that you've see an error on screen, something like:

ERROR:  invalid byte sequence for encoding "UTF8": 0xf36e2020
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

AHA!, the fscked UTF-8 error!!!

This is caused by a mismatch in your database encodings, surely because the database from where you got the SQL dump was encoded as SQL_ASCII while the new one is encoded as UTF8.

You can get some information using psql with the -l option. In the old server:

psql -l

[ stripped ]

openbsd_es      | wu     | SQL_ASCII

[ stripped ]

In the new one:

psql -l

[ stripped ]

openbsd_es_clon | wu     | UTF8

[ stripped ]

Of course you can have databases with different encodings in the same PostgreSQL server, but it is not recommended (if your server data directory was created with UTF8 as its default encoding, you should stick to that for all of your databases).

So, get back to our sql file (openbsd_es.sql). We've created it from a SQL_ASCII encoded database, which results in something like:

[Fenris] ~> file openbsd_es.sql
openbsd_es.sql: ISO-8859 English text, with very long lines
[Fenris] ~>

As our new database is UTF8 encoded, it would be nice of we could change quickly the encoding of the sql file, just to import it without all those errors.

Here is where recode fits in this post. Recode is a small tool from the GNU project that let you change on-the-fly the encoding of a given file. For example, to change our sql file into a suitable UTF-8 encoding, we could call recode just like this:

cat openbsd_es.sql | recode iso-8859-1..u8 > openbsd_es_utf8.sql

And the new openbsd_es_utf8.sql file will be UTF-8 encoded:

[Fenris] ~> file openbsd_es_utf8.sql
openbsd_es_utf8.sql: UTF-8 Unicode English text, with very long lines
[Fenris] ~>

Just one more thing to do before importing the sql file into the new database, open the new UTF-8 encoded sql file with your favourite editor and change the line:

SET client_encoding = 'SQL_ASCII';

to:

SET client_encoding = 'UTF8';

as this setting will set the encoding for the psql command before executing the SQL statements within the sql file.

Now, you can import the sql file as usual with (hopefully) no errors this time:

psql openbsd_es_clon < openbsd_es_utf8.sql

(interesting: you can install recode from ports in both FreeBSD and OpenBSD, /usr/ports/converters/recode in both BSD flavours. In linux it seems that it will be installed by default in some distributions, install it from packages if not.)

Posted by wu at 22:45 | Comments (3) | Trackbacks (0)
<< Una vez más, la misma mierda periodística | Main | XKCD: Security >>
Comments
Re: PostgreSQL database migration, the SQL_ASCII to UTF8 problem

Thank you very much for sharing!!! I was migrating an old Postgres db w/ the old encoding SQL_ASCII to the new one, UTF8 .

This solved my problem ;-)

Posted by: charlie at octubre 10,2011 17:23
Re: PostgreSQL database migration, the SQL_ASCII to UTF8 problem

You're great for sharing this!

I was trying to do some scripting to convert my backup file (207 millions of lines) with no luck...and recode solved it!

Thanks a lot, a lot, a lot!
Best regards,

HeCSa.

Posted by: HeCSa at octubre 29,2011 03:43
Re: PostgreSQL database migration, the SQL_ASCII to UTF8 problem

Thanks for this post, really useful. Makes me wonder why UTF8 isn't the default anyway...

Posted by: Arthur at noviembre 21,2011 21:57
Re: PostgreSQL database migration, the SQL_ASCII to UTF8 problem

You are a god amongst PostgreSQL administrators. Thank you for posting such a clear solution, saved me a bunch of time. (Siden note: I'm not really clear on why it says I'm posting this in Febrero 2009 when it's Diciembre 2011).

Posted by: Joseph Morris at diciembre 14,2011 07:35
Re: PostgreSQL database migration, the SQL_ASCII to UTF8 problem

Hi there Joseph. Thnx for the compliments, it is nice to see some of my stuff help other people.

I don't understand what you mean exactly with the date thing, your comment appears as published on december, the 14th, 2011 (it is my blog post with is dated back in 2009).

Cheers!

Posted by: Wu at diciembre 14,2011 08:02
Re: PostgreSQL database migration, the SQL_ASCII to UTF8 problem

So you've got an old SQL-ASCII database, and want to convert from ASCII to UTF-8 ?!

Well ...

Dump your existing database as plain text:
# pg_dump -U user_name --format=p --file=sql_ascii_dump.sql db_name

Drop and re-create your database:
# su - postgres
# dropdb db_name
# createdb -O user_name db_name "Some Comment"

Convert your database
# recode iso-8859-1..u8 sql_ascii_dump.sql

Edit this "sql_ascii_dump.sql" file, and change:
SET client_encoding = 'LATIN1';
To
SET client_encoding = 'UTF8';

Now restore your converted backup:
# psql -d db_name -U user_name -W So you've got an old SQL-ASCII database, and want to convert from ASCII to UTF-8 ?!

Well ...

Dump your existing database as plain text:
# pg_dump -U user_name --format=p --file=sql_ascii_dump.sql db_name

Drop and re-create your database:
# su - postgres
# dropdb db_name
# createdb -O user_name db_name "Some Comment"

Convert your database
# recode iso-8859-1..u8 sql_ascii_dump.sql

Edit this "sql_ascii_dump.sql" file, and change:
SET client_encoding = 'LATIN1';
To
SET client_encoding = 'UTF8';

Now restore your converted backup:
# psql -d db_name -U user_name -W

Posted by: Kingsley at septiembre 19,2012 02:13
Re: PostgreSQL database migration, the SQL_ASCII to UTF8 problem

Uh: that previous message was body-duplicated, maybe [Preview] is borked.

Anyway, if you need to restore an SQL-ASCII database, but have no existing postgresql database, you need to create it based on 'template0' (whatever that is) -

# createdb -T template0 --owner=user_name --encoding=iso-8859-1 --locale=C db_name "Some Comment"

Hope this helps someone out there

Posted by: Kingsley at septiembre 19,2012 02:17
Re: PostgreSQL database migration, the SQL_ASCII to UTF8 problem

Thank you very much for sharing this tip. It was very very useful for me and saved me much hassle and stress.

Thank you once again

Posted by: P.V.Anthony at octubre 23,2012 06:18
Trackbacks
Please send trackback to:http://blog.e-shell.org/134/tbping
There are no trackbacks.
Post a comment