Upgrading from 2.1 to 2.2.RC1 characters encoding problem [SOLVED with Solution]

alexandreracine

Joined: 2005-06-16
Posts: 39
Posted: Sat, 2007-02-24 18:51

UPDATE : You can see the solution a couple of post below.

Ok, I have a difficult (for me) problem.

Long story short : I am upgrading from 2.1 to 2.2.RC1. That works (7700 pictures and growing!). The part that does not work is the encoding characters. I was on another hosting, he updated mysql from 4.0 to 4.1, I changed to another hosting, all this combine results to mixed encoding.

Long :

If you go to the url http://www.salsamontreal.com/montrealsalsa/gallery , that's my gallery embedded in drupal 4.7.x here. You'll notice that some characters of titles are ok like "2006-09 au 2006-12-Des mardis au Délima". That's in French btw. Some others are not like "2007-02-16-Soirée de St-Valentin MAM". "Soirée" should be "Soirée" here. When upgrading from 2.1 to 2.2.RC1, gallery convert tables from latin1_swedish_ci to utf8_general_ci, witch is a good thing. The problem is that the valid accents "é", "à", etc goes away.

So after the upgrade the "2006-09 au 2006-12-Des mardis au Délima" becomes "2006-09 au 2006-12-Des mardis au D".

Does someone have any ideas how could I sort this out?

I did a conversion with drupal too, inspire by this post http://drupal.org/node/84214 , so I try the same thing, but since these are mix data, that does not work.

I have all backups, and made some tests on my local installation and could test some more without any problem.


(These are values of my local installation and not from the online version. Both runs the same software, but OS, mysql, php and apache version are not. For the problem here, those program should not interfere even if not the same version. For example, mysql online is 4.1.xxxx so all encoding works.)
Gallery version : 2.1 to 2.2.RC1
PHP version : 5.1.6
Webserver : 2.0.55
Database : mysql 5.0.24a-Debian_9-log
Operating system : GNU /Linux Ubuntu 6.10
Url of current gallery online : http://www.salsamontreal.com/montrealsalsa/gallery

Alexandre Racine
<a href="http://www.salsamontreal.com">Salsa Montréal - évènement, photos et beaucoup plus!</a>

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Sat, 2007-02-24 20:18

mysql 4.0 doesn't support UTF-8.
mysql 4.1. and later support UTF-8.
g2.1/2.2 store its data as utf-8, whether the mysql supports it or not.

g2 detects on each upgrade whether the mysql db supports utf-8 and if so, converts all columns and tables to utf-8.

the problem is that there are 2 different cases. for one of them (your case), the conversion isn't doing the 100% right thing and your text gets truncated.
see:
http://sourceforge.net/tracker/index.php?func=detail&aid=1593272&group_id=7130&atid=107130

@drupal:
drupal and g2 are using the same utf-8 conversion code (we looked at theirs when coding ours) since it's the recommended approach (recommended by the mysql team).

if you have a backup, there's probably a way to upgrade your g2 without losing any data.
let me know and i can provide instructions.

--------------
Doumentation: Support / Troubleshooting | Installation, Upgrade, Configuration and Usage

 
alexandreracine

Joined: 2005-06-16
Posts: 39
Posted: Sun, 2007-02-25 07:00

Yeah, I think you are right. I red the gallery and the mysql bug report.

When I upgraded from drupal 4.6 to 4.7, I had to do this manually :

mysqldump -u aracine_go -pgo --default-character-set=latin1 --skip-set-charset aracine_smdrupal47 > smdrupalbd-latin1.sql
replace latin1 utf8 -- smdrupalbd-latin1.sql
mysql -u aracine_go -pgo
DROP DATABASE aracine_smdrupal;
CREATE DATABASE aracine_smdrupal CHARACTER SET utf8 COLLATE utf8_general_ci; quit;
mysql -u aracine_go -pgo --default-character-set=utf8 aracine_smdrupal < smdrupalbd-latin1.sql

And then upgrade from 4.6 to 4.7.

Did you have something similar in mind?

Thanks.

Alexandre Racine
<a href="http://www.salsamontreal.com">Salsa Montréal - évènement, photos et beaucoup plus!</a>

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Sun, 2007-02-25 07:19

that sounds like an alternative.
i'd try that. if it fails , i have another approach ready.

--------------
Doumentation: Support / Troubleshooting | Installation, Upgrade, Configuration and Usage

 
alexandreracine

Joined: 2005-06-16
Posts: 39
Posted: Mon, 2007-02-26 05:45

Ok, my method fails. The result is the same, just like when gallery does the convertion : This "Des mardis au Délima" convert to "Des mardis au D" (bad) and this "Soirée de St-Valentin MAM" convert to "Soirée de St-Valentin MAM" (good).

So with "good" and "bad" data in the same table, I don't really know what to do.

Tell me, what is that other approach? :)

Thanks.

Alexandre Racine
http://www.salsamontreal.com - Évènement, photos et beaucoup plus!

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Mon, 2007-02-26 13:40

please clear the db cache (FAQ: How can I clear cached data?) since it caches some database results. so the text could be fine in the database, when you see the truncated text. or the other way.

ok, restore the database again and then follow these instructions:
http://gallery.menalto.com/forum/gallery_2.2_RC1_feedback?page=2#comment-222189

as far as we understand the problem, this should fix it for you.

--------------
Doumentation: Support / Troubleshooting | Installation, Upgrade, Configuration and Usage

 
alexandreracine

Joined: 2005-06-16
Posts: 39
Posted: Mon, 2007-02-26 17:21

Well, after trying this new approach, the correct text like "Des mardis au Délima" stays ok, and the not correct text like "Soirée de St-Valentin MAM" stays that way.

I can see that this is the convertion functions. I was a programmer once... err... 10 years ago.

Mabe another step could be introduce?

* 1. Detect current column attributes
* 2. Convert text column to binary column
* 2.1 Validate that this is good (with length comparison or other means)
* 3. Convert them to character/text columns with UTF8 charset

Alexandre Racine
http://www.salsamontreal.com - Évènement, photos et beaucoup plus!

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Mon, 2007-02-26 18:30

some problems with that suggested approach:
i'm pretty sure that the conversion to binary columns doesn't lose any data.
the truncation happens when converting from varbinary to varchar UTF-8 starting at the first byte-sequence that cannot be mapped to a unicode character.

but yes, we probably could validate the results to warn the user about the corruption (but it happened already).

and the validation (string length) could be very expensive. we'd have to record the strings in memory or in a copy of the table and compare the results to that copy.

--------------
Doumentation: Support / Troubleshooting | Installation, Upgrade, Configuration and Usage

 
alexandreracine

Joined: 2005-06-16
Posts: 39
Posted: Mon, 2007-02-26 21:10

Right, the code convert the entire table in one command. Speed vs integrity.

Looking in the tables, I can see that in my case all those "problems" are in the "g2_Item" table witch contain 7751 records. Either in the g_description, g_keywords, g_ownerId, g_summary or g_title field. This is of course, all pictures + all albums.

This could be an option. To validate everything, but that would somehow defeat the purpose of the command in the first place. From here http://bugs.mysql.com/bug.php?id=22719 mysql does not seems to want to fix that problem.

How about a diff? I could extract the good content, upgrade, and then put back the good content.

Are there any fields change in the "g2_Item" table from 2.1 to 2.2.RC1??

Thanks.

Alexandre Racine
http://www.salsamontreal.com - Évènement, photos et beaucoup plus!

 
alexandreracine

Joined: 2005-06-16
Posts: 39
Posted: Wed, 2007-02-28 20:41

I'll answer my own question.

From G2.1 to G2.2.RC1 in the g2_Item table, there is one more row named "g_renderer" and contain the "NULL" value.

Also, do these table are usefull? g2_SequenceId and g2_SequenceLock are still in latin1 and contain only two row.

So I'll try some kind of extraction of correct accents, upgrade (bad data will become good, and good data will corrupt), and dump back the correct data.

Alexandre Racine
http://www.salsamontreal.com - Évènement, photos et beaucoup plus!

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Wed, 2007-02-28 23:51

> Also, do these table are usefull? g2_SequenceId and g2_SequenceLock are still in latin1 and contain only two row.

that's fine.
see:
http://sourceforge.net/tracker/index.php?func=detail&aid=1590404&group_id=7130&atid=107130

don't worry about those 2 tables.

--------------
Doumentation: Support / Troubleshooting | Installation, Upgrade, Configuration and Usage

 
alexandreracine

Joined: 2005-06-16
Posts: 39
Posted: Fri, 2007-04-20 18:51

ok, so here it goes.

Whoever wants to do this, needs to know a little bit about mysql, linux command line, to adapt all this to his situation. But it works :) You also need phpmyadmin and this is a G2.1 to G2.2.RC2 procedure.

This is easier with mysql 5, but can work with mysql 4 too. (More details about mysql 4 and 5 in the post.)

In this case, my case, I got some mixt utf8 and latin1 data in latin1 encoding. This is particuly nothing to worry about if you are only using English as the language. But mixt that with some accents é, à, ù, from French, Spanish or Deutsch and you are so gone. So one picture might be "Année 2006", good accent, and another one might be "Soirée de St-Valentin", bad accent. This is all the result of not understanding what is happening and mighty administrator updates of mysql.

How is this gonna work? We'll keep the good characters, update gallery, put back the good characters. Why? Because when gallery is converting the database in the utf8 format, actually this is mysql fault, some utf8 strings get corrupted. See up this post for a link to the mysql bug tracker witch is close anyway.

Note : If you are unfamiliar with updading and mysql, I strongly suggest that you try this at home before doing that on your production server. By "at home", I mean "get yourself an old machine, install Ubuntu and recreate your gallery exactly like your online version". Now try this two times and if it work you'll be closer to more knowledge ;) The alternative? You could destroy all your data by accident. You have been warn.

First thing first : Empty your gallery cache : http://codex.gallery2.org/Gallery2:FAQ#How_can_I_clear_cached_data.3F
After that : Backup your G2 installation! - http://codex.gallery2.org/Gallery2:How_do_I_Make_Backups_of_My_Database

Now, go in phpmyadmin in the g2_Item table and clic Export! Export this table with the REPLACE mode and transfer the file so you can save it on your machine instead of having the result on your screen. Save the file as g2_Item.sql.latin1

Let's transform this file in a utf8 file :)

From the linux command line (I use Ubuntu www.ubuntu.com)

grep -E --regexp=[éêèëàùîçÉ] g2_Item.sql.latin1 > g2_Item.sql.utf8

Witch translate to "grep -E --regexp=[CHARACTERS YOU WANT TO KEEP] FROM.THIS.FILE.sql.latin1 > TO.THIS.FILE.sql.utf8"

Simply put, this will extract all good lines (row in mysql) that contain the good characters. Look that your new file is correct. Here, depending if you are using mysql 4 or 5, you'll have to ajust. For example, the export on mysql 5 with phpmyadmin, only put one "REPLACE" statement on the first lines of the file. If that first statement does not have any characters you want to keep, it wont be in your new file, and that statement is a must! So put it back. Also, you do want to remove any "CREATE DATABASE" or "DROP DATABASE" statements in your new file. There are two reasons for that. 1- You don't want to create or drop. 2- It has the "latin1" statement in it.

Now, update your gallery from G2.1 to G2.2.RC2 just like the book. (If G2.2 is there, consider it :)

While updating, your hole database will convert to utf8. Good accents will become corrupt and bad accents will be ok again!

Now, test your gallery, make sure that everything works like it should do, except for the accents.

Let's put back our good accents with the good database shall we?

Go back in phpmyadmin, in the g2_Item table and Import your new utf8 file. Once it is done and everything is ok, you'll have the "[Number here] (ex 333) or field updated." (or rows), with a success message.

To test if you have done right, clear the gallery cache completely : http://codex.gallery2.org/Gallery2:FAQ#How_can_I_clear_cached_data.3F

...and then refresh the gallery page. Voilà! :)

Alexandre Racine
http://www.salsamontreal.com - Évènement, photos et beaucoup plus!

 
sjsipila

Joined: 2007-04-03
Posts: 11
Posted: Tue, 2007-04-03 16:26

Ok,
I had this very same problem with some Scandinavian characters [åäö] when upgrading from G2.1 to G2.2. I followed the procedure Alexandre presented here and everything seems to be working ok, no more truncated descriptions etc.

I have these characters also in some other tables besides g2_Item (names of users etc.) but these were easily fixed manually.
Alexandre, thanks for the very nice instructions!

Julius Sipilä