character set handling with MySQL 4.1

LeosB

Joined: 2005-09-17
Posts: 16
Posted: Tue, 2005-09-20 10:01

Hello,

I have installed Gallery 2.0 (final) on Fedora Linux Core 3 with Apache 2.0.53, PHP 4.3.11 and MySQL 4.1.13 (the database default encoding is UTF-8). I use czech language. It seems that the proper database encoding is not set anywhere in Gallery 2.0, which leads to messing up the characters with ascii value above 127. I have solved it with this dirty hack:

--- lib/adodb/drivers/adodb-mysql.inc.php.orig 2005-08-03 19:15:11.000000000 +0200
+++ lib/adodb/drivers/adodb-mysql.inc.php 2005-09-19 14:57:36.000000000 +0200
@@ -353,6 +353,7 @@
$this->_connectionID = mysql_connect($argHostname,$argUsername,$argPassword);

if ($this->_connectionID === false) return false;
+ mysql_query('set character set utf8',$this->_connectionID);
if ($argDatabasename) return $this->SelectDB($argDatabasename);
return true;
}

Of course this is not an acceptable solution for the distribution. Somebody would need to find out something more clever...

Leos

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Tue, 2005-09-20 10:11

sounds reasonable, but the question is why it works for so many other people who use G2 with mysql 4 or 4.1 and their non-ascii characters. probably because they're using utf-8 DBs and it defaults to utf-8 connections.

 
LeosB

Joined: 2005-09-17
Posts: 16
Posted: Tue, 2005-09-20 10:23

It seems that this would be problem only for MySQL 4.1. Older versions (4.0 and 3.x) would be fine. Check this URL for a possible explanation: http://bugs.mysql.com/bug.php?id=9948 - however, please note that the solution which is going to appear in MySQL 4.1.15 (new server option --skip-character-set-client-handshake) is probably not going to be acceptable for most Gallery users. Setting the character set explicitly seems to be the only safe option - of course only if it is done some smart way (setting it to the character set which the user really wants), not using my quick&dirty one line hack.

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Tue, 2005-09-20 10:30

there's no choice for the internal character set G2 uses. G2 handles everything in UTF-8. UTF-8 output (generated HTML), UTF-8 input (data from browser), and a UTF-8 storage (database). Input from the filesystem (add from local server) is converted to UTF-8, if needed.
so specifying UTF-8 as the charset in the client handshake is correct.
the question is whether it is needed. For my mysql 4.1.10 / 4.1.11 installs, this wasn't necessary.

 
LeosB

Joined: 2005-09-17
Posts: 16
Posted: Tue, 2005-09-20 10:32

It seems that we are facing a deficiency in cooperation between PHP 4 and MySQL 4.1. There is no way (or at least no easy way - recompiling libmysql doesn't count) how to implicitly configure the default MySQL 4.1 database character set for PHP scripts (for example in php.ini). Explicit mysql_query('set character set ...') is the only solution which I found. Are you really sure that there are any Gallery users who use MySQL 4.1 and do not have any problems with non-ascii characters?

 
LeosB

Joined: 2005-09-17
Posts: 16
Posted: Tue, 2005-09-20 10:36

All right, here is my test: create a new album and enter the description "Věruš

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Tue, 2005-09-20 10:41

yes, me. i'm Swiss after all and we have French and German umlauts. or am i missing something?
and yep, i've seen character set problems with G2 and mysql, but these were resolved by moving to 4.1+ (from 3.x or 4.0) or a specific upgrade, because some mysql 4 versions have a character set collocation bug or something like that.

i'm not saying that specifying the character set is a bad idea. when i used to code perl / oracle i did it all the time, it makes sense. i'm just questioning if it's necessary in php for mysql. i don't know.

 
LeosB

Joined: 2005-09-17
Posts: 16
Posted: Tue, 2005-09-20 10:43

All right, here is my test: create a new album and enter the description "VXruYZino" there. Unfortunatelly it seems that the Drupal installation has some problems with the MySQL character set, too, so I have to write it this way: please replace X with the character which is available on the key 2 on the czech keyboard, Y for key 3 and Z for key 4. Will it be really shown correctly with MySQL 4.1.10/4.1.11? It was not for me until I added that mysql_query('set character set utf-8'). I bet that Drupal has exactly the same problem with MySQL 4.1 as Gallery - that's not surprising, both are PHP applications.

 
LeosB

Joined: 2005-09-17
Posts: 16
Posted: Tue, 2005-09-20 10:47

The French and German umlauts might be covered in ISO-8859-1, which is the default client charset for MySQL 4.1 with PHP. Please try to reproduce the test which I posted and let me know about the results.

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Tue, 2005-09-20 10:49

just tested with mysql 4.1.10a and it worked. i had to view your previous post ( Posted: Tue, 2005-09-20 12:36) in internet explorer to see the correct characters, in firefox it looks garbled.
i don't have a czech keyboard, but it wasn't needed.

but let's stop this example thing.
the question is: couldn't you just configure your mysql server to handle everything in UTF-8 by default?
or isn't this a property that can be set for each database? is your mysql g2 database using utf-8?
or is it a good idea to use this set-charset command and is it only required for mysql 4.1+ or since what exact version? or can we query the mysql server if this is supported? etc.

 
LeosB

Joined: 2005-09-17
Posts: 16
Posted: Tue, 2005-09-20 11:37

Did you really tested with the string which I suggested? My previous post is not shown correctly in internet explorer - it is just broken differently. It seems that internet explorer understands more UTF-8 characters than firefox, so the broken character has some meaning there (unlike in firefox). Check here how is that czech word supposed to look: http://www.leos.cz/veruscino.gif - if you did not quickly conclude "the czech keyboard wasn't needed" you would be able to check yourself. I chose the example to show you that my problem is really valid. I won't stop the example thing until I persuade you. ;-)

Here are my answers to your questions:

The MySQL server obeys what the client wants. To get everything in UTF-8 I could reconfigure the client not to ask for the default charset (server would fallback to its default charset, which is UTF-8 in my case) or ask for UTF-8. Both options require recompilig libmysql (which is linked to PHP), which sucks and I doubt all Gallery users would like to do it. The default libmysql (as distributed from mysql.com) asks for ISO-8859-1 (to be backwards compatible with older MySQL releases).

The default server charset can indeed be specified differently for different databases. However, I let my G2 database inherit the default server encoding, which is UTF-8. I just put "default-character-set=utf8" to my.cnf and I do not care about the charset anymore (until I find a broken client, of course).

According to the report at http://bugs.mysql.com/bug.php?id=9948 the charset handshake between client and server was changed in MySQL 4.1. Earlier versions did not support UTF-8, anyway. So it seems safe to query the server version and issue 'set character set utf8' only if it is 4.1 or newer.

 
LeosB

Joined: 2005-09-17
Posts: 16
Posted: Tue, 2005-09-20 12:24

To be more specific: the library, which would need to be modified (or you could call it "hacked" as well), is libmysqlclient.so.14. In the source file sql-common/client.c as distributed from mysql.com you can see this:

/* Set character set */
if (!mysql->options.charset_name &&
!(mysql->options.charset_name=
my_strdup(MYSQL_DEFAULT_CHARSET_NAME,MYF(MY_WME))))
goto error;

Basically it means that either mysql->options.charset_name is set previously (unfortunatelly PHP does not do this), or you get the value of MYSQL_DEFAULT_CHARSET_NAME, which is "latin1". However, modification to this library could break other applications, so I do not recommend it. Better possibility would be to add call to mysql_options(&mysql->conn, MYSQL_SET_CHARSET_NAME, (const char *)&desired_character_set) to ext/mysql/php_mysql.c in the sources of PHP. The variable desired_character_set should be set based on the administrator's requirements, the best approach seems to be a new configuration directive in php.ini. Again - modification and recompilation is needed, which is not what an average Gallery user would like to do, I think.

I wrote all this detailed information here only to demonstrate that modifying the PHP scripts (Gallery, Drupal, etc.) to call mysql_query('set character set utf8') seems to be the easiest solution, even though it is just a workaround for the IMO incorrect PHP+MYSQL cooperation.

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Tue, 2005-09-20 14:04

@example:
yes, i tested it with the exact string, thanks for supplying the gif. yes, i used this string, created an album, used it as description. aaved, browsed to album, looked correct.
just believe me, it works for other / some / most / whatever mysql 4.1.x installations. it's must be a configuraiton thing.
i even tried it with my other server which has mysql 4.0.24.
and VěrušĿino displayed correctly, as in your gif image. just the 4th last letter, something similar to a c showed in IE as a letter similar to a capital L, thus it's like that in both of the servers i tried. it exactly showed in the album what i entered in description which is identical to your image, with the exception of the one letter. but that's an IE issue.

now something that will suprise you: i use character set latin1 in my mysql 4.0.24.
and i use latin1_german1_ci in mysql 4.1.

how do you explain that? :)

 
LeosB

Joined: 2005-09-17
Posts: 16
Posted: Tue, 2005-09-20 23:54

That one letter (entered as something like c, but shown as something like L) is not an IE issue. It is what I try to explain since the beginning of this thread - it is a result of incorrect usage of MySQL charsets. That one character similar to c is not compatible with the latin1 character set and therefore gets corrupted (translated to another character, similar to L).

Let me explain how charsets work in MySQL:

When the client stores any textual data (for example char or varchar) to the server, it has to specify their charset. If it does not specify the charset explicitly, with MySQL 4.1 it defaults to latin1. With MySQL 4.0 it defaults to the default charset of the server. The server then translates the received data from the charset of the client to its own charset and stores them.

When the client retreives any textual data (for example char or varchar) from the server, it has to specify in which charset it would like to receive the data. If it does not specify the charset explicitly, with MySQL 4.1 it defaults to latin1. With MySQL 4.0 it defaults to the default charset of the database. The server translates the stored data from their saved charset to the charset which the client wants and then sends them to the client.

In my case I have configured the MySQL server to store the data in UTF-8 (with default-character-set=utf8 in my.cnf). Gallery 2.0 has the data in UTF-8, too. So I would like no translation to happen. To acheive this, I need to tell the MySQL server that the client charset is UTF-8, too. That is done by mysql_query('set character set utf8'). If I do not do it, MySQL assigns me the default client charset, which is latin1, and does the inappropriate translation, which corrupts some characters.

For MySQL versions 4.1 and newer I beleive that the best solution is what I have done - ask the server to store the data in UTF-8, either globally in my.cnf or for one specific table: CREATE TABLE xyz (...) ENGINE=InnoDB DEFAULT CHARSET=utf8, and do mysql_query('set character set utf8') to let the server know that the client character set is UTF-8, too. No translation happens and everything works nicely.

For MySQL versions 4.0 and older it is more difficult. Those versions do not support UTF-8 yet, but they support other charsets (8-bit only). That enables them to do the charset translations, which could eventually corrupt the textual data. After reading http://dev.mysql.com/doc/mysql/en/binary-varbinary.html please try to change the used column data types from "char(x) to "char(x) binary" or "binary(x)" and from "varchar(x)" to "varchar(x) binary" or "varbinary(x)". Will the character similar to c show correctly then, or will it again incorrectly show as similar to L?

I hope that this quite long post will explain you how the character sets really work in MySQL. If you understand how the character sets work with Oracle, you should find that MySQL is actually the same. The database stores the data in one character set and enables the clients to access it in different character sets, doing the translation automatically - provided that the clients specify their character set correctly.

 
LeosB

Joined: 2005-09-17
Posts: 16
Posted: Mon, 2005-09-26 12:34

Any news on this issue? Valiant, have you been able to try some of the suggestions from my last post?

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Mon, 2005-09-26 12:40

nope.
but it's on my uber - long to do list and i'm writing exams these days.

plus:
bharat let me know that mysql and postgresl can handle our utf8 characters even when using latin1 db tables. i don't get why, but as you see, it works in my g2 installs (except for this one character).
but in the installer we recommend utf-8 for mysql and unicode for postgresql, so i don't know yet how this stuff works.
i'll read into it, but not right now.

you'd have very strong arguments if you want to convince us to change to binary formats.

 
LeosB

Joined: 2005-09-17
Posts: 16
Posted: Mon, 2005-09-26 12:59

Works except for one characted equals to works incorrectly for me. That one character was something I found immediatelly after I started using Gallery 2.0 (that was the first version which I tried). I will check other czech characters and let you know about the result.

If you recommend using utf-8 for mysql, does it mean that you do not support older versions of mysql than 4.1? It would be the cleanest solution from my perspective (together with mysql_query('set character set utf8'), of course), but I am not sure if it would be acceptable to force the users of mysql 4.0 and 3.x to upgrade to 4.1.

All my arguments are already written in my older posts. I consider them very strong (what can be stronger than database corrupting some of your data?), but it is up to you as the Gallery developer to clasify them accordingly.

Have a good luck with your exams!

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Wed, 2005-12-21 23:34
 
LeosB

Joined: 2005-09-17
Posts: 16
Posted: Tue, 2006-01-31 15:11

FYI: Drupal finally seems to have this issue fixed in their release 4.7.0 beta 4. See http://drupal.org/node/40515 for more information.

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Tue, 2006-01-31 15:29

thanks for the update. good that drupal did the hard work for us concerning mysql and pg. we'll also have to look into settings for oracle and db2.
the linked discussion is very informative and can be used as a resource when patching g2.

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Fri, 2006-02-10 18:10

fixed in G2.1 RC1 and later versions.