[debian-mysql] Bug#438970: Bug#438970: UTF8 default charcterset for mysql-server package

Cristóbal Palmer cmpalmer at metalab.unc.edu
Wed Dec 5 07:42:02 UTC 2007


On Tue, Dec 04, 2007 at 11:18:38PM +0100, Norbert Tretkowski wrote:
> I've read the whole bugreport a few times now, and I think I have to
> agree with Sean, we should switch the default charset for new databases
> to utf8, but shouldn't touch existing ones.

Agreed. I think the easiest way to accomplish this (from the
sysadmin's perspective) would be to have a separate package, maybe
mysql-server-utf8? But again, I'm not an experienced packager. I'm
sure there are great debianish minds that will choose the best path.

As for why they didn't go with utf8 as the default in the past, I
recommend this article:

http://dev.mysql.com/tech-resources/articles/4.1/unicode.html

Actually, I *really* recommend that article. It made the difference
for me in terms of understanding what otherwise seemed like a
nonsensical charset forest with mysql.

> Having some testers with MySQL and utf8 experience would be great,
> thanks for your offer. Expect a package in experimental if we really
> decide to switch the default charset.

I look forward to it! I'm confident that together we can get something
packaged that will make a lot of people's lives significantly easier
in the long run.

I did want to chime in with a few more things:

(1) A previous comment seemed to indicate that changes to my.cnf would
cover everything. If it's possible to get mysql to do utf8 across the
board by default (server, db, client, conn) by only adjusting the
my.cnf under debian, would someone please attach such a my.cnf? I was
unsuccessful in my attempts to utf8-ify that way.

(2) What are the consequences of changing the default collation? The
default collation for utf8 is utf8_general_ci (try 'show character
set;'), but the default for latin1 is latin1_swedish_ci (‽‽). I'll
tell you now that the consequences could be painful ☮✈⚔⚠☫⚛☠±♫♥ for
users of some webapps, including (in the past) drupal:

http://drupal.org/node/66333

In case that wasn't clear, I mean that it can break things. Note that
the drupal example above was specifically a collation issue
(http://drupal.org/node/66333#comment-412577), and I feel sorry for
the reporter, who got a "won't fix" and "When it does occur, it is
relatively easy to fix by hand," which is--with all due respect to the
fine drupal people--bogus, imho.

The problem is that you can't always anticipate when/where/how charset
conversion or collation problems will be happening. Here's a horrible
example of how NOT to do latin1 -> utf8:

http://lists.wikimedia.org/pipermail/mediawiki-l/2004-November/002245.html

It would be nice if everybody understood encodings thoroughly and
played nice, but doing a little poking turns up tons of examples of
webapps behaving badly, and for a variety of reasons. Or maybe it's a
clash of expectations/preferences? My personal non-database favorite
encoding hobby horse is mailman lists and their archives. Perhaps it's
irrational of me to think that I shouldn't have to change browser
settings to view things correctly. Try visiting:

http://lists.ibiblio.org/pipermail/cc-jp/

for example. I promise it's not broken.* Mostly. :)

The more we consolidate on utf8, the better things get, but along the
way there will be painful moments. That's life.

UTF8 by default is a change that should happen, but carefully, and
there will likely need to be legacy support for the old defaults for
some time.

Cheers,
-- 
Cristóbal Palmer
ibiblio.org systems administrator

* Hint: View -> Character Encodings -> More Encodings -> East Asian ->
EUC-JP

Bonus points if you can tell me why some pages, eg.

http://lists.ibiblio.org/pipermail/cc-jp/2004-March/000128.html

look broken. Mailing list archives are fun, see?





More information about the pkg-mysql-maint mailing list