[Debian-l10n-devel] Translations cleanup

Wed Jun 9 08:02:27 UTC 2010

Hi Christian,

CCʼing Ubuntu Translation Coordinators team, since Ubuntu is downstream 
of Debian and therefore it should be also in Ubuntuʼs interest that this 
stuff gets fixed.

On 06/08/2010 01:56 AM, Christian PERRIER wrote:
> Quoting Arne Goetje (arne.goetje at canonical.com):
>> Hi Christian,
>
> CC'ing debian-l10n-devel mailing list, that gathers all people
> involved in the Debian i18n infrastructure and scripts.
>
>>
>> I just stumbled upon this translation overview page:
>> http://www.debian.org/international/l10n/po/
>>
>> The script which generates the page seems to need some improvement:
>>   * it seems not to use the iso_639_3.xml file from the iso-codes
>> package, since many language codes are marked as "Unknown language".
>
>
> I'm not sure this is easy to achieve. Nicolas, any idea?
>
>>   * language codes with @ modifiers are not parsed correctly. The
>> script should split the string at the @ and display it like this:
>> ca at valencia Catalan (valencia).
>
> That can probably be fixed even though my personal opinion about this
> ca at valencia "joke" is....let's say this politically correct...mitigated..:-)

Well, yes. ca at valencia is for political reasons. But other variations 
(@latin, @devanagari) do make sense.

>>   * some entries look bogus, e.g. vi_AR. There are no translations
>> with that code, so it needs to be investigated where this code comes
>> from.
>
> Certainly from some bogus package providing a vi_AR.po file.

Well, that bogus .po file seems to be completely empty then, since 
following such links leads to 0 translatable strings and no .po files 
listed. Probably we would need to add some debugging code in the script 
to tell us which package carries such crap.

Also, I donʼt see any reason to have CSB, KAB and TLH (upper case), 
which lead to csb, kab and tlh (lower case) respectively.

>> Also, I'd like to ask if there is any coordinated effort planned or
>> underway to fix the .po file names in the packages themselves? Quite
>> a few files need to be renamed in order to be useable.
>
> There have been some initiatives. In a quite distant past, I reported
> a few such errors to the relevant packages.

What would be the best approach to address such bugs? Can we tag bug 
reports, so that we can easily filter them for this task?
Should we prepare debdiffs or patches to fix these issues properly and 
attach them to the bug reports?

In Ubuntu we have established some guidelines [1] for developers to name 
those .po and .pot files properly, so that we can parse them easily when 
importing them into Launchpad. (We use Launchpad [2] to allow our 
translators to translate the packages, as you might know.)

[1] 
https://wiki.ubuntu.com/UbuntuDevelopment/Internationalisation/RecipeVerifyingTranslationUploads
[2] http://translations.launchpad.net/ubuntu/lucid

>>
>> Examples:
>>   * dk ->  should be da, according to the translations inside
>>   * sr_SR ->  the country code for Serbia is RS. It should actually be
>> just 'sr'. Likewise with sr_YU.
>>   * sr at Latn and sr at latin is actually the same and should be merged
>> into sr at latin. sr at Latn doesn't exist as a locale.
>>   * no and no_NO are discouraged. Translations should be either nb or
>> nn. In most cases, these 'no' translations are actually nb.
>>   * zh is also discouraged, they should be either zh_CN or zh_TW.
>>   * codes with country codes, where the language is only mainly
>> spoken in one country should be merged with the country-less
>> language codes to avoid confusion. E.g. ca_ES at valencia should get
>> merged into ca at valencia
>
>
> I even go further: fr_FR.po when there is no fr.po file and no other
> fr_* file is plain stupid. Indeed, my own personal opinion is that
> there is no serious argument for using country modifiers for most of
> the "multiple country" languages.

Thatʼs what I meant, yes. :)

> I had this debate many times in Debian lists...and, of course, there
> always someone popping up in a more or less pedantic way and "kindly"
> explaining me that "French as spoken in Belgium" is different from
> "French as spoken in France", but:
>
> - after over 10 years in l10n, I know about all this and probably all
> specificities of most languages in the world. That's pedantic too but
> I think I deserve the right to be pedantic on that matter
>
> - software l10n is about *written* languages, not spoken ones and
> apart from  very specific very well known cases ("ordenador" vs
> "computador" in es_ES and es_everywhere-else), there is no practical
> differences in most cases
>
> - only having fr_CA (for instance) translation files for French
> deprives users of other French locales from the French translation
> unless this file is copied as fr_FR, fr_CH, fr_BE, fr_LU, etc. Huge
> waste of resources. Of course, French is only an example, here.
>
> - exceptions to this (that is, real good reasons to use xx_YY.po files
> are very limited:
>    - pt vs pt_BR
>    - zh_CN vs zh_TW (all all practical implications for users of zh_HK,
>    zh_SG...)
>    - eventually pa_IN/pa_PK and bn_BD/bn_IN
>
> So, in short, all occurrences of xx_YY.po files (apart from the
> abovementioned exceptions) should be hunted down....and I would
> wholeheartedly welcome an initiative about this. Of course, most of
> these errors belong to upstream software, but we can expect Debian
> developers to relay them upstream (and of course, then, have fun times
> arguing with upstream developers when they tell us we are wrong..:-))

Yes, Iʼm with you on that. And we should mobilize forces together from 
Debian and Ubuntu to achieve this, since itʼs our common interest.
For gettext applications, .po files with country codes should only 
contain diffs to the main country-code-less .po files. I.e. fr.po is the 
main translation file and fr_CA only contains diffs for those strings 
which really need to be written differently, either because of spelling, 
grammar or specific terms.
How we deal with this on a case by case basis would need to be 
discussed. E.g. we could expect the fr.po file to mainly include strings 
from fr_FR, however if a software originates in Canada, translators 
might have used fr_CA terms which might not be appropriate in France. In 
that case it might make sense to put the fr_CA strings into fr.po and 
use a fr_FR diff for those strings which would be used different in 
France. Or, we modify those strings, so that fr.po is always equal to 
fr_FR and put the diffs into fr_CA.
This would need to be discussed and we should establish a common policy 
for this.

One more thing Iʼd like to mention here:
As mentioned above already, Ubuntu translators and translation teams are 
translating and bug fixing translations in Launchpad. Those translations 
then get exported into language-packs for Ubuntu. It is our wish that 
these changes get also contributed back to upstream, but we see that 
this does not work very well, except for a few cases.
For the packages which originate in Debian (at least), Iʼd like to kick 
off a discussion about how we could cooperate better to get translation 
improvements back into Debian. This includes for example 
debian-installer and iso-codes, for which the translations made in 
Launchpad are not used currently, since our policy is that we donʼt want 
to diverge further form upstream for these packages, just because of 
translations.
So ideally, we would establish a channel to get those translations back 
into Debian. Either Debian translation teams could "harvest" the 
translations from Launchpad (they can be downloaded individually or in 
batches), or the Ubuntu translation teams somehow push them back to 
Debian one way or another.

What do you think?

Cheers
Arne