[Debian-l10n-devel] tdebs infrastructure

Neil Williams codehelp at debian.org
Sun Jun 1 12:11:38 UTC 2008


On Sun, 2008-06-01 at 07:43 +0200, Christian Perrier wrote:
> Would you mind if I forward this to
> "debian-l10n-devel at lists.alioth.debian.org" ?

Done.

> Quoting Steve McIntyre (steve at einval.com):
> > Hi folks,
> > 
> > [ Christian: I randomly bumped into Neil this afternoon in Cambridge
> >   and we had some discussion about tdebs: how they work themselves,
> >   and how infrastructure could/should be developed around them. ]
> 
> Good. Please understand that I've been involved in these discussions
> from quite far. The tdebs idea comes from the 2006 Extremadura
> sessions. It was there developed by Javier Fernandes Sanguino-Pena and
> Eddy Petrisor.
> 
> Eddy then pushed the idea and built the first version of the wiki
> page. Then the project mostly stalled until Neil came on it with the
> Emdebian needs and revived it strongly.

For the benefit of the mailing list, I revived the idea because Emdebian
needs to reduce the typical Debian installation size by over 80% and one
of the key ways to do this is to not download or install translations
for languages that are not supported or used on that particular device.
e.g. a device that is not configured to support fr, should not have fr
translation files installed and should not be expected to download them
just to delete them later. A device that is only configured to support
es should not have any other translation files installed: de, fr, pt -
all removed. Equally, if the device is later configured to support fr
but not de, then only the fr translation files should be added to the
installation. This can save 250Mb on a typical Debian installation.
Emdebian devices also need to reduce download sizes so the files need to
be removed *in the archive*, not pruned after a multi-megabyte .deb has
been downloaded, unpacked and configured on a device that may only have
5Mb total available space.

Emdebian now has a working implementation of TDebs, written with the
Debian TDeb proposals in mind.

> I followed all this from quite far mostly by roughly thinking that the
> guys were going the right way.... without myself really going into the
> deep of the proposal. My general feeling was that, at some moment, it
> might need to be discussed with folks who have the knowledge of the
> internals of our package and archive management system..:-)
> 
> Which is more or less the discussion you had with Neil, as far as I understand...:-)
> 
> 
> > 
> > Neil told me that:
> > 
> >  * Tdebs are created at the moment by changing how a normal build
> >    works. The strings are extracted and used to generate a new
> >    pseudo-source package.

The pseudo-source package contains the POT file, the existing PO files
and sufficient debian/ content to recreate an updated TDeb or a new TDeb
for a specific language.

> > 
> >  * That new pseudo-source package is uploaded (somewhere!) so that
> >    translators can create translated "binary" packages to match
> >    them. These new packages will be uploaded as normal, using dput or
> >    similar, but there are some changes needed in dak etc. to cope with
> >    them yet.

Emdebian currently uploads the pseudo-packages to the Emdebian locale
repository:
http://www.emdebian.org/emdebian/langupdate.html
http://www.emdebian.org/locale/

The current pseudo-source package for apt is available at :

http://www.emdebian.org/locale/pool/main/a/apt/ 

The rest of the repository is organised by "language root" - e.g. the
TDeb for pt_BR is under pt, alongside the pt TDeb.

http://www.emdebian.org/locale/pool/pt/a/apt/

Where a pt_BR TDeb does not exist, the pt TDeb can be installed instead,
etc.

I've written a C/C++ application (langupdate) that can do this work and
it will be tested and improved during the ongoing development of
Emdebian TDebs.

These pseudo-source TDeb packages are experimental - there may well need
to be bug fixes and modifications but the expectation is that when
Debian adds translated manpages and other content to TDebs, Emdebian
will be able to retain TDebs that *only* have the .mo file using the
DEB_BUILD_OPTIONS="nodocs" support when the packages themselves are
cross-built.

(Also note that after discussions with Frans Pop at FOSDEM, Emdebian
TDebs are *architecture-dependent*. It is undecided whether Debian TDebs
should be also, but the consensus so far appears to be that the overhead
of converting the binary .mo file at runtime is less of a problem than
instantly multiplying the number of TDebs by 12. Emdebian uses
architecture-dependent TDebs because we have only 250 source packages,
not 20,000. Even so, Emdebian currently has 1,625 TDebs from just 250
source packages and that is only for one architecture, ARM. That could
easily translate to over 100,000 TDebs for Debian or 1.2 million if
Debian TDebs were architecture-dependent.)

> > 
> >  * The plan is to set up a new archive section for the translations,
> >    and add support in software to use that section for tdebs.
> 
> Yep, that's the general idea and how I understood it myself..:-)

Such an archive requires a non-trivial amount of automation - as above,
we can expect the number of TDebs to be roughly 5 times the number of
Debian source packages and that ratio will only increase as more
applications receive new translations.

In Emdebian, an embedded device would download the Packages.gz file for
the TDeb repository according to the supported language root(s) (cutting
the number of packages by a factor of 40), calculate the upgrade,
install the TDebs and then ditch the Packages.gz lists and the
downloaded Tdebs - indeed some devices will expect to do all of that in
RAM.

> > 
> > I hope that I've understood that correctly?
> 
> The very same way I understood it. You know, Eddy's initial proposal
> and Neil ones are very detailed, long and I have to admit that I
> overread them. I was mostly "feeling" they were goigng the right way
> and my confidence in the people promoting the idea and its internals
> was enough togive them the credit to push things the right way..:-)

:-)

If anyone has questions about the details, please contact me - but do
please read the existing documentation available from the Emdebian
website.

> > 
> > A few ideas (please comment, point out if you think they're
> > obvious/dumb/won't work!):
> > 
> >  * I'm thinking it would be really useful to pick out translations
> >    from uploaded files and put them in a central database/VCS. That
> >    could allow all kinds of useful things...
> 
> Sure, organising some meta-stuff around this would be good. Indeed, I
> don't really see "translators" building the tdebs. We/they would
> interact with some friendly tool offering them strings to translate,
> or PO files, and the tdebs would be built from that material by a
> dedicated infrastructure.

(Anyone willing to write such a tool is welcome to start asap - don't
wait for me to do it. :-) I'm happy to include whatever rules are
necessary for the pseudo-source packages to autobuild.)

I envisaged the debian i18n translator teams appointing a person to
upload TDebs after the usual review process was complete rather than
giving upload privileges to all translators. Submissions via a web
interface may still need review by the existing debian i18n teams.

> > 
> >  * If we can pick up on common strings, can we use data from one
> >    package to suggest (fuzzy?) translations for the same strings in
> >    other packages?
> 
> Yes. That's called "fuzzy matching" from a compendium. Actually Eddy
> Petrisor built some scripts to do this on i18n.debian.net

Yes, that's going to be very useful. Individual packages can use a
"glossary" upstream but a compendium can support translators across the
entire distribution instead of in just one upstream. There will always
be "corner-cases" where _File in one app is not quite the same meaning
as _File in almost every other app but that is why the compendium would
use "fuzzy".

> >  * If we have a database, that will allow us to more easily track
> >    translations that have been made and new translations that are
> >    needed.
> 
> Yep. That's the overall idea of Pootle (i18n.debian.net:8080)
> 
> > 
> >  * Using the database, we could build a web app that will allow people
> >    to do translation for us without needing to get learn the big,
> >    scary technical workflow that can be involved in package
> >    maintenance. We could pull out a few strings at a time and ask the
> >    user to translate. Add those new translations to the list for
> >    people to moderate/review for the next upload. Maybe even automatic
> >    builds/uploads directly from the database?
> 
> Pootle..:-)...we don't need to build anything, it's there. We need to
> feed it, actually (and have it survive the huge amount of data it
> could have to deal with.....that was tried with Packages Descriptions translations).
> 
> 
> > 
> >  * We should add specific support into our archive and package tools
> >    to work out appropriate versions of translations to use. The
> >    versions will have to be slightly more fuzzy than our normal
> >    packaging relationships; this will allow for some disconnection
> >    between the versions of packages and their translated strings, so
> >    that we're not forced to bump versions if they're not needed.
> > 
> > Thoughts? :-)

This reflects the reality that a lot of packages have multiple uploads
into Debian and often a few new upstream releases, without any of the
translated strings being changed and without any new PO files being
added.

Part of implementing TDebs in Debian will be modifying debhelper and/or
debian/rules to not install anything in /usr/share/locale/ but to leave
that job to the TDebs. The Debian maintainer would then only need to
upload any TDebs him/herself *if* the package is NEW or if the
translations are known to have changed. We need to prevent maintainers
uploading old translations that then replace updated TDebs because the
Debian upload has a higher version string.

> 
> You very well understood the problem, imho. My current feeling is that
> we're now ready to go into the deep of these ideas. A first move could
> be done at Debconf, in the i18n "sessions" I proposed. What would be
> good could be to dedicate a few of these to the tdebs idea and bring
> there folks who have the knowledge of the archive management tools
> (FTP team, release team maybe).

I'll be at DebCamp as well as DebConf this year so there will be time to
work on these things.

> An i18n session will also happen in Extremadura this year. I very
> recently realised that it is currently planned for September, in the
> early week-ends, which is by far *too close* from Debconf for /me to
> be safely there without compromising the relation with my
> wife ;..:-)...I'm currently working to try pushing it away to November
> or December.

Emdebian is sticking to September for our Extremadura Work Session and I
would like to attend both the Emdebian and i18n sessions.

> Thanks for pushing this back to /me, Steve....and, thanks a lot to
> Neil for spending a lot of time developin gyour thoughts about the
> problem.

No problem - as explained above, this simply had to be done if Emdebian
was ever to be considered as a viable option for embedded systems and
the idea itself borrows heavily from existing technology used by
OpenEmbedded and various other embedded groups. (i.e. Don't assign me
the credit for most of this, I just had to get it working for Emdebian.)

-- 
Neil Williams <codehelp at debian.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.alioth.debian.org/pipermail/debian-l10n-devel/attachments/20080601/be5c907a/attachment.pgp 


More information about the Debian-l10n-devel mailing list