[Soc-coordination] First report: Debsources as a Platform

Clément Schreiner clement at mux.me
Sun May 31 20:10:25 UTC 2015


Hi,

this is the first weekly report on my Summer of Code project 'Debsources
as a Platform'[1][2].

This project aims to

 1. make debsources more modular
 2. rewrite the backend (debsources-updater) using celery[3]
 3. write a new application using debsources: copyright.d.n
 4. write a new application using debsources: patch-tracker.d.n

I will work with another GSoC student: Orestis Ioannou[4][5], under the
supervision of Stefano and Matthieu.

For the first part of my GSoC, I have been tasked to write the new
backend (debsources-updater), which is currently entirely synchronous,
and needs to be refactored and made asynchronous. The tasks initially
planned for the first week were:

 1. research the queue managers available to celery (rabbitmq, redis,
 sql, …)
 2. make a simple ('hello world') prototype for the updater
 3. research whether celery can be made transactional and guaranteed to
 never lose tasks

During the week I realized those tasks had to be split up this way:

 1. setup celery
 2. make a simple prototype 
 3. research whether and how tasks can be persistent
 4. research whether and how we can make a transactional task queue with
 celery

I quickly realized tasks 3. and 4. would answer the question of which
queue manager we should use.



1 - setting up celery
-----------------

I though this would be done really fast, because I'd used celery before
(in a django app for University classwork), but it took me half a day to
realize why the example from the official documentation didn't work[6].

2 - making a simple prototype
-------------------------

Then I wanted to run simple tasks equivalent to the current
debsources-updater's "stages" (extract new packages, update suites,
etc.). I haven't gone very far, because I haven't been able to pickle
instances of SourcePackages, the class that represents source packages
in debsources' mirror.[7] (pickle is the default serializer for
transmitting data to celery tasks).

I've spent some time trying to figure out was exactly in SourcePackage
caused the problem, using dill[8] Though I strongly suspect the culprit
is in deb822.Sources (from python-debian, inherited by Source Package),
I haven't been able to narrow it down.

The possible solutions to this problem are:

 - only pass package name/versions to celery tasks, and let them
   retrieve the package from the mirror

 - not serialize deb822 data that is not essential for the tasks

 - build a smaller object from SourcePackage that only contains
   essential data

3, 4 - research: is celery guaranteed not to lose tasks
-------------------------------------------------------

Meanwhile, I've read a lot of documentation about celery, and concluded
that we should[10]:

 - use the rabbitmq message broker, which by default save messages on
   disk (so a message broker crash won't make us lose queued tasks)

 - make all of our tasks idempotent (running one tasks twice won't
   corrupt our data), and use celery's option 'late acks'[9]: tasks are
   considered finished only after they have been successfully
   executed. This way, a celery work crashing won't make the task
   disappear either.

I've also looked into how we could implement transactional task queues
with celery, but we probably won't need it for this project.


Next week
---------

Next week, I will work on the following tasks:

 1. figure out of to send SourcePackage objects to celery tasks (or
 decide not to send them)

 2. finish the prototype, with a complete chain of tasks that won't do
 anything but can be expanded lated

 3. design the protocol of the updater, depending on what I learn while
 experimenting with the prototype

 4. if needed, update the prototype according to the new specifications

If I still have time:

 5. implement the 'extract new packages' stage in the prototype


The trello board we use is now public, so Orestis' and my work can be
followed daily here: https://trello.com/b/LG8eUfPS/debsources

Thanks for reading,


Clément



[1] [http://sources.debian.net/]

[2]
[https://wiki.debian.org/SummerOfCode2015/StudentApplications/ClementSchreiner]

[3] [http://celery.readthedocs.org/en/latest/]

[4]
[https://wiki.debian.org/SummerOfCode2015/StudentApplications/OrestisIoannou]

[5]
[http://lists.alioth.debian.org/pipermail/soc-coordination/2015-May/002459.html]

[6]
[http://stackoverflow.com/questions/30484835/absolute-import-not-working-correctly-in-celery-project]

[7] [http://paste.debian.net/186870/]

[8] [https://pypi.python.org/pypi/dill]

[9]
[http://celery.readthedocs.org/en/latest/configuration.html#celery-acks-late]

[10] [https://github.com/clemux/debsources/blob/async/doc/celery.txt]



More information about the Soc-coordination mailing list