[Soc-coordination] Penultimate report: Semantic Package Review Interface for mentors.debian.net

Clément Schreiner clement at mux.me
Mon Jul 30 23:53:47 UTC 2012

this is the fourth bi-weekly report on my Summer of Code project                 
'Semantic Package Review Interface for mentors.debian.net'.                     
My project aims to extract metadata from packages submitted to                  
mentors.d.n[1], and use this data to match a mackage with a potential           
sponsor. Since a lot of packages get stuck in the mentoring process             
because their maintainers have difficulty finding a sponsor, this               
should ease their entering the Debian process.

1 New plugin API for debexpo's importer 

Last report, I mentioned an attempt at refactoring of the importer;
unfortunately I had to give up because my changes were not a clear
improvement and this was taking too much time compared its usefulness.

Thus, I only applied changes as small as possible to the importer, and
managed to integrate the new plugin API. Moreover, I have improved the
model for accessing the stored plugin results. Using sqlalchemy's
"association proxy"[2], I managed to represent the data of a plugin
result as a dictionary-like object. I also switched to the declarative
API. Both of these changes made the plugins look a lot nicer than with
the former API: [3]

I can't resist giving a little example for the access to the
plugin results, since it looks so nice:

  In [4]: q = meta.session.query(PluginResult)
  In [5]: q = q.filter_by(package_version_id = 15)
  In [6.: for result in q.all():
      print result,
      '(Severity: %d)' % result.severity,
      '--- Plugin: %s' % result.plugin,
      '--- Raw data: ', result._data
      Package is not native (Severity: 1) --- Plugin: native --- Raw data:  {u'severity': u'1', u'native': u'false'}
      Buildsystem: Package uses debhelper (Severity: 0) --- Plugin: buildsystem --- Raw data:  {u'buildsystem': u'debhelper'}
  In [8]: q = q.filter_by(entity='native_test')
  In [10]: r = q.one()
  In [12]: r['native']
  Out[12]: u'false'
  In [14]: r.is_native          # property added to the model
  Out[14]: False
  In [15]: r['native'] = 'true' # this will also update the database object
  In [16]: r.is_native
  Out[16]: True

2 Debtags and similar packages 

Using this new API, I wrote a new plugin for applying debtags
heuristics to new packages[4]. To this end, I extracted the relevant
module from debtagsd [5] (the application behind debtags.d.n) into a
temporary github repository[6] (added in debexpo/lib as a
submodule). I'll have talk about these heuristics with Enrico Zini,
since he suggested we release those heuristics separately from
debtagsd. For packages already in debian, the plugin can also retrieve
tags from the debtags database.

Then, I wrote a small wrapper to the python-xapian for querying
apt-xapian-index [7]. It might be useful for other people, and maybe
it could also be extended and released separately . I'll look into
that once the summer of code is over.

Using this wrapper, I wrote a new plugin for finding packages similar
to the one being uploaded to debexpo [8]. With a proper tokenization
(compared to my first attempt in June) of the package's description,
the results were surprisingly good, even before I added the set of
tags to the xapian query. The wrapper in debexpo/lib/axi.py is able to
add tags to a query, but I haven't updated the plugin to use this
yet. Indeed, I need to improve the plugin API first, so that a plugin
can easily access another one's results: the plugin manager will have
to know about plugin inter-dependency.

3 Next 

After I achieved that, I put the semantic metadata stuff aside and got
back to the plugin system. Indeed, even though it works relatively
well for my new plugins, I have not updated the existing plugins to
the new API, and so I have started that, in order to merge the git
branch 'plugin-api' soon. I have stumbled upon some design issues with
the plugins: it is not trivial to migrate the plugins' existing mako
templates; I have started to make some changes to the template
rendering code, and I believe I have a good idea how to do it quickly.

Some of the remaining tasks:


 - automatically suggest tags, using for example xapian's 'relevance
   sets' [9], or the apriori[10] tag recommendation from debtagsd for
   automatically finding more tags

 - write new heuristics to complete the existing ones: for example,
   add other languages than perl for implemented-in tags

 - to get the best results while writing as little heuristics as
   possible, I'll try to use apriori results from debtags; they give
   stats like this: 93% of package with tags (uitoolkit::qt,
   interface::x11) also have tag x11::application

 - new controller and UI allowing the user to check/add/remove the

Plugin system:

 - update the templates to display the plugin results

 - allow plugins to depend on others

 - allow plugins to be run any time, after certain actions from the
   user (example: run the similar packages plugin after the user has
   edited tags)

Similar packages plugin:

 - add results from the debtags plugin to the xapian query

 - nice template for displaying those in the package info page

Sponsor recommendation (the ultimate goal of this GSoC):

 - new plugin, using packages to recommend sponsor to the uploader: their
   maintainers with an account on mentors.d.n are kept as potential
   sponsors for the package

 - new UI: allow sponsors to select some tags as interesting

 - use this data to improve the sponsor recommendation. Example: a
   sponsor designates the debtag 'implemented-in::python' as
   interesting. Then, when a python-based package is uploaded, if the
   sponsor maintains a similar package, the increased.

 - new UI: help the new contributor to contact the appropriate
   sponsor, using the potential sponsors from the plugin

4 Conclusion 

Compared to my initial schedule, I am clearly very late: the last
month was supposed to be about UI development, bug squashing and
documentation writing.

Even though this schedule was not realistic and I had to change my
plan after the initial two weeks, I still think I should have made more
progress by now: I thought I would have implemented the quite simply
strategy 'find similar packages with xapian, keep their maintainers as
possible sponsors' in a few weeks maximum and could then have spent a
lot of time on semantic metadata extraction and improving the sponsor
recommendation strategy.

I a few ideas for why I did not manage to do it as quickly as I
wanted. First, hacking into debexpo's existing codebase was not always
easy. I spent a fair amount of time trying to understand the web
application's code, often having to read pylons and sqlalchemy
documentation extensively.

Adding features without breaking anything was another challenge, and
some bugs took hours or days to figure out, especially when they were
due to existing bugs in debexpo.

I also wasted time because of bad decisions: I spent too much effort
on stuff not very useful, or at least with a low priority, like trying
to refactor the package importer, which I had to give up after several
days of work because it didn't get anywhere and was distracting me
from the actual project. For the plugin API, even though the model I
eventually designed is very nice, I should have kept the first working
version to start the debtags work earlier. Too often have I been
distracted from the eventual goal, because I had ideas for improving
my code, and then read more documentation (on pylons or sqlalchemy)
than needed.

I don't think everything is negative: the plugin system really needed
an overhaul, for this project to be implemented correctly but also
because it will ease further development on debexpo; my changes seem
to be real improvements other the former implementation. Besides, I
feel confident I can complete the tasks described above; indeed, I
should be more productive now that I have finished shaving those yaks
/ laying foundations (mostly consisting of that plugin system). This
will fulfill the 'deliverables' requirements in the initial project,
although their implementation will probably not be as good as I
envisioned in April.

I'm aware all my previous estimations turned out overtly optimistic,
but I hope I can do all that under two weeks and finish before the
soft deadline, and thus will be able to do more than those tasks.

In any case, some of my plans for debexpo and debtags will have to
wait for after the summer of code, so I will stick around to finish
everything properly.


[1] [http://mentors.debian.net/]

[2] [http://docs.sqlalchemy.org/en/rel_0_7/orm/extensions/associationproxy.html]

[3] [http://anonscm.debian.org/gitweb/?p=debexpo/debexpo.git;a=blob;f=debexpo/plugins/native.py;hb=refs/heads/plugin-api]

[4] [http://anonscm.debian.org/gitweb/?p=debexpo/debexpo.git;a=blob;f=debexpo/plugins/debtags.py;hb=semantic-review]

[5] [http://anonscm.debian.org/gitweb/?p=debtags/debtagsd.git;a=summary]

[6] [https://github.com/clemux/debtags-heuristics]

[7] [http://www.enricozini.org/sw/apt-xapian-index/]

[8] [http://anonscm.debian.org/gitweb/?p=debexpo/debexpo.git;a=blob;f=debexpo/plugins/similar.py;hb=semantic-review]

[9] [http://www.enricozini.org/2007/debtags/axi-query-expand/]

[10] [http://en.wikipedia.org/wiki/Apriori_algorithm]

More information about the Soc-coordination mailing list