[Soc-coordination] Second report: Semantic Package Review Interface for mentors.debian.net
Clément Schreiner
clement at mux.me
Mon Jun 18 21:23:46 UTC 2012
Hi,
this is the second bi-weekly report on my Summer of Code project
'Semantic Package Review Interface for mentors.debian.net'.
My project aims to extract metadata from packages submitted to
mentors.d.n[1], and use this data to match a mackage with a potential
sponsor. Since a lot of packages get stuck in the mentoring process
because their maintainers have difficulty finding a sponsor, this
should ease their entering the Debian process.
First of all, because my last report did not explicitely link to my
git branch, my current work can be found in the 'semantic-review'
branch in debexpo's git repository:
git clone git://anonscm.debian.org/debexpo/debexpo.git
git checkout semantic-review
Web view: [http://anonscm.debian.org/gitweb/?p=debexpo/debexpo.git;a=summary].
My plan for the last two weeks was to implement these tasks:
- apply debtags' heuristics to a package
- tokenize a package's description and build a Xapian query with
resulting tokens and above tags
- make the above work with packages uploaded to mentors.d.n
- ask the maintainer to check/complete the tags assigned to the
package
- present the result of the query in debexpo's web UI
Instead of doing that outside debexpo, I directly integrated my
changes into the current codebase. My progress was thus slower as I
expected, because I ran into some difficulties:
1. debexpo doesn't import all the information I needed into the
database (binary packages' descriptions, for example)
2. the packages' sources are not easily accessible
To fix (2), I've had discussions with my mentors Nicolas and Arno and
my fellow gsoc student Baptiste about a new storage backend for source
packages. I will soon help Baptiste store packages into a git
repository with extracted sources.
While working on (1), I noticed the binary packages
were not properly imported; only binary package would be entered
into the database even when a source package built into two or more of
them.
After I fixed, that, I realized that an index of binary packages is
generated when debexpo's packages repository is updated. This contains
more data than the database. Also, this index is in the same RFC 822
format than Debian repositories. Since debtagsd's autotag[1] module
expects this very same format when working with packages, this will
ease my work this week.
However, a weird bug prevents some binary packages
to be indexed. I have been trying - unsuccessfully for now - to solve
that.
Besides those problems, I also have spent a lot of time reading and
understanding debexpo's codebase.
To sum up, I haven't been able to use debtags heuristics yet, but I
have a little similar-package-finder that uses xapian and
descriptions: [http://debexpo.mux.me/package/weboob].
This doesn't work really well, and hopefuly the results will get
better once I add debtags to the query (this week) - otherwise I'll
have to seek a better strategy.
Currently, this search is made on-the-fly when someone visits the
package's file. In the next two weeks I will move the code to
debexpo-importer (which imports a package's data into the database
when it gets uploaded); I might have to improve the importer
and its plugin system first.
[1] [http://anonscm.debian.org/gitweb/?p=debtags/debtagsd.git;a=tree;f=debdata;h=59ee89569d6cc622d1c4a0bd02e84ebb1c66409a;hb=HEAD]
More information about the Soc-coordination
mailing list