[Soc-coordination] Report 1 - PyPI to Debian repository converter

Natalia Frydrych natalia.frydrych at gmail.com
Sun Jun 3 21:37:08 UTC 2012


*

Hello,


This is my first report on the work progress on a project PyPI to Debian
Repository Converter[0] mentored by Piotr Ożarowski.


----------------
Work: I’ve worked mainly over issues related to the PyPI repository[1]
and its XML-RPC interface[2]. My goal was to download sources of
available Python 3 packages.

In the course of work I’ve dealt with  the following tasks:

----------------
1) Selection of packages intended for Python 3 (as agreed with my
mentor, I will work on packages for Python 3 first - once ready, I’ll
try to add support for Python 2 packages as well.)

After reading Python Packaging chapter from “The Architecture of Open
Source Applications” book[3], I’ve used browse method from PyPI's
XML-RPC interface, which makes  it possible to search for packages
matching classifiers[4]. Unfortunately, it is not possible to determine
the minimum and/or maximum required version of Python. You can list
specific versions or use "Programming Language :: Python :: 3"
classifier, unfortunately “Python :: 3.2” does not imply “Python :: 3”.
For this reason I have to call this method for each specific version,
but finally I’m able to get a list of unique packages, with a list of
their releases available for Python 3.

>From my point of view it would be helpful if the browse function has
provided the ability to select packages using wildcard in these criteria
or looking for packages not meeting given conditions. I have added this
to my TODO and if time permits, I will prepare patches for PyPI’s rpc.py.


I’ve decided to reject packages described in their classifiers as
'Development Status :: 1 - Planning', simply because they usually don’t
have source files yet. Debian package for project in the planning phase
is also not the best idea.

I’ve acquainted with the standard pep-0386[5], but while sorting list of
versions (harvested from real releases) using distutils library[6]  (in
order to select the latest available version), I came across a problem
which, by suggestion from my mentor, I reported to Python’s bug
tracker[7]. The first time I’ve reported a bug there and I had enjoyed
an immediate response. Moreover, my mentor suggested me to look in the
library sources and propose appropriate patch, which I did:-)

2) Download the relevant source files. In order to obtain links to
sources I’ve decided to use release_urls method which returns a list of
download urls for the given package release. Unfortunately, this method
doesn't accept a list on the entry, so calling it successively for each
package is relatively slow. While maintaining this shape the further
optimization is difficult, so I consider an attempt to modify this
method and send patches as well.


>From the list of files returned by relase_urls I’ve chosen those which
have python_version set to source. In Python 3 it is possible to put
archives in different formats so I set the download priority to tar.xz,
then tar.bz2, tar.gz and zip. Python programmers have many unusual ideas
to name their files, so it took me some time to make sure I'm
downloading appropriate files. Eventually I've (hopefully) reached the
state where only the right archive is dowloaded. My algorithm doesn’t
 skip other sources (f.e. additional plugins like the ones in
Pythomnic3k[a]) included in releases, but I had to add special cases for
packages such as waferslim[b] or tuxmodule[c] (i.e. check comment_text
field).


Statistics for downloaded packages at this moment are as follows:

packages for Python 3:

~~~~~~~~~~~~~~~~~

unique packages: 1016
packages without source: 138


packages for Python 2:

~~~~~~~~~~~~~~~~~
unique packages: 2930
packages without source: 457

NOTE: I’m aware that there are about 15k packages that match
"Programming Language :: Python", but most of them don’t have any
further version classifiers, so I’ll assume that they support Python 2 only.


NOTE: The packages described as “packages without source” are those for
which the release_urls method doesn’t return links to the source. In the
classifiers dictionary (obtained by the release_data method) there’s a
download_url field available, but this link often redirects to 3rd party
websites like sourceforge.net[8] which do not point to the archive directly.

3) Update to the newest version of packages. To check if  there are new
versions of packages in PyPI or new packages were added, list of unique
packages which meet my criteria is generated again and the list is
checked against already downloaded files. It seemed unnecessary to use
client_urls to check the exact file name again at this point - as I
wrote earlier, calling it takes a lot of time. I realize that this is
not optimal and will try to change it a bit soon. Developers usually
stick to the package_name-version convention, but there are also
situations such as e.g. Python Bytecode Verifier[d] or tmdb[e].


With a help of my mentor, I located PyPI sources[9].  I’ve found over
there updated_releases method which is not mentioned in the
documentation, but seems to be useful - I compare my results with it.


-----------------
Summary: My tool is able to find and download newest versions of Python
3 packages available in the PyPI. It was a fairly tedious part of the
job and I’m glad that I have it behind me. Right now my code works as
expected, I'll check how it behaves after another round of PyPI updates
and make the necessary modifications if needed.


-----------------

Plans: In the next few days the most important task is to design
detailed  API for plugins system, which will convert the packages to the
repository for Debian. I have to think about how to integrate stdeb[10]
and pkgme[11] (first two plugins) and to add Python 3 support to both of
them. One of the biggest challenges will be to determine the build
dependencies.


I think that during last 2 weeks my knowledge about PyPI has increased
dramatically and I can't wait until my knowledge about Debian packages
also become a bit fuller:-)


My repository can be followed at: https://gitorious.org/pypi2deb
*----
Natalia Frydrych
*


----------------
[0]
http://wiki.debian.org/SummerOfCode2012/StudentApplications/NataliaFrydrych
[1] http://pypi.python.org/
[2] http://wiki.python.org/moin/PyPiXmlRpc
[3] http://www.aosabook.org/en/packaging.html
[4] http://pypi.python.org/pypi?%3Aaction=list_classifiers
[5] http://www.python.org/dev/peps/pep-0386/
[6] http://docs.python.org/dev/distutils/introduction
[7] http://bugs.python.org/issue14894
[8] http://sourceforge.net
[9] https://bitbucket.org/loewis/pypi/src/3d39a7bcfc26/rpc.py
[10] https://github.com/astraw/stdeb
[11] https://launchpad.net/pkgme


----------------

[a] http://pypi.python.org/pypi/Pythomnic3k/1.2

[b] http://pypi.python.org/pypi/waferslim/1.0.2

[c] http://pypi.python.org/pypi/tuxmodule/1.0-
http://paste.debian.net/172552/

[d] http://pypi.python.org/pypi/Python%20Bytecode%20Verifier/0.1

[e] http://pypi.python.org/pypi/tmdb/0.9

*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/soc-coordination/attachments/20120603/d6b37b8e/attachment-0001.html>


More information about the Soc-coordination mailing list