[Debian-med-packaging] Bug#833388: ITP: metaphlan2 -- Metagenomic Phylogenetic Analysis

Wed Aug 3 19:00:05 UTC 2016

Package: wnpp
Severity: wishlist
Owner: Andreas Tille <tille at debian.org>

* Package name    : metaphlan2
  Version         : 2.5.0
  Upstream Author : Nicola Segata <nicola.segata at unitn.it>
* URL             : https://bitbucket.org/nsegata/metaphlan2/wiki/Home
* License         : MIT
  Programming Lang: Python
  Description     : Metagenomic Phylogenetic Analysis
 MetaPhlAn is a computational tool for profiling the composition of
 microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from
 metagenomic shotgun sequencing data with species level resolution. From
 version 2.0, MetaPhlAn is also able to identify specific strains (in the
 not-so-frequent cases in which the sample contains a previously
 sequenced strains) and to track strains across samples for all species.
 .
 MetaPhlAn 2.0 relies on ~1M unique clade-specific marker genes (the
 marker information file can be found at src/utils/markers_info.txt.bz2
 or here) identified from ~17,000 reference genomes (~13,500 bacterial
 and archaeal, ~3,500 viral, and ~110 eukaryotic), allowing:
 .
  * unambiguous taxonomic assignments;
  * accurate estimation of organismal relative abundance;
  * species-level resolution for bacteria, archaea, eukaryotes and
    viruses;
  * strain identification and tracking
  * orders of magnitude speedups compared to existing methods.
  * metagenomic strain-level population genomics

Remark: The package is a target for Debian Med in itself and will be
used by metaBIT.  It will be maintained by the Debian Med team and the
packaging is currently available at
   svn://anonscm.debian.org/debian-med/trunk/packages/metaphlan2/trunk/

******* I'd like to discuss the following issue on debian-devel list *******

While Debian Med is injecting several low popularity contest packages
this one has an extraordinary large set of data and thus I want to
discuss the following options:

  1) Original orig.tar.gz has 1GB and contains 1.2GB uncompressed
     binary data.  License-wise it should not be a problem since
     there is a recipe given how to translate these into text form
     back and forth[1].

     We would have: source package 1GB + binary package 1GB

  2) When unpackaging the orig.tar.gz translating binary data to
     text format and recompress using xz the tarball is "only" 265MB.
     The transformation process takes about 30min on my Laptop - not
     longer than any larger project might need to build but the
     resulting binary package would have again close to 1GB.

     This enables the options:

     2a) Source tarball 256MB + binary package 1GB

     2b) Do the conversion of the format in postinst at the expense
         of users time which is acceptable since the package usually
         unpacks on high performance machines and not so many
         installations which means bandwidth and disk space on Debian
         mirrors should be saved here instead of users machine

         Source tarball 256MB + binary package ~250MB (estimated)

  3) Strip all data from the source package and download data in
     postinst from upstream Git repository.  This makes the package
     of uncritical size from a Debian point of view but might be
     problematic in some user setups which might have problems with
     larger data downloads (possibly be upstream can be convinced
     to provide a *.bz2 tarball for maximum compression).

     3a) Use postinst

     3b) Inform user to call a download script manually to do not
         block apt for a longer time dealing with potential download
         problems.

What do you think what strategy should be choosen to be kind to Debian
(and mirror) resources?

Kind regards

        Andreas.

[1] https://bitbucket.org/biobakery/metaphlan2#markdown-header-customizing-the-database