[Debian-med-packaging] Bug#833388: ITP: metaphlan2 -- Metagenomic Phylogenetic Analysis
Andreas Tille
tille at debian.org
Wed Aug 3 19:00:05 UTC 2016
Package: wnpp
Severity: wishlist
Owner: Andreas Tille <tille at debian.org>
* Package name : metaphlan2
Version : 2.5.0
Upstream Author : Nicola Segata <nicola.segata at unitn.it>
* URL : https://bitbucket.org/nsegata/metaphlan2/wiki/Home
* License : MIT
Programming Lang: Python
Description : Metagenomic Phylogenetic Analysis
MetaPhlAn is a computational tool for profiling the composition of
microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from
metagenomic shotgun sequencing data with species level resolution. From
version 2.0, MetaPhlAn is also able to identify specific strains (in the
not-so-frequent cases in which the sample contains a previously
sequenced strains) and to track strains across samples for all species.
.
MetaPhlAn 2.0 relies on ~1M unique clade-specific marker genes (the
marker information file can be found at src/utils/markers_info.txt.bz2
or here) identified from ~17,000 reference genomes (~13,500 bacterial
and archaeal, ~3,500 viral, and ~110 eukaryotic), allowing:
.
* unambiguous taxonomic assignments;
* accurate estimation of organismal relative abundance;
* species-level resolution for bacteria, archaea, eukaryotes and
viruses;
* strain identification and tracking
* orders of magnitude speedups compared to existing methods.
* metagenomic strain-level population genomics
Remark: The package is a target for Debian Med in itself and will be
used by metaBIT. It will be maintained by the Debian Med team and the
packaging is currently available at
svn://anonscm.debian.org/debian-med/trunk/packages/metaphlan2/trunk/
******* I'd like to discuss the following issue on debian-devel list *******
While Debian Med is injecting several low popularity contest packages
this one has an extraordinary large set of data and thus I want to
discuss the following options:
1) Original orig.tar.gz has 1GB and contains 1.2GB uncompressed
binary data. License-wise it should not be a problem since
there is a recipe given how to translate these into text form
back and forth[1].
We would have: source package 1GB + binary package 1GB
2) When unpackaging the orig.tar.gz translating binary data to
text format and recompress using xz the tarball is "only" 265MB.
The transformation process takes about 30min on my Laptop - not
longer than any larger project might need to build but the
resulting binary package would have again close to 1GB.
This enables the options:
2a) Source tarball 256MB + binary package 1GB
2b) Do the conversion of the format in postinst at the expense
of users time which is acceptable since the package usually
unpacks on high performance machines and not so many
installations which means bandwidth and disk space on Debian
mirrors should be saved here instead of users machine
Source tarball 256MB + binary package ~250MB (estimated)
3) Strip all data from the source package and download data in
postinst from upstream Git repository. This makes the package
of uncritical size from a Debian point of view but might be
problematic in some user setups which might have problems with
larger data downloads (possibly be upstream can be convinced
to provide a *.bz2 tarball for maximum compression).
3a) Use postinst
3b) Inform user to call a download script manually to do not
block apt for a longer time dealing with potential download
problems.
What do you think what strategy should be choosen to be kind to Debian
(and mirror) resources?
Kind regards
Andreas.
[1] https://bitbucket.org/biobakery/metaphlan2#markdown-header-customizing-the-database
More information about the Debian-med-packaging
mailing list