[med-svn] [Git][med-team/metaphlan2][upstream] 2 commits: New upstream version 2.7.63

Mon Sep 17 09:38:15 BST 2018

Andreas Tille pushed to branch upstream at Debian Med / metaphlan2


Commits:
4eea2afd by Andreas Tille at 2018-09-17T07:59:53Z
New upstream version 2.7.63
- - - - -
6a0a6214 by Andreas Tille at 2018-09-17T08:09:26Z
New upstream version 2.7.8
- - - - -


18 changed files:

- .hg_archival.txt
- .hgignore
- .hgsubstate
- .hgtags
- README.md
- metaphlan2.py
- strainphlan.py
- strainphlan_src/add_metadata_tree.py
- strainphlan_src/build_tree_single_strain.py
- strainphlan_src/compute_distance.py
- strainphlan_src/compute_distance_all.py
- strainphlan_src/extract_markers.py
- strainphlan_src/ooSubprocess.py
- strainphlan_src/plot_tree_ete2.py
- strainphlan_src/plot_tree_graphlan.py
- strainphlan_src/sam_filter.py
- strainphlan_src/sample2markers.py
- utils/merge_metaphlan_tables.py


Changes:

=====================================
.hg_archival.txt
=====================================
@@ -1,4 +1,4 @@
 repo: 092c2fe2278cb7f0b18d81faeb4aab98b89dc096
-node: cbd7880df400b453b8beb4e62b39e4a23b5523b6
+node: 9760413b180fc2c68b817c23602541d3a97528af
 branch: default
-tag: 2.7.6
+tag: 2.7.8


=====================================
.hgignore
=====================================
@@ -1,6 +1,7 @@
 syntax: glob
 
-databases/
+metaphlan_databases/
+tests/
 *.pyc
 build/
 dist/


=====================================
.hgsubstate
=====================================
@@ -1,2 +1,2 @@
 c168a100f37e23e2c110849a8d91fac8da49f5bd utils/export2graphlan
-35dfd725e7f024fc6d0edef0cc191c7963108787 utils/hclust2
+69efddea43ae5e37d761cd138fcf083090371d1a utils/hclust2


=====================================
.hgtags
=====================================
@@ -14,3 +14,8 @@ a1fe0d15320c04f69d56f1b7dd31cff972a7b8df 2.7.2
 b2f9b3286d4be376805e3b5c26cf141ed375c605 2.7.5
 847b250adbe97b9f4adc7e15f0d4bb5a66e782ec 2.7.4
 178d1aaf4ac76e5d5477833e8e614104dcd32088 2.7.3
+cbd7880df400b453b8beb4e62b39e4a23b5523b6 2.7.6
+1365be3b4e5a68ed872911e949a25d0944f822b2 2.7.61
+4d50b9ccd95234a436541db13bcb10741f99138b 2.7.62
+a71e7d6b9d50b89c4b80714af145e10d050be7cf 2.7.63
+3e9c612e0a922b73214c90db51c5e26be17bf8b6 2.7.7


=====================================
README.md
=====================================
@@ -2,11 +2,26 @@
 
 # MetaPhlAn 2: Metagenomic Phylogenetic Analysis
 
+## Installation
+
+MetaPhlAn 2.0 can be obtained
+
+Thorugh **Bioconda**
+
+``$ conda install metaphlan2``
+
+by **direct download** from [Bitbucket](https://bitbucket.org/biobakery/metaphlan2/get/default.zip)
+
+or **cloning the repository** using the following command
+
+``$ hg clone https://bitbucket.org/biobakery/metaphlan2``
+
+----------------------
 
 ## Description
 MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With the newly added StrainPhlAn module, it is now possible to perform accurate strain-level microbial profiling.
 
-MetaPhlAn 2 relies on ~1M unique clade-specific marker genes ([the marker information file mpa_v20_m200_marker_info.txt.bz2 can be found in the Download page here](https://bitbucket.org/biobakery/metaphlan2/downloads/mpa_v20_m200_marker_info.txt.bz2)) identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic), allowing:
+MetaPhlAn 2 relies on ~1M unique clade-specific marker genes ([the marker information file `mpa_v20_m200_marker_info.txt.bz2` can be found in the Download page here](https://bitbucket.org/biobakery/metaphlan2/downloads/mpa_v20_m200_marker_info.txt.bz2)) identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic), allowing:
 
 * unambiguous taxonomic assignments;
 * accurate estimation of organismal relative abundance;
@@ -29,7 +44,7 @@ If you use StrainPhlAn, please cite the MetaPhlAn2 paper and the following Strai
 
 -------------
 
-## MetaPhlAn and StrainPhlAn resources
+## MetaPhlAn and StrainPhlAn tutorials and resources
 
 In addition to the information on this page, you can refer to the following additional resources.
 
@@ -43,14 +58,17 @@ In addition to the information on this page, you can refer to the following addi
 
 * The related [bioBakery workflows](https://bitbucket.org/biobakery/biobakery/wiki/biobakery_workflows)
 
-
 -------------
 
 ## Pre-requisites
 
-MetaPhlAn requires *python 2.7* or higher with argparse, tempfile and [*numpy*](http://www.numpy.org/) libraries installed 
-  (apart for numpy they are usually installed together with the python distribution). 
-  Python3 is also now supported.
+MetaPhlAn requires *python 2.7* or higher with argparse, tempfile, [numpy](http://www.numpy.org/), and [Biopython](https://biopython.org/) libraries installed 
+(apart for numpy and Biopython, the others are usually installed together with the python distribution). 
+Python3 is also now supported.
+
+MetaPhlAn requires the `read_fastx.py` script to be present in the system path, if not found MetaPhlAn will try to locate it in the folder containing the `metaphlan2.py` 
+script under `utils/read_fastx.py`.
+In case you moved the `metaphlan2.py` script, please export the `read_fastx.py` script in your PATH bash variable.
 
 **If you provide the SAM output of [BowTie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) as input, there are no additional prerequisite.**
 
@@ -62,22 +80,8 @@ MetaPhlAn requires *python 2.7* or higher with argparse, tempfile and [*numpy*](
 
 * MetaPhlAn is not tightly integrated with advanced heatmap plotting with [hclust2](https://bitbucket.org/nsegata/hclust2) and cladogram visualization with [GraPhlAn](https://bitbucket.org/nsegata/graphlan/wiki/Home). If you use such visualization tool please refer to their prerequisites. 
 
-----------------------
-
-## Installation
-
-MetaPhlAn 2.0 can be obtained by either
-
-Dowloading MetaPhlAn v2.0 [here](https://bitbucket.org/biobakery/metaphlan2/get/default.zip) or [here](https://www.dropbox.com/s/ztqr8qgbo727zpn/metaphlan2.zip?dl=0)
-
-**OR**
-
-Cloning the repository via the following commands
-``$ hg clone https://bitbucket.org/biobakery/metaphlan2``
-
 --------------------------
 
-
 ## Basic Usage
 
 This section presents some basic usages of MetaPhlAn2, for more advanced usages, please see at [its wiki](https://bitbucket.org/biobakery/biobakery/wiki/metaphlan2).
@@ -97,55 +101,28 @@ Here is the basic example to profile a metagenome from raw reads (requires BowTi
 $ metaphlan2.py metagenome.fastq --input_type fastq > profiled_metagenome.txt
 ```
 
-It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (--bowtie2out), and use multiple CPUs (--nproc) if available:
+It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (`--bowtie2out`), and use multiple CPUs (`--nproc`) if available:
 
 ```
 #!bash
 $ metaphlan2.py metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq > profiled_metagenome.txt
 ```
 
-If you already mapped your metagenome against the marker DB (using a previous  MetaPhlAn run), you can obtain the results in few seconds by using the previously saved --bowtie2out file and specifying the input (--input_type bowtie2out):
+If you already mapped your metagenome against the marker DB (using a previous  MetaPhlAn run), you can obtain the results in few seconds by using the previously saved `--bowtie2out` file and specifying the input (`--input_type bowtie2out`):
 
 ```
 #!bash
 $ metaphlan2.py metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out > profiled_metagenome.txt
 ```
 
-You can also provide an externally BowTie2-mapped SAM if you specify this format with --input_type. Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn2 with the obtained sam:
+You can also provide an externally BowTie2-mapped SAM if you specify this format with `--input_type`. Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn2 with the obtained sam:
 
 ```
 #!bash
-$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/db_v20/mpa_v20_m200 -U metagenome.fastq
+$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x databases/mpa_v20_m200 -U metagenome.fastq
 $ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt
 ```
 
-In order to make MetaPhlAn 2 easily compatible with complex metagenomic pipeline, there are now multiple alternative ways to pass the input:
-
-```
-#!bash
-$ cat metagenome.fastq | metaphlan2.py --input_type fastq > profiled_metagenome.txt
-```
-
-```
-#!bash
-$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq --bowtie2db ${mpa_dir}/db_v20/mpa_v20_m200 > profiled_metagenome.txt
-```
-
-```
-#!bash
-$ metaphlan2.py --input_type fastq < metagenome.fastq > profiled_metagenome.txt
-```
-
-```
-#!bash
-$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2) > profiled_metagenome.txt
-```
-
-```
-#!bash
-$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz) > profiled_metagenome.txt
-```
-
 MetaPhlAn 2 can also natively **handle paired-end metagenomes** (but does not use the paired-end information), and, more generally, metagenomes stored in multiple files (but you need to specify the --bowtie2out parameter):
 
 ```
@@ -208,16 +185,9 @@ $ metaphlan2.py metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out
 
 *  You can also provide an externally BowTie2-mapped SAM if you specify this format with 
    --input_type. Two steps: first apply BowTie2 and then feed MetaPhlAn2 with the obtained sam:
-$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/db_v20/mpa_v20_m200 -U metagenome.fastq
+$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x databases/mpa_v20_m200 -U metagenome.fastq
 $ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt
 
-*  Multiple alternative ways to pass the input are also available:
-$ cat metagenome.fastq | metaphlan2.py --input_type fastq 
-$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq 
-$ metaphlan2.py --input_type fastq < metagenome.fastq
-$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2)
-$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz)
-
 *  We can also natively handle paired-end metagenomes, and, more generally, metagenomes stored in 
   multiple files (but you need to specify the --bowtie2out parameter):
 $ metaphlan2.py metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq
@@ -254,7 +224,7 @@ $ metaphlan2.py -t clade_profiles metagenome_outfmt.bz2 --input_type bowtie2out
 *  Finally, to obtain all markers present for a specific clade and all its subclades, the 
    '-t clade_specific_strain_tracker' should be used. For example, the following command
    is reporting the presence/absence of the markers for the B. fragulis species and its strains
-$ metaphlan2.py -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 db_v20/mpa_v20_m200.pkl --input_type bowtie2out > marker_abundance_table.txt
+$ metaphlan2.py -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 databases/mpa_v20_m200.pkl --input_type bowtie2out > marker_abundance_table.txt
    the optional argument --min_ab specifies the minimum clade abundance for reporting the markers
 
 ------------------------------------------------------------------- 
@@ -501,7 +471,7 @@ In order to add a marker to the database, the user needs the following steps:
 
 ```
 #!bash
-bowtie2-inspect metaphlan2/db_v20/mpa_v20_m200 > metaphlan2/markers.fasta
+bowtie2-inspect metaphlan2/databases/mpa_v20_m200 > metaphlan2/markers.fasta
 ```
 
 * Add the marker sequence stored in a file new_marker.fasta to the marker set:
@@ -527,7 +497,7 @@ bowtie2-build metaphlan2/markers.fasta metaphlan2/db_v21/mpa_v21_m200
 import cPickle as pickle
 import bz2
 
-db = pickle.load(bz2.BZ2File('db_v20/mpa_v20_m200.pkl', 'r'))
+db = pickle.load(bz2.BZ2File('databases/mpa_v20_m200.pkl', 'r'))
 
 # Add the taxonomy of the new genomes
 db['taxonomy']['taxonomy of genome1'] = length of genome1
@@ -577,12 +547,12 @@ In addition, the table below shows the number of snps between the sample strains
 In the next sections, we will illustrate step by step how to run MetaPhlAn\_Strainer on this toy example to reproduce the above figures.
 
 ### Pre-requisites
-StrainPhlAn requires *python 2.7* and the libraries [pysam](http://pysam.readthedocs.org/en/latest/) (tested on **version 0.8.3**), [biopython](http://biopython.org/wiki/Main_Page), [msgpack](https://pypi.python.org/pypi/msgpack-python) and [numpy](http://www.numpy.org/), [dendropy](https://pythonhosted.org/DendroPy/) (tested on version **3.12.0**). Besides, StrainPhlAn also needs the following programs in the executable path:
+StrainPhlAn requires *python 2.7* and the libraries [pysam](http://pysam.readthedocs.org/en/latest/) (tested on **version 0.8.3**), [biopython](http://biopython.org/wiki/Main_Page), [msgpack](https://pypi.python.org/pypi/msgpack-python), [pandas](https://pandas.pydata.org) (tested on **version 0.22**), [numpy](http://www.numpy.org/) (tested on **version 1.14.2**) and [scipy](https://www.scipy.org) (tested on **version 1.0.0**), [dendropy](https://pythonhosted.org/DendroPy/) (tested on version **3.12.0**). Besides, StrainPhlAn also needs the following programs in the executable path:
 
 * [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) for mapping reads against the marker database.
 * [MUSCLE](http://www.drive5.com/muscle/) for the alignment step.
 * [samtools, bcftools and vcfutils.pl](http://samtools.sourceforge.net/) which can be downloaded from [here](https://github.com/samtools) for building consensus markers. Note that vcfutils.pl is included in bcftools and **StrainPhlAn only works with samtools version 0.1.19** as samtools has changed the output format after this version.
-* [blastn](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) for adding reference genomes to the phylogenetic tree.
+* [blast+](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) for adding reference genomes to the phylogenetic tree (blastn and makeblastdb commands)
 * [raxmlHPC and raxmlHPC-PTHREADS-SSE3](http://sco.h-its.org/exelixis/web/software/raxml/index.html) for building the phylogenetic trees.
 
 All dependence binaries on Linux 64 bit can be downloaded in the folder "bin" from [this link](https://www.dropbox.com/sh/m4na8wefp53j8ej/AABA3yVsG26TbB0t1cnBS9-Ra?dl=0).
@@ -621,7 +591,7 @@ for f in $(ls fastqs/*.bz2)
 do
     echo "Running metaphlan2 on ${f}"
     bn=$(basename ${f} | cut -d . -f 1)
-    tar xjfO ${f} | ../metaphlan2.py --bowtie2db ../db_v20/mpa_v20_m200 --mpa_pkl ../db_v20/mpa_v20_m200.pkl --input_type multifastq --nproc 10 -s sams/${bn}.sam.bz2 --bowtie2out sams/${bn}.bowtie2_out.bz2 -o sams/${bn}.profile
+    tar xjfO ${f} | ../metaphlan2.py --bowtie2db ../databases/mpa_v20_m200 --mpa_pkl ../databases/mpa_v20_m200.pkl --input_type multifastq --nproc 10 -s sams/${bn}.sam.bz2 --bowtie2out sams/${bn}.bowtie2_out.bz2 -o sams/${bn}.profile
 done
 ```
 
@@ -654,8 +624,8 @@ The commands to run are:
 ```
 #!python
 mkdir -p db_markers
-bowtie2-inspect ../db_v20/mpa_v20_m200 > db_markers/all_markers.fasta
-python ../strainphlan_src/extract_markers.py --mpa_pkl ../db_v20/mpa_v20_m200.pkl --ifn_markers db_markers/all_markers.fasta --clade s__Bacteroides_caccae --ofn_markers db_markers/s__Bacteroides_caccae.markers.fasta
+bowtie2-inspect ../databases/mpa_v20_m200 > db_markers/all_markers.fasta
+python ../strainphlan_src/extract_markers.py --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_markers db_markers/all_markers.fasta --clade s__Bacteroides_caccae --ofn_markers db_markers/s__Bacteroides_caccae.markers.fasta
 ```
 
 Note that the "all\_markers.fasta" file consists can be reused for extracting other reference genomes. 
@@ -666,7 +636,7 @@ Before building the trees, we should get the list of all clades detected from th
 
 ```
 #!python
-python ../strainphlan.py --mpa_pkl ../db_v20/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --output_dir output --nprocs_main 10 --print_clades_only > output/clades.txt
+python ../strainphlan.py --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --output_dir output --nprocs_main 10 --print_clades_only > output/clades.txt
 ```
 
 The clade names in the output file "clades.txt" will be used for the next step.
@@ -681,7 +651,7 @@ The commands to run are:
 ```
 #!python
 mkdir -p output
-python ../strainphlan.py --mpa_pkl ../db_v20/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --ifn_markers db_markers/s__Bacteroides_caccae.markers.fasta --ifn_ref_genomes reference_genomes/G000273725.fna.bz2 --output_dir output --nprocs_main 10 --clades s__Bacteroides_caccae &> output/log_full.txt
+python ../strainphlan.py --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --ifn_markers db_markers/s__Bacteroides_caccae.markers.fasta --ifn_ref_genomes reference_genomes/G000273725.fna.bz2 --output_dir output --nprocs_main 10 --clades s__Bacteroides_caccae &> output/log_full.txt
 ```
 
 This step will take around 2 minutes. After this step, you will find the tree "output/RAxML\_bestTree.s\_\_Bacteroides\_caccae.tree". All the output files can be found in the folder "output" in [this link](https://www.dropbox.com/sh/m4na8wefp53j8ej/AABA3yVsG26TbB0t1cnBS9-Ra?dl=0).


=====================================
metaphlan2.py
=====================================
@@ -16,8 +16,8 @@ from __future__ import with_statement
 __author__ = ('Nicola Segata (nicola.segata at unitn.it), '
               'Duy Tin Truong, '
               'Francesco Asnicar (f.asnicar at unitn.it)')
-__version__ = '2.7.6'
-__date__ = '2 March 2018'
+__version__ = '2.7.7'
+__date__ = '31 May 2018'
 
 
 import sys
@@ -62,7 +62,7 @@ DATABASE_DOWNLOAD = "https://bitbucket.org/biobakery/metaphlan2/downloads/"
 # get the directory that contains this script
 metaphlan2_script_install_folder = os.path.dirname(os.path.abspath(__file__))
 # get the default database folder
-DEFAULT_DB_FOLDER = os.path.join(metaphlan2_script_install_folder, "databases")
+DEFAULT_DB_FOLDER = os.path.join(metaphlan2_script_install_folder, "metaphlan_databases")
 
 
 #**********************************************************************************************
@@ -420,12 +420,12 @@ def read_params(args):
             "$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/db_v20/mpa_v20_m200 -U metagenome.fastq\n"
             "$ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt\n\n"
 
-            "*  Multiple alternative ways to pass the input are also available:\n"
-            "$ cat metagenome.fastq | metaphlan2.py --input_type fastq \n"
-            "$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq \n"
-            "$ metaphlan2.py --input_type fastq < metagenome.fastq\n"
-            "$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2)\n"
-            "$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz)\n\n"
+            # "*  Multiple alternative ways to pass the input are also available:\n"
+            # "$ cat metagenome.fastq | metaphlan2.py --input_type fastq \n"
+            # "$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq \n"
+            # "$ metaphlan2.py --input_type fastq < metagenome.fastq\n"
+            # "$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2)\n"
+            # "$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz)\n\n"
 
             "*  We can also natively handle paired-end metagenomes, and, more generally, metagenomes stored in \n"
             "  multiple files (but you need to specify the --bowtie2out parameter):\n"
@@ -503,10 +503,11 @@ def read_params(args):
         help=("The BowTie2 database file of the MetaPhlAn database. Used if "
               "--input_type is fastq, fasta, multifasta, or multifastq [default "+DEFAULT_DB_FOLDER+"]\n"))
 
+    INDEX = 'v20_m200'
     arg('-x', '--index', type=str, default='v20_m200',
         help=("Specify the id of the database version to use. If the database\n"
               "files are not found on the local MetaPhlAn2 installation they\n"
-              "will be automatically downloaded\n"))
+              "will be automatically downloaded [default "+INDEX+"]\n"))
 
     bt2ps = ['sensitive', 'very-sensitive', 'sensitive-local', 'very-sensitive-local']
     arg('--bt2_ps', metavar="BowTie2 presets", default='very-sensitive',
@@ -647,9 +648,9 @@ def read_params(args):
         help="Only checks if the MetaPhlAn2 DB is installed and installs it if not. All other parameters are ignored.")
     arg('--read_min_len', type=int, default=70,
         help="Specify the minimum length of the reads to be considered when parsing the input file with "
-             "'read_fastx.py' script, default value is 60")
+             "'read_fastx.py' script, default value is 70")
     arg('-v', '--version', action='version',
-        version="MetaPhlAn version {}\t({})".format(__version__, __date__),
+        version="MetaPhlAn version {} ({})".format(__version__, __date__),
         help="Prints the current MetaPhlAn version and exit")
     arg("-h", "--help", action="help", help="show this help message and exit")
 
@@ -748,7 +749,7 @@ def download_unpack_tar(url, download_file_name, folder, bowtie2_build, nproc):
             for row in f:
                 md5_md5 = row.strip().split(' ')[0]
     else:
-        sys.stderr.write('File "{}" not found!'.format(md5_file))
+        sys.stderr.write('File "{}" not found!\n'.format(md5_file))
 
     # compute MD5 of .tar.bz2
     if os.path.isfile(tar_file):
@@ -760,7 +761,7 @@ def download_unpack_tar(url, download_file_name, folder, bowtie2_build, nproc):
 
         md5_tar = hash_md5.hexdigest()[:32]
     else:
-        sys.stderr.write('File "{}" not found!'.format(tar_file))
+        sys.stderr.write('File "{}" not found!\n'.format(tar_file))
 
     if (md5_tar is None) or (md5_md5 is None):
         sys.exit("MD5 checksums not found, something went wrong!")
@@ -1264,7 +1265,7 @@ def map2bbh(mapping_f, input_type='bowtie2out', min_alignment_len=None):
     return markers2reads
 
 
-def maybe_generate_biom_file(pars, abundance_predictions):
+def maybe_generate_biom_file(tree, pars, abundance_predictions):
     json_key = "MetaPhlAn2"
 
     if not pars['biom']:
@@ -1291,21 +1292,20 @@ def maybe_generate_biom_file(pars, abundance_predictions):
         return tree.all_clades[name]
 
     def to_biomformat(clade_name):
-        return { 'taxonomy': clade_name.split(delimiter) }
+        return {'taxonomy': clade_name.split(delimiter)}
 
-    clades = iter( (abundance, findclade(name))
-                   for (name, abundance) in abundance_predictions
-                   if istip(name) )
-    packed = iter( ([abundance], clade.get_full_name(), clade.id)
-                   for (abundance, clade) in clades )
+    clades = iter((abundance, findclade(name))
+                  for (name, abundance) in abundance_predictions if istip(name))
+    packed = iter(([abundance], clade.get_full_name(), clade.id)
+                  for (abundance, clade) in clades)
 
-    #unpack that tuple here to stay under 80 chars on a line
+    # unpack that tuple here to stay under 80 chars on a line
     data, clade_names, clade_ids = zip(*packed)
     # biom likes column vectors, so we give it an array like this:
     # np.array([a],[b],[c])
     data = np.array(data)
     sample_ids = [pars['sample_id']]
-    table_id='MetaPhlAn2_Analysis'
+    table_id = 'MetaPhlAn2_Analysis'
 
 
 
@@ -1493,7 +1493,7 @@ def metaphlan2():
                     outf.write( "\t".join( [k,str(v)] ) + "\n" )
             else:
                 outf.write( "unclassified\t100.0\n" )
-            maybe_generate_biom_file(pars, outpred)
+            maybe_generate_biom_file(tree, pars, outpred)
         elif pars['t'] == 'rel_ab_w_read_stats':
             cl2ab, rr = tree.relative_abundances(
                         pars['tax_lev']+"__" if pars['tax_lev'] != 'a' else None )
@@ -1520,7 +1520,7 @@ def metaphlan2():
                 outf.write( "#estimated total number of reads from known clades: " + str(totl)+"\n")
             else:
                 outf.write( "unclassified\t100.0\n" )
-            maybe_generate_biom_file(pars, outpred)
+            maybe_generate_biom_file(tree, pars, outpred)
 
         elif pars['t'] == 'clade_profiles':
             cl2pr = tree.clade_profiles( pars['tax_lev']+"__" if pars['tax_lev'] != 'a' else None  )


=====================================
strainphlan.py
=====================================
@@ -2,9 +2,11 @@
 # Author: Duy Tin Truong (duytin.truong at unitn.it)
 #		at CIBIO, University of Trento, Italy
 
-__author__ = 'Duy Tin Truong (duytin.truong at unitn.it)'
-__version__ = '1.0.0'
-__date__ = '2nd August 2016'
+__author__ = ('Duy Tin Truong (duytin.truong at unitn.it), '
+              'Francesco Asnicar (f.asnicar at unitn.it), '
+              'Moreno Zolfo (moreno.zolfo at unitn.it)')
+__version__ = '1.2.0'
+__date__ = '31 May 2018'
 
 import sys
 import os
@@ -17,7 +19,10 @@ sys.path.append(os.path.join(MAIN_DIR, 'strainphlan_src'))
 
 import which
 import argparse as ap
-import cPickle as pickle
+try:
+    import cPickle as pickle
+except ImportError:
+    import pickle
 import msgpack
 import glob
 from mixed_utils import statistics
@@ -32,9 +37,9 @@ from Bio.Alphabet import IUPAC
 import pandas
 import logging
 import logging.config
-import sample2markers
-import copy
-import threading
+# import sample2markers
+# import copy
+# import threading
 import numpy
 import random
 import gc
@@ -86,12 +91,16 @@ def read_params():
         help='The representative sample. The marker list of each species '\
              'extracted from this sample will be used for all other samples.')
 
+    p.add_argument('-x', '--index', required=False, default="v20_m200", type=str,
+                   help=("Specify the id of the database version to use. If the database files are not found "
+                         "on the local MetaPhlAn2 installation they will be automatically downloaded"))
+
     p.add_argument(
         '--mpa_pkl',
         required=False,
-        default=os.path.join(metaphlan2_script_install_folder,"db_v20","mpa_v20_m200.pkl"),
+        default=os.path.join(metaphlan2_script_install_folder, "metaphlan_databases"),
         type=str,
-        help='The database of metaphlan3.py.')
+        help='The database of metaphlan2.py')
 
     p.add_argument(
         '--output_dir',
@@ -1166,15 +1175,29 @@ def load_sample(args):
             marker2seq = {}
 
         # reformat 'pileup'
+        marker2seq2exclude = []
         for m in marker2seq:
-            freq = marker2seq[m]['freq']
-            marker2seq[m]['freq'] = [(0.0, 0.0, 0.0) for i in \
-                                     range(len(marker2seq[m]['seq']))]
-            for p in freq:
-                marker2seq[m]['freq'][p] = freq[p]
-            marker2seq[m]['seq'] = marker2seq[m]['seq'].replace('-', 'N') # make sure we have no gaps in the sequence
+            
+            # FIX: Check if freq vector is longer than the reconstructed sequence. In this case, the marker is removed
+            # TODO: uniform the two vectors in samples2markers.py
+            if len(marker2seq[m]['freq']) > len(marker2seq[m]['seq']):
+                logger.info('Skipping marker'+m+' as sequence is reconstructed incorrectly')
+                marker2seq2exclude.append(m)
+
+            else:
+                freq = marker2seq[m]['freq']
+                marker2seq[m]['freq'] = [(0.0, 0.0, 0.0) for i in \
+                                         range(len(marker2seq[m]['seq']))]
+                for p in freq:
+                    marker2seq[m]['freq'][p] = freq[p]
+                marker2seq[m]['seq'] = marker2seq[m]['seq'].replace('-', 'N') # make sure we have no gaps in the sequence
+
+        # returning only the markers for which sequence was reconstructed correctly
+        for marker2exclude in marker2seq2exclude:
+            del marker2seq[marker2exclude]
 
         return marker2seq
+        
     else:
         # remove redundant clades and markers
         clade2n_markers = defaultdict(int)
@@ -1533,6 +1556,12 @@ def check_dependencies(args):
 
 def strainphlan():
     args = read_params()
+
+    # fix db .pkl file
+    if '--mpa_pkl' not in sys.argv:
+        if os.path.isfile(os.path.join(args['mpa_pkl'], "mpa_" + args['index'] + ".pkl")):
+            args['mpa_pkl'] = os.path.join(args['mpa_pkl'], "mpa_" + args['index'] + ".pkl")
+
     check_dependencies(args)
     strainer(args)
 


=====================================
strainphlan_src/add_metadata_tree.py
=====================================
@@ -3,14 +3,14 @@
 #        at CIBIO, University of Trento, Italy
 
 
-import sys
-import os
+# import sys
+# import os
 import argparse as ap
 import pandas
-import copy
-import ConfigParser
+# import copy
+# import ConfigParser
 import dendropy
-import numpy
+# import numpy
 # import ipdb
 
 
@@ -18,13 +18,13 @@ def read_params():
     p = ap.ArgumentParser()
     p.add_argument('--ifn_trees', nargs='+', required=True, default=None, type=str)
     p.add_argument('--ifn_metadatas', nargs='+', required=True, default=None, type=str)
-    p.add_argument('--string_to_remove', 
+    p.add_argument('--string_to_remove',
                    required=False, default='', type=str,
                    help='string to be removed in the tree node names')
     p.add_argument(
-                    '--metadatas', 
-                    nargs='+', 
-                    required=False, 
+                    '--metadatas',
+                    nargs='+',
+                    required=False,
                     default=['all'],
                     type=str,
                     help='The metadata fields that you want to add. '\
@@ -52,9 +52,9 @@ def main(args):
             index_col = get_index_col(ifn)
             df = pandas.read_csv(
                 ifn,
-                sep='\t', 
+                sep='\t',
                 dtype=unicode,
-                header=0, 
+                header=0,
                 index_col=index_col)
             df = df.transpose()
             df_list.append(df)
@@ -73,8 +73,8 @@ def main(args):
             sample = node.get_node_str().strip("'")
             sample = sample.replace(' ', '_')
             sample = sample.replace(args['string_to_remove'], '')
-            prefixes = [prefix for prefix in 
-                            ['k__', 'p__', 'c__', 'o__', 
+            prefixes = [prefix for prefix in
+                            ['k__', 'p__', 'c__', 'o__',
                              'f__', 'g__', 's__'] \
                         if prefix in sample]
 


=====================================
strainphlan_src/build_tree_single_strain.py
=====================================
@@ -6,9 +6,9 @@ __author__  = 'Duy Tin Truong (duytin.truong at unitn.it)'
 __version__ = '0.1'
 __date__    = '17 Sep 2015'
 
-import sys
+# import sys
 import os
-import argparse 
+import argparse
 import numpy
 from Bio import SeqIO
 import glob
@@ -17,34 +17,34 @@ import glob
 def read_params():
     p = argparse.ArgumentParser()
     p.add_argument(
-        '--ifn_alignments', 
+        '--ifn_alignments',
         nargs='+',
-        required=True, 
-        default=None, 
+        required=True,
+        default=None,
         type=str,
         help='The alignment file.')
     p.add_argument(
-        '--log_ofn', 
-        required=True, 
-        default=None, 
+        '--log_ofn',
+        required=True,
+        default=None,
         type=str,
         help='The log file.')
     p.add_argument(
-        '--nprocs', 
-        required=True, 
-        default=None, 
+        '--nprocs',
+        required=True,
+        default=None,
         type=int,
         help='Number of processors.')
     p.add_argument(
-        '--bootstrap_raxml', 
-        required=False, 
-        default=0, 
+        '--bootstrap_raxml',
+        required=False,
+        default=0,
         type=int,
         help='The number of runs for bootstraping when building the tree. '\
              'Default 0.')
     p.add_argument(
-        '--verbose', 
-        required=False, 
+        '--verbose',
+        required=False,
         dest='quiet',
         action='store_false',
         help='Show all information. Default "not set".')
@@ -84,15 +84,15 @@ def main(args):
 
         if len(sample2polrate):
             log_line = '%s\t%d\t%d\t%f\n'%\
-                        (os.path.basename(ifn_polymorphic).replace('.polymorphic', ''), 
-                        len(singles), 
-                        len(sample2polrate), 
+                        (os.path.basename(ifn_polymorphic).replace('.polymorphic', ''),
+                        len(singles),
+                        len(sample2polrate),
                         float(len(singles)) / len(sample2polrate))
         else:
             log_line = '%s\t%d\t%d\t%f\n'%\
-                        (os.path.basename(ifn_polymorphic).replace('.polymorphic', ''), 
-                        len(singles), 
-                        len(sample2polrate), 
+                        (os.path.basename(ifn_polymorphic).replace('.polymorphic', ''),
+                        len(singles),
+                        len(sample2polrate),
                         0)
         lfile.write(log_line)
 
@@ -116,7 +116,7 @@ def main(args):
             cmd += '-N %d '%(args.bootstrap_raxml)
             cmd += '-s %s '%os.path.abspath(ifn_alignment2)
             cmd += '-w %s '%os.path.abspath(os.path.dirname(ifn_alignment2))
-            cmd += '-n %s '%output_suffix 
+            cmd += '-n %s '%output_suffix
             cmd += '-p 1234 '
         else:
             cmd = 'raxmlHPC-PTHREADS-SSE3 '
@@ -136,7 +136,7 @@ def main(args):
         '''
         run(cmd)
     lfile.close()
-     
+
 
 
 


=====================================
strainphlan_src/compute_distance.py
=====================================
@@ -15,8 +15,8 @@ sys.path.append(MAIN_DIR)
 
 from mixed_utils import dist2file, statistics
 import argparse as ap
-from Bio import SeqIO, Seq, SeqRecord
-from collections import defaultdict
+from Bio import SeqIO#, Seq, SeqRecord
+# from collections import defaultdict
 import numpy
 from ooSubprocess import ooSubprocess
 
@@ -24,8 +24,8 @@ from ooSubprocess import ooSubprocess
 '''
 SUBST = {
         'A':{'A':0.0, 'C':1.0, 'G':1.0, 'T':1.0, '-':1.0},
-        'C':{'A':1.0, 'C':0.0, 'G':1.0, 'T':1.0, '-':1.0}, 
-        'G':{'A':1.0, 'C':1.0, 'G':0.0, 'T':1.0, '-':1.0}, 
+        'C':{'A':1.0, 'C':0.0, 'G':1.0, 'T':1.0, '-':1.0},
+        'G':{'A':1.0, 'C':1.0, 'G':0.0, 'T':1.0, '-':1.0},
         'T':{'A':1.0, 'C':1.0, 'G':1.0, 'T':0.0, '-':1.0},
         '-':{'A':1.0, 'C':1.0, 'G':1.0, 'T':1.0, '-':0.0}}
 '''
@@ -35,14 +35,14 @@ def read_params():
     p = ap.ArgumentParser()
     p.add_argument('--ifn_alignment', required=True, default=None, type=str)
     p.add_argument('--ofn_prefix', required=True, default=None, type=str)
-    p.add_argument('--count_gaps', 
+    p.add_argument('--count_gaps',
                    required=False,
-                   dest='ignore_gaps', 
+                   dest='ignore_gaps',
                    action='store_false')
     p.set_defaults(ignore_gaps=True)
-    p.add_argument('--overwrite', 
+    p.add_argument('--overwrite',
                    required=False,
-                   dest='overwrite', 
+                   dest='overwrite',
                    action='store_true')
     p.set_defaults(overwrite=False)
 
@@ -72,7 +72,7 @@ def get_dist(seq1, seq2, ignore_gaps):
     rel_sim = 1.0 - rel_dist
     rel_snp = abs_snp / float(len(seq1))
     return abs_dist, rel_dist, abs_sim, rel_sim, abs_snp, rel_snp
-    
+
 
 def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
     ofn_abs_dist = ofn_prefix + '.abs_dist'
@@ -101,14 +101,14 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
 
     for i in range(len(recs)):
         for j in range(i, len(recs)):
-            abs_d, rel_d, abs_s, rel_s, abs_sp, rel_sp = get_dist(recs[i].seq, 
+            abs_d, rel_d, abs_s, rel_s, abs_sp, rel_sp = get_dist(recs[i].seq,
                                                                 recs[j].seq,
                                                                 ignore_gaps)
 
             abs_dist[i][j] = abs_d
             abs_dist[j][i] = abs_d
             abs_dist_flat.append(abs_d)
-            
+
             rel_dist[i][j] = rel_d
             rel_dist[j][i] = rel_d
             rel_dist_flat.append(rel_d)
@@ -116,7 +116,7 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
             abs_sim[i][j] = abs_s
             abs_sim[j][i] = abs_s
             abs_sim_flat.append(abs_s)
-            
+
             rel_sim[i][j] = rel_s
             rel_sim[j][i] = rel_s
             rel_sim_flat.append(rel_s)
@@ -124,7 +124,7 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
             abs_snp[i][j] = abs_sp
             abs_snp[j][i] = abs_sp
             abs_snp_flat.append(abs_sp)
-            
+
             rel_snp[i][j] = rel_sp
             rel_snp[j][i] = rel_sp
             rel_snp_flat.append(rel_sp)
@@ -198,7 +198,7 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
               '--flabel_size', '5',
               '--slabel_size', '5',
               '--max_flabel_len', '200'])
-    '''       
+    '''
 
     ofn_abs_snp = ofn_prefix + '.abs_snp'
     dist2file(abs_snp, labels, ofn_abs_snp)
@@ -208,17 +208,17 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
     dist2file(rel_snp, labels, ofn_rel_snp)
     with open(ofn_rel_snp + '.info', 'w') as ofile:
         ofile.write(statistics(rel_snp_flat)[1])
- 
+
 
 
 
 def main(args):
     compute_dist_matrix(
-                        args['ifn_alignment'], 
+                        args['ifn_alignment'],
                         args['ofn_prefix'],
                         args['ignore_gaps'],
-                        args['overwrite']) 
-    
+                        args['overwrite'])
+
 if __name__ == "__main__":
     args = read_params()
     main(args)


=====================================
strainphlan_src/compute_distance_all.py
=====================================
@@ -13,9 +13,9 @@ MAIN_DIR = os.path.dirname(ABS_PATH)
 os.environ['PATH'] += ':%s'%MAIN_DIR
 sys.path.append(MAIN_DIR)
 import argparse as ap
-from Bio import SeqIO, Seq, SeqRecord
-from collections import defaultdict
-import numpy
+from Bio import SeqIO#, Seq, SeqRecord
+# from collections import defaultdict
+# import numpy
 from compute_distance import compute_dist_matrix
 from ooSubprocess import parallelize
 
@@ -24,9 +24,9 @@ def read_params():
     p = ap.ArgumentParser()
     p.add_argument('--ifn_alignments', nargs='+', required=True, default=None, type=str)
     p.add_argument('--nprocs', required=True, default=None, type=int)
-    p.add_argument('--count_gaps', 
+    p.add_argument('--count_gaps',
                    required=False,
-                   dest='ignore_gaps', 
+                   dest='ignore_gaps',
                    action='store_false')
     p.set_defaults(ignore_gaps=True)
 
@@ -37,11 +37,11 @@ def read_params():
 
 def compute_dist_matrix_wrapper(args):
     compute_dist_matrix(
-                        args['ifn_alignment'], 
+                        args['ifn_alignment'],
                         args['ofn_prefix'],
                         args['ignore_gaps'],
-                        overwrite=True) 
-    
+                        overwrite=True)
+
 
 
 def main(args):
@@ -49,7 +49,7 @@ def main(args):
     for i in range(len(args['ifn_alignments'])):
         args_list.append({})
         args_list[i]['ifn_alignment'] = args['ifn_alignments'][i]
-        args_list[i]['ofn_prefix'] = args['ifn_alignments'][i] 
+        args_list[i]['ofn_prefix'] = args['ifn_alignments'][i]
         if not args['ignore_gaps']:
             args_list[i]['ofn_prefix'] += '.count_gaps'
         args_list[i]['ignore_gaps'] = args['ignore_gaps']


=====================================
strainphlan_src/extract_markers.py
=====================================
@@ -6,8 +6,8 @@ __author__  = 'Duy Tin Truong (duytin.truong at unitn.it)'
 __version__ = '0.1'
 __date__    = '1 Sep 2014'
 
-import sys
-import os
+# import sys
+# import os
 import argparse as ap
 import pickle
 import bz2


=====================================
strainphlan_src/ooSubprocess.py
=====================================
@@ -8,12 +8,12 @@ import os
 import multiprocessing
 from multiprocessing.pool import ThreadPool
 import sys
-import cStringIO
-from tempfile import NamedTemporaryFile 
+# import cStringIO
+from tempfile import NamedTemporaryFile
 import which
 import functools
 import traceback
-import numpy
+# import numpy
 
 
 class ooSubprocessException(Exception):
@@ -62,8 +62,8 @@ class ooSubprocess:
         elif get_out_pipe:
             tmp_file = NamedTemporaryFile(dir=self.tmp_dir)
             p = subprocess.Popen(
-                                        cmd, 
-                                        stdin=in_pipe, 
+                                        cmd,
+                                        stdin=in_pipe,
                                         stdout=tmp_file,
                                         **kwargs)
             p.wait()
@@ -74,15 +74,15 @@ class ooSubprocess:
         elif out_fn:
             ofile = open(out_fn, 'w') if out_fn else None
             result = subprocess.check_call(
-                                            cmd, 
-                                            stdin=in_pipe, 
-                                            stdout=ofile, 
+                                            cmd,
+                                            stdin=in_pipe,
+                                            stdout=ofile,
                                             **kwargs)
             ofile.close()
         else:
             result = subprocess.check_call(
-                                            cmd, 
-                                            stdin=in_pipe, 
+                                            cmd,
+                                            stdin=in_pipe,
                                             **kwargs)
         return result
 


=====================================
strainphlan_src/plot_tree_ete2.py
=====================================
@@ -6,23 +6,23 @@ __author__  = 'Duy Tin Truong (duytin.truong at unitn.it)'
 __version__ = '0.1'
 __date__    = '7 Sep 2016'
 
-import sys
-import os
-import argparse 
+# import sys
+# import os
+import argparse
 from ete2 import Tree, TreeStyle, NodeStyle
 
 def read_params():
     p = argparse.ArgumentParser()
     p.add_argument(
-        '--ifn', 
-        required=True, 
-        default=None, 
+        '--ifn',
+        required=True,
+        default=None,
         type=str,
         help='The input tree file.')
     p.add_argument(
-        '--ofn', 
-        required=False, 
-        default=None, 
+        '--ofn',
+        required=False,
+        default=None,
         type=str,
         help='The input tree file.')
 


=====================================
strainphlan_src/plot_tree_graphlan.py
=====================================
@@ -6,78 +6,78 @@ __author__  = 'Duy Tin Truong (duytin.truong at unitn.it)'
 __version__ = '0.1'
 __date__    = '4 May 2015'
 
-import sys
-import os
+# import sys
+# import os
 import argparse as ap
 import dendropy
 from StringIO import StringIO
 import re
 from collections import defaultdict
-import ConfigParser
+# import ConfigParser
 import matplotlib.colors as colors
 import subprocess
 
 
 def read_params():
     p = ap.ArgumentParser()
-    p.add_argument('--ifn_tree', 
-                   required=True, 
-                   default=None, 
+    p.add_argument('--ifn_tree',
+                   required=True,
+                   default=None,
                    type=str,
                    help='The input tree in newick format.')
-    p.add_argument('--colorized_metadata', 
-                   required=False, 
-                   default='unset', 
+    p.add_argument('--colorized_metadata',
+                   required=False,
+                   default='unset',
                    type=str,
                    help='The metadata field to colorize. Default "unset".')
-    p.add_argument('--fig_size', 
-                   required=False, 
-                   default=8, 
+    p.add_argument('--fig_size',
+                   required=False,
+                   default=8,
                    type=float,
                    help='The figure size. Default "8".')
-    p.add_argument('--legend_marker_size', 
-                   required=False, 
-                   default=20, 
+    p.add_argument('--legend_marker_size',
+                   required=False,
+                   default=20,
                    type=int,
                    help='The legend marker size. Default "20".'
                    )
-    p.add_argument('--legend_font_size', 
-                   required=False, 
-                   default=10, 
+    p.add_argument('--legend_font_size',
+                   required=False,
+                   default=10,
                    type=int,
                    help='The legend font size. Default "10".'
                    )
-    p.add_argument('--legend_marker_edge_width', 
-                   required=False, 
-                   default=0.2, 
+    p.add_argument('--legend_marker_edge_width',
+                   required=False,
+                   default=0.2,
                    type=float,
                    help='The legend marker edge width. Default "0.2".'
                    )
-    p.add_argument('--leaf_marker_size', 
-                   required=False, 
-                   default=20, 
+    p.add_argument('--leaf_marker_size',
+                   required=False,
+                   default=20,
                    type=int,
                    help='The legend marker size. Default "20".'
                    )
-    p.add_argument('--leaf_marker_edge_width', 
-                   required=False, 
-                   default=0.2, 
+    p.add_argument('--leaf_marker_edge_width',
+                   required=False,
+                   default=0.2,
                    type=float,
                    help='The legend marker edge width. Default "0.2".'
                    )
-    p.add_argument('--dpi', 
-                   required=False, 
-                   default=300, 
+    p.add_argument('--dpi',
+                   required=False,
+                   default=300,
                    type=int,
                    help='The figure dpi.')
-    p.add_argument('--figure_extension', 
-                   required=False, 
-                   default='.png', 
+    p.add_argument('--figure_extension',
+                   required=False,
+                   default='.png',
                    type=str,
                    help='The figure extension. Default ".png".')
-    p.add_argument('--ofn_prefix', 
-                   required=False, 
-                   default=None, 
+    p.add_argument('--ofn_prefix',
+                   required=False,
+                   default=None,
                    type=str,
                    help='The prefix of output files.')
     return p.parse_args()
@@ -88,7 +88,7 @@ def read_params():
 def run(cmd):
     print cmd
     subprocess.call(cmd.split())
-    
+
 
 
 
@@ -133,7 +133,7 @@ def main(args):
         ofile.write('branch_bracket_depth\t0\n')
         #ofile.write('branch_thickness\t1.25\n')
         ofile.write('annotation_background_width\t0\n')
-        
+
         # legend
         ofile.write('#legends\n')
         ofile.write('class_legend_font_size\t%d\n'%args.legend_font_size)


=====================================
strainphlan_src/sam_filter.py
=====================================
@@ -7,28 +7,28 @@ __version__ = '0.1'
 __date__    = '18 Jul 2015'
 
 import sys
-import os
-import argparse 
+# import os
+import argparse
 
 
 def read_params():
     p = argparse.ArgumentParser()
     p.add_argument(
-        '--input_file', 
-        required=False, 
-        default=None, 
+        '--input_file',
+        required=False,
+        default=None,
         type=str,
         help='The input sam file.')
     p.add_argument(
-        '--min_align_score', 
-        required=True, 
-        default=None, 
+        '--min_align_score',
+        required=True,
+        default=None,
         type=int,
         help='The sam records with alignment score smaller than this value '
              'will be discarded.')
     p.add_argument(
-        '--verbose', 
-        required=False, 
+        '--verbose',
+        required=False,
         dest='quiet',
         action='store_false',
         help='Show all information. Default "not set".')


=====================================
strainphlan_src/sample2markers.py
=====================================
@@ -16,23 +16,23 @@ import argparse as ap
 import glob
 import ooSubprocess
 from ooSubprocess import print_stderr, trace_unhandled_exceptions
-import ConfigParser
+# import ConfigParser
 from Bio import SeqIO, Seq, SeqRecord
-import cStringIO
+# import cStringIO
 import msgpack
-import random
-import subprocess
-import bz2
-import gzip
+# import random
+# import subprocess
+# import bz2
+# import gzip
 import logging
 import logging.config
-import tarfile
-import threading
+# import tarfile
+# import threading
 import multiprocessing
 import pysam
 from collections import defaultdict
 from scipy import stats
-import numpy
+# import numpy
 
 logging.basicConfig(level=logging.DEBUG, stream=sys.stderr,
                     disable_existing_loggers=False,
@@ -57,17 +57,17 @@ def read_params():
     p.add_argument('--samtools_exe', required=False, default='samtools', type=str)
     p.add_argument('--bcftools_exe', required=False, default='bcftools', type=str)
     p.add_argument(
-        '--verbose', 
-        required=False, 
+        '--verbose',
+        required=False,
         dest='quiet',
         action='store_false',
         help='Show all information. Default "not set".')
     p.set_defaults(quiet=True)
     '''
     p.add_argument(
-        '--use_processes', 
-        required=False, 
-        default=False, 
+        '--use_processes',
+        required=False,
+        default=False,
         action='store_false',
         dest='use_threads',
         help='Use multiprocessing. Default "Use multithreading".')
@@ -102,7 +102,7 @@ def build_bowtie2db(ifn_markers, tmp_dir, error_pipe=None):
     oosp = ooSubprocess.ooSubprocess(tmp_dir)
     if index_fns == []:
         oosp.ex(
-                'bowtie2-build', 
+                'bowtie2-build',
                 ['--quiet', ifn_markers, index_path],
                 stderr=error_pipe)
     else:
@@ -113,17 +113,17 @@ def build_bowtie2db(ifn_markers, tmp_dir, error_pipe=None):
 
 
 def sample2markers(
-        ifn_sample, 
+        ifn_sample,
         min_read_len,
         min_align_score,
         min_base_quality,
         min_read_depth,
         error_rate,
-        ifn_markers, 
+        ifn_markers,
         index_path,
-        nprocs=1, 
+        nprocs=1,
         sam2file=None,
-        marker2file=None, 
+        marker2file=None,
         tmp_dir='tmp',
         quiet=False,
         samtools_exe='samtools',
@@ -136,7 +136,7 @@ def sample2markers(
     :param marker2file: the file name to store the consensus markers.
     :param sam2file: the file name to store the sam content.
     :returns: if marker2file==None, return the dictionary of the consensus
-    markers. Otherwise, save the result in fasta format to the file declared by 
+    markers. Otherwise, save the result in fasta format to the file declared by
     marker2file
     '''
 
@@ -148,7 +148,7 @@ def sample2markers(
 
     # sample to sam
     sample_pipe = oosp.chain(
-                             'dump_file.py', 
+                             'dump_file.py',
                              args=['--input_file', ifn_sample],
                              stderr=error_pipe
                              )
@@ -159,7 +159,7 @@ def sample2markers(
                                     stderr=error_pipe
                                     )
     bowtie2_pipe = oosp.chain(
-                                'bowtie2', 
+                                'bowtie2',
                                 args=[
                                         '-U', '-',
                                         '-x', index_path,
@@ -173,13 +173,13 @@ def sample2markers(
     else:
         oosp.chain(
                     'compress_file.py',
-                    args=['--output_file', sam2file], 
+                    args=['--output_file', sam2file],
                     in_pipe=bowtie2_pipe,
                     stderr=error_pipe,
                     stop=True)
 
         sam_pipe = oosp.chain(
-                                'dump_file.py', 
+                                'dump_file.py',
                                 args=['--input_file', sam2file],
                                 stderr=error_pipe)
     ofn_bam_sorted_prefix = os.path.join(
@@ -187,14 +187,14 @@ def sample2markers(
                                 os.path.basename(ifn_sample) + '.bam.sorted')
 
     return sam2markers(
-               sam_file=sam_pipe, 
+               sam_file=sam_pipe,
                ofn_bam_sorted_prefix=ofn_bam_sorted_prefix,
                min_align_score=min_align_score,
                min_base_quality=min_base_quality,
                min_read_depth=min_read_depth,
                error_rate=error_rate,
-               marker2file=marker2file, 
-               oosp=oosp, 
+               marker2file=marker2file,
+               oosp=oosp,
                tmp_dir=tmp_dir,
                quiet=quiet,
                samtools_exe=samtools_exe,
@@ -211,14 +211,14 @@ def save2file(tmp_file, ofn):
 
 
 def sam2markers(
-                sam_file, 
+                sam_file,
                 ofn_bam_sorted_prefix,
                 min_align_score=None,
                 min_base_quality=30,
                 min_read_depth=0,
                 error_rate=0.01,
-                marker2file=None, 
-                oosp=None, 
+                marker2file=None,
+                oosp=None,
                 tmp_dir='tmp',
                 quiet=False,
                 samtools_exe='samtools',
@@ -247,12 +247,12 @@ def sam2markers(
 
     if type(sam_file) == str:
         p1 = oosp.chain(
-                        'dump_file.py', 
+                        'dump_file.py',
                         args=['--input_file', sam_file],
                         stderr=error_pipe)
     else:
         p1 = sam_file
-    
+
     # filter sam
     if min_align_score == None:
         p1_filtered = p1
@@ -264,8 +264,8 @@ def sam2markers(
                                  stderr=error_pipe)
     # sam to bam
     p2 = oosp.chain(
-                    samtools_exe, 
-                    args=['view', '-bS', '-'], 
+                    samtools_exe,
+                    args=['view', '-bS', '-'],
                     in_pipe=p1_filtered,
                     stderr=error_pipe)
 
@@ -274,12 +274,12 @@ def sam2markers(
     for tmp_fn in tmp_fns:
         os.remove(tmp_fn)
     p3 = oosp.chain(
-                    samtools_exe, 
-                    args=['sort', '-', '-o',  ofn_bam_sorted_prefix], 
+                    samtools_exe,
+                    args=['sort', '-', '-o',  ofn_bam_sorted_prefix],
                     in_pipe=p2,
                     stderr=error_pipe)
 
-    # extract polimorphic information 
+    # extract polimorphic information
     marker2seq = defaultdict(dict)
     pysam.index(p3.name)
     samfile = pysam.AlignmentFile(p3.name)
@@ -307,30 +307,30 @@ def sam2markers(
     # bam to mpileup
     p3.seek(0)
     p4 = oosp.chain(
-                    samtools_exe, 
-                    args=['mpileup', '-u', '-'], 
+                    samtools_exe,
+                    args=['mpileup', '-u', '-'],
                     in_pipe=p3,
                     stderr=error_pipe)
 
     # mpileup to vcf
     p5 = oosp.chain(
-                    bcftools_exe, 
-                    args=['view', '-c', '-g', '-p', '1.1', '-'], 
+                    bcftools_exe,
+                    args=['view', '-c', '-g', '-p', '1.1', '-'],
                     in_pipe=p4,
                     stderr=error_pipe)
                     #stderr=open(os.devnull, 'w'))
 
     # fix AF1=0
     p6 = oosp.chain(
-                    'fix_AF1.py', 
-                    args=['--input_file', '-'], 
+                    'fix_AF1.py',
+                    args=['--input_file', '-'],
                     in_pipe=p5,
                     stderr=error_pipe)
 
     # vcf to fastq
     p7 = oosp.chain(
-                      'vcfutils.pl', 
-                      args=['vcf2fq'], 
+                      'vcfutils.pl',
+                      args=['vcf2fq'],
                       in_pipe=p6,
                       get_out_pipe=True,
                       stderr=error_pipe,
@@ -342,7 +342,7 @@ def sam2markers(
         marker2seq = dict(marker2seq)
     except Exception as e:
         logger.error("sam2markers failed on file " + sam_file)
-        raise 
+        raise
 
     if type(p1) == file:
         p1.close()
@@ -369,17 +369,17 @@ def run_sample(args_list):
     marker2file = output_prefix + args['marker2file_ext']
     if args['input_type'] == 'fastq':
         sample2markers(
-                    ifn_sample=ifn_sample, 
+                    ifn_sample=ifn_sample,
                     min_read_len=args['min_read_len'],
                     min_align_score=args['min_align_score'],
                     min_base_quality=args['min_base_quality'],
                     min_read_depth=args['min_read_depth'],
                     error_rate=args['error_rate'],
-                    ifn_markers=args['ifn_markers'], 
+                    ifn_markers=args['ifn_markers'],
                     index_path=args['index_path'],
                     nprocs=args['nprocs'],
                     sam2file=sam2file,
-                    marker2file=marker2file, 
+                    marker2file=marker2file,
                     tmp_dir=args['output_dir'],
                     quiet=args['quiet'],
                     samtools_exe=args['samtools_exe'],
@@ -389,7 +389,7 @@ def run_sample(args_list):
                             args['output_dir'],
                             os.path.basename(ifn_sample) + '.bam.sorted')
         sam2markers(
-                    sam_file=ifn_sample, 
+                    sam_file=ifn_sample,
                     ofn_bam_sorted_prefix=ofn_bam_sorted_prefix,
                     min_align_score=args['min_align_score'],
                     min_base_quality=args['min_base_quality'],
@@ -435,7 +435,7 @@ def main(args):
             print e
 
 
-        
+
 if __name__ == "__main__":
     args = read_params()
     main(args)


=====================================
utils/merge_metaphlan_tables.py
=====================================
@@ -39,9 +39,10 @@ def merge( aaastrIn, astrLabels, iCol, ostm ):
 
     # For each input datum in each input stream...
     pos = 0
+    dCol = None
 
-    for f in aaastrIn :
-        with open(f) as csvfile :
+    for f in aaastrIn:
+        with open(f) as csvfile:
             iIn = csv.reader(csvfile, csv.excel_tab)
 
             # Lines from the current file, empty list to hold data, empty hash to hold ids
@@ -51,18 +52,22 @@ def merge( aaastrIn, astrLabels, iCol, ostm ):
             # For a line in the file
             for astrLine in iIn:
                 if astrLine[0].startswith('#'):
+                    dCol = astrLine.index('relative_abundance') if 'relative_abundance' in astrLine else None
                     continue
 
                 iLine += 1
 
-                # ID is from first column, data are everything else
-                strID, astrData = astrLine[iCol], ( astrLine[:iCol] + astrLine[( iCol + 1 ):] )
+                if dCol:
+                    strID, astrData = astrLine[iCol], [astrLine[dCol]]
+                else:
+                    # ID is from first column, data are everything else
+                    strID, astrData = astrLine[iCol], (astrLine[:iCol] + astrLine[iCol + 1:])
 
                 hashIDs[strID] = iLine
-                aastrData.append( astrData )
+                aastrData.append(astrData)
 
             # Batch merge every new ID key set
-            setstrIDs.update( hashIDs.keys( ) )
+            setstrIDs.update(hashIDs.keys())
 
         pos += 1
 



View it on GitLab: https://salsa.debian.org/med-team/metaphlan2/compare/51196ec977e34d401797054d8ecb78987e14dde4...6a0a6214036de2c11466dd22ee8df64761ff6a06

-- 
View it on GitLab: https://salsa.debian.org/med-team/metaphlan2/compare/51196ec977e34d401797054d8ecb78987e14dde4...6a0a6214036de2c11466dd22ee8df64761ff6a06
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20180917/5d3d64c6/attachment-0001.html>