[med-svn] [Git][med-team/metaphlan2][upstream] 2 commits: New upstream version 2.7.63
Andreas Tille
gitlab at salsa.debian.org
Mon Sep 17 09:38:15 BST 2018
Andreas Tille pushed to branch upstream at Debian Med / metaphlan2
Commits:
4eea2afd by Andreas Tille at 2018-09-17T07:59:53Z
New upstream version 2.7.63
- - - - -
6a0a6214 by Andreas Tille at 2018-09-17T08:09:26Z
New upstream version 2.7.8
- - - - -
18 changed files:
- .hg_archival.txt
- .hgignore
- .hgsubstate
- .hgtags
- README.md
- metaphlan2.py
- strainphlan.py
- strainphlan_src/add_metadata_tree.py
- strainphlan_src/build_tree_single_strain.py
- strainphlan_src/compute_distance.py
- strainphlan_src/compute_distance_all.py
- strainphlan_src/extract_markers.py
- strainphlan_src/ooSubprocess.py
- strainphlan_src/plot_tree_ete2.py
- strainphlan_src/plot_tree_graphlan.py
- strainphlan_src/sam_filter.py
- strainphlan_src/sample2markers.py
- utils/merge_metaphlan_tables.py
Changes:
=====================================
.hg_archival.txt
=====================================
@@ -1,4 +1,4 @@
repo: 092c2fe2278cb7f0b18d81faeb4aab98b89dc096
-node: cbd7880df400b453b8beb4e62b39e4a23b5523b6
+node: 9760413b180fc2c68b817c23602541d3a97528af
branch: default
-tag: 2.7.6
+tag: 2.7.8
=====================================
.hgignore
=====================================
@@ -1,6 +1,7 @@
syntax: glob
-databases/
+metaphlan_databases/
+tests/
*.pyc
build/
dist/
=====================================
.hgsubstate
=====================================
@@ -1,2 +1,2 @@
c168a100f37e23e2c110849a8d91fac8da49f5bd utils/export2graphlan
-35dfd725e7f024fc6d0edef0cc191c7963108787 utils/hclust2
+69efddea43ae5e37d761cd138fcf083090371d1a utils/hclust2
=====================================
.hgtags
=====================================
@@ -14,3 +14,8 @@ a1fe0d15320c04f69d56f1b7dd31cff972a7b8df 2.7.2
b2f9b3286d4be376805e3b5c26cf141ed375c605 2.7.5
847b250adbe97b9f4adc7e15f0d4bb5a66e782ec 2.7.4
178d1aaf4ac76e5d5477833e8e614104dcd32088 2.7.3
+cbd7880df400b453b8beb4e62b39e4a23b5523b6 2.7.6
+1365be3b4e5a68ed872911e949a25d0944f822b2 2.7.61
+4d50b9ccd95234a436541db13bcb10741f99138b 2.7.62
+a71e7d6b9d50b89c4b80714af145e10d050be7cf 2.7.63
+3e9c612e0a922b73214c90db51c5e26be17bf8b6 2.7.7
=====================================
README.md
=====================================
@@ -2,11 +2,26 @@
# MetaPhlAn 2: Metagenomic Phylogenetic Analysis
+## Installation
+
+MetaPhlAn 2.0 can be obtained
+
+Thorugh **Bioconda**
+
+``$ conda install metaphlan2``
+
+by **direct download** from [Bitbucket](https://bitbucket.org/biobakery/metaphlan2/get/default.zip)
+
+or **cloning the repository** using the following command
+
+``$ hg clone https://bitbucket.org/biobakery/metaphlan2``
+
+----------------------
## Description
MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With the newly added StrainPhlAn module, it is now possible to perform accurate strain-level microbial profiling.
-MetaPhlAn 2 relies on ~1M unique clade-specific marker genes ([the marker information file mpa_v20_m200_marker_info.txt.bz2 can be found in the Download page here](https://bitbucket.org/biobakery/metaphlan2/downloads/mpa_v20_m200_marker_info.txt.bz2)) identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic), allowing:
+MetaPhlAn 2 relies on ~1M unique clade-specific marker genes ([the marker information file `mpa_v20_m200_marker_info.txt.bz2` can be found in the Download page here](https://bitbucket.org/biobakery/metaphlan2/downloads/mpa_v20_m200_marker_info.txt.bz2)) identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic), allowing:
* unambiguous taxonomic assignments;
* accurate estimation of organismal relative abundance;
@@ -29,7 +44,7 @@ If you use StrainPhlAn, please cite the MetaPhlAn2 paper and the following Strai
-------------
-## MetaPhlAn and StrainPhlAn resources
+## MetaPhlAn and StrainPhlAn tutorials and resources
In addition to the information on this page, you can refer to the following additional resources.
@@ -43,14 +58,17 @@ In addition to the information on this page, you can refer to the following addi
* The related [bioBakery workflows](https://bitbucket.org/biobakery/biobakery/wiki/biobakery_workflows)
-
-------------
## Pre-requisites
-MetaPhlAn requires *python 2.7* or higher with argparse, tempfile and [*numpy*](http://www.numpy.org/) libraries installed
- (apart for numpy they are usually installed together with the python distribution).
- Python3 is also now supported.
+MetaPhlAn requires *python 2.7* or higher with argparse, tempfile, [numpy](http://www.numpy.org/), and [Biopython](https://biopython.org/) libraries installed
+(apart for numpy and Biopython, the others are usually installed together with the python distribution).
+Python3 is also now supported.
+
+MetaPhlAn requires the `read_fastx.py` script to be present in the system path, if not found MetaPhlAn will try to locate it in the folder containing the `metaphlan2.py`
+script under `utils/read_fastx.py`.
+In case you moved the `metaphlan2.py` script, please export the `read_fastx.py` script in your PATH bash variable.
**If you provide the SAM output of [BowTie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) as input, there are no additional prerequisite.**
@@ -62,22 +80,8 @@ MetaPhlAn requires *python 2.7* or higher with argparse, tempfile and [*numpy*](
* MetaPhlAn is not tightly integrated with advanced heatmap plotting with [hclust2](https://bitbucket.org/nsegata/hclust2) and cladogram visualization with [GraPhlAn](https://bitbucket.org/nsegata/graphlan/wiki/Home). If you use such visualization tool please refer to their prerequisites.
-----------------------
-
-## Installation
-
-MetaPhlAn 2.0 can be obtained by either
-
-Dowloading MetaPhlAn v2.0 [here](https://bitbucket.org/biobakery/metaphlan2/get/default.zip) or [here](https://www.dropbox.com/s/ztqr8qgbo727zpn/metaphlan2.zip?dl=0)
-
-**OR**
-
-Cloning the repository via the following commands
-``$ hg clone https://bitbucket.org/biobakery/metaphlan2``
-
--------------------------
-
## Basic Usage
This section presents some basic usages of MetaPhlAn2, for more advanced usages, please see at [its wiki](https://bitbucket.org/biobakery/biobakery/wiki/metaphlan2).
@@ -97,55 +101,28 @@ Here is the basic example to profile a metagenome from raw reads (requires BowTi
$ metaphlan2.py metagenome.fastq --input_type fastq > profiled_metagenome.txt
```
-It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (--bowtie2out), and use multiple CPUs (--nproc) if available:
+It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (`--bowtie2out`), and use multiple CPUs (`--nproc`) if available:
```
#!bash
$ metaphlan2.py metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq > profiled_metagenome.txt
```
-If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you can obtain the results in few seconds by using the previously saved --bowtie2out file and specifying the input (--input_type bowtie2out):
+If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you can obtain the results in few seconds by using the previously saved `--bowtie2out` file and specifying the input (`--input_type bowtie2out`):
```
#!bash
$ metaphlan2.py metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out > profiled_metagenome.txt
```
-You can also provide an externally BowTie2-mapped SAM if you specify this format with --input_type. Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn2 with the obtained sam:
+You can also provide an externally BowTie2-mapped SAM if you specify this format with `--input_type`. Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn2 with the obtained sam:
```
#!bash
-$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/db_v20/mpa_v20_m200 -U metagenome.fastq
+$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x databases/mpa_v20_m200 -U metagenome.fastq
$ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt
```
-In order to make MetaPhlAn 2 easily compatible with complex metagenomic pipeline, there are now multiple alternative ways to pass the input:
-
-```
-#!bash
-$ cat metagenome.fastq | metaphlan2.py --input_type fastq > profiled_metagenome.txt
-```
-
-```
-#!bash
-$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq --bowtie2db ${mpa_dir}/db_v20/mpa_v20_m200 > profiled_metagenome.txt
-```
-
-```
-#!bash
-$ metaphlan2.py --input_type fastq < metagenome.fastq > profiled_metagenome.txt
-```
-
-```
-#!bash
-$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2) > profiled_metagenome.txt
-```
-
-```
-#!bash
-$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz) > profiled_metagenome.txt
-```
-
MetaPhlAn 2 can also natively **handle paired-end metagenomes** (but does not use the paired-end information), and, more generally, metagenomes stored in multiple files (but you need to specify the --bowtie2out parameter):
```
@@ -208,16 +185,9 @@ $ metaphlan2.py metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out
* You can also provide an externally BowTie2-mapped SAM if you specify this format with
--input_type. Two steps: first apply BowTie2 and then feed MetaPhlAn2 with the obtained sam:
-$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/db_v20/mpa_v20_m200 -U metagenome.fastq
+$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x databases/mpa_v20_m200 -U metagenome.fastq
$ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt
-* Multiple alternative ways to pass the input are also available:
-$ cat metagenome.fastq | metaphlan2.py --input_type fastq
-$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq
-$ metaphlan2.py --input_type fastq < metagenome.fastq
-$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2)
-$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz)
-
* We can also natively handle paired-end metagenomes, and, more generally, metagenomes stored in
multiple files (but you need to specify the --bowtie2out parameter):
$ metaphlan2.py metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq
@@ -254,7 +224,7 @@ $ metaphlan2.py -t clade_profiles metagenome_outfmt.bz2 --input_type bowtie2out
* Finally, to obtain all markers present for a specific clade and all its subclades, the
'-t clade_specific_strain_tracker' should be used. For example, the following command
is reporting the presence/absence of the markers for the B. fragulis species and its strains
-$ metaphlan2.py -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 db_v20/mpa_v20_m200.pkl --input_type bowtie2out > marker_abundance_table.txt
+$ metaphlan2.py -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 databases/mpa_v20_m200.pkl --input_type bowtie2out > marker_abundance_table.txt
the optional argument --min_ab specifies the minimum clade abundance for reporting the markers
-------------------------------------------------------------------
@@ -501,7 +471,7 @@ In order to add a marker to the database, the user needs the following steps:
```
#!bash
-bowtie2-inspect metaphlan2/db_v20/mpa_v20_m200 > metaphlan2/markers.fasta
+bowtie2-inspect metaphlan2/databases/mpa_v20_m200 > metaphlan2/markers.fasta
```
* Add the marker sequence stored in a file new_marker.fasta to the marker set:
@@ -527,7 +497,7 @@ bowtie2-build metaphlan2/markers.fasta metaphlan2/db_v21/mpa_v21_m200
import cPickle as pickle
import bz2
-db = pickle.load(bz2.BZ2File('db_v20/mpa_v20_m200.pkl', 'r'))
+db = pickle.load(bz2.BZ2File('databases/mpa_v20_m200.pkl', 'r'))
# Add the taxonomy of the new genomes
db['taxonomy']['taxonomy of genome1'] = length of genome1
@@ -577,12 +547,12 @@ In addition, the table below shows the number of snps between the sample strains
In the next sections, we will illustrate step by step how to run MetaPhlAn\_Strainer on this toy example to reproduce the above figures.
### Pre-requisites
-StrainPhlAn requires *python 2.7* and the libraries [pysam](http://pysam.readthedocs.org/en/latest/) (tested on **version 0.8.3**), [biopython](http://biopython.org/wiki/Main_Page), [msgpack](https://pypi.python.org/pypi/msgpack-python) and [numpy](http://www.numpy.org/), [dendropy](https://pythonhosted.org/DendroPy/) (tested on version **3.12.0**). Besides, StrainPhlAn also needs the following programs in the executable path:
+StrainPhlAn requires *python 2.7* and the libraries [pysam](http://pysam.readthedocs.org/en/latest/) (tested on **version 0.8.3**), [biopython](http://biopython.org/wiki/Main_Page), [msgpack](https://pypi.python.org/pypi/msgpack-python), [pandas](https://pandas.pydata.org) (tested on **version 0.22**), [numpy](http://www.numpy.org/) (tested on **version 1.14.2**) and [scipy](https://www.scipy.org) (tested on **version 1.0.0**), [dendropy](https://pythonhosted.org/DendroPy/) (tested on version **3.12.0**). Besides, StrainPhlAn also needs the following programs in the executable path:
* [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) for mapping reads against the marker database.
* [MUSCLE](http://www.drive5.com/muscle/) for the alignment step.
* [samtools, bcftools and vcfutils.pl](http://samtools.sourceforge.net/) which can be downloaded from [here](https://github.com/samtools) for building consensus markers. Note that vcfutils.pl is included in bcftools and **StrainPhlAn only works with samtools version 0.1.19** as samtools has changed the output format after this version.
-* [blastn](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) for adding reference genomes to the phylogenetic tree.
+* [blast+](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) for adding reference genomes to the phylogenetic tree (blastn and makeblastdb commands)
* [raxmlHPC and raxmlHPC-PTHREADS-SSE3](http://sco.h-its.org/exelixis/web/software/raxml/index.html) for building the phylogenetic trees.
All dependence binaries on Linux 64 bit can be downloaded in the folder "bin" from [this link](https://www.dropbox.com/sh/m4na8wefp53j8ej/AABA3yVsG26TbB0t1cnBS9-Ra?dl=0).
@@ -621,7 +591,7 @@ for f in $(ls fastqs/*.bz2)
do
echo "Running metaphlan2 on ${f}"
bn=$(basename ${f} | cut -d . -f 1)
- tar xjfO ${f} | ../metaphlan2.py --bowtie2db ../db_v20/mpa_v20_m200 --mpa_pkl ../db_v20/mpa_v20_m200.pkl --input_type multifastq --nproc 10 -s sams/${bn}.sam.bz2 --bowtie2out sams/${bn}.bowtie2_out.bz2 -o sams/${bn}.profile
+ tar xjfO ${f} | ../metaphlan2.py --bowtie2db ../databases/mpa_v20_m200 --mpa_pkl ../databases/mpa_v20_m200.pkl --input_type multifastq --nproc 10 -s sams/${bn}.sam.bz2 --bowtie2out sams/${bn}.bowtie2_out.bz2 -o sams/${bn}.profile
done
```
@@ -654,8 +624,8 @@ The commands to run are:
```
#!python
mkdir -p db_markers
-bowtie2-inspect ../db_v20/mpa_v20_m200 > db_markers/all_markers.fasta
-python ../strainphlan_src/extract_markers.py --mpa_pkl ../db_v20/mpa_v20_m200.pkl --ifn_markers db_markers/all_markers.fasta --clade s__Bacteroides_caccae --ofn_markers db_markers/s__Bacteroides_caccae.markers.fasta
+bowtie2-inspect ../databases/mpa_v20_m200 > db_markers/all_markers.fasta
+python ../strainphlan_src/extract_markers.py --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_markers db_markers/all_markers.fasta --clade s__Bacteroides_caccae --ofn_markers db_markers/s__Bacteroides_caccae.markers.fasta
```
Note that the "all\_markers.fasta" file consists can be reused for extracting other reference genomes.
@@ -666,7 +636,7 @@ Before building the trees, we should get the list of all clades detected from th
```
#!python
-python ../strainphlan.py --mpa_pkl ../db_v20/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --output_dir output --nprocs_main 10 --print_clades_only > output/clades.txt
+python ../strainphlan.py --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --output_dir output --nprocs_main 10 --print_clades_only > output/clades.txt
```
The clade names in the output file "clades.txt" will be used for the next step.
@@ -681,7 +651,7 @@ The commands to run are:
```
#!python
mkdir -p output
-python ../strainphlan.py --mpa_pkl ../db_v20/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --ifn_markers db_markers/s__Bacteroides_caccae.markers.fasta --ifn_ref_genomes reference_genomes/G000273725.fna.bz2 --output_dir output --nprocs_main 10 --clades s__Bacteroides_caccae &> output/log_full.txt
+python ../strainphlan.py --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --ifn_markers db_markers/s__Bacteroides_caccae.markers.fasta --ifn_ref_genomes reference_genomes/G000273725.fna.bz2 --output_dir output --nprocs_main 10 --clades s__Bacteroides_caccae &> output/log_full.txt
```
This step will take around 2 minutes. After this step, you will find the tree "output/RAxML\_bestTree.s\_\_Bacteroides\_caccae.tree". All the output files can be found in the folder "output" in [this link](https://www.dropbox.com/sh/m4na8wefp53j8ej/AABA3yVsG26TbB0t1cnBS9-Ra?dl=0).
=====================================
metaphlan2.py
=====================================
@@ -16,8 +16,8 @@ from __future__ import with_statement
__author__ = ('Nicola Segata (nicola.segata at unitn.it), '
'Duy Tin Truong, '
'Francesco Asnicar (f.asnicar at unitn.it)')
-__version__ = '2.7.6'
-__date__ = '2 March 2018'
+__version__ = '2.7.7'
+__date__ = '31 May 2018'
import sys
@@ -62,7 +62,7 @@ DATABASE_DOWNLOAD = "https://bitbucket.org/biobakery/metaphlan2/downloads/"
# get the directory that contains this script
metaphlan2_script_install_folder = os.path.dirname(os.path.abspath(__file__))
# get the default database folder
-DEFAULT_DB_FOLDER = os.path.join(metaphlan2_script_install_folder, "databases")
+DEFAULT_DB_FOLDER = os.path.join(metaphlan2_script_install_folder, "metaphlan_databases")
#**********************************************************************************************
@@ -420,12 +420,12 @@ def read_params(args):
"$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/db_v20/mpa_v20_m200 -U metagenome.fastq\n"
"$ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt\n\n"
- "* Multiple alternative ways to pass the input are also available:\n"
- "$ cat metagenome.fastq | metaphlan2.py --input_type fastq \n"
- "$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq \n"
- "$ metaphlan2.py --input_type fastq < metagenome.fastq\n"
- "$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2)\n"
- "$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz)\n\n"
+ # "* Multiple alternative ways to pass the input are also available:\n"
+ # "$ cat metagenome.fastq | metaphlan2.py --input_type fastq \n"
+ # "$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq \n"
+ # "$ metaphlan2.py --input_type fastq < metagenome.fastq\n"
+ # "$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2)\n"
+ # "$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz)\n\n"
"* We can also natively handle paired-end metagenomes, and, more generally, metagenomes stored in \n"
" multiple files (but you need to specify the --bowtie2out parameter):\n"
@@ -503,10 +503,11 @@ def read_params(args):
help=("The BowTie2 database file of the MetaPhlAn database. Used if "
"--input_type is fastq, fasta, multifasta, or multifastq [default "+DEFAULT_DB_FOLDER+"]\n"))
+ INDEX = 'v20_m200'
arg('-x', '--index', type=str, default='v20_m200',
help=("Specify the id of the database version to use. If the database\n"
"files are not found on the local MetaPhlAn2 installation they\n"
- "will be automatically downloaded\n"))
+ "will be automatically downloaded [default "+INDEX+"]\n"))
bt2ps = ['sensitive', 'very-sensitive', 'sensitive-local', 'very-sensitive-local']
arg('--bt2_ps', metavar="BowTie2 presets", default='very-sensitive',
@@ -647,9 +648,9 @@ def read_params(args):
help="Only checks if the MetaPhlAn2 DB is installed and installs it if not. All other parameters are ignored.")
arg('--read_min_len', type=int, default=70,
help="Specify the minimum length of the reads to be considered when parsing the input file with "
- "'read_fastx.py' script, default value is 60")
+ "'read_fastx.py' script, default value is 70")
arg('-v', '--version', action='version',
- version="MetaPhlAn version {}\t({})".format(__version__, __date__),
+ version="MetaPhlAn version {} ({})".format(__version__, __date__),
help="Prints the current MetaPhlAn version and exit")
arg("-h", "--help", action="help", help="show this help message and exit")
@@ -748,7 +749,7 @@ def download_unpack_tar(url, download_file_name, folder, bowtie2_build, nproc):
for row in f:
md5_md5 = row.strip().split(' ')[0]
else:
- sys.stderr.write('File "{}" not found!'.format(md5_file))
+ sys.stderr.write('File "{}" not found!\n'.format(md5_file))
# compute MD5 of .tar.bz2
if os.path.isfile(tar_file):
@@ -760,7 +761,7 @@ def download_unpack_tar(url, download_file_name, folder, bowtie2_build, nproc):
md5_tar = hash_md5.hexdigest()[:32]
else:
- sys.stderr.write('File "{}" not found!'.format(tar_file))
+ sys.stderr.write('File "{}" not found!\n'.format(tar_file))
if (md5_tar is None) or (md5_md5 is None):
sys.exit("MD5 checksums not found, something went wrong!")
@@ -1264,7 +1265,7 @@ def map2bbh(mapping_f, input_type='bowtie2out', min_alignment_len=None):
return markers2reads
-def maybe_generate_biom_file(pars, abundance_predictions):
+def maybe_generate_biom_file(tree, pars, abundance_predictions):
json_key = "MetaPhlAn2"
if not pars['biom']:
@@ -1291,21 +1292,20 @@ def maybe_generate_biom_file(pars, abundance_predictions):
return tree.all_clades[name]
def to_biomformat(clade_name):
- return { 'taxonomy': clade_name.split(delimiter) }
+ return {'taxonomy': clade_name.split(delimiter)}
- clades = iter( (abundance, findclade(name))
- for (name, abundance) in abundance_predictions
- if istip(name) )
- packed = iter( ([abundance], clade.get_full_name(), clade.id)
- for (abundance, clade) in clades )
+ clades = iter((abundance, findclade(name))
+ for (name, abundance) in abundance_predictions if istip(name))
+ packed = iter(([abundance], clade.get_full_name(), clade.id)
+ for (abundance, clade) in clades)
- #unpack that tuple here to stay under 80 chars on a line
+ # unpack that tuple here to stay under 80 chars on a line
data, clade_names, clade_ids = zip(*packed)
# biom likes column vectors, so we give it an array like this:
# np.array([a],[b],[c])
data = np.array(data)
sample_ids = [pars['sample_id']]
- table_id='MetaPhlAn2_Analysis'
+ table_id = 'MetaPhlAn2_Analysis'
@@ -1493,7 +1493,7 @@ def metaphlan2():
outf.write( "\t".join( [k,str(v)] ) + "\n" )
else:
outf.write( "unclassified\t100.0\n" )
- maybe_generate_biom_file(pars, outpred)
+ maybe_generate_biom_file(tree, pars, outpred)
elif pars['t'] == 'rel_ab_w_read_stats':
cl2ab, rr = tree.relative_abundances(
pars['tax_lev']+"__" if pars['tax_lev'] != 'a' else None )
@@ -1520,7 +1520,7 @@ def metaphlan2():
outf.write( "#estimated total number of reads from known clades: " + str(totl)+"\n")
else:
outf.write( "unclassified\t100.0\n" )
- maybe_generate_biom_file(pars, outpred)
+ maybe_generate_biom_file(tree, pars, outpred)
elif pars['t'] == 'clade_profiles':
cl2pr = tree.clade_profiles( pars['tax_lev']+"__" if pars['tax_lev'] != 'a' else None )
=====================================
strainphlan.py
=====================================
@@ -2,9 +2,11 @@
# Author: Duy Tin Truong (duytin.truong at unitn.it)
# at CIBIO, University of Trento, Italy
-__author__ = 'Duy Tin Truong (duytin.truong at unitn.it)'
-__version__ = '1.0.0'
-__date__ = '2nd August 2016'
+__author__ = ('Duy Tin Truong (duytin.truong at unitn.it), '
+ 'Francesco Asnicar (f.asnicar at unitn.it), '
+ 'Moreno Zolfo (moreno.zolfo at unitn.it)')
+__version__ = '1.2.0'
+__date__ = '31 May 2018'
import sys
import os
@@ -17,7 +19,10 @@ sys.path.append(os.path.join(MAIN_DIR, 'strainphlan_src'))
import which
import argparse as ap
-import cPickle as pickle
+try:
+ import cPickle as pickle
+except ImportError:
+ import pickle
import msgpack
import glob
from mixed_utils import statistics
@@ -32,9 +37,9 @@ from Bio.Alphabet import IUPAC
import pandas
import logging
import logging.config
-import sample2markers
-import copy
-import threading
+# import sample2markers
+# import copy
+# import threading
import numpy
import random
import gc
@@ -86,12 +91,16 @@ def read_params():
help='The representative sample. The marker list of each species '\
'extracted from this sample will be used for all other samples.')
+ p.add_argument('-x', '--index', required=False, default="v20_m200", type=str,
+ help=("Specify the id of the database version to use. If the database files are not found "
+ "on the local MetaPhlAn2 installation they will be automatically downloaded"))
+
p.add_argument(
'--mpa_pkl',
required=False,
- default=os.path.join(metaphlan2_script_install_folder,"db_v20","mpa_v20_m200.pkl"),
+ default=os.path.join(metaphlan2_script_install_folder, "metaphlan_databases"),
type=str,
- help='The database of metaphlan3.py.')
+ help='The database of metaphlan2.py')
p.add_argument(
'--output_dir',
@@ -1166,15 +1175,29 @@ def load_sample(args):
marker2seq = {}
# reformat 'pileup'
+ marker2seq2exclude = []
for m in marker2seq:
- freq = marker2seq[m]['freq']
- marker2seq[m]['freq'] = [(0.0, 0.0, 0.0) for i in \
- range(len(marker2seq[m]['seq']))]
- for p in freq:
- marker2seq[m]['freq'][p] = freq[p]
- marker2seq[m]['seq'] = marker2seq[m]['seq'].replace('-', 'N') # make sure we have no gaps in the sequence
+
+ # FIX: Check if freq vector is longer than the reconstructed sequence. In this case, the marker is removed
+ # TODO: uniform the two vectors in samples2markers.py
+ if len(marker2seq[m]['freq']) > len(marker2seq[m]['seq']):
+ logger.info('Skipping marker'+m+' as sequence is reconstructed incorrectly')
+ marker2seq2exclude.append(m)
+
+ else:
+ freq = marker2seq[m]['freq']
+ marker2seq[m]['freq'] = [(0.0, 0.0, 0.0) for i in \
+ range(len(marker2seq[m]['seq']))]
+ for p in freq:
+ marker2seq[m]['freq'][p] = freq[p]
+ marker2seq[m]['seq'] = marker2seq[m]['seq'].replace('-', 'N') # make sure we have no gaps in the sequence
+
+ # returning only the markers for which sequence was reconstructed correctly
+ for marker2exclude in marker2seq2exclude:
+ del marker2seq[marker2exclude]
return marker2seq
+
else:
# remove redundant clades and markers
clade2n_markers = defaultdict(int)
@@ -1533,6 +1556,12 @@ def check_dependencies(args):
def strainphlan():
args = read_params()
+
+ # fix db .pkl file
+ if '--mpa_pkl' not in sys.argv:
+ if os.path.isfile(os.path.join(args['mpa_pkl'], "mpa_" + args['index'] + ".pkl")):
+ args['mpa_pkl'] = os.path.join(args['mpa_pkl'], "mpa_" + args['index'] + ".pkl")
+
check_dependencies(args)
strainer(args)
=====================================
strainphlan_src/add_metadata_tree.py
=====================================
@@ -3,14 +3,14 @@
# at CIBIO, University of Trento, Italy
-import sys
-import os
+# import sys
+# import os
import argparse as ap
import pandas
-import copy
-import ConfigParser
+# import copy
+# import ConfigParser
import dendropy
-import numpy
+# import numpy
# import ipdb
@@ -18,13 +18,13 @@ def read_params():
p = ap.ArgumentParser()
p.add_argument('--ifn_trees', nargs='+', required=True, default=None, type=str)
p.add_argument('--ifn_metadatas', nargs='+', required=True, default=None, type=str)
- p.add_argument('--string_to_remove',
+ p.add_argument('--string_to_remove',
required=False, default='', type=str,
help='string to be removed in the tree node names')
p.add_argument(
- '--metadatas',
- nargs='+',
- required=False,
+ '--metadatas',
+ nargs='+',
+ required=False,
default=['all'],
type=str,
help='The metadata fields that you want to add. '\
@@ -52,9 +52,9 @@ def main(args):
index_col = get_index_col(ifn)
df = pandas.read_csv(
ifn,
- sep='\t',
+ sep='\t',
dtype=unicode,
- header=0,
+ header=0,
index_col=index_col)
df = df.transpose()
df_list.append(df)
@@ -73,8 +73,8 @@ def main(args):
sample = node.get_node_str().strip("'")
sample = sample.replace(' ', '_')
sample = sample.replace(args['string_to_remove'], '')
- prefixes = [prefix for prefix in
- ['k__', 'p__', 'c__', 'o__',
+ prefixes = [prefix for prefix in
+ ['k__', 'p__', 'c__', 'o__',
'f__', 'g__', 's__'] \
if prefix in sample]
=====================================
strainphlan_src/build_tree_single_strain.py
=====================================
@@ -6,9 +6,9 @@ __author__ = 'Duy Tin Truong (duytin.truong at unitn.it)'
__version__ = '0.1'
__date__ = '17 Sep 2015'
-import sys
+# import sys
import os
-import argparse
+import argparse
import numpy
from Bio import SeqIO
import glob
@@ -17,34 +17,34 @@ import glob
def read_params():
p = argparse.ArgumentParser()
p.add_argument(
- '--ifn_alignments',
+ '--ifn_alignments',
nargs='+',
- required=True,
- default=None,
+ required=True,
+ default=None,
type=str,
help='The alignment file.')
p.add_argument(
- '--log_ofn',
- required=True,
- default=None,
+ '--log_ofn',
+ required=True,
+ default=None,
type=str,
help='The log file.')
p.add_argument(
- '--nprocs',
- required=True,
- default=None,
+ '--nprocs',
+ required=True,
+ default=None,
type=int,
help='Number of processors.')
p.add_argument(
- '--bootstrap_raxml',
- required=False,
- default=0,
+ '--bootstrap_raxml',
+ required=False,
+ default=0,
type=int,
help='The number of runs for bootstraping when building the tree. '\
'Default 0.')
p.add_argument(
- '--verbose',
- required=False,
+ '--verbose',
+ required=False,
dest='quiet',
action='store_false',
help='Show all information. Default "not set".')
@@ -84,15 +84,15 @@ def main(args):
if len(sample2polrate):
log_line = '%s\t%d\t%d\t%f\n'%\
- (os.path.basename(ifn_polymorphic).replace('.polymorphic', ''),
- len(singles),
- len(sample2polrate),
+ (os.path.basename(ifn_polymorphic).replace('.polymorphic', ''),
+ len(singles),
+ len(sample2polrate),
float(len(singles)) / len(sample2polrate))
else:
log_line = '%s\t%d\t%d\t%f\n'%\
- (os.path.basename(ifn_polymorphic).replace('.polymorphic', ''),
- len(singles),
- len(sample2polrate),
+ (os.path.basename(ifn_polymorphic).replace('.polymorphic', ''),
+ len(singles),
+ len(sample2polrate),
0)
lfile.write(log_line)
@@ -116,7 +116,7 @@ def main(args):
cmd += '-N %d '%(args.bootstrap_raxml)
cmd += '-s %s '%os.path.abspath(ifn_alignment2)
cmd += '-w %s '%os.path.abspath(os.path.dirname(ifn_alignment2))
- cmd += '-n %s '%output_suffix
+ cmd += '-n %s '%output_suffix
cmd += '-p 1234 '
else:
cmd = 'raxmlHPC-PTHREADS-SSE3 '
@@ -136,7 +136,7 @@ def main(args):
'''
run(cmd)
lfile.close()
-
+
=====================================
strainphlan_src/compute_distance.py
=====================================
@@ -15,8 +15,8 @@ sys.path.append(MAIN_DIR)
from mixed_utils import dist2file, statistics
import argparse as ap
-from Bio import SeqIO, Seq, SeqRecord
-from collections import defaultdict
+from Bio import SeqIO#, Seq, SeqRecord
+# from collections import defaultdict
import numpy
from ooSubprocess import ooSubprocess
@@ -24,8 +24,8 @@ from ooSubprocess import ooSubprocess
'''
SUBST = {
'A':{'A':0.0, 'C':1.0, 'G':1.0, 'T':1.0, '-':1.0},
- 'C':{'A':1.0, 'C':0.0, 'G':1.0, 'T':1.0, '-':1.0},
- 'G':{'A':1.0, 'C':1.0, 'G':0.0, 'T':1.0, '-':1.0},
+ 'C':{'A':1.0, 'C':0.0, 'G':1.0, 'T':1.0, '-':1.0},
+ 'G':{'A':1.0, 'C':1.0, 'G':0.0, 'T':1.0, '-':1.0},
'T':{'A':1.0, 'C':1.0, 'G':1.0, 'T':0.0, '-':1.0},
'-':{'A':1.0, 'C':1.0, 'G':1.0, 'T':1.0, '-':0.0}}
'''
@@ -35,14 +35,14 @@ def read_params():
p = ap.ArgumentParser()
p.add_argument('--ifn_alignment', required=True, default=None, type=str)
p.add_argument('--ofn_prefix', required=True, default=None, type=str)
- p.add_argument('--count_gaps',
+ p.add_argument('--count_gaps',
required=False,
- dest='ignore_gaps',
+ dest='ignore_gaps',
action='store_false')
p.set_defaults(ignore_gaps=True)
- p.add_argument('--overwrite',
+ p.add_argument('--overwrite',
required=False,
- dest='overwrite',
+ dest='overwrite',
action='store_true')
p.set_defaults(overwrite=False)
@@ -72,7 +72,7 @@ def get_dist(seq1, seq2, ignore_gaps):
rel_sim = 1.0 - rel_dist
rel_snp = abs_snp / float(len(seq1))
return abs_dist, rel_dist, abs_sim, rel_sim, abs_snp, rel_snp
-
+
def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
ofn_abs_dist = ofn_prefix + '.abs_dist'
@@ -101,14 +101,14 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
for i in range(len(recs)):
for j in range(i, len(recs)):
- abs_d, rel_d, abs_s, rel_s, abs_sp, rel_sp = get_dist(recs[i].seq,
+ abs_d, rel_d, abs_s, rel_s, abs_sp, rel_sp = get_dist(recs[i].seq,
recs[j].seq,
ignore_gaps)
abs_dist[i][j] = abs_d
abs_dist[j][i] = abs_d
abs_dist_flat.append(abs_d)
-
+
rel_dist[i][j] = rel_d
rel_dist[j][i] = rel_d
rel_dist_flat.append(rel_d)
@@ -116,7 +116,7 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
abs_sim[i][j] = abs_s
abs_sim[j][i] = abs_s
abs_sim_flat.append(abs_s)
-
+
rel_sim[i][j] = rel_s
rel_sim[j][i] = rel_s
rel_sim_flat.append(rel_s)
@@ -124,7 +124,7 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
abs_snp[i][j] = abs_sp
abs_snp[j][i] = abs_sp
abs_snp_flat.append(abs_sp)
-
+
rel_snp[i][j] = rel_sp
rel_snp[j][i] = rel_sp
rel_snp_flat.append(rel_sp)
@@ -198,7 +198,7 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
'--flabel_size', '5',
'--slabel_size', '5',
'--max_flabel_len', '200'])
- '''
+ '''
ofn_abs_snp = ofn_prefix + '.abs_snp'
dist2file(abs_snp, labels, ofn_abs_snp)
@@ -208,17 +208,17 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
dist2file(rel_snp, labels, ofn_rel_snp)
with open(ofn_rel_snp + '.info', 'w') as ofile:
ofile.write(statistics(rel_snp_flat)[1])
-
+
def main(args):
compute_dist_matrix(
- args['ifn_alignment'],
+ args['ifn_alignment'],
args['ofn_prefix'],
args['ignore_gaps'],
- args['overwrite'])
-
+ args['overwrite'])
+
if __name__ == "__main__":
args = read_params()
main(args)
=====================================
strainphlan_src/compute_distance_all.py
=====================================
@@ -13,9 +13,9 @@ MAIN_DIR = os.path.dirname(ABS_PATH)
os.environ['PATH'] += ':%s'%MAIN_DIR
sys.path.append(MAIN_DIR)
import argparse as ap
-from Bio import SeqIO, Seq, SeqRecord
-from collections import defaultdict
-import numpy
+from Bio import SeqIO#, Seq, SeqRecord
+# from collections import defaultdict
+# import numpy
from compute_distance import compute_dist_matrix
from ooSubprocess import parallelize
@@ -24,9 +24,9 @@ def read_params():
p = ap.ArgumentParser()
p.add_argument('--ifn_alignments', nargs='+', required=True, default=None, type=str)
p.add_argument('--nprocs', required=True, default=None, type=int)
- p.add_argument('--count_gaps',
+ p.add_argument('--count_gaps',
required=False,
- dest='ignore_gaps',
+ dest='ignore_gaps',
action='store_false')
p.set_defaults(ignore_gaps=True)
@@ -37,11 +37,11 @@ def read_params():
def compute_dist_matrix_wrapper(args):
compute_dist_matrix(
- args['ifn_alignment'],
+ args['ifn_alignment'],
args['ofn_prefix'],
args['ignore_gaps'],
- overwrite=True)
-
+ overwrite=True)
+
def main(args):
@@ -49,7 +49,7 @@ def main(args):
for i in range(len(args['ifn_alignments'])):
args_list.append({})
args_list[i]['ifn_alignment'] = args['ifn_alignments'][i]
- args_list[i]['ofn_prefix'] = args['ifn_alignments'][i]
+ args_list[i]['ofn_prefix'] = args['ifn_alignments'][i]
if not args['ignore_gaps']:
args_list[i]['ofn_prefix'] += '.count_gaps'
args_list[i]['ignore_gaps'] = args['ignore_gaps']
=====================================
strainphlan_src/extract_markers.py
=====================================
@@ -6,8 +6,8 @@ __author__ = 'Duy Tin Truong (duytin.truong at unitn.it)'
__version__ = '0.1'
__date__ = '1 Sep 2014'
-import sys
-import os
+# import sys
+# import os
import argparse as ap
import pickle
import bz2
=====================================
strainphlan_src/ooSubprocess.py
=====================================
@@ -8,12 +8,12 @@ import os
import multiprocessing
from multiprocessing.pool import ThreadPool
import sys
-import cStringIO
-from tempfile import NamedTemporaryFile
+# import cStringIO
+from tempfile import NamedTemporaryFile
import which
import functools
import traceback
-import numpy
+# import numpy
class ooSubprocessException(Exception):
@@ -62,8 +62,8 @@ class ooSubprocess:
elif get_out_pipe:
tmp_file = NamedTemporaryFile(dir=self.tmp_dir)
p = subprocess.Popen(
- cmd,
- stdin=in_pipe,
+ cmd,
+ stdin=in_pipe,
stdout=tmp_file,
**kwargs)
p.wait()
@@ -74,15 +74,15 @@ class ooSubprocess:
elif out_fn:
ofile = open(out_fn, 'w') if out_fn else None
result = subprocess.check_call(
- cmd,
- stdin=in_pipe,
- stdout=ofile,
+ cmd,
+ stdin=in_pipe,
+ stdout=ofile,
**kwargs)
ofile.close()
else:
result = subprocess.check_call(
- cmd,
- stdin=in_pipe,
+ cmd,
+ stdin=in_pipe,
**kwargs)
return result
=====================================
strainphlan_src/plot_tree_ete2.py
=====================================
@@ -6,23 +6,23 @@ __author__ = 'Duy Tin Truong (duytin.truong at unitn.it)'
__version__ = '0.1'
__date__ = '7 Sep 2016'
-import sys
-import os
-import argparse
+# import sys
+# import os
+import argparse
from ete2 import Tree, TreeStyle, NodeStyle
def read_params():
p = argparse.ArgumentParser()
p.add_argument(
- '--ifn',
- required=True,
- default=None,
+ '--ifn',
+ required=True,
+ default=None,
type=str,
help='The input tree file.')
p.add_argument(
- '--ofn',
- required=False,
- default=None,
+ '--ofn',
+ required=False,
+ default=None,
type=str,
help='The input tree file.')
=====================================
strainphlan_src/plot_tree_graphlan.py
=====================================
@@ -6,78 +6,78 @@ __author__ = 'Duy Tin Truong (duytin.truong at unitn.it)'
__version__ = '0.1'
__date__ = '4 May 2015'
-import sys
-import os
+# import sys
+# import os
import argparse as ap
import dendropy
from StringIO import StringIO
import re
from collections import defaultdict
-import ConfigParser
+# import ConfigParser
import matplotlib.colors as colors
import subprocess
def read_params():
p = ap.ArgumentParser()
- p.add_argument('--ifn_tree',
- required=True,
- default=None,
+ p.add_argument('--ifn_tree',
+ required=True,
+ default=None,
type=str,
help='The input tree in newick format.')
- p.add_argument('--colorized_metadata',
- required=False,
- default='unset',
+ p.add_argument('--colorized_metadata',
+ required=False,
+ default='unset',
type=str,
help='The metadata field to colorize. Default "unset".')
- p.add_argument('--fig_size',
- required=False,
- default=8,
+ p.add_argument('--fig_size',
+ required=False,
+ default=8,
type=float,
help='The figure size. Default "8".')
- p.add_argument('--legend_marker_size',
- required=False,
- default=20,
+ p.add_argument('--legend_marker_size',
+ required=False,
+ default=20,
type=int,
help='The legend marker size. Default "20".'
)
- p.add_argument('--legend_font_size',
- required=False,
- default=10,
+ p.add_argument('--legend_font_size',
+ required=False,
+ default=10,
type=int,
help='The legend font size. Default "10".'
)
- p.add_argument('--legend_marker_edge_width',
- required=False,
- default=0.2,
+ p.add_argument('--legend_marker_edge_width',
+ required=False,
+ default=0.2,
type=float,
help='The legend marker edge width. Default "0.2".'
)
- p.add_argument('--leaf_marker_size',
- required=False,
- default=20,
+ p.add_argument('--leaf_marker_size',
+ required=False,
+ default=20,
type=int,
help='The legend marker size. Default "20".'
)
- p.add_argument('--leaf_marker_edge_width',
- required=False,
- default=0.2,
+ p.add_argument('--leaf_marker_edge_width',
+ required=False,
+ default=0.2,
type=float,
help='The legend marker edge width. Default "0.2".'
)
- p.add_argument('--dpi',
- required=False,
- default=300,
+ p.add_argument('--dpi',
+ required=False,
+ default=300,
type=int,
help='The figure dpi.')
- p.add_argument('--figure_extension',
- required=False,
- default='.png',
+ p.add_argument('--figure_extension',
+ required=False,
+ default='.png',
type=str,
help='The figure extension. Default ".png".')
- p.add_argument('--ofn_prefix',
- required=False,
- default=None,
+ p.add_argument('--ofn_prefix',
+ required=False,
+ default=None,
type=str,
help='The prefix of output files.')
return p.parse_args()
@@ -88,7 +88,7 @@ def read_params():
def run(cmd):
print cmd
subprocess.call(cmd.split())
-
+
@@ -133,7 +133,7 @@ def main(args):
ofile.write('branch_bracket_depth\t0\n')
#ofile.write('branch_thickness\t1.25\n')
ofile.write('annotation_background_width\t0\n')
-
+
# legend
ofile.write('#legends\n')
ofile.write('class_legend_font_size\t%d\n'%args.legend_font_size)
=====================================
strainphlan_src/sam_filter.py
=====================================
@@ -7,28 +7,28 @@ __version__ = '0.1'
__date__ = '18 Jul 2015'
import sys
-import os
-import argparse
+# import os
+import argparse
def read_params():
p = argparse.ArgumentParser()
p.add_argument(
- '--input_file',
- required=False,
- default=None,
+ '--input_file',
+ required=False,
+ default=None,
type=str,
help='The input sam file.')
p.add_argument(
- '--min_align_score',
- required=True,
- default=None,
+ '--min_align_score',
+ required=True,
+ default=None,
type=int,
help='The sam records with alignment score smaller than this value '
'will be discarded.')
p.add_argument(
- '--verbose',
- required=False,
+ '--verbose',
+ required=False,
dest='quiet',
action='store_false',
help='Show all information. Default "not set".')
=====================================
strainphlan_src/sample2markers.py
=====================================
@@ -16,23 +16,23 @@ import argparse as ap
import glob
import ooSubprocess
from ooSubprocess import print_stderr, trace_unhandled_exceptions
-import ConfigParser
+# import ConfigParser
from Bio import SeqIO, Seq, SeqRecord
-import cStringIO
+# import cStringIO
import msgpack
-import random
-import subprocess
-import bz2
-import gzip
+# import random
+# import subprocess
+# import bz2
+# import gzip
import logging
import logging.config
-import tarfile
-import threading
+# import tarfile
+# import threading
import multiprocessing
import pysam
from collections import defaultdict
from scipy import stats
-import numpy
+# import numpy
logging.basicConfig(level=logging.DEBUG, stream=sys.stderr,
disable_existing_loggers=False,
@@ -57,17 +57,17 @@ def read_params():
p.add_argument('--samtools_exe', required=False, default='samtools', type=str)
p.add_argument('--bcftools_exe', required=False, default='bcftools', type=str)
p.add_argument(
- '--verbose',
- required=False,
+ '--verbose',
+ required=False,
dest='quiet',
action='store_false',
help='Show all information. Default "not set".')
p.set_defaults(quiet=True)
'''
p.add_argument(
- '--use_processes',
- required=False,
- default=False,
+ '--use_processes',
+ required=False,
+ default=False,
action='store_false',
dest='use_threads',
help='Use multiprocessing. Default "Use multithreading".')
@@ -102,7 +102,7 @@ def build_bowtie2db(ifn_markers, tmp_dir, error_pipe=None):
oosp = ooSubprocess.ooSubprocess(tmp_dir)
if index_fns == []:
oosp.ex(
- 'bowtie2-build',
+ 'bowtie2-build',
['--quiet', ifn_markers, index_path],
stderr=error_pipe)
else:
@@ -113,17 +113,17 @@ def build_bowtie2db(ifn_markers, tmp_dir, error_pipe=None):
def sample2markers(
- ifn_sample,
+ ifn_sample,
min_read_len,
min_align_score,
min_base_quality,
min_read_depth,
error_rate,
- ifn_markers,
+ ifn_markers,
index_path,
- nprocs=1,
+ nprocs=1,
sam2file=None,
- marker2file=None,
+ marker2file=None,
tmp_dir='tmp',
quiet=False,
samtools_exe='samtools',
@@ -136,7 +136,7 @@ def sample2markers(
:param marker2file: the file name to store the consensus markers.
:param sam2file: the file name to store the sam content.
:returns: if marker2file==None, return the dictionary of the consensus
- markers. Otherwise, save the result in fasta format to the file declared by
+ markers. Otherwise, save the result in fasta format to the file declared by
marker2file
'''
@@ -148,7 +148,7 @@ def sample2markers(
# sample to sam
sample_pipe = oosp.chain(
- 'dump_file.py',
+ 'dump_file.py',
args=['--input_file', ifn_sample],
stderr=error_pipe
)
@@ -159,7 +159,7 @@ def sample2markers(
stderr=error_pipe
)
bowtie2_pipe = oosp.chain(
- 'bowtie2',
+ 'bowtie2',
args=[
'-U', '-',
'-x', index_path,
@@ -173,13 +173,13 @@ def sample2markers(
else:
oosp.chain(
'compress_file.py',
- args=['--output_file', sam2file],
+ args=['--output_file', sam2file],
in_pipe=bowtie2_pipe,
stderr=error_pipe,
stop=True)
sam_pipe = oosp.chain(
- 'dump_file.py',
+ 'dump_file.py',
args=['--input_file', sam2file],
stderr=error_pipe)
ofn_bam_sorted_prefix = os.path.join(
@@ -187,14 +187,14 @@ def sample2markers(
os.path.basename(ifn_sample) + '.bam.sorted')
return sam2markers(
- sam_file=sam_pipe,
+ sam_file=sam_pipe,
ofn_bam_sorted_prefix=ofn_bam_sorted_prefix,
min_align_score=min_align_score,
min_base_quality=min_base_quality,
min_read_depth=min_read_depth,
error_rate=error_rate,
- marker2file=marker2file,
- oosp=oosp,
+ marker2file=marker2file,
+ oosp=oosp,
tmp_dir=tmp_dir,
quiet=quiet,
samtools_exe=samtools_exe,
@@ -211,14 +211,14 @@ def save2file(tmp_file, ofn):
def sam2markers(
- sam_file,
+ sam_file,
ofn_bam_sorted_prefix,
min_align_score=None,
min_base_quality=30,
min_read_depth=0,
error_rate=0.01,
- marker2file=None,
- oosp=None,
+ marker2file=None,
+ oosp=None,
tmp_dir='tmp',
quiet=False,
samtools_exe='samtools',
@@ -247,12 +247,12 @@ def sam2markers(
if type(sam_file) == str:
p1 = oosp.chain(
- 'dump_file.py',
+ 'dump_file.py',
args=['--input_file', sam_file],
stderr=error_pipe)
else:
p1 = sam_file
-
+
# filter sam
if min_align_score == None:
p1_filtered = p1
@@ -264,8 +264,8 @@ def sam2markers(
stderr=error_pipe)
# sam to bam
p2 = oosp.chain(
- samtools_exe,
- args=['view', '-bS', '-'],
+ samtools_exe,
+ args=['view', '-bS', '-'],
in_pipe=p1_filtered,
stderr=error_pipe)
@@ -274,12 +274,12 @@ def sam2markers(
for tmp_fn in tmp_fns:
os.remove(tmp_fn)
p3 = oosp.chain(
- samtools_exe,
- args=['sort', '-', '-o', ofn_bam_sorted_prefix],
+ samtools_exe,
+ args=['sort', '-', '-o', ofn_bam_sorted_prefix],
in_pipe=p2,
stderr=error_pipe)
- # extract polimorphic information
+ # extract polimorphic information
marker2seq = defaultdict(dict)
pysam.index(p3.name)
samfile = pysam.AlignmentFile(p3.name)
@@ -307,30 +307,30 @@ def sam2markers(
# bam to mpileup
p3.seek(0)
p4 = oosp.chain(
- samtools_exe,
- args=['mpileup', '-u', '-'],
+ samtools_exe,
+ args=['mpileup', '-u', '-'],
in_pipe=p3,
stderr=error_pipe)
# mpileup to vcf
p5 = oosp.chain(
- bcftools_exe,
- args=['view', '-c', '-g', '-p', '1.1', '-'],
+ bcftools_exe,
+ args=['view', '-c', '-g', '-p', '1.1', '-'],
in_pipe=p4,
stderr=error_pipe)
#stderr=open(os.devnull, 'w'))
# fix AF1=0
p6 = oosp.chain(
- 'fix_AF1.py',
- args=['--input_file', '-'],
+ 'fix_AF1.py',
+ args=['--input_file', '-'],
in_pipe=p5,
stderr=error_pipe)
# vcf to fastq
p7 = oosp.chain(
- 'vcfutils.pl',
- args=['vcf2fq'],
+ 'vcfutils.pl',
+ args=['vcf2fq'],
in_pipe=p6,
get_out_pipe=True,
stderr=error_pipe,
@@ -342,7 +342,7 @@ def sam2markers(
marker2seq = dict(marker2seq)
except Exception as e:
logger.error("sam2markers failed on file " + sam_file)
- raise
+ raise
if type(p1) == file:
p1.close()
@@ -369,17 +369,17 @@ def run_sample(args_list):
marker2file = output_prefix + args['marker2file_ext']
if args['input_type'] == 'fastq':
sample2markers(
- ifn_sample=ifn_sample,
+ ifn_sample=ifn_sample,
min_read_len=args['min_read_len'],
min_align_score=args['min_align_score'],
min_base_quality=args['min_base_quality'],
min_read_depth=args['min_read_depth'],
error_rate=args['error_rate'],
- ifn_markers=args['ifn_markers'],
+ ifn_markers=args['ifn_markers'],
index_path=args['index_path'],
nprocs=args['nprocs'],
sam2file=sam2file,
- marker2file=marker2file,
+ marker2file=marker2file,
tmp_dir=args['output_dir'],
quiet=args['quiet'],
samtools_exe=args['samtools_exe'],
@@ -389,7 +389,7 @@ def run_sample(args_list):
args['output_dir'],
os.path.basename(ifn_sample) + '.bam.sorted')
sam2markers(
- sam_file=ifn_sample,
+ sam_file=ifn_sample,
ofn_bam_sorted_prefix=ofn_bam_sorted_prefix,
min_align_score=args['min_align_score'],
min_base_quality=args['min_base_quality'],
@@ -435,7 +435,7 @@ def main(args):
print e
-
+
if __name__ == "__main__":
args = read_params()
main(args)
=====================================
utils/merge_metaphlan_tables.py
=====================================
@@ -39,9 +39,10 @@ def merge( aaastrIn, astrLabels, iCol, ostm ):
# For each input datum in each input stream...
pos = 0
+ dCol = None
- for f in aaastrIn :
- with open(f) as csvfile :
+ for f in aaastrIn:
+ with open(f) as csvfile:
iIn = csv.reader(csvfile, csv.excel_tab)
# Lines from the current file, empty list to hold data, empty hash to hold ids
@@ -51,18 +52,22 @@ def merge( aaastrIn, astrLabels, iCol, ostm ):
# For a line in the file
for astrLine in iIn:
if astrLine[0].startswith('#'):
+ dCol = astrLine.index('relative_abundance') if 'relative_abundance' in astrLine else None
continue
iLine += 1
- # ID is from first column, data are everything else
- strID, astrData = astrLine[iCol], ( astrLine[:iCol] + astrLine[( iCol + 1 ):] )
+ if dCol:
+ strID, astrData = astrLine[iCol], [astrLine[dCol]]
+ else:
+ # ID is from first column, data are everything else
+ strID, astrData = astrLine[iCol], (astrLine[:iCol] + astrLine[iCol + 1:])
hashIDs[strID] = iLine
- aastrData.append( astrData )
+ aastrData.append(astrData)
# Batch merge every new ID key set
- setstrIDs.update( hashIDs.keys( ) )
+ setstrIDs.update(hashIDs.keys())
pos += 1
View it on GitLab: https://salsa.debian.org/med-team/metaphlan2/compare/51196ec977e34d401797054d8ecb78987e14dde4...6a0a6214036de2c11466dd22ee8df64761ff6a06
--
View it on GitLab: https://salsa.debian.org/med-team/metaphlan2/compare/51196ec977e34d401797054d8ecb78987e14dde4...6a0a6214036de2c11466dd22ee8df64761ff6a06
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20180917/5d3d64c6/attachment-0001.html>
More information about the debian-med-commit
mailing list