[med-svn] [Git][med-team/metaphlan2][master] 11 commits: Drop libjs-twitter-bootstrap dependency which is not needed any more
Andreas Tille
gitlab at salsa.debian.org
Mon Sep 17 09:38:10 BST 2018
Andreas Tille pushed to branch master at Debian Med / metaphlan2
Commits:
7cd85f0a by Andreas Tille at 2018-09-17T07:58:49Z
Drop libjs-twitter-bootstrap dependency which is not needed any more
- - - - -
4eea2afd by Andreas Tille at 2018-09-17T07:59:53Z
New upstream version 2.7.63
- - - - -
e0ba8c58 by Andreas Tille at 2018-09-17T07:59:54Z
Update upstream source from tag 'upstream/2.7.63'
Update to upstream version '2.7.63'
with Debian dir 3c4bb064fdf41825dc01c75df9980fcdce7c3a80
- - - - -
d63b5741 by Andreas Tille at 2018-09-17T07:59:54Z
New upstream version
- - - - -
d62d5bfa by Andreas Tille at 2018-09-17T07:59:58Z
Standards-Version: 4.2.1
- - - - -
79eaf51c by Andreas Tille at 2018-09-17T08:08:54Z
Uversionmangle to avoid missinterpretation of nano version number that is not separated by '.'
- - - - -
6a0a6214 by Andreas Tille at 2018-09-17T08:09:26Z
New upstream version 2.7.8
- - - - -
bfd0b1cc by Andreas Tille at 2018-09-17T08:09:26Z
Update upstream source from tag 'upstream/2.7.8'
Update to upstream version '2.7.8'
with Debian dir 91ffb6ef82155e367d405f9470ac29bfc58d0333
- - - - -
1f16874f by Andreas Tille at 2018-09-17T08:10:24Z
Really latest upstream version
- - - - -
35acfa8d by Andreas Tille at 2018-09-17T08:17:07Z
Refresh patches
- - - - -
bf0c58e9 by Andreas Tille at 2018-09-17T08:18:42Z
Upload to unstable
- - - - -
24 changed files:
- .hg_archival.txt
- .hgignore
- .hgsubstate
- .hgtags
- README.md
- debian/changelog
- debian/control
- debian/patches/mpa_dir-is-usr_share_metaphlan2.patch
- debian/patches/spelling.patch
- debian/rules
- debian/watch
- metaphlan2.py
- strainphlan.py
- strainphlan_src/add_metadata_tree.py
- strainphlan_src/build_tree_single_strain.py
- strainphlan_src/compute_distance.py
- strainphlan_src/compute_distance_all.py
- strainphlan_src/extract_markers.py
- strainphlan_src/ooSubprocess.py
- strainphlan_src/plot_tree_ete2.py
- strainphlan_src/plot_tree_graphlan.py
- strainphlan_src/sam_filter.py
- strainphlan_src/sample2markers.py
- utils/merge_metaphlan_tables.py
Changes:
=====================================
.hg_archival.txt
=====================================
@@ -1,4 +1,4 @@
repo: 092c2fe2278cb7f0b18d81faeb4aab98b89dc096
-node: cbd7880df400b453b8beb4e62b39e4a23b5523b6
+node: 9760413b180fc2c68b817c23602541d3a97528af
branch: default
-tag: 2.7.6
+tag: 2.7.8
=====================================
.hgignore
=====================================
@@ -1,6 +1,7 @@
syntax: glob
-databases/
+metaphlan_databases/
+tests/
*.pyc
build/
dist/
=====================================
.hgsubstate
=====================================
@@ -1,2 +1,2 @@
c168a100f37e23e2c110849a8d91fac8da49f5bd utils/export2graphlan
-35dfd725e7f024fc6d0edef0cc191c7963108787 utils/hclust2
+69efddea43ae5e37d761cd138fcf083090371d1a utils/hclust2
=====================================
.hgtags
=====================================
@@ -14,3 +14,8 @@ a1fe0d15320c04f69d56f1b7dd31cff972a7b8df 2.7.2
b2f9b3286d4be376805e3b5c26cf141ed375c605 2.7.5
847b250adbe97b9f4adc7e15f0d4bb5a66e782ec 2.7.4
178d1aaf4ac76e5d5477833e8e614104dcd32088 2.7.3
+cbd7880df400b453b8beb4e62b39e4a23b5523b6 2.7.6
+1365be3b4e5a68ed872911e949a25d0944f822b2 2.7.61
+4d50b9ccd95234a436541db13bcb10741f99138b 2.7.62
+a71e7d6b9d50b89c4b80714af145e10d050be7cf 2.7.63
+3e9c612e0a922b73214c90db51c5e26be17bf8b6 2.7.7
=====================================
README.md
=====================================
@@ -2,11 +2,26 @@
# MetaPhlAn 2: Metagenomic Phylogenetic Analysis
+## Installation
+
+MetaPhlAn 2.0 can be obtained
+
+Thorugh **Bioconda**
+
+``$ conda install metaphlan2``
+
+by **direct download** from [Bitbucket](https://bitbucket.org/biobakery/metaphlan2/get/default.zip)
+
+or **cloning the repository** using the following command
+
+``$ hg clone https://bitbucket.org/biobakery/metaphlan2``
+
+----------------------
## Description
MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With the newly added StrainPhlAn module, it is now possible to perform accurate strain-level microbial profiling.
-MetaPhlAn 2 relies on ~1M unique clade-specific marker genes ([the marker information file mpa_v20_m200_marker_info.txt.bz2 can be found in the Download page here](https://bitbucket.org/biobakery/metaphlan2/downloads/mpa_v20_m200_marker_info.txt.bz2)) identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic), allowing:
+MetaPhlAn 2 relies on ~1M unique clade-specific marker genes ([the marker information file `mpa_v20_m200_marker_info.txt.bz2` can be found in the Download page here](https://bitbucket.org/biobakery/metaphlan2/downloads/mpa_v20_m200_marker_info.txt.bz2)) identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic), allowing:
* unambiguous taxonomic assignments;
* accurate estimation of organismal relative abundance;
@@ -29,7 +44,7 @@ If you use StrainPhlAn, please cite the MetaPhlAn2 paper and the following Strai
-------------
-## MetaPhlAn and StrainPhlAn resources
+## MetaPhlAn and StrainPhlAn tutorials and resources
In addition to the information on this page, you can refer to the following additional resources.
@@ -43,14 +58,17 @@ In addition to the information on this page, you can refer to the following addi
* The related [bioBakery workflows](https://bitbucket.org/biobakery/biobakery/wiki/biobakery_workflows)
-
-------------
## Pre-requisites
-MetaPhlAn requires *python 2.7* or higher with argparse, tempfile and [*numpy*](http://www.numpy.org/) libraries installed
- (apart for numpy they are usually installed together with the python distribution).
- Python3 is also now supported.
+MetaPhlAn requires *python 2.7* or higher with argparse, tempfile, [numpy](http://www.numpy.org/), and [Biopython](https://biopython.org/) libraries installed
+(apart for numpy and Biopython, the others are usually installed together with the python distribution).
+Python3 is also now supported.
+
+MetaPhlAn requires the `read_fastx.py` script to be present in the system path, if not found MetaPhlAn will try to locate it in the folder containing the `metaphlan2.py`
+script under `utils/read_fastx.py`.
+In case you moved the `metaphlan2.py` script, please export the `read_fastx.py` script in your PATH bash variable.
**If you provide the SAM output of [BowTie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) as input, there are no additional prerequisite.**
@@ -62,22 +80,8 @@ MetaPhlAn requires *python 2.7* or higher with argparse, tempfile and [*numpy*](
* MetaPhlAn is not tightly integrated with advanced heatmap plotting with [hclust2](https://bitbucket.org/nsegata/hclust2) and cladogram visualization with [GraPhlAn](https://bitbucket.org/nsegata/graphlan/wiki/Home). If you use such visualization tool please refer to their prerequisites.
-----------------------
-
-## Installation
-
-MetaPhlAn 2.0 can be obtained by either
-
-Dowloading MetaPhlAn v2.0 [here](https://bitbucket.org/biobakery/metaphlan2/get/default.zip) or [here](https://www.dropbox.com/s/ztqr8qgbo727zpn/metaphlan2.zip?dl=0)
-
-**OR**
-
-Cloning the repository via the following commands
-``$ hg clone https://bitbucket.org/biobakery/metaphlan2``
-
--------------------------
-
## Basic Usage
This section presents some basic usages of MetaPhlAn2, for more advanced usages, please see at [its wiki](https://bitbucket.org/biobakery/biobakery/wiki/metaphlan2).
@@ -97,55 +101,28 @@ Here is the basic example to profile a metagenome from raw reads (requires BowTi
$ metaphlan2.py metagenome.fastq --input_type fastq > profiled_metagenome.txt
```
-It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (--bowtie2out), and use multiple CPUs (--nproc) if available:
+It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (`--bowtie2out`), and use multiple CPUs (`--nproc`) if available:
```
#!bash
$ metaphlan2.py metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq > profiled_metagenome.txt
```
-If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you can obtain the results in few seconds by using the previously saved --bowtie2out file and specifying the input (--input_type bowtie2out):
+If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you can obtain the results in few seconds by using the previously saved `--bowtie2out` file and specifying the input (`--input_type bowtie2out`):
```
#!bash
$ metaphlan2.py metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out > profiled_metagenome.txt
```
-You can also provide an externally BowTie2-mapped SAM if you specify this format with --input_type. Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn2 with the obtained sam:
+You can also provide an externally BowTie2-mapped SAM if you specify this format with `--input_type`. Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn2 with the obtained sam:
```
#!bash
-$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/db_v20/mpa_v20_m200 -U metagenome.fastq
+$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x databases/mpa_v20_m200 -U metagenome.fastq
$ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt
```
-In order to make MetaPhlAn 2 easily compatible with complex metagenomic pipeline, there are now multiple alternative ways to pass the input:
-
-```
-#!bash
-$ cat metagenome.fastq | metaphlan2.py --input_type fastq > profiled_metagenome.txt
-```
-
-```
-#!bash
-$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq --bowtie2db ${mpa_dir}/db_v20/mpa_v20_m200 > profiled_metagenome.txt
-```
-
-```
-#!bash
-$ metaphlan2.py --input_type fastq < metagenome.fastq > profiled_metagenome.txt
-```
-
-```
-#!bash
-$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2) > profiled_metagenome.txt
-```
-
-```
-#!bash
-$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz) > profiled_metagenome.txt
-```
-
MetaPhlAn 2 can also natively **handle paired-end metagenomes** (but does not use the paired-end information), and, more generally, metagenomes stored in multiple files (but you need to specify the --bowtie2out parameter):
```
@@ -208,16 +185,9 @@ $ metaphlan2.py metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out
* You can also provide an externally BowTie2-mapped SAM if you specify this format with
--input_type. Two steps: first apply BowTie2 and then feed MetaPhlAn2 with the obtained sam:
-$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/db_v20/mpa_v20_m200 -U metagenome.fastq
+$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x databases/mpa_v20_m200 -U metagenome.fastq
$ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt
-* Multiple alternative ways to pass the input are also available:
-$ cat metagenome.fastq | metaphlan2.py --input_type fastq
-$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq
-$ metaphlan2.py --input_type fastq < metagenome.fastq
-$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2)
-$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz)
-
* We can also natively handle paired-end metagenomes, and, more generally, metagenomes stored in
multiple files (but you need to specify the --bowtie2out parameter):
$ metaphlan2.py metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq
@@ -254,7 +224,7 @@ $ metaphlan2.py -t clade_profiles metagenome_outfmt.bz2 --input_type bowtie2out
* Finally, to obtain all markers present for a specific clade and all its subclades, the
'-t clade_specific_strain_tracker' should be used. For example, the following command
is reporting the presence/absence of the markers for the B. fragulis species and its strains
-$ metaphlan2.py -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 db_v20/mpa_v20_m200.pkl --input_type bowtie2out > marker_abundance_table.txt
+$ metaphlan2.py -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 databases/mpa_v20_m200.pkl --input_type bowtie2out > marker_abundance_table.txt
the optional argument --min_ab specifies the minimum clade abundance for reporting the markers
-------------------------------------------------------------------
@@ -501,7 +471,7 @@ In order to add a marker to the database, the user needs the following steps:
```
#!bash
-bowtie2-inspect metaphlan2/db_v20/mpa_v20_m200 > metaphlan2/markers.fasta
+bowtie2-inspect metaphlan2/databases/mpa_v20_m200 > metaphlan2/markers.fasta
```
* Add the marker sequence stored in a file new_marker.fasta to the marker set:
@@ -527,7 +497,7 @@ bowtie2-build metaphlan2/markers.fasta metaphlan2/db_v21/mpa_v21_m200
import cPickle as pickle
import bz2
-db = pickle.load(bz2.BZ2File('db_v20/mpa_v20_m200.pkl', 'r'))
+db = pickle.load(bz2.BZ2File('databases/mpa_v20_m200.pkl', 'r'))
# Add the taxonomy of the new genomes
db['taxonomy']['taxonomy of genome1'] = length of genome1
@@ -577,12 +547,12 @@ In addition, the table below shows the number of snps between the sample strains
In the next sections, we will illustrate step by step how to run MetaPhlAn\_Strainer on this toy example to reproduce the above figures.
### Pre-requisites
-StrainPhlAn requires *python 2.7* and the libraries [pysam](http://pysam.readthedocs.org/en/latest/) (tested on **version 0.8.3**), [biopython](http://biopython.org/wiki/Main_Page), [msgpack](https://pypi.python.org/pypi/msgpack-python) and [numpy](http://www.numpy.org/), [dendropy](https://pythonhosted.org/DendroPy/) (tested on version **3.12.0**). Besides, StrainPhlAn also needs the following programs in the executable path:
+StrainPhlAn requires *python 2.7* and the libraries [pysam](http://pysam.readthedocs.org/en/latest/) (tested on **version 0.8.3**), [biopython](http://biopython.org/wiki/Main_Page), [msgpack](https://pypi.python.org/pypi/msgpack-python), [pandas](https://pandas.pydata.org) (tested on **version 0.22**), [numpy](http://www.numpy.org/) (tested on **version 1.14.2**) and [scipy](https://www.scipy.org) (tested on **version 1.0.0**), [dendropy](https://pythonhosted.org/DendroPy/) (tested on version **3.12.0**). Besides, StrainPhlAn also needs the following programs in the executable path:
* [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) for mapping reads against the marker database.
* [MUSCLE](http://www.drive5.com/muscle/) for the alignment step.
* [samtools, bcftools and vcfutils.pl](http://samtools.sourceforge.net/) which can be downloaded from [here](https://github.com/samtools) for building consensus markers. Note that vcfutils.pl is included in bcftools and **StrainPhlAn only works with samtools version 0.1.19** as samtools has changed the output format after this version.
-* [blastn](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) for adding reference genomes to the phylogenetic tree.
+* [blast+](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) for adding reference genomes to the phylogenetic tree (blastn and makeblastdb commands)
* [raxmlHPC and raxmlHPC-PTHREADS-SSE3](http://sco.h-its.org/exelixis/web/software/raxml/index.html) for building the phylogenetic trees.
All dependence binaries on Linux 64 bit can be downloaded in the folder "bin" from [this link](https://www.dropbox.com/sh/m4na8wefp53j8ej/AABA3yVsG26TbB0t1cnBS9-Ra?dl=0).
@@ -621,7 +591,7 @@ for f in $(ls fastqs/*.bz2)
do
echo "Running metaphlan2 on ${f}"
bn=$(basename ${f} | cut -d . -f 1)
- tar xjfO ${f} | ../metaphlan2.py --bowtie2db ../db_v20/mpa_v20_m200 --mpa_pkl ../db_v20/mpa_v20_m200.pkl --input_type multifastq --nproc 10 -s sams/${bn}.sam.bz2 --bowtie2out sams/${bn}.bowtie2_out.bz2 -o sams/${bn}.profile
+ tar xjfO ${f} | ../metaphlan2.py --bowtie2db ../databases/mpa_v20_m200 --mpa_pkl ../databases/mpa_v20_m200.pkl --input_type multifastq --nproc 10 -s sams/${bn}.sam.bz2 --bowtie2out sams/${bn}.bowtie2_out.bz2 -o sams/${bn}.profile
done
```
@@ -654,8 +624,8 @@ The commands to run are:
```
#!python
mkdir -p db_markers
-bowtie2-inspect ../db_v20/mpa_v20_m200 > db_markers/all_markers.fasta
-python ../strainphlan_src/extract_markers.py --mpa_pkl ../db_v20/mpa_v20_m200.pkl --ifn_markers db_markers/all_markers.fasta --clade s__Bacteroides_caccae --ofn_markers db_markers/s__Bacteroides_caccae.markers.fasta
+bowtie2-inspect ../databases/mpa_v20_m200 > db_markers/all_markers.fasta
+python ../strainphlan_src/extract_markers.py --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_markers db_markers/all_markers.fasta --clade s__Bacteroides_caccae --ofn_markers db_markers/s__Bacteroides_caccae.markers.fasta
```
Note that the "all\_markers.fasta" file consists can be reused for extracting other reference genomes.
@@ -666,7 +636,7 @@ Before building the trees, we should get the list of all clades detected from th
```
#!python
-python ../strainphlan.py --mpa_pkl ../db_v20/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --output_dir output --nprocs_main 10 --print_clades_only > output/clades.txt
+python ../strainphlan.py --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --output_dir output --nprocs_main 10 --print_clades_only > output/clades.txt
```
The clade names in the output file "clades.txt" will be used for the next step.
@@ -681,7 +651,7 @@ The commands to run are:
```
#!python
mkdir -p output
-python ../strainphlan.py --mpa_pkl ../db_v20/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --ifn_markers db_markers/s__Bacteroides_caccae.markers.fasta --ifn_ref_genomes reference_genomes/G000273725.fna.bz2 --output_dir output --nprocs_main 10 --clades s__Bacteroides_caccae &> output/log_full.txt
+python ../strainphlan.py --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --ifn_markers db_markers/s__Bacteroides_caccae.markers.fasta --ifn_ref_genomes reference_genomes/G000273725.fna.bz2 --output_dir output --nprocs_main 10 --clades s__Bacteroides_caccae &> output/log_full.txt
```
This step will take around 2 minutes. After this step, you will find the tree "output/RAxML\_bestTree.s\_\_Bacteroides\_caccae.tree". All the output files can be found in the folder "output" in [this link](https://www.dropbox.com/sh/m4na8wefp53j8ej/AABA3yVsG26TbB0t1cnBS9-Ra?dl=0).
=====================================
debian/changelog
=====================================
@@ -1,3 +1,14 @@
+metaphlan2 (2.7.8-1) unstable; urgency=medium
+
+ * Drop libjs-twitter-bootstrap dependency which is not needed any more
+ Closes: #908426
+ * New upstream version
+ * Standards-Version: 4.2.1
+ * d/watch: Uversionmangle to avoid missinterpretation of nano version
+ number that is not separated by '.'
+
+ -- Andreas Tille <tille at debian.org> Mon, 17 Sep 2018 10:17:22 +0200
+
metaphlan2 (2.7.6-1) unstable; urgency=medium
[ Andreas Tille ]
=====================================
debian/control
=====================================
@@ -8,7 +8,7 @@ Build-Depends: debhelper (>= 11~),
dh-python,
pandoc,
bowtie2
-Standards-Version: 4.1.4
+Standards-Version: 4.2.1
Vcs-Browser: https://salsa.debian.org/med-team/metaphlan2
Vcs-Git: https://salsa.debian.org/med-team/metaphlan2.git
Homepage: https://bitbucket.org/biobakery/metaphlan2
@@ -21,8 +21,7 @@ Depends: ${python:Depends},
python-biom-format,
python-msgpack,
python-pandas,
- bowtie2,
- libjs-twitter-bootstrap
+ bowtie2
Description: Metagenomic Phylogenetic Analysis
MetaPhlAn is a computational tool for profiling the composition of
microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from
=====================================
debian/patches/mpa_dir-is-usr_share_metaphlan2.patch
=====================================
@@ -15,7 +15,7 @@ Description: Instead of setting mpa_dir bash variable the path to the
+ "$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x /usr/share/metaphlan2/db_v20/mpa_v20_m200 -U metagenome.fastq\n"
"$ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt\n\n"
- "* Multiple alternative ways to pass the input are also available:\n"
+ # "* Multiple alternative ways to pass the input are also available:\n"
@@ -1391,7 +1391,7 @@ def metaphlan2():
# check for the mpa_pkl file
if not os.path.isfile(pars['mpa_pkl']):
@@ -27,7 +27,7 @@ Description: Instead of setting mpa_dir bash variable the path to the
sys.exit(1)
--- a/README.md
+++ b/README.md
-@@ -82,33 +82,27 @@ Cloning the repository via the following
+@@ -86,33 +86,27 @@ In case you moved the `metaphlan2.py` sc
This section presents some basic usages of MetaPhlAn2, for more advanced usages, please see at [its wiki](https://bitbucket.org/biobakery/biobakery/wiki/metaphlan2).
@@ -48,7 +48,7 @@ Description: Instead of setting mpa_dir bash variable the path to the
+$ metaphlan2 metagenome.fastq --input_type fastq > profiled_metagenome.txt
```
- It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (--bowtie2out), and use multiple CPUs (--nproc) if available:
+ It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (`--bowtie2out`), and use multiple CPUs (`--nproc`) if available:
```
#!bash
@@ -56,7 +56,7 @@ Description: Instead of setting mpa_dir bash variable the path to the
+$ metaphlan2 metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq > profiled_metagenome.txt
```
- If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you can obtain the results in few seconds by using the previously saved --bowtie2out file and specifying the input (--input_type bowtie2out):
+ If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you can obtain the results in few seconds by using the previously saved `--bowtie2out` file and specifying the input (`--input_type bowtie2out`):
```
#!bash
@@ -64,47 +64,15 @@ Description: Instead of setting mpa_dir bash variable the path to the
+$ metaphlan2 metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out > profiled_metagenome.txt
```
- You can also provide an externally BowTie2-mapped SAM if you specify this format with --input_type. Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn2 with the obtained sam:
-@@ -116,41 +110,41 @@ You can also provide an externally BowTi
+ You can also provide an externally BowTie2-mapped SAM if you specify this format with `--input_type`. Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn2 with the obtained sam:
+@@ -120,14 +114,14 @@ You can also provide an externally BowTi
```
#!bash
- $ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/db_v20/mpa_v20_m200 -U metagenome.fastq
+ $ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x databases/mpa_v20_m200 -U metagenome.fastq
-$ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt
+$ metaphlan2 metagenome.sam --input_type sam > profiled_metagenome.txt
```
- In order to make MetaPhlAn 2 easily compatible with complex metagenomic pipeline, there are now multiple alternative ways to pass the input:
-
- ```
- #!bash
--$ cat metagenome.fastq | metaphlan2.py --input_type fastq > profiled_metagenome.txt
-+$ cat metagenome.fastq | metaphlan2 --input_type fastq > profiled_metagenome.txt
- ```
-
- ```
- #!bash
--$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq --bowtie2db ${mpa_dir}/db_v20/mpa_v20_m200 > profiled_metagenome.txt
-+$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2 --input_type fastq --bowtie2db ${mpa_dir}/db_v20/mpa_v20_m200 > profiled_metagenome.txt
- ```
-
- ```
- #!bash
--$ metaphlan2.py --input_type fastq < metagenome.fastq > profiled_metagenome.txt
-+$ metaphlan2 --input_type fastq < metagenome.fastq > profiled_metagenome.txt
- ```
-
- ```
- #!bash
--$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2) > profiled_metagenome.txt
-+$ metaphlan2 --input_type fastq <(bzcat metagenome.fastq.bz2) > profiled_metagenome.txt
- ```
-
- ```
- #!bash
--$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz) > profiled_metagenome.txt
-+$ metaphlan2 --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz) > profiled_metagenome.txt
- ```
-
MetaPhlAn 2 can also natively **handle paired-end metagenomes** (but does not use the paired-end information), and, more generally, metagenomes stored in multiple files (but you need to specify the --bowtie2out parameter):
```
@@ -114,7 +82,7 @@ Description: Instead of setting mpa_dir bash variable the path to the
```
For advanced options and other analysis types (such as strain tracking) please refer to the full command-line options.
-@@ -159,7 +153,7 @@ For advanced options and other analysis
+@@ -136,7 +130,7 @@ For advanced options and other analysis
```
@@ -123,7 +91,7 @@ Description: Instead of setting mpa_dir bash variable the path to the
{fastq,fasta,multifasta,multifastq,bowtie2out,sam}
[--mpa_pkl MPA_PKL] [--bowtie2db METAPHLAN_BOWTIE2_DB]
[--bt2_ps BowTie2 presets] [--bowtie2_exe BOWTIE2_EXE]
-@@ -184,7 +178,7 @@ AUTHORS: Nicola Segata (nicola.segata at un
+@@ -161,7 +155,7 @@ AUTHORS: Nicola Segata (nicola.segata at un
COMMON COMMANDS
@@ -132,7 +100,7 @@ Description: Instead of setting mpa_dir bash variable the path to the
main MetaPhlAn folder. Also BowTie2 should be in the system path with execution and read
permissions, and Perl should be installed.
-@@ -195,32 +189,32 @@ strains in particular cases) present in
+@@ -172,25 +166,25 @@ strains in particular cases) present in
relative abundance. This correspond to the default analysis type (--analysis_type rel_ab).
* Profiling a metagenome from raw reads:
@@ -152,23 +120,10 @@ Description: Instead of setting mpa_dir bash variable the path to the
* You can also provide an externally BowTie2-mapped SAM if you specify this format with
--input_type. Two steps: first apply BowTie2 and then feed MetaPhlAn2 with the obtained sam:
--$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/db_v20/mpa_v20_m200 -U metagenome.fastq
+ $ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x databases/mpa_v20_m200 -U metagenome.fastq
-$ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt
-+$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x /usr/share/metaphlan2/db_v20/mpa_v20_m200 -U metagenome.fastq
+$ metaphlan2 metagenome.sam --input_type sam > profiled_metagenome.txt
- * Multiple alternative ways to pass the input are also available:
--$ cat metagenome.fastq | metaphlan2.py --input_type fastq
--$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq
--$ metaphlan2.py --input_type fastq < metagenome.fastq
--$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2)
--$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz)
-+$ cat metagenome.fastq | metaphlan2 --input_type fastq
-+$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2 --input_type fastq
-+$ metaphlan2 --input_type fastq < metagenome.fastq
-+$ metaphlan2 --input_type fastq <(bzcat metagenome.fastq.bz2)
-+$ metaphlan2 --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz)
-
* We can also natively handle paired-end metagenomes, and, more generally, metagenomes stored in
multiple files (but you need to specify the --bowtie2out parameter):
-$ metaphlan2.py metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq
@@ -176,7 +131,7 @@ Description: Instead of setting mpa_dir bash variable the path to the
-------------------------------------------------------------------
-@@ -238,23 +232,23 @@ file saved during the execution of the d
+@@ -208,23 +202,23 @@ file saved during the execution of the d
* The following command will output the abundance of each marker with a RPK (reads per kil-base)
higher 0.0. (we are assuming that metagenome_outfmt.bz2 has been generated before as
shown above).
@@ -199,12 +154,12 @@ Description: Instead of setting mpa_dir bash variable the path to the
* Finally, to obtain all markers present for a specific clade and all its subclades, the
'-t clade_specific_strain_tracker' should be used. For example, the following command
is reporting the presence/absence of the markers for the B. fragulis species and its strains
--$ metaphlan2.py -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 db_v20/mpa_v20_m200.pkl --input_type bowtie2out > marker_abundance_table.txt
-+$ metaphlan2 -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 db_v20/mpa_v20_m200.pkl --input_type bowtie2out > marker_abundance_table.txt
+-$ metaphlan2.py -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 databases/mpa_v20_m200.pkl --input_type bowtie2out > marker_abundance_table.txt
++$ metaphlan2 -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 databases/mpa_v20_m200.pkl --input_type bowtie2out > marker_abundance_table.txt
the optional argument --min_ab specifies the minimum clade abundance for reporting the markers
-------------------------------------------------------------------
-@@ -551,7 +545,7 @@ pickle.dump(db, ofile, pickle.HIGHEST_PR
+@@ -521,7 +515,7 @@ pickle.dump(db, ofile, pickle.HIGHEST_PR
ofile.close()
```
@@ -213,16 +168,16 @@ Description: Instead of setting mpa_dir bash variable the path to the
## Metagenomic strain-level population genomics
-@@ -621,7 +615,7 @@ for f in $(ls fastqs/*.bz2)
+@@ -591,7 +585,7 @@ for f in $(ls fastqs/*.bz2)
do
echo "Running metaphlan2 on ${f}"
bn=$(basename ${f} | cut -d . -f 1)
-- tar xjfO ${f} | ../metaphlan2.py --bowtie2db ../db_v20/mpa_v20_m200 --mpa_pkl ../db_v20/mpa_v20_m200.pkl --input_type multifastq --nproc 10 -s sams/${bn}.sam.bz2 --bowtie2out sams/${bn}.bowtie2_out.bz2 -o sams/${bn}.profile
+- tar xjfO ${f} | ../metaphlan2.py --bowtie2db ../databases/mpa_v20_m200 --mpa_pkl ../databases/mpa_v20_m200.pkl --input_type multifastq --nproc 10 -s sams/${bn}.sam.bz2 --bowtie2out sams/${bn}.bowtie2_out.bz2 -o sams/${bn}.profile
+ tar xjfO ${f} | metaphlan2 --bowtie2db /usr/share/metaphlan2/db_v20/mpa_v20_m200 --mpa_pkl /usr/share/metaphlan2/db_v20/mpa_v20_m200.pkl --input_type multifastq --nproc 10 -s sams/${bn}.sam.bz2 --bowtie2out sams/${bn}.bowtie2_out.bz2 -o sams/${bn}.profile
done
```
-@@ -761,4 +755,4 @@ In the output folder, you can find the f
+@@ -731,4 +725,4 @@ In the output folder, you can find the f
1. clade_name.fasta: the alignment file of all metagenomic strains.
3. *.marker_pos: this file shows the starting position of each marker in the strains.
3. *.info: this file shows the general information like the total length of the concatenated markers (full sequence length), number of used markers, etc.
=====================================
debian/patches/spelling.patch
=====================================
@@ -4,7 +4,7 @@ Description: Spelling
--- a/README.md
+++ b/README.md
-@@ -337,7 +337,7 @@ Post-mapping arguments:
+@@ -307,7 +307,7 @@ Post-mapping arguments:
Additional analysis types and arguments:
-t ANALYSIS TYPE Type of analysis to perform:
* rel_ab: profiling a metagenomes in terms of relative abundances
@@ -13,7 +13,7 @@ Description: Spelling
* reads_map: mapping from reads to clades (only reads hitting a marker)
* clade_profiles: normalized marker counts for clades with at least a non-null marker
* marker_ab_table: normalized marker counts (only when > 0.0 and normalized by metagenome size if --nreads is specified)
-@@ -743,7 +743,7 @@ python ../strainphlan.py -h
+@@ -713,7 +713,7 @@ python ../strainphlan.py -h
The default setting can be stringent for some cases where you have very few samples left in the phylogenetic tree. You can relax some parameters to add more samples back:
1. *marker\_in\_clade*: In each sample, the clades with the percentage of present markers less than this threshold are removed. Default "0.8". You can set this parameter to "0.5" to add some more samples.
@@ -24,7 +24,7 @@ Description: Spelling
5. *relaxed\_parameters*: use this option to automatically set the above parameters to add some more samples by accepting some more gaps, Ns, etc. This option is equivalent to set: marker\_in\_clade=0.5, sample\_in\_marker=0.5, N\_in\_marker=0.5, gap\_in\_sample=0.5. Default "False".
--- a/strainphlan.py
+++ b/strainphlan.py
-@@ -328,7 +328,7 @@ def read_params():
+@@ -337,7 +337,7 @@ def read_params():
required=False,
default=['all'],
type=str,
@@ -35,7 +35,7 @@ Description: Spelling
'file where each clade name is on a line will be read.'
--- a/metaphlan2.py
+++ b/metaphlan2.py
-@@ -595,7 +595,7 @@ def read_params(args):
+@@ -596,7 +596,7 @@ def read_params(args):
default='rel_ab', help =
"Type of analysis to perform: \n"
" * rel_ab: profiling a metagenomes in terms of relative abundances\n"
=====================================
debian/rules
=====================================
@@ -8,9 +8,7 @@
override_dh_auto_build:
dh_auto_build
pandoc -s --from markdown --to html README.md | \
- sed -e 's#"[^"]*//cdnjs.cloudflare.com/ajax/libs/html5shiv/.*/html5shiv-printshiv.min.js"#"file:///usr/share/twitter-bootstrap/files/html5shiv/html5shiv-printshiv.min.js"#' \
- -e 's#"https://bitbucket.org/repo/r[Mm]969[Kk]/images#"images#' \
- > README.html
+ sed -e 's#"https://bitbucket.org/repo/r[Mm]969[Kk]/images#"images#' > README.html
override_dh_installchangelogs:
dh_installchangelogs changeset.txt
=====================================
debian/watch
=====================================
@@ -1,4 +1,5 @@
version=4
#opts="repacksuffix=+ds,dversionmangle=s/\+ds//g,repack,compression=xz" \
-https://bitbucket.org/biobakery/metaphlan2/downloads?tab=tags .*/(\d[\d.]+)\.tar\.bz2
+opts="uversionmangle=s/(\d\.\d\.\d)(\d)/$1.$2/" \
+ https://bitbucket.org/biobakery/metaphlan2/downloads?tab=tags .*/(\d[\d.]+)\.tar\.bz2
=====================================
metaphlan2.py
=====================================
@@ -16,8 +16,8 @@ from __future__ import with_statement
__author__ = ('Nicola Segata (nicola.segata at unitn.it), '
'Duy Tin Truong, '
'Francesco Asnicar (f.asnicar at unitn.it)')
-__version__ = '2.7.6'
-__date__ = '2 March 2018'
+__version__ = '2.7.7'
+__date__ = '31 May 2018'
import sys
@@ -62,7 +62,7 @@ DATABASE_DOWNLOAD = "https://bitbucket.org/biobakery/metaphlan2/downloads/"
# get the directory that contains this script
metaphlan2_script_install_folder = os.path.dirname(os.path.abspath(__file__))
# get the default database folder
-DEFAULT_DB_FOLDER = os.path.join(metaphlan2_script_install_folder, "databases")
+DEFAULT_DB_FOLDER = os.path.join(metaphlan2_script_install_folder, "metaphlan_databases")
#**********************************************************************************************
@@ -420,12 +420,12 @@ def read_params(args):
"$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/db_v20/mpa_v20_m200 -U metagenome.fastq\n"
"$ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt\n\n"
- "* Multiple alternative ways to pass the input are also available:\n"
- "$ cat metagenome.fastq | metaphlan2.py --input_type fastq \n"
- "$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq \n"
- "$ metaphlan2.py --input_type fastq < metagenome.fastq\n"
- "$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2)\n"
- "$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz)\n\n"
+ # "* Multiple alternative ways to pass the input are also available:\n"
+ # "$ cat metagenome.fastq | metaphlan2.py --input_type fastq \n"
+ # "$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq \n"
+ # "$ metaphlan2.py --input_type fastq < metagenome.fastq\n"
+ # "$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2)\n"
+ # "$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz)\n\n"
"* We can also natively handle paired-end metagenomes, and, more generally, metagenomes stored in \n"
" multiple files (but you need to specify the --bowtie2out parameter):\n"
@@ -503,10 +503,11 @@ def read_params(args):
help=("The BowTie2 database file of the MetaPhlAn database. Used if "
"--input_type is fastq, fasta, multifasta, or multifastq [default "+DEFAULT_DB_FOLDER+"]\n"))
+ INDEX = 'v20_m200'
arg('-x', '--index', type=str, default='v20_m200',
help=("Specify the id of the database version to use. If the database\n"
"files are not found on the local MetaPhlAn2 installation they\n"
- "will be automatically downloaded\n"))
+ "will be automatically downloaded [default "+INDEX+"]\n"))
bt2ps = ['sensitive', 'very-sensitive', 'sensitive-local', 'very-sensitive-local']
arg('--bt2_ps', metavar="BowTie2 presets", default='very-sensitive',
@@ -647,9 +648,9 @@ def read_params(args):
help="Only checks if the MetaPhlAn2 DB is installed and installs it if not. All other parameters are ignored.")
arg('--read_min_len', type=int, default=70,
help="Specify the minimum length of the reads to be considered when parsing the input file with "
- "'read_fastx.py' script, default value is 60")
+ "'read_fastx.py' script, default value is 70")
arg('-v', '--version', action='version',
- version="MetaPhlAn version {}\t({})".format(__version__, __date__),
+ version="MetaPhlAn version {} ({})".format(__version__, __date__),
help="Prints the current MetaPhlAn version and exit")
arg("-h", "--help", action="help", help="show this help message and exit")
@@ -748,7 +749,7 @@ def download_unpack_tar(url, download_file_name, folder, bowtie2_build, nproc):
for row in f:
md5_md5 = row.strip().split(' ')[0]
else:
- sys.stderr.write('File "{}" not found!'.format(md5_file))
+ sys.stderr.write('File "{}" not found!\n'.format(md5_file))
# compute MD5 of .tar.bz2
if os.path.isfile(tar_file):
@@ -760,7 +761,7 @@ def download_unpack_tar(url, download_file_name, folder, bowtie2_build, nproc):
md5_tar = hash_md5.hexdigest()[:32]
else:
- sys.stderr.write('File "{}" not found!'.format(tar_file))
+ sys.stderr.write('File "{}" not found!\n'.format(tar_file))
if (md5_tar is None) or (md5_md5 is None):
sys.exit("MD5 checksums not found, something went wrong!")
@@ -1264,7 +1265,7 @@ def map2bbh(mapping_f, input_type='bowtie2out', min_alignment_len=None):
return markers2reads
-def maybe_generate_biom_file(pars, abundance_predictions):
+def maybe_generate_biom_file(tree, pars, abundance_predictions):
json_key = "MetaPhlAn2"
if not pars['biom']:
@@ -1291,21 +1292,20 @@ def maybe_generate_biom_file(pars, abundance_predictions):
return tree.all_clades[name]
def to_biomformat(clade_name):
- return { 'taxonomy': clade_name.split(delimiter) }
+ return {'taxonomy': clade_name.split(delimiter)}
- clades = iter( (abundance, findclade(name))
- for (name, abundance) in abundance_predictions
- if istip(name) )
- packed = iter( ([abundance], clade.get_full_name(), clade.id)
- for (abundance, clade) in clades )
+ clades = iter((abundance, findclade(name))
+ for (name, abundance) in abundance_predictions if istip(name))
+ packed = iter(([abundance], clade.get_full_name(), clade.id)
+ for (abundance, clade) in clades)
- #unpack that tuple here to stay under 80 chars on a line
+ # unpack that tuple here to stay under 80 chars on a line
data, clade_names, clade_ids = zip(*packed)
# biom likes column vectors, so we give it an array like this:
# np.array([a],[b],[c])
data = np.array(data)
sample_ids = [pars['sample_id']]
- table_id='MetaPhlAn2_Analysis'
+ table_id = 'MetaPhlAn2_Analysis'
@@ -1493,7 +1493,7 @@ def metaphlan2():
outf.write( "\t".join( [k,str(v)] ) + "\n" )
else:
outf.write( "unclassified\t100.0\n" )
- maybe_generate_biom_file(pars, outpred)
+ maybe_generate_biom_file(tree, pars, outpred)
elif pars['t'] == 'rel_ab_w_read_stats':
cl2ab, rr = tree.relative_abundances(
pars['tax_lev']+"__" if pars['tax_lev'] != 'a' else None )
@@ -1520,7 +1520,7 @@ def metaphlan2():
outf.write( "#estimated total number of reads from known clades: " + str(totl)+"\n")
else:
outf.write( "unclassified\t100.0\n" )
- maybe_generate_biom_file(pars, outpred)
+ maybe_generate_biom_file(tree, pars, outpred)
elif pars['t'] == 'clade_profiles':
cl2pr = tree.clade_profiles( pars['tax_lev']+"__" if pars['tax_lev'] != 'a' else None )
=====================================
strainphlan.py
=====================================
@@ -2,9 +2,11 @@
# Author: Duy Tin Truong (duytin.truong at unitn.it)
# at CIBIO, University of Trento, Italy
-__author__ = 'Duy Tin Truong (duytin.truong at unitn.it)'
-__version__ = '1.0.0'
-__date__ = '2nd August 2016'
+__author__ = ('Duy Tin Truong (duytin.truong at unitn.it), '
+ 'Francesco Asnicar (f.asnicar at unitn.it), '
+ 'Moreno Zolfo (moreno.zolfo at unitn.it)')
+__version__ = '1.2.0'
+__date__ = '31 May 2018'
import sys
import os
@@ -17,7 +19,10 @@ sys.path.append(os.path.join(MAIN_DIR, 'strainphlan_src'))
import which
import argparse as ap
-import cPickle as pickle
+try:
+ import cPickle as pickle
+except ImportError:
+ import pickle
import msgpack
import glob
from mixed_utils import statistics
@@ -32,9 +37,9 @@ from Bio.Alphabet import IUPAC
import pandas
import logging
import logging.config
-import sample2markers
-import copy
-import threading
+# import sample2markers
+# import copy
+# import threading
import numpy
import random
import gc
@@ -86,12 +91,16 @@ def read_params():
help='The representative sample. The marker list of each species '\
'extracted from this sample will be used for all other samples.')
+ p.add_argument('-x', '--index', required=False, default="v20_m200", type=str,
+ help=("Specify the id of the database version to use. If the database files are not found "
+ "on the local MetaPhlAn2 installation they will be automatically downloaded"))
+
p.add_argument(
'--mpa_pkl',
required=False,
- default=os.path.join(metaphlan2_script_install_folder,"db_v20","mpa_v20_m200.pkl"),
+ default=os.path.join(metaphlan2_script_install_folder, "metaphlan_databases"),
type=str,
- help='The database of metaphlan3.py.')
+ help='The database of metaphlan2.py')
p.add_argument(
'--output_dir',
@@ -1166,15 +1175,29 @@ def load_sample(args):
marker2seq = {}
# reformat 'pileup'
+ marker2seq2exclude = []
for m in marker2seq:
- freq = marker2seq[m]['freq']
- marker2seq[m]['freq'] = [(0.0, 0.0, 0.0) for i in \
- range(len(marker2seq[m]['seq']))]
- for p in freq:
- marker2seq[m]['freq'][p] = freq[p]
- marker2seq[m]['seq'] = marker2seq[m]['seq'].replace('-', 'N') # make sure we have no gaps in the sequence
+
+ # FIX: Check if freq vector is longer than the reconstructed sequence. In this case, the marker is removed
+ # TODO: uniform the two vectors in samples2markers.py
+ if len(marker2seq[m]['freq']) > len(marker2seq[m]['seq']):
+ logger.info('Skipping marker'+m+' as sequence is reconstructed incorrectly')
+ marker2seq2exclude.append(m)
+
+ else:
+ freq = marker2seq[m]['freq']
+ marker2seq[m]['freq'] = [(0.0, 0.0, 0.0) for i in \
+ range(len(marker2seq[m]['seq']))]
+ for p in freq:
+ marker2seq[m]['freq'][p] = freq[p]
+ marker2seq[m]['seq'] = marker2seq[m]['seq'].replace('-', 'N') # make sure we have no gaps in the sequence
+
+ # returning only the markers for which sequence was reconstructed correctly
+ for marker2exclude in marker2seq2exclude:
+ del marker2seq[marker2exclude]
return marker2seq
+
else:
# remove redundant clades and markers
clade2n_markers = defaultdict(int)
@@ -1533,6 +1556,12 @@ def check_dependencies(args):
def strainphlan():
args = read_params()
+
+ # fix db .pkl file
+ if '--mpa_pkl' not in sys.argv:
+ if os.path.isfile(os.path.join(args['mpa_pkl'], "mpa_" + args['index'] + ".pkl")):
+ args['mpa_pkl'] = os.path.join(args['mpa_pkl'], "mpa_" + args['index'] + ".pkl")
+
check_dependencies(args)
strainer(args)
=====================================
strainphlan_src/add_metadata_tree.py
=====================================
@@ -3,14 +3,14 @@
# at CIBIO, University of Trento, Italy
-import sys
-import os
+# import sys
+# import os
import argparse as ap
import pandas
-import copy
-import ConfigParser
+# import copy
+# import ConfigParser
import dendropy
-import numpy
+# import numpy
# import ipdb
@@ -18,13 +18,13 @@ def read_params():
p = ap.ArgumentParser()
p.add_argument('--ifn_trees', nargs='+', required=True, default=None, type=str)
p.add_argument('--ifn_metadatas', nargs='+', required=True, default=None, type=str)
- p.add_argument('--string_to_remove',
+ p.add_argument('--string_to_remove',
required=False, default='', type=str,
help='string to be removed in the tree node names')
p.add_argument(
- '--metadatas',
- nargs='+',
- required=False,
+ '--metadatas',
+ nargs='+',
+ required=False,
default=['all'],
type=str,
help='The metadata fields that you want to add. '\
@@ -52,9 +52,9 @@ def main(args):
index_col = get_index_col(ifn)
df = pandas.read_csv(
ifn,
- sep='\t',
+ sep='\t',
dtype=unicode,
- header=0,
+ header=0,
index_col=index_col)
df = df.transpose()
df_list.append(df)
@@ -73,8 +73,8 @@ def main(args):
sample = node.get_node_str().strip("'")
sample = sample.replace(' ', '_')
sample = sample.replace(args['string_to_remove'], '')
- prefixes = [prefix for prefix in
- ['k__', 'p__', 'c__', 'o__',
+ prefixes = [prefix for prefix in
+ ['k__', 'p__', 'c__', 'o__',
'f__', 'g__', 's__'] \
if prefix in sample]
=====================================
strainphlan_src/build_tree_single_strain.py
=====================================
@@ -6,9 +6,9 @@ __author__ = 'Duy Tin Truong (duytin.truong at unitn.it)'
__version__ = '0.1'
__date__ = '17 Sep 2015'
-import sys
+# import sys
import os
-import argparse
+import argparse
import numpy
from Bio import SeqIO
import glob
@@ -17,34 +17,34 @@ import glob
def read_params():
p = argparse.ArgumentParser()
p.add_argument(
- '--ifn_alignments',
+ '--ifn_alignments',
nargs='+',
- required=True,
- default=None,
+ required=True,
+ default=None,
type=str,
help='The alignment file.')
p.add_argument(
- '--log_ofn',
- required=True,
- default=None,
+ '--log_ofn',
+ required=True,
+ default=None,
type=str,
help='The log file.')
p.add_argument(
- '--nprocs',
- required=True,
- default=None,
+ '--nprocs',
+ required=True,
+ default=None,
type=int,
help='Number of processors.')
p.add_argument(
- '--bootstrap_raxml',
- required=False,
- default=0,
+ '--bootstrap_raxml',
+ required=False,
+ default=0,
type=int,
help='The number of runs for bootstraping when building the tree. '\
'Default 0.')
p.add_argument(
- '--verbose',
- required=False,
+ '--verbose',
+ required=False,
dest='quiet',
action='store_false',
help='Show all information. Default "not set".')
@@ -84,15 +84,15 @@ def main(args):
if len(sample2polrate):
log_line = '%s\t%d\t%d\t%f\n'%\
- (os.path.basename(ifn_polymorphic).replace('.polymorphic', ''),
- len(singles),
- len(sample2polrate),
+ (os.path.basename(ifn_polymorphic).replace('.polymorphic', ''),
+ len(singles),
+ len(sample2polrate),
float(len(singles)) / len(sample2polrate))
else:
log_line = '%s\t%d\t%d\t%f\n'%\
- (os.path.basename(ifn_polymorphic).replace('.polymorphic', ''),
- len(singles),
- len(sample2polrate),
+ (os.path.basename(ifn_polymorphic).replace('.polymorphic', ''),
+ len(singles),
+ len(sample2polrate),
0)
lfile.write(log_line)
@@ -116,7 +116,7 @@ def main(args):
cmd += '-N %d '%(args.bootstrap_raxml)
cmd += '-s %s '%os.path.abspath(ifn_alignment2)
cmd += '-w %s '%os.path.abspath(os.path.dirname(ifn_alignment2))
- cmd += '-n %s '%output_suffix
+ cmd += '-n %s '%output_suffix
cmd += '-p 1234 '
else:
cmd = 'raxmlHPC-PTHREADS-SSE3 '
@@ -136,7 +136,7 @@ def main(args):
'''
run(cmd)
lfile.close()
-
+
=====================================
strainphlan_src/compute_distance.py
=====================================
@@ -15,8 +15,8 @@ sys.path.append(MAIN_DIR)
from mixed_utils import dist2file, statistics
import argparse as ap
-from Bio import SeqIO, Seq, SeqRecord
-from collections import defaultdict
+from Bio import SeqIO#, Seq, SeqRecord
+# from collections import defaultdict
import numpy
from ooSubprocess import ooSubprocess
@@ -24,8 +24,8 @@ from ooSubprocess import ooSubprocess
'''
SUBST = {
'A':{'A':0.0, 'C':1.0, 'G':1.0, 'T':1.0, '-':1.0},
- 'C':{'A':1.0, 'C':0.0, 'G':1.0, 'T':1.0, '-':1.0},
- 'G':{'A':1.0, 'C':1.0, 'G':0.0, 'T':1.0, '-':1.0},
+ 'C':{'A':1.0, 'C':0.0, 'G':1.0, 'T':1.0, '-':1.0},
+ 'G':{'A':1.0, 'C':1.0, 'G':0.0, 'T':1.0, '-':1.0},
'T':{'A':1.0, 'C':1.0, 'G':1.0, 'T':0.0, '-':1.0},
'-':{'A':1.0, 'C':1.0, 'G':1.0, 'T':1.0, '-':0.0}}
'''
@@ -35,14 +35,14 @@ def read_params():
p = ap.ArgumentParser()
p.add_argument('--ifn_alignment', required=True, default=None, type=str)
p.add_argument('--ofn_prefix', required=True, default=None, type=str)
- p.add_argument('--count_gaps',
+ p.add_argument('--count_gaps',
required=False,
- dest='ignore_gaps',
+ dest='ignore_gaps',
action='store_false')
p.set_defaults(ignore_gaps=True)
- p.add_argument('--overwrite',
+ p.add_argument('--overwrite',
required=False,
- dest='overwrite',
+ dest='overwrite',
action='store_true')
p.set_defaults(overwrite=False)
@@ -72,7 +72,7 @@ def get_dist(seq1, seq2, ignore_gaps):
rel_sim = 1.0 - rel_dist
rel_snp = abs_snp / float(len(seq1))
return abs_dist, rel_dist, abs_sim, rel_sim, abs_snp, rel_snp
-
+
def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
ofn_abs_dist = ofn_prefix + '.abs_dist'
@@ -101,14 +101,14 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
for i in range(len(recs)):
for j in range(i, len(recs)):
- abs_d, rel_d, abs_s, rel_s, abs_sp, rel_sp = get_dist(recs[i].seq,
+ abs_d, rel_d, abs_s, rel_s, abs_sp, rel_sp = get_dist(recs[i].seq,
recs[j].seq,
ignore_gaps)
abs_dist[i][j] = abs_d
abs_dist[j][i] = abs_d
abs_dist_flat.append(abs_d)
-
+
rel_dist[i][j] = rel_d
rel_dist[j][i] = rel_d
rel_dist_flat.append(rel_d)
@@ -116,7 +116,7 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
abs_sim[i][j] = abs_s
abs_sim[j][i] = abs_s
abs_sim_flat.append(abs_s)
-
+
rel_sim[i][j] = rel_s
rel_sim[j][i] = rel_s
rel_sim_flat.append(rel_s)
@@ -124,7 +124,7 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
abs_snp[i][j] = abs_sp
abs_snp[j][i] = abs_sp
abs_snp_flat.append(abs_sp)
-
+
rel_snp[i][j] = rel_sp
rel_snp[j][i] = rel_sp
rel_snp_flat.append(rel_sp)
@@ -198,7 +198,7 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
'--flabel_size', '5',
'--slabel_size', '5',
'--max_flabel_len', '200'])
- '''
+ '''
ofn_abs_snp = ofn_prefix + '.abs_snp'
dist2file(abs_snp, labels, ofn_abs_snp)
@@ -208,17 +208,17 @@ def compute_dist_matrix(ifn_alignment, ofn_prefix, ignore_gaps, overwrite):
dist2file(rel_snp, labels, ofn_rel_snp)
with open(ofn_rel_snp + '.info', 'w') as ofile:
ofile.write(statistics(rel_snp_flat)[1])
-
+
def main(args):
compute_dist_matrix(
- args['ifn_alignment'],
+ args['ifn_alignment'],
args['ofn_prefix'],
args['ignore_gaps'],
- args['overwrite'])
-
+ args['overwrite'])
+
if __name__ == "__main__":
args = read_params()
main(args)
=====================================
strainphlan_src/compute_distance_all.py
=====================================
@@ -13,9 +13,9 @@ MAIN_DIR = os.path.dirname(ABS_PATH)
os.environ['PATH'] += ':%s'%MAIN_DIR
sys.path.append(MAIN_DIR)
import argparse as ap
-from Bio import SeqIO, Seq, SeqRecord
-from collections import defaultdict
-import numpy
+from Bio import SeqIO#, Seq, SeqRecord
+# from collections import defaultdict
+# import numpy
from compute_distance import compute_dist_matrix
from ooSubprocess import parallelize
@@ -24,9 +24,9 @@ def read_params():
p = ap.ArgumentParser()
p.add_argument('--ifn_alignments', nargs='+', required=True, default=None, type=str)
p.add_argument('--nprocs', required=True, default=None, type=int)
- p.add_argument('--count_gaps',
+ p.add_argument('--count_gaps',
required=False,
- dest='ignore_gaps',
+ dest='ignore_gaps',
action='store_false')
p.set_defaults(ignore_gaps=True)
@@ -37,11 +37,11 @@ def read_params():
def compute_dist_matrix_wrapper(args):
compute_dist_matrix(
- args['ifn_alignment'],
+ args['ifn_alignment'],
args['ofn_prefix'],
args['ignore_gaps'],
- overwrite=True)
-
+ overwrite=True)
+
def main(args):
@@ -49,7 +49,7 @@ def main(args):
for i in range(len(args['ifn_alignments'])):
args_list.append({})
args_list[i]['ifn_alignment'] = args['ifn_alignments'][i]
- args_list[i]['ofn_prefix'] = args['ifn_alignments'][i]
+ args_list[i]['ofn_prefix'] = args['ifn_alignments'][i]
if not args['ignore_gaps']:
args_list[i]['ofn_prefix'] += '.count_gaps'
args_list[i]['ignore_gaps'] = args['ignore_gaps']
=====================================
strainphlan_src/extract_markers.py
=====================================
@@ -6,8 +6,8 @@ __author__ = 'Duy Tin Truong (duytin.truong at unitn.it)'
__version__ = '0.1'
__date__ = '1 Sep 2014'
-import sys
-import os
+# import sys
+# import os
import argparse as ap
import pickle
import bz2
=====================================
strainphlan_src/ooSubprocess.py
=====================================
@@ -8,12 +8,12 @@ import os
import multiprocessing
from multiprocessing.pool import ThreadPool
import sys
-import cStringIO
-from tempfile import NamedTemporaryFile
+# import cStringIO
+from tempfile import NamedTemporaryFile
import which
import functools
import traceback
-import numpy
+# import numpy
class ooSubprocessException(Exception):
@@ -62,8 +62,8 @@ class ooSubprocess:
elif get_out_pipe:
tmp_file = NamedTemporaryFile(dir=self.tmp_dir)
p = subprocess.Popen(
- cmd,
- stdin=in_pipe,
+ cmd,
+ stdin=in_pipe,
stdout=tmp_file,
**kwargs)
p.wait()
@@ -74,15 +74,15 @@ class ooSubprocess:
elif out_fn:
ofile = open(out_fn, 'w') if out_fn else None
result = subprocess.check_call(
- cmd,
- stdin=in_pipe,
- stdout=ofile,
+ cmd,
+ stdin=in_pipe,
+ stdout=ofile,
**kwargs)
ofile.close()
else:
result = subprocess.check_call(
- cmd,
- stdin=in_pipe,
+ cmd,
+ stdin=in_pipe,
**kwargs)
return result
=====================================
strainphlan_src/plot_tree_ete2.py
=====================================
@@ -6,23 +6,23 @@ __author__ = 'Duy Tin Truong (duytin.truong at unitn.it)'
__version__ = '0.1'
__date__ = '7 Sep 2016'
-import sys
-import os
-import argparse
+# import sys
+# import os
+import argparse
from ete2 import Tree, TreeStyle, NodeStyle
def read_params():
p = argparse.ArgumentParser()
p.add_argument(
- '--ifn',
- required=True,
- default=None,
+ '--ifn',
+ required=True,
+ default=None,
type=str,
help='The input tree file.')
p.add_argument(
- '--ofn',
- required=False,
- default=None,
+ '--ofn',
+ required=False,
+ default=None,
type=str,
help='The input tree file.')
=====================================
strainphlan_src/plot_tree_graphlan.py
=====================================
@@ -6,78 +6,78 @@ __author__ = 'Duy Tin Truong (duytin.truong at unitn.it)'
__version__ = '0.1'
__date__ = '4 May 2015'
-import sys
-import os
+# import sys
+# import os
import argparse as ap
import dendropy
from StringIO import StringIO
import re
from collections import defaultdict
-import ConfigParser
+# import ConfigParser
import matplotlib.colors as colors
import subprocess
def read_params():
p = ap.ArgumentParser()
- p.add_argument('--ifn_tree',
- required=True,
- default=None,
+ p.add_argument('--ifn_tree',
+ required=True,
+ default=None,
type=str,
help='The input tree in newick format.')
- p.add_argument('--colorized_metadata',
- required=False,
- default='unset',
+ p.add_argument('--colorized_metadata',
+ required=False,
+ default='unset',
type=str,
help='The metadata field to colorize. Default "unset".')
- p.add_argument('--fig_size',
- required=False,
- default=8,
+ p.add_argument('--fig_size',
+ required=False,
+ default=8,
type=float,
help='The figure size. Default "8".')
- p.add_argument('--legend_marker_size',
- required=False,
- default=20,
+ p.add_argument('--legend_marker_size',
+ required=False,
+ default=20,
type=int,
help='The legend marker size. Default "20".'
)
- p.add_argument('--legend_font_size',
- required=False,
- default=10,
+ p.add_argument('--legend_font_size',
+ required=False,
+ default=10,
type=int,
help='The legend font size. Default "10".'
)
- p.add_argument('--legend_marker_edge_width',
- required=False,
- default=0.2,
+ p.add_argument('--legend_marker_edge_width',
+ required=False,
+ default=0.2,
type=float,
help='The legend marker edge width. Default "0.2".'
)
- p.add_argument('--leaf_marker_size',
- required=False,
- default=20,
+ p.add_argument('--leaf_marker_size',
+ required=False,
+ default=20,
type=int,
help='The legend marker size. Default "20".'
)
- p.add_argument('--leaf_marker_edge_width',
- required=False,
- default=0.2,
+ p.add_argument('--leaf_marker_edge_width',
+ required=False,
+ default=0.2,
type=float,
help='The legend marker edge width. Default "0.2".'
)
- p.add_argument('--dpi',
- required=False,
- default=300,
+ p.add_argument('--dpi',
+ required=False,
+ default=300,
type=int,
help='The figure dpi.')
- p.add_argument('--figure_extension',
- required=False,
- default='.png',
+ p.add_argument('--figure_extension',
+ required=False,
+ default='.png',
type=str,
help='The figure extension. Default ".png".')
- p.add_argument('--ofn_prefix',
- required=False,
- default=None,
+ p.add_argument('--ofn_prefix',
+ required=False,
+ default=None,
type=str,
help='The prefix of output files.')
return p.parse_args()
@@ -88,7 +88,7 @@ def read_params():
def run(cmd):
print cmd
subprocess.call(cmd.split())
-
+
@@ -133,7 +133,7 @@ def main(args):
ofile.write('branch_bracket_depth\t0\n')
#ofile.write('branch_thickness\t1.25\n')
ofile.write('annotation_background_width\t0\n')
-
+
# legend
ofile.write('#legends\n')
ofile.write('class_legend_font_size\t%d\n'%args.legend_font_size)
=====================================
strainphlan_src/sam_filter.py
=====================================
@@ -7,28 +7,28 @@ __version__ = '0.1'
__date__ = '18 Jul 2015'
import sys
-import os
-import argparse
+# import os
+import argparse
def read_params():
p = argparse.ArgumentParser()
p.add_argument(
- '--input_file',
- required=False,
- default=None,
+ '--input_file',
+ required=False,
+ default=None,
type=str,
help='The input sam file.')
p.add_argument(
- '--min_align_score',
- required=True,
- default=None,
+ '--min_align_score',
+ required=True,
+ default=None,
type=int,
help='The sam records with alignment score smaller than this value '
'will be discarded.')
p.add_argument(
- '--verbose',
- required=False,
+ '--verbose',
+ required=False,
dest='quiet',
action='store_false',
help='Show all information. Default "not set".')
=====================================
strainphlan_src/sample2markers.py
=====================================
@@ -16,23 +16,23 @@ import argparse as ap
import glob
import ooSubprocess
from ooSubprocess import print_stderr, trace_unhandled_exceptions
-import ConfigParser
+# import ConfigParser
from Bio import SeqIO, Seq, SeqRecord
-import cStringIO
+# import cStringIO
import msgpack
-import random
-import subprocess
-import bz2
-import gzip
+# import random
+# import subprocess
+# import bz2
+# import gzip
import logging
import logging.config
-import tarfile
-import threading
+# import tarfile
+# import threading
import multiprocessing
import pysam
from collections import defaultdict
from scipy import stats
-import numpy
+# import numpy
logging.basicConfig(level=logging.DEBUG, stream=sys.stderr,
disable_existing_loggers=False,
@@ -57,17 +57,17 @@ def read_params():
p.add_argument('--samtools_exe', required=False, default='samtools', type=str)
p.add_argument('--bcftools_exe', required=False, default='bcftools', type=str)
p.add_argument(
- '--verbose',
- required=False,
+ '--verbose',
+ required=False,
dest='quiet',
action='store_false',
help='Show all information. Default "not set".')
p.set_defaults(quiet=True)
'''
p.add_argument(
- '--use_processes',
- required=False,
- default=False,
+ '--use_processes',
+ required=False,
+ default=False,
action='store_false',
dest='use_threads',
help='Use multiprocessing. Default "Use multithreading".')
@@ -102,7 +102,7 @@ def build_bowtie2db(ifn_markers, tmp_dir, error_pipe=None):
oosp = ooSubprocess.ooSubprocess(tmp_dir)
if index_fns == []:
oosp.ex(
- 'bowtie2-build',
+ 'bowtie2-build',
['--quiet', ifn_markers, index_path],
stderr=error_pipe)
else:
@@ -113,17 +113,17 @@ def build_bowtie2db(ifn_markers, tmp_dir, error_pipe=None):
def sample2markers(
- ifn_sample,
+ ifn_sample,
min_read_len,
min_align_score,
min_base_quality,
min_read_depth,
error_rate,
- ifn_markers,
+ ifn_markers,
index_path,
- nprocs=1,
+ nprocs=1,
sam2file=None,
- marker2file=None,
+ marker2file=None,
tmp_dir='tmp',
quiet=False,
samtools_exe='samtools',
@@ -136,7 +136,7 @@ def sample2markers(
:param marker2file: the file name to store the consensus markers.
:param sam2file: the file name to store the sam content.
:returns: if marker2file==None, return the dictionary of the consensus
- markers. Otherwise, save the result in fasta format to the file declared by
+ markers. Otherwise, save the result in fasta format to the file declared by
marker2file
'''
@@ -148,7 +148,7 @@ def sample2markers(
# sample to sam
sample_pipe = oosp.chain(
- 'dump_file.py',
+ 'dump_file.py',
args=['--input_file', ifn_sample],
stderr=error_pipe
)
@@ -159,7 +159,7 @@ def sample2markers(
stderr=error_pipe
)
bowtie2_pipe = oosp.chain(
- 'bowtie2',
+ 'bowtie2',
args=[
'-U', '-',
'-x', index_path,
@@ -173,13 +173,13 @@ def sample2markers(
else:
oosp.chain(
'compress_file.py',
- args=['--output_file', sam2file],
+ args=['--output_file', sam2file],
in_pipe=bowtie2_pipe,
stderr=error_pipe,
stop=True)
sam_pipe = oosp.chain(
- 'dump_file.py',
+ 'dump_file.py',
args=['--input_file', sam2file],
stderr=error_pipe)
ofn_bam_sorted_prefix = os.path.join(
@@ -187,14 +187,14 @@ def sample2markers(
os.path.basename(ifn_sample) + '.bam.sorted')
return sam2markers(
- sam_file=sam_pipe,
+ sam_file=sam_pipe,
ofn_bam_sorted_prefix=ofn_bam_sorted_prefix,
min_align_score=min_align_score,
min_base_quality=min_base_quality,
min_read_depth=min_read_depth,
error_rate=error_rate,
- marker2file=marker2file,
- oosp=oosp,
+ marker2file=marker2file,
+ oosp=oosp,
tmp_dir=tmp_dir,
quiet=quiet,
samtools_exe=samtools_exe,
@@ -211,14 +211,14 @@ def save2file(tmp_file, ofn):
def sam2markers(
- sam_file,
+ sam_file,
ofn_bam_sorted_prefix,
min_align_score=None,
min_base_quality=30,
min_read_depth=0,
error_rate=0.01,
- marker2file=None,
- oosp=None,
+ marker2file=None,
+ oosp=None,
tmp_dir='tmp',
quiet=False,
samtools_exe='samtools',
@@ -247,12 +247,12 @@ def sam2markers(
if type(sam_file) == str:
p1 = oosp.chain(
- 'dump_file.py',
+ 'dump_file.py',
args=['--input_file', sam_file],
stderr=error_pipe)
else:
p1 = sam_file
-
+
# filter sam
if min_align_score == None:
p1_filtered = p1
@@ -264,8 +264,8 @@ def sam2markers(
stderr=error_pipe)
# sam to bam
p2 = oosp.chain(
- samtools_exe,
- args=['view', '-bS', '-'],
+ samtools_exe,
+ args=['view', '-bS', '-'],
in_pipe=p1_filtered,
stderr=error_pipe)
@@ -274,12 +274,12 @@ def sam2markers(
for tmp_fn in tmp_fns:
os.remove(tmp_fn)
p3 = oosp.chain(
- samtools_exe,
- args=['sort', '-', '-o', ofn_bam_sorted_prefix],
+ samtools_exe,
+ args=['sort', '-', '-o', ofn_bam_sorted_prefix],
in_pipe=p2,
stderr=error_pipe)
- # extract polimorphic information
+ # extract polimorphic information
marker2seq = defaultdict(dict)
pysam.index(p3.name)
samfile = pysam.AlignmentFile(p3.name)
@@ -307,30 +307,30 @@ def sam2markers(
# bam to mpileup
p3.seek(0)
p4 = oosp.chain(
- samtools_exe,
- args=['mpileup', '-u', '-'],
+ samtools_exe,
+ args=['mpileup', '-u', '-'],
in_pipe=p3,
stderr=error_pipe)
# mpileup to vcf
p5 = oosp.chain(
- bcftools_exe,
- args=['view', '-c', '-g', '-p', '1.1', '-'],
+ bcftools_exe,
+ args=['view', '-c', '-g', '-p', '1.1', '-'],
in_pipe=p4,
stderr=error_pipe)
#stderr=open(os.devnull, 'w'))
# fix AF1=0
p6 = oosp.chain(
- 'fix_AF1.py',
- args=['--input_file', '-'],
+ 'fix_AF1.py',
+ args=['--input_file', '-'],
in_pipe=p5,
stderr=error_pipe)
# vcf to fastq
p7 = oosp.chain(
- 'vcfutils.pl',
- args=['vcf2fq'],
+ 'vcfutils.pl',
+ args=['vcf2fq'],
in_pipe=p6,
get_out_pipe=True,
stderr=error_pipe,
@@ -342,7 +342,7 @@ def sam2markers(
marker2seq = dict(marker2seq)
except Exception as e:
logger.error("sam2markers failed on file " + sam_file)
- raise
+ raise
if type(p1) == file:
p1.close()
@@ -369,17 +369,17 @@ def run_sample(args_list):
marker2file = output_prefix + args['marker2file_ext']
if args['input_type'] == 'fastq':
sample2markers(
- ifn_sample=ifn_sample,
+ ifn_sample=ifn_sample,
min_read_len=args['min_read_len'],
min_align_score=args['min_align_score'],
min_base_quality=args['min_base_quality'],
min_read_depth=args['min_read_depth'],
error_rate=args['error_rate'],
- ifn_markers=args['ifn_markers'],
+ ifn_markers=args['ifn_markers'],
index_path=args['index_path'],
nprocs=args['nprocs'],
sam2file=sam2file,
- marker2file=marker2file,
+ marker2file=marker2file,
tmp_dir=args['output_dir'],
quiet=args['quiet'],
samtools_exe=args['samtools_exe'],
@@ -389,7 +389,7 @@ def run_sample(args_list):
args['output_dir'],
os.path.basename(ifn_sample) + '.bam.sorted')
sam2markers(
- sam_file=ifn_sample,
+ sam_file=ifn_sample,
ofn_bam_sorted_prefix=ofn_bam_sorted_prefix,
min_align_score=args['min_align_score'],
min_base_quality=args['min_base_quality'],
@@ -435,7 +435,7 @@ def main(args):
print e
-
+
if __name__ == "__main__":
args = read_params()
main(args)
=====================================
utils/merge_metaphlan_tables.py
=====================================
@@ -39,9 +39,10 @@ def merge( aaastrIn, astrLabels, iCol, ostm ):
# For each input datum in each input stream...
pos = 0
+ dCol = None
- for f in aaastrIn :
- with open(f) as csvfile :
+ for f in aaastrIn:
+ with open(f) as csvfile:
iIn = csv.reader(csvfile, csv.excel_tab)
# Lines from the current file, empty list to hold data, empty hash to hold ids
@@ -51,18 +52,22 @@ def merge( aaastrIn, astrLabels, iCol, ostm ):
# For a line in the file
for astrLine in iIn:
if astrLine[0].startswith('#'):
+ dCol = astrLine.index('relative_abundance') if 'relative_abundance' in astrLine else None
continue
iLine += 1
- # ID is from first column, data are everything else
- strID, astrData = astrLine[iCol], ( astrLine[:iCol] + astrLine[( iCol + 1 ):] )
+ if dCol:
+ strID, astrData = astrLine[iCol], [astrLine[dCol]]
+ else:
+ # ID is from first column, data are everything else
+ strID, astrData = astrLine[iCol], (astrLine[:iCol] + astrLine[iCol + 1:])
hashIDs[strID] = iLine
- aastrData.append( astrData )
+ aastrData.append(astrData)
# Batch merge every new ID key set
- setstrIDs.update( hashIDs.keys( ) )
+ setstrIDs.update(hashIDs.keys())
pos += 1
View it on GitLab: https://salsa.debian.org/med-team/metaphlan2/compare/9d9b7df7ff658d2e71f92ff93fe919d83ac79dab...bf0c58e9883a7bf10e0bb22ae1fb2aaf7d5b18b8
--
View it on GitLab: https://salsa.debian.org/med-team/metaphlan2/compare/9d9b7df7ff658d2e71f92ff93fe919d83ac79dab...bf0c58e9883a7bf10e0bb22ae1fb2aaf7d5b18b8
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20180917/27416ae2/attachment-0001.html>
More information about the debian-med-commit
mailing list