[med-svn] [Git][med-team/metaphlan][master] 8 commits: Source-only upload.

Thu Aug 25 07:25:36 BST 2022


Andreas Tille pushed to branch master at Debian Med / metaphlan2


Commits:
049cf86e by Andreas Tille at 2022-08-13T06:19:29+02:00
Source-only upload.

- - - - -
f99ecc53 by Andreas Tille at 2022-08-13T06:20:35+02:00
Upload to unstable

- - - - -
198dd6fa by Andreas Tille at 2022-08-25T08:17:27+02:00
Rename VCS path

- - - - -
7a0a4279 by Andreas Tille at 2022-08-25T08:17:55+02:00
New upstream version 4.0.0
- - - - -
0a437202 by Andreas Tille at 2022-08-25T08:17:55+02:00
routine-update: New upstream version

- - - - -
6dae2576 by Andreas Tille at 2022-08-25T08:17:57+02:00
Update upstream source from tag 'upstream/4.0.0'

Update to upstream version '4.0.0'
with Debian dir be9d5b4fd65d037a0481a204519a9092ca681fdf
- - - - -
024590bb by Andreas Tille at 2022-08-25T08:23:28+02:00
Do not install old wrapper scripts

- - - - -
21f3d851 by Andreas Tille at 2022-08-25T08:24:25+02:00
Upload to unstable

- - - - -


29 changed files:

- README.md
- bioconda_recipe/meta.yaml
- changeset.txt
- − debian/bin/metaphlan2
- − debian/bin/strainphlan
- debian/changelog
- debian/control
- − debian/install
- metaphlan/__init__.py
- metaphlan/metaphlan.py
- metaphlan/strainphlan.py
- + metaphlan/utils/VallesColomerM_2022_Nov19_thresholds.tsv
- metaphlan/utils/add_metadata_tree.py
- + metaphlan/utils/calculate_diversity.R
- − metaphlan/utils/calculate_unifrac.R
- metaphlan/utils/external_exec.py
- metaphlan/utils/extract_markers.py
- metaphlan/utils/merge_metaphlan_tables.py
- − metaphlan/utils/mpa_v30_CHOCOPhlAn_201901_species_tree.nwk
- + metaphlan/utils/mpa_vJan21_CHOCOPhlAnSGB_202103.nwk
- + metaphlan/utils/mpa_vJan21_CHOCOPhlAnSGB_202103_SGB2GTDB.tsv
- metaphlan/utils/parallelisation.py
- metaphlan/utils/plot_tree_graphlan.py
- metaphlan/utils/read_fastx.py
- metaphlan/utils/sample2markers.py
- + metaphlan/utils/sgb_to_gtdb_profile.py
- metaphlan/utils/strain_transmission.py
- metaphlan/utils/util_fun.py
- setup.py


Changes:

=====================================
README.md
=====================================
@@ -1,44 +1,33 @@
 # MetaPhlAn: Metagenomic Phylogenetic Analysis
 [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/metaphlan/README.html) [![PyPI - Downloads](https://img.shields.io/pypi/dm/metaphlan?label=MetaPhlAn%20on%20PyPi)](https://pypi.org/project/MetaPhlAn/) [![MetaPhlAn on DockerHub](https://img.shields.io/docker/pulls/biobakery/metaphlan?label=MetaPhlAn%20on%20DockerHub)](https://hub.docker.com/r/biobakery/metaphlan) [![Build MetaPhlAn package](https://github.com/biobakery/MetaPhlAn/workflows/Build%20MetaPhlAn%20package/badge.svg?branch=3.0)](https://github.com/biobakery/MetaPhlAn/actions?query=workflow%3A%22Build+MetaPhlAn+package%22)
-## What's new in version 3.1
-* 433 low-quality species were removed from the MetaPhlAn 3.1 marker database and 2,680 species were added (for a new total of 15,766; a 17% increase).
-* Marker genes for a subset of existing bioBakery 3 species were also revised.
-* Most existing bioBakery 3 species pangenomes were updated with revised or expanded gene content.
-* MetaPhlAn 3.1 software has been updated to work with revised marker database.
+## What's new in version 4
+* Adoption of the species-level genome bins system (SGBs, http://segatalab.cibio.unitn.it/data/Pasolli_et_al.html)
+* New MetaPhlAn marker genes extracted identified from ~1M microbial genomes
+* Ability to profile 21,978 known (kSGBs) and 4,992 unknown (uSGBs) microbial species
+* Better representation of, not only the human gut microbiome but also many other animal and ecological environments
+* Estimation of metagenome composed by microbes not included in the database with parameter `--unclassified_estimation`
+* Compatibility with MetaPhlAn 3 databases with parameter `--mpa3`
 -------------
 
 ## Description
-MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With the newly added StrainPhlAn module, it is now possible to perform accurate strain-level microbial profiling.
-
-MetaPhlAn relies on ~1.1M unique clade-specific marker genes (the latest marker information file `mpa_v31_CHOCOPhlAn_201901_marker_info.txt.bz2` can be found  [here](http://cmprod1.cibio.unitn.it/biobakery3/metaphlan_databases/mpa_v31_CHOCOPhlAn_201901_marker_info.txt.bz2)) identified from ~100,000 reference genomes (~99,500 bacterial and archaeal and ~500 eukaryotic), allowing:
+MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With StrainPhlAn, it is possible to perform accurate strain-level microbial profiling.
+MetaPhlAn 4 relies on ~5.1M unique clade-specific marker genes identified from ~1M microbial genomes (~236,600 references and 771,500 metagenomic assembled genomes) spanning 26,970 species-level genome bins (SGBs, http://segatalab.cibio.unitn.it/data/Pasolli_et_al.html), 4,992 of them taxonomically unidentified at the species level (the latest marker information file can be found [here](http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_marker_info.txt.bz2)), allowing:
 
 * unambiguous taxonomic assignments;
-* accurate estimation of organismal relative abundance;
-* species-level resolution for bacteria, archaea, eukaryotes and viruses;
+* an accurate estimation of organismal relative abundance;
+* SGB-level resolution for bacteria, archaea and eukaryotes;
 * strain identification and tracking
 * orders of magnitude speedups compared to existing methods.
 * metagenomic strain-level population genomics
 
 If you use MetaPhlAn, please cite:
 
-[**Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3**](https://elifesciences.org/articles/65088) *Francesco Beghini, Lauren J McIver, Aitor Blanco-Míguez, Leonard Dubois, Francesco Asnicar, Sagun Maharjan, Ana Mailyan, Paolo Manghi, Matthias Scholz, Andrew Maltez Thomas, Mireia Valles-Colomer, George Weingart, Yancong Zhang, Moreno Zolfo, Curtis Huttenhower, Eric A Franzosa, Nicola Segata*. eLife (2021)
+**Extending and improving metagenomic taxonomic profiling with uncharacterized species with MetaPhlAn 4.** Aitor Blanco-Miguez, Francesco Beghini, Fabio Cumbo, Lauren J. McIver, Kelsey N. Thompson, Moreno Zolfo, Paolo Manghi, Leonard Dubois, Kun D. Huang, Andrew Maltez Thomas, Gianmarco Piccinno, Elisa Piperni, Michal Punčochář, Mireia Valles-Colomer, Adrian Tett, Francesca Giordano, Richard Davies, Jonathan Wolf, Sarah E. Berry, Tim D. Spector, Eric A. Franzosa, Edoardo Pasolli, Francesco Asnicar, Curtis Huttenhower, Nicola Segata. Preprint (2022)
 
 If you use StrainPhlAn, please cite the MetaPhlAn paper and the following StrainPhlAn paper:
 
 [**Microbial strain-level population structure and genetic diversity from metagenomes.**](http://genome.cshlp.org/content/27/4/626.full.pdf) *Duy Tin Truong, Adrian Tett, Edoardo Pasolli, Curtis Huttenhower, & Nicola Segata*. Genome Research 27:626-638 (2017)
 
--------------
-
-## Installation
-The best way to install MetaPhlAn is through conda via the Bioconda channel. If you have not configured you Anaconda installation in order to fetch packages from Bioconda, **please follow [these steps](https://bioconda.github.io/user/install.html#set-up-channels) in order to setup the channels.**
-
-You can install MetaPhlAn by running
-
-```
-$ conda install -c bioconda python=3.7 metaphlan
-```
-
-For installing it from the source code and for further installation instructions, please see the Wiki at the [Installation](https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-3.0#installation) paragraph.
 
 -------------
 
@@ -46,13 +35,9 @@ For installing it from the source code and for further installation instructions
 
 In addition to the information on this page, you can refer to the following additional resources.
 
-* The [MetaPhlAn tutorial](https://github.com/biobakery/MetaPhlAn/wiki).
-
-* The [StrainPhlAn tutorial](https://github.com/biobakery/MetaPhlAn/wiki/StrainPhlAn-3.0).
-
-* The [MetaPhlAn](https://github.com/biobakery/biobakery/wiki/metaphlan3) and [StrainPhlAn](https://github.com/biobakery/biobakery/wiki/strainphlan3) wikis on bioBakery.
+* The [MetaPhlAn documentation](https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-4).
 
-* The [MetaPhlAn](https://forum.biobakery.org/c/Microbial-community-profiling/MetaPhlAn/) and [StrainPhlAn](https://forum.biobakery.org/c/Microbial-community-profiling/StrainPhlAn/) Discourse forum.
+* The [StrainPhlAn documentation](https://github.com/biobakery/MetaPhlAn/wiki/StrainPhlAn-4).
 
 * Related tools including [PanPhlAn](https://github.com/segatalab/panphlan) (and its [tutorial](https://github.com/segatalab/panphlan/wiki/Home)), [GraPhlAn](https://github.com/segatalab/graphlan) (and it [tutorial](https://github.com/biobakery/biobakery/wiki/graphlan)), [PhyloPhlAn 3](https://github.com/biobakery/phylophlan) (and its [tutorial](https://github.com/biobakery/biobakery/wiki/phylophlan)), [HUMAnN](https://github.com/biobakery/humann/) (and its [tutorial](https://github.com/biobakery/biobakery/wiki/humann2)).
 


=====================================
bioconda_recipe/meta.yaml
=====================================
@@ -1,5 +1,5 @@
 {% set name = "metaphlan" %}
-{% set version = "3.1" %}
+{% set version = "4.0" %}
 
 package:
   name: {{ name }}
@@ -60,6 +60,7 @@ test:
     - plot_tree_graphlan.py -h
     - sample2markers.py -h
     - strain_transmission.py -h
+    - sgb_to_gtdb_profile.py -h
 
 about:
   home: https://github.com/biobakery/metaphlan


=====================================
changeset.txt
=====================================
@@ -1,3 +1,11 @@
+=== Version 4
+* Adoption of the species-level genome bins system (SGBs).
+* New MetaPhlAn marker genes extracted identified from ~1M microbial genomes.
+* Ability to profile 21,978 known (kSGBs) and 4,992 unknown (uSGBs) microbial species.
+* Better representation of, not only the human gut microbiome but also many other animal and ecological environments.
+* Estimation of metagenome composed by microbes not included in the database with parameter --unclassified_estimation.
+* Compatibility with MetaPhlAn 3 databases with parameter --mpa3
+
 === Version 3.1
 * 433 low-quality species were removed from the MetaPhlAn 3.1 marker database and 2,680 species were added (for a new total of 15,766; a 17% increase).
 * Marker genes for a subset of existing bioBakery 3 species were also revised.


=====================================
debian/bin/metaphlan2 deleted
=====================================
@@ -1,5 +0,0 @@
-#!/bin/sh
-
-cmd=`basename "$0" .py`.py
-
-exec "/usr/share/metaphlan2/$cmd" "$@"


=====================================
debian/bin/strainphlan deleted
=====================================
@@ -1,7 +0,0 @@
-#!/bin/sh
-
-export PYTHONPATH="$PATH:/usr/share/metaphlan2/strainer_src"
-
-cmd=`basename "$0" .py`.py
-
-exec "/usr/share/metaphlan2/$cmd" "$@"


=====================================
debian/changelog
=====================================
@@ -1,3 +1,17 @@
+metaphlan (4.0.0-1) unstable; urgency=medium
+
+  * Rename VCS path
+  * New upstream version
+  * Do not install old wrapper scripts
+
+ -- Andreas Tille <tille at debian.org>  Thu, 25 Aug 2022 08:23:37 +0200
+
+metaphlan (3.1.0+renamed-2) unstable; urgency=medium
+
+  * Source-only upload.
+
+ -- Andreas Tille <tille at debian.org>  Sat, 13 Aug 2022 06:19:37 +0200
+
 metaphlan (3.1.0+renamed-1) unstable; urgency=medium
 
   * d/watch: Create proper source tarball name


=====================================
debian/control
=====================================
@@ -11,8 +11,8 @@ Build-Depends: debhelper-compat (= 13),
                python3-setuptools,
                python3-biopython <!nocheck>
 Standards-Version: 4.6.1
-Vcs-Browser: https://salsa.debian.org/med-team/metaphlan2
-Vcs-Git: https://salsa.debian.org/med-team/metaphlan2.git
+Vcs-Browser: https://salsa.debian.org/med-team/metaphlan
+Vcs-Git: https://salsa.debian.org/med-team/metaphlan.git
 Homepage: https://github.com/biobakery/MetaPhlAn
 Rules-Requires-Root: no
 


=====================================
debian/install deleted
=====================================
@@ -1,2 +0,0 @@
-*.py		usr/share/metaphlan2
-debian/bin	usr


=====================================
metaphlan/__init__.py
=====================================
@@ -8,7 +8,7 @@ import sys
 import tarfile
 import time
 import zipfile
-from glob import glob
+from glob import glob, iglob
 import urllib.request
 
 def remove_prefix(text):
@@ -69,10 +69,6 @@ class ReportHook():
             status += "        \r"
             sys.stderr.write(status)
 
-# set the location of the database download url
-DROPBOX_DATABASE_DOWNLOAD = "https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADHWzATSQcI0CNFD0sk7MAga"
-ZENODO_DATABASE_DOWNLOAD = "https://zenodo.org/record/3957592"
-
 def download(url, download_file, force=False):
     """
     Download a file from a url
@@ -109,13 +105,9 @@ def download_unpack_tar(download_file_name, folder, bowtie2_build, nproc, use_ze
     tar_file = os.path.join(folder, download_file_name + ".tar")
     md5_file = os.path.join(folder, download_file_name + ".md5")
 
-    #Download the list of all the files in the Dropbox folder
-    if not use_zenodo:
-        url_tar_file = "http://cmprod1.cibio.unitn.it/biobakery3/metaphlan_databases/{}.tar".format(download_file_name)
-        url_md5_file = "http://cmprod1.cibio.unitn.it/biobakery3/metaphlan_databases/{}.md5".format(download_file_name)
-    else:
-        url_tar_file = "https://zenodo.org/record/3957592/files/{}.tar?download=1".format(download_file_name)
-        url_md5_file = "https://zenodo.org/record/3957592/files/{}.md5?download=1".format(download_file_name)
+    #Download the list of all the files in the FPT    
+    url_tar_file = "http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/{}.tar".format(download_file_name)
+    url_md5_file = "http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/{}.md5".format(download_file_name)
 
     # download tar and MD5 checksum
     download(url_tar_file, tar_file)
@@ -160,19 +152,29 @@ def download_unpack_tar(download_file_name, folder, bowtie2_build, nproc, use_ze
         sys.stderr.write("Warning: Unable to extract {}.\n".format(tar_file))
 
     # uncompress sequences
-    bz2_file = os.path.join(folder, download_file_name + ".fna.bz2")
+    for bz2_file in iglob(os.path.join(folder, download_file_name + "_*.fna.bz2")):
+        fna_file = bz2_file[:-4]
+
+        if not os.path.isfile(fna_file):
+            sys.stderr.write('\n\nDecompressing {} into {}\n'.format(bz2_file, fna_file))
+
+            with open(fna_file, 'wb') as fna_h, \
+                bz2.BZ2File(bz2_file, 'rb') as bz2_h:
+                for data in iter(lambda: bz2_h.read(100 * 1024), b''):
+                    fna_h.write(data)
+        os.remove(bz2_file)
+
+    # join fasta files
+    sys.stderr.write('\n\nJoining FASTA databases\n')
+    with open(os.path.join(folder, download_file_name + ".fna"), 'w') as fna_h:
+        for fna_file in iglob(os.path.join(folder, download_file_name + "_*.fna")):
+            with open(fna_file, 'r') as fna_r:
+                for line in fna_r:
+                    fna_h.write(line)
     fna_file = os.path.join(folder, download_file_name + ".fna")
 
-    if not os.path.isfile(fna_file):
-        sys.stderr.write('\n\nDecompressing {} into {}\n'.format(bz2_file, fna_file))
-
-        with open(fna_file, 'wb') as fna_h, \
-            bz2.BZ2File(bz2_file, 'rb') as bz2_h:
-            for data in iter(lambda: bz2_h.read(100 * 1024), b''):
-                fna_h.write(data)
-
     # build bowtie2 indexes
-    if not glob(os.path.join(folder, download_file_name + "*.bt2")):
+    if not glob(os.path.join(folder, download_file_name + "*.bt2l")):
         bt2_base = os.path.join(folder, download_file_name)
         bt2_cmd = [bowtie2_build, '--quiet']
 
@@ -193,14 +195,19 @@ def download_unpack_tar(download_file_name, folder, bowtie2_build, nproc, use_ze
             sys.exit(1)
 
     try:
-        for bt2 in glob(os.path.join(folder, download_file_name + "*.bt2")):
+        for bt2 in glob(os.path.join(folder, download_file_name + "*.bt2l")):
             os.chmod(bt2, stat.S_IRUSR | stat.S_IWUSR | stat.S_IRGRP | stat.S_IWGRP | stat.S_IROTH)  # change permissions to 664
     except PermissionError as e:
-        sys.stderr.write('Cannot change permission for {}. Make sure the files are readable.'.format(os.path.join(folder, download_file_name + "*.bt2")))
+        sys.stderr.write('Cannot change permission for {}. Make sure the files are readable.'.format(os.path.join(folder, download_file_name + "*.bt2l")))
 
-    sys.stderr.write('Removing uncompress database {}\n'.format(fna_file))
+    sys.stderr.write('Removing uncompressed databases\n')
     os.remove(fna_file)
 
+    # remove all the individual FASTA files but ViralDB
+    for fna_file in iglob(os.path.join(folder, download_file_name + "_*.fna")):
+        if not fna_file.endswith('_VSG.fna'):
+            os.remove(fna_file)
+    
 def download_unpack_zip(url,download_file_name,folder,software_name):
     """
     Download the url to the file and decompress into the folder
@@ -259,8 +266,9 @@ def check_and_install_database(index, bowtie2_db, bowtie2_build, nproc, force_re
     
     use_zenodo = False
     try:
-        if urllib.request.urlopen("http://cmprod1.cibio.unitn.it/biobakery3/metaphlan_databases/mpa_latest").getcode() != 200:
-            use_zenodo = True
+        if urllib.request.urlopen("http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_latest").getcode() != 200:
+            # use_zenodo = True
+            pass
     except:
         print('WARNING: It seems that you do not have Internet access.')
         if os.path.exists(os.path.join(bowtie2_db,'mpa_latest')):
@@ -269,16 +277,13 @@ def check_and_install_database(index, bowtie2_db, bowtie2_build, nproc, force_re
                 latest_db_version = [line.strip() for line in mpa_latest if not line.startswith('#')]
         else:
             print("""ERROR: Cannot find a local database. Please run MetaPhlAn using option "-x <database_name>".
-            You can download the MetaPhlAn database from \n {} \n {} \n {} 
-                  """.format('http://cmprod1.cibio.unitn.it/biobakery3/metaphlan_databases',ZENODO_DATABASE_DOWNLOAD, DROPBOX_DATABASE_DOWNLOAD))
+            You can download the MetaPhlAn database from \n {} 
+                  """.format('http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases'))
             sys.exit()
 
     #try downloading from the segatalab website. If fails, use zenodo
     if index == 'latest':
-        if not use_zenodo:
-            mpa_latest = 'http://cmprod1.cibio.unitn.it/biobakery3/metaphlan_databases/mpa_latest'
-        else:
-            mpa_latest = 'https://zenodo.org/record/3957592/files/mpa_latest?download=1'
+        mpa_latest = 'http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_latest'
 
         index = resolve_latest_database(bowtie2_db, mpa_latest, force_redownload_latest)
     
@@ -303,4 +308,4 @@ def check_and_install_database(index, bowtie2_db, bowtie2_build, nproc, force_re
                      "the size this might take a few minutes\n")
     download_unpack_tar(index, bowtie2_db, bowtie2_build, nproc, use_zenodo)
     sys.stderr.write("\nDownload complete\n")
-    return index
\ No newline at end of file
+    return index


=====================================
metaphlan/metaphlan.py
=====================================
@@ -1,11 +1,11 @@
 #!/usr/bin/env python
-__author__ = ('Francesco Beghini (francesco.beghini at unitn.it),'
+__author__ = ('Aitor Blanco-Miguez (aitor.blancomiguez at unitn.it), '
+              'Francesco Beghini (francesco.beghini at unitn.it), '
               'Nicola Segata (nicola.segata at unitn.it), '
               'Duy Tin Truong, '
-              'Francesco Asnicar (f.asnicar at unitn.it), '
-              'Aitor Blanco Miguez (aitor.blancomiguez at unitn.it)')
-__version__ = '3.1.0'
-__date__ = '25 Jul 2022'
+              'Francesco Asnicar (f.asnicar at unitn.it)')
+__version__ = '4.0.0'
+__date__ = '22 Aug 2022'
 
 import sys
 try:
@@ -17,7 +17,6 @@ if float(sys.version_info[0]) < 3.0:
     sys.stderr.write("MetaPhlAn requires Python 3, your current Python version is {}.{}.{}\n"
                     .format(sys.version_info[0], sys.version_info[1], sys.version_info[2]))
     sys.exit(1)
-    
 import os
 import stat
 import re
@@ -28,7 +27,6 @@ from glob import glob
 from subprocess import DEVNULL
 import argparse as ap
 import bz2
-import json
 import pickle
 import subprocess as subp
 import tempfile as tf
@@ -57,18 +55,18 @@ try:
 except ImportError:
     sys.stderr.write("Warning! Biom python library not detected!"
                      "\n Exporting to biom format will not work!\n")
+import json
 
 # get the directory that contains this script
 metaphlan_script_install_folder = os.path.dirname(os.path.abspath(__file__))
 # get the default database folder
 DEFAULT_DB_FOLDER = os.path.join(metaphlan_script_install_folder, "metaphlan_databases")
 DEFAULT_DB_FOLDER= os.environ.get('METAPHLAN_DB_DIR', DEFAULT_DB_FOLDER)
+#Wether to execute a SGB-based analysis
+SGB_ANALYSIS = True
 INDEX = 'latest'
 tax_units = "kpcofgst"
 
-# set default parameters
-GENOME_LENGTH_BOOST_FOR_UNKNOWN_ESTIMATE=1.05
-
 def read_params(args):
     p = ap.ArgumentParser( description=
             "DESCRIPTION\n"
@@ -181,7 +179,7 @@ def read_params(args):
 
     arg('-x', '--index', type=str, default=INDEX,
         help=("Specify the id of the database version to use. "
-              "If \"latest\", MetaPhlAn will get the latest version.\n" 
+              "If \"latest\", MetaPhlAn will get the latest version.\n"
               "If an index name is provided, MetaPhlAn will try to use it, if available, and skip the online check.\n"
               "If the database files are not found on the local MetaPhlAn installation they\n"
               "will be automatically downloaded [default "+INDEX+"]\n"))
@@ -227,6 +225,7 @@ def read_params(args):
          "'f' : families only\n"
          "'g' : genera only\n"
          "'s' : species only\n"
+         "'t' : SGBs only\n"
          "[default 'a']" )
     arg( '--min_cu_len', metavar="", default="2000", type=int, help =
          "minimum total nucleotide length for the markers in a clade for\n"
@@ -237,16 +236,20 @@ def read_params(args):
          "length smaller than this threshold will be discarded.\n"
          "[default None]\n"   )
     arg( '--add_viruses', action='store_true', help=
-         "Allow the profiling of viral organisms" )
+         "Together with --mpa3, allow the profiling of viral organisms" )
     arg( '--ignore_eukaryotes', action='store_true', help=
          "Do not profile eukaryotic organisms" )
     arg( '--ignore_bacteria', action='store_true', help=
          "Do not profile bacterial organisms" )
     arg( '--ignore_archaea', action='store_true', help=
          "Do not profile archeal organisms" )
+    arg( '--ignore_ksgbs', action='store_true', help=
+         "Do not profile known SGBs (together with --sgb option)" )
+    arg( '--ignore_usgbs', action='store_true', help=
+         "Do not profile unknown SGBs (together with --sgb option)" )
     arg( '--stat_q', metavar="", type = float, default=0.2, help =
          "Quantile value for the robust average\n"
-         "[default 0.2]"   )
+         "[default 0.2]" )
     arg( '--perc_nonzero', metavar="", type = float, default=0.33, help =
          "Percentage of markers with a non zero relative abundance for misidentify a species\n"
          "[default 0.33]"   )
@@ -313,7 +316,8 @@ def read_params(args):
 
     arg( '--legacy-output', action='store_true', help="Old MetaPhlAn2 two columns output\n")
     arg( '--CAMI_format_output', action='store_true', help="Report the profiling using the CAMI output format\n")
-    arg( '--unknown_estimation', action='store_true', help="Scale relative abundances to the number of reads mapping to known clades in order to estimate unknowness\n")
+    arg( '--unclassified_estimation', action='store_true', help="Scale relative abundances to the number of reads mapping to identified clades in order to estimate unclassified taxa\n")
+    arg( '--mpa3', action='store_true', help="Perform the analysis using the MetaPhlAn 3 algorithm\n")
 
     #*************************************************************
     #* Parameters related to biom file generation                *
@@ -348,11 +352,12 @@ def read_params(args):
 def set_mapping_arguments(index, bowtie2_db):
     mpa_pkl = 'mpa_pkl'
     bowtie2db = 'bowtie2db'
+    bt2_ext = 'bt2l' if SGB_ANALYSIS else 'bt2'
 
     if os.path.isfile(os.path.join(bowtie2_db, "{}.pkl".format(index))):
         mpa_pkl = os.path.join(bowtie2_db, "{}.pkl".format(index))
 
-    if glob(os.path.join(bowtie2_db, "{}*.bt2".format(index))):
+    if glob(os.path.join(bowtie2_db, "{}*.{}".format(index, bt2_ext))):
         bowtie2db = os.path.join(bowtie2_db, "{}".format(index))
 
     return (mpa_pkl, bowtie2db)
@@ -409,7 +414,6 @@ def run_bowtie2(fna_in, outfmt6_out, bowtie2_db, preset, nproc, min_mapq_val, fi
         except IOError as e:
             sys.stderr.write('IOError: "{}"\nUnable to open sam output file.\n'.format(e))
             sys.exit(1)
-        
         for line in p.stdout:
             if samout:
                 sam_file.write(line)
@@ -438,7 +442,6 @@ def run_bowtie2(fna_in, outfmt6_out, bowtie2_db, preset, nproc, min_mapq_val, fi
             if not avg_read_length:
                 sys.stderr.write('Fatal error running MetaPhlAn. The average read length was not estimated.\nPlease check your input files.\n')
                 sys.exit(1)
-            
             outf.write(lmybytes('#nreads\t{}\n'.format(int(nreads))))
             outf.write(lmybytes('#avg_read_length\t{}'.format(avg_read_length)))
             outf.close()
@@ -517,8 +520,9 @@ class TaxClade:
         return [(m,float(n)*1000.0/(np.absolute(self.markers2lens[m] - self.avg_read_length) +1) )
                     for m,n in self.markers2nreads.items()]
 
-    def compute_mapped_reads( self ):
-        if self.name.startswith('s__'):
+    def compute_mapped_reads( self ):        
+        tax_level = 't__' if SGB_ANALYSIS else 's__'
+        if self.name.startswith(tax_level):
             return self.nreads
         for c in self.children.values():
             self.nreads += c.compute_mapped_reads()
@@ -579,7 +583,7 @@ class TaxClade:
         quant = int(self.quantile*len(rat_nreads))
         ql,qr,qn = (quant,-quant,quant) if quant else (None,None,0)
 
-        if self.name[0] == 't' and (len(self.father.children) > 1 or "_sp" in self.father.name or "k__Viruses" in self.get_full_name()):
+        if not SGB_ANALYSIS and self.name[0] == 't' and (len(self.father.children) > 1 or "_sp" in self.father.name or "k__Viruses" in self.get_full_name()):
             non_zeros = float(len([n for r,n in rat_nreads if n > 0]))
             nreads = float(len(rat_nreads))
             if nreads == 0.0 or non_zeros / nreads < 0.7:
@@ -588,10 +592,6 @@ class TaxClade:
 
         if rat < 0.0:
             pass
-        elif self.stat == 'unknown':
-            wnreads = sorted([(float(n)/(np.absolute(r-self.avg_read_length)+1),(np.absolute(r - self.avg_read_length)+1) ,n) for r,n in rat_nreads], key=lambda x:x[0])
-            den,num = zip(*[v[1:] for v in wnreads])
-            loc_ab = float(sum(num))/float(sum(den)) if any(den) else 0.0
         elif self.stat == 'avg_g' or (not qn and self.stat in ['wavg_g','tavg_g']):
             loc_ab = nrawreads / rat if rat >= 0 else 0.0
         elif self.stat == 'avg_l' or (not qn and self.stat in ['wavg_l','tavg_l']):
@@ -640,9 +640,8 @@ class TaxClade:
         return ret
 
 class TaxTree:
-    def __init__( self, mpa, markers_to_ignore = None, unknown_calculation = None ): #, min_cu_len ):
+    def __init__( self, mpa, markers_to_ignore = None ): #, min_cu_len ):
         self.root = TaxClade( "root", 0)
-        self.unknown_calculation = unknown_calculation
         self.all_clades, self.markers2lens, self.markers2clades, self.taxa2clades, self.markers2exts = {}, {}, {}, {}, {}
         TaxClade.markers2lens = self.markers2lens
         TaxClade.markers2exts = self.markers2exts
@@ -661,13 +660,17 @@ class TaxTree:
             father = self.root
             for i in range(len(clade)):
                 clade_lev = clade[i]
-                clade_taxid = taxids[i] if i < 7 and taxids is not None else None
+                if SGB_ANALYSIS:
+                    clade_taxid = taxids[i] if i < 8 and taxids is not None else None
+                else:
+                    clade_taxid = taxids[i] if i < 7 and taxids is not None else None
                 if not clade_lev in father.children:
                     father.add_child(clade_lev, tax_id=clade_taxid)
                     self.all_clades[clade_lev] = father.children[clade_lev]
+                if SGB_ANALYSIS: father = father.children[clade_lev]
                 if clade_lev[0] == "t":
                     self.taxa2clades[clade_lev[3:]] = father
-                father = father.children[clade_lev]
+                if not SGB_ANALYSIS: father = father.children[clade_lev]
                 if clade_lev[0] == "t":
                     father.glen = lenc
 
@@ -677,11 +680,7 @@ class TaxTree:
             lens = []
             for c in node.children.values():
                 lens.append( add_lens( c ) )
-            # use a different length for the unknown calculation
-            if self.unknown_calculation:
-                node.glen = np.median(lens) * GENOME_LENGTH_BOOST_FOR_UNKNOWN_ESTIMATE
-            else:
-                node.glen = min(np.mean(lens), np.median(lens))
+            node.glen = min(np.mean(lens), np.median(lens))
             return node.glen
         
         add_lens(self.root)
@@ -708,18 +707,27 @@ class TaxTree:
     def add_reads(  self, marker, n,
                     add_viruses = False,
                     ignore_eukaryotes = False,
-                    ignore_bacteria = False, ignore_archaea = False  ):
+                    ignore_bacteria = False, ignore_archaea = False, 
+                    ignore_ksgbs = False, ignore_usgbs = False  ):
         clade = self.markers2clades[marker]
         cl = self.all_clades[clade]
-        if ignore_eukaryotes or ignore_bacteria or ignore_archaea or not add_viruses:
+        if ignore_bacteria or ignore_archaea or ignore_eukaryotes:
             cn = cl.get_full_name()
-            if not add_viruses and cn.startswith("k__Vir"):
+            if ignore_archaea and cn.startswith("k__Archaea"):
+                return (None, None)
+            if ignore_bacteria and cn.startswith("k__Bacteria"):
                 return (None, None)
             if ignore_eukaryotes and cn.startswith("k__Eukaryota"):
                 return (None, None)
-            if ignore_archaea and cn.startswith("k__Archaea"):
+        if not SGB_ANALYSIS and not add_viruses:
+            cn = cl.get_full_name()
+            if not add_viruses and cn.startswith("k__Vir"):
                 return (None, None)
-            if ignore_bacteria and cn.startswith("k__Bacteria"):
+        if SGB_ANALYSIS and (ignore_ksgbs or ignore_usgbs):
+            cn = cl.get_full_name()
+            if ignore_ksgbs and not '_SGB' in cn.split('|')[-2]:
+                return (None, None)
+            if ignore_usgbs and '_SGB' in cn.split('|')[-2]:
                 return (None, None)
         # while len(cl.children) == 1:
             # cl = list(cl.children.values())[0]
@@ -756,7 +764,7 @@ class TaxTree:
 
         for tax_label, clade in clade2abundance_n.items():
             for clade_label, tax_id, abundance in sorted(clade.get_all_abundances(), key=lambda pars:pars[0]):
-                if clade_label[:3] != 't__':
+                if SGB_ANALYSIS or clade_label[:3] != 't__':
                     if not tax_lev:
                         if clade_label not in self.all_clades:
                             to = tax_units.index(clade_label[0])
@@ -770,7 +778,8 @@ class TaxTree:
                         else:
                             glen = self.all_clades[clade_label].glen
                             tax_id = self.all_clades[clade_label].get_full_taxids()
-                            if 's__' in clade_label and abundance > 0:
+                            tax_level = 't__' if SGB_ANALYSIS else 's__' 
+                            if tax_level in clade_label and abundance > 0:
                                 self.all_clades[clade_label].nreads = int(np.floor(abundance*glen))
 
                             clade_label = self.all_clades[clade_label].get_full_name()
@@ -797,7 +806,7 @@ class TaxTree:
         ret_r = dict([( tax, (abundance, clade2est_nreads[tax] )) for tax, abundance in clade2abundance.items() if tax in clade2est_nreads])
 
         if tax_lev:
-            ret_d[("UNKNOWN", '-1')] = 1.0 - sum(ret_d.values())  
+            ret_d[("UNCLASSIFIED", '-1')] = 1.0 - sum(ret_d.values())  
         return ret_d, ret_r, tot_reads
 
 def mapq_filter(marker_name, mapq_value, min_mapq_val):
@@ -826,7 +835,7 @@ def map2bbh(mapping_f, min_mapq_val, input_type='bowtie2out', min_alignment_len=
             if r.startswith('#') and 'nreads' in r:
                 n_metagenome_reads = int(c)
             if r.startswith('#') and 'avg_read_length' in r:
-                avg_read_length = float(c)            
+                avg_read_length = float(c)
             else:
                 reads2markers[r] = c
     elif input_type == 'sam':
@@ -931,55 +940,16 @@ def maybe_generate_biom_file(tree, pars, abundance_predictions):
 
     return True
 
-def create_tree_and_add_reads(mpa_pkl, ignore_markers, pars, avg_read_length, markers2reads, unknown_calculation=False):
-    # Create an instance of the TaxTree and update the stats settings and then add the reads and markers
-    tree = TaxTree( mpa_pkl, ignore_markers , unknown_calculation = unknown_calculation)
-    tree.set_min_cu_len( pars['min_cu_len'] )
-    tree.set_stat( pars['stat'], pars['stat_q'], pars['perc_nonzero'], avg_read_length, pars['avoid_disqm'] )
-
-    map_out = []
-    for marker,reads in sorted(markers2reads.items(), key=lambda pars: pars[0]):
-        if marker not in tree.markers2lens:
-            continue
-        tax_seq, ids_seq = tree.add_reads( marker, len(reads),
-                                  add_viruses = pars['add_viruses'],
-                                  ignore_eukaryotes = pars['ignore_eukaryotes'],
-                                  ignore_bacteria = pars['ignore_bacteria'],
-                                  ignore_archaea = pars['ignore_archaea'],
-                                  )
-        if tax_seq:
-            map_out +=["\t".join([r,tax_seq, ids_seq]) for r in sorted(reads)]
-
-    return tree, map_out
-
-def compute_relative_abundance_and_fraction_of_mapped_reads(tree, pars, n_metagenome_reads, avg_read_length, ESTIMATE_UNK, unknown_calculation=False):
-    # Using the tree compute the relative abundance and then the fraction of mapped reads using the total mapped reads
-
-    # if computing the unknown calculation, set the tree settings to reset to the unknown stat
-    # must be set here and then reset as this is a global setting that will apply to all trees if set
-    if unknown_calculation:
-        tree.set_stat( 'unknown', pars['stat_q'], pars['perc_nonzero'], avg_read_length, pars['avoid_disqm'] )
-
-    cl2ab, rr, tot_nreads = tree.relative_abundances(
-            pars['tax_lev']+"__" if pars['tax_lev'] != 'a' else None )
-
-    if unknown_calculation:
-        tree.set_stat( pars['stat'], pars['stat_q'], pars['perc_nonzero'], avg_read_length, pars['avoid_disqm'] )
-
-    # If the mapped reads are over-estimated, set the ratio at 1
-    fraction_mapped_reads = min(tot_nreads/float(n_metagenome_reads), 1.0) if ESTIMATE_UNK else 1.0
-
-    # check for a negative fraction
-    if unknown_calculation and fraction_mapped_reads < 0:
-        fraction_mapped_reads = 1.0
-
-    return cl2ab, fraction_mapped_reads, tot_nreads, rr
-
 def main():
     ranks2code = { 'k' : 'superkingdom', 'p' : 'phylum', 'c':'class',
                    'o' : 'order', 'f' : 'family', 'g' : 'genus', 's' : 'species'}
     pars = read_params(sys.argv)
-    ESTIMATE_UNK = pars['unknown_estimation']
+
+    #Set SGB- / species- analysis
+    global SGB_ANALYSIS
+    SGB_ANALYSIS = not pars['mpa3']
+
+    ESTIMATE_UNK = pars['unclassified_estimation']
 
     # check if the database is installed, if not then install
     pars['index'] = check_and_install_database(pars['index'], pars['bowtie2db'], pars['bowtie2_build'], pars['nproc'], pars['force_download'])
@@ -1046,15 +1016,14 @@ def main():
                 if os.path.exists(pars['bowtie2out']):
                     os.remove( pars['bowtie2out'] )
 
+        bt2_ext = 'bt2l' if SGB_ANALYSIS else 'bt2'            
         if bow and not all([os.path.exists(".".join([str(pars['bowtie2db']), p]))
-                            for p in ["1.bt2", "2.bt2", "3.bt2", "4.bt2", "rev.1.bt2", "rev.2.bt2"]]):
+                            for p in ["1." + bt2_ext, "2." + bt2_ext, "3." + bt2_ext, "4." + bt2_ext, "rev.1." + bt2_ext, "rev.2." + bt2_ext]]):
             sys.stderr.write("No MetaPhlAn BowTie2 database found (--index "
                              "option)!\nExpecting location {}\nExiting..."
                              .format(pars['bowtie2db']))
             sys.exit(1)
-
-        # check for an incomplete build
-        if bow and not (abs(os.path.getsize(".".join([str(pars['bowtie2db']), "1.bt2"])) - os.path.getsize(".".join([str(pars['bowtie2db']), "rev.1.bt2"]))) <= 1000):
+        if bow and not (abs(os.path.getsize(".".join([str(pars['bowtie2db']), "1." + bt2_ext])) - os.path.getsize(".".join([str(pars['bowtie2db']), "rev.1." + bt2_ext]))) <= 1000):
             sys.stderr.write("Partial MetaPhlAn BowTie2 database found at {}. "
                              "Please remove and rebuild the database.\nExiting..."
                              .format(pars['bowtie2db']))
@@ -1071,19 +1040,19 @@ def main():
         mpa_pkl = pickle.load( a )
 
     REPORT_MERGED = mpa_pkl.get('merged_taxon',False)
-    
+    tree = TaxTree( mpa_pkl, ignore_markers )
+    tree.set_min_cu_len( pars['min_cu_len'] )
+
     markers2reads, n_metagenome_reads, avg_read_length = map2bbh(pars['inp'], pars['min_mapq_val'], pars['input_type'], pars['min_alignment_len'])
 
-    # create a tree for the relative abundance and unknown calculations (each use different default computations as unknown includes all the alignments)
-    tree, map_out = create_tree_and_add_reads( mpa_pkl, ignore_markers, pars, avg_read_length, markers2reads)
-    tree_unknown, map_out_unknown = create_tree_and_add_reads( mpa_pkl, ignore_markers, pars, avg_read_length, markers2reads, unknown_calculation=True)
+    tree.set_stat( pars['stat'], pars['stat_q'], pars['perc_nonzero'], avg_read_length, pars['avoid_disqm'])
 
     if no_map:
         os.remove( pars['inp'] )
 
     if not n_metagenome_reads and pars['nreads']:
         n_metagenome_reads = pars['nreads']
-
+    
     if ESTIMATE_UNK and pars['input_type'] == 'sam':
         if not n_metagenome_reads and not pars['nreads']:
             sys.stderr.write(
@@ -1092,6 +1061,21 @@ def main():
                     "\nExiting...\n\n" )
             sys.exit(1)
 
+    map_out = []
+    for marker,reads in sorted(markers2reads.items(), key=lambda pars: pars[0]):
+        if marker not in tree.markers2lens:
+            continue
+        tax_seq, ids_seq = tree.add_reads( marker, len(reads),
+                                  add_viruses = pars['add_viruses'],
+                                  ignore_eukaryotes = pars['ignore_eukaryotes'],
+                                  ignore_bacteria = pars['ignore_bacteria'],
+                                  ignore_archaea = pars['ignore_archaea'],
+                                  ignore_ksgbs = pars['ignore_ksgbs'],
+                                  ignore_usgbs = pars['ignore_usgbs']
+                                  )
+        if tax_seq:
+            map_out +=["\t".join([r,tax_seq, ids_seq]) for r in sorted(reads)]
+
     if pars['output'] is None and pars['output_file'] is not None:
         pars['output'] = pars['output_file']
 
@@ -1107,6 +1091,20 @@ def main():
         if not CAMI_OUTPUT:
             outf.write('#' + '\t'.join((pars["sample_id_key"], pars["sample_id"])) + '\n')
 
+        if ESTIMATE_UNK:
+            mapped_reads = 0
+            cl2pr = tree.clade_profiles( pars['tax_lev']+"__" if pars['tax_lev'] != 'a' else None  )
+            cl2ab, _, _ = tree.relative_abundances( pars['tax_lev']+"__" if pars['tax_lev'] != 'a' else None )
+            confident_taxa = [taxstr for (taxstr, _),relab in cl2ab.items() if relab > 0.0]
+            for c, m in cl2pr.items():
+                if c in confident_taxa:
+                    markers_cov = [a  / 1000 for _, a in m if a > 0]
+                    mapped_reads += np.mean(markers_cov) * tree.all_clades[c.split('|')[-1]].glen
+            # If the mapped reads are over-estimated, set the ratio at 1
+            fraction_mapped_reads = min(mapped_reads/float(n_metagenome_reads), 1.0)
+        else:
+            fraction_mapped_reads = 1.0
+      
         if pars['t'] == 'reads_map':
             if not MPA2_OUTPUT:
                outf.write('#read_id\tNCBI_taxlineage_str\tNCBI_taxlineage_ids\n')
@@ -1121,11 +1119,9 @@ def main():
                 else:
                     outf.write('#clade_name\tNCBI_tax_id\trelative_abundance\n')
 
-            cl2ab, fraction_mapped_reads, tot_nreads, rr = compute_relative_abundance_and_fraction_of_mapped_reads(tree, pars, n_metagenome_reads, avg_read_length, ESTIMATE_UNK)
-
-            # compute the mapped reads fraction again considering all alignments
-            cl2ab_unknown, fraction_mapped_reads, tot_nreads, rr_unknown = compute_relative_abundance_and_fraction_of_mapped_reads(tree_unknown, pars, n_metagenome_reads, avg_read_length, ESTIMATE_UNK, unknown_calculation=True)
-
+            cl2ab, _, tot_nreads = tree.relative_abundances(
+                        pars['tax_lev']+"__" if pars['tax_lev'] != 'a' else None )
+            
             outpred = [(taxstr, taxid,round(relab*100.0,5)) for (taxstr, taxid), relab in cl2ab.items() if relab > 0.0]
             has_repr = False
             
@@ -1140,7 +1136,7 @@ def main():
                             outf.write( '\t'.join( [ leaf_taxid, rank, taxid, taxpathsh, str(relab*fraction_mapped_reads) ] ) + '\n' )
                 else:
                     if ESTIMATE_UNK:
-                        outf.write( "\t".join( [    "UNKNOWN",
+                        outf.write( "\t".join( [    "UNCLASSIFIED",
                                                     "-1",
                                                     str(round((1-fraction_mapped_reads)*100,5)),""]) + "\n" )
                                                     
@@ -1148,10 +1144,10 @@ def main():
                                         key=lambda x:x[2]+(100.0*(8-(x[0].count("|"))))):
                         add_repr = ''
                         if REPORT_MERGED and (clade, taxid) in mpa_pkl['merged_taxon']:
-                            if pars['use_group_representative']:
+                            if pars['use_group_representative'] and not SGB_ANALYSIS:
                                 if '_group' in clade:
                                     clade, taxid, _ = sorted(mpa_pkl['merged_taxon'][(clade, taxid)], key=lambda x:x[2], reverse=True)[0]
-                            else:
+                            elif not pars['use_group_representative']:
                                 add_repr = '{}'.format(','.join( [ n[0] for n in mpa_pkl['merged_taxon'][(clade, taxid)]] ))
                                 has_repr = True
                         if not MPA2_OUTPUT:
@@ -1169,15 +1165,15 @@ def main():
                                     )
             else:
                 if not MPA2_OUTPUT:
-                    outf.write( "UNKNOWN\t-1\t100.0\t\n" )
+                    outf.write( "UNCLASSIFIED\t-1\t100.0\t\n" )
                 else:
-                    outf.write( "UNKNOWN\t100.0\n" )
+                    outf.write( "UNCLASSIFIED\t100.0\n" )
+                sys.stderr.write("WARNING: MetaPhlAn did not detect any microbial taxa in the sample.\n")
             maybe_generate_biom_file(tree, pars, outpred)
 
         elif pars['t'] == 'rel_ab_w_read_stats':
-            cl2ab, fraction_mapped_reads, tot_nreads, rr = compute_relative_abundance_and_fraction_of_mapped_reads(tree, pars, n_metagenome_reads, avg_read_length, ESTIMATE_UNK)
-            # compute the mapped reads fraction again considering all alignments
-            cl2ab_unknown, fraction_mapped_reads, tot_nreads, rr_unknown = compute_relative_abundance_and_fraction_of_mapped_reads(tree_unknown, pars, n_metagenome_reads, avg_read_length, ESTIMATE_UNK, unknown_calculation=True)
+            cl2ab, rr, tot_nreads = tree.relative_abundances(
+                        pars['tax_lev']+"__" if pars['tax_lev'] != 'a' else None )
 
             unmapped_reads = max(n_metagenome_reads - tot_nreads, 0)
 
@@ -1191,7 +1187,7 @@ def main():
                                             "coverage",
                                             "estimated_number_of_reads_from_the_clade" ]) +"\n" )
                 if ESTIMATE_UNK:
-                    outf.write( "\t".join( [    "UNKNOWN",
+                    outf.write( "\t".join( [    "UNCLASSIFIED",
                                                 "-1",
                                                 str(round((1-fraction_mapped_reads)*100,5)),
                                                 "-",
@@ -1264,3 +1260,4 @@ if __name__ == '__main__':
     t0 = time.time()
     main()
     sys.stderr.write('Elapsed time to run MetaPhlAn: {} s\n'.format( (time.time()-t0) ) )
+


=====================================
metaphlan/strainphlan.py
=====================================
@@ -4,8 +4,8 @@ __author__ = ('Aitor Blanco Miguez (aitor.blancomiguez at unitn.it), '
               'Francesco Asnicar (f.asnicar at unitn.it), '
               'Moreno Zolfo (moreno.zolfo at unitn.it), '
               'Francesco Beghini (francesco.beghini at unitn.it)')
-__version__ = '3.1.0'
-__date__ = '25 Jul 2022'
+__version__ = '4.0.0'
+__date__ = '22 Aug 2022'
 
 
 import sys
@@ -29,7 +29,7 @@ from Bio.Seq import Seq
 metaphlan_script_install_folder = os.path.dirname(os.path.abspath(__file__))
 DEFAULT_DB_FOLDER = os.path.join(metaphlan_script_install_folder, "metaphlan_databases")
 DEFAULT_DB_FOLDER = os.environ.get('METAPHLAN_DB_DIR', DEFAULT_DB_FOLDER)
-DEFAULT_DB_NAME =  "mpa_v31_CHOCOPhlAn_201901.pkl"
+DEFAULT_DB_NAME =  "mpa_vJan21_CHOCOPhlAnSGB_202103.pkl"
 DEFAULT_DATABASE = os.path.join(DEFAULT_DB_FOLDER, DEFAULT_DB_NAME)
 PHYLOPHLAN_MODES = ['accurate', 'fast']
 
@@ -371,7 +371,7 @@ For each sample, writes the FASTA files with the sequences of the filtered marke
 """
 def matrix_markers_to_fasta(cleaned_markers_matrix, clade, samples, references, 
     trim_sequences, tmp_dir, nprocs):     
-    tmp_dir=tmp_dir+clade+".StrainPhlAn3/"    
+    tmp_dir=tmp_dir+clade+".StrainPhlAn4/"    
     create_folder(tmp_dir)
 
     filtered_samples = []
@@ -560,11 +560,11 @@ Executes PhyloPhlAn2 to compute phylogeny
 """
 def compute_phylogeny(samples_list, samples_markers_dir, num_samples, tmp_dir, output_dir, clade, 
     marker_in_n_samples, min_markers, phylophlan_mode, phylophlan_configuration, mutation_rates, nprocs):    
-    info("\tCreating PhyloPhlAn 3.0 database...", init_new_line=True)
+    info("\tCreating PhyloPhlAn database...", init_new_line=True)
     create_phylophlan_db(tmp_dir, clade[:30])
     info("\tDone.", init_new_line=True)
     if not phylophlan_configuration:     
-        info("\tGenerating PhyloPhlAn 3.0 configuration file...", init_new_line=True)
+        info("\tGenerating PhyloPhlAn configuration file...", init_new_line=True)
         conf = get_phylophlan_configuration()
         phylophlan_configuration = generate_phylophlan_config_file(tmp_dir, conf)
         info("\tDone.", init_new_line=True)   
@@ -888,3 +888,4 @@ def main():
 
 if __name__ == '__main__':
     main()
+


=====================================
metaphlan/utils/VallesColomerM_2022_Nov19_thresholds.tsv
=====================================
@@ -0,0 +1,647 @@
+SGB	used_nGD_score_percentile
+SGB15369	0.000552792
+SGB5065	0.001298701
+SGB15216	0.001708817
+SGB5075	0.002011494
+SGB1626	0.00226576
+SGB4938_group	0.002639342
+SGB4367	0.003134796
+SGB4925	0.003558628
+SGB4951	0.00410027
+SGB14114	0.004101162
+SGB5077	0.004192872
+SGB4540	0.004343276
+SGB4933	0.004410658
+SGB5082	0.005245558
+SGB4303	0.005311355
+SGB4914	0.006474767
+SGB4811	0.006593407
+SGB5090	0.006881451
+SGB14631_group	0.00746484
+SGB5111	0.008006672
+SGB15356	0.008805031
+SGB4874	0.009523619
+SGB5121	0.00952381
+SGB4844	0.009562585
+SGB9286	0.009920635
+SGB4191	0.010416667
+SGB4290	0.01122807
+SGB9226	0.011264535
+SGB4348	0.011708861
+SGB5190	0.012303486
+SGB15180	0.01254902
+SGB5117	0.012635909
+SGB15260	0.012910798
+SGB4285	0.013098905
+SGB15318	0.015165616
+SGB15332	0.015370986
+SGB17248	0.015683346
+SGB15346	0.015686275
+SGB1846	0.015728978
+SGB4557	0.015888616
+SGB15340	0.016237003
+SGB15316	0.016501822
+SGB15286	0.016875786
+SGB5825	0.017105263
+SGB1798	0.018596491
+SGB1836_group	0.018794404
+SGB15265_group	0.019010344
+SGB15300	0.019277108
+SGB15452	0.019572418
+SGB15234_group	0.019636015
+SGB5089	0.020376176
+SGB15295	0.020568663
+SGB15244	0.020733062
+SGB1644	0.02122449
+SGB4422	0.02122449
+SGB5087	0.021245421
+SGB4652	0.021355077
+SGB15374	0.021428571
+SGB4886	0.021548822
+SGB1855	0.02173913
+SGB4826	0.021972534
+SGB4532	0.022144522
+SGB14311	0.023694928
+SGB4940	0.024437467
+SGB1814	0.024700911
+SGB4564	0.025074589
+SGB1872_group	0.026400886
+SGB4582	0.026463512
+SGB4716	0.027417027
+SGB15031_group	0.027508376
+SGB15106	0.028985507
+SGB4198	0.029315673
+SGB10115_group	0.03
+SGB10130	0.03
+SGB10131	0.03
+SGB10139_group	0.03
+SGB1024	0.03
+SGB12676	0.03
+SGB1303	0.03
+SGB13972	0.03
+SGB13976	0.03
+SGB13977	0.03
+SGB13979	0.03
+SGB13981	0.03
+SGB13982	0.03
+SGB13983	0.03
+SGB13997	0.03
+SGB13999	0.03
+SGB14005	0.03
+SGB14007	0.03
+SGB14017	0.03
+SGB14027	0.03
+SGB1404	0.03
+SGB14042	0.03
+SGB14110	0.03
+SGB14136	0.03
+SGB14177	0.03
+SGB14193	0.03
+SGB14205	0.03
+SGB14209	0.03
+SGB14215	0.03
+SGB14237	0.03
+SGB14243	0.03
+SGB14247	0.03
+SGB14248	0.03
+SGB14250	0.03
+SGB14252	0.03
+SGB14253	0.03
+SGB14256	0.03
+SGB14263	0.03
+SGB14294	0.03
+SGB14313	0.03
+SGB14317	0.03
+SGB14321	0.03
+SGB14322	0.03
+SGB14334	0.03
+SGB14336	0.03
+SGB14341_group	0.03
+SGB1437	0.03
+SGB14370	0.03
+SGB14373	0.03
+SGB14379	0.03
+SGB14380	0.03
+SGB14397	0.03
+SGB14398	0.03
+SGB1471	0.03
+SGB1472	0.03
+SGB1473	0.03
+SGB1474	0.03
+SGB1475	0.03
+SGB1477	0.03
+SGB14770	0.03
+SGB14773	0.03
+SGB14786	0.03
+SGB14809	0.03
+SGB14816	0.03
+SGB14822	0.03
+SGB14824_group	0.03
+SGB14834	0.03
+SGB14844	0.03
+SGB14853	0.03
+SGB14894	0.03
+SGB14899	0.03
+SGB14905	0.03
+SGB14906	0.03
+SGB14908	0.03
+SGB14912	0.03
+SGB14921	0.03
+SGB14923	0.03
+SGB14924	0.03
+SGB14932	0.03
+SGB14937	0.03
+SGB14953	0.03
+SGB14954	0.03
+SGB14974	0.03
+SGB14980	0.03
+SGB14990	0.03
+SGB14991	0.03
+SGB14999	0.03
+SGB15012	0.03
+SGB15015	0.03
+SGB15019	0.03
+SGB15022	0.03
+SGB15035	0.03
+SGB15040	0.03
+SGB15041	0.03
+SGB15042	0.03
+SGB15049	0.03
+SGB15051	0.03
+SGB15052	0.03
+SGB15065	0.03
+SGB15073	0.03
+SGB15081	0.03
+SGB15084	0.03
+SGB15085	0.03
+SGB15087	0.03
+SGB15093	0.03
+SGB15095	0.03
+SGB15097	0.03
+SGB15100	0.03
+SGB15101	0.03
+SGB15103	0.03
+SGB15107	0.03
+SGB15108	0.03
+SGB15119	0.03
+SGB15120	0.03
+SGB15121	0.03
+SGB15124	0.03
+SGB15126	0.03
+SGB15127	0.03
+SGB15131	0.03
+SGB15143	0.03
+SGB15145	0.03
+SGB15196	0.03
+SGB15198	0.03
+SGB15203	0.03
+SGB15213	0.03
+SGB15225	0.03
+SGB15229	0.03
+SGB15238	0.03
+SGB15271	0.03
+SGB15272	0.03
+SGB15273	0.03
+SGB15285	0.03
+SGB15287	0.03
+SGB15292	0.03
+SGB15299	0.03
+SGB15310	0.03
+SGB15323	0.03
+SGB15350	0.03
+SGB15355	0.03
+SGB15370	0.03
+SGB15373	0.03
+SGB15377	0.03
+SGB15382	0.03
+SGB15385	0.03
+SGB15390	0.03
+SGB15394	0.03
+SGB15395	0.03
+SGB15403	0.03
+SGB15413	0.03
+SGB15447	0.03
+SGB15451	0.03
+SGB15460	0.03
+SGB15467	0.03
+SGB15497	0.03
+SGB15878	0.03
+SGB1612	0.03
+SGB1613	0.03
+SGB1614	0.03
+SGB1617_group	0.03
+SGB1623	0.03
+SGB1636_group	0.03
+SGB1653	0.03
+SGB1657	0.03
+SGB1662	0.03
+SGB1663	0.03
+SGB1664	0.03
+SGB1666	0.03
+SGB1667	0.03
+SGB1668	0.03
+SGB1670	0.03
+SGB1672_group	0.03
+SGB1673	0.03
+SGB1675	0.03
+SGB1676	0.03
+SGB1677	0.03
+SGB1679	0.03
+SGB1680	0.03
+SGB16975_group	0.03
+SGB16985	0.03
+SGB17008	0.03
+SGB17009	0.03
+SGB1701	0.03
+SGB1705	0.03
+SGB17130	0.03
+SGB17152	0.03
+SGB17153	0.03
+SGB17156	0.03
+SGB17161_group	0.03
+SGB17167	0.03
+SGB17168	0.03
+SGB17169	0.03
+SGB17195	0.03
+SGB17231	0.03
+SGB17234	0.03
+SGB17237	0.03
+SGB17241	0.03
+SGB17243	0.03
+SGB17247	0.03
+SGB17256	0.03
+SGB17278	0.03
+SGB17364	0.03
+SGB1784_group	0.03
+SGB1797	0.03
+SGB1832	0.03
+SGB1853	0.03
+SGB1857	0.03
+SGB1858	0.03
+SGB1883_group	0.03
+SGB1891	0.03
+SGB1903	0.03
+SGB1928	0.03
+SGB1941	0.03
+SGB19436	0.03
+SGB1945	0.03
+SGB1948	0.03
+SGB1957	0.03
+SGB1962	0.03
+SGB1963	0.03
+SGB19692	0.03
+SGB19694	0.03
+SGB2021	0.03
+SGB2071	0.03
+SGB2212	0.03
+SGB2214	0.03
+SGB2227	0.03
+SGB2230	0.03
+SGB2238	0.03
+SGB2239	0.03
+SGB2240	0.03
+SGB2241	0.03
+SGB2276	0.03
+SGB2279	0.03
+SGB2286	0.03
+SGB2296	0.03
+SGB2311	0.03
+SGB2313	0.03
+SGB23226	0.03
+SGB2325	0.03
+SGB2326	0.03
+SGB2328	0.03
+SGB24891	0.03
+SGB25497	0.03
+SGB29095	0.03
+SGB3539	0.03
+SGB3546_group	0.03
+SGB3574	0.03
+SGB3668	0.03
+SGB3677	0.03
+SGB376	0.03
+SGB3922	0.03
+SGB3926	0.03
+SGB3934	0.03
+SGB3940	0.03
+SGB3952	0.03
+SGB3957	0.03
+SGB3964	0.03
+SGB3983	0.03
+SGB3988	0.03
+SGB3989	0.03
+SGB3996	0.03
+SGB4029	0.03
+SGB4037	0.03
+SGB4046	0.03
+SGB4110	0.03
+SGB4121_group	0.03
+SGB4122	0.03
+SGB4155	0.03
+SGB4181	0.03
+SGB4247	0.03
+SGB4260	0.03
+SGB4272	0.03
+SGB4274	0.03
+SGB4280	0.03
+SGB4289	0.03
+SGB4328	0.03
+SGB4329	0.03
+SGB4333	0.03
+SGB4350	0.03
+SGB4364	0.03
+SGB4365	0.03
+SGB4373	0.03
+SGB4394	0.03
+SGB4395	0.03
+SGB4418	0.03
+SGB4447_group	0.03
+SGB4537	0.03
+SGB4547	0.03
+SGB4552	0.03
+SGB4553	0.03
+SGB4571	0.03
+SGB4583	0.03
+SGB4584	0.03
+SGB4588_group	0.03
+SGB4617	0.03
+SGB4637	0.03
+SGB4638	0.03
+SGB4643	0.03
+SGB4648	0.03
+SGB4654	0.03
+SGB4655	0.03
+SGB4658	0.03
+SGB4659	0.03
+SGB4669	0.03
+SGB4670	0.03
+SGB4699	0.03
+SGB4706	0.03
+SGB4709	0.03
+SGB4711	0.03
+SGB4712	0.03
+SGB4714	0.03
+SGB4721	0.03
+SGB4741	0.03
+SGB4742	0.03
+SGB4750	0.03
+SGB4752	0.03
+SGB4753	0.03
+SGB4758	0.03
+SGB4762	0.03
+SGB4768	0.03
+SGB4771	0.03
+SGB4779	0.03
+SGB4780	0.03
+SGB4781	0.03
+SGB4782	0.03
+SGB4788	0.03
+SGB4804	0.03
+SGB4809	0.03
+SGB4810	0.03
+SGB4815	0.03
+SGB4816	0.03
+SGB4824	0.03
+SGB4828	0.03
+SGB4829	0.03
+SGB4831	0.03
+SGB4832	0.03
+SGB4834	0.03
+SGB4867	0.03
+SGB4882	0.03
+SGB4891	0.03
+SGB4893	0.03
+SGB4894	0.03
+SGB4900	0.03
+SGB4903	0.03
+SGB4909	0.03
+SGB4921	0.03
+SGB4922	0.03
+SGB4929	0.03
+SGB4930	0.03
+SGB4939	0.03
+SGB4950	0.03
+SGB4953	0.03
+SGB4957	0.03
+SGB4958	0.03
+SGB4959	0.03
+SGB4960	0.03
+SGB4966	0.03
+SGB4987	0.03
+SGB4988	0.03
+SGB5042	0.03
+SGB5043	0.03
+SGB5051	0.03
+SGB5060	0.03
+SGB5062	0.03
+SGB5063	0.03
+SGB5066	0.03
+SGB5071	0.03
+SGB5072	0.03
+SGB5076	0.03
+SGB5099	0.03
+SGB5104	0.03
+SGB5115	0.03
+SGB5116	0.03
+SGB5118	0.03
+SGB5234	0.03
+SGB5239	0.03
+SGB5736	0.03
+SGB5742	0.03
+SGB5752	0.03
+SGB5785	0.03
+SGB5803	0.03
+SGB5809	0.03
+SGB5843	0.03
+SGB5853	0.03
+SGB5858	0.03
+SGB5904	0.03
+SGB5906	0.03
+SGB5910_group	0.03
+SGB5967	0.03
+SGB6139	0.03
+SGB6140	0.03
+SGB6148	0.03
+SGB6173	0.03
+SGB6174	0.03
+SGB6178	0.03
+SGB6179	0.03
+SGB6180	0.03
+SGB6190	0.03
+SGB6191	0.03
+SGB6289	0.03
+SGB6305	0.03
+SGB6308	0.03
+SGB6310	0.03
+SGB6317	0.03
+SGB6328	0.03
+SGB6329	0.03
+SGB6340	0.03
+SGB6347	0.03
+SGB6358	0.03
+SGB6362	0.03
+SGB6367	0.03
+SGB6371	0.03
+SGB6376	0.03
+SGB6385	0.03
+SGB6420	0.03
+SGB6421	0.03
+SGB6422	0.03
+SGB6428	0.03
+SGB6438	0.03
+SGB6465	0.03
+SGB6473	0.03
+SGB6478	0.03
+SGB6502	0.03
+SGB6503	0.03
+SGB6506	0.03
+SGB6518	0.03
+SGB6522	0.03
+SGB6533	0.03
+SGB6571	0.03
+SGB6579	0.03
+SGB6584	0.03
+SGB6601	0.03
+SGB6602	0.03
+SGB6607	0.03
+SGB6609	0.03
+SGB6744	0.03
+SGB6747	0.03
+SGB6750	0.03
+SGB6771	0.03
+SGB6790	0.03
+SGB6806	0.03
+SGB6816	0.03
+SGB6817	0.03
+SGB6823	0.03
+SGB6836	0.03
+SGB6846	0.03
+SGB6847	0.03
+SGB6936	0.03
+SGB6939	0.03
+SGB6952	0.03
+SGB6956	0.03
+SGB6962	0.03
+SGB6970	0.03
+SGB6973	0.03
+SGB698	0.03
+SGB7061	0.03
+SGB7088	0.03
+SGB7115	0.03
+SGB7144	0.03
+SGB7852	0.03
+SGB7858	0.03
+SGB7865	0.03
+SGB7980_group	0.03
+SGB8002_group	0.03
+SGB8005	0.03
+SGB8022_group	0.03
+SGB8027_group	0.03
+SGB8076	0.03
+SGB8138_group	0.03
+SGB8590	0.03
+SGB8601	0.03
+SGB8617	0.03
+SGB8619	0.03
+SGB8640	0.03
+SGB8644	0.03
+SGB8651	0.03
+SGB8653	0.03
+SGB8655	0.03
+SGB8774	0.03
+SGB9202	0.03
+SGB9203	0.03
+SGB9205	0.03
+SGB9209	0.03
+SGB9210	0.03
+SGB9224	0.03
+SGB9228	0.03
+SGB9243	0.03
+SGB9260	0.03
+SGB9272_group	0.03
+SGB9273	0.03
+SGB9274	0.03
+SGB9278	0.03
+SGB9299	0.03
+SGB9311	0.03
+SGB9323	0.03
+SGB9324	0.03
+SGB9333	0.03
+SGB9340	0.03
+SGB9346	0.03
+SGB9347	0.03
+SGB9391	0.03
+SGB9712_group	0.03
+SGB4871	0.030075188
+SGB15090	0.030142782
+SGB6754	0.030379747
+SGB1699	0.030671602
+SGB15342	0.031550689
+SGB4936	0.031865157
+SGB4910	0.0323603
+SGB15089	0.033391574
+SGB4868	0.033633034
+SGB14909	0.033834586
+SGB4993	0.034712196
+SGB14861	0.035107169
+SGB15236	0.035652921
+SGB4581	0.036159079
+SGB2295	0.037366803
+SGB1934	0.03744763
+SGB4425	0.037647059
+SGB15091	0.038607595
+SGB1861	0.039012857
+SGB2301	0.039099155
+SGB4644	0.039389519
+SGB15254	0.039740621
+SGB2318	0.040191388
+SGB714_group	0.041403509
+SGB17244	0.041505792
+SGB1965	0.041692466
+SGB1830	0.04191947
+SGB5792	0.042183623
+SGB5767_group	0.042650104
+SGB15322	0.042857143
+SGB2303	0.045164122
+SGB1844_group	0.048100383
+SGB4964	0.048658318
+SGB4608	0.04916129
+SGB4577	0.04916336
+SGB10068	0.05
+SGB14306	0.05
+SGB14797_group	0.05
+SGB14874	0.05
+SGB14993_group	0.05
+SGB15054	0.05
+SGB15078	0.05
+SGB15132	0.05
+SGB15154	0.05
+SGB15159	0.05
+SGB15224	0.05
+SGB15249	0.05
+SGB15291	0.05
+SGB1790	0.05
+SGB1812	0.05
+SGB1815	0.05
+SGB1829	0.05
+SGB1860	0.05
+SGB1862	0.05
+SGB1877	0.05
+SGB1949	0.05
+SGB2290	0.05
+SGB4166	0.05
+SGB4184	0.05
+SGB4269	0.05
+SGB4327	0.05
+SGB4421	0.05
+SGB4575	0.05
+SGB4705	0.05
+SGB4749	0.05
+SGB4767	0.05
+SGB4820	0.05
+SGB5045	0.05
+SGB6276	0.05
+SGB6783	0.05
+SGB6796	0.05
+SGB9262	0.05
+SGB9283	0.05


=====================================
metaphlan/utils/add_metadata_tree.py
=====================================
@@ -1,8 +1,8 @@
 #!/usr/bin/env python
 __author__ = ('Duy Tin Truong (duytin.truong at unitn.it), '
               'Aitor Blanco Miguez (aitor.blancomiguez at unitn.it)')
-__version__ = '3.1.0'
-__date__    = '25 Jul 2021'
+__version__ = '4.0.0'
+__date__    = '22 Aug 2022'
 
 import argparse as ap
 import pandas


=====================================
metaphlan/utils/calculate_diversity.R
=====================================
@@ -0,0 +1,262 @@
+#!/usr/bin/env Rscript
+
+### calculate diversity ###
+
+# install required packages
+
+required_pkg <- c("optparse", "ape", "rbiom", "compositions", "BiocManager")
+a <- sapply(required_pkg, function(x) {  if (!requireNamespace(x, quietly = TRUE))
+  install.packages(x, repos = "http://cran.us.r-project.org")
+})
+if (! "microbiome" %in% installed.packages()){
+  BiocManager::install("microbiome")
+}
+
+# accept arguments from command line
+
+library("optparse")
+
+option_list = list(
+  
+  make_option(c("-f", "--file"), action="store", type="character", default=NULL, 
+              help="Merged MetaPhlAn profiles. 
+                A table with samples as columns and species as rows is required.",
+              metavar="character"),
+  
+  make_option(c("-o", "--out_directory"), action="store", type="character", default="diversity_analysis",
+              help="output directory.
+                [default = %default]"),
+  
+  make_option(c("-p", "--outfile_prefix"), action="store", type="character", default=NULL,
+              help="file name prefix of the output distance matrix and log files.
+                [default = input file basename]"),
+  
+  make_option(c("-t", "--tree"), action="store", type="character", default=NULL, 
+              help="Full path to the MetaPhlAn species Newick tree.
+                Mandatory for computing UniFrac distances."),
+  
+  make_option(c("-d", "--diversity"), action="store", type="character", default="beta", 
+              help="Choose whether to calculate alpha or beta diversity. 
+                Options are alpha or beta.
+                [default = %default]"),
+  
+  make_option(c("-m", "--metric"), action="store", type="character", default="bray-curtis", 
+              help="Name of the function to use when calculating diversity.
+                Options for alpha diversity are richness, shannon, simpson, gini.
+                Options for beta diversity are bray-curtis, jaccard, weighted-unifrac, unweighted-unifrac, clr, aitchison.
+                [default = %default]"),
+  
+  make_option(c("-s", "--taxon_separator"), action="store", type="character", default="t__", 
+              help="taxon separator used in the input MetaPhlAn table.
+                Options are: t__ for MetaPhlAn4 profiles and s__ for MetaPhlAn3 profiles.
+                [default = %default]")
+); 
+
+opt_parser = OptionParser(option_list=option_list);
+opt = parse_args(opt_parser);
+
+if (is.null(opt$file)){
+  print_help(opt_parser)
+  stop('At least one argument must be supplied (input file).tsv', call.=FALSE)
+}
+
+if(! (opt$diversity %in% c('alpha', 'beta'))){
+  write(paste0('Method "', opt$diversity, '" not available!'), stdout())
+  write(paste0('Available diversity analyses are "alpha" and "beta"'), stdout())
+  quit(status = -1)
+}
+
+if(opt$diversity =="alpha" & ! (opt$metric %in% c('richness', 'shannon', 'simpson', 'gini'))){
+  write(paste0('Method "', opt$metric, '" not available for alpha diversity'), stdout())
+  write(paste0('Available alpha-diversity metrics are "richness", shannon", "simpson", "gini".'), stdout())
+  quit(status = -1)
+}
+
+if(opt$diversity =="beta" & ! (opt$metric %in% c('bray-curtis', 'jaccard', 'weighted-unifrac', 'unweighted-unifrac', 'clr', 'aitchison'))){
+  write(paste0('Method "', opt$metric, '" not available for beta diversity'), stdout())
+  write(paste0('Available beta-diversity distance functions are "bray-curtis", "jaccard", "weighted-unifrac", "unweighted-unifrac", "clr", "aitchison".'), stdout())
+  quit(status = -1)
+}
+
+if(! (opt$taxon_separator %in% c('t__', 's__'))){
+  write(paste0('Taxon separator "', opt$taxon_separator, '" is not available'), stdout())
+  write(paste0('Possible taxon separators are "t__" for MetaPhlAn4 profiles and "s__" for MetaPhlAn3 profiles.'), stdout())
+  quit(status = -1)
+}
+
+if(is.null(opt$tree) & grepl('unifrac', opt$metric)){
+  write(paste0('Selected beta-diversity metric: "', opt$metric, '"'), stdout())
+  stop("A  tree is mandatory for computing UniFrac distances. (input tree).nwk", call.=FALSE)
+}
+
+for(x in c(opt$file, opt$tree)){
+  if(!file.exists(x)){
+    stop(paste0('Input file "', x, '" does not exist!'), call.=FALSE)
+  }
+}
+
+if(is.null(opt$outfile_prefix)){
+  outfile_prefix <- basename(opt$file)
+  outfile_prefix <- tools::file_path_sans_ext(outfile_prefix)
+} else {
+  outfile_prefix <- opt$outfile_prefix
+}
+
+current_dir <- getwd()
+dir.create(file.path(current_dir, opt$out_directory), showWarnings = FALSE)
+
+outfile <- paste(current_dir, opt$out_directory, outfile_prefix, sep="/")
+
+### table preprocessing ###
+
+mpa_table <- read.table(opt$file, comment.char = '#', sep = '\t', header = TRUE, check.names=FALSE)
+
+# check if NCBI id is present
+
+vec <- grepl("ncbi", colnames(mpa_table), ignore.case=TRUE)
+if(any(vec)){
+  # keep all the columns except the one with NCBI id
+  mpa_table <- mpa_table[grep(opt$taxon_separator, mpa_table[,1]), !vec]
+} else {
+  mpa_table <- mpa_table[grep(opt$taxon_separator, mpa_table[,1]),]
+}
+
+if(opt$taxon_separator == "t__"){
+  mpa_table[,1] <- gsub(".+\\|t__SGB", "", mpa_table[,1])
+} else { 
+  mpa_table[,1] <- gsub(".+\\|s__", "", mpa_table[,1])
+}
+
+mpa_table[,1] <- gsub("_group$", "", mpa_table[,1])
+rownames(mpa_table) <- mpa_table[,1]
+mpa_table <- mpa_table[,-1]
+
+# remove samples with all unknowns
+removed <- which(colSums(mpa_table) == 0)
+
+if(length(removed)>0){
+  
+  if(length(removed)==1){
+    message = "# WARNING: 1 sample with 100% unknown species was removed from the input table."
+  } else {
+    message = paste0("# WARNING: ", length(removed), " samples with 100% unknown species were removed from the input table.")
+  }
+  
+  write(message, stdout())
+  write(paste(names(removed), collapse='\n'), stdout())
+  
+  write(message, file=paste0(outfile, '_samples.log'))
+  write(paste(names(removed), collapse='\n'), file=paste0(outfile, '_samples.log'), append = TRUE)
+  
+  # remove samples
+  mpa_table <- mpa_table[, -removed]
+}
+
+### Data transformation
+mpa_table <- mpa_table / 100
+
+
+### Beta diversity ###
+
+if (opt$diversity == "beta"){
+  
+  # Bray-Curtis
+  if (opt$metric == "bray-curtis"){
+    mat <- rbiom::beta.div(as.matrix(mpa_table), method="bray-curtis", weighted=TRUE)
+  }
+  
+  # Jaccard
+  if (opt$metric == "jaccard"){
+    mat <- rbiom::beta.div(as.matrix(mpa_table), method="jaccard", weighted=FALSE)
+  }
+  
+  # Unifrac
+  if (grepl("unifrac", opt$metric)){
+    mpa_tree <- ape::read.tree(opt$tree)
+    
+    if(opt$taxon_separator == "s__"){
+      mpa_tree$tip.label <- gsub(".+\\|s__", "", mpa_tree$tip.label)
+    }
+    
+    removed <- setdiff(rownames(mpa_table), mpa_tree$tip.label)
+    
+    if(length(removed)){
+      message = paste0("# WARNING: ", length(removed), " species not present in the tree were removed from the input profile.")
+      write(message, stdout())
+      write(paste(removed, collapse='\n'), stdout())
+      
+      write(message, file=paste0(outfile, '_species.log'))
+      write(paste(removed, collapse='\n'), file=paste0(outfile, '_species.log'), append = TRUE)
+    }
+    filt_tree <- ape::keep.tip(mpa_tree, setdiff(rownames(mpa_table), removed))
+    filt_mpa_table <- mpa_table[filt_tree$tip.label,]
+    
+    # check again if after species removal some samples have 0s for all the remaining species, and remove them
+    removed <- which(colSums(filt_mpa_table) == 0)
+    
+    if(length(removed)){
+      message = paste0("# WARNING: after removal of species not in the tree, ", length(removed), " samples with 0 abundance of the remaining species were removed from the input table.")
+      
+      write(message, stdout())
+      write(paste(names(removed), collapse='\n'), stdout())
+      
+      if(file.exists(paste0(outfile, '_samples.log'))){
+        write(message, file=paste0(outfile, '_samples.log'), append = TRUE)
+      } else { 
+        write(message, file=paste0(outfile, '_samples.log'))
+      }
+      
+      write(paste(names(removed), collapse='\n'), file=paste0(outfile, '_samples.log'), append = TRUE)
+      
+      # remove samples
+      filt_mpa_table <- filt_mpa_table[, -removed]
+    }
+    
+    if (opt$metric == "weighted-unifrac"){
+      mat <- rbiom::beta.div(as.matrix(filt_mpa_table), tree=filt_tree, method="unifrac", weighted=TRUE)
+      
+    } else if (opt$metric == "unweighted-unifrac"){
+      mat <- rbiom::beta.div(as.matrix(filt_mpa_table), tree=filt_tree, method="unifrac", weighted=FALSE)
+    } 
+    
+  }
+  
+  # CLR or Aitchison
+  if (opt$metric == "clr" || opt$metric == "aitchison"){
+    # Centered Log-Ratio
+    ait_norm_mpa_table <- compositions::clr(mpa_table)
+    mat <- as.matrix(compositions::as.data.frame.rmult(ait_norm_mpa_table))
+
+    if (opt$metric == "aitchison"){
+      # Aitchison
+      mat <- rbiom::beta.div(mat, method="euclidean", weighted=TRUE)
+    }
+  }
+  
+  ### Alpha Diversity ###
+  
+} else if (opt$diversity == "alpha"){
+  
+  # Richness
+  if (opt$metric == "richness"){
+    mat <- microbiome::alpha(mpa_table, index = c("richness_observed"))
+  }
+  
+  # Shannon
+  if (opt$metric == "shannon"){
+    mat <- microbiome::alpha(mpa_table, index = c("diversity_shannon"))
+  }
+  
+  # Simpson
+  if (opt$metric == "simpson"){
+    mat <- microbiome::alpha(mpa_table, index = c("diversity_gini_simpson"))
+  }
+  
+  # Gini
+  if (opt$metric == "gini"){
+    mat <- microbiome::alpha(mpa_table, index = c("dominance_gini"))
+    row.names(mat) <- colnames(mpa_table)
+  }
+}
+
+write.table(as.matrix(mat), paste0(outfile, '_', opt$metric, '.tsv'), sep = '\t', quote = FALSE)


=====================================
metaphlan/utils/calculate_unifrac.R deleted
=====================================
@@ -1,67 +0,0 @@
-#!/usr/bin/env Rscript
-
-args <- commandArgs(trailingOnly = TRUE)
-
-if(length(args) == 0)
-{
-  write('Usage:
-          RScript calculate_unifrac.R <merged MetaPhlAn profiles> <Newick tree> <output file> <UniFrac type> <normalization>
-        
-  <merged MetaPhlAn profiles>\tFull path to the merged MetaPhlAn profiles. A table with samples by columns and species by rows is required.
-  <Newick tree>\t\t\tFull path to the MetaPhlAn 3 species Newick tree
-  <output file>\t\t\tBasename for output UniFrac distance matrix
-  <UniFrac type>\t\tweighted for Weighted UniFrac, unweighted for Unweighted UniFrac (default weighted)
-  <normalization>\t\t\tFunction to apply to the data for normalization. Can be none, sinh_sqrt, log10 (default none)
-        ', stdout())
-  quit(save = 'no')  
-}
-
-mpa_infile <- args[1]
-tree_file <- args[2]
-outfile <- args[3]
-weighted <- args[4]=='weighted' || is.na(args[4])
-normalize <- args[5] %in% c('sinh_sqrt','log10')
-
-
-for(x in c(mpa_infile,tree_file)){
-if(!file.exists(x)){
-  write(paste0('Input file "', x, '" does not exists! Exiting.'), stdout())
-  quit(status = -1)
-}
-}
-
-required_pkg <- c('ape','vegan','rbiom')
-a <- sapply(required_pkg, function(x) {  if (!requireNamespace(x, quietly = TRUE))
-  install.packages(x)
-})
-
-
-suppressPackageStartupMessages(library(rbiom))
-suppressPackageStartupMessages(library(vegan))
-suppressPackageStartupMessages(library(ape))
-
-mpa_table <- read.table(mpa_infile, comment.char = '#', sep = '\t', header = TRUE)
-mpa_table <- mpa_table[grep('s__',mpa_table[,1]),-2]
-mpa_table[,1] <- gsub(".+\\|s__", "", mpa_table[,1])
-rownames(mpa_table) <- mpa_table[,1]
-mpa_table <- mpa_table[,-1]
-
-mpa_tree <- ape::read.tree(tree_file)
-mpa_tree$tip.label <- gsub(".+\\|s__", "", mpa_tree$tip.label)
-
-removed <- setdiff(rownames(mpa_table), mpa_tree$tip.label)
-if(length(removed)){
-  write(paste0('WARNING: ', length(removed), ' species not present in the tree were removed from the input profile.'), stdout())
-  write(paste(removed, colllapse='\n'), stdout())
-}
-filt_tree <- ape::keep.tip(mpa_tree, intersect(rownames(mpa_table),mpa_tree$tip.label))
-filt_mpa_table <- mpa_table[filt_tree$tip.label,] / 100.0
-
-if(normalize == 'log10')
-  filt_mpa_table <- log10(1 + filt_mpa_table)
-if(normalize == 'sinh_sqrt')
-  filt_mpa_table <- asinh(sqrt(filt_mpa_table))
-  
-rbiom_distmat <- rbiom::unifrac(as.matrix(filt_mpa_table), weighted=weighted, tree=filt_tree)
-
-write.table(as.matrix(rbiom_distmat), outfile,sep = '\t', quote = FALSE)
\ No newline at end of file


=====================================
metaphlan/utils/external_exec.py
=====================================
@@ -3,8 +3,8 @@ __author__ = ('Aitor Blanco Miguez (aitor.blancomiguez at unitn.it), '
               'Francesco Asnicar (f.asnicar at unitn.it), '
               'Moreno Zolfo (moreno.zolfo at unitn.it), '
               'Francesco Beghini (francesco.beghini at unitn.it)')
-__version__ = '3.1.0'
-__date__ = '25 Jul 2022'
+__version__ = '4.0.0'
+__date__ = '22 Aug 2022'
 
 import os, sys, re, shutil, tempfile
 import subprocess as sb


=====================================
metaphlan/utils/extract_markers.py
=====================================
@@ -4,8 +4,8 @@ __author__ = ('Aitor Blanco Miguez (aitor.blancomiguez at unitn.it), '
               'Francesco Asnicar (f.asnicar at unitn.it), '
               'Moreno Zolfo (moreno.zolfo at unitn.it), '
               'Francesco Beghini (francesco.beghini at unitn.it)')
-__version__ = '3.1.0'
-__date__ = '25 Jul 2022'
+__version__ = '4.0.0'
+__date__ = '22 Aug 2022'
 
 import sys
 try:
@@ -31,7 +31,7 @@ except ImportError:
 metaphlan_script_install_folder = os.path.dirname(os.path.abspath(__file__))
 DEFAULT_DB_FOLDER = os.path.join(metaphlan_script_install_folder, "../metaphlan_databases")
 DEFAULT_DB_FOLDER = os.environ.get('METAPHLAN_DB_DIR', DEFAULT_DB_FOLDER)
-DEFAULT_DB_NAME =  "mpa_v31_CHOCOPhlAn_201901.pkl"
+DEFAULT_DB_NAME =  "mpa_vJan21_CHOCOPhlAnSGB_202103.pkl"
 DEFAULT_DATABASE = os.path.join(DEFAULT_DB_FOLDER, DEFAULT_DB_NAME)
 
 """


=====================================
metaphlan/utils/merge_metaphlan_tables.py
=====================================
@@ -1,86 +1,81 @@
 #!/usr/bin/env python3
 
+
 import argparse
 import os
 import sys
-import re
 import pandas as pd
 from itertools import takewhile
 
-def merge( aaastrIn, ostm ):
+
+def merge(aaastrIn, ostm):
     """
     Outputs the table join of the given pre-split string collection.
 
     :param  aaastrIn:   One or more split lines from which data are read..
     :type   aaastrIn:   collection of collections of string collections
-    :param  iCol:       Data column in which IDs are matched (zero-indexed).
-    :type   iCol:       int
     :param  ostm:       Output stream to which matched rows are written.
     :type   ostm:       output stream
     """
 
     listmpaVersion = set()
-    merged_tables = pd.DataFrame()
+    profiles_list = []
+    merged_tables = None
 
     for f in aaastrIn:
-        with open(f) as fin:
-            headers = [x.strip() for x in takewhile(lambda x: x.startswith('#'), fin)]
-        if len(headers) == 1:
-            names = ['clade_name', 'relative_abundance']
-            index_col = 0
-        if len(headers) >= 4:
-            names = headers[-1].split('#')[1].strip().split('\t')
-            index_col = [0,1]
+        headers = [x.strip() for x in takewhile(lambda x: x.startswith('#'), open(f))]
+        listmpaVersion.add(headers[0])
 
-        mpaVersion = list(filter(re.compile('#mpa_v[0-9]{2,}_CHOCOPhlAn\w*/{0,3}_[0-9]{0,}').match, headers))
-        if len(mpaVersion):
-            listmpaVersion.add(mpaVersion[0])
+        if len(headers) == 4:
+            names = headers[-1].split('#')[1].strip().split('\t')
         else:
-            print('merge_metaphlan_tables found tables without a header including the database version. Exiting.\n')
+            print('merge_metaphlan_tables: wrong header format for "{}", please check your profiles.\n'.format(f))
             return
+
         if len(listmpaVersion) > 1:
-            print('merge_metaphlan_tables found tables made with different versions of the MetaPhlAn database.\nPlease re-run MetaPhlAn with the same database.\n')
+            print('merge_metaphlan_tables: profiles from differrent versions of MetaPhlAn, please profile your '
+                  'samples using the same MetaPhlAn version.\n')
             return
-        
-        iIn = pd.read_csv(f, 
-                          sep='\t',
-                          skiprows=len(headers),
-                          names = names,
-                          usecols=range(3),
-                        ).fillna('')
-        iIn = iIn.set_index(iIn.columns[index_col].to_list())
-        if merged_tables.empty:
-            merged_tables = iIn.iloc[:,0].rename(os.path.splitext(os.path.basename(f))[0]).to_frame()
-        else:
-            merged_tables = pd.merge(iIn.iloc[:,0].rename(os.path.splitext(os.path.basename(f))[0]).to_frame(),
-                                    merged_tables,
-                                    how='outer', 
-                                    left_index=True, 
-                                    right_index=True
-                                    )
-    if listmpaVersion:
-        ostm.write(list(listmpaVersion)[0]+'\n')
-    merged_tables.fillna('0').reset_index().to_csv(ostm, index=False, sep = '\t')
 
-argp = argparse.ArgumentParser( prog = "merge_metaphlan_tables.py",
-    description = """Performs a table join on one or more metaphlan output files.""")
-argp.add_argument( "aistms",    metavar = "input.txt", nargs = "+",
-    help = "One or more tab-delimited text tables to join" )
-argp.add_argument( '-o',    metavar = "output.txt", nargs = 1,
-    help = "Name of output file in which joined tables are saved" )
+        iIn = pd.read_csv(f, sep='\t', skiprows=len(headers), names=names, usecols=range(3), index_col=0)
+        profiles_list.append(pd.Series(data=iIn['relative_abundance'], index=iIn.index,
+                                       name=os.path.splitext(os.path.basename(f))[0].replace('_profile', '')))
+
+    merged_tables = pd.concat([merged_tables, pd.concat(profiles_list, axis=1).fillna(0)], axis=1).fillna(0)
+    ostm.write(list(listmpaVersion)[0]+'\n')
+    merged_tables.to_csv(ostm, sep='\t')
 
-__doc__ = "::\n\n\t" + argp.format_help( ).replace( "\n", "\n\t" )
 
-argp.usage = argp.format_usage()[7:]+"\n\n\tPlease make sure to supply file paths to the files to combine. If combining 3 files (Table1.txt, Table2.txt, and Table3.txt) the call should be:\n\n\t\tpython merge_metaphlan_tables.py Table1.txt Table2.txt Table3.txt > output.txt\n\n\tA wildcard to indicate all .txt files that start with Table can be used as follows:\n\n\t\tpython merge_metaphlan_tables.py Table*.txt > output.txt"
+argp = argparse.ArgumentParser(prog="merge_metaphlan_tables.py",
+                               description="Performs a table join on one or more metaphlan output files.")
+argp.add_argument("aistms", metavar="input.txt", nargs="*", help="One or more tab-delimited text tables to join")
+argp.add_argument("-l", help="Name of file containing the paths to the files to combine")
+argp.add_argument('-o', metavar="output.txt", help="Name of output file in which joined tables are saved")
+argp.add_argument('--overwrite', default=False, action='store_true', help="Overwrite output file if exists")
+
+argp.usage = (argp.format_usage() + "\nPlease make sure to supply file paths to the files to combine.\n\n" +
+              "If combining 3 files (Table1.txt, Table2.txt, and Table3.txt) the call should be:\n" +
+              "   ./merge_metaphlan_tables.py Table1.txt Table2.txt Table3.txt > output.txt\n\n" +
+              "A wildcard to indicate all .txt files that start with Table can be used as follows:\n" +
+              "    ./merge_metaphlan_tables.py Table*.txt > output.txt")
 
 
 def main( ):
-    args = argp.parse_args( )
-    if args.o is None:
-        merge(args.aistms, sys.stdout)
-    else:
-        with open(args.o[0], 'w') as fout:
-            merge(args.aistms, fout)
+    args = argp.parse_args()
+
+    if args.l:
+        args.aistms = [x.strip().split()[0] for x in open(args.l)]
+
+    if not args.aistms:
+        print('merge_metaphlan_tables: no inputs to merge!')
+        return
+
+    if args.o and os.path.exists(args.o) and not args.overwrite:
+        print('merge_metaphlan_tables: output file "{}" exists, specify the --overwrite param to ovrewrite it!'.format(args.o))
+        return
+
+    merge(args.aistms, open(args.o, 'w') if args.o else sys.stdout)
+
 
 if __name__ == '__main__':
-    main()
\ No newline at end of file
+    main()


=====================================
metaphlan/utils/mpa_v30_CHOCOPhlAn_201901_species_tree.nwk deleted
=====================================
The diff for this file was not included because it is too large.

=====================================
metaphlan/utils/mpa_vJan21_CHOCOPhlAnSGB_202103.nwk
=====================================
The diff for this file was not included because it is too large.

=====================================
metaphlan/utils/mpa_vJan21_CHOCOPhlAnSGB_202103_SGB2GTDB.tsv
=====================================
The diff for this file was not included because it is too large.

=====================================
metaphlan/utils/parallelisation.py
=====================================
@@ -3,8 +3,8 @@ __author__ = ('Aitor Blanco Miguez (aitor.blancomiguez at unitn.it), '
               'Francesco Asnicar (f.asnicar at unitn.it), '
               'Moreno Zolfo (moreno.zolfo at unitn.it), '
               'Francesco Beghini (francesco.beghini at unitn.it)')
-__version__ = '3.1.0'
-__date__ = '25 Jul 2022'
+__version__ = '4.0.0'
+__date__ = '22 Aug 2022'
 
 try:
     from .util_fun import error


=====================================
metaphlan/utils/plot_tree_graphlan.py
=====================================
@@ -1,8 +1,8 @@
 #!/usr/bin/env python
 __author__ = ('Duy Tin Truong (duytin.truong at unitn.it), '
               'Aitor Blanco Miguez (aitor.blancomiguez at unitn.it)')
-__version__ = '3.1.0'
-__date__    = '25 Jul 2022'
+__version__ = '4.0.0'
+__date__    = '22 Aug 2022'
 
 import argparse as ap
 import dendropy


=====================================
metaphlan/utils/read_fastx.py
=====================================
@@ -61,6 +61,7 @@ def read_and_write_raw_int(fd, min_len=None, prefix_id=""):
     avg_read_length = 0
     nreads = 0
     discarded = 0
+    idx = 1
     #if min_len:
     r = []
 
@@ -178,3 +179,4 @@ def main():
 
 if __name__ == '__main__':
     main()
+


=====================================
metaphlan/utils/sample2markers.py
=====================================
@@ -4,8 +4,8 @@ __author__ = ('Aitor Blanco Miguez (aitor.blancomiguez at unitn.it), '
               'Francesco Asnicar (f.asnicar at unitn.it), '
               'Moreno Zolfo (moreno.zolfo at unitn.it), '
               'Francesco Beghini (francesco.beghini at unitn.it)')
-__version__ = '3.1.0'
-__date__ = '25 Jul 2022'
+__version__ = '4.0.0'
+__date__ = '22 Aug 2022'
 
 import sys
 try:


=====================================
metaphlan/utils/sgb_to_gtdb_profile.py
=====================================
@@ -0,0 +1,83 @@
+__author__ = 'Aitor Blanco (aitor.blancomiguez at unitn.it'
+__version__ = '4.0.0'
+__date__ = '22 Aug 2022'
+
+import os, time, sys
+import argparse as ap
+
+try:
+    from .util_fun import openw, info, error
+except ImportError:
+    from util_fun import openw, info, error
+
+GTDB_ASSIGNMENT_FILE = os.path.join(os.path.dirname(os.path.abspath(__file__)), "mpa_vJan21_CHOCOPhlAnSGB_202103_SGB2GTDB.tsv")
+
+def read_params():
+    p = ap.ArgumentParser(description="", formatter_class=ap.ArgumentDefaultsHelpFormatter)
+    p.add_argument('-i', '--input', type=str, default=None,
+                   help="The input profile")
+    p.add_argument('-o', '--output', type=str, default=None,
+                   help="The output profile")
+    
+    return p.parse_args()
+
+def check_params(args):
+    if not args.input:
+        error('-i (or --input) must be specified', exit=True, 
+            init_new_line=True)
+    if not args.output:
+        error('-o (or --output) must be specified', exit=True, 
+            init_new_line=True)
+
+def get_gtdb_profile(mpa_profile, gtdb_profile):
+    tax_levels = ['d','p','c','o','f','g','s']
+    sgb2gtdb = dict()
+    with open(GTDB_ASSIGNMENT_FILE, 'r') as read_file:
+        for line in read_file:
+            line = line.strip().split('\t')
+            sgb2gtdb[line[0]] = line[1]
+
+    with open(gtdb_profile, 'w') as wf:
+        with open(mpa_profile, 'r') as rf:
+            unclassified = 0
+            abundances = {x:dict() for x in tax_levels}
+            for line in rf:
+                if line.startswith('#mpa_'):
+                    wf.write(line)
+                    wf.write('#clade_name\trelative_abundance\n')
+                elif line.startswith('UNCLASSIFIED'):
+                    unclassified = float(line.strip().split('\t')[2])
+                    wf.write('UNCLASSIFIED\t{}\n'.format(unclassified))
+                elif 't__SGB' in line:
+                    line = line.strip().split('\t')
+                    gtdb_tax = sgb2gtdb[line[0].split('|')[-1][3:]]
+                    if gtdb_tax not in abundances:
+                        abundances['s'][gtdb_tax] = 0
+                    abundances['s'][gtdb_tax] += float(line[2])
+        tax_levels.reverse()
+        for tax_level in tax_levels[:-1]:
+            for tax in abundances[tax_level]:
+                new_tax = tax.replace(';{}'.format(tax.split(';')[-1]), '')
+                new_level = tax_levels[tax_levels.index(tax_level)+1]
+                if new_tax not in abundances[new_level]:
+                    abundances[new_level][new_tax] = 0
+                abundances[new_level][new_tax] += abundances[tax_level][tax]
+        tax_levels.reverse()
+        for tax_level in tax_levels:
+            for tax in abundances[tax_level]:
+                wf.write('{}\t{}\n'.format(tax, abundances[tax_level][tax]))
+
+def main():
+    t0 = time.time()
+    args = read_params()
+    info("Start execution")
+    check_params(args)
+
+    get_gtdb_profile(args.input, args.output)
+
+    exec_time = time.time() - t0
+    info("Finish execution ("+str(round(exec_time, 2))+" seconds)\n", 
+        init_new_line=True)
+	
+if __name__ == "__main__":
+	main()


=====================================
metaphlan/utils/strain_transmission.py
=====================================
@@ -1,7 +1,7 @@
 __author__ = ('Aitor Blanco (aitor.blancomiguez at unitn.it), '
              'Mireia Valles-Colomer (mireia.vallescolomer at unitn.it)')
-__version__ = '3.1.0'
-__date__ = '25 Jul 2022'
+__version__ = '4.0.0'
+__date__ = '22 Aug 2022'
 
 import os, time, sys
 import argparse as ap
@@ -14,7 +14,12 @@ except ImportError:
     from pyphlan import PpaTree, dist_matrix
 
 
-DISTRIBUTION_THRESHOLD = 0.01
+DISTRIBUTION_THRESHOLD = 0.03
+metaphlan_script_install_folder = os.path.dirname(os.path.abspath(__file__))
+DEFAULT_UTILS_FOLDER = os.path.join(metaphlan_script_install_folder)
+DEFAULT_UTILS_FOLDER = os.environ.get('METAPHLAN_DB_DIR', DEFAULT_UTILS_FOLDER)
+PRECOMPUTED_FILE = os.path.join(DEFAULT_UTILS_FOLDER, 'VallesColomerM_2022_Nov19_thresholds.tsv')
+
 
 """
 Reads and parses the command line arguments of the script.
@@ -29,10 +34,14 @@ def read_params():
                    help="The input metadata")
     p.add_argument('-o', '--output_dir', type=str, default=None,
                    help="The output directory")
-    p.add_argument('--save_dist', action='store_true',
-                   help="[Optional] Save the PhyPhlAn pairwise distances file")
+    p.add_argument('--sgb_id', type=str, default=None,
+                   help="[Optional] If specified, it will use the precomputed transmisison threshold for the specific SGB from the VallesColomerM_2022 study")
     p.add_argument('--threshold', type=float, default=DISTRIBUTION_THRESHOLD,
                    help="[Optional] A custom distribution threshold value")
+    p.add_argument('--precomputed_thresholds_file', type=str, default=PRECOMPUTED_FILE,
+                   help="[Optional] The file containing the pre-computed thresholds")
+    p.add_argument('--save_dist', action='store_true',
+                   help="[Optional] Save the PhyPhlAn pairwise distances file")
     
     return p.parse_args()
 
@@ -239,30 +248,51 @@ def write_transmission_events(transmission_events, threshold, output_dir, metada
         for event in transmission_events:
             report.write(event['1']+" <-> "+event['2']+"\n") 
 
+"""
+Gets the precomputed threshold from VallesColomerM_2022 study
+
+:param sgb_id: the SGB id
+:returns: the transmission threshold
+"""
+def get_precomputed_threshold(sgb_id, precomputed_thresholds_file):
+    sgb2thres = dict()
+    with open(precomputed_thresholds_file, 'r') as rf:
+        rf.readline()
+        for line in rf:
+            line = line.strip().split('\t')
+            sgb2thres[line[0]] = line[1]
+    if sgb_id not in sgb2thres:
+        error('The SGB specified "{}" has not been precomputed'.format(sgb_id), exit=True, init_new_line=True)
+    else:
+        return float(sgb2thres[sgb_id])
+
 """
 Identifies transmission events in phylogenetic trees
 
 :param tree: the input Newick tree
 :param metadata: the metadata file
 :param distr_threshold: the distribution threshold
+:param sgb_id: the SGB id
+:param precomputed_thresholds_file: the file containing the precomputed thresholds
 :param save_dist: whether to save the pairwise distances file
 :param output_dir: the output directory to store the results
 """
-def strain_transmission(tree, metadata, distr_threshold, save_dist, output_dir):
+def strain_transmission(tree, metadata, distr_threshold, sgb_id, precomputed_thresholds_file, save_dist, output_dir):
     normalise = True
     matrix = False
     distances_file = tree+".dist"
     tree_pairwisedist(tree, normalise, matrix, os.path.join(output_dir, distances_file))
     pairwise_distances = parse_distances(os.path.join(output_dir, distances_file)) 
-    if  not save_dist:
+    if not save_dist:
         os.remove(os.path.join(output_dir, distances_file))
 
     nodes = get_nodes(pairwise_distances)
-    
     training_nodes, metadata_samples = get_training_nodes(nodes, metadata)    
     training_distances = get_training_distances(training_nodes, pairwise_distances)
-
-    threshold = get_threshold(training_distances, distr_threshold)
+    if sgb_id is None:
+        threshold = get_threshold(training_distances, distr_threshold)
+    else:
+        threshold = get_threshold(training_distances, get_precomputed_threshold(sgb_id, precomputed_thresholds_file))
 
     transmission_events = get_transmission_events(pairwise_distances, metadata_samples, threshold)
     write_transmission_events(transmission_events, threshold, output_dir, metadata_samples)
@@ -274,6 +304,8 @@ Main call
 :param tree: the input Newick tree
 :param metadata: the metadata file
 :param threshold: the distribution threshold
+:param sgb_id: the SGB id
+:param precomputed_thresholds_file: the file containing the precomputed thresholds
 :param save_dist: whether to save the pairwise distances file
 :param output_dir: the output directory to store the results
 """
@@ -283,7 +315,7 @@ def main():
     info("Start execution")
     check_params(args)
 
-    strain_transmission(args.tree, args.metadata, args.threshold, args.save_dist, args.output_dir)
+    strain_transmission(args.tree, args.metadata, args.threshold, args.sgb_id, args.precomputed_thresholds_file, args.save_dist, args.output_dir)
 
     exec_time = time.time() - t0
     info("Finish execution ("+str(round(exec_time, 2))+" seconds)\n", 
@@ -291,3 +323,4 @@ def main():
 	
 if __name__ == "__main__":
 	main()
+


=====================================
metaphlan/utils/util_fun.py
=====================================
@@ -3,8 +3,8 @@ __author__ = ('Aitor Blanco Miguez (aitor.blancomiguez at unitn.it), '
               'Francesco Asnicar (f.asnicar at unitn.it), '
               'Moreno Zolfo (moreno.zolfo at unitn.it), '
               'Francesco Beghini (francesco.beghini at unitn.it)')
-__version__ = '3.1.0'
-__date__ = '25 Jul 2022'
+__version__ = '4.0.0'
+__date__ = '22 Aug 2022'
 
 
 import os, sys, re, pickletools, pickle, time, bz2, gzip
@@ -44,7 +44,7 @@ def error(message, init_new_line=False, exit=False, exit_value=1):
     sys.stderr.flush()
 
     if exit:
-        sys.stderr.write('{}: Stop StrainPhlAn 3.0 execution.\n'.format(time.ctime(int(time.time()))))
+        sys.stderr.write('{}: Stop StrainPhlAn execution.\n'.format(time.ctime(int(time.time()))))
         sys.exit(exit_value)
 
 


=====================================
setup.py
=====================================
@@ -13,9 +13,9 @@ if sys.version_info[0] < 3:
 
 setuptools.setup(
     name='MetaPhlAn',
-    version='3.1.0',
-    author='Francesco Beghini',
-    author_email='francesco.beghini at unitn.it',
+    version='4.0.0',
+    author='Aitor Blanco-Miguez',
+    author_email='aitor.blancomiguez at unitn.it',
     url='http://github.com/biobakery/MetaPhlAn/',
     license='LICENSE.txt',
     packages=setuptools.find_packages(),
@@ -34,6 +34,7 @@ setuptools.setup(
             'read_fastx.py = metaphlan.utils.read_fastx:main',
             'sample2markers.py = metaphlan.utils.sample2markers:main',
             'strain_transmission.py = metaphlan.utils.strain_transmission:main',
+            'sgb_to_gtdb_profile.py = metaphlan.utils.sgb_to_gtdb_profile:main',
         ]
     },
     description='MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With the newly added StrainPhlAn module, it is now possible to perform accurate strain-level microbial profiling.',



View it on GitLab: https://salsa.debian.org/med-team/metaphlan/-/compare/a40606098a1bdf26f4222ccc02296c0b4c92a9fc...21f3d85179b51dc0154fe4d686a760fdf044723d

-- 
View it on GitLab: https://salsa.debian.org/med-team/metaphlan/-/compare/a40606098a1bdf26f4222ccc02296c0b4c92a9fc...21f3d85179b51dc0154fe4d686a760fdf044723d
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20220825/9e5ffd1b/attachment-0001.htm>