[med-svn] [Git][med-team/plasmidid][upstream] New upstream version 1.6.3+dfsg

Steffen Möller gitlab at salsa.debian.org
Fri Jun 26 17:34:30 BST 2020



Steffen Möller pushed to branch upstream at Debian Med / plasmidid


Commits:
14fc856f by Steffen Moeller at 2020-06-26T17:53:41+02:00
New upstream version 1.6.3+dfsg
- - - - -


8 changed files:

- README.md
- bin/download_plasmid_database.py
- + bin/mashclust.py
- bin/spades_assembly.sh
- + bin/summary_report_pid.py
- + databases/card.fasta
- + img/pipeline_pID.png
- plasmidID


Changes:

=====================================
README.md
=====================================
@@ -1,28 +1,211 @@
 
-[![CircleCI Build Status](https://circleci.com/gh/circleci/circleci-docs.svg?style=shield)](https://circleci.com/gh/BU-ISCIII/plasmidID) [![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) [![CircleCi Community](https://img.shields.io/badge/community-CircleCI%20Discuss-343434.svg)](https://discuss.circleci.com) [![Scif](https://img.shields.io/badge/Filesystem-Scientific-brightgreen.svg)](https://sci-f.github.io)
+[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/fusion-report/README.html)
+[![CircleCI Build Status](https://circleci.com/gh/circleci/circleci-docs.svg?style=shield)](https://circleci.com/gh/BU-ISCIII/plasmidID) [![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) [![Scif](https://img.shields.io/badge/Filesystem-Scientific-brightgreen.svg)](https://sci-f.github.io)
 
 # plasmidID <img align="left" src="https://github.com/BU-ISCIII/plasmidID/blob/develop/img/plasmidID_logo.png" alt="Logo" width="100">
 
 <br>
 <br>
 
-![#f03c15](https://placehold.it/15/f03c15/000000?text=+) **PlasmidID is already available to download** ![#f03c15](https://placehold.it/15/f03c15/000000?text=+)
+* [Introduction](#introduction)
+* [Requirements](#requirements)
+    * [Software](#software)
+    * [Plasmid database](#plasmid-database)
+* [Installation](#installation)
+    * [Install from source](#install-from-source)
+    * [Install using conda](#install-using-conda)
+* [Quick usage](#quick-usage)
+* [Usage](#usage)
+* [Output](#output)
+* [Annotation file](#annotation-file)
+* [Illustrated pipeline](#illustrated-pipeline)
+* [Docker](#docker)
+
+## Introduction
 
 PlasmidID is a mapping-based, assembly-assisted plasmid identification tool that analyzes and gives graphic solution for plasmid identification.
 
-PlasmidID is a **computational pipeline** implemented in **BASH** that maps Illumina reads over plasmid database sequences. The most covered sequences are clustered by identity to avoid redundancy and the longest are used as scaffold for plasmid reconstruction. Reads are assembled and annotated by automatic and specific annotation. All information generated from mapping, assembly, annotation and local alignment analyses is gathered and accurately represented in a **circular image** which allow user to determine plasmidic composition in any bacterial sample.
+PlasmidID is a **computational pipeline** implemented in **BASH** that maps Illumina reads over plasmid database sequences. The k-mer filtered, most covered sequences are clustered by identity to avoid redundancy and the longest are used as scaffold for plasmid reconstruction. Reads are assembled and annotated by automatic and specific annotation. All information generated from mapping, assembly, annotation and local alignment analyses is gathered and accurately represented in a **circular image** which allow user to determine plasmidic composition in any bacterial sample.
+
+## Requirements
+
+#### Software
+
+* [Python >=3.6](https://www.python.org/)
+* [Trimmomatic v0.33](http://www.usadellab.org/cms/?page=trimmomatic)(Optional)
+* [Spades v3.8](http://cab.spbu.ru/software/spades/) (Optional)
+* [Perl v5.26.0](https://www.perl.org/get.html)
+* [NCBI_blast + v2.2.3](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download)
+* [Bedtools v2.25](http://bedtools.readthedocs.io/en/latest/)
+* [Bowtie 2 v2.2.4](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
+* [SAMtools v1.2](http://samtools.sourceforge.net/)
+* [prokka v1.12](http://www.vicbioinformatics.com/software.prokka.shtml)
+* [cd-hit v4.6.6](http://weizhongli-lab.org/cd-hit/) (no longer needed since v1.6)
+* [circos v0.69.3](http://circos.ca/software/download/circos/)
+* [mash v2.2](https://github.com/marbl/Mash)
+
+#### Plasmid database
+
+Since version v1.5.1 plasmid database can be downloaded with the following command:
+
+```Bash
+ download_plasmid_database.py -o FOLDER
+```
+
+## Installation
+
+#### Install from source
+
+Install all dependencies and add them to $PATH
+
+git clone https://github.com/BU-ISCIII/plasmidID.git
+
+Add plasmidID and ./bin to $PATH
+
+#### Install using conda
+
+This option is recomended.
+
+Install [Anaconda3](https://www.anaconda.com/distribution/)
+
+```
+conda install -c conda-forge -c bioconda plasmidid
+```
+Wait for the environment to solve
+
+Ignore warnings/errors
+
+## Quick usage
+
+Illumina paired-end
+```
+plasmidID \
+-1 SAMPLE_R1.fastq.gz  \
+-2 SAMPLE_R2.fastq.gz \
+-d YYYY-MM-DD_plasmids.fasta \
+-c SAMPLE_assembled_contigs.fasta \
+--no-trim \
+-s SAMPLE
+```
+
+SMRT sequencing (only contigs)
+```
+plasmidID \
+-d YYYY-MM-DD_plasmids.fasta \
+-c SAMPLE_contigs.fasta \
+-s SAMPLE
+```
+
+Annotate any fasta you want
+```
+plasmidID \
+-d YYYY-MM-DD_plasmids.fasta \
+-c SAMPLE_assembled_contigs.fasta \
+-a annotation_file \
+-s SAMPLE
+```
+More info about [annotation file](#annotation-file)
+
+If there are several samples in the same GROUP folder
+```
+summary_report_pid.py -i NO_GROUP/
+```
+## Usage
+
+```
+usage : plasmidID <-1 R1> <-2 R2> <-d database(fasta)> <-s sample_name> [-g group_name] [options]
+
+	Mandatory input data:
+	-1 | --R1	<filename>	reads corresponding to paired-end R1 (mandatory)
+	-2 | --R2	<filename>	reads corresponding to paired-end R2 (mandatory)
+	-d | --database	<filename>	database to map and reconstruct (mandatory)
+	-s | --sample	<string>	sample name (mandatory), less than 37 characters
+
+	Optional input data:
+	-g | --group	<string>	group name (optional). If unset, samples will be gathered in NO_GROUP group
+	-c | --contigs	<filename>	file with contigs. If supplied, plasmidID will not assembly reads
+	-a | --annotate <filename>	file with configuration file for specific annotation
+	-o 		<output_dir>	output directory, by default is the current directory
+
+	Pipeline options:
+	--explore	Relaxes default parameters to find less reliable relationships within data supplied and database
+	--only-reconstruct	Database supplied will not be filtered and all sequences will be used as scaffold
+						This option does not require R1 and R2, instead a contig file can be supplied
+	-w 			Undo winner takes it all algorithm when clustering by kmer - QUICKER MODE
+	Trimming:
+	--trimmomatic-directory Indicate directory holding trimmomatic .jar executable
+	--no-trim	Reads supplied will not be quality trimmed
+
+	Coverage and Clustering:
+	-C | --coverage-cutoff	<int>	minimun coverage percentage to select a plasmid as scafold (0-100), default 80
+	-S | --coverage-summary	<int>	minimun coverage percentage to include plasmids in summary image (0-100), default 90
+	-f | --cluster	<int>	kmer identity to cluster plasmids into the same representative sequence (0 means identical) 		(0-1), default 0.5
+	-k | --kmer	<int>	identity to filter plasmids from the database with kmer approach (0-1), default 0.95
+
+	Contig local alignment
+	-i | --alignment-identity <int>	minimun identity percentage aligned for a contig to annotate, default 90
+	-l | --alignment-percentage <int>	minimun length percentage aligned for a contig to annotate, default 20
+	-L | --length-total	<int>	minimun alignment length to filter blast analysis
+	--extend-annotation <int>	look for annotation over regions with no homology found (base pairs), default 500bp
+
+	Draw images:
+	--config-directory <dir>	directory holding config files, default config_files/
+	--config-file-individual <file-name> file name of the individual file used to reconstruct
+	Additional options:
+
+	-M | --memory	<int>	max memory allowed to use
+	-T | --threads	<int>	number of threads
+	-v | --version		version
+	-h | --help		display usage message
+
+example: ./plasmidID.sh -1 ecoli_R1.fastq.gz -2 ecoli_R2.fastq.gz -d database.fasta -s ECO_553 -G ENTERO
+	./plasmidID.sh -1 ecoli_R1.fastq.gz -2 ecoli_R2.fastq.gz -d PacBio_sample.fasta -c scaffolds.fasta -C 60 -s ECO_60 -G ENTERO --no-trim
+```
+
+## Examples
+
+Under construction
+
+## Output
+
+Since v1.6, the more relevant output is located in GROUP/SAMPLE folder:
+
+- **SAMPLE_final_results.html(.tab)**
+	- id: Name of the accession number of reference
+	- length: length of the reference sequence
+	- species: species of the reference sequence
+	- description: rest of reference fasta header
+	- contig_name: number of the contigs that align the minimun required for complete contig track
+	- SAMPLE:
+		- Image of the reconstructed plasmid (click to open in new tab)
+		- MAPPING % (percentage): percentage of reference covered with reads
+			- X for contig mode (gray colour)
+			- Orientative colouring (the closer to 100% the better)
+		- ALIGN FR (fraction_covered): total length of contigs aligned (complete) / reference sequence length
+			- Orientative colouring (the closer to 1 the better)
+			
+
+## Annotation file
+
+Under construction
+
+## Illustrated pipeline
 
 This image sumarizes PlasmidID pipeline, including the most important steps.
 For furder details, including:
-- [Database download](https://github.com/BU-ISCIII/plasmidID/wiki/Plasmid-Database)
-- [Dependencies](https://github.com/BU-ISCIII/plasmidID/wiki/Installation-and-Dependencies)
-- [Execution](https://github.com/BU-ISCIII/plasmidID/wiki/Execution)
 - [Results interpretation](https://github.com/BU-ISCIII/plasmidID/wiki/Understanding-the-image:-track-by-track)
 - and more, please visit: [**PLASMIDID WIKI**](https://github.com/BU-ISCIII/plasmidID/wiki)
 
-<p align="center"><img src="https://github.com/BU-ISCIII/plasmidID/blob/master/img/Short_pipeline.png" alt="workflow_small"  width="500">
+<p align="center"><img src="https://github.com/BU-ISCIII/plasmidID/blob/master/img/pipeline_pID.png" alt="workflow_small"  width="500">
+
+
+
+
+
+
+
+## Docker
 
-# Quick usage
 Example:
 Clone the repo:
 ```Bash
@@ -45,7 +228,6 @@ docker run -v $PWD:$PWD -w $PWD buisciii/plasmidid plasmidID.sh \
      -s KPN
 ```
 
-
 Or you can use Singularity instead:
 ```Bash
 singularity exec docker://buisciii/plasmidid plasmidID.sh \


=====================================
bin/download_plasmid_database.py
=====================================
@@ -148,7 +148,7 @@ def main():
     if len(erroneous) > 0:
         with open(plasmid_failed_path, 'w+') as ferror:
             for acc, reason in erroneous.items():
-                ferror.write(acc + ": " + reason)
+                ferror.write(acc + ": " + reason + "\n")
 
     logger.info("ALL DONE\nFASTA file is available in: " + plasmid_fasta_path)
 


=====================================
bin/mashclust.py
=====================================
@@ -0,0 +1,399 @@
+#!/usr/bin/env python
+
+# Standard library imports
+import os
+import sys
+import re
+import logging
+import subprocess
+
+# Third party imports
+import argparse
+import datetime
+import pandas as pd
+import numpy as np 
+from Bio import Entrez
+from Bio import SeqIO
+
+logger = logging.getLogger()
+
+"""
+=============================================================
+HEADER
+=============================================================
+FUNCTION: Reduces redundancy in multifasta files using kmer mash distance,
+takes the longest sequence per cluster as representative
+
+INSTITUTION:CNM-ISCIII
+AUTHOR: Pedro J. Sola (pedroscampoy at gmail.com)
+d^v^b
+VERSION=0.1
+CREATED: 27 May 2020
+REVISION: 
+
+TODO:
+
+================================================================
+END_OF_HEADER
+================================================================
+"""
+
+END_FORMATTING = '\033[0m'
+WHITE_BG = '\033[0;30;47m'
+BOLD = '\033[1m'
+UNDERLINE = '\033[4m'
+RED = '\033[31m'
+GREEN = '\033[32m'
+MAGENTA = '\033[35m'
+BLUE =  '\033[34m'
+CYAN = '\033[36m'
+YELLOW = '\033[93m'
+DIM = '\033[2m'
+
+def check_create_dir(path):
+    #exists = os.path.isfile(path)
+    #exists = os.path.isdir(path)
+    if os.path.exists(path):
+        pass
+    else:
+        os.mkdir(path)
+
+def mash_dist(input_file, output_dir, threads=10):
+    # mash dist -i database.filtered_0.9_term.fasta database.filtered_0.9_term.fasta > database.filtered_0.9_term.mash.distances.tab
+    #'reference_ID', 'query_ID', 'distance', 'p_value', 'shared_hashes'
+
+    input_file = os.path.abspath(input_file)
+    file_prefix = input_file.split('/')[-1]
+    prefix = ('.').join(file_prefix.split('.')[0:-1])
+
+    output_filename = prefix + ".mash.distances.tab"
+    
+    output_file = os.path.join(output_dir, output_filename)
+
+    cmd = ["mash", "dist", "-i", "-p", str(threads), input_file, input_file]
+
+    prog = cmd[0]
+    param = cmd[1:]
+
+    try:
+        with open(output_file, "w+") as outfile:
+            #calculate mash distance and save it in output file
+            command = subprocess.run(cmd,
+            stdout=outfile, stderr=subprocess.PIPE, universal_newlines=True)
+        if command.returncode == 0:
+            logger.debug(GREEN + "Program %s successfully executed" % prog + END_FORMATTING)
+        else:
+            logger.error(RED + BOLD + "Command %s FAILED\n" % prog + END_FORMATTING
+                + BOLD + "WITH PARAMETERS: " + END_FORMATTING + " ".join(param) + "\n"
+                + BOLD + "EXIT-CODE: %d\n" % command.returncode +
+                "ERROR:\n" + END_FORMATTING + command.stderr)
+    except OSError as e:
+        sys.exit(RED + BOLD + "failed to execute program '%s': %s" % (prog, str(e)) + END_FORMATTING)
+
+    return output_file
+
+def mash_dist_to_pairwise(distance_file, distance_type='hash_distance'):
+    df = pd.read_csv(distance_file, sep='\t', names=['reference_ID', 'query_ID', 'distance', 'p_value', 'shared_hashes'])
+    df[['hash_1', 'hash_2']] = df['shared_hashes'].str.split('/', expand=True)
+    df.hash_1 = df.hash_1.astype(float)
+    df.hash_2 = df.hash_2.astype(float)
+    df['hash_distance'] = 1 - (df.hash_1 / df.hash_2)
+    dfpair = df[['reference_ID', 'query_ID', distance_type]]
+    
+    return dfpair
+
+def pairwise_to_cluster(pw,threshold = 0.5):
+    groups = {}
+    columns = pw.columns.tolist()
+    sorted_df = pw[(pw[columns[0]] != pw[columns[1]]) & (pw[columns[2]] <= threshold)].sort_values(by=[columns[2]])
+
+    print(pw.head())
+    print(sorted_df.shape)
+    
+    def rename_dict_clusters(cluster_dict):
+        reordered_dict = {}
+        for i, k in enumerate(list(cluster_dict)):
+            reordered_dict[i] = cluster_dict[k]
+        return reordered_dict
+    
+    def regroup_clusters(list_keys, groups_dict, both_samples_list):
+        #sum previous clusters
+        list_keys.sort()
+        new_cluster = sum([groups_dict[key] for key in list_keys], [])
+        #add new cluster
+        cluster_asign = list(set(new_cluster + both_samples_list))
+        #Remove duped cluster
+        first_cluster = list_keys[0]
+        groups_dict[first_cluster] = cluster_asign
+        rest_cluster = list_keys[1:]
+        for key in rest_cluster:
+            del groups_dict[key]
+        groups_dict = rename_dict_clusters(groups_dict)
+        return groups_dict
+        
+    for _, row in sorted_df.iterrows():
+        group_number = len(groups)
+        sample_1 = str(row[0])
+        sample_2 = str(row[1])
+        both_samples_list = row[0:2].tolist()
+                
+        if group_number == 0:
+            groups[group_number] = both_samples_list
+        
+        all_samples_dict = sum(groups.values(), [])
+                
+        if sample_1 in all_samples_dict or sample_2 in all_samples_dict:
+            #extract cluster which have the new samples
+            key_with_sample = {key for (key,value) in groups.items() if (sample_1 in value or sample_2 in value)}
+            
+            cluster_with_sample = list(key_with_sample)
+            cluster_with_sample_name = cluster_with_sample[0]
+            number_of_shared_clusters = len(key_with_sample)
+            if number_of_shared_clusters > 1:
+                groups = regroup_clusters(cluster_with_sample, groups, both_samples_list)
+            else:
+                groups[cluster_with_sample_name] = list(set(groups[cluster_with_sample_name] + both_samples_list))
+        else:
+            groups[group_number] = both_samples_list
+            
+    for _, row in pw[(pw[pw.columns[0]] != pw[pw.columns[1]]) & (pw[pw.columns[2]] > threshold)].iterrows():
+        sample_1 = str(row[0])
+        sample_2 = str(row[1])
+        all_samples_dict = sum(groups.values(), [])
+        if sample_1 not in all_samples_dict:
+            group_number = len(groups)
+            groups[group_number] = [sample_1]
+        
+        if sample_2 not in all_samples_dict:
+            group_number = len(groups)
+            groups[group_number] = [sample_2]
+            
+    cluster_df = pd.DataFrame(groups.values(),index=list(groups))
+    
+    cluster_df_return = cluster_df.stack().droplevel(1).reset_index().rename(columns={'index': 'group', 0: 'id'})
+
+    return cluster_df_return
+
+def big_pairwise_to_cluster(pw,threshold = 0.5):
+    
+    def rename_dict_clusters(cluster_dict):
+        reordered_dict = {}
+        for i, k in enumerate(list(cluster_dict)):
+            reordered_dict[i] = cluster_dict[k]
+        return reordered_dict
+    
+    def regroup_clusters(list_keys, groups_dict, both_samples_list):
+        #sum previous clusters
+        list_keys.sort()
+        new_cluster = sum([groups_dict[key] for key in list_keys], [])
+        #add new cluster
+        cluster_asign = list(set(new_cluster + both_samples_list))
+        #Remove duped cluster
+        first_cluster = list_keys[0]
+        groups_dict[first_cluster] = cluster_asign
+        rest_cluster = list_keys[1:]
+        for key in rest_cluster:
+            del groups_dict[key]
+        groups_dict = rename_dict_clusters(groups_dict)
+        return groups_dict
+    
+    groups = {}
+    
+    with open(pw, "r") as f:
+        for line in f:
+            line_split = line.split('\t')
+            sample_1 = line_split[0]
+            sample_2 = line_split[1]
+            hash1, hash2 = line_split[4].split('/')
+            hash_distance = 1 - (int(hash1) / int(hash2))
+
+            if hash_distance <= threshold:
+                group_number = len(groups)
+
+                both_samples_list = [sample_1,sample_2]
+
+                if group_number == 0:
+                    groups[group_number] = both_samples_list
+
+                all_samples_dict = sum(groups.values(), [])
+
+                if sample_1 in all_samples_dict or sample_2 in all_samples_dict:
+                    #extract cluster which have the new samples
+                    key_with_sample = {key for (key,value) in groups.items() if (sample_1 in value or sample_2 in value)}
+
+                    cluster_with_sample = list(key_with_sample)
+                    cluster_with_sample_name = cluster_with_sample[0]
+                    number_of_shared_clusters = len(key_with_sample)
+                    if number_of_shared_clusters > 1:
+                        groups = regroup_clusters(cluster_with_sample, groups, both_samples_list)
+                    else:
+                        groups[cluster_with_sample_name] = list(set(groups[cluster_with_sample_name] + both_samples_list))
+                else:
+                    groups[group_number] = both_samples_list
+            else:
+                if sample_1 not in all_samples_dict:
+                    group_number = len(groups)
+                    groups[group_number] = [sample_1]
+
+                if sample_2 not in all_samples_dict:
+                    group_number = len(groups)
+                    groups[group_number] = [sample_2]
+            
+    cluster_df = pd.DataFrame(groups.values(),index=list(groups))
+    
+    cluster_df_return = cluster_df.stack().droplevel(1).reset_index().rename(columns={'index': 'group', 0: 'id'})
+            
+    return cluster_df_return
+            
+def calculate_seqlen(fasta_file):
+    len_df = pd.DataFrame(columns=['id','length'])
+    index = 0
+    for seq_record in SeqIO.parse(fasta_file, "fasta"):
+        len_df.loc[index, 'id'] = seq_record.id
+        len_df.loc[index, 'length'] = len(seq_record)
+        index = index + 1
+    len_df['length'] = len_df.length.astype('int')
+    return len_df
+    
+def extract_representative(row):
+    row.clustered.remove(row.id)
+    return
+
+def extract_length(row, final_cluster):
+    lengths = [final_cluster['length'][final_cluster.id == idclust].tolist()[0] for idclust in row.clustered]
+    return lengths
+
+def extract_distance_legacy(reprensetative, list_clustered, pwdist):
+    distances = [round(pwdist[pwdist.columns[2]][(pwdist[pwdist.columns[0]] == reprensetative) & (pwdist[pwdist.columns[1]] == idclust)].tolist()[0],2) for idclust in list_clustered]
+    return distances
+
+def extract_distance(reprensetative, list_clustered, mash_file):
+    distances = []
+    for idclust in list_clustered:
+        with open(mash_file, "r") as f:
+            for line in f:
+                line_split = line.split('\t')
+                sample_1 = line_split[0]
+                sample_2 = line_split[1]
+                if sample_1 == reprensetative and sample_2 == idclust:
+                    hash1, hash2 = line_split[4].split('/')
+                    hash_distance = 1 - (int(hash1) / int(hash2)) 
+                    distances.append(hash_distance)
+    return distances
+
+def retrieve_fasta_cluster(fasta_file, final_cluster, output_dir, mash_file, kmerdist, quantity_id=1, save_clustered=False):
+    input_file = os.path.abspath(fasta_file)
+    file_prefix = input_file.split('/')[-1]
+    prefix = ('.').join(file_prefix.split('.')[0:-1])
+    
+    output_representative = os.path.join(output_dir, prefix + '.' + str(kmerdist) + '.representative.fasta')
+    output_clustered = os.path.join(output_dir, prefix +  '.' + str(kmerdist) + '.clustered.fasta')
+    output_summary = os.path.join(output_dir, prefix + '.' + str(kmerdist) + '.clusters.tab')
+    
+    representative_id = final_cluster.sort_values(by=['group', 'length'], ascending=[True, False]).groupby('group').head(quantity_id)
+    summary_id_grouped = final_cluster.groupby('group')['id'].apply(list).reset_index(name='clustered')
+    representative_list = representative_id.id.tolist()
+    representative_and_sumary = representative_id.merge(summary_id_grouped, on='group', how='left')
+    #Use function extract_representative to remove the repr. from column
+    representative_and_sumary.apply(extract_representative, axis=1)
+    representative_and_sumary['lengths_clustered'] = representative_and_sumary.apply(lambda x: extract_length(x, final_cluster), axis=1)
+    representative_and_sumary['distance_clustered'] = representative_and_sumary.apply(lambda x: extract_distance(x.id, x.clustered, mash_file), axis=1)
+    #read the fasta and retrieve representative sequences
+    representative_records = []
+    clustered_records = []
+    for seq_record in SeqIO.parse(fasta_file, "fasta"):
+        if seq_record.id in representative_list:
+            representative_records.append(seq_record)
+        else:
+            clustered_records.append(seq_record)
+        
+    SeqIO.write(representative_records, output_representative, "fasta")
+    
+    if not save_clustered == False:
+        SeqIO.write(clustered_records, output_clustered, "fasta")
+        
+    representative_and_sumary.to_csv(output_summary, sep='\t', index=False)
+
+    previous_seq = final_cluster.shape[0]
+    post_seq = representative_and_sumary.shape[0]
+
+    logger.info(MAGENTA + "%s sequences clustered into %s" % (previous_seq, post_seq) + END_FORMATTING)
+
+
+
+    
+
+def main():
+
+    def get_arguments():
+
+        parser = argparse.ArgumentParser(prog = 'common_mash_reference.py', description= 'Search for all mash files and find the representative reference')
+        
+        parser.add_argument('-i', '--input', dest="input_file", metavar="input_directory", type=str, required=True, help='REQUIRED.Input FASTA file')
+        parser.add_argument('-o', '--output', type=str, required=False, default=False, help='Output directory to extract clusteres FASTA')
+        parser.add_argument('-d', '--distance', type=float, required=False, default=0.5, help='Threshold distance to cluster sequences[0-1] 0(identical) 1(unrelated) (default 0.5)')
+        parser.add_argument('-g', '--output_grouped', required=False,  action='store_true', help='Output clustered (non representative sequences')
+
+        arguments = parser.parse_args()
+
+        return arguments
+
+    args = get_arguments()
+
+    input_file = os.path.abspath(args.input_file)
+
+    if args.output == False:
+        output_dir = ('/').join(input_file.split('/')[0:-1])
+    else:
+        output_dir = os.path.abspath(args.output)
+    
+    check_create_dir(output_dir)
+
+
+    #LOGGING
+    #Create log file with date and time
+    right_now = str(datetime.date.today())
+    right_now_full = "_".join(right_now.split(" "))
+
+    log_filename = 'mashclust' + "_" + right_now_full + ".log"
+    log_full_path = os.path.join(output_dir, log_filename)
+
+    logger = logging.getLogger()
+    logger.setLevel(logging.DEBUG)
+
+    formatter = logging.Formatter('%(asctime)s:%(message)s')
+
+    file_handler = logging.FileHandler(log_full_path)
+    file_handler.setLevel(logging.DEBUG)
+    file_handler.setFormatter(formatter)
+
+    stream_handler = logging.StreamHandler()
+    stream_handler.setLevel(logging.INFO)
+    #stream_handler.setFormatter(formatter)
+
+    logger.addHandler(stream_handler)
+    logger.addHandler(file_handler)
+
+    #####################START PIPELINE################
+
+    logger.info(args)
+    #CALCULATE MASH DISTANCE
+    logger.info('Obtaining mash distance')
+    mash_file = mash_dist(input_file, output_dir, threads=10)
+    logger.info('Obtaining cluster from distance')
+    #pairwise_distance = mash_dist_to_pairwise(mash_file)
+    #cluster_df = pairwise_to_cluster(pairwise_distance, threshold=args.distance)
+    cluster_df = big_pairwise_to_cluster(mash_file, threshold=args.distance)
+    logger.info('Calculating length')
+    len_df = calculate_seqlen(input_file)
+    final_cluster = cluster_df.merge(len_df, on='id', how='left')
+    logger.info('Filtering representative fasta')
+    retrieve_fasta_cluster(input_file, final_cluster, output_dir, mash_file, args.distance, quantity_id=1, save_clustered=args.output_grouped)
+    logger.info('DONE')
+
+if __name__ == '__main__':
+    try:
+        main()
+    except Exception as e:
+        logger.exception(e)
+        raise
\ No newline at end of file


=====================================
bin/spades_assembly.sh
=====================================
@@ -26,19 +26,17 @@ usage() {
 
 spades_assembly script that assemble illumina sequences using SPAdes
 
-usage : $0 <-p R1_paired file> <-P R2_paired file> [-u <R1_unpaired>] [-U <R2_unpaired>] [-o <directory>]
+usage : $0 <-p R1_paired file> <-P R2_paired file> [-o <directory>]
 		 [-k <int>][-s sample_name] [-g group_name] [-f <file_name>] [-T <int>] [q] [-c] [-v] [-h]
 
 	-p R1_paired file (mandatory)
 	-P R2_paired file (mandatory)
-	-u R1_unpaired file
-	-U R2_unpaired file
 	-k kmers, supplied as numbers sepparated by number or one flag per number, default: 21,33,55,77,99,127
 	-o output directory (optional)
 	-f file name
 	-s sample name (mandatory)
 	-g group name (optional). If unset, samples will be gathered in NO_GROUP group
-	-q quick_mode: look for files in a folder SUPPLIED with "paired" and "unpaired" term
+	-q quick_mode: look for files in a folder SUPPLIED with "paired" term
 	-c clean mode: remove unnecesary temporary folders
 	-T threads, default 1
 	-v version
@@ -89,8 +87,6 @@ cwd="$(pwd)"
 group="NO_GROUP"
 r1_paired_file="R1_paired_file"
 r2_paired_file="R2_paired_file"
-r1_unpaired_command=""
-r2_unpaired_command=""
 threads=1
 kmer_values_command="21,33,55,77,99,127"
 kmer_option=false
@@ -108,14 +104,6 @@ while getopts $options opt; do
 		P )
 			r2_paired_file=$OPTARG
 			;;
-		u )
-			r1_unpaired_file=$OPTARG
-			r1_unpaired_command=$(echo "--s1" $r1_unpaired_file)
-			;;
-		U )
-			r2_unpaired_file=$OPTARG
-			r2_unpaired_command=$(echo "--s1" $r2_unpaired_file)
-			;;
 		o )
 			output_dir=$OPTARG
 			;;
@@ -212,10 +200,6 @@ if [ $quick_mode = true ]; then
 	echo "Entering QUICK MODE"
 	r1_paired_file=$(find $directory_reads -name  "*1_paired.fastq.gz" -type f)
 	r2_paired_file=$(find $directory_reads -name  "*2_paired.fastq.gz" -type f)
-	r1_unpaired_file=$(find $directory_reads -name  "*1_unpaired.fastq.gz" -type f)
-	r1_unpaired_command=$(echo "--s1" $r1_unpaired_file)
-	r2_unpaired_file=$(find $directory_reads -name  "*2_unpaired.fastq.gz" -type f)
-	r2_unpaired_command=$(echo "--s2" $r2_unpaired_file)
 fi
 
 
@@ -231,8 +215,6 @@ echo "$(date)"
 echo "Assembly:"
 echo "R1 paired file = " $r1_paired_file
 echo "R2 paired file = " $r2_paired_file
-echo "R1 unpaired file = " $r1_unpaired_file
-echo "R2 unpaired file = " $r2_unpaired_file
 
 
 spades.py \
@@ -241,8 +223,6 @@ spades.py \
 -k $kmer_values_command \
 --pe1-1 $r1_paired_file \
 --pe1-2 $r2_paired_file \
-$r1_unpaired_command \
-$r2_unpaired_command \
 -o $output_dir || error ${LINENO} $(basename $0) "Spades command failed. See $output_dir/logs for more information."
 
 


=====================================
bin/summary_report_pid.py
=====================================
@@ -0,0 +1,468 @@
+#!/usr/bin/env python
+
+# Standard library imports
+import os
+import sys
+import re
+import logging
+import subprocess
+import html
+
+
+# Third party imports
+import argparse
+import datetime
+import pandas as pd
+import numpy as np 
+from Bio import Entrez
+from Bio import SeqIO
+from tabulate import tabulate
+
+logger = logging.getLogger()
+
+"""
+=============================================================
+HEADER
+=============================================================
+FUNCTION: Creates a summary report in tsv and hml from plasmidID execution
+
+INSTITUTION:CNM-ISCIII
+AUTHOR: Pedro J. Sola (pedroscampoy at gmail.com)
+d^v^b
+VERSION=0.1
+CREATED: 04 June 2020
+REVISION: 
+
+TODO:
+
+================================================================
+END_OF_HEADER
+================================================================
+"""
+
+END_FORMATTING = '\033[0m'
+WHITE_BG = '\033[0;30;47m'
+BOLD = '\033[1m'
+UNDERLINE = '\033[4m'
+RED = '\033[31m'
+GREEN = '\033[32m'
+MAGENTA = '\033[35m'
+BLUE =  '\033[34m'
+CYAN = '\033[36m'
+YELLOW = '\033[93m'
+DIM = '\033[2m'
+
+def check_file_exists(file_name):
+    """
+        Check file exist and is not 0 Kb, if not program exit.
+    """
+    #file_info = os.stat(file_name) #Retrieve the file info to check if has size > 0
+    #or file_info.st_size == 0:
+    if not os.path.isfile(file_name): 
+        logger.info(RED + BOLD + "File: %s not found or empty\n" % file_name + END_FORMATTING)
+        sys.exit(1)
+    return os.path.isfile(file_name)
+
+def extract_files(folder):
+    percentage_file = ""
+    complete_file = ""
+    representative_file = ""
+    for root, _, files in os.walk(folder):
+        for name in files:
+            if name.endswith(".coverage_adapted_clustered_percentage"):
+                percentage_file = os.path.join(root, name)
+            elif name.endswith(".plasmids.complete"):
+                complete_file = os.path.join(root, name)
+            elif "representative" in name and name.endswith(".fasta"):
+                representative_file = os.path.join(root, name)
+                
+    return percentage_file, complete_file, representative_file
+
+def percentage_to_df(percentage_file):
+    if not percentage_file == "":
+        df = pd.read_csv(percentage_file, sep=" ", names=['id', 'percentage'])
+        df['percentage'] = df['percentage'].round(2)
+        return df
+    else:
+        return pd.DataFrame(columns=['id','percentage'])
+
+def len_description_to_df(representative_file):
+    df = pd.DataFrame(columns=['id','length','species', 'description'])
+    index = 0
+    for seq_record in SeqIO.parse(representative_file, "fasta"):
+        df.loc[index, 'id'] = seq_record.id
+        df.loc[index, 'length'] = len(seq_record)
+        description_split = seq_record.description.split(' ')
+        df.loc[index, 'species'] = (' ').join(description_split[1:3])
+        df.loc[index, 'description'] = (' ').join(description_split[2:])
+        index = index + 1
+    df['length'] = df['length'].astype(int)
+
+    return df
+
+def complete_report_df(complete_file, len_description_df, percentage_df):
+    
+    def set_to_list(row):
+        listed_set = list(row.contig_name)
+        listed_set.sort()
+        return listed_set
+    
+    #CP029217.1	176762	288994	9	id=170244
+    dfc = pd.read_csv(complete_file, sep="\t", names=['id', 'start', 'end', 'contig_name', 'contig_id'])
+    dfc['len_covered'] = dfc.end - dfc.start
+    covered_df = dfc.groupby('id')['len_covered'].sum().reset_index()
+    contigs_df = dfc.groupby('id')['contig_name'].apply(set).reset_index()#Merge all dataframes 
+    #Merge all dataframes
+    df = len_description_df.merge(covered_df, on='id', how='left')
+    df['fraction_covered'] = round(df.len_covered / df.length, 2)
+    del df['len_covered']
+    df = df.merge(contigs_df, on='id', how='left')
+    df = df.dropna()
+    df['contig_name'] = df.apply(lambda x: set_to_list(x), axis=1)
+    df = df.merge(percentage_df, on='id', how='left')
+    df = df.sort_values(by=['length'], ascending=False).reset_index(drop=True)
+    df = df.fillna('X')
+    return df
+
+def include_images(sample_folder, summary_df):
+    sample = sample_folder.split("/")[-1]
+    def image_finder(row, sample_folder):
+        for root, _, files in os.walk(sample_folder):
+            for name in files:
+                if 'images' in root and row.id in name and name.endswith('.png'):
+                    return 'file://' + os.path.join(root, name)
+                
+    summary_df['images'] = summary_df.apply(lambda x: image_finder(x, sample_folder), axis=1)
+    summary_df.to_csv(sample_folder + '/' + sample +  '_final_results.tab', sep='\t', index=False)
+    return summary_df
+
+html_template = """
+    <!doctype html>
+
+    <html lang="en">
+    
+    <style type="text/css">
+    
+    body {
+        font: normal 20px "Courier New", Arial, sans-serif;
+        padding: auto;
+        margin: auto;
+    }
+    img {
+        display: block;
+        max-width: 350px;
+        height: auto;
+    }
+    
+    .summary{
+        display: flex;
+        flex-direction: column-reverse;
+    }
+    
+    .numeric-values {
+        display: flex;
+        flex-direction: row;
+        justify-content: space-around;
+    }
+    
+    .numeric-values div {
+        flex-grow: 1;
+    }
+    
+    .neutral {
+        background-color: lightgray;
+    }
+    
+    .likely {
+        background-color: limegreen;
+    }
+    
+    .unlikely {
+        background-color: sandybrown;
+    }
+    
+    .unprobable {
+        background-color: brown;
+    }
+    
+    header {
+        display: flex;
+        flex-direction: row;
+        justify-content: flex-start;
+        align-items: flex-end;
+        font-size: 2.7em;
+    }
+    
+    table {
+        position: relative;
+        border-collapse: collapse; 
+    }
+    
+    header img {
+        width: 70px;
+    }
+    
+    th {
+        font-size: 1.3em;
+        background-color: skyblue;
+        position: sticky;
+        top: 0;
+    }
+
+    tr td {
+        font-size: 1.1em;
+        text-align: center;
+    }
+    
+    footer {
+        height: 3em;
+        padding: 1;
+    }
+    
+    tr:nth-child(even) {background-color: snow;}
+    tr:hover {background-color:azure;}
+
+    </style>
+    
+    <head>
+      <meta charset="utf-8">
+
+      <title>PlasmidID Report</title>
+      <meta name="description" content="https://github.com/BU-ISCIII/plasmidID">
+      <meta name="author" content="pedroscampoy at gmail.com">
+
+      <link rel="stylesheet" href="css/styles.css?v=1.0">
+      <link rel="shortcut icon" type="image/png" href="https://raw.github.com/BU-ISCIII/plasmidID/master/img/plasmidID_logo.png">
+
+    </head>
+
+    <body>
+      <header>
+         <img src="https://raw.github.com/BU-ISCIII/plasmidID/master/img/plasmidID_logo.png"' alt="header-image">
+         PlasmidID REPORT
+      </header>
+      <div>
+      TABLESUMMARY
+      </div>
+      <footer>
+          More information at <a href="https://github.com/BU-ISCIII/plasmidID" target="_blank"> https://github.com/BU-ISCIII/plasmidID </a>
+          <br>
+          Author: pedroscampoy at gmail.com (BU-ISCIII)
+      </footer>
+    </body>
+    </html>
+    
+    
+    \n"""
+
+def summary_to_html(sample_folder, final_individual_dataframe, html_template):
+    df = final_individual_dataframe.copy()
+    sample = sample_folder.split("/")[-1]
+    html_filename = os.path.join(sample_folder, sample + '_final_results.html')
+    hidden_filename = os.path.join(sample_folder, '.' + sample + '_final_individual_results.tab')
+    
+    def complete_to_rating(row):
+        if row.fraction_covered >= 0.8 and row.fraction_covered <= 1.2:
+            return 'likely'
+        elif row.fraction_covered > 1.2 or (row.fraction_covered < 0.8 and row.fraction_covered > 0.5):
+            return 'unlikely'
+        else:
+            return 'unprobable'
+        
+    def mapping_to_rating(row):
+        if row.percentage == 'X':
+            return 'neutral'
+        elif row.percentage >= 80:
+            return 'likely'
+        elif row.percentage < 80 and row.percentage > 60:
+            return 'unlikely'
+        else:
+            return 'unprobable'
+        
+    
+    def apply_img_tag(row):
+        return '<div class=summary>' + '\n' + \
+    '<div class=numeric-values>' + '\n' + \
+    '<div class=\"percentage ' + row.perc_rating + '\">' + 'MAPPING %<br>' + str(row.percentage) + '</div>' + '\n' + \
+    '<div class=\"complete ' + row.complete_rating + '\">' + 'ALIGN FR<br>' + str(row.fraction_covered) + '</div>' + '\n' + \
+    '</div>' + '\n' + \
+    '<a href=' + row.images + ' target=\"_blank\">' + '\n' + \
+    '<img src=' + row.images + ' alt=' + "\"" + row.id + "\"" + '>' + '\n' + \
+    '</a>' + '\n' + \
+    '</div>'
+    
+    def italic_species(row):
+        return '<i>' + row.species + '</i>'
+    
+    df['perc_rating'] = df.apply(lambda x: mapping_to_rating(x), axis=1)
+    
+    df['complete_rating'] = df.apply(lambda x: complete_to_rating(x), axis=1)
+    
+    df['images'] = df.apply(lambda x: apply_img_tag(x), axis=1)
+    
+    df['species'] = df.apply(lambda x: italic_species(x), axis=1)
+    
+    df.drop(['percentage', 'fraction_covered', 'perc_rating', 'complete_rating'], axis = 1, inplace = True) 
+
+    df.rename(columns={'images':sample}, inplace=True)
+    
+    df.to_csv(hidden_filename, sep='\t', index=False)
+    
+    table = tabulate(df, headers='keys', tablefmt='html', showindex=False)
+    table = html.unescape(table)
+    table = table.replace("style=\"text-align: right;\"", "")
+    
+    final_html = html_template.replace('TABLESUMMARY', table)
+    with open(html_filename, 'w+') as f:
+        f.write(final_html)
+
+def summary_to_html_group(group_folder, html_template):
+    group = group_folder.split("/")[-1]
+    html_filename = os.path.join(group_folder, group + '_final_results.html')
+    individual_files = []
+    for root, _, files in os.walk(group_folder):
+        for name in files:
+            if name.endswith("final_individual_results.tab") and name.startswith("."):
+                individual_files.append(os.path.join(root, name))
+    individual_dfs = []
+    sample_list_column = []
+    for file in individual_files:
+        df = pd.read_csv(file, sep='\t')
+        del df['contig_name']
+        individual_dfs.append(df)
+        sample_list_column.append(df.columns.tolist()[-1])
+                
+    dfm = individual_dfs[0]
+    for df_ in individual_dfs[1:]:
+        dfm = dfm.merge(df_, on=['id','length', 'species', 'description'], how='outer')
+    
+    coun_df = dfm.drop(['length', 'species', 'description'], axis = 1).groupby('id').count().sum(axis=1).reset_index(name='N')
+    
+    dfm = dfm.merge(coun_df, on='id', how='outer')
+    
+    columns_reorder = ['id','length', 'species', 'description', 'N'] + sample_list_column
+    
+    dfm = dfm[columns_reorder]
+    
+    dfm.fillna('-', inplace=True)
+    
+    dfm = dfm.sort_values(by=['N','length'], ascending=[False,False]).reset_index(drop=True)
+    
+    table = tabulate(dfm, headers='keys', tablefmt='html', showindex=False)
+    table = html.unescape(table)
+    
+    final_html = html_template.replace('TABLESUMMARY', table)
+    
+    with open(html_filename, 'w+') as f:
+        f.write(final_html)
+        
+    return dfm
+    
+def summary_to_tab_group(group_folder):
+    group = group_folder.split("/")[-1]
+    tab_filename = os.path.join(group_folder, group + '_final_results.tab')
+    individual_files = []
+    for root, _, files in os.walk(group_folder):
+        for name in files:
+            if name.endswith("final_results.tab"):
+                individual_files.append(os.path.join(root, name))
+    individual_dfs = []
+
+    for file in individual_files:
+        sample = file.split('/')[-1].replace('_final_results.tab', '')
+        df = pd.read_csv(file, sep='\t')
+        df.drop(['contig_name', 'images'], axis = 1, inplace=True)
+        df.rename(columns={'fraction_covered':'Fr_cov_' + sample, 'percentage':'Map%_' + sample}, inplace=True)
+        individual_dfs.append(df)
+        
+    dfm = individual_dfs[0]
+    for df_ in individual_dfs[1:]:
+        dfm = dfm.merge(df_, on=['id','length', 'species', 'description'], how='outer')
+        
+    count_df = dfm.filter(regex='Fr_cov.*|^id$', axis=1).groupby('id').count().sum(axis=1).reset_index(name='N')
+    
+    dfm = dfm.merge(count_df, on='id', how='outer')
+    
+    columns_reorder = dfm.columns.tolist()[0:4]
+    columns_reorder.append(dfm.columns.tolist()[-1])
+    columns_reorder = columns_reorder + dfm.columns.tolist()[4:-1]
+    
+    dfm = dfm[columns_reorder]
+    
+    dfm = dfm.sort_values(by=['N','length'], ascending=[False,False]).reset_index(drop=True)
+    
+    dfm.to_csv(tab_filename, sep='\t', index=False)
+    
+    return
+    
+
+def main():
+
+    def get_arguments():
+
+        parser = argparse.ArgumentParser(prog = 'summary_report_pid.py', description= 'Creates a summary report in tsv and hml from plasmidID execution')
+        
+        parser.add_argument('-i', '--input', dest="input_folder", metavar="input_folder", type=str, required=True, help='REQUIRED.Input pID folder')
+        parser.add_argument('-g', '--group', required=False,  action='store_false', help='Creates a group report instead of individual (Default True)')
+
+
+        arguments = parser.parse_args()
+
+        return arguments
+
+    args = get_arguments()
+
+    input_folder = os.path.abspath(args.input_folder)
+    #output_dir = input_folder
+
+    #LOGGING
+    #Create log file with date and time
+    #right_now = str(datetime.date.today())
+    #right_now_full = "_".join(right_now.split(" "))
+
+    #log_filename = 'logs/summary_pid' + "_" + right_now_full + ".log"
+    #log_full_path = os.path.join(output_dir, log_filename)
+
+    logger = logging.getLogger()
+    logger.setLevel(logging.DEBUG)
+
+    #formatter = logging.Formatter('%(asctime)s:%(message)s')
+
+    #file_handler = logging.FileHandler(log_full_path)
+    #file_handler.setLevel(logging.DEBUG)
+    #file_handler.setFormatter(formatter)
+
+    stream_handler = logging.StreamHandler()
+    stream_handler.setLevel(logging.INFO)
+    #stream_handler.setFormatter(formatter)
+
+    logger.addHandler(stream_handler)
+    #logger.addHandler(file_handler)
+
+    #####################START PIPELINE################
+
+    logger.info(args)
+    #CALCULATE MASH DISTANCE
+    logger.info('Creating summary')
+
+    if args.group == True:
+        summary_to_html_group(input_folder, html_template)
+        summary_to_tab_group(input_folder)
+    else:
+        percentage_file, complete_file, representative_file = extract_files(input_folder)
+        check_file_exists(complete_file)
+        check_file_exists(representative_file)
+        percentage_df = percentage_to_df(percentage_file)
+        len_description_df = len_description_to_df(representative_file)
+        summary_df = complete_report_df(complete_file, len_description_df, percentage_df)
+        final_individual_dataframe = include_images(input_folder, summary_df)
+        summary_to_html(input_folder, final_individual_dataframe, html_template)
+
+    logger.info('DONE')
+
+if __name__ == '__main__':
+    try:
+        main()
+    except Exception as e:
+        logger.exception(e)
+        raise
\ No newline at end of file


=====================================
databases/card.fasta
=====================================
The diff for this file was not included because it is too large.

=====================================
img/pipeline_pID.png
=====================================
Binary files /dev/null and b/img/pipeline_pID.png differ


=====================================
plasmidID
=====================================
@@ -15,7 +15,7 @@ set -e
 #INSTITUTION:ISCIII
 #CENTRE:BU-ISCIII
 #AUTHOR: Pedro J. Sola (pedroscampoy at gmail.com)
-VERSION=1.5.1
+VERSION=1.6.3
 #CREATED: 15 March 2018
 #
 #ACKNOLEDGE: longops2getops.sh: https://gist.github.com/adamhotep/895cebf290e95e613c006afbffef09d7
@@ -52,7 +52,7 @@ usage : $0 <-1 R1> <-2 R2> <-d database(fasta)> <-s sample_name> [-g group_name]
 	--explore	Relaxes default parameters to find less reliable relationships within data supplied and database
 	--only-reconstruct	Database supplied will not be filtered and all sequences will be used as scaffold
 						This option does not require R1 and R2, instead a contig file can be supplied
-
+	-w 			Undo winner takes it all algorithm when clustering by kmer - QUICKER MODE
 	Trimming:
 	--trimmomatic-directory Indicate directory holding trimmomatic .jar executable
 	--no-trim	Reads supplied will not be quality trimmed
@@ -60,12 +60,12 @@ usage : $0 <-1 R1> <-2 R2> <-d database(fasta)> <-s sample_name> [-g group_name]
 	Coverage and Clustering:
 	-C | --coverage-cutoff	<int>	minimun coverage percentage to select a plasmid as scafold (0-100), default 80
 	-S | --coverage-summary	<int>	minimun coverage percentage to include plasmids in summary image (0-100), default 90
-	-f | --cluster	<int>	identity percentage to cluster plasmids into the same representative sequence (0-100), default 80
-	-k | --kmer	<int>	identity to filter plasmids from the database with kmer approach (0-1), default 0.9
+	-f | --cluster	<int>	kmer identity to cluster plasmids into the same representative sequence (0 means identical) (0-1), default 0.5
+	-k | --kmer	<int>	identity to filter plasmids from the database with kmer approach (0-1), default 0.95
 
 	Contig local alignment
 	-i | --alignment-identity <int>	minimun identity percentage aligned for a contig to annotate, default 90
-	-l | --alignment-percentage <int>	minimun length percentage aligned for a contig to annotate, default 30
+	-l | --alignment-percentage <int>	minimun length percentage aligned for a contig to annotate, default 20
 	-L | --length-total	<int>	minimun alignment length to filter blast analysis
 	--extend-annotation <int>	look for annotation over regions with no homology found (base pairs), default 500bp
 
@@ -174,22 +174,21 @@ max_memory=4000
 cwd="$(pwd)"
 group="NO_GROUP"
 database="Database"
-kmer_cutoff=0.9
+kmer_cutoff=0.95
 coverage_cutoff=80
 coverage_summary=80
-cluster_cutoff=80
+cluster_cutoff=0.5
 alignment_identity=90
-alignment_percentage=30
+alignment_percentage=20
 extend_annotation=500
 R1="R1_file"
 R2="R2_file"
 no_trim=false
-only_reconstruct=false
 explore=false
 include_assembly=true
 annotation=false
 verbose_option_circos=""
-w_winner=""
+w_winner="-w"
 is_verbose=false
 config_dir="$script_dir/config_files"
 trimmomatic_directory=/opt/Trimmomatic/
@@ -256,7 +255,6 @@ while getopts $options opt; do
 		R )
 			only_reconstruct=true
 			no_trim=true
-			reconstruct_fasta=$database
 			;;
 		C )
 			coverage_cutoff=$OPTARG
@@ -289,7 +287,7 @@ while getopts $options opt; do
 			output_dir=$OPTARG
 			;;
 		w)
-			w_winner="-w"
+			w_winner=""
 			;;
 		V)
 			verbose_option_circos="-V"
@@ -333,19 +331,26 @@ printf "${YELLOW}Starting plasmidID version:${VERSION}${NC}\n"
 printf "%s"
 printf "${YELLOW}------------------${NC}\n\n"
 
+if [ -z $r1_file -a -z $r1_file ]
+then
+	if [ -f $contigs ];then
+		only_reconstruct=true
+	fi
+else
+	only_reconstruct=false
+fi
 
 if [ $only_reconstruct = false ]; then
 	##CHECK DEPENDENCIES, MANDATORY FIELDS, FOLDERS AND ARGUMENTS
 
 	echo -e "\n${CYAN}CHECKING DEPENDENCIES AND MANDATORY FILES${NC}"
 
-	check_dependencies.sh blastn bowtie2-build bowtie2 cd-hit-est bedtools prokka samtools mash circos
+	check_dependencies.sh blastn bowtie2-build bowtie2 bedtools prokka samtools mash circos
 
 	check_mandatory_files.sh $r1_file $r2_file $database
 fi
 
 
-
 if [ ! $sample ]; then
 	echo -e "${RED}ERROR${NC}: please, provide a sample name"
 	usage
@@ -384,8 +389,9 @@ fi
 
 if [ $explore = true ]; then
 	coverage_cutoff=60
+	kmer_cutoff=0.9
 	coverage_summary=70
-	cluster_cutoff=70
+	cluster_cutoff=0.3
 	alignment_percentage=10
 fi
 
@@ -459,39 +465,38 @@ fi
 
 ####TRIMMING#################################################################
 #############################################################################
+if [ $only_reconstruct = false ];then
+	if [ $no_trim = false ]; then
 
-if [ $no_trim = false ]; then
-
+		r1_file_mapping=$r1_file
+		r2_file_mapping=$r2_file
 
-	r1_file_mapping=$r1_file
-	r2_file_mapping=$r2_file
+		echo -e "\n${CYAN}TRIMMING READS${NC} $(date)\n \
+	Reads will be quality trimmed with a window of 4 and an average quality of 20"
+		#echo "R1:" $r1_file
+		#echo "R2:" $r2_file
 
-	echo -e "\n${CYAN}TRIMMING READS${NC} $(date)\n \
-Reads will be quality trimmed with a window of 4 and an average quality of 20"
-	#echo "R1:" $r1_file
-	#echo "R2:" $r2_file
+		check_dependencies.sh trimmomatic
 
-	check_dependencies.sh trimmomatic
+		filestrimmed=$(ls $output_dir/$group/$sample/trimmed 2> /dev/null | grep "paired" | wc -l)
 
-	filestrimmed=$(ls $output_dir/$group/$sample/trimmed 2> /dev/null | grep "paired" | wc -l)
-
-	if [ "$filestrimmed" = "4" ];then
-		echo -e "\nFound trimmed files for this sample" $sample;
-		echo "Omitting trimming"
+		if [ "$filestrimmed" = "4" ];then
+			echo -e "\nFound trimmed files for this sample" $sample;
+			echo "Omitting trimming"
+		else
+			quality_trim.sh -1 $r1_file -2 $r2_file -s $sample -g $group -o $output_dir/$group/$sample/trimmed -d $trimmomatic_directory -T $threads &>> $log_file || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information. \ncommand: \n quality_trim.sh -1 $r1_file -2 $r2_file -s $sample -g $group -o $output_dir/$group/$sample/trimmed -d $trimmomatic_directory -T $threads"
+		fi
 	else
-		quality_trim.sh -1 $r1_file -2 $r2_file -s $sample -g $group -o $output_dir/$group/$sample/trimmed -d $trimmomatic_directory -T $threads &>> $log_file || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information. \ncommand: \n quality_trim.sh -1 $r1_file -2 $r2_file -s $sample -g $group -o $output_dir/$group/$sample/trimmed -d $trimmomatic_directory -T $threads"
+		echo -e "\nNo trim selected, skipping trimming step"
+		r1_file_mapping=$r1_file
+		r2_file_mapping=$r2_file
 	fi
-else
-	echo -e "\nNo trim selected, skipping trimming step"
-	r1_file_mapping=$r1_file
-	r2_file_mapping=$r2_file
-fi
-
-#group/sample/trimmed/sample_1_paired.fastq.gz
-#group/sample/trimmed/sample_1_unpaired.fastq.gz
-#group/sample/trimmed/sample_2_paired.fastq.gz
-#group/sample/trimmed/sample_2_unpaired.fastq.gz
 
+	#group/sample/trimmed/sample_1_paired.fastq.gz
+	#group/sample/trimmed/sample_1_unpaired.fastq.gz
+	#group/sample/trimmed/sample_2_paired.fastq.gz
+	#group/sample/trimmed/sample_2_unpaired.fastq.gz
+fi
 ####ASSEMLY##################################################################
 #############################################################################
 
@@ -529,7 +534,7 @@ fi
 
 if [ $only_reconstruct = false ]; then
 
-	reconstruct_fasta=$output_dir/$group/$sample/mapping/$sample".coverage_adapted_filtered_"$coverage_cutoff"_term.fasta"_$cluster_cutoff
+	reconstruct_fasta=$output_dir/$group/$sample/mapping/$sample".coverage_adapted_filtered_"$coverage_cutoff"_term.fasta"
 
 
 ####### PREVIOUS SUMMARY OUTPUT########################
@@ -577,17 +582,26 @@ printf '%s\n' "${variables[$i]}"
 done
 
 
-	echo -e "\n${YELLOW} STARTING KMER FILTERING, MAPPING and CLUSTERING${NC}\n"
+	echo -e "\n${YELLOW} STARTING KMER FILTERING, CLUSTERING and MAPPING ${NC}\n"
 
 
 	echo -e "\n${CYAN}SCREENING READS WITH KMERS${NC} ($(date))\n \
 Reads will be screened against database supplied for further filtering and mapping,\n \
 this will reduce the input sequences to map against" $sample
+	if [ -f  $output_dir/$group/$sample/kmer/database.msh -a -f $output_dir/$group/$sample/kmer/database.screen.tab ];then
+		echo -e "\nFound a kmer file for sample" $sample;
+		echo "Omitting mash execution"
+	else
+		mash_screener.sh -d $database -1 $r1_file_mapping -s $sample -o $output_dir/$group/$sample/kmer -f $kmer_cutoff $w_winner &>> $log_file || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information.\n command: \nmash_screener.sh -d $database -1 $r1_file_mapping -s $sample -o $output_dir/$group/$sample/kmer"
+	fi
+	#output_dir/$group/$sample/kmer/database.filtered_$kmer_cutoff"_term.fasta"
 
-	mash_screener.sh -d $database -1 $r1_file_mapping -s $sample -o $output_dir/$group/$sample/kmer -f $kmer_cutoff $w_winner &>> $log_file || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information.\n command: \nmash_screener.sh -d $database -1 $r1_file_mapping -s $sample -o $output_dir/$group/$sample/kmer"
-
+	echo -e "\n${CYAN}CLUSTERING SEQUENCES BY KMER DISTANCE${NC} ($(date))\n \
+Sequences obtained after screen will be clustered to reduce redundancy,\n \
+one representative, the largest, will be considered for further analysis" $sample
+	mashclust.py -i $output_dir/$group/$sample/kmer/database.filtered_$kmer_cutoff"_term.fasta" -d $cluster_cutoff &>> $log_file || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information.\n command:mashclust.py -i $output_dir/$group/$sample/kmer/database.filtered_$kmer_cutoff"_term.fasta" -d $cluster_cutoff"
 
-	screened_ddbb=$output_dir/$group/$sample/kmer/database.filtered_$kmer_cutoff"_term.fasta"
+	screened_ddbb=$output_dir/$group/$sample/kmer/database.filtered_$kmer_cutoff"_term."$cluster_cutoff".representative.fasta"
 	
 	####MAPPING##################################################################
 	#############################################################################
@@ -646,7 +660,7 @@ will pass to further analysis"
  	#sample.coverage_adapted
  	#sample.coverage_adapted_filtered_80
 
-	filter_fasta.sh -i $database -f  $output_dir/$group/$sample/mapping/$sample".coverage_adapted_filtered_"$coverage_cutoff &>> $log_file || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information.\ncommand:\nfilter_fasta.sh -i $database -f  $output_dir/$group/$sample/mapping/$sample\".coverage_adapted_filtered_\"$coverage_cutoff"
+	filter_fasta.sh -i $database -f $output_dir/$group/$sample/mapping/$sample".coverage_adapted_filtered_"$coverage_cutoff &>> $log_file || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information.\ncommand:\nfilter_fasta.sh -i $database -f  $output_dir/$group/$sample/mapping/$sample\".coverage_adapted_filtered_\"$coverage_cutoff"
 
 	#sample.coverage_adapted_filtered_50_term.fasta
 
@@ -658,20 +672,7 @@ will pass to further analysis"
 		exit 1
 	fi
 
-	echo -e "\n${CYAN}CLUSTERING PUTATIVE PLASMIDS${NC} ($(date))\n \
-Clustering by homology removes database redundancy, taking the longest representative of each group.\n \
-Clusters will be composed by plasmids with an identity of" $cluster_cutoff"% or higher"
-
-	if [ -f  $output_dir/$group/$sample/mapping/$sample".coverage_adapted_filtered_"$coverage_cutoff"_term.fasta_"$cluster_cutoff ];then \
-		echo -e "\nFound a clustered file for sample" $sample;
-		echo "Omitting clustering"
-	else
-		cdhit_cluster.sh -i  $output_dir/$group/$sample/mapping/$sample".coverage_adapted_filtered_"$coverage_cutoff"_term.fasta" -c $cluster_cutoff -M $max_memory -T $threads &>> $log_file || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information.\ncommand:\ncdhit_cluster.sh -i  $output_dir/$group/$sample/mapping/$sample\".coverage_adapted_filtered_\"$coverage_cutoff\"_term.fasta\" -c $cluster_cutoff -M $max_memory -T $threads"
-
-	fi
-
-
-	process_cluster_output.sh -i $reconstruct_fasta -b  $output_dir/$group/$sample/mapping/$sample".coverage_adapted" -c $coverage_cutoff &>> $log_file || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information.\ncommand:\nprocess_cluster_output.sh -i $reconstruct_fasta -b  $output_dir/$group/$sample/mapping/$sample\".coverage_adapted\" -c $coverage_cutoff"
+	process_cluster_output.sh -i $reconstruct_fasta -b  $output_dir/$group/$sample/mapping/$sample".coverage_adapted" -c 80 &>> $log_file || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information.\ncommand:\nprocess_cluster_output.sh -i $reconstruct_fasta -b  $output_dir/$group/$sample/mapping/$sample\".coverage_adapted\" -c $coverage_cutoff"
 
 	mkdir -p $output_dir/$group/$sample/data
 	cp $output_dir/$group/$sample/mapping/$sample".coverage_adapted_clustered_ac" $output_dir/$group/$sample/data/$sample".coverage_adapted_clustered_ac"
@@ -719,25 +720,42 @@ A bedgraph file containing mapping information for filtered plasmids will be gen
 else
 #####################################################################################################################################################
 ##########################ONLY RECONSTRUCT###########################################################################################################
+#####################################################################################################################################################
 #####################################################################################################################################################
 
 
-	echo -e "\n${BLUE}ONLY RECONSTRUCT MODE SELECTED${NC} ($(date))\n \
-${YELLOW}WARNING${NC}:PlasmidID will not filter the database supplied by coverage,\n \
-instead all sequences supplied by user will be used as scaffold.\n \
-Please use a fasta file with limited ammount of sequences."
+	echo -e "\n${BLUE}SMRT/CONTIG MODE SELECTED${NC} ($(date))\n \
+PlasmidID will only use contigs supplied and will not filter by mapping coverage\n"
 
 	##CHECK DEPENDENCIES, MANDATORY FIELDS, FOLDERS AND ARGUMENTS
 
 	echo -e "\n${CYAN}CHECKING DEPENDENCIES AND MANDATORY FILES${NC}"
 
-	check_dependencies.sh blastn prokka circos
+	check_dependencies.sh blastn prokka circos mash
 
 	check_mandatory_files.sh $contigs $database
 
 	config_file_individual="OR.conf"
 
-	calculate_seqlen.sh -i $database -r -o $output_dir/$group/$sample/data -n "database_reconstruct_"$sample &>> $log_file || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information.\ncommand:\ncalculate_seqlen.sh -i $database -r -o $output_dir/$group/$sample/data -n "database_reconstruct_"$sample"
+	echo -e "\n${CYAN}SCREENING READS WITH KMERS${NC} ($(date))\n \
+Reads will be screened against database supplied for further filtering and mapping,\n \
+this will reduce the input sequences to map against" $sample
+	if [ -f  $output_dir/$group/$sample/kmer/database.msh -a -f $output_dir/$group/$sample/kmer/database.screen.tab ];then
+		echo -e "\nFound a kmer file for sample" $sample;
+		echo "Omitting mash execution"
+	else
+		mash_screener.sh -d $database -1 $contigs -s $sample -o $output_dir/$group/$sample/kmer -f $kmer_cutoff $w_winner &>> $log_file || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information.\n command: \nmash_screener.sh -d $database -1 $r1_file_mapping -s $sample -o $output_dir/$group/$sample/kmer"
+	fi
+	#output_dir/$group/$sample/kmer/database.filtered_$kmer_cutoff"_term.fasta"
+
+	echo -e "\n${CYAN}CLUSTERING SEQUENCES BY KMER DISTANCE${NC} ($(date))\n \
+Sequences obtained after screen will be clustered to reduce redundancy,\n \
+one representative, the largest, will be considered for further analysis" $sample
+	mashclust.py -i $output_dir/$group/$sample/kmer/database.filtered_$kmer_cutoff"_term.fasta" -d $cluster_cutoff &>> $log_file  || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information.\n command: mashclust.py -i $output_dir/$group/$sample/kmer/database.filtered_$kmer_cutoff"_term.fasta" -d $cluster_cutoff"
+
+	screened_ddbb=$output_dir/$group/$sample/kmer/database.filtered_$kmer_cutoff"_term."$cluster_cutoff".representative.fasta"
+
+	calculate_seqlen.sh -i $screened_ddbb -r -o $output_dir/$group/$sample/data -n "database_reconstruct_"$sample &>> $log_file || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information.\ncommand:\ncalculate_seqlen.sh -i $database -r -o $output_dir/$group/$sample/data -n "database_reconstruct_"$sample"
 	#"database_reconstruct_"$sample.length
 
 	awk '{print $1}' $output_dir/$group/$sample/data/"database_reconstruct_"$sample".length" > $output_dir/$group/$sample/data/$sample".coverage_adapted_clustered_ac" || error ${LINENO} $(basename $0) "AWK command fail when creating $sample\".coverage_adapted_clustered_ac\" file .See $output_dir/logs/plasmidID.log for more information.\ncommand:\nawk '{print $1}' $output_dir/$group/$sample/data/\"database_reconstruct_\"$sample\".length\" > $output_dir/$group/$sample/data/$sample\".coverage_adapted_clustered_ac\""
@@ -746,7 +764,7 @@ Please use a fasta file with limited ammount of sequences."
 	#sample.karyotype_individual.txt
 	#sample.karyotype_individual.txt
 
-	reconstruct_fasta=$database
+	reconstruct_fasta=$output_dir/$group/$sample/kmer/database.filtered_$kmer_cutoff"_term."$cluster_cutoff".representative.fasta"
 
 	if [ -f $output_dir/$group/$sample/data/$sample".bedgraph_term" ];then
 		rm $output_dir/$group/$sample/data/$sample".bedgraph_term"
@@ -1051,5 +1069,9 @@ draw_circos_images.sh -i $output_dir/$group/$sample \
 -l $output_dir/$group/$sample/logs/draw_circos_images.log \
 -c $verbose_option_circos || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information.\ncommand:\ndraw_circos_images.sh -i $output/$group/$sample -d $config_dir -C $config_file_individual -o $output_dir/$group/$sample/images -g $group -s $sample -l $output_dir/$group/$sample/logs/draw_circos_images.log -c $verbose_option_circos"
 
+echo -e "\n${CYAN}CREATING SUMMARY REPORT${NC} ($(date))\n \
+An html report with miniatures of the images will be generate with useful statistics to determine the correct plasmids in the sample."
+
+summary_report_pid.py -i $output_dir/$group/$sample -g || error ${LINENO} $(basename $0) "See $output_dir/logs/plasmidID.log for more information.\ncommand:\nsummary_report_pid.py -i $output_dir/$group/$sample -g"
 
 echo -e "\n${YELLOW}ALL DONE WITH plasmidID${NC}\nThank you for using plasmidID\n"



View it on GitLab: https://salsa.debian.org/med-team/plasmidid/-/commit/14fc856f1e788a910ed2e01e7fc61bd0a313c3a2

-- 
View it on GitLab: https://salsa.debian.org/med-team/plasmidid/-/commit/14fc856f1e788a910ed2e01e7fc61bd0a313c3a2
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20200626/55abb962/attachment-0001.html>


More information about the debian-med-commit mailing list