[med-svn] [Git][med-team/seqsero2][upstream] New upstream version 1.0.2
Andreas Tille
gitlab at salsa.debian.org
Wed Dec 11 15:32:00 GMT 2019
Andreas Tille pushed to branch upstream at Debian Med / seqsero2
Commits:
90603416 by Andreas Tille at 2019-12-11T15:19:04Z
New upstream version 1.0.2
- - - - -
15 changed files:
- − H_and_O_and_specific_genes.fasta
- + MANIFEST.in
- README.md
- + __pycache__/Initial_Conditions.cpython-34.pyc
- + __pycache__/version.cpython-34.pyc
- Initial_Conditions.py → bin/Initial_Conditions.py
- SeqSero2_package.py → bin/SeqSero2_package.py
- SeqSero2_update_kmer_database.py → bin/SeqSero2_update_kmer_database.py
- deinterleave_fastq.sh → bin/deinterleave_fastq.sh
- + seqsero2_db/H_and_O_and_specific_genes.fasta
- antigens.pickle → seqsero2_db/antigens.pickle
- invA_mers_dict → seqsero2_db/invA_mers_dict
- special.pickle → seqsero2_db/special.pickle
- + setup.py
- + version.py
Changes:
=====================================
H_and_O_and_specific_genes.fasta deleted
=====================================
The diff for this file was not included because it is too large.
=====================================
MANIFEST.in
=====================================
@@ -0,0 +1,10 @@
+include LICENSE
+include README.md
+include MANIFEST.in
+include version.py
+include setup.py
+include seqsero2_db/antigens.pickle
+include seqsero2_db/H_and_O_and_specific_genes.fasta
+include seqsero2_db/invA_mers_dict
+include seqsero2_db/special.pickle
+include bin/deinterleave_fastq.sh
=====================================
README.md
=====================================
@@ -1,42 +1,44 @@
-# SeqSero2 alpha-test version
-Salmonella serotyping from genome sequencing data
+# SeqSero2 v1.0.1
+Salmonella serotype prediction from genome sequencing data.
+Online version: http://www.denglab.info/SeqSero2
# Introduction
-SeqSero2 is a pipeline for Salmonella serotype determination from raw sequencing reads or genome assemblies. This is a alpha test version. A web app will be available soon.
-
+SeqSero2 is a pipeline for Salmonella serotype prediction from raw sequencing reads or genome assemblies
# Dependencies
-SeqSero has two modes:
-
+SeqSero has three workflows:
-(A) k-mer based mode (default), which applies unique k-mers of serotype determinant alleles to determine Salmonella serotypes in a fast speed. Special thanks to Dr. Hendrik Den Bakker for his significant contribution to this mode, details can be found in [SeqSeroK](https://github.com/hcdenbakker/SeqSeroK) and [SalmID](https://github.com/hcdenbakker/SalmID).
+(A) Allele micro-assembly (default). This workflow takes raw reads as input and performs targeted assembly of serotype determinant alleles. Assembled alleles are used to predict serotype and flag potential inter-serotype contamination in sequencing data (i.e., presence of reads from multiple serotypes due to, for example, cross or carryover contamination during sequencing).
-K-mer mode is a independant pipeline, it only requires:
+Allele micro-assembly workflow depends on:
1. Python 3;
-2. [SRA Toolkit](http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software) (optional, just used to fastq-dump sra files);
+2. [Burrows-Wheeler Aligner v0.7.12](http://sourceforge.net/projects/bio-bwa/files/);
-(B) allele based mode (if users want to extract serotype determinant alleles), which applies a hybrid approach of reads-mapping and micro-assembly.
+3. [Samtools v1.8](http://sourceforge.net/projects/samtools/files/samtools/);
-Allele mode depends on:
+4. [NCBI BLAST v2.2.28+](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download);
-1. Python 3;
+5. [SRA Toolkit v2.8.0](http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software);
-2. [Burrows-Wheeler Aligner](http://sourceforge.net/projects/bio-bwa/files/);
+6. [SPAdes v3.9.0](http://bioinf.spbau.ru/spades);
-3. [Samtools](http://sourceforge.net/projects/samtools/files/samtools/);
+7. [Bedtools v2.17.0](http://bedtools.readthedocs.io/en/latest/);
-4. [NCBI BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download);
+8. [SalmID v0.11](https://github.com/hcdenbakker/SalmID).
-5. [SRA Toolkit](http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software);
-6. [SPAdes](http://bioinf.spbau.ru/spades);
+(B) Raw reads k-mer. This workflow takes raw reads as input and performs rapid serotype prediction based on unique k-mers of serotype determinants.
+
+Raw reads k-mer workflow (originally SeqSeroK) depends on:
+
+1. Python 3;
+2. [SRA Toolkit](http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software) (optional, just used to fastq-dump sra files);
-7. [Bedtools](http://bedtools.readthedocs.io/en/latest/);
-8. [SalmID](https://github.com/hcdenbakker/SalmID).
+(C) Genome assembly k-mer. This workflow takes genome assemblies as input and the rest of the workflow largely overlaps with the raw reads k-mer workflow
# Executing the code
@@ -44,7 +46,7 @@ Make sure all SeqSero2 and its dependency executables are added to your path (e.
Usage: SeqSero2_package.py
- -m <string> (which mode to apply, 'k'(kmer mode), 'a'(allele mode), default=k)
+ -m <string> (which workflow to apply, 'a'(raw reads allele micro-assembly), 'k'(raw reads and genome assembly k-mer), default=a)
-t <string> (input data type, '1' for interleaved paired-end reads, '2' for separated paired-end reads, '3' for single reads, '4' for genome assembly, '5' for nanopore fasta, '6'for nanopore fastq)
@@ -56,23 +58,26 @@ Make sure all SeqSero2 and its dependency executables are added to your path (e.
-d <string> (output directory name, if not set, the output directory would be 'SeqSero_result_'+time stamp+one random number)
- -c <flag> (if '-c' was flagged, SeqSero2 will use clean mode and only output serotyping prediction without the directory containing log files)
+ -c <flag> (if '-c' was flagged, SeqSero2 will only output serotype prediction without the directory containing log files)
+
+ --check <flag> (use '--check' flag to check the required dependencies)
+
+ -v, --version (show program's version number and exit)
# Examples
-K-mer mode:
+Allele mode:
- # K-mer (default), for separated paired-end raw reads ("-t 2")
- SeqSero2_package.py -t 2 -i R1.fastq.gz R2.fastq.gz
+ # Allele workflow ("-m a", default), for separated paired-end raw reads ("-t 2"), use 10 threads in mapping and assembly ("-p 10")
+ SeqSero2_package.py -p 10 -t 2 -i R1.fastq.gz R2.fastq.gz
- # K-mer (default), for assemblies ("-t 4", assembly only predcited by K-mer mode)
- SeqSero2_package.py -t 4 -i assembly.fasta
+K-mer mode:
-Allele mode:
+ # Raw reads k-mer ("-m k"), for separated paired-end raw reads ("-t 2")
+ SeqSero2_package.py -m k -t 2 -i R1.fastq.gz R2.fastq.gz
- # Allele mode ("-m a"), for separated paired-end raw reads ("-t 2"), use 10 threads in mapping and assembly ("-p 10")
- SeqSero2_package.py -m a -p 10 -t 2 -i R1.fastq.gz R2.fastq.gz
-
+ # Genome assembly k-mer ("-t 4", genome assemblies only predicted by the k-mer workflow, "-m k")
+ SeqSero2_package.py -m k -t 4 -i assembly.fasta
# Output
Upon executing the command, a directory named 'SeqSero_result_Time_your_run' will be created. Your result will be stored in 'Seqsero_result.txt' in that directory. And the assembled alleles can also be found in the directory if using "-m a" (allele mode).
=====================================
__pycache__/Initial_Conditions.cpython-34.pyc
=====================================
Binary files /dev/null and b/__pycache__/Initial_Conditions.cpython-34.pyc differ
=====================================
__pycache__/version.cpython-34.pyc
=====================================
Binary files /dev/null and b/__pycache__/version.cpython-34.pyc differ
=====================================
Initial_Conditions.py → bin/Initial_Conditions.py
=====================================
The diff for this file was not included because it is too large.
=====================================
SeqSero2_package.py → bin/SeqSero2_package.py
=====================================
@@ -11,20 +11,41 @@ import pickle
import argparse
import itertools
from distutils.version import LooseVersion
+from distutils.spawn import find_executable
+
+try:
+ from .version import SeqSero2_version
+except Exception: #ImportError
+ from version import SeqSero2_version
### SeqSero Kmer
def parse_args():
"Parse the input arguments, use '-h' for help."
- parser = argparse.ArgumentParser(usage='SeqSero2_package.py -t <data_type> -m <mode> -i <input_data> [-p <number of threads>] [-b <BWA_algorithm>]\n\nDevelopper: Shaokang Zhang (zskzsk at uga.edu), Hendrik C Den-Bakker (Hendrik.DenBakker at uga.edu) and Xiangyu Deng (xdeng at uga.edu)\n\nContact email:seqsero at gmail.com')#add "-m <data_type>" in future
- parser.add_argument("-i",nargs="+",help="<string>: path/to/input_data")
- parser.add_argument("-t",choices=['1','2','3','4','5','6'],help="<int>: '1'(pair-end reads, interleaved),'2'(pair-end reads, seperated),'3'(single-end reads), '4'(assembly),'5'(nanopore fasta),'6'(nanopore fastq)")
- parser.add_argument("-b",choices=['sam','mem'],default="mem",help="<string>: mode for mapping, 'sam'(bwa samse/sampe), 'mem'(bwa mem), default=mem")
- parser.add_argument("-p",default="1",help="<int>: threads used for mapping mode, if p>4, only 4 threads will be used for assembly, default=1")
- parser.add_argument("-m",choices=['k','a'],default="k",help="<string>: 'k'(kmer mode), 'a'(allele mode), default=k")
+ parser = argparse.ArgumentParser(usage='SeqSero2_package.py -t <data_type> -m <mode> -i <input_data> [-d <output_directory>] [-p <number of threads>] [-b <BWA_algorithm>]\n\nDevelopper: Shaokang Zhang (zskzsk at uga.edu), Hendrik C Den-Bakker (Hendrik.DenBakker at uga.edu) and Xiangyu Deng (xdeng at uga.edu)\n\nContact email:seqsero at gmail.com\n\nVersion: v1.0.2')#add "-m <data_type>" in future
+ parser.add_argument("-i",nargs="+",help="<string>: path/to/input_data",type=os.path.abspath) ### ed_SL_05282019: add 'type=os.path.abspath' to generate absolute path of input data.
+ parser.add_argument("-t",choices=['1','2','3','4','5','6'],help="<int>: '1' for interleaved paired-end reads, '2' for separated paired-end reads, '3' for single reads, '4' for genome assembly, '5' for nanopore fasta, '6' for nanopore fastq")
+ parser.add_argument("-b",choices=['sam','mem'],default="mem",help="<string>: algorithms for bwa mapping for allele mode; 'mem' for mem, 'sam' for samse/sampe; default=mem; optional; for now we only optimized for default 'mem' mode")
+ parser.add_argument("-p",default="1",help="<int>: number of threads for allele mode, if p >4, only 4 threads will be used for assembly since the amount of extracted reads is small, default=1")
+ parser.add_argument("-m",choices=['k','a'],default="a",help="<string>: which workflow to apply, 'a'(raw reads allele micro-assembly), 'k'(raw reads and genome assembly k-mer), default=a")
parser.add_argument("-d",help="<string>: output directory name, if not set, the output directory would be 'SeqSero_result_'+time stamp+one random number")
- parser.add_argument("-c",action="store_true",help="<flag>: if '-c' was flagged, SeqSero2 will use clean mode and only output serotyping prediction, the directory containing log files will be deleted")
+ parser.add_argument("-c",action="store_true",help="<flag>: if '-c' was flagged, SeqSero2 will only output serotype prediction without the directory containing log files")
+ parser.add_argument("--check",action="store_true",help="<flag>: use '--check' flag to check the required dependencies")
+ parser.add_argument('-v', '--version', action='version', version='%(prog)s ' + SeqSero2_version)
return parser.parse_args()
+### ed_SL_05282019: check paths of dependencies
+check_dependencies = parse_args().check
+dependencies = ['bwa','samtools','blastn','fastq-dump','spades.py','bedtools','SalmID.py']
+if check_dependencies:
+ for item in dependencies:
+ ext_path = find_executable(item)
+ if ext_path is not None:
+ print ("Using "+item+" - "+ext_path)
+ else:
+ print ("ERROR: can not find "+item+" in PATH")
+ sys.exit()
+### end of --check
+
def reverse_complement(sequence):
complement = {
'A': 'T',
@@ -258,11 +279,9 @@ def Combine(b, c):
def seqsero_from_formula_to_serotypes(Otype, fliC, fljB, special_gene_list,subspecies):
#like test_output_06012017.txt
#can add more varialbles like sdf-type, sub-species-type in future (we can conclude it into a special-gene-list)
- from Initial_Conditions import phase1
- from Initial_Conditions import phase2
- from Initial_Conditions import phaseO
- from Initial_Conditions import sero
- from Initial_Conditions import subs
+ from Initial_Conditions import phase1,phase2,phaseO,sero,subs,remove_list,rename_dict
+ rename_dict_not_anymore=[rename_dict[x] for x in rename_dict]
+ rename_dict_all=rename_dict_not_anymore+list(rename_dict) #used for decide whether to
seronames = []
seronames_none_subspecies=[]
for i in range(len(phase1)):
@@ -320,9 +339,20 @@ def seqsero_from_formula_to_serotypes(Otype, fliC, fljB, special_gene_list,subsp
if (new_fliC in fliC_combine
or fliC in fliC_combine) and (new_fljB in fljB_combine
or fljB in fljB_combine):
- if subs[i] == subspecies:
- seronames.append(sero[i])
- seronames_none_subspecies.append(sero[i])
+ ######start, remove_list,rename_dict, added on 11/11/2018
+ if sero[i] not in remove_list:
+ temp_sero=sero[i]
+ if temp_sero in rename_dict:
+ temp_sero=rename_dict[temp_sero] #rename if in the rename list
+ if temp_sero not in seronames:#the new sero may already included, if yes, then not consider
+ if subs[i] == subspecies:
+ seronames.append(temp_sero)
+ seronames_none_subspecies.append(temp_sero)
+ else:
+ pass
+ else:
+ pass
+ ######end, added on 11/11/2018
#analyze seronames
subspecies_pointer=""
if len(seronames) == 0 and len(seronames_none_subspecies)!=0:
@@ -336,10 +366,12 @@ def seqsero_from_formula_to_serotypes(Otype, fliC, fljB, special_gene_list,subsp
star_line = ""
if len(seronames) > 1: #there are two possible predictions for serotypes
star = "*"
- star_line = "The predicted serotypes share the same general formula:\t" + Otype + ":" + fliC + ":" + fljB + "\n"
+ #changed 04072019
+ #star_line = "The predicted serotypes share the same general formula:\t" + Otype + ":" + fliC + ":" + fljB + "\n"
if subspecies_pointer=="1" and len(seronames_none_subspecies)!=0:
star="*"
- star_line="The formula with this subspieces prediction can't get a serotype in KW manual, and the serotyping prediction was made without considering it."+star_line
+ star_line=" The predicted O and H antigens correspond to serotype '"+(" or ").join(seronames)+"' in the Kauffmann-White scheme. The predicted subspecies by SalmID (github.com/hcdenbakker/SalmID) may not be consistent with subspecies designation in the Kauffmann-White scheme." + star_line
+ #star_line="The formula with this subspieces prediction can't get a serotype in KW manual, and the serotyping prediction was made without considering it."+star_line
if Otype=="":
Otype="-"
predict_form = Otype + ":" + fliC + ":" + fljB
@@ -350,23 +382,30 @@ def seqsero_from_formula_to_serotypes(Otype, fliC, fljB, special_gene_list,subsp
for x in special_gene_list:
if x.startswith("sdf"):
sdf = "+"
- predict_form = predict_form + " Sdf prediction:" + sdf
+ #star_line="Detected sdf gene, a marker to differentiate Gallinarum and Enteritidis"
+ star_line=" sdf gene detected." # ed_SL_04152019: new output format
+ #predict_form = predict_form + " Sdf prediction:" + sdf
+ predict_form = predict_form #changed 04072019
if sdf == "-":
star = "*"
- star_line = "Additional characterization is necessary to assign a serotype to this strain. Commonly circulating strains of serotype Enteritidis are sdf+, although sdf- strains of serotype Enteritidis are known to exist. Serotype Gallinarum is typically sdf- but should be quite rare. Sdf- strains of serotype Enteritidis and serotype Gallinarum can be differentiated by phenotypic profile or genetic criteria.\n"
- predict_sero = "Gallinarum/Enteritidis sdf -"
+ #star_line="Didn't detected sdf gene, a marker to differentiate Gallinarum and Enteritidis"
+ star_line=" sdf gene not detected." # ed_SL_04152019: new output format
+ #changed in 04072019, for new output
+ #star_line = "Additional characterization is necessary to assign a serotype to this strain. Commonly circulating strains of serotype Enteritidis are sdf+, although sdf- strains of serotype Enteritidis are known to exist. Serotype Gallinarum is typically sdf- but should be quite rare. Sdf- strains of serotype Enteritidis and serotype Gallinarum can be differentiated by phenotypic profile or genetic criteria.\n"
+ #predict_sero = "Gallinarum/Enteritidis" #04132019, for new output requirement
+ predict_sero = "Gallinarum or Enteritidis" # ed_SL_04152019: new output format
###end of special test for Enteritidis
elif predict_form == "4:i:-":
- predict_sero = "potential monophasic variant of Typhimurium"
+ predict_sero = "I 4,[5],12:i:-" # ed_SL_09242019: change serotype name
elif predict_form == "4:r:-":
- predict_sero = "potential monophasic variant of Heidelberg"
- elif predict_form == "4:b:-":
- predict_sero = "potential monophasic variant of Paratyphi B"
- elif predict_form == "8:e,h:1,2":
- predict_sero = "Newport"
- star = "*"
- star_line = "Serotype Bardo shares the same antigenic profile with Newport, but Bardo is exceedingly rare."
- claim = "The serotype(s) is/are the only serotype(s) with the indicated antigenic profile currently recognized in the Kauffmann White Scheme. New serotypes can emerge and the possibility exists that this antigenic profile may emerge in a different subspecies. Identification of strains to the subspecies level should accompany serotype determination; the same antigenic profile in different subspecies is considered different serotypes.\n"
+ predict_sero = "4:r:-"
+ elif predict_form == "4:b:-": # ed_SL_09272019: change for new output format
+ predict_sero = "N/A (4:b:-)"
+ #elif predict_form == "8:e,h:1,2": #removed after official merge of newport and bardo
+ #predict_sero = "Newport"
+ #star = "*"
+ #star_line = "Serotype Bardo shares the same antigenic profile with Newport, but Bardo is exceedingly rare."
+ claim = " The serotype(s) is/are the only serotype(s) with the indicated antigenic profile currently recognized in the Kauffmann White Scheme. New serotypes can emerge and the possibility exists that this antigenic profile may emerge in a different subspecies. Identification of strains to the subspecies level should accompany serotype determination; the same antigenic profile in different subspecies is considered different serotypes.\n"
if "N/A" in predict_sero:
claim = ""
#special test for Typhimurium
@@ -381,9 +420,11 @@ def seqsero_from_formula_to_serotypes(Otype, fliC, fljB, special_gene_list,subsp
if normal > mutation:
pass
elif normal < mutation:
- predict_sero = predict_sero.strip() + "(O5-)"
+ #predict_sero = predict_sero.strip() + "(O5-)"
+ predict_sero = predict_sero.strip() #diable special sero for new output requirement, 04132019
star = "*"
- star_line = "Detected the deletion of O5-."
+ #star_line = "Detected the deletion of O5-."
+ star_line = " Detected a deletion that causes O5- variant of Typhimurium." # ed_SL_04152019: new output format
else:
pass
#special test for Paratyphi B
@@ -397,23 +438,29 @@ def seqsero_from_formula_to_serotypes(Otype, fliC, fljB, special_gene_list,subsp
mutation = float(special_gene_list[x])
#print(normal,mutation)
if normal > mutation:
- predict_sero = predict_sero.strip() + "(dt+)"
+ #predict_sero = predict_sero.strip() + "(dt+)" #diable special sero for new output requirement, 04132019
+ predict_sero = predict_sero.strip()+' var. L(+) tartrate+' if "Paratyphi B" in predict_sero else predict_sero.strip() # ed_SL_04152019: new output format
star = "*"
- star_line = "Didn't detect the SNP for dt- which means this isolate is a Paratyphi B variant L(+) tartrate(+)."
+ #star_line = "Didn't detect the SNP for dt- which means this isolate is a Paratyphi B variant L(+) tartrate(+)."
+ star_line = " The SNP that causes d-Tartrate nonfermentating phenotype of Paratyphi B was not detected. " # ed_SL_04152019: new output format
elif normal < mutation:
- predict_sero = predict_sero.strip() + "(dt-)"
+ #predict_sero = predict_sero.strip() + "(dt-)" #diable special sero for new output requirement, 04132019
+ predict_sero = predict_sero.strip()
star = "*"
- star_line = "Detected the SNP for dt- which means this isolate is a systemic pathovar of Paratyphi B."
+ #star_line = "Detected the SNP for dt- which means this isolate is a systemic pathovar of Paratyphi B."
+ star_line = " Detected the SNP for d-Tartrate nonfermenting phenotype of Paratyphi B." # ed_SL_04152019: new output format
else:
star = "*"
- star_line = "Failed to detect the SNP for dt-, can't decide it's a Paratyphi B variant L(+) tartrate(+) or not."
+ #star_line = " Failed to detect the SNP for dt-, can't decide it's a Paratyphi B variant L(+) tartrate(+) or not."
+ star_line = " " ## ed_SL_05152019: do not report this situation.
#special test for O13,22 and O13,23
if Otype=="13":
- ex_dir = os.path.dirname(os.path.realpath(__file__))
+ #ex_dir = os.path.dirname(os.path.realpath(__file__))
+ ex_dir = os.path.abspath(os.path.join(os.path.dirname(os.path.dirname(__file__)),'seqsero2_db')) # ed_SL_09152019
f = open(ex_dir + '/special.pickle', 'rb')
special = pickle.load(f)
O22_O23=special['O22_O23']
- if predict_sero.split(" or ")[0] in O22_O23[-1]:
+ if predict_sero.split(" or ")[0] in O22_O23[-1] and predict_sero.split(" or ")[0] not in rename_dict_all:#if in rename_dict_all, then it means already merged, no need to analyze
O22_score=0
O23_score=0
for x in special_gene_list:
@@ -426,26 +473,28 @@ def seqsero_from_formula_to_serotypes(Otype, fliC, fljB, special_gene_list,subsp
if predict_sero.split(" or ")[0] in z:
if O22_score > O23_score:
star = "*"
- star_line = "Detected O22 specific genes to further differenciate '"+predict_sero+"'."
+ #star_line = "Detected O22 specific genes to further differenciate '"+predict_sero+"'." #diabled for new output requirement, 04132019
predict_sero = z[0]
elif O22_score < O23_score:
star = "*"
- star_line = "Detected O23 specific genes to further differenciate '"+predict_sero+"'."
+ #star_line = "Detected O23 specific genes to further differenciate '"+predict_sero+"'." #diabled for new output requirement, 04132019
predict_sero = z[1]
else:
star = "*"
- star_line = "Fail to detect O22 and O23 differences."
+ #star_line = "Fail to detect O22 and O23 differences." #diabled for new output requirement, 04132019
+ if " or " in predict_sero:
+ star_line = star_line + " The predicted serotypes share the same general formula:\t" + Otype + ":" + fliC + ":" + fljB + "\n"
#special test for O6,8
- merge_O68_list=["Blockley","Bovismorbificans","Hadar","Litchfield","Manhattan","Muenchen"]
- for x in merge_O68_list:
- if x in predict_sero:
- predict_sero=x
- star=""
- star_line=""
+ #merge_O68_list=["Blockley","Bovismorbificans","Hadar","Litchfield","Manhattan","Muenchen"] #remove 11/11/2018, because already in merge list
+ #for x in merge_O68_list:
+ # if x in predict_sero:
+ # predict_sero=x
+ # star=""
+ # star_line=""
#special test for Montevideo; most of them are monophasic
- if "Montevideo" in predict_sero and "1,2,7" in predict_form:
- star="*"
- star_line="Montevideo is almost always monophasic, having an antigen called for the fljB position may be a result of Salmonella-Salmonella contamination."
+ #if "Montevideo" in predict_sero and "1,2,7" in predict_form: #remove 11/11/2018, because already in merge list
+ #star="*"
+ #star_line="Montevideo is almost always monophasic, having an antigen called for the fljB position may be a result of Salmonella-Salmonella contamination."
return predict_form, predict_sero, star, star_line, claim
### End of SeqSero Kmer part
@@ -643,21 +692,26 @@ def decide_O_type_and_get_special_genes(Final_list,Final_list_passed):
if potenial_new_gene!="":
print("two differnt genes in same contig, fix it for O antigen")
print(potenial_new_gene[:3])
- final_O.append(potenial_new_gene)
+ pointer=0
+ for y in final_O:
+ if y[0].split("___")[-1]==potenial_new_gene[0].split("___")[-1]:
+ pointer=1
+ if pointer!=0: #changed to consider two genes in same contig
+ final_O.append(potenial_new_gene)
### end of the two genes on same contig test
final_O=sorted(final_O,key=lambda x: x[2], reverse=True)#sorted
if len(final_O)==0 or (len(final_O)==1 and "O-1,3,19_not_in_3,10" in final_O[0][0]):
#print "$$$No Otype, due to no hit"#may need to be changed
O_choice="-"
else:
- highest_O_coverage=max([float(x[0].split("_cov_")[-1]) for x in final_O if "O-1,3,19_not_in_3,10" not in x[0]])
+ highest_O_coverage=max([float(x[0].split("_cov_")[-1].split("_")[0]) for x in final_O if "O-1,3,19_not_in_3,10" not in x[0]])
O_list=[]
O_list_less_contamination=[]
for x in final_O:
if not "O-1,3,19_not_in_3,10__130" in x[0]:#O-1,3,19_not_in_3,10 is too small, which may affect further analysis; to avoid contamination affect, use 0.15 of highest coverage as cut-off
O_list.append(x[0].split("__")[0])
O_nodes_list.append(x[0].split("___")[1])
- if float(x[0].split("_cov_")[-1])>highest_O_coverage*0.15:
+ if float(x[0].split("_cov_")[-1].split("_")[0])>highest_O_coverage*0.15:
O_list_less_contamination.append(x[0].split("__")[0])
### special test for O9,46 and O3,10 family
if ("O-9,46_wbaV" in O_list or "O-9,46_wbaV-from-II-9,12:z29:1,5-SRR1346254" in O_list) and O_list_less_contamination[0].startswith("O-9,"):#not sure should use and float(O9_wbaV)/float(num_1) > 0.1
@@ -695,7 +749,7 @@ def decide_O_type_and_get_special_genes(Final_list,Final_list_passed):
try:
max_score=0
for x in final_O:
- if x[2]>=max_score and float(x[0].split("_cov_")[-1])>highest_O_coverage*0.15:#use x[2],08172018, the "coverage identity = cover_length * identity"; also meet coverage threshold
+ if x[2]>=max_score and float(x[0].split("_cov_")[-1].split("_")[0])>highest_O_coverage*0.15:#use x[2],08172018, the "coverage identity = cover_length * identity"; also meet coverage threshold
max_score=x[2]#change from x[-1] to x[2],08172018
O_choice=x[0].split("_")[0]
if O_choice=="O-1,3,19":
@@ -726,7 +780,7 @@ def decide_O_type_and_get_special_genes(Final_list,Final_list_passed):
contamination_O=""
else:
contamination_O="potential contamination from O antigen signals"
- return O_choice,O_nodes_list,special_genes,final_O,contamination_O
+ return O_choice,O_nodes_list,special_genes,final_O,contamination_O,Otypes_uniq
### End of SeqSero2 allele prediction and output
def get_input_files(make_dir,input_file,data_type,dirpath):
@@ -764,41 +818,60 @@ def get_input_files(make_dir,input_file,data_type,dirpath):
os.chdir("..")
return for_fq,rev_fq
-def predict_O_and_H_types(Final_list,Final_list_passed):
+def predict_O_and_H_types(Final_list,Final_list_passed,new_fasta):
#get O and H types from Final_list from blast parsing; allele mode
+ from Bio import SeqIO
fliC_choice="-"
fljB_choice="-"
fliC_contig="NA"
fljB_contig="NA"
fliC_region=set([0])
fljB_region=set([0,])
- fliC_length=0 #can be changed to coverage in future
- fljB_length=0 #can be changed to coverage in future
+ fliC_length=0 #can be changed to coverage in future; in 03292019, changed to ailgned length
+ fljB_length=0 #can be changed to coverage in future; in 03292019, changed to ailgned length
O_choice="-"#no need to decide O contig for now, should be only one
- O_choice,O_nodes,special_gene_list,O_nodes_roles,contamination_O=decide_O_type_and_get_special_genes(Final_list,Final_list_passed)#decide the O antigen type and also return special-gene-list for further identification
+ O_choice,O_nodes,special_gene_list,O_nodes_roles,contamination_O,Otypes_uniq=decide_O_type_and_get_special_genes(Final_list,Final_list_passed)#decide the O antigen type and also return special-gene-list for further identification
O_choice=O_choice.split("-")[-1].strip()
if (O_choice=="1,3,19" and len(O_nodes_roles)==1 and "1,3,19" in O_nodes_roles[0][0]) or O_choice=="":
O_choice="-"
H_contig_roles=decide_contig_roles_for_H_antigen(Final_list,Final_list_passed)#decide the H antigen contig is fliC or fljB
+ #add alignment locations, used for further selection, 03312019
+ for i in range(len(H_contig_roles)):
+ x=H_contig_roles[i]
+ for y in Final_list_passed:
+ if x[1] in y[0] and y[0].startswith(x[0]):
+ H_contig_roles[i]+=H_contig_roles[i]+(y[-1],)
+ break
log_file=open("SeqSero_log.txt","a")
- print("O_contigs:")
+ extract_file=open("Extracted_antigen_alleles.fasta","a")
+ handle_fasta=list(SeqIO.parse(new_fasta,"fasta"))
+
+ #print("O_contigs:")
log_file.write("O_contigs:\n")
+ extract_file.write("#Sequences with antigen signals (if the micro-assembled contig only covers the flanking region, it will not be used for contamination analysis)\n")
+ extract_file.write("#O_contigs:\n")
for x in O_nodes_roles:
if "O-1,3,19_not_in_3,10" not in x[0]:#O-1,3,19_not_in_3,10 is just a small size marker
- print(x[0].split("___")[-1],x[0].split("__")[0],"blast score:",x[1],"identity%:",str(round(x[2]*100,2))+"%",str(min(x[-1]))+" to "+str(max(x[-1])))
- log_file.write(x[0].split("___")[-1]+" "+x[0].split("__")[0]+" "+"blast score: "+str(x[1])+"identity%:"+str(round(x[2]*100,2))+"% "+str(min(x[-1]))+" to "+str(max(x[-1]))+"\n")
+ #print(x[0].split("___")[-1],x[0].split("__")[0],"blast score:",x[1],"identity%:",str(round(x[2]*100,2))+"%",str(min(x[-1]))+" to "+str(max(x[-1])))
+ log_file.write(x[0].split("___")[-1]+" "+x[0].split("__")[0]+"; "+"blast score: "+str(x[1])+" identity%: "+str(round(x[2]*100,2))+"%; alignment from "+str(min(x[-1]))+" to "+str(max(x[-1]))+" of antigen\n")
+ title=">"+x[0].split("___")[-1]+" "+x[0].split("__")[0]+"; "+"blast score: "+str(x[1])+" identity%: "+str(round(x[2]*100,2))+"%; alignment from "+str(min(x[-1]))+" to "+str(max(x[-1]))+" of antigen\n"
+ seqs=""
+ for z in handle_fasta:
+ if x[0].split("___")[-1]==z.description:
+ seqs=str(z.seq)
+ extract_file.write(title+seqs+"\n")
if len(H_contig_roles)!=0:
- highest_H_coverage=max([float(x[1].split("_cov_")[-1]) for x in H_contig_roles]) #less than highest*0.1 would be regarded as contamination and noises, they will still be considered in contamination detection and logs, but not used as final serotype output
+ highest_H_coverage=max([float(x[1].split("_cov_")[-1].split("_")[0]) for x in H_contig_roles]) #less than highest*0.1 would be regarded as contamination and noises, they will still be considered in contamination detection and logs, but not used as final serotype output
else:
highest_H_coverage=0
for x in H_contig_roles:
#if multiple choices, temporately select the one with longest length for now, will revise in further change
- if "fliC" == x[0] and int(x[1].split("_")[3])>=fliC_length and x[1] not in O_nodes and float(x[1].split("_cov_")[-1])>highest_H_coverage*0.13:#remember to avoid the effect of O-type contig, so should not in O_node list
+ if "fliC" == x[0] and len(x[-1])>=fliC_length and x[1] not in O_nodes and float(x[1].split("_cov_")[-1].split("_")[0])>highest_H_coverage*0.13:#remember to avoid the effect of O-type contig, so should not in O_node list
fliC_contig=x[1]
- fliC_length=int(x[1].split("_")[3])
- elif "fljB" == x[0] and int(x[1].split("_")[3])>=fljB_length and x[1] not in O_nodes and float(x[1].split("_cov_")[-1])>highest_H_coverage*0.13:
+ fliC_length=len(x[-1])
+ elif "fljB" == x[0] and len(x[-1])>=fljB_length and x[1] not in O_nodes and float(x[1].split("_cov_")[-1].split("_")[0])>highest_H_coverage*0.13:
fljB_contig=x[1]
- fljB_length=int(x[1].split("_")[3])
+ fljB_length=len(x[-1])
for x in Final_list_passed:
if fliC_choice=="-" and "fliC_" in x[0] and fliC_contig in x[0]:
fliC_choice=x[0].split("_")[1]
@@ -858,8 +931,9 @@ def predict_O_and_H_types(Final_list,Final_list_passed):
pass
#print(x[:3],contig_length,len(target_region&x[3])/float(len(x[3])),contig_length*0.5,len(x[3]),contig_length*1.5)
#end of removing none-middle contigs
- print("H_contigs:")
+ #print("H_contigs:")
log_file.write("H_contigs:\n")
+ extract_file.write("#H_contigs:\n")
H_contig_stat=[]
H1_cont_stat={}
H2_cont_stat={}
@@ -871,13 +945,25 @@ def predict_O_and_H_types(Final_list,Final_list_passed):
if "first" in y[0] or "last" in y[0]: #this is the final filter to decide it's fliC or fljB, if can't pass, then can't decide
for y in Final_list_passed: #it's impossible to has the "first" and "last" allele as prediction, so re-do it
if x[1] in y[0]:#it's very possible to be third phase allele, so no need to make it must be fliC or fljB
- print(x[1],"can't_decide_fliC_or_fljB",y[0].split("_")[1],"blast_score:",y[1],"identity%:",str(round(y[2]*100,2))+"%",str(min(y[-1]))+" to "+str(max(y[-1])))
- log_file.write(x[1]+" "+x[0]+" "+y[0].split("_")[1]+" "+"blast_score: "+str(y[1])+" identity%:"+str(round(y[2]*100,2))+"% "+str(min(y[-1]))+" to "+str(max(y[-1]))+"\n")
+ #print(x[1],"can't_decide_fliC_or_fljB",y[0].split("_")[1],"blast_score:",y[1],"identity%:",str(round(y[2]*100,2))+"%",str(min(y[-1]))+" to "+str(max(y[-1])))
+ log_file.write(x[1]+" "+x[0]+" "+y[0].split("_")[1]+"; "+"blast score: "+str(y[1])+" identity%: "+str(round(y[2]*100,2))+"%; alignment from "+str(min(y[-1]))+" to "+str(max(y[-1]))+" of antigen\n")
H_contig_roles[i]="can't decide fliC or fljB, may be third phase"
+ title=">"+x[1]+" "+x[0]+" "+y[0].split("_")[1]+"; "+"blast score: "+str(y[1])+" identity%: "+str(round(y[2]*100,2))+"%; alignment from "+str(min(y[-1]))+" to "+str(max(y[-1]))+" of antiten\n"
+ seqs=""
+ for z in handle_fasta:
+ if x[1]==z.description:
+ seqs=str(z.seq)
+ extract_file.write(title+seqs+"\n")
break
else:
- print(x[1],x[0],y[0].split("_")[1],"blast_score:",y[1],"identity%:",str(round(y[2]*100,2))+"%",str(min(y[-1]))+" to "+str(max(y[-1])))
- log_file.write(x[1]+" "+x[0]+" "+y[0].split("_")[1]+" "+"blast_score: "+str(y[1])+" identity%:"+str(round(y[2]*100,2))+"% "+str(min(y[-1]))+" to "+str(max(y[-1]))+"\n")
+ #print(x[1],x[0],y[0].split("_")[1],"blast_score:",y[1],"identity%:",str(round(y[2]*100,2))+"%",str(min(y[-1]))+" to "+str(max(y[-1])))
+ log_file.write(x[1]+" "+x[0]+" "+y[0].split("_")[1]+"; "+"blast score: "+str(y[1])+" identity%: "+str(round(y[2]*100,2))+"%; alignment from "+str(min(y[-1]))+" to "+str(max(y[-1]))+" of antigen\n")
+ title=">"+x[1]+" "+x[0]+" "+y[0].split("_")[1]+"; "+"blast score: "+str(y[1])+" identity%: "+str(round(y[2]*100,2))+"%; alignment from "+str(min(y[-1]))+" to "+str(max(y[-1]))+" of antigen\n"
+ seqs=""
+ for z in handle_fasta:
+ if x[1]==z.description:
+ seqs=str(z.seq)
+ extract_file.write(title+seqs+"\n")
if x[0]=="fliC":
if y[0].split("_")[1] not in H1_cont_stat:
H1_cont_stat[y[0].split("_")[1]]=y[2]
@@ -899,12 +985,41 @@ def predict_O_and_H_types(Final_list,Final_list_passed):
contamination_H="potential contamination from H antigen signals"
elif len(H2_cont_stat_list)==1 and fljB_contig=="NA":
contamination_H="potential contamination from H antigen signals, uncommon weak fljB signals detected"
- print(contamination_O)
- print(contamination_H)
+ #get additional antigens
+ """
+ if ("O-9,46_wbaV" in O_list or "O-9,46_wbaV-from-II-9,12:z29:1,5-SRR1346254" in O_list) and O_list_less_contamination[0].startswith("O-9,"):#not sure should use and float(O9_wbaV)/float(num_1) > 0.1
+ if "O-9,46_wzy" in O_list:#and float(O946_wzy)/float(num_1) > 0.1
+ O_choice="O-9,46"
+ #print "$$$Most possilble Otype: O-9,46"
+ elif "O-9,46,27_partial_wzy" in O_list:#and float(O94627)/float(num_1) > 0.1
+ O_choice="O-9,46,27"
+ #print "$$$Most possilble Otype: O-9,46,27"
+ elif ("O-3,10_wzx" in O_list) and ("O-9,46_wzy" in O_list) and (O_list[0].startswith("O-3,10") or O_list_less_contamination[0].startswith("O-9,46_wzy")):#and float(O310_wzx)/float(num_1) > 0.1 and float(O946_wzy)/float(num_1) > 0.1
+ if "O-3,10_not_in_1,3,19" in O_list:#and float(O310_no_1319)/float(num_1) > 0.1
+ O_choice="O-3,10"
+ #print "$$$Most possilble Otype: O-3,10 (contain O-3,10_not_in_1,3,19)"
+ else:
+ O_choice="O-1,3,19"
+ #print "$$$Most possilble Otype: O-1,3,19 (not contain O-3,10_not_in_1,3,19)"
+ ### end of special test for O9,46 and O3,10 family
+
+ if O_choice=="O-9,46,27" or O_choice=="O-3,10" or O_choice=="O-1,3,19":
+ if len(Otypes_uniq)>2:
+ contamination_O="potential contamination from O antigen signals"
+ else:
+ if len(Otypes_uniq)>1:
+ if O_choice=="O-4" and len(Otypes_uniq)==2 and "O-9,46,27" in Otypes_uniq: #for special 4,12,27 case such as Bredeney and Schwarzengrund
+ contamination_O=""
+ elif O_choice=="O-9,46" and len(Otypes_uniq)==2 and "O-9,46_wbaV" in Otypes_uniq and "O-9,46_wzy" in Otypes_uniq: #for special 4,12,27 case such as Bredeney and Schwarzengrund
+ contamination_O=""
+ """
+ additonal_antigents=[]
+ #print(contamination_O)
+ #print(contamination_H)
log_file.write(contamination_O+"\n")
log_file.write(contamination_H+"\n")
log_file.close()
- return O_choice,fliC_choice,fljB_choice,special_gene_list,contamination_O,contamination_H
+ return O_choice,fliC_choice,fljB_choice,special_gene_list,contamination_O,contamination_H,Otypes_uniq,H1_cont_stat_list,H2_cont_stat_list
def get_input_K(input_file,lib_dict,data_type,k_size):
#kmer mode; get input_Ks from dict and data_type
@@ -1064,6 +1179,8 @@ def map_and_sort(threads,database,fnameA,fnameB,sam,bam,for_sai,rev_sai,sorted_b
def extract_mapped_reads_and_do_assembly_and_blast(current_time,sorted_bam,combined_fq,mapped_fq1,mapped_fq2,threads,fnameA,fnameB,database,mapping_mode):
#seqsero2 -a; extract, assembly and blast
subprocess.check_call("bamToFastq -i "+sorted_bam+" -fq "+combined_fq,shell=True)
+ #print("fnameA:",fnameA)
+ #print("fnameB:",fnameB)
if fnameB!="":
subprocess.check_call("bamToFastq -i "+sorted_bam+" -fq "+mapped_fq1+" -fq2 "+mapped_fq2 + " 2>> data_log.txt",shell=True)#2> /dev/null if want no output
else:
@@ -1074,12 +1191,13 @@ def extract_mapped_reads_and_do_assembly_and_blast(current_time,sorted_bam,combi
t="4"
else:
t=threads
- if os.path.getsize(combined_fq)>100 and os.path.getsize(mapped_fq1)>100:#if not, then it's "-:-:-"
+ if os.path.getsize(combined_fq)>100 and (fnameB=="" or os.path.getsize(mapped_fq1)>100):#if not, then it's "-:-:-"
if fnameB!="":
subprocess.check_call("spades.py --careful --pe1-s "+combined_fq+" --pe1-1 "+mapped_fq1+" --pe1-2 "+mapped_fq2+" -t "+t+" -o "+outdir+ " >> data_log.txt 2>&1",shell=True)
else:
subprocess.check_call("spades.py --careful --pe1-s "+combined_fq+" -t "+t+" -o "+outdir+ " >> data_log.txt 2>&1",shell=True)
- new_fasta=fnameA+"_"+database+"_"+mapping_mode+".fasta"
+ #new_fasta=fnameA+"_"+database+"_"+mapping_mode+".fasta"
+ new_fasta=fnameA+"_"+database.split('/')[-1]+"_"+mapping_mode+".fasta" # ed_SL_09152019: change path to databse for packaging
subprocess.check_call("mv "+outdir+"/contigs.fasta "+new_fasta+ " 2> /dev/null",shell=True)
#os.system("mv "+outdir+"/scaffolds.fasta "+new_fasta+ " 2> /dev/null") contigs.fasta
subprocess.check_call("rm -rf "+outdir+ " 2> /dev/null",shell=True)
@@ -1089,7 +1207,7 @@ def extract_mapped_reads_and_do_assembly_and_blast(current_time,sorted_bam,combi
subprocess.check_call("blastn -query "+database+" -db "+new_fasta+"_db -out "+xmlfile+" -outfmt 5 >> data_log.txt 2>&1",shell=True)###1/27/2015; 08272018, remove "-word_size 10"
else:
xmlfile="NA"
- return xmlfile
+ return xmlfile,new_fasta
def judge_subspecies(fnameA):
#seqsero2 -a; judge subspecies on just forward raw reads fastq
@@ -1139,8 +1257,12 @@ def main():
make_dir=args.d
clean_mode=args.c
k_size=27 #will change for bug fixing
- database="H_and_O_and_specific_genes.fasta"
+ #database="H_and_O_and_specific_genes.fasta"
dirpath = os.path.abspath(os.path.dirname(os.path.realpath(__file__)))
+ ex_dir = os.path.abspath(os.path.join(os.path.dirname(os.path.dirname(__file__)),'seqsero2_db')) # ed_SL_09152019: add ex_dir for packaging
+ database=ex_dir+"/H_and_O_and_specific_genes.fasta" # ed_SL_09152019: change path to database for packaging
+ note="Note:"
+ NA_note=" This predicted serotype is not in the Kauffman-White scheme." # ed_SL_09272019: add for new output format
if len(sys.argv)==1:
subprocess.check_call(dirpath+"/SeqSero2_package.py -h",shell=True)#change name of python file
else:
@@ -1153,7 +1275,9 @@ def main():
else:
subprocess.check_call(["mkdir",make_dir])
#subprocess.check_call("cp "+dirpath+"/"+database+" "+" ".join(input_file)+" "+make_dir,shell=True)
- subprocess.check_call("ln -sr "+dirpath+"/"+database+" "+" ".join(input_file)+" "+make_dir,shell=True)
+ #subprocess.check_call("ln -sr "+dirpath+"/"+database+" "+" ".join(input_file)+" "+make_dir,shell=True)
+ subprocess.check_call("ln -f -s "+database+" "+" ".join(input_file)+" "+make_dir,shell=True) # ed_SL_09152019: change path to database for packaging
+ #subprocess.check_call("ln -f -s "+dirpath+"/"+database+" "+" ".join(input_file)+" "+make_dir,shell=True) ### ed_SL_05282019: use -f option to force the replacement of links, remove -r and use absolute path instead to avoid link issue (use 'type=os.path.abspath' in -i argument).
############################begin the real analysis
if analysis_mode=="a":
if data_type in ["1","2","3"]:#use allele mode
@@ -1165,7 +1289,7 @@ def main():
current_time=time.strftime("%Y_%m_%d_%H_%M_%S", time.localtime())
sam,bam,sorted_bam,mapped_fq1,mapped_fq2,combined_fq,for_sai,rev_sai=get_temp_file_names(fnameA,fnameB) #get temp files id
map_and_sort(threads,database,fnameA,fnameB,sam,bam,for_sai,rev_sai,sorted_bam,mapping_mode) #do mapping and sort
- xmlfile=extract_mapped_reads_and_do_assembly_and_blast(current_time,sorted_bam,combined_fq,mapped_fq1,mapped_fq2,threads,fnameA,fnameB,database,mapping_mode) #extract the mapped reads and do micro assembly and blast
+ xmlfile,new_fasta=extract_mapped_reads_and_do_assembly_and_blast(current_time,sorted_bam,combined_fq,mapped_fq1,mapped_fq2,threads,fnameA,fnameB,database,mapping_mode) #extract the mapped reads and do micro assembly and blast
if xmlfile=="NA":
O_choice,fliC_choice,fljB_choice,special_gene_list,contamination_O,contamination_H=("-","-","-",[],"","")
else:
@@ -1174,36 +1298,89 @@ def main():
for x in Final_list:
file.write("\t".join(str(y) for y in x)+"\n")
file.close()
- Final_list_passed=[x for x in Final_list if float(x[0].split("_cov_")[1])>=0.9 and (x[1]>=int(x[0].split("__")[1]) or x[1]>=int(x[0].split("___")[1].split("_")[3]) or x[1]>1000)]
- O_choice,fliC_choice,fljB_choice,special_gene_list,contamination_O,contamination_H=predict_O_and_H_types(Final_list,Final_list_passed) #predict O, fliC and fljB
+ Final_list_passed=[x for x in Final_list if float(x[0].split("_cov_")[1].split("_")[0])>=0.9 and (x[1]>=int(x[0].split("__")[1]) or x[1]>=int(x[0].split("___")[1].split("_")[3]) or x[1]>1000)]
+ O_choice,fliC_choice,fljB_choice,special_gene_list,contamination_O,contamination_H,Otypes_uniq,H1_cont_stat_list,H2_cont_stat_list=predict_O_and_H_types(Final_list,Final_list_passed,new_fasta) #predict O, fliC and fljB
subspecies=judge_subspecies(fnameA) #predict subspecies
###output
predict_form,predict_sero,star,star_line,claim=seqsero_from_formula_to_serotypes(O_choice,fliC_choice,fljB_choice,special_gene_list,subspecies)
+ claim="" #04132019, disable claim for new report requirement
contamination_report=""
+ H_list=["fliC_"+x for x in H1_cont_stat_list if len(x)>0]+["fljB_"+x for x in H2_cont_stat_list if len(x)>0]
if contamination_O!="" and contamination_H=="":
- contamination_report="#detected potential contamination of mixed serotypes from O antigen signals.\n"
+ contamination_report="#Potential inter-serotype contamination detected from O antigen signals. All O-antigens detected:"+"\t".join(Otypes_uniq)+"."
elif contamination_O=="" and contamination_H!="":
- contamination_report="#detected potential contamination of mixed serotypes or potential thrid H phase from H antigen signals.\n"
+ contamination_report="#Potential inter-serotype contamination detected or potential thrid H phase from H antigen signals. All H-antigens detected:"+"\t".join(H_list)+"."
elif contamination_O!="" and contamination_H!="":
- contamination_report="#detected potential contamination of mixed serotypes from both O and H antigen signals.\n"
+ contamination_report="#Potential inter-serotype contamination detected from both O and H antigen signals.All O-antigens detected:"+"\t".join(Otypes_uniq)+". All H-antigens detected:"+"\t".join(H_list)+"."
+ if contamination_report!="":
+ #contamination_report="potential inter-serotype contamination detected (please refer below antigen signal report for details)." #above contamination_reports are for back-up and bug fixing #web-based mode need to be re-used, 04132019
+ contamination_report=" Co-existence of multiple serotypes detected, indicating potential inter-serotype contamination. See 'Extracted_antigen_alleles.fasta' for detected serotype determinant alleles."
+ #claim="\n"+open("Extracted_antigen_alleles.fasta","r").read()#used to store H and O antigen sequeences #04132019, need to change if using web-version
+ ## ed_SL_09272019: change for new output format
+ #if contamination_report+star_line+claim=="": #0413, new output style
+ # note=""
+ #else:
+ # note="Note:"
if clean_mode:
subprocess.check_call("rm -rf ../"+make_dir,shell=True)
make_dir="none-output-directory due to '-c' flag"
else:
- new_file=open("Seqsero_result.txt","w")
+ new_file=open("SeqSero_result.txt","w")
if O_choice=="":
O_choice="-"
- new_file.write("Output_directory:"+make_dir+"\nInput files:\t"+for_fq+" "+rev_fq+"\n"+"O antigen prediction:\t"+O_choice+"\n"+"H1 antigen prediction(fliC):\t"+fliC_choice+"\n"+"H2 antigen prediction(fljB):\t"+fljB_choice+"\n"+"Predicted antigenic profile:\t"+predict_form+"\n"+"Predicted subspecies:\t"+subspecies+"\n"+"Predicted serotype(s):\t"+predict_sero+star+"\n"+contamination_report+star+star_line+claim+"\n")#+##
+ if "N/A" not in predict_sero:
+ new_file.write("Output_directory:\t"+make_dir+"\n"+
+ "Input files:\t"+"\t".join(input_file)+"\n"+
+ "O antigen prediction:\t"+O_choice+"\n"+
+ "H1 antigen prediction(fliC):\t"+fliC_choice+"\n"+
+ "H2 antigen prediction(fljB):\t"+fljB_choice+"\n"+
+ "Predicted subspecies:\t"+subspecies+"\n"+
+ "Predicted antigenic profile:\t"+predict_form+"\n"+
+ "Predicted serotype:\t"+predict_sero+"\n"+ # ed_SL_04152019: change serotype(s) to serotype
+ note+contamination_report+star_line+claim+"\n")#+##
+ else:
+ #star_line=star_line.strip()+"\tNone such antigenic formula in KW.\n"
+ star_line="" #04132019, for new output requirement, diable star_line if "NA" in output
+ new_file.write("Output_directory:\t"+make_dir+"\n"+
+ "Input files:\t"+"\t".join(input_file)+"\n"+
+ "O antigen prediction:\t"+O_choice+"\n"+
+ "H1 antigen prediction(fliC):\t"+fliC_choice+"\n"+
+ "H2 antigen prediction(fljB):\t"+fljB_choice+"\n"+
+ "Predicted subspecies:\t"+subspecies+"\n"+
+ "Predicted antigenic profile:\t"+predict_form+"\n"+
+ "Predicted serotype:\t"+predict_form+"\n"+ # ed_SL_09242019: add serotype output for "N/A" prediction
+ note+NA_note+contamination_report+star_line+claim+"\n")#+##
new_file.close()
print("\n")
#subprocess.check_call("cat Seqsero_result.txt",shell=True)
#subprocess.call("rm H_and_O_and_specific_genes.fasta* *.sra *.bam *.sam *.fastq *.gz *.fq temp.txt *.xml "+fnameA+"*_db* 2> /dev/null",shell=True)
subprocess.call("rm H_and_O_and_specific_genes.fasta* *.sra *.bam *.sam *.fastq *.gz *.fq temp.txt "+fnameA+"*_db* 2> /dev/null",shell=True)
- print("Output_directory:"+make_dir+"\nInput files:\t"+for_fq+" "+rev_fq+"\n"+"O antigen prediction:\t"+O_choice+"\n"+"H1 antigen prediction(fliC):\t"+fliC_choice+"\n"+"H2 antigen prediction(fljB):\t"+fljB_choice+"\n"+"Predicted antigenic profile:\t"+predict_form+"\n"+"Predicted subspecies:\t"+subspecies+"\n"+"Predicted serotype(s):\t"+predict_sero+star+"\n"+contamination_report+star+star_line+claim+"\n")#+##
+ if "N/A" not in predict_sero:
+ #print("Output_directory:"+make_dir+"\nInput files:\t"+for_fq+" "+rev_fq+"\n"+"O antigen prediction:\t"+O_choice+"\n"+"H1 antigen prediction(fliC):\t"+fliC_choice+"\n"+"H2 antigen prediction(fljB):\t"+fljB_choice+"\n"+"Predicted antigenic profile:\t"+predict_form+"\n"+"Predicted subspecies:\t"+subspecies+"\n"+"Predicted serotype(s):\t"+predict_sero+star+"\nNote:"+contamination_report+star+star_line+claim+"\n")#+##
+ print("Output_directory:\t"+make_dir+"\n"+
+ "Input files:\t"+"\t".join(input_file)+"\n"+
+ "O antigen prediction:\t"+O_choice+"\n"+
+ "H1 antigen prediction(fliC):\t"+fliC_choice+"\n"+
+ "H2 antigen prediction(fljB):\t"+fljB_choice+"\n"+
+ "Predicted subspecies:\t"+subspecies+"\n"+
+ "Predicted antigenic profile:\t"+predict_form+"\n"+
+ "Predicted serotype:\t"+predict_sero+"\n"+ # ed_SL_04152019: change serotype(s) to serotype
+ note+contamination_report+star_line+claim+"\n")#+##
+ else:
+ print("Output_directory:\t"+make_dir+"\n"+
+ "Input files:\t"+"\t".join(input_file)+"\n"+
+ "O antigen prediction:\t"+O_choice+"\n"+
+ "H1 antigen prediction(fliC):\t"+fliC_choice+"\n"+
+ "H2 antigen prediction(fljB):\t"+fljB_choice+"\n"+
+ "Predicted subspecies:\t"+subspecies+"\n"+
+ "Predicted antigenic profile:\t"+predict_form+"\n"+
+ "Predicted serotype:\t"+predict_form+"\n"+ # ed_SL_09242019: add serotype output for "N/A" prediction
+ note+NA_note+contamination_report+star_line+claim+"\n")
else:
print("Allele modes only support raw reads datatype, i.e. '-t 1 or 2 or 3'; please use '-m k'")
elif analysis_mode=="k":
- ex_dir = os.path.dirname(os.path.realpath(__file__))
+ #ex_dir = os.path.dirname(os.path.realpath(__file__))
+ ex_dir = os.path.abspath(os.path.join(os.path.dirname(os.path.dirname(__file__)),'seqsero2_db')) # ed_SL_09152019: change ex_dir for packaging
#output_mode = args.mode
for_fq,rev_fq=get_input_files(make_dir,input_file,data_type,dirpath)
input_file = for_fq #-k will just use forward because not all reads were used
@@ -1219,20 +1396,73 @@ def main():
subspecies="II"
predict_form,predict_sero,star,star_line,claim = seqsero_from_formula_to_serotypes(
highest_O.split('-')[1], highest_fliC, highest_fljB, Special_dict,subspecies)
+ claim="" #no claim any more based on new output requirement
+ ## ed_SL_09272019: change for new output format
+ #if star_line+claim=="": #0413, new output style
+ # note=""
+ #else:
+ # note="Note:"
if clean_mode:
subprocess.check_call("rm -rf ../"+make_dir,shell=True)
make_dir="none-output-directory due to '-c' flag"
+ ### ed_SL_05282019, fix the assignment issue of variable 'O_choice' using "-m k -c"
+ if highest_O.split('-')[-1]=="":
+ O_choice="-"
+ else:
+ O_choice=highest_O.split('-')[-1]
+ ###
else:
if highest_O.split('-')[-1]=="":
O_choice="-"
else:
O_choice=highest_O.split('-')[-1]
#print("Output_directory:"+make_dir+"\tInput_file:"+input_file+"\tPredicted subpecies:"+subspecies + '\tPredicted antigenic profile:' + predict_form + '\tPredicted serotype(s):' + predict_sero)
- new_file=open("Seqsero_result.txt","w")
- new_file.write("Output_directory:"+make_dir+"\nInput files:\t"+input_file+"\n"+"O antigen prediction:\t"+O_choice+"\n"+"H1 antigen prediction(fliC):\t"+highest_fliC+"\n"+"H2 antigen prediction(fljB):\t"+highest_fljB+"\n"+"Predicted antigenic profile:\t"+predict_form+"\n"+"Predicted subspecies:\t"+subspecies+"\n"+"Predicted serotype(s):\t"+predict_sero+star+"\n"+star+star_line+claim+"\n")#+##
+ new_file=open("SeqSero_result.txt","w")
+ #new_file.write("Output_directory:"+make_dir+"\nInput files:\t"+input_file+"\n"+"O antigen prediction:\t"+O_choice+"\n"+"H1 antigen prediction(fliC):\t"+highest_fliC+"\n"+"H2 antigen prediction(fljB):\t"+highest_fljB+"\n"+"Predicted antigenic profile:\t"+predict_form+"\n"+"Predicted subspecies:\t"+subspecies+"\n"+"Predicted serotype(s):\t"+predict_sero+star+"\n"+star+star_line+claim+"\n")#+##
+ if "N/A" not in predict_sero:
+ new_file.write("Output_directory:\t"+make_dir+"\n"+
+ "Input files:\t"+input_file+"\n"+
+ "O antigen prediction:\t"+O_choice+"\n"+
+ "H1 antigen prediction(fliC):\t"+highest_fliC+"\n"+
+ "H2 antigen prediction(fljB):\t"+highest_fljB+"\n"+
+ "Predicted subspecies:\t"+subspecies+"\n"+
+ "Predicted antigenic profile:\t"+predict_form+"\n"+
+ "Predicted serotype:\t"+predict_sero+"\n"+ # ed_SL_04152019: change serotype(s) to serotype
+ note+star_line+claim+"\n")#+##
+ else:
+ #star_line=star_line.strip()+"\tNone such antigenic formula in KW.\n"
+ star_line = "" #changed for new output requirement, 04132019
+ new_file.write("Output_directory:\t"+make_dir+"\n"+
+ "Input files:\t"+input_file+"\n"+
+ "O antigen prediction:\t"+O_choice+"\n"+
+ "H1 antigen prediction(fliC):\t"+highest_fliC+"\n"+
+ "H2 antigen prediction(fljB):\t"+highest_fljB+"\n"+
+ "Predicted subspecies:\t"+subspecies+"\n"+
+ "Predicted antigenic profile:\t"+predict_form+"\n"+
+ "Predicted serotype:\t"+predict_form+"\n"+ # ed_SL_09242019: add serotype output for "N/A" prediction
+ note+NA_note+star_line+claim+"\n")#+##
new_file.close()
subprocess.call("rm *.fasta* *.fastq *.gz *.fq temp.txt *.sra 2> /dev/null",shell=True)
- print("Output_directory:"+make_dir+"\nInput files:\t"+input_file+"\n"+"O antigen prediction:\t"+O_choice+"\n"+"H1 antigen prediction(fliC):\t"+highest_fliC+"\n"+"H2 antigen prediction(fljB):\t"+highest_fljB+"\n"+"Predicted antigenic profile:\t"+predict_form+"\n"+"Predicted subspecies:\t"+subspecies+"\n"+"Predicted serotype(s):\t"+predict_sero+star+"\n"+star+star_line+claim+"\n")#+##
+ if "N/A" not in predict_sero:
+ print("Output_directory:\t"+make_dir+"\n"+
+ "Input files:\t"+input_file+"\n"+
+ "O antigen prediction:\t"+O_choice+"\n"+
+ "H1 antigen prediction(fliC):\t"+highest_fliC+"\n"+
+ "H2 antigen prediction(fljB):\t"+highest_fljB+"\n"+
+ "Predicted subspecies:\t"+subspecies+"\n"+
+ "Predicted antigenic profile:\t"+predict_form+"\n"+
+ "Predicted serotype:\t"+predict_sero+"\n"+ # ed_SL_04152019: change serotype(s) to serotype
+ note+star_line+claim+"\n")#+##
+ else:
+ print("Output_directory:\t"+make_dir+"\n"+
+ "Input files:\t"+input_file+"\n"+
+ "O antigen prediction:\t"+O_choice+"\n"+
+ "H1 antigen prediction(fliC):\t"+highest_fliC+"\n"+
+ "H2 antigen prediction(fljB):\t"+highest_fljB+"\n"+
+ "Predicted subspecies:\t"+subspecies+"\n"+
+ "Predicted antigenic profile:\t"+predict_form+"\n"+
+ "Predicted serotype:\t"+predict_form+"\n"+ # ed_SL_09242019: add serotype output for "N/A" prediction
+ note+NA_note+star_line+claim+"\n")#+##
if __name__ == '__main__':
main()
=====================================
SeqSero2_update_kmer_database.py → bin/SeqSero2_update_kmer_database.py
=====================================
=====================================
deinterleave_fastq.sh → bin/deinterleave_fastq.sh
=====================================
=====================================
seqsero2_db/H_and_O_and_specific_genes.fasta
=====================================
The diff for this file was not included because it is too large.
=====================================
antigens.pickle → seqsero2_db/antigens.pickle
=====================================
=====================================
invA_mers_dict → seqsero2_db/invA_mers_dict
=====================================
=====================================
special.pickle → seqsero2_db/special.pickle
=====================================
=====================================
setup.py
=====================================
@@ -0,0 +1,30 @@
+import os, sys
+from distutils.core import setup
+from setuptools import find_packages
+
+def readme():
+ with open('README.md') as f:
+ return f.read()
+
+setup(name='SeqSero2',
+ version=open("version.py").readlines()[-1].split()[-1].strip("\"'"),
+ description='Salmonella serotyping',
+ long_description=readme(),
+ classifiers=[
+ 'Development Status :: 3 - Alpha',
+ 'License :: OSI Approved :: GNU General Public License v2 (GPLv2)',
+ 'Programming Language :: Python :: 3',
+ 'Topic :: Text Processing :: Linguistic',
+ ],
+ keywords='Salmonella serotyping bioinformatics WGS',
+ url='https://github.com/denglab/SeqSero2/',
+ author='Shaokang Zhang, Hendrik C Den-Bakker and Xiangyu Deng',
+ author_email='zskzsk at uga.edu, Hendrik.DenBakker at uga.edu, xdeng at uga.edu',
+ license='GPLv2',
+ scripts=["bin/deinterleave_fastq.sh","bin/Initial_Conditions.py","bin/SeqSero2_package.py","bin/SeqSero2_update_kmer_database.py"],
+ packages=[""],
+ include_package_data = True,
+ install_requires=['biopython==1.73'],
+ data_files=[("seqsero2_db",["seqsero2_db/antigens.pickle","seqsero2_db/H_and_O_and_specific_genes.fasta","seqsero2_db/invA_mers_dict","seqsero2_db/special.pickle"])],
+ zip_safe=False,
+)
=====================================
version.py
=====================================
@@ -0,0 +1 @@
+SeqSero2_version = '1.0.2'
View it on GitLab: https://salsa.debian.org/med-team/seqsero2/commit/906034167260e58401e0d9127fe66d787218ca8e
--
View it on GitLab: https://salsa.debian.org/med-team/seqsero2/commit/906034167260e58401e0d9127fe66d787218ca8e
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20191211/2aca4ac2/attachment-0001.html>
More information about the debian-med-commit
mailing list