[med-svn] [Git][med-team/kma][upstream] New upstream version 1.2.25
Steffen Möller
gitlab at salsa.debian.org
Thu May 14 18:14:15 BST 2020
Steffen Möller pushed to branch upstream at Debian Med / kma
Commits:
2f19ae42 by Steffen Moeller at 2020-05-14T17:49:12+02:00
New upstream version 1.2.25
- - - - -
14 changed files:
- README.md
- ef.c
- hashtable.c
- hashtable.h
- kma.c
- runinput.c
- runinput.h
- runkma.c
- runkma.h
- sparse.c
- sparse.h
- spltdb.c
- spltdb.h
- version.h
Changes:
=====================================
README.md
=====================================
@@ -1,159 +1,159 @@
-# Getting Started #
-
-```
-git clone https://bitbucket.org/genomicepidemiology/kma.git
-cd kma && make
-
-./kma index -i templates.fsa.gz -o templates
-./kma -i reads_se.fq.gz -o output/name -t_db templates
-./kma -ipe reads_1.fq.gz reads_2.fq.gz -o output/name -t_db templates
-```
-
-# Introduction #
-KMA is a mapping method designed to map raw reads directly against redundant databases, in an
-ultra-fast manner using seed and extend. KMA is particulary good at aligning
-high quality reads against highly redundant databases, where unique matches often does
-not exist. It works for long low quality reads as well, such as those from Nanopore.
-Non-unique matches are resolved using the "ConClave" sorting scheme, and a consensus sequence are outputtet
-in addition to other common attributes, based on our users demands.
-
-If you use KMA for your published research, then please cite:
-Philip T.L.C. Clausen, Frank M. Aarestrup & Ole Lund,
-"Rapid and precise alignment of raw reads against redundant databases with KMA",
-BMC Bioinformatics, 2018;19:307.
-
-
-# Usage #
-A more detailed guide on KMA and its options can be found in the pdf "KMAspecification.pdf".
-For practical reasons you might want to add kma to your path, this is usually done with:
-
-```
-mv kma ~/bin/
-```
-
-## Indexing ##
-In order to use KMA for mapping, the databases need to indexed.
-This is done with “kma index”, the most important options are described below:
-
-```
--i Input fasta file(s), space separated. By default kma index reads from stdin.
--o Output name, the name given to the database.
--k kmersize used for indexing the database.
--k_t kmersize used to identify template candidates when running KMA.
--k_i kmersize used when performing alignments between two sequences.
-```
-
-Example of use:
-
-```
-kma index -i templates.fsa.gz -o database/name
-```
-
-## Mapping ##
-The default settings of KMA is set so that it should work on most cases,
-but it is evident to use the right options for the right problems.
-Some of the most important options:
-
-```
--i Inputfile(s), default is read from stdin. All input options takes as many files as you wish in fastq or fasta format, space separated.
--ipe Inputfile(s), paired end. The file pairs should be placed right after each other.
--int Inputfile(s), interleaved.
--o Output destination.
--t_db Database from indexing.
--mem_mode *.index and *.seq are not loaded into memory, which enables one to map against larger databases. Templates are chosen using k-mer counting.
--dense Skip insertions when making the consensus sequence.
--ref_fsa “-” will be substituted with “n” in the consensus sequence.
--matrix Gives the counts all all called bases at each position in each mapped template. Columns are: reference base, A count, C count, G count, T count, N count, - count.
--mp Minimum phred-score.
--Mt1 Match to only one template in the database.
--ID Minimum identity to output template match.
--apm Paired end method, “p”: Reward if pairing the reads, “u”: unite best template matches in each read if possible, “f” force paired reads to pair.
--1t1 One read to one template, no splicing performed. Well suited for short reads and whole genome mapping.
--bc90 Basecalls should be significantly overrepresented, and have at least 90% agreement.
--bcNano Basecalls optimized for nanopore sequencing.
--mrs minimum alignment score normalized to alignment length.
-```
-
-Examples of running KMA:
-
-Short read mapping, one read maps only to one template:
-```
-kma –i singleEndReads.fq.gz –ipe pairedEnd_*.fq.gz –o output/name –t_db database/name -1t1
-```
-
-Long read mapping against a database of genes:
-```
-kma –i someLongReads.fq.gz –o output/name –t_db database/name
-```
-
-Whole genome mapping with nanopore reads:
-```
-kma –i nanoporeReads.fq.gz –o output/name –t_db database/name –mem_mode –mp 20 –mrs 0.0 –bcNano
-```
-
-Whole genome mapping against a single genome in the database (still nanopore), specified by template number (here “2”).
-This is the same as the line number of the wanted sequences in the \*.name file from the indexing.
-```
-kma –i nanoporeReads.fq.gz –o output/name –t_db database/name –mp 20 –mrs 0.0 –bcNano –Mt1 2
-```
-
-
-# Result Explanation #
-When the mapping is done KMA will produce the following files:
-
-1. \*.res A result overview giving the most common statistics for each mapped template.
-2. \*.fsa The consensus sequences drawn from the alignments.
-3. \*.aln The consensus alignment of the reads against their template.
-4. \*.frag.gz Mapping information on each mapped read, columns are: read, number of equally well mapping templates, mapping score, start position, end position (w.r.t. template), the choosen template.
-5. \*.mat.gz Base counts on each position in each template, (only if “-matrix” is enabled)
-
-# Shared memory #
-The databases of KMA can be put into shared memory, this enables you to align several
-samples at ones while only having the database loaded in one place.
-KMA_SHM can be used to put the databases into shared memory, and most importantly take them
-down afterwards. It is of the highest concern that these shared databases are used correctly,
-as they will take up a part of memory and they will exist there until they are taken down again,
-the computer is restarted or computer breaks down.
-
-Example of setting up a database in shared memory.
-```
-kma shm –t_db templates –shmLvl 1
-```
-
-“-shmLvl” specifies how much of the database there should be stored in shared memory, use.
-“-shm-h” to get an overview of the different levels.
-
-Example of taking it down again, always remember to do this then it is no longer needed:
-```
-kma shm –t_db database/name –shmLvl 1 –destroy
-```
-
-# Installation Requirements #
-In order to install KMA, you need to have a C-compiler and zlib development files installed.
-Zlib development files can be installed on unix systems with:
-```
-sudo apt-get install libz-dev
-```
-
-# Help #
-Usage and options are available with the "-h" option on all three programs.
-If in doubt, please mail any concerns or problems to: *plan at dtu.dk*.
-
-# Citation #
-1. Philip T.L.C. Clausen, Frank M. Aarestrup & Ole Lund, "Rapid and precise alignment of raw reads against redundant databases with KMA", BMC Bioinformatics, 2018;19:307.
-
-# License #
-Copyright (c) 2017, Philip Clausen, Technical University of Denmark
-All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
+# Getting Started #
+
+```
+git clone https://bitbucket.org/genomicepidemiology/kma.git
+cd kma && make
+
+./kma index -i templates.fsa.gz -o templates
+./kma -i reads_se.fq.gz -o output/name -t_db templates
+./kma -ipe reads_1.fq.gz reads_2.fq.gz -o output/name -t_db templates
+```
+
+# Introduction #
+KMA is a mapping method designed to map raw reads directly against redundant databases, in an
+ultra-fast manner using seed and extend. KMA is particulary good at aligning
+high quality reads against highly redundant databases, where unique matches often does
+not exist. It works for long low quality reads as well, such as those from Nanopore.
+Non-unique matches are resolved using the "ConClave" sorting scheme, and a consensus sequence are outputtet
+in addition to other common attributes, based on our users demands.
+
+If you use KMA for your published research, then please cite:
+Philip T.L.C. Clausen, Frank M. Aarestrup & Ole Lund,
+"Rapid and precise alignment of raw reads against redundant databases with KMA",
+BMC Bioinformatics, 2018;19:307.
+
+
+# Usage #
+A more detailed guide on KMA and its options can be found in the pdf "KMAspecification.pdf".
+For practical reasons you might want to add kma to your path, this is usually done with:
+
+```
+mv kma ~/bin/
+```
+
+## Indexing ##
+In order to use KMA for mapping, the databases need to indexed.
+This is done with kma index, the most important options are described below:
+
+```
+-i Input fasta file(s), space separated. By default kma index reads from stdin.
+-o Output name, the name given to the database.
+-k kmersize used for indexing the database.
+-k_t kmersize used to identify template candidates when running KMA.
+-k_i kmersize used when performing alignments between two sequences.
+```
+
+Example of use:
+
+```
+kma index -i templates.fsa.gz -o database/name
+```
+
+## Mapping ##
+The default settings of KMA is set so that it should work on most cases,
+but it is evident to use the right options for the right problems.
+Some of the most important options:
+
+```
+-i Inputfile(s), default is read from stdin. All input options takes as many files as you wish in fastq or fasta format, space separated.
+-ipe Inputfile(s), paired end. The file pairs should be placed right after each other.
+-int Inputfile(s), interleaved.
+-o Output destination.
+-t_db Database from indexing.
+-mem_mode *.index and *.seq are not loaded into memory, which enables one to map against larger databases. Templates are chosen using k-mer counting.
+-dense Skip insertions when making the consensus sequence.
+-ref_fsa - will be substituted with n in the consensus sequence.
+-matrix Gives the counts all all called bases at each position in each mapped template. Columns are: reference base, A count, C count, G count, T count, N count, - count.
+-mp Minimum phred-score.
+-Mt1 Match to only one template in the database.
+-ID Minimum identity to output template match.
+-apm Paired end method, p: Reward if pairing the reads, u: unite best template matches in each read if possible, f force paired reads to pair.
+-1t1 One read to one template, no splicing performed. Well suited for short reads and whole genome mapping.
+-bc90 Basecalls should be significantly overrepresented, and have at least 90% agreement.
+-bcNano Basecalls optimized for nanopore sequencing.
+-mrs minimum alignment score normalized to alignment length.
+```
+
+Examples of running KMA:
+
+Short read mapping, one read maps only to one template:
+```
+kma -i singleEndReads.fq.gz -ipe pairedEnd_*.fq.gz -o output/name -t_db database/name -1t1
+```
+
+Long read mapping against a database of genes:
+```
+kma -i someLongReads.fq.gz -o output/name -t_db database/name -bcNano -bc 0.7
+```
+
+Whole genome mapping with nanopore reads:
+```
+kma -i nanoporeReads.fq.gz -o output/name -t_db database/name -mem_mode -mp 20 -mrs 0.0 -bcNano -bc 0.7
+```
+
+Whole genome mapping against a single genome in the database (still nanopore), specified by template number (here 2).
+This is the same as the line number of the wanted sequences in the \*.name file from the indexing.
+```
+kma -i nanoporeReads.fq.gz -o output/name -t_db database/name -mp 20 -bcNano -bc 0.7 -Mt1 2
+```
+
+
+# Result Explanation #
+When the mapping is done KMA will produce the following files:
+
+1. \*.res A result overview giving the most common statistics for each mapped template.
+2. \*.fsa The consensus sequences drawn from the alignments.
+3. \*.aln The consensus alignment of the reads against their template.
+4. \*.frag.gz Mapping information on each mapped read, columns are: read, number of equally well mapping templates, mapping score, start position, end position (w.r.t. template), the choosen template.
+5. \*.mat.gz Base counts on each position in each template, (only if -matrix is enabled)
+
+# Shared memory #
+The databases of KMA can be put into shared memory, this enables you to align several
+samples at ones while only having the database loaded in one place.
+KMA_SHM can be used to put the databases into shared memory, and most importantly take them
+down afterwards. It is of the highest concern that these shared databases are used correctly,
+as they will take up a part of memory and they will exist there until they are taken down again,
+the computer is restarted or computer breaks down.
+
+Example of setting up a database in shared memory.
+```
+kma shm -t_db database/name -shmLvl 1
+```
+
+-shmLvl specifies how much of the database there should be stored in shared memory, use.
+-shm-h to get an overview of the different levels.
+
+Example of taking it down again, always remember to do this then it is no longer needed:
+```
+kma shm -t_db database/name -shmLvl 1 -destroy
+```
+
+# Installation Requirements #
+In order to install KMA, you need to have a C-compiler and zlib development files installed.
+Zlib development files can be installed on unix systems with:
+```
+sudo apt-get install libz-dev
+```
+
+# Help #
+Usage and options are available with the "-h" option on all three programs.
+If in doubt, please mail any concerns or problems to: *plan at dtu.dk*.
+
+# Citation #
+1. Philip T.L.C. Clausen, Frank M. Aarestrup & Ole Lund, "Rapid and precise alignment of raw reads against redundant databases with KMA", BMC Bioinformatics, 2018;19:307.
+
+# License #
+Copyright (c) 2017, Philip Clausen, Technical University of Denmark
+All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
=====================================
ef.c
=====================================
@@ -51,6 +51,7 @@ void getExtendedFeatures(char *template_name, AssemInfo *matrix, long unsigned *
long double var, nucHighVar;
Assembly *assembly;
+ fragmentCountAln = (((readCountAln >> 1) + (readCountAln & 1)) <= fragmentCountAln) ? (fragmentCountAln) : ((readCountAln >> 1) + (readCountAln & 1));
if(matrix) {
/* iterate matrix to get:
Nuc_high_depth_variance
=====================================
hashtable.c
=====================================
@@ -34,7 +34,7 @@ int intpos_bin(const unsigned *str1, const int str2) {
}
downLim = 1;
- pos = (upLim + downLim) / 2;
+ pos = (upLim + downLim) >> 1;
while(0 < (upLim - downLim)) {
if(str1[pos] == str2) {
return pos;
@@ -43,7 +43,7 @@ int intpos_bin(const unsigned *str1, const int str2) {
} else {
upLim = pos - 1;
}
- pos = (upLim + downLim) / 2;
+ pos = (upLim + downLim) >> 1;
}
if(str1[pos] == str2) {
return pos;
@@ -51,7 +51,7 @@ int intpos_bin(const unsigned *str1, const int str2) {
return -1;
}
-HashTable * collect_Kmers(const HashMapKMA *templates, unsigned *Scores, unsigned *Scores_tot, HashMap_kmers *foundKmers, Hit *hits) {
+HashTable * collect_Kmers(const HashMapKMA *templates, unsigned *Scores, long unsigned *Scores_tot, HashMap_kmers *foundKmers, Hit *hits) {
int template, SU;
unsigned i, j, *value;
@@ -118,7 +118,7 @@ HashTable * collect_Kmers(const HashMapKMA *templates, unsigned *Scores, unsigne
return kmerList;
}
-HashTable ** collect_Kmers_deCon(const HashMapKMA *templates, unsigned *Scores, unsigned *Scores_tot, HashMap_kmers *foundKmers, Hit *hits, int contamination) {
+HashTable ** collect_Kmers_deCon(const HashMapKMA *templates, unsigned *Scores, long unsigned *Scores_tot, HashMap_kmers *foundKmers, Hit *hits, int contamination) {
int template, SU;
unsigned i, j, n, *value;
@@ -219,7 +219,7 @@ HashTable ** collect_Kmers_deCon(const HashMapKMA *templates, unsigned *Scores,
return Returner;
}
-HashTable * withDraw_Kmers(unsigned *Scores, unsigned *Scores_tot, HashTable *kmerList, int template, Hit *hits) {
+HashTable * withDraw_Kmers(unsigned *Scores, long unsigned *Scores_tot, HashTable *kmerList, int template, Hit *hits) {
unsigned i;
HashTable *node, *prev;
@@ -267,7 +267,7 @@ HashTable * withDraw_Kmers(unsigned *Scores, unsigned *Scores_tot, HashTable *km
return kmerList;
}
-Hit withDraw_Contamination(unsigned *Scores, unsigned *Scores_tot, HashTable *kmerList, HashTable *deConTable, int template, Hit hits) {
+Hit withDraw_Contamination(unsigned *Scores, long unsigned *Scores_tot, HashTable *kmerList, HashTable *deConTable, int template, Hit hits) {
unsigned i, belong;
HashTable *node, *prev;
=====================================
hashtable.h
=====================================
@@ -38,7 +38,7 @@ struct hit {
#endif
int intpos_bin(const unsigned *str1, const int str2);
-HashTable * collect_Kmers(const HashMapKMA *templates, unsigned *Scores, unsigned *Scores_tot, HashMap_kmers *foundKmers, Hit *hits);
-HashTable ** collect_Kmers_deCon(const HashMapKMA *templates, unsigned *Scores, unsigned *Scores_tot, HashMap_kmers *foundKmers, Hit *hits, int contamination);
-HashTable * withDraw_Kmers(unsigned *Scores, unsigned *Scores_tot, HashTable *kmerList, int template, Hit *hits);
-Hit withDraw_Contamination(unsigned *Scores, unsigned *Scores_tot, HashTable *kmerList, HashTable *deConTable, int template, Hit hits);
+HashTable * collect_Kmers(const HashMapKMA *templates, unsigned *Scores, long unsigned *Scores_tot, HashMap_kmers *foundKmers, Hit *hits);
+HashTable ** collect_Kmers_deCon(const HashMapKMA *templates, unsigned *Scores, long unsigned *Scores_tot, HashMap_kmers *foundKmers, Hit *hits, int contamination);
+HashTable * withDraw_Kmers(unsigned *Scores, long unsigned *Scores_tot, HashTable *kmerList, int template, Hit *hits);
+Hit withDraw_Contamination(unsigned *Scores, long unsigned *Scores_tot, HashTable *kmerList, HashTable *deConTable, int template, Hit hits);
=====================================
kma.c
=====================================
@@ -131,6 +131,7 @@ static void helpMessage(int exeStatus) {
fprintf(helpOut, "#\t-a\t\tPrint all best mappings\t\tFalse\n");
fprintf(helpOut, "#\t-mp\t\tMinimum phred score\t\t20\n");
fprintf(helpOut, "#\t-5p\t\tCut a constant number of\n#\t\t\tnucleotides from the 5 prime.\t0\n");
+ fprintf(helpOut, "#\t-3p\t\tCut a constant number of\n#\t\t\tnucleotides from the 3 prime.\t0\n");
fprintf(helpOut, "#\t-Sparse\t\tOnly count kmers\t\tFalse\n");
fprintf(helpOut, "#\t-Mt1\t\tMap only to \"num\" template.\t0 / False\n");
fprintf(helpOut, "#\t-ID\t\tMinimum ID\t\t\t1.0%%\n");
@@ -174,11 +175,12 @@ static void helpMessage(int exeStatus) {
int kma_main(int argc, char *argv[]) {
- static int minPhred, fiveClip, sparse_run, mem_mode, Mt1, ConClave, bcd;
- static int fileCounter, fileCounter_PE, fileCounter_INT, targetNum, vcf;
- static int extendedFeatures, spltDB, mq, thread_num, kmersize, one2one;
- static int ref_fsa, print_matrix, print_all, sam, Ts, Tv, minlen, **d;
- static unsigned nc, nf, shm, exhaustive;
+ static int minPhred, fiveClip, threeClip, ConClave, mem_mode, sparse_run;
+ static int fileCounter, fileCounter_PE, fileCounter_INT, Ts, Tv, minlen;
+ static int extendedFeatures, spltDB, thread_num, kmersize, targetNum, mq;
+ static int ref_fsa, print_matrix, print_all, sam, vcf, Mt1, bcd, one2one;
+ static int **d;
+ static unsigned nc, nf, shm, exhaustive, verbose;
static char *outputfilename, *templatefilename, **templatefilenames;
static char **inputfiles, **inputfiles_PE, **inputfiles_INT, ss;
static double ID_t, scoreT, coverT, evalue;
@@ -203,6 +205,7 @@ int kma_main(int argc, char *argv[]) {
/* SET DEFAULTS */
ConClave = 1;
+ verbose = 0;
vcf = 0;
sam = 0;
nc = 0;
@@ -213,6 +216,7 @@ int kma_main(int argc, char *argv[]) {
status = 0;
minPhred = 20;
fiveClip = 0;
+ threeClip = 0;
sparse_run = 0;
fileCounter = 0;
fileCounter_PE = 0;
@@ -508,6 +512,15 @@ int kma_main(int argc, char *argv[]) {
exit(4);
}
}
+ } else if(strcmp(argv[args], "-3p") == 0) {
+ ++args;
+ if(args < argc) {
+ threeClip = strtoul(argv[args], &exeBasic, 10);
+ if(*exeBasic != 0) {
+ fprintf(stderr, "Invalid argument at \"-3p\".\n");
+ exit(4);
+ }
+ }
} else if(strcmp(argv[args], "-dense") == 0) {
assembly_KMA_Ptr = &assemble_KMA_dense_threaded;
} else if(strcmp(argv[args], "-sasm") == 0) {
@@ -798,6 +811,8 @@ int kma_main(int argc, char *argv[]) {
spltDB = 1;
} else if(strcmp(argv[args], "-status") == 0) {
kmaPipe = &kmaPipeFork;
+ } else if(strcmp(argv[args], "-verbose") == 0) {
+ verbose = 1;
} else if(strcmp(argv[args], "-v") == 0) {
fprintf(stdout, "KMA-%s\n", KMA_VERSION);
exit(0);
@@ -1079,7 +1094,7 @@ int kma_main(int argc, char *argv[]) {
fprintf(stderr, "Interleaved information is not considered in Sparse mode.\n");
}
- run_input_sparse(templates, inputfiles, fileCounter, minPhred, fiveClip, kmersize, to2Bit, ioStream);
+ run_input_sparse(templates, inputfiles, fileCounter, minPhred, fiveClip, threeClip, kmersize, to2Bit, ioStream);
hashMapKMA_destroy(templates);
free(myTemplatefilename);
} else {
@@ -1101,17 +1116,17 @@ int kma_main(int argc, char *argv[]) {
/* SE */
if(fileCounter > 0) {
- totFrags += run_input(inputfiles, fileCounter, minPhred, fiveClip, minlen, to2Bit, ioStream);
+ totFrags += run_input(inputfiles, fileCounter, minPhred, fiveClip, threeClip, minlen, to2Bit, ioStream);
}
/* PE */
if(fileCounter_PE > 0) {
- totFrags += run_input_PE(inputfiles_PE, fileCounter_PE, minPhred, fiveClip, minlen, to2Bit, ioStream);
+ totFrags += run_input_PE(inputfiles_PE, fileCounter_PE, minPhred, fiveClip, threeClip, minlen, to2Bit, ioStream);
}
/* INT */
if(fileCounter_INT > 0) {
- totFrags += run_input_INT(inputfiles_INT, fileCounter_INT, minPhred, fiveClip, minlen, to2Bit, ioStream);
+ totFrags += run_input_INT(inputfiles_INT, fileCounter_INT, minPhred, fiveClip, threeClip, minlen, to2Bit, ioStream);
}
if(Mt1) {
@@ -1149,11 +1164,11 @@ int kma_main(int argc, char *argv[]) {
myTemplatefilename = smalloc(strlen(templatefilename) + 64);
strcpy(myTemplatefilename, templatefilename);
if(spltDB == 0 && targetNum != 1) {
- status = runKMA_spltDB(templatefilenames, targetNum, outputfilename, argc, argv, ConClave, kmersize, minlen, rewards, extendedFeatures, ID_t, mq, scoreT, evalue, bcd, ref_fsa, print_matrix, print_all, vcf, sam, nc, nf, shm, thread_num);
+ status = runKMA_spltDB(templatefilenames, targetNum, outputfilename, argc, argv, ConClave, kmersize, minlen, rewards, extendedFeatures, ID_t, mq, scoreT, evalue, bcd, ref_fsa, print_matrix, print_all, vcf, sam, nc, nf, shm, thread_num, verbose);
} else if(mem_mode) {
- status = runKMA_MEM(myTemplatefilename, outputfilename, exeBasic, ConClave, kmersize, minlen, rewards, extendedFeatures, ID_t, mq, scoreT, evalue, bcd, ref_fsa, print_matrix, print_all, vcf, sam, nc, nf, shm, thread_num);
+ status = runKMA_MEM(myTemplatefilename, outputfilename, exeBasic, ConClave, kmersize, minlen, rewards, extendedFeatures, ID_t, mq, scoreT, evalue, bcd, ref_fsa, print_matrix, print_all, vcf, sam, nc, nf, shm, thread_num, verbose);
} else {
- status = runKMA(myTemplatefilename, outputfilename, exeBasic, ConClave, kmersize, minlen, rewards, extendedFeatures, ID_t, mq, scoreT, evalue, bcd, ref_fsa, print_matrix, print_all, vcf, sam, nc, nf, shm, thread_num);
+ status = runKMA(myTemplatefilename, outputfilename, exeBasic, ConClave, kmersize, minlen, rewards, extendedFeatures, ID_t, mq, scoreT, evalue, bcd, ref_fsa, print_matrix, print_all, vcf, sam, nc, nf, shm, thread_num, verbose);
}
free(myTemplatefilename);
fprintf(stderr, "# Closing files\n");
=====================================
runinput.c
=====================================
@@ -28,7 +28,7 @@
void (*printFsa_ptr)(Qseqs*, Qseqs*, CompDNA*, FILE*) = &printFsa;
void (*printFsa_pair_ptr)(Qseqs*, Qseqs*, Qseqs*, Qseqs*, CompDNA*, FILE*) = &printFsa_pair;
-long unsigned run_input(char **inputfiles, int fileCount, int minPhred, int fiveClip, int minlen, char *trans, FILE *out) {
+long unsigned run_input(char **inputfiles, int fileCount, int minPhred, int fiveClip, int threeClip, int minlen, char *trans, FILE *out) {
int fileCounter, phredCut, start, end;
unsigned FASTQ;
@@ -67,7 +67,8 @@ long unsigned run_input(char **inputfiles, int fileCount, int minPhred, int five
/* trim */
seq = qual->seq;
start = fiveClip;
- end = qseq->len - 1;
+ end = qseq->len - 1 - threeClip;
+ end = end < 0 ? 0 : end;
while(end >= 0 && seq[end] < phredCut) {
--end;
}
@@ -133,7 +134,7 @@ long unsigned run_input(char **inputfiles, int fileCount, int minPhred, int five
return count;
}
-long unsigned run_input_PE(char **inputfiles, int fileCount, int minPhred, int fiveClip, int minlen, char *trans, FILE *out) {
+long unsigned run_input_PE(char **inputfiles, int fileCount, int minPhred, int fiveClip, int threeClip, int minlen, char *trans, FILE *out) {
int fileCounter, phredCut, start, start2, end;
unsigned FASTQ, FASTQ2;
@@ -192,7 +193,8 @@ long unsigned run_input_PE(char **inputfiles, int fileCount, int minPhred, int f
/* trim forward */
seq = qual->seq;
start = fiveClip;
- end = qseq->len - 1;
+ end = qseq->len - 1 - threeClip;
+ end = end < 0 ? 0 : end;
while(end >= 0 && seq[end] < phredCut) {
--end;
}
@@ -326,7 +328,7 @@ long unsigned run_input_PE(char **inputfiles, int fileCount, int minPhred, int f
return count;
}
-long unsigned run_input_INT(char **inputfiles, int fileCount, int minPhred, int fiveClip, int minlen, char *trans, FILE *out) {
+long unsigned run_input_INT(char **inputfiles, int fileCount, int minPhred, int fiveClip, int threeClip, int minlen, char *trans, FILE *out) {
int fileCounter, phredCut, start, start2, end;
unsigned FASTQ;
@@ -372,7 +374,8 @@ long unsigned run_input_INT(char **inputfiles, int fileCount, int minPhred, int
/* trim forward */
seq = qual->seq;
start = fiveClip;
- end = qseq->len - 1;
+ end = qseq->len - 1 - threeClip;
+ end = end < 0 ? 0 : end;
while(end >= 0 && seq[end] < phredCut) {
--end;
}
=====================================
runinput.h
=====================================
@@ -23,9 +23,9 @@
/* pointers determining how to deliver the input */
extern void (*printFsa_ptr)(Qseqs*, Qseqs*, CompDNA*, FILE*);
extern void (*printFsa_pair_ptr)(Qseqs*, Qseqs*, Qseqs*, Qseqs*, CompDNA*, FILE*);
-long unsigned run_input(char **inputfiles, int fileCount, int minPhred, int fiveClip, int minlen, char *trans, FILE *out);
-long unsigned run_input_PE(char **inputfiles, int fileCount, int minPhred, int fiveClip, int minlen, char *trans, FILE *out);
-long unsigned run_input_INT(char **inputfiles, int fileCount, int minPhred, int fiveClip, int minlen, char *trans, FILE *out);
+long unsigned run_input(char **inputfiles, int fileCount, int minPhred, int fiveClip, int threeClip, int minlen, char *trans, FILE *out);
+long unsigned run_input_PE(char **inputfiles, int fileCount, int minPhred, int fiveClip, int threeClip, int minlen, char *trans, FILE *out);
+long unsigned run_input_INT(char **inputfiles, int fileCount, int minPhred, int fiveClip, int threeClip, int minlen, char *trans, FILE *out);
void bootFsa(Qseqs *header, Qseqs *qseq, CompDNA *compressor, FILE *out);
void printFsa(Qseqs *header, Qseqs *qseq, CompDNA *compressor, FILE *out);
void printFsa_pair(Qseqs *header, Qseqs *qseq, Qseqs *header_r, Qseqs *qseq_r, CompDNA *compressor, FILE *out);
=====================================
runkma.c
=====================================
@@ -134,14 +134,14 @@ char * nameLoad(Qseqs *name, FILE *infile) {
return (char *) name->seq;
}
-int runKMA(char *templatefilename, char *outputfilename, char *exePrev, int ConClave, int kmersize, int minlen, Penalties *rewards, int extendedFeatures, double ID_t, int mq, double scoreT, double evalue, int bcd, int ref_fsa, int print_matrix, int print_all, int vcf, int sam, int nc, int nf, unsigned shm, int thread_num) {
+int runKMA(char *templatefilename, char *outputfilename, char *exePrev, int ConClave, int kmersize, int minlen, Penalties *rewards, int extendedFeatures, double ID_t, int mq, double scoreT, double evalue, int bcd, int ref_fsa, int print_matrix, int print_all, int vcf, int sam, int nc, int nf, unsigned shm, int thread_num, int verbose) {
int i, j, tmp_template, tmp_tmp_template, file_len, bestTemplate, tot;
int template, bestHits, t_len, start, end, aln_len, status, rand, sparse;
int fragCount, fileCount, maxFrag, coverScore, tmp_start, tmp_end, score;
- int index_in_no, seq_in_no, DB_size, flag, stats[5], *matched_templates;
+ int index_in_no, seq_in_no, DB_size, flag, counter, stats[5];
int *bestTemplates, *bestTemplates_r, *best_start_pos, *best_end_pos;
- int *template_lengths;
+ int *matched_templates, *template_lengths;
unsigned randScore, *fragmentCounts, *readCounts;
long read_score, best_read_score, *index_indexes, *seq_indexes;
long unsigned Nhits, template_tot_ulen, bestNum;
@@ -1164,14 +1164,22 @@ int runKMA(char *templatefilename, char *outputfilename, char *exePrev, int ConC
if(assembly_KMA_Ptr == &skip_assemble_KMA) {
alignLoadPtr = &alignLoad_skip;
}
+ if(verbose) {
+ fprintf(stderr, "# Template\tScore\tProgress\n");
+ }
/* Do local assemblies of fragments mapping to the same template */
depth = 0;
q_id = 0;
cover = 0;
q_cover = 0;
+ counter = 0;
for(template = 1; template < DB_size; ++template) {
if(w_scores[template] > 0) {
+ if(verbose) {
+ counter += w_scores[template];
+ fprintf(stderr, "# %d / %d\t%lu\t%3lu%%\n", template, DB_size, w_scores[template], 100 * counter / Nhits);
+ }
/* make p_value to see whether assembly is feasable */
read_score = w_scores[template];
t_len = template_lengths[template];
@@ -1293,7 +1301,7 @@ int runKMA(char *templatefilename, char *outputfilename, char *exePrev, int ConC
return status;
}
-int runKMA_MEM(char *templatefilename, char *outputfilename, char *exePrev, int ConClave, int kmersize, int minlen, Penalties *rewards, int extendedFeatures, double ID_t, int mq, double scoreT, double evalue, int bcd, int ref_fsa, int print_matrix, int print_all, int vcf, int sam, int nc, int nf, unsigned shm, int thread_num) {
+int runKMA_MEM(char *templatefilename, char *outputfilename, char *exePrev, int ConClave, int kmersize, int minlen, Penalties *rewards, int extendedFeatures, double ID_t, int mq, double scoreT, double evalue, int bcd, int ref_fsa, int print_matrix, int print_all, int vcf, int sam, int nc, int nf, unsigned shm, int thread_num, int verbose) {
/* runKMA_MEM is a memory saving version of runKMA,
at the cost it chooses best templates based on kmers
@@ -1456,6 +1464,7 @@ int runKMA_MEM(char *templatefilename, char *outputfilename, char *exePrev, int
bestTemplates = (matched_templates + 1);
/* consider printPair */
+ Nhits = 0;
t_len = 0;
read_score = 0;
while((rc_flag = get_ankers(matched_templates, qseq_comp, header, &flag, inputfile)) != 0) {
@@ -1510,8 +1519,21 @@ int runKMA_MEM(char *templatefilename, char *outputfilename, char *exePrev, int
updateAllFrag(qseq_r->seq, qseq_r->len, abs(bestHits), read_score, best_start_pos, best_end_pos, bestTemplates, header_r, frag_out_all);
}
}
+
+ /* verbose */
+ if(verbose && verbose++ == 1024) {
+ Nhits += verbose - 1;
+ fprintf(stderr, "# Scored %ld query sequences.\n", Nhits);
+ verbose = 1;
+ }
}
}
+ /* verbose */
+ if(verbose) {
+ Nhits += verbose - 1;
+ fprintf(stderr, "# Scored %ld query sequences.\n", Nhits);
+ verbose = 1;
+ }
kmaPipe(0, 0, inputfile, &i);
status |= i;
i = 0;
@@ -2250,17 +2272,21 @@ int runKMA_MEM(char *templatefilename, char *outputfilename, char *exePrev, int
if(progress) {
fprintf(stderr, "# Progress:\t%3d%%\r", 0);
fflush(stderr);
+ } else if(verbose) {
+ fprintf(stderr, "# Template\tScore\tProgress\n");
}
if(assembly_KMA_Ptr == &skip_assemble_KMA) {
alignLoadPtr = &alignLoad_skip;
}
-
for(template = 1; template < DB_size; ++template) {
if(w_scores[template] > 0) {
if(progress) {
counter += w_scores[template];
fprintf(stderr, "# Progress:\t%3lu%%\r", 100 * counter / Nhits);
fflush(stderr);
+ } else if(verbose) {
+ counter += w_scores[template];
+ fprintf(stderr, "# %d / %d\t%lu\t%3lu%%\n", template, DB_size, w_scores[template], 100 * counter / Nhits);
}
/* make p_value to see whether assembly is feasable */
=====================================
runkma.h
=====================================
@@ -26,5 +26,5 @@
unsigned char * ustrdup(unsigned char *src, size_t n);
int load_DBs_KMA(char *templatefilename, long unsigned **alignment_scores, long unsigned **uniq_alignment_scores, int **template_lengths, unsigned shm);
char * nameLoad(Qseqs *name, FILE *infile);
-int runKMA(char *templatefilename, char *outputfilename, char *exePrev, int ConClave, int kmersize, int minlen, Penalties *rewards, int extendedFeatures, double ID_t, int mq, double scoreT, double evalue, int bcd, int ref_fsa, int print_matrix, int print_all, int vcf, int sam, int nc, int nf, unsigned shm, int thread_num);
-int runKMA_MEM(char *templatefilename, char *outputfilename, char *exePrev, int ConClave, int kmersize, int minlen, Penalties *rewards, int extendedFeatures, double ID_t, int mq, double scoreT, double evalue, int bcd, int ref_fsa, int print_matrix, int print_all, int vcf, int sam, int nc, int nf, unsigned shm, int thread_num);
+int runKMA(char *templatefilename, char *outputfilename, char *exePrev, int ConClave, int kmersize, int minlen, Penalties *rewards, int extendedFeatures, double ID_t, int mq, double scoreT, double evalue, int bcd, int ref_fsa, int print_matrix, int print_all, int vcf, int sam, int nc, int nf, unsigned shm, int thread_num, int verbose);
+int runKMA_MEM(char *templatefilename, char *outputfilename, char *exePrev, int ConClave, int kmersize, int minlen, Penalties *rewards, int extendedFeatures, double ID_t, int mq, double scoreT, double evalue, int bcd, int ref_fsa, int print_matrix, int print_all, int vcf, int sam, int nc, int nf, unsigned shm, int thread_num, int verbose);
=====================================
sparse.c
=====================================
@@ -226,7 +226,7 @@ void save_kmers_sparse(const HashMapKMA *templates, HashMap_kmers *foundKmers, C
}
}
-void run_input_sparse(const HashMapKMA *templates, char **inputfiles, int fileCount, int minPhred, int fiveClip, int kmersize, char *trans, FILE *out) {
+void run_input_sparse(const HashMapKMA *templates, char **inputfiles, int fileCount, int minPhred, int fiveClip, int threeClip, int kmersize, char *trans, FILE *out) {
int FASTQ, fileCounter, phredCut, start, end;
char *filename;
@@ -263,7 +263,8 @@ void run_input_sparse(const HashMapKMA *templates, char **inputfiles, int fileCo
/* trim */
seq = qual->seq;
start = fiveClip;
- end = qseq->len - 1;
+ end = qseq->len - 1 - threeClip;
+ end = end < 0 ? 0 : end;
while(end >= 0 && seq[end] < phredCut) {
--end;
}
@@ -313,11 +314,11 @@ void run_input_sparse(const HashMapKMA *templates, char **inputfiles, int fileCo
int save_kmers_sparse_batch(char *templatefilename, char *outputfilename, char *exePrev, int ID_t, double evalue, char ss, unsigned shm) {
- int i, file_len, stop, template, score, tmp_score, status, contamination;
- int score_add, score_tot_add, deCon;
- unsigned *Scores, *Scores_tot, *w_Scores, *w_Scores_tot, *SearchList;
+ int i, file_len, stop, template, status, contamination, score_add, deCon;
+ unsigned *Scores, *w_Scores, *SearchList;
unsigned *template_lengths, *template_ulengths;
- long unsigned Ntot;
+ long unsigned Ntot, score, tmp_score, score_tot_add;
+ long unsigned *Scores_tot, *w_Scores_tot;
char **template_names;
double expected, q_value, p_value, etta, depth, cover, query_cover;
double tot_depth, tot_cover, tot_query_cover, tmp_depth, tmp_cover;
@@ -359,7 +360,7 @@ int save_kmers_sparse_batch(char *templatefilename, char *outputfilename, char *
} else {
if(hashMapKMA_load(templates, templatefile, templatefilename)) {
fprintf(stderr, "Wrong format of DB.\n");
- exit(2);
+ exit(1);
}
}
fclose(templatefile);
@@ -416,9 +417,9 @@ int save_kmers_sparse_batch(char *templatefilename, char *outputfilename, char *
t0 = clock();
fprintf(stderr, "# Finding best matches and output results.\n");
- Scores = calloc(templates->DB_size, sizeof(int));
- Scores_tot = calloc(templates->DB_size, sizeof(int));
- SearchList = malloc(templates->DB_size * sizeof(int));
+ Scores = calloc(templates->DB_size, sizeof(unsigned));
+ Scores_tot = calloc(templates->DB_size, sizeof(long unsigned));
+ SearchList = malloc(templates->DB_size * sizeof(unsigned));
if(!Scores || !Scores_tot || !SearchList) {
ERROR();
}
@@ -439,8 +440,8 @@ int save_kmers_sparse_batch(char *templatefilename, char *outputfilename, char *
fprintf(stderr, "# Total number of matches: %lu of %lu kmers\n", Nhits.tot, Ntot);
/* copy scores */
- w_Scores = smalloc(templates->DB_size * sizeof(int));
- w_Scores_tot = smalloc(templates->DB_size * sizeof(int));
+ w_Scores = smalloc(templates->DB_size * sizeof(unsigned));
+ w_Scores_tot = smalloc(templates->DB_size * sizeof(long unsigned));
for(i = 0; i < templates->DB_size; ++i) {
w_Scores[i] = Scores[i];
@@ -596,7 +597,7 @@ int save_kmers_sparse_batch(char *templatefilename, char *outputfilename, char *
tot_query_cover = 100.0 * (Scores_tot[template] + score_tot_add) / Ntot;
/* output results */
- fprintf(sparse_out, "%s\t%d\t%d\t%d\t%d\t%8.2f\t%8.2f\t%8.2f\t%8.2f\t%8.2f\t%8.2f\t%f\t%e\n",
+ fprintf(sparse_out, "%s\t%d\t%lu\t%d\t%d\t%8.2f\t%8.2f\t%8.2f\t%8.2f\t%8.2f\t%8.2f\t%f\t%e\n",
template_names[template], template, score, (int) expected, template_lengths[template], query_cover, cover, depth, tot_query_cover, tot_cover, tot_depth, q_value, p_value);
/* update scores */
@@ -743,7 +744,7 @@ int save_kmers_sparse_batch(char *templatefilename, char *outputfilename, char *
tot_cover = 100.0 * Scores[template] / template_ulengths[template];
tot_depth = 1.0 * Scores_tot[template] / template_lengths[template];
tot_query_cover = 100.0 * Scores_tot[template] / Ntot;
- fprintf(sparse_out, "%s\t%d\t%d\t%d\t%d\t%8.2f\t%8.2f\t%8.2f\t%8.2f\t%8.2f\t%8.2f\t%8.2f\t%4.1e\n",
+ fprintf(sparse_out, "%s\t%d\t%lu\t%d\t%d\t%8.2f\t%8.2f\t%8.2f\t%8.2f\t%8.2f\t%8.2f\t%8.2f\t%4.1e\n",
template_names[template], template, score, (int) expected, template_lengths[template], query_cover, cover, depth, tot_query_cover, tot_cover, tot_depth, q_value, p_value);
/* update scores */
=====================================
sparse.h
=====================================
@@ -24,5 +24,5 @@
int translateToKmersAndDump(long unsigned *Kmers, int n, int max, unsigned char *qseq, int seqlen, int kmersize, long unsigned mask, long unsigned prefix, int prefix_len, FILE *out);
char ** load_DBs_Sparse(char *templatefilename, unsigned **template_lengths, unsigned **template_ulengths, unsigned shm);
void save_kmers_sparse(const HashMapKMA *templates, HashMap_kmers *foundKmers, CompKmers *compressor);
-void run_input_sparse(const HashMapKMA *templates, char **inputfiles, int fileCount, int minPhred, int fiveClip, int kmersize, char *trans, FILE *out);
+void run_input_sparse(const HashMapKMA *templates, char **inputfiles, int fileCount, int minPhred, int fiveClip, int threeClip, int kmersize, char *trans, FILE *out);
int save_kmers_sparse_batch(char *templatefilename, char *outputfilename, char *exePrev, int ID_t, double evalue, char ss, unsigned shm);
=====================================
spltdb.c
=====================================
@@ -394,7 +394,7 @@ unsigned get_ankers_spltDB(int *infoSize, int *out_Tem, CompDNA *qseq, Qseqs *he
return num;
}
-int runKMA_spltDB(char **templatefilenames, int targetNum, char *outputfilename, int argc, char **argv, int ConClave, int kmersize, int minlen, Penalties *rewards, int extendedFeatures, double ID_t, int mq, double scoreT, double evalue, int bcd, int ref_fsa, int print_matrix, int print_all, int vcf, int sam, int nc, int nf, unsigned shm, int thread_num) {
+int runKMA_spltDB(char **templatefilenames, int targetNum, char *outputfilename, int argc, char **argv, int ConClave, int kmersize, int minlen, Penalties *rewards, int extendedFeatures, double ID_t, int mq, double scoreT, double evalue, int bcd, int ref_fsa, int print_matrix, int print_all, int vcf, int sam, int nc, int nf, unsigned shm, int thread_num, int verbose) {
/* https://www.youtube.com/watch?v=LtXEMwSG5-8 */
@@ -605,6 +605,7 @@ int runKMA_spltDB(char **templatefilenames, int targetNum, char *outputfilename,
}
}
+ Nhits = 0;
target = 0;
targetScore = 0;
rc_flag = 0;
@@ -675,6 +676,13 @@ int runKMA_spltDB(char **templatefilenames, int targetNum, char *outputfilename,
updateAllFrag(qseq_r->seq, qseq_r->len, abs(bestHits), read_score, best_start_pos, best_end_pos, bestTemplates, header_r, frag_out_all);
}
}
+
+ /* verbose */
+ if(verbose && verbose++ == 1024) {
+ Nhits += verbose - 1;
+ fprintf(stderr, "# Scored %ld query sequences.\n", Nhits);
+ verbose = 1;
+ }
}
/* remove inferior read matches */
@@ -760,6 +768,12 @@ int runKMA_spltDB(char **templatefilenames, int targetNum, char *outputfilename,
++uPtr;
}
}
+ /* verbose */
+ if(verbose) {
+ Nhits += verbose - 1;
+ fprintf(stderr, "# Scored %ld query sequences.\n", Nhits);
+ verbose = 1;
+ }
/* get fragmentCount */
if(extendedFeatures) {
@@ -1490,6 +1504,8 @@ int runKMA_spltDB(char **templatefilenames, int targetNum, char *outputfilename,
if(progress) {
fprintf(stderr, "# Progress:\t%3d%%\r", 0);
fflush(stderr);
+ } else if(verbose) {
+ fprintf(stderr, "# Template\tScore\tProgress\n");
}
/* get index, seq and names on the fly */
@@ -1583,6 +1599,9 @@ int runKMA_spltDB(char **templatefilenames, int targetNum, char *outputfilename,
counter += w_scores[template];
fprintf(stderr, "# Progress:\t%3lu%%\r", 100 * counter / Nhits);
fflush(stderr);
+ } else if(verbose) {
+ counter += w_scores[template];
+ fprintf(stderr, "# %d / %d\t%lu\t%3lu%%\n", template, DB_size, w_scores[template], 100 * counter / Nhits);
}
/* make p_value to see whether assembly is feasable */
=====================================
spltdb.h
=====================================
@@ -35,4 +35,4 @@ struct spltDBbuff {
int print_ankers_spltDB(int *out_Tem, CompDNA *qseq, int rc_flag, const Qseqs *header, const int flag, FILE *out);
int print_ankers_Sparse_spltDB(int *out_Tem, CompDNA *qseq, int rc_flag, const Qseqs *header, const int flag, FILE *out);
unsigned get_ankers_spltDB(int *infoSize, int *out_Tem, CompDNA *qseq, Qseqs *header, int *flag, FILE *inputfile);
-int runKMA_spltDB(char **templatefilenames, int targetNum, char *outputfilename, int argc, char **argv, int ConClave, int kmersize, int minlen, Penalties *rewards, int extendedFeatures, double ID_t, int mq, double scoreT, double evalue, int bcd, int ref_fsa, int print_matrix, int print_all, int vcf, int sam, int nc, int nf, unsigned shm, int thread_num);
+int runKMA_spltDB(char **templatefilenames, int targetNum, char *outputfilename, int argc, char **argv, int ConClave, int kmersize, int minlen, Penalties *rewards, int extendedFeatures, double ID_t, int mq, double scoreT, double evalue, int bcd, int ref_fsa, int print_matrix, int print_all, int vcf, int sam, int nc, int nf, unsigned shm, int thread_num, int verbose);
=====================================
version.h
=====================================
@@ -17,4 +17,4 @@
* limitations under the License.
*/
-#define KMA_VERSION "1.2.22"
+#define KMA_VERSION "1.2.25"
View it on GitLab: https://salsa.debian.org/med-team/kma/-/commit/2f19ae42ee0599842c05907229fbcefcd490ae07
--
View it on GitLab: https://salsa.debian.org/med-team/kma/-/commit/2f19ae42ee0599842c05907229fbcefcd490ae07
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20200514/80017779/attachment-0001.html>
More information about the debian-med-commit
mailing list