[med-svn] [Git][med-team/last-align][upstream] New upstream version 1389
Andreas Tille (@tille)
gitlab at salsa.debian.org
Thu Jul 14 15:46:16 BST 2022
Andreas Tille pushed to branch upstream at Debian Med / last-align
Commits:
8e061678 by Andreas Tille at 2022-07-14T16:34:02+02:00
New upstream version 1389
- - - - -
22 changed files:
- README.rst
- bin/last-train
- bin/maf-convert
- doc/last-cookbook.rst
- doc/lastal.rst
- src/LastalArguments.cc
- src/LastalArguments.hh
- src/LastdbArguments.cc
- src/getoptUtil.hh
- src/lastal.cc
- src/makefile
- src/split/cbrc_split_aligner.cc
- src/split/cbrc_split_aligner.hh
- src/split/last-split-main.cc
- src/split/last-split.cc
- src/split/last-split.hh
- + src/split/last_split_options.cc
- + src/split/last_split_options.hh
- + src/split/mcf_last_splitter.cc
- + src/split/mcf_last_splitter.hh
- test/last-split-test.out
- test/maf-convert-test.out
Changes:
=====================================
README.rst
=====================================
@@ -7,8 +7,8 @@ proteomes). It's especially good at:
* Finding rearrangements and recombinations: we believe last-split_
does that more rigorously than anything else.
-* Finding DNA-versus-protein related regions, especially protein
- fossils (but introns are not considered).
+* Finding DNA-versus-protein related regions, especially protein_
+ fossils_.
* Unusual data, e.g. AT-rich DNA, because we can fit_ parameters to
the data and calculate significance_.
@@ -30,7 +30,9 @@ apply to older versions of LAST! You can see your version with::
Install
-------
-Please download the highest version number from
+You can install it from bioconda_ or `Debian Med`_, or like this...
+
+Download the highest version number from
https://gitlab.com/mcfrith/last/-/tags. Using the command line, go
into the downloaded directory and type::
@@ -63,10 +65,6 @@ up your computer to find them automatically. Some possible ways:
You might have to log out and back in before your computer recognizes
the new programs.
-**Alternative:** Install LAST from bioconda_. But it might not be the
-latest version: usually that doesn't matter, but it might work
-inferiorly or differently from this documentation.
-
Further info
------------
@@ -88,7 +86,10 @@ LAST is brought to you by:
.. _significance: doc/last-evalues.rst
.. _cookbook: doc/last-cookbook.rst
.. _LAST papers: doc/last-papers.rst
+.. _protein: https://doi.org/10.1109/TCBB.2022.3177855
+.. _fossils: https://doi.org/10.1093/molbev/msac068
.. _bioconda: https://bioconda.github.io/
+.. _Debian Med: https://www.debian.org/devel/debian-med/
.. _PATH variable: https://en.wikipedia.org/wiki/PATH_(variable)
.. _Computational Omics Research Team: https://www.airc.aist.go.jp/en/cort/
.. _AIRC: https://www.airc.aist.go.jp/en/
=====================================
bin/last-train
=====================================
@@ -734,6 +734,9 @@ def fixedLastalArgs(opts, lastalProgName):
if opts.verbose: x.append("-" + "v" * opts.verbose)
if opts.codon:
x.append("-K1")
+ else:
+ x.append("--split-n")
+ x.append("--split-m=0.01") # xxx ???
return x
def process(args, inStream):
@@ -745,14 +748,6 @@ def versionFromLastal():
proc = process(args, None)
return proc.stdout.read().split()[1]
-def lastSplitProcess(opts, proc):
- splitArgs = ["last-split", "-n", "-m0.01"] # xxx ???
- proc = process(splitArgs, proc.stdout)
- if opts.postmask:
- maskArgs = ["last-postmask"]
- proc = process(maskArgs, proc.stdout)
- return proc
-
def doTraining(opts, args):
tryToMakeChildProgramsFindable()
lastalProgName = readLastalProgName(args[0])
@@ -781,7 +776,8 @@ def doTraining(opts, args):
if opts.A: lastalArgs.append("-A" + opts.A)
if opts.B: lastalArgs.append("-B" + opts.B)
proc = process(lastalArgs + args, None)
- proc = lastSplitProcess(opts, proc)
+ if opts.postmask:
+ proc = process(["last-postmask"], proc.stdout)
if not opts.scale:
outerScale = scaleFromHeader(proc.stdout)
@@ -836,8 +832,8 @@ def doTraining(opts, args):
writeGapCosts(opts, delCosts, insCosts, True, proc.stdin)
writeScoreMatrixFunc(proc.stdin, matScores, "")
proc.stdin.close()
- if not opts.codon:
- proc = lastSplitProcess(opts, proc)
+ if opts.postmask and not opts.codon:
+ proc = process(["last-postmask"], proc.stdout)
matProbs, gapProbs = probsFromFile(opts, lastalArgs, maxGapGrowProb,
codonMatches, proc.stdout)
=====================================
bin/maf-convert
=====================================
@@ -194,8 +194,10 @@ def mafInput(opts, lines):
r = fields[1].upper()
for c, x in zip(matrixColumnNames, fields[2:]):
opts.scoreMatrix[r, c] = int(x)
- if opts.isKeepComments:
+ if opts.headerMode == 1:
print(line, end="")
+ if opts.headerMode == 2 and line.startswith("# LAST version "):
+ print("@PG\tID:lastal\tPN:lastal\tVN:" + fields[3])
if sLines: yield aLine, sLines, qLines, pLines
def isJoinable(opts, oldMaf, newMaf):
@@ -1168,7 +1170,7 @@ def mafConvert(opts, args):
formatName = args[0].lower()
fileNames = args[1:]
- opts.isKeepComments = False
+ opts.headerMode = 0
opts.bitScoreA = None
opts.bitScoreB = None
@@ -1179,8 +1181,9 @@ def mafConvert(opts, args):
writeHtmlHeader()
if isFormat(formatName, "sam"):
writeSamHeader(opts, fileNames)
+ opts.headerMode = 2
if isFormat(formatName, "tabular"):
- opts.isKeepComments = True
+ opts.headerMode = 1
if not fileNames:
fileNames = ["-"]
=====================================
doc/last-cookbook.rst
=====================================
@@ -150,7 +150,7 @@ align them to a genome like this::
lastdb -P8 -uRY4 mydb genome.fa
last-train -P8 -Q0 mydb reads.fastq > reads.train
- lastal -P8 -p reads.train mydb reads.fastq | last-split > out.maf
+ lastal -P8 --split -p reads.train mydb reads.fastq > out.maf
``-P8`` makes it faster by running 8 parallel threads, adjust as
appropriate for your computer. This has no effect on the results.
@@ -161,14 +161,14 @@ memory use, but also reduces sensitivity.
``-Q0`` makes it discard the fastq_ quality information (or you can
keep-but-ignore it with ``-Qkeep``).
-last-split_ cuts the output of ``lastal`` down to a unique best
-alignment for each part of each read. It gives each alignment a
-`mismap probability`_, which is high if that part of the read is
-almost equally similar to several parts of the genome.
+``--split`` cuts the output down to a unique best alignment for each
+part of each read. It gives each alignment a `mismap probability`_,
+which is high if that part of the read is almost equally similar to
+several parts of the genome.
Here we didn't suppress alignments caused by simple sequence (like
``cacacacacacacacacacacaca``), so as not to hide anything from
-last-split_. You can discard such alignments with last-postmask_
+``--split``. You can discard such alignments with last-postmask_
(though they may help to explain each part of a DNA read).
To make it more sensitive but slow, replace ``RY4`` with ``NEAR``:
@@ -213,10 +213,11 @@ Aligning Illumina DNA reads to a genome
lastdb -P8 -uNEAR mydb genome.fasta
last-train -P8 -Q1 mydb reads.fastq.gz > reads.train
- lastal -P8 -p reads.train mydb reads.fastq.gz | last-split | gzip > out.maf.gz
+ lastal -P8 --split -p reads.train mydb reads.fastq.gz | gzip > out.maf.gz
Most LAST commands accept ``.gz`` compressed files, and you can
-compress output with ``gzip`` as above.
+compress output with ``gzip`` as above. You can get faster but
+slightly worse compression with e.g. ``gzip -5``.
``-Q1`` makes it use the fastq_ quality information to improve the
training and alignment. LAST **assumes** that the qualities reflect
@@ -245,8 +246,8 @@ Aligning paired-end Illumina DNA reads to a genome
You can use last-split-pe_, or the older last-pair-probs_. The
difference is that ``last-split-pe`` allows different parts of one
read (i.e. one "end") to align to different parts of the genome, like
-``last-split``. (Or you could align the reads individually, ignoring
-the pair relationships.)
+``--split``. (Or you could align the reads individually, ignoring the
+pair relationships.)
Aligning potentially-spliced Illumina reads to a genome
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -261,7 +262,7 @@ is a slow-and-sensitive recipe::
lastdb -P8 -uNEAR humdb human_no_alt_analysis_set.fa
last-train -P8 --revsym -E0.05 -C2 humdb chimp.fa > humchi.train
- lastal -E0.05 -C2 -p humchi.train humdb chimp.fa | last-split -fMAF+ > humchi1.maf
+ lastal -E0.05 -C2 --split-f=MAF+ -p humchi.train humdb chimp.fa > humchi1.maf
``--revsym`` makes the substitution rates the same on both strands.
For example, it makes A→G equal T→C (because A→G on one strand means
@@ -273,9 +274,10 @@ asymmetric "heavy" and "light" strands).
expected to occur by chance at a rate ≤ 0.05 times per pair of random
sequences of length 1 billion each.
-``-fMAF+`` makes it show `per-base mismap probabilities`_: the
-probability that each query (chimp) base should be aligned to a
-different part of the reference (human).
+``--split-f=MAF+`` has the same effect as ``--split``, and also makes
+it show `per-base mismap probabilities`_: the probability that each
+query (chimp) base should be aligned to a different part of the
+reference (human).
The result so far is asymmetric: each part of the chimp genome is
aligned to at most one part of the human genome, but not vice-versa.
=====================================
doc/lastal.rst
=====================================
@@ -43,6 +43,9 @@ Steps in lastal
7) Redo the alignments to minimize column ambiguity, using either
gamma-centroid or LAMA (OFF by default).
+8) Cut the alignments down to a unique best alignment for each part of
+ each query sequence (OFF by default).
+
Options
-------
@@ -585,6 +588,32 @@ Miscellaneous options
PSSM is "constructed to the same scale" as the match/mismatch
scores (SF Altschul et al. 1997, NAR 25(17):3389-402).
+Split options
+~~~~~~~~~~~~~
+
+--split
+ Cut the alignments down to a unique best alignment for each part
+ of each query sequence. This is useful for DNA queries that cross
+ rearrangement breakpoints. It's the same as running last-split_
+ with default options.
+
+--splice
+ This is similar to ``--split``, but it favors alignments that
+ would be expected due to intron/exon splicing. So it favors
+ alignments where parts of a query sequence are separated by
+ intron-like gaps with GT-AG splice signals. This is the same as
+ running last-split_ with option ``-g``.
+
+--split-f, --split-m, --split-s, --split-n, --split-b
+ These are equivalent to last-split_ options ``-f``, ``-m``,
+ ``-s``, ``-n``, and ``-b``. If you use one of these, you don't
+ need to also specify ``--split``.
+
+--split-d, --split-c, --split-t, --split-M, --split-S
+ These are equivalent to last-split_ options ``-d``, ``-c``,
+ ``-t``, ``-M``, and ``-S``. They imply ``--splice``, so you don't
+ need to also specify ``--splice``.
+
Parallel processes and memory sharing
-------------------------------------
=====================================
src/LastalArguments.cc
=====================================
@@ -40,15 +40,6 @@ static void writeIntList(std::ostream &s, const std::vector<int> &v) {
}
}
-static int myGetopt( int argc, char** argv, const char* optstring ){
- if( optind < argc ){
- std::string nextarg = argv[optind];
- if( nextarg == "--help" ) return 'h';
- if( nextarg == "--version" ) return 'V';
- }
- return getopt( argc, argv, optstring );
-}
-
static char parseOutputFormat( const char* text ){
std::string s = text;
for( size_t i = 0; i < s.size(); ++i ){
@@ -115,7 +106,8 @@ LastalArguments::LastalArguments() :
temperature(-1), // depends on the score matrix
gamma(1),
geneticCodeFile("1"),
- verbosity(0){}
+ verbosity(0),
+ isSplit(false){}
void LastalArguments::fromArgs( int argc, char** argv, bool optionsOnly ){
programName = argv[0];
@@ -124,85 +116,118 @@ void LastalArguments::fromArgs( int argc, char** argv, bool optionsOnly ){
Find and align similar sequences.\n\
\n\
Cosmetic options:\n\
--h, --help: show all options and their default settings, and exit\n\
--V, --version: show version information, and exit\n\
--v: be verbose: write messages about what lastal is doing\n\
--f: output format: TAB, MAF, BlastTab, BlastTab+ (default=MAF)";
+ -h, --help show all options and their default settings, and exit\n\
+ -V, --version show version information, and exit\n\
+ -v be verbose: write messages about what lastal is doing\n\
+ -f output format: TAB, MAF, BlastTab, BlastTab+ (default: MAF)";
std::string help = usage + "\n\
\n\
E-value options (default settings):\n\
--D: query letters per random alignment ("
+ -D query letters per random alignment ("
+ stringify(queryLettersPerRandomAlignment) + ")\n\
--E: maximum expected alignments per square giga (1e+18/D/refSize/numOfStrands)\n\
+ -E maximum expected alignments per square giga (1e+18/D/refSize/numOfStrands)\n\
\n\
Score options (default settings):\n\
--r: match score (2 if -M, else 6 if 1<=Q<=4, else 1 if DNA)\n\
--q: mismatch cost (3 if -M, else 18 if 1<=Q<=4, else 1 if DNA)\n\
--p: match/mismatch score matrix (protein-protein: BL62, DNA-protein: BL80)\n\
--X: N/X is ambiguous in: 0=neither sequence, 1=reference, 2=query, 3=both ("
+ -r match score (2 if -M, else 6 if 1<=Q<=4, else 1 if DNA)\n\
+ -q mismatch cost (3 if -M, else 18 if 1<=Q<=4, else 1 if DNA)\n\
+ -p match/mismatch score matrix (protein-protein: BL62, DNA-protein: BL80)\n\
+ -X N/X is ambiguous in: 0=neither sequence, 1=reference, 2=query, 3=both ("
+ stringify(ambiguousLetterOpt) + ")\n\
--a: gap existence cost (DNA: 7, protein: 11, 1<=Q<=4: 21)\n\
--b: gap extension cost (DNA: 1, protein: 2, 1<=Q<=4: 9)\n\
--A: insertion existence cost (a)\n\
--B: insertion extension cost (b)\n\
--c: unaligned residue pair cost (off)\n\
--F: frameshift cost(s) (off)\n\
--x: maximum score drop for preliminary gapped alignments (z)\n\
--y: maximum score drop for gapless alignments (min[t*10, x])\n\
--z: maximum score drop for final gapped alignments (e-1)\n\
--d: minimum score for gapless alignments (min[e, 2500/n query letters per hit])\n\
--e: minimum score for gapped alignments\n\
+ -a gap existence cost (DNA: 7, protein: 11, 1<=Q<=4: 21)\n\
+ -b gap extension cost (DNA: 1, protein: 2, 1<=Q<=4: 9)\n\
+ -A insertion existence cost (a)\n\
+ -B insertion extension cost (b)\n\
+ -c unaligned residue pair cost (off)\n\
+ -F frameshift cost(s) (off)\n\
+ -x maximum score drop for preliminary gapped alignments (z)\n\
+ -y maximum score drop for gapless alignments (min[t*10, x])\n\
+ -z maximum score drop for final gapped alignments (e-1)\n\
+ -d minimum score for gapless alignments (min[e, 2500/n query letters per hit])\n\
+ -e minimum score for gapped alignments\n\
\n\
Initial-match options (default settings):\n\
--m: maximum initial matches per query position ("
+ -m maximum initial matches per query position ("
+ stringify(oneHitMultiplicity) + ")\n\
--l: minimum length for initial matches ("
+ -l minimum length for initial matches ("
+ stringify(minHitDepth) + ")\n\
--L: maximum length for initial matches (infinity)\n\
--k: use initial matches starting at every k-th position in each query ("
+ -L maximum length for initial matches (infinity)\n\
+ -k use initial matches starting at every k-th position in each query ("
+ stringify(queryStep) + ")\n\
--W: use \"minimum\" positions in sliding windows of W consecutive positions\n\
+ -W use \"minimum\" positions in sliding windows of W consecutive positions\n\
\n\
Miscellaneous options (default settings):\n\
--s: strand: 0=reverse, 1=forward, 2=both (2 for DNA, 1 for protein)\n\
--S: score matrix applies to forward strand of: 0=reference, 1=query ("
+ -s strand: 0=reverse, 1=forward, 2=both (2 for DNA, 1 for protein)\n\
+ -S score matrix applies to forward strand of: 0=reference, 1=query ("
+ stringify(isQueryStrandMatrix) + ")\n\
--K: omit alignments whose query range lies in >= K others with > score (off)\n\
--C: omit gapless alignments in >= C others with > score-per-length (off)\n\
--P: number of parallel threads ("
+ -K omit alignments whose query range lies in >= K others with > score (off)\n\
+ -C omit gapless alignments in >= C others with > score-per-length (off)\n\
+ -P number of parallel threads ("
+ stringify(numOfThreads) + ")\n\
--i: query batch size (64M if multi-volume, else off)\n\
--M: find minimum-difference alignments (faster but cruder)\n\
--T: type of alignment: 0=local, 1=overlap ("
+ -i query batch size (64M if multi-volume, else off)\n\
+ -M find minimum-difference alignments (faster but cruder)\n\
+ -T type of alignment: 0=local, 1=overlap ("
+ stringify(globality) + ")\n\
--n: maximum gapless alignments per query position (infinity if m=0, else m)\n\
--N: stop after the first N alignments per query strand\n\
--R: lowercase & simple-sequence options (the same as was used by lastdb)\n\
--u: mask lowercase during extensions: 0=never, 1=gapless,\n\
- 2=gapless+postmask, 3=always (2 if lastdb -c and Q!=pssm, else 0)\n\
--w: suppress repeats inside exact matches, offset by <= this distance ("
+ -n maximum gapless alignments per query position (infinity if m=0, else m)\n\
+ -N stop after the first N alignments per query strand\n\
+ -R lowercase & simple-sequence options (the same as was used by lastdb)\n\
+ -u mask lowercase during extensions: 0=never, 1=gapless,\n\
+ 2=gapless+postmask, 3=always (2 if lastdb -c and Q!=pssm, else 0)\n\
+ -w suppress repeats inside exact matches, offset by <= this distance ("
+ stringify(maxRepeatDistance) + ")\n\
--G: genetic code (" + geneticCodeFile + ")\n\
--t: 'temperature' for calculating probabilities (1/lambda)\n\
--g: 'gamma' parameter for gamma-centroid and LAMA ("
+ -G genetic code (" + geneticCodeFile + ")\n\
+ -t 'temperature' for calculating probabilities (1/lambda)\n\
+ -g 'gamma' parameter for gamma-centroid and LAMA ("
+ stringify(gamma) + ")\n\
--j: output type: 0=match counts, 1=gapless, 2=redundant gapped, 3=gapped,\n\
- 4=column ambiguity estimates, 5=gamma-centroid, 6=LAMA,\n\
- 7=expected counts ("
+ -j output type: 0=match counts, 1=gapless, 2=redundant gapped, 3=gapped,\n\
+ 4=column ambiguity estimates, 5=gamma-centroid, 6=LAMA,\n\
+ 7=expected counts ("
+ stringify(outputType) + ")\n\
--J: score type: 0=ordinary, 1=full (1 for new-style frameshifts, else 0)\n\
--Q: input format: fastx, keep, sanger, solexa, illumina, prb, pssm\n\
- (default=fasta)\n\
+ -J score type: 0=ordinary, 1=full (1 for new-style frameshifts, else 0)\n\
+ -Q input format: fastx, keep, sanger, solexa, illumina, prb, pssm\n\
+ (default: fasta)\n\
+\n\
+Split options:\n\
+ --split do split alignment\n\
+ --splice do spliced alignment\n\
+ --split-f=FMT " + LastSplitOptions::helpf + "\n\
+ --split-d=D " + LastSplitOptions::helpd + "\n\
+ --split-c=PROB " + LastSplitOptions::helpc + "\n\
+ --split-t=PROB " + LastSplitOptions::helpt + "\n\
+ --split-M=MEAN " + LastSplitOptions::helpM + "\n\
+ --split-S=SDEV " + LastSplitOptions::helpS + "\n\
+ --split-m=PROB " + LastSplitOptions::helpm + "\n\
+ --split-s=INT " + LastSplitOptions::helps + "\n\
+ --split-n " + LastSplitOptions::helpn + "\n\
+ --split-b=B " + LastSplitOptions::helpb + "\n\
";
- int c;
- const char optionString[] =
+ static const char sOpts[] =
"hVvf:"
"r:q:p:X:a:b:A:B:c:F:x:y:z:d:e:"
"D:E:"
"s:S:MT:m:l:L:n:N:C:K:k:W:i:P:R:u:w:t:g:G:j:J:Q:";
- while( (c = myGetopt(argc, argv, optionString)) != -1 ){
+
+ static struct option lOpts[] = {
+ { "help", no_argument, 0, 'h' },
+ { "version", no_argument, 0, 'V' },
+ { "split", no_argument, 0, 128 + 0 },
+ { "splice", no_argument, 0, 128 + 1 },
+ { "split-f", required_argument, 0, 128 + 'f' },
+ { "split-d", required_argument, 0, 128 + 'd' },
+ { "split-c", required_argument, 0, 128 + 'c' },
+ { "split-t", required_argument, 0, 128 + 't' },
+ { "split-M", required_argument, 0, 128 + 'M' },
+ { "split-S", required_argument, 0, 128 + 'S' },
+ { "split-m", required_argument, 0, 128 + 'm' },
+ { "split-s", required_argument, 0, 128 + 's' },
+ { "split-n", no_argument, 0, 128 + 'n' },
+ { "split-b", required_argument, 0, 128 + 'b' },
+ { 0, 0, 0, 0}
+ };
+
+ int c;
+ while ((c = getopt_long(argc, argv, sOpts, lOpts, &c)) != -1) {
switch(c){
case 'h':
std::cout << help;
@@ -373,9 +398,50 @@ Miscellaneous options (default settings):\n\
unstringify( inputFormat, optarg );
break;
+ case 128 + 1:
+ splitOpts.isSplicedAlignment = true;
+ break;
+ case 128 + 'f':
+ splitOpts.format = LastSplitOptions::parseOutputFormat(optarg);
+ break;
+ case 128 + 'd':
+ splitOpts.isSplicedAlignment = true;
+ unstringify(splitOpts.direction, optarg);
+ break;
+ case 128 + 'c':
+ splitOpts.isSplicedAlignment = true;
+ unstringify(splitOpts.cis, optarg);
+ break;
+ case 128 + 't':
+ splitOpts.isSplicedAlignment = true;
+ unstringify(splitOpts.trans, optarg);
+ break;
+ case 128 + 'M':
+ splitOpts.isSplicedAlignment = true;
+ unstringify(splitOpts.mean, optarg);
+ break;
+ case 128 + 'S':
+ splitOpts.isSplicedAlignment = true;
+ unstringify(splitOpts.sdev, optarg);
+ break;
+ case 128 + 'm':
+ unstringify(splitOpts.mismap, optarg);
+ break;
+ case 128 + 's':
+ unstringify(splitOpts.score, optarg);
+ break;
+ case 128 + 'n':
+ splitOpts.no_split = true;
+ break;
+ case 128 + 'b':
+ unstringifySize(splitOpts.bytes, optarg);
+ break;
+
case '?':
ERR( "bad option" );
}
+
+ if (c >= 128) isSplit = true;
}
if( maskLowercase == 1 && inputFormat == 5 )
@@ -416,6 +482,7 @@ Miscellaneous options (default settings):\n\
ERR( "please give me a database name and sequence file(s)\n\n" + usage );
lastdbName = argv[optind++];
inputStart = optind;
+ if (splitOpts.isSplicedAlignment) splitOpts.genome = lastdbName;
}
resetGetopt();
@@ -586,6 +653,17 @@ To proceed without E-values, set a score threshold with option -e.");
else maxDropGapless = int( 10.0 * temperature + 0.5 );
maxDropGapless = std::min( maxDropGapless, maxDropGapped );
}
+
+ if (isSplit) {
+ if (outputFormat != 'm')
+ ERR("can't do split alignment with non-MAF output");
+ if (isTranslated())
+ ERR("can't do split DNA-protein alignment");
+ if (gapPairCost > 0)
+ ERR("can't do split alignment with option -c");
+ if (verbosity > 1) splitOpts.verbose = true;
+ splitOpts.setUnspecifiedValues(minScoreGapped, temperature);
+ }
}
void LastalArguments::writeCommented( std::ostream& stream ) const{
=====================================
src/LastalArguments.hh
=====================================
@@ -6,6 +6,7 @@
#define LASTAL_ARGUMENTS_HH
#include "SequenceFormat.hh"
+#include "split/last_split_options.hh"
#include <limits.h>
#include <stddef.h>
@@ -120,6 +121,9 @@ struct LastalArguments{
std::string geneticCodeFile;
int verbosity;
+ bool isSplit;
+ LastSplitOptions splitOpts;
+
// positional arguments:
const char* programName;
std::string lastdbName;
=====================================
src/LastdbArguments.cc
=====================================
@@ -14,15 +14,6 @@ static void badopt( char opt, const char* arg ){
ERR( std::string("bad option value: -") + opt + ' ' + arg );
}
-static int myGetopt( int argc, char** argv, const char* optstring ){
- if( optind < argc ){
- std::string nextarg = argv[optind];
- if( nextarg == "--help" ) return 'h';
- if( nextarg == "--version" ) return 'V';
- }
- return getopt( argc, argv, optstring );
-}
-
using namespace cbrc;
LastdbArguments::LastdbArguments() :
@@ -56,45 +47,51 @@ Usage: " + std::string(programName) +
Prepare sequences for subsequent alignment with lastal.\n\
\n\
Main Options:\n\
--h, --help: show all options and their default settings, and exit\n\
--p: interpret the sequences as proteins\n\
--c: soft-mask lowercase letters (in reference *and* query sequences)\n\
--u: seeding scheme (default: YASS for DNA, else exact-match seeds)\n\
--P: number of parallel threads (default: " + stringify(numOfThreads) + ")";
+ -h, --help show all options and their default settings, and exit\n\
+ -p interpret the sequences as proteins\n\
+ -c soft-mask lowercase letters (in reference *and* query sequences)\n\
+ -u seeding scheme (default: YASS for DNA, else exact-match seeds)\n\
+ -P number of parallel threads (default: " + stringify(numOfThreads) + ")";
std::string help = usage + "\n\
\n\
Advanced Options (default settings):\n\
--R: lowercase & simple-sequence options (default: 03 for -q, else 01)\n\
--w: use initial matches starting at every w-th position in each sequence ("
+ -R lowercase & simple-sequence options (default: 03 for -q, else 01)\n\
+ -w use initial matches starting at every w-th position in each sequence ("
+ stringify(indexStep) + ")\n\
--W: use \"minimum\" positions in sliding windows of W consecutive positions ("
+ -W use \"minimum\" positions in sliding windows of W consecutive positions ("
+ stringify(minimizerWindow) + ")\n\
--S: strand: 0=reverse, 1=forward, 2=both ("
+ -S strand: 0=reverse, 1=forward, 2=both ("
+ stringify(strand) + ")\n\
--s: volume size (unlimited)\n\
--Q: input format: fastx, keep, sanger, solexa, illumina (default=fasta)\n\
--q: interpret the sequences as proteins and append */STOP\n\
--m: seed patterns (1=match, 0=anything, @=transition)\n\
--d: DNA seed patterns (N=match, n=anything, R=purine match, etc.)\n\
--a: user-defined alphabet\n\
--i: minimum limit on initial matches per query position ("
+ -s volume size (unlimited)\n\
+ -Q input format: fastx, keep, sanger, solexa, illumina (default=fasta)\n\
+ -q interpret the sequences as proteins and append */STOP\n\
+ -m seed patterns (1=match, 0=anything, @=transition)\n\
+ -d DNA seed patterns (N=match, n=anything, R=purine match, etc.)\n\
+ -a user-defined alphabet\n\
+ -i minimum limit on initial matches per query position ("
+ stringify(minSeedLimit) + ")\n\
--b: maximum length for buckets\n\
--B: use max bucket length with memory <= (memory for stored positions) / B ("
+ -b maximum length for buckets\n\
+ -B use max bucket length with memory <= (memory for stored positions) / B ("
+ stringify(minIndexedPositionsPerBucket) + ")\n\
--C: child table type: 0=none, 1=byte-size, 2=short-size, 3=full ("
+ -C child table type: 0=none, 1=byte-size, 2=short-size, 3=full ("
+ stringify(childTableType) + ")\n\
--x: just count sequences and letters\n\
--D: print all sequences in lastdb files\n\
--v: be verbose: write messages about what lastdb is doing\n\
--V, --version: show version information, and exit\n\
+ -x just count sequences and letters\n\
+ -D print all sequences in lastdb files\n\
+ -v be verbose: write messages about what lastdb is doing\n\
+ -V, --version show version information, and exit\n\
";
static const char sOpts[] = "hVpqR:cm:d:S:s:w:W:P:u:a:i:b:B:C:xDvQ:";
+ static struct option lOpts[] = {
+ { "help", no_argument, 0, 'h' },
+ { "version", no_argument, 0, 'V' },
+ { 0, 0, 0, 0}
+ };
+
int c;
- while ((c = myGetopt(argc, argv, sOpts)) != -1) {
+ while ((c = getopt_long(argc, argv, sOpts, lOpts, &c)) != -1) {
switch(c){
case 'h':
std::cout << help;
=====================================
src/getoptUtil.hh
=====================================
@@ -1,17 +1,13 @@
-// Copyright 2017 Martin C. Frith
+// Author: Martin C. Frith 2017
+// SPDX-License-Identifier: GPL-3.0-or-later
#ifndef GETOPT_UTIL_HH
#define GETOPT_UTIL_HH
-#include <unistd.h>
+#include <getopt.h>
-inline void resetGetopt() { // XXX fragile voodoo
-#ifdef __GLIBC__
- optind = 0;
-#else
- optind = 1;
- //optreset = 1; // XXX ???
-#endif
+inline void resetGetopt() {
+ optind = 1; // xxx ???
}
#endif
=====================================
src/lastal.cc
=====================================
@@ -24,6 +24,7 @@
#include "zio.hh"
#include "stringify.hh"
#include "threadUtil.hh"
+#include "split/mcf_last_splitter.hh"
#include <math.h>
#include <signal.h>
@@ -50,6 +51,7 @@ typedef unsigned long long countT;
struct LastAligner { // data that changes between queries
Aligners engines;
+ LastSplitter splitter;
std::vector<int> qualityPssm;
std::vector<AlignmentText> textAlns;
std::vector< std::vector<countT> > matchCounts; // used if outputType == 0
@@ -102,6 +104,7 @@ namespace {
Alphabet queryAlph; // for translated alignment
TantanMasker tantanMasker;
GeneticCode geneticCode;
+ SplitAlignerParams splitParams;
const unsigned maxNumOfIndexes = 16;
SubsetSuffixArray suffixArrays[maxNumOfIndexes];
DnaWordsFinder wordsFinder;
@@ -586,7 +589,7 @@ static void writeAlignment(LastAligner &aligner, const MultiSequence &qrySeqs,
alph, queryAlph,
translationType, geneticCode.getCodonToAmino(),
evaluer, args.outputFormat, extras);
- if (isCollatedAlignments() || aligners.size() > 1) {
+ if (isCollatedAlignments() || aligners.size() > 1 || args.isSplit) {
aligner.textAlns.push_back(a);
} else {
std::cout << a.text;
@@ -1119,6 +1122,35 @@ void translateAndScan(LastAligner &aligner, MultiSequence &qrySeqs,
}
}
+static void splitAlignments(LastSplitter &splitter,
+ std::vector<AlignmentText> &textAlns,
+ bool isQryQual) {
+ if (!args.isSplit) return;
+
+ unsigned linesPerMaf =
+ 3 + isQryQual + (args.outputType > 3) + (args.outputType > 6);
+
+ splitter.reserve(textAlns.size());
+ std::vector<char *> linePtrs((linesPerMaf + 1) * textAlns.size());
+
+ char **beg = linePtrs.empty() ? 0 : &linePtrs[0];
+ for (size_t i = 0; i < textAlns.size(); ++i) {
+ char **end = beg;
+ char *text = textAlns[i].text;
+ for (unsigned j = 0; j < linesPerMaf; ++j) {
+ *end++ = text;
+ text = strchr(text, '\n');
+ *text++ = 0;
+ }
+ *end = text;
+ splitter.addMaf(beg, end, false);
+ beg = end + 1;
+ }
+
+ splitter.split(args.splitOpts, splitParams, false);
+ clearAlignments(textAlns);
+}
+
static void alignOneQuery(LastAligner &aligner, MultiSequence &qrySeqs,
size_t qryNum, size_t chunkQryNum,
size_t finalCullingLimit, bool isFirstVolume) {
@@ -1167,15 +1199,17 @@ static void alignOneQuery(LastAligner &aligner, MultiSequence &qrySeqs,
translateAndScan(aligner, qrySeqs, qryData, chunkQryNum, finalCullingLimit,
args.isQueryStrandMatrix ? revMatrices : fwdMatrices);
- if (isCollatedAlignments() && numOfVolumes < 2) {
- sort(textAlns.begin() + oldNumOfAlns, textAlns.end());
+ if (numOfVolumes < 2) {
+ if (isCollatedAlignments()) {
+ sort(textAlns.begin() + oldNumOfAlns, textAlns.end());
+ }
+ splitAlignments(aligner.splitter, textAlns, qrySeqs.qualsPerLetter());
}
}
static size_t alignSomeQueries(size_t chunkNum, unsigned volume) {
size_t numOfChunks = aligners.size();
LastAligner &aligner = aligners[chunkNum];
- std::vector<AlignmentText> &textAlns = aligner.textAlns;
size_t beg = firstSequenceInChunk(qrySeqsGlobal, numOfChunks, chunkNum);
size_t end = firstSequenceInChunk(qrySeqsGlobal, numOfChunks, chunkNum + 1);
bool isMultiVolume = (numOfVolumes > 1);
@@ -1190,8 +1224,11 @@ static size_t alignSomeQueries(size_t chunkNum, unsigned volume) {
finalCullingLimit, isFirstVolume);
}
if (isMultiVolume && volume + 1 == numOfVolumes) {
+ std::vector<AlignmentText> &textAlns = aligner.textAlns;
cullFinalAlignments(textAlns, 0, args.cullingLimitForFinalAlignments);
sort(textAlns.begin(), textAlns.end());
+ splitAlignments(aligner.splitter, textAlns,
+ qrySeqsGlobal.qualsPerLetter());
}
return beg;
}
@@ -1215,6 +1252,10 @@ static void scanOneVolume(unsigned volume, unsigned numOfThreadsLeft) {
aligner.matchCounts.clear();
printAlignments(aligner.textAlns);
clearAlignments(aligner.textAlns);
+ if (!aligner.splitter.isOutputEmpty()) {
+ aligner.splitter.printOutput();
+ aligner.splitter.clearOutput();
+ }
}
}
@@ -1246,6 +1287,7 @@ static bool readSequenceData(MultiSequence &qrySeqs) {
static void runOneThread(unsigned threadNum) {
LastAligner &aligner = aligners[threadNum];
+ LastSplitter &splitter = aligner.splitter;
std::vector< std::vector<countT> > &matchCounts = aligner.matchCounts;
std::vector<AlignmentText> &textAlns = aligner.textAlns;
MultiSequence qrySeqs;
@@ -1260,7 +1302,13 @@ static void runOneThread(unsigned threadNum) {
alignOneQuery(aligner, qrySeqs, i, i,
args.cullingLimitForFinalAlignments, true);
}
- if (!textAlns.empty()) {
+ if (!splitter.isOutputEmpty()) {
+ {
+ std::lock_guard<std::mutex> lockGuard(outputMutex);
+ splitter.printOutput();
+ }
+ splitter.clearOutput();
+ } else if (!textAlns.empty()) {
{
std::lock_guard<std::mutex> lockGuard(outputMutex);
printAlignments(textAlns);
@@ -1419,6 +1467,12 @@ void writeHeader(countT numOfRefSeqs, countT refLetters, std::ostream &out) {
if( args.outputFormat == 'B' ) out << ", query length, subject length";
out << '\n';
}
+
+ if (args.isSplit) {
+ args.splitOpts.print();
+ splitParams.print();
+ std::cout << "#\n";
+ }
}
}
@@ -1521,6 +1575,15 @@ void lastal( int argc, char** argv ){
if (!isMultiVolume) args.minScoreGapless = minScoreGapless;
if (args.outputType > 0) makeQualityScorers();
+ if (args.isSplit) {
+ setLastSplitParams(splitParams, args.splitOpts, scoreMatrix.cells,
+ scoreMatrix.rowSymbols.c_str(),
+ scoreMatrix.colSymbols.c_str(),
+ args.delOpenCosts[0], args.delGrowCosts[0],
+ args.insOpenCosts[0], args.insGrowCosts[0],
+ args.temperature, refLetters, args.inputFormat);
+ }
+
if (numOfVolumes + 1 == 0) {
readIndex(args.lastdbName, numOfRefSeqs);
numOfVolumes = 1;
=====================================
src/makefile
=====================================
@@ -36,19 +36,20 @@ mcf_frameshift_xdrop_aligner.o mcf_gap_costs.o GeneticCode.o \
GreedyXdropAligner.o LastEvaluer.o OneQualityScoreMatrix.o \
QualityPssmMaker.o TwoQualityScoreMatrix.o gaplessXdrop.o \
gaplessPssmXdrop.o gaplessTwoQualityXdrop.o cbrc_linalg.o \
-mcf_substitution_matrix_stats.o $(alpObj)
+mcf_substitution_matrix_stats.o split/cbrc_unsplit_alignment.o \
+split/last_split_options.o $(alpObj)
alignObj4 = Alignment.o AlignmentPot.o AlignmentWrite.o Centroid.o \
DiagonalTable.o MultiSequence.o MultiSequenceQual.o SegmentPair.o \
SegmentPairPot.o SubsetSuffixArray.o SubsetSuffixArraySearch.o \
-lastal.o
+lastal.o split/cbrc_split_aligner.o split/mcf_last_splitter.o
splitObj0 = Alphabet.o LambdaCalculator.o fileMap.o cbrc_linalg.o \
mcf_substitution_matrix_stats.o split/cbrc_unsplit_alignment.o \
-split/last-split-main.o
+split/last_split_options.o split/last-split-main.o
splitObj4 = MultiSequence.o split/cbrc_split_aligner.o \
-split/last-split.o
+split/mcf_last_splitter.o split/last-split.o
PPOBJ = last-pair-probs.o last-pair-probs-main.o
@@ -143,7 +144,7 @@ ScoreMatrixData.hh: ../data/*.mat
../build/mat-inc.sh ../data/*.mat > $@
VERSION1 = git describe --dirty
-VERSION2 = echo ' (HEAD -> main, tag: 1356) ' | sed -e 's/.*tag: *//' -e 's/[,) ].*//'
+VERSION2 = echo ' (HEAD -> main, tag: 1389) ' | sed -e 's/.*tag: *//' -e 's/[,) ].*//'
VERSION = \"`test -e ../.git && $(VERSION1) || $(VERSION2)`\"
@@ -156,7 +157,7 @@ fix8 = sed 's/.*\.o/& &5 &8/'
depend:
sed '/[m][v]/q' makefile > m
- $(CXX) -MM -std=c++11 *.cc | $(fix8) >> m
+ $(CXX) -MM -I. -std=c++11 *.cc | $(fix8) >> m
$(CC) -MM *.c >> m
$(CXX) -MM alp/*.cpp | sed 's|.*:|alp/&|' >> m
$(CXX) -MM -I. split/*.cc | sed 's|.*:|split/&|' | $(fix8) >> m
@@ -226,20 +227,25 @@ GreedyXdropAligner.o GreedyXdropAligner.o5 GreedyXdropAligner.o8: GreedyXdropAli
LambdaCalculator.o LambdaCalculator.o5 LambdaCalculator.o8: LambdaCalculator.cc LambdaCalculator.hh \
cbrc_linalg.hh
LastalArguments.o LastalArguments.o5 LastalArguments.o8: LastalArguments.cc LastalArguments.hh \
- SequenceFormat.hh stringify.hh getoptUtil.hh version.hh
+ SequenceFormat.hh split/last_split_options.hh stringify.hh getoptUtil.hh \
+ version.hh
lastal.o lastal.o5 lastal.o8: lastal.cc last.hh Alphabet.hh CyclicSubsetSeed.hh \
MultiSequence.hh ScoreMatrixRow.hh VectorOrMmap.hh Mmap.hh fileMap.hh \
stringify.hh SequenceFormat.hh SubsetSuffixArray.hh dna_words_finder.hh \
- qualityScoreUtil.hh LastalArguments.hh QualityPssmMaker.hh \
- OneQualityScoreMatrix.hh mcf_substitution_matrix_stats.hh \
- TwoQualityScoreMatrix.hh LastEvaluer.hh mcf_frameshift_xdrop_aligner.hh \
- mcf_gap_costs.hh alp/sls_alignment_evaluer.hpp alp/sls_pvalues.hpp \
- alp/sls_basic.hpp GeneticCode.hh SubsetMinimizerFinder.hh \
- AlignmentPot.hh Alignment.hh Centroid.hh GappedXdropAligner.hh \
- mcf_contiguous_queue.hh mcf_reverse_queue.hh mcf_simd.hh SegmentPair.hh \
- GreedyXdropAligner.hh SegmentPairPot.hh ScoreMatrix.hh TantanMasker.hh \
- tantan.hh DiagonalTable.hh gaplessXdrop.hh gaplessPssmXdrop.hh \
- gaplessTwoQualityXdrop.hh zio.hh mcf_zstream.hh threadUtil.hh version.hh
+ qualityScoreUtil.hh LastalArguments.hh split/last_split_options.hh \
+ QualityPssmMaker.hh OneQualityScoreMatrix.hh \
+ mcf_substitution_matrix_stats.hh TwoQualityScoreMatrix.hh LastEvaluer.hh \
+ mcf_frameshift_xdrop_aligner.hh mcf_gap_costs.hh \
+ alp/sls_alignment_evaluer.hpp alp/sls_pvalues.hpp alp/sls_basic.hpp \
+ GeneticCode.hh SubsetMinimizerFinder.hh AlignmentPot.hh Alignment.hh \
+ Centroid.hh GappedXdropAligner.hh mcf_contiguous_queue.hh \
+ mcf_reverse_queue.hh mcf_simd.hh SegmentPair.hh GreedyXdropAligner.hh \
+ SegmentPairPot.hh ScoreMatrix.hh TantanMasker.hh tantan.hh \
+ DiagonalTable.hh gaplessXdrop.hh gaplessPssmXdrop.hh \
+ gaplessTwoQualityXdrop.hh zio.hh mcf_zstream.hh threadUtil.hh \
+ split/mcf_last_splitter.hh split/cbrc_split_aligner.hh \
+ split/cbrc_unsplit_alignment.hh split/cbrc_int_exponentiator.hh \
+ Alphabet.hh MultiSequence.hh split/last_split_options.hh version.hh
LastdbArguments.o LastdbArguments.o5 LastdbArguments.o8: LastdbArguments.cc LastdbArguments.hh \
SequenceFormat.hh stringify.hh getoptUtil.hh version.hh
lastdb.o lastdb.o5 lastdb.o8: lastdb.cc last.hh Alphabet.hh CyclicSubsetSeed.hh \
@@ -288,7 +294,7 @@ SubsetSuffixArraySearch.o SubsetSuffixArraySearch.o5 SubsetSuffixArraySearch.o8:
SubsetSuffixArraySort.o SubsetSuffixArraySort.o5 SubsetSuffixArraySort.o8: SubsetSuffixArraySort.cc SubsetSuffixArray.hh \
CyclicSubsetSeed.hh dna_words_finder.hh VectorOrMmap.hh Mmap.hh \
fileMap.hh stringify.hh
-tantan.o tantan.o5 tantan.o8: tantan.cc tantan.hh
+tantan.o tantan.o5 tantan.o8: tantan.cc tantan.hh mcf_simd.hh
TantanMasker.o TantanMasker.o5 TantanMasker.o8: TantanMasker.cc TantanMasker.hh ScoreMatrixRow.hh \
tantan.hh ScoreMatrix.hh mcf_substitution_matrix_stats.hh
TwoQualityScoreMatrix.o TwoQualityScoreMatrix.o5 TwoQualityScoreMatrix.o8: TwoQualityScoreMatrix.cc \
@@ -374,8 +380,16 @@ split/cbrc_split_aligner.o split/cbrc_split_aligner.o5 split/cbrc_split_aligner.
split/cbrc_unsplit_alignment.o split/cbrc_unsplit_alignment.o5 split/cbrc_unsplit_alignment.o8: split/cbrc_unsplit_alignment.cc \
split/cbrc_unsplit_alignment.hh
split/last-split.o split/last-split.o5 split/last-split.o8: split/last-split.cc split/last-split.hh \
+ split/last_split_options.hh split/mcf_last_splitter.hh \
split/cbrc_split_aligner.hh split/cbrc_unsplit_alignment.hh \
split/cbrc_int_exponentiator.hh Alphabet.hh MultiSequence.hh \
ScoreMatrixRow.hh VectorOrMmap.hh Mmap.hh fileMap.hh stringify.hh
split/last-split-main.o split/last-split-main.o5 split/last-split-main.o8: split/last-split-main.cc split/last-split.hh \
- stringify.hh version.hh
+ split/last_split_options.hh stringify.hh version.hh
+split/last_split_options.o split/last_split_options.o5 split/last_split_options.o8: split/last_split_options.cc \
+ split/last_split_options.hh
+split/mcf_last_splitter.o split/mcf_last_splitter.o5 split/mcf_last_splitter.o8: split/mcf_last_splitter.cc \
+ split/mcf_last_splitter.hh split/cbrc_split_aligner.hh \
+ split/cbrc_unsplit_alignment.hh split/cbrc_int_exponentiator.hh \
+ Alphabet.hh MultiSequence.hh ScoreMatrixRow.hh VectorOrMmap.hh Mmap.hh \
+ fileMap.hh stringify.hh split/last_split_options.hh
=====================================
src/split/cbrc_split_aligner.cc
=====================================
@@ -4,26 +4,26 @@
#include "cbrc_split_aligner.hh"
#include "mcf_substitution_matrix_stats.hh"
+#include <assert.h>
+#include <float.h>
+#include <limits.h>
+#include <string.h>
+
#include <algorithm>
-#include <cassert>
#include <cctype>
-#include <cstring> // strcmp
+#include <cmath>
#include <fstream>
#include <iostream>
#include <sstream>
#include <stdexcept>
-#include <float.h>
-
static void err(const std::string& s) {
throw std::runtime_error(s);
}
namespace cbrc {
-template<typename T, int N> T arrayMax(T (&array)[N]) {
- return *std::max_element(array, array + N);
-}
+static int myMax(const int *b, int s) { return *std::max_element(b, b + s); }
// Orders candidate alignments by increasing DP start coordinate.
// Breaks ties by decreasing DP end coordinate.
@@ -143,10 +143,10 @@ void mergeInto(unsigned* beg1,
}
}
-int SplitAligner::score_mat[64][64][numQualCodes];
+int SplitAlignerParams::score_mat[64][64][numQualCodes];
// The score for a cis-splice with the given distance (i.e. intron length)
-int SplitAligner::calcSpliceScore(double dist) const {
+int SplitAlignerParams::calcSpliceScore(double dist) const {
double logDist = std::log(dist);
double d = logDist - meanLogDist;
double s = spliceTerm1 + spliceTerm2 * d * d - logDist;
@@ -193,37 +193,43 @@ static unsigned spliceEndSignalRev(const uchar *seqPtr,
return 15 - (n1 * 4 + n2); // reverse-complement
}
-unsigned SplitAligner::findScore(unsigned j, long score) const {
+unsigned SplitAligner::findScore(bool isGenome, unsigned j, long score) const {
for (unsigned i = 0; i < numAlns; ++i) {
if (dpBeg(i) >= j || dpEnd(i) < j) continue;
size_t ij = matrixRowOrigins[i] + j;
- if (Vmat[ij] + spliceBegScore(ij) == score) return i;
+ if (Vmat[ij] + spliceBegScore(isGenome, ij) == score) return i;
}
return numAlns;
}
-unsigned SplitAligner::findSpliceScore(unsigned i, unsigned j,
+unsigned SplitAligner::findSpliceScore(const SplitAlignerParams ¶ms,
+ unsigned i, unsigned j,
long score) const {
- assert(splicePrior > 0.0);
+ assert(params.splicePrior > 0.0);
+ const bool isGenome = params.isGenome();
size_t ij = matrixRowOrigins[i] + j;
unsigned iSeq = rnameAndStrandIds[i];
unsigned iEnd = spliceEndCoords[ij];
- int iScore = spliceEndScore(ij);
+ int iScore = spliceEndScore(isGenome, ij);
for (unsigned k = 0; k < numAlns; k++) {
if (rnameAndStrandIds[k] != iSeq) continue;
if (dpBeg(k) >= j || dpEnd(k) < j) continue;
size_t kj = matrixRowOrigins[k] + j;
unsigned kBeg = spliceBegCoords[kj];
if (iEnd <= kBeg) continue;
- int s = iScore + spliceBegScore(kj) + spliceScore(iEnd - kBeg);
+ int s = iScore + spliceBegScore(isGenome, kj) +
+ params.spliceScore(iEnd - kBeg);
if (Vmat[kj] + s == score) return k;
}
return numAlns;
}
-long SplitAligner::scoreFromSplice(unsigned i, unsigned j,
+long SplitAligner::scoreFromSplice(const SplitAlignerParams ¶ms,
+ unsigned i, unsigned j,
unsigned oldNumInplay,
unsigned& oldInplayPos) const {
+ const unsigned maxSpliceDist = params.maxSpliceDist;
+ const bool isGenome = params.isGenome();
size_t ij = matrixRowOrigins[i] + j;
long score = LONG_MIN;
unsigned iSeq = rnameAndStrandIds[i];
@@ -245,8 +251,8 @@ long SplitAligner::scoreFromSplice(unsigned i, unsigned j,
unsigned kBeg = spliceBegCoords[kj];
if (iEnd <= kBeg) continue;
if (iEnd - kBeg > maxSpliceDist) continue;
- score = std::max(score,
- Vmat[kj] + spliceBegScore(kj) + spliceScore(iEnd - kBeg));
+ score = std::max(score, Vmat[kj] + spliceBegScore(isGenome, kj) +
+ params.spliceScore(iEnd - kBeg));
}
return score;
@@ -320,7 +326,8 @@ void SplitAligner::updateInplayAlnIndicesB(unsigned& sortedAlnPos,
newNumInplay = (newEnd - newBeg) + (sortedAlnPos - sortedAlnOldPos);
}
-long SplitAligner::viterbiSplit() {
+long SplitAligner::viterbiSplit(const SplitAlignerParams ¶ms) {
+ const int restartScore = params.restartScore;
unsigned *inplayAlnBeg = &newInplayAlnIndices[0];
unsigned *inplayAlnEnd = inplayAlnBeg;
unsigned *sortedAlnPtr = &sortedAlnIndices[0];
@@ -358,7 +365,11 @@ long SplitAligner::viterbiSplit() {
return maxScore;
}
-long SplitAligner::viterbiSplice() {
+long SplitAligner::viterbiSplice(const SplitAlignerParams ¶ms) {
+ const int jumpScore = params.jumpScore;
+ const int restartScore = params.restartScore;
+ const double splicePrior = params.splicePrior;
+ const bool isGenome = params.isGenome();
unsigned sortedAlnPos = 0;
unsigned oldNumInplay = 0;
unsigned newNumInplay = 0;
@@ -381,15 +392,15 @@ long SplitAligner::viterbiSplice() {
long s = scoreFromJump;
if (splicePrior > 0.0)
- s = std::max(s,
- scoreFromSplice(i, j, oldNumInplay, oldInplayPos));
- s += spliceEndScore(ij);
+ s = std::max(s, scoreFromSplice(params, i, j,
+ oldNumInplay, oldInplayPos));
+ s += spliceEndScore(isGenome, ij);
s = std::max(s, Vmat[ij] + Smat[ij*2]);
if (alns[i].qstart == j && s < 0) s = 0;
s += Smat[ij*2+1];
Vmat[ij + 1] = s;
- sMax = std::max(sMax, s + spliceBegScore(ij + 1));
+ sMax = std::max(sMax, s + spliceBegScore(isGenome, ij + 1));
}
maxScore = std::max(sMax, maxScore);
scoreFromJump = std::max(sMax + jumpScore, maxScore + restartScore);
@@ -413,22 +424,24 @@ unsigned SplitAligner::findEndScore(long score) const {
return numAlns;
}
-void SplitAligner::traceBack(long viterbiScore,
+void SplitAligner::traceBack(const SplitAlignerParams ¶ms,
+ long viterbiScore,
std::vector<unsigned>& alnNums,
std::vector<unsigned>& queryBegs,
std::vector<unsigned>& queryEnds) const {
+ const bool isGenome = params.isGenome();
unsigned i, j;
- if (restartProb > 0) {
+ if (params.isSpliced()) {
+ i = findEndScore(viterbiScore);
+ assert(i < numAlns);
+ j = alns[i].qend;
+ } else {
j = maxEnd;
long t = cell(Vvec, j);
if (t == 0) return;
while (t == cell(Vvec, j-1)) --j;
- i = findScore(j, t);
- assert(i < numAlns);
- } else {
- i = findEndScore(viterbiScore);
+ i = findScore(isGenome, j, t);
assert(i < numAlns);
- j = alns[i].qend;
}
alnNums.push_back(i);
@@ -438,7 +451,7 @@ void SplitAligner::traceBack(long viterbiScore,
--j;
size_t ij = matrixRowOrigins[i] + j;
long score = Vmat[ij + 1] - Smat[ij*2+1];
- if (restartProb <= 0 && alns[i].qstart == j && score == 0) {
+ if (params.isSpliced() && alns[i].qstart == j && score == 0) {
queryBegs.push_back(j);
return;
}
@@ -452,18 +465,18 @@ void SplitAligner::traceBack(long viterbiScore,
bool isStay = (score == Vmat[ij] + Smat[ij*2]);
if (isStay && alns[i].isForwardStrand()) continue;
- long s = score - spliceEndScore(ij);
- long t = s - restartScore;
+ long s = score - spliceEndScore(isGenome, ij);
+ long t = s - params.restartScore;
if (t == cell(Vvec, j)) {
queryBegs.push_back(j);
if (t == 0) return;
while (t == cell(Vvec, j-1)) --j;
- i = findScore(j, t);
+ i = findScore(isGenome, j, t);
} else {
if (isStay) continue;
queryBegs.push_back(j);
- unsigned k = findScore(j, s - jumpScore);
- i = (k < numAlns) ? k : findSpliceScore(i, j, score);
+ unsigned k = findScore(isGenome, j, s - params.jumpScore);
+ i = (k < numAlns) ? k : findSpliceScore(params, i, j, score);
}
assert(i < numAlns);
alnNums.push_back(i);
@@ -483,9 +496,12 @@ int SplitAligner::segmentScore(unsigned alnNum,
return score;
}
-double SplitAligner::probFromSpliceF(unsigned i, unsigned j,
+double SplitAligner::probFromSpliceF(const SplitAlignerParams ¶ms,
+ unsigned i, unsigned j,
unsigned oldNumInplay,
unsigned& oldInplayPos) const {
+ const unsigned maxSpliceDist = params.maxSpliceDist;
+ const bool isGenome = params.isGenome();
size_t ij = matrixRowOrigins[i] + j;
double sum = 0.0;
unsigned iSeq = rnameAndStrandIds[i];
@@ -507,15 +523,19 @@ double SplitAligner::probFromSpliceF(unsigned i, unsigned j,
unsigned kBeg = spliceBegCoords[kj];
if (iEnd <= kBeg) continue;
if (iEnd - kBeg > maxSpliceDist) continue;
- sum += Fmat[kj] * spliceBegProb(kj) * spliceProb(iEnd - kBeg);
+ sum += Fmat[kj] * spliceBegProb(isGenome, kj) *
+ params.spliceProb(iEnd - kBeg);
}
return sum;
}
-double SplitAligner::probFromSpliceB(unsigned i, unsigned j,
+double SplitAligner::probFromSpliceB(const SplitAlignerParams ¶ms,
+ unsigned i, unsigned j,
unsigned oldNumInplay,
unsigned& oldInplayPos) const {
+ const unsigned maxSpliceDist = params.maxSpliceDist;
+ const bool isGenome = params.isGenome();
size_t ij = matrixRowOrigins[i] + j;
double sum = 0.0;
unsigned iSeq = rnameAndStrandIds[i];
@@ -537,13 +557,15 @@ double SplitAligner::probFromSpliceB(unsigned i, unsigned j,
unsigned kEnd = spliceEndCoords[kj];
if (kEnd <= iBeg) continue;
if (kEnd - iBeg > maxSpliceDist) continue;
- sum += Bmat[kj] * spliceEndProb(kj) * spliceProb(kEnd - iBeg);
+ sum += Bmat[kj] * spliceEndProb(isGenome, kj) *
+ params.spliceProb(kEnd - iBeg);
}
return sum;
}
-void SplitAligner::forwardSplit() {
+void SplitAligner::forwardSplit(const SplitAlignerParams ¶ms) {
+ const double restartProb = params.restartProb;
unsigned *inplayAlnBeg = &newInplayAlnIndices[0];
unsigned *inplayAlnEnd = inplayAlnBeg;
unsigned *sortedAlnPtr = &sortedAlnIndices[0];
@@ -585,7 +607,10 @@ void SplitAligner::forwardSplit() {
cell(rescales, maxEnd) = 1 / sumOfProbs; // makes scaled sumOfProbs equal 1
}
-void SplitAligner::forwardSplice() {
+void SplitAligner::forwardSplice(const SplitAlignerParams ¶ms) {
+ const double splicePrior = params.splicePrior;
+ const double jumpProb = params.jumpProb;
+ const bool isGenome = params.isGenome();
unsigned sortedAlnPos = 0;
unsigned oldNumInplay = 0;
unsigned newNumInplay = 0;
@@ -612,15 +637,15 @@ void SplitAligner::forwardSplice() {
double p = probFromJump;
if (splicePrior > 0.0)
- p += probFromSpliceF(i, j, oldNumInplay, oldInplayPos);
- p *= spliceEndProb(ij);
+ p += probFromSpliceF(params, i, j, oldNumInplay, oldInplayPos);
+ p *= spliceEndProb(isGenome, ij);
p += Fmat[ij] * Sexp[ij*2];
if (alns[i].qstart == j) p += begprob;
p = p * Sexp[ij*2+1] * rescale;
Fmat[ij + 1] = p;
if (alns[i].qend == j+1) zF += p;
- pSum += p * spliceBegProb(ij + 1);
+ pSum += p * spliceBegProb(isGenome, ij + 1);
rNew += p;
}
begprob *= rescale;
@@ -631,7 +656,8 @@ void SplitAligner::forwardSplice() {
cell(rescales, maxEnd) = 1 / zF; // this causes scaled zF to equal 1
}
-void SplitAligner::backwardSplit() {
+void SplitAligner::backwardSplit(const SplitAlignerParams ¶ms) {
+ const double restartProb = params.restartProb;
unsigned *inplayAlnBeg = &newInplayAlnIndices[0];
unsigned *inplayAlnEnd = inplayAlnBeg;
unsigned *sortedAlnPtr = &sortedAlnIndices[0];
@@ -667,7 +693,10 @@ void SplitAligner::backwardSplit() {
}
}
-void SplitAligner::backwardSplice() {
+void SplitAligner::backwardSplice(const SplitAlignerParams ¶ms) {
+ const double splicePrior = params.splicePrior;
+ const double jumpProb = params.jumpProb;
+ const bool isGenome = params.isGenome();
unsigned sortedAlnPos = 0;
unsigned oldNumInplay = 0;
unsigned newNumInplay = 0;
@@ -692,8 +721,8 @@ void SplitAligner::backwardSplice() {
double p = probFromJump;
if (splicePrior > 0.0)
- p += probFromSpliceB(i, j, oldNumInplay, oldInplayPos);
- p *= spliceBegProb(ij);
+ p += probFromSpliceB(params, i, j, oldNumInplay, oldInplayPos);
+ p *= spliceBegProb(isGenome, ij);
p += Bmat[ij] * Sexp[ij*2];
if (alns[i].qend == j) p += endprob;
p = p * Sexp[ij*2-1] * rescale;
@@ -705,7 +734,7 @@ void SplitAligner::backwardSplice() {
Bmat[ij - 1] = p;
//if (alns[i].qstart == j-1) zB += p;
- pSum += p * spliceEndProb(ij - 1);
+ pSum += p * spliceEndProb(isGenome, ij - 1);
}
endprob *= rescale;
probFromJump = pSum * jumpProb;
@@ -737,15 +766,21 @@ SplitAligner::marginalProbs(unsigned queryBeg, unsigned alnNum,
// The next routine represents affine gap scores in a cunning way.
// Aij holds scores at query bases, and at every base that is aligned
-// to a gap it gets a score of insExistenceScore + insExtensionScore.
-// Dij holds scores between query bases, and between every pair of
-// bases that are both aligned to gaps it gets a score of
-// -insExistenceScore. This produces suitable affine gap scores, even
-// if we jump from one alignment to another in the middle of a gap.
-
-void SplitAligner::calcBaseScores(unsigned i) {
- const int firstInsScore = insExistenceScore + insExtensionScore;
- const int tweenInsScore = -insExistenceScore;
+// to a gap it gets a score of insOpenScore + insGrowScore. Dij holds
+// scores between query bases, and between every pair of bases that
+// are both aligned to gaps it gets a score of -insOpenScore. This
+// produces suitable affine gap scores, even if we jump from one
+// alignment to another in the middle of a gap.
+
+void SplitAligner::calcBaseScores(const SplitAlignerParams ¶ms,
+ unsigned i) {
+ const int qualityOffset = params.qualityOffset;
+ const int delOpenScore = params.delOpenScore;
+ const int delGrowScore = params.delGrowScore;
+ const int insOpenScore = params.insOpenScore;
+ const int insGrowScore = params.insGrowScore;
+ const int firstInsScore = insOpenScore + insGrowScore;
+ const int tweenInsScore = -insOpenScore;
const UnsplitAlignment& a = alns[i];
const size_t origin = matrixRowOrigins[i];
@@ -772,21 +807,21 @@ void SplitAligner::calcBaseScores(unsigned i) {
while (*qAlign) {
unsigned char x = *rAlign++;
unsigned char y = *qAlign++;
- int q = qQual ? (*qQual++ - qualityOffset) : (numQualCodes - 1);
+ int q = qQual ? (*qQual++ - qualityOffset) : (params.numQualCodes - 1);
if (x == '-') { // gap in reference sequence: insertion
*matBeg++ = delScore + insCompensationScore;
*matBeg++ = firstInsScore;
delScore = 0;
insCompensationScore = tweenInsScore;
} else if (y == '-') { // gap in query sequence: deletion
- if (delScore == 0) delScore = gapExistenceScore;
- delScore += gapExtensionScore;
+ if (delScore == 0) delScore = delOpenScore;
+ delScore += delGrowScore;
insCompensationScore = 0;
} else {
assert(q >= 0);
- if (q >= numQualCodes) q = numQualCodes - 1;
+ if (q >= params.numQualCodes) q = params.numQualCodes - 1;
*matBeg++ = delScore;
- *matBeg++ = score_mat[x % 64][y % 64][q];
+ *matBeg++ = params.score_mat[x % 64][y % 64][q];
delScore = 0;
insCompensationScore = 0;
}
@@ -848,17 +883,25 @@ static const uchar *seqEnd(const MultiSequence &m, size_t sequenceIndex) {
return m.seqReader() + m.seqEnd(sequenceIndex);
}
-void SplitAligner::initSpliceSignals(unsigned i) {
- const uchar *toUnmasked = alphabet.numbersToUppercase;
- const UnsplitAlignment& a = alns[i];
-
- StringNumMap::const_iterator f = chromosomeIndex.find(a.rname);
+void SplitAlignerParams::seqEnds(const uchar *&beg, const uchar *&end,
+ const char *seqName) const {
+ StringNumMap::const_iterator f = chromosomeIndex.find(seqName);
if (f == chromosomeIndex.end())
- err("can't find " + std::string(a.rname) + " in the genome");
+ err("can't find " + std::string(seqName) + " in the genome");
size_t v = f->second % maxGenomeVolumes();
size_t c = f->second / maxGenomeVolumes();
- const uchar *chromBeg = seqBeg(genome[v], c);
- const uchar *chromEnd = seqEnd(genome[v], c);
+ beg = seqBeg(genome[v], c);
+ end = seqEnd(genome[v], c);
+}
+
+void SplitAligner::initSpliceSignals(const SplitAlignerParams ¶ms,
+ unsigned i) {
+ const uchar *toUnmasked = params.alphabet.numbersToUppercase;
+ const UnsplitAlignment &a = alns[i];
+
+ const uchar *chromBeg;
+ const uchar *chromEnd;
+ params.seqEnds(chromBeg, chromEnd, a.rname);
if (a.rend > chromEnd - chromBeg)
err("alignment beyond the end of " + std::string(a.rname));
@@ -912,32 +955,28 @@ static void decodeSpliceSignal(char *out,
}
}
-void SplitAligner::spliceBegSignal(char *out,
- unsigned alnNum, unsigned queryPos,
- bool isSenseStrand) const {
- const UnsplitAlignment& a = alns[alnNum];
- bool isForwardStrand = a.isForwardStrand();
- StringNumMap::const_iterator f = chromosomeIndex.find(a.rname);
+void SplitAlignerParams::spliceBegSignal(char *out, const char *seqName,
+ bool isForwardStrand,
+ bool isSenseStrand,
+ unsigned coord) const {
+ StringNumMap::const_iterator f = chromosomeIndex.find(seqName);
size_t v = f->second % maxGenomeVolumes();
size_t c = f->second / maxGenomeVolumes();
uchar signal[2];
- unsigned coord = cell(spliceBegCoords, alnNum, queryPos);
if (isForwardStrand) getNextSignal(signal, seqBeg(genome[v], c) + coord);
else getPrevSignal(signal, seqEnd(genome[v], c) - coord);
decodeSpliceSignal(out, signal, alphabet.decode, alphabet.complement,
isSenseStrand == isForwardStrand);
}
-void SplitAligner::spliceEndSignal(char *out,
- unsigned alnNum, unsigned queryPos,
- bool isSenseStrand) const {
- const UnsplitAlignment& a = alns[alnNum];
- bool isForwardStrand = a.isForwardStrand();
- StringNumMap::const_iterator f = chromosomeIndex.find(a.rname);
+void SplitAlignerParams::spliceEndSignal(char *out, const char *seqName,
+ bool isForwardStrand,
+ bool isSenseStrand,
+ unsigned coord) const {
+ StringNumMap::const_iterator f = chromosomeIndex.find(seqName);
size_t v = f->second % maxGenomeVolumes();
size_t c = f->second / maxGenomeVolumes();
uchar signal[2];
- unsigned coord = cell(spliceEndCoords, alnNum, queryPos);
if (isForwardStrand) getPrevSignal(signal, seqBeg(genome[v], c) + coord);
else getNextSignal(signal, seqEnd(genome[v], c) - coord);
decodeSpliceSignal(out, signal, alphabet.decode, alphabet.complement,
@@ -950,7 +989,7 @@ struct RnameAndStrandLess {
bool operator()(unsigned a, unsigned b) const {
return
alns[a].qstrand != alns[b].qstrand ? alns[a].qstrand < alns[b].qstrand :
- std::strcmp(alns[a].rname, alns[b].rname) < 0;
+ strcmp(alns[a].rname, alns[b].rname) < 0;
}
const UnsplitAlignment *alns;
@@ -958,7 +997,7 @@ struct RnameAndStrandLess {
void SplitAligner::initRnameAndStrandIds() {
rnameAndStrandIds.resize(numAlns);
- RnameAndStrandLess less(&alns[0]);
+ RnameAndStrandLess less(alns);
stable_sort(sortedAlnIndices.begin(), sortedAlnIndices.end(), less);
unsigned c = 0;
for (unsigned i = 0; i < numAlns; ++i) {
@@ -968,20 +1007,22 @@ void SplitAligner::initRnameAndStrandIds() {
}
}
-void SplitAligner::dpExtensionMinScores(int maxJumpScore,
- size_t& minScore1,
- size_t& minScore2) const {
- if (!chromosomeIndex.empty()) maxJumpScore += maxSpliceBegEndScore;
- assert(maxJumpScore + insExistenceScore <= 0);
- minScore1 = 1 - (maxJumpScore + insExistenceScore);
- minScore2 = 1 - (maxJumpScore + maxJumpScore + insExistenceScore);
+void SplitAlignerParams::dpExtensionMinScores(size_t &minScore1,
+ size_t &minScore2) const {
+ if (jumpProb > 0 || splicePrior > 0) {
+ int maxJumpScore = (splicePrior > 0) ? maxSpliceScore : jumpScore;
+ if (isGenome()) maxJumpScore += maxSpliceBegEndScore;
+ assert(maxJumpScore + insOpenScore <= 0);
+ minScore1 = 1 - (maxJumpScore + insOpenScore);
+ minScore2 = 1 - (maxJumpScore + maxJumpScore + insOpenScore);
+ }
}
static size_t dpExtension(size_t maxScore, size_t minScore, size_t divisor) {
return (maxScore > minScore) ? (maxScore - minScore) / divisor : 0;
}
-void SplitAligner::initDpBounds() {
+void SplitAligner::initDpBounds(const SplitAlignerParams ¶ms) {
minBeg = -1;
for (unsigned i = 0; i < numAlns; ++i)
minBeg = std::min(minBeg, alns[i].qstart);
@@ -1005,20 +1046,18 @@ void SplitAligner::initDpBounds() {
// length * maxMatchScore
// An extension of length x must have a (negative) score <=
- // maxJumpScore + insExistenceScore + insExtensionScore * x
+ // maxJumpScore + insOpenScore + insGrowScore * x
- assert(insExtensionScore < 0);
+ int maxMatchScore = params.maxMatchScore;
+ assert(params.insGrowScore < 0);
assert(maxMatchScore >= 0);
- size_t oldDiv = -insExtensionScore;
- size_t newDiv = maxMatchScore - insExtensionScore;
+ size_t oldDiv = -params.insGrowScore;
+ size_t newDiv = maxMatchScore - params.insGrowScore;
size_t minScore1 = -1;
size_t minScore2 = -1;
- if (jumpProb > 0.0 || splicePrior > 0.0) {
- int m = (splicePrior > 0.0) ? maxSpliceScore : jumpScore;
- dpExtensionMinScores(m, minScore1, minScore2);
- }
+ params.dpExtensionMinScores(minScore1, minScore2);
for (unsigned i = 0; i < numAlns; ++i) {
size_t b = alns[i].qstart;
@@ -1047,8 +1086,9 @@ void SplitAligner::initDpBounds() {
}
}
-void SplitAligner::layout(std::vector<UnsplitAlignment>::const_iterator beg,
- std::vector<UnsplitAlignment>::const_iterator end) {
+void SplitAligner::layout(const SplitAlignerParams ¶ms,
+ const UnsplitAlignment *beg,
+ const UnsplitAlignment *end) {
assert(end > beg);
numAlns = end - beg;
alns = beg;
@@ -1057,30 +1097,31 @@ void SplitAligner::layout(std::vector<UnsplitAlignment>::const_iterator beg,
for (unsigned i = 0; i < numAlns; ++i) sortedAlnIndices[i] = i;
newInplayAlnIndices.resize(numAlns);
- if (restartProb <= 0) {
+ if (params.isSpliced()) {
oldInplayAlnIndices.resize(numAlns);
rBegs.resize(numAlns);
rEnds.resize(numAlns);
- if (splicePrior > 0.0 || !chromosomeIndex.empty()) {
+ if (params.isSpliceCoords()) {
initRbegsAndEnds();
}
initRnameAndStrandIds();
}
- initDpBounds();
+ initDpBounds(params);
}
-size_t SplitAligner::memory(bool isViterbi, bool isBothSpliceStrands) const {
+size_t SplitAligner::memory(const SplitAlignerParams ¶ms,
+ bool isViterbi, bool isBothSpliceStrands) const {
size_t numOfStrands = isBothSpliceStrands ? 2 : 1;
size_t x = 2 * sizeof(int) + 2 * sizeof(float);
- if (splicePrior > 0 || !chromosomeIndex.empty()) x += 2 * sizeof(unsigned);
- if (!chromosomeIndex.empty()) x += 2;
+ if (params.isSpliceCoords()) x += 2 * sizeof(unsigned);
+ if (params.isGenome()) x += 2;
if (isViterbi) x += sizeof(long) * numOfStrands;
x += 2 * sizeof(double) * numOfStrands;
return x * cellsPerDpMatrix();
}
-void SplitAligner::initMatricesForOneQuery() {
+void SplitAligner::initMatricesForOneQuery(const SplitAlignerParams ¶ms) {
size_t doubleMatrixSize = cellsPerDpMatrix() * 2;
// The final cell per row is never used, because there's one less
// Aij than Dij per candidate alignment.
@@ -1089,36 +1130,42 @@ void SplitAligner::initMatricesForOneQuery() {
Sexp.resize(doubleMatrixSize);
}
- for (unsigned i = 0; i < numAlns; i++) calcBaseScores(i);
+ for (unsigned i = 0; i < numAlns; i++) calcBaseScores(params, i);
- if (splicePrior > 0.0 || !chromosomeIndex.empty()) {
+ if (params.isSpliceCoords()) {
resizeMatrix(spliceBegCoords);
resizeMatrix(spliceEndCoords);
for (unsigned i = 0; i < numAlns; ++i) initSpliceCoords(i);
}
- if (!chromosomeIndex.empty()) {
+ if (params.isGenome()) {
+ spliceBegScores = params.spliceBegScores;
+ spliceEndScores = params.spliceEndScores;
+ spliceBegProbs = params.spliceBegProbs;
+ spliceEndProbs = params.spliceEndProbs;
+
resizeMatrix(spliceBegSignals);
resizeMatrix(spliceEndSignals);
- for (unsigned i = 0; i < numAlns; ++i) initSpliceSignals(i);
+ for (unsigned i = 0; i < numAlns; ++i) initSpliceSignals(params, i);
}
- std::transform(&Smat[0], &Smat[0] + doubleMatrixSize, &Sexp[0], scaledExp);
+ std::transform(&Smat[0], &Smat[0] + doubleMatrixSize, &Sexp[0],
+ params.scaledExp);
// if x/scale < about -745, then exp(x/scale) will be exactly 0.0
}
-void SplitAligner::flipSpliceSignals() {
+void SplitAligner::flipSpliceSignals(const SplitAlignerParams ¶ms) {
Vmat.swap(VmatRev);
Vvec.swap(VvecRev);
Fmat.swap(FmatRev);
Bmat.swap(BmatRev);
rescales.swap(rescalesRev);
- for (int i = 0; i < 16; ++i) {
- int j = 15 - ((i%4) * 4 + (i/4)); // reverse-complement
- std::swap(spliceBegScores[i], spliceEndScores[j]);
- std::swap(spliceBegProbs[i], spliceEndProbs[j]);
- }
+ int d = 17 - (spliceBegScores - params.spliceBegScores);
+ spliceBegScores = params.spliceBegScores + d;
+ spliceEndScores = params.spliceEndScores + d;
+ spliceBegProbs = params.spliceBegProbs + d;
+ spliceEndProbs = params.spliceEndProbs + d;
}
double SplitAligner::spliceSignalStrandLogOdds() const {
@@ -1147,9 +1194,9 @@ double SplitAligner::spliceSignalStrandLogOdds() const {
// estimated mean ln[distance] 7.01031
// estimated standard deviation of ln[distance] 1.72269
-void SplitAligner::setSpliceParams(double splicePriorIn,
- double meanLogDistIn,
- double sdevLogDistIn) {
+void SplitAlignerParams::setSpliceParams(double splicePriorIn,
+ double meanLogDistIn,
+ double sdevLogDistIn) {
splicePrior = splicePriorIn;
meanLogDist = meanLogDistIn;
sdevLogDist = sdevLogDistIn;
@@ -1191,14 +1238,14 @@ void SplitAligner::setSpliceParams(double splicePriorIn,
}
}
-void SplitAligner::setParams(int gapExistenceScoreIn, int gapExtensionScoreIn,
- int insExistenceScoreIn, int insExtensionScoreIn,
- int jumpScoreIn, int restartScoreIn,
- double scaleIn, int qualityOffsetIn) {
- gapExistenceScore = gapExistenceScoreIn;
- gapExtensionScore = gapExtensionScoreIn;
- insExistenceScore = insExistenceScoreIn;
- insExtensionScore = insExtensionScoreIn;
+void SplitAlignerParams::setParams(int delOpenScoreIn, int delGrowScoreIn,
+ int insOpenScoreIn, int insGrowScoreIn,
+ int jumpScoreIn, int restartScoreIn,
+ double scaleIn, int qualityOffsetIn) {
+ delOpenScore = delOpenScoreIn;
+ delGrowScore = delGrowScoreIn;
+ insOpenScore = insOpenScoreIn;
+ insGrowScore = insGrowScoreIn;
jumpScore = jumpScoreIn;
restartScore = restartScoreIn;
scale = scaleIn;
@@ -1208,11 +1255,7 @@ void SplitAligner::setParams(int gapExistenceScoreIn, int gapExtensionScoreIn,
restartProb = scaledExp(restartScore);
}
-static int scoreFromProb(double prob, double scale) {
- return std::floor(scale * std::log(prob) + 0.5);
-}
-
-void SplitAligner::setSpliceSignals() {
+void SplitAlignerParams::setSpliceSignals() {
// If an RNA-DNA alignment reaches position i in the DNA, the
// probability of splicing from i to j is:
// P(i & j) = d(i) * a(j) * f(j - i),
@@ -1247,7 +1290,7 @@ void SplitAligner::setSpliceSignals() {
double dAvg = (dGT + dGC + dAT + dNN * 13) / 16;
double aAvg = (aAG + aAC + aNN * 14) / 16;
- for (int i = 0; i < 17; ++i) {
+ for (int i = 0; i < 17 * 2; ++i) {
spliceBegScores[i] = scoreFromProb(dNN / dAvg, scale);
spliceEndScores[i] = scoreFromProb(aNN / aAvg, scale);
}
@@ -1259,15 +1302,22 @@ void SplitAligner::setSpliceSignals() {
spliceEndScores[0 * 4 + 2] = scoreFromProb(aAG / aAvg, scale);
spliceEndScores[0 * 4 + 1] = scoreFromProb(aAC / aAvg, scale);
- for (int i = 0; i < 17; ++i) {
+ for (int i = 0; i < 16; ++i) {
+ int j = 15 - ((i%4) * 4 + (i/4)); // reverse-complement
+ spliceBegScores[17 + i] = spliceEndScores[j];
+ spliceEndScores[17 + i] = spliceBegScores[j];
+ }
+
+ for (int i = 0; i < 17 * 2; ++i) {
spliceBegProbs[i] = scaledExp(spliceBegScores[i]);
spliceEndProbs[i] = scaledExp(spliceEndScores[i]);
}
- maxSpliceBegEndScore = arrayMax(spliceBegScores) + arrayMax(spliceEndScores);
+ maxSpliceBegEndScore =
+ myMax(spliceBegScores, 17) + myMax(spliceEndScores, 17);
}
-void SplitAligner::printParameters() const {
+void SplitAlignerParams::print() const {
if (jumpProb > 0.0) {
std::cout << "# trans=" << jumpScore << "\n";
}
@@ -1276,7 +1326,7 @@ void SplitAligner::printParameters() const {
std::cout << "# cismax=" << maxSpliceDist << "\n";
}
- if (!chromosomeIndex.empty()) {
+ if (isGenome()) {
std::cout << "#"
<< " GT=" << spliceBegScores[2 * 4 + 3]
<< " GC=" << spliceBegScores[2 * 4 + 1]
@@ -1322,9 +1372,9 @@ static void readPrjFile(const std::string& baseName,
}
}
-void SplitAligner::readGenomeVolume(const std::string& baseName,
- size_t seqCount,
- size_t volumeNumber) {
+void SplitAlignerParams::readGenomeVolume(const std::string &baseName,
+ size_t seqCount,
+ size_t volumeNumber) {
if (seqCount + 1 == 0) err("can't read: " + baseName);
genome[volumeNumber].fromFiles(baseName, seqCount, 0);
@@ -1339,7 +1389,7 @@ void SplitAligner::readGenomeVolume(const std::string& baseName,
}
}
-void SplitAligner::readGenome(const std::string& baseName) {
+void SplitAlignerParams::readGenome(const std::string &baseName) {
std::string alphabetLetters;
size_t seqCount, volumes;
readPrjFile(baseName, alphabetLetters, seqCount, volumes);
@@ -1393,19 +1443,16 @@ static int min(const std::vector< std::vector<int> >& matrix) {
}
static int matrixLookup(const std::vector< std::vector<int> >& matrix,
- const std::string& rowNames,
- const std::string& colNames, char x, char y) {
- std::string::size_type row = rowNames.find(x);
- std::string::size_type col = colNames.find(y);
- if (row == std::string::npos || col == std::string::npos)
- return min(matrix);
- else
- return matrix.at(row).at(col);
+ const char *rowNames,
+ const char *colNames, char x, char y) {
+ const char *r = strchr(rowNames, x);
+ const char *c = strchr(colNames, y);
+ return (r && c) ? matrix.at(r - rowNames).at(c - colNames) : min(matrix);
}
-void SplitAligner::setScoreMat(const std::vector< std::vector<int> >& matrix,
- const std::string& rowNames,
- const std::string& colNames) {
+void SplitAlignerParams::setScoreMat(const std::vector< std::vector<int> > &sm,
+ const char *rowNames,
+ const char *colNames) {
const std::string bases = "ACGT";
// Reverse-engineer the abundances of ACGT from the score matrix:
@@ -1415,8 +1462,7 @@ void SplitAligner::setScoreMat(const std::vector< std::vector<int> >& matrix,
for (unsigned i = 0; i < blen; ++i) bmat[i] = &bvec[i * blen];
for (unsigned i = 0; i < blen; ++i)
for (unsigned j = 0; j < blen; ++j)
- bmat[i][j] = matrixLookup(matrix, rowNames, colNames,
- bases[i], bases[j]);
+ bmat[i][j] = matrixLookup(sm, rowNames, colNames, bases[i], bases[j]);
mcf::SubstitutionMatrixStats stats;
stats.calcFromScale(&bmat[0], blen, scale);
@@ -1425,7 +1471,7 @@ void SplitAligner::setScoreMat(const std::vector< std::vector<int> >& matrix,
char x = std::toupper(i);
for (int j = 64; j < 128; ++j) {
char y = std::toupper(j);
- int score = matrixLookup(matrix, rowNames, colNames, x, y);
+ int score = matrixLookup(sm, rowNames, colNames, x, y);
for (int q = 0; q < numQualCodes; ++q) {
std::string::size_type xc = bases.find(x);
std::string::size_type yc = bases.find(y);
@@ -1439,7 +1485,7 @@ void SplitAligner::setScoreMat(const std::vector< std::vector<int> >& matrix,
}
}
- maxMatchScore = max(matrix);
+ maxMatchScore = max(sm);
}
}
=====================================
src/split/cbrc_split_aligner.hh
=====================================
@@ -10,74 +10,142 @@
#include "Alphabet.hh"
#include "MultiSequence.hh"
+#include <math.h>
+#include <stddef.h>
+
+#include <map>
#include <string>
#include <vector>
-#include <cmath>
-#include <climits>
-#include <stddef.h> // size_t
-#include <map>
namespace cbrc {
-class SplitAligner {
-public:
- SplitAligner() {
- setParams(-7, -1, -7, -1, -40, INT_MIN/2, 1.0, 33); // xxx ???
- setSpliceParams(0.0, 7.0, 1.7);
- }
+struct SplitAlignerParams {
+ static int scoreFromProb(double prob, double scale) {
+ return floor(scale * log(prob) + 0.5);
+ }
+
+ // A deletion of length k scores: delOpenScore + k * delGrowScore
+ // An insertion of length k scores: insOpenScore + k * insGrowScore
+
+ // "jumpScore" is the (negative) score for a trans-splice.
+
+ // "restartScore" is the (negative) score for re-starting an
+ // alignment, in the repeated matches algorithm from chapter 2 of
+ // Durbin, Eddy, et al. In that book, it is called "-T".
+
+ // "qualityOffset" is 33 for fastq-sanger or 64 for fastq-illumina
+
+ void setParams(int delOpenScoreIn, int delGrowScoreIn,
+ int insOpenScoreIn, int insGrowScoreIn,
+ int jumpScoreIn, int restartScoreIn, double scaleIn,
+ int qualityOffsetIn);
+
+ void setSpliceParams(double splicePriorIn,
+ double meanLogDistIn, double sdevLogDistIn);
+
+ void setScoreMat(const std::vector< std::vector<int> > &sm,
+ const char *rowNames, const char *colNames);
- // A gap of length k scores: gapExistenceScore + k * gapExtensionScore.
+ void readGenome(const std::string &baseName);
- // We allow for the possibility that insertions get different scores:
- // insExistenceScore + k * insExtensionScore.
+ // XXX this should allow us to specify scores for gt-ag, at-ac, etc.
+ void setSpliceSignals();
- // "jumpScore" is the (negative) score for a trans-splice.
+ // Outputs some algorithm parameters on lines starting with "#"
+ void print() const;
- // "restartScore" is the (negative) score for re-starting an
- // alignment, in the repeated matches algorithm from chapter 2 of
- // Durbin, Eddy, et al. In that book, it is called "-T".
+ static const int numQualCodes = 64;
+ static int score_mat[64][64][numQualCodes];
+ int maxMatchScore;
+ int qualityOffset;
+ int delOpenScore;
+ int delGrowScore;
+ int insOpenScore;
+ int insGrowScore;
+ int jumpScore;
+ int restartScore;
+ double jumpProb;
+ double restartProb;
+ double scale;
+ IntExponentiator scaledExp; // for fast calculation of exp(x / scale)
- // "qualityOffset" is 33 for fastq-sanger or 64 for fastq-illumina
+ double splicePrior;
+ double meanLogDist;
+ double sdevLogDist;
+ double spliceTerm1;
+ double spliceTerm2;
+ unsigned maxSpliceDist;
+ int maxSpliceScore;
+ int maxSpliceBegEndScore;
+ std::vector<int> spliceScoreTable; // lookup table
+ std::vector<double> spliceProbTable; // lookup table
+ unsigned spliceTableSize;
+ MultiSequence genome[32];
+ Alphabet alphabet;
+ typedef std::map<std::string, unsigned long long> StringNumMap;
+ StringNumMap chromosomeIndex;
+ int spliceBegScores[(4 * 4 + 1) * 2]; // donor score for any dinucleotide
+ int spliceEndScores[(4 * 4 + 1) * 2]; // acceptor score for any dinucleotide
+ double spliceBegProbs[(4 * 4 + 1) * 2];
+ double spliceEndProbs[(4 * 4 + 1) * 2];
- void setParams(int gapExistenceScoreIn, int gapExtensionScoreIn,
- int insExistenceScoreIn, int insExtensionScoreIn,
- int jumpScoreIn, int restartScoreIn, double scaleIn,
- int qualityOffsetIn);
+ bool isSpliced() const { return restartProb <= 0; }
- void setSpliceParams(double splicePriorIn,
- double meanLogDistIn, double sdevLogDistIn);
+ bool isGenome() const { return !chromosomeIndex.empty(); }
- void setScoreMat(const std::vector< std::vector<int> >& matrix,
- const std::string& rowNames,
- const std::string& colNames);
+ bool isSpliceCoords() const { return splicePrior > 0 || isGenome(); }
- void readGenome(const std::string& baseName);
+ void dpExtensionMinScores(size_t &minScore1, size_t &minScore2) const;
- // XXX this should allow us to specify scores for gt-ag, at-ac, etc.
- void setSpliceSignals();
+ void seqEnds(const uchar *&beg, const uchar *&end,
+ const char *seqName) const;
- // Outputs some algorithm parameters on lines starting with "#"
- void printParameters() const;
+ int spliceScore(unsigned d) const
+ { return d < spliceTableSize ? spliceScoreTable[d] : calcSpliceScore(d); }
+ double spliceProb(unsigned d) const
+ { return d < spliceTableSize ? spliceProbTable[d] : calcSpliceProb(d); }
+
+ void spliceBegSignal(char *out, const char *seqName, bool isForwardStrand,
+ bool isSenseStrand, unsigned coord) const;
+
+ void spliceEndSignal(char *out, const char *seqName, bool isForwardStrand,
+ bool isSenseStrand, unsigned coord) const;
+
+ int calcSpliceScore(double dist) const;
+
+ double calcSpliceProb(double dist) const
+ { return scaledExp(calcSpliceScore(dist)); }
+
+ size_t maxGenomeVolumes() const { return sizeof genome / sizeof *genome; }
+
+ void readGenomeVolume(const std::string &baseName,
+ size_t seqCount, size_t volumeNumber);
+};
+
+class SplitAligner {
+public:
// Prepares to analyze some candidate alignments for one query
// sequence: sets the number of DP matrix cells (and thus memory)
- void layout(std::vector<UnsplitAlignment>::const_iterator beg,
- std::vector<UnsplitAlignment>::const_iterator end);
+ void layout(const SplitAlignerParams ¶ms,
+ const UnsplitAlignment *beg, const UnsplitAlignment *end);
// The number of cells in each dynamic programming matrix
size_t cellsPerDpMatrix() const
{ return matrixRowOrigins[numAlns-1] + dpEnd(numAlns-1) + 1; }
// Bytes of memory needed for the current query sequence (roughly)
- size_t memory(bool isViterbi, bool isBothSpliceStrands) const;
+ size_t memory(const SplitAlignerParams ¶ms,
+ bool isViterbi, bool isBothSpliceStrands) const;
// Call this before viterbi/forward/backward, and after layout
- void initMatricesForOneQuery();
+ void initMatricesForOneQuery(const SplitAlignerParams ¶ms);
- long viterbi() { // returns the optimal split-alignment score
+ // returns the optimal split-alignment score
+ long viterbi(const SplitAlignerParams ¶ms) {
resizeMatrix(Vmat);
resizeVector(Vvec);
- return (restartProb > 0) ? viterbiSplit() : viterbiSplice();
+ return params.isSpliced() ? viterbiSplice(params) : viterbiSplit(params);
}
// Gets the chunks of an optimal split alignment.
@@ -86,7 +154,7 @@ public:
// 2. The chunk's start coordinate in the query sequence
// 3. The chunk's end coordinate in the query sequence
// It gets the chunks in reverse order, from query end to query start.
- void traceBack(long viterbiScore,
+ void traceBack(const SplitAlignerParams ¶ms, long viterbiScore,
std::vector<unsigned>& alnNums,
std::vector<unsigned>& queryBegs,
std::vector<unsigned>& queryEnds) const;
@@ -95,16 +163,16 @@ public:
int segmentScore(unsigned alnNum,
unsigned queryBeg, unsigned queryEnd) const;
- void forwardBackward() {
+ void forwardBackward(const SplitAlignerParams ¶ms) {
resizeVector(rescales);
resizeMatrix(Fmat);
resizeMatrix(Bmat);
- if (restartProb > 0) {
- forwardSplit();
- backwardSplit();
+ if (params.isSpliced()) {
+ forwardSplice(params);
+ backwardSplice(params);
} else {
- forwardSplice();
- backwardSplice();
+ forwardSplit(params);
+ backwardSplit(params);
}
}
@@ -113,7 +181,7 @@ public:
unsigned alnBeg, unsigned alnEnd) const;
// Toggles between forward and reverse-complement splice signals
- void flipSpliceSignals();
+ void flipSpliceSignals(const SplitAlignerParams ¶ms);
// This returns log(p / (1-p)), where p is the probability that
// the query uses splice signals in the orientation currently set
@@ -122,31 +190,27 @@ public:
// Gets the 2 genome bases immediately downstream of queryPos in
// alnNum, and writes them into the buffer pointed to by "out"
- void spliceBegSignal(char *out, unsigned alnNum, unsigned queryPos,
- bool isSenseStrand) const;
+ void spliceBegSignal(char *out, const SplitAlignerParams ¶ms,
+ unsigned alnNum, unsigned queryPos,
+ bool isSenseStrand) const {
+ const UnsplitAlignment &a = alns[alnNum];
+ params.spliceBegSignal(out, a.rname, a.isForwardStrand(), isSenseStrand,
+ cell(spliceBegCoords, alnNum, queryPos));
+ }
// Gets the 2 genome bases immediately upstream of queryPos in
// alnNum, and writes them into the buffer pointed to by "out"
- void spliceEndSignal(char *out, unsigned alnNum, unsigned queryPos,
- bool isSenseStrand) const;
+ void spliceEndSignal(char *out, const SplitAlignerParams ¶ms,
+ unsigned alnNum, unsigned queryPos,
+ bool isSenseStrand) const {
+ const UnsplitAlignment &a = alns[alnNum];
+ params.spliceEndSignal(out, a.rname, a.isForwardStrand(), isSenseStrand,
+ cell(spliceEndCoords, alnNum, queryPos));
+ }
private:
- static const int numQualCodes = 64;
- static int score_mat[64][64][numQualCodes];
- int maxMatchScore;
- int qualityOffset;
- int gapExistenceScore;
- int gapExtensionScore;
- int insExistenceScore;
- int insExtensionScore;
- int jumpScore;
- int restartScore;
- double jumpProb;
- double restartProb;
- double scale;
- IntExponentiator scaledExp; // for fast calculation of exp(x / scale)
unsigned numAlns; // the number of candidate alignments (for 1 query)
- std::vector<UnsplitAlignment>::const_iterator alns; // the candidates
+ const UnsplitAlignment *alns; // the candidates
unsigned minBeg; // the minimum query start coordinate of any candidate
unsigned maxEnd; // the maximum query end coordinate of any candidate
std::vector<unsigned> dpBegs; // dynamic programming begin coords
@@ -182,14 +246,6 @@ private:
std::vector<unsigned> oldInplayAlnIndices;
std::vector<unsigned> newInplayAlnIndices;
- double splicePrior;
- double meanLogDist;
- double sdevLogDist;
- double spliceTerm1;
- double spliceTerm2;
- unsigned maxSpliceDist;
- int maxSpliceScore;
- int maxSpliceBegEndScore;
std::vector<unsigned> spliceBegCoords;
std::vector<unsigned> spliceEndCoords;
std::vector<unsigned char> spliceBegSignals;
@@ -197,50 +253,26 @@ private:
std::vector<unsigned> rBegs; // genomic beg coordinate of each candidate
std::vector<unsigned> rEnds; // genomic end coordinate of each candidate
std::vector<unsigned> rnameAndStrandIds;
- std::vector<int> spliceScoreTable; // lookup table
- std::vector<double> spliceProbTable; // lookup table
- unsigned spliceTableSize;
- MultiSequence genome[32];
- Alphabet alphabet;
- typedef std::map<std::string, unsigned long long> StringNumMap;
- StringNumMap chromosomeIndex;
- int spliceBegScores[4 * 4 + 1]; // donor score for any dinucleotide
- int spliceEndScores[4 * 4 + 1]; // acceptor score for any dinucleotide
- double spliceBegProbs[4 * 4 + 1];
- double spliceEndProbs[4 * 4 + 1];
- int spliceBegScore(size_t ij) const {
- if (chromosomeIndex.empty()) return 0;
- return spliceBegScores[spliceBegSignals[ij]];
+ const int *spliceBegScores;
+ const int *spliceEndScores;
+ const double *spliceBegProbs;
+ const double *spliceEndProbs;
+ int spliceBegScore(bool isGenome, size_t ij) const {
+ return isGenome ? spliceBegScores[spliceBegSignals[ij]] : 0;
}
- int spliceEndScore(size_t ij) const {
- if (chromosomeIndex.empty()) return 0;
- return spliceEndScores[spliceEndSignals[ij]];
+ int spliceEndScore(bool isGenome, size_t ij) const {
+ return isGenome ? spliceEndScores[spliceEndSignals[ij]] : 0;
}
- double spliceBegProb(size_t ij) const {
- if (chromosomeIndex.empty()) return 1;
- return spliceBegProbs[spliceBegSignals[ij]];
+ double spliceBegProb(bool isGenome, size_t ij) const {
+ return isGenome ? spliceBegProbs[spliceBegSignals[ij]] : 1.0;
}
- double spliceEndProb(size_t ij) const {
- if (chromosomeIndex.empty()) return 1;
- return spliceEndProbs[spliceEndSignals[ij]];
+ double spliceEndProb(bool isGenome, size_t ij) const {
+ return isGenome ? spliceEndProbs[spliceEndSignals[ij]] : 1.0;
}
- int calcSpliceScore(double dist) const;
- int spliceScore(unsigned d) const
- { return d < spliceTableSize ? spliceScoreTable[d] : calcSpliceScore(d); }
- double calcSpliceProb(double dist) const
- { return scaledExp(calcSpliceScore(dist)); }
- double spliceProb(unsigned d) const
- { return d < spliceTableSize ? spliceProbTable[d] : calcSpliceProb(d); }
void initSpliceCoords(unsigned i);
- void initSpliceSignals(unsigned i);
+ void initSpliceSignals(const SplitAlignerParams ¶ms, unsigned i);
void initRnameAndStrandIds();
void initRbegsAndEnds();
- size_t maxGenomeVolumes() const { return sizeof genome / sizeof *genome; }
- void readGenomeVolume(const std::string& baseName,
- size_t seqCount, size_t volumeNumber);
-
- void dpExtensionMinScores(int maxJumpScore,
- size_t& minScore1, size_t& minScore2) const;
void updateInplayAlnIndicesF(unsigned& sortedAlnPos,
unsigned& oldNumInplay,
@@ -250,17 +282,19 @@ private:
unsigned& oldNumInplay,
unsigned& newNumInplay, unsigned j);
- long viterbiSplit();
- long viterbiSplice();
+ long viterbiSplit(const SplitAlignerParams ¶ms);
+ long viterbiSplice(const SplitAlignerParams ¶ms);
- void forwardSplit();
- void backwardSplit();
- void forwardSplice();
- void backwardSplice();
+ void forwardSplit(const SplitAlignerParams ¶ms);
+ void backwardSplit(const SplitAlignerParams ¶ms);
+ void forwardSplice(const SplitAlignerParams ¶ms);
+ void backwardSplice(const SplitAlignerParams ¶ms);
- unsigned findScore(unsigned j, long score) const;
- unsigned findSpliceScore(unsigned i, unsigned j, long score) const;
- long scoreFromSplice(unsigned i, unsigned j, unsigned oldNumInplay,
+ unsigned findScore(bool isGenome, unsigned j, long score) const;
+ unsigned findSpliceScore(const SplitAlignerParams ¶ms,
+ unsigned i, unsigned j, long score) const;
+ long scoreFromSplice(const SplitAlignerParams ¶ms,
+ unsigned i, unsigned j, unsigned oldNumInplay,
unsigned& oldInplayPos) const;
long endScore() const;
unsigned findEndScore(long score) const;
@@ -300,14 +334,16 @@ private:
if (m.size() < s) m.resize(s);
}
- double probFromSpliceF(unsigned i, unsigned j, unsigned oldNumInplay,
+ double probFromSpliceF(const SplitAlignerParams ¶ms,
+ unsigned i, unsigned j, unsigned oldNumInplay,
unsigned& oldInplayPos) const;
- double probFromSpliceB(unsigned i, unsigned j, unsigned oldNumInplay,
+ double probFromSpliceB(const SplitAlignerParams ¶ms,
+ unsigned i, unsigned j, unsigned oldNumInplay,
unsigned& oldInplayPos) const;
- void calcBaseScores(unsigned i);
- void initDpBounds();
+ void calcBaseScores(const SplitAlignerParams ¶ms, unsigned i);
+ void initDpBounds(const SplitAlignerParams ¶ms);
};
}
=====================================
src/split/last-split-main.cc
=====================================
@@ -1,4 +1,5 @@
-// Copyright 2013, 2014 Martin C. Frith
+// Author: Martin C. Frith 2013
+// SPDX-License-Identifier: GPL-3.0-or-later
// Parse the command line and run last-split.
@@ -8,47 +9,12 @@
#include <getopt.h>
-#include <cctype>
#include <cstdlib> // EXIT_SUCCESS, EXIT_FAILURE
#include <iostream>
-static size_t defaultBytes(bool isSplicedAlignment) {
- size_t b = isSplicedAlignment ? 8 : 8 * 1024;
- for (int i = 0; i < 3; ++i) {
- size_t n = b * 1024;
- if (n / 1024 != b) return -1;
- b = n;
- }
- return b;
-}
-
-static char parseOutputFormat(const char *text) {
- std::string s = text;
- for (size_t i = 0; i < s.size(); ++i) {
- s[i] = std::tolower(s[i]);
- }
- if (s == "maf") return 'm';
- if (s == "maf+") return 'M';
- return 0;
-}
-
static void run(int argc, char* argv[]) {
LastSplitOptions opts;
- opts.format = 0;
- opts.isTopSeqQuery = false;
- opts.direction = 1;
- opts.cis = 0.004;
- opts.trans = 1e-05;
- opts.mean = 7.0;
- opts.sdev = 1.7;
- opts.mismap = 1.0;
- opts.score = -1;
- opts.no_split = false;
- opts.bytes = 0;
- opts.verbose = false;
- opts.isSplicedAlignment = false;
-
std::string version = "last-split "
#include "version.hh"
"\n";
@@ -62,24 +28,18 @@ come from different parts of the genome.\n\
\n\
Options:\n\
-h, --help show this help message and exit\n\
- -f, --format=FMT output format: MAF, MAF+ (default: depends on input)\n\
- -r, --reverse reverse the roles of the 2 sequences in each alignment\n\
- -g, --genome=NAME lastdb genome name\n\
- -d, --direction=D RNA direction: 0=reverse, 1=forward, 2=mixed (default="
- + cbrc::stringify(opts.direction) + ")\n\
- -c, --cis=PROB cis-splice probability per base (default="
- + cbrc::stringify(opts.cis) + ")\n\
- -t, --trans=PROB trans-splice probability per base (default="
- + cbrc::stringify(opts.trans) + ")\n\
- -M, --mean=MEAN mean of ln[intron length] (default="
- + cbrc::stringify(opts.mean) + ")\n\
- -S, --sdev=SDEV standard deviation of ln[intron length] (default="
- + cbrc::stringify(opts.sdev) + ")\n\
- -m, --mismap=PROB maximum mismap probability (default="
- + cbrc::stringify(opts.mismap) + ")\n\
- -s, --score=INT minimum alignment score (default=e OR e+t*ln[100])\n\
- -n, --no-split write original, not split, alignments\n\
- -b, --bytes=B maximum memory (default=8T for split, 8G for spliced)\n\
+ -f, --format=FMT " + LastSplitOptions::helpf + "\n\
+ -r, --reverse " + LastSplitOptions::helpr + "\n\
+ -g, --genome=NAME " + LastSplitOptions::helpg + "\n\
+ -d, --direction=D " + LastSplitOptions::helpd + "\n\
+ -c, --cis=PROB " + LastSplitOptions::helpc + "\n\
+ -t, --trans=PROB " + LastSplitOptions::helpt + "\n\
+ -M, --mean=MEAN " + LastSplitOptions::helpM + "\n\
+ -S, --sdev=SDEV " + LastSplitOptions::helpS + "\n\
+ -m, --mismap=PROB " + LastSplitOptions::helpm + "\n\
+ -s, --score=INT " + LastSplitOptions::helps + "\n\
+ -n, --no-split " + LastSplitOptions::helpn + "\n\
+ -b, --bytes=B " + LastSplitOptions::helpb + "\n\
-v, --verbose be verbose\n\
-V, --version show version information and exit\n\
";
@@ -112,7 +72,7 @@ Options:\n\
std::cout << help;
return;
case 'f':
- opts.format = parseOutputFormat(optarg);
+ opts.format = LastSplitOptions::parseOutputFormat(optarg);
break;
case 'r':
opts.isTopSeqQuery = true;
@@ -164,8 +124,6 @@ Options:\n\
}
}
- if (!opts.bytes) opts.bytes = defaultBytes(opts.isSplicedAlignment);
-
opts.inputFileNames.assign(argv + optind, argv + argc);
if (opts.inputFileNames.empty()) opts.inputFileNames.push_back("-");
=====================================
src/split/last-split.cc
=====================================
@@ -1,23 +1,21 @@
-// Copyright 2013, 2014 Martin C. Frith
+// Author: Martin C. Frith 2013
+// SPDX-License-Identifier: GPL-3.0-or-later
#include "last-split.hh"
+#include "mcf_last_splitter.hh"
-#include "cbrc_split_aligner.hh"
-
-#include <stdio.h>
#include <string.h>
#include <algorithm>
-#include <cassert>
#include <cctype>
-#include <cmath>
#include <fstream>
-#include <iomanip>
#include <iostream>
#include <sstream>
#include <stdexcept>
#include <streambuf>
+using namespace mcf;
+
class MyString {
public:
MyString() : v(1), s(0), e(0) {}
@@ -114,10 +112,6 @@ static bool isSameName(const char *sLine1, const char *sLine2) {
}
}
-static int scoreFromProb(double prob, double scale) {
- return std::floor(scale * std::log(prob) + 0.5);
-}
-
static void transpose(std::vector< std::vector<int> > &matrix) {
size_t rows = matrix.size();
size_t cols = matrix[0].size();
@@ -131,265 +125,29 @@ static void transpose(std::vector< std::vector<int> > &matrix) {
m.swap(matrix);
}
-// Defines an ordering, for sorting.
-static bool less(const cbrc::UnsplitAlignment& a,
- const cbrc::UnsplitAlignment& b) {
- int qnameCmp = strcmp(a.qname, b.qname);
- if (qnameCmp != 0 ) return qnameCmp < 0;
- if (a.qstart != b.qstart ) return a.qstart < b.qstart;
- if (a.qend != b.qend ) return a.qend < b.qend;
- if (a.qstrand != b.qstrand) return a.qstrand < b.qstrand;
- int qalignCmp = strcmp(a.qalign, b.qalign);
- if (qalignCmp != 0 ) return qalignCmp < 0;
- int ralignCmp = strcmp(a.ralign, b.ralign);
- if (ralignCmp != 0 ) return ralignCmp < 0;
- return a.linesBeg < b.linesBeg; // stabilizes the sort
-}
-
-static int printSense(char *out, double senseStrandLogOdds) {
- double b = senseStrandLogOdds / std::log(2.0);
- if (b < 0.1 && b > -0.1) b = 0;
- else if (b > 10) b = std::floor(b + 0.5);
- else if (b < -10) b = std::ceil(b - 0.5);
- int precision = (b < 10 && b > -10) ? 2 : 3;
- return sprintf(out, " sense=%.*g", precision, b);
-}
-
-static void doOneAlignmentPart(cbrc::SplitAligner& sa,
- const cbrc::UnsplitAlignment& a,
- unsigned numOfParts, unsigned partNum,
- unsigned alnNum,
- unsigned qSliceBeg, unsigned qSliceEnd,
- bool isSenseStrand, double senseStrandLogOdds,
- const LastSplitOptions& opts,
- bool isAlreadySplit) {
- std::vector<char> outputText;
-
- if (qSliceBeg >= a.qend || qSliceEnd <= a.qstart) {
- return; // this can happen for spliced alignment!
- }
-
- unsigned qSliceBegTrimmed = qSliceBeg;
- unsigned qSliceEndTrimmed = qSliceEnd;
- unsigned alnBeg, alnEnd;
- cbrc::mafSliceBeg(a.ralign, a.qalign, a.qstart, qSliceBegTrimmed, alnBeg);
- cbrc::mafSliceEnd(a.ralign, a.qalign, a.qend, qSliceEndTrimmed, alnEnd);
-
- if (qSliceBegTrimmed >= qSliceEndTrimmed) {
- return; // I think this can happen for spliced alignment
- }
-
- int score =
- sa.segmentScore(alnNum, qSliceBeg, qSliceEnd) -
- sa.segmentScore(alnNum, qSliceBeg, qSliceBegTrimmed) -
- sa.segmentScore(alnNum, qSliceEndTrimmed, qSliceEnd);
- if (score < opts.score) return;
-
- std::vector<double> p;
- if (opts.direction != 0) {
- p = sa.marginalProbs(qSliceBegTrimmed, alnNum, alnBeg, alnEnd);
- }
- std::vector<double> pRev;
- if (opts.direction != 1) {
- sa.flipSpliceSignals();
- pRev = sa.marginalProbs(qSliceBegTrimmed, alnNum, alnBeg, alnEnd);
- sa.flipSpliceSignals();
- }
- if (opts.direction == 0) p.swap(pRev);
- if (opts.direction == 2) {
- double reverseProb = 1 / (1 + std::exp(senseStrandLogOdds));
- // the exp might overflow to inf, but that should be OK
- double forwardProb = 1 - reverseProb;
- for (unsigned i = 0; i < p.size(); ++i) {
- p[i] = forwardProb * p[i] + reverseProb * pRev[i];
- }
- }
-
- assert(!p.empty());
- double mismap = 1.0 - *std::max_element(p.begin(), p.end());
- mismap = std::max(mismap, 1e-10);
- if (mismap > opts.mismap) return;
- int mismapPrecision = 3;
-
- std::vector<char> slice;
- size_t lineLen = cbrc::mafSlice(slice, a, alnBeg, alnEnd, &p[0]);
- const char *sliceBeg = &slice[0];
- const char *sliceEnd = sliceBeg + slice.size();
- const char *pLine = sliceEnd - lineLen;
- const char *secondLastLine = pLine - lineLen;
-
- if (isAlreadySplit && secondLastLine[0] == 'p') {
- size_t backToSeq = alnEnd - alnBeg + 1;
- mismap = cbrc::pLinesToErrorProb(pLine - backToSeq, sliceEnd - backToSeq);
- if (mismap > opts.mismap) return;
- mismapPrecision = 2;
- }
-
- bool isLastalProbs = (*(secondLastLine - isAlreadySplit * lineLen) == 'p');
- if (opts.format == 'm' || (opts.format == 0 && !isLastalProbs)) {
- while (*(sliceEnd - lineLen) == 'p') sliceEnd -= lineLen;
- }
-
- const char *aLineOld = a.linesBeg[0];
- size_t aLineOldSize = a.linesBeg[1] - a.linesBeg[0] - 1;
- std::vector<char> aLine(aLineOldSize + 128);
- char *out = &aLine[0];
- if (opts.no_split && aLineOld[0] == 'a') {
- memcpy(out, aLineOld, aLineOldSize);
- out += aLineOldSize;
- } else {
- out += sprintf(out, "a score=%d", score);
- }
- out += sprintf(out, " mismap=%.*g", mismapPrecision, mismap);
- if (opts.direction == 2) out += printSense(out, senseStrandLogOdds);
- if (!opts.genome.empty() && !opts.no_split) {
- if (partNum > 0) {
- out = strcpy(out, isSenseStrand ? " acc=" : " don=") + 5;
- sa.spliceEndSignal(out, alnNum, qSliceBeg, isSenseStrand);
- out += 2;
- }
- if (partNum + 1 < numOfParts) {
- out = strcpy(out, isSenseStrand ? " don=" : " acc=") + 5;
- sa.spliceBegSignal(out, alnNum, qSliceEnd, isSenseStrand);
- out += 2;
- }
- }
- *out++ = '\n';
-
- outputText.insert(outputText.end(), &aLine[0], out);
- outputText.insert(outputText.end(), sliceBeg, sliceEnd);
-
- if (opts.no_split && a.linesEnd[-1][0] == 'c') {
- outputText.insert(outputText.end(), a.linesEnd[-1], a.linesEnd[0]);
- outputText.back() = '\n';
- }
-
- outputText.push_back('\n');
-
- std::cout.write(&outputText[0], outputText.size());
-}
-
-static void doOneQuery(std::vector<cbrc::UnsplitAlignment>::const_iterator beg,
- std::vector<cbrc::UnsplitAlignment>::const_iterator end,
- cbrc::SplitAligner& sa, const LastSplitOptions& opts,
- bool isAlreadySplit) {
- if (opts.verbose) std::cerr << beg->qname << "\t" << (end - beg);
- sa.layout(beg, end);
- if (opts.verbose) std::cerr << "\tcells=" << sa.cellsPerDpMatrix();
- size_t bytes = sa.memory(!opts.no_split, opts.direction == 2);
- if (bytes > opts.bytes) {
- if (opts.verbose) std::cerr << "\n";
- std::cerr << "last-split: skipping sequence " << beg->qname
- << " (" << bytes << " bytes)\n";
- return;
- }
- sa.initMatricesForOneQuery();
-
- if (opts.direction != 0) {
- sa.forwardBackward();
- }
- if (opts.direction != 1) {
- sa.flipSpliceSignals();
- sa.forwardBackward();
- sa.flipSpliceSignals();
- }
-
- double senseStrandLogOdds =
- (opts.direction == 2) ? sa.spliceSignalStrandLogOdds() : 0;
-
- if (opts.no_split) {
- if (opts.verbose) std::cerr << "\n";
- unsigned numOfParts = end - beg;
- for (unsigned i = 0; i < numOfParts; ++i) {
- doOneAlignmentPart(sa, beg[i], numOfParts, i, i,
- beg[i].qstart, beg[i].qend,
- true, senseStrandLogOdds, opts, isAlreadySplit);
- }
- } else {
- long viterbiScore = LONG_MIN;
- if (opts.direction != 0) {
- viterbiScore = sa.viterbi();
- if (opts.verbose) std::cerr << "\t" << viterbiScore;
- }
- long viterbiScoreRev = LONG_MIN;
- if (opts.direction != 1) {
- sa.flipSpliceSignals();
- viterbiScoreRev = sa.viterbi();
- sa.flipSpliceSignals();
- if (opts.verbose) std::cerr << "\t" << viterbiScoreRev;
- }
- bool isSenseStrand = (viterbiScore >= viterbiScoreRev);
- std::vector<unsigned> alnNums;
- std::vector<unsigned> queryBegs;
- std::vector<unsigned> queryEnds;
- if (isSenseStrand) {
- sa.traceBack(viterbiScore, alnNums, queryBegs, queryEnds);
- } else {
- sa.flipSpliceSignals();
- sa.traceBack(viterbiScoreRev, alnNums, queryBegs, queryEnds);
- sa.flipSpliceSignals();
- }
- std::reverse(alnNums.begin(), alnNums.end());
- std::reverse(queryBegs.begin(), queryBegs.end());
- std::reverse(queryEnds.begin(), queryEnds.end());
-
- if (opts.verbose) std::cerr << "\n";
- unsigned numOfParts = alnNums.size();
- for (unsigned k = 0; k < numOfParts; ++k) {
- unsigned i = alnNums[k];
- doOneAlignmentPart(sa, beg[i], numOfParts, k, i,
- queryBegs[k], queryEnds[k],
- isSenseStrand, senseStrandLogOdds,
- opts, isAlreadySplit);
- }
- }
-}
-
static void doOneBatch(MyString &inputText,
const std::vector<size_t> &lineEnds,
const std::vector<unsigned> &mafEnds,
- cbrc::SplitAligner &sa, const LastSplitOptions &opts,
+ LastSplitter &splitter, const LastSplitOptions &opts,
+ const cbrc::SplitAlignerParams ¶ms,
bool isAlreadySplit) {
std::vector<char *> linePtrs(lineEnds.size());
for (size_t i = 0; i < lineEnds.size(); ++i) {
linePtrs[i] = &inputText[0] + lineEnds[i];
}
- std::vector<cbrc::UnsplitAlignment> mafs;
- mafs.reserve(mafEnds.size() - 1); // saves memory: no excess capacity
- for (unsigned i = 1; i < mafEnds.size(); ++i)
- mafs.push_back(cbrc::UnsplitAlignment(&linePtrs[0] + mafEnds[i-1],
- &linePtrs[0] + mafEnds[i],
- opts.isTopSeqQuery));
-
- sort(mafs.begin(), mafs.end(), less);
- std::vector<cbrc::UnsplitAlignment>::const_iterator b = mafs.begin();
- std::vector<cbrc::UnsplitAlignment>::const_iterator e = mafs.begin();
- size_t qendMax = 0;
- while (e < mafs.end()) {
- if (e->qend > qendMax) qendMax = e->qend;
- ++e;
- if (e == mafs.end() || strcmp(e->qname, b->qname) != 0 ||
- (e->qstart >= qendMax && !opts.isSplicedAlignment)) {
- doOneQuery(b, e, sa, opts, isAlreadySplit);
- b = e;
- qendMax = 0;
- }
+ splitter.reserve(mafEnds.size() - 1); // saves memory: no excess capacity
+ for (unsigned i = 1; i < mafEnds.size(); ++i) {
+ splitter.addMaf(&linePtrs[0] + mafEnds[i-1],
+ &linePtrs[0] + mafEnds[i], opts.isTopSeqQuery);
}
-}
-static void printParameters(const LastSplitOptions& opts) {
- std::cout << std::setprecision(12) << "#"
- << " m=" << opts.mismap
- << " s=" << opts.score;
- if (opts.isSplicedAlignment) {
- std::cout << " d=" << opts.direction
- << " c=" << opts.cis
- << " t=" << opts.trans
- << " M=" << opts.mean
- << " S=" << opts.sdev;
+ splitter.split(opts, params, isAlreadySplit);
+
+ if (!splitter.isOutputEmpty()) {
+ splitter.printOutput();
+ splitter.clearOutput();
}
- std::cout << "\n" << std::setprecision(6);
}
static void addMaf(std::vector<unsigned> &mafEnds,
@@ -412,12 +170,13 @@ static void eraseOldInput(MyString &inputText,
}
void lastSplit(LastSplitOptions& opts) {
- cbrc::SplitAligner sa;
+ cbrc::SplitAlignerParams params;
+ LastSplitter splitter;
std::vector< std::vector<int> > scoreMatrix;
std::string rowNames, colNames;
std::string word, name, key;
int state = 0;
- int sequenceFormat = -1;
+ int sequenceFormat = 1; // xxx ???
int gapExistenceCost = -1;
int gapExtensionCost = -1;
int insExistenceCost = -1;
@@ -485,36 +244,20 @@ void lastSplit(LastSplitOptions& opts) {
insExistenceCost < 0 || insExtensionCost < 0 ||
lastalScoreThreshold < 0 || scale <= 0 || genomeSize <= 0)
err("can't read the header");
- if (sequenceFormat == 2 || sequenceFormat >= 4)
- err("unsupported Q format");
- if (opts.score < 0)
- opts.score = lastalScoreThreshold +
- (opts.isSplicedAlignment ? scoreFromProb(100, scale) : 0);
- int restartCost =
- opts.isSplicedAlignment ? -(INT_MIN/2) : opts.score - 1;
- double jumpProb = opts.isSplicedAlignment
- ? opts.trans / (2 * genomeSize) // 2 strands
- : 0.0;
- int jumpCost =
- (jumpProb > 0.0) ? -scoreFromProb(jumpProb, scale) : -(INT_MIN/2);
- int qualityOffset =
- (sequenceFormat == 0) ? 0 : (sequenceFormat == 3) ? 64 : 33;
- printParameters(opts);
+ opts.setUnspecifiedValues(lastalScoreThreshold, scale);
if (opts.isTopSeqQuery) {
transpose(scoreMatrix);
std::swap(rowNames, colNames);
std::swap(gapExistenceCost, insExistenceCost);
std::swap(gapExtensionCost, insExtensionCost);
}
- sa.setParams(-gapExistenceCost, -gapExtensionCost,
- -insExistenceCost, -insExtensionCost,
- -jumpCost, -restartCost, scale, qualityOffset);
- double splicePrior = opts.isSplicedAlignment ? opts.cis : 0.0;
- sa.setSpliceParams(splicePrior, opts.mean, opts.sdev);
- sa.setScoreMat(scoreMatrix, rowNames, colNames);
- sa.setSpliceSignals();
- if (!opts.genome.empty()) sa.readGenome(opts.genome);
- sa.printParameters();
+ setLastSplitParams(params, opts,
+ scoreMatrix, rowNames.c_str(), colNames.c_str(),
+ gapExistenceCost, gapExtensionCost,
+ insExistenceCost, insExtensionCost,
+ scale, genomeSize, sequenceFormat);
+ opts.print();
+ params.print();
std::cout << "#\n";
state = 1;
}
@@ -528,7 +271,8 @@ void lastSplit(LastSplitOptions& opts) {
} else if (strchr(opts.no_split ? "asqpc" : "sqp", linePtr[0])) {
if (!opts.isTopSeqQuery && linePtr[0] == 's' && sLineCount++ % 2 &&
!isSameName(&inputText[qNameLineBeg], linePtr)) {
- doOneBatch(inputText, lineEnds, mafEnds, sa, opts, isAlreadySplit);
+ doOneBatch(inputText, lineEnds, mafEnds, splitter, opts, params,
+ isAlreadySplit);
eraseOldInput(inputText, lineEnds, mafEnds);
qNameLineBeg = lineEnds.back();
}
@@ -539,5 +283,6 @@ void lastSplit(LastSplitOptions& opts) {
}
}
addMaf(mafEnds, lineEnds);
- doOneBatch(inputText, lineEnds, mafEnds, sa, opts, isAlreadySplit);
+ doOneBatch(inputText, lineEnds, mafEnds, splitter, opts, params,
+ isAlreadySplit);
}
=====================================
src/split/last-split.hh
=====================================
@@ -1,4 +1,5 @@
-// Copyright 2013 Martin C. Frith
+// Author: Martin C. Frith 2013
+// SPDX-License-Identifier: GPL-3.0-or-later
// This routine reads alignments of query sequences to a genome, and
// estimates which alignment parts represent the source of each query.
@@ -13,28 +14,8 @@
#ifndef LAST_SPLIT_HH
#define LAST_SPLIT_HH
-#include <stddef.h> // size_t
-#include <string>
-#include <vector>
+#include "last_split_options.hh"
-struct LastSplitOptions {
- int format;
- bool isTopSeqQuery;
- std::string genome;
- int direction;
- double cis;
- double trans;
- double mean;
- double sdev;
- double mismap;
- int score;
- bool no_split;
- size_t bytes;
- bool verbose;
- bool isSplicedAlignment;
- std::vector<std::string> inputFileNames;
-};
-
-void lastSplit(LastSplitOptions& opts);
+void lastSplit(LastSplitOptions &opts);
#endif
=====================================
src/split/last_split_options.cc
=====================================
@@ -0,0 +1,115 @@
+// Author: Martin C. Frith 2022
+// SPDX-License-Identifier: GPL-3.0-or-later
+
+#include "last_split_options.hh"
+
+#include <ctype.h>
+#include <math.h>
+
+#include <iostream>
+
+#define OPT_d 1
+#define OPT_c 0.004
+#define OPT_t 1e-05
+#define OPT_M 7.0
+#define OPT_S 1.7
+#define OPT_m 1.0
+
+#define MACROS_SUCK(X) #X
+#define STR(X) MACROS_SUCK(X)
+
+LastSplitOptions::LastSplitOptions()
+ : format(0),
+ isTopSeqQuery(false),
+ direction(OPT_d),
+ cis(OPT_c),
+ trans(OPT_t),
+ mean(OPT_M),
+ sdev(OPT_S),
+ mismap(OPT_m),
+ score(-1),
+ no_split(false),
+ bytes(0),
+ verbose(false),
+ isSplicedAlignment(false) {}
+
+const char LastSplitOptions::helpf[] =
+ "output format: MAF, MAF+";
+
+const char LastSplitOptions::helpr[] =
+ "reverse the roles of the 2 sequences in each alignment";
+
+const char LastSplitOptions::helpg[] =
+ "lastdb genome name";
+
+const char LastSplitOptions::helpd[] =
+ "RNA direction: 0=reverse, 1=forward, 2=mixed (default: " STR(OPT_d) ")";
+
+const char LastSplitOptions::helpc[] =
+ "cis-splice probability per base (default: " STR(OPT_c) ")";
+
+const char LastSplitOptions::helpt[] =
+ "trans-splice probability per base (default: " STR(OPT_t) ")";
+
+const char LastSplitOptions::helpM[] =
+ "mean of ln[intron length] (default: " STR(OPT_M) ")";
+
+const char LastSplitOptions::helpS[] =
+ "standard deviation of ln[intron length] (default: " STR(OPT_S) ")";
+
+const char LastSplitOptions::helpm[] =
+ "maximum mismap probability (default: " STR(OPT_m) ")";
+
+const char LastSplitOptions::helps[] =
+ "minimum alignment score (default: e OR e+t*ln[100])";
+
+const char LastSplitOptions::helpn[] =
+ "write original, not split, alignments";
+
+const char LastSplitOptions::helpb[] =
+ "maximum memory (default: 8T for split, 8G for spliced)";
+
+static size_t defaultBytes(bool isSplicedAlignment) {
+ size_t b = isSplicedAlignment ? 8 : 8 * 1024;
+ for (int i = 0; i < 3; ++i) {
+ size_t n = b * 1024;
+ if (n / 1024 != b) return -1;
+ b = n;
+ }
+ return b;
+}
+
+void LastSplitOptions::setUnspecifiedValues(int lastalMinScore, double scale) {
+ if (!bytes) bytes = defaultBytes(isSplicedAlignment);
+
+ if (score < 0) {
+ score = lastalMinScore;
+ if (isSplicedAlignment) score += floor(scale * log(100.0) + 0.5);
+ }
+}
+
+void LastSplitOptions::print() const {
+ std::streamsize p = std::cout.precision(12);
+ std::cout << '#'
+ << " m=" << mismap
+ << " s=" << score;
+ if (isSplicedAlignment) {
+ std::cout << " d=" << direction
+ << " c=" << cis
+ << " t=" << trans
+ << " M=" << mean
+ << " S=" << sdev;
+ }
+ std::cout << '\n';
+ std::cout.precision(p);
+}
+
+char LastSplitOptions::parseOutputFormat(const char *text) {
+ std::string s = text;
+ for (size_t i = 0; i < s.size(); ++i) {
+ s[i] = tolower(s[i]);
+ }
+ if (s == "maf") return 'm';
+ if (s == "maf+") return 'M';
+ return 0;
+}
=====================================
src/split/last_split_options.hh
=====================================
@@ -0,0 +1,51 @@
+// Author: Martin C. Frith 2022
+// SPDX-License-Identifier: GPL-3.0-or-later
+
+#ifndef LAST_SPLIT_OPTIONS_HH
+#define LAST_SPLIT_OPTIONS_HH
+
+#include <stddef.h>
+
+#include <string>
+#include <vector>
+
+struct LastSplitOptions {
+ int format;
+ bool isTopSeqQuery;
+ std::string genome;
+ int direction;
+ double cis;
+ double trans;
+ double mean;
+ double sdev;
+ double mismap;
+ int score;
+ bool no_split;
+ size_t bytes;
+ bool verbose;
+ bool isSplicedAlignment;
+ std::vector<std::string> inputFileNames;
+
+ static const char helpf[];
+ static const char helpr[];
+ static const char helpg[];
+ static const char helpd[];
+ static const char helpc[];
+ static const char helpt[];
+ static const char helpM[];
+ static const char helpS[];
+ static const char helpm[];
+ static const char helps[];
+ static const char helpn[];
+ static const char helpb[];
+
+ LastSplitOptions();
+
+ void setUnspecifiedValues(int lastalMinScore, double scale);
+
+ void print() const;
+
+ static char parseOutputFormat(const char *text);
+};
+
+#endif
=====================================
src/split/mcf_last_splitter.cc
=====================================
@@ -0,0 +1,285 @@
+// Author: Martin C. Frith 2013
+// SPDX-License-Identifier: GPL-3.0-or-later
+
+#include "mcf_last_splitter.hh"
+
+#include <assert.h>
+#include <limits.h>
+#include <math.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <algorithm>
+#include <stdexcept>
+
+// Defines an ordering, for sorting.
+static bool less(const cbrc::UnsplitAlignment& a,
+ const cbrc::UnsplitAlignment& b) {
+ int qnameCmp = strcmp(a.qname, b.qname);
+ if (qnameCmp != 0 ) return qnameCmp < 0;
+ if (a.qstart != b.qstart ) return a.qstart < b.qstart;
+ if (a.qend != b.qend ) return a.qend < b.qend;
+ if (a.qstrand != b.qstrand) return a.qstrand < b.qstrand;
+ int qalignCmp = strcmp(a.qalign, b.qalign);
+ if (qalignCmp != 0 ) return qalignCmp < 0;
+ int ralignCmp = strcmp(a.ralign, b.ralign);
+ if (ralignCmp != 0 ) return ralignCmp < 0;
+ return a.linesBeg < b.linesBeg; // stabilizes the sort
+}
+
+static int printSense(char *out, double senseStrandLogOdds) {
+ double b = senseStrandLogOdds / log(2.0);
+ if (b < 0.1 && b > -0.1) b = 0;
+ else if (b > 10) b = floor(b + 0.5);
+ else if (b < -10) b = ceil(b - 0.5);
+ int precision = (b < 10 && b > -10) ? 2 : 3;
+ return sprintf(out, " sense=%.*g", precision, b);
+}
+
+namespace mcf {
+
+void LastSplitter::doOneAlignmentPart(const LastSplitOptions &opts,
+ const cbrc::SplitAlignerParams ¶ms,
+ bool isAlreadySplit,
+ const cbrc::UnsplitAlignment &a,
+ unsigned numOfParts, unsigned partNum,
+ unsigned alnNum,
+ unsigned qSliceBeg, unsigned qSliceEnd,
+ bool isSenseStrand,
+ double senseStrandLogOdds) {
+ if (qSliceBeg >= a.qend || qSliceEnd <= a.qstart) {
+ return; // this can happen for spliced alignment!
+ }
+
+ unsigned qSliceBegTrimmed = qSliceBeg;
+ unsigned qSliceEndTrimmed = qSliceEnd;
+ unsigned alnBeg, alnEnd;
+ cbrc::mafSliceBeg(a.ralign, a.qalign, a.qstart, qSliceBegTrimmed, alnBeg);
+ cbrc::mafSliceEnd(a.ralign, a.qalign, a.qend, qSliceEndTrimmed, alnEnd);
+
+ if (qSliceBegTrimmed >= qSliceEndTrimmed) {
+ return; // I think this can happen for spliced alignment
+ }
+
+ int score =
+ sa.segmentScore(alnNum, qSliceBeg, qSliceEnd) -
+ sa.segmentScore(alnNum, qSliceBeg, qSliceBegTrimmed) -
+ sa.segmentScore(alnNum, qSliceEndTrimmed, qSliceEnd);
+ if (score < opts.score) return;
+
+ std::vector<double> p;
+ if (opts.direction != 0) {
+ p = sa.marginalProbs(qSliceBegTrimmed, alnNum, alnBeg, alnEnd);
+ }
+ std::vector<double> pRev;
+ if (opts.direction != 1) {
+ sa.flipSpliceSignals(params);
+ pRev = sa.marginalProbs(qSliceBegTrimmed, alnNum, alnBeg, alnEnd);
+ sa.flipSpliceSignals(params);
+ }
+ if (opts.direction == 0) p.swap(pRev);
+ if (opts.direction == 2) {
+ double reverseProb = 1 / (1 + exp(senseStrandLogOdds));
+ // the exp might overflow to inf, but that should be OK
+ double forwardProb = 1 - reverseProb;
+ for (unsigned i = 0; i < p.size(); ++i) {
+ p[i] = forwardProb * p[i] + reverseProb * pRev[i];
+ }
+ }
+
+ assert(!p.empty());
+ double mismap = 1.0 - *std::max_element(p.begin(), p.end());
+ mismap = std::max(mismap, 1e-10);
+ if (mismap > opts.mismap) return;
+ int mismapPrecision = 3;
+
+ std::vector<char> slice;
+ size_t lineLen = cbrc::mafSlice(slice, a, alnBeg, alnEnd, &p[0]);
+ const char *sliceBeg = &slice[0];
+ const char *sliceEnd = sliceBeg + slice.size();
+ const char *pLine = sliceEnd - lineLen;
+ const char *secondLastLine = pLine - lineLen;
+
+ if (isAlreadySplit && secondLastLine[0] == 'p') {
+ size_t backToSeq = alnEnd - alnBeg + 1;
+ mismap = cbrc::pLinesToErrorProb(pLine - backToSeq, sliceEnd - backToSeq);
+ if (mismap > opts.mismap) return;
+ mismapPrecision = 2;
+ }
+
+ bool isLastalProbs = (*(secondLastLine - isAlreadySplit * lineLen) == 'p');
+ if (opts.format == 'm' || (opts.format == 0 && !isLastalProbs)) {
+ while (*(sliceEnd - lineLen) == 'p') sliceEnd -= lineLen;
+ }
+
+ const char *aLineOld = a.linesBeg[0];
+ size_t aLineOldSize = a.linesBeg[1] - a.linesBeg[0] - 1;
+ std::vector<char> aLine(aLineOldSize + 128);
+ char *out = &aLine[0];
+ if (opts.no_split && aLineOld[0] == 'a') {
+ memcpy(out, aLineOld, aLineOldSize);
+ out += aLineOldSize;
+ } else {
+ out += sprintf(out, "a score=%d", score);
+ }
+ out += sprintf(out, " mismap=%.*g", mismapPrecision, mismap);
+ if (opts.direction == 2) out += printSense(out, senseStrandLogOdds);
+ if (!opts.genome.empty() && !opts.no_split) {
+ if (partNum > 0) {
+ out = strcpy(out, isSenseStrand ? " acc=" : " don=") + 5;
+ sa.spliceEndSignal(out, params, alnNum, qSliceBeg, isSenseStrand);
+ out += 2;
+ }
+ if (partNum + 1 < numOfParts) {
+ out = strcpy(out, isSenseStrand ? " don=" : " acc=") + 5;
+ sa.spliceBegSignal(out, params, alnNum, qSliceEnd, isSenseStrand);
+ out += 2;
+ }
+ }
+ *out++ = '\n';
+
+ outputText.insert(outputText.end(), &aLine[0], out);
+ outputText.insert(outputText.end(), sliceBeg, sliceEnd);
+
+ if (opts.no_split && a.linesEnd[-1][0] == 'c') {
+ outputText.insert(outputText.end(), a.linesEnd[-1], a.linesEnd[0]);
+ outputText.back() = '\n';
+ }
+
+ outputText.push_back('\n');
+}
+
+void LastSplitter::doOneQuery(const LastSplitOptions &opts,
+ const cbrc::SplitAlignerParams ¶ms,
+ bool isAlreadySplit,
+ const cbrc::UnsplitAlignment *beg,
+ const cbrc::UnsplitAlignment *end) {
+ if (opts.verbose) std::cerr << beg->qname << "\t" << (end - beg);
+ sa.layout(params, beg, end);
+ if (opts.verbose) std::cerr << "\tcells=" << sa.cellsPerDpMatrix();
+ size_t bytes = sa.memory(params, !opts.no_split, opts.direction == 2);
+ if (bytes > opts.bytes) {
+ if (opts.verbose) std::cerr << "\n";
+ std::cerr << "last-split: skipping sequence " << beg->qname
+ << " (" << bytes << " bytes)\n";
+ return;
+ }
+ sa.initMatricesForOneQuery(params);
+
+ if (opts.direction != 0) {
+ sa.forwardBackward(params);
+ }
+ if (opts.direction != 1) {
+ sa.flipSpliceSignals(params);
+ sa.forwardBackward(params);
+ sa.flipSpliceSignals(params);
+ }
+
+ double senseStrandLogOdds =
+ (opts.direction == 2) ? sa.spliceSignalStrandLogOdds() : 0;
+
+ if (opts.no_split) {
+ if (opts.verbose) std::cerr << "\n";
+ unsigned numOfParts = end - beg;
+ for (unsigned i = 0; i < numOfParts; ++i) {
+ doOneAlignmentPart(opts, params, isAlreadySplit, beg[i],
+ numOfParts, i, i, beg[i].qstart, beg[i].qend,
+ true, senseStrandLogOdds);
+ }
+ } else {
+ long viterbiScore = LONG_MIN;
+ if (opts.direction != 0) {
+ viterbiScore = sa.viterbi(params);
+ if (opts.verbose) std::cerr << "\t" << viterbiScore;
+ }
+ long viterbiScoreRev = LONG_MIN;
+ if (opts.direction != 1) {
+ sa.flipSpliceSignals(params);
+ viterbiScoreRev = sa.viterbi(params);
+ sa.flipSpliceSignals(params);
+ if (opts.verbose) std::cerr << "\t" << viterbiScoreRev;
+ }
+ bool isSenseStrand = (viterbiScore >= viterbiScoreRev);
+ std::vector<unsigned> alnNums;
+ std::vector<unsigned> queryBegs;
+ std::vector<unsigned> queryEnds;
+ if (isSenseStrand) {
+ sa.traceBack(params, viterbiScore, alnNums, queryBegs, queryEnds);
+ } else {
+ sa.flipSpliceSignals(params);
+ sa.traceBack(params, viterbiScoreRev, alnNums, queryBegs, queryEnds);
+ sa.flipSpliceSignals(params);
+ }
+ std::reverse(alnNums.begin(), alnNums.end());
+ std::reverse(queryBegs.begin(), queryBegs.end());
+ std::reverse(queryEnds.begin(), queryEnds.end());
+
+ if (opts.verbose) std::cerr << "\n";
+ unsigned numOfParts = alnNums.size();
+ for (unsigned k = 0; k < numOfParts; ++k) {
+ unsigned i = alnNums[k];
+ doOneAlignmentPart(opts, params, isAlreadySplit, beg[i],
+ numOfParts, k, i, queryBegs[k], queryEnds[k],
+ isSenseStrand, senseStrandLogOdds);
+ }
+ }
+}
+
+void LastSplitter::split(const LastSplitOptions &opts,
+ const cbrc::SplitAlignerParams ¶ms,
+ bool isAlreadySplit) {
+ sort(mafs.begin(), mafs.end(), less);
+
+ const cbrc::UnsplitAlignment *beg = mafs.empty() ? 0 : &mafs[0];
+ const cbrc::UnsplitAlignment *end = beg + mafs.size();
+ const cbrc::UnsplitAlignment *mid = beg;
+ size_t qendMax = 0;
+
+ while (mid < end) {
+ if (mid->qend > qendMax) qendMax = mid->qend;
+ ++mid;
+ if (mid == end || strcmp(mid->qname, beg->qname) != 0 ||
+ (mid->qstart >= qendMax && !opts.isSplicedAlignment)) {
+ doOneQuery(opts, params, isAlreadySplit, beg, mid);
+ beg = mid;
+ qendMax = 0;
+ }
+ }
+
+ mafs.clear();
+}
+
+void setLastSplitParams(cbrc::SplitAlignerParams ¶ms,
+ const LastSplitOptions &opts,
+ const std::vector< std::vector<int> > &scoreMatrix,
+ const char *rowNames, const char *colNames,
+ int delOpenCost, int delGrowCost,
+ int insOpenCost, int insGrowCost,
+ double scale, double genomeSize, int sequenceFormat) {
+ int restartCost =
+ opts.isSplicedAlignment ? -(INT_MIN/2) : opts.score - 1;
+
+ double jumpProb = opts.isSplicedAlignment
+ ? opts.trans / (2 * genomeSize) // 2 strands
+ : 0.0;
+
+ int jumpCost = (jumpProb > 0.0) ?
+ -params.scoreFromProb(jumpProb, scale) : -(INT_MIN/2);
+
+ if (sequenceFormat == 2 || sequenceFormat == 4 || sequenceFormat == 5) {
+ throw std::runtime_error("unsupported last-split Q format");
+ }
+
+ int qualityOffset =
+ (sequenceFormat == 1) ? 33 : (sequenceFormat == 3) ? 64 : 0;
+
+ params.setParams(-delOpenCost, -delGrowCost, -insOpenCost, -insGrowCost,
+ -jumpCost, -restartCost, scale, qualityOffset);
+ double splicePrior = opts.isSplicedAlignment ? opts.cis : 0.0;
+ params.setSpliceParams(splicePrior, opts.mean, opts.sdev);
+ params.setScoreMat(scoreMatrix, rowNames, colNames);
+ params.setSpliceSignals();
+ if (!opts.genome.empty()) params.readGenome(opts.genome);
+}
+
+}
=====================================
src/split/mcf_last_splitter.hh
=====================================
@@ -0,0 +1,62 @@
+// Author: Martin C. Frith 2022
+// SPDX-License-Identifier: GPL-3.0-or-later
+
+#ifndef MCF_LAST_SPLITTER_HH
+#define MCF_LAST_SPLITTER_HH
+
+#include "cbrc_split_aligner.hh"
+#include "last_split_options.hh"
+
+#include <iostream>
+
+namespace mcf {
+
+void setLastSplitParams(cbrc::SplitAlignerParams ¶ms,
+ const LastSplitOptions &opts,
+ const std::vector< std::vector<int> > &scoreMatrix,
+ const char *rowNames, const char *colNames,
+ int delOpenCost, int delGrowCost,
+ int insOpenCost, int insGrowCost,
+ double scale, double genomeSize, int sequenceFormat);
+
+class LastSplitter {
+public:
+ void reserve(size_t s) { mafs.reserve(s); }
+
+ void addMaf(char **linesBeg, char **linesEnd, bool isTopSeqQuery) {
+ mafs.push_back(cbrc::UnsplitAlignment(linesBeg, linesEnd, isTopSeqQuery));
+ }
+
+ // Calculate and store output, and clear MAFs
+ void split(const LastSplitOptions &opts,
+ const cbrc::SplitAlignerParams ¶ms, bool isAlreadySplit);
+
+ bool isOutputEmpty() const { return outputText.empty(); }
+
+ void printOutput() const
+ { std::cout.write(&outputText[0], outputText.size()); }
+
+ void clearOutput() { outputText.clear(); }
+
+private:
+ cbrc::SplitAligner sa;
+ std::vector<cbrc::UnsplitAlignment> mafs;
+ std::vector<char> outputText;
+
+ void doOneQuery(const LastSplitOptions &opts,
+ const cbrc::SplitAlignerParams ¶ms, bool isAlreadySplit,
+ const cbrc::UnsplitAlignment *beg,
+ const cbrc::UnsplitAlignment *end);
+
+ void doOneAlignmentPart(const LastSplitOptions &opts,
+ const cbrc::SplitAlignerParams ¶ms,
+ bool isAlreadySplit,
+ const cbrc::UnsplitAlignment &a, unsigned numOfParts,
+ unsigned partNum, unsigned alnNum,
+ unsigned qSliceBeg, unsigned qSliceEnd,
+ bool isSenseStrand, double senseStrandLogOdds);
+};
+
+}
+
+#endif
=====================================
test/last-split-test.out
=====================================
@@ -6,18 +6,18 @@ come from different parts of the genome.
Options:
-h, --help show this help message and exit
- -f, --format=FMT output format: MAF, MAF+ (default: depends on input)
+ -f, --format=FMT output format: MAF, MAF+
-r, --reverse reverse the roles of the 2 sequences in each alignment
-g, --genome=NAME lastdb genome name
- -d, --direction=D RNA direction: 0=reverse, 1=forward, 2=mixed (default=1)
- -c, --cis=PROB cis-splice probability per base (default=0.004)
- -t, --trans=PROB trans-splice probability per base (default=1e-05)
- -M, --mean=MEAN mean of ln[intron length] (default=7)
- -S, --sdev=SDEV standard deviation of ln[intron length] (default=1.7)
- -m, --mismap=PROB maximum mismap probability (default=1)
- -s, --score=INT minimum alignment score (default=e OR e+t*ln[100])
+ -d, --direction=D RNA direction: 0=reverse, 1=forward, 2=mixed (default: 1)
+ -c, --cis=PROB cis-splice probability per base (default: 0.004)
+ -t, --trans=PROB trans-splice probability per base (default: 1e-05)
+ -M, --mean=MEAN mean of ln[intron length] (default: 7.0)
+ -S, --sdev=SDEV standard deviation of ln[intron length] (default: 1.7)
+ -m, --mismap=PROB maximum mismap probability (default: 1.0)
+ -s, --score=INT minimum alignment score (default: e OR e+t*ln[100])
-n, --no-split write original, not split, alignments
- -b, --bytes=B maximum memory (default=8T for split, 8G for spliced)
+ -b, --bytes=B maximum memory (default: 8T for split, 8G for spliced)
-v, --verbose be verbose
-V, --version show version information and exit
# LAST version 356
=====================================
test/maf-convert-test.out
=====================================
@@ -11827,6 +11827,7 @@ SRR359290.9033 1 CAAAGTGCTGGGATTGCAGGCATAAGCCACTGTGCGTGGCC 41
99 16 chrM 8940 70 1X1=1X3=1X1=4X8=1X9=1X13=1X5=1X4=2X15=1X3=1X4=1X1=1X2=1X * 0 0 ACGCCAAATAGTTTATTATCAAAACCATCAACCTACTCATTCAAACAATAACCCTAACCGTACGCCTAACCGATAATATTATTACAA ############61477?1=/'=';%&*.>B5?:>>B@)+>CBB*B*,.BB*>((;2+B*B39A>*?>(>8B=B>C+=42;'8+ at C: NM:i:18 AS:i:284
@HD VN:1.3 SO:unsorted
@RG ID:1 PL:ILLUMINA SM:x
+ at PG ID:lastal PN:lastal VN:356
SRR359290.9001 16 chr1 160106736 255 6=1X9=1X8=1X1=1X6=1X4=1X11=1X23= * 0 0 TTCTTTCCCAGCATCGGGGTGGTGCGGGGGGCTGCCCTCCTCATCTGCAAGCCCCGCCGCAACTCAGTCTTCCAG ####################################################?EEGGGGGFDGAGGGGGGBGGDF NM:i:7 AS:i:221 RG:Z:1
SRR359290.9002 0 chr1 231468664 255 75= * 0 0 CGATAATTAAACTATAGGTAGTATATTAAACAAAAATGAGTTTTTAGCAATTATGTGAAATAAGGCTTTAACCAA GGGFGGGGFGFGGGGGGGGGGGEGEGGGGGGGGFGGGGGFGFGGGGGGDFGFGGGGFFFFDGGBFGGFFGDGAGG NM:i:0 AS:i:450 RG:Z:1
SRR359290.9002 0 chr15 50775673 255 22H4=1X19=29H * 0 0 ATATTAAACAAAAATGAGTTTTTA EGEGGGGGGGGFGGGGGFGFGGGG NM:i:1 AS:i:120 RG:Z:1
@@ -12068,6 +12069,7 @@ SRR359290.9033 0 chr6_ssto_hap7 945118 255 15=1X20=1X4=34H * 0 0 CAAAGTGCTGGGATT
@SQ SN:chrUn_gl000248 LN:39786
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566
+ at PG ID:lastal PN:lastal VN:356
SRR359290.9001 16 chr1 160106736 255 6=1X9=1X8=1X1=1X6=1X4=1X11=1X23= * 0 0 TTCTTTCCCAGCATCGGGGTGGTGCGGGGGGCTGCCCTCCTCATCTGCAAGCCCCGCCGCAACTCAGTCTTCCAG ####################################################?EEGGGGGFDGAGGGGGGBGGDF NM:i:7 AS:i:221
SRR359290.9002 0 chr1 231468664 255 75= * 0 0 CGATAATTAAACTATAGGTAGTATATTAAACAAAAATGAGTTTTTAGCAATTATGTGAAATAAGGCTTTAACCAA GGGFGGGGFGFGGGGGGGGGGGEGEGGGGGGGGFGGGGGFGFGGGGGGDFGFGGGGFFFFDGGBFGGFFGDGAGG NM:i:0 AS:i:450
SRR359290.9002 0 chr15 50775673 255 22H4=1X19=29H * 0 0 ATATTAAACAAAAATGAGTTTTTA EGEGGGGGGGGFGGGGGFGFGGGG NM:i:1 AS:i:120
View it on GitLab: https://salsa.debian.org/med-team/last-align/-/commit/8e061678b27fe7998f8631ab49970f0027f48354
--
View it on GitLab: https://salsa.debian.org/med-team/last-align/-/commit/8e061678b27fe7998f8631ab49970f0027f48354
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20220714/961a97ec/attachment-0001.htm>
More information about the debian-med-commit
mailing list