[med-svn] [Git][med-team/last-align][upstream] New upstream version 1167
Nilesh Patra
gitlab at salsa.debian.org
Mon Dec 14 17:56:44 GMT 2020
Nilesh Patra pushed to branch upstream at Debian Med / last-align
Commits:
4dc09e39 by Nilesh Patra at 2020-12-14T23:13:59+05:30
New upstream version 1167
- - - - -
30 changed files:
- ChangeLog.txt
- doc/last-train.html
- doc/last-train.txt
- doc/lastdb.html
- doc/lastdb.txt
- scripts/last-train
- src/Alignment.cc
- src/Alignment.hh
- src/AlignmentWrite.cc
- src/Alphabet.cc
- src/Alphabet.hh
- src/Centroid.cc
- src/GappedXdropAligner.hh
- src/GappedXdropAlignerFrame.cc
- src/GeneticCode.cc
- src/GeneticCode.hh
- src/LastEvaluer.cc
- src/LastEvaluer.hh
- src/LastalArguments.cc
- src/LastalArguments.hh
- src/LastdbArguments.cc
- src/LastdbArguments.hh
- src/MultiSequence.cc
- src/MultiSequence.hh
- src/lastal.cc
- src/lastdb.cc
- src/mcf_frameshift_xdrop_aligner.cc
- src/mcf_frameshift_xdrop_aligner.hh
- src/mcf_substitution_matrix_stats.cc
- src/version.hh
Changes:
=====================================
ChangeLog.txt
=====================================
@@ -1,8 +1,123 @@
+2020-12-11 Martin C. Frith <Martin C. Frith>
+
+ * src/GeneticCode.cc, src/GeneticCode.hh, src/lastal.cc, test/last-
+ test.out, test/last-test.sh:
+ Better rep-masking for codons & new frameshifts
+ [c1fbf7da4f68] [tip]
+
+ * src/GappedXdropAlignerFrame.cc, src/GeneticCode.cc,
+ src/GeneticCode.hh, src/lastal.cc:
+ Refactor
+ [f0ef1bb760a6]
+
+2020-12-04 Martin C. Frith <Martin C. Frith>
+
+ * doc/last-train.txt, scripts/last-train:
+ Add last-train --scale option
+ [f8298e7154b0]
+
+2020-12-03 Martin C. Frith <Martin C. Frith>
+
+ * scripts/last-train:
+ Refactor
+ [8d973000becc]
+
+ * src/GappedXdropAligner.hh, src/GappedXdropAlignerFrame.cc:
+ Refactor
+ [f85ac4b37611]
+
+ * src/mcf_frameshift_xdrop_aligner.cc:
+ Change expected-count output for new frameshifts
+ [6295d6a32b28]
+
+ * src/Centroid.cc:
+ Refactor
+ [ee8f8ed3fd4e]
+
+ * scripts/last-train:
+ Refactor
+ [e35d9df1d54c]
+
+ * doc/lastdb.txt, src/Alphabet.cc, src/Alphabet.hh,
+ src/LastdbArguments.cc, src/LastdbArguments.hh,
+ src/MultiSequence.cc, src/MultiSequence.hh, src/lastdb.cc,
+ src/mcf_substitution_matrix_stats.cc, test/last-test.out, test/last-
+ test.sh:
+ Add lastdb -q option
+ [74ad9d8977ec]
+
+2020-12-02 Martin C. Frith <Martin C. Frith>
+
+ * src/lastal.cc, test/last-test.out, test/last-test.sh:
+ Protein-codon alignment: use tantan on the DNA
+ [8ac7c496f9be]
+
+ * src/Alignment.cc, src/Alignment.hh, src/AlignmentWrite.cc,
+ src/LastEvaluer.cc, src/LastEvaluer.hh, src/lastal.cc, test/last-
+ test.out, test/last-test.sh:
+ E-values for new-style frameshifts!
+ [fffb08324c7d]
+
+2020-12-01 Martin C. Frith <Martin C. Frith>
+
+ * src/mcf_frameshift_xdrop_aligner.cc,
+ src/mcf_frameshift_xdrop_aligner.hh:
+ Start implementing frameshift E-value simulation
+ [d73a654fdabb]
+
+ * src/LastalArguments.cc, src/LastalArguments.hh, src/lastal.cc:
+ Allow for non-integer alignment scores
+ [4bc305f33933]
+
+2020-11-30 Martin C. Frith <Martin C. Frith>
+
+ * src/LastEvaluer.cc, src/LastEvaluer.hh, src/lastal.cc:
+ Refactor
+ [f5a5aed6d38e]
+
+ * src/AlignmentWrite.cc, src/LastEvaluer.cc, src/LastalArguments.cc:
+ Refactor
+ [49942e88c59e]
+
+2020-11-27 Martin C. Frith <Martin C. Frith>
+
+ * src/AlignmentWrite.cc:
+ Protein-codon alignment: fix writing X and *
+ [922917aa2dd8]
+
+2020-11-26 Martin C. Frith <Martin C. Frith>
+
+ * src/lastal.cc:
+ Reduce lastal memory use, sometimes
+ [bc682cb4f508]
+
+ * src/LastEvaluer.cc, test/last-test.out:
+ Fix minor bugs in E-value finite-size correction
+ [8d49928d5e10]
+
+ * src/LastEvaluer.cc, src/LastEvaluer.hh:
+ Refactor
+ [814e3863ab23]
+
+ * scripts/last-train:
+ Refactor
+ [a1e4ab7fd745]
+
+ * scripts/last-train:
+ Refactor
+ [70149e6fb67a]
+
+2020-11-25 Martin C. Frith <Martin C. Frith>
+
+ * scripts/last-train, src/lastal.cc:
+ Refactor
+ [249167a86543]
+
2020-11-06 Martin C. Frith <Martin C. Frith>
* src/mcf_frameshift_xdrop_aligner.hh:
Try to appease C++ compilers
- [62a147543f2b] [tip]
+ [62a147543f2b]
* doc/last-split.txt:
Add awk mismap trick to doc
=====================================
doc/last-train.html
=====================================
@@ -392,6 +392,13 @@ shorter, it is never picked. 0 means use everything.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">--sample-length=<var>L</var></span></kbd></td>
<td>Use randomly-chosen chunks of length L.</td></tr>
+<tr><td class="option-group">
+<kbd><span class="option">--scale=<var>S</var></span></kbd></td>
+<td>Output scores in units of 1/S bits. Traditional values
+include 2 for half-bit scores and 3 for 1/3-bit scores.
+(Note that 1/3-bit scores essentially equal Phred scores
+a.k.a. decibans, because log10(2) ≈ 3/10.) The default is to
+infer a scale from the initial score parameters.</td></tr>
</tbody>
</table>
</blockquote>
@@ -524,11 +531,6 @@ output, so it will override lastal's default.</p>
<li><p class="first">last-train (and lastal) converts between path and alignment
parameters as in Supplementary Section 3.1 of <a class="reference external" href="https://doi.org/10.1093/bioinformatics/btz576">btz576</a>.</p>
</li>
-<li><p class="first">last-train converts probabilities to scores using an arbitrary scale
-factor. (If all the scores are multiplied by any positive constant,
-it makes no difference to alignment.) This scale is determined by
-the initial score parameters.</p>
-</li>
<li><p class="first">last-train uses parameters with "homogeneous letter probabilities"
and "balanced length probability" (<a class="reference external" href="https://doi.org/10.1093/bioinformatics/btz576">btz576</a>).</p>
</li>
@@ -536,8 +538,8 @@ and "balanced length probability" (<a class="reference external" href=
inaccurate. It then finds an adjusted scale factor (without
changing the scores), which makes the integer-rounded scores
correspond to homogeneous letter probabilities and balanced length
-probability. It writes this adjusted scale as a "-t" option for
-lastal, e.g. "-t4.4363".</p>
+probability. It writes this adjusted scale (in nats, not bits) as a
+"-t" option for lastal, e.g. "-t4.4363".</p>
</li>
<li><p class="first">In rare cases, it may be impossible to find such an adjusted scale
factor. If that happens, last-train doubles the original scale (to
=====================================
doc/last-train.txt
=====================================
@@ -59,6 +59,12 @@ Training options
shorter, it is never picked. 0 means use everything.
--sample-length=L
Use randomly-chosen chunks of length L.
+ --scale=S
+ Output scores in units of 1/S bits. Traditional values
+ include 2 for half-bit scores and 3 for 1/3-bit scores.
+ (Note that 1/3-bit scores essentially equal Phred scores
+ a.k.a. decibans, because log10(2) ≈ 3/10.) The default is to
+ infer a scale from the initial score parameters.
All options below this point are passed to lastal to do the
alignments: they are described in more detail at `<lastal.html>`_.
@@ -142,11 +148,6 @@ Details
* last-train (and lastal) converts between path and alignment
parameters as in Supplementary Section 3.1 of btz576_.
-* last-train converts probabilities to scores using an arbitrary scale
- factor. (If all the scores are multiplied by any positive constant,
- it makes no difference to alignment.) This scale is determined by
- the initial score parameters.
-
* last-train uses parameters with "homogeneous letter probabilities"
and "balanced length probability" (btz576_).
@@ -154,8 +155,8 @@ Details
inaccurate. It then finds an adjusted scale factor (without
changing the scores), which makes the integer-rounded scores
correspond to homogeneous letter probabilities and balanced length
- probability. It writes this adjusted scale as a "-t" option for
- lastal, e.g. "-t4.4363".
+ probability. It writes this adjusted scale (in nats, not bits) as a
+ "-t" option for lastal, e.g. "-t4.4363".
* In rare cases, it may be impossible to find such an adjusted scale
factor. If that happens, last-train doubles the original scale (to
=====================================
doc/lastdb.html
=====================================
@@ -469,6 +469,10 @@ Default fasta
<p class="last">The fastq formats are described in <a class="reference external" href="lastal.html">lastal.html</a>.</p>
</td></tr>
<tr><td class="option-group">
+<kbd><span class="option">-q</span></kbd></td>
+<td>Interpret the sequences as proteins, use a 21-letter alphabet
+with <tt class="docutils literal">*</tt> meaning STOP, and append <tt class="docutils literal">*</tt> to each sequence.</td></tr>
+<tr><td class="option-group">
<kbd><span class="option">-P <var>THREADS</var></span></kbd></td>
<td>Divide the work between this number of threads running in
parallel. 0 means use as many threads as your computer claims
@@ -594,9 +598,12 @@ bytes.</p>
</li>
<li><p class="first">lastdb8: makes the index twice as big.</p>
</li>
-<li><p class="first">-u, -m: Multiple patterns multiply the index size. For example,
+<li><p class="first">-u, -m, -d: Multiple patterns multiply the index size. For example,
<a class="reference external" href="last-seeds.html">MAM8</a> makes it 8 times bigger.</p>
</li>
+<li><p class="first">-u, -d: may reduce the index, e.g. <a class="reference external" href="last-seeds.html">RY32</a> makes
+it 32 times smaller.</p>
+</li>
<li><p class="first">-s: does not change the total size, but splits it into volumes.</p>
</li>
<li><p class="first">-S2: doubles the size of everything.</p>
=====================================
doc/lastdb.txt
=====================================
@@ -136,6 +136,9 @@ Advanced Options
The fastq formats are described in `<lastal.html>`_.
+ -q Interpret the sequences as proteins, use a 21-letter alphabet
+ with ``*`` meaning STOP, and append ``*`` to each sequence.
+
-P THREADS
Divide the work between this number of threads running in
parallel. 0 means use as many threads as your computer claims
@@ -258,9 +261,12 @@ This is modified by several options.
* lastdb8: makes the index twice as big.
-* -u, -m: Multiple patterns multiply the index size. For example,
+* -u, -m, -d: Multiple patterns multiply the index size. For example,
`MAM8 <last-seeds.html>`_ makes it 8 times bigger.
+* -u, -d: may reduce the index, e.g. `RY32 <last-seeds.html>`_ makes
+ it 32 times smaller.
+
* -s: does not change the total size, but splits it into volumes.
* -S2: doubles the size of everything.
=====================================
scripts/last-train
=====================================
@@ -6,7 +6,7 @@
# [Fri19] How sequence alignment scores correspond to probability models,
# MC Frith, Bioinformatics, 2019.
-from __future__ import print_function
+from __future__ import division, print_function
import gzip
import math
@@ -18,6 +18,8 @@ import subprocess
import sys
import tempfile
+proteinAlphabet20 = "ACDEFGHIKLMNPQRSTVWY"
+
def myOpen(fileName): # faster than fileinput
if fileName == "-":
return sys.stdin
@@ -25,7 +27,7 @@ def myOpen(fileName): # faster than fileinput
return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
-def rootOfIncreasingFunction(func, lowerBound, upperBound, *args):
+def rootOfIncreasingFunction(func, lowerBound, upperBound, args):
# Find x such that func(x, *args) == 0
while True:
mid = 0.5 * (lowerBound + upperBound)
@@ -36,6 +38,17 @@ def rootOfIncreasingFunction(func, lowerBound, upperBound, *args):
else:
upperBound = mid
+def rootOfDecreasingFunction(func, lowerBound, upperBound, args):
+ # Find x such that func(x, *args) == 0
+ while True:
+ mid = 0.5 * (lowerBound + upperBound)
+ if mid <= lowerBound or mid >= upperBound:
+ return mid
+ if func(mid, *args) < 0:
+ upperBound = mid
+ else:
+ lowerBound = mid
+
def homogeneousLetterFreqs(scale, matScores):
# Solve the simultaneous equations in Section 2.1 of [Fri19]
expMat = [[math.exp(j / scale) for j in i] for i in matScores]
@@ -162,13 +175,6 @@ def getSeqSample(opts, queryFiles, outfile):
x = y
writeSegment(outfile, x)
-def versionFromHeader(lines):
- for line in lines:
- words = line.split()
- if "version" in words:
- return words[-1]
- raise Exception("couldn't read the version")
-
def scaleFromHeader(lines):
for line in lines:
for i in line.split():
@@ -177,41 +183,38 @@ def scaleFromHeader(lines):
raise Exception("couldn't read the scale")
def countsFromLastOutput(lines, opts):
- numTransitionCounts = 5
- matrix = []
- # use +1 pseudocounts as a kludge to mitigate numerical problems:
- matches = 1.0
- deletes = 2.0 # 1 open + 1 extension
- inserts = 2.0 # 1 open + 1 extension
- delOpens = 1.0
- insOpens = 1.0
+ nTransitions = 5
+ tranCounts = [1.0] * nTransitions # +1 pseudocounts
+ tranCounts[1] = 2.0 # deletes: opens + extensions, so 2 pseudocounts
+ tranCounts[2] = 2.0 # inserts: opens + extensions, so 2 pseudocounts
+ countMatrix = None
alignments = 0 # no pseudocount here
for line in lines:
if line[0] == "s":
strand = line.split()[4] # slow?
if line[0] == "c":
- c = [float(i) for i in line.split()[1:]]
- transCounts = c[-numTransitionCounts:]
- if not matrix:
- matrixSize = int(math.sqrt(len(c) - numTransitionCounts))
- matrix = [[1.0] * matrixSize for i in range(matrixSize)]
- identities = sum(c[i * matrixSize + i] for i in range(matrixSize))
- alignmentLength = transCounts[0] + transCounts[1] + transCounts[2]
- if 100 * identities > opts.pid * alignmentLength: continue
- for i in range(matrixSize):
- for j in range(matrixSize):
- if strand == "+" or opts.S == "0":
- matrix[i][j] += c[i * matrixSize + j]
+ counts = [float(i) for i in line.split()[1:]]
+ if not countMatrix:
+ matrixSize = len(counts) - nTransitions
+ nCols = int(math.sqrt(matrixSize))
+ nRows = matrixSize // nCols
+ countMatrix = [[1.0] * nCols for i in range(nRows)]
+ identities = sum(counts[i * nCols + i] for i in range(nRows))
+ alignmentLength = sum(counts[matrixSize + i] for i in range(3))
+ if 100 * identities > opts.pid * alignmentLength:
+ continue
+ for i in range(nRows):
+ for j in range(nCols):
+ if strand == "+" or opts.S != "1":
+ countMatrix[i][j] += counts[i * nCols + j]
else:
- matrix[-1-i][-1-j] += c[i * matrixSize + j]
- matches += transCounts[0]
- deletes += transCounts[1]
- inserts += transCounts[2]
- delOpens += transCounts[3]
- insOpens += transCounts[4]
+ countMatrix[-1-i][-1-j] += counts[i * nCols + j]
+ for i in range(nTransitions):
+ tranCounts[i] += counts[matrixSize + i]
alignments += 1
- gapCounts = matches, deletes, inserts, delOpens, insOpens, alignments
- return matrix, gapCounts
+ if not alignments:
+ raise Exception("no alignments")
+ return countMatrix, tranCounts + [alignments]
def scoreFromProb(scale, prob):
if prob > 0: logProb = math.log(prob)
@@ -223,7 +226,7 @@ def costFromProb(scale, prob):
def guessAlphabet(matrixSize):
if matrixSize == 4: return "ACGT"
- if matrixSize == 20: return "ACDEFGHIKLMNPQRSTVWY"
+ if matrixSize == 20: return proteinAlphabet20
raise Exception("can't handle unusual alphabets")
def writeMatrixHead(outFile, prefix, alphabet, formatString):
@@ -257,7 +260,6 @@ def matProbsFromCounts(counts, opts):
identities = sum(counts[i][i] for i in r)
total = sum(map(sum, counts))
probs = [[j / total for j in i] for i in counts]
-
print("# substitution percent identity: %g" % (100 * identities / total))
print()
print("# count matrix "
@@ -268,7 +270,6 @@ def matProbsFromCounts(counts, opts):
"(query letters = columns, reference letters = rows):")
writeProbMatrix(sys.stdout, probs, "# ")
print()
-
return probs
def probImbalance(endProb, matchProb, firstDelProb, delExtendProb,
@@ -282,12 +283,11 @@ def balancedEndProb(*args):
matchProb, firstDelProb, delExtendProb, firstInsProb, insExtendProb = args
lowerBound = max(delExtendProb, insExtendProb)
upperBound = 1.0
- r = rootOfIncreasingFunction(probImbalance, lowerBound, upperBound, *args)
- return r
+ return rootOfIncreasingFunction(probImbalance,
+ lowerBound, upperBound, args)
-def gapRatiosFromCounts(counts, opts):
+def gapProbsFromCounts(counts, opts):
matches, deletes, inserts, delOpens, insOpens, alignments = counts
- if not alignments: raise Exception("no alignments")
gaps = deletes + inserts
gapOpens = delOpens + insOpens
denominator = matches + gapOpens + (alignments + 1) # +1 pseudocount
@@ -295,14 +295,13 @@ def gapRatiosFromCounts(counts, opts):
if opts.gapsym:
delOpenProb = gapOpens / denominator / 2
insOpenProb = gapOpens / denominator / 2
- delExtendProb = (gaps - gapOpens) / gaps
- insExtendProb = (gaps - gapOpens) / gaps
+ delGrowProb = (gaps - gapOpens) / gaps
+ insGrowProb = (gaps - gapOpens) / gaps
else:
delOpenProb = delOpens / denominator
insOpenProb = insOpens / denominator
- delExtendProb = (deletes - delOpens) / deletes
- insExtendProb = (inserts - insOpens) / inserts
-
+ delGrowProb = (deletes - delOpens) / deletes
+ insGrowProb = (inserts - insOpens) / inserts
print("# aligned letter pairs: %.12g" % matches)
print("# deletes: %.12g" % deletes)
print("# inserts: %.12g" % inserts)
@@ -314,29 +313,34 @@ def gapRatiosFromCounts(counts, opts):
print("# matchProb: %g" % matchProb)
print("# delOpenProb: %g" % delOpenProb)
print("# insOpenProb: %g" % insOpenProb)
- print("# delExtendProb: %g" % delExtendProb)
- print("# insExtendProb: %g" % insExtendProb)
+ print("# delExtendProb: %g" % delGrowProb)
+ print("# insExtendProb: %g" % insGrowProb)
print()
+ return matchProb, (delOpenProb, delGrowProb), (insOpenProb, insGrowProb)
- delCloseProb = 1 - delExtendProb
+def gapRatiosFromProbs(matchProb, delProbs, insProbs):
+ delOpenProb, delGrowProb = delProbs
+ insOpenProb, insGrowProb = insProbs
+
+ delCloseProb = 1 - delGrowProb
firstDelProb = delOpenProb * delCloseProb
- insCloseProb = 1 - insExtendProb
+ insCloseProb = 1 - insGrowProb
firstInsProb = insOpenProb * insCloseProb
- endProb = balancedEndProb(matchProb, firstDelProb, delExtendProb,
- firstInsProb, insExtendProb)
+ endProb = balancedEndProb(matchProb, firstDelProb, delGrowProb,
+ firstInsProb, insGrowProb)
# probably, endProb is negligibly less than 1
matchRatio = matchProb / (endProb * endProb)
firstDelRatio = firstDelProb / endProb
- delExtendRatio = delExtendProb / endProb
- delRatios = firstDelRatio, delExtendRatio
+ delGrowRatio = delGrowProb / endProb
+ delRatios = firstDelRatio, delGrowRatio
firstInsRatio = firstInsProb / endProb
- insExtendRatio = insExtendProb / endProb
- insRatios = firstInsRatio, insExtendRatio
+ insGrowRatio = insGrowProb / endProb
+ insRatios = firstInsRatio, insGrowRatio
return matchRatio, delRatios, insRatios
@@ -346,10 +350,8 @@ def scoreFromLetterProbs(scale, matchRatio, pairProb, rowProb, colProb):
return scoreFromProb(scale, matchRatio * probRatio)
def matScoresFromProbs(scale, matchRatio, matProbs, rowProbs, colProbs):
- r = range(len(matProbs))
- return [[scoreFromLetterProbs(scale, matchRatio, matProbs[i][j],
- rowProbs[i], colProbs[j]) for j in r]
- for i in r]
+ return [[scoreFromLetterProbs(scale, matchRatio, matProbs[i][j], x, y)
+ for j, y in enumerate(colProbs)] for i, x in enumerate(rowProbs)]
def gapCostsFromProbRatios(scale, firstGapRatio, gapExtendRatio):
# The next addition gets the alignment parameter from the path
@@ -373,28 +375,26 @@ def scoreImbalance(scale, matScores, delCosts, insCosts):
i = imbalanceFromGap(scale, *insCosts)
return 1 / sum(homogeneousLetterFreqs(scale, matScores)) + d + i - 1
-def negScoreImbalance(*args):
- return -scoreImbalance(*args)
-
-def balancedScale(nearScale, *args):
+def balancedScale(imbalanceFunc, nearScale, args):
# Find a scale, near nearScale, with balanced length probability
bump = 1.000001
- value = scoreImbalance(nearScale, *args)
+ rootFinders = rootOfDecreasingFunction, rootOfIncreasingFunction
+ value = imbalanceFunc(nearScale, *args)
if abs(value) <= 0:
return nearScale
oldLower = oldUpper = nearScale
while oldUpper < 2 * nearScale: # xxx ???
newLower = oldLower / bump
- lowerValue = scoreImbalance(newLower, *args)
+ lowerValue = imbalanceFunc(newLower, *args)
if (lowerValue < 0) != (value < 0):
- func = scoreImbalance if value > 0 else negScoreImbalance
- return rootOfIncreasingFunction(func, newLower, oldLower, *args)
+ finder = rootFinders[value > 0]
+ return finder(imbalanceFunc, newLower, oldLower, args)
oldLower = newLower
newUpper = oldUpper * bump
- upperValue = scoreImbalance(newUpper, *args)
+ upperValue = imbalanceFunc(newUpper, *args)
if (upperValue < 0) != (value < 0):
- func = scoreImbalance if value < 0 else negScoreImbalance
- return rootOfIncreasingFunction(func, oldUpper, newUpper, *args)
+ finder = rootFinders[value < 0]
+ return finder(imbalanceFunc, oldUpper, newUpper, args)
oldUpper = newUpper
return 0.0
@@ -403,7 +403,8 @@ def scoresAndScale(originalScale, matParams, delRatios, insRatios):
matScores = matScoresFromProbs(originalScale, *matParams)
delCosts = gapCostsFromProbRatios(originalScale, *delRatios)
insCosts = gapCostsFromProbRatios(originalScale, *insRatios)
- scale = balancedScale(originalScale, matScores, delCosts, insCosts)
+ args = matScores, delCosts, insCosts
+ scale = balancedScale(scoreImbalance, originalScale, args)
if scale > 0:
rowFreqs = homogeneousLetterFreqs(scale, zip(*matScores))
colFreqs = homogeneousLetterFreqs(scale, matScores)
@@ -413,20 +414,21 @@ def scoresAndScale(originalScale, matParams, delRatios, insRatios):
"doubling the scale")
originalScale *= 2
-def writeGapCosts(gapCosts, out):
- firstDelCost, delExtendCost, firstInsCost, insExtendCost = gapCosts
- print("#last -a", firstDelCost - delExtendCost, file=out)
- print("#last -A", firstInsCost - insExtendCost, file=out)
- print("#last -b", delExtendCost, file=out)
- print("#last -B", insExtendCost, file=out)
-
-def printGapCosts(gapCosts):
- firstDelCost, delExtendCost, firstInsCost, insExtendCost = gapCosts
- print("# delExistCost:", firstDelCost - delExtendCost)
- print("# insExistCost:", firstInsCost - insExtendCost)
- print("# delExtendCost:", delExtendCost)
- print("# insExtendCost:", insExtendCost)
- print()
+def writeGapCosts(delCosts, insCosts, isLastFormat, outFile):
+ delInit, delGrow = delCosts
+ insInit, insGrow = insCosts
+ delOpen = delInit - delGrow
+ insOpen = insInit - insGrow
+ if isLastFormat:
+ print("#last -a", delOpen, file=outFile)
+ print("#last -A", insOpen, file=outFile)
+ print("#last -b", delGrow, file=outFile)
+ print("#last -B", insGrow, file=outFile)
+ else:
+ print("# delExistCost:", delOpen, file=outFile)
+ print("# insExistCost:", insOpen, file=outFile)
+ print("# delExtendCost:", delGrow, file=outFile)
+ print("# insExtendCost:", insGrow, file=outFile)
def tryToMakeChildProgramsFindable():
d = os.path.dirname(__file__)
@@ -458,29 +460,46 @@ def fixedLastalArgs(opts, lastalProgName):
if opts.verbose: x.append("-" + "v" * opts.verbose)
return x
+def process(args, inStream):
+ return subprocess.Popen(args, stdin=inStream, stdout=subprocess.PIPE,
+ universal_newlines=True)
+
+def versionFromLastal():
+ args = ["lastal", "--version"]
+ proc = process(args, None)
+ return proc.stdout.read().split()[1]
+
+def lastSplitProcess(opts, proc):
+ splitArgs = ["last-split", "-n", "-m0.01"] # xxx ???
+ proc = process(splitArgs, proc.stdout)
+ if opts.postmask:
+ maskArgs = ["last-postmask"]
+ proc = process(maskArgs, proc.stdout)
+ return proc
+
def doTraining(opts, args):
tryToMakeChildProgramsFindable()
lastalProgName = readLastalProgName(args[0])
scaleIncrease = 20 # while training, up-scale the scores by this amount
- x = fixedLastalArgs(opts, lastalProgName)
- if opts.r: x.append("-r" + opts.r)
- if opts.q: x.append("-q" + opts.q)
- if opts.p: x.append("-p" + opts.p)
- if opts.a: x.append("-a" + opts.a)
- if opts.b: x.append("-b" + opts.b)
- if opts.A: x.append("-A" + opts.A)
- if opts.B: x.append("-B" + opts.B)
- x += args
- y = ["last-split", "-n", "-m0.01"] # xxx ???
- z = ["last-postmask"]
- p = subprocess.Popen(x, stdout=subprocess.PIPE, universal_newlines=True)
- q = subprocess.Popen(y, stdin=p.stdout, stdout=subprocess.PIPE,
- universal_newlines=True)
- if opts.postmask:
- q = subprocess.Popen(z, stdin=q.stdout, stdout=subprocess.PIPE,
- universal_newlines=True)
- lastalVersion = versionFromHeader(q.stdout)
- externalScale = scaleFromHeader(q.stdout)
+ lastalVersion = versionFromLastal()
+
+ lastalArgs = fixedLastalArgs(opts, lastalProgName)
+ if opts.r: lastalArgs.append("-r" + opts.r)
+ if opts.q: lastalArgs.append("-q" + opts.q)
+ if opts.p: lastalArgs.append("-p" + opts.p)
+ if opts.a: lastalArgs.append("-a" + opts.a)
+ if opts.b: lastalArgs.append("-b" + opts.b)
+ if opts.A: lastalArgs.append("-A" + opts.A)
+ if opts.B: lastalArgs.append("-B" + opts.B)
+ lastalArgs += args
+ proc = process(lastalArgs, None)
+ proc = lastSplitProcess(opts, proc)
+
+ if opts.scale:
+ externalScale = opts.scale / math.log(2)
+ else:
+ externalScale = scaleFromHeader(proc.stdout)
+
internalScale = externalScale * scaleIncrease
oldParameters = []
@@ -491,18 +510,20 @@ def doTraining(opts, args):
print()
while True:
- print("#", *x)
+ print("#", *lastalArgs)
print()
sys.stdout.flush()
- matCounts, gapCounts = countsFromLastOutput(q.stdout, opts)
- matchRatio, delRatios, insRatios = gapRatiosFromCounts(gapCounts, opts)
+ matCounts, gapCounts = countsFromLastOutput(proc.stdout, opts)
+ gapProbs = gapProbsFromCounts(gapCounts, opts)
matProbs = matProbsFromCounts(matCounts, opts)
+ matchRatio, delRatios, insRatios = gapRatiosFromProbs(*gapProbs)
rowProbs = [sum(i) for i in matProbs]
colProbs = [sum(i) for i in zip(*matProbs)]
matParams = matchRatio, matProbs, rowProbs, colProbs
sas = scoresAndScale(internalScale, matParams, delRatios, insRatios)
matScores, delCosts, insCosts, scale = sas
- printGapCosts(delCosts + insCosts)
+ writeGapCosts(delCosts, insCosts, False, None)
+ print()
print("# score matrix "
"(query letters = columns, reference letters = rows):")
writeScoreMatrix(sys.stdout, matScores, "# ")
@@ -510,28 +531,22 @@ def doTraining(opts, args):
parameters = delCosts, insCosts, matScores
if parameters in oldParameters: break
oldParameters.append(parameters)
- x = fixedLastalArgs(opts, lastalProgName)
- x.append("-t{0:.6}".format(scale))
- x.append("-p-")
- x += args
- p = subprocess.Popen(x, stdin=subprocess.PIPE, stdout=subprocess.PIPE,
- universal_newlines=True)
- writeGapCosts(delCosts + insCosts, p.stdin)
- writeScoreMatrix(p.stdin, matScores, "")
- p.stdin.close()
- # in python2.6, the next line must come after p.stdin.close()
- q = subprocess.Popen(y, stdin=p.stdout, stdout=subprocess.PIPE,
- universal_newlines=True)
- if opts.postmask:
- q = subprocess.Popen(z, stdin=q.stdout, stdout=subprocess.PIPE,
- universal_newlines=True)
+ lastalArgs = fixedLastalArgs(opts, lastalProgName)
+ lastalArgs.append("-t{0:.6}".format(scale))
+ lastalArgs.append("-p-")
+ lastalArgs += args
+ proc = process(lastalArgs, subprocess.PIPE)
+ writeGapCosts(delCosts, insCosts, True, proc.stdin)
+ writeScoreMatrix(proc.stdin, matScores, "")
+ proc.stdin.close()
+ proc = lastSplitProcess(opts, proc)
sas = scoresAndScale(externalScale, matParams, delRatios, insRatios)
matScores, delCosts, insCosts, scale = sas
if opts.X: print("#last -X", opts.X)
if opts.Q: print("#last -Q", opts.Q)
print("#last -t{0:.6}".format(scale))
- writeGapCosts(delCosts + insCosts, sys.stdout)
+ writeGapCosts(delCosts, insCosts, True, None)
if opts.s: print("#last -s", opts.s)
if opts.S: print("#last -S", opts.S)
print("# score matrix "
@@ -559,6 +574,7 @@ if __name__ == "__main__":
op = optparse.OptionParser(usage=usage, description=description)
op.add_option("-v", "--verbose", action="count",
help="show more details of intermediate steps")
+
og = optparse.OptionGroup(op, "Training options")
og.add_option("--revsym", action="store_true",
help="force reverse-complement symmetry")
@@ -574,7 +590,10 @@ if __name__ == "__main__":
help="number of random sequence samples (default: %default)")
og.add_option("--sample-length", type="int", default=2000, metavar="L",
help="length of each sample (default: %default)")
+ og.add_option("--scale", type="float", metavar="S",
+ help="output scores in units of 1/S bits")
op.add_option_group(og)
+
og = optparse.OptionGroup(op, "Initial parameter options")
og.add_option("-r", metavar="SCORE",
help="match score (default: 6 if Q>=1, else 5)")
@@ -588,6 +607,7 @@ if __name__ == "__main__":
og.add_option("-A", metavar="COST", help="insertion existence cost")
og.add_option("-B", metavar="COST", help="insertion extension cost")
op.add_option_group(og)
+
og = optparse.OptionGroup(op, "Alignment options")
og.add_option("-D", metavar="LENGTH",
help="query letters per random alignment (default: 1e6)")
@@ -614,6 +634,7 @@ if __name__ == "__main__":
og.add_option("-Q", metavar="NAME",
help="input format: fastx, sanger (default=fasta)")
op.add_option_group(og)
+
(opts, args) = op.parse_args()
if len(args) < 1:
op.error("I need a lastdb index and query sequences")
=====================================
src/Alignment.cc
=====================================
@@ -51,7 +51,12 @@ void Alignment::makeXdrop( Aligners &aligners, bool isGreedy,
const Alphabet& alph, AlignmentExtras& extras,
double gamma, int outputType ){
score = seed.score;
- if( outputType > 3 ) extras.fullScore = seed.score;
+
+ if (gap.isNewFrameshifts()) {
+ extras.fullScore = -1; // means that "score" is (non-integer) fullScore
+ } else if (outputType > 3) {
+ extras.fullScore = seed.score;
+ }
if( outputType == 7 ){
assert( seed.size > 0 ); // makes things easier to understand
@@ -303,21 +308,21 @@ void Alignment::extend( std::vector< SegmentPair >& chunks,
if (gap.isNewFrameshifts()) {
size_t frame2 = isForward ? dnaStart + 2 : dnaStart - 2;
- score += aligner.alignFrame(seq1 + start1, seq2 + start2,
- seq2 + dnaToAa(frame1, frameSize),
- seq2 + dnaToAa(frame2, frameSize),
- isForward, sm, gap, maxDrop);
+ aligner.alignFrame(seq1 + start1, seq2 + start2,
+ seq2 + dnaToAa(frame1, frameSize),
+ seq2 + dnaToAa(frame2, frameSize),
+ isForward, sm, gap, maxDrop);
while (aligner.getNextChunkFrame(end1, end2, size, gapCost, gap))
chunks.push_back(SegmentPair(end1 - size, end2 - size * 3, size,
gapCost));
+ FrameshiftXdropAligner &fxa = aligners.frameshiftAligner;
+ double probDropLimit = exp(scale * -maxDrop);
+ double s = fxa.forward(seq1 + start1, seq2 + start2,
+ seq2 + dnaToAa(frame1, frameSize),
+ seq2 + dnaToAa(frame2, frameSize),
+ isForward, probMat, gap, probDropLimit);
+ score += s / scale;
if (outputType > 3) {
- FrameshiftXdropAligner &fxa = aligners.frameshiftAligner;
- double probDropLimit = exp(scale * -maxDrop);
- double s = fxa.forward(seq1 + start1, seq2 + start2,
- seq2 + dnaToAa(frame1, frameSize),
- seq2 + dnaToAa(frame2, frameSize),
- isForward, probMat, gap, probDropLimit);
- extras.fullScore += s / scale;
fxa.backward(isForward, probMat, gap);
getColumnCodes(fxa, columnCodes, chunks);
if (outputType == 7) {
=====================================
src/Alignment.hh
=====================================
@@ -32,7 +32,7 @@ struct AlignmentText {
SegmentPair::indexT strandNum;
SegmentPair::indexT queryBeg;
SegmentPair::indexT queryEnd;
- int score;
+ double score;
SegmentPair::indexT alnSize;
SegmentPair::indexT matches;
char *text; // seems to be a bit faster than std::vector<char>
@@ -40,7 +40,7 @@ struct AlignmentText {
AlignmentText() {}
AlignmentText(size_t queryNumIn, size_t queryBegIn, size_t queryEndIn,
- char strandIn, int scoreIn,
+ char strandIn, double scoreIn,
size_t alnSizeIn, size_t matchesIn, char *textIn) :
strandNum(queryNumIn * 2 + (strandIn == '-')),
queryBeg(queryBegIn), queryEnd(queryEndIn), score(scoreIn),
@@ -126,7 +126,7 @@ struct Alignment{
// data:
std::vector<SegmentPair> blocks; // the gapless blocks of the alignment
- int score;
+ double score;
SegmentPair seed; // the alignment remembers its seed
size_t beg1() const{ return blocks.front().beg1(); }
@@ -166,6 +166,7 @@ struct Alignment{
const Alphabet& alph, int translationType,
const uchar *codonToAmino,
const LastEvaluer& evaluer,
+ const AlignmentExtras& extras,
bool isExtraColumns) const;
size_t numColumns(size_t frameSize, bool isCodon) const;
=====================================
src/AlignmentWrite.cc
=====================================
@@ -5,8 +5,10 @@
#include "LastEvaluer.hh"
#include "MultiSequence.hh"
#include "Alphabet.hh"
+
+#include <assert.h>
+
#include <algorithm>
-#include <cassert>
#include <cstdio> // sprintf
using namespace cbrc;
@@ -32,7 +34,8 @@ const char aminoTriplets[] =
"Val" "val"
"Trp" "trp"
"Tyr" "tyr"
- "Xxx" "xxx";
+ "***" "***"
+ "Xaa" "xaa";
// This writes a "size_t" integer into a char buffer ending at "end".
// It writes backwards from the end, because that's easier & faster.
@@ -128,7 +131,7 @@ AlignmentText Alignment::write(const MultiSequence& seq1,
return writeTab(seq1, seq2, seqNum2, translationType, evaluer, extras);
else
return writeBlastTab(seq1, seq2, seqNum2, seqData2, alph, translationType,
- codonToAmino, evaluer, format == 'B');
+ codonToAmino, evaluer, extras, format == 'B');
}
static size_t alignedColumnCount(const std::vector<SegmentPair> &blocks) {
@@ -180,8 +183,12 @@ static size_t writeBlocks(std::vector<char> &text,
return end - e;
}
+static const char *scoreFormat(bool isIntegerScores) {
+ return isIntegerScores ? "%.0f" : "%.1f";
+}
+
static char* writeTags( const LastEvaluer& evaluer, double queryLength,
- int score, double fullScore,
+ double score, double fullScore,
char separator, char* out ){
if( evaluer.isGood() ){
double epa = evaluer.evaluePerArea( score );
@@ -217,7 +224,7 @@ AlignmentText Alignment::writeTab(const MultiSequence& seq1,
size_t seqStart2 = seq2.seqBeg(seqNum2) - seq2.padBeg(seqNum2);
size_t seqLen2 = seq2.seqLen(seqNum2);
- IntText sc(score);
+ FloatText sc(scoreFormat(extras.fullScore >= 0), score);
std::string n1 = seq1.seqName(seqNum1);
char strand1 = seq1.strand(seqNum1);
std::string n2 = seq2.seqName(seqNum2);
@@ -266,9 +273,10 @@ static void putRight(Writer &w, const IntText &t, size_t width) {
}
// Write an "a" line
-static char *writeMafLineA(char *out, int score, const LastEvaluer& evaluer,
+static char *writeMafLineA(char *out, double score, const LastEvaluer& evaluer,
double queryLength, double fullScore) {
- out += std::sprintf(out, "a score=%d", score);
+ out += std::sprintf(out, "a score=");
+ out += std::sprintf(out, scoreFormat(fullScore >= 0), score);
return writeTags(evaluer, queryLength, score, fullScore, ' ', out);
}
@@ -444,6 +452,7 @@ AlignmentText Alignment::writeBlastTab(const MultiSequence& seq1,
int translationType,
const uchar *codonToAmino,
const LastEvaluer& evaluer,
+ const AlignmentExtras& extras,
bool isExtraColumns) const {
size_t alnBeg1 = beg1();
size_t alnEnd1 = end1();
@@ -509,13 +518,16 @@ AlignmentText Alignment::writeBlastTab(const MultiSequence& seq1,
}
IntText s1(seqLen1);
IntText s2(seqLen2);
- IntText sc(score);
+ FloatText sc;
size_t s =
n2.size() + n1.size() + mp.size() + as.size() + mm.size() + go.size() +
b2.size() + e2.size() + b1.size() + e1.size() + 10;
if (evaluer.isGood()) s += ev.size() + bs.size() + 2;
- if (isExtraColumns) s += s1.size() + s2.size() + sc.size() + 3;
+ if (isExtraColumns) {
+ sc.set(scoreFormat(extras.fullScore >= 0), score);
+ s += s1.size() + s2.size() + sc.size() + 3;
+ }
char *text = new char[s + 1];
Writer w(text);
@@ -596,7 +608,7 @@ static char *writeAmino(char *dest, const uchar *beg, const uchar *end,
for (const uchar *i = beg; i < end; ++i) {
unsigned x = toUnmasked[*i];
bool isMasked = (x != *i);
- if (x > 20) x = 20;
+ if (x > 21) x = 21;
unsigned y = (x * 2 + isMasked) * 3;
*dest++ = aminoTriplets[y];
*dest++ = aminoTriplets[y + 1];
=====================================
src/Alphabet.cc
=====================================
@@ -19,6 +19,7 @@ const char* Alphabet::dna = "ACGT";
// U=selenocysteine, O=pyrrolysine, *=stop?
const char* Alphabet::protein = "ACDEFGHIKLMNPQRSTVWY";
+const char* Alphabet::proteinWithStop = "ACDEFGHIKLMNPQRSTVWY*";
void Alphabet::fromString( const std::string& alphString ){
letters = alphString;
=====================================
src/Alphabet.hh
=====================================
@@ -23,6 +23,7 @@ struct Alphabet{
static const char* dna;
static const char* protein;
+ static const char* proteinWithStop;
static const unsigned capacity = 256;
=====================================
src/Centroid.cc
=====================================
@@ -584,16 +584,8 @@ namespace cbrc{
int alphabetSizeIncrement = alphabetSize;
if (!isExtendFwd) alphabetSizeIncrement *= -1;
- const double delOpen = gapCosts.delProbPieces[0].openProb;
- const double delGrow = gapCosts.delProbPieces[0].growProb;
- const double insOpen = gapCosts.insProbPieces[0].openProb;
- const double insGrow = gapCosts.insProbPieces[0].growProb;
-
- const double delInit = delOpen * delGrow; // for 1st letter in a deletion
- const double insInit = insOpen * insGrow; // for 1st letter in an insert
-
- const double delNext = delGrow - delInit;
- const double insNext = insGrow - insInit;
+ double delInitCount = 0; double delNextCount = 0;
+ double insInitCount = 0; double insNextCount = 0;
for( size_t k = 0; k < numAntidiagonals; ++k ){ // loop over antidiagonals
const size_t seq1beg = xa.seq1start( k );
@@ -601,9 +593,6 @@ namespace cbrc{
const double scale2 = scale[k];
const double scale1 = scale[k+1];
- const double scaledDelNext = scale1 * delNext;
- const double scaledInsNext = scale1 * insNext;
-
const size_t scoreEnd = xa.scoreEndIndex( k );
const double* bM0 = &bM[ scoreEnd + xdropPadLen ];
const double* bD0 = &bD[ scoreEnd + xdropPadLen ];
@@ -616,6 +605,9 @@ namespace cbrc{
const double* fI1 = &fI[ vertBeg ];
const double* fM2 = &fM[ diagBeg ];
+ double dNextCount = 0;
+ double iNextCount = 0;
+
const double* bM0last = bM0 + xa.numCellsAndPads( k ) - xdropPadLen - 1;
const uchar* s1 = isExtendFwd ? seq1 + seq1beg : seq1 - seq1beg;
@@ -640,10 +632,10 @@ namespace cbrc{
const double alignProb = xSum * yM * matchProb;
c.emit[letter1][letter2] += alignProb;
c.toMatch += alignProb;
- c.delInit += xSum * yD * delInit;
- c.delNext += xD * yD * scaledDelNext;
- c.insInit += xSum * yI * insInit;
- c.insNext += xI * yI * scaledInsNext;
+ delInitCount += xSum * yD;
+ dNextCount += xD * yD;
+ insInitCount += xSum * yI;
+ iNextCount += xI * yI;
if (bM0 == bM0last) break;
fM2++; fD1++; fI1++;
@@ -673,10 +665,10 @@ namespace cbrc{
countUncertainLetters(c.emit[letter1], alignProb,
alphabetSize, match_score[letter1], lp2);
c.toMatch += alignProb;
- c.delInit += xSum * yD * delInit;
- c.delNext += xD * yD * scaledDelNext;
- c.insInit += xSum * yI * insInit;
- c.insNext += xI * yI * scaledInsNext;
+ delInitCount += xSum * yD;
+ dNextCount += xD * yD;
+ insInitCount += xSum * yI;
+ iNextCount += xI * yI;
if (bM0 == bM0last) break;
fM2++; fD1++; fI1++;
@@ -686,7 +678,26 @@ namespace cbrc{
lp2 -= alphabetSizeIncrement;
}
}
+
+ delNextCount += dNextCount * scale1;
+ insNextCount += iNextCount * scale1;
}
+
+ const double delOpen = gapCosts.delProbPieces[0].openProb;
+ const double delGrow = gapCosts.delProbPieces[0].growProb;
+ const double insOpen = gapCosts.insProbPieces[0].openProb;
+ const double insGrow = gapCosts.insProbPieces[0].growProb;
+
+ const double delInit = delOpen * delGrow; // for 1st letter in a deletion
+ const double insInit = insOpen * insGrow; // for 1st letter in an insert
+
+ const double delNext = delGrow - delInit;
+ const double insNext = insGrow - insInit;
+
+ c.delInit += delInitCount * delInit;
+ c.delNext += delNextCount * delNext;
+ c.insInit += insInitCount * insInit;
+ c.insNext += insNextCount * insNext;
}
} // end namespace cbrc
=====================================
src/GappedXdropAligner.hh
=====================================
@@ -205,13 +205,13 @@ class GappedXdropAligner {
int frameshiftCost);
// Like "align3", but does "new style" DNA-protein alignment.
- int alignFrame(const uchar *seq1,
- const uchar *seq2frame0,
- const uchar *seq2frame1, // the +1 frame
- const uchar *seq2frame2, // the +2 frame
+ int alignFrame(const uchar *protein,
+ const uchar *frame0, // translated DNA, 0 frame
+ const uchar *frame1, // translated DNA, +1 frame
+ const uchar *frame2, // translated DNA, +2 frame
bool isForward,
const ScoreMatrixRow *scorer,
- const GapCosts &gap,
+ const GapCosts &gapCosts,
int maxScoreDrop);
// Use this after alignFrame. The cost of the unaligned region
@@ -220,8 +220,8 @@ class GappedXdropAligner {
bool getNextChunkFrame(size_t &end1,
size_t &end2,
size_t &length,
- int &gapCost, // cost of the gap before this chunk
- const GapCosts &gap);
+ int &costOfNearerGap,
+ const GapCosts &gapCosts);
void writeShape(std::ostream &out) const;
=====================================
src/GappedXdropAlignerFrame.cc
=====================================
@@ -22,33 +22,33 @@ void GappedXdropAligner::initFrame() {
bestAntidiagonal = 0;
}
-int GappedXdropAligner::alignFrame(const uchar *seq1,
- const uchar *seq2frame0,
- const uchar *seq2frame1, // the +1 frame
- const uchar *seq2frame2, // the +2 frame
+int GappedXdropAligner::alignFrame(const uchar *protein,
+ const uchar *frame0,
+ const uchar *frame1,
+ const uchar *frame2,
bool isForward,
const ScoreMatrixRow *scorer,
- const GapCosts &gap,
+ const GapCosts &gapCosts,
int maxScoreDrop) {
if (!isForward) {
- --seq1; --seq2frame0; --seq2frame1; --seq2frame2;
+ --protein; --frame0; --frame1; --frame2;
}
- const_uchar_ptr frames[] = {seq2frame0, seq2frame1, seq2frame2};
+ const_uchar_ptr frames[] = {frame0, frame1, frame2};
const int seqIncrement = isForward ? 1 : -1;
- const int delOpenScore = -gap.delPieces[0].openCost;
- const int insOpenScore = -gap.insPieces[0].openCost;
- const int delScore1 = gap.delScore1;
- const int delScore2 = gap.delScore2;
- const int delScore3 = gap.delScore3;
- const int insScore1 = gap.insScore1;
- const int insScore2 = gap.insScore2;
- const int insScore3 = gap.insScore3;
+ const int delOpenScore = -gapCosts.delPieces[0].openCost;
+ const int insOpenScore = -gapCosts.insPieces[0].openCost;
+ const int delScore1 = gapCosts.delScore1;
+ const int delScore2 = gapCosts.delScore2;
+ const int delScore3 = gapCosts.delScore3;
+ const int insScore1 = gapCosts.insScore1;
+ const int insScore2 = gapCosts.insScore2;
+ const int insScore3 = gapCosts.insScore3;
int runOfDrops = 2;
int runOfEdges = 0;
int numCells = 1;
- size_t seq1end = 1;
+ size_t proteinEnd = 1;
size_t diagPos6 = xdropPadLen - 1;
size_t horiPos5 = xdropPadLen * 2 - 1;
size_t horiPos4 = xdropPadLen * 3 - 1;
@@ -63,11 +63,11 @@ int GappedXdropAligner::alignFrame(const uchar *seq1,
initFrame();
for (size_t antidiagonal = 0; /* noop */; ++antidiagonal) {
- const uchar *s1 = seq1;
- const uchar *s2 = frames[antidiagonal % 3];
+ const uchar *s1 = protein;
frames[antidiagonal % 3] += seqIncrement;
+ const uchar *s2 = frames[antidiagonal % 3];
- initAntidiagonal(antidiagonal + 6, seq1end, thisPos, numCells);
+ initAntidiagonal(antidiagonal + 6, proteinEnd, thisPos, numCells);
thisPos += xdropPadLen;
Score *X0 = &xScores[thisPos];
Score *Y0 = &yScores[thisPos];
@@ -83,6 +83,7 @@ int GappedXdropAligner::alignFrame(const uchar *seq1,
int minScore = bestScore - maxScoreDrop;
for (int i = 0; i < numCells; ++i) {
+ s2 -= seqIncrement;
int s = scorer[*s1][*s2];
int y1 = Y5[i] + delScore1;
int y2 = Y4[i] + delScore2;
@@ -97,7 +98,6 @@ int GappedXdropAligner::alignFrame(const uchar *seq1,
Y0[i] = maxValue(b + delOpenScore, y3);
Z0[i] = maxValue(b + insOpenScore, z3);
s1 += seqIncrement;
- s2 -= seqIncrement;
}
if (newBestScore > bestScore) {
@@ -118,7 +118,7 @@ int GappedXdropAligner::alignFrame(const uchar *seq1,
++runOfEdges;
if (runOfEdges == 3) {
++numCells;
- ++seq1end;
+ ++proteinEnd;
runOfEdges = 0;
}
}
@@ -130,7 +130,7 @@ int GappedXdropAligner::alignFrame(const uchar *seq1,
if (runOfDrops == 3) {
--numCells;
if (numCells == 0) break;
- seq1 += seqIncrement;
+ protein += seqIncrement;
frames[0] -= seqIncrement;
frames[1] -= seqIncrement;
frames[2] -= seqIncrement;
@@ -152,18 +152,18 @@ int GappedXdropAligner::alignFrame(const uchar *seq1,
bool GappedXdropAligner::getNextChunkFrame(size_t &end1,
size_t &end2,
size_t &length,
- int &gapCost,
- const GapCosts &gap) {
+ int &costOfNearerGap,
+ const GapCosts &gapCosts) {
if (bestAntidiagonal == 0) return false;
- const int delOpenScore = -gap.delPieces[0].openCost;
- const int insOpenScore = -gap.insPieces[0].openCost;
- const int delScore1 = gap.delScore1;
- const int delScore2 = gap.delScore2;
- const int delScore3 = gap.delScore3;
- const int insScore1 = gap.insScore1;
- const int insScore2 = gap.insScore2;
- const int insScore3 = gap.insScore3;
+ const int delOpenScore = -gapCosts.delPieces[0].openCost;
+ const int insOpenScore = -gapCosts.insPieces[0].openCost;
+ const int delScore1 = gapCosts.delScore1;
+ const int delScore2 = gapCosts.delScore2;
+ const int delScore3 = gapCosts.delScore3;
+ const int insScore1 = gapCosts.insScore1;
+ const int insScore2 = gapCosts.insScore2;
+ const int insScore3 = gapCosts.insScore3;
end1 = bestSeq1position;
end2 = bestAntidiagonal - bestSeq1position * 3;
@@ -187,7 +187,7 @@ bool GappedXdropAligner::getNextChunkFrame(size_t &end1,
length = end1 - bestSeq1position;
if (bestAntidiagonal == 0) {
- gapCost = 0;
+ costOfNearerGap = 0;
return true;
}
@@ -212,7 +212,7 @@ bool GappedXdropAligner::getNextChunkFrame(size_t &end1,
state = std::max_element(opt, opt + 7) - opt;
} while (state != 0);
- gapCost = opt[0] - scoreHere;
+ costOfNearerGap = opt[0] - scoreHere;
return true;
}
=====================================
src/GeneticCode.cc
=====================================
@@ -49,7 +49,8 @@ void GeneticCode::codeTableSet( const Alphabet& aaAlph, const Alphabet& dnaAlph
{
uchar codon[3];
- genome2residue.assign( UNKNOWN, 'X' );
+ int size = maxDnaAlphabetSize * maxDnaAlphabetSize * maxDnaAlphabetSize;
+ genome2residue.assign(size, 'X');
for( size_t i = 0 ; i < AAs.size() ; i++ ){
char aminoAcid = std::toupper( AAs[i] );
@@ -71,13 +72,13 @@ void GeneticCode::codeTableSet( const Alphabet& aaAlph, const Alphabet& dnaAlph
}
// codons containing DNA delimiters, or lowercase bases
- for( int i = 0 ; i < NumMember ; i++ ){
+ for( int i = 0 ; i < maxDnaAlphabetSize ; i++ ){
codon[0]= dnaAlph.decode[i];
- for( int j = 0 ; j < NumMember ; j++ ){
+ for( int j = 0 ; j < maxDnaAlphabetSize ; j++ ){
codon[1]= dnaAlph.decode[j];
- for( int k = 0 ; k < NumMember ; k++ ){
+ for( int k = 0 ; k < maxDnaAlphabetSize ; k++ ){
codon[2]= dnaAlph.decode[k];
int c = codon2number2( codon, dnaAlph );
@@ -99,15 +100,16 @@ void GeneticCode::codeTableSet( const Alphabet& aaAlph, const Alphabet& dnaAlph
return;
}
-void GeneticCode::initCodons(const uchar *ntToNumber, const uchar *aaToNumber,
- bool isMaskLowercase) {
- const int n = NumMember;
+const unsigned delimiterCodon = 64;
+const unsigned unknownCodon = 65;
+
+static void initCodonLookup(std::vector<uchar> &lookupFromDnaTriplet,
+ const uchar *ntToNumber, bool isMaskLowercase) {
+ const int n = maxDnaAlphabetSize;
const char dna[] = "ACGTacgt";
const unsigned dnaMax = isMaskLowercase ? 4 : 8;
- const unsigned delimiterCodon = 64;
- const unsigned unknownCodon = 65;
- genome2residue.assign(n * n * n, unknownCodon);
+ lookupFromDnaTriplet.assign(n * n * n, unknownCodon);
for (unsigned i = 0; i < dnaMax; ++i) {
for (unsigned j = 0; j < dnaMax; ++j) {
@@ -116,12 +118,21 @@ void GeneticCode::initCodons(const uchar *ntToNumber, const uchar *aaToNumber,
uchar b = dna[j]; unsigned y = ntToNumber[b];
uchar c = dna[k]; unsigned z = ntToNumber[c];
unsigned codonNumber = (i % 4) * 16 + (j % 4) * 4 + (k % 4);
- genome2residue[x*n*n + y*n + z] = codonNumber;
+ lookupFromDnaTriplet[x*n*n + y*n + z] = codonNumber;
}
}
}
- setDelimiters(&genome2residue[0], n, delimiterCodon);
+ setDelimiters(&lookupFromDnaTriplet[0], n, delimiterCodon);
+}
+
+void GeneticCode::initCodons(const uchar *ntToNumber, const uchar *aaToNumber,
+ bool isMaskLowercase, bool isUnmaskLowercase) {
+ initCodonLookup(genome2residue, ntToNumber, isMaskLowercase);
+
+ if (isMaskLowercase && isUnmaskLowercase) {
+ initCodonLookup(genome2residueWithoutLowercaseMasking, ntToNumber, false);
+ }
uchar unknown = 'X';
uchar delimiter = ' ';
@@ -136,24 +147,33 @@ void GeneticCode::initCodons(const uchar *ntToNumber, const uchar *aaToNumber,
codonToAminoAcid[delimiterCodon] = aaToNumber[delimiter];
}
-//
-void GeneticCode::translate( const uchar* beg, const uchar* end,
- uchar* dest ) const{
- unsigned delimiter = genome2residue[4];
- size_t size = end - beg;
-
- for( size_t i = 0 ; i < 3 ; i++ ){
- for( size_t j = i ; j+2 < size ; j+=3 ){
- *dest++ = translation( beg + j );
- }
+static void doTranslate(const uchar *lookupFromDnaTriplet,
+ const uchar *beg, size_t size, uchar *dest) {
+ const int n = maxDnaAlphabetSize;
+ const unsigned delimiter = lookupFromDnaTriplet[4];
+ for (size_t i = 0; i < 3; ++i) {
+ for (size_t j = i; j + 2 < size; j += 3) {
+ const uchar *b = beg + j;
+ *dest++ = lookupFromDnaTriplet[n * n * b[0] + n * b[1] + b[2]];
+ }
// this ensures that each reading frame has exactly the same size:
- if( i > size % 3 ) *dest++ = delimiter;
+ if (i > size % 3) *dest++ = delimiter;
}
// this ensures that the size of the translated sequence is exactly "size":
- if( size % 3 > 0 ) *dest++ = delimiter;
- if( size % 3 > 1 ) *dest++ = delimiter;
+ if (size % 3 > 0) *dest++ = delimiter;
+ if (size % 3 > 1) *dest++ = delimiter;
+}
+
+void GeneticCode::translate(const uchar *beg, const uchar *end,
+ uchar *dest) const {
+ doTranslate(&genome2residue[0], beg, end - beg, dest);
+}
+
+void GeneticCode::translateWithoutMasking(const uchar *beg, const uchar *end,
+ uchar *dest) const {
+ doTranslate(&genome2residueWithoutLowercaseMasking[0], beg, end - beg, dest);
}
//
=====================================
src/GeneticCode.hh
=====================================
@@ -19,17 +19,20 @@ typedef unsigned char uchar;
class Alphabet;
+const int maxDnaAlphabetSize = 54;
+
class GeneticCode{
private:
std::string AAs;
std::string Base[3];
- static const int NumMember = 54; // DNA member
- static const int UNKNOWN = NumMember*NumMember*NumMember; // unknown residue
std::vector<uchar> genome2residue;
+ std::vector<uchar> genome2residueWithoutLowercaseMasking;
uchar codonToAminoAcid[256];
- static int codon2number( const uchar* codon )
- { return codon[0] * NumMember*NumMember + codon[1] * NumMember + codon[2]; }
+ static int codon2number(const uchar *codon) {
+ const int n = maxDnaAlphabetSize;
+ return n * n * codon[0] + n * codon[1] + codon[2];
+ }
static int codon2number2( const uchar* codon, const Alphabet& dnaAlph );
@@ -49,11 +52,16 @@ class GeneticCode{
// 64=delimiter, 65=unknown). Also setup codonToAminoAcid. Any DNA
// triplet with non-ACGT (or with lowercase if isMaskLowercase is
// true) gets translated to 65=unknown.
+ // If isMaskLowercase and isUnmaskLowercase are both true, also
+ // setup translateWithoutMasking.
void initCodons( const uchar *ntToNumber, const uchar *aaToNumber,
- bool isMaskLowercase );
+ bool isMaskLowercase, bool isUnmaskLowercase );
void translate( const uchar* beg, const uchar* end, uchar* dest ) const;
+ void translateWithoutMasking( const uchar* beg, const uchar* end,
+ uchar* dest ) const;
+
uchar translation( const uchar* codon ) const
{ return genome2residue[ codon2number( codon ) ]; }
=====================================
src/LastEvaluer.cc
=====================================
@@ -4,10 +4,14 @@
#include "GeneticCode.hh"
+#include "alp/sls_falp_alignment_evaluer.hpp"
+
#include <algorithm>
#include <cstring>
#include <iostream>
+#include <random>
+
#define COUNTOF(a) (sizeof (a) / sizeof *(a))
namespace cbrc {
@@ -296,7 +300,7 @@ void LastEvaluer::init(const char *matrixName,
const FrameshiftEvalueParameters &p = frameshiftEvalueParameters[i];
if (isHit(p, geneticCodeName, matrixName, delOpen, delEpen,
frameshiftCost))
- return frameshiftEvaluer.initParameters(p.parameters);
+ return evaluer.initParameters(p.parameters);
}
}
}
@@ -317,22 +321,25 @@ void LastEvaluer::init(const char *matrixName,
std::vector<double> aaFreqs(matrixSize);
copy(letterFreqs1, letterFreqs1 + alphabetSize, aaFreqs.begin());
- if (frameshiftCost > 0) { // with frameshifts:
- if (isGapped) {
- frameshiftEvaluer.initFrameshift(4, matrixSize, codonTable,
- &matrix[0], &ntFreqs[0], &aaFreqs[0],
- delOpen, delEpen, insOpen, insEpen,
- frameshiftCost, true,
- lambdaTolerance, kTolerance,
- 0, maxMegabytes, randomSeed);
- } else {
- frameshiftEvaluer.initGapless(4, matrixSize, codonTable,
- &matrix[0], &ntFreqs[0], &aaFreqs[0]);
- }
+ if (isGapped && frameshiftCost > 0) { // with frameshifts:
+ Sls::FrameshiftAlignmentEvaluer frameshiftEvaluer;
+ frameshiftEvaluer.initFrameshift(4, matrixSize, codonTable,
+ &matrix[0], &ntFreqs[0], &aaFreqs[0],
+ delOpen, delEpen, insOpen, insEpen,
+ frameshiftCost, true,
+ lambdaTolerance, kTolerance,
+ 0, maxMegabytes, randomSeed);
+ const Sls::FALP_set_of_parameters &p = frameshiftEvaluer.parameters();
+ Sls::AlignmentEvaluerParameters q = {p.lambda, p.K,
+ p.a_I, p.b_I, // !!! no flip
+ p.a_J, p.b_J,
+ p.alpha_I, p.beta_I,
+ p.alpha_J, p.beta_J,
+ p.sigma, p.tau};
+ evaluer.initParameters(q);
} else { // without frameshifts:
std::vector<double> txFreqs(matrixSize);
makeTxFreqs(&txFreqs[0], &ntFreqs[0], codonTable);
-
if (isGapped) {
evaluer.set_gapped_computation_parameters_simplified(maxSeconds);
evaluer.initGapped(matrixSize, &matrix[0], &txFreqs[0], &aaFreqs[0],
@@ -342,6 +349,14 @@ void LastEvaluer::init(const char *matrixName,
} else {
evaluer.initGapless(matrixSize, &matrix[0], &txFreqs[0], &aaFreqs[0]);
}
+ const Sls::ALP_set_of_parameters &p = evaluer.parameters();
+ Sls::AlignmentEvaluerParameters q = {p.lambda, p.K,
+ p.a_J * 3, p.b_J * 3,
+ p.a_I, p.b_I, // !!! flip (I, J)
+ p.alpha_J * 9, p.beta_J * 9,
+ p.alpha_I, p.beta_I,
+ p.sigma * 3, p.tau * 3};
+ evaluer.initParameters(q);
}
} else { // ordinary alignment:
if (isGapped && insOpen == delOpen && insEpen == delEpen) {
@@ -402,31 +417,74 @@ void LastEvaluer::init(const char *matrixName,
}
}
-static int theMinScore(double lambda, double k, double evalue, double area) {
- // do log of evalue separately, to reduce the risk of overflow:
- double s = (std::log(k * area) - std::log(evalue)) / lambda;
- return std::ceil(std::max(1.0, s));
-}
+void LastEvaluer::initFrameshift(const const_dbl_ptr *substitutionProbs,
+ const double *proteinLetterFreqs,
+ int numProteinLetters,
+ const double *tranDnaLetterFreqs,
+ int numTranDnaLetters,
+ const GapCosts &gapCosts,
+ double scale, int verbosity) {
+ FrameshiftXdropAligner aligner;
+ int proteinLength = 200; // xxx long enough to avoid edge effects ???
+ int tranDnaLength = proteinLength * 3;
+ int numOfAlignments = 50; // suggested by Y-K Yu, R Bundschuh, T Hwa, 2002
+ if (verbosity > 1) numOfAlignments = 1000;
+
+ std::mt19937_64 randGen;
+ std::discrete_distribution<> pDist(proteinLetterFreqs,
+ proteinLetterFreqs + numProteinLetters);
+ std::discrete_distribution<> tDist(tranDnaLetterFreqs,
+ tranDnaLetterFreqs + numTranDnaLetters);
+
+ std::vector<uchar> seqs(proteinLength + tranDnaLength);
+ uchar *protein = &seqs[0];
+ uchar *tranDna = protein + proteinLength;
+
+ double probRatioSum = 0;
+
+ for (int i = 0; i < numOfAlignments; ++i) {
+ for (int j = 0; j < proteinLength; ++j) protein[j] = pDist(randGen);
+ for (int j = 0; j < tranDnaLength; ++j) tranDna[j] = tDist(randGen);
+ double p = aligner.maxSumOfProbRatios(protein, proteinLength,
+ tranDna, tranDnaLength,
+ substitutionProbs, gapCosts);
+ probRatioSum += 1 / p;
+ if (verbosity > 1) std::cerr << "simScore: " << (log(p) / scale) << "\n";
+ }
-int LastEvaluer::minScore(double evalue, double area) const {
- if (evaluer.isGood()) {
- const Sls::ALP_set_of_parameters &p = evaluer.parameters();
- return theMinScore(p.lambda, p.K, evalue, area);
- } else if (frameshiftEvaluer.isGood()) {
- const Sls::FALP_set_of_parameters &p = frameshiftEvaluer.parameters();
- return theMinScore(p.lambda, p.K, evalue, area);
- } else {
- return -1;
+ // max likelihood k = 1 / (m * n * avg[exp(-lambda * score)])
+ double k = numOfAlignments / (proteinLength * tranDnaLength * probRatioSum);
+
+ if (verbosity > 1) {
+ std::cerr << "lambda k m n: " << scale << " " << k << " "
+ << proteinLength << " " << tranDnaLength << "\n";
}
+
+ Sls::AlignmentEvaluerParameters p = {scale, k, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+ evaluer.initParameters(p);
+}
+
+double LastEvaluer::minScore(double evalue, double area) const {
+ const Sls::ALP_set_of_parameters &p = evaluer.parameters();
+ // do log of evalue separately, to reduce the risk of overflow:
+ double s = (log(p.K * area) - log(evalue)) / p.lambda;
+ return std::max(0.0, s);
}
-int LastEvaluer::minScore(double queryLettersPerRandomAlignment) const {
- if (!isGood()) return -1;
+double LastEvaluer::minScore(double queryLettersPerRandomAlignment) const {
double huge = 1e9;
- double f = queryLettersPerRandomAlignment / huge;
- for (int score = 1; ; ++score) {
- double evalue = area(score, huge) * evaluePerArea(score);
- if (evalue * f <= 1) return score;
+ const Sls::ALP_set_of_parameters &p = evaluer.parameters();
+ double x = queryLettersPerRandomAlignment * databaseSeqNum;
+ double beg = 0;
+ double end = log(x * databaseSeqLen * p.K) / p.lambda;
+
+ while (1) {
+ double mid = (beg + end) / 2;
+ if (mid <= beg || mid >= end) return mid;
+ if (evaluer.evalue(mid, huge, databaseSeqLen) < huge / x)
+ end = mid;
+ else
+ beg = mid;
}
}
@@ -434,9 +492,6 @@ void LastEvaluer::writeCommented(std::ostream& out) const {
if (evaluer.isGood()) {
const Sls::ALP_set_of_parameters &p = evaluer.parameters();
out << "# lambda=" << p.lambda << " K=" << p.K << "\n";
- } else if (frameshiftEvaluer.isGood()) {
- const Sls::FALP_set_of_parameters &p = frameshiftEvaluer.parameters();
- out << "# lambda=" << p.lambda << " K=" << p.K << "\n";
}
}
@@ -444,19 +499,12 @@ void LastEvaluer::writeParameters(std::ostream &out) const {
std::streamsize prec = out.precision(17);
if (evaluer.isGood()) {
const Sls::ALP_set_of_parameters &p = evaluer.parameters();
+ // !!! flip (I, J) order, to match AlignmentEvaluer::initParameters
out << p.lambda << ", " << p.K << ",\n"
- << p.a_I << ", " << p.b_I << ",\n"
<< p.a_J << ", " << p.b_J << ",\n"
- << p.alpha_I << ", " << p.beta_I << ",\n"
- << p.alpha_J << ", " << p.beta_J << ",\n"
- << p.sigma << ", " << p.tau << "\n";
- } else if (frameshiftEvaluer.isGood()) {
- const Sls::FALP_set_of_parameters &p = frameshiftEvaluer.parameters();
- out << p.lambda << ", " << p.K << ",\n"
<< p.a_I << ", " << p.b_I << ",\n"
- << p.a_J << ", " << p.b_J << ",\n"
- << p.alpha_I << ", " << p.beta_I << ",\n"
<< p.alpha_J << ", " << p.beta_J << ",\n"
+ << p.alpha_I << ", " << p.beta_I << ",\n"
<< p.sigma << ", " << p.tau << "\n";
}
out.precision(prec);
=====================================
src/LastEvaluer.hh
=====================================
@@ -18,12 +18,14 @@
#define LAST_EVALUER_HH
#include "ScoreMatrixRow.hh"
+#include "mcf_frameshift_xdrop_aligner.hh"
#include "alp/sls_alignment_evaluer.hpp"
-#include "alp/sls_falp_alignment_evaluer.hpp"
namespace cbrc {
+using namespace mcf;
+
class GeneticCode;
class LastEvaluer {
@@ -54,6 +56,14 @@ public:
const char *geneticCodeName,
int verbosity);
+ // "new-style" frameshifts, sum-of-paths scores
+ // scale=lambda
+ // The Freqs need not sum to 1
+ void initFrameshift(const const_dbl_ptr *substitutionProbs,
+ const double *proteinLetterFreqs, int numProteinLetters,
+ const double *tranDnaLetterFreqs, int numTranDnaLetters,
+ const GapCosts &gapCosts, double scale, int verbosity);
+
void setSearchSpace(double databaseLength, // number of database letters
double databaseMaxSeqLength, // length of longest seq
double numOfStrands) { // 1 or 2
@@ -66,37 +76,26 @@ public:
}
}
- bool isGood() const
- { return evaluer.isGood() || frameshiftEvaluer.isGood(); }
+ bool isGood() const { return evaluer.isGood(); }
// Don't call this in the "bad" state:
- double evaluePerArea(double score) const {
- return evaluer.isGood() ?
- evaluer.evaluePerArea(score) : frameshiftEvaluer.evaluePerArea(score);
- }
+ double evaluePerArea(double score) const
+ { return evaluer.evaluePerArea(score); }
// Don't call this in the "bad" state or before calling setSearchSpace:
- double area(double score, double queryLength) const {
- return databaseSeqNum *
- (evaluer.isGood()
- ? evaluer.area(score, queryLength, databaseSeqLen)
- : frameshiftEvaluer.area(score, queryLength, databaseSeqLen));
- }
+ double area(double score, double queryLength) const
+ { return databaseSeqNum * evaluer.area(score, queryLength, databaseSeqLen); }
// Don't call this in the "bad" state:
- double bitScore(double score) const {
- return evaluer.isGood() ?
- evaluer.bitScore(score) : frameshiftEvaluer.bitScore(score);
- }
+ double bitScore(double score) const { return evaluer.bitScore(score); }
- // In the "good" state: returns the minimum score with E-value <=
- // "evalue", which is always > 0. Otherwise: returns -1.
- int minScore(double evalue, double area) const;
+ // Returns max(0, score with E-value == "evalue").
+ // Don't call this in the "bad" state.
+ double minScore(double evalue, double area) const;
- // In the "good" state, after calling setSearchSpace: returns the
- // minimum positive score with E-value <= 1 per this many query
- // letters. In the "bad" state: returns -1.
- int minScore(double queryLettersPerRandomAlignment) const;
+ // Returns max(0, score with E-value == 1 per this many query letters).
+ // Don't call this in the "bad" state or before calling setSearchSpace..
+ double minScore(double queryLettersPerRandomAlignment) const;
// Writes some parameters preceded by "#". Does nothing in the "bad" state.
void writeCommented(std::ostream& out) const;
@@ -106,7 +105,6 @@ public:
private:
Sls::AlignmentEvaluer evaluer;
- Sls::FrameshiftAlignmentEvaluer frameshiftEvaluer;
double databaseSeqLen;
double databaseSeqNum;
};
=====================================
src/LastalArguments.cc
=====================================
@@ -3,12 +3,14 @@
#include "LastalArguments.hh"
#include "stringify.hh"
#include "getoptUtil.hh"
+
+#include <math.h>
+
#include <algorithm> // max
#include <iostream>
#include <sstream>
#include <stdexcept>
#include <cctype>
-#include <cmath> // log
#include <cstring> // strtok
#include <cstdlib> // EXIT_SUCCESS
@@ -561,7 +563,7 @@ static int percent(int val, int percentage) {
return (percentage == 100) ? val : val * percentage / 100;
}
-void LastalArguments::setDefaultsFromMatrix(double lambda, int minScore,
+void LastalArguments::setDefaultsFromMatrix(double lambda, double minScore,
double maxEvalueDefault) {
if (maxEvalue < 0) maxEvalue = maxEvalueDefault;
if( outputType < 2 && minScoreGapped < 0 ) minScoreGapped = minScoreGapless;
@@ -572,11 +574,15 @@ can't calculate E-values: maybe the mismatch or gap costs are too weak.\n\
To proceed without E-values, set a score threshold with option -e.");
minScoreGapped = minScore;
}
- if( outputType < 2 ) minScoreGapless = minScoreGapped;
+
+ if (minScoreGapped > INT_MAX)
+ ERR("the alignment score threshold is too big");
+
+ if (outputType < 2) minScoreGapless = ceil(minScoreGapped);
if( temperature < 0 ) temperature = 1 / lambda;
- int defaultMaxScoreDrop = std::max(minScoreGapped - 1, 0);
+ int defaultMaxScoreDrop = std::max(ceil(minScoreGapped) - 1, 0.0); // xxx
defaultMaxScoreDrop = std::max(defaultMaxScoreDrop, maxDropGapless);
if (maxDropFinalSuffix == '%') {
=====================================
src/LastalArguments.hh
=====================================
@@ -2,16 +2,17 @@
// This struct holds the command line arguments for lastal.
-#ifndef LASTALARGUMENTS_HH
-#define LASTALARGUMENTS_HH
+#ifndef LASTAL_ARGUMENTS_HH
+#define LASTAL_ARGUMENTS_HH
#include "SequenceFormat.hh"
-#include <climits>
-#include <string>
+#include <limits.h>
+#include <stddef.h>
+
#include <iosfwd>
+#include <string>
#include <vector>
-#include <stddef.h> // size_t
namespace cbrc{
@@ -39,7 +40,7 @@ struct LastalArguments{
bool isCaseSensitiveSeeds, bool isVolumes,
size_t refMinimizerWindow,
unsigned realNumOfThreads );
- void setDefaultsFromMatrix(double lambda, int minScore,
+ void setDefaultsFromMatrix(double lambda, double minScore,
double maxEvalueDefault);
// write the parameter settings, starting each line with "#":
@@ -80,7 +81,7 @@ struct LastalArguments{
int maskLowercase;
double maxEvalue;
double queryLettersPerRandomAlignment;
- int minScoreGapped;
+ double minScoreGapped;
int minScoreGapless;
int matchScore;
int mismatchCost;
@@ -121,5 +122,6 @@ struct LastalArguments{
int inputStart; // index in argv of first input filename
};
-} // end namespace cbrc
-#endif // LASTALARGUMENTS_HH
+}
+
+#endif
=====================================
src/LastdbArguments.cc
=====================================
@@ -27,6 +27,7 @@ using namespace cbrc;
LastdbArguments::LastdbArguments() :
isProtein(false),
+ isAddStops(false),
isKeepLowercase(true),
tantanSetting(0),
isCaseSensitive(false),
@@ -72,6 +73,7 @@ Advanced Options (default settings):\n\
+ stringify(strand) + ")\n\
-s: volume size (unlimited)\n\
-Q: input format: fastx, keep, sanger, solexa, illumina (default=fasta)\n\
+-q: interpret the sequences as proteins and append */STOP\n\
-P: number of parallel threads ("
+ stringify(numOfThreads) + ")\n\
-m: seed patterns (1=match, 0=anything, @=transition)\n\
@@ -92,7 +94,7 @@ Report bugs to: last-align (ATmark) googlegroups (dot) com\n\
LAST home page: http://last.cbrc.jp/\n\
";
- static const char sOpts[] = "hVpR:cm:d:S:s:w:W:P:u:a:i:b:B:C:xvQ:";
+ static const char sOpts[] = "hVpqR:cm:d:S:s:w:W:P:u:a:i:b:B:C:xvQ:";
int c;
while ((c = myGetopt(argc, argv, sOpts)) != -1) {
@@ -108,6 +110,9 @@ LAST home page: http://last.cbrc.jp/\n\
case 'p':
isProtein = true;
break;
+ case 'q':
+ isAddStops = true;
+ break;
case 'R':
if( optarg[0] < '0' || optarg[0] > '1' ) badopt( c, optarg );
if( optarg[1] < '0' || optarg[1] > '2' ) badopt( c, optarg );
=====================================
src/LastdbArguments.hh
=====================================
@@ -31,6 +31,7 @@ struct LastdbArguments{
// options:
bool isProtein;
+ bool isAddStops;
bool isKeepLowercase;
int tantanSetting;
bool isCaseSensitive;
=====================================
src/MultiSequence.cc
=====================================
@@ -9,13 +9,15 @@
using namespace cbrc;
-void MultiSequence::initForAppending( indexT padSizeIn ){
+void MultiSequence::initForAppending(indexT padSizeIn,
+ bool isAppendStopSymbol) {
padSize = padSizeIn;
seq.v.assign( padSize, ' ' );
ends.v.assign( 1, padSize );
names.v.clear();
nameEnds.v.assign( 1, 0 );
qualityScoresPerLetter = 0;
+ isAppendingStopSymbol = isAppendStopSymbol;
}
void MultiSequence::reinitForAppending(){
@@ -98,7 +100,12 @@ MultiSequence::appendFromFasta( std::istream& stream, indexT maxSeqLen ){
c = buf->snextc();
}
- if (isRoomToAppendPad(maxSeqLen)) finish();
+ if (seq.v.size() <= maxSeqLen &&
+ padSize + isAppendingStopSymbol <= maxSeqLen - seq.v.size()) {
+ if (isAppendingStopSymbol) seq.v.push_back('*');
+ finish();
+ }
+
return stream;
}
=====================================
src/MultiSequence.hh
=====================================
@@ -24,7 +24,7 @@ class MultiSequence{
typedef LAST_INT_TYPE indexT;
// initialize with leftmost delimiter pad, ready for appending sequences
- void initForAppending( indexT padSizeIn );
+ void initForAppending(indexT padSizeIn, bool isAppendStopSymbol = false);
// re-initialize, but keep the last sequence if it is unfinished
void reinitForAppending();
@@ -141,6 +141,7 @@ class MultiSequence{
VectorOrMmap<uchar> qualityScores;
size_t qualityScoresPerLetter;
bool isReadingFastq;
+ bool isAppendingStopSymbol;
// Read a fasta/fastq header: read the whole line but store just the
// 1st word
=====================================
src/lastal.cc
=====================================
@@ -311,14 +311,21 @@ static void calculateScoreStatistics(const std::string& matrixName,
try{
gaplessEvaluer.init(0, 0, 0, alph.letters.c_str(), scoreMat, p1, p2, false,
0, 0, 0, 0, fsCost, geneticCode, 0, 0);
- if (gapCosts.isNewFrameshifts() && isGapped) return;
- const mcf::GapCosts::Piece &del = gapCosts.delPieces[0];
- const mcf::GapCosts::Piece &ins = gapCosts.insPieces[0];
- evaluer.init(canonicalMatrixName, args.matchScore, args.mismatchCost,
- alph.letters.c_str(), scoreMat, p1, p2, isGapped,
- del.openCost, del.growCost, ins.openCost, ins.growCost,
- fsCost, geneticCode,
- args.geneticCodeFile.c_str(), args.verbosity);
+ if (gapCosts.isNewFrameshifts() && isGapped) {
+ if (args.temperature < 0) return;
+ unsigned alphSize2 = scoreMatrix.isCodonCols() ? 64 : alph.size;
+ evaluer.initFrameshift(fwdMatrices.ratios, p1, alph.size,
+ stats.letterProbs2(), alphSize2,
+ gapCosts, stats.lambda(), args.verbosity);
+ } else {
+ const mcf::GapCosts::Piece &del = gapCosts.delPieces[0];
+ const mcf::GapCosts::Piece &ins = gapCosts.insPieces[0];
+ evaluer.init(canonicalMatrixName, args.matchScore, args.mismatchCost,
+ alph.letters.c_str(), scoreMat, p1, p2, isGapped,
+ del.openCost, del.growCost, ins.openCost, ins.growCost,
+ fsCost, geneticCode,
+ args.geneticCodeFile.c_str(), args.verbosity);
+ }
countT m = std::min(refMaxSeqLen, refLetters);
evaluer.setSearchSpace(refLetters, m, args.numOfStrands());
if( args.verbosity > 0 ) evaluer.writeParameters( std::cerr );
@@ -453,8 +460,14 @@ static const ScoreMatrixRow *getQueryPssm(const LastAligner &aligner,
namespace Phase{ enum Enum{ gapless, gapped, final }; }
+static bool isFullScoreThreshold() {
+ return gapCosts.isNewFrameshifts();
+}
+
static bool isMaskLowercase(Phase::Enum e) {
- return (e < 1 && args.maskLowercase > 0) || args.maskLowercase > 2;
+ return (e < 1 && args.maskLowercase > 0)
+ || (e < 2 && args.maskLowercase > 1 && isFullScoreThreshold())
+ || args.maskLowercase > 2;
}
struct Dispatcher{
@@ -751,8 +764,9 @@ void alignGapped( LastAligner& aligner,
if( aln.score < args.minScoreGapped ) continue;
- if( !aln.isOptimal( dis.a, dis.b, args.globality, dis.m, dis.d, gapCosts,
- frameSize, dis.p, dis.t, dis.i, dis.j ) ){
+ if (!isFullScoreThreshold() &&
+ !aln.isOptimal(dis.a, dis.b, args.globality, dis.m, dis.d, gapCosts,
+ frameSize, dis.p, dis.t, dis.i, dis.j)) {
// If retained, non-"optimal" alignments can hide "optimal"
// alignments, e.g. during non-redundantization.
continue;
@@ -830,8 +844,8 @@ static void eraseWeakAlignments(LastAligner &aligner, AlignmentPot &gappedAlns,
Dispatcher dis(Phase::gapless, aligner, queryNum, matrices, querySeq);
for (size_t i = 0; i < gappedAlns.size(); ++i) {
Alignment &a = gappedAlns.items[i];
- if (!a.hasGoodSegment(dis.a, dis.b, args.minScoreGapped, dis.m, gapCosts,
- frameSize, dis.p, dis.t, dis.i, dis.j)) {
+ if (!a.hasGoodSegment(dis.a, dis.b, ceil(args.minScoreGapped), dis.m,
+ gapCosts, frameSize, dis.p, dis.t, dis.i, dis.j)) {
AlignmentPot::mark(a);
}
}
@@ -911,44 +925,69 @@ void makeQualityPssm( LastAligner& aligner,
}
}
+static void unmaskLowercase(LastAligner &aligner, size_t queryNum,
+ const SubstitutionMatrices &matrices,
+ uchar *querySeq) {
+ makeQualityPssm(aligner, queryNum, matrices, querySeq, false);
+ if (scoreMatrix.isCodonCols()) {
+ const uchar *q = query.seqReader();
+ geneticCode.translateWithoutMasking(q + query.padBeg(queryNum),
+ q + query.padEnd(queryNum), querySeq);
+ }
+}
+
+static void remaskLowercase(LastAligner &aligner, size_t queryNum,
+ const SubstitutionMatrices &matrices,
+ uchar *querySeq) {
+ makeQualityPssm(aligner, queryNum, matrices, querySeq, true);
+ if (scoreMatrix.isCodonCols()) {
+ const uchar *q = query.seqReader();
+ geneticCode.translate(q + query.padBeg(queryNum),
+ q + query.padEnd(queryNum), querySeq);
+ }
+}
+
// Scan one query sequence against one database volume
void scan(LastAligner& aligner, size_t queryNum,
- const SubstitutionMatrices &matrices, const uchar* querySeq) {
+ const SubstitutionMatrices &matrices, uchar *querySeq) {
if( args.outputType == 0 ){ // we just want match counts
countMatches( queryNum, querySeq );
return;
}
- bool isMask = (args.maskLowercase > 0);
- makeQualityPssm(aligner, queryNum, matrices, querySeq, isMask);
+ const int maskMode = args.maskLowercase;
+ makeQualityPssm(aligner, queryNum, matrices, querySeq, maskMode > 0);
SegmentPairPot gaplessAlns;
alignGapless(aligner, gaplessAlns, queryNum, matrices, querySeq);
if( args.outputType == 1 ) return; // we just want gapless alignments
if( gaplessAlns.size() == 0 ) return;
- if( args.maskLowercase == 1 || args.maskLowercase == 2 )
- makeQualityPssm(aligner, queryNum, matrices, querySeq, false);
+ if (maskMode == 1 || (maskMode == 2 && !isFullScoreThreshold()))
+ unmaskLowercase(aligner, queryNum, matrices, querySeq);
AlignmentPot gappedAlns;
- if( args.maxDropFinal != args.maxDropGapped ){
+ if (args.maxDropFinal != args.maxDropGapped
+ || (maskMode == 2 && isFullScoreThreshold())) {
alignGapped( aligner, gappedAlns, gaplessAlns,
queryNum, matrices, querySeq, Phase::gapped );
erase_if( gaplessAlns.items, SegmentPairPot::isNotMarkedAsGood );
+ if (maskMode == 2 && isFullScoreThreshold())
+ unmaskLowercase(aligner, queryNum, matrices, querySeq);
}
alignGapped( aligner, gappedAlns, gaplessAlns,
queryNum, matrices, querySeq, Phase::final );
if( gappedAlns.size() == 0 ) return;
- if (args.maskLowercase == 2) {
- makeQualityPssm(aligner, queryNum, matrices, querySeq, true);
+ if (maskMode == 2 && !isFullScoreThreshold()) {
+ remaskLowercase(aligner, queryNum, matrices, querySeq);
eraseWeakAlignments(aligner, gappedAlns, queryNum, matrices, querySeq);
LOG2("lowercase-filtered alignments=" << gappedAlns.size());
if (gappedAlns.size() == 0) return;
if (args.outputType > 3)
- makeQualityPssm(aligner, queryNum, matrices, querySeq, false);
+ unmaskLowercase(aligner, queryNum, matrices, querySeq);
}
if( args.outputType > 2 ){ // we want non-redundant alignments
@@ -961,9 +1000,9 @@ void scan(LastAligner& aligner, size_t queryNum,
}
static void tantanMaskOneQuery(size_t queryNum, uchar *querySeq) {
- size_t beg = query.seqBeg(queryNum) - query.padBeg(queryNum);
- size_t end = query.seqEnd(queryNum) - query.padBeg(queryNum);
- tantanMasker.mask(querySeq + beg, querySeq + end, alph.numbersToLowercase);
+ size_t b = query.seqBeg(queryNum) - query.padBeg(queryNum);
+ size_t e = query.seqEnd(queryNum) - query.padBeg(queryNum);
+ tantanMasker.mask(querySeq + b, querySeq + e, queryAlph.numbersToLowercase);
}
static void tantanMaskTranslatedQuery(size_t queryNum, uchar *querySeq) {
@@ -984,28 +1023,43 @@ static void tantanMaskTranslatedQuery(size_t queryNum, uchar *querySeq) {
// after optionally translating and/or masking the query
void translateAndScan(LastAligner& aligner,
size_t queryNum, const SubstitutionMatrices &matrices) {
- const uchar* querySeq = query.seqReader() + query.padBeg(queryNum);
+ uchar *querySeqs = query.seqWriter();
+ uchar *querySeq = querySeqs + query.padBeg(queryNum);
std::vector<uchar> modifiedQuery;
size_t size = query.padLen(queryNum);
- if( args.isTranslated() ){
- modifiedQuery.resize( size );
- geneticCode.translate( querySeq, querySeq + size, &modifiedQuery[0] );
- if( args.tantanSetting && !scoreMatrix.isCodonCols() ){
- tantanMaskTranslatedQuery( queryNum, &modifiedQuery[0] );
+ if (args.isTranslated()) {
+ if (args.tantanSetting && scoreMatrix.isCodonCols()) {
+ if (args.isKeepLowercase) {
+ err("can't keep lowercase & find simple repeats & use codons");
+ }
+ tantanMaskOneQuery(queryNum, querySeq);
}
+ modifiedQuery.resize(size);
+ geneticCode.translate(querySeq, querySeq + size, &modifiedQuery[0]);
querySeq = &modifiedQuery[0];
- }else{
- if( args.tantanSetting ){
- modifiedQuery.assign( querySeq, querySeq + size );
- tantanMaskOneQuery( queryNum, &modifiedQuery[0] );
- querySeq = &modifiedQuery[0];
+ if (args.tantanSetting && !scoreMatrix.isCodonCols()) {
+ tantanMaskTranslatedQuery(queryNum, querySeq);
+ }
+ } else {
+ if (args.tantanSetting) {
+ if (args.isKeepLowercase) {
+ modifiedQuery.assign(querySeq, querySeq + size);
+ querySeq = &modifiedQuery[0];
+ }
+ tantanMaskOneQuery(queryNum, querySeq);
}
}
size_t oldNumOfAlns = aligner.textAlns.size();
scan( aligner, queryNum, matrices, querySeq );
cullFinalAlignments( aligner.textAlns, oldNumOfAlns );
+
+ if (args.tantanSetting && !args.isKeepLowercase) {
+ for (size_t i = query.seqBeg(queryNum); i < query.seqEnd(queryNum); ++i) {
+ querySeqs[i] = queryAlph.numbersToUppercase[querySeqs[i]];
+ }
+ }
}
static void alignOneQuery(LastAligner &aligner,
@@ -1103,7 +1157,7 @@ int calcMinScoreGapless(double numLettersInReference) {
// This attempts to ensure that the gapped alignment phase will be
// reasonably fast relative to the gapless alignment phase.
- if (!gaplessEvaluer.isGood()) return args.minScoreGapped;
+ if (!gaplessEvaluer.isGood()) return ceil(args.minScoreGapped);
double n = args.maxGaplessAlignmentsPerQueryPosition;
if (args.maxGaplessAlignmentsPerQueryPosition + 1 == 0) n = 10; // ?
@@ -1118,8 +1172,9 @@ int calcMinScoreGapless(double numLettersInReference) {
// trial-and-error. It should depend on the relative speeds of
// gapless and gapped extensions.
- int s = gaplessEvaluer.minScore(e, numLettersInReference);
- return std::min(s, args.minScoreGapped);
+ double s = gaplessEvaluer.minScore(e, numLettersInReference);
+ s = std::max(1.0, s);
+ return ceil(std::min(s, args.minScoreGapped));
}
// Read one database volume
@@ -1251,10 +1306,7 @@ void lastal( int argc, char** argv ){
isKeepRefLowercase, refTantanSetting,
isCaseSensitiveSeeds, isMultiVolume,
refMinimizerWindow, aligners.size() );
- if( args.tantanSetting )
- tantanMasker.init( isProtein, args.tantanSetting > 1,
- alph.letters, alph.encode );
- makeScoreMatrix( matrixName, matrixFile );
+ makeScoreMatrix(matrixName, matrixFile);
gapCosts.assign(args.delOpenCosts, args.delGrowCosts,
args.insOpenCosts, args.insGrowCosts,
args.frameshiftCosts, args.gapPairCost,
@@ -1267,25 +1319,38 @@ void lastal( int argc, char** argv ){
geneticCode.fromString(GeneticCode::stringFromName(args.geneticCodeFile));
if (scoreMatrix.isCodonCols()) {
geneticCode.initCodons(queryAlph.encode, alph.encode,
- args.maskLowercase > 2);
+ args.maskLowercase > 0, args.maskLowercase < 3);
} else {
geneticCode.codeTableSet( alph, queryAlph );
}
query.initForAppending(3);
- }
- else{
+ } else {
queryAlph = alph;
query.initForAppending(1);
}
+ if (args.tantanSetting) {
+ bool isAtRichDna = (args.tantanSetting > 1);
+ if (scoreMatrix.isCodonCols()) {
+ tantanMasker.init(0, isAtRichDna, queryAlph.letters, queryAlph.encode);
+ } else {
+ tantanMasker.init(isProtein, isAtRichDna, alph.letters, alph.encode);
+ }
+ }
+
if (args.outputType > 0) {
calculateScoreStatistics(matrixName, refLetters, refMaxSeqLen);
}
- int minScore = (args.maxEvalue > 0) ? evaluer.minScore(args.maxEvalue, 1e18)
- : evaluer.minScore(args.queryLettersPerRandomAlignment);
-
- double eg2 = evaluer.isGood() ? 1e18 * evaluer.evaluePerArea(minScore) : -1;
+ double minScore = -1;
+ double eg2 = -1;
+ if (evaluer.isGood()) {
+ minScore = (args.maxEvalue > 0) ? evaluer.minScore(args.maxEvalue, 1e18)
+ : evaluer.minScore(args.queryLettersPerRandomAlignment);
+ if (!isFullScoreThreshold() || args.outputType < 2)
+ minScore = ceil(std::max(1.0, minScore));
+ eg2 = 1e18 * evaluer.evaluePerArea(minScore);
+ }
args.setDefaultsFromMatrix(fwdMatrices.stats.lambda(), minScore, eg2);
minScoreGapless = calcMinScoreGapless(refLetters);
=====================================
src/lastdb.cc
=====================================
@@ -27,6 +27,7 @@ typedef unsigned long long countT;
// Set up an alphabet (e.g. DNA or protein), based on the user options
void makeAlphabet( Alphabet& alph, const LastdbArguments& args ){
if( !args.userAlphabet.empty() ) alph.fromString( args.userAlphabet );
+ else if( args.isAddStops ) alph.fromString( alph.proteinWithStop );
else if( args.isProtein ) alph.fromString( alph.protein );
else alph.fromString( alph.dna );
}
@@ -343,7 +344,7 @@ void lastdb( int argc, char** argv ){
LOG("wordLength=" << wordsFinder.wordLength);
MultiSequence multi;
- multi.initForAppending(1);
+ multi.initForAppending(1, args.isAddStops);
alph.tr(multi.seqWriter(), multi.seqWriter() + multi.seqBeg(0));
unsigned volumeNumber = 0;
countT sequenceCount = 0;
@@ -365,7 +366,7 @@ void lastdb( int argc, char** argv ){
if (sequenceCount == 0) {
maxSeqLen = maxLettersPerVolume(args, wordsFinder,
multi.qualsPerLetter(), seeds.size());
- if (!args.isProtein && args.userAlphabet.empty() &&
+ if (!args.isProtein && !args.isAddStops && args.userAlphabet.empty() &&
isDubiousDna(alph, multi)) {
std::cerr << args.programName << ": that's some funny-lookin DNA\n";
}
=====================================
src/mcf_frameshift_xdrop_aligner.cc
=====================================
@@ -6,6 +6,7 @@
#include <assert.h>
#include <math.h>
+#include <algorithm>
//#include <iostream> // for debugging
namespace mcf {
@@ -366,14 +367,141 @@ void FrameshiftXdropAligner::count(bool isRightwardExtension,
}
}
- transitionCounts[1] += gapCosts.insProb1 * ins1;
- transitionCounts[2] += gapCosts.insProb2 * ins2;
- transitionCounts[3] += gapCosts.insProb3 * ins3;
- transitionCounts[4] += gapCosts.insProb3 * (insNext - ins3);
+ transitionCounts[1] += gapCosts.delProb3 * delNext; // deleted whole codons
+ transitionCounts[2] += gapCosts.insProb3 * insNext; // inserted whole codons
+ transitionCounts[3] += gapCosts.delProb3 * del3; // in-frame opens/closes
+ transitionCounts[4] += gapCosts.insProb3 * ins3; // in-frame opens/closes
+
transitionCounts[5] += gapCosts.delProb1 * del1;
transitionCounts[6] += gapCosts.delProb2 * del2;
- transitionCounts[7] += gapCosts.delProb3 * del3;
- transitionCounts[8] += gapCosts.delProb3 * (delNext - del3);
+ transitionCounts[7] += gapCosts.insProb1 * ins1;
+ transitionCounts[8] += gapCosts.insProb2 * ins2;
+}
+
+double FrameshiftXdropAligner::maxSumOfProbRatios(const uchar *protein,
+ int proteinLength,
+ const uchar *tranDna,
+ int tranDnaLength,
+ const const_dbl_ptr *substitutionProbs,
+ const GapCosts &gapCosts) {
+ const double delOpenProb = gapCosts.delProbPieces[0].openProb;
+ const double insOpenProb = gapCosts.insProbPieces[0].openProb;
+ const double delProb1 = gapCosts.delProb1;
+ const double delProb2 = gapCosts.delProb2;
+ const double delProb3 = gapCosts.delProb3;
+ const double insProb1 = gapCosts.insProb1;
+ const double insProb2 = gapCosts.insProb2;
+ const double insProb3 = gapCosts.insProb3;
+
+ const int proteinSize = proteinLength + 1;
+ const int tranDnaSize = tranDnaLength + 1;
+ xFwdProbs.resize(tranDnaSize + proteinSize * tranDnaSize);
+ double *delRow = &xFwdProbs[0];
+ double *newRow = delRow + tranDnaSize;
+ double Y1, Y2, Y3, Z1, Z2, Z3;
+
+ Z1 = Z2 = Z3 = 0;
+ for (int j = 0; j < tranDnaSize; ++j) {
+ double z1 = Z1 * insProb1;
+ double z2 = Z2 * insProb2;
+ double z3 = Z3 * insProb3;
+ double b = z1 + z2 + z3 + 1;
+ newRow[j] = b;
+ delRow[j] = b * delOpenProb;
+ Z3 = Z2;
+ Z2 = Z1;
+ Z1 = b * insOpenProb + z3;
+ }
+
+ for (int i = 1; i < proteinSize; ++i) {
+ const double *substitutionRow = substitutionProbs[protein[i-1]];
+ const double *oldRow = newRow;
+ newRow += tranDnaSize;
+ Y2 = Z2 = Z3 = 0;
+
+ {
+ Y3 = delRow[0];
+ double y3 = Y3 * delProb3;
+ double b = y3 + 1;
+ newRow[0] = b;
+ delRow[0] = b * delOpenProb + y3;
+ Z1 = b * insOpenProb;
+ }
+
+ for (int j = 1; j < tranDnaSize; ++j) {
+ Y1 = Y2;
+ Y2 = Y3;
+ Y3 = delRow[j];
+ double x = oldRow[j-1] * substitutionRow[tranDna[j-1]];
+ double y1 = Y1 * delProb1;
+ double y2 = Y2 * delProb2;
+ double y3 = Y3 * delProb3;
+ double z1 = Z1 * insProb1;
+ double z2 = Z2 * insProb2;
+ double z3 = Z3 * insProb3;
+ double b = x + y1 + y2 + y3 + z1 + z2 + z3 + 1;
+ newRow[j] = b;
+ delRow[j] = b * delOpenProb + y3;
+ Z3 = Z2;
+ Z2 = Z1;
+ Z1 = b * insOpenProb + z3;
+ }
+ }
+
+ double maxValue = 0;
+
+ Z1 = Z2 = Z3 = 0;
+ for (int j = tranDnaSize; j-- > 0;) {
+ double z1 = Z1 * insProb1;
+ double z2 = Z2 * insProb2;
+ double z3 = Z3 * insProb3;
+ double b = z1 + z2 + z3 + 1;
+ maxValue = std::max(maxValue, b * newRow[j]);
+ newRow[j] = b;
+ delRow[j] = b * delOpenProb;
+ Z3 = Z2;
+ Z2 = Z1;
+ Z1 = b * insOpenProb + z3;
+ }
+
+ for (int i = proteinLength; i-- > 0;) {
+ const double *substitutionRow = substitutionProbs[protein[i]];
+ const double *oldRow = newRow;
+ newRow -= tranDnaSize;
+ Y2 = Z2 = Z3 = 0;
+
+ {
+ Y3 = delRow[tranDnaLength];
+ double y3 = Y3 * delProb3;
+ double b = y3 + 1;
+ maxValue = std::max(maxValue, b * newRow[tranDnaLength]);
+ newRow[tranDnaLength] = b;
+ delRow[tranDnaLength] = b * delOpenProb + y3;
+ Z1 = b * insOpenProb;
+ }
+
+ for (int j = tranDnaLength; j-- > 0;) {
+ Y1 = Y2;
+ Y2 = Y3;
+ Y3 = delRow[j];
+ double x = oldRow[j+1] * substitutionRow[tranDna[j]];
+ double y1 = Y1 * delProb1;
+ double y2 = Y2 * delProb2;
+ double y3 = Y3 * delProb3;
+ double z1 = Z1 * insProb1;
+ double z2 = Z2 * insProb2;
+ double z3 = Z3 * insProb3;
+ double b = x + y1 + y2 + y3 + z1 + z2 + z3 + 1;
+ maxValue = std::max(maxValue, b * newRow[j]);
+ newRow[j] = b;
+ delRow[j] = b * delOpenProb + y3;
+ Z3 = Z2;
+ Z2 = Z1;
+ Z1 = b * insOpenProb + z3;
+ }
+ }
+
+ return maxValue;
}
}
=====================================
src/mcf_frameshift_xdrop_aligner.hh
=====================================
@@ -54,6 +54,12 @@ public:
return xFwdProbs[i] * xBckProbs[totalNumOfCells - j + padSize*7 - 1];
}
+ // tranDna should point to translated DNA: frame 012012012...
+ double maxSumOfProbRatios(const uchar *protein, int proteinLength,
+ const uchar *tranDna, int tranDnaLength,
+ const const_dbl_ptr *substitutionProbs,
+ const GapCosts &gapCosts);
+
private:
std::vector<Prob> xFwdProbs;
std::vector<Prob> yFwdProbs;
=====================================
src/mcf_substitution_matrix_stats.cc
=====================================
@@ -18,6 +18,8 @@ const struct {
} substitutionMatrixScales[] = {
{"BL62", 3.0861133577701141},
{"BL80", 2.8253295934982616},
+ {"MIQS", 0.0},
+ {"PAM10", 0.0},
{"PAM30", 2.8848596855435313},
};
@@ -100,21 +102,30 @@ void SubstitutionMatrixStats::calcBias(const const_int_ptr *scoreMatrix,
void SubstitutionMatrixStats::calcUnbiased(const char *matrixName,
const const_int_ptr *scoreMatrix,
unsigned size) {
+ double scale = 0;
+ unsigned realSize = size;
+
for (size_t i = 0; i < COUNTOF(substitutionMatrixScales); ++i) {
if (strcmp(matrixName, substitutionMatrixScales[i].name) == 0) {
- calcFromScale(scoreMatrix, size, substitutionMatrixScales[i].scale);
- return;
+ scale = substitutionMatrixScales[i].scale;
+ if (size > 20) realSize = 20;
}
}
- cbrc::LambdaCalculator c;
- c.calculate(scoreMatrix, size);
- mLambda = c.lambda();
- if (!isBad()) {
+ if (scale > 0) {
+ calcFromScale(scoreMatrix, realSize, scale);
+ } else {
+ cbrc::LambdaCalculator c;
+ c.calculate(scoreMatrix, realSize);
+ mLambda = c.lambda();
+ if (isBad()) return;
mBias = 1;
- mLetterProbs1.assign(c.letterProbs1(), c.letterProbs1() + size);
- mLetterProbs2.assign(c.letterProbs2(), c.letterProbs2() + size);
+ mLetterProbs1.assign(c.letterProbs1(), c.letterProbs1() + realSize);
+ mLetterProbs2.assign(c.letterProbs2(), c.letterProbs2() + realSize);
}
+
+ mLetterProbs1.insert(mLetterProbs1.end(), size - realSize, 0.0);
+ mLetterProbs2.insert(mLetterProbs2.end(), size - realSize, 0.0);
}
void SubstitutionMatrixStats::flipDnaStrands(const unsigned char *complement) {
=====================================
src/version.hh
=====================================
@@ -1 +1 @@
-"1145"
+"1167"
View it on GitLab: https://salsa.debian.org/med-team/last-align/-/commit/4dc09e39224756822117fb8a9fed8dbdf3619956
--
View it on GitLab: https://salsa.debian.org/med-team/last-align/-/commit/4dc09e39224756822117fb8a9fed8dbdf3619956
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20201214/fe44b595/attachment-0001.html>
More information about the debian-med-commit
mailing list