[med-svn] [Git][med-team/last-align][upstream] New upstream version 1167

Mon Dec 14 17:56:44 GMT 2020


Nilesh Patra pushed to branch upstream at Debian Med / last-align


Commits:
4dc09e39 by Nilesh Patra at 2020-12-14T23:13:59+05:30
New upstream version 1167
- - - - -


30 changed files:

- ChangeLog.txt
- doc/last-train.html
- doc/last-train.txt
- doc/lastdb.html
- doc/lastdb.txt
- scripts/last-train
- src/Alignment.cc
- src/Alignment.hh
- src/AlignmentWrite.cc
- src/Alphabet.cc
- src/Alphabet.hh
- src/Centroid.cc
- src/GappedXdropAligner.hh
- src/GappedXdropAlignerFrame.cc
- src/GeneticCode.cc
- src/GeneticCode.hh
- src/LastEvaluer.cc
- src/LastEvaluer.hh
- src/LastalArguments.cc
- src/LastalArguments.hh
- src/LastdbArguments.cc
- src/LastdbArguments.hh
- src/MultiSequence.cc
- src/MultiSequence.hh
- src/lastal.cc
- src/lastdb.cc
- src/mcf_frameshift_xdrop_aligner.cc
- src/mcf_frameshift_xdrop_aligner.hh
- src/mcf_substitution_matrix_stats.cc
- src/version.hh


Changes:

=====================================
ChangeLog.txt
=====================================
@@ -1,8 +1,123 @@
+2020-12-11  Martin C. Frith  <Martin C. Frith>
+
+	* src/GeneticCode.cc, src/GeneticCode.hh, src/lastal.cc, test/last-
+	test.out, test/last-test.sh:
+	Better rep-masking for codons & new frameshifts
+	[c1fbf7da4f68] [tip]
+
+	* src/GappedXdropAlignerFrame.cc, src/GeneticCode.cc,
+	src/GeneticCode.hh, src/lastal.cc:
+	Refactor
+	[f0ef1bb760a6]
+
+2020-12-04  Martin C. Frith  <Martin C. Frith>
+
+	* doc/last-train.txt, scripts/last-train:
+	Add last-train --scale option
+	[f8298e7154b0]
+
+2020-12-03  Martin C. Frith  <Martin C. Frith>
+
+	* scripts/last-train:
+	Refactor
+	[8d973000becc]
+
+	* src/GappedXdropAligner.hh, src/GappedXdropAlignerFrame.cc:
+	Refactor
+	[f85ac4b37611]
+
+	* src/mcf_frameshift_xdrop_aligner.cc:
+	Change expected-count output for new frameshifts
+	[6295d6a32b28]
+
+	* src/Centroid.cc:
+	Refactor
+	[ee8f8ed3fd4e]
+
+	* scripts/last-train:
+	Refactor
+	[e35d9df1d54c]
+
+	* doc/lastdb.txt, src/Alphabet.cc, src/Alphabet.hh,
+	src/LastdbArguments.cc, src/LastdbArguments.hh,
+	src/MultiSequence.cc, src/MultiSequence.hh, src/lastdb.cc,
+	src/mcf_substitution_matrix_stats.cc, test/last-test.out, test/last-
+	test.sh:
+	Add lastdb -q option
+	[74ad9d8977ec]
+
+2020-12-02  Martin C. Frith  <Martin C. Frith>
+
+	* src/lastal.cc, test/last-test.out, test/last-test.sh:
+	Protein-codon alignment: use tantan on the DNA
+	[8ac7c496f9be]
+
+	* src/Alignment.cc, src/Alignment.hh, src/AlignmentWrite.cc,
+	src/LastEvaluer.cc, src/LastEvaluer.hh, src/lastal.cc, test/last-
+	test.out, test/last-test.sh:
+	E-values for new-style frameshifts!
+	[fffb08324c7d]
+
+2020-12-01  Martin C. Frith  <Martin C. Frith>
+
+	* src/mcf_frameshift_xdrop_aligner.cc,
+	src/mcf_frameshift_xdrop_aligner.hh:
+	Start implementing frameshift E-value simulation
+	[d73a654fdabb]
+
+	* src/LastalArguments.cc, src/LastalArguments.hh, src/lastal.cc:
+	Allow for non-integer alignment scores
+	[4bc305f33933]
+
+2020-11-30  Martin C. Frith  <Martin C. Frith>
+
+	* src/LastEvaluer.cc, src/LastEvaluer.hh, src/lastal.cc:
+	Refactor
+	[f5a5aed6d38e]
+
+	* src/AlignmentWrite.cc, src/LastEvaluer.cc, src/LastalArguments.cc:
+	Refactor
+	[49942e88c59e]
+
+2020-11-27  Martin C. Frith  <Martin C. Frith>
+
+	* src/AlignmentWrite.cc:
+	Protein-codon alignment: fix writing X and *
+	[922917aa2dd8]
+
+2020-11-26  Martin C. Frith  <Martin C. Frith>
+
+	* src/lastal.cc:
+	Reduce lastal memory use, sometimes
+	[bc682cb4f508]
+
+	* src/LastEvaluer.cc, test/last-test.out:
+	Fix minor bugs in E-value finite-size correction
+	[8d49928d5e10]
+
+	* src/LastEvaluer.cc, src/LastEvaluer.hh:
+	Refactor
+	[814e3863ab23]
+
+	* scripts/last-train:
+	Refactor
+	[a1e4ab7fd745]
+
+	* scripts/last-train:
+	Refactor
+	[70149e6fb67a]
+
+2020-11-25  Martin C. Frith  <Martin C. Frith>
+
+	* scripts/last-train, src/lastal.cc:
+	Refactor
+	[249167a86543]
+
 2020-11-06  Martin C. Frith  <Martin C. Frith>
 
 	* src/mcf_frameshift_xdrop_aligner.hh:
 	Try to appease C++ compilers
-	[62a147543f2b] [tip]
+	[62a147543f2b]
 
 	* doc/last-split.txt:
 	Add awk mismap trick to doc


=====================================
doc/last-train.html
=====================================
@@ -392,6 +392,13 @@ shorter, it is never picked.  0 means use everything.</td></tr>
 <tr><td class="option-group">
 <kbd><span class="option">--sample-length=<var>L</var></span></kbd></td>
 <td>Use randomly-chosen chunks of length L.</td></tr>
+<tr><td class="option-group">
+<kbd><span class="option">--scale=<var>S</var></span></kbd></td>
+<td>Output scores in units of 1/S bits.  Traditional values
+include 2 for half-bit scores and 3 for 1/3-bit scores.
+(Note that 1/3-bit scores essentially equal Phred scores
+a.k.a. decibans, because log10(2) ≈ 3/10.)  The default is to
+infer a scale from the initial score parameters.</td></tr>
 </tbody>
 </table>
 </blockquote>
@@ -524,11 +531,6 @@ output, so it will override lastal's default.</p>
 <li><p class="first">last-train (and lastal) converts between path and alignment
 parameters as in Supplementary Section 3.1 of <a class="reference external" href="https://doi.org/10.1093/bioinformatics/btz576">btz576</a>.</p>
 </li>
-<li><p class="first">last-train converts probabilities to scores using an arbitrary scale
-factor.  (If all the scores are multiplied by any positive constant,
-it makes no difference to alignment.)  This scale is determined by
-the initial score parameters.</p>
-</li>
 <li><p class="first">last-train uses parameters with "homogeneous letter probabilities"
 and "balanced length probability" (<a class="reference external" href="https://doi.org/10.1093/bioinformatics/btz576">btz576</a>).</p>
 </li>
@@ -536,8 +538,8 @@ and "balanced length probability" (<a class="reference external" href=
 inaccurate.  It then finds an adjusted scale factor (without
 changing the scores), which makes the integer-rounded scores
 correspond to homogeneous letter probabilities and balanced length
-probability.  It writes this adjusted scale as a "-t" option for
-lastal, e.g. "-t4.4363".</p>
+probability.  It writes this adjusted scale (in nats, not bits) as a
+"-t" option for lastal, e.g. "-t4.4363".</p>
 </li>
 <li><p class="first">In rare cases, it may be impossible to find such an adjusted scale
 factor.  If that happens, last-train doubles the original scale (to


=====================================
doc/last-train.txt
=====================================
@@ -59,6 +59,12 @@ Training options
          shorter, it is never picked.  0 means use everything.
   --sample-length=L
          Use randomly-chosen chunks of length L.
+  --scale=S
+         Output scores in units of 1/S bits.  Traditional values
+         include 2 for half-bit scores and 3 for 1/3-bit scores.
+         (Note that 1/3-bit scores essentially equal Phred scores
+         a.k.a. decibans, because log10(2) ≈ 3/10.)  The default is to
+         infer a scale from the initial score parameters.
 
 All options below this point are passed to lastal to do the
 alignments: they are described in more detail at `<lastal.html>`_.
@@ -142,11 +148,6 @@ Details
 * last-train (and lastal) converts between path and alignment
   parameters as in Supplementary Section 3.1 of btz576_.
 
-* last-train converts probabilities to scores using an arbitrary scale
-  factor.  (If all the scores are multiplied by any positive constant,
-  it makes no difference to alignment.)  This scale is determined by
-  the initial score parameters.
-
 * last-train uses parameters with "homogeneous letter probabilities"
   and "balanced length probability" (btz576_).
 
@@ -154,8 +155,8 @@ Details
   inaccurate.  It then finds an adjusted scale factor (without
   changing the scores), which makes the integer-rounded scores
   correspond to homogeneous letter probabilities and balanced length
-  probability.  It writes this adjusted scale as a "-t" option for
-  lastal, e.g. "-t4.4363".
+  probability.  It writes this adjusted scale (in nats, not bits) as a
+  "-t" option for lastal, e.g. "-t4.4363".
 
 * In rare cases, it may be impossible to find such an adjusted scale
   factor.  If that happens, last-train doubles the original scale (to


=====================================
doc/lastdb.html
=====================================
@@ -469,6 +469,10 @@ Default           fasta
 <p class="last">The fastq formats are described in <a class="reference external" href="lastal.html">lastal.html</a>.</p>
 </td></tr>
 <tr><td class="option-group">
+<kbd><span class="option">-q</span></kbd></td>
+<td>Interpret the sequences as proteins, use a 21-letter alphabet
+with <tt class="docutils literal">*</tt> meaning STOP, and append <tt class="docutils literal">*</tt> to each sequence.</td></tr>
+<tr><td class="option-group">
 <kbd><span class="option">-P <var>THREADS</var></span></kbd></td>
 <td>Divide the work between this number of threads running in
 parallel.  0 means use as many threads as your computer claims
@@ -594,9 +598,12 @@ bytes.</p>
 </li>
 <li><p class="first">lastdb8: makes the index twice as big.</p>
 </li>
-<li><p class="first">-u, -m: Multiple patterns multiply the index size.  For example,
+<li><p class="first">-u, -m, -d: Multiple patterns multiply the index size.  For example,
 <a class="reference external" href="last-seeds.html">MAM8</a> makes it 8 times bigger.</p>
 </li>
+<li><p class="first">-u, -d: may reduce the index, e.g. <a class="reference external" href="last-seeds.html">RY32</a> makes
+it 32 times smaller.</p>
+</li>
 <li><p class="first">-s: does not change the total size, but splits it into volumes.</p>
 </li>
 <li><p class="first">-S2: doubles the size of everything.</p>


=====================================
doc/lastdb.txt
=====================================
@@ -136,6 +136,9 @@ Advanced Options
 
       The fastq formats are described in `<lastal.html>`_.
 
+  -q  Interpret the sequences as proteins, use a 21-letter alphabet
+      with ``*`` meaning STOP, and append ``*`` to each sequence.
+
   -P THREADS
       Divide the work between this number of threads running in
       parallel.  0 means use as many threads as your computer claims
@@ -258,9 +261,12 @@ This is modified by several options.
 
 * lastdb8: makes the index twice as big.
 
-* -u, -m: Multiple patterns multiply the index size.  For example,
+* -u, -m, -d: Multiple patterns multiply the index size.  For example,
   `MAM8 <last-seeds.html>`_ makes it 8 times bigger.
 
+* -u, -d: may reduce the index, e.g. `RY32 <last-seeds.html>`_ makes
+  it 32 times smaller.
+
 * -s: does not change the total size, but splits it into volumes.
 
 * -S2: doubles the size of everything.


=====================================
scripts/last-train
=====================================
@@ -6,7 +6,7 @@
 # [Fri19] How sequence alignment scores correspond to probability models,
 #         MC Frith, Bioinformatics, 2019.
 
-from __future__ import print_function
+from __future__ import division, print_function
 
 import gzip
 import math
@@ -18,6 +18,8 @@ import subprocess
 import sys
 import tempfile
 
+proteinAlphabet20 = "ACDEFGHIKLMNPQRSTVWY"
+
 def myOpen(fileName):  # faster than fileinput
     if fileName == "-":
         return sys.stdin
@@ -25,7 +27,7 @@ def myOpen(fileName):  # faster than fileinput
         return gzip.open(fileName, "rt")  # xxx dubious for Python2
     return open(fileName)
 
-def rootOfIncreasingFunction(func, lowerBound, upperBound, *args):
+def rootOfIncreasingFunction(func, lowerBound, upperBound, args):
     # Find x such that func(x, *args) == 0
     while True:
         mid = 0.5 * (lowerBound + upperBound)
@@ -36,6 +38,17 @@ def rootOfIncreasingFunction(func, lowerBound, upperBound, *args):
         else:
             upperBound = mid
 
+def rootOfDecreasingFunction(func, lowerBound, upperBound, args):
+    # Find x such that func(x, *args) == 0
+    while True:
+        mid = 0.5 * (lowerBound + upperBound)
+        if mid <= lowerBound or mid >= upperBound:
+            return mid
+        if func(mid, *args) < 0:
+            upperBound = mid
+        else:
+            lowerBound = mid
+
 def homogeneousLetterFreqs(scale, matScores):
     # Solve the simultaneous equations in Section 2.1 of [Fri19]
     expMat = [[math.exp(j / scale) for j in i] for i in matScores]
@@ -162,13 +175,6 @@ def getSeqSample(opts, queryFiles, outfile):
                 x = y
     writeSegment(outfile, x)
 
-def versionFromHeader(lines):
-    for line in lines:
-        words = line.split()
-        if "version" in words:
-            return words[-1]
-    raise Exception("couldn't read the version")
-
 def scaleFromHeader(lines):
     for line in lines:
         for i in line.split():
@@ -177,41 +183,38 @@ def scaleFromHeader(lines):
     raise Exception("couldn't read the scale")
 
 def countsFromLastOutput(lines, opts):
-    numTransitionCounts = 5
-    matrix = []
-    # use +1 pseudocounts as a kludge to mitigate numerical problems:
-    matches = 1.0
-    deletes = 2.0  # 1 open + 1 extension
-    inserts = 2.0  # 1 open + 1 extension
-    delOpens = 1.0
-    insOpens = 1.0
+    nTransitions = 5
+    tranCounts = [1.0] * nTransitions  # +1 pseudocounts
+    tranCounts[1] = 2.0  # deletes: opens + extensions, so 2 pseudocounts
+    tranCounts[2] = 2.0  # inserts: opens + extensions, so 2 pseudocounts
+    countMatrix = None
     alignments = 0  # no pseudocount here
     for line in lines:
         if line[0] == "s":
             strand = line.split()[4]  # slow?
         if line[0] == "c":
-            c = [float(i) for i in line.split()[1:]]
-            transCounts = c[-numTransitionCounts:]
-            if not matrix:
-                matrixSize = int(math.sqrt(len(c) - numTransitionCounts))
-                matrix = [[1.0] * matrixSize for i in range(matrixSize)]
-            identities = sum(c[i * matrixSize + i] for i in range(matrixSize))
-            alignmentLength = transCounts[0] + transCounts[1] + transCounts[2]
-            if 100 * identities > opts.pid * alignmentLength: continue
-            for i in range(matrixSize):
-                for j in range(matrixSize):
-                    if strand == "+" or opts.S == "0":
-                        matrix[i][j]       += c[i * matrixSize + j]
+            counts = [float(i) for i in line.split()[1:]]
+            if not countMatrix:
+                matrixSize = len(counts) - nTransitions
+                nCols = int(math.sqrt(matrixSize))
+                nRows = matrixSize // nCols
+                countMatrix = [[1.0] * nCols for i in range(nRows)]
+            identities = sum(counts[i * nCols + i] for i in range(nRows))
+            alignmentLength = sum(counts[matrixSize + i] for i in range(3))
+            if 100 * identities > opts.pid * alignmentLength:
+                continue
+            for i in range(nRows):
+                for j in range(nCols):
+                    if strand == "+" or opts.S != "1":
+                        countMatrix[i][j]       += counts[i * nCols + j]
                     else:
-                        matrix[-1-i][-1-j] += c[i * matrixSize + j]
-            matches += transCounts[0]
-            deletes += transCounts[1]
-            inserts += transCounts[2]
-            delOpens += transCounts[3]
-            insOpens += transCounts[4]
+                        countMatrix[-1-i][-1-j] += counts[i * nCols + j]
+            for i in range(nTransitions):
+                tranCounts[i] += counts[matrixSize + i]
             alignments += 1
-    gapCounts = matches, deletes, inserts, delOpens, insOpens, alignments
-    return matrix, gapCounts
+    if not alignments:
+        raise Exception("no alignments")
+    return countMatrix, tranCounts + [alignments]
 
 def scoreFromProb(scale, prob):
     if prob > 0: logProb = math.log(prob)
@@ -223,7 +226,7 @@ def costFromProb(scale, prob):
 
 def guessAlphabet(matrixSize):
     if matrixSize ==  4: return "ACGT"
-    if matrixSize == 20: return "ACDEFGHIKLMNPQRSTVWY"
+    if matrixSize == 20: return proteinAlphabet20
     raise Exception("can't handle unusual alphabets")
 
 def writeMatrixHead(outFile, prefix, alphabet, formatString):
@@ -257,7 +260,6 @@ def matProbsFromCounts(counts, opts):
     identities = sum(counts[i][i] for i in r)
     total = sum(map(sum, counts))
     probs = [[j / total for j in i] for i in counts]
-
     print("# substitution percent identity: %g" % (100 * identities / total))
     print()
     print("# count matrix "
@@ -268,7 +270,6 @@ def matProbsFromCounts(counts, opts):
           "(query letters = columns, reference letters = rows):")
     writeProbMatrix(sys.stdout, probs, "# ")
     print()
-
     return probs
 
 def probImbalance(endProb, matchProb, firstDelProb, delExtendProb,
@@ -282,12 +283,11 @@ def balancedEndProb(*args):
     matchProb, firstDelProb, delExtendProb, firstInsProb, insExtendProb = args
     lowerBound = max(delExtendProb, insExtendProb)
     upperBound = 1.0
-    r = rootOfIncreasingFunction(probImbalance, lowerBound, upperBound, *args)
-    return r
+    return rootOfIncreasingFunction(probImbalance,
+                                    lowerBound, upperBound, args)
 
-def gapRatiosFromCounts(counts, opts):
+def gapProbsFromCounts(counts, opts):
     matches, deletes, inserts, delOpens, insOpens, alignments = counts
-    if not alignments: raise Exception("no alignments")
     gaps = deletes + inserts
     gapOpens = delOpens + insOpens
     denominator = matches + gapOpens + (alignments + 1)  # +1 pseudocount
@@ -295,14 +295,13 @@ def gapRatiosFromCounts(counts, opts):
     if opts.gapsym:
         delOpenProb = gapOpens / denominator / 2
         insOpenProb = gapOpens / denominator / 2
-        delExtendProb = (gaps - gapOpens) / gaps
-        insExtendProb = (gaps - gapOpens) / gaps
+        delGrowProb = (gaps - gapOpens) / gaps
+        insGrowProb = (gaps - gapOpens) / gaps
     else:
         delOpenProb = delOpens / denominator
         insOpenProb = insOpens / denominator
-        delExtendProb = (deletes - delOpens) / deletes
-        insExtendProb = (inserts - insOpens) / inserts
-
+        delGrowProb = (deletes - delOpens) / deletes
+        insGrowProb = (inserts - insOpens) / inserts
     print("# aligned letter pairs: %.12g" % matches)
     print("# deletes: %.12g" % deletes)
     print("# inserts: %.12g" % inserts)
@@ -314,29 +313,34 @@ def gapRatiosFromCounts(counts, opts):
     print("# matchProb: %g" % matchProb)
     print("# delOpenProb: %g" % delOpenProb)
     print("# insOpenProb: %g" % insOpenProb)
-    print("# delExtendProb: %g" % delExtendProb)
-    print("# insExtendProb: %g" % insExtendProb)
+    print("# delExtendProb: %g" % delGrowProb)
+    print("# insExtendProb: %g" % insGrowProb)
     print()
+    return matchProb, (delOpenProb, delGrowProb), (insOpenProb, insGrowProb)
 
-    delCloseProb = 1 - delExtendProb
+def gapRatiosFromProbs(matchProb, delProbs, insProbs):
+    delOpenProb, delGrowProb = delProbs
+    insOpenProb, insGrowProb = insProbs
+
+    delCloseProb = 1 - delGrowProb
     firstDelProb = delOpenProb * delCloseProb
 
-    insCloseProb = 1 - insExtendProb
+    insCloseProb = 1 - insGrowProb
     firstInsProb = insOpenProb * insCloseProb
 
-    endProb = balancedEndProb(matchProb, firstDelProb, delExtendProb,
-                              firstInsProb, insExtendProb)
+    endProb = balancedEndProb(matchProb, firstDelProb, delGrowProb,
+                              firstInsProb, insGrowProb)
     # probably, endProb is negligibly less than 1
 
     matchRatio = matchProb / (endProb * endProb)
 
     firstDelRatio = firstDelProb / endProb
-    delExtendRatio = delExtendProb / endProb
-    delRatios = firstDelRatio, delExtendRatio
+    delGrowRatio = delGrowProb / endProb
+    delRatios = firstDelRatio, delGrowRatio
 
     firstInsRatio = firstInsProb / endProb
-    insExtendRatio = insExtendProb / endProb
-    insRatios = firstInsRatio, insExtendRatio
+    insGrowRatio = insGrowProb / endProb
+    insRatios = firstInsRatio, insGrowRatio
 
     return matchRatio, delRatios, insRatios
 
@@ -346,10 +350,8 @@ def scoreFromLetterProbs(scale, matchRatio, pairProb, rowProb, colProb):
     return scoreFromProb(scale, matchRatio * probRatio)
 
 def matScoresFromProbs(scale, matchRatio, matProbs, rowProbs, colProbs):
-    r = range(len(matProbs))
-    return [[scoreFromLetterProbs(scale, matchRatio, matProbs[i][j],
-                                  rowProbs[i], colProbs[j]) for j in r]
-            for i in r]
+    return [[scoreFromLetterProbs(scale, matchRatio, matProbs[i][j], x, y)
+             for j, y in enumerate(colProbs)] for i, x in enumerate(rowProbs)]
 
 def gapCostsFromProbRatios(scale, firstGapRatio, gapExtendRatio):
     # The next addition gets the alignment parameter from the path
@@ -373,28 +375,26 @@ def scoreImbalance(scale, matScores, delCosts, insCosts):
     i = imbalanceFromGap(scale, *insCosts)
     return 1 / sum(homogeneousLetterFreqs(scale, matScores)) + d + i - 1
 
-def negScoreImbalance(*args):
-    return -scoreImbalance(*args)
-
-def balancedScale(nearScale, *args):
+def balancedScale(imbalanceFunc, nearScale, args):
     # Find a scale, near nearScale, with balanced length probability
     bump = 1.000001
-    value = scoreImbalance(nearScale, *args)
+    rootFinders = rootOfDecreasingFunction, rootOfIncreasingFunction
+    value = imbalanceFunc(nearScale, *args)
     if abs(value) <= 0:
         return nearScale
     oldLower = oldUpper = nearScale
     while oldUpper < 2 * nearScale:  # xxx ???
         newLower = oldLower / bump
-        lowerValue = scoreImbalance(newLower, *args)
+        lowerValue = imbalanceFunc(newLower, *args)
         if (lowerValue < 0) != (value < 0):
-            func = scoreImbalance if value > 0 else negScoreImbalance
-            return rootOfIncreasingFunction(func, newLower, oldLower, *args)
+            finder = rootFinders[value > 0]
+            return finder(imbalanceFunc, newLower, oldLower, args)
         oldLower = newLower
         newUpper = oldUpper * bump
-        upperValue = scoreImbalance(newUpper, *args)
+        upperValue = imbalanceFunc(newUpper, *args)
         if (upperValue < 0) != (value < 0):
-            func = scoreImbalance if value < 0 else negScoreImbalance
-            return rootOfIncreasingFunction(func, oldUpper, newUpper, *args)
+            finder = rootFinders[value < 0]
+            return finder(imbalanceFunc, oldUpper, newUpper, args)
         oldUpper = newUpper
     return 0.0
 
@@ -403,7 +403,8 @@ def scoresAndScale(originalScale, matParams, delRatios, insRatios):
         matScores = matScoresFromProbs(originalScale, *matParams)
         delCosts = gapCostsFromProbRatios(originalScale, *delRatios)
         insCosts = gapCostsFromProbRatios(originalScale, *insRatios)
-        scale = balancedScale(originalScale, matScores, delCosts, insCosts)
+        args = matScores, delCosts, insCosts
+        scale = balancedScale(scoreImbalance, originalScale, args)
         if scale > 0:
             rowFreqs = homogeneousLetterFreqs(scale, zip(*matScores))
             colFreqs = homogeneousLetterFreqs(scale, matScores)
@@ -413,20 +414,21 @@ def scoresAndScale(originalScale, matParams, delRatios, insRatios):
               "doubling the scale")
         originalScale *= 2
 
-def writeGapCosts(gapCosts, out):
-    firstDelCost, delExtendCost, firstInsCost, insExtendCost = gapCosts
-    print("#last -a", firstDelCost - delExtendCost, file=out)
-    print("#last -A", firstInsCost - insExtendCost, file=out)
-    print("#last -b", delExtendCost, file=out)
-    print("#last -B", insExtendCost, file=out)
-
-def printGapCosts(gapCosts):
-    firstDelCost, delExtendCost, firstInsCost, insExtendCost = gapCosts
-    print("# delExistCost:", firstDelCost - delExtendCost)
-    print("# insExistCost:", firstInsCost - insExtendCost)
-    print("# delExtendCost:", delExtendCost)
-    print("# insExtendCost:", insExtendCost)
-    print()
+def writeGapCosts(delCosts, insCosts, isLastFormat, outFile):
+    delInit, delGrow = delCosts
+    insInit, insGrow = insCosts
+    delOpen = delInit - delGrow
+    insOpen = insInit - insGrow
+    if isLastFormat:
+        print("#last -a", delOpen, file=outFile)
+        print("#last -A", insOpen, file=outFile)
+        print("#last -b", delGrow, file=outFile)
+        print("#last -B", insGrow, file=outFile)
+    else:
+        print("# delExistCost:", delOpen, file=outFile)
+        print("# insExistCost:", insOpen, file=outFile)
+        print("# delExtendCost:", delGrow, file=outFile)
+        print("# insExtendCost:", insGrow, file=outFile)
 
 def tryToMakeChildProgramsFindable():
     d = os.path.dirname(__file__)
@@ -458,29 +460,46 @@ def fixedLastalArgs(opts, lastalProgName):
     if opts.verbose: x.append("-" + "v" * opts.verbose)
     return x
 
+def process(args, inStream):
+    return subprocess.Popen(args, stdin=inStream, stdout=subprocess.PIPE,
+                            universal_newlines=True)
+
+def versionFromLastal():
+    args = ["lastal", "--version"]
+    proc = process(args, None)
+    return proc.stdout.read().split()[1]
+
+def lastSplitProcess(opts, proc):
+    splitArgs = ["last-split", "-n", "-m0.01"]  # xxx ???
+    proc = process(splitArgs, proc.stdout)
+    if opts.postmask:
+        maskArgs = ["last-postmask"]
+        proc = process(maskArgs, proc.stdout)
+    return proc
+
 def doTraining(opts, args):
     tryToMakeChildProgramsFindable()
     lastalProgName = readLastalProgName(args[0])
     scaleIncrease = 20  # while training, up-scale the scores by this amount
-    x = fixedLastalArgs(opts, lastalProgName)
-    if opts.r: x.append("-r" + opts.r)
-    if opts.q: x.append("-q" + opts.q)
-    if opts.p: x.append("-p" + opts.p)
-    if opts.a: x.append("-a" + opts.a)
-    if opts.b: x.append("-b" + opts.b)
-    if opts.A: x.append("-A" + opts.A)
-    if opts.B: x.append("-B" + opts.B)
-    x += args
-    y = ["last-split", "-n", "-m0.01"]  # xxx ???
-    z = ["last-postmask"]
-    p = subprocess.Popen(x, stdout=subprocess.PIPE, universal_newlines=True)
-    q = subprocess.Popen(y, stdin=p.stdout, stdout=subprocess.PIPE,
-                         universal_newlines=True)
-    if opts.postmask:
-        q = subprocess.Popen(z, stdin=q.stdout, stdout=subprocess.PIPE,
-                             universal_newlines=True)
-    lastalVersion = versionFromHeader(q.stdout)
-    externalScale = scaleFromHeader(q.stdout)
+    lastalVersion = versionFromLastal()
+
+    lastalArgs = fixedLastalArgs(opts, lastalProgName)
+    if opts.r: lastalArgs.append("-r" + opts.r)
+    if opts.q: lastalArgs.append("-q" + opts.q)
+    if opts.p: lastalArgs.append("-p" + opts.p)
+    if opts.a: lastalArgs.append("-a" + opts.a)
+    if opts.b: lastalArgs.append("-b" + opts.b)
+    if opts.A: lastalArgs.append("-A" + opts.A)
+    if opts.B: lastalArgs.append("-B" + opts.B)
+    lastalArgs += args
+    proc = process(lastalArgs, None)
+    proc = lastSplitProcess(opts, proc)
+
+    if opts.scale:
+        externalScale = opts.scale / math.log(2)
+    else:
+        externalScale = scaleFromHeader(proc.stdout)
+
     internalScale = externalScale * scaleIncrease
     oldParameters = []
 
@@ -491,18 +510,20 @@ def doTraining(opts, args):
     print()
 
     while True:
-        print("#", *x)
+        print("#", *lastalArgs)
         print()
         sys.stdout.flush()
-        matCounts, gapCounts = countsFromLastOutput(q.stdout, opts)
-        matchRatio, delRatios, insRatios = gapRatiosFromCounts(gapCounts, opts)
+        matCounts, gapCounts = countsFromLastOutput(proc.stdout, opts)
+        gapProbs = gapProbsFromCounts(gapCounts, opts)
         matProbs = matProbsFromCounts(matCounts, opts)
+        matchRatio, delRatios, insRatios = gapRatiosFromProbs(*gapProbs)
         rowProbs = [sum(i) for i in matProbs]
         colProbs = [sum(i) for i in zip(*matProbs)]
         matParams = matchRatio, matProbs, rowProbs, colProbs
         sas = scoresAndScale(internalScale, matParams, delRatios, insRatios)
         matScores, delCosts, insCosts, scale = sas
-        printGapCosts(delCosts + insCosts)
+        writeGapCosts(delCosts, insCosts, False, None)
+        print()
         print("# score matrix "
               "(query letters = columns, reference letters = rows):")
         writeScoreMatrix(sys.stdout, matScores, "# ")
@@ -510,28 +531,22 @@ def doTraining(opts, args):
         parameters = delCosts, insCosts, matScores
         if parameters in oldParameters: break
         oldParameters.append(parameters)
-        x = fixedLastalArgs(opts, lastalProgName)
-        x.append("-t{0:.6}".format(scale))
-        x.append("-p-")
-        x += args
-        p = subprocess.Popen(x, stdin=subprocess.PIPE, stdout=subprocess.PIPE,
-                             universal_newlines=True)
-        writeGapCosts(delCosts + insCosts, p.stdin)
-        writeScoreMatrix(p.stdin, matScores, "")
-        p.stdin.close()
-        # in python2.6, the next line must come after p.stdin.close()
-        q = subprocess.Popen(y, stdin=p.stdout, stdout=subprocess.PIPE,
-                             universal_newlines=True)
-        if opts.postmask:
-            q = subprocess.Popen(z, stdin=q.stdout, stdout=subprocess.PIPE,
-                                 universal_newlines=True)
+        lastalArgs = fixedLastalArgs(opts, lastalProgName)
+        lastalArgs.append("-t{0:.6}".format(scale))
+        lastalArgs.append("-p-")
+        lastalArgs += args
+        proc = process(lastalArgs, subprocess.PIPE)
+        writeGapCosts(delCosts, insCosts, True, proc.stdin)
+        writeScoreMatrix(proc.stdin, matScores, "")
+        proc.stdin.close()
+        proc = lastSplitProcess(opts, proc)
 
     sas = scoresAndScale(externalScale, matParams, delRatios, insRatios)
     matScores, delCosts, insCosts, scale = sas
     if opts.X: print("#last -X", opts.X)
     if opts.Q: print("#last -Q", opts.Q)
     print("#last -t{0:.6}".format(scale))
-    writeGapCosts(delCosts + insCosts, sys.stdout)
+    writeGapCosts(delCosts, insCosts, True, None)
     if opts.s: print("#last -s", opts.s)
     if opts.S: print("#last -S", opts.S)
     print("# score matrix "
@@ -559,6 +574,7 @@ if __name__ == "__main__":
     op = optparse.OptionParser(usage=usage, description=description)
     op.add_option("-v", "--verbose", action="count",
                   help="show more details of intermediate steps")
+
     og = optparse.OptionGroup(op, "Training options")
     og.add_option("--revsym", action="store_true",
                   help="force reverse-complement symmetry")
@@ -574,7 +590,10 @@ if __name__ == "__main__":
                   help="number of random sequence samples (default: %default)")
     og.add_option("--sample-length", type="int", default=2000, metavar="L",
                   help="length of each sample (default: %default)")
+    og.add_option("--scale", type="float", metavar="S",
+                  help="output scores in units of 1/S bits")
     op.add_option_group(og)
+
     og = optparse.OptionGroup(op, "Initial parameter options")
     og.add_option("-r", metavar="SCORE",
                   help="match score (default: 6 if Q>=1, else 5)")
@@ -588,6 +607,7 @@ if __name__ == "__main__":
     og.add_option("-A", metavar="COST", help="insertion existence cost")
     og.add_option("-B", metavar="COST", help="insertion extension cost")
     op.add_option_group(og)
+
     og = optparse.OptionGroup(op, "Alignment options")
     og.add_option("-D", metavar="LENGTH",
                   help="query letters per random alignment (default: 1e6)")
@@ -614,6 +634,7 @@ if __name__ == "__main__":
     og.add_option("-Q", metavar="NAME",
                   help="input format: fastx, sanger (default=fasta)")
     op.add_option_group(og)
+
     (opts, args) = op.parse_args()
     if len(args) < 1:
         op.error("I need a lastdb index and query sequences")


=====================================
src/Alignment.cc
=====================================
@@ -51,7 +51,12 @@ void Alignment::makeXdrop( Aligners &aligners, bool isGreedy,
 			   const Alphabet& alph, AlignmentExtras& extras,
 			   double gamma, int outputType ){
   score = seed.score;
-  if( outputType > 3 ) extras.fullScore = seed.score;
+
+  if (gap.isNewFrameshifts()) {
+    extras.fullScore = -1;  // means that "score" is (non-integer) fullScore
+  } else if (outputType > 3) {
+    extras.fullScore = seed.score;
+  }
 
   if( outputType == 7 ){
     assert( seed.size > 0 );  // makes things easier to understand
@@ -303,21 +308,21 @@ void Alignment::extend( std::vector< SegmentPair >& chunks,
 
     if (gap.isNewFrameshifts()) {
       size_t frame2 = isForward ? dnaStart + 2 : dnaStart - 2;
-      score += aligner.alignFrame(seq1 + start1, seq2 + start2,
-				  seq2 + dnaToAa(frame1, frameSize),
-				  seq2 + dnaToAa(frame2, frameSize),
-				  isForward, sm, gap, maxDrop);
+      aligner.alignFrame(seq1 + start1, seq2 + start2,
+			 seq2 + dnaToAa(frame1, frameSize),
+			 seq2 + dnaToAa(frame2, frameSize),
+			 isForward, sm, gap, maxDrop);
       while (aligner.getNextChunkFrame(end1, end2, size, gapCost, gap))
 	chunks.push_back(SegmentPair(end1 - size, end2 - size * 3, size,
 				     gapCost));
+      FrameshiftXdropAligner &fxa = aligners.frameshiftAligner;
+      double probDropLimit = exp(scale * -maxDrop);
+      double s = fxa.forward(seq1 + start1, seq2 + start2,
+			     seq2 + dnaToAa(frame1, frameSize),
+			     seq2 + dnaToAa(frame2, frameSize),
+			     isForward, probMat, gap, probDropLimit);
+      score += s / scale;
       if (outputType > 3) {
-	FrameshiftXdropAligner &fxa = aligners.frameshiftAligner;
-	double probDropLimit = exp(scale * -maxDrop);
-	double s = fxa.forward(seq1 + start1, seq2 + start2,
-			       seq2 + dnaToAa(frame1, frameSize),
-			       seq2 + dnaToAa(frame2, frameSize),
-			       isForward, probMat, gap, probDropLimit);
-	extras.fullScore += s / scale;
 	fxa.backward(isForward, probMat, gap);
 	getColumnCodes(fxa, columnCodes, chunks);
 	if (outputType == 7) {


=====================================
src/Alignment.hh
=====================================
@@ -32,7 +32,7 @@ struct AlignmentText {
   SegmentPair::indexT strandNum;
   SegmentPair::indexT queryBeg;
   SegmentPair::indexT queryEnd;
-  int score;
+  double score;
   SegmentPair::indexT alnSize;
   SegmentPair::indexT matches;
   char *text;  // seems to be a bit faster than std::vector<char>
@@ -40,7 +40,7 @@ struct AlignmentText {
   AlignmentText() {}
 
   AlignmentText(size_t queryNumIn, size_t queryBegIn, size_t queryEndIn,
-		char strandIn, int scoreIn,
+		char strandIn, double scoreIn,
 		size_t alnSizeIn, size_t matchesIn, char *textIn) :
     strandNum(queryNumIn * 2 + (strandIn == '-')),
     queryBeg(queryBegIn), queryEnd(queryEndIn), score(scoreIn),
@@ -126,7 +126,7 @@ struct Alignment{
 
   // data:
   std::vector<SegmentPair> blocks;  // the gapless blocks of the alignment
-  int score;
+  double score;
   SegmentPair seed;  // the alignment remembers its seed
 
   size_t beg1() const{ return blocks.front().beg1(); }
@@ -166,6 +166,7 @@ struct Alignment{
 			      const Alphabet& alph, int translationType,
 			      const uchar *codonToAmino,
 			      const LastEvaluer& evaluer,
+			      const AlignmentExtras& extras,
 			      bool isExtraColumns) const;
 
   size_t numColumns(size_t frameSize, bool isCodon) const;


=====================================
src/AlignmentWrite.cc
=====================================
@@ -5,8 +5,10 @@
 #include "LastEvaluer.hh"
 #include "MultiSequence.hh"
 #include "Alphabet.hh"
+
+#include <assert.h>
+
 #include <algorithm>
-#include <cassert>
 #include <cstdio>  // sprintf
 
 using namespace cbrc;
@@ -32,7 +34,8 @@ const char aminoTriplets[] =
   "Val" "val"
   "Trp" "trp"
   "Tyr" "tyr"
-  "Xxx" "xxx";
+  "***" "***"
+  "Xaa" "xaa";
 
 // This writes a "size_t" integer into a char buffer ending at "end".
 // It writes backwards from the end, because that's easier & faster.
@@ -128,7 +131,7 @@ AlignmentText Alignment::write(const MultiSequence& seq1,
     return writeTab(seq1, seq2, seqNum2, translationType, evaluer, extras);
   else
     return writeBlastTab(seq1, seq2, seqNum2, seqData2, alph, translationType,
-			 codonToAmino, evaluer, format == 'B');
+			 codonToAmino, evaluer, extras, format == 'B');
 }
 
 static size_t alignedColumnCount(const std::vector<SegmentPair> &blocks) {
@@ -180,8 +183,12 @@ static size_t writeBlocks(std::vector<char> &text,
   return end - e;
 }
 
+static const char *scoreFormat(bool isIntegerScores) {
+  return isIntegerScores ? "%.0f" : "%.1f";
+}
+
 static char* writeTags( const LastEvaluer& evaluer, double queryLength,
-			int score, double fullScore,
+			double score, double fullScore,
 			char separator, char* out ){
   if( evaluer.isGood() ){
     double epa = evaluer.evaluePerArea( score );
@@ -217,7 +224,7 @@ AlignmentText Alignment::writeTab(const MultiSequence& seq1,
   size_t seqStart2 = seq2.seqBeg(seqNum2) - seq2.padBeg(seqNum2);
   size_t seqLen2 = seq2.seqLen(seqNum2);
 
-  IntText sc(score);
+  FloatText sc(scoreFormat(extras.fullScore >= 0), score);
   std::string n1 = seq1.seqName(seqNum1);
   char strand1 = seq1.strand(seqNum1);
   std::string n2 = seq2.seqName(seqNum2);
@@ -266,9 +273,10 @@ static void putRight(Writer &w, const IntText &t, size_t width) {
 }
 
 // Write an "a" line
-static char *writeMafLineA(char *out, int score, const LastEvaluer& evaluer,
+static char *writeMafLineA(char *out, double score, const LastEvaluer& evaluer,
 			   double queryLength, double fullScore) {
-  out += std::sprintf(out, "a score=%d", score);
+  out += std::sprintf(out, "a score=");
+  out += std::sprintf(out, scoreFormat(fullScore >= 0), score);
   return writeTags(evaluer, queryLength, score, fullScore, ' ', out);
 }
 
@@ -444,6 +452,7 @@ AlignmentText Alignment::writeBlastTab(const MultiSequence& seq1,
 				       int translationType,
 				       const uchar *codonToAmino,
 				       const LastEvaluer& evaluer,
+				       const AlignmentExtras& extras,
 				       bool isExtraColumns) const {
   size_t alnBeg1 = beg1();
   size_t alnEnd1 = end1();
@@ -509,13 +518,16 @@ AlignmentText Alignment::writeBlastTab(const MultiSequence& seq1,
   }
   IntText s1(seqLen1);
   IntText s2(seqLen2);
-  IntText sc(score);
+  FloatText sc;
 
   size_t s =
     n2.size() + n1.size() + mp.size() + as.size() + mm.size() + go.size() +
     b2.size() + e2.size() + b1.size() + e1.size() + 10;
   if (evaluer.isGood()) s += ev.size() + bs.size() + 2;
-  if (isExtraColumns)   s += s1.size() + s2.size() + sc.size() + 3;
+  if (isExtraColumns) {
+    sc.set(scoreFormat(extras.fullScore >= 0), score);
+    s += s1.size() + s2.size() + sc.size() + 3;
+  }
 
   char *text = new char[s + 1];
   Writer w(text);
@@ -596,7 +608,7 @@ static char *writeAmino(char *dest, const uchar *beg, const uchar *end,
   for (const uchar *i = beg; i < end; ++i) {
     unsigned x = toUnmasked[*i];
     bool isMasked = (x != *i);
-    if (x > 20) x = 20;
+    if (x > 21) x = 21;
     unsigned y = (x * 2 + isMasked) * 3;
     *dest++ = aminoTriplets[y];
     *dest++ = aminoTriplets[y + 1];


=====================================
src/Alphabet.cc
=====================================
@@ -19,6 +19,7 @@ const char* Alphabet::dna = "ACGT";
 
 // U=selenocysteine, O=pyrrolysine, *=stop?
 const char* Alphabet::protein = "ACDEFGHIKLMNPQRSTVWY";
+const char* Alphabet::proteinWithStop = "ACDEFGHIKLMNPQRSTVWY*";
 
 void Alphabet::fromString( const std::string& alphString ){
   letters = alphString;


=====================================
src/Alphabet.hh
=====================================
@@ -23,6 +23,7 @@ struct Alphabet{
 
   static const char* dna;
   static const char* protein;
+  static const char* proteinWithStop;
 
   static const unsigned capacity = 256;
 


=====================================
src/Centroid.cc
=====================================
@@ -584,16 +584,8 @@ namespace cbrc{
     int alphabetSizeIncrement = alphabetSize;
     if (!isExtendFwd) alphabetSizeIncrement *= -1;
 
-    const double delOpen = gapCosts.delProbPieces[0].openProb;
-    const double delGrow = gapCosts.delProbPieces[0].growProb;
-    const double insOpen = gapCosts.insProbPieces[0].openProb;
-    const double insGrow = gapCosts.insProbPieces[0].growProb;
-
-    const double delInit = delOpen * delGrow;  // for 1st letter in a deletion
-    const double insInit = insOpen * insGrow;  // for 1st letter in an insert
-
-    const double delNext = delGrow - delInit;
-    const double insNext = insGrow - insInit;
+    double delInitCount = 0; double delNextCount = 0;
+    double insInitCount = 0; double insNextCount = 0;
 
     for( size_t k = 0; k < numAntidiagonals; ++k ){  // loop over antidiagonals
       const size_t seq1beg = xa.seq1start( k );
@@ -601,9 +593,6 @@ namespace cbrc{
       const double scale2 = scale[k];
       const double scale1 = scale[k+1];
 
-      const double scaledDelNext = scale1 * delNext;
-      const double scaledInsNext = scale1 * insNext;
-
       const size_t scoreEnd = xa.scoreEndIndex( k );
       const double* bM0 = &bM[ scoreEnd + xdropPadLen ];
       const double* bD0 = &bD[ scoreEnd + xdropPadLen ];
@@ -616,6 +605,9 @@ namespace cbrc{
       const double* fI1 = &fI[ vertBeg ];
       const double* fM2 = &fM[ diagBeg ];
 
+      double dNextCount = 0;
+      double iNextCount = 0;
+
       const double* bM0last = bM0 + xa.numCellsAndPads( k ) - xdropPadLen - 1;
 
       const uchar* s1 = isExtendFwd ? seq1 + seq1beg : seq1 - seq1beg;
@@ -640,10 +632,10 @@ namespace cbrc{
 	  const double alignProb = xSum * yM * matchProb;
 	  c.emit[letter1][letter2] += alignProb;
 	  c.toMatch += alignProb;
-	  c.delInit += xSum * yD * delInit;
-	  c.delNext += xD * yD * scaledDelNext;
-	  c.insInit += xSum * yI * insInit;
-	  c.insNext += xI * yI * scaledInsNext;
+	  delInitCount += xSum * yD;
+	  dNextCount += xD * yD;
+	  insInitCount += xSum * yI;
+	  iNextCount += xI * yI;
 
 	  if (bM0 == bM0last) break;
 	  fM2++; fD1++; fI1++;
@@ -673,10 +665,10 @@ namespace cbrc{
 	  countUncertainLetters(c.emit[letter1], alignProb,
 				alphabetSize, match_score[letter1], lp2);
 	  c.toMatch += alignProb;
-	  c.delInit += xSum * yD * delInit;
-	  c.delNext += xD * yD * scaledDelNext;
-	  c.insInit += xSum * yI * insInit;
-	  c.insNext += xI * yI * scaledInsNext;
+	  delInitCount += xSum * yD;
+	  dNextCount += xD * yD;
+	  insInitCount += xSum * yI;
+	  iNextCount += xI * yI;
 
 	  if (bM0 == bM0last) break;
 	  fM2++; fD1++; fI1++;
@@ -686,7 +678,26 @@ namespace cbrc{
 	  lp2 -= alphabetSizeIncrement;
 	}
       }
+
+      delNextCount += dNextCount * scale1;
+      insNextCount += iNextCount * scale1;
     }
+
+    const double delOpen = gapCosts.delProbPieces[0].openProb;
+    const double delGrow = gapCosts.delProbPieces[0].growProb;
+    const double insOpen = gapCosts.insProbPieces[0].openProb;
+    const double insGrow = gapCosts.insProbPieces[0].growProb;
+
+    const double delInit = delOpen * delGrow;  // for 1st letter in a deletion
+    const double insInit = insOpen * insGrow;  // for 1st letter in an insert
+
+    const double delNext = delGrow - delInit;
+    const double insNext = insGrow - insInit;
+
+    c.delInit += delInitCount * delInit;
+    c.delNext += delNextCount * delNext;
+    c.insInit += insInitCount * insInit;
+    c.insNext += insNextCount * insNext;
   }
 
 }  // end namespace cbrc


=====================================
src/GappedXdropAligner.hh
=====================================
@@ -205,13 +205,13 @@ class GappedXdropAligner {
                      int frameshiftCost);
 
   // Like "align3", but does "new style" DNA-protein alignment.
-  int alignFrame(const uchar *seq1,
-		 const uchar *seq2frame0,
-		 const uchar *seq2frame1,  // the +1 frame
-		 const uchar *seq2frame2,  // the +2 frame
+  int alignFrame(const uchar *protein,
+		 const uchar *frame0,  // translated DNA,  0 frame
+		 const uchar *frame1,  // translated DNA, +1 frame
+		 const uchar *frame2,  // translated DNA, +2 frame
 		 bool isForward,
 		 const ScoreMatrixRow *scorer,
-		 const GapCosts &gap,
+		 const GapCosts &gapCosts,
 		 int maxScoreDrop);
 
   // Use this after alignFrame.  The cost of the unaligned region
@@ -220,8 +220,8 @@ class GappedXdropAligner {
   bool getNextChunkFrame(size_t &end1,
 			 size_t &end2,
 			 size_t &length,
-			 int &gapCost,  // cost of the gap before this chunk
-			 const GapCosts &gap);
+			 int &costOfNearerGap,
+			 const GapCosts &gapCosts);
 
   void writeShape(std::ostream &out) const;
 


=====================================
src/GappedXdropAlignerFrame.cc
=====================================
@@ -22,33 +22,33 @@ void GappedXdropAligner::initFrame() {
   bestAntidiagonal = 0;
 }
 
-int GappedXdropAligner::alignFrame(const uchar *seq1,
-				   const uchar *seq2frame0,
-				   const uchar *seq2frame1,  // the +1 frame
-				   const uchar *seq2frame2,  // the +2 frame
+int GappedXdropAligner::alignFrame(const uchar *protein,
+				   const uchar *frame0,
+				   const uchar *frame1,
+				   const uchar *frame2,
 				   bool isForward,
 				   const ScoreMatrixRow *scorer,
-				   const GapCosts &gap,
+				   const GapCosts &gapCosts,
 				   int maxScoreDrop) {
   if (!isForward) {
-    --seq1; --seq2frame0; --seq2frame1; --seq2frame2;
+    --protein; --frame0; --frame1; --frame2;
   }
-  const_uchar_ptr frames[] = {seq2frame0, seq2frame1, seq2frame2};
+  const_uchar_ptr frames[] = {frame0, frame1, frame2};
   const int seqIncrement = isForward ? 1 : -1;
 
-  const int delOpenScore = -gap.delPieces[0].openCost;
-  const int insOpenScore = -gap.insPieces[0].openCost;
-  const int delScore1 = gap.delScore1;
-  const int delScore2 = gap.delScore2;
-  const int delScore3 = gap.delScore3;
-  const int insScore1 = gap.insScore1;
-  const int insScore2 = gap.insScore2;
-  const int insScore3 = gap.insScore3;
+  const int delOpenScore = -gapCosts.delPieces[0].openCost;
+  const int insOpenScore = -gapCosts.insPieces[0].openCost;
+  const int delScore1 = gapCosts.delScore1;
+  const int delScore2 = gapCosts.delScore2;
+  const int delScore3 = gapCosts.delScore3;
+  const int insScore1 = gapCosts.insScore1;
+  const int insScore2 = gapCosts.insScore2;
+  const int insScore3 = gapCosts.insScore3;
 
   int runOfDrops = 2;
   int runOfEdges = 0;
   int numCells = 1;
-  size_t seq1end = 1;
+  size_t proteinEnd = 1;
   size_t diagPos6 = xdropPadLen - 1;
   size_t horiPos5 = xdropPadLen * 2 - 1;
   size_t horiPos4 = xdropPadLen * 3 - 1;
@@ -63,11 +63,11 @@ int GappedXdropAligner::alignFrame(const uchar *seq1,
   initFrame();
 
   for (size_t antidiagonal = 0; /* noop */; ++antidiagonal) {
-    const uchar *s1 = seq1;
-    const uchar *s2 = frames[antidiagonal % 3];
+    const uchar *s1 = protein;
     frames[antidiagonal % 3] += seqIncrement;
+    const uchar *s2 = frames[antidiagonal % 3];
 
-    initAntidiagonal(antidiagonal + 6, seq1end, thisPos, numCells);
+    initAntidiagonal(antidiagonal + 6, proteinEnd, thisPos, numCells);
     thisPos += xdropPadLen;
     Score *X0 = &xScores[thisPos];
     Score *Y0 = &yScores[thisPos];
@@ -83,6 +83,7 @@ int GappedXdropAligner::alignFrame(const uchar *seq1,
     int minScore = bestScore - maxScoreDrop;
 
     for (int i = 0; i < numCells; ++i) {
+      s2 -= seqIncrement;
       int s = scorer[*s1][*s2];
       int y1 = Y5[i] + delScore1;
       int y2 = Y4[i] + delScore2;
@@ -97,7 +98,6 @@ int GappedXdropAligner::alignFrame(const uchar *seq1,
       Y0[i] = maxValue(b + delOpenScore, y3);
       Z0[i] = maxValue(b + insOpenScore, z3);
       s1 += seqIncrement;
-      s2 -= seqIncrement;
     }
 
     if (newBestScore > bestScore) {
@@ -118,7 +118,7 @@ int GappedXdropAligner::alignFrame(const uchar *seq1,
       ++runOfEdges;
       if (runOfEdges == 3) {
 	++numCells;
-	++seq1end;
+	++proteinEnd;
 	runOfEdges = 0;
       }
     }
@@ -130,7 +130,7 @@ int GappedXdropAligner::alignFrame(const uchar *seq1,
       if (runOfDrops == 3) {
 	--numCells;
 	if (numCells == 0) break;
-	seq1 += seqIncrement;
+	protein += seqIncrement;
 	frames[0] -= seqIncrement;
 	frames[1] -= seqIncrement;
 	frames[2] -= seqIncrement;
@@ -152,18 +152,18 @@ int GappedXdropAligner::alignFrame(const uchar *seq1,
 bool GappedXdropAligner::getNextChunkFrame(size_t &end1,
 					   size_t &end2,
 					   size_t &length,
-					   int &gapCost,
-					   const GapCosts &gap) {
+					   int &costOfNearerGap,
+					   const GapCosts &gapCosts) {
   if (bestAntidiagonal == 0) return false;
 
-  const int delOpenScore = -gap.delPieces[0].openCost;
-  const int insOpenScore = -gap.insPieces[0].openCost;
-  const int delScore1 = gap.delScore1;
-  const int delScore2 = gap.delScore2;
-  const int delScore3 = gap.delScore3;
-  const int insScore1 = gap.insScore1;
-  const int insScore2 = gap.insScore2;
-  const int insScore3 = gap.insScore3;
+  const int delOpenScore = -gapCosts.delPieces[0].openCost;
+  const int insOpenScore = -gapCosts.insPieces[0].openCost;
+  const int delScore1 = gapCosts.delScore1;
+  const int delScore2 = gapCosts.delScore2;
+  const int delScore3 = gapCosts.delScore3;
+  const int insScore1 = gapCosts.insScore1;
+  const int insScore2 = gapCosts.insScore2;
+  const int insScore3 = gapCosts.insScore3;
 
   end1 = bestSeq1position;
   end2 = bestAntidiagonal - bestSeq1position * 3;
@@ -187,7 +187,7 @@ bool GappedXdropAligner::getNextChunkFrame(size_t &end1,
 
   length = end1 - bestSeq1position;
   if (bestAntidiagonal == 0) {
-    gapCost = 0;
+    costOfNearerGap = 0;
     return true;
   }
 
@@ -212,7 +212,7 @@ bool GappedXdropAligner::getNextChunkFrame(size_t &end1,
     state = std::max_element(opt, opt + 7) - opt;
   } while (state != 0);
 
-  gapCost = opt[0] - scoreHere;
+  costOfNearerGap = opt[0] - scoreHere;
   return true;
 }
 


=====================================
src/GeneticCode.cc
=====================================
@@ -49,7 +49,8 @@ void GeneticCode::codeTableSet( const Alphabet& aaAlph, const Alphabet& dnaAlph
 {
   uchar codon[3];
 
-  genome2residue.assign( UNKNOWN, 'X' );
+  int size = maxDnaAlphabetSize * maxDnaAlphabetSize * maxDnaAlphabetSize;
+  genome2residue.assign(size, 'X');
 
   for( size_t i = 0 ; i < AAs.size() ; i++ ){
     char aminoAcid = std::toupper( AAs[i] );
@@ -71,13 +72,13 @@ void GeneticCode::codeTableSet( const Alphabet& aaAlph, const Alphabet& dnaAlph
   }
 
   // codons containing DNA delimiters, or lowercase bases
-  for( int i = 0 ; i < NumMember ; i++ ){
+  for( int i = 0 ; i < maxDnaAlphabetSize ; i++ ){
     codon[0]= dnaAlph.decode[i];
 
-    for( int j = 0 ; j < NumMember ; j++ ){
+    for( int j = 0 ; j < maxDnaAlphabetSize ; j++ ){
       codon[1]= dnaAlph.decode[j];
 
-      for( int k = 0 ; k < NumMember ; k++ ){
+      for( int k = 0 ; k < maxDnaAlphabetSize ; k++ ){
 	codon[2]= dnaAlph.decode[k];
 
 	int c = codon2number2( codon, dnaAlph );
@@ -99,15 +100,16 @@ void GeneticCode::codeTableSet( const Alphabet& aaAlph, const Alphabet& dnaAlph
   return;
 }
 
-void GeneticCode::initCodons(const uchar *ntToNumber, const uchar *aaToNumber,
-			     bool isMaskLowercase) {
-  const int n = NumMember;
+const unsigned delimiterCodon = 64;
+const unsigned unknownCodon = 65;
+
+static void initCodonLookup(std::vector<uchar> &lookupFromDnaTriplet,
+			    const uchar *ntToNumber, bool isMaskLowercase) {
+  const int n = maxDnaAlphabetSize;
   const char dna[] = "ACGTacgt";
   const unsigned dnaMax = isMaskLowercase ? 4 : 8;
-  const unsigned delimiterCodon = 64;
-  const unsigned unknownCodon = 65;
 
-  genome2residue.assign(n * n * n, unknownCodon);
+  lookupFromDnaTriplet.assign(n * n * n, unknownCodon);
 
   for (unsigned i = 0; i < dnaMax; ++i) {
     for (unsigned j = 0; j < dnaMax; ++j) {
@@ -116,12 +118,21 @@ void GeneticCode::initCodons(const uchar *ntToNumber, const uchar *aaToNumber,
 	uchar b = dna[j];  unsigned y = ntToNumber[b];
 	uchar c = dna[k];  unsigned z = ntToNumber[c];
 	unsigned codonNumber = (i % 4) * 16 + (j % 4) * 4 + (k % 4);
-	genome2residue[x*n*n + y*n + z] = codonNumber;
+	lookupFromDnaTriplet[x*n*n + y*n + z] = codonNumber;
       }
     }
   }
 
-  setDelimiters(&genome2residue[0], n, delimiterCodon);
+  setDelimiters(&lookupFromDnaTriplet[0], n, delimiterCodon);
+}
+
+void GeneticCode::initCodons(const uchar *ntToNumber, const uchar *aaToNumber,
+			     bool isMaskLowercase, bool isUnmaskLowercase) {
+  initCodonLookup(genome2residue, ntToNumber, isMaskLowercase);
+
+  if (isMaskLowercase && isUnmaskLowercase) {
+    initCodonLookup(genome2residueWithoutLowercaseMasking, ntToNumber, false);
+  }
 
   uchar unknown = 'X';
   uchar delimiter = ' ';
@@ -136,24 +147,33 @@ void GeneticCode::initCodons(const uchar *ntToNumber, const uchar *aaToNumber,
   codonToAminoAcid[delimiterCodon] = aaToNumber[delimiter];
 }
 
-//
-void GeneticCode::translate( const uchar* beg, const uchar* end,
-			     uchar* dest ) const{
-  unsigned delimiter = genome2residue[4];
-  size_t size = end - beg;
-
-  for( size_t i = 0 ; i < 3 ; i++ ){
-    for( size_t j = i ; j+2 < size ; j+=3 ){
-      *dest++ = translation( beg + j );
-    }
+static void doTranslate(const uchar *lookupFromDnaTriplet,
+			const uchar *beg, size_t size, uchar *dest) {
+  const int n = maxDnaAlphabetSize;
+  const unsigned delimiter = lookupFromDnaTriplet[4];
 
+  for (size_t i = 0; i < 3; ++i) {
+    for (size_t j = i; j + 2 < size; j += 3) {
+      const uchar *b = beg + j;
+      *dest++ = lookupFromDnaTriplet[n * n * b[0] + n * b[1] + b[2]];
+    }
     // this ensures that each reading frame has exactly the same size:
-    if( i > size % 3 ) *dest++ = delimiter;
+    if (i > size % 3) *dest++ = delimiter;
   }
 
   // this ensures that the size of the translated sequence is exactly "size":
-  if( size % 3 > 0 ) *dest++ = delimiter;
-  if( size % 3 > 1 ) *dest++ = delimiter;
+  if (size % 3 > 0) *dest++ = delimiter;
+  if (size % 3 > 1) *dest++ = delimiter;
+}
+
+void GeneticCode::translate(const uchar *beg, const uchar *end,
+			    uchar *dest) const {
+  doTranslate(&genome2residue[0], beg, end - beg, dest);
+}
+
+void GeneticCode::translateWithoutMasking(const uchar *beg, const uchar *end,
+					  uchar *dest) const {
+  doTranslate(&genome2residueWithoutLowercaseMasking[0], beg, end - beg, dest);
 }
 
 //


=====================================
src/GeneticCode.hh
=====================================
@@ -19,17 +19,20 @@ typedef unsigned char uchar;
 
 class Alphabet;
 
+const int maxDnaAlphabetSize = 54;
+
 class GeneticCode{
  private:
   std::string AAs;
   std::string Base[3];
-  static const int NumMember = 54;  // DNA member
-  static const int UNKNOWN = NumMember*NumMember*NumMember;  // unknown residue
   std::vector<uchar> genome2residue;
+  std::vector<uchar> genome2residueWithoutLowercaseMasking;
   uchar codonToAminoAcid[256];
 
-  static int codon2number( const uchar* codon )
-  { return codon[0] * NumMember*NumMember + codon[1] * NumMember + codon[2]; }
+  static int codon2number(const uchar *codon) {
+    const int n = maxDnaAlphabetSize;
+    return n * n * codon[0] + n * codon[1] + codon[2];
+  }
 
   static int codon2number2( const uchar* codon, const Alphabet& dnaAlph );
 
@@ -49,11 +52,16 @@ class GeneticCode{
   // 64=delimiter, 65=unknown).  Also setup codonToAminoAcid.  Any DNA
   // triplet with non-ACGT (or with lowercase if isMaskLowercase is
   // true) gets translated to 65=unknown.
+  // If isMaskLowercase and isUnmaskLowercase are both true, also
+  // setup translateWithoutMasking.
   void initCodons( const uchar *ntToNumber, const uchar *aaToNumber,
-		   bool isMaskLowercase );
+		   bool isMaskLowercase, bool isUnmaskLowercase );
 
   void translate( const uchar* beg, const uchar* end, uchar* dest ) const;
 
+  void translateWithoutMasking( const uchar* beg, const uchar* end,
+				uchar* dest ) const;
+
   uchar translation( const uchar* codon ) const
   { return genome2residue[ codon2number( codon ) ]; }
 


=====================================
src/LastEvaluer.cc
=====================================
@@ -4,10 +4,14 @@
 
 #include "GeneticCode.hh"
 
+#include "alp/sls_falp_alignment_evaluer.hpp"
+
 #include <algorithm>
 #include <cstring>
 #include <iostream>
 
+#include <random>
+
 #define COUNTOF(a) (sizeof (a) / sizeof *(a))
 
 namespace cbrc {
@@ -296,7 +300,7 @@ void LastEvaluer::init(const char *matrixName,
 	  const FrameshiftEvalueParameters &p = frameshiftEvalueParameters[i];
 	  if (isHit(p, geneticCodeName, matrixName, delOpen, delEpen,
 		    frameshiftCost))
-	    return frameshiftEvaluer.initParameters(p.parameters);
+	    return evaluer.initParameters(p.parameters);
 	}
       }
     }
@@ -317,22 +321,25 @@ void LastEvaluer::init(const char *matrixName,
     std::vector<double> aaFreqs(matrixSize);
     copy(letterFreqs1, letterFreqs1 + alphabetSize, aaFreqs.begin());
 
-    if (frameshiftCost > 0) {  // with frameshifts:
-      if (isGapped) {
-	frameshiftEvaluer.initFrameshift(4, matrixSize, codonTable,
-					 &matrix[0], &ntFreqs[0], &aaFreqs[0],
-					 delOpen, delEpen, insOpen, insEpen,
-					 frameshiftCost, true,
-					 lambdaTolerance, kTolerance,
-					 0, maxMegabytes, randomSeed);
-      } else {
-	frameshiftEvaluer.initGapless(4, matrixSize, codonTable,
-				      &matrix[0], &ntFreqs[0], &aaFreqs[0]);
-      }
+    if (isGapped && frameshiftCost > 0) {  // with frameshifts:
+      Sls::FrameshiftAlignmentEvaluer frameshiftEvaluer;
+      frameshiftEvaluer.initFrameshift(4, matrixSize, codonTable,
+				       &matrix[0], &ntFreqs[0], &aaFreqs[0],
+				       delOpen, delEpen, insOpen, insEpen,
+				       frameshiftCost, true,
+				       lambdaTolerance, kTolerance,
+				       0, maxMegabytes, randomSeed);
+      const Sls::FALP_set_of_parameters &p = frameshiftEvaluer.parameters();
+      Sls::AlignmentEvaluerParameters q = {p.lambda, p.K,
+					   p.a_I, p.b_I,  // !!! no flip
+					   p.a_J, p.b_J,
+					   p.alpha_I, p.beta_I,
+					   p.alpha_J, p.beta_J,
+					   p.sigma, p.tau};
+      evaluer.initParameters(q);
     } else {  // without frameshifts:
       std::vector<double> txFreqs(matrixSize);
       makeTxFreqs(&txFreqs[0], &ntFreqs[0], codonTable);
-
       if (isGapped) {
 	evaluer.set_gapped_computation_parameters_simplified(maxSeconds);
 	evaluer.initGapped(matrixSize, &matrix[0], &txFreqs[0], &aaFreqs[0],
@@ -342,6 +349,14 @@ void LastEvaluer::init(const char *matrixName,
       } else {
 	evaluer.initGapless(matrixSize, &matrix[0], &txFreqs[0], &aaFreqs[0]);
       }
+      const Sls::ALP_set_of_parameters &p = evaluer.parameters();
+      Sls::AlignmentEvaluerParameters q = {p.lambda, p.K,
+					   p.a_J * 3, p.b_J * 3,
+					   p.a_I, p.b_I,  // !!! flip (I, J)
+					   p.alpha_J * 9, p.beta_J * 9,
+					   p.alpha_I, p.beta_I,
+					   p.sigma * 3, p.tau * 3};
+      evaluer.initParameters(q);
     }
   } else {  // ordinary alignment:
     if (isGapped && insOpen == delOpen && insEpen == delEpen) {
@@ -402,31 +417,74 @@ void LastEvaluer::init(const char *matrixName,
   }
 }
 
-static int theMinScore(double lambda, double k, double evalue, double area) {
-  // do log of evalue separately, to reduce the risk of overflow:
-  double s = (std::log(k * area) - std::log(evalue)) / lambda;
-  return std::ceil(std::max(1.0, s));
-}
+void LastEvaluer::initFrameshift(const const_dbl_ptr *substitutionProbs,
+				 const double *proteinLetterFreqs,
+				 int numProteinLetters,
+				 const double *tranDnaLetterFreqs,
+				 int numTranDnaLetters,
+				 const GapCosts &gapCosts,
+				 double scale, int verbosity) {
+  FrameshiftXdropAligner aligner;
+  int proteinLength = 200;  // xxx long enough to avoid edge effects ???
+  int tranDnaLength = proteinLength * 3;
+  int numOfAlignments = 50;  // suggested by Y-K Yu, R Bundschuh, T Hwa, 2002
+  if (verbosity > 1) numOfAlignments = 1000;
+
+  std::mt19937_64 randGen;
+  std::discrete_distribution<> pDist(proteinLetterFreqs,
+				     proteinLetterFreqs + numProteinLetters);
+  std::discrete_distribution<> tDist(tranDnaLetterFreqs,
+				     tranDnaLetterFreqs + numTranDnaLetters);
+
+  std::vector<uchar> seqs(proteinLength + tranDnaLength);
+  uchar *protein = &seqs[0];
+  uchar *tranDna = protein + proteinLength;
+
+  double probRatioSum = 0;
+
+  for (int i = 0; i < numOfAlignments; ++i) {
+    for (int j = 0; j < proteinLength; ++j) protein[j] = pDist(randGen);
+    for (int j = 0; j < tranDnaLength; ++j) tranDna[j] = tDist(randGen);
+    double p = aligner.maxSumOfProbRatios(protein, proteinLength,
+					  tranDna, tranDnaLength,
+					  substitutionProbs, gapCosts);
+    probRatioSum += 1 / p;
+    if (verbosity > 1) std::cerr << "simScore: " << (log(p) / scale) << "\n";
+  }
 
-int LastEvaluer::minScore(double evalue, double area) const {
-  if (evaluer.isGood()) {
-    const Sls::ALP_set_of_parameters &p = evaluer.parameters();
-    return theMinScore(p.lambda, p.K, evalue, area);
-  } else if (frameshiftEvaluer.isGood()) {
-    const Sls::FALP_set_of_parameters &p = frameshiftEvaluer.parameters();
-    return theMinScore(p.lambda, p.K, evalue, area);
-  } else {
-    return -1;
+  // max likelihood k  =  1 / (m * n * avg[exp(-lambda * score)])
+  double k = numOfAlignments / (proteinLength * tranDnaLength * probRatioSum);
+
+  if (verbosity > 1) {
+    std::cerr << "lambda k m n: " << scale << " " << k << " "
+	      << proteinLength << " " << tranDnaLength << "\n";
   }
+
+  Sls::AlignmentEvaluerParameters p = {scale, k, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+  evaluer.initParameters(p);
+}
+
+double LastEvaluer::minScore(double evalue, double area) const {
+  const Sls::ALP_set_of_parameters &p = evaluer.parameters();
+  // do log of evalue separately, to reduce the risk of overflow:
+  double s = (log(p.K * area) - log(evalue)) / p.lambda;
+  return std::max(0.0, s);
 }
 
-int LastEvaluer::minScore(double queryLettersPerRandomAlignment) const {
-  if (!isGood()) return -1;
+double LastEvaluer::minScore(double queryLettersPerRandomAlignment) const {
   double huge = 1e9;
-  double f = queryLettersPerRandomAlignment / huge;
-  for (int score = 1; ; ++score) {
-    double evalue = area(score, huge) * evaluePerArea(score);
-    if (evalue * f <= 1) return score;
+  const Sls::ALP_set_of_parameters &p = evaluer.parameters();
+  double x = queryLettersPerRandomAlignment * databaseSeqNum;
+  double beg = 0;
+  double end = log(x * databaseSeqLen * p.K) / p.lambda;
+
+  while (1) {
+    double mid = (beg + end) / 2;
+    if (mid <= beg || mid >= end) return mid;
+    if (evaluer.evalue(mid, huge, databaseSeqLen) < huge / x)
+      end = mid;
+    else
+      beg = mid;
   }
 }
 
@@ -434,9 +492,6 @@ void LastEvaluer::writeCommented(std::ostream& out) const {
   if (evaluer.isGood()) {
     const Sls::ALP_set_of_parameters &p = evaluer.parameters();
     out << "# lambda=" << p.lambda << " K=" << p.K << "\n";
-  } else if (frameshiftEvaluer.isGood()) {
-    const Sls::FALP_set_of_parameters &p = frameshiftEvaluer.parameters();
-    out << "# lambda=" << p.lambda << " K=" << p.K << "\n";
   }
 }
 
@@ -444,19 +499,12 @@ void LastEvaluer::writeParameters(std::ostream &out) const {
   std::streamsize prec = out.precision(17);
   if (evaluer.isGood()) {
     const Sls::ALP_set_of_parameters &p = evaluer.parameters();
+    // !!! flip (I, J) order, to match AlignmentEvaluer::initParameters
     out << p.lambda << ", " << p.K << ",\n"
-	<< p.a_I << ", " << p.b_I << ",\n"
 	<< p.a_J << ", " << p.b_J << ",\n"
-	<< p.alpha_I << ", " << p.beta_I << ",\n"
-	<< p.alpha_J << ", " << p.beta_J << ",\n"
-	<< p.sigma << ", " << p.tau << "\n";
-  } else if (frameshiftEvaluer.isGood()) {
-    const Sls::FALP_set_of_parameters &p = frameshiftEvaluer.parameters();
-    out << p.lambda << ", " << p.K << ",\n"
 	<< p.a_I << ", " << p.b_I << ",\n"
-	<< p.a_J << ", " << p.b_J << ",\n"
-	<< p.alpha_I << ", " << p.beta_I << ",\n"
 	<< p.alpha_J << ", " << p.beta_J << ",\n"
+	<< p.alpha_I << ", " << p.beta_I << ",\n"
 	<< p.sigma << ", " << p.tau << "\n";
   }
   out.precision(prec);


=====================================
src/LastEvaluer.hh
=====================================
@@ -18,12 +18,14 @@
 #define LAST_EVALUER_HH
 
 #include "ScoreMatrixRow.hh"
+#include "mcf_frameshift_xdrop_aligner.hh"
 
 #include "alp/sls_alignment_evaluer.hpp"
-#include "alp/sls_falp_alignment_evaluer.hpp"
 
 namespace cbrc {
 
+using namespace mcf;
+
 class GeneticCode;
 
 class LastEvaluer {
@@ -54,6 +56,14 @@ public:
 	    const char *geneticCodeName,
 	    int verbosity);
 
+  // "new-style" frameshifts, sum-of-paths scores
+  // scale=lambda
+  // The Freqs need not sum to 1
+  void initFrameshift(const const_dbl_ptr *substitutionProbs,
+		      const double *proteinLetterFreqs, int numProteinLetters,
+		      const double *tranDnaLetterFreqs, int numTranDnaLetters,
+		      const GapCosts &gapCosts, double scale, int verbosity);
+
   void setSearchSpace(double databaseLength,  // number of database letters
 		      double databaseMaxSeqLength,  // length of longest seq
 		      double numOfStrands) {  // 1 or 2
@@ -66,37 +76,26 @@ public:
     }
   }
 
-  bool isGood() const
-  { return evaluer.isGood() || frameshiftEvaluer.isGood(); }
+  bool isGood() const { return evaluer.isGood(); }
 
   // Don't call this in the "bad" state:
-  double evaluePerArea(double score) const {
-    return evaluer.isGood() ?
-      evaluer.evaluePerArea(score) : frameshiftEvaluer.evaluePerArea(score);
-  }
+  double evaluePerArea(double score) const
+  { return evaluer.evaluePerArea(score); }
 
   // Don't call this in the "bad" state or before calling setSearchSpace:
-  double area(double score, double queryLength) const {
-    return databaseSeqNum *
-      (evaluer.isGood()
-       ?           evaluer.area(score, queryLength, databaseSeqLen)
-       : frameshiftEvaluer.area(score, queryLength, databaseSeqLen));
-  }
+  double area(double score, double queryLength) const
+  { return databaseSeqNum * evaluer.area(score, queryLength, databaseSeqLen); }
 
   // Don't call this in the "bad" state:
-  double bitScore(double score) const {
-    return evaluer.isGood() ?
-      evaluer.bitScore(score) : frameshiftEvaluer.bitScore(score);
-  }
+  double bitScore(double score) const { return evaluer.bitScore(score); }
 
-  // In the "good" state: returns the minimum score with E-value <=
-  // "evalue", which is always > 0.  Otherwise: returns -1.
-  int minScore(double evalue, double area) const;
+  // Returns max(0, score with E-value == "evalue").
+  // Don't call this in the "bad" state.
+  double minScore(double evalue, double area) const;
 
-  // In the "good" state, after calling setSearchSpace: returns the
-  // minimum positive score with E-value <= 1 per this many query
-  // letters.  In the "bad" state: returns -1.
-  int minScore(double queryLettersPerRandomAlignment) const;
+  // Returns max(0, score with E-value == 1 per this many query letters).
+  // Don't call this in the "bad" state or before calling setSearchSpace..
+  double minScore(double queryLettersPerRandomAlignment) const;
 
   // Writes some parameters preceded by "#".  Does nothing in the "bad" state.
   void writeCommented(std::ostream& out) const;
@@ -106,7 +105,6 @@ public:
 
 private:
   Sls::AlignmentEvaluer evaluer;
-  Sls::FrameshiftAlignmentEvaluer frameshiftEvaluer;
   double databaseSeqLen;
   double databaseSeqNum;
 };


=====================================
src/LastalArguments.cc
=====================================
@@ -3,12 +3,14 @@
 #include "LastalArguments.hh"
 #include "stringify.hh"
 #include "getoptUtil.hh"
+
+#include <math.h>
+
 #include <algorithm>  // max
 #include <iostream>
 #include <sstream>
 #include <stdexcept>
 #include <cctype>
-#include <cmath>  // log
 #include <cstring>  // strtok
 #include <cstdlib>  // EXIT_SUCCESS
 
@@ -561,7 +563,7 @@ static int percent(int val, int percentage) {
   return (percentage == 100) ? val : val * percentage / 100;
 }
 
-void LastalArguments::setDefaultsFromMatrix(double lambda, int minScore,
+void LastalArguments::setDefaultsFromMatrix(double lambda, double minScore,
 					    double maxEvalueDefault) {
   if (maxEvalue < 0) maxEvalue = maxEvalueDefault;
   if( outputType < 2 && minScoreGapped < 0 ) minScoreGapped = minScoreGapless;
@@ -572,11 +574,15 @@ can't calculate E-values: maybe the mismatch or gap costs are too weak.\n\
 To proceed without E-values, set a score threshold with option -e.");
     minScoreGapped = minScore;
   }
-  if( outputType < 2 ) minScoreGapless = minScoreGapped;
+
+  if (minScoreGapped > INT_MAX)
+    ERR("the alignment score threshold is too big");
+
+  if (outputType < 2) minScoreGapless = ceil(minScoreGapped);
 
   if( temperature < 0 ) temperature = 1 / lambda;
 
-  int defaultMaxScoreDrop = std::max(minScoreGapped - 1, 0);
+  int defaultMaxScoreDrop = std::max(ceil(minScoreGapped) - 1, 0.0);  // xxx
   defaultMaxScoreDrop = std::max(defaultMaxScoreDrop, maxDropGapless);
 
   if (maxDropFinalSuffix == '%') {


=====================================
src/LastalArguments.hh
=====================================
@@ -2,16 +2,17 @@
 
 // This struct holds the command line arguments for lastal.
 
-#ifndef LASTALARGUMENTS_HH
-#define LASTALARGUMENTS_HH
+#ifndef LASTAL_ARGUMENTS_HH
+#define LASTAL_ARGUMENTS_HH
 
 #include "SequenceFormat.hh"
 
-#include <climits>
-#include <string>
+#include <limits.h>
+#include <stddef.h>
+
 #include <iosfwd>
+#include <string>
 #include <vector>
-#include <stddef.h>  // size_t
 
 namespace cbrc{
 
@@ -39,7 +40,7 @@ struct LastalArguments{
                                 bool isCaseSensitiveSeeds, bool isVolumes,
 				size_t refMinimizerWindow,
 				unsigned realNumOfThreads );
-  void setDefaultsFromMatrix(double lambda, int minScore,
+  void setDefaultsFromMatrix(double lambda, double minScore,
 			     double maxEvalueDefault);
 
   // write the parameter settings, starting each line with "#":
@@ -80,7 +81,7 @@ struct LastalArguments{
   int maskLowercase;
   double maxEvalue;
   double queryLettersPerRandomAlignment;
-  int minScoreGapped;
+  double minScoreGapped;
   int minScoreGapless;
   int matchScore;
   int mismatchCost;
@@ -121,5 +122,6 @@ struct LastalArguments{
   int inputStart;  // index in argv of first input filename
 };
 
-}  // end namespace cbrc
-#endif  // LASTALARGUMENTS_HH
+}
+
+#endif


=====================================
src/LastdbArguments.cc
=====================================
@@ -27,6 +27,7 @@ using namespace cbrc;
 
 LastdbArguments::LastdbArguments() :
   isProtein(false),
+  isAddStops(false),
   isKeepLowercase(true),
   tantanSetting(0),
   isCaseSensitive(false),
@@ -72,6 +73,7 @@ Advanced Options (default settings):\n\
     + stringify(strand) + ")\n\
 -s: volume size (unlimited)\n\
 -Q: input format: fastx, keep, sanger, solexa, illumina (default=fasta)\n\
+-q: interpret the sequences as proteins and append */STOP\n\
 -P: number of parallel threads ("
     + stringify(numOfThreads) + ")\n\
 -m: seed patterns (1=match, 0=anything, @=transition)\n\
@@ -92,7 +94,7 @@ Report bugs to: last-align (ATmark) googlegroups (dot) com\n\
 LAST home page: http://last.cbrc.jp/\n\
 ";
 
-  static const char sOpts[] = "hVpR:cm:d:S:s:w:W:P:u:a:i:b:B:C:xvQ:";
+  static const char sOpts[] = "hVpqR:cm:d:S:s:w:W:P:u:a:i:b:B:C:xvQ:";
 
   int c;
   while ((c = myGetopt(argc, argv, sOpts)) != -1) {
@@ -108,6 +110,9 @@ LAST home page: http://last.cbrc.jp/\n\
     case 'p':
       isProtein = true;
       break;
+    case 'q':
+      isAddStops = true;
+      break;
     case 'R':
       if( optarg[0] < '0' || optarg[0] > '1' ) badopt( c, optarg );
       if( optarg[1] < '0' || optarg[1] > '2' ) badopt( c, optarg );


=====================================
src/LastdbArguments.hh
=====================================
@@ -31,6 +31,7 @@ struct LastdbArguments{
 
   // options:
   bool isProtein;
+  bool isAddStops;
   bool isKeepLowercase;
   int tantanSetting;
   bool isCaseSensitive;


=====================================
src/MultiSequence.cc
=====================================
@@ -9,13 +9,15 @@
 
 using namespace cbrc;
 
-void MultiSequence::initForAppending( indexT padSizeIn ){
+void MultiSequence::initForAppending(indexT padSizeIn,
+				     bool isAppendStopSymbol) {
   padSize = padSizeIn;
   seq.v.assign( padSize, ' ' );
   ends.v.assign( 1, padSize );
   names.v.clear();
   nameEnds.v.assign( 1, 0 );
   qualityScoresPerLetter = 0;
+  isAppendingStopSymbol = isAppendStopSymbol;
 }
 
 void MultiSequence::reinitForAppending(){
@@ -98,7 +100,12 @@ MultiSequence::appendFromFasta( std::istream& stream, indexT maxSeqLen ){
     c = buf->snextc();
   }
 
-  if (isRoomToAppendPad(maxSeqLen)) finish();
+  if (seq.v.size() <= maxSeqLen &&
+      padSize + isAppendingStopSymbol <= maxSeqLen - seq.v.size()) {
+    if (isAppendingStopSymbol) seq.v.push_back('*');
+    finish();
+  }
+
   return stream;
 }
 


=====================================
src/MultiSequence.hh
=====================================
@@ -24,7 +24,7 @@ class MultiSequence{
   typedef LAST_INT_TYPE indexT;
 
   // initialize with leftmost delimiter pad, ready for appending sequences
-  void initForAppending( indexT padSizeIn );
+  void initForAppending(indexT padSizeIn, bool isAppendStopSymbol = false);
 
   // re-initialize, but keep the last sequence if it is unfinished
   void reinitForAppending();
@@ -141,6 +141,7 @@ class MultiSequence{
   VectorOrMmap<uchar> qualityScores;
   size_t qualityScoresPerLetter;
   bool isReadingFastq;
+  bool isAppendingStopSymbol;
 
   // Read a fasta/fastq header: read the whole line but store just the
   // 1st word


=====================================
src/lastal.cc
=====================================
@@ -311,14 +311,21 @@ static void calculateScoreStatistics(const std::string& matrixName,
   try{
     gaplessEvaluer.init(0, 0, 0, alph.letters.c_str(), scoreMat, p1, p2, false,
 			0, 0, 0, 0, fsCost, geneticCode, 0, 0);
-    if (gapCosts.isNewFrameshifts() && isGapped) return;
-    const mcf::GapCosts::Piece &del = gapCosts.delPieces[0];
-    const mcf::GapCosts::Piece &ins = gapCosts.insPieces[0];
-    evaluer.init(canonicalMatrixName, args.matchScore, args.mismatchCost,
-		 alph.letters.c_str(), scoreMat, p1, p2, isGapped,
-		 del.openCost, del.growCost, ins.openCost, ins.growCost,
-		 fsCost, geneticCode,
-		 args.geneticCodeFile.c_str(), args.verbosity);
+    if (gapCosts.isNewFrameshifts() && isGapped) {
+      if (args.temperature < 0) return;
+      unsigned alphSize2 = scoreMatrix.isCodonCols() ? 64 : alph.size;
+      evaluer.initFrameshift(fwdMatrices.ratios, p1, alph.size,
+			     stats.letterProbs2(), alphSize2,
+			     gapCosts, stats.lambda(), args.verbosity);
+    } else {
+      const mcf::GapCosts::Piece &del = gapCosts.delPieces[0];
+      const mcf::GapCosts::Piece &ins = gapCosts.insPieces[0];
+      evaluer.init(canonicalMatrixName, args.matchScore, args.mismatchCost,
+		   alph.letters.c_str(), scoreMat, p1, p2, isGapped,
+		   del.openCost, del.growCost, ins.openCost, ins.growCost,
+		   fsCost, geneticCode,
+		   args.geneticCodeFile.c_str(), args.verbosity);
+    }
     countT m = std::min(refMaxSeqLen, refLetters);
     evaluer.setSearchSpace(refLetters, m, args.numOfStrands());
     if( args.verbosity > 0 ) evaluer.writeParameters( std::cerr );
@@ -453,8 +460,14 @@ static const ScoreMatrixRow *getQueryPssm(const LastAligner &aligner,
 
 namespace Phase{ enum Enum{ gapless, gapped, final }; }
 
+static bool isFullScoreThreshold() {
+  return gapCosts.isNewFrameshifts();
+}
+
 static bool isMaskLowercase(Phase::Enum e) {
-  return (e < 1 && args.maskLowercase > 0) || args.maskLowercase > 2;
+  return (e < 1 && args.maskLowercase > 0)
+    || (e < 2 && args.maskLowercase > 1 && isFullScoreThreshold())
+    || args.maskLowercase > 2;
 }
 
 struct Dispatcher{
@@ -751,8 +764,9 @@ void alignGapped( LastAligner& aligner,
 
     if( aln.score < args.minScoreGapped ) continue;
 
-    if( !aln.isOptimal( dis.a, dis.b, args.globality, dis.m, dis.d, gapCosts,
-			frameSize, dis.p, dis.t, dis.i, dis.j ) ){
+    if (!isFullScoreThreshold() &&
+	!aln.isOptimal(dis.a, dis.b, args.globality, dis.m, dis.d, gapCosts,
+		       frameSize, dis.p, dis.t, dis.i, dis.j)) {
       // If retained, non-"optimal" alignments can hide "optimal"
       // alignments, e.g. during non-redundantization.
       continue;
@@ -830,8 +844,8 @@ static void eraseWeakAlignments(LastAligner &aligner, AlignmentPot &gappedAlns,
   Dispatcher dis(Phase::gapless, aligner, queryNum, matrices, querySeq);
   for (size_t i = 0; i < gappedAlns.size(); ++i) {
     Alignment &a = gappedAlns.items[i];
-    if (!a.hasGoodSegment(dis.a, dis.b, args.minScoreGapped, dis.m, gapCosts,
-			  frameSize, dis.p, dis.t, dis.i, dis.j)) {
+    if (!a.hasGoodSegment(dis.a, dis.b, ceil(args.minScoreGapped), dis.m,
+			  gapCosts, frameSize, dis.p, dis.t, dis.i, dis.j)) {
       AlignmentPot::mark(a);
     }
   }
@@ -911,44 +925,69 @@ void makeQualityPssm( LastAligner& aligner,
   }
 }
 
+static void unmaskLowercase(LastAligner &aligner, size_t queryNum,
+			    const SubstitutionMatrices &matrices,
+			    uchar *querySeq) {
+  makeQualityPssm(aligner, queryNum, matrices, querySeq, false);
+  if (scoreMatrix.isCodonCols()) {
+    const uchar *q = query.seqReader();
+    geneticCode.translateWithoutMasking(q + query.padBeg(queryNum),
+					q + query.padEnd(queryNum), querySeq);
+  }
+}
+
+static void remaskLowercase(LastAligner &aligner, size_t queryNum,
+			    const SubstitutionMatrices &matrices,
+			    uchar *querySeq) {
+  makeQualityPssm(aligner, queryNum, matrices, querySeq, true);
+  if (scoreMatrix.isCodonCols()) {
+    const uchar *q = query.seqReader();
+    geneticCode.translate(q + query.padBeg(queryNum),
+			  q + query.padEnd(queryNum), querySeq);
+  }
+}
+
 // Scan one query sequence against one database volume
 void scan(LastAligner& aligner, size_t queryNum,
-	  const SubstitutionMatrices &matrices, const uchar* querySeq) {
+	  const SubstitutionMatrices &matrices, uchar *querySeq) {
   if( args.outputType == 0 ){  // we just want match counts
     countMatches( queryNum, querySeq );
     return;
   }
 
-  bool isMask = (args.maskLowercase > 0);
-  makeQualityPssm(aligner, queryNum, matrices, querySeq, isMask);
+  const int maskMode = args.maskLowercase;
+  makeQualityPssm(aligner, queryNum, matrices, querySeq, maskMode > 0);
 
   SegmentPairPot gaplessAlns;
   alignGapless(aligner, gaplessAlns, queryNum, matrices, querySeq);
   if( args.outputType == 1 ) return;  // we just want gapless alignments
   if( gaplessAlns.size() == 0 ) return;
 
-  if( args.maskLowercase == 1 || args.maskLowercase == 2 )
-    makeQualityPssm(aligner, queryNum, matrices, querySeq, false);
+  if (maskMode == 1 || (maskMode == 2 && !isFullScoreThreshold()))
+    unmaskLowercase(aligner, queryNum, matrices, querySeq);
 
   AlignmentPot gappedAlns;
 
-  if( args.maxDropFinal != args.maxDropGapped ){
+  if (args.maxDropFinal != args.maxDropGapped
+      || (maskMode == 2 && isFullScoreThreshold())) {
     alignGapped( aligner, gappedAlns, gaplessAlns,
 		 queryNum, matrices, querySeq, Phase::gapped );
     erase_if( gaplessAlns.items, SegmentPairPot::isNotMarkedAsGood );
+    if (maskMode == 2 && isFullScoreThreshold())
+      unmaskLowercase(aligner, queryNum, matrices, querySeq);
   }
 
   alignGapped( aligner, gappedAlns, gaplessAlns,
 	       queryNum, matrices, querySeq, Phase::final );
   if( gappedAlns.size() == 0 ) return;
 
-  if (args.maskLowercase == 2) {
-    makeQualityPssm(aligner, queryNum, matrices, querySeq, true);
+  if (maskMode == 2 && !isFullScoreThreshold()) {
+    remaskLowercase(aligner, queryNum, matrices, querySeq);
     eraseWeakAlignments(aligner, gappedAlns, queryNum, matrices, querySeq);
     LOG2("lowercase-filtered alignments=" << gappedAlns.size());
     if (gappedAlns.size() == 0) return;
     if (args.outputType > 3)
-      makeQualityPssm(aligner, queryNum, matrices, querySeq, false);
+      unmaskLowercase(aligner, queryNum, matrices, querySeq);
   }
 
   if( args.outputType > 2 ){  // we want non-redundant alignments
@@ -961,9 +1000,9 @@ void scan(LastAligner& aligner, size_t queryNum,
 }
 
 static void tantanMaskOneQuery(size_t queryNum, uchar *querySeq) {
-  size_t beg = query.seqBeg(queryNum) - query.padBeg(queryNum);
-  size_t end = query.seqEnd(queryNum) - query.padBeg(queryNum);
-  tantanMasker.mask(querySeq + beg, querySeq + end, alph.numbersToLowercase);
+  size_t b = query.seqBeg(queryNum) - query.padBeg(queryNum);
+  size_t e = query.seqEnd(queryNum) - query.padBeg(queryNum);
+  tantanMasker.mask(querySeq + b, querySeq + e, queryAlph.numbersToLowercase);
 }
 
 static void tantanMaskTranslatedQuery(size_t queryNum, uchar *querySeq) {
@@ -984,28 +1023,43 @@ static void tantanMaskTranslatedQuery(size_t queryNum, uchar *querySeq) {
 // after optionally translating and/or masking the query
 void translateAndScan(LastAligner& aligner,
 		      size_t queryNum, const SubstitutionMatrices &matrices) {
-  const uchar* querySeq = query.seqReader() + query.padBeg(queryNum);
+  uchar *querySeqs = query.seqWriter();
+  uchar *querySeq = querySeqs + query.padBeg(queryNum);
   std::vector<uchar> modifiedQuery;
   size_t size = query.padLen(queryNum);
 
-  if( args.isTranslated() ){
-    modifiedQuery.resize( size );
-    geneticCode.translate( querySeq, querySeq + size, &modifiedQuery[0] );
-    if( args.tantanSetting && !scoreMatrix.isCodonCols() ){
-      tantanMaskTranslatedQuery( queryNum, &modifiedQuery[0] );
+  if (args.isTranslated()) {
+    if (args.tantanSetting && scoreMatrix.isCodonCols()) {
+      if (args.isKeepLowercase) {
+	err("can't keep lowercase & find simple repeats & use codons");
+      }
+      tantanMaskOneQuery(queryNum, querySeq);
     }
+    modifiedQuery.resize(size);
+    geneticCode.translate(querySeq, querySeq + size, &modifiedQuery[0]);
     querySeq = &modifiedQuery[0];
-  }else{
-    if( args.tantanSetting ){
-      modifiedQuery.assign( querySeq, querySeq + size );
-      tantanMaskOneQuery( queryNum, &modifiedQuery[0] );
-      querySeq = &modifiedQuery[0];
+    if (args.tantanSetting && !scoreMatrix.isCodonCols()) {
+      tantanMaskTranslatedQuery(queryNum, querySeq);
+    }
+  } else {
+    if (args.tantanSetting) {
+      if (args.isKeepLowercase) {
+	modifiedQuery.assign(querySeq, querySeq + size);
+	querySeq = &modifiedQuery[0];
+      }
+      tantanMaskOneQuery(queryNum, querySeq);
     }
   }
 
   size_t oldNumOfAlns = aligner.textAlns.size();
   scan( aligner, queryNum, matrices, querySeq );
   cullFinalAlignments( aligner.textAlns, oldNumOfAlns );
+
+  if (args.tantanSetting && !args.isKeepLowercase) {
+    for (size_t i = query.seqBeg(queryNum); i < query.seqEnd(queryNum); ++i) {
+      querySeqs[i] = queryAlph.numbersToUppercase[querySeqs[i]];
+    }
+  }
 }
 
 static void alignOneQuery(LastAligner &aligner,
@@ -1103,7 +1157,7 @@ int calcMinScoreGapless(double numLettersInReference) {
   // This attempts to ensure that the gapped alignment phase will be
   // reasonably fast relative to the gapless alignment phase.
 
-  if (!gaplessEvaluer.isGood()) return args.minScoreGapped;
+  if (!gaplessEvaluer.isGood()) return ceil(args.minScoreGapped);
 
   double n = args.maxGaplessAlignmentsPerQueryPosition;
   if (args.maxGaplessAlignmentsPerQueryPosition + 1 == 0) n = 10;  // ?
@@ -1118,8 +1172,9 @@ int calcMinScoreGapless(double numLettersInReference) {
   // trial-and-error.  It should depend on the relative speeds of
   // gapless and gapped extensions.
 
-  int s = gaplessEvaluer.minScore(e, numLettersInReference);
-  return std::min(s, args.minScoreGapped);
+  double s = gaplessEvaluer.minScore(e, numLettersInReference);
+  s = std::max(1.0, s);
+  return ceil(std::min(s, args.minScoreGapped));
 }
 
 // Read one database volume
@@ -1251,10 +1306,7 @@ void lastal( int argc, char** argv ){
 				isKeepRefLowercase, refTantanSetting,
                                 isCaseSensitiveSeeds, isMultiVolume,
 				refMinimizerWindow, aligners.size() );
-  if( args.tantanSetting )
-    tantanMasker.init( isProtein, args.tantanSetting > 1,
-		       alph.letters, alph.encode );
-  makeScoreMatrix( matrixName, matrixFile );
+  makeScoreMatrix(matrixName, matrixFile);
   gapCosts.assign(args.delOpenCosts, args.delGrowCosts,
 		  args.insOpenCosts, args.insGrowCosts,
 		  args.frameshiftCosts, args.gapPairCost,
@@ -1267,25 +1319,38 @@ void lastal( int argc, char** argv ){
     geneticCode.fromString(GeneticCode::stringFromName(args.geneticCodeFile));
     if (scoreMatrix.isCodonCols()) {
       geneticCode.initCodons(queryAlph.encode, alph.encode,
-			     args.maskLowercase > 2);
+			     args.maskLowercase > 0, args.maskLowercase < 3);
     } else {
       geneticCode.codeTableSet( alph, queryAlph );
     }
     query.initForAppending(3);
-  }
-  else{
+  } else {
     queryAlph = alph;
     query.initForAppending(1);
   }
 
+  if (args.tantanSetting) {
+    bool isAtRichDna = (args.tantanSetting > 1);
+    if (scoreMatrix.isCodonCols()) {
+      tantanMasker.init(0, isAtRichDna, queryAlph.letters, queryAlph.encode);
+    } else {
+      tantanMasker.init(isProtein, isAtRichDna, alph.letters, alph.encode);
+    }
+  }
+
   if (args.outputType > 0) {
     calculateScoreStatistics(matrixName, refLetters, refMaxSeqLen);
   }
 
-  int minScore = (args.maxEvalue > 0) ? evaluer.minScore(args.maxEvalue, 1e18)
-    : evaluer.minScore(args.queryLettersPerRandomAlignment);
-
-  double eg2 = evaluer.isGood() ? 1e18 * evaluer.evaluePerArea(minScore) : -1;
+  double minScore = -1;
+  double eg2 = -1;
+  if (evaluer.isGood()) {
+    minScore = (args.maxEvalue > 0) ? evaluer.minScore(args.maxEvalue, 1e18)
+      : evaluer.minScore(args.queryLettersPerRandomAlignment);
+    if (!isFullScoreThreshold() || args.outputType < 2)
+      minScore = ceil(std::max(1.0, minScore));
+    eg2 = 1e18 * evaluer.evaluePerArea(minScore);
+  }
   args.setDefaultsFromMatrix(fwdMatrices.stats.lambda(), minScore, eg2);
 
   minScoreGapless = calcMinScoreGapless(refLetters);


=====================================
src/lastdb.cc
=====================================
@@ -27,6 +27,7 @@ typedef unsigned long long countT;
 // Set up an alphabet (e.g. DNA or protein), based on the user options
 void makeAlphabet( Alphabet& alph, const LastdbArguments& args ){
   if( !args.userAlphabet.empty() )  alph.fromString( args.userAlphabet );
+  else if( args.isAddStops )        alph.fromString( alph.proteinWithStop );
   else if( args.isProtein )         alph.fromString( alph.protein );
   else                              alph.fromString( alph.dna );
 }
@@ -343,7 +344,7 @@ void lastdb( int argc, char** argv ){
   LOG("wordLength=" << wordsFinder.wordLength);
 
   MultiSequence multi;
-  multi.initForAppending(1);
+  multi.initForAppending(1, args.isAddStops);
   alph.tr(multi.seqWriter(), multi.seqWriter() + multi.seqBeg(0));
   unsigned volumeNumber = 0;
   countT sequenceCount = 0;
@@ -365,7 +366,7 @@ void lastdb( int argc, char** argv ){
       if (sequenceCount == 0) {
 	maxSeqLen = maxLettersPerVolume(args, wordsFinder,
 					multi.qualsPerLetter(), seeds.size());
-	if (!args.isProtein && args.userAlphabet.empty() &&
+	if (!args.isProtein && !args.isAddStops && args.userAlphabet.empty() &&
 	    isDubiousDna(alph, multi)) {
 	  std::cerr << args.programName << ": that's some funny-lookin DNA\n";
 	}


=====================================
src/mcf_frameshift_xdrop_aligner.cc
=====================================
@@ -6,6 +6,7 @@
 #include <assert.h>
 #include <math.h>
 
+#include <algorithm>
 //#include <iostream>  // for debugging
 
 namespace mcf {
@@ -366,14 +367,141 @@ void FrameshiftXdropAligner::count(bool isRightwardExtension,
     }
   }
 
-  transitionCounts[1] += gapCosts.insProb1 * ins1;
-  transitionCounts[2] += gapCosts.insProb2 * ins2;
-  transitionCounts[3] += gapCosts.insProb3 * ins3;
-  transitionCounts[4] += gapCosts.insProb3 * (insNext - ins3);
+  transitionCounts[1] += gapCosts.delProb3 * delNext;  // deleted whole codons
+  transitionCounts[2] += gapCosts.insProb3 * insNext;  // inserted whole codons
+  transitionCounts[3] += gapCosts.delProb3 * del3;  // in-frame opens/closes
+  transitionCounts[4] += gapCosts.insProb3 * ins3;  // in-frame opens/closes
+
   transitionCounts[5] += gapCosts.delProb1 * del1;
   transitionCounts[6] += gapCosts.delProb2 * del2;
-  transitionCounts[7] += gapCosts.delProb3 * del3;
-  transitionCounts[8] += gapCosts.delProb3 * (delNext - del3);
+  transitionCounts[7] += gapCosts.insProb1 * ins1;
+  transitionCounts[8] += gapCosts.insProb2 * ins2;
+}
+
+double FrameshiftXdropAligner::maxSumOfProbRatios(const uchar *protein,
+						  int proteinLength,
+						  const uchar *tranDna,
+						  int tranDnaLength,
+						  const const_dbl_ptr *substitutionProbs,
+						  const GapCosts &gapCosts) {
+  const double delOpenProb = gapCosts.delProbPieces[0].openProb;
+  const double insOpenProb = gapCosts.insProbPieces[0].openProb;
+  const double delProb1 = gapCosts.delProb1;
+  const double delProb2 = gapCosts.delProb2;
+  const double delProb3 = gapCosts.delProb3;
+  const double insProb1 = gapCosts.insProb1;
+  const double insProb2 = gapCosts.insProb2;
+  const double insProb3 = gapCosts.insProb3;
+
+  const int proteinSize = proteinLength + 1;
+  const int tranDnaSize = tranDnaLength + 1;
+  xFwdProbs.resize(tranDnaSize + proteinSize * tranDnaSize);
+  double *delRow = &xFwdProbs[0];
+  double *newRow = delRow + tranDnaSize;
+  double Y1, Y2, Y3, Z1, Z2, Z3;
+
+  Z1 = Z2 = Z3 = 0;
+  for (int j = 0; j < tranDnaSize; ++j) {
+    double z1 = Z1 * insProb1;
+    double z2 = Z2 * insProb2;
+    double z3 = Z3 * insProb3;
+    double b  = z1 + z2 + z3 + 1;
+    newRow[j] = b;
+    delRow[j] = b * delOpenProb;
+    Z3 = Z2;
+    Z2 = Z1;
+    Z1 = b * insOpenProb + z3;
+  }
+
+  for (int i = 1; i < proteinSize; ++i) {
+    const double *substitutionRow = substitutionProbs[protein[i-1]];
+    const double *oldRow = newRow;
+    newRow += tranDnaSize;
+    Y2 = Z2 = Z3 = 0;
+
+    {
+      Y3 = delRow[0];
+      double y3 = Y3 * delProb3;
+      double b  = y3 + 1;
+      newRow[0] = b;
+      delRow[0] = b * delOpenProb + y3;
+      Z1 = b * insOpenProb;
+    }
+
+    for (int j = 1; j < tranDnaSize; ++j) {
+      Y1 = Y2;
+      Y2 = Y3;
+      Y3 = delRow[j];
+      double x  = oldRow[j-1] * substitutionRow[tranDna[j-1]];
+      double y1 = Y1 * delProb1;
+      double y2 = Y2 * delProb2;
+      double y3 = Y3 * delProb3;
+      double z1 = Z1 * insProb1;
+      double z2 = Z2 * insProb2;
+      double z3 = Z3 * insProb3;
+      double b  = x + y1 + y2 + y3 + z1 + z2 + z3 + 1;
+      newRow[j] = b;
+      delRow[j] = b * delOpenProb + y3;
+      Z3 = Z2;
+      Z2 = Z1;
+      Z1 = b * insOpenProb + z3;
+    }
+  }
+
+  double maxValue = 0;
+
+  Z1 = Z2 = Z3 = 0;
+  for (int j = tranDnaSize; j-- > 0;) {
+    double z1 = Z1 * insProb1;
+    double z2 = Z2 * insProb2;
+    double z3 = Z3 * insProb3;
+    double b  = z1 + z2 + z3 + 1;
+    maxValue  = std::max(maxValue, b * newRow[j]);
+    newRow[j] = b;
+    delRow[j] = b * delOpenProb;
+    Z3 = Z2;
+    Z2 = Z1;
+    Z1 = b * insOpenProb + z3;
+  }
+
+  for (int i = proteinLength; i-- > 0;) {
+    const double *substitutionRow = substitutionProbs[protein[i]];
+    const double *oldRow = newRow;
+    newRow -= tranDnaSize;
+    Y2 = Z2 = Z3 = 0;
+
+    {
+      Y3 = delRow[tranDnaLength];
+      double y3 = Y3 * delProb3;
+      double b  = y3 + 1;
+      maxValue  = std::max(maxValue, b * newRow[tranDnaLength]);
+      newRow[tranDnaLength] = b;
+      delRow[tranDnaLength] = b * delOpenProb + y3;
+      Z1 = b * insOpenProb;
+    }
+
+    for (int j = tranDnaLength; j-- > 0;) {
+      Y1 = Y2;
+      Y2 = Y3;
+      Y3 = delRow[j];
+      double x  = oldRow[j+1] * substitutionRow[tranDna[j]];
+      double y1 = Y1 * delProb1;
+      double y2 = Y2 * delProb2;
+      double y3 = Y3 * delProb3;
+      double z1 = Z1 * insProb1;
+      double z2 = Z2 * insProb2;
+      double z3 = Z3 * insProb3;
+      double b  = x + y1 + y2 + y3 + z1 + z2 + z3 + 1;
+      maxValue  = std::max(maxValue, b * newRow[j]);
+      newRow[j] = b;
+      delRow[j] = b * delOpenProb + y3;
+      Z3 = Z2;
+      Z2 = Z1;
+      Z1 = b * insOpenProb + z3;
+    }
+  }
+
+  return maxValue;
 }
 
 }


=====================================
src/mcf_frameshift_xdrop_aligner.hh
=====================================
@@ -54,6 +54,12 @@ public:
     return xFwdProbs[i] * xBckProbs[totalNumOfCells - j + padSize*7 - 1];
   }
 
+  // tranDna should point to translated DNA: frame 012012012...
+  double maxSumOfProbRatios(const uchar *protein, int proteinLength,
+			    const uchar *tranDna, int tranDnaLength,
+			    const const_dbl_ptr *substitutionProbs,
+			    const GapCosts &gapCosts);
+
 private:
   std::vector<Prob> xFwdProbs;
   std::vector<Prob> yFwdProbs;


=====================================
src/mcf_substitution_matrix_stats.cc
=====================================
@@ -18,6 +18,8 @@ const struct {
 } substitutionMatrixScales[] = {
   {"BL62", 3.0861133577701141},
   {"BL80", 2.8253295934982616},
+  {"MIQS", 0.0},
+  {"PAM10", 0.0},
   {"PAM30", 2.8848596855435313},
 };
 
@@ -100,21 +102,30 @@ void SubstitutionMatrixStats::calcBias(const const_int_ptr *scoreMatrix,
 void SubstitutionMatrixStats::calcUnbiased(const char *matrixName,
 					   const const_int_ptr *scoreMatrix,
 					   unsigned size) {
+  double scale = 0;
+  unsigned realSize = size;
+
   for (size_t i = 0; i < COUNTOF(substitutionMatrixScales); ++i) {
     if (strcmp(matrixName, substitutionMatrixScales[i].name) == 0) {
-      calcFromScale(scoreMatrix, size, substitutionMatrixScales[i].scale);
-      return;
+      scale = substitutionMatrixScales[i].scale;
+      if (size > 20) realSize = 20;
     }
   }
 
-  cbrc::LambdaCalculator c;
-  c.calculate(scoreMatrix, size);
-  mLambda = c.lambda();
-  if (!isBad()) {
+  if (scale > 0) {
+    calcFromScale(scoreMatrix, realSize, scale);
+  } else {
+    cbrc::LambdaCalculator c;
+    c.calculate(scoreMatrix, realSize);
+    mLambda = c.lambda();
+    if (isBad()) return;
     mBias = 1;
-    mLetterProbs1.assign(c.letterProbs1(), c.letterProbs1() + size);
-    mLetterProbs2.assign(c.letterProbs2(), c.letterProbs2() + size);
+    mLetterProbs1.assign(c.letterProbs1(), c.letterProbs1() + realSize);
+    mLetterProbs2.assign(c.letterProbs2(), c.letterProbs2() + realSize);
   }
+
+  mLetterProbs1.insert(mLetterProbs1.end(), size - realSize, 0.0);
+  mLetterProbs2.insert(mLetterProbs2.end(), size - realSize, 0.0);
 }
 
 void SubstitutionMatrixStats::flipDnaStrands(const unsigned char *complement) {


=====================================
src/version.hh
=====================================
@@ -1 +1 @@
-"1145"
+"1167"



View it on GitLab: https://salsa.debian.org/med-team/last-align/-/commit/4dc09e39224756822117fb8a9fed8dbdf3619956

-- 
View it on GitLab: https://salsa.debian.org/med-team/last-align/-/commit/4dc09e39224756822117fb8a9fed8dbdf3619956
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20201214/fe44b595/attachment-0001.html>