[med-svn] [Git][med-team/last-align][upstream] New upstream version 1178

Nilesh Patra gitlab at salsa.debian.org
Thu Jan 28 11:59:23 GMT 2021



Nilesh Patra pushed to branch upstream at Debian Med / last-align


Commits:
d1b788bb by Nilesh Patra at 2021-01-28T17:19:22+05:30
New upstream version 1178
- - - - -


25 changed files:

- ChangeLog.txt
- doc/Makefile
- doc/last-evalues.html
- doc/last-evalues.txt
- doc/last-papers.html
- doc/last-papers.txt
- doc/last-postmask.html
- doc/last-postmask.txt
- doc/last-split.html
- doc/last-split.txt
- doc/last-train.html
- doc/last-train.txt
- doc/last-tuning.html
- doc/last-tuning.txt
- doc/last-tutorial.html
- doc/last-tutorial.txt
- doc/last.html
- doc/last.txt
- doc/lastal.html
- doc/lastal.txt
- scripts/last-train
- src/Alignment.cc
- src/LastalArguments.cc
- src/lastal.cc
- src/version.hh


Changes:

=====================================
ChangeLog.txt
=====================================
@@ -1,9 +1,54 @@
+2021-01-27  Martin C. Frith  <Martin C. Frith>
+
+	* doc/Makefile, doc/last-evalues.txt, doc/last-papers.txt, doc/last-
+	postmask.txt, doc/last-split.txt, doc/last-train.txt, doc/last-
+	tuning.txt, doc/last-tutorial.txt, doc/last.txt, doc/lastal.txt:
+	Big doc update
+	[ba6a55ed7287] [tip]
+
+2021-01-25  Martin C. Frith  <Martin C. Frith>
+
+	* doc/lastal.txt, scripts/last-train, src/LastalArguments.cc:
+	train --codon: more data & better pseudocounts
+	[9a6365d5825a]
+
+2021-01-14  Martin C. Frith  <Martin C. Frith>
+
+	* src/Alignment.cc, src/lastal.cc, test/last-test.out, test/last-
+	test.sh:
+	Better repeat-masking for new-style frameshifts
+	[e8001aa6f633]
+
+2021-01-13  Martin C. Frith  <Martin C. Frith>
+
+	* src/lastal.cc:
+	Refactor
+	[3713015a6ad3]
+
+	* src/lastal.cc:
+	Refactor
+	[69de4393c89c]
+
+	* doc/last-papers.txt, src/lastal.cc:
+	Refactor
+	[2ad10ab5b4b1]
+
+2021-01-05  Martin C. Frith  <Martin C. Frith>
+
+	* doc/last-train.txt, scripts/last-train:
+	last-train: add codon & frameshift options!
+	[977a35de8389]
+
+	* scripts/last-train:
+	Refactor
+	[663475110c85]
+
 2020-12-28  Martin C. Frith  <Martin C. Frith>
 
 	* src/LastEvaluer.cc, src/mcf_frameshift_xdrop_aligner.cc,
 	src/mcf_frameshift_xdrop_aligner.hh, test/last-test.out:
 	Fix E-values for new-style frameshifts
-	[a55b81199602] [tip]
+	[a55b81199602]
 
 2020-12-23  Martin C. Frith  <Martin C. Frith>
 


=====================================
doc/Makefile
=====================================
@@ -13,6 +13,8 @@ ${DOCS}: last-doc.css Makefile
 
 .SUFFIXES: .html .txt
 
+RST2HTML = rst2html  # xxx .py ???
+
 # Ugh!  Is there a better way?
 RST_CSS = `locate html4css1.css | tail -n1`
 #RST_CSS = html4css1.css
@@ -21,7 +23,7 @@ RSTFLAGS = --initial-header-level=2 --no-compact-lists	\
 --no-compact-field-lists --option-limit=0 --no-doc-info
 
 .txt.html:
-	rst2html --stylesheet-path=${RST_CSS},last-doc.css ${RSTFLAGS} $< $@
+	${RST2HTML} --stylesheet-path=${RST_CSS},last-doc.css ${RSTFLAGS} $< $@
 
 last-matrices.txt: ../data/*.mat
 	../build/mat-doc.sh ../data/*.mat > $@


=====================================
doc/last-evalues.html
=====================================
@@ -367,9 +367,6 @@ option -v (verbose).</p>
 <p>If your sequences have very different letter frequencies (e.g. very
 AT-rich DNA), the E-values will be misleading.  The best solution is
 to use a suitable score matrix, such as AT77 or ATMAP.</p>
-<p>For DNA-versus-protein alignment, however, only the protein
-frequencies are determined by the score matrix.  The DNA frequencies
-are fixed at 0.25.</p>
 </div>
 <div class="section" id="calculation-of-eg2">
 <h3>Calculation of EG2</h3>
@@ -426,12 +423,12 @@ or gap costs (e.g. match score = 99, mismatch cost = 1).  In such
 cases, lastal requires you to set a score threshold with option -e.</p>
 </li>
 <li><p class="first">There may be a long startup time to calculate the E-value parameters
-(lambda, K, and finite-size correction parameters), especially for
-alignment with frameshifts.  This is avoided, for particular scoring
-schemes, by storing the parameters in LAST's source code.
-Parameters for other scoring schemes can be added on request.  (Or
-you can do it yourself: run lastal with option -v, so it prints the
-E-value parameters, then copy them into LastEvaluer.cc.)</p>
+(lambda, K, and finite-size correction parameters).  This is
+avoided, for particular scoring schemes, by storing the parameters
+in LAST's source code.  Parameters for other scoring schemes can be
+added on request.  (Or you can do it yourself: run lastal with
+option -v, so it prints the E-value parameters, then copy them into
+LastEvaluer.cc.)</p>
 </li>
 <li><p class="first">The E-values do not account for lastal option -c.  If you use this
 option, the E-values will be under-estimates.</p>


=====================================
doc/last-evalues.txt
=====================================
@@ -58,10 +58,6 @@ If your sequences have very different letter frequencies (e.g. very
 AT-rich DNA), the E-values will be misleading.  The best solution is
 to use a suitable score matrix, such as AT77 or ATMAP.
 
-For DNA-versus-protein alignment, however, only the protein
-frequencies are determined by the score matrix.  The DNA frequencies
-are fixed at 0.25.
-
 Calculation of EG2
 ~~~~~~~~~~~~~~~~~~
 
@@ -115,12 +111,12 @@ Limitations
   cases, lastal requires you to set a score threshold with option -e.
 
 * There may be a long startup time to calculate the E-value parameters
-  (lambda, K, and finite-size correction parameters), especially for
-  alignment with frameshifts.  This is avoided, for particular scoring
-  schemes, by storing the parameters in LAST's source code.
-  Parameters for other scoring schemes can be added on request.  (Or
-  you can do it yourself: run lastal with option -v, so it prints the
-  E-value parameters, then copy them into LastEvaluer.cc.)
+  (lambda, K, and finite-size correction parameters).  This is
+  avoided, for particular scoring schemes, by storing the parameters
+  in LAST's source code.  Parameters for other scoring schemes can be
+  added on request.  (Or you can do it yourself: run lastal with
+  option -v, so it prints the E-value parameters, then copy them into
+  LastEvaluer.cc.)
 
 * The E-values do not account for lastal option -c.  If you use this
   option, the E-values will be under-estimates.


=====================================
doc/last-papers.html
=====================================
@@ -377,7 +377,8 @@ Res. 2014 42(7):e59.</p>
 <li><p class="first"><a class="reference external" href="http://bioinformatics.oxfordjournals.org/content/30/24/3575.long">Frameshift alignment: statistics and post-genomic
 applications</a>.  Sheetlin SL, Park Y, Frith MC, Spouge JL.
 Bioinformatics. 2014 30(24):3575-82.</p>
-<p>Describes DNA-versus-protein alignment allowing for frameshifts.</p>
+<p>Describes "old-style" DNA-versus-protein alignment allowing for
+frameshifts (without <tt class="docutils literal"><span class="pre">last-train</span></tt>).</p>
 </li>
 <li><p class="first"><a class="reference external" href="http://www.genomebiology.com/content/16/1/106">Split-alignment of genomes finds orthologies more accurately</a>.
 Frith MC, Kawaguchi R.  Genome Biology. 2015 16:106.</p>
@@ -387,13 +388,22 @@ whole genome alignment.</p>
 <li><p class="first"><a class="reference external" href="https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btw742">Training alignment parameters for arbitrary sequencers with
 LAST-TRAIN</a>.  Hamada M, Ono Y, Asai K Frith MC.
 Bioinformatics. 2017 33(6):926-928.</p>
-<p>Describes last-train.</p>
+<p>Describes <tt class="docutils literal"><span class="pre">last-train</span></tt>.</p>
 </li>
 <li><p class="first"><a class="reference external" href="https://ieeexplore.ieee.org/document/8288582/">A Simplified Description of Child Tables for Sequence Similarity
 Search</a>.  Frith MC, Shrestha A.  IEEE/ACM Trans Comput Biol
 Bioinform. 2018.</p>
 <p>Describes how LAST uses child tables.</p>
 </li>
+<li><p class="first"><a class="reference external" href="https://doi.org/10.1093/bioinformatics/btaa1054">Minimally-overlapping words for sequence similarity search</a>.
+Frith MC, Noé L, Kucherov G.  Bioinformatics. 2020</p>
+<p>Describes the <tt class="docutils literal">lastdb <span class="pre">-u</span> RY</tt> sparsity options.</p>
+</li>
+<li><p class="first"><a class="reference external" href="https://doi.org/10.1101/2021.01.25.428050">Improved DNA-versus-protein homology search for protein fossils</a>.
+Yao Y, Frith MC.</p>
+<p>Describes "new-style" DNA-versus-protein search with <tt class="docutils literal"><span class="pre">last-train</span>
+<span class="pre">--codon</span></tt>.</p>
+</li>
 </ol>
 <div class="section" id="external-methods">
 <h2>External methods</h2>


=====================================
doc/last-papers.txt
=====================================
@@ -90,7 +90,8 @@ research to society.
 
     __ http://bioinformatics.oxfordjournals.org/content/30/24/3575.long
 
-    Describes DNA-versus-protein alignment allowing for frameshifts.
+    Describes "old-style" DNA-versus-protein alignment allowing for
+    frameshifts (without ``last-train``).
 
 11. `Split-alignment of genomes finds orthologies more accurately`__.
     Frith MC, Kawaguchi R.  Genome Biology. 2015 16:106.
@@ -106,7 +107,7 @@ research to society.
 
     __ https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btw742
 
-    Describes last-train.
+    Describes ``last-train``.
 
 13. `A Simplified Description of Child Tables for Sequence Similarity
     Search`__.  Frith MC, Shrestha A.  IEEE/ACM Trans Comput Biol
@@ -116,6 +117,21 @@ research to society.
 
     Describes how LAST uses child tables.
 
+14. `Minimally-overlapping words for sequence similarity search`__.
+    Frith MC, Noé L, Kucherov G.  Bioinformatics. 2020
+
+    __ https://doi.org/10.1093/bioinformatics/btaa1054
+
+    Describes the ``lastdb -u RY`` sparsity options.
+
+15. `Improved DNA-versus-protein homology search for protein fossils`__.
+    Yao Y, Frith MC.
+
+    __ https://doi.org/10.1101/2021.01.25.428050
+
+    Describes "new-style" DNA-versus-protein search with ``last-train
+    --codon``.
+
 External methods
 ----------------
 


=====================================
doc/last-postmask.html
=====================================
@@ -319,9 +319,9 @@ table.field-list { border: thin solid green }
 <h1 class="title">last-postmask</h1>
 
 <p>This script does post-alignment masking.  It reads pair-wise sequence
-alignments, and writes only those that align a significant amount of
-uppercase sequence.  (Lowercase is often used to indicate undesirable
-repetitive regions.)</p>
+alignments, and only writes alignments that include a significant
+amount of uppercase-to-uppercase alignment.  (Lowercase is often used
+to indicate undesirable "simple" sequence.)</p>
 <p>The input should be in MAF format, with header lines (of the kind
 produced by lastal) describing the alignment score parameters.</p>
 <p>The script discards alignments that lack any segment with score >=


=====================================
doc/last-postmask.txt
=====================================
@@ -2,9 +2,9 @@ last-postmask
 =============
 
 This script does post-alignment masking.  It reads pair-wise sequence
-alignments, and writes only those that align a significant amount of
-uppercase sequence.  (Lowercase is often used to indicate undesirable
-repetitive regions.)
+alignments, and only writes alignments that include a significant
+amount of uppercase-to-uppercase alignment.  (Lowercase is often used
+to indicate undesirable "simple" sequence.)
 
 The input should be in MAF format, with header lines (of the kind
 produced by lastal) describing the alignment score parameters.


=====================================
doc/last-split.html
=====================================
@@ -364,7 +364,7 @@ last-split options.</p>
 </div>
 <div class="section" id="alignment-of-two-whole-genomes">
 <h3>Alignment of two whole genomes</h3>
-<p>See <a class="reference external" href="https://github.com/mcfrith/last-genome-alignments">here</a>.</p>
+<p>See <a class="reference external" href="last-tutorial.html">here</a>.</p>
 </div>
 </div>
 <div class="section" id="faq">


=====================================
doc/last-split.txt
=====================================
@@ -54,7 +54,7 @@ last-split options.
 Alignment of two whole genomes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-See `here <https://github.com/mcfrith/last-genome-alignments>`_.
+See `here <last-tutorial.html>`_.
 
 FAQ
 ---


=====================================
doc/last-train.html
=====================================
@@ -399,6 +399,13 @@ include 2 for half-bit scores and 3 for 1/3-bit scores.
 (Note that 1/3-bit scores essentially equal Phred scores
 a.k.a. decibans, because log10(2) ≈ 3/10.)  The default is to
 infer a scale from the initial score parameters.</td></tr>
+<tr><td class="option-group">
+<kbd><span class="option">--codon</span></kbd></td>
+<td>Do training for DNA query sequences versus protein reference
+sequences.  These options will be ignored: <tt class="docutils literal"><span class="pre">--revsym</span>
+<span class="pre">--matsym</span> <span class="pre">--gapsym</span> <span class="pre">--pid</span> <span class="pre">--postmask</span> <span class="pre">-q</span> <span class="pre">-p</span> <span class="pre">-S</span></tt>.  If
+<tt class="docutils literal"><span class="pre">--codon</span></tt> is used, the "initial parameter options" are
+initial probabilities, not scores/costs.</td></tr>
 </tbody>
 </table>
 </blockquote>
@@ -433,6 +440,9 @@ alignments: they are described in more detail at <a class="reference external" h
 <tr><td class="option-group">
 <kbd><span class="option">-B <var>COST</var></span></kbd></td>
 <td>Initial insertion extension cost.</td></tr>
+<tr><td class="option-group">
+<kbd><span class="option">-F <var>LIST</var></span></kbd></td>
+<td>Initial frameshift probabilities (only used with <tt class="docutils literal"><span class="pre">--codon</span></tt>).</td></tr>
 </tbody>
 </table>
 </blockquote>
@@ -557,10 +567,7 @@ may be poor.</p>
 </li>
 <li><p class="first">last-train can fail for various reasons, e.g. if the sequences are
 too dissimilar.  If it fails to find any alignments, you could try
-<a class="reference external" href="last-tutorial.html#example-6-find-very-short-dna-alignments">reducing the alignment significance threshold</a> with
-option -D.</p>
-</li>
-<li><p class="first">last-train cannot handle DNA-versus-protein alignment.</p>
+reducing the alignment <a class="reference external" href="last-evalues.html">significance</a> threshold with option <tt class="docutils literal"><span class="pre">-D</span></tt>.</p>
 </li>
 </ul>
 </div>


=====================================
doc/last-train.txt
=====================================
@@ -65,6 +65,12 @@ Training options
          (Note that 1/3-bit scores essentially equal Phred scores
          a.k.a. decibans, because log10(2) ≈ 3/10.)  The default is to
          infer a scale from the initial score parameters.
+  --codon
+         Do training for DNA query sequences versus protein reference
+         sequences.  These options will be ignored: ``--revsym
+         --matsym --gapsym --pid --postmask -q -p -S``.  If
+         ``--codon`` is used, the "initial parameter options" are
+         initial probabilities, not scores/costs.
 
 All options below this point are passed to lastal to do the
 alignments: they are described in more detail at `<lastal.html>`_.
@@ -79,6 +85,7 @@ Initial parameter options
   -b COST    Initial gap extension cost.
   -A COST    Initial insertion existence cost.
   -B COST    Initial insertion extension cost.
+  -F LIST    Initial frameshift probabilities (only used with ``--codon``).
 
 Alignment options
 ~~~~~~~~~~~~~~~~~
@@ -174,8 +181,6 @@ Bugs
 
 * last-train can fail for various reasons, e.g. if the sequences are
   too dissimilar.  If it fails to find any alignments, you could try
-  `reducing the alignment significance threshold
-  <last-tutorial.html#example-6-find-very-short-dna-alignments>`_ with
-  option -D.
+  reducing the alignment significance_ threshold with option ``-D``.
 
-* last-train cannot handle DNA-versus-protein alignment.
+.. _significance: last-evalues.html


=====================================
doc/last-tuning.html
=====================================
@@ -320,13 +320,6 @@ table.field-list { border: thin solid green }
 
 <p>This document tells you how to trade-off <strong>speed</strong>, <strong>sensitivity</strong>,
 and <strong>memory and disk usage</strong>.</p>
-<p>Ideally, the default settings would always work well.  Unfortunately,
-there is too great a variety of challenging alignment tasks, and the
-LAST developers lack experience with most of them.</p>
-<p>LAST must have <em>some</em> defaults, and any choice will displease someone.
-It is wrong to say "LAST is faster but less sensitive than method X",
-or "slower but more sensitive than method Y", without varying the
-defaults.</p>
 <div class="section" id="sparsity-options">
 <h2>Sparsity options</h2>
 <p>It's advisable to use at most one sparsity option: combining them will
@@ -409,7 +402,7 @@ high-identity matches.</p>
 <h3>lastal -C</h3>
 <p>This option (gapless alignment culling) can make lastal <strong>faster</strong> but
 <strong>less sensitive</strong>.  It can also <strong>reduce redundant output</strong>.  For
-example, -C2 makes it discard alignments (before gapped extension)
+example, <tt class="docutils literal"><span class="pre">-C2</span></tt> makes it discard alignments (before gapped extension)
 whose query coordinates lie in those of 2 or more stronger alignments.
 This works well for aligning long, repeat-rich, indel-poor sequences
 (e.g. mammal chromosomes) without repeat-masking.</p>
@@ -421,13 +414,13 @@ sets the maximum score drop in alignments, in the gapped extension
 phase.  Lower values make it faster, by quitting unpromising
 extensions sooner.  The default aims at best accuracy.</p>
 <p>You can set this option in several ways: perhaps the most intuitive is
-via maximum gap length.  For example, -z10g sets the maximum score
+via maximum gap length.  For example, <tt class="docutils literal"><span class="pre">-z10g</span></tt> sets the maximum score
 drop such that the longest possible gap length is 10.</p>
 </div>
 <div class="section" id="lastal-x">
 <h3>lastal -x</h3>
 <p>This option (preliminary gapped extension) can make lastal <strong>faster</strong>
-but <strong>less sensitive</strong>.  For example, -x2g makes it extend gapped
+but <strong>less sensitive</strong>.  For example, <tt class="docutils literal"><span class="pre">-x2g</span></tt> makes it extend gapped
 alignments with a maximum gap length of 2, discard those with score
 below the gapped score threshold, then redo the survivors with the
 final max score drop (z).</p>
@@ -442,17 +435,17 @@ gaps).</p>
 <div class="section" id="lastal-j1">
 <h3>lastal -j1</h3>
 <p>This option requests <strong>gapless</strong> alignment, which is even <strong>faster</strong>.
-(You could get the same effect by using very high gap costs, but -j1
-is faster because it skips the gapping phase entirely.)</p>
+(You could get the same effect by using very high gap costs, but
+<tt class="docutils literal"><span class="pre">-j1</span></tt> is faster because it skips the gapping phase entirely.)</p>
 </div>
 <div class="section" id="lastal-f">
 <h3>lastal -f</h3>
-<p>Option -fTAB <strong>reduces the output size</strong>, which can improve speed.</p>
+<p>Option <tt class="docutils literal"><span class="pre">-fTAB</span></tt> <strong>reduces the output size</strong>, which can improve speed.</p>
 </div>
 <div class="section" id="lastdb-i">
 <h3>lastdb -i</h3>
 <p>This option <strong>makes lastdb faster</strong>, but disables some lastal options.
-If lastdb is too slow, try -i10.</p>
+If lastdb is too slow, try <tt class="docutils literal"><span class="pre">-i10</span></tt>.</p>
 </div>
 <div class="section" id="lastdb-c2">
 <h3>lastdb -C2</h3>


=====================================
doc/last-tuning.txt
=====================================
@@ -4,15 +4,6 @@ LAST Performance Tuning
 This document tells you how to trade-off **speed**, **sensitivity**,
 and **memory and disk usage**.
 
-Ideally, the default settings would always work well.  Unfortunately,
-there is too great a variety of challenging alignment tasks, and the
-LAST developers lack experience with most of them.
-
-LAST must have *some* defaults, and any choice will displease someone.
-It is wrong to say "LAST is faster but less sensitive than method X",
-or "slower but more sensitive than method Y", without varying the
-defaults.
-
 Sparsity options
 ~~~~~~~~~~~~~~~~
 
@@ -110,7 +101,7 @@ lastal -C
 
 This option (gapless alignment culling) can make lastal **faster** but
 **less sensitive**.  It can also **reduce redundant output**.  For
-example, -C2 makes it discard alignments (before gapped extension)
+example, ``-C2`` makes it discard alignments (before gapped extension)
 whose query coordinates lie in those of 2 or more stronger alignments.
 This works well for aligning long, repeat-rich, indel-poor sequences
 (e.g. mammal chromosomes) without repeat-masking.
@@ -124,14 +115,14 @@ phase.  Lower values make it faster, by quitting unpromising
 extensions sooner.  The default aims at best accuracy.
 
 You can set this option in several ways: perhaps the most intuitive is
-via maximum gap length.  For example, -z10g sets the maximum score
+via maximum gap length.  For example, ``-z10g`` sets the maximum score
 drop such that the longest possible gap length is 10.
 
 lastal -x
 ---------
 
 This option (preliminary gapped extension) can make lastal **faster**
-but **less sensitive**.  For example, -x2g makes it extend gapped
+but **less sensitive**.  For example, ``-x2g`` makes it extend gapped
 alignments with a maximum gap length of 2, discard those with score
 below the gapped score threshold, then redo the survivors with the
 final max score drop (z).
@@ -148,19 +139,19 @@ lastal -j1
 ----------
 
 This option requests **gapless** alignment, which is even **faster**.
-(You could get the same effect by using very high gap costs, but -j1
-is faster because it skips the gapping phase entirely.)
+(You could get the same effect by using very high gap costs, but
+``-j1`` is faster because it skips the gapping phase entirely.)
 
 lastal -f
 ---------
 
-Option -fTAB **reduces the output size**, which can improve speed.
+Option ``-fTAB`` **reduces the output size**, which can improve speed.
 
 lastdb -i
 ---------
 
 This option **makes lastdb faster**, but disables some lastal options.
-If lastdb is too slow, try -i10.
+If lastdb is too slow, try ``-i10``.
 
 lastdb -C2
 ----------


=====================================
doc/last-tutorial.html
=====================================
@@ -4,7 +4,7 @@
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
 <meta name="generator" content="Docutils 0.6: http://docutils.sourceforge.net/" />
-<title>LAST Tutorial</title>
+<title>LAST Cookbook</title>
 <style type="text/css">
 
 /*
@@ -315,206 +315,306 @@ table.field-list { border: thin solid green }
 </style>
 </head>
 <body>
-<div class="document" id="last-tutorial">
-<h1 class="title">LAST Tutorial</h1>
-
-<p>LAST finds similar regions between sequences, and aligns them.</p>
-<div class="section" id="example-1-compare-the-human-and-fugu-mitochondrial-genomes">
-<h2>Example 1: Compare the human and fugu mitochondrial genomes</h2>
-<p>For our first example, we wish to find and align similar regions
-between the human and fugu mitochondrial genomes.  You can find these
-sequences in the examples directory: humanMito.fa and fuguMito.fa.  We
-can compare them like this:</p>
+<div class="document" id="last-cookbook">
+<h1 class="title">LAST Cookbook</h1>
+
+<p><a class="reference external" href="last.html">LAST</a> is used by running commands in a terminal / command line.  It
+has many options: unfortunately, the LAST developers don't know the
+best options for every possible alignment task.  Here are some
+reasonable starting points.  Feel free to optimize (and share!) search
+protocols.</p>
+<div class="section" id="a-minimal-example-compare-human-fugu-mitochondrial-genomes">
+<h2>A minimal example: compare human & fugu mitochondrial genomes</h2>
+<p>Let's find and align similar regions between the human and fugu
+mitochondrial genomes.  These FASTA-format files are in LAST's
+examples directory: humanMito.fa and fuguMito.fa.  The simplest
+possible usage is:</p>
 <pre class="literal-block">
-lastdb -cR01 humdb humanMito.fa
+lastdb humdb humanMito.fa
 lastal humdb fuguMito.fa > myalns.maf
 </pre>
-<p>The lastdb command creates several files whose names begin with
-"humdb".  The lastal command then compares fuguMito.fa to the humdb
+<p>The <a class="reference external" href="lastdb.html">lastdb</a> command creates several files whose names begin with
+"humdb".  The <a class="reference external" href="lastal.html">lastal</a> command then compares fuguMito.fa to the humdb
 files, and writes the alignments to a file called "myalns.maf".</p>
-<p>The "-cR01" option suppresses alignments caused by simple sequence
-such as cacacacacacacacacacacaca.</p>
 </div>
 <div class="section" id="understanding-the-output">
 <h2>Understanding the output</h2>
 <p>The output has very long lines, so you need to view it without
-line-wrapping.  For example, with a Unix/Linux/MacOsX command line,
-you can use:</p>
+line-wrapping.  For example, you can use:</p>
 <pre class="literal-block">
 less -S myalns.maf
 </pre>
-<p>Each alignment looks like this:</p>
+<p>Each alignment looks like this (<a class="reference external" href="http://genome.ucsc.edu/FAQ/FAQformat.html#format5">MAF</a> format):</p>
 <pre class="literal-block">
 a score=27 EG2=4.7e+04 E=2.6e-05
 s humanMito 2170 145 + 16571 AGTAGGCCTAAAAGCAGCCACCAATTAAGAAAGCGTT...
 s fuguMito  1648 142 + 16447 AGTAGGCTTAGAAGCAGCCACCA--CAAGAAAGCGTT...
 </pre>
 <p>The score is a measure of how significant the similarity is.  EG2 and
-E are explained at <a class="reference external" href="last-evalues.html">last-evalues.html</a>.  Lines starting with "s"
-contain: the sequence name, the start coordinate of the alignment, the
-number of bases spanned by the alignment, the strand, the sequence
-length, and the aligned bases.</p>
+E are explained at <a class="reference external" href="last-evalues.html">last-evalues</a>.  Lines starting with "s" contain:
+the sequence name, the start coordinate of the alignment, the number
+of bases spanned by the alignment, the strand, the sequence length,
+and the aligned bases.</p>
 <p>The start coordinates are zero-based.  This means that, if the
 alignment begins right at the start of a sequence, the coordinate is
-0.  If the strand is "-", the start coordinate is in the reverse
-strand.</p>
-<p>This alignment format is called <a class="reference external" href="http://genome.ucsc.edu/FAQ/FAQformat.html#format5">MAF (multiple alignment format)</a>.  You can
-convert it to several other formats using <a class="reference external" href="maf-convert.html">maf-convert</a>.  You can make lastal produce a few other formats
-with option -f (see <a class="reference external" href="lastal.html">lastal.html</a>).</p>
+0.  If the strand is "-", the start coordinate is the coordinate in
+the reverse-complemented sequence (the same as if you were to
+reverse-complement the sequence before giving it to LAST).</p>
+<p>You can convert MAF to other formats with <a class="reference external" href="maf-convert.html">maf-convert</a>, or use <a class="reference external" href="lastal.html">lastal</a>
+option <tt class="docutils literal"><span class="pre">-f</span></tt> to get a few other formats.</p>
 </div>
-<div class="section" id="example-2-compare-vertebrate-proteins-to-invertebrate-proteins">
-<h2>Example 2: Compare vertebrate proteins to invertebrate proteins</h2>
-<p>Use the lastdb -p option to indicate that the sequences are proteins:</p>
+<div class="section" id="more-accurate-learn-substitution-gap-rates">
+<h2>More accurate: learn substitution & gap rates</h2>
+<p>We can get more accurate alignments between the human and fugu
+mitochondrial genomes like this:</p>
 <pre class="literal-block">
-lastdb -p -cR01 invdb invertebrate.fa
-lastal invdb vertebrate.fa
+lastdb humdb humanMito.fa
+last-train humdb fuguMito.fa > hufu.train
+lastal -p hufu.train humdb fuguMito.fa > myalns.maf
 </pre>
+<p>The <a class="reference external" href="last-train.html">last-train</a> command finds the rates of deletion, insertion, and
+each kind of substitution between these sequences, and writes them to
+a file "hufu.train".  lastal's <tt class="docutils literal"><span class="pre">-p</span></tt> option uses this file to get
+more-accurate alignments.</p>
 </div>
-<div class="section" id="example-3-compare-dna-sequences-to-protein-sequences">
-<h2>Example 3: Compare DNA sequences to protein sequences</h2>
-<p>Here we use the -F15 option, to specify translated alignment with a
-score penalty of 15 for frameshifts:</p>
+<div class="section" id="comparing-protein-sequences">
+<h2>Comparing protein sequences</h2>
+<p>We can compare some query proteins to some reference proteins like
+this:</p>
 <pre class="literal-block">
-lastdb -p -cR01 protdb proteins.fa
-lastal -F15 protdb dnas.fa
+lastdb -p -c -R01 refdb ref-prots.fa
+lastal refdb query-prots.fa > prot-alns.maf
 </pre>
-</div>
-<div class="section" id="example-4-find-high-similarity-and-short-protein-alignments">
-<h2>Example 4: Find high-similarity, and short, protein alignments</h2>
-<p>LAST uses a <a class="reference external" href="last-matrices.html">scoring scheme</a> to find
-similarities.  Some scoring schemes are tuned for weak similarities,
-others for strong similarities.  The PAM30 scoring scheme finds strong
-protein similarities:</p>
+<p><tt class="docutils literal"><span class="pre">-p</span></tt> tells it the sequences are proteins.  (If you forget <tt class="docutils literal"><span class="pre">-p</span></tt> and
+the sequences look proteinaceous, you'll get a warning message.)</p>
+<p>The other options suppress alignments caused by simple sequence such
+as <tt class="docutils literal">APPSPAPPSPAPPSPAPPSPAP</tt>:</p>
+<ul>
+<li><p class="first"><tt class="docutils literal"><span class="pre">-R01</span></tt> converts the sequence letters to uppercase, then finds
+simple regions and converts them to lowercase.  This will be done
+for both ref-prots and query-prots.</p>
+</li>
+<li><p class="first"><tt class="docutils literal"><span class="pre">-c</span></tt> omits alignments that lack a significant amount of
+uppercase-to-uppercase alignment.</p>
+</li>
+</ul>
+<p>You can also use <a class="reference external" href="last-train.html">last-train</a>, but we've hardly tested it for
+protein-protein alignment, so we're not sure if it helps.</p>
+<div class="section" id="find-high-similarity-and-short-protein-alignments">
+<h3>Find high-similarity, and short, protein alignments</h3>
+<p>If we just want high-similarity alignments, we can use the PAM30 (or
+PAM10) <a class="reference external" href="last-matrices.html">scoring scheme</a>:</p>
 <pre class="literal-block">
-lastdb -p -cR01 invdb invertebrate.fa
-lastal -pPAM30 invdb vertebrate.fa
+lastdb -p -c -R01 refdb ref-prots.fa
+lastal -p PAM30 refdb query-prots.fa > prot-alns.maf
 </pre>
 <p>This has two advantages:</p>
 <ul>
-<li><p class="first">It omits weak alignments, or alignment parts (occasionally a strong
-similarity is flanked by a weak similarity).</p>
+<li><p class="first">It omits low-similarity alignments, or alignment parts.</p>
 </li>
-<li><p class="first">It can find short similarities.  If we seek very short similarities,
-weak ones are hopeless (statistically insignificant), so we had
-better focus on strong ones.  (How short is "very short"?  It
-depends on the amount of sequence data we are searching, but perhaps
-roughly less than 40 amino acids.)</p>
+<li><p class="first">It can find short similarities, which would be deemed insignificant
+(likely to occur by chance between random sequences) unless we focus
+the search on high-similarity.</p>
 </li>
 </ul>
 </div>
-<div class="section" id="example-5-align-human-dna-sequences-to-the-human-genome">
-<h2>Example 5: Align human DNA sequences to the human genome</h2>
-<p>We can align human DNA sequences to the human genome like this:</p>
+</div>
+<div class="section" id="comparing-dna-to-proteins">
+<h2>Comparing DNA to proteins</h2>
+<p>We can find related regions of DNA and proteins, allowing for nonsense
+mutations and frameshifts.  For example, let's find DNA regions
+related to transposon proteins:</p>
+<pre class="literal-block">
+lastdb -q -c -R01 trandb transposon-prots.fa
+last-train --codon trandb dna.fa > codon.train
+lastal -p codon.train -m100 -D1e9 -K1 trandb dna.fa > out.maf
+</pre>
+<p><tt class="docutils literal"><span class="pre">-q</span></tt> appends <tt class="docutils literal">*</tt> meaning STOP to each protein, and treats <tt class="docutils literal">*</tt> as
+a 21st protein letter.</p>
+<p><tt class="docutils literal"><span class="pre">--codon</span></tt> makes it do DNA-versus-protein.  Here, <a class="reference external" href="last-train.html">last-train</a> tries
+to learn 21x64 substitution rates, so it needs a fairly large amount
+of data (e.g. a chromosome).</p>
+<p><tt class="docutils literal"><span class="pre">-m100</span></tt> makes it more slow-and-sensitive than the default (which is
+<tt class="docutils literal"><span class="pre">-m10</span></tt>), see <a class="reference external" href="lastal.html">lastal</a>.</p>
+<p><tt class="docutils literal"><span class="pre">-D1e9</span></tt> sets a strict <a class="reference external" href="last-evalues.html">significance</a> threshold.  It means: only
+report strong similarities that would be expected to occur by chance,
+between random sequences, at most once per 10^9 base-pairs.  The
+default is 1e6.</p>
+<p><tt class="docutils literal"><span class="pre">-K1</span></tt> streamlines the output by omitting any alignment whose DNA
+range lies in that of a higher-scoring alignment.</p>
+<p>Another possibility is to add <a class="reference external" href="last-train.html">last-train</a> option <tt class="docutils literal"><span class="pre">-X1</span></tt>, which treats
+matches to <tt class="docutils literal">X</tt> (unknown) amino acids as neutral instead of
+disfavored.</p>
+</div>
+<div class="section" id="aligning-high-indel-error-long-dna-reads-to-a-genome">
+<h2>Aligning high-indel-error long DNA reads to a genome</h2>
+<p>Suppose we have DNA reads in either FASTA or <a class="reference external" href="https://doi.org/10.1093/nar/gkp1137">FASTQ</a> format.  This is
+sensitive but slow:</p>
 <pre class="literal-block">
-lastdb -uNEAR -R01 humandb human/chr*.fa
-lastal humandb queries.fa | last-split > myalns.maf
+lastdb -P8 -uNEAR -R01 mydb genome.fa
+last-train -P8 -Q0 mydb reads.fastq > reads.train
+lastal -P8 -p reads.train mydb reads.fastq | last-split > out.maf
 </pre>
-<p>This will use about 15 gigabytes of memory.</p>
+<p><tt class="docutils literal"><span class="pre">-P8</span></tt> uses 8 parallel threads, adjust as appropriate for your
+computer.  This has no effect on the results.</p>
+<p><tt class="docutils literal"><span class="pre">-uNEAR</span></tt> selects a <a class="reference external" href="last-seeds.html">seeding scheme</a> that's better at finding
+alignments with few substitutions and/or many gaps.</p>
+<p><tt class="docutils literal"><span class="pre">-Q0</span></tt> makes it discard the <a class="reference external" href="https://doi.org/10.1093/nar/gkp1137">fastq</a> quality information (or you can
+keep-but-ignore it with <tt class="docutils literal"><span class="pre">-Qkeep</span></tt>).</p>
+<p><a class="reference external" href="last-split.html">last-split</a> finds a unique best alignment for each part of each read.</p>
+<p>Here we used <tt class="docutils literal"><span class="pre">-R01</span></tt> to lowercase simple sequence like
+<tt class="docutils literal">cacacacacacacacacacacaca</tt>.  But we didn't suppress it with <tt class="docutils literal"><span class="pre">-c</span></tt>,
+so as not to hide anything from <a class="reference external" href="last-split.html">last-split</a>.  If desired, you can
+filter lowercase with <a class="reference external" href="last-postmask.html">last-postmask</a>.</p>
+<p>You can go faster by sacrificing a bit of sensitivity.  It depends on
+your aim, e.g. slow-and-sensitive seems necessary to find intricate
+rearrangements of repeats.  Suggested ways to go faster:</p>
 <ul>
-<li><p class="first">-uNEAR selects a <a class="reference external" href="last-seeds.html">seeding scheme</a> that makes it
-better at finding short-and-strong similarities.  (It also changes
-the default scoring scheme.)</p>
+<li><p class="first"><a class="reference external" href="https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md">Mask repeats</a>.</p>
 </li>
-<li><p class="first">-R01 tells it to mark simple sequences (such as cacacacacacacacaca)
-by lowercase, but not suppress them.  This has no effect on the
-alignment, but it allows us to see simple sequences in the output,
-and gives us the option to do <a class="reference external" href="last-postmask.html">post-alignment masking</a>.</p>
+<li><p class="first">Add lastal option <tt class="docutils literal"><span class="pre">-k8</span></tt> (or <tt class="docutils literal"><span class="pre">-k16</span></tt> etc).  This makes it faster,
+by only finding initial matches starting at every 8th (or 16th etc)
+position in the reads.</p>
 </li>
-<li><p class="first">last-split reads the alignments produced by lastal, and looks for a
-unique best alignment for each part of each query.  It allows
-different parts of one query to match different parts of the genome,
-which may happen due to rearrangements.  It has several useful
-options, please see <a class="reference external" href="last-split.html">last-split.html</a>.</p>
+<li><p class="first">Replace <tt class="docutils literal"><span class="pre">-uNEAR</span></tt> with <tt class="docutils literal"><span class="pre">-uRY32</span></tt> (or <tt class="docutils literal"><span class="pre">-uRY16</span></tt>, <tt class="docutils literal"><span class="pre">-uRY8</span></tt>,
+<tt class="docutils literal"><span class="pre">-uRY4</span></tt>).  This makes it check for initial matches starting at
+only ~1/32 (or ~1/16 etc) of positions, in both reads and genome.
+Compared to <tt class="docutils literal"><span class="pre">-k</span></tt>: this harms sensitivity slightly more, but
+reduces memory use and makes lastdb faster.</p>
 </li>
 </ul>
+<div class="section" id="which-genome-version-to-use">
+<h3>Which genome version to use?</h3>
+<p>Some genome versions (e.g. for human) have artificial
+exactly-duplicated regions, which makes it hard to align reads
+uniquely.  To avoid that, look for a genome version called something
+like "analysis set".</p>
 </div>
-<div class="section" id="tuning-speed-sensitivity-memory-and-disk-usage">
-<h2>Tuning speed, sensitivity, memory and disk usage</h2>
-<ul>
-<li><p class="first">You can make LAST faster by <a class="reference external" href="last-parallel.html">using multiple CPUs</a>.</p>
-</li>
-<li><p class="first">You can <a class="reference external" href="last-tuning.html">trade off speed, sensitivity, memory and disk usage</a>.  To go much faster, try either lastal option
-<tt class="docutils literal"><span class="pre">-k32</span></tt> or lastdb option <tt class="docutils literal"><span class="pre">-uRY32</span></tt> (but not both).  <tt class="docutils literal"><span class="pre">-k32</span></tt> is a
-bit more sensitive, but <tt class="docutils literal"><span class="pre">-uRY32</span></tt> uses less memory.</p>
-</li>
-</ul>
+<div class="section" id="aligning-low-error-long-dna-reads-to-a-genome">
+<h3>Aligning low-error long DNA reads to a genome</h3>
+<p>We can do this the same way as for high-error reads, but perhaps
+accelerate more aggressively (e.g. <tt class="docutils literal"><span class="pre">-uRY32</span></tt>).</p>
+<p>If repeats are not masked, <a class="reference external" href="lastal.html">lastal</a> option <tt class="docutils literal"><span class="pre">-C2</span></tt> may reduce run time
+with little effect on accuracy.</p>
+</div>
+<div class="section" id="aligning-potentially-spliced-rna-or-cdna-long-reads-to-a-genome">
+<h3>Aligning potentially-spliced RNA or cDNA long reads to a genome</h3>
+<p>See <a class="reference external" href="https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md">here</a>.  (For low-error reads, you can probably omit <tt class="docutils literal"><span class="pre">-d90</span></tt> and
+<tt class="docutils literal"><span class="pre">-m20</span></tt>.)</p>
 </div>
-<div class="section" id="example-6-find-very-short-dna-alignments">
-<h2>Example 6: Find very short DNA alignments</h2>
-<p>By default, LAST is quite strict, and only reports significant
-alignments that will rarely occur by chance.  In the preceding
-example, the minimum alignment length is about 26 bases (less for
-smaller genomes).  To find shorter alignments, we must down-tune the
-strictness:</p>
+</div>
+<div class="section" id="aligning-illumina-dna-reads-to-a-genome">
+<h2>Aligning Illumina DNA reads to a genome</h2>
 <pre class="literal-block">
-lastdb -uNEAR -R01 humandb human/chr*.fa
-lastal -D100 humandb queries.fa | last-split > myalns.maf
+lastdb -P8 -uNEAR -R01 -C2 mydb genome.fasta
+last-train -P8 -Q1 mydb reads.fastq.gz > reads.train
+lastal -P8 -p reads.train mydb reads.fastq.gz | last-split > out.maf
 </pre>
-<ul>
-<li><p class="first">-D100 makes lastal report alignments that could occur by chance once
-per hundred query letters.  (The default is once per million.)</p>
-</li>
-</ul>
-<p>In this example, the minimum alignment length is about 20 bases (less
-for smaller genomes).</p>
+<p>Most LAST commands accept gzip-compressed (<tt class="docutils literal">.gz</tt>) files.</p>
+<p><a class="reference external" href="lastdb.html">lastdb</a> option <tt class="docutils literal"><span class="pre">-C2</span></tt> makes the alignment a bit faster, but uses more
+memory.  This has no effect on the results.  (You could use it in the
+other examples too, but it might not be faster.)</p>
+<p><tt class="docutils literal"><span class="pre">-Q1</span></tt> makes it use the <a class="reference external" href="https://doi.org/10.1093/nar/gkp1137">fastq</a> quality information to improve the
+training and alignment.  LAST <strong>assumes</strong> that the qualities reflect
+substitution errors, not insertion/deletion errors.  (For long
+non-Illumina reads, we suspect this assumption doesn't hold, so we
+didn't use this option.)</p>
+<p>This recipe may be excessively slow-and-sensitive.  Adding <a class="reference external" href="lastal.html">lastal</a>
+option <tt class="docutils literal"><span class="pre">-C2</span></tt> may make it faster with negligible accuracy loss.  You
+can accelerate with e.g. <tt class="docutils literal"><span class="pre">-uRY16</span></tt> or <tt class="docutils literal"><span class="pre">-k16</span></tt> as above.</p>
+<div class="section" id="finding-very-short-dna-alignments">
+<h3>Finding very short DNA alignments</h3>
+<p>By default, LAST only reports <a class="reference external" href="last-evalues.html">significant</a> alignments that will rarely
+occur by chance.  In the preceding example, the minimum alignment
+length is about 26 bases for a human-size genome (less for smaller
+genomes).  To find shorter alignments, add <a class="reference external" href="lastal.html">lastal</a> option <tt class="docutils literal"><span class="pre">-D100</span></tt>
+(say), to get alignments that could occur by chance once per hundred
+query letters (the default is once per million.)  This makes the
+minimum alignment length about 20 bases for a human-size genome.</p>
 </div>
-<div class="section" id="example-7-align-human-fastq-sequences-to-the-human-genome">
-<h2>Example 7: Align human fastq sequences to the human genome</h2>
-<p>DNA sequences are not always perfectly accurate, and they are
-sometimes provided in fastq format, which indicates the reliability of
-each base.  LAST can use this information to improve alignment
-accuracy.  Option -Q1 indicates fastq-sanger format:</p>
+<div class="section" id="aligning-paired-end-illumina-dna-reads-to-a-genome">
+<h3>Aligning paired-end Illumina DNA reads to a genome</h3>
+<p>You could use <a class="reference external" href="last-pair-probs.html">last-pair-probs</a>.  It has a disadvantage: it doesn't
+allow different parts of one read (i.e. one "end") to align to
+different parts of the genome.  Alternatively, you could align the
+reads individually, ignoring the pair relationships:</p>
 <pre class="literal-block">
-lastdb -uNEAR -R01 humandb human/chr*.fa
-lastal -Q1 humandb queries.fastq | last-split > myalns.maf
+fastq-interleave reads1.fq reads2.fq | lastal -P8 -p reads.train mydb | last-split > out.maf
 </pre>
-<p><strong>Assumption:</strong> LAST assumes the reliabilities reflect substitution
-errors, not insertion/deletion errors.  If that is not true, you can
-tell it to ignore the reliability data with -Q0.</p>
+<p><tt class="docutils literal"><span class="pre">fastq-interleave</span></tt> ensures that each read has a unique name (by
+appending "/1" and "/2" if necessary).</p>
 </div>
-<div class="section" id="fastq-format-confusion">
-<h2>Fastq format confusion</h2>
-<p>Unfortunately, there is more than one fastq format (see
-<a class="reference external" href="http://nar.oxfordjournals.org/content/38/6/1767.long">http://nar.oxfordjournals.org/content/38/6/1767.long</a>).  Recently
-(2013) fastq-sanger seems to be dominant, but if you have another
-variant you need to change the -Q option (see <a class="reference external" href="lastal.html">lastal.html</a>).</p>
+<div class="section" id="aligning-potentially-spliced-illumina-reads-to-a-genome">
+<h3>Aligning potentially-spliced Illumina reads to a genome</h3>
+<p>See <a class="reference external" href="last-split.html">last-split</a> (and <a class="reference external" href="last-pair-probs.html">last-pair-probs</a>).</p>
 </div>
-<div class="section" id="paired-dna-reads">
-<h2>Paired DNA reads</h2>
-<p>If you have paired DNA reads, there are two options:</p>
-<ol class="arabic">
-<li><p class="first">Use last-pair-probs (see <a class="reference external" href="last-pair-probs.html">last-pair-probs.html</a>).</p>
-</li>
-<li><p class="first">Ignore the pairing information, and align the reads individually
-(using last-split as above).  This may be useful because
-last-pair-probs does not currently allow different parts of one
-read to match different parts of the genome, though it does allow
-the two reads in a pair to match (e.g.) different chromosomes.</p>
-</li>
-</ol>
 </div>
-<div class="section" id="example-8-compare-the-human-and-mouse-genomes">
-<h2>Example 8: Compare the human and mouse genomes</h2>
-<p>See <a class="reference external" href="https://github.com/mcfrith/last-genome-alignments">here</a>.</p>
+<div class="section" id="aligning-human-chimp-genomes">
+<h2>Aligning human & chimp genomes</h2>
+<p>This is very slow-and-sensitive:</p>
+<pre class="literal-block">
+lastdb -P8 -uNEAR -R01 humdb human_no_alt_analysis_set.fa
+last-train -P8 --revsym -E0.05 -C2 humdb chimp.fa > humchi.train
+lastal -E0.05 -C2 -p humchi.train humdb chimp.fa | last-split > humchi1.maf
+</pre>
+<p><tt class="docutils literal"><span class="pre">--revsym</span></tt> makes the substitution rates the same on both strands.
+For example, it makes A→G equal T→C (because A→G on one strand means
+T→C on the other strand).  This is usually appropriate for
+genome-genome comparison (but maybe not for mitochondria which have
+asymmetric "heavy" and "light" strands).</p>
+<p><tt class="docutils literal"><span class="pre">-E0.05</span></tt> means only get <a class="reference external" href="last-evalues.html">significant</a> alignments that would be
+expected to occur by chance at a rate ≤ 0.05 times per pair of random
+sequences of length 1 billion each.</p>
+<p>The result so far is asymmetric: each part of the chimp genome is
+aligned to at most one part of the human genome, but not vice-versa.
+We can get one-to-one alignments like this:</p>
+<pre class="literal-block">
+maf-swap humchi1.maf | last-split > humchi2.maf
+</pre>
+<p>Then we can discard less-confident alignments, and <a class="reference external" href="maf-convert.html">convert</a> to a
+compact tabular format:</p>
+<pre class="literal-block">
+last-postmask humchi2.maf | maf-convert -n tab | awk -F= '$2 <= 1e-5' > humchi.tab
+</pre>
+<p><a class="reference external" href="last-postmask.html">last-postmask</a> discards alignments caused by simple sequence.  The
+<tt class="docutils literal">awk</tt> command gets alignments with <a class="reference external" href="last-split.html">mismap probability</a> ≤ 10^-5.
+Finally, we can make a <a class="reference external" href="last-dotplot.html">dotplot</a>:</p>
+<pre class="literal-block">
+last-dotplot humchi.tab humchi.png
+</pre>
+<p>To go faster, with minor accuracy loss: replace <tt class="docutils literal"><span class="pre">-uNEAR</span></tt> with
+<tt class="docutils literal"><span class="pre">-uRY32</span></tt> and/or <a class="reference external" href="https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md">mask repeats</a>.</p>
+<p>To squeeze out the last 0.000...1% of accuracy: add <tt class="docutils literal"><span class="pre">-m50</span></tt> to the
+<a class="reference external" href="lastal.html">lastal</a> options.</p>
+<div class="section" id="aligning-human-mouse-genomes">
+<h3>Aligning human & mouse genomes</h3>
+<p>You can do this in the same way as human/chimp, except that <tt class="docutils literal"><span class="pre">-uNEAR</span></tt>
+should be omitted.  To increase sensitivity, but also time and memory
+use, add lastdb <a class="reference external" href="last-seeds.html">seeding</a> option <tt class="docutils literal"><span class="pre">-uMAM4</span></tt> or or <tt class="docutils literal"><span class="pre">-uMAM8</span></tt>.  To
+increase them even more, add <a class="reference external" href="lastal.html">lastal</a> option <tt class="docutils literal"><span class="pre">-m100</span></tt> (or as high as
+you can bear).</p>
 </div>
-<div class="section" id="example-9-compare-the-human-and-chimp-genomes">
-<h2>Example 9: Compare the human and chimp genomes</h2>
-<p>See <a class="reference external" href="https://github.com/mcfrith/last-genome-alignments">here</a>.  But
-that recipe is <em>extremely</em> slow-and-accurate.  You can <a class="reference external" href="last-tuning.html">tune</a> it to compare huge, high-similarity genomes with
-moderate run time and memory use:</p>
+</div>
+<div class="section" id="large-reference-sequences">
+<h2>Large reference sequences</h2>
+<p>If the sequences that you give to lastdb exceed ~4 billion letters,
+consider using 8-byte LAST (<a class="reference external" href="lastdb.html">lastdb8</a> and <a class="reference external" href="lastal.html">lastal8</a>).  Ordinary (4-byte)
+LAST can't handle so much sequence at once, so <a class="reference external" href="lastdb.html">lastdb</a> splits it into
+"volumes", which may be inefficient.  8-byte LAST avoids voluming, but
+uses more memory.  So <a class="reference external" href="lastdb.html">lastdb8</a> works well with a memory-reducing
+option: <tt class="docutils literal"><span class="pre">-uRY</span></tt> or <tt class="docutils literal"><span class="pre">-w</span></tt> or <tt class="docutils literal"><span class="pre">-W</span></tt>.</p>
+</div>
+<div class="section" id="moar-faster">
+<h2>Moar faster</h2>
 <ul>
-<li><p class="first">Omit the sensitivity-boosting lastal -m option.</p>
-</li>
-<li><p class="first">Replace <tt class="docutils literal"><span class="pre">-uNEAR</span></tt> with <tt class="docutils literal"><span class="pre">-uRY32</span></tt>.</p>
+<li><p class="first"><a class="reference external" href="last-parallel.html">Using multiple CPUs / cores</a></p>
 </li>
-<li><p class="first">If the "reference" genome (the one given to lastdb) is > 4 GB, it's
-probably more efficient to use lastdb8 and lastal8, instead of
-lastdb and lastal.</p>
+<li><p class="first"><a class="reference external" href="last-tuning.html">Various speed & memory options</a></p>
 </li>
 </ul>
 </div>
-<div class="section" id="example-10-ambiguity-of-alignment-columns">
-<h2>Example 10: Ambiguity of alignment columns</h2>
+<div class="section" id="ambiguity-of-alignment-columns">
+<h2>Ambiguity of alignment columns</h2>
 <p>Consider this alignment:</p>
 <pre class="literal-block">
 TGAAGTTAAAGGTATATGAATTCCAATTCTTAACCCCCCTATTAAACGAATATCTTG
@@ -522,12 +622,10 @@ TGAAGTTAAAGGTATATGAATTCCAATTCTTAACCCCCCTATTAAACGAATATCTTG
 TGAAGTTAGAGGTAT--GGTTTTGAGTAGT----CCTCCTATTTTTCGAATATCTTG
 </pre>
 <p>The middle section has such weak similarity that its precise alignment
-cannot be confidently inferred.</p>
-<p>It is sometimes useful to estimate the ambiguity of each column in an
-alignment.  We can do that using lastal option -j4:</p>
+cannot be confidently inferred.  We can see the confidence of each
+alignment column with <a class="reference external" href="lastal.html">lastal</a> option <tt class="docutils literal"><span class="pre">-j4</span></tt>:</p>
 <pre class="literal-block">
-lastdb -cR01 humdb humanMito.fa
-lastal -j4 humdb fuguMito.fa > myalns.maf
+lastal -j4 -p hufu.train humdb fuguMito.fa > myalns.maf
 </pre>
 <p>The output looks like this:</p>
 <pre class="literal-block">
@@ -537,7 +635,7 @@ s seqY 0 51 + 51 TGAAGTTAGAGGTAT--GGTTTTGAGTAGT----CCTCCTATTTTTCGAATATCTTG
 p                %*.14442011.(%##"%$$$$###""!!!""""&'(*,340.,,.~~~~~~~~~~~
 </pre>
 <p>The "p" line indicates the probability that each column is wrongly
-aligned, using a compact code (the same as fastq-sanger format):</p>
+aligned, using a compact code (the same as <a class="reference external" href="https://doi.org/10.1093/nar/gkp1137">fastq</a> format):</p>
 <blockquote>
 <table border="1" class="docutils">
 <colgroup>
@@ -633,8 +731,6 @@ aligned, using a compact code (the same as fastq-sanger format):</p>
 <p>Note that each alignment is grown from a "core" region, and the
 ambiguity estimates assume that the core is correctly aligned.  The
 core is indicated by "~" symbols, and it contains exact matches only.</p>
-<p>LAST has options to find alignments with optimal column probabilities,
-instead of optimal score: see <a class="reference external" href="lastal.html">lastal.html</a>.</p>
 </div>
 </div>
 </body>


=====================================
doc/last-tutorial.txt
=====================================
@@ -1,211 +1,334 @@
-LAST Tutorial
+LAST Cookbook
 =============
 
-LAST finds similar regions between sequences, and aligns them.
+LAST_ is used by running commands in a terminal / command line.  It
+has many options: unfortunately, the LAST developers don't know the
+best options for every possible alignment task.  Here are some
+reasonable starting points.  Feel free to optimize (and share!) search
+protocols.
 
-Example 1: Compare the human and fugu mitochondrial genomes
------------------------------------------------------------
+A minimal example: compare human & fugu mitochondrial genomes
+-------------------------------------------------------------
 
-For our first example, we wish to find and align similar regions
-between the human and fugu mitochondrial genomes.  You can find these
-sequences in the examples directory: humanMito.fa and fuguMito.fa.  We
-can compare them like this::
+Let's find and align similar regions between the human and fugu
+mitochondrial genomes.  These FASTA-format files are in LAST's
+examples directory: humanMito.fa and fuguMito.fa.  The simplest
+possible usage is::
 
-  lastdb -cR01 humdb humanMito.fa
+  lastdb humdb humanMito.fa
   lastal humdb fuguMito.fa > myalns.maf
 
-The lastdb command creates several files whose names begin with
-"humdb".  The lastal command then compares fuguMito.fa to the humdb
+The lastdb_ command creates several files whose names begin with
+"humdb".  The lastal_ command then compares fuguMito.fa to the humdb
 files, and writes the alignments to a file called "myalns.maf".
 
-The "-cR01" option suppresses alignments caused by simple sequence
-such as cacacacacacacacacacacaca.
-
 Understanding the output
 ------------------------
 
 The output has very long lines, so you need to view it without
-line-wrapping.  For example, with a Unix/Linux/MacOsX command line,
-you can use::
+line-wrapping.  For example, you can use::
 
   less -S myalns.maf
 
-Each alignment looks like this::
+Each alignment looks like this (MAF_ format)::
 
   a score=27 EG2=4.7e+04 E=2.6e-05
   s humanMito 2170 145 + 16571 AGTAGGCCTAAAAGCAGCCACCAATTAAGAAAGCGTT...
   s fuguMito  1648 142 + 16447 AGTAGGCTTAGAAGCAGCCACCA--CAAGAAAGCGTT...
 
 The score is a measure of how significant the similarity is.  EG2 and
-E are explained at `<last-evalues.html>`_.  Lines starting with "s"
-contain: the sequence name, the start coordinate of the alignment, the
-number of bases spanned by the alignment, the strand, the sequence
-length, and the aligned bases.
+E are explained at last-evalues_.  Lines starting with "s" contain:
+the sequence name, the start coordinate of the alignment, the number
+of bases spanned by the alignment, the strand, the sequence length,
+and the aligned bases.
 
 The start coordinates are zero-based.  This means that, if the
 alignment begins right at the start of a sequence, the coordinate is
-0.  If the strand is "-", the start coordinate is in the reverse
-strand.
+0.  If the strand is "-", the start coordinate is the coordinate in
+the reverse-complemented sequence (the same as if you were to
+reverse-complement the sequence before giving it to LAST).
+
+You can convert MAF to other formats with maf-convert_, or use lastal_
+option ``-f`` to get a few other formats.
+
+More accurate: learn substitution & gap rates
+---------------------------------------------
+
+We can get more accurate alignments between the human and fugu
+mitochondrial genomes like this::
+
+  lastdb humdb humanMito.fa
+  last-train humdb fuguMito.fa > hufu.train
+  lastal -p hufu.train humdb fuguMito.fa > myalns.maf
 
-This alignment format is called `MAF (multiple alignment format)
-<http://genome.ucsc.edu/FAQ/FAQformat.html#format5>`_.  You can
-convert it to several other formats using `maf-convert
-<maf-convert.html>`_.  You can make lastal produce a few other formats
-with option -f (see `<lastal.html>`_).
+The last-train_ command finds the rates of deletion, insertion, and
+each kind of substitution between these sequences, and writes them to
+a file "hufu.train".  lastal's ``-p`` option uses this file to get
+more-accurate alignments.
 
-Example 2: Compare vertebrate proteins to invertebrate proteins
----------------------------------------------------------------
+Comparing protein sequences
+---------------------------
 
-Use the lastdb -p option to indicate that the sequences are proteins::
+We can compare some query proteins to some reference proteins like
+this::
 
-  lastdb -p -cR01 invdb invertebrate.fa
-  lastal invdb vertebrate.fa
+  lastdb -p -c -R01 refdb ref-prots.fa
+  lastal refdb query-prots.fa > prot-alns.maf
 
-Example 3: Compare DNA sequences to protein sequences
------------------------------------------------------
+``-p`` tells it the sequences are proteins.  (If you forget ``-p`` and
+the sequences look proteinaceous, you'll get a warning message.)
 
-Here we use the -F15 option, to specify translated alignment with a
-score penalty of 15 for frameshifts::
+The other options suppress alignments caused by simple sequence such
+as ``APPSPAPPSPAPPSPAPPSPAP``:
 
-  lastdb -p -cR01 protdb proteins.fa
-  lastal -F15 protdb dnas.fa
+* ``-R01`` converts the sequence letters to uppercase, then finds
+  simple regions and converts them to lowercase.  This will be done
+  for both ref-prots and query-prots.
 
-Example 4: Find high-similarity, and short, protein alignments
---------------------------------------------------------------
+* ``-c`` omits alignments that lack a significant amount of
+  uppercase-to-uppercase alignment.
 
-LAST uses a `scoring scheme <last-matrices.html>`_ to find
-similarities.  Some scoring schemes are tuned for weak similarities,
-others for strong similarities.  The PAM30 scoring scheme finds strong
-protein similarities::
+You can also use last-train_, but we've hardly tested it for
+protein-protein alignment, so we're not sure if it helps.
 
-  lastdb -p -cR01 invdb invertebrate.fa
-  lastal -pPAM30 invdb vertebrate.fa
+Find high-similarity, and short, protein alignments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If we just want high-similarity alignments, we can use the PAM30 (or
+PAM10) `scoring scheme`_::
+
+  lastdb -p -c -R01 refdb ref-prots.fa
+  lastal -p PAM30 refdb query-prots.fa > prot-alns.maf
 
 This has two advantages:
 
-* It omits weak alignments, or alignment parts (occasionally a strong
-  similarity is flanked by a weak similarity).
+* It omits low-similarity alignments, or alignment parts.
+
+* It can find short similarities, which would be deemed insignificant
+  (likely to occur by chance between random sequences) unless we focus
+  the search on high-similarity.
+
+Comparing DNA to proteins
+-------------------------
+
+We can find related regions of DNA and proteins, allowing for nonsense
+mutations and frameshifts.  For example, let's find DNA regions
+related to transposon proteins::
+
+  lastdb -q -c -R01 trandb transposon-prots.fa
+  last-train --codon trandb dna.fa > codon.train
+  lastal -p codon.train -m100 -D1e9 -K1 trandb dna.fa > out.maf
+
+``-q`` appends ``*`` meaning STOP to each protein, and treats ``*`` as
+a 21st protein letter.
+
+``--codon`` makes it do DNA-versus-protein.  Here, last-train_ tries
+to learn 21x64 substitution rates, so it needs a fairly large amount
+of data (e.g. a chromosome).
+
+``-m100`` makes it more slow-and-sensitive than the default (which is
+``-m10``), see lastal_.
+
+``-D1e9`` sets a strict significance_ threshold.  It means: only
+report strong similarities that would be expected to occur by chance,
+between random sequences, at most once per 10^9 base-pairs.  The
+default is 1e6.
+
+``-K1`` streamlines the output by omitting any alignment whose DNA
+range lies in that of a higher-scoring alignment.
+
+Another possibility is to add last-train_ option ``-X1``, which treats
+matches to ``X`` (unknown) amino acids as neutral instead of
+disfavored.
+
+Aligning high-indel-error long DNA reads to a genome
+----------------------------------------------------
+
+Suppose we have DNA reads in either FASTA or FASTQ_ format.  This is
+sensitive but slow::
 
-* It can find short similarities.  If we seek very short similarities,
-  weak ones are hopeless (statistically insignificant), so we had
-  better focus on strong ones.  (How short is "very short"?  It
-  depends on the amount of sequence data we are searching, but perhaps
-  roughly less than 40 amino acids.)
+  lastdb -P8 -uNEAR -R01 mydb genome.fa
+  last-train -P8 -Q0 mydb reads.fastq > reads.train
+  lastal -P8 -p reads.train mydb reads.fastq | last-split > out.maf
 
-Example 5: Align human DNA sequences to the human genome
---------------------------------------------------------
+``-P8`` uses 8 parallel threads, adjust as appropriate for your
+computer.  This has no effect on the results.
 
-We can align human DNA sequences to the human genome like this::
+``-uNEAR`` selects a `seeding scheme`_ that's better at finding
+alignments with few substitutions and/or many gaps.
 
-  lastdb -uNEAR -R01 humandb human/chr*.fa
-  lastal humandb queries.fa | last-split > myalns.maf
+``-Q0`` makes it discard the fastq_ quality information (or you can
+keep-but-ignore it with ``-Qkeep``).
 
-This will use about 15 gigabytes of memory.
+last-split_ finds a unique best alignment for each part of each read.
 
-* -uNEAR selects a `seeding scheme <last-seeds.html>`_ that makes it
-  better at finding short-and-strong similarities.  (It also changes
-  the default scoring scheme.)
+Here we used ``-R01`` to lowercase simple sequence like
+``cacacacacacacacacacacaca``.  But we didn't suppress it with ``-c``,
+so as not to hide anything from last-split_.  If desired, you can
+filter lowercase with last-postmask_.
 
-* -R01 tells it to mark simple sequences (such as cacacacacacacacaca)
-  by lowercase, but not suppress them.  This has no effect on the
-  alignment, but it allows us to see simple sequences in the output,
-  and gives us the option to do `post-alignment masking
-  <last-postmask.html>`_.
+You can go faster by sacrificing a bit of sensitivity.  It depends on
+your aim, e.g. slow-and-sensitive seems necessary to find intricate
+rearrangements of repeats.  Suggested ways to go faster:
 
-* last-split reads the alignments produced by lastal, and looks for a
-  unique best alignment for each part of each query.  It allows
-  different parts of one query to match different parts of the genome,
-  which may happen due to rearrangements.  It has several useful
-  options, please see `<last-split.html>`_.
+* `Mask repeats`_.
 
-Tuning speed, sensitivity, memory and disk usage
-------------------------------------------------
+* Add lastal option ``-k8`` (or ``-k16`` etc).  This makes it faster,
+  by only finding initial matches starting at every 8th (or 16th etc)
+  position in the reads.
 
-* You can make LAST faster by `using multiple CPUs
-  <last-parallel.html>`_.
+* Replace ``-uNEAR`` with ``-uRY32`` (or ``-uRY16``, ``-uRY8``,
+  ``-uRY4``).  This makes it check for initial matches starting at
+  only ~1/32 (or ~1/16 etc) of positions, in both reads and genome.
+  Compared to ``-k``: this harms sensitivity slightly more, but
+  reduces memory use and makes lastdb faster.
 
-* You can `trade off speed, sensitivity, memory and disk usage
-  <last-tuning.html>`_.  To go much faster, try either lastal option
-  ``-k32`` or lastdb option ``-uRY32`` (but not both).  ``-k32`` is a
-  bit more sensitive, but ``-uRY32`` uses less memory.
+Which genome version to use?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Example 6: Find very short DNA alignments
------------------------------------------
+Some genome versions (e.g. for human) have artificial
+exactly-duplicated regions, which makes it hard to align reads
+uniquely.  To avoid that, look for a genome version called something
+like "analysis set".
 
-By default, LAST is quite strict, and only reports significant
-alignments that will rarely occur by chance.  In the preceding
-example, the minimum alignment length is about 26 bases (less for
-smaller genomes).  To find shorter alignments, we must down-tune the
-strictness::
+Aligning low-error long DNA reads to a genome
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-  lastdb -uNEAR -R01 humandb human/chr*.fa
-  lastal -D100 humandb queries.fa | last-split > myalns.maf
+We can do this the same way as for high-error reads, but perhaps
+accelerate more aggressively (e.g. ``-uRY32``).
 
-* -D100 makes lastal report alignments that could occur by chance once
-  per hundred query letters.  (The default is once per million.)
+If repeats are not masked, lastal_ option ``-C2`` may reduce run time
+with little effect on accuracy.
 
-In this example, the minimum alignment length is about 20 bases (less
-for smaller genomes).
+Aligning potentially-spliced RNA or cDNA long reads to a genome
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Example 7: Align human fastq sequences to the human genome
-----------------------------------------------------------
+See here_.  (For low-error reads, you can probably omit ``-d90`` and
+``-m20``.)
 
-DNA sequences are not always perfectly accurate, and they are
-sometimes provided in fastq format, which indicates the reliability of
-each base.  LAST can use this information to improve alignment
-accuracy.  Option -Q1 indicates fastq-sanger format::
+Aligning Illumina DNA reads to a genome
+---------------------------------------
 
-  lastdb -uNEAR -R01 humandb human/chr*.fa
-  lastal -Q1 humandb queries.fastq | last-split > myalns.maf
+::
 
-**Assumption:** LAST assumes the reliabilities reflect substitution
-errors, not insertion/deletion errors.  If that is not true, you can
-tell it to ignore the reliability data with -Q0.
+  lastdb -P8 -uNEAR -R01 -C2 mydb genome.fasta
+  last-train -P8 -Q1 mydb reads.fastq.gz > reads.train
+  lastal -P8 -p reads.train mydb reads.fastq.gz | last-split > out.maf
 
-Fastq format confusion
-----------------------
+Most LAST commands accept gzip-compressed (``.gz``) files.
 
-Unfortunately, there is more than one fastq format (see
-http://nar.oxfordjournals.org/content/38/6/1767.long).  Recently
-(2013) fastq-sanger seems to be dominant, but if you have another
-variant you need to change the -Q option (see `<lastal.html>`_).
+lastdb_ option ``-C2`` makes the alignment a bit faster, but uses more
+memory.  This has no effect on the results.  (You could use it in the
+other examples too, but it might not be faster.)
 
-Paired DNA reads
-----------------
+``-Q1`` makes it use the fastq_ quality information to improve the
+training and alignment.  LAST **assumes** that the qualities reflect
+substitution errors, not insertion/deletion errors.  (For long
+non-Illumina reads, we suspect this assumption doesn't hold, so we
+didn't use this option.)
 
-If you have paired DNA reads, there are two options:
+This recipe may be excessively slow-and-sensitive.  Adding lastal_
+option ``-C2`` may make it faster with negligible accuracy loss.  You
+can accelerate with e.g. ``-uRY16`` or ``-k16`` as above.
 
-1. Use last-pair-probs (see `<last-pair-probs.html>`_).
+Finding very short DNA alignments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-2. Ignore the pairing information, and align the reads individually
-   (using last-split as above).  This may be useful because
-   last-pair-probs does not currently allow different parts of one
-   read to match different parts of the genome, though it does allow
-   the two reads in a pair to match (e.g.) different chromosomes.
+By default, LAST only reports significant_ alignments that will rarely
+occur by chance.  In the preceding example, the minimum alignment
+length is about 26 bases for a human-size genome (less for smaller
+genomes).  To find shorter alignments, add lastal_ option ``-D100``
+(say), to get alignments that could occur by chance once per hundred
+query letters (the default is once per million.)  This makes the
+minimum alignment length about 20 bases for a human-size genome.
 
-Example 8: Compare the human and mouse genomes
-----------------------------------------------
+Aligning paired-end Illumina DNA reads to a genome
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-See `here <https://github.com/mcfrith/last-genome-alignments>`_.
+You could use last-pair-probs_.  It has a disadvantage: it doesn't
+allow different parts of one read (i.e. one "end") to align to
+different parts of the genome.  Alternatively, you could align the
+reads individually, ignoring the pair relationships::
 
-Example 9: Compare the human and chimp genomes
-----------------------------------------------
+  fastq-interleave reads1.fq reads2.fq | lastal -P8 -p reads.train mydb | last-split > out.maf
 
-See `here <https://github.com/mcfrith/last-genome-alignments>`_.  But
-that recipe is *extremely* slow-and-accurate.  You can `tune
-<last-tuning.html>`_ it to compare huge, high-similarity genomes with
-moderate run time and memory use:
+``fastq-interleave`` ensures that each read has a unique name (by
+appending "/1" and "/2" if necessary).
 
-* Omit the sensitivity-boosting lastal -m option.
+Aligning potentially-spliced Illumina reads to a genome
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-* Replace ``-uNEAR`` with ``-uRY32``.
+See last-split_ (and last-pair-probs_).
 
-* If the "reference" genome (the one given to lastdb) is > 4 GB, it's
-  probably more efficient to use lastdb8 and lastal8, instead of
-  lastdb and lastal.
+Aligning human & chimp genomes
+------------------------------
 
-Example 10: Ambiguity of alignment columns
-------------------------------------------
+This is very slow-and-sensitive::
+
+  lastdb -P8 -uNEAR -R01 humdb human_no_alt_analysis_set.fa
+  last-train -P8 --revsym -E0.05 -C2 humdb chimp.fa > humchi.train
+  lastal -E0.05 -C2 -p humchi.train humdb chimp.fa | last-split > humchi1.maf
+
+``--revsym`` makes the substitution rates the same on both strands.
+For example, it makes A→G equal T→C (because A→G on one strand means
+T→C on the other strand).  This is usually appropriate for
+genome-genome comparison (but maybe not for mitochondria which have
+asymmetric "heavy" and "light" strands).
+
+``-E0.05`` means only get significant_ alignments that would be
+expected to occur by chance at a rate ≤ 0.05 times per pair of random
+sequences of length 1 billion each.
+
+The result so far is asymmetric: each part of the chimp genome is
+aligned to at most one part of the human genome, but not vice-versa.
+We can get one-to-one alignments like this::
+
+  maf-swap humchi1.maf | last-split > humchi2.maf
+
+Then we can discard less-confident alignments, and convert_ to a
+compact tabular format::
+
+  last-postmask humchi2.maf | maf-convert -n tab | awk -F= '$2 <= 1e-5' > humchi.tab
+
+last-postmask_ discards alignments caused by simple sequence.  The
+``awk`` command gets alignments with `mismap probability`_ ≤ 10^-5.
+Finally, we can make a dotplot_::
+
+  last-dotplot humchi.tab humchi.png
+
+To go faster, with minor accuracy loss: replace ``-uNEAR`` with
+``-uRY32`` and/or `mask repeats`_.
+
+To squeeze out the last 0.000...1% of accuracy: add ``-m50`` to the
+lastal_ options.
+
+Aligning human & mouse genomes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can do this in the same way as human/chimp, except that ``-uNEAR``
+should be omitted.  To increase sensitivity, but also time and memory
+use, add lastdb seeding_ option ``-uMAM4`` or or ``-uMAM8``.  To
+increase them even more, add lastal_ option ``-m100`` (or as high as
+you can bear).
+
+Large reference sequences
+-------------------------
+
+If the sequences that you give to lastdb exceed ~4 billion letters,
+consider using 8-byte LAST (lastdb8_ and lastal8_).  Ordinary (4-byte)
+LAST can't handle so much sequence at once, so lastdb_ splits it into
+"volumes", which may be inefficient.  8-byte LAST avoids voluming, but
+uses more memory.  So lastdb8_ works well with a memory-reducing
+option: ``-uRY`` or ``-w`` or ``-W``.
+
+Moar faster
+-----------
+
+* `Using multiple CPUs / cores <last-parallel.html>`_
+* `Various speed & memory options <last-tuning.html>`_
+
+Ambiguity of alignment columns
+------------------------------
 
 Consider this alignment::
 
@@ -214,13 +337,10 @@ Consider this alignment::
   TGAAGTTAGAGGTAT--GGTTTTGAGTAGT----CCTCCTATTTTTCGAATATCTTG
 
 The middle section has such weak similarity that its precise alignment
-cannot be confidently inferred.
-
-It is sometimes useful to estimate the ambiguity of each column in an
-alignment.  We can do that using lastal option -j4::
+cannot be confidently inferred.  We can see the confidence of each
+alignment column with lastal_ option ``-j4``::
 
-  lastdb -cR01 humdb humanMito.fa
-  lastal -j4 humdb fuguMito.fa > myalns.maf
+  lastal -j4 -p hufu.train humdb fuguMito.fa > myalns.maf
 
 The output looks like this::
 
@@ -230,7 +350,7 @@ The output looks like this::
   p                %*.14442011.(%##"%$$$$###""!!!""""&'(*,340.,,.~~~~~~~~~~~
 
 The "p" line indicates the probability that each column is wrongly
-aligned, using a compact code (the same as fastq-sanger format):
+aligned, using a compact code (the same as fastq_ format):
 
   ======  =================   ======  =================
   Symbol  Error probability   Symbol  Error probability
@@ -256,5 +376,26 @@ Note that each alignment is grown from a "core" region, and the
 ambiguity estimates assume that the core is correctly aligned.  The
 core is indicated by "~" symbols, and it contains exact matches only.
 
-LAST has options to find alignments with optimal column probabilities,
-instead of optimal score: see `<lastal.html>`_.
+.. _last: last.html
+.. _lastdb8:
+.. _lastdb: lastdb.html
+.. _lastal8:
+.. _lastal: lastal.html
+.. _dotplot: last-dotplot.html
+.. _last-pair-probs: last-pair-probs.html
+.. _last-postmask: last-postmask.html
+.. _mismap probability:
+.. _last-split: last-split.html
+.. _last-train: last-train.html
+.. _convert:
+.. _maf-convert: maf-convert.html
+.. _scoring scheme: last-matrices.html
+.. _seeding scheme:
+.. _seeding: last-seeds.html
+.. _last-evalues:
+.. _significant:
+.. _significance: last-evalues.html
+.. _fastq: https://doi.org/10.1093/nar/gkp1137
+.. _here:
+.. _mask repeats: https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md
+.. _MAF: http://genome.ucsc.edu/FAQ/FAQformat.html#format5


=====================================
doc/last.html
=====================================
@@ -318,26 +318,25 @@ table.field-list { border: thin solid green }
 <div class="document" id="last-genome-scale-sequence-comparison">
 <h1 class="title">LAST: Genome-Scale Sequence Comparison</h1>
 
-<p>LAST finds similar regions between sequences, and aligns them.  It is
-designed for comparing large datasets to each other (e.g. vertebrate
-genomes and/or large numbers of DNA reads).  It can:</p>
+<p>LAST finds and aligns similar regions between sequences.  It's
+designed for moderately large data (e.g. genomes, DNA reads,
+proteomes).  It's especially good at:</p>
 <ul>
-<li><p class="first">Indicate the (un)ambiguity of each column in an alignment.</p>
+<li><p class="first">Finding sequence rearrangements and recombinations: we believe
+<a class="reference external" href="last-split.html">last-split</a> does that more rigorously than anything else.</p>
 </li>
-<li><p class="first">Use sequence quality data in a rigorous fashion.</p>
+<li><p class="first">Finding related parts of DNA and proteins, especially protein
+fossils (but introns are not considered).</p>
 </li>
-<li><p class="first">Align DNA to proteins with frameshifts.</p>
+<li><p class="first">Unusual data, e.g. AT-rich DNA, because we can <a class="reference external" href="last-train.html">fit</a> parameters to
+the data and calculate <a class="reference external" href="last-evalues.html">significance</a>.</p>
 </li>
-<li><p class="first">Compare PSSMs to sequences.</p>
-</li>
-<li><p class="first">Calculate the likelihood of chance similarities between random
-sequences.</p>
-</li>
-<li><p class="first">Do split and spliced alignment.</p>
-</li>
-<li><p class="first">Train alignment parameters for unusual kinds of sequence (e.g. nanopore).</p>
+<li><p class="first">Sensitive DNA-DNA search, due to <a class="reference external" href="last-train.html">fitting</a>, sensitive <a class="reference external" href="last-seeds.html">seeding</a>, and
+calculating <a class="reference external" href="last-evalues.html">significance</a>.</p>
 </li>
 </ul>
+<p>It can also: indicate the confidence/uncertainty of each column in an
+alignment, and use sequence quality data in a rigorous fashion.</p>
 <div class="section" id="requirements">
 <h2>Requirements</h2>
 <p>To handle mammalian genomes, it's best if you have at least 10-20
@@ -354,17 +353,8 @@ compile the programs, type:</p>
 <pre class="literal-block">
 make
 </pre>
-<p>You might get some harmless warning messages.  If your compiler is
-really old, you might get an error message like this:</p>
-<pre class="literal-block">
-unrecognized command line option "-std=c++11"
-</pre>
-<p>In that case, you can compile like this (which will disable
-multi-threading):</p>
-<pre class="literal-block">
-make CXXFLAGS=-O3
-</pre>
-<p>Or you can specify another compiler like this:</p>
+<p>You might get some harmless warning messages.  It's possible to
+specify a C++ compiler like this:</p>
 <pre class="literal-block">
 make CXX=MyOtherCompiler
 </pre>
@@ -398,13 +388,9 @@ the new programs.</p>
 Public License, either version 3 of the License, or (at your option)
 any later version.</p>
 </div>
-<div class="section" id="website">
-<h2>Website</h2>
-<p>LAST's website is: <a class="reference external" href="http://last.cbrc.jp/">http://last.cbrc.jp/</a></p>
-</div>
 <div class="section" id="contact">
 <h2>Contact</h2>
-<p>Questions and feedback are very welcome.  We like to use a <strong>public</strong>
+<p>Questions and feedback are welcome.  We like to use a <strong>public</strong>
 mailing list, so everyone can benefit from the discussion:</p>
 <blockquote>
 last-align (ATmark) googlegroups (dot) com</blockquote>
@@ -419,9 +405,6 @@ will never send rude or mocking replies.</p>
 error, please try to: tell us the LAST version number and exact error
 message, and give us some input data and exact commands that trigger
 the problem.</strong></p>
-<p>If you do a benchmarking test of LAST, we recommend you contact us to
-check you are using it in a suitable way.  The history of
-bioinformatics benchmarks shows it is all too easy to get this wrong.</p>
 </div>
 </div>
 </body>


=====================================
doc/last.txt
=====================================
@@ -1,18 +1,24 @@
 LAST: Genome-Scale Sequence Comparison
 ======================================
 
-LAST finds similar regions between sequences, and aligns them.  It is
-designed for comparing large datasets to each other (e.g. vertebrate
-genomes and/or large numbers of DNA reads).  It can:
-
-* Indicate the (un)ambiguity of each column in an alignment.
-* Use sequence quality data in a rigorous fashion.
-* Align DNA to proteins with frameshifts.
-* Compare PSSMs to sequences.
-* Calculate the likelihood of chance similarities between random
-  sequences.
-* Do split and spliced alignment.
-* Train alignment parameters for unusual kinds of sequence (e.g. nanopore).
+LAST finds and aligns similar regions between sequences.  It's
+designed for moderately large data (e.g. genomes, DNA reads,
+proteomes).  It's especially good at:
+
+* Finding sequence rearrangements and recombinations: we believe
+  last-split_ does that more rigorously than anything else.
+
+* Finding related parts of DNA and proteins, especially protein
+  fossils (but introns are not considered).
+
+* Unusual data, e.g. AT-rich DNA, because we can fit_ parameters to
+  the data and calculate significance_.
+
+* Sensitive DNA-DNA search, due to fitting_, sensitive seeding_, and
+  calculating significance_.
+
+It can also: indicate the confidence/uncertainty of each column in an
+alignment, and use sequence quality data in a rigorous fashion.
 
 Requirements
 ------------
@@ -33,17 +39,8 @@ compile the programs, type::
 
   make
 
-You might get some harmless warning messages.  If your compiler is
-really old, you might get an error message like this::
-
-  unrecognized command line option "-std=c++11"
-
-In that case, you can compile like this (which will disable
-multi-threading)::
-
-  make CXXFLAGS=-O3
-
-Or you can specify another compiler like this::
+You might get some harmless warning messages.  It's possible to
+specify a C++ compiler like this::
 
   make CXX=MyOtherCompiler
 
@@ -80,15 +77,10 @@ LAST (including the scripts) is distributed under the GNU General
 Public License, either version 3 of the License, or (at your option)
 any later version.
 
-Website
--------
-
-LAST's website is: http://last.cbrc.jp/
-
 Contact
 -------
 
-Questions and feedback are very welcome.  We like to use a **public**
+Questions and feedback are welcome.  We like to use a **public**
 mailing list, so everyone can benefit from the discussion:
 
   last-align (ATmark) googlegroups (dot) com
@@ -106,6 +98,8 @@ error, please try to: tell us the LAST version number and exact error
 message, and give us some input data and exact commands that trigger
 the problem.**
 
-If you do a benchmarking test of LAST, we recommend you contact us to
-check you are using it in a suitable way.  The history of
-bioinformatics benchmarks shows it is all too easy to get this wrong.
+.. _fit:
+.. _fitting: last-train.html
+.. _last-split: last-split.html
+.. _seeding: last-seeds.html
+.. _significance: last-evalues.html


=====================================
doc/lastal.html
=====================================
@@ -435,11 +435,13 @@ MAF.</p>
 <kbd><span class="option">-D <var>LENGTH</var></span></kbd></td>
 <td>Report alignments that are expected by chance at most once per
 LENGTH query letters.  This option only affects the default
-value of -E, so if you specify -E then -D has no effect.</td></tr>
+value of <tt class="docutils literal"><span class="pre">-E</span></tt>, so if you specify <tt class="docutils literal"><span class="pre">-E</span></tt> then <tt class="docutils literal"><span class="pre">-D</span></tt> has no
+effect.</td></tr>
 <tr><td class="option-group">
 <kbd><span class="option">-E <var>THRESHOLD</var></span></kbd></td>
 <td>Maximum EG2 (<a class="reference external" href="last-evalues.html">expected alignments per square giga</a>).  This option only affects the default
-value of -e, so if you specify -e then -E has no effect.</td></tr>
+value of <tt class="docutils literal"><span class="pre">-e</span></tt>, so if you specify <tt class="docutils literal"><span class="pre">-e</span></tt> then <tt class="docutils literal"><span class="pre">-E</span></tt> has no
+effect.</td></tr>
 </tbody>
 </table>
 </blockquote>
@@ -507,12 +509,19 @@ regions have sizes j and k, where j ≤ k, the cost is: a +
 b⋅(k-j) + c⋅j.  If c ≥ a + 2b (the default), it reduces to
 standard affine gaps.</td></tr>
 <tr><td class="option-group">
-<kbd><span class="option">-F <var>COST</var></span></kbd></td>
+<kbd><span class="option">-F <var>LIST</var></span></kbd></td>
 <td><p class="first">Align DNA queries to protein reference sequences, using the
-specified frameshift cost.  A value of 15 seems to be
-reasonable.  (As a special case, -F0 means DNA-versus-protein
-alignment without frameshifts, which is faster.)  The output
-looks like this:</p>
+specified frameshift cost(s): either one cost (old-style
+frameshifts), or 4 comma-separated costs (new-style
+frameshifts).  As a special case, <tt class="docutils literal"><span class="pre">-F0</span></tt> means
+DNA-versus-protein alignment without frameshifts, which is
+faster.</p>
+<p>The four new-style frameshift costs are for, in order: deletion
+of length k mod 3 = 1 bases, deletion of k mod 3 = 2 bases,
+insertion of k mod 3 = 1 bases, insertion of k mod 3 = 2 bases.
+(You're expected to get them from <a class="reference external" href="last-train.html">last-train</a>, not set them
+manually.)</p>
+<p>The output looks like this:</p>
 <pre class="literal-block">
 a score=108
 s prot 2  40 + 649 FLLQAVKLQDP-STPHQIVPSP-VSDLIATHTLCPRMKYQDD
@@ -535,12 +544,12 @@ The default value is e-1, which arguably produces the best
 alignments.  Lower values improve speed, by quitting unpromising
 extensions sooner.  You can specify this parameter in 3 ways:</p>
 <ul class="last">
-<li><p class="first">A score (e.g. -z20).</p>
+<li><p class="first">A score (e.g. <tt class="docutils literal"><span class="pre">-z20</span></tt>).</p>
 </li>
-<li><p class="first">A percentage.  For example, -z50% specifies 50% of the default
-value (rounded down to the nearest integer).</p>
+<li><p class="first">A percentage.  For example, <tt class="docutils literal"><span class="pre">-z50%</span></tt> specifies 50% of the
+default value (rounded down to the nearest integer).</p>
 </li>
-<li><p class="first">A maximum gap length.  For example, -z8g sets the maximum
+<li><p class="first">A maximum gap length.  For example, <tt class="docutils literal"><span class="pre">-z8g</span></tt> sets the maximum
 score drop to: min[a+8b, A+8B].  However, this never increases
 the value above the default.</p>
 </li>
@@ -724,8 +733,10 @@ with the following meanings.</p>
 </li>
 </ol>
 <p class="last">Details: Tantan is applied separately to forward and reverse
-strands.  For DNA-versus-protein alignment (option -F), it is
-applied to the DNA after translation, at the protein level.</p>
+strands.  For DNA-versus-protein alignment, if you use a codon
+substitution matrix (e.g. from <tt class="docutils literal"><span class="pre">last-train</span> <span class="pre">--codon</span></tt>), tantan
+is applied to the DNA before translation, else it is applied
+after translation.</p>
 </td></tr>
 <tr><td class="option-group">
 <kbd><span class="option">-u <var>NUMBER</var></span></kbd></td>
@@ -766,14 +777,13 @@ alignment</p>
 this core</p>
 </li>
 </ul>
-<p class="last">Use -w0 to turn this off.</p>
+<p class="last">Use <tt class="docutils literal"><span class="pre">-w0</span></tt> to turn this off.</p>
 </td></tr>
 <tr><td class="option-group">
 <kbd><span class="option">-G <var>GENETIC-CODE</var></span></kbd></td>
-<td>Specify the genetic code for translating DNA to protein.  This
-option has no effect unless DNA-versus-protein alignment is
-selected with option -F.  Codes are specified by numbers
-(e.g. 1 = standard, 2 = vertebrate mitochondrial), listed here:
+<td>Specify the genetic code for translating DNA to protein.  Codes
+are specified by numbers (e.g. 1 = standard, 2 = vertebrate
+mitochondrial), listed here:
 <a class="reference external" href="https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi">https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi</a>.  Any
 other GENETIC-CODE is assumed to be a file name: for an example
 of the format, see vertebrateMito.gc in the examples directory.</td></tr>


=====================================
doc/lastal.txt
=====================================
@@ -108,12 +108,14 @@ E-value options
   -D LENGTH
       Report alignments that are expected by chance at most once per
       LENGTH query letters.  This option only affects the default
-      value of -E, so if you specify -E then -D has no effect.
+      value of ``-E``, so if you specify ``-E`` then ``-D`` has no
+      effect.
 
   -E THRESHOLD
       Maximum EG2 (`expected alignments per square giga
       <last-evalues.html>`_).  This option only affects the default
-      value of -e, so if you specify -e then -E has no effect.
+      value of ``-e``, so if you specify ``-e`` then ``-E`` has no
+      effect.
 
 Score options
 ~~~~~~~~~~~~~
@@ -175,12 +177,21 @@ Score options
       b⋅(k-j) + c⋅j.  If c ≥ a + 2b (the default), it reduces to
       standard affine gaps.
 
-  -F COST
+  -F LIST
       Align DNA queries to protein reference sequences, using the
-      specified frameshift cost.  A value of 15 seems to be
-      reasonable.  (As a special case, -F0 means DNA-versus-protein
-      alignment without frameshifts, which is faster.)  The output
-      looks like this::
+      specified frameshift cost(s): either one cost (old-style
+      frameshifts), or 4 comma-separated costs (new-style
+      frameshifts).  As a special case, ``-F0`` means
+      DNA-versus-protein alignment without frameshifts, which is
+      faster.
+
+      The four new-style frameshift costs are for, in order: deletion
+      of length k mod 3 = 1 bases, deletion of k mod 3 = 2 bases,
+      insertion of k mod 3 = 1 bases, insertion of k mod 3 = 2 bases..
+      (You're expected to get them from last-train_, not set them
+      manually.)
+
+      The output looks like this::
 
         a score=108
         s prot 2  40 + 649 FLLQAVKLQDP-STPHQIVPSP-VSDLIATHTLCPRMKYQDD
@@ -202,12 +213,12 @@ Score options
       alignments.  Lower values improve speed, by quitting unpromising
       extensions sooner.  You can specify this parameter in 3 ways:
 
-      * A score (e.g. -z20).
+      * A score (e.g. ``-z20``).
 
-      * A percentage.  For example, -z50% specifies 50% of the default
-        value (rounded down to the nearest integer).
+      * A percentage.  For example, ``-z50%`` specifies 50% of the
+        default value (rounded down to the nearest integer).
 
-      * A maximum gap length.  For example, -z8g sets the maximum
+      * A maximum gap length.  For example, ``-z8g`` sets the maximum
         score drop to: min[a+8b, A+8B].  However, this never increases
         the value above the default.
 
@@ -361,8 +372,10 @@ Miscellaneous options
       2. Convert simple repeats, within AT-rich DNA, to lowercase.
 
       Details: Tantan is applied separately to forward and reverse
-      strands.  For DNA-versus-protein alignment (option -F), it is
-      applied to the DNA after translation, at the protein level.
+      strands.  For DNA-versus-protein alignment, if you use a codon
+      substitution matrix (e.g. from ``last-train --codon``), tantan
+      is applied to the DNA before translation, else it is applied
+      after translation.
 
   -u NUMBER
       Specify treatment of lowercase letters when extending
@@ -398,13 +411,12 @@ Miscellaneous options
       * has start coordinates offset by DISTANCE or less relative to
         this core
 
-      Use -w0 to turn this off.
+      Use ``-w0`` to turn this off.
 
   -G GENETIC-CODE
-      Specify the genetic code for translating DNA to protein.  This
-      option has no effect unless DNA-versus-protein alignment is
-      selected with option -F.  Codes are specified by numbers
-      (e.g. 1 = standard, 2 = vertebrate mitochondrial), listed here:
+      Specify the genetic code for translating DNA to protein.  Codes
+      are specified by numbers (e.g. 1 = standard, 2 = vertebrate
+      mitochondrial), listed here:
       https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi.  Any
       other GENETIC-CODE is assumed to be a file name: for an example
       of the format, see vertebrateMito.gc in the examples directory.
@@ -588,3 +600,5 @@ lastal8
 lastal8 has identical usage to lastal, and is used with `lastdb8
 <lastdb.html>`_.  lastal cannot read the output of lastdb8, and
 lastal8 cannot read the output of lastdb.
+
+.. _last-train: last-train.html


=====================================
scripts/last-train
=====================================
@@ -19,6 +19,7 @@ import sys
 import tempfile
 
 proteinAlphabet20 = "ACDEFGHIKLMNPQRSTVWY"
+proteinAlphabet21 = proteinAlphabet20 + "*"
 
 def myOpen(fileName):  # faster than fileinput
     if fileName == "-":
@@ -183,7 +184,7 @@ def scaleFromHeader(lines):
     raise Exception("couldn't read the scale")
 
 def countsFromLastOutput(lines, opts):
-    nTransitions = 5
+    nTransitions = 9 if opts.codon else 5
     tranCounts = [1.0] * nTransitions  # +1 pseudocounts
     tranCounts[1] = 2.0  # deletes: opens + extensions, so 2 pseudocounts
     tranCounts[2] = 2.0  # inserts: opens + extensions, so 2 pseudocounts
@@ -196,13 +197,15 @@ def countsFromLastOutput(lines, opts):
             counts = [float(i) for i in line.split()[1:]]
             if not countMatrix:
                 matrixSize = len(counts) - nTransitions
-                nCols = int(math.sqrt(matrixSize))
+                nCols = 64 if opts.codon else int(math.sqrt(matrixSize))
                 nRows = matrixSize // nCols
-                countMatrix = [[1.0] * nCols for i in range(nRows)]
-            identities = sum(counts[i * nCols + i] for i in range(nRows))
-            alignmentLength = sum(counts[matrixSize + i] for i in range(3))
-            if 100 * identities > opts.pid * alignmentLength:
-                continue
+                pseudocount = 0.0 if opts.codon else 1.0
+                countMatrix = [[pseudocount] * nCols for i in range(nRows)]
+            if not opts.codon:
+                identities = sum(counts[i * nCols + i] for i in range(nRows))
+                alignmentLength = sum(counts[matrixSize + i] for i in range(3))
+                if 100 * identities > opts.pid * alignmentLength:
+                    continue
             for i in range(nRows):
                 for j in range(nCols):
                     if strand == "+" or opts.S != "1":
@@ -214,6 +217,13 @@ def countsFromLastOutput(lines, opts):
             alignments += 1
     if not alignments:
         raise Exception("no alignments")
+    if opts.codon:
+        pseudocounts = nRows * 32  # xxx ???
+        rowSums = [sum(i) + 1 for i in countMatrix]
+        colSums = [sum(i) + 1 for i in zip(*countMatrix)]
+        mul = pseudocounts / (sum(rowSums) * sum(colSums))
+        countMatrix = [[x + mul * i * j for j, x in zip(colSums, row)]
+                       for i, row in zip(rowSums, countMatrix)]
     return countMatrix, tranCounts + [alignments]
 
 def scoreFromProb(scale, prob):
@@ -414,21 +424,260 @@ def scoresAndScale(originalScale, matParams, delRatios, insRatios):
               "doubling the scale")
         originalScale *= 2
 
-def writeGapCosts(delCosts, insCosts, isLastFormat, outFile):
-    delInit, delGrow = delCosts
-    insInit, insGrow = insCosts
-    delOpen = delInit - delGrow
-    insOpen = insInit - insGrow
+### Routines for codons & frameshifts:
+
+def initialCodonSubstitutionProbs(matchProb):
+    aa = "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG"
+    b1 = "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG"
+    b2 = "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG"
+    b3 = "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"
+
+    p = matchProb / 61
+    q = (1/64 - p) / 19
+    if q <= 0:
+        raise Exception("initial match probability must be < 61/64")
+    matrix = [[q for j in aa] for i in proteinAlphabet20]
+    for a, x, y, z in zip(aa, b1, b2, b3):
+        codon = "ACGT".index(x) * 16 + "ACGT".index(y) * 4 + "ACGT".index(z)
+        if a == "*":
+            for row in matrix:
+                row[codon] = 1 / (64 * 20)
+        else:
+            matrix[proteinAlphabet20.index(a)][codon] = p
+    return matrix
+
+def initialCodonProbs(opts):
+    matProbs = initialCodonSubstitutionProbs(float(opts.r))
+    delOpenProb = float(opts.a)
+    delGrowProb = float(opts.b)
+    insOpenProb = float(opts.A)
+    insGrowProb = float(opts.B)
+    matchProb = 0.99 - delOpenProb - insOpenProb
+    if opts.F:
+        delProb1, delProb2, insProb1, insProb2 = map(float, opts.F.split(","))
+    else:
+        delProb1 = delProb2 = 1 - delGrowProb
+        insProb1 = insProb2 = 1 - insGrowProb
+    delProbs = delOpenProb, delGrowProb, delProb1, delProb2
+    insProbs = insOpenProb, insGrowProb, insProb1, insProb2
+    return matProbs, (matchProb, delProbs, insProbs)
+
+def formattedCodons(spec):
+    a = "acgt"
+    return (format(i + j + k, spec) for i in a for j in a for k in a)
+
+def printCodonCountMatrix(matrix):
+    print("#", " ", *formattedCodons("5"))
+    for x, row in zip(proteinAlphabet21, matrix):
+        print("#", x, *(format(i, "<5.4g") for i in row))
+
+def writeCodonScoreMatrix(outFile, matrixAndProbs, prefix):
+    matrix, rowProbs, colProbs = matrixAndProbs
+    maxLen = max(len(str(x)) for row in matrix for x in row)
+    spec = ">" + str(max(maxLen, 3))
+    print(prefix + " ", *formattedCodons(spec), file=outFile)
+    for x, row, p in zip(proteinAlphabet21, matrix, rowProbs):
+        r = " ".join(format(i, spec) for i in row)
+        print(prefix + x, r, p, file=outFile)
+    print(prefix + " ", *(format(i, spec) for i in colProbs), file=outFile)
+
+def codonMatProbsFromCounts(counts, opts):
+    total = sum(map(sum, counts))
+    probs = [[j / total for j in i] for i in counts]
+    print("# count matrix "
+          "(query letters = columns, reference letters = rows):")
+    printCodonCountMatrix(counts)
+    print()
+    return probs
+
+def freqText(probability):
+    p = 100 * probability
+    t = format(p, ".2")
+    if len(t) > 3:
+        t = format(p, "<3.2g")
+    if len(t) > 3:
+        t = t.lstrip("0")
+    return t
+
+def frameshiftProbImbalance(endProb, matchProb, delProbs, insProbs):
+    insOpenProb, insGrowProb, ins1, ins2 = insProbs
+    delOpenProb, delGrowProb, del1, del2 = delProbs
+    iNum = insOpenProb * (ins1 * endProb ** 2 + ins2 * (1 - ins1) * endProb
+                          + (1 - ins1) * (1 - ins2) * (1 - insGrowProb))
+    iDen = endProb ** 3 - (1 - ins1) * (1 - ins2) * insGrowProb
+    dNum = delOpenProb * (del1 / endProb ** 2 + del2 * (1 - del1) / endProb
+                          + (1 - del1) * (1 - del2) * (1 - delGrowProb))
+    dDen = endProb ** 3 - (1 - del1) * (1 - del2) * delGrowProb
+    return 1 - matchProb / endProb ** 6 - iNum / iDen - dNum / dDen
+
+def balancedFrameshiftEndProb(*args):
+    matchProb, delProbs, insProbs = args
+    insOpenProb, insGrowProb, ins1, ins2 = insProbs
+    delOpenProb, delGrowProb, del1, del2 = delProbs
+    lowerBound = max((1 - ins1) * (1 - ins2) * insGrowProb,
+                     (1 - del1) * (1 - del2) * delGrowProb) ** (1/3)
+    upperBound = 1.0
+    return rootOfIncreasingFunction(frameshiftProbImbalance,
+                                    lowerBound, upperBound, args)
+
+def frameshiftProbsFromCounts(counts, opts):
+    (matches, deletes, inserts, delOpens0, insOpens0,
+     delOpens1, delOpens2, insOpens1, insOpens2, alignments) = counts
+    delOpens = delOpens0 + delOpens1 + delOpens2
+    insOpens = insOpens0 + insOpens1 + insOpens2
+    denominator = matches + insOpens + delOpens + (alignments + 1)
+    matchProb = matches / denominator
+
+    insOpenProb = insOpens / denominator
+    insGrowProb = (inserts - insOpens0) / inserts
+    insProb2 = insOpens2 / (inserts + insOpens2)
+    insProb1 = insOpens1 / (inserts + insOpens2 + insOpens1)
+
+    delOpenProb = delOpens / denominator
+    delGrowProb = (deletes - delOpens0) / deletes
+    delProb2 = delOpens2 / (deletes + delOpens2)
+    delProb1 = delOpens1 / (deletes + delOpens2 + delOpens1)
+
+    print("# aligned residue/codon pairs: %.12g" % matches)
+    print("# whole codon deletes: %.12g" % deletes)
+    print("# whole codon inserts: %.12g" % inserts)
+    print("# delOpens: %.12g" % delOpens)
+    print("# insOpens: %.12g" % insOpens)
+    print("# frameshifts del-1,del-2,ins+1,ins+2: %.12g,%.12g,%.12g,%.12g"
+          % (delOpens1, delOpens2, insOpens1, insOpens2))
+    print("# alignments:", alignments)
+    print("# matchProb: %g" % matchProb)
+    print("# delOpenProb: %g" % delOpenProb)
+    print("# insOpenProb: %g" % insOpenProb)
+    print("# delExtendProb: %g" % delGrowProb)
+    print("# insExtendProb: %g" % insGrowProb)
+    print("# frameshiftProbs del-1,del-2,ins+1,ins+2: %g,%g,%g,%g"
+          % (delProb1, delProb2, insProb1, insProb2))
+    print()
+    delProbs = delOpenProb, delGrowProb, delProb1, delProb2
+    insProbs = insOpenProb, insGrowProb, insProb1, insProb2
+    return matchProb, delProbs, insProbs
+
+def frameshiftRatiosFromProbs(matchProb, delProbs, insProbs):
+    delOpenProb, delGrowProb, del1, del2 = delProbs
+    insOpenProb, insGrowProb, ins1, ins2 = insProbs
+
+    endProb = balancedFrameshiftEndProb(matchProb, delProbs, insProbs)
+
+    matchRatio = matchProb / endProb ** 6
+
+    insAdj = (1 - insGrowProb) / insGrowProb
+    insOpenRatio = insOpenProb * insAdj
+    insMean = ((1 - ins1) * (1 - ins2) * insGrowProb) ** (1/3)
+    insRatio0 = insMean / endProb
+    insRatio1 = ins1 / (insAdj * insMean)
+    insRatio2 = ins2 * (1 - ins1) / (insAdj * insMean ** 2)
+    insRatios = insOpenRatio, insRatio0, insRatio1, insRatio2
+
+    delAdj = (1 - delGrowProb) / delGrowProb
+    delOpenRatio = delOpenProb * delAdj
+    delMean = ((1 - del1) * (1 - del2) * delGrowProb) ** (1/3)
+    delRatio0 = delMean / endProb
+    delRatio1 = del1 / (delAdj * delMean * endProb ** 4)
+    delRatio2 = del2 * (1 - del1) / (delAdj * (delMean * endProb) ** 2)
+    delRatios = delOpenRatio, delRatio0, delRatio1, delRatio2
+
+    return matchRatio, delRatios, insRatios
+
+def frameshiftCostsFromProbRatios(scale, gapRatios):
+    gapOpenRatio, gapRatio0, gapRatio1, gapRatio2 = gapRatios
+    gapOpenCost = costFromProb(scale, gapOpenRatio)
+    gapCost0 = max(costFromProb(scale, gapRatio0), 1)
+    gapCost1 = costFromProb(scale, gapRatio1)
+    gapCost2 = costFromProb(scale, gapRatio2)
+    return gapOpenCost, gapCost0, gapCost1, gapCost2
+
+def frameshiftImbalanceFromGap(scale, gapCosts):
+    gapOpenCost, gapCost0, gapCost1, gapCost2 = gapCosts
+    a = math.exp(-gapOpenCost / scale)
+    b = math.exp(-gapCost0 / scale)
+    f = math.exp(-gapCost1 / scale)
+    g = math.exp(-gapCost2 / scale)
+    return a * b * (f + g * b + b * b) / (1 - b ** 3)
+
+def frameshiftScoreImbalance(scale, matScores, rowProbs, colProbs,
+                             delCosts, insCosts):
+    d = frameshiftImbalanceFromGap(scale, delCosts)
+    i = frameshiftImbalanceFromGap(scale, insCosts)
+    m = sum(x * y * math.exp(matScores[i][j] / scale)
+            for i, x in enumerate(rowProbs) for j, y in enumerate(colProbs))
+    return m + d + i - 1
+
+def normalizedFreqs(freqs):
+    x = list(map(float, freqs))
+    s = sum(x)
+    return [i / s for i in x]
+
+def codonScoresAndScale(originalScale, matParams, delRatios, insRatios):
+    matchRatio, matProbs, rowFreqs, colFreqs = matParams
+    rowProbs = normalizedFreqs(rowFreqs)
+    colProbs = normalizedFreqs(colFreqs)
+    matParams = matchRatio, matProbs, rowProbs, colProbs
+    matScores = matScoresFromProbs(originalScale, *matParams)
+    delCosts = frameshiftCostsFromProbRatios(originalScale, delRatios)
+    insCosts = frameshiftCostsFromProbRatios(originalScale, insRatios)
+    args = matScores, rowProbs, colProbs, delCosts, insCosts
+    scale = balancedScale(frameshiftScoreImbalance, originalScale, args)
+    assert scale > 0  # XXX
+    matScores = matScores, rowFreqs, colFreqs
+    return matScores, delCosts, insCosts, scale
+
+def isCloseEnough(oldParameters, newParameters):
+    delCosts0, insCosts0, substitutionParameters0 = oldParameters
+    m0, rowFreqs0, colFreqs0 = substitutionParameters0
+
+    delCosts1, insCosts1, substitutionParameters1 = newParameters
+    m1, rowFreqs1, colFreqs1 = substitutionParameters1
+
+    return (delCosts0 == delCosts1 and insCosts0 == insCosts1 and
+            all(abs(i - j) < 2 for x, y in zip(m0, m1) for i, j in zip(x, y)))
+
+### End of routines for codons & frameshifts
+
+def writeGapCosts(opts, delCosts, insCosts, isLastFormat, outFile):
+    if opts.codon:
+        delOpen, delGrow, del1, del2 = delCosts
+        insOpen, insGrow, ins1, ins2 = insCosts
+        frameshiftCosts = del1, del2, ins1, ins2
+        frameshiftCosts = ",".join(map(str, frameshiftCosts))
+    else:
+        delInit, delGrow = delCosts
+        insInit, insGrow = insCosts
+        delOpen = delInit - delGrow
+        insOpen = insInit - insGrow
     if isLastFormat:
         print("#last -a", delOpen, file=outFile)
         print("#last -A", insOpen, file=outFile)
         print("#last -b", delGrow, file=outFile)
         print("#last -B", insGrow, file=outFile)
+        if opts.codon:
+            print("#last -F", frameshiftCosts, file=outFile)
     else:
         print("# delExistCost:", delOpen, file=outFile)
         print("# insExistCost:", insOpen, file=outFile)
         print("# delExtendCost:", delGrow, file=outFile)
         print("# insExtendCost:", insGrow, file=outFile)
+        if opts.codon:
+            print("# frameshiftCosts del-1,del-2,ins+1,ins+2:",
+                  frameshiftCosts, file=outFile)
+
+def probsFromFile(opts, lastalArgs, lines):
+    print("#", *lastalArgs)
+    print()
+    sys.stdout.flush()
+    matCounts, gapCounts = countsFromLastOutput(lines, opts)
+    if opts.codon:
+        gapProbs = frameshiftProbsFromCounts(gapCounts, opts)
+        matProbs = codonMatProbsFromCounts(matCounts, opts)
+    else:
+        gapProbs = gapProbsFromCounts(gapCounts, opts)
+        matProbs = matProbsFromCounts(matCounts, opts)
+    return matProbs, gapProbs
 
 def tryToMakeChildProgramsFindable():
     d = os.path.dirname(__file__)
@@ -458,6 +707,8 @@ def fixedLastalArgs(opts, lastalProgName):
     if opts.X: x.append("-X" + opts.X)
     if opts.Q: x.append("-Q" + opts.Q)
     if opts.verbose: x.append("-" + "v" * opts.verbose)
+    if opts.codon:
+        x.append("-K1")
     return x
 
 def process(args, inStream):
@@ -480,78 +731,97 @@ def lastSplitProcess(opts, proc):
 def doTraining(opts, args):
     tryToMakeChildProgramsFindable()
     lastalProgName = readLastalProgName(args[0])
-    scaleIncrease = 20  # while training, up-scale the scores by this amount
     lastalVersion = versionFromLastal()
+    print("# lastal version:", lastalVersion)
+
+    if opts.codon:
+        scaleIncrease = 1
+        gapRatiosFunc = frameshiftRatiosFromProbs
+        scoresAndScaleFunc = codonScoresAndScale
+        writeScoreMatrixFunc = writeCodonScoreMatrix
+    else:
+        scaleIncrease = 20  # while training, upscale the scores by this amount
+        gapRatiosFunc = gapRatiosFromProbs
+        scoresAndScaleFunc = scoresAndScale
+        writeScoreMatrixFunc = writeScoreMatrix
 
-    lastalArgs = fixedLastalArgs(opts, lastalProgName)
-    if opts.r: lastalArgs.append("-r" + opts.r)
-    if opts.q: lastalArgs.append("-q" + opts.q)
-    if opts.p: lastalArgs.append("-p" + opts.p)
-    if opts.a: lastalArgs.append("-a" + opts.a)
-    if opts.b: lastalArgs.append("-b" + opts.b)
-    if opts.A: lastalArgs.append("-A" + opts.A)
-    if opts.B: lastalArgs.append("-B" + opts.B)
-    lastalArgs += args
-    proc = process(lastalArgs, None)
-    proc = lastSplitProcess(opts, proc)
+        print("# maximum percent identity:", opts.pid)
+        lastalArgs = fixedLastalArgs(opts, lastalProgName)
+        if opts.r: lastalArgs.append("-r" + opts.r)
+        if opts.q: lastalArgs.append("-q" + opts.q)
+        if opts.p: lastalArgs.append("-p" + opts.p)
+        if opts.a: lastalArgs.append("-a" + opts.a)
+        if opts.b: lastalArgs.append("-b" + opts.b)
+        if opts.A: lastalArgs.append("-A" + opts.A)
+        if opts.B: lastalArgs.append("-B" + opts.B)
+        lastalArgs += args
+        proc = process(lastalArgs, None)
+        proc = lastSplitProcess(opts, proc)
+        if not opts.scale:
+            externalScale = scaleFromHeader(proc.stdout)
 
     if opts.scale:
         externalScale = opts.scale / math.log(2)
-    else:
-        externalScale = scaleFromHeader(proc.stdout)
 
     internalScale = externalScale * scaleIncrease
     oldParameters = []
 
-    print("# lastal version:", lastalVersion)
-    print("# maximum percent identity:", opts.pid)
     print("# scale of score parameters:", externalScale)
     print("# scale used while training:", internalScale)
     print()
 
+    if opts.codon:
+        matProbs, gapProbs = initialCodonProbs(opts)
+    else:
+        matProbs, gapProbs = probsFromFile(opts, lastalArgs, proc.stdout)
+
     while True:
-        print("#", *lastalArgs)
-        print()
-        sys.stdout.flush()
-        matCounts, gapCounts = countsFromLastOutput(proc.stdout, opts)
-        gapProbs = gapProbsFromCounts(gapCounts, opts)
-        matProbs = matProbsFromCounts(matCounts, opts)
-        matchRatio, delRatios, insRatios = gapRatiosFromProbs(*gapProbs)
+        matchRatio, delRatios, insRatios = gapRatiosFunc(*gapProbs)
         rowProbs = [sum(i) for i in matProbs]
         colProbs = [sum(i) for i in zip(*matProbs)]
+        if opts.codon:
+            rowProbs = [freqText(i) for i in rowProbs]
+            colProbs = [freqText(i) for i in colProbs]
         matParams = matchRatio, matProbs, rowProbs, colProbs
-        sas = scoresAndScale(internalScale, matParams, delRatios, insRatios)
-        matScores, delCosts, insCosts, scale = sas
-        writeGapCosts(delCosts, insCosts, False, None)
+        ss = scoresAndScaleFunc(internalScale, matParams, delRatios, insRatios)
+        matScores, delCosts, insCosts, scale = ss
+        writeGapCosts(opts, delCosts, insCosts, False, None)
         print()
         print("# score matrix "
               "(query letters = columns, reference letters = rows):")
-        writeScoreMatrix(sys.stdout, matScores, "# ")
+        writeScoreMatrixFunc(sys.stdout, matScores, "# ")
         print()
         parameters = delCosts, insCosts, matScores
-        if parameters in oldParameters: break
+        if opts.codon:
+            if any(isCloseEnough(i, parameters) for i in oldParameters):
+                break
+        else:
+            if parameters in oldParameters:
+                break
         oldParameters.append(parameters)
         lastalArgs = fixedLastalArgs(opts, lastalProgName)
         lastalArgs.append("-t{0:.6}".format(scale))
         lastalArgs.append("-p-")
         lastalArgs += args
         proc = process(lastalArgs, subprocess.PIPE)
-        writeGapCosts(delCosts, insCosts, True, proc.stdin)
-        writeScoreMatrix(proc.stdin, matScores, "")
+        writeGapCosts(opts, delCosts, insCosts, True, proc.stdin)
+        writeScoreMatrixFunc(proc.stdin, matScores, "")
         proc.stdin.close()
-        proc = lastSplitProcess(opts, proc)
+        if not opts.codon:
+            proc = lastSplitProcess(opts, proc)
+        matProbs, gapProbs = probsFromFile(opts, lastalArgs, proc.stdout)
 
-    sas = scoresAndScale(externalScale, matParams, delRatios, insRatios)
-    matScores, delCosts, insCosts, scale = sas
+    ss = scoresAndScaleFunc(externalScale, matParams, delRatios, insRatios)
+    matScores, delCosts, insCosts, scale = ss
     if opts.X: print("#last -X", opts.X)
     if opts.Q: print("#last -Q", opts.Q)
     print("#last -t{0:.6}".format(scale))
-    writeGapCosts(delCosts, insCosts, True, None)
+    writeGapCosts(opts, delCosts, insCosts, True, None)
     if opts.s: print("#last -s", opts.s)
     if opts.S: print("#last -S", opts.S)
     print("# score matrix "
           "(query letters = columns, reference letters = rows):")
-    writeScoreMatrix(sys.stdout, matScores, "")
+    writeScoreMatrixFunc(sys.stdout, matScores, "")
 
 def lastTrain(opts, args):
     if opts.sample_number:
@@ -586,12 +856,15 @@ if __name__ == "__main__":
                   "skip alignments with > PID% identity (default: %default)")
     og.add_option("--postmask", type="int", metavar="NUMBER", default=1, help=
                   "skip mostly-lowercase alignments (default=%default)")
-    og.add_option("--sample-number", type="int", default=500, metavar="N",
-                  help="number of random sequence samples (default: %default)")
+    og.add_option("--sample-number", type="int", metavar="N",
+                  help="number of random sequence samples "
+                  "(default: 20000 if --codon else 500)")
     og.add_option("--sample-length", type="int", default=2000, metavar="L",
                   help="length of each sample (default: %default)")
     og.add_option("--scale", type="float", metavar="S",
                   help="output scores in units of 1/S bits")
+    og.add_option("--codon", action="store_true",
+                  help="DNA queries & protein reference, with frameshifts")
     op.add_option_group(og)
 
     og = optparse.OptionGroup(op, "Initial parameter options")
@@ -606,6 +879,8 @@ if __name__ == "__main__":
                   help="gap extension cost (default: 9 if Q>=1, else 3)")
     og.add_option("-A", metavar="COST", help="insertion existence cost")
     og.add_option("-B", metavar="COST", help="insertion extension cost")
+    og.add_option("-F", metavar="LIST", help="frameshift probabilities: "
+                  "del-1,del-2,ins+1,ins+2 (default: 1-b,1-b,1-B,1-B)")
     op.add_option_group(og)
 
     og = optparse.OptionGroup(op, "Alignment options")
@@ -638,8 +913,18 @@ if __name__ == "__main__":
     (opts, args) = op.parse_args()
     if len(args) < 1:
         op.error("I need a lastdb index and query sequences")
+    if opts.sample_number is None:
+        opts.sample_number = 20000 if opts.codon else 500
     if not opts.sample_number and (len(args) < 2 or "-" in args):
         op.error("sorry, can't use stdin when --sample-number=0")
+    if opts.codon:
+        if not opts.scale: opts.scale = 3
+        if not opts.r: opts.r = "0.4"
+        if not opts.a: opts.a = "0.02"
+        if not opts.b: opts.b = "0.5"
+        if not opts.A: opts.A = opts.a
+        if not opts.B: opts.B = opts.b
+        opts.S = None
     if not opts.p and (not opts.Q or opts.Q in ("0", "fastx", "keep")):
         if not opts.r: opts.r = "5"
         if not opts.q: opts.q = "5"


=====================================
src/Alignment.cc
=====================================
@@ -50,7 +50,7 @@ void Alignment::makeXdrop( Aligners &aligners, bool isGreedy,
                            const uchar* qual1, const uchar* qual2,
 			   const Alphabet& alph, AlignmentExtras& extras,
 			   double gamma, int outputType ){
-  score = seed.score;
+  if (probMatrix) score = seed.score;  // else keep the old score
   if (outputType > 3 && !gap.isNewFrameshifts()) extras.fullScore = seed.score;
 
   if( outputType == 7 ){
@@ -62,6 +62,7 @@ void Alignment::makeXdrop( Aligners &aligners, bool isGreedy,
   }
 
   // extend a gapped alignment in the left/reverse direction from the seed:
+  blocks.clear();
   std::vector<char>& columnAmbiguityCodes = extras.columnAmbiguityCodes;
   extend( blocks, columnAmbiguityCodes, aligners, isGreedy,
 	  seq1, seq2, seed.beg1(), seed.beg2(), false, globality,
@@ -310,23 +311,25 @@ void Alignment::extend( std::vector< SegmentPair >& chunks,
       while (aligner.getNextChunkFrame(end1, end2, size, gapCost, gap))
 	chunks.push_back(SegmentPair(end1 - size, end2 - size * 3, size,
 				     gapCost));
-      FrameshiftXdropAligner &fxa = aligners.frameshiftAligner;
-      double probDropLimit = exp(scale * -maxDrop);
-      double s = fxa.forward(seq1 + start1, seq2 + start2,
-			     seq2 + dnaToAa(frame1, frameSize),
-			     seq2 + dnaToAa(frame2, frameSize),
-			     isForward, probMat, gap, probDropLimit);
-      score += s / scale;
-      if (outputType > 3) {
-	fxa.backward(isForward, probMat, gap);
-	getColumnCodes(fxa, columnCodes, chunks);
-	if (outputType == 7) {
-	  double *ec = &extras.expectedCounts[0];
-	  double *subsCounts[scoreMatrixRowSize];
-	  for (int i = 0; i < scoreMatrixRowSize; ++i)
-	    subsCounts[i] = ec + i * scoreMatrixRowSize;
-	  double *tranCounts = ec + scoreMatrixRowSize * scoreMatrixRowSize;
-	  fxa.count(isForward, gap, subsCounts, tranCounts);
+      if (probMat) {
+	FrameshiftXdropAligner &fxa = aligners.frameshiftAligner;
+	double probDropLimit = exp(scale * -maxDrop);
+	double s = fxa.forward(seq1 + start1, seq2 + start2,
+			       seq2 + dnaToAa(frame1, frameSize),
+			       seq2 + dnaToAa(frame2, frameSize),
+			       isForward, probMat, gap, probDropLimit);
+	score += s / scale;
+	if (outputType > 3) {
+	  fxa.backward(isForward, probMat, gap);
+	  getColumnCodes(fxa, columnCodes, chunks);
+	  if (outputType == 7) {
+	    double *ec = &extras.expectedCounts[0];
+	    double *subsCounts[scoreMatrixRowSize];
+	    for (int i = 0; i < scoreMatrixRowSize; ++i)
+	      subsCounts[i] = ec + i * scoreMatrixRowSize;
+	    double *tranCounts = ec + scoreMatrixRowSize * scoreMatrixRowSize;
+	    fxa.count(isForward, gap, subsCounts, tranCounts);
+	  }
 	}
       }
     } else {


=====================================
src/LastalArguments.cc
=====================================
@@ -146,7 +146,7 @@ Score options (default settings):\n\
 -A: insertion existence cost (a)\n\
 -B: insertion extension cost (b)\n\
 -c: unaligned residue pair cost (off)\n\
--F: frameshift cost (off)\n\
+-F: frameshift cost(s) (off)\n\
 -x: maximum score drop for preliminary gapped alignments (z)\n\
 -y: maximum score drop for gapless alignments (min[t*10, x])\n\
 -z: maximum score drop for final gapped alignments (e-1)\n\


=====================================
src/lastal.cc
=====================================
@@ -458,7 +458,7 @@ static const ScoreMatrixRow *getQueryPssm(const LastAligner &aligner,
   return reinterpret_cast<const ScoreMatrixRow *>(&qualityPssm[0]);
 }
 
-namespace Phase{ enum Enum{ gapless, gapped, final }; }
+namespace Phase{ enum Enum{ gapless, pregapped, gapped, postgapped }; }
 
 static bool isFullScoreThreshold() {
   return gapCosts.isNewFrameshifts();
@@ -466,7 +466,7 @@ static bool isFullScoreThreshold() {
 
 static bool isMaskLowercase(Phase::Enum e) {
   return (e < 1 && args.maskLowercase > 0)
-    || (e < 2 && args.maskLowercase > 1 && isFullScoreThreshold())
+    || (e < 3 && args.maskLowercase > 1 && isFullScoreThreshold())
     || args.maskLowercase > 2;
 }
 
@@ -493,7 +493,7 @@ struct Dispatcher{
       r( isMaskLowercase(e) ? matrices.ratiosMasked : matrices.ratios ),
       t( isMaskLowercase(e) ? matrices.twoQualMasked : matrices.twoQual ),
       d( (e == Phase::gapless) ? args.maxDropGapless :
-         (e == Phase::gapped ) ? args.maxDropGapped : args.maxDropFinal ),
+         (e == Phase::pregapped ) ? args.maxDropGapped : args.maxDropFinal ),
       z( t ? 2 : p ? 1 : 0 ){}
 
   int gaplessOverlap( indexT x, indexT y, size_t &rev, size_t &fwd ) const{
@@ -637,10 +637,9 @@ void alignGapless1(LastAligner &aligner, SegmentPairPot &gaplessAlns,
 }
 
 // Find query matches to the suffix array, and do gapless extensions
-void alignGapless( LastAligner& aligner, SegmentPairPot& gaplessAlns,
-		   size_t queryNum, const SubstitutionMatrices &matrices,
-		   const uchar* querySeq ){
-  Dispatcher dis(Phase::gapless, aligner, queryNum, matrices, querySeq);
+void alignGapless(LastAligner &aligner, SegmentPairPot &gaplessAlns,
+		  size_t queryNum, const uchar *querySeq,
+		  const Dispatcher &dis) {
   DiagonalTable dt;  // record already-covered positions on each diagonal
   size_t maxAlignments =
     args.maxAlignmentsPerQueryStrand ? args.maxAlignmentsPerQueryStrand : 1;
@@ -713,9 +712,8 @@ void shrinkToLongestIdenticalRun( SegmentPair& sp, const Dispatcher& dis ){
 void alignGapped( LastAligner& aligner,
 		  AlignmentPot& gappedAlns, SegmentPairPot& gaplessAlns,
                   size_t queryNum, const SubstitutionMatrices &matrices,
-		  const uchar* querySeq, Phase::Enum phase ){
+		  const uchar* querySeq, size_t frameSize, Phase::Enum phase ){
   Dispatcher dis(phase, aligner, queryNum, matrices, querySeq);
-  size_t frameSize = args.isFrameshift() ? (query.padLen(queryNum) / 3) : 0;
   countT gappedExtensionCount = 0, gappedAlignmentCount = 0;
 
   // Redo the gapless extensions, using gapped score parameters.
@@ -755,11 +753,10 @@ void alignGapped( LastAligner& aligner,
     shrinkToLongestIdenticalRun( aln.seed, dis );
 
     // do gapped extension from each end of the seed:
-    aln.makeXdrop(aligner.engines, args.isGreedy,
-		  dis.a, dis.b, args.globality, dis.m,
-		  scoreMatrix.maxScore, scoreMatrix.minScore,
-		  dis.r, matrices.stats.lambda(), gapCosts,
-		  dis.d, frameSize, dis.p, dis.t, dis.i, dis.j, alph, extras);
+    aln.makeXdrop(aligner.engines, args.isGreedy, dis.a, dis.b, args.globality,
+		  dis.m, scoreMatrix.maxScore, scoreMatrix.minScore,
+		  dis.r, matrices.stats.lambda(), gapCosts, dis.d,
+		  frameSize, dis.p, dis.t, dis.i, dis.j, alph, extras);
     ++gappedExtensionCount;
 
     if( aln.score < args.minScoreGapped ) continue;
@@ -775,7 +772,7 @@ void alignGapped( LastAligner& aligner,
     gaplessAlns.markAllOverlaps( aln.blocks );
     gaplessAlns.markTandemRepeats( aln.seed, args.maxRepeatDistance );
 
-    if( phase == Phase::final ) gappedAlns.add(aln);
+    if (phase == Phase::gapped) gappedAlns.add(aln);
     else SegmentPairPot::markAsGood(sp);
 
     ++gappedAlignmentCount;
@@ -786,15 +783,26 @@ void alignGapped( LastAligner& aligner,
   LOG2( "gapped alignments=" << gappedAlignmentCount );
 }
 
+// Redo gapped extensions, but keep the old alignment scores
+static void alignPostgapped(LastAligner &aligner, AlignmentPot &gappedAlns,
+			    size_t frameSize, const Dispatcher &dis) {
+  AlignmentExtras extras;  // not used
+  for (size_t i = 0; i < gappedAlns.size(); ++i) {
+    Alignment &aln = gappedAlns.items[i];
+    aln.makeXdrop(aligner.engines, args.isGreedy, dis.a, dis.b, args.globality,
+		  dis.m, scoreMatrix.maxScore, scoreMatrix.minScore,
+		  0, 0, gapCosts, dis.d,
+		  frameSize, dis.p, dis.t, dis.i, dis.j, alph, extras);
+  }
+}
+
 // Print the gapped alignments, after optionally calculating match
 // probabilities and re-aligning using the gamma-centroid algorithm
 void alignFinish( LastAligner& aligner, const AlignmentPot& gappedAlns,
 		  size_t queryNum, const SubstitutionMatrices &matrices,
-		  const uchar* querySeq ){
+		  size_t frameSize, const Dispatcher &dis ){
   Centroid& centroid = aligner.engines.centroid;
-  Dispatcher dis(Phase::final, aligner, queryNum, matrices, querySeq);
   size_t queryLen = query.padLen(queryNum);
-  size_t frameSize = args.isFrameshift() ? (queryLen / 3) : 0;
 
   if( args.outputType > 3 ){
     if( dis.p ){
@@ -819,28 +827,26 @@ void alignFinish( LastAligner& aligner, const AlignmentPot& gappedAlns,
     AlignmentExtras extras;
     if (isFullScoreThreshold()) extras.fullScore = -1;  // score is fullScore
     if( args.outputType < 4 ){
-      writeAlignment(aligner, aln, queryNum, querySeq, extras);
+      writeAlignment(aligner, aln, queryNum, dis.b, extras);
     } else {  // calculate match probabilities:
       Alignment probAln;
       probAln.seed = aln.seed;
-      probAln.makeXdrop(aligner.engines, args.isGreedy,
-			dis.a, dis.b, args.globality, dis.m,
-			scoreMatrix.maxScore, scoreMatrix.minScore,
-			dis.r, matrices.stats.lambda(), gapCosts,
-			dis.d, frameSize, dis.p, dis.t, dis.i, dis.j, alph,
-			extras, args.gamma, args.outputType);
+      probAln.makeXdrop(aligner.engines, args.isGreedy, dis.a, dis.b,
+			args.globality,
+			dis.m, scoreMatrix.maxScore, scoreMatrix.minScore,
+			dis.r, matrices.stats.lambda(), gapCosts, dis.d,
+			frameSize, dis.p, dis.t, dis.i, dis.j, alph, extras,
+			args.gamma, args.outputType);
       assert(aln.score != -INF);
-      writeAlignment(aligner, probAln, queryNum, querySeq, extras);
+      if (args.maskLowercase == 2 && isFullScoreThreshold())
+	probAln.score = aln.score;
+      writeAlignment(aligner, probAln, queryNum, dis.b, extras);
     }
   }
 }
 
-static void eraseWeakAlignments(LastAligner &aligner, AlignmentPot &gappedAlns,
-				size_t queryNum,
-				const SubstitutionMatrices &matrices,
-				const uchar *querySeq) {
-  size_t frameSize = args.isFrameshift() ? (query.padLen(queryNum) / 3) : 0;
-  Dispatcher dis(Phase::gapless, aligner, queryNum, matrices, querySeq);
+static void eraseWeakAlignments(AlignmentPot &gappedAlns,
+				size_t frameSize, const Dispatcher &dis) {
   for (size_t i = 0; i < gappedAlns.size(); ++i) {
     Alignment &a = gappedAlns.items[i];
     if (!a.hasGoodSegment(dis.a, dis.b, ceil(args.minScoreGapped), dis.m,
@@ -957,32 +963,38 @@ void scan(LastAligner& aligner, size_t queryNum,
   const int maskMode = args.maskLowercase;
   makeQualityPssm(aligner, queryNum, matrices, querySeq, maskMode > 0);
 
+  Dispatcher dis0(Phase::gapless, aligner, queryNum, matrices, querySeq);
   SegmentPairPot gaplessAlns;
-  alignGapless(aligner, gaplessAlns, queryNum, matrices, querySeq);
+  alignGapless(aligner, gaplessAlns, queryNum, querySeq, dis0);
   if( args.outputType == 1 ) return;  // we just want gapless alignments
   if( gaplessAlns.size() == 0 ) return;
 
   if (maskMode == 1 || (maskMode == 2 && !isFullScoreThreshold()))
     unmaskLowercase(aligner, queryNum, matrices, querySeq);
 
+  size_t frameSize = args.isFrameshift() ? (query.padLen(queryNum) / 3) : 0;
   AlignmentPot gappedAlns;
 
-  if (args.maxDropFinal != args.maxDropGapped
-      || (maskMode == 2 && isFullScoreThreshold())) {
-    alignGapped( aligner, gappedAlns, gaplessAlns,
-		 queryNum, matrices, querySeq, Phase::gapped );
+  if (args.maxDropFinal != args.maxDropGapped) {
+    alignGapped(aligner, gappedAlns, gaplessAlns,
+		queryNum, matrices, querySeq, frameSize, Phase::pregapped);
     erase_if( gaplessAlns.items, SegmentPairPot::isNotMarkedAsGood );
-    if (maskMode == 2 && isFullScoreThreshold())
-      unmaskLowercase(aligner, queryNum, matrices, querySeq);
   }
 
-  alignGapped( aligner, gappedAlns, gaplessAlns,
-	       queryNum, matrices, querySeq, Phase::final );
+  alignGapped(aligner, gappedAlns, gaplessAlns,
+	      queryNum, matrices, querySeq, frameSize, Phase::gapped);
   if( gappedAlns.size() == 0 ) return;
 
+  Dispatcher dis3(Phase::postgapped, aligner, queryNum, matrices, querySeq);
+
+  if (maskMode == 2 && isFullScoreThreshold()) {
+    unmaskLowercase(aligner, queryNum, matrices, querySeq);
+    alignPostgapped(aligner, gappedAlns, frameSize, dis3);
+  }
+
   if (maskMode == 2 && !isFullScoreThreshold()) {
     remaskLowercase(aligner, queryNum, matrices, querySeq);
-    eraseWeakAlignments(aligner, gappedAlns, queryNum, matrices, querySeq);
+    eraseWeakAlignments(gappedAlns, frameSize, dis0);
     LOG2("lowercase-filtered alignments=" << gappedAlns.size());
     if (gappedAlns.size() == 0) return;
     if (args.outputType > 3)
@@ -995,7 +1007,7 @@ void scan(LastAligner& aligner, size_t queryNum,
   }
 
   if( !isCollatedAlignments() ) gappedAlns.sort();  // sort by score
-  alignFinish(aligner, gappedAlns, queryNum, matrices, querySeq);
+  alignFinish(aligner, gappedAlns, queryNum, matrices, frameSize, dis3);
 }
 
 static void tantanMaskOneQuery(size_t queryNum, uchar *querySeq) {


=====================================
src/version.hh
=====================================
@@ -1 +1 @@
-"1170"
+"1178"



View it on GitLab: https://salsa.debian.org/med-team/last-align/-/commit/d1b788bbe87c1e5812ec649f0952e53a399c2fc9

-- 
View it on GitLab: https://salsa.debian.org/med-team/last-align/-/commit/d1b788bbe87c1e5812ec649f0952e53a399c2fc9
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20210128/5b9b5f1f/attachment-0001.html>


More information about the debian-med-commit mailing list