[med-svn] [Git][med-team/last-align][upstream] New upstream version 1178
Nilesh Patra
gitlab at salsa.debian.org
Thu Jan 28 11:59:23 GMT 2021
Nilesh Patra pushed to branch upstream at Debian Med / last-align
Commits:
d1b788bb by Nilesh Patra at 2021-01-28T17:19:22+05:30
New upstream version 1178
- - - - -
25 changed files:
- ChangeLog.txt
- doc/Makefile
- doc/last-evalues.html
- doc/last-evalues.txt
- doc/last-papers.html
- doc/last-papers.txt
- doc/last-postmask.html
- doc/last-postmask.txt
- doc/last-split.html
- doc/last-split.txt
- doc/last-train.html
- doc/last-train.txt
- doc/last-tuning.html
- doc/last-tuning.txt
- doc/last-tutorial.html
- doc/last-tutorial.txt
- doc/last.html
- doc/last.txt
- doc/lastal.html
- doc/lastal.txt
- scripts/last-train
- src/Alignment.cc
- src/LastalArguments.cc
- src/lastal.cc
- src/version.hh
Changes:
=====================================
ChangeLog.txt
=====================================
@@ -1,9 +1,54 @@
+2021-01-27 Martin C. Frith <Martin C. Frith>
+
+ * doc/Makefile, doc/last-evalues.txt, doc/last-papers.txt, doc/last-
+ postmask.txt, doc/last-split.txt, doc/last-train.txt, doc/last-
+ tuning.txt, doc/last-tutorial.txt, doc/last.txt, doc/lastal.txt:
+ Big doc update
+ [ba6a55ed7287] [tip]
+
+2021-01-25 Martin C. Frith <Martin C. Frith>
+
+ * doc/lastal.txt, scripts/last-train, src/LastalArguments.cc:
+ train --codon: more data & better pseudocounts
+ [9a6365d5825a]
+
+2021-01-14 Martin C. Frith <Martin C. Frith>
+
+ * src/Alignment.cc, src/lastal.cc, test/last-test.out, test/last-
+ test.sh:
+ Better repeat-masking for new-style frameshifts
+ [e8001aa6f633]
+
+2021-01-13 Martin C. Frith <Martin C. Frith>
+
+ * src/lastal.cc:
+ Refactor
+ [3713015a6ad3]
+
+ * src/lastal.cc:
+ Refactor
+ [69de4393c89c]
+
+ * doc/last-papers.txt, src/lastal.cc:
+ Refactor
+ [2ad10ab5b4b1]
+
+2021-01-05 Martin C. Frith <Martin C. Frith>
+
+ * doc/last-train.txt, scripts/last-train:
+ last-train: add codon & frameshift options!
+ [977a35de8389]
+
+ * scripts/last-train:
+ Refactor
+ [663475110c85]
+
2020-12-28 Martin C. Frith <Martin C. Frith>
* src/LastEvaluer.cc, src/mcf_frameshift_xdrop_aligner.cc,
src/mcf_frameshift_xdrop_aligner.hh, test/last-test.out:
Fix E-values for new-style frameshifts
- [a55b81199602] [tip]
+ [a55b81199602]
2020-12-23 Martin C. Frith <Martin C. Frith>
=====================================
doc/Makefile
=====================================
@@ -13,6 +13,8 @@ ${DOCS}: last-doc.css Makefile
.SUFFIXES: .html .txt
+RST2HTML = rst2html # xxx .py ???
+
# Ugh! Is there a better way?
RST_CSS = `locate html4css1.css | tail -n1`
#RST_CSS = html4css1.css
@@ -21,7 +23,7 @@ RSTFLAGS = --initial-header-level=2 --no-compact-lists \
--no-compact-field-lists --option-limit=0 --no-doc-info
.txt.html:
- rst2html --stylesheet-path=${RST_CSS},last-doc.css ${RSTFLAGS} $< $@
+ ${RST2HTML} --stylesheet-path=${RST_CSS},last-doc.css ${RSTFLAGS} $< $@
last-matrices.txt: ../data/*.mat
../build/mat-doc.sh ../data/*.mat > $@
=====================================
doc/last-evalues.html
=====================================
@@ -367,9 +367,6 @@ option -v (verbose).</p>
<p>If your sequences have very different letter frequencies (e.g. very
AT-rich DNA), the E-values will be misleading. The best solution is
to use a suitable score matrix, such as AT77 or ATMAP.</p>
-<p>For DNA-versus-protein alignment, however, only the protein
-frequencies are determined by the score matrix. The DNA frequencies
-are fixed at 0.25.</p>
</div>
<div class="section" id="calculation-of-eg2">
<h3>Calculation of EG2</h3>
@@ -426,12 +423,12 @@ or gap costs (e.g. match score = 99, mismatch cost = 1). In such
cases, lastal requires you to set a score threshold with option -e.</p>
</li>
<li><p class="first">There may be a long startup time to calculate the E-value parameters
-(lambda, K, and finite-size correction parameters), especially for
-alignment with frameshifts. This is avoided, for particular scoring
-schemes, by storing the parameters in LAST's source code.
-Parameters for other scoring schemes can be added on request. (Or
-you can do it yourself: run lastal with option -v, so it prints the
-E-value parameters, then copy them into LastEvaluer.cc.)</p>
+(lambda, K, and finite-size correction parameters). This is
+avoided, for particular scoring schemes, by storing the parameters
+in LAST's source code. Parameters for other scoring schemes can be
+added on request. (Or you can do it yourself: run lastal with
+option -v, so it prints the E-value parameters, then copy them into
+LastEvaluer.cc.)</p>
</li>
<li><p class="first">The E-values do not account for lastal option -c. If you use this
option, the E-values will be under-estimates.</p>
=====================================
doc/last-evalues.txt
=====================================
@@ -58,10 +58,6 @@ If your sequences have very different letter frequencies (e.g. very
AT-rich DNA), the E-values will be misleading. The best solution is
to use a suitable score matrix, such as AT77 or ATMAP.
-For DNA-versus-protein alignment, however, only the protein
-frequencies are determined by the score matrix. The DNA frequencies
-are fixed at 0.25.
-
Calculation of EG2
~~~~~~~~~~~~~~~~~~
@@ -115,12 +111,12 @@ Limitations
cases, lastal requires you to set a score threshold with option -e.
* There may be a long startup time to calculate the E-value parameters
- (lambda, K, and finite-size correction parameters), especially for
- alignment with frameshifts. This is avoided, for particular scoring
- schemes, by storing the parameters in LAST's source code.
- Parameters for other scoring schemes can be added on request. (Or
- you can do it yourself: run lastal with option -v, so it prints the
- E-value parameters, then copy them into LastEvaluer.cc.)
+ (lambda, K, and finite-size correction parameters). This is
+ avoided, for particular scoring schemes, by storing the parameters
+ in LAST's source code. Parameters for other scoring schemes can be
+ added on request. (Or you can do it yourself: run lastal with
+ option -v, so it prints the E-value parameters, then copy them into
+ LastEvaluer.cc.)
* The E-values do not account for lastal option -c. If you use this
option, the E-values will be under-estimates.
=====================================
doc/last-papers.html
=====================================
@@ -377,7 +377,8 @@ Res. 2014 42(7):e59.</p>
<li><p class="first"><a class="reference external" href="http://bioinformatics.oxfordjournals.org/content/30/24/3575.long">Frameshift alignment: statistics and post-genomic
applications</a>. Sheetlin SL, Park Y, Frith MC, Spouge JL.
Bioinformatics. 2014 30(24):3575-82.</p>
-<p>Describes DNA-versus-protein alignment allowing for frameshifts.</p>
+<p>Describes "old-style" DNA-versus-protein alignment allowing for
+frameshifts (without <tt class="docutils literal"><span class="pre">last-train</span></tt>).</p>
</li>
<li><p class="first"><a class="reference external" href="http://www.genomebiology.com/content/16/1/106">Split-alignment of genomes finds orthologies more accurately</a>.
Frith MC, Kawaguchi R. Genome Biology. 2015 16:106.</p>
@@ -387,13 +388,22 @@ whole genome alignment.</p>
<li><p class="first"><a class="reference external" href="https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btw742">Training alignment parameters for arbitrary sequencers with
LAST-TRAIN</a>. Hamada M, Ono Y, Asai K Frith MC.
Bioinformatics. 2017 33(6):926-928.</p>
-<p>Describes last-train.</p>
+<p>Describes <tt class="docutils literal"><span class="pre">last-train</span></tt>.</p>
</li>
<li><p class="first"><a class="reference external" href="https://ieeexplore.ieee.org/document/8288582/">A Simplified Description of Child Tables for Sequence Similarity
Search</a>. Frith MC, Shrestha A. IEEE/ACM Trans Comput Biol
Bioinform. 2018.</p>
<p>Describes how LAST uses child tables.</p>
</li>
+<li><p class="first"><a class="reference external" href="https://doi.org/10.1093/bioinformatics/btaa1054">Minimally-overlapping words for sequence similarity search</a>.
+Frith MC, Noé L, Kucherov G. Bioinformatics. 2020</p>
+<p>Describes the <tt class="docutils literal">lastdb <span class="pre">-u</span> RY</tt> sparsity options.</p>
+</li>
+<li><p class="first"><a class="reference external" href="https://doi.org/10.1101/2021.01.25.428050">Improved DNA-versus-protein homology search for protein fossils</a>.
+Yao Y, Frith MC.</p>
+<p>Describes "new-style" DNA-versus-protein search with <tt class="docutils literal"><span class="pre">last-train</span>
+<span class="pre">--codon</span></tt>.</p>
+</li>
</ol>
<div class="section" id="external-methods">
<h2>External methods</h2>
=====================================
doc/last-papers.txt
=====================================
@@ -90,7 +90,8 @@ research to society.
__ http://bioinformatics.oxfordjournals.org/content/30/24/3575.long
- Describes DNA-versus-protein alignment allowing for frameshifts.
+ Describes "old-style" DNA-versus-protein alignment allowing for
+ frameshifts (without ``last-train``).
11. `Split-alignment of genomes finds orthologies more accurately`__.
Frith MC, Kawaguchi R. Genome Biology. 2015 16:106.
@@ -106,7 +107,7 @@ research to society.
__ https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btw742
- Describes last-train.
+ Describes ``last-train``.
13. `A Simplified Description of Child Tables for Sequence Similarity
Search`__. Frith MC, Shrestha A. IEEE/ACM Trans Comput Biol
@@ -116,6 +117,21 @@ research to society.
Describes how LAST uses child tables.
+14. `Minimally-overlapping words for sequence similarity search`__.
+ Frith MC, Noé L, Kucherov G. Bioinformatics. 2020
+
+ __ https://doi.org/10.1093/bioinformatics/btaa1054
+
+ Describes the ``lastdb -u RY`` sparsity options.
+
+15. `Improved DNA-versus-protein homology search for protein fossils`__.
+ Yao Y, Frith MC.
+
+ __ https://doi.org/10.1101/2021.01.25.428050
+
+ Describes "new-style" DNA-versus-protein search with ``last-train
+ --codon``.
+
External methods
----------------
=====================================
doc/last-postmask.html
=====================================
@@ -319,9 +319,9 @@ table.field-list { border: thin solid green }
<h1 class="title">last-postmask</h1>
<p>This script does post-alignment masking. It reads pair-wise sequence
-alignments, and writes only those that align a significant amount of
-uppercase sequence. (Lowercase is often used to indicate undesirable
-repetitive regions.)</p>
+alignments, and only writes alignments that include a significant
+amount of uppercase-to-uppercase alignment. (Lowercase is often used
+to indicate undesirable "simple" sequence.)</p>
<p>The input should be in MAF format, with header lines (of the kind
produced by lastal) describing the alignment score parameters.</p>
<p>The script discards alignments that lack any segment with score >=
=====================================
doc/last-postmask.txt
=====================================
@@ -2,9 +2,9 @@ last-postmask
=============
This script does post-alignment masking. It reads pair-wise sequence
-alignments, and writes only those that align a significant amount of
-uppercase sequence. (Lowercase is often used to indicate undesirable
-repetitive regions.)
+alignments, and only writes alignments that include a significant
+amount of uppercase-to-uppercase alignment. (Lowercase is often used
+to indicate undesirable "simple" sequence.)
The input should be in MAF format, with header lines (of the kind
produced by lastal) describing the alignment score parameters.
=====================================
doc/last-split.html
=====================================
@@ -364,7 +364,7 @@ last-split options.</p>
</div>
<div class="section" id="alignment-of-two-whole-genomes">
<h3>Alignment of two whole genomes</h3>
-<p>See <a class="reference external" href="https://github.com/mcfrith/last-genome-alignments">here</a>.</p>
+<p>See <a class="reference external" href="last-tutorial.html">here</a>.</p>
</div>
</div>
<div class="section" id="faq">
=====================================
doc/last-split.txt
=====================================
@@ -54,7 +54,7 @@ last-split options.
Alignment of two whole genomes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-See `here <https://github.com/mcfrith/last-genome-alignments>`_.
+See `here <last-tutorial.html>`_.
FAQ
---
=====================================
doc/last-train.html
=====================================
@@ -399,6 +399,13 @@ include 2 for half-bit scores and 3 for 1/3-bit scores.
(Note that 1/3-bit scores essentially equal Phred scores
a.k.a. decibans, because log10(2) ≈ 3/10.) The default is to
infer a scale from the initial score parameters.</td></tr>
+<tr><td class="option-group">
+<kbd><span class="option">--codon</span></kbd></td>
+<td>Do training for DNA query sequences versus protein reference
+sequences. These options will be ignored: <tt class="docutils literal"><span class="pre">--revsym</span>
+<span class="pre">--matsym</span> <span class="pre">--gapsym</span> <span class="pre">--pid</span> <span class="pre">--postmask</span> <span class="pre">-q</span> <span class="pre">-p</span> <span class="pre">-S</span></tt>. If
+<tt class="docutils literal"><span class="pre">--codon</span></tt> is used, the "initial parameter options" are
+initial probabilities, not scores/costs.</td></tr>
</tbody>
</table>
</blockquote>
@@ -433,6 +440,9 @@ alignments: they are described in more detail at <a class="reference external" h
<tr><td class="option-group">
<kbd><span class="option">-B <var>COST</var></span></kbd></td>
<td>Initial insertion extension cost.</td></tr>
+<tr><td class="option-group">
+<kbd><span class="option">-F <var>LIST</var></span></kbd></td>
+<td>Initial frameshift probabilities (only used with <tt class="docutils literal"><span class="pre">--codon</span></tt>).</td></tr>
</tbody>
</table>
</blockquote>
@@ -557,10 +567,7 @@ may be poor.</p>
</li>
<li><p class="first">last-train can fail for various reasons, e.g. if the sequences are
too dissimilar. If it fails to find any alignments, you could try
-<a class="reference external" href="last-tutorial.html#example-6-find-very-short-dna-alignments">reducing the alignment significance threshold</a> with
-option -D.</p>
-</li>
-<li><p class="first">last-train cannot handle DNA-versus-protein alignment.</p>
+reducing the alignment <a class="reference external" href="last-evalues.html">significance</a> threshold with option <tt class="docutils literal"><span class="pre">-D</span></tt>.</p>
</li>
</ul>
</div>
=====================================
doc/last-train.txt
=====================================
@@ -65,6 +65,12 @@ Training options
(Note that 1/3-bit scores essentially equal Phred scores
a.k.a. decibans, because log10(2) ≈ 3/10.) The default is to
infer a scale from the initial score parameters.
+ --codon
+ Do training for DNA query sequences versus protein reference
+ sequences. These options will be ignored: ``--revsym
+ --matsym --gapsym --pid --postmask -q -p -S``. If
+ ``--codon`` is used, the "initial parameter options" are
+ initial probabilities, not scores/costs.
All options below this point are passed to lastal to do the
alignments: they are described in more detail at `<lastal.html>`_.
@@ -79,6 +85,7 @@ Initial parameter options
-b COST Initial gap extension cost.
-A COST Initial insertion existence cost.
-B COST Initial insertion extension cost.
+ -F LIST Initial frameshift probabilities (only used with ``--codon``).
Alignment options
~~~~~~~~~~~~~~~~~
@@ -174,8 +181,6 @@ Bugs
* last-train can fail for various reasons, e.g. if the sequences are
too dissimilar. If it fails to find any alignments, you could try
- `reducing the alignment significance threshold
- <last-tutorial.html#example-6-find-very-short-dna-alignments>`_ with
- option -D.
+ reducing the alignment significance_ threshold with option ``-D``.
-* last-train cannot handle DNA-versus-protein alignment.
+.. _significance: last-evalues.html
=====================================
doc/last-tuning.html
=====================================
@@ -320,13 +320,6 @@ table.field-list { border: thin solid green }
<p>This document tells you how to trade-off <strong>speed</strong>, <strong>sensitivity</strong>,
and <strong>memory and disk usage</strong>.</p>
-<p>Ideally, the default settings would always work well. Unfortunately,
-there is too great a variety of challenging alignment tasks, and the
-LAST developers lack experience with most of them.</p>
-<p>LAST must have <em>some</em> defaults, and any choice will displease someone.
-It is wrong to say "LAST is faster but less sensitive than method X",
-or "slower but more sensitive than method Y", without varying the
-defaults.</p>
<div class="section" id="sparsity-options">
<h2>Sparsity options</h2>
<p>It's advisable to use at most one sparsity option: combining them will
@@ -409,7 +402,7 @@ high-identity matches.</p>
<h3>lastal -C</h3>
<p>This option (gapless alignment culling) can make lastal <strong>faster</strong> but
<strong>less sensitive</strong>. It can also <strong>reduce redundant output</strong>. For
-example, -C2 makes it discard alignments (before gapped extension)
+example, <tt class="docutils literal"><span class="pre">-C2</span></tt> makes it discard alignments (before gapped extension)
whose query coordinates lie in those of 2 or more stronger alignments.
This works well for aligning long, repeat-rich, indel-poor sequences
(e.g. mammal chromosomes) without repeat-masking.</p>
@@ -421,13 +414,13 @@ sets the maximum score drop in alignments, in the gapped extension
phase. Lower values make it faster, by quitting unpromising
extensions sooner. The default aims at best accuracy.</p>
<p>You can set this option in several ways: perhaps the most intuitive is
-via maximum gap length. For example, -z10g sets the maximum score
+via maximum gap length. For example, <tt class="docutils literal"><span class="pre">-z10g</span></tt> sets the maximum score
drop such that the longest possible gap length is 10.</p>
</div>
<div class="section" id="lastal-x">
<h3>lastal -x</h3>
<p>This option (preliminary gapped extension) can make lastal <strong>faster</strong>
-but <strong>less sensitive</strong>. For example, -x2g makes it extend gapped
+but <strong>less sensitive</strong>. For example, <tt class="docutils literal"><span class="pre">-x2g</span></tt> makes it extend gapped
alignments with a maximum gap length of 2, discard those with score
below the gapped score threshold, then redo the survivors with the
final max score drop (z).</p>
@@ -442,17 +435,17 @@ gaps).</p>
<div class="section" id="lastal-j1">
<h3>lastal -j1</h3>
<p>This option requests <strong>gapless</strong> alignment, which is even <strong>faster</strong>.
-(You could get the same effect by using very high gap costs, but -j1
-is faster because it skips the gapping phase entirely.)</p>
+(You could get the same effect by using very high gap costs, but
+<tt class="docutils literal"><span class="pre">-j1</span></tt> is faster because it skips the gapping phase entirely.)</p>
</div>
<div class="section" id="lastal-f">
<h3>lastal -f</h3>
-<p>Option -fTAB <strong>reduces the output size</strong>, which can improve speed.</p>
+<p>Option <tt class="docutils literal"><span class="pre">-fTAB</span></tt> <strong>reduces the output size</strong>, which can improve speed.</p>
</div>
<div class="section" id="lastdb-i">
<h3>lastdb -i</h3>
<p>This option <strong>makes lastdb faster</strong>, but disables some lastal options.
-If lastdb is too slow, try -i10.</p>
+If lastdb is too slow, try <tt class="docutils literal"><span class="pre">-i10</span></tt>.</p>
</div>
<div class="section" id="lastdb-c2">
<h3>lastdb -C2</h3>
=====================================
doc/last-tuning.txt
=====================================
@@ -4,15 +4,6 @@ LAST Performance Tuning
This document tells you how to trade-off **speed**, **sensitivity**,
and **memory and disk usage**.
-Ideally, the default settings would always work well. Unfortunately,
-there is too great a variety of challenging alignment tasks, and the
-LAST developers lack experience with most of them.
-
-LAST must have *some* defaults, and any choice will displease someone.
-It is wrong to say "LAST is faster but less sensitive than method X",
-or "slower but more sensitive than method Y", without varying the
-defaults.
-
Sparsity options
~~~~~~~~~~~~~~~~
@@ -110,7 +101,7 @@ lastal -C
This option (gapless alignment culling) can make lastal **faster** but
**less sensitive**. It can also **reduce redundant output**. For
-example, -C2 makes it discard alignments (before gapped extension)
+example, ``-C2`` makes it discard alignments (before gapped extension)
whose query coordinates lie in those of 2 or more stronger alignments.
This works well for aligning long, repeat-rich, indel-poor sequences
(e.g. mammal chromosomes) without repeat-masking.
@@ -124,14 +115,14 @@ phase. Lower values make it faster, by quitting unpromising
extensions sooner. The default aims at best accuracy.
You can set this option in several ways: perhaps the most intuitive is
-via maximum gap length. For example, -z10g sets the maximum score
+via maximum gap length. For example, ``-z10g`` sets the maximum score
drop such that the longest possible gap length is 10.
lastal -x
---------
This option (preliminary gapped extension) can make lastal **faster**
-but **less sensitive**. For example, -x2g makes it extend gapped
+but **less sensitive**. For example, ``-x2g`` makes it extend gapped
alignments with a maximum gap length of 2, discard those with score
below the gapped score threshold, then redo the survivors with the
final max score drop (z).
@@ -148,19 +139,19 @@ lastal -j1
----------
This option requests **gapless** alignment, which is even **faster**.
-(You could get the same effect by using very high gap costs, but -j1
-is faster because it skips the gapping phase entirely.)
+(You could get the same effect by using very high gap costs, but
+``-j1`` is faster because it skips the gapping phase entirely.)
lastal -f
---------
-Option -fTAB **reduces the output size**, which can improve speed.
+Option ``-fTAB`` **reduces the output size**, which can improve speed.
lastdb -i
---------
This option **makes lastdb faster**, but disables some lastal options.
-If lastdb is too slow, try -i10.
+If lastdb is too slow, try ``-i10``.
lastdb -C2
----------
=====================================
doc/last-tutorial.html
=====================================
@@ -4,7 +4,7 @@
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.6: http://docutils.sourceforge.net/" />
-<title>LAST Tutorial</title>
+<title>LAST Cookbook</title>
<style type="text/css">
/*
@@ -315,206 +315,306 @@ table.field-list { border: thin solid green }
</style>
</head>
<body>
-<div class="document" id="last-tutorial">
-<h1 class="title">LAST Tutorial</h1>
-
-<p>LAST finds similar regions between sequences, and aligns them.</p>
-<div class="section" id="example-1-compare-the-human-and-fugu-mitochondrial-genomes">
-<h2>Example 1: Compare the human and fugu mitochondrial genomes</h2>
-<p>For our first example, we wish to find and align similar regions
-between the human and fugu mitochondrial genomes. You can find these
-sequences in the examples directory: humanMito.fa and fuguMito.fa. We
-can compare them like this:</p>
+<div class="document" id="last-cookbook">
+<h1 class="title">LAST Cookbook</h1>
+
+<p><a class="reference external" href="last.html">LAST</a> is used by running commands in a terminal / command line. It
+has many options: unfortunately, the LAST developers don't know the
+best options for every possible alignment task. Here are some
+reasonable starting points. Feel free to optimize (and share!) search
+protocols.</p>
+<div class="section" id="a-minimal-example-compare-human-fugu-mitochondrial-genomes">
+<h2>A minimal example: compare human & fugu mitochondrial genomes</h2>
+<p>Let's find and align similar regions between the human and fugu
+mitochondrial genomes. These FASTA-format files are in LAST's
+examples directory: humanMito.fa and fuguMito.fa. The simplest
+possible usage is:</p>
<pre class="literal-block">
-lastdb -cR01 humdb humanMito.fa
+lastdb humdb humanMito.fa
lastal humdb fuguMito.fa > myalns.maf
</pre>
-<p>The lastdb command creates several files whose names begin with
-"humdb". The lastal command then compares fuguMito.fa to the humdb
+<p>The <a class="reference external" href="lastdb.html">lastdb</a> command creates several files whose names begin with
+"humdb". The <a class="reference external" href="lastal.html">lastal</a> command then compares fuguMito.fa to the humdb
files, and writes the alignments to a file called "myalns.maf".</p>
-<p>The "-cR01" option suppresses alignments caused by simple sequence
-such as cacacacacacacacacacacaca.</p>
</div>
<div class="section" id="understanding-the-output">
<h2>Understanding the output</h2>
<p>The output has very long lines, so you need to view it without
-line-wrapping. For example, with a Unix/Linux/MacOsX command line,
-you can use:</p>
+line-wrapping. For example, you can use:</p>
<pre class="literal-block">
less -S myalns.maf
</pre>
-<p>Each alignment looks like this:</p>
+<p>Each alignment looks like this (<a class="reference external" href="http://genome.ucsc.edu/FAQ/FAQformat.html#format5">MAF</a> format):</p>
<pre class="literal-block">
a score=27 EG2=4.7e+04 E=2.6e-05
s humanMito 2170 145 + 16571 AGTAGGCCTAAAAGCAGCCACCAATTAAGAAAGCGTT...
s fuguMito 1648 142 + 16447 AGTAGGCTTAGAAGCAGCCACCA--CAAGAAAGCGTT...
</pre>
<p>The score is a measure of how significant the similarity is. EG2 and
-E are explained at <a class="reference external" href="last-evalues.html">last-evalues.html</a>. Lines starting with "s"
-contain: the sequence name, the start coordinate of the alignment, the
-number of bases spanned by the alignment, the strand, the sequence
-length, and the aligned bases.</p>
+E are explained at <a class="reference external" href="last-evalues.html">last-evalues</a>. Lines starting with "s" contain:
+the sequence name, the start coordinate of the alignment, the number
+of bases spanned by the alignment, the strand, the sequence length,
+and the aligned bases.</p>
<p>The start coordinates are zero-based. This means that, if the
alignment begins right at the start of a sequence, the coordinate is
-0. If the strand is "-", the start coordinate is in the reverse
-strand.</p>
-<p>This alignment format is called <a class="reference external" href="http://genome.ucsc.edu/FAQ/FAQformat.html#format5">MAF (multiple alignment format)</a>. You can
-convert it to several other formats using <a class="reference external" href="maf-convert.html">maf-convert</a>. You can make lastal produce a few other formats
-with option -f (see <a class="reference external" href="lastal.html">lastal.html</a>).</p>
+0. If the strand is "-", the start coordinate is the coordinate in
+the reverse-complemented sequence (the same as if you were to
+reverse-complement the sequence before giving it to LAST).</p>
+<p>You can convert MAF to other formats with <a class="reference external" href="maf-convert.html">maf-convert</a>, or use <a class="reference external" href="lastal.html">lastal</a>
+option <tt class="docutils literal"><span class="pre">-f</span></tt> to get a few other formats.</p>
</div>
-<div class="section" id="example-2-compare-vertebrate-proteins-to-invertebrate-proteins">
-<h2>Example 2: Compare vertebrate proteins to invertebrate proteins</h2>
-<p>Use the lastdb -p option to indicate that the sequences are proteins:</p>
+<div class="section" id="more-accurate-learn-substitution-gap-rates">
+<h2>More accurate: learn substitution & gap rates</h2>
+<p>We can get more accurate alignments between the human and fugu
+mitochondrial genomes like this:</p>
<pre class="literal-block">
-lastdb -p -cR01 invdb invertebrate.fa
-lastal invdb vertebrate.fa
+lastdb humdb humanMito.fa
+last-train humdb fuguMito.fa > hufu.train
+lastal -p hufu.train humdb fuguMito.fa > myalns.maf
</pre>
+<p>The <a class="reference external" href="last-train.html">last-train</a> command finds the rates of deletion, insertion, and
+each kind of substitution between these sequences, and writes them to
+a file "hufu.train". lastal's <tt class="docutils literal"><span class="pre">-p</span></tt> option uses this file to get
+more-accurate alignments.</p>
</div>
-<div class="section" id="example-3-compare-dna-sequences-to-protein-sequences">
-<h2>Example 3: Compare DNA sequences to protein sequences</h2>
-<p>Here we use the -F15 option, to specify translated alignment with a
-score penalty of 15 for frameshifts:</p>
+<div class="section" id="comparing-protein-sequences">
+<h2>Comparing protein sequences</h2>
+<p>We can compare some query proteins to some reference proteins like
+this:</p>
<pre class="literal-block">
-lastdb -p -cR01 protdb proteins.fa
-lastal -F15 protdb dnas.fa
+lastdb -p -c -R01 refdb ref-prots.fa
+lastal refdb query-prots.fa > prot-alns.maf
</pre>
-</div>
-<div class="section" id="example-4-find-high-similarity-and-short-protein-alignments">
-<h2>Example 4: Find high-similarity, and short, protein alignments</h2>
-<p>LAST uses a <a class="reference external" href="last-matrices.html">scoring scheme</a> to find
-similarities. Some scoring schemes are tuned for weak similarities,
-others for strong similarities. The PAM30 scoring scheme finds strong
-protein similarities:</p>
+<p><tt class="docutils literal"><span class="pre">-p</span></tt> tells it the sequences are proteins. (If you forget <tt class="docutils literal"><span class="pre">-p</span></tt> and
+the sequences look proteinaceous, you'll get a warning message.)</p>
+<p>The other options suppress alignments caused by simple sequence such
+as <tt class="docutils literal">APPSPAPPSPAPPSPAPPSPAP</tt>:</p>
+<ul>
+<li><p class="first"><tt class="docutils literal"><span class="pre">-R01</span></tt> converts the sequence letters to uppercase, then finds
+simple regions and converts them to lowercase. This will be done
+for both ref-prots and query-prots.</p>
+</li>
+<li><p class="first"><tt class="docutils literal"><span class="pre">-c</span></tt> omits alignments that lack a significant amount of
+uppercase-to-uppercase alignment.</p>
+</li>
+</ul>
+<p>You can also use <a class="reference external" href="last-train.html">last-train</a>, but we've hardly tested it for
+protein-protein alignment, so we're not sure if it helps.</p>
+<div class="section" id="find-high-similarity-and-short-protein-alignments">
+<h3>Find high-similarity, and short, protein alignments</h3>
+<p>If we just want high-similarity alignments, we can use the PAM30 (or
+PAM10) <a class="reference external" href="last-matrices.html">scoring scheme</a>:</p>
<pre class="literal-block">
-lastdb -p -cR01 invdb invertebrate.fa
-lastal -pPAM30 invdb vertebrate.fa
+lastdb -p -c -R01 refdb ref-prots.fa
+lastal -p PAM30 refdb query-prots.fa > prot-alns.maf
</pre>
<p>This has two advantages:</p>
<ul>
-<li><p class="first">It omits weak alignments, or alignment parts (occasionally a strong
-similarity is flanked by a weak similarity).</p>
+<li><p class="first">It omits low-similarity alignments, or alignment parts.</p>
</li>
-<li><p class="first">It can find short similarities. If we seek very short similarities,
-weak ones are hopeless (statistically insignificant), so we had
-better focus on strong ones. (How short is "very short"? It
-depends on the amount of sequence data we are searching, but perhaps
-roughly less than 40 amino acids.)</p>
+<li><p class="first">It can find short similarities, which would be deemed insignificant
+(likely to occur by chance between random sequences) unless we focus
+the search on high-similarity.</p>
</li>
</ul>
</div>
-<div class="section" id="example-5-align-human-dna-sequences-to-the-human-genome">
-<h2>Example 5: Align human DNA sequences to the human genome</h2>
-<p>We can align human DNA sequences to the human genome like this:</p>
+</div>
+<div class="section" id="comparing-dna-to-proteins">
+<h2>Comparing DNA to proteins</h2>
+<p>We can find related regions of DNA and proteins, allowing for nonsense
+mutations and frameshifts. For example, let's find DNA regions
+related to transposon proteins:</p>
+<pre class="literal-block">
+lastdb -q -c -R01 trandb transposon-prots.fa
+last-train --codon trandb dna.fa > codon.train
+lastal -p codon.train -m100 -D1e9 -K1 trandb dna.fa > out.maf
+</pre>
+<p><tt class="docutils literal"><span class="pre">-q</span></tt> appends <tt class="docutils literal">*</tt> meaning STOP to each protein, and treats <tt class="docutils literal">*</tt> as
+a 21st protein letter.</p>
+<p><tt class="docutils literal"><span class="pre">--codon</span></tt> makes it do DNA-versus-protein. Here, <a class="reference external" href="last-train.html">last-train</a> tries
+to learn 21x64 substitution rates, so it needs a fairly large amount
+of data (e.g. a chromosome).</p>
+<p><tt class="docutils literal"><span class="pre">-m100</span></tt> makes it more slow-and-sensitive than the default (which is
+<tt class="docutils literal"><span class="pre">-m10</span></tt>), see <a class="reference external" href="lastal.html">lastal</a>.</p>
+<p><tt class="docutils literal"><span class="pre">-D1e9</span></tt> sets a strict <a class="reference external" href="last-evalues.html">significance</a> threshold. It means: only
+report strong similarities that would be expected to occur by chance,
+between random sequences, at most once per 10^9 base-pairs. The
+default is 1e6.</p>
+<p><tt class="docutils literal"><span class="pre">-K1</span></tt> streamlines the output by omitting any alignment whose DNA
+range lies in that of a higher-scoring alignment.</p>
+<p>Another possibility is to add <a class="reference external" href="last-train.html">last-train</a> option <tt class="docutils literal"><span class="pre">-X1</span></tt>, which treats
+matches to <tt class="docutils literal">X</tt> (unknown) amino acids as neutral instead of
+disfavored.</p>
+</div>
+<div class="section" id="aligning-high-indel-error-long-dna-reads-to-a-genome">
+<h2>Aligning high-indel-error long DNA reads to a genome</h2>
+<p>Suppose we have DNA reads in either FASTA or <a class="reference external" href="https://doi.org/10.1093/nar/gkp1137">FASTQ</a> format. This is
+sensitive but slow:</p>
<pre class="literal-block">
-lastdb -uNEAR -R01 humandb human/chr*.fa
-lastal humandb queries.fa | last-split > myalns.maf
+lastdb -P8 -uNEAR -R01 mydb genome.fa
+last-train -P8 -Q0 mydb reads.fastq > reads.train
+lastal -P8 -p reads.train mydb reads.fastq | last-split > out.maf
</pre>
-<p>This will use about 15 gigabytes of memory.</p>
+<p><tt class="docutils literal"><span class="pre">-P8</span></tt> uses 8 parallel threads, adjust as appropriate for your
+computer. This has no effect on the results.</p>
+<p><tt class="docutils literal"><span class="pre">-uNEAR</span></tt> selects a <a class="reference external" href="last-seeds.html">seeding scheme</a> that's better at finding
+alignments with few substitutions and/or many gaps.</p>
+<p><tt class="docutils literal"><span class="pre">-Q0</span></tt> makes it discard the <a class="reference external" href="https://doi.org/10.1093/nar/gkp1137">fastq</a> quality information (or you can
+keep-but-ignore it with <tt class="docutils literal"><span class="pre">-Qkeep</span></tt>).</p>
+<p><a class="reference external" href="last-split.html">last-split</a> finds a unique best alignment for each part of each read.</p>
+<p>Here we used <tt class="docutils literal"><span class="pre">-R01</span></tt> to lowercase simple sequence like
+<tt class="docutils literal">cacacacacacacacacacacaca</tt>. But we didn't suppress it with <tt class="docutils literal"><span class="pre">-c</span></tt>,
+so as not to hide anything from <a class="reference external" href="last-split.html">last-split</a>. If desired, you can
+filter lowercase with <a class="reference external" href="last-postmask.html">last-postmask</a>.</p>
+<p>You can go faster by sacrificing a bit of sensitivity. It depends on
+your aim, e.g. slow-and-sensitive seems necessary to find intricate
+rearrangements of repeats. Suggested ways to go faster:</p>
<ul>
-<li><p class="first">-uNEAR selects a <a class="reference external" href="last-seeds.html">seeding scheme</a> that makes it
-better at finding short-and-strong similarities. (It also changes
-the default scoring scheme.)</p>
+<li><p class="first"><a class="reference external" href="https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md">Mask repeats</a>.</p>
</li>
-<li><p class="first">-R01 tells it to mark simple sequences (such as cacacacacacacacaca)
-by lowercase, but not suppress them. This has no effect on the
-alignment, but it allows us to see simple sequences in the output,
-and gives us the option to do <a class="reference external" href="last-postmask.html">post-alignment masking</a>.</p>
+<li><p class="first">Add lastal option <tt class="docutils literal"><span class="pre">-k8</span></tt> (or <tt class="docutils literal"><span class="pre">-k16</span></tt> etc). This makes it faster,
+by only finding initial matches starting at every 8th (or 16th etc)
+position in the reads.</p>
</li>
-<li><p class="first">last-split reads the alignments produced by lastal, and looks for a
-unique best alignment for each part of each query. It allows
-different parts of one query to match different parts of the genome,
-which may happen due to rearrangements. It has several useful
-options, please see <a class="reference external" href="last-split.html">last-split.html</a>.</p>
+<li><p class="first">Replace <tt class="docutils literal"><span class="pre">-uNEAR</span></tt> with <tt class="docutils literal"><span class="pre">-uRY32</span></tt> (or <tt class="docutils literal"><span class="pre">-uRY16</span></tt>, <tt class="docutils literal"><span class="pre">-uRY8</span></tt>,
+<tt class="docutils literal"><span class="pre">-uRY4</span></tt>). This makes it check for initial matches starting at
+only ~1/32 (or ~1/16 etc) of positions, in both reads and genome.
+Compared to <tt class="docutils literal"><span class="pre">-k</span></tt>: this harms sensitivity slightly more, but
+reduces memory use and makes lastdb faster.</p>
</li>
</ul>
+<div class="section" id="which-genome-version-to-use">
+<h3>Which genome version to use?</h3>
+<p>Some genome versions (e.g. for human) have artificial
+exactly-duplicated regions, which makes it hard to align reads
+uniquely. To avoid that, look for a genome version called something
+like "analysis set".</p>
</div>
-<div class="section" id="tuning-speed-sensitivity-memory-and-disk-usage">
-<h2>Tuning speed, sensitivity, memory and disk usage</h2>
-<ul>
-<li><p class="first">You can make LAST faster by <a class="reference external" href="last-parallel.html">using multiple CPUs</a>.</p>
-</li>
-<li><p class="first">You can <a class="reference external" href="last-tuning.html">trade off speed, sensitivity, memory and disk usage</a>. To go much faster, try either lastal option
-<tt class="docutils literal"><span class="pre">-k32</span></tt> or lastdb option <tt class="docutils literal"><span class="pre">-uRY32</span></tt> (but not both). <tt class="docutils literal"><span class="pre">-k32</span></tt> is a
-bit more sensitive, but <tt class="docutils literal"><span class="pre">-uRY32</span></tt> uses less memory.</p>
-</li>
-</ul>
+<div class="section" id="aligning-low-error-long-dna-reads-to-a-genome">
+<h3>Aligning low-error long DNA reads to a genome</h3>
+<p>We can do this the same way as for high-error reads, but perhaps
+accelerate more aggressively (e.g. <tt class="docutils literal"><span class="pre">-uRY32</span></tt>).</p>
+<p>If repeats are not masked, <a class="reference external" href="lastal.html">lastal</a> option <tt class="docutils literal"><span class="pre">-C2</span></tt> may reduce run time
+with little effect on accuracy.</p>
+</div>
+<div class="section" id="aligning-potentially-spliced-rna-or-cdna-long-reads-to-a-genome">
+<h3>Aligning potentially-spliced RNA or cDNA long reads to a genome</h3>
+<p>See <a class="reference external" href="https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md">here</a>. (For low-error reads, you can probably omit <tt class="docutils literal"><span class="pre">-d90</span></tt> and
+<tt class="docutils literal"><span class="pre">-m20</span></tt>.)</p>
</div>
-<div class="section" id="example-6-find-very-short-dna-alignments">
-<h2>Example 6: Find very short DNA alignments</h2>
-<p>By default, LAST is quite strict, and only reports significant
-alignments that will rarely occur by chance. In the preceding
-example, the minimum alignment length is about 26 bases (less for
-smaller genomes). To find shorter alignments, we must down-tune the
-strictness:</p>
+</div>
+<div class="section" id="aligning-illumina-dna-reads-to-a-genome">
+<h2>Aligning Illumina DNA reads to a genome</h2>
<pre class="literal-block">
-lastdb -uNEAR -R01 humandb human/chr*.fa
-lastal -D100 humandb queries.fa | last-split > myalns.maf
+lastdb -P8 -uNEAR -R01 -C2 mydb genome.fasta
+last-train -P8 -Q1 mydb reads.fastq.gz > reads.train
+lastal -P8 -p reads.train mydb reads.fastq.gz | last-split > out.maf
</pre>
-<ul>
-<li><p class="first">-D100 makes lastal report alignments that could occur by chance once
-per hundred query letters. (The default is once per million.)</p>
-</li>
-</ul>
-<p>In this example, the minimum alignment length is about 20 bases (less
-for smaller genomes).</p>
+<p>Most LAST commands accept gzip-compressed (<tt class="docutils literal">.gz</tt>) files.</p>
+<p><a class="reference external" href="lastdb.html">lastdb</a> option <tt class="docutils literal"><span class="pre">-C2</span></tt> makes the alignment a bit faster, but uses more
+memory. This has no effect on the results. (You could use it in the
+other examples too, but it might not be faster.)</p>
+<p><tt class="docutils literal"><span class="pre">-Q1</span></tt> makes it use the <a class="reference external" href="https://doi.org/10.1093/nar/gkp1137">fastq</a> quality information to improve the
+training and alignment. LAST <strong>assumes</strong> that the qualities reflect
+substitution errors, not insertion/deletion errors. (For long
+non-Illumina reads, we suspect this assumption doesn't hold, so we
+didn't use this option.)</p>
+<p>This recipe may be excessively slow-and-sensitive. Adding <a class="reference external" href="lastal.html">lastal</a>
+option <tt class="docutils literal"><span class="pre">-C2</span></tt> may make it faster with negligible accuracy loss. You
+can accelerate with e.g. <tt class="docutils literal"><span class="pre">-uRY16</span></tt> or <tt class="docutils literal"><span class="pre">-k16</span></tt> as above.</p>
+<div class="section" id="finding-very-short-dna-alignments">
+<h3>Finding very short DNA alignments</h3>
+<p>By default, LAST only reports <a class="reference external" href="last-evalues.html">significant</a> alignments that will rarely
+occur by chance. In the preceding example, the minimum alignment
+length is about 26 bases for a human-size genome (less for smaller
+genomes). To find shorter alignments, add <a class="reference external" href="lastal.html">lastal</a> option <tt class="docutils literal"><span class="pre">-D100</span></tt>
+(say), to get alignments that could occur by chance once per hundred
+query letters (the default is once per million.) This makes the
+minimum alignment length about 20 bases for a human-size genome.</p>
</div>
-<div class="section" id="example-7-align-human-fastq-sequences-to-the-human-genome">
-<h2>Example 7: Align human fastq sequences to the human genome</h2>
-<p>DNA sequences are not always perfectly accurate, and they are
-sometimes provided in fastq format, which indicates the reliability of
-each base. LAST can use this information to improve alignment
-accuracy. Option -Q1 indicates fastq-sanger format:</p>
+<div class="section" id="aligning-paired-end-illumina-dna-reads-to-a-genome">
+<h3>Aligning paired-end Illumina DNA reads to a genome</h3>
+<p>You could use <a class="reference external" href="last-pair-probs.html">last-pair-probs</a>. It has a disadvantage: it doesn't
+allow different parts of one read (i.e. one "end") to align to
+different parts of the genome. Alternatively, you could align the
+reads individually, ignoring the pair relationships:</p>
<pre class="literal-block">
-lastdb -uNEAR -R01 humandb human/chr*.fa
-lastal -Q1 humandb queries.fastq | last-split > myalns.maf
+fastq-interleave reads1.fq reads2.fq | lastal -P8 -p reads.train mydb | last-split > out.maf
</pre>
-<p><strong>Assumption:</strong> LAST assumes the reliabilities reflect substitution
-errors, not insertion/deletion errors. If that is not true, you can
-tell it to ignore the reliability data with -Q0.</p>
+<p><tt class="docutils literal"><span class="pre">fastq-interleave</span></tt> ensures that each read has a unique name (by
+appending "/1" and "/2" if necessary).</p>
</div>
-<div class="section" id="fastq-format-confusion">
-<h2>Fastq format confusion</h2>
-<p>Unfortunately, there is more than one fastq format (see
-<a class="reference external" href="http://nar.oxfordjournals.org/content/38/6/1767.long">http://nar.oxfordjournals.org/content/38/6/1767.long</a>). Recently
-(2013) fastq-sanger seems to be dominant, but if you have another
-variant you need to change the -Q option (see <a class="reference external" href="lastal.html">lastal.html</a>).</p>
+<div class="section" id="aligning-potentially-spliced-illumina-reads-to-a-genome">
+<h3>Aligning potentially-spliced Illumina reads to a genome</h3>
+<p>See <a class="reference external" href="last-split.html">last-split</a> (and <a class="reference external" href="last-pair-probs.html">last-pair-probs</a>).</p>
</div>
-<div class="section" id="paired-dna-reads">
-<h2>Paired DNA reads</h2>
-<p>If you have paired DNA reads, there are two options:</p>
-<ol class="arabic">
-<li><p class="first">Use last-pair-probs (see <a class="reference external" href="last-pair-probs.html">last-pair-probs.html</a>).</p>
-</li>
-<li><p class="first">Ignore the pairing information, and align the reads individually
-(using last-split as above). This may be useful because
-last-pair-probs does not currently allow different parts of one
-read to match different parts of the genome, though it does allow
-the two reads in a pair to match (e.g.) different chromosomes.</p>
-</li>
-</ol>
</div>
-<div class="section" id="example-8-compare-the-human-and-mouse-genomes">
-<h2>Example 8: Compare the human and mouse genomes</h2>
-<p>See <a class="reference external" href="https://github.com/mcfrith/last-genome-alignments">here</a>.</p>
+<div class="section" id="aligning-human-chimp-genomes">
+<h2>Aligning human & chimp genomes</h2>
+<p>This is very slow-and-sensitive:</p>
+<pre class="literal-block">
+lastdb -P8 -uNEAR -R01 humdb human_no_alt_analysis_set.fa
+last-train -P8 --revsym -E0.05 -C2 humdb chimp.fa > humchi.train
+lastal -E0.05 -C2 -p humchi.train humdb chimp.fa | last-split > humchi1.maf
+</pre>
+<p><tt class="docutils literal"><span class="pre">--revsym</span></tt> makes the substitution rates the same on both strands.
+For example, it makes A→G equal T→C (because A→G on one strand means
+T→C on the other strand). This is usually appropriate for
+genome-genome comparison (but maybe not for mitochondria which have
+asymmetric "heavy" and "light" strands).</p>
+<p><tt class="docutils literal"><span class="pre">-E0.05</span></tt> means only get <a class="reference external" href="last-evalues.html">significant</a> alignments that would be
+expected to occur by chance at a rate ≤ 0.05 times per pair of random
+sequences of length 1 billion each.</p>
+<p>The result so far is asymmetric: each part of the chimp genome is
+aligned to at most one part of the human genome, but not vice-versa.
+We can get one-to-one alignments like this:</p>
+<pre class="literal-block">
+maf-swap humchi1.maf | last-split > humchi2.maf
+</pre>
+<p>Then we can discard less-confident alignments, and <a class="reference external" href="maf-convert.html">convert</a> to a
+compact tabular format:</p>
+<pre class="literal-block">
+last-postmask humchi2.maf | maf-convert -n tab | awk -F= '$2 <= 1e-5' > humchi.tab
+</pre>
+<p><a class="reference external" href="last-postmask.html">last-postmask</a> discards alignments caused by simple sequence. The
+<tt class="docutils literal">awk</tt> command gets alignments with <a class="reference external" href="last-split.html">mismap probability</a> ≤ 10^-5.
+Finally, we can make a <a class="reference external" href="last-dotplot.html">dotplot</a>:</p>
+<pre class="literal-block">
+last-dotplot humchi.tab humchi.png
+</pre>
+<p>To go faster, with minor accuracy loss: replace <tt class="docutils literal"><span class="pre">-uNEAR</span></tt> with
+<tt class="docutils literal"><span class="pre">-uRY32</span></tt> and/or <a class="reference external" href="https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md">mask repeats</a>.</p>
+<p>To squeeze out the last 0.000...1% of accuracy: add <tt class="docutils literal"><span class="pre">-m50</span></tt> to the
+<a class="reference external" href="lastal.html">lastal</a> options.</p>
+<div class="section" id="aligning-human-mouse-genomes">
+<h3>Aligning human & mouse genomes</h3>
+<p>You can do this in the same way as human/chimp, except that <tt class="docutils literal"><span class="pre">-uNEAR</span></tt>
+should be omitted. To increase sensitivity, but also time and memory
+use, add lastdb <a class="reference external" href="last-seeds.html">seeding</a> option <tt class="docutils literal"><span class="pre">-uMAM4</span></tt> or or <tt class="docutils literal"><span class="pre">-uMAM8</span></tt>. To
+increase them even more, add <a class="reference external" href="lastal.html">lastal</a> option <tt class="docutils literal"><span class="pre">-m100</span></tt> (or as high as
+you can bear).</p>
</div>
-<div class="section" id="example-9-compare-the-human-and-chimp-genomes">
-<h2>Example 9: Compare the human and chimp genomes</h2>
-<p>See <a class="reference external" href="https://github.com/mcfrith/last-genome-alignments">here</a>. But
-that recipe is <em>extremely</em> slow-and-accurate. You can <a class="reference external" href="last-tuning.html">tune</a> it to compare huge, high-similarity genomes with
-moderate run time and memory use:</p>
+</div>
+<div class="section" id="large-reference-sequences">
+<h2>Large reference sequences</h2>
+<p>If the sequences that you give to lastdb exceed ~4 billion letters,
+consider using 8-byte LAST (<a class="reference external" href="lastdb.html">lastdb8</a> and <a class="reference external" href="lastal.html">lastal8</a>). Ordinary (4-byte)
+LAST can't handle so much sequence at once, so <a class="reference external" href="lastdb.html">lastdb</a> splits it into
+"volumes", which may be inefficient. 8-byte LAST avoids voluming, but
+uses more memory. So <a class="reference external" href="lastdb.html">lastdb8</a> works well with a memory-reducing
+option: <tt class="docutils literal"><span class="pre">-uRY</span></tt> or <tt class="docutils literal"><span class="pre">-w</span></tt> or <tt class="docutils literal"><span class="pre">-W</span></tt>.</p>
+</div>
+<div class="section" id="moar-faster">
+<h2>Moar faster</h2>
<ul>
-<li><p class="first">Omit the sensitivity-boosting lastal -m option.</p>
-</li>
-<li><p class="first">Replace <tt class="docutils literal"><span class="pre">-uNEAR</span></tt> with <tt class="docutils literal"><span class="pre">-uRY32</span></tt>.</p>
+<li><p class="first"><a class="reference external" href="last-parallel.html">Using multiple CPUs / cores</a></p>
</li>
-<li><p class="first">If the "reference" genome (the one given to lastdb) is > 4 GB, it's
-probably more efficient to use lastdb8 and lastal8, instead of
-lastdb and lastal.</p>
+<li><p class="first"><a class="reference external" href="last-tuning.html">Various speed & memory options</a></p>
</li>
</ul>
</div>
-<div class="section" id="example-10-ambiguity-of-alignment-columns">
-<h2>Example 10: Ambiguity of alignment columns</h2>
+<div class="section" id="ambiguity-of-alignment-columns">
+<h2>Ambiguity of alignment columns</h2>
<p>Consider this alignment:</p>
<pre class="literal-block">
TGAAGTTAAAGGTATATGAATTCCAATTCTTAACCCCCCTATTAAACGAATATCTTG
@@ -522,12 +622,10 @@ TGAAGTTAAAGGTATATGAATTCCAATTCTTAACCCCCCTATTAAACGAATATCTTG
TGAAGTTAGAGGTAT--GGTTTTGAGTAGT----CCTCCTATTTTTCGAATATCTTG
</pre>
<p>The middle section has such weak similarity that its precise alignment
-cannot be confidently inferred.</p>
-<p>It is sometimes useful to estimate the ambiguity of each column in an
-alignment. We can do that using lastal option -j4:</p>
+cannot be confidently inferred. We can see the confidence of each
+alignment column with <a class="reference external" href="lastal.html">lastal</a> option <tt class="docutils literal"><span class="pre">-j4</span></tt>:</p>
<pre class="literal-block">
-lastdb -cR01 humdb humanMito.fa
-lastal -j4 humdb fuguMito.fa > myalns.maf
+lastal -j4 -p hufu.train humdb fuguMito.fa > myalns.maf
</pre>
<p>The output looks like this:</p>
<pre class="literal-block">
@@ -537,7 +635,7 @@ s seqY 0 51 + 51 TGAAGTTAGAGGTAT--GGTTTTGAGTAGT----CCTCCTATTTTTCGAATATCTTG
p %*.14442011.(%##"%$$$$###""!!!""""&'(*,340.,,.~~~~~~~~~~~
</pre>
<p>The "p" line indicates the probability that each column is wrongly
-aligned, using a compact code (the same as fastq-sanger format):</p>
+aligned, using a compact code (the same as <a class="reference external" href="https://doi.org/10.1093/nar/gkp1137">fastq</a> format):</p>
<blockquote>
<table border="1" class="docutils">
<colgroup>
@@ -633,8 +731,6 @@ aligned, using a compact code (the same as fastq-sanger format):</p>
<p>Note that each alignment is grown from a "core" region, and the
ambiguity estimates assume that the core is correctly aligned. The
core is indicated by "~" symbols, and it contains exact matches only.</p>
-<p>LAST has options to find alignments with optimal column probabilities,
-instead of optimal score: see <a class="reference external" href="lastal.html">lastal.html</a>.</p>
</div>
</div>
</body>
=====================================
doc/last-tutorial.txt
=====================================
@@ -1,211 +1,334 @@
-LAST Tutorial
+LAST Cookbook
=============
-LAST finds similar regions between sequences, and aligns them.
+LAST_ is used by running commands in a terminal / command line. It
+has many options: unfortunately, the LAST developers don't know the
+best options for every possible alignment task. Here are some
+reasonable starting points. Feel free to optimize (and share!) search
+protocols.
-Example 1: Compare the human and fugu mitochondrial genomes
------------------------------------------------------------
+A minimal example: compare human & fugu mitochondrial genomes
+-------------------------------------------------------------
-For our first example, we wish to find and align similar regions
-between the human and fugu mitochondrial genomes. You can find these
-sequences in the examples directory: humanMito.fa and fuguMito.fa. We
-can compare them like this::
+Let's find and align similar regions between the human and fugu
+mitochondrial genomes. These FASTA-format files are in LAST's
+examples directory: humanMito.fa and fuguMito.fa. The simplest
+possible usage is::
- lastdb -cR01 humdb humanMito.fa
+ lastdb humdb humanMito.fa
lastal humdb fuguMito.fa > myalns.maf
-The lastdb command creates several files whose names begin with
-"humdb". The lastal command then compares fuguMito.fa to the humdb
+The lastdb_ command creates several files whose names begin with
+"humdb". The lastal_ command then compares fuguMito.fa to the humdb
files, and writes the alignments to a file called "myalns.maf".
-The "-cR01" option suppresses alignments caused by simple sequence
-such as cacacacacacacacacacacaca.
-
Understanding the output
------------------------
The output has very long lines, so you need to view it without
-line-wrapping. For example, with a Unix/Linux/MacOsX command line,
-you can use::
+line-wrapping. For example, you can use::
less -S myalns.maf
-Each alignment looks like this::
+Each alignment looks like this (MAF_ format)::
a score=27 EG2=4.7e+04 E=2.6e-05
s humanMito 2170 145 + 16571 AGTAGGCCTAAAAGCAGCCACCAATTAAGAAAGCGTT...
s fuguMito 1648 142 + 16447 AGTAGGCTTAGAAGCAGCCACCA--CAAGAAAGCGTT...
The score is a measure of how significant the similarity is. EG2 and
-E are explained at `<last-evalues.html>`_. Lines starting with "s"
-contain: the sequence name, the start coordinate of the alignment, the
-number of bases spanned by the alignment, the strand, the sequence
-length, and the aligned bases.
+E are explained at last-evalues_. Lines starting with "s" contain:
+the sequence name, the start coordinate of the alignment, the number
+of bases spanned by the alignment, the strand, the sequence length,
+and the aligned bases.
The start coordinates are zero-based. This means that, if the
alignment begins right at the start of a sequence, the coordinate is
-0. If the strand is "-", the start coordinate is in the reverse
-strand.
+0. If the strand is "-", the start coordinate is the coordinate in
+the reverse-complemented sequence (the same as if you were to
+reverse-complement the sequence before giving it to LAST).
+
+You can convert MAF to other formats with maf-convert_, or use lastal_
+option ``-f`` to get a few other formats.
+
+More accurate: learn substitution & gap rates
+---------------------------------------------
+
+We can get more accurate alignments between the human and fugu
+mitochondrial genomes like this::
+
+ lastdb humdb humanMito.fa
+ last-train humdb fuguMito.fa > hufu.train
+ lastal -p hufu.train humdb fuguMito.fa > myalns.maf
-This alignment format is called `MAF (multiple alignment format)
-<http://genome.ucsc.edu/FAQ/FAQformat.html#format5>`_. You can
-convert it to several other formats using `maf-convert
-<maf-convert.html>`_. You can make lastal produce a few other formats
-with option -f (see `<lastal.html>`_).
+The last-train_ command finds the rates of deletion, insertion, and
+each kind of substitution between these sequences, and writes them to
+a file "hufu.train". lastal's ``-p`` option uses this file to get
+more-accurate alignments.
-Example 2: Compare vertebrate proteins to invertebrate proteins
----------------------------------------------------------------
+Comparing protein sequences
+---------------------------
-Use the lastdb -p option to indicate that the sequences are proteins::
+We can compare some query proteins to some reference proteins like
+this::
- lastdb -p -cR01 invdb invertebrate.fa
- lastal invdb vertebrate.fa
+ lastdb -p -c -R01 refdb ref-prots.fa
+ lastal refdb query-prots.fa > prot-alns.maf
-Example 3: Compare DNA sequences to protein sequences
------------------------------------------------------
+``-p`` tells it the sequences are proteins. (If you forget ``-p`` and
+the sequences look proteinaceous, you'll get a warning message.)
-Here we use the -F15 option, to specify translated alignment with a
-score penalty of 15 for frameshifts::
+The other options suppress alignments caused by simple sequence such
+as ``APPSPAPPSPAPPSPAPPSPAP``:
- lastdb -p -cR01 protdb proteins.fa
- lastal -F15 protdb dnas.fa
+* ``-R01`` converts the sequence letters to uppercase, then finds
+ simple regions and converts them to lowercase. This will be done
+ for both ref-prots and query-prots.
-Example 4: Find high-similarity, and short, protein alignments
---------------------------------------------------------------
+* ``-c`` omits alignments that lack a significant amount of
+ uppercase-to-uppercase alignment.
-LAST uses a `scoring scheme <last-matrices.html>`_ to find
-similarities. Some scoring schemes are tuned for weak similarities,
-others for strong similarities. The PAM30 scoring scheme finds strong
-protein similarities::
+You can also use last-train_, but we've hardly tested it for
+protein-protein alignment, so we're not sure if it helps.
- lastdb -p -cR01 invdb invertebrate.fa
- lastal -pPAM30 invdb vertebrate.fa
+Find high-similarity, and short, protein alignments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If we just want high-similarity alignments, we can use the PAM30 (or
+PAM10) `scoring scheme`_::
+
+ lastdb -p -c -R01 refdb ref-prots.fa
+ lastal -p PAM30 refdb query-prots.fa > prot-alns.maf
This has two advantages:
-* It omits weak alignments, or alignment parts (occasionally a strong
- similarity is flanked by a weak similarity).
+* It omits low-similarity alignments, or alignment parts.
+
+* It can find short similarities, which would be deemed insignificant
+ (likely to occur by chance between random sequences) unless we focus
+ the search on high-similarity.
+
+Comparing DNA to proteins
+-------------------------
+
+We can find related regions of DNA and proteins, allowing for nonsense
+mutations and frameshifts. For example, let's find DNA regions
+related to transposon proteins::
+
+ lastdb -q -c -R01 trandb transposon-prots.fa
+ last-train --codon trandb dna.fa > codon.train
+ lastal -p codon.train -m100 -D1e9 -K1 trandb dna.fa > out.maf
+
+``-q`` appends ``*`` meaning STOP to each protein, and treats ``*`` as
+a 21st protein letter.
+
+``--codon`` makes it do DNA-versus-protein. Here, last-train_ tries
+to learn 21x64 substitution rates, so it needs a fairly large amount
+of data (e.g. a chromosome).
+
+``-m100`` makes it more slow-and-sensitive than the default (which is
+``-m10``), see lastal_.
+
+``-D1e9`` sets a strict significance_ threshold. It means: only
+report strong similarities that would be expected to occur by chance,
+between random sequences, at most once per 10^9 base-pairs. The
+default is 1e6.
+
+``-K1`` streamlines the output by omitting any alignment whose DNA
+range lies in that of a higher-scoring alignment.
+
+Another possibility is to add last-train_ option ``-X1``, which treats
+matches to ``X`` (unknown) amino acids as neutral instead of
+disfavored.
+
+Aligning high-indel-error long DNA reads to a genome
+----------------------------------------------------
+
+Suppose we have DNA reads in either FASTA or FASTQ_ format. This is
+sensitive but slow::
-* It can find short similarities. If we seek very short similarities,
- weak ones are hopeless (statistically insignificant), so we had
- better focus on strong ones. (How short is "very short"? It
- depends on the amount of sequence data we are searching, but perhaps
- roughly less than 40 amino acids.)
+ lastdb -P8 -uNEAR -R01 mydb genome.fa
+ last-train -P8 -Q0 mydb reads.fastq > reads.train
+ lastal -P8 -p reads.train mydb reads.fastq | last-split > out.maf
-Example 5: Align human DNA sequences to the human genome
---------------------------------------------------------
+``-P8`` uses 8 parallel threads, adjust as appropriate for your
+computer. This has no effect on the results.
-We can align human DNA sequences to the human genome like this::
+``-uNEAR`` selects a `seeding scheme`_ that's better at finding
+alignments with few substitutions and/or many gaps.
- lastdb -uNEAR -R01 humandb human/chr*.fa
- lastal humandb queries.fa | last-split > myalns.maf
+``-Q0`` makes it discard the fastq_ quality information (or you can
+keep-but-ignore it with ``-Qkeep``).
-This will use about 15 gigabytes of memory.
+last-split_ finds a unique best alignment for each part of each read.
-* -uNEAR selects a `seeding scheme <last-seeds.html>`_ that makes it
- better at finding short-and-strong similarities. (It also changes
- the default scoring scheme.)
+Here we used ``-R01`` to lowercase simple sequence like
+``cacacacacacacacacacacaca``. But we didn't suppress it with ``-c``,
+so as not to hide anything from last-split_. If desired, you can
+filter lowercase with last-postmask_.
-* -R01 tells it to mark simple sequences (such as cacacacacacacacaca)
- by lowercase, but not suppress them. This has no effect on the
- alignment, but it allows us to see simple sequences in the output,
- and gives us the option to do `post-alignment masking
- <last-postmask.html>`_.
+You can go faster by sacrificing a bit of sensitivity. It depends on
+your aim, e.g. slow-and-sensitive seems necessary to find intricate
+rearrangements of repeats. Suggested ways to go faster:
-* last-split reads the alignments produced by lastal, and looks for a
- unique best alignment for each part of each query. It allows
- different parts of one query to match different parts of the genome,
- which may happen due to rearrangements. It has several useful
- options, please see `<last-split.html>`_.
+* `Mask repeats`_.
-Tuning speed, sensitivity, memory and disk usage
-------------------------------------------------
+* Add lastal option ``-k8`` (or ``-k16`` etc). This makes it faster,
+ by only finding initial matches starting at every 8th (or 16th etc)
+ position in the reads.
-* You can make LAST faster by `using multiple CPUs
- <last-parallel.html>`_.
+* Replace ``-uNEAR`` with ``-uRY32`` (or ``-uRY16``, ``-uRY8``,
+ ``-uRY4``). This makes it check for initial matches starting at
+ only ~1/32 (or ~1/16 etc) of positions, in both reads and genome.
+ Compared to ``-k``: this harms sensitivity slightly more, but
+ reduces memory use and makes lastdb faster.
-* You can `trade off speed, sensitivity, memory and disk usage
- <last-tuning.html>`_. To go much faster, try either lastal option
- ``-k32`` or lastdb option ``-uRY32`` (but not both). ``-k32`` is a
- bit more sensitive, but ``-uRY32`` uses less memory.
+Which genome version to use?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Example 6: Find very short DNA alignments
------------------------------------------
+Some genome versions (e.g. for human) have artificial
+exactly-duplicated regions, which makes it hard to align reads
+uniquely. To avoid that, look for a genome version called something
+like "analysis set".
-By default, LAST is quite strict, and only reports significant
-alignments that will rarely occur by chance. In the preceding
-example, the minimum alignment length is about 26 bases (less for
-smaller genomes). To find shorter alignments, we must down-tune the
-strictness::
+Aligning low-error long DNA reads to a genome
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- lastdb -uNEAR -R01 humandb human/chr*.fa
- lastal -D100 humandb queries.fa | last-split > myalns.maf
+We can do this the same way as for high-error reads, but perhaps
+accelerate more aggressively (e.g. ``-uRY32``).
-* -D100 makes lastal report alignments that could occur by chance once
- per hundred query letters. (The default is once per million.)
+If repeats are not masked, lastal_ option ``-C2`` may reduce run time
+with little effect on accuracy.
-In this example, the minimum alignment length is about 20 bases (less
-for smaller genomes).
+Aligning potentially-spliced RNA or cDNA long reads to a genome
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Example 7: Align human fastq sequences to the human genome
-----------------------------------------------------------
+See here_. (For low-error reads, you can probably omit ``-d90`` and
+``-m20``.)
-DNA sequences are not always perfectly accurate, and they are
-sometimes provided in fastq format, which indicates the reliability of
-each base. LAST can use this information to improve alignment
-accuracy. Option -Q1 indicates fastq-sanger format::
+Aligning Illumina DNA reads to a genome
+---------------------------------------
- lastdb -uNEAR -R01 humandb human/chr*.fa
- lastal -Q1 humandb queries.fastq | last-split > myalns.maf
+::
-**Assumption:** LAST assumes the reliabilities reflect substitution
-errors, not insertion/deletion errors. If that is not true, you can
-tell it to ignore the reliability data with -Q0.
+ lastdb -P8 -uNEAR -R01 -C2 mydb genome.fasta
+ last-train -P8 -Q1 mydb reads.fastq.gz > reads.train
+ lastal -P8 -p reads.train mydb reads.fastq.gz | last-split > out.maf
-Fastq format confusion
-----------------------
+Most LAST commands accept gzip-compressed (``.gz``) files.
-Unfortunately, there is more than one fastq format (see
-http://nar.oxfordjournals.org/content/38/6/1767.long). Recently
-(2013) fastq-sanger seems to be dominant, but if you have another
-variant you need to change the -Q option (see `<lastal.html>`_).
+lastdb_ option ``-C2`` makes the alignment a bit faster, but uses more
+memory. This has no effect on the results. (You could use it in the
+other examples too, but it might not be faster.)
-Paired DNA reads
-----------------
+``-Q1`` makes it use the fastq_ quality information to improve the
+training and alignment. LAST **assumes** that the qualities reflect
+substitution errors, not insertion/deletion errors. (For long
+non-Illumina reads, we suspect this assumption doesn't hold, so we
+didn't use this option.)
-If you have paired DNA reads, there are two options:
+This recipe may be excessively slow-and-sensitive. Adding lastal_
+option ``-C2`` may make it faster with negligible accuracy loss. You
+can accelerate with e.g. ``-uRY16`` or ``-k16`` as above.
-1. Use last-pair-probs (see `<last-pair-probs.html>`_).
+Finding very short DNA alignments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-2. Ignore the pairing information, and align the reads individually
- (using last-split as above). This may be useful because
- last-pair-probs does not currently allow different parts of one
- read to match different parts of the genome, though it does allow
- the two reads in a pair to match (e.g.) different chromosomes.
+By default, LAST only reports significant_ alignments that will rarely
+occur by chance. In the preceding example, the minimum alignment
+length is about 26 bases for a human-size genome (less for smaller
+genomes). To find shorter alignments, add lastal_ option ``-D100``
+(say), to get alignments that could occur by chance once per hundred
+query letters (the default is once per million.) This makes the
+minimum alignment length about 20 bases for a human-size genome.
-Example 8: Compare the human and mouse genomes
-----------------------------------------------
+Aligning paired-end Illumina DNA reads to a genome
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-See `here <https://github.com/mcfrith/last-genome-alignments>`_.
+You could use last-pair-probs_. It has a disadvantage: it doesn't
+allow different parts of one read (i.e. one "end") to align to
+different parts of the genome. Alternatively, you could align the
+reads individually, ignoring the pair relationships::
-Example 9: Compare the human and chimp genomes
-----------------------------------------------
+ fastq-interleave reads1.fq reads2.fq | lastal -P8 -p reads.train mydb | last-split > out.maf
-See `here <https://github.com/mcfrith/last-genome-alignments>`_. But
-that recipe is *extremely* slow-and-accurate. You can `tune
-<last-tuning.html>`_ it to compare huge, high-similarity genomes with
-moderate run time and memory use:
+``fastq-interleave`` ensures that each read has a unique name (by
+appending "/1" and "/2" if necessary).
-* Omit the sensitivity-boosting lastal -m option.
+Aligning potentially-spliced Illumina reads to a genome
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-* Replace ``-uNEAR`` with ``-uRY32``.
+See last-split_ (and last-pair-probs_).
-* If the "reference" genome (the one given to lastdb) is > 4 GB, it's
- probably more efficient to use lastdb8 and lastal8, instead of
- lastdb and lastal.
+Aligning human & chimp genomes
+------------------------------
-Example 10: Ambiguity of alignment columns
-------------------------------------------
+This is very slow-and-sensitive::
+
+ lastdb -P8 -uNEAR -R01 humdb human_no_alt_analysis_set.fa
+ last-train -P8 --revsym -E0.05 -C2 humdb chimp.fa > humchi.train
+ lastal -E0.05 -C2 -p humchi.train humdb chimp.fa | last-split > humchi1.maf
+
+``--revsym`` makes the substitution rates the same on both strands.
+For example, it makes A→G equal T→C (because A→G on one strand means
+T→C on the other strand). This is usually appropriate for
+genome-genome comparison (but maybe not for mitochondria which have
+asymmetric "heavy" and "light" strands).
+
+``-E0.05`` means only get significant_ alignments that would be
+expected to occur by chance at a rate ≤ 0.05 times per pair of random
+sequences of length 1 billion each.
+
+The result so far is asymmetric: each part of the chimp genome is
+aligned to at most one part of the human genome, but not vice-versa.
+We can get one-to-one alignments like this::
+
+ maf-swap humchi1.maf | last-split > humchi2.maf
+
+Then we can discard less-confident alignments, and convert_ to a
+compact tabular format::
+
+ last-postmask humchi2.maf | maf-convert -n tab | awk -F= '$2 <= 1e-5' > humchi.tab
+
+last-postmask_ discards alignments caused by simple sequence. The
+``awk`` command gets alignments with `mismap probability`_ ≤ 10^-5.
+Finally, we can make a dotplot_::
+
+ last-dotplot humchi.tab humchi.png
+
+To go faster, with minor accuracy loss: replace ``-uNEAR`` with
+``-uRY32`` and/or `mask repeats`_.
+
+To squeeze out the last 0.000...1% of accuracy: add ``-m50`` to the
+lastal_ options.
+
+Aligning human & mouse genomes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can do this in the same way as human/chimp, except that ``-uNEAR``
+should be omitted. To increase sensitivity, but also time and memory
+use, add lastdb seeding_ option ``-uMAM4`` or or ``-uMAM8``. To
+increase them even more, add lastal_ option ``-m100`` (or as high as
+you can bear).
+
+Large reference sequences
+-------------------------
+
+If the sequences that you give to lastdb exceed ~4 billion letters,
+consider using 8-byte LAST (lastdb8_ and lastal8_). Ordinary (4-byte)
+LAST can't handle so much sequence at once, so lastdb_ splits it into
+"volumes", which may be inefficient. 8-byte LAST avoids voluming, but
+uses more memory. So lastdb8_ works well with a memory-reducing
+option: ``-uRY`` or ``-w`` or ``-W``.
+
+Moar faster
+-----------
+
+* `Using multiple CPUs / cores <last-parallel.html>`_
+* `Various speed & memory options <last-tuning.html>`_
+
+Ambiguity of alignment columns
+------------------------------
Consider this alignment::
@@ -214,13 +337,10 @@ Consider this alignment::
TGAAGTTAGAGGTAT--GGTTTTGAGTAGT----CCTCCTATTTTTCGAATATCTTG
The middle section has such weak similarity that its precise alignment
-cannot be confidently inferred.
-
-It is sometimes useful to estimate the ambiguity of each column in an
-alignment. We can do that using lastal option -j4::
+cannot be confidently inferred. We can see the confidence of each
+alignment column with lastal_ option ``-j4``::
- lastdb -cR01 humdb humanMito.fa
- lastal -j4 humdb fuguMito.fa > myalns.maf
+ lastal -j4 -p hufu.train humdb fuguMito.fa > myalns.maf
The output looks like this::
@@ -230,7 +350,7 @@ The output looks like this::
p %*.14442011.(%##"%$$$$###""!!!""""&'(*,340.,,.~~~~~~~~~~~
The "p" line indicates the probability that each column is wrongly
-aligned, using a compact code (the same as fastq-sanger format):
+aligned, using a compact code (the same as fastq_ format):
====== ================= ====== =================
Symbol Error probability Symbol Error probability
@@ -256,5 +376,26 @@ Note that each alignment is grown from a "core" region, and the
ambiguity estimates assume that the core is correctly aligned. The
core is indicated by "~" symbols, and it contains exact matches only.
-LAST has options to find alignments with optimal column probabilities,
-instead of optimal score: see `<lastal.html>`_.
+.. _last: last.html
+.. _lastdb8:
+.. _lastdb: lastdb.html
+.. _lastal8:
+.. _lastal: lastal.html
+.. _dotplot: last-dotplot.html
+.. _last-pair-probs: last-pair-probs.html
+.. _last-postmask: last-postmask.html
+.. _mismap probability:
+.. _last-split: last-split.html
+.. _last-train: last-train.html
+.. _convert:
+.. _maf-convert: maf-convert.html
+.. _scoring scheme: last-matrices.html
+.. _seeding scheme:
+.. _seeding: last-seeds.html
+.. _last-evalues:
+.. _significant:
+.. _significance: last-evalues.html
+.. _fastq: https://doi.org/10.1093/nar/gkp1137
+.. _here:
+.. _mask repeats: https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md
+.. _MAF: http://genome.ucsc.edu/FAQ/FAQformat.html#format5
=====================================
doc/last.html
=====================================
@@ -318,26 +318,25 @@ table.field-list { border: thin solid green }
<div class="document" id="last-genome-scale-sequence-comparison">
<h1 class="title">LAST: Genome-Scale Sequence Comparison</h1>
-<p>LAST finds similar regions between sequences, and aligns them. It is
-designed for comparing large datasets to each other (e.g. vertebrate
-genomes and/or large numbers of DNA reads). It can:</p>
+<p>LAST finds and aligns similar regions between sequences. It's
+designed for moderately large data (e.g. genomes, DNA reads,
+proteomes). It's especially good at:</p>
<ul>
-<li><p class="first">Indicate the (un)ambiguity of each column in an alignment.</p>
+<li><p class="first">Finding sequence rearrangements and recombinations: we believe
+<a class="reference external" href="last-split.html">last-split</a> does that more rigorously than anything else.</p>
</li>
-<li><p class="first">Use sequence quality data in a rigorous fashion.</p>
+<li><p class="first">Finding related parts of DNA and proteins, especially protein
+fossils (but introns are not considered).</p>
</li>
-<li><p class="first">Align DNA to proteins with frameshifts.</p>
+<li><p class="first">Unusual data, e.g. AT-rich DNA, because we can <a class="reference external" href="last-train.html">fit</a> parameters to
+the data and calculate <a class="reference external" href="last-evalues.html">significance</a>.</p>
</li>
-<li><p class="first">Compare PSSMs to sequences.</p>
-</li>
-<li><p class="first">Calculate the likelihood of chance similarities between random
-sequences.</p>
-</li>
-<li><p class="first">Do split and spliced alignment.</p>
-</li>
-<li><p class="first">Train alignment parameters for unusual kinds of sequence (e.g. nanopore).</p>
+<li><p class="first">Sensitive DNA-DNA search, due to <a class="reference external" href="last-train.html">fitting</a>, sensitive <a class="reference external" href="last-seeds.html">seeding</a>, and
+calculating <a class="reference external" href="last-evalues.html">significance</a>.</p>
</li>
</ul>
+<p>It can also: indicate the confidence/uncertainty of each column in an
+alignment, and use sequence quality data in a rigorous fashion.</p>
<div class="section" id="requirements">
<h2>Requirements</h2>
<p>To handle mammalian genomes, it's best if you have at least 10-20
@@ -354,17 +353,8 @@ compile the programs, type:</p>
<pre class="literal-block">
make
</pre>
-<p>You might get some harmless warning messages. If your compiler is
-really old, you might get an error message like this:</p>
-<pre class="literal-block">
-unrecognized command line option "-std=c++11"
-</pre>
-<p>In that case, you can compile like this (which will disable
-multi-threading):</p>
-<pre class="literal-block">
-make CXXFLAGS=-O3
-</pre>
-<p>Or you can specify another compiler like this:</p>
+<p>You might get some harmless warning messages. It's possible to
+specify a C++ compiler like this:</p>
<pre class="literal-block">
make CXX=MyOtherCompiler
</pre>
@@ -398,13 +388,9 @@ the new programs.</p>
Public License, either version 3 of the License, or (at your option)
any later version.</p>
</div>
-<div class="section" id="website">
-<h2>Website</h2>
-<p>LAST's website is: <a class="reference external" href="http://last.cbrc.jp/">http://last.cbrc.jp/</a></p>
-</div>
<div class="section" id="contact">
<h2>Contact</h2>
-<p>Questions and feedback are very welcome. We like to use a <strong>public</strong>
+<p>Questions and feedback are welcome. We like to use a <strong>public</strong>
mailing list, so everyone can benefit from the discussion:</p>
<blockquote>
last-align (ATmark) googlegroups (dot) com</blockquote>
@@ -419,9 +405,6 @@ will never send rude or mocking replies.</p>
error, please try to: tell us the LAST version number and exact error
message, and give us some input data and exact commands that trigger
the problem.</strong></p>
-<p>If you do a benchmarking test of LAST, we recommend you contact us to
-check you are using it in a suitable way. The history of
-bioinformatics benchmarks shows it is all too easy to get this wrong.</p>
</div>
</div>
</body>
=====================================
doc/last.txt
=====================================
@@ -1,18 +1,24 @@
LAST: Genome-Scale Sequence Comparison
======================================
-LAST finds similar regions between sequences, and aligns them. It is
-designed for comparing large datasets to each other (e.g. vertebrate
-genomes and/or large numbers of DNA reads). It can:
-
-* Indicate the (un)ambiguity of each column in an alignment.
-* Use sequence quality data in a rigorous fashion.
-* Align DNA to proteins with frameshifts.
-* Compare PSSMs to sequences.
-* Calculate the likelihood of chance similarities between random
- sequences.
-* Do split and spliced alignment.
-* Train alignment parameters for unusual kinds of sequence (e.g. nanopore).
+LAST finds and aligns similar regions between sequences. It's
+designed for moderately large data (e.g. genomes, DNA reads,
+proteomes). It's especially good at:
+
+* Finding sequence rearrangements and recombinations: we believe
+ last-split_ does that more rigorously than anything else.
+
+* Finding related parts of DNA and proteins, especially protein
+ fossils (but introns are not considered).
+
+* Unusual data, e.g. AT-rich DNA, because we can fit_ parameters to
+ the data and calculate significance_.
+
+* Sensitive DNA-DNA search, due to fitting_, sensitive seeding_, and
+ calculating significance_.
+
+It can also: indicate the confidence/uncertainty of each column in an
+alignment, and use sequence quality data in a rigorous fashion.
Requirements
------------
@@ -33,17 +39,8 @@ compile the programs, type::
make
-You might get some harmless warning messages. If your compiler is
-really old, you might get an error message like this::
-
- unrecognized command line option "-std=c++11"
-
-In that case, you can compile like this (which will disable
-multi-threading)::
-
- make CXXFLAGS=-O3
-
-Or you can specify another compiler like this::
+You might get some harmless warning messages. It's possible to
+specify a C++ compiler like this::
make CXX=MyOtherCompiler
@@ -80,15 +77,10 @@ LAST (including the scripts) is distributed under the GNU General
Public License, either version 3 of the License, or (at your option)
any later version.
-Website
--------
-
-LAST's website is: http://last.cbrc.jp/
-
Contact
-------
-Questions and feedback are very welcome. We like to use a **public**
+Questions and feedback are welcome. We like to use a **public**
mailing list, so everyone can benefit from the discussion:
last-align (ATmark) googlegroups (dot) com
@@ -106,6 +98,8 @@ error, please try to: tell us the LAST version number and exact error
message, and give us some input data and exact commands that trigger
the problem.**
-If you do a benchmarking test of LAST, we recommend you contact us to
-check you are using it in a suitable way. The history of
-bioinformatics benchmarks shows it is all too easy to get this wrong.
+.. _fit:
+.. _fitting: last-train.html
+.. _last-split: last-split.html
+.. _seeding: last-seeds.html
+.. _significance: last-evalues.html
=====================================
doc/lastal.html
=====================================
@@ -435,11 +435,13 @@ MAF.</p>
<kbd><span class="option">-D <var>LENGTH</var></span></kbd></td>
<td>Report alignments that are expected by chance at most once per
LENGTH query letters. This option only affects the default
-value of -E, so if you specify -E then -D has no effect.</td></tr>
+value of <tt class="docutils literal"><span class="pre">-E</span></tt>, so if you specify <tt class="docutils literal"><span class="pre">-E</span></tt> then <tt class="docutils literal"><span class="pre">-D</span></tt> has no
+effect.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-E <var>THRESHOLD</var></span></kbd></td>
<td>Maximum EG2 (<a class="reference external" href="last-evalues.html">expected alignments per square giga</a>). This option only affects the default
-value of -e, so if you specify -e then -E has no effect.</td></tr>
+value of <tt class="docutils literal"><span class="pre">-e</span></tt>, so if you specify <tt class="docutils literal"><span class="pre">-e</span></tt> then <tt class="docutils literal"><span class="pre">-E</span></tt> has no
+effect.</td></tr>
</tbody>
</table>
</blockquote>
@@ -507,12 +509,19 @@ regions have sizes j and k, where j ≤ k, the cost is: a +
b⋅(k-j) + c⋅j. If c ≥ a + 2b (the default), it reduces to
standard affine gaps.</td></tr>
<tr><td class="option-group">
-<kbd><span class="option">-F <var>COST</var></span></kbd></td>
+<kbd><span class="option">-F <var>LIST</var></span></kbd></td>
<td><p class="first">Align DNA queries to protein reference sequences, using the
-specified frameshift cost. A value of 15 seems to be
-reasonable. (As a special case, -F0 means DNA-versus-protein
-alignment without frameshifts, which is faster.) The output
-looks like this:</p>
+specified frameshift cost(s): either one cost (old-style
+frameshifts), or 4 comma-separated costs (new-style
+frameshifts). As a special case, <tt class="docutils literal"><span class="pre">-F0</span></tt> means
+DNA-versus-protein alignment without frameshifts, which is
+faster.</p>
+<p>The four new-style frameshift costs are for, in order: deletion
+of length k mod 3 = 1 bases, deletion of k mod 3 = 2 bases,
+insertion of k mod 3 = 1 bases, insertion of k mod 3 = 2 bases.
+(You're expected to get them from <a class="reference external" href="last-train.html">last-train</a>, not set them
+manually.)</p>
+<p>The output looks like this:</p>
<pre class="literal-block">
a score=108
s prot 2 40 + 649 FLLQAVKLQDP-STPHQIVPSP-VSDLIATHTLCPRMKYQDD
@@ -535,12 +544,12 @@ The default value is e-1, which arguably produces the best
alignments. Lower values improve speed, by quitting unpromising
extensions sooner. You can specify this parameter in 3 ways:</p>
<ul class="last">
-<li><p class="first">A score (e.g. -z20).</p>
+<li><p class="first">A score (e.g. <tt class="docutils literal"><span class="pre">-z20</span></tt>).</p>
</li>
-<li><p class="first">A percentage. For example, -z50% specifies 50% of the default
-value (rounded down to the nearest integer).</p>
+<li><p class="first">A percentage. For example, <tt class="docutils literal"><span class="pre">-z50%</span></tt> specifies 50% of the
+default value (rounded down to the nearest integer).</p>
</li>
-<li><p class="first">A maximum gap length. For example, -z8g sets the maximum
+<li><p class="first">A maximum gap length. For example, <tt class="docutils literal"><span class="pre">-z8g</span></tt> sets the maximum
score drop to: min[a+8b, A+8B]. However, this never increases
the value above the default.</p>
</li>
@@ -724,8 +733,10 @@ with the following meanings.</p>
</li>
</ol>
<p class="last">Details: Tantan is applied separately to forward and reverse
-strands. For DNA-versus-protein alignment (option -F), it is
-applied to the DNA after translation, at the protein level.</p>
+strands. For DNA-versus-protein alignment, if you use a codon
+substitution matrix (e.g. from <tt class="docutils literal"><span class="pre">last-train</span> <span class="pre">--codon</span></tt>), tantan
+is applied to the DNA before translation, else it is applied
+after translation.</p>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-u <var>NUMBER</var></span></kbd></td>
@@ -766,14 +777,13 @@ alignment</p>
this core</p>
</li>
</ul>
-<p class="last">Use -w0 to turn this off.</p>
+<p class="last">Use <tt class="docutils literal"><span class="pre">-w0</span></tt> to turn this off.</p>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-G <var>GENETIC-CODE</var></span></kbd></td>
-<td>Specify the genetic code for translating DNA to protein. This
-option has no effect unless DNA-versus-protein alignment is
-selected with option -F. Codes are specified by numbers
-(e.g. 1 = standard, 2 = vertebrate mitochondrial), listed here:
+<td>Specify the genetic code for translating DNA to protein. Codes
+are specified by numbers (e.g. 1 = standard, 2 = vertebrate
+mitochondrial), listed here:
<a class="reference external" href="https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi">https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi</a>. Any
other GENETIC-CODE is assumed to be a file name: for an example
of the format, see vertebrateMito.gc in the examples directory.</td></tr>
=====================================
doc/lastal.txt
=====================================
@@ -108,12 +108,14 @@ E-value options
-D LENGTH
Report alignments that are expected by chance at most once per
LENGTH query letters. This option only affects the default
- value of -E, so if you specify -E then -D has no effect.
+ value of ``-E``, so if you specify ``-E`` then ``-D`` has no
+ effect.
-E THRESHOLD
Maximum EG2 (`expected alignments per square giga
<last-evalues.html>`_). This option only affects the default
- value of -e, so if you specify -e then -E has no effect.
+ value of ``-e``, so if you specify ``-e`` then ``-E`` has no
+ effect.
Score options
~~~~~~~~~~~~~
@@ -175,12 +177,21 @@ Score options
b⋅(k-j) + c⋅j. If c ≥ a + 2b (the default), it reduces to
standard affine gaps.
- -F COST
+ -F LIST
Align DNA queries to protein reference sequences, using the
- specified frameshift cost. A value of 15 seems to be
- reasonable. (As a special case, -F0 means DNA-versus-protein
- alignment without frameshifts, which is faster.) The output
- looks like this::
+ specified frameshift cost(s): either one cost (old-style
+ frameshifts), or 4 comma-separated costs (new-style
+ frameshifts). As a special case, ``-F0`` means
+ DNA-versus-protein alignment without frameshifts, which is
+ faster.
+
+ The four new-style frameshift costs are for, in order: deletion
+ of length k mod 3 = 1 bases, deletion of k mod 3 = 2 bases,
+ insertion of k mod 3 = 1 bases, insertion of k mod 3 = 2 bases..
+ (You're expected to get them from last-train_, not set them
+ manually.)
+
+ The output looks like this::
a score=108
s prot 2 40 + 649 FLLQAVKLQDP-STPHQIVPSP-VSDLIATHTLCPRMKYQDD
@@ -202,12 +213,12 @@ Score options
alignments. Lower values improve speed, by quitting unpromising
extensions sooner. You can specify this parameter in 3 ways:
- * A score (e.g. -z20).
+ * A score (e.g. ``-z20``).
- * A percentage. For example, -z50% specifies 50% of the default
- value (rounded down to the nearest integer).
+ * A percentage. For example, ``-z50%`` specifies 50% of the
+ default value (rounded down to the nearest integer).
- * A maximum gap length. For example, -z8g sets the maximum
+ * A maximum gap length. For example, ``-z8g`` sets the maximum
score drop to: min[a+8b, A+8B]. However, this never increases
the value above the default.
@@ -361,8 +372,10 @@ Miscellaneous options
2. Convert simple repeats, within AT-rich DNA, to lowercase.
Details: Tantan is applied separately to forward and reverse
- strands. For DNA-versus-protein alignment (option -F), it is
- applied to the DNA after translation, at the protein level.
+ strands. For DNA-versus-protein alignment, if you use a codon
+ substitution matrix (e.g. from ``last-train --codon``), tantan
+ is applied to the DNA before translation, else it is applied
+ after translation.
-u NUMBER
Specify treatment of lowercase letters when extending
@@ -398,13 +411,12 @@ Miscellaneous options
* has start coordinates offset by DISTANCE or less relative to
this core
- Use -w0 to turn this off.
+ Use ``-w0`` to turn this off.
-G GENETIC-CODE
- Specify the genetic code for translating DNA to protein. This
- option has no effect unless DNA-versus-protein alignment is
- selected with option -F. Codes are specified by numbers
- (e.g. 1 = standard, 2 = vertebrate mitochondrial), listed here:
+ Specify the genetic code for translating DNA to protein. Codes
+ are specified by numbers (e.g. 1 = standard, 2 = vertebrate
+ mitochondrial), listed here:
https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi. Any
other GENETIC-CODE is assumed to be a file name: for an example
of the format, see vertebrateMito.gc in the examples directory.
@@ -588,3 +600,5 @@ lastal8
lastal8 has identical usage to lastal, and is used with `lastdb8
<lastdb.html>`_. lastal cannot read the output of lastdb8, and
lastal8 cannot read the output of lastdb.
+
+.. _last-train: last-train.html
=====================================
scripts/last-train
=====================================
@@ -19,6 +19,7 @@ import sys
import tempfile
proteinAlphabet20 = "ACDEFGHIKLMNPQRSTVWY"
+proteinAlphabet21 = proteinAlphabet20 + "*"
def myOpen(fileName): # faster than fileinput
if fileName == "-":
@@ -183,7 +184,7 @@ def scaleFromHeader(lines):
raise Exception("couldn't read the scale")
def countsFromLastOutput(lines, opts):
- nTransitions = 5
+ nTransitions = 9 if opts.codon else 5
tranCounts = [1.0] * nTransitions # +1 pseudocounts
tranCounts[1] = 2.0 # deletes: opens + extensions, so 2 pseudocounts
tranCounts[2] = 2.0 # inserts: opens + extensions, so 2 pseudocounts
@@ -196,13 +197,15 @@ def countsFromLastOutput(lines, opts):
counts = [float(i) for i in line.split()[1:]]
if not countMatrix:
matrixSize = len(counts) - nTransitions
- nCols = int(math.sqrt(matrixSize))
+ nCols = 64 if opts.codon else int(math.sqrt(matrixSize))
nRows = matrixSize // nCols
- countMatrix = [[1.0] * nCols for i in range(nRows)]
- identities = sum(counts[i * nCols + i] for i in range(nRows))
- alignmentLength = sum(counts[matrixSize + i] for i in range(3))
- if 100 * identities > opts.pid * alignmentLength:
- continue
+ pseudocount = 0.0 if opts.codon else 1.0
+ countMatrix = [[pseudocount] * nCols for i in range(nRows)]
+ if not opts.codon:
+ identities = sum(counts[i * nCols + i] for i in range(nRows))
+ alignmentLength = sum(counts[matrixSize + i] for i in range(3))
+ if 100 * identities > opts.pid * alignmentLength:
+ continue
for i in range(nRows):
for j in range(nCols):
if strand == "+" or opts.S != "1":
@@ -214,6 +217,13 @@ def countsFromLastOutput(lines, opts):
alignments += 1
if not alignments:
raise Exception("no alignments")
+ if opts.codon:
+ pseudocounts = nRows * 32 # xxx ???
+ rowSums = [sum(i) + 1 for i in countMatrix]
+ colSums = [sum(i) + 1 for i in zip(*countMatrix)]
+ mul = pseudocounts / (sum(rowSums) * sum(colSums))
+ countMatrix = [[x + mul * i * j for j, x in zip(colSums, row)]
+ for i, row in zip(rowSums, countMatrix)]
return countMatrix, tranCounts + [alignments]
def scoreFromProb(scale, prob):
@@ -414,21 +424,260 @@ def scoresAndScale(originalScale, matParams, delRatios, insRatios):
"doubling the scale")
originalScale *= 2
-def writeGapCosts(delCosts, insCosts, isLastFormat, outFile):
- delInit, delGrow = delCosts
- insInit, insGrow = insCosts
- delOpen = delInit - delGrow
- insOpen = insInit - insGrow
+### Routines for codons & frameshifts:
+
+def initialCodonSubstitutionProbs(matchProb):
+ aa = "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG"
+ b1 = "TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG"
+ b2 = "TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG"
+ b3 = "TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG"
+
+ p = matchProb / 61
+ q = (1/64 - p) / 19
+ if q <= 0:
+ raise Exception("initial match probability must be < 61/64")
+ matrix = [[q for j in aa] for i in proteinAlphabet20]
+ for a, x, y, z in zip(aa, b1, b2, b3):
+ codon = "ACGT".index(x) * 16 + "ACGT".index(y) * 4 + "ACGT".index(z)
+ if a == "*":
+ for row in matrix:
+ row[codon] = 1 / (64 * 20)
+ else:
+ matrix[proteinAlphabet20.index(a)][codon] = p
+ return matrix
+
+def initialCodonProbs(opts):
+ matProbs = initialCodonSubstitutionProbs(float(opts.r))
+ delOpenProb = float(opts.a)
+ delGrowProb = float(opts.b)
+ insOpenProb = float(opts.A)
+ insGrowProb = float(opts.B)
+ matchProb = 0.99 - delOpenProb - insOpenProb
+ if opts.F:
+ delProb1, delProb2, insProb1, insProb2 = map(float, opts.F.split(","))
+ else:
+ delProb1 = delProb2 = 1 - delGrowProb
+ insProb1 = insProb2 = 1 - insGrowProb
+ delProbs = delOpenProb, delGrowProb, delProb1, delProb2
+ insProbs = insOpenProb, insGrowProb, insProb1, insProb2
+ return matProbs, (matchProb, delProbs, insProbs)
+
+def formattedCodons(spec):
+ a = "acgt"
+ return (format(i + j + k, spec) for i in a for j in a for k in a)
+
+def printCodonCountMatrix(matrix):
+ print("#", " ", *formattedCodons("5"))
+ for x, row in zip(proteinAlphabet21, matrix):
+ print("#", x, *(format(i, "<5.4g") for i in row))
+
+def writeCodonScoreMatrix(outFile, matrixAndProbs, prefix):
+ matrix, rowProbs, colProbs = matrixAndProbs
+ maxLen = max(len(str(x)) for row in matrix for x in row)
+ spec = ">" + str(max(maxLen, 3))
+ print(prefix + " ", *formattedCodons(spec), file=outFile)
+ for x, row, p in zip(proteinAlphabet21, matrix, rowProbs):
+ r = " ".join(format(i, spec) for i in row)
+ print(prefix + x, r, p, file=outFile)
+ print(prefix + " ", *(format(i, spec) for i in colProbs), file=outFile)
+
+def codonMatProbsFromCounts(counts, opts):
+ total = sum(map(sum, counts))
+ probs = [[j / total for j in i] for i in counts]
+ print("# count matrix "
+ "(query letters = columns, reference letters = rows):")
+ printCodonCountMatrix(counts)
+ print()
+ return probs
+
+def freqText(probability):
+ p = 100 * probability
+ t = format(p, ".2")
+ if len(t) > 3:
+ t = format(p, "<3.2g")
+ if len(t) > 3:
+ t = t.lstrip("0")
+ return t
+
+def frameshiftProbImbalance(endProb, matchProb, delProbs, insProbs):
+ insOpenProb, insGrowProb, ins1, ins2 = insProbs
+ delOpenProb, delGrowProb, del1, del2 = delProbs
+ iNum = insOpenProb * (ins1 * endProb ** 2 + ins2 * (1 - ins1) * endProb
+ + (1 - ins1) * (1 - ins2) * (1 - insGrowProb))
+ iDen = endProb ** 3 - (1 - ins1) * (1 - ins2) * insGrowProb
+ dNum = delOpenProb * (del1 / endProb ** 2 + del2 * (1 - del1) / endProb
+ + (1 - del1) * (1 - del2) * (1 - delGrowProb))
+ dDen = endProb ** 3 - (1 - del1) * (1 - del2) * delGrowProb
+ return 1 - matchProb / endProb ** 6 - iNum / iDen - dNum / dDen
+
+def balancedFrameshiftEndProb(*args):
+ matchProb, delProbs, insProbs = args
+ insOpenProb, insGrowProb, ins1, ins2 = insProbs
+ delOpenProb, delGrowProb, del1, del2 = delProbs
+ lowerBound = max((1 - ins1) * (1 - ins2) * insGrowProb,
+ (1 - del1) * (1 - del2) * delGrowProb) ** (1/3)
+ upperBound = 1.0
+ return rootOfIncreasingFunction(frameshiftProbImbalance,
+ lowerBound, upperBound, args)
+
+def frameshiftProbsFromCounts(counts, opts):
+ (matches, deletes, inserts, delOpens0, insOpens0,
+ delOpens1, delOpens2, insOpens1, insOpens2, alignments) = counts
+ delOpens = delOpens0 + delOpens1 + delOpens2
+ insOpens = insOpens0 + insOpens1 + insOpens2
+ denominator = matches + insOpens + delOpens + (alignments + 1)
+ matchProb = matches / denominator
+
+ insOpenProb = insOpens / denominator
+ insGrowProb = (inserts - insOpens0) / inserts
+ insProb2 = insOpens2 / (inserts + insOpens2)
+ insProb1 = insOpens1 / (inserts + insOpens2 + insOpens1)
+
+ delOpenProb = delOpens / denominator
+ delGrowProb = (deletes - delOpens0) / deletes
+ delProb2 = delOpens2 / (deletes + delOpens2)
+ delProb1 = delOpens1 / (deletes + delOpens2 + delOpens1)
+
+ print("# aligned residue/codon pairs: %.12g" % matches)
+ print("# whole codon deletes: %.12g" % deletes)
+ print("# whole codon inserts: %.12g" % inserts)
+ print("# delOpens: %.12g" % delOpens)
+ print("# insOpens: %.12g" % insOpens)
+ print("# frameshifts del-1,del-2,ins+1,ins+2: %.12g,%.12g,%.12g,%.12g"
+ % (delOpens1, delOpens2, insOpens1, insOpens2))
+ print("# alignments:", alignments)
+ print("# matchProb: %g" % matchProb)
+ print("# delOpenProb: %g" % delOpenProb)
+ print("# insOpenProb: %g" % insOpenProb)
+ print("# delExtendProb: %g" % delGrowProb)
+ print("# insExtendProb: %g" % insGrowProb)
+ print("# frameshiftProbs del-1,del-2,ins+1,ins+2: %g,%g,%g,%g"
+ % (delProb1, delProb2, insProb1, insProb2))
+ print()
+ delProbs = delOpenProb, delGrowProb, delProb1, delProb2
+ insProbs = insOpenProb, insGrowProb, insProb1, insProb2
+ return matchProb, delProbs, insProbs
+
+def frameshiftRatiosFromProbs(matchProb, delProbs, insProbs):
+ delOpenProb, delGrowProb, del1, del2 = delProbs
+ insOpenProb, insGrowProb, ins1, ins2 = insProbs
+
+ endProb = balancedFrameshiftEndProb(matchProb, delProbs, insProbs)
+
+ matchRatio = matchProb / endProb ** 6
+
+ insAdj = (1 - insGrowProb) / insGrowProb
+ insOpenRatio = insOpenProb * insAdj
+ insMean = ((1 - ins1) * (1 - ins2) * insGrowProb) ** (1/3)
+ insRatio0 = insMean / endProb
+ insRatio1 = ins1 / (insAdj * insMean)
+ insRatio2 = ins2 * (1 - ins1) / (insAdj * insMean ** 2)
+ insRatios = insOpenRatio, insRatio0, insRatio1, insRatio2
+
+ delAdj = (1 - delGrowProb) / delGrowProb
+ delOpenRatio = delOpenProb * delAdj
+ delMean = ((1 - del1) * (1 - del2) * delGrowProb) ** (1/3)
+ delRatio0 = delMean / endProb
+ delRatio1 = del1 / (delAdj * delMean * endProb ** 4)
+ delRatio2 = del2 * (1 - del1) / (delAdj * (delMean * endProb) ** 2)
+ delRatios = delOpenRatio, delRatio0, delRatio1, delRatio2
+
+ return matchRatio, delRatios, insRatios
+
+def frameshiftCostsFromProbRatios(scale, gapRatios):
+ gapOpenRatio, gapRatio0, gapRatio1, gapRatio2 = gapRatios
+ gapOpenCost = costFromProb(scale, gapOpenRatio)
+ gapCost0 = max(costFromProb(scale, gapRatio0), 1)
+ gapCost1 = costFromProb(scale, gapRatio1)
+ gapCost2 = costFromProb(scale, gapRatio2)
+ return gapOpenCost, gapCost0, gapCost1, gapCost2
+
+def frameshiftImbalanceFromGap(scale, gapCosts):
+ gapOpenCost, gapCost0, gapCost1, gapCost2 = gapCosts
+ a = math.exp(-gapOpenCost / scale)
+ b = math.exp(-gapCost0 / scale)
+ f = math.exp(-gapCost1 / scale)
+ g = math.exp(-gapCost2 / scale)
+ return a * b * (f + g * b + b * b) / (1 - b ** 3)
+
+def frameshiftScoreImbalance(scale, matScores, rowProbs, colProbs,
+ delCosts, insCosts):
+ d = frameshiftImbalanceFromGap(scale, delCosts)
+ i = frameshiftImbalanceFromGap(scale, insCosts)
+ m = sum(x * y * math.exp(matScores[i][j] / scale)
+ for i, x in enumerate(rowProbs) for j, y in enumerate(colProbs))
+ return m + d + i - 1
+
+def normalizedFreqs(freqs):
+ x = list(map(float, freqs))
+ s = sum(x)
+ return [i / s for i in x]
+
+def codonScoresAndScale(originalScale, matParams, delRatios, insRatios):
+ matchRatio, matProbs, rowFreqs, colFreqs = matParams
+ rowProbs = normalizedFreqs(rowFreqs)
+ colProbs = normalizedFreqs(colFreqs)
+ matParams = matchRatio, matProbs, rowProbs, colProbs
+ matScores = matScoresFromProbs(originalScale, *matParams)
+ delCosts = frameshiftCostsFromProbRatios(originalScale, delRatios)
+ insCosts = frameshiftCostsFromProbRatios(originalScale, insRatios)
+ args = matScores, rowProbs, colProbs, delCosts, insCosts
+ scale = balancedScale(frameshiftScoreImbalance, originalScale, args)
+ assert scale > 0 # XXX
+ matScores = matScores, rowFreqs, colFreqs
+ return matScores, delCosts, insCosts, scale
+
+def isCloseEnough(oldParameters, newParameters):
+ delCosts0, insCosts0, substitutionParameters0 = oldParameters
+ m0, rowFreqs0, colFreqs0 = substitutionParameters0
+
+ delCosts1, insCosts1, substitutionParameters1 = newParameters
+ m1, rowFreqs1, colFreqs1 = substitutionParameters1
+
+ return (delCosts0 == delCosts1 and insCosts0 == insCosts1 and
+ all(abs(i - j) < 2 for x, y in zip(m0, m1) for i, j in zip(x, y)))
+
+### End of routines for codons & frameshifts
+
+def writeGapCosts(opts, delCosts, insCosts, isLastFormat, outFile):
+ if opts.codon:
+ delOpen, delGrow, del1, del2 = delCosts
+ insOpen, insGrow, ins1, ins2 = insCosts
+ frameshiftCosts = del1, del2, ins1, ins2
+ frameshiftCosts = ",".join(map(str, frameshiftCosts))
+ else:
+ delInit, delGrow = delCosts
+ insInit, insGrow = insCosts
+ delOpen = delInit - delGrow
+ insOpen = insInit - insGrow
if isLastFormat:
print("#last -a", delOpen, file=outFile)
print("#last -A", insOpen, file=outFile)
print("#last -b", delGrow, file=outFile)
print("#last -B", insGrow, file=outFile)
+ if opts.codon:
+ print("#last -F", frameshiftCosts, file=outFile)
else:
print("# delExistCost:", delOpen, file=outFile)
print("# insExistCost:", insOpen, file=outFile)
print("# delExtendCost:", delGrow, file=outFile)
print("# insExtendCost:", insGrow, file=outFile)
+ if opts.codon:
+ print("# frameshiftCosts del-1,del-2,ins+1,ins+2:",
+ frameshiftCosts, file=outFile)
+
+def probsFromFile(opts, lastalArgs, lines):
+ print("#", *lastalArgs)
+ print()
+ sys.stdout.flush()
+ matCounts, gapCounts = countsFromLastOutput(lines, opts)
+ if opts.codon:
+ gapProbs = frameshiftProbsFromCounts(gapCounts, opts)
+ matProbs = codonMatProbsFromCounts(matCounts, opts)
+ else:
+ gapProbs = gapProbsFromCounts(gapCounts, opts)
+ matProbs = matProbsFromCounts(matCounts, opts)
+ return matProbs, gapProbs
def tryToMakeChildProgramsFindable():
d = os.path.dirname(__file__)
@@ -458,6 +707,8 @@ def fixedLastalArgs(opts, lastalProgName):
if opts.X: x.append("-X" + opts.X)
if opts.Q: x.append("-Q" + opts.Q)
if opts.verbose: x.append("-" + "v" * opts.verbose)
+ if opts.codon:
+ x.append("-K1")
return x
def process(args, inStream):
@@ -480,78 +731,97 @@ def lastSplitProcess(opts, proc):
def doTraining(opts, args):
tryToMakeChildProgramsFindable()
lastalProgName = readLastalProgName(args[0])
- scaleIncrease = 20 # while training, up-scale the scores by this amount
lastalVersion = versionFromLastal()
+ print("# lastal version:", lastalVersion)
+
+ if opts.codon:
+ scaleIncrease = 1
+ gapRatiosFunc = frameshiftRatiosFromProbs
+ scoresAndScaleFunc = codonScoresAndScale
+ writeScoreMatrixFunc = writeCodonScoreMatrix
+ else:
+ scaleIncrease = 20 # while training, upscale the scores by this amount
+ gapRatiosFunc = gapRatiosFromProbs
+ scoresAndScaleFunc = scoresAndScale
+ writeScoreMatrixFunc = writeScoreMatrix
- lastalArgs = fixedLastalArgs(opts, lastalProgName)
- if opts.r: lastalArgs.append("-r" + opts.r)
- if opts.q: lastalArgs.append("-q" + opts.q)
- if opts.p: lastalArgs.append("-p" + opts.p)
- if opts.a: lastalArgs.append("-a" + opts.a)
- if opts.b: lastalArgs.append("-b" + opts.b)
- if opts.A: lastalArgs.append("-A" + opts.A)
- if opts.B: lastalArgs.append("-B" + opts.B)
- lastalArgs += args
- proc = process(lastalArgs, None)
- proc = lastSplitProcess(opts, proc)
+ print("# maximum percent identity:", opts.pid)
+ lastalArgs = fixedLastalArgs(opts, lastalProgName)
+ if opts.r: lastalArgs.append("-r" + opts.r)
+ if opts.q: lastalArgs.append("-q" + opts.q)
+ if opts.p: lastalArgs.append("-p" + opts.p)
+ if opts.a: lastalArgs.append("-a" + opts.a)
+ if opts.b: lastalArgs.append("-b" + opts.b)
+ if opts.A: lastalArgs.append("-A" + opts.A)
+ if opts.B: lastalArgs.append("-B" + opts.B)
+ lastalArgs += args
+ proc = process(lastalArgs, None)
+ proc = lastSplitProcess(opts, proc)
+ if not opts.scale:
+ externalScale = scaleFromHeader(proc.stdout)
if opts.scale:
externalScale = opts.scale / math.log(2)
- else:
- externalScale = scaleFromHeader(proc.stdout)
internalScale = externalScale * scaleIncrease
oldParameters = []
- print("# lastal version:", lastalVersion)
- print("# maximum percent identity:", opts.pid)
print("# scale of score parameters:", externalScale)
print("# scale used while training:", internalScale)
print()
+ if opts.codon:
+ matProbs, gapProbs = initialCodonProbs(opts)
+ else:
+ matProbs, gapProbs = probsFromFile(opts, lastalArgs, proc.stdout)
+
while True:
- print("#", *lastalArgs)
- print()
- sys.stdout.flush()
- matCounts, gapCounts = countsFromLastOutput(proc.stdout, opts)
- gapProbs = gapProbsFromCounts(gapCounts, opts)
- matProbs = matProbsFromCounts(matCounts, opts)
- matchRatio, delRatios, insRatios = gapRatiosFromProbs(*gapProbs)
+ matchRatio, delRatios, insRatios = gapRatiosFunc(*gapProbs)
rowProbs = [sum(i) for i in matProbs]
colProbs = [sum(i) for i in zip(*matProbs)]
+ if opts.codon:
+ rowProbs = [freqText(i) for i in rowProbs]
+ colProbs = [freqText(i) for i in colProbs]
matParams = matchRatio, matProbs, rowProbs, colProbs
- sas = scoresAndScale(internalScale, matParams, delRatios, insRatios)
- matScores, delCosts, insCosts, scale = sas
- writeGapCosts(delCosts, insCosts, False, None)
+ ss = scoresAndScaleFunc(internalScale, matParams, delRatios, insRatios)
+ matScores, delCosts, insCosts, scale = ss
+ writeGapCosts(opts, delCosts, insCosts, False, None)
print()
print("# score matrix "
"(query letters = columns, reference letters = rows):")
- writeScoreMatrix(sys.stdout, matScores, "# ")
+ writeScoreMatrixFunc(sys.stdout, matScores, "# ")
print()
parameters = delCosts, insCosts, matScores
- if parameters in oldParameters: break
+ if opts.codon:
+ if any(isCloseEnough(i, parameters) for i in oldParameters):
+ break
+ else:
+ if parameters in oldParameters:
+ break
oldParameters.append(parameters)
lastalArgs = fixedLastalArgs(opts, lastalProgName)
lastalArgs.append("-t{0:.6}".format(scale))
lastalArgs.append("-p-")
lastalArgs += args
proc = process(lastalArgs, subprocess.PIPE)
- writeGapCosts(delCosts, insCosts, True, proc.stdin)
- writeScoreMatrix(proc.stdin, matScores, "")
+ writeGapCosts(opts, delCosts, insCosts, True, proc.stdin)
+ writeScoreMatrixFunc(proc.stdin, matScores, "")
proc.stdin.close()
- proc = lastSplitProcess(opts, proc)
+ if not opts.codon:
+ proc = lastSplitProcess(opts, proc)
+ matProbs, gapProbs = probsFromFile(opts, lastalArgs, proc.stdout)
- sas = scoresAndScale(externalScale, matParams, delRatios, insRatios)
- matScores, delCosts, insCosts, scale = sas
+ ss = scoresAndScaleFunc(externalScale, matParams, delRatios, insRatios)
+ matScores, delCosts, insCosts, scale = ss
if opts.X: print("#last -X", opts.X)
if opts.Q: print("#last -Q", opts.Q)
print("#last -t{0:.6}".format(scale))
- writeGapCosts(delCosts, insCosts, True, None)
+ writeGapCosts(opts, delCosts, insCosts, True, None)
if opts.s: print("#last -s", opts.s)
if opts.S: print("#last -S", opts.S)
print("# score matrix "
"(query letters = columns, reference letters = rows):")
- writeScoreMatrix(sys.stdout, matScores, "")
+ writeScoreMatrixFunc(sys.stdout, matScores, "")
def lastTrain(opts, args):
if opts.sample_number:
@@ -586,12 +856,15 @@ if __name__ == "__main__":
"skip alignments with > PID% identity (default: %default)")
og.add_option("--postmask", type="int", metavar="NUMBER", default=1, help=
"skip mostly-lowercase alignments (default=%default)")
- og.add_option("--sample-number", type="int", default=500, metavar="N",
- help="number of random sequence samples (default: %default)")
+ og.add_option("--sample-number", type="int", metavar="N",
+ help="number of random sequence samples "
+ "(default: 20000 if --codon else 500)")
og.add_option("--sample-length", type="int", default=2000, metavar="L",
help="length of each sample (default: %default)")
og.add_option("--scale", type="float", metavar="S",
help="output scores in units of 1/S bits")
+ og.add_option("--codon", action="store_true",
+ help="DNA queries & protein reference, with frameshifts")
op.add_option_group(og)
og = optparse.OptionGroup(op, "Initial parameter options")
@@ -606,6 +879,8 @@ if __name__ == "__main__":
help="gap extension cost (default: 9 if Q>=1, else 3)")
og.add_option("-A", metavar="COST", help="insertion existence cost")
og.add_option("-B", metavar="COST", help="insertion extension cost")
+ og.add_option("-F", metavar="LIST", help="frameshift probabilities: "
+ "del-1,del-2,ins+1,ins+2 (default: 1-b,1-b,1-B,1-B)")
op.add_option_group(og)
og = optparse.OptionGroup(op, "Alignment options")
@@ -638,8 +913,18 @@ if __name__ == "__main__":
(opts, args) = op.parse_args()
if len(args) < 1:
op.error("I need a lastdb index and query sequences")
+ if opts.sample_number is None:
+ opts.sample_number = 20000 if opts.codon else 500
if not opts.sample_number and (len(args) < 2 or "-" in args):
op.error("sorry, can't use stdin when --sample-number=0")
+ if opts.codon:
+ if not opts.scale: opts.scale = 3
+ if not opts.r: opts.r = "0.4"
+ if not opts.a: opts.a = "0.02"
+ if not opts.b: opts.b = "0.5"
+ if not opts.A: opts.A = opts.a
+ if not opts.B: opts.B = opts.b
+ opts.S = None
if not opts.p and (not opts.Q or opts.Q in ("0", "fastx", "keep")):
if not opts.r: opts.r = "5"
if not opts.q: opts.q = "5"
=====================================
src/Alignment.cc
=====================================
@@ -50,7 +50,7 @@ void Alignment::makeXdrop( Aligners &aligners, bool isGreedy,
const uchar* qual1, const uchar* qual2,
const Alphabet& alph, AlignmentExtras& extras,
double gamma, int outputType ){
- score = seed.score;
+ if (probMatrix) score = seed.score; // else keep the old score
if (outputType > 3 && !gap.isNewFrameshifts()) extras.fullScore = seed.score;
if( outputType == 7 ){
@@ -62,6 +62,7 @@ void Alignment::makeXdrop( Aligners &aligners, bool isGreedy,
}
// extend a gapped alignment in the left/reverse direction from the seed:
+ blocks.clear();
std::vector<char>& columnAmbiguityCodes = extras.columnAmbiguityCodes;
extend( blocks, columnAmbiguityCodes, aligners, isGreedy,
seq1, seq2, seed.beg1(), seed.beg2(), false, globality,
@@ -310,23 +311,25 @@ void Alignment::extend( std::vector< SegmentPair >& chunks,
while (aligner.getNextChunkFrame(end1, end2, size, gapCost, gap))
chunks.push_back(SegmentPair(end1 - size, end2 - size * 3, size,
gapCost));
- FrameshiftXdropAligner &fxa = aligners.frameshiftAligner;
- double probDropLimit = exp(scale * -maxDrop);
- double s = fxa.forward(seq1 + start1, seq2 + start2,
- seq2 + dnaToAa(frame1, frameSize),
- seq2 + dnaToAa(frame2, frameSize),
- isForward, probMat, gap, probDropLimit);
- score += s / scale;
- if (outputType > 3) {
- fxa.backward(isForward, probMat, gap);
- getColumnCodes(fxa, columnCodes, chunks);
- if (outputType == 7) {
- double *ec = &extras.expectedCounts[0];
- double *subsCounts[scoreMatrixRowSize];
- for (int i = 0; i < scoreMatrixRowSize; ++i)
- subsCounts[i] = ec + i * scoreMatrixRowSize;
- double *tranCounts = ec + scoreMatrixRowSize * scoreMatrixRowSize;
- fxa.count(isForward, gap, subsCounts, tranCounts);
+ if (probMat) {
+ FrameshiftXdropAligner &fxa = aligners.frameshiftAligner;
+ double probDropLimit = exp(scale * -maxDrop);
+ double s = fxa.forward(seq1 + start1, seq2 + start2,
+ seq2 + dnaToAa(frame1, frameSize),
+ seq2 + dnaToAa(frame2, frameSize),
+ isForward, probMat, gap, probDropLimit);
+ score += s / scale;
+ if (outputType > 3) {
+ fxa.backward(isForward, probMat, gap);
+ getColumnCodes(fxa, columnCodes, chunks);
+ if (outputType == 7) {
+ double *ec = &extras.expectedCounts[0];
+ double *subsCounts[scoreMatrixRowSize];
+ for (int i = 0; i < scoreMatrixRowSize; ++i)
+ subsCounts[i] = ec + i * scoreMatrixRowSize;
+ double *tranCounts = ec + scoreMatrixRowSize * scoreMatrixRowSize;
+ fxa.count(isForward, gap, subsCounts, tranCounts);
+ }
}
}
} else {
=====================================
src/LastalArguments.cc
=====================================
@@ -146,7 +146,7 @@ Score options (default settings):\n\
-A: insertion existence cost (a)\n\
-B: insertion extension cost (b)\n\
-c: unaligned residue pair cost (off)\n\
--F: frameshift cost (off)\n\
+-F: frameshift cost(s) (off)\n\
-x: maximum score drop for preliminary gapped alignments (z)\n\
-y: maximum score drop for gapless alignments (min[t*10, x])\n\
-z: maximum score drop for final gapped alignments (e-1)\n\
=====================================
src/lastal.cc
=====================================
@@ -458,7 +458,7 @@ static const ScoreMatrixRow *getQueryPssm(const LastAligner &aligner,
return reinterpret_cast<const ScoreMatrixRow *>(&qualityPssm[0]);
}
-namespace Phase{ enum Enum{ gapless, gapped, final }; }
+namespace Phase{ enum Enum{ gapless, pregapped, gapped, postgapped }; }
static bool isFullScoreThreshold() {
return gapCosts.isNewFrameshifts();
@@ -466,7 +466,7 @@ static bool isFullScoreThreshold() {
static bool isMaskLowercase(Phase::Enum e) {
return (e < 1 && args.maskLowercase > 0)
- || (e < 2 && args.maskLowercase > 1 && isFullScoreThreshold())
+ || (e < 3 && args.maskLowercase > 1 && isFullScoreThreshold())
|| args.maskLowercase > 2;
}
@@ -493,7 +493,7 @@ struct Dispatcher{
r( isMaskLowercase(e) ? matrices.ratiosMasked : matrices.ratios ),
t( isMaskLowercase(e) ? matrices.twoQualMasked : matrices.twoQual ),
d( (e == Phase::gapless) ? args.maxDropGapless :
- (e == Phase::gapped ) ? args.maxDropGapped : args.maxDropFinal ),
+ (e == Phase::pregapped ) ? args.maxDropGapped : args.maxDropFinal ),
z( t ? 2 : p ? 1 : 0 ){}
int gaplessOverlap( indexT x, indexT y, size_t &rev, size_t &fwd ) const{
@@ -637,10 +637,9 @@ void alignGapless1(LastAligner &aligner, SegmentPairPot &gaplessAlns,
}
// Find query matches to the suffix array, and do gapless extensions
-void alignGapless( LastAligner& aligner, SegmentPairPot& gaplessAlns,
- size_t queryNum, const SubstitutionMatrices &matrices,
- const uchar* querySeq ){
- Dispatcher dis(Phase::gapless, aligner, queryNum, matrices, querySeq);
+void alignGapless(LastAligner &aligner, SegmentPairPot &gaplessAlns,
+ size_t queryNum, const uchar *querySeq,
+ const Dispatcher &dis) {
DiagonalTable dt; // record already-covered positions on each diagonal
size_t maxAlignments =
args.maxAlignmentsPerQueryStrand ? args.maxAlignmentsPerQueryStrand : 1;
@@ -713,9 +712,8 @@ void shrinkToLongestIdenticalRun( SegmentPair& sp, const Dispatcher& dis ){
void alignGapped( LastAligner& aligner,
AlignmentPot& gappedAlns, SegmentPairPot& gaplessAlns,
size_t queryNum, const SubstitutionMatrices &matrices,
- const uchar* querySeq, Phase::Enum phase ){
+ const uchar* querySeq, size_t frameSize, Phase::Enum phase ){
Dispatcher dis(phase, aligner, queryNum, matrices, querySeq);
- size_t frameSize = args.isFrameshift() ? (query.padLen(queryNum) / 3) : 0;
countT gappedExtensionCount = 0, gappedAlignmentCount = 0;
// Redo the gapless extensions, using gapped score parameters.
@@ -755,11 +753,10 @@ void alignGapped( LastAligner& aligner,
shrinkToLongestIdenticalRun( aln.seed, dis );
// do gapped extension from each end of the seed:
- aln.makeXdrop(aligner.engines, args.isGreedy,
- dis.a, dis.b, args.globality, dis.m,
- scoreMatrix.maxScore, scoreMatrix.minScore,
- dis.r, matrices.stats.lambda(), gapCosts,
- dis.d, frameSize, dis.p, dis.t, dis.i, dis.j, alph, extras);
+ aln.makeXdrop(aligner.engines, args.isGreedy, dis.a, dis.b, args.globality,
+ dis.m, scoreMatrix.maxScore, scoreMatrix.minScore,
+ dis.r, matrices.stats.lambda(), gapCosts, dis.d,
+ frameSize, dis.p, dis.t, dis.i, dis.j, alph, extras);
++gappedExtensionCount;
if( aln.score < args.minScoreGapped ) continue;
@@ -775,7 +772,7 @@ void alignGapped( LastAligner& aligner,
gaplessAlns.markAllOverlaps( aln.blocks );
gaplessAlns.markTandemRepeats( aln.seed, args.maxRepeatDistance );
- if( phase == Phase::final ) gappedAlns.add(aln);
+ if (phase == Phase::gapped) gappedAlns.add(aln);
else SegmentPairPot::markAsGood(sp);
++gappedAlignmentCount;
@@ -786,15 +783,26 @@ void alignGapped( LastAligner& aligner,
LOG2( "gapped alignments=" << gappedAlignmentCount );
}
+// Redo gapped extensions, but keep the old alignment scores
+static void alignPostgapped(LastAligner &aligner, AlignmentPot &gappedAlns,
+ size_t frameSize, const Dispatcher &dis) {
+ AlignmentExtras extras; // not used
+ for (size_t i = 0; i < gappedAlns.size(); ++i) {
+ Alignment &aln = gappedAlns.items[i];
+ aln.makeXdrop(aligner.engines, args.isGreedy, dis.a, dis.b, args.globality,
+ dis.m, scoreMatrix.maxScore, scoreMatrix.minScore,
+ 0, 0, gapCosts, dis.d,
+ frameSize, dis.p, dis.t, dis.i, dis.j, alph, extras);
+ }
+}
+
// Print the gapped alignments, after optionally calculating match
// probabilities and re-aligning using the gamma-centroid algorithm
void alignFinish( LastAligner& aligner, const AlignmentPot& gappedAlns,
size_t queryNum, const SubstitutionMatrices &matrices,
- const uchar* querySeq ){
+ size_t frameSize, const Dispatcher &dis ){
Centroid& centroid = aligner.engines.centroid;
- Dispatcher dis(Phase::final, aligner, queryNum, matrices, querySeq);
size_t queryLen = query.padLen(queryNum);
- size_t frameSize = args.isFrameshift() ? (queryLen / 3) : 0;
if( args.outputType > 3 ){
if( dis.p ){
@@ -819,28 +827,26 @@ void alignFinish( LastAligner& aligner, const AlignmentPot& gappedAlns,
AlignmentExtras extras;
if (isFullScoreThreshold()) extras.fullScore = -1; // score is fullScore
if( args.outputType < 4 ){
- writeAlignment(aligner, aln, queryNum, querySeq, extras);
+ writeAlignment(aligner, aln, queryNum, dis.b, extras);
} else { // calculate match probabilities:
Alignment probAln;
probAln.seed = aln.seed;
- probAln.makeXdrop(aligner.engines, args.isGreedy,
- dis.a, dis.b, args.globality, dis.m,
- scoreMatrix.maxScore, scoreMatrix.minScore,
- dis.r, matrices.stats.lambda(), gapCosts,
- dis.d, frameSize, dis.p, dis.t, dis.i, dis.j, alph,
- extras, args.gamma, args.outputType);
+ probAln.makeXdrop(aligner.engines, args.isGreedy, dis.a, dis.b,
+ args.globality,
+ dis.m, scoreMatrix.maxScore, scoreMatrix.minScore,
+ dis.r, matrices.stats.lambda(), gapCosts, dis.d,
+ frameSize, dis.p, dis.t, dis.i, dis.j, alph, extras,
+ args.gamma, args.outputType);
assert(aln.score != -INF);
- writeAlignment(aligner, probAln, queryNum, querySeq, extras);
+ if (args.maskLowercase == 2 && isFullScoreThreshold())
+ probAln.score = aln.score;
+ writeAlignment(aligner, probAln, queryNum, dis.b, extras);
}
}
}
-static void eraseWeakAlignments(LastAligner &aligner, AlignmentPot &gappedAlns,
- size_t queryNum,
- const SubstitutionMatrices &matrices,
- const uchar *querySeq) {
- size_t frameSize = args.isFrameshift() ? (query.padLen(queryNum) / 3) : 0;
- Dispatcher dis(Phase::gapless, aligner, queryNum, matrices, querySeq);
+static void eraseWeakAlignments(AlignmentPot &gappedAlns,
+ size_t frameSize, const Dispatcher &dis) {
for (size_t i = 0; i < gappedAlns.size(); ++i) {
Alignment &a = gappedAlns.items[i];
if (!a.hasGoodSegment(dis.a, dis.b, ceil(args.minScoreGapped), dis.m,
@@ -957,32 +963,38 @@ void scan(LastAligner& aligner, size_t queryNum,
const int maskMode = args.maskLowercase;
makeQualityPssm(aligner, queryNum, matrices, querySeq, maskMode > 0);
+ Dispatcher dis0(Phase::gapless, aligner, queryNum, matrices, querySeq);
SegmentPairPot gaplessAlns;
- alignGapless(aligner, gaplessAlns, queryNum, matrices, querySeq);
+ alignGapless(aligner, gaplessAlns, queryNum, querySeq, dis0);
if( args.outputType == 1 ) return; // we just want gapless alignments
if( gaplessAlns.size() == 0 ) return;
if (maskMode == 1 || (maskMode == 2 && !isFullScoreThreshold()))
unmaskLowercase(aligner, queryNum, matrices, querySeq);
+ size_t frameSize = args.isFrameshift() ? (query.padLen(queryNum) / 3) : 0;
AlignmentPot gappedAlns;
- if (args.maxDropFinal != args.maxDropGapped
- || (maskMode == 2 && isFullScoreThreshold())) {
- alignGapped( aligner, gappedAlns, gaplessAlns,
- queryNum, matrices, querySeq, Phase::gapped );
+ if (args.maxDropFinal != args.maxDropGapped) {
+ alignGapped(aligner, gappedAlns, gaplessAlns,
+ queryNum, matrices, querySeq, frameSize, Phase::pregapped);
erase_if( gaplessAlns.items, SegmentPairPot::isNotMarkedAsGood );
- if (maskMode == 2 && isFullScoreThreshold())
- unmaskLowercase(aligner, queryNum, matrices, querySeq);
}
- alignGapped( aligner, gappedAlns, gaplessAlns,
- queryNum, matrices, querySeq, Phase::final );
+ alignGapped(aligner, gappedAlns, gaplessAlns,
+ queryNum, matrices, querySeq, frameSize, Phase::gapped);
if( gappedAlns.size() == 0 ) return;
+ Dispatcher dis3(Phase::postgapped, aligner, queryNum, matrices, querySeq);
+
+ if (maskMode == 2 && isFullScoreThreshold()) {
+ unmaskLowercase(aligner, queryNum, matrices, querySeq);
+ alignPostgapped(aligner, gappedAlns, frameSize, dis3);
+ }
+
if (maskMode == 2 && !isFullScoreThreshold()) {
remaskLowercase(aligner, queryNum, matrices, querySeq);
- eraseWeakAlignments(aligner, gappedAlns, queryNum, matrices, querySeq);
+ eraseWeakAlignments(gappedAlns, frameSize, dis0);
LOG2("lowercase-filtered alignments=" << gappedAlns.size());
if (gappedAlns.size() == 0) return;
if (args.outputType > 3)
@@ -995,7 +1007,7 @@ void scan(LastAligner& aligner, size_t queryNum,
}
if( !isCollatedAlignments() ) gappedAlns.sort(); // sort by score
- alignFinish(aligner, gappedAlns, queryNum, matrices, querySeq);
+ alignFinish(aligner, gappedAlns, queryNum, matrices, frameSize, dis3);
}
static void tantanMaskOneQuery(size_t queryNum, uchar *querySeq) {
=====================================
src/version.hh
=====================================
@@ -1 +1 @@
-"1170"
+"1178"
View it on GitLab: https://salsa.debian.org/med-team/last-align/-/commit/d1b788bbe87c1e5812ec649f0952e53a399c2fc9
--
View it on GitLab: https://salsa.debian.org/med-team/last-align/-/commit/d1b788bbe87c1e5812ec649f0952e53a399c2fc9
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20210128/5b9b5f1f/attachment-0001.html>
More information about the debian-med-commit
mailing list