[med-svn] [Git][med-team/last-align][upstream] New upstream version 961

Andreas Tille gitlab at salsa.debian.org
Mon Dec 17 10:20:33 GMT 2018


Andreas Tille pushed to branch upstream at Debian Med / last-align


Commits:
1d01bfcb by Andreas Tille at 2018-12-17T10:12:55Z
New upstream version 961
- - - - -


27 changed files:

- ChangeLog.txt
- doc/last-dotplot.html
- doc/last-dotplot.txt
- doc/last-train.html
- doc/last-train.txt
- doc/lastal.html
- doc/lastal.txt
- doc/lastdb.html
- doc/lastdb.txt
- scripts/last-dotplot
- scripts/last-map-probs
- scripts/last-postmask
- scripts/last-train
- scripts/maf-convert
- scripts/maf-cut
- scripts/maf-swap
- src/LastalArguments.cc
- src/LastdbArguments.cc
- src/MultiSequence.hh
- src/MultiSequenceQual.cc
- src/SequenceFormat.hh
- src/SubsetSuffixArray.cc
- src/last.hh
- src/lastal.cc
- src/lastdb.cc
- src/split/last-split.cc
- src/version.hh


Changes:

=====================================
ChangeLog.txt
=====================================
@@ -1,8 +1,41 @@
+2018-12-10  Martin C. Frith  <Martin C. Frith>
+
+	* doc/last-dotplot.txt, scripts/last-dotplot:
+	Add last-dotplot --maxseqs option
+	[ed0fb9b1eb40] [tip]
+
+	* src/split/last-split.cc:
+	Make last-split -n keep E-values
+	[2063a5a7d7ac]
+
+2018-10-30  Martin C. Frith  <Martin C. Frith>
+
+	* doc/last-train.txt, doc/lastal.txt, doc/lastdb.txt, scripts/last-
+	train, src/LastalArguments.cc, src/LastdbArguments.cc,
+	src/MultiSequence.hh, src/MultiSequenceQual.cc,
+	src/SequenceFormat.hh, src/last.hh, test/last-test.out, test/last-
+	test.sh:
+	Add fastq-ignore option
+	[c551d04a3928]
+
+2018-10-29  Martin C. Frith  <Martin C. Frith>
+
+	* src/LastalArguments.cc, src/SequenceFormat.hh,
+	src/SubsetSuffixArray.cc, src/lastal.cc, src/lastdb.cc:
+	Refactor
+	[96b45da7bf0f]
+
+	* scripts/last-dotplot, scripts/last-map-probs, scripts/last-postmask,
+	scripts/last-train, scripts/maf-convert, scripts/maf-cut, scripts
+	/maf-swap:
+	Fix gz reading for Python3
+	[c54672c8892e]
+
 2018-09-10  Martin C. Frith  <Martin C. Frith>
 
 	* scripts/last-map-probs:
 	Python3-ify last-map-probs
-	[83191e47259e] [tip]
+	[83191e47259e]
 
 2018-09-07  Martin C. Frith  <Martin C. Frith>
 


=====================================
doc/last-dotplot.html
=====================================
@@ -335,62 +335,6 @@ last-dotplot alns alns.gif
 set of sequences.  This document calls a "set of sequences" a
 "genome", though it need not actually be a genome.</p>
 </div>
-<div class="section" id="choosing-sequences">
-<h2>Choosing sequences</h2>
-<p>If there are too many sequences, the dotplot will be very cluttered,
-or the script might give up with an error message.  You can exclude
-sequences with names like "chrUn_random522" like this:</p>
-<pre class="literal-block">
-last-dotplot -1 'chr[!U]*' -2 'chr[!U]*' alns alns.png
-</pre>
-<p>Option "-1" selects sequences from the 1st (horizontal) genome, and
-"-2" selects sequences from the 2nd (vertical) genome.  'chr[!U]*' is
-a <em>pattern</em> that specifies names starting with "chr", followed by any
-character except U, followed by anything.</p>
-<table border="1" class="docutils">
-<colgroup>
-<col width="26%" />
-<col width="74%" />
-</colgroup>
-<tbody valign="top">
-<tr><td>Pattern</td>
-<td>Meaning</td>
-</tr>
-<tr><td><tt class="docutils literal">*</tt></td>
-<td>zero or more of any character</td>
-</tr>
-<tr><td><tt class="docutils literal">?</tt></td>
-<td>any single character</td>
-</tr>
-<tr><td><tt class="docutils literal">[abc]</tt></td>
-<td>any character in abc</td>
-</tr>
-<tr><td><tt class="docutils literal">[!abc]</tt></td>
-<td>any character not in abc</td>
-</tr>
-</tbody>
-</table>
-<p>If a sequence name has a dot (e.g. "hg19.chr7"), the pattern is
-compared to both the whole name and the part after the dot.</p>
-<p>You can specify more than one pattern, e.g. this gets sequences with
-names starting in "chr" followed by one or two characters:</p>
-<pre class="literal-block">
-last-dotplot -1 'chr?' -1 'chr??' alns alns.png
-</pre>
-<p>You can also specify a sequence range; for example this gets the first
-1000 bases of chr9:</p>
-<pre class="literal-block">
-last-dotplot -1 chr9:0-1000 alns alns.png
-</pre>
-</div>
-<div class="section" id="fonts">
-<h2>Fonts</h2>
-<p>last-dotplot tries to find a nice font on your computer, but may fail
-and use an ugly font.  You can specify a font like this:</p>
-<pre class="literal-block">
-last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns.png
-</pre>
-</div>
 <div class="section" id="options">
 <h2>Options</h2>
 <blockquote>
@@ -405,18 +349,23 @@ last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns
 <kbd><span class="option">-v</span>, <span class="option">--verbose</span></kbd></td>
 <td>Show progress messages & data about the plot.</td></tr>
 <tr><td class="option-group">
+<kbd><span class="option">-x <var>INT</var></span>, <span class="option">--width=<var>INT</var></span></kbd></td>
+<td>Maximum width in pixels.</td></tr>
+<tr><td class="option-group">
+<kbd><span class="option">-y <var>INT</var></span>, <span class="option">--height=<var>INT</var></span></kbd></td>
+<td>Maximum height in pixels.</td></tr>
+<tr><td class="option-group">
+<kbd><span class="option">-m <var>M</var></span>, <span class="option">--maxseqs=<var>M</var></span></kbd></td>
+<td>Maximum number of horizontal or vertical sequences.  If there
+are >M sequences, the smallest ones (after cutting) will be
+discarded.</td></tr>
+<tr><td class="option-group">
 <kbd><span class="option">-1 <var>PATTERN</var></span>, <span class="option">--seq1=<var>PATTERN</var></span></kbd></td>
 <td>Which sequences to show from the 1st (horizontal) genome.</td></tr>
 <tr><td class="option-group">
 <kbd><span class="option">-2 <var>PATTERN</var></span>, <span class="option">--seq2=<var>PATTERN</var></span></kbd></td>
 <td>Which sequences to show from the 2nd (vertical) genome.</td></tr>
 <tr><td class="option-group">
-<kbd><span class="option">-x <var>WIDTH</var></span>, <span class="option">--width=<var>WIDTH</var></span></kbd></td>
-<td>Maximum width in pixels.</td></tr>
-<tr><td class="option-group">
-<kbd><span class="option">-y <var>HEIGHT</var></span>, <span class="option">--height=<var>HEIGHT</var></span></kbd></td>
-<td>Maximum height in pixels.</td></tr>
-<tr><td class="option-group">
 <kbd><span class="option">-c <var>COLOR</var></span>, <span class="option">--forwardcolor=<var>COLOR</var></span></kbd></td>
 <td>Color for forward alignments.</td></tr>
 <tr><td class="option-group">
@@ -602,6 +551,65 @@ of unknown sequence.</p>
 pixel.</p>
 </div>
 </div>
+<div class="section" id="choosing-sequences">
+<h2>Choosing sequences</h2>
+<p>For example, you can exclude sequences with names like
+"chrUn_random522" like this:</p>
+<pre class="literal-block">
+last-dotplot -1 'chr[!U]*' -2 'chr[!U]*' alns alns.png
+</pre>
+<p>Option "-1" selects sequences from the 1st (horizontal) genome, and
+"-2" selects sequences from the 2nd (vertical) genome.  'chr[!U]*' is
+a <em>pattern</em> that specifies names starting with "chr", followed by any
+character except U, followed by anything.</p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="26%" />
+<col width="74%" />
+</colgroup>
+<tbody valign="top">
+<tr><td>Pattern</td>
+<td>Meaning</td>
+</tr>
+<tr><td><tt class="docutils literal">*</tt></td>
+<td>zero or more of any character</td>
+</tr>
+<tr><td><tt class="docutils literal">?</tt></td>
+<td>any single character</td>
+</tr>
+<tr><td><tt class="docutils literal">[abc]</tt></td>
+<td>any character in abc</td>
+</tr>
+<tr><td><tt class="docutils literal">[!abc]</tt></td>
+<td>any character not in abc</td>
+</tr>
+</tbody>
+</table>
+<p>If a sequence name has a dot (e.g. "hg19.chr7"), the pattern is
+compared to both the whole name and the part after the dot.</p>
+<p>You can specify more than one pattern, e.g. this gets sequences with
+names starting in "chr" followed by one or two characters:</p>
+<pre class="literal-block">
+last-dotplot -1 'chr?' -1 'chr??' alns alns.png
+</pre>
+<p>You can also specify a sequence range; for example this gets the first
+1000 bases of chr9:</p>
+<pre class="literal-block">
+last-dotplot -1 chr9:0-1000 alns alns.png
+</pre>
+</div>
+<div class="section" id="text-font">
+<h2>Text font</h2>
+<p>You can improve the font quality by increasing its size, e.g. to 20
+points:</p>
+<blockquote>
+last-dotplot -s20 my-alignments my-plot.png</blockquote>
+<p>last-dotplot tries to find a nice font on your computer, but may fail
+and use an ugly font.  You can specify a font like this:</p>
+<pre class="literal-block">
+last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns.png
+</pre>
+</div>
 <div class="section" id="colors">
 <h2>Colors</h2>
 <p>Colors can be specified in <a class="reference external" href="http://effbot.org/imagingbook/imagecolor.htm">various ways described here</a>.</p>


=====================================
doc/last-dotplot.txt
=====================================
@@ -19,50 +19,6 @@ last-dotplot shows alignments of one set of sequences against another
 set of sequences.  This document calls a "set of sequences" a
 "genome", though it need not actually be a genome.
 
-Choosing sequences
-------------------
-
-If there are too many sequences, the dotplot will be very cluttered,
-or the script might give up with an error message.  You can exclude
-sequences with names like "chrUn_random522" like this::
-
-  last-dotplot -1 'chr[!U]*' -2 'chr[!U]*' alns alns.png
-
-Option "-1" selects sequences from the 1st (horizontal) genome, and
-"-2" selects sequences from the 2nd (vertical) genome.  'chr[!U]*' is
-a *pattern* that specifies names starting with "chr", followed by any
-character except U, followed by anything.
-
-==========  =============================
-Pattern     Meaning
-----------  -----------------------------
-``*``       zero or more of any character
-``?``       any single character
-``[abc]``   any character in abc
-``[!abc]``  any character not in abc
-==========  =============================
-
-If a sequence name has a dot (e.g. "hg19.chr7"), the pattern is
-compared to both the whole name and the part after the dot.
-
-You can specify more than one pattern, e.g. this gets sequences with
-names starting in "chr" followed by one or two characters::
-
-  last-dotplot -1 'chr?' -1 'chr??' alns alns.png
-
-You can also specify a sequence range; for example this gets the first
-1000 bases of chr9::
-
-  last-dotplot -1 chr9:0-1000 alns alns.png
-
-Fonts
------
-
-last-dotplot tries to find a nice font on your computer, but may fail
-and use an ugly font.  You can specify a font like this::
-
-  last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns.png
-
 Options
 -------
 
@@ -70,14 +26,18 @@ Options
       Show a help message, with default option values, and exit.
   -v, --verbose
       Show progress messages & data about the plot.
+  -x INT, --width=INT
+      Maximum width in pixels.
+  -y INT, --height=INT
+      Maximum height in pixels.
+  -m M, --maxseqs=M
+      Maximum number of horizontal or vertical sequences.  If there
+      are >M sequences, the smallest ones (after cutting) will be
+      discarded.
   -1 PATTERN, --seq1=PATTERN
       Which sequences to show from the 1st (horizontal) genome.
   -2 PATTERN, --seq2=PATTERN
       Which sequences to show from the 2nd (vertical) genome.
-  -x WIDTH, --width=WIDTH
-      Maximum width in pixels.
-  -y HEIGHT, --height=HEIGHT
-      Maximum height in pixels.
   -c COLOR, --forwardcolor=COLOR
       Color for forward alignments.
   -r COLOR, --reversecolor=COLOR
@@ -207,6 +167,54 @@ of unknown sequence.
 An unsequenced gap will be shown only if it covers at least one whole
 pixel.
 
+Choosing sequences
+------------------
+
+For example, you can exclude sequences with names like
+"chrUn_random522" like this::
+
+  last-dotplot -1 'chr[!U]*' -2 'chr[!U]*' alns alns.png
+
+Option "-1" selects sequences from the 1st (horizontal) genome, and
+"-2" selects sequences from the 2nd (vertical) genome.  'chr[!U]*' is
+a *pattern* that specifies names starting with "chr", followed by any
+character except U, followed by anything.
+
+==========  =============================
+Pattern     Meaning
+----------  -----------------------------
+``*``       zero or more of any character
+``?``       any single character
+``[abc]``   any character in abc
+``[!abc]``  any character not in abc
+==========  =============================
+
+If a sequence name has a dot (e.g. "hg19.chr7"), the pattern is
+compared to both the whole name and the part after the dot.
+
+You can specify more than one pattern, e.g. this gets sequences with
+names starting in "chr" followed by one or two characters::
+
+  last-dotplot -1 'chr?' -1 'chr??' alns alns.png
+
+You can also specify a sequence range; for example this gets the first
+1000 bases of chr9::
+
+  last-dotplot -1 chr9:0-1000 alns alns.png
+
+Text font
+---------
+
+You can improve the font quality by increasing its size, e.g. to 20
+points:
+
+  last-dotplot -s20 my-alignments my-plot.png
+
+last-dotplot tries to find a nice font on your computer, but may fail
+and use an ugly font.  You can specify a font like this::
+
+  last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns.png
+
 Colors
 ------
 


=====================================
doc/last-train.html
=====================================
@@ -333,7 +333,7 @@ last-train mydb queries.fasta
 <p>last-train prints a summary of each alignment step, followed by the
 final score parameters, in a format that can be read by <a class="reference external" href="lastal.html#score-options">lastal's -p
 option</a>.</p>
-<p>You can also pipe sequences into last-train, for example:</p>
+<p>last-train can read .gz files, or from pipes:</p>
 <pre class="literal-block">
 bzcat queries.fasta.bz2 | last-train mydb
 </pre>
@@ -479,15 +479,21 @@ position in each query.</td></tr>
 <td>Number of parallel threads.</td></tr>
 <tr><td class="option-group">
 <kbd><span class="option">-Q <var>NUMBER</var></span></kbd></td>
-<td><p class="first">Query sequence format: 0=fasta, 1=fastq-sanger.</p>
-<p>last-train assumes that fastq quality codes indicate
-substitution error probabilities, <em>not</em> insertion or
-deletion error probabilities.  If this assumption is
-dubious (e.g. for data with many insertion or deletion
-errors), it may be better to use fasta.</p>
-<p>For fastq, last-train finds the rates of substitutions
-not explained by the quality data (ideally, real
-substitutions as opposed to errors).</p>
+<td><p class="first">How to read the query sequences.  By default, they must
+be in <tt class="docutils literal">fasta</tt> format.  <tt class="docutils literal"><span class="pre">-Q0</span></tt> means <tt class="docutils literal">fasta</tt> or
+<tt class="docutils literal"><span class="pre">fastq-ignore</span></tt>.  <tt class="docutils literal"><span class="pre">-Q1</span></tt> means <tt class="docutils literal"><span class="pre">fastq-sanger</span></tt>.</p>
+<p>The <tt class="docutils literal">fastq</tt> formats are described here:
+<a class="reference external" href="lastal.html">lastal.html</a>.  <tt class="docutils literal"><span class="pre">fastq-ignore</span></tt> means that the
+quality data is ignored, so the results will be the same
+as for <tt class="docutils literal">fasta</tt>.</p>
+<p>For <tt class="docutils literal"><span class="pre">fastq-sanger</span></tt>, last-train assumes the quality
+codes indicate substitution error probabilities, <em>not</em>
+insertion or deletion error probabilities.  If this
+assumption is dubious (e.g. for data with many insertion
+or deletion errors), it may be better to use
+<tt class="docutils literal"><span class="pre">fastq-ignore</span></tt>.  For <tt class="docutils literal"><span class="pre">fastq-sanger</span></tt>, last-train finds
+the rates of substitutions not explained by the quality
+data (ideally, real substitutions as opposed to errors).</p>
 <p class="last">If specified, this parameter is written in last-train's
 output, so it will override lastal's default.</p>
 </td></tr>


=====================================
doc/last-train.txt
=====================================
@@ -19,7 +19,7 @@ last-train prints a summary of each alignment step, followed by the
 final score parameters, in a format that can be read by `lastal's -p
 option <lastal.html#score-options>`_.
 
-You can also pipe sequences into last-train, for example::
+last-train can read .gz files, or from pipes::
 
   bzcat queries.fasta.bz2 | last-train mydb
 
@@ -102,17 +102,23 @@ Alignment options
   -k STEP    Look for initial matches starting only at every STEP-th
              position in each query.
   -P COUNT   Number of parallel threads.
-  -Q NUMBER  Query sequence format: 0=fasta, 1=fastq-sanger.
-
-             last-train assumes that fastq quality codes indicate
-             substitution error probabilities, *not* insertion or
-             deletion error probabilities.  If this assumption is
-             dubious (e.g. for data with many insertion or deletion
-             errors), it may be better to use fasta.
-
-             For fastq, last-train finds the rates of substitutions
-             not explained by the quality data (ideally, real
-             substitutions as opposed to errors).
+  -Q NUMBER  How to read the query sequences.  By default, they must
+             be in ``fasta`` format.  ``-Q0`` means ``fasta`` or
+             ``fastq-ignore``.  ``-Q1`` means ``fastq-sanger``.
+
+             The ``fastq`` formats are described here:
+             `<lastal.html>`_.  ``fastq-ignore`` means that the
+             quality data is ignored, so the results will be the same
+             as for ``fasta``.
+
+             For ``fastq-sanger``, last-train assumes the quality
+             codes indicate substitution error probabilities, *not*
+             insertion or deletion error probabilities.  If this
+             assumption is dubious (e.g. for data with many insertion
+             or deletion errors), it may be better to use
+             ``fastq-ignore``.  For ``fastq-sanger``, last-train finds
+             the rates of substitutions not explained by the quality
+             data (ideally, real substitutions as opposed to errors).
 
              If specified, this parameter is written in last-train's
              output, so it will override lastal's default.


=====================================
doc/lastal.html
=====================================
@@ -849,13 +849,18 @@ generalized affine gap costs.  They are:</p>
 </td></tr>
 <tr><td class="option-group">
 <kbd><span class="option">-Q <var>NUMBER</var></span></kbd></td>
-<td><p class="first">This option allows lastal to use sequence quality scores, or
-PSSMs, for the queries.  0 means read queries in fasta format
-(without quality scores); 1 means fastq-sanger format; 2 means
-fastq-solexa format; 3 means fastq-illumina format; 4 means prb
-format; 5 means read PSSMs.  (<em>Warning</em>: Illumina data is not
-necessarily in fastq-illumina format; it is often in
-fastq-sanger format.)</p>
+<td><p class="first">Specify how to read the query sequences:</p>
+<pre class="literal-block">
+Default    fasta
+   0       fasta or fastq-ignore
+   1       fastq-sanger
+   2       fastq-solexa
+   3       fastq-illumina
+   4       prb
+   5       PSSM
+</pre>
+<p><em>Warning</em>: Illumina data is not necessarily in fastq-illumina
+format; it is often in fastq-sanger format.</p>
 <p>The fastq formats look like this:</p>
 <pre class="literal-block">
 @mySequenceName
@@ -863,12 +868,18 @@ TTTTTTTTGCCTCGGGCCTGAGTTCTTAGCCGCG
 +
 55555555*&5-/55*5//5(55,5#&$)$)*+$
 </pre>
-<p>The "+" may optionally be followed by a name (ignored), and the
-sequence and quality codes are allowed to wrap onto more than
-one line.  For fastq-sanger, the quality scores are obtained by
-subtracting 33 from the ASCII values of the characters below the
-"+".  For fastq-solexa and fastq-illumina, they are obtained by
-subtracting 64.</p>
+<p>The "+" may be followed by text (ignored).  The symbols below
+the "+" are quality codes, one per sequence letter.  The
+sequence and quality codes may wrap onto more than one line.</p>
+<p>fastq-ignore means that the quality codes are ignored.  For the
+other fastq variants, lastal assumes the quality codes indicate
+substitution error probabilities, <em>not</em> insertion or deletion
+error probabilities.  If this assumption is dubious (e.g. for
+data with many insertion or deletion errors), it may be better
+to use fastq-ignore.</p>
+<p>For fastq-sanger, quality scores are obtained by subtracting 33
+from the ASCII values of the quality codes.  For fastq-solexa
+and fastq-illumina, they are obtained by subtracting 64.</p>
 <p>prb format stores four quality scores (A, C, G, T) per position,
 with one sequence per line, like this:</p>
 <pre class="literal-block">


=====================================
doc/lastal.txt
=====================================
@@ -474,13 +474,18 @@ Miscellaneous options
       * The count of adjacent pairs of insertions and unaligned letter pairs.
 
   -Q NUMBER
-      This option allows lastal to use sequence quality scores, or
-      PSSMs, for the queries.  0 means read queries in fasta format
-      (without quality scores); 1 means fastq-sanger format; 2 means
-      fastq-solexa format; 3 means fastq-illumina format; 4 means prb
-      format; 5 means read PSSMs.  (*Warning*: Illumina data is not
-      necessarily in fastq-illumina format; it is often in
-      fastq-sanger format.)
+      Specify how to read the query sequences::
+
+        Default    fasta
+           0       fasta or fastq-ignore
+           1       fastq-sanger
+           2       fastq-solexa
+           3       fastq-illumina
+           4       prb
+           5       PSSM
+
+      *Warning*: Illumina data is not necessarily in fastq-illumina
+      format; it is often in fastq-sanger format.
 
       The fastq formats look like this::
 
@@ -489,12 +494,20 @@ Miscellaneous options
         +
         55555555*&5-/55*5//5(55,5#&$)$)*+$
 
-      The "+" may optionally be followed by a name (ignored), and the
-      sequence and quality codes are allowed to wrap onto more than
-      one line.  For fastq-sanger, the quality scores are obtained by
-      subtracting 33 from the ASCII values of the characters below the
-      "+".  For fastq-solexa and fastq-illumina, they are obtained by
-      subtracting 64.
+      The "+" may be followed by text (ignored).  The symbols below
+      the "+" are quality codes, one per sequence letter.  The
+      sequence and quality codes may wrap onto more than one line.
+
+      fastq-ignore means that the quality codes are ignored.  For the
+      other fastq variants, lastal assumes the quality codes indicate
+      substitution error probabilities, *not* insertion or deletion
+      error probabilities.  If this assumption is dubious (e.g. for
+      data with many insertion or deletion errors), it may be better
+      to use fastq-ignore.
+
+      For fastq-sanger, quality scores are obtained by subtracting 33
+      from the ASCII values of the quality codes.  For fastq-solexa
+      and fastq-illumina, they are obtained by subtracting 64.
 
       prb format stores four quality scores (A, C, G, T) per position,
       with one sequence per line, like this::


=====================================
doc/lastdb.html
=====================================
@@ -457,11 +457,10 @@ about 4 billion.</p>
 </td></tr>
 <tr><td class="option-group">
 <kbd><span class="option">-Q <var>NUMBER</var></span></kbd></td>
-<td>Specify the input format.  0 means fasta, 1 means fastq-sanger,
-2 means fastq-solexa, and 3 means fastq-illumina.  The fastq
-formats provide sequence quality data, which will be stored by
-lastdb and then used by lastal.  These formats are described in
-<a class="reference external" href="lastal.html">lastal.html</a>.</td></tr>
+<td>Specify how to read the sequences.  The default is fasta, 0
+means fasta or fastq-ignore, 1 means fastq-sanger, 2 means
+fastq-solexa, and 3 means fastq-illumina.  The fastq formats are
+described in <a class="reference external" href="lastal.html">lastal.html</a>.</td></tr>
 <tr><td class="option-group">
 <kbd><span class="option">-P <var>THREADS</var></span></kbd></td>
 <td>Divide the work between this number of threads running in


=====================================
doc/lastdb.txt
=====================================
@@ -125,11 +125,10 @@ Advanced Options
       about 4 billion.
 
   -Q NUMBER
-      Specify the input format.  0 means fasta, 1 means fastq-sanger,
-      2 means fastq-solexa, and 3 means fastq-illumina.  The fastq
-      formats provide sequence quality data, which will be stored by
-      lastdb and then used by lastal.  These formats are described in
-      `<lastal.html>`_.
+      Specify how to read the sequences.  The default is fasta, 0
+      means fasta or fastq-ignore, 1 means fastq-sanger, 2 means
+      fastq-solexa, and 3 means fastq-illumina.  The fastq formats are
+      described in `<lastal.html>`_.
 
   -P THREADS
       Divide the work between this number of threads running in


=====================================
scripts/last-dotplot
=====================================
@@ -13,6 +13,7 @@ import collections
 import functools
 import gzip
 from fnmatch import fnmatchcase
+import logging
 from operator import itemgetter
 import subprocess
 import itertools, optparse, os, re, sys
@@ -34,14 +35,9 @@ def myOpen(fileName):  # faster than fileinput
     if fileName == "-":
         return sys.stdin
     if fileName.endswith(".gz"):
-        return gzip.open(fileName)
+        return gzip.open(fileName, "rt")  # xxx dubious for Python2
     return open(fileName)
 
-def warn(message):
-    if opts.verbose:
-        prog = os.path.basename(sys.argv[0])
-        sys.stderr.write(prog + ": " + message + "\n")
-
 def groupByFirstItem(things):
     for k, v in itertools.groupby(things, itemgetter(0)):
         yield k, [i[1:] for i in v]
@@ -414,8 +410,8 @@ def dataFromRanges(sortedRanges, font, fontSize, imageMode, labelOpt, textRot):
         out = [seqName, str(rangeBeg), str(rangeEnd)]
         if strandNum > 0:
             out.append(".+-"[strandNum])
-        warn("\t".join(out))
-    warn("")
+        logging.info("\t".join(out))
+    logging.info("")
     rangeSizes = [e - b for n, b, e, s in sortedRanges]
     labs = list(rangeLabels(sortedRanges, labelOpt, font, fontSize,
                             imageMode, textRot))
@@ -430,7 +426,7 @@ def div_ceil(x, y):
 
 def get_bp_per_pix(rangeSizes, pixTweenRanges, maxPixels):
     '''Get the minimum bp-per-pixel that fits in the size limit.'''
-    warn("choosing bp per pixel...")
+    logging.info("choosing bp per pixel...")
     numOfRanges = len(rangeSizes)
     maxPixelsInRanges = maxPixels - pixTweenRanges * (numOfRanges - 1)
     if maxPixelsInRanges < numOfRanges:
@@ -718,18 +714,37 @@ def getFont(opts):
         out, err = p.communicate()
         fileNames.append(out)
     except OSError as e:
-        warn("fc-match error: " + str(e))
+        logging.info("fc-match error: " + str(e))
     fileNames.append("/Library/Fonts/Arial.ttf")  # for Mac
     for i in fileNames:
         try:
             font = ImageFont.truetype(i, opts.fontsize)
-            warn("font: " + i)
+            logging.info("font: " + i)
             return font
         except IOError as e:
-            warn("font load error: " + str(e))
+            logging.info("font load error: " + str(e))
     return ImageFont.load_default()
 
+def sequenceSizesAndNames(seqRanges):
+    for seqName, ranges in itertools.groupby(seqRanges, itemgetter(0)):
+        size = sum(e - b for n, b, e in ranges)
+        yield size, seqName
+
+def biggestSequences(seqRanges, maxNumOfSequences):
+    s = sorted(sequenceSizesAndNames(seqRanges), reverse=True)
+    if len(s) > maxNumOfSequences:
+        logging.warning("too many sequences - discarding the smallest ones")
+        s = s[:maxNumOfSequences]
+    return set(i[1] for i in s)
+
+def remainingSequenceRanges(seqRanges, alignments, seqIndex):
+    remainingSequences = set(i[seqIndex] for i in alignments)
+    return [i for i in seqRanges if i[0] in remainingSequences]
+
 def lastDotplot(opts, args):
+    logLevel = logging.INFO if opts.verbose else logging.WARNING
+    logging.basicConfig(format="%(filename)s: %(message)s", level=logLevel)
+
     font = getFont(opts)
     image_mode = 'RGB'
     forward_color = ImageColor.getcolor(opts.forwardcolor, image_mode)
@@ -740,11 +755,11 @@ def lastDotplot(opts, args):
     maxGap1, maxGapB1 = twoValuesFromOption(opts.max_gap1, ":")
     maxGap2, maxGapB2 = twoValuesFromOption(opts.max_gap2, ":")
 
-    warn("reading alignments...")
+    logging.info("reading alignments...")
     alnData = readAlignments(args[0], opts)
     alignments, seqRanges1, coverDict1, seqRanges2, coverDict2 = alnData
     if not alignments: raise Exception("there are no alignments")
-    warn("cutting...")
+    logging.info("cutting...")
     coverDict1 = dict(mergedRangesPerSeq(coverDict1))
     coverDict2 = dict(mergedRangesPerSeq(coverDict2))
     minAlignedBases = min(coveredLength(coverDict1), coveredLength(coverDict2))
@@ -754,10 +769,20 @@ def lastDotplot(opts, args):
     cutRanges2 = list(trimmed(seqRanges2, coverDict2, minAlignedBases,
                               maxGap2, pad, pad))
 
-    warn("reading secondary alignments...")
+    biggestSeqs1 = biggestSequences(cutRanges1, opts.maxseqs)
+    cutRanges1 = [i for i in cutRanges1 if i[0] in biggestSeqs1]
+    alignments = [i for i in alignments if i[0] in biggestSeqs1]
+    cutRanges2 = remainingSequenceRanges(cutRanges2, alignments, 1)
+
+    biggestSeqs2 = biggestSequences(cutRanges2, opts.maxseqs)
+    cutRanges2 = [i for i in cutRanges2 if i[0] in biggestSeqs2]
+    alignments = [i for i in alignments if i[1] in biggestSeqs2]
+    cutRanges1 = remainingSequenceRanges(cutRanges1, alignments, 0)
+
+    logging.info("reading secondary alignments...")
     alnDataB = readSecondaryAlignments(opts, cutRanges1, cutRanges2)
     alignmentsB, seqRangesB1, coverDictB1, seqRangesB2, coverDictB2 = alnDataB
-    warn("cutting...")
+    logging.info("cutting...")
     coverDictB1 = dict(mergedRangesPerSeq(coverDictB1))
     coverDictB2 = dict(mergedRangesPerSeq(coverDictB2))
     cutRangesB1 = trimmed(seqRangesB1, coverDictB1, minAlignedBases,
@@ -765,7 +790,7 @@ def lastDotplot(opts, args):
     cutRangesB2 = trimmed(seqRangesB2, coverDictB2, minAlignedBases,
                           maxGapB2, 0, 0)
 
-    warn("sorting...")
+    logging.info("sorting...")
     sortOut = allSortedRanges(opts, alignments, alignmentsB,
                               cutRanges1, cutRangesB1, cutRanges2, cutRangesB2)
     sortedRanges1, sortedRanges2 = sortOut
@@ -785,7 +810,7 @@ def lastDotplot(opts, args):
     bpPerPix1 = get_bp_per_pix(rangeSizes1, opts.border_pixels, maxPixels1)
     bpPerPix2 = get_bp_per_pix(rangeSizes2, opts.border_pixels, maxPixels2)
     bpPerPix = max(bpPerPix1, bpPerPix2)
-    warn("bp per pixel = " + str(bpPerPix))
+    logging.info("bp per pixel = " + str(bpPerPix))
 
     p1 = pixelData(rangeSizes1, bpPerPix, opts.border_pixels, lMargin)
     rangePixBegs1, rangePixLens1, width = p1
@@ -797,14 +822,14 @@ def lastDotplot(opts, args):
     rangeDict2 = dict(rangesPerSeq(sortedRanges2, rangePixBegs2,
                                    rangePixLens2, bpPerPix))
 
-    warn("width:  " + str(width))
-    warn("height: " + str(height))
+    logging.info("width:  " + str(width))
+    logging.info("height: " + str(height))
 
-    warn("processing alignments...")
+    logging.info("processing alignments...")
     hits = alignmentPixels(width, height, alignments + alignmentsB, bpPerPix,
                            rangeDict1, rangeDict2)
 
-    warn("reading annotations...")
+    logging.info("reading annotations...")
 
     rangeDict1 = expandedSeqDict(rangeDict1)
     beds1 = itertools.chain(readBed(opts.bed1, rangeDict1),
@@ -822,7 +847,7 @@ def lastDotplot(opts, args):
 
     boxes = sorted(itertools.chain(b1, b2))
 
-    warn("drawing...")
+    logging.info("drawing...")
 
     image_size = width, height
     im = Image.new(image_mode, image_size, opts.background_color)
@@ -868,17 +893,20 @@ if __name__ == "__main__":
     op = optparse.OptionParser(usage=usage, description=description)
     op.add_option("-v", "--verbose", action="count",
                   help="show progress messages & data about the plot")
+    # Replace "width" & "height" with a single "length" option?
+    op.add_option("-x", "--width", metavar="INT", type="int", default=1000,
+                  help="maximum width in pixels (default: %default)")
+    op.add_option("-y", "--height", metavar="INT", type="int", default=1000,
+                  help="maximum height in pixels (default: %default)")
+    op.add_option("-m", "--maxseqs", type="int", default=100, metavar="M",
+                  help="maximum number of horizontal or vertical sequences "
+                  "(default=%default)")
     op.add_option("-1", "--seq1", metavar="PATTERN", action="append",
                   default=[],
                   help="which sequences to show from the 1st genome")
     op.add_option("-2", "--seq2", metavar="PATTERN", action="append",
                   default=[],
                   help="which sequences to show from the 2nd genome")
-    # Replace "width" & "height" with a single "length" option?
-    op.add_option("-x", "--width", type="int", default=1000,
-                  help="maximum width in pixels (default: %default)")
-    op.add_option("-y", "--height", type="int", default=1000,
-                  help="maximum height in pixels (default: %default)")
     op.add_option("-c", "--forwardcolor", metavar="COLOR", default="red",
                   help="color for forward alignments (default: %default)")
     op.add_option("-r", "--reversecolor", metavar="COLOR", default="blue",


=====================================
scripts/last-map-probs
=====================================
@@ -16,7 +16,7 @@ def myOpen(fileName):
     if fileName == "-":
         return sys.stdin
     if fileName.endswith(".gz"):
-        return gzip.open(fileName)
+        return gzip.open(fileName, "rt")  # xxx dubious for Python2
     return open(fileName)
 
 def logsum(numbers):


=====================================
scripts/last-postmask
=====================================
@@ -26,7 +26,7 @@ def myOpen(fileName):
     if fileName == "-":
         return sys.stdin
     if fileName.endswith(".gz"):
-        return gzip.open(fileName)
+        return gzip.open(fileName, "rt")  # xxx dubious for Python2
     return open(fileName)
 
 def complement(base):


=====================================
scripts/last-train
=====================================
@@ -10,7 +10,7 @@ def myOpen(fileName):  # faster than fileinput
     if fileName == "-":
         return sys.stdin
     if fileName.endswith(".gz"):
-        return gzip.open(fileName)
+        return gzip.open(fileName, "rt")  # xxx dubious for Python2
     return open(fileName)
 
 def randomSample(things, sampleSize):
@@ -499,8 +499,9 @@ if __name__ == "__main__":
                   "every STEP-th position in each query (default: 1)")
     og.add_option("-P", metavar="THREADS",
                   help="number of parallel threads")
-    og.add_option("-Q", metavar="NUMBER",
-                  help="input format: 0=fasta, 1=fastq-sanger")
+    og.add_option("-Q", metavar="NUMBER", help=
+                  "input format: 0=fasta or fastq-ignore, 1=fastq-sanger "
+                  "(default=fasta)")
     op.add_option_group(og)
     (opts, args) = op.parse_args()
     if len(args) < 1:


=====================================
scripts/maf-convert
=====================================
@@ -21,7 +21,7 @@ def myOpen(fileName):
     if fileName == "-":
         return sys.stdin
     if fileName.endswith(".gz"):
-        return gzip.open(fileName)
+        return gzip.open(fileName, "rt")  # xxx dubious for Python2
     return open(fileName)
 
 def maxlen(s):


=====================================
scripts/maf-cut
=====================================
@@ -12,7 +12,7 @@ def myOpen(fileName):  # faster than fileinput
     if fileName == "-":
         return sys.stdin
     if fileName.endswith(".gz"):
-        return gzip.open(fileName)
+        return gzip.open(fileName, "rt")  # xxx dubious for Python2
     return open(fileName)
 
 def alnBegFromSeqBeg(gappedSequence, seqBeg):


=====================================
scripts/maf-swap
=====================================
@@ -25,7 +25,7 @@ def myOpen(fileName):
     if fileName == "-":
         return sys.stdin
     if fileName.endswith(".gz"):
-        return gzip.open(fileName)
+        return gzip.open(fileName, "rt")  # xxx dubious for Python2
     return open(fileName)
 
 def indexOfNthSequence(mafLines, n):


=====================================
src/LastalArguments.cc
=====================================
@@ -172,9 +172,8 @@ Miscellaneous options (default settings):\n\
                  4=column ambiguity estimates, 5=gamma-centroid, 6=LAMA,\n\
                  7=expected counts ("
     + stringify(outputType) + ")\n\
--Q: input format: 0=fasta, 1=fastq-sanger, 2=fastq-solexa, 3=fastq-illumina,\n\
-                  4=prb, 5=PSSM ("
-    + stringify(inputFormat) + ")\n\
+-Q: input format: 0=fasta or fastq-ignore, 1=fastq-sanger, 2=fastq-solexa,\n\
+                  3=fastq-illumina, 4=prb, 5=PSSM (fasta)\n\
 \n\
 Report bugs to: last-align (ATmark) googlegroups (dot) com\n\
 LAST home page: http://last.cbrc.jp/\n\
@@ -445,7 +444,7 @@ void LastalArguments::setDefaultsFromAlphabet( bool isDna, bool isProtein,
     if( gapExistCost   == INT_MIN ) gapExistCost   =  11;
     if( gapExtendCost  < 0 ) gapExtendCost  =   2;
   }
-  else if( !isQuality( inputFormat ) ){
+  else if( !isUseQuality( inputFormat ) ){
     if( matchScore     < 0 ) matchScore     =   1;
     if( mismatchCost   < 0 ) mismatchCost   =   1;
     if( gapExistCost   == INT_MIN ) gapExistCost   =   7;
@@ -641,7 +640,7 @@ void LastalArguments::writeCommented( std::ostream& stream ) const{
   if( outputType > 4 && outputType < 7 )
     stream << " g=" << gamma;
   stream << " j=" << outputType;
-  stream << " Q=" << inputFormat;
+  stream << " Q=" << (inputFormat % sequenceFormat::fasta);
   stream << '\n';
 
   stream << "# " << lastdbName << '\n';


=====================================
src/LastdbArguments.cc
=====================================
@@ -70,8 +70,8 @@ Advanced Options (default settings):\n\
 -S: strand: 0=reverse, 1=forward, 2=both ("
     + stringify(strand) + ")\n\
 -s: volume size (unlimited)\n\
--Q: input format: 0=fasta, 1=fastq-sanger, 2=fastq-solexa, 3=fastq-illumina ("
-    + stringify(inputFormat) + ")\n\
+-Q: input format: 0=fasta or fastq-ignore,\n\
+                  1=fastq-sanger, 2=fastq-solexa, 3=fastq-illumina (fasta)\n\
 -P: number of parallel threads ("
     + stringify(numOfThreads) + ")\n\
 -m: seed pattern\n\


=====================================
src/MultiSequence.hh
=====================================
@@ -47,8 +47,15 @@ class MultiSequence{
   // may not finish reading the sequence.
   std::istream& appendFromFasta( std::istream& stream, indexT maxSeqLen );
 
-  // As above, but read quality scores too.
-  std::istream& appendFromFastq( std::istream& stream, indexT maxSeqLen );
+  // As above, but read FASTQ format.
+  std::istream& appendFromFastq(std::istream& stream, indexT maxSeqLen,
+				bool isKeepQualityData);
+
+  // As above, but read either FASTA or FASTQ format.  The first
+  // sequence may have either format, but subsequent sequences must
+  // have the same format.
+  std::istream& appendFromFastx(std::istream& stream, indexT maxSeqLen,
+				bool isKeepQualityData);
 
   // As above, but read quality scores too.
   std::istream& appendFromPrb( std::istream& stream, indexT maxSeqLen,
@@ -133,6 +140,7 @@ class MultiSequence{
   // Qsolexa = -10*log10(p/(1-p))
   VectorOrMmap<uchar> qualityScores;
   size_t qualityScoresPerLetter;
+  bool isReadingFastq;
 
   // Read a fasta/fastq header: read the whole line but store just the
   // 1st word


=====================================
src/MultiSequenceQual.cc
=====================================
@@ -12,13 +12,35 @@
 using namespace cbrc;
 
 std::istream&
-MultiSequence::appendFromFastq( std::istream& stream, indexT maxSeqLen ){
+MultiSequence::appendFromFastx(std::istream& stream, indexT maxSeqLen,
+			       bool isKeepQualityData) {
+  if (names.empty()) {
+    isReadingFastq = false;
+    char c = '>';
+    stream >> c;
+    if (c == '@') {
+      isReadingFastq = true;
+    } else if (c != '>') {
+      ERR("bad sequence data: missing '>' or '@'");
+    }
+    readFastxName(stream);
+    if (!stream) return stream;
+  }
+
+  return isReadingFastq ?
+    appendFromFastq(stream, maxSeqLen, isKeepQualityData) :
+    appendFromFasta(stream, maxSeqLen);
+}
+
+std::istream&
+MultiSequence::appendFromFastq(std::istream& stream, indexT maxSeqLen,
+			       bool isKeepQualityData) {
   // initForAppending:
-  qualityScoresPerLetter = 1;
+  qualityScoresPerLetter = isKeepQualityData;
   if( qualityScores.v.empty() ) appendQualPad();
 
   if( isFinished() ){
-    uchar c = '@';
+    char c = '@';
     stream >> c;
     if( c != '@' ) ERR( "bad FASTQ data: missing '@'" );
     readFastxName(stream);
@@ -41,11 +63,16 @@ MultiSequence::appendFromFastq( std::istream& stream, indexT maxSeqLen ){
       c = buf->sbumpc();
     } while (c != std::streambuf::traits_type::eof() && c != '\n');
 
-    while (qualityScores.v.size() < seq.v.size()) {
+    for (size_t i = seq.v.size() - ends.v.back(); i > 0; ) {
       c = buf->sbumpc();
       if (c == std::streambuf::traits_type::eof()) ERR("bad FASTQ data");
-      if (c > 126) ERR("non-printable-ASCII in FASTQ quality data");
-      if (c > ' ') qualityScores.v.push_back(c);
+      if (c > ' ') {
+	if (isKeepQualityData) {
+	  if (c > 126) ERR("non-printable-ASCII in FASTQ quality data");
+	  qualityScores.v.push_back(c);
+	}
+	--i;
+      }
     }
 
     finish();


=====================================
src/SequenceFormat.hh
=====================================
@@ -8,7 +8,7 @@
 namespace cbrc {
 
 namespace sequenceFormat {
-enum Enum { fasta, fastqSanger, fastqSolexa, fastqIllumina, prb, pssm };
+enum Enum { fastx, fastqSanger, fastqSolexa, fastqIllumina, prb, pssm, fasta };
 }
 
 inline std::istream &operator>>(std::istream &s, sequenceFormat::Enum &f) {
@@ -19,17 +19,17 @@ inline std::istream &operator>>(std::istream &s, sequenceFormat::Enum &f) {
   return s;
 }
 
-inline bool isQuality(sequenceFormat::Enum f) {
-  return f != sequenceFormat::fasta && f != sequenceFormat::pssm;
-}
-
-inline bool isFastq(sequenceFormat::Enum f) {
+inline bool isUseFastq(sequenceFormat::Enum f) {
   return
       f == sequenceFormat::fastqSanger ||
       f == sequenceFormat::fastqSolexa ||
       f == sequenceFormat::fastqIllumina;
 }
 
+inline bool isUseQuality(sequenceFormat::Enum f) {
+  return isUseFastq(f) || f == sequenceFormat::prb;
+}
+
 inline int qualityOffset(sequenceFormat::Enum f) {
   return (f == sequenceFormat::fastqSanger) ? 33 : 64;
 }  // The result is meaningless for non-quality formats.


=====================================
src/SubsetSuffixArray.cc
=====================================
@@ -66,13 +66,13 @@ void SubsetSuffixArray::fromFiles( const std::string& baseName,
 
   try{
     childTable.m.open( baseName + ".chi", indexedPositions );
-  }catch( std::runtime_error ){
+  }catch( std::runtime_error& ){
     try{
       kiddyTable.m.open( baseName + ".chi2", indexedPositions );
-    }catch( std::runtime_error ){
+    }catch( std::runtime_error& ){
       try{
 	chibiTable.m.open( baseName + ".chi1", indexedPositions );
-      }catch( std::runtime_error ){}
+      }catch( std::runtime_error& ){}
     }
   }
 }


=====================================
src/last.hh
=====================================
@@ -25,12 +25,14 @@ inline std::istream &appendSequence(MultiSequence &m, std::istream &in,
 
   if (f == sequenceFormat::fasta) {
     m.appendFromFasta(in, maxSeqLen);
+  } else if (f == sequenceFormat::fastx) {
+    m.appendFromFastx(in, maxSeqLen, false);
   } else if (f == sequenceFormat::prb) {
     m.appendFromPrb(in, maxSeqLen, a.size, a.decode);
   } else if (f == sequenceFormat::pssm) {
     m.appendFromPssm(in, maxSeqLen, a.encode, isMaskLowercase);
   } else {
-    m.appendFromFastq(in, maxSeqLen);
+    m.appendFromFastq(in, maxSeqLen, true);
   }
 
   if (!m.isFinished() && m.finishedSequences() == 0) {


=====================================
src/lastal.cc
=====================================
@@ -80,7 +80,7 @@ namespace {
   OneQualityExpMatrix oneQualityExpMatrixRev;
   QualityPssmMaker qualityPssmMaker;
   QualityPssmMaker qualityPssmMakerRev;
-  sequenceFormat::Enum referenceFormat;  // defaults to 0
+  sequenceFormat::Enum referenceFormat = sequenceFormat::fasta;
   TwoQualityScoreMatrix twoQualityMatrix;
   TwoQualityScoreMatrix twoQualityMatrixMasked;
   TwoQualityScoreMatrix twoQualityMatrixRev;
@@ -103,7 +103,7 @@ static void calculateSubstitutionScoreMatrixStatistics() {
   lambdaCalculator.calculate( scoreMatrix.caseSensitive, alph.size );
 
   if( lambdaCalculator.isBad() ){
-    if( isQuality( args.inputFormat ) ||
+    if( isUseQuality( args.inputFormat ) ||
         (args.temperature < 0 && args.outputType > 3) )
       ERR( "can't calculate probabilities: "
 	   "maybe the mismatch costs are too weak" );
@@ -179,7 +179,7 @@ void makeQualityScorers(){
   if( args.isGreedy ) return;
 
   if( args.isTranslated() )
-    if( isQuality( args.inputFormat ) || isQuality( referenceFormat ) )
+    if( isUseQuality( args.inputFormat ) || isUseQuality( referenceFormat ) )
       return warn( args.programName,
 		   "quality data not used for DNA-versus-protein alignment" );
 
@@ -199,8 +199,8 @@ void makeQualityScorers(){
   double lp2rev[scoreMatrixRowSize];
   permuteComplement( lp2, lp2rev );
 
-  if( referenceFormat == sequenceFormat::fasta ){
-    if( isFastq( args.inputFormat ) ){
+  if( !isUseFastq( referenceFormat ) ){
+    if( isUseFastq( args.inputFormat ) ){
       LOG( "calculating per-quality scores..." );
       if( args.maskLowercase > 0 )
 	oneQualityMatrixMasked.init( m, alph.size, lambda,
@@ -232,7 +232,7 @@ void makeQualityScorers(){
 	  oneQualityExpMatrixRev.init( qRev, args.temperature );
       }
     }
-    if( isQuality(args.inputFormat) ){
+    if( isUseQuality(args.inputFormat) ){
       qualityPssmMaker.init( m, alph.size, lambda, isMatchMismatch,
                              args.matchScore, -args.mismatchCost,
                              isPhred2, offset2, alph.numbersToUppercase );
@@ -243,7 +243,7 @@ void makeQualityScorers(){
     }
   }
   else{
-    if( isFastq( args.inputFormat ) ){
+    if( isUseFastq( args.inputFormat ) ){
       if( args.maskLowercase > 0 )
 	twoQualityMatrixMasked.init( m, lambda, lp1, lp2,
 				     isPhred1, offset1, isPhred2, offset2,
@@ -336,8 +336,9 @@ void readOuterPrj( const std::string& fileName, unsigned& volumes,
 
   if( f.eof() && !f.bad() ) f.clear();
   if( alph.letters.empty() || refSequences+1 == 0 || refLetters+1 == 0 ||
-      isCaseSensitiveSeeds < 0 || referenceFormat >= sequenceFormat::prb ||
-      numOfIndexes > maxNumOfIndexes ){
+      isCaseSensitiveSeeds < 0 || numOfIndexes > maxNumOfIndexes ||
+      referenceFormat == sequenceFormat::prb ||
+      referenceFormat == sequenceFormat::pssm ){
     f.setstate( std::ios::failbit );
   }
   if( !f ) ERR( "can't read file: " + fileName );
@@ -748,7 +749,7 @@ void alignFinish( LastAligner& aligner, const AlignmentPot& gappedAlns,
       if (args.outputType == 7) {
 	centroid.setScoreMatrix(dis.m, args.temperature);
 	centroid.setLetterProbsPerPosition(alph.size, queryLen, dis.b, dis.j,
-					   isFastq(args.inputFormat),
+					   isUseFastq(args.inputFormat),
 					   qualityPssmMaker.qualToProbRight(),
 					   lambdaCalculator.letterProbs2(),
 					   alph.numbersToUppercase);
@@ -850,7 +851,7 @@ static void printAndClearAll() {
 void makeQualityPssm( LastAligner& aligner,
 		      size_t queryNum, char strand, const uchar* querySeq,
 		      bool isMask ){
-  if( !isQuality( args.inputFormat ) || isQuality( referenceFormat ) ) return;
+  if (!isUseQuality(args.inputFormat) || isUseQuality(referenceFormat)) return;
   if( args.isTranslated() || args.isGreedy ) return;
 
   std::vector<int> &qualityPssm = aligner.qualityPssm;
@@ -1028,7 +1029,7 @@ static void scanOneVolume(unsigned volume, unsigned volumeCount) {
 
 void readIndex( const std::string& baseName, indexT seqCount ) {
   LOG( "reading " << baseName << "..." );
-  text.fromFiles( baseName, seqCount, isFastq( referenceFormat ) );
+  text.fromFiles(baseName, seqCount, referenceFormat != sequenceFormat::fasta);
   for( unsigned x = 0; x < numOfIndexes; ++x ){
     if( numOfIndexes > 1 ){
       suffixArrays[x].fromFiles( baseName + char('a' + x),


=====================================
src/lastdb.cc
=====================================
@@ -98,7 +98,7 @@ void writeLastalOptions( std::ostream& out, const std::string& seedText ){
 void writePrjFile( const std::string& fileName, const LastdbArguments& args,
 		   const Alphabet& alph, countT sequenceCount,
 		   const std::vector<countT>& letterCounts,
-		   unsigned volumes, unsigned numOfIndexes,
+		   bool isFastq, unsigned volumes, unsigned numOfIndexes,
 		   const std::string& seedText ){
   countT letterTotal = std::accumulate( letterCounts.begin(),
                                         letterCounts.end(), countT(0) );
@@ -124,7 +124,7 @@ void writePrjFile( const std::string& fileName, const LastdbArguments& args,
       f << "tantansetting=" << args.tantanSetting << '\n';
     }
     f << "masklowercase=" << args.isCaseSensitive << '\n';
-    if( args.inputFormat != sequenceFormat::fasta ){
+    if( isFastq ){
       f << "sequenceformat=" << args.inputFormat << '\n';
     }
     if( args.minimizerWindow > 1 ){
@@ -201,7 +201,8 @@ void makeVolume( std::vector< CyclicSubsetSeed >& seeds,
 
   LOG( "writing..." );
   writePrjFile( baseName + ".prj", args, alph, numOfSequences,
-		letterCountsInThisVolume, -1, numOfIndexes, seedText );
+		letterCountsInThisVolume, multi.qualsPerLetter(), -1,
+		numOfIndexes, seedText );
   multi.toFiles( baseName );
 
   for( unsigned x = 0; x < numOfIndexes; ++x ){
@@ -240,8 +241,9 @@ void makeVolume( std::vector< CyclicSubsetSeed >& seeds,
 // neglects memory for the sequence names, and the fact that
 // lowercase-masked letters and DNA "N"s aren't indexed.)
 static indexT maxLettersPerVolume( const LastdbArguments& args,
+				   size_t qualityCodesPerLetter,
 				   unsigned numOfIndexes ){
-  size_t bytesPerLetter = isFastq( args.inputFormat ) ? 2 : 1;
+  size_t bytesPerLetter = 1 + qualityCodesPerLetter;
   size_t maxIndexBytesPerPosition = sizeof(indexT) + 1;
   maxIndexBytesPerPosition *= numOfIndexes;
   size_t x = bytesPerLetter * args.indexStep + maxIndexBytesPerPosition;
@@ -286,7 +288,7 @@ void lastdb( int argc, char** argv ){
   unsigned volumeNumber = 0;
   countT sequenceCount = 0;
   std::vector<countT> letterCounts( alph.size );
-  indexT maxSeqLen = maxLettersPerVolume( args, seeds.size() );
+  indexT maxSeqLen = 0;
 
   char defaultInputName[] = "-";
   char* defaultInput[] = { defaultInputName, 0 };
@@ -299,9 +301,13 @@ void lastdb( int argc, char** argv ){
 
     while (appendSequence(multi, in, maxSeqLen, args.inputFormat, alph,
 			  args.isKeepLowercase, 0)) {
-      if( !args.isProtein && args.userAlphabet.empty() &&
-          sequenceCount == 0 && isDubiousDna( alph, multi ) ){
-        std::cerr << args.programName << ": that's some funny-lookin DNA\n";
+      if (sequenceCount == 0) {
+	maxSeqLen =
+	  maxLettersPerVolume(args, multi.qualsPerLetter(), seeds.size());
+	if (!args.isProtein && args.userAlphabet.empty() &&
+	    isDubiousDna(alph, multi)) {
+	  std::cerr << args.programName << ": that's some funny-lookin DNA\n";
+	}
       }
 
       if( multi.isFinished() ){
@@ -345,7 +351,8 @@ void lastdb( int argc, char** argv ){
   }
 
   writePrjFile( args.lastdbName + ".prj", args, alph, sequenceCount,
-		letterCounts, volumeNumber, seeds.size(), seedText );
+		letterCounts, multi.qualsPerLetter(), volumeNumber,
+		seeds.size(), seedText );
 }
 
 int main( int argc, char** argv )


=====================================
src/split/last-split.cc
=====================================
@@ -134,8 +134,12 @@ static void doOneAlignmentPart(cbrc::SplitAligner& sa,
 
   if (opts.format == 'm') s.pop_back();
 
-  std::cout << std::setprecision(mismapPrecision)
-	    << "a score=" << score << " mismap=" << mismap;
+  if (opts.no_split && a.linesBeg[0][0] == 'a') {
+    std::cout << a.linesBeg[0];
+  } else {
+    std::cout << "a score=" << score;
+  }
+  std::cout << std::setprecision(mismapPrecision) << " mismap=" << mismap;
   if (opts.direction == 2) printSense(senseStrandLogOdds);
   if (!opts.genome.empty() && !opts.no_split) {
     char signal[3] = {0};
@@ -383,7 +387,7 @@ void lastSplit(LastSplitOptions& opts) {
 	  mafEnds.resize(1);
 	} else if (isSpace(line)) {
 	  addMaf(mafEnds, mafLines);
-	} else if (std::strchr("sqpc", line[0])) {
+	} else if (std::strchr(opts.no_split ? "asqpc" : "sqpc", line[0])) {
 	  mafLines.push_back(line);
 	}
       }


=====================================
src/version.hh
=====================================
@@ -1 +1 @@
-"956"
+"961"



View it on GitLab: https://salsa.debian.org/med-team/last-align/commit/1d01bfcb813030a4be50b9db3df0318d0579323a

-- 
View it on GitLab: https://salsa.debian.org/med-team/last-align/commit/1d01bfcb813030a4be50b9db3df0318d0579323a
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20181217/dba5c6b9/attachment-0001.html>


More information about the debian-med-commit mailing list