[med-svn] [Git][med-team/last-align][upstream] New upstream version 961
Andreas Tille
gitlab at salsa.debian.org
Mon Dec 17 10:20:33 GMT 2018
Andreas Tille pushed to branch upstream at Debian Med / last-align
Commits:
1d01bfcb by Andreas Tille at 2018-12-17T10:12:55Z
New upstream version 961
- - - - -
27 changed files:
- ChangeLog.txt
- doc/last-dotplot.html
- doc/last-dotplot.txt
- doc/last-train.html
- doc/last-train.txt
- doc/lastal.html
- doc/lastal.txt
- doc/lastdb.html
- doc/lastdb.txt
- scripts/last-dotplot
- scripts/last-map-probs
- scripts/last-postmask
- scripts/last-train
- scripts/maf-convert
- scripts/maf-cut
- scripts/maf-swap
- src/LastalArguments.cc
- src/LastdbArguments.cc
- src/MultiSequence.hh
- src/MultiSequenceQual.cc
- src/SequenceFormat.hh
- src/SubsetSuffixArray.cc
- src/last.hh
- src/lastal.cc
- src/lastdb.cc
- src/split/last-split.cc
- src/version.hh
Changes:
=====================================
ChangeLog.txt
=====================================
@@ -1,8 +1,41 @@
+2018-12-10 Martin C. Frith <Martin C. Frith>
+
+ * doc/last-dotplot.txt, scripts/last-dotplot:
+ Add last-dotplot --maxseqs option
+ [ed0fb9b1eb40] [tip]
+
+ * src/split/last-split.cc:
+ Make last-split -n keep E-values
+ [2063a5a7d7ac]
+
+2018-10-30 Martin C. Frith <Martin C. Frith>
+
+ * doc/last-train.txt, doc/lastal.txt, doc/lastdb.txt, scripts/last-
+ train, src/LastalArguments.cc, src/LastdbArguments.cc,
+ src/MultiSequence.hh, src/MultiSequenceQual.cc,
+ src/SequenceFormat.hh, src/last.hh, test/last-test.out, test/last-
+ test.sh:
+ Add fastq-ignore option
+ [c551d04a3928]
+
+2018-10-29 Martin C. Frith <Martin C. Frith>
+
+ * src/LastalArguments.cc, src/SequenceFormat.hh,
+ src/SubsetSuffixArray.cc, src/lastal.cc, src/lastdb.cc:
+ Refactor
+ [96b45da7bf0f]
+
+ * scripts/last-dotplot, scripts/last-map-probs, scripts/last-postmask,
+ scripts/last-train, scripts/maf-convert, scripts/maf-cut, scripts
+ /maf-swap:
+ Fix gz reading for Python3
+ [c54672c8892e]
+
2018-09-10 Martin C. Frith <Martin C. Frith>
* scripts/last-map-probs:
Python3-ify last-map-probs
- [83191e47259e] [tip]
+ [83191e47259e]
2018-09-07 Martin C. Frith <Martin C. Frith>
=====================================
doc/last-dotplot.html
=====================================
@@ -335,62 +335,6 @@ last-dotplot alns alns.gif
set of sequences. This document calls a "set of sequences" a
"genome", though it need not actually be a genome.</p>
</div>
-<div class="section" id="choosing-sequences">
-<h2>Choosing sequences</h2>
-<p>If there are too many sequences, the dotplot will be very cluttered,
-or the script might give up with an error message. You can exclude
-sequences with names like "chrUn_random522" like this:</p>
-<pre class="literal-block">
-last-dotplot -1 'chr[!U]*' -2 'chr[!U]*' alns alns.png
-</pre>
-<p>Option "-1" selects sequences from the 1st (horizontal) genome, and
-"-2" selects sequences from the 2nd (vertical) genome. 'chr[!U]*' is
-a <em>pattern</em> that specifies names starting with "chr", followed by any
-character except U, followed by anything.</p>
-<table border="1" class="docutils">
-<colgroup>
-<col width="26%" />
-<col width="74%" />
-</colgroup>
-<tbody valign="top">
-<tr><td>Pattern</td>
-<td>Meaning</td>
-</tr>
-<tr><td><tt class="docutils literal">*</tt></td>
-<td>zero or more of any character</td>
-</tr>
-<tr><td><tt class="docutils literal">?</tt></td>
-<td>any single character</td>
-</tr>
-<tr><td><tt class="docutils literal">[abc]</tt></td>
-<td>any character in abc</td>
-</tr>
-<tr><td><tt class="docutils literal">[!abc]</tt></td>
-<td>any character not in abc</td>
-</tr>
-</tbody>
-</table>
-<p>If a sequence name has a dot (e.g. "hg19.chr7"), the pattern is
-compared to both the whole name and the part after the dot.</p>
-<p>You can specify more than one pattern, e.g. this gets sequences with
-names starting in "chr" followed by one or two characters:</p>
-<pre class="literal-block">
-last-dotplot -1 'chr?' -1 'chr??' alns alns.png
-</pre>
-<p>You can also specify a sequence range; for example this gets the first
-1000 bases of chr9:</p>
-<pre class="literal-block">
-last-dotplot -1 chr9:0-1000 alns alns.png
-</pre>
-</div>
-<div class="section" id="fonts">
-<h2>Fonts</h2>
-<p>last-dotplot tries to find a nice font on your computer, but may fail
-and use an ugly font. You can specify a font like this:</p>
-<pre class="literal-block">
-last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns.png
-</pre>
-</div>
<div class="section" id="options">
<h2>Options</h2>
<blockquote>
@@ -405,18 +349,23 @@ last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns
<kbd><span class="option">-v</span>, <span class="option">--verbose</span></kbd></td>
<td>Show progress messages & data about the plot.</td></tr>
<tr><td class="option-group">
+<kbd><span class="option">-x <var>INT</var></span>, <span class="option">--width=<var>INT</var></span></kbd></td>
+<td>Maximum width in pixels.</td></tr>
+<tr><td class="option-group">
+<kbd><span class="option">-y <var>INT</var></span>, <span class="option">--height=<var>INT</var></span></kbd></td>
+<td>Maximum height in pixels.</td></tr>
+<tr><td class="option-group">
+<kbd><span class="option">-m <var>M</var></span>, <span class="option">--maxseqs=<var>M</var></span></kbd></td>
+<td>Maximum number of horizontal or vertical sequences. If there
+are >M sequences, the smallest ones (after cutting) will be
+discarded.</td></tr>
+<tr><td class="option-group">
<kbd><span class="option">-1 <var>PATTERN</var></span>, <span class="option">--seq1=<var>PATTERN</var></span></kbd></td>
<td>Which sequences to show from the 1st (horizontal) genome.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-2 <var>PATTERN</var></span>, <span class="option">--seq2=<var>PATTERN</var></span></kbd></td>
<td>Which sequences to show from the 2nd (vertical) genome.</td></tr>
<tr><td class="option-group">
-<kbd><span class="option">-x <var>WIDTH</var></span>, <span class="option">--width=<var>WIDTH</var></span></kbd></td>
-<td>Maximum width in pixels.</td></tr>
-<tr><td class="option-group">
-<kbd><span class="option">-y <var>HEIGHT</var></span>, <span class="option">--height=<var>HEIGHT</var></span></kbd></td>
-<td>Maximum height in pixels.</td></tr>
-<tr><td class="option-group">
<kbd><span class="option">-c <var>COLOR</var></span>, <span class="option">--forwardcolor=<var>COLOR</var></span></kbd></td>
<td>Color for forward alignments.</td></tr>
<tr><td class="option-group">
@@ -602,6 +551,65 @@ of unknown sequence.</p>
pixel.</p>
</div>
</div>
+<div class="section" id="choosing-sequences">
+<h2>Choosing sequences</h2>
+<p>For example, you can exclude sequences with names like
+"chrUn_random522" like this:</p>
+<pre class="literal-block">
+last-dotplot -1 'chr[!U]*' -2 'chr[!U]*' alns alns.png
+</pre>
+<p>Option "-1" selects sequences from the 1st (horizontal) genome, and
+"-2" selects sequences from the 2nd (vertical) genome. 'chr[!U]*' is
+a <em>pattern</em> that specifies names starting with "chr", followed by any
+character except U, followed by anything.</p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="26%" />
+<col width="74%" />
+</colgroup>
+<tbody valign="top">
+<tr><td>Pattern</td>
+<td>Meaning</td>
+</tr>
+<tr><td><tt class="docutils literal">*</tt></td>
+<td>zero or more of any character</td>
+</tr>
+<tr><td><tt class="docutils literal">?</tt></td>
+<td>any single character</td>
+</tr>
+<tr><td><tt class="docutils literal">[abc]</tt></td>
+<td>any character in abc</td>
+</tr>
+<tr><td><tt class="docutils literal">[!abc]</tt></td>
+<td>any character not in abc</td>
+</tr>
+</tbody>
+</table>
+<p>If a sequence name has a dot (e.g. "hg19.chr7"), the pattern is
+compared to both the whole name and the part after the dot.</p>
+<p>You can specify more than one pattern, e.g. this gets sequences with
+names starting in "chr" followed by one or two characters:</p>
+<pre class="literal-block">
+last-dotplot -1 'chr?' -1 'chr??' alns alns.png
+</pre>
+<p>You can also specify a sequence range; for example this gets the first
+1000 bases of chr9:</p>
+<pre class="literal-block">
+last-dotplot -1 chr9:0-1000 alns alns.png
+</pre>
+</div>
+<div class="section" id="text-font">
+<h2>Text font</h2>
+<p>You can improve the font quality by increasing its size, e.g. to 20
+points:</p>
+<blockquote>
+last-dotplot -s20 my-alignments my-plot.png</blockquote>
+<p>last-dotplot tries to find a nice font on your computer, but may fail
+and use an ugly font. You can specify a font like this:</p>
+<pre class="literal-block">
+last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns.png
+</pre>
+</div>
<div class="section" id="colors">
<h2>Colors</h2>
<p>Colors can be specified in <a class="reference external" href="http://effbot.org/imagingbook/imagecolor.htm">various ways described here</a>.</p>
=====================================
doc/last-dotplot.txt
=====================================
@@ -19,50 +19,6 @@ last-dotplot shows alignments of one set of sequences against another
set of sequences. This document calls a "set of sequences" a
"genome", though it need not actually be a genome.
-Choosing sequences
-------------------
-
-If there are too many sequences, the dotplot will be very cluttered,
-or the script might give up with an error message. You can exclude
-sequences with names like "chrUn_random522" like this::
-
- last-dotplot -1 'chr[!U]*' -2 'chr[!U]*' alns alns.png
-
-Option "-1" selects sequences from the 1st (horizontal) genome, and
-"-2" selects sequences from the 2nd (vertical) genome. 'chr[!U]*' is
-a *pattern* that specifies names starting with "chr", followed by any
-character except U, followed by anything.
-
-========== =============================
-Pattern Meaning
----------- -----------------------------
-``*`` zero or more of any character
-``?`` any single character
-``[abc]`` any character in abc
-``[!abc]`` any character not in abc
-========== =============================
-
-If a sequence name has a dot (e.g. "hg19.chr7"), the pattern is
-compared to both the whole name and the part after the dot.
-
-You can specify more than one pattern, e.g. this gets sequences with
-names starting in "chr" followed by one or two characters::
-
- last-dotplot -1 'chr?' -1 'chr??' alns alns.png
-
-You can also specify a sequence range; for example this gets the first
-1000 bases of chr9::
-
- last-dotplot -1 chr9:0-1000 alns alns.png
-
-Fonts
------
-
-last-dotplot tries to find a nice font on your computer, but may fail
-and use an ugly font. You can specify a font like this::
-
- last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns.png
-
Options
-------
@@ -70,14 +26,18 @@ Options
Show a help message, with default option values, and exit.
-v, --verbose
Show progress messages & data about the plot.
+ -x INT, --width=INT
+ Maximum width in pixels.
+ -y INT, --height=INT
+ Maximum height in pixels.
+ -m M, --maxseqs=M
+ Maximum number of horizontal or vertical sequences. If there
+ are >M sequences, the smallest ones (after cutting) will be
+ discarded.
-1 PATTERN, --seq1=PATTERN
Which sequences to show from the 1st (horizontal) genome.
-2 PATTERN, --seq2=PATTERN
Which sequences to show from the 2nd (vertical) genome.
- -x WIDTH, --width=WIDTH
- Maximum width in pixels.
- -y HEIGHT, --height=HEIGHT
- Maximum height in pixels.
-c COLOR, --forwardcolor=COLOR
Color for forward alignments.
-r COLOR, --reversecolor=COLOR
@@ -207,6 +167,54 @@ of unknown sequence.
An unsequenced gap will be shown only if it covers at least one whole
pixel.
+Choosing sequences
+------------------
+
+For example, you can exclude sequences with names like
+"chrUn_random522" like this::
+
+ last-dotplot -1 'chr[!U]*' -2 'chr[!U]*' alns alns.png
+
+Option "-1" selects sequences from the 1st (horizontal) genome, and
+"-2" selects sequences from the 2nd (vertical) genome. 'chr[!U]*' is
+a *pattern* that specifies names starting with "chr", followed by any
+character except U, followed by anything.
+
+========== =============================
+Pattern Meaning
+---------- -----------------------------
+``*`` zero or more of any character
+``?`` any single character
+``[abc]`` any character in abc
+``[!abc]`` any character not in abc
+========== =============================
+
+If a sequence name has a dot (e.g. "hg19.chr7"), the pattern is
+compared to both the whole name and the part after the dot.
+
+You can specify more than one pattern, e.g. this gets sequences with
+names starting in "chr" followed by one or two characters::
+
+ last-dotplot -1 'chr?' -1 'chr??' alns alns.png
+
+You can also specify a sequence range; for example this gets the first
+1000 bases of chr9::
+
+ last-dotplot -1 chr9:0-1000 alns alns.png
+
+Text font
+---------
+
+You can improve the font quality by increasing its size, e.g. to 20
+points:
+
+ last-dotplot -s20 my-alignments my-plot.png
+
+last-dotplot tries to find a nice font on your computer, but may fail
+and use an ugly font. You can specify a font like this::
+
+ last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns.png
+
Colors
------
=====================================
doc/last-train.html
=====================================
@@ -333,7 +333,7 @@ last-train mydb queries.fasta
<p>last-train prints a summary of each alignment step, followed by the
final score parameters, in a format that can be read by <a class="reference external" href="lastal.html#score-options">lastal's -p
option</a>.</p>
-<p>You can also pipe sequences into last-train, for example:</p>
+<p>last-train can read .gz files, or from pipes:</p>
<pre class="literal-block">
bzcat queries.fasta.bz2 | last-train mydb
</pre>
@@ -479,15 +479,21 @@ position in each query.</td></tr>
<td>Number of parallel threads.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-Q <var>NUMBER</var></span></kbd></td>
-<td><p class="first">Query sequence format: 0=fasta, 1=fastq-sanger.</p>
-<p>last-train assumes that fastq quality codes indicate
-substitution error probabilities, <em>not</em> insertion or
-deletion error probabilities. If this assumption is
-dubious (e.g. for data with many insertion or deletion
-errors), it may be better to use fasta.</p>
-<p>For fastq, last-train finds the rates of substitutions
-not explained by the quality data (ideally, real
-substitutions as opposed to errors).</p>
+<td><p class="first">How to read the query sequences. By default, they must
+be in <tt class="docutils literal">fasta</tt> format. <tt class="docutils literal"><span class="pre">-Q0</span></tt> means <tt class="docutils literal">fasta</tt> or
+<tt class="docutils literal"><span class="pre">fastq-ignore</span></tt>. <tt class="docutils literal"><span class="pre">-Q1</span></tt> means <tt class="docutils literal"><span class="pre">fastq-sanger</span></tt>.</p>
+<p>The <tt class="docutils literal">fastq</tt> formats are described here:
+<a class="reference external" href="lastal.html">lastal.html</a>. <tt class="docutils literal"><span class="pre">fastq-ignore</span></tt> means that the
+quality data is ignored, so the results will be the same
+as for <tt class="docutils literal">fasta</tt>.</p>
+<p>For <tt class="docutils literal"><span class="pre">fastq-sanger</span></tt>, last-train assumes the quality
+codes indicate substitution error probabilities, <em>not</em>
+insertion or deletion error probabilities. If this
+assumption is dubious (e.g. for data with many insertion
+or deletion errors), it may be better to use
+<tt class="docutils literal"><span class="pre">fastq-ignore</span></tt>. For <tt class="docutils literal"><span class="pre">fastq-sanger</span></tt>, last-train finds
+the rates of substitutions not explained by the quality
+data (ideally, real substitutions as opposed to errors).</p>
<p class="last">If specified, this parameter is written in last-train's
output, so it will override lastal's default.</p>
</td></tr>
=====================================
doc/last-train.txt
=====================================
@@ -19,7 +19,7 @@ last-train prints a summary of each alignment step, followed by the
final score parameters, in a format that can be read by `lastal's -p
option <lastal.html#score-options>`_.
-You can also pipe sequences into last-train, for example::
+last-train can read .gz files, or from pipes::
bzcat queries.fasta.bz2 | last-train mydb
@@ -102,17 +102,23 @@ Alignment options
-k STEP Look for initial matches starting only at every STEP-th
position in each query.
-P COUNT Number of parallel threads.
- -Q NUMBER Query sequence format: 0=fasta, 1=fastq-sanger.
-
- last-train assumes that fastq quality codes indicate
- substitution error probabilities, *not* insertion or
- deletion error probabilities. If this assumption is
- dubious (e.g. for data with many insertion or deletion
- errors), it may be better to use fasta.
-
- For fastq, last-train finds the rates of substitutions
- not explained by the quality data (ideally, real
- substitutions as opposed to errors).
+ -Q NUMBER How to read the query sequences. By default, they must
+ be in ``fasta`` format. ``-Q0`` means ``fasta`` or
+ ``fastq-ignore``. ``-Q1`` means ``fastq-sanger``.
+
+ The ``fastq`` formats are described here:
+ `<lastal.html>`_. ``fastq-ignore`` means that the
+ quality data is ignored, so the results will be the same
+ as for ``fasta``.
+
+ For ``fastq-sanger``, last-train assumes the quality
+ codes indicate substitution error probabilities, *not*
+ insertion or deletion error probabilities. If this
+ assumption is dubious (e.g. for data with many insertion
+ or deletion errors), it may be better to use
+ ``fastq-ignore``. For ``fastq-sanger``, last-train finds
+ the rates of substitutions not explained by the quality
+ data (ideally, real substitutions as opposed to errors).
If specified, this parameter is written in last-train's
output, so it will override lastal's default.
=====================================
doc/lastal.html
=====================================
@@ -849,13 +849,18 @@ generalized affine gap costs. They are:</p>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-Q <var>NUMBER</var></span></kbd></td>
-<td><p class="first">This option allows lastal to use sequence quality scores, or
-PSSMs, for the queries. 0 means read queries in fasta format
-(without quality scores); 1 means fastq-sanger format; 2 means
-fastq-solexa format; 3 means fastq-illumina format; 4 means prb
-format; 5 means read PSSMs. (<em>Warning</em>: Illumina data is not
-necessarily in fastq-illumina format; it is often in
-fastq-sanger format.)</p>
+<td><p class="first">Specify how to read the query sequences:</p>
+<pre class="literal-block">
+Default fasta
+ 0 fasta or fastq-ignore
+ 1 fastq-sanger
+ 2 fastq-solexa
+ 3 fastq-illumina
+ 4 prb
+ 5 PSSM
+</pre>
+<p><em>Warning</em>: Illumina data is not necessarily in fastq-illumina
+format; it is often in fastq-sanger format.</p>
<p>The fastq formats look like this:</p>
<pre class="literal-block">
@mySequenceName
@@ -863,12 +868,18 @@ TTTTTTTTGCCTCGGGCCTGAGTTCTTAGCCGCG
+
55555555*&5-/55*5//5(55,5#&$)$)*+$
</pre>
-<p>The "+" may optionally be followed by a name (ignored), and the
-sequence and quality codes are allowed to wrap onto more than
-one line. For fastq-sanger, the quality scores are obtained by
-subtracting 33 from the ASCII values of the characters below the
-"+". For fastq-solexa and fastq-illumina, they are obtained by
-subtracting 64.</p>
+<p>The "+" may be followed by text (ignored). The symbols below
+the "+" are quality codes, one per sequence letter. The
+sequence and quality codes may wrap onto more than one line.</p>
+<p>fastq-ignore means that the quality codes are ignored. For the
+other fastq variants, lastal assumes the quality codes indicate
+substitution error probabilities, <em>not</em> insertion or deletion
+error probabilities. If this assumption is dubious (e.g. for
+data with many insertion or deletion errors), it may be better
+to use fastq-ignore.</p>
+<p>For fastq-sanger, quality scores are obtained by subtracting 33
+from the ASCII values of the quality codes. For fastq-solexa
+and fastq-illumina, they are obtained by subtracting 64.</p>
<p>prb format stores four quality scores (A, C, G, T) per position,
with one sequence per line, like this:</p>
<pre class="literal-block">
=====================================
doc/lastal.txt
=====================================
@@ -474,13 +474,18 @@ Miscellaneous options
* The count of adjacent pairs of insertions and unaligned letter pairs.
-Q NUMBER
- This option allows lastal to use sequence quality scores, or
- PSSMs, for the queries. 0 means read queries in fasta format
- (without quality scores); 1 means fastq-sanger format; 2 means
- fastq-solexa format; 3 means fastq-illumina format; 4 means prb
- format; 5 means read PSSMs. (*Warning*: Illumina data is not
- necessarily in fastq-illumina format; it is often in
- fastq-sanger format.)
+ Specify how to read the query sequences::
+
+ Default fasta
+ 0 fasta or fastq-ignore
+ 1 fastq-sanger
+ 2 fastq-solexa
+ 3 fastq-illumina
+ 4 prb
+ 5 PSSM
+
+ *Warning*: Illumina data is not necessarily in fastq-illumina
+ format; it is often in fastq-sanger format.
The fastq formats look like this::
@@ -489,12 +494,20 @@ Miscellaneous options
+
55555555*&5-/55*5//5(55,5#&$)$)*+$
- The "+" may optionally be followed by a name (ignored), and the
- sequence and quality codes are allowed to wrap onto more than
- one line. For fastq-sanger, the quality scores are obtained by
- subtracting 33 from the ASCII values of the characters below the
- "+". For fastq-solexa and fastq-illumina, they are obtained by
- subtracting 64.
+ The "+" may be followed by text (ignored). The symbols below
+ the "+" are quality codes, one per sequence letter. The
+ sequence and quality codes may wrap onto more than one line.
+
+ fastq-ignore means that the quality codes are ignored. For the
+ other fastq variants, lastal assumes the quality codes indicate
+ substitution error probabilities, *not* insertion or deletion
+ error probabilities. If this assumption is dubious (e.g. for
+ data with many insertion or deletion errors), it may be better
+ to use fastq-ignore.
+
+ For fastq-sanger, quality scores are obtained by subtracting 33
+ from the ASCII values of the quality codes. For fastq-solexa
+ and fastq-illumina, they are obtained by subtracting 64.
prb format stores four quality scores (A, C, G, T) per position,
with one sequence per line, like this::
=====================================
doc/lastdb.html
=====================================
@@ -457,11 +457,10 @@ about 4 billion.</p>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-Q <var>NUMBER</var></span></kbd></td>
-<td>Specify the input format. 0 means fasta, 1 means fastq-sanger,
-2 means fastq-solexa, and 3 means fastq-illumina. The fastq
-formats provide sequence quality data, which will be stored by
-lastdb and then used by lastal. These formats are described in
-<a class="reference external" href="lastal.html">lastal.html</a>.</td></tr>
+<td>Specify how to read the sequences. The default is fasta, 0
+means fasta or fastq-ignore, 1 means fastq-sanger, 2 means
+fastq-solexa, and 3 means fastq-illumina. The fastq formats are
+described in <a class="reference external" href="lastal.html">lastal.html</a>.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-P <var>THREADS</var></span></kbd></td>
<td>Divide the work between this number of threads running in
=====================================
doc/lastdb.txt
=====================================
@@ -125,11 +125,10 @@ Advanced Options
about 4 billion.
-Q NUMBER
- Specify the input format. 0 means fasta, 1 means fastq-sanger,
- 2 means fastq-solexa, and 3 means fastq-illumina. The fastq
- formats provide sequence quality data, which will be stored by
- lastdb and then used by lastal. These formats are described in
- `<lastal.html>`_.
+ Specify how to read the sequences. The default is fasta, 0
+ means fasta or fastq-ignore, 1 means fastq-sanger, 2 means
+ fastq-solexa, and 3 means fastq-illumina. The fastq formats are
+ described in `<lastal.html>`_.
-P THREADS
Divide the work between this number of threads running in
=====================================
scripts/last-dotplot
=====================================
@@ -13,6 +13,7 @@ import collections
import functools
import gzip
from fnmatch import fnmatchcase
+import logging
from operator import itemgetter
import subprocess
import itertools, optparse, os, re, sys
@@ -34,14 +35,9 @@ def myOpen(fileName): # faster than fileinput
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
- return gzip.open(fileName)
+ return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
-def warn(message):
- if opts.verbose:
- prog = os.path.basename(sys.argv[0])
- sys.stderr.write(prog + ": " + message + "\n")
-
def groupByFirstItem(things):
for k, v in itertools.groupby(things, itemgetter(0)):
yield k, [i[1:] for i in v]
@@ -414,8 +410,8 @@ def dataFromRanges(sortedRanges, font, fontSize, imageMode, labelOpt, textRot):
out = [seqName, str(rangeBeg), str(rangeEnd)]
if strandNum > 0:
out.append(".+-"[strandNum])
- warn("\t".join(out))
- warn("")
+ logging.info("\t".join(out))
+ logging.info("")
rangeSizes = [e - b for n, b, e, s in sortedRanges]
labs = list(rangeLabels(sortedRanges, labelOpt, font, fontSize,
imageMode, textRot))
@@ -430,7 +426,7 @@ def div_ceil(x, y):
def get_bp_per_pix(rangeSizes, pixTweenRanges, maxPixels):
'''Get the minimum bp-per-pixel that fits in the size limit.'''
- warn("choosing bp per pixel...")
+ logging.info("choosing bp per pixel...")
numOfRanges = len(rangeSizes)
maxPixelsInRanges = maxPixels - pixTweenRanges * (numOfRanges - 1)
if maxPixelsInRanges < numOfRanges:
@@ -718,18 +714,37 @@ def getFont(opts):
out, err = p.communicate()
fileNames.append(out)
except OSError as e:
- warn("fc-match error: " + str(e))
+ logging.info("fc-match error: " + str(e))
fileNames.append("/Library/Fonts/Arial.ttf") # for Mac
for i in fileNames:
try:
font = ImageFont.truetype(i, opts.fontsize)
- warn("font: " + i)
+ logging.info("font: " + i)
return font
except IOError as e:
- warn("font load error: " + str(e))
+ logging.info("font load error: " + str(e))
return ImageFont.load_default()
+def sequenceSizesAndNames(seqRanges):
+ for seqName, ranges in itertools.groupby(seqRanges, itemgetter(0)):
+ size = sum(e - b for n, b, e in ranges)
+ yield size, seqName
+
+def biggestSequences(seqRanges, maxNumOfSequences):
+ s = sorted(sequenceSizesAndNames(seqRanges), reverse=True)
+ if len(s) > maxNumOfSequences:
+ logging.warning("too many sequences - discarding the smallest ones")
+ s = s[:maxNumOfSequences]
+ return set(i[1] for i in s)
+
+def remainingSequenceRanges(seqRanges, alignments, seqIndex):
+ remainingSequences = set(i[seqIndex] for i in alignments)
+ return [i for i in seqRanges if i[0] in remainingSequences]
+
def lastDotplot(opts, args):
+ logLevel = logging.INFO if opts.verbose else logging.WARNING
+ logging.basicConfig(format="%(filename)s: %(message)s", level=logLevel)
+
font = getFont(opts)
image_mode = 'RGB'
forward_color = ImageColor.getcolor(opts.forwardcolor, image_mode)
@@ -740,11 +755,11 @@ def lastDotplot(opts, args):
maxGap1, maxGapB1 = twoValuesFromOption(opts.max_gap1, ":")
maxGap2, maxGapB2 = twoValuesFromOption(opts.max_gap2, ":")
- warn("reading alignments...")
+ logging.info("reading alignments...")
alnData = readAlignments(args[0], opts)
alignments, seqRanges1, coverDict1, seqRanges2, coverDict2 = alnData
if not alignments: raise Exception("there are no alignments")
- warn("cutting...")
+ logging.info("cutting...")
coverDict1 = dict(mergedRangesPerSeq(coverDict1))
coverDict2 = dict(mergedRangesPerSeq(coverDict2))
minAlignedBases = min(coveredLength(coverDict1), coveredLength(coverDict2))
@@ -754,10 +769,20 @@ def lastDotplot(opts, args):
cutRanges2 = list(trimmed(seqRanges2, coverDict2, minAlignedBases,
maxGap2, pad, pad))
- warn("reading secondary alignments...")
+ biggestSeqs1 = biggestSequences(cutRanges1, opts.maxseqs)
+ cutRanges1 = [i for i in cutRanges1 if i[0] in biggestSeqs1]
+ alignments = [i for i in alignments if i[0] in biggestSeqs1]
+ cutRanges2 = remainingSequenceRanges(cutRanges2, alignments, 1)
+
+ biggestSeqs2 = biggestSequences(cutRanges2, opts.maxseqs)
+ cutRanges2 = [i for i in cutRanges2 if i[0] in biggestSeqs2]
+ alignments = [i for i in alignments if i[1] in biggestSeqs2]
+ cutRanges1 = remainingSequenceRanges(cutRanges1, alignments, 0)
+
+ logging.info("reading secondary alignments...")
alnDataB = readSecondaryAlignments(opts, cutRanges1, cutRanges2)
alignmentsB, seqRangesB1, coverDictB1, seqRangesB2, coverDictB2 = alnDataB
- warn("cutting...")
+ logging.info("cutting...")
coverDictB1 = dict(mergedRangesPerSeq(coverDictB1))
coverDictB2 = dict(mergedRangesPerSeq(coverDictB2))
cutRangesB1 = trimmed(seqRangesB1, coverDictB1, minAlignedBases,
@@ -765,7 +790,7 @@ def lastDotplot(opts, args):
cutRangesB2 = trimmed(seqRangesB2, coverDictB2, minAlignedBases,
maxGapB2, 0, 0)
- warn("sorting...")
+ logging.info("sorting...")
sortOut = allSortedRanges(opts, alignments, alignmentsB,
cutRanges1, cutRangesB1, cutRanges2, cutRangesB2)
sortedRanges1, sortedRanges2 = sortOut
@@ -785,7 +810,7 @@ def lastDotplot(opts, args):
bpPerPix1 = get_bp_per_pix(rangeSizes1, opts.border_pixels, maxPixels1)
bpPerPix2 = get_bp_per_pix(rangeSizes2, opts.border_pixels, maxPixels2)
bpPerPix = max(bpPerPix1, bpPerPix2)
- warn("bp per pixel = " + str(bpPerPix))
+ logging.info("bp per pixel = " + str(bpPerPix))
p1 = pixelData(rangeSizes1, bpPerPix, opts.border_pixels, lMargin)
rangePixBegs1, rangePixLens1, width = p1
@@ -797,14 +822,14 @@ def lastDotplot(opts, args):
rangeDict2 = dict(rangesPerSeq(sortedRanges2, rangePixBegs2,
rangePixLens2, bpPerPix))
- warn("width: " + str(width))
- warn("height: " + str(height))
+ logging.info("width: " + str(width))
+ logging.info("height: " + str(height))
- warn("processing alignments...")
+ logging.info("processing alignments...")
hits = alignmentPixels(width, height, alignments + alignmentsB, bpPerPix,
rangeDict1, rangeDict2)
- warn("reading annotations...")
+ logging.info("reading annotations...")
rangeDict1 = expandedSeqDict(rangeDict1)
beds1 = itertools.chain(readBed(opts.bed1, rangeDict1),
@@ -822,7 +847,7 @@ def lastDotplot(opts, args):
boxes = sorted(itertools.chain(b1, b2))
- warn("drawing...")
+ logging.info("drawing...")
image_size = width, height
im = Image.new(image_mode, image_size, opts.background_color)
@@ -868,17 +893,20 @@ if __name__ == "__main__":
op = optparse.OptionParser(usage=usage, description=description)
op.add_option("-v", "--verbose", action="count",
help="show progress messages & data about the plot")
+ # Replace "width" & "height" with a single "length" option?
+ op.add_option("-x", "--width", metavar="INT", type="int", default=1000,
+ help="maximum width in pixels (default: %default)")
+ op.add_option("-y", "--height", metavar="INT", type="int", default=1000,
+ help="maximum height in pixels (default: %default)")
+ op.add_option("-m", "--maxseqs", type="int", default=100, metavar="M",
+ help="maximum number of horizontal or vertical sequences "
+ "(default=%default)")
op.add_option("-1", "--seq1", metavar="PATTERN", action="append",
default=[],
help="which sequences to show from the 1st genome")
op.add_option("-2", "--seq2", metavar="PATTERN", action="append",
default=[],
help="which sequences to show from the 2nd genome")
- # Replace "width" & "height" with a single "length" option?
- op.add_option("-x", "--width", type="int", default=1000,
- help="maximum width in pixels (default: %default)")
- op.add_option("-y", "--height", type="int", default=1000,
- help="maximum height in pixels (default: %default)")
op.add_option("-c", "--forwardcolor", metavar="COLOR", default="red",
help="color for forward alignments (default: %default)")
op.add_option("-r", "--reversecolor", metavar="COLOR", default="blue",
=====================================
scripts/last-map-probs
=====================================
@@ -16,7 +16,7 @@ def myOpen(fileName):
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
- return gzip.open(fileName)
+ return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
def logsum(numbers):
=====================================
scripts/last-postmask
=====================================
@@ -26,7 +26,7 @@ def myOpen(fileName):
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
- return gzip.open(fileName)
+ return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
def complement(base):
=====================================
scripts/last-train
=====================================
@@ -10,7 +10,7 @@ def myOpen(fileName): # faster than fileinput
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
- return gzip.open(fileName)
+ return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
def randomSample(things, sampleSize):
@@ -499,8 +499,9 @@ if __name__ == "__main__":
"every STEP-th position in each query (default: 1)")
og.add_option("-P", metavar="THREADS",
help="number of parallel threads")
- og.add_option("-Q", metavar="NUMBER",
- help="input format: 0=fasta, 1=fastq-sanger")
+ og.add_option("-Q", metavar="NUMBER", help=
+ "input format: 0=fasta or fastq-ignore, 1=fastq-sanger "
+ "(default=fasta)")
op.add_option_group(og)
(opts, args) = op.parse_args()
if len(args) < 1:
=====================================
scripts/maf-convert
=====================================
@@ -21,7 +21,7 @@ def myOpen(fileName):
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
- return gzip.open(fileName)
+ return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
def maxlen(s):
=====================================
scripts/maf-cut
=====================================
@@ -12,7 +12,7 @@ def myOpen(fileName): # faster than fileinput
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
- return gzip.open(fileName)
+ return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
def alnBegFromSeqBeg(gappedSequence, seqBeg):
=====================================
scripts/maf-swap
=====================================
@@ -25,7 +25,7 @@ def myOpen(fileName):
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
- return gzip.open(fileName)
+ return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
def indexOfNthSequence(mafLines, n):
=====================================
src/LastalArguments.cc
=====================================
@@ -172,9 +172,8 @@ Miscellaneous options (default settings):\n\
4=column ambiguity estimates, 5=gamma-centroid, 6=LAMA,\n\
7=expected counts ("
+ stringify(outputType) + ")\n\
--Q: input format: 0=fasta, 1=fastq-sanger, 2=fastq-solexa, 3=fastq-illumina,\n\
- 4=prb, 5=PSSM ("
- + stringify(inputFormat) + ")\n\
+-Q: input format: 0=fasta or fastq-ignore, 1=fastq-sanger, 2=fastq-solexa,\n\
+ 3=fastq-illumina, 4=prb, 5=PSSM (fasta)\n\
\n\
Report bugs to: last-align (ATmark) googlegroups (dot) com\n\
LAST home page: http://last.cbrc.jp/\n\
@@ -445,7 +444,7 @@ void LastalArguments::setDefaultsFromAlphabet( bool isDna, bool isProtein,
if( gapExistCost == INT_MIN ) gapExistCost = 11;
if( gapExtendCost < 0 ) gapExtendCost = 2;
}
- else if( !isQuality( inputFormat ) ){
+ else if( !isUseQuality( inputFormat ) ){
if( matchScore < 0 ) matchScore = 1;
if( mismatchCost < 0 ) mismatchCost = 1;
if( gapExistCost == INT_MIN ) gapExistCost = 7;
@@ -641,7 +640,7 @@ void LastalArguments::writeCommented( std::ostream& stream ) const{
if( outputType > 4 && outputType < 7 )
stream << " g=" << gamma;
stream << " j=" << outputType;
- stream << " Q=" << inputFormat;
+ stream << " Q=" << (inputFormat % sequenceFormat::fasta);
stream << '\n';
stream << "# " << lastdbName << '\n';
=====================================
src/LastdbArguments.cc
=====================================
@@ -70,8 +70,8 @@ Advanced Options (default settings):\n\
-S: strand: 0=reverse, 1=forward, 2=both ("
+ stringify(strand) + ")\n\
-s: volume size (unlimited)\n\
--Q: input format: 0=fasta, 1=fastq-sanger, 2=fastq-solexa, 3=fastq-illumina ("
- + stringify(inputFormat) + ")\n\
+-Q: input format: 0=fasta or fastq-ignore,\n\
+ 1=fastq-sanger, 2=fastq-solexa, 3=fastq-illumina (fasta)\n\
-P: number of parallel threads ("
+ stringify(numOfThreads) + ")\n\
-m: seed pattern\n\
=====================================
src/MultiSequence.hh
=====================================
@@ -47,8 +47,15 @@ class MultiSequence{
// may not finish reading the sequence.
std::istream& appendFromFasta( std::istream& stream, indexT maxSeqLen );
- // As above, but read quality scores too.
- std::istream& appendFromFastq( std::istream& stream, indexT maxSeqLen );
+ // As above, but read FASTQ format.
+ std::istream& appendFromFastq(std::istream& stream, indexT maxSeqLen,
+ bool isKeepQualityData);
+
+ // As above, but read either FASTA or FASTQ format. The first
+ // sequence may have either format, but subsequent sequences must
+ // have the same format.
+ std::istream& appendFromFastx(std::istream& stream, indexT maxSeqLen,
+ bool isKeepQualityData);
// As above, but read quality scores too.
std::istream& appendFromPrb( std::istream& stream, indexT maxSeqLen,
@@ -133,6 +140,7 @@ class MultiSequence{
// Qsolexa = -10*log10(p/(1-p))
VectorOrMmap<uchar> qualityScores;
size_t qualityScoresPerLetter;
+ bool isReadingFastq;
// Read a fasta/fastq header: read the whole line but store just the
// 1st word
=====================================
src/MultiSequenceQual.cc
=====================================
@@ -12,13 +12,35 @@
using namespace cbrc;
std::istream&
-MultiSequence::appendFromFastq( std::istream& stream, indexT maxSeqLen ){
+MultiSequence::appendFromFastx(std::istream& stream, indexT maxSeqLen,
+ bool isKeepQualityData) {
+ if (names.empty()) {
+ isReadingFastq = false;
+ char c = '>';
+ stream >> c;
+ if (c == '@') {
+ isReadingFastq = true;
+ } else if (c != '>') {
+ ERR("bad sequence data: missing '>' or '@'");
+ }
+ readFastxName(stream);
+ if (!stream) return stream;
+ }
+
+ return isReadingFastq ?
+ appendFromFastq(stream, maxSeqLen, isKeepQualityData) :
+ appendFromFasta(stream, maxSeqLen);
+}
+
+std::istream&
+MultiSequence::appendFromFastq(std::istream& stream, indexT maxSeqLen,
+ bool isKeepQualityData) {
// initForAppending:
- qualityScoresPerLetter = 1;
+ qualityScoresPerLetter = isKeepQualityData;
if( qualityScores.v.empty() ) appendQualPad();
if( isFinished() ){
- uchar c = '@';
+ char c = '@';
stream >> c;
if( c != '@' ) ERR( "bad FASTQ data: missing '@'" );
readFastxName(stream);
@@ -41,11 +63,16 @@ MultiSequence::appendFromFastq( std::istream& stream, indexT maxSeqLen ){
c = buf->sbumpc();
} while (c != std::streambuf::traits_type::eof() && c != '\n');
- while (qualityScores.v.size() < seq.v.size()) {
+ for (size_t i = seq.v.size() - ends.v.back(); i > 0; ) {
c = buf->sbumpc();
if (c == std::streambuf::traits_type::eof()) ERR("bad FASTQ data");
- if (c > 126) ERR("non-printable-ASCII in FASTQ quality data");
- if (c > ' ') qualityScores.v.push_back(c);
+ if (c > ' ') {
+ if (isKeepQualityData) {
+ if (c > 126) ERR("non-printable-ASCII in FASTQ quality data");
+ qualityScores.v.push_back(c);
+ }
+ --i;
+ }
}
finish();
=====================================
src/SequenceFormat.hh
=====================================
@@ -8,7 +8,7 @@
namespace cbrc {
namespace sequenceFormat {
-enum Enum { fasta, fastqSanger, fastqSolexa, fastqIllumina, prb, pssm };
+enum Enum { fastx, fastqSanger, fastqSolexa, fastqIllumina, prb, pssm, fasta };
}
inline std::istream &operator>>(std::istream &s, sequenceFormat::Enum &f) {
@@ -19,17 +19,17 @@ inline std::istream &operator>>(std::istream &s, sequenceFormat::Enum &f) {
return s;
}
-inline bool isQuality(sequenceFormat::Enum f) {
- return f != sequenceFormat::fasta && f != sequenceFormat::pssm;
-}
-
-inline bool isFastq(sequenceFormat::Enum f) {
+inline bool isUseFastq(sequenceFormat::Enum f) {
return
f == sequenceFormat::fastqSanger ||
f == sequenceFormat::fastqSolexa ||
f == sequenceFormat::fastqIllumina;
}
+inline bool isUseQuality(sequenceFormat::Enum f) {
+ return isUseFastq(f) || f == sequenceFormat::prb;
+}
+
inline int qualityOffset(sequenceFormat::Enum f) {
return (f == sequenceFormat::fastqSanger) ? 33 : 64;
} // The result is meaningless for non-quality formats.
=====================================
src/SubsetSuffixArray.cc
=====================================
@@ -66,13 +66,13 @@ void SubsetSuffixArray::fromFiles( const std::string& baseName,
try{
childTable.m.open( baseName + ".chi", indexedPositions );
- }catch( std::runtime_error ){
+ }catch( std::runtime_error& ){
try{
kiddyTable.m.open( baseName + ".chi2", indexedPositions );
- }catch( std::runtime_error ){
+ }catch( std::runtime_error& ){
try{
chibiTable.m.open( baseName + ".chi1", indexedPositions );
- }catch( std::runtime_error ){}
+ }catch( std::runtime_error& ){}
}
}
}
=====================================
src/last.hh
=====================================
@@ -25,12 +25,14 @@ inline std::istream &appendSequence(MultiSequence &m, std::istream &in,
if (f == sequenceFormat::fasta) {
m.appendFromFasta(in, maxSeqLen);
+ } else if (f == sequenceFormat::fastx) {
+ m.appendFromFastx(in, maxSeqLen, false);
} else if (f == sequenceFormat::prb) {
m.appendFromPrb(in, maxSeqLen, a.size, a.decode);
} else if (f == sequenceFormat::pssm) {
m.appendFromPssm(in, maxSeqLen, a.encode, isMaskLowercase);
} else {
- m.appendFromFastq(in, maxSeqLen);
+ m.appendFromFastq(in, maxSeqLen, true);
}
if (!m.isFinished() && m.finishedSequences() == 0) {
=====================================
src/lastal.cc
=====================================
@@ -80,7 +80,7 @@ namespace {
OneQualityExpMatrix oneQualityExpMatrixRev;
QualityPssmMaker qualityPssmMaker;
QualityPssmMaker qualityPssmMakerRev;
- sequenceFormat::Enum referenceFormat; // defaults to 0
+ sequenceFormat::Enum referenceFormat = sequenceFormat::fasta;
TwoQualityScoreMatrix twoQualityMatrix;
TwoQualityScoreMatrix twoQualityMatrixMasked;
TwoQualityScoreMatrix twoQualityMatrixRev;
@@ -103,7 +103,7 @@ static void calculateSubstitutionScoreMatrixStatistics() {
lambdaCalculator.calculate( scoreMatrix.caseSensitive, alph.size );
if( lambdaCalculator.isBad() ){
- if( isQuality( args.inputFormat ) ||
+ if( isUseQuality( args.inputFormat ) ||
(args.temperature < 0 && args.outputType > 3) )
ERR( "can't calculate probabilities: "
"maybe the mismatch costs are too weak" );
@@ -179,7 +179,7 @@ void makeQualityScorers(){
if( args.isGreedy ) return;
if( args.isTranslated() )
- if( isQuality( args.inputFormat ) || isQuality( referenceFormat ) )
+ if( isUseQuality( args.inputFormat ) || isUseQuality( referenceFormat ) )
return warn( args.programName,
"quality data not used for DNA-versus-protein alignment" );
@@ -199,8 +199,8 @@ void makeQualityScorers(){
double lp2rev[scoreMatrixRowSize];
permuteComplement( lp2, lp2rev );
- if( referenceFormat == sequenceFormat::fasta ){
- if( isFastq( args.inputFormat ) ){
+ if( !isUseFastq( referenceFormat ) ){
+ if( isUseFastq( args.inputFormat ) ){
LOG( "calculating per-quality scores..." );
if( args.maskLowercase > 0 )
oneQualityMatrixMasked.init( m, alph.size, lambda,
@@ -232,7 +232,7 @@ void makeQualityScorers(){
oneQualityExpMatrixRev.init( qRev, args.temperature );
}
}
- if( isQuality(args.inputFormat) ){
+ if( isUseQuality(args.inputFormat) ){
qualityPssmMaker.init( m, alph.size, lambda, isMatchMismatch,
args.matchScore, -args.mismatchCost,
isPhred2, offset2, alph.numbersToUppercase );
@@ -243,7 +243,7 @@ void makeQualityScorers(){
}
}
else{
- if( isFastq( args.inputFormat ) ){
+ if( isUseFastq( args.inputFormat ) ){
if( args.maskLowercase > 0 )
twoQualityMatrixMasked.init( m, lambda, lp1, lp2,
isPhred1, offset1, isPhred2, offset2,
@@ -336,8 +336,9 @@ void readOuterPrj( const std::string& fileName, unsigned& volumes,
if( f.eof() && !f.bad() ) f.clear();
if( alph.letters.empty() || refSequences+1 == 0 || refLetters+1 == 0 ||
- isCaseSensitiveSeeds < 0 || referenceFormat >= sequenceFormat::prb ||
- numOfIndexes > maxNumOfIndexes ){
+ isCaseSensitiveSeeds < 0 || numOfIndexes > maxNumOfIndexes ||
+ referenceFormat == sequenceFormat::prb ||
+ referenceFormat == sequenceFormat::pssm ){
f.setstate( std::ios::failbit );
}
if( !f ) ERR( "can't read file: " + fileName );
@@ -748,7 +749,7 @@ void alignFinish( LastAligner& aligner, const AlignmentPot& gappedAlns,
if (args.outputType == 7) {
centroid.setScoreMatrix(dis.m, args.temperature);
centroid.setLetterProbsPerPosition(alph.size, queryLen, dis.b, dis.j,
- isFastq(args.inputFormat),
+ isUseFastq(args.inputFormat),
qualityPssmMaker.qualToProbRight(),
lambdaCalculator.letterProbs2(),
alph.numbersToUppercase);
@@ -850,7 +851,7 @@ static void printAndClearAll() {
void makeQualityPssm( LastAligner& aligner,
size_t queryNum, char strand, const uchar* querySeq,
bool isMask ){
- if( !isQuality( args.inputFormat ) || isQuality( referenceFormat ) ) return;
+ if (!isUseQuality(args.inputFormat) || isUseQuality(referenceFormat)) return;
if( args.isTranslated() || args.isGreedy ) return;
std::vector<int> &qualityPssm = aligner.qualityPssm;
@@ -1028,7 +1029,7 @@ static void scanOneVolume(unsigned volume, unsigned volumeCount) {
void readIndex( const std::string& baseName, indexT seqCount ) {
LOG( "reading " << baseName << "..." );
- text.fromFiles( baseName, seqCount, isFastq( referenceFormat ) );
+ text.fromFiles(baseName, seqCount, referenceFormat != sequenceFormat::fasta);
for( unsigned x = 0; x < numOfIndexes; ++x ){
if( numOfIndexes > 1 ){
suffixArrays[x].fromFiles( baseName + char('a' + x),
=====================================
src/lastdb.cc
=====================================
@@ -98,7 +98,7 @@ void writeLastalOptions( std::ostream& out, const std::string& seedText ){
void writePrjFile( const std::string& fileName, const LastdbArguments& args,
const Alphabet& alph, countT sequenceCount,
const std::vector<countT>& letterCounts,
- unsigned volumes, unsigned numOfIndexes,
+ bool isFastq, unsigned volumes, unsigned numOfIndexes,
const std::string& seedText ){
countT letterTotal = std::accumulate( letterCounts.begin(),
letterCounts.end(), countT(0) );
@@ -124,7 +124,7 @@ void writePrjFile( const std::string& fileName, const LastdbArguments& args,
f << "tantansetting=" << args.tantanSetting << '\n';
}
f << "masklowercase=" << args.isCaseSensitive << '\n';
- if( args.inputFormat != sequenceFormat::fasta ){
+ if( isFastq ){
f << "sequenceformat=" << args.inputFormat << '\n';
}
if( args.minimizerWindow > 1 ){
@@ -201,7 +201,8 @@ void makeVolume( std::vector< CyclicSubsetSeed >& seeds,
LOG( "writing..." );
writePrjFile( baseName + ".prj", args, alph, numOfSequences,
- letterCountsInThisVolume, -1, numOfIndexes, seedText );
+ letterCountsInThisVolume, multi.qualsPerLetter(), -1,
+ numOfIndexes, seedText );
multi.toFiles( baseName );
for( unsigned x = 0; x < numOfIndexes; ++x ){
@@ -240,8 +241,9 @@ void makeVolume( std::vector< CyclicSubsetSeed >& seeds,
// neglects memory for the sequence names, and the fact that
// lowercase-masked letters and DNA "N"s aren't indexed.)
static indexT maxLettersPerVolume( const LastdbArguments& args,
+ size_t qualityCodesPerLetter,
unsigned numOfIndexes ){
- size_t bytesPerLetter = isFastq( args.inputFormat ) ? 2 : 1;
+ size_t bytesPerLetter = 1 + qualityCodesPerLetter;
size_t maxIndexBytesPerPosition = sizeof(indexT) + 1;
maxIndexBytesPerPosition *= numOfIndexes;
size_t x = bytesPerLetter * args.indexStep + maxIndexBytesPerPosition;
@@ -286,7 +288,7 @@ void lastdb( int argc, char** argv ){
unsigned volumeNumber = 0;
countT sequenceCount = 0;
std::vector<countT> letterCounts( alph.size );
- indexT maxSeqLen = maxLettersPerVolume( args, seeds.size() );
+ indexT maxSeqLen = 0;
char defaultInputName[] = "-";
char* defaultInput[] = { defaultInputName, 0 };
@@ -299,9 +301,13 @@ void lastdb( int argc, char** argv ){
while (appendSequence(multi, in, maxSeqLen, args.inputFormat, alph,
args.isKeepLowercase, 0)) {
- if( !args.isProtein && args.userAlphabet.empty() &&
- sequenceCount == 0 && isDubiousDna( alph, multi ) ){
- std::cerr << args.programName << ": that's some funny-lookin DNA\n";
+ if (sequenceCount == 0) {
+ maxSeqLen =
+ maxLettersPerVolume(args, multi.qualsPerLetter(), seeds.size());
+ if (!args.isProtein && args.userAlphabet.empty() &&
+ isDubiousDna(alph, multi)) {
+ std::cerr << args.programName << ": that's some funny-lookin DNA\n";
+ }
}
if( multi.isFinished() ){
@@ -345,7 +351,8 @@ void lastdb( int argc, char** argv ){
}
writePrjFile( args.lastdbName + ".prj", args, alph, sequenceCount,
- letterCounts, volumeNumber, seeds.size(), seedText );
+ letterCounts, multi.qualsPerLetter(), volumeNumber,
+ seeds.size(), seedText );
}
int main( int argc, char** argv )
=====================================
src/split/last-split.cc
=====================================
@@ -134,8 +134,12 @@ static void doOneAlignmentPart(cbrc::SplitAligner& sa,
if (opts.format == 'm') s.pop_back();
- std::cout << std::setprecision(mismapPrecision)
- << "a score=" << score << " mismap=" << mismap;
+ if (opts.no_split && a.linesBeg[0][0] == 'a') {
+ std::cout << a.linesBeg[0];
+ } else {
+ std::cout << "a score=" << score;
+ }
+ std::cout << std::setprecision(mismapPrecision) << " mismap=" << mismap;
if (opts.direction == 2) printSense(senseStrandLogOdds);
if (!opts.genome.empty() && !opts.no_split) {
char signal[3] = {0};
@@ -383,7 +387,7 @@ void lastSplit(LastSplitOptions& opts) {
mafEnds.resize(1);
} else if (isSpace(line)) {
addMaf(mafEnds, mafLines);
- } else if (std::strchr("sqpc", line[0])) {
+ } else if (std::strchr(opts.no_split ? "asqpc" : "sqpc", line[0])) {
mafLines.push_back(line);
}
}
=====================================
src/version.hh
=====================================
@@ -1 +1 @@
-"956"
+"961"
View it on GitLab: https://salsa.debian.org/med-team/last-align/commit/1d01bfcb813030a4be50b9db3df0318d0579323a
--
View it on GitLab: https://salsa.debian.org/med-team/last-align/commit/1d01bfcb813030a4be50b9db3df0318d0579323a
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20181217/dba5c6b9/attachment-0001.html>
More information about the debian-med-commit
mailing list