[med-svn] [Git][med-team/last-align][master] 2 commits: Import Upstream version 1066
Michael R. Crusoe
gitlab at salsa.debian.org
Fri Jun 19 15:43:51 BST 2020
Michael R. Crusoe pushed to branch master at Debian Med / last-align
Commits:
9b12cf2c by Michael R. Crusoe at 2020-06-19T16:43:07+02:00
Import Upstream version 1066
- - - - -
e1d8e7ff by Michael R. Crusoe at 2020-06-19T16:43:07+02:00
Import Debian changes 1066-1
last-align (1066-1) unstable; urgency=medium
* Team upload.
[ Michael R. Crusoe ]
* i386: enable avx2, avx, and sse4.1 specific builds as well
* Demote GNU parallel to a recommends; only used by parallel-fasta
[ Steffen Moeller ]
* New upstream version.
* debhelper-compat 13 (routine-update)
- - - - -
25 changed files:
- ChangeLog.txt
- debian/changelog
- debian/control
- debian/patches/2to3.patch
- doc/last-train.html
- doc/last-train.txt
- doc/lastal.html
- doc/lastal.txt
- doc/lastdb.html
- doc/lastdb.txt
- doc/maf-convert.html
- doc/maf-convert.txt
- scripts/last-dotplot
- scripts/last-train
- scripts/maf-convert
- scripts/maf-cut
- src/LastalArguments.cc
- src/LastdbArguments.cc
- src/SequenceFormat.hh
- src/last.hh
- src/lastal.cc
- src/lastdb.cc
- src/split/cbrc_split_aligner.cc
- src/split/last-split.cc
- src/version.hh
Changes:
=====================================
ChangeLog.txt
=====================================
@@ -1,8 +1,39 @@
+2020-06-11 Martin C. Frith <Martin C. Frith>
+
+ * doc/last-train.txt, scripts/last-train, src/lastal.cc:
+ Make inconsistent last-train/lastal -Q an error
+ [f2071b3c3173] [tip]
+
+ * doc/last-train.txt, doc/lastal.txt, doc/lastdb.txt, scripts/last-
+ train, src/LastalArguments.cc, src/LastdbArguments.cc,
+ src/SequenceFormat.hh, src/last.hh, src/lastdb.cc,
+ src/split/cbrc_split_aligner.cc, src/split/last-split.cc, test/last-
+ test.out, test/last-test.sh:
+ Add option to keep but ignore fastq quals
+ [0315c96a5a9b]
+
+2020-06-08 Martin C. Frith <Martin C. Frith>
+
+ * scripts/maf-convert:
+ Fix small bug in maf-convert -J
+ [cd67e8009c78]
+
+ * doc/maf-convert.txt, scripts/maf-convert, test/maf-convert-test.out,
+ test/maf-convert-test.sh:
+ Add maf-convert -J option
+ [bc7b2cdbcb32]
+
+ * doc/maf-convert.txt, scripts/last-dotplot, scripts/last-train,
+ scripts/maf-convert, scripts/maf-cut, test/maf-convert-test.out,
+ test/maf-convert-test.sh:
+ Add maf-convert to gff (with help from C. Plessy)
+ [6d8d56a74b14]
+
2020-05-07 Martin C. Frith <Martin C. Frith>
* src/tantan.cc:
Make tantan repeat-finding faster
- [543d36d39ce3] [tip]
+ [543d36d39ce3]
2020-03-16 Martin C. Frith <Martin C. Frith>
=====================================
debian/changelog
=====================================
@@ -1,9 +1,15 @@
-last-align (1061-2) UNRELEASED; urgency=medium
+last-align (1066-1) unstable; urgency=medium
* Team upload.
+
+ [ Michael R. Crusoe ]
* i386: enable avx2, avx, and sse4.1 specific builds as well
* Demote GNU parallel to a recommends; only used by parallel-fasta
+ [ Steffen Moeller ]
+ * New upstream version.
+ * debhelper-compat 13 (routine-update)
+
-- Michael R. Crusoe <michael.crusoe at gmail.com> Sun, 17 May 2020 12:32:51 +0200
last-align (1061-1) unstable; urgency=medium
=====================================
debian/control
=====================================
@@ -4,11 +4,11 @@ Uploaders: Charles Plessy <plessy at debian.org>,
Andreas Tille <tille at debian.org>
Section: science
Priority: optional
-Build-Depends: debhelper-compat (= 12),
+Build-Depends: debhelper-compat (= 13),
help2man,
python3-pil,
zlib1g-dev,
- libsimde-dev ( >= 0.0.0.git.20200419 )
+ libsimde-dev
Standards-Version: 4.5.0
Vcs-Browser: https://salsa.debian.org/med-team/last-align
Vcs-Git: https://salsa.debian.org/med-team/last-align.git
@@ -17,12 +17,12 @@ Rules-Requires-Root: no
Package: last-align
Architecture: any
-Built-Using: ${Built-Using}
Depends: ${shlibs:Depends},
- ${misc:Depends},
+ ${misc:Depends}
Recommends: python3,
python3-pil,
parallel
+Built-Using: ${Built-Using}
Description: genome-scale comparison of biological sequences
LAST is software for comparing and aligning sequences, typically DNA or
protein sequences. LAST is similar to BLAST, but it copes better with very
=====================================
debian/patches/2to3.patch
=====================================
@@ -3,15 +3,17 @@ Bug-Debian: https://bugs.debian.org/943148
Author: Andreas Tille <tille at debian.org>
Last-Update: Fri, 22 Nov 2019 11:23:44 +0100
+Index: last-align/scripts/last-dotplot
+===================================================================
--- last-align.orig/scripts/last-dotplot
+++ last-align/scripts/last-dotplot
@@ -1,4 +1,4 @@
-#! /usr/bin/env python
-+#!/usr/bin/python3
++#! /usr/bin/python3
+ # Author: Martin C. Frith 2008
+ # SPDX-License-Identifier: GPL-3.0-or-later
- # Read pair-wise alignments in MAF or LAST tabular format: write an
- # "Oxford grid", a.k.a. dotplot.
-@@ -262,12 +262,12 @@
+@@ -264,12 +264,12 @@ def mergedRanges(ranges):
yield oldBeg, maxEnd
def mergedRangesPerSeq(coverDict):
@@ -26,7 +28,7 @@ Last-Update: Fri, 22 Nov 2019 11:23:44 +0100
def trimmed(seqRanges, coverDict, minAlignedBases, maxGapFrac, endPad, midPad):
maxEndGapFrac, maxMidGapFrac = twoValuesFromOption(maxGapFrac, ",")
-@@ -308,7 +308,7 @@
+@@ -310,7 +310,7 @@ def rangesWithStrandInfo(seqRanges, stra
def natural_sort_key(my_string):
'''Return a sort key for "natural" ordering, e.g. chr9 < chr10.'''
parts = re.split(r'(\d+)', my_string)
@@ -35,7 +37,7 @@ Last-Update: Fri, 22 Nov 2019 11:23:44 +0100
return parts
def nameKey(oneSeqRanges):
-@@ -577,7 +577,7 @@
+@@ -579,7 +579,7 @@ def drawJoins(im, alignments, bpPerPix,
def expandedSeqDict(seqDict):
'''Allow lookup by short sequence names, e.g. chr7 as well as hg19.chr7.'''
newDict = seqDict.copy()
@@ -44,7 +46,7 @@ Last-Update: Fri, 22 Nov 2019 11:23:44 +0100
if "." in name:
base = name.split(".")[-1]
if base in newDict: # an ambiguous case was found:
-@@ -611,7 +611,7 @@
+@@ -613,7 +613,7 @@ def readBed(fileName, rangeDict):
yield layer, color, seqName, beg, end
def commaSeparatedInts(text):
@@ -53,6 +55,8 @@ Last-Update: Fri, 22 Nov 2019 11:23:44 +0100
def readGenePred(opts, fileName, rangeDict):
for line in myOpen(fileName):
+Index: last-align/scripts/last-map-probs
+===================================================================
--- last-align.orig/scripts/last-map-probs
+++ last-align/scripts/last-map-probs
@@ -1,4 +1,4 @@
@@ -61,6 +65,8 @@ Last-Update: Fri, 22 Nov 2019 11:23:44 +0100
# Copyright 2010, 2011, 2012, 2014 Martin C. Frith
+Index: last-align/scripts/last-postmask
+===================================================================
--- last-align.orig/scripts/last-postmask
+++ last-align/scripts/last-postmask
@@ -1,4 +1,4 @@
@@ -69,7 +75,7 @@ Last-Update: Fri, 22 Nov 2019 11:23:44 +0100
# Copyright 2014 Martin C. Frith
-@@ -37,7 +37,7 @@
+@@ -37,7 +37,7 @@ def complement(base):
def fastScoreMatrix(rowHeads, colHeads, matrix, deleteCost, insertCost):
matrixLen = 128
@@ -78,7 +84,7 @@ Last-Update: Fri, 22 Nov 2019 11:23:44 +0100
fastMatrix = [[defaultScore for i in range(matrixLen)]
for j in range(matrixLen)]
for i, x in enumerate(rowHeads):
-@@ -111,7 +111,7 @@
+@@ -111,7 +111,7 @@ def doOneFile(lines):
if i.startswith("B="): bIns = int(i[2:])
if i.startswith("e="): minScore = int(i[2:])
if i.startswith("S="): strandParam = int(i[2:])
@@ -87,15 +93,17 @@ Last-Update: Fri, 22 Nov 2019 11:23:44 +0100
colHeads = fields[1:]
elif nf == len(colHeads) + 2 and len(fields[1]) == 1:
rowHeads.append(fields[1])
+Index: last-align/scripts/last-train
+===================================================================
--- last-align.orig/scripts/last-train
+++ last-align/scripts/last-train
@@ -1,4 +1,4 @@
-#! /usr/bin/env python
+#!/usr/bin/python3
# Copyright 2015 Martin C. Frith
+ # SPDX-License-Identifier: GPL-3.0-or-later
- # References:
-@@ -248,7 +248,7 @@
+@@ -249,7 +249,7 @@ def writeScoreMatrix(outFile, matrix, pr
writeMatrixBody(outFile, prefix, alphabet, matrix, "%6s")
def matProbsFromCounts(counts, opts):
@@ -104,23 +112,27 @@ Last-Update: Fri, 22 Nov 2019 11:23:44 +0100
if opts.revsym: # add complement (reverse strand) substitutions
counts = [[counts[i][j] + counts[-1-i][-1-j] for j in r] for i in r]
if opts.matsym: # symmetrize the substitution matrix
+Index: last-align/scripts/maf-convert
+===================================================================
--- last-align.orig/scripts/maf-convert
+++ last-align/scripts/maf-convert
@@ -1,4 +1,4 @@
-#! /usr/bin/env python
+#!/usr/bin/python3
# Copyright 2010, 2011, 2013, 2014 Martin C. Frith
- # Read MAF-format alignments: write them in other formats.
+ # SPDX-License-Identifier: GPL-3.0-or-later
# Seems to work with Python 2.x, x>=6
+Index: last-align/scripts/maf-cut
+===================================================================
--- last-align.orig/scripts/maf-cut
+++ last-align/scripts/maf-cut
@@ -1,4 +1,4 @@
-#! /usr/bin/env python
-+#!/usr/bin/python3
++#! /usr/bin/python3
+ # Author: Martin C. Frith 2018
from __future__ import print_function
-
-@@ -70,12 +70,12 @@
+@@ -71,12 +71,12 @@ def cutMafRecords(mafLines, alnBeg, alnE
def mafFieldWidths(mafRecords):
sRecords = (i for i in mafRecords if i[0] == "s")
@@ -136,6 +148,8 @@ Last-Update: Fri, 22 Nov 2019 11:23:44 +0100
print("%*s %-*s %*s %*s %*s %*s %*s" % tuple(formatParams))
def cutOneMaf(cutRange, mafLines):
+Index: last-align/scripts/maf-join
+===================================================================
--- last-align.orig/scripts/maf-join
+++ last-align/scripts/maf-join
@@ -1,4 +1,4 @@
@@ -144,7 +158,7 @@ Last-Update: Fri, 22 Nov 2019 11:23:44 +0100
# Copyright 2009, 2010, 2011 Martin C. Frith
-@@ -229,7 +229,7 @@
+@@ -229,7 +229,7 @@ def nextWindow(window, theInput, referen
while True:
maf = theInput.peek()
if maf.after(referenceMaf): break
@@ -153,6 +167,8 @@ Last-Update: Fri, 22 Nov 2019 11:23:44 +0100
if not maf.before(referenceMaf): yield maf
except StopIteration: pass
+Index: last-align/scripts/maf-swap
+===================================================================
--- last-align.orig/scripts/maf-swap
+++ last-align/scripts/maf-swap
@@ -1,4 +1,4 @@
@@ -161,7 +177,7 @@ Last-Update: Fri, 22 Nov 2019 11:23:44 +0100
# Read MAF-format alignments, and write them, after moving the Nth
# sequence to the top in each alignment.
-@@ -69,12 +69,12 @@
+@@ -69,12 +69,12 @@ def flippedMafRecords(mafLines):
def sLineFieldWidths(mafLines):
sLines = (i for i in mafLines if i[0] == "s")
=====================================
doc/last-train.html
=====================================
@@ -491,22 +491,23 @@ Ns/Xs as ambiguous.</p>
output, so it will override lastal's default.</p>
</td></tr>
<tr><td class="option-group">
-<kbd><span class="option">-Q <var>NUMBER</var></span></kbd></td>
-<td><p class="first">How to read the query sequences. By default, they must
-be in <tt class="docutils literal">fasta</tt> format. <tt class="docutils literal"><span class="pre">-Q0</span></tt> means <tt class="docutils literal">fasta</tt> or
-<tt class="docutils literal"><span class="pre">fastq-ignore</span></tt>. <tt class="docutils literal"><span class="pre">-Q1</span></tt> means <tt class="docutils literal"><span class="pre">fastq-sanger</span></tt>.</p>
+<kbd><span class="option">-Q <var>NAME</var></span></kbd></td>
+<td><p class="first">How to read the query sequences (the NAME is not
+case-sensitive):</p>
+<pre class="literal-block">
+Default fasta
+"0", "fastx" fasta or fastq: discard per-base quality data
+"1", "sanger" fastq-sanger
+</pre>
<p>The <tt class="docutils literal">fastq</tt> formats are described here:
-<a class="reference external" href="lastal.html">lastal.html</a>. <tt class="docutils literal"><span class="pre">fastq-ignore</span></tt> means that the
-quality data is ignored, so the results will be the same
-as for <tt class="docutils literal">fasta</tt>.</p>
-<p>For <tt class="docutils literal"><span class="pre">fastq-sanger</span></tt>, last-train assumes the quality
-codes indicate substitution error probabilities, <em>not</em>
-insertion or deletion error probabilities. If this
+<a class="reference external" href="lastal.html">lastal.html</a>. last-train assumes the per-base
+quality codes indicate substitution error probabilities,
+<em>not</em> insertion or deletion error probabilities. If this
assumption is dubious (e.g. for data with many insertion
-or deletion errors), it may be better to use
-<tt class="docutils literal"><span class="pre">fastq-ignore</span></tt>. For <tt class="docutils literal"><span class="pre">fastq-sanger</span></tt>, last-train finds
-the rates of substitutions not explained by the quality
-data (ideally, real substitutions as opposed to errors).</p>
+or deletion errors), it may be better to discard the
+quality data. For <tt class="docutils literal"><span class="pre">fastq-sanger</span></tt>, last-train finds the
+rates of substitutions not explained by the quality data
+(ideally, real substitutions as opposed to errors).</p>
<p class="last">If specified, this parameter is written in last-train's
output, so it will override lastal's default.</p>
</td></tr>
=====================================
doc/last-train.txt
=====================================
@@ -114,23 +114,22 @@ Alignment options
If specified, this parameter is written in last-train's
output, so it will override lastal's default.
- -Q NUMBER How to read the query sequences. By default, they must
- be in ``fasta`` format. ``-Q0`` means ``fasta`` or
- ``fastq-ignore``. ``-Q1`` means ``fastq-sanger``.
+ -Q NAME How to read the query sequences (the NAME is not
+ case-sensitive)::
- The ``fastq`` formats are described here:
- `<lastal.html>`_. ``fastq-ignore`` means that the
- quality data is ignored, so the results will be the same
- as for ``fasta``.
+ Default fasta
+ "0", "fastx" fasta or fastq: discard per-base quality data
+ "1", "sanger" fastq-sanger
- For ``fastq-sanger``, last-train assumes the quality
- codes indicate substitution error probabilities, *not*
- insertion or deletion error probabilities. If this
+ The ``fastq`` formats are described here:
+ `<lastal.html>`_. last-train assumes the per-base
+ quality codes indicate substitution error probabilities,
+ *not* insertion or deletion error probabilities. If this
assumption is dubious (e.g. for data with many insertion
- or deletion errors), it may be better to use
- ``fastq-ignore``. For ``fastq-sanger``, last-train finds
- the rates of substitutions not explained by the quality
- data (ideally, real substitutions as opposed to errors).
+ or deletion errors), it may be better to discard the
+ quality data. For ``fastq-sanger``, last-train finds the
+ rates of substitutions not explained by the quality data
+ (ideally, real substitutions as opposed to errors).
If specified, this parameter is written in last-train's
output, so it will override lastal's default.
=====================================
doc/lastal.html
=====================================
@@ -844,16 +844,18 @@ letters.)</p>
</ul>
</td></tr>
<tr><td class="option-group">
-<kbd><span class="option">-Q <var>NUMBER</var></span></kbd></td>
-<td><p class="first">Specify how to read the query sequences:</p>
+<kbd><span class="option">-Q <var>NAME</var></span></kbd></td>
+<td><p class="first">Specify how to read the query sequences (the NAME is not
+case-sensitive):</p>
<pre class="literal-block">
-Default fasta
- 0 fasta or fastq-ignore
- 1 fastq-sanger
- 2 fastq-solexa
- 3 fastq-illumina
- 4 prb
- 5 PSSM
+Default fasta
+"0", "fastx" fasta or fastq: discard per-base quality data
+"keep" fasta or fastq: keep but ignore per-base quality data
+"1", "sanger" fastq-sanger
+"2", "solexa" fastq-solexa
+"3", "illumina" fastq-illumina
+"4", "prb" prb
+"5", "pssm" PSSM
</pre>
<p><em>Warning</em>: Illumina data is not necessarily in fastq-illumina
format; it is often in fastq-sanger format.</p>
@@ -867,12 +869,10 @@ TTTTTTTTGCCTCGGGCCTGAGTTCTTAGCCGCG
<p>The "+" may be followed by text (ignored). The symbols below
the "+" are quality codes, one per sequence letter. The
sequence and quality codes may wrap onto more than one line.</p>
-<p>fastq-ignore means that the quality codes are ignored. For the
-other fastq variants, lastal assumes the quality codes indicate
-substitution error probabilities, <em>not</em> insertion or deletion
-error probabilities. If this assumption is dubious (e.g. for
-data with many insertion or deletion errors), it may be better
-to use fastq-ignore.</p>
+<p>lastal assumes the quality codes indicate substitution error
+probabilities, <em>not</em> insertion or deletion error probabilities.
+If this assumption is dubious (e.g. for data with many insertion
+or deletion errors), it may be better to discard or ignore them.</p>
<p>For fastq-sanger, quality scores are obtained by subtracting 33
from the ASCII values of the quality codes. For fastq-solexa
and fastq-illumina, they are obtained by subtracting 64.</p>
=====================================
doc/lastal.txt
=====================================
@@ -474,16 +474,18 @@ Miscellaneous options
* The count of delete opens (= count of delete closes).
* The count of insert opens (= count of insert closes).
- -Q NUMBER
- Specify how to read the query sequences::
-
- Default fasta
- 0 fasta or fastq-ignore
- 1 fastq-sanger
- 2 fastq-solexa
- 3 fastq-illumina
- 4 prb
- 5 PSSM
+ -Q NAME
+ Specify how to read the query sequences (the NAME is not
+ case-sensitive)::
+
+ Default fasta
+ "0", "fastx" fasta or fastq: discard per-base quality data
+ "keep" fasta or fastq: keep but ignore per-base quality data
+ "1", "sanger" fastq-sanger
+ "2", "solexa" fastq-solexa
+ "3", "illumina" fastq-illumina
+ "4", "prb" prb
+ "5", "pssm" PSSM
*Warning*: Illumina data is not necessarily in fastq-illumina
format; it is often in fastq-sanger format.
@@ -499,12 +501,10 @@ Miscellaneous options
the "+" are quality codes, one per sequence letter. The
sequence and quality codes may wrap onto more than one line.
- fastq-ignore means that the quality codes are ignored. For the
- other fastq variants, lastal assumes the quality codes indicate
- substitution error probabilities, *not* insertion or deletion
- error probabilities. If this assumption is dubious (e.g. for
- data with many insertion or deletion errors), it may be better
- to use fastq-ignore.
+ lastal assumes the quality codes indicate substitution error
+ probabilities, *not* insertion or deletion error probabilities.
+ If this assumption is dubious (e.g. for data with many insertion
+ or deletion errors), it may be better to discard or ignore them.
For fastq-sanger, quality scores are obtained by subtracting 33
from the ASCII values of the quality codes. For fastq-solexa
=====================================
doc/lastdb.html
=====================================
@@ -456,11 +456,18 @@ lastdb will refuse to process any single sequence longer than
about 4 billion.</p>
</td></tr>
<tr><td class="option-group">
-<kbd><span class="option">-Q <var>NUMBER</var></span></kbd></td>
-<td>Specify how to read the sequences. The default is fasta, 0
-means fasta or fastq-ignore, 1 means fastq-sanger, 2 means
-fastq-solexa, and 3 means fastq-illumina. The fastq formats are
-described in <a class="reference external" href="lastal.html">lastal.html</a>.</td></tr>
+<kbd><span class="option">-Q <var>NAME</var></span></kbd></td>
+<td><p class="first">Specify how to read the sequences (the NAME is not case-sensitive):</p>
+<pre class="literal-block">
+Default fasta
+"0", "fastx" fasta or fastq: discard per-base quality data
+"keep" fasta or fastq: keep but ignore per-base quality data
+"1", "sanger" fastq-sanger
+"2", "solexa" fastq-solexa
+"3", "illumina" fastq-illumina
+</pre>
+<p class="last">The fastq formats are described in <a class="reference external" href="lastal.html">lastal.html</a>.</p>
+</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-P <var>THREADS</var></span></kbd></td>
<td>Divide the work between this number of threads running in
=====================================
doc/lastdb.txt
=====================================
@@ -124,11 +124,17 @@ Advanced Options
lastdb will refuse to process any single sequence longer than
about 4 billion.
- -Q NUMBER
- Specify how to read the sequences. The default is fasta, 0
- means fasta or fastq-ignore, 1 means fastq-sanger, 2 means
- fastq-solexa, and 3 means fastq-illumina. The fastq formats are
- described in `<lastal.html>`_.
+ -Q NAME
+ Specify how to read the sequences (the NAME is not case-sensitive)::
+
+ Default fasta
+ "0", "fastx" fasta or fastq: discard per-base quality data
+ "keep" fasta or fastq: keep but ignore per-base quality data
+ "1", "sanger" fastq-sanger
+ "2", "solexa" fastq-solexa
+ "3", "illumina" fastq-illumina
+
+ The fastq formats are described in `<lastal.html>`_.
-P THREADS
Divide the work between this number of threads running in
=====================================
doc/maf-convert.html
=====================================
@@ -320,7 +320,7 @@ table.field-list { border: thin solid green }
<p>This script reads alignments in <a class="reference external" href="http://genome.ucsc.edu/FAQ/FAQformat.html#format5">maf</a> format, and writes them in
another format. It can write them in these formats: <a class="reference external" href="https://genome.ucsc.edu/goldenPath/help/axt.html">axt</a>, blast,
-blasttab, <a class="reference external" href="https://genome.ucsc.edu/goldenPath/help/chain.html">chain</a>, html, <a class="reference external" href="https://genome.ucsc.edu/FAQ/FAQformat.html#format2">psl</a>, sam, tab. You can use it like this:</p>
+blasttab, <a class="reference external" href="https://genome.ucsc.edu/goldenPath/help/chain.html">chain</a>, gff, html, <a class="reference external" href="https://genome.ucsc.edu/FAQ/FAQformat.html#format2">psl</a>, sam, tab. You can use it like this:</p>
<pre class="literal-block">
maf-convert psl my-alignments.maf > my-alignments.psl
</pre>
@@ -350,9 +350,17 @@ nucleotides. This affects psl format only (the first 4
columns).</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-j <var>N</var></span>, <span class="option">--join=<var>N</var></span></kbd></td>
-<td>Join neighboring alignments if they are co-linear and
-separated by at most N letters. This affects psl format
-only.</td></tr>
+<td>Join alignments that are co-linear (align different parts of
+the same sequences and strands, with the parts being in the
+same order in each sequence), are separated by at most N
+letters in each sequence, and are consecutive in the input.
+This affects psl and gff formats only.</td></tr>
+<tr><td class="option-group">
+<kbd><span class="option">-J <var>N</var></span>, <span class="option">--Join=<var>N</var></span></kbd></td>
+<td>Join alignments that are co-linear, are separated by at most
+N letters in each sequence, and are nearest in each sequence.
+This affects psl and gff formats only, and reads the whole
+input into memory.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-n</span>, <span class="option">--noheader</span></kbd></td>
<td>Omit any header lines from the output. This may be useful if
=====================================
doc/maf-convert.txt
=====================================
@@ -3,7 +3,7 @@ maf-convert
This script reads alignments in maf_ format, and writes them in
another format. It can write them in these formats: axt_, blast,
-blasttab, chain_, html, psl_, sam, tab. You can use it like this::
+blasttab, chain_, gff, html, psl_, sam, tab. You can use it like this::
maf-convert psl my-alignments.maf > my-alignments.psl
@@ -35,9 +35,17 @@ Options
columns).
-j N, --join=N
- Join neighboring alignments if they are co-linear and
- separated by at most N letters. This affects psl format
- only.
+ Join alignments that are co-linear (align different parts of
+ the same sequences and strands, with the parts being in the
+ same order in each sequence), are separated by at most N
+ letters in each sequence, and are consecutive in the input.
+ This affects psl and gff formats only.
+
+ -J N, --Join=N
+ Join alignments that are co-linear, are separated by at most
+ N letters in each sequence, and are nearest in each sequence.
+ This affects psl and gff formats only, and reads the whole
+ input into memory.
-n, --noheader
Omit any header lines from the output. This may be useful if
=====================================
scripts/last-dotplot
=====================================
@@ -1,4 +1,6 @@
#! /usr/bin/env python
+# Author: Martin C. Frith 2008
+# SPDX-License-Identifier: GPL-3.0-or-later
# Read pair-wise alignments in MAF or LAST tabular format: write an
# "Oxford grid", a.k.a. dotplot.
=====================================
scripts/last-train
=====================================
@@ -1,5 +1,6 @@
#! /usr/bin/env python
# Copyright 2015 Martin C. Frith
+# SPDX-License-Identifier: GPL-3.0-or-later
# References:
# [Fri19] How sequence alignment scores correspond to probability models,
@@ -576,14 +577,14 @@ if __name__ == "__main__":
op.add_option_group(og)
og = optparse.OptionGroup(op, "Initial parameter options")
og.add_option("-r", metavar="SCORE",
- help="match score (default: 6 if Q>0, else 5)")
+ help="match score (default: 6 if Q>=1, else 5)")
og.add_option("-q", metavar="COST",
- help="mismatch cost (default: 18 if Q>0, else 5)")
+ help="mismatch cost (default: 18 if Q>=1, else 5)")
og.add_option("-p", metavar="NAME", help="match/mismatch score matrix")
og.add_option("-a", metavar="COST",
- help="gap existence cost (default: 21 if Q>0, else 15)")
+ help="gap existence cost (default: 21 if Q>=1, else 15)")
og.add_option("-b", metavar="COST",
- help="gap extension cost (default: 9 if Q>0, else 3)")
+ help="gap extension cost (default: 9 if Q>=1, else 3)")
og.add_option("-A", metavar="COST", help="insertion existence cost")
og.add_option("-B", metavar="COST", help="insertion extension cost")
op.add_option_group(og)
@@ -610,16 +611,15 @@ if __name__ == "__main__":
og.add_option("-X", metavar="NUMBER", help="N/X is ambiguous in: "
"0=neither sequence, 1=reference, 2=query, 3=both "
"(default=0)")
- og.add_option("-Q", metavar="NUMBER", help=
- "input format: 0=fasta or fastq-ignore, 1=fastq-sanger "
- "(default=fasta)")
+ og.add_option("-Q", metavar="NAME",
+ help="input format: fastx, sanger (default=fasta)")
op.add_option_group(og)
(opts, args) = op.parse_args()
if len(args) < 1:
op.error("I need a lastdb index and query sequences")
if not opts.sample_number and (len(args) < 2 or "-" in args):
op.error("sorry, can't use stdin when --sample-number=0")
- if not opts.p and (not opts.Q or opts.Q == "0"):
+ if not opts.p and (not opts.Q or opts.Q in ("0", "fastx", "keep")):
if not opts.r: opts.r = "5"
if not opts.q: opts.q = "5"
if not opts.a: opts.a = "15"
=====================================
scripts/maf-convert
=====================================
@@ -1,6 +1,6 @@
#! /usr/bin/env python
# Copyright 2010, 2011, 2013, 2014 Martin C. Frith
-# Read MAF-format alignments: write them in other formats.
+# SPDX-License-Identifier: GPL-3.0-or-later
# Seems to work with Python 2.x, x>=6
# By "MAF" we mean "multiple alignment format" described in the UCSC
@@ -9,8 +9,15 @@
from __future__ import print_function
from itertools import *
+import collections
+import functools
import gzip
-import math, optparse, os, signal, sys
+import math
+import operator
+import optparse
+import os
+import signal
+import sys
try:
from future_builtins import map, zip
@@ -161,6 +168,41 @@ def mafGroupInput(opts, lines):
fixOrder(x)
yield x
+def linkedMafBegSortKey(sequenceNumber, mafAndLinks):
+ return operator.itemgetter(0, 2, 4)(mafAndLinks[0][1][sequenceNumber])
+
+def colinearMafInput(opts, lines):
+ mafs = list(mafInput(opts, lines))
+ if not mafs:
+ return
+ numOfSeqs = max(len(maf[1]) for maf in mafs)
+ linkedMafs = [(maf, [None] * numOfSeqs) for maf in mafs]
+ for s in range(numOfSeqs):
+ begFunc = functools.partial(linkedMafBegSortKey, s)
+ linkedMafs.sort(key=begFunc)
+ maxEnd = "", "+", 0
+ for j in range(1, len(linkedMafs)):
+ xMaf, xLinks = linkedMafs[j - 1]
+ yMaf, yLinks = linkedMafs[j]
+ x = xMaf[1][s]
+ y = yMaf[1][s]
+ newEnd = x[0], x[2], x[5]
+ if newEnd > maxEnd:
+ maxEnd = newEnd
+ if (x[:4] == y[:4] and 0 <= y[4] - x[5] <= opts.Join):
+ k = j + 1
+ yBeg = y[0], y[2], y[4]
+ if k == len(linkedMafs) or yBeg < begFunc(linkedMafs[k]):
+ yLinks[s] = xMaf
+ colinearMafs = []
+ for maf, links in linkedMafs:
+ if colinearMafs:
+ if any(i is not colinearMafs[-1] for i in links):
+ yield colinearMafs
+ colinearMafs = []
+ colinearMafs.append(maf)
+ yield colinearMafs
+
##### Routines for converting to AXT format: #####
axtCounter = count()
@@ -185,7 +227,7 @@ def writeAxt(maf):
if score:
outWords.append(score)
- print(" ".join(outWords))
+ print(*outWords)
for i in sLines:
print(i[6])
print() # print a blank line at the end
@@ -243,7 +285,7 @@ def writeChain(maf):
outWords.append(str(next(axtCounter) + 1))
- print(" ".join(outWords))
+ print(*outWords)
letterSizes = [i[3] for i in sLines]
rows = [i[6] for i in sLines]
@@ -360,15 +402,16 @@ def writePsl(opts, mafs):
blockCount = len(blocks)
blockSizes, blockStartsA, blockStartsB = map(pslCommaString, zip(*blocks))
- outWords = (matches, mismatches, repMatches, nCount,
- numInsertB, baseInsertB, numInsertA, baseInsertA, strand,
- seqNameB, seqLenB, begB, endB, seqNameA, seqLenA, begA, endA,
- blockCount, blockSizes, blockStartsB, blockStartsA)
-
- print("\t".join(map(str, outWords)))
+ print(matches, mismatches, repMatches, nCount,
+ numInsertB, baseInsertB, numInsertA, baseInsertA, strand,
+ seqNameB, seqLenB, begB, endB, seqNameA, seqLenA, begA, endA,
+ blockCount, blockSizes, blockStartsB, blockStartsA, sep="\t")
def mafConvertToPsl(opts, lines):
- if opts.join:
+ if opts.Join:
+ for i in colinearMafInput(opts, lines):
+ writePsl(opts, i)
+ elif opts.join:
for i in mafGroupInput(opts, lines):
writePsl(opts, i)
else:
@@ -679,8 +722,8 @@ def writeBlastTab(opts, maf):
mismatches = alnSize - matches - rowA.count("-") - rowB.count("-")
gapOpens = gapRunCount(rowA) + gapRunCount(rowB)
- out = [seqNameB, seqNameA, matchPercent, str(alnSize), str(mismatches),
- str(gapOpens), begB, endB, begA, endA]
+ out = [seqNameB, seqNameA, matchPercent, alnSize, mismatches,
+ gapOpens, begB, endB, begA, endA]
score, evalue = scoreAndEvalue(aLine)
if evalue:
@@ -689,12 +732,79 @@ def writeBlastTab(opts, maf):
bitScore = opts.bitScoreA * float(score) - opts.bitScoreB
out.append("%.3g" % bitScore)
- print("\t".join(out))
+ print(*out, sep="\t")
def mafConvertToBlastTab(opts, lines):
for maf in mafInput(opts, lines):
writeBlastTab(opts, maf)
+##### Routines for converting to GFF format: #####
+
+def writeGffHeader():
+ print("##gff-version 3")
+
+def gffFromMaf(maf):
+ aLine, sLines, qLines, pLines = maf
+ fieldsA, fieldsB = pairOrDie(sLines, "GFF")
+ seqNameA, seqLenA, strandA, letterSizeA, begA, endA, rowA = fieldsA
+ seqNameB, seqLenB, strandB, letterSizeB, begB, endB, rowB = fieldsB
+
+ score = "."
+ for i in aLine.split():
+ if i.startswith("score="):
+ score = i[6:]
+
+ if strandA == "-":
+ begA, endA = seqLenA - endA, seqLenA - begA
+ if strandB == "-":
+ begB, endB = seqLenB - endB, seqLenB - begB
+ begA += 1
+ begB += 1
+
+ strand = "+" if strandA == strandB else "-"
+
+ return seqNameA, begA, endA, strand, score, seqNameB, begB, endB
+
+def writeOneGff(gff, typeOfThing, parentId):
+ seqNameA, begA, endA, strand, score, seqNameB, begB, endB = gff
+ target = "Target={0} {1} {2}".format(seqNameB, begB, endB)
+ name = "Name={0}:{1}-{2}".format(seqNameB, begB, endB)
+ attributes = [target, name]
+ if parentId:
+ attributes.append("Parent=" + parentId)
+ print(seqNameA, "maf-convert", typeOfThing, begA, endA, score, strand, ".",
+ ";".join(attributes), sep="\t")
+
+def writeGffGroup(qryNameCounts, mafs):
+ gffs = [gffFromMaf(i) for i in mafs]
+ seqNameA, begA, endA, strand, score, seqNameB, begB, endB = gffs[0]
+ endA = gffs[-1][2]
+ if strand == "+":
+ endB = gffs[-1][7]
+ else:
+ begB = gffs[-1][6]
+ qryNameCounts[seqNameB] += 1
+ parentId = "{0}.{1}".format(seqNameB, qryNameCounts[seqNameB])
+ myId = "ID=" + parentId
+ name = "Name={0}:{1}-{2}".format(seqNameB, begB, endB)
+ attributes = [myId, name]
+ print(seqNameA, "maf-convert", "match", begA, endA, ".", strand, ".",
+ ";".join(attributes), sep="\t")
+ for i in gffs:
+ writeOneGff(i, "match_part", parentId)
+
+def mafConvertToGff(opts, lines):
+ qryNameCounts = collections.defaultdict(int)
+ if opts.Join:
+ for i in colinearMafInput(opts, lines):
+ writeGffGroup(qryNameCounts, i)
+ elif opts.join:
+ for i in mafGroupInput(opts, lines):
+ writeGffGroup(qryNameCounts, i)
+ else:
+ for i in mafInput(opts, lines):
+ writeOneGff(gffFromMaf(i), "match", None)
+
##### Routines for converting to HTML format: #####
def writeHtmlHeader():
@@ -812,6 +922,8 @@ def mafConvertOneFile(opts, formatName, lines):
mafConvertToBlastTab(opts, lines)
elif isFormat(formatName, "chain"):
mafConvertToChain(opts, lines)
+ elif isFormat(formatName, "gff"):
+ mafConvertToGff(opts, lines)
elif isFormat(formatName, "html"):
mafConvertToHtml(opts, lines)
elif isFormat(formatName, "psl"):
@@ -832,6 +944,8 @@ def mafConvert(opts, args):
opts.bitScoreB = None
if not opts.noheader:
+ if isFormat(formatName, "gff"):
+ writeGffHeader()
if isFormat(formatName, "html"):
writeHtmlHeader()
if isFormat(formatName, "sam"):
@@ -859,6 +973,7 @@ if __name__ == "__main__":
%prog blast mafFile(s)
%prog blasttab mafFile(s)
%prog chain mafFile(s)
+ %prog gff mafFile(s)
%prog html mafFile(s)
%prog psl mafFile(s)
%prog sam mafFile(s)
@@ -869,8 +984,10 @@ if __name__ == "__main__":
op = optparse.OptionParser(usage=usage, description=description)
op.add_option("-p", "--protein", action="store_true",
help="assume protein alignments, for psl match counts")
- op.add_option("-j", "--join", type="float", metavar="N",
- help="join co-linear alignments separated by <= N letters")
+ op.add_option("-j", "--join", type="float", metavar="N", help="join "
+ "consecutive co-linear alignments separated by <= N letters")
+ op.add_option("-J", "--Join", type="float", metavar="N", help=
+ "join nearest co-linear alignments separated by <= N letters")
op.add_option("-n", "--noheader", action="store_true",
help="omit any header lines from the output")
op.add_option("-d", "--dictionary", action="store_true",
=====================================
scripts/maf-cut
=====================================
@@ -1,4 +1,5 @@
#! /usr/bin/env python
+# Author: Martin C. Frith 2018
from __future__ import print_function
=====================================
src/LastalArguments.cc
=====================================
@@ -135,13 +135,13 @@ E-value options (default settings):\n\
-E: maximum expected alignments per square giga (1e+18/D/refSize/numOfStrands)\n\
\n\
Score options (default settings):\n\
--r: match score (2 if -M, else 6 if 0<Q<5, else 1 if DNA)\n\
--q: mismatch cost (3 if -M, else 18 if 0<Q<5, else 1 if DNA)\n\
+-r: match score (2 if -M, else 6 if 1<=Q<=4, else 1 if DNA)\n\
+-q: mismatch cost (3 if -M, else 18 if 1<=Q<=4, else 1 if DNA)\n\
-p: match/mismatch score matrix (protein-protein: BL62, DNA-protein: BL80)\n\
-X: N/X is ambiguous in: 0=neither sequence, 1=reference, 2=query, 3=both ("
+ stringify(ambiguousLetterOpt) + ")\n\
--a: gap existence cost (DNA: 7, protein: 11, 0<Q<5: 21)\n\
--b: gap extension cost (DNA: 1, protein: 2, 0<Q<5: 9)\n\
+-a: gap existence cost (DNA: 7, protein: 11, 1<=Q<=4: 21)\n\
+-b: gap extension cost (DNA: 1, protein: 2, 1<=Q<=4: 9)\n\
-A: insertion existence cost (a)\n\
-B: insertion extension cost (b)\n\
-c: unaligned residue pair cost (off)\n\
@@ -178,7 +178,7 @@ Miscellaneous options (default settings):\n\
-N: stop after the first N alignments per query strand\n\
-R: repeat-marking options (the same as was used for lastdb)\n\
-u: mask lowercase during extensions: 0=never, 1=gapless,\n\
- 2=gapless+postmask, 3=always (2 if lastdb -c and Q<5, else 0)\n\
+ 2=gapless+postmask, 3=always (2 if lastdb -c and Q!=pssm, else 0)\n\
-w: suppress repeats inside exact matches, offset by <= this distance ("
+ stringify(maxRepeatDistance) + ")\n\
-G: genetic code (" + geneticCodeFile + ")\n\
@@ -189,8 +189,8 @@ Miscellaneous options (default settings):\n\
4=column ambiguity estimates, 5=gamma-centroid, 6=LAMA,\n\
7=expected counts ("
+ stringify(outputType) + ")\n\
--Q: input format: 0=fasta or fastq-ignore, 1=fastq-sanger, 2=fastq-solexa,\n\
- 3=fastq-illumina, 4=prb, 5=PSSM (fasta)\n\
+-Q: input format: fastx, keep, sanger, solexa, illumina, prb, pssm\n\
+ (default=fasta)\n\
\n\
Report bugs to: last-align (ATmark) googlegroups (dot) com\n\
LAST home page: http://last.cbrc.jp/\n\
@@ -640,7 +640,7 @@ void LastalArguments::writeCommented( std::ostream& stream ) const{
if( outputType > 4 && outputType < 7 )
stream << " g=" << gamma;
stream << " j=" << outputType;
- stream << " Q=" << (inputFormat % sequenceFormat::fasta);
+ stream << " Q=" << (inputFormat < sequenceFormat::fasta ? inputFormat : 0);
stream << '\n';
stream << "# " << lastdbName << '\n';
=====================================
src/LastdbArguments.cc
=====================================
@@ -70,8 +70,7 @@ Advanced Options (default settings):\n\
-S: strand: 0=reverse, 1=forward, 2=both ("
+ stringify(strand) + ")\n\
-s: volume size (unlimited)\n\
--Q: input format: 0=fasta or fastq-ignore,\n\
- 1=fastq-sanger, 2=fastq-solexa, 3=fastq-illumina (fasta)\n\
+-Q: input format: fastx, keep, sanger, solexa, illumina (default=fasta)\n\
-P: number of parallel threads ("
+ stringify(numOfThreads) + ")\n\
-m: seed pattern\n\
@@ -160,7 +159,8 @@ LAST home page: http://last.cbrc.jp/\n\
break;
case 'Q':
unstringify( inputFormat, optarg );
- if( inputFormat >= sequenceFormat::prb ) badopt( c, optarg );
+ if (inputFormat == sequenceFormat::prb ||
+ inputFormat == sequenceFormat::pssm) badopt(c, optarg);
break;
case '?':
ERR( "bad option" );
=====================================
src/SequenceFormat.hh
=====================================
@@ -3,19 +3,33 @@
#ifndef SEQUENCE_FORMAT_HH
#define SEQUENCE_FORMAT_HH
+#include <cctype>
#include <istream>
+#include <stddef.h>
+#include <string>
namespace cbrc {
namespace sequenceFormat {
-enum Enum { fastx, fastqSanger, fastqSolexa, fastqIllumina, prb, pssm, fasta };
+enum Enum { fastx, fastqSanger, fastqSolexa, fastqIllumina, prb, pssm, fasta,
+ fastxKeep };
}
inline std::istream &operator>>(std::istream &s, sequenceFormat::Enum &f) {
- int i = 0;
- s >> i;
- if (i < 0 || i > sequenceFormat::pssm) s.setstate(std::ios::failbit);
- if (s) f = static_cast<sequenceFormat::Enum>(i);
+ std::string w;
+ s >> w;
+ if (!s) return s;
+ for (size_t i = 0; i < w.size(); ++i) {
+ w[i] = std::tolower(w[i]);
+ }
+ /**/ if (w == "0" || w == "fastx") f = sequenceFormat::fastx;
+ else if (w == "keep") f = sequenceFormat::fastxKeep;
+ else if (w == "1" || w == "sanger") f = sequenceFormat::fastqSanger;
+ else if (w == "2" || w == "solexa") f = sequenceFormat::fastqSolexa;
+ else if (w == "3" || w == "illumina") f = sequenceFormat::fastqIllumina;
+ else if (w == "4" || w == "prb") f = sequenceFormat::prb;
+ else if (w == "5" || w == "pssm") f = sequenceFormat::pssm;
+ else s.setstate(std::ios::failbit);
return s;
}
=====================================
src/last.hh
=====================================
@@ -27,6 +27,8 @@ inline std::istream &appendSequence(MultiSequence &m, std::istream &in,
m.appendFromFasta(in, maxSeqLen);
} else if (f == sequenceFormat::fastx) {
m.appendFromFastx(in, maxSeqLen, false);
+ } else if (f == sequenceFormat::fastxKeep) {
+ m.appendFromFastx(in, maxSeqLen, true);
} else if (f == sequenceFormat::prb) {
m.appendFromPrb(in, maxSeqLen, a.size, a.decode);
} else if (f == sequenceFormat::pssm) {
=====================================
src/lastal.cc
=====================================
@@ -1143,7 +1143,11 @@ void lastal( int argc, char** argv ){
matrixFile = ScoreMatrix::stringFromName( matrixName );
args.resetCumulativeOptions();
args.fromString( matrixFile ); // read options from the matrix file
+ sequenceFormat::Enum f = args.inputFormat;
args.fromArgs( argc, argv ); // command line overrides matrix file
+ if (isUseQuality(f) != isUseQuality(args.inputFormat)) {
+ ERR("option -Q is inconsistent with the matrix file");
+ }
}
if( minSeedLimit > 1 ){
=====================================
src/lastdb.cc
=====================================
@@ -126,7 +126,8 @@ void writePrjFile( const std::string& fileName, const LastdbArguments& args,
}
f << "masklowercase=" << args.isCaseSensitive << '\n';
if( isFastq ){
- f << "sequenceformat=" << args.inputFormat << '\n';
+ f << "sequenceformat="
+ << (args.inputFormat % sequenceFormat::fastxKeep) << '\n';
}
if( args.minimizerWindow > 1 ){
// Maybe this should be written (and read) by the indexes, so
=====================================
src/split/cbrc_split_aligner.cc
=====================================
@@ -758,7 +758,7 @@ void SplitAligner::calcBaseScores(unsigned i) {
const char *rAlign = a.ralign;
const char *qAlign = a.qalign;
- const char *qQual = a.qQual;
+ const char *qQual = qualityOffset ? a.qQual : 0;
while (*qAlign) {
unsigned char x = *rAlign++;
=====================================
src/split/last-split.cc
=====================================
@@ -370,7 +370,8 @@ void lastSplit(LastSplitOptions& opts) {
: 0.0;
int jumpCost =
(jumpProb > 0.0) ? -scoreFromProb(jumpProb, scale) : -(INT_MIN/2);
- int qualityOffset = (sequenceFormat == 3) ? 64 : 33;
+ int qualityOffset =
+ (sequenceFormat == 0) ? 0 : (sequenceFormat == 3) ? 64 : 33;
printParameters(opts);
sa.setParams(-gapExistenceCost, -gapExtensionCost,
-insExistenceCost, -insExtensionCost,
=====================================
src/version.hh
=====================================
@@ -1 +1 @@
-"1061"
+"1066"
View it on GitLab: https://salsa.debian.org/med-team/last-align/-/compare/bcc8c5f686990ac3beec20d80595b0168ce0d56c...e1d8e7ffb17d07a95396beed25a56b5b330949b3
--
View it on GitLab: https://salsa.debian.org/med-team/last-align/-/compare/bcc8c5f686990ac3beec20d80595b0168ce0d56c...e1d8e7ffb17d07a95396beed25a56b5b330949b3
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20200619/80ef6080/attachment-0001.html>
More information about the debian-med-commit
mailing list