[med-svn] [Git][med-team/last-align][master] 6 commits: New upstream version 932
Andreas Tille
gitlab at salsa.debian.org
Thu Apr 26 21:31:58 BST 2018
Andreas Tille pushed to branch master at Debian Med / last-align
Commits:
84ec4ca8 by Andreas Tille at 2018-04-26T22:27:11+02:00
New upstream version 932
- - - - -
01411f75 by Andreas Tille at 2018-04-26T22:27:12+02:00
Update upstream source from tag 'upstream/932'
Update to upstream version '932'
with Debian dir ca55603e5f6bfad4fafbad60ea034882fa3de4f9
- - - - -
31755067 by Andreas Tille at 2018-04-26T22:27:13+02:00
New upstream version
- - - - -
8316de43 by Andreas Tille at 2018-04-26T22:27:19+02:00
Point Vcs fields to salsa.debian.org
- - - - -
7c2a5450 by Andreas Tille at 2018-04-26T22:27:19+02:00
Standards-Version: 4.1.4
- - - - -
041dbcfc by Andreas Tille at 2018-04-26T22:30:13+02:00
Upload to unstable
- - - - -
25 changed files:
- ChangeLog.txt
- debian/changelog
- debian/control
- doc/Makefile
- doc/lastal.html
- doc/lastal.txt
- doc/lastdb.html
- doc/lastdb.txt
- + doc/maf-cut.html
- + doc/maf-cut.txt
- scripts/last-dotplot
- + scripts/maf-cut
- src/Alphabet.cc
- src/QualityPssmMaker.cc
- src/ScoreMatrix.cc
- src/ScoreMatrix.hh
- src/SubsetSuffixArray.hh
- src/SubsetSuffixArraySort.cc
- src/TantanMasker.cc
- src/alp/sls_alignment_evaluer.cpp
- src/lastal.cc
- src/lastdb.cc
- src/makefile
- src/qualityScoreUtil.hh
- src/version.hh
Changes:
=====================================
ChangeLog.txt
=====================================
--- a/ChangeLog.txt
+++ b/ChangeLog.txt
@@ -1,9 +1,74 @@
+2018-04-20 Martin C. Frith <Martin C. Frith>
+
+ * doc/Makefile, doc/lastal.txt, doc/maf-cut.txt, scripts/maf-cut:
+ Add maf-cut utility
+ [482845444bcd] [tip]
+
+2018-04-18 Martin C. Frith <Martin C. Frith>
+
+ * src/ScoreMatrix.cc, src/ScoreMatrix.hh, src/lastal.cc, src/makefile,
+ test/last-test.out:
+ Use appropriate scores for ambiguous DNA letters
+ [75785542d5a0]
+
+2018-04-17 Martin C. Frith <Martin C. Frith>
+
+ * src/ScoreMatrix.cc, src/ScoreMatrix.hh, test/last-test.out:
+ Prettify the score matrix in lastal's header
+ [68bd00d2faff]
+
+ * src/ScoreMatrix.cc, src/lastal.cc:
+ Refactor
+ [d1aa91889332]
+
+ * src/ScoreMatrix.cc, src/ScoreMatrix.hh, src/TantanMasker.cc,
+ src/lastal.cc:
+ Refactor
+ [e22cac29168c]
+
+2018-04-16 Martin C. Frith <Martin C. Frith>
+
+ * src/QualityPssmMaker.cc, src/ScoreMatrix.cc, src/ScoreMatrix.hh,
+ src/qualityScoreUtil.hh:
+ Refactor
+ [fb851b8892a5]
+
+2018-03-05 Martin C. Frith <Martin C. Frith>
+
+ * scripts/last-dotplot, test/last-dotplot-test.out, test/last-dotplot-
+ test.sh:
+ Fix dotplot --strands bug with cut sequences
+ [bc08832db1a1]
+
+ * scripts/last-dotplot, test/last-dotplot-test.out, test/last-dotplot-
+ test.sh:
+ Fix dotplot crash for sorting by sequence length
+ [e1861f956f60]
+
+2018-02-26 Martin C. Frith <Martin C. Frith>
+
+ * doc/lastdb.txt, src/Alphabet.cc, test/last-test.out, test/last-
+ test.sh:
+ Convert Us in RNA to Ts
+ [e7d7cb7db0a7]
+
+2018-02-13 Martin C. Frith <Martin C. Frith>
+
+ * doc/lastdb.txt, src/SubsetSuffixArray.hh,
+ src/SubsetSuffixArraySort.cc, src/lastdb.cc:
+ Multi-thread lastdb's main sort
+ [77d8757028c2]
+
+ * src/alp/sls_alignment_evaluer.cpp, test/last-test.out:
+ Fix some E-value & last-train fails
+ [2c57015b1e89]
+
2018-01-05 Martin C. Frith <Martin C. Frith>
* src/Alignment.cc, src/Alignment.hh, src/Centroid.cc,
src/Centroid.hh:
Refactor
- [8430dbf5abe7] [tip]
+ [8430dbf5abe7]
* src/Centroid.cc:
Try to calculate lastal probs a tiny bit faster
=====================================
debian/changelog
=====================================
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,11 @@
+last-align (932-1) unstable; urgency=medium
+
+ * New upstream version
+ * Point Vcs fields to salsa.debian.org
+ * Standards-Version: 4.1.4
+
+ -- Andreas Tille <tille at debian.org> Thu, 26 Apr 2018 22:27:19 +0200
+
last-align (921-1) unstable; urgency=medium
* New upstream version
=====================================
debian/control
=====================================
--- a/debian/control
+++ b/debian/control
@@ -8,9 +8,9 @@ Build-Depends: debhelper (>= 11~),
help2man,
python-pil,
zlib1g-dev
-Standards-Version: 4.1.3
-Vcs-Browser: https://anonscm.debian.org/cgit/debian-med/last-align.git
-Vcs-Git: https://anonscm.debian.org/git/debian-med/last-align.git
+Standards-Version: 4.1.4
+Vcs-Browser: https://salsa.debian.org/med-team/last-align
+Vcs-Git: https://salsa.debian.org/med-team/last-align.git
Homepage: http://last.cbrc.jp/
Package: last-align
=====================================
doc/Makefile
=====================================
--- a/doc/Makefile
+++ b/doc/Makefile
@@ -5,7 +5,7 @@ last-map-probs.html last-matrices.html last-pair-probs.html \
last-papers.html last-parallel.html last-postmask.html \
last-repeats.html last-seeds.html last-split.html last-train.html \
last-tuning.html last-tutorial.html last.html lastal.html lastdb.html \
-maf-convert.html
+maf-convert.html maf-cut.html
all: ${DOCS}
=====================================
doc/lastal.html
=====================================
--- a/doc/lastal.html
+++ b/doc/lastal.html
@@ -462,14 +462,18 @@ value of -e, so if you specify -e then -E has no effect.</td></tr>
<td><p class="first">Specify a match/mismatch score matrix. Options -r and -q will
be ignored. The built-in matrices are described in
<a class="reference external" href="last-matrices.html">last-matrices.html</a>.</p>
-<p class="last">Any other NAME is assumed to be a file name. For an example of
-the format, see the matrix files in the data directory. Any
-letters that aren't in the matrix will get the lowest score in
-the matrix when aligned to anything. Asymmetric scores are
-allowed: query letters correspond to columns and reference
-letters correspond to rows. Other options can be specified on
-lines starting with "#last", but command line options override
-them.</p>
+<p>Any other NAME is assumed to be a file name. For an example of
+the format, see the matrix files in the data directory.
+Asymmetric scores are allowed: query letters correspond to
+columns and reference letters correspond to rows.</p>
+<p>Any letters that aren't in the matrix get default match/mismatch
+scores. For doubly- and triply-ambiguous bases (such as "W"
+meaning A or T), these default scores are derived from the ACGT
+scores, and are shown in the header of lastal's output. Any
+other letters get the lowest score in the matrix when aligned to
+anything.</p>
+<p class="last">Other options can be specified on lines starting with "#last",
+but command line options override them.</p>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-a <var>COST</var></span></kbd></td>
=====================================
doc/lastal.txt
=====================================
--- a/doc/lastal.txt
+++ b/doc/lastal.txt
@@ -130,13 +130,19 @@ Score options
`<last-matrices.html>`_.
Any other NAME is assumed to be a file name. For an example of
- the format, see the matrix files in the data directory. Any
- letters that aren't in the matrix will get the lowest score in
- the matrix when aligned to anything. Asymmetric scores are
- allowed: query letters correspond to columns and reference
- letters correspond to rows. Other options can be specified on
- lines starting with "#last", but command line options override
- them.
+ the format, see the matrix files in the data directory.
+ Asymmetric scores are allowed: query letters correspond to
+ columns and reference letters correspond to rows.
+
+ Any letters that aren't in the matrix get default match/mismatch
+ scores. For doubly- and triply-ambiguous bases (such as "W"
+ meaning A or T), these default scores are derived from the ACGT
+ scores, and are shown in the header of lastal's output. Any
+ other letters get the lowest score in the matrix when aligned to
+ anything.
+
+ Other options can be specified on lines starting with "#last",
+ but command line options override them.
-a COST
Gap existence cost.
=====================================
doc/lastdb.html
=====================================
--- a/doc/lastdb.html
+++ b/doc/lastdb.html
@@ -461,8 +461,7 @@ lastdb and then used by lastal. These formats are described in
<kbd><span class="option">-P <var>THREADS</var></span></kbd></td>
<td>Divide the work between this number of threads running in
parallel. 0 means use as many threads as your computer claims
-it can handle simultaneously. Currently, multi-threading is
-used for tantan masking only.</td></tr>
+it can handle simultaneously.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-m <var>PATTERN</var></span></kbd></td>
<td><p class="first">Specify a spaced seed pattern, for example "-m 110101". In this
@@ -490,7 +489,8 @@ alphabet is equivalent to "-a ACGT". The protein alphabet (-p)
is equivalent to "-a ACDEFGHIKLMNPQRSTVWY". Non-alphabet
letters are allowed in sequences, but by default they are
excluded from initial matches and get the mismatch score when
-aligned to anything. If -a is specified, -p is ignored.</td></tr>
+aligned to anything. As a special case, for the DNA alphabet,
+Us are converted to Ts. If -a is specified, -p is ignored.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-i <var>MATCHES</var></span></kbd></td>
<td>This option makes lastdb faster, at the expense of limiting your
=====================================
doc/lastdb.txt
=====================================
--- a/doc/lastdb.txt
+++ b/doc/lastdb.txt
@@ -129,8 +129,7 @@ Advanced Options
-P THREADS
Divide the work between this number of threads running in
parallel. 0 means use as many threads as your computer claims
- it can handle simultaneously. Currently, multi-threading is
- used for tantan masking only.
+ it can handle simultaneously.
-m PATTERN
Specify a spaced seed pattern, for example "-m 110101". In this
@@ -161,7 +160,8 @@ Advanced Options
is equivalent to "-a ACDEFGHIKLMNPQRSTVWY". Non-alphabet
letters are allowed in sequences, but by default they are
excluded from initial matches and get the mismatch score when
- aligned to anything. If -a is specified, -p is ignored.
+ aligned to anything. As a special case, for the DNA alphabet,
+ Us are converted to Ts. If -a is specified, -p is ignored.
-i MATCHES
This option makes lastdb faster, at the expense of limiting your
=====================================
doc/maf-cut.html
=====================================
--- /dev/null
+++ b/doc/maf-cut.html
@@ -0,0 +1,336 @@
+<?xml version="1.0" encoding="utf-8" ?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
+<meta name="generator" content="Docutils 0.6: http://docutils.sourceforge.net/" />
+<title>maf-cut</title>
+<style type="text/css">
+
+/*
+:Author: David Goodger (goodger at python.org)
+:Id: $Id: html4css1.css 5951 2009-05-18 18:03:10Z milde $
+:Copyright: This stylesheet has been placed in the public domain.
+
+Default cascading style sheet for the HTML output of Docutils.
+
+See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to
+customize this style sheet.
+*/
+
+/* used to remove borders from tables and images */
+.borderless, table.borderless td, table.borderless th {
+ border: 0 }
+
+table.borderless td, table.borderless th {
+ /* Override padding for "table.docutils td" with "! important".
+ The right padding separates the table cells. */
+ padding: 0 0.5em 0 0 ! important }
+
+.first {
+ /* Override more specific margin styles with "! important". */
+ margin-top: 0 ! important }
+
+.last, .with-subtitle {
+ margin-bottom: 0 ! important }
+
+.hidden {
+ display: none }
+
+a.toc-backref {
+ text-decoration: none ;
+ color: black }
+
+blockquote.epigraph {
+ margin: 2em 5em ; }
+
+dl.docutils dd {
+ margin-bottom: 0.5em }
+
+/* Uncomment (and remove this text!) to get bold-faced definition list terms
+dl.docutils dt {
+ font-weight: bold }
+*/
+
+div.abstract {
+ margin: 2em 5em }
+
+div.abstract p.topic-title {
+ font-weight: bold ;
+ text-align: center }
+
+div.admonition, div.attention, div.caution, div.danger, div.error,
+div.hint, div.important, div.note, div.tip, div.warning {
+ margin: 2em ;
+ border: medium outset ;
+ padding: 1em }
+
+div.admonition p.admonition-title, div.hint p.admonition-title,
+div.important p.admonition-title, div.note p.admonition-title,
+div.tip p.admonition-title {
+ font-weight: bold ;
+ font-family: sans-serif }
+
+div.attention p.admonition-title, div.caution p.admonition-title,
+div.danger p.admonition-title, div.error p.admonition-title,
+div.warning p.admonition-title {
+ color: red ;
+ font-weight: bold ;
+ font-family: sans-serif }
+
+/* Uncomment (and remove this text!) to get reduced vertical space in
+ compound paragraphs.
+div.compound .compound-first, div.compound .compound-middle {
+ margin-bottom: 0.5em }
+
+div.compound .compound-last, div.compound .compound-middle {
+ margin-top: 0.5em }
+*/
+
+div.dedication {
+ margin: 2em 5em ;
+ text-align: center ;
+ font-style: italic }
+
+div.dedication p.topic-title {
+ font-weight: bold ;
+ font-style: normal }
+
+div.figure {
+ margin-left: 2em ;
+ margin-right: 2em }
+
+div.footer, div.header {
+ clear: both;
+ font-size: smaller }
+
+div.line-block {
+ display: block ;
+ margin-top: 1em ;
+ margin-bottom: 1em }
+
+div.line-block div.line-block {
+ margin-top: 0 ;
+ margin-bottom: 0 ;
+ margin-left: 1.5em }
+
+div.sidebar {
+ margin: 0 0 0.5em 1em ;
+ border: medium outset ;
+ padding: 1em ;
+ background-color: #ffffee ;
+ width: 40% ;
+ float: right ;
+ clear: right }
+
+div.sidebar p.rubric {
+ font-family: sans-serif ;
+ font-size: medium }
+
+div.system-messages {
+ margin: 5em }
+
+div.system-messages h1 {
+ color: red }
+
+div.system-message {
+ border: medium outset ;
+ padding: 1em }
+
+div.system-message p.system-message-title {
+ color: red ;
+ font-weight: bold }
+
+div.topic {
+ margin: 2em }
+
+h1.section-subtitle, h2.section-subtitle, h3.section-subtitle,
+h4.section-subtitle, h5.section-subtitle, h6.section-subtitle {
+ margin-top: 0.4em }
+
+h1.title {
+ text-align: center }
+
+h2.subtitle {
+ text-align: center }
+
+hr.docutils {
+ width: 75% }
+
+img.align-left, .figure.align-left{
+ clear: left ;
+ float: left ;
+ margin-right: 1em }
+
+img.align-right, .figure.align-right {
+ clear: right ;
+ float: right ;
+ margin-left: 1em }
+
+.align-left {
+ text-align: left }
+
+.align-center {
+ clear: both ;
+ text-align: center }
+
+.align-right {
+ text-align: right }
+
+/* reset inner alignment in figures */
+div.align-right {
+ text-align: left }
+
+/* div.align-center * { */
+/* text-align: left } */
+
+ol.simple, ul.simple {
+ margin-bottom: 1em }
+
+ol.arabic {
+ list-style: decimal }
+
+ol.loweralpha {
+ list-style: lower-alpha }
+
+ol.upperalpha {
+ list-style: upper-alpha }
+
+ol.lowerroman {
+ list-style: lower-roman }
+
+ol.upperroman {
+ list-style: upper-roman }
+
+p.attribution {
+ text-align: right ;
+ margin-left: 50% }
+
+p.caption {
+ font-style: italic }
+
+p.credits {
+ font-style: italic ;
+ font-size: smaller }
+
+p.label {
+ white-space: nowrap }
+
+p.rubric {
+ font-weight: bold ;
+ font-size: larger ;
+ color: maroon ;
+ text-align: center }
+
+p.sidebar-title {
+ font-family: sans-serif ;
+ font-weight: bold ;
+ font-size: larger }
+
+p.sidebar-subtitle {
+ font-family: sans-serif ;
+ font-weight: bold }
+
+p.topic-title {
+ font-weight: bold }
+
+pre.address {
+ margin-bottom: 0 ;
+ margin-top: 0 ;
+ font: inherit }
+
+pre.literal-block, pre.doctest-block {
+ margin-left: 2em ;
+ margin-right: 2em }
+
+span.classifier {
+ font-family: sans-serif ;
+ font-style: oblique }
+
+span.classifier-delimiter {
+ font-family: sans-serif ;
+ font-weight: bold }
+
+span.interpreted {
+ font-family: sans-serif }
+
+span.option {
+ white-space: nowrap }
+
+span.pre {
+ white-space: pre }
+
+span.problematic {
+ color: red }
+
+span.section-subtitle {
+ /* font-size relative to parent (h1..h6 element) */
+ font-size: 80% }
+
+table.citation {
+ border-left: solid 1px gray;
+ margin-left: 1px }
+
+table.docinfo {
+ margin: 2em 4em }
+
+table.docutils {
+ margin-top: 0.5em ;
+ margin-bottom: 0.5em }
+
+table.footnote {
+ border-left: solid 1px black;
+ margin-left: 1px }
+
+table.docutils td, table.docutils th,
+table.docinfo td, table.docinfo th {
+ padding-left: 0.5em ;
+ padding-right: 0.5em ;
+ vertical-align: top }
+
+table.docutils th.field-name, table.docinfo th.docinfo-name {
+ font-weight: bold ;
+ text-align: left ;
+ white-space: nowrap ;
+ padding-left: 0 }
+
+h1 tt.docutils, h2 tt.docutils, h3 tt.docutils,
+h4 tt.docutils, h5 tt.docutils, h6 tt.docutils {
+ font-size: 100% }
+
+ul.auto-toc {
+ list-style-type: none }
+
+</style>
+<style type="text/css">
+
+/* Style sheet for LAST HTML documents */
+h1 { color: navy }
+h2 { color: teal }
+div.document { margin-left: auto; margin-right: auto; max-width: 45em }
+strong { color: red }
+.option-list td { padding-bottom: 1em }
+table.field-list { border: thin solid green }
+
+</style>
+</head>
+<body>
+<div class="document" id="maf-cut">
+<h1 class="title">maf-cut</h1>
+
+<p><tt class="docutils literal"><span class="pre">maf-cut</span></tt> cuts out parts of <a class="reference external" href="http://genome.ucsc.edu/FAQ/FAQformat.html#format5">MAF</a> format alignments. For example:</p>
+<pre class="literal-block">
+maf-cut chr7:1234000-1235000 alignments.maf
+</pre>
+<p>This gets alignment parts within the range 1234000-1235000 of the
+sequence named "chr7".</p>
+<p>The ranges are zero-based. For example, <tt class="docutils literal"><span class="pre">chr7:0-1000</span></tt> means the
+first 1000 bases of chr7.</p>
+<p>You can omit the sequence name:</p>
+<pre class="literal-block">
+maf-cut 1234000-1235000 alignments.maf
+</pre>
+<p>Then, the range applies to the topmost sequence in each alignment.</p>
+</div>
+</body>
+</html>
=====================================
doc/maf-cut.txt
=====================================
--- /dev/null
+++ b/doc/maf-cut.txt
@@ -0,0 +1,20 @@
+maf-cut
+=======
+
+``maf-cut`` cuts out parts of MAF_ format alignments. For example::
+
+ maf-cut chr7:1234000-1235000 alignments.maf
+
+This gets alignment parts within the range 1234000-1235000 of the
+sequence named "chr7".
+
+The ranges are zero-based. For example, ``chr7:0-1000`` means the
+first 1000 bases of chr7.
+
+You can omit the sequence name::
+
+ maf-cut 1234000-1235000 alignments.maf
+
+Then, the range applies to the topmost sequence in each alignment.
+
+.. _MAF: http://genome.ucsc.edu/FAQ/FAQformat.html#format5
=====================================
scripts/last-dotplot
=====================================
--- a/scripts/last-dotplot
+++ b/scripts/last-dotplot
@@ -285,7 +285,7 @@ def nameKey(oneSeqRanges):
return natural_sort_key(oneSeqRanges[0][0])
def sizeKey(oneSeqRanges):
- return sum(b - e for n, b, e in oneSeqRanges), nameKey(oneSeqRanges)
+ return sum(b - e for n, b, e, s in oneSeqRanges), nameKey(oneSeqRanges)
def alignmentKey(seqNamesToLists, oneSeqRanges):
seqName = oneSeqRanges[0][0]
@@ -479,7 +479,7 @@ def strandAndOrigin(ranges, beg, size):
if isReverseStrand:
beg = -(beg + size)
for rangeBeg, rangeEnd, isReverseRange, origin in ranges:
- if rangeEnd > beg:
+ if rangeEnd > beg: # assumes the ranges are sorted
return (isReverseStrand != isReverseRange), origin
def alignmentPixels(width, height, alignments, bp_per_pix,
@@ -698,7 +698,7 @@ def rangesWithOrigins(sortedRanges, rangePixBegs, rangePixLens, bpPerPix):
def rangesPerSeq(sortedRanges, rangePixBegs, rangePixLens, bpPerPix):
a = rangesWithOrigins(sortedRanges, rangePixBegs, rangePixLens, bpPerPix)
for k, v in itertools.groupby(a, itemgetter(0)):
- yield k, [i[1] for i in v]
+ yield k, sorted(i[1] for i in v)
def getFont(opts):
if opts.fontfile:
=====================================
scripts/maf-cut
=====================================
--- /dev/null
+++ b/scripts/maf-cut
@@ -0,0 +1,135 @@
+#! /usr/bin/env python
+
+from __future__ import print_function
+
+import gzip
+import itertools
+import optparse
+import signal
+import sys
+
+def myOpen(fileName): # faster than fileinput
+ if fileName == "-":
+ return sys.stdin
+ if fileName.endswith(".gz"):
+ return gzip.open(fileName)
+ return open(fileName)
+
+def alnBegFromSeqBeg(gappedSequence, seqBeg):
+ for i, x in enumerate(gappedSequence):
+ if x != "-":
+ if seqBeg == 0:
+ return i
+ seqBeg -= 1
+
+def alnEndFromSeqEnd(gappedSequence, seqEnd):
+ for i, x in enumerate(gappedSequence):
+ if x != "-":
+ seqEnd -= 1
+ if seqEnd == 0:
+ return i + 1
+
+def alignmentRange(cutBeg, cutEnd, sLineFields):
+ beg = int(sLineFields[2])
+ if beg >= cutEnd:
+ return 0, 0
+ sequenceWithGaps = sLineFields[6]
+ span = len(sequenceWithGaps) - sequenceWithGaps.count("-")
+ end = beg + span
+ if end <= cutBeg:
+ return 0, 0
+ seqBeg = max(cutBeg - beg, 0)
+ alnBeg = alnBegFromSeqBeg(sequenceWithGaps, seqBeg)
+ seqEnd = min(cutEnd - beg, span)
+ alnEnd = alnEndFromSeqEnd(sequenceWithGaps, seqEnd)
+ return alnBeg, alnEnd
+
+def findTheSpecifiedSequence(seqName, mafLines):
+ for line in mafLines:
+ if line[0] == "s":
+ fields = line.split()
+ if seqName is None or fields[1] == seqName:
+ return fields
+ return None
+
+def cutMafRecords(mafLines, alnBeg, alnEnd):
+ for line in mafLines:
+ fields = line.split()
+ if line[0] == "s":
+ oldSeq = fields[6]
+ newSeq = oldSeq[alnBeg:alnEnd]
+ beg = int(fields[2]) + alnBeg - oldSeq[:alnBeg].count("-")
+ span = len(newSeq) - newSeq.count("-")
+ yield fields[:2] + [str(beg), str(span)] + fields[4:6] + [newSeq]
+ elif line[0] == "q":
+ yield fields[:2] + [fields[2][alnBeg:alnEnd]]
+ elif line[0] == "p":
+ yield fields[:1] + [fields[1][alnBeg:alnEnd]]
+ else:
+ yield fields
+
+def mafFieldWidths(mafRecords):
+ sRecords = (i for i in mafRecords if i[0] == "s")
+ sColumns = zip(*sRecords)
+ for i in sColumns:
+ yield max(map(len, i))
+
+def printMafLine(fieldWidths, fields):
+ formatParams = itertools.chain.from_iterable(zip(fieldWidths, fields))
+ print("%*s %-*s %*s %*s %*s %*s %*s" % tuple(formatParams))
+
+def cutOneMaf(cutRange, mafLines):
+ seqName, cutBeg, cutEnd = cutRange
+ sLineFields = findTheSpecifiedSequence(seqName, mafLines)
+ if not sLineFields:
+ return
+ alnBeg, alnEnd = alignmentRange(cutBeg, cutEnd, sLineFields)
+ if alnBeg >= alnEnd:
+ return
+ mafRecords = list(cutMafRecords(mafLines, alnBeg, alnEnd))
+ fieldWidths = list(mafFieldWidths(mafRecords))
+ for fields in mafRecords:
+ if fields[0] == "s":
+ printMafLine(fieldWidths, fields)
+ elif fields[0] == "q":
+ printMafLine(fieldWidths, fields[:2] + [""] * 4 + fields[2:])
+ elif fields[0] == "p":
+ printMafLine(fieldWidths, fields[:1] + [""] * 5 + fields[1:])
+ else:
+ print(" ".join(fields))
+ print()
+
+def mafCutOneFile(cutRange, lines):
+ mafLines = []
+ for line in lines:
+ if line.isspace():
+ cutOneMaf(cutRange, mafLines)
+ mafLines = []
+ elif line[0] != "#":
+ mafLines.append(line)
+ cutOneMaf(cutRange, mafLines)
+
+def wantedRange(cutSpecification):
+ seqName = None
+ if ":" in cutSpecification:
+ seqName, cutSpecification = cutSpecification.rsplit(":", 1)
+ beg, end = cutSpecification.rsplit("-", 1)
+ return seqName, int(beg), int(end)
+
+def mafCut(opts, args):
+ cutRange = wantedRange(args[0])
+ if len(args) > 1:
+ for i in args[1:]:
+ mafCutOneFile(cutRange, myOpen(i))
+ else:
+ mafCutOneFile(cutRange, sys.stdin)
+
+if __name__ == "__main__":
+ signal.signal(signal.SIGPIPE, signal.SIG_DFL) # avoid silly error message
+ usage = "%prog chrN:start-end alignments.maf"
+ description = "Get parts of MAF-format alignments."
+ op = optparse.OptionParser(usage=usage, description=description)
+ opts, args = op.parse_args()
+ if not args:
+ op.error("please give me a cut specification and MAF alignments")
+ mafCut(opts, args)
=====================================
src/Alphabet.cc
=====================================
--- a/src/Alphabet.cc
+++ b/src/Alphabet.cc
@@ -96,6 +96,11 @@ void Alphabet::initCaseConversions( unsigned codeEnd ) {
numbersToLowercase[i] = i;
}
+ if (letters == dna) { // kludge for RNA:
+ encode['U'] = encode['T'];
+ encode['u'] = encode['t'];
+ }
+
for( unsigned i = 0; i < capacity; ++i ){
lettersToUppercase[i] = numbersToUppercase[ encode[i] ];
}
=====================================
src/QualityPssmMaker.cc
=====================================
--- a/src/QualityPssmMaker.cc
+++ b/src/QualityPssmMaker.cc
@@ -29,8 +29,8 @@ void QualityPssmMaker::init(const ScoreMatrixRow *scoreMatrix,
this->isMatchMismatchMatrix = isMatchMismatchMatrix;
this->toUnmasked = toUnmasked;
- double matchExp = std::exp(lambda * matchScore);
- double mismatchExp = std::exp(lambda * mismatchScore);
+ double matchExp = probFromScore(lambda, matchScore);
+ double mismatchExp = probFromScore(lambda, mismatchScore);
for (int q = 0; q < qualityCapacity; ++q) {
double e = errorProbFromQual(q, qualityOffset, false);
@@ -40,12 +40,12 @@ void QualityPssmMaker::init(const ScoreMatrixRow *scoreMatrix,
double expScore = p * matchExp + e * mismatchExp;
assert(p > 0);
assert(expScore > 0);
- qualityToMatchScore[q] = nearestInt(std::log(expScore) / lambda);
+ qualityToMatchScore[q] = scoreFromProb(lambda, expScore);
}
for (int i = 0; i < scoreMatrixRowSize; ++i)
for (int j = 0; j < scoreMatrixRowSize; ++j)
- expMatrix[i][j] = std::exp(lambda * scoreMatrix[i][j]); // can be 0
+ expMatrix[i][j] = probFromScore(lambda, scoreMatrix[i][j]); // can be 0
}
// This function is a wee bit slow, even if isMatchMismatchMatrix and
@@ -88,7 +88,7 @@ void QualityPssmMaker::make(const uchar *sequenceBeg,
double expScore = std::inner_product(letterProbs, letterProbsEnd,
expMatrix[unmasked1], 0.0);
assert(expScore > 0);
- score = nearestInt(std::log(expScore) / lambda);
+ score = scoreFromProb(lambda, expScore);
}
}
=====================================
src/ScoreMatrix.cc
=====================================
--- a/src/ScoreMatrix.cc
+++ b/src/ScoreMatrix.cc
@@ -2,6 +2,7 @@
#include "ScoreMatrix.hh"
#include "ScoreMatrixData.hh"
+#include "qualityScoreUtil.hh"
#include "zio.hh"
#include <sstream>
#include <iomanip>
@@ -9,17 +10,24 @@
#include <stdexcept>
#include <cassert>
#include <cctype> // toupper, tolower
-#include <cstddef> // size_t
+#include <stddef.h> // size_t
//#include <iostream> // for debugging
#define ERR(x) throw std::runtime_error(x)
#define COUNTOF(a) (sizeof (a) / sizeof *(a))
+static void makeUppercase(std::string& s) {
+ for (size_t i = 0; i < s.size(); ++i) {
+ unsigned char c = s[i];
+ s[i] = std::toupper(c);
+ }
+}
+
namespace cbrc{
const char *ScoreMatrix::canonicalName( const std::string& name ){
- for( std::size_t i = 0; i < COUNTOF(scoreMatrixNicknames); ++i )
+ for( size_t i = 0; i < COUNTOF(scoreMatrixNicknames); ++i )
if( name == scoreMatrixNicknames[i].nickname )
return scoreMatrixNicknames[i].realname;
return name.c_str();
@@ -28,24 +36,24 @@ const char *ScoreMatrix::canonicalName( const std::string& name ){
std::string ScoreMatrix::stringFromName( const std::string& name ){
std::string n = canonicalName( name );
- for( std::size_t i = 0; i < COUNTOF(scoreMatrices); ++i )
+ for( size_t i = 0; i < COUNTOF(scoreMatrices); ++i )
if( n == scoreMatrices[i].name )
return scoreMatrices[i].text;
return slurp( n.c_str() );
}
-void ScoreMatrix::matchMismatch( int match, int mismatch,
- const std::string& letters ){
- rows = letters;
- cols = letters;
- std::size_t size = letters.size();
+void ScoreMatrix::setMatchMismatch(int matchScore, int mismatchCost,
+ const std::string& symbols) {
+ rowSymbols.assign(symbols.begin(), symbols.end());
+ colSymbols.assign(symbols.begin(), symbols.end());
- cells.resize( size );
+ size_t size = symbols.size();
+ cells.resize(size);
- for( std::size_t i = 0; i < size; ++i ){
- cells[i].assign( size, -mismatch );
- cells[i][i] = match;
+ for (size_t i = 0; i < size; ++i) {
+ cells[i].assign(size, -mismatchCost);
+ cells[i][i] = matchScore;
}
}
@@ -55,20 +63,31 @@ void ScoreMatrix::fromString( const std::string& matString ){
if( !iss ) ERR( "can't read the score matrix" );
}
-void ScoreMatrix::init( const uchar encode[] ){
- assert( !rows.empty() && !cols.empty() );
+static unsigned s2i(const uchar symbolToIndex[], uchar c) {
+ return symbolToIndex[c];
+}
+
+static void upperAndLowerIndex(unsigned tooBig, const uchar symbolToIndex[],
+ char symbol, unsigned& upper, unsigned& lower) {
+ uchar s = symbol;
+ upper = symbolToIndex[s];
+ lower = symbolToIndex[std::tolower(s)];
+ if (upper >= tooBig || lower >= tooBig) {
+ ERR(std::string("bad letter in score matrix: ") + symbol);
+ }
+}
- for( std::string::iterator i = rows.begin(); i < rows.end(); ++i )
- *i = std::toupper( *i );
+void ScoreMatrix::init(const uchar symbolToIndex[]) {
+ assert( !rowSymbols.empty() && !colSymbols.empty() );
- for( std::string::iterator i = cols.begin(); i < cols.end(); ++i )
- *i = std::toupper( *i );
+ makeUppercase(rowSymbols);
+ makeUppercase(colSymbols);
minScore = cells[0][0];
maxScore = cells[0][0];
- for( std::size_t i = 0; i < rows.size(); ++i ){
- for( std::size_t j = 0; j < cols.size(); ++j ){
+ for( size_t i = 0; i < rowSymbols.size(); ++i ){
+ for( size_t j = 0; j < colSymbols.size(); ++j ){
minScore = std::min( minScore, cells[i][j] );
maxScore = std::max( maxScore, cells[i][j] );
}
@@ -82,30 +101,25 @@ void ScoreMatrix::init( const uchar encode[] ){
}
}
- for( std::size_t i = 0; i < rows.size(); ++i ){
- for( std::size_t j = 0; j < cols.size(); ++j ){
- uchar x = encode[ uchar(rows[i]) ];
- uchar y = encode[ uchar(cols[j]) ];
- uchar a = encode[ std::tolower( rows[i] ) ];
- uchar b = encode[ std::tolower( cols[j] ) ];
- if( a >= MAT )
- ERR( std::string("bad letter in score matrix: ") + rows[i] );
- if( b >= MAT )
- ERR( std::string("bad letter in score matrix: ") + cols[j] );
- caseSensitive[x][b] = std::min( cells[i][j], 0 );
- caseSensitive[a][y] = std::min( cells[i][j], 0 );
- caseSensitive[a][b] = std::min( cells[i][j], 0 );
- caseSensitive[x][y] = cells[i][j]; // careful: maybe a==x or b==y
- caseInsensitive[x][y] = cells[i][j];
- caseInsensitive[x][b] = cells[i][j];
- caseInsensitive[a][y] = cells[i][j];
- caseInsensitive[a][b] = cells[i][j];
+ for( size_t i = 0; i < rowSymbols.size(); ++i ){
+ for( size_t j = 0; j < colSymbols.size(); ++j ){
+ unsigned iu, il, ju, jl;
+ upperAndLowerIndex(MAT, symbolToIndex, rowSymbols[i], iu, il);
+ upperAndLowerIndex(MAT, symbolToIndex, colSymbols[j], ju, jl);
+ caseSensitive[iu][jl] = std::min( cells[i][j], 0 );
+ caseSensitive[il][ju] = std::min( cells[i][j], 0 );
+ caseSensitive[il][jl] = std::min( cells[i][j], 0 );
+ caseSensitive[iu][ju] = cells[i][j]; // careful: maybe il==iu or jl==ju
+ caseInsensitive[iu][ju] = cells[i][j];
+ caseInsensitive[iu][jl] = cells[i][j];
+ caseInsensitive[il][ju] = cells[i][j];
+ caseInsensitive[il][jl] = cells[i][j];
}
}
// set a hugely negative score for the delimiter symbol:
uchar delimiter = ' ';
- uchar z = encode[delimiter];
+ uchar z = symbolToIndex[delimiter];
assert( z < MAT );
for( unsigned i = 0; i < MAT; ++i ){
caseSensitive[z][i] = -INF;
@@ -116,24 +130,26 @@ void ScoreMatrix::init( const uchar encode[] ){
}
void ScoreMatrix::writeCommented( std::ostream& stream ) const{
+ int colWidth = colSymbols.size() < 20 ? 3 : 2;
+
stream << "# " << ' ';
- for( std::size_t i = 0; i < cols.size(); ++i ){
- stream << ' ' << std::setw(OUTPAD) << cols[i];
+ for( size_t i = 0; i < colSymbols.size(); ++i ){
+ stream << ' ' << std::setw(colWidth) << colSymbols[i];
}
stream << '\n';
- for( std::size_t i = 0; i < rows.size(); ++i ){
- stream << "# " << rows[i];
- for( std::size_t j = 0; j < cols.size(); ++j ){
- stream << ' ' << std::setw(OUTPAD) << cells[i][j];
+ for( size_t i = 0; i < rowSymbols.size(); ++i ){
+ stream << "# " << rowSymbols[i];
+ for( size_t j = 0; j < colSymbols.size(); ++j ){
+ stream << ' ' << std::setw(colWidth) << cells[i][j];
}
stream << '\n';
}
}
std::istream& operator>>( std::istream& stream, ScoreMatrix& m ){
- std::string tmpRows;
- std::string tmpCols;
+ std::string tmpRowSymbols;
+ std::string tmpColSymbols;
std::vector< std::vector<int> > tmpCells;
std::string line;
int state = 0;
@@ -145,46 +161,131 @@ std::istream& operator>>( std::istream& stream, ScoreMatrix& m ){
if( state == 0 ){
if( c == '#' ) continue; // skip comment lines at the top
do{
- tmpCols.push_back(c);
+ tmpColSymbols.push_back(c);
}while( iss >> c );
state = 1;
}
else{
- tmpRows.push_back(c);
+ tmpRowSymbols.push_back(c);
tmpCells.resize( tmpCells.size() + 1 );
int score;
while( iss >> score ){
tmpCells.back().push_back(score);
}
- if( tmpCells.back().size() != tmpCols.size() ) ERR( "bad score matrix" );
+ if (tmpCells.back().size() != tmpColSymbols.size()) {
+ ERR("bad score matrix");
+ }
}
}
- if( stream.eof() && !stream.bad() && !tmpRows.empty() ){
+ if( stream.eof() && !stream.bad() && !tmpRowSymbols.empty() ){
stream.clear();
- m.rows.swap(tmpRows);
- m.cols.swap(tmpCols);
+ m.rowSymbols.swap(tmpRowSymbols);
+ m.colSymbols.swap(tmpColSymbols);
m.cells.swap(tmpCells);
}
return stream;
}
-std::ostream& operator<<( std::ostream& stream, const ScoreMatrix& m ){
- for( std::size_t i = 0; i < m.cols.size(); ++i ){
- stream << ' ' << std::setw(m.OUTPAD) << m.cols[i];
+const char *ambiguities[] = {
+ "M" "AC",
+ "S" "CG",
+ "K" "GT",
+ "W" "TA",
+ "R" "AG",
+ "Y" "CT",
+ "B" "CGT",
+ "D" "AGT",
+ "H" "ACT",
+ "V" "ACG"
+};
+
+const size_t numOfAmbiguousSymbols = COUNTOF(ambiguities);
+
+static bool isIn(const std::string& s, char x) {
+ return find(s.begin(), s.end(), x) != s.end();
+}
+
+static const char *ambiguityList(char symbol, char scratch[]) {
+ for (size_t i = 0; i < numOfAmbiguousSymbols; ++i) {
+ if (ambiguities[i][0] == symbol) return ambiguities[i] + 1;
}
- stream << '\n';
+ scratch[0] = symbol;
+ return scratch;
+}
+
+static double symbolProbSum(const uchar symbolToIndex[],
+ const char *symbols, const double probs[]) {
+ if (symbols[1] == 0) return 1;
+ double p = 0;
+ for (size_t i = 0; symbols[i]; ++i) {
+ unsigned x = s2i(symbolToIndex, symbols[i]);
+ p += probs[x];
+ }
+ return p;
+}
- for( std::size_t i = 0; i < m.rows.size(); ++i ){
- stream << m.rows[i];
- for( std::size_t j = 0; j < m.cols.size(); ++j ){
- stream << ' ' << std::setw(m.OUTPAD) << m.cells[i][j];
+static int jointScore(const uchar symbolToIndex[], int **fastMatrix,
+ double scale,
+ const double rowSymbolProbs[],
+ const double colSymbolProbs[],
+ const char *rSymbols, const char *cSymbols) {
+ bool isOneRowSymbol = (rSymbols[1] == 0);
+ bool isOneColSymbol = (cSymbols[1] == 0);
+
+ double p = 0;
+ for (size_t i = 0; rSymbols[i]; ++i) {
+ for (size_t j = 0; cSymbols[j]; ++j) {
+ unsigned x = s2i(symbolToIndex, rSymbols[i]);
+ unsigned y = s2i(symbolToIndex, cSymbols[j]);
+ double r = isOneRowSymbol ? 1 : rowSymbolProbs[x];
+ double c = isOneColSymbol ? 1 : colSymbolProbs[y];
+ p += r * c * probFromScore(scale, fastMatrix[x][y]);
}
- stream << '\n';
}
- return stream;
+ double rowProbSum = symbolProbSum(symbolToIndex, rSymbols, rowSymbolProbs);
+ double colProbSum = symbolProbSum(symbolToIndex, cSymbols, colSymbolProbs);
+
+ return scoreFromProb(scale, p / (rowProbSum * colProbSum));
+}
+
+void ScoreMatrix::addAmbiguousScores(const uchar symbolToIndex[],
+ double scale,
+ const double rowSymbolProbs[],
+ const double colSymbolProbs[]) {
+ int *fastMatrix[MAT];
+ std::copy(caseInsensitive, caseInsensitive + MAT, fastMatrix);
+
+ char scratch[2] = {0};
+
+ for (size_t k = 0; k < numOfAmbiguousSymbols; ++k) {
+ char ambiguousSymbol = ambiguities[k][0];
+ if (isIn(colSymbols, ambiguousSymbol)) continue;
+ colSymbols.push_back(ambiguousSymbol);
+ for (size_t i = 0; i < rowSymbols.size(); ++i) {
+ const char *rSymbols = ambiguityList(rowSymbols[i], scratch);
+ const char *cSymbols = ambiguities[k] + 1;
+ int s = jointScore(symbolToIndex, fastMatrix, scale,
+ rowSymbolProbs, colSymbolProbs, rSymbols, cSymbols);
+ cells[i].push_back(s);
+ }
+ }
+
+ for (size_t k = 0; k < numOfAmbiguousSymbols; ++k) {
+ char ambiguousSymbol = ambiguities[k][0];
+ if (isIn(rowSymbols, ambiguousSymbol)) continue;
+ rowSymbols.push_back(ambiguousSymbol);
+ cells.resize(cells.size() + 1);
+ for (size_t j = 0; j < colSymbols.size(); ++j) {
+ const char *rSymbols = ambiguities[k] + 1;
+ const char *cSymbols = ambiguityList(colSymbols[j], scratch);
+ int s = jointScore(symbolToIndex, fastMatrix, scale,
+ rowSymbolProbs, colSymbolProbs, rSymbols, cSymbols);
+ cells.back().push_back(s);
+ }
+ }
}
-} // end namespace cbrc
+}
=====================================
src/ScoreMatrix.hh
=====================================
--- a/src/ScoreMatrix.hh
+++ b/src/ScoreMatrix.hh
@@ -15,23 +15,33 @@
namespace cbrc{
-struct ScoreMatrix{
- typedef unsigned char uchar;
+typedef unsigned char uchar;
+struct ScoreMatrix{
enum { INF = INT_MAX / 2 }; // big, but try to avoid overflow
enum { MAT = 64 }; // matrix size = MAT x MAT
- enum { OUTPAD = 2 }; // cell-padding for output
static const char *canonicalName( const std::string& name );
static std::string stringFromName( const std::string& name );
- void matchMismatch( int match, int mismatch, const std::string& letters );
+ void setMatchMismatch(int matchScore, // usually > 0
+ int mismatchCost, // usually > 0
+ const std::string& symbols); // case is preserved
+
void fromString( const std::string& s );
- void init( const uchar encode[] ); // unspecified letters get minScore
+
+ void init(const uchar symbolToIndex[]); // unspecified letters get minScore
+
+ // add scores for e.g. "W" meaning A or T
+ void addAmbiguousScores(const uchar symbolToIndex[],
+ double scale, // "lambda" for getting probabilities
+ const double rowSymbolProbs[],
+ const double colSymbolProbs[]);
+
void writeCommented( std::ostream& stream ) const; // write preceded by "#"
- std::string rows; // row headings (letters)
- std::string cols; // column headings (letters)
+ std::string rowSymbols; // row headings (letters)
+ std::string colSymbols; // column headings (letters)
std::vector< std::vector<int> > cells; // scores
int caseSensitive[MAT][MAT];
int caseInsensitive[MAT][MAT];
@@ -42,5 +52,6 @@ struct ScoreMatrix{
std::istream& operator>>( std::istream& stream, ScoreMatrix& mat );
std::ostream& operator<<( std::ostream& stream, const ScoreMatrix& mat );
-} // end namespace cbrc
-#endif // SCOREMATRIX_HH
+}
+
+#endif
=====================================
src/SubsetSuffixArray.hh
=====================================
--- a/src/SubsetSuffixArray.hh
+++ b/src/SubsetSuffixArray.hh
@@ -32,6 +32,8 @@ class SubsetSuffixArray{
public:
typedef LAST_INT_TYPE indexT;
+ struct Range {indexT* beg; indexT* end; indexT depth;};
+
CyclicSubsetSeed& getSeed() { return seed; }
const CyclicSubsetSeed& getSeed() const { return seed; }
@@ -44,8 +46,8 @@ public:
size_t step, size_t minimizerWindow );
// Sort the suffix array (but don't make the buckets).
- void sortIndex( const uchar* text,
- size_t maxUnsortedInterval, int childTableType );
+ void sortIndex( const uchar* text, size_t maxUnsortedInterval,
+ int childTableType, size_t numOfThreads );
// Make the buckets. If bucketDepth+1 == 0, then a default
// bucketDepth is used. The default is: the maximum possible
@@ -121,17 +123,26 @@ private:
void sort2( const uchar* text, indexT* beg, const uchar* subsetMap );
- void radixSort1( const uchar* text, const uchar* subsetMap,
+ void radixSort1( std::vector<Range>& rangeStack,
+ const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth );
- void radixSort2( const uchar* text, const uchar* subsetMap,
+ void radixSort2( std::vector<Range>& rangeStack,
+ const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth );
- void radixSort3( const uchar* text, const uchar* subsetMap,
+ void radixSort3( std::vector<Range>& rangeStack,
+ const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth );
- void radixSort4( const uchar* text, const uchar* subsetMap,
+ void radixSort4( std::vector<Range>& rangeStack,
+ const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth );
- void radixSortN( const uchar* text, const uchar* subsetMap,
+ void radixSortN( std::vector<Range>& rangeStack,
+ const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth,
- unsigned subsetCount );
+ unsigned subsetCount, indexT* bucketSize );
+
+ void sortRanges( std::vector<Range>* stacks, indexT* bucketSizes,
+ const uchar* text, size_t maxUnsortedInterval,
+ size_t numOfThreads );
// Same as the 1st equalRange, but uses more info and may be faster:
void equalRange( indexT& beg, indexT& end, const uchar* textBase,
=====================================
src/SubsetSuffixArraySort.cc
=====================================
--- a/src/SubsetSuffixArraySort.cc
+++ b/src/SubsetSuffixArraySort.cc
@@ -7,17 +7,24 @@
#include <algorithm> // iter_swap, min
//#include <iostream> // for debugging
+#ifdef HAS_CXX_THREADS
+#include <thread>
+#endif
+
using namespace cbrc;
namespace{
typedef SubsetSuffixArray::indexT indexT;
- struct Stack{ indexT* beg; indexT* end; indexT depth; };
- Stack stack[1048576]; // big enough???
- Stack* sp = stack;
+ typedef SubsetSuffixArray::Range Range;
}
-#define PUSH(b, e, d) if( e-b > 1 ) sp->beg = b, sp->end = e, (sp++)->depth = d
-#define POP(b, e, d) b = (--sp)->beg, e = sp->end, d = sp->depth
+static void pushRange(std::vector<Range> &v,
+ indexT *beg, indexT *end, indexT depth) {
+ if (end - beg > 1) {
+ Range r = {beg, end, depth};
+ v.push_back(r);
+ }
+}
static void insertionSort( const uchar* text, const CyclicSubsetSeed& seed,
indexT* beg, indexT* end, const uchar* subsetMap ){
@@ -60,7 +67,8 @@ void SubsetSuffixArray::sort2( const uchar* text, indexT* beg,
// Specialized sort for 1 symbol + 1 delimiter.
// E.g. wildcard positions in spaced seeds.
-void SubsetSuffixArray::radixSort1( const uchar* text, const uchar* subsetMap,
+void SubsetSuffixArray::radixSort1( std::vector<Range>& rangeStack,
+ const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth ){
indexT* end0 = beg; // end of '0's
indexT* begN = end; // beginning of delimiters
@@ -78,7 +86,7 @@ void SubsetSuffixArray::radixSort1( const uchar* text, const uchar* subsetMap,
}
}
- PUSH( beg, end0, depth ); // the '0's
+ pushRange( rangeStack, beg, end0, depth ); // the '0's
if( isChildDirectionForward( beg ) ){
if( end0 == end ) return;
@@ -91,7 +99,8 @@ void SubsetSuffixArray::radixSort1( const uchar* text, const uchar* subsetMap,
// Specialized sort for 2 symbols + 1 delimiter.
// E.g. transition-constrained positions in subset seeds.
-void SubsetSuffixArray::radixSort2( const uchar* text, const uchar* subsetMap,
+void SubsetSuffixArray::radixSort2( std::vector<Range>& rangeStack,
+ const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth ){
indexT* end0 = beg; // end of '0's
indexT* end1 = beg; // end of '1's
@@ -114,8 +123,8 @@ void SubsetSuffixArray::radixSort2( const uchar* text, const uchar* subsetMap,
}
}
- PUSH( beg, end0, depth ); // the '0's
- PUSH( end0, end1, depth ); // the '1's
+ pushRange( rangeStack, beg, end0, depth ); // the '0's
+ pushRange( rangeStack, end0, end1, depth ); // the '1's
if( isChildDirectionForward( beg ) ){
if( end0 == end ) return;
@@ -132,7 +141,8 @@ void SubsetSuffixArray::radixSort2( const uchar* text, const uchar* subsetMap,
// Specialized sort for 3 symbols + 1 delimiter.
// E.g. subset seeds for bisulfite-converted DNA.
-void SubsetSuffixArray::radixSort3( const uchar* text, const uchar* subsetMap,
+void SubsetSuffixArray::radixSort3( std::vector<Range>& rangeStack,
+ const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth ){
indexT* end0 = beg; // end of '0's
indexT* end1 = beg; // end of '1's
@@ -161,9 +171,9 @@ void SubsetSuffixArray::radixSort3( const uchar* text, const uchar* subsetMap,
}
}
- PUSH( beg, end0, depth ); // the '0's
- PUSH( end0, end1, depth ); // the '1's
- PUSH( beg2, begN, depth ); // the '2's
+ pushRange( rangeStack, beg, end0, depth ); // the '0's
+ pushRange( rangeStack, end0, end1, depth ); // the '1's
+ pushRange( rangeStack, beg2, begN, depth ); // the '2's
if( isChildDirectionForward( beg ) ){
if( end0 == end ) return;
@@ -183,7 +193,8 @@ void SubsetSuffixArray::radixSort3( const uchar* text, const uchar* subsetMap,
}
// Specialized sort for 4 symbols + 1 delimiter. E.g. DNA.
-void SubsetSuffixArray::radixSort4( const uchar* text, const uchar* subsetMap,
+void SubsetSuffixArray::radixSort4( std::vector<Range>& rangeStack,
+ const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth ){
indexT* end0 = beg; // end of '0's
indexT* end1 = beg; // end of '1's
@@ -218,10 +229,10 @@ void SubsetSuffixArray::radixSort4( const uchar* text, const uchar* subsetMap,
}
}
- PUSH( beg, end0, depth ); // the '0's
- PUSH( end0, end1, depth ); // the '1's
- PUSH( end1, end2, depth ); // the '2's
- PUSH( beg3, begN, depth ); // the '3's
+ pushRange( rangeStack, beg, end0, depth ); // the '0's
+ pushRange( rangeStack, end0, end1, depth ); // the '1's
+ pushRange( rangeStack, end1, end2, depth ); // the '2's
+ pushRange( rangeStack, beg3, begN, depth ); // the '3's
if( isChildDirectionForward( beg ) ){
if( end0 == end ) return;
@@ -244,11 +255,13 @@ void SubsetSuffixArray::radixSort4( const uchar* text, const uchar* subsetMap,
}
}
-void SubsetSuffixArray::radixSortN( const uchar* text, const uchar* subsetMap,
+const unsigned numOfBuckets = 256;
+
+void SubsetSuffixArray::radixSortN( std::vector<Range>& rangeStack,
+ const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth,
- unsigned subsetCount ){
- static indexT bucketSize[256]; // initialized to zero at startup
- /* */ indexT* bucketEnd[256]; // "static" makes little difference to speed
+ unsigned subsetCount, indexT* bucketSize ){
+ indexT* bucketEnd[numOfBuckets];
// get bucket sizes (i.e. letter counts):
// The intermediate oracle array makes it faster (see "Engineering
@@ -268,7 +281,7 @@ void SubsetSuffixArray::radixSortN( const uchar* text, const uchar* subsetMap,
indexT* pos = beg;
for( unsigned i = 0; i < subsetCount; ++i ){
indexT* nextPos = pos + bucketSize[i];
- PUSH( pos, nextPos, depth );
+ pushRange( rangeStack, pos, nextPos, depth );
pos = nextPos;
bucketEnd[i] = pos;
}
@@ -306,31 +319,91 @@ void SubsetSuffixArray::radixSortN( const uchar* text, const uchar* subsetMap,
}
}
-void SubsetSuffixArray::sortIndex( const uchar* text,
- size_t maxUnsortedInterval,
- int childTableType ){
- const indexT minLength = 1;
+static size_t rangeSize(const Range &r) {
+ return r.end - r.beg;
+}
- if( childTableType == 1 ) chibiTable.v.assign( suffixArray.v.size(), -1 );
- if( childTableType == 2 ) kiddyTable.v.assign( suffixArray.v.size(), -1 );
- if( childTableType == 3 ) childTable.v.assign( suffixArray.v.size(), 0 );
+static size_t nextRangeSize(const std::vector<Range> &ranges) {
+ return rangeSize(ranges.back());
+}
- PUSH( &suffixArray.v.front(), &suffixArray.v.back() + 1, 0 );
- setChildReverse( &suffixArray.v.back() + 1, &suffixArray.v.front() );
+static size_t rangeSizeSum(const std::vector<Range> &ranges) {
+ size_t s = 0;
+ for (size_t i = 0; i < ranges.size(); ++i) {
+ s += rangeSize(ranges[i]);
+ }
+ return s;
+}
+
+static size_t numOfThreadsForOneRange(size_t numOfThreads,
+ size_t sizeOfThisRange,
+ size_t sizeOfAllRanges) {
+ // We want:
+ // min(t | sizeOfThisRange / t < sizeOfOtherRanges / (numOfThreads - (t+1)))
+ // Or equivalently:
+ // max(t | sizeOfThisRange / (t-1) >= sizeOfOtherRanges / (numOfThreads - t))
+ double x = numOfThreads - 1; // use double to avoid overflow
+ return (x * sizeOfThisRange + sizeOfAllRanges) / sizeOfAllRanges;
+}
- while( sp > stack ){
- indexT* beg;
- indexT* end;
- indexT depth;
- POP( beg, end, depth );
+void SubsetSuffixArray::sortRanges(std::vector<Range> *stacks,
+ indexT *bucketSizes,
+ const uchar *text,
+ size_t maxUnsortedInterval,
+ size_t numOfThreads) {
+ std::vector<Range> &myStack = stacks[0];
+
+ while( !myStack.empty() ){
+#ifdef HAS_CXX_THREADS
+ size_t numOfChunks = std::min(numOfThreads, myStack.size());
+ if (numOfChunks > 1) {
+ size_t totalSize = rangeSizeSum(myStack);
+ size_t numOfNewThreads = numOfChunks - 1;
+ std::vector<std::thread> threads(numOfNewThreads);
+
+ for (size_t i = 0; i < numOfNewThreads; ++i) {
+ size_t thisSize = nextRangeSize(myStack);
+ size_t t = numOfThreadsForOneRange(numOfThreads, thisSize, totalSize);
+ size_t maxThreads = numOfThreads - (numOfNewThreads - i);
+ size_t thisThreads = std::min(t, maxThreads);
+ numOfThreads -= thisThreads;
+
+ do {
+ totalSize -= nextRangeSize(myStack);
+ stacks[numOfThreads].push_back(myStack.back());
+ myStack.pop_back();
+ thisSize += nextRangeSize(myStack);
+ } while (myStack.size() > numOfThreads &&
+ thisSize <= totalSize / numOfThreads);
+ // We want:
+ // max(r | sizeSum(r) <= (totalSize - sizeSum(r-1)) / newNumOfThreads)
+
+ threads[i] = std::thread(&SubsetSuffixArray::sortRanges, this,
+ stacks + numOfThreads,
+ bucketSizes + numOfThreads * numOfBuckets,
+ text, maxUnsortedInterval, thisThreads);
+ }
+ sortRanges(stacks, bucketSizes, text, maxUnsortedInterval, numOfThreads);
+ for (size_t i = 0; i < numOfNewThreads; ++i) {
+ threads[i].join();
+ }
+ return;
+ }
+#endif
+
+ indexT* beg = myStack.back().beg;
+ indexT* end = myStack.back().end;
+ indexT depth = myStack.back().depth;
+ myStack.pop_back();
size_t interval = end - beg;
+ const indexT minLength = 1;
if( interval <= maxUnsortedInterval && depth >= minLength ) continue;
const uchar* textBase = text + depth;
const uchar* subsetMap = seed.subsetMap(depth);
- if( childTableType == 0 ){
+ if( childTable.v.empty() && kiddyTable.v.empty() && chibiTable.v.empty() ){
if( interval < 10 ){ // ???
insertionSort( textBase, seed, beg, end, subsetMap );
continue;
@@ -347,11 +420,30 @@ void SubsetSuffixArray::sortIndex( const uchar* text,
++depth;
switch( subsetCount ){
- case 1: radixSort1( textBase, subsetMap, beg, end, depth ); break;
- case 2: radixSort2( textBase, subsetMap, beg, end, depth ); break;
- case 3: radixSort3( textBase, subsetMap, beg, end, depth ); break;
- case 4: radixSort4( textBase, subsetMap, beg, end, depth ); break;
- default: radixSortN( textBase, subsetMap, beg, end, depth, subsetCount );
+ case 1: radixSort1(myStack, textBase, subsetMap, beg, end, depth); break;
+ case 2: radixSort2(myStack, textBase, subsetMap, beg, end, depth); break;
+ case 3: radixSort3(myStack, textBase, subsetMap, beg, end, depth); break;
+ case 4: radixSort4(myStack, textBase, subsetMap, beg, end, depth); break;
+ default: radixSortN(myStack, textBase, subsetMap, beg, end, depth,
+ subsetCount, bucketSizes);
}
}
}
+
+void SubsetSuffixArray::sortIndex( const uchar* text,
+ size_t maxUnsortedInterval,
+ int childTableType,
+ size_t numOfThreads ){
+ if( childTableType == 1 ) chibiTable.v.assign( suffixArray.v.size(), -1 );
+ if( childTableType == 2 ) kiddyTable.v.assign( suffixArray.v.size(), -1 );
+ if( childTableType == 3 ) childTable.v.assign( suffixArray.v.size(), 0 );
+
+ std::vector< std::vector<Range> > stacks(numOfThreads);
+ std::vector<indexT> bucketSizes(numOfThreads * numOfBuckets);
+
+ pushRange( stacks[0], &suffixArray.v.front(), &suffixArray.v.back() + 1, 0 );
+ setChildReverse( &suffixArray.v.back() + 1, &suffixArray.v.front() );
+
+ sortRanges(&stacks[0], &bucketSizes[0], text, maxUnsortedInterval,
+ numOfThreads);
+}
=====================================
src/TantanMasker.cc
=====================================
--- a/src/TantanMasker.cc
+++ b/src/TantanMasker.cc
@@ -37,7 +37,7 @@ T -5 -5 -5 2\n\
s.fromString(ScoreMatrix::stringFromName("BL62"));
} else {
alphabetSize = alphabet.size();
- s.matchMismatch(1, 1, alphabet);
+ s.setMatchMismatch(1, 1, alphabet);
}
s.init(letterToIndex);
=====================================
src/alp/sls_alignment_evaluer.cpp
=====================================
--- a/src/alp/sls_alignment_evaluer.cpp
+++ b/src/alp/sls_alignment_evaluer.cpp
@@ -463,8 +463,12 @@ double temperature_)
double GaplessPreliminaryTime=CurrentTimeGaplessPreliminary-CurrentTime1;
//the choice for the importance sampling
- long int gapOpen=alp_data::Tmin(gapOpen1_,gapOpen2_);
- long int gapEpen=alp_data::Tmin(gapEpen1_,gapEpen2_);
+ //long int gapOpen=alp_data::Tmin(gapOpen1_,gapOpen2_);
+ //long int gapEpen=alp_data::Tmin(gapEpen1_,gapEpen2_);
+
+ long int gapEpen = alp_data::Tmin(gapEpen1_, gapEpen2_);
+ long int gapOpen = alp_data::Tmin(gapOpen1_ + gapEpen1_, gapOpen2_ + gapEpen2_) - gapEpen;
+
if(max_time_<=0)
{
=====================================
src/lastal.cc
=====================================
--- a/src/lastal.cc
+++ b/src/lastal.cc
@@ -96,6 +96,43 @@ void complementMatrix(const ScoreMatrixRow *from, ScoreMatrixRow *to) {
to[i][j] = from[alph.complement[i]][alph.complement[j]];
}
+// Meaningless for PSSMs, unless they have the same scale as the score matrix
+static void calculateSubstitutionScoreMatrixStatistics() {
+ LOG( "calculating matrix probabilities..." );
+ // the case-sensitivity of the matrix makes no difference here
+ lambdaCalculator.calculate( scoreMatrix.caseSensitive, alph.size );
+
+ if( lambdaCalculator.isBad() ){
+ if( isQuality( args.inputFormat ) ||
+ (args.temperature < 0 && args.outputType > 3) )
+ ERR( "can't calculate probabilities: "
+ "maybe the mismatch costs are too weak" );
+ else
+ LOG( "can't calculate probabilities: "
+ "maybe the mismatch costs are too weak" );
+ return;
+ }
+
+ const double *p1 = lambdaCalculator.letterProbs1();
+ const double *p2 = lambdaCalculator.letterProbs2();
+
+ LOG( "matrix lambda=" << lambdaCalculator.lambda() );
+ LOG( "matrix letter frequencies (upper=reference, lower=query):" );
+ if( args.verbosity > 0 ){
+ std::cerr << std::left;
+ std::streamsize p = std::cerr.precision(2);
+ unsigned e = alph.size;
+ for( unsigned i = 0; i < e; ++i )
+ std::cerr << std::setw(3) << alph.letters[i] << (i + 1 < e ? " " : "\n");
+ for( unsigned i = 0; i < e; ++i )
+ std::cerr << std::setw(3) << 100 * p1[i] << (i + 1 < e ? " " : "\n");
+ for( unsigned i = 0; i < e; ++i )
+ std::cerr << std::setw(3) << 100 * p2[i] << (i + 1 < e ? " " : "\n");
+ std::cerr.precision(p);
+ std::cerr << std::right;
+ }
+}
+
// Set up a scoring matrix, based on the user options
void makeScoreMatrix( const std::string& matrixName,
const std::string& matrixFile ){
@@ -103,11 +140,20 @@ void makeScoreMatrix( const std::string& matrixName,
scoreMatrix.fromString( matrixFile );
}
else{
- scoreMatrix.matchMismatch( args.matchScore, args.mismatchCost,
- alph.letters );
+ scoreMatrix.setMatchMismatch( args.matchScore, args.mismatchCost,
+ alph.letters );
}
- scoreMatrix.init( alph.encode );
+ scoreMatrix.init(alph.encode);
+ if (args.outputType > 0) {
+ calculateSubstitutionScoreMatrixStatistics();
+ if (alph.letters == alph.dna) {
+ scoreMatrix.addAmbiguousScores(alph.encode, lambdaCalculator.lambda(),
+ lambdaCalculator.letterProbs1(),
+ lambdaCalculator.letterProbs2());
+ scoreMatrix.init(alph.encode);
+ }
+ }
// If the input is a PSSM, the score matrix is not used, and its
// maximum score should not be used. Here, we try to set it to a
@@ -231,43 +277,8 @@ void makeQualityScorers(){
}
// Calculate statistical parameters for the alignment scoring scheme
-// Meaningless for PSSMs, unless they have the same scale as the score matrix
-void calculateScoreStatistics( const std::string& matrixName,
- countT refLetters ){
- LOG( "calculating matrix probabilities..." );
- // the case-sensitivity of the matrix makes no difference here
- lambdaCalculator.calculate( scoreMatrix.caseSensitive, alph.size );
-
- if( lambdaCalculator.isBad() ){
- if( isQuality( args.inputFormat ) ||
- (args.temperature < 0 && args.outputType > 3) )
- ERR( "can't calculate probabilities: "
- "maybe the mismatch costs are too weak" );
- else
- LOG( "can't calculate probabilities: "
- "maybe the mismatch costs are too weak" );
- return;
- }
-
- const double *p1 = lambdaCalculator.letterProbs1();
- const double *p2 = lambdaCalculator.letterProbs2();
-
- LOG( "matrix lambda=" << lambdaCalculator.lambda() );
- LOG( "matrix letter frequencies (upper=reference, lower=query):" );
- if( args.verbosity > 0 ){
- std::cerr << std::left;
- std::streamsize p = std::cerr.precision(2);
- unsigned e = alph.size;
- for( unsigned i = 0; i < e; ++i )
- std::cerr << std::setw(3) << alph.letters[i] << (i + 1 < e ? " " : "\n");
- for( unsigned i = 0; i < e; ++i )
- std::cerr << std::setw(3) << 100 * p1[i] << (i + 1 < e ? " " : "\n");
- for( unsigned i = 0; i < e; ++i )
- std::cerr << std::setw(3) << 100 * p2[i] << (i + 1 < e ? " " : "\n");
- std::cerr.precision(p);
- std::cerr << std::right;
- }
-
+static void calculateScoreStatistics(const std::string& matrixName,
+ countT refLetters) {
const char *canonicalMatrixName = ScoreMatrix::canonicalName( matrixName );
bool isGapped = (args.outputType > 1);
bool isStandardGeneticCode = args.geneticCodeFile.empty();
@@ -275,7 +286,8 @@ void calculateScoreStatistics( const std::string& matrixName,
try{
evaluer.init( canonicalMatrixName, args.matchScore, args.mismatchCost,
alph.letters.c_str(), scoreMatrix.caseSensitive,
- p1, p2, isGapped,
+ lambdaCalculator.letterProbs1(),
+ lambdaCalculator.letterProbs2(), isGapped,
gapCosts.delExist, gapCosts.delExtend,
gapCosts.insExist, gapCosts.insExtend,
args.frameshiftCost, geneticCode, isStandardGeneticCode,
=====================================
src/lastdb.cc
=====================================
--- a/src/lastdb.cc
+++ b/src/lastdb.cc
@@ -215,7 +215,8 @@ void makeVolume( std::vector< CyclicSubsetSeed >& seeds,
}
LOG( "sorting..." );
- myIndex.sortIndex( seq, args.minSeedLimit, args.childTableType );
+ myIndex.sortIndex( seq, args.minSeedLimit, args.childTableType,
+ numOfThreads );
LOG( "bucketing..." );
myIndex.makeBuckets( seq, args.bucketDepth );
=====================================
src/makefile
=====================================
--- a/src/makefile
+++ b/src/makefile
@@ -184,8 +184,8 @@ OneQualityScoreMatrix.o OneQualityScoreMatrix.o8: OneQualityScoreMatrix.cc \
stringify.hh
QualityPssmMaker.o QualityPssmMaker.o8: QualityPssmMaker.cc QualityPssmMaker.hh \
ScoreMatrixRow.hh qualityScoreUtil.hh stringify.hh
-ScoreMatrix.o ScoreMatrix.o8: ScoreMatrix.cc ScoreMatrix.hh ScoreMatrixData.hh zio.hh \
- mcf_zstream.hh
+ScoreMatrix.o ScoreMatrix.o8: ScoreMatrix.cc ScoreMatrix.hh ScoreMatrixData.hh \
+ qualityScoreUtil.hh stringify.hh zio.hh mcf_zstream.hh
SegmentPair.o SegmentPair.o8: SegmentPair.cc SegmentPair.hh
SegmentPairPot.o SegmentPairPot.o8: SegmentPairPot.cc SegmentPairPot.hh SegmentPair.hh
SubsetMinimizerFinder.o SubsetMinimizerFinder.o8: SubsetMinimizerFinder.cc \
=====================================
src/qualityScoreUtil.hh
=====================================
--- a/src/qualityScoreUtil.hh
+++ b/src/qualityScoreUtil.hh
@@ -18,6 +18,14 @@ inline int nearestInt(double x) {
else return -static_cast<int>(0.5 - x);
}
+inline double probFromScore(double lambda, int score) {
+ return std::exp(lambda * score);
+}
+
+inline int scoreFromProb(double lambda, double prob) {
+ return nearestInt(std::log(prob) / lambda);
+}
+
inline double phredErrorProb(int qualityScore) {
if (qualityScore < 0) qualityScore = 0;
return std::pow(10.0, -0.1 * qualityScore);
@@ -51,7 +59,7 @@ inline double qualityCertainty(double errorProb, double letterProb) {
inline int qualityPairScore(double expScore, double certainty1,
double certainty2, double lambda) {
double x = certainty1 * certainty2 * (expScore - 1) + 1;
- return nearestInt(std::log(x) / lambda);
+ return scoreFromProb(lambda, x);
}
inline void checkQualityCodes(const uchar *qualityBeg,
=====================================
src/version.hh
=====================================
--- a/src/version.hh
+++ b/src/version.hh
@@ -1 +1 @@
-"921"
+"932"
View it on GitLab: https://salsa.debian.org/med-team/last-align/compare/b77d677b0334a9c1d55946ec391106c21f9e246b...041dbcfce9f072a318f56c6f30d1b682d512abb5
---
View it on GitLab: https://salsa.debian.org/med-team/last-align/compare/b77d677b0334a9c1d55946ec391106c21f9e246b...041dbcfce9f072a318f56c6f30d1b682d512abb5
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20180426/b949f072/attachment-0001.html>
More information about the debian-med-commit
mailing list