[med-svn] [Git][med-team/tantan][upstream] New upstream version 22
Sascha Steinbiss
gitlab at salsa.debian.org
Sat Dec 22 20:48:07 GMT 2018
Sascha Steinbiss pushed to branch upstream at Debian Med / tantan
Commits:
1063526e by Sascha Steinbiss at 2018-12-22T20:36:53Z
New upstream version 22
- - - - -
13 changed files:
- ChangeLog.txt
- README.html
- README.txt
- src/mcf_tantan_options.cc
- src/mcf_tantan_options.hh
- src/tantan.cc
- src/tantan_app.cc
- + src/tantan_repeat_finder.cc
- + src/tantan_repeat_finder.hh
- src/version.hh
- + test/hard.fa
- test/tantan_test.out
- test/tantan_test.sh
Changes:
=====================================
ChangeLog.txt
=====================================
@@ -1,9 +1,31 @@
+2018-12-19 Martin C. Frith <Martin C. Frith>
+
+ * src/tantan.cc:
+ Make it faster
+ [3523060bcfb9] [tip]
+
+ * src/tantan_repeat_finder.cc, src/tantan_repeat_finder.hh:
+ Make -f4 a bit faster
+ [1fccdf6e21be]
+
+ * README.txt, src/mcf_tantan_options.cc, src/mcf_tantan_options.hh,
+ src/tantan_app.cc, src/tantan_repeat_finder.cc,
+ src/tantan_repeat_finder.hh, test/hard.fa, test/tantan_test.out,
+ test/tantan_test.sh:
+ Add option to find straightforward tandem repeats
+ [1bae60144712]
+
+ * README.txt, src/mcf_tantan_options.cc, src/tantan.cc,
+ src/tantan_app.cc:
+ Tweak the help message
+ [fc1ca32a72aa]
+
2018-12-10 Martin C. Frith <Martin C. Frith>
* README.txt, src/mcf_tantan_options.cc, src/mcf_tantan_options.hh,
src/tantan_app.cc, test/tantan_test.out, test/tantan_test.sh:
Add match score, mismatch cost options
- [2383404c795a] [tip]
+ [2383404c795a]
* src/tantan.cc:
Refactor
=====================================
README.html
=====================================
@@ -451,7 +451,7 @@ repeats, so it's easy to lift the masking after determining homology.</p>
<td>preserve uppercase/lowercase in non-masked regions</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-m</span></kbd></td>
-<td>file for letter pair scores (scoring matrix)</td></tr>
+<td>file for letter-pair score matrix</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-r</span></kbd></td>
<td>probability of a repeat starting per position</td></tr>
@@ -480,9 +480,12 @@ repeats, so it's easy to lift the masking after determining homology.</p>
<kbd><span class="option">-s</span></kbd></td>
<td>minimum repeat probability for masking</td></tr>
<tr><td class="option-group">
+<kbd><span class="option">-n</span></kbd></td>
+<td>minimum copy number, affects -f4 only</td></tr>
+<tr><td class="option-group">
<kbd><span class="option">-f</span></kbd></td>
<td>output type: 0=masked sequence, 1=repeat probabilities,
-2=repeat counts, 3=BED</td></tr>
+2=repeat counts, 3=BED, 4=tandem repeats</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-h</span>, <span class="option">--help</span></kbd></td>
<td>show help message, then exit</td></tr>
@@ -512,6 +515,29 @@ TGCTAGCAA ttaggcttaggtcagtgc TTAGGCTTAGGTCAGTGC AACGTA
</pre>
<p>(My thanks to Junko Tsuji and Paul Horton for finding these issues.)</p>
</div>
+<div class="section" id="finding-straightforward-tandem-repeats">
+<h1>Finding straightforward tandem repeats</h1>
+<p>Option -f4 runs tantan in a different mode, where it finds
+straightforward tandem repeats only. (Technically, it uses a Viterbi
+algorithm instead of a Forward-Backward algorithm.) This is <em>not</em>
+recommended for avoiding false homologs! But it might be useful for
+studying tandem repeats. The output looks like this:</p>
+<pre class="literal-block">
+mySeq 14765 14780 6 2.5 GTCATG GTCATG,GTCATG,GTC
+mySeq 632362 632377 2 6 GC GC,GC,GC,GCt,GCT,GCT
+mySeq 1278353 1278369 3 6.5 TCA TCA,TCA,TCA,TC-,TC,TC,T
+mySeq 3616084 3616100 3 5.33333 TGG TGA,TGA,TGG,TGG,TGG,T
+</pre>
+<p>The first 3 columns show the start and end coordinates of the
+repetitive region, in <a class="reference external" href="https://genome.ucsc.edu/FAQ/FAQformat.html#format1">BED</a> format. Column
+4 shows the length of the repeating unit (which might vary due to
+insertions and deletions, so this column shows the most common
+length). Column 5 shows the number of repeat units. Column 6 shows
+the repeating unit (which again might vary, so this is just a
+representative). Column 7 shows the whole repeat: lowercase letters
+are insertions relative to the previous repeat unit, and dashes are
+deletions relative to the previous repeat unit.</p>
+</div>
<div class="section" id="miscellaneous">
<h1>Miscellaneous</h1>
<p>tantan is distributed under the GNU General Public License, either
=====================================
README.txt
=====================================
@@ -136,7 +136,7 @@ Options
-p interpret the sequences as proteins
-x letter to use for masking, instead of lowercase
-c preserve uppercase/lowercase in non-masked regions
--m file for letter pair scores (scoring matrix)
+-m file for letter-pair score matrix
-r probability of a repeat starting per position
-e probability of a repeat ending per position
-w maximum tandem repeat period to consider
@@ -146,8 +146,9 @@ Options
-a gap existence cost
-b gap extension cost
-s minimum repeat probability for masking
+-n minimum copy number, affects -f4 only
-f output type: 0=masked sequence, 1=repeat probabilities,
- 2=repeat counts, 3=BED
+ 2=repeat counts, 3=BED, 4=tandem repeats
-h, --help show help message, then exit
--version show version information, then exit
@@ -172,6 +173,31 @@ align it on the other strand::
(My thanks to Junko Tsuji and Paul Horton for finding these issues.)
+Finding straightforward tandem repeats
+--------------------------------------
+
+Option -f4 runs tantan in a different mode, where it finds
+straightforward tandem repeats only. (Technically, it uses a Viterbi
+algorithm instead of a Forward-Backward algorithm.) This is *not*
+recommended for avoiding false homologs! But it might be useful for
+studying tandem repeats. The output looks like this::
+
+ mySeq 14765 14780 6 2.5 GTCATG GTCATG,GTCATG,GTC
+ mySeq 632362 632377 2 6 GC GC,GC,GC,GCt,GCT,GCT
+ mySeq 1278353 1278369 3 6.5 TCA TCA,TCA,TCA,TC-,TC,TC,T
+ mySeq 3616084 3616100 3 5.33333 TGG TGA,TGA,TGG,TGG,TGG,T
+
+The first 3 columns show the start and end coordinates of the
+repetitive region, in `BED
+<https://genome.ucsc.edu/FAQ/FAQformat.html#format1>`_ format. Column
+4 shows the length of the repeating unit (which might vary due to
+insertions and deletions, so this column shows the most common
+length). Column 5 shows the number of repeat units. Column 6 shows
+the repeating unit (which again might vary, so this is just a
+representative). Column 7 shows the whole repeat: lowercase letters
+are insertions relative to the previous repeat unit, and dashes are
+deletions relative to the previous repeat unit.
+
Miscellaneous
-------------
=====================================
src/mcf_tantan_options.cc
=====================================
@@ -36,7 +36,7 @@ static int myGetopt(int argc, char **argv, const char *optstring,
std::istream &operator>>(std::istream &s, TantanOptions::OutputType &x) {
int i = 0;
s >> i;
- if (i < 0 || i > 3)
+ if (i < 0 || i > 4)
s.setstate(std::ios::failbit);
if (s)
x = static_cast<TantanOptions::OutputType>(i);
@@ -57,6 +57,7 @@ TantanOptions::TantanOptions() :
gapExistenceCost(0),
gapExtensionCost(-1), // means: no gaps
minMaskProb(0.5),
+ minCopyNumber(2.0),
outputType(maskOut),
indexOfFirstNonOptionArgument(-1) {}
@@ -69,7 +70,7 @@ Options (default settings):\n\
-p interpret the sequences as proteins\n\
-x letter to use for masking, instead of lowercase\n\
-c preserve uppercase/lowercase in non-masked regions\n\
- -m file for letter pair scores (+1/-1, but -p selects BLOSUM62)\n\
+ -m file for letter-pair score matrix\n\
-r probability of a repeat starting per position ("
+ stringify(repeatProb) + ")\n\
-e probability of a repeat ending per position ("
@@ -77,15 +78,17 @@ Options (default settings):\n\
-w maximum tandem repeat period to consider (100, but -p selects 50)\n\
-d probability decay per period ("
+ stringify(repeatOffsetProbDecay) + ")\n\
- -i match score (1, but -p selects BLOSUM62)\n\
- -j mismatch cost (1, but -p selects BLOSUM62)\n\
+ -i match score (BLOSUM62 if -p, else 2 if -f4, else 1)\n\
+ -j mismatch cost (BLOSUM62 if -p, else 7 if -f4, else 1)\n\
-a gap existence cost ("
+ stringify(gapExistenceCost) + ")\n\
- -b gap extension cost (infinite: no gaps)\n\
+ -b gap extension cost (7 if -f4, else infinity)\n\
-s minimum repeat probability for masking ("
+ stringify(minMaskProb) + ")\n\
+ -n minimum copy number, affects -f4 only ("
+ + stringify(minCopyNumber) + ")\n\
-f output type: 0=masked sequence, 1=repeat probabilities,\n\
- 2=repeat counts, 3=BED ("
+ 2=repeat counts, 3=BED, 4=tandem repeats ("
+ stringify(outputType) + ")\n\
-h, --help show help message, then exit\n\
--version show version information, then exit\n\
@@ -93,12 +96,13 @@ Options (default settings):\n\
Report bugs to: tantan at cbrc.jp\n\
Home page: http://www.cbrc.jp/tantan/\n\
";
+ // -k for transition cost?
std::string version = "tantan "
#include "version.hh"
"\n";
- const char *optstring = "px:cm:r:e:w:d:i:j:a:b:s:f:h";
+ const char *optstring = "px:cm:r:e:w:d:i:j:a:b:s:n:f:h";
int i;
while ((i = myGetopt(argc, argv, optstring, help, version)) != -1) {
@@ -158,6 +162,9 @@ Home page: http://www.cbrc.jp/tantan/\n\
unstringify(minMaskProb, optarg);
// don't bother checking for stupid values?
break;
+ case 'n':
+ unstringify(minCopyNumber, optarg);
+ break;
case 'f':
unstringify(outputType, optarg);
break;
@@ -168,11 +175,18 @@ Home page: http://www.cbrc.jp/tantan/\n\
}
}
+ if (gapExtensionCost < 0 && outputType == repOut) gapExtensionCost = 7;
+
if (gapExtensionCost > 0 && gapExistenceCost + gapExtensionCost <= 0)
throw Error("gap existence + extension cost is too low");
if (maxCycleLength < 0) maxCycleLength = (isProtein ? 50 : 100);
+ if (!isProtein || matchScore || mismatchCost) {
+ if (matchScore == 0) matchScore = (outputType == repOut ? 2 : 1);
+ if (mismatchCost == 0) mismatchCost = (outputType == repOut ? 7 : 1);
+ }
+
indexOfFirstNonOptionArgument = optind;
}
=====================================
src/mcf_tantan_options.hh
=====================================
@@ -23,7 +23,8 @@ struct TantanOptions {
int gapExistenceCost;
int gapExtensionCost;
double minMaskProb;
- enum OutputType { maskOut, probOut, countOut, bedOut } outputType;
+ double minCopyNumber;
+ enum OutputType { maskOut, probOut, countOut, bedOut, repOut } outputType;
int indexOfFirstNonOptionArgument;
};
=====================================
src/tantan.cc
=====================================
@@ -20,10 +20,10 @@ void multiplyAll(std::vector<double> &v, double factor) {
}
double firstRepeatOffsetProb(double probMult, int maxRepeatOffset) {
- if (probMult < 1 || probMult > 1)
+ if (probMult < 1 || probMult > 1) {
return (1 - probMult) / (1 - std::pow(probMult, maxRepeatOffset));
- else
- return 1.0 / maxRepeatOffset;
+ }
+ return 1.0 / maxRepeatOffset;
}
void checkForwardAndBackwardTotals(double fTot, double bTot) {
@@ -64,6 +64,7 @@ struct Tantan {
double b2fLast; // background state to last foreground state
double backgroundProb;
+ std::vector<double> b2fProbs; // background state to each foreground state
std::vector<double> foregroundProbs;
std::vector<double> insertionProbs;
@@ -110,9 +111,16 @@ struct Tantan {
b2fFirst = repeatProb * firstRepeatOffsetProb(b2fDecay, maxRepeatOffset);
b2fLast = repeatProb * firstRepeatOffsetProb(b2fGrowth, maxRepeatOffset);
+ b2fProbs.resize(maxRepeatOffset);
foregroundProbs.resize(maxRepeatOffset);
insertionProbs.resize(maxRepeatOffset - 1);
+ double p = b2fFirst;
+ for (int i = 0; i < maxRepeatOffset; ++i) {
+ b2fProbs[i] = p;
+ p *= b2fDecay;
+ }
+
scaleFactors.resize((seqEnd - seqBeg) / scaleStepSize);
}
@@ -211,20 +219,17 @@ struct Tantan {
void calcForwardTransitionProbs() {
if (endGapProb > 0) return calcForwardTransitionProbsWithGaps();
- double fromBackground = backgroundProb * b2fLast;
+ double b = backgroundProb;
double fromForeground = 0;
- double *foregroundPtr = END(foregroundProbs);
double *foregroundBeg = BEG(foregroundProbs);
- while (foregroundPtr > foregroundBeg) {
- --foregroundPtr;
- double f = *foregroundPtr;
+ for (int i = 0; i < maxRepeatOffset; ++i) {
+ double f = foregroundBeg[i];
fromForeground += f;
- *foregroundPtr = fromBackground + f * f2f0;
- fromBackground *= b2fGrowth;
+ foregroundBeg[i] = b * b2fProbs[i] + f * f2f0;
}
- backgroundProb = backgroundProb * b2b + fromForeground * f2b;
+ backgroundProb = b * b2b + fromForeground * f2b;
}
void calcBackwardTransitionProbs() {
@@ -232,18 +237,15 @@ struct Tantan {
double toBackground = f2b * backgroundProb;
double toForeground = 0;
- double *foregroundPtr = BEG(foregroundProbs);
- double *foregroundEnd = END(foregroundProbs);
+ double *foregroundBeg = BEG(foregroundProbs);
- while (foregroundPtr < foregroundEnd) {
- toForeground *= b2fGrowth;
- double f = *foregroundPtr;
- toForeground += f;
- *foregroundPtr = toBackground + f2f0 * f;
- ++foregroundPtr;
+ for (int i = 0; i < maxRepeatOffset; ++i) {
+ double f = foregroundBeg[i];
+ toForeground += b2fProbs[i] * f;
+ foregroundBeg[i] = toBackground + f2f0 * f;
}
- backgroundProb = b2b * backgroundProb + b2fLast * toForeground;
+ backgroundProb = b2b * backgroundProb + toForeground;
}
void addEndCounts(double forwardProb,
=====================================
src/tantan_app.cc
=====================================
@@ -9,17 +9,18 @@
#include "mcf_tantan_options.hh"
#include "mcf_util.hh"
#include "tantan.hh"
+#include "tantan_repeat_finder.hh"
#include "LambdaCalculator.hh"
#include <algorithm> // copy, fill_n
#include <cassert>
#include <cmath>
-#include <cstring> // strchr
#include <cstdlib> // EXIT_SUCCESS, EXIT_FAILURE
#include <exception> // exception
#include <fstream>
#include <iostream>
#include <new> // bad_alloc
+#include <string.h>
#define BEG(v) ((v).empty() ? 0 : &(v).front())
#define END(v) ((v).empty() ? 0 : &(v).back() + 1)
@@ -33,7 +34,7 @@ bool isDubiousDna(const uchar *beg, const uchar *end) {
int badLetters = 0;
if (end - beg > 100) end = beg + 100; // just check the first 100 letters
while (beg < end) {
- if (!std::strchr("AaCcGgTtNnUu", *beg)) {
+ if (!strchr("AaCcGgTtNnUu", *beg)) {
++badLetters;
if (badLetters > 10) return true;
}
@@ -45,6 +46,7 @@ bool isDubiousDna(const uchar *beg, const uchar *end) {
namespace {
TantanOptions options;
Alphabet alphabet;
+tantan::RepeatFinder repeatFinder;
enum { scoreMatrixSize = 64 };
int fastMatrix[scoreMatrixSize][scoreMatrixSize];
@@ -76,16 +78,14 @@ void initScoresAndProbabilities() {
if (options.scoreMatrixFileName) {
unfilify(scoreMatrix, options.scoreMatrixFileName);
} else if (options.isProtein) {
- if (options.matchScore || options.mismatchCost) {
- scoreMatrix.initMatchMismatch(std::max(options.matchScore, 1),
- std::max(options.mismatchCost, 1),
+ if (options.matchScore && options.mismatchCost) {
+ scoreMatrix.initMatchMismatch(options.matchScore, options.mismatchCost,
Alphabet::protein);
} else {
unstringify(scoreMatrix, ScoreMatrix::blosum62);
}
} else {
- scoreMatrix.initMatchMismatch(std::max(options.matchScore, 1),
- std::max(options.mismatchCost, 1),
+ scoreMatrix.initMatchMismatch(options.matchScore, options.mismatchCost,
"ACGTU"); // allow for RNA
}
@@ -101,7 +101,9 @@ void initScoresAndProbabilities() {
for (int i = 0; i < scoreMatrixSize; ++i) {
for (int j = 0; j < scoreMatrixSize; ++j) {
- probMatrix[i][j] = std::exp(matrixLambda * fastMatrix[i][j]);
+ double x = matrixLambda * fastMatrix[i][j];
+ if (options.outputType != options.repOut) x = std::exp(x);
+ probMatrix[i][j] = x;
}
}
@@ -116,6 +118,10 @@ void initScoresAndProbabilities() {
// XXX check if firstGapProb is too high
}
+ repeatFinder.init(options.maxCycleLength, probMatrixPointers,
+ options.repeatProb, options.repeatEndProb,
+ options.repeatOffsetProbDecay, firstGapProb, otherGapProb);
+
//std::cerr << "lambda: " << matrixLambda << "\n";
//std::cerr << "firstGapProb: " << firstGapProb << "\n";
//std::cerr << "otherGapProb: " << otherGapProb << "\n";
@@ -157,6 +163,138 @@ void writeBed(const float *probBeg, const float *probEnd,
if (maskBeg) writeBedLine(seqName, probBeg, maskBeg, probEnd, output);
}
+void storeSequence(const uchar *beg, const uchar *end, std::string &out) {
+ out.clear();
+ for (const uchar *i = beg; i < end; ++i) {
+ out.push_back(std::toupper(alphabet.numbersToLetters[*i]));
+ }
+}
+
+struct RepeatUnit {
+ const uchar *beg;
+ int len;
+};
+
+bool less(const RepeatUnit &x, const RepeatUnit &y) {
+ if (x.len != y.len) return x.len < y.len;
+ int c = memcmp(x.beg, y.beg, x.len);
+ if (c != 0) return c < 0;
+ return x.beg > y.beg;
+}
+
+int mainLen(const std::vector<RepeatUnit> &repUnits) {
+ int bestLen;
+ size_t bestCount = 0;
+ size_t count = 0;
+ for (size_t i = 0; i < repUnits.size(); ++i) {
+ if (i && repUnits[i].len > repUnits[i - 1].len) {
+ count = 0;
+ }
+ ++count;
+ if (count > bestCount) {
+ bestCount = count;
+ bestLen = repUnits[i].len;
+ }
+ }
+ return bestLen;
+}
+
+const uchar *mainBeg(const std::vector<RepeatUnit> &repUnits, int len) {
+ const uchar *bestBeg;
+ size_t bestCount = 0;
+ size_t count = 0;
+ for (size_t i = 0; i < repUnits.size(); ++i) {
+ if (repUnits[i].len != len) continue;
+ if (count && memcmp(repUnits[i - 1].beg, repUnits[i].beg, len) != 0) {
+ count = 0;
+ }
+ ++count;
+ if (count < bestCount) continue;
+ if (count > bestCount || repUnits[i].beg < bestBeg) {
+ bestCount = count;
+ bestBeg = repUnits[i].beg;
+ }
+ }
+ return bestBeg;
+}
+
+void writeRepeat(const FastaSequence &f,
+ const uchar *repBeg, const uchar *repEnd,
+ const std::string &repText, std::vector<RepeatUnit> &repUnits,
+ const uchar *commaPos, int finalOffset) {
+ double repeatCount = count(repText.begin(), repText.end(), ',');
+ double copyNumber = repeatCount + (repEnd - commaPos) * 1.0 / finalOffset;
+ if (copyNumber < options.minCopyNumber) return;
+
+ sort(repUnits.begin(), repUnits.end(), less);
+ int bestLen = mainLen(repUnits);
+ const uchar *bestBeg = mainBeg(repUnits, bestLen);
+
+ const uchar *beg = BEG(f.sequence);
+ std::cout << firstWord(f.title) << '\t'
+ << (repBeg - beg) << '\t' << (repEnd - beg) << '\t'
+ << bestLen << '\t' << copyNumber << '\t';
+ for (int i = 0; i < bestLen; ++i) {
+ char c = std::toupper(alphabet.numbersToLetters[bestBeg[i]]);
+ std::cout << c;
+ }
+ std::cout << '\t';
+ std::cout << repText << '\n';
+}
+
+void findRepeatsInOneSequence(const FastaSequence &f) {
+ const uchar *beg = BEG(f.sequence);
+ const uchar *end = END(f.sequence);
+
+ repeatFinder.calcBestPathScore(beg, end);
+
+ std::vector<RepeatUnit> repUnits;
+ std::string repText;
+ const uchar *repBeg;
+ const uchar *commaPos;
+ int state = 0;
+
+ for (const uchar *seqPtr = beg; seqPtr < end; ++seqPtr) {
+ int newState = repeatFinder.nextState();
+
+ if (newState == 0) {
+ if (state > 0) {
+ writeRepeat(f, repBeg, seqPtr, repText, repUnits, commaPos, state);
+ }
+ } else if (newState <= options.maxCycleLength) {
+ if (state == 0) {
+ repUnits.clear();
+ repBeg = seqPtr - newState;
+ storeSequence(repBeg, seqPtr, repText);
+ commaPos = repBeg;
+ } else if (state <= options.maxCycleLength) {
+ for (int i = state; i > newState; --i) {
+ if (seqPtr - commaPos >= i) {
+ repText.push_back(',');
+ commaPos = seqPtr;
+ }
+ repText.push_back('-');
+ }
+ }
+ RepeatUnit unit = {seqPtr - newState, newState};
+ repUnits.push_back(unit);
+ if (seqPtr - commaPos >= newState) {
+ repText.push_back(',');
+ commaPos = seqPtr;
+ }
+ repText.push_back(std::toupper(alphabet.numbersToLetters[*seqPtr]));
+ } else {
+ repText.push_back(std::tolower(alphabet.numbersToLetters[*seqPtr]));
+ }
+
+ state = newState;
+ }
+
+ if (state > 0) {
+ writeRepeat(f, repBeg, end, repText, repUnits, commaPos, state);
+ }
+}
+
void processOneSequence(FastaSequence &f, std::ostream &output) {
uchar *beg = BEG(f.sequence);
uchar *end = END(f.sequence);
@@ -181,6 +319,8 @@ void processOneSequence(FastaSequence &f, std::ostream &output) {
BEG(transitionCounts));
double sequenceLength = static_cast<double>(f.sequence.size());
transitionTotal += sequenceLength + 1;
+ } else if (options.outputType == options.repOut) {
+ findRepeatsInOneSequence(f);
} else {
std::vector<float> probabilities(end - beg);
float *probBeg = BEG(probabilities);
=====================================
src/tantan_repeat_finder.cc
=====================================
@@ -0,0 +1,278 @@
+// Copyright 2018 Martin C. Frith
+
+#include "tantan_repeat_finder.hh"
+
+#include <algorithm>
+#include <assert.h>
+#include <limits.h>
+#include <math.h>
+//#include <iostream> // for debugging
+
+namespace tantan {
+
+static double max3(double x, double y, double z) {
+ return std::max(std::max(x, y), z);
+}
+
+static double myLog(double x) {
+ return x > 0 ? log(x) : -HUGE_VAL;
+}
+
+static double firstRepeatOffsetProb(double probMult, int maxRepeatOffset) {
+ if (probMult < 1 || probMult > 1) {
+ return (1 - probMult) / (1 - pow(probMult, 1.0 * maxRepeatOffset));
+ }
+ return 1.0 / maxRepeatOffset;
+}
+
+static int numOfDpScoresPerLetter(int maxRepeatOffset, double endGapScore) {
+ if (endGapScore > -HUGE_VAL) {
+ assert(maxRepeatOffset <= INT_MAX / 2);
+ return maxRepeatOffset * 2;
+ }
+ assert(maxRepeatOffset < INT_MAX);
+ return maxRepeatOffset + 1;
+}
+
+static unsigned minStoredPositions(const uchar *beg, const uchar *end) {
+ // We will do a dynamic programming algorithm along the sequence.
+ // To save memory, we will keep only some DP values, and recalculate
+ // the others later. We keep the values in an array x of size s.
+ // We will store DP initial values in x[0]. We will put
+ // the DP values after the first s-1 sequence positions in x[1],
+ // the DP values after the next s-2 positions in x[2],
+ // the DP values after the next s-3 positions in x[3], etc.
+ // This function returns the minimum possible value of s.
+ unsigned t = 0;
+ while (t < end - beg) {
+ beg += t;
+ ++t;
+ }
+ return t + 1;
+}
+
+void RepeatFinder::init(int maxRepeatOffset,
+ const const_double_ptr *substitutionMatrix,
+ double repeatProb,
+ double repeatEndProb,
+ double repeatOffsetProbDecay,
+ double firstGapProb,
+ double otherGapProb) {
+ assert(maxRepeatOffset > 0);
+ this->maxRepeatOffset = maxRepeatOffset;
+ this->substitutionMatrix = substitutionMatrix;
+
+ b2b = myLog(1 - repeatProb);
+ f2b = myLog(repeatEndProb);
+ g2g = myLog(otherGapProb);
+ oneGapScore = myLog(firstGapProb * (1 - otherGapProb));
+ endGapScore = myLog(firstGapProb * (maxRepeatOffset > 1));
+ f2f0 = myLog(1 - repeatEndProb);
+ f2f1 = myLog(1 - repeatEndProb - firstGapProb);
+ f2f2 = myLog(1 - repeatEndProb - firstGapProb * 2);
+
+ double x = 1 / repeatOffsetProbDecay;
+ b2fGrowth = myLog(x);
+ b2fLast = myLog(repeatProb * firstRepeatOffsetProb(x, maxRepeatOffset));
+
+ dpScoresPerLetter = numOfDpScoresPerLetter(maxRepeatOffset, endGapScore);
+}
+
+void RepeatFinder::initializeBackwardScores() {
+ scoresPtr[0] = b2b;
+ std::fill_n(scoresPtr + 1, maxRepeatOffset, f2b);
+ if (endGapScore > -HUGE_VAL) {
+ std::fill_n(scoresPtr + 1 + maxRepeatOffset, maxRepeatOffset-1, -HUGE_VAL);
+ }
+}
+
+void RepeatFinder::calcBackwardTransitionScoresWithGaps() {
+ double toBackground = f2b + scoresPtr[0];
+ double *foregroundPtr = scoresPtr + 1;
+ double f = *foregroundPtr;
+ double toForeground = f;
+
+ double *insertionPtr = scoresPtr + 1 + maxRepeatOffset;
+ double i = *insertionPtr;
+ *foregroundPtr = max3(toBackground, f2f1 + f, i);
+ double d = endGapScore + f;
+ ++foregroundPtr;
+ toForeground += b2fGrowth;
+
+ while (foregroundPtr < scoresPtr + maxRepeatOffset) {
+ f = *foregroundPtr;
+ toForeground = std::max(toForeground, f);
+ i = *(insertionPtr + 1);
+ *foregroundPtr = max3(toBackground, f2f2 + f, std::max(i, d));
+ double oneGapScore_f = oneGapScore + f;
+ *insertionPtr = std::max(oneGapScore_f, g2g + i);
+ d = std::max(oneGapScore_f, g2g + d);
+ ++foregroundPtr;
+ ++insertionPtr;
+ toForeground += b2fGrowth;
+ }
+
+ f = *foregroundPtr;
+ toForeground = std::max(toForeground, f);
+ *foregroundPtr = max3(toBackground, f2f1 + f, d);
+ *insertionPtr = endGapScore + f;
+
+ scoresPtr[0] = std::max(b2b + scoresPtr[0], b2fLast + toForeground);
+}
+
+void RepeatFinder::calcBackwardTransitionScores() {
+ if (endGapScore > -HUGE_VAL) return calcBackwardTransitionScoresWithGaps();
+
+ double toBackground = f2b + scoresPtr[0];
+ double toForeground = -HUGE_VAL;
+ double *foregroundPtr = scoresPtr + 1;
+ double *foregroundEnd = foregroundPtr + maxRepeatOffset;
+
+ while (foregroundPtr < foregroundEnd) {
+ toForeground += b2fGrowth;
+ double f = *foregroundPtr;
+ toForeground = std::max(toForeground, f);
+ *foregroundPtr = std::max(toBackground, f2f0 + f);
+ ++foregroundPtr;
+ }
+
+ scoresPtr[0] = std::max(b2b + scoresPtr[0], b2fLast + toForeground);
+}
+
+void RepeatFinder::calcEmissionScores() {
+ const double *matrixRow = substitutionMatrix[*seqPtr];
+ double *oldScores = scoresPtr - dpScoresPerLetter;
+ int maxOffset = maxOffsetInTheSequence();
+ int i = 1;
+
+ scoresPtr[0] = oldScores[0];
+
+ for (; i <= maxOffset; ++i) {
+ scoresPtr[i] = oldScores[i] + matrixRow[seqPtr[-i]];
+ }
+
+ for (; i <= maxRepeatOffset; ++i) {
+ scoresPtr[i] = -HUGE_VAL;
+ }
+
+ std::copy(oldScores + i, scoresPtr, scoresPtr + i);
+}
+
+void RepeatFinder::makeCheckpoint() {
+ checkpoint += dpScoresPerLetter;
+ std::copy(scoresPtr - dpScoresPerLetter, scoresPtr, checkpoint);
+ scoresPtr = checkpoint + dpScoresPerLetter;
+ assert(scoresPtr < scoresEnd);
+}
+
+void RepeatFinder::redoCheckpoint() {
+ seqPtr += (scoresEnd - scoresPtr) / dpScoresPerLetter;
+ while (scoresPtr < scoresEnd) {
+ --seqPtr;
+ calcScoresForOneSequencePosition();
+ scoresPtr += dpScoresPerLetter;
+ }
+ scoresPtr -= dpScoresPerLetter;
+ checkpoint -= dpScoresPerLetter;
+}
+
+double RepeatFinder::calcBestPathScore(const uchar *seqBeg,
+ const uchar *seqEnd) {
+ this->seqBeg = seqBeg;
+ this->seqEnd = seqEnd;
+
+ unsigned long numOfStoredPositions = minStoredPositions(seqBeg, seqEnd);
+ unsigned long numOfScores = numOfStoredPositions * dpScoresPerLetter;
+ assert(numOfStoredPositions > 0);
+ assert(numOfScores > 0);
+ dpScores.resize(numOfScores);
+ scoresPtr = &dpScores[0];
+ scoresEnd = scoresPtr + numOfScores;
+ checkpoint = scoresPtr;
+ seqPtr = seqEnd;
+
+ initializeBackwardScores();
+
+ while (seqPtr > seqBeg) {
+ --seqPtr;
+ scoresPtr += dpScoresPerLetter;
+ if (scoresPtr == scoresEnd) makeCheckpoint();
+ calcScoresForOneSequencePosition();
+ }
+
+ state = 0;
+ return scoresPtr[0];
+}
+
+int RepeatFinder::offsetWithMaxScore() const {
+ const double *matrixRow = substitutionMatrix[*seqPtr];
+ int maxOffset = maxOffsetInTheSequence();
+ int bestOffset = 0;
+ double toForeground = -HUGE_VAL;
+
+ for (int i = 1; i <= maxOffset; ++i) {
+ toForeground += b2fGrowth;
+ double f = scoreWithEmission(matrixRow, i);
+ if (f > toForeground) {
+ toForeground = f;
+ bestOffset = i;
+ }
+ }
+
+ return bestOffset;
+}
+
+int RepeatFinder::deletionWithMaxScore() const {
+ const double *matrixRow = substitutionMatrix[*seqPtr];
+ int bestOffset = 1;
+ double f = scoreWithEmission(matrixRow, 1);
+ double d = endGapScore + f;
+
+ for (int i = 2; i < state; ++i) {
+ d += g2g;
+ f = scoreWithEmission(matrixRow, i);
+ if (oneGapScore + f > d) {
+ d = oneGapScore + f;
+ bestOffset = i;
+ }
+ }
+
+ return bestOffset;
+}
+
+int RepeatFinder::nextState() {
+ double maxScore = scoresPtr[state];
+ if (scoresPtr == checkpoint) redoCheckpoint();
+ scoresPtr -= dpScoresPerLetter;
+
+ if (state == 0) {
+ if (b2b + scoresPtr[0] < maxScore) state = offsetWithMaxScore();
+ } else if (state <= maxRepeatOffset) {
+ if (f2b + scoresPtr[0] >= maxScore) {
+ state = 0;
+ } else if (endGapScore > -HUGE_VAL) {
+ double f = scoreWithEmission(substitutionMatrix[*seqPtr], state);
+ if (state == 1) {
+ if (f2f1 + f < maxScore) state += maxRepeatOffset;
+ } else if (state == maxRepeatOffset) {
+ if (f2f1 + f < maxScore) state = deletionWithMaxScore();
+ } else if (f2f2 + f < maxScore) {
+ if (scoresPtr[state + maxRepeatOffset] >= maxScore) {
+ state += maxRepeatOffset;
+ } else {
+ state = deletionWithMaxScore();
+ }
+ }
+ }
+ } else {
+ ++state;
+ if (state == dpScoresPerLetter || g2g + scoresPtr[state] < maxScore) {
+ state -= maxRepeatOffset;
+ }
+ }
+
+ ++seqPtr;
+ return state;
+}
+
+}
=====================================
src/tantan_repeat_finder.hh
=====================================
@@ -0,0 +1,99 @@
+// Copyright 2018 Martin C. Frith
+
+// This class finds tandem repeats in a sequence, using a Viterbi
+// algorithm.
+
+// The input parameters are as in tantan.hh, with one difference: the
+// substitutionMatrix should be a *log* likelihood ratio matrix:
+// substitutionMatrix[x][y] = lambda * scoringMatrix[x][y].
+
+// Usage: first call init. Then call calcBestPathScore, which runs
+// the Viterbi algorithm backwards from the end to the start of the
+// sequence. Finally, call nextState once per sequence letter, to get
+// the "state" of each letter from start to end.
+
+// state = 0: non-repeat.
+// 0 < state <= maxRepeatOffset: tandem repeat with period = state.
+// maxRepeatOffset < state < 2*maxRepeatOffset: insertion in repeat.
+
+#ifndef TANTAN_REPEAT_FINDER_HH
+#define TANTAN_REPEAT_FINDER_HH
+
+#include <vector>
+
+namespace tantan {
+
+typedef unsigned char uchar;
+typedef const double *const_double_ptr;
+
+class RepeatFinder {
+public:
+ void init(int maxRepeatOffset,
+ const const_double_ptr *substitutionMatrix,
+ double repeatProb,
+ double repeatEndProb,
+ double repeatOffsetProbDecay,
+ double firstGapProb,
+ double otherGapProb);
+
+ double calcBestPathScore(const uchar *seqBeg, const uchar *seqEnd);
+
+ int nextState();
+
+private:
+ const const_double_ptr *substitutionMatrix;
+
+ double b2b;
+ double f2b;
+ double g2g;
+ double oneGapScore;
+ double endGapScore;
+ double f2f0;
+ double f2f1;
+ double f2f2;
+ double b2fGrowth;
+ double b2fLast;
+
+ int maxRepeatOffset;
+ int dpScoresPerLetter;
+
+ std::vector<double> dpScores;
+ double *scoresPtr;
+ double *scoresEnd;
+ double *checkpoint;
+
+ const uchar *seqBeg;
+ const uchar *seqEnd;
+ const uchar *seqPtr;
+ int state;
+
+ void initializeBackwardScores();
+ void calcBackwardTransitionScoresWithGaps();
+ void calcBackwardTransitionScores();
+ void calcEmissionScores();
+ void makeCheckpoint();
+ void redoCheckpoint();
+ int offsetWithMaxScore() const;
+ int deletionWithMaxScore() const;
+
+ bool isNearSeqBeg() const {
+ return seqPtr - seqBeg < maxRepeatOffset;
+ }
+
+ int maxOffsetInTheSequence() const {
+ return isNearSeqBeg() ? (seqPtr - seqBeg) : maxRepeatOffset;
+ }
+
+ double scoreWithEmission(const double *matrixRow, int offset) const {
+ return scoresPtr[offset] + matrixRow[seqPtr[-offset]];
+ }
+
+ void calcScoresForOneSequencePosition() {
+ calcEmissionScores();
+ calcBackwardTransitionScores();
+ }
+};
+
+}
+
+#endif
=====================================
src/version.hh
=====================================
@@ -1 +1 @@
-"18"
+"22"
=====================================
test/hard.fa
=====================================
@@ -0,0 +1,3 @@
+>hard
+catcatcatcatcat
+atcatatcatatcatatcatatcat
=====================================
test/tantan_test.out
=====================================
@@ -1489,3 +1489,27 @@ GTGGTCCACATATACAATGGAATATTACTCAGCTATCAGAAAGAA
IIIIIIIIIIIIIIIIDIII2BDHIIHCF=/-6A1921*/1-*-7
205
+
+SRR019778.4 6 27 7 3 TTGTGTG TGTGTAT,TGTGTGT,TGTGTGT
+SRR019778.4 28 45 2 12 GT GT,GT,GT,GT,GT,-T,T,T,T,T,T,T
+SRR019778.42 3 44 4 11 TGTG TGTG,TGTG,TGTG,TGTG,TGTG,TGTG,TGTT,TGTT,T-TT,TTT,TTT
+SRR019778.45 9 37 8 3.33333 GTGTGTGG GTGTGTGG,GTGTGTGT,GTtGTGTGT,GTT
+SRR019778.50 0 21 7 3 GTGTGTT GTGTGTT,GTGTGTT,GAGTGTT
+SRR019778.50 22 33 2 5.5 TG TG,TG,TG,TG,TG,T
+SRR019778.64 30 44 3 4.66667 GAG GAG,GAG,GAG,GAG,GA
+SRR019778.65 2 13 2 5.5 TG TG,TG,TG,TG,TG,T
+SRR019778.65 4 37 16 2.0625 TGTGTGTGTTTTCTGT TGTGTGTGTTTTCTGT,TGTGTGTGTTTTTTGT,T
+SRR019778.78 9 31 10 2.2 TGCCTTACTA TGCCTTACTA,TGCCTTACTA,TG
+SRR019778.95 2 18 4 4 TGAT TGAT,TGAT,TGAT,TGAT
+SRR019778.95 22 45 4 5.75 ATAG ATAG,ATAG,ATAG,ATAG,ATAG,ATA
+
+hard 0 14 3 4.66667 CAT CAT,CAT,CAT,CAT,CA
+hard 10 40 5 6 ATCAT ATCAT,ATCAT,ATCAT,ATCAT,ATCAT,ATCAT
+
+chrM 207 220 4 3.25 TTAA TTAA,TTAA,TTAA,T
+chrM 515 526 2 5.5 CA CA,CA,CA,CA,CA,C
+chrM 5305 5326 12 1.75 CCACCATCATAG CCACCATCATAG,CCACCATCA
+chrM 8271 8290 9 2.11111 ACCCCCTCT ACCCCCTCT,ACCCCCTCT,A
+chrM 13647 13691 30 1.48276 CCCCACCCTTACTAACATTAACGAAAATAA CCCCACCCTTACTAACATTAACGAAAATAA,CCCCACCCT-ACTAA
+chrM 15829 15844 6 2.5 AATCCT AATCCT,AATCCT,AAT
+chrM 16183 16195 1 12 C C,C,C,C,C,C,C,C,C,C,C,C
=====================================
test/tantan_test.sh
=====================================
@@ -29,5 +29,11 @@ countLowercaseLetters () {
tantan panda.fastq
echo
tantan -i2 -j3 hg19_chrM.fa | countLowercaseLetters
+ echo
+ tantan -f4 panda.fastq
+ echo
+ tantan -f4 -b12 hard.fa
+ echo
+ tantan -f4 -n1 hg19_chrM.fa
} 2>&1 |
diff -u tantan_test.out -
View it on GitLab: https://salsa.debian.org/med-team/tantan/commit/1063526ec91bd472dd347b5a39d44052ce6f3049
--
View it on GitLab: https://salsa.debian.org/med-team/tantan/commit/1063526ec91bd472dd347b5a39d44052ce6f3049
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20181222/150d33e9/attachment-0001.html>
More information about the debian-med-commit
mailing list