[med-svn] [Git][med-team/tantan][upstream] New upstream version 22

Sascha Steinbiss gitlab at salsa.debian.org
Sat Dec 22 20:48:07 GMT 2018


Sascha Steinbiss pushed to branch upstream at Debian Med / tantan


Commits:
1063526e by Sascha Steinbiss at 2018-12-22T20:36:53Z
New upstream version 22
- - - - -


13 changed files:

- ChangeLog.txt
- README.html
- README.txt
- src/mcf_tantan_options.cc
- src/mcf_tantan_options.hh
- src/tantan.cc
- src/tantan_app.cc
- + src/tantan_repeat_finder.cc
- + src/tantan_repeat_finder.hh
- src/version.hh
- + test/hard.fa
- test/tantan_test.out
- test/tantan_test.sh


Changes:

=====================================
ChangeLog.txt
=====================================
@@ -1,9 +1,31 @@
+2018-12-19  Martin C. Frith  <Martin C. Frith>
+
+	* src/tantan.cc:
+	Make it faster
+	[3523060bcfb9] [tip]
+
+	* src/tantan_repeat_finder.cc, src/tantan_repeat_finder.hh:
+	Make -f4 a bit faster
+	[1fccdf6e21be]
+
+	* README.txt, src/mcf_tantan_options.cc, src/mcf_tantan_options.hh,
+	src/tantan_app.cc, src/tantan_repeat_finder.cc,
+	src/tantan_repeat_finder.hh, test/hard.fa, test/tantan_test.out,
+	test/tantan_test.sh:
+	Add option to find straightforward tandem repeats
+	[1bae60144712]
+
+	* README.txt, src/mcf_tantan_options.cc, src/tantan.cc,
+	src/tantan_app.cc:
+	Tweak the help message
+	[fc1ca32a72aa]
+
 2018-12-10  Martin C. Frith  <Martin C. Frith>
 
 	* README.txt, src/mcf_tantan_options.cc, src/mcf_tantan_options.hh,
 	src/tantan_app.cc, test/tantan_test.out, test/tantan_test.sh:
 	Add match score, mismatch cost options
-	[2383404c795a] [tip]
+	[2383404c795a]
 
 	* src/tantan.cc:
 	Refactor


=====================================
README.html
=====================================
@@ -451,7 +451,7 @@ repeats, so it's easy to lift the masking after determining homology.</p>
 <td>preserve uppercase/lowercase in non-masked regions</td></tr>
 <tr><td class="option-group">
 <kbd><span class="option">-m</span></kbd></td>
-<td>file for letter pair scores (scoring matrix)</td></tr>
+<td>file for letter-pair score matrix</td></tr>
 <tr><td class="option-group">
 <kbd><span class="option">-r</span></kbd></td>
 <td>probability of a repeat starting per position</td></tr>
@@ -480,9 +480,12 @@ repeats, so it's easy to lift the masking after determining homology.</p>
 <kbd><span class="option">-s</span></kbd></td>
 <td>minimum repeat probability for masking</td></tr>
 <tr><td class="option-group">
+<kbd><span class="option">-n</span></kbd></td>
+<td>minimum copy number, affects -f4 only</td></tr>
+<tr><td class="option-group">
 <kbd><span class="option">-f</span></kbd></td>
 <td>output type: 0=masked sequence, 1=repeat probabilities,
-2=repeat counts, 3=BED</td></tr>
+2=repeat counts, 3=BED, 4=tandem repeats</td></tr>
 <tr><td class="option-group">
 <kbd><span class="option">-h</span>, <span class="option">--help</span></kbd></td>
 <td>show help message, then exit</td></tr>
@@ -512,6 +515,29 @@ TGCTAGCAA ttaggcttaggtcagtgc TTAGGCTTAGGTCAGTGC AACGTA
 </pre>
 <p>(My thanks to Junko Tsuji and Paul Horton for finding these issues.)</p>
 </div>
+<div class="section" id="finding-straightforward-tandem-repeats">
+<h1>Finding straightforward tandem repeats</h1>
+<p>Option -f4 runs tantan in a different mode, where it finds
+straightforward tandem repeats only.  (Technically, it uses a Viterbi
+algorithm instead of a Forward-Backward algorithm.)  This is <em>not</em>
+recommended for avoiding false homologs!  But it might be useful for
+studying tandem repeats.  The output looks like this:</p>
+<pre class="literal-block">
+mySeq   14765   14780   6       2.5     GTCATG  GTCATG,GTCATG,GTC
+mySeq   632362  632377  2       6       GC      GC,GC,GC,GCt,GCT,GCT
+mySeq   1278353 1278369 3       6.5     TCA     TCA,TCA,TCA,TC-,TC,TC,T
+mySeq   3616084 3616100 3       5.33333 TGG     TGA,TGA,TGG,TGG,TGG,T
+</pre>
+<p>The first 3 columns show the start and end coordinates of the
+repetitive region, in <a class="reference external" href="https://genome.ucsc.edu/FAQ/FAQformat.html#format1">BED</a> format.  Column
+4 shows the length of the repeating unit (which might vary due to
+insertions and deletions, so this column shows the most common
+length).  Column 5 shows the number of repeat units.  Column 6 shows
+the repeating unit (which again might vary, so this is just a
+representative).  Column 7 shows the whole repeat: lowercase letters
+are insertions relative to the previous repeat unit, and dashes are
+deletions relative to the previous repeat unit.</p>
+</div>
 <div class="section" id="miscellaneous">
 <h1>Miscellaneous</h1>
 <p>tantan is distributed under the GNU General Public License, either


=====================================
README.txt
=====================================
@@ -136,7 +136,7 @@ Options
 -p  interpret the sequences as proteins
 -x  letter to use for masking, instead of lowercase
 -c  preserve uppercase/lowercase in non-masked regions
--m  file for letter pair scores (scoring matrix)
+-m  file for letter-pair score matrix
 -r  probability of a repeat starting per position
 -e  probability of a repeat ending per position
 -w  maximum tandem repeat period to consider
@@ -146,8 +146,9 @@ Options
 -a  gap existence cost
 -b  gap extension cost
 -s  minimum repeat probability for masking
+-n  minimum copy number, affects -f4 only
 -f  output type: 0=masked sequence, 1=repeat probabilities,
-                 2=repeat counts, 3=BED
+                 2=repeat counts, 3=BED, 4=tandem repeats
 -h, --help  show help message, then exit
 --version   show version information, then exit
 
@@ -172,6 +173,31 @@ align it on the other strand::
 
 (My thanks to Junko Tsuji and Paul Horton for finding these issues.)
 
+Finding straightforward tandem repeats
+--------------------------------------
+
+Option -f4 runs tantan in a different mode, where it finds
+straightforward tandem repeats only.  (Technically, it uses a Viterbi
+algorithm instead of a Forward-Backward algorithm.)  This is *not*
+recommended for avoiding false homologs!  But it might be useful for
+studying tandem repeats.  The output looks like this::
+
+  mySeq   14765   14780   6       2.5     GTCATG  GTCATG,GTCATG,GTC
+  mySeq   632362  632377  2       6       GC      GC,GC,GC,GCt,GCT,GCT
+  mySeq   1278353 1278369 3       6.5     TCA     TCA,TCA,TCA,TC-,TC,TC,T
+  mySeq   3616084 3616100 3       5.33333 TGG     TGA,TGA,TGG,TGG,TGG,T
+
+The first 3 columns show the start and end coordinates of the
+repetitive region, in `BED
+<https://genome.ucsc.edu/FAQ/FAQformat.html#format1>`_ format.  Column
+4 shows the length of the repeating unit (which might vary due to
+insertions and deletions, so this column shows the most common
+length).  Column 5 shows the number of repeat units.  Column 6 shows
+the repeating unit (which again might vary, so this is just a
+representative).  Column 7 shows the whole repeat: lowercase letters
+are insertions relative to the previous repeat unit, and dashes are
+deletions relative to the previous repeat unit.
+
 Miscellaneous
 -------------
 


=====================================
src/mcf_tantan_options.cc
=====================================
@@ -36,7 +36,7 @@ static int myGetopt(int argc, char **argv, const char *optstring,
 std::istream &operator>>(std::istream &s, TantanOptions::OutputType &x) {
   int i = 0;
   s >> i;
-  if (i < 0 || i > 3)
+  if (i < 0 || i > 4)
     s.setstate(std::ios::failbit);
   if (s)
     x = static_cast<TantanOptions::OutputType>(i);
@@ -57,6 +57,7 @@ TantanOptions::TantanOptions() :
     gapExistenceCost(0),
     gapExtensionCost(-1),  // means: no gaps
     minMaskProb(0.5),
+    minCopyNumber(2.0),
     outputType(maskOut),
     indexOfFirstNonOptionArgument(-1) {}
 
@@ -69,7 +70,7 @@ Options (default settings):\n\
  -p  interpret the sequences as proteins\n\
  -x  letter to use for masking, instead of lowercase\n\
  -c  preserve uppercase/lowercase in non-masked regions\n\
- -m  file for letter pair scores (+1/-1, but -p selects BLOSUM62)\n\
+ -m  file for letter-pair score matrix\n\
  -r  probability of a repeat starting per position ("
       + stringify(repeatProb) + ")\n\
  -e  probability of a repeat ending per position ("
@@ -77,15 +78,17 @@ Options (default settings):\n\
  -w  maximum tandem repeat period to consider (100, but -p selects 50)\n\
  -d  probability decay per period ("
       + stringify(repeatOffsetProbDecay) + ")\n\
- -i  match score (1, but -p selects BLOSUM62)\n\
- -j  mismatch cost (1, but -p selects BLOSUM62)\n\
+ -i  match score (BLOSUM62 if -p, else 2 if -f4, else 1)\n\
+ -j  mismatch cost (BLOSUM62 if -p, else 7 if -f4, else 1)\n\
  -a  gap existence cost ("
       + stringify(gapExistenceCost) + ")\n\
- -b  gap extension cost (infinite: no gaps)\n\
+ -b  gap extension cost (7 if -f4, else infinity)\n\
  -s  minimum repeat probability for masking ("
       + stringify(minMaskProb) + ")\n\
+ -n  minimum copy number, affects -f4 only ("
+      + stringify(minCopyNumber) + ")\n\
  -f  output type: 0=masked sequence, 1=repeat probabilities,\n\
-                  2=repeat counts, 3=BED ("
+                  2=repeat counts, 3=BED, 4=tandem repeats ("
       + stringify(outputType) + ")\n\
  -h, --help  show help message, then exit\n\
  --version   show version information, then exit\n\
@@ -93,12 +96,13 @@ Options (default settings):\n\
 Report bugs to: tantan at cbrc.jp\n\
 Home page: http://www.cbrc.jp/tantan/\n\
 ";
+  // -k for transition cost?
 
   std::string version = "tantan "
 #include "version.hh"
       "\n";
 
-  const char *optstring = "px:cm:r:e:w:d:i:j:a:b:s:f:h";
+  const char *optstring = "px:cm:r:e:w:d:i:j:a:b:s:n:f:h";
 
   int i;
   while ((i = myGetopt(argc, argv, optstring, help, version)) != -1) {
@@ -158,6 +162,9 @@ Home page: http://www.cbrc.jp/tantan/\n\
         unstringify(minMaskProb, optarg);
         // don't bother checking for stupid values?
         break;
+      case 'n':
+	unstringify(minCopyNumber, optarg);
+	break;
       case 'f':
         unstringify(outputType, optarg);
         break;
@@ -168,11 +175,18 @@ Home page: http://www.cbrc.jp/tantan/\n\
     }
   }
 
+  if (gapExtensionCost < 0 && outputType == repOut) gapExtensionCost = 7;
+
   if (gapExtensionCost > 0 && gapExistenceCost + gapExtensionCost <= 0)
     throw Error("gap existence + extension cost is too low");
 
   if (maxCycleLength < 0) maxCycleLength = (isProtein ? 50 : 100);
 
+  if (!isProtein || matchScore || mismatchCost) {
+    if (matchScore   == 0) matchScore   = (outputType == repOut ? 2 : 1);
+    if (mismatchCost == 0) mismatchCost = (outputType == repOut ? 7 : 1);
+  }
+
   indexOfFirstNonOptionArgument = optind;
 }
 


=====================================
src/mcf_tantan_options.hh
=====================================
@@ -23,7 +23,8 @@ struct TantanOptions {
   int gapExistenceCost;
   int gapExtensionCost;
   double minMaskProb;
-  enum OutputType { maskOut, probOut, countOut, bedOut } outputType;
+  double minCopyNumber;
+  enum OutputType { maskOut, probOut, countOut, bedOut, repOut } outputType;
 
   int indexOfFirstNonOptionArgument;
 };


=====================================
src/tantan.cc
=====================================
@@ -20,10 +20,10 @@ void multiplyAll(std::vector<double> &v, double factor) {
 }
 
 double firstRepeatOffsetProb(double probMult, int maxRepeatOffset) {
-  if (probMult < 1 || probMult > 1)
+  if (probMult < 1 || probMult > 1) {
     return (1 - probMult) / (1 - std::pow(probMult, maxRepeatOffset));
-  else
-    return 1.0 / maxRepeatOffset;
+  }
+  return 1.0 / maxRepeatOffset;
 }
 
 void checkForwardAndBackwardTotals(double fTot, double bTot) {
@@ -64,6 +64,7 @@ struct Tantan {
   double b2fLast;  // background state to last foreground state
 
   double backgroundProb;
+  std::vector<double> b2fProbs;  // background state to each foreground state
   std::vector<double> foregroundProbs;
   std::vector<double> insertionProbs;
 
@@ -110,9 +111,16 @@ struct Tantan {
     b2fFirst = repeatProb * firstRepeatOffsetProb(b2fDecay, maxRepeatOffset);
     b2fLast = repeatProb * firstRepeatOffsetProb(b2fGrowth, maxRepeatOffset);
 
+    b2fProbs.resize(maxRepeatOffset);
     foregroundProbs.resize(maxRepeatOffset);
     insertionProbs.resize(maxRepeatOffset - 1);
 
+    double p = b2fFirst;
+    for (int i = 0; i < maxRepeatOffset; ++i) {
+      b2fProbs[i] = p;
+      p *= b2fDecay;
+    }
+
     scaleFactors.resize((seqEnd - seqBeg) / scaleStepSize);
   }
 
@@ -211,20 +219,17 @@ struct Tantan {
   void calcForwardTransitionProbs() {
     if (endGapProb > 0) return calcForwardTransitionProbsWithGaps();
 
-    double fromBackground = backgroundProb * b2fLast;
+    double b = backgroundProb;
     double fromForeground = 0;
-    double *foregroundPtr = END(foregroundProbs);
     double *foregroundBeg = BEG(foregroundProbs);
 
-    while (foregroundPtr > foregroundBeg) {
-      --foregroundPtr;
-      double f = *foregroundPtr;
+    for (int i = 0; i < maxRepeatOffset; ++i) {
+      double f = foregroundBeg[i];
       fromForeground += f;
-      *foregroundPtr = fromBackground + f * f2f0;
-      fromBackground *= b2fGrowth;
+      foregroundBeg[i] = b * b2fProbs[i] + f * f2f0;
     }
 
-    backgroundProb = backgroundProb * b2b + fromForeground * f2b;
+    backgroundProb = b * b2b + fromForeground * f2b;
   }
 
   void calcBackwardTransitionProbs() {
@@ -232,18 +237,15 @@ struct Tantan {
 
     double toBackground = f2b * backgroundProb;
     double toForeground = 0;
-    double *foregroundPtr = BEG(foregroundProbs);
-    double *foregroundEnd = END(foregroundProbs);
+    double *foregroundBeg = BEG(foregroundProbs);
 
-    while (foregroundPtr < foregroundEnd) {
-      toForeground *= b2fGrowth;
-      double f = *foregroundPtr;
-      toForeground += f;
-      *foregroundPtr = toBackground + f2f0 * f;
-      ++foregroundPtr;
+    for (int i = 0; i < maxRepeatOffset; ++i) {
+      double f = foregroundBeg[i];
+      toForeground += b2fProbs[i] * f;
+      foregroundBeg[i] = toBackground + f2f0 * f;
     }
 
-    backgroundProb = b2b * backgroundProb + b2fLast * toForeground;
+    backgroundProb = b2b * backgroundProb + toForeground;
   }
 
   void addEndCounts(double forwardProb,


=====================================
src/tantan_app.cc
=====================================
@@ -9,17 +9,18 @@
 #include "mcf_tantan_options.hh"
 #include "mcf_util.hh"
 #include "tantan.hh"
+#include "tantan_repeat_finder.hh"
 #include "LambdaCalculator.hh"
 
 #include <algorithm>  // copy, fill_n
 #include <cassert>
 #include <cmath>
-#include <cstring>  // strchr
 #include <cstdlib>  // EXIT_SUCCESS, EXIT_FAILURE
 #include <exception>  // exception
 #include <fstream>
 #include <iostream>
 #include <new>  // bad_alloc
+#include <string.h>
 
 #define BEG(v) ((v).empty() ? 0 : &(v).front())
 #define END(v) ((v).empty() ? 0 : &(v).back() + 1)
@@ -33,7 +34,7 @@ bool isDubiousDna(const uchar *beg, const uchar *end) {
   int badLetters = 0;
   if (end - beg > 100) end = beg + 100;  // just check the first 100 letters
   while (beg < end) {
-    if (!std::strchr("AaCcGgTtNnUu", *beg)) {
+    if (!strchr("AaCcGgTtNnUu", *beg)) {
       ++badLetters;
       if (badLetters > 10) return true;
     }
@@ -45,6 +46,7 @@ bool isDubiousDna(const uchar *beg, const uchar *end) {
 namespace {
 TantanOptions options;
 Alphabet alphabet;
+tantan::RepeatFinder repeatFinder;
 
 enum { scoreMatrixSize = 64 };
 int fastMatrix[scoreMatrixSize][scoreMatrixSize];
@@ -76,16 +78,14 @@ void initScoresAndProbabilities() {
   if (options.scoreMatrixFileName) {
     unfilify(scoreMatrix, options.scoreMatrixFileName);
   } else if (options.isProtein) {
-    if (options.matchScore || options.mismatchCost) {
-      scoreMatrix.initMatchMismatch(std::max(options.matchScore, 1),
-				    std::max(options.mismatchCost, 1),
+    if (options.matchScore && options.mismatchCost) {
+      scoreMatrix.initMatchMismatch(options.matchScore, options.mismatchCost,
 				    Alphabet::protein);
     } else {
       unstringify(scoreMatrix, ScoreMatrix::blosum62);
     }
   } else {
-    scoreMatrix.initMatchMismatch(std::max(options.matchScore, 1),
-				  std::max(options.mismatchCost, 1),
+    scoreMatrix.initMatchMismatch(options.matchScore, options.mismatchCost,
 				  "ACGTU");  // allow for RNA
   }
 
@@ -101,7 +101,9 @@ void initScoresAndProbabilities() {
 
   for (int i = 0; i < scoreMatrixSize; ++i) {
     for (int j = 0; j < scoreMatrixSize; ++j) {
-      probMatrix[i][j] = std::exp(matrixLambda * fastMatrix[i][j]);
+      double x = matrixLambda * fastMatrix[i][j];
+      if (options.outputType != options.repOut) x = std::exp(x);
+      probMatrix[i][j] = x;
     }
   }
 
@@ -116,6 +118,10 @@ void initScoresAndProbabilities() {
     // XXX check if firstGapProb is too high
   }
 
+  repeatFinder.init(options.maxCycleLength, probMatrixPointers,
+		    options.repeatProb, options.repeatEndProb,
+		    options.repeatOffsetProbDecay, firstGapProb, otherGapProb);
+
   //std::cerr << "lambda: " << matrixLambda << "\n";
   //std::cerr << "firstGapProb: " << firstGapProb << "\n";
   //std::cerr << "otherGapProb: " << otherGapProb << "\n";
@@ -157,6 +163,138 @@ void writeBed(const float *probBeg, const float *probEnd,
   if (maskBeg) writeBedLine(seqName, probBeg, maskBeg, probEnd, output);
 }
 
+void storeSequence(const uchar *beg, const uchar *end, std::string &out) {
+  out.clear();
+  for (const uchar *i = beg; i < end; ++i) {
+    out.push_back(std::toupper(alphabet.numbersToLetters[*i]));
+  }
+}
+
+struct RepeatUnit {
+  const uchar *beg;
+  int len;
+};
+
+bool less(const RepeatUnit &x, const RepeatUnit &y) {
+  if (x.len != y.len) return x.len < y.len;
+  int c = memcmp(x.beg, y.beg, x.len);
+  if (c != 0) return c < 0;
+  return x.beg > y.beg;
+}
+
+int mainLen(const std::vector<RepeatUnit> &repUnits) {
+  int bestLen;
+  size_t bestCount = 0;
+  size_t count = 0;
+  for (size_t i = 0; i < repUnits.size(); ++i) {
+    if (i && repUnits[i].len > repUnits[i - 1].len) {
+      count = 0;
+    }
+    ++count;
+    if (count > bestCount) {
+      bestCount = count;
+      bestLen = repUnits[i].len;
+    }
+  }
+  return bestLen;
+}
+
+const uchar *mainBeg(const std::vector<RepeatUnit> &repUnits, int len) {
+  const uchar *bestBeg;
+  size_t bestCount = 0;
+  size_t count = 0;
+  for (size_t i = 0; i < repUnits.size(); ++i) {
+    if (repUnits[i].len != len) continue;
+    if (count && memcmp(repUnits[i - 1].beg, repUnits[i].beg, len) != 0) {
+      count = 0;
+    }
+    ++count;
+    if (count < bestCount) continue;
+    if (count > bestCount || repUnits[i].beg < bestBeg) {
+      bestCount = count;
+      bestBeg = repUnits[i].beg;
+    }
+  }
+  return bestBeg;
+}
+
+void writeRepeat(const FastaSequence &f,
+		 const uchar *repBeg, const uchar *repEnd,
+		 const std::string &repText, std::vector<RepeatUnit> &repUnits,
+		 const uchar *commaPos, int finalOffset) {
+  double repeatCount = count(repText.begin(), repText.end(), ',');
+  double copyNumber = repeatCount + (repEnd - commaPos) * 1.0 / finalOffset;
+  if (copyNumber < options.minCopyNumber) return;
+
+  sort(repUnits.begin(), repUnits.end(), less);
+  int bestLen = mainLen(repUnits);
+  const uchar *bestBeg = mainBeg(repUnits, bestLen);
+
+  const uchar *beg = BEG(f.sequence);
+  std::cout << firstWord(f.title) << '\t'
+	    << (repBeg - beg) << '\t' << (repEnd - beg) << '\t'
+	    << bestLen << '\t' << copyNumber << '\t';
+  for (int i = 0; i < bestLen; ++i) {
+    char c = std::toupper(alphabet.numbersToLetters[bestBeg[i]]);
+    std::cout << c;
+  }
+  std::cout << '\t';
+  std::cout << repText << '\n';
+}
+
+void findRepeatsInOneSequence(const FastaSequence &f) {
+  const uchar *beg = BEG(f.sequence);
+  const uchar *end = END(f.sequence);
+
+  repeatFinder.calcBestPathScore(beg, end);
+
+  std::vector<RepeatUnit> repUnits;
+  std::string repText;
+  const uchar *repBeg;
+  const uchar *commaPos;
+  int state = 0;
+
+  for (const uchar *seqPtr = beg; seqPtr < end; ++seqPtr) {
+    int newState = repeatFinder.nextState();
+
+    if (newState == 0) {
+      if (state > 0) {
+	writeRepeat(f, repBeg, seqPtr, repText, repUnits, commaPos, state);
+      }
+    } else if (newState <= options.maxCycleLength) {
+      if (state == 0) {
+	repUnits.clear();
+	repBeg = seqPtr - newState;
+	storeSequence(repBeg, seqPtr, repText);
+	commaPos = repBeg;
+      } else if (state <= options.maxCycleLength) {
+	for (int i = state; i > newState; --i) {
+	  if (seqPtr - commaPos >= i) {
+	    repText.push_back(',');
+	    commaPos = seqPtr;
+	  }
+	  repText.push_back('-');
+	}
+      }
+      RepeatUnit unit = {seqPtr - newState, newState};
+      repUnits.push_back(unit);
+      if (seqPtr - commaPos >= newState) {
+	repText.push_back(',');
+	commaPos = seqPtr;
+      }
+      repText.push_back(std::toupper(alphabet.numbersToLetters[*seqPtr]));
+    } else {
+      repText.push_back(std::tolower(alphabet.numbersToLetters[*seqPtr]));
+    }
+
+    state = newState;
+  }
+
+  if (state > 0) {
+    writeRepeat(f, repBeg, end, repText, repUnits, commaPos, state);
+  }
+}
+
 void processOneSequence(FastaSequence &f, std::ostream &output) {
   uchar *beg = BEG(f.sequence);
   uchar *end = END(f.sequence);
@@ -181,6 +319,8 @@ void processOneSequence(FastaSequence &f, std::ostream &output) {
                              BEG(transitionCounts));
     double sequenceLength = static_cast<double>(f.sequence.size());
     transitionTotal += sequenceLength + 1;
+  } else if (options.outputType == options.repOut) {
+    findRepeatsInOneSequence(f);
   } else {
     std::vector<float> probabilities(end - beg);
     float *probBeg = BEG(probabilities);


=====================================
src/tantan_repeat_finder.cc
=====================================
@@ -0,0 +1,278 @@
+// Copyright 2018 Martin C. Frith
+
+#include "tantan_repeat_finder.hh"
+
+#include <algorithm>
+#include <assert.h>
+#include <limits.h>
+#include <math.h>
+//#include <iostream>  // for debugging
+
+namespace tantan {
+
+static double max3(double x, double y, double z) {
+  return std::max(std::max(x, y), z);
+}
+
+static double myLog(double x) {
+  return x > 0 ? log(x) : -HUGE_VAL;
+}
+
+static double firstRepeatOffsetProb(double probMult, int maxRepeatOffset) {
+  if (probMult < 1 || probMult > 1) {
+    return (1 - probMult) / (1 - pow(probMult, 1.0 * maxRepeatOffset));
+  }
+  return 1.0 / maxRepeatOffset;
+}
+
+static int numOfDpScoresPerLetter(int maxRepeatOffset, double endGapScore) {
+  if (endGapScore > -HUGE_VAL) {
+    assert(maxRepeatOffset <= INT_MAX / 2);
+    return maxRepeatOffset * 2;
+  }
+  assert(maxRepeatOffset < INT_MAX);
+  return maxRepeatOffset + 1;
+}
+
+static unsigned minStoredPositions(const uchar *beg, const uchar *end) {
+  // We will do a dynamic programming algorithm along the sequence.
+  // To save memory, we will keep only some DP values, and recalculate
+  // the others later.  We keep the values in an array x of size s.
+  // We will store DP initial values in x[0].  We will put
+  // the DP values after the first s-1 sequence positions in x[1],
+  // the DP values after the next s-2 positions in x[2],
+  // the DP values after the next s-3 positions in x[3], etc.
+  // This function returns the minimum possible value of s.
+  unsigned t = 0;
+  while (t < end - beg) {
+    beg += t;
+    ++t;
+  }
+  return t + 1;
+}
+
+void RepeatFinder::init(int maxRepeatOffset,
+			const const_double_ptr *substitutionMatrix,
+			double repeatProb,
+			double repeatEndProb,
+			double repeatOffsetProbDecay,
+			double firstGapProb,
+			double otherGapProb) {
+  assert(maxRepeatOffset > 0);
+  this->maxRepeatOffset = maxRepeatOffset;
+  this->substitutionMatrix = substitutionMatrix;
+
+  b2b = myLog(1 - repeatProb);
+  f2b = myLog(repeatEndProb);
+  g2g = myLog(otherGapProb);
+  oneGapScore = myLog(firstGapProb * (1 - otherGapProb));
+  endGapScore = myLog(firstGapProb * (maxRepeatOffset > 1));
+  f2f0 = myLog(1 - repeatEndProb);
+  f2f1 = myLog(1 - repeatEndProb - firstGapProb);
+  f2f2 = myLog(1 - repeatEndProb - firstGapProb * 2);
+
+  double x = 1 / repeatOffsetProbDecay;
+  b2fGrowth = myLog(x);
+  b2fLast = myLog(repeatProb * firstRepeatOffsetProb(x, maxRepeatOffset));
+
+  dpScoresPerLetter = numOfDpScoresPerLetter(maxRepeatOffset, endGapScore);
+}
+
+void RepeatFinder::initializeBackwardScores() {
+  scoresPtr[0] = b2b;
+  std::fill_n(scoresPtr + 1, maxRepeatOffset, f2b);
+  if (endGapScore > -HUGE_VAL) {
+    std::fill_n(scoresPtr + 1 + maxRepeatOffset, maxRepeatOffset-1, -HUGE_VAL);
+  }
+}
+
+void RepeatFinder::calcBackwardTransitionScoresWithGaps() {
+  double toBackground = f2b + scoresPtr[0];
+  double *foregroundPtr = scoresPtr + 1;
+  double f = *foregroundPtr;
+  double toForeground = f;
+
+  double *insertionPtr = scoresPtr + 1 + maxRepeatOffset;
+  double i = *insertionPtr;
+  *foregroundPtr = max3(toBackground, f2f1 + f, i);
+  double d = endGapScore + f;
+  ++foregroundPtr;
+  toForeground += b2fGrowth;
+
+  while (foregroundPtr < scoresPtr + maxRepeatOffset) {
+    f = *foregroundPtr;
+    toForeground = std::max(toForeground, f);
+    i = *(insertionPtr + 1);
+    *foregroundPtr = max3(toBackground, f2f2 + f, std::max(i, d));
+    double oneGapScore_f = oneGapScore + f;
+    *insertionPtr = std::max(oneGapScore_f, g2g + i);
+    d = std::max(oneGapScore_f, g2g + d);
+    ++foregroundPtr;
+    ++insertionPtr;
+    toForeground += b2fGrowth;
+  }
+
+  f = *foregroundPtr;
+  toForeground = std::max(toForeground, f);
+  *foregroundPtr = max3(toBackground, f2f1 + f, d);
+  *insertionPtr = endGapScore + f;
+
+  scoresPtr[0] = std::max(b2b + scoresPtr[0], b2fLast + toForeground);
+}
+
+void RepeatFinder::calcBackwardTransitionScores() {
+  if (endGapScore > -HUGE_VAL) return calcBackwardTransitionScoresWithGaps();
+
+  double toBackground = f2b + scoresPtr[0];
+  double toForeground = -HUGE_VAL;
+  double *foregroundPtr = scoresPtr + 1;
+  double *foregroundEnd = foregroundPtr + maxRepeatOffset;
+
+  while (foregroundPtr < foregroundEnd) {
+    toForeground += b2fGrowth;
+    double f = *foregroundPtr;
+    toForeground = std::max(toForeground, f);
+    *foregroundPtr = std::max(toBackground, f2f0 + f);
+    ++foregroundPtr;
+  }
+
+  scoresPtr[0] = std::max(b2b + scoresPtr[0], b2fLast + toForeground);
+}
+
+void RepeatFinder::calcEmissionScores() {
+  const double *matrixRow = substitutionMatrix[*seqPtr];
+  double *oldScores = scoresPtr - dpScoresPerLetter;
+  int maxOffset = maxOffsetInTheSequence();
+  int i = 1;
+
+  scoresPtr[0] = oldScores[0];
+
+  for (; i <= maxOffset; ++i) {
+    scoresPtr[i] = oldScores[i] + matrixRow[seqPtr[-i]];
+  }
+
+  for (; i <= maxRepeatOffset; ++i) {
+    scoresPtr[i] = -HUGE_VAL;
+  }
+
+  std::copy(oldScores + i, scoresPtr, scoresPtr + i);
+}
+
+void RepeatFinder::makeCheckpoint() {
+  checkpoint += dpScoresPerLetter;
+  std::copy(scoresPtr - dpScoresPerLetter, scoresPtr, checkpoint);
+  scoresPtr = checkpoint + dpScoresPerLetter;
+  assert(scoresPtr < scoresEnd);
+}
+
+void RepeatFinder::redoCheckpoint() {
+  seqPtr += (scoresEnd - scoresPtr) / dpScoresPerLetter;
+  while (scoresPtr < scoresEnd) {
+    --seqPtr;
+    calcScoresForOneSequencePosition();
+    scoresPtr += dpScoresPerLetter;
+  }
+  scoresPtr -= dpScoresPerLetter;
+  checkpoint -= dpScoresPerLetter;
+}
+
+double RepeatFinder::calcBestPathScore(const uchar *seqBeg,
+				       const uchar *seqEnd) {
+  this->seqBeg = seqBeg;
+  this->seqEnd = seqEnd;
+
+  unsigned long numOfStoredPositions = minStoredPositions(seqBeg, seqEnd);
+  unsigned long numOfScores = numOfStoredPositions * dpScoresPerLetter;
+  assert(numOfStoredPositions > 0);
+  assert(numOfScores > 0);
+  dpScores.resize(numOfScores);
+  scoresPtr = &dpScores[0];
+  scoresEnd = scoresPtr + numOfScores;
+  checkpoint = scoresPtr;
+  seqPtr = seqEnd;
+
+  initializeBackwardScores();
+
+  while (seqPtr > seqBeg) {
+    --seqPtr;
+    scoresPtr += dpScoresPerLetter;
+    if (scoresPtr == scoresEnd) makeCheckpoint();
+    calcScoresForOneSequencePosition();
+  }
+
+  state = 0;
+  return scoresPtr[0];
+}
+
+int RepeatFinder::offsetWithMaxScore() const {
+  const double *matrixRow = substitutionMatrix[*seqPtr];
+  int maxOffset = maxOffsetInTheSequence();
+  int bestOffset = 0;
+  double toForeground = -HUGE_VAL;
+
+  for (int i = 1; i <= maxOffset; ++i) {
+    toForeground += b2fGrowth;
+    double f = scoreWithEmission(matrixRow, i);
+    if (f > toForeground) {
+      toForeground = f;
+      bestOffset = i;
+    }
+  }
+
+  return bestOffset;
+}
+
+int RepeatFinder::deletionWithMaxScore() const {
+  const double *matrixRow = substitutionMatrix[*seqPtr];
+  int bestOffset = 1;
+  double f = scoreWithEmission(matrixRow, 1);
+  double d = endGapScore + f;
+
+  for (int i = 2; i < state; ++i) {
+    d += g2g;
+    f = scoreWithEmission(matrixRow, i);
+    if (oneGapScore + f > d) {
+      d = oneGapScore + f;
+      bestOffset = i;
+    }
+  }
+
+  return bestOffset;
+}
+
+int RepeatFinder::nextState() {
+  double maxScore = scoresPtr[state];
+  if (scoresPtr == checkpoint) redoCheckpoint();
+  scoresPtr -= dpScoresPerLetter;
+
+  if (state == 0) {
+    if (b2b + scoresPtr[0] < maxScore) state = offsetWithMaxScore();
+  } else if (state <= maxRepeatOffset) {
+    if (f2b + scoresPtr[0] >= maxScore) {
+      state = 0;
+    } else if (endGapScore > -HUGE_VAL) {
+      double f = scoreWithEmission(substitutionMatrix[*seqPtr], state);
+      if (state == 1) {
+	if (f2f1 + f < maxScore) state += maxRepeatOffset;
+      } else if (state == maxRepeatOffset) {
+	if (f2f1 + f < maxScore) state = deletionWithMaxScore();
+      } else if (f2f2 + f < maxScore) {
+	if (scoresPtr[state + maxRepeatOffset] >= maxScore) {
+	  state += maxRepeatOffset;
+	} else {
+	  state = deletionWithMaxScore();
+	}
+      }
+    }
+  } else {
+    ++state;
+    if (state == dpScoresPerLetter || g2g + scoresPtr[state] < maxScore) {
+      state -= maxRepeatOffset;
+    }
+  }
+
+  ++seqPtr;
+  return state;
+}
+
+}


=====================================
src/tantan_repeat_finder.hh
=====================================
@@ -0,0 +1,99 @@
+// Copyright 2018 Martin C. Frith
+
+// This class finds tandem repeats in a sequence, using a Viterbi
+// algorithm.
+
+// The input parameters are as in tantan.hh, with one difference: the
+// substitutionMatrix should be a *log* likelihood ratio matrix:
+// substitutionMatrix[x][y] = lambda * scoringMatrix[x][y].
+
+// Usage: first call init.  Then call calcBestPathScore, which runs
+// the Viterbi algorithm backwards from the end to the start of the
+// sequence.  Finally, call nextState once per sequence letter, to get
+// the "state" of each letter from start to end.
+
+// state = 0: non-repeat.
+// 0 < state <= maxRepeatOffset: tandem repeat with period = state.
+// maxRepeatOffset < state < 2*maxRepeatOffset: insertion in repeat.
+
+#ifndef TANTAN_REPEAT_FINDER_HH
+#define TANTAN_REPEAT_FINDER_HH
+
+#include <vector>
+
+namespace tantan {
+
+typedef unsigned char uchar;
+typedef const double *const_double_ptr;
+
+class RepeatFinder {
+public:
+  void init(int maxRepeatOffset,
+	    const const_double_ptr *substitutionMatrix,
+	    double repeatProb,
+	    double repeatEndProb,
+	    double repeatOffsetProbDecay,
+	    double firstGapProb,
+	    double otherGapProb);
+
+  double calcBestPathScore(const uchar *seqBeg, const uchar *seqEnd);
+
+  int nextState();
+
+private:
+  const const_double_ptr *substitutionMatrix;
+
+  double b2b;
+  double f2b;
+  double g2g;
+  double oneGapScore;
+  double endGapScore;
+  double f2f0;
+  double f2f1;
+  double f2f2;
+  double b2fGrowth;
+  double b2fLast;
+
+  int maxRepeatOffset;
+  int dpScoresPerLetter;
+
+  std::vector<double> dpScores;
+  double *scoresPtr;
+  double *scoresEnd;
+  double *checkpoint;
+
+  const uchar *seqBeg;
+  const uchar *seqEnd;
+  const uchar *seqPtr;
+  int state;
+
+  void initializeBackwardScores();
+  void calcBackwardTransitionScoresWithGaps();
+  void calcBackwardTransitionScores();
+  void calcEmissionScores();
+  void makeCheckpoint();
+  void redoCheckpoint();
+  int offsetWithMaxScore() const;
+  int deletionWithMaxScore() const;
+
+  bool isNearSeqBeg() const {
+    return seqPtr - seqBeg < maxRepeatOffset;
+  }
+
+  int maxOffsetInTheSequence() const {
+    return isNearSeqBeg() ? (seqPtr - seqBeg) : maxRepeatOffset;
+  }
+
+  double scoreWithEmission(const double *matrixRow, int offset) const {
+    return scoresPtr[offset] + matrixRow[seqPtr[-offset]];
+  }
+
+  void calcScoresForOneSequencePosition() {
+    calcEmissionScores();
+    calcBackwardTransitionScores();
+  }
+};
+
+}
+
+#endif


=====================================
src/version.hh
=====================================
@@ -1 +1 @@
-"18"
+"22"


=====================================
test/hard.fa
=====================================
@@ -0,0 +1,3 @@
+>hard
+catcatcatcatcat
+atcatatcatatcatatcatatcat


=====================================
test/tantan_test.out
=====================================
@@ -1489,3 +1489,27 @@ GTGGTCCACATATACAATGGAATATTACTCAGCTATCAGAAAGAA
 IIIIIIIIIIIIIIIIDIII2BDHIIHCF=/-6A1921*/1-*-7
 
 205
+
+SRR019778.4	6	27	7	3	TTGTGTG	TGTGTAT,TGTGTGT,TGTGTGT
+SRR019778.4	28	45	2	12	GT	GT,GT,GT,GT,GT,-T,T,T,T,T,T,T
+SRR019778.42	3	44	4	11	TGTG	TGTG,TGTG,TGTG,TGTG,TGTG,TGTG,TGTT,TGTT,T-TT,TTT,TTT
+SRR019778.45	9	37	8	3.33333	GTGTGTGG	GTGTGTGG,GTGTGTGT,GTtGTGTGT,GTT
+SRR019778.50	0	21	7	3	GTGTGTT	GTGTGTT,GTGTGTT,GAGTGTT
+SRR019778.50	22	33	2	5.5	TG	TG,TG,TG,TG,TG,T
+SRR019778.64	30	44	3	4.66667	GAG	GAG,GAG,GAG,GAG,GA
+SRR019778.65	2	13	2	5.5	TG	TG,TG,TG,TG,TG,T
+SRR019778.65	4	37	16	2.0625	TGTGTGTGTTTTCTGT	TGTGTGTGTTTTCTGT,TGTGTGTGTTTTTTGT,T
+SRR019778.78	9	31	10	2.2	TGCCTTACTA	TGCCTTACTA,TGCCTTACTA,TG
+SRR019778.95	2	18	4	4	TGAT	TGAT,TGAT,TGAT,TGAT
+SRR019778.95	22	45	4	5.75	ATAG	ATAG,ATAG,ATAG,ATAG,ATAG,ATA
+
+hard	0	14	3	4.66667	CAT	CAT,CAT,CAT,CAT,CA
+hard	10	40	5	6	ATCAT	ATCAT,ATCAT,ATCAT,ATCAT,ATCAT,ATCAT
+
+chrM	207	220	4	3.25	TTAA	TTAA,TTAA,TTAA,T
+chrM	515	526	2	5.5	CA	CA,CA,CA,CA,CA,C
+chrM	5305	5326	12	1.75	CCACCATCATAG	CCACCATCATAG,CCACCATCA
+chrM	8271	8290	9	2.11111	ACCCCCTCT	ACCCCCTCT,ACCCCCTCT,A
+chrM	13647	13691	30	1.48276	CCCCACCCTTACTAACATTAACGAAAATAA	CCCCACCCTTACTAACATTAACGAAAATAA,CCCCACCCT-ACTAA
+chrM	15829	15844	6	2.5	AATCCT	AATCCT,AATCCT,AAT
+chrM	16183	16195	1	12	C	C,C,C,C,C,C,C,C,C,C,C,C


=====================================
test/tantan_test.sh
=====================================
@@ -29,5 +29,11 @@ countLowercaseLetters () {
     tantan panda.fastq
     echo
     tantan -i2 -j3 hg19_chrM.fa | countLowercaseLetters
+    echo
+    tantan -f4 panda.fastq
+    echo
+    tantan -f4 -b12 hard.fa
+    echo
+    tantan -f4 -n1 hg19_chrM.fa
 } 2>&1 |
 diff -u tantan_test.out -



View it on GitLab: https://salsa.debian.org/med-team/tantan/commit/1063526ec91bd472dd347b5a39d44052ce6f3049

-- 
View it on GitLab: https://salsa.debian.org/med-team/tantan/commit/1063526ec91bd472dd347b5a39d44052ce6f3049
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20181222/150d33e9/attachment-0001.html>


More information about the debian-med-commit mailing list