[med-svn] [bowtie2] 02/05: Imported Upstream version 2.2.7

Wed Feb 17 21:13:59 UTC 2016

This is an automated email from the git hooks/post-receive script.

tille pushed a commit to branch master
in repository bowtie2.

commit 8549af184cec2cc84e40c2e331f01664716cf57d
Author: Andreas Tille <tille at debian.org>
Date:   Wed Feb 17 22:09:49 2016 +0100

    Imported Upstream version 2.2.7
---
 MANUAL                   |   28 +-
 MANUAL.markdown          |   28 +-
 Makefile                 |    6 +-
 NEWS                     |   11 +-
 VERSION                  |    2 +-
 alphabet.cpp             |   43 +-
 blockwise_sa.h           | 1017 +++++++++++++++++++++++++++-------------------
 bt2_build.cpp            |    9 +
 bt2_dp.cpp               |    2 -
 bt2_idx.h                |    8 +-
 bt2_search.cpp           |   36 +-
 diff_sample.h            |  317 ++++++++++-----
 ds.h                     |   72 +++-
 multikey_qsort.h         |  428 ++++++++++---------
 pat.h                    |   13 +-
 processor_support.h      |    4 +-
 scripts/debug_wrapper.pl |   21 -
 sse_util.h               |    4 +
 threading.h              |  109 ++++-
 19 files changed, 1337 insertions(+), 821 deletions(-)

diff --git a/MANUAL b/MANUAL
index ffcc98d..947e8ce 100644
--- a/MANUAL
+++ b/MANUAL
@@ -304,11 +304,11 @@ reference.
 For an alignment to be considered "valid" (i.e. "good enough") by Bowtie 2, it
 must have an alignment score no less than the minimum score threshold.  The
 threshold is configurable and is expressed as a function of the read length. In
-end-to-end alignment mode, the default minimum score threhsold is `-0.6 + -0.6 *
-L`, where `L` is the read length.  In local alignment mdoe, the default minimum
+end-to-end alignment mode, the default minimum score threshold is `-0.6 + -0.6 *
+L`, where `L` is the read length.  In local alignment mode, the default minimum
 score threshold is `20 + 8.0 * ln(L)`, where L is the read length.  This can be
 configured with the `--score-min` option.  For details on how to set options
-like `--score-min` that correpond to functions, see the section on [setting
+like `--score-min` that correspond to functions, see the section on [setting
 function options].
 
 Mapping quality: higher = more unique
@@ -419,16 +419,16 @@ alignment status of pairs per se, not individual mates.
 
 The SAM `FLAGS` field, the second field in a SAM record, has multiple bits that
 describe the paired-end nature of the read and alignment.  The first (least
-significant) bit (1 in decimal, 0x1 in hexidecimal) is set if the read is part
-of a pair.  The second bit (2 in decimal, 0x2 in hexidecimal) is set if the read
+significant) bit (1 in decimal, 0x1 in hexadecimal) is set if the read is part
+of a pair.  The second bit (2 in decimal, 0x2 in hexadecimal) is set if the read
 is part of a pair that aligned in a paired-end fashion.  The fourth bit (8 in
-decimal, 0x8 in hexidecimal) is set if the read is part of a pair and the other
+decimal, 0x8 in hexadecimal) is set if the read is part of a pair and the other
 mate in the pair had at least one valid alignment.  The sixth bit (32 in
-decimal, 0x20 in hexidecimal) is set if the read is part of a pair and the other
+decimal, 0x20 in hexadecimal) is set if the read is part of a pair and the other
 mate in the pair aligned to the Crick strand (or, equivalently, if the reverse
 complement of the other mate aligned to the Watson strand).  The seventh bit (64
-in decimal, 0x40 in hexidecimal) is set if the read is mate 1 in a pair.  The
-eighth bit (128 in decimal, 0x80 in hexidecimal) is set if the read is mate 2 in
+in decimal, 0x40 in hexadecimal) is set if the read is mate 1 in a pair.  The
+eighth bit (128 in decimal, 0x80 in hexadecimal) is set if the read is mate 2 in
 a pair.  See the [SAM specification] for a more detailed description of the
 `FLAGS` field.
 
@@ -523,7 +523,7 @@ because it exceeded a limit placed on search effort (see `-D` and `-R`) or
 because it already knows all it needs to know to report an alignment.
 Information from the best alignments are used to estimate mapping quality (the
 `MAPQ` [SAM] field) and to set SAM optional fields, such as `AS:i` and
-`XS:i`.  Bowtie 2 does not garantee that the alignment reported is the best
+`XS:i`.  Bowtie 2 does not guarantee that the alignment reported is the best
 possible in terms of alignment score.
 
 See also: `-D`, which puts an upper limit on the number of dynamic programming
@@ -678,7 +678,7 @@ ambiguous nucleotides.  Bowtie 2 will still print a SAM record for such a read,
 but no alignment will be reported and and the `YF:i` SAM optional field will be
 set to indicate the reason the read was filtered.
 
-* `YF:Z:LN`: the read was filtered becuase it had length less than or equal to
+* `YF:Z:LN`: the read was filtered because it had length less than or equal to
 the number of seed mismatches set with the `-N` option.
 * `YF:Z:NS`: the read was filtered because it contains a number of ambiguous
 characters (usually `N` or `.`) greater than the ceiling specified with
@@ -978,14 +978,14 @@ alignment].  Can be set to 0 or 1. Setting this higher makes alignment slower
     -L <int>
 
 Sets the length of the seed substrings to align during [multiseed alignment].
-Smaller values make alignment slower but more senstive. Default: the
+Smaller values make alignment slower but more sensitive. Default: the
 `--sensitive` preset is used by default, which sets `-L` to 20 both in
 `--end-to-end` mode and in `--local` mode.
 
     -i <func>
 
 Sets a function governing the interval between seed substrings to use during
-[multiseed alignment].  For instance, if the read has 30 characers, and seed
+[multiseed alignment].  For instance, if the read has 30 characters, and seed
 length is 10, and the seed interval is 6, the seeds extracted will be:
 
     Read:      TAGCTACGCTCTACGCTATCATGCATAAAC
@@ -1462,7 +1462,7 @@ When one or more `--rg` arguments are specified, `bowtie2` will also print
 an `@RG` line that includes all user-specified `--rg` tokens separated by
 tabs.
 
-Each subsequnt line describes an alignment or, if the read failed to align, a
+Each subsequent line describes an alignment or, if the read failed to align, a
 read.  Each line is a collection of at least 12 fields separated by tabs; from
 left to right, the fields are:
 
diff --git a/MANUAL.markdown b/MANUAL.markdown
index 403ced5..fd07cfa 100644
--- a/MANUAL.markdown
+++ b/MANUAL.markdown
@@ -315,11 +315,11 @@ reference.
 For an alignment to be considered "valid" (i.e. "good enough") by Bowtie 2, it
 must have an alignment score no less than the minimum score threshold.  The
 threshold is configurable and is expressed as a function of the read length. In
-end-to-end alignment mode, the default minimum score threhsold is `-0.6 + -0.6 *
-L`, where `L` is the read length.  In local alignment mdoe, the default minimum
+end-to-end alignment mode, the default minimum score threshold is `-0.6 + -0.6 *
+L`, where `L` is the read length.  In local alignment mode, the default minimum
 score threshold is `20 + 8.0 * ln(L)`, where L is the read length.  This can be
 configured with the [`--score-min`] option.  For details on how to set options
-like `--score-min` that correpond to functions, see the section on [setting
+like `--score-min` that correspond to functions, see the section on [setting
 function options].
 
 [setting function options]: #setting-function-options
@@ -432,16 +432,16 @@ alignment status of pairs per se, not individual mates.
 
 The SAM `FLAGS` field, the second field in a SAM record, has multiple bits that
 describe the paired-end nature of the read and alignment.  The first (least
-significant) bit (1 in decimal, 0x1 in hexidecimal) is set if the read is part
-of a pair.  The second bit (2 in decimal, 0x2 in hexidecimal) is set if the read
+significant) bit (1 in decimal, 0x1 in hexadecimal) is set if the read is part
+of a pair.  The second bit (2 in decimal, 0x2 in hexadecimal) is set if the read
 is part of a pair that aligned in a paired-end fashion.  The fourth bit (8 in
-decimal, 0x8 in hexidecimal) is set if the read is part of a pair and the other
+decimal, 0x8 in hexadecimal) is set if the read is part of a pair and the other
 mate in the pair had at least one valid alignment.  The sixth bit (32 in
-decimal, 0x20 in hexidecimal) is set if the read is part of a pair and the other
+decimal, 0x20 in hexadecimal) is set if the read is part of a pair and the other
 mate in the pair aligned to the Crick strand (or, equivalently, if the reverse
 complement of the other mate aligned to the Watson strand).  The seventh bit (64
-in decimal, 0x40 in hexidecimal) is set if the read is mate 1 in a pair.  The
-eighth bit (128 in decimal, 0x80 in hexidecimal) is set if the read is mate 2 in
+in decimal, 0x40 in hexadecimal) is set if the read is mate 1 in a pair.  The
+eighth bit (128 in decimal, 0x80 in hexadecimal) is set if the read is mate 2 in
 a pair.  See the [SAM specification] for a more detailed description of the
 `FLAGS` field.
 
@@ -537,7 +537,7 @@ because it exceeded a limit placed on search effort (see [`-D`] and [`-R`]) or
 because it already knows all it needs to know to report an alignment.
 Information from the best alignments are used to estimate mapping quality (the
 `MAPQ` [SAM] field) and to set SAM optional fields, such as [`AS:i`] and
-[`XS:i`].  Bowtie 2 does not garantee that the alignment reported is the best
+[`XS:i`].  Bowtie 2 does not guarantee that the alignment reported is the best
 possible in terms of alignment score.
 
 See also: [`-D`], which puts an upper limit on the number of dynamic programming
@@ -697,7 +697,7 @@ ambiguous nucleotides.  Bowtie 2 will still print a SAM record for such a read,
 but no alignment will be reported and and the `YF:i` SAM optional field will be
 set to indicate the reason the read was filtered.
 
-* `YF:Z:LN`: the read was filtered becuase it had length less than or equal to
+* `YF:Z:LN`: the read was filtered because it had length less than or equal to
 the number of seed mismatches set with the [`-N`] option.
 * `YF:Z:NS`: the read was filtered because it contains a number of ambiguous
 characters (usually `N` or `.`) greater than the ceiling specified with
@@ -1205,7 +1205,7 @@ alignment].  Can be set to 0 or 1. Setting this higher makes alignment slower
 </td><td>
 
 Sets the length of the seed substrings to align during [multiseed alignment].
-Smaller values make alignment slower but more senstive. Default: the
+Smaller values make alignment slower but more sensitive. Default: the
 [`--sensitive`] preset is used by default, which sets `-L` to 20 both in
 [`--end-to-end`] mode and in [`--local`] mode.
 
@@ -1221,7 +1221,7 @@ Smaller values make alignment slower but more senstive. Default: the
 </td><td>
 
 Sets a function governing the interval between seed substrings to use during
-[multiseed alignment].  For instance, if the read has 30 characers, and seed
+[multiseed alignment].  For instance, if the read has 30 characters, and seed
 length is 10, and the seed interval is 6, the seeds extracted will be:
 
     Read:      TAGCTACGCTCTACGCTATCATGCATAAAC
@@ -2109,7 +2109,7 @@ When one or more [`--rg`] arguments are specified, `bowtie2` will also print
 an `@RG` line that includes all user-specified [`--rg`] tokens separated by
 tabs.
 
-Each subsequnt line describes an alignment or, if the read failed to align, a
+Each subsequent line describes an alignment or, if the read failed to align, a
 read.  Each line is a collection of at least 12 fields separated by tabs; from
 left to right, the fields are:
 
diff --git a/Makefile b/Makefile
index 8711bc6..6c93f04 100644
--- a/Makefile
+++ b/Makefile
@@ -83,7 +83,7 @@ else
 endif
 
 ifeq (1,$(NO_SPINLOCK))
-	override EXTRA_FLAGS += -DNO_SPIN_LOCK
+	override EXTRA_FLAGS += -DNO_SPINLOCK
 endif
 
 ifeq (1,$(WITH_TBB))
@@ -105,6 +105,10 @@ ifeq (1,$(WITH_THREAD_PROFILING))
 	override EXTRA_FLAGS += -DPER_THREAD_TIMING=1
 endif
 
+ifeq (1,$(WITH_AFFINITY))
+	override EXTRA_FLAGS += -DWITH_AFFINITY=1
+endif
+
 SHARED_CPPS = ccnt_lut.cpp ref_read.cpp alphabet.cpp shmem.cpp \
               edit.cpp bt2_idx.cpp bt2_io.cpp bt2_util.cpp \
               reference.cpp ds.cpp multikey_qsort.cpp limit.cpp \
diff --git a/NEWS b/NEWS
index 4c00344..105f0a5 100644
--- a/NEWS
+++ b/NEWS
@@ -16,6 +16,15 @@ Please report any issues using the Sourceforge bug tracker:
 Version Release History
 =======================
 
+Version 2.2.7 - Feb 10, 2016
+   * Added a parallel index build option: bowtie2-build --threads <# threads>.
+   * Fixed an issue whereby IUPAC codes (other than A/C/G/T/N) in reads were
+     converted to As. Now all non-A/C/G/T characters in reads become Ns.
+   * Fixed some compilation issues, including for the Intel C++ Compiler.
+   * Removed debugging code that could impede performance for many alignment
+     threads.
+   * Fixed a few typos in documentation.
+
 Version 2.2.6 - Jul 22, 2015
    * Switched to a stable sort to avoid some potential reproducibility confusions.
    * Added 'install' target for *nix platforms.
@@ -25,7 +34,7 @@ Version 2.2.6 - Jul 22, 2015
    * Fixed a bug that caused seed lenght to be dependent of the -L and -N parameters order.
    * Fixed a bug that caused --local followed by -N to reset seed lenght to 22 which is
      actually the default value for global.
-   * Enable compilation on FreeBDS and clang, although gmake port is still required.
+   * Enable compilation on FreeBSD and clang, although gmake port is still required.
    * Fixed an issue that made bowtie2 compilation process to fail on Snow Leopard.
 
 
diff --git a/VERSION b/VERSION
index bda8fbe..5bc1cc4 100644
--- a/VERSION
+++ b/VERSION
@@ -1 +1 @@
-2.2.6
+2.2.7
diff --git a/alphabet.cpp b/alphabet.cpp
index 7613557..630a3bb 100644
--- a/alphabet.cpp
+++ b/alphabet.cpp
@@ -295,27 +295,30 @@ void setIupacsCat(uint8_t cat) {
 
 /// For converting from ASCII to the Dna5 code where A=0, C=1, G=2,
 /// T=3, N=4
+/// According to the manual all the other characters, including
+/// IUPAC codes are being converted to N
 uint8_t asc2dna[] = {
-	/*   0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-	/*  16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-	/*  32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-	/*  48 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-	/*  64 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,
-	       /*    A     C           G                    N */
-	/*  80 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-	       /*             T */
-	/*  96 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,
-	       /*    a     c           g                    n */
-	/* 112 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-	       /*             t */
-	/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-	/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-	/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-	/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-	/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-	/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-	/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-	/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+	/*   0 */ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	/*  16 */ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	/*  32 */ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	/*                                               - */
+	/*  48 */ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	/*  64 */ 4, 0, 4, 1, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4,
+	       /*    A  B  C  D        G  H        K     M  N */
+	/*  80 */ 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	       /*       R  S  T  U  V  W     Y */
+	/*  96 */ 4, 0, 4, 1, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4,
+	       /*    a  b  c  d        g  h        k     m  n */
+	/* 112 */ 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	       /*       r  s  t  u  v  w     y */
+	/* 128 */ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	/* 144 */ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	/* 160 */ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	/* 176 */ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	/* 192 */ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	/* 208 */ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	/* 224 */ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	/* 240 */ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
 };
 
 /// Convert an ascii char representing a base or a color to a 2-bit
diff --git a/blockwise_sa.h b/blockwise_sa.h
index 6402160..9cae3f2 100644
--- a/blockwise_sa.h
+++ b/blockwise_sa.h
@@ -35,6 +35,7 @@
 #include "timer.h"
 #include "ds.h"
 #include "mem_ids.h"
+#include "word_io.h"
 
 using namespace std;
 
@@ -86,23 +87,7 @@ public:
 	/**
 	 * Get the next suffix; compute the next bucket if necessary.
 	 */
-	TIndexOffU nextSuffix() {
-		if(_itrPushedBackSuffix != OFF_MASK) {
-			TIndexOffU tmp = _itrPushedBackSuffix;
-			_itrPushedBackSuffix = OFF_MASK;
-			return tmp;
-		}
-		while(_itrBucketPos >= _itrBucket.size() ||
-		      _itrBucket.size() == 0)
-		{
-			if(!hasMoreBlocks()) {
-				throw out_of_range("No more suffixes");
-			}
-			nextBlock();
-			_itrBucketPos = 0;
-		}
-		return _itrBucket[_itrBucketPos++];
-	}
+    virtual TIndexOffU nextSuffix() = 0;
 
 	/**
 	 * Return true iff the next call to nextSuffix will succeed.
@@ -158,7 +143,7 @@ protected:
 	 * Grab the next block of sorted suffixes.  The block is guaranteed
 	 * to have at most _bucketSz elements.
 	 */
-	virtual void nextBlock() = 0;
+	virtual void nextBlock(int cur_block, int tid = 0) = 0;
 	/// Return true iff more blocks are available
 	virtual bool hasMoreBlocks() const = 0;
 	/// Optionally output a verbose message
@@ -206,48 +191,154 @@ class KarkkainenBlockwiseSA : public InorderBlockwiseSA<TStr> {
 public:
 	typedef DifferenceCoverSample<TStr> TDC;
 
-	KarkkainenBlockwiseSA(const TStr& __text,
-	                      TIndexOffU __bucketSz,
-	                      uint32_t __dcV,
-	                      uint32_t __seed = 0,
-	      	              bool __sanityCheck = false,
-	   	                  bool __passMemExc = false,
-	      	              bool __verbose = false,
-	      	              ostream& __logger = cout) :
-	InorderBlockwiseSA<TStr>(__text, __bucketSz, __sanityCheck, __passMemExc, __verbose, __logger),
-	_sampleSuffs(EBWTB_CAT), _cur(0), _dcV(__dcV), _dc(EBWTB_CAT), _built(false)
-	{ _randomSrc.init(__seed); reset(); }
-
-	~KarkkainenBlockwiseSA() { }
-
-	/**
-	 * Allocate an amount of memory that simulates the peak memory
-	 * usage of the DifferenceCoverSample with the given text and v.
-	 * Throws bad_alloc if it's not going to fit in memory.  Returns
-	 * the approximate number of bytes the Cover takes at all times.
-	 */
-	static size_t simulateAllocs(const TStr& text, TIndexOffU bucketSz) {
-		size_t len = text.length();
-		// _sampleSuffs and _itrBucket are in memory at the peak
-		size_t bsz = bucketSz;
-		size_t sssz = len / max<TIndexOffU>(bucketSz-1, 1);
-		AutoArray<TIndexOffU> tmp(bsz + sssz + (1024 * 1024 /*out of caution*/), EBWT_CAT);
-		return bsz;
-	}
-
-	/// Defined in blockwise_sa.cpp
-	virtual void nextBlock();
+    KarkkainenBlockwiseSA(const TStr& __text,
+                          TIndexOffU __bucketSz,
+                          int __nthreads,
+                          uint32_t __dcV,
+                          uint32_t __seed = 0,
+                          bool __sanityCheck = false,
+                          bool __passMemExc = false,
+                          bool __verbose = false,
+                          string base_fname = "",
+                          ostream& __logger = cout) :
+    InorderBlockwiseSA<TStr>(__text, __bucketSz, __sanityCheck, __passMemExc, __verbose, __logger),
+    _sampleSuffs(EBWTB_CAT), _nthreads(__nthreads), _itrBucketIdx(0), _cur(0), _dcV(__dcV), _dc(EBWTB_CAT), _built(false), _base_fname(base_fname), _bigEndian(currentlyBigEndian())
+    { _randomSrc.init(__seed); reset(); }
+    
+    ~KarkkainenBlockwiseSA()
+    {
+        if(_threads.size() > 0) {
+            for (size_t tid = 0; tid < _threads.size(); tid++) {
+                _threads[tid]->join();
+                delete _threads[tid];
+            }
+        }
+    }
+    
+    /**
+     * Allocate an amount of memory that simulates the peak memory
+     * usage of the DifferenceCoverSample with the given text and v.
+     * Throws bad_alloc if it's not going to fit in memory.  Returns
+     * the approximate number of bytes the Cover takes at all times.
+     */
+    static size_t simulateAllocs(const TStr& text, TIndexOffU bucketSz) {
+        size_t len = text.length();
+        // _sampleSuffs and _itrBucket are in memory at the peak
+        size_t bsz = bucketSz;
+        size_t sssz = len / max<TIndexOffU>(bucketSz-1, 1);
+        AutoArray<TIndexOffU> tmp(bsz + sssz + (1024 * 1024 /*out of caution*/), EBWT_CAT);
+        return bsz;
+    }
 
-	/// Defined in blockwise_sa.cpp
-	virtual void qsort(EList<TIndexOffU>& bucket);
-
-	/// Return true iff more blocks are available
-	virtual bool hasMoreBlocks() const {
-		return _cur <= _sampleSuffs.size();
-	}
-
-	/// Return the difference-cover period
-	uint32_t dcV() const { return _dcV; }
+    static void nextBlock_Worker(void *vp) {
+        pair<KarkkainenBlockwiseSA*, int> param = *(pair<KarkkainenBlockwiseSA*, int>*)vp;
+        KarkkainenBlockwiseSA* sa = param.first;
+        int tid = param.second;
+        while(true) {
+            size_t cur = 0;
+            {
+                ThreadSafe ts(&sa->_mutex, sa->_nthreads > 1);
+                cur = sa->_cur;
+                if(cur > sa->_sampleSuffs.size()) break;
+                sa->_cur++;
+            }
+            sa->nextBlock((int)cur, tid);
+            // Write suffixes into a file
+            std::ostringstream number; number << cur;
+            const string fname = sa->_base_fname + "." + number.str() + ".sa";
+            ofstream sa_file(fname.c_str(), ios::binary);
+            if(!sa_file.good()) {
+                cerr << "Could not open file for writing a reference graph: \"" << fname << "\"" << endl;
+                throw 1;
+            }
+            const EList<TIndexOffU>& bucket = sa->_itrBuckets[tid];
+            writeU<TIndexOffU>(sa_file, (TIndexOffU)bucket.size(), sa->_bigEndian);
+            for(size_t i = 0; i < bucket.size(); i++) {
+                writeU<TIndexOffU>(sa_file, bucket[i], sa->_bigEndian);
+            }
+            sa_file.close();
+            sa->_itrBuckets[tid].clear();
+            sa->_done[cur] = true;
+        }
+    }
+    
+    /**
+     * Get the next suffix; compute the next bucket if necessary.
+     */
+    virtual TIndexOffU nextSuffix() {
+        // Launch threads if not
+        if(this->_nthreads > 1) {
+            if(_threads.size() == 0) {
+                _done.resize(_sampleSuffs.size() + 1);
+                _done.fill(false);
+                _itrBuckets.resize(this->_nthreads);
+                for(int tid = 0; tid < this->_nthreads; tid++) {
+                    _tparams.expand();
+                    _tparams.back().first = this;
+                    _tparams.back().second = tid;
+                    _threads.push_back(new tthread::thread(nextBlock_Worker, (void*)&_tparams.back()));
+                }
+                assert_eq(_threads.size(), (size_t)this->_nthreads);
+            }
+        }
+        if(this->_itrPushedBackSuffix != OFF_MASK) {
+            TIndexOffU tmp = this->_itrPushedBackSuffix;
+            this->_itrPushedBackSuffix = OFF_MASK;
+            return tmp;
+        }
+        while(this->_itrBucketPos >= this->_itrBucket.size() ||
+              this->_itrBucket.size() == 0)
+        {
+            if(!hasMoreBlocks()) {
+                throw out_of_range("No more suffixes");
+            }
+            if(this->_nthreads == 1) {
+                nextBlock((int)_cur);
+                _cur++;
+            } else {
+                while(!_done[this->_itrBucketIdx]) {
+#if defined(_TTHREAD_WIN32_)
+                    Sleep(1);
+#elif defined(_TTHREAD_POSIX_)
+                    const static timespec ts = {0, 1000000};  // 1 millisecond
+                    nanosleep(&ts, NULL);
+#endif
+                }
+                // Read suffixes from a file
+                std::ostringstream number; number << this->_itrBucketIdx;
+                const string fname = _base_fname + "." + number.str() + ".sa";
+                ifstream sa_file(fname.c_str(), ios::binary);
+                if(!sa_file.good()) {
+                    cerr << "Could not open file for reading a reference graph: \"" << fname << "\"" << endl;
+                    throw 1;
+                }
+                size_t numSAs = readU<TIndexOffU>(sa_file, _bigEndian);
+                this->_itrBucket.resizeExact(numSAs);
+                for(size_t i = 0; i < numSAs; i++) {
+                    this->_itrBucket[i] = readU<TIndexOffU>(sa_file, _bigEndian);
+                }
+                sa_file.close();
+                std::remove(fname.c_str());
+            }
+            this->_itrBucketIdx++;
+            this->_itrBucketPos = 0;
+        }
+        return this->_itrBucket[this->_itrBucketPos++];
+    }
+    
+    /// Defined in blockwise_sa.cpp
+    virtual void nextBlock(int cur_block, int tid = 0);
+    
+    /// Defined in blockwise_sa.cpp
+    virtual void qsort(EList<TIndexOffU>& bucket);
+    
+    /// Return true iff more blocks are available
+    virtual bool hasMoreBlocks() const {
+        return this->_itrBucketIdx <= _sampleSuffs.size();
+    }
+    
+    /// Return the difference-cover period
+    uint32_t dcV() const { return _dcV; }
 
 protected:
 
@@ -272,27 +363,27 @@ protected:
 
 private:
 
-	/**
-	 * Calculate the difference-cover sample and sample suffixes.
-	 */
-	void build() {
-		// Calculate difference-cover sample
-		assert(_dc.get() == NULL);
-		if(_dcV != 0) {
-			_dc.init(new TDC(this->text(), _dcV, this->verbose(), this->sanityCheck()));
-			_dc.get()->build();
-		}
-		// Calculate sample suffixes
-		if(this->bucketSz() <= this->text().length()) {
-			VMSG_NL("Building samples");
-			buildSamples();
-		} else {
-			VMSG_NL("Skipping building samples since text length " <<
-			        this->text().length() << " is less than bucket size: " <<
-			        this->bucketSz());
-		}
-		_built = true;
-	}
+    /**
+     * Calculate the difference-cover sample and sample suffixes.
+     */
+    void build() {
+        // Calculate difference-cover sample
+        assert(_dc.get() == NULL);
+        if(_dcV != 0) {
+            _dc.init(new TDC(this->text(), _dcV, this->verbose(), this->sanityCheck()));
+            _dc.get()->build(this->_nthreads);
+        }
+        // Calculate sample suffixes
+        if(this->bucketSz() <= this->text().length()) {
+            VMSG_NL("Building samples");
+            buildSamples();
+        } else {
+            VMSG_NL("Skipping building samples since text length " <<
+                    this->text().length() << " is less than bucket size: " <<
+                    this->bucketSz());
+        }
+        _built = true;
+    }
 
 	/**
 	 * Calculate the lcp between two suffixes using the difference
@@ -318,12 +409,22 @@ private:
 
 	void buildSamples();
 
-	EList<TIndexOffU>  _sampleSuffs; /// sample suffixes
-	TIndexOffU         _cur;         /// offset to 1st elt of next block
-	const uint32_t   _dcV;         /// difference-cover periodicity
-	PtrWrap<TDC>     _dc;          /// queryable difference-cover data
-	bool             _built;       /// whether samples/DC have been built
-	RandomSource     _randomSrc;   /// source of pseudo-randoms
+    EList<TIndexOffU>  _sampleSuffs; /// sample suffixes
+    int                _nthreads;    /// # of threads
+    TIndexOffU         _itrBucketIdx;
+    TIndexOffU         _cur;         /// offset to 1st elt of next block
+    const uint32_t   _dcV;         /// difference-cover periodicity
+    PtrWrap<TDC>     _dc;          /// queryable difference-cover data
+    bool             _built;       /// whether samples/DC have been built
+    RandomSource     _randomSrc;   /// source of pseudo-randoms
+    
+    MUTEX_T                 _mutex;       /// synchronization of output message
+    string                  _base_fname;  /// base file name for storing SA blocks
+    bool                    _bigEndian;   /// bigEndian?
+    EList<tthread::thread*> _threads;     /// thread list
+    EList<pair<KarkkainenBlockwiseSA*, int> > _tparams;
+    ELList<TIndexOffU>      _itrBuckets;  /// buckets
+    EList<bool>             _done;        /// is a block processed?
 };
 
 /**
@@ -383,6 +484,47 @@ inline void KarkkainenBlockwiseSA<S2bDnaString>::qsort(
 	}
 }
 
+template<typename TStr>
+struct BinarySortingParam {
+    const TStr*              t;
+    const EList<TIndexOffU>* sampleSuffs;
+    EList<TIndexOffU>        bucketSzs;
+    EList<TIndexOffU>        bucketReps;
+    size_t                   begin;
+    size_t                   end;
+};
+
+template<typename TStr>
+static void BinarySorting_worker(void *vp)
+{
+    BinarySortingParam<TStr>* param = (BinarySortingParam<TStr>*)vp;
+    const TStr& t = *(param->t);
+    size_t len = t.length();
+    const EList<TIndexOffU>& sampleSuffs = *(param->sampleSuffs);
+    EList<TIndexOffU>& bucketSzs = param->bucketSzs;
+    EList<TIndexOffU>& bucketReps = param->bucketReps;
+    ASSERT_ONLY(size_t numBuckets = bucketSzs.size());
+    size_t begin = param->begin;
+    size_t end = param->end;
+    // Iterate through every suffix in the text, determine which
+    // bucket it falls into by doing a binary search across the
+    // sorted list of samples, and increment a counter associated
+    // with that bucket.  Also, keep one representative for each
+    // bucket so that we can split it later.  We loop in ten
+    // stretches so that we can print out a helpful progress
+    // message.  (This step can take a long time.)
+    for(TIndexOffU i = (TIndexOffU)begin; i < end && i < len; i++) {
+        TIndexOffU r = binarySASearch(t, i, sampleSuffs);
+        if(r == std::numeric_limits<TIndexOffU>::max()) continue; // r was one of the samples
+        assert_lt(r, numBuckets);
+        bucketSzs[r]++;
+        assert_lt(bucketSzs[r], len);
+        if(bucketReps[r] == OFF_MASK || (i & 100) == 0) {
+            bucketReps[r] = i; // clobbers previous one, but that's OK
+        }
+    }
+}
+
 /**
  * Select a set of bucket-delineating sample suffixes such that no
  * bucket is greater than the requested upper limit.  Some care is
@@ -391,186 +533,186 @@ inline void KarkkainenBlockwiseSA<S2bDnaString>::qsort(
  */
 template<typename TStr>
 void KarkkainenBlockwiseSA<TStr>::buildSamples() {
-	const TStr& t = this->text();
-	TIndexOffU bsz = this->bucketSz()-1; // subtract 1 to leave room for sample
-	size_t len = this->text().length();
-	// Prepare _sampleSuffs array
-	_sampleSuffs.clear();
-	TIndexOffU numSamples = (TIndexOffU)((len/bsz)+1)<<1; // ~len/bsz x 2
-	assert_gt(numSamples, 0);
-	VMSG_NL("Reserving space for " << numSamples << " sample suffixes");
-	if(this->_passMemExc) {
-		_sampleSuffs.resizeExact(numSamples);
-		// Randomly generate samples.  Allow duplicates for now.
-		VMSG_NL("Generating random suffixes");
-		for(size_t i = 0; i < numSamples; i++) {
-#ifdef BOWTIE_64BIT_INDEX         
-			_sampleSuffs[i] = (TIndexOffU)(_randomSrc.nextU64() % len); 
+    const TStr& t = this->text();
+    TIndexOffU bsz = this->bucketSz()-1; // subtract 1 to leave room for sample
+    size_t len = this->text().length();
+    // Prepare _sampleSuffs array
+    _sampleSuffs.clear();
+    TIndexOffU numSamples = (TIndexOffU)((len/bsz)+1)<<1; // ~len/bsz x 2
+    assert_gt(numSamples, 0);
+    VMSG_NL("Reserving space for " << numSamples << " sample suffixes");
+    if(this->_passMemExc) {
+        _sampleSuffs.resizeExact(numSamples);
+        // Randomly generate samples.  Allow duplicates for now.
+        VMSG_NL("Generating random suffixes");
+        for(size_t i = 0; i < numSamples; i++) {
+#ifdef BOWTIE_64BIT_INDEX
+            _sampleSuffs[i] = (TIndexOffU)(_randomSrc.nextU64() % len);
 #else
-			_sampleSuffs[i] = (TIndexOffU)(_randomSrc.nextU32() % len); 
+            _sampleSuffs[i] = (TIndexOffU)(_randomSrc.nextU32() % len);
 #endif
-		}
-	} else {
-		try {
-			_sampleSuffs.resizeExact(numSamples);
-			// Randomly generate samples.  Allow duplicates for now.
-			VMSG_NL("Generating random suffixes");
-			for(size_t i = 0; i < numSamples; i++) {
+        }
+    } else {
+        try {
+            _sampleSuffs.resizeExact(numSamples);
+            // Randomly generate samples.  Allow duplicates for now.
+            VMSG_NL("Generating random suffixes");
+            for(size_t i = 0; i < numSamples; i++) {
 #ifdef BOWTIE_64BIT_INDEX
-				_sampleSuffs[i] = (TIndexOffU)(_randomSrc.nextU64() % len); 
+                _sampleSuffs[i] = (TIndexOffU)(_randomSrc.nextU64() % len);
 #else
-				_sampleSuffs[i] = (TIndexOffU)(_randomSrc.nextU32() % len); 
-#endif                
-			}
-		} catch(bad_alloc &e) {
-			if(this->_passMemExc) {
-				throw e; // rethrow immediately
-			} else {
-				cerr << "Could not allocate sample suffix container of " << (numSamples * OFF_SIZE) << " bytes." << endl
-				     << "Please try using a smaller number of blocks by specifying a larger --bmax or" << endl
-				     << "a smaller --bmaxdivn" << endl;
-				throw 1;
-			}
-		}
-	}
-	// Remove duplicates; very important to do this before the call to
-	// mkeyQSortSuf so that it doesn't try to calculate lexicographical
-	// relationships between very long, identical strings, which takes
-	// an extremely long time in general, and causes the stack to grow
-	// linearly with the size of the input
-	{
-		Timer timer(cout, "QSorting sample offsets, eliminating duplicates time: ", this->verbose());
-		VMSG_NL("QSorting " << _sampleSuffs.size() << " sample offsets, eliminating duplicates");
-		_sampleSuffs.sort();
-		size_t sslen = _sampleSuffs.size();
-		for(size_t i = 0; i < sslen-1; i++) {
-			if(_sampleSuffs[i] == _sampleSuffs[i+1]) {
-				_sampleSuffs.erase(i--);
-				sslen--;
-			}
-		}
-	}
-	// Multikey quicksort the samples
-	{
-		Timer timer(cout, "  Multikey QSorting samples time: ", this->verbose());
-		VMSG_NL("Multikey QSorting " << _sampleSuffs.size() << " samples");
-		this->qsort(_sampleSuffs);
-	}
-	// Calculate bucket sizes
-	VMSG_NL("Calculating bucket sizes");
-	int limit = 5;
-	// Iterate until all buckets are less than
-	while(--limit >= 0) {
-		// Calculate bucket sizes by doing a binary search for each
-		// suffix and noting where it lands
-		TIndexOffU numBuckets = (TIndexOffU)_sampleSuffs.size()+1;
-		EList<TIndexOffU> bucketSzs(EBWTB_CAT); // holds computed bucket sizes
-		EList<TIndexOffU> bucketReps(EBWTB_CAT); // holds 1 member of each bucket (for splitting)
-		try {
-			// Allocate and initialize containers for holding bucket
-			// sizes and representatives.
-			bucketSzs.resizeExact(numBuckets);
-			bucketReps.resizeExact(numBuckets);
-			bucketSzs.fillZero();
-			bucketReps.fill(OFF_MASK);
-		} catch(bad_alloc &e) {
-			if(this->_passMemExc) {
-				throw e; // rethrow immediately
-			} else {
-				cerr << "Could not allocate sizes, representatives (" << ((numBuckets*8)>>10) << " KB) for blocks." << endl
-				     << "Please try using a smaller number of blocks by specifying a larger --bmax or a" << endl
-				     << "smaller --bmaxdivn." << endl;
-				throw 1;
-			}
-		}
-		// Iterate through every suffix in the text, determine which
-		// bucket it falls into by doing a binary search across the
-		// sorted list of samples, and increment a counter associated
-		// with that bucket.  Also, keep one representative for each
-		// bucket so that we can split it later.  We loop in ten
-		// stretches so that we can print out a helpful progress
-		// message.  (This step can take a long time.)
-		{
-			VMSG_NL("  Binary sorting into buckets");
-			Timer timer(cout, "  Binary sorting into buckets time: ", this->verbose());
-			TIndexOffU lenDiv10 = (TIndexOffU)((len + 9) / 10);
-			for(TIndexOffU iten = 0, ten = 0; iten < len; iten += lenDiv10, ten++) {
-				TIndexOffU itenNext = iten + lenDiv10;
-				if(ten > 0) VMSG_NL("  " << (ten * 10) << "%");
-				for(TIndexOffU i = iten; i < itenNext && i < len; i++) {
-					TIndexOffU r = binarySASearch(t, i, _sampleSuffs);
-					if(r == std::numeric_limits<TIndexOffU>::max()) continue; // r was one of the samples
-					assert_lt(r, numBuckets);
-					bucketSzs[r]++;
-					assert_lt(bucketSzs[r], len);
-					if(bucketReps[r] == OFF_MASK ||
-					   (_randomSrc.nextU32() & 100) == 0)
-					{
-						bucketReps[r] = i; // clobbers previous one, but that's OK
-					}
-				}
-			}
-			VMSG_NL("  100%");
-		}
-		// Check for large buckets and mergeable pairs of small buckets
-		// and split/merge as necessary
-		TIndexOff added = 0;
-		TIndexOff merged = 0;
-		assert_eq(bucketSzs.size(), numBuckets);
-		assert_eq(bucketReps.size(), numBuckets);
-		{
-			Timer timer(cout, "  Splitting and merging time: ", this->verbose());
-			VMSG_NL("Splitting and merging");
-			for(TIndexOffU i = 0; i < numBuckets; i++) {
-				TIndexOffU mergedSz = bsz + 1;
-				assert(bucketSzs[(size_t)i] == 0 || bucketReps[(size_t)i] != OFF_MASK);
-				if(i < numBuckets-1) {
-					mergedSz = bucketSzs[(size_t)i] + bucketSzs[(size_t)i+1] + 1;
-				}
-				// Merge?
-				if(mergedSz <= bsz) {
-					bucketSzs[(size_t)i+1] += (bucketSzs[(size_t)i]+1);
-					// The following may look strange, but it's necessary
-					// to ensure that the merged bucket has a representative
-					bucketReps[(size_t)i+1] = _sampleSuffs[(size_t)i+added];
-					_sampleSuffs.erase((size_t)i+added);
-					bucketSzs.erase((size_t)i);
-					bucketReps.erase((size_t)i);
-					i--; // might go to -1 but ++ will overflow back to 0
-					numBuckets--;
-					merged++;
-					assert_eq(numBuckets, _sampleSuffs.size()+1-added);
-					assert_eq(numBuckets, bucketSzs.size());
-				}
-				// Split?
-				else if(bucketSzs[(size_t)i] > bsz) {
-					// Add an additional sample from the bucketReps[]
-					// set accumulated in the binarySASearch loop; this
-					// effectively splits the bucket
-					_sampleSuffs.insert(bucketReps[(size_t)i], (TIndexOffU)(i + (added++)));
-				}
-			}
-		}
-		if(added == 0) {
-			//if(this->verbose()) {
-			//	cout << "Final bucket sizes:" << endl;
-			//	cout << "  (begin): " << bucketSzs[0] << " (" << (int)(bsz - bucketSzs[0]) << ")" << endl;
-			//	for(uint32_t i = 1; i < numBuckets; i++) {
-			//		cout << "  " << bucketSzs[i] << " (" << (int)(bsz - bucketSzs[i]) << ")" << endl;
-			//	}
-			//}
-			break;
-		}
-		// Otherwise, continue until no more buckets need to be
-		// split
-		VMSG_NL("Split " << added << ", merged " << merged << "; iterating...");
-	}
-	// Do *not* force a do-over
-//	if(limit == 0) {
-//		VMSG_NL("Iterated too many times; trying again...");
-//		buildSamples();
-//	}
-	VMSG_NL("Avg bucket size: " << ((double)(len-_sampleSuffs.size()) / (_sampleSuffs.size()+1)) << " (target: " << bsz << ")");
+                _sampleSuffs[i] = (TIndexOffU)(_randomSrc.nextU32() % len);
+#endif
+            }
+        } catch(bad_alloc &e) {
+            if(this->_passMemExc) {
+                throw e; // rethrow immediately
+            } else {
+                cerr << "Could not allocate sample suffix container of " << (numSamples * OFF_SIZE) << " bytes." << endl
+                << "Please try using a smaller number of blocks by specifying a larger --bmax or" << endl
+                << "a smaller --bmaxdivn" << endl;
+                throw 1;
+            }
+        }
+    }
+    // Remove duplicates; very important to do this before the call to
+    // mkeyQSortSuf so that it doesn't try to calculate lexicographical
+    // relationships between very long, identical strings, which takes
+    // an extremely long time in general, and causes the stack to grow
+    // linearly with the size of the input
+    {
+        Timer timer(cout, "QSorting sample offsets, eliminating duplicates time: ", this->verbose());
+        VMSG_NL("QSorting " << _sampleSuffs.size() << " sample offsets, eliminating duplicates");
+        _sampleSuffs.sort();
+        size_t sslen = _sampleSuffs.size();
+        for(size_t i = 0; i < sslen-1; i++) {
+            if(_sampleSuffs[i] == _sampleSuffs[i+1]) {
+                _sampleSuffs.erase(i--);
+                sslen--;
+            }
+        }
+    }
+    // Multikey quicksort the samples
+    {
+        Timer timer(cout, "  Multikey QSorting samples time: ", this->verbose());
+        VMSG_NL("Multikey QSorting " << _sampleSuffs.size() << " samples");
+        this->qsort(_sampleSuffs);
+    }
+    // Calculate bucket sizes
+    VMSG_NL("Calculating bucket sizes");
+    int limit = 5;
+    // Iterate until all buckets are less than
+    while(--limit >= 0) {
+        TIndexOffU numBuckets = (TIndexOffU)_sampleSuffs.size()+1;
+        AutoArray<tthread::thread*> threads(this->_nthreads);
+        EList<BinarySortingParam<TStr> > tparams;
+        for(int tid = 0; tid < this->_nthreads; tid++) {
+            // Calculate bucket sizes by doing a binary search for each
+            // suffix and noting where it lands
+            tparams.expand();
+            try {
+                // Allocate and initialize containers for holding bucket
+                // sizes and representatives.
+                tparams.back().bucketSzs.resizeExact(numBuckets);
+                tparams.back().bucketReps.resizeExact(numBuckets);
+                tparams.back().bucketSzs.fillZero();
+                tparams.back().bucketReps.fill(OFF_MASK);
+            } catch(bad_alloc &e) {
+                if(this->_passMemExc) {
+                    throw e; // rethrow immediately
+                } else {
+                    cerr << "Could not allocate sizes, representatives (" << ((numBuckets*8)>>10) << " KB) for blocks." << endl
+                    << "Please try using a smaller number of blocks by specifying a larger --bmax or a" << endl
+                    << "smaller --bmaxdivn." << endl;
+                    throw 1;
+                }
+            }
+            tparams.back().t = &t;
+            tparams.back().sampleSuffs = &_sampleSuffs;
+            tparams.back().begin = (tid == 0 ? 0 : len / this->_nthreads * tid);
+            tparams.back().end = (tid + 1 == this->_nthreads ? len : len / this->_nthreads * (tid + 1));
+            if(this->_nthreads == 1) {
+                BinarySorting_worker<TStr>((void*)&tparams.back());
+            } else {
+                threads[tid] = new tthread::thread(BinarySorting_worker<TStr>, (void*)&tparams.back());
+            }
+        }
+        
+        if(this->_nthreads > 1) {
+            for (int tid = 0; tid < this->_nthreads; tid++) {
+                threads[tid]->join();
+            }
+        }
+        
+        EList<TIndexOffU>& bucketSzs = tparams[0].bucketSzs;
+        EList<TIndexOffU>& bucketReps = tparams[0].bucketReps;
+        for(int tid = 1; tid < this->_nthreads; tid++) {
+            for(size_t j = 0; j < numBuckets; j++) {
+                bucketSzs[j] += tparams[tid].bucketSzs[j];
+                if(bucketReps[j] == OFF_MASK) {
+                    bucketReps[j] = tparams[tid].bucketReps[j];
+                }
+            }
+        }
+        // Check for large buckets and mergeable pairs of small buckets
+        // and split/merge as necessary
+        TIndexOff added = 0;
+        TIndexOff merged = 0;
+        assert_eq(bucketSzs.size(), numBuckets);
+        assert_eq(bucketReps.size(), numBuckets);
+        {
+            Timer timer(cout, "  Splitting and merging time: ", this->verbose());
+            VMSG_NL("Splitting and merging");
+            for(TIndexOffU i = 0; i < numBuckets; i++) {
+                TIndexOffU mergedSz = bsz + 1;
+                assert(bucketSzs[(size_t)i] == 0 || bucketReps[(size_t)i] != OFF_MASK);
+                if(i < numBuckets-1) {
+                    mergedSz = bucketSzs[(size_t)i] + bucketSzs[(size_t)i+1] + 1;
+                }
+                // Merge?
+                if(mergedSz <= bsz) {
+                    bucketSzs[(size_t)i+1] += (bucketSzs[(size_t)i]+1);
+                    // The following may look strange, but it's necessary
+                    // to ensure that the merged bucket has a representative
+                    bucketReps[(size_t)i+1] = _sampleSuffs[(size_t)i+added];
+                    _sampleSuffs.erase((size_t)i+added);
+                    bucketSzs.erase((size_t)i);
+                    bucketReps.erase((size_t)i);
+                    i--; // might go to -1 but ++ will overflow back to 0
+                    numBuckets--;
+                    merged++;
+                    assert_eq(numBuckets, _sampleSuffs.size()+1-added);
+                    assert_eq(numBuckets, bucketSzs.size());
+                }
+                // Split?
+                else if(bucketSzs[(size_t)i] > bsz) {
+                    // Add an additional sample from the bucketReps[]
+                    // set accumulated in the binarySASearch loop; this
+                    // effectively splits the bucket
+                    _sampleSuffs.insert(bucketReps[(size_t)i], (TIndexOffU)(i + (added++)));
+                }
+            }
+        }
+        if(added == 0) {
+            //if(this->verbose()) {
+            //	cout << "Final bucket sizes:" << endl;
+            //	cout << "  (begin): " << bucketSzs[0] << " (" << (int)(bsz - bucketSzs[0]) << ")" << endl;
+            //	for(uint32_t i = 1; i < numBuckets; i++) {
+            //		cout << "  " << bucketSzs[i] << " (" << (int)(bsz - bucketSzs[i]) << ")" << endl;
+            //	}
+            //}
+            break;
+        }
+        // Otherwise, continue until no more buckets need to be
+        // split
+        VMSG_NL("Split " << added << ", merged " << merged << "; iterating...");
+    }
+    // Do *not* force a do-over
+    //	if(limit == 0) {
+    //		VMSG_NL("Iterated too many times; trying again...");
+    //		buildSamples();
+    //	}
+    VMSG_NL("Avg bucket size: " << ((double)(len-_sampleSuffs.size()) / (_sampleSuffs.size()+1)) << " (target: " << bsz << ")");
 }
 
 /**
@@ -774,166 +916,197 @@ bool KarkkainenBlockwiseSA<TStr>::suffixCmp(
  * of the blockwise suffix sorting process.
  */
 template<typename TStr>
-void KarkkainenBlockwiseSA<TStr>::nextBlock() {
-	EList<TIndexOffU>& bucket = this->_itrBucket;
-	VMSG_NL("Getting block " << (_cur+1) << " of " << _sampleSuffs.size()+1);
-	assert(_built);
-	assert_gt(_dcV, 3);
-	assert_leq(_cur, _sampleSuffs.size());
-	const TStr& t = this->text();
-	TIndexOffU len = (TIndexOffU)t.length();
-	// Set up the bucket
-	bucket.clear();
-	TIndexOffU lo = OFF_MASK, hi = OFF_MASK;
-	if(_sampleSuffs.size() == 0) {
-		// Special case: if _sampleSuffs is 0, then multikey-quicksort
-		// everything
-		VMSG_NL("  No samples; assembling all-inclusive block");
-		assert_eq(0, _cur);
-		try {
-			if(bucket.capacity() < this->bucketSz()) {
-				bucket.reserveExact(len+1);
-			}
-			bucket.resize(len);
-			for(TIndexOffU i = 0; i < len; i++) {
-				bucket[i] = i;
-			}
-		} catch(bad_alloc &e) {
-			if(this->_passMemExc) {
-				throw e; // rethrow immediately
-			} else {
-				cerr << "Could not allocate a master suffix-array block of " << ((len+1) * 4) << " bytes" << endl
-				     << "Please try using a larger number of blocks by specifying a smaller --bmax or" << endl
-				     << "a larger --bmaxdivn" << endl;
-				throw 1;
-			}
-		}
-	} else {
-		try {
-			VMSG_NL("  Reserving size (" << this->bucketSz() << ") for bucket");
-			// BTL: Add a +100 fudge factor; there seem to be instances
-			// where a bucket ends up having one more elt than bucketSz()
-			if(bucket.size() < this->bucketSz()+100) {
-				bucket.reserveExact(this->bucketSz()+100);
-			}
-		} catch(bad_alloc &e) {
-			if(this->_passMemExc) {
-				throw e; // rethrow immediately
-			} else {
-				cerr << "Could not allocate a suffix-array block of " << ((this->bucketSz()+1) * 4) << " bytes" << endl;
-				cerr << "Please try using a larger number of blocks by specifying a smaller --bmax or" << endl
-				     << "a larger --bmaxdivn" << endl;
-				throw 1;
-			}
-		}
-		// Select upper and lower bounds from _sampleSuffs[] and
-		// calculate the Z array up to the difference-cover periodicity
-		// for both.  Be careful about first/last buckets.
-		EList<TIndexOffU> zLo(EBWTB_CAT), zHi(EBWTB_CAT);
-		assert_geq(_cur, 0);
-		assert_leq(_cur, _sampleSuffs.size());
-		bool first = (_cur == 0);
-		bool last  = (_cur == _sampleSuffs.size());
-		try {
-			Timer timer(cout, "  Calculating Z arrays time: ", this->verbose());
-			VMSG_NL("  Calculating Z arrays");
-			if(!last) {
-				// Not the last bucket
-				assert_lt(_cur, _sampleSuffs.size());
-				hi = _sampleSuffs[_cur];
-				zHi.resizeExact(_dcV);
-				zHi.fillZero();
-				assert_eq(zHi[0], 0);
-				calcZ(t, hi, zHi, this->verbose(), this->sanityCheck());
-			}
-			if(!first) {
-				// Not the first bucket
-				assert_gt(_cur, 0);
-				assert_leq(_cur, _sampleSuffs.size());
-				lo = _sampleSuffs[_cur-1];
-				zLo.resizeExact(_dcV);
-				zLo.fillZero();
-				assert_gt(_dcV, 3);
-				assert_eq(zLo[0], 0);
-				calcZ(t, lo, zLo, this->verbose(), this->sanityCheck());
-			}
-		} catch(bad_alloc &e) {
-			if(this->_passMemExc) {
-				throw e; // rethrow immediately
-			} else {
-				cerr << "Could not allocate a z-array of " << (_dcV * 4) << " bytes" << endl;
-				cerr << "Please try using a larger number of blocks by specifying a smaller --bmax or" << endl
-				     << "a larger --bmaxdivn" << endl;
-				throw 1;
-			}
-		}
-
-		// This is the most critical loop in the algorithm; this is where
-		// we iterate over all suffixes in the text and pick out those that
-		// fall into the current bucket.
-		//
-		// This loop is based on the SMALLERSUFFIXES function outlined on
-		// p7 of the "Fast BWT" paper
-		//
-		int64_t kHi = -1, kLo = -1;
-		int64_t jHi = -1, jLo = -1;
-		bool kHiSoft = false, kLoSoft = false;
-		assert_eq(0, bucket.size());
-		{
-			Timer timer(cout, "  Block accumulator loop time: ", this->verbose());
-			VMSG_NL("  Entering block accumulator loop:");
-			TIndexOffU lenDiv10 = (len + 9) / 10;
-			for(TIndexOffU iten = 0, ten = 0; iten < len; iten += lenDiv10, ten++) {
-			TIndexOffU itenNext = iten + lenDiv10;
-			if(ten > 0) VMSG_NL("  " << (ten * 10) << "%");
-			for(TIndexOffU i = iten; i < itenNext && i < len; i++) {
-				assert_lt(jLo, (TIndexOff)i); assert_lt(jHi, (TIndexOff)i);
-				// Advance the upper-bound comparison by one character
-				if(i == hi || i == lo) continue; // equal to one of the bookends
-				if(hi != OFF_MASK && !suffixCmp(hi, i, jHi, kHi, kHiSoft, zHi)) {
-					continue; // not in the bucket
-				}
-				if(lo != OFF_MASK && suffixCmp(lo, i, jLo, kLo, kLoSoft, zLo)) {
-					continue; // not in the bucket
-				}
-				// In the bucket! - add it
-				assert_lt(i, len);
-				try {
-					bucket.push_back(i);
-				} catch(bad_alloc &e) {
-					cerr << "Could not append element to block of " << ((bucket.size()) * OFF_SIZE) << " bytes" << endl;
-					if(this->_passMemExc) {
-						throw e; // rethrow immediately
-					} else {
-						cerr << "Please try using a larger number of blocks by specifying a smaller --bmax or" << endl
-						     << "a larger --bmaxdivn" << endl;
-						throw 1;
-					}
-				}
-				// Not necessarily true; we allow overflowing buckets
-				// since we can't guarantee that a good set of sample
-				// suffixes can be found in a reasonable amount of time
-				//assert_lt(bucket.size(), this->bucketSz());
-			}
-			} // end loop over all suffixes of t
-			VMSG_NL("  100%");
-		}
-	} // end else clause of if(_sampleSuffs.size() == 0)
-	// Sort the bucket
-	if(bucket.size() > 0) {
-		Timer timer(cout, "  Sorting block time: ", this->verbose());
-		VMSG_NL("  Sorting block of length " << bucket.size());
-		this->qsort(bucket);
-	}
-	if(hi != OFF_MASK) {
-		// Not the final bucket; throw in the sample on the RHS
-		bucket.push_back(hi);
-	} else {
-		// Final bucket; throw in $ suffix
-		bucket.push_back(len);
-	}
-	VMSG_NL("Returning block of " << bucket.size());
-	_cur++; // advance to next bucket
+void KarkkainenBlockwiseSA<TStr>::nextBlock(int cur_block, int tid) {
+#ifndef NDEBUG
+    if(this->_nthreads > 1) {
+        assert_lt(tid, this->_itrBuckets.size());
+    }
+#endif
+    EList<TIndexOffU>& bucket = (this->_nthreads > 1 ? this->_itrBuckets[tid] : this->_itrBucket);
+    {
+        ThreadSafe ts(&_mutex, this->_nthreads > 1);
+        VMSG_NL("Getting block " << (cur_block+1) << " of " << _sampleSuffs.size()+1);
+    }
+    assert(_built);
+    assert_gt(_dcV, 3);
+    assert_leq(cur_block, _sampleSuffs.size());
+    const TStr& t = this->text();
+    TIndexOffU len = (TIndexOffU)t.length();
+    // Set up the bucket
+    bucket.clear();
+    TIndexOffU lo = OFF_MASK, hi = OFF_MASK;
+    if(_sampleSuffs.size() == 0) {
+        // Special case: if _sampleSuffs is 0, then multikey-quicksort
+        // everything
+        {
+            ThreadSafe ts(&_mutex, this->_nthreads > 1);
+            VMSG_NL("  No samples; assembling all-inclusive block");
+        }
+        assert_eq(0, cur_block);
+        try {
+            if(bucket.capacity() < this->bucketSz()) {
+                bucket.reserveExact(len+1);
+            }
+            bucket.resize(len);
+            for(TIndexOffU i = 0; i < len; i++) {
+                bucket[i] = i;
+            }
+        } catch(bad_alloc &e) {
+            if(this->_passMemExc) {
+                throw e; // rethrow immediately
+            } else {
+                cerr << "Could not allocate a master suffix-array block of " << ((len+1) * 4) << " bytes" << endl
+                << "Please try using a larger number of blocks by specifying a smaller --bmax or" << endl
+                << "a larger --bmaxdivn" << endl;
+                throw 1;
+            }
+        }
+    } else {
+        try {
+            {
+                ThreadSafe ts(&_mutex, this->_nthreads > 1);
+                VMSG_NL("  Reserving size (" << this->bucketSz() << ") for bucket " << (cur_block+1));
+            }
+            // BTL: Add a +100 fudge factor; there seem to be instances
+            // where a bucket ends up having one more elt than bucketSz()
+            if(bucket.size() < this->bucketSz()+100) {
+                bucket.reserveExact(this->bucketSz()+100);
+            }
+        } catch(bad_alloc &e) {
+            if(this->_passMemExc) {
+                throw e; // rethrow immediately
+            } else {
+                cerr << "Could not allocate a suffix-array block of " << ((this->bucketSz()+1) * 4) << " bytes" << endl;
+                cerr << "Please try using a larger number of blocks by specifying a smaller --bmax or" << endl
+                << "a larger --bmaxdivn" << endl;
+                throw 1;
+            }
+        }
+        // Select upper and lower bounds from _sampleSuffs[] and
+        // calculate the Z array up to the difference-cover periodicity
+        // for both.  Be careful about first/last buckets.
+        EList<TIndexOffU> zLo(EBWTB_CAT), zHi(EBWTB_CAT);
+        assert_geq(cur_block, 0);
+        assert_leq((size_t)cur_block, _sampleSuffs.size());
+        bool first = (cur_block == 0);
+        bool last  = ((size_t)cur_block == _sampleSuffs.size());
+        try {
+            // Timer timer(cout, "  Calculating Z arrays time: ", this->verbose());
+            {
+                ThreadSafe ts(&_mutex, this->_nthreads > 1);
+                VMSG_NL("  Calculating Z arrays for bucket " << (cur_block+1));
+            }
+            if(!last) {
+                // Not the last bucket
+                assert_lt(cur_block, _sampleSuffs.size());
+                hi = _sampleSuffs[cur_block];
+                zHi.resizeExact(_dcV);
+                zHi.fillZero();
+                assert_eq(zHi[0], 0);
+                calcZ(t, hi, zHi, this->verbose(), this->sanityCheck());
+            }
+            if(!first) {
+                // Not the first bucket
+                assert_gt(cur_block, 0);
+                assert_leq(cur_block, _sampleSuffs.size());
+                lo = _sampleSuffs[cur_block-1];
+                zLo.resizeExact(_dcV);
+                zLo.fillZero();
+                assert_gt(_dcV, 3);
+                assert_eq(zLo[0], 0);
+                calcZ(t, lo, zLo, this->verbose(), this->sanityCheck());
+            }
+        } catch(bad_alloc &e) {
+            if(this->_passMemExc) {
+                throw e; // rethrow immediately
+            } else {
+                cerr << "Could not allocate a z-array of " << (_dcV * 4) << " bytes" << endl;
+                cerr << "Please try using a larger number of blocks by specifying a smaller --bmax or" << endl
+                << "a larger --bmaxdivn" << endl;
+                throw 1;
+            }
+        }
+        
+        // This is the most critical loop in the algorithm; this is where
+        // we iterate over all suffixes in the text and pick out those that
+        // fall into the current bucket.
+        //
+        // This loop is based on the SMALLERSUFFIXES function outlined on
+        // p7 of the "Fast BWT" paper
+        //
+        int64_t kHi = -1, kLo = -1;
+        int64_t jHi = -1, jLo = -1;
+        bool kHiSoft = false, kLoSoft = false;
+        assert_eq(0, bucket.size());
+        {
+            // Timer timer(cout, "  Block accumulator loop time: ", this->verbose());
+            {
+                ThreadSafe ts(&_mutex, this->_nthreads > 1);
+                VMSG_NL("  Entering block accumulator loop for bucket " << (cur_block+1) << ":");
+            }
+            TIndexOffU lenDiv10 = (len + 9) / 10;
+            for(TIndexOffU iten = 0, ten = 0; iten < len; iten += lenDiv10, ten++) {
+                TIndexOffU itenNext = iten + lenDiv10;
+                {
+                    ThreadSafe ts(&_mutex, this->_nthreads > 1);
+                    if(ten > 0) VMSG_NL("  bucket " << (cur_block+1) << ": " << (ten * 10) << "%");
+                }
+                for(TIndexOffU i = iten; i < itenNext && i < len; i++) {
+                    assert_lt(jLo, (TIndexOff)i); assert_lt(jHi, (TIndexOff)i);
+                    // Advance the upper-bound comparison by one character
+                    if(i == hi || i == lo) continue; // equal to one of the bookends
+                    if(hi != OFF_MASK && !suffixCmp(hi, i, jHi, kHi, kHiSoft, zHi)) {
+                        continue; // not in the bucket
+                    }
+                    if(lo != OFF_MASK && suffixCmp(lo, i, jLo, kLo, kLoSoft, zLo)) {
+                        continue; // not in the bucket
+                    }
+                    // In the bucket! - add it
+                    assert_lt(i, len);
+                    try {
+                        bucket.push_back(i);
+                    } catch(bad_alloc &e) {
+                        cerr << "Could not append element to block of " << ((bucket.size()) * OFF_SIZE) << " bytes" << endl;
+                        if(this->_passMemExc) {
+                            throw e; // rethrow immediately
+                        } else {
+                            cerr << "Please try using a larger number of blocks by specifying a smaller --bmax or" << endl
+                            << "a larger --bmaxdivn" << endl;
+                            throw 1;
+                        }
+                    }
+                    // Not necessarily true; we allow overflowing buckets
+                    // since we can't guarantee that a good set of sample
+                    // suffixes can be found in a reasonable amount of time
+                    //assert_lt(bucket.size(), this->bucketSz());
+                }
+            } // end loop over all suffixes of t
+            {
+                ThreadSafe ts(&_mutex, this->_nthreads > 1);
+                VMSG_NL("  bucket " << (cur_block+1) << ": 100%");
+            }
+        }
+    } // end else clause of if(_sampleSuffs.size() == 0)
+    // Sort the bucket
+    if(bucket.size() > 0) {
+        Timer timer(cout, "  Sorting block time: ", this->verbose());
+        {
+            ThreadSafe ts(&_mutex, this->_nthreads > 1);
+            VMSG_NL("  Sorting block of length " << bucket.size() << " for bucket " << (cur_block+1));
+        }
+        this->qsort(bucket);
+    }
+    if(hi != OFF_MASK) {
+        // Not the final bucket; throw in the sample on the RHS
+        bucket.push_back(hi);
+    } else {
+        // Final bucket; throw in $ suffix
+        bucket.push_back(len);
+    }
+    {
+        ThreadSafe ts(&_mutex, this->_nthreads > 1);
+        VMSG_NL("Returning block of " << bucket.size() << " for bucket " << (cur_block+1));
+    }
 }
 
 #endif /*BLOCKWISE_SA_H_*/
diff --git a/bt2_build.cpp b/bt2_build.cpp
index 8bcbee0..e0d87ad 100644
--- a/bt2_build.cpp
+++ b/bt2_build.cpp
@@ -64,6 +64,7 @@ static bool packed;
 static bool writeRef;
 static bool justRef;
 static bool reverseEach;
+static int nthreads;
 static string wrapper;
 
 static void resetOptions() {
@@ -92,6 +93,7 @@ static void resetOptions() {
 	writeRef     = true;  // write compact reference to .3.gEbwt_ext/.4.gEbwt_ext
 	justRef      = false; // *just* write compact reference, don't index
 	reverseEach  = false;
+    nthreads     = 1;
 	wrapper.clear();
 }
 
@@ -108,6 +110,7 @@ enum {
 	ARG_USAGE,
 	ARG_REVERSE_EACH,
 	ARG_SA,
+    ARG_THREADS,
 	ARG_WRAPPER
 };
 
@@ -150,6 +153,7 @@ static void printUsage(ostream& out) {
 	    << "    -3/--justref            just build .3/.4 index files" << endl
 	    << "    -o/--offrate <int>      SA is sampled every 2^<int> BWT chars (default: 5)" << endl
 	    << "    -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)" << endl
+        << "    --threads <int>         # of threads" << endl
 	    //<< "    --ntoa                  convert Ns in reference to As" << endl
 	    //<< "    --big --little          endianness (default: little, this host: "
 	    //<< (currentlyBigEndian()? "big":"little") << ")" << endl
@@ -197,6 +201,7 @@ static struct option long_options[] = {
 	{(char*)"color",        no_argument,       0,            'C'},
 	{(char*)"sa",           no_argument,       0,            ARG_SA},
 	{(char*)"reverse-each", no_argument,       0,            ARG_REVERSE_EACH},
+    {(char*)"threads",      required_argument, 0,            ARG_THREADS},
 	{(char*)"usage",        no_argument,       0,            ARG_USAGE},
 	{(char*)"wrapper",      required_argument, 0,            ARG_WRAPPER},
 	{(char*)0, 0, 0, 0} // terminator
@@ -299,6 +304,9 @@ static bool parseOptions(int argc, const char **argv) {
 				doSaFile = true;
 				break;
 			case ARG_NTOA: nsToAs = true; break;
+            case ARG_THREADS:
+                nthreads = parseNumber<int>(0, "--threads arg must be at least 1");
+                break;
 			case 'a': autoMem = false; break;
 			case 'q': verbose = false; break;
 			case 's': sanityCheck = true; break;
@@ -433,6 +441,7 @@ static void driver(
 		lineRate,
 		offRate,      // suffix-array sampling rate
 		ftabChars,    // number of chars in initial arrow-pair calc
+              nthreads,
 		outfile,      // basename for .?.ebwt files
 		reverse == 0, // fw
 		!entireSA,    // useBlockwise
diff --git a/bt2_dp.cpp b/bt2_dp.cpp
index 565518d..e7da53d 100644
--- a/bt2_dp.cpp
+++ b/bt2_dp.cpp
@@ -35,7 +35,6 @@ static int seed;            // srandom() seed
 static bool showVersion;    // just print version and quit?
 static uint32_t qUpto;      // max # of queries to read
 static int nthreads;        // number of pthreads operating concurrently
-static bool useSpinlock;    // false -> don't use of spinlocks even if they're #defines
 static uint32_t skipReads;  // # reads/read pairs to skip
 int gGapBarrier;            // # diags on top/bot only to be entered diagonally
 static int bonusMatchType;  // how to reward matches
@@ -70,7 +69,6 @@ static void resetOptions() {
 	showVersion				= false; // just print version and quit?
 	qUpto					= 0xffffffff; // max # of queries to read
 	nthreads				= 1;     // number of pthreads operating concurrently
-	useSpinlock				= true;  // false -> don't use of spinlocks even if they're #defines
 	skipReads				= 0;     // # reads/read pairs to skip
 	gGapBarrier				= 4;     // disallow gaps within this many chars of either end of alignment
 	bonusMatchType  = DEFAULT_MATCH_BONUS_TYPE;
diff --git a/bt2_idx.h b/bt2_idx.h
index 752cbd7..0b63148 100644
--- a/bt2_idx.h
+++ b/bt2_idx.h
@@ -580,6 +580,7 @@ public:
 		int32_t lineRate,
 		int32_t offRate,
 		int32_t ftabChars,
+        int nthreads,
 		const string& file,   // base filename for EBWT files
 		bool fw,
 		bool useBlockwise,
@@ -658,8 +659,10 @@ public:
 		    refparams,
 		    fout1,
 		    fout2,
+                             file,
 			saOut,
 			bwtOut,
+            nthreads,
 		    useBlockwise,
 		    bmax,
 		    bmaxSqrtMult,
@@ -918,8 +921,10 @@ public:
 	                    const RefReadInParams& refparams,
 	                    ofstream& out1,
 	                    ofstream& out2,
+                        const string& outfile,
 	                    ofstream* saOut,
 	                    ofstream* bwtOut,
+                        int nthreads,
 	                    bool useBlockwise,
 	                    TIndexOffU bmax,
 	                    TIndexOffU bmaxSqrtMult,
@@ -1050,6 +1055,7 @@ public:
 					// constructing the DifferenceCoverSample
 					dcv <<= 1;
 					TIndexOffU sz = (TIndexOffU)DifferenceCoverSample<TStr>::simulateAllocs(s, dcv >> 1);
+                    if(nthreads > 1) sz *= (nthreads + 1);
 					AutoArray<uint8_t> tmp(sz, EBWT_CAT);
 					dcv >>= 1;
 					// Likewise with the KarkkainenBlockwiseSA
@@ -1070,7 +1076,7 @@ public:
 					VMSG_NL("");
 				}
 				VMSG_NL("Constructing suffix-array element generator");
-				KarkkainenBlockwiseSA<TStr> bsa(s, bmax, dcv, seed, _sanity, _passMemExc, _verbose);
+				KarkkainenBlockwiseSA<TStr> bsa(s, bmax, nthreads, dcv, seed, _sanity, _passMemExc, _verbose, outfile);
 				assert(bsa.suffixItrIsReset());
 				assert_eq(bsa.size(), s.length()+1);
 				VMSG_NL("Converting suffix-array elements to index image");
diff --git a/bt2_search.cpp b/bt2_search.cpp
index 751367b..d28df28 100644
--- a/bt2_search.cpp
+++ b/bt2_search.cpp
@@ -90,7 +90,6 @@ static bool noRefNames;   // true -> print reference indexes; not names
 static uint32_t khits;    // number of hits per read; >1 is much slower
 static uint32_t mhits;    // don't report any hits if there are > mhits
 static int partitionSz;   // output a partitioning key in first field
-static bool useSpinlock;  // false -> don't use of spinlocks even if they're #defines
 static bool fileParallel; // separate threads read separate input files in parallel
 static bool useShmem;     // use shared memory to hold the index
 static bool useMm;        // use memory-mapped files to hold the index
@@ -280,7 +279,6 @@ static void resetOptions() {
 	khits					= 1;     // number of hits per read; >1 is much slower
 	mhits					= 50;    // stop after finding this many alignments+1
 	partitionSz				= 0;     // output a partitioning key in first field
-	useSpinlock				= true;  // false -> don't use of spinlocks even if they're #defines
 	fileParallel			= false; // separate threads read separate input files in parallel
 	useShmem				= false; // use shared memory to hold the index
 	useMm					= false; // use memory-mapped files to hold the index
@@ -785,9 +783,9 @@ static void printUsage(ostream& out) {
 		<< "  --met-stderr       send metrics to stderr (off)" << endl
 		<< "  --met <int>        report internal counters & metrics every <int> secs (1)" << endl
 	// Following is supported in the wrapper instead
-	    << "  --no-unal          supppress SAM records for unaligned reads" << endl
-	    << "  --no-head          supppress header lines, i.e. lines starting with @" << endl
-	    << "  --no-sq            supppress @SQ header lines" << endl
+	    << "  --no-unal          suppress SAM records for unaligned reads" << endl
+	    << "  --no-head          suppress header lines, i.e. lines starting with @" << endl
+	    << "  --no-sq            suppress @SQ header lines" << endl
 	    << "  --rg-id <text>     set read group id, reflected in @RG line and RG:Z: opt field" << endl
 	    << "  --rg <text>        add <text> (\"lab:value\") to @RG line of SAM header." << endl
 	    << "                     Note: @RG line only printed when --rg-id is set." << endl
@@ -1162,10 +1160,9 @@ static void parseOption(int next_option, const char *arg) {
 		case ARG_SAM_PRINT_YI: sam_print_yi = true; break;
 		case ARG_REORDER: reorder = true; break;
 		case ARG_MAPQ_EX: {
-			sam_print_zp = true;
-			// TODO: remove next line
-			sam_print_xss = true;
-			sam_print_yn = true;
+			//sam_print_zp = true;
+			//sam_print_xss = true;
+			//sam_print_yn = true;
 			sam_print_zt = true;
 			break;
 		}
@@ -1919,7 +1916,7 @@ struct PerfMetrics {
 				/* 118 */ "DPBtFiltStart"  "\t"
 				/* 119 */ "DPBtFiltScore"  "\t"
 				/* 120 */ "DpBtFiltDom"    "\t"
-
+#ifdef USE_MEM_TALLY
 				/* 121 */ "MemPeak"        "\t"
 				/* 122 */ "UncatMemPeak"   "\t" // 0
 				/* 123 */ "EbwtMemPeak"    "\t" // EBWT_CAT
@@ -1929,7 +1926,7 @@ struct PerfMetrics {
 				/* 127 */ "DPMemPeak"      "\t" // DP_CAT
 				/* 128 */ "MiscMemPeak"    "\t" // MISC_CAT
 				/* 129 */ "DebugMemPeak"   "\t" // DEBUG_CAT
-				
+#endif
 				"\n";
 			
 			if(name != NULL) {
@@ -2468,6 +2465,7 @@ struct PerfMetrics {
 		if(metricsStderr) stderrSs << buf << '\t';
 		if(o != NULL) { o->writeChars(buf); o->write('\t'); }
 		
+#ifdef USE_MEM_TALLY
 		// 121. Overall memory peak
 		itoa10<size_t>(gMemTally.peak() >> 20, buf);
 		if(metricsStderr) stderrSs << buf << '\t';
@@ -2504,6 +2502,7 @@ struct PerfMetrics {
 		itoa10<size_t>(gMemTally.peak(DEBUG_CAT) >> 20, buf);
 		if(metricsStderr) stderrSs << buf;
 		if(o != NULL) { o->writeChars(buf); }
+#endif
 
 		if(o != NULL) { o->write('\n'); }
 		if(metricsStderr) cerr << stderrSs.str().c_str() << endl;
@@ -4358,7 +4357,6 @@ static void driver(
 		format,        // file format
 		fileParallel,  // true -> wrap files with separate PairedPatternSources
 		seed,          // pseudo-random seed
-		useSpinlock,   // use spin locks instead of pthreads
 		solexaQuals,   // true -> qualities are on solexa64 scale
 		phred64Quals,  // true -> qualities are on phred64 scale
 		integerQuals,  // true -> qualities are space-separated numbers
@@ -4624,6 +4622,13 @@ extern "C" {
  */
 int bowtie(int argc, const char **argv) {
 	try {
+	#ifdef WITH_TBB
+	#ifdef WITH_AFFINITY
+		//CWILKS: adjust this depending on # of hyperthreads per core
+		pinning_observer pinner( 2 /* the number of hyper threads on each core */ );
+        	pinner.observe( true );
+	#endif
+	#endif
 		// Reset all global state, including getopt state
 		opterr = optind = 1;
 		resetOptions();
@@ -4736,6 +4741,13 @@ int bowtie(int argc, const char **argv) {
 			}
 			driver<SString<char> >("DNA", bt2index, outfile);
 		}
+	#ifdef WITH_TBB
+	#ifdef WITH_AFFINITY
+		// Always disable observation before observers destruction
+    		//tracker.observe( false );
+    		pinner.observe( false );
+	#endif
+	#endif
 		return 0;
 	} catch(std::exception& e) {
 		cerr << "Error: Encountered exception: '" << e.what() << "'" << endl;
diff --git a/diff_sample.h b/diff_sample.h
index a0ebd31..6099421 100644
--- a/diff_sample.h
+++ b/diff_sample.h
@@ -496,7 +496,7 @@ public:
 	const EList<uint32_t>& dmap() const  { return _dmap; }
 	ostream& log() const                 { return _logger; }
 
-	void     build();
+	void     build(int nthreads);
 	uint32_t tieBreakOff(TIndexOffU i, TIndexOffU j) const;
 	int64_t  breakTie(TIndexOffU i, TIndexOffU j) const;
 	bool     isCovered(TIndexOffU i) const;
@@ -670,126 +670,219 @@ static inline bool suffixSameUpTo(
 	return true;
 }
 
+template<typename TStr>
+struct VSortingParam {
+    DifferenceCoverSample<TStr>* dcs;
+    TIndexOffU*                  sPrimeArr;
+    size_t                       sPrimeSz;
+    TIndexOffU*                  sPrimeOrderArr;
+    size_t                       depth;
+    const EList<size_t>*         boundaries;
+    size_t*                      cur;
+    MUTEX_T*                     mutex;
+};
+
+template<typename TStr>
+static void VSorting_worker(void *vp)
+{
+    VSortingParam<TStr>* param = (VSortingParam<TStr>*)vp;
+    DifferenceCoverSample<TStr>* dcs = param->dcs;
+    const TStr& host = dcs->text();
+    const size_t hlen = host.length();
+    uint32_t v = dcs->v();
+    while(true) {
+        size_t cur = 0;
+        {
+            ThreadSafe ts(param->mutex, true);
+            cur = *(param->cur);
+            (*param->cur)++;
+        }
+        if(cur >= param->boundaries->size()) return;
+        size_t begin = (cur == 0 ? 0 : (*param->boundaries)[cur-1]);
+        size_t end = (*param->boundaries)[cur];
+        assert_leq(begin, end);
+        if(end - begin <= 1) continue;
+        mkeyQSortSuf2(
+                      host,
+                      hlen,
+                      param->sPrimeArr,
+                      param->sPrimeSz,
+                      param->sPrimeOrderArr,
+                      4,
+                      begin,
+                      end,
+                      param->depth,
+                      v);
+    }
+}
+
 /**
  * Calculates a ranking of all suffixes in the sample and stores them,
  * packed according to the mu mapping, in _isaPrime.
  */
 template <typename TStr>
-void DifferenceCoverSample<TStr>::build() {
-	// Local names for relevant types
-	VMSG_NL("Building DifferenceCoverSample");
-	// Local names for relevant data
-	const TStr& t = this->text();
-	uint32_t v = this->v();
-	assert_gt(v, 2);
-	// Build s'
-	EList<TIndexOffU> sPrime;
-	// Need to allocate 2 extra elements at the end of the sPrime and _isaPrime
-	// arrays.  One element that's less than all others, and another that acts
-	// as needed padding for the Larsson-Sadakane sorting code.
-	size_t padding = 1;
-	VMSG_NL("  Building sPrime");
-	buildSPrime(sPrime, padding);
-	size_t sPrimeSz = sPrime.size() - padding;
-	assert_gt(sPrime.size(), padding);
-	assert_leq(sPrime.size(), t.length() + padding + 1);
-	TIndexOffU nextRank = 0;
-	{
-		VMSG_NL("  Building sPrimeOrder");
-		EList<TIndexOffU> sPrimeOrder;
-		sPrimeOrder.resizeExact(sPrimeSz);
-		for(TIndexOffU i = 0; i < sPrimeSz; i++) {
-			sPrimeOrder[i] = i;
-		}
-		// sPrime now holds suffix-offsets for DC samples.
-		{
-			Timer timer(cout, "  V-Sorting samples time: ", this->verbose());
-			VMSG_NL("  V-Sorting samples");
-			// Extract backing-store array from sPrime and sPrimeOrder;
-			// the mkeyQSortSuf2 routine works on the array for maximum
-			// efficiency
-			TIndexOffU *sPrimeArr = (TIndexOffU*)sPrime.ptr();
-			assert_eq(sPrimeArr[0], sPrime[0]);
-			assert_eq(sPrimeArr[sPrimeSz-1], sPrime[sPrimeSz-1]);
-			TIndexOffU *sPrimeOrderArr = (TIndexOffU*)sPrimeOrder.ptr();
-			assert_eq(sPrimeOrderArr[0], sPrimeOrder[0]);
-			assert_eq(sPrimeOrderArr[sPrimeSz-1], sPrimeOrder[sPrimeSz-1]);
-			// Sort sample suffixes up to the vth character using a
-			// multikey quicksort.  Sort time is proportional to the
-			// number of samples times v.  It isn't quadratic.
-			// sPrimeOrder is passed in as a swapping partner for
-			// sPrimeArr, i.e., every time the multikey qsort swaps
-			// elements in sPrime, it swaps the same elements in
-			// sPrimeOrder too.  This allows us to easily reconstruct
-			// what the sort did.
-			mkeyQSortSuf2(t, sPrimeArr, sPrimeSz, sPrimeOrderArr, 4,
-			              this->verbose(), this->sanityCheck(), v);
-			// Make sure sPrime and sPrimeOrder are consistent with
-			// their respective backing-store arrays
-			assert_eq(sPrimeArr[0], sPrime[0]);
-			assert_eq(sPrimeArr[sPrimeSz-1], sPrime[sPrimeSz-1]);
-			assert_eq(sPrimeOrderArr[0], sPrimeOrder[0]);
-			assert_eq(sPrimeOrderArr[sPrimeSz-1], sPrimeOrder[sPrimeSz-1]);
-		}
-		// Now assign the ranking implied by the sorted sPrime/sPrimeOrder
-		// arrays back into sPrime.
-		VMSG_NL("  Allocating rank array");
-		_isaPrime.resizeExact(sPrime.size());
-		ASSERT_ONLY(_isaPrime.fill(OFF_MASK));
-		assert_gt(_isaPrime.size(), 0);
-		{
-			Timer timer(cout, "  Ranking v-sort output time: ", this->verbose());
-			VMSG_NL("  Ranking v-sort output");
-			for(size_t i = 0; i < sPrimeSz-1; i++) {
-				// Place the appropriate ranking
-				_isaPrime[sPrimeOrder[i]] = nextRank;
-				// If sPrime[i] and sPrime[i+1] are identical up to v, then we
-				// should give the next suffix the same rank
-				if(!suffixSameUpTo(t, sPrime[i], sPrime[i+1], v)) nextRank++;
-			}
-			_isaPrime[sPrimeOrder[sPrimeSz-1]] = nextRank; // finish off
+void DifferenceCoverSample<TStr>::build(int nthreads) {
+    // Local names for relevant types
+    VMSG_NL("Building DifferenceCoverSample");
+    // Local names for relevant data
+    const TStr& t = this->text();
+    uint32_t v = this->v();
+    assert_gt(v, 2);
+    // Build s'
+    EList<TIndexOffU> sPrime;
+    // Need to allocate 2 extra elements at the end of the sPrime and _isaPrime
+    // arrays.  One element that's less than all others, and another that acts
+    // as needed padding for the Larsson-Sadakane sorting code.
+    size_t padding = 1;
+    VMSG_NL("  Building sPrime");
+    buildSPrime(sPrime, padding);
+    size_t sPrimeSz = sPrime.size() - padding;
+    assert_gt(sPrime.size(), padding);
+    assert_leq(sPrime.size(), t.length() + padding + 1);
+    TIndexOffU nextRank = 0;
+    {
+        VMSG_NL("  Building sPrimeOrder");
+        EList<TIndexOffU> sPrimeOrder;
+        sPrimeOrder.resizeExact(sPrimeSz);
+        for(TIndexOffU i = 0; i < sPrimeSz; i++) {
+            sPrimeOrder[i] = i;
+        }
+        // sPrime now holds suffix-offsets for DC samples.
+        {
+            Timer timer(cout, "  V-Sorting samples time: ", this->verbose());
+            VMSG_NL("  V-Sorting samples");
+            // Extract backing-store array from sPrime and sPrimeOrder;
+            // the mkeyQSortSuf2 routine works on the array for maximum
+            // efficiency
+            TIndexOffU *sPrimeArr = (TIndexOffU*)sPrime.ptr();
+            assert_eq(sPrimeArr[0], sPrime[0]);
+            assert_eq(sPrimeArr[sPrimeSz-1], sPrime[sPrimeSz-1]);
+            TIndexOffU *sPrimeOrderArr = (TIndexOffU*)sPrimeOrder.ptr();
+            assert_eq(sPrimeOrderArr[0], sPrimeOrder[0]);
+            assert_eq(sPrimeOrderArr[sPrimeSz-1], sPrimeOrder[sPrimeSz-1]);
+            // Sort sample suffixes up to the vth character using a
+            // multikey quicksort.  Sort time is proportional to the
+            // number of samples times v.  It isn't quadratic.
+            // sPrimeOrder is passed in as a swapping partner for
+            // sPrimeArr, i.e., every time the multikey qsort swaps
+            // elements in sPrime, it swaps the same elements in
+            // sPrimeOrder too.  This allows us to easily reconstruct
+            // what the sort did.
+            if(nthreads == 1) {
+                mkeyQSortSuf2(t, sPrimeArr, sPrimeSz, sPrimeOrderArr, 4,
+                              this->verbose(), this->sanityCheck(), v);
+            } else {
+                int query_depth = 0;
+                int tmp_nthreads = nthreads;
+                while(tmp_nthreads > 0) {
+                    query_depth++;
+                    tmp_nthreads >>= 1;
+                }
+                EList<size_t> boundaries; // bucket boundaries for parallelization
+                TIndexOffU *sOrig = NULL;
+                if(this->sanityCheck()) {
+                    sOrig = new TIndexOffU[sPrimeSz];
+                    memcpy(sOrig, sPrimeArr, OFF_SIZE * sPrimeSz);
+                }
+                mkeyQSortSuf2(t, sPrimeArr, sPrimeSz, sPrimeOrderArr, 4,
+                              this->verbose(), false, query_depth, &boundaries);
+                if(boundaries.size() > 0) {
+                    AutoArray<tthread::thread*> threads(nthreads);
+                    EList<VSortingParam<TStr> > tparams;
+                    size_t cur = 0;
+                    MUTEX_T mutex;
+                    for(int tid = 0; tid < nthreads; tid++) {
+                        // Calculate bucket sizes by doing a binary search for each
+                        // suffix and noting where it lands
+                        tparams.expand();
+                        tparams.back().dcs = this;
+                        tparams.back().sPrimeArr = sPrimeArr;
+                        tparams.back().sPrimeSz = sPrimeSz;
+                        tparams.back().sPrimeOrderArr = sPrimeOrderArr;
+                        tparams.back().depth = query_depth;
+                        tparams.back().boundaries = &boundaries;
+                        tparams.back().cur = &cur;
+                        tparams.back().mutex = &mutex;
+                        threads[tid] = new tthread::thread(VSorting_worker<TStr>, (void*)&tparams.back());
+                    }
+                    for (int tid = 0; tid < nthreads; tid++) {
+                        threads[tid]->join();
+                    }
+                }
+                if(this->sanityCheck()) {
+                    sanityCheckOrderedSufs(t, t.length(), sPrimeArr, sPrimeSz, v);
+                    for(size_t i = 0; i < sPrimeSz; i++) {
+                        assert_eq(sPrimeArr[i], sOrig[sPrimeOrderArr[i]]);
+                    }
+                    delete[] sOrig;
+                }
+            }
+            // Make sure sPrime and sPrimeOrder are consistent with
+            // their respective backing-store arrays
+            assert_eq(sPrimeArr[0], sPrime[0]);
+            assert_eq(sPrimeArr[sPrimeSz-1], sPrime[sPrimeSz-1]);
+            assert_eq(sPrimeOrderArr[0], sPrimeOrder[0]);
+            assert_eq(sPrimeOrderArr[sPrimeSz-1], sPrimeOrder[sPrimeSz-1]);
+        }
+        // Now assign the ranking implied by the sorted sPrime/sPrimeOrder
+        // arrays back into sPrime.
+        VMSG_NL("  Allocating rank array");
+        _isaPrime.resizeExact(sPrime.size());
+        ASSERT_ONLY(_isaPrime.fill(OFF_MASK));
+        assert_gt(_isaPrime.size(), 0);
+        {
+            Timer timer(cout, "  Ranking v-sort output time: ", this->verbose());
+            VMSG_NL("  Ranking v-sort output");
+            for(size_t i = 0; i < sPrimeSz-1; i++) {
+                // Place the appropriate ranking
+                _isaPrime[sPrimeOrder[i]] = nextRank;
+                // If sPrime[i] and sPrime[i+1] are identical up to v, then we
+                // should give the next suffix the same rank
+                if(!suffixSameUpTo(t, sPrime[i], sPrime[i+1], v)) nextRank++;
+            }
+            _isaPrime[sPrimeOrder[sPrimeSz-1]] = nextRank; // finish off
 #ifndef NDEBUG
-			for(size_t i = 0; i < sPrimeSz; i++) {
-				assert_neq(OFF_MASK, _isaPrime[i]);
-				assert_lt(_isaPrime[i], sPrimeSz);
-			}
+            for(size_t i = 0; i < sPrimeSz; i++) {
+                assert_neq(OFF_MASK, _isaPrime[i]);
+                assert_lt(_isaPrime[i], sPrimeSz);
+            }
 #endif
-		}
-		// sPrimeOrder is destroyed
-		// All the information we need is now in _isaPrime
-	}
-	_isaPrime[_isaPrime.size()-1] = (TIndexOffU)sPrimeSz;
-	sPrime[sPrime.size()-1] = (TIndexOffU)sPrimeSz;
-	// _isaPrime[_isaPrime.size()-1] and sPrime[sPrime.size()-1] are just
-	// spacer for the Larsson-Sadakane routine to use
-	{
-		Timer timer(cout, "  Invoking Larsson-Sadakane on ranks time: ", this->verbose());
-		VMSG_NL("  Invoking Larsson-Sadakane on ranks");
-		if(sPrime.size() >= LS_SIZE) {
-			cerr << "Error; sPrime array has so many elements that it can't be converted to a signed array without overflow." << endl;
-			throw 1;
-		}
-		LarssonSadakane<TIndexOff> ls;
-		ls.suffixsort(
-			(TIndexOff*)_isaPrime.ptr(),
-			(TIndexOff*)sPrime.ptr(),
-			(TIndexOff)sPrimeSz,
-			(TIndexOff)sPrime.size(),
-			0);
-	}
-	// chop off final character of _isaPrime
-	_isaPrime.resizeExact(sPrimeSz);
-	for(size_t i = 0; i < _isaPrime.size(); i++) {
-		_isaPrime[i]--;
-	}
+        }
+        // sPrimeOrder is destroyed
+        // All the information we need is now in _isaPrime
+    }
+    _isaPrime[_isaPrime.size()-1] = (TIndexOffU)sPrimeSz;
+    sPrime[sPrime.size()-1] = (TIndexOffU)sPrimeSz;
+    // _isaPrime[_isaPrime.size()-1] and sPrime[sPrime.size()-1] are just
+    // spacer for the Larsson-Sadakane routine to use
+    {
+        Timer timer(cout, "  Invoking Larsson-Sadakane on ranks time: ", this->verbose());
+        VMSG_NL("  Invoking Larsson-Sadakane on ranks");
+        if(sPrime.size() >= LS_SIZE) {
+            cerr << "Error; sPrime array has so many elements that it can't be converted to a signed array without overflow." << endl;
+            throw 1;
+        }
+        LarssonSadakane<TIndexOff> ls;
+        ls.suffixsort(
+                      (TIndexOff*)_isaPrime.ptr(),
+                      (TIndexOff*)sPrime.ptr(),
+                      (TIndexOff)sPrimeSz,
+                      (TIndexOff)sPrime.size(),
+                      0);
+    }
+    // chop off final character of _isaPrime
+    _isaPrime.resizeExact(sPrimeSz);
+    for(size_t i = 0; i < _isaPrime.size(); i++) {
+        _isaPrime[i]--;
+    }
 #ifndef NDEBUG
-	for(size_t i = 0; i < sPrimeSz-1; i++) {
-		assert_lt(_isaPrime[i], sPrimeSz);
-		assert(i == 0 || _isaPrime[i] != _isaPrime[i-1]);
-	}
+    for(size_t i = 0; i < sPrimeSz-1; i++) {
+        assert_lt(_isaPrime[i], sPrimeSz);
+        assert(i == 0 || _isaPrime[i] != _isaPrime[i-1]);
+    }
 #endif
-	VMSG_NL("  Sanity-checking and returning");
-	if(this->sanityCheck()) doBuiltSanityCheck();
+    VMSG_NL("  Sanity-checking and returning");
+    if(this->sanityCheck()) doBuiltSanityCheck();
 }
 
 /**
diff --git a/ds.h b/ds.h
index 7e322c4..8e8045a 100644
--- a/ds.h
+++ b/ds.h
@@ -99,7 +99,9 @@ protected:
 	uint64_t peak_;
 };
 
+#ifdef USE_MEM_TALLY
 extern MemoryTally gMemTally;
+#endif
 
 /**
  * A simple fixed-length array of type T, automatically freed in the
@@ -112,7 +114,11 @@ public:
 	AutoArray(size_t sz, int cat = 0) : cat_(cat) {
 		t_ = NULL;
 		t_ = new T[sz];
+#ifdef USE_MEM_TALLY
 		gMemTally.add(cat_, sz);
+#else
+		(void)cat_;
+#endif
 		memset(t_, 0, sz * sizeof(T));
 		sz_ = sz;
 	}
@@ -120,7 +126,9 @@ public:
 	~AutoArray() {
 		if(t_ != NULL) {
 			delete[] t_;
+#ifdef USE_MEM_TALLY
 			gMemTally.del(cat_, sz_);
+#endif
 		}
 	}
 	
@@ -177,16 +185,22 @@ public:
 		assert(p_ == NULL);
 		p_ = p;
 		freeable_ = freeable;
+#ifdef USE_MEM_TALLY
 		if(p != NULL && freeable_) {
 			gMemTally.add(cat_, sizeof(T));
 		}
+#else
+		(void)cat_;
+#endif
 	}
 	
 	void free() {
 		if(p_ != NULL) {
 			if(freeable_) {
 				delete p_;
+#ifdef USE_MEM_TALLY
 				gMemTally.del(cat_, sizeof(T));
+#endif
 			}
 			p_ = NULL;
 		}
@@ -240,16 +254,22 @@ public:
 		p_ = p;
 		sz_ = sz;
 		freeable_ = freeable;
+#ifdef USE_MEM_TALLY
 		if(p != NULL && freeable_) {
 			gMemTally.add(cat_, sizeof(T) * sz_);
 		}
+#else
+		(void)cat_;
+#endif
 	}
 	
 	void free() {
 		if(p_ != NULL) {
 			if(freeable_) {
 				delete[] p_;
+#ifdef USE_MEM_TALLY
 				gMemTally.del(cat_, sizeof(T) * sz_);
+#endif
 			}
 			p_ = NULL;
 		}
@@ -319,7 +339,12 @@ public:
 	 * Allocate initial default of S elements.
 	 */
 	explicit EList() :
-		cat_(0), allocCat_(-1), list_(NULL), sz_(S), cur_(0) { }
+		cat_(0), allocCat_(-1), list_(NULL), sz_(S), cur_(0)
+	{
+#ifndef USE_MEM_TALLY
+		(void)cat_;
+#endif
+	}
 
 	/**
 	 * Allocate initial default of S elements.
@@ -897,7 +922,9 @@ private:
 	T *alloc(size_t sz) {
 		T* tmp = new T[sz];
 		assert(tmp != NULL);
+#ifdef USE_MEM_TALLY
 		gMemTally.add(cat_, sz);
+#endif
 		allocCat_ = cat_;
 		return tmp;
 	}
@@ -911,7 +938,9 @@ private:
 			assert_neq(-1, allocCat_);
 			assert_eq(allocCat_, cat_);
 			delete[] list_;
+#ifdef USE_MEM_TALLY
 			gMemTally.del(cat_, sz_);
+#endif
 			list_ = NULL;
 			sz_ = cur_ = 0;
 		}
@@ -1015,6 +1044,9 @@ public:
 	explicit ELList(int cat = 0) :
 		cat_(cat), list_(NULL), sz_(S2), cur_(0)
 	{
+#ifndef USE_MEM_TALLY
+		(void)cat_;
+#endif
 		assert_geq(cat, 0);
 	}
 
@@ -1256,7 +1288,9 @@ protected:
 	EList<T, S1> *alloc(size_t sz) {
 		assert_gt(sz, 0);
 		EList<T, S1> *tmp = new EList<T, S1>[sz];
+#ifdef USE_MEM_TALLY
 		gMemTally.add(cat_, sz);
+#endif
 		if(cat_ != 0) {
 			for(size_t i = 0; i < sz; i++) {
 				assert(tmp[i].ptr() == NULL);
@@ -1273,7 +1307,9 @@ protected:
 	void free() {
 		if(list_ != NULL) {
 			delete[] list_;
+#ifdef USE_MEM_TALLY
 			gMemTally.del(cat_, sz_);
+#endif
 			list_ = NULL;
 		}
 	}
@@ -1592,7 +1628,9 @@ protected:
 	ELList<T, S1, S2> *alloc(size_t sz) {
 		assert_gt(sz, 0);
 		ELList<T, S1, S2> *tmp = new ELList<T, S1, S2>[sz];
+#ifdef USE_MEM_TALLY
 		gMemTally.add(cat_, sz);
+#endif
 		if(cat_ != 0) {
 			for(size_t i = 0; i < sz; i++) {
 				assert(tmp[i].ptr() == NULL);
@@ -1609,7 +1647,9 @@ protected:
 	void free() {
 		if(list_ != NULL) {
 			delete[] list_;
+#ifdef USE_MEM_TALLY
 			gMemTally.del(cat_, sz_);
+#endif
 			list_ = NULL;
 		}
 	}
@@ -1679,6 +1719,9 @@ public:
 		sz_(0),
 		cur_(0)
 	{
+#ifndef USE_MEM_TALLY
+		(void)cat_;
+#endif
 		if(sz_ > 0) {
 			list_ = alloc(sz_);
 		}
@@ -1880,7 +1923,9 @@ private:
 	T *alloc(size_t sz) {
 		assert_gt(sz, 0);
 		T *tmp = new T[sz];
+#ifdef USE_MEM_TALLY
 		gMemTally.add(cat_, sz);
+#endif
 		return tmp;
 	}
 
@@ -1891,7 +1936,9 @@ private:
 	void free() {
 		if(list_ != NULL) {
 			delete[] list_;
+#ifdef USE_MEM_TALLY
 			gMemTally.del(cat_, sz_);
+#endif
 			list_ = NULL;
 		}
 	}
@@ -2026,6 +2073,9 @@ public:
 	explicit ELSet(int cat = 0) :
 		cat_(cat), list_(NULL), sz_(S), cur_(0)
 	{
+#ifndef USE_MEM_TALLY
+		(void)cat_;
+#endif
 		assert_geq(cat, 0);
 	}
 
@@ -2272,7 +2322,9 @@ protected:
 	ESet<T> *alloc(size_t sz) {
 		assert_gt(sz, 0);
 		ESet<T> *tmp = new ESet<T>[sz];
+#ifdef USE_MEM_TALLY
 		gMemTally.add(cat_, sz);
+#endif
 		if(cat_ != 0) {
 			for(size_t i = 0; i < sz; i++) {
 				assert(tmp[i].ptr() == NULL);
@@ -2289,7 +2341,9 @@ protected:
 	void free() {
 		if(list_ != NULL) {
 			delete[] list_;
+#ifdef USE_MEM_TALLY
 			gMemTally.del(cat_, sz_);
+#endif
 			list_ = NULL;
 		}
 	}
@@ -2360,6 +2414,9 @@ public:
 		sz_(128),
 		cur_(0)
 	{
+#ifndef USE_MEM_TALLY
+		(void)cat_;
+#endif
 		list_ = alloc(sz_);
 	}
 
@@ -2539,7 +2596,9 @@ private:
 	std::pair<K, V> *alloc(size_t sz) {
 		assert_gt(sz, 0);
 		std::pair<K, V> *tmp = new std::pair<K, V>[sz];
+#ifdef USE_MEM_TALLY
 		gMemTally.add(cat_, sz);
+#endif
 		return tmp;
 	}
 
@@ -2550,7 +2609,9 @@ private:
 	void free() {
 		if(list_ != NULL) {
 			delete[] list_;
+#ifdef USE_MEM_TALLY
 			gMemTally.del(cat_, sz_);
+#endif
 			list_ = NULL;
 		}
 	}
@@ -3015,7 +3076,12 @@ public:
 	{
 		for(size_t i = 0; i < ((bytes+pagesz-1)/pagesz); i++) {
 			pages_.push_back(new uint8_t[pagesz]);
+#ifdef USE_MEM_TALLY
 			gMemTally.add(cat, pagesz);
+#else
+			(void)cat_;
+			(void)pagesz_;
+#endif
 			assert(pages_.back() != NULL);
 		}
 		assert(repOk());
@@ -3028,7 +3094,11 @@ public:
 		for(size_t i = 0; i < pages_.size(); i++) {
 			assert(pages_[i] != NULL);
 			delete[] pages_[i];
+#ifdef USE_MEM_TALLY
 			gMemTally.del(cat_, pagesz_);
+#else
+			(void)cat_;
+#endif
 		}
 	}
 
diff --git a/multikey_qsort.h b/multikey_qsort.h
index 5c2e041..eea455f 100644
--- a/multikey_qsort.h
+++ b/multikey_qsort.h
@@ -496,100 +496,134 @@ void mkeyQSortSuf(
 	if(sanityCheck) sanityCheckOrderedSufs(host, hlen, s, slen, upto);
 }
 
-/**
- * Just like mkeyQSortSuf but all swaps are applied to s2 as well as s.
- * This is a helpful variant if, for example, the caller would like to
- * see how their input was permuted by the sort routine (in that case,
- * the caller would let s2 be an array s2[] where s2 is the same length
- * as s and s2[i] = i).
- */
+struct QSortRange {
+    size_t begin;
+    size_t end;
+    size_t depth;
+};
 template<typename T>
 void mkeyQSortSuf2(
-	const T& host,
-	size_t hlen,
-	TIndexOffU *s,
-	size_t slen,
-	TIndexOffU *s2,
-	int hi,
-	size_t begin,
-	size_t end,
-	size_t depth,
-	size_t upto = OFF_MASK)
+                   const T& host,
+                   size_t hlen,
+                   TIndexOffU *s,
+                   size_t slen,
+                   TIndexOffU *s2,
+                   int hi,
+                   size_t _begin,
+                   size_t _end,
+                   size_t _depth,
+                   size_t upto = OFF_MASK,
+                   EList<size_t>* boundaries = NULL)
 {
-	// Helper for making the recursive call; sanity-checks arguments to
-	// make sure that the problem actually got smaller.
-	#define MQS_RECURSE_SUF_DS(nbegin, nend, ndepth) { \
-		assert(nbegin > begin || nend < end || ndepth > depth); \
-		if(ndepth < upto) { /* don't exceed depth of 'upto' */ \
-			mkeyQSortSuf2(host, hlen, s, slen, s2, hi, nbegin, nend, ndepth, upto); \
-		} \
-	}
-	assert_leq(begin, slen);
-	assert_leq(end, slen);
-	size_t a, b, c, d, /*e,*/ r;
-	size_t n = end - begin;
-	if(n <= 1) return;                 // 1-element list already sorted
-	CHOOSE_AND_SWAP_PIVOT(SWAP2, CHAR_AT_SUF); // pick pivot, swap it into [begin]
-	int v = CHAR_AT_SUF(begin, depth); // v <- randomly-selected pivot value
-	#ifndef NDEBUG
-	{
-		bool stillInBounds = false;
-		for(size_t i = begin; i < end; i++) {
-			if(depth < (hlen-s[i])) {
-				stillInBounds = true;
-				break;
-			} else { /* already fell off this suffix */ }
-		}
-		assert(stillInBounds); // >=1 suffix must still be in bounds
-	}
-	#endif
-	a = b = begin;
-	c = d = /*e =*/ end-1;
-	while(true) {
-		// Invariant: everything before a is = pivot, everything
-		// between a and b is <
-		int bc = 0; // shouldn't have to init but gcc on Mac complains
-		while(b <= c && v >= (bc = CHAR_AT_SUF(b, depth))) {
-			if(v == bc) {
-				SWAP2(s, s2, a, b); a++;
-			}
-			b++;
-		}
-		// Invariant: everything after d is = pivot, everything
-		// between c and d is >
-		int cc = 0; // shouldn't have to init but gcc on Mac complains
-		while(b <= c && v <= (cc = CHAR_AT_SUF(c, depth))) {
-			if(v == cc) {
-				SWAP2(s, s2, c, d); d--; /*e--;*/
-			}
-			//else if(c == e && v == hi) e--;
-			c--;
-		}
-		if(b > c) break;
-		SWAP2(s, s2, b, c);
-		b++;
-		c--;
-	}
-	assert(a > begin || c < end-1);                      // there was at least one =s
-	assert_lt(/*e*/d-c, n); // they can't all have been > pivot
-	assert_lt(b-a, n); // they can't all have been < pivot
-	assert(assertPartitionedSuf(host, s, slen, hi, v, begin, end, depth));  // check pre-=-swap invariant
-	r = min(a-begin, b-a); VECSWAP2(s, s2, begin, b-r,   r);  // swap left = to center
-	r = min(d-c, end-d-1); VECSWAP2(s, s2, b,     end-r, r);  // swap right = to center
-	assert(assertPartitionedSuf2(host, s, slen, hi, v, begin, end, depth)); // check post-=-swap invariant
-	r = b-a; // r <- # of <'s
-	if(r > 0) {
-		MQS_RECURSE_SUF_DS(begin, begin + r, depth); // recurse on <'s
-	}
-	// Do not recurse on ='s if the pivot was the off-the-end value;
-	// they're already fully sorted
-	if(v != hi) {
-		MQS_RECURSE_SUF_DS(begin + r, begin + r + (a-begin) + (end-d-1), depth+1); // recurse on ='s
-	}
-	r = d-c;   // r <- # of >'s excluding those exhausted
-	if(r > 0 && v < hi-1) {
-		MQS_RECURSE_SUF_DS(end-r, end, depth); // recurse on >'s
-	}
+    ELList<QSortRange, 3, 1024> block_list;
+    while(true) {
+        size_t begin = 0, end = 0, depth = 0;
+        if(block_list.size() == 0) {
+            begin = _begin;
+            end = _end;
+            depth = _depth;
+        } else {
+            if(block_list.back().size() > 0) {
+                begin = block_list.back()[0].begin;
+                end = block_list.back()[0].end;
+                depth = block_list.back()[0].depth;
+                block_list.back().erase(0);
+            } else {
+                block_list.resize(block_list.size() - 1);
+                if(block_list.size() == 0) {
+                    break;
+                }
+            }
+        }
+        if(depth == upto) {
+            if(boundaries != NULL) {
+                (*boundaries).push_back(end);
+            }
+            continue;
+        }
+        assert_leq(begin, slen);
+        assert_leq(end, slen);
+        size_t a, b, c, d, /*e,*/ r;
+        size_t n = end - begin;
+        if(n <= 1) { // 1-element list already sorted
+            if(n == 1 && boundaries != NULL) {
+                boundaries->push_back(end);
+            }
+            continue;
+        }
+        CHOOSE_AND_SWAP_PIVOT(SWAP2, CHAR_AT_SUF); // pick pivot, swap it into [begin]
+        int v = CHAR_AT_SUF(begin, depth); // v <- randomly-selected pivot value
+#ifndef NDEBUG
+        {
+            bool stillInBounds = false;
+            for(size_t i = begin; i < end; i++) {
+                if(depth < (hlen-s[i])) {
+                    stillInBounds = true;
+                    break;
+                } else { /* already fell off this suffix */ }
+            }
+            assert(stillInBounds); // >=1 suffix must still be in bounds
+        }
+#endif
+        a = b = begin;
+        c = d = /*e =*/ end-1;
+        while(true) {
+            // Invariant: everything before a is = pivot, everything
+            // between a and b is <
+            int bc = 0; // shouldn't have to init but gcc on Mac complains
+            while(b <= c && v >= (bc = CHAR_AT_SUF(b, depth))) {
+                if(v == bc) {
+                    SWAP2(s, s2, a, b); a++;
+                }
+                b++;
+            }
+            // Invariant: everything after d is = pivot, everything
+            // between c and d is >
+            int cc = 0; // shouldn't have to init but gcc on Mac complains
+            while(b <= c && v <= (cc = CHAR_AT_SUF(c, depth))) {
+                if(v == cc) {
+                    SWAP2(s, s2, c, d); d--; /*e--;*/
+                }
+                //else if(c == e && v == hi) e--;
+                c--;
+            }
+            if(b > c) break;
+            SWAP2(s, s2, b, c);
+            b++;
+            c--;
+        }
+        assert(a > begin || c < end-1);                      // there was at least one =s
+        assert_lt(/*e*/d-c, n); // they can't all have been > pivot
+        assert_lt(b-a, n); // they can't all have been < pivot
+        assert(assertPartitionedSuf(host, s, slen, hi, v, begin, end, depth));  // check pre-=-swap invariant
+        r = min(a-begin, b-a); VECSWAP2(s, s2, begin, b-r,   r);  // swap left = to center
+        r = min(d-c, end-d-1); VECSWAP2(s, s2, b,     end-r, r);  // swap right = to center
+        assert(assertPartitionedSuf2(host, s, slen, hi, v, begin, end, depth)); // check post-=-swap invariant
+        r = b-a; // r <- # of <'s
+        block_list.expand();
+        block_list.back().clear();
+        if(r > 0) { // recurse on <'s
+            block_list.back().expand();
+            block_list.back().back().begin = begin;
+            block_list.back().back().end = begin + r;
+            block_list.back().back().depth = depth;
+        }
+        // Do not recurse on ='s if the pivot was the off-the-end value;
+        // they're already fully sorted
+        if(v != hi) { // recurse on ='s
+            block_list.back().expand();
+            block_list.back().back().begin = begin + r;
+            block_list.back().back().end = begin + r + (a-begin) + (end-d-1);
+            block_list.back().back().depth = depth + 1;
+        }
+        r = d-c;   // r <- # of >'s excluding those exhausted
+        if(r > 0 && v < hi-1) { // recurse on >'s
+            block_list.back().expand();
+            block_list.back().back().begin = end - r;
+            block_list.back().back().end = end;
+            block_list.back().back().depth = depth;
+        }
+    }
 }
 
 /**
@@ -598,30 +632,31 @@ void mkeyQSortSuf2(
  */
 template<typename T>
 void mkeyQSortSuf2(
-	const T& host,
-	TIndexOffU *s,
-	size_t slen,
-	TIndexOffU *s2,
-	int hi,
-	bool verbose = false,
-	bool sanityCheck = false,
-	size_t upto = OFF_MASK)
+                   const T& host,
+                   TIndexOffU *s,
+                   size_t slen,
+                   TIndexOffU *s2,
+                   int hi,
+                   bool verbose = false,
+                   bool sanityCheck = false,
+                   size_t upto = OFF_MASK,
+                   EList<size_t>* boundaries = NULL)
 {
-	size_t hlen = host.length();
-	if(sanityCheck) sanityCheckInputSufs(s, slen);
-	TIndexOffU *sOrig = NULL;
-	if(sanityCheck) {
-		sOrig = new TIndexOffU[slen];
-		memcpy(sOrig, s, OFF_SIZE * slen);
-	}
-	mkeyQSortSuf2(host, hlen, s, slen, s2, hi, (size_t)0, slen, (size_t)0, upto);
-	if(sanityCheck) {
-		sanityCheckOrderedSufs(host, hlen, s, slen, upto);
-		for(size_t i = 0; i < slen; i++) {
-			assert_eq(s[i], sOrig[s2[i]]);
-		}
-		delete[] sOrig;
-	}
+    size_t hlen = host.length();
+    if(sanityCheck) sanityCheckInputSufs(s, slen);
+    TIndexOffU *sOrig = NULL;
+    if(sanityCheck) {
+        sOrig = new TIndexOffU[slen];
+        memcpy(sOrig, s, OFF_SIZE * slen);
+    }
+    mkeyQSortSuf2(host, hlen, s, slen, s2, hi, (size_t)0, slen, (size_t)0, upto, boundaries);
+    if(sanityCheck) {
+        sanityCheckOrderedSufs(host, hlen, s, slen, upto);
+        for(size_t i = 0; i < slen; i++) {
+            assert_eq(s[i], sOrig[s2[i]]);
+        }
+        delete[] sOrig;
+    }
 }
 
 // Ugly but necessary; otherwise the compiler chokes dramatically on
@@ -971,77 +1006,102 @@ static void selectionSortSufDcU8(
 
 template<typename T1, typename T2>
 static void bucketSortSufDcU8(
-		const T1& host1,
-		const T2& host,
-        size_t hlen,
-        TIndexOffU* s,
-        size_t slen,
-        const DifferenceCoverSample<T1>& dc,
-        uint8_t hi,
-        size_t begin,
-        size_t end,
-        size_t depth,
-        bool sanityCheck = false)
+                              const T1& host1,
+                              const T2& host,
+                              size_t hlen,
+                              TIndexOffU* s,
+                              size_t slen,
+                              const DifferenceCoverSample<T1>& dc,
+                              uint8_t hi,
+                              size_t _begin,
+                              size_t _end,
+                              size_t _depth,
+                              bool sanityCheck = false)
 {
-	size_t cnts[] = { 0, 0, 0, 0, 0 };
-	#define BKT_RECURSE_SUF_DC_U8(nbegin, nend) { \
-		bucketSortSufDcU8<T1,T2>(host1, host, hlen, s, slen, dc, hi, \
-		                         (nbegin), (nend), depth+1, sanityCheck); \
-	}
-	assert_gt(end, begin);
-	assert_leq(end-begin, BUCKET_SORT_CUTOFF);
-	assert_eq(hi, 4);
-	if(end == begin+1) return; // 1-element list already sorted
-	if(depth > dc.v()) {
-		// Quicksort the remaining suffixes using difference cover
-		// for constant-time comparisons; this is O(k*log(k)) where
-		// k=(end-begin)
-		qsortSufDcU8<T1,T2>(host1, host, hlen, s, slen, dc, begin, end, sanityCheck);
-		return;
-	}
-	if(end-begin <= SELECTION_SORT_CUTOFF) {
-		// Bucket sort remaining items
-		selectionSortSufDcU8(host1, host, hlen, s, slen, dc, hi,
-		                     begin, end, depth, sanityCheck);
-		if(sanityCheck) {
-			sanityCheckOrderedSufs(host1, hlen, s, slen,
-			                       OFF_MASK, begin, end);
-		}
-		return;
-	}
-	for(size_t i = begin; i < end; i++) {
-		size_t off = depth + s[i];
-		uint8_t c = (off < hlen) ? get_uint8(host, off) : hi;
-		assert_leq(c, 4);
-		if(c == 0) {
-			s[begin + cnts[0]++] = s[i];
-		} else {
-			bkts[c-1][cnts[c]++] = s[i];
-		}
-	}
-	assert_eq(cnts[0] + cnts[1] + cnts[2] + cnts[3] + cnts[4], end - begin);
-	size_t cur = begin + cnts[0];
-	if(cnts[1] > 0) { memcpy(&s[cur], bkts[0], cnts[1] << (OFF_SIZE/4 + 1)); cur += cnts[1]; }
-	if(cnts[2] > 0) { memcpy(&s[cur], bkts[1], cnts[2] << (OFF_SIZE/4 + 1)); cur += cnts[2]; }
-	if(cnts[3] > 0) { memcpy(&s[cur], bkts[2], cnts[3] << (OFF_SIZE/4 + 1)); cur += cnts[3]; }
-	if(cnts[4] > 0) { memcpy(&s[cur], bkts[3], cnts[4] << (OFF_SIZE/4 + 1)); }
-	// This frame is now totally finished with bkts[][], so recursive
-	// callees can safely clobber it; we're not done with cnts[], but
-	// that's local to the stack frame.
-	cur = begin;
-	if(cnts[0] > 0) {
-		BKT_RECURSE_SUF_DC_U8(cur, cur + cnts[0]); cur += cnts[0];
-	}
-	if(cnts[1] > 0) {
-		BKT_RECURSE_SUF_DC_U8(cur, cur + cnts[1]); cur += cnts[1];
-	}
-	if(cnts[2] > 0) {
-		BKT_RECURSE_SUF_DC_U8(cur, cur + cnts[2]); cur += cnts[2];
-	}
-	if(cnts[3] > 0) {
-		BKT_RECURSE_SUF_DC_U8(cur, cur + cnts[3]);
-	}
-	// Done
+    // 5 64-element buckets for bucket-sorting A, C, G, T, $
+    TIndexOffU* bkts[4];
+    for(size_t i = 0; i < 4; i++) {
+        bkts[i] = new TIndexOffU[4 * 1024 * 1024];
+    }
+    ELList<size_t, 5, 1024> block_list;
+    bool first = true;
+    while(true) {
+        size_t begin = 0, end = 0;
+        if(first) {
+            begin = _begin;
+            end = _end;
+            first = false;
+        } else {
+            if(block_list.size() == 0) {
+                break;
+            }
+            if(block_list.back().size() > 1) {
+                end = block_list.back().back(); block_list.back().pop_back();
+                begin = block_list.back().back();
+            } else {
+                block_list.resize(block_list.size() - 1);
+                if(block_list.size() == 0) {
+                    break;
+                }
+            }
+        }
+        size_t depth = block_list.size() + _depth;
+        assert_leq(end-begin, BUCKET_SORT_CUTOFF);
+        assert_eq(hi, 4);
+        if(end <= begin + 1) { // 1-element list already sorted
+            continue;
+        }
+        if(depth > dc.v()) {
+            // Quicksort the remaining suffixes using difference cover
+            // for constant-time comparisons; this is O(k*log(k)) where
+            // k=(end-begin)
+            qsortSufDcU8<T1,T2>(host1, host, hlen, s, slen, dc, begin, end, sanityCheck);
+            continue;
+        }
+        if(end-begin <= SELECTION_SORT_CUTOFF) {
+            // Bucket sort remaining items
+            selectionSortSufDcU8(host1, host, hlen, s, slen, dc, hi,
+                                 begin, end, depth, sanityCheck);
+            if(sanityCheck) {
+                sanityCheckOrderedSufs(host1, hlen, s, slen,
+                                       OFF_MASK, begin, end);
+            }
+            continue;
+        }
+        size_t cnts[] = { 0, 0, 0, 0, 0 };
+        for(size_t i = begin; i < end; i++) {
+            size_t off = depth + s[i];
+            uint8_t c = (off < hlen) ? get_uint8(host, off) : hi;
+            assert_leq(c, 4);
+            if(c == 0) {
+                s[begin + cnts[0]++] = s[i];
+            } else {
+                bkts[c-1][cnts[c]++] = s[i];
+            }
+        }
+        assert_eq(cnts[0] + cnts[1] + cnts[2] + cnts[3] + cnts[4], end - begin);
+        size_t cur = begin + cnts[0];
+        if(cnts[1] > 0) { memcpy(&s[cur], bkts[0], cnts[1] << (OFF_SIZE/4 + 1)); cur += cnts[1]; }
+        if(cnts[2] > 0) { memcpy(&s[cur], bkts[1], cnts[2] << (OFF_SIZE/4 + 1)); cur += cnts[2]; }
+        if(cnts[3] > 0) { memcpy(&s[cur], bkts[2], cnts[3] << (OFF_SIZE/4 + 1)); cur += cnts[3]; }
+        if(cnts[4] > 0) { memcpy(&s[cur], bkts[3], cnts[4] << (OFF_SIZE/4 + 1)); }
+        // This frame is now totally finished with bkts[][], so recursive
+        // callees can safely clobber it; we're not done with cnts[], but
+        // that's local to the stack frame.
+        block_list.expand();
+        block_list.back().clear();
+        block_list.back().push_back(begin);
+        for(size_t i = 0; i < 4; i++) {
+            if(cnts[i] > 0) {
+                block_list.back().push_back(block_list.back().back() + cnts[i]);
+            }
+        }
+    }
+    // Done
+    
+    for(size_t i = 0; i < 4; i++) {
+        delete [] bkts[i];
+    }
 }
 
 /**
diff --git a/pat.h b/pat.h
index d260724..cf3e92b 100644
--- a/pat.h
+++ b/pat.h
@@ -98,7 +98,6 @@ struct PatternParams {
 		int format_,
 		bool fileParallel_,
 		uint32_t seed_,
-		bool useSpinlock_,
 		bool solexa64_,
 		bool phred64_,
 		bool intQuals_,
@@ -109,7 +108,6 @@ struct PatternParams {
 		format(format_),
 		fileParallel(fileParallel_),
 		seed(seed_),
-		useSpinlock(useSpinlock_),
 		solexa64(solexa64_),
 		phred64(phred64_),
 		intQuals(intQuals_),
@@ -121,7 +119,6 @@ struct PatternParams {
 	int format;           // file format
 	bool fileParallel;    // true -> wrap files with separate PairedPatternSources
 	uint32_t seed;        // pseudo-random seed
-	bool useSpinlock;     // use spin locks instead of pthreads
 	bool solexa64;        // true -> qualities are on solexa64 scale
 	bool phred64;         // true -> qualities are on phred64 scale
 	bool intQuals;        // true -> qualities are space-separated numbers
@@ -147,7 +144,6 @@ public:
 		readCnt_(0),
 		numWrappers_(0),
 		doLocking_(true),
-		useSpinlock_(p.useSpinlock),
 		mutex()
 	{
 	}
@@ -264,10 +260,6 @@ protected:
 
 	int numWrappers_;      /// # threads that own a wrapper for this PatternSource
 	bool doLocking_;       /// override whether to lock (true = don't override)
-	/// User can ask to use the normal pthreads-style lock even if
-	/// spinlocks is enabled and compiled in.  This is sometimes better
-	/// if we expect bad I/O latency on some reads.
-	bool useSpinlock_;
 	MUTEX_T mutex;
 };
 
@@ -938,7 +930,7 @@ public:
 	FastaPatternSource(const EList<string>& infiles,
 	                   const PatternParams& p) :
 		BufferedFilePatternSource(infiles, p),
-		first_(true), solexa64_(p.solexa64), phred64_(p.phred64), intQuals_(p.intQuals)
+		first_(true)
 	{ }
 	virtual void reset() {
 		first_ = true;
@@ -993,9 +985,6 @@ protected:
 	
 private:
 	bool first_;
-	bool solexa64_;
-	bool phred64_;
-	bool intQuals_;
 };
 
 
diff --git a/processor_support.h b/processor_support.h
index f68ee65..b07d8dd 100644
--- a/processor_support.h
+++ b/processor_support.h
@@ -44,8 +44,8 @@ public:
 
     try {
 #if ( defined(USING_INTEL_COMPILER) || defined(USING_MSC_COMPILER) )
-        __cpuid((void *) &regs,0); // test if __cpuid() works, if not catch the exception
-        __cpuid((void *) &regs,0x1); // POPCNT bit is bit 23 in ECX
+        __cpuid((int *) &regs,0); // test if __cpuid() works, if not catch the exception
+        __cpuid((int *) &regs,0x1); // POPCNT bit is bit 23 in ECX
 #elif defined(USING_GCC_COMPILER)
         __get_cpuid(0x1, &regs.EAX, &regs.EBX, &regs.ECX, &regs.EDX);
 #else
diff --git a/scripts/debug_wrapper.pl b/scripts/debug_wrapper.pl
deleted file mode 100755
index b970c07..0000000
--- a/scripts/debug_wrapper.pl
+++ /dev/null
@@ -1,21 +0,0 @@
-#!/usr/bin/env perl
-use strict;
-use File::Spec;
-use Path::Class;
-
-# A helper for debugging
-my ($vol,$script_path,$prog);
-$prog = File::Spec->rel2abs( __FILE__ );
-($vol,$script_path,$prog) = File::Spec->splitpath($prog);
-my $bw_path = dir(File::Spec->splitdir($script_path))->parent();
-
-my $bw = File::Spec->catpath(
-            $vol,
-            $bw_path,
-            'bowtie2'
-            );
-
-exec $bw, @ARGV or die ("Fail to exec! $bw\n");
-
-
-
diff --git a/sse_util.h b/sse_util.h
index 929b4f6..8edb1a8 100644
--- a/sse_util.h
+++ b/sse_util.h
@@ -233,7 +233,9 @@ private:
 		}
 		assert_eq(0, (tmpint & 0xf)); // should be 16-byte aligned
 		assert(tmp != NULL);
+#ifdef USE_MEM_TALLY
 		gMemTally.add(cat_, sz);
+#endif
 		return tmp;
 	}
 
@@ -244,7 +246,9 @@ private:
 	void free() {
 		if(list_ != NULL) {
 			delete[] last_alloc_;
+#ifdef USE_MEM_TALLY
 			gMemTally.del(cat_, sz_);
+#endif
 			list_ = NULL;
 			sz_ = cur_ = 0;
 		}
diff --git a/threading.h b/threading.h
index 25356c2..ec9b504 100644
--- a/threading.h
+++ b/threading.h
@@ -25,7 +25,14 @@
 #ifdef WITH_TBB
 # include <tbb/mutex.h>
 # include <tbb/spin_mutex.h>
-# include <tbb/task_group.h>
+# ifdef WITH_AFFINITY
+#  include <cstdlib>
+#  include <sched.h>
+#  include <tbb/task_group.h>
+#  include <tbb/task_scheduler_observer.h>
+#  include <tbb/atomic.h>
+#  include <tbb/task_scheduler_init.h>
+# endif
 #else
 # include "tinythread.h"
 # include "fast_mutex.h"
@@ -69,4 +76,104 @@ private:
 	MUTEX_T *ptr_mutex;
 };
 
+#ifdef WITH_TBB
+#ifdef WITH_AFFINITY
+//ripped entirely from;
+//https://software.intel.com/en-us/blogs/2013/10/31/applying-intel-threading-building-blocks-observers-for-thread-affinity-on-intel
+class concurrency_tracker: public tbb::task_scheduler_observer {
+    tbb::atomic<int> num_threads;
+public:
+    concurrency_tracker() : num_threads() { observe(true); }
+    /*override*/ void on_scheduler_entry( bool ) { ++num_threads; }
+    /*override*/ void on_scheduler_exit( bool ) { --num_threads; }
+
+    int get_concurrency() { return num_threads; }
+};
+
+class pinning_observer: public tbb::task_scheduler_observer {
+    cpu_set_t *mask;
+    int ncpus;
+
+    const int pinning_step;
+    tbb::atomic<int> thread_index;
+public:
+    pinning_observer( int pinning_step=1 ) : pinning_step(pinning_step), thread_index() {
+        for ( ncpus = sizeof(cpu_set_t)/CHAR_BIT; ncpus < 16*1024 /* some reasonable limit */; ncpus <<= 1 ) {
+            mask = CPU_ALLOC( ncpus );
+            if ( !mask ) break;
+            const size_t size = CPU_ALLOC_SIZE( ncpus );
+            CPU_ZERO_S( size, mask );
+            const int err = sched_getaffinity( 0, size, mask );
+            if ( !err ) break;
+
+            CPU_FREE( mask );
+            mask = NULL;
+            if ( errno != EINVAL )  break;
+        }
+        if ( !mask )
+            std::cout << "Warning: Failed to obtain process affinity mask. Thread affinitization is disabled." << std::endl;
+    }
+
+/*override*/ void on_scheduler_entry( bool ) {
+    if ( !mask ) return;
+
+    const size_t size = CPU_ALLOC_SIZE( ncpus );
+    const int num_cpus = CPU_COUNT_S( size, mask );
+    int thr_idx =
+//cwilks: we're one interface version lower than what
+//is required for task arena (7000 vs. 7001)
+#if USE_TASK_ARENA_CURRENT_SLOT
+        tbb::task_arena::current_slot();
+#else
+        thread_index++;
+#endif
+#if __MIC__
+    thr_idx += 1; // To avoid logical thread zero for the master thread on Intel(R) Xeon Phi(tm)
+#endif
+    thr_idx %= num_cpus; // To limit unique number in [0; num_cpus-1] range
+
+        // Place threads with specified step
+        int cpu_idx = 0;
+        for ( int i = 0, offset = 0; i<thr_idx; ++i ) {
+            cpu_idx += pinning_step;
+            if ( cpu_idx >= num_cpus )
+                cpu_idx = ++offset;
+        }
+
+        // Find index of 'cpu_idx'-th bit equal to 1
+        int mapped_idx = -1;
+        while ( cpu_idx >= 0 ) {
+            if ( CPU_ISSET_S( ++mapped_idx, size, mask ) )
+                --cpu_idx;
+        }
+
+        cpu_set_t *target_mask = CPU_ALLOC( ncpus );
+        CPU_ZERO_S( size, target_mask );
+        CPU_SET_S( mapped_idx, size, target_mask );
+        const int err = sched_setaffinity( 0, size, target_mask );
+
+        //std::cout << "Just set affinity for thread " << thr_idx << "\n";
+        if ( err ) {
+            std::cout << "Failed to set thread affinity!n";
+            exit( EXIT_FAILURE );
+        }
+#if LOG_PINNING
+        else {
+            std::stringstream ss;
+            ss << "Set thread affinity: Thread " << thr_idx << ": CPU " << mapped_idx << std::endl;
+            std::cerr << ss.str();
+        }
+#endif
+        CPU_FREE( target_mask );
+    }
+
+    ~pinning_observer() {
+        if ( mask )
+            CPU_FREE( mask );
+    }
+};
+
+#endif
+#endif
+
 #endif

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/bowtie2.git