[med-svn] [rtax] 01/01: Imported Upstream version 0.984

Thu Nov 28 10:05:20 UTC 2013

This is an automated email from the git hooks/post-receive script.

tille pushed a commit to branch master
in repository rtax.

commit cb0bfb7e562f2a17dac9959f78bf91419150a90f
Author: Andreas Tille <tille at debian.org>
Date:   Thu Nov 28 11:08:34 2013 +0100

    Imported Upstream version 0.984
---
 README                                   |  139 ++++
 changelog                                |   61 ++
 greengenes/greengenes.README             |   69 ++
 greengenes/prepare-greengenes            |   76 +++
 rtax                                     |  285 ++++++++
 scripts/FastaIndex.pm                    |  136 ++++
 scripts/LevelPrediction.pm               |   71 ++
 scripts/classificationQualityFilter.pl   |   51 ++
 scripts/classificationQualityStripper.pl |    7 +
 scripts/fastaidx.pl                      |   22 +
 scripts/greengenesExtract.pl             |   49 ++
 scripts/joinDelimitedFasta.pl            |   38 ++
 scripts/parseUclustClusters.pl           |   52 ++
 scripts/rtaxSearch.pl                    | 1062 ++++++++++++++++++++++++++++++
 scripts/rtaxVote.pl                      |  286 ++++++++
 scripts/splitDelimitedFasta.pl           |   46 ++
 16 files changed, 2450 insertions(+)

diff --git a/README b/README
new file mode 100644
index 0000000..d9be133
--- /dev/null
+++ b/README
@@ -0,0 +1,139 @@
+RTAX: Rapid and accurate taxonomic classification of short paired-end
+      sequence reads from the 16S ribosomal RNA gene.
+
+David A. W. Soergel (1), Rob Knight (2), and Steven E. Brenner (1)
+
+1 Department of Plant and Microbial Biology,
+  University of California, Berkeley 
+2 Howard Hughes Medical Institute and Department of Chemistry 
+  and Biochemistry, University of Colorado at Boulder
+* Corresponding author (current address): soergel at cs.umass.edu
+
+Version 0.984  (August 7, 2013)
+
+http://dev.davidsoergel.com/rtax
+
+
+Requirements
+------------
+
+ * Hardware: memory requirements vary with the size of the reference database;
+    for the version of GreenGenes we provide, ~4GB is needed.
+
+ * USEARCH (http://www.drive5.com/usearch/).  RTAX has been tested with
+    v4.1.93 and v5.0.144.  Make sure that the program is in your search
+    path as "usearch", or provide its path explicitly in the "rtax" script.
+    Note that v5.x uses quite a bit more memory, so 4.x may be preferred.
+
+* A reference database consisting of sequences for which taxonomy assignments
+    are known.  We provide prepared versions of the GreenGenes database on the
+    RTAX web site; see the GreenGenes section below for details.
+
+    Two files are needed:
+    1. a FASTA file containing the sequences, and 
+    2. a file listing taxonomy strings for each sequence ID found in the 
+         FASTA file.  The format of this file is
+
+sequenceid	pcidwidth	kingdom; phylum; class; order; ...
+
+         one entry per line, where sequenceid, pcidwidth, and the taxonomy
+         string are separated by a single tab, and the taxonomy string itself 
+         is delimited either by semicolons or by tabs.  The pcidwidth column
+         is optional, and (if present) is ignored in the current version of 
+         RTAX anyway.  (In our usage, we cluster the reference database into
+         99% id OTUs, representing each by a single "seed" sequence.  This 
+         column then lists the largest pairwise %id difference between the 
+         cluster seed and any cluster member.)
+
+
+Installation
+------------
+
+RTAX consists of a set of perl scripts and a shell script ("rtax") to wire
+them together. No compilation is needed; just extract the package to a 
+convenient location.
+
+The perl scripts must remain in the "scripts" directory below the "rtax" shell
+script, but the latter can be symlinked anywhere for convenience. A common
+pattern might be to place the whole package at /usr/local/rtax, and then
+symlink the script into your path (e.g., "ln -s /usr/local/rtax/rtax 
+/usr/local/bin").
+
+
+Running RTAX
+------------
+
+Sequence classification is done as follows:
+
+    rtax -r gg.nr.fasta -t gg.nr.taxonomy -a queryA.fasta [-b queryB.fasta] -o result.out 
+
+Substitute a different reference database for the GreenGenes files if desired,
+of course.
+
+If two query files are provided, then they must contain mate-paired reads
+with exactly matching identifiers (though they need not be in the same order).
+Any ids present in one file but not the other are silently dropped.  Alternate
+naming schemes for the paired reads (e.g., "SOMEID.a" paired with "SOMEID.b",
+and so forth) are handled via the -i option.
+
+RTAX considers query sequences in the forward sense with respect to the
+reference database.  Thus, in a paired-end experiment, the reverse reads will
+likely need be reverse-complemented with the -y option.
+
+RTAX may be run for single-ended reads as well; simply exclude the queryB
+parameter in this case.
+
+Run "rtax" with no arguments for a help message describing additional options.
+
+Various progress messages are provided on stderr, and the predicted
+classification for each query is provided in the file given by the -o option.
+The output format is essentially the same as the taxonomy file input format
+above. The second column indicates the best %id observed between the query
+sequence and any reference sequence, and the taxonomy string is tab-delimited.
+
+Temporary files are created in /tmp/rtax.SOMENUMBER (or under the temp
+directory given by -m).  These can be quite large; please use -m to select a
+location with sufficient free space, as needed.  The temporary directory is
+deleted upon successful termination, but may need to be manually cleaned up in
+the event of an error of some sort.
+
+
+Using the GreenGenes reference database
+---------------------------------------
+
+An RTAX-formatted reference database based on GreenGenes (version of March 11,
+2011) is available at
+http://dev.davidsoergel.com/rtax/greengenes.reference.20110311.tgz. That
+contains gg.nr.fasta and gg.nr.taxonomy, which are the result of clustering
+the GreenGenes input file at 99% identity, finding consensus taxonomy strings
+for each cluster, and formatting the result as needed for use with RTAX.
+
+Please see the "greengenes" subdirectory in the RTAX distribution for
+information on using newer versions of GreenGenes.
+
+
+Preparing a non-GreenGenes reference database
+---------------------------------------------
+
+You can certainly use different databases, or GreenGenes clustered by
+different means or with different parameters (or not at all, though this 
+approach will have poor performance). If you have any trouble producing the 
+reference fasta and taxonomy files in the required format, examine the 
+prepare-greengenes script for hints, and feel free to contact us for help.
+
+
+Plumbing
+--------
+
+Taxonomy assignment proceeds in two phases, which we implement separately:
+
+1.  For each query sequence (or mate pair), find an appropriate set of
+    matching hits in a reference database.  This is implemented as 
+    rtaxSearchSingle and rtaxSearchPair for single and paired-end reads, 
+    respectively.
+
+2.  Find consensus taxonomy among each set of reference hits.  This is
+	implemented as rtaxVote.
+
+3.  Finally the detailed rtaxVote results are filtered and cleaned up 
+    for output.
diff --git a/changelog b/changelog
new file mode 100644
index 0000000..0fb4c8f
--- /dev/null
+++ b/changelog
@@ -0,0 +1,61 @@
+Version 0.984  (2013-08-07)
+===========================
+
+* Fix tempdir calls for wider compatibility.
+* Add NOMATEPAIR classification when --single_ok is not selected, so that every input line gets an output line.
+
+
+
+Version 0.983  (2012-09-05)
+===========================
+
+* add options -x and -y to reverse-complement reads 1 and 2, respectively
+* fixed completely broken fallback to single-ended classification, and miscounting of NOHIT cases
+* fix idList command-line parsing bug
+
+
+Version 0.982  (2012-07-02)
+===========================
+
+* getopts bug fix
+* try to fix issues with TOOMANYHITS cases
+
+
+Version 0.981  (2012-05-08)
+===========================
+
+* change all references to "/usr/bin/perl" to "/usr/bin/env perl" for wider compatibility
+
+
+Version 0.98  (2012-05-07)
+==========================
+
+* insured global alignment with iddef 2, i.e. ignoring terminal gaps in computing percent identity, which was the intention all along
+* Fix intermittent hanging due to squirrely interaction of usearch with named pipes
+
+
+Version 0.96  (2012-03-01)
+==========================
+
+* Make fallback optional. Store FASTA indexes as DBM. Add fastaidx tool
+* Combined pair + single script with fallback
+
+
+Version 0.94  (2012-01-12)
+==========================
+
+* add tempdir option
+* add idList option
+* add id regex option
+
+
+Version 0.93 (2012-01-09)
+=========================
+
+* Ditch Bio::Index::Fasta
+
+
+Version 0.92 (2011-12-28)
+=========================
+
+* Prep Qiime integration 
\ No newline at end of file
diff --git a/greengenes/greengenes.README b/greengenes/greengenes.README
new file mode 100644
index 0000000..b158a70
--- /dev/null
+++ b/greengenes/greengenes.README
@@ -0,0 +1,69 @@
+Caution re. using newer versions of GreenGenes
+----------------------------------------------
+
+As of September 2011, we do not recommend using a version of GreenGenes newer
+than March 11, 2011. The reason is that GreenGenes has expanded dramatically in
+this time--from about half a million to about one million sequences--but the
+annotation process has not kept up. Thus, newer versions contain much higher
+proportions of sequences that either carry no taxonomic annotation at all or
+that list "unclassified" at many or even all ranks.
+
+At present, RTAX assumes that a sequence that is either not annotated or
+"unclassified" at some rank does not belong to any known taxonomic group at that
+rank. That is, we assume that the lack of an annotation constitutes an assertion
+that the sequence is novel. This assumption makes our taxonomic assignment
+process more conservative. Our interpretation is very different from the
+alternative, that the sequence could perhaps be fully annotated with known
+names, but that this simply has not been done yet.
+
+This question manifests itself in the voting procedure. We require that half of
+_all_ hits to a query sequence agree on a taxonomic group (at whatever rank) in
+order to make an assignment. We include hits with no annotation in the
+denominator, corresponding to the idea that those sequences do not belong to the
+group being considered in the numerator.
+
+We reasoned that the alternative procedure--considering the proportion of
+concordant hits out of only those carrying any annotation at all (at a given
+rank)--risks overly biasing taxonomic assignments towards previously named
+groups.
+
+On the other hand, our inclusion of unannotated sequences in the denominator
+means that, when those sequences are highly prevalent, the annotated sequences
+in the numerator rarely reach the 50% threshold required to make an
+assignment--even if they are in perfect agreement.
+
+We can hope that GreenGenes taxonomic assignments will become more comprehensive
+in the future. Alternatively, we should carefully consider whether it is
+appropriate to switch to the more permissive approach of ignoring unannotated
+sequences when voting at each rank, or if some alternative voting procedure
+better balances the risks of misannotation versus low classification rate. Until
+this is resolved, it seems prudent to simply use the March 2011 database since
+it is known to work well.
+
+Preparing the Greengenes reference database
+-------------------------------------------
+
+To regenerate the GreenGenes reference database files for use with RTAX (e.g.
+with a newer GreenGenes version):
+
+Download the complete GreenGenes database to a disk with at least 20GB free
+space (as of Sep 2011, but GreenGenes is growing rapidly):
+
+    wget http://greengenes.lbl.gov/Download/Sequence_Data/Greengenes_format/greengenes16SrRNAgenes.txt.gz
+
+Then, while in the directory containing the downloaded .txt.gz file, run
+prepare-greengenes. (If usearch is not in your path, use "prepare-greengenes
+/path/to/usearch"). This takes a couple of hours and about 4GB of memory. Go out
+for coffee and whatnot. When you get back, you'll find two files have been
+produced: gg.nr.fasta and gg.nr.taxonomy.
+
+Note that the free (32-bit) version of USEARCH can use at most 2GB memory on
+some systems and at most 4GB memory on others. 2GB memory is not sufficient to
+cluster the current version of GreenGenes, and 4GB barely suffices. Thus,
+GreenGenes will soon be too large to cluster at 99% id using USEARCH without a
+paid 64-bit license. (Note that USEARCH reports "2.0 GB available" even if in
+fact 4GB are available, so don't be discouraged until it actually crashes.) When
+4GB is also insufficient, and lacking a paid USEARCH license, a possible
+strategy will be to cluster at lower %id thresholds. This requires less memory
+and produces a smaller, less fine-grained reference database. We have not
+explored the impact of doing this on classification performance.
diff --git a/greengenes/prepare-greengenes b/greengenes/prepare-greengenes
new file mode 100755
index 0000000..f2e8f18
--- /dev/null
+++ b/greengenes/prepare-greengenes
@@ -0,0 +1,76 @@
+#!/bin/sh
+
+cur=`pwd`
+
+# we want to call subsidiary scripts in the "scripts" subdir, but we don't know where we're installed or what the working directory is.
+# solution from http://hintsforums.macworld.com/archive/index.php/t-73839.html
+
+IFS=$' \t\n'
+declare -x PATH=/bin:/usr/bin
+arg=$0; [[ -L $0 ]] && arg=$(stat -f '%Y' "$0")
+pth=$(2>/dev/null cd "${arg%/*}" >&2; echo "`pwd -P`/${arg##*/}")
+par=$(dirname "$pth")
+scripts="$par/../scripts"
+
+if [[ ( $# -ne 1 ) && ($# -ne 0) ]]
+then
+  echo "Usage: `basename $0`   (in a directory containing greengenes16SrRNAgenes.txt.gz)"
+  exit 1
+fi
+
+# Grab the command-line arguments
+
+#gg=$1
+usearch=`which usearch`
+if [[ ($# -ne 0) ]]
+then
+    usearch=$1
+fi
+echo using usearch: $usearch
+
+# create a temporary working directory
+
+#tempdir=/tmp/rtax.$RANDOM
+tempdir=./rtax.$RANDOM
+mkdir $tempdir
+cd $tempdir
+
+
+# Extract the sequence and taxonomy information:
+#echo gunzip -c -S .gz $cur/greengenes16SrRNAgenes.txt.gz \| $scripts/greengenesExtract.pl
+echo Extracting raw fasta and taxonomy files
+gunzip -c -S .gz $cur/greengenes16SrRNAgenes.txt.gz | $scripts/greengenesExtract.pl
+
+# That will write greengenes.fasta and greengenes.taxonomy into the current directory.
+
+# In our experiments, we cluster the reference sequences into 99% id OTUs, and find a consensus taxonomy string for each resulting cluster.
+
+# sort by length
+echo sorting fasta file by length
+$usearch --sort greengenes.fasta --output greengenes.fasta.sorted
+
+# cluster
+echo clustering at 99% identity
+$usearch --cluster greengenes.fasta.sorted --usersort --iddef 2 --uc gg.nr.uc --seedsout $cur/gg.nr.fasta --id 0.99 --maxrejects 128 --nowordcountreject
+             
+# parse clusters
+echo parsing clusters
+$scripts/parseUclustClusters.pl gg.nr.uc > gg.nr.cl
+
+# find consensus taxonomy per cluster
+echo finding consensus taxonomy per cluster
+$scripts/rtaxVote.pl greengenes.taxonomy gg.nr.cl > gg.nr.taxonomy.raw
+
+echo cleaning up taxonomy output
+$scripts/classificationQualityFilter.pl 1 0.25 0.1 < gg.nr.taxonomy.raw > gg.nr.taxonomy.filtered
+$scripts/classificationQualityStripper.pl <  gg.nr.taxonomy.filtered > $cur/gg.nr.taxonomy 
+
+# Note gg.nr.taxonomy resulting from this process includes the pcid-width column and a tab-delimited tax string
+
+echo wrote $cur/gg.nr.fasta and $cur/gg.nr.taxonomy 
+
+
+# remove the temp directory
+# (comment out to debug or grab intermediate results)
+
+rm -rf $tempdir
\ No newline at end of file
diff --git a/rtax b/rtax
new file mode 100755
index 0000000..c8c1c36
--- /dev/null
+++ b/rtax
@@ -0,0 +1,285 @@
+#!/bin/bash -eu
+
+# RTAX: Rapid and accurate taxonomic classification of short paired-end
+#       sequence reads from the 16S ribosomal RNA gene.
+#
+# David A. W. Soergel 1*, Rob Knight 2, and Steven E. Brenner 1
+#
+# 1 Department of Plant and Microbial Biology,
+#   University of California, Berkeley
+# 2 Howard Hughes Medical Institute and Department of Chemistry
+#   and Biochemistry, University of Colorado at Boulder
+# * Corresponding author: soergel at cs.umass.edu
+#
+# http://www.davidsoergel.com/rtax
+#
+# Version 0.984  (August 7, 2013)
+#
+# For usage instructions: just run the script with no arguments
+#
+#
+# Copyright (c) 2011 Regents of the University of California
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+#     * Redistributions of source code must retain the above copyright notice,
+#       this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in the
+#       documentation and/or other materials provided with the distribution.
+#     * Neither the name of the University of California, Berkeley nor the
+#       names of its contributors may be used to endorse or promote products
+#       derived from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+function echoerr() { echo "$@" 1>&2; }
+
+echoerr
+echoerr 'RTAX: Rapid and accurate taxonomic classification of short paired-end'
+echoerr '      sequence reads from the 16S ribosomal RNA gene.'
+echoerr ''
+echoerr 'David A. W. Soergel (1), Rob Knight (2), and Steven E. Brenner (1)'
+echoerr ''
+echoerr '1 Department of Plant and Microbial Biology,'
+echoerr '  University of California, Berkeley '
+echoerr '2 Howard Hughes Medical Institute and Department of Chemistry '
+echoerr '  and Biochemistry, University of Colorado at Boulder'
+echoerr '* Corresponding author (current address): soergel at cs.umass.edu'
+echoerr ''
+echoerr 'Version 0.984  (August 7, 2013)'
+echoerr ''
+echoerr 'http://dev.davidsoergel.com/rtax'
+echoerr ''
+echoerr 'Please cite:'
+echoerr 'Soergel D.A.W., Dey N., Knight R., and Brenner S.E.  2012.'
+echoerr 'Selection of primers for optimal taxonomic classification of'
+echoerr 'environmental 16S rRNA gene sequences.  ISME J (6), 1440–1444'
+echoerr ''
+
+
+
+# we want to call subsidiary scripts in the "scripts" subdir, but we don't know where we're installed or what the working directory is.
+# solution from http://hintsforums.macworld.com/archive/index.php/t-73839.html
+
+IFS=$' \t\n'
+#declare -x PATH=/bin:/usr/bin:/usr/local/bin
+arg=$0; [[ -L $0 ]] && arg=$(stat -f '%Y' "$0")
+pth=$(2>/dev/null cd "${arg%/*}" >&2; echo "`pwd -P`/${arg##*/}")
+par=$(dirname "$pth")
+scripts="$par/scripts"
+
+# The above solution works with the widespread GNU Coreutils 5.97.  Thanks to Bernd Brandt for this alternate solution for Coreutils 8.5.
+
+#IFS=$' \t\n'
+#arg=$0; [[ -L $0 ]] && arg=$(stat -c '%N' "$0") 
+#pth=${arg##*-> \`};
+#par=$(dirname "$pth")
+#scripts="$par/scripts"
+
+function abspath {
+	if [[ -d "$1" ]]
+	then
+		pushd "$1" >/dev/null
+		pwd
+		popd >/dev/null
+	#elif [[ -e $1 ]]
+	#then
+    else
+		pushd $(dirname $1) >/dev/null
+		echo $(pwd)/$(basename $1)
+		popd >/dev/null
+	#else
+	#	echo $1 does not exist! >&2
+	#	return 127
+	fi
+}
+
+# Grab the command-line arguments
+
+usageHelp="Usage: ${0##*/} -r <refdb> -t <taxonomy> -a <queryA> -b <queryB> -d <delimiter> -i <regex> -o <classifications.out>"
+refdbHelp="-r  reference database in FASTA format"
+taxonomyHelp="-t  taxonomy file with sequence IDs matching the reference database"
+queryAHelp="-a  FASTA file containing query sequences (single-ended or read 1)"
+queryBHelp="-b  FASTA file containing query sequences (read b, with matching IDs)"
+revcompAHelp="-x  Reverse-complement query A sequences (required if they are provided in the reverse sense)"
+revcompBHelp="-y  Reverse-complement query B sequences (required if they are provided in the reverse sense)"
+idListHelp="-l  text file containing sequence IDs to process, one per line"
+delimiterHelp="-d  delimiter separating the two reads when provided in a single file"
+idRegexHelp="-i  regular expression used to select part of the fasta header to use as the sequence id.  Default: \"(\\\\S+)\""
+outputHelp="-o  output path"
+tmpdirHelp="-m  temporary directory.  Will be removed on successful completion, but likely not if there is an error."
+fallbackHelp="-f  for sequences where only one read is available, fall back to single-ended classification.  Default: drop these sequences."
+noGenericFallbackHelp="-g  for sequences where one read is overly generic, do not fall back to single-ended classification.  Default: classify these sequences based on only the more specific read."
+badOptionHelp="Option not recognised"
+printHelpAndExit()
+{
+	echoerr "$usageHelp"
+	echoerr "$refdbHelp"
+	echoerr "$taxonomyHelp"
+	echoerr "$queryAHelp"
+	echoerr "$queryBHelp"
+	echoerr "$revcompAHelp"
+	echoerr "$revcompBHelp"
+	echoerr "$idRegexHelp"
+	echoerr "$idListHelp"
+	echoerr "$delimiterHelp"
+	echoerr "$tmpdirHelp"
+	echoerr "$fallbackHelp"
+	echoerr "$noGenericFallbackHelp"
+	echoerr "$outputHelp"
+	exit $1
+}
+printErrorHelpAndExit()
+{
+        echoerr "-------------------------------------------------------------"
+        echoerr "ERROR : $@"
+        echoerr "-------------------------------------------------------------"
+        echoerr
+        printHelpAndExit 1
+}
+
+refdb=""
+taxonomy=""
+queryA=""
+queryB=""
+revcompA=""
+revcompB=""
+delimiter=""
+idList=""
+idRegex=""
+verbose=""
+fallback=""
+noGenericFallback=""
+outpath=""
+tempdir=""
+
+while getopts "hr:t:a:b:i:l:d:m:o:vfgxy" optionName; do
+case "$optionName" in
+h) printHelpAndExit 0;;
+r) refdb=`abspath "$OPTARG"`;;
+t) taxonomy=`abspath "$OPTARG"`;;
+a) queryA=`abspath "$OPTARG"`;;
+b) queryB=`abspath "$OPTARG"`;;
+x) revcompA="--revcompA";;
+y) revcompB="--revcompB";;
+d) delimiter="$OPTARG";;
+i) idRegex="$OPTARG";;
+l) idList=`abspath "$OPTARG"`;;
+o) outpath=`abspath "$OPTARG"`;;
+m) tempdir=`abspath "$OPTARG"`;;
+v) verbose="0";;
+f) fallback="--singleOK";;
+g) noGenericFallback="--nosingleOKgeneric";;
+[?]) printErrorHelpAndExit "$badOptionHelp";;
+esac
+done
+
+
+
+if [[ -n "$verbose" ]]
+then 
+echoerr
+echoerr Reference database : $refdb
+echoerr Taxonomy file      : $taxonomy
+echoerr ID List            : $idList
+echoerr Query read A       : $queryA
+echoerr RevComp read A     : $revcompA
+echoerr Query read B       : $queryB
+echoerr RevComp read B     : $revcompB
+# echoerr Header Regex : $idRegex
+echoerr Read delimiter     : $delimiter
+echoerr Output path        : $outpath
+echoerr Temporary dir      : $tempdir
+echoerr Fallback           : $fallback
+echoerr No Generic Fallback: $noGenericFallback
+echoerr
+fi
+
+
+
+if [[ -z "$refdb" || -z "$taxonomy" || -z "$queryA" || -z "$outpath" ]]; then
+    printErrorHelpAndExit "-r, -t, -a, and -o options are required"
+fi
+
+if [[ -n "$queryB" && -n "$delimiter" ]]; then
+    printErrorHelpAndExit "Paired ends can be input using -b or -d, but not both"
+fi
+
+# find USEARCH, or hardcode if needed (could use an environment variable...)
+# usearch=/path/to/usearch
+set +e
+usearch=`which usearch`
+if [ $? -eq 0 ] 
+then
+    echoerr using usearch: $usearch
+    echoerr
+else
+    echoerr "usearch not found in path.  Please correct, or just hardcode the path in the rtax script."
+    exit 1
+fi
+
+set -e
+
+# create a temporary working directory
+
+if [ ! $tempdir ]; then
+    tempdir=/tmp/rtax.$RANDOM
+    #tempdir=./rtax.$RANDOM
+fi
+mkdir -p $tempdir
+cd $tempdir
+
+# perform the search
+
+if [ $idList ]; then
+	idList="--idList $idList"
+fi
+
+
+if [ $delimiter ]; then
+    $scripts/splitDelimitedFasta.pl $delimiter $queryA
+    queryAa=`basename $queryA`.a
+    queryAb=`basename $queryA`.b
+    $scripts/rtaxSearch.pl --database $refdb --queryA $queryAa --queryB $queryAb $idList $revcompA $revcompB --usearch $usearch --idRegex "$idRegex" $fallback $noGenericFallback > rawhits
+else
+    if [ $queryB ]; then
+        $scripts/rtaxSearch.pl --database $refdb --queryA $queryA --queryB $queryB $idList $revcompA $revcompB --usearch $usearch --idRegex "$idRegex" $fallback $noGenericFallback > rawhits
+    else
+        fallback="--singleOK"
+        $scripts/rtaxSearch.pl --database $refdb --queryA $queryA $idList $revcompA --usearch $usearch --idRegex "$idRegex" $fallback > rawhits
+    fi
+fi
+
+
+# choose the "best" taxonomy assignment by walking down the tree, including info on how many ref sequences agree at each level
+# and pipe directly into the cleanup scripts instead of using temp files
+
+$scripts/rtaxVote.pl $taxonomy rawhits | $scripts/classificationQualityFilter.pl 1 0.0 0.5 | $scripts/classificationQualityStripper.pl > $outpath
+
+# The output of this process includes the pcid-width column and a tab-delimited tax string
+
+# filter the taxonomy strings to the finest rank where half of the hits agree (out of all the hits, i.e. including in the denominator those hits with no annotation)
+#$scripts/classificationQualityFilter.pl 1 0.0 0.5 < rawtaxonomy > filteredtaxonomy
+
+# clean up the resulting output and provide it to STDOUT
+#$scripts/classificationQualityStripper.pl < filteredtaxonomy 
+
+# remove the temp directory
+# (comment out to debug or grab intermediate results)
+
+rm -rf $tempdir
\ No newline at end of file
diff --git a/scripts/FastaIndex.pm b/scripts/FastaIndex.pm
new file mode 100644
index 0000000..cb9b91d
--- /dev/null
+++ b/scripts/FastaIndex.pm
@@ -0,0 +1,136 @@
+package FastaIndex;
+
+use strict;
+use warnings;
+
+use IO::File;
+
+
+BEGIN {
+for my $field (qw( fastaFileName startpos lines fh ))
+	{
+    my $slot = __PACKAGE__ . "::$field";
+    no strict "refs";          # So symbolic ref to typeglob works.
+
+    *$field = sub
+    	{
+        my $self = shift;
+        $self->{$slot} = shift if @_;
+        return $self->{$slot};
+    	};
+    }
+}
+
+sub new {
+	my ($this) = @_;
+	my $class = ref($this) || $this;
+	my $self = {};
+	
+	bless $self, $class;
+	
+	return $self;
+}
+
+sub make_index {
+    
+	my ($this, $fastaFileName, $idregex, $dbmFileName) = @_;
+	if(!defined $idregex) { $idregex = "(\\S+)"; }
+    
+	$this->fastaFileName($fastaFileName);
+	
+	#open(IN, "$fastaFileName") or die "could not open $fastaFileName";
+	my $in = IO::File->new("$fastaFileName") or die "could not open $fastaFileName";
+	$this->fh($in);
+	
+	if($dbmFileName)
+	    {
+	    my %openpos = ();
+	    dbmopen( %openpos, $dbmFileName.".pos", 0666) || die ("Could not open DBM file $dbmFileName.pos");
+	    $this->startpos(\%openpos);
+	    
+	     my %openlines = ();
+	    dbmopen( %openlines, $dbmFileName.".lines", 0666) || die ("Could not open DBM file $dbmFileName.lines");
+	    $this->lines(\%openlines);
+	    
+	    my $sizepos = scalar keys %openpos;
+	    my $sizelines = scalar keys %openlines;
+	    if($sizepos != $sizelines) { die "DBM files broken $dbmFileName\n"; }
+	    if($sizepos > 0) {
+	        print STDERR ("Index file $dbmFileName.pos already exists; assuming valid\n");
+	        return;  
+        };
+        }
+    else
+        {
+	    # just store a hash in memory.
+	    $this->startpos({});
+	    $this->lines({});
+        }
+	    
+	my $lastpos = 0;
+	my $pos = 0;
+	#my $seq = "";
+	my $id = "";
+	my $numlines = 0;
+	
+	while(<$in>)
+	    {
+	    my $line = $_;
+	    if($line =~ /^>$idregex/)
+	        {
+	            if(defined $id && $id ne "") {
+	                # write the previous record        
+	                #print STDERR "$id -> $lastpos\n";
+	                $this->startpos()->{$id} = $lastpos;
+	                $this->lines()->{$id} = $numlines;
+                   }
+	            
+	            # start a new record
+	            $id = $1;
+	            #$seq = $line;
+	            $lastpos = $pos;
+	            $numlines = 0;
+	        }
+	    # else { $seq .= $line; }
+            
+	    $pos += length($line);
+	    $numlines++;
+        }
+    # write the final record       
+	$this->startpos()->{$id} = $lastpos;
+	$this->lines()->{$id} = $numlines;
+    #delete($this->startpos()->{""});  # spurious entries from the start of the loop
+    #delete($this->lines()->{""});  # spurious entries from the start of the loop
+}
+
+sub count_records()
+    {
+    my ($this) = @_;
+    return scalar(keys %{$this->startpos()});
+    }
+
+sub fetch {
+    
+	my ($this, $id) = @_;
+    
+    my $pos = $this->startpos()->{$id};
+    if(! defined $pos) { die "no sequence with id: $id in " . $this->fastaFileName(); }
+    
+    my $numlines = $this->lines()->{$id};
+    
+    my $result = "";
+    my $in = $this->fh();
+    #print $this->fastaFileName() . ": " . $in . "\n";
+    seek($in,$pos,0);
+    for (1..$numlines) { $result .= <$in> }
+    return $result;
+}
+
+sub close {
+	my ($this) = @_;
+    my $in = $this->fh();
+    close($in);
+    $this->startpos(undef);
+    $this->lines(undef);
+}
+1;
\ No newline at end of file
diff --git a/scripts/LevelPrediction.pm b/scripts/LevelPrediction.pm
new file mode 100644
index 0000000..03dea87
--- /dev/null
+++ b/scripts/LevelPrediction.pm
@@ -0,0 +1,71 @@
+package LevelPrediction;
+
+use strict;
+use warnings;
+
+sub new {
+	my ($this) = @_;
+    my $class = ref($this) || $this;
+    my $self = {};
+
+    bless $self, $class;
+
+    return $self;
+}
+
+for my $field (
+    qw(label numChildren count localMinProportion localMaxProportion globalMinProportion globalMaxProportion alternateLocalProportion alternateGlobalProportion)
+    )
+{
+    my $slot = __PACKAGE__ . "::$field";
+    no strict "refs";    # So symbolic ref to typeglob works.
+
+    *$field = sub {
+        my $self = shift;
+        $self->{$slot} = shift if @_;
+        return $self->{$slot};
+    };
+}
+
+sub print_tab_delimited {
+    my $self = shift;
+    for my $field (
+        qw(label numChildren count localMinProportion localMaxProportion globalMinProportion globalMaxProportion alternateLocalProportion alternateGlobalProportion)
+        )
+    {
+        my $slot = __PACKAGE__ . "::$field";
+        print "\t" . ( $self->{$slot} ? $self->{$slot} : "undef" );
+    }
+}
+
+sub toString {
+    my $self = shift;
+	my $result = "";
+	
+	for my $field (
+        qw(label)
+        )
+    {
+        my $slot = __PACKAGE__ . "::$field";
+        $result .=  ( $self->{$slot} ? $self->{$slot} : "" );
+    }
+	
+	for my $field (
+        qw(numChildren count)
+        )
+    {
+        my $slot = __PACKAGE__ . "::$field";
+        $result .= ";" . ( $self->{$slot} ? $self->{$slot} : "" );
+    }
+	
+    for my $fieldb (
+        qw(localMinProportion localMaxProportion globalMinProportion globalMaxProportion alternateLocalProportion alternateGlobalProportion)
+        )
+    {
+        my $slot = __PACKAGE__ . "::$fieldb";
+        $result .=  $self->{$slot} ? (sprintf ";%.2f", $self->{$slot}) : ";" ;
+    }
+	return $result;
+}
+
+1;
diff --git a/scripts/classificationQualityFilter.pl b/scripts/classificationQualityFilter.pl
new file mode 100755
index 0000000..a6a1e53
--- /dev/null
+++ b/scripts/classificationQualityFilter.pl
@@ -0,0 +1,51 @@
+#!/usr/bin/env perl
+
+my $countThreshold               = $ARGV[0];
+my $localMinProportionThreshold  = $ARGV[1];
+my $globalMinProportionThreshold = $ARGV[2];
+
+while (<STDIN>) {
+    chomp;
+    my ( $id, $bestPcid, @taxonomyEntries ) = split /\t/;
+
+    if ( $taxonomyEntries[0] eq "NOPRIMER" ) {
+        print "$id\t\tNOPRIMER\n";
+    }
+    elsif ( $taxonomyEntries[0] eq "NOHIT" ) {
+        print "$id\t\tNOHIT\n";
+    }
+    elsif ( $taxonomyEntries[0] eq "NOMATEPAIR" ) {
+        print "$id\t\tNOMATEPAIR\n";
+    }
+	elsif ( $taxonomyEntries[0] =~ /TOOMANYHITS/ ) {
+	    print "$id\t\tTOOMANYHITS\n";
+	}
+    else {
+        my @newTaxonomyEntries = ();
+        for my $entry (@taxonomyEntries) {
+
+            my (
+                $label,               $numChildren,              $count,
+                $localMinProportion,  $localMaxProportion,       $globalMinProportion,
+                $globalMaxProportion, $alternateLocalProportion, $alternateGlobalProportion
+            ) = split /;/, $entry;
+
+            if (   $count >= $countThreshold
+                && $localMinProportion >= $localMinProportionThreshold
+                && $globalMinProportion >= $globalMinProportionThreshold )
+            {
+                push @newTaxonomyEntries, $entry;
+            }
+            else { last; }
+        }
+        #if ( @newTaxonomyEntries == 0 ) {
+        #    push @newTaxonomyEntries, "Root";
+        #}
+
+		# if the trailing entries of the filtered taxonomy string are "SKIP", strip them
+		while(substr(@newTaxonomyEntries[$#newTaxonomyEntries],0,4) eq "SKIP") { pop @newTaxonomyEntries; }
+
+        print "$id\t$bestPcid\t" . ( join "\t", @newTaxonomyEntries ) . "\n";
+    }
+
+}
diff --git a/scripts/classificationQualityStripper.pl b/scripts/classificationQualityStripper.pl
new file mode 100755
index 0000000..7765117
--- /dev/null
+++ b/scripts/classificationQualityStripper.pl
@@ -0,0 +1,7 @@
+#!/usr/bin/env perl
+
+while (<STDIN>) {
+	my $line = $_;
+	$line =~ s/;[^\t\n]*//g;
+	print $line;
+}
diff --git a/scripts/fastaidx.pl b/scripts/fastaidx.pl
new file mode 100755
index 0000000..a5ace78
--- /dev/null
+++ b/scripts/fastaidx.pl
@@ -0,0 +1,22 @@
+#!/usr/bin/env perl
+
+use strict;
+use warnings;
+use Getopt::Long;
+use File::Temp;
+
+use FindBin;
+use lib "$FindBin::Bin";
+use FastaIndex;
+
+my ($fastaFileName, $idRegex) = @ARGV;
+
+my $indexA = FastaIndex->new();    # '-filename' => "A.idx", '-write_flag' => 1 );
+print STDERR "Indexing $fastaFileName...\n";
+$indexA->make_index( $fastaFileName, $idRegex, $fastaFileName );
+
+while(<STDIN>)
+{
+    chomp;
+    print $indexA->fetch($_);
+}
\ No newline at end of file
diff --git a/scripts/greengenesExtract.pl b/scripts/greengenesExtract.pl
new file mode 100755
index 0000000..5e15120
--- /dev/null
+++ b/scripts/greengenesExtract.pl
@@ -0,0 +1,49 @@
+#!/usr/bin/env perl
+
+use strict;
+use warnings;
+
+open( FASTA, ">greengenes.fasta" )    || die("Can't write to greengenes.fasta");
+open( TAX,   ">greengenes.taxonomy" ) || die("Can't write to greengenes.taxonomy");
+
+my $fieldname = "gg_norm_tax_string";
+
+while (<STDIN>) {
+
+    # read in by blocks
+    if ( $_ =~ /^BEGIN$/ ) {
+        my $prokMSAid = "NONE";
+        my $tax       = "NONE";
+        my $seq       = "NONE";
+
+        until ( $_ =~ /^END$/ ) {
+            if (/^prokMSA_id=(.+)/) {
+                $prokMSAid = $1;
+            }
+            elsif (/^$fieldname=(.+)/) {
+                $tax = $1;
+            }
+            elsif (/aligned_seq=(.+)/) {
+                $seq = $1;
+            }
+
+            if ( $seq ne "NONE" && $seq ne "unaligned" ) {
+                print FASTA ">$prokMSAid\n$seq\n";
+
+                # don't include taxonomy info if there is no sequence anyway
+                # and certainly not if there is no taxonomy data
+                if ( $tax ne "" && $tax ne "NONE" ) {
+                    print TAX "$prokMSAid\t$tax\n";
+                }
+            }
+
+            $_ = <STDIN>;
+        }
+    }
+
+    # else ignore anything outside a BEGIN/END block
+
+}
+
+close FASTA;
+close TAX;
diff --git a/scripts/joinDelimitedFasta.pl b/scripts/joinDelimitedFasta.pl
new file mode 100755
index 0000000..664b286
--- /dev/null
+++ b/scripts/joinDelimitedFasta.pl
@@ -0,0 +1,38 @@
+#!/usr/bin/env perl
+
+use strict;
+use warnings;
+
+# make sure delimiter is properly escaped, e.g. for "+" we need "\+".
+my $delimiter = $ARGV[0];
+my $infileA = $ARGV[1];
+my $infileB = $ARGV[2];
+my $rc2 = $ARGV[3];
+
+open(INA, $infileA) or die "Can't read $infileA\n";
+open(INB, $infileB) or die "Can't read $infileB\n";
+
+my $fastaHeader = "";
+
+#sub revcompl {
+#    return ((my $rev = reverse $_) =~ tr/ACGTacgt/TGCAtgca/); 
+#}
+	
+while(<INA>)
+	{
+	my $a = $_;
+	my $b = <INB>;
+	chomp $a;
+	chomp $b;
+	if($rc2)
+	    {
+	    # $b = revcompl($b);
+	    $b = reverse $b;
+	    $b =~ tr/ACGTacgt/TGCAtgca/;
+	    }
+	
+	print $a . $delimiter . $b . "\n";
+
+	}
+close INA;
+close INB;
\ No newline at end of file
diff --git a/scripts/parseUclustClusters.pl b/scripts/parseUclustClusters.pl
new file mode 100755
index 0000000..0ed23c1
--- /dev/null
+++ b/scripts/parseUclustClusters.pl
@@ -0,0 +1,52 @@
+#!/usr/bin/env perl
+
+# read a uclust result file; for each query sequence, collect the prokMSAids of the hits.
+
+use strict;
+use warnings;
+
+my %clusters   = ();
+my %worstPcids = ();
+
+parseUclust( $ARGV[0] );
+
+for my $key ( sort ( keys %clusters ) ) {
+    my $worstPcid = $worstPcids{$key};
+
+    #print STDERR "$key -> $clusters{$key}\n";
+    print "$key\t$worstPcid\t" . ( join "\t", @{ $clusters{$key} } ) . "\n";
+}
+
+sub parseUclust {
+    my ($ucFileName) = @_;
+    open( UC, $ucFileName ) || die("Could not open $ucFileName");
+
+    while (<UC>) {
+
+        if (/^\s*#/) { next; }
+
+        my ( $type, $cluster, $size, $percentid, $strand, $querystart, $targetstart, $alignment, $querylabel, $targetlabel ) = split /\t/;
+        chomp $querylabel;
+        chomp $targetlabel;
+
+        if ( $type eq "S" ) {
+
+            #print STDERR "S $querylabel\n";
+            $clusters{$querylabel}    = [$querylabel];
+            $worstPcids{$querylabel} = 100.0;
+        }
+        elsif ( $type eq "H" ) {
+
+            #print STDERR "H $targetlabel $querylabel\n";
+            push @{ $clusters{$targetlabel} }, $querylabel;
+
+            if ( $percentid < $worstPcids{$targetlabel} ) {
+                $worstPcids{$targetlabel} = $percentid;
+            }
+
+        }
+
+        # ignore other types
+    }
+    close UC;
+}
diff --git a/scripts/rtaxSearch.pl b/scripts/rtaxSearch.pl
new file mode 100755
index 0000000..cfdd180
--- /dev/null
+++ b/scripts/rtaxSearch.pl
@@ -0,0 +1,1062 @@
+#!/usr/bin/env perl
+
+# RTAX: Rapid and accurate taxonomic classification of short paired-end
+#       sequence reads from the 16S ribosomal RNA gene.
+#
+# David A. W. Soergel 1*, Rob Knight 2, and Steven E. Brenner 1
+#
+# 1 Department of Plant and Microbial Biology,
+#   University of California, Berkeley
+# 2 Howard Hughes Medical Institute and Department of Chemistry
+#   and Biochemistry, University of Colorado at Boulder
+# * Corresponding author: soergel at cs.umass.edu
+#
+# http://www.davidsoergel.com/rtax
+#
+# Version 0.984  (August 7, 2013)
+#
+# For usage instructions: just run the script with no arguments
+#
+#
+# Copyright (c) 2011 Regents of the University of California
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+#     * Redistributions of source code must retain the above copyright notice,
+#       this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in the
+#       documentation and/or other materials provided with the distribution.
+#     * Neither the name of the University of California, Berkeley nor the
+#       names of its contributors may be used to endorse or promote products
+#       derived from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+use strict;
+use warnings;
+use Getopt::Long;
+use File::Temp qw/ tempdir /;
+
+use FindBin;
+use lib "$FindBin::Bin";
+use FastaIndex;
+
+### command-line arguments get stored in globals
+
+use vars qw (
+    $usearch
+    $slop
+    $minMaxAccepts
+    $maxMaxAccepts
+    $maxPercentDifferenceThreshold
+    $databaseFile
+    $readAFileAll
+    $readBFileAll
+    $revCompReadA
+    $revCompReadB
+    $idRegex
+    $idList
+    $singleOK
+    $singleOKgeneric
+);
+
+# just store these as globals for now-- they're basically like command-line args
+my $indexA;
+my $indexB;
+
+sub init {
+    $singleOKgeneric = 1;    # default value; GetOptions may override
+
+    Getopt::Long::Configure("bundling");
+    GetOptions(
+        "usearch=s"                       => \$usearch,
+        "slop=s"                          => \$slop,
+        "minMaxAccepts=s"                 => \$minMaxAccepts,
+        "maxMaxAccepts=s"                 => \$maxMaxAccepts,
+        "maxPercentDifferenceThreshold=s" => \$maxPercentDifferenceThreshold,
+        "databaseFile=s"                  => \$databaseFile,
+        "idRegex=s"                       => \$idRegex,
+        "queryA=s"                        => \$readAFileAll,
+        "queryB=s"                        => \$readBFileAll,
+        "revcompA"                        => \$revCompReadA,
+        "revcompB"                        => \$revCompReadB,
+        "idList=s"                        => \$idList,
+        "singleOK"                        => \$singleOK,
+        "singleOKgeneric!"                => \$singleOKgeneric
+    );
+
+    if (   !defined $databaseFile
+        || !defined $readAFileAll )
+    {
+        print STDERR "Missing required argument!\n\n";
+        usage();
+    }
+
+    if ( !defined $usearch ) {
+        $usearch = `whereis usearch`;
+        chomp $usearch;
+        if ( !defined $usearch || $usearch eq "" ) {
+            print STDERR "Could not find usearch.\n\n";
+            usage();
+        }
+    }
+
+    if ( !defined $slop )          { $slop          = 0.005; }
+    if ( !defined $minMaxAccepts ) { $minMaxAccepts = 1000; }
+    if ( !defined $maxMaxAccepts ) { $maxMaxAccepts = 16000; }
+    if ( !defined $maxPercentDifferenceThreshold ) {
+        $maxPercentDifferenceThreshold = 0.2;
+    }
+    if ( !defined $idRegex || $idRegex eq "" ) { $idRegex = "(\\S+)"; }
+    print STDERR "Header Regex: $idRegex\n";
+
+    # these are redundant between multiple runs, oh well
+    # but the DBM indexes should persist, in the same directory as the original files
+    $indexA = FastaIndex->new();    # '-filename' => "A.idx", '-write_flag' => 1 );
+	if ($revCompReadA) {
+            $readAFileAll = revcompFile($readAFileAll);
+        }
+    $indexA->make_index( $readAFileAll, $idRegex, $readAFileAll );
+
+    $indexB = FastaIndex->new();
+    if ( defined $readBFileAll ) {    # '-filename' => "B.idx", '-write_flag' => 1 );
+        if ($revCompReadB) {
+            $readBFileAll = revcompFile($readBFileAll);
+        }
+        $indexB->make_index( $readBFileAll, $idRegex, $readBFileAll );
+    }
+}
+
+sub usage {
+    print STDERR << "EOF";
+
+RTAX: Rapid and accurate taxonomic classification of short paired-end
+      sequence reads from the 16S ribosomal RNA gene.
+      
+David A. W. Soergel (1*), Rob Knight (2), and Steven E. Brenner (1)
+1 Department of Plant and Microbial Biology, University of California, Berkeley 
+2 Howard Hughes Medical Institute and Department of Chemistry and Biochemistry,
+  University of Colorado at Boulder
+* Corresponding author: soergel\@berkeley.edu
+
+http://www.davidsoergel.com/rtax
+
+Version 0.984  (August 7, 2013)
+
+OPTIONS:
+
+    --usearch <path>    location of usearch program
+                        (defaults to result of "whereis usearch")
+     
+    --slop <number>     %id difference from maximum to accept
+                        (default 0.005, i.e. accept hits to a 200nt query that
+                        has 1 more mismatch than the best match, or 2 extra
+                        mismatches out of 400nt.)
+     
+    --minMaxAccepts <number>
+                        The initial maxaccepts value to pass to USEARCH,
+                        controlling the maximum number of hits that will be
+                        returned in the first iteration.  If this number of hits
+                        is reached, the search is repeated with a doubled
+                        maxaccepts value.  Thus, a smaller value causes the
+                        first iteration to run faster, but increases the
+                        probability of needing more iterations.
+                        (Default 1000)
+     
+    --maxMaxAccepts <number>
+                        The highest escalated maxaccepts value to allow.  The
+                        maxaccepts value is doubled on every iteration, so this
+                        value is effectively rounded down to a power of two 
+                        factor of minMaxAccepts.  USEARCH runs with large 
+                        maxaccepts values are very slow, so this controls when 
+                        RTAX gives up on a query because there are too many
+                        hits (i.e., a query that is too short and/or too highly
+                        conserved to be informative).  (Default 16000) 
+
+    --maxPercentDifferenceThreshold <number>
+                        The largest percent difference to allow between a query
+                        (considering the read pair jointly) and a database hit.
+                        (Default 0.2, i.e. reference sequences of less than 80%
+                        identity with the query sequence are never matched)
+
+    --databaseFile <file>
+                        A FASTA file containing a reference database.
+                        
+    --queryA <file>     A FASTA file containing one set of query reads.
+    
+    --queryB <file>     A FASTA file containing the other set of query reads
+                        (if any).  Must be provided in the forward sense
+                        unless --revcompB is used.
+
+    --revcompB          Reverse-complement read B.  Required if the queryB file
+                        is provided in the reverse sense, as is typical with
+                        paired-end experiments.
+                        
+    --singleOK          Classify a sequence based on only one read when the
+                        other read is missing.  Required when queryB is absent.
+                        Default: false, so sequences present in only one of the
+                        two input files are dropped.  When enabled, paired-end
+                        sequences are classified using both reads as usual, but
+                        any remaining single-ended reads are also classified.
+                        
+    --singleOKgeneric   Classify a sequence based on only one read when the
+                        other read is present but uninformative.  This occurs
+                        when one read is so generic (i.e., short, or highly
+                        conserved) that more than maxMaxAccepts hits are
+                        found.  Default: true; use --nosingleOKgeneric to
+                        disable.
+    
+    --idRegex <regex>   A regular expression for extracting IDs from fasta
+                        headers.  The first capture group (aka \$1) will be
+                        used as the sequence ID; these should be unique.
+                        This is useful when the ID needed to match mate
+                        pairs is embedded in the header in some way.
+                        "^>" is automatically prepended to the regex
+                        provided here.  Take care that double escaping may
+                        be required.  Default: "(\\S+)" (the first field)
+    
+    --idList <file>     A file containing a list of IDs to process (one per 
+                        line).  By default all IDs found in the query files
+                        are used.
+    
+    Note that the two query files must provide mate-paired reads with exactly
+    matching identifiers (though they need not be in the same order).  Any ids
+    present in one file but not the other are classified in single-ended mode.
+    Alternate naming schemes for the paired reads (e.g., "SOMEID.a" paired with
+    "SOMEID.b", and so forth) are handled via the --idRegex option.
+    
+    Various indexes and derived FASTA files will be created in a temporary
+    directory and are cleaned up afterwards.
+    
+    The output is tab-delimited text, one line per query pair, in the form
+    
+        <query ID>  <%id>   <list of reference IDs>
+    
+    where the second column gives the %id between the query and the best match,
+    and the reference IDs provided are all those matches within "slop" of this
+    best value.
+    
+    This output is suitable for piping directly into rtaxVote for taxonomy
+    assignment.
+
+EOF
+
+    exit;
+}
+
+sub main {
+
+    init();
+
+    my ( $pairedReadAFile, $pairedReadBFile, $pairedBothCount, $singleReadAFile, $singleReadACount, $singleReadBFile, $singleReadBCount, $allSingleIDs ) =
+        partitionReadFiles();
+
+    processPairs( $pairedReadAFile, $pairedReadBFile, $pairedBothCount );
+
+    if ($singleOK) {
+        processSingle( $singleReadAFile, $indexA, $singleReadACount );
+        processSingle( $singleReadBFile, $indexB, $singleReadBCount );
+    }
+	else {
+		for my $queryLabel (@$allSingleIDs) {
+        	print "$queryLabel\t\tNOMATEPAIR\n";
+    	}
+	}
+}
+
+sub partitionReadFiles {
+
+    my @idList = ();
+    if ($idList) {
+        open( IDLIST, "$idList" ) or die "Could not open $idList";
+        while (<IDLIST>) { chomp; push @idList, $_; }
+        close IDLIST;
+    }
+
+    my @idsA = keys %{ $indexA->startpos() };
+    my @idsB =
+        defined $indexB->startpos() ? keys %{ $indexB->startpos() } : ();    # in the single-read case, there should still be an empty index
+
+    my @bothAandB = ();
+    my @aOnly     = ();
+    my @bOnly     = ();
+
+    # encode which of the input files contain which IDs in three bits
+
+    my %count = ();
+
+    foreach my $element (@idsA) {
+        if ( ( $element =~ /[\t ]/ ) > 0 )                                   # $indexA->db() should return parsed IDs with no whitespace
+        {
+            die "Invalid FASTA id: $element\n";
+        }
+        $count{$element} += 1;
+    }
+
+    foreach my $element (@idsB) {
+        if ( ( $element =~ /[\t ]/ ) > 0 )                                   # $indexA->db() should return parsed IDs with no whitespace
+        {
+            die "Invalid FASTA id: $element\n";
+        }
+        $count{$element} += 2;
+    }
+
+    foreach my $element (@idList) {
+        if ( ( $element =~ /[\t ]/ ) > 0 ) { die "Invalid FASTA id: $element\n"; }
+        $count{$element} += 4;
+    }
+
+    if ($idList) {
+        foreach my $element ( keys %count ) {
+            $count{$element} -= 4;
+        }
+    }
+
+    foreach my $element ( keys %count ) {
+        if ( !( $element =~ /^__/ ) ) {
+            if    ( $count{$element} == 1 ) { push @aOnly,     $element }
+            elsif ( $count{$element} == 2 ) { push @bOnly,     $element }
+            elsif ( $count{$element} == 3 ) { push @bothAandB, $element }
+            else {
+
+                # no problem; these are sequences not in the idList
+            }
+        }
+    }
+
+    print STDERR "file a only = " . scalar(@aOnly) . " sequences\n";
+    print STDERR "file b only = " . scalar(@bOnly) . " sequences\n";
+    print STDERR "both        = " . scalar(@bothAandB) . " sequences\n";
+
+    my $pairedReadAFile = extractFasta( $indexA, \@bothAandB );
+    my $pairedReadBFile = extractFasta( $indexB, \@bothAandB );
+
+    my $singleReadAFile = extractFasta( $indexA, \@aOnly );
+    my $singleReadBFile = extractFasta( $indexB, \@bOnly );
+
+	my @allSingleIDs = (@aOnly, @bOnly);
+
+    return ( $pairedReadAFile, $pairedReadBFile, scalar(@bothAandB), $singleReadAFile, scalar(@aOnly), $singleReadBFile, scalar(@bOnly), \@allSingleIDs );
+}
+
+sub processPairs {
+    my ( $pairedReadAFile, $pairedReadBFile, $pairedBothCount ) = @_;
+    if ( $pairedBothCount == 0 ) { return; }
+
+    my $nohitQueryIds = [];
+    push @$nohitQueryIds, "ALL";
+    my $tooManyHitQueryIds = [];
+
+    my $percentDifferenceThreshold = 0.005;    # this gets doubled to 1% before being used the first time
+
+    while ( @$nohitQueryIds > 0 ) {
+
+        # double the allowed %diff on every round
+        $percentDifferenceThreshold *= 2;
+        if ( $percentDifferenceThreshold > $maxPercentDifferenceThreshold ) {
+            last;
+        }
+
+        if ( $nohitQueryIds->[0] ne "ALL" ) {
+
+            # prepare input files with the remaining sequences
+            $pairedReadAFile = extractFasta( $indexA, $nohitQueryIds );
+            $pairedReadBFile = extractFasta( $indexB, $nohitQueryIds );
+        }
+
+        # within doPairSearch we escalate maxAccepts, so if a queryLabel is still marked tooManyHits at this point,
+        # that means that there are more than maxMaxAccepts hits for this threshold--
+        # so there's really no point in testing this query again at an even lower threshold
+        my $tooManyHitQueryIdsThisRound;
+
+        my $numRemaining = ( $nohitQueryIds->[0] eq "ALL" ) ? $pairedBothCount : scalar(@$nohitQueryIds);
+
+        print STDERR
+            "doPairSearch $pairedReadAFile, $pairedReadBFile: $numRemaining query sequences remaining\n     searching with pair %id "
+            . $percentDifferenceThreshold
+            . " and maxAccepts "
+            . $minMaxAccepts . "\n";
+
+        ( $nohitQueryIds, $tooManyHitQueryIdsThisRound ) =
+            doPairSearch( $pairedReadAFile, $pairedReadBFile, $percentDifferenceThreshold, $minMaxAccepts );
+
+        print STDERR "MAIN: Finished round at threshold $percentDifferenceThreshold; "
+            . scalar(@$nohitQueryIds)
+            . " NOHIT, "
+            . scalar(@$tooManyHitQueryIdsThisRound)
+            . " TOOMANYHIT.\n";
+
+        push @$tooManyHitQueryIds, @$tooManyHitQueryIdsThisRound;
+    }
+
+    print STDERR "MAIN: " . scalar(@$nohitQueryIds) . " query sequences remaining with NOHIT\n";
+    for my $queryLabel (@$nohitQueryIds) {
+        print "$queryLabel\t\tNOHIT\n";
+    }
+
+    print STDERR "MAIN: " . scalar(@$tooManyHitQueryIds) . " query sequences remaining with TOOMANYHITS\n";
+    for my $queryLabel (@$tooManyHitQueryIds) {
+        print "$queryLabel\t\tTOOMANYHITS\n";
+    }
+}
+
+sub processSingle {
+    my ( $singleReadFile, $singleIndex, $singleReadCount ) = @_;
+    if ( !defined $singleReadFile || $singleReadFile eq "" || $singleReadCount == 0 ) { return; }
+
+    my $nohitQueryIds = [];
+    push @$nohitQueryIds, "ALL";
+    my $tooManyHitQueryIds = [];
+
+    my $singlePercentDifferenceThreshold = 0.005;    # this gets doubled to 1% before being used the first time
+
+    while ( @$nohitQueryIds > 0 ) {
+
+        # double the allowed %diff on every round
+        $singlePercentDifferenceThreshold *= 2;
+        if ( $singlePercentDifferenceThreshold > $maxPercentDifferenceThreshold ) {
+            last;
+        }
+
+        if ( $nohitQueryIds->[0] ne "ALL" ) {
+
+            # prepare input files with the remaining sequences
+            $singleReadFile = extractFasta( $singleIndex, $nohitQueryIds );
+        }
+
+        my $numRemaining = ( $nohitQueryIds->[0] eq "ALL" ) ? $singleReadCount : scalar(@$nohitQueryIds);
+
+        # within doSearch we escalate maxAccepts, so if a queryLabel is still marked tooManyHits at this point,
+        # that means that there are more than maxMaxAccepts hits for this threshold--
+        # so there's really no point in testing this query again at an even lower threshold
+        my $tooManyHitQueryIdsThisRound;
+
+        print STDERR "doSingleSearch $singleReadFile: $numRemaining query sequences remaining\n     searching with %id "
+            . $singlePercentDifferenceThreshold
+            . " and maxAccepts "
+            . $minMaxAccepts . "\n";
+
+        ( $nohitQueryIds, $tooManyHitQueryIdsThisRound ) =
+            doSingleSearch( $singleReadFile, $singleIndex, $singlePercentDifferenceThreshold, $minMaxAccepts );
+
+        print STDERR "Finished round at threshold $singlePercentDifferenceThreshold; "
+            . scalar(@$nohitQueryIds)
+            . " NOHIT, "
+            . scalar(@$tooManyHitQueryIdsThisRound)
+            . " TOOMANYHIT.\n";
+
+        push @$tooManyHitQueryIds, @$tooManyHitQueryIdsThisRound;
+    }
+
+    print STDERR scalar(@$nohitQueryIds) . " query sequences remaining with NOHIT\n";
+    for my $queryLabel (@$nohitQueryIds) {
+        print "$queryLabel\t\tNOHIT\n";
+    }
+
+    print STDERR scalar(@$tooManyHitQueryIds) . " query sequences remaining with TOOMANYHITS\n";
+    for my $queryLabel (@$tooManyHitQueryIds) {
+        print "$queryLabel\t\tTOOMANYHITS\n";
+    }
+
+}
+
+sub doSingleSearch {
+    my $singleReadFile                   = shift;
+    my $singleIndex                      = shift;
+    my $singlePercentDifferenceThreshold = shift;
+    my $maxAccepts                       = shift;
+
+    my $singlePercentIdThreshold = 1. - $singlePercentDifferenceThreshold;
+
+    my $nohitQueryIds      = [];
+    my $tooManyHitQueryIds = [];
+
+# open the USEARCH streams
+#  print STDERR "$usearch --quiet --global --iddef 2 --query $singleReadFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject\n";
+# open( UCA,
+# "$usearch --quiet --global --iddef 2 --query $singleReadFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject |"
+#    ) || die "can't fork usearch: $!";
+
+    my $dir = tempdir(CLEANUP => 1);
+    if ( system( 'mknod', "$dir/a", 'p' ) && system( 'mkfifo', "$dir/a" ) ) { die "mk{nod,fifo} $dir/a failed"; }
+    if ( !fork() ) {
+
+        # see paired end case for explanation
+        open( UCAW, ">$dir/a" ) || die("Couldn't write named pipe: $dir/a");
+        print UCAW "# pipe open!\n";
+        close UCAW;
+
+        my $cmd =
+"$usearch --quiet --global --iddef 2 --query $singleReadFile --db $databaseFile --uc $dir/a --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject";
+        print STDERR $cmd . "\n";
+        exec $cmd || die "can't fork usearch: $!";
+    }
+
+    open( UCA, "$dir/a" ) || die("Couldn't read named pipe from usearch: $dir/a");
+
+    #print STDERR "Reading named pipe from usearch: $dir/a\n";
+
+    # Load the first non-comment line from each stream
+    my $nextLineA;
+    my $pipeARetryCount = 0;
+    while ( !defined $nextLineA ) {
+
+        # keep trying to read from the pipe even if the writer disconnects
+        while (<UCA>) {
+            if (/^\s*#/) { next; }
+            if (/^\s*$/) { next; }
+            $nextLineA = $_;
+            last;
+        }
+
+        #print STDERR "Waiting for named pipe $dir/a\n";
+        sleep 1;
+        $pipeARetryCount++;
+        if ( $pipeARetryCount > 10 ) { die "Named pipe communication with usearch failed: $dir/a\n"; }
+    }
+
+    # read synchronized blocks from each stream
+    while (1) {
+        my ( $queryLabelA, $idsA );
+
+        #print STDERR "reading next block...\n";
+
+        # idsA is a reference to a hash from targetIds to %ids
+        ( $queryLabelA, $idsA, $nextLineA ) = collectIds( *UCA, $nextLineA );
+
+        if ( ( scalar keys %$idsA ) >= $maxAccepts ) {
+            push @$tooManyHitQueryIds, $queryLabelA;
+        }
+
+        elsif ( !reconcileSingleHitsAndPrint( $queryLabelA, $idsA, $singlePercentIdThreshold ) ) {
+            push @$nohitQueryIds, $queryLabelA;
+        }
+
+        if ( !$nextLineA ) {
+
+            #print STDERR "End of stream, close\n";
+            last;
+        }
+
+    }
+
+    close(UCA) || die "can't close usearch: $!";
+
+    #print STDERR "Closed usearch stream.\n";
+
+    # for the TOOMANYHITS cases, escalate maxAccepts and try again
+    # Note this recurses, so no need to iterate here
+    if ( scalar(@$tooManyHitQueryIds) && ( $maxAccepts * 2 <= $maxMaxAccepts ) ) {
+        my $nohitQueryIdsB;
+
+        print STDERR "Escalating maxAccepts to " . ( $maxAccepts * 2 ) . " for " . scalar(@$tooManyHitQueryIds) . " sequences.\n";
+
+        # prepare input files with the remaining sequences
+        $singleReadFile = extractFasta( $singleIndex, $tooManyHitQueryIds );
+
+        ( $nohitQueryIdsB, $tooManyHitQueryIds ) =
+            doSingleSearch( $singleReadFile, $singleIndex, $singlePercentDifferenceThreshold, $maxAccepts * 2 );
+        if (@$nohitQueryIdsB) {
+            die "A TOOMANYHITS case can't turn into a NOHITS case";
+        }
+    }
+
+    print STDERR
+        "doSingleSearch $singleReadFile: Finished at pair threshold $singlePercentDifferenceThreshold and maxAccepts $maxAccepts\n";
+    print STDERR "         NOHITS: " . scalar(@$nohitQueryIds) . "\n";
+    print STDERR "    TOOMANYHITS: " . scalar(@$tooManyHitQueryIds) . "\n";
+
+    # print STDERR "         NOHITS: " .      ( join ", ", @$nohitQueryIds ) . "\n";
+    # print STDERR "    TOOMANYHITS: " . ( join ", ", @$tooManyHitQueryIds ) . "\n";
+
+    # any tooManyHitQueryIds that remain had more than maxMaxAccepts hits
+    return ( $nohitQueryIds, $tooManyHitQueryIds );
+}
+
+sub reconcileSingleHitsAndPrint {
+    my ( $queryLabel, $idsA, $singlePercentIdThreshold ) = @_;
+
+    my $idsByIdentity = {};
+    for my $targetLabel ( keys %$idsA ) {
+        my $percentIdA = $idsA->{$targetLabel};
+        if ( !defined $idsByIdentity->{$percentIdA} ) {
+            $idsByIdentity->{$percentIdA} = [];
+        }
+        push @{ $idsByIdentity->{$percentIdA} }, $targetLabel;
+
+    }
+
+    if ( !%$idsByIdentity ) {
+
+        #print STDERR "$queryLabel -> no reconciled hits at $singlePercentIdThreshold%\n";
+
+        # this is the NOHIT case, but we'll back off and try again
+        return 0;
+    }
+    else {
+
+        #print STDERR "$queryLabel -> printing reconciled hits at $singlePercentIdThreshold%\n";
+        printLine( $queryLabel, $idsByIdentity );
+        return 1;
+    }
+
+}
+
+sub doPairSearch {
+    my $readAFile                      = shift;
+    my $readBFile                      = shift;
+    my $pairPercentDifferenceThreshold = shift;
+    my $maxAccepts                     = shift;
+
+    my $pairPercentIdThreshold = 1. - $pairPercentDifferenceThreshold;
+
+    # because we're going to average the two %ids, we have to search each read with twice the %diff for the pair
+    my $singlePercentDifferenceThreshold = $pairPercentDifferenceThreshold * 2;
+    my $singlePercentIdThreshold         = 1. - $singlePercentDifferenceThreshold;
+
+    my $nohitQueryIds      = [];
+    my $tooManyHitQueryIds = [];
+
+# open the USEARCH streams
+#    print STDERR
+#"$usearch --quiet --global --iddef 2 --query $readAFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject\n";
+
+#    open( UCA,
+#"$usearch --quiet --global --iddef 2 --query $readAFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject |"
+#    ) || die "can't fork usearch: $!";
+
+#    print STDERR
+#"$usearch --quiet --global --iddef 2 --query $readBFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject\n";
+
+#    open( UCB,
+#"$usearch --quiet --global --iddef 2 --query $readBFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject |"
+#    ) || die "can't fork usearch: $!";
+
+    my $dir = tempdir(CLEANUP => 1);
+    if ( system( 'mknod', "$dir/a", 'p' ) && system( 'mkfifo', "$dir/a" ) ) { die "mk{nod,fifo} $dir/a failed"; }
+    if ( system( 'mknod', "$dir/b", 'p' ) && system( 'mkfifo', "$dir/b" ) ) { die "mk{nod,fifo} $dir/b failed"; }
+
+    # try to avoid mysterious intermittent condition where opening a named pipe blocks forever
+    #while ( !-p "$dir/a" ) {
+    #    print STDERR "Waiting for named pipe $dir/a\n";
+    #    sleep 1;
+    #}
+    #while ( !-p "$dir/b" ) {
+    #    print STDERR "Waiting for named pipe $dir/b\n";
+    #    sleep 1;
+    #}
+
+    if ( !fork() ) {
+
+# there is a mysterious intermittent condition where opening a named pipe blocks forever.
+# I think what is happening may be:
+# if the reader side of the named pipe is not already connected when usearch tries to write to it, usearch gets confused and never writes anything
+# thus when the reader side does connect, it blocks.
+
+        # one hack is just to sleep here for a while in hopes that the reader thread gets all hooked up
+        # sleep 5;
+
+        # let's try writing to it, so we block here until we know it works, and then continue on to usearch
+        open( UCAW, ">$dir/a" ) || die("Couldn't write named pipe: $dir/a");
+        print UCAW "# pipe open!\n";
+        close UCAW;
+
+        my $cmd =
+"$usearch --quiet --global --iddef 2 --query $readAFile --db $databaseFile --uc $dir/a --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject";
+        print STDERR $cmd . "\n";
+        exec $cmd || die("can't fork usearch: $!");
+    }
+
+    if ( !fork() ) {
+        open( UCBW, ">$dir/b" ) || die("Couldn't write named pipe: $dir/b");
+        print UCBW "# pipe open!\n";
+        close UCBW;
+
+        my $cmd =
+"$usearch --quiet --global --iddef 2 --query $readBFile --db $databaseFile --uc $dir/b --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject";
+        print STDERR $cmd . "\n";
+        exec $cmd || die("can't fork usearch: $!");
+    }
+    open( UCA, "$dir/a" ) || die("Couldn't read named pipe from usearch: $dir/a");
+
+    #print STDERR "Reading named pipe from usearch: $dir/a\n";
+
+    open( UCB, "$dir/b" ) || die("Couldn't read named pipe from usearch: $dir/b");
+
+    #print STDERR "Reading named pipe from usearch: $dir/b\n";
+
+    # Load the first non-comment line from each stream
+    my $nextLineA;
+    my $pipeARetryCount = 0;
+    while ( !defined $nextLineA ) {
+
+        # keep trying to read from the pipe even if the writer disconnects
+        while (<UCA>) {
+            if (/^\s*#/) { next; }
+            if (/^\s*$/) { next; }
+            $nextLineA = $_;
+            last;
+        }
+
+        #print STDERR "Waiting for named pipe $dir/a\n";
+        sleep 1;
+        $pipeARetryCount++;
+        if ( $pipeARetryCount > 10 ) { die("Named pipe communication with usearch failed: $dir/a\n"); }
+    }
+
+    my $nextLineB;
+    my $pipeBRetryCount = 0;
+    while ( !defined $nextLineB ) {
+
+        # keep trying to read from the pipe even if the writer disconnects
+        while (<UCB>) {
+            if (/^\s*#/) { next; }
+            if (/^\s*$/) { next; }
+            $nextLineB = $_;
+            last;
+        }
+
+        #print STDERR "Waiting for named pipe $dir/b\n";
+        sleep 1;
+        $pipeBRetryCount++;
+        if ( $pipeBRetryCount > 10 ) { die("Named pipe communication with usearch failed: $dir/b\n"); }
+    }
+
+    # read synchronized blocks from each stream
+    while (1) {
+        my ( $queryLabelA, $idsA, $queryLabelB, $idsB );
+
+        # idsA is a reference to a hash from targetIds to %ids
+        ( $queryLabelA, $idsA, $nextLineA ) = collectIds( *UCA, $nextLineA );
+        ( $queryLabelB, $idsB, $nextLineB ) = collectIds( *UCB, $nextLineB );
+
+        if ( !( $queryLabelA eq $queryLabelB ) ) {
+            die("Usearch results desynchronized: $queryLabelA neq $queryLabelB");
+        }
+
+        my $numHitsA = ( scalar keys %$idsA );
+        my $numHitsB = ( scalar keys %$idsB );
+
+        # if either read got NOHITS, then it's definitely NOHITS for the pair.  This trumps TOOMANYHITS for the other read.
+        # don't bother trying to reconcile hits in this case
+
+        if ( ( $numHitsA == 0 ) || ( $numHitsB == 0 ) ) {
+            push @$nohitQueryIds, $queryLabelA;
+        }
+
+        # if both reads got TOOMANYHITS, then it's definitely TOOMANYHITS for the pair.
+
+        elsif ( ( $numHitsA >= $maxAccepts ) && ( $numHitsB >= $maxAccepts ) ) {
+            push @$tooManyHitQueryIds, $queryLabelA;
+        }
+
+        # if neither read got TOOMANYHITS, then we're good to go
+
+        elsif ( ( $numHitsA < $maxAccepts ) && ( $numHitsB < $maxAccepts ) ) {
+            if ( !reconcilePairedHitsAndPrint( $queryLabelA, $idsA, $idsB, $pairPercentIdThreshold ) ) {
+                push @$nohitQueryIds, $queryLabelA;
+            }
+
+        }
+
+        # if only one read got TOOMANYHITS...
+
+        else {
+
+            # escalate if possible
+            if ( $maxAccepts < $maxMaxAccepts ) {
+                push @$tooManyHitQueryIds, $queryLabelA;
+            }
+
+            # if we're already at maxMaxAccepts and we're allowed, fall back to the info provided by the other read.
+
+          # This is a little tricky: which percent ID threshold do we use?
+          # For consistency, I think we have to assume that the overly generic read is a 100% match.
+          # the cleanest way to say this is to say that the "overly generic" read just matches everything, so
+          # reconcilePairHitsAndPrint( $queryLabelA, $idsA, $allIds, $pairPercentIdThreshold )
+          # but that would require a hash %allIds mapping every ID to 100, just because reconcilePairHitsAndPrint says $idsB->{$targetLabel}
+          # it's equivalent to just use reconcileSingleHitsAndPrint with singlePercentIdThreshold.
+
+            elsif ($singleOKgeneric) {
+                if ( $numHitsA < $maxAccepts ) {
+                    if ( !reconcileSingleHitsAndPrint( $queryLabelA, $idsA, $singlePercentIdThreshold ) ) {
+                        push @$nohitQueryIds, $queryLabelA;
+                    }
+                }
+                elsif ( $numHitsB < $maxAccepts ) {
+                    if ( !reconcileSingleHitsAndPrint( $queryLabelA, $idsB, $singlePercentIdThreshold ) ) {
+                        push @$nohitQueryIds, $queryLabelA;
+                    }
+                }
+                else { die("impossible"); }
+            }
+
+            # if we're already at maxMaxAccepts, but not allowed to rely on the other read, just report TOOMANYHITS for the pair
+            else {
+                push @$tooManyHitQueryIds, $queryLabelA;
+            }
+        }
+        if ( !$nextLineA || !$nextLineB ) {
+            if ( !( !$nextLineA && !$nextLineB ) ) {
+                die("Usearch results desynchronized at end:\nA: $nextLineA\nB: $nextLineB");
+            }
+            last;
+        }
+
+    }
+
+    close(UCA) || die "can't close usearch: $!";
+    close(UCB) || die "can't close usearch: $!";
+
+    # for the TOOMANYHITS cases, escalate maxAccepts and try again
+    # Note this recurses, so no need to iterate here
+    if ( scalar(@$tooManyHitQueryIds) && ( $maxAccepts * 2 <= $maxMaxAccepts ) ) {
+        my $nohitQueryIdsB;
+
+        print STDERR "doPairSearch $readAFile, $readBFile: Escalating maxAccepts to "
+            . ( $maxAccepts * 2 ) . " for "
+            . scalar(@$tooManyHitQueryIds)
+            . " sequences.\n";
+
+        # prepare input files with the remaining sequences
+        my $readAFileEsc = extractFasta( $indexA, $tooManyHitQueryIds );
+        my $readBFileEsc = extractFasta( $indexB, $tooManyHitQueryIds );
+
+        ( $nohitQueryIdsB, $tooManyHitQueryIds ) =
+            doPairSearch( $readAFileEsc, $readBFileEsc, $pairPercentDifferenceThreshold, $maxAccepts * 2 );
+
+        # A TOOMANYHITS case can certainly turn into a NOHITS case:
+        # once one read is no longer TOOMANYHITS, it may turn out that nothing can be reconciled with the other read.
+
+        #if (@$nohitQueryIdsB) {
+        #    die(      "A TOOMANYHITS case can't turn into a NOHITS case: "
+        #            . ( join ", ", @$nohitQueryIdsB )
+        #            . " at pair threshold $pairPercentDifferenceThreshold and maxAccepts "
+        #            . ( $maxAccepts * 2 ) );
+        #  }
+
+        push @$nohitQueryIds, @$nohitQueryIdsB;
+    }
+
+    print STDERR
+        "doPairSearch $readAFile, $readBFile: Finished at pair threshold $pairPercentDifferenceThreshold and maxAccepts $maxAccepts\n";
+    print STDERR "         NOHITS: " . scalar(@$nohitQueryIds) . "\n";
+    print STDERR "    TOOMANYHITS: " . scalar(@$tooManyHitQueryIds) . "\n";
+
+    # print STDERR "         NOHITS: " .      ( join ", ", @$nohitQueryIds ) . "\n";
+    # print STDERR "    TOOMANYHITS: " . ( join ", ", @$tooManyHitQueryIds ) . "\n";
+
+    # any tooManyHitQueryIds that remain had more than maxMaxAccepts hits
+    return ( $nohitQueryIds, $tooManyHitQueryIds );
+}
+
+sub reconcilePairedHitsAndPrint {
+    my ( $queryLabel, $idsA, $idsB, $pairPercentIdThreshold ) = @_;
+
+    my $idsByIdentity = {};
+    for my $targetLabel ( keys %$idsA ) {
+        my $percentIdA = $idsA->{$targetLabel};
+        my $percentIdB = $idsB->{$targetLabel};
+        if ($percentIdB) {
+
+            # both reads hit the same target
+            # compute the average percent id
+            my $pairPercentId = ( $percentIdA + $percentIdB ) / 2.0;
+            if ( $pairPercentId >= $pairPercentIdThreshold ) {
+                if ( !defined $idsByIdentity->{$pairPercentId} ) {
+                    $idsByIdentity->{$pairPercentId} = [];
+                }
+                push @{ $idsByIdentity->{$pairPercentId} }, $targetLabel;
+            }
+        }
+    }
+
+    if ( !%$idsByIdentity ) {
+
+        #print STDERR "$queryLabel -> no reconciled hits at $pairPercentIdThreshold%\n";
+
+        # this is the NOHIT case, but we'll back off and try again
+        return 0;
+    }
+    else {
+
+        #print STDERR "$queryLabel -> printing reconciled hits at $pairPercentIdThreshold%\n";
+        printLine( $queryLabel, $idsByIdentity );
+        return 1;
+    }
+
+}
+
+##### Read a block of target IDs for one query from a USEARCH stream
+sub collectIds {
+
+    # assume that the records for a given label are all grouped together, so we can collect the hits for each label and process them in turn
+    # need to jump through some hoops to save the first line of the next block (since we can't rewind a pipe with seek()).
+
+    my $fh        = shift;
+    my $firstLine = shift;
+
+    my %hits = ();
+
+    # process the first line (cached from the previous call)
+    #print STDERR "Processing firstLine: $firstLine\n";
+    my ( $typeF, $clusterF, $sizeF, $percentIdF, $strandF, $queryStartF, $targetStartF, $alignmentF, $queryLabelF, $targetLabelF ) =
+        split /\t/, $firstLine;
+    chomp $targetLabelF;
+
+    $queryLabelF =~ /$idRegex/;
+    $queryLabelF = $1;
+
+    #$queryLabelF =~ s/\s.*//;    # primary ID is only the portion before whitespace
+
+    if ( $typeF eq "N" ) {
+
+        #print STDERR "$queryLabelF -> N\n";
+
+        # NOHIT represented as empty hash
+        # note we still have to cache the next line, after filtering comments
+        while (<$fh>) {
+            if (/^\s*#/) { next; }
+            if (/^\s*$/) { next; }
+            my $line = $_;
+            chomp $line;
+
+            #print STDERR "$line\n";
+            return ( $queryLabelF, \%hits, $line );
+        }
+    }
+    elsif ( $typeF eq "H" ) {
+        $hits{$targetLabelF} = $percentIdF;
+    }
+
+    # process the remaining lines
+    while (<$fh>) {
+        if (/^\s*#/) { next; }
+        if (/^\s*$/) { next; }
+        my $line = $_;
+        chomp $line;
+
+        #print STDERR "$line\n";
+
+        my ( $type, $cluster, $size, $percentId, $strand, $queryStart, $targetStart, $alignment, $queryLabel, $targetLabel ) =
+            split /\t/;
+        chomp $targetLabel;
+
+        $queryLabel =~ /$idRegex/;
+        $queryLabel = $1;
+
+        #$queryLabel =~ s/\s.*//;    # primary ID is only the portion before whitespace
+
+        if ( $queryLabel ne $queryLabelF ) {
+
+            # advanced past the end of the current block
+            #print STDERR "End of block $queryLabelF -> block of " . scalar( keys %hits ) . " hits\n";
+
+            #print STDERR "End of block, return\n";
+            return ( $queryLabelF, \%hits, $line );
+        }
+        else {
+            if ( $type eq "N" ) {
+                die "impossible: $queryLabel reports no hit after it already had a hit";
+            }
+            elsif ( $type eq "H" ) {
+                $hits{$targetLabel} = $percentId;
+            }
+        }
+    }
+
+    # end of the stream (the last block)
+    #print STDERR "End of stream $queryLabelF -> block of " . scalar( keys %hits ) . " hits\n";
+
+    #print STDERR "End of stream, return\n";
+    return ( $queryLabelF, \%hits, "" );
+}
+
+my $fastaNum;
+
+sub extractFasta {
+    my $index = shift;
+    my $ids   = shift;
+    my $name  = $fastaNum++;
+
+    # print STDERR "Extracting " . scalar(@$ids) . " to file $name\n";
+
+    open( OUT, ">$name" );
+
+    # my $out = Bio::SeqIO->new( '-format' => 'Fasta', '-fh' => \*OUT );
+    for my $id (@$ids) {
+        my $seqobj = $index->fetch($id);
+        if ( !defined $seqobj ) {
+            print STDERR "Extracting from " . $index->fastaFileName() . ": Undefined: $id\n";
+        }
+
+        # elsif ($seqobj->primary_id() ne $id)
+        #    {
+        #    print STDERR "Extracting from " . $index->filename() . ": ID problem: " . ($seqobj->primary_id()) . " ne $id\n";
+        #    }
+        else {
+
+            #$out->write_seq($seqobj);
+            print OUT $seqobj;
+        }
+    }
+    close OUT;
+
+    return $name;
+}
+
+sub printLine {
+    my ( $label, $idsByIdentity ) = @_;
+
+    my @ids        = ();
+    my @pcids      = sort { $b <=> $a } keys %$idsByIdentity;
+    my $bestPcid   = $pcids[0];
+    my $acceptPcid = $bestPcid - $slop;
+    for my $pcid (@pcids) {
+        if ( $pcid < $acceptPcid ) { last; }
+        push @ids, @{ $idsByIdentity->{$pcid} };
+    }
+
+    my $idString = ( join "\t", @ids );
+    print "$label\t$bestPcid\t" . $idString . "\n";
+
+    # print STDERR "$label\t$bestPcid\t" . $idString . "\n";
+}
+
+sub revcompFile {
+    my $infile = shift;
+
+    my $outfile = $infile . ".rc";
+
+    open( IN,  $infile )     or die "Can't read $infile\n";
+    open( OUT, ">$outfile" ) or die "Can't write $outfile\n";
+
+    while (<IN>) {
+        my $a = $_;
+        chomp $a;
+        if ( !( $a =~ /^>/ ) ) {
+            $a = reverse $a;
+            $a =~ tr/ACGTacgt/TGCAtgca/;
+        }
+        print OUT $a . "\n";
+    }
+    close IN;
+    close OUT;
+    return $outfile;
+}
+
+main();
diff --git a/scripts/rtaxVote.pl b/scripts/rtaxVote.pl
new file mode 100755
index 0000000..a81bba3
--- /dev/null
+++ b/scripts/rtaxVote.pl
@@ -0,0 +1,286 @@
+#!/usr/bin/env perl
+
+# read the prokMSAids of the hits, as produced by exactMatchIds.
+# Then grab the taxonomy info for those prokMSAids and make some classification info from the set.
+
+use strict;
+use warnings;
+
+#use lib '.';
+use FindBin;
+use lib "$FindBin::Bin";
+
+use LevelPrediction;
+
+loadTaxonomy( $ARGV[0] );
+processHits( $ARGV[1] );
+
+my %taxonomies = ();
+
+# my %taxonomyPrior = ();
+
+sub loadTaxonomy {
+    my ($taxFileName) = @_;
+    print STDERR "loading taxonomy...\n";
+    open( TAX, $taxFileName ) or die "Cannot open input taxonomy file $taxFileName: $!";
+    my $lines = 0;
+    while ( my $line = <TAX> ) {
+		chomp $line;
+		
+		# try to accept several formats:
+		# semicolon or tab-delimited taxonomy strings
+		# and including the "pcid-width" column between the sequence id and
+		# the taxonomy string, or not
+		
+		my ( $prokMSAid, @taxonomy ) = split( /[;\t]/, $line );
+        #my ( @taxonomy ) = split( /; /, $taxString );
+
+	    if ( @taxonomy && ($taxonomy[0] eq "" || $taxonomy[0] =~ /^(\d+\.?\d*|\.\d+)$/ )) {
+	        # value is numeric or empty, must be the pcid width of the target cluster
+	        my $taxPcid = shift @taxonomy;
+	        #print STDERR "Ignoring target cluster width: $taxPcid\n";
+	    }
+
+        $taxonomies{$prokMSAid} = \@taxonomy;
+
+#       my $taxString = "";
+#        for my $taxElem (@taxonomy) {
+#            $taxString .= "$taxElem; ";
+#
+#            #			$taxonomyPrior{$taxString}++;
+#       }
+        $lines++;
+    }
+    close TAX;
+    print STDERR "...done loading $lines taxonomy lines\n";
+
+    #	print STDERR "normalizing prior...\n";
+    #	for my $taxKey (keys %taxonomyPrior)
+    #	{
+    #		$taxonomyPrior{$taxKey} /= $lines;
+    #	}
+    #	$taxonomyPrior{""} = 1;
+    #    print STDERR "...done normalizing prior\n";
+}
+
+sub printLine {
+    my ( $label, $bestPcid, @ids ) = @_;
+
+    my $levelPredictions = makeTaxonomyPrediction(@ids);
+    print "$label\t$bestPcid";
+    for my $levelPrediction (@$levelPredictions) {
+        print "\t" . $levelPrediction->toString();
+    }
+
+    #if(scalar(@$levelPredictions) == 0) {$unclassified++}
+    #if(scalar(@$levelPredictions) == 0) { print "\tUNCLASSIFIED"; }
+
+    print "\n";
+
+    return ( scalar(@$levelPredictions) == 0 );
+}
+
+sub processHits {
+    my ($hitsFileName) = @_;
+    open( HITS, $hitsFileName ) || die("Could not open $hitsFileName");
+
+    my $hit          = 0;
+    my $noprimer     = 0;
+    my $nohit        = 0;
+    my $toomanyhits  = 0;
+    my $nomatepair  = 0;
+    my $unclassified = 0;
+
+    while (<HITS>) {
+        chomp;
+        my ( $label, $bestPcid, @ids ) = split /\t/;
+
+        if ( ( !@ids ) || ( $ids[0] eq "" ) )    #  an empty id list
+		{
+			print "$label\t\tNOHIT\n";
+            $nohit++;
+		}
+		elsif($ids[0] eq "NOHIT")     # at one point we represented this only as an empty id list, but now apparently it is used
+        {
+            print "$label\t\tNOHIT\n";
+            $nohit++;
+        }
+        elsif ( ( $ids[0] eq "NOPRIMER" ) ) {
+            print "$label\t\tNOPRIMER\n";
+            $noprimer++;
+        }
+        elsif ( ( $ids[0] eq "TOOMANYHITS" ) ) {
+            print "$label\t\tTOOMANYHITS\n";
+            $toomanyhits++;
+        }
+        elsif ( ( $ids[0] eq "NOMATEPAIR" ) ) {
+            print "$label\t\tNOMATEPAIR\n";
+            $nomatepair++;
+        }
+        else {
+            $unclassified += printLine( $label, $bestPcid, @ids );
+            $hit++;
+        }
+    }
+
+    my $samples = $hit + $nohit + $toomanyhits + $nomatepair + $noprimer;
+
+    print STDERR "$samples items\n";
+    if ( $samples != 0 ) {
+        print STDERR "$noprimer (" . sprintf( "%.1f",     ( 100 * $noprimer / $samples ) ) . "%) had no primer match\n";
+        print STDERR "$nohit (" . sprintf( "%.1f",        ( 100 * $nohit / $samples ) ) . "%) had no hits\n";
+        print STDERR "$toomanyhits (" . sprintf( "%.1f",        ( 100 * $toomanyhits / $samples ) ) . "%) had too many hits\n";
+        print STDERR "$nomatepair (" . sprintf( "%.1f",        ( 100 * $nomatepair / $samples ) ) . "%) had no mate pair\n";
+        print STDERR "$unclassified (" . sprintf( "%.1f", ( 100 * $unclassified / $samples ) ) . "%) had hits but no classification\n";
+        print STDERR "" . ( $hit - $unclassified ) . " (" . sprintf( "%.1f",          ( 100 * ( $hit - $unclassified ) / $samples ) ) . "%) were classified\n";
+    }
+
+    # classifications per level?  That depends on the stringency filter, which is downstream
+
+    close HITS;
+}
+
+#sub normalizeByPrior {
+#	my ($taxString, $taxonCounts) = @_;
+#
+#	print STDERR "Normalizing: \n";
+#	while( my ($k, $v) = each %$taxonCounts) {
+#        print STDERR "$k = $v ; ";
+#    }
+#	print STDERR "\n";
+#
+#	my $normalizer = $taxonomyPrior{$taxString};
+#	if(!defined $normalizer)
+#		{
+#			print STDERR ("No prior: $taxString\n");
+#		}
+#
+#	for my $taxon (keys %$taxonCounts)
+#	{
+#		$taxonCounts->{$taxon} = ($taxonCounts->{$taxon} / $taxonomyPrior{$taxString . $taxon . "; "}) * $normalizer;
+#	}
+#	print STDERR "       Done: \n";
+#	while( my ($k, $v) = each %$taxonCounts) {
+#        print STDERR "$k = $v ; ";
+#    }
+#	print STDERR "\n";
+#}
+
+sub makeTaxonomyPrediction {
+    my (@ids) = @_;
+	
+    my @levelPredictions = ();
+
+# TOOMANYHITS is now added in ucFilterBest.pl, so it should be caught at line 99 above
+
+#	if(scalar(@ids) == 1000) {
+#		# when we did the uclust at exactMatchOneReadSet:50, we used " --allhits --maxaccepts 1000 ".
+#		# after that we filtered for the best pcid set (using ucFilterBest.pl).
+#		# if 1000 hits remain, then the real set of best-pcid hits is larger, and we missed some.
+#		# In that case we should bail because the set of 1000 hits we do have may not be representative.
+#		# I think this is the reason why matching the E517F primer only (17nt reads) produced predictions, and at different levels to boot.
+#		# That also depends on the classification thresholds.
+#			
+#		my $levelPrediction = LevelPrediction->new();	
+#		$levelPrediction->label("TOOMANYHITS");
+#		push @levelPredictions, $levelPrediction;
+#		return \@levelPredictions;
+#		}
+
+    my @taxonomyVectors = map { $taxonomies{$_} } @ids;
+#	my @taxonomyClusterSizes = map { $taxonomyWorstPcids{$_} } @ids;
+
+    my $globalTotal = @taxonomyVectors;
+
+    my $predictedTaxa    = "";
+
+    my $globalUnknowns = 0;    # at all levels, incrementing as we go
+
+    for my $level ( 0 .. 10 ) {
+
+        my $levelPrediction = LevelPrediction->new();
+
+        # assert the remaining taxonomyVectors are equal at higher levels
+
+        my %taxonCounts   = ();
+        my $localUnknowns = 0;
+        my $localTotal    = @taxonomyVectors;
+
+        # count up how often each label is seen descending from this node
+
+        for my $vec (@taxonomyVectors) {
+            my $taxon = $vec->[$level];
+
+            # "Unknown" should occur only towards the leaves; an unspecified intermediate level followed by a populated level is a "skip".
+            # Here, "skip" is counted as just another taxon.
+            if   ( !defined $taxon || $taxon eq "" || $taxon =~ /unknown/i ) { $localUnknowns++; }
+            else                                                             { $taxonCounts{$taxon}++; }
+        }
+
+        if ( $localUnknowns == $localTotal ) { last; }
+
+        # this normalization makes no sense, because we don't want a uniform prior either
+        # e.g., one Archaeal hit among dozens of Bacterial hits will win, because there are so few Archaea in GreenGenes to begin with
+        # normalizeByPrior( $predictedTaxa, \%taxonCounts );
+
+        # get the best label and check for ties
+
+        $levelPrediction->numChildren( scalar( keys %taxonCounts ) );
+
+        my @taxaInOrder = sort { $taxonCounts{$b} <=> $taxonCounts{$a} } keys %taxonCounts;
+        my $bestTaxon = $taxaInOrder[0];
+
+        # print STDERR "$bestLabel\n";
+        $levelPrediction->label($bestTaxon);
+        $predictedTaxa .= "$bestTaxon; ";
+
+        my $bestCount = $taxonCounts{$bestTaxon};
+        $levelPrediction->count($bestCount);
+
+        my $secondBestTaxon = $taxaInOrder[1];
+        if ( defined $secondBestTaxon ) {
+            my $secondBestCount = $taxonCounts{$secondBestTaxon};
+
+            if ( $levelPrediction->count() < 2 * $secondBestCount ) {
+                # Declare a tie if the winning taxon doesn't have at least twice as many votes as the runner-up.  
+				# just consider this level unknown and bail
+                last;
+            }
+        }
+
+        # compute various ratios of the prediction vs. alternatives
+
+        $levelPrediction->localMinProportion( $bestCount / $localTotal );
+        $levelPrediction->localMaxProportion( ( $bestCount + $localUnknowns ) / $localTotal );
+
+        my $globalUnknowns += $localUnknowns;
+        $levelPrediction->globalMinProportion( $bestCount / $globalTotal );
+        $levelPrediction->globalMaxProportion( ( $bestCount + $globalUnknowns ) / $globalTotal );
+
+        # what if all of the "unknown" matches should have been the same?  Then an "alternate" classification might have won
+        $levelPrediction->alternateLocalProportion( $localUnknowns / $localTotal );
+        $levelPrediction->alternateGlobalProportion( $globalUnknowns / $globalTotal )
+            ;    # note we already added the local unknowns to the global unknowns
+
+        # it's possible that a completely different path has a higher global probability than this one,
+        # but we'd never know because we pick the max _per level_ and never explore the other paths.
+
+        # decide whether to bother continuing
+        # if ( $bestLocalMinProportion < 0.5 ) { last; }
+        # for now, print all the best labels until everything is unknown or there is a tie; sort out the confidence later
+
+        push @levelPredictions, $levelPrediction;
+
+        # remove any non-matching paths from consideration at the next level
+
+        my @newTaxonomyVectors = ();
+        for my $vec (@taxonomyVectors) {
+            my $taxon = $vec->[$level];
+            if ( defined $taxon && $taxon eq $bestTaxon ) { push @newTaxonomyVectors, $vec; }
+        }
+        @taxonomyVectors = @newTaxonomyVectors;
+
+    }
+    return \@levelPredictions;
+
+}
diff --git a/scripts/splitDelimitedFasta.pl b/scripts/splitDelimitedFasta.pl
new file mode 100755
index 0000000..d85c230
--- /dev/null
+++ b/scripts/splitDelimitedFasta.pl
@@ -0,0 +1,46 @@
+#!/usr/bin/env perl
+
+use strict;
+use warnings;
+
+# make sure delimiter is properly escaped, e.g. for "+" we need "\+".
+my $delimiter = $ARGV[0];
+my $infile = $ARGV[1];
+
+my $infilename = `basename $infile`;
+chomp $infilename;
+
+open(IN, $infile) or die "Can't read $infile\n";
+open(A,">$infilename.a") or die "Can't write $infilename.a\n";
+open(B,">$infilename.b") or die "Can't write $infilename.b\n";
+
+my $fastaHeader = "";
+my $outfile = *A;
+	
+while(<IN>)
+	{
+	chomp;
+	if(/^>/)
+		{
+		$outfile = *A;
+		$fastaHeader = $_;
+		print A "$fastaHeader\n";
+		print B "$fastaHeader\n";
+		}
+	else
+		{
+		if ( /(.*)$delimiter(.*)/)
+		    {
+		        print A "$1\n";
+		        print B "$2\n";
+		        $outfile = *B;
+	        }
+		else
+		    {
+		        print $outfile "$_\n";
+	        }
+		}
+	}
+close IN;
+close A;
+close B;
\ No newline at end of file

-- 
Alioth's /git/debian-med/git-commit-notice on /srv/git.debian.org/git/debian-med/rtax.git