[med-svn] [rtax] 01/01: Imported Upstream version 0.984
Andreas Tille
tille at debian.org
Thu Nov 28 10:05:20 UTC 2013
This is an automated email from the git hooks/post-receive script.
tille pushed a commit to branch master
in repository rtax.
commit cb0bfb7e562f2a17dac9959f78bf91419150a90f
Author: Andreas Tille <tille at debian.org>
Date: Thu Nov 28 11:08:34 2013 +0100
Imported Upstream version 0.984
---
README | 139 ++++
changelog | 61 ++
greengenes/greengenes.README | 69 ++
greengenes/prepare-greengenes | 76 +++
rtax | 285 ++++++++
scripts/FastaIndex.pm | 136 ++++
scripts/LevelPrediction.pm | 71 ++
scripts/classificationQualityFilter.pl | 51 ++
scripts/classificationQualityStripper.pl | 7 +
scripts/fastaidx.pl | 22 +
scripts/greengenesExtract.pl | 49 ++
scripts/joinDelimitedFasta.pl | 38 ++
scripts/parseUclustClusters.pl | 52 ++
scripts/rtaxSearch.pl | 1062 ++++++++++++++++++++++++++++++
scripts/rtaxVote.pl | 286 ++++++++
scripts/splitDelimitedFasta.pl | 46 ++
16 files changed, 2450 insertions(+)
diff --git a/README b/README
new file mode 100644
index 0000000..d9be133
--- /dev/null
+++ b/README
@@ -0,0 +1,139 @@
+RTAX: Rapid and accurate taxonomic classification of short paired-end
+ sequence reads from the 16S ribosomal RNA gene.
+
+David A. W. Soergel (1), Rob Knight (2), and Steven E. Brenner (1)
+
+1 Department of Plant and Microbial Biology,
+ University of California, Berkeley
+2 Howard Hughes Medical Institute and Department of Chemistry
+ and Biochemistry, University of Colorado at Boulder
+* Corresponding author (current address): soergel at cs.umass.edu
+
+Version 0.984 (August 7, 2013)
+
+http://dev.davidsoergel.com/rtax
+
+
+Requirements
+------------
+
+ * Hardware: memory requirements vary with the size of the reference database;
+ for the version of GreenGenes we provide, ~4GB is needed.
+
+ * USEARCH (http://www.drive5.com/usearch/). RTAX has been tested with
+ v4.1.93 and v5.0.144. Make sure that the program is in your search
+ path as "usearch", or provide its path explicitly in the "rtax" script.
+ Note that v5.x uses quite a bit more memory, so 4.x may be preferred.
+
+* A reference database consisting of sequences for which taxonomy assignments
+ are known. We provide prepared versions of the GreenGenes database on the
+ RTAX web site; see the GreenGenes section below for details.
+
+ Two files are needed:
+ 1. a FASTA file containing the sequences, and
+ 2. a file listing taxonomy strings for each sequence ID found in the
+ FASTA file. The format of this file is
+
+sequenceid pcidwidth kingdom; phylum; class; order; ...
+
+ one entry per line, where sequenceid, pcidwidth, and the taxonomy
+ string are separated by a single tab, and the taxonomy string itself
+ is delimited either by semicolons or by tabs. The pcidwidth column
+ is optional, and (if present) is ignored in the current version of
+ RTAX anyway. (In our usage, we cluster the reference database into
+ 99% id OTUs, representing each by a single "seed" sequence. This
+ column then lists the largest pairwise %id difference between the
+ cluster seed and any cluster member.)
+
+
+Installation
+------------
+
+RTAX consists of a set of perl scripts and a shell script ("rtax") to wire
+them together. No compilation is needed; just extract the package to a
+convenient location.
+
+The perl scripts must remain in the "scripts" directory below the "rtax" shell
+script, but the latter can be symlinked anywhere for convenience. A common
+pattern might be to place the whole package at /usr/local/rtax, and then
+symlink the script into your path (e.g., "ln -s /usr/local/rtax/rtax
+/usr/local/bin").
+
+
+Running RTAX
+------------
+
+Sequence classification is done as follows:
+
+ rtax -r gg.nr.fasta -t gg.nr.taxonomy -a queryA.fasta [-b queryB.fasta] -o result.out
+
+Substitute a different reference database for the GreenGenes files if desired,
+of course.
+
+If two query files are provided, then they must contain mate-paired reads
+with exactly matching identifiers (though they need not be in the same order).
+Any ids present in one file but not the other are silently dropped. Alternate
+naming schemes for the paired reads (e.g., "SOMEID.a" paired with "SOMEID.b",
+and so forth) are handled via the -i option.
+
+RTAX considers query sequences in the forward sense with respect to the
+reference database. Thus, in a paired-end experiment, the reverse reads will
+likely need be reverse-complemented with the -y option.
+
+RTAX may be run for single-ended reads as well; simply exclude the queryB
+parameter in this case.
+
+Run "rtax" with no arguments for a help message describing additional options.
+
+Various progress messages are provided on stderr, and the predicted
+classification for each query is provided in the file given by the -o option.
+The output format is essentially the same as the taxonomy file input format
+above. The second column indicates the best %id observed between the query
+sequence and any reference sequence, and the taxonomy string is tab-delimited.
+
+Temporary files are created in /tmp/rtax.SOMENUMBER (or under the temp
+directory given by -m). These can be quite large; please use -m to select a
+location with sufficient free space, as needed. The temporary directory is
+deleted upon successful termination, but may need to be manually cleaned up in
+the event of an error of some sort.
+
+
+Using the GreenGenes reference database
+---------------------------------------
+
+An RTAX-formatted reference database based on GreenGenes (version of March 11,
+2011) is available at
+http://dev.davidsoergel.com/rtax/greengenes.reference.20110311.tgz. That
+contains gg.nr.fasta and gg.nr.taxonomy, which are the result of clustering
+the GreenGenes input file at 99% identity, finding consensus taxonomy strings
+for each cluster, and formatting the result as needed for use with RTAX.
+
+Please see the "greengenes" subdirectory in the RTAX distribution for
+information on using newer versions of GreenGenes.
+
+
+Preparing a non-GreenGenes reference database
+---------------------------------------------
+
+You can certainly use different databases, or GreenGenes clustered by
+different means or with different parameters (or not at all, though this
+approach will have poor performance). If you have any trouble producing the
+reference fasta and taxonomy files in the required format, examine the
+prepare-greengenes script for hints, and feel free to contact us for help.
+
+
+Plumbing
+--------
+
+Taxonomy assignment proceeds in two phases, which we implement separately:
+
+1. For each query sequence (or mate pair), find an appropriate set of
+ matching hits in a reference database. This is implemented as
+ rtaxSearchSingle and rtaxSearchPair for single and paired-end reads,
+ respectively.
+
+2. Find consensus taxonomy among each set of reference hits. This is
+ implemented as rtaxVote.
+
+3. Finally the detailed rtaxVote results are filtered and cleaned up
+ for output.
diff --git a/changelog b/changelog
new file mode 100644
index 0000000..0fb4c8f
--- /dev/null
+++ b/changelog
@@ -0,0 +1,61 @@
+Version 0.984 (2013-08-07)
+===========================
+
+* Fix tempdir calls for wider compatibility.
+* Add NOMATEPAIR classification when --single_ok is not selected, so that every input line gets an output line.
+
+
+
+Version 0.983 (2012-09-05)
+===========================
+
+* add options -x and -y to reverse-complement reads 1 and 2, respectively
+* fixed completely broken fallback to single-ended classification, and miscounting of NOHIT cases
+* fix idList command-line parsing bug
+
+
+Version 0.982 (2012-07-02)
+===========================
+
+* getopts bug fix
+* try to fix issues with TOOMANYHITS cases
+
+
+Version 0.981 (2012-05-08)
+===========================
+
+* change all references to "/usr/bin/perl" to "/usr/bin/env perl" for wider compatibility
+
+
+Version 0.98 (2012-05-07)
+==========================
+
+* insured global alignment with iddef 2, i.e. ignoring terminal gaps in computing percent identity, which was the intention all along
+* Fix intermittent hanging due to squirrely interaction of usearch with named pipes
+
+
+Version 0.96 (2012-03-01)
+==========================
+
+* Make fallback optional. Store FASTA indexes as DBM. Add fastaidx tool
+* Combined pair + single script with fallback
+
+
+Version 0.94 (2012-01-12)
+==========================
+
+* add tempdir option
+* add idList option
+* add id regex option
+
+
+Version 0.93 (2012-01-09)
+=========================
+
+* Ditch Bio::Index::Fasta
+
+
+Version 0.92 (2011-12-28)
+=========================
+
+* Prep Qiime integration
\ No newline at end of file
diff --git a/greengenes/greengenes.README b/greengenes/greengenes.README
new file mode 100644
index 0000000..b158a70
--- /dev/null
+++ b/greengenes/greengenes.README
@@ -0,0 +1,69 @@
+Caution re. using newer versions of GreenGenes
+----------------------------------------------
+
+As of September 2011, we do not recommend using a version of GreenGenes newer
+than March 11, 2011. The reason is that GreenGenes has expanded dramatically in
+this time--from about half a million to about one million sequences--but the
+annotation process has not kept up. Thus, newer versions contain much higher
+proportions of sequences that either carry no taxonomic annotation at all or
+that list "unclassified" at many or even all ranks.
+
+At present, RTAX assumes that a sequence that is either not annotated or
+"unclassified" at some rank does not belong to any known taxonomic group at that
+rank. That is, we assume that the lack of an annotation constitutes an assertion
+that the sequence is novel. This assumption makes our taxonomic assignment
+process more conservative. Our interpretation is very different from the
+alternative, that the sequence could perhaps be fully annotated with known
+names, but that this simply has not been done yet.
+
+This question manifests itself in the voting procedure. We require that half of
+_all_ hits to a query sequence agree on a taxonomic group (at whatever rank) in
+order to make an assignment. We include hits with no annotation in the
+denominator, corresponding to the idea that those sequences do not belong to the
+group being considered in the numerator.
+
+We reasoned that the alternative procedure--considering the proportion of
+concordant hits out of only those carrying any annotation at all (at a given
+rank)--risks overly biasing taxonomic assignments towards previously named
+groups.
+
+On the other hand, our inclusion of unannotated sequences in the denominator
+means that, when those sequences are highly prevalent, the annotated sequences
+in the numerator rarely reach the 50% threshold required to make an
+assignment--even if they are in perfect agreement.
+
+We can hope that GreenGenes taxonomic assignments will become more comprehensive
+in the future. Alternatively, we should carefully consider whether it is
+appropriate to switch to the more permissive approach of ignoring unannotated
+sequences when voting at each rank, or if some alternative voting procedure
+better balances the risks of misannotation versus low classification rate. Until
+this is resolved, it seems prudent to simply use the March 2011 database since
+it is known to work well.
+
+Preparing the Greengenes reference database
+-------------------------------------------
+
+To regenerate the GreenGenes reference database files for use with RTAX (e.g.
+with a newer GreenGenes version):
+
+Download the complete GreenGenes database to a disk with at least 20GB free
+space (as of Sep 2011, but GreenGenes is growing rapidly):
+
+ wget http://greengenes.lbl.gov/Download/Sequence_Data/Greengenes_format/greengenes16SrRNAgenes.txt.gz
+
+Then, while in the directory containing the downloaded .txt.gz file, run
+prepare-greengenes. (If usearch is not in your path, use "prepare-greengenes
+/path/to/usearch"). This takes a couple of hours and about 4GB of memory. Go out
+for coffee and whatnot. When you get back, you'll find two files have been
+produced: gg.nr.fasta and gg.nr.taxonomy.
+
+Note that the free (32-bit) version of USEARCH can use at most 2GB memory on
+some systems and at most 4GB memory on others. 2GB memory is not sufficient to
+cluster the current version of GreenGenes, and 4GB barely suffices. Thus,
+GreenGenes will soon be too large to cluster at 99% id using USEARCH without a
+paid 64-bit license. (Note that USEARCH reports "2.0 GB available" even if in
+fact 4GB are available, so don't be discouraged until it actually crashes.) When
+4GB is also insufficient, and lacking a paid USEARCH license, a possible
+strategy will be to cluster at lower %id thresholds. This requires less memory
+and produces a smaller, less fine-grained reference database. We have not
+explored the impact of doing this on classification performance.
diff --git a/greengenes/prepare-greengenes b/greengenes/prepare-greengenes
new file mode 100755
index 0000000..f2e8f18
--- /dev/null
+++ b/greengenes/prepare-greengenes
@@ -0,0 +1,76 @@
+#!/bin/sh
+
+cur=`pwd`
+
+# we want to call subsidiary scripts in the "scripts" subdir, but we don't know where we're installed or what the working directory is.
+# solution from http://hintsforums.macworld.com/archive/index.php/t-73839.html
+
+IFS=$' \t\n'
+declare -x PATH=/bin:/usr/bin
+arg=$0; [[ -L $0 ]] && arg=$(stat -f '%Y' "$0")
+pth=$(2>/dev/null cd "${arg%/*}" >&2; echo "`pwd -P`/${arg##*/}")
+par=$(dirname "$pth")
+scripts="$par/../scripts"
+
+if [[ ( $# -ne 1 ) && ($# -ne 0) ]]
+then
+ echo "Usage: `basename $0` (in a directory containing greengenes16SrRNAgenes.txt.gz)"
+ exit 1
+fi
+
+# Grab the command-line arguments
+
+#gg=$1
+usearch=`which usearch`
+if [[ ($# -ne 0) ]]
+then
+ usearch=$1
+fi
+echo using usearch: $usearch
+
+# create a temporary working directory
+
+#tempdir=/tmp/rtax.$RANDOM
+tempdir=./rtax.$RANDOM
+mkdir $tempdir
+cd $tempdir
+
+
+# Extract the sequence and taxonomy information:
+#echo gunzip -c -S .gz $cur/greengenes16SrRNAgenes.txt.gz \| $scripts/greengenesExtract.pl
+echo Extracting raw fasta and taxonomy files
+gunzip -c -S .gz $cur/greengenes16SrRNAgenes.txt.gz | $scripts/greengenesExtract.pl
+
+# That will write greengenes.fasta and greengenes.taxonomy into the current directory.
+
+# In our experiments, we cluster the reference sequences into 99% id OTUs, and find a consensus taxonomy string for each resulting cluster.
+
+# sort by length
+echo sorting fasta file by length
+$usearch --sort greengenes.fasta --output greengenes.fasta.sorted
+
+# cluster
+echo clustering at 99% identity
+$usearch --cluster greengenes.fasta.sorted --usersort --iddef 2 --uc gg.nr.uc --seedsout $cur/gg.nr.fasta --id 0.99 --maxrejects 128 --nowordcountreject
+
+# parse clusters
+echo parsing clusters
+$scripts/parseUclustClusters.pl gg.nr.uc > gg.nr.cl
+
+# find consensus taxonomy per cluster
+echo finding consensus taxonomy per cluster
+$scripts/rtaxVote.pl greengenes.taxonomy gg.nr.cl > gg.nr.taxonomy.raw
+
+echo cleaning up taxonomy output
+$scripts/classificationQualityFilter.pl 1 0.25 0.1 < gg.nr.taxonomy.raw > gg.nr.taxonomy.filtered
+$scripts/classificationQualityStripper.pl < gg.nr.taxonomy.filtered > $cur/gg.nr.taxonomy
+
+# Note gg.nr.taxonomy resulting from this process includes the pcid-width column and a tab-delimited tax string
+
+echo wrote $cur/gg.nr.fasta and $cur/gg.nr.taxonomy
+
+
+# remove the temp directory
+# (comment out to debug or grab intermediate results)
+
+rm -rf $tempdir
\ No newline at end of file
diff --git a/rtax b/rtax
new file mode 100755
index 0000000..c8c1c36
--- /dev/null
+++ b/rtax
@@ -0,0 +1,285 @@
+#!/bin/bash -eu
+
+# RTAX: Rapid and accurate taxonomic classification of short paired-end
+# sequence reads from the 16S ribosomal RNA gene.
+#
+# David A. W. Soergel 1*, Rob Knight 2, and Steven E. Brenner 1
+#
+# 1 Department of Plant and Microbial Biology,
+# University of California, Berkeley
+# 2 Howard Hughes Medical Institute and Department of Chemistry
+# and Biochemistry, University of Colorado at Boulder
+# * Corresponding author: soergel at cs.umass.edu
+#
+# http://www.davidsoergel.com/rtax
+#
+# Version 0.984 (August 7, 2013)
+#
+# For usage instructions: just run the script with no arguments
+#
+#
+# Copyright (c) 2011 Regents of the University of California
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# * Redistributions of source code must retain the above copyright notice,
+# this list of conditions and the following disclaimer.
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+# * Neither the name of the University of California, Berkeley nor the
+# names of its contributors may be used to endorse or promote products
+# derived from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+function echoerr() { echo "$@" 1>&2; }
+
+echoerr
+echoerr 'RTAX: Rapid and accurate taxonomic classification of short paired-end'
+echoerr ' sequence reads from the 16S ribosomal RNA gene.'
+echoerr ''
+echoerr 'David A. W. Soergel (1), Rob Knight (2), and Steven E. Brenner (1)'
+echoerr ''
+echoerr '1 Department of Plant and Microbial Biology,'
+echoerr ' University of California, Berkeley '
+echoerr '2 Howard Hughes Medical Institute and Department of Chemistry '
+echoerr ' and Biochemistry, University of Colorado at Boulder'
+echoerr '* Corresponding author (current address): soergel at cs.umass.edu'
+echoerr ''
+echoerr 'Version 0.984 (August 7, 2013)'
+echoerr ''
+echoerr 'http://dev.davidsoergel.com/rtax'
+echoerr ''
+echoerr 'Please cite:'
+echoerr 'Soergel D.A.W., Dey N., Knight R., and Brenner S.E. 2012.'
+echoerr 'Selection of primers for optimal taxonomic classification of'
+echoerr 'environmental 16S rRNA gene sequences. ISME J (6), 1440–1444'
+echoerr ''
+
+
+
+# we want to call subsidiary scripts in the "scripts" subdir, but we don't know where we're installed or what the working directory is.
+# solution from http://hintsforums.macworld.com/archive/index.php/t-73839.html
+
+IFS=$' \t\n'
+#declare -x PATH=/bin:/usr/bin:/usr/local/bin
+arg=$0; [[ -L $0 ]] && arg=$(stat -f '%Y' "$0")
+pth=$(2>/dev/null cd "${arg%/*}" >&2; echo "`pwd -P`/${arg##*/}")
+par=$(dirname "$pth")
+scripts="$par/scripts"
+
+# The above solution works with the widespread GNU Coreutils 5.97. Thanks to Bernd Brandt for this alternate solution for Coreutils 8.5.
+
+#IFS=$' \t\n'
+#arg=$0; [[ -L $0 ]] && arg=$(stat -c '%N' "$0")
+#pth=${arg##*-> \`};
+#par=$(dirname "$pth")
+#scripts="$par/scripts"
+
+function abspath {
+ if [[ -d "$1" ]]
+ then
+ pushd "$1" >/dev/null
+ pwd
+ popd >/dev/null
+ #elif [[ -e $1 ]]
+ #then
+ else
+ pushd $(dirname $1) >/dev/null
+ echo $(pwd)/$(basename $1)
+ popd >/dev/null
+ #else
+ # echo $1 does not exist! >&2
+ # return 127
+ fi
+}
+
+# Grab the command-line arguments
+
+usageHelp="Usage: ${0##*/} -r <refdb> -t <taxonomy> -a <queryA> -b <queryB> -d <delimiter> -i <regex> -o <classifications.out>"
+refdbHelp="-r reference database in FASTA format"
+taxonomyHelp="-t taxonomy file with sequence IDs matching the reference database"
+queryAHelp="-a FASTA file containing query sequences (single-ended or read 1)"
+queryBHelp="-b FASTA file containing query sequences (read b, with matching IDs)"
+revcompAHelp="-x Reverse-complement query A sequences (required if they are provided in the reverse sense)"
+revcompBHelp="-y Reverse-complement query B sequences (required if they are provided in the reverse sense)"
+idListHelp="-l text file containing sequence IDs to process, one per line"
+delimiterHelp="-d delimiter separating the two reads when provided in a single file"
+idRegexHelp="-i regular expression used to select part of the fasta header to use as the sequence id. Default: \"(\\\\S+)\""
+outputHelp="-o output path"
+tmpdirHelp="-m temporary directory. Will be removed on successful completion, but likely not if there is an error."
+fallbackHelp="-f for sequences where only one read is available, fall back to single-ended classification. Default: drop these sequences."
+noGenericFallbackHelp="-g for sequences where one read is overly generic, do not fall back to single-ended classification. Default: classify these sequences based on only the more specific read."
+badOptionHelp="Option not recognised"
+printHelpAndExit()
+{
+ echoerr "$usageHelp"
+ echoerr "$refdbHelp"
+ echoerr "$taxonomyHelp"
+ echoerr "$queryAHelp"
+ echoerr "$queryBHelp"
+ echoerr "$revcompAHelp"
+ echoerr "$revcompBHelp"
+ echoerr "$idRegexHelp"
+ echoerr "$idListHelp"
+ echoerr "$delimiterHelp"
+ echoerr "$tmpdirHelp"
+ echoerr "$fallbackHelp"
+ echoerr "$noGenericFallbackHelp"
+ echoerr "$outputHelp"
+ exit $1
+}
+printErrorHelpAndExit()
+{
+ echoerr "-------------------------------------------------------------"
+ echoerr "ERROR : $@"
+ echoerr "-------------------------------------------------------------"
+ echoerr
+ printHelpAndExit 1
+}
+
+refdb=""
+taxonomy=""
+queryA=""
+queryB=""
+revcompA=""
+revcompB=""
+delimiter=""
+idList=""
+idRegex=""
+verbose=""
+fallback=""
+noGenericFallback=""
+outpath=""
+tempdir=""
+
+while getopts "hr:t:a:b:i:l:d:m:o:vfgxy" optionName; do
+case "$optionName" in
+h) printHelpAndExit 0;;
+r) refdb=`abspath "$OPTARG"`;;
+t) taxonomy=`abspath "$OPTARG"`;;
+a) queryA=`abspath "$OPTARG"`;;
+b) queryB=`abspath "$OPTARG"`;;
+x) revcompA="--revcompA";;
+y) revcompB="--revcompB";;
+d) delimiter="$OPTARG";;
+i) idRegex="$OPTARG";;
+l) idList=`abspath "$OPTARG"`;;
+o) outpath=`abspath "$OPTARG"`;;
+m) tempdir=`abspath "$OPTARG"`;;
+v) verbose="0";;
+f) fallback="--singleOK";;
+g) noGenericFallback="--nosingleOKgeneric";;
+[?]) printErrorHelpAndExit "$badOptionHelp";;
+esac
+done
+
+
+
+if [[ -n "$verbose" ]]
+then
+echoerr
+echoerr Reference database : $refdb
+echoerr Taxonomy file : $taxonomy
+echoerr ID List : $idList
+echoerr Query read A : $queryA
+echoerr RevComp read A : $revcompA
+echoerr Query read B : $queryB
+echoerr RevComp read B : $revcompB
+# echoerr Header Regex : $idRegex
+echoerr Read delimiter : $delimiter
+echoerr Output path : $outpath
+echoerr Temporary dir : $tempdir
+echoerr Fallback : $fallback
+echoerr No Generic Fallback: $noGenericFallback
+echoerr
+fi
+
+
+
+if [[ -z "$refdb" || -z "$taxonomy" || -z "$queryA" || -z "$outpath" ]]; then
+ printErrorHelpAndExit "-r, -t, -a, and -o options are required"
+fi
+
+if [[ -n "$queryB" && -n "$delimiter" ]]; then
+ printErrorHelpAndExit "Paired ends can be input using -b or -d, but not both"
+fi
+
+# find USEARCH, or hardcode if needed (could use an environment variable...)
+# usearch=/path/to/usearch
+set +e
+usearch=`which usearch`
+if [ $? -eq 0 ]
+then
+ echoerr using usearch: $usearch
+ echoerr
+else
+ echoerr "usearch not found in path. Please correct, or just hardcode the path in the rtax script."
+ exit 1
+fi
+
+set -e
+
+# create a temporary working directory
+
+if [ ! $tempdir ]; then
+ tempdir=/tmp/rtax.$RANDOM
+ #tempdir=./rtax.$RANDOM
+fi
+mkdir -p $tempdir
+cd $tempdir
+
+# perform the search
+
+if [ $idList ]; then
+ idList="--idList $idList"
+fi
+
+
+if [ $delimiter ]; then
+ $scripts/splitDelimitedFasta.pl $delimiter $queryA
+ queryAa=`basename $queryA`.a
+ queryAb=`basename $queryA`.b
+ $scripts/rtaxSearch.pl --database $refdb --queryA $queryAa --queryB $queryAb $idList $revcompA $revcompB --usearch $usearch --idRegex "$idRegex" $fallback $noGenericFallback > rawhits
+else
+ if [ $queryB ]; then
+ $scripts/rtaxSearch.pl --database $refdb --queryA $queryA --queryB $queryB $idList $revcompA $revcompB --usearch $usearch --idRegex "$idRegex" $fallback $noGenericFallback > rawhits
+ else
+ fallback="--singleOK"
+ $scripts/rtaxSearch.pl --database $refdb --queryA $queryA $idList $revcompA --usearch $usearch --idRegex "$idRegex" $fallback > rawhits
+ fi
+fi
+
+
+# choose the "best" taxonomy assignment by walking down the tree, including info on how many ref sequences agree at each level
+# and pipe directly into the cleanup scripts instead of using temp files
+
+$scripts/rtaxVote.pl $taxonomy rawhits | $scripts/classificationQualityFilter.pl 1 0.0 0.5 | $scripts/classificationQualityStripper.pl > $outpath
+
+# The output of this process includes the pcid-width column and a tab-delimited tax string
+
+# filter the taxonomy strings to the finest rank where half of the hits agree (out of all the hits, i.e. including in the denominator those hits with no annotation)
+#$scripts/classificationQualityFilter.pl 1 0.0 0.5 < rawtaxonomy > filteredtaxonomy
+
+# clean up the resulting output and provide it to STDOUT
+#$scripts/classificationQualityStripper.pl < filteredtaxonomy
+
+# remove the temp directory
+# (comment out to debug or grab intermediate results)
+
+rm -rf $tempdir
\ No newline at end of file
diff --git a/scripts/FastaIndex.pm b/scripts/FastaIndex.pm
new file mode 100644
index 0000000..cb9b91d
--- /dev/null
+++ b/scripts/FastaIndex.pm
@@ -0,0 +1,136 @@
+package FastaIndex;
+
+use strict;
+use warnings;
+
+use IO::File;
+
+
+BEGIN {
+for my $field (qw( fastaFileName startpos lines fh ))
+ {
+ my $slot = __PACKAGE__ . "::$field";
+ no strict "refs"; # So symbolic ref to typeglob works.
+
+ *$field = sub
+ {
+ my $self = shift;
+ $self->{$slot} = shift if @_;
+ return $self->{$slot};
+ };
+ }
+}
+
+sub new {
+ my ($this) = @_;
+ my $class = ref($this) || $this;
+ my $self = {};
+
+ bless $self, $class;
+
+ return $self;
+}
+
+sub make_index {
+
+ my ($this, $fastaFileName, $idregex, $dbmFileName) = @_;
+ if(!defined $idregex) { $idregex = "(\\S+)"; }
+
+ $this->fastaFileName($fastaFileName);
+
+ #open(IN, "$fastaFileName") or die "could not open $fastaFileName";
+ my $in = IO::File->new("$fastaFileName") or die "could not open $fastaFileName";
+ $this->fh($in);
+
+ if($dbmFileName)
+ {
+ my %openpos = ();
+ dbmopen( %openpos, $dbmFileName.".pos", 0666) || die ("Could not open DBM file $dbmFileName.pos");
+ $this->startpos(\%openpos);
+
+ my %openlines = ();
+ dbmopen( %openlines, $dbmFileName.".lines", 0666) || die ("Could not open DBM file $dbmFileName.lines");
+ $this->lines(\%openlines);
+
+ my $sizepos = scalar keys %openpos;
+ my $sizelines = scalar keys %openlines;
+ if($sizepos != $sizelines) { die "DBM files broken $dbmFileName\n"; }
+ if($sizepos > 0) {
+ print STDERR ("Index file $dbmFileName.pos already exists; assuming valid\n");
+ return;
+ };
+ }
+ else
+ {
+ # just store a hash in memory.
+ $this->startpos({});
+ $this->lines({});
+ }
+
+ my $lastpos = 0;
+ my $pos = 0;
+ #my $seq = "";
+ my $id = "";
+ my $numlines = 0;
+
+ while(<$in>)
+ {
+ my $line = $_;
+ if($line =~ /^>$idregex/)
+ {
+ if(defined $id && $id ne "") {
+ # write the previous record
+ #print STDERR "$id -> $lastpos\n";
+ $this->startpos()->{$id} = $lastpos;
+ $this->lines()->{$id} = $numlines;
+ }
+
+ # start a new record
+ $id = $1;
+ #$seq = $line;
+ $lastpos = $pos;
+ $numlines = 0;
+ }
+ # else { $seq .= $line; }
+
+ $pos += length($line);
+ $numlines++;
+ }
+ # write the final record
+ $this->startpos()->{$id} = $lastpos;
+ $this->lines()->{$id} = $numlines;
+ #delete($this->startpos()->{""}); # spurious entries from the start of the loop
+ #delete($this->lines()->{""}); # spurious entries from the start of the loop
+}
+
+sub count_records()
+ {
+ my ($this) = @_;
+ return scalar(keys %{$this->startpos()});
+ }
+
+sub fetch {
+
+ my ($this, $id) = @_;
+
+ my $pos = $this->startpos()->{$id};
+ if(! defined $pos) { die "no sequence with id: $id in " . $this->fastaFileName(); }
+
+ my $numlines = $this->lines()->{$id};
+
+ my $result = "";
+ my $in = $this->fh();
+ #print $this->fastaFileName() . ": " . $in . "\n";
+ seek($in,$pos,0);
+ for (1..$numlines) { $result .= <$in> }
+ return $result;
+}
+
+sub close {
+ my ($this) = @_;
+ my $in = $this->fh();
+ close($in);
+ $this->startpos(undef);
+ $this->lines(undef);
+}
+1;
\ No newline at end of file
diff --git a/scripts/LevelPrediction.pm b/scripts/LevelPrediction.pm
new file mode 100644
index 0000000..03dea87
--- /dev/null
+++ b/scripts/LevelPrediction.pm
@@ -0,0 +1,71 @@
+package LevelPrediction;
+
+use strict;
+use warnings;
+
+sub new {
+ my ($this) = @_;
+ my $class = ref($this) || $this;
+ my $self = {};
+
+ bless $self, $class;
+
+ return $self;
+}
+
+for my $field (
+ qw(label numChildren count localMinProportion localMaxProportion globalMinProportion globalMaxProportion alternateLocalProportion alternateGlobalProportion)
+ )
+{
+ my $slot = __PACKAGE__ . "::$field";
+ no strict "refs"; # So symbolic ref to typeglob works.
+
+ *$field = sub {
+ my $self = shift;
+ $self->{$slot} = shift if @_;
+ return $self->{$slot};
+ };
+}
+
+sub print_tab_delimited {
+ my $self = shift;
+ for my $field (
+ qw(label numChildren count localMinProportion localMaxProportion globalMinProportion globalMaxProportion alternateLocalProportion alternateGlobalProportion)
+ )
+ {
+ my $slot = __PACKAGE__ . "::$field";
+ print "\t" . ( $self->{$slot} ? $self->{$slot} : "undef" );
+ }
+}
+
+sub toString {
+ my $self = shift;
+ my $result = "";
+
+ for my $field (
+ qw(label)
+ )
+ {
+ my $slot = __PACKAGE__ . "::$field";
+ $result .= ( $self->{$slot} ? $self->{$slot} : "" );
+ }
+
+ for my $field (
+ qw(numChildren count)
+ )
+ {
+ my $slot = __PACKAGE__ . "::$field";
+ $result .= ";" . ( $self->{$slot} ? $self->{$slot} : "" );
+ }
+
+ for my $fieldb (
+ qw(localMinProportion localMaxProportion globalMinProportion globalMaxProportion alternateLocalProportion alternateGlobalProportion)
+ )
+ {
+ my $slot = __PACKAGE__ . "::$fieldb";
+ $result .= $self->{$slot} ? (sprintf ";%.2f", $self->{$slot}) : ";" ;
+ }
+ return $result;
+}
+
+1;
diff --git a/scripts/classificationQualityFilter.pl b/scripts/classificationQualityFilter.pl
new file mode 100755
index 0000000..a6a1e53
--- /dev/null
+++ b/scripts/classificationQualityFilter.pl
@@ -0,0 +1,51 @@
+#!/usr/bin/env perl
+
+my $countThreshold = $ARGV[0];
+my $localMinProportionThreshold = $ARGV[1];
+my $globalMinProportionThreshold = $ARGV[2];
+
+while (<STDIN>) {
+ chomp;
+ my ( $id, $bestPcid, @taxonomyEntries ) = split /\t/;
+
+ if ( $taxonomyEntries[0] eq "NOPRIMER" ) {
+ print "$id\t\tNOPRIMER\n";
+ }
+ elsif ( $taxonomyEntries[0] eq "NOHIT" ) {
+ print "$id\t\tNOHIT\n";
+ }
+ elsif ( $taxonomyEntries[0] eq "NOMATEPAIR" ) {
+ print "$id\t\tNOMATEPAIR\n";
+ }
+ elsif ( $taxonomyEntries[0] =~ /TOOMANYHITS/ ) {
+ print "$id\t\tTOOMANYHITS\n";
+ }
+ else {
+ my @newTaxonomyEntries = ();
+ for my $entry (@taxonomyEntries) {
+
+ my (
+ $label, $numChildren, $count,
+ $localMinProportion, $localMaxProportion, $globalMinProportion,
+ $globalMaxProportion, $alternateLocalProportion, $alternateGlobalProportion
+ ) = split /;/, $entry;
+
+ if ( $count >= $countThreshold
+ && $localMinProportion >= $localMinProportionThreshold
+ && $globalMinProportion >= $globalMinProportionThreshold )
+ {
+ push @newTaxonomyEntries, $entry;
+ }
+ else { last; }
+ }
+ #if ( @newTaxonomyEntries == 0 ) {
+ # push @newTaxonomyEntries, "Root";
+ #}
+
+ # if the trailing entries of the filtered taxonomy string are "SKIP", strip them
+ while(substr(@newTaxonomyEntries[$#newTaxonomyEntries],0,4) eq "SKIP") { pop @newTaxonomyEntries; }
+
+ print "$id\t$bestPcid\t" . ( join "\t", @newTaxonomyEntries ) . "\n";
+ }
+
+}
diff --git a/scripts/classificationQualityStripper.pl b/scripts/classificationQualityStripper.pl
new file mode 100755
index 0000000..7765117
--- /dev/null
+++ b/scripts/classificationQualityStripper.pl
@@ -0,0 +1,7 @@
+#!/usr/bin/env perl
+
+while (<STDIN>) {
+ my $line = $_;
+ $line =~ s/;[^\t\n]*//g;
+ print $line;
+}
diff --git a/scripts/fastaidx.pl b/scripts/fastaidx.pl
new file mode 100755
index 0000000..a5ace78
--- /dev/null
+++ b/scripts/fastaidx.pl
@@ -0,0 +1,22 @@
+#!/usr/bin/env perl
+
+use strict;
+use warnings;
+use Getopt::Long;
+use File::Temp;
+
+use FindBin;
+use lib "$FindBin::Bin";
+use FastaIndex;
+
+my ($fastaFileName, $idRegex) = @ARGV;
+
+my $indexA = FastaIndex->new(); # '-filename' => "A.idx", '-write_flag' => 1 );
+print STDERR "Indexing $fastaFileName...\n";
+$indexA->make_index( $fastaFileName, $idRegex, $fastaFileName );
+
+while(<STDIN>)
+{
+ chomp;
+ print $indexA->fetch($_);
+}
\ No newline at end of file
diff --git a/scripts/greengenesExtract.pl b/scripts/greengenesExtract.pl
new file mode 100755
index 0000000..5e15120
--- /dev/null
+++ b/scripts/greengenesExtract.pl
@@ -0,0 +1,49 @@
+#!/usr/bin/env perl
+
+use strict;
+use warnings;
+
+open( FASTA, ">greengenes.fasta" ) || die("Can't write to greengenes.fasta");
+open( TAX, ">greengenes.taxonomy" ) || die("Can't write to greengenes.taxonomy");
+
+my $fieldname = "gg_norm_tax_string";
+
+while (<STDIN>) {
+
+ # read in by blocks
+ if ( $_ =~ /^BEGIN$/ ) {
+ my $prokMSAid = "NONE";
+ my $tax = "NONE";
+ my $seq = "NONE";
+
+ until ( $_ =~ /^END$/ ) {
+ if (/^prokMSA_id=(.+)/) {
+ $prokMSAid = $1;
+ }
+ elsif (/^$fieldname=(.+)/) {
+ $tax = $1;
+ }
+ elsif (/aligned_seq=(.+)/) {
+ $seq = $1;
+ }
+
+ if ( $seq ne "NONE" && $seq ne "unaligned" ) {
+ print FASTA ">$prokMSAid\n$seq\n";
+
+ # don't include taxonomy info if there is no sequence anyway
+ # and certainly not if there is no taxonomy data
+ if ( $tax ne "" && $tax ne "NONE" ) {
+ print TAX "$prokMSAid\t$tax\n";
+ }
+ }
+
+ $_ = <STDIN>;
+ }
+ }
+
+ # else ignore anything outside a BEGIN/END block
+
+}
+
+close FASTA;
+close TAX;
diff --git a/scripts/joinDelimitedFasta.pl b/scripts/joinDelimitedFasta.pl
new file mode 100755
index 0000000..664b286
--- /dev/null
+++ b/scripts/joinDelimitedFasta.pl
@@ -0,0 +1,38 @@
+#!/usr/bin/env perl
+
+use strict;
+use warnings;
+
+# make sure delimiter is properly escaped, e.g. for "+" we need "\+".
+my $delimiter = $ARGV[0];
+my $infileA = $ARGV[1];
+my $infileB = $ARGV[2];
+my $rc2 = $ARGV[3];
+
+open(INA, $infileA) or die "Can't read $infileA\n";
+open(INB, $infileB) or die "Can't read $infileB\n";
+
+my $fastaHeader = "";
+
+#sub revcompl {
+# return ((my $rev = reverse $_) =~ tr/ACGTacgt/TGCAtgca/);
+#}
+
+while(<INA>)
+ {
+ my $a = $_;
+ my $b = <INB>;
+ chomp $a;
+ chomp $b;
+ if($rc2)
+ {
+ # $b = revcompl($b);
+ $b = reverse $b;
+ $b =~ tr/ACGTacgt/TGCAtgca/;
+ }
+
+ print $a . $delimiter . $b . "\n";
+
+ }
+close INA;
+close INB;
\ No newline at end of file
diff --git a/scripts/parseUclustClusters.pl b/scripts/parseUclustClusters.pl
new file mode 100755
index 0000000..0ed23c1
--- /dev/null
+++ b/scripts/parseUclustClusters.pl
@@ -0,0 +1,52 @@
+#!/usr/bin/env perl
+
+# read a uclust result file; for each query sequence, collect the prokMSAids of the hits.
+
+use strict;
+use warnings;
+
+my %clusters = ();
+my %worstPcids = ();
+
+parseUclust( $ARGV[0] );
+
+for my $key ( sort ( keys %clusters ) ) {
+ my $worstPcid = $worstPcids{$key};
+
+ #print STDERR "$key -> $clusters{$key}\n";
+ print "$key\t$worstPcid\t" . ( join "\t", @{ $clusters{$key} } ) . "\n";
+}
+
+sub parseUclust {
+ my ($ucFileName) = @_;
+ open( UC, $ucFileName ) || die("Could not open $ucFileName");
+
+ while (<UC>) {
+
+ if (/^\s*#/) { next; }
+
+ my ( $type, $cluster, $size, $percentid, $strand, $querystart, $targetstart, $alignment, $querylabel, $targetlabel ) = split /\t/;
+ chomp $querylabel;
+ chomp $targetlabel;
+
+ if ( $type eq "S" ) {
+
+ #print STDERR "S $querylabel\n";
+ $clusters{$querylabel} = [$querylabel];
+ $worstPcids{$querylabel} = 100.0;
+ }
+ elsif ( $type eq "H" ) {
+
+ #print STDERR "H $targetlabel $querylabel\n";
+ push @{ $clusters{$targetlabel} }, $querylabel;
+
+ if ( $percentid < $worstPcids{$targetlabel} ) {
+ $worstPcids{$targetlabel} = $percentid;
+ }
+
+ }
+
+ # ignore other types
+ }
+ close UC;
+}
diff --git a/scripts/rtaxSearch.pl b/scripts/rtaxSearch.pl
new file mode 100755
index 0000000..cfdd180
--- /dev/null
+++ b/scripts/rtaxSearch.pl
@@ -0,0 +1,1062 @@
+#!/usr/bin/env perl
+
+# RTAX: Rapid and accurate taxonomic classification of short paired-end
+# sequence reads from the 16S ribosomal RNA gene.
+#
+# David A. W. Soergel 1*, Rob Knight 2, and Steven E. Brenner 1
+#
+# 1 Department of Plant and Microbial Biology,
+# University of California, Berkeley
+# 2 Howard Hughes Medical Institute and Department of Chemistry
+# and Biochemistry, University of Colorado at Boulder
+# * Corresponding author: soergel at cs.umass.edu
+#
+# http://www.davidsoergel.com/rtax
+#
+# Version 0.984 (August 7, 2013)
+#
+# For usage instructions: just run the script with no arguments
+#
+#
+# Copyright (c) 2011 Regents of the University of California
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# * Redistributions of source code must retain the above copyright notice,
+# this list of conditions and the following disclaimer.
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+# * Neither the name of the University of California, Berkeley nor the
+# names of its contributors may be used to endorse or promote products
+# derived from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+use strict;
+use warnings;
+use Getopt::Long;
+use File::Temp qw/ tempdir /;
+
+use FindBin;
+use lib "$FindBin::Bin";
+use FastaIndex;
+
+### command-line arguments get stored in globals
+
+use vars qw (
+ $usearch
+ $slop
+ $minMaxAccepts
+ $maxMaxAccepts
+ $maxPercentDifferenceThreshold
+ $databaseFile
+ $readAFileAll
+ $readBFileAll
+ $revCompReadA
+ $revCompReadB
+ $idRegex
+ $idList
+ $singleOK
+ $singleOKgeneric
+);
+
+# just store these as globals for now-- they're basically like command-line args
+my $indexA;
+my $indexB;
+
+sub init {
+ $singleOKgeneric = 1; # default value; GetOptions may override
+
+ Getopt::Long::Configure("bundling");
+ GetOptions(
+ "usearch=s" => \$usearch,
+ "slop=s" => \$slop,
+ "minMaxAccepts=s" => \$minMaxAccepts,
+ "maxMaxAccepts=s" => \$maxMaxAccepts,
+ "maxPercentDifferenceThreshold=s" => \$maxPercentDifferenceThreshold,
+ "databaseFile=s" => \$databaseFile,
+ "idRegex=s" => \$idRegex,
+ "queryA=s" => \$readAFileAll,
+ "queryB=s" => \$readBFileAll,
+ "revcompA" => \$revCompReadA,
+ "revcompB" => \$revCompReadB,
+ "idList=s" => \$idList,
+ "singleOK" => \$singleOK,
+ "singleOKgeneric!" => \$singleOKgeneric
+ );
+
+ if ( !defined $databaseFile
+ || !defined $readAFileAll )
+ {
+ print STDERR "Missing required argument!\n\n";
+ usage();
+ }
+
+ if ( !defined $usearch ) {
+ $usearch = `whereis usearch`;
+ chomp $usearch;
+ if ( !defined $usearch || $usearch eq "" ) {
+ print STDERR "Could not find usearch.\n\n";
+ usage();
+ }
+ }
+
+ if ( !defined $slop ) { $slop = 0.005; }
+ if ( !defined $minMaxAccepts ) { $minMaxAccepts = 1000; }
+ if ( !defined $maxMaxAccepts ) { $maxMaxAccepts = 16000; }
+ if ( !defined $maxPercentDifferenceThreshold ) {
+ $maxPercentDifferenceThreshold = 0.2;
+ }
+ if ( !defined $idRegex || $idRegex eq "" ) { $idRegex = "(\\S+)"; }
+ print STDERR "Header Regex: $idRegex\n";
+
+ # these are redundant between multiple runs, oh well
+ # but the DBM indexes should persist, in the same directory as the original files
+ $indexA = FastaIndex->new(); # '-filename' => "A.idx", '-write_flag' => 1 );
+ if ($revCompReadA) {
+ $readAFileAll = revcompFile($readAFileAll);
+ }
+ $indexA->make_index( $readAFileAll, $idRegex, $readAFileAll );
+
+ $indexB = FastaIndex->new();
+ if ( defined $readBFileAll ) { # '-filename' => "B.idx", '-write_flag' => 1 );
+ if ($revCompReadB) {
+ $readBFileAll = revcompFile($readBFileAll);
+ }
+ $indexB->make_index( $readBFileAll, $idRegex, $readBFileAll );
+ }
+}
+
+sub usage {
+ print STDERR << "EOF";
+
+RTAX: Rapid and accurate taxonomic classification of short paired-end
+ sequence reads from the 16S ribosomal RNA gene.
+
+David A. W. Soergel (1*), Rob Knight (2), and Steven E. Brenner (1)
+1 Department of Plant and Microbial Biology, University of California, Berkeley
+2 Howard Hughes Medical Institute and Department of Chemistry and Biochemistry,
+ University of Colorado at Boulder
+* Corresponding author: soergel\@berkeley.edu
+
+http://www.davidsoergel.com/rtax
+
+Version 0.984 (August 7, 2013)
+
+OPTIONS:
+
+ --usearch <path> location of usearch program
+ (defaults to result of "whereis usearch")
+
+ --slop <number> %id difference from maximum to accept
+ (default 0.005, i.e. accept hits to a 200nt query that
+ has 1 more mismatch than the best match, or 2 extra
+ mismatches out of 400nt.)
+
+ --minMaxAccepts <number>
+ The initial maxaccepts value to pass to USEARCH,
+ controlling the maximum number of hits that will be
+ returned in the first iteration. If this number of hits
+ is reached, the search is repeated with a doubled
+ maxaccepts value. Thus, a smaller value causes the
+ first iteration to run faster, but increases the
+ probability of needing more iterations.
+ (Default 1000)
+
+ --maxMaxAccepts <number>
+ The highest escalated maxaccepts value to allow. The
+ maxaccepts value is doubled on every iteration, so this
+ value is effectively rounded down to a power of two
+ factor of minMaxAccepts. USEARCH runs with large
+ maxaccepts values are very slow, so this controls when
+ RTAX gives up on a query because there are too many
+ hits (i.e., a query that is too short and/or too highly
+ conserved to be informative). (Default 16000)
+
+ --maxPercentDifferenceThreshold <number>
+ The largest percent difference to allow between a query
+ (considering the read pair jointly) and a database hit.
+ (Default 0.2, i.e. reference sequences of less than 80%
+ identity with the query sequence are never matched)
+
+ --databaseFile <file>
+ A FASTA file containing a reference database.
+
+ --queryA <file> A FASTA file containing one set of query reads.
+
+ --queryB <file> A FASTA file containing the other set of query reads
+ (if any). Must be provided in the forward sense
+ unless --revcompB is used.
+
+ --revcompB Reverse-complement read B. Required if the queryB file
+ is provided in the reverse sense, as is typical with
+ paired-end experiments.
+
+ --singleOK Classify a sequence based on only one read when the
+ other read is missing. Required when queryB is absent.
+ Default: false, so sequences present in only one of the
+ two input files are dropped. When enabled, paired-end
+ sequences are classified using both reads as usual, but
+ any remaining single-ended reads are also classified.
+
+ --singleOKgeneric Classify a sequence based on only one read when the
+ other read is present but uninformative. This occurs
+ when one read is so generic (i.e., short, or highly
+ conserved) that more than maxMaxAccepts hits are
+ found. Default: true; use --nosingleOKgeneric to
+ disable.
+
+ --idRegex <regex> A regular expression for extracting IDs from fasta
+ headers. The first capture group (aka \$1) will be
+ used as the sequence ID; these should be unique.
+ This is useful when the ID needed to match mate
+ pairs is embedded in the header in some way.
+ "^>" is automatically prepended to the regex
+ provided here. Take care that double escaping may
+ be required. Default: "(\\S+)" (the first field)
+
+ --idList <file> A file containing a list of IDs to process (one per
+ line). By default all IDs found in the query files
+ are used.
+
+ Note that the two query files must provide mate-paired reads with exactly
+ matching identifiers (though they need not be in the same order). Any ids
+ present in one file but not the other are classified in single-ended mode.
+ Alternate naming schemes for the paired reads (e.g., "SOMEID.a" paired with
+ "SOMEID.b", and so forth) are handled via the --idRegex option.
+
+ Various indexes and derived FASTA files will be created in a temporary
+ directory and are cleaned up afterwards.
+
+ The output is tab-delimited text, one line per query pair, in the form
+
+ <query ID> <%id> <list of reference IDs>
+
+ where the second column gives the %id between the query and the best match,
+ and the reference IDs provided are all those matches within "slop" of this
+ best value.
+
+ This output is suitable for piping directly into rtaxVote for taxonomy
+ assignment.
+
+EOF
+
+ exit;
+}
+
+sub main {
+
+ init();
+
+ my ( $pairedReadAFile, $pairedReadBFile, $pairedBothCount, $singleReadAFile, $singleReadACount, $singleReadBFile, $singleReadBCount, $allSingleIDs ) =
+ partitionReadFiles();
+
+ processPairs( $pairedReadAFile, $pairedReadBFile, $pairedBothCount );
+
+ if ($singleOK) {
+ processSingle( $singleReadAFile, $indexA, $singleReadACount );
+ processSingle( $singleReadBFile, $indexB, $singleReadBCount );
+ }
+ else {
+ for my $queryLabel (@$allSingleIDs) {
+ print "$queryLabel\t\tNOMATEPAIR\n";
+ }
+ }
+}
+
+sub partitionReadFiles {
+
+ my @idList = ();
+ if ($idList) {
+ open( IDLIST, "$idList" ) or die "Could not open $idList";
+ while (<IDLIST>) { chomp; push @idList, $_; }
+ close IDLIST;
+ }
+
+ my @idsA = keys %{ $indexA->startpos() };
+ my @idsB =
+ defined $indexB->startpos() ? keys %{ $indexB->startpos() } : (); # in the single-read case, there should still be an empty index
+
+ my @bothAandB = ();
+ my @aOnly = ();
+ my @bOnly = ();
+
+ # encode which of the input files contain which IDs in three bits
+
+ my %count = ();
+
+ foreach my $element (@idsA) {
+ if ( ( $element =~ /[\t ]/ ) > 0 ) # $indexA->db() should return parsed IDs with no whitespace
+ {
+ die "Invalid FASTA id: $element\n";
+ }
+ $count{$element} += 1;
+ }
+
+ foreach my $element (@idsB) {
+ if ( ( $element =~ /[\t ]/ ) > 0 ) # $indexA->db() should return parsed IDs with no whitespace
+ {
+ die "Invalid FASTA id: $element\n";
+ }
+ $count{$element} += 2;
+ }
+
+ foreach my $element (@idList) {
+ if ( ( $element =~ /[\t ]/ ) > 0 ) { die "Invalid FASTA id: $element\n"; }
+ $count{$element} += 4;
+ }
+
+ if ($idList) {
+ foreach my $element ( keys %count ) {
+ $count{$element} -= 4;
+ }
+ }
+
+ foreach my $element ( keys %count ) {
+ if ( !( $element =~ /^__/ ) ) {
+ if ( $count{$element} == 1 ) { push @aOnly, $element }
+ elsif ( $count{$element} == 2 ) { push @bOnly, $element }
+ elsif ( $count{$element} == 3 ) { push @bothAandB, $element }
+ else {
+
+ # no problem; these are sequences not in the idList
+ }
+ }
+ }
+
+ print STDERR "file a only = " . scalar(@aOnly) . " sequences\n";
+ print STDERR "file b only = " . scalar(@bOnly) . " sequences\n";
+ print STDERR "both = " . scalar(@bothAandB) . " sequences\n";
+
+ my $pairedReadAFile = extractFasta( $indexA, \@bothAandB );
+ my $pairedReadBFile = extractFasta( $indexB, \@bothAandB );
+
+ my $singleReadAFile = extractFasta( $indexA, \@aOnly );
+ my $singleReadBFile = extractFasta( $indexB, \@bOnly );
+
+ my @allSingleIDs = (@aOnly, @bOnly);
+
+ return ( $pairedReadAFile, $pairedReadBFile, scalar(@bothAandB), $singleReadAFile, scalar(@aOnly), $singleReadBFile, scalar(@bOnly), \@allSingleIDs );
+}
+
+sub processPairs {
+ my ( $pairedReadAFile, $pairedReadBFile, $pairedBothCount ) = @_;
+ if ( $pairedBothCount == 0 ) { return; }
+
+ my $nohitQueryIds = [];
+ push @$nohitQueryIds, "ALL";
+ my $tooManyHitQueryIds = [];
+
+ my $percentDifferenceThreshold = 0.005; # this gets doubled to 1% before being used the first time
+
+ while ( @$nohitQueryIds > 0 ) {
+
+ # double the allowed %diff on every round
+ $percentDifferenceThreshold *= 2;
+ if ( $percentDifferenceThreshold > $maxPercentDifferenceThreshold ) {
+ last;
+ }
+
+ if ( $nohitQueryIds->[0] ne "ALL" ) {
+
+ # prepare input files with the remaining sequences
+ $pairedReadAFile = extractFasta( $indexA, $nohitQueryIds );
+ $pairedReadBFile = extractFasta( $indexB, $nohitQueryIds );
+ }
+
+ # within doPairSearch we escalate maxAccepts, so if a queryLabel is still marked tooManyHits at this point,
+ # that means that there are more than maxMaxAccepts hits for this threshold--
+ # so there's really no point in testing this query again at an even lower threshold
+ my $tooManyHitQueryIdsThisRound;
+
+ my $numRemaining = ( $nohitQueryIds->[0] eq "ALL" ) ? $pairedBothCount : scalar(@$nohitQueryIds);
+
+ print STDERR
+ "doPairSearch $pairedReadAFile, $pairedReadBFile: $numRemaining query sequences remaining\n searching with pair %id "
+ . $percentDifferenceThreshold
+ . " and maxAccepts "
+ . $minMaxAccepts . "\n";
+
+ ( $nohitQueryIds, $tooManyHitQueryIdsThisRound ) =
+ doPairSearch( $pairedReadAFile, $pairedReadBFile, $percentDifferenceThreshold, $minMaxAccepts );
+
+ print STDERR "MAIN: Finished round at threshold $percentDifferenceThreshold; "
+ . scalar(@$nohitQueryIds)
+ . " NOHIT, "
+ . scalar(@$tooManyHitQueryIdsThisRound)
+ . " TOOMANYHIT.\n";
+
+ push @$tooManyHitQueryIds, @$tooManyHitQueryIdsThisRound;
+ }
+
+ print STDERR "MAIN: " . scalar(@$nohitQueryIds) . " query sequences remaining with NOHIT\n";
+ for my $queryLabel (@$nohitQueryIds) {
+ print "$queryLabel\t\tNOHIT\n";
+ }
+
+ print STDERR "MAIN: " . scalar(@$tooManyHitQueryIds) . " query sequences remaining with TOOMANYHITS\n";
+ for my $queryLabel (@$tooManyHitQueryIds) {
+ print "$queryLabel\t\tTOOMANYHITS\n";
+ }
+}
+
+sub processSingle {
+ my ( $singleReadFile, $singleIndex, $singleReadCount ) = @_;
+ if ( !defined $singleReadFile || $singleReadFile eq "" || $singleReadCount == 0 ) { return; }
+
+ my $nohitQueryIds = [];
+ push @$nohitQueryIds, "ALL";
+ my $tooManyHitQueryIds = [];
+
+ my $singlePercentDifferenceThreshold = 0.005; # this gets doubled to 1% before being used the first time
+
+ while ( @$nohitQueryIds > 0 ) {
+
+ # double the allowed %diff on every round
+ $singlePercentDifferenceThreshold *= 2;
+ if ( $singlePercentDifferenceThreshold > $maxPercentDifferenceThreshold ) {
+ last;
+ }
+
+ if ( $nohitQueryIds->[0] ne "ALL" ) {
+
+ # prepare input files with the remaining sequences
+ $singleReadFile = extractFasta( $singleIndex, $nohitQueryIds );
+ }
+
+ my $numRemaining = ( $nohitQueryIds->[0] eq "ALL" ) ? $singleReadCount : scalar(@$nohitQueryIds);
+
+ # within doSearch we escalate maxAccepts, so if a queryLabel is still marked tooManyHits at this point,
+ # that means that there are more than maxMaxAccepts hits for this threshold--
+ # so there's really no point in testing this query again at an even lower threshold
+ my $tooManyHitQueryIdsThisRound;
+
+ print STDERR "doSingleSearch $singleReadFile: $numRemaining query sequences remaining\n searching with %id "
+ . $singlePercentDifferenceThreshold
+ . " and maxAccepts "
+ . $minMaxAccepts . "\n";
+
+ ( $nohitQueryIds, $tooManyHitQueryIdsThisRound ) =
+ doSingleSearch( $singleReadFile, $singleIndex, $singlePercentDifferenceThreshold, $minMaxAccepts );
+
+ print STDERR "Finished round at threshold $singlePercentDifferenceThreshold; "
+ . scalar(@$nohitQueryIds)
+ . " NOHIT, "
+ . scalar(@$tooManyHitQueryIdsThisRound)
+ . " TOOMANYHIT.\n";
+
+ push @$tooManyHitQueryIds, @$tooManyHitQueryIdsThisRound;
+ }
+
+ print STDERR scalar(@$nohitQueryIds) . " query sequences remaining with NOHIT\n";
+ for my $queryLabel (@$nohitQueryIds) {
+ print "$queryLabel\t\tNOHIT\n";
+ }
+
+ print STDERR scalar(@$tooManyHitQueryIds) . " query sequences remaining with TOOMANYHITS\n";
+ for my $queryLabel (@$tooManyHitQueryIds) {
+ print "$queryLabel\t\tTOOMANYHITS\n";
+ }
+
+}
+
+sub doSingleSearch {
+ my $singleReadFile = shift;
+ my $singleIndex = shift;
+ my $singlePercentDifferenceThreshold = shift;
+ my $maxAccepts = shift;
+
+ my $singlePercentIdThreshold = 1. - $singlePercentDifferenceThreshold;
+
+ my $nohitQueryIds = [];
+ my $tooManyHitQueryIds = [];
+
+# open the USEARCH streams
+# print STDERR "$usearch --quiet --global --iddef 2 --query $singleReadFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject\n";
+# open( UCA,
+# "$usearch --quiet --global --iddef 2 --query $singleReadFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject |"
+# ) || die "can't fork usearch: $!";
+
+ my $dir = tempdir(CLEANUP => 1);
+ if ( system( 'mknod', "$dir/a", 'p' ) && system( 'mkfifo', "$dir/a" ) ) { die "mk{nod,fifo} $dir/a failed"; }
+ if ( !fork() ) {
+
+ # see paired end case for explanation
+ open( UCAW, ">$dir/a" ) || die("Couldn't write named pipe: $dir/a");
+ print UCAW "# pipe open!\n";
+ close UCAW;
+
+ my $cmd =
+"$usearch --quiet --global --iddef 2 --query $singleReadFile --db $databaseFile --uc $dir/a --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject";
+ print STDERR $cmd . "\n";
+ exec $cmd || die "can't fork usearch: $!";
+ }
+
+ open( UCA, "$dir/a" ) || die("Couldn't read named pipe from usearch: $dir/a");
+
+ #print STDERR "Reading named pipe from usearch: $dir/a\n";
+
+ # Load the first non-comment line from each stream
+ my $nextLineA;
+ my $pipeARetryCount = 0;
+ while ( !defined $nextLineA ) {
+
+ # keep trying to read from the pipe even if the writer disconnects
+ while (<UCA>) {
+ if (/^\s*#/) { next; }
+ if (/^\s*$/) { next; }
+ $nextLineA = $_;
+ last;
+ }
+
+ #print STDERR "Waiting for named pipe $dir/a\n";
+ sleep 1;
+ $pipeARetryCount++;
+ if ( $pipeARetryCount > 10 ) { die "Named pipe communication with usearch failed: $dir/a\n"; }
+ }
+
+ # read synchronized blocks from each stream
+ while (1) {
+ my ( $queryLabelA, $idsA );
+
+ #print STDERR "reading next block...\n";
+
+ # idsA is a reference to a hash from targetIds to %ids
+ ( $queryLabelA, $idsA, $nextLineA ) = collectIds( *UCA, $nextLineA );
+
+ if ( ( scalar keys %$idsA ) >= $maxAccepts ) {
+ push @$tooManyHitQueryIds, $queryLabelA;
+ }
+
+ elsif ( !reconcileSingleHitsAndPrint( $queryLabelA, $idsA, $singlePercentIdThreshold ) ) {
+ push @$nohitQueryIds, $queryLabelA;
+ }
+
+ if ( !$nextLineA ) {
+
+ #print STDERR "End of stream, close\n";
+ last;
+ }
+
+ }
+
+ close(UCA) || die "can't close usearch: $!";
+
+ #print STDERR "Closed usearch stream.\n";
+
+ # for the TOOMANYHITS cases, escalate maxAccepts and try again
+ # Note this recurses, so no need to iterate here
+ if ( scalar(@$tooManyHitQueryIds) && ( $maxAccepts * 2 <= $maxMaxAccepts ) ) {
+ my $nohitQueryIdsB;
+
+ print STDERR "Escalating maxAccepts to " . ( $maxAccepts * 2 ) . " for " . scalar(@$tooManyHitQueryIds) . " sequences.\n";
+
+ # prepare input files with the remaining sequences
+ $singleReadFile = extractFasta( $singleIndex, $tooManyHitQueryIds );
+
+ ( $nohitQueryIdsB, $tooManyHitQueryIds ) =
+ doSingleSearch( $singleReadFile, $singleIndex, $singlePercentDifferenceThreshold, $maxAccepts * 2 );
+ if (@$nohitQueryIdsB) {
+ die "A TOOMANYHITS case can't turn into a NOHITS case";
+ }
+ }
+
+ print STDERR
+ "doSingleSearch $singleReadFile: Finished at pair threshold $singlePercentDifferenceThreshold and maxAccepts $maxAccepts\n";
+ print STDERR " NOHITS: " . scalar(@$nohitQueryIds) . "\n";
+ print STDERR " TOOMANYHITS: " . scalar(@$tooManyHitQueryIds) . "\n";
+
+ # print STDERR " NOHITS: " . ( join ", ", @$nohitQueryIds ) . "\n";
+ # print STDERR " TOOMANYHITS: " . ( join ", ", @$tooManyHitQueryIds ) . "\n";
+
+ # any tooManyHitQueryIds that remain had more than maxMaxAccepts hits
+ return ( $nohitQueryIds, $tooManyHitQueryIds );
+}
+
+sub reconcileSingleHitsAndPrint {
+ my ( $queryLabel, $idsA, $singlePercentIdThreshold ) = @_;
+
+ my $idsByIdentity = {};
+ for my $targetLabel ( keys %$idsA ) {
+ my $percentIdA = $idsA->{$targetLabel};
+ if ( !defined $idsByIdentity->{$percentIdA} ) {
+ $idsByIdentity->{$percentIdA} = [];
+ }
+ push @{ $idsByIdentity->{$percentIdA} }, $targetLabel;
+
+ }
+
+ if ( !%$idsByIdentity ) {
+
+ #print STDERR "$queryLabel -> no reconciled hits at $singlePercentIdThreshold%\n";
+
+ # this is the NOHIT case, but we'll back off and try again
+ return 0;
+ }
+ else {
+
+ #print STDERR "$queryLabel -> printing reconciled hits at $singlePercentIdThreshold%\n";
+ printLine( $queryLabel, $idsByIdentity );
+ return 1;
+ }
+
+}
+
+sub doPairSearch {
+ my $readAFile = shift;
+ my $readBFile = shift;
+ my $pairPercentDifferenceThreshold = shift;
+ my $maxAccepts = shift;
+
+ my $pairPercentIdThreshold = 1. - $pairPercentDifferenceThreshold;
+
+ # because we're going to average the two %ids, we have to search each read with twice the %diff for the pair
+ my $singlePercentDifferenceThreshold = $pairPercentDifferenceThreshold * 2;
+ my $singlePercentIdThreshold = 1. - $singlePercentDifferenceThreshold;
+
+ my $nohitQueryIds = [];
+ my $tooManyHitQueryIds = [];
+
+# open the USEARCH streams
+# print STDERR
+#"$usearch --quiet --global --iddef 2 --query $readAFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject\n";
+
+# open( UCA,
+#"$usearch --quiet --global --iddef 2 --query $readAFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject |"
+# ) || die "can't fork usearch: $!";
+
+# print STDERR
+#"$usearch --quiet --global --iddef 2 --query $readBFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject\n";
+
+# open( UCB,
+#"$usearch --quiet --global --iddef 2 --query $readBFile --db $databaseFile --uc /dev/stdout --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject |"
+# ) || die "can't fork usearch: $!";
+
+ my $dir = tempdir(CLEANUP => 1);
+ if ( system( 'mknod', "$dir/a", 'p' ) && system( 'mkfifo', "$dir/a" ) ) { die "mk{nod,fifo} $dir/a failed"; }
+ if ( system( 'mknod', "$dir/b", 'p' ) && system( 'mkfifo', "$dir/b" ) ) { die "mk{nod,fifo} $dir/b failed"; }
+
+ # try to avoid mysterious intermittent condition where opening a named pipe blocks forever
+ #while ( !-p "$dir/a" ) {
+ # print STDERR "Waiting for named pipe $dir/a\n";
+ # sleep 1;
+ #}
+ #while ( !-p "$dir/b" ) {
+ # print STDERR "Waiting for named pipe $dir/b\n";
+ # sleep 1;
+ #}
+
+ if ( !fork() ) {
+
+# there is a mysterious intermittent condition where opening a named pipe blocks forever.
+# I think what is happening may be:
+# if the reader side of the named pipe is not already connected when usearch tries to write to it, usearch gets confused and never writes anything
+# thus when the reader side does connect, it blocks.
+
+ # one hack is just to sleep here for a while in hopes that the reader thread gets all hooked up
+ # sleep 5;
+
+ # let's try writing to it, so we block here until we know it works, and then continue on to usearch
+ open( UCAW, ">$dir/a" ) || die("Couldn't write named pipe: $dir/a");
+ print UCAW "# pipe open!\n";
+ close UCAW;
+
+ my $cmd =
+"$usearch --quiet --global --iddef 2 --query $readAFile --db $databaseFile --uc $dir/a --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject";
+ print STDERR $cmd . "\n";
+ exec $cmd || die("can't fork usearch: $!");
+ }
+
+ if ( !fork() ) {
+ open( UCBW, ">$dir/b" ) || die("Couldn't write named pipe: $dir/b");
+ print UCBW "# pipe open!\n";
+ close UCBW;
+
+ my $cmd =
+"$usearch --quiet --global --iddef 2 --query $readBFile --db $databaseFile --uc $dir/b --id $singlePercentIdThreshold --maxaccepts $maxAccepts --maxrejects 128 --nowordcountreject";
+ print STDERR $cmd . "\n";
+ exec $cmd || die("can't fork usearch: $!");
+ }
+ open( UCA, "$dir/a" ) || die("Couldn't read named pipe from usearch: $dir/a");
+
+ #print STDERR "Reading named pipe from usearch: $dir/a\n";
+
+ open( UCB, "$dir/b" ) || die("Couldn't read named pipe from usearch: $dir/b");
+
+ #print STDERR "Reading named pipe from usearch: $dir/b\n";
+
+ # Load the first non-comment line from each stream
+ my $nextLineA;
+ my $pipeARetryCount = 0;
+ while ( !defined $nextLineA ) {
+
+ # keep trying to read from the pipe even if the writer disconnects
+ while (<UCA>) {
+ if (/^\s*#/) { next; }
+ if (/^\s*$/) { next; }
+ $nextLineA = $_;
+ last;
+ }
+
+ #print STDERR "Waiting for named pipe $dir/a\n";
+ sleep 1;
+ $pipeARetryCount++;
+ if ( $pipeARetryCount > 10 ) { die("Named pipe communication with usearch failed: $dir/a\n"); }
+ }
+
+ my $nextLineB;
+ my $pipeBRetryCount = 0;
+ while ( !defined $nextLineB ) {
+
+ # keep trying to read from the pipe even if the writer disconnects
+ while (<UCB>) {
+ if (/^\s*#/) { next; }
+ if (/^\s*$/) { next; }
+ $nextLineB = $_;
+ last;
+ }
+
+ #print STDERR "Waiting for named pipe $dir/b\n";
+ sleep 1;
+ $pipeBRetryCount++;
+ if ( $pipeBRetryCount > 10 ) { die("Named pipe communication with usearch failed: $dir/b\n"); }
+ }
+
+ # read synchronized blocks from each stream
+ while (1) {
+ my ( $queryLabelA, $idsA, $queryLabelB, $idsB );
+
+ # idsA is a reference to a hash from targetIds to %ids
+ ( $queryLabelA, $idsA, $nextLineA ) = collectIds( *UCA, $nextLineA );
+ ( $queryLabelB, $idsB, $nextLineB ) = collectIds( *UCB, $nextLineB );
+
+ if ( !( $queryLabelA eq $queryLabelB ) ) {
+ die("Usearch results desynchronized: $queryLabelA neq $queryLabelB");
+ }
+
+ my $numHitsA = ( scalar keys %$idsA );
+ my $numHitsB = ( scalar keys %$idsB );
+
+ # if either read got NOHITS, then it's definitely NOHITS for the pair. This trumps TOOMANYHITS for the other read.
+ # don't bother trying to reconcile hits in this case
+
+ if ( ( $numHitsA == 0 ) || ( $numHitsB == 0 ) ) {
+ push @$nohitQueryIds, $queryLabelA;
+ }
+
+ # if both reads got TOOMANYHITS, then it's definitely TOOMANYHITS for the pair.
+
+ elsif ( ( $numHitsA >= $maxAccepts ) && ( $numHitsB >= $maxAccepts ) ) {
+ push @$tooManyHitQueryIds, $queryLabelA;
+ }
+
+ # if neither read got TOOMANYHITS, then we're good to go
+
+ elsif ( ( $numHitsA < $maxAccepts ) && ( $numHitsB < $maxAccepts ) ) {
+ if ( !reconcilePairedHitsAndPrint( $queryLabelA, $idsA, $idsB, $pairPercentIdThreshold ) ) {
+ push @$nohitQueryIds, $queryLabelA;
+ }
+
+ }
+
+ # if only one read got TOOMANYHITS...
+
+ else {
+
+ # escalate if possible
+ if ( $maxAccepts < $maxMaxAccepts ) {
+ push @$tooManyHitQueryIds, $queryLabelA;
+ }
+
+ # if we're already at maxMaxAccepts and we're allowed, fall back to the info provided by the other read.
+
+ # This is a little tricky: which percent ID threshold do we use?
+ # For consistency, I think we have to assume that the overly generic read is a 100% match.
+ # the cleanest way to say this is to say that the "overly generic" read just matches everything, so
+ # reconcilePairHitsAndPrint( $queryLabelA, $idsA, $allIds, $pairPercentIdThreshold )
+ # but that would require a hash %allIds mapping every ID to 100, just because reconcilePairHitsAndPrint says $idsB->{$targetLabel}
+ # it's equivalent to just use reconcileSingleHitsAndPrint with singlePercentIdThreshold.
+
+ elsif ($singleOKgeneric) {
+ if ( $numHitsA < $maxAccepts ) {
+ if ( !reconcileSingleHitsAndPrint( $queryLabelA, $idsA, $singlePercentIdThreshold ) ) {
+ push @$nohitQueryIds, $queryLabelA;
+ }
+ }
+ elsif ( $numHitsB < $maxAccepts ) {
+ if ( !reconcileSingleHitsAndPrint( $queryLabelA, $idsB, $singlePercentIdThreshold ) ) {
+ push @$nohitQueryIds, $queryLabelA;
+ }
+ }
+ else { die("impossible"); }
+ }
+
+ # if we're already at maxMaxAccepts, but not allowed to rely on the other read, just report TOOMANYHITS for the pair
+ else {
+ push @$tooManyHitQueryIds, $queryLabelA;
+ }
+ }
+ if ( !$nextLineA || !$nextLineB ) {
+ if ( !( !$nextLineA && !$nextLineB ) ) {
+ die("Usearch results desynchronized at end:\nA: $nextLineA\nB: $nextLineB");
+ }
+ last;
+ }
+
+ }
+
+ close(UCA) || die "can't close usearch: $!";
+ close(UCB) || die "can't close usearch: $!";
+
+ # for the TOOMANYHITS cases, escalate maxAccepts and try again
+ # Note this recurses, so no need to iterate here
+ if ( scalar(@$tooManyHitQueryIds) && ( $maxAccepts * 2 <= $maxMaxAccepts ) ) {
+ my $nohitQueryIdsB;
+
+ print STDERR "doPairSearch $readAFile, $readBFile: Escalating maxAccepts to "
+ . ( $maxAccepts * 2 ) . " for "
+ . scalar(@$tooManyHitQueryIds)
+ . " sequences.\n";
+
+ # prepare input files with the remaining sequences
+ my $readAFileEsc = extractFasta( $indexA, $tooManyHitQueryIds );
+ my $readBFileEsc = extractFasta( $indexB, $tooManyHitQueryIds );
+
+ ( $nohitQueryIdsB, $tooManyHitQueryIds ) =
+ doPairSearch( $readAFileEsc, $readBFileEsc, $pairPercentDifferenceThreshold, $maxAccepts * 2 );
+
+ # A TOOMANYHITS case can certainly turn into a NOHITS case:
+ # once one read is no longer TOOMANYHITS, it may turn out that nothing can be reconciled with the other read.
+
+ #if (@$nohitQueryIdsB) {
+ # die( "A TOOMANYHITS case can't turn into a NOHITS case: "
+ # . ( join ", ", @$nohitQueryIdsB )
+ # . " at pair threshold $pairPercentDifferenceThreshold and maxAccepts "
+ # . ( $maxAccepts * 2 ) );
+ # }
+
+ push @$nohitQueryIds, @$nohitQueryIdsB;
+ }
+
+ print STDERR
+ "doPairSearch $readAFile, $readBFile: Finished at pair threshold $pairPercentDifferenceThreshold and maxAccepts $maxAccepts\n";
+ print STDERR " NOHITS: " . scalar(@$nohitQueryIds) . "\n";
+ print STDERR " TOOMANYHITS: " . scalar(@$tooManyHitQueryIds) . "\n";
+
+ # print STDERR " NOHITS: " . ( join ", ", @$nohitQueryIds ) . "\n";
+ # print STDERR " TOOMANYHITS: " . ( join ", ", @$tooManyHitQueryIds ) . "\n";
+
+ # any tooManyHitQueryIds that remain had more than maxMaxAccepts hits
+ return ( $nohitQueryIds, $tooManyHitQueryIds );
+}
+
+sub reconcilePairedHitsAndPrint {
+ my ( $queryLabel, $idsA, $idsB, $pairPercentIdThreshold ) = @_;
+
+ my $idsByIdentity = {};
+ for my $targetLabel ( keys %$idsA ) {
+ my $percentIdA = $idsA->{$targetLabel};
+ my $percentIdB = $idsB->{$targetLabel};
+ if ($percentIdB) {
+
+ # both reads hit the same target
+ # compute the average percent id
+ my $pairPercentId = ( $percentIdA + $percentIdB ) / 2.0;
+ if ( $pairPercentId >= $pairPercentIdThreshold ) {
+ if ( !defined $idsByIdentity->{$pairPercentId} ) {
+ $idsByIdentity->{$pairPercentId} = [];
+ }
+ push @{ $idsByIdentity->{$pairPercentId} }, $targetLabel;
+ }
+ }
+ }
+
+ if ( !%$idsByIdentity ) {
+
+ #print STDERR "$queryLabel -> no reconciled hits at $pairPercentIdThreshold%\n";
+
+ # this is the NOHIT case, but we'll back off and try again
+ return 0;
+ }
+ else {
+
+ #print STDERR "$queryLabel -> printing reconciled hits at $pairPercentIdThreshold%\n";
+ printLine( $queryLabel, $idsByIdentity );
+ return 1;
+ }
+
+}
+
+##### Read a block of target IDs for one query from a USEARCH stream
+sub collectIds {
+
+ # assume that the records for a given label are all grouped together, so we can collect the hits for each label and process them in turn
+ # need to jump through some hoops to save the first line of the next block (since we can't rewind a pipe with seek()).
+
+ my $fh = shift;
+ my $firstLine = shift;
+
+ my %hits = ();
+
+ # process the first line (cached from the previous call)
+ #print STDERR "Processing firstLine: $firstLine\n";
+ my ( $typeF, $clusterF, $sizeF, $percentIdF, $strandF, $queryStartF, $targetStartF, $alignmentF, $queryLabelF, $targetLabelF ) =
+ split /\t/, $firstLine;
+ chomp $targetLabelF;
+
+ $queryLabelF =~ /$idRegex/;
+ $queryLabelF = $1;
+
+ #$queryLabelF =~ s/\s.*//; # primary ID is only the portion before whitespace
+
+ if ( $typeF eq "N" ) {
+
+ #print STDERR "$queryLabelF -> N\n";
+
+ # NOHIT represented as empty hash
+ # note we still have to cache the next line, after filtering comments
+ while (<$fh>) {
+ if (/^\s*#/) { next; }
+ if (/^\s*$/) { next; }
+ my $line = $_;
+ chomp $line;
+
+ #print STDERR "$line\n";
+ return ( $queryLabelF, \%hits, $line );
+ }
+ }
+ elsif ( $typeF eq "H" ) {
+ $hits{$targetLabelF} = $percentIdF;
+ }
+
+ # process the remaining lines
+ while (<$fh>) {
+ if (/^\s*#/) { next; }
+ if (/^\s*$/) { next; }
+ my $line = $_;
+ chomp $line;
+
+ #print STDERR "$line\n";
+
+ my ( $type, $cluster, $size, $percentId, $strand, $queryStart, $targetStart, $alignment, $queryLabel, $targetLabel ) =
+ split /\t/;
+ chomp $targetLabel;
+
+ $queryLabel =~ /$idRegex/;
+ $queryLabel = $1;
+
+ #$queryLabel =~ s/\s.*//; # primary ID is only the portion before whitespace
+
+ if ( $queryLabel ne $queryLabelF ) {
+
+ # advanced past the end of the current block
+ #print STDERR "End of block $queryLabelF -> block of " . scalar( keys %hits ) . " hits\n";
+
+ #print STDERR "End of block, return\n";
+ return ( $queryLabelF, \%hits, $line );
+ }
+ else {
+ if ( $type eq "N" ) {
+ die "impossible: $queryLabel reports no hit after it already had a hit";
+ }
+ elsif ( $type eq "H" ) {
+ $hits{$targetLabel} = $percentId;
+ }
+ }
+ }
+
+ # end of the stream (the last block)
+ #print STDERR "End of stream $queryLabelF -> block of " . scalar( keys %hits ) . " hits\n";
+
+ #print STDERR "End of stream, return\n";
+ return ( $queryLabelF, \%hits, "" );
+}
+
+my $fastaNum;
+
+sub extractFasta {
+ my $index = shift;
+ my $ids = shift;
+ my $name = $fastaNum++;
+
+ # print STDERR "Extracting " . scalar(@$ids) . " to file $name\n";
+
+ open( OUT, ">$name" );
+
+ # my $out = Bio::SeqIO->new( '-format' => 'Fasta', '-fh' => \*OUT );
+ for my $id (@$ids) {
+ my $seqobj = $index->fetch($id);
+ if ( !defined $seqobj ) {
+ print STDERR "Extracting from " . $index->fastaFileName() . ": Undefined: $id\n";
+ }
+
+ # elsif ($seqobj->primary_id() ne $id)
+ # {
+ # print STDERR "Extracting from " . $index->filename() . ": ID problem: " . ($seqobj->primary_id()) . " ne $id\n";
+ # }
+ else {
+
+ #$out->write_seq($seqobj);
+ print OUT $seqobj;
+ }
+ }
+ close OUT;
+
+ return $name;
+}
+
+sub printLine {
+ my ( $label, $idsByIdentity ) = @_;
+
+ my @ids = ();
+ my @pcids = sort { $b <=> $a } keys %$idsByIdentity;
+ my $bestPcid = $pcids[0];
+ my $acceptPcid = $bestPcid - $slop;
+ for my $pcid (@pcids) {
+ if ( $pcid < $acceptPcid ) { last; }
+ push @ids, @{ $idsByIdentity->{$pcid} };
+ }
+
+ my $idString = ( join "\t", @ids );
+ print "$label\t$bestPcid\t" . $idString . "\n";
+
+ # print STDERR "$label\t$bestPcid\t" . $idString . "\n";
+}
+
+sub revcompFile {
+ my $infile = shift;
+
+ my $outfile = $infile . ".rc";
+
+ open( IN, $infile ) or die "Can't read $infile\n";
+ open( OUT, ">$outfile" ) or die "Can't write $outfile\n";
+
+ while (<IN>) {
+ my $a = $_;
+ chomp $a;
+ if ( !( $a =~ /^>/ ) ) {
+ $a = reverse $a;
+ $a =~ tr/ACGTacgt/TGCAtgca/;
+ }
+ print OUT $a . "\n";
+ }
+ close IN;
+ close OUT;
+ return $outfile;
+}
+
+main();
diff --git a/scripts/rtaxVote.pl b/scripts/rtaxVote.pl
new file mode 100755
index 0000000..a81bba3
--- /dev/null
+++ b/scripts/rtaxVote.pl
@@ -0,0 +1,286 @@
+#!/usr/bin/env perl
+
+# read the prokMSAids of the hits, as produced by exactMatchIds.
+# Then grab the taxonomy info for those prokMSAids and make some classification info from the set.
+
+use strict;
+use warnings;
+
+#use lib '.';
+use FindBin;
+use lib "$FindBin::Bin";
+
+use LevelPrediction;
+
+loadTaxonomy( $ARGV[0] );
+processHits( $ARGV[1] );
+
+my %taxonomies = ();
+
+# my %taxonomyPrior = ();
+
+sub loadTaxonomy {
+ my ($taxFileName) = @_;
+ print STDERR "loading taxonomy...\n";
+ open( TAX, $taxFileName ) or die "Cannot open input taxonomy file $taxFileName: $!";
+ my $lines = 0;
+ while ( my $line = <TAX> ) {
+ chomp $line;
+
+ # try to accept several formats:
+ # semicolon or tab-delimited taxonomy strings
+ # and including the "pcid-width" column between the sequence id and
+ # the taxonomy string, or not
+
+ my ( $prokMSAid, @taxonomy ) = split( /[;\t]/, $line );
+ #my ( @taxonomy ) = split( /; /, $taxString );
+
+ if ( @taxonomy && ($taxonomy[0] eq "" || $taxonomy[0] =~ /^(\d+\.?\d*|\.\d+)$/ )) {
+ # value is numeric or empty, must be the pcid width of the target cluster
+ my $taxPcid = shift @taxonomy;
+ #print STDERR "Ignoring target cluster width: $taxPcid\n";
+ }
+
+ $taxonomies{$prokMSAid} = \@taxonomy;
+
+# my $taxString = "";
+# for my $taxElem (@taxonomy) {
+# $taxString .= "$taxElem; ";
+#
+# # $taxonomyPrior{$taxString}++;
+# }
+ $lines++;
+ }
+ close TAX;
+ print STDERR "...done loading $lines taxonomy lines\n";
+
+ # print STDERR "normalizing prior...\n";
+ # for my $taxKey (keys %taxonomyPrior)
+ # {
+ # $taxonomyPrior{$taxKey} /= $lines;
+ # }
+ # $taxonomyPrior{""} = 1;
+ # print STDERR "...done normalizing prior\n";
+}
+
+sub printLine {
+ my ( $label, $bestPcid, @ids ) = @_;
+
+ my $levelPredictions = makeTaxonomyPrediction(@ids);
+ print "$label\t$bestPcid";
+ for my $levelPrediction (@$levelPredictions) {
+ print "\t" . $levelPrediction->toString();
+ }
+
+ #if(scalar(@$levelPredictions) == 0) {$unclassified++}
+ #if(scalar(@$levelPredictions) == 0) { print "\tUNCLASSIFIED"; }
+
+ print "\n";
+
+ return ( scalar(@$levelPredictions) == 0 );
+}
+
+sub processHits {
+ my ($hitsFileName) = @_;
+ open( HITS, $hitsFileName ) || die("Could not open $hitsFileName");
+
+ my $hit = 0;
+ my $noprimer = 0;
+ my $nohit = 0;
+ my $toomanyhits = 0;
+ my $nomatepair = 0;
+ my $unclassified = 0;
+
+ while (<HITS>) {
+ chomp;
+ my ( $label, $bestPcid, @ids ) = split /\t/;
+
+ if ( ( !@ids ) || ( $ids[0] eq "" ) ) # an empty id list
+ {
+ print "$label\t\tNOHIT\n";
+ $nohit++;
+ }
+ elsif($ids[0] eq "NOHIT") # at one point we represented this only as an empty id list, but now apparently it is used
+ {
+ print "$label\t\tNOHIT\n";
+ $nohit++;
+ }
+ elsif ( ( $ids[0] eq "NOPRIMER" ) ) {
+ print "$label\t\tNOPRIMER\n";
+ $noprimer++;
+ }
+ elsif ( ( $ids[0] eq "TOOMANYHITS" ) ) {
+ print "$label\t\tTOOMANYHITS\n";
+ $toomanyhits++;
+ }
+ elsif ( ( $ids[0] eq "NOMATEPAIR" ) ) {
+ print "$label\t\tNOMATEPAIR\n";
+ $nomatepair++;
+ }
+ else {
+ $unclassified += printLine( $label, $bestPcid, @ids );
+ $hit++;
+ }
+ }
+
+ my $samples = $hit + $nohit + $toomanyhits + $nomatepair + $noprimer;
+
+ print STDERR "$samples items\n";
+ if ( $samples != 0 ) {
+ print STDERR "$noprimer (" . sprintf( "%.1f", ( 100 * $noprimer / $samples ) ) . "%) had no primer match\n";
+ print STDERR "$nohit (" . sprintf( "%.1f", ( 100 * $nohit / $samples ) ) . "%) had no hits\n";
+ print STDERR "$toomanyhits (" . sprintf( "%.1f", ( 100 * $toomanyhits / $samples ) ) . "%) had too many hits\n";
+ print STDERR "$nomatepair (" . sprintf( "%.1f", ( 100 * $nomatepair / $samples ) ) . "%) had no mate pair\n";
+ print STDERR "$unclassified (" . sprintf( "%.1f", ( 100 * $unclassified / $samples ) ) . "%) had hits but no classification\n";
+ print STDERR "" . ( $hit - $unclassified ) . " (" . sprintf( "%.1f", ( 100 * ( $hit - $unclassified ) / $samples ) ) . "%) were classified\n";
+ }
+
+ # classifications per level? That depends on the stringency filter, which is downstream
+
+ close HITS;
+}
+
+#sub normalizeByPrior {
+# my ($taxString, $taxonCounts) = @_;
+#
+# print STDERR "Normalizing: \n";
+# while( my ($k, $v) = each %$taxonCounts) {
+# print STDERR "$k = $v ; ";
+# }
+# print STDERR "\n";
+#
+# my $normalizer = $taxonomyPrior{$taxString};
+# if(!defined $normalizer)
+# {
+# print STDERR ("No prior: $taxString\n");
+# }
+#
+# for my $taxon (keys %$taxonCounts)
+# {
+# $taxonCounts->{$taxon} = ($taxonCounts->{$taxon} / $taxonomyPrior{$taxString . $taxon . "; "}) * $normalizer;
+# }
+# print STDERR " Done: \n";
+# while( my ($k, $v) = each %$taxonCounts) {
+# print STDERR "$k = $v ; ";
+# }
+# print STDERR "\n";
+#}
+
+sub makeTaxonomyPrediction {
+ my (@ids) = @_;
+
+ my @levelPredictions = ();
+
+# TOOMANYHITS is now added in ucFilterBest.pl, so it should be caught at line 99 above
+
+# if(scalar(@ids) == 1000) {
+# # when we did the uclust at exactMatchOneReadSet:50, we used " --allhits --maxaccepts 1000 ".
+# # after that we filtered for the best pcid set (using ucFilterBest.pl).
+# # if 1000 hits remain, then the real set of best-pcid hits is larger, and we missed some.
+# # In that case we should bail because the set of 1000 hits we do have may not be representative.
+# # I think this is the reason why matching the E517F primer only (17nt reads) produced predictions, and at different levels to boot.
+# # That also depends on the classification thresholds.
+#
+# my $levelPrediction = LevelPrediction->new();
+# $levelPrediction->label("TOOMANYHITS");
+# push @levelPredictions, $levelPrediction;
+# return \@levelPredictions;
+# }
+
+ my @taxonomyVectors = map { $taxonomies{$_} } @ids;
+# my @taxonomyClusterSizes = map { $taxonomyWorstPcids{$_} } @ids;
+
+ my $globalTotal = @taxonomyVectors;
+
+ my $predictedTaxa = "";
+
+ my $globalUnknowns = 0; # at all levels, incrementing as we go
+
+ for my $level ( 0 .. 10 ) {
+
+ my $levelPrediction = LevelPrediction->new();
+
+ # assert the remaining taxonomyVectors are equal at higher levels
+
+ my %taxonCounts = ();
+ my $localUnknowns = 0;
+ my $localTotal = @taxonomyVectors;
+
+ # count up how often each label is seen descending from this node
+
+ for my $vec (@taxonomyVectors) {
+ my $taxon = $vec->[$level];
+
+ # "Unknown" should occur only towards the leaves; an unspecified intermediate level followed by a populated level is a "skip".
+ # Here, "skip" is counted as just another taxon.
+ if ( !defined $taxon || $taxon eq "" || $taxon =~ /unknown/i ) { $localUnknowns++; }
+ else { $taxonCounts{$taxon}++; }
+ }
+
+ if ( $localUnknowns == $localTotal ) { last; }
+
+ # this normalization makes no sense, because we don't want a uniform prior either
+ # e.g., one Archaeal hit among dozens of Bacterial hits will win, because there are so few Archaea in GreenGenes to begin with
+ # normalizeByPrior( $predictedTaxa, \%taxonCounts );
+
+ # get the best label and check for ties
+
+ $levelPrediction->numChildren( scalar( keys %taxonCounts ) );
+
+ my @taxaInOrder = sort { $taxonCounts{$b} <=> $taxonCounts{$a} } keys %taxonCounts;
+ my $bestTaxon = $taxaInOrder[0];
+
+ # print STDERR "$bestLabel\n";
+ $levelPrediction->label($bestTaxon);
+ $predictedTaxa .= "$bestTaxon; ";
+
+ my $bestCount = $taxonCounts{$bestTaxon};
+ $levelPrediction->count($bestCount);
+
+ my $secondBestTaxon = $taxaInOrder[1];
+ if ( defined $secondBestTaxon ) {
+ my $secondBestCount = $taxonCounts{$secondBestTaxon};
+
+ if ( $levelPrediction->count() < 2 * $secondBestCount ) {
+ # Declare a tie if the winning taxon doesn't have at least twice as many votes as the runner-up.
+ # just consider this level unknown and bail
+ last;
+ }
+ }
+
+ # compute various ratios of the prediction vs. alternatives
+
+ $levelPrediction->localMinProportion( $bestCount / $localTotal );
+ $levelPrediction->localMaxProportion( ( $bestCount + $localUnknowns ) / $localTotal );
+
+ my $globalUnknowns += $localUnknowns;
+ $levelPrediction->globalMinProportion( $bestCount / $globalTotal );
+ $levelPrediction->globalMaxProportion( ( $bestCount + $globalUnknowns ) / $globalTotal );
+
+ # what if all of the "unknown" matches should have been the same? Then an "alternate" classification might have won
+ $levelPrediction->alternateLocalProportion( $localUnknowns / $localTotal );
+ $levelPrediction->alternateGlobalProportion( $globalUnknowns / $globalTotal )
+ ; # note we already added the local unknowns to the global unknowns
+
+ # it's possible that a completely different path has a higher global probability than this one,
+ # but we'd never know because we pick the max _per level_ and never explore the other paths.
+
+ # decide whether to bother continuing
+ # if ( $bestLocalMinProportion < 0.5 ) { last; }
+ # for now, print all the best labels until everything is unknown or there is a tie; sort out the confidence later
+
+ push @levelPredictions, $levelPrediction;
+
+ # remove any non-matching paths from consideration at the next level
+
+ my @newTaxonomyVectors = ();
+ for my $vec (@taxonomyVectors) {
+ my $taxon = $vec->[$level];
+ if ( defined $taxon && $taxon eq $bestTaxon ) { push @newTaxonomyVectors, $vec; }
+ }
+ @taxonomyVectors = @newTaxonomyVectors;
+
+ }
+ return \@levelPredictions;
+
+}
diff --git a/scripts/splitDelimitedFasta.pl b/scripts/splitDelimitedFasta.pl
new file mode 100755
index 0000000..d85c230
--- /dev/null
+++ b/scripts/splitDelimitedFasta.pl
@@ -0,0 +1,46 @@
+#!/usr/bin/env perl
+
+use strict;
+use warnings;
+
+# make sure delimiter is properly escaped, e.g. for "+" we need "\+".
+my $delimiter = $ARGV[0];
+my $infile = $ARGV[1];
+
+my $infilename = `basename $infile`;
+chomp $infilename;
+
+open(IN, $infile) or die "Can't read $infile\n";
+open(A,">$infilename.a") or die "Can't write $infilename.a\n";
+open(B,">$infilename.b") or die "Can't write $infilename.b\n";
+
+my $fastaHeader = "";
+my $outfile = *A;
+
+while(<IN>)
+ {
+ chomp;
+ if(/^>/)
+ {
+ $outfile = *A;
+ $fastaHeader = $_;
+ print A "$fastaHeader\n";
+ print B "$fastaHeader\n";
+ }
+ else
+ {
+ if ( /(.*)$delimiter(.*)/)
+ {
+ print A "$1\n";
+ print B "$2\n";
+ $outfile = *B;
+ }
+ else
+ {
+ print $outfile "$_\n";
+ }
+ }
+ }
+close IN;
+close A;
+close B;
\ No newline at end of file
--
Alioth's /git/debian-med/git-commit-notice on /srv/git.debian.org/git/debian-med/rtax.git
More information about the debian-med-commit
mailing list