[med-svn] [vsearch] 05/10: Imported Upstream version 2.3.2
Andreas Tille
tille at debian.org
Mon Dec 5 11:49:39 UTC 2016
This is an automated email from the git hooks/post-receive script.
tille pushed a commit to branch master
in repository vsearch.
commit 34bca89b567d3536fa36def013d4778fddc8a151
Author: Andreas Tille <tille at debian.org>
Date: Mon Dec 5 12:02:47 2016 +0100
Imported Upstream version 2.3.2
---
README.md | 110 +++++++-------------------
configure.ac | 2 +-
man/vsearch.1 | 217 +++++++++++++++++++++++++++------------------------
src/results.cc | 2 +-
src/vsearch.cc | 16 ++--
test/unclassified.sh | 48 +++++++++++-
6 files changed, 198 insertions(+), 197 deletions(-)
diff --git a/README.md b/README.md
index 070093a..105d87c 100644
--- a/README.md
+++ b/README.md
@@ -10,7 +10,7 @@ The aim of this project is to create an alternative to the [USEARCH](http://www.
* be as accurate or more accurate than usearch
* be as fast or faster than usearch
-We have implemented a tool called VSEARCH which supports *de novo* and reference based chimera detection, clustering, full-length and prefix dereplication, reverse complementation, masking, all-vs-all pairwise global alignment, exact and global alignment searching, shuffling, subsampling and sorting. It also supports FASTQ file analysis, filtering and conversion.
+We have implemented a tool called VSEARCH which supports *de novo* and reference based chimera detection, clustering, full-length and prefix dereplication, rereplication, reverse complementation, masking, all-vs-all pairwise global alignment, exact and global alignment searching, shuffling, subsampling and sorting. It also supports FASTQ file analysis, filtering, conversion and merging of paired-end reads.
VSEARCH stands for vectorized search, as the tool takes advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by default uses a heuristic seed and extend aligner. This usually results in more accurate alignments and overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps.
@@ -35,9 +35,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
**Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:
```
-wget https://github.com/torognes/vsearch/archive/v1.11.1.tar.gz
-tar xzf v1.11.1.tar.gz
-cd vsearch-1.11.1
+wget https://github.com/torognes/vsearch/archive/v2.3.2.tar.gz
+tar xzf v2.3.2.tar.gz
+cd vsearch-2.3.2
./autogen.sh
./configure
make
@@ -60,18 +60,18 @@ make install # as root or sudo make install
**Binary distribution** Starting with version 1.4.0, binary distribution files (.tar.gz) for GNU/Linux on x86-64 and Apple Mac OS X on x86-64 containing pre-compiled binaries as well as the documentation (man and pdf files) will be made available as part of each [release](https://github.com/torognes/vsearch/releases). The included executables include support for input files compressed by zlib and bzip2 (with files usually ending in `.gz` or `.bz2`). Download the appropriate executable fo [...]
```sh
-wget https://github.com/torognes/vsearch/releases/download/v1.11.1/vsearch-1.11.1-linux-x86_64.tar.gz
-tar xzf vsearch-1.11.1-linux-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.3.2/vsearch-2.3.2-linux-x86_64.tar.gz
+tar xzf vsearch-2.3.2-linux-x86_64.tar.gz
```
Or these commands if you are using a Mac:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v1.11.1/vsearch-1.11.1-osx-x86_64.tar.gz
-tar xzf vsearch-1.11.1-osx-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.3.2/vsearch-2.3.2-osx-x86_64.tar.gz
+tar xzf vsearch-2.3.2-osx-x86_64.tar.gz
```
-You will now have the binary distribution in a folder called something like `vsearch-1.11.1-linux-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
+You will now have the binary distribution in a folder called something like `vsearch-2.3.2-linux-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
**Old binaries** Older VSEARCH binaries (until version 1.1.3) are available [here](https://github.com/torognes/vsearch/releases) for [GNU/Linux on x86-64 systems](https://github.com/torognes/vsearch/blob/master/bin/vsearch-1.1.3-linux-x86_64) and [Apple Mac OS X on x86-64 systems](https://github.com/torognes/vsearch/blob/master/bin/vsearch-1.1.3-osx-x86_64). These executables include support for input files compressed by zlib and bzip2 (with files usually ending in `.gz` or `.bz2`). Down [...]
@@ -98,60 +98,15 @@ ln -s vsearch-1.1.3-linux-x86_64 vsearch
With the `from-uc`command in [biom](http://biom-format.org/) 2.1.5 or later, it is possible to convert data in a `.uc` file produced by vsearch into a biom file that can be read by QIIME and other software. It is described [here](https://gist.github.com/gregcaporaso/f3c042e5eb806349fa18).
+Please note that vsearch version 2.2.0 and later are able to directly output OTU tables in biom 1.0 format as well as the classic and mothur formats.
-## Implementation details and initial assessment
-
-**Please note: The vsearch code has evolved substantially over time and the numbers below may not be accurate any more.**
-
-**Search algorithm:** VSEARCH indexes the unique kmers in the database in a way similar to USEARCH, but is currently limited to continuous words (non-spaced seeds) of length 7-15 (the length can be specified with the `--wordlength` option, default 8). It samples every unique kmer from each query sequence and identifies the number of matching kmers in each database sequence. It then examines the database sequences in order of decreasing number of kmer matches. A full global alignment is c [...]
-
-**Kmer selection:** How many and which kmers USEARCH chooses from the query sequence is not well documented. It is also not known exactly which database sequences are examined, and in which order. We have therefore experimented with various strategies in order to obtain good performance. Our procedure seems to give results equal to or better than USEARCH.
-
-We have chosen to select all kmers occuring at least once from the query. At least 10 (by default, can be specified with `--miwordmatches`) of these kmers must be present in the database sequence before it will be considered. If the query has fewer than 10 kmers, all must be present in the database sequence. Furthermore, if several database sequences have the same number of kmer matches, they will be examined in order of decreasing sequence length.
-
-It appears that there are differences in usearch between the searches performed by the `--usearch_global` command and the clustering commands. Notably, it appears like `--usearch_global` simply ignores the options `--wordlength`, `--slots` and `--pattern`, while the clustering commands takes them into account. VSEARCH supports the `--wordlength` option for kmer lengths from 7 to 15, but the options `--slots` and `--pattern` are ignored.
-
-**Alignment:** VSEARCH uses a 8-way 16-bit SIMD vectorized implementation of the full dynamic programming algorithm (Needleman-Wunsch) for global sequence alignment. It is an adaptation of the method described by Rognes (2011). Due to the extreme memory requirements of this method when aligning two long sequences (e.g. more than 5000bp long), an alternative algorithm described by Hirschberg (1975) and Myers and Miller (1988) is used when aligning a pair of long sequences. This alternativ [...]
-
-**Search Accuracy:** The accuracy of VSEARCH searches has been assessed and compared to USEARCH version 7.0.1090. The Rfam 11.0 database was used for the assessment, as described on the [USEARCH website](http://drive5.com/usearch/benchmark_rfam.html). A similar procedure was described in the USEARCH paper using the Rfam 9.1 database.
-
-The database was initially shuffled. Then the first sequence from each of the 2085 Rfam families with at least two members was selected as queries while the rest was used as the database. The ability of VSEARCH and USEARCH to identify another member of the same family as the top hit was measured, and then recall and precision was calculated.
-
-When USEARCH was run without the `--fulldp` option, VSEARCH had much better recall than USEARCH, but the precision was lower. The [F<sub>1</sub>-score](http://en.wikipedia.org/wiki/F1_score) was considerably higher for VSEARCH. When USEARCH was run with `--fulldp`, VSEARCH had slightly better recall, precision and F-score than USEARCH.
-
-The recall of VSEARCH was usually about 92.3-93.5% and the precision was usually 93.0-94.1%. When run without the `--fulldp` option the recall of USEARCH was usually about 83.0-85.3% while precision was 98.5-99.0%. When run with the `--fulldp` option the recall of USEARCH was usually about 92.0-92.8% and the precision was about 92.2-93.0%.
-
-Please see the files in the `eval` folder for the scripts used for this assessment.
-
-**Search speed:** The speed of VSEARCH searches appears to be somewhat faster than USEARCH when USEARCH is run without the `--fulldp` option. When USEARCH is run with the `--fulldp` option, VSEARCH may be considerable faster, but it depends on the options and sequences used.
-
-For the accuracy assessment searches in Rfam 11.0 with 100 replicates of the query sequences, VSEARCH needed 46 seconds, whereas USEARCH needed 60 seconds without the `--fulldp` option and 70 seconds with `--fulldp`. This includes time for loading, masking and indexing the database (about 2 seconds for VSEARCH, 5 seconds for USEARCH). The measurements were made on a Apple MacBook Pro Retina 2013 with four 2.3GHz Intel Core i7 cores (8 virtual cores) using the default number of threads (8).
-
-**Memory:** VSEARCH is a 64-bit program and supports very large databases if you have enough memory. Search and clustering might use a lot of memory, especially if run with many threads. Memory usage has not been compared with USEARCH yet.
-**Clustering:** The clustering commands `--cluster_smallmem` and `--cluster_fast` have been implemented. These commands support multiple threads. The only difference between `--cluster_smallmem` and `--cluster_fast` is that `--cluster_fast` will sort the sequences by length before clustering, while `--cluster_smallmem` require the sequences to be in length-sorted order unless the `--usersort` option is specified. An additional clustering command called `--cluster_size` has been added tha [...]
-
-The speed of clustering with VSEARCH relative to USEARCH depends on how many threads are used. Running with a single thread VSEARCH currently seems to be 2-4 times slower than with USEARCH, depending on parameters. Speed also depends on sequence length, and vsearch is relatively slower with longer sequences compared to usearch.
-
-**Chimera detection:** Chimera detection using the algorithm described by Edgar *et al.* (2011) has been implemented in VSEARCH. Both the `--uchime_ref` and `--uchime_denovo` commands and all their options are supported.
-
-A preliminary assessment of the accuracy of VSEARCH on chimera detection has been performed using the SIMM dataset described in the UCHIME paper. See the `eval/chimeval.sh` script and the results in `eval/chimeval.txt` for details. On the datasets with 1-5% substitutions, VSEARCH is generally on par with the original UCHIME implementation (version 4.2.40), and a bit more accurate than the implementation in USEARCH (version 7.0.1090). On the datasets with 1-5% indels, VSEARCH is clearly m [...]
-
-**Dereplication and sorting:** The dereplication and sorting commands seems to be considerably faster in VSEARCH than in USEARCH.
-
-**Masking:** VSEARCH by default uses an optimized multithreaded re-implementation of the well-known DUST algorithm by Tatusov and Lipman (source: ftp://ftp.ncbi.nlm.nih.gov/pub/tatusov/dust/version1/src/) to mask simple repeats and low-complexity regions in the sequences before searching and clustering. USEARCH by default uses an undocumented rapid masking method called "fastnucleo" that seems to mask fewer and smaller regions than dust. USEARCH may also be run with the DUST masking meth [...]
-
-**Extensions:** A shuffle command has been added. By specifying a FASTA file using the `--shuffle` option, and an output file with the `--output` option, VSEARCH will shuffle the sequences in a pseudo-random order. An integer may be specified as the seed with the `--seed` option to generate the same shuffling several times. By default, or when `--seed 0` is specified, the pseudo-random number generator will be initialized with pseudo-random data from the machine to give different numbers [...]
-
-Another extension implemented is that `--derep_fulllength` and `--cluster_fast` will honour the `--sizein` option and add together the abundances for the sequences that are clustered.
-
-An additional clustering command called `--cluster_size` has been added that will sort sequences by abundance before clustering.
-
-The commands `--sortbylength` and `--sortbysize` supports the `--topn` option to output no more than the given number of sequences.
+## Implementation details and initial assessment
-The width of FASTA formatted output files may be specified with the `--fasta_width` option and the width of alignments produced with the `--alnout` and `--uchimealn` options may be specified with the `--rowlen` and `--alignwidth` options, respectively. When an argument of zero (0) is specified for these options, sequences and alignments will not be wrapped.
+Please see the paper for details:
-VSEARCH implements the old USEARCH option `--iddef` to specify the definition of identity used to rank the hits. Values accepted are 0 (CD-HIT definition using shortest sequence as numerator), 1 (edit distance), 2 (edit distance excluding terminal gaps, default), 3 (Marine Biological Lab definition where entire gaps are considered a single difference) or 4 (BLAST, same as 2). See the [USEARCH User Guide 4.1](http://drive5.com/usearch/UsearchUserGuide4.1.pdf) page 42-44 for details. Also [...]
+Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584
+doi: [10.7717/peerj.2584](https://doi.org/10.7717/peerj.2584)
## Dependencies
@@ -178,8 +133,6 @@ VSEARCH includes public domain code written by Alexander Peslyak for the MD5 mes
VSEARCH includes public domain code written by Steve Reid and others for the SHA1 message digest algorithm.
-VSEARCH includes statistical data from [PEAR](https://github.com/xflouris/PEAR) by Zhang, Kobert, Flouri & Stamatakis. Used with permission.
-
The VSEARCH distribution includes code from GNU Autoconf which normally is available under the GNU General Public License, but may be distributed with the special autoconf configure script exception.
VSEARCH may include code from the [zlib](http://www.zlib.net) library copyright Jean-loup Gailly and Mark Adler, distributed under the [zlib license](http://www.zlib.net/zlib_license.html).
@@ -207,7 +160,7 @@ File | Description
**dbhash.cc** | Database hashing for exact searches
**dbindex.cc** | Indexes the database by identifying unique kmers in the sequences
**derep.cc** | Dereplication
-**dynlib.cc** | Dynamic loading of compression libraries
+**dynlibs.cc** | Dynamic loading of compression libraries
**eestats.cc** | Produce statistics for fastq_eestats command
**fasta.cc** | FASTA file parser
**fastq.cc** | FASTQ file parser
@@ -220,6 +173,7 @@ File | Description
**mergepairs.cc** | Paired-end read merging
**minheap.cc** | A minheap implementation for the list of top kmer matches
**msa.cc** | Simple multiple sequence alignment and consensus sequence computation for clusters
+**otutable.cc** | Generate OTU tables in various formats
**rerep.cc** | Rereplication
**results.cc** | Output results in various formats (alnout, userout, blast6, uc)
**search.cc** | Implements search using global alignment
@@ -283,19 +237,20 @@ Thanks to the following people for patches and other suggestions for improvement
## Citing VSEARCH
-The vsearch manuscript is now published in PeerJ Preprints:
+Please cite the following publication if you use VSEARCH:
-Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ Preprints 4:e2409v1 https://doi.org/10.7287/peerj.preprints.2409v1
+Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584.
+doi: [10.7717/peerj.2584](https://doi.org/10.7717/peerj.2584)
+Please note that citing any of the underlying algorithms, e.g. UCHIME, may also be appropriate.
## Test datasets
Test datasets (found in the separate vsearch-data repository) were
-obtained from the [BioMarks project](http://biomarks.eu/) (Logares et
-al. 2014), the [TARA OCEANS
-project](http://oceans.taraexpeditions.org/) (Karsenti et al. 2011)
-and the [Protist Ribosomal Database](http://ssu-rrna.org/) (Guillou et
-al. 2012).
+obtained from
+the [BioMarks project](http://biomarks.eu/) (Logares et al. 2014),
+the [TARA OCEANS project](http://oceans.taraexpeditions.org/) (Karsenti et al. 2011)
+and the [Protist Ribosomal Database](http://ssu-rrna.org/) (Guillou et al. 2012).
## References
@@ -310,31 +265,20 @@ doi:[10.1093/bioinformatics/btq461](http://dx.doi.org/10.1093/bioinformatics/btq
*Bioinformatics*, 27 (16): 2194-2200.
doi:[10.1093/bioinformatics/btr381](http://dx.doi.org/10.1093/bioinformatics/btr381)
-* Farrar M (2007)
-**Striped Smith-Waterman speeds database searches six times over other SIMD implementations.**
-*Bioinformatics* (2007) 23 (2): 156-161.
-doi:[10.1093/bioinformatics/btl582](http://dx.doi.org/10.1093/bioinformatics/btl582)
-
-* Guillou L., Bachar D., Audic S., Bass D., Berney C., Bittner L., Boutte C., Burgaud G., de Vargas C., Decelle J., del Campo J., Dolan J., Dunthorn M., Edvardsen B., Holzmann M., Kooistra W., Lara E., Lebescot N., Logares R., Mahé F., Massana R., Montresor M., Morard R., Not F., Pawlowski J., Probert I., Sauvadet A.-L., Siano R., Stoeck T., Vaulot D., Zimmermann P. & Christen R. (2013)
+* Guillou L, Bachar D, Audic S, Bass D, Berney C, Bittner L, Boutte C, Burgaud G, de Vargas C, Decelle J, del Campo J, Dolan J, Dunthorn M, Edvardsen B, Holzmann M, Kooistra W, Lara E, Lebescot N, Logares R, Mahé F, Massana R, Montresor M, Morard R, Not F, Pawlowski J, Probert I, Sauvadet A-L, Siano R, Stoeck T, Vaulot D, Zimmermann P & Christen R (2013)
**The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote Small Sub-Unit rRNA sequences with curated taxonomy.**
*Nucleic Acids Research*, 41 (D1), D597-D604.
doi:[10.1093/nar/gks1160](http://dx.doi.org/10.1093/nar/gks1160)
-* Hirschberg D.S (1975) **A linear space algorithm for computing maximal common subsequences.** *Comm ACM*, 18(6), 341-343. doi:[10.1145/360825.360861](http://dx.doi.org/10.1145/360825.360861)
-
-* Karsenti E., González Acinas S., Bork P., Bowler C., de Vargas C., Raes J., Sullivan M. B., Arendt D., Benzoni F., Claverie J.-M., Follows M., Jaillon O., Gorsky G., Hingamp P., Iudicone D., Kandels-Lewis S., Krzic U., Not F., Ogata H., Pesant S., Reynaud E. G., Sardet C., Sieracki M. E., Speich S., Velayoudon D., Weissenbach J., Wincker P. & the Tara Oceans Consortium (2011)
+* Karsenti E, González Acinas S, Bork P, Bowler C, de Vargas C, Raes J, Sullivan M B, Arendt D, Benzoni F, Claverie J-M, Follows M, Jaillon O, Gorsky G, Hingamp P, Iudicone D, Kandels-Lewis S, Krzic U, Not F, Ogata H, Pesant S, Reynaud E G, Sardet C, Sieracki M E, Speich S, Velayoudon D, Weissenbach J, Wincker P & the Tara Oceans Consortium (2011)
**A holistic approach to marine eco-systems biology.**
*PLoS Biology*, 9(10), e1001177.
doi:[10.1371/journal.pbio.1001177](http://dx.doi.org/10.1371/journal.pbio.1001177)
-* Logares R., Audic S., Bass D., Bittner L., Boutte C., Christen R., Claverie J.-M., Decelle J., Dolan J. R., Dunthorn M., Edvardsen B., Gobet A., Kooistra W. H. C. F., Mahé F., Not F., Ogata H., Pawlowski J., Pernice M. C., Romac S., Shalchian-Tabrizi K., Simon N., Stoeck T., Santini S., Siano R., Wincker P., Zingone A., Richards T., de Vargas C. & Massana R. (2014) **The patterning of rare and abundant community assemblages in coastal marine-planktonic microbial eukaryotes.**
+* Logares R, Audic S, Bass D, Bittner L, Boutte C, Christen R, Claverie J-M, Decelle J, Dolan J R, Dunthorn M, Edvardsen B, Gobet A, Kooistra W H C F, Mahé F, Not F, Ogata H, Pawlowski J, Pernice M C, Romac S, Shalchian-Tabrizi K, Simon N, Stoeck T, Santini S, Siano R, Wincker P, Zingone A, Richards T, de Vargas C & Massana R (2014) **The patterning of rare and abundant community assemblages in coastal marine-planktonic microbial eukaryotes.**
*Current Biology*, 24(8), 813-821.
doi:[10.1016/j.cub.2014.02.050](http://dx.doi.org/10.1016/j.cub.2014.02.050)
-* Myers E.W., & Miller W. (1988) **Optimal alignments in linear space.**
-*Comput Appl Biosci*, 4(1), 11-17.
-doi:[10.1093/bioinformatics/4.1.11](http://dx.doi.org/10.1093/bioinformatics/4.1.11)
-
* Rognes T (2011)
**Faster Smith-Waterman database searches by inter-sequence SIMD parallelisation.**
*BMC Bioinformatics*, 12: 221.
diff --git a/configure.ac b/configure.ac
index ff32971..794cb18 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2,7 +2,7 @@
# Process this file with autoconf to produce a configure script.
AC_PREREQ([2.63])
-AC_INIT([vsearch], [2.3.0], [torognes at ifi.uio.no])
+AC_INIT([vsearch], [2.3.2], [torognes at ifi.uio.no])
AM_INIT_AUTOMAKE([subdir-objects])
AC_LANG([C++])
AC_CONFIG_SRCDIR([src/vsearch.cc])
diff --git a/man/vsearch.1 b/man/vsearch.1
index 41cdac6..52b9ad9 100644
--- a/man/vsearch.1
+++ b/man/vsearch.1
@@ -1,5 +1,5 @@
.\" ============================================================================
-.TH vsearch 1 "October 10, 2016" "version 2.3.0" "USER COMMANDS"
+.TH vsearch 1 "November 18, 2016" "version 2.3.2" "USER COMMANDS"
.\" ============================================================================
.SH NAME
vsearch \(em chimera detection, clustering, dereplication and
@@ -156,7 +156,7 @@ ending before the next identifier line, or the file end. \fBvsearch\fR
silently ignores ascii characters 9 to 13, and exits with an error
message if ascii characters 0 to 8, 14 to 31, '.' or '-' are
present. All other ascii or non-ascii characters are stripped and
-complained about in a non-blocking warning message.
+complained about in a warning message.
.PP
In fastq files, each entry is made of sequence header starting with a
symbol '@', a nucleotidic sequence (same rules as for fasta
@@ -185,7 +185,7 @@ example, R vs R) also receives a score of zero.
.PP
\fBvsearch\fR can read data from standard files and write to standard
files, but it can also read from pipes and write to pipes! For
-example, multiple fasta files can be piped into vsearch for
+example, multiple fasta files can be piped into \fBvsearch\fR for
dereplication. To do so, file names can be replaced with:
.RS
.IP - 2
@@ -196,6 +196,7 @@ a named pipe created with the command mkfifo,
.IP -
a process substitution '<(command)' as input or '>(command)' as output.
.RE
+.PP
\fBvsearch\fR can automatically read compressed gzip or bzip2 files if
the appropriate libraries are present during the
compilation. \fBvsearch\fR can also read pipes streaming compressed
@@ -504,16 +505,16 @@ respectively, and is recommended at least for large tables. The OTUs
are represented by the cluster centroids. Taxonomy information will be
included for the OTUs if available. Sample identifiers will be
extracted from the headers of all sequences in the input file. If the
-header contains ";sample=abc123;" or ";barcodelabel=abc123;" or a
-similar string somewhere, then the given sample identifier (here
-"abc123") will be used. The semicolon is not mandatory at the
+header contains ';sample=abc123;' or ';barcodelabel=abc123;' or a
+similar string somewhere, then the given sample identifier
+(here 'abc123') will be used. The semicolon is not mandatory at the
beginning or end of the header. The sample identifier may contain any
printable character except semicolons. If no such sample label is
found, the identifier in the initial part of the header will be used,
but only letters, digits and underscores are allowed. OTU identifiers
will be extracted from the headers of the cluster centroid
-sequences. If the header contains ";otu=def789;" or a similar string
-somewhere, then the given OTU identifier (here "def789") will be
+sequences. If the header contains ';otu=def789;' or a similar string
+somewhere, then the given OTU identifier (here 'def789') will be
used. The semicolon is not mandatory at the beginning or end of the
header. The OTU identifier may contain any printable character except
semicolons. If no such OTU label is found, the identifier in the
@@ -522,8 +523,8 @@ semicolons are allowed. Alternatively, OTU identifers can be generated
using the relabelling options (\-\-relabel, \-\-relabel_sha1 or
\-\-relabel_md5). Taxonomy information, if present, will also be
extracted from the headers of the centroid sequences. If the header
-contains ";tax=Homo_sapiens;" or a similar string somewhere, then the
-given taxonomy information (here "Homo_sapiens") will be used. The
+contains ';tax=Homo_sapiens;' or a similar string somewhere, then the
+given taxonomy information (here 'Homo_sapiens') will be used. The
semicolon is not mandatory at the beginning or end of the header. The
taxonomy information may contain any printable character except
semicolons. If an OTU table in the biom version 2.1 HDF5 file format
@@ -617,13 +618,13 @@ alignment. Columns containing a majority of gaps are skipped, except
for terminal gaps.
.TP
.BI \-\-mothur_shared_out \0filename
-Output an OTU table in the mothur "shared" tab-separated plain text
+Output an OTU table in the mothur 'shared' tab-separated plain text
format as described at http://www.mothur.org/wiki/Shared_file. The
format describes how a matrix containing the abundances of the OTUs in
the different samples is stored. The first line will start with the
-strings "label", "group" and "numOtus" and is followed by a list of
+strings 'label', 'group' and 'numOtus' and is followed by a list of
all OTU identifiers. The following lines, one for each sample, starts
-with the string "vsearch" followed by the sample identifier, the total
+with the string 'vsearch' followed by the sample identifier, the total
number of OTUs, and a list of abundances for each OTU in that sample,
in the order given on the first line. The OTU and sample identifiers
are extracted from the FASTA headers of the sequences. The OTUs are
@@ -633,7 +634,7 @@ further details.
.BI \-\-otutabout \0filename
Output an OTU table in the classic tab-separated plain text format as
a matrix containing the abundances of the OTUs in the different
-samples. The first line will start with the string "#OTU ID" and is
+samples. The first line will start with the string '#OTU ID' and is
followed by a tab-separated list of all sample identifiers. The
following lines, one for each OTU, starts with the OTU identifier and
is followed by a tab-separated list of abundances for that OTU in each
@@ -641,7 +642,7 @@ sample, in the order given on the first line. The OTU and sample
identifiers are extracted from the FASTA headers of the sequences. The
OTUs are represented by the cluster centroids. An extra column is
added to the right of the table if taxonomy information is available
-for at least one of the OTUs. This column will be labelled "taxonomy"
+for at least one of the OTUs. This column will be labelled 'taxonomy'
and each row will then contain the taxonomy information extracted for
that OTU. See the \-\-biomout option for further details.
.TP
@@ -650,11 +651,11 @@ Output a sequence profile to a text file with the frequency of each
nucleotide in each position in the multiple alignment for each
cluster. There is a FASTA-like header line for each cluster, followed
by the profile information in a tab-separated format. The columns are:
-position (1-based), consensus nucleotide, number of A's, number of
-C's, number of G's, number of Ts or Us, and finally the number of
-gaps. If ambiguous nucleotide symbols are present, the numbers may be
-floating point numbers, otherwise they are integers. For instance,
-an 'R' counts 0.5 towards an A and 0.5 towards a G.
+position (1-based), consensus nucleotide, number of As, number of Cs,
+number of Gs, number of Ts or Us, and finally the number of gaps. If
+ambiguous nucleotide symbols are present, the numbers may be floating
+point numbers, otherwise they are integers. For instance, an 'R'
+counts 0.5 towards an A and 0.5 towards a G.
.TP
.BI \-\-qmask\~ "none|dust|soft"
Mask regions in sequences using the
@@ -682,7 +683,7 @@ description of the same option under Chimera detection for details.
.TP
.B \-\-sizein
Take into account the abundance annotations present in the input fasta
-file (search for the pattern "[>;]size=\fIinteger\fR[;]" in sequence
+file (search for the pattern '[>;]size=\fIinteger\fR[;]' in sequence
headers).
.TP
.B \-\-sizeorder
@@ -697,7 +698,7 @@ greedy clustering (DGC).
.TP
.B \-\-sizeout
Add abundance annotations to the output fasta files (add the pattern
-";size=\fIinteger\fR;" to sequence headers). If \-\-sizein is
+';size=\fIinteger\fR;' to sequence headers). If \-\-sizein is
specified, abundance annotations are reported to output files, and
each cluster centroid receives a new abundance value corresponding to
the total abundance of the amplicons included in the cluster
@@ -724,27 +725,28 @@ C):
.IP \n[step]. 4
Record type: S, H, or C.
.IP \n+[step].
-Cluster number (0-based).
+Cluster number (zero-based).
.IP \n+[step].
Centroid length (S), query length (H), or cluster size (C).
.IP \n+[step].
-Percentage of similarity with the centroid sequence (H), or set to "*"
+Percentage of similarity with the centroid sequence (H), or set to '*'
(S, C).
.IP \n+[step].
-Match orientation + or - (H), or set to "*" (S, C).
+Match orientation + or - (H), or set to '*' (S, C).
.IP \n+[step].
-Not used, always set to "*" (S, C) or to zero (H).
+Not used, always set to '*' (S, C) or to zero (H).
.IP \n+[step].
-Not used, always set to "*" (S, C) or to zero (H).
+Not used, always set to '*' (S, C) or to zero (H).
.IP \n+[step].
-Compact representation of the pairwise alignment using the CIGAR
-format (Compact Idiosyncratic Gapped Alignment Report): M (match), D
-(deletion) and I (insertion). The equal sign "=" indicates that the
-query is identical to the centroid sequence.
+set to '*' (S, C) or, for H, compact representation of the pairwise
+alignment using the CIGAR format (Compact Idiosyncratic Gapped
+Alignment Report): M (match), D (deletion) and I (insertion). The
+equal sign '=' indicates that the query is identical to the centroid
+sequence.
.IP \n+[step].
Label of the query sequence (H), or of the centroid sequence (S, C).
.IP \n+[step].
-Label of the centroid sequence (H), or set to "*" (S, C).
+Label of the centroid sequence (H), or set to '*' (S, C).
.RE
.RE
.TP
@@ -796,7 +798,7 @@ and sorted by decreasing abundance. Identical sequences receive the
header of the first sequence of their group. If \-\-sizeout is used,
the number of occurrences (i.e. abundance) of each sequence is
indicated at the end of their fasta header using the pattern
-";size=\fIinteger\fR;".
+';size=\fIinteger\fR;'.
.TP
.BI \-\-relabel \0string
Please see the description of the same option under Chimera detection
@@ -824,12 +826,12 @@ is specified, in which case an abundance of 1 is used.
.TP
.B \-\-sizein
Take into account the abundance annotations present in the input fasta
-file (search for the pattern "[>;]size=\fIinteger\fR[;]" in sequence
+file (search for the pattern '[>;]size=\fIinteger\fR[;]' in sequence
headers).
.TP
.B \-\-sizeout
Add abundance annotations to the output fasta file (add the pattern
-";size=\fIinteger\fR;" to sequence headers). If \-\-sizein is
+';size=\fIinteger\fR;' to sequence headers). If \-\-sizein is
specified, each unique sequence receives a new abundance value
corresponding to its total abundance (sum of the abundances of its
occurrences). If \-\-sizein is not specified, input abundances are set
@@ -858,24 +860,24 @@ content varies with the type of entry (S, H or C):
.IP \n[step]. 4
Record type: S, H, or C.
.IP \n+[step].
-Cluster number (0-based).
+Cluster number (zero-based).
.IP \n+[step].
Sequence length (S, H), or cluster size (C).
.IP \n+[step].
-Percentage of similarity with the centroid sequence (H), or set to "*"
+Percentage of similarity with the centroid sequence (H), or set to '*'
(S, C).
.IP \n+[step].
-Match orientation + or - (H), or set to "*" (S, C).
+Match orientation + or - (H), or set to '*' (S, C).
.IP \n+[step].
-Not used, always set to "*" (S, C) or 0 (H).
+Not used, always set to '*' (S, C) or 0 (H).
.IP \n+[step].
-Not used, always set to "*" (S, C) or 0 (H).
+Not used, always set to '*' (S, C) or 0 (H).
.IP \n+[step].
-Not used, always set to "*".
+Not used, always set to '*'.
.IP \n+[step].
Label of the query sequence (H), or of the centroid sequence (S, C).
.IP \n+[step].
-Label of the centroid sequence (H), or set to "*" (S, C).
+Label of the centroid sequence (H), or set to '*' (S, C).
.RE
.RE
.TP
@@ -1163,19 +1165,19 @@ Rate: growth rate of AvgEE between this position and position - 1.
RatePct: Rate (as explained above) expressed as a percentage.
.RE
.TP
-Effect of expected error and length filtering
+Effect of expected error and length filtering:
.RS
The first column indicates read lengths (\fIL\fR). The next four
columns indicate the number of reads that would be retained by the
\-\-fastq_filter command if the reads were truncated at length \fIL\fR
(option \-\-fastq_trunclen \fIL\fR) and filtered to have a maximum
-expected error of 1.0, 0.5, 0.25 or 0.1 (option \-\-fastq_maxee
-\fIfloat\fR). The last four columns indicate the fraction of reads
-that would be retained by the \-\-fastq_filter command using the same
-length and maximum expected error parameters.
+expected error of 1.0, 0.5, 0.25 or 0.1 (with the option
+\-\-fastq_maxee \fIfloat\fR). The last four columns indicate the
+fraction of reads that would be retained by the \-\-fastq_filter
+command using the same length and maximum expected error parameters.
.RE
.TP
-Effect of minimum quality and length filtering
+Effect of minimum quality and length filtering:
.RS
The first column indicates read lengths (\fILen\fR). The next four
columns indicate the fraction of reads that would be retained by the
@@ -1298,42 +1300,42 @@ An input sequence can be composed of lower- or uppercase letters. When
soft masking is specified, lower case letters are treated as symbols
that should be masked. Otherwise the case of the input sequences is
ignored.
-
-Masking is performed by the commands
-for chimera detection (uchime_denovo, uchime_ref), clustering
-(cluster_fast, cluster_smallmem, cluster_size), masking (maskfasta,
-fastx_mask), pairwise alignment (allpairs_global) and searching
-(search_exact, usearch_global).
-
+.br
+Masking is performed by the commands for chimera detection
+(uchime_denovo, uchime_ref), clustering (cluster_fast,
+cluster_smallmem, cluster_size), masking (maskfasta, fastx_mask),
+pairwise alignment (allpairs_global) and searching (search_exact,
+usearch_global).
+.br
Masking is usually specified with the \-\-qmask option, while the
\-\-dbmask option is used for the database sequences specified with
-the \-\-db option with the \-\-usearch_global,
-\-\-search_exact and \-\-uchime_ref commands.
-
+the \-\-db option with the \-\-usearch_global, \-\-search_exact and
+\-\-uchime_ref commands.
+.br
The argument to the \-\-qmask and \-\-dbmask option may be none, soft
or dust. If the argument is none, the no masking is performed. If the
argument is soft the lower case symbols are masked. Finally, if the
argument is dust, the sequence is masked using the DUST algorithm by
Tatusov and Lipman to mask low-complexity regions.
-
+.br
If the \-\-hardmask option is specified, all masked regions are
converted to N's, otherwise masked regions are indicated by lower case
letters.
-
+.br
If any sequence is masked, the masked version of the sequence (with
lower case letters or N's) is used in all output files. Otherwise the
sequence is unmodified. The exception is the sequences in the output
file specified with the \-\-uchimealns option, where the input
sequences are converted to upper case first and lower case letters
indicate disagreement between the aligned sequences.
-
+.br
When a sequence region is masked, words in the region are not included
in the indices used in the heuristic search algorithm. In all other
aspects, the region is treated as other regions.
-
+.br
Regions in sequences that are hardmasked (with N's) have a zero
alignment score and do not contribute to an alignment.
-
+.br
Here are the results of combined masking options \-\-qmask (or
\-\-dbmask for database sequences) and \-\-hardmask, assuming each
input sequence contains both lower and uppercase nucleotides:
@@ -1445,7 +1447,7 @@ comparison per line:
.RS
.nr step 1 1
.IP \n[step]. 4
-Record type, always set to "H".
+Record type, always set to 'H'.
.IP \n+[step].
Ordinal number of the target sequence (based on input order, starting
from zero).
@@ -1454,7 +1456,7 @@ Sequence length.
.IP \n+[step].
Percentage of similarity with the target sequence.
.IP \n+[step].
-Match orientation, always set to "+".
+Match orientation, always set to '+'.
.IP \n+[step].
Not used, always set to zero.
.IP \n+[step].
@@ -1462,7 +1464,7 @@ Not used, always set to zero.
.IP \n+[step].
Compact representation of the pairwise alignment using the CIGAR
format (Compact Idiosyncratic Gapped Alignment Report): M (match), D
-(deletion) and I (insertion). The equal sign "=" indicates that the
+(deletion) and I (insertion). The equal sign '=' indicates that the
query is identical to the centroid sequence.
.IP \n+[step].
Label of the query sequence.
@@ -1499,7 +1501,7 @@ differently. Output order may vary when using multiple threads. A
similar output can be obtain with \-\-userout \fIfilename\fR and
\-\-userfields
query+target+id+alnlen+mism+opens+qlo+qhi+tlo+thi+evalue+bits. A
-complete list and description is available in the section "Userfields"
+complete list and description is available in the section 'Userfields'
of this manual.
.RS
.RS
@@ -1508,7 +1510,7 @@ of this manual.
\fIquery\fR: query label.
.IP \n+[step].
\fItarget\fR: target (database sequence) label. The field is set to
-"*" if there is no alignment.
+'*' if there is no alignment.
.IP \n+[step].
\fIid\fR: percentage of identity (real value ranging from 0.0 to
100.0). The percentage identity is defined as 100 * (matching columns)
@@ -1593,7 +1595,7 @@ the left (L) or right (R) extremity of the sequence, or inside the
sequence (I). Sequence symbols (Q and T) can be combined with location
symbols (L, I, and R), and numerical values to declare penalties for
all possible contexts: aQL/bQI/cQR/dTL/eTI/fTR, where abcdef are zero
-or positive integers, and "/" is used as a separator.
+or positive integers, and '/' is used as a separator.
.br
To simplify declarations, the location symbols (L, I, and R) can be
combined, the symbol (E) can be used to treat both extremities (L and
@@ -1604,7 +1606,7 @@ terminal gaps (left or right), in both query and target sequences
(i.e. 20I/2E). If only a numerical value is given, without any
sequence or location symbol, then the penalty applies to all gap
openings. To forbid gap-opening, an infinite penalty value can be
-declared with the symbol "*". To use \fBvsearch\fR as a semi-global
+declared with the symbol '*'. To use \fBvsearch\fR as a semi-global
aligner, a null-penalty can be applied to the left (L) or right (R)
gaps.
.br
@@ -1775,14 +1777,15 @@ considered further. Default value is 12 for the default word length
8. For word lengths 3-15, the default minimum word matches are 18, 17,
16, 15, 14, 12, 11, 10, 9, 8, 7, 5 and 3, respectively. If the query
sequence has fewer unique words than the number specified, all words
-in the query must match.
+in the query must match. If the argument is 0, no word matches are
+required.
.TP
.BI \-\-mismatch\~ "integer"
Score assigned to a mismatch (i.e. different nucleotides) in the
pairwise alignment. The default value is -4.
.TP
.BI \-\-mothur_shared_out \0filename
-Write search results to an OTU table in the mothur "shared"
+Write search results to an OTU table in the mothur 'shared'
tab-separated plain text file format. The query file contains the
samples, while the database file contains the OTUs. Sample and OTU
identifiers are extracted from the header of these sequences. See the
@@ -1802,7 +1805,7 @@ Clustering section for further details.
.B \-\-output_no_hits
Write both matching and non-matching queries to \-\-alnout,
\-\-blast6out, \-\-samout or \-\-userout output files. Non-matching
-queries are labelled "No hits" in \-\-alnout files.
+queries are labelled 'No hits' in \-\-alnout files.
.TP
.B \-\-pattern \fIstring\fR
This option is ignored. It is provided for compatibility with usearch.
@@ -1855,7 +1858,7 @@ strictly identical.
.TP
.B \-\-sizeout
Add abundance annotations to the output of the option \-\-dbmatched
-(using the pattern ";size=\fIinteger\fR;"), to report the number of
+(using the pattern ';size=\fIinteger\fR;'), to report the number of
queries that matched each target.
.TP
.BI \-\-strand\~ "plus|both"
@@ -1884,11 +1887,11 @@ Output searching results in \fIfilename\fR using a tab-separated
uclust-like format with 10 columns. When using the \-\-search_exact
command, the table layout is the same than with the
\-\-allpairs_global. When using the \-\-usearch_global command, the
-table present 2 different type of entries: hit (H) or no hit (N). Each
-query sequence is compared to all other sequences, and the best hit
-(\-\-maxaccept 1) or several hits (\-\-maxaccept >1) are reported
-(H). Output order may vary when using multiple threads. Column content
-varies with the type of entry (H or N):
+table present two different type of entries: hit (H) or no hit
+(N). Each query sequence is compared to all other sequences, and the
+best hit (\-\-maxaccept 1) or several hits (\-\-maxaccept > 1) are
+reported (H). Output order may vary when using multiple
+threads. Column content varies with the type of entry (H or N):
.RS
.RS
.nr step 1 1
@@ -1896,26 +1899,26 @@ varies with the type of entry (H or N):
Record type: H, or N.
.IP \n+[step].
Ordinal number of the target sequence (based on input order, starting
-from zero). Set to "*" for N.
+from zero). Set to '*' for N.
.IP \n+[step].
-Sequence length. Set to "*" for N.
+Sequence length. Set to '*' for N.
.IP \n+[step].
-Percentage of similarity with the target sequence. Set to "*" for N.
+Percentage of similarity with the target sequence. Set to '*' for N.
.IP \n+[step].
-Match orientation + or -. . Set to "." for N.
+Match orientation + or -. . Set to '.' for N.
.IP \n+[step].
-Not used, always set to zero for H, or "*" for N.
+Not used, always set to zero for H, or '*' for N.
.IP \n+[step].
-Not used, always set to zero for H, or "*" for N.
+Not used, always set to zero for H, or '*' for N.
.IP \n+[step].
Compact representation of the pairwise alignment using the CIGAR
format (Compact Idiosyncratic Gapped Alignment Report): M (match), D
-(deletion) and I (insertion). The equal sign "=" indicates that the
-query is identical to the centroid sequence. Set to "*" for N.
+(deletion) and I (insertion). The equal sign '=' indicates that the
+query is identical to the centroid sequence. Set to '*' for N.
.IP \n+[step].
Label of the query sequence.
.IP \n+[step].
-Label of the target centroid sequence. Set to "*" for N.
+Label of the target centroid sequence. Set to '*' for N.
.RE
.RE
.TP
@@ -1930,8 +1933,8 @@ alignment.
.TP
.BI \-\-userfields \0string
When using \-\-userout, select and order the fields written to the
-output file. Fields are separated by "+" (e.g. query+target+id). See
-the "Userfields" section for a complete list of fields.
+output file. Fields are separated by '+' (e.g. query+target+id). See
+the 'Userfields' section for a complete list of fields.
.TP
.BI \-\-userout \0filename
Write user-defined tab-separated output to \fIfilename\fR. Select the
@@ -2010,7 +2013,7 @@ conserve the abundance annotations.
.B \-\-sizeout
When using \-\-relabel, \-\-relabel_md5 or \-\-relabel_sha1, preserve
and report abundance annotations to the output fasta file (using the
-pattern ";size=\fIinteger\fR;").
+pattern ';size=\fIinteger\fR;').
.TP
.BI \-\-shuffle \0filename
Pseudo-randomly shuffle the order of sequences contained in
@@ -2067,7 +2070,7 @@ for details.
.TP
.B \-\-sizeout
When using \-\-relabel, report abundance annotations to the output
-fasta file (using the pattern ";size=\fIinteger\fR;").
+fasta file (using the pattern ';size=\fIinteger\fR;').
.TP
.BI \-\-sortbylength \0filename
Sort by decreasing length the sequences contained in
@@ -2076,7 +2079,7 @@ Sort by decreasing length the sequences contained in
.TP
.BI \-\-sortbysize \0filename
Sort by decreasing abundance the sequences contained in \fIfilename\fR
-(the pattern "[>;]size=\fIinteger\fR[;]" has to be present). See the
+(the pattern '[>;]size=\fIinteger\fR[;]' has to be present). See the
options \-\-minsize and \-\-maxsize to eliminate rare and dominant
sequences.
.TP
@@ -2351,7 +2354,7 @@ openings and gap extensions. The field is set to 0 if there is no
alignment.
.TP
.B target
-Target label. The field is set to "*" if there is no alignment.
+Target label. The field is set to '*' if there is no alignment.
.TP
.B tcov
Fraction of the target sequence that is aligned with the query
@@ -2399,8 +2402,8 @@ field is set to 0 if there is no alignment.
.TP
.B tstrand
Target strand orientation (+ or - for nucleotide sequences). Always
-set to "+", so reverse strand matches have tstrand "+" and qstrand
-"-". Empty field if there is no alignment.
+set to '+', so reverse strand matches have tstrand '+' and qstrand
+'-'. Empty field if there is no alignment.
.RE
.PP
.\" ============================================================================
@@ -2451,7 +2454,7 @@ be more consistent.
.\" ============================================================================
.SH NOVELTIES
\fBvsearch\fR introduces new commands and new options not present in
-usearch 7. They are described in the "Options" section of this
+usearch 7. They are described in the 'Options' section of this
manual. Here is a short list:
.RS
.IP - 2
@@ -2557,7 +2560,7 @@ to the output file:
.RE
.PP
Sort by decreasing abundance the sequences contained in
-\fIqueries.fas\fR (using the "size=\fIinteger\fR" information),
+\fIqueries.fas\fR (using the 'size=\fIinteger\fR' information),
relabel the sequences while preserving the abundance information (with
\-\-sizeout), keep only sequences with an abundance equal to or
greater than 2:
@@ -2740,7 +2743,7 @@ with usearch,
new userfields qilo, qihi, tilo, tihi give alignment coordinates
ignoring terminal gaps,
.IP -
-in \-\-uc output files, a perfect alignment is indicated with a "="
+in \-\-uc output files, a perfect alignment is indicated with a '='
sign,
.IP -
the option \-\-cluster_fast now sorts sequences by decreasing length,
@@ -2813,7 +2816,7 @@ Changed to autotools build system.
Several new commands and options. Bug fixes.
.TP
.BR v1.3.2\~ "released September 15th, 2015"
-Fixed memory leaks. Added "-h" shortcut for help. Removed extra "v" in
+Fixed memory leaks. Added '-h' shortcut for help. Removed extra 'v' in
version number.
.TP
.BR v1.3.3\~ "released September 15th, 2015"
@@ -2956,7 +2959,7 @@ output (no matter if \-\-notrunclabels is in effect or not).
Fixed a bug causing a segmentation fault when running
\-\-usearch_global with an empty query sequence. Also fixed a bug
causing imperfect alignments to be reported with an alignment string
-of "=" in uc output files. Fixed typos in man file. Fixed fasta/fastq
+of '=' in uc output files. Fixed typos in man file. Fixed fasta/fastq
processing code regarding presence or absence of compression library
header files.
.TP
@@ -3028,6 +3031,18 @@ and \-\-otutabout to the clustering and searching commands.
Allowed zero-length sequences in FASTA and FASTQ files. Added
\-\-fastq_trunclen_keep option. Fixed bug with output of OTU tables to
pipes.
+.TP
+.BR v2.3.1\~ "released November 16th, 2016"
+Fixed bug where \-\-minwordmatches 0 was interpreted as the default
+minimum word matches for the given word length instead of zero. When
+used in combination with \-\-maxaccepts 0 and \-\-maxrejects 0 it will
+allow complete bypass of kmer-based heuristics.
+.TP
+.BR v2.3.2\~ "released November 18th, 2016"
+Fixed bug where vsearch reported the ordinal number of the target
+sequence instead of the cluster number in column 2 on H-lines in the
+uc output file after clustering. For search and alignment commands
+both usearch and vsearch reports the target sequence number here.
.RE
.LP
.\" ============================================================================
diff --git a/src/results.cc b/src/results.cc
index 82b1c79..e35184a 100644
--- a/src/results.cc
+++ b/src/results.cc
@@ -193,7 +193,7 @@ void results_show_uc_one(FILE * fp,
fprintf(fp,
"H\t%d\t%ld\t%.1f\t%c\t0\t0\t%s\t%s\t%s\n",
- hp->target,
+ clusterno,
qseqlen,
hp->id,
hp->strand ? '-' : '+',
diff --git a/src/vsearch.cc b/src/vsearch.cc
index 2b0962b..291b9af 100644
--- a/src/vsearch.cc
+++ b/src/vsearch.cc
@@ -646,7 +646,7 @@ void args_init(int argc, char **argv)
opt_minsl = 0.0;
opt_mintsize = 0;
opt_minuniquesize = 0;
- opt_minwordmatches = 0;
+ opt_minwordmatches = -1;
opt_mismatch = -4;
opt_mothur_shared_out = 0;
opt_msaout = 0;
@@ -1047,6 +1047,8 @@ void args_init(int argc, char **argv)
case 32:
opt_minseqlength = args_getlong(optarg);
+ if (opt_minseqlength < 0)
+ fatal("The argument to --minseqlength must not be negative");
break;
case 33:
@@ -1504,6 +1506,8 @@ void args_init(int argc, char **argv)
case 142:
opt_minwordmatches = args_getlong(optarg);
+ if (opt_minwordmatches < 0)
+ fatal("The argument to --minwordmatches must not be negative");
break;
case 143:
@@ -1750,9 +1754,6 @@ void args_init(int argc, char **argv)
opt_maxrejects = 32;
}
- if (opt_minseqlength < -1)
- fatal("The argument to --minseqlength must not be negative");
-
if (opt_maxaccepts < 0)
fatal("The argument to --maxaccepts must not be negative");
@@ -1801,9 +1802,6 @@ void args_init(int argc, char **argv)
if (opt_fastq_tail < 1)
fatal("The argument to --fastq_tail must be positive");
- if (opt_minwordmatches < 0)
- fatal("The argument to --minwordmatches must not be negative");
-
if ((opt_min_unmasked_pct < 0.0) && (opt_min_unmasked_pct > 100.0))
fatal("The argument to --min_unmasked_pct must be between 0.0 and 100.0");
@@ -1849,7 +1847,9 @@ void args_init(int argc, char **argv)
#endif
- if (opt_minwordmatches == 0)
+ /* set defaults parameters, if not specified */
+
+ if (opt_minwordmatches < 0)
opt_minwordmatches = minwordmatches_defaults[opt_wordlength];
if (opt_threads == 0)
diff --git a/test/unclassified.sh b/test/unclassified.sh
index f9196b9..3ed03d7 100644
--- a/test/unclassified.sh
+++ b/test/unclassified.sh
@@ -31,8 +31,8 @@ DESCRIPTION="check if vsearch is in the PATH"
# #
#*****************************************************************************#
-## usearch 6, 7 and 8 output a "=" when the sequences are identical
-DESCRIPTION="CIGAR alignment is \"=\" when the sequences are identical"
+## usearch 6, 7 and 8 output a "=" when the sequences are strictly identical
+DESCRIPTION="CIGAR string is \'=\' when the sequences are identical"
UC_OUT=$("${VSEARCH}" \
--cluster_fast <(printf ">seq1\nACGT\n>seq2\nACGT\n") \
--id 0.97 \
@@ -48,8 +48,26 @@ UC_OUT=$("${VSEARCH}" \
unset UC_OUT
+## usearch 6 and 7 output a "=" when the sequences are identical
+## (terminal gaps ignored), usearch 8 seems to behave differently
+DESCRIPTION="CIGAR string is \'=\' when the sequences are identical (terminal gaps ignored)"
+UC_OUT=$("${VSEARCH}" \
+ --cluster_fast <(printf ">seq1\nACGT\n>seq2\nACG\n") \
+ --id 0.97 \
+ --quiet \
+ --minseqlength 1 \
+ --uc - | grep "^H" | cut -f 8)
+
+[[ "${UC_OUT}" == "=" ]] && \
+ success "${DESCRIPTION}" || \
+ failure "${DESCRIPTION}"
+
+## clean
+unset UC_OUT
+
+
## is the 3rd column of H the query length or the alignment length?
-DESCRIPTION="3rd column of H is the query length"
+DESCRIPTION="when clustering, 3rd column of H in --uc is the query length"
UC_OUT=$("${VSEARCH}" \
--cluster_fast <(printf ">seq1\nACGT\n>seq2\nACAGT\n") \
--id 0.5 \
@@ -65,6 +83,30 @@ awk 'BEGIN {FS = "\t"} {$3 == 4 && $9 == "seq1"}' <<< "${UC_OUT}" && \
unset UC_OUT
+## when clustering, in --uc output, the highest number in the 2nd
+## column of H entries is smaller or equal to the number of input
+## sequences (number of S and H lines, minus one)
+DESCRIPTION="when clustering (--uc output), the 2nd column of H is the centroid's ordinal number"
+INPUT=">seq1\nAAAA\n>seq2\nAAAT\n>seq3\nGGGG\n>seq4\nGGGC\n"
+
+UC_OUT=$("${VSEARCH}" \
+ --cluster_fast <(printf ${INPUT}) \
+ --id 0.75 \
+ --quiet \
+ --minseqlength 1 \
+ --uc -)
+
+awk 'BEGIN {FS = "\t" ; H = 0 ; seq = -1}
+ {if (/^S/ || /^H/) {seq += 1}
+ if (/^H/) {if (H < $2) {H = $2}}}
+ END {if (H > seq - 1) {exit 1}}' <<< "${UC_OUT}" && \
+ success "${DESCRIPTION}" || \
+ failure "${DESCRIPTION}"
+
+## clean
+unset UC_OUT
+
+
#*****************************************************************************#
# #
# UC format when dereplicating #
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/vsearch.git
More information about the debian-med-commit
mailing list