[med-svn] [Git][med-team/vsearch][master] 4 commits: Fix watch URL
Nilesh Patra
gitlab at salsa.debian.org
Fri Apr 23 19:43:22 BST 2021
Nilesh Patra pushed to branch master at Debian Med / vsearch
Commits:
7f4a1ea2 by Nilesh Patra at 2021-04-24T00:06:18+05:30
Fix watch URL
- - - - -
73d823fa by Nilesh Patra at 2021-04-24T00:06:31+05:30
New upstream version 2.17.0
- - - - -
fac6c5f6 by Nilesh Patra at 2021-04-24T00:06:33+05:30
Update upstream source from tag 'upstream/2.17.0'
Update to upstream version '2.17.0'
with Debian dir 2e3d3a27ef0393fa383d4f4bb2d6d517078247c9
- - - - -
ec00cc2b by Nilesh Patra at 2021-04-24T00:08:00+05:30
Interim changelog entry
- - - - -
27 changed files:
- .travis.yml
- README.md
- configure.ac
- debian/changelog
- debian/watch
- man/Makefile.am
- man/vsearch.1
- src/Makefile.am
- src/allpairs.cc
- src/chimera.cc
- src/eestats.cc
- src/fasta.cc
- src/fastqjoin.cc
- src/fastqops.cc
- src/fastx.cc
- src/fastx.h
- src/filter.cc
- src/getseq.cc
- src/mergepairs.cc
- + src/orient.cc
- + src/orient.h
- src/otutable.cc
- src/search.cc
- src/searchexact.cc
- src/sintax.cc
- src/vsearch.cc
- src/vsearch.h
Changes:
=====================================
.travis.yml
=====================================
@@ -3,14 +3,13 @@ language:
os:
- linux
-- osx
compiler:
- g++
- clang
install:
-- if [ $TRAVIS_OS_NAME = linux ]; then sudo apt-get install -y ghostscript groff valgrind ; else brew install ghostscript; brew install valgrind ; fi
+- sudo apt-get install -y ghostscript groff valgrind
script:
- ./autogen.sh
=====================================
README.md
=====================================
@@ -4,7 +4,7 @@
## Introduction
-The aim of this project is to create an alternative to the [USEARCH](http://www.drive5.com/usearch/) tool developed by Robert C. Edgar (2010). The new tool should:
+The aim of this project is to create an alternative to the [USEARCH](https://www.drive5.com/usearch/) tool developed by Robert C. Edgar (2010). The new tool should:
* have open source code with an appropriate open source license
* be free of charge, gratis
@@ -16,7 +16,7 @@ We have implemented a tool called VSEARCH which supports *de novo* and reference
VSEARCH stands for vectorized search, as the tool takes advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by default uses a heuristic seed and extend aligner. This usually results in more accurate alignments and overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps.
-[VSEARCH binaries](https://github.com/torognes/vsearch/releases/latest) are provided for GNU/Linux on three 64-bit processor architectures: x86-64, POWER8 (ppc64le) and ARMv8 (aarch64). Binaries are also provided for MacOS (version 10.9 Mavericks or later) on Intel (x86-64) and Apple Silicon (ARMv8), as well as Windows (64-bit, version 7 or higher, on x86_64). VSEARCH contains dedicated SIMD code for the three processors architectures (SSE2/SSSE3, AltiVec/VMX/VSX, Neon).
+[VSEARCH binaries](https://github.com/torognes/vsearch/releases/latest) are provided for GNU/Linux on three 64-bit processor architectures: x86-64, POWER8 (ppc64le) and ARMv8 (aarch64). Binaries are also provided for MacOS (version 10.9 Mavericks or later) on Intel (x86-64) and Apple Silicon (ARMv8), as well as Windows (64-bit, version 7 or higher, on x86_64). VSEARCH contains dedicated SIMD code for the three processor architectures (SSE2/SSSE3, AltiVec/VMX/VSX, Neon).
| CPU \ OS | GNU/Linux | MacOS | Windows |
| ------------- | :-----------: | :----: | :-------: |
@@ -26,7 +26,10 @@ VSEARCH stands for vectorized search, as the tool takes advantage of parallelism
Various packages, plugins and wrappers are also available from other sources - see [below](https://github.com/torognes/vsearch#packages-plugins-and-wrappers).
-The source code compiles correctly with gcc (versions 4.8 to 9.1) and llvm-clang (3.8 to 9.0). The source code should also compile on [FreeBSD](https://www.freebsd.org/) and [NetBSD](https://www.netbsd.org/) systems.
+The source code compiles correctly with `gcc` (versions 4.8.5 to 10.2)
+and `llvm-clang` (3.8 to 13.0). The source code should also compile on
+[FreeBSD](https://www.freebsd.org/) and
+[NetBSD](https://www.netbsd.org/) systems.
VSEARCH can directly read input query and database files that are compressed using gzip and bzip2 (.gz and .bz2) if the zlib and bzip2 libraries are available.
@@ -34,7 +37,7 @@ Most of the nucleotide based commands and options in USEARCH version 7 are suppo
## Getting Help
-If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.15.2/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
+If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.17.0/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
## Example
@@ -47,18 +50,18 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
**Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:
```
-wget https://github.com/torognes/vsearch/archive/v2.15.2.tar.gz
-tar xzf v2.15.2.tar.gz
-cd vsearch-2.15.2
+wget https://github.com/torognes/vsearch/archive/v2.17.0.tar.gz
+tar xzf v2.17.0.tar.gz
+cd vsearch-2.17.0
./autogen.sh
./configure
make
make install # as root or sudo make install
```
-You may customize the installation directory using the `--prefix=DIR` option to `configure`. If the compression libraries [zlib](http://www.zlib.net) and/or [bzip2](http://www.bzip.org) are installed on the system, they will be detected automatically and support for compressed files will be included in vsearch. Support for compressed files may be disabled using the `--disable-zlib` and `--disable-bzip2` options to `configure`. A PDF version of the manual will be created from the `vsearch.1` manual file if `ps2pdf` is available, unless disabled using the `--disable-pdfman` option to `configure`. Other options may also be applied to `configure`, please run `configure -h` to see them all. GNU autoconf (version 2.63 or later), automake and the GCC C++ compiler is required to build vsearch.
+You may customize the installation directory using the `--prefix=DIR` option to `configure`. If the compression libraries [zlib](https://www.zlib.net) and/or [bzip2](https://www.sourceware.org/bzip2/) are installed on the system, they will be detected automatically and support for compressed files will be included in vsearch. Support for compressed files may be disabled using the `--disable-zlib` and `--disable-bzip2` options to `configure`. A PDF version of the manual will be created from the `vsearch.1` manual file if `ps2pdf` is available, unless disabled using the `--disable-pdfman` option to `configure`. Other options may also be applied to `configure`, please run `configure -h` to see them all. GNU autoconf (version 2.63 or later), automake and the GCC C++ compiler is required to build vsearch.
-The Windows binary was compiled using the [Mingw-w64](https://mingw-w64.org/) C++ cross-compiler.
+The Windows binary was compiled using the [Mingw-w64](http://mingw-w64.org/) C++ cross-compiler.
**Cloning the repo** Instead of downloading the source distribution as a compressed archive, you could clone the repo and build it as shown below. The options to `configure` as described above are still valid.
@@ -78,43 +81,43 @@ Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (v
Download the appropriate executable for your system using the following commands if you are using a Linux x86_64 system:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.15.2/vsearch-2.15.2-linux-x86_64.tar.gz
-tar xzf vsearch-2.15.2-linux-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.17.0/vsearch-2.17.0-linux-x86_64.tar.gz
+tar xzf vsearch-2.17.0-linux-x86_64.tar.gz
```
Or these commands if you are using a Linux ppc64le system:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.15.2/vsearch-2.15.2-linux-ppc64le.tar.gz
-tar xzf vsearch-2.15.2-linux-ppc64le.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.17.0/vsearch-2.17.0-linux-ppc64le.tar.gz
+tar xzf vsearch-2.17.0-linux-ppc64le.tar.gz
```
Or these commands if you are using a Linux aarch64 system:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.15.2/vsearch-2.15.2-linux-aarch64.tar.gz
-tar xzf vsearch-2.15.2-linux-aarch64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.17.0/vsearch-2.17.0-linux-aarch64.tar.gz
+tar xzf vsearch-2.17.0-linux-aarch64.tar.gz
```
Or these commands if you are using a Mac:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.15.2/vsearch-2.15.2-macos-x86_64.tar.gz
-tar xzf vsearch-2.15.2-macos-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.17.0/vsearch-2.17.0-macos-x86_64.tar.gz
+tar xzf vsearch-2.17.0-macos-x86_64.tar.gz
```
Or if you are using Windows, download and extract (unzip) the contents of this file:
```
-https://github.com/torognes/vsearch/releases/download/v2.15.2/vsearch-2.15.2-win-x86_64.zip
+https://github.com/torognes/vsearch/releases/download/v2.17.0/vsearch-2.17.0-win-x86_64.zip
```
-Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.15.2-linux-x86_64` or `vsearch-2.15.2-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
+Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.17.0-linux-x86_64` or `vsearch-2.17.0-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
-Windows: You will now have the binary distribution in a folder called `vsearch-2.15.2-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
+Windows: You will now have the binary distribution in a folder called `vsearch-2.17.0-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
-**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.15.2/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.15.2/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
+**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.17.0/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.17.0/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
## Packages, plugins, and wrappers
@@ -127,7 +130,7 @@ Windows: You will now have the binary distribution in a folder called `vsearch-2
**Galaxy wrapper** Thanks to the work of the [Intergalactic Utilities Commission](https://wiki.galaxyproject.org/IUC) members, vsearch is now part of the [Galaxy ToolShed](https://toolshed.g2.bx.psu.edu/view/iuc/vsearch/).
-**Homebrew package** Thanks to [Torsten Seeman](https://github.com/tseemann), a [vsearch package](https://github.com/Homebrew/homebrew-science/pull/2409) for [Homebrew](http://brew.sh/) has been made.
+**Homebrew package** Thanks to [Torsten Seeman](https://github.com/tseemann), a [vsearch package](https://formulae.brew.sh/formula/vsearch) for [Homebrew](http://brew.sh/) has been made.
**Pkgsrc package** Thanks to [Jason Bacon](https://github.com/outpaddling), a vsearch [pkgsrc](https://www.pkgsrc.org) package is available for NetBSD and other UNIX-like systems. Install the binary package with `pkgin install vsearch`, or build from source with additional optimizations.
@@ -171,7 +174,7 @@ The VSEARCH code is dual-licensed either under the GNU General Public License ve
VSEARCH includes code from several other projects. We thank the authors for making their source code available.
-VSEARCH includes code from Google's [CityHash project](http://code.google.com/p/cityhash/) by Geoff Pike and Jyrki Alakuijala, providing some excellent hash functions available under a MIT license.
+VSEARCH includes code from Google's [CityHash project](https://github.com/google/cityhash) by Geoff Pike and Jyrki Alakuijala, providing some excellent hash functions available under a MIT license.
VSEARCH includes code derived from Tatusov and Lipman's DUST program that is in the public domain.
@@ -181,9 +184,9 @@ VSEARCH includes public domain code written by Steve Reid and others for the SHA
The VSEARCH distribution includes code from GNU Autoconf which normally is available under the GNU General Public License, but may be distributed with the special autoconf configure script exception.
-VSEARCH may include code from the [zlib](http://www.zlib.net) library copyright Jean-loup Gailly and Mark Adler, distributed under the [zlib license](http://www.zlib.net/zlib_license.html).
+VSEARCH may include code from the [zlib](https://www.zlib.net) library copyright Jean-loup Gailly and Mark Adler, distributed under the [zlib license](https://www.zlib.net/zlib_license.html).
-VSEARCH may include code from the [bzip2](http://www.bzip.org) library copyright Julian R. Seward, distributed under a BSD-style license.
+VSEARCH may include code from the [bzip2](https://www.sourceware.org/bzip2/) library copyright Julian R. Seward, distributed under a BSD-style license.
## Code
@@ -223,6 +226,7 @@ File | Description
**mergepairs.cc** | Paired-end read merging
**minheap.cc** | A minheap implementation for the list of top kmer matches
**msa.cc** | Simple multiple sequence alignment and consensus sequence computation for clusters
+**orient.cc** | Orient direction of sequences based on reference database
**otutable.cc** | Generate OTU tables in various formats
**rerep.cc** | Rereplication
**results.cc** | Output results in various formats (alnout, userout, blast6, uc)
@@ -243,7 +247,7 @@ File | Description
**vsearch.cc** | Main program file, general initialization, reads arguments and parses options, writes info.
**xstring.h** | Code for a simple string class
-VSEARCH may be compiled with zlib or bzip2 integration that allows it to read compressed FASTA files. The [zlib](http://www.zlib.net/) and the [bzip2](http://www.bzip.org/) libraries are needed for this.
+VSEARCH may be compiled with zlib or bzip2 integration that allows it to read compressed FASTA files. The [zlib](http://www.zlib.net/) and the [bzip2](https://www.sourceware.org/bzip2/) libraries are needed for this.
## Bugs
@@ -294,9 +298,9 @@ Please note that citing any of the underlying algorithms, e.g. UCHIME, may also
Test datasets (found in the separate vsearch-data repository) were
obtained from
-the [BioMarks project](http://biomarks.eu/) (Logares et al. 2014),
-the [TARA OCEANS project](http://oceans.taraexpeditions.org/) (Karsenti et al. 2011)
-and the [Protist Ribosomal Database](http://ssu-rrna.org/) (Guillou et al. 2012).
+the BioMarks project (Logares et al. 2014),
+the [TARA OCEANS project](https://oceans.taraexpeditions.org/en/) (Karsenti et al. 2011)
+and the [Protist Ribosomal Reference Database (PR<sup>2</sup>)](https://github.com/pr2database/pr2database) (Guillou et al. 2013).
## References
@@ -304,28 +308,28 @@ and the [Protist Ribosomal Database](http://ssu-rrna.org/) (Guillou et al. 2012)
* Edgar RC (2010)
**Search and clustering orders of magnitude faster than BLAST.**
*Bioinformatics*, 26 (19): 2460-2461.
-doi:[10.1093/bioinformatics/btq461](http://dx.doi.org/10.1093/bioinformatics/btq461)
+doi:[10.1093/bioinformatics/btq461](https://doi.org/10.1093/bioinformatics/btq461)
* Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R (2011)
**UCHIME improves sensitivity and speed of chimera detection.**
*Bioinformatics*, 27 (16): 2194-2200.
-doi:[10.1093/bioinformatics/btr381](http://dx.doi.org/10.1093/bioinformatics/btr381)
+doi:[10.1093/bioinformatics/btr381](https://doi.org/10.1093/bioinformatics/btr381)
* Guillou L, Bachar D, Audic S, Bass D, Berney C, Bittner L, Boutte C, Burgaud G, de Vargas C, Decelle J, del Campo J, Dolan J, Dunthorn M, Edvardsen B, Holzmann M, Kooistra W, Lara E, Lebescot N, Logares R, Mahé F, Massana R, Montresor M, Morard R, Not F, Pawlowski J, Probert I, Sauvadet A-L, Siano R, Stoeck T, Vaulot D, Zimmermann P & Christen R (2013)
**The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote Small Sub-Unit rRNA sequences with curated taxonomy.**
*Nucleic Acids Research*, 41 (D1), D597-D604.
-doi:[10.1093/nar/gks1160](http://dx.doi.org/10.1093/nar/gks1160)
+doi:[10.1093/nar/gks1160](https://doi.org/10.1093/nar/gks1160)
* Karsenti E, González Acinas S, Bork P, Bowler C, de Vargas C, Raes J, Sullivan M B, Arendt D, Benzoni F, Claverie J-M, Follows M, Jaillon O, Gorsky G, Hingamp P, Iudicone D, Kandels-Lewis S, Krzic U, Not F, Ogata H, Pesant S, Reynaud E G, Sardet C, Sieracki M E, Speich S, Velayoudon D, Weissenbach J, Wincker P & the Tara Oceans Consortium (2011)
**A holistic approach to marine eco-systems biology.**
*PLoS Biology*, 9(10), e1001177.
-doi:[10.1371/journal.pbio.1001177](http://dx.doi.org/10.1371/journal.pbio.1001177)
+doi:[10.1371/journal.pbio.1001177](https://doi.org/10.1371/journal.pbio.1001177)
* Logares R, Audic S, Bass D, Bittner L, Boutte C, Christen R, Claverie J-M, Decelle J, Dolan J R, Dunthorn M, Edvardsen B, Gobet A, Kooistra W H C F, Mahé F, Not F, Ogata H, Pawlowski J, Pernice M C, Romac S, Shalchian-Tabrizi K, Simon N, Stoeck T, Santini S, Siano R, Wincker P, Zingone A, Richards T, de Vargas C & Massana R (2014) **The patterning of rare and abundant community assemblages in coastal marine-planktonic microbial eukaryotes.**
*Current Biology*, 24(8), 813-821.
-doi:[10.1016/j.cub.2014.02.050](http://dx.doi.org/10.1016/j.cub.2014.02.050)
+doi:[10.1016/j.cub.2014.02.050](https://doi.org/10.1016/j.cub.2014.02.050)
* Rognes T (2011)
**Faster Smith-Waterman database searches by inter-sequence SIMD parallelisation.**
*BMC Bioinformatics*, 12: 221.
-doi:[10.1186/1471-2105-12-221](http://dx.doi.org/10.1186/1471-2105-12-221)
+doi:[10.1186/1471-2105-12-221](https://doi.org/10.1186/1471-2105-12-221)
=====================================
configure.ac
=====================================
@@ -2,7 +2,7 @@
# Process this file with autoconf to produce a configure script.
AC_PREREQ([2.63])
-AC_INIT([vsearch], [2.15.2], [torognes at ifi.uio.no], [vsearch], [https://github.com/torognes/vsearch])
+AC_INIT([vsearch], [2.17.0], [torognes at ifi.uio.no], [vsearch], [https://github.com/torognes/vsearch])
AC_CANONICAL_TARGET
AM_INIT_AUTOMAKE([subdir-objects])
AC_LANG([C++])
=====================================
debian/changelog
=====================================
@@ -1,3 +1,14 @@
+vsearch (2.17.0-1) UNRELEASED; urgency=medium
+
+ [ Steffen Möller ]
+ * Update metadata - added guix, fixed indent
+
+ [ Nilesh Patra ]
+ * Fix watch URL
+ * New upstream version 2.17.0
+
+ -- Nilesh Patra <nilesh at debian.org> Sat, 24 Apr 2021 00:07:29 +0530
+
vsearch (2.15.2-3) unstable; urgency=medium
* Team upload.
=====================================
debian/watch
=====================================
@@ -1,4 +1,4 @@
version=4
opts="downloadurlmangle=s/\/tree\/(.*)/\/archive\/$1.tar.gz/" \
- https://github.com/torognes/vsearch/releases .*/archive/v([0-9.rc-]+)\.(?:tar(?:\.gz|\.bz2)?|tgz)
+ https://github.com/torognes/vsearch/releases .*/archive/.*/v([0-9.rc-]+)\.(?:tar(?:\.gz|\.bz2)?|tgz)
=====================================
man/Makefile.am
=====================================
@@ -11,11 +11,7 @@ doc_DATA += vsearch_manual.html
vsearch_manual.html : vsearch.1
sed -e 's/\\-/-/g' $< | \
- if [ $$(uname) == "Darwin" ] ; then \
- iconv -f UTF-8 -t ISO-8859-1 ; \
- else \
- cat ; \
- fi | \
+ iconv -f UTF-8 -t ISO-8859-1 | \
groff -t -m mandoc -m www -Thtml > $@
CLEANFILES += vsearch_manual.html
@@ -29,11 +25,7 @@ doc_DATA += vsearch_manual.pdf
vsearch_manual.pdf : vsearch.1
sed -e 's/\\-/-/g' $< | \
- if [ $$(uname) == "Darwin" ] ; then \
- iconv -f UTF-8 -t ISO-8859-1 ; \
- else \
- cat ; \
- fi | \
+ iconv -f UTF-8 -t ISO-8859-1 | \
groff -W space -t -m mandoc -T ps -P -pa4 | ps2pdf - $@
CLEANFILES += vsearch_manual.pdf
=====================================
man/vsearch.1
=====================================
@@ -1,5 +1,5 @@
.\" ============================================================================
-.TH vsearch 1 "January 26, 2021" "version 2.15.2" "USER COMMANDS"
+.TH vsearch 1 "March 29, 2021" "version 2.17.0" "USER COMMANDS"
.\" ============================================================================
.SH NAME
vsearch \(em chimera detection, clustering, dereplication and
@@ -109,6 +109,13 @@ Masking:
\fIoutputfile\fR [\fIoptions\fR]
.PP
.RE
+Orienting:
+.RS
+\fBvsearch\fR \-\-orient \fIfastxfile\fR \-\-db \fIfastafile\fR
+(\-\-fastaout | \-\-fastqout | \-\-notmatched | \-\-tabbedout)
+\fIoutputfile\fR [\fIoptions\fR]
+.PP
+.RE
Pairwise alignment:
.RS
\fBvsearch\fR \-\-allpairs_global \fIfastafile\fR (\-\-alnout |
@@ -1562,12 +1569,12 @@ ambiguous bases (N's), as specified with the \-\-fastq_maxns are also
discarded (no limit by default). Staggered reads are not merged unless
the \-\-fastq_allowmergestagger option is specified. The minimum
length of the overlap region between the reads may be specified with
-the \-\-fastq_minovlen option (default 10). The overlap region may not
-include more mismatches than specified with the \-\-fastq_maxdiffs
-option (10 by default) or a higher percentage of mismatches than
-specified with the \-\-fastq_maxdiffpct option (100.0% by default),
-otherwise the read pair is discarded. Additional rules will avoid
-merging of reads that cannot be aligned reliably and
+the \-\-fastq_minovlen option (at least 5, default 10). The overlap
+region may not include more mismatches than specified with the
+\-\-fastq_maxdiffs option (10 by default) or a higher percentage of
+mismatches than specified with the \-\-fastq_maxdiffpct option (100.0%
+by default), otherwise the read pair is discarded. Additional rules
+will avoid merging of reads that cannot be aligned reliably and
unambiguously. The mimimum and maximum length of the merged sequence
may be specified with the \-\-fastq_minmergelen and
\-\-fastq_maxmergelen options, respectively. The quality value limits
@@ -1591,7 +1598,7 @@ merged sequence. The default is 1.
.TP
.BI \-\-fastq_minovlen\~ "positive integer"
When using \-\-fastq_mergepairs, specify the minimum overlap between
-the merged reads. The default is 10.
+the merged reads. The default is 10. Must be at least 5.
.TAG fastq_nostagger
.TP
.B \-\-fastq_nostagger
@@ -2057,6 +2064,67 @@ mask.
.RE
.PP
.\" ----------------------------------------------------------------------------
+.TAG orienting-options
+Orienting options:
+.RS
+.PP
+The \-\-orient command can be used to orient the sequences in a given
+file in either the forward or the reverse complementary direction
+based on a reference database specified with the \-\-db option. The
+two strands of each input sequence are compared to the reference
+database using nucleotide words. If one of the strands share many more
+words with at least one sequence in the database than the other, that
+strand is chosen. The correctly oriented sequences may be written to a
+FASTA file specified with the \-\-fastaout, and to a FASTQ file
+specified with the \-\-fastqout option (as long as the input was also
+in FASTA format). If the result is uncertain, because the number of
+matching words is too similar, the original sequence is written to the
+file specified with the \-\-notmatched option. The results may also be
+written to a tab-delimited text file specified with the \-\-tabbedout
+option. This file will contain the query label, the direction (+, - or
+?), the number of matching words on the forward strand, and the number
+of matching words on the reverse complementary strand. By default, a
+word length of 12 is used for this command. The word length may be
+adjusted using the \-\-wordlength option. There has to be at least 4
+times as many matches on one strand than the other for a strand to be
+selected. In addition to the common options, the following options may
+also be specified for this command: \-\-dbmask, \-\-qmask,
+\-\-relabel, \-\-relabel_keep, \-\-relabel_md5, \-\-relabel_self,
+\-\-relabel_sha1, \-\-sizein, and \-\-sizeout.
+.PP
+.TAG db
+.TP 9
+.BI \-\-db \0filename
+Read the reference database from the given file. It may be in FASTA,
+FASTQ or UDB format. If an UDB file is used it should have been
+created with a wordlength of 12.
+.TAG fastaout
+.TP
+.BI \-\-fastaout \0filename
+Write the correctly oriented sequences to \fIfilename\fR, in fasta format.
+.TAG fastqout
+.TP
+.BI \-\-fastqout \0filename
+Write the correctly oriented sequences to \fIfilename\fR, in fastq format.
+.TAG notmatched
+.TP
+.BI \-\-notmatched \0filename
+Write the sequences with undetermined direction to \fIfilename\fR, in
+the orginal format.
+.TAG orient
+.TP
+.BI \-\-orient \0filename
+Orient the sequences in the given file.
+.TAG tabbedout
+.TP
+.BI \-\-tabbedout \0 filename
+Write the resuls to a tab-delimited text file with the specified
+filename. This file will contain the query label, the direction (+, -
+or ?), the number of matching words on the forward strand, and the
+number of matching words on the reverse complementary strand.
+.RE
+.PP
+.\" ----------------------------------------------------------------------------
.TAG restriction-site-cutting-options
Restriction site cutting options:
.RS
@@ -3233,9 +3301,10 @@ Userfields (fields accepted by the \-\-userfields option):
.RS
.TP 9
.B aln
-Print a string of M (match), D (delete, i.e. a gap in the query) and I
-(insert, i.e. a gap in the target) representing the pairwise
-alignment. Empty field if there is no alignment.
+Print a string of M (match/mismatch, i.e. not a gap), D (delete,
+i.e. a gap in the query) and I (insert, i.e. a gap in the target)
+representing the pairwise alignment. Empty field if there is no
+alignment.
.TP
.B alnlen
Print the length of the query-target alignment (number of
@@ -4307,6 +4376,18 @@ changes. Compiles successfully on macOS running on Apple Silicon
adaptations for Windows compatibility, including the use of the C++
standard library for regular expressions. Minor changes for
compatibility with Power8. Switch to C++ header files.
+.TP
+.BR v2.16.0\~ "released March 22nd, 2021"
+This version adds the orient command. It also handles empty input
+files properly. Documentation has been updated.
+.TP
+.BR v2.17.0\~ "released March 29nd, 2021"
+The fastq_mergepairs command has been changed. It now allows merging
+of sequences with overlaps as short as 5 bp if the \-\-fastq_minovlen
+option has been adjusted down from the default 10. In addition, much
+fewer pairs of reads should now be rejected with the reason 'multiple
+potential alignments' as the algorithm for detecting those have been
+changed.
.LP
.\" ============================================================================
.\" TODO:
=====================================
src/Makefile.am
=====================================
@@ -48,6 +48,7 @@ md5.h \
mergepairs.h \
minheap.h \
msa.h \
+orient.h \
otutable.h \
rerep.h \
results.h \
@@ -140,6 +141,7 @@ md5.c \
mergepairs.cc \
minheap.cc \
msa.cc \
+orient.cc \
otutable.cc \
rerep.cc \
results.cc \
=====================================
src/allpairs.cc
=====================================
@@ -629,13 +629,21 @@ void allpairs_global(char * cmdline, char * progheader)
progress_done();
if (!opt_quiet)
- fprintf(stderr, "Matching query sequences: %d of %d (%.2f%%)\n",
- qmatches, queries, 100.0 * qmatches / queries);
+ {
+ fprintf(stderr, "Matching query sequences: %d of %d",
+ qmatches, queries);
+ if (queries > 0)
+ fprintf(stderr, " (%.2f%%)", 100.0 * qmatches / queries);
+ fprintf(stderr, "\n");
+ }
if (opt_log)
{
- fprintf(fp_log, "Matching query sequences: %d of %d (%.2f%%)\n\n",
- qmatches, queries, 100.0 * qmatches / queries);
+ fprintf(fp_log, "Matching query sequences: %d of %d",
+ qmatches, queries);
+ if (queries > 0)
+ fprintf(fp_log, " (%.2f%%)", 100.0 * qmatches / queries);
+ fprintf(fp_log, "\n\n");
}
xpthread_mutex_destroy(&mutex_output);
=====================================
src/chimera.cc
=====================================
@@ -1590,26 +1590,59 @@ void chimera()
progress_done();
if (!opt_quiet)
- fprintf(stderr,
- "Found %d (%.1f%%) chimeras, %d (%.1f%%) non-chimeras,\n"
- "and %d (%.1f%%) borderline sequences in %u unique sequences.\n"
- "Taking abundance information into account, this corresponds to\n"
- "%" PRId64 " (%.1f%%) chimeras, %" PRId64 " (%.1f%%) non-chimeras,\n"
- "and %" PRId64 " (%.1f%%) borderline sequences in %" PRId64 " total sequences.\n",
- chimera_count,
- 100.0 * chimera_count / total_count,
- nonchimera_count,
- 100.0 * nonchimera_count / total_count,
- borderline_count,
- 100.0 * borderline_count / total_count,
- total_count,
- chimera_abundance,
- 100.0 * chimera_abundance / total_abundance,
- nonchimera_abundance,
- 100.0 * nonchimera_abundance / total_abundance,
- borderline_abundance,
- 100.0 * borderline_abundance / total_abundance,
- total_abundance);
+ {
+ if (total_count > 0)
+ fprintf(stderr,
+ "Found %d (%.1f%%) chimeras, "
+ "%d (%.1f%%) non-chimeras,\n"
+ "and %d (%.1f%%) borderline sequences "
+ "in %u unique sequences.\n",
+ chimera_count,
+ 100.0 * chimera_count / total_count,
+ nonchimera_count,
+ 100.0 * nonchimera_count / total_count,
+ borderline_count,
+ 100.0 * borderline_count / total_count,
+ total_count);
+ else
+ fprintf(stderr,
+ "Found %d chimeras, "
+ "%d non-chimeras,\n"
+ "and %d borderline sequences "
+ "in %u unique sequences.\n",
+ chimera_count,
+ nonchimera_count,
+ borderline_count,
+ total_count);
+
+ if (total_abundance > 0)
+ fprintf(stderr,
+ "Taking abundance information into account, "
+ "this corresponds to\n"
+ "%" PRId64 " (%.1f%%) chimeras, "
+ "%" PRId64 " (%.1f%%) non-chimeras,\n"
+ "and %" PRId64 " (%.1f%%) borderline sequences "
+ "in %" PRId64 " total sequences.\n",
+ chimera_abundance,
+ 100.0 * chimera_abundance / total_abundance,
+ nonchimera_abundance,
+ 100.0 * nonchimera_abundance / total_abundance,
+ borderline_abundance,
+ 100.0 * borderline_abundance / total_abundance,
+ total_abundance);
+ else
+ fprintf(stderr,
+ "Taking abundance information into account, "
+ "this corresponds to\n"
+ "%" PRId64 " chimeras, "
+ "%" PRId64 " non-chimeras,\n"
+ "and %" PRId64 " borderline sequences "
+ "in %" PRId64 " total sequences.\n",
+ chimera_abundance,
+ nonchimera_abundance,
+ borderline_abundance,
+ total_abundance);
+ }
if (opt_log)
{
@@ -1617,10 +1650,16 @@ void chimera()
fprintf(fp_log, "%s", opt_uchime_ref);
else
fprintf(fp_log, "%s", denovo_dbname);
- fprintf(fp_log, ": %d/%u chimeras (%.1f%%)\n",
- chimera_count,
- seqno,
- 100.0 * chimera_count / seqno);
+
+ if (seqno > 0)
+ fprintf(fp_log, ": %d/%u chimeras (%.1f%%)\n",
+ chimera_count,
+ seqno,
+ 100.0 * chimera_count / seqno);
+ else
+ fprintf(fp_log, ": %d/%u chimeras\n",
+ chimera_count,
+ seqno);
}
=====================================
src/eestats.cc
=====================================
@@ -482,8 +482,14 @@ void fastq_eestats2()
progress_done();
fprintf(fp_output,
- "%" PRIu64 " reads, max len %" PRIu64 ", avg %.1f\n\n",
- seq_count, longest, 1.0 * symbols / seq_count);
+ "%" PRIu64 " reads",
+ seq_count);
+
+ if (seq_count > 0)
+ fprintf(fp_output,
+ ", max len %" PRIu64 ", avg %.1f",
+ longest, 1.0 * symbols / seq_count);
+ fprintf(fp_output, "\n\n");
fprintf(fp_output, "Length");
for (int y = 0; y < opt_ee_cutoffs_count; y++)
=====================================
src/fasta.cc
=====================================
@@ -64,7 +64,7 @@ fastx_handle fasta_open(const char * filename)
{
fastx_handle h = fastx_open(filename);
- if (fastx_is_fastq(h))
+ if (fastx_is_fastq(h) && ! h->is_empty)
fatal("FASTA file expected, FASTQ file found (%s)", filename);
return h;
=====================================
src/fastqjoin.cc
=====================================
@@ -238,8 +238,10 @@ void fastq_join()
fastq_close(fastq_fwd);
fastq_fwd = 0;
- xfree(seq);
- xfree(qual);
+ if (seq)
+ xfree(seq);
+ if (qual)
+ xfree(qual);
xfree(padgap);
xfree(padgapq);
}
=====================================
src/fastqops.cc
=====================================
@@ -185,66 +185,69 @@ void fastq_chars()
{
fprintf(stderr, "Read %" PRIu64 " sequences.\n", seq_count);
- fprintf(stderr, "Qmin %d, QMax %d, Range %d\n",
- qmin, qmax, qmax-qmin+1);
+ if (seq_count > 0)
+ {
+ fprintf(stderr, "Qmin %d, QMax %d, Range %d\n",
+ qmin, qmax, qmax-qmin+1);
- fprintf(stderr, "Guess: -fastq_qmin %d -fastq_qmax %d -fastq_ascii %d\n",
- fastq_qmin, fastq_qmax, fastq_ascii);
+ fprintf(stderr, "Guess: -fastq_qmin %d -fastq_qmax %d -fastq_ascii %d\n",
+ fastq_qmin, fastq_qmax, fastq_ascii);
- if (fastq_ascii == 64)
- {
- if (qmin < 64)
- fprintf(stderr, "Guess: Solexa format (phred+64)\n");
- else if (qmin < 66)
- fprintf(stderr, "Guess: Illumina 1.3+ format (phred+64)\n");
- else
- fprintf(stderr, "Guess: Illumina 1.5+ format (phred+64)\n");
- }
- else
- {
- if (qmax > 73)
- fprintf(stderr, "Guess: Illumina 1.8+ format (phred+33)\n");
+ if (fastq_ascii == 64)
+ {
+ if (qmin < 64)
+ fprintf(stderr, "Guess: Solexa format (phred+64)\n");
+ else if (qmin < 66)
+ fprintf(stderr, "Guess: Illumina 1.3+ format (phred+64)\n");
+ else
+ fprintf(stderr, "Guess: Illumina 1.5+ format (phred+64)\n");
+ }
else
- fprintf(stderr, "Guess: Original Sanger format (phred+33)\n");
- }
+ {
+ if (qmax > 73)
+ fprintf(stderr, "Guess: Illumina 1.8+ format (phred+33)\n");
+ else
+ fprintf(stderr, "Guess: Original Sanger format (phred+33)\n");
+ }
- fprintf(stderr, "\n");
- fprintf(stderr, "Letter N Freq MaxRun\n");
- fprintf(stderr, "------ ---------- ------ ------\n");
+ fprintf(stderr, "\n");
+ fprintf(stderr, "Letter N Freq MaxRun\n");
+ fprintf(stderr, "------ ---------- ------ ------\n");
- for(int c=0; c<256; c++)
- {
- if (sequence_chars[c] > 0)
+ for(int c=0; c<256; c++)
{
- fprintf(stderr, " %c %10" PRIu64 " %5.1f%% %6d",
- c,
- sequence_chars[c],
- 100.0 * sequence_chars[c] / total_chars,
- maxrun[c]);
- if ((c == 'N') || (c == 'n'))
+ if (sequence_chars[c] > 0)
{
- if (qmin_n < qmax_n)
- fprintf(stderr, " Q=%c..%c", qmin_n, qmax_n);
- else
- fprintf(stderr, " Q=%c", qmin_n);
+ fprintf(stderr, " %c %10" PRIu64 " %5.1f%% %6d",
+ c,
+ sequence_chars[c],
+ 100.0 * sequence_chars[c] / total_chars,
+ maxrun[c]);
+ if ((c == 'N') || (c == 'n'))
+ {
+ if (qmin_n < qmax_n)
+ fprintf(stderr, " Q=%c..%c", qmin_n, qmax_n);
+ else
+ fprintf(stderr, " Q=%c", qmin_n);
+ }
+ fprintf(stderr, "\n");
}
- fprintf(stderr, "\n");
}
- }
- fprintf(stderr, "\n");
- fprintf(stderr, "Char ASCII Freq Tails\n");
- fprintf(stderr, "---- ----- ------ ----------\n");
+ fprintf(stderr, "\n");
+ fprintf(stderr, "Char ASCII Freq Tails\n");
+ fprintf(stderr, "---- ----- ------ ----------\n");
- for(int c=qmin; c<=qmax; c++)
- {
- if (quality_chars[c] > 0)
+ for(int c=qmin; c<=qmax; c++)
{
- fprintf(stderr, " '%c' %5d %5.1f%% %10" PRIu64 "\n",
- c,
- c,
- 100.0 * quality_chars[c] / total_chars,
- tail_chars[c]);
+ if (quality_chars[c] > 0)
+ {
+ fprintf(stderr, " '%c' %5d %5.1f%% %10" PRIu64 "\n",
+ c,
+ c,
+ 100.0 * quality_chars[c] / total_chars,
+ tail_chars[c]);
+ }
}
}
}
@@ -253,66 +256,69 @@ void fastq_chars()
{
fprintf(fp_log, "Read %" PRIu64 " sequences.\n", seq_count);
- fprintf(fp_log, "Qmin %d, QMax %d, Range %d\n",
- qmin, qmax, qmax-qmin+1);
+ if (seq_count > 0)
+ {
+ fprintf(fp_log, "Qmin %d, QMax %d, Range %d\n",
+ qmin, qmax, qmax-qmin+1);
- fprintf(fp_log, "Guess: -fastq_qmin %d -fastq_qmax %d -fastq_ascii %d\n",
- fastq_qmin, fastq_qmax, fastq_ascii);
+ fprintf(fp_log, "Guess: -fastq_qmin %d -fastq_qmax %d -fastq_ascii %d\n",
+ fastq_qmin, fastq_qmax, fastq_ascii);
- if (fastq_ascii == 64)
- {
- if (qmin < 64)
- fprintf(fp_log, "Guess: Solexa format (phred+64)\n");
- else if (qmin < 66)
- fprintf(fp_log, "Guess: Illumina 1.3+ format (phred+64)\n");
- else
- fprintf(fp_log, "Guess: Illumina 1.5+ format (phred+64)\n");
- }
- else
- {
- if (qmax > 73)
- fprintf(fp_log, "Guess: Illumina 1.8+ format (phred+33)\n");
+ if (fastq_ascii == 64)
+ {
+ if (qmin < 64)
+ fprintf(fp_log, "Guess: Solexa format (phred+64)\n");
+ else if (qmin < 66)
+ fprintf(fp_log, "Guess: Illumina 1.3+ format (phred+64)\n");
+ else
+ fprintf(fp_log, "Guess: Illumina 1.5+ format (phred+64)\n");
+ }
else
- fprintf(fp_log, "Guess: Original Sanger format (phred+33)\n");
- }
+ {
+ if (qmax > 73)
+ fprintf(fp_log, "Guess: Illumina 1.8+ format (phred+33)\n");
+ else
+ fprintf(fp_log, "Guess: Original Sanger format (phred+33)\n");
+ }
- fprintf(fp_log, "\n");
- fprintf(fp_log, "Letter N Freq MaxRun\n");
- fprintf(fp_log, "------ ---------- ------ ------\n");
+ fprintf(fp_log, "\n");
+ fprintf(fp_log, "Letter N Freq MaxRun\n");
+ fprintf(fp_log, "------ ---------- ------ ------\n");
- for(int c=0; c<256; c++)
- {
- if (sequence_chars[c] > 0)
+ for(int c=0; c<256; c++)
{
- fprintf(fp_log, " %c %10" PRIu64 " %5.1f%% %6d",
- c,
- sequence_chars[c],
- 100.0 * sequence_chars[c] / total_chars,
- maxrun[c]);
- if ((c == 'N') || (c == 'n'))
+ if (sequence_chars[c] > 0)
{
- if (qmin_n < qmax_n)
- fprintf(fp_log, " Q=%c..%c", qmin_n, qmax_n);
- else
- fprintf(fp_log, " Q=%c", qmin_n);
+ fprintf(fp_log, " %c %10" PRIu64 " %5.1f%% %6d",
+ c,
+ sequence_chars[c],
+ 100.0 * sequence_chars[c] / total_chars,
+ maxrun[c]);
+ if ((c == 'N') || (c == 'n'))
+ {
+ if (qmin_n < qmax_n)
+ fprintf(fp_log, " Q=%c..%c", qmin_n, qmax_n);
+ else
+ fprintf(fp_log, " Q=%c", qmin_n);
+ }
+ fprintf(fp_log, "\n");
}
- fprintf(fp_log, "\n");
}
- }
- fprintf(fp_log, "\n");
- fprintf(fp_log, "Char ASCII Freq Tails\n");
- fprintf(fp_log, "---- ----- ------ ----------\n");
+ fprintf(fp_log, "\n");
+ fprintf(fp_log, "Char ASCII Freq Tails\n");
+ fprintf(fp_log, "---- ----- ------ ----------\n");
- for(int c=qmin; c<=qmax; c++)
- {
- if (quality_chars[c] > 0)
+ for(int c=qmin; c<=qmax; c++)
{
- fprintf(fp_log, " '%c' %5d %5.1f%% %10" PRIu64 "\n",
- c,
- c,
- 100.0 * quality_chars[c] / total_chars,
- tail_chars[c]);
+ if (quality_chars[c] > 0)
+ {
+ fprintf(fp_log, " '%c' %5d %5.1f%% %10" PRIu64 "\n",
+ c,
+ c,
+ 100.0 * quality_chars[c] / total_chars,
+ tail_chars[c]);
+ }
}
}
}
@@ -622,7 +628,8 @@ void fastq_stats()
fprintf(fp_log, "\n");
fprintf(fp_log, "%10" PRIu64 " Recs (%.1lfM), 0 too long\n",
seq_count, seq_count / 1.0e6);
- fprintf(fp_log, "%10.1lf Avg length\n", 1.0 * symbols / seq_count);
+ if (seq_count > 0)
+ fprintf(fp_log, "%10.1lf Avg length\n", 1.0 * symbols / seq_count);
fprintf(fp_log, "%9.1lfM Bases\n", symbols / 1.0e6);
}
@@ -658,7 +665,7 @@ void fastx_revcomp()
if (!h)
fatal("Unrecognized file type (not proper FASTA or FASTQ format)");
- if (opt_fastqout && ! h->is_fastq)
+ if (opt_fastqout && ! (h->is_fastq || h->is_empty))
fatal("Cannot write FASTQ output with a FASTA input file, lacking quality scores");
uint64_t filesize = fastx_get_size(h);
=====================================
src/fastx.cc
=====================================
@@ -275,13 +275,20 @@ fastx_handle fastx_open(const char * filename)
unsigned char magic[2];
h->format = FORMAT_PLAIN;
- if (fread(&magic, 1, 2, h->fp) < 2)
- fatal("Unable to read from file (%s)", filename);
- if (memcmp(magic, MAGIC_GZIP, 2) == 0)
- h->format = FORMAT_GZIP;
- else if (memcmp(magic, MAGIC_BZIP, 2) == 0)
- h->format = FORMAT_BZIP;
+ size_t bytes_read = fread(&magic, 1, 2, h->fp);
+
+ if (bytes_read >= 2)
+ {
+ if (memcmp(magic, MAGIC_GZIP, 2) == 0)
+ h->format = FORMAT_GZIP;
+ else if (memcmp(magic, MAGIC_BZIP, 2) == 0)
+ h->format = FORMAT_BZIP;
+ }
+ else
+ {
+ /* consider it an empty file or a tiny fasta file, uncompressed */
+ }
/* close and reopen to avoid problems with gzip library */
/* rewind was not enough */
@@ -330,65 +337,71 @@ fastx_handle fastx_open(const char * filename)
uint64_t rest = fastx_file_fill_buffer(h);
- if (rest < 2)
- fatal("File too small");
-
-
/* examine first char and see if it starts with > or @ */
int filetype = 0;
- char * first = h->file_buffer.data;
+ h->is_empty = 1;
+ h->is_fastq = 0;
- if (*first == '>')
- {
- filetype = 1;
- h->is_fastq = 0;
- }
- else if (*first == '@')
+ if (rest > 0)
{
- filetype = 2;
- h->is_fastq = 1;
- }
+ h->is_empty = 0;
- if (filetype == 0)
- {
- /* close files if unrecognized file type */
+ char * first = h->file_buffer.data;
- switch(h->format)
+ if (*first == '>')
{
- case FORMAT_PLAIN:
- break;
+ filetype = 1;
+ }
+ else if (*first == '@')
+ {
+ filetype = 2;
+ h->is_fastq = 1;
+ }
- case FORMAT_GZIP:
+ if (filetype == 0)
+ {
+ /* close files if unrecognized file type */
+
+ switch(h->format)
+ {
+ case FORMAT_PLAIN:
+ break;
+
+ case FORMAT_GZIP:
#ifdef HAVE_ZLIB_H
- (*gzclose_p)(h->fp_gz);
- h->fp_gz = 0;
- break;
+ (*gzclose_p)(h->fp_gz);
+ h->fp_gz = 0;
+ break;
#endif
- case FORMAT_BZIP:
+ case FORMAT_BZIP:
#ifdef HAVE_BZLIB_H
- (*BZ2_bzReadClose_p)(&bzError, h->fp_bz);
- h->fp_bz = 0;
- break;
+ (*BZ2_bzReadClose_p)(&bzError, h->fp_bz);
+ h->fp_bz = 0;
+ break;
#endif
- default:
- fatal("Internal error");
- }
+ default:
+ fatal("Internal error");
+ }
- fclose(h->fp);
- h->fp = 0;
+ fclose(h->fp);
+ h->fp = 0;
- if (memcmp(first, MAGIC_GZIP, 2) == 0)
- fatal("File appears to be gzip compressed. Please use --gzip_decompress");
+ if (rest >= 2)
+ {
+ if (memcmp(first, MAGIC_GZIP, 2) == 0)
+ fatal("File appears to be gzip compressed. Please use --gzip_decompress");
- if (memcmp(first, MAGIC_BZIP, 2) == 0)
- fatal("File appears to be bzip2 compressed. Please use --bzip2_decompress");
+ if (memcmp(first, MAGIC_BZIP, 2) == 0)
+ fatal("File appears to be bzip2 compressed. Please use --bzip2_decompress");
+ }
- fatal("File type not recognized.");
+ fatal("File type not recognized.");
- return 0;
+ return 0;
+ }
}
/* more initialization */
@@ -412,7 +425,7 @@ fastx_handle fastx_open(const char * filename)
bool fastx_is_fastq(fastx_handle h)
{
- return h->is_fastq;
+ return h->is_fastq || h->is_empty;
}
void fastx_close(fastx_handle h)
@@ -426,6 +439,7 @@ void fastx_close(fastx_handle h)
if (h->stripped[i])
fprintf(stderr, " %c(%" PRIu64 ")", i, h->stripped[i]);
fprintf(stderr, "\n");
+ fprintf(stderr, "REMINDER: vsearch does not support amino acid sequences\n");
if (opt_log)
{
@@ -434,6 +448,7 @@ void fastx_close(fastx_handle h)
if (h->stripped[i])
fprintf(fp_log, " %c(%" PRIu64 ")", i, h->stripped[i]);
fprintf(fp_log, "\n");
+ fprintf(fp_log, "REMINDER: vsearch does not support amino acid sequences\n");
}
}
@@ -661,4 +676,3 @@ int64_t fastx_get_abundance(fastx_handle h)
else
return fasta_get_abundance(h);
}
-
=====================================
src/fastx.h
=====================================
@@ -77,6 +77,7 @@ struct fastx_s
{
bool is_pipe;
bool is_fastq;
+ bool is_empty;
FILE * fp;
=====================================
src/filter.cc
=====================================
@@ -223,7 +223,7 @@ void filter(bool fastq_only, char * filename)
if (!h1)
fatal("Unrecognized file type (not proper FASTA or FASTQ format)");
- if (! h1->is_fastq)
+ if (! (h1->is_fastq || h1->is_empty))
{
if (fastq_only)
{
@@ -259,7 +259,7 @@ void filter(bool fastq_only, char * filename)
if (h1->is_fastq != h2->is_fastq)
fatal("The forward and reverse input sequence must in the same format, either FASTA or FASTQ");
- if (! h2->is_fastq)
+ if (! (h2->is_fastq || h2->is_empty))
{
if (fastq_only)
{
=====================================
src/getseq.cc
=====================================
@@ -318,7 +318,7 @@ void getseq(char * filename)
if (!h1)
fatal("Unrecognized file type (not proper FASTA or FASTQ format)");
- if ((opt_fastqout || opt_notmatchedfq) && ! h1->is_fastq)
+ if ((opt_fastqout || opt_notmatchedfq) && ! (h1->is_fastq || h1->is_empty))
fatal("Cannot write FASTQ output from FASTA input");
uint64_t filesize = fastx_get_size(h1);
@@ -448,18 +448,30 @@ void getseq(char * filename)
progress_done();
if (! opt_quiet)
- fprintf(stderr,
- "%" PRId64 " of %" PRId64 " sequences extracted (%.1lf%%)\n",
- kept,
- kept + discarded,
- 100.0 * kept / (kept + discarded));
+ {
+ fprintf(stderr,
+ "%" PRId64 " of %" PRId64 " sequences extracted",
+ kept,
+ kept + discarded);
+ if (kept + discarded > 0)
+ fprintf(stderr,
+ " (%.1lf%%)",
+ 100.0 * kept / (kept + discarded));
+ fprintf(stderr, "\n");
+ }
if (opt_log)
- fprintf(fp_log,
- "%" PRId64 " of %" PRId64 " sequences extracted (%.1lf%%)\n",
- kept,
- kept + discarded,
- 100.0 * kept / (kept + discarded));
+ {
+ fprintf(fp_log,
+ "%" PRId64 " of %" PRId64 " sequences extracted",
+ kept,
+ kept + discarded);
+ if (kept + discarded > 0)
+ fprintf(fp_log,
+ " (%.1lf%%)",
+ 100.0 * kept / (kept + discarded));
+ fprintf(fp_log, "\n");
+ }
if (opt_fastaout)
fclose(fp_fastaout);
=====================================
src/mergepairs.cc
=====================================
@@ -68,10 +68,9 @@ static const int chunk_factor = 2; /* chunks per thread */
/* scores in bits */
static const int k = 5;
-static const double merge_minscore = 16.0;
+static int merge_mindiagcount = 4;
+static double merge_minscore = 16.0;
static const double merge_dropmax = 16.0;
-static const int merge_mindiagcount = 4;
-static const int merge_minrepeatdiagcount = 12;
static const double merge_mismatchmax = -4.0;
/* static variables */
@@ -686,13 +685,6 @@ int64_t optimize(merge_data_t * ip,
{
kmers = 1;
- if (diagcount >= merge_minrepeatdiagcount)
- {
- hits++;
- if (hits > 1)
- break;
- }
-
/* for each interesting diagonal */
int64_t fwd_3prime_overhang
@@ -747,6 +739,9 @@ int64_t optimize(merge_data_t * ip,
if (dropmax >= merge_dropmax)
score = 0.0;
+ if (score >= merge_minscore)
+ hits++;
+
if (score > best_score)
{
best_score = score;
@@ -1280,6 +1275,18 @@ void pair_all()
void fastq_mergepairs()
{
+ /* fatal error if specified overlap is too small */
+
+ if (opt_fastq_minovlen < 5)
+ fatal("Overlap specified with --fastq_minovlen must be at least 5");
+
+ /* relax default parameters in case of short overlaps */
+
+ if (opt_fastq_minovlen < 9)
+ {
+ merge_mindiagcount = opt_fastq_minovlen - 4;
+ merge_minscore = 1.6 * opt_fastq_minovlen;
+ }
/* open input files */
@@ -1312,7 +1319,8 @@ void fastq_mergepairs()
uint64_t filesize = fastq_get_size(fastq_fwd);
progress_init("Merging reads", filesize);
- pair_all();
+ if (! fastq_fwd->is_empty)
+ pair_all();
progress_done();
@@ -1324,14 +1332,22 @@ void fastq_mergepairs()
total);
fprintf(stderr,
- "%10" PRIu64 " Merged (%.1lf%%)\n",
- merged,
- 100.0 * merged / total);
+ "%10" PRIu64 " Merged",
+ merged);
+ if (total > 0)
+ fprintf(stderr,
+ " (%.1lf%%)",
+ 100.0 * merged / total);
+ fprintf(stderr, "\n");
fprintf(stderr,
- "%10" PRIu64 " Not merged (%.1lf%%)\n",
- notmerged,
- 100.0 * notmerged / total);
+ "%10" PRIu64 " Not merged",
+ notmerged);
+ if (total > 0)
+ fprintf(stderr,
+ " (%.1lf%%)",
+ 100.0 * notmerged / total);
+ fprintf(stderr, "\n");
if (notmerged > 0)
fprintf(stderr, "\nPairs that failed merging due to various reasons:\n");
@@ -1413,13 +1429,16 @@ void fastq_mergepairs()
fprintf(stderr, "\n");
- fprintf(stderr, "Statistics of all reads:\n");
+ if (total > 0)
+ {
+ fprintf(stderr, "Statistics of all reads:\n");
- double mean_read_length = sum_read_length / (2.0 * pairs_read);
+ double mean_read_length = sum_read_length / (2.0 * pairs_read);
- fprintf(stderr,
- "%10.2f Mean read length\n",
- mean_read_length);
+ fprintf(stderr,
+ "%10.2f Mean read length\n",
+ mean_read_length);
+ }
if (merged > 0)
{
=====================================
src/orient.cc
=====================================
@@ -0,0 +1,436 @@
+/*
+
+ VSEARCH: a versatile open source tool for metagenomics
+
+ Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+ All rights reserved.
+
+ Contact: Torbjorn Rognes <torognes at ifi.uio.no>,
+ Department of Informatics, University of Oslo,
+ PO Box 1080 Blindern, NO-0316 Oslo, Norway
+
+ This software is dual-licensed and available under a choice
+ of one of two licenses, either under the terms of the GNU
+ General Public License version 3 or the BSD 2-Clause License.
+
+
+ GNU General Public License version 3
+
+ This program is free software: you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation, either version 3 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+
+ The BSD 2-Clause License
+
+ Redistribution and use in source and binary forms, with or without
+ modification, are permitted provided that the following conditions
+ are met:
+
+ 1. Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+
+ 2. Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+ INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+ BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
+ ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+#include "vsearch.h"
+
+unsigned int rc_kmer(unsigned int kmer)
+{
+ /* reverse complement a kmer where k = opt_wordlength */
+
+ unsigned int fwd = kmer;
+ unsigned int rev = 0;
+
+ for (int i = 0; i < opt_wordlength; i++)
+ {
+ unsigned int x = (fwd & 3U) ^ 3U;
+ fwd = fwd >> 2U;
+ rev = rev << 2U;
+ rev |= x;
+ }
+
+ return rev;
+}
+
+
+void orient()
+{
+ fastx_handle query_h;
+
+ FILE * fp_fastaout = 0;
+ FILE * fp_fastqout = 0;
+ FILE * fp_tabbedout = 0;
+ FILE * fp_notmatched = 0;
+
+ int queries = 0;
+ int qmatches = 0;
+ int matches_fwd = 0;
+ int matches_rev = 0;
+ int notmatched = 0;
+
+ /* check arguments */
+
+ if (! opt_db)
+ fatal("Database not specified with --db");
+
+ if (! (opt_fastaout || opt_fastqout || opt_notmatched || opt_tabbedout))
+ fatal("Output file not specified with --fastaout, --fastqout, --notmatched or --tabbedout");
+
+ /* prepare reading of queries */
+
+ query_h = fastx_open(opt_orient);
+
+ /* open output files */
+
+ if (opt_fastaout)
+ {
+ fp_fastaout = fopen_output(opt_fastaout);
+ if (! fp_fastaout)
+ fatal("Unable to open fasta output file for writing");
+ }
+
+ if (opt_fastqout)
+ {
+ if (! fastx_is_fastq(query_h))
+ fatal("Cannot write FASTQ output with FASTA input");
+
+ fp_fastqout = fopen_output(opt_fastqout);
+ if (! fp_fastqout)
+ fatal("Unable to open fastq output file for writing");
+ }
+
+ if (opt_notmatched)
+ {
+ fp_notmatched = fopen_output(opt_notmatched);
+ if (! fp_notmatched)
+ fatal("Unable to open notmatched output file for writing");
+ }
+
+ if (opt_tabbedout)
+ {
+ fp_tabbedout = fopen_output(opt_tabbedout);
+ if (! fp_tabbedout)
+ fatal("Unable to open tabbedout output file for writing");
+ }
+
+ /* check if it may be an UDB file */
+
+ bool is_udb = udb_detect_isudb(opt_db);
+
+ if (is_udb)
+ udb_read(opt_db, 1, 1);
+ else
+ db_read(opt_db, 0);
+
+ if (!is_udb)
+ {
+ if (opt_dbmask == MASK_DUST)
+ dust_all();
+ else if ((opt_dbmask == MASK_SOFT) && (opt_hardmask))
+ hardmask_all();
+ }
+
+ if (!is_udb)
+ {
+ dbindex_prepare(1, opt_dbmask);
+ dbindex_addallsequences(opt_dbmask);
+ }
+
+ uhandle_s * uh_fwd = unique_init();
+
+ size_t alloc = 0;
+ char * qseq_rev = 0;
+ char * query_qual_rev = 0;
+
+ progress_init("Orienting sequences", fasta_get_size(query_h));
+
+ while (fastx_next(query_h,
+ ! opt_notrunclabels,
+ chrmap_no_change))
+ {
+ char * query_head = fastx_get_header(query_h);
+ int query_head_len = fastx_get_header_length(query_h);
+ char * qseq_fwd = fastx_get_sequence(query_h);
+ int qseqlen = fastx_get_sequence_length(query_h);
+ int qsize = fastx_get_abundance(query_h);
+ char * query_qual_fwd = fastx_get_quality(query_h);
+
+ /* find kmers in query sequence */
+
+ unsigned int kmer_count_fwd;
+ unsigned int * kmer_list_fwd;
+
+ unique_count(uh_fwd, opt_wordlength, qseqlen, qseq_fwd,
+ & kmer_count_fwd, & kmer_list_fwd, opt_qmask);
+
+ /* count kmers matching on each strand */
+
+ unsigned int count_fwd = 0;
+ unsigned int count_rev = 0;
+ const unsigned int hits_factor = 8;
+
+ for(unsigned int i = 0; i < kmer_count_fwd; i++)
+ {
+ unsigned int kmer_fwd = kmer_list_fwd[i];
+ unsigned int kmer_rev = rc_kmer(kmer_fwd);
+
+ unsigned int hits_fwd = dbindex_getmatchcount(kmer_fwd);
+ unsigned int hits_rev = dbindex_getmatchcount(kmer_rev);
+
+ /* require 8 times as many matches on one stand than the other */
+
+ if (hits_fwd > hits_factor * hits_rev)
+ count_fwd++;
+ else if (hits_rev > hits_factor * hits_fwd)
+ count_rev++;
+ }
+
+ /* get progress as amount of input file read */
+
+ uint64_t progress = fasta_get_position(query_h);
+
+ /* update stats */
+
+ queries++;
+
+ int strand = 2;
+ unsigned int min_count = 1;
+ unsigned int min_factor = 4;
+
+ if ((count_fwd >= min_count) && (count_fwd >= min_factor * count_rev))
+ {
+ /* fwd */
+
+ strand = 0;
+ matches_fwd++;
+ qmatches++;
+
+ if (opt_fastaout)
+ fasta_print_general(fp_fastaout,
+ 0,
+ qseq_fwd,
+ qseqlen,
+ query_head,
+ query_head_len,
+ qsize,
+ qmatches,
+ -1.0,
+ -1,
+ -1,
+ 0,
+ 0.0);
+
+ if (opt_fastqout)
+ fastq_print_general(fp_fastqout,
+ qseq_fwd,
+ qseqlen,
+ query_head,
+ query_head_len,
+ query_qual_fwd,
+ qsize,
+ qmatches,
+ -1.0);
+ }
+ else if ((count_rev >= min_count) && (count_rev >= min_factor * count_fwd))
+ {
+ /* rev */
+
+ strand = 1;
+ matches_rev++;
+ qmatches++;
+
+ /* alloc more mem if necessary to keep reverse sequence and qual */
+
+ if ((size_t)(qseqlen + 1) > alloc)
+ {
+ alloc = qseqlen + 1;
+ qseq_rev = (char*) xrealloc(qseq_rev, alloc);
+ if (fastx_is_fastq(query_h))
+ query_qual_rev = (char*) xrealloc(query_qual_rev, alloc);
+ }
+
+ /* get reverse complementary sequence */
+
+ reverse_complement(qseq_rev, qseq_fwd, qseqlen);
+
+ if (opt_fastaout)
+ fasta_print_general(fp_fastaout,
+ 0,
+ qseq_rev,
+ qseqlen,
+ query_head,
+ query_head_len,
+ qsize,
+ qmatches,
+ -1.0,
+ -1,
+ -1,
+ 0,
+ 0.0);
+
+ if (opt_fastqout)
+ {
+ /* reverse quality scores */
+
+ if (fastx_is_fastq(query_h))
+ {
+ for(int i = 0; i < qseqlen; i++)
+ query_qual_rev[i] = query_qual_fwd[qseqlen-1-i];
+ query_qual_rev[qseqlen] = 0;
+ }
+
+ fastq_print_general(fp_fastqout,
+ qseq_rev,
+ qseqlen,
+ query_head,
+ query_head_len,
+ query_qual_rev,
+ qsize,
+ qmatches,
+ -1.0);
+ }
+ }
+ else
+ {
+ /* undecided */
+
+ strand = 2;
+ notmatched++;
+
+ if (opt_notmatched)
+ {
+ if (fastx_is_fastq(query_h))
+ fastq_print_general(fp_notmatched,
+ qseq_fwd,
+ qseqlen,
+ query_head,
+ query_head_len,
+ query_qual_fwd,
+ qsize,
+ notmatched,
+ -1.0);
+ else
+ fasta_print_general(fp_notmatched,
+ 0,
+ qseq_fwd,
+ qseqlen,
+ query_head,
+ query_head_len,
+ qsize,
+ notmatched,
+ -1.0,
+ -1,
+ -1,
+ 0,
+ 0.0);
+ }
+ }
+
+ if (opt_tabbedout)
+ {
+ fprintf(fp_tabbedout,
+ "%s\t%c\t%d\t%d\n",
+ query_head,
+ strand == 0 ? '+' : (strand == 1 ? '-' : '?'),
+ count_fwd,
+ count_rev);
+ }
+
+ /* show progress */
+
+ progress_update(progress);
+ }
+
+ progress_done();
+
+ /* clean up */
+
+ if (qseq_rev)
+ xfree(qseq_rev);
+ if (query_qual_rev)
+ xfree(query_qual_rev);
+
+ unique_exit(uh_fwd);
+
+ dbindex_free();
+ db_free();
+
+ if (opt_tabbedout)
+ fclose(fp_tabbedout);
+ if (opt_notmatched)
+ fclose(fp_notmatched);
+ if (opt_fastqout)
+ fclose(fp_fastqout);
+ if (opt_fastaout)
+ fclose(fp_fastaout);
+
+ fasta_close(query_h);
+
+ if (!opt_quiet)
+ {
+ fprintf(stderr, "Forward oriented sequences: %d", matches_fwd);
+ if (queries > 0)
+ fprintf(stderr, " (%.2f%%)", 100.0 * matches_fwd / queries);
+ fprintf(stderr, "\n");
+ fprintf(stderr, "Reverse oriented sequences: %d", matches_rev);
+ if (queries > 0)
+ fprintf(stderr, " (%.2f%%)", 100.0 * matches_rev / queries);
+ fprintf(stderr, "\n");
+ fprintf(stderr, "All oriented sequences: %d", qmatches);
+ if (queries > 0)
+ fprintf(stderr, " (%.2f%%)", 100.0 * qmatches / queries);
+ fprintf(stderr, "\n");
+ fprintf(stderr, "Not oriented sequences: %d", notmatched);
+ if (queries > 0)
+ fprintf(stderr, " (%.2f%%)", 100.0 * notmatched / queries);
+ fprintf(stderr, "\n");
+ fprintf(stderr, "Total number of sequences: %d\n", queries);
+ }
+
+ if (opt_log)
+ {
+ fprintf(fp_log, "Forward oriented sequences: %d", matches_fwd);
+ if (queries > 0)
+ fprintf(fp_log, " (%.2f%%)", 100.0 * matches_fwd / queries);
+ fprintf(fp_log, "\n");
+ fprintf(fp_log, "Reverse oriented sequences: %d", matches_rev);
+ if (queries > 0)
+ fprintf(fp_log, " (%.2f%%)", 100.0 * matches_rev / queries);
+ fprintf(fp_log, "\n");
+ fprintf(fp_log, "All oriented sequences: %d", qmatches);
+ if (queries > 0)
+ fprintf(fp_log, " (%.2f%%)", 100.0 * qmatches / queries);
+ fprintf(fp_log, "\n");
+ fprintf(fp_log, "Not oriented sequences: %d", notmatched);
+ if (queries > 0)
+ fprintf(fp_log, " (%.2f%%)", 100.0 * notmatched / queries);
+ fprintf(fp_log, "\n");
+ fprintf(fp_log, "Total number of sequences: %d\n", queries);
+ }
+}
=====================================
src/orient.h
=====================================
@@ -0,0 +1,61 @@
+/*
+
+ VSEARCH: a versatile open source tool for metagenomics
+
+ Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+ All rights reserved.
+
+ Contact: Torbjorn Rognes <torognes at ifi.uio.no>,
+ Department of Informatics, University of Oslo,
+ PO Box 1080 Blindern, NO-0316 Oslo, Norway
+
+ This software is dual-licensed and available under a choice
+ of one of two licenses, either under the terms of the GNU
+ General Public License version 3 or the BSD 2-Clause License.
+
+
+ GNU General Public License version 3
+
+ This program is free software: you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation, either version 3 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+
+ The BSD 2-Clause License
+
+ Redistribution and use in source and binary forms, with or without
+ modification, are permitted provided that the following conditions
+ are met:
+
+ 1. Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+
+ 2. Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+ INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+ BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
+ ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+void orient();
=====================================
src/otutable.cc
=====================================
@@ -392,7 +392,7 @@ void otutable_print_biomout(FILE * fp)
fprintf(fp, "null");
else
{
- fprintf(fp, "{\"taxonomy\":\"");
+ fprintf(fp, R"({"taxonomy":")");
otu_tax_map_t::iterator it
= otutable->otu_tax_map.find(otu_name);
if (it != otutable->otu_tax_map.end())
=====================================
src/search.cc
=====================================
@@ -689,24 +689,40 @@ void usearch_global(char * cmdline, char * progheader)
if (!opt_quiet)
{
- fprintf(stderr, "Matching unique query sequences: %d of %d (%.2f%%)\n",
- qmatches, queries, 100.0 * qmatches / queries);
+ fprintf(stderr, "Matching unique query sequences: %d of %d",
+ qmatches, queries);
+ if (queries > 0)
+ fprintf(stderr, " (%.2f%%)", 100.0 * qmatches / queries);
+ fprintf(stderr, "\n");
if (opt_sizein)
- fprintf(stderr, "Matching total query sequences: %" PRIu64 " of %"
- PRIu64 " (%.2f%%)\n",
- qmatches_abundance, queries_abundance,
- 100.0 * qmatches_abundance / queries_abundance);
+ {
+ fprintf(stderr, "Matching total query sequences: %" PRIu64 " of %"
+ PRIu64,
+ qmatches_abundance, queries_abundance);
+ if (queries_abundance > 0)
+ fprintf(stderr, " (%.2f%%)",
+ 100.0 * qmatches_abundance / queries_abundance);
+ fprintf(stderr, "\n");
+ }
}
if (opt_log)
{
- fprintf(fp_log, "Matching unique query sequences: %d of %d (%.2f%%)\n",
- qmatches, queries, 100.0 * qmatches / queries);
+ fprintf(fp_log, "Matching unique query sequences: %d of %d",
+ qmatches, queries);
+ if (queries > 0)
+ fprintf(fp_log, " (%.2f%%)", 100.0 * qmatches / queries);
+ fprintf(fp_log, "\n");
if (opt_sizein)
- fprintf(fp_log, "Matching total query sequences: %" PRIu64 " of %"
- PRIu64 " (%.2f%%)\n",
- qmatches_abundance, queries_abundance,
- 100.0 * qmatches_abundance / queries_abundance);
+ {
+ fprintf(fp_log, "Matching total query sequences: %" PRIu64 " of %"
+ PRIu64,
+ qmatches_abundance, queries_abundance);
+ if (queries_abundance > 0)
+ fprintf(fp_log, " (%.2f%%)",
+ 100.0 * qmatches_abundance / queries_abundance);
+ fprintf(fp_log, "\n");
+ }
}
if (opt_biomout)
=====================================
src/searchexact.cc
=====================================
@@ -709,12 +709,20 @@ void search_exact(char * cmdline, char * progheader)
fasta_close(query_fasta_h);
if (!opt_quiet)
- fprintf(stderr, "Matching query sequences: %d of %d (%.2f%%)\n",
- qmatches, queries, 100.0 * qmatches / queries);
+ {
+ fprintf(stderr, "Matching query sequences: %d of %d", qmatches, queries);
+ if (queries > 0)
+ fprintf(stderr, " (%.2f%%)", 100.0 * qmatches / queries);
+ fprintf(stderr, "\n");
+ }
if (opt_log)
- fprintf(fp_log, "Matching query sequences: %d of %d (%.2f%%)\n",
- qmatches, queries, 100.0 * qmatches / queries);
+ {
+ fprintf(fp_log, "Matching query sequences: %d of %d", qmatches, queries);
+ if (queries > 0)
+ fprintf(fp_log, " (%.2f%%)", 100.0 * qmatches / queries);
+ fprintf(fp_log, "\n");
+ }
if (fp_biomout)
{
=====================================
src/sintax.cc
=====================================
@@ -319,17 +319,9 @@ void sintax_analyse(char * query_head,
void sintax_query(int64_t t)
{
int all_seqno[2][bootstrap_count];
- int best_seqno[2];
- int boot_count[2];
- unsigned int best_count[2];
-
- best_count[0] = 0;
- best_count[1] = 0;
- best_seqno[0] = 0;
- best_seqno[1] = 0;
- boot_count[0] = 0;
- boot_count[1] = 0;
-
+ int best_seqno[2] = {0, 0};
+ int boot_count[2] = {0, 0};
+ unsigned int best_count[2] = {0, 0};
int qseqlen = si_plus[t].qseqlen;
char * query_head = si_plus[t].query_head;
@@ -628,12 +620,20 @@ void sintax()
progress_done();
if (! opt_quiet)
- fprintf(stderr, "Classified %d of %d sequences (%.2f%%)\n",
- classified, queries, 100.0 * classified / queries);
+ {
+ fprintf(stderr, "Classified %d of %d sequences", classified, queries);
+ if (queries > 0)
+ fprintf(stderr, " (%.2f%%)", 100.0 * classified / queries);
+ fprintf(stderr, "\n");
+ }
if (opt_log)
- fprintf(fp_log, "Classified %d of %d sequences (%.2f%%)\n",
- classified, queries, 100.0 * classified / queries);
+ {
+ fprintf(fp_log, "Classified %d of %d sequences", classified, queries);
+ if (queries > 0)
+ fprintf(fp_log, " (%.2f%%)", 100.0 * classified / queries);
+ fprintf(fp_log, "\n");
+ }
/* clean up */
=====================================
src/vsearch.cc
=====================================
@@ -150,6 +150,7 @@ char * opt_msaout;
char * opt_nonchimeras;
char * opt_notmatched;
char * opt_notmatchedfq;
+char * opt_orient;
char * opt_otutabout;
char * opt_output;
char * opt_pattern;
@@ -817,6 +818,7 @@ void args_init(int argc, char **argv)
opt_notmatched = 0;
opt_notmatched = 0;
opt_notrunclabels = 0;
+ opt_orient = 0;
opt_otutabout = 0;
opt_output = 0;
opt_output_no_hits = 0;
@@ -879,7 +881,7 @@ void args_init(int argc, char **argv)
opt_usersort = 0;
opt_version = 0;
opt_weak_id = 10.0;
- opt_wordlength = 8;
+ opt_wordlength = 0;
opt_xn = 8.0;
opt_xsize = 0;
opt_xee = 0;
@@ -1044,6 +1046,7 @@ void args_init(int argc, char **argv)
option_notmatched,
option_notmatchedfq,
option_notrunclabels,
+ option_orient,
option_otutabout,
option_output,
option_output_no_hits,
@@ -1273,6 +1276,7 @@ void args_init(int argc, char **argv)
{"notmatched", required_argument, 0, 0 },
{"notmatchedfq", required_argument, 0, 0 },
{"notrunclabels", no_argument, 0, 0 },
+ {"orient", required_argument, 0, 0 },
{"otutabout", required_argument, 0, 0 },
{"output", required_argument, 0, 0 },
{"output_no_hits", no_argument, 0, 0 },
@@ -2295,6 +2299,10 @@ void args_init(int argc, char **argv)
opt_derep_id = optarg;
break;
+ case option_orient:
+ opt_orient = optarg;
+ break;
+
default:
fatal("Internal error in option parsing");
}
@@ -2340,6 +2348,7 @@ void args_init(int argc, char **argv)
option_help,
option_makeudb_usearch,
option_maskfasta,
+ option_orient,
option_rereplicate,
option_search_exact,
option_sff_convert,
@@ -3398,6 +3407,34 @@ void args_init(int argc, char **argv)
option_xsize,
-1 },
+ { option_orient,
+ option_bzip2_decompress,
+ option_db,
+ option_dbmask,
+ option_fasta_width,
+ option_fastaout,
+ option_fastqout,
+ option_gzip_decompress,
+ option_log,
+ option_no_progress,
+ option_notmatched,
+ option_notrunclabels,
+ option_qmask,
+ option_quiet,
+ option_relabel,
+ option_relabel_keep,
+ option_relabel_md5,
+ option_relabel_self,
+ option_relabel_sha1,
+ option_sizein,
+ option_sizeout,
+ option_tabbedout,
+ option_threads,
+ option_wordlength,
+ option_xee,
+ option_xsize,
+ -1 },
+
{ option_rereplicate,
option_bzip2_decompress,
option_fasta_width,
@@ -4004,6 +4041,15 @@ void args_init(int argc, char **argv)
if (opt_maxrejects < 0)
fatal("The argument to --maxrejects must not be negative");
+ if (opt_wordlength == 0)
+ {
+ /* set default word length */
+ if (opt_orient)
+ opt_wordlength = 12;
+ else
+ opt_wordlength = 8;
+ }
+
if ((opt_wordlength < 3) || (opt_wordlength > 15))
fatal("The argument to --wordlength must be in the range 3 to 15");
@@ -4410,6 +4456,19 @@ void cmd_help()
" Output\n"
" --output FILENAME output to specified FASTA file\n"
"\n"
+ "Orient sequences in forward or reverse direction\n"
+ " --orient FILENAME orient sequences in given FASTA/FASTQ file\n"
+ " Data\n"
+ " --db FILENAME database of sequences in correct orientation\n"
+ " --dbmask none|dust|soft mask db seqs with dust, soft or no method (dust)\n"
+ " --qmask none|dust|soft mask query with dust, soft or no method (dust)\n"
+ " --wordlength INT length of words used for matching 3-15 (12)\n"
+ " Output\n"
+ " --fastaout FILENAME FASTA output filename for oriented sequences\n"
+ " --fastqout FILENAME FASTQ output filenamr for oriented sequences\n"
+ " --notmatched FILENAME output filename for undetermined sequences\n"
+ " --tabbedout FILENAME output filename for result information\n"
+ "\n"
"Paired-end reads joining\n"
" --fastq_join FILENAME join paired-end reads into one sequence with gap\n"
" Data\n"
@@ -5097,6 +5156,8 @@ int main(int argc, char** argv)
fastx_getsubseq();
else if (opt_cut)
cut();
+ else if (opt_orient)
+ orient();
else
cmd_none();
=====================================
src/vsearch.h
=====================================
@@ -255,6 +255,7 @@
#include "sffconvert.h"
#include "getseq.h"
#include "cut.h"
+#include "orient.h"
/* options */
@@ -346,6 +347,7 @@ extern char * opt_msaout;
extern char * opt_nonchimeras;
extern char * opt_notmatched;
extern char * opt_notmatchedfq;
+extern char * opt_orient;
extern char * opt_otutabout;
extern char * opt_output;
extern char * opt_pattern;
View it on GitLab: https://salsa.debian.org/med-team/vsearch/-/compare/87aafc8e9fce863b18df0eb6ace9584d0d947be4...ec00cc2b87f375288fb74e899705cf65eff113c0
--
View it on GitLab: https://salsa.debian.org/med-team/vsearch/-/compare/87aafc8e9fce863b18df0eb6ace9584d0d947be4...ec00cc2b87f375288fb74e899705cf65eff113c0
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20210423/83d41c14/attachment-0001.htm>
More information about the debian-med-commit
mailing list