[med-svn] [Git][med-team/vsearch][upstream] New upstream version 2.22.1
Andreas Tille (@tille)
gitlab at salsa.debian.org
Mon Sep 26 20:13:07 BST 2022
Andreas Tille pushed to branch upstream at Debian Med / vsearch
Commits:
8fddc675 by Andreas Tille at 2022-09-26T20:19:22+02:00
New upstream version 2.22.1
- - - - -
19 changed files:
- + .travis.yml
- README.md
- configure.ac
- man/vsearch.1
- src/Makefile.am
- src/cluster.cc
- src/derep.cc
- + src/derepsmallmem.cc
- + src/derepsmallmem.h
- src/eestats.cc
- src/fastx.cc
- src/fastx.h
- src/results.cc
- src/searchcore.cc
- src/sintax.cc
- src/util.cc
- src/util.h
- src/vsearch.cc
- src/vsearch.h
Changes:
=====================================
.travis.yml
=====================================
@@ -0,0 +1,40 @@
+language:
+- cpp
+
+arch:
+#- amd64
+- arm64
+- ppc64le
+
+os:
+- linux
+#- osx
+
+dist:
+- xenial
+
+osx_image:
+- xcode12.5
+
+addons:
+ apt:
+ packages:
+ - ghostscript
+ - valgrind
+ - groff
+ homebrew:
+ packages:
+ - ghostscript
+
+compiler:
+- g++
+- clang
+
+script:
+- ./autogen.sh
+- ./configure
+- make
+- export PATH=$PWD/bin:$PATH
+- git clone https://github.com/frederic-mahe/vsearch-tests.git
+- cd vsearch-tests
+- bash ./run_all_tests.sh
=====================================
README.md
=====================================
@@ -26,8 +26,8 @@ VSEARCH stands for vectorized search, as the tool takes advantage of parallelism
Various packages, plugins and wrappers are also available from other sources - see [below](https://github.com/torognes/vsearch#packages-plugins-and-wrappers).
-The source code compiles correctly with `gcc` (versions 4.8.5 to 10.2)
-and `llvm-clang` (3.8 to 13.0). The source code should also compile on
+The source code compiles correctly with `gcc` (versions 4.8.5 to 12.0)
+and `llvm-clang` (3.8 to 15.0). The source code should also compile on
[FreeBSD](https://www.freebsd.org/) and
[NetBSD](https://www.netbsd.org/) systems.
@@ -37,7 +37,7 @@ Most of the nucleotide based commands and options in USEARCH version 7 are suppo
## Getting Help
-If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
+If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
## Example
@@ -50,9 +50,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
**Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:
```
-wget https://github.com/torognes/vsearch/archive/v2.21.1.tar.gz
-tar xzf v2.21.1.tar.gz
-cd vsearch-2.21.1
+wget https://github.com/torognes/vsearch/archive/v2.22.1.tar.gz
+tar xzf v2.22.1.tar.gz
+cd vsearch-2.22.1
./autogen.sh
./configure CFLAGS="-O3" CXXFLAGS="-O3"
make
@@ -69,7 +69,7 @@ The distributed Linux ppc64le and aarch64 binaries and the Windows binary were c
git clone https://github.com/torognes/vsearch.git
cd vsearch
./autogen.sh
-./configure
+./configure CFLAGS="-O3" CXXFLAGS="-O3"
make
make install # as root or sudo make install
```
@@ -81,43 +81,43 @@ Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (v
Download the appropriate executable for your system using the following commands if you are using a Linux x86_64 system:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-linux-x86_64.tar.gz
-tar xzf vsearch-2.21.1-linux-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch-2.22.1-linux-x86_64.tar.gz
+tar xzf vsearch-2.22.1-linux-x86_64.tar.gz
```
Or these commands if you are using a Linux ppc64le system:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-linux-ppc64le.tar.gz
-tar xzf vsearch-2.21.1-linux-ppc64le.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch-2.22.1-linux-ppc64le.tar.gz
+tar xzf vsearch-2.22.1-linux-ppc64le.tar.gz
```
Or these commands if you are using a Linux aarch64 (arm64) system:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-linux-aarch64.tar.gz
-tar xzf vsearch-2.21.1-linux-aarch64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch-2.22.1-linux-aarch64.tar.gz
+tar xzf vsearch-2.22.1-linux-aarch64.tar.gz
```
Or these commands if you are using a Mac:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-macos-x86_64.tar.gz
-tar xzf vsearch-2.21.1-macos-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch-2.22.1-macos-x86_64.tar.gz
+tar xzf vsearch-2.22.1-macos-x86_64.tar.gz
```
Or if you are using Windows, download and extract (unzip) the contents of this file:
```
-https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-win-x86_64.zip
+https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch-2.22.1-win-x86_64.zip
```
-Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.21.1-linux-x86_64` or `vsearch-2.21.1-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`. Versions with statically compiled libraries are available for Linux systems. These have "-static" in their name, and could be used on systems that do not have all the necessary libraries installed.
+Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.22.1-linux-x86_64` or `vsearch-2.22.1-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`. Versions with statically compiled libraries are available for Linux systems. These have "-static" in their name, and could be used on systems that do not have all the necessary libraries installed.
-Windows: You will now have the binary distribution in a folder called `vsearch-2.21.1-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
+Windows: You will now have the binary distribution in a folder called `vsearch-2.22.1-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
-**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
+**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
## Packages, plugins, and wrappers
=====================================
configure.ac
=====================================
@@ -2,7 +2,7 @@
# Process this file with autoconf to produce a configure script.
AC_PREREQ([2.63])
-AC_INIT([vsearch], [2.21.1], [torognes at ifi.uio.no], [vsearch], [https://github.com/torognes/vsearch])
+AC_INIT([vsearch], [2.22.1], [torognes at ifi.uio.no], [vsearch], [https://github.com/torognes/vsearch])
AC_CANONICAL_TARGET
AM_INIT_AUTOMAKE([subdir-objects])
AC_LANG([C++])
=====================================
man/vsearch.1
=====================================
@@ -1,5 +1,5 @@
.\" ============================================================================
-.TH vsearch 1 "January 18, 2022" "version 2.21.1" "USER COMMANDS"
+.TH vsearch 1 "September 19, 2022" "version 2.22.1" "USER COMMANDS"
.\" ============================================================================
.SH NAME
vsearch \(em a versatile open-source tool for microbiome analysis,
@@ -43,6 +43,9 @@ Dereplication and rereplication:
\fBvsearch\fR (\-\-derep_fulllength | \-\-derep_id | \-\-derep_prefix)
\fIfastafile\fR (\-\-output | \-\-uc) \fIoutputfile\fR [\fIoptions\fR]
.PP
+\fBvsearch\fR \-\-derep_smallmem (\fIfastafile\fR | \fIfastqfile\fR)
+\-\-fastaout \fIoutputfile\fR [\fIoptions\fR]
+.PP
\fBvsearch\fR \-\-rereplicate \fIfastafile\fR \-\-output
\fIoutputfile\fR [\fIoptions\fR]
.PP
@@ -1044,25 +1047,31 @@ Dereplication and rereplication options:
.PP
.RS
VSEARCH can dereplicate sequences with the commands
-\-\-derep_fulllength, \-\-derep_id, \-\-derep_prefix and
-\-\-fastx_uniques. The \-\-derep_fulllength command is depreciated and
-is replaced by the new \-\-fastx_uniques command that can also handle
-FASTQ files in addition to FASTA files. The \-\-derep_fulllength and
+\-\-derep_fulllength, \-\-derep_id, \-\-derep_smallmem,
+\-\-derep_prefix and \-\-fastx_uniques. The \-\-derep_fulllength
+command is depreciated and is replaced by the new \-\-fastx_uniques
+command that can also handle FASTQ files in addition to FASTA
+files. The \-\-derep_fulllength, \-\-derep_smallmem, and
\-\-fastx_uniques commands requires strictly identical sequences of
the same length, but ignores upper/lower case and treats T and U as
identical symbols. The \-\-derep_id command requires both identical
sequences and identical headers/labels. The \-\-derep_prefix command
will group sequences with a common prefix and does not require them to
-be equally long. The \-\-fastx_uniques command can write FASTQ output
-(specified with \-\-fastqout) or FASTA output (specified with
-\-\-fastaout) as well as a special tab-separated column text format
-(with \-\-tabbedout). The other commands can write FASTA output to the
-file specified with the \-\-output option. All dereplication commands
-can write output to a special UCLUST-like file specified with the
-\-\-uc option. The \-\-rereplicate command can duplicate sequences in
-the input file according to the abundance of each input
-sequence. Other valid options are \-\-fastq_ascii, \-\-fastq_asciiout,
-\-\-fastq_qmax, \-\-fastq_qmaxout, \-\-fastq_qmin, \-\-fastq_qminout,
+be equally long. The \-\-derep_smallmem uses a much smaller amount of
+memory when dereplicating than the other files, and may be a bit
+slower and cannot read the input from a pipe. It takes both FASTA and
+FASTQ files as input but only writes FASTA output to the file
+specified with the \-\-fastaout option. The \-\-fastx_uniques command
+can write FASTQ output (specified with \-\-fastqout) or FASTA output
+(specified with \-\-fastaout) as well as a special tab-separated
+column text format (with \-\-tabbedout). The other commands can write
+FASTA output to the file specified with the \-\-output option. All
+dereplication commands, except \-\-derep_smallmem, can write output to
+a special UCLUST-like file specified with the \-\-uc option. The
+\-\-rereplicate command can duplicate sequences in the input file
+according to the abundance of each input sequence. Other valid options
+are \-\-fastq_ascii, \-\-fastq_asciiout, \-\-fastq_qmax,
+\-\-fastq_qmaxout, \-\-fastq_qmin, \-\-fastq_qminout,
\-\-fastq_qout_max, \-\-maxuniquesize, \-\-minuniquesize, \-\-relabel,
\-\-relabel_keep, \-\-relabel_md5, \-\-relabel_self, \-\-relabel_sha1,
\-\-sizein, \-\-sizeout, \-\-strand, \-\-topn, and \-\-xsize.
@@ -1082,6 +1091,22 @@ not support multithreading.
Merge strictly identical sequences contained in \fIfilename\fR, as
with the \-\-derep_fulllength command, but the sequence labels
(identifiers) on the header line need to be identical too.
+.TAG derep_smallmem
+.TP
+.BI \-\-derep_smallmem \0filename
+Merge strictly identical sequences contained in \fIfilename\fR, as
+with the \-\-derep_fulllength command, but using much less memory. The
+output is written to a FASTA file specified with the \-\-fastaout
+option. The output is written in the order that the sequences first
+appear in the input, and not in decending abundance order as with the
+other dereplication commands. It can read, but not write FASTQ
+files. This command cannot read from a pipe, it must be a proper file,
+as it is read twice. Dereplication is performed with a 128 bit hash
+function and it is not verified that grouped sequences are identical,
+however the probability that two different sequences are grouped in a
+dataset of 1 000 000 000 unique sequences is approximately 1e-21.
+Multithreading and the options \-\-topn, \-\-uc, or \-\-tabbedout are
+not supported.
.TAG derep_prefix
.TP
.BI \-\-derep_prefix \0filename
@@ -1103,7 +1128,7 @@ header of the first sequence of their group. If \-\-sizeout is used,
the number of occurrences (i.e. abundance) of each sequence is
indicated at the end of their fasta header using the
pattern ';size=\fIinteger\fR;'. This option is only valid for
-\-\-fastx_uniques.
+\-\-fastx_uniques and \-\-derep_smallmem.
.TAG fastqout
.TP
.BI \-\-fastqout \0filename
@@ -1195,7 +1220,7 @@ header of the first sequence of their group. If \-\-sizeout is used,
the number of occurrences (i.e. abundance) of each sequence is
indicated at the end of their fasta header using the
pattern ';size=\fIinteger\fR;'. This option is not allowed for
-fastx_uniques.
+\-\-fastx_uniques or \-\-derep_smallmem.
.TP
.TAG relabel
.BI \-\-relabel \0string
@@ -1458,7 +1483,7 @@ at the end of the sequence if this option is not specified.
.TAG subseq_start
.TP
.BI \-\-subseq_start\~ "positive integer"
-Specifiy the starting position in the sequences when extracting
+Specify the starting position in the sequences when extracting
subsequences using the \-\-fastx_getsubseq command. Positions are
1-based, so the sequences start at position 1. The default is to start
at the beginning of the sequence (position 1), if this option is not
@@ -1694,13 +1719,21 @@ mismatches.
.TP
.BI \-\-fastq_maxee\~ real
When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter,
-discard sequences with more than the specified number of expected
-errors.
+discard sequences with an expected error greater than the specified
+number (value ranging from 0.0 to infinity). For a given sequence, the
+expected error is the sum of error probabilities for all the positions
+in the sequence. In practice, the expected error is greater than zero
+(error probabilities can be small but not null), and at most equal to
+the length of the sequence (when all positions have an error
+probability of 1.0).
.TAG fastq_maxee_rate
.TP
.BI \-\-fastq_maxee_rate\~ real
When using \-\-fastq_filter or \-\-fastx_filter, discard sequences
-with more than the specified number of expected errors per base.
+with an average expected error greater than the specified number
+(value ranging from 0.0 to 1.0 included). For a given sequence, the
+average expected error is the sum of error probabilities for all the
+positions in the sequence, divided by the length of the sequence.
.TAG fastq_maxlen
.TP
.BI \-\-fastq_maxlen\~ "positive integer"
@@ -1721,10 +1754,15 @@ discard sequences with more than the specified number of N's.
.BI \-\-fastq_mergepairs\0 filename
Merge paired-end sequence reads into one sequence. The forward reads
are specified as the argument to this option and the reverse reads are
-specified with the \-\-reverse option. The merged sequences are output
-to the file(s) specified with the \-\-fastaout or \-\-fastqout
-options. The non-merged reads can be output to the files specified
-with the \-\-fastaout_notmerged_fwd, \-\-fastaout_notmerged_rev,
+specified with the \-\-reverse option. Reads with the same
+index/position in the forward and reverse files are considered to form
+a pair, even if their labels are different. Thus, forward and reverse
+reads \fBmust\fR appear in the same order and total number in both
+files. A warning is emitted if the forward and reverse files contain
+different numbers of reads. The merged sequences are written to the
+file(s) specified with the \-\-fastaout or \-\-fastqout options. The
+non-merged reads can be output to the files specified with the
+\-\-fastaout_notmerged_fwd, \-\-fastaout_notmerged_rev,
\-\-fastqout_notmerged_fwd and \-\-fastqout_notmerged_rev
options. Statistics may be output to the file specified with the
\-\-eetabbedout option. Sequences are truncated as specified with the
@@ -2273,7 +2311,7 @@ Write the correctly oriented sequences to \fIfilename\fR, in fastq format.
.TP
.BI \-\-notmatched \0filename
Write the sequences with undetermined direction to \fIfilename\fR, in
-the orginal format.
+the original format.
.TAG orient
.TP
.BI \-\-orient \0filename
@@ -2378,7 +2416,7 @@ are written to the file specified with the \-\-fastaout file and the
fragments on the reverse strand are written to the file specified with
the \-\-fastaout_rev option. Input sequences that do not match are
written to the file specified with the option \-\-fastaout_discarded,
-and their reverse complement are also written to the file specfied
+and their reverse complement are also written to the file specified
with the \-\-fastaout_discarded_rev option. The relabel options
(\-\-relabel, \-\-relabel_self, \-\-relabel_keep, \-\-relabel_md5, and
\-\-relabel_sha1) may be used to relabel the output sequences).
@@ -2663,7 +2701,7 @@ taxonomic information in the same format as used with the \-\-sintax
command, e.g. "tax=k:Archaea,p:Euryarchaeota,c:Halobacteria". Only the
initial parts of the taxonomy that are common to a large fraction of
the hits of each query will be output. It is necessary to set the
-\-\-maxaccepts option to a value differrent from 1 for this
+\-\-maxaccepts option to a value different from 1 for this
information to be useful. The \-\-top_hits_only option may also be
useful. The fraction of matching hits required may be adjusted by the
\-\-lca_cutoff option (default 1.0).
@@ -2845,9 +2883,9 @@ FASTA format.
.TP
.BI \-\-query_cov \0real
Reject if the fraction of the query aligned to the target sequence is
-lower than \fIreal\fR. The query coverage is computed as
-(matches + mismatches) / query sequence length. Internal or terminal
-gaps are not taken into account.
+lower than \fIreal\fR (value ranging from 0.0 to 1.0 included). The
+query coverage is computed as (matches + mismatches) / query sequence
+length. Internal or terminal gaps are not taken into account.
.TAG rightjust
.TP
.B \-\-rightjust
@@ -2912,6 +2950,7 @@ as usearch does).
.IP \n+[step].
quality string (ignored, always set to '*').
.RE
+.TP
Optional fields for query-target matches (number and order of fields may vary):
.RS
.nr step 12 1
@@ -3415,7 +3454,7 @@ allowing spaces in the taxonomic identifiers.
.TP 9
.BI \-\-db \0filename
Read the reference sequences from \fIfilename\fR, in FASTA, FASTQ or
-UDB format. These sequences needs to be annotated with taxonomy.
+UDB format. These sequences need to be annotated with taxonomy.
.TAG randseed
.TP
.BI \-\-randseed\~ "positive integer"
@@ -4631,6 +4670,17 @@ the use of UDB databases with uchime_ref.
.BR v2.21.1\~ "released January 18th, 2022"
Fix a problem with dereplication of empty input files. Update Altivec
code on ppc64le for improved compiler compatibility (vector->__vector).
+.TP
+.BR v2.21.2\~ "released September 12th, 2022"
+Fix problems with the lcaout option when using maxaccepts above 1 and
+either lca_cutoff below 1 or with top_hits_only enabled. Update
+documentation. Update code to avoid compiler warnings.
+.TP
+.BR v2.22.0\~ "released September 19th, 2022"
+Add the derep_smallmem command for dereplication using little memory.
+.TP
+.BR v2.22.1\~ "released September 19th, 2022"
+Fix compiler warning.
.LP
.\" ============================================================================
.\" TODO:
=====================================
src/Makefile.am
=====================================
@@ -31,6 +31,7 @@ db.h \
dbhash.h \
dbindex.h \
derep.h \
+derepsmallmem.h \
dynlibs.h \
eestats.h \
fa2fq.h \
@@ -126,6 +127,7 @@ db.cc \
dbhash.cc \
dbindex.cc \
derep.cc \
+derepsmallmem.cc \
dynlibs.cc \
eestats.cc \
fa2fq.cc \
=====================================
src/cluster.cc
=====================================
@@ -632,8 +632,6 @@ void cluster_core_parallel()
opt_gap_extension_query_right,
opt_gap_extension_target_right);
- int aligncount = 0;
-
int lastlength = INT_MAX;
int seqno = 0;
@@ -799,8 +797,6 @@ void cluster_core_parallel()
unsigned int target = hit->target;
if (search_acceptable_unaligned(si, target))
{
- aligncount++;
-
/* perform vectorized alignment */
/* but only using 1 sequence ! */
@@ -1017,11 +1013,6 @@ void cluster_core_parallel()
}
progress_done();
-#if 0
- if (!opt_quiet)
- fprintf(stderr, "Extra alignments computed: %d\n", aligncount);
-#endif
-
/* clean up search info */
for(int i = 0; i < max_queries; i++)
{
=====================================
src/derep.cc
=====================================
@@ -1103,11 +1103,9 @@ void derep_prefix()
/* adjust size of hash table for 2/3 fill rate */
int64_t hashtablesize = 1;
- int hash_shift = 0;
while (3 * dbsequencecount > 2 * hashtablesize)
{
hashtablesize <<= 1;
- hash_shift++;
}
int hash_mask = hashtablesize - 1;
=====================================
src/derepsmallmem.cc
=====================================
@@ -0,0 +1,657 @@
+/*
+
+ VSEARCH: a versatile open source tool for metagenomics
+
+ Copyright (C) 2014-2022, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+ All rights reserved.
+
+ Contact: Torbjorn Rognes <torognes at ifi.uio.no>,
+ Department of Informatics, University of Oslo,
+ PO Box 1080 Blindern, NO-0316 Oslo, Norway
+
+ This software is dual-licensed and available under a choice
+ of one of two licenses, either under the terms of the GNU
+ General Public License version 3 or the BSD 2-Clause License.
+
+
+ GNU General Public License version 3
+
+ This program is free software: you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation, either version 3 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+
+ The BSD 2-Clause License
+
+ Redistribution and use in source and binary forms, with or without
+ modification, are permitted provided that the following conditions
+ are met:
+
+ 1. Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+
+ 2. Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+ INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+ BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
+ ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+#include "vsearch.h"
+
+#define HASH hash_cityhash128
+
+struct bucket
+{
+ uint128 hash;
+ uint64_t size;
+};
+
+static struct bucket * hashtable = nullptr;
+static uint64_t hashtablesize = 0;
+
+double find_median()
+{
+ /* find the median size, based on an iterative search starting at e.g. 1 */
+
+ uint64_t cand = 1; /* candidate for the median */
+ uint64_t below = 0; /* closest value below the candidate */
+ uint64_t above = 0; /* closest value above the candidate */
+
+ uint64_t cand_count; /* number of clusters with same size as cand */
+ uint64_t below_count; /* number of clusters with smaller size than cand */
+ uint64_t above_count; /* number of clusters with larger size than cand */
+
+ while (true)
+ {
+ cand_count = 0;
+ below_count = 0;
+ above_count = 0;
+
+ for(uint64_t i = 0; i < hashtablesize; i++)
+ {
+ uint64_t v = hashtable[i].size;
+ if (v > 0)
+ {
+ if (v > cand)
+ {
+ if ((above_count == 0) || (v < above))
+ {
+ above = v;
+ }
+ above_count++;
+ }
+ else if (v < cand)
+ {
+ if ((below_count == 0) || (v > below))
+ {
+ below = v;
+ }
+ below_count++;
+ }
+ else
+ {
+ cand_count++;
+ }
+ }
+ }
+
+ double mid = below_count + cand_count + above_count;
+ if (mid == 0)
+ return 0;
+ mid = mid / 2.0;
+
+ if (mid >= below_count)
+ {
+ if (mid <= below_count + cand_count)
+ {
+ if (mid == below_count + cand_count)
+ {
+ return (cand + above) / 2.0;
+ }
+ else if (mid == below_count)
+ {
+ return (below + cand) / 2.0;
+ }
+ else
+ {
+ return cand;
+ }
+ }
+ else
+ {
+ cand = above;
+ }
+ }
+ else
+ {
+ cand = below;
+ }
+ }
+}
+
+uint64_t inline hash2bucket(uint128 hash, uint64_t htsize)
+{
+ return Uint128Low64(hash) % htsize;
+}
+
+uint64_t inline next_bucket(uint64_t prev_bucket, uint64_t htsize)
+{
+ return (prev_bucket + 1) % htsize;
+}
+
+void rehash_smallmem()
+{
+ /* allocate new hash table, 50% larger */
+ uint64_t new_hashtablesize = 3 * hashtablesize / 2;
+ auto * new_hashtable =
+ (struct bucket *) xmalloc(sizeof(bucket) * new_hashtablesize);
+
+ /* zero new hash table */
+ for(uint64_t j = 0; j < new_hashtablesize; j++)
+ {
+ new_hashtable[j].hash.first = 0;
+ new_hashtable[j].hash.second = 0;
+ new_hashtable[j].size = 0;
+ }
+
+ /* rehash all from old to new */
+ for(uint64_t i = 0; i < hashtablesize; i++)
+ {
+ struct bucket * old_bp = hashtable + i;
+ if (old_bp->size)
+ {
+ uint64_t k = hash2bucket(old_bp->hash, new_hashtablesize);
+ while (new_hashtable[k].size)
+ {
+ k = next_bucket(k, new_hashtablesize);
+ }
+ struct bucket * new_bp = new_hashtable + k;
+ * new_bp = * old_bp;
+ }
+ }
+
+ /* free old table */
+ xfree(hashtable);
+
+ /* update variables */
+ hashtable = new_hashtable;
+ hashtablesize = new_hashtablesize;
+}
+
+void derep_smallmem(char * input_filename)
+{
+ /*
+ dereplicate full length sequences using a small amount of memory
+ output options: --fastaout
+ */
+
+ show_rusage();
+
+ fastx_handle h = fastx_open(input_filename);
+
+ if (!h)
+ {
+ fatal("Unrecognized input file type (not proper FASTA or FASTQ format).");
+ }
+
+ if (h->is_pipe)
+ {
+ fatal("The derep_smallmem command does not support input from a pipe.");
+ }
+
+ FILE * fp_fastaout = nullptr;
+
+ if (opt_fastaout)
+ {
+ fp_fastaout = fopen_output(opt_fastaout);
+ if (!fp_fastaout)
+ {
+ fatal("Unable to open FASTA output file for writing");
+ }
+ }
+ else
+ {
+ fatal("Ouput file for dereplication must be specified with --fastaout");
+ }
+
+ uint64_t filesize = fastx_get_size(h);
+
+ /* allocate initial memory for sequences of length up to 1023 chars */
+ int64_t alloc_seqlen = 1024;
+
+ /* allocate initial hashtable with 1024 buckets */
+
+ hashtablesize = 1024;
+ hashtable = (struct bucket *) xmalloc(sizeof(bucket) * hashtablesize);
+
+ /* zero hash table */
+ for(uint64_t j = 0; j < hashtablesize; j++)
+ {
+ hashtable[j].hash.first = 0;
+ hashtable[j].hash.second = 0;
+ hashtable[j].size = 0;
+ }
+
+ show_rusage();
+
+ char * seq_up = (char*) xmalloc(alloc_seqlen + 1);
+ char * rc_seq_up = (char*) xmalloc(alloc_seqlen + 1);
+
+ char * prompt = nullptr;
+ if (xsprintf(& prompt, "Dereplicating file %s", input_filename) == -1)
+ {
+ fatal("Out of memory");
+ }
+
+ progress_init(prompt, filesize);
+
+ uint64_t sequencecount = 0;
+ uint64_t nucleotidecount = 0;
+ int64_t shortest = INT64_MAX;
+ int64_t longest = 0;
+ uint64_t discarded_short = 0;
+ uint64_t discarded_long = 0;
+ uint64_t clusters = 0;
+ int64_t sumsize = 0;
+ uint64_t maxsize = 0;
+ double median = 0.0;
+ double average = 0.0;
+
+ /* first pass */
+
+ while(fastx_next(h, ! opt_notrunclabels, chrmap_no_change))
+ {
+ int64_t seqlen = fastx_get_sequence_length(h);
+
+ if (seqlen < opt_minseqlength)
+ {
+ discarded_short++;
+ continue;
+ }
+
+ if (seqlen > opt_maxseqlength)
+ {
+ discarded_long++;
+ continue;
+ }
+
+ nucleotidecount += seqlen;
+ if (seqlen > longest)
+ {
+ longest = seqlen;
+ }
+ if (seqlen < shortest)
+ {
+ shortest = seqlen;
+ }
+
+ /* check allocations */
+
+ if (seqlen > alloc_seqlen)
+ {
+ alloc_seqlen = seqlen;
+ seq_up = (char*) xrealloc(seq_up, alloc_seqlen + 1);
+ rc_seq_up = (char*) xrealloc(rc_seq_up, alloc_seqlen + 1);
+
+ show_rusage();
+ }
+
+ if (100 * (clusters + 1) > 95 * hashtablesize)
+ {
+ // keep hash table fill rate at max 95% */
+ rehash_smallmem();
+ show_rusage();
+ }
+
+ char * seq = fastx_get_sequence(h);
+
+ /* normalize sequence: uppercase and replace U by T */
+ string_normalize(seq_up, seq, seqlen);
+
+ /* reverse complement if necessary */
+ if (opt_strand > 1)
+ {
+ reverse_complement(rc_seq_up, seq_up, seqlen);
+ }
+
+ /*
+ Find free bucket or bucket for identical sequence.
+ Make sure sequences are exactly identical
+ in case of any hash collision.
+ With 64-bit hashes, there is about 50% chance of a
+ collision when the number of sequences is about 5e9.
+ */
+
+ uint128 hash = HASH(seq_up, seqlen);
+ uint64_t j = hash2bucket(hash, hashtablesize);
+ struct bucket * bp = hashtable + j;
+
+ while ((bp->size) && (hash != bp->hash))
+ {
+ j = next_bucket(j, hashtablesize);
+ bp = hashtable + j;
+ }
+
+ if ((opt_strand > 1) && !bp->size)
+ {
+ /* no match on plus strand */
+ /* check minus strand as well */
+
+ uint128 rc_hash = HASH(rc_seq_up, seqlen);
+ uint64_t k = hash2bucket(rc_hash, hashtablesize);
+ struct bucket * rc_bp = hashtable + k;
+
+ while ((rc_bp->size) && (rc_hash != rc_bp->hash))
+ {
+ k = next_bucket(k, hashtablesize);
+ rc_bp = hashtable + k;
+ }
+
+ if (rc_bp->size)
+ {
+ bp = rc_bp;
+ j = k;
+ }
+ }
+
+ int abundance = fastx_get_abundance(h);
+ int64_t ab = opt_sizein ? abundance : 1;
+ sumsize += ab;
+
+ if (bp->size)
+ {
+ /* at least one identical sequence already */
+ bp->size += ab;
+ }
+ else
+ {
+ /* no identical sequences yet */
+ bp->size = ab;
+ bp->hash = hash;
+ clusters++;
+ }
+
+ if (bp->size > maxsize)
+ {
+ maxsize = bp->size;
+ }
+
+ sequencecount++;
+ progress_update(fastx_get_position(h));
+ }
+ progress_done();
+ xfree(prompt);
+ fastx_close(h);
+
+ show_rusage();
+
+ if (!opt_quiet)
+ {
+ if (sequencecount > 0)
+ {
+ fprintf(stderr,
+ "%'" PRIu64 " nt in %'" PRIu64 " seqs, min %'" PRIu64
+ ", max %'" PRIu64 ", avg %'.0f\n",
+ nucleotidecount,
+ sequencecount,
+ shortest,
+ longest,
+ nucleotidecount * 1.0 / sequencecount);
+ }
+ else
+ {
+ fprintf(stderr,
+ "%'" PRIu64 " nt in %'" PRIu64 " seqs\n",
+ nucleotidecount,
+ sequencecount);
+ }
+ }
+
+ if (opt_log)
+ {
+ if (sequencecount > 0)
+ {
+ fprintf(fp_log,
+ "%'" PRIu64 " nt in %'" PRIu64 " seqs, min %'" PRIu64
+ ", max %'" PRIu64 ", avg %'.0f\n",
+ nucleotidecount,
+ sequencecount,
+ shortest,
+ longest,
+ nucleotidecount * 1.0 / sequencecount);
+ }
+ else
+ {
+ fprintf(fp_log,
+ "%'" PRIu64 " nt in %'" PRIu64 " seqs\n",
+ nucleotidecount,
+ sequencecount);
+ }
+ }
+
+ if (discarded_short)
+ {
+ fprintf(stderr,
+ "minseqlength %" PRId64 ": %" PRId64 " %s discarded.\n",
+ opt_minseqlength,
+ discarded_short,
+ (discarded_short == 1 ? "sequence" : "sequences"));
+
+ if (opt_log)
+ {
+ fprintf(fp_log,
+ "minseqlength %" PRId64 ": %" PRId64 " %s discarded.\n\n",
+ opt_minseqlength,
+ discarded_short,
+ (discarded_short == 1 ? "sequence" : "sequences"));
+ }
+ }
+
+ if (discarded_long)
+ {
+ fprintf(stderr,
+ "maxseqlength %" PRId64 ": %" PRId64 " %s discarded.\n",
+ opt_maxseqlength,
+ discarded_long,
+ (discarded_long == 1 ? "sequence" : "sequences"));
+
+ if (opt_log)
+ {
+ fprintf(fp_log,
+ "maxseqlength %" PRId64 ": %" PRId64 " %s discarded.\n\n",
+ opt_maxseqlength,
+ discarded_long,
+ (discarded_long == 1 ? "sequence" : "sequences"));
+ }
+ }
+
+
+ show_rusage();
+
+ average = 1.0 * sumsize / clusters;
+ median = find_median();
+
+ if (clusters < 1)
+ {
+ if (!opt_quiet)
+ {
+ fprintf(stderr,
+ "0 unique sequences\n");
+ }
+ if (opt_log)
+ {
+ fprintf(fp_log,
+ "0 unique sequences\n\n");
+ }
+ }
+ else
+ {
+ if (!opt_quiet)
+ {
+ fprintf(stderr,
+ "%" PRId64
+ " unique sequences, avg cluster %.1lf, median %.0f, max %"
+ PRIu64 "\n",
+ clusters, average, median, maxsize);
+ }
+ if (opt_log)
+ {
+ fprintf(fp_log,
+ "%" PRId64
+ " unique sequences, avg cluster %.1lf, median %.0f, max %"
+ PRIu64 "\n\n",
+ clusters, average, median, maxsize);
+ }
+ }
+
+ show_rusage();
+
+ /* second pass with output */
+
+ fastx_handle h2 = fastx_open(input_filename);
+ if (!h2)
+ {
+ fatal("Cannot open and read from the input file.");
+ }
+
+ progress_init("Writing FASTA output file", filesize);
+
+ uint64_t selected = 0;
+
+ while(fastx_next(h2, ! opt_notrunclabels, chrmap_no_change))
+ {
+ int64_t seqlen = fastx_get_sequence_length(h2);
+
+ if ((seqlen < opt_minseqlength) || (seqlen > opt_maxseqlength))
+ {
+ continue;
+ }
+
+ char * seq = fastx_get_sequence(h2);
+
+ /* normalize sequence: uppercase and replace U by T */
+ string_normalize(seq_up, seq, seqlen);
+
+ /* reverse complement if necessary */
+ if (opt_strand > 1)
+ {
+ reverse_complement(rc_seq_up, seq_up, seqlen);
+ }
+
+ uint128 hash = HASH(seq_up, seqlen);
+ uint64_t j = hash2bucket(hash, hashtablesize);
+ struct bucket * bp = hashtable + j;
+
+ while ((bp->size) && (hash != bp->hash))
+ {
+ j = next_bucket(j, hashtablesize);
+ bp = hashtable + j;
+ }
+
+ if ((opt_strand > 1) && ! bp->size)
+ {
+ /* no match on plus strand */
+ /* check minus strand as well */
+
+ uint128 rc_hash = HASH(rc_seq_up, seqlen);
+ uint64_t k = hash2bucket(rc_hash, hashtablesize);
+ struct bucket * rc_bp = hashtable + k;
+
+ while ((rc_bp->size) && (rc_hash != rc_bp->hash))
+ {
+ k = next_bucket(k, hashtablesize);
+ rc_bp = hashtable + k;
+ }
+
+ if (rc_bp->size)
+ {
+ bp = rc_bp;
+ j = k;
+ }
+ }
+
+ int64_t size = bp->size;
+
+ if (size > 0)
+ {
+ /* print sequence */
+
+ char * header = fastx_get_header(h2);
+ int headerlen = fastx_get_header_length(h2);
+
+ if ((size >= opt_minuniquesize) && (size <= opt_maxuniquesize))
+ {
+ selected++;
+ fasta_print_general(fp_fastaout,
+ nullptr,
+ seq,
+ seqlen,
+ header,
+ headerlen,
+ size,
+ selected,
+ -1.0,
+ -1, -1, nullptr, 0.0);
+ }
+ bp->size = -1;
+ }
+
+ progress_update(fastx_get_position(h2));
+ }
+ progress_done();
+ fastx_close(h2);
+ fclose(fp_fastaout);
+
+ show_rusage();
+
+ if (selected < clusters)
+ {
+ if (!opt_quiet)
+ {
+ fprintf(stderr,
+ "%" PRId64 " uniques written, %"
+ PRId64 " clusters discarded (%.1f%%)\n",
+ selected, clusters - selected,
+ 100.0 * (clusters - selected) / clusters);
+ }
+
+ if (opt_log)
+ {
+ fprintf(fp_log,
+ "%" PRId64 " uniques written, %"
+ PRId64 " clusters discarded (%.1f%%)\n\n",
+ selected, clusters - selected,
+ 100.0 * (clusters - selected) / clusters);
+ }
+ }
+
+ show_rusage();
+
+ xfree(seq_up);
+ xfree(rc_seq_up);
+ xfree(hashtable);
+
+ show_rusage();
+}
=====================================
src/derepsmallmem.h
=====================================
@@ -0,0 +1,61 @@
+/*
+
+ VSEARCH: a versatile open source tool for metagenomics
+
+ Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+ All rights reserved.
+
+ Contact: Torbjorn Rognes <torognes at ifi.uio.no>,
+ Department of Informatics, University of Oslo,
+ PO Box 1080 Blindern, NO-0316 Oslo, Norway
+
+ This software is dual-licensed and available under a choice
+ of one of two licenses, either under the terms of the GNU
+ General Public License version 3 or the BSD 2-Clause License.
+
+
+ GNU General Public License version 3
+
+ This program is free software: you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation, either version 3 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+
+ The BSD 2-Clause License
+
+ Redistribution and use in source and binary forms, with or without
+ modification, are permitted provided that the following conditions
+ are met:
+
+ 1. Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+
+ 2. Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+ INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+ BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
+ ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+void derep_smallmem(char * input_filename);
=====================================
src/eestats.cc
=====================================
@@ -138,7 +138,6 @@ void fastq_eestats()
progress_init("Reading FASTQ file", filesize);
uint64_t seq_count = 0;
- uint64_t symbols = 0;
int64_t len_alloc = 10;
@@ -221,8 +220,6 @@ void fastq_eestats()
/* update quality statistics */
- symbols += len;
-
double ee = 0.0;
for(int64_t i=0; i < len; i++)
=====================================
src/fastx.cc
=====================================
@@ -469,6 +469,11 @@ bool fastx_is_empty(fastx_handle h)
return h->is_empty;
}
+bool fastx_is_pipe(fastx_handle h)
+{
+ return h->is_pipe;
+}
+
void fastx_close(fastx_handle h)
{
/* Warn about stripped chars */
=====================================
src/fastx.h
=====================================
@@ -116,6 +116,7 @@ typedef struct fastx_s * fastx_handle;
bool fastx_is_fastq(fastx_handle h);
bool fastx_is_empty(fastx_handle h);
+bool fastx_is_pipe(fastx_handle h);
void fastx_filter_header(fastx_handle h, bool truncateatspace);
fastx_handle fastx_open(const char * filename);
void fastx_close(fastx_handle h);
=====================================
src/results.cc
=====================================
@@ -525,63 +525,131 @@ void results_show_lcaout(FILE * fp,
/* Output last common ancestor (LCA) of the hits,
in a similar way to the Sintax command */
- int first_level_start[tax_levels];
- int first_level_len[tax_levels];
- int level_match[tax_levels];
- char * first_h = nullptr;
+ /* Use a modified Boyer-Moore majority voting algorithm at each taxonomic
+ level to find the most common name at each level */
fprintf(fp, "%s\t", query_head);
- if (hitcount > 0)
+ int votes[tax_levels];
+ int cand[tax_levels];
+ int cand_level_start[tax_levels][tax_levels];
+ int cand_level_len[tax_levels][tax_levels];
+ int level_match[tax_levels];
+
+ for (int k = 0; k < tax_levels; k++)
+ {
+ votes[k] = 0;
+ cand[k] = -1;
+ level_match[k] = 0;
+ }
+
+ double top_hit_id = hits[0].id;
+ int tophitcount = 0;
+
+ for (int t = 0; t < hitcount; t++)
{
- for (int t = 0; t < hitcount; t++)
+ struct hit * hp = hits + t;
+
+ if (opt_top_hits_only && (hp->id < top_hit_id))
{
- int seqno = hits[t].target;
- if (t == 0)
+ break;
+ }
+
+ tophitcount++;
+
+ int seqno = hp->target;
+ int new_level_start[tax_levels];
+ int new_level_len[tax_levels];
+ tax_split(seqno, new_level_start, new_level_len);
+
+ for (int k = 0; k < tax_levels; k++)
+ {
+ if (votes[k] == 0)
{
- tax_split(seqno, first_level_start, first_level_len);
- first_h = db_getheader(seqno);
+ cand[k] = seqno;
+ votes[k] = 1;
for (int j = 0; j < tax_levels; j++)
{
- level_match[j] = 1;
+ cand_level_start[k][j] = new_level_start[j];
+ cand_level_len[k][j] = new_level_len[j];
}
}
else
{
- int level_start[tax_levels];
- int level_len[tax_levels];
- tax_split(seqno, level_start, level_len);
- char * h = db_getheader(seqno);
- for (int j = 0; j < tax_levels; j++)
+ bool match = true;
+ for (int j = 0; j <= k; j++)
{
- /* For each taxonomic level */
- if ((level_len[j] == first_level_len[j]) &&
- (strncmp(first_h + first_level_start[j],
- h + level_start[j],
- level_len[j]) == 0))
+ if ((new_level_len[j] != cand_level_len[k][j]) ||
+ (strncmp(db_getheader(cand[k]) + cand_level_start[k][j],
+ db_getheader(seqno) + new_level_start[j],
+ new_level_len[j]) != 0))
{
- level_match[j]++;
+ match = false;
+ break;
}
}
+ if (match)
+ {
+ votes[k]++;
+ }
+ else
+ {
+ votes[k]--;
+ }
+ }
+ }
+ }
+
+ /* count actual matches to the candidate at each level */
+
+ for (int t = 0; t < tophitcount; t++)
+ {
+ int seqno = hits[t].target;
+ int new_level_start[tax_levels];
+ int new_level_len[tax_levels];
+ tax_split(seqno, new_level_start, new_level_len);
+
+ for (int k = 0; k < tax_levels; k++)
+ {
+ bool match = true;
+ for (int j = 0; j <= k; j++)
+ {
+ if ((new_level_len[j] != cand_level_len[k][j]) ||
+ (strncmp(db_getheader(cand[k]) + cand_level_start[k][j],
+ db_getheader(seqno) + new_level_start[j],
+ new_level_len[j]) != 0))
+ {
+ match = false;
+ break;
+ }
+ }
+ if (match)
+ {
+ level_match[k]++;
}
}
+ }
+ /* output results */
+
+ if (tophitcount > 0)
+ {
bool comma = false;
for (int j = 0; j < tax_levels; j++)
{
- if (1.0 * level_match[j] / hitcount < opt_lca_cutoff)
+ if (1.0 * level_match[j] / tophitcount < opt_lca_cutoff)
{
break;
}
- if (first_level_len[j] > 0)
+ if (cand_level_len[j][j] > 0)
{
fprintf(fp,
"%s%c:%.*s",
(comma ? "," : ""),
tax_letters[j],
- first_level_len[j],
- first_h + first_level_start[j]);
+ cand_level_len[j][j],
+ db_getheader(cand[j]) + cand_level_start[j][j]);
comma = true;
}
}
=====================================
src/searchcore.cc
=====================================
@@ -749,7 +749,6 @@ void search_onequery(struct searchinfo_s * si, int seqmask)
int delayed = 0;
- int t = 0;
while ((si->finalized + delayed < opt_maxaccepts + opt_maxrejects - 1) &&
(si->rejects < opt_maxrejects) &&
(si->accepts < opt_maxaccepts) &&
@@ -785,7 +784,6 @@ void search_onequery(struct searchinfo_s * si, int seqmask)
align_delayed(si);
delayed = 0;
}
- t++;
}
if (delayed > 0)
{
=====================================
src/sintax.cc
=====================================
@@ -199,11 +199,11 @@ void sintax_analyse(char * query_head,
{
if (opt_sintax_cutoff > 0.0)
{
- fprintf(fp_tabbedout, "\t\t\t");
+ fprintf(fp_tabbedout, "\t\t");
}
else
{
- fprintf(fp_tabbedout, "\t\t");
+ fprintf(fp_tabbedout, "\t");
}
}
=====================================
src/util.cc
=====================================
@@ -194,6 +194,11 @@ uint64_t hash_cityhash64(char * s, uint64_t n)
return CityHash64((const char*)s, n);
}
+uint128 hash_cityhash128(char * s, uint64_t n)
+{
+ return CityHash128((const char*)s, n);
+}
+
int64_t getusec()
{
struct timeval tv;
=====================================
src/util.h
=====================================
@@ -84,6 +84,7 @@ char * xstrdup(const char *s);
char * xstrchrnul(char *s, int c);
int xsprintf(char * * ret, const char * format, ...);
uint64_t hash_cityhash64(char * s, uint64_t n);
+uint128 hash_cityhash128(char * s, uint64_t n);
int64_t getusec();
void show_rusage();
=====================================
src/vsearch.cc
=====================================
@@ -105,6 +105,7 @@ char * opt_dbnotmatched;
char * opt_derep_fulllength;
char * opt_derep_id;
char * opt_derep_prefix;
+char * opt_derep_smallmem;
char * opt_eetabbedout;
char * opt_fasta2fastq;
char * opt_fastaout;
@@ -764,6 +765,7 @@ void args_init(int argc, char **argv)
opt_derep_fulllength = nullptr;
opt_derep_id = nullptr;
opt_derep_prefix = nullptr;
+ opt_derep_smallmem = nullptr;
opt_dn = 1.4;
opt_ee_cutoffs_count = 3;
opt_ee_cutoffs_values = (double*) xmalloc(opt_ee_cutoffs_count * sizeof(double));
@@ -1012,6 +1014,7 @@ void args_init(int argc, char **argv)
option_derep_fulllength,
option_derep_id,
option_derep_prefix,
+ option_derep_smallmem,
option_dn,
option_ee_cutoffs,
option_eeout,
@@ -1250,6 +1253,7 @@ void args_init(int argc, char **argv)
{"derep_fulllength", required_argument, nullptr, 0 },
{"derep_id", required_argument, nullptr, 0 },
{"derep_prefix", required_argument, nullptr, 0 },
+ {"derep_smallmem", required_argument, nullptr, 0 },
{"dn", required_argument, nullptr, 0 },
{"ee_cutoffs", required_argument, nullptr, 0 },
{"eeout", no_argument, nullptr, 0 },
@@ -2475,6 +2479,10 @@ void args_init(int argc, char **argv)
opt_tsegout = optarg;
break;
+ case option_derep_smallmem:
+ opt_derep_smallmem = optarg;
+ break;
+
default:
fatal("Internal error in option parsing");
}
@@ -2505,6 +2513,7 @@ void args_init(int argc, char **argv)
option_derep_fulllength,
option_derep_id,
option_derep_prefix,
+ option_derep_smallmem,
option_fasta2fastq,
option_fastq_chars,
option_fastq_convert,
@@ -3137,6 +3146,37 @@ void args_init(int argc, char **argv)
option_xsize,
-1 },
+ { option_derep_smallmem,
+ option_bzip2_decompress,
+ option_fasta_width,
+ option_fastaout,
+ option_fastq_ascii,
+ option_fastq_qmax,
+ option_fastq_qmin,
+ option_gzip_decompress,
+ option_label_suffix,
+ option_log,
+ option_maxseqlength,
+ option_maxuniquesize,
+ option_minseqlength,
+ option_minuniquesize,
+ option_no_progress,
+ option_notrunclabels,
+ option_quiet,
+ option_relabel,
+ option_relabel_keep,
+ option_relabel_md5,
+ option_relabel_self,
+ option_relabel_sha1,
+ option_sample,
+ option_sizein,
+ option_sizeout,
+ option_strand,
+ option_threads,
+ option_xee,
+ option_xsize,
+ -1 },
+
{ option_fasta2fastq,
option_bzip2_decompress,
option_fastq_asciiout,
@@ -3873,6 +3913,8 @@ void args_init(int argc, char **argv)
option_gzip_decompress,
option_label_suffix,
option_log,
+ option_maxseqlength,
+ option_minseqlength,
option_no_progress,
option_notrunclabels,
option_quiet,
@@ -4829,6 +4871,7 @@ void cmd_help()
" --derep_fulllength FILENAME dereplicate sequences in the given FASTA file\n"
" --derep_id FILENAME dereplicate using both identifiers and sequences\n"
" --derep_prefix FILENAME dereplicate sequences in file based on prefixes\n"
+ " --derep_smallmem FILENAME dereplicate sequences in file using less memory\n"
" --fastx_uniques FILENAME dereplicate sequences in the FASTA/FASTQ file\n"
" --rereplicate FILENAME rereplicate sequences in the given FASTA file\n"
" Parameters\n"
@@ -5269,7 +5312,7 @@ void cmd_none()
if (! opt_quiet)
{
fprintf(stderr,
- "For help, please enter: %s --help | more\n"
+ "For more help, please enter: %s --help\n"
"For further details, please consult the manual by entering: man vsearch\n"
"\n"
"Selected command examples:\n"
@@ -5301,9 +5344,9 @@ void cmd_none()
"vsearch --usearch_global FILENAME --db FILENAME --id 0.97 --alnout FILENAME\n"
"\n"
"Other commands: cluster_fast, cluster_smallmem, cluster_unoise, cut,\n"
- " derep_id, derep_fulllength, derep_prefix, fasta2fastq,\n"
- " fastq_filter, fastq_join, fastx_getseqs, fastx_getsubseqs,\n"
- " maskfasta, orient, rereplicate, uchime2_denovo,\n"
+ " derep_id, derep_fulllength, derep_prefix, derep_smallmem,\n"
+ " fasta2fastq, fastq_filter, fastq_join, fastx_getseqs,\n"
+ " fastx_getsubseq, maskfasta, orient, rereplicate, uchime2_denovo,\n"
" uchime3_denovo, udb2fasta, udbinfo, udbstats, version\n"
"\n",
progname);
@@ -5529,6 +5572,10 @@ int main(int argc, char** argv)
{
derep_prefix();
}
+ else if (opt_derep_smallmem)
+ {
+ derep_smallmem(opt_derep_smallmem);
+ }
else if (opt_derep_id)
{
derep(opt_derep_id, true);
=====================================
src/vsearch.h
=====================================
@@ -258,6 +258,7 @@
#include "cut.h"
#include "orient.h"
#include "fa2fq.h"
+#include "derepsmallmem.h"
/* options */
View it on GitLab: https://salsa.debian.org/med-team/vsearch/-/commit/8fddc675f40a794cb43d1ca5878354ec1b6a2d4d
--
View it on GitLab: https://salsa.debian.org/med-team/vsearch/-/commit/8fddc675f40a794cb43d1ca5878354ec1b6a2d4d
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20220926/fabb2025/attachment-0001.htm>
More information about the debian-med-commit
mailing list