[med-svn] [Git][med-team/vsearch][upstream] New upstream version 2.22.1

Andreas Tille (@tille) gitlab at salsa.debian.org
Mon Sep 26 20:13:07 BST 2022



Andreas Tille pushed to branch upstream at Debian Med / vsearch


Commits:
8fddc675 by Andreas Tille at 2022-09-26T20:19:22+02:00
New upstream version 2.22.1
- - - - -


19 changed files:

- + .travis.yml
- README.md
- configure.ac
- man/vsearch.1
- src/Makefile.am
- src/cluster.cc
- src/derep.cc
- + src/derepsmallmem.cc
- + src/derepsmallmem.h
- src/eestats.cc
- src/fastx.cc
- src/fastx.h
- src/results.cc
- src/searchcore.cc
- src/sintax.cc
- src/util.cc
- src/util.h
- src/vsearch.cc
- src/vsearch.h


Changes:

=====================================
.travis.yml
=====================================
@@ -0,0 +1,40 @@
+language:
+- cpp
+
+arch:
+#- amd64
+- arm64
+- ppc64le
+
+os:
+- linux
+#- osx
+
+dist:
+- xenial
+
+osx_image:
+- xcode12.5
+
+addons:
+  apt:
+    packages:
+    - ghostscript
+    - valgrind
+    - groff
+  homebrew:
+    packages:
+    - ghostscript
+
+compiler:
+- g++
+- clang
+
+script:
+- ./autogen.sh
+- ./configure
+- make
+- export PATH=$PWD/bin:$PATH
+- git clone https://github.com/frederic-mahe/vsearch-tests.git
+- cd vsearch-tests
+- bash ./run_all_tests.sh


=====================================
README.md
=====================================
@@ -26,8 +26,8 @@ VSEARCH stands for vectorized search, as the tool takes advantage of parallelism
 
 Various packages, plugins and wrappers are also available from other sources - see [below](https://github.com/torognes/vsearch#packages-plugins-and-wrappers).
 
-The source code compiles correctly with `gcc` (versions 4.8.5 to 10.2)
-and `llvm-clang` (3.8 to 13.0). The source code should also compile on
+The source code compiles correctly with `gcc` (versions 4.8.5 to 12.0)
+and `llvm-clang` (3.8 to 15.0). The source code should also compile on
 [FreeBSD](https://www.freebsd.org/) and
 [NetBSD](https://www.netbsd.org/) systems.
 
@@ -37,7 +37,7 @@ Most of the nucleotide based commands and options in USEARCH version 7 are suppo
 
 ## Getting Help
 
-If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
+If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
 
 ## Example
 
@@ -50,9 +50,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
 **Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:
 
 ```
-wget https://github.com/torognes/vsearch/archive/v2.21.1.tar.gz
-tar xzf v2.21.1.tar.gz
-cd vsearch-2.21.1
+wget https://github.com/torognes/vsearch/archive/v2.22.1.tar.gz
+tar xzf v2.22.1.tar.gz
+cd vsearch-2.22.1
 ./autogen.sh
 ./configure CFLAGS="-O3" CXXFLAGS="-O3"
 make
@@ -69,7 +69,7 @@ The distributed Linux ppc64le and aarch64 binaries and the Windows binary were c
 git clone https://github.com/torognes/vsearch.git
 cd vsearch
 ./autogen.sh
-./configure
+./configure CFLAGS="-O3" CXXFLAGS="-O3"
 make
 make install  # as root or sudo make install
 ```
@@ -81,43 +81,43 @@ Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (v
 Download the appropriate executable for your system using the following commands if you are using a Linux x86_64 system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-linux-x86_64.tar.gz
-tar xzf vsearch-2.21.1-linux-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch-2.22.1-linux-x86_64.tar.gz
+tar xzf vsearch-2.22.1-linux-x86_64.tar.gz
 ```
 
 Or these commands if you are using a Linux ppc64le system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-linux-ppc64le.tar.gz
-tar xzf vsearch-2.21.1-linux-ppc64le.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch-2.22.1-linux-ppc64le.tar.gz
+tar xzf vsearch-2.22.1-linux-ppc64le.tar.gz
 ```
 
 Or these commands if you are using a Linux aarch64 (arm64) system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-linux-aarch64.tar.gz
-tar xzf vsearch-2.21.1-linux-aarch64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch-2.22.1-linux-aarch64.tar.gz
+tar xzf vsearch-2.22.1-linux-aarch64.tar.gz
 ```
 
 Or these commands if you are using a Mac:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-macos-x86_64.tar.gz
-tar xzf vsearch-2.21.1-macos-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch-2.22.1-macos-x86_64.tar.gz
+tar xzf vsearch-2.22.1-macos-x86_64.tar.gz
 ```
 
 Or if you are using Windows, download and extract (unzip) the contents of this file:
 
 ```
-https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-win-x86_64.zip
+https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch-2.22.1-win-x86_64.zip
 ```
 
-Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.21.1-linux-x86_64` or `vsearch-2.21.1-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`. Versions with statically compiled libraries are available for Linux systems. These have "-static" in their name, and could be used on systems that do not have all the necessary libraries installed.
+Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.22.1-linux-x86_64` or `vsearch-2.22.1-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`. Versions with statically compiled libraries are available for Linux systems. These have "-static" in their name, and could be used on systems that do not have all the necessary libraries installed.
 
-Windows: You will now have the binary distribution in a folder called `vsearch-2.21.1-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
+Windows: You will now have the binary distribution in a folder called `vsearch-2.22.1-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
 
 
-**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
+**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.22.1/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
 
 
 ## Packages, plugins, and wrappers


=====================================
configure.ac
=====================================
@@ -2,7 +2,7 @@
 # Process this file with autoconf to produce a configure script.
 
 AC_PREREQ([2.63])
-AC_INIT([vsearch], [2.21.1], [torognes at ifi.uio.no], [vsearch], [https://github.com/torognes/vsearch])
+AC_INIT([vsearch], [2.22.1], [torognes at ifi.uio.no], [vsearch], [https://github.com/torognes/vsearch])
 AC_CANONICAL_TARGET
 AM_INIT_AUTOMAKE([subdir-objects])
 AC_LANG([C++])


=====================================
man/vsearch.1
=====================================
@@ -1,5 +1,5 @@
 .\" ============================================================================
-.TH vsearch 1 "January 18, 2022" "version 2.21.1" "USER COMMANDS"
+.TH vsearch 1 "September 19, 2022" "version 2.22.1" "USER COMMANDS"
 .\" ============================================================================
 .SH NAME
 vsearch \(em a versatile open-source tool for microbiome analysis,
@@ -43,6 +43,9 @@ Dereplication and rereplication:
 \fBvsearch\fR (\-\-derep_fulllength | \-\-derep_id | \-\-derep_prefix)
 \fIfastafile\fR (\-\-output | \-\-uc) \fIoutputfile\fR [\fIoptions\fR]
 .PP
+\fBvsearch\fR \-\-derep_smallmem (\fIfastafile\fR | \fIfastqfile\fR)
+\-\-fastaout \fIoutputfile\fR [\fIoptions\fR]
+.PP
 \fBvsearch\fR \-\-rereplicate \fIfastafile\fR \-\-output
 \fIoutputfile\fR [\fIoptions\fR]
 .PP
@@ -1044,25 +1047,31 @@ Dereplication and rereplication options:
 .PP
 .RS
 VSEARCH can dereplicate sequences with the commands
-\-\-derep_fulllength, \-\-derep_id, \-\-derep_prefix and
-\-\-fastx_uniques. The \-\-derep_fulllength command is depreciated and
-is replaced by the new \-\-fastx_uniques command that can also handle
-FASTQ files in addition to FASTA files. The \-\-derep_fulllength and
+\-\-derep_fulllength, \-\-derep_id, \-\-derep_smallmem,
+\-\-derep_prefix and \-\-fastx_uniques. The \-\-derep_fulllength
+command is depreciated and is replaced by the new \-\-fastx_uniques
+command that can also handle FASTQ files in addition to FASTA
+files. The \-\-derep_fulllength, \-\-derep_smallmem, and
 \-\-fastx_uniques commands requires strictly identical sequences of
 the same length, but ignores upper/lower case and treats T and U as
 identical symbols. The \-\-derep_id command requires both identical
 sequences and identical headers/labels. The \-\-derep_prefix command
 will group sequences with a common prefix and does not require them to
-be equally long. The \-\-fastx_uniques command can write FASTQ output
-(specified with \-\-fastqout) or FASTA output (specified with
-\-\-fastaout) as well as a special tab-separated column text format
-(with \-\-tabbedout). The other commands can write FASTA output to the
-file specified with the \-\-output option. All dereplication commands
-can write output to a special UCLUST-like file specified with the
-\-\-uc option. The \-\-rereplicate command can duplicate sequences in
-the input file according to the abundance of each input
-sequence. Other valid options are \-\-fastq_ascii, \-\-fastq_asciiout,
-\-\-fastq_qmax, \-\-fastq_qmaxout, \-\-fastq_qmin, \-\-fastq_qminout,
+be equally long. The \-\-derep_smallmem uses a much smaller amount of
+memory when dereplicating than the other files, and may be a bit
+slower and cannot read the input from a pipe. It takes both FASTA and
+FASTQ files as input but only writes FASTA output to the file
+specified with the \-\-fastaout option. The \-\-fastx_uniques command
+can write FASTQ output (specified with \-\-fastqout) or FASTA output
+(specified with \-\-fastaout) as well as a special tab-separated
+column text format (with \-\-tabbedout). The other commands can write
+FASTA output to the file specified with the \-\-output option. All
+dereplication commands, except \-\-derep_smallmem, can write output to
+a special UCLUST-like file specified with the \-\-uc option. The
+\-\-rereplicate command can duplicate sequences in the input file
+according to the abundance of each input sequence. Other valid options
+are \-\-fastq_ascii, \-\-fastq_asciiout, \-\-fastq_qmax,
+\-\-fastq_qmaxout, \-\-fastq_qmin, \-\-fastq_qminout,
 \-\-fastq_qout_max, \-\-maxuniquesize, \-\-minuniquesize, \-\-relabel,
 \-\-relabel_keep, \-\-relabel_md5, \-\-relabel_self, \-\-relabel_sha1,
 \-\-sizein, \-\-sizeout, \-\-strand, \-\-topn, and \-\-xsize.
@@ -1082,6 +1091,22 @@ not support multithreading.
 Merge strictly identical sequences contained in \fIfilename\fR, as
 with the \-\-derep_fulllength command, but the sequence labels
 (identifiers) on the header line need to be identical too.
+.TAG derep_smallmem
+.TP
+.BI \-\-derep_smallmem \0filename
+Merge strictly identical sequences contained in \fIfilename\fR, as
+with the \-\-derep_fulllength command, but using much less memory. The
+output is written to a FASTA file specified with the \-\-fastaout
+option. The output is written in the order that the sequences first
+appear in the input, and not in decending abundance order as with the
+other dereplication commands. It can read, but not write FASTQ
+files. This command cannot read from a pipe, it must be a proper file,
+as it is read twice. Dereplication is performed with a 128 bit hash
+function and it is not verified that grouped sequences are identical,
+however the probability that two different sequences are grouped in a
+dataset of 1 000 000 000 unique sequences is approximately 1e-21.
+Multithreading and the options \-\-topn, \-\-uc, or \-\-tabbedout are
+not supported.
 .TAG derep_prefix
 .TP
 .BI \-\-derep_prefix \0filename
@@ -1103,7 +1128,7 @@ header of the first sequence of their group. If \-\-sizeout is used,
 the number of occurrences (i.e. abundance) of each sequence is
 indicated at the end of their fasta header using the
 pattern ';size=\fIinteger\fR;'. This option is only valid for
-\-\-fastx_uniques.
+\-\-fastx_uniques and \-\-derep_smallmem.
 .TAG fastqout
 .TP
 .BI \-\-fastqout \0filename
@@ -1195,7 +1220,7 @@ header of the first sequence of their group. If \-\-sizeout is used,
 the number of occurrences (i.e. abundance) of each sequence is
 indicated at the end of their fasta header using the
 pattern ';size=\fIinteger\fR;'. This option is not allowed for
-fastx_uniques.
+\-\-fastx_uniques or \-\-derep_smallmem.
 .TP
 .TAG relabel
 .BI \-\-relabel \0string
@@ -1458,7 +1483,7 @@ at the end of the sequence if this option is not specified.
 .TAG subseq_start
 .TP
 .BI \-\-subseq_start\~ "positive integer"
-Specifiy the starting position in the sequences when extracting
+Specify the starting position in the sequences when extracting
 subsequences using the \-\-fastx_getsubseq command. Positions are
 1-based, so the sequences start at position 1. The default is to start
 at the beginning of the sequence (position 1), if this option is not
@@ -1694,13 +1719,21 @@ mismatches.
 .TP
 .BI \-\-fastq_maxee\~ real
 When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter,
-discard sequences with more than the specified number of expected
-errors.
+discard sequences with an expected error greater than the specified
+number (value ranging from 0.0 to infinity). For a given sequence, the
+expected error is the sum of error probabilities for all the positions
+in the sequence. In practice, the expected error is greater than zero
+(error probabilities can be small but not null), and at most equal to
+the length of the sequence (when all positions have an error
+probability of 1.0).
 .TAG fastq_maxee_rate
 .TP
 .BI \-\-fastq_maxee_rate\~ real
 When using \-\-fastq_filter or \-\-fastx_filter, discard sequences
-with more than the specified number of expected errors per base.
+with an average expected error greater than the specified number
+(value ranging from 0.0 to 1.0 included). For a given sequence, the
+average expected error is the sum of error probabilities for all the
+positions in the sequence, divided by the length of the sequence.
 .TAG fastq_maxlen
 .TP
 .BI \-\-fastq_maxlen\~ "positive integer"
@@ -1721,10 +1754,15 @@ discard sequences with more than the specified number of N's.
 .BI \-\-fastq_mergepairs\0 filename
 Merge paired-end sequence reads into one sequence. The forward reads
 are specified as the argument to this option and the reverse reads are
-specified with the \-\-reverse option. The merged sequences are output
-to the file(s) specified with the \-\-fastaout or \-\-fastqout
-options. The non-merged reads can be output to the files specified
-with the \-\-fastaout_notmerged_fwd, \-\-fastaout_notmerged_rev,
+specified with the \-\-reverse option. Reads with the same
+index/position in the forward and reverse files are considered to form
+a pair, even if their labels are different. Thus, forward and reverse
+reads \fBmust\fR appear in the same order and total number in both
+files. A warning is emitted if the forward and reverse files contain
+different numbers of reads. The merged sequences are written to the
+file(s) specified with the \-\-fastaout or \-\-fastqout options. The
+non-merged reads can be output to the files specified with the
+\-\-fastaout_notmerged_fwd, \-\-fastaout_notmerged_rev,
 \-\-fastqout_notmerged_fwd and \-\-fastqout_notmerged_rev
 options. Statistics may be output to the file specified with the
 \-\-eetabbedout option. Sequences are truncated as specified with the
@@ -2273,7 +2311,7 @@ Write the correctly oriented sequences to \fIfilename\fR, in fastq format.
 .TP
 .BI \-\-notmatched \0filename
 Write the sequences with undetermined direction to \fIfilename\fR, in
-the orginal format.
+the original format.
 .TAG orient
 .TP
 .BI \-\-orient \0filename
@@ -2378,7 +2416,7 @@ are written to the file specified with the \-\-fastaout file and the
 fragments on the reverse strand are written to the file specified with
 the \-\-fastaout_rev option. Input sequences that do not match are
 written to the file specified with the option \-\-fastaout_discarded,
-and their reverse complement are also written to the file specfied
+and their reverse complement are also written to the file specified
 with the \-\-fastaout_discarded_rev option. The relabel options
 (\-\-relabel, \-\-relabel_self, \-\-relabel_keep, \-\-relabel_md5, and
 \-\-relabel_sha1) may be used to relabel the output sequences).
@@ -2663,7 +2701,7 @@ taxonomic information in the same format as used with the \-\-sintax
 command, e.g. "tax=k:Archaea,p:Euryarchaeota,c:Halobacteria". Only the
 initial parts of the taxonomy that are common to a large fraction of
 the hits of each query will be output. It is necessary to set the
-\-\-maxaccepts option to a value differrent from 1 for this
+\-\-maxaccepts option to a value different from 1 for this
 information to be useful. The \-\-top_hits_only option may also be
 useful. The fraction of matching hits required may be adjusted by the
 \-\-lca_cutoff option (default 1.0).
@@ -2845,9 +2883,9 @@ FASTA format.
 .TP
 .BI \-\-query_cov \0real
 Reject if the fraction of the query aligned to the target sequence is
-lower than \fIreal\fR. The query coverage is computed as
-(matches + mismatches) / query sequence length. Internal or terminal
-gaps are not taken into account.
+lower than \fIreal\fR (value ranging from 0.0 to 1.0 included). The
+query coverage is computed as (matches + mismatches) / query sequence
+length. Internal or terminal gaps are not taken into account.
 .TAG rightjust
 .TP
 .B \-\-rightjust
@@ -2912,6 +2950,7 @@ as usearch does).
 .IP \n+[step].
 quality string (ignored, always set to '*').
 .RE
+.TP
 Optional fields for query-target matches (number and order of fields may vary):
 .RS
 .nr step 12 1
@@ -3415,7 +3454,7 @@ allowing spaces in the taxonomic identifiers.
 .TP 9
 .BI \-\-db \0filename
 Read the reference sequences from \fIfilename\fR, in FASTA, FASTQ or
-UDB format. These sequences needs to be annotated with taxonomy.
+UDB format. These sequences need to be annotated with taxonomy.
 .TAG randseed
 .TP
 .BI \-\-randseed\~ "positive integer"
@@ -4631,6 +4670,17 @@ the use of UDB databases with uchime_ref.
 .BR v2.21.1\~ "released January 18th, 2022"
 Fix a problem with dereplication of empty input files. Update Altivec
 code on ppc64le for improved compiler compatibility (vector->__vector).
+.TP
+.BR v2.21.2\~ "released September 12th, 2022"
+Fix problems with the lcaout option when using maxaccepts above 1 and
+either lca_cutoff below 1 or with top_hits_only enabled. Update
+documentation. Update code to avoid compiler warnings.
+.TP
+.BR v2.22.0\~ "released September 19th, 2022"
+Add the derep_smallmem command for dereplication using little memory.
+.TP
+.BR v2.22.1\~ "released September 19th, 2022"
+Fix compiler warning.
 .LP
 .\" ============================================================================
 .\" TODO:


=====================================
src/Makefile.am
=====================================
@@ -31,6 +31,7 @@ db.h \
 dbhash.h \
 dbindex.h \
 derep.h \
+derepsmallmem.h \
 dynlibs.h \
 eestats.h \
 fa2fq.h \
@@ -126,6 +127,7 @@ db.cc \
 dbhash.cc \
 dbindex.cc \
 derep.cc \
+derepsmallmem.cc \
 dynlibs.cc \
 eestats.cc \
 fa2fq.cc \


=====================================
src/cluster.cc
=====================================
@@ -632,8 +632,6 @@ void cluster_core_parallel()
                      opt_gap_extension_query_right,
                      opt_gap_extension_target_right);
 
-  int aligncount = 0;
-
   int lastlength = INT_MAX;
 
   int seqno = 0;
@@ -799,8 +797,6 @@ void cluster_core_parallel()
                           unsigned int target = hit->target;
                           if (search_acceptable_unaligned(si, target))
                             {
-                              aligncount++;
-
                               /* perform vectorized alignment */
                               /* but only using 1 sequence ! */
 
@@ -1017,11 +1013,6 @@ void cluster_core_parallel()
     }
   progress_done();
 
-#if 0
-  if (!opt_quiet)
-    fprintf(stderr, "Extra alignments computed: %d\n", aligncount);
-#endif
-
   /* clean up search info */
   for(int i = 0; i < max_queries; i++)
     {


=====================================
src/derep.cc
=====================================
@@ -1103,11 +1103,9 @@ void derep_prefix()
   /* adjust size of hash table for 2/3 fill rate */
 
   int64_t hashtablesize = 1;
-  int hash_shift = 0;
   while (3 * dbsequencecount > 2 * hashtablesize)
     {
       hashtablesize <<= 1;
-      hash_shift++;
     }
   int hash_mask = hashtablesize - 1;
 


=====================================
src/derepsmallmem.cc
=====================================
@@ -0,0 +1,657 @@
+/*
+
+  VSEARCH: a versatile open source tool for metagenomics
+
+  Copyright (C) 2014-2022, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+  All rights reserved.
+
+  Contact: Torbjorn Rognes <torognes at ifi.uio.no>,
+  Department of Informatics, University of Oslo,
+  PO Box 1080 Blindern, NO-0316 Oslo, Norway
+
+  This software is dual-licensed and available under a choice
+  of one of two licenses, either under the terms of the GNU
+  General Public License version 3 or the BSD 2-Clause License.
+
+
+  GNU General Public License version 3
+
+  This program is free software: you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation, either version 3 of the License, or
+  (at your option) any later version.
+
+  This program is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+
+  The BSD 2-Clause License
+
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions
+  are met:
+
+  1. Redistributions of source code must retain the above copyright
+  notice, this list of conditions and the following disclaimer.
+
+  2. Redistributions in binary form must reproduce the above copyright
+  notice, this list of conditions and the following disclaimer in the
+  documentation and/or other materials provided with the distribution.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+  FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+  COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+  INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+  BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+  LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
+  ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+  POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+#include "vsearch.h"
+
+#define HASH hash_cityhash128
+
+struct bucket
+{
+  uint128 hash;
+  uint64_t size;
+};
+
+static struct bucket * hashtable = nullptr;
+static uint64_t hashtablesize = 0;
+
+double find_median()
+{
+  /* find the median size, based on an iterative search starting at e.g. 1 */
+
+  uint64_t cand = 1;    /* candidate for the median */
+  uint64_t below = 0;   /* closest value below the candidate */
+  uint64_t above = 0;   /* closest value above the candidate */
+
+  uint64_t cand_count;  /* number of clusters with same size as cand */
+  uint64_t below_count; /* number of clusters with smaller size than cand */
+  uint64_t above_count; /* number of clusters with larger size than cand */
+
+  while (true)
+    {
+      cand_count = 0;
+      below_count = 0;
+      above_count = 0;
+
+      for(uint64_t i = 0; i < hashtablesize; i++)
+        {
+          uint64_t v = hashtable[i].size;
+          if (v > 0)
+            {
+              if (v > cand)
+                {
+                  if ((above_count == 0) || (v < above))
+                    {
+                      above = v;
+                    }
+                  above_count++;
+                }
+              else if (v < cand)
+                {
+                  if ((below_count == 0) || (v > below))
+                    {
+                      below = v;
+                    }
+                  below_count++;
+                }
+              else
+                {
+                  cand_count++;
+                }
+            }
+        }
+
+      double mid = below_count + cand_count + above_count;
+      if (mid == 0)
+        return 0;
+      mid = mid / 2.0;
+
+      if (mid >= below_count)
+        {
+          if (mid <= below_count + cand_count)
+            {
+              if (mid == below_count + cand_count)
+                {
+                  return (cand + above) / 2.0;
+                }
+              else if (mid == below_count)
+                {
+                  return (below + cand) / 2.0;
+                }
+              else
+                {
+                  return cand;
+                }
+            }
+          else
+            {
+              cand = above;
+            }
+        }
+      else
+        {
+          cand = below;
+        }
+    }
+}
+
+uint64_t inline hash2bucket(uint128 hash, uint64_t htsize)
+{
+  return Uint128Low64(hash) % htsize;
+}
+
+uint64_t inline next_bucket(uint64_t prev_bucket, uint64_t htsize)
+{
+  return (prev_bucket + 1) % htsize;
+}
+
+void rehash_smallmem()
+{
+  /* allocate new hash table, 50% larger */
+  uint64_t new_hashtablesize = 3 * hashtablesize / 2;
+  auto * new_hashtable =
+    (struct bucket *) xmalloc(sizeof(bucket) * new_hashtablesize);
+
+  /* zero new hash table */
+  for(uint64_t j = 0; j < new_hashtablesize; j++)
+    {
+      new_hashtable[j].hash.first = 0;
+      new_hashtable[j].hash.second = 0;
+      new_hashtable[j].size = 0;
+    }
+
+  /* rehash all from old to new */
+  for(uint64_t i = 0; i < hashtablesize; i++)
+    {
+      struct bucket * old_bp = hashtable + i;
+      if (old_bp->size)
+        {
+          uint64_t k = hash2bucket(old_bp->hash, new_hashtablesize);
+          while (new_hashtable[k].size)
+            {
+              k = next_bucket(k, new_hashtablesize);
+            }
+          struct bucket * new_bp = new_hashtable + k;
+          * new_bp = * old_bp;
+        }
+    }
+
+  /* free old table */
+  xfree(hashtable);
+
+  /* update variables */
+  hashtable = new_hashtable;
+  hashtablesize = new_hashtablesize;
+}
+
+void derep_smallmem(char * input_filename)
+{
+  /*
+    dereplicate full length sequences using a small amount of memory
+    output options: --fastaout
+  */
+
+  show_rusage();
+
+  fastx_handle h = fastx_open(input_filename);
+
+  if (!h)
+    {
+      fatal("Unrecognized input file type (not proper FASTA or FASTQ format).");
+    }
+
+  if (h->is_pipe)
+    {
+      fatal("The derep_smallmem command does not support input from a pipe.");
+    }
+
+  FILE * fp_fastaout = nullptr;
+
+  if (opt_fastaout)
+    {
+      fp_fastaout = fopen_output(opt_fastaout);
+      if (!fp_fastaout)
+        {
+          fatal("Unable to open FASTA output file for writing");
+        }
+    }
+  else
+    {
+      fatal("Ouput file for dereplication must be specified with --fastaout");
+    }
+
+  uint64_t filesize = fastx_get_size(h);
+
+  /* allocate initial memory for sequences of length up to 1023 chars */
+  int64_t alloc_seqlen = 1024;
+
+  /* allocate initial hashtable with 1024 buckets */
+
+  hashtablesize = 1024;
+  hashtable = (struct bucket *) xmalloc(sizeof(bucket) * hashtablesize);
+
+  /* zero hash table */
+  for(uint64_t j = 0; j < hashtablesize; j++)
+    {
+      hashtable[j].hash.first = 0;
+      hashtable[j].hash.second = 0;
+      hashtable[j].size = 0;
+    }
+
+  show_rusage();
+
+  char * seq_up = (char*) xmalloc(alloc_seqlen + 1);
+  char * rc_seq_up = (char*) xmalloc(alloc_seqlen + 1);
+
+  char * prompt = nullptr;
+  if (xsprintf(& prompt, "Dereplicating file %s", input_filename) == -1)
+    {
+      fatal("Out of memory");
+    }
+
+  progress_init(prompt, filesize);
+
+  uint64_t sequencecount = 0;
+  uint64_t nucleotidecount = 0;
+  int64_t shortest = INT64_MAX;
+  int64_t longest = 0;
+  uint64_t discarded_short = 0;
+  uint64_t discarded_long = 0;
+  uint64_t clusters = 0;
+  int64_t sumsize = 0;
+  uint64_t maxsize = 0;
+  double median = 0.0;
+  double average = 0.0;
+
+  /* first pass */
+
+  while(fastx_next(h, ! opt_notrunclabels, chrmap_no_change))
+    {
+      int64_t seqlen = fastx_get_sequence_length(h);
+
+      if (seqlen < opt_minseqlength)
+        {
+          discarded_short++;
+          continue;
+        }
+
+      if (seqlen > opt_maxseqlength)
+        {
+          discarded_long++;
+          continue;
+        }
+
+      nucleotidecount += seqlen;
+      if (seqlen > longest)
+        {
+          longest = seqlen;
+        }
+      if (seqlen < shortest)
+        {
+          shortest = seqlen;
+        }
+
+      /* check allocations */
+
+      if (seqlen > alloc_seqlen)
+        {
+          alloc_seqlen = seqlen;
+          seq_up = (char*) xrealloc(seq_up, alloc_seqlen + 1);
+          rc_seq_up = (char*) xrealloc(rc_seq_up, alloc_seqlen + 1);
+
+          show_rusage();
+        }
+
+      if (100 * (clusters + 1) > 95 * hashtablesize)
+        {
+          // keep hash table fill rate at max 95% */
+          rehash_smallmem();
+          show_rusage();
+        }
+
+      char * seq = fastx_get_sequence(h);
+
+      /* normalize sequence: uppercase and replace U by T  */
+      string_normalize(seq_up, seq, seqlen);
+
+      /* reverse complement if necessary */
+      if (opt_strand > 1)
+        {
+          reverse_complement(rc_seq_up, seq_up, seqlen);
+        }
+
+      /*
+        Find free bucket or bucket for identical sequence.
+        Make sure sequences are exactly identical
+        in case of any hash collision.
+        With 64-bit hashes, there is about 50% chance of a
+        collision when the number of sequences is about 5e9.
+      */
+
+      uint128 hash = HASH(seq_up, seqlen);
+      uint64_t j =  hash2bucket(hash, hashtablesize);
+      struct bucket * bp = hashtable + j;
+
+      while ((bp->size) && (hash != bp->hash))
+        {
+          j = next_bucket(j, hashtablesize);
+          bp = hashtable + j;
+        }
+
+      if ((opt_strand > 1) && !bp->size)
+        {
+          /* no match on plus strand */
+          /* check minus strand as well */
+
+          uint128 rc_hash = HASH(rc_seq_up, seqlen);
+          uint64_t k =  hash2bucket(rc_hash, hashtablesize);
+          struct bucket * rc_bp = hashtable + k;
+
+          while ((rc_bp->size) && (rc_hash != rc_bp->hash))
+            {
+              k = next_bucket(k, hashtablesize);
+              rc_bp = hashtable + k;
+            }
+
+          if (rc_bp->size)
+            {
+              bp = rc_bp;
+              j = k;
+            }
+        }
+
+      int abundance = fastx_get_abundance(h);
+      int64_t ab = opt_sizein ? abundance : 1;
+      sumsize += ab;
+
+      if (bp->size)
+        {
+          /* at least one identical sequence already */
+          bp->size += ab;
+        }
+      else
+        {
+          /* no identical sequences yet */
+          bp->size = ab;
+          bp->hash = hash;
+          clusters++;
+        }
+
+      if (bp->size > maxsize)
+        {
+          maxsize = bp->size;
+        }
+
+      sequencecount++;
+      progress_update(fastx_get_position(h));
+    }
+  progress_done();
+  xfree(prompt);
+  fastx_close(h);
+
+  show_rusage();
+
+  if (!opt_quiet)
+    {
+      if (sequencecount > 0)
+        {
+          fprintf(stderr,
+                  "%'" PRIu64 " nt in %'" PRIu64 " seqs, min %'" PRIu64
+                  ", max %'" PRIu64 ", avg %'.0f\n",
+                  nucleotidecount,
+                  sequencecount,
+                  shortest,
+                  longest,
+                  nucleotidecount * 1.0 / sequencecount);
+        }
+      else
+        {
+          fprintf(stderr,
+                  "%'" PRIu64 " nt in %'" PRIu64 " seqs\n",
+                  nucleotidecount,
+                  sequencecount);
+        }
+    }
+
+  if (opt_log)
+    {
+      if (sequencecount > 0)
+        {
+          fprintf(fp_log,
+                  "%'" PRIu64 " nt in %'" PRIu64 " seqs, min %'" PRIu64
+                  ", max %'" PRIu64 ", avg %'.0f\n",
+                  nucleotidecount,
+                  sequencecount,
+                  shortest,
+                  longest,
+                  nucleotidecount * 1.0 / sequencecount);
+        }
+      else
+        {
+          fprintf(fp_log,
+                  "%'" PRIu64 " nt in %'" PRIu64 " seqs\n",
+                  nucleotidecount,
+                  sequencecount);
+        }
+    }
+
+  if (discarded_short)
+    {
+      fprintf(stderr,
+              "minseqlength %" PRId64 ": %" PRId64 " %s discarded.\n",
+              opt_minseqlength,
+              discarded_short,
+              (discarded_short == 1 ? "sequence" : "sequences"));
+
+      if (opt_log)
+        {
+          fprintf(fp_log,
+                  "minseqlength %" PRId64 ": %" PRId64 " %s discarded.\n\n",
+                  opt_minseqlength,
+                  discarded_short,
+                  (discarded_short == 1 ? "sequence" : "sequences"));
+        }
+    }
+
+  if (discarded_long)
+    {
+      fprintf(stderr,
+              "maxseqlength %" PRId64 ": %" PRId64 " %s discarded.\n",
+              opt_maxseqlength,
+              discarded_long,
+              (discarded_long == 1 ? "sequence" : "sequences"));
+
+      if (opt_log)
+        {
+          fprintf(fp_log,
+                  "maxseqlength %" PRId64 ": %" PRId64 " %s discarded.\n\n",
+                  opt_maxseqlength,
+                  discarded_long,
+                  (discarded_long == 1 ? "sequence" : "sequences"));
+        }
+    }
+
+
+  show_rusage();
+
+  average = 1.0 * sumsize / clusters;
+  median = find_median();
+
+  if (clusters < 1)
+    {
+      if (!opt_quiet)
+        {
+          fprintf(stderr,
+                  "0 unique sequences\n");
+        }
+      if (opt_log)
+        {
+          fprintf(fp_log,
+                  "0 unique sequences\n\n");
+        }
+    }
+  else
+    {
+      if (!opt_quiet)
+        {
+          fprintf(stderr,
+                  "%" PRId64
+                  " unique sequences, avg cluster %.1lf, median %.0f, max %"
+                  PRIu64 "\n",
+                  clusters, average, median, maxsize);
+        }
+      if (opt_log)
+        {
+          fprintf(fp_log,
+                  "%" PRId64
+                  " unique sequences, avg cluster %.1lf, median %.0f, max %"
+                  PRIu64 "\n\n",
+                  clusters, average, median, maxsize);
+        }
+    }
+
+  show_rusage();
+
+  /* second pass with output */
+
+  fastx_handle h2 = fastx_open(input_filename);
+  if (!h2)
+    {
+      fatal("Cannot open and read from the input file.");
+    }
+
+  progress_init("Writing FASTA output file", filesize);
+
+  uint64_t selected = 0;
+
+  while(fastx_next(h2, ! opt_notrunclabels, chrmap_no_change))
+    {
+      int64_t seqlen = fastx_get_sequence_length(h2);
+
+      if ((seqlen < opt_minseqlength) || (seqlen > opt_maxseqlength))
+        {
+          continue;
+        }
+
+      char * seq = fastx_get_sequence(h2);
+
+      /* normalize sequence: uppercase and replace U by T  */
+      string_normalize(seq_up, seq, seqlen);
+
+      /* reverse complement if necessary */
+      if (opt_strand > 1)
+        {
+          reverse_complement(rc_seq_up, seq_up, seqlen);
+        }
+
+      uint128 hash = HASH(seq_up, seqlen);
+      uint64_t j =  hash2bucket(hash, hashtablesize);
+      struct bucket * bp = hashtable + j;
+
+      while ((bp->size) && (hash != bp->hash))
+        {
+          j = next_bucket(j, hashtablesize);
+          bp = hashtable + j;
+        }
+
+      if ((opt_strand > 1) && ! bp->size)
+        {
+          /* no match on plus strand */
+          /* check minus strand as well */
+
+          uint128 rc_hash = HASH(rc_seq_up, seqlen);
+          uint64_t k =  hash2bucket(rc_hash, hashtablesize);
+          struct bucket * rc_bp = hashtable + k;
+
+          while ((rc_bp->size) && (rc_hash != rc_bp->hash))
+            {
+              k = next_bucket(k, hashtablesize);
+              rc_bp = hashtable + k;
+            }
+
+          if (rc_bp->size)
+            {
+              bp = rc_bp;
+              j = k;
+            }
+        }
+
+      int64_t size = bp->size;
+
+      if (size > 0)
+        {
+          /* print sequence */
+
+          char * header = fastx_get_header(h2);
+          int headerlen = fastx_get_header_length(h2);
+
+          if ((size >= opt_minuniquesize) && (size <= opt_maxuniquesize))
+            {
+              selected++;
+              fasta_print_general(fp_fastaout,
+                                  nullptr,
+                                  seq,
+                                  seqlen,
+                                  header,
+                                  headerlen,
+                                  size,
+                                  selected,
+                                  -1.0,
+                                  -1, -1, nullptr, 0.0);
+            }
+          bp->size = -1;
+        }
+
+      progress_update(fastx_get_position(h2));
+    }
+  progress_done();
+  fastx_close(h2);
+  fclose(fp_fastaout);
+
+  show_rusage();
+
+  if (selected < clusters)
+    {
+      if (!opt_quiet)
+        {
+          fprintf(stderr,
+                  "%" PRId64 " uniques written, %"
+                  PRId64 " clusters discarded (%.1f%%)\n",
+                  selected, clusters - selected,
+                  100.0 * (clusters - selected) / clusters);
+        }
+
+      if (opt_log)
+        {
+          fprintf(fp_log,
+                  "%" PRId64 " uniques written, %"
+                  PRId64 " clusters discarded (%.1f%%)\n\n",
+                  selected, clusters - selected,
+                  100.0 * (clusters - selected) / clusters);
+        }
+    }
+
+  show_rusage();
+
+  xfree(seq_up);
+  xfree(rc_seq_up);
+  xfree(hashtable);
+
+  show_rusage();
+}


=====================================
src/derepsmallmem.h
=====================================
@@ -0,0 +1,61 @@
+/*
+
+  VSEARCH: a versatile open source tool for metagenomics
+
+  Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+  All rights reserved.
+
+  Contact: Torbjorn Rognes <torognes at ifi.uio.no>,
+  Department of Informatics, University of Oslo,
+  PO Box 1080 Blindern, NO-0316 Oslo, Norway
+
+  This software is dual-licensed and available under a choice
+  of one of two licenses, either under the terms of the GNU
+  General Public License version 3 or the BSD 2-Clause License.
+
+
+  GNU General Public License version 3
+
+  This program is free software: you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation, either version 3 of the License, or
+  (at your option) any later version.
+
+  This program is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+
+  The BSD 2-Clause License
+
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions
+  are met:
+
+  1. Redistributions of source code must retain the above copyright
+  notice, this list of conditions and the following disclaimer.
+
+  2. Redistributions in binary form must reproduce the above copyright
+  notice, this list of conditions and the following disclaimer in the
+  documentation and/or other materials provided with the distribution.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+  FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+  COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+  INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+  BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+  LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
+  ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+  POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+void derep_smallmem(char * input_filename);


=====================================
src/eestats.cc
=====================================
@@ -138,7 +138,6 @@ void fastq_eestats()
   progress_init("Reading FASTQ file", filesize);
 
   uint64_t seq_count = 0;
-  uint64_t symbols = 0;
 
   int64_t len_alloc = 10;
 
@@ -221,8 +220,6 @@ void fastq_eestats()
 
       /* update quality statistics */
 
-      symbols += len;
-
       double ee = 0.0;
 
       for(int64_t i=0; i < len; i++)


=====================================
src/fastx.cc
=====================================
@@ -469,6 +469,11 @@ bool fastx_is_empty(fastx_handle h)
   return h->is_empty;
 }
 
+bool fastx_is_pipe(fastx_handle h)
+{
+  return h->is_pipe;
+}
+
 void fastx_close(fastx_handle h)
 {
   /* Warn about stripped chars */


=====================================
src/fastx.h
=====================================
@@ -116,6 +116,7 @@ typedef struct fastx_s * fastx_handle;
 
 bool fastx_is_fastq(fastx_handle h);
 bool fastx_is_empty(fastx_handle h);
+bool fastx_is_pipe(fastx_handle h);
 void fastx_filter_header(fastx_handle h, bool truncateatspace);
 fastx_handle fastx_open(const char * filename);
 void fastx_close(fastx_handle h);


=====================================
src/results.cc
=====================================
@@ -525,63 +525,131 @@ void results_show_lcaout(FILE * fp,
   /* Output last common ancestor (LCA) of the hits,
      in a similar way to the Sintax command */
 
-  int first_level_start[tax_levels];
-  int first_level_len[tax_levels];
-  int level_match[tax_levels];
-  char * first_h = nullptr;
+  /* Use a modified Boyer-Moore majority voting algorithm at each taxonomic
+     level to find the most common name at each level */
 
   fprintf(fp, "%s\t", query_head);
 
-  if (hitcount > 0)
+  int votes[tax_levels];
+  int cand[tax_levels];
+  int cand_level_start[tax_levels][tax_levels];
+  int cand_level_len[tax_levels][tax_levels];
+  int level_match[tax_levels];
+
+  for (int k = 0; k < tax_levels; k++)
+    {
+      votes[k] = 0;
+      cand[k] = -1;
+      level_match[k] = 0;
+    }
+
+  double top_hit_id = hits[0].id;
+  int tophitcount = 0;
+
+  for (int t = 0; t < hitcount; t++)
     {
-      for (int t = 0; t < hitcount; t++)
+      struct hit * hp = hits + t;
+
+      if (opt_top_hits_only && (hp->id < top_hit_id))
         {
-          int seqno = hits[t].target;
-          if (t == 0)
+          break;
+        }
+
+      tophitcount++;
+
+      int seqno = hp->target;
+      int new_level_start[tax_levels];
+      int new_level_len[tax_levels];
+      tax_split(seqno, new_level_start, new_level_len);
+
+      for (int k = 0; k < tax_levels; k++)
+        {
+          if (votes[k] == 0)
             {
-              tax_split(seqno, first_level_start, first_level_len);
-              first_h = db_getheader(seqno);
+              cand[k] = seqno;
+              votes[k] = 1;
               for (int j = 0; j < tax_levels; j++)
                 {
-                  level_match[j] = 1;
+                  cand_level_start[k][j] = new_level_start[j];
+                  cand_level_len[k][j] = new_level_len[j];
                 }
             }
           else
             {
-              int level_start[tax_levels];
-              int level_len[tax_levels];
-              tax_split(seqno, level_start, level_len);
-              char * h = db_getheader(seqno);
-              for (int j = 0; j < tax_levels; j++)
+              bool match = true;
+              for (int j = 0; j <= k; j++)
                 {
-                  /* For each taxonomic level */
-                  if ((level_len[j] == first_level_len[j]) &&
-                      (strncmp(first_h + first_level_start[j],
-                               h + level_start[j],
-                               level_len[j]) == 0))
+                  if ((new_level_len[j] != cand_level_len[k][j]) ||
+                      (strncmp(db_getheader(cand[k]) + cand_level_start[k][j],
+                               db_getheader(seqno) + new_level_start[j],
+                               new_level_len[j]) != 0))
                     {
-                      level_match[j]++;
+                      match = false;
+                      break;
                     }
                 }
+              if (match)
+                {
+                  votes[k]++;
+                }
+              else
+                {
+                  votes[k]--;
+                }
+            }
+        }
+    }
+
+  /* count actual matches to the candidate at each level */
+
+  for (int t = 0; t < tophitcount; t++)
+    {
+      int seqno = hits[t].target;
+      int new_level_start[tax_levels];
+      int new_level_len[tax_levels];
+      tax_split(seqno, new_level_start, new_level_len);
+
+      for (int k = 0; k < tax_levels; k++)
+        {
+          bool match = true;
+          for (int j = 0; j <= k; j++)
+            {
+              if ((new_level_len[j] != cand_level_len[k][j]) ||
+                  (strncmp(db_getheader(cand[k]) + cand_level_start[k][j],
+                           db_getheader(seqno) + new_level_start[j],
+                           new_level_len[j]) != 0))
+                {
+                  match = false;
+                  break;
+                }
+            }
+          if (match)
+            {
+              level_match[k]++;
             }
         }
+    }
 
+  /* output results */
+
+  if (tophitcount > 0)
+    {
       bool comma = false;
       for (int j = 0; j < tax_levels; j++)
         {
-          if (1.0 * level_match[j] / hitcount < opt_lca_cutoff)
+          if (1.0 * level_match[j] / tophitcount < opt_lca_cutoff)
             {
               break;
             }
 
-          if (first_level_len[j] > 0)
+          if (cand_level_len[j][j] > 0)
             {
               fprintf(fp,
                       "%s%c:%.*s",
                       (comma ? "," : ""),
                       tax_letters[j],
-                      first_level_len[j],
-                      first_h + first_level_start[j]);
+                      cand_level_len[j][j],
+                      db_getheader(cand[j]) + cand_level_start[j][j]);
               comma = true;
             }
         }


=====================================
src/searchcore.cc
=====================================
@@ -749,7 +749,6 @@ void search_onequery(struct searchinfo_s * si, int seqmask)
 
   int delayed = 0;
 
-  int t = 0;
   while ((si->finalized + delayed < opt_maxaccepts + opt_maxrejects - 1) &&
          (si->rejects < opt_maxrejects) &&
          (si->accepts < opt_maxaccepts) &&
@@ -785,7 +784,6 @@ void search_onequery(struct searchinfo_s * si, int seqmask)
           align_delayed(si);
           delayed = 0;
         }
-      t++;
     }
   if (delayed > 0)
     {


=====================================
src/sintax.cc
=====================================
@@ -199,11 +199,11 @@ void sintax_analyse(char * query_head,
     {
       if (opt_sintax_cutoff > 0.0)
         {
-          fprintf(fp_tabbedout, "\t\t\t");
+          fprintf(fp_tabbedout, "\t\t");
         }
       else
         {
-          fprintf(fp_tabbedout, "\t\t");
+          fprintf(fp_tabbedout, "\t");
         }
     }
 


=====================================
src/util.cc
=====================================
@@ -194,6 +194,11 @@ uint64_t hash_cityhash64(char * s, uint64_t n)
   return CityHash64((const char*)s, n);
 }
 
+uint128 hash_cityhash128(char * s, uint64_t n)
+{
+  return CityHash128((const char*)s, n);
+}
+
 int64_t getusec()
 {
   struct timeval tv;


=====================================
src/util.h
=====================================
@@ -84,6 +84,7 @@ char * xstrdup(const char *s);
 char * xstrchrnul(char *s, int c);
 int xsprintf(char * * ret, const char * format, ...);
 uint64_t hash_cityhash64(char * s, uint64_t n);
+uint128 hash_cityhash128(char * s, uint64_t n);
 int64_t getusec();
 void show_rusage();
 


=====================================
src/vsearch.cc
=====================================
@@ -105,6 +105,7 @@ char * opt_dbnotmatched;
 char * opt_derep_fulllength;
 char * opt_derep_id;
 char * opt_derep_prefix;
+char * opt_derep_smallmem;
 char * opt_eetabbedout;
 char * opt_fasta2fastq;
 char * opt_fastaout;
@@ -764,6 +765,7 @@ void args_init(int argc, char **argv)
   opt_derep_fulllength = nullptr;
   opt_derep_id = nullptr;
   opt_derep_prefix = nullptr;
+  opt_derep_smallmem = nullptr;
   opt_dn = 1.4;
   opt_ee_cutoffs_count = 3;
   opt_ee_cutoffs_values = (double*) xmalloc(opt_ee_cutoffs_count * sizeof(double));
@@ -1012,6 +1014,7 @@ void args_init(int argc, char **argv)
       option_derep_fulllength,
       option_derep_id,
       option_derep_prefix,
+      option_derep_smallmem,
       option_dn,
       option_ee_cutoffs,
       option_eeout,
@@ -1250,6 +1253,7 @@ void args_init(int argc, char **argv)
       {"derep_fulllength",      required_argument, nullptr, 0 },
       {"derep_id",              required_argument, nullptr, 0 },
       {"derep_prefix",          required_argument, nullptr, 0 },
+      {"derep_smallmem",        required_argument, nullptr, 0 },
       {"dn",                    required_argument, nullptr, 0 },
       {"ee_cutoffs",            required_argument, nullptr, 0 },
       {"eeout",                 no_argument,       nullptr, 0 },
@@ -2475,6 +2479,10 @@ void args_init(int argc, char **argv)
           opt_tsegout = optarg;
           break;
 
+        case option_derep_smallmem:
+          opt_derep_smallmem = optarg;
+          break;
+
         default:
           fatal("Internal error in option parsing");
         }
@@ -2505,6 +2513,7 @@ void args_init(int argc, char **argv)
       option_derep_fulllength,
       option_derep_id,
       option_derep_prefix,
+      option_derep_smallmem,
       option_fasta2fastq,
       option_fastq_chars,
       option_fastq_convert,
@@ -3137,6 +3146,37 @@ void args_init(int argc, char **argv)
         option_xsize,
         -1 },
 
+      { option_derep_smallmem,
+        option_bzip2_decompress,
+        option_fasta_width,
+        option_fastaout,
+        option_fastq_ascii,
+        option_fastq_qmax,
+        option_fastq_qmin,
+        option_gzip_decompress,
+        option_label_suffix,
+        option_log,
+        option_maxseqlength,
+        option_maxuniquesize,
+        option_minseqlength,
+        option_minuniquesize,
+        option_no_progress,
+        option_notrunclabels,
+        option_quiet,
+        option_relabel,
+        option_relabel_keep,
+        option_relabel_md5,
+        option_relabel_self,
+        option_relabel_sha1,
+        option_sample,
+        option_sizein,
+        option_sizeout,
+        option_strand,
+        option_threads,
+        option_xee,
+        option_xsize,
+        -1 },
+
       { option_fasta2fastq,
         option_bzip2_decompress,
         option_fastq_asciiout,
@@ -3873,6 +3913,8 @@ void args_init(int argc, char **argv)
         option_gzip_decompress,
         option_label_suffix,
         option_log,
+        option_maxseqlength,
+        option_minseqlength,
         option_no_progress,
         option_notrunclabels,
         option_quiet,
@@ -4829,6 +4871,7 @@ void cmd_help()
               "  --derep_fulllength FILENAME dereplicate sequences in the given FASTA file\n"
               "  --derep_id FILENAME         dereplicate using both identifiers and sequences\n"
               "  --derep_prefix FILENAME     dereplicate sequences in file based on prefixes\n"
+              "  --derep_smallmem FILENAME   dereplicate sequences in file using less memory\n"
               "  --fastx_uniques FILENAME    dereplicate sequences in the FASTA/FASTQ file\n"
               "  --rereplicate FILENAME      rereplicate sequences in the given FASTA file\n"
               " Parameters\n"
@@ -5269,7 +5312,7 @@ void cmd_none()
   if (! opt_quiet)
     {
       fprintf(stderr,
-              "For help, please enter: %s --help | more\n"
+              "For more help, please enter: %s --help\n"
               "For further details, please consult the manual by entering: man vsearch\n"
               "\n"
               "Selected command examples:\n"
@@ -5301,9 +5344,9 @@ void cmd_none()
               "vsearch --usearch_global FILENAME --db FILENAME --id 0.97 --alnout FILENAME\n"
               "\n"
               "Other commands: cluster_fast, cluster_smallmem, cluster_unoise, cut,\n"
-              "                derep_id, derep_fulllength, derep_prefix, fasta2fastq,\n"
-              "                fastq_filter, fastq_join, fastx_getseqs, fastx_getsubseqs,\n"
-              "                maskfasta, orient, rereplicate, uchime2_denovo,\n"
+              "                derep_id, derep_fulllength, derep_prefix, derep_smallmem,\n"
+              "                fasta2fastq, fastq_filter, fastq_join, fastx_getseqs,\n"
+              "                fastx_getsubseq, maskfasta, orient, rereplicate, uchime2_denovo,\n"
               "                uchime3_denovo, udb2fasta, udbinfo, udbstats, version\n"
               "\n",
               progname);
@@ -5529,6 +5572,10 @@ int main(int argc, char** argv)
     {
       derep_prefix();
     }
+  else if (opt_derep_smallmem)
+    {
+      derep_smallmem(opt_derep_smallmem);
+    }
   else if (opt_derep_id)
     {
       derep(opt_derep_id, true);


=====================================
src/vsearch.h
=====================================
@@ -258,6 +258,7 @@
 #include "cut.h"
 #include "orient.h"
 #include "fa2fq.h"
+#include "derepsmallmem.h"
 
 /* options */
 



View it on GitLab: https://salsa.debian.org/med-team/vsearch/-/commit/8fddc675f40a794cb43d1ca5878354ec1b6a2d4d

-- 
View it on GitLab: https://salsa.debian.org/med-team/vsearch/-/commit/8fddc675f40a794cb43d1ca5878354ec1b6a2d4d
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20220926/fabb2025/attachment-0001.htm>


More information about the debian-med-commit mailing list