[med-svn] [Git][med-team/vsearch][master] 5 commits: New upstream version 2.15.0

Sun Sep 20 16:12:21 BST 2020


Nilesh Patra pushed to branch master at Debian Med / vsearch


Commits:
43f7534f by Nilesh Patra at 2020-09-20T19:47:16+05:30
New upstream version 2.15.0
- - - - -
97c0d457 by Nilesh Patra at 2020-09-20T19:47:17+05:30
Update upstream source from tag 'upstream/2.15.0'

Update to upstream version '2.15.0'
with Debian dir 3ccf7154ef6af28f7c58dcfd03ae8be50c701f59
- - - - -
9815f033 by Nilesh Patra at 2020-09-20T20:34:50+05:30
Remove reference data, compare checksum

- - - - -
1a58654f by Nilesh Patra at 2020-09-20T20:36:56+05:30
Update changelog

- - - - -
91780227 by Nilesh Patra at 2020-09-20T20:37:25+05:30
routine-update: Ready to upload to unstable

- - - - -


23 changed files:

- README.md
- configure.ac
- debian/changelog
- − debian/tests/expected-output/test1-expected.out
- − debian/tests/expected-output/test2-expected.out
- debian/tests/run-unit-test
- debian/vsearch-examples.docs
- man/Makefile.am
- man/vsearch.1
- src/cluster.cc
- src/derep.cc
- src/derep.h
- src/fasta.cc
- src/fastq.cc
- src/fastqops.cc
- src/fastx.cc
- src/maps.cc
- src/mergepairs.cc
- src/sintax.cc
- src/sortbylength.cc
- src/sortbysize.cc
- src/vsearch.cc
- src/vsearch.h


Changes:

=====================================
README.md
=====================================
@@ -34,7 +34,7 @@ Most of the nucleotide based commands and options in USEARCH version 7 are suppo
 
 ## Getting Help
 
-If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.14.2/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
+If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.15.0/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
 
 ## Example
 
@@ -47,9 +47,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
 **Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:
 
 ```
-wget https://github.com/torognes/vsearch/archive/v2.14.2.tar.gz
-tar xzf v2.14.2.tar.gz
-cd vsearch-2.14.2
+wget https://github.com/torognes/vsearch/archive/v2.15.0.tar.gz
+tar xzf v2.15.0.tar.gz
+cd vsearch-2.15.0
 ./autogen.sh
 ./configure
 make
@@ -78,43 +78,43 @@ Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (v
 Download the appropriate executable for your system using the following commands if you are using a Linux x86_64 system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.14.2/vsearch-2.14.2-linux-x86_64.tar.gz
-tar xzf vsearch-2.14.2-linux-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.15.0/vsearch-2.15.0-linux-x86_64.tar.gz
+tar xzf vsearch-2.15.0-linux-x86_64.tar.gz
 ```
 
 Or these commands if you are using a Linux ppc64le system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.14.2/vsearch-2.14.2-linux-ppc64le.tar.gz
-tar xzf vsearch-2.14.2-linux-ppc64le.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.15.0/vsearch-2.15.0-linux-ppc64le.tar.gz
+tar xzf vsearch-2.15.0-linux-ppc64le.tar.gz
 ```
 
 Or these commands if you are using a Linux aarch64 system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.14.2/vsearch-2.14.2-linux-aarch64.tar.gz
-tar xzf vsearch-2.14.2-linux-aarch64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.15.0/vsearch-2.15.0-linux-aarch64.tar.gz
+tar xzf vsearch-2.15.0-linux-aarch64.tar.gz
 ```
 
 Or these commands if you are using a Mac:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.14.2/vsearch-2.14.2-macos-x86_64.tar.gz
-tar xzf vsearch-2.14.2-macos-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.15.0/vsearch-2.15.0-macos-x86_64.tar.gz
+tar xzf vsearch-2.15.0-macos-x86_64.tar.gz
 ```
 
 Or if you are using Windows, download and extract (unzip) the contents of this file:
 
 ```
-https://github.com/torognes/vsearch/releases/download/v2.14.2/vsearch-2.14.2-win-x86_64.zip
+https://github.com/torognes/vsearch/releases/download/v2.15.0/vsearch-2.15.0-win-x86_64.zip
 ```
 
-Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.14.2-linux-x86_64` or `vsearch-2.14.2-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
+Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.15.0-linux-x86_64` or `vsearch-2.15.0-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
 
-Windows: You will now have the binary distribution in a folder called `vsearch-2.14.2-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
+Windows: You will now have the binary distribution in a folder called `vsearch-2.15.0-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
 
 
-**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.14.2/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.14.2/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
+**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.15.0/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.15.0/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
 
 
 ## Packages, plugins, and wrappers


=====================================
configure.ac
=====================================
@@ -2,7 +2,7 @@
 # Process this file with autoconf to produce a configure script.
 
 AC_PREREQ([2.63])
-AC_INIT([vsearch], [2.14.2], [torognes at ifi.uio.no])
+AC_INIT([vsearch], [2.15.0], [torognes at ifi.uio.no])
 AC_CANONICAL_TARGET
 AM_INIT_AUTOMAKE([subdir-objects])
 AC_LANG([C++])


=====================================
debian/changelog
=====================================
@@ -1,3 +1,10 @@
+vsearch (2.15.0-1) unstable; urgency=medium
+
+  * New upstream version 2.15.0
+  * Remove reference data, compare checksum
+
+ -- Nilesh Patra <npatra974 at gmail.com>  Sun, 20 Sep 2020 20:37:25 +0530
+
 vsearch (2.14.2-3) unstable; urgency=medium
 
   * Source-only upload


=====================================
debian/tests/expected-output/test1-expected.out deleted
=====================================
The diff for this file was not included because it is too large.

=====================================
debian/tests/expected-output/test2-expected.out deleted
=====================================
@@ -1,35 +0,0 @@
-vsearch --usearch_global query.fsa --db BioMarKs50k.fsa --id 0.9 --alnout test2.out
-vsearch v2.14.2_linux_x86_64, 7.5GB RAM, 8 cores
-
-Query >60caa38f93eb4a7ef8c0fa4d96a5a5f8;size=24
- %Id   TLen  Target
-100%    380  60caa38f93eb4a7ef8c0fa4d96a5a5f8;size=24
-
- Query 380nt >60caa38f93eb4a7ef8c0fa4d96a5a5f8;size=24
-Target 380nt >60caa38f93eb4a7ef8c0fa4d96a5a5f8;size=24
-
-Qry   1 + AGCTCCAATAGCGTATATTAAAATTGTTGCGGTTAAAACGCTCGTAGTTGGATATCTGCTAAGG 64
-          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
-Tgt   1 + AGCTCCAATAGCGTATATTAAAATTGTTGCGGTTAAAACGCTCGTAGTTGGATATCTGCTAAGG 64
-
-Qry  65 + GGTTCCGGTCCTTCCCAGTGAAGAATACGCGGAACTCTTCTTGGCATTTATTCAGGGAAGGTGT 128
-          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
-Tgt  65 + GGTTCCGGTCCTTCCCAGTGAAGAATACGCGGAACTCTTCTTGGCATTTATTCAGGGAAGGTGT 128
-
-Qry 129 + TTGCACTTTGTTGTGTGTCACATGATCTGAATTTTTACTTTGAGGAAATGAGAGTGTTTCAAGC 192
-          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
-Tgt 129 + TTGCACTTTGTTGTGTGTCACATGATCTGAATTTTTACTTTGAGGAAATGAGAGTGTTTCAAGC 192
-
-Qry 193 + AGGCTTTCGCCGTGAATATGATAGCATGGAATAATAGCACAGGACCCCTTTCCAAAGCTGTTGG 256
-          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
-Tgt 193 + AGGCTTTCGCCGTGAATATGATAGCATGGAATAATAGCACAGGACCCCTTTCCAAAGCTGTTGG 256
-
-Qry 257 + TTTTTTGGAACGAGGTAATCAGAATAAGGATAGTTGGGGGTATTCGTATTTAACTGTCAGAGGT 320
-          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
-Tgt 257 + TTTTTTGGAACGAGGTAATCAGAATAAGGATAGTTGGGGGTATTCGTATTTAACTGTCAGAGGT 320
-
-Qry 321 + GAAATTCTTGGATTTTTTAAAGACGAACTATTGCGAAGGCATCTGCCCAGGATGTTTTTA 380
-          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
-Tgt 321 + GAAATTCTTGGATTTTTTAAAGACGAACTATTGCGAAGGCATCTGCCCAGGATGTTTTTA 380
-
-380 cols, 380 ids (100.0%), 0 gaps (0.0%)


=====================================
debian/tests/run-unit-test
=====================================
@@ -13,14 +13,27 @@ cp /usr/share/doc/${pkg}-examples/* -a "${AUTOPKGTEST_TMP}"
 cd "${AUTOPKGTEST_TMP}"
 gunzip -r *
 
+function compare_cksum()
+{
+	# $1: file whose md5sum needs to be tested
+	# $2: expected checksum
+	if [ "$(cat $1 | md5sum | awk '{print $1}')" != "$2" ]
+	then
+		echo "Checksums do not match"
+		exit 1
+	fi
+
+}
+
 echo 'Test 1'
 vsearch --cluster_fast BioMarKs50k.fsa --id 0.97 --centroids test1.out
-diff -u test1.out test1-expected.out 
+compare_cksum test1.out f83633b2f3d9891122056f62b7f4d863
 echo 'PASS'
 echo
 
 echo 'Test 2'
 vsearch --usearch_global query.fsa --db BioMarKs50k.fsa --id 0.9 --alnout test2.out
-diff -u <(tail -n +3 test2.out) <(tail -n +3 test2-expected.out)
+(tail -n +3 test2.out) > test2_.out
+compare_cksum test2_.out a84f0fe81dfac71bef52eb715bfb9404
 echo 'PASS'
 


=====================================
debian/vsearch-examples.docs
=====================================
@@ -1,2 +1 @@
 debian/tests/data/*
-debian/tests/expected-output/*


=====================================
man/Makefile.am
=====================================
@@ -34,7 +34,7 @@ vsearch_manual.pdf : vsearch.1
 	else \
 		cat ; \
 	fi | \
-	groff -t -m mandoc -T ps -P -pa4 | ps2pdf - $@
+	groff -W space -t -m mandoc -T ps -P -pa4 | ps2pdf - $@
 
 CLEANFILES += vsearch_manual.pdf
 


=====================================
man/vsearch.1
=====================================
@@ -1,5 +1,5 @@
 .\" ============================================================================
-.TH vsearch 1 "January 28, 2020" "version 2.14.2" "USER COMMANDS"
+.TH vsearch 1 "June 19, 2020" "version 2.15.0" "USER COMMANDS"
 .\" ============================================================================
 .SH NAME
 vsearch \(em chimera detection, clustering, dereplication and
@@ -34,7 +34,7 @@ Clustering:
 .RE
 Dereplication and rereplication:
 .RS
-\fBvsearch\fR (\-\-derep_fulllength | \-\-derep_prefix)
+\fBvsearch\fR (\-\-derep_fulllength | \-\-derep_id | \-\-derep_prefix)
 \fIfastafile\fR (\-\-output | \-\-uc) \fIoutputfile\fR [\fIoptions\fR]
 .PP
 \fBvsearch\fR \-\-rereplicate \fIfastafile\fR \-\-output
@@ -198,17 +198,21 @@ alignment method described by Hirschberg (1975) and Myers and Miller
 .\" ----------------------------------------------------------------------------
 .SS Input
 \fBvsearch\fR accept as input fasta or fastq files containing one or
-several nucleotidic entries. In fasta files, each nucleotidic entry is
-made of a header and a sequence. The header is defined as the string
-comprised between the '>' symbol and the first space, tab or the end
-of the line, whichever comes first. Additionally, if the header
-matches
-'>[;]size=\fIinteger\fR;label', '>label;size=\fIinteger\fR;label' or
-'>label;size=\fIinteger\fR[;]', \fBvsearch\fR interpret
-\fIinteger\fR as the number of occurrences (or abundance) of the
-sequence in the study. That abundance information is used or created
-during chimera detection, clustering, dereplication, sorting and
-searching.
+several nucleotidic entries. In fasta files, each entry is made of a
+header and a sequence. The header is defined as the string comprised
+between the initial '>' symbol and the first space, tab or the end of
+the line, unless the \-\-notrunclabels option is in effect, in which
+case the entire line is included. The header should contain printable
+ascii characters (33-126). The program will terminate with a fatal
+error if there are unprintable ascii characters. A warning will be
+issued if non-ascii characters (128-255) are encountered.
+.PP
+If the header matches '>[;]size=\fIinteger\fR;label',
+'>label;size=\fIinteger\fR;label' or '>label;size=\fIinteger\fR[;]',
+\fBvsearch\fR interpret \fIinteger\fR as the number of occurrences (or
+abundance) of the sequence in the study. That abundance information is
+used or created during chimera detection, clustering, dereplication,
+sorting and searching.
 .PP
 The sequence is defined as a string of IUPAC symbols
 (ACGTURYSWKMDBHVN), starting after the end of the identifier line and
@@ -349,7 +353,8 @@ greater than \fIinteger\fR (50,000 nucleotides by default).
 .BI \-\-minseqlength\~ "positive integer"
 All \fBvsearch\fR operations discard sequences of length smaller than
 \fIinteger\fR: 1 nucleotide by default for sorting or shuffling, 32
-nucleotides for clustering, dereplication or searching.
+nucleotides for clustering and dereplication as well as the commands
+\-\-makeudb_usearch, \-\-sintax, and \-\-usearch_global.
 .TAG no_progress
 .TP
 .B \-\-no_progress
@@ -358,7 +363,8 @@ Do not show the gradually increasing progress indicator.
 .TP
 .B \-\-notrunclabels
 Do not truncate sequence labels at first space or tab, use the full
-header in output files.
+header in output files. Turned off by default for all commands except
+the sintax command.
 .TAG quiet
 .TP
 .B \-\-quiet
@@ -392,7 +398,7 @@ Input sequences are masked as specified with the \-\-qmask and
 \-\-hardmask options. Masking of the database for reference based
 chimera detection is specified with the \-\-dbmask option.
 .PP
-In \fIde novo\fR mode, input fasta file should present abundance
+In \fIde novo\fR mode, input fasta file must present abundance
 annotations (i.e. a pattern [;]size=\fIinteger\fR[;] in the fasta
 header). Input order matters for chimera detection, so we recommend to
 sort sequences by decreasing abundance (default of
@@ -437,15 +443,17 @@ parents, or sufficiently close relatives, are not present in the
 database.
 .TAG dn
 .TP
-.BI \-\-dn \0real
-No vote pseudo-count, corresponding to the parameter \fIn\fR in the
-chimera scoring function (default value is 1.4).
+.BI \-\-dn\~ "strictly positive real number"
+pseudo-count prior on the number of no votes, corresponding to the
+parameter \fIn\fR in the chimera scoring function (default value is
+1.4). Increasing \-\-dn reduces the likelihood of tagging a sequence
+as a chimera (less false positives, but also more false negatives).
 .TAG fasta_score
 .TP
 .B \-\-fasta_score
 Add the chimera score to the headers in the fasta output files for
-chimeras, non-chimeras and borderline sequences, using the format
-';uchime_denovo=\fIfloat\fR;'.
+chimeras, non-chimeras and borderline sequences, using the
+format ';uchime_denovo=\fIfloat\fR;'.
 .TAG mindiffs
 .TP
 .BI \-\-mindiffs\~ "positive integer"
@@ -521,8 +529,14 @@ false-positive rate in reference sequences).
 When using \-\-uchime_ref, ignore a reference sequence when its
 nucleotide sequence is strictly identical to the nucleotidic sequence
 of the query.
-.TAG sizeout
 .TP
+.B \-\-sizein
+In \fIde novo\fR mode, abundance annotations
+(pattern '[>;]size=\fIinteger\fR[;]') present in sequence headers are
+taken into account by default (\-\-sizein is always implied). This
+option is ignored by \-\-uchime_ref.
+.TP
+.TAG sizeout
 .B \-\-sizeout
 When relabelling, add abundance annotations to fasta headers (using
 the format ';size=\fIinteger\fR;').
@@ -619,13 +633,15 @@ When using \-\-uchimeout, write chimera detection results using a
 17\-field, tab\-separated uchime\-like format (drop the 5th field of
 \-\-uchimeout), compatible with usearch version 5 and earlier
 versions.
+.TP
 .TAG xn
+.BI \-\-xn\~ "strictly positive real number"
+weight of no votes, corresponding to the parameter \fIbeta\fR in the
+scoring function (default value is 8.0). Increasing \-\-xn reduces the
+likelihood of tagging a sequence as a chimera (less false positives,
+but also more false negatives).
 .TP
-.BI \-\-xn \0real
-No vote weight, corresponding to the parameter \fIbeta\fR in the
-scoring function (default value is 8.0).
 .TAG xsize
-.TP
 .B \-\-xsize
 Strip abundance information from the headers when writing the output
 file.
@@ -696,12 +712,14 @@ first sequence of the cluster).
 .TP
 .BI \-\-clusterout_id
 Add cluster identifier information to the output files
-when using the \-\-consout and \-\-profile options.
+when using the \-\-centroids, \-\-consout and \-\-profile options.
 .TAG clusterout_sort
 .TP
 .BI \-\-clusterout_sort
-Sort output files by decreasing abundance
-when using the \-\-consout, \-\-msaout and \-\-profile options.
+Sort some output files by decreasing abundance instead of input
+order. It applies to the \-\-consout, \-\-msaout, \-\-profile,
+\-\-centroids, and \-\-uc options. For \-\-uc, the sorting applies
+only to the centroid information part (the C lines).
 .TAG cluster_fast
 .TP
 .BI \-\-cluster_fast \0filename
@@ -901,8 +919,8 @@ greedy clustering (DGC).
 .TAG sizeout
 .TP
 .B \-\-sizeout
-Add abundance annotations to the output fasta files (add the pattern
-';size=\fIinteger\fR;' to sequence headers). If \-\-sizein is
+Add abundance annotations to the output fasta files (add the
+pattern ';size=\fIinteger\fR;' to sequence headers). If \-\-sizein is
 specified, abundance annotations are reported to output files, and
 each cluster centroid receives a new abundance value corresponding to
 the total abundance of the amplicons included in the cluster
@@ -992,6 +1010,12 @@ length and the same string of nucleotides (case insensitive, T and U
 are considered the same). See the options \-\-sizein and \-\-sizeout
 to take into account and compute abundance values. This command does
 not support multithreading.
+.TAG derep_id
+.TP 9
+.BI \-\-derep_id \0filename
+Merge strictly identical sequences contained in \fIfilename\fR, as
+with the \-\-derep_fulllength command, but the sequence labels
+(identifiers) on the header line need to be identical too.
 .TAG derep_prefix
 .TP
 .BI \-\-derep_prefix \0filename
@@ -1021,8 +1045,8 @@ Write the dereplicated sequences to \fIfilename\fR, in fasta format
 and sorted by decreasing abundance. Identical sequences receive the
 header of the first sequence of their group. If \-\-sizeout is used,
 the number of occurrences (i.e. abundance) of each sequence is
-indicated at the end of their fasta header using the pattern
-';size=\fIinteger\fR;'.
+indicated at the end of their fasta header using the
+pattern ';size=\fIinteger\fR;'.
 .TP
 .TAG relabel
 .BI \-\-relabel \0string
@@ -1067,8 +1091,8 @@ headers). That option is active by default when rereplicating.
 .TAG sizeout
 .TP
 .B \-\-sizeout
-Add abundance annotations to the output fasta file (add the pattern
-';size=\fIinteger\fR;' to sequence headers).  If \-\-sizein is
+Add abundance annotations to the output fasta file (add the
+pattern ';size=\fIinteger\fR;' to sequence headers). If \-\-sizein is
 specified, each unique sequence receives a new abundance value
 corresponding to its total abundance (sum of the abundances of its
 occurrences). If \-\-sizein is not specified, input abundances are set
@@ -1537,18 +1561,20 @@ ambiguous bases (N's), as specified with the \-\-fastq_maxns are also
 discarded (no limit by default). Staggered reads are not merged unless
 the \-\-fastq_allowmergestagger option is specified. The minimum
 length of the overlap region between the reads may be specified with
-the \-\-fastq_minovlen option (default 10). The overlap region may
-not include more mismatches than specified with the \-\-fastq_maxdiffs
+the \-\-fastq_minovlen option (default 10). The overlap region may not
+include more mismatches than specified with the \-\-fastq_maxdiffs
 option (10 by default) or a higher percentage of mismatches than
 specified with the \-\-fastq_maxdiffpct option (100.0% by default),
 otherwise the read pair is discarded. Additional rules will avoid
 merging of reads that cannot be aligned reliably and
 unambiguously. The mimimum and maximum length of the merged sequence
 may be specified with the \-\-fastq_minmergelen and
-\-\-fastq_maxmergelen options, respectively. Other relevant options
-are: \-\-fastq_ascii, \-\-fastq_maxee, \-\-fastq_nostagger,
-\-\-fastq_qmax, \-\-fastq_qmaxout, \-\-fastq_qmin, \-\-fastq_qminout,
-and \-\-label_suffix.
+\-\-fastq_maxmergelen options, respectively. The quality value limits
+for output files may be specied with the \-\-fastq_qminout and
+\-\-fastq_qmaxout options, but they apply only to the merged region.
+Other relevant options are: \-\-fastq_ascii, \-\-fastq_maxee,
+\-\-fastq_nostagger, \-\-fastq_qmax, \-\-fastq_qmin, and
+\-\-label_suffix.
 .TAG fastq_minlen
 .TP
 .BI \-\-fastq_minlen\~ "positive integer"
@@ -1580,10 +1606,11 @@ files. The default is 41, which is usual for recent Sanger/Illumina
 .TAG fastq_qmaxout
 .TP
 .BI \-\-fastq_qmaxout\~ "positive integer"
-When using \-\-fastq_convert or \-\-sff_convert, specify the maximum
-quality score used when writing FASTQ files. The default is 41, which
-is usual for recent Sanger/Illumina 1.8+ files. Older formats may use
-a maximum quality score of 40.
+When using \-\-fastq_mergepairs, \-\-fastq_convert or \-\-sff_convert,
+specify the maximum quality score used when writing FASTQ files. The
+default is 41, which is usual for recent Sanger/Illumina 1.8+
+files. Older formats may use a maximum quality score of 40. The limit
+only applies to the merged region when using \-\-fastq_mergepairs.
 .TAG fastq_qmin
 .TP
 .BI \-\-fastq_qmin\~ "positive integer"
@@ -1593,10 +1620,11 @@ files. Older formats may use scores between -5 and 2.
 .TAG fastq_qminout
 .TP
 .BI \-\-fastq_qminout\~ "positive integer"
-When using \-\-fastq_convert or \-\-sff_convert, specify the minimum
-quality score used when writing FASTQ files. The default is 0, which
-is usual for Sanger/Illumina 1.8+ files. Older versions of the format
-may use scores between -5 and 2.
+When using \-\-fastq_mergepairs, \-\-fastq_convert or \-\-sff_convert,
+specify the minimum quality score used when writing FASTQ files. The
+default is 0, which is usual for Sanger/Illumina 1.8+ files. Older
+versions of the format may use scores between -5 and 2. The limit
+applies only to the merged region when using \-\-fastq_mergepairs.
 .TAG fastq_stats
 .TP
 .BI \-\-fastq_stats \0filename
@@ -2411,11 +2439,11 @@ Reject the sequence match if the alignment contains at least
 \fIinteger\fR insertions or deletions.
 .TAG maxhits
 .TP
-.BI \-\-maxhits\~ "positive integer"
+.BI \-\-maxhits\~ "non-negative integer"
 Maximum number of hits to show once the search is terminated (hits are
-sorted by decreasing identity). Unlimited by default. That option
-applies to \-\-alnout, \-\-blast6out, \-\-fastapairs, \-\-samout,
-\-\-uc, or \-\-userout output files.
+sorted by decreasing identity). Unlimited by default or if the
+argument it zero. This option applies to \-\-alnout, \-\-blast6out,
+\-\-fastapairs, \-\-samout, \-\-uc, or \-\-userout output files.
 .TAG maxid
 .TP
 .BI \-\-maxid \0real
@@ -3099,6 +3127,9 @@ rank. Commas and semicolons are not allowed in the name of the rank.
 .PP
 Example: ">X80725_S000004313;\:tax=d:Bacteria,\:p:Proteobacteria,\:c:Gammaproteobacteria,\:o:Enterobacteriales,\:f:Enterobacteriaceae,\:g:Escherichia/Shigella,\:s:Escherichia_coli".
 .PP
+The option \-\-notrunclabels is turned on by default for this command,
+allowing spaces in the taxonomic identifiers.
+.PP
 .TAG db
 .TP 9
 .BI \-\-db \0filename
@@ -3391,8 +3422,8 @@ field is set to 0 if there is no alignment.
 .TP
 .B tstrand
 Target strand orientation (+ or - for nucleotide sequences). Always
-set to '+', so reverse strand matches have tstrand '+' and qstrand
-'-'. Empty field if there is no alignment.
+set to '+', so reverse strand matches have tstrand '+' and
+qstrand '\-'. Empty field if there is no alignment.
 .RE
 .PP
 .\" ============================================================================
@@ -4240,9 +4271,20 @@ relabelling options valid for certain commands.
 Fixed bug with sequences written to file specified with fastaout_rev
 for commands fastx_filter and fastq_filter.
 .TP
-.BR v2.14.2\~ "released January 28th, 20202"
+.BR v2.14.2\~ "released January 28th, 2020"
 Fixed some issues with the cut, fastx_revcomp, fastq_convert,
 fastq_mergepairs, and makeudb_usearch commands. Updated manual.
+.TP
+.BR v2.15.0\~ "released June 19th, 2020"
+Update manual and documentation. Turn on notrunclabels option for
+sintax command by default. Change maxhits 0 to mean unlimited hits,
+like the default. Allow non-ascii characters in headers, with a
+warning. Sort centroids and uc too when clusterout_sort specified. Add
+cluster id to centroids output when clusterout_id specified. Improve
+error messages when parsing FASTQ files. Add missing fastq_qminout
+option and fix label_suffix option for fastq_mergepairs. Add derep_id
+command that dereplicates based on both label and sequence. Remove
+compilation warnings.
 .LP
 .\" ============================================================================
 .\" TODO:


=====================================
src/cluster.cc
=====================================
@@ -1197,12 +1197,17 @@ void cluster(char * dbname,
     }
 
 
-  /* Sort clusters */
+  /* Sort sequences in clusters by their abundance or ordinal number */
   /* Sequences in same cluster must always come right after each other. */
   /* The centroid sequence must be the first in each cluster. */
 
   progress_init("Sorting clusters", clusters);
-  qsort(clusterinfo, seqcount, sizeof(clusterinfo_t), compare_byclusterno);
+  if (opt_clusterout_sort)
+    qsort(clusterinfo, seqcount, sizeof(clusterinfo_t),
+          compare_byclusterabundance);
+  else
+    qsort(clusterinfo, seqcount, sizeof(clusterinfo_t),
+          compare_byclusterno);
   progress_done();
 
   progress_init("Writing clusters", seqcount);
@@ -1237,7 +1242,9 @@ void cluster(char * dbname,
                                 cluster_abundance[clusterno],
                                 clusterno+1,
                                 -1.0,
-                                -1, -1, 0, 0.0);
+                                -1,
+                                opt_clusterout_id ? clusterno : -1,
+                                0, 0.0);
 
           if (opt_uc)
             fprintf(fp_uc, "C\t%d\t%" PRId64 "\t*\t*\t*\t*\t*\t%s\t*\n",
@@ -1332,15 +1339,6 @@ void cluster(char * dbname,
         }
     }
 
-  if (opt_clusterout_sort)
-    {
-      /* Optionally sort clusters by abundance */
-      progress_init("Sorting clusters by abundance", clusters);
-      qsort(clusterinfo, seqcount, sizeof(clusterinfo_t),
-            compare_byclusterabundance);
-      progress_done();
-    }
-
   if (opt_msaout || opt_consout || opt_profile)
     {
       int msa_target_count = 0;


=====================================
src/derep.cc
=====================================
@@ -204,9 +204,10 @@ void rehash(struct bucket * * hashtableref, int64_t alloc_clusters)
   * hashtableref = new_hashtable;
 }
 
-
-void derep_fulllength()
+void derep(char * input_filename, bool use_header)
 {
+  /* dereplicate full length sequences, optionally require identical headers */
+
   show_rusage();
 
   FILE * fp_output = 0;
@@ -226,7 +227,7 @@ void derep_fulllength()
         fatal("Unable to open output (uc) file for writing");
     }
 
-  fastx_handle h = fastx_open(opt_derep_fulllength);
+  fastx_handle h = fastx_open(input_filename);
 
   show_rusage();
 
@@ -281,7 +282,7 @@ void derep_fulllength()
   char * rc_seq_up = (char*) xmalloc(alloc_seqlen + 1);
 
   char * prompt = 0;
-  if (xsprintf(& prompt, "Dereplicating file %s", opt_derep_fulllength) == -1)
+  if (xsprintf(& prompt, "Dereplicating file %s", input_filename) == -1)
     fatal("Out of memory");
 
   progress_init(prompt, filesize);
@@ -368,6 +369,8 @@ void derep_fulllength()
         }
 
       char * seq = fastx_get_sequence(h);
+      char * header = fastx_get_header(h);
+      int64_t headerlen = fastx_get_header_length(h);
 
       /* normalize sequence: uppercase and replace U by T  */
       string_normalize(seq_up, seq, seqlen);
@@ -385,13 +388,16 @@ void derep_fulllength()
       */
 
       uint64_t hash = HASH(seq_up, seqlen);
+      if (use_header)
+        hash ^= HASH(header, headerlen);
       uint64_t j = hash & hash_mask;
       struct bucket * bp = hashtable + j;
 
       while ((bp->size)
              &&
              ((hash != bp->hash) ||
-              (seqcmp(seq_up, bp->seq, seqlen))))
+              (seqcmp(seq_up, bp->seq, seqlen)) ||
+              (use_header && strcmp(header, bp->header))))
         {
           j = (j+1) & hash_mask;
           bp = hashtable + j;
@@ -409,7 +415,8 @@ void derep_fulllength()
           while ((rc_bp->size)
                  &&
                  ((rc_hash != rc_bp->hash) ||
-                  (seqcmp(rc_seq_up, rc_bp->seq, seqlen))))
+                  (seqcmp(rc_seq_up, rc_bp->seq, seqlen)) ||
+                  (use_header && strcmp(header, bp->header))))
             {
               k = (k+1) & hash_mask;
               rc_bp = hashtable + k;
@@ -424,7 +431,6 @@ void derep_fulllength()
             }
         }
 
-      char * header = fastx_get_header(h);
       int abundance = fastx_get_abundance(h);
       int64_t ab = opt_sizein ? abundance : 1;
       sumsize += ab;
@@ -1086,3 +1092,13 @@ void derep_prefix()
   xfree(hashtable);
   db_free();
 }
+
+void derep_fulllength()
+{
+  derep(opt_derep_fulllength, false);
+}
+
+void derep_id()
+{
+  derep(opt_derep_id, true);
+}


=====================================
src/derep.h
=====================================
@@ -59,4 +59,5 @@
 */
 
 void derep_fulllength();
+void derep_id();
 void derep_prefix();


=====================================
src/fasta.cc
=====================================
@@ -366,6 +366,9 @@ void fasta_print_general(FILE * fp,
                                   xee);
     }
 
+  if (opt_label_suffix)
+    fprintf(fp, "%s", opt_label_suffix);
+
   if (clustersize > 0)
     fprintf(fp, ";seqs=%d", clustersize);
 


=====================================
src/fastq.cc
=====================================
@@ -84,19 +84,18 @@ void buffer_filter_extend(fastx_handle h,
                           uint64_t len,
                           unsigned int * char_action,
                           const unsigned char * char_mapping,
-                          uint64_t lineno_start)
+                          bool * ok,
+                          char * illegal_char)
 {
   buffer_makespace(dest_buffer, len+1);
 
   /* Strip unwanted characters from the string and raise warnings or
      errors on certain characters. */
 
-  uint64_t lineno = lineno_start;
-
   char * p = source_buf;
   char * d = dest_buffer->data + dest_buffer->length;
   char * q = d;
-  char msg[200];
+  * ok = true;
 
   for(uint64_t i = 0; i < len; i++)
     {
@@ -118,17 +117,9 @@ void buffer_filter_extend(fastx_handle h,
 
         case 2:
           /* fatal character */
-          if ((c>=32) && (c<127))
-            snprintf(msg,
-                     200,
-                     "Illegal character '%c'",
-                     c);
-          else
-            snprintf(msg,
-                     200,
-                     "Illegal unprintable ASCII character no %d",
-                     (unsigned char) c);
-          fastq_fatal(lineno, msg);
+          if (*ok)
+            * illegal_char = c;
+          * ok = false;
           break;
 
         case 3:
@@ -137,7 +128,6 @@ void buffer_filter_extend(fastx_handle h,
 
         case 4:
           /* newline (silently stripped) */
-          lineno++;
           break;
         }
     }
@@ -177,6 +167,10 @@ bool fastq_next(fastx_handle h,
 
   h->lineno_start = h->lineno;
 
+  char msg[200];
+  bool ok = true;
+  char illegal_char = 0;
+
   uint64_t rest = fastx_file_fill_buffer(h);
 
   /* check end of file */
@@ -221,8 +215,6 @@ bool fastq_next(fastx_handle h,
       rest -= len;
     }
 
-  uint64_t lineno_seq = h->lineno;
-
   /* read sequence line(s) */
   lf = 0;
   while (1)
@@ -255,17 +247,26 @@ bool fastq_next(fastx_handle h,
                            & h->sequence_buffer,
                            h->file_buffer.data + h->file_buffer.position,
                            len,
-                           char_fq_action_seq, char_mapping, lineno_seq);
+                           char_fq_action_seq, char_mapping,
+                           & ok, & illegal_char);
       h->file_buffer.position += len;
       rest -= len;
-    }
 
-#if 0
-  if (h->sequence_buffer.length == 0)
-    fastq_fatal(lineno_seq, "Empty sequence line");
-#endif
-
-  uint64_t lineno_plus = h->lineno;
+      if (!ok)
+        {
+          if ((illegal_char >= 32) && (illegal_char < 127))
+            snprintf(msg,
+                     200,
+                     "Illegal sequence character '%c'",
+                     illegal_char);
+          else
+            snprintf(msg,
+                     200,
+                     "Illegal sequence character (unprintable, no %d)",
+                     (unsigned char) illegal_char);
+          fastq_fatal(h->lineno - (lf ? 1 : 0), msg);
+        }
+    }
 
   /* read + line */
 
@@ -319,13 +320,11 @@ bool fastq_next(fastx_handle h,
         plusline_invalid = 1;
     }
   if (plusline_invalid)
-    fastq_fatal(lineno_plus,
+    fastq_fatal(h->lineno - (lf ? 1 : 0),
                 "'+' line must be empty or identical to header");
 
   /* read quality line(s) */
 
-  uint64_t lineno_qual = h->lineno;
-
   lf = 0;
   while (1)
     {
@@ -354,22 +353,38 @@ bool fastq_next(fastx_handle h,
           len = lf - (h->file_buffer.data + h->file_buffer.position) + 1;
           h->lineno++;
         }
+
       buffer_filter_extend(h,
                            & h->quality_buffer,
                            h->file_buffer.data + h->file_buffer.position,
                            len,
-                           char_fq_action_qual, chrmap_identity, lineno_qual);
+                           char_fq_action_qual, chrmap_identity,
+                           & ok, & illegal_char);
       h->file_buffer.position += len;
       rest -= len;
-    }
 
-#if 0
-  if (h->quality_buffer.length == 0)
-    fastq_fatal(lineno_seq, "Empty quality line");
-#endif
+      /* break if quality line already too long */
+      if (h->quality_buffer.length > h->sequence_buffer.length)
+        break;
+
+      if (!ok)
+        {
+          if ((illegal_char >= 32) && (illegal_char < 127))
+            snprintf(msg,
+                     200,
+                     "Illegal quality character '%c'",
+                     illegal_char);
+          else
+            snprintf(msg,
+                     200,
+                     "Illegal quality character (unprintable, no %d)",
+                     (unsigned char) illegal_char);
+          fastq_fatal(h->lineno - (lf ? 1 : 0), msg);
+        }
+    }
 
   if (h->sequence_buffer.length != h->quality_buffer.length)
-    fastq_fatal(lineno_qual,
+    fastq_fatal(h->lineno - (lf ? 1 : 0),
                 "Sequence and quality lines must be equally long");
 
   fastx_filter_header(h, truncateatspace);
@@ -483,6 +498,9 @@ void fastq_print_general(FILE * fp,
                                   xee);
     }
 
+  if (opt_label_suffix)
+    fprintf(fp, "%s", opt_label_suffix);
+
   if (opt_sizeout && (abundance > 0))
     fprintf(fp, ";size=%u", abundance);
 


=====================================
src/fastqops.cc
=====================================
@@ -653,11 +653,6 @@ void fastx_revcomp()
   char * seq_buffer = (char*) xmalloc(buffer_alloc);
   char * qual_buffer = (char*) xmalloc(buffer_alloc);
 
-  uint64_t header_alloc = 512;
-  char * header = (char*) xmalloc(header_alloc);
-
-  uint64_t suffix_length = opt_label_suffix ? strlen(opt_label_suffix) : 0;
-
   fastx_handle h = fastx_open(opt_fastx_revcomp);
 
   if (!h)
@@ -699,22 +694,7 @@ void fastx_revcomp()
       /* header */
 
       uint64_t hlen = fastx_get_header_length(h);
-
-      if (hlen + suffix_length + 1 > header_alloc)
-        {
-          header_alloc = hlen + suffix_length + 1;
-          header = (char*) xrealloc(header, header_alloc);
-        }
-
-      char * d = fastx_get_header(h);
-
-      if (opt_label_suffix)
-        snprintf(header, header_alloc, "%s%s", d, opt_label_suffix);
-      else
-        snprintf(header, header_alloc, "%s", d);
-
-      hlen += suffix_length;
-
+      char * header = fastx_get_header(h);
       int64_t abundance = fastx_get_abundance(h);
 
 
@@ -780,7 +760,6 @@ void fastx_revcomp()
 
   fastx_close(h);
 
-  xfree(header);
   xfree(seq_buffer);
   xfree(qual_buffer);
 }


=====================================
src/fastx.cc
=====================================
@@ -169,6 +169,28 @@ void fastx_filter_header(fastx_handle h, bool truncateatspace)
 
           exit(EXIT_FAILURE);
 
+        case 7:
+          /* Non-ASCII but acceptable */
+          fprintf(stderr,
+                  "\n"
+                  "WARNING: Non-ASCII character encountered in FASTA/FASTQ header.\n"
+                  "Character no %d (0x%2x) on or right before line %"
+                  PRIu64 ".\n",
+                  c, c,
+                  h->lineno);
+
+          if (fp_log)
+            fprintf(fp_log,
+                    "\n"
+                    "WARNING: Non-ASCII character encountered in FASTA/FASTQ header.\n"
+                    "Character no %d (0x%2x) on or right before line %"
+                    PRIu64 ".\n",
+                    c, c,
+                    h->lineno);
+
+          *q++ = c;
+          break;
+
         case 5:
         case 6:
           /* tab or space */


=====================================
src/maps.cc
=====================================
@@ -84,6 +84,7 @@ unsigned int char_header_action[256] =
       4 = lf
       5 = tab
       6 = space
+      7 = non-ascii, legal, but warn
 
     @   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O
     P   Q   R   S   T   U   V   W   X   Y   Z   [   \   ]   ^   _
@@ -97,14 +98,14 @@ unsigned int char_header_action[256] =
     1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
     1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
     1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,
-    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
-    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
-    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
-    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
-    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
-    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
-    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
-    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
+    7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,
+    7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,
+    7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,
+    7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,
+    7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,
+    7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,
+    7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,
+    7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7
   };
 
 unsigned int char_fasta_action[256] =


=====================================
src/mergepairs.cc
=====================================
@@ -188,11 +188,9 @@ typedef struct merge_data_s
   int64_t fwd_trunc;
   int64_t rev_trunc;
   int64_t pair_no;
-  char * merged_header;
   char * merged_sequence;
   char * merged_quality;
   int64_t merged_length;
-  int64_t merged_header_alloc;
   int64_t merged_seq_alloc;
   double ee_merged;
   double ee_fwd;
@@ -310,13 +308,17 @@ void precompute_qual()
 
           /* Match */
           p = px * py / 3.0 / (1.0 - px - py + 4.0 * px * py / 3.0);
-          q = opt_fastq_ascii + MIN(round(-10.0*log10(p)), opt_fastq_qmaxout);
-          merge_qual_same[x][y] = q;
+          q = round(-10.0 * log10(p));
+          q = MIN(q, opt_fastq_qmaxout);
+          q = MAX(q, opt_fastq_qminout);
+          merge_qual_same[x][y] = opt_fastq_ascii + q;
 
           /* Mismatch, x is highest quality */
           p = px * (1.0 - py / 3.0) / (px + py - 4.0 * px * py / 3.0);
-          q = opt_fastq_ascii + MIN(round(-10.0*log10(p)), opt_fastq_qmaxout);
-          merge_qual_diff[x][y] = q;
+          q = round(-10.0 * log10(p));
+          q = MIN(q, opt_fastq_qmaxout);
+          q = MAX(q, opt_fastq_qminout);
+          merge_qual_diff[x][y] = opt_fastq_ascii + q;
 
           /*
             observed match,
@@ -392,8 +394,8 @@ void keep(merge_data_t * ip)
       fastq_print_general(fp_fastqout,
                           ip->merged_sequence,
                           ip->merged_length,
-                          ip->merged_header,
-                          strlen(ip->merged_header),
+                          ip->fwd_header,
+                          strlen(ip->fwd_header),
                           ip->merged_quality,
                           0,
                           merged,
@@ -406,8 +408,8 @@ void keep(merge_data_t * ip)
                           0,
                           ip->merged_sequence,
                           ip->merged_length,
-                          ip->merged_header,
-                          strlen(ip->merged_header),
+                          ip->fwd_header,
+                          strlen(ip->fwd_header),
                           0,
                           merged,
                           ip->ee_merged,
@@ -645,12 +647,6 @@ void merge(merge_data_t * ip)
 
   if (ip->ee_merged <= opt_fastq_maxee)
     {
-      if (opt_label_suffix)
-        (void) sprintf(ip->merged_header, "%s%s",
-                       ip->fwd_header, opt_label_suffix);
-      else
-        strcpy(ip->merged_header, ip->fwd_header);
-
       ip->reason = ok;
       ip->merged = 1;
     }
@@ -932,8 +928,6 @@ void process(merge_data_t * ip,
 
 bool read_pair(merge_data_t * ip)
 {
-  int64_t suffix_len = opt_label_suffix ? strlen(opt_label_suffix) : 0;
-
   if (fastq_next(fastq_fwd, 0, chrmap_upcase))
     {
       if (! fastq_next(fastq_rev, 0, chrmap_upcase))
@@ -979,15 +973,6 @@ bool read_pair(merge_data_t * ip)
                                                 merged_seq_needed);
         }
 
-      int64_t merged_header_needed = fwd_header_len + suffix_len + 1;
-
-      if (merged_header_needed > ip->merged_header_alloc)
-        {
-          ip->merged_header_alloc = merged_header_needed;
-          ip->merged_header = (char*) xrealloc(ip->merged_header,
-                                               merged_header_needed);
-        }
-
       /* make local copies of the seq, header and qual */
 
       strcpy(ip->fwd_header,   fastq_get_header(fastq_fwd));
@@ -997,7 +982,6 @@ bool read_pair(merge_data_t * ip)
       strcpy(ip->fwd_quality,  fastq_get_quality(fastq_fwd));
       strcpy(ip->rev_quality,  fastq_get_quality(fastq_rev));
 
-      ip->merged_header[0] = 0;
       ip->merged_sequence[0] = 0;
       ip->merged_quality[0] = 0;
       ip->merged = 0;
@@ -1035,8 +1019,6 @@ void init_merge_data(merge_data_t * ip)
   ip->reason = undefined;
   ip->merged_seq_alloc = 0;
   ip->merged_sequence = 0;
-  ip->merged_header = 0;
-  ip->merged_header_alloc = 0;
   ip->merged_quality = 0;
   ip->merged_length = 0;
 }
@@ -1056,8 +1038,6 @@ void free_merge_data(merge_data_t * ip)
   if (ip->rev_quality)
     xfree(ip->rev_quality);
 
-  if (ip->merged_header)
-    xfree(ip->merged_header);
   if (ip->merged_sequence)
     xfree(ip->merged_sequence);
   if (ip->merged_quality)


=====================================
src/sintax.cc
=====================================
@@ -323,6 +323,13 @@ void sintax_query(int64_t t)
   int boot_count[2];
   unsigned int best_count[2];
 
+  best_count[0] = 0;
+  best_count[1] = 0;
+  best_seqno[0] = 0;
+  best_seqno[1] = 0;
+  boot_count[0] = 0;
+  boot_count[1] = 0;
+
   int qseqlen = si_plus[t].qseqlen;
   char * query_head = si_plus[t].query_head;
 
@@ -344,10 +351,6 @@ void sintax_query(int64_t t)
 
       /* perform 100 bootstraps */
 
-      best_count[s] = 0;
-      best_seqno[s] = 0;
-      boot_count[s] = 0;
-
       if (kmersamplecount >= subset_size)
         {
           for (int i = 0; i < bootstrap_count ; i++)


=====================================
src/sortbylength.cc
=====================================
@@ -60,7 +60,7 @@
 
 #include "vsearch.h"
 
-static struct sortinfo_s
+static struct sortinfo_length_s
 {
   unsigned int length;
   unsigned int size;
@@ -69,8 +69,8 @@ static struct sortinfo_s
 
 int sortbylength_compare(const void * a, const void * b)
 {
-  struct sortinfo_s * x = (struct sortinfo_s *) a;
-  struct sortinfo_s * y = (struct sortinfo_s *) b;
+  struct sortinfo_length_s * x = (struct sortinfo_length_s *) a;
+  struct sortinfo_length_s * y = (struct sortinfo_length_s *) b;
 
   /* longest first, then most abundant, then by label, otherwise keep order */
 
@@ -110,7 +110,8 @@ void sortbylength()
   show_rusage();
 
   int dbsequencecount = db_getsequencecount();
-  sortinfo = (struct sortinfo_s *) xmalloc(dbsequencecount * sizeof(sortinfo_s));
+  sortinfo = (struct sortinfo_length_s *)
+    xmalloc(dbsequencecount * sizeof(sortinfo_length_s));
 
   int passed = 0;
 
@@ -127,7 +128,7 @@ void sortbylength()
   show_rusage();
 
   progress_init("Sorting", 100);
-  qsort(sortinfo, passed, sizeof(sortinfo_s), sortbylength_compare);
+  qsort(sortinfo, passed, sizeof(sortinfo_length_s), sortbylength_compare);
   progress_done();
 
   double median = 0.0;


=====================================
src/sortbysize.cc
=====================================
@@ -60,7 +60,7 @@
 
 #include "vsearch.h"
 
-static struct sortinfo_s
+static struct sortinfo_size_s
 {
   unsigned int size;
   unsigned int seqno;
@@ -68,8 +68,8 @@ static struct sortinfo_s
 
 int sortbysize_compare(const void * a, const void * b)
 {
-  struct sortinfo_s * x = (struct sortinfo_s *) a;
-  struct sortinfo_s * y = (struct sortinfo_s *) b;
+  struct sortinfo_size_s * x = (struct sortinfo_size_s *) a;
+  struct sortinfo_size_s * y = (struct sortinfo_size_s *) b;
 
   /* highest abundance first, then by label, otherwise keep order */
 
@@ -108,7 +108,8 @@ void sortbysize()
 
   progress_init("Getting sizes", dbsequencecount);
 
-  sortinfo = (struct sortinfo_s*) xmalloc(dbsequencecount * sizeof(sortinfo_s));
+  sortinfo = (struct sortinfo_size_s*)
+    xmalloc(dbsequencecount * sizeof(sortinfo_size_s));
 
   int passed = 0;
 
@@ -130,7 +131,7 @@ void sortbysize()
   show_rusage();
 
   progress_init("Sorting", 100);
-  qsort(sortinfo, passed, sizeof(sortinfo_s), sortbysize_compare);
+  qsort(sortinfo, passed, sizeof(sortinfo_size_s), sortbysize_compare);
   progress_done();
 
   double median = 0.0;


=====================================
src/vsearch.cc
=====================================
@@ -102,6 +102,7 @@ char * opt_db;
 char * opt_dbmatched;
 char * opt_dbnotmatched;
 char * opt_derep_fulllength;
+char * opt_derep_id;
 char * opt_derep_prefix;
 char * opt_eetabbedout;
 char * opt_fastaout;
@@ -675,6 +676,7 @@ void args_init(int argc, char **argv)
   opt_dbmatched = 0;
   opt_dbnotmatched = 0;
   opt_derep_fulllength = 0;
+  opt_derep_id = 0;
   opt_derep_prefix = 0;
   opt_dn = 1.4;
   opt_ee_cutoffs_count = 3;
@@ -782,7 +784,7 @@ void args_init(int argc, char **argv)
   opt_maxaccepts = 1;
   opt_maxdiffs = INT_MAX;
   opt_maxgaps = INT_MAX;
-  opt_maxhits = LONG_MAX;
+  opt_maxhits = 0;
   opt_maxid = 1.0;
   opt_maxqsize = INT_MAX;
   opt_maxqt = DBL_MAX;
@@ -914,6 +916,7 @@ void args_init(int argc, char **argv)
     option_dbmatched,
     option_dbnotmatched,
     option_derep_fulllength,
+    option_derep_id,
     option_derep_prefix,
     option_dn,
     option_ee_cutoffs,
@@ -1142,6 +1145,7 @@ void args_init(int argc, char **argv)
     {"dbmatched",             required_argument, 0, 0 },
     {"dbnotmatched",          required_argument, 0, 0 },
     {"derep_fulllength",      required_argument, 0, 0 },
+    {"derep_id",              required_argument, 0, 0 },
     {"derep_prefix",          required_argument, 0, 0 },
     {"dn",                    required_argument, 0, 0 },
     {"ee_cutoffs",            required_argument, 0, 0 },
@@ -2287,6 +2291,10 @@ void args_init(int argc, char **argv)
           opt_relabel_self = 1;
           break;
 
+        case option_derep_id:
+          opt_derep_id = optarg;
+          break;
+
         default:
           fatal("Internal error in option parsing");
         }
@@ -2311,6 +2319,7 @@ void args_init(int argc, char **argv)
       option_cluster_unoise,
       option_cut,
       option_derep_fulllength,
+      option_derep_id,
       option_derep_prefix,
       option_fastq_chars,
       option_fastq_convert,
@@ -2858,6 +2867,34 @@ void args_init(int argc, char **argv)
         option_xsize,
         -1 },
 
+      { option_derep_id,
+        option_bzip2_decompress,
+        option_fasta_width,
+        option_gzip_decompress,
+        option_log,
+        option_maxseqlength,
+        option_maxuniquesize,
+        option_minseqlength,
+        option_minuniquesize,
+        option_no_progress,
+        option_notrunclabels,
+        option_output,
+        option_quiet,
+        option_relabel,
+        option_relabel_keep,
+        option_relabel_md5,
+        option_relabel_self,
+        option_relabel_sha1,
+        option_sizein,
+        option_sizeout,
+        option_strand,
+        option_threads,
+        option_topn,
+        option_uc,
+        option_xee,
+        option_xsize,
+        -1 },
+
       { option_derep_prefix,
         option_bzip2_decompress,
         option_fasta_width,
@@ -3046,6 +3083,7 @@ void args_init(int argc, char **argv)
         option_fastq_qmax,
         option_fastq_qmaxout,
         option_fastq_qmin,
+        option_fastq_qminout,
         option_fastq_truncqual,
         option_fastqout,
         option_fastqout_notmerged_fwd,
@@ -4056,6 +4094,9 @@ void args_init(int argc, char **argv)
   if (opt_maxsize < 1)
     fatal("The argument to maxsize must be at least 1");
 
+  if (opt_maxhits < 0)
+    fatal("The argument to maxhits cannot be negative");
+
 
   /* TODO: check valid range of gap penalties */
 
@@ -4086,6 +4127,9 @@ void args_init(int argc, char **argv)
 
   /* set defaults parameters, if not specified */
 
+  if (opt_maxhits == 0)
+    opt_maxhits = LONG_MAX;
+
   if (opt_minwordmatches < 0)
     opt_minwordmatches = minwordmatches_defaults[opt_wordlength];
 
@@ -4111,13 +4155,23 @@ void args_init(int argc, char **argv)
 
   if (opt_minseqlength < 0)
     {
-      if (opt_cluster_smallmem || opt_cluster_fast || opt_cluster_size ||
-          opt_usearch_global || opt_derep_fulllength || opt_derep_prefix ||
-          opt_makeudb_usearch || opt_cluster_unoise || opt_sintax)
+      if (opt_cluster_fast ||
+          opt_cluster_size ||
+          opt_cluster_smallmem ||
+          opt_cluster_unoise ||
+          opt_derep_fulllength ||
+          opt_derep_id ||
+          opt_derep_prefix ||
+          opt_makeudb_usearch ||
+          opt_sintax ||
+          opt_usearch_global)
         opt_minseqlength = 32;
       else
         opt_minseqlength = 1;
     }
+
+  if (opt_sintax)
+    opt_notrunclabels = 1;
 }
 
 void show_publication()
@@ -4267,6 +4321,7 @@ void cmd_help()
               "\n"
               "Dereplication and rereplication\n"
               "  --derep_fulllength FILENAME dereplicate sequences in the given FASTA file\n"
+              "  --derep_id FILENAME         dereplicate using both identifiers and sequences\n"
               "  --derep_prefix FILENAME     dereplicate sequences in file based on prefixes\n"
               "  --rereplicate FILENAME      rereplicate sequences in the given FASTA file\n"
               " Parameters\n"
@@ -4379,7 +4434,7 @@ void cmd_help()
               "  --fastqout FILENAME         FASTQ output filename for merged sequences\n"
               "  --fastqout_notmerged_fwd FN FASTQ filename for non-merged forward sequences\n"
               "  --fastqout_notmerged_rev FN FASTQ filename for non-merged reverse sequences\n"
-              "  --label_suffix              suffix to append to label of merged sequences\n"
+              "  --label_suffix STRING       suffix to append to label of merged sequences\n"
               "  --xee                       remove expected errors (ee) info from output\n"
               "\n"
               "Pairwise alignment\n"
@@ -4673,6 +4728,8 @@ void cmd_derep()
 
   if (opt_derep_fulllength)
     derep_fulllength();
+  else if (opt_derep_id)
+    derep_id();
   else
     {
       if (opt_strand > 1)
@@ -4965,7 +5022,7 @@ int main(int argc, char** argv)
     cmd_sortbysize();
   else if (opt_sortbylength)
     cmd_sortbylength();
-  else if (opt_derep_fulllength || opt_derep_prefix)
+  else if (opt_derep_fulllength || opt_derep_id || opt_derep_prefix)
     cmd_derep();
   else if (opt_shuffle)
     cmd_shuffle();


=====================================
src/vsearch.h
=====================================
@@ -291,6 +291,7 @@ extern char * opt_db;
 extern char * opt_dbmatched;
 extern char * opt_dbnotmatched;
 extern char * opt_derep_fulllength;
+extern char * opt_derep_id;
 extern char * opt_derep_prefix;
 extern char * opt_eetabbedout;
 extern char * opt_fastaout;



View it on GitLab: https://salsa.debian.org/med-team/vsearch/-/compare/647be73e11398303a4ca69bc41a4ba2669634548...91780227165a2cd116ba9198a8bf8990fca730d9

-- 
View it on GitLab: https://salsa.debian.org/med-team/vsearch/-/compare/647be73e11398303a4ca69bc41a4ba2669634548...91780227165a2cd116ba9198a8bf8990fca730d9
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20200920/c5cd8e7e/attachment-0001.html>