[med-svn] [Git][med-team/vsearch][master] 6 commits: New upstream version 2.8.5
Andreas Tille
gitlab at salsa.debian.org
Wed Sep 26 20:15:23 BST 2018
Andreas Tille pushed to branch master at Debian Med / vsearch
Commits:
c89cbc05 by Andreas Tille at 2018-09-26T18:58:13Z
New upstream version 2.8.5
- - - - -
c3ec18fe by Andreas Tille at 2018-09-26T18:58:14Z
Update upstream source from tag 'upstream/2.8.5'
Update to upstream version '2.8.5'
with Debian dir c2d06d4e6763e50513738b683958b5211f0eff72
- - - - -
aa8bb7e5 by Andreas Tille at 2018-09-26T18:58:14Z
New upstream version
- - - - -
d51114b3 by Andreas Tille at 2018-09-26T18:58:16Z
Standards-Version: 4.2.1
- - - - -
781a1516 by Andreas Tille at 2018-09-26T18:59:46Z
Update patches
- - - - -
2223cdf9 by Andreas Tille at 2018-09-26T19:14:33Z
Upload to unstable
- - - - -
14 changed files:
- README.md
- configure.ac
- debian/changelog
- debian/control
- debian/patches/compilerflags.patch
- man/vsearch.1
- src/abundance.cc
- src/align.cc
- src/derep.cc
- src/eestats.cc
- src/mergepairs.cc
- src/subsample.cc
- src/util.cc
- src/vsearch.h
Changes:
=====================================
README.md
=====================================
@@ -24,7 +24,7 @@ Most of the nucleotide based commands and options in USEARCH version 7 are suppo
## Getting Help
-If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.8.0/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
+If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.8.5/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
## Example
@@ -37,9 +37,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
**Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:
```
-wget https://github.com/torognes/vsearch/archive/v2.8.0.tar.gz
-tar xzf v2.8.0.tar.gz
-cd vsearch-2.8.0
+wget https://github.com/torognes/vsearch/archive/v2.8.5.tar.gz
+tar xzf v2.8.5.tar.gz
+cd vsearch-2.8.5
./autogen.sh
./configure
make
@@ -70,36 +70,36 @@ Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (v
Download the appropriate executable for your system using the following commands if you are using a Linux x86_64 system:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.8.0/vsearch-2.8.0-linux-x86_64.tar.gz
-tar xzf vsearch-2.8.0-linux-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.8.5/vsearch-2.8.5-linux-x86_64.tar.gz
+tar xzf vsearch-2.8.5-linux-x86_64.tar.gz
```
Or these commands if you are using a Linux ppc64le system:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.8.0/vsearch-2.8.0-linux-ppc64le.tar.gz
-tar xzf vsearch-2.8.0-linux-ppc64le.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.8.5/vsearch-2.8.5-linux-ppc64le.tar.gz
+tar xzf vsearch-2.8.5-linux-ppc64le.tar.gz
```
Or these commands if you are using a Mac:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.8.0/vsearch-2.8.0-macos-x86_64.tar.gz
-tar xzf vsearch-2.8.0-macos-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.8.5/vsearch-2.8.5-macos-x86_64.tar.gz
+tar xzf vsearch-2.8.5-macos-x86_64.tar.gz
```
Or if you are using Windows, download and extract (unzip) the contents of this file:
```
-https://github.com/torognes/vsearch/releases/download/v2.8.0/vsearch-2.8.0-win-x86_64.zip
+https://github.com/torognes/vsearch/releases/download/v2.8.5/vsearch-2.8.5-win-x86_64.zip
```
-Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.8.0-linux-x86_64` or `vsearch-2.8.0-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
+Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.8.5-linux-x86_64` or `vsearch-2.8.5-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
-Windows: You will now have the binary distribution in a folder called `vsearch-2.8.0-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
+Windows: You will now have the binary distribution in a folder called `vsearch-2.8.5-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
-**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.8.0/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.8.0/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
+**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.8.5/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.8.5/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
## Plugins, packages, and wrappers
=====================================
configure.ac
=====================================
@@ -2,7 +2,7 @@
# Process this file with autoconf to produce a configure script.
AC_PREREQ([2.63])
-AC_INIT([vsearch], [2.8.0], [torognes at ifi.uio.no])
+AC_INIT([vsearch], [2.8.5], [torognes at ifi.uio.no])
AC_CANONICAL_TARGET
AM_INIT_AUTOMAKE([subdir-objects])
AC_LANG([C++])
@@ -14,8 +14,8 @@ MACOSX_DEPLOYMENT_TARGET="10.7"
# Set default gcc and g++ options
-CFLAGS='-g'
-CXXFLAGS='-g'
+CFLAGS='-g -march=native'
+CXXFLAGS='-g -march=native'
# Checks for programs.
AC_PROG_CXX
=====================================
debian/changelog
=====================================
@@ -1,3 +1,10 @@
+vsearch (2.8.5-1) unstable; urgency=medium
+
+ * New upstream version
+ * Standards-Version: 4.2.1
+
+ -- Andreas Tille <tille at debian.org> Wed, 26 Sep 2018 21:00:00 +0200
+
vsearch (2.8.0-1) unstable; urgency=medium
* New upstream version
=====================================
debian/control
=====================================
@@ -10,7 +10,7 @@ Build-Depends: debhelper (>= 11~),
python-markdown,
ghostscript,
time
-Standards-Version: 4.1.4
+Standards-Version: 4.2.1
Vcs-Browser: https://salsa.debian.org/med-team/vsearch
Vcs-Git: https://salsa.debian.org/med-team/vsearch.git
Homepage: https://github.com/torognes/vsearch/
=====================================
debian/patches/compilerflags.patch
=====================================
@@ -6,10 +6,10 @@ Author: Sascha Steinbiss <satta at debian.org>
# Set default gcc and g++ options
--CFLAGS='-g'
--CXXFLAGS='-g'
-+#CFLAGS='-g'
-+#CXXFLAGS='-g'
+-CFLAGS='-g -march=native'
+-CXXFLAGS='-g -march=native'
++#CFLAGS='-g -march=native'
++#CXXFLAGS='-g -march=native'
# Checks for programs.
AC_PROG_CXX
=====================================
man/vsearch.1
=====================================
@@ -1,5 +1,5 @@
.\" ============================================================================
-.TH vsearch 1 "April 24, 2018" "version 2.8.0" "USER COMMANDS"
+.TH vsearch 1 "September 26, 2018" "version 2.8.5" "USER COMMANDS"
.\" ============================================================================
.SH NAME
vsearch \(em chimera detection, clustering, dereplication and
@@ -828,7 +828,8 @@ Dereplication and rereplication options:
Merge strictly identical sequences contained in
\fIfilename\fR. Identical sequences are defined as having the same
length and the same string of nucleotides (case insensitive, T and U
-are considered the same).
+are considered the same). See the options \-\-sizein and \-\-sizeout
+to take into account and compute abundance values.
.TP
.BI \-\-derep_prefix \0filename
Merge sequences with identical prefixes contained in \fIfilename\fR.
@@ -3394,6 +3395,29 @@ incorrect FASTA headers of consensus sequences after clustering.
.TP
.BR v2.8.0\~ "released April 24th, 2018"
Added the fastq_maxdiffpct option to the fastq_mergepairs command.
+.TP
+.BR v2.8.1\~ "released June 22nd, 2018"
+Fixes for compilation warnings with GCC 8.
+.TP
+.BR v2.8.2\~ "released August 21st, 2018"
+Fix for wrong placement of semicolons in header lines in some cases
+when using the sizeout or xsize options. Reduced memory requirements
+for full-length dereplication in cases with many duplicate sequences.
+Improved wording of fastq_mergepairs report. Updated manual regarding
+use of sizein and sizeout with dereplication. Changed a compiler
+option.
+.TP
+.BR v2.8.3\~ "released August 31st, 2018"
+Fix for segmentation fault for \-\-derep_fulllength with \-\-uc.
+.TP
+.BR v2.8.4\~ "released September 3rd, 2018"
+Further reduce memory requirements for dereplication when not using
+the uc option. Fix output during subsampling when quiet or log options
+are in effect.
+.TP
+.BR v2.8.5\~ "released September 26th, 2018"
+Fixed a bug in fastq_eestats2 that caused the values for large lengths
+to be much too high when the input sequences had varying lengths.
.RE
.LP
.\" ============================================================================
@@ -3403,7 +3427,3 @@ Added the fastq_maxdiffpct option to the fastq_mergepairs command.
.\" visualize and output to pdf
.\" man -l vsearch.1
.\" man -t ./doc/vsearch.1 | ps2pdf - > ./doc/vsearch_manual.pdf
-.\"
-.\" INSTALL (sysadmin)
-.\" gzip -c vsearch.1 > vsearch.1.gz
-.\" mv vsearch.1.gz /usr/share/man/man1/
=====================================
src/abundance.cc
=====================================
@@ -149,12 +149,12 @@ void abundance_fprint_header_strip_size(FILE * fp,
{
if (start <= 1)
{
- if (end < header_length)
+ if (end < header_length - 1)
fprintf(fp, "%s", header + end + 1);
}
else
{
- if (end == header_length)
+ if (end >= header_length - 1)
fprintf(fp, "%.*s", start - 1, header);
else
fprintf(fp, "%.*s;%.*s",
=====================================
src/align.cc
=====================================
@@ -85,7 +85,7 @@ inline void pushop(char newop, char ** cigarendp, char * op, int * count)
char buf[25];
int len = sprintf(buf, "%d", *count);
*cigarendp -= len;
- strncpy(*cigarendp, buf, (size_t)len);
+ memcpy(*cigarendp, buf, (size_t)len);
}
*op = newop;
*count = 1;
@@ -102,7 +102,7 @@ inline void finishop(char ** cigarendp, char * op, int * count)
char buf[25];
int len = sprintf(buf, "%d", *count);
*cigarendp -= len;
- strncpy(*cigarendp, buf, (size_t)len);
+ memcpy(*cigarendp, buf, (size_t)len);
}
*op = 0;
*count = 0;
=====================================
src/derep.cc
=====================================
@@ -60,8 +60,6 @@
#include "vsearch.h"
-//#define BITMAP
-
#define HASH hash_cityhash64
struct bucket
@@ -71,23 +69,47 @@ struct bucket
unsigned int seqno_last;
unsigned int size;
bool deleted;
+ char * header;
+ char * seq;
};
-#ifdef BITMAP
-unsigned char * hash_occupied = 0;
-
-void hash_set_occupied(uint64_t hashindex)
+int derep_compare_prefix(const void * a, const void * b)
{
- hash_occupied[(hashindex) >> 3] |= 1 << (hashindex & 7);
-}
+ struct bucket * x = (struct bucket *) a;
+ struct bucket * y = (struct bucket *) b;
-int hash_is_occupied(uint64_t hashindex)
-{
- return hash_occupied[(hashindex) >> 3] & (1 << (hashindex & 7));
+ /* highest abundance first, then by label, otherwise keep order */
+
+ if (x->deleted > y->deleted)
+ return +1;
+ else if (x->deleted < y->deleted)
+ return -1;
+ else
+ {
+ if (x->size < y->size)
+ return +1;
+ else if (x->size > y->size)
+ return -1;
+ else
+ {
+ int r = strcmp(db_getheader(x->seqno_first),
+ db_getheader(y->seqno_first));
+ if (r != 0)
+ return r;
+ else
+ {
+ if (x->seqno_first < y->seqno_first)
+ return -1;
+ else if (x->seqno_first > y->seqno_first)
+ return +1;
+ else
+ return 0;
+ }
+ }
+ }
}
-#endif
-int derep_compare(const void * a, const void * b)
+int derep_compare_full(const void * a, const void * b)
{
struct bucket * x = (struct bucket *) a;
struct bucket * y = (struct bucket *) b;
@@ -106,8 +128,9 @@ int derep_compare(const void * a, const void * b)
return -1;
else
{
- int r = strcmp(db_getheader(x->seqno_first),
- db_getheader(y->seqno_first));
+ if (x->size == 0)
+ return 0;
+ int r = strcmp(x->header, y->header);
if (r != 0)
return r;
else
@@ -142,8 +165,50 @@ int seqcmp(char * a, char * b, int n)
return chrmap_4bit[(int)(*p)] - chrmap_4bit[(int)(*q)];
}
+void rehash(struct bucket * * hashtableref, int64_t alloc_clusters)
+{
+ /*
+ double the size of the hash table:
+
+ - allocate the new hash table
+ - rehash all entries from the old to the new table
+ - free the old table
+ - update variables
+ */
+
+ struct bucket * old_hashtable = * hashtableref;
+ uint64_t old_hashtablesize = 2 * alloc_clusters;
+ uint64_t new_hashtablesize = 2 * old_hashtablesize;
+ uint64_t new_hash_mask = new_hashtablesize - 1;
+
+ struct bucket * new_hashtable =
+ (struct bucket *) xmalloc(sizeof(bucket) * new_hashtablesize);
+ memset(new_hashtable, 0, sizeof(bucket) * new_hashtablesize);
+
+ /* rehash all */
+ for(uint64_t i = 0; i < old_hashtablesize; i++)
+ {
+ struct bucket * old_bp = old_hashtable + i;
+ if (old_bp->size)
+ {
+ uint64_t k = old_bp->hash & new_hash_mask;
+ while (new_hashtable[k].size)
+ k = (k + 1) & new_hash_mask;
+ struct bucket * new_bp = new_hashtable + k;
+
+ * new_bp = * old_bp;
+ }
+ }
+
+ xfree(old_hashtable);
+ * hashtableref = new_hashtable;
+}
+
+
void derep_fulllength()
{
+ show_rusage();
+
FILE * fp_output = 0;
FILE * fp_uc = 0;
@@ -161,57 +226,148 @@ void derep_fulllength()
fatal("Unable to open output (uc) file for writing");
}
- db_read(opt_derep_fulllength, 0);
+ fastx_handle h = fastx_open(opt_derep_fulllength);
show_rusage();
- int64_t dbsequencecount = db_getsequencecount();
+ if (!h)
+ fatal("Unrecognized input file type (not proper FASTA or FASTQ format)");
- /* adjust size of hash table for 2/3 fill rate */
+ uint64_t filesize = fastx_get_size(h);
- int64_t hashtablesize = 1;
- int hash_shift = 0;
- while (3 * dbsequencecount > 2 * hashtablesize)
- {
- hashtablesize <<= 1;
- hash_shift++;
- }
- int hash_mask = hashtablesize - 1;
+ /* allocate initial memory for 1024 clusters
+ with sequences of length 1023 */
+
+ uint64_t alloc_clusters = 1024;
+ uint64_t alloc_seqs = 1024;
+ int64_t alloc_seqlen = 1023;
+
+ uint64_t hashtablesize = 2 * alloc_clusters;
+ uint64_t hash_mask = hashtablesize - 1;
struct bucket * hashtable =
(struct bucket *) xmalloc(sizeof(bucket) * hashtablesize);
-
memset(hashtable, 0, sizeof(bucket) * hashtablesize);
-#ifdef BITMAP
- hash_occupied =
- (unsigned char *) xmalloc(hashtablesize / 8);
- memset(hash_occupied, 0, hashtablesize / 8);
-#endif
+ show_rusage();
- int64_t clusters = 0;
+ unsigned int * nextseqtab = 0;
+ const unsigned int terminal = (unsigned int)(-1);
+ char ** headertab = 0;
+ char * match_strand = 0;
+
+ if (opt_uc)
+ {
+ /* If the uc option is in effect we need to keep some extra info.
+ Allocate and init memory for this. */
+
+ /* Links to other sequences in cluster */
+ nextseqtab =
+ (unsigned int*) xmalloc(sizeof(unsigned int) * alloc_seqs);
+ memset(nextseqtab, terminal, sizeof(unsigned int) * alloc_seqs);
+
+ /* Pointers to the header strings */
+ headertab = (char **) xmalloc(sizeof(char*) * alloc_seqs);
+ memset(headertab, 0, sizeof(char*) * alloc_seqs);
+
+ /* Matching strand */
+ match_strand = (char *) xmalloc(alloc_seqs);
+ memset(match_strand, 0, alloc_seqs);
+ }
+
+ show_rusage();
+
+ char * seq_up = (char*) xmalloc(alloc_seqlen + 1);
+ char * rc_seq_up = (char*) xmalloc(alloc_seqlen + 1);
+
+ char * prompt = 0;
+ if (xsprintf(& prompt, "Dereplicating file %s", opt_derep_fulllength) == -1)
+ fatal("Out of memory");
+
+ progress_init(prompt, filesize);
+
+ uint64_t sequencecount = 0;
+ uint64_t nucleotidecount = 0;
+ int64_t shortest = INT64_MAX;
+ int64_t longest = 0;
+ uint64_t discarded_short = 0;
+ uint64_t discarded_long = 0;
+ uint64_t clusters = 0;
int64_t sumsize = 0;
uint64_t maxsize = 0;
double median = 0.0;
double average = 0.0;
- /* alloc and init table of links to other sequences in cluster */
+ while(fastx_next(h, 0, chrmap_no_change))
+ {
+ int64_t seqlen = fastx_get_sequence_length(h);
- unsigned int * nextseqtab = (unsigned int*) xmalloc(sizeof(unsigned int) * dbsequencecount);
- const unsigned int terminal = (unsigned int)(-1);
- memset(nextseqtab, -1, sizeof(unsigned int) * dbsequencecount);
+ if (seqlen < opt_minseqlength)
+ {
+ discarded_short++;
+ continue;
+ }
+
+ if (seqlen > opt_maxseqlength)
+ {
+ discarded_long++;
+ continue;
+ }
- char * match_strand = (char *) xmalloc(dbsequencecount);
- memset(match_strand, 0, dbsequencecount);
+ nucleotidecount += seqlen;
+ if (seqlen > longest)
+ longest = seqlen;
+ if (seqlen < shortest)
+ shortest = seqlen;
- char * seq_up = (char*) xmalloc(db_getlongestsequence() + 1);
- char * rc_seq_up = (char*) xmalloc(db_getlongestsequence() + 1);
+ /* check allocations */
- progress_init("Dereplicating", dbsequencecount);
- for(int64_t i=0; i<dbsequencecount; i++)
- {
- unsigned int seqlen = db_getsequencelen(i);
- char * seq = db_getsequence(i);
+ if (seqlen > alloc_seqlen)
+ {
+ alloc_seqlen = seqlen;
+ seq_up = (char*) xrealloc(seq_up, alloc_seqlen + 1);
+ rc_seq_up = (char*) xrealloc(rc_seq_up, alloc_seqlen + 1);
+
+ show_rusage();
+ }
+
+ if (opt_uc && (sequencecount + 1 > alloc_seqs))
+ {
+ uint64_t new_alloc_seqs = 2 * alloc_seqs;
+
+ nextseqtab =
+ (unsigned int*) xrealloc(nextseqtab,
+ sizeof(unsigned int) * new_alloc_seqs);
+ memset(nextseqtab + alloc_seqs,
+ terminal,
+ sizeof(unsigned int) * alloc_seqs);
+
+ headertab = (char**) xrealloc(headertab,
+ sizeof(char*) * new_alloc_seqs);
+ memset(headertab + alloc_seqs, 0, sizeof(char*) * alloc_seqs);
+
+ match_strand = (char *) xrealloc(match_strand, new_alloc_seqs);
+ memset(match_strand + alloc_seqs, 0, alloc_seqs);
+
+ alloc_seqs = new_alloc_seqs;
+
+ show_rusage();
+ }
+
+ if (clusters + 1 > alloc_clusters)
+ {
+ uint64_t new_alloc_clusters = 2 * alloc_clusters;
+
+ rehash(& hashtable, alloc_clusters);
+
+ alloc_clusters = new_alloc_clusters;
+ hashtablesize = 2 * alloc_clusters;
+ hash_mask = hashtablesize - 1;
+
+ show_rusage();
+ }
+
+ char * seq = fastx_get_sequence(h);
/* normalize sequence: uppercase and replace U by T */
string_normalize(seq_up, seq, seqlen);
@@ -232,24 +388,13 @@ void derep_fulllength()
uint64_t j = hash & hash_mask;
struct bucket * bp = hashtable + j;
- while (
-#ifdef BITMAP
- (hash_is_occupied(j))
-#else
- (bp->size)
-#endif
+ while ((bp->size)
&&
- ((bp->hash != hash) ||
- (seqlen != db_getsequencelen(bp->seqno_first)) ||
- (seqcmp(seq_up, db_getsequence(bp->seqno_first), seqlen))))
+ ((hash != bp->hash) ||
+ (seqcmp(seq_up, bp->seq, seqlen))))
{
- bp++;
- j++;
- if (bp >= hashtable + hashtablesize)
- {
- bp = hashtable;
- j = 0;
- }
+ j = (j+1) & hash_mask;
+ bp = hashtable + j;
}
if ((opt_strand > 1) && !bp->size)
@@ -258,81 +403,148 @@ void derep_fulllength()
/* check minus strand as well */
uint64_t rc_hash = HASH(rc_seq_up, seqlen);
- struct bucket * rc_bp = hashtable + rc_hash % hashtablesize;
uint64_t k = rc_hash & hash_mask;
+ struct bucket * rc_bp = hashtable + k;
- while (
-#ifdef BITMAP
- (hash_is_occupied(j))
-#else
- (rc_bp->size)
-#endif
+ while ((rc_bp->size)
&&
- ((rc_bp->hash != rc_hash) ||
- (seqlen != db_getsequencelen(rc_bp->seqno_first)) ||
- (seqcmp(rc_seq_up,
- db_getsequence(rc_bp->seqno_first),
- seqlen))))
+ ((rc_hash != rc_bp->hash) ||
+ (seqcmp(rc_seq_up, rc_bp->seq, seqlen))))
{
- rc_bp++;
- k++;
- if (rc_bp >= hashtable + hashtablesize)
- {
- rc_bp = hashtable;
- k++;
- }
+ k = (k+1) & hash_mask;
+ rc_bp = hashtable + k;
}
if (rc_bp->size)
{
bp = rc_bp;
j = k;
- match_strand[i] = 1;
+ if (opt_uc)
+ match_strand[sequencecount] = 1;
}
}
- int64_t ab = opt_sizein ? db_getabundance(i) : 1;
+ char * header = fastx_get_header(h);
+ int abundance = fastx_get_abundance(h);
+ int64_t ab = opt_sizein ? abundance : 1;
sumsize += ab;
if (bp->size)
{
/* at least one identical sequence already */
bp->size += ab;
- unsigned int last = bp->seqno_last;
- nextseqtab[last] = i;
- bp->seqno_last = i;
+
+ if (opt_uc)
+ {
+ unsigned int last = bp->seqno_last;
+ nextseqtab[last] = sequencecount;
+ bp->seqno_last = sequencecount;
+ headertab[sequencecount] = xstrdup(header);
+ }
}
else
{
/* no identical sequences yet */
bp->size = ab;
bp->hash = hash;
- bp->seqno_first = i;
- bp->seqno_last = i;
+ bp->seqno_first = sequencecount;
+ bp->seqno_last = sequencecount;
+ bp->seq = xstrdup(seq);
+ bp->header = xstrdup(header);
clusters++;
}
if (bp->size > maxsize)
maxsize = bp->size;
-#ifdef BITMAP
- hash_set_occupied(j);
-#endif
+ sequencecount++;
- progress_update(i);
+ progress_update(fastx_get_position(h));
}
progress_done();
+ xfree(prompt);
+ fastx_close(h);
+
+ show_rusage();
+
+ if (!opt_quiet)
+ {
+ if (sequencecount > 0)
+ fprintf(stderr,
+ "%'" PRIu64 " nt in %'" PRIu64 " seqs, min %'" PRIu64
+ ", max %'" PRIu64 ", avg %'.0f\n",
+ nucleotidecount,
+ sequencecount,
+ shortest,
+ longest,
+ nucleotidecount * 1.0 / sequencecount);
+ else
+ fprintf(stderr,
+ "%'" PRIu64 " nt in %'" PRIu64 " seqs\n",
+ nucleotidecount,
+ sequencecount);
+ }
+
+ if (opt_log)
+ {
+ if (sequencecount > 0)
+ fprintf(fp_log,
+ "%'" PRIu64 " nt in %'" PRIu64 " seqs, min %'" PRIu64
+ ", max %'" PRIu64 ", avg %'.0f\n",
+ nucleotidecount,
+ sequencecount,
+ shortest,
+ longest,
+ nucleotidecount * 1.0 / sequencecount);
+ else
+ fprintf(fp_log,
+ "%'" PRIu64 " nt in %'" PRIu64 " seqs\n",
+ nucleotidecount,
+ sequencecount);
+ }
+
+ if (discarded_short)
+ {
+ fprintf(stderr,
+ "minseqlength %" PRId64 ": %" PRId64 " %s discarded.\n",
+ opt_minseqlength,
+ discarded_short,
+ (discarded_short == 1 ? "sequence" : "sequences"));
+
+ if (opt_log)
+ fprintf(fp_log,
+ "minseqlength %" PRId64 ": %" PRId64 " %s discarded.\n\n",
+ opt_minseqlength,
+ discarded_short,
+ (discarded_short == 1 ? "sequence" : "sequences"));
+ }
+
+ if (discarded_long)
+ {
+ fprintf(stderr,
+ "maxseqlength %" PRId64 ": %" PRId64 " %s discarded.\n",
+ opt_maxseqlength,
+ discarded_long,
+ (discarded_long == 1 ? "sequence" : "sequences"));
+
+ if (opt_log)
+ fprintf(fp_log,
+ "maxseqlength %" PRId64 ": %" PRId64 " %s discarded.\n\n",
+ opt_maxseqlength,
+ discarded_long,
+ (discarded_long == 1 ? "sequence" : "sequences"));
+ }
xfree(seq_up);
xfree(rc_seq_up);
show_rusage();
-
progress_init("Sorting", 1);
- qsort(hashtable, hashtablesize, sizeof(bucket), derep_compare);
+ qsort(hashtable, hashtablesize, sizeof(struct bucket), derep_compare_full);
progress_done();
+ show_rusage();
if (clusters > 0)
{
@@ -358,32 +570,34 @@ void derep_fulllength()
{
if (!opt_quiet)
fprintf(stderr,
- "%" PRId64 " unique sequences, avg cluster %.1lf, median %.0f, max %" PRIu64 "\n",
+ "%" PRId64
+ " unique sequences, avg cluster %.1lf, median %.0f, max %"
+ PRIu64 "\n",
clusters, average, median, maxsize);
if (opt_log)
fprintf(fp_log,
- "%" PRId64 " unique sequences, avg cluster %.1lf, median %.0f, max %" PRIu64 "\n\n",
+ "%" PRId64
+ " unique sequences, avg cluster %.1lf, median %.0f, max %"
+ PRIu64 "\n\n",
clusters, average, median, maxsize);
}
- show_rusage();
-
-
/* count selected */
- int64_t selected = 0;
- for (int64_t i=0; i<clusters; i++)
+ uint64_t selected = 0;
+ for (uint64_t i=0; i<clusters; i++)
{
struct bucket * bp = hashtable + i;
int64_t size = bp->size;
if ((size >= opt_minuniquesize) && (size <= opt_maxuniquesize))
{
selected++;
- if (selected == opt_topn)
+ if (selected == (uint64_t) opt_topn)
break;
}
}
+ show_rusage();
/* write output */
@@ -392,7 +606,7 @@ void derep_fulllength()
progress_init("Writing output file", clusters);
int64_t relabel_count = 0;
- for (int64_t i=0; i<clusters; i++)
+ for (uint64_t i=0; i<clusters; i++)
{
struct bucket * bp = hashtable + i;
int64_t size = bp->size;
@@ -401,10 +615,10 @@ void derep_fulllength()
relabel_count++;
fasta_print_general(fp_output,
0,
- db_getsequence(bp->seqno_first),
- db_getsequencelen(bp->seqno_first),
- db_getheader(bp->seqno_first),
- db_getheaderlen(bp->seqno_first),
+ bp->seq,
+ strlen(bp->seq),
+ bp->header,
+ strlen(bp->header),
size,
relabel_count,
-1, -1, 0, 0.0);
@@ -423,11 +637,11 @@ void derep_fulllength()
if (opt_uc)
{
progress_init("Writing uc file, first part", clusters);
- for (int64_t i=0; i<clusters; i++)
+ for (uint64_t i=0; i<clusters; i++)
{
struct bucket * bp = hashtable + i;
- char * h = db_getheader(bp->seqno_first);
- int64_t len = db_getsequencelen(bp->seqno_first);
+ char * h = bp->header;
+ int64_t len = strlen(bp->seq);
fprintf(fp_uc, "S\t%" PRId64 "\t%" PRId64 "\t*\t*\t*\t*\t*\t%s\t*\n",
i, len, h);
@@ -439,47 +653,73 @@ void derep_fulllength()
"H\t%" PRId64 "\t%" PRId64 "\t%.1f\t%s\t0\t0\t*\t%s\t%s\n",
i, len, 100.0,
(match_strand[next] ? "-" : "+"),
- db_getheader(next), h);
+ headertab[next], h);
progress_update(i);
}
progress_done();
- show_rusage();
progress_init("Writing uc file, second part", clusters);
- for (int64_t i=0; i<clusters; i++)
+ for (uint64_t i=0; i<clusters; i++)
{
struct bucket * bp = hashtable + i;
fprintf(fp_uc, "C\t%" PRId64 "\t%u\t*\t*\t*\t*\t*\t%s\t*\n",
- i, bp->size, db_getheader(bp->seqno_first));
+ i, bp->size, bp->header);
progress_update(i);
}
fclose(fp_uc);
progress_done();
- show_rusage();
}
+ show_rusage();
+
if (selected < clusters)
{
if (!opt_quiet)
fprintf(stderr,
- "%" PRId64 " uniques written, %" PRId64 " clusters discarded (%.1f%%)\n",
+ "%" PRId64 " uniques written, %"
+ PRId64 " clusters discarded (%.1f%%)\n",
selected, clusters - selected,
100.0 * (clusters - selected) / clusters);
if (opt_log)
fprintf(fp_log,
- "%" PRId64 " uniques written, %" PRId64 " clusters discarded (%.1f%%)\n\n",
+ "%" PRId64 " uniques written, %"
+ PRId64 " clusters discarded (%.1f%%)\n\n",
selected, clusters - selected,
100.0 * (clusters - selected) / clusters);
}
- xfree(match_strand);
- xfree(nextseqtab);
+ show_rusage();
+
+ /* Free all seqs and headers */
+
+ for (uint64_t i=0; i<clusters; i++)
+ {
+ struct bucket * bp = hashtable + i;
+ if (bp->size)
+ {
+ xfree(bp->seq);
+ xfree(bp->header);
+ }
+ }
+
+ if (opt_uc)
+ {
+ for (uint64_t i=0; i<alloc_seqs; i++)
+ if (headertab[i])
+ xfree(headertab[i]);
+ xfree(headertab);
+ xfree(nextseqtab);
+ xfree(match_strand);
+ }
+
+ show_rusage();
+
xfree(hashtable);
- db_free();
-}
+ show_rusage();
+}
void derep_prefix()
{
@@ -532,7 +772,8 @@ void derep_prefix()
/* alloc and init table of links to other sequences in cluster */
- unsigned int * nextseqtab = (unsigned int*) xmalloc(sizeof(unsigned int) * dbsequencecount);
+ unsigned int * nextseqtab
+ = (unsigned int*) xmalloc(sizeof(unsigned int) * dbsequencecount);
const unsigned int terminal = (unsigned int)(-1);
memset(nextseqtab, -1, sizeof(unsigned int) * dbsequencecount);
@@ -693,7 +934,7 @@ void derep_prefix()
show_rusage();
progress_init("Sorting", 1);
- qsort(hashtable, hashtablesize, sizeof(bucket), derep_compare);
+ qsort(hashtable, hashtablesize, sizeof(bucket), derep_compare_prefix);
progress_done();
if (clusters > 0)
@@ -720,11 +961,15 @@ void derep_prefix()
{
if (!opt_quiet)
fprintf(stderr,
- "%" PRId64 " unique sequences, avg cluster %.1lf, median %.0f, max %" PRIu64 "\n",
+ "%" PRId64
+ " unique sequences, avg cluster %.1lf, median %.0f, max %"
+ PRIu64 "\n",
clusters, average, median, maxsize);
if (opt_log)
fprintf(fp_log,
- "%" PRId64 " unique sequences, avg cluster %.1lf, median %.0f, max %" PRIu64 "\n\n",
+ "%" PRId64
+ " unique sequences, avg cluster %.1lf, median %.0f, max %"
+ PRIu64 "\n\n",
clusters, average, median, maxsize);
}
@@ -822,13 +1067,15 @@ void derep_prefix()
{
if (!opt_quiet)
fprintf(stderr,
- "%" PRId64 " uniques written, %" PRId64 " clusters discarded (%.1f%%)\n",
+ "%" PRId64 " uniques written, %" PRId64
+ " clusters discarded (%.1f%%)\n",
selected, clusters - selected,
100.0 * (clusters - selected) / clusters);
if (opt_log)
fprintf(fp_log,
- "%" PRId64 " uniques written, %" PRId64 " clusters discarded (%.1f%%)\n\n",
+ "%" PRId64 " uniques written, %" PRId64
+ " clusters discarded (%.1f%%)\n\n",
selected, clusters - selected,
100.0 * (clusters - selected) / clusters);
}
=====================================
src/eestats.cc
=====================================
@@ -465,15 +465,6 @@ void fastq_eestats2()
}
}
- for (int x = 0; x < len_steps; x++)
- {
- uint64_t len_cutoff = opt_length_cutoffs_shortest + x * opt_length_cutoffs_increment;
- if (len < len_cutoff)
- for (int y = 0; y < opt_ee_cutoffs_count; y++)
- if (ee <= opt_ee_cutoffs_values[y])
- count_table[x * opt_ee_cutoffs_count + y]++;
- }
-
progress_update(fastq_get_position(h));
}
progress_done();
=====================================
src/mergepairs.cc
=====================================
@@ -1366,7 +1366,7 @@ void fastq_mergepairs()
if (failed_repeat)
fprintf(stderr,
- "%10" PRIu64 " potential tandem repeat\n",
+ "%10" PRIu64 " multiple potential alignments\n",
failed_repeat);
if (failed_maxdiffs)
=====================================
src/subsample.cc
=====================================
@@ -111,8 +111,13 @@ void subsample()
for(int i=0; i<dbsequencecount; i++)
mass_total += db_getabundance(i);
- fprintf(stderr, "Got %" PRIu64 " reads from %d amplicons\n",
- mass_total, dbsequencecount);
+ if (! opt_quiet)
+ fprintf(stderr, "Got %" PRIu64 " reads from %d amplicons\n",
+ mass_total, dbsequencecount);
+
+ if (opt_log)
+ fprintf(fp_log, "Got %" PRIu64 " reads from %d amplicons\n",
+ mass_total, dbsequencecount);
int * abundance = (int*) xmalloc(dbsequencecount * sizeof(int));
@@ -229,7 +234,10 @@ void subsample()
xfree(abundance);
- fprintf(stderr, "Subsampled %" PRIu64 " reads from %d amplicons\n", n, samples);
+ if (! opt_quiet)
+ fprintf(stderr, "Subsampled %" PRIu64 " reads from %d amplicons\n", n, samples);
+ if (opt_log)
+ fprintf(fp_log, "Subsampled %" PRIu64 " reads from %d amplicons\n", n, samples);
db_free();
=====================================
src/util.cc
=====================================
@@ -60,6 +60,8 @@
#include "vsearch.h"
+//#define SHOW_RUSAGE
+
static const char * progress_prompt;
static uint64_t progress_next;
static uint64_t progress_size;
@@ -186,7 +188,7 @@ int64_t getusec(void)
void show_rusage()
{
-#if 0
+#ifdef SHOW_RUSAGE
double user_time = 0.0;
double system_time = 0.0;
=====================================
src/vsearch.h
=====================================
@@ -58,7 +58,9 @@
*/
+#define __STDC_CONSTANT_MACROS 1
#define __STDC_FORMAT_MACROS 1
+#define __STDC_LIMIT_MACROS 1
#define __restrict
#ifdef HAVE_CONFIG_H
@@ -87,6 +89,7 @@
#include <fcntl.h>
#include <unistd.h>
#include <float.h>
+#include <stdint.h>
#include <inttypes.h>
#include <stdarg.h>
View it on GitLab: https://salsa.debian.org/med-team/vsearch/compare/0f98cccabf4840158918c5a5f4e82606b85fbaf4...2223cdf9733c2b04e81016491b35b75c1dff0880
--
View it on GitLab: https://salsa.debian.org/med-team/vsearch/compare/0f98cccabf4840158918c5a5f4e82606b85fbaf4...2223cdf9733c2b04e81016491b35b75c1dff0880
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20180926/9f62b830/attachment-0001.html>
More information about the debian-med-commit
mailing list