[med-svn] [vsearch] 01/01: New upstream version 2.3.4

Thu Dec 22 22:44:53 UTC 2016

This is an automated email from the git hooks/post-receive script.

tille pushed a commit to annotated tag upstream/2.3.4
in repository vsearch.

commit c46e44d6b9d610e040c31738dcde840c249e9d91
Author: Andreas Tille <tille at debian.org>
Date:   Thu Dec 22 23:38:19 2016 +0100

    New upstream version 2.3.4
---
 README.md         | 18 +++++++++---------
 configure.ac      |  2 +-
 man/vsearch.1     | 19 ++++++++++++-------
 src/minheap.cc    | 54 +++++++++++++++++++++++++++++++++++++++++++-----------
 src/msa.cc        | 19 ++++++-------------
 src/searchcore.cc | 47 ++++++++++++++++++-----------------------------
 6 files changed, 89 insertions(+), 70 deletions(-)

diff --git a/README.md b/README.md
index 105d87c..c393373 100644
--- a/README.md
+++ b/README.md
@@ -14,7 +14,7 @@ We have implemented a tool called VSEARCH which supports *de novo* and reference
 
 VSEARCH stands for vectorized search, as the tool takes advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by default uses a heuristic seed and extend aligner. This usually results in more accurate alignments and overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps.
 
-VSEARCH binaries are provided for x86-64 systems running GNU/Linux or OS X (10.7 or higher).
+VSEARCH binaries are provided for x86-64 systems running GNU/Linux or OS X (10.7 or higher). A beta version for 64-bit Windows (version 7 or higher) is also available.
 
 VSEARCH can directly read input query and database files that are compressed using gzip and bzip2 (.gz and .bz2) if the zlib and bzip2 libraries are available.
 
@@ -35,9 +35,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
 **Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:
 
 ```
-wget https://github.com/torognes/vsearch/archive/v2.3.2.tar.gz
-tar xzf v2.3.2.tar.gz
-cd vsearch-2.3.2
+wget https://github.com/torognes/vsearch/archive/v2.3.4.tar.gz
+tar xzf v2.3.4.tar.gz
+cd vsearch-2.3.4
 ./autogen.sh
 ./configure
 make
@@ -60,18 +60,18 @@ make install  # as root or sudo make install
 **Binary distribution** Starting with version 1.4.0, binary distribution files (.tar.gz) for GNU/Linux on x86-64 and Apple Mac OS X on x86-64 containing pre-compiled binaries as well as the documentation (man and pdf files) will be made available as part of each [release](https://github.com/torognes/vsearch/releases). The included executables include support for input files compressed by zlib and bzip2 (with files usually ending in `.gz` or `.bz2`). Download the appropriate executable fo [...]
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.3.2/vsearch-2.3.2-linux-x86_64.tar.gz
-tar xzf vsearch-2.3.2-linux-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.3.4/vsearch-2.3.4-linux-x86_64.tar.gz
+tar xzf vsearch-2.3.4-linux-x86_64.tar.gz
 ```
 
 Or these commands if you are using a Mac:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.3.2/vsearch-2.3.2-osx-x86_64.tar.gz
-tar xzf vsearch-2.3.2-osx-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.3.4/vsearch-2.3.4-osx-x86_64.tar.gz
+tar xzf vsearch-2.3.4-osx-x86_64.tar.gz
 ```
 
-You will now have the binary distribution in a folder called something like `vsearch-2.3.2-linux-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
+You will now have the binary distribution in a folder called something like `vsearch-2.3.4-linux-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
 
 
 **Old binaries** Older VSEARCH binaries (until version 1.1.3) are available [here](https://github.com/torognes/vsearch/releases) for [GNU/Linux on x86-64 systems](https://github.com/torognes/vsearch/blob/master/bin/vsearch-1.1.3-linux-x86_64) and [Apple Mac OS X on x86-64 systems](https://github.com/torognes/vsearch/blob/master/bin/vsearch-1.1.3-osx-x86_64). These executables include support for input files compressed by zlib and bzip2 (with files usually ending in `.gz` or `.bz2`). Down [...]
diff --git a/configure.ac b/configure.ac
index 794cb18..97da8b5 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2,7 +2,7 @@
 # Process this file with autoconf to produce a configure script.
 
 AC_PREREQ([2.63])
-AC_INIT([vsearch], [2.3.2], [torognes at ifi.uio.no])
+AC_INIT([vsearch], [2.3.4], [torognes at ifi.uio.no])
 AM_INIT_AUTOMAKE([subdir-objects])
 AC_LANG([C++])
 AC_CONFIG_SRCDIR([src/vsearch.cc])
diff --git a/man/vsearch.1 b/man/vsearch.1
index 52b9ad9..ae702df 100644
--- a/man/vsearch.1
+++ b/man/vsearch.1
@@ -1,5 +1,5 @@
 .\" ============================================================================
-.TH vsearch 1 "November 18, 2016" "version 2.3.2" "USER COMMANDS"
+.TH vsearch 1 "December 9, 2016" "version 2.3.4" "USER COMMANDS"
 .\" ============================================================================
 .SH NAME
 vsearch \(em chimera detection, clustering, dereplication and
@@ -650,12 +650,11 @@ that OTU. See the \-\-biomout option for further details.
 Output a sequence profile to a text file with the frequency of each
 nucleotide in each position in the multiple alignment for each
 cluster. There is a FASTA-like header line for each cluster, followed
-by the profile information in a tab-separated format. The columns are:
-position (1-based), consensus nucleotide, number of As, number of Cs,
-number of Gs, number of Ts or Us, and finally the number of gaps. If
-ambiguous nucleotide symbols are present, the numbers may be floating
-point numbers, otherwise they are integers. For instance, an 'R'
-counts 0.5 towards an A and 0.5 towards a G.
+by the profile information in a tab-separated format. The eight
+columns are: position (0-based), consensus nucleotide, number of As,
+number of Cs, number of Gs, number of Ts or Us, number of gap symbols,
+and finally the total number of ambiguous nucleotide symbols (B, D, H,
+K, M, N, R, S, Y, V or W). All numbers are integers.
 .TP
 .BI \-\-qmask\~ "none|dust|soft"
 Mask regions in sequences using the
@@ -3043,6 +3042,12 @@ Fixed bug where vsearch reported the ordinal number of the target
 sequence instead of the cluster number in column 2 on H-lines in the
 uc output file after clustering. For search and alignment commands
 both usearch and vsearch reports the target sequence number here.
+.TP
+.BR v2.3.3\~ "released December 5th, 2016"
+A minor speed improvement.
+.TP
+.BR v2.3.4\~ "released December 9th, 2016"
+Fixed bug in output of sequence profiles and updated documentation.
 .RE
 .LP
 .\" ============================================================================
diff --git a/src/minheap.cc b/src/minheap.cc
index 545eca1..fd7a18e 100644
--- a/src/minheap.cc
+++ b/src/minheap.cc
@@ -63,22 +63,36 @@
 /* implement a priority queue with a min heap binary array structure */
 /* elements with the lowest count should be at the top (root) */
 
+/*
+  To keep track of the n best potential target sequences, we store
+  them in a min heap. The root element corresponds to the least good
+  target, while the best elements are found at the leaf nodes. This
+  makes it simple to decide whether a new target should be included or
+  not, because it just needs to be compared to the root note.  The
+  list will be fully sorted before use when we want to find the best
+  element and then the second best and so on.
+*/
+
 int
 elem_smaller(elem_t * a, elem_t * b)
 {
   /* return 1 if a is smaller than b, 0 if equal or greater */
   if (a->count < b->count)
     return 1;
-  else if (a->count > b->count)
-    return 0;
-  else if (a->length > b->length)
-    return 1;
-  else if (a->length < b->length)
-    return 0;
-  else if (a->seqno > b->seqno)
-    return 1;
   else
-    return 0;
+    if (a->count > b->count)
+      return 0;
+    else
+      if (a->length > b->length)
+        return 1;
+      else
+        if (a->length < b->length)
+          return 0;
+        else
+          if (a->seqno > b->seqno)
+            return 1;
+          else
+            return 0;
 }
 
 int minheap_compare(const void * a, const void * b)
@@ -86,10 +100,28 @@ int minheap_compare(const void * a, const void * b)
   elem_t * x = (elem_t*) a;
   elem_t * y = (elem_t*) b;
 
-  if (elem_smaller(x, y))
+  /* return -1 if a is smaller than b, +1 if greater, otherwize 0 */
+  /* first: lower count, larger length, lower seqno */
+
+  if (x->count < y->count)
     return -1;
   else
-    return +1;
+    if (x->count > y->count)
+      return +1;
+    else
+      if (x->length > y->length)
+        return -1;
+      else
+        if (x->length < y->length)
+          return +1;
+        else
+          if (x->seqno > y->seqno)
+            return -1;
+          else
+            if (x->seqno < y->seqno)
+              return +1;
+            else
+              return 0;
 }
 
 minheap_t *
diff --git a/src/msa.cc b/src/msa.cc
index 959fc9b..6814aa0 100644
--- a/src/msa.cc
+++ b/src/msa.cc
@@ -350,20 +350,13 @@ void msa(FILE * fp_msaout, FILE * fp_consout, FILE * fp_profile,
       for (int i=0; i<alnlen; i++)
         {
           fprintf(fp_profile, "%d\t%c", i, aln[i]);
-          int nongap_count = 0;
+          // A, C, G and T
           for (int c=0; c<4; c++)
-            {
-              int count = profile[4*i+c];
-              nongap_count += count;
-              if (count % 12 == 0)
-                fprintf(fp_profile, "\t%d", count / 12);
-              else
-                fprintf(fp_profile, "\t%.2f", 1.0 * count / 12.0);
-            }
-          if (nongap_count % 12 == 0)
-            fprintf(fp_profile, "\t%d", target_count - nongap_count / 12);
-          else
-            fprintf(fp_profile, "\t%.2f", 1.0 * target_count - 1.0 * nongap_count / 12.0);
+            fprintf(fp_profile, "\t%d", profile[PROFSIZE*i+c]);
+          // Gap symbol
+          fprintf(fp_profile, "\t%d", profile[PROFSIZE*i+5]);
+          // Ambiguous nucleotide (Ns and others)
+          fprintf(fp_profile, "\t%d", profile[PROFSIZE*i+4]);
           fprintf(fp_profile, "\n");
         }
       fprintf(fp_profile, "\n");
diff --git a/src/searchcore.cc b/src/searchcore.cc
index 2500db1..17c9eba 100644
--- a/src/searchcore.cc
+++ b/src/searchcore.cc
@@ -161,33 +161,6 @@ bool search_enough_kmers(struct searchinfo_s * si,
   return (count >= opt_minwordmatches) || (count >= si->kmersamplecount);
 }
 
-inline void topscore_insert(int i, struct searchinfo_s * si)
-{
-  count_t count = si->kmers[i];
-  
-  /* ignore sequences with very few kmer matches */
-
-  if (!search_enough_kmers(si, count))
-    return;
-  
-  unsigned int seqno = dbindex_getmapping(i);
-  unsigned int length = db_getsequencelen(seqno);
-
-  elem_t novel;
-  novel.count = count;
-  novel.seqno = seqno;
-  novel.length = length;
-  
-  minheap_add(si->m, & novel);
-}
-
-void _mm_print_epi8(__m128i x)
-{
-  unsigned char * y = (unsigned char*)&x;
-  for (int i=0; i<16; i++)
-    printf("%s%02x", (i>0?" ":""), y[15-i]);
-}
-
 void search_topscores(struct searchinfo_s * si)
 {
   /*
@@ -228,9 +201,25 @@ void search_topscores(struct searchinfo_s * si)
         }
     }
 
+  int minmatches = MIN(opt_minwordmatches, si->kmersamplecount);
+
   for(int i=0; i < indexed_count; i++)
-    topscore_insert(i, si);
-  
+    {
+      count_t count = si->kmers[i];
+      if (count >= minmatches)
+        {
+          unsigned int seqno = dbindex_getmapping(i);
+          unsigned int length = db_getsequencelen(seqno);
+
+          elem_t novel;
+          novel.count = count;
+          novel.seqno = seqno;
+          novel.length = length;
+
+          minheap_add(si->m, & novel);
+        }
+    }
+
   minheap_sort(si->m);
 }
 

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/vsearch.git