[med-svn] [Git][med-team/vsearch][upstream] New upstream version 2.25.0

Étienne Mollier (@emollier) gitlab at salsa.debian.org
Wed Nov 15 14:58:38 GMT 2023



Étienne Mollier pushed to branch upstream at Debian Med / vsearch


Commits:
88399a1e by Étienne Mollier at 2023-11-15T15:36:38+01:00
New upstream version 2.25.0
- - - - -


8 changed files:

- README.md
- configure.ac
- man/vsearch.1
- src/chimera.cc
- src/fasta.cc
- src/searchexact.cc
- src/vsearch.cc
- src/vsearch.h


Changes:

=====================================
README.md
=====================================
@@ -37,7 +37,7 @@ Most of the nucleotide based commands and options in USEARCH version 7 are suppo
 
 ## Getting Help
 
-If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
+If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
 
 ## Example
 
@@ -50,9 +50,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
 **Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:
 
 ```
-wget https://github.com/torognes/vsearch/archive/v2.24.0.tar.gz
-tar xzf v2.24.0.tar.gz
-cd vsearch-2.24.0
+wget https://github.com/torognes/vsearch/archive/v2.25.0.tar.gz
+tar xzf v2.25.0.tar.gz
+cd vsearch-2.25.0
 ./autogen.sh
 ./configure CFLAGS="-O3" CXXFLAGS="-O3"
 make
@@ -81,48 +81,48 @@ Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (v
 Download the appropriate executable for your system using the following commands if you are using a Linux x86_64 system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch-2.24.0-linux-x86_64.tar.gz
-tar xzf vsearch-2.24.0-linux-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch-2.25.0-linux-x86_64.tar.gz
+tar xzf vsearch-2.25.0-linux-x86_64.tar.gz
 ```
 
 Or these commands if you are using a Linux ppc64le system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch-2.24.0-linux-ppc64le.tar.gz
-tar xzf vsearch-2.24.0-linux-ppc64le.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch-2.25.0-linux-ppc64le.tar.gz
+tar xzf vsearch-2.25.0-linux-ppc64le.tar.gz
 ```
 
 Or these commands if you are using a Linux aarch64 (arm64) system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch-2.24.0-linux-aarch64.tar.gz
-tar xzf vsearch-2.24.0-linux-aarch64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch-2.25.0-linux-aarch64.tar.gz
+tar xzf vsearch-2.25.0-linux-aarch64.tar.gz
 ```
 
 Or these commands if you are using a Mac with an Apple Silicon CPU:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch-2.24.0-macos-aarch64.tar.gz
-tar xzf vsearch-2.24.0-macos-aarch64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch-2.25.0-macos-aarch64.tar.gz
+tar xzf vsearch-2.25.0-macos-aarch64.tar.gz
 ```
 
 Or these commands if you are using a Mac with an Intel CPU:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch-2.24.0-macos-x86_64.tar.gz
-tar xzf vsearch-2.24.0-macos-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch-2.25.0-macos-x86_64.tar.gz
+tar xzf vsearch-2.25.0-macos-x86_64.tar.gz
 ```
 
 Or if you are using Windows, download and extract (unzip) the contents of this file:
 
 ```
-https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch-2.24.0-win-x86_64.zip
+https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch-2.25.0-win-x86_64.zip
 ```
 
-Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.24.0-linux-x86_64` or `vsearch-2.24.0-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`. Versions with statically compiled libraries are available for Linux systems. These have "-static" in their name, and could be used on systems that do not have all the necessary libraries installed.
+Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.25.0-linux-x86_64` or `vsearch-2.25.0-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`. Versions with statically compiled libraries are available for Linux systems. These have "-static" in their name, and could be used on systems that do not have all the necessary libraries installed.
 
 **Windows**: You will now have the binary distribution in a folder
-called `vsearch-2.24.0-win-x86_64`. The vsearch executable is called
+called `vsearch-2.25.0-win-x86_64`. The vsearch executable is called
 `vsearch.exe`. The manual in PDF format is called
 `vsearch_manual.pdf`. If you want to be able to call `vsearch.exe`
 from any command prompt window, you can put the vsearch executable in
@@ -133,7 +133,7 @@ searching for it in the Start menu, `Edit` user variables, add
 your changes.
 
 
-**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
+**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
 
 
 ## Packages, plugins, and wrappers


=====================================
configure.ac
=====================================
@@ -2,7 +2,7 @@
 # Process this file with autoconf to produce a configure script.
 
 AC_PREREQ([2.63])
-AC_INIT([vsearch], [2.24.0], [torognes at ifi.uio.no], [vsearch], [https://github.com/torognes/vsearch])
+AC_INIT([vsearch], [2.25.0], [torognes at ifi.uio.no], [vsearch], [https://github.com/torognes/vsearch])
 AC_CANONICAL_TARGET
 AM_INIT_AUTOMAKE([subdir-objects])
 AC_LANG([C++])


=====================================
man/vsearch.1
=====================================
@@ -1,5 +1,5 @@
 .\" ============================================================================
-.TH vsearch 1 "October 26, 2023" "version 2.24.0" "USER COMMANDS"
+.TH vsearch 1 "November 10, 2023" "version 2.25.0" "USER COMMANDS"
 .\" ============================================================================
 .SH NAME
 vsearch \(em a versatile open-source tool for microbiome analysis,
@@ -4830,6 +4830,10 @@ with clustering. Add more references.
 Update documentation. Improve code. Allow up to 20 parents for the
 undocumented and experimental chimeras_denovo command. Fix compilation
 warnings for sha1.c. Compile for release (not debug) by default.
+.TP
+.BR v2.25.0\~ "released November 10th, 2023"
+Allow a given percentage of mismatches between chimeras and parents
+for the experimental chimeras_denovo command.
 .LP
 .\" ============================================================================
 .\" TODO:


=====================================
src/chimera.cc
=====================================
@@ -139,6 +139,9 @@ struct chimera_info_s
   int * smooth;
   int * maxsmooth;
 
+  double * scan_p;
+  double * scan_q;
+
   int parents_found;
   int best_parents[maxparents];
   int best_start[maxparents];
@@ -213,6 +216,11 @@ void realloc_arrays(struct chimera_info_s * ci)
       ci->smooth = (int*) xrealloc(ci->smooth,
                                    maxcandidates * maxqlen * sizeof(int));
 
+      ci->scan_p = (double *) xrealloc(ci->scan_p,
+                                       (maxqlen + 1) * sizeof(double));
+      ci->scan_q = (double *) xrealloc(ci->scan_q,
+                                       (maxqlen + 1) * sizeof(double));
+
       int maxalnlen = maxqlen + 2 * db_getlongestsequence();
       for (int f = 0; f < maxparents ; f++)
         {
@@ -306,10 +314,75 @@ int compare_positions(const void * a, const void * b)
     return 0;
 }
 
+bool scan_matches(struct chimera_info_s * ci,
+                  int * matches,
+                  int len,
+                  double percentage,
+                  int * best_start,
+                  int * best_len)
+{
+  /*
+    Scan matches array of zeros and ones, and find the longest subsequence
+    having a match fraction above or equal to the given percentage (e.g. 2%).
+    Based on an idea of finding the longest positive sum substring:
+    https://stackoverflow.com/questions/28356453/longest-positive-sum-substring
+    If the percentage is 2%, matches are given a score of 2 and mismatches -98.
+  */
+
+  double score_match = percentage;
+  double score_mismatch = percentage - 100.0;
+
+  double * p = ci->scan_p;
+  double * q = ci->scan_q;
+
+  p[0] = 0.0;
+  for (int i = 0; i < len; i++)
+    p[i + 1] = p[i] + (matches[i] ? score_match : score_mismatch);
+
+  q[len] = p[len];
+  for (int i = len - 1; i >= 0; i--)
+    q[i] = MAX(q[i + 1], p[i]);
+
+  int best_i = 0;
+  int best_d = -1;
+  double best_c = -1.0;
+  int i = 1;
+  int j = 1;
+  while (j <= len)
+    {
+      double c = q[j] - p[i - 1];
+      if (c >= 0.0)
+        {
+          int d = j - i + 1;
+          if (d > best_d)
+            {
+              best_i = i;
+              best_d = d;
+              best_c = c;
+            }
+          j += 1;
+        }
+      else
+        {
+          i += 1;
+        }
+    }
+
+  if (best_c >= 0.0)
+    {
+      * best_start = best_i - 1;
+      * best_len = best_d;
+      return true;
+    }
+  else
+    return false;
+}
+
 int find_best_parents_long(struct chimera_info_s * ci)
 {
-  /* find parents with longest perfect match regions,
-     excluding regions matched by previously identified parents */
+  /* Find parents with longest matching regions, without indels, allowing
+     a given percentage of mismatches (specified with --chimeras_diff_pct),
+     and excluding regions matched by previously identified parents. */
 
   find_matches(ci);
 
@@ -342,29 +415,38 @@ int find_best_parents_long(struct chimera_info_s * ci)
         {
           int start = 0;
           int len = 0;
-
-          for (int j = 0; j < ci->query_len; j++)
+          int j = 0;
+          while (j < ci->query_len)
             {
-              if ((position_used[j] == false) &&
-                  (ci->match[i * ci->query_len + j] == 1) &&
-                  ((len == 0) || (ci->insert[i * ci->query_len + j] == 0)))
+              start = j;
+              len = 0;
+              while ((j < ci->query_len) &&
+                     (! position_used[j]) &&
+                     ((len == 0) || (ci->insert[i * ci->query_len + j] == 0)))
                 {
-                  if (len == 0)
-                    {
-                      start = j;
-                    }
                   len++;
-                  if (len > best_len)
-                    {
-                      best_cand = i;
-                      best_start = start;
-                      best_len = len;
-                    }
+                  j++;
                 }
-              else
+              if (len > best_len)
                 {
-                  len = 0;
+                  int scan_best_start = 0;
+                  int scan_best_len = 0;
+                  if (scan_matches(ci,
+                                   ci->match + i*ci->query_len + start,
+                                   len,
+                                   opt_chimeras_diff_pct,
+                                   & scan_best_start,
+                                   & scan_best_len))
+                    {
+                      if (scan_best_len > best_len)
+                        {
+                          best_cand = i;
+                          best_start = start + scan_best_start;
+                          best_len = scan_best_len;
+                        }
+                    }
                 }
+              j++;
             }
         }
 
@@ -727,7 +809,7 @@ int eval_parents_long(struct chimera_info_s * ci)
       unsigned int qsym = chrmap_4bit[(int)(ci->qaln[i])];
       unsigned int psym[maxparents];
       for (int f = 0; f < maxparents; f++)
-	psym[f] = 0;
+        psym[f] = 0;
       for (int f = 0; f < ci->parents_found; f++)
         psym[f] = chrmap_4bit[(int)(ci->paln[f][i])];
 
@@ -748,13 +830,15 @@ int eval_parents_long(struct chimera_info_s * ci)
 
       if (all_defined)
         {
-          bool parents_equal = true;
-          for (int f = 1; f < ci->parents_found; f++)
-            if (psym[f] != psym[0])
-              parents_equal = false;
-
-          if (! parents_equal)
-            diff = ci->model[i];
+          int z = 0;
+          for (int f = 0; f < ci->parents_found; f++)
+            if (psym[f] == qsym)
+              {
+                diff = 'A' + f;
+                z++;
+              }
+          if (z > 1)
+            diff = ' ';
         }
 
       ci->diffs[i] = diff;
@@ -778,11 +862,11 @@ int eval_parents_long(struct chimera_info_s * ci)
       char qsym = chrmap_4bit[(int)(ci->qaln[i])];
 
       for(int f = 0; f < ci->parents_found; f++)
-	{
-	  char psym = chrmap_4bit[(int)(ci->paln[f][i])];
-	  if (qsym == psym)
-	    match_QP[f]++;
-	}
+        {
+          char psym = chrmap_4bit[(int)(ci->paln[f][i])];
+          if (qsym == psym)
+            match_QP[f]++;
+        }
     }
 
 
@@ -1632,6 +1716,8 @@ void chimera_thread_init(struct chimera_info_s * ci)
   ci->votes = nullptr;
   ci->model = nullptr;
   ci->ignore = nullptr;
+  ci->scan_p = nullptr;
+  ci->scan_q = nullptr;
 
   for (int f = 0; f < maxparents; f++)
     {
@@ -1716,6 +1802,14 @@ void chimera_thread_exit(struct chimera_info_s * ci)
     {
       xfree(ci->query_head);
     }
+  if (ci->scan_p)
+    {
+      xfree(ci->scan_p);
+    }
+  if (ci->scan_q)
+    {
+      xfree(ci->scan_q);
+    }
 
   for (int f = 0; f < maxparents; f++)
     if (ci->paln[f])


=====================================
src/fasta.cc
=====================================
@@ -426,23 +426,23 @@ void fasta_print_general(FILE * fp,
   if ((opt_eeout || opt_fastq_eeout) && (ee >= 0.0))
     {
       if (ee < 0.000000001)
-	fprintf(fp, ";ee=%.13lf", ee);
+        fprintf(fp, ";ee=%.13lf", ee);
       else if (ee < 0.00000001)
-	fprintf(fp, ";ee=%.12lf", ee);
+        fprintf(fp, ";ee=%.12lf", ee);
       else if (ee < 0.0000001)
-	fprintf(fp, ";ee=%.11lf", ee);
+        fprintf(fp, ";ee=%.11lf", ee);
       else if (ee < 0.000001)
-	fprintf(fp, ";ee=%.10lf", ee);
+        fprintf(fp, ";ee=%.10lf", ee);
       else if (ee < 0.00001)
-	fprintf(fp, ";ee=%.9lf", ee);
+        fprintf(fp, ";ee=%.9lf", ee);
       else if (ee < 0.0001)
-	fprintf(fp, ";ee=%.8lf", ee);
+        fprintf(fp, ";ee=%.8lf", ee);
       else if (ee < 0.001)
-	fprintf(fp, ";ee=%.7lf", ee);
+        fprintf(fp, ";ee=%.7lf", ee);
       else if (ee < 0.01)
-	fprintf(fp, ";ee=%.6lf", ee);
+        fprintf(fp, ";ee=%.6lf", ee);
       else if (ee < 0.1)
-	fprintf(fp, ";ee=%.5lf", ee);
+        fprintf(fp, ";ee=%.5lf", ee);
       else
         fprintf(fp, ";ee=%.4lf", ee);
     }


=====================================
src/searchexact.cc
=====================================
@@ -495,13 +495,13 @@ void search_exact_thread_run(int64_t t)
 
           /* update stats */
           queries++;
-	  queries_abundance += qsize;
+          queries_abundance += qsize;
 
           if (match)
             {
               qmatches++;
-	      qmatches_abundance += qsize;
-	    }
+              qmatches_abundance += qsize;
+            }
 
           /* show progress */
           progress_update(progress);
@@ -876,9 +876,9 @@ void search_exact(char * cmdline, char * progheader)
         }
       fprintf(stderr, "\n");
       if (opt_sizein)
-	{
+        {
           fprintf(stderr, "Matching total query sequences: %" PRIu64 " of %"
-	          PRIu64,
+                  PRIu64,
                   qmatches_abundance, queries_abundance);
           if (queries_abundance > 0)
             {


=====================================
src/vsearch.cc
=====================================
@@ -194,6 +194,7 @@ char * opt_usearch_global;
 char * opt_userout;
 double * opt_ee_cutoffs_values;
 double opt_abskew;
+double opt_chimeras_diff_pct;
 double opt_dn;
 double opt_fastq_maxdiffpct;
 double opt_fastq_maxee;
@@ -755,6 +756,7 @@ void args_init(int argc, char **argv)
   opt_centroids = nullptr;
   opt_chimeras = nullptr;
   opt_chimeras_denovo = nullptr;
+  opt_chimeras_diff_pct = 0.0;
   opt_chimeras_length_min = 10;
   opt_chimeras_parents_max = 3;
   opt_chimeras_parts = 0;
@@ -1010,6 +1012,7 @@ void args_init(int argc, char **argv)
       option_centroids,
       option_chimeras,
       option_chimeras_denovo,
+      option_chimeras_diff_pct,
       option_chimeras_length_min,
       option_chimeras_parents_max,
       option_chimeras_parts,
@@ -1255,6 +1258,7 @@ void args_init(int argc, char **argv)
       {"centroids",             required_argument, nullptr, 0 },
       {"chimeras",              required_argument, nullptr, 0 },
       {"chimeras_denovo",       required_argument, nullptr, 0 },
+      {"chimeras_diff_pct",     required_argument, nullptr, 0 },
       {"chimeras_length_min",   required_argument, nullptr, 0 },
       {"chimeras_parents_max",  required_argument, nullptr, 0 },
       {"chimeras_parts",        required_argument, nullptr, 0 },
@@ -2536,6 +2540,10 @@ void args_init(int argc, char **argv)
           opt_chimeras_parents_max = args_getlong(optarg);
           break;
 
+        case option_chimeras_diff_pct:
+          opt_chimeras_diff_pct = args_getdouble(optarg);
+          break;
+
         default:
           fatal("Internal error in option parsing");
         }
@@ -2709,6 +2717,7 @@ void args_init(int argc, char **argv)
         option_alignwidth,
         option_alnout,
         option_chimeras,
+        option_chimeras_diff_pct,
         option_chimeras_length_min,
         option_chimeras_parents_max,
         option_chimeras_parts,
@@ -4782,6 +4791,11 @@ void args_init(int argc, char **argv)
       fatal("The argument to chimeras_parents_max must be in the range 2 to %s.\n", maxparents_string);
     }
 
+  if ((opt_chimeras_diff_pct < 0.0) || (opt_chimeras_diff_pct > 50.0))
+    {
+      fatal("The argument to chimeras_diff_pct must be in the range 0.0 to 50.0");
+    }
+
   if (options_selected[option_chimeras_parts] &&
       ((opt_chimeras_parts < 2) || (opt_chimeras_parts > 100)))
     {
@@ -4986,6 +5000,7 @@ void cmd_help()
               "  --chimeras_denovo FILENAME  detect chimeras de novo in long exact sequences\n"
               " Parameters\n"
               "  --abskew REAL               minimum abundance ratio (1.0)\n"
+              "  --chimeras_diff_pct         mismatch %% allowed in each chimeric region (0.0)\n"
               "  --chimeras_length_min       minimum length of each chimeric region (10)\n"
               "  --chimeras_parents_max      maximum number of parent sequences (3)\n"
               "  --chimeras_parts            number of parts to divide sequences (length/100)\n"


=====================================
src/vsearch.h
=====================================
@@ -393,6 +393,7 @@ extern char * opt_usearch_global;
 extern char * opt_userout;
 extern double * opt_ee_cutoffs_values;
 extern double opt_abskew;
+extern double opt_chimeras_diff_pct;
 extern double opt_dn;
 extern double opt_fastq_maxdiffpct;
 extern double opt_fastq_maxee;



View it on GitLab: https://salsa.debian.org/med-team/vsearch/-/commit/88399a1e38abf1a5ebff27623d60e6e59b8b54d5

-- 
View it on GitLab: https://salsa.debian.org/med-team/vsearch/-/commit/88399a1e38abf1a5ebff27623d60e6e59b8b54d5
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20231115/94b5f030/attachment-0001.htm>


More information about the debian-med-commit mailing list