[med-svn] [Git][med-team/vsearch][upstream] New upstream version 2.10.2

Andreas Tille gitlab at salsa.debian.org
Tue Dec 18 15:59:57 GMT 2018


Andreas Tille pushed to branch upstream at Debian Med / vsearch


Commits:
0d029e68 by Andreas Tille at 2018-12-18T15:54:23Z
New upstream version 2.10.2
- - - - -


10 changed files:

- README.md
- configure.ac
- man/vsearch.1
- src/Makefile.am
- + src/sffconvert.cc
- + src/sffconvert.h
- src/sintax.cc
- src/util.cc
- src/vsearch.cc
- src/vsearch.h


Changes:

=====================================
README.md
=====================================
@@ -24,7 +24,7 @@ Most of the nucleotide based commands and options in USEARCH version 7 are suppo
 
 ## Getting Help
 
-If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
+If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
 
 ## Example
 
@@ -37,9 +37,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
 **Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:
 
 ```
-wget https://github.com/torognes/vsearch/archive/v2.9.0.tar.gz
-tar xzf v2.9.0.tar.gz
-cd vsearch-2.9.0
+wget https://github.com/torognes/vsearch/archive/v2.10.2.tar.gz
+tar xzf v2.10.2.tar.gz
+cd vsearch-2.10.2
 ./autogen.sh
 ./configure
 make
@@ -70,36 +70,36 @@ Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (v
 Download the appropriate executable for your system using the following commands if you are using a Linux x86_64 system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch-2.9.0-linux-x86_64.tar.gz
-tar xzf vsearch-2.9.0-linux-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-linux-x86_64.tar.gz
+tar xzf vsearch-2.10.2-linux-x86_64.tar.gz
 ```
 
 Or these commands if you are using a Linux ppc64le system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch-2.9.0-linux-ppc64le.tar.gz
-tar xzf vsearch-2.9.0-linux-ppc64le.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-linux-ppc64le.tar.gz
+tar xzf vsearch-2.10.2-linux-ppc64le.tar.gz
 ```
 
 Or these commands if you are using a Mac:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch-2.9.0-macos-x86_64.tar.gz
-tar xzf vsearch-2.9.0-macos-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-macos-x86_64.tar.gz
+tar xzf vsearch-2.10.2-macos-x86_64.tar.gz
 ```
 
 Or if you are using Windows, download and extract (unzip) the contents of this file:
 
 ```
-https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch-2.9.0-win-x86_64.zip
+https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-win-x86_64.zip
 ```
 
-Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.9.0-linux-x86_64` or `vsearch-2.9.0-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
+Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.10.2-linux-x86_64` or `vsearch-2.10.2-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
 
-Windows: You will now have the binary distribution in a folder called `vsearch-2.9.0-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
+Windows: You will now have the binary distribution in a folder called `vsearch-2.10.2-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
 
 
-**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
+**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
 
 
 ## Plugins, packages, and wrappers


=====================================
configure.ac
=====================================
@@ -2,7 +2,7 @@
 # Process this file with autoconf to produce a configure script.
 
 AC_PREREQ([2.63])
-AC_INIT([vsearch], [2.9.0], [torognes at ifi.uio.no])
+AC_INIT([vsearch], [2.10.2], [torognes at ifi.uio.no])
 AC_CANONICAL_TARGET
 AM_INIT_AUTOMAKE([subdir-objects])
 AC_LANG([C++])
@@ -14,8 +14,8 @@ MACOSX_DEPLOYMENT_TARGET="10.7"
 
 # Set default gcc and g++ options
 
-CFLAGS='-g -march=native'
-CXXFLAGS='-g -march=native'
+CFLAGS='-g'
+CXXFLAGS='-g'
 
 # Checks for programs.
 AC_PROG_CXX


=====================================
man/vsearch.1
=====================================
@@ -1,5 +1,5 @@
 .\" ============================================================================
-.TH vsearch 1 "October 10, 2018" "version 2.9.0" "USER COMMANDS"
+.TH vsearch 1 "December 10, 2018" "version 2.10.2" "USER COMMANDS"
 .\" ============================================================================
 .SH NAME
 vsearch \(em chimera detection, clustering, dereplication and
@@ -71,6 +71,9 @@ FASTA/FASTQ file processing:
 \fBvsearch\fR \-\-fastx_revcomp \fIfastxfile\fR (\-\-fastaout |
 \-\-fastqout) \fIoutputfile\fR [\fIoptions\fR]
 .PP
+\fBvsearch\fR \-\-sff_convert \fIsff-file\fR \-\-fastqout
+\fIoutputfile\fR [\fIoptions\fR]
+.PP
 .RE
 Masking:
 .RS
@@ -878,17 +881,18 @@ for details.
 .TP
 .BI \-\-rereplicate \0filename
 Duplicate each sequence the number of times indicated by the abundance
-of each sequence in the specified file. The sequence labels are
-identical for the same sequence, unless \-\-relabel, \-\-relabel_sha1
-or \-\-relabel_md5 is used to create unique labels. Output is written
-to the file specified with the \-\-output option, in FASTA format. The
-output file does not contain abundance information unless \-\-sizeout
-is specified, in which case an abundance of 1 is used.
+of each sequence in the specified file (option \-\-sizein is always
+implied). The sequence labels are identical for the same sequence,
+unless \-\-relabel, \-\-relabel_sha1 or \-\-relabel_md5 is used to
+create unique labels. Output is written to the file specified with the
+\-\-output option, in FASTA format. The output file does not contain
+abundance information unless \-\-sizeout is specified, in which case
+an abundance of 1 is used.
 .TP
 .B \-\-sizein
 Take into account the abundance annotations present in the input fasta
 file (search for the pattern '[>;]size=\fIinteger\fR[;]' in sequence
-headers).
+headers). That option is active by default when rereplicating.
 .TP
 .B \-\-sizeout
 Add abundance annotations to the output fasta file (add the pattern
@@ -961,8 +965,9 @@ length of the sequences in a FASTQ file may be performed with the
 \-\-fastq_stats, \-\-fastq_eestats, and \-\-fastq_eestats2
 commands. Sequences may be shortened, filtered and converted by the
 \-\-fastq_filter or \-\-fastx_filter commands. Paired-end reads can be
-merged using the \-\-fastq_mergepairs command. Finally, the
-\-\-fastx_revcomp command reverse-complements sequences.
+merged using the \-\-fastq_mergepairs command. The \-\-fastx_revcomp
+command reverse-complements sequences. Finally, the \-\-sff_convert
+command can be used to convert SFF files to FASTQ.
 .PP
 .TP 9
 .B \-\-eeout
@@ -1013,9 +1018,9 @@ Illumina 1.8+ FASTQ format (phred+33). The value 64 is used by the
 Solexa, Illumina 1.3+ and Illumina 1.5+ formats (phred+64).
 .TP
 .BI \-\-fastq_asciiout\~ "positive integer"
-When using \-\-fastq_convert, define the ASCII character number used
-as the basis for the FASTQ quality score when writing FASTQ output
-files. The default is 33.
+When using \-\-fastq_convert or \-\-sff_convert, define the ASCII
+character number used as the basis for the FASTQ quality score when
+writing FASTQ output files. The default is 33.
 .TP
 .BI \-\-fastq_chars \0filename
 Summarize the composition of sequence and quality strings contained in
@@ -1039,7 +1044,7 @@ Convert between the different variants of the FASTQ file format. The
 quality encoding of the input file must be specified with the
 \-\-fastq_ascii option (either 33 or 64, the default is 33), and the
 output quality encoding must be specified with the \-\-fastq_asciiout
-option (default 33). The mimimum and maximum output quality scores may
+option (default 33). The minimum and maximum output quality scores may
 be limited using the \-\-fastq_qminout and \-\-fastq_qmaxout
 options. The output file is specified with the \-\-fastqout option.
 .TP
@@ -1199,10 +1204,10 @@ files. The default is 41, which is usual for recent Sanger/Illumina
 1.8+ files.
 .TP
 .BI \-\-fastq_qmaxout\~ "positive integer"
-When using \-\-fastq_convert, specify the maximum quality score used
-when writing FASTQ files. The default is 41, which is usual for recent
-Sanger/Illumina 1.8+ files. Older formats may use a maximum quality
-score of 40.
+When using \-\-fastq_convert or \-\-sff_convert, specify the maximum
+quality score used when writing FASTQ files. The default is 41, which
+is usual for recent Sanger/Illumina 1.8+ files. Older formats may use
+a maximum quality score of 40.
 .TP
 .BI \-\-fastq_qmin\~ "positive integer"
 Specify the minimum quality score accepted for FASTQ files. The
@@ -1210,10 +1215,10 @@ default is 0, which is usual for recent Sanger/Illumina 1.8+
 files. Older formats may use scores between -5 and 2.
 .TP
 .BI \-\-fastq_qminout\~ "positive integer"
-When using \-\-fastq_convert, specify the minimum quality score used
-when writing FASTQ files. The default is 0, which is usual for
-Sanger/Illumina 1.8+ files. Older versions of the format may use
-scores between -5 and 2.
+When using \-\-fastq_convert or \-\-sff_convert, specify the minimum
+quality score used when writing FASTQ files. The default is 0, which
+is usual for Sanger/Illumina 1.8+ files. Older versions of the format
+may use scores between -5 and 2.
 .TP
 .BI \-\-fastq_stats \0filename
 Analyze a FASTQ file and report the number of reads it contains. The
@@ -1422,6 +1427,21 @@ for details.
 When using \-\-fastq_mergepairs or \-\-fastq_join, specify the FASTQ
 file containing containing the reverse reads.
 .TP
+.BI \-\-sff_convert \0filename
+Convert the given SFF file to FASTQ. The FASTQ output file is
+specified with the \-\-fastqout option. The sequence may be clipped as
+specied in the SFF file if the option \-\-sff_clip is specified,
+otherwise no clipping occurs. Bases that would have been clipped are
+converted to lower case, while the rest is in upper case. The output
+quality encoding may be specified with the \-\-fastq_asciiout option
+(default 33). The minimum and maximum output quality scores may be
+limited using the \-\-fastq_qminout and \-\-fastq_qmaxout options.
+.TP
+.BI \-\-sff_clip
+Specifies that the sequences converted by the \-\-sff_convert command
+should be clipped in both ends as indicated in the SFF file. By
+default no clipping is performed.
+.TP
 .B \-\-xsize
 Strip abundance information from the headers when writing the output
 file.
@@ -3456,6 +3476,26 @@ the first space (unless the notrunclabels option is in effect).
 .TP
 .BR v2.9.0\~ "released October 10th, 2018"
 Added the fastq_join command.
+.TP
+.BR v2.9.1\~ "released October 29th, 2018"
+Changed compiler options that select the target cpu and tuning to
+allow the software to run on any 64-bit x86 system, while tuning for
+more modern variants. Avoid illegal instruction error on some
+architectures. Update documentation of rereplicate command.
+.TP
+.BR v2.10.0\~ "released December 6th, 2018"
+Added the sff_convert commmand to convert SFF files to FASTQ. Added
+some additional option argument checks. Fixed segmentation fault bug
+after some fatal errors when a log file was specified.
+.TP
+.BR v2.10.1\~ "released December 7th, 2018"
+Improved sff_convert command. It will now read several variants of the
+SFF format. It is also able to read from a pipe. Warnings are given if
+there are minor problems. Errors messages have been improved. Minor
+speed and memory usage improvements.
+.TP
+.BR v2.10.2\~ "released December 10th, 2018"
+Fixed bug in sintax with reversed order of domain and kingdom.
 .RE
 .LP
 .\" ============================================================================


=====================================
src/Makefile.am
=====================================
@@ -3,7 +3,7 @@ bin_PROGRAMS = $(top_builddir)/bin/vsearch
 if TARGET_PPC
 AM_CXXFLAGS=-Wall -Wsign-compare -O3 -g -mcpu=power8
 else
-AM_CXXFLAGS=-Wall -Wsign-compare -O3 -g
+AM_CXXFLAGS=-Wall -Wsign-compare -O3 -g -march=x86-64 -mtune=generic
 endif
 
 AM_CFLAGS=$(AM_CXXFLAGS)
@@ -47,6 +47,7 @@ results.h \
 search.h \
 searchcore.h \
 searchexact.h \
+sffconvert.h \
 showalign.h \
 sha1.h \
 shuffle.h \
@@ -76,13 +77,13 @@ libcityhash_a_SOURCES = city.cc city.h
 
 if TARGET_WIN
 
-libcityhash_a_CXXFLAGS = -Wall -Wno-sign-compare -O3 -g -D_MSC_VER
+libcityhash_a_CXXFLAGS = $(AM_CXXFLAGS) -Wno-sign-compare -D_MSC_VER
 __top_builddir__bin_vsearch_LDFLAGS = -static
 __top_builddir__bin_vsearch_LDADD = libregex.a libcityhash.a libcpu_ssse3.a libcpu_sse2.a
 
 else
 
-libcityhash_a_CXXFLAGS = -Wall -Wno-sign-compare -O3 -g
+libcityhash_a_CXXFLAGS = $(AM_CXXFLAGS) -Wno-sign-compare
 
 if TARGET_PPC
 __top_builddir__bin_vsearch_LDADD = libcityhash.a libcpu.a
@@ -126,6 +127,7 @@ results.cc \
 search.cc \
 searchcore.cc \
 searchexact.cc \
+sffconvert.cc \
 sha1.c \
 showalign.cc \
 shuffle.cc \


=====================================
src/sffconvert.cc
=====================================
@@ -0,0 +1,438 @@
+/*
+
+  VSEARCH: a versatile open source tool for metagenomics
+
+  Copyright (C) 2014-2018, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+  All rights reserved.
+
+  Contact: Torbjorn Rognes <torognes at ifi.uio.no>,
+  Department of Informatics, University of Oslo,
+  PO Box 1080 Blindern, NO-0316 Oslo, Norway
+
+  This software is dual-licensed and available under a choice
+  of one of two licenses, either under the terms of the GNU
+  General Public License version 3 or the BSD 2-Clause License.
+
+
+  GNU General Public License version 3
+
+  This program is free software: you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation, either version 3 of the License, or
+  (at your option) any later version.
+
+  This program is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+
+  The BSD 2-Clause License
+
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions
+  are met:
+
+  1. Redistributions of source code must retain the above copyright
+  notice, this list of conditions and the following disclaimer.
+
+  2. Redistributions in binary form must reproduce the above copyright
+  notice, this list of conditions and the following disclaimer in the
+  documentation and/or other materials provided with the distribution.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+  FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+  COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+  INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+  BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+  LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
+  ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+  POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+#include "vsearch.h"
+
+uint32_t sff_magic = 0x2e736666;
+
+struct sff_header_s
+{
+  uint32_t magic_number; /* .sff */
+  uint32_t version;
+  uint64_t index_offset;
+  uint32_t index_length;
+  uint32_t number_of_reads;
+  uint16_t header_length;
+  uint16_t key_length;
+  uint16_t flows_per_read;
+  uint8_t  flowgram_format_code;
+} sff_header;
+
+struct sff_read_header_s
+{
+  uint16_t read_header_length;
+  uint16_t name_length;
+  uint32_t number_of_bases;
+  uint16_t clip_qual_left;
+  uint16_t clip_qual_right;
+  uint16_t clip_adapter_left;
+  uint16_t clip_adapter_right;
+} read_header;
+
+uint64_t fskip(FILE * fp, uint64_t length)
+{
+  /* read given amount of data from a stream and ignore it */
+  /* used instead of seeking in order to work with pipes   */
+#define BLOCKSIZE 4096
+  char buffer[BLOCKSIZE];
+
+  uint64_t skipped = 0;
+  uint64_t rest = length;
+
+  while (rest > 0)
+    {
+      uint64_t want = ((rest > BLOCKSIZE) ? BLOCKSIZE : rest);
+      uint64_t got = fread(buffer, 1, want, fp);
+      skipped += got;
+      rest -= got;
+      if (got < want)
+        break;
+    }
+  return skipped;
+}
+
+void sff_convert()
+{
+  if (! opt_fastqout)
+    fatal("No output file for sff_convert specified with --fastqout.");
+
+  FILE * fp_fastqout = fopen_output(opt_fastqout);
+  if (!fp_fastqout)
+    fatal("Unable to open FASTQ output file for writing.");
+
+  FILE * fp_sff = fopen_input(opt_sff_convert);
+  if (!fp_sff)
+    fatal("Unable to open SFF input file for reading.");
+
+  /* read and check header */
+
+  uint64_t filepos = 0;
+
+  if (fread(&sff_header, 1, 31, fp_sff) < 31)
+    fatal("Unable to read from SFF file. File may be truncated.");
+  filepos += 31;
+
+  sff_header.magic_number    = bswap_32(sff_header.magic_number);
+  sff_header.version         = bswap_32(sff_header.version);
+  sff_header.index_offset    = bswap_64(sff_header.index_offset);
+  sff_header.index_length    = bswap_32(sff_header.index_length);
+  sff_header.number_of_reads = bswap_32(sff_header.number_of_reads);
+  sff_header.header_length   = bswap_16(sff_header.header_length);
+  sff_header.key_length      = bswap_16(sff_header.key_length);
+  sff_header.flows_per_read  = bswap_16(sff_header.flows_per_read);
+
+  if (sff_header.magic_number != sff_magic)
+    fatal("Invalid SFF file. Incorrect magic number. Must be 0x2e736666 (.sff).");
+
+  if (sff_header.version != 1)
+    fatal("Invalid SFF file. Incorrect version. Must be 1.");
+
+  if (sff_header.flowgram_format_code != 1)
+    fatal("Invalid SFF file. Incorrect flowgram format code. Must be 1.");
+
+  if (sff_header.header_length != 8 * ((31 + sff_header.flows_per_read + sff_header.key_length + 7) / 8))
+    fatal("Invalid SFF file. Incorrect header length.");
+
+  if (sff_header.key_length != 4)
+    fatal("Invalid SFF file. Incorrect key length. Must be 4.");
+
+  if ((sff_header.index_length > 0) && (sff_header.index_length < 8))
+    fatal("Invalid SFF file. Incorrect index size. Must be at least 8.");
+
+  /* read and check flow chars, key and padding */
+
+  if (fskip(fp_sff, sff_header.flows_per_read) < sff_header.flows_per_read)
+    fatal("Invalid SFF file. Unable to read flow characters. File may be truncated.");
+  filepos += sff_header.flows_per_read;
+
+  char * key_sequence = (char *) xmalloc(sff_header.key_length + 1);
+  if (fread(key_sequence, 1, sff_header.key_length, fp_sff) < sff_header.key_length)
+    fatal("Invalid SFF file. Unable to read key sequence. File may be truncated.");
+  key_sequence[sff_header.key_length] = 0;
+  filepos += sff_header.key_length;
+
+  uint32_t padding_length = sff_header.header_length - sff_header.flows_per_read - sff_header.key_length - 31;
+  if (fskip(fp_sff, padding_length) < padding_length)
+    fatal("Invalid SFF file. Unable to read padding. File may be truncated.");
+  filepos += padding_length;
+
+  double totallength = 0.0;
+  uint32_t minimum = UINT_MAX;
+  uint32_t maximum = 0;
+
+  bool index_done = (sff_header.index_offset == 0) || (sff_header.index_length == 0);
+  bool index_odd = false;
+  char index_kind[9];
+
+  uint32_t index_padding = 0;
+  if ((sff_header.index_length & 7) > 0)
+    index_padding = 8 - (sff_header.index_length & 7);
+
+  if (! opt_quiet)
+    {
+      fprintf(stderr, "Number of reads: %d\n", sff_header.number_of_reads);
+      fprintf(stderr, "Flows per read:  %d\n", sff_header.flows_per_read);
+      fprintf(stderr, "Key sequence:    %s\n", key_sequence);
+    }
+
+  if (opt_log)
+    {
+      fprintf(fp_log, "Number of reads: %d\n", sff_header.number_of_reads);
+      fprintf(fp_log, "Flows per read:  %d\n", sff_header.flows_per_read);
+      fprintf(fp_log, "Key sequence:    %s\n", key_sequence);
+    }
+
+  progress_init("Converting SFF: ", sff_header.number_of_reads);
+
+  for(uint32_t read_no = 0; read_no < sff_header.number_of_reads; read_no++)
+    {
+      /* check if the index block is here */
+
+      if (! index_done)
+        {
+          if (filepos == sff_header.index_offset)
+            {
+              if (fread(index_kind, 1, 8, fp_sff) < 8)
+                fatal("Invalid SFF file. Unable to read index header. File may be truncated.");
+              filepos += 8;
+              index_kind[8] = 0;
+
+              uint64 index_size = sff_header.index_length - 8 + index_padding;
+              if (fskip(fp_sff, index_size) != index_size)
+                fatal("Invalid SFF file. Unable to read entire index. File may be truncated.");
+
+              filepos += index_size;
+              index_done = true;
+              index_odd = true;
+            }
+        }
+
+      /* read and check each read header */
+
+      if (fread(&read_header, 1, 16, fp_sff) < 16)
+        fatal("Invalid SFF file. Unable to read read header. File may be truncated.");
+      filepos += 16;
+
+      read_header.read_header_length = bswap_16(read_header.read_header_length);
+      read_header.name_length = bswap_16(read_header.name_length);
+      read_header.number_of_bases = bswap_32(read_header.number_of_bases);
+      read_header.clip_qual_left = bswap_16(read_header.clip_qual_left);
+      read_header.clip_qual_right = bswap_16(read_header.clip_qual_right);
+      read_header.clip_adapter_left = bswap_16(read_header.clip_adapter_left);
+      read_header.clip_adapter_right = bswap_16(read_header.clip_adapter_right);
+
+      if (read_header.read_header_length != 8 * ((16 + read_header.name_length + 7) / 8))
+        fatal("Invalid SFF file. Incorrect read header length.");
+      if (read_header.clip_qual_left > read_header.number_of_bases)
+        fatal("Invalid SFF file. Incorrect clip_qual_left value.");
+      if (read_header.clip_adapter_left > read_header.number_of_bases)
+        fatal("Invalid SFF file. Incorrect clip_adapter_left value.");
+      if (read_header.clip_qual_right > read_header.number_of_bases)
+        fatal("Invalid SFF file. Incorrect clip_qual_right value.");
+      if (read_header.clip_adapter_right > read_header.number_of_bases)
+        fatal("Invalid SFF file. Incorrect clip_adapter_right value.");
+
+      char * read_name = (char *) xmalloc(read_header.name_length + 1);
+      if (fread(read_name, 1, read_header.name_length, fp_sff) < read_header.name_length)
+        fatal("Invalid SFF file. Unable to read read name. File may be truncated.");
+      filepos += read_header.name_length;
+      read_name[read_header.name_length] = 0;
+
+      uint32_t read_header_padding_length = read_header.read_header_length - read_header.name_length - 16;
+      if (fskip(fp_sff, read_header_padding_length) < read_header_padding_length)
+        fatal("Invalid SFF file. Unable to read read header padding. File may be truncated.");
+      filepos += read_header_padding_length;
+
+      /* read and check the flowgram and sequence */
+
+      if (fskip(fp_sff, 2 * sff_header.flows_per_read) < sff_header.flows_per_read)
+        fatal("Invalid SFF file. Unable to read flowgram values. File may be truncated.");
+      filepos += 2 * sff_header.flows_per_read;
+
+      if (fskip(fp_sff, read_header.number_of_bases) < read_header.number_of_bases)
+        fatal("Invalid SFF file. Unable to read flow indices. File may be truncated.");
+      filepos += read_header.number_of_bases;
+
+      char * bases = (char *) xmalloc(read_header.number_of_bases + 1);
+      if (fread(bases, 1, read_header.number_of_bases, fp_sff) < read_header.number_of_bases)
+        fatal("Invalid SFF file. Unable to read read length. File may be truncated.");
+      bases[read_header.number_of_bases] = 0;
+      filepos += read_header.number_of_bases;
+
+      char * qual = (char *) xmalloc(read_header.number_of_bases + 1);
+      if (fread(qual, 1, read_header.number_of_bases, fp_sff) < read_header.number_of_bases)
+        fatal("Invalid SFF file. Unable to read quality scores. File may be truncated.");
+      filepos += read_header.number_of_bases;
+
+      /* convert quality scores to ascii characters */
+      for(uint32_t base_no = 0; base_no < read_header.number_of_bases; base_no++)
+        {
+          int q = qual[base_no];
+          if (q < opt_fastq_qminout)
+            q = opt_fastq_qminout;
+          if (q > opt_fastq_qmaxout)
+            q = opt_fastq_qmaxout;
+          qual[base_no] = opt_fastq_asciiout + q;
+        }
+      qual[read_header.number_of_bases] = 0;
+
+      uint32_t read_data_length = (2 * sff_header.flows_per_read + 3 * read_header.number_of_bases);
+      uint32_t read_data_padded_length = 8 * ((read_data_length + 7) / 8);
+      uint32_t read_data_padding_length = read_data_padded_length - read_data_length;
+
+      if (fskip(fp_sff, read_data_padding_length) < read_data_padding_length)
+        fatal("Invalid SFF file. Unable to read read data padding. File may be truncated.");
+      filepos += read_data_padding_length;
+
+      uint32_t clip_start = 0;
+      clip_start = MAX(1, MAX(read_header.clip_qual_left, read_header.clip_adapter_left)) - 1;
+
+      uint32_t clip_end = read_header.number_of_bases;
+      clip_end = MIN((read_header.clip_qual_right == 0 ? read_header.number_of_bases : read_header.clip_qual_right), (read_header.clip_adapter_right == 0 ? read_header.number_of_bases : read_header.clip_adapter_right));
+
+      /* make the clipped bases lowercase and the rest uppercase */
+      for (uint32_t i = 0; i < read_header.number_of_bases; i++)
+        {
+          if ((i < clip_start) || (i >= clip_end))
+            bases[i] = tolower(bases[i]);
+          else
+            bases[i] = toupper(bases[i]);
+        }
+
+      if (opt_sff_clip)
+        {
+          bases[clip_end] = 0;
+          qual[clip_end] = 0;
+        }
+      else
+        {
+          clip_start = 0;
+          clip_end = read_header.number_of_bases;
+        }
+
+      uint32_t length = clip_end - clip_start;
+
+      fastq_print_general(fp_fastqout,
+                          bases + clip_start,
+                          length,
+                          read_name,
+                          strlen(read_name),
+                          qual + clip_start,
+                          0, 0, 0, 0);
+
+      xfree(read_name);
+      xfree(bases);
+      xfree(qual);
+
+      totallength += length;
+      if (length < minimum)
+        minimum = length;
+      if (length > maximum)
+        maximum = length;
+
+      progress_update(read_no + 1);
+    }
+  progress_done();
+
+  /* check if the index block is here */
+
+  if (! index_done)
+    {
+      if (filepos == sff_header.index_offset)
+        {
+          if (fread(index_kind, 1, 8, fp_sff) < 8)
+            fatal("Invalid SFF file. Unable to read index header. File may be truncated.");
+          filepos += 8;
+          index_kind[8] = 0;
+
+          uint64 index_size = sff_header.index_length - 8;
+          if (fskip(fp_sff, index_size) != index_size)
+            fatal("Invalid SFF file. Unable to read entire index. File may be truncated.");
+
+          filepos += index_size;
+          index_done = true;
+
+          /* try to skip padding, if any */
+
+          if (index_padding > 0)
+            {
+              uint64_t got = fskip(fp_sff, index_padding);
+              if ((got < index_padding) && (got != 0))
+                fprintf(stderr, "WARNING: Additional data at end of SFF file ignored\n");
+            }
+        }
+    }
+
+  if (! index_done)
+    {
+      fprintf(stderr, "WARNING: SFF index missing\n");
+      if (opt_log)
+        fprintf(fp_log, "WARNING: SFF index missing\n");
+    }
+
+  if (index_odd)
+    {
+      fprintf(stderr, "WARNING: Index at unusual position in file\n");
+      if (opt_log)
+        fprintf(fp_log, "WARNING: Index at unusual position in file\n");
+    }
+
+  /* ignore the rest of file */
+
+  /* try reading just another byte */
+
+  if (fskip(fp_sff, 1) > 0)
+    {
+      fprintf(stderr, "WARNING: Additional data at end of SFF file ignored\n");
+      if (opt_log)
+        fprintf(fp_log, "WARNING: Additional data at end of SFF file ignored\n");
+    }
+
+  fclose(fp_sff);
+  fclose(fp_fastqout);
+
+  double average = totallength / sff_header.number_of_reads;
+
+  if (! opt_quiet)
+    {
+      if (sff_header.index_length > 0)
+        fprintf(stderr, "Index type:      %s\n", index_kind);
+      fprintf(stderr, "\nSFF file read successfully.\n");
+      if (sff_header.number_of_reads > 0)
+        fprintf(stderr, "Sequence length: minimum %d, average %.1f, maximum %d\n",
+                minimum,
+                average,
+                maximum);
+    }
+
+  if (opt_log)
+    {
+      if (sff_header.index_length > 0)
+        fprintf(fp_log, "Index type:      %s\n", index_kind);
+      fprintf(fp_log, "\nSFF file read successfully.\n");
+      if (sff_header.number_of_reads > 0)
+        fprintf(fp_log, "Sequence length: minimum %d, average %.1f, maximum %d\n",
+                minimum,
+                average,
+                maximum);
+    }
+
+  xfree(key_sequence);
+}


=====================================
src/sffconvert.h
=====================================
@@ -0,0 +1,61 @@
+/*
+
+  VSEARCH: a versatile open source tool for metagenomics
+
+  Copyright (C) 2014-2018, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+  All rights reserved.
+
+  Contact: Torbjorn Rognes <torognes at ifi.uio.no>,
+  Department of Informatics, University of Oslo,
+  PO Box 1080 Blindern, NO-0316 Oslo, Norway
+
+  This software is dual-licensed and available under a choice
+  of one of two licenses, either under the terms of the GNU
+  General Public License version 3 or the BSD 2-Clause License.
+
+
+  GNU General Public License version 3
+
+  This program is free software: you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation, either version 3 of the License, or
+  (at your option) any later version.
+
+  This program is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+
+  The BSD 2-Clause License
+
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions
+  are met:
+
+  1. Redistributions of source code must retain the above copyright
+  notice, this list of conditions and the following disclaimer.
+
+  2. Redistributions in binary form must reproduce the above copyright
+  notice, this list of conditions and the following disclaimer in the
+  documentation and/or other materials provided with the distribution.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+  FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+  COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+  INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+  BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+  LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
+  ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+  POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+void sff_convert();


=====================================
src/sintax.cc
=====================================
@@ -86,7 +86,7 @@ static pthread_attr_t attr;
 static fastx_handle query_fastx_h;
 
 const int tax_levels = 8;
-const char * tax_letters = "kdpcofgs";
+const char * tax_letters = "dkpcofgs";
 const int subset_size = 32;
 const int bootstrap_count = 100;
 
@@ -150,8 +150,8 @@ bool sintax_parse_tax(const char * header,
 void sintax_split(int seqno, int * level_start, int * level_len)
 {
   /* Parse taxonomy string into the following parts
-     k kingdom
      d domain
+     k kingdom
      p phylum
      c class
      o order


=====================================
src/util.cc
=====================================
@@ -114,7 +114,7 @@ void  __attribute__((noreturn)) fatal(const char * msg)
   fprintf(stderr, "\n\n");
   fprintf(stderr, "Fatal error: %s\n", msg);
 
-  if (opt_log)
+  if (fp_log)
     {
       fprintf(fp_log, "\n\n");
       fprintf(fp_log, "Fatal error: %s\n", msg);


=====================================
src/vsearch.cc
=====================================
@@ -77,6 +77,7 @@ bool opt_relabel_keep;
 bool opt_relabel_md5;
 bool opt_relabel_sha1;
 bool opt_samheader;
+bool opt_sff_clip;
 bool opt_sizeorder;
 bool opt_xsize;
 char * opt_allpairs_global;
@@ -139,6 +140,7 @@ char * opt_rereplicate;
 char * opt_reverse;
 char * opt_samout;
 char * opt_search_exact;
+char * opt_sff_convert;
 char * opt_shuffle;
 char * opt_sintax;
 char * opt_sortbylength;
@@ -789,6 +791,8 @@ void args_init(int argc, char **argv)
   opt_search_exact = 0;
   opt_self = 0;
   opt_selfid = 0;
+  opt_sff_convert = 0;
+  opt_sff_clip = 0;
   opt_shuffle = 0;
   opt_sintax = 0;
   opt_sintax_cutoff = 0.0;
@@ -1033,7 +1037,9 @@ void args_init(int argc, char **argv)
     {"fastq_join",            required_argument, 0, 0 },
     {"join_padgap",           required_argument, 0, 0 },
     {"join_padgapq",          required_argument, 0, 0 },
-    { 0, 0, 0, 0 }
+    {"sff_convert",           required_argument, 0, 0 },
+    {"sff_clip",              no_argument,       0, 0 },
+    { 0,                      0,                 0, 0 }
   };
 
   int option_count = sizeof(long_options) / sizeof(struct option);
@@ -1891,6 +1897,14 @@ void args_init(int argc, char **argv)
           opt_join_padgapq = optarg;
           break;
 
+        case 202:
+          opt_sff_convert = optarg;
+          break;
+
+        case 203:
+          opt_sff_clip = 1;
+          break;
+
         default:
           fatal("Internal error in option parsing");
         }
@@ -1980,6 +1994,8 @@ void args_init(int argc, char **argv)
     commands++;
   if (opt_fastq_join)
     commands++;
+  if (opt_sff_convert)
+    commands++;
 
 
   if (commands > 1)
@@ -2061,9 +2077,21 @@ void args_init(int argc, char **argv)
   if (opt_fastq_qmin > opt_fastq_qmax)
     fatal("The argument to --fastq_qmin cannot be larger than to --fastq_qmax");
 
+  if (opt_fastq_ascii + opt_fastq_qmin < 33)
+    fatal("Sum of arguments to --fastq_ascii and --fastq_qmin must be no less than 33");
+
+  if (opt_fastq_ascii + opt_fastq_qmax > 126)
+    fatal("Sum of arguments to --fastq_ascii and --fastq_qmax must be no more than 126");
+
   if (opt_fastq_qminout > opt_fastq_qmaxout)
     fatal("The argument to --fastq_qminout cannot be larger than to --fastq_qmaxout");
 
+  if (opt_fastq_asciiout + opt_fastq_qminout < 33)
+    fatal("Sum of arguments to --fastq_asciiout and --fastq_qminout must be no less than 33");
+
+  if (opt_fastq_asciiout + opt_fastq_qmaxout > 126)
+    fatal("Sum of arguments to --fastq_asciiout and --fastq_qmaxout must be no more than 126");
+
   if (opt_gzip_decompress && opt_bzip2_decompress)
     fatal("Specify either --gzip_decompress or --bzip2_decompress, not both");
 
@@ -2243,11 +2271,21 @@ void cmd_help()
               "  --relabel_keep              keep the old label after the new when relabelling\n"
               "  --relabel_md5               relabel with md5 digest of normalized sequence\n"
               "  --relabel_sha1              relabel with sha1 digest of normalized sequence\n"
-              "  --sizeorder                 sort accepted centroids by abundance (AGC)\n"
+              "  --sizeorder                 sort accepted centroids by abundance, AGC\n"
               "  --sizeout                   write cluster abundances to centroid file\n"
               "  --uc FILENAME               specify filename for UCLUST-like output\n"
               "  --xsize                     strip abundance information in output\n"
               "\n"
+              "Convert SFF to FASTQ\n"
+              "  --sff_convert FILENAME      convert given SFF file to FASTQ format\n"
+              " Parameters\n"
+              "  --sff_clip                  clip ends of sequences as indicated in file (no)\n"
+              "  --fastq_asciiout INT        FASTQ output quality score ASCII base char (33)\n"
+              "  --fastq_qmaxout INT         maximum base quality value for FASTQ output (41)\n"
+              "  --fastq_qminout INT         minimum base quality value for FASTQ output (0)\n"
+              " Output\n"
+              "  --fastqout FILENAME         output converted sequences to given FASTQ file\n"
+              "\n"
               "Dereplication and rereplication\n"
               "  --derep_fulllength FILENAME dereplicate sequences in the given FASTA file\n"
               "  --derep_prefix FILENAME     dereplicate sequences in file based on prefixes\n"
@@ -2989,6 +3027,8 @@ int main(int argc, char** argv)
     udb_stats();
   else if (opt_sintax)
     sintax();
+  else if (opt_sff_convert)
+    sff_convert();
   else
     cmd_none();
 


=====================================
src/vsearch.h
=====================================
@@ -125,6 +125,9 @@
 #define PROG_OS "win"
 #include <windows.h>
 #include <psapi.h>
+#define bswap_16(x) _byteswap_ushort(x)
+#define bswap_32(x) _byteswap_ulong(x)
+#define bswap_64(x) _byteswap_uint64(x)
 
 #else
 
@@ -132,6 +135,10 @@
 
 #define PROG_OS "macos"
 #include <sys/sysctl.h>
+#include <libkern/OSByteOrder.h>
+#define bswap_16(x) OSSwapInt16(x)
+#define bswap_32(x) OSSwapInt32(x)
+#define bswap_64(x) OSSwapInt64(x)
 
 #else
 
@@ -142,6 +149,7 @@
 #endif
 
 #include <sys/sysinfo.h>
+#include <byteswap.h>
 
 #endif
 
@@ -212,6 +220,7 @@
 #include "kmerhash.h"
 #include "sintax.h"
 #include "fastqjoin.h"
+#include "sffconvert.h"
 
 /* options */
 
@@ -230,6 +239,7 @@ extern bool opt_relabel_keep;
 extern bool opt_relabel_md5;
 extern bool opt_relabel_sha1;
 extern bool opt_samheader;
+extern bool opt_sff_clip;
 extern bool opt_sizeorder;
 extern bool opt_xsize;
 extern char * opt_allpairs_global;
@@ -292,6 +302,7 @@ extern char * opt_rereplicate;
 extern char * opt_reverse;
 extern char * opt_samout;
 extern char * opt_search_exact;
+extern char * opt_sff_convert;
 extern char * opt_shuffle;
 extern char * opt_sintax;
 extern char * opt_sortbylength;



View it on GitLab: https://salsa.debian.org/med-team/vsearch/commit/0d029e68b5adc51e3c578c65230c4c5b9c4a1f3b

-- 
View it on GitLab: https://salsa.debian.org/med-team/vsearch/commit/0d029e68b5adc51e3c578c65230c4c5b9c4a1f3b
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20181218/aef0356b/attachment-0001.html>


More information about the debian-med-commit mailing list