[med-svn] [Git][med-team/vsearch][upstream] New upstream version 2.10.2
Andreas Tille
gitlab at salsa.debian.org
Tue Dec 18 15:59:57 GMT 2018
Andreas Tille pushed to branch upstream at Debian Med / vsearch
Commits:
0d029e68 by Andreas Tille at 2018-12-18T15:54:23Z
New upstream version 2.10.2
- - - - -
10 changed files:
- README.md
- configure.ac
- man/vsearch.1
- src/Makefile.am
- + src/sffconvert.cc
- + src/sffconvert.h
- src/sintax.cc
- src/util.cc
- src/vsearch.cc
- src/vsearch.h
Changes:
=====================================
README.md
=====================================
@@ -24,7 +24,7 @@ Most of the nucleotide based commands and options in USEARCH version 7 are suppo
## Getting Help
-If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
+If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
## Example
@@ -37,9 +37,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
**Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:
```
-wget https://github.com/torognes/vsearch/archive/v2.9.0.tar.gz
-tar xzf v2.9.0.tar.gz
-cd vsearch-2.9.0
+wget https://github.com/torognes/vsearch/archive/v2.10.2.tar.gz
+tar xzf v2.10.2.tar.gz
+cd vsearch-2.10.2
./autogen.sh
./configure
make
@@ -70,36 +70,36 @@ Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (v
Download the appropriate executable for your system using the following commands if you are using a Linux x86_64 system:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch-2.9.0-linux-x86_64.tar.gz
-tar xzf vsearch-2.9.0-linux-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-linux-x86_64.tar.gz
+tar xzf vsearch-2.10.2-linux-x86_64.tar.gz
```
Or these commands if you are using a Linux ppc64le system:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch-2.9.0-linux-ppc64le.tar.gz
-tar xzf vsearch-2.9.0-linux-ppc64le.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-linux-ppc64le.tar.gz
+tar xzf vsearch-2.10.2-linux-ppc64le.tar.gz
```
Or these commands if you are using a Mac:
```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch-2.9.0-macos-x86_64.tar.gz
-tar xzf vsearch-2.9.0-macos-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-macos-x86_64.tar.gz
+tar xzf vsearch-2.10.2-macos-x86_64.tar.gz
```
Or if you are using Windows, download and extract (unzip) the contents of this file:
```
-https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch-2.9.0-win-x86_64.zip
+https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-win-x86_64.zip
```
-Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.9.0-linux-x86_64` or `vsearch-2.9.0-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
+Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.10.2-linux-x86_64` or `vsearch-2.10.2-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
-Windows: You will now have the binary distribution in a folder called `vsearch-2.9.0-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
+Windows: You will now have the binary distribution in a folder called `vsearch-2.10.2-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
-**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
+**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
## Plugins, packages, and wrappers
=====================================
configure.ac
=====================================
@@ -2,7 +2,7 @@
# Process this file with autoconf to produce a configure script.
AC_PREREQ([2.63])
-AC_INIT([vsearch], [2.9.0], [torognes at ifi.uio.no])
+AC_INIT([vsearch], [2.10.2], [torognes at ifi.uio.no])
AC_CANONICAL_TARGET
AM_INIT_AUTOMAKE([subdir-objects])
AC_LANG([C++])
@@ -14,8 +14,8 @@ MACOSX_DEPLOYMENT_TARGET="10.7"
# Set default gcc and g++ options
-CFLAGS='-g -march=native'
-CXXFLAGS='-g -march=native'
+CFLAGS='-g'
+CXXFLAGS='-g'
# Checks for programs.
AC_PROG_CXX
=====================================
man/vsearch.1
=====================================
@@ -1,5 +1,5 @@
.\" ============================================================================
-.TH vsearch 1 "October 10, 2018" "version 2.9.0" "USER COMMANDS"
+.TH vsearch 1 "December 10, 2018" "version 2.10.2" "USER COMMANDS"
.\" ============================================================================
.SH NAME
vsearch \(em chimera detection, clustering, dereplication and
@@ -71,6 +71,9 @@ FASTA/FASTQ file processing:
\fBvsearch\fR \-\-fastx_revcomp \fIfastxfile\fR (\-\-fastaout |
\-\-fastqout) \fIoutputfile\fR [\fIoptions\fR]
.PP
+\fBvsearch\fR \-\-sff_convert \fIsff-file\fR \-\-fastqout
+\fIoutputfile\fR [\fIoptions\fR]
+.PP
.RE
Masking:
.RS
@@ -878,17 +881,18 @@ for details.
.TP
.BI \-\-rereplicate \0filename
Duplicate each sequence the number of times indicated by the abundance
-of each sequence in the specified file. The sequence labels are
-identical for the same sequence, unless \-\-relabel, \-\-relabel_sha1
-or \-\-relabel_md5 is used to create unique labels. Output is written
-to the file specified with the \-\-output option, in FASTA format. The
-output file does not contain abundance information unless \-\-sizeout
-is specified, in which case an abundance of 1 is used.
+of each sequence in the specified file (option \-\-sizein is always
+implied). The sequence labels are identical for the same sequence,
+unless \-\-relabel, \-\-relabel_sha1 or \-\-relabel_md5 is used to
+create unique labels. Output is written to the file specified with the
+\-\-output option, in FASTA format. The output file does not contain
+abundance information unless \-\-sizeout is specified, in which case
+an abundance of 1 is used.
.TP
.B \-\-sizein
Take into account the abundance annotations present in the input fasta
file (search for the pattern '[>;]size=\fIinteger\fR[;]' in sequence
-headers).
+headers). That option is active by default when rereplicating.
.TP
.B \-\-sizeout
Add abundance annotations to the output fasta file (add the pattern
@@ -961,8 +965,9 @@ length of the sequences in a FASTQ file may be performed with the
\-\-fastq_stats, \-\-fastq_eestats, and \-\-fastq_eestats2
commands. Sequences may be shortened, filtered and converted by the
\-\-fastq_filter or \-\-fastx_filter commands. Paired-end reads can be
-merged using the \-\-fastq_mergepairs command. Finally, the
-\-\-fastx_revcomp command reverse-complements sequences.
+merged using the \-\-fastq_mergepairs command. The \-\-fastx_revcomp
+command reverse-complements sequences. Finally, the \-\-sff_convert
+command can be used to convert SFF files to FASTQ.
.PP
.TP 9
.B \-\-eeout
@@ -1013,9 +1018,9 @@ Illumina 1.8+ FASTQ format (phred+33). The value 64 is used by the
Solexa, Illumina 1.3+ and Illumina 1.5+ formats (phred+64).
.TP
.BI \-\-fastq_asciiout\~ "positive integer"
-When using \-\-fastq_convert, define the ASCII character number used
-as the basis for the FASTQ quality score when writing FASTQ output
-files. The default is 33.
+When using \-\-fastq_convert or \-\-sff_convert, define the ASCII
+character number used as the basis for the FASTQ quality score when
+writing FASTQ output files. The default is 33.
.TP
.BI \-\-fastq_chars \0filename
Summarize the composition of sequence and quality strings contained in
@@ -1039,7 +1044,7 @@ Convert between the different variants of the FASTQ file format. The
quality encoding of the input file must be specified with the
\-\-fastq_ascii option (either 33 or 64, the default is 33), and the
output quality encoding must be specified with the \-\-fastq_asciiout
-option (default 33). The mimimum and maximum output quality scores may
+option (default 33). The minimum and maximum output quality scores may
be limited using the \-\-fastq_qminout and \-\-fastq_qmaxout
options. The output file is specified with the \-\-fastqout option.
.TP
@@ -1199,10 +1204,10 @@ files. The default is 41, which is usual for recent Sanger/Illumina
1.8+ files.
.TP
.BI \-\-fastq_qmaxout\~ "positive integer"
-When using \-\-fastq_convert, specify the maximum quality score used
-when writing FASTQ files. The default is 41, which is usual for recent
-Sanger/Illumina 1.8+ files. Older formats may use a maximum quality
-score of 40.
+When using \-\-fastq_convert or \-\-sff_convert, specify the maximum
+quality score used when writing FASTQ files. The default is 41, which
+is usual for recent Sanger/Illumina 1.8+ files. Older formats may use
+a maximum quality score of 40.
.TP
.BI \-\-fastq_qmin\~ "positive integer"
Specify the minimum quality score accepted for FASTQ files. The
@@ -1210,10 +1215,10 @@ default is 0, which is usual for recent Sanger/Illumina 1.8+
files. Older formats may use scores between -5 and 2.
.TP
.BI \-\-fastq_qminout\~ "positive integer"
-When using \-\-fastq_convert, specify the minimum quality score used
-when writing FASTQ files. The default is 0, which is usual for
-Sanger/Illumina 1.8+ files. Older versions of the format may use
-scores between -5 and 2.
+When using \-\-fastq_convert or \-\-sff_convert, specify the minimum
+quality score used when writing FASTQ files. The default is 0, which
+is usual for Sanger/Illumina 1.8+ files. Older versions of the format
+may use scores between -5 and 2.
.TP
.BI \-\-fastq_stats \0filename
Analyze a FASTQ file and report the number of reads it contains. The
@@ -1422,6 +1427,21 @@ for details.
When using \-\-fastq_mergepairs or \-\-fastq_join, specify the FASTQ
file containing containing the reverse reads.
.TP
+.BI \-\-sff_convert \0filename
+Convert the given SFF file to FASTQ. The FASTQ output file is
+specified with the \-\-fastqout option. The sequence may be clipped as
+specied in the SFF file if the option \-\-sff_clip is specified,
+otherwise no clipping occurs. Bases that would have been clipped are
+converted to lower case, while the rest is in upper case. The output
+quality encoding may be specified with the \-\-fastq_asciiout option
+(default 33). The minimum and maximum output quality scores may be
+limited using the \-\-fastq_qminout and \-\-fastq_qmaxout options.
+.TP
+.BI \-\-sff_clip
+Specifies that the sequences converted by the \-\-sff_convert command
+should be clipped in both ends as indicated in the SFF file. By
+default no clipping is performed.
+.TP
.B \-\-xsize
Strip abundance information from the headers when writing the output
file.
@@ -3456,6 +3476,26 @@ the first space (unless the notrunclabels option is in effect).
.TP
.BR v2.9.0\~ "released October 10th, 2018"
Added the fastq_join command.
+.TP
+.BR v2.9.1\~ "released October 29th, 2018"
+Changed compiler options that select the target cpu and tuning to
+allow the software to run on any 64-bit x86 system, while tuning for
+more modern variants. Avoid illegal instruction error on some
+architectures. Update documentation of rereplicate command.
+.TP
+.BR v2.10.0\~ "released December 6th, 2018"
+Added the sff_convert commmand to convert SFF files to FASTQ. Added
+some additional option argument checks. Fixed segmentation fault bug
+after some fatal errors when a log file was specified.
+.TP
+.BR v2.10.1\~ "released December 7th, 2018"
+Improved sff_convert command. It will now read several variants of the
+SFF format. It is also able to read from a pipe. Warnings are given if
+there are minor problems. Errors messages have been improved. Minor
+speed and memory usage improvements.
+.TP
+.BR v2.10.2\~ "released December 10th, 2018"
+Fixed bug in sintax with reversed order of domain and kingdom.
.RE
.LP
.\" ============================================================================
=====================================
src/Makefile.am
=====================================
@@ -3,7 +3,7 @@ bin_PROGRAMS = $(top_builddir)/bin/vsearch
if TARGET_PPC
AM_CXXFLAGS=-Wall -Wsign-compare -O3 -g -mcpu=power8
else
-AM_CXXFLAGS=-Wall -Wsign-compare -O3 -g
+AM_CXXFLAGS=-Wall -Wsign-compare -O3 -g -march=x86-64 -mtune=generic
endif
AM_CFLAGS=$(AM_CXXFLAGS)
@@ -47,6 +47,7 @@ results.h \
search.h \
searchcore.h \
searchexact.h \
+sffconvert.h \
showalign.h \
sha1.h \
shuffle.h \
@@ -76,13 +77,13 @@ libcityhash_a_SOURCES = city.cc city.h
if TARGET_WIN
-libcityhash_a_CXXFLAGS = -Wall -Wno-sign-compare -O3 -g -D_MSC_VER
+libcityhash_a_CXXFLAGS = $(AM_CXXFLAGS) -Wno-sign-compare -D_MSC_VER
__top_builddir__bin_vsearch_LDFLAGS = -static
__top_builddir__bin_vsearch_LDADD = libregex.a libcityhash.a libcpu_ssse3.a libcpu_sse2.a
else
-libcityhash_a_CXXFLAGS = -Wall -Wno-sign-compare -O3 -g
+libcityhash_a_CXXFLAGS = $(AM_CXXFLAGS) -Wno-sign-compare
if TARGET_PPC
__top_builddir__bin_vsearch_LDADD = libcityhash.a libcpu.a
@@ -126,6 +127,7 @@ results.cc \
search.cc \
searchcore.cc \
searchexact.cc \
+sffconvert.cc \
sha1.c \
showalign.cc \
shuffle.cc \
=====================================
src/sffconvert.cc
=====================================
@@ -0,0 +1,438 @@
+/*
+
+ VSEARCH: a versatile open source tool for metagenomics
+
+ Copyright (C) 2014-2018, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+ All rights reserved.
+
+ Contact: Torbjorn Rognes <torognes at ifi.uio.no>,
+ Department of Informatics, University of Oslo,
+ PO Box 1080 Blindern, NO-0316 Oslo, Norway
+
+ This software is dual-licensed and available under a choice
+ of one of two licenses, either under the terms of the GNU
+ General Public License version 3 or the BSD 2-Clause License.
+
+
+ GNU General Public License version 3
+
+ This program is free software: you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation, either version 3 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+
+ The BSD 2-Clause License
+
+ Redistribution and use in source and binary forms, with or without
+ modification, are permitted provided that the following conditions
+ are met:
+
+ 1. Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+
+ 2. Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+ INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+ BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
+ ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+#include "vsearch.h"
+
+uint32_t sff_magic = 0x2e736666;
+
+struct sff_header_s
+{
+ uint32_t magic_number; /* .sff */
+ uint32_t version;
+ uint64_t index_offset;
+ uint32_t index_length;
+ uint32_t number_of_reads;
+ uint16_t header_length;
+ uint16_t key_length;
+ uint16_t flows_per_read;
+ uint8_t flowgram_format_code;
+} sff_header;
+
+struct sff_read_header_s
+{
+ uint16_t read_header_length;
+ uint16_t name_length;
+ uint32_t number_of_bases;
+ uint16_t clip_qual_left;
+ uint16_t clip_qual_right;
+ uint16_t clip_adapter_left;
+ uint16_t clip_adapter_right;
+} read_header;
+
+uint64_t fskip(FILE * fp, uint64_t length)
+{
+ /* read given amount of data from a stream and ignore it */
+ /* used instead of seeking in order to work with pipes */
+#define BLOCKSIZE 4096
+ char buffer[BLOCKSIZE];
+
+ uint64_t skipped = 0;
+ uint64_t rest = length;
+
+ while (rest > 0)
+ {
+ uint64_t want = ((rest > BLOCKSIZE) ? BLOCKSIZE : rest);
+ uint64_t got = fread(buffer, 1, want, fp);
+ skipped += got;
+ rest -= got;
+ if (got < want)
+ break;
+ }
+ return skipped;
+}
+
+void sff_convert()
+{
+ if (! opt_fastqout)
+ fatal("No output file for sff_convert specified with --fastqout.");
+
+ FILE * fp_fastqout = fopen_output(opt_fastqout);
+ if (!fp_fastqout)
+ fatal("Unable to open FASTQ output file for writing.");
+
+ FILE * fp_sff = fopen_input(opt_sff_convert);
+ if (!fp_sff)
+ fatal("Unable to open SFF input file for reading.");
+
+ /* read and check header */
+
+ uint64_t filepos = 0;
+
+ if (fread(&sff_header, 1, 31, fp_sff) < 31)
+ fatal("Unable to read from SFF file. File may be truncated.");
+ filepos += 31;
+
+ sff_header.magic_number = bswap_32(sff_header.magic_number);
+ sff_header.version = bswap_32(sff_header.version);
+ sff_header.index_offset = bswap_64(sff_header.index_offset);
+ sff_header.index_length = bswap_32(sff_header.index_length);
+ sff_header.number_of_reads = bswap_32(sff_header.number_of_reads);
+ sff_header.header_length = bswap_16(sff_header.header_length);
+ sff_header.key_length = bswap_16(sff_header.key_length);
+ sff_header.flows_per_read = bswap_16(sff_header.flows_per_read);
+
+ if (sff_header.magic_number != sff_magic)
+ fatal("Invalid SFF file. Incorrect magic number. Must be 0x2e736666 (.sff).");
+
+ if (sff_header.version != 1)
+ fatal("Invalid SFF file. Incorrect version. Must be 1.");
+
+ if (sff_header.flowgram_format_code != 1)
+ fatal("Invalid SFF file. Incorrect flowgram format code. Must be 1.");
+
+ if (sff_header.header_length != 8 * ((31 + sff_header.flows_per_read + sff_header.key_length + 7) / 8))
+ fatal("Invalid SFF file. Incorrect header length.");
+
+ if (sff_header.key_length != 4)
+ fatal("Invalid SFF file. Incorrect key length. Must be 4.");
+
+ if ((sff_header.index_length > 0) && (sff_header.index_length < 8))
+ fatal("Invalid SFF file. Incorrect index size. Must be at least 8.");
+
+ /* read and check flow chars, key and padding */
+
+ if (fskip(fp_sff, sff_header.flows_per_read) < sff_header.flows_per_read)
+ fatal("Invalid SFF file. Unable to read flow characters. File may be truncated.");
+ filepos += sff_header.flows_per_read;
+
+ char * key_sequence = (char *) xmalloc(sff_header.key_length + 1);
+ if (fread(key_sequence, 1, sff_header.key_length, fp_sff) < sff_header.key_length)
+ fatal("Invalid SFF file. Unable to read key sequence. File may be truncated.");
+ key_sequence[sff_header.key_length] = 0;
+ filepos += sff_header.key_length;
+
+ uint32_t padding_length = sff_header.header_length - sff_header.flows_per_read - sff_header.key_length - 31;
+ if (fskip(fp_sff, padding_length) < padding_length)
+ fatal("Invalid SFF file. Unable to read padding. File may be truncated.");
+ filepos += padding_length;
+
+ double totallength = 0.0;
+ uint32_t minimum = UINT_MAX;
+ uint32_t maximum = 0;
+
+ bool index_done = (sff_header.index_offset == 0) || (sff_header.index_length == 0);
+ bool index_odd = false;
+ char index_kind[9];
+
+ uint32_t index_padding = 0;
+ if ((sff_header.index_length & 7) > 0)
+ index_padding = 8 - (sff_header.index_length & 7);
+
+ if (! opt_quiet)
+ {
+ fprintf(stderr, "Number of reads: %d\n", sff_header.number_of_reads);
+ fprintf(stderr, "Flows per read: %d\n", sff_header.flows_per_read);
+ fprintf(stderr, "Key sequence: %s\n", key_sequence);
+ }
+
+ if (opt_log)
+ {
+ fprintf(fp_log, "Number of reads: %d\n", sff_header.number_of_reads);
+ fprintf(fp_log, "Flows per read: %d\n", sff_header.flows_per_read);
+ fprintf(fp_log, "Key sequence: %s\n", key_sequence);
+ }
+
+ progress_init("Converting SFF: ", sff_header.number_of_reads);
+
+ for(uint32_t read_no = 0; read_no < sff_header.number_of_reads; read_no++)
+ {
+ /* check if the index block is here */
+
+ if (! index_done)
+ {
+ if (filepos == sff_header.index_offset)
+ {
+ if (fread(index_kind, 1, 8, fp_sff) < 8)
+ fatal("Invalid SFF file. Unable to read index header. File may be truncated.");
+ filepos += 8;
+ index_kind[8] = 0;
+
+ uint64 index_size = sff_header.index_length - 8 + index_padding;
+ if (fskip(fp_sff, index_size) != index_size)
+ fatal("Invalid SFF file. Unable to read entire index. File may be truncated.");
+
+ filepos += index_size;
+ index_done = true;
+ index_odd = true;
+ }
+ }
+
+ /* read and check each read header */
+
+ if (fread(&read_header, 1, 16, fp_sff) < 16)
+ fatal("Invalid SFF file. Unable to read read header. File may be truncated.");
+ filepos += 16;
+
+ read_header.read_header_length = bswap_16(read_header.read_header_length);
+ read_header.name_length = bswap_16(read_header.name_length);
+ read_header.number_of_bases = bswap_32(read_header.number_of_bases);
+ read_header.clip_qual_left = bswap_16(read_header.clip_qual_left);
+ read_header.clip_qual_right = bswap_16(read_header.clip_qual_right);
+ read_header.clip_adapter_left = bswap_16(read_header.clip_adapter_left);
+ read_header.clip_adapter_right = bswap_16(read_header.clip_adapter_right);
+
+ if (read_header.read_header_length != 8 * ((16 + read_header.name_length + 7) / 8))
+ fatal("Invalid SFF file. Incorrect read header length.");
+ if (read_header.clip_qual_left > read_header.number_of_bases)
+ fatal("Invalid SFF file. Incorrect clip_qual_left value.");
+ if (read_header.clip_adapter_left > read_header.number_of_bases)
+ fatal("Invalid SFF file. Incorrect clip_adapter_left value.");
+ if (read_header.clip_qual_right > read_header.number_of_bases)
+ fatal("Invalid SFF file. Incorrect clip_qual_right value.");
+ if (read_header.clip_adapter_right > read_header.number_of_bases)
+ fatal("Invalid SFF file. Incorrect clip_adapter_right value.");
+
+ char * read_name = (char *) xmalloc(read_header.name_length + 1);
+ if (fread(read_name, 1, read_header.name_length, fp_sff) < read_header.name_length)
+ fatal("Invalid SFF file. Unable to read read name. File may be truncated.");
+ filepos += read_header.name_length;
+ read_name[read_header.name_length] = 0;
+
+ uint32_t read_header_padding_length = read_header.read_header_length - read_header.name_length - 16;
+ if (fskip(fp_sff, read_header_padding_length) < read_header_padding_length)
+ fatal("Invalid SFF file. Unable to read read header padding. File may be truncated.");
+ filepos += read_header_padding_length;
+
+ /* read and check the flowgram and sequence */
+
+ if (fskip(fp_sff, 2 * sff_header.flows_per_read) < sff_header.flows_per_read)
+ fatal("Invalid SFF file. Unable to read flowgram values. File may be truncated.");
+ filepos += 2 * sff_header.flows_per_read;
+
+ if (fskip(fp_sff, read_header.number_of_bases) < read_header.number_of_bases)
+ fatal("Invalid SFF file. Unable to read flow indices. File may be truncated.");
+ filepos += read_header.number_of_bases;
+
+ char * bases = (char *) xmalloc(read_header.number_of_bases + 1);
+ if (fread(bases, 1, read_header.number_of_bases, fp_sff) < read_header.number_of_bases)
+ fatal("Invalid SFF file. Unable to read read length. File may be truncated.");
+ bases[read_header.number_of_bases] = 0;
+ filepos += read_header.number_of_bases;
+
+ char * qual = (char *) xmalloc(read_header.number_of_bases + 1);
+ if (fread(qual, 1, read_header.number_of_bases, fp_sff) < read_header.number_of_bases)
+ fatal("Invalid SFF file. Unable to read quality scores. File may be truncated.");
+ filepos += read_header.number_of_bases;
+
+ /* convert quality scores to ascii characters */
+ for(uint32_t base_no = 0; base_no < read_header.number_of_bases; base_no++)
+ {
+ int q = qual[base_no];
+ if (q < opt_fastq_qminout)
+ q = opt_fastq_qminout;
+ if (q > opt_fastq_qmaxout)
+ q = opt_fastq_qmaxout;
+ qual[base_no] = opt_fastq_asciiout + q;
+ }
+ qual[read_header.number_of_bases] = 0;
+
+ uint32_t read_data_length = (2 * sff_header.flows_per_read + 3 * read_header.number_of_bases);
+ uint32_t read_data_padded_length = 8 * ((read_data_length + 7) / 8);
+ uint32_t read_data_padding_length = read_data_padded_length - read_data_length;
+
+ if (fskip(fp_sff, read_data_padding_length) < read_data_padding_length)
+ fatal("Invalid SFF file. Unable to read read data padding. File may be truncated.");
+ filepos += read_data_padding_length;
+
+ uint32_t clip_start = 0;
+ clip_start = MAX(1, MAX(read_header.clip_qual_left, read_header.clip_adapter_left)) - 1;
+
+ uint32_t clip_end = read_header.number_of_bases;
+ clip_end = MIN((read_header.clip_qual_right == 0 ? read_header.number_of_bases : read_header.clip_qual_right), (read_header.clip_adapter_right == 0 ? read_header.number_of_bases : read_header.clip_adapter_right));
+
+ /* make the clipped bases lowercase and the rest uppercase */
+ for (uint32_t i = 0; i < read_header.number_of_bases; i++)
+ {
+ if ((i < clip_start) || (i >= clip_end))
+ bases[i] = tolower(bases[i]);
+ else
+ bases[i] = toupper(bases[i]);
+ }
+
+ if (opt_sff_clip)
+ {
+ bases[clip_end] = 0;
+ qual[clip_end] = 0;
+ }
+ else
+ {
+ clip_start = 0;
+ clip_end = read_header.number_of_bases;
+ }
+
+ uint32_t length = clip_end - clip_start;
+
+ fastq_print_general(fp_fastqout,
+ bases + clip_start,
+ length,
+ read_name,
+ strlen(read_name),
+ qual + clip_start,
+ 0, 0, 0, 0);
+
+ xfree(read_name);
+ xfree(bases);
+ xfree(qual);
+
+ totallength += length;
+ if (length < minimum)
+ minimum = length;
+ if (length > maximum)
+ maximum = length;
+
+ progress_update(read_no + 1);
+ }
+ progress_done();
+
+ /* check if the index block is here */
+
+ if (! index_done)
+ {
+ if (filepos == sff_header.index_offset)
+ {
+ if (fread(index_kind, 1, 8, fp_sff) < 8)
+ fatal("Invalid SFF file. Unable to read index header. File may be truncated.");
+ filepos += 8;
+ index_kind[8] = 0;
+
+ uint64 index_size = sff_header.index_length - 8;
+ if (fskip(fp_sff, index_size) != index_size)
+ fatal("Invalid SFF file. Unable to read entire index. File may be truncated.");
+
+ filepos += index_size;
+ index_done = true;
+
+ /* try to skip padding, if any */
+
+ if (index_padding > 0)
+ {
+ uint64_t got = fskip(fp_sff, index_padding);
+ if ((got < index_padding) && (got != 0))
+ fprintf(stderr, "WARNING: Additional data at end of SFF file ignored\n");
+ }
+ }
+ }
+
+ if (! index_done)
+ {
+ fprintf(stderr, "WARNING: SFF index missing\n");
+ if (opt_log)
+ fprintf(fp_log, "WARNING: SFF index missing\n");
+ }
+
+ if (index_odd)
+ {
+ fprintf(stderr, "WARNING: Index at unusual position in file\n");
+ if (opt_log)
+ fprintf(fp_log, "WARNING: Index at unusual position in file\n");
+ }
+
+ /* ignore the rest of file */
+
+ /* try reading just another byte */
+
+ if (fskip(fp_sff, 1) > 0)
+ {
+ fprintf(stderr, "WARNING: Additional data at end of SFF file ignored\n");
+ if (opt_log)
+ fprintf(fp_log, "WARNING: Additional data at end of SFF file ignored\n");
+ }
+
+ fclose(fp_sff);
+ fclose(fp_fastqout);
+
+ double average = totallength / sff_header.number_of_reads;
+
+ if (! opt_quiet)
+ {
+ if (sff_header.index_length > 0)
+ fprintf(stderr, "Index type: %s\n", index_kind);
+ fprintf(stderr, "\nSFF file read successfully.\n");
+ if (sff_header.number_of_reads > 0)
+ fprintf(stderr, "Sequence length: minimum %d, average %.1f, maximum %d\n",
+ minimum,
+ average,
+ maximum);
+ }
+
+ if (opt_log)
+ {
+ if (sff_header.index_length > 0)
+ fprintf(fp_log, "Index type: %s\n", index_kind);
+ fprintf(fp_log, "\nSFF file read successfully.\n");
+ if (sff_header.number_of_reads > 0)
+ fprintf(fp_log, "Sequence length: minimum %d, average %.1f, maximum %d\n",
+ minimum,
+ average,
+ maximum);
+ }
+
+ xfree(key_sequence);
+}
=====================================
src/sffconvert.h
=====================================
@@ -0,0 +1,61 @@
+/*
+
+ VSEARCH: a versatile open source tool for metagenomics
+
+ Copyright (C) 2014-2018, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+ All rights reserved.
+
+ Contact: Torbjorn Rognes <torognes at ifi.uio.no>,
+ Department of Informatics, University of Oslo,
+ PO Box 1080 Blindern, NO-0316 Oslo, Norway
+
+ This software is dual-licensed and available under a choice
+ of one of two licenses, either under the terms of the GNU
+ General Public License version 3 or the BSD 2-Clause License.
+
+
+ GNU General Public License version 3
+
+ This program is free software: you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation, either version 3 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+
+ The BSD 2-Clause License
+
+ Redistribution and use in source and binary forms, with or without
+ modification, are permitted provided that the following conditions
+ are met:
+
+ 1. Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+
+ 2. Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+ INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+ BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
+ ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ POSSIBILITY OF SUCH DAMAGE.
+
+*/
+
+void sff_convert();
=====================================
src/sintax.cc
=====================================
@@ -86,7 +86,7 @@ static pthread_attr_t attr;
static fastx_handle query_fastx_h;
const int tax_levels = 8;
-const char * tax_letters = "kdpcofgs";
+const char * tax_letters = "dkpcofgs";
const int subset_size = 32;
const int bootstrap_count = 100;
@@ -150,8 +150,8 @@ bool sintax_parse_tax(const char * header,
void sintax_split(int seqno, int * level_start, int * level_len)
{
/* Parse taxonomy string into the following parts
- k kingdom
d domain
+ k kingdom
p phylum
c class
o order
=====================================
src/util.cc
=====================================
@@ -114,7 +114,7 @@ void __attribute__((noreturn)) fatal(const char * msg)
fprintf(stderr, "\n\n");
fprintf(stderr, "Fatal error: %s\n", msg);
- if (opt_log)
+ if (fp_log)
{
fprintf(fp_log, "\n\n");
fprintf(fp_log, "Fatal error: %s\n", msg);
=====================================
src/vsearch.cc
=====================================
@@ -77,6 +77,7 @@ bool opt_relabel_keep;
bool opt_relabel_md5;
bool opt_relabel_sha1;
bool opt_samheader;
+bool opt_sff_clip;
bool opt_sizeorder;
bool opt_xsize;
char * opt_allpairs_global;
@@ -139,6 +140,7 @@ char * opt_rereplicate;
char * opt_reverse;
char * opt_samout;
char * opt_search_exact;
+char * opt_sff_convert;
char * opt_shuffle;
char * opt_sintax;
char * opt_sortbylength;
@@ -789,6 +791,8 @@ void args_init(int argc, char **argv)
opt_search_exact = 0;
opt_self = 0;
opt_selfid = 0;
+ opt_sff_convert = 0;
+ opt_sff_clip = 0;
opt_shuffle = 0;
opt_sintax = 0;
opt_sintax_cutoff = 0.0;
@@ -1033,7 +1037,9 @@ void args_init(int argc, char **argv)
{"fastq_join", required_argument, 0, 0 },
{"join_padgap", required_argument, 0, 0 },
{"join_padgapq", required_argument, 0, 0 },
- { 0, 0, 0, 0 }
+ {"sff_convert", required_argument, 0, 0 },
+ {"sff_clip", no_argument, 0, 0 },
+ { 0, 0, 0, 0 }
};
int option_count = sizeof(long_options) / sizeof(struct option);
@@ -1891,6 +1897,14 @@ void args_init(int argc, char **argv)
opt_join_padgapq = optarg;
break;
+ case 202:
+ opt_sff_convert = optarg;
+ break;
+
+ case 203:
+ opt_sff_clip = 1;
+ break;
+
default:
fatal("Internal error in option parsing");
}
@@ -1980,6 +1994,8 @@ void args_init(int argc, char **argv)
commands++;
if (opt_fastq_join)
commands++;
+ if (opt_sff_convert)
+ commands++;
if (commands > 1)
@@ -2061,9 +2077,21 @@ void args_init(int argc, char **argv)
if (opt_fastq_qmin > opt_fastq_qmax)
fatal("The argument to --fastq_qmin cannot be larger than to --fastq_qmax");
+ if (opt_fastq_ascii + opt_fastq_qmin < 33)
+ fatal("Sum of arguments to --fastq_ascii and --fastq_qmin must be no less than 33");
+
+ if (opt_fastq_ascii + opt_fastq_qmax > 126)
+ fatal("Sum of arguments to --fastq_ascii and --fastq_qmax must be no more than 126");
+
if (opt_fastq_qminout > opt_fastq_qmaxout)
fatal("The argument to --fastq_qminout cannot be larger than to --fastq_qmaxout");
+ if (opt_fastq_asciiout + opt_fastq_qminout < 33)
+ fatal("Sum of arguments to --fastq_asciiout and --fastq_qminout must be no less than 33");
+
+ if (opt_fastq_asciiout + opt_fastq_qmaxout > 126)
+ fatal("Sum of arguments to --fastq_asciiout and --fastq_qmaxout must be no more than 126");
+
if (opt_gzip_decompress && opt_bzip2_decompress)
fatal("Specify either --gzip_decompress or --bzip2_decompress, not both");
@@ -2243,11 +2271,21 @@ void cmd_help()
" --relabel_keep keep the old label after the new when relabelling\n"
" --relabel_md5 relabel with md5 digest of normalized sequence\n"
" --relabel_sha1 relabel with sha1 digest of normalized sequence\n"
- " --sizeorder sort accepted centroids by abundance (AGC)\n"
+ " --sizeorder sort accepted centroids by abundance, AGC\n"
" --sizeout write cluster abundances to centroid file\n"
" --uc FILENAME specify filename for UCLUST-like output\n"
" --xsize strip abundance information in output\n"
"\n"
+ "Convert SFF to FASTQ\n"
+ " --sff_convert FILENAME convert given SFF file to FASTQ format\n"
+ " Parameters\n"
+ " --sff_clip clip ends of sequences as indicated in file (no)\n"
+ " --fastq_asciiout INT FASTQ output quality score ASCII base char (33)\n"
+ " --fastq_qmaxout INT maximum base quality value for FASTQ output (41)\n"
+ " --fastq_qminout INT minimum base quality value for FASTQ output (0)\n"
+ " Output\n"
+ " --fastqout FILENAME output converted sequences to given FASTQ file\n"
+ "\n"
"Dereplication and rereplication\n"
" --derep_fulllength FILENAME dereplicate sequences in the given FASTA file\n"
" --derep_prefix FILENAME dereplicate sequences in file based on prefixes\n"
@@ -2989,6 +3027,8 @@ int main(int argc, char** argv)
udb_stats();
else if (opt_sintax)
sintax();
+ else if (opt_sff_convert)
+ sff_convert();
else
cmd_none();
=====================================
src/vsearch.h
=====================================
@@ -125,6 +125,9 @@
#define PROG_OS "win"
#include <windows.h>
#include <psapi.h>
+#define bswap_16(x) _byteswap_ushort(x)
+#define bswap_32(x) _byteswap_ulong(x)
+#define bswap_64(x) _byteswap_uint64(x)
#else
@@ -132,6 +135,10 @@
#define PROG_OS "macos"
#include <sys/sysctl.h>
+#include <libkern/OSByteOrder.h>
+#define bswap_16(x) OSSwapInt16(x)
+#define bswap_32(x) OSSwapInt32(x)
+#define bswap_64(x) OSSwapInt64(x)
#else
@@ -142,6 +149,7 @@
#endif
#include <sys/sysinfo.h>
+#include <byteswap.h>
#endif
@@ -212,6 +220,7 @@
#include "kmerhash.h"
#include "sintax.h"
#include "fastqjoin.h"
+#include "sffconvert.h"
/* options */
@@ -230,6 +239,7 @@ extern bool opt_relabel_keep;
extern bool opt_relabel_md5;
extern bool opt_relabel_sha1;
extern bool opt_samheader;
+extern bool opt_sff_clip;
extern bool opt_sizeorder;
extern bool opt_xsize;
extern char * opt_allpairs_global;
@@ -292,6 +302,7 @@ extern char * opt_rereplicate;
extern char * opt_reverse;
extern char * opt_samout;
extern char * opt_search_exact;
+extern char * opt_sff_convert;
extern char * opt_shuffle;
extern char * opt_sintax;
extern char * opt_sortbylength;
View it on GitLab: https://salsa.debian.org/med-team/vsearch/commit/0d029e68b5adc51e3c578c65230c4c5b9c4a1f3b
--
View it on GitLab: https://salsa.debian.org/med-team/vsearch/commit/0d029e68b5adc51e3c578c65230c4c5b9c4a1f3b
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20181218/aef0356b/attachment-0001.html>
More information about the debian-med-commit
mailing list