[med-svn] [Git][med-team/vsearch][master] 7 commits: New upstream version 2.10.4

Fri Jan 11 22:03:36 GMT 2019

Andreas Tille pushed to branch master at Debian Med / vsearch


Commits:
a2403d5a by Andreas Tille at 2019-01-11T21:54:32Z
New upstream version 2.10.4
- - - - -
baba9c99 by Andreas Tille at 2019-01-11T21:54:33Z
Update upstream source from tag 'upstream/2.10.4'

Update to upstream version '2.10.4'
with Debian dir 3bd1892beea2fd7bf28f0378691eb9f94e3a6efd
- - - - -
f2859ca7 by Andreas Tille at 2019-01-11T21:54:33Z
New upstream version

- - - - -
9975266b by Andreas Tille at 2019-01-11T21:54:33Z
debhelper 12

- - - - -
ea6ece50 by Andreas Tille at 2019-01-11T21:54:35Z
Standards-Version: 4.3.0

- - - - -
e46ae418 by Andreas Tille at 2019-01-11T21:54:35Z
Secure URI in copyright format

- - - - -
03a0889a by Andreas Tille at 2019-01-11T21:55:29Z
Upload to unstable

- - - - -


17 changed files:

- LICENSE.txt
- README.md
- configure.ac
- debian/changelog
- debian/compat
- debian/control
- debian/copyright
- man/vsearch.1
- src/Makefile.am
- src/align_simd.cc
- src/cpu.cc
- src/cpu.h
- src/fastqops.cc
- src/results.cc
- src/searchcore.cc
- src/vsearch.cc
- src/vsearch.h


Changes:

=====================================
LICENSE.txt
=====================================
@@ -1,6 +1,6 @@
   VSEARCH: a versatile open source tool for metagenomics
 
-  Copyright (C) 2014-2017, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+  Copyright (C) 2014-2019, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
   All rights reserved.
 
   Contact: Torbjorn Rognes <torognes at ifi.uio.no>,


=====================================
README.md
=====================================
@@ -16,7 +16,7 @@ We have implemented a tool called VSEARCH which supports *de novo* and reference
 
 VSEARCH stands for vectorized search, as the tool takes advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by default uses a heuristic seed and extend aligner. This usually results in more accurate alignments and overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps.
 
-VSEARCH binaries are provided for x86-64 systems running GNU/Linux, macOS (version 10.7 or higher) and Windows (64-bit, version 7 or higher), as well as ppc64le systems running GNU/Linux.
+VSEARCH binaries are provided for x86-64 systems running GNU/Linux, macOS (version 10.7 or higher) and Windows (64-bit, version 7 or higher), as well as for 64-bit little-endian POWER8 (ppc64le) and 64-bit ARMv8 systems (aarch64) running GNU/Linux. VSEARCH contains dedicated SIMD code for these three processors architectures (SSE2, AltiVec/VMX/VSX, Neon).
 
 VSEARCH can directly read input query and database files that are compressed using gzip and bzip2 (.gz and .bz2) if the zlib and bzip2 libraries are available.
 
@@ -24,7 +24,7 @@ Most of the nucleotide based commands and options in USEARCH version 7 are suppo
 
 ## Getting Help
 
-If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
+If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
 
 ## Example
 
@@ -37,9 +37,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
 **Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:
 
 ```
-wget https://github.com/torognes/vsearch/archive/v2.10.2.tar.gz
-tar xzf v2.10.2.tar.gz
-cd vsearch-2.10.2
+wget https://github.com/torognes/vsearch/archive/v2.10.4.tar.gz
+tar xzf v2.10.4.tar.gz
+cd vsearch-2.10.4
 ./autogen.sh
 ./configure
 make
@@ -48,8 +48,6 @@ make install  # as root or sudo make install
 
 You may customize the installation directory using the `--prefix=DIR` option to `configure`. If the compression libraries [zlib](http://www.zlib.net) and/or [bzip2](http://www.bzip.org) are installed on the system, they will be detected automatically and support for compressed files will be included in vsearch. Support for compressed files may be disabled using the `--disable-zlib` and `--disable-bzip2` options to `configure`. A PDF version of the manual will be created from the `vsearch.1` manual file if `ps2pdf` is available, unless disabled using the `--disable-pdfman` option to `configure`. Other  options may also be applied to `configure`, please run `configure -h` to see them all. GNU autotools (version 2.63 or later) and the gcc compiler is required to build vsearch.
 
-The IBM XL C++ compiler is recommended on ppc64le systems.
-
 The Windows binary was compiled using the [Mingw-w64](https://mingw-w64.org/) C++ cross-compiler.
 
 **Cloning the repo** Instead of downloading the source distribution as a compressed archive, you could clone the repo and build it as shown below. The options to `configure` as described above are still valid.
@@ -65,47 +63,56 @@ make install  # as root or sudo make install
 
 **Binary distribution** Starting with version 1.4.0, binary distribution files containing pre-compiled binaries as well as the documentation will be made available as part of each [release](https://github.com/torognes/vsearch/releases). The included executables include support for input files compressed by zlib and bzip2 (with files usually ending in `.gz` or `.bz2`).
 
-Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (version 10.7 or higher) and Windows (64-bit, version 7 or higher), as well as ppc64le systems running GNU/Linux.
+Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (version 10.7 or higher) and Windows (64-bit, version 7 or higher), as well as POWER8 (ppc64le) and 64-bit AMDv8 (aarch64) systems running GNU/Linux.
 
 Download the appropriate executable for your system using the following commands if you are using a Linux x86_64 system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-linux-x86_64.tar.gz
-tar xzf vsearch-2.10.2-linux-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch-2.10.4-linux-x86_64.tar.gz
+tar xzf vsearch-2.10.4-linux-x86_64.tar.gz
 ```
 
 Or these commands if you are using a Linux ppc64le system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-linux-ppc64le.tar.gz
-tar xzf vsearch-2.10.2-linux-ppc64le.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch-2.10.4-linux-ppc64le.tar.gz
+tar xzf vsearch-2.10.4-linux-ppc64le.tar.gz
+```
+
+Or these commands if you are using a Linux aarch64 system:
+
+```sh
+wget https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch-2.10.4-linux-aarch64.tar.gz
+tar xzf vsearch-2.10.4-linux-aarch64.tar.gz
 ```
 
 Or these commands if you are using a Mac:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-macos-x86_64.tar.gz
-tar xzf vsearch-2.10.2-macos-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch-2.10.4-macos-x86_64.tar.gz
+tar xzf vsearch-2.10.4-macos-x86_64.tar.gz
 ```
 
 Or if you are using Windows, download and extract (unzip) the contents of this file:
 
 ```
-https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-win-x86_64.zip
+https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch-2.10.4-win-x86_64.zip
 ```
 
-Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.10.2-linux-x86_64` or `vsearch-2.10.2-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
+Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.10.4-linux-x86_64` or `vsearch-2.10.4-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
 
-Windows: You will now have the binary distribution in a folder called `vsearch-2.10.2-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
+Windows: You will now have the binary distribution in a folder called `vsearch-2.10.4-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
 
 
-**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
+**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
 
 
 ## Plugins, packages, and wrappers
 
 **QIIME 2 plugin** Thanks to the [QIIME 2](https://github.com/qiime2) team, there is now a plugin called [q2-vsearch](https://github.com/qiime2/q2-vsearch) for [QIIME 2](https://qiime2.org).
 
+**Conda package** Thanks to the [BioConda](https://bioconda.github.io/) team, there is now a [vsearch package](https://anaconda.org/bioconda/vsearch) in [Conda](https://conda.io/).
+
 **Homebrew package** Thanks to [Torsten Seeman](https://github.com/tseemann), a [vsearch package](https://github.com/Homebrew/homebrew-science/pull/2409) for [Homebrew](http://brew.sh/) has been made.
 
 **Debian package** Thanks to the [Debian Med](https://www.debian.org/devel/debian-med/) team, there is now a [vsearch](https://packages.debian.org/sid/vsearch) package in [Debian](https://www.debian.org/).
@@ -187,6 +194,7 @@ File | Description
 **eestats.cc** | Produce statistics for fastq_eestats command
 **fasta.cc** | FASTA file parser
 **fastq.cc** | FASTQ file parser
+**fastqjoin.cc** | FASTQ paired-end reads joining
 **fastqops.cc** | FASTQ file statistics etc
 **fastx.cc** | Detection of FASTA and FASTQ files, wrapper for FASTA and FASTQ parsers
 **kmerhash.cc** | Hash for kmers used by paired-end read merger
@@ -203,6 +211,7 @@ File | Description
 **search.cc** | Implements search using global alignment
 **searchcore.cc** | Core search functions for searching, clustering and chimera detection
 **searchexact.cc** | Exact search functions
+**sffconvert.cc** | SFF to FASTQ file conversion
 **sha1.c** | SHA1 message digest
 **showalign.cc** | Output an alignment in a human-readable way given a CIGAR-string and the sequences
 **shuffle.cc** | Shuffle sequences
@@ -232,14 +241,6 @@ or you could send an email to [torognes at ifi.uio.no](mailto:torognes at ifi.uio.no?s
 VSEARCH is designed for rather short sequences, and will be slow when sequences are longer than about 5,000 bp. This is because it always performs optimal global alignment on selected sequences.
 
 
-## Future work
-
-Some issues to work on:
-
-* testing and debugging
-* heuristics for alignment of long sequences (e.g. banded alignment around selected diagonals)?
-
-
 ## The VSEARCH team
 
 The main contributors to VSEARCH:


=====================================
configure.ac
=====================================
@@ -2,7 +2,7 @@
 # Process this file with autoconf to produce a configure script.
 
 AC_PREREQ([2.63])
-AC_INIT([vsearch], [2.10.2], [torognes at ifi.uio.no])
+AC_INIT([vsearch], [2.10.4], [torognes at ifi.uio.no])
 AC_CANONICAL_TARGET
 AM_INIT_AUTOMAKE([subdir-objects])
 AC_LANG([C++])
@@ -86,6 +86,7 @@ if test "x${have_zlib}" = "xyes"; then
 fi
 
 case $target in
+     aarch64*) target_aarch64="yes" ;;
      powerpc64*) target_ppc="yes" ;;
 esac
 
@@ -96,6 +97,7 @@ AM_CONDITIONAL(HAVE_ZLIB, test "x${have_zlib}" = "xyes")
 AM_CONDITIONAL(HAVE_PTHREADS, test "x${have_pthreads}" = "xyes")
 AM_CONDITIONAL(HAVE_PS2PDF, test "x${have_ps2pdf}" = "xyes")
 AM_CONDITIONAL(TARGET_PPC, test "x${target_ppc}" = "xyes")
+AM_CONDITIONAL(TARGET_AARCH64, test "x${target_aarch64}" = "xyes")
 AM_PROG_CC_C_O
 
 AC_CONFIG_FILES([Makefile


=====================================
debian/changelog
=====================================
@@ -1,3 +1,12 @@
+vsearch (2.10.4-1) unstable; urgency=medium
+
+  * New upstream version
+  * debhelper 12
+  * Standards-Version: 4.3.0
+  * Secure URI in copyright format
+
+ -- Andreas Tille <tille at debian.org>  Fri, 11 Jan 2019 22:54:35 +0100
+
 vsearch (2.10.2-1) unstable; urgency=medium
 
   * New upstream version


=====================================
debian/compat
=====================================
@@ -1 +1 @@
-11
+12


=====================================
debian/control
=====================================
@@ -4,13 +4,13 @@ Uploaders: Tim Booth <tbooth at ceh.ac.uk>,
            Andreas Tille <tille at debian.org>
 Section: science
 Priority: optional
-Build-Depends: debhelper (>= 11~),
+Build-Depends: debhelper (>= 12~),
                zlib1g-dev,
                libbz2-dev,
                python-markdown,
                ghostscript,
                time
-Standards-Version: 4.2.1
+Standards-Version: 4.3.0
 Vcs-Browser: https://salsa.debian.org/med-team/vsearch
 Vcs-Git: https://salsa.debian.org/med-team/vsearch.git
 Homepage: https://github.com/torognes/vsearch/


=====================================
debian/copyright
=====================================
@@ -1,4 +1,4 @@
-Format: http://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
+Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
 Upstream-Name: VSEARCH
 Upstream-Contact: Torbjørn Rognes <torognes at ifi.uio.no>
 Source: https://github.com/torognes/vsearch/


=====================================
man/vsearch.1
=====================================
@@ -1,5 +1,5 @@
 .\" ============================================================================
-.TH vsearch 1 "December 10, 2018" "version 2.10.2" "USER COMMANDS"
+.TH vsearch 1 "January 4, 2019" "version 2.10.4" "USER COMMANDS"
 .\" ============================================================================
 .SH NAME
 vsearch \(em chimera detection, clustering, dereplication and
@@ -1430,7 +1430,7 @@ file containing containing the reverse reads.
 .BI \-\-sff_convert \0filename
 Convert the given SFF file to FASTQ. The FASTQ output file is
 specified with the \-\-fastqout option. The sequence may be clipped as
-specied in the SFF file if the option \-\-sff_clip is specified,
+specified in the SFF file if the option \-\-sff_clip is specified,
 otherwise no clipping occurs. Bases that would have been clipped are
 converted to lower case, while the rest is in upper case. The output
 quality encoding may be specified with the \-\-fastq_asciiout option
@@ -3496,6 +3496,16 @@ speed and memory usage improvements.
 .TP
 .BR v2.10.2\~ "released December 10th, 2018"
 Fixed bug in sintax with reversed order of domain and kingdom.
+.TP
+.BR v2.10.3\~ "released December 19th, 2018"
+Ported to Linux on ARMv8 (aarch64). Fixed compilation warning with gcc
+version 8.1.0 and 8.2.0.
+.TP
+.BR v2.10.4\~ "released January 4th, 2019"
+Fixed serious bug in x86_64 SIMD alignment code introduced in version
+2.10.3. Added link to BioConda in README. Fixed bug in fastq_stats
+with sequence length 1. Fixed use of equals symbol in UC files for
+identical sequences with cluster_fast.
 .RE
 .LP
 .\" ============================================================================


=====================================
src/Makefile.am
=====================================
@@ -3,8 +3,12 @@ bin_PROGRAMS = $(top_builddir)/bin/vsearch
 if TARGET_PPC
 AM_CXXFLAGS=-Wall -Wsign-compare -O3 -g -mcpu=power8
 else
+if TARGET_AARCH64
+AM_CXXFLAGS=-Wall -Wsign-compare -O3 -g -march=armv8-a+simd -mtune=generic
+else
 AM_CXXFLAGS=-Wall -Wsign-compare -O3 -g -march=x86-64 -mtune=generic
 endif
+endif
 
 AM_CFLAGS=$(AM_CXXFLAGS)
 
@@ -66,12 +70,17 @@ if TARGET_PPC
 libcpu_a_SOURCES = cpu.cc $(VSEARCHHEADERS)
 noinst_LIBRARIES = libcpu.a libcityhash.a
 else
+if TARGET_AARCH64
+libcpu_a_SOURCES = cpu.cc $(VSEARCHHEADERS)
+noinst_LIBRARIES = libcpu.a libcityhash.a
+else
 libcpu_sse2_a_SOURCES = cpu.cc $(VSEARCHHEADERS)
 libcpu_sse2_a_CXXFLAGS = $(AM_CXXFLAGS) -msse2
 libcpu_ssse3_a_SOURCES = cpu.cc $(VSEARCHHEADERS)
 libcpu_ssse3_a_CXXFLAGS = $(AM_CXXFLAGS) -mssse3 -DSSSE3
 noinst_LIBRARIES = libcpu_sse2.a libcpu_ssse3.a libcityhash.a
 endif
+endif
 
 libcityhash_a_SOURCES = city.cc city.h
 
@@ -88,8 +97,12 @@ libcityhash_a_CXXFLAGS = $(AM_CXXFLAGS) -Wno-sign-compare
 if TARGET_PPC
 __top_builddir__bin_vsearch_LDADD = libcityhash.a libcpu.a
 else
+if TARGET_AARCH64
+__top_builddir__bin_vsearch_LDADD = libcityhash.a libcpu.a
+else
 __top_builddir__bin_vsearch_LDADD = libcityhash.a libcpu_ssse3.a libcpu_sse2.a
 endif
+endif
 
 endif
 


=====================================
src/align_simd.cc
=====================================
@@ -2,7 +2,7 @@
 
   VSEARCH: a versatile open source tool for metagenomics
 
-  Copyright (C) 2014-2018, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+  Copyright (C) 2014-2019, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
   All rights reserved.
 
   Contact: Torbjorn Rognes <torognes at ifi.uio.no>,
@@ -69,8 +69,6 @@
   maximize score
 */
 
-//#define DEBUG
-
 #define CHANNELS 8
 #define CDEPTH 4
 
@@ -89,14 +87,105 @@
 */
 
 #define MAXSEQLENPRODUCT 25000000
-//#define MAXSEQLENPRODUCT 160000
 
 static int64_t scorematrix[16][16];
 
+/*
+   The macros below usually operate on 128-bit vectors of 8 signed
+   short 16-bit integers. Additions and subtractions should be
+   saturated.  The shift operation should shift left by 2 bytes (one
+   short int) and shift in zeros. The v_mask_gt operation should
+   compare two vectors of signed shorts and return a 16-bit bitmask
+   with pairs of 2 bits set for each element greater in the first than
+   in the second argument.
+*/
+
 #ifdef __PPC__
+
 typedef vector signed short VECTOR_SHORT;
-#else
+
+const vector unsigned char perm_merge_long_low =
+  {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+   0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+
+const vector unsigned char perm_merge_long_high =
+  {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
+   0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+
+#define v_init(a,b,c,d,e,f,g,h) (const VECTOR_SHORT){a,b,c,d,e,f,g,h}
+#define v_load(a) vec_ld(0, (VECTOR_SHORT *)(a))
+#define v_store(a, b) vec_st((vector unsigned char)(b), 0,              \
+                             (vector unsigned char *)(a))
+#define v_add(a, b) vec_adds((a), (b))
+#define v_sub(a, b) vec_subs((a), (b))
+#define v_sub_unsigned(a, b) ((VECTOR_SHORT)                            \
+                              vec_subs((vector unsigned short) (a),     \
+                                       (vector unsigned short) (b)))
+#define v_max(a, b) vec_max((a), (b))
+#define v_min(a, b) vec_min((a), (b))
+#define v_dup(a) vec_splat((VECTOR_SHORT){(short)(a), 0, 0, 0, 0, 0, 0, 0}, 0);
+#define v_zero vec_splat_s16(0)
+#define v_and(a, b) vec_and((a), (b))
+#define v_xor(a, b) vec_xor((a), (b))
+#define v_shift_left(a) vec_sld((a), v_zero, 2)
+
+#elif __aarch64__
+
+typedef int16x8_t VECTOR_SHORT;
+
+const uint16x8_t neon_mask =
+  {0x0003, 0x000c, 0x0030, 0x00c0, 0x0300, 0x0c00, 0x3000, 0xc000};
+
+#define v_init(a,b,c,d,e,f,g,h) (const VECTOR_SHORT){a,b,c,d,e,f,g,h}
+#define v_load(a) vld1q_s16((const int16_t *)(a))
+#define v_store(a, b) vst1q_s16((int16_t *)(a), (b))
+#define v_merge_lo_16(a, b) vzip1q_s16((a),(b))
+#define v_merge_hi_16(a, b) vzip2q_s16((a),(b))
+#define v_merge_lo_32(a, b) vreinterpretq_s16_s32(vzip1q_s32(vreinterpretq_s32_s16(a), vreinterpretq_s32_s16(b)))
+#define v_merge_hi_32(a, b) vreinterpretq_s16_s32(vzip2q_s32(vreinterpretq_s32_s16(a), vreinterpretq_s32_s16(b)))
+#define v_merge_lo_64(a, b) vreinterpretq_s16_s64(vcombine_s64(vget_low_s64(vreinterpretq_s64_s16(a)), vget_low_s64(vreinterpretq_s64_s16(b))))
+#define v_merge_hi_64(a, b) vreinterpretq_s16_s64(vcombine_s64(vget_high_s64(vreinterpretq_s64_s16(a)), vget_high_s64(vreinterpretq_s64_s16(b))))
+#define v_add(a, b) vqaddq_s16((a), (b))
+#define v_sub(a, b) vqsubq_s16((a), (b))
+#define v_sub_unsigned(a, b) vreinterpretq_s16_u16(vqsubq_u16(vreinterpretq_u16_s16(a), vreinterpretq_u16_s16(b)))
+#define v_max(a, b) vmaxq_s16((a), (b))
+#define v_min(a, b) vminq_s16((a), (b))
+#define v_dup(a) vdupq_n_s16(a)
+#define v_zero v_dup(0)
+#define v_and(a, b) vandq_s16((a), (b))
+#define v_xor(a, b) veorq_s16((a), (b))
+#define v_shift_left(a) vextq_s16((v_zero), (a), 7)
+#define v_mask_gt(a, b) vaddvq_u16(vandq_u16((vcgtq_s16((a), (b))), neon_mask))
+
+#elif __x86_64__
+
 typedef __m128i VECTOR_SHORT;
+
+#define v_init(a,b,c,d,e,f,g,h) _mm_set_epi16(h,g,f,e,d,c,b,a)
+#define v_load(a) _mm_load_si128((VECTOR_SHORT *)(a))
+#define v_store(a, b) _mm_store_si128((VECTOR_SHORT *)(a), (b))
+#define v_merge_lo_16(a, b) _mm_unpacklo_epi16((a),(b))
+#define v_merge_hi_16(a, b) _mm_unpackhi_epi16((a),(b))
+#define v_merge_lo_32(a, b) _mm_unpacklo_epi32((a),(b))
+#define v_merge_hi_32(a, b) _mm_unpackhi_epi32((a),(b))
+#define v_merge_lo_64(a, b) _mm_unpacklo_epi64((a),(b))
+#define v_merge_hi_64(a, b) _mm_unpackhi_epi64((a),(b))
+#define v_add(a, b) _mm_adds_epi16((a), (b))
+#define v_sub(a, b) _mm_subs_epi16((a), (b))
+#define v_sub_unsigned(a, b) _mm_subs_epu16((a), (b))
+#define v_max(a, b) _mm_max_epi16((a), (b))
+#define v_min(a, b) _mm_min_epi16((a), (b))
+#define v_dup(a) _mm_set1_epi16(a)
+#define v_zero v_dup(0)
+#define v_and(a, b) _mm_and_si128((a), (b))
+#define v_xor(a, b) _mm_xor_si128((a), (b))
+#define v_shift_left(a) _mm_slli_si128((a), 2)
+#define v_mask_gt(a, b) _mm_movemask_epi8(_mm_cmpgt_epi16((a), (b)))
+
+#else
+
+#error Unknown Architecture
+
 #endif
 
 struct s16info_s
@@ -174,35 +263,10 @@ void dumpscorematrix(CELL * m)
     }
 }
 
-#ifdef __PPC__
-  const vector unsigned char perm_merge_long_low =
-    { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
-      0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17 };
-
-  const vector unsigned char perm_merge_long_high =
-    { 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
-      0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f };
-#endif
-
 void dprofile_fill16(CELL * dprofile_word,
                      CELL * score_matrix_word,
                      BYTE * dseq)
 {
-#ifdef __PPC__
-  vector signed short reg0, reg1, reg2, reg3, reg4, reg5, reg6, reg7;
-  vector signed int   reg8, reg9, reg10,reg11,reg12,reg13,reg14,reg15;
-  vector signed long long  reg16,reg17,reg18,reg19,reg20,reg21,reg22,reg23;
-  vector signed long long  reg24,reg25,reg26,reg27,reg28,reg29,reg30,reg31;
-#else
-  VECTOR_SHORT reg0,  reg1,  reg2,  reg3,  reg4,  reg5,  reg6,  reg7;
-  VECTOR_SHORT reg8,  reg9,  reg10, reg11, reg12, reg13, reg14, reg15;
-  VECTOR_SHORT reg16, reg17, reg18, reg19, reg20, reg21, reg22, reg23;
-  VECTOR_SHORT reg24, reg25, reg26, reg27, reg28, reg29, reg30, reg31;
-#endif
-
-  /* does not require ssse3 */
-  /* approx 4*(5*8+2*40)=480 instructions */
-
 #if 0
   dumpscorematrix(score_matrix_word);
 
@@ -222,16 +286,29 @@ void dprofile_fill16(CELL * dprofile_word,
 
     for(int i=0; i<16; i += 8)
     {
+
 #ifdef __PPC__
-      reg0 = vec_ld(0, (VECTOR_SHORT*)(score_matrix_word + d[0] + i));
-      reg1 = vec_ld(0, (VECTOR_SHORT*)(score_matrix_word + d[1] + i));
-      reg2 = vec_ld(0, (VECTOR_SHORT*)(score_matrix_word + d[2] + i));
-      reg3 = vec_ld(0, (VECTOR_SHORT*)(score_matrix_word + d[3] + i));
-      reg4 = vec_ld(0, (VECTOR_SHORT*)(score_matrix_word + d[4] + i));
-      reg5 = vec_ld(0, (VECTOR_SHORT*)(score_matrix_word + d[5] + i));
-      reg6 = vec_ld(0, (VECTOR_SHORT*)(score_matrix_word + d[6] + i));
-      reg7 = vec_ld(0, (VECTOR_SHORT*)(score_matrix_word + d[7] + i));
+      vector signed short     reg0, reg1, reg2, reg3, reg4, reg5, reg6, reg7;
+      vector signed int       reg8, reg9, reg10,reg11,reg12,reg13,reg14,reg15;
+      vector signed long long reg16,reg17,reg18,reg19,reg20,reg21,reg22,reg23;
+      vector signed long long reg24,reg25,reg26,reg27,reg28,reg29,reg30,reg31;
+#else
+      VECTOR_SHORT reg0,  reg1,  reg2,  reg3,  reg4,  reg5,  reg6,  reg7;
+      VECTOR_SHORT reg8,  reg9,  reg10, reg11, reg12, reg13, reg14, reg15;
+      VECTOR_SHORT reg16, reg17, reg18, reg19, reg20, reg21, reg22, reg23;
+      VECTOR_SHORT reg24, reg25, reg26, reg27, reg28, reg29, reg30, reg31;
+#endif
 
+      reg0 = v_load(score_matrix_word + d[0] + i);
+      reg1 = v_load(score_matrix_word + d[1] + i);
+      reg2 = v_load(score_matrix_word + d[2] + i);
+      reg3 = v_load(score_matrix_word + d[3] + i);
+      reg4 = v_load(score_matrix_word + d[4] + i);
+      reg5 = v_load(score_matrix_word + d[5] + i);
+      reg6 = v_load(score_matrix_word + d[6] + i);
+      reg7 = v_load(score_matrix_word + d[7] + i);
+
+#ifdef __PPC__
       reg8  = (vector signed int) vec_mergeh(reg0, reg1);
       reg9  = (vector signed int) vec_mergel(reg0, reg1);
       reg10 = (vector signed int) vec_mergeh(reg2, reg3);
@@ -266,79 +343,43 @@ void dprofile_fill16(CELL * dprofile_word,
         (reg21, reg23, perm_merge_long_low);
       reg31 = (vector signed long long) vec_perm
         (reg21, reg23, perm_merge_long_high);
-
-      vec_st((vector unsigned char)reg24, 0, (vector unsigned char *)
-             (dprofile_word + CDEPTH*CHANNELS*(i+0) + CHANNELS*j));
-      vec_st((vector unsigned char)reg25, 0, (vector unsigned char *)
-             (dprofile_word + CDEPTH*CHANNELS*(i+1) + CHANNELS*j));
-      vec_st((vector unsigned char)reg26, 0, (vector unsigned char *)
-             (dprofile_word + CDEPTH*CHANNELS*(i+2) + CHANNELS*j));
-      vec_st((vector unsigned char)reg27, 0, (vector unsigned char *)
-             (dprofile_word + CDEPTH*CHANNELS*(i+3) + CHANNELS*j));
-      vec_st((vector unsigned char)reg28, 0, (vector unsigned char *)
-             (dprofile_word + CDEPTH*CHANNELS*(i+4) + CHANNELS*j));
-      vec_st((vector unsigned char)reg29, 0, (vector unsigned char *)
-             (dprofile_word + CDEPTH*CHANNELS*(i+5) + CHANNELS*j));
-      vec_st((vector unsigned char)reg30, 0, (vector unsigned char *)
-             (dprofile_word + CDEPTH*CHANNELS*(i+6) + CHANNELS*j));
-      vec_st((vector unsigned char)reg31, 0, (vector unsigned char *)
-             (dprofile_word + CDEPTH*CHANNELS*(i+7) + CHANNELS*j));
-
 #else
-
-      reg0  = _mm_load_si128((VECTOR_SHORT*)(score_matrix_word + d[0] + i));
-      reg1  = _mm_load_si128((VECTOR_SHORT*)(score_matrix_word + d[1] + i));
-      reg2  = _mm_load_si128((VECTOR_SHORT*)(score_matrix_word + d[2] + i));
-      reg3  = _mm_load_si128((VECTOR_SHORT*)(score_matrix_word + d[3] + i));
-      reg4  = _mm_load_si128((VECTOR_SHORT*)(score_matrix_word + d[4] + i));
-      reg5  = _mm_load_si128((VECTOR_SHORT*)(score_matrix_word + d[5] + i));
-      reg6  = _mm_load_si128((VECTOR_SHORT*)(score_matrix_word + d[6] + i));
-      reg7  = _mm_load_si128((VECTOR_SHORT*)(score_matrix_word + d[7] + i));
-
-      reg8  = _mm_unpacklo_epi16(reg0,  reg1);
-      reg9  = _mm_unpackhi_epi16(reg0,  reg1);
-      reg10 = _mm_unpacklo_epi16(reg2,  reg3);
-      reg11 = _mm_unpackhi_epi16(reg2,  reg3);
-      reg12 = _mm_unpacklo_epi16(reg4,  reg5);
-      reg13 = _mm_unpackhi_epi16(reg4,  reg5);
-      reg14 = _mm_unpacklo_epi16(reg6,  reg7);
-      reg15 = _mm_unpackhi_epi16(reg6,  reg7);
-
-      reg16 = _mm_unpacklo_epi32(reg8,  reg10);
-      reg17 = _mm_unpackhi_epi32(reg8,  reg10);
-      reg18 = _mm_unpacklo_epi32(reg12, reg14);
-      reg19 = _mm_unpackhi_epi32(reg12, reg14);
-      reg20 = _mm_unpacklo_epi32(reg9,  reg11);
-      reg21 = _mm_unpackhi_epi32(reg9,  reg11);
-      reg22 = _mm_unpacklo_epi32(reg13, reg15);
-      reg23 = _mm_unpackhi_epi32(reg13, reg15);
-
-      reg24 = _mm_unpacklo_epi64(reg16, reg18);
-      reg25 = _mm_unpackhi_epi64(reg16, reg18);
-      reg26 = _mm_unpacklo_epi64(reg17, reg19);
-      reg27 = _mm_unpackhi_epi64(reg17, reg19);
-      reg28 = _mm_unpacklo_epi64(reg20, reg22);
-      reg29 = _mm_unpackhi_epi64(reg20, reg22);
-      reg30 = _mm_unpacklo_epi64(reg21, reg23);
-      reg31 = _mm_unpackhi_epi64(reg21, reg23);
-
-      _mm_store_si128((VECTOR_SHORT*)(dprofile_word +
-                                 CDEPTH*CHANNELS*(i+0) + CHANNELS*j), reg24);
-      _mm_store_si128((VECTOR_SHORT*)(dprofile_word +
-                                 CDEPTH*CHANNELS*(i+1) + CHANNELS*j), reg25);
-      _mm_store_si128((VECTOR_SHORT*)(dprofile_word +
-                                 CDEPTH*CHANNELS*(i+2) + CHANNELS*j), reg26);
-      _mm_store_si128((VECTOR_SHORT*)(dprofile_word +
-                                 CDEPTH*CHANNELS*(i+3) + CHANNELS*j), reg27);
-      _mm_store_si128((VECTOR_SHORT*)(dprofile_word +
-                                 CDEPTH*CHANNELS*(i+4) + CHANNELS*j), reg28);
-      _mm_store_si128((VECTOR_SHORT*)(dprofile_word +
-                                 CDEPTH*CHANNELS*(i+5) + CHANNELS*j), reg29);
-      _mm_store_si128((VECTOR_SHORT*)(dprofile_word +
-                                 CDEPTH*CHANNELS*(i+6) + CHANNELS*j), reg30);
-      _mm_store_si128((VECTOR_SHORT*)(dprofile_word +
-                                 CDEPTH*CHANNELS*(i+7) + CHANNELS*j), reg31);
+      reg8  = v_merge_lo_16(reg0,  reg1);
+      reg9  = v_merge_hi_16(reg0,  reg1);
+      reg10 = v_merge_lo_16(reg2,  reg3);
+      reg11 = v_merge_hi_16(reg2,  reg3);
+      reg12 = v_merge_lo_16(reg4,  reg5);
+      reg13 = v_merge_hi_16(reg4,  reg5);
+      reg14 = v_merge_lo_16(reg6,  reg7);
+      reg15 = v_merge_hi_16(reg6,  reg7);
+
+      reg16 = v_merge_lo_32(reg8,  reg10);
+      reg17 = v_merge_hi_32(reg8,  reg10);
+      reg18 = v_merge_lo_32(reg12, reg14);
+      reg19 = v_merge_hi_32(reg12, reg14);
+      reg20 = v_merge_lo_32(reg9,  reg11);
+      reg21 = v_merge_hi_32(reg9,  reg11);
+      reg22 = v_merge_lo_32(reg13, reg15);
+      reg23 = v_merge_hi_32(reg13, reg15);
+
+      reg24 = v_merge_lo_64(reg16, reg18);
+      reg25 = v_merge_hi_64(reg16, reg18);
+      reg26 = v_merge_lo_64(reg17, reg19);
+      reg27 = v_merge_hi_64(reg17, reg19);
+      reg28 = v_merge_lo_64(reg20, reg22);
+      reg29 = v_merge_hi_64(reg20, reg22);
+      reg30 = v_merge_lo_64(reg21, reg23);
+      reg31 = v_merge_hi_64(reg21, reg23);
 #endif
+
+      v_store(dprofile_word + CDEPTH*CHANNELS*(i+0) + CHANNELS*j, reg24);
+      v_store(dprofile_word + CDEPTH*CHANNELS*(i+1) + CHANNELS*j, reg25);
+      v_store(dprofile_word + CDEPTH*CHANNELS*(i+2) + CHANNELS*j, reg26);
+      v_store(dprofile_word + CDEPTH*CHANNELS*(i+3) + CHANNELS*j, reg27);
+      v_store(dprofile_word + CDEPTH*CHANNELS*(i+4) + CHANNELS*j, reg28);
+      v_store(dprofile_word + CDEPTH*CHANNELS*(i+5) + CHANNELS*j, reg29);
+      v_store(dprofile_word + CDEPTH*CHANNELS*(i+6) + CHANNELS*j, reg30);
+      v_store(dprofile_word + CDEPTH*CHANNELS*(i+7) + CHANNELS*j, reg31);
     }
   }
 #if 0
@@ -370,6 +411,9 @@ void dprofile_fill16(CELL * dprofile_word,
 #define VECTORBYTEPERMUTE vec_vbpermq
 #endif
 
+/* The VSX vec_bperm instruction puts the 16 selected bits of the first
+   source into bits 48-63 of the destination. */
+
 const vector unsigned char perm  = { 120, 112, 104,  96,  88,  80,  72,  64,
                                       56,  48,  40,  32,  24,  16,   8,   0 };
 
@@ -378,27 +422,27 @@ const vector unsigned char perm  = { 120, 112, 104,  96,  88,  80,  72,  64,
     vector unsigned short W, X, Y, Z;                                   \
     vector unsigned int WX, YZ;                                         \
     vector short VV;                                                    \
-    VV = vec_ld(0, &(V));                                               \
-    H = vec_adds(H, VV);                                                \
+    VV = v_load(&V);                                                    \
+    H = v_add(H, VV);                                                   \
     W = (vector unsigned short) VECTORBYTEPERMUTE                       \
       ((vector unsigned char) vec_cmpgt(F, H), perm);                   \
-    H = vec_max(H, F);                                                  \
+    H = v_max(H, F);                                                    \
     X = (vector unsigned short) VECTORBYTEPERMUTE                       \
       ((vector unsigned char) vec_cmpgt(E, H), perm);                   \
-    H = vec_max(H, E);                                                  \
-    H_MIN = vec_min(H_MIN, H);                                          \
-    H_MAX = vec_max(H_MAX, H);                                          \
+    H = v_max(H, E);                                                    \
+    H_MIN = v_min(H_MIN, H);                                            \
+    H_MAX = v_max(H_MAX, H);                                            \
     N = H;                                                              \
-    HF = vec_subs(H, QR_t);                                             \
-    F = vec_subs(F, R_t);                                               \
+    HF = v_sub(H, QR_t);                                                \
+    F = v_sub(F, R_t);                                                  \
     Y = (vector unsigned short) VECTORBYTEPERMUTE                       \
       ((vector unsigned char) vec_cmpgt(F, HF), perm);                  \
-    F = vec_max(F, HF);                                                 \
-    HE = vec_subs(H, QR_q);                                             \
-    E = vec_subs(E, R_q);                                               \
+    F = v_max(F, HF);                                                   \
+    HE = v_sub(H, QR_q);                                                \
+    E = v_sub(E, R_q);                                                  \
     Z = (vector unsigned short) VECTORBYTEPERMUTE                       \
       ((vector unsigned char) vec_cmpgt(E, HE), perm);                  \
-    E = vec_max(E, HE);                                                 \
+    E = v_max(E, HE);                                                   \
     WX = (vector unsigned int) vec_mergel(W, X);                        \
     YZ = (vector unsigned int) vec_mergel(Y, Z);                        \
     RES = (vector unsigned long long) vec_mergeh(WX, YZ);               \
@@ -406,23 +450,25 @@ const vector unsigned char perm  = { 120, 112, 104,  96,  88,  80,  72,  64,
 
 #else
 
+/* x86_64 & aarch64 */
+
 #define ALIGNCORE(H, N, F, V, PATH, QR_q, R_q, QR_t, R_t, H_MIN, H_MAX) \
-  H = _mm_adds_epi16(H, V);                                             \
-  *(PATH+0) = _mm_movemask_epi8(_mm_cmpgt_epi16(F, H));                 \
-  H = _mm_max_epi16(H, F);                                              \
-  *(PATH+1) = _mm_movemask_epi8(_mm_cmpgt_epi16(E, H));                 \
-  H = _mm_max_epi16(H, E);                                              \
-  H_MIN = _mm_min_epi16(H_MIN, H);                                      \
-  H_MAX = _mm_max_epi16(H_MAX, H);                                      \
+  H = v_add(H, V);                                                      \
+  *(PATH+0) = v_mask_gt(F, H);                                          \
+  H = v_max(H, F);                                                      \
+  *(PATH+1) = v_mask_gt(E, H);                                          \
+  H = v_max(H, E);                                                      \
+  H_MIN = v_min(H_MIN, H);                                              \
+  H_MAX = v_max(H_MAX, H);                                              \
   N = H;                                                                \
-  HF = _mm_subs_epi16(H, QR_t);                                         \
-  F = _mm_subs_epi16(F, R_t);                                           \
-  *(PATH+2) = _mm_movemask_epi8(_mm_cmpgt_epi16(F, HF));                \
-  F = _mm_max_epi16(F, HF);                                             \
-  HE = _mm_subs_epi16(H, QR_q);                                         \
-  E = _mm_subs_epi16(E, R_q);                                           \
-  *(PATH+3) = _mm_movemask_epi8(_mm_cmpgt_epi16(E, HE));                \
-  E = _mm_max_epi16(E, HE);
+  HF = v_sub(H, QR_t);                                                  \
+  F = v_sub(F, R_t);                                                    \
+  *(PATH+2) = v_mask_gt(F, HF);                                         \
+  F = v_max(F, HF);                                                     \
+  HE = v_sub(H, QR_q);                                                  \
+  E = v_sub(E, R_q);                                                    \
+  *(PATH+3) = v_mask_gt(E, HE);                                         \
+  E = v_max(E, HE);
 
 #endif
 
@@ -459,22 +505,24 @@ void aligncolumns_first(VECTOR_SHORT * Sm,
                         int64_t ql,
                         unsigned short * dir)
 {
-#ifdef __PPC__
 
   VECTOR_SHORT h4, h5, h6, h7, h8, E, HE, HF;
   VECTOR_SHORT * vp;
 
-  VECTOR_SHORT h_min = vec_splat_s16(0);
-  VECTOR_SHORT h_max = vec_splat_s16(0);
+  VECTOR_SHORT h_min = v_zero;
+  VECTOR_SHORT h_max = v_zero;
 
+#ifdef __PPC__
   vector unsigned long long RES1, RES2, RES;
+#endif
 
   int64_t i;
 
-  f0 = vec_subs(f0, QR_t_0);
-  f1 = vec_subs(f1, QR_t_1);
-  f2 = vec_subs(f2, QR_t_2);
-  f3 = vec_subs(f3, QR_t_3);
+  f0 = v_sub(f0, QR_t_0);
+  f1 = v_sub(f1, QR_t_1);
+  f2 = v_sub(f2, QR_t_2);
+  f3 = v_sub(f3, QR_t_3);
+
 
   for(i=0; i < ql - 1; i++)
     {
@@ -492,32 +540,38 @@ void aligncolumns_first(VECTOR_SHORT * Sm,
          Then use signed subtraction to obtain the correct value.
       */
 
-      h4 = (vector short) vec_subs((vector unsigned short) h4,
-                                   (vector unsigned short) Mm);
-      h4 = vec_subs(h4, M_QR_t_left);
+      h4 = v_sub_unsigned(h4, Mm);
+      h4 = v_sub(h4, M_QR_t_left);
 
-      E  = (vector short) vec_subs((vector unsigned short) E,
-                                   (vector unsigned short) Mm);
-      E  = vec_subs(E, M_QR_t_left);
-      E  = vec_subs(E, M_QR_q_interior);
+      E  = v_sub_unsigned(E, Mm);
+      E  = v_sub(E, M_QR_t_left);
+      E  = v_sub(E, M_QR_q_interior);
 
-      M_QR_t_left = vec_adds(M_QR_t_left, M_R_t_left);
+      M_QR_t_left = v_add(M_QR_t_left, M_R_t_left);
 
+#ifdef __PPC__
       ALIGNCORE(h0, h5, f0, vp[0], RES1,
                 QR_q_i, R_q_i, QR_t_0, R_t_0, h_min, h_max);
       ALIGNCORE(h1, h6, f1, vp[1], RES2,
                 QR_q_i, R_q_i, QR_t_1, R_t_1, h_min, h_max);
       RES = vec_perm(RES1, RES2, perm_merge_long_low);
-      vec_st((vector unsigned char) RES, 0,
-             (vector unsigned char *)(dir+16*i+0));
-
+      v_store((dir + 16*i + 0), RES);
       ALIGNCORE(h2, h7, f2, vp[2], RES1,
                 QR_q_i, R_q_i, QR_t_2, R_t_2, h_min, h_max);
       ALIGNCORE(h3, h8, f3, vp[3], RES2,
                 QR_q_i, R_q_i, QR_t_3, R_t_3, h_min, h_max);
       RES = vec_perm(RES1, RES2, perm_merge_long_low);
-      vec_st((vector unsigned char) RES, 0,
-             (vector unsigned char *)(dir+16*i+8));
+      v_store((dir + 16*i + 8), RES);
+#else
+      ALIGNCORE(h0, h5, f0, vp[0], dir+16*i+0,
+                QR_q_i, R_q_i, QR_t_0, R_t_0, h_min, h_max);
+      ALIGNCORE(h1, h6, f1, vp[1], dir+16*i+4,
+                QR_q_i, R_q_i, QR_t_1, R_t_1, h_min, h_max);
+      ALIGNCORE(h2, h7, f2, vp[2], dir+16*i+8,
+                QR_q_i, R_q_i, QR_t_2, R_t_2, h_min, h_max);
+      ALIGNCORE(h3, h8, f3, vp[3], dir+16*i+12,
+                QR_q_i, R_q_i, QR_t_3, R_t_3, h_min, h_max);
+#endif
 
       hep[2*i+0] = h8;
       hep[2*i+1] = E;
@@ -534,107 +588,25 @@ void aligncolumns_first(VECTOR_SHORT * Sm,
 
   E  = hep[2*i+1];
 
-  E = (vector short) vec_subs((vector unsigned short) E,
-                              (vector unsigned short) Mm);
-  E = vec_subs(E, M_QR_t_left);
-  E = vec_subs(E, M_QR_q_right);
+  E  = v_sub_unsigned(E, Mm);
+  E  = v_sub(E, M_QR_t_left);
+  E  = v_sub(E, M_QR_q_right);
+
 
+#ifdef __PPC__
   ALIGNCORE(h0, h5, f0, vp[0], RES1,
             QR_q_r, R_q_r, QR_t_0, R_t_0, h_min, h_max);
   ALIGNCORE(h1, h6, f1, vp[1], RES2,
             QR_q_r, R_q_r, QR_t_1, R_t_1, h_min, h_max);
   RES = vec_perm(RES1, RES2, perm_merge_long_low);
-  vec_st((vector unsigned char) RES, 0,
-         (vector unsigned char *)(dir+16*i+0));
-
+  v_store((dir + 16*i + 0), RES);
   ALIGNCORE(h2, h7, f2, vp[2], RES1,
             QR_q_r, R_q_r, QR_t_2, R_t_2, h_min, h_max);
   ALIGNCORE(h3, h8, f3, vp[3], RES2,
             QR_q_r, R_q_r, QR_t_3, R_t_3, h_min, h_max);
   RES = vec_perm(RES1, RES2, perm_merge_long_low);
-  vec_st((vector unsigned char) RES, 0,
-         (vector unsigned char *)(dir+16*i+8));
-
-  hep[2*i+0] = h8;
-  hep[2*i+1] = E;
-
-  Sm[0] = h5;
-  Sm[1] = h6;
-  Sm[2] = h7;
-  Sm[3] = h8;
-
-  *_h_min = h_min;
-  *_h_max = h_max;
-
+  v_store((dir + 16*i + 8), RES);
 #else
-
-  VECTOR_SHORT h4, h5, h6, h7, h8, E, HE, HF;
-  VECTOR_SHORT * vp;
-
-  VECTOR_SHORT h_min = _mm_setzero_si128();
-  VECTOR_SHORT h_max = _mm_setzero_si128();
-
-  int64_t i;
-
-  f0 = _mm_subs_epi16(f0, QR_t_0);
-  f1 = _mm_subs_epi16(f1, QR_t_1);
-  f2 = _mm_subs_epi16(f2, QR_t_2);
-  f3 = _mm_subs_epi16(f3, QR_t_3);
-
-  for(i=0; i < ql - 1; i++)
-    {
-      vp = qp[i+0];
-
-      h4 = hep[2*i+0];
-
-      E  = hep[2*i+1];
-
-      /*
-         Initialize selected h and e values for next/this round.
-         First zero those cells where a new sequence starts
-         by using an unsigned saturated subtraction of a huge value to
-         set it to zero.
-         Then use signed subtraction to obtain the correct value.
-      */
-
-      h4 = _mm_subs_epu16(h4, Mm);
-      h4 = _mm_subs_epi16(h4, M_QR_t_left);
-
-      E  = _mm_subs_epu16(E, Mm);
-      E  = _mm_subs_epi16(E, M_QR_t_left);
-      E  = _mm_subs_epi16(E, M_QR_q_interior);
-
-      M_QR_t_left = _mm_adds_epi16(M_QR_t_left, M_R_t_left);
-
-      ALIGNCORE(h0, h5, f0, vp[0], dir+16*i+0,
-                QR_q_i, R_q_i, QR_t_0, R_t_0, h_min, h_max);
-      ALIGNCORE(h1, h6, f1, vp[1], dir+16*i+4,
-                QR_q_i, R_q_i, QR_t_1, R_t_1, h_min, h_max);
-      ALIGNCORE(h2, h7, f2, vp[2], dir+16*i+8,
-                QR_q_i, R_q_i, QR_t_2, R_t_2, h_min, h_max);
-      ALIGNCORE(h3, h8, f3, vp[3], dir+16*i+12,
-                QR_q_i, R_q_i, QR_t_3, R_t_3, h_min, h_max);
-
-      hep[2*i+0] = h8;
-      hep[2*i+1] = E;
-
-      h0 = h4;
-      h1 = h5;
-      h2 = h6;
-      h3 = h7;
-    }
-
-  /* the final round - using query gap penalties for right end */
-
-  vp = qp[i+0];
-
-  E  = hep[2*i+1];
-
-
-  E  = _mm_subs_epu16(E, Mm);
-  E  = _mm_subs_epi16(E, M_QR_t_left);
-  E  = _mm_subs_epi16(E, M_QR_q_right);
-
   ALIGNCORE(h0, h5, f0, vp[0], dir+16*i+ 0,
             QR_q_r, R_q_r, QR_t_0, R_t_0, h_min, h_max);
   ALIGNCORE(h1, h6, f1, vp[1], dir+16*i+ 4,
@@ -643,6 +615,8 @@ void aligncolumns_first(VECTOR_SHORT * Sm,
             QR_q_r, R_q_r, QR_t_2, R_t_2, h_min, h_max);
   ALIGNCORE(h3, h8, f3, vp[3], dir+16*i+12,
             QR_q_r, R_q_r, QR_t_3, R_t_3, h_min, h_max);
+#endif
+
 
   hep[2*i+0] = h8;
   hep[2*i+1] = E;
@@ -654,8 +628,6 @@ void aligncolumns_first(VECTOR_SHORT * Sm,
 
   *_h_min = h_min;
   *_h_max = h_max;
-
-#endif
 }
 
 void aligncolumns_rest(VECTOR_SHORT * Sm,
@@ -686,22 +658,22 @@ void aligncolumns_rest(VECTOR_SHORT * Sm,
                        int64_t ql,
                        unsigned short * dir)
 {
-#ifdef __PPC__
-
   VECTOR_SHORT h4, h5, h6, h7, h8, E, HE, HF;
   VECTOR_SHORT * vp;
 
-  VECTOR_SHORT h_min = vec_splat_s16(0);
-  VECTOR_SHORT h_max = vec_splat_s16(0);
+  VECTOR_SHORT h_min = v_zero;
+  VECTOR_SHORT h_max = v_zero;
 
+#ifdef __PPC__
   vector unsigned long long RES1, RES2, RES;
+#endif
 
   int64_t i;
 
-  f0 = vec_subs(f0, QR_t_0);
-  f1 = vec_subs(f1, QR_t_1);
-  f2 = vec_subs(f2, QR_t_2);
-  f3 = vec_subs(f3, QR_t_3);
+  f0 = v_sub(f0, QR_t_0);
+  f1 = v_sub(f1, QR_t_1);
+  f2 = v_sub(f2, QR_t_2);
+  f3 = v_sub(f3, QR_t_3);
 
   for(i=0; i < ql - 1; i++)
     {
@@ -711,21 +683,29 @@ void aligncolumns_rest(VECTOR_SHORT * Sm,
 
       E  = hep[2*i+1];
 
+#ifdef __PPC__
       ALIGNCORE(h0, h5, f0, vp[0], RES1,
                 QR_q_i, R_q_i, QR_t_0, R_t_0, h_min, h_max);
       ALIGNCORE(h1, h6, f1, vp[1], RES2,
                 QR_q_i, R_q_i, QR_t_1, R_t_1, h_min, h_max);
       RES = vec_perm(RES1, RES2, perm_merge_long_low);
-      vec_st((vector unsigned char) RES, 0,
-             (vector unsigned char *)(dir+16*i+0));
-
+      v_store((dir + 16*i + 0), RES);
       ALIGNCORE(h2, h7, f2, vp[2], RES1,
                 QR_q_i, R_q_i, QR_t_2, R_t_2, h_min, h_max);
       ALIGNCORE(h3, h8, f3, vp[3], RES2,
                 QR_q_i, R_q_i, QR_t_3, R_t_3, h_min, h_max);
       RES = vec_perm(RES1, RES2, perm_merge_long_low);
-      vec_st((vector unsigned char) RES, 0,
-             (vector unsigned char *)(dir+16*i+8));
+      v_store((dir + 16*i + 8), RES);
+#else
+      ALIGNCORE(h0, h5, f0, vp[0], dir+16*i+ 0,
+                QR_q_i, R_q_i, QR_t_0, R_t_0, h_min, h_max);
+      ALIGNCORE(h1, h6, f1, vp[1], dir+16*i+ 4,
+                QR_q_i, R_q_i, QR_t_1, R_t_1, h_min, h_max);
+      ALIGNCORE(h2, h7, f2, vp[2], dir+16*i+ 8,
+                QR_q_i, R_q_i, QR_t_2, R_t_2, h_min, h_max);
+      ALIGNCORE(h3, h8, f3, vp[3], dir+16*i+12,
+                QR_q_i, R_q_i, QR_t_3, R_t_3, h_min, h_max);
+#endif
 
       hep[2*i+0] = h8;
       hep[2*i+1] = E;
@@ -742,81 +722,20 @@ void aligncolumns_rest(VECTOR_SHORT * Sm,
 
   E  = hep[2*i+1];
 
+#ifdef __PPC__
   ALIGNCORE(h0, h5, f0, vp[0], RES1,
             QR_q_r, R_q_r, QR_t_0, R_t_0, h_min, h_max);
   ALIGNCORE(h1, h6, f1, vp[1], RES2,
             QR_q_r, R_q_r, QR_t_1, R_t_1, h_min, h_max);
   RES = vec_perm(RES1, RES2, perm_merge_long_low);
-  vec_st((vector unsigned char) RES, 0,
-         (vector unsigned char *)(dir+16*i+0));
-
+  v_store((dir + 16*i + 0), RES);
   ALIGNCORE(h2, h7, f2, vp[2], RES1,
             QR_q_r, R_q_r, QR_t_2, R_t_2, h_min, h_max);
   ALIGNCORE(h3, h8, f3, vp[3], RES2,
             QR_q_r, R_q_r, QR_t_3, R_t_3, h_min, h_max);
   RES = vec_perm(RES1, RES2, perm_merge_long_low);
-  vec_st((vector unsigned char) RES, 0,
-         (vector unsigned char *)(dir+16*i+8));
-
-  hep[2*i+0] = h8;
-  hep[2*i+1] = E;
-
-  Sm[0] = h5;
-  Sm[1] = h6;
-  Sm[2] = h7;
-  Sm[3] = h8;
-
-  *_h_min = h_min;
-  *_h_max = h_max;
-
+  v_store((dir + 16*i + 8), RES);
 #else
-
-  VECTOR_SHORT h4, h5, h6, h7, h8, E, HE, HF;
-  VECTOR_SHORT * vp;
-
-  VECTOR_SHORT h_min = _mm_setzero_si128();
-  VECTOR_SHORT h_max = _mm_setzero_si128();
-
-  int64_t i;
-
-  f0 = _mm_subs_epi16(f0, QR_t_0);
-  f1 = _mm_subs_epi16(f1, QR_t_1);
-  f2 = _mm_subs_epi16(f2, QR_t_2);
-  f3 = _mm_subs_epi16(f3, QR_t_3);
-
-
-  for(i=0; i < ql - 1; i++)
-    {
-      vp = qp[i+0];
-
-      h4 = hep[2*i+0];
-
-      E  = hep[2*i+1];
-
-      ALIGNCORE(h0, h5, f0, vp[0], dir+16*i+ 0,
-                QR_q_i, R_q_i, QR_t_0, R_t_0, h_min, h_max);
-      ALIGNCORE(h1, h6, f1, vp[1], dir+16*i+ 4,
-                QR_q_i, R_q_i, QR_t_1, R_t_1, h_min, h_max);
-      ALIGNCORE(h2, h7, f2, vp[2], dir+16*i+ 8,
-                QR_q_i, R_q_i, QR_t_2, R_t_2, h_min, h_max);
-      ALIGNCORE(h3, h8, f3, vp[3], dir+16*i+12,
-                QR_q_i, R_q_i, QR_t_3, R_t_3, h_min, h_max);
-
-      hep[2*i+0] = h8;
-      hep[2*i+1] = E;
-
-      h0 = h4;
-      h1 = h5;
-      h2 = h6;
-      h3 = h7;
-    }
-
-  /* the final round - using query gap penalties for right end */
-
-  vp = qp[i+0];
-
-  E  = hep[2*i+1];
-
   ALIGNCORE(h0, h5, f0, vp[0], dir+16*i+ 0,
             QR_q_r, R_q_r, QR_t_0, R_t_0, h_min, h_max);
   ALIGNCORE(h1, h6, f1, vp[1], dir+16*i+ 4,
@@ -825,6 +744,7 @@ void aligncolumns_rest(VECTOR_SHORT * Sm,
             QR_q_r, R_q_r, QR_t_2, R_t_2, h_min, h_max);
   ALIGNCORE(h3, h8, f3, vp[3], dir+16*i+12,
             QR_q_r, R_q_r, QR_t_3, R_t_3, h_min, h_max);
+#endif
 
   hep[2*i+0] = h8;
   hep[2*i+1] = E;
@@ -836,8 +756,6 @@ void aligncolumns_rest(VECTOR_SHORT * Sm,
 
   *_h_min = h_min;
   *_h_max = h_max;
-
-#endif
 }
 
 inline void pushop(s16info_s * s, char newop)
@@ -852,7 +770,7 @@ inline void pushop(s16info_s * s, char newop)
           char buf[11];
           int len = sprintf(buf, "%d", s->opcount);
           s->cigarend -= len;
-          strncpy(s->cigarend, buf, len);
+          memcpy(s->cigarend, buf, len);
         }
       s->op = newop;
       s->opcount = 1;
@@ -869,7 +787,7 @@ inline void finishop(s16info_s * s)
           char buf[11];
           int len = sprintf(buf, "%d", s->opcount);
           s->cigarend -= len;
-          strncpy(s->cigarend, buf, len);
+          memcpy(s->cigarend, buf, len);
         }
       s->op = 0;
       s->opcount = 0;
@@ -905,8 +823,8 @@ void backtrack16(s16info_s * s,
     for(uint64_t j=0; j<dlen; j++)
     {
       uint64_t d = *((uint64_t *) (dirbuffer +
-                                             (offset + 16*s->qlen*(j/4) +
-                                              16*i + 4*(j&3)) % dirbuffersize));
+                                   (offset + 16*s->qlen*(j/4) +
+                                    16*i + 4*(j&3)) % dirbuffersize));
       if (d & maskup)
       {
         if (d & maskleft)
@@ -933,8 +851,8 @@ void backtrack16(s16info_s * s,
     for(uint64_t j=0; j<dlen; j++)
     {
       uint64_t d = *((uint64_t *) (dirbuffer +
-                                             (offset + 16*s->qlen*(j/4) +
-                                              16*i + 4*(j&3)) % dirbuffersize));
+                                   (offset + 16*s->qlen*(j/4) +
+                                    16*i + 4*(j&3)) % dirbuffersize));
       if (d & maskextup)
       {
         if (d & maskextleft)
@@ -973,8 +891,8 @@ void backtrack16(s16info_s * s,
     aligned++;
 
     uint64_t d = *((uint64_t *) (dirbuffer +
-                                           (offset + 16*s->qlen*(j/4) +
-                                            16*i + 4*(j&3)) % dirbuffersize));
+                                 (offset + 16*s->qlen*(j/4) +
+                                  16*i + 4*(j&3)) % dirbuffersize));
 
     if ((s->op == 'I') && (d & maskextleft))
     {
@@ -1240,7 +1158,7 @@ void search16(s16info_s * s,
       s->cigar = (char *) xmalloc(s->cigaralloc);
     }
 
-  VECTOR_SHORT T, M, T0;
+  VECTOR_SHORT M, T0;
 
   VECTOR_SHORT M_QR_target_left, M_R_target_left;
   VECTOR_SHORT M_QR_query_interior;
@@ -1273,58 +1191,29 @@ void search16(s16info_s * s,
   uint64_t next_id = 0;
   uint64_t done = 0;
 
+  T0 = v_init(-1, 0, 0, 0, 0, 0, 0, 0);
 
-#ifdef __PPC__
-
-  const vector short T0_init = { -1, 0, 0, 0, 0, 0, 0, 0 };
-  T0 = T0_init;
-
-  R_query_left = vec_splat((VECTOR_SHORT){s->penalty_gap_extension_query_left, 0, 0, 0, 0, 0, 0, 0}, 0);
-
-  QR_query_interior = vec_splat((VECTOR_SHORT){(short)(s->penalty_gap_open_query_interior + s->penalty_gap_extension_query_interior), 0, 0, 0, 0, 0, 0, 0}, 0);
-  R_query_interior  = vec_splat((VECTOR_SHORT){s->penalty_gap_extension_query_interior, 0, 0, 0, 0, 0, 0, 0}, 0);
-
-  QR_query_right  = vec_splat((VECTOR_SHORT){(short)(s->penalty_gap_open_query_right + s->penalty_gap_extension_query_right), 0, 0, 0, 0, 0, 0, 0}, 0);
-  R_query_right  = vec_splat((VECTOR_SHORT){s->penalty_gap_extension_query_right, 0, 0, 0, 0, 0, 0, 0}, 0);
-
-  QR_target_left  = vec_splat((VECTOR_SHORT){(short)(s->penalty_gap_open_target_left + s->penalty_gap_extension_target_left), 0, 0, 0, 0, 0, 0, 0}, 0);
-  R_target_left  = vec_splat((VECTOR_SHORT){s->penalty_gap_extension_target_left, 0, 0, 0, 0, 0, 0, 0}, 0);
-
-  QR_target_interior = vec_splat((VECTOR_SHORT){(short)(s->penalty_gap_open_target_interior + s->penalty_gap_extension_target_interior), 0, 0, 0, 0, 0, 0, 0}, 0);
-  R_target_interior = vec_splat((VECTOR_SHORT){s->penalty_gap_extension_target_interior, 0, 0, 0, 0, 0, 0, 0}, 0);
-
-  QR_target_right  = vec_splat((VECTOR_SHORT){(short)(s->penalty_gap_open_target_right + s->penalty_gap_extension_target_right), 0, 0, 0, 0, 0, 0, 0}, 0);
-  R_target_right  = vec_splat((VECTOR_SHORT){s->penalty_gap_extension_target_right, 0, 0, 0, 0, 0, 0, 0}, 0);
-
-#else
-
-  T0 = _mm_set_epi16(0, 0, 0, 0, 0, 0, 0, 0xffff);
-
-  R_query_left = _mm_set1_epi16(s->penalty_gap_extension_query_left);
+  R_query_left = v_dup(s->penalty_gap_extension_query_left);
 
-  QR_query_interior = _mm_set1_epi16(s->penalty_gap_open_query_interior +
-                                     s->penalty_gap_extension_query_interior);
-  R_query_interior  = _mm_set1_epi16(s->penalty_gap_extension_query_interior);
+  QR_query_interior = v_dup((s->penalty_gap_open_query_interior +
+                             s->penalty_gap_extension_query_interior));
+  R_query_interior  = v_dup(s->penalty_gap_extension_query_interior);
 
-  QR_query_right  = _mm_set1_epi16(s->penalty_gap_open_query_right +
-                                   s->penalty_gap_extension_query_right);
-  R_query_right  = _mm_set1_epi16(s->penalty_gap_extension_query_right);
+  QR_query_right  = v_dup((s->penalty_gap_open_query_right +
+                           s->penalty_gap_extension_query_right));
+  R_query_right  = v_dup(s->penalty_gap_extension_query_right);
 
-  QR_target_left  = _mm_set1_epi16(s->penalty_gap_open_target_left +
-                                   s->penalty_gap_extension_target_left);
-  R_target_left  = _mm_set1_epi16(s->penalty_gap_extension_target_left);
-
-  QR_target_interior = _mm_set1_epi16(s->penalty_gap_open_target_interior +
-                                     s->penalty_gap_extension_target_interior);
-  R_target_interior = _mm_set1_epi16(s->penalty_gap_extension_target_interior);
-
-  QR_target_right  = _mm_set1_epi16(s->penalty_gap_open_target_right +
-                                   s->penalty_gap_extension_target_right);
-  R_target_right  = _mm_set1_epi16(s->penalty_gap_extension_target_right);
-
-#endif
+  QR_target_left  = v_dup((s->penalty_gap_open_target_left +
+                           s->penalty_gap_extension_target_left));
+  R_target_left  = v_dup(s->penalty_gap_extension_target_left);
 
+  QR_target_interior = v_dup((s->penalty_gap_open_target_interior +
+                              s->penalty_gap_extension_target_interior));
+  R_target_interior = v_dup(s->penalty_gap_extension_target_interior);
 
+  QR_target_right  = v_dup((s->penalty_gap_open_target_right +
+                            s->penalty_gap_extension_target_right));
+  R_target_right  = v_dup(s->penalty_gap_extension_target_right);
 
   hep = (VECTOR_SHORT*) hearray;
   qp = (VECTOR_SHORT**) q_start;
@@ -1364,47 +1253,26 @@ void search16(s16info_s * s,
   short score_min = SHRT_MIN + gap_penalty_max;
   short score_max = SHRT_MAX;
 
-#ifdef __PPC__
-  const VECTOR_SHORT VZERO = vec_splat_s16(0);
-
   for(int i=0; i<4; i++)
     {
-      S[i] = vec_splat_s16(0);
-      dseqalloc[i] = vec_splat_s16(0);
+      S[i] = v_zero;
+      dseqalloc[i] = v_zero;
     }
 
-  VECTOR_SHORT H0 = vec_splat_s16(0);
-  VECTOR_SHORT H1 = vec_splat_s16(0);
-  VECTOR_SHORT H2 = vec_splat_s16(0);
-  VECTOR_SHORT H3 = vec_splat_s16(0);
+  VECTOR_SHORT H0 = v_zero;
+  VECTOR_SHORT H1 = v_zero;
+  VECTOR_SHORT H2 = v_zero;
+  VECTOR_SHORT H3 = v_zero;
 
-  VECTOR_SHORT F0 = vec_splat_s16(0);
-  VECTOR_SHORT F1 = vec_splat_s16(0);
-  VECTOR_SHORT F2 = vec_splat_s16(0);
-  VECTOR_SHORT F3 = vec_splat_s16(0);
-#else
-  for(int i=0; i<4; i++)
-    {
-      S[i] = _mm_setzero_si128();
-      dseqalloc[i] = _mm_setzero_si128();
-    }
-
-  VECTOR_SHORT H0 = _mm_setzero_si128();
-  VECTOR_SHORT H1 = _mm_setzero_si128();
-  VECTOR_SHORT H2 = _mm_setzero_si128();
-  VECTOR_SHORT H3 = _mm_setzero_si128();
-
-  VECTOR_SHORT F0 = _mm_setzero_si128();
-  VECTOR_SHORT F1 = _mm_setzero_si128();
-  VECTOR_SHORT F2 = _mm_setzero_si128();
-  VECTOR_SHORT F3 = _mm_setzero_si128();
-#endif
+  VECTOR_SHORT F0 = v_zero;
+  VECTOR_SHORT F1 = v_zero;
+  VECTOR_SHORT F2 = v_zero;
+  VECTOR_SHORT F3 = v_zero;
 
   int easy = 0;
 
   unsigned short * dir = dirbuffer;
 
-
   while(1)
   {
     if (easy)
@@ -1441,53 +1309,28 @@ void search16(s16info_s * s,
         {
           /* one or more sequences ended */
 
-#ifdef __PPC__
-          VECTOR_SHORT QR_diff = vec_subs(QR_target_right,
-                                           QR_target_interior);
-          VECTOR_SHORT R_diff  = vec_subs(R_target_right,
-                                           R_target_interior);
-          for(unsigned int j=0; j<CDEPTH; j++)
-            {
-              VECTOR_SHORT M = vec_splat_s16(0);
-              VECTOR_SHORT T = T0;
-              for(int c=0; c<CHANNELS; c++)
-                {
-                  if ((d_begin[c] == d_end[c]) &&
-                      (j >= ((d_length[c]+3) % 4)))
-                    {
-                      M = vec_xor(M, T);
-                    }
-                  T = vec_sld(T, VZERO, 2);
-                }
-              QR_target[j] = vec_adds(QR_target_interior,
-                                      vec_and(QR_diff, M));
-              R_target[j]  = vec_adds(R_target_interior,
-                                      vec_and(R_diff, M));
-            }
-#else
-          VECTOR_SHORT QR_diff = _mm_subs_epi16(QR_target_right,
-                                                QR_target_interior);
-          VECTOR_SHORT R_diff  = _mm_subs_epi16(R_target_right,
-                                                R_target_interior);
+          VECTOR_SHORT QR_diff = v_sub(QR_target_right,
+                                       QR_target_interior);
+          VECTOR_SHORT R_diff  = v_sub(R_target_right,
+                                       R_target_interior);
           for(unsigned int j=0; j<CDEPTH; j++)
             {
-              VECTOR_SHORT M = _mm_setzero_si128();
-              VECTOR_SHORT T = T0;
+              VECTOR_SHORT M = v_zero;
+              VECTOR_SHORT TT = T0;
               for(int c=0; c<CHANNELS; c++)
                 {
                   if ((d_begin[c] == d_end[c]) &&
                       (j >= ((d_length[c]+3) % 4)))
                     {
-                      M = _mm_xor_si128(M, T);
+                      M = v_xor(M, TT);
                     }
-                  T = _mm_slli_si128(T, 2);
+                  TT = v_shift_left(TT);
                 }
-              QR_target[j] = _mm_adds_epi16(QR_target_interior,
-                                            _mm_and_si128(QR_diff, M));
-              R_target[j]  = _mm_adds_epi16(R_target_interior,
-                                            _mm_and_si128(R_diff, M));
+              QR_target[j] = v_add(QR_target_interior,
+                                   v_and(QR_diff, M));
+              R_target[j]  = v_add(R_target_interior,
+                                   v_and(R_diff, M));
             }
-#endif
         }
 
       VECTOR_SHORT h_min, h_max;
@@ -1504,22 +1347,18 @@ void search16(s16info_s * s,
                         & h_min, & h_max,
                         qlen, dir);
 
+      VECTOR_SHORT h_min_vector;
+      VECTOR_SHORT h_max_vector;
+      v_store(& h_min_vector, h_min);
+      v_store(& h_max_vector, h_max);
       for(int c=0; c<CHANNELS; c++)
         {
           if (! overflow[c])
             {
-              signed short h_min_array[8];
-              signed short h_max_array[8];
-#ifdef __PPC__
-              *(VECTOR_SHORT*)h_min_array = h_min;
-              *(VECTOR_SHORT*)h_max_array = h_max;
-#else
-              _mm_storeu_si128((VECTOR_SHORT*)h_min_array, h_min);
-              _mm_storeu_si128((VECTOR_SHORT*)h_max_array, h_max);
-#endif
-              signed short h_min_c = h_min_array[c];
-              signed short h_max_c = h_max_array[c];
-              if ((h_min_c <= score_min) || (h_max_c >= score_max))
+              signed short h_min_c = ((signed short *)(& h_min_vector))[c];
+              signed short h_max_c = ((signed short *)(& h_max_vector))[c];
+              if ((h_min_c <= score_min) ||
+                  (h_max_c >= score_max))
                 overflow[c] = true;
             }
         }
@@ -1531,12 +1370,9 @@ void search16(s16info_s * s,
 
       easy = 1;
 
-#ifdef __PPC__
-      M = vec_splat_s16(0);
-#else
-      M = _mm_setzero_si128();
-#endif
-      T = T0;
+      M = v_zero;
+
+      VECTOR_SHORT T = T0;
       for (int c=0; c<CHANNELS; c++)
       {
         if (d_begin[c] < d_end[c])
@@ -1557,11 +1393,7 @@ void search16(s16info_s * s,
         {
           /* sequence in channel c ended. change of sequence */
 
-#ifdef __PPC__
-          M = vec_xor(M, T);
-#else
-          M = _mm_xor_si128(M, T);
-#endif
+          M = v_xor(M, T);
 
           int64_t cand_id = seq_id[c];
 
@@ -1673,11 +1505,7 @@ void search16(s16info_s * s,
                 dseq[CHANNELS*j+c] = 0;
             }
         }
-#ifdef __PPC__
-        T = vec_sld(T, VZERO, 2);
-#else
-        T = _mm_slli_si128(T, 2);
-#endif
+        T = v_shift_left(T);
       }
 
       if (done == sequences)
@@ -1685,23 +1513,13 @@ void search16(s16info_s * s,
 
       /* make masked versions of QR and R for gaps in target */
 
-#ifdef __PPC__
-      M_QR_target_left = vec_and(M, QR_target_left);
-      M_R_target_left = vec_and(M, R_target_left);
-#else
-      M_QR_target_left = _mm_and_si128(M, QR_target_left);
-      M_R_target_left = _mm_and_si128(M, R_target_left);
-#endif
+      M_QR_target_left = v_and(M, QR_target_left);
+      M_R_target_left = v_and(M, R_target_left);
 
       /* make masked versions of QR for gaps in query at target left end */
 
-#ifdef __PPC__
-      M_QR_query_interior = vec_and(M, QR_query_interior);
-      M_QR_query_right = vec_and(M, QR_query_right);
-#else
-      M_QR_query_interior = _mm_and_si128(M, QR_query_interior);
-      M_QR_query_right = _mm_and_si128(M, QR_query_right);
-#endif
+      M_QR_query_interior = v_and(M, QR_query_interior);
+      M_QR_query_right = v_and(M, QR_query_right);
 
       dprofile_fill16(dprofile, (CELL*) s->matrix, dseq);
 
@@ -1719,54 +1537,29 @@ void search16(s16info_s * s,
       else
         {
           /* one or more sequences ended */
-#ifdef __PPC__
-          VECTOR_SHORT QR_diff = vec_subs(QR_target_right,
-                                          QR_target_interior);
-          VECTOR_SHORT R_diff  = vec_subs(R_target_right,
-                                          R_target_interior);
-          for(unsigned int j=0; j<CDEPTH; j++)
-            {
-              VECTOR_SHORT M = vec_splat_s16(0);
-              VECTOR_SHORT T = T0;
-              for(int c=0; c<CHANNELS; c++)
-                {
-                  if ((d_begin[c] == d_end[c]) &&
-                      (j >= ((d_length[c]+3) % 4)))
-                    {
-                      M = vec_xor(M, T);
-                    }
-                  T = vec_sld(T, VZERO, 2);
-                }
-              QR_target[j] = vec_adds(QR_target_interior,
-                                      vec_and(QR_diff, M));
-              R_target[j]  = vec_adds(R_target_interior,
-                                      vec_and(R_diff, M));
-            }
-#else
-          VECTOR_SHORT QR_diff = _mm_subs_epi16(QR_target_right,
-                                                QR_target_interior);
 
-          VECTOR_SHORT R_diff  = _mm_subs_epi16(R_target_right,
-                                                R_target_interior);
+          VECTOR_SHORT QR_diff = v_sub(QR_target_right,
+                                       QR_target_interior);
+          VECTOR_SHORT R_diff  = v_sub(R_target_right,
+                                       R_target_interior);
           for(unsigned int j=0; j<CDEPTH; j++)
             {
-              VECTOR_SHORT M = _mm_setzero_si128();
-              VECTOR_SHORT T = T0;
+              VECTOR_SHORT M = v_zero;
+              VECTOR_SHORT TT = T0;
               for(int c=0; c<CHANNELS; c++)
                 {
                   if ((d_begin[c] == d_end[c]) &&
                       (j >= ((d_length[c]+3) % 4)))
                     {
-                      M = _mm_xor_si128(M, T);
+                      M = v_xor(M, TT);
                     }
-                  T = _mm_slli_si128(T, 2);
+                  TT = v_shift_left(TT);
                 }
-              QR_target[j] = _mm_adds_epi16(QR_target_interior,
-                                            _mm_and_si128(QR_diff, M));
-              R_target[j]  = _mm_adds_epi16(R_target_interior,
-                                            _mm_and_si128(R_diff, M));
+              QR_target[j] = v_add(QR_target_interior,
+                                   v_and(QR_diff, M));
+              R_target[j]  = v_add(R_target_interior,
+                                   v_and(R_diff, M));
             }
-#endif
         }
 
       VECTOR_SHORT h_min, h_max;
@@ -1787,21 +1580,16 @@ void search16(s16info_s * s,
                          M_QR_query_right,
                          qlen, dir);
 
+      VECTOR_SHORT h_min_vector;
+      VECTOR_SHORT h_max_vector;
+      v_store(& h_min_vector, h_min);
+      v_store(& h_max_vector, h_max);
       for(int c=0; c<CHANNELS; c++)
         {
           if (! overflow[c])
             {
-              signed short h_min_array[8];
-              signed short h_max_array[8];
-#ifdef __PPC__
-              *(VECTOR_SHORT*)h_min_array = h_min;
-              *(VECTOR_SHORT*)h_max_array = h_max;
-#else
-              _mm_storeu_si128((VECTOR_SHORT*)h_min_array, h_min);
-              _mm_storeu_si128((VECTOR_SHORT*)h_max_array, h_max);
-#endif
-              signed short h_min_c = h_min_array[c];
-              signed short h_max_c = h_max_array[c];
+              signed short h_min_c = ((signed short *)(& h_min_vector))[c];
+              signed short h_max_c = ((signed short *)(& h_max_vector))[c];
               if ((h_min_c <= score_min) ||
                   (h_max_c >= score_max))
                 overflow[c] = true;
@@ -1809,27 +1597,15 @@ void search16(s16info_s * s,
         }
     }
 
-#ifdef __PPC__
-    H0 = vec_subs(H3, R_query_left);
-    H1 = vec_subs(H0, R_query_left);
-    H2 = vec_subs(H1, R_query_left);
-    H3 = vec_subs(H2, R_query_left);
-
-    F0 = vec_subs(F3, R_query_left);
-    F1 = vec_subs(F0, R_query_left);
-    F2 = vec_subs(F1, R_query_left);
-    F3 = vec_subs(F2, R_query_left);
-#else
-    H0 = _mm_subs_epi16(H3, R_query_left);
-    H1 = _mm_subs_epi16(H0, R_query_left);
-    H2 = _mm_subs_epi16(H1, R_query_left);
-    H3 = _mm_subs_epi16(H2, R_query_left);
-
-    F0 = _mm_subs_epi16(F3, R_query_left);
-    F1 = _mm_subs_epi16(F0, R_query_left);
-    F2 = _mm_subs_epi16(F1, R_query_left);
-    F3 = _mm_subs_epi16(F2, R_query_left);
-#endif
+    H0 = v_sub(H3, R_query_left);
+    H1 = v_sub(H0, R_query_left);
+    H2 = v_sub(H1, R_query_left);
+    H3 = v_sub(H2, R_query_left);
+
+    F0 = v_sub(F3, R_query_left);
+    F1 = v_sub(F0, R_query_left);
+    F2 = v_sub(F1, R_query_left);
+    F3 = v_sub(F2, R_query_left);
 
     dir += 4 * 4 * s->qlen;
 


=====================================
src/cpu.cc
=====================================
@@ -63,7 +63,59 @@
 /* This file contains code dependent on special cpu features. */
 /* The file may be compiled several times with different cpu options. */
 
-#ifdef __PPC__
+#ifdef __aarch64__
+
+void increment_counters_from_bitmap(unsigned short * counters,
+                                    unsigned char * bitmap,
+                                    unsigned int totalbits)
+{
+  const uint8x16_t c1 =
+    { 0x01, 0x01, 0x02, 0x02, 0x04, 0x04, 0x08, 0x08,
+      0x10, 0x10, 0x20, 0x20, 0x40, 0x40, 0x80, 0x80 };
+
+  unsigned short * p = (unsigned short *)(bitmap);
+  int16x8_t * q = (int16x8_t *)(counters);
+  int r = (totalbits + 15) / 16;
+
+  for(int j=0; j<r; j++)
+    {
+      uint16x8_t r0;
+      uint8x16_t r1, r2, r3, r4;
+      int16x8_t r5, r6;
+
+      // load and duplicate short
+      r0 = vdupq_n_u16(*p);
+      p++;
+
+      // cast to bytes
+      r1 = vreinterpretq_u8_u16(r0);
+
+      // bit test with mask giving 0x00 or 0xff
+      r2 = vtstq_u8(r1, c1);
+
+      // transpose to duplicate even bytes
+      r3 = vtrn1q_u8(r2, r2);
+
+      // transpose to duplicate odd bytes
+      r4 = vtrn2q_u8(r2, r2);
+
+      // cast to signed 0x0000 or 0xffff
+      r5 = vreinterpretq_s16_u8(r3);
+
+      // cast to signed 0x0000 or 0xffff
+      r6 = vreinterpretq_s16_u8(r4);
+
+      // subtract signed 0 or -1 (i.e add 0 or 1) with saturation to counter
+      *q = vqsubq_s16(*q, r5);
+      q++;
+
+      // subtract signed 0 or 1 (i.e. add 0 or 1) with saturation to counter
+      *q = vqsubq_s16(*q, r6);
+      q++;
+    }
+}
+
+#elif __PPC__
 
 void increment_counters_from_bitmap(unsigned short * counters,
                                     unsigned char * bitmap,
@@ -102,7 +154,7 @@ void increment_counters_from_bitmap(unsigned short * counters,
     }
 }
 
-#else
+#elif __x86_64__
 
 #ifdef SSSE3
 void increment_counters_from_bitmap_ssse3(unsigned short * counters,
@@ -170,4 +222,8 @@ void increment_counters_from_bitmap_sse2(unsigned short * counters,
     }
 }
 
+#else
+
+#error Unknown architecture
+
 #endif


=====================================
src/cpu.h
=====================================
@@ -58,16 +58,15 @@
 
 */
 
-#ifdef __PPC__
-void increment_counters_from_bitmap(unsigned short * counters,
-                                    unsigned char * bitmap,
-                                    unsigned int totalbits);
-#else
+#ifdef __x86_64__
 void increment_counters_from_bitmap_sse2(unsigned short * counters,
                                          unsigned char * bitmap,
                                          unsigned int totalbits);
-
 void increment_counters_from_bitmap_ssse3(unsigned short * counters,
                                           unsigned char * bitmap,
                                           unsigned int totalbits);
+#else
+void increment_counters_from_bitmap(unsigned short * counters,
+                                    unsigned char * bitmap,
+                                    unsigned int totalbits);
 #endif


=====================================
src/fastqops.cc
=====================================
@@ -889,7 +889,7 @@ void fastq_stats()
       fprintf(fp_log, "  Len     Q=5    Q=10    Q=15    Q=20\n");
       fprintf(fp_log, "-----  ------  ------  ------  ------\n");
 
-      for(int64_t i = len_max; i >= len_max/2; i--)
+      for(int64_t i = len_max; i >= MAX(1, len_max/2); i--)
         {
           double read_percentage[4];
 


=====================================
src/results.cc
=====================================
@@ -2,7 +2,7 @@
 
   VSEARCH: a versatile open source tool for metagenomics
 
-  Copyright (C) 2014-2018, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+  Copyright (C) 2014-2019, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
   All rights reserved.
 
   Contact: Torbjorn Rognes <torognes at ifi.uio.no>,
@@ -197,15 +197,28 @@ void results_show_uc_one(FILE * fp,
     strand: + or -
     0
     0
-    compressed alignment, e.g. 9I92M14D, or "=" if prefect alignment
+    compressed alignment, e.g. 9I92M14D, or "=" if perfect alignment
     query label
     target label
   */
 
   if (hp)
     {
-      bool perfect = (hp->matches == qseqlen) &&
-        ((uint64_t)(qseqlen) == db_getsequencelen(hp->target));
+      bool perfect;
+
+      if (opt_cluster_fast)
+        {
+          /* cluster_fast */
+          /* use = for identical sequences ignoring terminal gaps */
+          perfect = (hp->matches == hp->internal_alignmentlength);
+        }
+      else
+        {
+          /* cluster_size, cluster_smallmem, cluster_unoise */
+          /* usearch_global, search_exact, allpairs_global */
+          /* use = for strictly identical sequences */
+          perfect = (hp->matches == hp->nwalignmentlength);
+        }
 
       fprintf(fp,
               "H\t%d\t%" PRId64 "\t%.1f\t%c\t0\t0\t%s\t%s\t%s\n",


=====================================
src/searchcore.cc
=====================================
@@ -185,15 +185,15 @@ void search_topscores(struct searchinfo_s * si)
 
       if (bitmap)
         {
-#ifdef __PPC__
-          increment_counters_from_bitmap(si->kmers, bitmap, indexed_count);
-#else
+#ifdef __x86_64__
           if (ssse3_present)
             increment_counters_from_bitmap_ssse3(si->kmers,
                                                  bitmap, indexed_count);
           else
             increment_counters_from_bitmap_sse2(si->kmers,
                                                 bitmap, indexed_count);
+#else
+          increment_counters_from_bitmap(si->kmers, bitmap, indexed_count);
 #endif
         }
       else


=====================================
src/vsearch.cc
=====================================
@@ -278,6 +278,7 @@ int64_t opt_wordlength;
 /* cpu features available */
 
 int64_t altivec_present = 0;
+int64_t neon_present = 0;
 int64_t mmx_present = 0;
 int64_t sse_present = 0;
 int64_t sse2_present = 0;
@@ -300,7 +301,7 @@ FILE * fp_log = 0;
 char * STDIN_NAME = (char*) "/dev/stdin";
 char * STDOUT_NAME = (char*) "/dev/stdout";
 
-#ifndef __PPC__
+#ifdef __x86_64__
 #define cpuid(f1, f2, a, b, c, d)                                \
   __asm__ __volatile__ ("cpuid"                                  \
                         : "=a" (a), "=b" (b), "=c" (c), "=d" (d) \
@@ -309,9 +310,16 @@ char * STDOUT_NAME = (char*) "/dev/stdout";
 
 void cpu_features_detect()
 {
-#ifdef __PPC__
-  altivec_present = 1;
+#ifdef __aarch64__
+#ifdef __ARM_NEON
+  /* may check /proc/cpuinfo for asimd or neon */
+  neon_present = 1;
 #else
+#error ARM Neon not present
+#endif
+#elif __PPC__
+  altivec_present = 1;
+#elif __x86_64__
   unsigned int a, b, c, d;
 
   cpuid(0, 0, a, b, c, d);
@@ -336,12 +344,16 @@ void cpu_features_detect()
       avx2_present = (b >>  5) & 1;
     }
   }
+#else
+#error Unknown architecture
 #endif
 }
 
 void cpu_features_show()
 {
   fprintf(stderr, "CPU features:");
+  if (neon_present)
+    fprintf(stderr, " neon");
   if (altivec_present)
     fprintf(stderr, " altivec");
   if (mmx_present)
@@ -2962,7 +2974,7 @@ int main(int argc, char** argv)
 
   dynlibs_open();
 
-#ifndef __PPC__
+#ifdef __x86_64__
   if (!sse2_present)
     fatal("Sorry, this program requires a cpu with SSE2.");
 #endif


=====================================
src/vsearch.h
=====================================
@@ -96,7 +96,17 @@
 #define PROG_NAME PACKAGE
 #define PROG_VERSION PACKAGE_VERSION
 
-#ifdef __PPC__
+#ifdef __x86_64__
+
+#define PROG_CPU "x86_64"
+#ifdef __SSE2__
+#include <emmintrin.h>
+#endif
+#ifdef __SSSE3__
+#include <tmmintrin.h>
+#endif
+
+#elif __PPC__
 
 #ifdef __LITTLE_ENDIAN__
 #define PROG_CPU "ppc64le"
@@ -105,17 +115,14 @@
 #error Big endian ppc64 CPUs not supported
 #endif
 
-#else
+#elif __aarch64__
 
-#define PROG_CPU "x86_64"
+#define PROG_CPU "aarch64"
+#include <arm_neon.h>
 
-#ifdef __SSE2__
-#include <emmintrin.h>
-#endif
+#else
 
-#ifdef __SSSE3__
-#include <tmmintrin.h>
-#endif
+#error Unknown architecture (not ppc64le, aarch64 or x86_64)
 
 #endif
 



View it on GitLab: https://salsa.debian.org/med-team/vsearch/compare/5345c3ee2741bbbf920f3de4509abd07f074be2b...03a0889a8328c78e2a9c45ec67db539e3bee1268

-- 
View it on GitLab: https://salsa.debian.org/med-team/vsearch/compare/5345c3ee2741bbbf920f3de4509abd07f074be2b...03a0889a8328c78e2a9c45ec67db539e3bee1268
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20190111/c0010ce7/attachment-0001.html>