[med-svn] [Git][med-team/bwa-mem2][master] 8 commits: New upstream version 2.2.1
Michael R. Crusoe (@crusoe)
gitlab at salsa.debian.org
Thu Feb 27 11:18:23 GMT 2025
Michael R. Crusoe pushed to branch master at Debian Med / bwa-mem2
Commits:
d40c35ba by Michael R. Crusoe at 2021-11-18T18:55:27+01:00
New upstream version 2.2.1
- - - - -
dde9762d by Michael R. Crusoe at 2021-11-18T18:55:27+01:00
routine-update: New upstream version
- - - - -
f66536d2 by Michael R. Crusoe at 2021-11-18T18:55:28+01:00
Update upstream source from tag 'upstream/2.2.1'
Update to upstream version '2.2.1'
with Debian dir ea51a5e5c08b1396f08d49919b9fb87cf60c5379
- - - - -
0dd1f6f2 by Michael R. Crusoe at 2021-11-18T18:55:29+01:00
routine-update: Standards-Version: 4.6.0
- - - - -
53f902e6 by Michael R. Crusoe at 2021-11-18T18:55:34+01:00
routine-update: Ready to upload to unstable
- - - - -
238a2ec0 by Michael R. Crusoe at 2025-02-27T10:56:32+01:00
cherry-pick all patches since 2.2.1 from upstream
- - - - -
0e4faf01 by Michael R. Crusoe at 2025-02-27T12:08:48+01:00
Add patch to support newer GCC
- - - - -
cabf1480 by Michael R. Crusoe at 2025-02-27T12:13:53+01:00
Update SIMDe patch
- - - - -
17 changed files:
- NEWS.md
- README.md
- debian/changelog
- debian/control
- + debian/patches/0001-Fix-typo.patch
- + debian/patches/0002-fixed-return-status-due-incorrect-input-files.patch
- + debian/patches/0003-fixed-smem-memory-rellocation-issue-issue167.patch
- + debian/patches/0004-Update-to-reflect-2.2.1-release-and-small-typo.patch
- + debian/patches/0005-GCC-11-now-defines-__rdtsc-so-definition-test-condit.patch
- + debian/patches/0006-fixed-segfault-due-to-memory-shortage-in-smem.patch
- + debian/patches/0007-Merge-pull-request-229-from-gh-jphan-fix_n_processed.patch
- + debian/patches/0010-newer-GCC-support.patch
- debian/patches/hardening
- debian/patches/series
- debian/patches/simde
- src/fastmap.cpp
- src/main.cpp
Changes:
=====================================
NEWS.md
=====================================
@@ -1,3 +1,32 @@
+Release 2.2.1 (17 March 2021)
+---------------------------------
+Hotfix for v2.2: Fixed the bug mentioned in #135.
+
+
+Release 2.2 (8 March 2021)
+---------------------------------
+Changes since the last release (2.1):
+
+* Passed the validation test on ~88 billions reads (Credits: Keiran Raine, CASM division, Sanger Institute)
+* Fixed bugs reported in #109 causing mismatch between bwa-mem and bwa-mem2
+* Fixed the issue (# 112) causing crash due to corrupted thread id
+* Using all the SSE flags to create optimized SSE41 and SSE42 binaries
+
+
+Release 2.1 (16 October 2020)
+---------------------------------
+Release 2.1 of BWA-MEM2.
+
+Changes since the last release (2.0):
+* *Smaller index*: the index size on disk is down by 8 times and in memory by 4 times due to moving to only one type of FM-index (2bit.64 instead of 2bit.64 and 8bit.32) and 8x compression of suffix array. For example, for human genome, index size on disk is down to ~10GB from ~80GB and memory footprint is down to ~10GB from ~40GB. There is a substantial decrease in index IO time due to the reduction and hardly any performance impact on read mapping.
+
+* Added support for 2 more execution modes: sse4.2 and avx.
+
+* Fixed multiple bugs including those reported in Issues #71, #80 and #85.
+
+* Merged multiple pull requests.
+
+
Release 2.0 (9 July 2020)
---------------------------------
This is the first production release of BWA-MEM2.
=====================================
README.md
=====================================
@@ -51,10 +51,10 @@ For general users, it is recommended to use the precompiled binaries from the
runs faster than gcc-compiled binaries. The precompiled binaries also
indirectly support CPU dispatch. The `bwa-mem2` binary can automatically choose
the most efficient implementation based on the SIMD instruction set available
-on the running machine. Precompiled binaries were generated on a CentOS6
+on the running machine. Precompiled binaries were generated on a CentOS7
machine using the following command line:
```sh
-make CXX=icpc
+make CXX=icpc multi
```
[bwa]: https://github.com/lh3/bwa
=====================================
debian/changelog
=====================================
@@ -1,5 +1,7 @@
-bwa-mem2 (2.2-1) UNRELEASED; urgency=medium
+bwa-mem2 (2.2.1-1) unstable; urgency=medium
* Initial release. (Closes: #950607)
+ * New upstream version
+ * Standards-Version: 4.6.0 (routine-update)
- -- Michael R. Crusoe <crusoe at debian.org> Mon, 08 Mar 2021 15:51:37 +0100
+ -- Michael R. Crusoe <crusoe at debian.org> Thu, 18 Nov 2021 18:55:34 +0100
=====================================
debian/control
=====================================
@@ -8,7 +8,7 @@ Build-Depends: debhelper-compat (= 13),
zlib1g-dev,
libsimde-dev,
help2man
-Standards-Version: 4.5.1
+Standards-Version: 4.6.0
Vcs-Browser: https://salsa.debian.org/med-team/bwa-mem2
Vcs-Git: https://salsa.debian.org/med-team/bwa-mem2.git
Homepage: https://github.com/bwa-mem2/bwa-mem2
=====================================
debian/patches/0001-Fix-typo.patch
=====================================
@@ -0,0 +1,23 @@
+From: teepean <tnatkinn at gmail.com>
+Date: Wed, 17 Mar 2021 09:28:22 +0200
+Subject: Fix typo
+
+Origin: upstream, https://github.com/bwa-mem2/bwa-mem2/commit/a5df52284fff63266ff31310711e657a40479a4a
+Forwarded: not-needed
+---
+ src/kthread.cpp | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+diff --git a/src/kthread.cpp b/src/kthread.cpp
+index 4418b9c..44b7abf 100644
+--- a/src/kthread.cpp
++++ b/src/kthread.cpp
+@@ -56,7 +56,7 @@ static void *ktf_worker(void *data)
+ long i;
+ int tid = w->i;
+
+-#if AFF && (__liunx__)
++#if AFF && (__linux__)
+ fprintf(stderr, "i: %d, CPU: %d\n", tid , sched_getcpu());
+ #endif
+
=====================================
debian/patches/0002-fixed-return-status-due-incorrect-input-files.patch
=====================================
@@ -0,0 +1,67 @@
+From: Vasimuddin <wasim.mzr at gmail.com>
+Date: Wed, 30 Jun 2021 00:04:58 -0700
+Subject: fixed return status due incorrect input files
+
+Origin: upstream, https://github.com/bwa-mem2/bwa-mem2/commit/966cd1e0e7869c45ae7a93287612be6697406051
+Forwarded: not-needed
+---
+ src/fastmap.cpp | 4 ++--
+ src/main.cpp | 24 +++++++++++++-----------
+ 2 files changed, 15 insertions(+), 13 deletions(-)
+
+diff --git a/src/fastmap.cpp b/src/fastmap.cpp
+index 3a7e62e..bc245b9 100644
+--- a/src/fastmap.cpp
++++ b/src/fastmap.cpp
+@@ -896,7 +896,7 @@ int main_mem(int argc, char *argv[])
+ if (is_o)
+ fclose(aux.fp);
+ delete aux.fmi;
+- kclose(ko);
++ // kclose(ko);
+ return 1;
+ }
+ // fp = gzopen(argv[optind + 1], "r");
+@@ -924,7 +924,7 @@ int main_mem(int argc, char *argv[])
+ fclose(aux.fp);
+ delete aux.fmi;
+ kclose(ko);
+- kclose(ko2);
++ // kclose(ko2);
+ return 1;
+ }
+ // fp2 = gzopen(argv[optind + 2], "r");
+diff --git a/src/main.cpp b/src/main.cpp
+index 3a269e1..abee98d 100644
+--- a/src/main.cpp
++++ b/src/main.cpp
+@@ -111,16 +111,18 @@ int main(int argc, char* argv[])
+ fprintf(stderr, "ERROR: unknown command '%s'\n", argv[1]);
+ return 1;
+ }
+-
+- fprintf(stderr, "\nImportant parameter settings: \n");
+- fprintf(stderr, "\tBATCH_SIZE: %d\n", BATCH_SIZE);
+- fprintf(stderr, "\tMAX_SEQ_LEN_REF: %d\n", MAX_SEQ_LEN_REF);
+- fprintf(stderr, "\tMAX_SEQ_LEN_QER: %d\n", MAX_SEQ_LEN_QER);
+- fprintf(stderr, "\tMAX_SEQ_LEN8: %d\n", MAX_SEQ_LEN8);
+- fprintf(stderr, "\tSEEDS_PER_READ: %d\n", SEEDS_PER_READ);
+- fprintf(stderr, "\tSIMD_WIDTH8 X: %d\n", SIMD_WIDTH8);
+- fprintf(stderr, "\tSIMD_WIDTH16 X: %d\n", SIMD_WIDTH16);
+- fprintf(stderr, "\tAVG_SEEDS_PER_READ: %d\n", AVG_SEEDS_PER_READ);
++
++ if (ret == 0) {
++ fprintf(stderr, "\nImportant parameter settings: \n");
++ fprintf(stderr, "\tBATCH_SIZE: %d\n", BATCH_SIZE);
++ fprintf(stderr, "\tMAX_SEQ_LEN_REF: %d\n", MAX_SEQ_LEN_REF);
++ fprintf(stderr, "\tMAX_SEQ_LEN_QER: %d\n", MAX_SEQ_LEN_QER);
++ fprintf(stderr, "\tMAX_SEQ_LEN8: %d\n", MAX_SEQ_LEN8);
++ fprintf(stderr, "\tSEEDS_PER_READ: %d\n", SEEDS_PER_READ);
++ fprintf(stderr, "\tSIMD_WIDTH8 X: %d\n", SIMD_WIDTH8);
++ fprintf(stderr, "\tSIMD_WIDTH16 X: %d\n", SIMD_WIDTH16);
++ fprintf(stderr, "\tAVG_SEEDS_PER_READ: %d\n", AVG_SEEDS_PER_READ);
++ }
+
+- return 0;
++ return ret;
+ }
=====================================
debian/patches/0003-fixed-smem-memory-rellocation-issue-issue167.patch
=====================================
@@ -0,0 +1,55 @@
+From: Vasimuddin <wasim.mzr at gmail.com>
+Date: Sun, 12 Dec 2021 21:56:48 -0800
+Subject: fixed smem memory rellocation issue (issue167)
+
+Origin: upstream, https://github.com/bwa-mem2/bwa-mem2/commit/edc703f883e8aaed83067100d8e54e0e9e810ef5
+Forwarded: not-needed
+---
+ src/bwamem.cpp | 8 ++++----
+ src/macro.h | 2 +-
+ 2 files changed, 5 insertions(+), 5 deletions(-)
+
+diff --git a/src/bwamem.cpp b/src/bwamem.cpp
+index 4a18aff..76d4148 100644
+--- a/src/bwamem.cpp
++++ b/src/bwamem.cpp
+@@ -924,13 +924,13 @@ int mem_kernel1_core(FMI_search *fmi,
+ for (i = 0; i < len; ++i)
+ seq[i] = seq[i] < 4? seq[i] : nst_nt4_table[(int)seq[i]]; //nst_nt4??
+ }
+-
++ tot_len *= N_SMEM_KERNEL;
+ // This covers enc_qdb/SMEM reallocs
+ if (tot_len >= mmc->wsize_mem[tid])
+ {
+ fprintf(stderr, "[%0.4d] Re-allocating SMEM data structures due to enc_qdb\n", tid);
+ int64_t tmp = mmc->wsize_mem[tid];
+- mmc->wsize_mem[tid] *= 2;
++ mmc->wsize_mem[tid] = tot_len;
+ mmc->matchArray[tid] = (SMEM *) _mm_realloc(mmc->matchArray[tid],
+ tmp, mmc->wsize_mem[tid], sizeof(SMEM));
+ //realloc(mmc->matchArray[tid], mmc->wsize_mem[tid] * sizeof(SMEM));
+@@ -966,8 +966,8 @@ int mem_kernel1_core(FMI_search *fmi,
+ num_smem);
+
+ if (num_smem >= *wsize_mem){
+- fprintf(stderr, "num_smem: %ld\n", num_smem);
+- assert(num_smem < *wsize_mem);
++ fprintf(stderr, "Error [bug]: num_smem: %ld are more than allocated space.\n", num_smem);
++ exit(EXIT_FAILURE);
+ }
+ printf_(VER, "6. Done! mem_collect_smem, num_smem: %ld\n", num_smem);
+ tprof[MEM_COLLECT][tid] += __rdtsc() - tim;
+diff --git a/src/macro.h b/src/macro.h
+index d2da0cc..4adc980 100644
+--- a/src/macro.h
++++ b/src/macro.h
+@@ -48,7 +48,7 @@ Authors: Vasimuddin Md <vasimuddin.md at intel.com>; Sanchit Misra <sanchit.misra at i
+ #define BATCH_SIZE 512 /* Block of reads alloacted to a thread for processing*/
+ #define BATCH_MUL 20
+ #define SEEDS_PER_CHAIN 1
+-
++#define N_SMEM_KERNEL 3
+ #define READ_LEN 151
+
+ #define SEQ_LEN8 128 // redundant??
=====================================
debian/patches/0004-Update-to-reflect-2.2.1-release-and-small-typo.patch
=====================================
@@ -0,0 +1,63 @@
+From: Jason Stajich <jasonstajich.phd at gmail.com>
+Date: Mon, 6 Jun 2022 09:15:34 -0700
+Subject: Update to reflect 2.2.1 release and small typo
+
+Origin: upstream, https://github.com/bwa-mem2/bwa-mem2/commit/320f0c365a86e2576250f87399812c19710d21ad
+Forwarded: not-needed
+
+- Add direct links to 2.2.1 release
+- The example bwa-mem2 run is for the d2 dataset not d3 so indicate that with output file
+- Add DOI and link to publication for easier citing
+---
+ README.md | 14 +++++++-------
+ 1 file changed, 7 insertions(+), 7 deletions(-)
+
+diff --git a/README.md b/README.md
+index 5b2bbf6..21d30ac 100644
+--- a/README.md
++++ b/README.md
+@@ -15,10 +15,10 @@
+ ## Getting Started
+ ```sh
+ # Use precompiled binaries (recommended)
+-curl -L https://github.com/bwa-mem2/bwa-mem2/releases/download/v2.0pre2/bwa-mem2-2.0pre2_x64-linux.tar.bz2 \
++curl -L https://github.com/bwa-mem2/bwa-mem2/releases/download/v2.2.1/bwa-mem2-2.2.1_x64-linux.tar.bz2 \
+ | tar jxf -
+-bwa-mem2-2.0pre2_x64-linux/bwa-mem2 index ref.fa
+-bwa-mem2-2.0pre2_x64-linux/bwa-mem2 mem ref.fa read1.fq read2.fq > out.sam
++bwa-mem2-2.2.1_x64-linux/bwa-mem2 index ref.fa
++bwa-mem2-2.2.1_x64-linux/bwa-mem2 mem ref.fa read1.fq read2.fq > out.sam
+
+ # Compile from source (not recommended for general users)
+ # Get the source
+@@ -36,13 +36,13 @@ make
+
+ ## Introduction
+
+-Bwa-mem2 is the next version of the bwa-mem algorithm in [bwa][bwa]. It
++The tool bwa-mem2 is the next version of the bwa-mem algorithm in [bwa][bwa]. It
+ produces alignment identical to bwa and is ~1.3-3.1x faster depending on the use-case, dataset and the running machine.
+
+ The original bwa was developed by Heng Li (@lh3). Performance enhancement in
+ bwa-mem2 was primarily done by Vasimuddin Md (@yuk12) and Sanchit Misra (@sanchit-misra)
+ from Parallel Computing Lab, Intel.
+-Bwa-mem2 is distributed under the MIT license.
++bwa-mem2 is distributed under the MIT license.
+
+ ## Installation
+
+@@ -118,7 +118,7 @@ or ```make``` (using gcc compiler)
+ For example, in our double socket (56 threads each) and double numa compute node, we used the following command line to align D2 to human_g1k_v37.fasta reference genome.
+ ```
+ numactl -m 0 -C 0-27,56-83 ./bwa-mem2 index human_g1k_v37.fasta
+-numactl -m 0 -C 0-27,56-83 ./bwa-mem2 mem -t 56 human_g1k_v37.fasta SRR7733443_1.fastq SRR7733443_2.fastq > d3_align.sam
++numactl -m 0 -C 0-27,56-83 ./bwa-mem2 mem -t 56 human_g1k_v37.fasta SRR7733443_1.fastq SRR7733443_2.fastq > d2_align.sam
+ ```
+
+ <p align="center">
+@@ -144,4 +144,4 @@ The following are the highlights of the ert based bwa-mem2 tool:
+
+ Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru.
+ <b> Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. </b>
+-<i> IEEE Parallel and Distributed Processing Symposium (IPDPS), 2019. </i>
++<i> IEEE Parallel and Distributed Processing Symposium (IPDPS), 2019. [10.1109/IPDPS.2019.00041](https://doi.org/10.1109/IPDPS.2019.00041) </i>
=====================================
debian/patches/0005-GCC-11-now-defines-__rdtsc-so-definition-test-condit.patch
=====================================
@@ -0,0 +1,24 @@
+From: "Martin O. Pollard" <mp15 at sanger.ac.uk>
+Date: Mon, 13 Jun 2022 11:04:32 +0000
+Subject: GCC 11 now defines __rdtsc so definition test conditional on GCC
+ version
+
+Origin: upstream, https://github.com/bwa-mem2/bwa-mem2/commit/d48e1dec3bc305124e0e2764a8824af29084dd66
+Forwarded: not-needed
+---
+ src/utils.h | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+diff --git a/src/utils.h b/src/utils.h
+index 54a062a..fbf8439 100644
+--- a/src/utils.h
++++ b/src/utils.h
+@@ -48,7 +48,7 @@
+
+ #define xassert(cond, msg) if ((cond) == 0) _err_fatal_simple_core(__func__, msg)
+
+-#if defined(__GNUC__) && !defined(__clang__)
++#if defined(__GNUC__) && __GNUC__ < 11 && !defined(__clang__)
+ #if defined(__i386__)
+ static inline unsigned long long __rdtsc(void)
+ {
=====================================
debian/patches/0006-fixed-segfault-due-to-memory-shortage-in-smem.patch
=====================================
@@ -0,0 +1,2067 @@
+From: Vasimuddin <wasim.mzr at gmail.com>
+Date: Mon, 18 Dec 2023 06:25:34 -0800
+Subject: fixed segfault due to memory shortage in smem
+
+Origin: upstream, https://github.com/bwa-mem2/bwa-mem2/commit/7f3a4db2e5e419b0802902cd83ab120362db657b
+Forwarded: not-needed
+---
+ src/bwamem.cpp | 547 ++++++++++++++++++++++++++++++++------------------------
+ src/bwamem.h | 2 +
+ src/fastmap.cpp | 144 +++++++--------
+ 3 files changed, 387 insertions(+), 306 deletions(-)
+ mode change 100644 => 100755 src/bwamem.cpp
+ mode change 100644 => 100755 src/bwamem.h
+
+diff --git a/src/bwamem.cpp b/src/bwamem.cpp
+old mode 100644
+new mode 100755
+index 76d4148..cfaa6ec
+--- a/src/bwamem.cpp
++++ b/src/bwamem.cpp
+@@ -39,17 +39,17 @@ extern uint64_t tprof[LIM_R][LIM_C];
+
+ #define chain_cmp(a, b) (((b).pos < (a).pos) - ((a).pos < (b).pos))
+ KBTREE_INIT(chn, mem_chain_t, chain_cmp)
+-
++
+ #define intv_lt(a, b) ((a).info < (b).info)
+ KSORT_INIT(mem_intv, bwtintv_t, intv_lt)
+ #define intv_lt1(a, b) ((((uint64_t)(a).m) <<32 | ((uint64_t)(a).n)) < (((uint64_t)(b).m) <<32 | ((uint64_t)(b).n))) // trial
+ KSORT_INIT(mem_intv1, SMEM, intv_lt1) // debug
+-
++
+ #define max_(x, y) ((x)>(y)?(x):(y))
+ #define min_(x, y) ((x)>(y)?(y):(x))
+-
++
+ #define MAX_BAND_TRY 2
+-
++
+ int tcnt = 0;
+ /********************
+ * Filtering chains *
+@@ -180,7 +180,7 @@ int mem_patch_reg(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac,
+ double r;
+ if (bns == 0 || pac == 0 || query == 0) return 0;
+ assert(a->rid == b->rid && a->rb <= b->rb);
+-
++
+ if (a->rb < bns->l_pac && b->rb >= bns->l_pac) return 0; // on different strands
+ if (a->qb >= b->qb || a->qe >= b->qe || a->re >= b->re) return 0; // not colinear
+ w = (a->re - b->rb) - (a->qe - b->qb); // required bandwidth
+@@ -193,32 +193,32 @@ int mem_patch_reg(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac,
+ "[%d,%d)<=>[%ld,%ld), @ %s; w=%d, r=%.4g\n",
+ a->qb, a->qe, (long)a->rb, (long)a->re, b->qb, b->qe,
+ (long)b->rb, (long)b->re, bns->anns[a->rid].name, w, r);
+-
++
+ if (a->re < b->rb || a->qe < b->qb) // no overlap on query or on ref
+ {
+ if (w > opt->w<<1 || r >= PATCH_MAX_R_BW) return 0; // the bandwidth or the relative bandwidth is too large
+ } else if (w > opt->w<<2 || r >= PATCH_MAX_R_BW*2) return 0; // more permissive if overlapping on both ref and query
+-
++
+ // global alignment
+ w += a->w + b->w;
+ w = w < opt->w<<2? w : opt->w<<2;
+
+ if (bwa_verbose >= 4)
+ fprintf(stderr, "* test potential hit merge with global alignment; w=%d\n", w);
+-
++
+ bwa_gen_cigar2(opt->mat, opt->o_del, opt->e_del, opt->o_ins, opt->e_ins, w,
+ bns->l_pac, pac, b->qe - a->qb, query + a->qb, a->rb, b->re,
+ &score, 0, 0);
+-
++
+ q_s = (int)((double)(b->qe - a->qb) / ((b->qe - b->qb) + (a->qe - a->qb)) *
+ (b->score + a->score) + .499); // predicted score from query
+-
++
+ r_s = (int)((double)(b->re - a->rb) / ((b->re - b->rb) + (a->re - a->rb)) *
+ (b->score + a->score) + .499); // predicted score from ref
+-
++
+ if (bwa_verbose >= 4)
+ fprintf(stderr, "* score=%d;(%d,%d)\n", score, q_s, r_s);
+-
++
+ if ((double)score / (q_s > r_s? q_s : r_s) < PATCH_MIN_SC_RATIO) return 0;
+ *_w = w;
+ return score;
+@@ -296,14 +296,14 @@ int mem_sort_dedup_patch(const mem_opt_t *opt, const bntseq_t *bns,
+ int m, i, j;
+ if (n <= 1) return n;
+ ks_introsort(mem_ars2, n, a); // sort by the END position, not START!
+-
++
+ for (i = 0; i < n; ++i) a[i].n_comp = 1;
+ for (i = 1; i < n; ++i)
+ {
+ mem_alnreg_t *p = &a[i];
+ if (p->rid != a[i-1].rid || p->rb >= a[i-1].re + opt->max_chain_gap)
+ continue; // then no need to go into the loop below
+-
++
+ for (j = i - 1; j >= 0 && p->rid == a[j].rid && p->rb < a[j].re + opt->max_chain_gap; --j) {
+ mem_alnreg_t *q = &a[j];
+ int64_t or_, oq, mr, mq;
+@@ -366,10 +366,10 @@ static int test_and_merge(const mem_opt_t *opt, int64_t l_pac, mem_chain_t *c,
+ if (p->qbeg >= c->seeds[0].qbeg && p->qbeg + p->len <= qend &&
+ p->rbeg >= c->seeds[0].rbeg && p->rbeg + p->len <= rend)
+ return 1; // contained seed; do nothing
+-
++
+ if ((last->rbeg < l_pac || c->seeds[0].rbeg < l_pac) &&
+ p->rbeg >= l_pac) return 0; // don't chain if on different strand
+-
++
+ x = p->qbeg - last->qbeg; // always non-negtive
+ y = p->rbeg - last->rbeg;
+ if (y >= 0 && x - y <= opt->w && y - x <= opt->w &&
+@@ -378,7 +378,7 @@ static int test_and_merge(const mem_opt_t *opt, int64_t l_pac, mem_chain_t *c,
+ if (c->n == c->m)
+ {
+ mem_seed_t *auxSeedBuf = NULL;
+- int pm = c->m;
++ int pm = c->m;
+ c->m <<= 1;
+ if (pm == SEEDS_PER_CHAIN) { // re-new memory
+ if ((auxSeedBuf = (mem_seed_t *) calloc(c->m, sizeof(mem_seed_t))) == NULL) { fprintf(stderr, "ERROR: out of memory auxSeedBuf\n"); exit(1); }
+@@ -390,7 +390,7 @@ static int test_and_merge(const mem_opt_t *opt, int64_t l_pac, mem_chain_t *c,
+ if ((auxSeedBuf = (mem_seed_t *) realloc(c->seeds, c->m * sizeof(mem_seed_t))) == NULL) { fprintf(stderr, "ERROR: out of memory auxSeedBuf\n"); exit(1); }
+ c->seeds = auxSeedBuf;
+ }
+- memset((char*) (c->seeds + c->n), 0, (c->m - c->n) * sizeof(mem_seed_t));
++ memset((char*) (c->seeds + c->n), 0, (c->m - c->n) * sizeof(mem_seed_t));
+ }
+ c->seeds[c->n++] = *p;
+ return 1;
+@@ -474,7 +474,7 @@ void mem_flt_chained_seeds(const mem_opt_t *opt, const bntseq_t *bns, const uint
+ {
+ //double min_l = opt->min_chain_weight?
+ // MEM_HSP_COEF * opt->min_chain_weight : MEM_MINSC_COEF * log(l_query);
+-
++
+ int i, j, k;// min_HSP_score = (int)(opt->a * min_l + .499);
+ //if (min_l > MEM_SEEDSW_COEF * l_query) return; // don't run the following for short reads
+
+@@ -488,7 +488,7 @@ void mem_flt_chained_seeds(const mem_opt_t *opt, const bntseq_t *bns, const uint
+ MEM_HSP_COEF * opt->min_chain_weight : MEM_MINSC_COEF * log(l_query);
+ int min_HSP_score = (int)(opt->a * min_l + .499);
+ if (min_l > MEM_SEEDSW_COEF * l_query) continue;
+-
++
+ for (j = k = 0; j < c->n; ++j)
+ {
+ mem_seed_t *s = &c->seeds[j];
+@@ -601,7 +601,7 @@ int mem_chain_flt(const mem_opt_t *opt, int n_chn_, mem_chain_t *a_, int tid)
+
+ for (; i < n_chn; ++i)
+ if (a[i].kept < 3) a[i].kept = 0;
+-
++
+ for (i = k = 0; i < n_chn; ++i) // free discarded chains
+ {
+ mem_chain_t *c = &a[i];
+@@ -631,23 +631,39 @@ SMEM *mem_collect_smem(FMI_search *fmi, const mem_opt_t *opt,
+ int16_t *query_pos_ar,
+ uint8_t *enc_qdb,
+ int32_t *rid,
+- int64_t &tot_smem)
++ mem_cache *mmc,
++ int64_t &tot_smem,
++ int tid)
+ {
+ int64_t pos = 0;
+ int split_len = (int)(opt->min_seed_len * opt->split_factor + .499);
+ int64_t num_smem1 = 0, num_smem2 = 0, num_smem3 = 0;
+ int max_readlength = -1;
+-
++
+ int32_t *query_cum_len_ar = (int32_t *)_mm_malloc(nseq * sizeof(int32_t), 64);
+-
+- int offset = 0;
++
++ // New addition: checks
++ int64_t tot_seq_len = 0;
++ for (int l=0; l<nseq; l++)
++ tot_seq_len += seq_[l].l_seq;
++
++ if (tot_seq_len > mmc->wsize_mem_s[tid]) {
++ fprintf(stderr, "[%0.4d] REA Re-allocating enc_qdb only\n", tid);
++ mmc->wsize_mem_s[tid] = tot_seq_len + 1024;
++ mmc->enc_qdb[tid] = (uint8_t *) realloc(mmc->enc_qdb[tid],
++ mmc->wsize_mem_s[tid] * sizeof(uint8_t));
++ enc_qdb = mmc->enc_qdb[tid];
++ }
++
++ int offset = 0, max_read_len = 0;
+ for (int l=0; l<nseq; l++)
+ {
+ min_intv_ar[l] = 1;
+- for (int j=0; j<seq_[l].l_seq; j++)
++ for (int j=0; j<seq_[l].l_seq; j++) {
+ enc_qdb[offset + j] = seq_[l].seq[j];
+-
+- offset += seq_[l].l_seq;
++ // max_read_len = std::max(max_read_len, seq_[l].seq[j]);
++ }
++ offset += seq_[l].l_seq;
+ rid[l] = l;
+ }
+
+@@ -662,27 +678,68 @@ SMEM *mem_collect_smem(FMI_search *fmi, const mem_opt_t *opt,
+ fmi->getSMEMsAllPosOneThread(enc_qdb, min_intv_ar, rid, nseq, nseq,
+ seq_, query_cum_len_ar, max_readlength, opt->min_seed_len,
+ matchArray, &num_smem1);
++ // New addition: checks
++ if (num_smem1 > mmc->wsize_mem_r[tid]) {
++ fprintf(stderr, "[%0.4d] REA Re-allocating min_intv, query_pos, & rid\n", tid);
++ mmc->wsize_mem_r[tid] = num_smem1 + 1024;
++ mmc->min_intv_ar[tid] = (int32_t *) realloc(mmc->min_intv_ar[tid],
++ mmc->wsize_mem_r[tid] * sizeof(int32_t));
++ mmc->query_pos_ar[tid] = (int16_t *) realloc(mmc->query_pos_ar[tid],
++ mmc->wsize_mem_r[tid] * sizeof(int16_t));
++ mmc->rid[tid] = (int32_t *) realloc(mmc->rid[tid],
++ mmc->wsize_mem_r[tid] * sizeof(int32_t));
++ }
+
+-
++ // fprintf(stderr, "num_smem1: %d\n", num_smem1);
++ int64_t mem_lim = 0;
+ for (int64_t i=0; i<num_smem1; i++)
+ {
+ SMEM *p = &matchArray[i];
+ int start = p->m, end = p->n +1;
+ if (end - start < split_len || p->s > opt->split_width)
+ continue;
+-
++
+ int len = seq_[p->rid].l_seq;
+
+ rid[pos] = p->rid;
+-
++
+ query_pos_ar[pos] = (end + start)>>1;
+
+ assert(query_pos_ar[pos] < len);
+
+ min_intv_ar[pos] = p->s + 1;
+ pos ++;
++ mem_lim += seq_[p->rid].l_seq;
+ }
++ #if 1
++ // fprintf(stderr, "num_smem2: %d, lim: %d\n", num_smem2, mem_lim + num_smem1 + offset);
++ { // experimental, pcl
++ // int64_t avg_read_len = static_cast<int64_t>(offset*1.0/nseq);
++ // if (mmc->wsize_mem[tid] < mem_lim + num_smem1 + offset) {
++ if (mmc->wsize_mem[tid] < mem_lim + num_smem1) {
++ fprintf(stderr, "[%0.4d] REA Re-allocating2 SMEM data structures due to enc_qdb\n", tid);
++ int64_t tmp = mmc->wsize_mem[tid];
++ mmc->wsize_mem[tid] = mem_lim + num_smem1 + offset;
+
++ SMEM* ptr1 = mmc->matchArray[tid];
++ // ptr2 = w.mmc.min_intv_ar;
++ // ptr3 =
++ fprintf(stderr, "Realloc size from: %d to :%d\n", tmp, mmc->wsize_mem[tid]);
++ mmc->matchArray[tid] = (SMEM *) _mm_malloc(mmc->wsize_mem[tid] * sizeof(SMEM), 64);
++ assert(mmc->matchArray[tid] != NULL);
++ matchArray = mmc->matchArray[tid];
++ // w.mmc.min_intv_ar[l] = (int32_t *) malloc(w.mmc.wsize_mem[l] * sizeof(int32_t));
++ // w.mmc.query_pos_ar[tid] = (int16_t *) malloc(w.mmc.wsize_mem[l] * sizeof(int16_t));
++ // w.mmc.enc_qdb[l] = (uint8_t *) malloc(w.mmc.wsize_mem[l] * sizeof(uint8_t));
++ // w.mmc.rid[l] = (int32_t *) malloc(w.mmc.wsize_mem[l] * sizeof(int32_t));
++ //w.mmc.lim[l] = (int32_t *) _mm_malloc((BATCH_SIZE + 32) * sizeof(int32_t), 64);
++ for (int i=0; i<num_smem1; i++) {
++ mmc->matchArray[tid][i] = ptr1[i];
++ }
++ _mm_free(ptr1);
++ }
++ }
++ #endif
+ fmi->getSMEMsOnePosOneThread(enc_qdb,
+ query_pos_ar,
+ min_intv_ar,
+@@ -695,19 +752,33 @@ SMEM *mem_collect_smem(FMI_search *fmi, const mem_opt_t *opt,
+ opt->min_seed_len,
+ matchArray + num_smem1,
+ &num_smem2);
+-
++ // fprintwsize_mem_rf(stderr, "num_smem2: %d\n", num_smem2);
+ if (opt->max_mem_intv > 0)
+ {
++ if (mmc->wsize_mem[tid] < num_smem2 + offset*1.0/(opt->min_seed_len + 1) + num_smem1) {
++ int64_t tmp = mmc->wsize_mem[tid];
++ mmc->wsize_mem[tid] = num_smem2 + offset*1.0/(opt->min_seed_len + 1) + num_smem1;
++ SMEM* ptr1 = mmc->matchArray[tid];
++ fprintf(stderr, "Realloc2 size from: %d to :%d\n", tmp, mmc->wsize_mem[tid]);
++ mmc->matchArray[tid] = (SMEM *) _mm_malloc(mmc->wsize_mem[tid] * sizeof(SMEM), 64);
++ assert(mmc->matchArray[tid] != NULL);
++ matchArray = mmc->matchArray[tid];
++ for (int i=0; i<num_smem1; i++) {
++ mmc->matchArray[tid][i] = ptr1[i];
++ }
++ _mm_free(ptr1);
++ }
++
+ for (int l=0; l<nseq; l++)
+ min_intv_ar[l] = opt->max_mem_intv;
+
+ num_smem3 = fmi->bwtSeedStrategyAllPosOneThread(enc_qdb, min_intv_ar,
+- nseq, seq_, query_cum_len_ar,
++ nseq, seq_, query_cum_len_ar,
+ opt->min_seed_len + 1,
+- matchArray + num_smem1 + num_smem2);
++ matchArray + num_smem1 + num_smem2);
+ }
+ tot_smem = num_smem1 + num_smem2 + num_smem3;
+-
++ // fprintf(stderr, "num_smems: %d %d %d, %d\n", num_smem1, num_smem2, num_smem3, tot_smem);
+ fmi->sortSMEMs(matchArray, &tot_smem, nseq, seq_[0].l_seq, 1); // seq_[0].l_seq - only used for blocking when using nthreads
+
+ pos = 0;
+@@ -718,7 +789,7 @@ SMEM *mem_collect_smem(FMI_search *fmi, const mem_opt_t *opt,
+ pos++;
+ } while (pos < tot_smem - 1 && matchArray[pos].rid == matchArray[pos + 1].rid);
+ int64_t n = pos + 1 - smem_ptr;
+-
++
+ if (n > 0)
+ ks_introsort(mem_intv1, n, &matchArray[smem_ptr]);
+ smem_ptr = pos + 1;
+@@ -744,19 +815,19 @@ void mem_chain_seeds(FMI_search *fmi, const mem_opt_t *opt,
+ int64_t i, pos = 0;
+ int64_t smem_ptr = 0;
+ int64_t l_pac = bns->l_pac;
+-
++
+ int num[nseq];
+ memset(num, 0, nseq*sizeof(int));
+ int smem_buf_size = 6000;
+ int64_t *sa_coord = (int64_t *) _mm_malloc(sizeof(int64_t) * opt->max_occ * smem_buf_size, 64);
+ int64_t seedBufCount = 0;
+-
++
+ for (int l=0; l<nseq; l++)
+ kv_init(chain_ar[l]);
+-
++
+ // filter seq at early stage than this!, shifted to collect!!!
+ // if (len < opt->min_seed_len) return chain; // if the query is shorter than the seed length, no match
+-
++
+ uint64_t tim = __rdtsc();
+ for (int l=0; l<nseq && pos < num_smem - 1; l++)
+ {
+@@ -771,7 +842,7 @@ void mem_chain_seeds(FMI_search *fmi, const mem_opt_t *opt,
+ tree = kb_init(chn, KB_DEFAULT_SIZE + 8); // +8, due to addition of counters in chain
+ mem_chain_v *chain = &chain_ar[l];
+ size = 0;
+-
++
+ b = e = l_rep = 0;
+ pos = smem_ptr - 1;
+ //for (i = 0, b = e = l_rep = 0; i < aux_->mem.n; ++i) // compute frac_rep
+@@ -803,7 +874,7 @@ void mem_chain_seeds(FMI_search *fmi, const mem_opt_t *opt,
+ pos - smem_ptr + 1, opt->max_occ, tid, id); // sa compressed prefetch
+ tprof[MEM_SA][tid] += __rdtsc() - tim;
+ #endif
+-
++
+ for (i = smem_ptr; i <= pos; i++)
+ {
+ SMEM *p = &matchArray[i];
+@@ -818,8 +889,8 @@ void mem_chain_seeds(FMI_search *fmi, const mem_opt_t *opt,
+ fmi->get_sa_entries(p, sa_coord, &cnt, 1, opt->max_occ);
+ tprof[MEM_SA][tid] += __rdtsc() - tim;
+ #endif
+-
+- cnt = 0;
++
++ cnt = 0;
+ for (k = count = 0; k < p->s && count < opt->max_occ; k += step, ++count)
+ {
+ mem_chain_t tmp, *lower, *upper;
+@@ -831,18 +902,18 @@ void mem_chain_seeds(FMI_search *fmi, const mem_opt_t *opt,
+ #else
+ s.rbeg = tmp.pos = sa_coord[cnt++];
+ #endif
+-
++
+ s.qbeg = p->m;
+ s.score= s.len = slen;
+- if (s.rbeg < 0 || s.len < 0)
++ if (s.rbeg < 0 || s.len < 0)
+ fprintf(stderr, "rbeg: %ld, slen: %d, cnt: %d, n: %d, m: %d, num_smem: %ld\n",
+ s.rbeg, s.len, cnt-1, p->n, p->m, num_smem);
+-
++
+ rid = bns_intv2rid(bns, s.rbeg, s.rbeg + s.len);
+ // bridging multiple reference sequences or the
+ // forward-reverse boundary; TODO: split the seed;
+ // don't discard it!!!
+- if (rid < 0) continue;
++ if (rid < 0) continue;
+ if (kb_size(tree))
+ {
+ kb_intervalp(chn, tree, &tmp, &lower, &upper); // find the closest chain
+@@ -878,10 +949,10 @@ void mem_chain_seeds(FMI_search *fmi, const mem_opt_t *opt,
+ }
+ } // seeds
+
+- smem_ptr = pos + 1;
++ smem_ptr = pos + 1;
+ size = kb_size(tree);
+ // tprof[PE21][0] += kb_size(tree) * sizeof(mem_chain_t);
+-
++
+ kv_resize(mem_chain_t, *chain, size);
+
+ #define traverse_func(p_) (chain->a[chain->n++] = *(p_))
+@@ -891,8 +962,8 @@ void mem_chain_seeds(FMI_search *fmi, const mem_opt_t *opt,
+ for (i = 0; i < chain->n; ++i)
+ chain->a[i].frac_rep = (float)l_rep / seq_[l].l_seq;
+
+- kb_destroy(chn, tree);
+-
++ kb_destroy(chn, tree);
++
+ } // iterations over input reads
+ tprof[MEM_SA_BLOCK][tid] += __rdtsc() - tim;
+
+@@ -908,11 +979,11 @@ int mem_kernel1_core(FMI_search *fmi,
+ int64_t seedBufSize,
+ mem_cache *mmc,
+ int tid)
+-{
++{
+ int i;
+ int64_t num_smem = 0, tot_len = 0;
+ mem_chain_v *chn;
+-
++
+ uint64_t tim;
+ /* convert to 2-bit encoding if we have not done so */
+ for (int l=0; l<nseq; l++)
+@@ -920,17 +991,20 @@ int mem_kernel1_core(FMI_search *fmi,
+ char *seq = seq_[l].seq;
+ int len = seq_[l].l_seq;
+ tot_len += len;
+-
++
+ for (i = 0; i < len; ++i)
+- seq[i] = seq[i] < 4? seq[i] : nst_nt4_table[(int)seq[i]]; //nst_nt4??
++ seq[i] = seq[i] < 4? seq[i] : nst_nt4_table[(int)seq[i]]; //nst_nt4??
+ }
+ tot_len *= N_SMEM_KERNEL;
++ // fprintf(stderr, "wsize: %d, tot_len: %d\n", mmc->wsize_mem[tid], tot_len);
+ // This covers enc_qdb/SMEM reallocs
+ if (tot_len >= mmc->wsize_mem[tid])
+ {
+ fprintf(stderr, "[%0.4d] Re-allocating SMEM data structures due to enc_qdb\n", tid);
+ int64_t tmp = mmc->wsize_mem[tid];
+ mmc->wsize_mem[tid] = tot_len;
++ mmc->wsize_mem_s[tid] = tot_len;
++ mmc->wsize_mem_r[tid] = tot_len;
+ mmc->matchArray[tid] = (SMEM *) _mm_realloc(mmc->matchArray[tid],
+ tmp, mmc->wsize_mem[tid], sizeof(SMEM));
+ //realloc(mmc->matchArray[tid], mmc->wsize_mem[tid] * sizeof(SMEM));
+@@ -951,26 +1025,29 @@ int mem_kernel1_core(FMI_search *fmi,
+ uint8_t *enc_qdb = mmc->enc_qdb[tid];
+ int32_t *rid = mmc->rid[tid];
+ int64_t *wsize_mem = &mmc->wsize_mem[tid];
+-
+- tim = __rdtsc();
++
++ tim = __rdtsc();
+ /********************** Kernel 1: FM+SMEMs *************************/
+ printf_(VER, "6. Calling mem_collect_smem.., tid: %d\n", tid);
+- mem_collect_smem(fmi, opt,
+- seq_,
+- nseq,
+- matchArray,
+- min_intv_ar,
+- query_pos_ar,
+- enc_qdb,
+- rid,
+- num_smem);
++ matchArray = mem_collect_smem(fmi, opt,
++ seq_,
++ nseq,
++ matchArray,
++ min_intv_ar,
++ query_pos_ar,
++ enc_qdb,
++ rid,
++ mmc,
++ num_smem,
++ tid);
+
+ if (num_smem >= *wsize_mem){
+- fprintf(stderr, "Error [bug]: num_smem: %ld are more than allocated space.\n", num_smem);
++ fprintf(stderr, "Error [bug]: num_smem: %ld are more than allocated space %ld.\n",
++ num_smem, *wsize_mem);
+ exit(EXIT_FAILURE);
+ }
+ printf_(VER, "6. Done! mem_collect_smem, num_smem: %ld\n", num_smem);
+- tprof[MEM_COLLECT][tid] += __rdtsc() - tim;
++ tprof[MEM_COLLECT][tid] += __rdtsc() - tim;
+
+
+ /********************* Kernel 1.1: SA2REF **********************/
+@@ -982,7 +1059,7 @@ int mem_kernel1_core(FMI_search *fmi,
+ seedBufSize,
+ matchArray,
+ num_smem);
+-
++
+ printf_(VER, "5. Done mem_chain..\n");
+ // tprof[MEM_CHAIN][tid] += __rdtsc() - tim;
+
+@@ -997,7 +1074,7 @@ int mem_kernel1_core(FMI_search *fmi,
+ printf_(VER, "7. Done mem_chain_flt..\n");
+ // tprof[MEM_ALN_M1][tid] += __rdtsc() - tim;
+
+-
++
+ printf_(VER, "8. Calling mem_flt_chained_seeds..\n");
+ for (int l=0; l<nseq; l++) {
+ chn = &chain_ar[l];
+@@ -1017,7 +1094,7 @@ int mem_kernel2_core(FMI_search *fmi,
+ int nseq,
+ mem_chain_v *chain_ar,
+ mem_cache *mmc,
+- uint8_t *ref_string,
++ uint8_t *ref_string,
+ int tid)
+ {
+ int i;
+@@ -1057,7 +1134,7 @@ int mem_kernel2_core(FMI_search *fmi,
+ }
+ free(chain_ar[l].a);
+ }
+-
++
+ int m = 0;
+ for (int l=0; l<nseq; l++)
+ {
+@@ -1071,7 +1148,7 @@ int mem_kernel2_core(FMI_search *fmi,
+ regs[l].n = m;
+ }
+
+- for (int l=0; l<nseq; l++) {
++ for (int l=0; l<nseq; l++) {
+ regs[l].n = mem_sort_dedup_patch(opt, fmi->idx->bns,
+ fmi->idx->pac,
+ (uint8_t*) seq_[l].seq,
+@@ -1088,16 +1165,16 @@ int mem_kernel2_core(FMI_search *fmi,
+ }
+ }
+ // tprof[POST_SWA][tid] += __rdtsc() - tim;
+-
++
+ return 1;
+ }
+
+ static void worker_aln(void *data, int seq_id, int batch_size, int tid)
+ {
+ worker_t *w = (worker_t*) data;
+-
+- printf_(VER, "11. Calling mem_kernel2_core..\n");
+- mem_kernel2_core(w->fmi, w->opt,
++
++ printf_(VER, "11. Calling mem_kernel2_core..\n");
++ mem_kernel2_core(w->fmi, w->opt,
+ w->seqs + seq_id,
+ w->regs + seq_id,
+ batch_size,
+@@ -1116,7 +1193,7 @@ static void worker_bwt(void *data, int seq_id, int batch_size, int tid)
+ printf_(VER, "4. Calling mem_kernel1_core..%d %d\n", seq_id, tid);
+ int seedBufSz = w->seedBufSize;
+
+- int memSize = w->nreads;
++ int memSize = w->nreads;
+ if (batch_size < BATCH_SIZE) {
+ seedBufSz = (memSize - seq_id) * AVG_SEEDS_PER_READ;
+ // fprintf(stderr, "[%0.4d] Info: adjusted seedBufSz %d\n", tid, seedBufSz);
+@@ -1154,7 +1231,7 @@ int64_t sort_classify(mem_cache *mmc, int64_t pcnt, int tid)
+ }
+ }
+ assert(pos8 + pos16 == pcnt);
+-
++
+ for (int i=pos8; i<pcnt; i++) {
+ seqPairArray[i] = seqPairArrayAux[i-pos8];
+ }
+@@ -1165,15 +1242,15 @@ int64_t sort_classify(mem_cache *mmc, int64_t pcnt, int tid)
+ static void worker_sam(void *data, int seqid, int batch_size, int tid)
+ {
+ worker_t *w = (worker_t*) data;
+-
++
+ if (w->opt->flag & MEM_F_PE)
+ {
+ int64_t pcnt = 0;
+ int start = seqid;
+ int end = seqid + batch_size;
+ int pos = start >> 1;
+-
+-#if (((!__AVX512BW__) && (!__AVX2__)) || ((!__AVX512BW__) && (__AVX2__)))
++
++#if (((!__AVX512BW__) && (!__AVX2__)) || ((!__AVX512BW__) && (__AVX2__)))
+ for (int i=start; i< end; i+=2)
+ {
+ // orig mem_sam_pe() function
+@@ -1182,7 +1259,7 @@ static void worker_sam(void *data, int seqid, int batch_size, int tid)
+ (w->n_processed >> 1) + pos++, // check!
+ &w->seqs[i],
+ &w->regs[i]);
+-
++
+ free(w->regs[i].a);
+ free(w->regs[i+1].a);
+ }
+@@ -1198,13 +1275,13 @@ static void worker_sam(void *data, int seqid, int batch_size, int tid)
+ (w->n_processed >> 1) + pos++, // check!
+ &w->seqs[i],
+ &w->regs[i],
+- &w->mmc,
++ &w->mmc,
+ pcnt, gcnt,
+ maxRefLen,
+ maxQerLen,
+ tid);
+ }
+-
++
+ // tprof[SAM1][tid] += __rdtsc() - tim;
+ int64_t pcnt8 = sort_classify(&w->mmc, pcnt, tid);
+
+@@ -1212,7 +1289,7 @@ static void worker_sam(void *data, int seqid, int batch_size, int tid)
+ assert(aln != NULL);
+
+ // processing
+- mem_sam_pe_batch(w->opt, &w->mmc, pcnt, pcnt8, aln, maxRefLen, maxQerLen, tid);
++ mem_sam_pe_batch(w->opt, &w->mmc, pcnt, pcnt8, aln, maxRefLen, maxQerLen, tid);
+
+ // post-processing
+ // tim = __rdtsc();
+@@ -1234,7 +1311,7 @@ static void worker_sam(void *data, int seqid, int batch_size, int tid)
+ free(w->regs[i].a);
+ free(w->regs[i+1].a);
+ }
+- //tprof[SAM3][tid] += __rdtsc() - tim;
++ //tprof[SAM3][tid] += __rdtsc() - tim;
+ _mm_free(aln); // kswr_t
+ #endif
+ }
+@@ -1246,7 +1323,7 @@ static void worker_sam(void *data, int seqid, int batch_size, int tid)
+ w->regs[i].a,
+ w->n_processed + i);
+ #if V17 // Feature from v0.7.17 of orig. bwa-mem
+- if (w->opt->flag & MEM_F_PRIMARY5) mem_reorder_primary5(w->opt->T, &w->regs[i]);
++ if (w->opt->flag & MEM_F_PRIMARY5) mem_reorder_primary5(w->opt->T, &w->regs[i]);
+ #endif
+ mem_reg2sam(w->opt, w->fmi->idx->bns, w->fmi->idx->pac, &w->seqs[i],
+ &w->regs[i], 0, 0);
+@@ -1264,7 +1341,7 @@ void mem_process_seqs(mem_opt_t *opt,
+ {
+ mem_pestat_t pes[4];
+ double ctime, rtime;
+-
++
+ ctime = cputime(); rtime = realtime();
+ w.opt = opt;
+ w.seqs = seqs; w.n_processed = n_processed;
+@@ -1272,16 +1349,16 @@ void mem_process_seqs(mem_opt_t *opt,
+
+ //int n_ = (opt->flag & MEM_F_PE) ? n : n; // this requires n%2==0
+ int n_ = n;
+-
+- uint64_t tim = __rdtsc();
++
++ uint64_t tim = __rdtsc();
+ fprintf(stderr, "[0000] 1. Calling kt_for - worker_bwt\n");
+-
++
+ kt_for(worker_bwt, &w, n_); // SMEMs (+SAL)
+
+ fprintf(stderr, "[0000] 2. Calling kt_for - worker_aln\n");
+-
++
+ kt_for(worker_aln, &w, n_); // BSW
+- tprof[WORKER10][0] += __rdtsc() - tim;
++ tprof[WORKER10][0] += __rdtsc() - tim;
+
+
+ // PAIRED_END
+@@ -1296,11 +1373,11 @@ void mem_process_seqs(mem_opt_t *opt,
+ // distribution from data
+ }
+ }
+-
++
+ tim = __rdtsc();
+ fprintf(stderr, "[0000] 3. Calling kt_for - worker_sam\n");
+-
+- kt_for(worker_sam, &w, n_); // SAM
++
++ kt_for(worker_sam, &w, n_); // SAM
+ tprof[WORKER20][0] += __rdtsc() - tim;
+
+ fprintf(stderr, "\t[0000][ M::%s] Processed %d reads in %.3f "
+@@ -1342,7 +1419,7 @@ int mem_mark_primary_se(const mem_opt_t *opt, int n, mem_alnreg_t *a, int64_t id
+ int i, n_pri;
+ int_v z = {0,0,0};
+ if (n == 0) return 0;
+-
++
+ for (i = n_pri = 0; i < n; ++i)
+ {
+ a[i].sub = a[i].alt_sc = 0, a[i].secondary = a[i].secondary_all = -1, a[i].hash = hash_64(id+i);
+@@ -1448,7 +1525,7 @@ void mem_reg2sam(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac,
+
+ if (!(opt->flag & MEM_F_ALL))
+ XA = mem_gen_alt(opt, bns, pac, a, s->l_seq, s->seq);
+-
++
+ kv_init(aa);
+ str.l = str.m = 0; str.s = 0;
+ for (k = l = 0; k < a->n; ++k)
+@@ -1465,7 +1542,7 @@ void mem_reg2sam(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac,
+
+ *q = mem_reg2aln(opt, bns, pac, s->l_seq, s->seq, p);
+ assert(q->rid >= 0); // this should not happen with the new code
+- q->XA = XA? XA[k] : 0;
++ q->XA = XA? XA[k] : 0;
+ q->flag |= extra_flag; // flag secondary
+ if (p->secondary >= 0) q->sub = -1; // don't output sub-optimal score
+ if (l && p->secondary < 0) // if supplementary
+@@ -1482,7 +1559,7 @@ void mem_reg2sam(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac,
+ mem_aln_t t;
+ t = mem_reg2aln(opt, bns, pac, s->l_seq, s->seq, 0);
+ t.flag |= extra_flag;
+- mem_aln2sam(opt, bns, &str, s, 1, &t, 0, m);
++ mem_aln2sam(opt, bns, &str, s, 1, &t, 0, m);
+ } else {
+ for (k = 0; k < aa.n; ++k)
+ mem_aln2sam(opt, bns, &str, s, aa.n, aa.a, k, m);
+@@ -1511,7 +1588,7 @@ static inline void add_cigar(const mem_opt_t *opt, mem_aln_t *p, kstring_t *str,
+
+ void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str,
+ bseq1_t *s, int n, const mem_aln_t *list, int which, const mem_aln_t *m_)
+-{
++{
+ int i, l_name;
+ mem_aln_t ptmp = list[which], *p = &ptmp, mtmp, *m = 0; // make a copy of the alignment to convert
+
+@@ -1599,7 +1676,7 @@ void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str,
+ str->s[str->l] = 0;
+ } else kputc('*', str);
+ }
+-
++
+ // print optional tags
+ if (p->n_cigar) {
+ kputsn("\tNM:i:", 6, str); kputw(p->NM, str);
+@@ -1607,7 +1684,7 @@ void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str,
+ }
+ #if V17
+ if (m && m->n_cigar) { kputsn("\tMC:Z:", 6, str); add_cigar(opt, m, str, which); }
+-#endif
++#endif
+ if (p->score >= 0) { kputsn("\tAS:i:", 6, str); kputw(p->score, str); }
+ if (p->sub >= 0) { kputsn("\tXS:i:", 6, str); kputw(p->sub, str); }
+ if (bwa_rg_id[0]) { kputsn("\tRG:Z:", 6, str); kputs(bwa_rg_id, str); }
+@@ -1636,7 +1713,7 @@ void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str,
+ }
+
+ if (p->XA) { kputsn("\tXA:Z:", 6, str); kputs(p->XA, str); }
+-
++
+ if (s->comment) { kputc('\t', str); kputs(s->comment, str); }
+ if ((opt->flag&MEM_F_REF_HDR) && p->rid >= 0 && bns->anns[p->rid].anno != 0 && bns->anns[p->rid].anno[0] != 0) {
+ int tmp;
+@@ -1761,7 +1838,7 @@ void* _mm_realloc(void *ptr, int64_t csize, int64_t nsize, int16_t dsize) {
+ assert(nptr != NULL);
+ memcpy_bwamem(nptr, nsize * dsize, ptr, csize, __FILE__, __LINE__);
+ _mm_free(ptr);
+-
++
+ return nptr;
+ }
+
+@@ -1778,11 +1855,11 @@ uint8_t *bns_get_seq_v2(int64_t l_pac, const uint8_t *pac, int64_t beg, int64_t
+ if (beg >= l_pac || end <= l_pac) {
+ *len = end - beg;
+ assert(end-beg < BATCH_SIZE * SEEDS_PER_READ * sizeof(SeqPair));
+-
++
+ //seq = (uint8_t*) malloc(end - beg);
+ // seq = seqb;
+ if (beg >= l_pac) { // reverse strand
+-#if 0 // orig
++#if 0 // orig
+ int64_t beg_f = (l_pac<<1) - 1 - end;
+ int64_t end_f = (l_pac<<1) - 1 - beg;
+ for (k = end_f; k > beg_f; --k) {
+@@ -1800,7 +1877,7 @@ uint8_t *bns_get_seq_v2(int64_t l_pac, const uint8_t *pac, int64_t beg, int64_t
+ }
+ #else
+ seq = ref_string + beg;
+-#endif
++#endif
+ }
+
+ } else *len = 0; // if bridging the forward-reverse boundary, return nothing
+@@ -1819,7 +1896,7 @@ uint8_t *bns_fetch_seq_v2(const bntseq_t *bns, const uint8_t *pac,
+ // if (*beg > mid || mid >= *end)
+ // fprintf(Error: stderr, "%ld %ld %ld\n", *beg, mid, *end);
+ assert(*beg <= mid && mid < *end);
+-
++
+ *rid = bns_pos2rid(bns, bns_depos(bns, mid, &is_rev));
+ far_beg = bns->anns[*rid].offset;
+ far_end = far_beg + bns->anns[*rid].len;
+@@ -1832,7 +1909,7 @@ uint8_t *bns_fetch_seq_v2(const bntseq_t *bns, const uint8_t *pac,
+ *end = *end < far_end? *end : far_end;
+
+ seq = bns_get_seq_v2(bns->l_pac, pac, *beg, *end, &len, ref_string, seqb);
+-
++
+ if (seq == 0 || *end - *beg != len) {
+ fprintf(stderr, "[E::%s] begin=%ld, mid=%ld, end=%ld, len=%ld, seq=%p, rid=%d, far_beg=%ld, far_end=%ld\n",
+ __func__, (long)*beg, (long)mid, (long)*end, (long)len, seq, *rid, (long)far_beg, (long)far_end);
+@@ -1852,29 +1929,29 @@ inline void sortPairsLenExt(SeqPair *pairArray, int32_t count, SeqPair *tempArra
+
+ int32_t *hist2 = hist + MAX_SEQ_LEN8;
+ int32_t *hist3 = hist + MAX_SEQ_LEN8 + MAX_SEQ_LEN16;
+-
++
+ for(i = 0; i <= MAX_SEQ_LEN8 + MAX_SEQ_LEN16; i+=1)
+ //_mm256_store_si256((__m256i *)(hist + i), zero256);
+ hist[i] = 0;
+-
++
+ int *arr = (int*) calloc (count, sizeof(int));
+ assert(arr != NULL);
+-
++
+ for(i = 0; i < count; i++)
+ {
+ SeqPair sp = pairArray[i];
+ // int minval = sp.h0 + max_(sp.len1, sp.len2);
+ int minval = sp.h0 + min_(sp.len1, sp.len2) * score_a;
+- if (sp.len1 < MAX_SEQ_LEN8 && sp.len2 < MAX_SEQ_LEN8 && minval < MAX_SEQ_LEN8)
++ if (sp.len1 < MAX_SEQ_LEN8 && sp.len2 < MAX_SEQ_LEN8 && minval < MAX_SEQ_LEN8)
+ hist[minval]++;
+ else if(sp.len1 < MAX_SEQ_LEN16 && sp.len2 < MAX_SEQ_LEN16 && minval < MAX_SEQ_LEN16)
+ hist2[minval] ++;
+ else
+ hist3[0] ++;
+-
++
+ arr[i] = 0;
+ }
+-
++
+ int32_t cumulSum = 0;
+ for(i = 0; i < MAX_SEQ_LEN8; i++)
+ {
+@@ -1895,8 +1972,8 @@ inline void sortPairsLenExt(SeqPair *pairArray, int32_t count, SeqPair *tempArra
+ SeqPair sp = pairArray[i];
+ // int minval = sp.h0 + max_(sp.len1, sp.len2);
+ int minval = sp.h0 + min_(sp.len1, sp.len2) * score_a;
+-
+- if (sp.len1 < MAX_SEQ_LEN8 && sp.len2 < MAX_SEQ_LEN8 && minval < MAX_SEQ_LEN8)
++
++ if (sp.len1 < MAX_SEQ_LEN8 && sp.len2 < MAX_SEQ_LEN8 && minval < MAX_SEQ_LEN8)
+ {
+ int32_t pos = hist[minval];
+ tempArray[pos] = sp;
+@@ -1906,7 +1983,7 @@ inline void sortPairsLenExt(SeqPair *pairArray, int32_t count, SeqPair *tempArra
+ {
+ fprintf(stderr, "[%s] Error log: repeat, pos: %d, arr: %d, minval: %d, (%d %d)\n",
+ __func__, pos, arr[pos], minval, sp.len1, sp.len2);
+- exit(EXIT_FAILURE);
++ exit(EXIT_FAILURE);
+ }
+ arr[pos] = 1;
+ }
+@@ -1922,8 +1999,8 @@ inline void sortPairsLenExt(SeqPair *pairArray, int32_t count, SeqPair *tempArra
+ "i: %d, pos: %d, arr: %d, hist2: %d, minval: %d, (%d %d %d) (%d %d %d)\n",
+ __func__, i, pos, arr[pos], hist2[minval], minval, sp.h0, sp.len1, sp.len2,
+ spt.h0, spt.len1, spt.len2);
+- exit(EXIT_FAILURE);
+- }
++ exit(EXIT_FAILURE);
++ }
+ arr[pos] = i + 1;
+ }
+ else {
+@@ -1934,7 +2011,7 @@ inline void sortPairsLenExt(SeqPair *pairArray, int32_t count, SeqPair *tempArra
+ numPairs1 ++;
+ }
+ }
+-
++
+ for(i = 0; i < count; i++) {
+ pairArray[i] = tempArray[i];
+ }
+@@ -1948,14 +2025,14 @@ inline void sortPairsLen(SeqPair *pairArray, int32_t count, SeqPair *tempArray,
+ int32_t i;
+ #if ((!__AVX512BW__) & (__AVX2__ | __SSE2__))
+ for(i = 0; i <= MAX_SEQ_LEN16; i++) hist[i] = 0;
+-#else
++#else
+ __m512i zero512 = _mm512_setzero_si512();
+ for(i = 0; i <= MAX_SEQ_LEN16; i+=16)
+ {
+ _mm512_store_si512((__m512i *)(hist + i), zero512);
+ }
+ #endif
+-
++
+ for(i = 0; i < count; i++)
+ {
+ SeqPair sp = pairArray[i];
+@@ -1996,20 +2073,20 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ SeqPair *seqPairArrayRight128 = mmc->seqPairArrayRight128[tid];
+ int64_t *wsize_pair = &(mmc->wsize[tid]);
+
+- uint8_t *seqBufLeftRef = mmc->seqBufLeftRef[tid*CACHE_LINE];
++ uint8_t *seqBufLeftRef = mmc->seqBufLeftRef[tid*CACHE_LINE];
+ uint8_t *seqBufRightRef = mmc->seqBufRightRef[tid*CACHE_LINE];
+- uint8_t *seqBufLeftQer = mmc->seqBufLeftQer[tid*CACHE_LINE];
++ uint8_t *seqBufLeftQer = mmc->seqBufLeftQer[tid*CACHE_LINE];
+ uint8_t *seqBufRightQer = mmc->seqBufRightQer[tid*CACHE_LINE];
+ int64_t *wsize_buf_ref = &(mmc->wsize_buf_ref[tid*CACHE_LINE]);
+ int64_t *wsize_buf_qer = &(mmc->wsize_buf_qer[tid*CACHE_LINE]);
+-
++
+ // int32_t *lim_g = mmc->lim + (BATCH_SIZE + 32) * tid;
+ int32_t *lim_g = mmc->lim[tid];
+-
++
+ mem_seed_t *s;
+ int64_t l_pac = bns->l_pac, rmax[8] __attribute__((aligned(64)));
+ // std::vector<int8_t> nexitv(nseq, 0);
+-
++
+ int numPairsLeft = 0, numPairsRight = 0;
+ int numPairsLeft1 = 0, numPairsRight1 = 0;
+ int numPairsLeft128 = 0, numPairsRight128 = 0;
+@@ -2023,25 +2100,25 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ uint32_t *srtgg = (uint32_t*) malloc(nseq * SEEDS_PER_READ * fac * sizeof(uint32_t));
+
+ int spos = 0;
+-
++
+ // uint64_t timUP = __rdtsc();
+ for (int l=0; l<nseq; l++)
+ {
+ int max = 0;
+ uint8_t *rseq = 0;
+-
++
+ uint32_t *srtg = srtgg;
+ lim_g[l+1] = 0;
+-
++
+ const uint8_t *query = (uint8_t *) seq_[l].seq;
+ int l_query = seq_[l].l_seq;
+-
+- mem_chain_v *chn = &chain_ar[l];
++
++ mem_chain_v *chn = &chain_ar[l];
+ mem_alnreg_v *av = &av_v[l]; // alignment
+ mem_chain_t *c;
+
+ _mm_prefetch((const char*) query, _MM_HINT_NTA);
+-
++
+ // aln mem allocation
+ av->m = 0;
+ for (int j=0; j<chn->n; j++) {
+@@ -2054,13 +2131,13 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ {
+ c = &chn->a[j];
+ assert(c->seqid == l);
+-
++
+ int64_t tmp = 0;
+ if (c->n == 0) continue;
+-
++
+ _mm_prefetch((const char*) (srtg + spos + 64), _MM_HINT_NTA);
+ _mm_prefetch((const char*) (lim_g), _MM_HINT_NTA);
+-
++
+ // get the max possible span
+ rmax[0] = l_pac<<1; rmax[1] = 0;
+
+@@ -2076,7 +2153,7 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ rmax[1] = (rmax[1] > e)? rmax[1] : e;
+ if (t->len > max) max = t->len;
+ }
+-
++
+ rmax[0] = rmax[0] > 0? rmax[0] : 0;
+ rmax[1] = rmax[1] < l_pac<<1? rmax[1] : l_pac<<1;
+ if (rmax[0] < l_pac && l_pac < rmax[1])
+@@ -2098,30 +2175,30 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+
+ _mm_prefetch((const char*) rseq, _MM_HINT_NTA);
+ // _mm_prefetch((const char*) rseq + 64, _MM_HINT_NTA);
+-
++
+ // assert(c->n < MAX_SEEDS_PER_READ); // temp
+ if (c->n > srt_size) {
+ srt_size = c->n + 10;
+ srt = (uint64_t *) realloc(srt, srt_size * 8);
+ }
+-
+- for (int i = 0; i < c->n; ++i)
++
++ for (int i = 0; i < c->n; ++i)
+ srt[i] = (uint64_t)c->seeds[i].score<<32 | i;
+
+- if (c->n > 1)
++ if (c->n > 1)
+ ks_introsort_64(c->n, srt);
+-
++
+ // assert((spos + c->n) < SEEDS_PER_READ * FAC * nseq);
+ if ((spos + c->n) > SEEDS_PER_READ * fac * nseq) {
+ fac <<= 1;
+ srtgg = (uint32_t *) realloc(srtgg, nseq * SEEDS_PER_READ * fac * sizeof(uint32_t));
+ }
+-
++
+ for (int i = 0; i < c->n; ++i)
+ srtg[spos++] = srt[i];
+
+ lim_g[l+1] += c->n;
+-
++
+ // uint64_t tim = __rdtsc();
+ for (int k=c->n-1; k >= 0; k--)
+ {
+@@ -2131,9 +2208,9 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ // a = kv_pushp(mem_alnreg_t, *av);
+ a = &av->a[av->n++];
+ memset(a, 0, sizeof(mem_alnreg_t));
+-
++
+ s->aln = av->n-1;
+-
++
+ a->w = opt->w;
+ a->score = a->truesc = -1;
+ a->rid = c->rid;
+@@ -2141,9 +2218,9 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ a->seedlen0 = s->len;
+ a->c = c; //ptr
+ a->rb = a->qb = a->re = a->qe = H0_;
+-
++
+ tprof[PE19][tid] ++;
+-
++
+ int flag = 0;
+ std::pair<int, int> pr;
+ if (s->qbeg) // left extension
+@@ -2152,7 +2229,7 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ sp.h0 = s->len * opt->a;
+ sp.seqid = c->seqid;
+ sp.regid = av->n - 1;
+-
++
+ if (numPairsLeft >= *wsize_pair) {
+ fprintf(stderr, "[0000][%0.4d] Re-allocating seqPairArrays, in Left\n", tid);
+ *wsize_pair += 1024;
+@@ -2170,10 +2247,10 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ mmc->seqPairArrayRight128[tid] = seqPairArrayRight128;
+ }
+
+-
++
+ sp.idq = leftQerOffset;
+ sp.idr = leftRefOffset;
+-
++
+ leftQerOffset += s->qbeg;
+ if (leftQerOffset >= *wsize_buf_qer)
+ {
+@@ -2182,17 +2259,17 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ int64_t tmp = *wsize_buf_qer;
+ *wsize_buf_qer *= 2;
+ uint8_t *seqBufQer_ = (uint8_t*)
+- _mm_realloc(seqBufLeftQer, tmp, *wsize_buf_qer, sizeof(uint8_t));
++ _mm_realloc(seqBufLeftQer, tmp, *wsize_buf_qer, sizeof(uint8_t));
+ mmc->seqBufLeftQer[tid*CACHE_LINE] = seqBufLeftQer = seqBufQer_;
+-
++
+ seqBufQer_ = (uint8_t*)
+- _mm_realloc(seqBufRightQer, tmp, *wsize_buf_qer, sizeof(uint8_t));
+- mmc->seqBufRightQer[tid*CACHE_LINE] = seqBufRightQer = seqBufQer_;
++ _mm_realloc(seqBufRightQer, tmp, *wsize_buf_qer, sizeof(uint8_t));
++ mmc->seqBufRightQer[tid*CACHE_LINE] = seqBufRightQer = seqBufQer_;
+ }
+-
++
+ uint8_t *qs = seqBufLeftQer + sp.idq;
+ for (int i = 0; i < s->qbeg; ++i) qs[i] = query[s->qbeg - 1 - i];
+-
++
+ tmp = s->rbeg - rmax[0];
+ leftRefOffset += tmp;
+ if (leftRefOffset >= *wsize_buf_ref)
+@@ -2202,31 +2279,31 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ int64_t tmp = *wsize_buf_ref;
+ *wsize_buf_ref *= 2;
+ uint8_t *seqBufRef_ = (uint8_t*)
+- _mm_realloc(seqBufLeftRef, tmp, *wsize_buf_ref, sizeof(uint8_t));
++ _mm_realloc(seqBufLeftRef, tmp, *wsize_buf_ref, sizeof(uint8_t));
+ mmc->seqBufLeftRef[tid*CACHE_LINE] = seqBufLeftRef = seqBufRef_;
+-
++
+ seqBufRef_ = (uint8_t*)
+- _mm_realloc(seqBufRightRef, tmp, *wsize_buf_ref, sizeof(uint8_t));
+- mmc->seqBufRightRef[tid*CACHE_LINE] = seqBufRightRef = seqBufRef_;
++ _mm_realloc(seqBufRightRef, tmp, *wsize_buf_ref, sizeof(uint8_t));
++ mmc->seqBufRightRef[tid*CACHE_LINE] = seqBufRightRef = seqBufRef_;
+ }
+-
+- uint8_t *rs = seqBufLeftRef + sp.idr;
++
++ uint8_t *rs = seqBufLeftRef + sp.idr;
+ for (int64_t i = 0; i < tmp; ++i) rs[i] = rseq[tmp - 1 - i]; //seq1
+-
++
+ sp.len2 = s->qbeg;
+ sp.len1 = tmp;
+ int minval = sp.h0 + min_(sp.len1, sp.len2) * opt->a;
+-
++
+ if (sp.len1 < MAX_SEQ_LEN8 && sp.len2 < MAX_SEQ_LEN8 && minval < MAX_SEQ_LEN8) {
+ numPairsLeft128++;
+ }
+ else if (sp.len1 < MAX_SEQ_LEN16 && sp.len2 < MAX_SEQ_LEN16 && minval < MAX_SEQ_LEN16){
+ numPairsLeft16++;
+ }
+- else {
++ else {
+ numPairsLeft1++;
+ }
+-
++
+ seqPairArrayLeft128[numPairsLeft] = sp;
+ numPairsLeft ++;
+ a->qb = s->qbeg; a->rb = s->rbeg;
+@@ -2241,7 +2318,7 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ {
+ int64_t qe = s->qbeg + s->len;
+ int64_t re = s->rbeg + s->len - rmax[0];
+- assert(re >= 0);
++ assert(re >= 0);
+ SeqPair sp;
+
+ sp.h0 = H0_; //random number
+@@ -2265,13 +2342,13 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ * sizeof(SeqPair));
+ mmc->seqPairArrayRight128[tid] = seqPairArrayRight128;
+ }
+-
++
+ sp.len2 = l_query - qe;
+ sp.len1 = rmax[1] - rmax[0] - re;
+
+ sp.idq = rightQerOffset;
+ sp.idr = rightRefOffset;
+-
++
+ rightQerOffset += sp.len2;
+ if (rightQerOffset >= *wsize_buf_qer)
+ {
+@@ -2280,12 +2357,12 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ int64_t tmp = *wsize_buf_qer;
+ *wsize_buf_qer *= 2;
+ uint8_t *seqBufQer_ = (uint8_t*)
+- _mm_realloc(seqBufLeftQer, tmp, *wsize_buf_qer, sizeof(uint8_t));
++ _mm_realloc(seqBufLeftQer, tmp, *wsize_buf_qer, sizeof(uint8_t));
+ mmc->seqBufLeftQer[tid*CACHE_LINE] = seqBufLeftQer = seqBufQer_;
+-
++
+ seqBufQer_ = (uint8_t*)
+- _mm_realloc(seqBufRightQer, tmp, *wsize_buf_qer, sizeof(uint8_t));
+- mmc->seqBufRightQer[tid*CACHE_LINE] = seqBufRightQer = seqBufQer_;
++ _mm_realloc(seqBufRightQer, tmp, *wsize_buf_qer, sizeof(uint8_t));
++ mmc->seqBufRightQer[tid*CACHE_LINE] = seqBufRightQer = seqBufQer_;
+ }
+
+ rightRefOffset += sp.len1;
+@@ -2296,25 +2373,25 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ int64_t tmp = *wsize_buf_ref;
+ *wsize_buf_ref *= 2;
+ uint8_t *seqBufRef_ = (uint8_t*)
+- _mm_realloc(seqBufLeftRef, tmp, *wsize_buf_ref, sizeof(uint8_t));
++ _mm_realloc(seqBufLeftRef, tmp, *wsize_buf_ref, sizeof(uint8_t));
+ mmc->seqBufLeftRef[tid*CACHE_LINE] = seqBufLeftRef = seqBufRef_;
+-
++
+ seqBufRef_ = (uint8_t*)
+- _mm_realloc(seqBufRightRef, tmp, *wsize_buf_ref, sizeof(uint8_t));
+- mmc->seqBufRightRef[tid*CACHE_LINE] = seqBufRightRef = seqBufRef_;
++ _mm_realloc(seqBufRightRef, tmp, *wsize_buf_ref, sizeof(uint8_t));
++ mmc->seqBufRightRef[tid*CACHE_LINE] = seqBufRightRef = seqBufRef_;
+ }
+-
++
+ tprof[PE23][tid] += sp.len1 + sp.len2;
+
+ uint8_t *qs = seqBufRightQer + sp.idq;
+ uint8_t *rs = seqBufRightRef + sp.idr;
+-
++
+ for (int i = 0; i < sp.len2; ++i) qs[i] = query[qe + i];
+
+ for (int i = 0; i < sp.len1; ++i) rs[i] = rseq[re + i]; //seq1
+
+ int minval = sp.h0 + min_(sp.len1, sp.len2) * opt->a;
+-
++
+ if (sp.len1 < MAX_SEQ_LEN8 && sp.len2 < MAX_SEQ_LEN8 && minval < MAX_SEQ_LEN8) {
+ numPairsRight128++;
+ }
+@@ -2349,16 +2426,16 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ }
+ }
+ // tprof[MEM_ALN2_UP][tid] += __rdtsc() - timUP;
+-
++
+
+ int32_t *hist = (int32_t *)_mm_malloc((MAX_SEQ_LEN8 + MAX_SEQ_LEN16 + 32) *
+ sizeof(int32_t), 64);
+-
++
+ /* Sorting based score is required as that affects the use of SIMD lanes */
+ sortPairsLenExt(seqPairArrayLeft128, numPairsLeft, seqPairArrayAux, hist,
+ numPairsLeft128, numPairsLeft16, numPairsLeft1, opt->a);
+ assert(numPairsLeft == (numPairsLeft128 + numPairsLeft16 + numPairsLeft1));
+-
++
+
+ // SWA
+ // uint64_t timL = __rdtsc();
+@@ -2369,17 +2446,17 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ BandedPairWiseSW bswLeft(opt->o_del, opt->e_del, opt->o_ins,
+ opt->e_ins, opt->zdrop, opt->pen_clip5,
+ opt->mat, opt->a, opt->b, nthreads);
+-
++
+ BandedPairWiseSW bswRight(opt->o_del, opt->e_del, opt->o_ins,
+ opt->e_ins, opt->zdrop, opt->pen_clip3,
+ opt->mat, opt->a, opt->b, nthreads);
+-
++
+ int i;
+ // Left
+ SeqPair *pair_ar = seqPairArrayLeft128 + numPairsLeft128 + numPairsLeft16;
+ SeqPair *pair_ar_aux = seqPairArrayAux;
+ int nump = numPairsLeft1;
+-
++
+ // scalar
+ for ( i=0; i<MAX_BAND_TRY; i++)
+ {
+@@ -2394,11 +2471,11 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ // tprof[PE5][0] += nump;
+ // tprof[PE6][0] ++;
+ // tprof[MEM_ALN2_B][tid] += __rdtsc() - tim;
+-
++
+ int num = 0;
+ for (int l=0; l<nump; l++)
+ {
+- mem_alnreg_t *a;
++ mem_alnreg_t *a;
+ SeqPair *sp = &pair_ar[l];
+ a = &(av_v[sp->seqid].a[sp->regid]); // prev
+ int prev = a->score;
+@@ -2426,7 +2503,7 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ a->seedcov += t->len;
+ }
+ }
+-
++
+ } else {
+ pair_ar_aux[num++] = *sp;
+ }
+@@ -2443,7 +2520,7 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+
+ pair_ar = seqPairArrayLeft128 + numPairsLeft128;
+ pair_ar_aux = seqPairArrayAux;
+-
++
+ nump = numPairsLeft16;
+ for ( i=0; i<MAX_BAND_TRY; i++)
+ {
+@@ -2460,14 +2537,14 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ nthreads,
+ w);
+ #endif
+-
++
+ tprof[PE5][0] += nump;
+- tprof[PE6][0] ++;
++ tprof[PE6][0] ++;
+ // tprof[MEM_ALN2_B][tid] += __rdtsc() - tim;
+
+ int num = 0;
+ for (int l=0; l<nump; l++)
+- {
++ {
+ mem_alnreg_t *a;
+ SeqPair *sp = &pair_ar[l];
+ a = &(av_v[sp->seqid].a[sp->regid]); // prev
+@@ -2475,7 +2552,7 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ int prev = a->score;
+ a->score = sp->score;
+
+-
++
+ if (a->score == prev || sp->max_off < (w >> 1) + (w >> 2) ||
+ i+1 == MAX_BAND_TRY)
+ {
+@@ -2511,13 +2588,13 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ //****************** Left - vector int8 ***********************
+ pair_ar = seqPairArrayLeft128;
+ pair_ar_aux = seqPairArrayAux;
+-
++
+ nump = numPairsLeft128;
+ for ( i=0; i<MAX_BAND_TRY; i++)
+ {
+ int32_t w = opt->w << i;
+ // int64_t tim = __rdtsc();
+-
++
+ #if ((!__AVX512BW__) && (!__AVX2__) && (!__SSE2__))
+ bswLeft.scalarBandedSWAWrapper(pair_ar, seqBufLeftRef, seqBufLeftQer, nump, nthreads, w);
+ #else
+@@ -2528,22 +2605,22 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ nump,
+ nthreads,
+ w);
+-#endif
+-
++#endif
++
+ tprof[PE1][0] += nump;
+ tprof[PE2][0] ++;
+ // tprof[MEM_ALN2_D][tid] += __rdtsc() - tim;
+
+ int num = 0;
+ for (int l=0; l<nump; l++)
+- {
++ {
+ mem_alnreg_t *a;
+ SeqPair *sp = &pair_ar[l];
+ a = &(av_v[sp->seqid].a[sp->regid]); // prev
+
+ int prev = a->score;
+ a->score = sp->score;
+-
++
+ if (a->score == prev || sp->max_off < (w >> 1) + (w >> 2) ||
+ i+1 == MAX_BAND_TRY)
+ {
+@@ -2577,12 +2654,12 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ }
+
+ // tprof[CLEFT][tid] += __rdtsc() - timL;
+-
++
+ // uint64_t timR = __rdtsc();
+ // **********************************************************
+ // Right, scalar
+ for (int l=0; l<numPairsRight; l++) {
+- mem_alnreg_t *a;
++ mem_alnreg_t *a;
+ SeqPair *sp = &seqPairArrayRight128[l];
+ a = &(av_v[sp->seqid].a[sp->regid]); // prev
+ sp->h0 = a->score;
+@@ -2600,7 +2677,7 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ for ( i=0; i<MAX_BAND_TRY; i++)
+ {
+ int32_t w = opt->w << i;
+- // tim = __rdtsc();
++ // tim = __rdtsc();
+ bswRight.scalarBandedSWAWrapper(pair_ar,
+ seqBufRightRef,
+ seqBufRightQer,
+@@ -2614,12 +2691,12 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+
+ for (int l=0; l<nump; l++)
+ {
+- mem_alnreg_t *a;
++ mem_alnreg_t *a;
+ SeqPair *sp = &pair_ar[l];
+ a = &(av_v[sp->seqid].a[sp->regid]); // prev
+ int prev = a->score;
+ a->score = sp->score;
+-
++
+ // no further banding
+ if (a->score == prev || sp->max_off < (w >> 1) + (w >> 2) ||
+ i+1 == MAX_BAND_TRY)
+@@ -2657,7 +2734,7 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ pair_ar = seqPairArrayRight128 + numPairsRight128;
+ pair_ar_aux = seqPairArrayAux;
+ nump = numPairsRight16;
+-
++
+ for ( i=0; i<MAX_BAND_TRY; i++)
+ {
+ int32_t w = opt->w << i;
+@@ -2677,18 +2754,18 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ tprof[PE7][0] += nump;
+ tprof[PE8][0] ++;
+ // tprof[MEM_ALN2_C][tid] += __rdtsc() - tim;
+-
++
+ int num = 0;
+
+ for (int l=0; l<nump; l++)
+ {
+- mem_alnreg_t *a;
++ mem_alnreg_t *a;
+ SeqPair *sp = &pair_ar[l];
+ a = &(av_v[sp->seqid].a[sp->regid]); // prev
+ //OutScore *o = outScoreArray + l;
+ int prev = a->score;
+ a->score = sp->score;
+-
++
+ // no further banding
+ if (a->score == prev || sp->max_off < (w >> 1) + (w >> 2) ||
+ i+1 == MAX_BAND_TRY)
+@@ -2727,14 +2804,14 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ pair_ar = seqPairArrayRight128;
+ pair_ar_aux = seqPairArrayAux;
+ nump = numPairsRight128;
+-
++
+ for ( i=0; i<MAX_BAND_TRY; i++)
+ {
+ int32_t w = opt->w << i;
+ // uint64_t tim = __rdtsc();
+-
++
+ #if ((!__AVX512BW__) && (!__AVX2__) && (!__SSE2__))
+- bswRight.scalarBandedSWAWrapper(pair_ar, seqBufRightRef, seqBufRightQer, nump, nthreads, w);
++ bswRight.scalarBandedSWAWrapper(pair_ar, seqBufRightRef, seqBufRightQer, nump, nthreads, w);
+ #else
+ sortPairsLen(pair_ar, nump, seqPairArrayAux, hist);
+ bswRight.getScores8(pair_ar,
+@@ -2743,8 +2820,8 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ nump,
+ nthreads,
+ w);
+-#endif
+-
++#endif
++
+ tprof[PE3][0] += nump;
+ tprof[PE4][0] ++;
+ // tprof[MEM_ALN2_E][tid] += __rdtsc() - tim;
+@@ -2752,7 +2829,7 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+
+ for (int l=0; l<nump; l++)
+ {
+- mem_alnreg_t *a;
++ mem_alnreg_t *a;
+ SeqPair *sp = &pair_ar[l];
+ a = &(av_v[sp->seqid].a[sp->regid]); // prev
+ //OutScore *o = outScoreArray + l;
+@@ -2793,7 +2870,7 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+
+ _mm_free(hist);
+ // tprof[CRIGHT][tid] += __rdtsc() - timR;
+-
++
+ if (numPairsLeft >= *wsize_pair || numPairsRight >= *wsize_pair)
+ { // refine it!
+ fprintf(stderr, "Error: Unexpected behaviour!!!\n");
+@@ -2803,12 +2880,12 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ exit(EXIT_FAILURE);
+ }
+ /* Discard seeds and hence their alignemnts */
+-
++
+ lim_g[0] = 0;
+ for (int l=1; l<nseq; l++)
+ lim_g[l] += lim_g[l-1];
+-
+- // uint64_t tim = __rdtsc();
++
++ // uint64_t tim = __rdtsc();
+ int *lim = (int *) calloc(BATCH_SIZE, sizeof(int));
+ assert(lim != NULL);
+
+@@ -2816,12 +2893,12 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ {
+ int s_start = 0, s_end = 0;
+ uint32_t *srtg = srtgg + lim_g[l];
+-
++
+ int l_query = seq_[l].l_seq;
+- mem_chain_v *chn = &chain_ar[l];
++ mem_chain_v *chn = &chain_ar[l];
+ mem_alnreg_v *av = &av_v[l]; // alignment
+ mem_chain_t *c;
+-
++
+ for (int j=0; j<chn->n; j++)
+ {
+ c = &chn->a[j];
+@@ -2831,7 +2908,7 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+
+ uint32_t *srt2 = srtg + s_start;
+ s_start += c->n;
+-
++
+ int k = 0;
+ for (k = c->n-1; k >= 0; k--)
+ {
+@@ -2851,12 +2928,12 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ || s->qbeg + s->len > p->qe) {
+ v++; continue; // not fully contained
+ }
+-
++
+ if (s->len - p->seedlen0 > .1 * l_query) { v++; continue;}
+ // qd: distance ahead of the seed on query; rd: on reference
+ qd = s->qbeg - p->qb; rd = s->rbeg - p->rb;
+ // the maximal gap allowed in regions ahead of the seed
+- max_gap = cal_max_gap(opt, qd < rd? qd : rd);
++ max_gap = cal_max_gap(opt, qd < rd? qd : rd);
+ w = max_gap < p->w? max_gap : p->w; // bounded by the band width
+ if (qd - rd < w && rd - qd < w) break; // the seed is "around" a previous hit
+ // similar to the previous four lines, but this time we look at the region behind
+@@ -2864,10 +2941,10 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ max_gap = cal_max_gap(opt, qd < rd? qd : rd);
+ w = max_gap < p->w? max_gap : p->w;
+ if (qd - rd < w && rd - qd < w) break;
+-
++
+ v++;
+ }
+-
++
+ // the seed is (almost) contained in an existing alignment;
+ // further testing is needed to confirm it is not leading
+ // to a different aln
+@@ -2881,7 +2958,7 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ //if (t->done == H0_) continue; //check for interferences!!!
+ // only check overlapping if t is long enough;
+ // TODO: more efficient by early stopping
+- if (t->len < s->len * .95) continue;
++ if (t->len < s->len * .95) continue;
+ if (s->qbeg <= t->qbeg && s->qbeg + s->len - t->qbeg >= s->len>>2 &&
+ t->qbeg - s->qbeg != t->rbeg - s->rbeg) break;
+ if (t->qbeg <= s->qbeg && t->qbeg + t->len - s->qbeg >= s->len>>2 &&
+@@ -2894,7 +2971,7 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ tprof[PE18][tid]++;
+ continue;
+ }
+- }
++ }
+ lim[l]++;
+ }
+ }
+@@ -2902,5 +2979,5 @@ void mem_chain2aln_across_reads_V2(const mem_opt_t *opt, const bntseq_t *bns,
+ free(srtgg);
+ free(srt);
+ free(lim);
+- // tprof[MEM_ALN2_DOWN][tid] += __rdtsc() - tim;
++ // tprof[MEM_ALN2_DOWN][tid] += __rdtsc() - tim;
+ }
+diff --git a/src/bwamem.h b/src/bwamem.h
+old mode 100644
+new mode 100755
+index 8189ad6..12329cb
+--- a/src/bwamem.h
++++ b/src/bwamem.h
+@@ -206,6 +206,8 @@ typedef struct
+ uint8_t *enc_qdb[MAX_THREADS];
+
+ int64_t wsize_mem[MAX_THREADS];
++ int64_t wsize_mem_s[MAX_THREADS];
++ int64_t wsize_mem_r[MAX_THREADS];
+ } mem_cache;
+
+ // chain moved to .h
+diff --git a/src/fastmap.cpp b/src/fastmap.cpp
+index bc245b9..baf4a27 100644
+--- a/src/fastmap.cpp
++++ b/src/fastmap.cpp
+@@ -99,7 +99,7 @@ int HTStatus()
+ /*** Memory pre-allocations ***/
+ void memoryAlloc(ktp_aux_t *aux, worker_t &w, int32_t nreads, int32_t nthreads)
+ {
+- mem_opt_t *opt = aux->opt;
++ mem_opt_t *opt = aux->opt;
+ int32_t memSize = nreads;
+ int32_t readLen = READ_LEN;
+
+@@ -123,7 +123,7 @@ void memoryAlloc(ktp_aux_t *aux, worker_t &w, int32_t nreads, int32_t nthreads)
+ fprintf(stderr, "------------------------------------------\n");
+ fprintf(stderr, "1. Memory pre-allocation for Chaining: %0.4lf MB\n", allocMem/1e6);
+
+-
++
+ /* SWA mem allocation */
+ int64_t wsize = BATCH_SIZE * SEEDS_PER_READ;
+ for(int l=0; l<nthreads; l++)
+@@ -136,16 +136,16 @@ void memoryAlloc(ktp_aux_t *aux, worker_t &w, int32_t nreads, int32_t nthreads)
+ _mm_malloc(wsize * MAX_SEQ_LEN_REF * sizeof(int8_t) + MAX_LINE_LEN, 64);
+ w.mmc.seqBufRightQer[l*CACHE_LINE] = (uint8_t *)
+ _mm_malloc(wsize * MAX_SEQ_LEN_QER * sizeof(int8_t) + MAX_LINE_LEN, 64);
+-
++
+ w.mmc.wsize_buf_ref[l*CACHE_LINE] = wsize * MAX_SEQ_LEN_REF;
+ w.mmc.wsize_buf_qer[l*CACHE_LINE] = wsize * MAX_SEQ_LEN_QER;
+-
++
+ assert(w.mmc.seqBufLeftRef[l*CACHE_LINE] != NULL);
+ assert(w.mmc.seqBufLeftQer[l*CACHE_LINE] != NULL);
+ assert(w.mmc.seqBufRightRef[l*CACHE_LINE] != NULL);
+ assert(w.mmc.seqBufRightQer[l*CACHE_LINE] != NULL);
+ }
+-
++
+ for(int l=0; l<nthreads; l++) {
+ w.mmc.seqPairArrayAux[l] = (SeqPair *) malloc((wsize + MAX_LINE_LEN)* sizeof(SeqPair));
+ w.mmc.seqPairArrayLeft128[l] = (SeqPair *) malloc((wsize + MAX_LINE_LEN)* sizeof(SeqPair));
+@@ -155,17 +155,20 @@ void memoryAlloc(ktp_aux_t *aux, worker_t &w, int32_t nreads, int32_t nthreads)
+ assert(w.mmc.seqPairArrayAux[l] != NULL);
+ assert(w.mmc.seqPairArrayLeft128[l] != NULL);
+ assert(w.mmc.seqPairArrayRight128[l] != NULL);
+- }
++ }
+
+
+ allocMem = (wsize * MAX_SEQ_LEN_REF * sizeof(int8_t) + MAX_LINE_LEN) * opt->n_threads * 2+
+- (wsize * MAX_SEQ_LEN_QER * sizeof(int8_t) + MAX_LINE_LEN) * opt->n_threads * 2 +
+- wsize * sizeof(SeqPair) * opt->n_threads * 3;
++ (wsize * MAX_SEQ_LEN_QER * sizeof(int8_t) + MAX_LINE_LEN) * opt->n_threads * 2 +
++ wsize * sizeof(SeqPair) * opt->n_threads * 3;
+ fprintf(stderr, "2. Memory pre-allocation for BSW: %0.4lf MB\n", allocMem/1e6);
+
+ for (int l=0; l<nthreads; l++)
+ {
++ // BATCH_MUL accounts for smems per reads in smem kernel
+ w.mmc.wsize_mem[l] = BATCH_MUL * BATCH_SIZE * readLen;
++ w.mmc.wsize_mem_s[l] = BATCH_MUL * BATCH_SIZE * readLen;
++ w.mmc.wsize_mem_r[l] = BATCH_MUL * BATCH_SIZE * readLen;
+ w.mmc.matchArray[l] = (SMEM *) _mm_malloc(w.mmc.wsize_mem[l] * sizeof(SMEM), 64);
+ w.mmc.min_intv_ar[l] = (int32_t *) malloc(w.mmc.wsize_mem[l] * sizeof(int32_t));
+ w.mmc.query_pos_ar[l] = (int16_t *) malloc(w.mmc.wsize_mem[l] * sizeof(int16_t));
+@@ -202,9 +205,9 @@ ktp_data_t *kt_pipeline(void *shared, int step, void *data, mem_opt_t *opt, work
+ &sz);
+
+ tprof[READ_IO][0] += __rdtsc() - tim;
+-
++
+ fprintf(stderr, "[0000] read_chunk: %ld, work_chunk_size: %ld, nseq: %d\n",
+- aux->task_size, sz, ret->n_seqs);
++ aux->task_size, sz, ret->n_seqs);
+
+ if (ret->seqs == 0) {
+ free(ret);
+@@ -223,9 +226,9 @@ ktp_data_t *kt_pipeline(void *shared, int step, void *data, mem_opt_t *opt, work
+ fprintf(stderr, "\t[0000][ M::%s] read %d sequences (%ld bp)...\n",
+ __func__, ret->n_seqs, (long)size);
+ }
+-
++
+ return ret;
+- } // Step 0
++ } // Step 0
+ else if (step == 1) /* Step 2: Main processing-engine */
+ {
+ static int task = 0;
+@@ -238,8 +241,8 @@ ktp_data_t *kt_pipeline(void *shared, int step, void *data, mem_opt_t *opt, work
+ w.chain_ar = (mem_chain_v*) malloc (w.nreads * sizeof(mem_chain_v));
+ w.seedBuf = (mem_seed_t *) calloc(sizeof(mem_seed_t), w.nreads * AVG_SEEDS_PER_READ);
+ assert(w.regs != NULL); assert(w.chain_ar != NULL); assert(w.seedBuf != NULL);
+- }
+-
++ }
++
+ fprintf(stderr, "[0000] Calling mem_process_seqs.., task: %d\n", task++);
+
+ uint64_t tim = __rdtsc();
+@@ -253,7 +256,7 @@ ktp_data_t *kt_pipeline(void *shared, int step, void *data, mem_opt_t *opt, work
+
+ fprintf(stderr, "[M::%s] %d single-end sequences; %d paired-end sequences.....\n",
+ __func__, n_sep[0], n_sep[1]);
+-
++
+ if (n_sep[0]) {
+ tmp_opt.flag &= ~MEM_F_PE;
+ /* single-end sequences, in the mixture */
+@@ -263,7 +266,7 @@ ktp_data_t *kt_pipeline(void *shared, int step, void *data, mem_opt_t *opt, work
+ sep[0],
+ 0,
+ w);
+-
++
+ for (int i = 0; i < n_sep[0]; ++i)
+ ret->seqs[sep[0][i].id].sam = sep[0][i].sam;
+ }
+@@ -276,7 +279,7 @@ ktp_data_t *kt_pipeline(void *shared, int step, void *data, mem_opt_t *opt, work
+ sep[1],
+ aux->pes0,
+ w);
+-
++
+ for (int i = 0; i < n_sep[1]; ++i)
+ ret->seqs[sep[1][i].id].sam = sep[1][i].sam;
+ }
+@@ -290,11 +293,11 @@ ktp_data_t *kt_pipeline(void *shared, int step, void *data, mem_opt_t *opt, work
+ ret->seqs,
+ aux->pes0,
+ w);
+- }
++ }
+ tprof[MEM_PROCESS2][0] += __rdtsc() - tim;
+-
++
+ return ret;
+- }
++ }
+ /* Step 3: Write output */
+ else if (step == 2)
+ {
+@@ -317,15 +320,15 @@ ktp_data_t *kt_pipeline(void *shared, int step, void *data, mem_opt_t *opt, work
+
+ return 0;
+ } // step 2
+-
++
+ return 0;
+ }
+
+ static void *ktp_worker(void *data)
+-{
++{
+ ktp_worker_t *w = (ktp_worker_t*) data;
+ ktp_t *p = w->pl;
+-
++
+ while (w->step < p->n_steps) {
+ // test whether we can kick off the job with this worker
+ int pthread_ret = pthread_mutex_lock(&p->mutex);
+@@ -369,7 +372,7 @@ static int process(void *shared, gzFile gfp, gzFile gfp2, int pipe_threads)
+ mem_opt_t *opt = aux->opt;
+ int32_t nthreads = opt->n_threads; // global variable for profiling!
+ w.nthreads = opt->n_threads;
+-
++
+ #if NUMA_ENABLED
+ int deno = 1;
+ int tc = numa_num_task_cpus();
+@@ -378,7 +381,7 @@ static int process(void *shared, gzFile gfp, gzFile gfp2, int pipe_threads)
+ fprintf(stderr, "num_cpus: %d, num_numas: %d, configured cpus: %d\n", tc, tn, tcc);
+ int ht = HTStatus();
+ if (ht) deno = 2;
+-
++
+ if (nthreads < tcc/tn/deno) {
+ fprintf(stderr, "Enabling single numa domain...\n\n");
+ // numa_set_preferred(0);
+@@ -406,7 +409,7 @@ static int process(void *shared, gzFile gfp, gzFile gfp2, int pipe_threads)
+ int num_sockets = num_total_logical_cpus / num_logical_cpus;
+ fprintf(stderr, "#sockets: %d, #cores/socket: %d, #logical_cpus: %d, #ht/core: %d\n",
+ num_sockets, num_logical_cpus/num_ht, num_total_logical_cpus, num_ht);
+-
++
+ for (int i=0; i<num_total_logical_cpus; i++) affy[i] = i;
+ int slookup[256] = {-1};
+
+@@ -452,23 +455,23 @@ static int process(void *shared, gzFile gfp, gzFile gfp2, int pipe_threads)
+ }
+ }
+ #endif
+-
++
+ int32_t nreads = aux->actual_chunk_size/ READ_LEN + 10;
+-
++
+ /* All memory allocation */
+ memoryAlloc(aux, w, nreads, nthreads);
+ fprintf(stderr, "* Threads used (compute): %d\n", nthreads);
+-
++
+ /* pipeline using pthreads */
+ ktp_t aux_;
+ int p_nt = pipe_threads; // 2;
+ int n_steps = 3;
+-
++
+ w.ref_string = aux->ref_string;
+ w.fmi = aux->fmi;
+ w.nreads = nreads;
+ // w.memSize = nreads;
+-
++
+ aux_.n_workers = p_nt;
+ aux_.n_steps = n_steps;
+ aux_.shared = aux;
+@@ -481,7 +484,7 @@ static int process(void *shared, gzFile gfp, gzFile gfp2, int pipe_threads)
+ fprintf(stderr, "* No. of pipeline threads: %d\n\n", p_nt);
+ aux_.workers = (ktp_worker_t*) malloc(p_nt * sizeof(ktp_worker_t));
+ assert(aux_.workers != NULL);
+-
++
+ for (int i = 0; i < p_nt; ++i) {
+ ktp_worker_t *wr = &aux_.workers[i];
+ wr->step = 0; wr->pl = &aux_; wr->data = 0;
+@@ -490,13 +493,13 @@ static int process(void *shared, gzFile gfp, gzFile gfp2, int pipe_threads)
+ wr->opt = opt;
+ wr->w = &w;
+ }
+-
++
+ pthread_t *ptid = (pthread_t *) calloc(p_nt, sizeof(pthread_t));
+ assert(ptid != NULL);
+-
++
+ for (int i = 0; i < p_nt; ++i)
+ pthread_create(&ptid[i], 0, ktp_worker, (void*) &aux_.workers[i]);
+-
++
+ for (int i = 0; i < p_nt; ++i)
+ pthread_join(ptid[i], 0);
+
+@@ -508,14 +511,14 @@ static int process(void *shared, gzFile gfp, gzFile gfp2, int pipe_threads)
+ free(ptid);
+ free(aux_.workers);
+ /***** pipeline ends ******/
+-
++
+ fprintf(stderr, "[0000] Computation ends..\n");
+-
+- /* Dealloc memory allcoated in the header section */
++
++ /* Dealloc memory allcoated in the header section */
+ free(w.chain_ar);
+ free(w.regs);
+ free(w.seedBuf);
+-
++
+ for(int l=0; l<nthreads; l++) {
+ _mm_free(w.mmc.seqBufLeftRef[l*CACHE_LINE]);
+ _mm_free(w.mmc.seqBufRightRef[l*CACHE_LINE]);
+@@ -594,7 +597,7 @@ static void usage(const mem_opt_t *opt)
+ fprintf(stderr, " -j treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)\n");
+ fprintf(stderr, " -5 for split alignment, take the alignment with the smallest coordinate as primary\n");
+ fprintf(stderr, " -q don't modify mapQ of supplementary alignments\n");
+- fprintf(stderr, " -K INT process INT input bases in each batch regardless of nThreads (for reproducibility) []\n");
++ fprintf(stderr, " -K INT process INT input bases in each batch regardless of nThreads (for reproducibility) []\n");
+ fprintf(stderr, " -v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [%d]\n", bwa_verbose);
+ fprintf(stderr, " -T INT minimum score to output [%d]\n", opt->T);
+ fprintf(stderr, " -h INT[,INT] if there are <INT hits with score >80%% of the max score, output all in XA [%d,%d]\n", opt->max_XA_hits, opt->max_XA_hits_alt);
+@@ -616,7 +619,7 @@ int main_mem(int argc, char *argv[])
+ int fixed_chunk_size = -1;
+ char *p, *rg_line = 0, *hdr_line = 0;
+ const char *mode = 0;
+-
++
+ mem_opt_t *opt, opt0;
+ gzFile fp, fp2 = 0;
+ void *ko = 0, *ko2 = 0;
+@@ -625,16 +628,16 @@ int main_mem(int argc, char *argv[])
+ ktp_aux_t aux;
+ bool is_o = 0;
+ uint8_t *ref_string;
+-
++
+ memset(&aux, 0, sizeof(ktp_aux_t));
+ memset(pes, 0, 4 * sizeof(mem_pestat_t));
+ for (i = 0; i < 4; ++i) pes[i].failed = 1;
+-
++
+ // opterr = 0;
+ aux.fp = stdout;
+ aux.opt = opt = mem_opt_init();
+ memset(&opt0, 0, sizeof(mem_opt_t));
+-
++
+ /* Parse input arguments */
+ // comment: added option '5' in the list
+ while ((c = getopt(argc, argv, "51qpaMCSPVYjk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:W:x:G:h:y:K:X:H:o:f:")) >= 0)
+@@ -777,7 +780,7 @@ int main_mem(int argc, char *argv[])
+ return 1;
+ }
+ }
+-
++
+ /* Check output file name */
+ if (rg_line)
+ {
+@@ -789,7 +792,7 @@ int main_mem(int argc, char *argv[])
+ if (optind + 2 != argc && optind + 3 != argc) {
+ usage(opt);
+ free(opt);
+- if (is_o)
++ if (is_o)
+ fclose(aux.fp);
+ return 1;
+ }
+@@ -838,52 +841,52 @@ int main_mem(int argc, char *argv[])
+ return 1;
+ }
+ } else update_a(opt, &opt0);
+-
++
+ /* Matrix for SWA */
+ bwa_fill_scmat(opt->a, opt->b, opt->mat);
+-
++
+ /* Load bwt2/FMI index */
+ uint64_t tim = __rdtsc();
+-
+- fprintf(stderr, "* Ref file: %s\n", argv[optind]);
++
++ fprintf(stderr, "* Ref file: %s\n", argv[optind]);
+ aux.fmi = new FMI_search(argv[optind]);
+ aux.fmi->load_index();
+ tprof[FMI][0] += __rdtsc() - tim;
+-
++
+ // reading ref string from the file
+ tim = __rdtsc();
+ fprintf(stderr, "* Reading reference genome..\n");
+-
++
+ char binary_seq_file[PATH_MAX];
+ strcpy_s(binary_seq_file, PATH_MAX, argv[optind]);
+ strcat_s(binary_seq_file, PATH_MAX, ".0123");
+ //sprintf(binary_seq_file, "%s.0123", argv[optind]);
+-
++
+ fprintf(stderr, "* Binary seq file = %s\n", binary_seq_file);
+ FILE *fr = fopen(binary_seq_file, "r");
+-
++
+ if (fr == NULL) {
+ fprintf(stderr, "Error: can't open %s input file\n", binary_seq_file);
+ exit(EXIT_FAILURE);
+ }
+-
++
+ int64_t rlen = 0;
+- fseek(fr, 0, SEEK_END);
++ fseek(fr, 0, SEEK_END);
+ rlen = ftell(fr);
+ ref_string = (uint8_t*) _mm_malloc(rlen, 64);
+ aux.ref_string = ref_string;
+ rewind(fr);
+-
++
+ /* Reading ref. sequence */
+ err_fread_noeof(ref_string, 1, rlen, fr);
+-
++
+ uint64_t timer = __rdtsc();
+ tprof[REF_IO][0] += timer - tim;
+-
++
+ fclose(fr);
+ fprintf(stderr, "* Reference genome size: %ld bp\n", rlen);
+ fprintf(stderr, "* Done reading reference genome !!\n\n");
+-
++
+ if (ignore_alt)
+ for (i = 0; i < aux.fmi->idx->bns->n_seqs; ++i)
+ aux.fmi->idx->bns->anns[i].is_alt = 0;
+@@ -893,7 +896,7 @@ int main_mem(int argc, char *argv[])
+ if (ko == 0) {
+ fprintf(stderr, "[E::%s] fail to open file `%s'.\n", __func__, argv[optind + 1]);
+ free(opt);
+- if (is_o)
++ if (is_o)
+ fclose(aux.fp);
+ delete aux.fmi;
+ // kclose(ko);
+@@ -902,7 +905,7 @@ int main_mem(int argc, char *argv[])
+ // fp = gzopen(argv[optind + 1], "r");
+ fp = gzdopen(fd, "r");
+ aux.ks = kseq_init(fp);
+-
++
+ // PAIRED_END
+ /* Handling Paired-end reads */
+ aux.ks2 = 0;
+@@ -920,13 +923,13 @@ int main_mem(int argc, char *argv[])
+ free(ko);
+ err_gzclose(fp);
+ kseq_destroy(aux.ks);
+- if (is_o)
+- fclose(aux.fp);
++ if (is_o)
++ fclose(aux.fp);
+ delete aux.fmi;
+ kclose(ko);
+ // kclose(ko2);
+ return 1;
+- }
++ }
+ // fp2 = gzopen(argv[optind + 2], "r");
+ fp2 = gzdopen(fd2, "r");
+ aux.ks2 = kseq_init(fp2);
+@@ -949,7 +952,7 @@ int main_mem(int argc, char *argv[])
+
+ /* Relay process function */
+ process(&aux, fp, fp2, no_mt_io? 1:2);
+-
++
+ tprof[PROCESS][0] += __rdtsc() - tim;
+
+ // free memory
+@@ -957,7 +960,7 @@ int main_mem(int argc, char *argv[])
+ _mm_free(ref_string);
+ free(hdr_line);
+ free(opt);
+- kseq_destroy(aux.ks);
++ kseq_destroy(aux.ks);
+ err_gzclose(fp); kclose(ko);
+
+ // PAIRED_END
+@@ -965,18 +968,17 @@ int main_mem(int argc, char *argv[])
+ kseq_destroy(aux.ks2);
+ err_gzclose(fp2); kclose(ko2);
+ }
+-
++
+ if (is_o) {
+ fclose(aux.fp);
+ }
+
+ // new bwt/FMI
+- delete(aux.fmi);
++ delete(aux.fmi);
+
+ /* Display runtime profiling stats */
+ tprof[MEM][0] = __rdtsc() - tprof[MEM][0];
+ display_stats(nt);
+-
++
+ return 0;
+ }
+-
=====================================
debian/patches/0007-Merge-pull-request-229-from-gh-jphan-fix_n_processed.patch
=====================================
@@ -0,0 +1,32 @@
+From: Vasimuddin Md <wasim.mzr at gmail.com>
+Date: Thu, 9 May 2024 00:19:51 +0530
+Subject: Merge pull request #229 from gh-jphan/fix_n_processed
+
+Origin: upstream, https://github.com/bwa-mem2/bwa-mem2/commit/7aa5ff6c3330490e5629ab9b7327683d2dce02d6
+Forwarded: not-needed
+
+Fix threading reproducibility issue by moving n_processed update line from step 2 to step 1
+---
+ src/fastmap.cpp | 4 ++--
+ 1 file changed, 2 insertions(+), 2 deletions(-)
+
+diff --git a/src/fastmap.cpp b/src/fastmap.cpp
+index baf4a27..5cbb2dc 100644
+--- a/src/fastmap.cpp
++++ b/src/fastmap.cpp
+@@ -295,13 +295,13 @@ ktp_data_t *kt_pipeline(void *shared, int step, void *data, mem_opt_t *opt, work
+ w);
+ }
+ tprof[MEM_PROCESS2][0] += __rdtsc() - tim;
+-
++
++ aux->n_processed += ret->n_seqs;
+ return ret;
+ }
+ /* Step 3: Write output */
+ else if (step == 2)
+ {
+- aux->n_processed += ret->n_seqs;
+ uint64_t tim = __rdtsc();
+
+ for (int i = 0; i < ret->n_seqs; ++i)
=====================================
debian/patches/0010-newer-GCC-support.patch
=====================================
@@ -0,0 +1,21 @@
+From: "Michael R. Crusoe" <michael.crusoe at gmail.com>
+Date: Thu, 27 Feb 2025 11:00:54 +0100
+Subject: newer GCC support
+
+---
+ ext/safestringlib/makefile | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+diff --git a/ext/safestringlib/makefile b/ext/safestringlib/makefile
+index 23ad139..1993818 100644
+--- a/ext/safestringlib/makefile
++++ b/ext/safestringlib/makefile
+@@ -1,7 +1,7 @@
+ IDIR = include
+ MKDIR_P = mkdir -p
+ CC=gcc
+-CFLAGS+=-I$(IDIR) -fstack-protector-strong -fPIE -fPIC -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security
++CFLAGS+=-I$(IDIR) -fstack-protector-strong -fPIE -fPIC -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -DSTDC_HEADERS
+ LDFLAGS+=-z noexecstack -z relo -z now
+
+ ODIR=obj
=====================================
debian/patches/hardening
=====================================
@@ -1,9 +1,18 @@
-Author: Michael R. Crusoe <crusoe at debian.org>
-Description: Enable proper flag appending so that Debian can harden the build.
+From: "Michael R. Crusoe" <crusoe at debian.org>
+Date: Thu, 9 Jul 2020 16:05:44 +0200
+Subject: Enable proper flag appending so that Debian can harden the build.
+
Forwarded: https://github.com/bwa-mem2/bwa-mem2/pull/69
---- bwa-mem2.orig/Makefile
-+++ bwa-mem2/Makefile
-@@ -115,7 +115,7 @@
+---
+ Makefile | 2 +-
+ ext/safestringlib/makefile | 8 ++++----
+ 2 files changed, 5 insertions(+), 5 deletions(-)
+
+diff --git a/Makefile b/Makefile
+index 359585f..da91f57 100644
+--- a/Makefile
++++ b/Makefile
+@@ -115,7 +115,7 @@ multi:
$(EXE):$(BWA_LIB) $(SAFE_STR_LIB) src/main.o
@@ -12,8 +21,10 @@ Forwarded: https://github.com/bwa-mem2/bwa-mem2/pull/69
$(BWA_LIB):$(OBJS)
ar rcs $(BWA_LIB) $(OBJS)
---- bwa-mem2.orig/ext/safestringlib/makefile
-+++ bwa-mem2/ext/safestringlib/makefile
+diff --git a/ext/safestringlib/makefile b/ext/safestringlib/makefile
+index f7cc268..23ad139 100644
+--- a/ext/safestringlib/makefile
++++ b/ext/safestringlib/makefile
@@ -1,8 +1,8 @@
IDIR = include
MKDIR_P = mkdir -p
@@ -25,7 +36,7 @@ Forwarded: https://github.com/bwa-mem2/bwa-mem2/pull/69
ODIR=obj
OTDIR=objtest
-@@ -30,7 +30,7 @@
+@@ -30,7 +30,7 @@ CLIB =$(addprefix $(SRCDIR)/,$(_CLIB))
$(ODIR)/%.o: $(SRCDIR)/%.c $(DEPS) $(ODEPS)
@@ -34,7 +45,7 @@ Forwarded: https://github.com/bwa-mem2/bwa-mem2/pull/69
libsafestring.a: $(OBJ)
ar rcs $@ $^
-@@ -44,7 +44,7 @@
+@@ -44,7 +44,7 @@ TCLIB =$(addprefix $(TESTDIR)/,$(_TESTFUNCS))
$(OTDIR)/%.o: $(TESTDIR)/%.c $(TESTDIR)/test_private.h
=====================================
debian/patches/series
=====================================
@@ -1,2 +1,10 @@
+0001-Fix-typo.patch
+0002-fixed-return-status-due-incorrect-input-files.patch
+0003-fixed-smem-memory-rellocation-issue-issue167.patch
+0004-Update-to-reflect-2.2.1-release-and-small-typo.patch
+0005-GCC-11-now-defines-__rdtsc-so-definition-test-condit.patch
+0006-fixed-segfault-due-to-memory-shortage-in-smem.patch
+0007-Merge-pull-request-229-from-gh-jphan-fix_n_processed.patch
hardening
simde
+0010-newer-GCC-support.patch
=====================================
debian/patches/simde
=====================================
@@ -1,10 +1,48 @@
-From: Michael R. Crusoe <crusoe at debian.org>
+From: "Michael R. Crusoe" <crusoe at debian.org>
+Date: Thu, 27 Feb 2025 10:06:00 +0100
Subject: Enable building on non-x86
+
Forwarded: https://github.com/bwa-mem2/bwa-mem2/pull/84
+---
+ src/FMI_search.cpp | 4 ++++
+ src/FMI_search.h | 3 ++-
+ src/bandedSWA.cpp | 7 -------
+ src/bandedSWA.h | 9 ++++++++-
+ src/bwa.h | 7 +++++++
+ src/bwamem.cpp | 5 ++++-
+ src/fastmap.cpp | 10 +++++++++-
+ src/ksw.cpp | 1 -
+ src/ksw.h | 3 ++-
+ src/kswv.h | 3 ++-
+ src/main.cpp | 5 +++++
+ src/runsimd.cpp | 2 +-
+ src/utils.h | 38 ++++++++++++++++++++++++++++++++++++++
+ 13 files changed, 82 insertions(+), 15 deletions(-)
---- bwa-mem2.orig/src/FMI_search.h
-+++ bwa-mem2/src/FMI_search.h
-@@ -34,7 +34,8 @@
+diff --git a/src/FMI_search.cpp b/src/FMI_search.cpp
+index 5f0ff48..4609fde 100644
+--- a/src/FMI_search.cpp
++++ b/src/FMI_search.cpp
+@@ -29,10 +29,14 @@ Authors: Sanchit Misra <sanchit.misra at intel.com>; Vasimuddin Md <vasimuddin.md at i
+
+ #include <stdio.h>
+ #include "sais.h"
++#if !(defined(__GNUC__) && __GNUC__ < 11 && !defined(__clang__))
++#include <immintrin.h> // For __rdtsc
++#endif
+ #include "FMI_search.h"
+ #include "memcpy_bwamem.h"
+ #include "profiling.h"
+
++
+ #ifdef __cplusplus
+ extern "C" {
+ #endif
+diff --git a/src/FMI_search.h b/src/FMI_search.h
+index 25c4d0d..92ac03f 100644
+--- a/src/FMI_search.h
++++ b/src/FMI_search.h
+@@ -34,7 +34,8 @@ Authors: Sanchit Misra <sanchit.misra at intel.com>; Vasimuddin Md <vasimuddin.md at i
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
@@ -14,9 +52,29 @@ Forwarded: https://github.com/bwa-mem2/bwa-mem2/pull/84
#include <limits.h>
#include <fstream>
---- bwa-mem2.orig/src/bandedSWA.h
-+++ bwa-mem2/src/bandedSWA.h
-@@ -36,10 +36,17 @@
+diff --git a/src/bandedSWA.cpp b/src/bandedSWA.cpp
+index dfd81bc..c99e072 100644
+--- a/src/bandedSWA.cpp
++++ b/src/bandedSWA.cpp
+@@ -4148,13 +4148,6 @@ void BandedPairWiseSW::smithWaterman128_16(uint16_t seq1SoA[],
+
+ /********************************************************************************/
+ /* SSE2 - 8 bit version */
+-#ifndef __SSE4_1__
+-static inline __m128i _mm_blendv_epi8 (__m128i x, __m128i y, __m128i mask)
+-{
+- // Replace bit in x with bit in y when matching bit in mask is set:
+- return _mm_or_si128(_mm_andnot_si128(mask, x), _mm_and_si128(mask, y));
+-}
+-#endif
+
+ #define ZSCORE8(i4_128, y4_128) \
+ { \
+diff --git a/src/bandedSWA.h b/src/bandedSWA.h
+index c81552b..1a9fd1f 100644
+--- a/src/bandedSWA.h
++++ b/src/bandedSWA.h
+@@ -36,10 +36,17 @@ Authors: Vasimuddin Md <vasimuddin.md at intel.com>; Sanchit Misra <sanchit.misra at i
#include <assert.h>
#include "macro.h"
@@ -35,8 +93,96 @@ Forwarded: https://github.com/bwa-mem2/bwa-mem2/pull/84
#define __mmask8 uint8_t
#define __mmask16 uint16_t
#endif
---- bwa-mem2.orig/src/ksw.cpp
-+++ bwa-mem2/src/ksw.cpp
+diff --git a/src/bwa.h b/src/bwa.h
+index 877f00c..aa36083 100644
+--- a/src/bwa.h
++++ b/src/bwa.h
+@@ -37,6 +37,13 @@ Authors: Vasimuddin Md <vasimuddin.md at intel.com>; Sanchit Misra <sanchit.misra at i
+ #include "bwt.h"
+ #include "macro.h"
+
++#if !defined(__SSE__)
++#define _mm_malloc(size, align) aligned_alloc(align, size)
++#define _mm_free free
++#define _MM_HINT_NTA 0
++#define _MM_HINT_T0 0
++#endif
++
+ #define BWA_IDX_BWT 0x1
+ #define BWA_IDX_BNS 0x2
+ #define BWA_IDX_PAC 0x4
+diff --git a/src/bwamem.cpp b/src/bwamem.cpp
+index cfaa6ec..3f05377 100755
+--- a/src/bwamem.cpp
++++ b/src/bwamem.cpp
+@@ -28,6 +28,9 @@ Authors: Vasimuddin Md <vasimuddin.md at intel.com>; Sanchit Misra <sanchit.misra at i
+ Heng Li <hli at jimmy.harvard.edu>
+ *****************************************************************************************/
+
++#if !(defined(__GNUC__) && __GNUC__ < 11 && !defined(__clang__))
++#include <immintrin.h> // For __rdtsc
++#endif
+ #include "bwamem.h"
+ #include "FMI_search.h"
+ #include "memcpy_bwamem.h"
+@@ -2023,7 +2026,7 @@ inline void sortPairsLen(SeqPair *pairArray, int32_t count, SeqPair *tempArray,
+ {
+
+ int32_t i;
+-#if ((!__AVX512BW__) & (__AVX2__ | __SSE2__))
++#if (!__AVX512BW__)
+ for(i = 0; i <= MAX_SEQ_LEN16; i++) hist[i] = 0;
+ #else
+ __m512i zero512 = _mm512_setzero_si512();
+diff --git a/src/fastmap.cpp b/src/fastmap.cpp
+index 5cbb2dc..1318f90 100644
+--- a/src/fastmap.cpp
++++ b/src/fastmap.cpp
+@@ -36,9 +36,13 @@ Authors: Vasimuddin Md <vasimuddin.md at intel.com>; Sanchit Misra <sanchit.misra at i
+ #include <numa.h>
+ #endif
+ #include <sstream>
++#if !(defined(__GNUC__) && __GNUC__ < 11 && !defined(__clang__))
++#include <immintrin.h> // For __rdtsc
++#endif
+ #include "fastmap.h"
+ #include "FMI_search.h"
+
++
+ #if AFF && (__linux__)
+ #include <sys/sysinfo.h>
+ int affy[256];
+@@ -52,7 +56,7 @@ void __cpuid(unsigned int i, unsigned int cpuid[4]) {
+ #ifdef _WIN32
+ __cpuid((int *) cpuid, (int)i);
+
+-#else
++#elif defined(__x86_64__) || defined(__i386__)
+ asm volatile
+ ("cpuid" : "=a" (cpuid[0]), "=b" (cpuid[1]), "=c" (cpuid[2]), "=d" (cpuid[3])
+ : "0" (i), "2" (0));
+@@ -62,6 +66,7 @@ void __cpuid(unsigned int i, unsigned int cpuid[4]) {
+
+ int HTStatus()
+ {
++#if defined(__x86_64__) || defined(__i386__)
+ unsigned int cpuid[4];
+ char platform_vendor[12];
+ __cpuid(0, cpuid);
+@@ -93,6 +98,9 @@ int HTStatus()
+ fprintf(stderr, "CPUs support hyperThreading !!\n");
+
+ return ht;
++#else
++ return 0;
++#endif
+ }
+
+
+diff --git a/src/ksw.cpp b/src/ksw.cpp
+index ad9bc50..9369713 100644
+--- a/src/ksw.cpp
++++ b/src/ksw.cpp
@@ -30,7 +30,6 @@
#include <stdlib.h>
#include <stdint.h>
@@ -45,8 +191,10 @@ Forwarded: https://github.com/bwa-mem2/bwa-mem2/pull/84
#include "ksw.h"
#include "macro.h"
---- bwa-mem2.orig/src/ksw.h
-+++ bwa-mem2/src/ksw.h
+diff --git a/src/ksw.h b/src/ksw.h
+index 54bcef8..3179fc4 100644
+--- a/src/ksw.h
++++ b/src/ksw.h
@@ -26,7 +26,8 @@
#define __AC_KSW_H
@@ -57,9 +205,11 @@ Forwarded: https://github.com/bwa-mem2/bwa-mem2/pull/84
#define KSW_XBYTE 0x10000
#define KSW_XSTOP 0x20000
---- bwa-mem2.orig/src/kswv.h
-+++ bwa-mem2/src/kswv.h
-@@ -39,7 +39,8 @@
+diff --git a/src/kswv.h b/src/kswv.h
+index 11da4d7..aed1d1a 100644
+--- a/src/kswv.h
++++ b/src/kswv.h
+@@ -39,7 +39,8 @@ Authors: Vasimuddin Md <vasimuddin.md at intel.com>; Sanchit Misra <sanchit.misra at i
#include "ksw.h"
#include "bandedSWA.h"
#else
@@ -69,28 +219,58 @@ Forwarded: https://github.com/bwa-mem2/bwa-mem2/pull/84
#endif
#ifdef __GNUC__
---- bwa-mem2.orig/src/bwa.h
-+++ bwa-mem2/src/bwa.h
-@@ -37,6 +37,13 @@
- #include "bwt.h"
- #include "macro.h"
+diff --git a/src/main.cpp b/src/main.cpp
+index abee98d..0b4f07f 100644
+--- a/src/main.cpp
++++ b/src/main.cpp
+@@ -29,6 +29,9 @@ Contacts: Vasimuddin Md <vasimuddin.md at intel.com>; Sanchit Misra <sanchit.misra@
+ *****************************************************************************************/
-+#if !defined(__SSE__)
-+#define _mm_malloc(size, align) aligned_alloc(align, size)
-+#define _mm_free free
-+#define _MM_HINT_NTA 0
-+#define _MM_HINT_T0 0
+ // ----------------------------------
++#if !(defined(__GNUC__) && __GNUC__ < 11 && !defined(__clang__))
++#include <immintrin.h> // For __rdtsc
+#endif
-+
- #define BWA_IDX_BWT 0x1
- #define BWA_IDX_BNS 0x2
- #define BWA_IDX_PAC 0x4
---- bwa-mem2.orig/src/utils.h
-+++ bwa-mem2/src/utils.h
-@@ -63,6 +63,43 @@
- __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
+ #include "main.h"
+
+ #ifndef PACKAGE_VERSION
+@@ -85,6 +88,8 @@ int main(int argc, char* argv[])
+ fprintf(stderr, "Executing in SSE4.2 mode!!\n");
+ #elif __SSE4_1__
+ fprintf(stderr, "Executing in SSE4.1 mode!!\n");
++#elif __SSE2__
++ fprintf(stderr, "Executing in SSE2!!\n");
+ #endif
+ fprintf(stderr, "-----------------------------\n");
+
+diff --git a/src/runsimd.cpp b/src/runsimd.cpp
+index 03c0e91..e0798f2 100644
+--- a/src/runsimd.cpp
++++ b/src/runsimd.cpp
+@@ -61,7 +61,7 @@ void __cpuidex(int cpuid[4], int func_id, int subfunc_id)
+ __asm__ volatile ("cpuid"
+ : "=a" (cpuid[0]), "=b" (cpuid[1]), "=c" (cpuid[2]), "=d" (cpuid[3])
+ : "0" (func_id), "2" (subfunc_id));
+-#else // on 32bit, ebx can NOT be used as PIC code
++#elif defined(__i386__) // on 32bit, ebx can NOT be used as PIC code
+ __asm__ volatile ("xchgl %%ebx, %1; cpuid; xchgl %%ebx, %1"
+ : "=a" (cpuid[0]), "=r" (cpuid[1]), "=c" (cpuid[2]), "=d" (cpuid[3])
+ : "0" (func_id), "2" (subfunc_id));
+diff --git a/src/utils.h b/src/utils.h
+index fbf8439..f07fee2 100644
+--- a/src/utils.h
++++ b/src/utils.h
+@@ -30,6 +30,7 @@
+
+ #include <stdint.h>
+ #include <stdio.h>
++#include <sys/select.h>
+ #include <zlib.h>
+
+ #ifdef __GNUC__
+@@ -64,6 +65,43 @@ static inline unsigned long long __rdtsc(void)
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
+ #endif
+// From https://github.com/google/benchmark/blob/37177a84b7e8d33696ea1e1854513cb0de3b4dc3/src/cycleclock.h
+// Apache 2.0 license
+#elif defined(__aarch64__)
@@ -104,12 +284,12 @@ Forwarded: https://github.com/bwa-mem2/bwa-mem2/pull/84
+ asm volatile("mrs %0, cntvct_el0" : "=r"(virtual_timer_value));
+ return virtual_timer_value;
+}
-+#elif defined(__ARM_ARCH)
-+ // V6 is the earliest arch that has a standard cyclecount
-+ // Native Client validator doesn't allow MRC instructions.
-+#if (__ARM_ARCH >= 6)
++#elif !(defined(__GNUC__) && __GNUC__ >= 11)
+static inline unsigned long long __rdtsc(void)
+{
++#if defined(__ARM_ARCH) && (__ARM_ARCH >= 6)
++ // V6 is the earliest arch that has a standard cyclecount
++ // Native Client validator doesn't allow MRC instructions.
+ uint32_t pmccntr;
+ uint32_t pmuseren;
+ uint32_t pmcntenset;
@@ -129,83 +309,5 @@ Forwarded: https://github.com/bwa-mem2/bwa-mem2/pull/84
+ return static_cast<int64_t>(tv.tv_sec) * 1000000 + tv.tv_usec;
+}
#endif
- #endif
-
---- bwa-mem2.orig/src/fastmap.cpp
-+++ bwa-mem2/src/fastmap.cpp
-@@ -52,7 +52,7 @@
- #ifdef _WIN32
- __cpuid((int *) cpuid, (int)i);
-
--#else
-+#elif defined(__x86_64__) || defined(__i386__)
- asm volatile
- ("cpuid" : "=a" (cpuid[0]), "=b" (cpuid[1]), "=c" (cpuid[2]), "=d" (cpuid[3])
- : "0" (i), "2" (0));
-@@ -62,6 +62,7 @@
-
- int HTStatus()
- {
-+#if defined(__x86_64__) || defined(__i386__)
- unsigned int cpuid[4];
- char platform_vendor[12];
- __cpuid(0, cpuid);
-@@ -93,6 +94,9 @@
- fprintf(stderr, "CPUs support hyperThreading !!\n");
-
- return ht;
-+#else
-+ return 0;
-+#endif
- }
-
-
---- bwa-mem2.orig/src/runsimd.cpp
-+++ bwa-mem2/src/runsimd.cpp
-@@ -61,7 +61,7 @@
- __asm__ volatile ("cpuid"
- : "=a" (cpuid[0]), "=b" (cpuid[1]), "=c" (cpuid[2]), "=d" (cpuid[3])
- : "0" (func_id), "2" (subfunc_id));
--#else // on 32bit, ebx can NOT be used as PIC code
-+#elif defined(__i386__) // on 32bit, ebx can NOT be used as PIC code
- __asm__ volatile ("xchgl %%ebx, %1; cpuid; xchgl %%ebx, %1"
- : "=a" (cpuid[0]), "=r" (cpuid[1]), "=c" (cpuid[2]), "=d" (cpuid[3])
- : "0" (func_id), "2" (subfunc_id));
---- bwa-mem2.orig/src/bwamem.cpp
-+++ bwa-mem2/src/bwamem.cpp
-@@ -1946,7 +1946,7 @@
- {
-
- int32_t i;
--#if ((!__AVX512BW__) & (__AVX2__ | __SSE2__))
-+#if (!__AVX512BW__)
- for(i = 0; i <= MAX_SEQ_LEN16; i++) hist[i] = 0;
- #else
- __m512i zero512 = _mm512_setzero_si512();
---- bwa-mem2.orig/src/main.cpp
-+++ bwa-mem2/src/main.cpp
-@@ -85,6 +85,8 @@
- fprintf(stderr, "Executing in SSE4.2 mode!!\n");
- #elif __SSE4_1__
- fprintf(stderr, "Executing in SSE4.1 mode!!\n");
-+#elif __SSE2__
-+ fprintf(stderr, "Executing in SSE2!!\n");
- #endif
- fprintf(stderr, "-----------------------------\n");
---- bwa-mem2.orig/src/bandedSWA.cpp
-+++ bwa-mem2/src/bandedSWA.cpp
-@@ -4148,13 +4148,6 @@
-
- /********************************************************************************/
- /* SSE2 - 8 bit version */
--#ifndef __SSE4_1__
--static inline __m128i _mm_blendv_epi8 (__m128i x, __m128i y, __m128i mask)
--{
-- // Replace bit in x with bit in y when matching bit in mask is set:
-- return _mm_or_si128(_mm_andnot_si128(mask, x), _mm_and_si128(mask, y));
--}
--#endif
-
- #define ZSCORE8(i4_128, y4_128) \
- { \
+ typedef struct {
=====================================
src/fastmap.cpp
=====================================
@@ -953,6 +953,7 @@ int main_mem(int argc, char *argv[])
tprof[PROCESS][0] += __rdtsc() - tim;
// free memory
+ int32_t nt = aux.opt->n_threads;
_mm_free(ref_string);
free(hdr_line);
free(opt);
@@ -974,7 +975,7 @@ int main_mem(int argc, char *argv[])
/* Display runtime profiling stats */
tprof[MEM][0] = __rdtsc() - tprof[MEM][0];
- display_stats(aux.opt->n_threads);
+ display_stats(nt);
return 0;
}
=====================================
src/main.cpp
=====================================
@@ -32,7 +32,7 @@ Contacts: Vasimuddin Md <vasimuddin.md at intel.com>; Sanchit Misra <sanchit.misra@
#include "main.h"
#ifndef PACKAGE_VERSION
-#define PACKAGE_VERSION "2.2"
+#define PACKAGE_VERSION "2.2.1"
#endif
View it on GitLab: https://salsa.debian.org/med-team/bwa-mem2/-/compare/a7f38675c6c0d9ec41cd660b2eb38165cc9af080...cabf1480df85263495f8a3e7d5279ccb453ad89c
--
View it on GitLab: https://salsa.debian.org/med-team/bwa-mem2/-/compare/a7f38675c6c0d9ec41cd660b2eb38165cc9af080...cabf1480df85263495f8a3e7d5279ccb453ad89c
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20250227/ccb9aa64/attachment-0001.htm>
More information about the debian-med-commit
mailing list