[med-svn] [bwa] 01/05: Imported Upstream version 0.7.6a

Fri Feb 21 12:00:47 UTC 2014

This is an automated email from the git hooks/post-receive script.

tille pushed a commit to branch master
in repository bwa.

commit 73da463b67138306a28b34341eb877c29693cd02
Author: Andreas Tille <tille at debian.org>
Date:   Fri Feb 21 12:31:50 2014 +0100

    Imported Upstream version 0.7.6a
---
 Makefile      |   2 +-
 NEWS          |  56 ++++++++++++++
 README.md     |   4 +-
 bwa.1         |  14 ++--
 bwa.c         |  34 +++++++--
 bwamem.c      | 241 +++++++++++++++++++++++++++++++++-------------------------
 bwamem.h      |   8 +-
 bwamem_pair.c |  12 ++-
 bwape.c       |   3 +-
 bwase.c       |   5 +-
 fastmap.c     |  25 ++++--
 ksw.c         |  30 +++++---
 kthread.c     |  53 +++++++++++++
 main.c        |  42 ++++------
 14 files changed, 351 insertions(+), 178 deletions(-)

diff --git a/Makefile b/Makefile
index 69b2ccd..ff48a20 100644
--- a/Makefile
+++ b/Makefile
@@ -4,7 +4,7 @@ CFLAGS=		-g -Wall -O2
 WRAP_MALLOC=-DUSE_MALLOC_WRAPPERS
 AR=			ar
 DFLAGS=		-DHAVE_PTHREAD $(WRAP_MALLOC)
-LOBJS=		utils.o kstring.o ksw.o bwt.o bntseq.o bwa.o bwamem.o bwamem_pair.o malloc_wrap.o
+LOBJS=		utils.o kthread.o kstring.o ksw.o bwt.o bntseq.o bwa.o bwamem.o bwamem_pair.o malloc_wrap.o
 AOBJS=		QSufSort.o bwt_gen.o bwase.o bwaseqio.o bwtgap.o bwtaln.o bamlite.o \
 			is.o bwtindex.o bwape.o kopen.o pemerge.o \
 			bwtsw2_core.o bwtsw2_main.o bwtsw2_aux.o bwt_lite.o \
diff --git a/NEWS b/NEWS
index 29627f0..eb9c37a 100644
--- a/NEWS
+++ b/NEWS
@@ -1,3 +1,59 @@
+Release 0.7.6 (31 Januaray, 2014)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Changes in BWA-MEM:
+
+ * Changed the way mapping quality is estimated. The new method tends to give
+   the same alignment a higher mapping quality. On paired-end reads, the change
+   is minor as with pairing, the mapping quality is usually high. For short
+   single-end reads, the difference is considerable.
+
+ * Improved load balance when many threads are spawned. However, bwa-mem is
+   still not very thread efficient, probably due to the frequent heap memory
+   allocation. Further improvement is a little difficult and may affect the
+   code stability.
+
+ * Allow to use different clipping penalties for 5'- and 3'-ends. This helps
+   when we do not want to clip one end.
+
+ * Print the @PG line, including the command line options.
+
+ * Improved the band width estimate: a) fixed a bug causing the band
+   width extimated from extension not used in the final global alignment; b)
+   try doubled band width if the global alignment score is smaller.
+   Insufficient band width leads to wrong CIGAR and spurious mismatches/indels.
+
+ * Added a new option -D to fine tune a heuristic on dropping suboptimal hits.
+   Reducing -D increases accuracy but decreases the mapping speed. If unsure,
+   leave it to the default.
+
+ * Bugfix: for a repetitive single-end read, the reported hit is not randomly
+   distributed among equally best hits.
+
+ * Bugfix: missing paired-end hits due to unsorted list of SE hits.
+
+ * Bugfix: incorrect CIGAR caused by a defect in the global alignment.
+
+ * Bugfix: incorrect CIGAR caused by failed SW rescue.
+
+ * Bugfix: alignments largely mapped to the same position are regarded to be
+   distinct from each other, which leads to underestimated mapping quality.
+
+ * Added the MD tag.
+
+There are no changes to BWA-backtrack in this release. However, it has a few
+known issues yet to be fixed. If you prefer BWA-track, It is still advised to
+use bwa-0.6.x.
+
+While I developed BWA-MEM, I also found a few issues with BWA-SW. It is now
+possible to improve BWA-SW with the lessons learned from BWA-MEM. However, as
+BWA-MEM is usually better, I will not improve BWA-SW until I find applications
+where BWA-SW may excel.
+
+(0.7.6: 31 January 2014, r432)
+
+
+
 Release 0.7.5a (30 May, 2013)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/README.md b/README.md
index d903e38..009a4ca 100644
--- a/README.md
+++ b/README.md
@@ -28,7 +28,8 @@ different sub-commands: **aln/samse/sampe** for BWA-backtrack,
 BWA is released under [GPLv3][1]. The latest souce code is [freely
 available][2] at github. Released packages can [be downloaded ][3] at
 SourceForge. After you acquire the source code, simply use `make` to compile
-and copy the single executable `bwa` to the destination you want.
+and copy the single executable `bwa` to the destination you want. The only
+dependency of BWA is [zlib][14].
 
 ###Seeking helps
 
@@ -71,3 +72,4 @@ do not have plan to submit it to a peer-reviewed journal in the near future.
 [11]: http://www.ncbi.nlm.nih.gov/pubmed/20080505
 [12]: http://arxiv.org/abs/1303.3997
 [13]: http://arxiv.org/
+[14]: http://zlib.net/
diff --git a/bwa.1 b/bwa.1
index e63fe8d..5949a1b 100644
--- a/bwa.1
+++ b/bwa.1
@@ -1,4 +1,4 @@
-.TH bwa 1 "24 May 2013" "bwa-0.7.5" "Bioinformatics tools"
+.TH bwa 1 "31 January 2014" "bwa-0.7.6" "Bioinformatics tools"
 .SH NAME
 .PP
 bwa - Burrows-Wheeler Alignment Tool
@@ -158,7 +158,7 @@ Number of threads [1]
 Minimum seed length. Matches shorter than
 .I INT
 will be missed. The alignment speed is usually insensitive to this value unless
-it significantly deviates 20. [19]
+it significantly deviates from 20. [19]
 .TP
 .BI -w \ INT
 Band width. Essentially, gaps longer than
@@ -210,12 +210,13 @@ Gap extension penalty. A gap of length k costs O + k*E (i.e.
 .B -O
 is for opening a zero-length gap). [1]
 .TP
-.BI -L \ INT
+.BI -L \ INT[,INT]
 Clipping penalty. When performing SW extension, BWA-MEM keeps track of the best
 score reaching the end of query. If this score is larger than the best SW score
 minus the clipping penalty, clipping will not be applied. Note that in this
 case, the SAM AS tag reports the best SW score; clipping penalty is not
-deducted. [5]
+deduced. If two numbers are provided, the first is for 5'-end clipping and
+second for 3'-end clipping. [5]
 .TP
 .BI -U \ INT
 Penalty for an unpaired read pair. BWA-MEM scores an unpaired read pair as
@@ -250,10 +251,6 @@ transfer read meta information (e.g. barcode) to the SAM output. Note that the
 FASTA/Q comment (the string after a space in the header line) must conform the SAM
 spec (e.g. BC:Z:CGTAC). Malformated comments lead to incorrect SAM output.
 .TP
-.B -H
-Use hard clipping 'H' in the SAM output. This option may dramatically reduce
-the redundancy of output when mapping long contig or BAC sequences.
-.TP
 .B -M
 Mark shorter split hits as secondary (for Picard compatibility).
 .TP
@@ -569,6 +566,7 @@ NM	Edit distance
 MD	Mismatching positions/bases
 AS	Alignment score
 BC	Barcode sequence
+SA	Supplementary alignments
 _
 X0	Number of best hits
 X1	Number of suboptimal hits found by BWA
diff --git a/bwa.c b/bwa.c
index f8949f7..aec04d8 100644
--- a/bwa.c
+++ b/bwa.c
@@ -6,6 +6,7 @@
 #include "bwa.h"
 #include "ksw.h"
 #include "utils.h"
+#include "kstring.h"
 
 #ifdef USE_MALLOC_WRAPPERS
 #  include "malloc_wrap.h"
@@ -91,6 +92,8 @@ uint32_t *bwa_gen_cigar(const int8_t mat[25], int q, int r, int w_, int64_t l_pa
 	uint8_t tmp, *rseq;
 	int i;
 	int64_t rlen;
+	kstring_t str;
+
 	*n_cigar = 0; *NM = -1;
 	if (l_query <= 0 || rb >= re || (rb < l_pac && re > l_pac)) return 0; // reject if negative length or bridging the forward and reverse strand
 	rseq = bns_get_seq(l_pac, pac, rb, re, &rlen);
@@ -122,18 +125,31 @@ uint32_t *bwa_gen_cigar(const int8_t mat[25], int q, int r, int w_, int64_t l_pa
 		*score = ksw_global(l_query, query, rlen, rseq, 5, mat, q, r, w, n_cigar, &cigar);
 	}
 	{// compute NM
-		int k, x, y, n_mm = 0, n_gap = 0;
-		for (k = 0, x = y = 0; k < *n_cigar; ++k) {
-			int op  = cigar[k]&0xf;
-			int len = cigar[k]>>4;
+		int k, x, y, u, n_mm = 0, n_gap = 0;
+		str.l = str.m = *n_cigar * 4; str.s = (char*)cigar; // append MD to CIGAR
+		for (k = 0, x = y = u = 0; k < *n_cigar; ++k) {
+			int op, len;
+			cigar = (uint32_t*)str.s;
+			op  = cigar[k]&0xf, len = cigar[k]>>4;
 			if (op == 0) { // match
-				for (i = 0; i < len; ++i)
-					if (query[x + i] != rseq[y + i]) ++n_mm;
+				for (i = 0; i < len; ++i) {
+					if (query[x + i] != rseq[y + i]) {
+						kputw(u, &str); kputc("ACGTN"[rseq[y+i]], &str);
+						++n_mm; u = 0;
+					} else ++u;
+				}
 				x += len; y += len;
-			} else if (op == 1) x += len, n_gap += len;
-			else if (op == 2) y += len, n_gap += len;
+			} else if (op == 2) { // deletion
+				kputw(u, &str); kputc('^', &str);
+				for (i = 0; i < len; ++i)
+					kputc("ACGTN"[rseq[y+i]], &str);
+				u = 0;
+				y += len, n_gap += len;
+			} else if (op == 1) x += len, n_gap += len; // insertion
 		}
+		kputw(u, &str); kputc(0, &str);
 		*NM = n_mm + n_gap;
+		cigar = (uint32_t*)str.s;
 	}
 	if (rb >= l_pac) // reverse back query
 		for (i = 0; i < l_query>>1; ++i)
@@ -277,9 +293,11 @@ void bwa_idx_destroy(bwaidx_t *idx)
 void bwa_print_sam_hdr(const bntseq_t *bns, const char *rg_line)
 {
 	int i;
+	extern char *bwa_pg;
 	for (i = 0; i < bns->n_seqs; ++i)
 		err_printf("@SQ\tSN:%s\tLN:%d\n", bns->anns[i].name, bns->anns[i].len);
 	if (rg_line) err_printf("%s\n", rg_line);
+	err_printf("%s\n", bwa_pg);
 }
 
 static char *bwa_escape(char *s)
diff --git a/bwamem.c b/bwamem.c
index 801ea7e..6f77064 100644
--- a/bwamem.c
+++ b/bwamem.c
@@ -51,7 +51,7 @@ mem_opt_t *mem_opt_init()
 	o->T = 30;
 	o->zdrop = 100;
 	o->pen_unpaired = 17;
-	o->pen_clip = 5;
+	o->pen_clip5 = o->pen_clip3 = 5;
 	o->min_seed_len = 19;
 	o->split_width = 10;
 	o->max_occ = 10000;
@@ -63,6 +63,9 @@ mem_opt_t *mem_opt_init()
 	o->chunk_size = 10000000;
 	o->n_threads = 1;
 	o->max_matesw = 100;
+	o->mask_level_redun = 0.95;
+	o->mapQ_coef_len = 50; o->mapQ_coef_fac = log(o->mapQ_coef_len);
+//	o->mapQ_coef_len = o->mapQ_coef_fac = 0;
 	bwa_fill_scmat(o->a, o->b, o->mat);
 	return o;
 }
@@ -212,6 +215,7 @@ static void mem_insert_seed(const mem_opt_t *opt, int64_t l_pac, kbtree_t(chn) *
 				s.rbeg = tmp.pos = bwt_sa(itr->bwt, p->x[0] + k); // this is the base coordinate in the forward-reverse reference
 				s.qbeg = p->info>>32;
 				s.len  = slen;
+				if (bwa_verbose >= 5) printf("SEED l=%d,qb=%d,rb=%ld\n", s.len, s.qbeg, (long)s.rbeg);
 				if (s.rbeg < l_pac && l_pac < s.rbeg + s.len) continue; // bridging forward-reverse boundary; skip
 				if (kb_size(tree)) {
 					kb_intervalp(chn, tree, &tmp, &lower, &upper); // find the closest chain
@@ -228,12 +232,32 @@ static void mem_insert_seed(const mem_opt_t *opt, int64_t l_pac, kbtree_t(chn) *
 	}
 }
 
+int mem_chain_weight(const mem_chain_t *c)
+{
+	int64_t end;
+	int j, w = 0, tmp;
+	for (j = 0, end = 0; j < c->n; ++j) {
+		const mem_seed_t *s = &c->seeds[j];
+		if (s->qbeg >= end) w += s->len;
+		else if (s->qbeg + s->len > end) w += s->qbeg + s->len - end;
+		end = end > s->qbeg + s->len? end : s->qbeg + s->len;
+	}
+	tmp = w;
+	for (j = 0, end = 0; j < c->n; ++j) {
+		const mem_seed_t *s = &c->seeds[j];
+		if (s->rbeg >= end) w += s->len;
+		else if (s->rbeg + s->len > end) w += s->rbeg + s->len - end;
+		end = end > s->qbeg + s->len? end : s->qbeg + s->len;
+	}
+	return w < tmp? w : tmp;
+}
+
 void mem_print_chain(const bntseq_t *bns, mem_chain_v *chn)
 {
 	int i, j;
 	for (i = 0; i < chn->n; ++i) {
 		mem_chain_t *p = &chn->a[i];
-		err_printf("%d", p->n);
+		err_printf("CHAIN(%d) n=%d w=%d", i, p->n, mem_chain_weight(p));
 		for (j = 0; j < p->n; ++j) {
 			bwtint_t pos;
 			int is_rev, ref_id;
@@ -290,22 +314,8 @@ int mem_chain_flt(const mem_opt_t *opt, int n_chn, mem_chain_t *chains)
 	a = malloc(sizeof(flt_aux_t) * n_chn);
 	for (i = 0; i < n_chn; ++i) {
 		mem_chain_t *c = &chains[i];
-		int64_t end;
-		int w = 0, tmp;
-		for (j = 0, end = 0; j < c->n; ++j) {
-			const mem_seed_t *s = &c->seeds[j];
-			if (s->qbeg >= end) w += s->len;
-			else if (s->qbeg + s->len > end) w += s->qbeg + s->len - end;
-			end = end > s->qbeg + s->len? end : s->qbeg + s->len;
-		}
-		tmp = w;
-		for (j = 0, end = 0; j < c->n; ++j) {
-			const mem_seed_t *s = &c->seeds[j];
-			if (s->rbeg >= end) w += s->len;
-			else if (s->rbeg + s->len > end) w += s->rbeg + s->len - end;
-			end = end > s->qbeg + s->len? end : s->qbeg + s->len;
-		}
-		w = w < tmp? w : tmp;
+		int w;
+		w = mem_chain_weight(c);
 		a[i].beg = c->seeds[0].qbeg;
 		a[i].end = c->seeds[c->n-1].qbeg + c->seeds[c->n-1].len;
 		a[i].w = w; a[i].p = c; a[i].p2 = 0;
@@ -363,13 +373,45 @@ int mem_chain_flt(const mem_opt_t *opt, int n_chn, mem_chain_t *chains)
  * De-overlap single-end hits *
  ******************************/
 
+#define alnreg_slt2(a, b) ((a).re < (b).re)
+KSORT_INIT(mem_ars2, mem_alnreg_t, alnreg_slt2)
+
 #define alnreg_slt(a, b) ((a).score > (b).score || ((a).score == (b).score && ((a).rb < (b).rb || ((a).rb == (b).rb && (a).qb < (b).qb))))
 KSORT_INIT(mem_ars, mem_alnreg_t, alnreg_slt)
 
-int mem_sort_and_dedup(int n, mem_alnreg_t *a)
+#define alnreg_hlt(a, b) ((a).score > (b).score || ((a).score == (b).score && (a).hash < (b).hash))
+KSORT_INIT(mem_ars_hash, mem_alnreg_t, alnreg_hlt)
+
+int mem_sort_and_dedup(int n, mem_alnreg_t *a, float mask_level_redun)
 {
-	int m, i;
+	int m, i, j;
 	if (n <= 1) return n;
+	ks_introsort(mem_ars2, n, a);
+	for (i = 1; i < n; ++i) {
+		mem_alnreg_t *p = &a[i];
+		if (p->rb >= a[i-1].re) continue;
+		for (j = i - 1; j >= 0 && p->rb < a[j].re; --j) {
+			mem_alnreg_t *q = &a[j];
+			int64_t or, oq, mr, mq;
+			if (q->qe == q->qb) continue; // a[j] has been excluded
+			or = q->re - p->rb; // overlap length on the reference
+			oq = q->qb < p->qb? q->qe - p->qb : p->qe - q->qb; // overlap length on the query
+			mr = q->re - q->rb < p->re - p->rb? q->re - q->rb : p->re - p->rb; // min ref len in alignment
+			mq = q->qe - q->qb < p->qe - p->qb? q->qe - q->qb : p->qe - p->qb; // min qry len in alignment
+			if (or > mask_level_redun * mr && oq > mask_level_redun * mq) { // one of the hits is redundant
+				if (p->score < q->score) {
+					p->qe = p->qb;
+					break;
+				} else q->qe = q->qb;
+			}
+		}
+	}
+	for (i = 0, m = 0; i < n; ++i) // exclude identical hits
+		if (a[i].qe > a[i].qb) {
+			if (m != i) a[m++] = a[i];
+			else ++m;
+		}
+	n = m;
 	ks_introsort(mem_ars, n, a);
 	for (i = 1; i < n; ++i) { // mark identical hits
 		if (a[i].score == a[i-1].score && a[i].rb == a[i-1].rb && a[i].qb == a[i-1].qb)
@@ -383,13 +425,14 @@ int mem_sort_and_dedup(int n, mem_alnreg_t *a)
 	return m;
 }
 
-void mem_mark_primary_se(const mem_opt_t *opt, int n, mem_alnreg_t *a) // IMPORTANT: must run mem_sort_and_dedup() before calling this function
+void mem_mark_primary_se(const mem_opt_t *opt, int n, mem_alnreg_t *a, int64_t id) // IMPORTANT: must run mem_sort_and_dedup() before calling this function
 { // similar to the loop in mem_chain_flt()
 	int i, k, tmp;
 	kvec_t(int) z;
 	if (n == 0) return;
 	kv_init(z);
-	for (i = 0; i < n; ++i) a[i].sub = 0, a[i].secondary = -1;
+	for (i = 0; i < n; ++i) a[i].sub = 0, a[i].secondary = -1, a[i].hash = hash_64(id+i);
+	ks_introsort(mem_ars_hash, n, a);
 	tmp = opt->a + opt->b > opt->q + opt->r? opt->a + opt->b : opt->q + opt->r;
 	kv_push(int, z, 0);
 	for (i = 1; i < n; ++i) {
@@ -475,7 +518,7 @@ int mem_chain2aln_short(const mem_opt_t *opt, int64_t l_pac, const uint8_t *pac,
 	a.score = x.score;
 	a.csub = x.score2;
 	kv_push(mem_alnreg_t, *av, a);
-	if (bwa_verbose >= 4) printf("SHORT: [%d,%d) <=> [%ld,%ld)\n", a.qb, a.qe, (long)a.rb, (long)a.re);
+	if (bwa_verbose >= 4) printf("chain2aln(short): [%d,%d) <=> [%ld,%ld)\n", a.qb, a.qe, (long)a.rb, (long)a.re);
 	return 0;
 }
 
@@ -561,6 +604,7 @@ void mem_chain2aln(const mem_opt_t *opt, int64_t l_pac, const uint8_t *pac, int
 		a->w = aw[0] = aw[1] = opt->w;
 		a->score = a->truesc = -1;
 
+		if (bwa_verbose >= 4) err_printf("Extending from seed [%ld,%ld,%ld]\n", (long)s->len, (long)s->qbeg, (long)s->rbeg);
 		if (s->qbeg) { // left extension
 			uint8_t *rs, *qs;
 			int qle, tle, gtle, gscore;
@@ -572,12 +616,12 @@ void mem_chain2aln(const mem_opt_t *opt, int64_t l_pac, const uint8_t *pac, int
 			for (i = 0; i < MAX_BAND_TRY; ++i) {
 				int prev = a->score;
 				aw[0] = opt->w << i;
-				a->score = ksw_extend(s->qbeg, qs, tmp, rs, 5, opt->mat, opt->q, opt->r, aw[0], opt->pen_clip, opt->zdrop, s->len * opt->a, &qle, &tle, &gtle, &gscore, &max_off[0]);
+				a->score = ksw_extend(s->qbeg, qs, tmp, rs, 5, opt->mat, opt->q, opt->r, aw[0], opt->pen_clip5, opt->zdrop, s->len * opt->a, &qle, &tle, &gtle, &gscore, &max_off[0]);
 				if (bwa_verbose >= 4) { printf("L\t%d < %d; w=%d; max_off=%d\n", prev, a->score, aw[0], max_off[0]); fflush(stdout); }
 				if (a->score == prev || max_off[0] < (aw[0]>>1) + (aw[0]>>2)) break;
 			}
 			// check whether we prefer to reach the end of the query
-			if (gscore <= 0 || gscore <= a->score - opt->pen_clip) { // local extension
+			if (gscore <= 0 || gscore <= a->score - opt->pen_clip5) { // local extension
 				a->qb = s->qbeg - qle, a->rb = s->rbeg - tle;
 				a->truesc = a->score;
 			} else { // to-end extension
@@ -595,12 +639,12 @@ void mem_chain2aln(const mem_opt_t *opt, int64_t l_pac, const uint8_t *pac, int
 			for (i = 0; i < MAX_BAND_TRY; ++i) {
 				int prev = a->score;
 				aw[1] = opt->w << i;
-				a->score = ksw_extend(l_query - qe, query + qe, rmax[1] - rmax[0] - re, rseq + re, 5, opt->mat, opt->q, opt->r, aw[1], opt->pen_clip, opt->zdrop, sc0, &qle, &tle, &gtle, &gscore, &max_off[1]);
+				a->score = ksw_extend(l_query - qe, query + qe, rmax[1] - rmax[0] - re, rseq + re, 5, opt->mat, opt->q, opt->r, aw[1], opt->pen_clip3, opt->zdrop, sc0, &qle, &tle, &gtle, &gscore, &max_off[1]);
 				if (bwa_verbose >= 4) { printf("R\t%d < %d; w=%d; max_off=%d\n", prev, a->score, aw[1], max_off[1]); fflush(stdout); }
 				if (a->score == prev || max_off[1] < (aw[1]>>1) + (aw[1]>>2)) break;
 			}
 			// similar to the above
-			if (gscore <= 0 || gscore <= a->score - opt->pen_clip) { // local extension
+			if (gscore <= 0 || gscore <= a->score - opt->pen_clip3) { // local extension
 				a->qe = qe + qle, a->re = rmax[0] + re + tle;
 				a->truesc += a->score - sc0;
 			} else { // to-end extension
@@ -728,7 +772,10 @@ void mem_aln2sam(const bntseq_t *bns, kstring_t *str, bseq1_t *s, int n, const m
 	}
 
 	// print optional tags
-	if (p->n_cigar) { kputsn("\tNM:i:", 6, str); kputw(p->NM, str); }
+	if (p->n_cigar) {
+		kputsn("\tNM:i:", 6, str); kputw(p->NM, str);
+		kputsn("\tMD:Z:", 6, str); kputs((char*)(p->cigar + p->n_cigar), str);
+	}
 	if (p->score >= 0) { kputsn("\tAS:i:", 6, str); kputw(p->score, str); }
 	if (p->sub >= 0) { kputsn("\tXS:i:", 6, str); kputw(p->sub, str); }
 	if (bwa_rg_id[0]) { kputsn("\tRG:Z:", 6, str); kputs(bwa_rg_id, str); }
@@ -768,9 +815,18 @@ int mem_approx_mapq_se(const mem_opt_t *opt, const mem_alnreg_t *a)
 	sub = a->csub > sub? a->csub : sub;
 	if (sub >= a->score) return 0;
 	l = a->qe - a->qb > a->re - a->rb? a->qe - a->qb : a->re - a->rb;
-	mapq = a->score? (int)(MEM_MAPQ_COEF * (1. - (double)sub / a->score) * log(a->seedcov) + .499) : 0;
 	identity = 1. - (double)(l * opt->a - a->score) / (opt->a + opt->b) / l;
-	mapq = identity < 0.95? (int)(mapq * identity * identity + .499) : mapq;
+	if (a->score == 0) {
+		mapq = 0;
+	} else if (opt->mapQ_coef_len > 0) {
+		double tmp;
+		tmp = l < opt->mapQ_coef_len? 1. : opt->mapQ_coef_fac / log(l);
+		tmp *= identity * identity;
+		mapq = (int)(6.02 * (a->score - sub) / opt->a * tmp * tmp + .499);
+	} else {
+		mapq = (int)(MEM_MAPQ_COEF * (1. - (double)sub / a->score) * log(a->seedcov) + .499);
+		mapq = identity < 0.95? (int)(mapq * identity * identity + .499) : mapq;
+	}
 	if (a->sub_n > 0) mapq -= (int)(4.343 * log(a->sub_n+1) + .499);
 	if (mapq > 60) mapq = 60;
 	if (mapq < 0) mapq = 0;
@@ -831,12 +887,13 @@ mem_alnreg_v mem_align1_core(const mem_opt_t *opt, const bwt_t *bwt, const bntse
 	for (i = 0; i < chn.n; ++i) {
 		mem_chain_t *p = &chn.a[i];
 		int ret;
+		if (bwa_verbose >= 4) err_printf("===> Processing chain(%d) <===\n", i);
 		ret = mem_chain2aln_short(opt, bns->l_pac, pac, l_seq, (uint8_t*)seq, p, &regs);
 		if (ret > 0) mem_chain2aln(opt, bns->l_pac, pac, l_seq, (uint8_t*)seq, p, &regs);
 		free(chn.a[i].seeds);
 	}
 	free(chn.a);
-	regs.n = mem_sort_and_dedup(regs.n, regs.a);
+	regs.n = mem_sort_and_dedup(regs.n, regs.a, opt->mask_level_redun);
 	return regs;
 }
 
@@ -847,7 +904,7 @@ mem_alnreg_v mem_align1(const mem_opt_t *opt, const bwt_t *bwt, const bntseq_t *
 	seq = malloc(l_seq);
 	memcpy(seq, seq_, l_seq); // makes a copy of seq_
 	ar = mem_align1_core(opt, bwt, bns, pac, l_seq, seq);
-	mem_mark_primary_se(opt, ar.n, ar.a);
+	mem_mark_primary_se(opt, ar.n, ar.a, lrand48());
 	free(seq);
 	return ar;
 }
@@ -856,7 +913,7 @@ mem_alnreg_v mem_align1(const mem_opt_t *opt, const bwt_t *bwt, const bntseq_t *
 mem_aln_t mem_reg2aln(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, int l_query, const char *query_, const mem_alnreg_t *ar)
 {
 	mem_aln_t a;
-	int i, w2, qb, qe, NM, score, is_rev;
+	int i, w2, qb, qe, NM, score, is_rev, last_sc = -(1<<30), l_MD;
 	int64_t pos, rb, re;
 	uint8_t *query;
 
@@ -877,29 +934,46 @@ mem_aln_t mem_reg2aln(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *
 		exit(1);
 	}
 	w2 = infer_bw(qe - qb, re - rb, ar->truesc, opt->a, opt->q, opt->r);
-	w2 = w2 < opt->w? w2 : opt->w;
-	a.cigar = bwa_gen_cigar(opt->mat, opt->q, opt->r, w2, bns->l_pac, pac, qe - qb, (uint8_t*)&query[qb], rb, re, &score, &a.n_cigar, &NM);
+	if (bwa_verbose >= 4) printf("Band width: infer=%d, opt=%d, alnreg=%d\n", w2, opt->w, ar->w);
+	if (w2 > opt->w) w2 = w2 < ar->w? w2 : ar->w;
+	else w2 = opt->w;
+	i = 0; a.cigar = 0;
+	do {
+		free(a.cigar);
+		a.cigar = bwa_gen_cigar(opt->mat, opt->q, opt->r, w2, bns->l_pac, pac, qe - qb, (uint8_t*)&query[qb], rb, re, &score, &a.n_cigar, &NM);
+		if (bwa_verbose >= 4) printf("Final alignment: w2=%d, global_sc=%d, local_sc=%d\n", w2, score, ar->truesc);
+		if (score == last_sc) break; // it is possible that global alignment and local alignment give different scores
+		last_sc = score;
+		w2 <<= 1;
+	} while (++i < 3 && score < ar->truesc - opt->a);
+	l_MD = strlen((char*)(a.cigar + a.n_cigar)) + 1;
 	a.NM = NM;
 	pos = bns_depos(bns, rb < bns->l_pac? rb : re - 1, &is_rev);
 	a.is_rev = is_rev;
-	if (a.n_cigar > 0) {
+	if (a.n_cigar > 0) { // squeeze out leading or trailing deletions
 		if ((a.cigar[0]&0xf) == 2) {
 			pos += a.cigar[0]>>4;
 			--a.n_cigar;
-			memmove(a.cigar, a.cigar + 1, a.n_cigar * 4);
-		} else if ((a.cigar[a.n_cigar-1]&0xf) == 2) --a.n_cigar;
+			memmove(a.cigar, a.cigar + 1, a.n_cigar * 4 + l_MD);
+		} else if ((a.cigar[a.n_cigar-1]&0xf) == 2) {
+			--a.n_cigar;
+			memmove(a.cigar + a.n_cigar, a.cigar + a.n_cigar + 1, l_MD); // MD needs to be moved accordingly
+		}
 	}
 	if (qb != 0 || qe != l_query) { // add clipping to CIGAR
 		int clip5, clip3;
 		clip5 = is_rev? l_query - qe : qb;
 		clip3 = is_rev? qb : l_query - qe;
-		a.cigar = realloc(a.cigar, 4 * (a.n_cigar + 2));
+		a.cigar = realloc(a.cigar, 4 * (a.n_cigar + 2) + l_MD);
 		if (clip5) {
-			memmove(a.cigar+1, a.cigar, a.n_cigar * 4);
+			memmove(a.cigar+1, a.cigar, a.n_cigar * 4 + l_MD); // make room for 5'-end clipping
 			a.cigar[0] = clip5<<4 | 3;
 			++a.n_cigar;
 		}
-		if (clip3) a.cigar[a.n_cigar++] = clip3<<4 | 3;
+		if (clip3) {
+			memmove(a.cigar + a.n_cigar + 1, a.cigar + a.n_cigar, l_MD); // make room for 3'-end clipping
+			a.cigar[a.n_cigar++] = clip3<<4 | 3;
+		}
 	}
 	a.rid = bns_pos2rid(bns, pos);
 	a.pos = pos - bns->anns[a.rid].offset;
@@ -909,7 +983,6 @@ mem_aln_t mem_reg2aln(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *
 }
 
 typedef struct {
-	int start, step, n;
 	const mem_opt_t *opt;
 	const bwt_t *bwt;
 	const bntseq_t *bns;
@@ -917,86 +990,50 @@ typedef struct {
 	const mem_pestat_t *pes;
 	bseq1_t *seqs;
 	mem_alnreg_v *regs;
+	int64_t n_processed;
 } worker_t;
 
-static void *worker1(void *data)
+static void worker1(void *data, int i, int tid)
 {
 	worker_t *w = (worker_t*)data;
-	int i;
 	if (!(w->opt->flag&MEM_F_PE)) {
-		for (i = w->start; i < w->n; i += w->step)
-			w->regs[i] = mem_align1_core(w->opt, w->bwt, w->bns, w->pac, w->seqs[i].l_seq, w->seqs[i].seq);
-	} else { // for PE we align the two ends in the same thread in case the 2nd read is of worse quality, in which case some threads may be faster/slower
-		for (i = w->start; i < w->n>>1; i += w->step) {
-			w->regs[i<<1|0] = mem_align1_core(w->opt, w->bwt, w->bns, w->pac, w->seqs[i<<1|0].l_seq, w->seqs[i<<1|0].seq);
-			w->regs[i<<1|1] = mem_align1_core(w->opt, w->bwt, w->bns, w->pac, w->seqs[i<<1|1].l_seq, w->seqs[i<<1|1].seq);
-		}
+		w->regs[i] = mem_align1_core(w->opt, w->bwt, w->bns, w->pac, w->seqs[i].l_seq, w->seqs[i].seq);
+	} else {
+		w->regs[i<<1|0] = mem_align1_core(w->opt, w->bwt, w->bns, w->pac, w->seqs[i<<1|0].l_seq, w->seqs[i<<1|0].seq);
+		w->regs[i<<1|1] = mem_align1_core(w->opt, w->bwt, w->bns, w->pac, w->seqs[i<<1|1].l_seq, w->seqs[i<<1|1].seq);
 	}
-	return 0;
 }
 
-static void *worker2(void *data)
+static void worker2(void *data, int i, int tid)
 {
 	extern int mem_sam_pe(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, const mem_pestat_t pes[4], uint64_t id, bseq1_t s[2], mem_alnreg_v a[2]);
 	worker_t *w = (worker_t*)data;
-	int i;
 	if (!(w->opt->flag&MEM_F_PE)) {
-		for (i = w->start; i < w->n; i += w->step) {
-			mem_mark_primary_se(w->opt, w->regs[i].n, w->regs[i].a);
-			mem_reg2sam_se(w->opt, w->bns, w->pac, &w->seqs[i], &w->regs[i], 0, 0);
-			free(w->regs[i].a);
-		}
+		mem_mark_primary_se(w->opt, w->regs[i].n, w->regs[i].a, w->n_processed + i);
+		mem_reg2sam_se(w->opt, w->bns, w->pac, &w->seqs[i], &w->regs[i], 0, 0);
+		free(w->regs[i].a);
 	} else {
-		int n = 0;
-		for (i = w->start; i < w->n>>1; i += w->step) { // not implemented yet
-			n += mem_sam_pe(w->opt, w->bns, w->pac, w->pes, i, &w->seqs[i<<1], &w->regs[i<<1]);
-			free(w->regs[i<<1|0].a); free(w->regs[i<<1|1].a);
-		}
-		fprintf(stderr, "[M::%s@%d] performed mate-SW for %d reads\n", __func__, w->start, n);
+		mem_sam_pe(w->opt, w->bns, w->pac, w->pes, (w->n_processed>>1) + i, &w->seqs[i<<1], &w->regs[i<<1]);
+		free(w->regs[i<<1|0].a); free(w->regs[i<<1|1].a);
 	}
-	return 0;
 }
 
-void mem_process_seqs(const mem_opt_t *opt, const bwt_t *bwt, const bntseq_t *bns, const uint8_t *pac, int n, bseq1_t *seqs, const mem_pestat_t *pes0)
+void mem_process_seqs(const mem_opt_t *opt, const bwt_t *bwt, const bntseq_t *bns, const uint8_t *pac, int64_t n_processed, int n, bseq1_t *seqs, const mem_pestat_t *pes0)
 {
-	int i;
-	worker_t *w;
+	extern void kt_for(int n_threads, void (*func)(void*,int,int), void *data, int n);
+	worker_t w;
 	mem_alnreg_v *regs;
 	mem_pestat_t pes[4];
 
-	w = calloc(opt->n_threads, sizeof(worker_t));
 	regs = malloc(n * sizeof(mem_alnreg_v));
-	for (i = 0; i < opt->n_threads; ++i) {
-		worker_t *p = &w[i];
-		p->start = i; p->step = opt->n_threads; p->n = n;
-		p->opt = opt; p->bwt = bwt; p->bns = bns; p->pac = pac;
-		p->seqs = seqs; p->regs = regs;
-		p->pes = &pes[0];
-	}
-
-#ifdef HAVE_PTHREAD
-	if (opt->n_threads == 1) {
-#endif
-		worker1(w);
-		if (opt->flag&MEM_F_PE) { // paired-end mode
-			if (pes0) memcpy(pes, pes0, 4 * sizeof(mem_pestat_t)); // if pes0 != NULL, set the insert-size distribution as pes0
-			else mem_pestat(opt, bns->l_pac, n, regs, pes); // otherwise, infer the insert size distribution from data
-		}
-		worker2(w);
-#ifdef HAVE_PTHREAD
-	} else {
-		pthread_t *tid;
-		tid = (pthread_t*)calloc(opt->n_threads, sizeof(pthread_t));
-		for (i = 0; i < opt->n_threads; ++i) pthread_create(&tid[i], 0, worker1, &w[i]);
-		for (i = 0; i < opt->n_threads; ++i) pthread_join(tid[i], 0);
-		if (opt->flag&MEM_F_PE) {
-			if (pes0) memcpy(pes, pes0, 4 * sizeof(mem_pestat_t));
-			else mem_pestat(opt, bns->l_pac, n, regs, pes);
-		}
-		for (i = 0; i < opt->n_threads; ++i) pthread_create(&tid[i], 0, worker2, &w[i]);
-		for (i = 0; i < opt->n_threads; ++i) pthread_join(tid[i], 0);
-		free(tid);
+	w.opt = opt; w.bwt = bwt; w.bns = bns; w.pac = pac;
+	w.seqs = seqs; w.regs = regs; w.n_processed = n_processed;
+	w.pes = &pes[0];
+	kt_for(opt->n_threads, worker1, &w, (opt->flag&MEM_F_PE)? n>>1 : n); // find mapping positions
+	if (opt->flag&MEM_F_PE) { // infer insert sizes if not provided
+		if (pes0) memcpy(pes, pes0, 4 * sizeof(mem_pestat_t)); // if pes0 != NULL, set the insert-size distribution as pes0
+		else mem_pestat(opt, bns->l_pac, n, regs, pes); // otherwise, infer the insert size distribution from data
 	}
-#endif
-	free(regs); free(w);
+	kt_for(opt->n_threads, worker2, &w, (opt->flag&MEM_F_PE)? n>>1 : n); // generate alignment
+	free(regs);
 }
diff --git a/bwamem.h b/bwamem.h
index be1862a..8b24c51 100644
--- a/bwamem.h
+++ b/bwamem.h
@@ -20,7 +20,7 @@ typedef struct __smem_i smem_i;
 typedef struct {
 	int a, b, q, r;         // match score, mismatch penalty and gap open/extension penalty. A gap of size k costs q+k*r
 	int pen_unpaired;       // phred-scaled penalty for unpaired reads
-	int pen_clip;           // clipping penalty. This score is not deducted from the DP score.
+	int pen_clip5,pen_clip3;// clipping penalty. This score is not deducted from the DP score.
 	int w;                  // band width
 	int zdrop;              // Z-dropoff
 
@@ -35,6 +35,9 @@ typedef struct {
 	int chunk_size;         // process chunk_size-bp sequences in a batch
 	float mask_level;       // regard a hit as redundant if the overlap with another better hit is over mask_level times the min length of the two hits
 	float chain_drop_ratio; // drop a chain if its seed coverage is below chain_drop_ratio times the seed coverage of a better chain overlapping with the small chain
+	float mask_level_redun;
+	float mapQ_coef_len;
+	int mapQ_coef_fac;
 	int max_ins;            // when estimating insert size distribution, skip pairs with insert longer than this value
 	int max_matesw;         // perform maximally max_matesw rounds of mate-SW for each end
 	int8_t mat[25];         // scoring matrix; mat[0] == 0 if unset
@@ -51,6 +54,7 @@ typedef struct {
 	int w;          // actual band width used in extension
 	int seedcov;    // length of regions coverged by seeds
 	int secondary;  // index of the parent hit shadowing the current hit; <0 if primary
+	uint64_t hash;
 } mem_alnreg_t;
 
 typedef struct { size_t n, m; mem_alnreg_t *a; } mem_alnreg_v;
@@ -106,7 +110,7 @@ extern "C" {
 	 * @param pes0   insert-size info; if NULL, infer from data; if not NULL, it should be an array with 4 elements,
 	 *               corresponding to each FF, FR, RF and RR orientation. See mem_pestat() for more info.
 	 */
-	void mem_process_seqs(const mem_opt_t *opt, const bwt_t *bwt, const bntseq_t *bns, const uint8_t *pac, int n, bseq1_t *seqs, const mem_pestat_t *pes0);
+	void mem_process_seqs(const mem_opt_t *opt, const bwt_t *bwt, const bntseq_t *bns, const uint8_t *pac, int64_t n_processed, int n, bseq1_t *seqs, const mem_pestat_t *pes0);
 
 	/**
 	 * Find the aligned regions for one query sequence
diff --git a/bwamem_pair.c b/bwamem_pair.c
index 06aacff..f1aa73a 100644
--- a/bwamem_pair.c
+++ b/bwamem_pair.c
@@ -108,6 +108,7 @@ void mem_pestat(const mem_opt_t *opt, int64_t l_pac, int n, const mem_alnreg_v *
 
 int mem_matesw(const mem_opt_t *opt, int64_t l_pac, const uint8_t *pac, const mem_pestat_t pes[4], const mem_alnreg_t *a, int l_ms, const uint8_t *ms, mem_alnreg_v *ma)
 {
+	extern int mem_sort_and_dedup(int n, mem_alnreg_t *a, float mask_level_redun);
 	int i, r, skip[4], n = 0;
 	for (r = 0; r < 4; ++r)
 		skip[r] = pes[r].failed? 1 : 0;
@@ -146,7 +147,7 @@ int mem_matesw(const mem_opt_t *opt, int64_t l_pac, const uint8_t *pac, const me
 			int tmp, xtra = KSW_XSUBO | KSW_XSTART | (l_ms * opt->a < 250? KSW_XBYTE : 0) | opt->min_seed_len;
 			aln = ksw_align(l_ms, seq, len, ref, 5, opt->mat, opt->q, opt->r, xtra, 0);
 			memset(&b, 0, sizeof(mem_alnreg_t));
-			if (aln.score >= opt->min_seed_len) {
+			if (aln.score >= opt->min_seed_len && aln.qb >= 0) { // something goes wrong if aln.qb < 0
 				b.qb = is_rev? l_ms - (aln.qe + 1) : aln.qb;                                                                                                                                                                              
 				b.qe = is_rev? l_ms - aln.qb : aln.qe + 1; 
 				b.rb = is_rev? (l_pac<<1) - (rb + aln.te + 1) : rb + aln.tb;
@@ -166,6 +167,7 @@ int mem_matesw(const mem_opt_t *opt, int64_t l_pac, const uint8_t *pac, const me
 			}
 			++n;
 		}
+		if (n) ma->n = mem_sort_and_dedup(ma->n, ma->a, opt->mask_level_redun);
 		if (rev) free(rev);
 		free(ref);
 	}
@@ -233,7 +235,7 @@ int mem_pair(const mem_opt_t *opt, int64_t l_pac, const uint8_t *pac, const mem_
 
 int mem_sam_pe(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, const mem_pestat_t pes[4], uint64_t id, bseq1_t s[2], mem_alnreg_v a[2])
 {
-	extern void mem_mark_primary_se(const mem_opt_t *opt, int n, mem_alnreg_t *a);
+	extern void mem_mark_primary_se(const mem_opt_t *opt, int n, mem_alnreg_t *a, int64_t id);
 	extern int mem_approx_mapq_se(const mem_opt_t *opt, const mem_alnreg_t *a);
 	extern void mem_reg2sam_se(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, bseq1_t *s, mem_alnreg_v *a, int extra_flag, const mem_aln_t *m);
 	extern void mem_aln2sam(const bntseq_t *bns, kstring_t *str, bseq1_t *s, int n, const mem_aln_t *list, int which, const mem_aln_t *m);
@@ -255,8 +257,8 @@ int mem_sam_pe(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, co
 				n += mem_matesw(opt, bns->l_pac, pac, pes, &b[i].a[j], s[!i].l_seq, (uint8_t*)s[!i].seq, &a[!i]);
 		free(b[0].a); free(b[1].a);
 	}
-	mem_mark_primary_se(opt, a[0].n, a[0].a);
-	mem_mark_primary_se(opt, a[1].n, a[1].a);
+	mem_mark_primary_se(opt, a[0].n, a[0].a, id<<1|0);
+	mem_mark_primary_se(opt, a[1].n, a[1].a, id<<1|1);
 	if (opt->flag&MEM_F_NOPAIRING) goto no_pairing;
 	// pairing single-end hits
 	if (a[0].n && a[1].n && (o = mem_pair(opt, bns->l_pac, pac, pes, s, a, id, &subo, &n_sub, z)) > 0) {
@@ -301,6 +303,7 @@ int mem_sam_pe(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, co
 		h[1] = mem_reg2aln(opt, bns, pac, s[1].l_seq, s[1].seq, &a[1].a[z[1]]); h[1].mapq = q_se[1]; h[1].flag |= 0x80 | extra_flag;
 		mem_aln2sam(bns, &str, &s[0], 1, &h[0], 0, &h[1]); s[0].sam = strdup(str.s); str.l = 0;
 		mem_aln2sam(bns, &str, &s[1], 1, &h[1], 0, &h[0]); s[1].sam = str.s;
+		if (strcmp(s[0].name, s[1].name) != 0) err_fatal(__func__, "paired reads have different names: \"%s\", \"%s\"\n", s[0].name, s[1].name);
 		free(h[0].cigar); free(h[1].cigar);
 	} else goto no_pairing;
 	return n;
@@ -319,6 +322,7 @@ no_pairing:
 	}
 	mem_reg2sam_se(opt, bns, pac, &s[0], &a[0], 0x41|extra_flag, &h[1]);
 	mem_reg2sam_se(opt, bns, pac, &s[1], &a[1], 0x81|extra_flag, &h[0]);
+	if (strcmp(s[0].name, s[1].name) != 0) err_fatal(__func__, "paired reads have different names: \"%s\", \"%s\"\n", s[0].name, s[1].name);
 	free(h[0].cigar); free(h[1].cigar);
 	return n;
 }
diff --git a/bwape.c b/bwape.c
index 08490e7..82fc50b 100644
--- a/bwape.c
+++ b/bwape.c
@@ -49,7 +49,6 @@ int bwa_approx_mapQ(const bwa_seq_t *p, int mm);
 void bwa_print_sam1(const bntseq_t *bns, bwa_seq_t *p, const bwa_seq_t *mate, int mode, int max_top2);
 bntseq_t *bwa_open_nt(const char *prefix);
 void bwa_print_sam_SQ(const bntseq_t *bns);
-void bwa_print_sam_PG();
 
 pe_opt_t *bwa_init_pe_opt()
 {
@@ -671,7 +670,6 @@ void bwa_sai2sam_pe_core(const char *prefix, char *const fn_sa[2], char *const f
 
 	// core loop
 	bwa_print_sam_hdr(bns, rg_line);
-	bwa_print_sam_PG();
 	while ((seqs[0] = bwa_read_seq(ks[0], 0x40000, &n_seqs, opt0.mode, opt0.trim_qual)) != 0) {
 		int cnt_chg;
 		isize_info_t ii;
@@ -706,6 +704,7 @@ void bwa_sai2sam_pe_core(const char *prefix, char *const fn_sa[2], char *const f
 			}
 			bwa_print_sam1(bns, p[0], p[1], opt.mode, opt.max_top2);
 			bwa_print_sam1(bns, p[1], p[0], opt.mode, opt.max_top2);
+			if (strcmp(p[0]->name, p[1]->name) != 0) err_fatal(__func__, "paired reads have different names: \"%s\", \"%s\"\n", p[0]->name, p[1]->name);
 		}
 		fprintf(stderr, "%.2f sec\n", (float)(clock() - t) / CLOCKS_PER_SEC); t = clock();
 
diff --git a/bwase.c b/bwase.c
index 746add5..30f306e 100644
--- a/bwase.c
+++ b/bwase.c
@@ -19,8 +19,6 @@
 
 int g_log_n[256];
 
-void bwa_print_sam_PG();
-
 void bwa_aln2seq_core(int n_aln, const bwt_aln1_t *aln, bwa_seq_t *s, int set_main, int n_multi)
 {
 	int i, cnt, best;
@@ -180,7 +178,7 @@ bwa_cigar_t *bwa_refine_gapped_core(bwtint_t l_pac, const ubyte_t *pacseq, int l
 	assert(re <= l_pac);
 	rseq = bns_get_seq(l_pac, pacseq, rb, re, &rlen);
 	assert(re - rb == rlen);
-	ksw_global(len, seq, rlen, rseq, 5, mat, 5, 1, SW_BW, n_cigar, &cigar32);
+	ksw_global(len, seq, rlen, rseq, 5, mat, 5, 1, SW_BW > abs(rlen - len) * 1.5? SW_BW : abs(rlen - len) * 1.5, n_cigar, &cigar32);
 	assert(*n_cigar > 0);
 	if ((cigar32[*n_cigar - 1]&0xf) == 1) cigar32[*n_cigar - 1] = (cigar32[*n_cigar - 1]>>4<<4) | 3; // change endding ins to soft clipping
 	if ((cigar32[0]&0xf) == 1) cigar32[0] = (cigar32[0]>>4<<4) | 3; // change beginning ins to soft clipping
@@ -530,7 +528,6 @@ void bwa_sai2sam_se_core(const char *prefix, const char *fn_sa, const char *fn_f
 	}
 	err_fread_noeof(&opt, sizeof(gap_opt_t), 1, fp_sa);
 	bwa_print_sam_hdr(bns, rg_line);
-	//bwa_print_sam_PG();
 	// set ks
 	ks = bwa_open_reads(opt.mode, fn_fa);
 	// core loop
diff --git a/fastmap.c b/fastmap.c
index b00ec00..40cea8c 100644
--- a/fastmap.c
+++ b/fastmap.c
@@ -2,6 +2,8 @@
 #include <stdio.h>
 #include <unistd.h>
 #include <stdlib.h>
+#include <ctype.h>
+#include <math.h>
 #include "bwa.h"
 #include "bwamem.h"
 #include "kvec.h"
@@ -25,9 +27,10 @@ int main_mem(int argc, char *argv[])
 	bwaidx_t *idx;
 	char *rg_line = 0;
 	void *ko = 0, *ko2 = 0;
+	int64_t n_processed = 0;
 
 	opt = mem_opt_init();
-	while ((c = getopt(argc, argv, "paMCSPHk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:")) >= 0) {
+	while ((c = getopt(argc, argv, "paMCSPHk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:")) >= 0) {
 		if (c == 'k') opt->min_seed_len = atoi(optarg);
 		else if (c == 'w') opt->w = atoi(optarg);
 		else if (c == 'A') opt->a = atoi(optarg);
@@ -35,7 +38,6 @@ int main_mem(int argc, char *argv[])
 		else if (c == 'O') opt->q = atoi(optarg);
 		else if (c == 'E') opt->r = atoi(optarg);
 		else if (c == 'T') opt->T = atoi(optarg);
-		else if (c == 'L') opt->pen_clip = atoi(optarg);
 		else if (c == 'U') opt->pen_unpaired = atoi(optarg);
 		else if (c == 't') opt->n_threads = atoi(optarg), opt->n_threads = opt->n_threads > 1? opt->n_threads : 1;
 		else if (c == 'P') opt->flag |= MEM_F_NOPAIRING;
@@ -47,14 +49,23 @@ int main_mem(int argc, char *argv[])
 		else if (c == 'd') opt->zdrop = atoi(optarg);
 		else if (c == 'v') bwa_verbose = atoi(optarg);
 		else if (c == 'r') opt->split_factor = atof(optarg);
+		else if (c == 'D') opt->chain_drop_ratio = atof(optarg);
 		else if (c == 'C') copy_comment = 1;
-		else if (c == 'R') {
+		else if (c == 'Q') {
+			opt->mapQ_coef_len = atoi(optarg);
+			opt->mapQ_coef_fac = opt->mapQ_coef_len > 0? log(opt->mapQ_coef_len) : 0;
+		} else if (c == 'L') {
+			char *p;
+			opt->pen_clip5 = opt->pen_clip3 = strtol(optarg, &p, 10);
+			if (*p != 0 && ispunct(*p) && isdigit(p[1]))
+				opt->pen_clip3 = strtol(p+1, &p, 10);
+		} else if (c == 'R') {
 			if ((rg_line = bwa_set_rg(optarg)) == 0) return 1; // FIXME: memory leak
 		} else if (c == 's') opt->split_width = atoi(optarg);
 		else return 1;
 	}
 	if (opt->n_threads < 1) opt->n_threads = 1;
-	if (optind + 1 >= argc) {
+	if (optind + 1 >= argc || optind + 3 < argc) {
 		fprintf(stderr, "\n");
 		fprintf(stderr, "Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]\n\n");
 		fprintf(stderr, "Algorithm options:\n\n");
@@ -65,13 +76,14 @@ int main_mem(int argc, char *argv[])
 		fprintf(stderr, "       -r FLOAT   look for internal seeds inside a seed longer than {-k} * FLOAT [%g]\n", opt->split_factor);
 //		fprintf(stderr, "       -s INT     look for internal seeds inside a seed with less than INT occ [%d]\n", opt->split_width);
 		fprintf(stderr, "       -c INT     skip seeds with more than INT occurrences [%d]\n", opt->max_occ);
+		fprintf(stderr, "       -D FLOAT   drop chains shorter than FLOAT fraction of the longest overlapping chain [%.2f]\n", opt->chain_drop_ratio);
 		fprintf(stderr, "       -S         skip mate rescue\n");
 		fprintf(stderr, "       -P         skip pairing; mate rescue performed unless -S also in use\n");
 		fprintf(stderr, "       -A INT     score for a sequence match [%d]\n", opt->a);
 		fprintf(stderr, "       -B INT     penalty for a mismatch [%d]\n", opt->b);
 		fprintf(stderr, "       -O INT     gap open penalty [%d]\n", opt->q);
 		fprintf(stderr, "       -E INT     gap extension penalty; a gap of size k cost {-O} + {-E}*k [%d]\n", opt->r);
-		fprintf(stderr, "       -L INT     penalty for clipping [%d]\n", opt->pen_clip);
+		fprintf(stderr, "       -L INT     penalty for clipping [%d]\n", opt->pen_clip5);
 		fprintf(stderr, "       -U INT     penalty for an unpaired read pair [%d]\n", opt->pen_unpaired);
 		fprintf(stderr, "\nInput/output options:\n\n");
 		fprintf(stderr, "       -p         first query file consists of interleaved paired-end sequences\n");
@@ -128,7 +140,8 @@ int main_mem(int argc, char *argv[])
 		for (i = 0; i < n; ++i) size += seqs[i].l_seq;
 		if (bwa_verbose >= 3)
 			fprintf(stderr, "[M::%s] read %d sequences (%ld bp)...\n", __func__, n, (long)size);
-		mem_process_seqs(opt, idx->bwt, idx->bns, idx->pac, n, seqs, 0);
+		mem_process_seqs(opt, idx->bwt, idx->bns, idx->pac, n_processed, n, seqs, 0);
+		n_processed += n;
 		for (i = 0; i < n; ++i) {
 			err_fputs(seqs[i].sam, stdout);
 			free(seqs[i].name); free(seqs[i].comment); free(seqs[i].seq); free(seqs[i].qual); free(seqs[i].sam);
diff --git a/ksw.c b/ksw.c
index 454c49d..db018fa 100644
--- a/ksw.c
+++ b/ksw.c
@@ -502,26 +502,34 @@ int ksw_global(int qlen, const uint8_t *query, int tlen, const uint8_t *target,
 		end = i + w + 1 < qlen? i + w + 1 : qlen; // only loop through [beg,end) of the query sequence
 		h1 = beg == 0? -(gapo + gape * (i + 1)) : MINUS_INF;
 		for (j = beg; LIKELY(j < end); ++j) {
-			// This loop is organized in a similar way to ksw_extend() and ksw_sse2(), except:
-			// 1) not checking h>0; 2) recording direction for backtracking
+			// At the beginning of the loop: eh[j] = { H(i-1,j-1), E(i,j) }, f = F(i,j) and h1 = H(i,j-1)
+			// Cells are computed in the following order:
+			//   M(i,j)   = H(i-1,j-1) + S(i,j)
+			//   H(i,j)   = max{M(i,j), E(i,j), F(i,j)}
+			//   E(i+1,j) = max{M(i,j)-gapo, E(i,j)} - gape
+			//   F(i,j+1) = max{M(i,j)-gapo, F(i,j)} - gape
+			// We have to separate M(i,j); otherwise the direction may not be recorded correctly.
+			// However, a CIGAR like "10M3I3D10M" allowed by local() and extend() is disallowed by global().
+			// Such a CIGAR may occur, in theory, if mismatch_penalty > 2*gap_ext_penalty + 2*gap_open_penalty/k.
+			// In practice, this should happen very rarely given a reasonable scoring system.
 			eh_t *p = &eh[j];
-			int32_t h = p->h, e = p->e;
+			int32_t h, m = p->h, e = p->e;
 			uint8_t d; // direction
 			p->h = h1;
-			h += q[j];
-			d = h >= e? 0 : 1;
-			h = h >= e? h : e;
+			m += q[j];
+			d = m >= e? 0 : 1;
+			h = m >= e? m : e;
 			d = h >= f? d : 2;
 			h = h >= f? h : f;
 			h1 = h;
-			h -= gapoe;
+			m -= gapoe;
 			e -= gape;
-			d |= e > h? 1<<2 : 0;
-			e = e > h? e : h;
+			d |= e > m? 1<<2 : 0;
+			e  = e > m? e    : m;
 			p->e = e;
 			f -= gape;
-			d |= f > h? 2<<4 : 0; // if we want to halve the memory, use one bit only, instead of two
-			f = f > h? f : h;
+			d |= f > m? 2<<4 : 0; // if we want to halve the memory, use one bit only, instead of two
+			f  = f > m? f    : m;
 			zi[j - beg] = d; // z[i,j] keeps h for the current cell and e/f for the next cell
 		}
 		eh[end].h = h1; eh[end].e = MINUS_INF;
diff --git a/kthread.c b/kthread.c
new file mode 100644
index 0000000..a44426b
--- /dev/null
+++ b/kthread.c
@@ -0,0 +1,53 @@
+#include <pthread.h>
+#include <stdlib.h>
+
+struct kt_for_t;
+
+typedef struct {
+	struct kt_for_t *t;
+	int i;
+} ktf_worker_t;
+
+typedef struct kt_for_t {
+	int n_threads, n;
+	ktf_worker_t *w;
+	void (*func)(void*,int,int);
+	void *data;
+} kt_for_t;
+
+static inline int steal_work(kt_for_t *t)
+{
+	int i, k, min = 0x7fffffff, min_i = -1;
+	for (i = 0; i < t->n_threads; ++i)
+		if (min > t->w[i].i) min = t->w[i].i, min_i = i;
+	k = __sync_fetch_and_add(&t->w[min_i].i, t->n_threads);
+	return k >= t->n? -1 : k;
+}
+
+static void *ktf_worker(void *data)
+{
+	ktf_worker_t *w = (ktf_worker_t*)data;
+	int i;
+	for (;;) {
+		i = __sync_fetch_and_add(&w->i, w->t->n_threads);
+		if (i >= w->t->n) break;
+		w->t->func(w->t->data, i, w - w->t->w);
+	}
+	while ((i = steal_work(w->t)) >= 0)
+		w->t->func(w->t->data, i, w - w->t->w);
+	pthread_exit(0);
+}
+
+void kt_for(int n_threads, void (*func)(void*,int,int), void *data, int n)
+{
+	int i;
+	kt_for_t t;
+	pthread_t *tid;
+	t.func = func, t.data = data, t.n_threads = n_threads, t.n = n;
+	t.w = (ktf_worker_t*)alloca(n_threads * sizeof(ktf_worker_t));
+	tid = (pthread_t*)alloca(n_threads * sizeof(pthread_t));
+	for (i = 0; i < n_threads; ++i)
+		t.w[i].t = &t, t.w[i].i = i;
+	for (i = 0; i < n_threads; ++i) pthread_create(&tid[i], 0, ktf_worker, &t.w[i]);
+	for (i = 0; i < n_threads; ++i) pthread_join(tid[i], 0);
+}
diff --git a/main.c b/main.c
index 662d54f..f872917 100644
--- a/main.c
+++ b/main.c
@@ -1,9 +1,10 @@
 #include <stdio.h>
 #include <string.h>
+#include "kstring.h"
 #include "utils.h"
 
 #ifndef PACKAGE_VERSION
-#define PACKAGE_VERSION "0.7.5a-r405"
+#define PACKAGE_VERSION "0.7.6a-r433"
 #endif
 
 int bwa_fa2pac(int argc, char *argv[]);
@@ -13,16 +14,10 @@ int bwa_bwt2sa(int argc, char *argv[]);
 int bwa_index(int argc, char *argv[]);
 int bwt_bwtgen_main(int argc, char *argv[]);
 
-int bwa_aln(int argc, char *argv[]);
-int bwa_sai2sam_se(int argc, char *argv[]);
-int bwa_sai2sam_pe(int argc, char *argv[]);
-
-int bwa_bwtsw2(int argc, char *argv[]);
-
 int main_fastmap(int argc, char *argv[]);
 int main_mem(int argc, char *argv[]);
 
-int main_pemerge(int argc, char *argv[]);
+char *bwa_pg;
 
 static int usage()
 {
@@ -34,11 +29,6 @@ static int usage()
 	fprintf(stderr, "Command: index         index sequences in the FASTA format\n");
 	fprintf(stderr, "         mem           BWA-MEM algorithm\n");
 	fprintf(stderr, "         fastmap       identify super-maximal exact matches\n");
-	fprintf(stderr, "         pemerge       merge overlapping paired ends (EXPERIMENTAL)\n");
-	fprintf(stderr, "         aln           gapped/ungapped alignment\n");
-	fprintf(stderr, "         samse         generate alignment (single ended)\n");
-	fprintf(stderr, "         sampe         generate alignment (paired ended)\n");
-	fprintf(stderr, "         bwasw         BWA-SW for long queries\n");
 	fprintf(stderr, "\n");
 	fprintf(stderr, "         fa2pac        convert FASTA to PAC format\n");
 	fprintf(stderr, "         pac2bwt       generate BWT from PAC\n");
@@ -46,23 +36,23 @@ static int usage()
 	fprintf(stderr, "         bwtupdate     update .bwt to the new format\n");
 	fprintf(stderr, "         bwt2sa        generate SA from BWT and Occ\n");
 	fprintf(stderr, "\n");
-	fprintf(stderr, "Note: To use BWA, you need to first index the genome with `bwa index'. There are\n");
-	fprintf(stderr, "      three alignment algorithms in BWA: `mem', `bwasw' and `aln/samse/sampe'. If\n");
-	fprintf(stderr, "      you are not sure which to use, try `bwa mem' first. Please `man ./bwa.1' for\n");
-	fprintf(stderr, "      for the manual.\n\n");
+	fprintf(stderr,
+"Note: To use BWA, you need to first index the genome with `bwa index'.\n"
+"      There are three alignment algorithms in BWA: `mem', `bwasw', and\n"
+"      `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'\n"
+"      first. Please `man ./bwa.1' for the manual.\n\n");
 	return 1;
 }
 
-void bwa_print_sam_PG()
-{
-	err_printf("@PG\tID:bwa\tPN:bwa\tVN:%s\n", PACKAGE_VERSION);
-}
-
 int main(int argc, char *argv[])
 {
 	int i, ret;
 	double t_real;
+	kstring_t pg = {0,0,0};
 	t_real = realtime();
+	ksprintf(&pg, "@PG\tID:bwa\tPN:bwa\tVN:%s\tCL:%s", PACKAGE_VERSION, argv[0]);
+	for (i = 1; i < argc; ++i) ksprintf(&pg, " %s", argv[i]);
+	bwa_pg = pg.s;
 	if (argc < 2) return usage();
 	if (strcmp(argv[1], "fa2pac") == 0) ret = bwa_fa2pac(argc-1, argv+1);
 	else if (strcmp(argv[1], "pac2bwt") == 0) ret = bwa_pac2bwt(argc-1, argv+1);
@@ -70,15 +60,8 @@ int main(int argc, char *argv[])
 	else if (strcmp(argv[1], "bwtupdate") == 0) ret = bwa_bwtupdate(argc-1, argv+1);
 	else if (strcmp(argv[1], "bwt2sa") == 0) ret = bwa_bwt2sa(argc-1, argv+1);
 	else if (strcmp(argv[1], "index") == 0) ret = bwa_index(argc-1, argv+1);
-	else if (strcmp(argv[1], "aln") == 0) ret = bwa_aln(argc-1, argv+1);
-	else if (strcmp(argv[1], "samse") == 0) ret = bwa_sai2sam_se(argc-1, argv+1);
-	else if (strcmp(argv[1], "sampe") == 0) ret = bwa_sai2sam_pe(argc-1, argv+1);
-	else if (strcmp(argv[1], "bwtsw2") == 0) ret = bwa_bwtsw2(argc-1, argv+1);
-	else if (strcmp(argv[1], "dbwtsw") == 0) ret = bwa_bwtsw2(argc-1, argv+1);
-	else if (strcmp(argv[1], "bwasw") == 0) ret = bwa_bwtsw2(argc-1, argv+1);
 	else if (strcmp(argv[1], "fastmap") == 0) ret = main_fastmap(argc-1, argv+1);
 	else if (strcmp(argv[1], "mem") == 0) ret = main_mem(argc-1, argv+1);
-	else if (strcmp(argv[1], "pemerge") == 0) ret = main_pemerge(argc-1, argv+1);
 	else {
 		fprintf(stderr, "[main] unrecognized command '%s'\n", argv[1]);
 		return 1;
@@ -92,5 +75,6 @@ int main(int argc, char *argv[])
 			fprintf(stderr, " %s", argv[i]);
 		fprintf(stderr, "\n[%s] Real time: %.3f sec; CPU: %.3f sec\n", __func__, realtime() - t_real, cputime());
 	}
+	free(bwa_pg);
 	return ret;
 }

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/bwa.git