[med-svn] [bwa] 01/03: New upstream version 0.7.16
Steffen Möller
moeller at moszumanska.debian.org
Mon Sep 11 13:54:07 UTC 2017
This is an automated email from the git hooks/post-receive script.
moeller pushed a commit to branch master
in repository bwa.
commit 6b5aa1323449ac10783d202441289b5bc99abaea
Author: Steffen Moeller <moeller at debian.org>
Date: Mon Sep 11 15:40:17 2017 +0200
New upstream version 0.7.16
---
NEWS.md | 19 ++++++++++++++++++-
README.md | 27 +++++++++++++--------------
bntseq.c | 6 +++---
bwa.1 | 49 +++++++++++++++++++++++++++++++++++++++++++++----
bwa.c | 30 ++++++++++++++++++++++--------
bwakit/run-bwamem | 7 ++++---
bwakit/run-gen-ref | 2 +-
bwamem.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
bwamem.h | 1 +
bwamem_pair.c | 5 +++++
bwape.c | 5 +++--
bwase.c | 9 ++++++---
bwt_gen.c | 21 +++++++++------------
bwt_lite.c | 2 +-
bwtaln.c | 5 +++--
bwtgap.c | 2 +-
bwtgap.h | 6 +++---
bwtindex.c | 4 ++--
bwtsw2_pair.c | 8 +++++++-
fastmap.c | 11 ++++++++---
kthread.c | 1 +
main.c | 2 +-
malloc_wrap.c | 8 ++++----
23 files changed, 201 insertions(+), 77 deletions(-)
diff --git a/NEWS.md b/NEWS.md
index 9a63bef..02a1c0c 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -1,7 +1,24 @@
+Release 0.7.16 (30 July 2017)
+-----------------------------
+
+This release added a couple of minor features and incorporated multiple pull
+requests, including:
+
+ * Added option -5, which is useful to some Hi-C pipelines.
+
+ * Fixed an error with samtools sorting (#129). Updated download link for
+ GRCh38 (#123). Fixed README MarkDown formatting (#70). Addressed multiple
+ issues via a collected pull request #139 by @jmarshall. Avoid malformatted
+ SAM header when -R is used with TAB (#84). Output mate CIGAR (#138).
+
+(0.7.16: 30 July 2017, r1180)
+
+
+
Release 0.7.15 (31 May 2016)
----------------------------
-Fixed a long existing bug which potentially leads underestimated insert size
+Fixed a long existing bug which potentially leads to underestimated insert size
upper bound. This bug should have little effect in practice.
(0.7.15: 31 May 2016, r1140)
diff --git a/README.md b/README.md
index 9ac49bb..5f673a2 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,5 @@
[![Build Status](https://travis-ci.org/lh3/bwa.svg?branch=dev)](https://travis-ci.org/lh3/bwa)
-[![Build Status](https://drone.io/github.com/lh3/bwa/status.png)](https://drone.io/github.com/lh3/bwa/latest)
-##Getting started
+## Getting started
git clone https://github.com/lh3/bwa.git
cd bwa; make
@@ -8,7 +7,7 @@
./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz
-##Introduction
+## Introduction
BWA is a software package for mapping DNA sequences against a large reference
genome, such as the human genome. It consists of three algorithms:
@@ -24,7 +23,7 @@ reference genome (the **index** command). Alignment algorithms are invoked with
different sub-commands: **aln/samse/sampe** for BWA-backtrack,
**bwasw** for BWA-SW and **mem** for the BWA-MEM algorithm.
-##Availability
+## Availability
BWA is released under [GPLv3][1]. The latest source code is [freely
available at github][2]. Released packages can [be downloaded][3] at
@@ -37,7 +36,7 @@ In addition to BWA, this self-consistent package also comes with bwa-associated
and 3rd-party tools for proper BAM-to-FASTQ conversion, mapping to ALT contigs,
adapter triming, duplicate marking, HLA typing and associated data files.
-##Seeking helps
+## Seeking help
The detailed usage is described in the man page available together with the
source code. You can use `man ./bwa.1` to view the man page in a terminal. The
@@ -46,7 +45,7 @@ have questions about BWA, you may [sign up the mailing list][6] and then send
the questions to [bio-bwa-help at sourceforge.net][7]. You may also ask questions
in forums such as [BioStar][8] and [SEQanswers][9].
-##Citing BWA
+## Citing BWA
* Li H. and Durbin R. (2009) Fast and accurate short read alignment with
Burrows-Wheeler transform. *Bioinformatics*, **25**, 1754-1760. [PMID:
@@ -63,7 +62,7 @@ in forums such as [BioStar][8] and [SEQanswers][9].
Please note that the last reference is a preprint hosted at [arXiv.org][13]. I
do not have plan to submit it to a peer-reviewed journal in the near future.
-##Frequently asked questions (FAQs)
+## Frequently asked questions (FAQs)
1. [What types of data does BWA work with?](#type)
2. [Why does a read appear multiple times in the output SAM?](#multihit)
@@ -73,7 +72,7 @@ do not have plan to submit it to a peer-reviewed journal in the near future.
6. [Does BWA work with ALT contigs in the GRCh38 release?](#altctg)
7. [Can I just run BWA-MEM against GRCh38+ALT without post-processing?](#postalt)
-####<a name="type"></a>1. What types of data does BWA work with?
+#### <a name="type"></a>1. What types of data does BWA work with?
BWA works with a variety types of DNA sequence data, though the optimal
algorithm and setting may vary. The following list gives the recommended
@@ -108,7 +107,7 @@ errors given longer query sequences as the chance of missing all seeds is small.
As is shown above, with non-default settings, BWA-MEM works with Oxford Nanopore
reads with a sequencing error rate over 20%.
-####<a name="multihit"></a>2. Why does a read appear multiple times in the output SAM?
+#### <a name="multihit"></a>2. Why does a read appear multiple times in the output SAM?
BWA-SW and BWA-MEM perform local alignments. If there is a translocation, a gene
fusion or a long deletion, a read bridging the break point may have two hits,
@@ -116,18 +115,18 @@ occupying two lines in the SAM output. With the default setting of BWA-MEM, one
and only one line is primary and is soft clipped; other lines are tagged with
0x800 SAM flag (supplementary alignment) and are hard clipped.
-####<a name="4gb"></a>3. Does BWA work on reference sequences longer than 4GB in total?
+#### <a name="4gb"></a>3. Does BWA work on reference sequences longer than 4GB in total?
Yes. Since 0.6.x, all BWA algorithms work with a genome with total length over
4GB. However, individual chromosome should not be longer than 2GB.
-####<a name="pe0"></a>4. Why can one read in a pair has high mapping quality but the other has zero?
+#### <a name="pe0"></a>4. Why can one read in a pair have a high mapping quality but the other has zero?
This is correct. Mapping quality is assigned for individual read, not for a read
pair. It is possible that one read can be mapped unambiguously, but its mate
falls in a tandem repeat and thus its accurate position cannot be determined.
-####<a name="endref"></a>5. How can a BWA-backtrack alignment stands out of the end of a chromosome?
+#### <a name="endref"></a>5. How can a BWA-backtrack alignment stand out of the end of a chromosome?
Internally BWA concatenates all reference sequences into one long sequence. A
read may be mapped to the junction of two adjacent reference sequences. In this
@@ -135,7 +134,7 @@ case, BWA-backtrack will flag the read as unmapped (0x4), but you will see
position, CIGAR and all the tags. A similar issue may occur to BWA-SW alignment
as well. BWA-MEM does not have this problem.
-####<a name="altctg"></a>6. Does BWA work with ALT contigs in the GRCh38 release?
+#### <a name="altctg"></a>6. Does BWA work with ALT contigs in the GRCh38 release?
Yes, since 0.7.11, BWA-MEM officially supports mapping to GRCh38+ALT.
BWA-backtrack and BWA-SW don't properly support ALT mapping as of now. Please
@@ -143,7 +142,7 @@ see [README-alt.md][18] for details. Briefly, it is recommended to use
[bwakit][17], the binary release of BWA, for generating the reference genome
and for mapping.
-####<a name="postalt"></a>7. Can I just run BWA-MEM against GRCh38+ALT without post-processing?
+#### <a name="postalt"></a>7. Can I just run BWA-MEM against GRCh38+ALT without post-processing?
If you are not interested in hits to ALT contigs, it is okay to run BWA-MEM
without post-processing. The alignments produced this way are very close to
diff --git a/bntseq.c b/bntseq.c
index 465e383..65f7e93 100644
--- a/bntseq.c
+++ b/bntseq.c
@@ -299,9 +299,9 @@ int64_t bns_fasta2bntseq(gzFile fp_fa, const char *prefix, int for_only)
// read sequences
while (kseq_read(seq) >= 0) pac = add1(seq, bns, pac, &m_pac, &m_seqs, &m_holes, &q);
if (!for_only) { // add the reverse complemented sequence
- m_pac = (bns->l_pac * 2 + 3) / 4 * 4;
- pac = realloc(pac, m_pac/4);
- memset(pac + (bns->l_pac+3)/4, 0, (m_pac - (bns->l_pac+3)/4*4) / 4);
+ int64_t ll_pac = (bns->l_pac * 2 + 3) / 4 * 4;
+ if (ll_pac > m_pac) pac = realloc(pac, ll_pac/4);
+ memset(pac + (bns->l_pac+3)/4, 0, (ll_pac - (bns->l_pac+3)/4*4) / 4);
for (l = bns->l_pac - 1; l >= 0; --l, ++bns->l_pac)
_set_pac(pac, bns->l_pac, 3-_get_pac(pac, l));
}
diff --git a/bwa.1 b/bwa.1
index b9a65f1..94c2c58 100644
--- a/bwa.1
+++ b/bwa.1
@@ -1,4 +1,4 @@
-.TH bwa 1 "31 May 2016" "bwa-0.7.15-r1140" "Bioinformatics tools"
+.TH bwa 1 "30 July 2017" "bwa-0.7.16-r1180" "Bioinformatics tools"
.SH NAME
.PP
bwa - Burrows-Wheeler Alignment Tool
@@ -107,6 +107,8 @@ appropriate algorithm will be chosen automatically.
.IR clipPen ]
.RB [ -U
.IR unpairPen ]
+.RB [ -x
+.IR readType ]
.RB [ -R
.IR RGline ]
.RB [ -H
@@ -256,7 +258,23 @@ Penalty for an unpaired read pair. BWA-MEM scores an unpaired read pair as
and scores a paired as scoreRead1+scoreRead2-insertPenalty. It compares these
two scores to determine whether we should force pairing. A larger value leads to
more aggressive read pair. [17]
-
+.TP
+.BI -x \ STR
+Read type. Changes multiple parameters unless overriden [null]
+.RS
+.TP 10
+.BR pacbio :
+.B -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0
+(PacBio reads to ref)
+.TP
+.BR ont2d :
+.B -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0
+(Oxford Nanopore 2D-reads to ref)
+.TP
+.BR intractg :
+.B -B9 -O16 -L5
+(intra-species contigs to ref)
+.RE
.TP
.B INPUT/OUTPUT OPTIONS:
.TP
@@ -277,6 +295,29 @@ If ARG starts with @, it is interpreted as a string and gets inserted into the
output SAM header; otherwise, ARG is interpreted as a file with all lines
starting with @ in the file inserted into the SAM header. [null]
.TP
+.BI -o \ FILE
+Write the output SAM file to
+.IR FILE .
+For compatibility with other BWA commands, this option may also be given as
+.B -f
+.IR FILE .
+[standard ouptut]
+.TP
+.B -5
+For split alignment, mark the segment with the smallest coordinate as the
+primary. This option may help some Hi-C pipelines. By default, BWA-MEM marks
+highest scoring segment as primary.
+.TP
+.B -K \ INT
+Process
+.I INT
+input bases in each batch regardless of the number of threads in use
+.RI [10000000* nThreads ].
+By default, the batch size is proportional to the number of threads in use.
+Because the inferred insert size distribution slightly depends on the batch
+size, using different number of threads may produce different output.
+Specifying this option helps reproducibility.
+.TP
.BI -T \ INT
Don't output alignment with score lower than
.IR INT .
@@ -302,7 +343,7 @@ Output all found alignments for single-end or unpaired paired-end reads. These
alignments will be flagged as secondary alignments.
.TP
.B -C
-Append append FASTA/Q comment to SAM output. This option can be used to
+Append FASTA/Q comment to SAM output. This option can be used to
transfer read meta information (e.g. barcode) to the SAM output. Note that the
FASTA/Q comment (the string after a space in the header line) must conform the SAM
spec (e.g. BC:Z:CGTAC). Malformated comments lead to incorrect SAM output.
@@ -316,7 +357,7 @@ supplementary alignments.
Mark shorter split hits as secondary (for Picard compatibility).
.TP
.BI -v \ INT
-Control the verbose level of the output. This option has not been fully
+Control the verbosity level of the output. This option has not been fully
supported throughout BWA. Ideally, a value 0 for disabling all the output to
stderr; 1 for outputting errors only; 2 for warnings and errors; 3 for
all normal messages; 4 or higher for debugging. When this option takes value
diff --git a/bwa.c b/bwa.c
index c78ebb6..75ccdf4 100644
--- a/bwa.c
+++ b/bwa.c
@@ -30,13 +30,23 @@ static inline void trim_readno(kstring_t *s)
s->l -= 2, s->s[s->l] = 0;
}
+static inline char *dupkstring(const kstring_t *str, int dupempty)
+{
+ char *s = (str->l > 0 || dupempty)? malloc(str->l + 1) : NULL;
+ if (!s) return NULL;
+
+ memcpy(s, str->s, str->l);
+ s[str->l] = '\0';
+ return s;
+}
+
static inline void kseq2bseq1(const kseq_t *ks, bseq1_t *s)
{ // TODO: it would be better to allocate one chunk of memory, but probably it does not matter in practice
- s->name = strdup(ks->name.s);
- s->comment = ks->comment.l? strdup(ks->comment.s) : 0;
- s->seq = strdup(ks->seq.s);
- s->qual = ks->qual.l? strdup(ks->qual.s) : 0;
- s->l_seq = strlen(s->seq);
+ s->name = dupkstring(&ks->name, 1);
+ s->comment = dupkstring(&ks->comment, 0);
+ s->seq = dupkstring(&ks->seq, 1);
+ s->qual = dupkstring(&ks->qual, 0);
+ s->l_seq = ks->seq.l;
}
bseq1_t *bseq_read(int chunk_size, int *n_, void *ks1_, void *ks2_)
@@ -144,9 +154,9 @@ uint32_t *bwa_gen_cigar2(const int8_t mat[25], int o_del, int e_del, int o_ins,
max_del = (int)((double)(((l_query+1)>>1) * mat[0] - o_del) / e_del + 1.);
max_gap = max_ins > max_del? max_ins : max_del;
max_gap = max_gap > 1? max_gap : 1;
- w = (max_gap + abs(rlen - l_query) + 1) >> 1;
+ w = (max_gap + abs((int)rlen - l_query) + 1) >> 1;
w = w < w_? w : w_;
- min_w = abs(rlen - l_query) + 3;
+ min_w = abs((int)rlen - l_query) + 3;
w = w > min_w? w : min_w;
// NW alignment
if (bwa_verbose >= 4) {
@@ -414,10 +424,14 @@ char *bwa_set_rg(const char *s)
if (bwa_verbose >= 1) fprintf(stderr, "[E::%s] the read group line is not started with @RG\n", __func__);
goto err_set_rg;
}
+ if (strstr(s, "\t") != NULL) {
+ if (bwa_verbose >= 1) fprintf(stderr, "[E::%s] the read group line contained literal <tab> characters -- replace with escaped tabs: \\t\n", __func__);
+ goto err_set_rg;
+ }
rg_line = strdup(s);
bwa_escape(rg_line);
if ((p = strstr(rg_line, "\tID:")) == 0) {
- if (bwa_verbose >= 1) fprintf(stderr, "[E::%s] no ID at the read group line\n", __func__);
+ if (bwa_verbose >= 1) fprintf(stderr, "[E::%s] no ID within the read group line\n", __func__);
goto err_set_rg;
}
p += 4;
diff --git a/bwakit/run-bwamem b/bwakit/run-bwamem
index 462bafe..4c0e8d0 100755
--- a/bwakit/run-bwamem
+++ b/bwakit/run-bwamem
@@ -5,7 +5,7 @@ use warnings;
use Getopt::Std;
my %opts = (t=>1);
-getopts("PSadskHo:R:x:t:", \%opts);
+getopts("MPSadskHo:R:x:t:", \%opts);
die('
Usage: run-bwamem [options] <idxbase> <file1> [file2]
@@ -24,6 +24,7 @@ Options: -o STR prefix for output files [inferred from
-S for BAM input, don\'t shuffle
-s sort the output alignment (via samtools; requring more RAM)
-k keep temporary files generated by typeHLA
+ -M mark shorter split hits as secondary
Examples:
@@ -143,7 +144,7 @@ if ($is_bam) {
$cmd = "cat $ARGV[1] \\\n";
}
-my $bwa_opts = "-p " . ($opts{t} > 1? "-t$opts{t} " : "") . (defined($opts{x})? "-x $opts{x} " : "") . (defined($opts{R})? "-R'$opts{R}' " : "");
+my $bwa_opts = "-p " . ($opts{t} > 1? "-t$opts{t} " : "") . (defined($opts{x})? "-x $opts{x} " : "") . (defined($opts{R})? "-R'$opts{R}' " : "") . (defined($opts{M})? "-M " : "");
$bwa_opts .= join(" ", @RG_lines) . " -C " if @RG_lines > 0;
$cmd .= " | $root/trimadap 2> $prefix.log.trim \\\n" if defined($opts{a});
@@ -163,7 +164,7 @@ if (-f "$ARGV[0].alt" && !defined($opts{P})) {
}
my $t_sort = $opts{t} < 4? $opts{t} : 4;
-$cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - $prefix.aln;\n" : " | $root/samtools view -1 - > $prefix.aln.bam;\n";
+$cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - -o $prefix.aln.bam;\n" : " | $root/samtools view -1 - > $prefix.aln.bam;\n";
if ($has_hla && defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) {
$cmd .= "$root/run-HLA ". (defined($opts{x}) && $opts{x} eq 'intractg'? "-A " : "") . "$prefix.hla > $prefix.hla.top 2> $prefix.log.hla;\n";
diff --git a/bwakit/run-gen-ref b/bwakit/run-gen-ref
index 3ed63b2..58fab68 100755
--- a/bwakit/run-gen-ref
+++ b/bwakit/run-gen-ref
@@ -2,7 +2,7 @@
root=`dirname $0`
-url38="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz"
+url38="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz"
url37d5="ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz"
if [ $# -eq 0 ]; then
diff --git a/bwamem.c b/bwamem.c
index 99a6d88..d7c00fe 100644
--- a/bwamem.c
+++ b/bwamem.c
@@ -809,6 +809,19 @@ static inline int get_rlen(int n_cigar, const uint32_t *cigar)
return l;
}
+static inline void add_cigar(const mem_opt_t *opt, mem_aln_t *p, kstring_t *str, int which)
+{
+ int i;
+ if (p->n_cigar) { // aligned
+ for (i = 0; i < p->n_cigar; ++i) {
+ int c = p->cigar[i]&0xf;
+ if (!(opt->flag&MEM_F_SOFTCLIP) && !p->is_alt && (c == 3 || c == 4))
+ c = which? 4 : 3; // use hard clipping for supplementary alignments
+ kputw(p->cigar[i]>>4, str); kputc("MIDSH"[c], str);
+ }
+ } else kputc('*', str); // having a coordinate but unaligned (e.g. when copy_mate is true)
+}
+
void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str, bseq1_t *s, int n, const mem_aln_t *list, int which, const mem_aln_t *m_)
{
int i, l_name;
@@ -835,14 +848,7 @@ void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str, bseq
kputs(bns->anns[p->rid].name, str); kputc('\t', str); // RNAME
kputl(p->pos + 1, str); kputc('\t', str); // POS
kputw(p->mapq, str); kputc('\t', str); // MAPQ
- if (p->n_cigar) { // aligned
- for (i = 0; i < p->n_cigar; ++i) {
- int c = p->cigar[i]&0xf;
- if (!(opt->flag&MEM_F_SOFTCLIP) && !p->is_alt && (c == 3 || c == 4))
- c = which? 4 : 3; // use hard clipping for supplementary alignments
- kputw(p->cigar[i]>>4, str); kputc("MIDSH"[c], str);
- }
- } else kputc('*', str); // having a coordinate but unaligned (e.g. when copy_mate is true)
+ add_cigar(opt, p, str, which);
} else kputsn("*\t0\t0\t*", 7, str); // without coordinte
kputc('\t', str);
@@ -899,6 +905,7 @@ void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str, bseq
kputsn("\tNM:i:", 6, str); kputw(p->NM, str);
kputsn("\tMD:Z:", 6, str); kputs((char*)(p->cigar + p->n_cigar), str);
}
+ if (m && m->n_cigar) { kputsn("\tMC:Z:", 6, str); add_cigar(opt, m, str, which); }
if (p->score >= 0) { kputsn("\tAS:i:", 6, str); kputw(p->score, str); }
if (p->sub >= 0) { kputsn("\tXS:i:", 6, str); kputw(p->sub, str); }
if (bwa_rg_id[0]) { kputsn("\tRG:Z:", 6, str); kputs(bwa_rg_id, str); }
@@ -968,6 +975,30 @@ int mem_approx_mapq_se(const mem_opt_t *opt, const mem_alnreg_t *a)
return mapq;
}
+void mem_reorder_primary5(int T, mem_alnreg_v *a)
+{
+ int k, n_pri = 0, left_st = INT_MAX, left_k = -1;
+ mem_alnreg_t t;
+ for (k = 0; k < a->n; ++k)
+ if (a->a[k].secondary < 0 && !a->a[k].is_alt && a->a[k].score >= T) ++n_pri;
+ if (n_pri <= 1) return; // only one alignment
+ for (k = 0; k < a->n; ++k) {
+ mem_alnreg_t *p = &a->a[k];
+ if (p->secondary >= 0 || p->is_alt || p->score < T) continue;
+ if (p->qb < left_st) left_st = p->qb, left_k = k;
+ }
+ assert(a->a[0].secondary < 0);
+ if (left_k == 0) return; // no need to reorder
+ t = a->a[0], a->a[0] = a->a[left_k], a->a[left_k] = t;
+ for (k = 1; k < a->n; ++k) { // update secondary and secondary_all
+ mem_alnreg_t *p = &a->a[k];
+ if (p->secondary == 0) p->secondary = left_k;
+ else if (p->secondary == left_k) p->secondary = 0;
+ if (p->secondary_all == 0) p->secondary_all = left_k;
+ else if (p->secondary_all == left_k) p->secondary_all = 0;
+ }
+}
+
// TODO (future plan): group hits into a uint64_t[] array. This will be cleaner and more flexible
void mem_reg2sam(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, bseq1_t *s, mem_alnreg_v *a, int extra_flag, const mem_aln_t *m)
{
@@ -1160,6 +1191,7 @@ static void worker2(void *data, int i, int tid)
if (!(w->opt->flag&MEM_F_PE)) {
if (bwa_verbose >= 4) printf("=====> Finalizing read '%s' <=====\n", w->seqs[i].name);
mem_mark_primary_se(w->opt, w->regs[i].n, w->regs[i].a, w->n_processed + i);
+ if (w->opt->flag & MEM_F_PRIMARY5) mem_reorder_primary5(w->opt->T, &w->regs[i]);
mem_reg2sam(w->opt, w->bns, w->pac, &w->seqs[i], &w->regs[i], 0, 0);
free(w->regs[i].a);
} else {
diff --git a/bwamem.h b/bwamem.h
index 730b603..4e2b1d8 100644
--- a/bwamem.h
+++ b/bwamem.h
@@ -19,6 +19,7 @@ typedef struct __smem_i smem_i;
#define MEM_F_REF_HDR 0x100
#define MEM_F_SOFTCLIP 0x200
#define MEM_F_SMARTPE 0x400
+#define MEM_F_PRIMARY5 0x800
typedef struct {
int a, b; // match score and mismatch penalty
diff --git a/bwamem_pair.c b/bwamem_pair.c
index b812cc0..9beb84b 100644
--- a/bwamem_pair.c
+++ b/bwamem_pair.c
@@ -243,6 +243,7 @@ int mem_pair(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, cons
}
void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str, bseq1_t *s, int n, const mem_aln_t *list, int which, const mem_aln_t *m);
+void mem_reorder_primary5(int T, mem_alnreg_v *a);
#define raw_mapq(diff, a) ((int)(6.02 * (diff) / (a) + .499))
@@ -275,6 +276,10 @@ int mem_sam_pe(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, co
}
n_pri[0] = mem_mark_primary_se(opt, a[0].n, a[0].a, id<<1|0);
n_pri[1] = mem_mark_primary_se(opt, a[1].n, a[1].a, id<<1|1);
+ if (opt->flag & MEM_F_PRIMARY5) {
+ mem_reorder_primary5(opt->T, &a[0]);
+ mem_reorder_primary5(opt->T, &a[1]);
+ }
if (opt->flag&MEM_F_NOPAIRING) goto no_pairing;
// pairing single-end hits
if (n_pri[0] && n_pri[1] && (o = mem_pair(opt, bns, pac, pes, s, a, id, &subo, &n_sub, z, n_pri)) > 0) {
diff --git a/bwape.c b/bwape.c
index f2d3d8e..fa4f7f7 100644
--- a/bwape.c
+++ b/bwape.c
@@ -624,7 +624,8 @@ ubyte_t *bwa_paired_sw(const bntseq_t *bns, const ubyte_t *_pacseq, int n_seqs,
void bwa_sai2sam_pe_core(const char *prefix, char *const fn_sa[2], char *const fn_fa[2], pe_opt_t *popt, const char *rg_line)
{
extern bwa_seqio_t *bwa_open_reads(int mode, const char *fn_fa);
- int i, j, n_seqs, tot_seqs = 0;
+ int i, j, n_seqs;
+ long long tot_seqs = 0;
bwa_seq_t *seqs[2];
bwa_seqio_t *ks[2];
clock_t t;
@@ -711,7 +712,7 @@ void bwa_sai2sam_pe_core(const char *prefix, char *const fn_sa[2], char *const f
for (j = 0; j < 2; ++j)
bwa_free_read_seq(n_seqs, seqs[j]);
- fprintf(stderr, "[bwa_sai2sam_pe_core] %d sequences have been processed.\n", tot_seqs);
+ fprintf(stderr, "[bwa_sai2sam_pe_core] %lld sequences have been processed.\n", tot_seqs);
last_ii = ii;
}
diff --git a/bwase.c b/bwase.c
index 77c50db..18e8671 100644
--- a/bwase.c
+++ b/bwase.c
@@ -173,13 +173,15 @@ bwa_cigar_t *bwa_refine_gapped_core(bwtint_t l_pac, const ubyte_t *pacseq, int l
ubyte_t *rseq;
int64_t k, rb, re, rlen;
int8_t mat[25];
+ int w;
bwa_fill_scmat(1, 3, mat);
rb = *_rb; re = rb + len + ref_shift;
assert(re <= l_pac);
rseq = bns_get_seq(l_pac, pacseq, rb, re, &rlen);
assert(re - rb == rlen);
- ksw_global(len, seq, rlen, rseq, 5, mat, 5, 1, SW_BW > abs(rlen - len) * 1.5? SW_BW : abs(rlen - len) * 1.5, n_cigar, &cigar32);
+ w = abs((int)rlen - len) * 1.5;
+ ksw_global(len, seq, rlen, rseq, 5, mat, 5, 1, SW_BW > w? SW_BW : w, n_cigar, &cigar32);
assert(*n_cigar > 0);
if ((cigar32[*n_cigar - 1]&0xf) == 1) cigar32[*n_cigar - 1] = (cigar32[*n_cigar - 1]>>4<<4) | 3; // change endding ins to soft clipping
if ((cigar32[0]&0xf) == 1) cigar32[0] = (cigar32[0]>>4<<4) | 3; // change beginning ins to soft clipping
@@ -505,7 +507,8 @@ void bwase_initialize()
void bwa_sai2sam_se_core(const char *prefix, const char *fn_sa, const char *fn_fa, int n_occ, const char *rg_line)
{
extern bwa_seqio_t *bwa_open_reads(int mode, const char *fn_fa);
- int i, n_seqs, tot_seqs = 0, m_aln;
+ int i, n_seqs, m_aln;
+ long long tot_seqs = 0;
bwt_aln1_t *aln = 0;
bwa_seq_t *seqs;
bwa_seqio_t *ks;
@@ -563,7 +566,7 @@ void bwa_sai2sam_se_core(const char *prefix, const char *fn_sa, const char *fn_f
fprintf(stderr, "%.2f sec\n", (float)(clock() - t) / CLOCKS_PER_SEC);
bwa_free_read_seq(n_seqs, seqs);
- fprintf(stderr, "[bwa_aln_core] %d sequences have been processed.\n", tot_seqs);
+ fprintf(stderr, "[bwa_aln_core] %lld sequences have been processed.\n", tot_seqs);
}
// destroy
diff --git a/bwt_gen.c b/bwt_gen.c
index 76f28c9..3adcc22 100644
--- a/bwt_gen.c
+++ b/bwt_gen.c
@@ -338,6 +338,7 @@ BWT *BWTCreate(const bgint_t textLength, unsigned int *decodeTable)
bwt->decodeTable = (unsigned*)calloc(DNA_OCC_CNT_TABLE_SIZE_IN_WORD, sizeof(unsigned int));
GenerateDNAOccCountTable(bwt->decodeTable);
} else {
+ // FIXME Prevent BWTFree() from freeing decodeTable in this case
bwt->decodeTable = decodeTable;
}
@@ -1538,25 +1539,21 @@ BWTInc *BWTIncConstructFromPacked(const char *inputFileName, bgint_t initialMaxB
(long)bwtInc->numberOfIterationDone, (long)processedTextLength);
}
}
- return bwtInc;
-}
-void BWTFree(BWT *bwt)
-{
- if (bwt == 0) return;
- free(bwt->cumulativeFreq);
- free(bwt->bwtCode);
- free(bwt->occValue);
- free(bwt->occValueMajor);
- free(bwt->decodeTable);
- free(bwt);
+ fclose(packedFile);
+ return bwtInc;
}
void BWTIncFree(BWTInc *bwtInc)
{
if (bwtInc == 0) return;
+ free(bwtInc->bwt->cumulativeFreq);
+ free(bwtInc->bwt->occValueMajor);
+ free(bwtInc->bwt->decodeTable);
free(bwtInc->bwt);
free(bwtInc->workingMemory);
+ free(bwtInc->cumulativeCountInCurrentBuild);
+ free(bwtInc->packedShift);
free(bwtInc);
}
@@ -1602,7 +1599,7 @@ void bwt_bwtgen2(const char *fn_pac, const char *fn_bwt, int block_size)
{
BWTInc *bwtInc;
bwtInc = BWTIncConstructFromPacked(fn_pac, block_size, block_size);
- printf("[bwt_gen] Finished constructing BWT in %u iterations.\n", bwtInc->numberOfIterationDone);
+ fprintf(stderr, "[bwt_gen] Finished constructing BWT in %u iterations.\n", bwtInc->numberOfIterationDone);
BWTSaveBwtCodeAndOcc(bwtInc->bwt, fn_bwt, 0);
BWTIncFree(bwtInc);
}
diff --git a/bwt_lite.c b/bwt_lite.c
index f7946f5..eb16da6 100644
--- a/bwt_lite.c
+++ b/bwt_lite.c
@@ -48,7 +48,7 @@ bwtl_t *bwtl_seq2bwtl(int len, const uint8_t *seq)
}
{ // generate cnt_table
for (i = 0; i != 256; ++i) {
- u_int32_t j, x = 0;
+ uint32_t j, x = 0;
for (j = 0; j != 4; ++j)
x |= (((i&3) == j) + ((i>>2&3) == j) + ((i>>4&3) == j) + (i>>6 == j)) << (j<<3);
b->cnt_table[i] = x;
diff --git a/bwtaln.c b/bwtaln.c
index 20b01cd..a348fdf 100644
--- a/bwtaln.c
+++ b/bwtaln.c
@@ -158,7 +158,8 @@ bwa_seqio_t *bwa_open_reads(int mode, const char *fn_fa)
void bwa_aln_core(const char *prefix, const char *fn_fa, const gap_opt_t *opt)
{
- int i, n_seqs, tot_seqs = 0;
+ int i, n_seqs;
+ long long tot_seqs = 0;
bwa_seq_t *seqs;
bwa_seqio_t *ks;
clock_t t;
@@ -218,7 +219,7 @@ void bwa_aln_core(const char *prefix, const char *fn_fa, const gap_opt_t *opt)
fprintf(stderr, "%.2f sec\n", (float)(clock() - t) / CLOCKS_PER_SEC);
bwa_free_read_seq(n_seqs, seqs);
- fprintf(stderr, "[bwa_aln_core] %d sequences have been processed.\n", tot_seqs);
+ fprintf(stderr, "[bwa_aln_core] %lld sequences have been processed.\n", tot_seqs);
}
// destroy
diff --git a/bwtgap.c b/bwtgap.c
index 08bc1f4..7dde443 100644
--- a/bwtgap.c
+++ b/bwtgap.c
@@ -58,7 +58,7 @@ static inline void gap_push(gap_stack_t *stack, int i, bwtint_t k, bwtint_t l, i
q->stack = (gap_entry_t*)realloc(q->stack, sizeof(gap_entry_t) * q->m_entries);
}
p = q->stack + q->n_entries;
- p->info = (u_int32_t)score<<21 | i; p->k = k; p->l = l;
+ p->info = (uint32_t)score<<21 | i; p->k = k; p->l = l;
p->n_mm = n_mm; p->n_gapo = n_gapo; p->n_gape = n_gape;
p->n_ins = n_ins; p->n_del = n_del;
p->state = state;
diff --git a/bwtgap.h b/bwtgap.h
index 7dd6165..9085b62 100644
--- a/bwtgap.h
+++ b/bwtgap.h
@@ -5,9 +5,9 @@
#include "bwtaln.h"
typedef struct { // recursion stack
- u_int32_t info; // score<<21 | i
- u_int32_t n_mm:8, n_gapo:8, n_gape:8, state:2, n_seed_mm:6;
- u_int32_t n_ins:16, n_del:16;
+ uint32_t info; // score<<21 | i
+ uint32_t n_mm:8, n_gapo:8, n_gape:8, state:2, n_seed_mm:6;
+ uint32_t n_ins:16, n_del:16;
int last_diff_pos;
bwtint_t k, l; // (k,l) is the SA region of [i,n-1]
} gap_entry_t;
diff --git a/bwtindex.c b/bwtindex.c
index fb32231..e1322f8 100644
--- a/bwtindex.c
+++ b/bwtindex.c
@@ -119,7 +119,7 @@ bwt_t *bwt_pac2bwt(const char *fn_pac, int use_is)
}
rope_destroy(r);
}
- bwt->bwt = (u_int32_t*)calloc(bwt->bwt_size, 4);
+ bwt->bwt = (uint32_t*)calloc(bwt->bwt_size, 4);
for (i = 0; i < bwt->seq_len; ++i)
bwt->bwt[i>>4] |= buf[i] << ((15 - (i&15)) << 1);
free(buf);
@@ -175,7 +175,7 @@ void bwt_bwtupdate_core(bwt_t *bwt)
int bwa_bwtupdate(int argc, char *argv[]) // the "bwtupdate" command
{
bwt_t *bwt;
- if (argc < 2) {
+ if (argc != 2) {
fprintf(stderr, "Usage: bwa bwtupdate <the.bwt>\n");
return 1;
}
diff --git a/bwtsw2_pair.c b/bwtsw2_pair.c
index b945128..f50b25b 100644
--- a/bwtsw2_pair.c
+++ b/bwtsw2_pair.c
@@ -70,6 +70,12 @@ bsw2pestat_t bsw2_stat(int n, bwtsw2_t **buf, kstring_t *msg, int max_ins)
for (i = x = 0, r.avg = 0; i < k; ++i)
if (isize[i] >= r.low && isize[i] <= r.high)
r.avg += isize[i], ++x;
+ if (x == 0) {
+ ksprintf(msg, "[%s] fail to infer the insert size distribution: no pairs within boundaries.\n", __func__);
+ free(isize);
+ r.failed = 1;
+ return r;
+ }
r.avg /= x;
for (i = 0, r.std = 0; i < k; ++i)
if (isize[i] >= r.low && isize[i] <= r.high)
@@ -236,7 +242,7 @@ void bsw2_pair(const bsw2opt_t *opt, int64_t l_pac, const uint8_t *pac, int n, b
double diff;
G[0] = hits[i]->hits[0].G + a[1].G;
G[1] = hits[i+1]->hits[0].G + a[0].G;
- diff = fabs(G[0] - G[1]) / (opt->a + opt->b) / ((hits[i]->hits[0].len + a[1].len + hits[i+1]->hits[0].len + a[0].len) / 2.);
+ diff = fabs((double)(G[0] - G[1])) / (opt->a + opt->b) / ((hits[i]->hits[0].len + a[1].len + hits[i+1]->hits[0].len + a[0].len) / 2.);
if (diff > 0.05) a[G[0] > G[1]? 0 : 1].G = 0;
}
if (a[0].G == 0 || a[1].G == 0) { // one proper pair only
diff --git a/fastmap.c b/fastmap.c
index 1e49ccb..d89c495 100644
--- a/fastmap.c
+++ b/fastmap.c
@@ -130,7 +130,7 @@ int main_mem(int argc, char *argv[])
aux.opt = opt = mem_opt_init();
memset(&opt0, 0, sizeof(mem_opt_t));
- while ((c = getopt(argc, argv, "1paMCSPVYjk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:W:x:G:h:y:K:X:H:")) >= 0) {
+ while ((c = getopt(argc, argv, "51paMCSPVYjk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:o:f:W:x:G:h:y:K:X:H:")) >= 0) {
if (c == 'k') opt->min_seed_len = atoi(optarg), opt0.min_seed_len = 1;
else if (c == '1') no_mt_io = 1;
else if (c == 'x') mode = optarg;
@@ -147,6 +147,7 @@ int main_mem(int argc, char *argv[])
else if (c == 'S') opt->flag |= MEM_F_NO_RESCUE;
else if (c == 'Y') opt->flag |= MEM_F_SOFTCLIP;
else if (c == 'V') opt->flag |= MEM_F_REF_HDR;
+ else if (c == '5') opt->flag |= MEM_F_PRIMARY5;
else if (c == 'c') opt->max_occ = atoi(optarg), opt0.max_occ = 1;
else if (c == 'd') opt->zdrop = atoi(optarg), opt0.zdrop = 1;
else if (c == 'v') bwa_verbose = atoi(optarg);
@@ -157,6 +158,7 @@ int main_mem(int argc, char *argv[])
else if (c == 's') opt->split_width = atoi(optarg), opt0.split_width = 1;
else if (c == 'G') opt->max_chain_gap = atoi(optarg), opt0.max_chain_gap = 1;
else if (c == 'N') opt->max_chain_extend = atoi(optarg), opt0.max_chain_extend = 1;
+ else if (c == 'o' || c == 'f') xreopen(optarg, "wb", stdout);
else if (c == 'W') opt->min_chain_weight = atoi(optarg), opt0.min_chain_weight = 1;
else if (c == 'y') opt->max_mem_intv = atol(optarg), opt0.max_mem_intv = 1;
else if (c == 'C') aux.copy_comment = 1;
@@ -256,7 +258,7 @@ int main_mem(int argc, char *argv[])
fprintf(stderr, " -E INT[,INT] gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [%d,%d]\n", opt->e_del, opt->e_ins);
fprintf(stderr, " -L INT[,INT] penalty for 5'- and 3'-end clipping [%d,%d]\n", opt->pen_clip5, opt->pen_clip3);
fprintf(stderr, " -U INT penalty for an unpaired read pair [%d]\n\n", opt->pen_unpaired);
- fprintf(stderr, " -x STR read type. Setting -x changes multiple parameters unless overriden [null]\n");
+ fprintf(stderr, " -x STR read type. Setting -x changes multiple parameters unless overridden [null]\n");
fprintf(stderr, " pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref)\n");
fprintf(stderr, " ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref)\n");
fprintf(stderr, " intractg: -B9 -O16 -L5 (intra-species contigs to ref)\n");
@@ -264,9 +266,12 @@ int main_mem(int argc, char *argv[])
fprintf(stderr, " -p smart pairing (ignoring in2.fq)\n");
fprintf(stderr, " -R STR read group header line such as '@RG\\tID:foo\\tSM:bar' [null]\n");
fprintf(stderr, " -H STR/FILE insert STR to header if it starts with @; or insert lines in FILE [null]\n");
+ fprintf(stderr, " -o FILE sam file to output results to [stdout]\n");
fprintf(stderr, " -j treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)\n");
+ fprintf(stderr, " -5 for split alignment, take the alignment with the smallest coordiate as primary\n");
+ fprintf(stderr, " -K INT process INT input bases in each batch regardless of nThreads (for reproducibility) []\n");
fprintf(stderr, "\n");
- fprintf(stderr, " -v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [%d]\n", bwa_verbose);
+ fprintf(stderr, " -v INT verbosity level: 1=error, 2=warning, 3=message, 4+=debugging [%d]\n", bwa_verbose);
fprintf(stderr, " -T INT minimum score to output [%d]\n", opt->T);
fprintf(stderr, " -h INT[,INT] if there are <INT hits with score >80%% of the max score, output all in XA [%d,%d]\n", opt->max_XA_hits, opt->max_XA_hits_alt);
fprintf(stderr, " -a output all alignments for SE or unpaired PE\n");
diff --git a/kthread.c b/kthread.c
index 1b13494..780de19 100644
--- a/kthread.c
+++ b/kthread.c
@@ -1,4 +1,5 @@
#include <pthread.h>
+#include <stdint.h>
#include <stdlib.h>
#include <limits.h>
diff --git a/main.c b/main.c
index f936a81..e2b7cf1 100644
--- a/main.c
+++ b/main.c
@@ -4,7 +4,7 @@
#include "utils.h"
#ifndef PACKAGE_VERSION
-#define PACKAGE_VERSION "0.7.15-r1140"
+#define PACKAGE_VERSION "0.7.16a-r1181"
#endif
int bwa_fa2pac(int argc, char *argv[]);
diff --git a/malloc_wrap.c b/malloc_wrap.c
index 100b8cb..6165e04 100644
--- a/malloc_wrap.c
+++ b/malloc_wrap.c
@@ -13,7 +13,7 @@ void *wrap_calloc(size_t nmemb, size_t size,
void *p = calloc(nmemb, size);
if (NULL == p) {
fprintf(stderr,
- "[%s] Failed to allocate %zd bytes at %s line %u: %s\n",
+ "[%s] Failed to allocate %zu bytes at %s line %u: %s\n",
func, nmemb * size, file, line, strerror(errno));
exit(EXIT_FAILURE);
}
@@ -25,7 +25,7 @@ void *wrap_malloc(size_t size,
void *p = malloc(size);
if (NULL == p) {
fprintf(stderr,
- "[%s] Failed to allocate %zd bytes at %s line %u: %s\n",
+ "[%s] Failed to allocate %zu bytes at %s line %u: %s\n",
func, size, file, line, strerror(errno));
exit(EXIT_FAILURE);
}
@@ -37,7 +37,7 @@ void *wrap_realloc(void *ptr, size_t size,
void *p = realloc(ptr, size);
if (NULL == p) {
fprintf(stderr,
- "[%s] Failed to allocate %zd bytes at %s line %u: %s\n",
+ "[%s] Failed to allocate %zu bytes at %s line %u: %s\n",
func, size, file, line, strerror(errno));
exit(EXIT_FAILURE);
}
@@ -49,7 +49,7 @@ char *wrap_strdup(const char *s,
char *p = strdup(s);
if (NULL == p) {
fprintf(stderr,
- "[%s] Failed to allocate %zd bytes at %s line %u: %s\n",
+ "[%s] Failed to allocate %zu bytes at %s line %u: %s\n",
func, strlen(s), file, line, strerror(errno));
exit(EXIT_FAILURE);
}
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/bwa.git
More information about the debian-med-commit
mailing list