[med-svn] [Git][med-team/bowtie][upstream] New upstream version 1.2.3+dfsg
Alexandre Mestiashvili
gitlab at salsa.debian.org
Thu Jul 11 15:11:28 BST 2019
Alexandre Mestiashvili pushed to branch upstream at Debian Med / bowtie
Commits:
8fdd41ca by Alexandre Mestiashvili at 2019-07-10T15:14:53Z
New upstream version 1.2.3+dfsg
- - - - -
25 changed files:
- MANUAL
- MANUAL.markdown
- Makefile
- NEWS
- VERSION
- blockwise_sa.h
- bowtie
- bowtie-inspect
- doc/manual.html
- doc/website/latest-version.txt
- doc/website/manual.ssi
- doc/website/recent_news.ssi
- doc/website/rhsidebar.ssi
- ebwt.cpp
- ebwt.h
- ebwt_build.cpp
- ebwt_search.cpp
- filebuf.h
- hit.h
- pat.cpp
- processor_support.h
- sam.cpp
- + scripts/bowtie-hbb.sh
- + scripts/run-hbb.sh
- shmem.h
Changes:
=====================================
MANUAL
=====================================
@@ -199,7 +199,7 @@ Note that [Maq] internally rounds base qualities to the nearest 10 and
rounds qualities greater than 30 to 30. To maintain compatibility,
Bowtie does the same. Rounding can be suppressed with the
`--nomaqround` option.
-
+
Bowtie is not fully sensitive in `-n` 2 and `-n` 3 modes by default.
In these modes Bowtie imposes a "backtracking limit" to limit effort
spent trying to find valid alignments for low-quality reads unlikely to
@@ -352,7 +352,7 @@ output no alignments.
Specifying `-m` 5 instructs bowtie to refrain from reporting any
alignments for reads having more than 5 reportable alignments. Since
the read has exactly 5 reportable alignments, the `-m` 5 limit allows
-`bowtie` to print them as usual.
+`bowtie` to print them as usual.
Example 9: `-a -m 3 --best --strata`
@@ -383,7 +383,7 @@ A valid paired-end alignment satisfies these criteria:
defined by the `-v`/`-n`/`-e`/`-l` options.
2. The relative orientation and position of the mates satisfy the
constraints defined by the `-I`/`-X`/`--fr`/`--rf`/`--ff`
- options.
+ options.
Policies governing which paired-end alignments are reported for a
given read are specified using the `-k`, `-a` and `-m` options as
@@ -705,6 +705,18 @@ The query input files (specified either as `<m1>` and `<m2>`, or as
or similar). All quality values are assumed to be 40 on the [Phred
quality] scale.
+ -F
+
+Reads are substrings (k-mers) extracted from a FASTA file `s`.
+Specifically, for every reference sequence in FASTA file `s`, Bowtie
+2 aligns the k-mers at offsets 1, 1+i, 1+2i, ... until reaching the
+end of the reference. Each k-mer is aligned as a separate read.
+Quality values are set to all Is (40 on Phred scale). Each k-mer
+(read) is given a name like `sequence`_`offset`, where `sequence`
+is the name of the FASTA sequence it was drawn from and `offset`
+is its 0-based offset of origin with respect to the sequence. Only
+single k-mers, i.e. unpaired reads, can be aligned in this way.
+
-r
The query input files (specified either as `<m1>` and `<m2>`, or as
@@ -786,7 +798,7 @@ versions prior to 1.3. Default: off.
--solexa1.3-quals
Same as `--phred64-quals`. This is usually the right option for use
-with (unconverted) reads emitted by GA Pipeline version 1.3 or later.
+with (unconverted) reads emitted by GA Pipeline version 1.3 or later.
Default: off.
--integer-quals
@@ -1025,7 +1037,7 @@ disallowed via the `-k` option) and they fall into more than one
alignment "stratum", report only those alignments that fall into the
best stratum. By default, Bowtie reports all reportable alignments
regardless of whether they fall into multiple strata. When
-`--strata` is specified, `--best` must also be specified.
+`--strata` is specified, `--best` must also be specified.
Output
@@ -1271,7 +1283,7 @@ Default `bowtie` output
character of the alignment occurs
5. Read sequence (reverse-complemented if orientation is `-`).
-
+
If the read was in colorspace, then the sequence shown in this
column is the sequence of *decoded nucleotides*, not the original
colors. See the [Colorspace alignment] section for details about
@@ -1280,7 +1292,7 @@ Default `bowtie` output
6. ASCII-encoded read qualities (reversed if orientation is `-`). The
encoded quality values are on the Phred scale and the encoding is
ASCII-offset by 33 (ASCII char `!`).
-
+
If the read was in colorspace, then the qualities shown in this
column are the *decoded qualities*, not the original qualities.
See the [Colorspace alignment] section for details about decoding.
@@ -1290,7 +1302,7 @@ Default `bowtie` output
this read, this column contains the value of the ceiling,
indicating that at least that many valid alignments were found in
addition to the one reported.
-
+
Otherwise, this column contains the number of other instances where
the same sequence aligned against the same reference characters as
were aligned against in the reported alignment. This is *not* the
@@ -1304,7 +1316,7 @@ Default `bowtie` output
mismatches in the alignment, this field is empty. A single
descriptor has the format offset:reference-base>read-base. The
offset is expressed as a 0-based offset from the high-quality (5')
- end of the read.
+ end of the read.
SAM `bowtie` output
-------------------
@@ -1390,7 +1402,7 @@ right, the fields are:
11. ASCII-encoded read qualities (reverse-complemented if the read
aligned to the reverse strand). The encoded quality values are on
the [Phred quality] scale and the encoding is ASCII-offset by 33
- (ASCII char `!`), similarly to a [FASTQ] file.
+ (ASCII char `!`), similarly to a [FASTQ] file.
12. Optional fields. Fields are tab-separated. For descriptions of
all possible optional fields, see the SAM format specification.
@@ -1441,10 +1453,10 @@ The `bowtie-build` indexer
`bowtie-build` outputs a set of 6 files with suffixes `.1.ebwt`,
`.2.ebwt`, `.3.ebwt`, `.4.ebwt`, `.rev.1.ebwt`, and `.rev.2.ebwt`. (If
the total length of all the input sequences is greater than about 4
-billion, then the index files will end in `ebwtl` instead of `ebwt`.)
+billion, then the index files will end in `ebwtl` instead of `ebwt`.)
These files together constitute the index: they are all that is needed
to align reads to that reference. The original sequence files are no
-longer used by Bowtie once the index is built.
+longer used by Bowtie once the index is built.
Use of Karkkainen's [blockwise algorithm] allows `bowtie-build` to
trade off between running time and memory usage. `bowtie-build` has
@@ -1580,7 +1592,7 @@ governs how many rows get marked: the indexer will mark every 2^`<int>`
rows. Marking more rows makes reference-position lookups faster, but
requires more memory to hold the annotations at runtime. The default
is 5 (every 32nd row is marked; for human genome, annotations occupy
-about 340 megabytes).
+about 340 megabytes).
-t/--ftabchars <int>
=====================================
MANUAL.markdown
=====================================
@@ -207,7 +207,7 @@ Note that [Maq] internally rounds base qualities to the nearest 10 and
rounds qualities greater than 30 to 30. To maintain compatibility,
Bowtie does the same. Rounding can be suppressed with the
[`--nomaqround`] option.
-
+
Bowtie is not fully sensitive in [`-n`] 2 and [`-n`] 3 modes by default.
In these modes Bowtie imposes a "backtracking limit" to limit effort
spent trying to find valid alignments for low-quality reads unlikely to
@@ -362,7 +362,7 @@ output no alignments.
Specifying [`-m`] 5 instructs bowtie to refrain from reporting any
alignments for reads having more than 5 reportable alignments. Since
the read has exactly 5 reportable alignments, the [`-m`] 5 limit allows
-`bowtie` to print them as usual.
+`bowtie` to print them as usual.
### Example 9: `-a -m 3 --best --strata`
@@ -393,7 +393,7 @@ A valid paired-end alignment satisfies these criteria:
defined by the [`-v`]/[`-n`]/[`-e`]/[`-l`] options.
2. The relative orientation and position of the mates satisfy the
constraints defined by the [`-I`]/[`-X`]/[`--fr`]/[`--rf`]/[`--ff`]
- options.
+ options.
Policies governing which paired-end alignments are reported for a
given read are specified using the [`-k`], [`-a`] and [`-m`] options as
@@ -767,6 +767,24 @@ The query input files (specified either as `<m1>` and `<m2>`, or as
or similar). All quality values are assumed to be 40 on the [Phred
quality] scale.
+</td></tr><tr><td id="bowtie-options-F">
+
+[`-F`]: #bowtie-options-F
+
+ -F
+
+</td><td>
+
+Reads are substrings (k-mers) extracted from a FASTA file `s`.
+Specifically, for every reference sequence in FASTA file `s`, Bowtie
+2 aligns the k-mers at offsets 1, 1+i, 1+2i, ... until reaching the
+end of the reference. Each k-mer is aligned as a separate read.
+Quality values are set to all Is (40 on Phred scale). Each k-mer
+(read) is given a name like `sequence`_`offset`, where `sequence`
+is the name of the FASTA sequence it was drawn from and `offset`
+is its 0-based offset of origin with respect to the sequence. Only
+single k-mers, i.e. unpaired reads, can be aligned in this way.
+
</td></tr><tr><td id="bowtie-options-r">
[`-r`]: #bowtie-options-r
@@ -938,7 +956,7 @@ versions prior to 1.3. Default: off.
</td><td>
Same as [`--phred64-quals`]. This is usually the right option for use
-with (unconverted) reads emitted by GA Pipeline version 1.3 or later.
+with (unconverted) reads emitted by GA Pipeline version 1.3 or later.
Default: off.
</td></tr><tr><td id="bowtie-options-integer-quals">
@@ -1333,7 +1351,7 @@ disallowed via the [`-k`] option) and they fall into more than one
alignment "stratum", report only those alignments that fall into the
best stratum. By default, Bowtie reports all reportable alignments
regardless of whether they fall into multiple strata. When
-[`--strata`] is specified, [`--best`] must also be specified.
+[`--strata`] is specified, [`--best`] must also be specified.
</td></tr>
</table>
@@ -1781,7 +1799,7 @@ Default `bowtie` output
character of the alignment occurs
5. Read sequence (reverse-complemented if orientation is `-`).
-
+
If the read was in colorspace, then the sequence shown in this
column is the sequence of *decoded nucleotides*, not the original
colors. See the [Colorspace alignment] section for details about
@@ -1790,7 +1808,7 @@ Default `bowtie` output
6. ASCII-encoded read qualities (reversed if orientation is `-`). The
encoded quality values are on the Phred scale and the encoding is
ASCII-offset by 33 (ASCII char `!`).
-
+
If the read was in colorspace, then the qualities shown in this
column are the *decoded qualities*, not the original qualities.
See the [Colorspace alignment] section for details about decoding.
@@ -1800,7 +1818,7 @@ Default `bowtie` output
this read, this column contains the value of the ceiling,
indicating that at least that many valid alignments were found in
addition to the one reported.
-
+
Otherwise, this column contains the number of other instances where
the same sequence aligned against the same reference characters as
were aligned against in the reported alignment. This is *not* the
@@ -1814,7 +1832,7 @@ Default `bowtie` output
mismatches in the alignment, this field is empty. A single
descriptor has the format offset:reference-base>read-base. The
offset is expressed as a 0-based offset from the high-quality (5')
- end of the read.
+ end of the read.
SAM `bowtie` output
-------------------
@@ -1934,7 +1952,7 @@ right, the fields are:
11. ASCII-encoded read qualities (reverse-complemented if the read
aligned to the reverse strand). The encoded quality values are on
the [Phred quality] scale and the encoding is ASCII-offset by 33
- (ASCII char `!`), similarly to a [FASTQ] file.
+ (ASCII char `!`), similarly to a [FASTQ] file.
12. Optional fields. Fields are tab-separated. For descriptions of
all possible optional fields, see the SAM format specification.
@@ -2011,10 +2029,10 @@ The `bowtie-build` indexer
`bowtie-build` outputs a set of 6 files with suffixes `.1.ebwt`,
`.2.ebwt`, `.3.ebwt`, `.4.ebwt`, `.rev.1.ebwt`, and `.rev.2.ebwt`. (If
the total length of all the input sequences is greater than about 4
-billion, then the index files will end in `ebwtl` instead of `ebwt`.)
+billion, then the index files will end in `ebwtl` instead of `ebwt`.)
These files together constitute the index: they are all that is needed
to align reads to that reference. The original sequence files are no
-longer used by Bowtie once the index is built.
+longer used by Bowtie once the index is built.
Use of Karkkainen's [blockwise algorithm] allows `bowtie-build` to
trade off between running time and memory usage. `bowtie-build` has
@@ -2221,7 +2239,7 @@ governs how many rows get marked: the indexer will mark every 2^`<int>`
rows. Marking more rows makes reference-position lookups faster, but
requires more memory to hold the annotations at runtime. The default
is 5 (every 32nd row is marked; for human genome, annotations occupy
-about 340 megabytes).
+about 340 megabytes).
</td></tr><tr><td>
@@ -2425,4 +2443,3 @@ Print version information and quit.
Print usage information and quit.
</td></tr></table>
-
=====================================
Makefile
=====================================
@@ -420,9 +420,10 @@ doc: doc/manual.html MANUAL
doc/manual.html: MANUAL.markdown
echo "<h1>Table of Contents</h1>" > .tmp.head
- pandoc -T "Bowtie Manual" -B .tmp.head \
+ pandoc -B .tmp.head \
--css style.css -o $@ \
--from markdown --to HTML \
+ --metadata title:"Bowtie Manual" \
--table-of-contents $^
MANUAL: MANUAL.markdown
=====================================
NEWS
=====================================
@@ -5,14 +5,15 @@ Bowtie NEWS
Bowtie is now available for download. 0.9.0 is the first version to
be released under the OSI Artistic License (see `LICENSE') and freely
-available to the public for download. The current version is 1.2.1.1.
+available to the public for download. The current version is 1.2.3.
Reporting Issues
================
Please report any issues using the Sourceforge bug tracker:
- https://sourceforge.net/tracker/?group_id=236897&atid=1101606
+ * https://github.com/BenLangmead/bowtie/issues
+ * https://sourceforge.net/tracker/?group_id=236897&atid=1101606
Announcements
=============
@@ -26,7 +27,20 @@ subscribe to our mailing list:
Version Release History
=======================
+Version 1.2.3 - Jul 5, 2019
+ * Added support for reading and inspecting Bowtie 2 indexes.
+ Bowtie 2 indexes can now be used with either Bowtie or Bowtie 2.
+ * Added support for building an index from a gzipped-compressed
+ FASTA.
+ * Fixed issue preventing bowtie from reporting repeated alignments
+ when -M is specified.
+ * Fixed issue with -F mode omitting final base of each read.
+ * Fixed clipping of first letter of first read in batches after first.
+ * Fixed an issue preventing bowtie wrapper script from finding indexes.
+
Version 1.2.2 - Dec 11, 2017
+Update (12/12/2017): We have had to re-release this version of bowtie to address an issue when compiling with pthreads (make NO_TBB=1).
+
* Fixed major issue causing corrupt SAM output when using many threads (-p/--threads) on certain systems
* Fixed major issue with incorrect alignment offsets being reported in --large-index mode
* Fixed major issue with reads files being skipped when multiple inputs were specified together with -p/--threads
=====================================
VERSION
=====================================
@@ -1 +1 @@
-1.2.2
+1.2.3
=====================================
blockwise_sa.h
=====================================
@@ -74,7 +74,11 @@ class BlockwiseSA {
_logger(__logger)
{ }
- virtual ~BlockwiseSA() { }
+ virtual ~BlockwiseSA()
+#if __cplusplus > 199711L
+ noexcept(false)
+#endif
+ { }
/**
* Get the next suffix; compute the next bucket if necessary.
@@ -200,7 +204,11 @@ class KarkkainenBlockwiseSA : public InorderBlockwiseSA<TStr> {
#endif
{ _randomSrc.init(__seed); reset(); }
- ~KarkkainenBlockwiseSA() {
+ ~KarkkainenBlockwiseSA()
+#if __cplusplus > 199711L
+ noexcept(false)
+#endif
+ {
if(_dc != NULL) delete _dc; _dc = NULL; // difference cover sample
if (_done != NULL) {
delete[] _done;
=====================================
bowtie
=====================================
@@ -69,6 +69,7 @@ def main():
bin_spec = os.path.join(ex_path, bin_s)
if args.verbose:
+ bowtie_args.append('--verbose')
logging.getLogger().setLevel(logging.INFO)
if args.large_index:
@@ -85,14 +86,16 @@ def main():
if tot_size > small_index_max_size:
bin_spec = os.path.join(ex_path, bin_l)
else:
- if os.path.exists(args.index + idx_ext_l):
+ if not os.path.exists(arg + idx_ext_s)\
+ and os.path.exists(args.index + idx_ext_l):
bin_spec = os.path.join(ex_path, bin_l)
bowtie_args.insert(0, args.index)
else:
for arg in bowtie_args:
if arg[0] == '-':
continue
- if os.path.exists(arg + idx_ext_l):
+ if not os.path.exists(arg + idx_ext_s)\
+ and os.path.exists(arg + idx_ext_l):
bin_spec = os.path.join(ex_path, bin_l)
if args.debug:
=====================================
bowtie-inspect
=====================================
@@ -34,8 +34,8 @@ def main():
inspect_bin_name = "bowtie-inspect"
inspect_bin_s = "bowtie-inspect-s"
inspect_bin_l = "bowtie-inspect-l"
- idx_ext_l = '.1.ebwtl';
- idx_ext_s = '.1.ebwt';
+ idx_ext_l = '.1.ebwtl'
+ idx_ext_s = '.1.ebwt'
curr_script = os.path.realpath(inspect.getsourcefile(main))
ex_path = os.path.dirname(curr_script)
inspect_bin_spec = os.path.join(ex_path,inspect_bin_s)
@@ -43,12 +43,13 @@ def main():
options,arguments = bld.build_args()
if '--verbose' in options:
+ arguments.append('--verbose')
logging.getLogger().setLevel(logging.INFO)
-
+
if '--debug' in options:
inspect_bin_spec += '-debug'
inspect_bin_l += '-debug'
-
+
if '--large-index' in options:
inspect_bin_spec = os.path.join(ex_path,inspect_bin_l)
elif len(arguments) >= 1:
@@ -58,18 +59,13 @@ def main():
if large_idx_exists and not small_idx_exists:
logging.info("No small index but a large one is present. Inspecting the large index.")
inspect_bin_spec = os.path.join(ex_path,inspect_bin_l)
-
+
arguments[0] = inspect_bin_name
arguments.insert(1, 'basic-0')
arguments.insert(1, '--wrapper')
logging.info('Command: %s %s' % (inspect_bin_spec,' '.join(arguments[1:])))
- os.execv(inspect_bin_spec, arguments)
-
-
-if __name__ == "__main__":
- main()
-
-
-
+ os.execv(inspect_bin_spec, arguments)
+if __name__ == "__main__":
+ main()
=====================================
doc/manual.html
=====================================
The diff for this file was not included because it is too large.
=====================================
doc/website/latest-version.txt
=====================================
@@ -1 +1 @@
-0.12.8
+i.2.3
=====================================
doc/website/manual.ssi
=====================================
The diff for this file was not included because it is too large.
=====================================
doc/website/recent_news.ssi
=====================================
@@ -1,13 +1,35 @@
+<h2>1.2.3 - 07/05/2019</h2>
+<ul>
+ <li>Added support for reading and inspecting Bowtie 2 indexes. Bowtie 2 indexes can now be used with either Bowtie or Bowtie 2.</li>
+ <li>Added support for building an index from a gzipped-compressed FASTA.</li>
+ <li>Fixed issue preventing bowtie from reporting repeated alignments when <tt><a href="manual.shtml#bowtie-options-M">-M</a></tt> is specified.</li>
+ <li>Fixed issue with <tt><a href="manual.shtml#bowtie-options-F">-F</a></tt> mode omitting final base of each read.</li>
+ <li>Fixed clipping of first letter of first read in batches after first.</li>
+ <li>Fixed an issue preventing bowtie wrapper script from finding indexes.</li>
+</ul>
+
+
+<h2>1000-Genomes major-allele SNP references -- 4/26/2019</h2>
+<ul>
+ <li>For each base where the typical reference has the non-majority allele (according to the <a href="http://www.internationalgenome.org">1000 Genomes Project</a>, we substituted in the majority allele instead
+ <li>Links for indexes added to sidebar, as are links for the edited FASTA files
+ <li>We made versions both for GRCh38 primary assembly and hg19 assembly
+ <li>See <a href="https://github.com/BenLangmead/bowtie-majref">how we created them</a>
+ <li>Only SNPs (single-base substitutions) are considered for now; indels are future work
+ <li>Because only SNPs are considered, coordinates (e.g. gene annotations) are the same as for typical GRCh38 and hg19 assemblies. Most downstream tools are unaffected as long as major-allele-edited FASTAs are used wherever genome sequences are required.
+</ul>
+
<h2>1.2.2 - 12/11/2017</h2>
+<p>Update (12/12/2017): We have had to re-release this version of bowtie to address an issue when compiling with pthreads (<tt>make NO_TBB=1</tt>).</p>
<ul>
- <li>Fixed major issue causing corrupt SAM output when using many threads (-p/--threads) on certain systems</li>
- <li>Fixed major issue with incorrect alignment offsets being reported in --large-index mode</li>
- <li>Fixed major issue with reads files being skipped when multiple inputs were specified together with -p/--threads</li>
+ <li>Fixed major issue causing corrupt SAM output when using many threads (<tt><a href="manual.shtml#bowtie-options-p">-p/--threads</a></tt>) on certain systems</li>
+ <li>Fixed major issue with incorrect alignment offsets being reported in <tt><a href="manual.shtml#bowtie-options-large-index">--large-index</a></tt> mode</li>
+ <li>Fixed major issue with reads files being skipped when multiple inputs were specified together with <tt><a href="manual.shtml#bowtie-options-p">-p/--threads</a></tt></li>
<li>The official LICENSE of Bowtie was changed to Artistic License 2.0. This fixes an issue with the previous LICENSE, which mistakenly combined elements of different open-source licenses.</li>
- <li>Fixed issue where bowtie would still run for a long time even when -u was set to a small number.</li>
+ <li>Fixed issue where bowtie would still run for a long time even when <tt><a href="manual.shtml#bowtie-options-u">-u</a></tt> was set to a small number.</li>
<li>Fixed spurious "Reads file contained a pattern with more than 1024 quality values" error for some colorspace inputs.</li>
- <li>Fixed issue with --strata sometimes failing to suppress alignments at lower strata.</li>
- <li>Fixed issue with ends of paired-end reads sometimes appearing in non-adjacent lines of the SAM output with -p/--threads >1</li>
+ <li>Fixed issue with <tt><a href="manual.shtml#bowtie-options-strata">--strata</a></tt> sometimes failing to suppress alignments at lower strata.</li>
+ <li>Fixed issue with ends of paired-end reads sometimes appearing in non-adjacent lines of the SAM output with <tt><a href="manual.shtml#bowtie-options-p">-p/--threads</a></tt> >1</li>
<li>Fixed issue whereby the read name of end #2 was not always truncated at the first whitespace character</li>
<li>Code simplifications</li>
</ul>
=====================================
doc/website/rhsidebar.ssi
=====================================
@@ -18,10 +18,10 @@
</tr>
<tr>
<td>
- <a href="https://sourceforge.net/projects/bowtie-bio/files/bowtie/1.2.2">Bowtie 1.2.2</a>
+ <a href="https://sourceforge.net/projects/bowtie-bio/files/bowtie/1.2.3">Bowtie 1.2.3</a>
</td>
<td align="right">
- 12/11/17
+ 07/05/19
</td>
</tr>
<tr>
@@ -105,37 +105,67 @@
<b>2.7 GB</b>
</td></tr>
+ <!-- GRCh38 major-allele -->
+ <tr><td>
+ <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/grch38_1kgmaj_bt.zip"><i>H. sapiens</i>, NCBI GRCh38 <br/> with 1KGenomes major SNPs</a>
+ </td><td align="right">
+ <b>2.7 GB</b>
+ </td></tr>
+ <tr><td colspan="2" style="font-size: x-small"> <a href="https://github.com/BenLangmead/bowtie-majref">How we built this</a>, <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/grch38_1kgmaj.fa.gz">FASTA</a>
+ </td></tr>
+
<!-- hg19 -->
<tr><td>
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19.ebwt.zip"><i>H. sapiens</i>, UCSC hg19</a>
</td><td align="right">
<b>2.7 GB</b>
</td></tr>
- <tr><td colspan=2><table width="100%"><tr>
+ <tr><td colspan=2><table width="100%">
+ <!--
+ <tr>
<td> or: <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19.ebwt.1.zip">part 1</a> - 1.7 GB,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19.ebwt.2.zip">part 2</a> - 1.0 GB</td>
- </tr><tr>
+ </tr>
+ -->
+ <tr>
<td> colorspace:
- <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19_c.ebwt.zip">full</a>, or
+ <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19_c.ebwt.zip">full</a>
+ <!-- , or
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19_c.ebwt.1.zip">part 1</a>,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19_c.ebwt.2.zip">part 2</a></td>
+ -->
</tr></table>
</td></tr>
+ <!-- hg19 major-allele -->
+ <tr><td>
+ <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19_1kgmaj_bt.zip"><i>H. sapiens</i>, UCSC hg19 <br/> with 1KGenomes major SNPs</a>
+ </td><td align="right">
+ <b>2.7 GB</b>
+ </td></tr>
+ <tr><td colspan="2" style="font-size: x-small"> <a href="https://github.com/BenLangmead/bowtie-majref">How we built this</a>, <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19_1kgmaj.fa.gz">FASTA</a>
+ </td></tr>
+
<!-- hg18 -->
<tr><td>
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg18.ebwt.zip"><i>H. sapiens</i>, UCSC hg18</a>
</td><td align="right">
<b>2.7 GB</b>
</td></tr>
- <tr><td colspan=2><table width="100%"><tr>
+ <tr><td colspan=2><table width="100%">
+ <!--
+ <tr>
<td> or: <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg18.ebwt.1.zip">part 1</a> - 1.7 GB,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg18.ebwt.2.zip">part 2</a> - 1.0 GB</td>
- </tr><tr>
+ </tr>
+ -->
+ <tr>
<td> colorspace:
- <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg18_c.ebwt.zip">full</a>, or
+ <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg18_c.ebwt.zip">full</a>
+ <!-- , or
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg18_c.ebwt.1.zip">part 1</a>,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg18_c.ebwt.2.zip">part 2</a></td>
+ -->
</tr></table>
</td></tr>
@@ -145,14 +175,20 @@
</td><td align="right">
<b>2.7 GB</b>
</td></tr>
- <tr><td colspan=2><table width="100%"><tr>
+ <tr><td colspan=2><table width="100%">
+ <!--
+ <tr>
<td> or: <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/h_sapiens_37_asm.ebwt.1.zip">part 1</a> - 1.7 GB,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/h_sapiens_37_asm.ebwt.2.zip">part 2</a> - 1.0 GB</td>
- </tr><tr>
+ </tr>
+ -->
+ <tr>
<td> colorspace:
- <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/h_sapiens_37_asm_c.ebwt.zip">full</a>, or
+ <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/h_sapiens_37_asm_c.ebwt.zip">full</a>
+ <!-- , or
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/h_sapiens_37_asm_c.ebwt.1.zip">part 1</a>,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/h_sapiens_37_asm_c.ebwt.2.zip">part 2</a></td>
+ -->
</tr></table>
</td></tr>
@@ -162,14 +198,20 @@
</td><td align="right">
<b>2.7 GB</b>
</td></tr>
- <tr><td colspan=2><table width="100%"><tr>
+ <tr><td colspan=2><table width="100%">
+ <!--
+ <tr>
<td> or: <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/h_sapiens_asm.ebwt.1.zip">part 1</a> - 1.7 GB,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/h_sapiens_asm.ebwt.2.zip">part 2</a> - 1.0 GB</td>
- </tr><tr>
+ </tr>
+ -->
+ <tr>
<td> colorspace:
- <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/h_sapiens_asm_c.ebwt.zip">full</a>, or
+ <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/h_sapiens_asm_c.ebwt.zip">full</a>
+ <!-- , or
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/h_sapiens_asm_c.ebwt.1.zip">part 1</a>,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/h_sapiens_asm_c.ebwt.2.zip">part 2</a></td>
+ -->
</tr></table>
</td></tr>
@@ -179,14 +221,20 @@
</td><td align="right">
<b>2.4 GB</b>
</td></tr>
- <tr><td colspan=2><table width="100%"><tr>
+ <tr><td colspan=2><table width="100%">
+ <!--
+ <tr>
<td> or: <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm8.ebwt.1.zip">part 1</a> - 1.5 GB,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm8.ebwt.2.zip">part 2</a> - 900 MB</td>
- </tr><tr>
+ </tr>
+ -->
+ <tr>
<td> colorspace:
- <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm8_c.ebwt.zip">full</a>, or
+ <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm8_c.ebwt.zip">full</a>
+ <!-- , or
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm8_c.ebwt.1.zip">part 1</a>,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm8_c.ebwt.2.zip">part 2</a></td>
+ -->
</tr></table>
</td></tr>
@@ -196,14 +244,20 @@
</td><td align="right">
<b>2.4 GB</b>
</td></tr>
- <tr><td colspan=2><table width="100%"><tr>
+ <tr><td colspan=2><table width="100%">
+ <!--
+ <tr>
<td> or: <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm9.ebwt.1.zip">part 1</a> - 1.5 GB,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm9.ebwt.2.zip">part 2</a> - 900 MB</td>
- </tr><tr>
+ </tr>
+ -->
+ <tr>
<td> colorspace:
- <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm9_c.ebwt.zip">full</a>, or
+ <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm9_c.ebwt.zip">full</a>
+ <!-- , or
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm9_c.ebwt.1.zip">part 1</a>,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm9_c.ebwt.2.zip">part 2</a></td>
+ -->
</tr></table>
</td></tr>
@@ -213,14 +267,20 @@
</td><td align="right">
<b>2.4 GB</b>
</td></tr>
- <tr><td colspan=2><table width="100%"><tr>
+ <tr><td colspan=2><table width="100%">
+ <!--
+ <tr>
<td> or: <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/m_musculus_ncbi37.ebwt.1.zip">part 1</a> - 1.5 GB,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/m_musculus_ncbi37.ebwt.2.zip">part 2</a> - 900 MB</td>
- </tr><tr>
+ </tr>
+ -->
+ <tr>
<td> colorspace:
- <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/m_musculus_ncbi37_c.ebwt.zip">full</a>, or
+ <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/m_musculus_ncbi37_c.ebwt.zip">full</a>
+ <!-- , or
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/m_musculus_ncbi37_c.ebwt.1.zip">part 1</a>,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/m_musculus_ncbi37_c.ebwt.2.zip">part 2</a></td>
+ -->
</tr></table>
</td></tr>
@@ -230,14 +290,20 @@
</td><td align="right">
<b>2.4 GB</b>
</td></tr>
- <tr><td colspan=2><table width="100%"><tr>
+ <tr><td colspan=2><table width="100%">
+ <!--
+ <tr>
<td> or: <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/rn4.ebwt.1.zip">part 1</a> - 1.5 GB,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/rn4.ebwt.2.zip">part 2</a> - 900 MB</td>
- </tr><tr>
+ </tr>
+ -->
+ <tr>
<td> colorspace:
- <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/rn4_c.ebwt.zip">full</a>, or
+ <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/rn4_c.ebwt.zip">full</a>
+ <!-- , or
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/rn4_c.ebwt.1.zip">part 1</a>,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/rn4_c.ebwt.2.zip">part 2</a></td>
+ -->
</tr></table>
</td></tr>
@@ -247,14 +313,20 @@
</td><td align="right">
<b>2.4 GB</b>
</td></tr>
- <tr><td colspan=2><table width="100%"><tr>
+ <tr><td colspan=2><table width="100%">
+ <!--
+ <tr>
<td> or: <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/b_taurus.ebwt.1.zip">part 1</a> - 1.5 GB,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/b_taurus.ebwt.2.zip">part 2</a> - 900 MB</td>
- </tr><tr>
+ </tr>
+ -->
+ <tr>
<td> colorspace:
- <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/b_taurus_c.ebwt.zip">full</a>, or
+ <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/b_taurus_c.ebwt.zip">full</a>
+ <!-- , or
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/b_taurus_c.ebwt.1.zip">part 1</a>,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/b_taurus_c.ebwt.2.zip">part 2</a></td>
+ -->
</tr></table>
</td></tr>
@@ -264,14 +336,20 @@
</td><td align="right">
<b>2.4 GB</b>
</td></tr>
- <tr><td colspan=2><table width="100%"><tr>
+ <tr><td colspan=2><table width="100%">
+ <!--
+ <tr>
<td> or: <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/canFam2.ebwt.1.zip">part 1</a> - 1.5 GB,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/canFam2.ebwt.2.zip">part 2</a> - 900 MB</td>
- </tr><tr>
+ </tr>
+ -->
+ <tr>
<td> colorspace:
- <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/canFam2_c.ebwt.zip">full</a>, or
+ <a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/canFam2_c.ebwt.zip">full</a>
+ <!-- , or
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/canFam2_c.ebwt.1.zip">part 1</a>,
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/canFam2_c.ebwt.2.zip">part 2</a></td>
+ -->
</tr></table>
</td></tr>
@@ -279,7 +357,7 @@
<tr><td>
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/galGal3.ebwt.zip"><i>G. gallus</i>, UCSC, galGal3</a>
</td><td align="right">
- 1.1 GB
+ <b>1.1 GB</b>
</td></tr>
<tr><td colspan=2><table width="100%"><tr>
<td> colorspace:
@@ -291,7 +369,7 @@
<tr><td>
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/d_melanogaster_fb5_22.ebwt.zip"><i>D. melanogaster</i>, Flybase, r5.22</a>
</td><td align="right">
- 150 MB
+ <b>150 MB</b>
</td></tr>
<tr><td colspan=2><table width="100%"><tr>
<td> colorspace:
@@ -303,7 +381,7 @@
<tr><td>
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/a_thaliana.ebwt.zip"><i>A. thaliana</i>, TAIR, TAIR9</a>
</td><td align="right">
- 120 MB
+ <b>120 MB</b>
</td></tr>
<tr><td colspan=2><table width="100%"><tr>
<td> colorspace:
@@ -315,7 +393,7 @@
<tr><td>
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/c_elegans_ws200.ebwt.zip"><i>C. elegans</i>, Wormbase, WS200</a>
</td><td align="right">
- 75 MB
+ <b>75 MB</b>
</td></tr>
<tr><td colspan=2><table width="100%"><tr>
<td> colorspace:
@@ -327,7 +405,7 @@
<tr><td>
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/s_cerevisiae.ebwt.zip"><i>S. cerevisiae</i>, CYGD</a>
</td><td align="right">
- 15 MB
+ <b>15 MB</b>
</td></tr>
<tr><td colspan=2><table width="100%"><tr>
<td> colorspace:
@@ -339,7 +417,7 @@
<tr><td>
<a href="ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/e_coli.ebwt.zip"><i>E. coli</i>, NCBI, st. 536</a>
</td><td align="right">
- 5 MB
+ <b>5 MB</b>
</td></tr>
<tr><td colspan=2><table width="100%"><tr>
<td> colorspace:
=====================================
ebwt.cpp
=====================================
@@ -10,16 +10,17 @@
#include <iostream>
#include <fstream>
#include <stdlib.h>
+#include <unistd.h>
using namespace std;
#ifdef BOWTIE_64BIT_INDEX
-
std::string gEbwt_ext("ebwtl");
+std::string gBt2_ext("bt2l");
#else
-
std::string gEbwt_ext("ebwt");
+std::string gBt2_ext("bt2");
#endif // BOWTIE_64BIT_INDEX
@@ -33,12 +34,15 @@ string gLastIOErrMsg;
* "$BOWTIE_INDEXES/".
*/
string adjustEbwtBase(const string& cmdline,
- const string& ebwtFileBase,
- bool verbose = false)
+ const string& ebwtFileBase,
+ bool verbose = false)
{
string str = ebwtFileBase;
ifstream in;
if(verbose) cout << "Trying " << str << endl;
+ if (access((str + ".1." + gBt2_ext).c_str(), R_OK) == 0) {
+ gEbwt_ext = gBt2_ext;
+ }
in.open((str + ".1." + gEbwt_ext).c_str(), ios_base::in | ios::binary);
if(!in.is_open()) {
if(verbose) cout << " didn't work" << endl;
@@ -60,7 +64,10 @@ string adjustEbwtBase(const string& cmdline,
if(getenv("BOWTIE_INDEXES") != NULL) {
str = string(getenv("BOWTIE_INDEXES")) + "/" + ebwtFileBase;
if(verbose) cout << "Trying " << str << endl;
- in.open((str + ".1.ebwt").c_str(), ios_base::in | ios::binary);
+ if (access((str + ".1." + gBt2_ext).c_str(), R_OK) == 0) {
+ gEbwt_ext = gBt2_ext;
+ }
+ in.open((str + ".1." + gEbwt_ext).c_str(), ios_base::in | ios::binary);
if(!in.is_open()) {
if(verbose) cout << " didn't work" << endl;
in.close();
=====================================
ebwt.h
=====================================
@@ -126,28 +126,30 @@ public:
int32_t isaRate,
int32_t ftabChars,
bool color,
- bool entireReverse)
+ bool entireReverse,
+ bool isBt2Index)
{
- init(len, lineRate, linesPerSide, offRate, isaRate, ftabChars, color, entireReverse);
+ init(len, lineRate, linesPerSide, offRate, isaRate, ftabChars, color, entireReverse, isBt2Index);
}
EbwtParams(const EbwtParams& eh) {
init(eh._len, eh._lineRate, eh._linesPerSide, eh._offRate,
- eh._isaRate, eh._ftabChars, eh._color, eh._entireReverse);
+ eh._isaRate, eh._ftabChars, eh._color, eh._entireReverse, eh._isBt2Index);
}
void init(TIndexOffU len, int32_t lineRate, int32_t linesPerSide,
int32_t offRate, int32_t isaRate, int32_t ftabChars,
- bool color, bool entireReverse)
+ bool color, bool entireReverse, bool isBt2Index)
{
_color = color;
+ _isBt2Index = isBt2Index;
_entireReverse = entireReverse;
_len = len;
_bwtLen = _len + 1;
_sz = (len+3)/4;
_bwtSz = (len/4 + 1);
_lineRate = lineRate;
- _linesPerSide = linesPerSide;
+ _linesPerSide = _isBt2Index ? 1 : linesPerSide;
_origOffRate = offRate;
_offRate = offRate;
_offMask = OFF_MASK << _offRate;
@@ -170,6 +172,16 @@ public:
_numSides = _numSidePairs*2;
_numLines = _numSides * _linesPerSide;
_ebwtTotLen = _numSidePairs * (2*_sideSz);
+
+ if (_isBt2Index) {
+ _sideBwtSz = _sideSz - 4*OFF_SIZE;
+ _sideBwtLen = _sideBwtSz*4;
+ _numSides = (_bwtSz+(_sideBwtSz)-1)/(_sideBwtSz);
+ _numSidePairs = _numSides / 2;
+ _numLines = _numSides * _linesPerSide;
+ _ebwtTotLen = _numSides * _sideSz;
+ }
+
_ebwtTotSz = _ebwtTotLen;
assert(repOk());
}
@@ -205,6 +217,7 @@ public:
TIndexOffU ebwtTotSz() const { return _ebwtTotSz; }
bool color() const { return _color; }
bool entireReverse() const { return _entireReverse; }
+ bool isBt2Index() const { return _isBt2Index; }
/**
* Set a new suffix-array sampling rate, which involves updating
@@ -238,7 +251,7 @@ public:
assert_lt(_lineRate, 32);
assert_lt(_linesPerSide, 32);
assert_lt(_ftabChars, 32);
- assert_eq(0, _ebwtTotSz % (2*_lineSz));
+ assert_eq(0, _ebwtTotSz % (_isBt2Index ? _lineSz : 2*_lineSz));
return true;
}
@@ -309,6 +322,7 @@ public:
TIndexOffU _ebwtTotSz;
bool _color;
bool _entireReverse;
+ bool _isBt2Index;
};
/**
@@ -358,6 +372,7 @@ public:
_verbose(verbose), \
_passMemExc(passMemExc), \
_sanity(sanityCheck), \
+ _isBt2Index(isBt2Index), \
_fw(__fw), \
_in1(NULL), \
_in2(NULL), \
@@ -405,7 +420,8 @@ public:
bool verbose = false,
bool startVerbose = false,
bool passMemExc = false,
- bool sanityCheck = false) :
+ bool sanityCheck = false,
+ bool isBt2Index = false) :
Ebwt_INITS
Ebwt_STAT_INITS
{
@@ -468,7 +484,8 @@ public:
int32_t __overrideIsaRate = -1,
bool verbose = false,
bool passMemExc = false,
- bool sanityCheck = false) :
+ bool sanityCheck = false,
+ bool isBt2Index = false) :
Ebwt_INITS
Ebwt_STAT_INITS,
_eh(joinedLen(szs),
@@ -478,7 +495,8 @@ public:
isaRate,
ftabChars,
color,
- refparams.reverse == REF_READ_REVERSE)
+ refparams.reverse == REF_READ_REVERSE,
+ false /* is this a bt2 index? */)
{
#ifdef POPCNT_CAPABILITY
ProcessorSupport ps;
@@ -862,7 +880,7 @@ public:
TIndexOffU* ftab() const { return _ftab; }
TIndexOffU* eftab() const { return _eftab; }
TIndexOffU* offs() const { return _offs; }
- uint32_t* isa() const { return _isa; } /* check */
+ TIndexOffU* isa() const { return _isa; } /* check */
TIndexOffU* plen() const { return _plen; }
TIndexOffU* rstarts() const { return _rstarts; }
uint8_t* ebwt() const { return _ebwt; }
@@ -1142,6 +1160,8 @@ public:
inline void countFwSideEx(const SideLocus& l, TIndexOffU *pairs) const;
inline TIndexOffU countBwSide(const SideLocus& l, int c) const;
inline void countBwSideEx(const SideLocus& l, TIndexOffU *pairs) const;
+ inline TIndexOffU countBt2Side(const SideLocus& l, int c) const;
+ inline void countBt2SideEx(const SideLocus& l, TIndexOffU *pairs) const;
inline TIndexOffU mapLF(const SideLocus& l ASSERT_ONLY(, bool overrideSanity = false)) const;
inline void mapLFEx(const SideLocus& l, TIndexOffU *pairs ASSERT_ONLY(, bool overrideSanity = false)) const;
inline void mapLFEx(const SideLocus& ltop, const SideLocus& lbot, TIndexOffU *tops, TIndexOffU *bots ASSERT_ONLY(, bool overrideSanity = false)) const;
@@ -1188,6 +1208,7 @@ public:
bool _verbose;
bool _passMemExc;
bool _sanity;
+ bool _isBt2Index;
bool _fw; // true iff this is a forward index
FILE *_in1; // input fd for primary index file
FILE *_in2; // input fd for secondary index file
@@ -1570,7 +1591,7 @@ struct SideLocus {
* from one call to initFromRow to possibly avoid a second call.
*/
static void initFromTopBot(TIndexOffU top,
- TIndexOffU bot,
+ TIndexOffU bot,
const EbwtParams& ep,
const uint8_t* ebwt,
SideLocus& ltop,
@@ -1585,7 +1606,7 @@ struct SideLocus {
lbot._charOff = (uint32_t)(ltop._charOff + spread);
lbot._sideNum = ltop._sideNum;
lbot._sideByteOff = ltop._sideByteOff;
- lbot._fw = ltop._fw;
+ lbot._fw = ep._isBt2Index ? true : ltop._fw;
lbot._by = lbot._charOff >> 2;
assert_lt(lbot._by, (int)sideBwtSz);
if(!lbot._fw) lbot._by = sideBwtSz - lbot._by - 1;
@@ -1604,9 +1625,14 @@ struct SideLocus {
const uint32_t sideSz = ep._sideSz;
// Side length is hard-coded for now; this allows the compiler
// to do clever things to accelerate / and %.
- _sideNum = row / (56*OFF_SIZE);
- _charOff = row % (56*OFF_SIZE);
- _sideByteOff = _sideNum * sideSz;
+ if (ep._isBt2Index) {
+ _sideNum = row / (48*OFF_SIZE);
+ _charOff = row % (48*OFF_SIZE);
+ } else {
+ _sideNum = row / (56*OFF_SIZE);
+ _charOff = row % (56*OFF_SIZE);
+ }
+ _sideByteOff = _sideNum * sideSz;
assert_leq(row, ep._len);
assert_leq(_sideByteOff + sideSz, ep._ebwtTotSz);
#ifndef NO_PREFETCH
@@ -1615,7 +1641,7 @@ struct SideLocus {
PREFETCH_LOCALITY);
#endif
// prefetch this side too
- _fw = (_sideNum & 1) != 0; // odd-numbered sides are forward
+ _fw = ep._isBt2Index ? true : ((_sideNum & 1) != 0); // odd-numbered sides are forward
_by = _charOff >> 2; // byte within side
assert_lt(_by, (int)ep._sideBwtSz);
_bp = _charOff & 3; // bit-pair within byte
@@ -2259,10 +2285,10 @@ inline void Ebwt<TStr>::countFwSideEx(const SideLocus& l, TIndexOffU* arrs) cons
arrs[2] += (gt[0] + this->_fchr[2]);
arrs[3] += (gt[1] + this->_fchr[3]);
#ifndef NDEBUG
- assert_leq(arrs[0], this->_fchr[1]); // can't have jumpded into next char's section
- assert_leq(arrs[1], this->_fchr[2]); // can't have jumpded into next char's section
- assert_leq(arrs[2], this->_fchr[3]); // can't have jumpded into next char's section
- assert_leq(arrs[3], this->_fchr[4]); // can't have jumpded into next char's section
+ assert_leq(arrs[0], this->_fchr[1]); // can't have jumped into next char's section
+ assert_leq(arrs[1], this->_fchr[2]); // can't have jumped into next char's section
+ assert_leq(arrs[2], this->_fchr[3]); // can't have jumped into next char's section
+ assert_leq(arrs[3], this->_fchr[4]); // can't have jumped into next char's section
#endif
}
@@ -2365,6 +2391,110 @@ inline void Ebwt<TStr>::countBwSideEx(const SideLocus& l, TIndexOffU* arrs) cons
#endif
}
+#define WITHIN_BWT_LEN(x) \
+ assert_leq(x[0], this->_eh._sideBwtLen); \
+ assert_leq(x[1], this->_eh._sideBwtLen); \
+ assert_leq(x[2], this->_eh._sideBwtLen); \
+ assert_leq(x[3], this->_eh._sideBwtLen)
+
+#define WITHIN_FCHR(x) \
+ assert_leq(x[0], this->fchr()[1]); \
+ assert_leq(x[1], this->fchr()[2]); \
+ assert_leq(x[2], this->fchr()[3]); \
+ assert_leq(x[3], this->fchr()[4])
+
+template<typename TStr>
+inline TIndexOffU Ebwt<TStr>::countBt2Side(const SideLocus& l, int c) const {
+ assert_range(0, 3, c);
+ assert_range(0, (int)this->_eh._sideBwtSz-1, (int)l._by);
+ assert_range(0, 3, (int)l._bp);
+ const uint8_t *side = l.side(this->ebwt());
+ TIndexOffU cCnt = countUpTo(l, c);
+ assert_leq(cCnt, this->_eh._sideBwtLen);
+ if(c == 0 && l._sideByteOff <= _zEbwtByteOff && l._sideByteOff + l._by >= _zEbwtByteOff) {
+ // Adjust for the fact that we represented $ with an 'A', but
+ // shouldn't count it as an 'A' here
+ if((l._sideByteOff + l._by > _zEbwtByteOff) ||
+ (l._sideByteOff + l._by == _zEbwtByteOff && l._bp > _zEbwtBpOff))
+ {
+ cCnt--; // Adjust for '$' looking like an 'A'
+ }
+ }
+ TIndexOffU ret;
+ // Now factor in the occ[] count at the side break
+ const uint8_t *acgt8 = side + _eh._sideBwtSz;
+ const TIndexOffU *acgt = reinterpret_cast<const TIndexOffU*>(acgt8);
+ assert_leq(acgt[0], this->_eh._numSides * this->_eh._sideBwtLen); // b/c it's used as padding
+ assert_leq(acgt[1], this->_eh._len);
+ assert_leq(acgt[2], this->_eh._len);
+ assert_leq(acgt[3], this->_eh._len);
+ ret = acgt[c] + cCnt + this->fchr()[c];
+#ifndef NDEBUG
+ assert_leq(ret, this->fchr()[c+1]); // can't have jumpded into next char's section
+ if(c == 0) {
+ assert_leq(cCnt, this->_eh._sideBwtLen);
+ } else {
+ assert_leq(ret, this->_eh._bwtLen);
+ }
+#endif
+ return ret;
+}
+
+/**
+ * Count all occurrences of character c from the beginning of the
+ * forward side to <by,bp> and add in the occ[] count up to the side
+ * break just prior to the side.
+ *
+ * A forward side is shaped like:
+ *
+ * [A] [C] XXXXXXXXXXXXXXXX
+ * -4- -4- --------56------ (numbers in bytes)
+ * ^
+ * Side ptr (result from SideLocus.side())
+ *
+ * And following it is a reverse side shaped like:
+ *
+ * [G] [T] XXXXXXXXXXXXXXXX
+ * -4- -4- --------56------ (numbers in bytes)
+ * ^
+ * Side ptr (result from SideLocus.side())
+ *
+ */
+template<typename TStr>
+inline void Ebwt<TStr>::countBt2SideEx(const SideLocus& l, TIndexOffU* arrs) const {
+ assert_range(0, (int)this->_eh._sideBwtSz-1, (int)l._by);
+ assert_range(0, 3, (int)l._bp);
+ countUpToEx(l, arrs);
+ if(l._sideByteOff <= _zEbwtByteOff && l._sideByteOff + l._by >= _zEbwtByteOff) {
+ // Adjust for the fact that we represented $ with an 'A', but
+ // shouldn't count it as an 'A' here
+ if((l._sideByteOff + l._by > _zEbwtByteOff) ||
+ (l._sideByteOff + l._by == _zEbwtByteOff && l._bp > _zEbwtBpOff))
+ {
+ arrs[0]--; // Adjust for '$' looking like an 'A'
+ }
+ }
+ WITHIN_FCHR(arrs);
+ WITHIN_BWT_LEN(arrs);
+ // Now factor in the occ[] count at the side break
+ const uint8_t *side = l.side(this->ebwt());
+ const uint8_t *acgt16 = side + this->_eh._sideSz - OFF_SIZE*4;
+ const TIndexOffU *acgt = reinterpret_cast<const TIndexOffU*>(acgt16);
+ assert_leq(acgt[0], this->fchr()[1] + this->_eh.sideBwtLen());
+ assert_leq(acgt[1], this->fchr()[2]-this->fchr()[1]);
+ assert_leq(acgt[2], this->fchr()[3]-this->fchr()[2]);
+ assert_leq(acgt[3], this->fchr()[4]-this->fchr()[3]);
+ assert_leq(acgt[0], this->_eh._len + this->_eh.sideBwtLen());
+ assert_leq(acgt[1], this->_eh._len);
+ assert_leq(acgt[2], this->_eh._len);
+ assert_leq(acgt[3], this->_eh._len);
+ arrs[0] += (acgt[0] + this->fchr()[0]);
+ arrs[1] += (acgt[1] + this->fchr()[1]);
+ arrs[2] += (acgt[2] + this->fchr()[2]);
+ arrs[3] += (acgt[3] + this->fchr()[3]);
+ WITHIN_FCHR(arrs);
+}
+
/**
* Given top and bot loci, calculate counts of all four DNA chars up to
* those loci. Used for more advanced backtracking-search.
@@ -2386,10 +2516,21 @@ inline void Ebwt<TStr>::mapLFEx(const SideLocus& ltop,
assert_eq(0, tops[1]); assert_eq(0, bots[1]);
assert_eq(0, tops[2]); assert_eq(0, bots[2]);
assert_eq(0, tops[3]); assert_eq(0, bots[3]);
- if(ltop._fw) countFwSideEx(ltop, tops); // Forward side
- else countBwSideEx(ltop, tops); // Backward side
- if(lbot._fw) countFwSideEx(lbot, bots); // Forward side
- else countBwSideEx(lbot, bots); // Backward side
+ if(ltop._fw) {
+ // Forward side
+ !_eh._isBt2Index ? countFwSideEx(ltop, tops)
+ : countBt2SideEx(ltop, tops);
+ } else {
+ countBwSideEx(ltop, tops); // Backward side
+ }
+
+ if(lbot._fw) {
+ // Forward side
+ !_eh._isBt2Index ? countFwSideEx(lbot, bots)
+ : countBt2SideEx(lbot, bots);
+ } else {
+ countBwSideEx(lbot, bots); // Backward side
+ }
#ifndef NDEBUG
if(_sanity && !overrideSanity) {
// Make sure results match up with individual calls to mapLF;
@@ -2422,8 +2563,13 @@ inline void Ebwt<TStr>::mapLFEx(const SideLocus& l,
assert_eq(0, arrs[1]);
assert_eq(0, arrs[2]);
assert_eq(0, arrs[3]);
- if(l._fw) countFwSideEx(l, arrs); // Forward side
- else countBwSideEx(l, arrs); // Backward side
+ if(l._fw) {
+ // Forward side
+ !_eh._isBt2Index ? countFwSideEx(l, arrs)
+ : countBt2SideEx(l, arrs);
+ } else {
+ countBwSideEx(l, arrs); // Backward side
+ }
#ifndef NDEBUG
if(_sanity && !overrideSanity) {
// Make sure results match up with individual calls to mapLF;
@@ -2454,8 +2600,14 @@ inline TIndexOffU Ebwt<TStr>::mapLF(const SideLocus& l
int c = rowL(l);
assert_lt(c, 4);
assert_geq(c, 0);
- if(l._fw) ret = countFwSide(l, c); // Forward side
- else ret = countBwSide(l, c); // Backward side
+ if(l._fw) {
+ // Forward side
+ ret = !_eh._isBt2Index ? countFwSide(l, c)
+ : countBt2Side(l, c);
+ }
+ else {
+ ret = countBwSide(l, c); // Backward side
+ }
assert_lt(ret, this->_eh._bwtLen);
#ifndef NDEBUG
if(_sanity && !overrideSanity) {
@@ -2485,8 +2637,14 @@ inline TIndexOffU Ebwt<TStr>::mapLF(const SideLocus& l, int c
TIndexOffU ret;
assert_lt(c, 4);
assert_geq(c, 0);
- if(l._fw) ret = countFwSide(l, c); // Forward side
- else ret = countBwSide(l, c); // Backward side
+ if(l._fw) {
+ // Forward side
+ ret = !_eh._isBt2Index ? countFwSide(l, c)
+ : countBt2Side(l, c);
+ }
+ else {
+ ret = countBwSide(l, c); // Backward side
+ }
assert_lt(ret, this->_eh._bwtLen);
#ifndef NDEBUG
if(_sanity && !overrideSanity) {
@@ -2517,8 +2675,13 @@ inline TIndexOffU Ebwt<TStr>::mapLF1(TIndexOffU row, const SideLocus& l, int c
TIndexOffU ret;
assert_lt(c, 4);
assert_geq(c, 0);
- if(l._fw) ret = countFwSide(l, c); // Forward side
- else ret = countBwSide(l, c); // Backward side
+ if(l._fw) {
+ // Forward side
+ ret = !_eh._isBt2Index ? countFwSide(l, c)
+ : countBt2Side(l, c);
+ } else {
+ ret = countBwSide(l, c); // Backward side
+ }
assert_lt(ret, this->_eh._bwtLen);
#ifndef NDEBUG
if(_sanity && !overrideSanity) {
@@ -2549,8 +2712,13 @@ inline int Ebwt<TStr>::mapLF1(TIndexOffU& row, const SideLocus& l
int c = rowL(l);
assert_lt(c, 4);
assert_geq(c, 0);
- if(l._fw) row = countFwSide(l, c); // Forward side
- else row = countBwSide(l, c); // Backward side
+ if(l._fw) {
+ // Forward side
+ row = !_eh._isBt2Index ? countFwSide(l, c)
+ : countBt2Side(l, c);
+ } else {
+ row = countBwSide(l, c); // Backward side
+ }
assert_lt(row, this->_eh._bwtLen);
#ifndef NDEBUG
if(_sanity && !overrideSanity) {
@@ -2829,8 +2997,13 @@ inline bool Ebwt<TStr>::reportReconstruct(const String<Dna5>& query,
TIndexOffU newi;
assert_lt(c, 4);
assert_geq(c, 0);
- if(l->_fw) newi = countFwSide(*l, c); // Forward side
- else newi = countBwSide(*l, c); // Backward side
+ if(l->_fw) {
+ // Forward side
+ newi = !_eh._isBt2Index ? countFwSide(*l, c)
+ : countBt2Side(*l, c);
+ } else {
+ newi = countBwSide(*l, c); // Backward side
+ }
assert_lt(newi, this->_eh._bwtLen);
assert_neq(newi, i);
i = newi; // update row
@@ -2883,8 +3056,13 @@ inline bool Ebwt<TStr>::reportReconstruct(const String<Dna5>& query,
TIndexOffU newi;
assert_lt(c, 4);
assert_geq(c, 0);
- if(l->_fw) newi = countFwSide(*l, c); // Forward side
- else newi = countBwSide(*l, c); // Backward side
+ if(l->_fw) {
+ // Forward side
+ newi = !_eh._isBt2Index ? countFwSide(*l, c)
+ : countBt2Side(*l, c);
+ } else {
+ newi = countBwSide(*l, c); // Backward side
+ }
assert_lt(newi, this->_eh._bwtLen);
assert_neq(newi, i);
i = newi; // update row
@@ -2922,8 +3100,14 @@ inline bool Ebwt<TStr>::reportReconstruct(const String<Dna5>& query,
appendValue(rbuf, (Dna5)c);
TIndexOffU newi;
assert_lt(c, 4); assert_geq(c, 0);
- if(l->_fw) newi = countFwSide(*l, c); // Forward side
- else newi = countBwSide(*l, c); // Backward side
+ if(l->_fw) {
+ // Forward side
+ newi = !_eh._isBt2Index ? countFwSide(*l, c)
+ : countBt2Side(*l, c);
+ }
+ else {
+ newi = countBwSide(*l, c); // Backward side
+ }
assert_lt(newi, this->_eh._bwtLen);
assert_neq(newi, i);
i = newi; // update row
@@ -3208,11 +3392,11 @@ void Ebwt<TStr>::readIntoMemory(
EbwtParams *eh;
bool deleteEh = false;
if(params != NULL) {
- params->init(len, lineRate, linesPerSide, offRate, isaRate, ftabChars, color, entireRev);
+ params->init(len, lineRate, linesPerSide, offRate, isaRate, ftabChars, color, entireRev, _isBt2Index);
if(_verbose || startVerbose) params->print(cerr);
eh = params;
} else {
- eh = new EbwtParams(len, lineRate, linesPerSide, offRate, isaRate, ftabChars, color, entireRev);
+ eh = new EbwtParams(len, lineRate, linesPerSide, offRate, isaRate, ftabChars, color, entireRev, _isBt2Index);
deleteEh = true;
}
@@ -3678,6 +3862,7 @@ void Ebwt<TStr>::readIntoMemory(
/**
* Read reference names from an input stream 'in' for an Ebwt primary
* file and store them in 'refnames'.
+ * TODO: revisit this function
*/
static inline void
readEbwtRefnames(FILE* fin, vector<string>& refnames) {
@@ -3710,7 +3895,11 @@ readEbwtRefnames(FILE* fin, vector<string>& refnames) {
}
// Create a new EbwtParams from the entries read from primary stream
- EbwtParams eh(len, lineRate, linesPerSide, offRate, -1, ftabChars, color, entireReverse);
+ bool isBt2Index = false;
+ if (gEbwt_ext == "bt2" || gEbwt_ext == "bt2") {
+ isBt2Index = true;
+ }
+ EbwtParams eh(len, lineRate, linesPerSide, offRate, -1, ftabChars, color, entireReverse, isBt2Index);
TIndexOffU nPat = readI<TIndexOffU>(fin, switchEndian); // nPat
fseeko(fin, nPat*OFF_SIZE, SEEK_CUR);
=====================================
ebwt_build.cpp
=====================================
@@ -333,12 +333,22 @@ static void driver(const string& infile,
} else {
// Adapt sequence files to ifstreams
for(size_t i = 0; i < infiles.size(); i++) {
- FILE *f = fopen(infiles[i].c_str(), "rb");
- if (f == NULL) {
- cerr << "Error: could not open "<< infiles[i] << endl;
- throw 1;
+ FileBuf *fb = NULL;
+ if (FileBuf::isGzippedFile(infiles[i].c_str())) {
+ gzFile zFp = gzopen(infiles[i].c_str(), "rb");
+ if (zFp == NULL) {
+ std::cerr << "Error: could not open " << infiles[i].c_str() << std::endl;
+ throw 1;
+ }
+ fb = new FileBuf(zFp);
+ } else {
+ FILE *f = fopen(infiles[i].c_str(), "rb");
+ if (f == NULL) {
+ cerr << "Error: could not open "<< infiles[i] << endl;
+ throw 1;
+ }
+ fb = new FileBuf(f);
}
- FileBuf *fb = new FileBuf(f);
assert(fb != NULL);
assert(!fb->eof());
assert(fb->get() == '>');
@@ -472,7 +482,8 @@ static void driver(const string& infile,
-1, // override isaRate
verbose, // be talkative
autoMem, // pass exceptions up to the toplevel so that we can adjust memory settings automatically
- sanityCheck); // verify results and internal consistency
+ sanityCheck, // verify results and internal consistency
+ false); // are we building a bt2 index?
// Note that the Ebwt is *not* resident in memory at this time. To
// load it into memory, call ebwt.loadIntoMemory()
if(verbose) {
=====================================
ebwt_search.cpp
=====================================
@@ -480,6 +480,9 @@ static void printUsage(ostream& out) {
<< "Input:" << endl
<< " -q query input files are FASTQ .fq/.fastq (default)" << endl
<< " -f query input files are (multi-)FASTA .fa/.mfa" << endl
+ << " -F k:<int>,i:<int> query input files are continuous FASTA where reads" << endl
+ << " are substrings (k-mers) extracted from a FASTA file <s>" << endl
+ << " and aligned at offsets 1, 1+i, 1+2i ... end of reference" << endl
<< " -r query input files are raw one-sequence-per-line" << endl
<< " -c query sequences given on cmd line (as <mates>, <singles>)" << endl
<< " -C reads and index are in colorspace" << endl
@@ -1459,8 +1462,7 @@ static void exactSearch(PatternComposer& _patsrc,
if(!refs->loaded()) throw 1;
}
exactSearch_refs = refs;
- vector<int> tids;
- tids.reserve(max(nthreads, thread_ceiling));
+ int tids[max(nthreads, thread_ceiling)];
#ifdef WITH_TBB
vector<std::thread*> threads;
vector<thread_tracking_pair> tps;
@@ -1567,10 +1569,9 @@ static void exactSearch(PatternComposer& _patsrc,
SLEEP(10);
}
#else
- for (int i = 0; i < nthreads; i++) {
+ for (int i = 0; i < nthreads - 1; i++) {
threads[i]->join();
}
- delete[] tids;
#endif
#ifndef _WIN32
@@ -1844,8 +1845,7 @@ static void mismatchSearchFull(PatternComposer& _patsrc,
}
mismatchSearch_refs = refs;
- vector<int> tids;
- tids.reserve(max(nthreads, thread_ceiling));
+ int tids[max(nthreads, thread_ceiling)];
#ifdef WITH_TBB
vector<std::thread*> threads;
vector<thread_tracking_pair> tps;
@@ -1953,10 +1953,9 @@ static void mismatchSearchFull(PatternComposer& _patsrc,
SLEEP(10);
}
#else
- for (int i = 0; i < nthreads; i++) {
+ for (int i = 0; i < nthreads - 1; i++) {
threads[i]->join();
}
- delete[] tids;
#endif
#ifndef _WIN32
@@ -2349,8 +2348,7 @@ static void twoOrThreeMismatchSearchFull(
twoOrThreeMismatchSearch_hitMask = NULL;
twoOrThreeMismatchSearch_two = two;
- vector<int> tids;
- tids.reserve(max(nthreads, thread_ceiling));
+ int tids[max(nthreads, thread_ceiling)];
#ifdef WITH_TBB
vector<std::thread*> threads;
vector<thread_tracking_pair> tps;
@@ -2458,10 +2456,9 @@ static void twoOrThreeMismatchSearchFull(
SLEEP(10);
}
#else
- for (int i = 0; i < nthreads; i++) {
+ for (int i = 0; i < nthreads - 1; i++) {
threads[i]->join();
}
- delete[] tids;
#endif
#ifndef _WIN32
@@ -2901,8 +2898,7 @@ static void seededQualCutoffSearchFull(
}
seededQualSearch_refs = refs;
- vector<int> tids;
- tids.reserve(max(nthreads, thread_ceiling));
+ int tids[max(nthreads, thread_ceiling)];
#ifdef WITH_TBB
vector<std::thread*> threads;
vector<thread_tracking_pair> tps;
@@ -3018,10 +3014,9 @@ static void seededQualCutoffSearchFull(
SLEEP(10);
}
#else
- for (int i = 0; i < nthreads; i++) {
+ for (int i = 0; i < nthreads - 1; i++) {
threads[i]->join();
}
- delete[] tids;
#endif
#ifndef _WIN32
@@ -3127,6 +3122,11 @@ static void driver(const char * type,
}
// Adjust
adjustedEbwtFileBase = adjustEbwtBase(argv0, ebwtFileBase, verbose);
+ bool isBt2Index = false;
+ if (gEbwt_ext == "bt2" || gEbwt_ext == "bt2l") {
+ isBt2Index = true;
+ }
+
vector<PatternSource*> patsrcs_a;
vector<PatternSource*> patsrcs_b;
@@ -3266,7 +3266,8 @@ static void driver(const char * type,
verbose, // whether to be talkative
startVerbose, // talkative during initialization
false /*passMemExc*/,
- sanityCheck);
+ sanityCheck,
+ isBt2Index);
Ebwt<TStr>* ebwtBw = NULL;
// We need the mirror index if mismatches are allowed
if(mismatches > 0 || maqLike) {
@@ -3287,7 +3288,8 @@ static void driver(const char * type,
verbose, // whether to be talkative
startVerbose, // talkative during initialization
false /*passMemExc*/,
- sanityCheck);
+ sanityCheck,
+ isBt2Index);
}
if(!os.empty()) {
for(size_t i = 0; i < os.size(); i++) {
=====================================
filebuf.h
=====================================
@@ -13,6 +13,8 @@
#include <string.h>
#include <stdint.h>
#include <stdexcept>
+#include <fcntl.h>
+#include <zlib.h>
#include "assert_helpers.h"
/**
@@ -49,6 +51,12 @@ public:
assert(_in != NULL);
}
+ FileBuf(gzFile in) {
+ init();
+ _zIn = in;
+ assert(_zIn != NULL);
+ }
+
FileBuf(std::ifstream *inf) {
init();
_inf = inf;
@@ -65,7 +73,7 @@ public:
* Return true iff there is a stream ready to read.
*/
bool isOpen() {
- return _in != NULL || _inf != NULL || _ins != NULL;
+ return _in != NULL || _zIn != NULL || _inf != NULL || _ins != NULL;
}
/**
@@ -76,7 +84,9 @@ public:
fclose(_in);
} else if(_inf != NULL) {
_inf->close();
- } else {
+ } else if(_zIn != NULL) {
+ gzclose(_zIn);
+ } else {
// can't close _ins
}
}
@@ -85,7 +95,7 @@ public:
* Get the next character of input and advance.
*/
int get() {
- assert(_in != NULL || _inf != NULL || _ins != NULL);
+ assert(_in != NULL || _zIn != NULL || _inf != NULL || _ins != NULL);
int c = peek();
if(c != -1) {
_cur++;
@@ -113,6 +123,19 @@ public:
_done = false;
}
+ /**
+ * Initialize the buffer with a new gz file.
+ */
+ void newFile(gzFile in) {
+ _in = NULL;
+ _zIn = in;
+ _inf = NULL;
+ _ins = NULL;
+ _cur = BUF_SZ;
+ _buf_sz = BUF_SZ;
+ _done = false;
+ }
+
/**
* Initialize the buffer with a new ifstream.
*/
@@ -148,6 +171,8 @@ public:
} else if(_ins != NULL) {
_ins->clear();
_ins->seekg(0, std::ios::beg);
+ } else if (_zIn != NULL) {
+ gzrewind(_zIn);
} else {
rewind(_in);
}
@@ -162,7 +187,7 @@ public:
* Occasionally we'll need to read in a new buffer's worth of data.
*/
int peek() {
- assert(_in != NULL || _inf != NULL || _ins != NULL);
+ assert(_in != NULL || _zIn != NULL || _inf != NULL || _ins != NULL);
assert_leq(_cur, _buf_sz);
if(_cur == _buf_sz) {
if(_done) {
@@ -175,6 +200,8 @@ public:
if(_inf != NULL) {
_inf->read((char*)_buf, BUF_SZ);
_buf_sz = _inf->gcount();
+ } else if (_zIn != NULL) {
+ _buf_sz = gzread(_zIn, (void *)_buf, BUF_SZ);
} else if(_ins != NULL) {
_ins->read((char*)_buf, BUF_SZ);
_buf_sz = _ins->gcount();
@@ -333,10 +360,44 @@ public:
return _lastn_cur;
}
+ static bool isGzippedFile(const char *filename) {
+ if (filename == NULL) {
+ return false;
+ }
+
+ int fd = open(filename, O_RDONLY);
+ if (fd == -1) {
+ std::cerr << "Unable to open " << filename << std::endl;
+ throw 1;
+ }
+
+ uint8_t byte1, byte2;
+
+ ssize_t r1 = read(fd, &byte1, sizeof(uint8_t));
+ ssize_t r2 = read(fd, &byte2, sizeof(uint8_t));
+
+ if (::close(fd) == -1) {
+ std::cerr << "Unable to close " << filename << std::endl;
+ throw 1;
+ }
+
+ if (r1 == 0 || r2 == 0) {
+ std::cerr << "Unable to read file magic number" << std::endl;
+ return false;
+ }
+
+ if (byte1 == 0x1f && byte2 == 0x8b) {
+ return true;
+ }
+
+ return false;
+ }
+
private:
void init() {
_in = NULL;
+ _zIn = NULL;
_inf = NULL;
_ins = NULL;
_cur = _buf_sz = BUF_SZ;
@@ -347,6 +408,7 @@ private:
static const size_t BUF_SZ = 256 * 1024;
FILE *_in;
+ gzFile _zIn;
std::ifstream *_inf;
std::istream *_ins;
size_t _cur;
=====================================
hit.h
=====================================
@@ -169,7 +169,6 @@ public:
nthreads_((nthreads > 0) ? nthreads : 1),
ptBufs_(),
ptCounts_(nthreads_),
- batchIds_(nthreads_),
perThreadBufSize_(perThreadBufSize),
ptNumAligned_(NULL),
reorder_(reorder)
@@ -183,8 +182,6 @@ public:
ptNumMaxed_ = ptNumUnaligned_ + nthreads_;
ptBufs_.resize(nthreads_);
ptCounts_.resize(nthreads_, 0);
- batchIds_.assign(nthreads_, 0);
- lastBatchIdSeen = 0;
initDumps();
}
@@ -253,9 +250,6 @@ public:
}
}
ptCounts_[threadId]++;
- if (reorder_) {
- batchIds_[threadId] = rdid / perThreadBufSize_ + 1;
- }
if(tally) {
tallyAlignments(threadId, end - start, paired);
}
@@ -268,10 +262,10 @@ public:
void finish(bool hadoopOut) {
// Flush all per-thread buffers
flushAll();
-
+
// Close all output streams
closeOuts();
-
+
// Print information about how many unpaired and/or paired
// reads were aligned.
if(!quiet_) {
@@ -285,7 +279,7 @@ public:
numUnaligned += ptNumUnaligned_[i];
numMaxed += ptNumMaxed_[i];
}
-
+
uint64_t tot = numAligned + numUnaligned + numMaxed;
double alPct = 0.0, unalPct = 0.0, maxPct = 0.0;
if(tot > 0) {
@@ -560,44 +554,21 @@ protected:
/**
* Flush thread's output buffer and reset both buffer and count.
*/
- void flush(size_t threadId, bool finalBatch) {
+ void flush(size_t threadId, bool force) {
{
ThreadSafe _ts(&mutex_); // flush
- if (reorder_) {
- nchars += ptBufs_[threadId].length();
- batch b(ptBufs_[threadId], batchIds_[threadId], false /* has batch been written */);
- unwrittenBatches_.push_back(b);
- // consider writing if we have enough data to fill the buffer
- // or we're ready to output the final batch
- if (finalBatch || nchars >= OutFileBuf::BUF_SZ) {
- // sort by batch ID
- std::sort(unwrittenBatches_.begin(), unwrittenBatches_.end());
- for (std::vector<batch>::size_type i = 0; i < unwrittenBatches_.size(); i++) {
- if (unwrittenBatches_[i].batchId - lastBatchIdSeen == 1) {
- nchars -= out_.writeString(unwrittenBatches_[i].btString);
- lastBatchIdSeen = unwrittenBatches_[i].batchId;
- unwrittenBatches_[i].isWritten = true;
- }
- }
- unwrittenBatches_.erase(std::remove_if(unwrittenBatches_.begin(),
- unwrittenBatches_.end(), batch::remove_written_batches),
- unwrittenBatches_.end());
- }
- }
- else {
- out_.writeString(ptBufs_[threadId]);
- }
+ out_.writeString(ptBufs_[threadId]);
}
ptCounts_[threadId] = 0;
ptBufs_[threadId].clear();
}
-
+
/**
* Flush all output buffers.
*/
void flushAll() {
for(size_t i = 0; i < nthreads_; i++) {
- flush(i, i == nthreads_ - 1);
+ flush(i, true);
}
}
@@ -610,7 +581,7 @@ protected:
flush(threadId, false /* final batch? */);
}
}
-
+
/**
* Close (and flush) all OutFileBufs.
*/
@@ -621,49 +592,14 @@ protected:
OutFileBuf& out_; /// the alignment output stream(s)
vector<string>* _refnames; /// map from reference indexes to names
MUTEX_T mutex_; /// pthreads mutexes for per-file critical sections
-
- // used for output read buffer
+
+ // used for output read buffer
size_t nthreads_;
std::vector<BTString> ptBufs_;
std::vector<size_t> ptCounts_;
int perThreadBufSize_;
bool reorder_;
- struct batch {
- BTString btString;
- size_t batchId;
- bool isWritten;
-
- batch(BTString& s, size_t id, bool b)
- : batchId(id), isWritten(b)
- {
- s.moveTo(btString);
- }
-
- bool operator<(const batch& other) const {
- return batchId < other.batchId;
- }
-
- batch& operator=(batch& other) {
- if (&other != this) {
- batchId = other.batchId;
- isWritten = other.isWritten;
- other.btString.moveTo(btString);
- }
- return *this;
- }
-
- static bool remove_written_batches(const batch& b) {
- return b.isWritten;
- }
- };
-
-
- std::vector<batch> unwrittenBatches_;
- std::vector<size_t> batchIds_;
- size_t lastBatchIdSeen;
- size_t nchars;
-
// Output filenames for dumping
std::string dumpAlBase_;
std::string dumpUnalBase_;
@@ -1100,7 +1036,7 @@ public:
virtual HitSinkPerThread* create() const {
return new NGoodHitSinkPerThread(sink_, n_, max_, defaultMapq_, threadId_);
}
-
+
virtual HitSinkPerThread* createMult(uint32_t m) const {
uint32_t max = max_ * (max_ == 0xffffffff ? 1 : m);
uint32_t n = n_ * (n_ == 0xffffffff ? 1 : m);
@@ -1243,7 +1179,7 @@ public:
virtual HitSinkPerThread* create() const {
return new NBestFirstStratHitSinkPerThread(sink_, n_, max_, 1, defaultMapq_, threadId_);
}
-
+
virtual HitSinkPerThread* createMult(uint32_t m) const {
uint32_t max = max_ * (max_ == 0xffffffff ? 1 : m);
uint32_t n = n_ * (n_ == 0xffffffff ? 1 : m);
=====================================
pat.cpp
=====================================
@@ -765,6 +765,7 @@ pair<bool, int> FastaContinuousPatternSource::nextBatchFromFile(
// can re-use this prefix for all the reads
// that are substrings of this FASTA sequence
name_prefix_buf_[nameoff++] = c;
+ subReadCnt_ = 0;
}
c = getc_wrapper();
}
@@ -794,13 +795,14 @@ pair<bool, int> FastaContinuousPatternSource::nextBatchFromFile(
// the sampling gaps are by looking at the read
// name
if(!beginning_) {
- readCnt_++;
+ subReadCnt_++;
}
continue;
}
// install name
+ nameoff = strlen(name_prefix_buf_);
memcpy(readbuf[readi].readOrigBuf, name_prefix_buf_, nameoff);
- itoa10<TReadId>(readCnt_ - subReadCnt_, name_int_buf_);
+ itoa10<TReadId>(subReadCnt_, name_int_buf_);
size_t bufoff = nameoff;
memcpy(readbuf[readi].readOrigBuf + bufoff, name_int_buf_, strlen(name_int_buf_));
bufoff += strlen(name_int_buf_);
@@ -817,8 +819,8 @@ pair<bool, int> FastaContinuousPatternSource::nextBatchFromFile(
}
readbuf[readi].readOrigBufLen = bufoff;
eat_ = freq_-1;
- readCnt_++;
beginning_ = false;
+ subReadCnt_++;
readi++;
}
}
@@ -861,9 +863,9 @@ bool FastaContinuousPatternSource::parse(
// Parse sequence
assert_eq(0, seqan::length(ra.patFw));
- c = ra.readOrigBuf[cur++];
int nchar = 0, seqoff = 0;
while(cur < buflen) {
+ c = ra.readOrigBuf[cur++];
if(isalpha(c)) {
assert_in(toupper(c), "ACGTN");
if(nchar++ >= this->trim5_) {
@@ -871,7 +873,6 @@ bool FastaContinuousPatternSource::parse(
ra.patBufFw[seqoff++] = charToDna5[c]; // ascii to int
}
}
- c = ra.readOrigBuf[cur++];
}
ra.patBufFw[seqoff] = '\0';
_setBegin(ra.patFw, (Dna5*)ra.patBufFw);
@@ -1329,14 +1330,18 @@ pair<bool, int> RawPatternSource::nextBatchFromFile(
// Read until we run out of input or until we've filled the buffer
for(; readi < pt.max_buf_ && c >= 0; readi++) {
readbuf[readi].readOrigBufLen = 0;
- while(c >= 0 && c != '\n' && c != '\r') {
- readbuf[readi].readOrigBuf[readbuf[readi].readOrigBufLen++] = c;
+ while(c >= 0 && (c == '\n' || c == '\r')) {
c = getc_wrapper();
}
- while(c >= 0 && (c == '\n' || c == '\r')) {
+ while(c >= 0 && (c != '\n' && c != '\r')) {
+ readbuf[readi].readOrigBuf[readbuf[readi].readOrigBufLen++] = c;
c = getc_wrapper();
}
}
+ while (readi > 0 && readbuf[readi-1].readOrigBufLen == 0) {
+ readi--;
+ }
+
return make_pair(c < 0, readi);
}
=====================================
processor_support.h
=====================================
@@ -4,7 +4,7 @@
// Utility class ProcessorSupport provides POPCNTenabled() to determine
// processor support for POPCNT instruction. It uses CPUID to
// retrieve the processor capabilities.
-// for Intel ICC compiler __cpuid() is an intrinsic
+// for Intel ICC compiler __cpuid() is an intrinsic
// for Microsoft compiler __cpuid() is provided by #include <intrin.h>
// for GCC compiler __get_cpuid() is provided by #include <cpuid.h>
@@ -25,10 +25,10 @@ struct regs_t {unsigned int EAX, EBX, ECX, EDX;};
class ProcessorSupport {
-#ifdef POPCNT_CAPABILITY
+#ifdef POPCNT_CAPABILITY
-public:
- ProcessorSupport() { }
+public:
+ ProcessorSupport() { }
bool POPCNTenabled()
{
// from: Intel® 64 and IA-32 Architectures Software Developer’s Manual, 325462-036US,March 2013
@@ -44,12 +44,12 @@ public:
try {
#if ( defined(USING_INTEL_COMPILER) || defined(USING_MSC_COMPILER) )
- __cpuid((void *) ®s,0); // test if __cpuid() works, if not catch the exception
- __cpuid((void *) ®s,0x1); // POPCNT bit is bit 23 in ECX
+ __cpuid((int *) ®s,0); // test if __cpuid() works, if not catch the exception
+ __cpuid((int *) ®s,0x1); // POPCNT bit is bit 23 in ECX
#elif defined(USING_GCC_COMPILER)
__get_cpuid(0x1, ®s.EAX, ®s.EBX, ®s.ECX, ®s.EDX);
#else
- std::cerr << “ERROR: please define __cpuid() for this build.\n”;
+ std::cerr << “ERROR: please define __cpuid() for this build.\n”;
assert(0);
#endif
if( !( (regs.ECX & BIT(20)) && (regs.ECX & BIT(23)) ) ) return false;
@@ -64,7 +64,3 @@ public:
};
#endif /*PROCESSOR_SUPPORT_H_*/
-
-
-
-
=====================================
sam.cpp
=====================================
@@ -139,9 +139,6 @@ void SAMHitSink::reportUnOrMax(
o << '\n';
}
ptCounts_[threadId]++;
- if (reorder_) {
- batchIds_[threadId] = p.rdid() / perThreadBufSize_ + 1;
- }
}
/**
@@ -337,7 +334,7 @@ void SAMHitSink::reportMaxed(
int strat = min(hs[i].stratum, hs[i+1].stratum);
if(strat == bestStratum) {
if(num == r) {
- reportHits(NULL, &hs, i, i+2, threadId, 0, (int)(hs.size()/2)+1, false, p.rdid());
+ reportHits(NULL, &hs, i, i+2, threadId, 0, (int)(hs.size()/2)+1, true, p.rdid());
break;
}
num++;
@@ -352,7 +349,7 @@ void SAMHitSink::reportMaxed(
}
assert_leq(num, hs.size());
uint32_t r = rand.nextU32() % num;
- reportHits(&hs[r], NULL, 0, 1, threadId, 0, (int)hs.size()+1, false, p.rdid());
+ reportHits(&hs[r], NULL, 0, 1, threadId, 0, (int)hs.size()+1, true, p.rdid());
}
} else {
reportUnOrMax(p, &hs, threadId, false);
=====================================
scripts/bowtie-hbb.sh
=====================================
@@ -0,0 +1,69 @@
+#!/bin/bash
+
+set -e
+
+while getopts "sb:" opt; do
+ case $opt in
+ s) use_sra=1 ;;
+ b) branch="$OPTARG" ;;
+ *) echo "Usage: $0 [-s] [-b <branch_name>]" && exit 1
+ esac
+done
+shift $(($OPTIND - 1))
+
+if [ "$branch" == "" ] ; then
+ branch="master"
+fi
+
+set -x
+yum install -y git zip unzip pandoc
+
+git clone https://github.com/BenLangmead/bowtie.git
+if [ $? -ne 0 ] ; then
+ echo "Unable to clone bowtie repo"
+ exit 1
+fi
+
+cd bowtie
+
+pwd && ls
+git branch -a | grep "$branch" 2>&1 > /dev/null
+if [ $? -ne 0 ] ; then
+ echo "branch '$branch' does not exist"
+ exit 1
+else
+ git checkout "$branch"
+fi
+
+if [ $use_sra -eq 1 ] ; then
+ # this variant is needed to compile ncbi-vdb
+ source /hbb/activate
+ make sra-deps
+fi
+
+# this variant creates static binaries with PIC
+source /hbb_exe_gc_hardened/activate
+
+mkdir /mybin
+echo 'res=`echo $@ | sed "s/-L.*$//"`; echo $res; echo $res; /opt/rh/devtoolset-7/root/usr/bin/ar $res;' > /mybin/ar
+chmod +x /mybin/ar && export PATH=/mybin:$PATH
+
+# make static-libs
+# if [ $? -ne 0 ] ; then
+# echo "Unable to build tbb and/or zlib static dependencies"
+# exit 1
+# fi
+
+make bowtie-bin.zip RELEASE_BUILD=1 EXTRA_FLAGS="--std=c++98"
+if [ $? -ne 0 ] ; then
+ echo "Unable to create bowtie package"
+ exit 1
+fi
+echo "Running libcheck..."
+libcheck bowtie-{align,build,inspect}-*
+
+echo "Running hardening check..."
+hardening-check -b bowtie-{align,build,inspect}-*
+
+echo "Copying binary package"
+cp /bowtie/*.zip /io
=====================================
scripts/run-hbb.sh
=====================================
@@ -0,0 +1,8 @@
+#!/bin/sh
+
+bowtie_root="${PWD%%bowtie*}/bowtie"
+
+docker run -t -i --rm \
+ -v $bowtie_root:/io \
+ phusion/holy-build-box-64:latest \
+ bash /io/scripts/bowtie-hbb.sh
=====================================
shmem.h
=====================================
@@ -67,7 +67,7 @@ bool allocSharedMem(std::string fname,
cerr << "EEXIST" << endl;
} else if(errno == EINVAL) {
cerr << "Warning: shared-memory chunk's segment size doesn't match expected size (" << (shmemLen) << ")" << endl
- << "Deleteing old shared memory block and trying again." << endl;
+ << "Deleting old shared memory block and trying again." << endl;
shmid = shmget(key, 0, 0);
if((ret = shmctl(shmid, IPC_RMID, &ds)) < 0) {
cerr << "shmctl returned " << ret
@@ -105,7 +105,7 @@ bool allocSharedMem(std::string fname,
if(ds.shm_segsz != shmemLen) {
cerr << "Warning: shared-memory chunk's segment size (" << ds.shm_segsz
<< ") doesn't match expected size (" << shmemLen << ")" << endl
- << "Deleteing old shared memory block and trying again." << endl;
+ << "Deleting old shared memory block and trying again." << endl;
if((ret = shmctl(shmid, IPC_RMID, &ds)) < 0) {
cerr << "shmctl returned " << ret << " for IPC_RMID and errno is " << errno << endl;
throw 1;
View it on GitLab: https://salsa.debian.org/med-team/bowtie/commit/8fdd41caa10ed179fa67b07803e754b6faca52a5
--
View it on GitLab: https://salsa.debian.org/med-team/bowtie/commit/8fdd41caa10ed179fa67b07803e754b6faca52a5
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20190711/ed6587ae/attachment-0001.html>
More information about the debian-med-commit
mailing list