[med-svn] [kraken] 01/06: New upstream version 1.0
Andreas Tille
tille at debian.org
Mon Nov 13 15:43:30 UTC 2017
This is an automated email from the git hooks/post-receive script.
tille pushed a commit to branch master
in repository kraken.
commit 1e43ce849ccff40607124deea75990ba67a25ecc
Author: Andreas Tille <tille at debian.org>
Date: Mon Nov 13 16:33:30 2017 +0100
New upstream version 1.0
---
CHANGELOG | 5 ++
docs/MANUAL.html | 116 ++++++++++++++---------------
docs/MANUAL.markdown | 60 ++++++++-------
docs/top.html | 2 +-
install_kraken.sh | 7 +-
scripts/add_to_library.sh | 34 ++++++---
scripts/build_kraken_db.sh | 49 +++++++------
scripts/convert_gb_to_fa.pl | 58 +++++++++++++++
scripts/cp_into_tempfile.pl | 2 +-
scripts/download_genomic_library.sh | 120 +++++++++++-------------------
scripts/download_taxonomy.sh | 46 +++++-------
scripts/kraken | 2 +-
scripts/kraken-build | 10 +--
scripts/kraken-filter | 2 +-
scripts/kraken-mpa-report | 2 +-
scripts/kraken-report | 9 ++-
scripts/kraken-translate | 2 +-
scripts/krakenlib.pm | 29 +++++++-
scripts/lookup_accession_numbers.pl | 65 +++++++++++++++++
scripts/read_merger.pl | 2 +-
scripts/report_gi_numbers.pl | 2 +-
scripts/rsync_from_ncbi.pl | 134 ++++++++++++++++++++++++++++++++++
scripts/scan_fasta_file.pl | 50 +++++++++++++
scripts/standard_installation.sh | 5 +-
scripts/verify_gi_numbers.pl | 2 +-
src/Makefile | 4 +-
src/classify.cpp | 22 ++++--
src/db_shrink.cpp | 8 +-
src/db_sort.cpp | 8 +-
src/kmer_estimator.cpp | 141 ++++++++++++++++++++++++++++++++++++
src/set_lcas.cpp | 28 +++++--
31 files changed, 761 insertions(+), 265 deletions(-)
diff --git a/CHANGELOG b/CHANGELOG
index db8ea12..c1aaed1 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -1,3 +1,8 @@
+v0.10.6-beta:
+* fixed overflow bug in command line parsing
+* fixed GRCh38.p2 bug in human genome downloads
+* fixed installation exit code bug
+
v0.10.5-beta:
* fix bug in GRCh38 download to handle multi-fasta files
* add --header-line and --intermediate-ranks options to kraken-mpa-report
diff --git a/docs/MANUAL.html b/docs/MANUAL.html
index 6829d86..4ac90d2 100644
--- a/docs/MANUAL.html
+++ b/docs/MANUAL.html
@@ -4,7 +4,7 @@
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="generator" content="pandoc" />
- <title>Kraken Manual - </title>
+ <title>Kraken Manual – </title>
<style type="text/css">code{white-space: pre;}</style>
<link rel="stylesheet" href="kraken.css" type="text/css" />
<link href='http://fonts.googleapis.com/css?family=Ubuntu:400,700,400italic' rel='stylesheet' type='text/css'>
@@ -13,7 +13,7 @@
<div class="pretoc">
<p class="title">Kraken taxonomic sequence classification system</p>
- <p class="version">Version 0.10.5-beta</p>
+ <p class="version">Version 1.0</p>
<p>Operating Manual</p>
</div>
@@ -36,23 +36,23 @@
<li><a href="#upgrading-databases-to-v0.10">Upgrading Databases to v0.10+</a></li>
</ul>
</div>
-<h1 id="introduction"><a href="#introduction">Introduction</a></h1>
-<p><a href="http://ccb.jhu.edu/software/kraken/">Kraken</a> is a taxonomic sequence classifier that assigns taxonomic labels to short DNA reads. It does this by examining the <span class="math"><em>k</em></span>-mers within a read and querying a database with those <span class="math"><em>k</em></span>-mers. This database contains a mapping of every <span class="math"><em>k</em></span>-mer in <a href="http://ccb.jhu.edu/software/kraken/">Kraken</a>'s genomic library to the lowest common a [...]
+<h1 id="introduction">Introduction</h1>
+<p><a href="http://ccb.jhu.edu/software/kraken/">Kraken</a> is a taxonomic sequence classifier that assigns taxonomic labels to short DNA reads. It does this by examining the <span class="math inline"><em>k</em></span>-mers within a read and querying a database with those <span class="math inline"><em>k</em></span>-mers. This database contains a mapping of every <span class="math inline"><em>k</em></span>-mer in <a href="http://ccb.jhu.edu/software/kraken/">Kraken</a>'s genomic library t [...]
<p>The latest released version of Kraken will be available at the <a href="http://ccb.jhu.edu/software/kraken/">Kraken website</a>, and the latest updates to the Kraken source code are available at the <a href="https://github.com/DerrickWood/kraken">Kraken GitHub repository</a>.</p>
<p>If you use <a href="http://ccb.jhu.edu/software/kraken/">Kraken</a> in your research, please cite the <a href="http://genomebiology.com/2014/15/3/R46">Kraken paper</a>. Thank you!</p>
-<h1 id="system-requirements"><a href="#system-requirements">System Requirements</a></h1>
+<h1 id="system-requirements">System Requirements</h1>
<p>Note: Users concerned about the disk or memory requirements should read the paragraph about MiniKraken, below.</p>
<ul>
-<li><p><strong>Disk space</strong>: Construction of Kraken's standard database will require at least 160 GB of disk space. Customized databases may require more or less space. Disk space used is linearly proportional to the number of distinct <span class="math"><em>k</em></span>-mers; as of Feb. 2015, Kraken's default database contains just under 6 billion (6e9) distinct <span class="math"><em>k</em></span>-mers.</p>
+<li><p><strong>Disk space</strong>: Construction of Kraken's standard database will require at least 500 GB of disk space. Customized databases may require more or less space. Disk space used is linearly proportional to the number of distinct <span class="math inline"><em>k</em></span>-mers; as of Oct. 2017, Kraken's default database contains just over 14.4 billion (1.44e10) distinct <span class="math inline"><em>k</em></span>-mers.</p>
<p>In addition, the disk used to store the database should be locally-attached storage. Storing the database on a network filesystem (NFS) partition can cause Kraken's operation to be very slow, or to be stopped completely. As NFS accesses are much slower than local disk accesses, both preloading and database building will be slowed by use of NFS.</p></li>
-<li><p><strong>Memory</strong>: To run efficiently, Kraken requires enough free memory to hold the database in RAM. While this can be accomplished using a ramdisk, Kraken supplies a utility for loading the database into RAM via the OS cache. The default database size is 75 GB (as of Feb. 2015), and so you will need at least that much RAM if you want to build or run with the default database.</p></li>
+<li><p><strong>Memory</strong>: To run efficiently, Kraken requires enough free memory to hold the database in RAM. While this can be accomplished using a ramdisk, Kraken supplies a utility for loading the database into RAM via the OS cache. The default database size is 170 GB (as of Oct. 2017), and so you will need at least that much RAM if you want to build or run with the default database.</p></li>
<li><p><strong>Dependencies</strong>: Kraken currently makes extensive use of Linux utilities such as sed, find, and wget. Many scripts are written using the Bash shell, and the main scripts are written using Perl. Core programs needed to build the database and run the classifier are written in C++, and need to be compiled using g++. Multithreading is handled using OpenMP. Downloads of NCBI data are performed by wget and in some cases, by rsync. Most Linux systems that have any sort of d [...]
-<p>Finally, if you want to build your own database, you will need to install the <a href="http://www.cbcb.umd.edu/software/jellyfish/">Jellyfish</a> <span class="math"><em>k</em></span>-mer counter. Note that Kraken only supports use of Jellyfish version 1. Jellyfish version 2 is not yet compatible with Kraken.</p></li>
+<p>Finally, if you want to build your own database, you will need to install the <a href="http://www.cbcb.umd.edu/software/jellyfish/">Jellyfish</a> <span class="math inline"><em>k</em></span>-mer counter. Note that Kraken only supports use of Jellyfish version 1. Jellyfish version 2 is not yet compatible with Kraken.</p></li>
<li><p><strong>Network connectivity</strong>: Kraken's standard database build and download commands expect unfettered FTP and rsync access to the NCBI FTP server. If you're working behind a proxy, you may need to set certain environment variables (such as <code>ftp_proxy</code> or <code>RSYNC_PROXY</code>) in order to get these commands to work properly.</p></li>
<li><p><strong>MiniKraken</strong>: To allow users with low-memory computing environments to use Kraken, we supply a reduced standard database that can be downloaded from the Kraken web site. When Kraken is run with a reduced database, we call it MiniKraken.</p>
<p>The database we make available is only 4 GB in size, and should run well on computers with as little as 8 GB of RAM. Disk space required for this database is also only 4 GB.</p></li>
</ul>
-<h1 id="installation"><a href="#installation">Installation</a></h1>
+<h1 id="installation">Installation</h1>
<p>To begin using Kraken, you will first need to install it, and then either download or create a database.</p>
<p>Kraken consists of two main scripts ("<code>kraken</code>" and "<code>kraken-build</code>"), along with several programs and smaller scripts. As part of the installation process, all scripts and programs are installed in the same directory. After installation, you can move the main scripts elsewhere, but moving the other scripts and programs requires editing the scripts and changing the "<code>$KRAKEN_DIR</code>" variables.</p>
<p>Once a directory is selected, you need to run the following command in the directory where you extracted the Kraken source:</p>
@@ -63,31 +63,32 @@
<pre><code>cp $KRAKEN_DIR/bin/kraken $HOME/bin
cp $KRAKEN_DIR/bin/kraken-build $HOME/bin</code></pre>
<p>After installation, you're ready to either create or download a database.</p>
-<h1 id="kraken-databases"><a href="#kraken-databases">Kraken Databases</a></h1>
+<h1 id="kraken-databases">Kraken Databases</h1>
<p>A Kraken database is a directory containing at least 4 files:</p>
<ul>
-<li><code>database.kdb</code>: Contains the <span class="math"><em>k</em></span>-mer to taxon mappings</li>
+<li><code>database.kdb</code>: Contains the <span class="math inline"><em>k</em></span>-mer to taxon mappings</li>
<li><code>database.idx</code>: Contains minimizer offset locations in database.kdb</li>
<li><code>taxonomy/nodes.dmp</code>: Taxonomy tree structure + ranks</li>
<li><code>taxonomy/names.dmp</code>: Taxonomy names</li>
</ul>
<p>Other files may be present as part of the database build process.</p>
<p>In interacting with Kraken, you should not have to directly reference any of these files, but rather simply provide the name of the directory in which they are stored. Kraken allows both the use of a standard database as well as custom databases; these are described in the sections <a href="#standard-kraken-database">Standard Kraken Database</a> and <a href="#custom-databases">Custom Databases</a> below, respectively.</p>
-<h1 id="standard-kraken-database"><a href="#standard-kraken-database">Standard Kraken Database</a></h1>
+<h1 id="standard-kraken-database">Standard Kraken Database</h1>
<p>To create the standard Kraken database, you can use the following command:</p>
<pre><code>kraken-build --standard --db $DBNAME</code></pre>
-<p>(Replace "<code>$DBNAME</code>" above with your preferred database name/location. Please note that the database will use approximately 160 GB of disk space during creation.)</p>
+<p>(Replace "<code>$DBNAME</code>" above with your preferred database name/location. Please note that the database will use approximately 500 GB of disk space during creation.)</p>
<p>This will download NCBI taxonomic information, as well as the complete genomes in RefSeq for the bacterial, archaeal, and viral domains. After downloading all this data, the build process begins; this is the most time-consuming step. If you have multiple processing cores, you can run this process with multiple threads, e.g.:</p>
-<pre><code>kraken-build --standard --threads 16 --db $DBNAME</code></pre>
-<p>Using 16 threads on a computer with 122 GB of RAM, the build process took approximately an hour and a half (steps with an asterisk have some multi-threading enabled) in February 2015:</p>
-<pre><code> 7m48s *Step 1 (create set)
- n/a Step 2 (reduce database, optional and skipped)
-53m16s *Step 3 (sort set)
- 1m04s Step 4 (GI number to sequence ID map)
- 0m27s Step 5 (Sequence ID to taxon map)
-29m20s *Step 6 (set LCA values)
-------
-91m55s Total build time</code></pre>
+<pre><code>kraken-build --standard --threads 24 --db $DBNAME</code></pre>
+<p>Using 24 threads on a computer (an AWS r4.8xlarge instance) with 244 GB of RAM, the build process took approximately 5 hours (steps with an asterisk have some multi-threading enabled) in October 2017:</p>
+<pre><code> 24m50s *Step 1 (create set)
+ n/a Step 2 (reduce database, optional and skipped)
+154m53s *Step 3 (sort set)
+ n/a Step 4 (GI number to sequence ID map - now obsolete)
+ <1s Step 5 (Sequence ID to taxon map)
+127m28s *Step 6 (set LCA values)
+-------
+5h7m11s Total build time</code></pre>
+<p>This process used the automatically estimated jellyfish hash size of 20170976000.</p>
<p>Note that if any step (including the initial downloads) fails, the build process will abort. However, <code>kraken-build</code> will produce checkpoints throughout the installation process, and will restart the build at the last incomplete step if you attempt to run the same command again on a partially-built database.</p>
<p>To create a custom database, or to use a database from another source, see <a href="#custom-databases">Custom Databases</a>.</p>
<p>Notes for users with lower amounts of RAM:</p>
@@ -95,7 +96,7 @@ cp $KRAKEN_DIR/bin/kraken-build $HOME/bin</code></pre>
<li><p>If you encounter problems with Jellyfish not being able to allocate enough memory on your system to run the build process, you can supply a smaller hash size to Jellyfish using <code>kraken-build</code>'s <code>--jellyfish-hash-size</code> switch. Each space in the hash table uses approximately 6.9 bytes, so using "<code>--jellyfish-hash-size 6400M</code>" will use a hash table size of 6.4 billion spaces and require 44.3 GB of RAM.</p></li>
<li><p>Kraken's build process will normally attempt to minimize disk writing by allocating large blocks of RAM and operating within them until data needs to be written to disk. However, this extra RAM usage may exceed your capacity. In such cases, you may want to use <code>kraken-build</code>'s <code>--work-on-disk</code> switch. This will minimize the amount of RAM usage and cause Kraken's build programs to perform most operations off of disk files. This switch can also be useful for pe [...]
</ol>
-<h1 id="classification"><a href="#classification">Classification</a></h1>
+<h1 id="classification">Classification</h1>
<p>To classify a set of sequences (reads), use the <code>kraken</code> command:</p>
<pre><code>kraken --db $DBNAME seqs.fa</code></pre>
<p>Output will be sent to standard output by default. The files containing the sequences to be classified should be specified on the command line. Sequences can also be provided through standard input using the special filename <code>/dev/fd/0</code>.</p>
@@ -105,29 +106,29 @@ cp $KRAKEN_DIR/bin/kraken-build $HOME/bin</code></pre>
<p>The <code>kraken</code> program allows several different options:</p>
<ul>
<li><p><strong>Multithreading</strong>: Use the <code>--threads NUM</code> switch to use multiple threads.</p></li>
-<li><p><strong>Quick operation</strong>: Rather than searching all <span class="math"><em>k</em></span>-mers in a sequence, stop classification after the first database hit; use <code>--quick</code> to enable this mode. Note that <code>--min-hits</code> will allow you to require multiple hits before declaring a sequence classified, which can be especially useful with custom databases when testing to see if sequences either do or do not belong to a particular genome.</p></li>
+<li><p><strong>Quick operation</strong>: Rather than searching all <span class="math inline"><em>k</em></span>-mers in a sequence, stop classification after the first database hit; use <code>--quick</code> to enable this mode. Note that <code>--min-hits</code> will allow you to require multiple hits before declaring a sequence classified, which can be especially useful with custom databases when testing to see if sequences either do or do not belong to a particular genome.</p></li>
<li><p><strong>Sequence filtering</strong>: Classified or unclassified sequences can be sent to a file for later processing, using the <code>--classified-out</code> and <code>--unclassified-out</code> switches, respectively.</p></li>
<li><p><strong>Output redirection</strong>: Output can be directed using standard shell redirection (<code>|</code> or <code>></code>), or using the <code>--output</code> switch.</p></li>
<li><p><strong>FASTQ input</strong>: Input is normally expected to be in FASTA format, but you can classify FASTQ data using the <code>--fastq-input</code> switch.</p></li>
<li><p><strong>Compressed input</strong>: Kraken can handle gzip and bzip2 compressed files as input by specifying the proper switch of <code>--gzip-compressed</code> or <code>--bzip2-compressed</code>.</p></li>
<li><p><strong>Input format auto-detection</strong>: If regular files are specified on the command line as input, Kraken will attempt to determine the format of your input prior to classification. You can disable this by explicitly specifying <code>--fasta-input</code>, <code>--fastq-input</code>, <code>--gzip-compressed</code>, and/or <code>--bzip2-compressed</code> as appropriate. Note that use of the character device file <code>/dev/fd/0</code> to read from standard input (aka <code>s [...]
-<li><p><strong>Paired reads</strong>: Kraken does not query <span class="math"><em>k</em></span>-mers containing ambiguous nucleotides (non-ACGT). If you have paired reads, you can use this fact to your advantage and increase Kraken's accuracy by concatenating the pairs together with a single <code>N</code> between the sequences. Using the <code>--paired</code> option when running <code>kraken</code> will automatically do this for you; simply specify the two mate pair files on the comman [...]
+<li><p><strong>Paired reads</strong>: Kraken does not query <span class="math inline"><em>k</em></span>-mers containing ambiguous nucleotides (non-ACGT). If you have paired reads, you can use this fact to your advantage and increase Kraken's accuracy by concatenating the pairs together with a single <code>N</code> between the sequences. Using the <code>--paired</code> option when running <code>kraken</code> will automatically do this for you; simply specify the two mate pair files on the [...]
</ul>
<p>To get a full list of options, use <code>kraken --help</code>.</p>
-<h1 id="output-format"><a href="#output-format">Output Format</a></h1>
+<h1 id="output-format">Output Format</h1>
<p>Each sequence classified by Kraken results in a single line of output. Output lines contain five tab-delimited fields; from left to right, they are:</p>
<ol style="list-style-type: decimal">
<li>"C"/"U": one letter code indicating that the sequence was either classified or unclassified.</li>
<li>The sequence ID, obtained from the FASTA/FASTQ header.</li>
<li>The taxonomy ID Kraken used to label the sequence; this is 0 if the sequence is unclassified.</li>
<li>The length of the sequence in bp.</li>
-<li>A space-delimited list indicating the LCA mapping of each <span class="math"><em>k</em></span>-mer in the sequence. For example, "562:13 561:4 A:31 0:1 562:3" would indicate that:
+<li>A space-delimited list indicating the LCA mapping of each <span class="math inline"><em>k</em></span>-mer in the sequence. For example, "562:13 561:4 A:31 0:1 562:3" would indicate that:
<ul>
-<li>the first 13 <span class="math"><em>k</em></span>-mers mapped to taxonomy ID #562</li>
-<li>the next 4 <span class="math"><em>k</em></span>-mers mapped to taxonomy ID #561</li>
-<li>the next 31 <span class="math"><em>k</em></span>-mers contained an ambiguous nucleotide</li>
-<li>the next <span class="math"><em>k</em></span>-mer was not in the database</li>
-<li>the last 3 <span class="math"><em>k</em></span>-mers mapped to taxonomy ID #562</li>
+<li>the first 13 <span class="math inline"><em>k</em></span>-mers mapped to taxonomy ID #562</li>
+<li>the next 4 <span class="math inline"><em>k</em></span>-mers mapped to taxonomy ID #561</li>
+<li>the next 31 <span class="math inline"><em>k</em></span>-mers contained an ambiguous nucleotide</li>
+<li>the next <span class="math inline"><em>k</em></span>-mer was not in the database</li>
+<li>the last 3 <span class="math inline"><em>k</em></span>-mers mapped to taxonomy ID #562</li>
</ul></li>
</ol>
<p>For users who want the full taxonomic name associated with each input sequence, we provide a script named <code>kraken-translate</code> that produces two different output formats for classified sequences. The script operates on the output of <code>kraken</code>, like so:</p>
@@ -141,18 +142,19 @@ kraken-translate --db $DBNAME sequences.kraken > sequences.labels</code></pre
<p>Alternatively, <code>kraken-translate</code> accepts the option <code>--mpa-format</code> which will report only levels of the taxonomy with standard rank assignments (superkingdom, kingdom, phylum, class, order, family, genus, species), and uses pipes to delimit the various levels of the taxonomy. For example, <code>kraken-translate --mpa-format --db $DBNAME</code> with the above example output from <code>kraken</code> would result in the following line of output:</p>
<pre><code>SEQ1 d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli</code></pre>
<p>Taxonomy assignments above the superkingdom (<code>d__</code>) rank are represented as just "root" when using the <code>--mpa-report</code> option with <code>kraken-translate</code>.</p>
-<h1 id="custom-databases"><a href="#custom-databases">Custom Databases</a></h1>
+<h1 id="custom-databases">Custom Databases</h1>
<p>We realize the standard database may not suit everyone's needs. Kraken also allows creation of customized databases.</p>
<p>To build a custom database:</p>
<ol style="list-style-type: decimal">
<li><p>Install a taxonomy. Usually, you will just use the NCBI taxonomy, which you can easily download using:</p>
<pre><code>kraken-build --download-taxonomy --db $DBNAME</code></pre>
-<p>This will download the GI number to taxon map, as well as the taxonomic name and tree information from NCBI. These files can be found in <code>$DBNAME/taxonomy/</code> . If you need to modify the taxonomy, edits can be made to the <code>names.dmp</code> and <code>nodes.dmp</code> files in this directory; the <code>gi_taxid_nucl.dmp</code> file will also need to be updated appropriately.</p></li>
+<p>This will download the accession number to taxon map, as well as the taxonomic name and tree information from NCBI. These files can be found in <code>$DBNAME/taxonomy/</code> . If you need to modify the taxonomy, edits can be made to the <code>names.dmp</code> and <code>nodes.dmp</code> files in this directory; the <code>gi_taxid_nucl.dmp</code> file will also need to be updated appropriately.</p></li>
<li><p>Install a genomic library. Four sets of standard genomes are made easily available through <code>kraken-build</code>:</p>
<ul>
-<li>bacteria: RefSeq complete bacterial/archaeal genomes</li>
-<li>plasmids: RefSeq plasmid sequences</li>
-<li>viruses: RefSeq complete viral genomes</li>
+<li>archaea: RefSeq complete archaeal genomes</li>
+<li>bacteria: RefSeq complete bacterial genomes</li>
+<li>plasmid: RefSeq plasmid sequences</li>
+<li>viral: RefSeq complete viral genomes</li>
<li>human: GRCh38 human genome</li>
</ul>
<p>To download and install any one of these, use the <code>--download-library</code> switch, e.g.:</p>
@@ -160,12 +162,12 @@ kraken-translate --db $DBNAME sequences.kraken > sequences.labels</code></pre
Other genomes can also be added, but such genomes must meet certain requirements:
<ul>
<li>Sequences must be in a FASTA file (multi-FASTA is allowed)</li>
-<li>Each sequence's ID (the string between the <code>></code> and the first whitespace character on the header line) must contain either a GI number to allow Kraken to lookup the correct taxa, or an explicit assignment of the taxonomy ID using <code>kraken:taxid</code> (see below).</li>
+<li>Each sequence's ID (the string between the <code>></code> and the first whitespace character on the header line) must contain either an NCBI accession number to allow Kraken to lookup the correct taxa, or an explicit assignment of the taxonomy ID using <code>kraken:taxid</code> (see below).</li>
</ul>
<p>Replicons not downloaded from NCBI may need their taxonomy information assigned explicitly. This can be done using the string <code>kraken:taxid|XXX</code> in the sequence ID, with <code>XXX</code> replaced by the desired taxon ID. For example, to put a known adapter sequence in taxon 32630 ("synthetic construct"), you could use the following:</p>
<pre><code>>sequence16|kraken:taxid|32630 Adapter sequence
CAAGCAGAAGACGGCATACGAGATCTTCGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA</code></pre>
-<p>The <code>kraken:taxid</code> string must begin the sequence ID or be immediately preceded by a pipe character (<code>|</code>). Explicit assignment of taxonomy IDs in this manner will override the GI number mapping provided by NCBI.</p>
+<p>The <code>kraken:taxid</code> string must begin the sequence ID or be immediately preceded by a pipe character (<code>|</code>). Explicit assignment of taxonomy IDs in this manner will override the accession number mapping provided by NCBI.</p>
<p>If your genomes meet the requirements above, then you can add each replicon to your database's genomic library using the <code>--add-to-library</code> switch, e.g.:</p>
<pre><code>kraken-build --add-to-library chr1.fa --db $DBNAME
kraken-build --add-to-library chr2.fa --db $DBNAME</code></pre>
@@ -178,19 +180,19 @@ done</code></pre>
<pre><code>find genomes/ -name '*.fa' -print0 | \
xargs -0 -I{} -n1 kraken-build --add-to-library {} --db $DBNAME</code></pre>
<p>(You may also find the <code>-P</code> option to <code>xargs</code> useful to add many files in parallel if you have multiple processors.)</p></li>
-<li><p>Once your library is finalized, you need to build the database. Depending on your size requirements, you may want to adjust the <span class="math"><em>k</em></span>-mer and/or minimizer lengths from the defaults. Except for some small bookkeeping fields, a Kraken database will use <span class="math"><em>s</em><em>D</em></span> + <span class="math">8(4<sup><em>M</em></sup>)</span> bytes, where <span class="math"><em>s</em></span> is the number of bytes used to store the <span class [...]
-<p>The minimizers serve to keep <span class="math"><em>k</em></span>-mers that are adjacent in query sequences close to each other in the database, which allows Kraken to exploit the CPU cache. Changing the value of <span class="math"><em>M</em></span> can significantly affect the speed of Kraken, and neither increasing or decreasing <span class="math"><em>M</em></span> will guarantee faster or slower speed.</p>
+<li><p>Once your library is finalized, you need to build the database. Depending on your size requirements, you may want to adjust the <span class="math inline"><em>k</em></span>-mer and/or minimizer lengths from the defaults. Except for some small bookkeeping fields, a Kraken database will use <span class="math inline"><em>s</em><em>D</em></span> + <span class="math inline">8(4<sup><em>M</em></sup>)</span> bytes, where <span class="math inline"><em>s</em></span> is the number of bytes u [...]
+<p>The minimizers serve to keep <span class="math inline"><em>k</em></span>-mers that are adjacent in query sequences close to each other in the database, which allows Kraken to exploit the CPU cache. Changing the value of <span class="math inline"><em>M</em></span> can significantly affect the speed of Kraken, and neither increasing or decreasing <span class="math inline"><em>M</em></span> will guarantee faster or slower speed.</p>
<p>To build the database, you'll use the <code>--build</code> switch:</p>
<pre><code>kraken-build --build --db $DBNAME</code></pre>
<p>As noted above, you may want to also use any of <code>--threads</code>, <code>--kmer-len</code>, or <code>--minimizer-len</code> to adjust the database build time and/or final size.</p></li>
-<li><p>Shrinking the database: The "--shrink" task allows you to take an existing Kraken database and create a smaller MiniKraken database from it. The use of this option removes all but a specified number of <span class="math"><em>k</em></span>-mer/taxon pairs to create a new, smaller database. For example:</p>
+<li><p>Shrinking the database: The "--shrink" task allows you to take an existing Kraken database and create a smaller MiniKraken database from it. The use of this option removes all but a specified number of <span class="math inline"><em>k</em></span>-mer/taxon pairs to create a new, smaller database. For example:</p>
<pre><code>kraken-build --shrink 10000 --db $DBNAME --new-db minikraken</code></pre>
-<p>This will create a new database named <code>minikraken</code> that contains 10000 <span class="math"><em>k</em></span>-mers selected from across the original database (<code>$DBNAME</code>).</p>
+<p>This will create a new database named <code>minikraken</code> that contains 10000 <span class="math inline"><em>k</em></span>-mers selected from across the original database (<code>$DBNAME</code>).</p>
<p>The <code>--shrink</code> task is only meant to be run on a completed database. However, if you know before you create a database that you will only be able to use a certain amount of memory, you can use the <code>--max-db-size</code> switch for the <code>--build</code> task to provide a maximum size (in GB) for the database. This allows you to create a MiniKraken database without having to create a full Kraken database first.</p></li>
</ol>
<p>A full list of options for <code>kraken-build</code> can be obtained using <code>kraken-build --help</code>.</p>
<p>After building a database, if you want to reduce the disk usage of the database you can use <code>kraken-build</code>'s <code>--clean</code> switch to remove all intermediate files from the database directory.</p>
-<h1 id="memory-usage-and-efficiency"><a href="#memory-usage-and-efficiency">Memory Usage and Efficiency</a></h1>
+<h1 id="memory-usage-and-efficiency">Memory Usage and Efficiency</h1>
<p>Kraken's execution requires many random accesses to a very large file. To obtain maximal speed, these accesses need to be made as quickly as possible. This means that the database must be in physical memory during execution. Although we provide the <code>--preload</code> option to Kraken for users who cannot use a ramdisk, the ramdisk is likely the simplest option, and is well-suited for installations on computers where Kraken is to be run a majority of the time. In addition, using a [...]
<p>We also note that in some cases, <code>--preload</code> may not be needed (or even advisable). If you know that your database is already in memory (for example, if it has been recently read or unzipped, then it should be in your operating system cache, which resides in physical memory), then there is no need to perform this step. We have noticed that in low-memory (~8 GB) situations, preloading a MiniKraken DB is actually much slower than simply using <code>cat minikraken/database.* & [...]
<p>To create a ramdisk, you will need to have superuser (root) permission. As root, you can use the following commands to create a ramdisk:</p>
@@ -203,7 +205,7 @@ mount -t ramfs none /ramdisk</code></pre>
<pre><code>kraken --db /ramdisk/$DBNAME seqs.fa</code></pre>
<p>Note that anything copied into a ramdisk will be deleted if the ramdisk is unmounted or the computer is restarted, so make sure that you have a copy of the database on a hard disk (or other non-volatile storage).</p>
<p>Note that when using the <code>--paired</code> option, Kraken will not (by default) make any attempt to ensure that the two files you specify are indeed matching sets of paired-end reads. To verify that the names of each read do indeed match, you can use the <code>--check-names</code> option in combination with the <code>--paired</code> option.</p>
-<h1 id="sample-reports"><a href="#sample-reports">Sample Reports</a></h1>
+<h1 id="sample-reports">Sample Reports</h1>
<p>To get an idea as to Kraken's results across an entire sample, we provide the <code>kraken-report</code> script. It is used like this:</p>
<pre><code>kraken-report --db $DBNAME kraken.output</code></pre>
<p>Note that the database used must be the same as the one used to generate the output file, or the report script may encounter problems. Output is sent to standard output.</p>
@@ -219,18 +221,18 @@ mount -t ramfs none /ramdisk</code></pre>
<p>The scientific names are indented using spaces, according to the tree structure specified by the taxonomy.</p>
<p>By default, taxa with no reads assigned to (or under) them will not have any output produced. However, if you wish to have all taxa displayed, you can use the <code>--show-zeros</code> switch to do so. This can be useful if you are looking to do further downstream analysis of the reports, and want to compare samples. Sorting by the taxonomy ID (using <code>sort -nf5</code>) can provide a consistent line ordering between reports.</p>
<p>In addition, we also provide the program <code>kraken-mpa-report</code>; this program provides output in a format similar to MetaPhlAn's tab-delimited output. For <code>kraken-mpa-report</code>, multiple Kraken output files can be specified on the command line and each will be treated as a separate sample. For each taxon at the standard ranks (from domain to species), the count of reads in each sample assigned to any node in the clade rooted at that taxon is displayed. <code>kraken-mp [...]
-<h1 id="confidence-scoring"><a href="#confidence-scoring">Confidence Scoring</a></h1>
+<h1 id="confidence-scoring">Confidence Scoring</h1>
<p>At present, we have not yet developed a confidence score with a solid probabilistic interpretation for Kraken. However, we have developed a simple scoring scheme that has yielded good results for us, and we've made that available in the <code>kraken-filter</code> script. The approach we use allows a user to specify a threshold score in the [0,1] interval; the <code>kraken-filter</code> script then will adjust labels up the tree until the label's score (described below) meets or exceed [...]
-<p>A sequence label's score is a fraction <span class="math"><em>C</em></span>/<span class="math"><em>Q</em></span>, where <span class="math"><em>C</em></span> is the number of <span class="math"><em>k</em></span>-mers mapped to LCA values in the clade rooted at the label, and <span class="math"><em>Q</em></span> is the number of <span class="math"><em>k</em></span>-mers in the sequence that lack an ambiguous nucleotide (i.e., they were queried against the database). Consider the example [...]
+<p>A sequence label's score is a fraction <span class="math inline"><em>C</em></span>/<span class="math inline"><em>Q</em></span>, where <span class="math inline"><em>C</em></span> is the number of <span class="math inline"><em>k</em></span>-mers mapped to LCA values in the clade rooted at the label, and <span class="math inline"><em>Q</em></span> is the number of <span class="math inline"><em>k</em></span>-mers in the sequence that lack an ambiguous nucleotide (i.e., they were queried a [...]
<p>"562:13 561:4 A:31 0:1 562:3" would indicate that:</p>
<ul>
-<li>the first 13 <span class="math"><em>k</em></span>-mers mapped to taxonomy ID #562</li>
-<li>the next 4 <span class="math"><em>k</em></span>-mers mapped to taxonomy ID #561</li>
-<li>the next 31 <span class="math"><em>k</em></span>-mers contained an ambiguous nucleotide</li>
-<li>the next <span class="math"><em>k</em></span>-mer was not in the database</li>
-<li>the last 3 <span class="math"><em>k</em></span>-mers mapped to taxonomy ID #562</li>
+<li>the first 13 <span class="math inline"><em>k</em></span>-mers mapped to taxonomy ID #562</li>
+<li>the next 4 <span class="math inline"><em>k</em></span>-mers mapped to taxonomy ID #561</li>
+<li>the next 31 <span class="math inline"><em>k</em></span>-mers contained an ambiguous nucleotide</li>
+<li>the next <span class="math inline"><em>k</em></span>-mer was not in the database</li>
+<li>the last 3 <span class="math inline"><em>k</em></span>-mers mapped to taxonomy ID #562</li>
</ul>
-<p>In this case, ID #561 is the parent node of #562. Here, a label of #562 for this sequence would have a score of <span class="math"><em>C</em></span>/<span class="math"><em>Q</em></span> = (13+3)/(13+4+1+3) = 16/21. A label of #561 would have a score of <span class="math"><em>C</em></span>/<span class="math"><em>Q</em></span> = (13+4+3)/(13+4+1+3) = 20/21. If a user specified a threshold over 16/21, kraken-filter would adjust the original label from #562 to #561; if the threshold was g [...]
+<p>In this case, ID #561 is the parent node of #562. Here, a label of #562 for this sequence would have a score of <span class="math inline"><em>C</em></span>/<span class="math inline"><em>Q</em></span> = (13+3)/(13+4+1+3) = 16/21. A label of #561 would have a score of <span class="math inline"><em>C</em></span>/<span class="math inline"><em>Q</em></span> = (13+4+3)/(13+4+1+3) = 20/21. If a user specified a threshold over 16/21, kraken-filter would adjust the original label from #562 to [...]
<p><code>kraken-filter</code> is used like this:</p>
<pre><code>kraken-filter --db $DBNAME [--threshold NUM] kraken.output</code></pre>
<p>If not specified, the threshold will be 0. <code>kraken-filter</code>'s output is similar to <code>kraken</code>'s, but a new field between the length and LCA mapping list is present, indicating the new label's score (or the root label's score if the sequence has become unclassified).</p>
@@ -316,7 +318,7 @@ mount -t ramfs none /ramdisk</code></pre>
</table>
</div>
<p>As can be seen, with no threshold (i.e., Kraken's original labels), Kraken's precision is fairly high, but it does increase with the threshold. Diminishing returns apply, however, and there is a loss in sensitivity that must be taken into account when deciding on the threshold to use for your own project.</p>
-<h1 id="kraken-environment-variables"><a href="#kraken-environment-variables">Kraken Environment Variables</a></h1>
+<h1 id="kraken-environment-variables">Kraken Environment Variables</h1>
<p>The Kraken programs (with the exception of <code>kraken-build</code>) support the use of some environment variables to help in reducing command line lengths:</p>
<ul>
<li><p><strong><code>KRAKEN_NUM_THREADS</code></strong>: this variable is only used by <code>kraken</code>; if the <code>--threads</code> option is not supplied to <code>kraken</code>, then the value of this variable (if it is set) will be used as the number of threads to run <code>kraken</code>.</p></li>
@@ -343,8 +345,8 @@ kraken sequences.fa | kraken-report > sequences.kreport</code></pre>
kraken sequences.fa</code></pre>
<p>will use <code>/data/kraken_dbs/mainDB</code> to classify <code>sequences.fa</code>.</p></li>
</ul>
-<h1 id="upgrading-databases-to-v0.10"><a href="#upgrading-databases-to-v0.10">Upgrading Databases to v0.10+</a></h1>
-<p>The minimizer ordering in Kraken versions prior to v0.10.0-beta was a simple lexicographical ordering that provided a suboptimal distribution of k-mers within the bins. Ideally, the bin sizes would be uniform, but simple lexicographical ordering creates a bias toward low-complexity minimizers. To resolve this, the ordering is now "scrambled" by XORing all minimizers with a predefined constant to toggle half of each minimizer's bits before sorting. The more evenly distributed [...]
+<h1 id="upgrading-databases-to-v0.10">Upgrading Databases to v0.10+</h1>
+<p>The minimizer ordering in Kraken versions prior to v0.10.0-beta was a simple lexicographical ordering that provided a suboptimal distribution of k-mers within the bins. Ideally, the bin sizes would be uniform, but simple lexicographical ordering creates a bias toward low-complexity minimizers. To resolve this, the ordering is now "scrambled" by XORing all minimizers with a predefined constant to toggle half of each minimizer's bits before sorting. The more evenly distributed [...]
<ol style="list-style-type: decimal">
<li><p>Build a new database. This is the preferred option, as a newly-created database will have the latest genomes and NCBI taxonomy information.</p></li>
<li><p>Re-sort an existing database. If you have a custom database, you may want to simply reformat the database to provide you with Kraken's increased speed. To do so, you'll need to do the following:</p>
diff --git a/docs/MANUAL.markdown b/docs/MANUAL.markdown
index 903a1ea..1980a3b 100644
--- a/docs/MANUAL.markdown
+++ b/docs/MANUAL.markdown
@@ -36,10 +36,11 @@ Note: Users concerned about the disk or memory requirements should
read the paragraph about MiniKraken, below.
* **Disk space**: Construction of Kraken's standard database will require at
- least 160 GB of disk space. Customized databases may require
+ least 500 GB of disk space. Customized databases may require
more or less space. Disk space used is linearly proportional
- to the number of distinct $k$-mers; as of Feb. 2015, Kraken's
- default database contains just under 6 billion (6e9) distinct $k$-mers.
+ to the number of distinct $k$-mers; as of Oct. 2017, Kraken's
+ default database contains just over 14.4 billion (1.44e10)
+ distinct $k$-mers.
In addition, the disk used to store the database should be
locally-attached storage. Storing the database on a network
@@ -51,8 +52,8 @@ read the paragraph about MiniKraken, below.
* **Memory**: To run efficiently, Kraken requires enough free memory to
hold the database in RAM. While this can be accomplished using a
ramdisk, Kraken supplies a utility for loading the database into
- RAM via the OS cache. The default database size is 75 GB (as of
- Feb. 2015), and so you will need at least that much RAM if you want
+ RAM via the OS cache. The default database size is 170 GB (as of
+ Oct. 2017), and so you will need at least that much RAM if you want
to build or run with the default database.
* **Dependencies**: Kraken currently makes extensive use of Linux utilities
@@ -152,7 +153,7 @@ To create the standard Kraken database, you can use the following command:
kraken-build --standard --db $DBNAME
(Replace "`$DBNAME`" above with your preferred database name/location.
-Please note that the database will use approximately 160 GB of
+Please note that the database will use approximately 500 GB of
disk space during creation.)
This will download NCBI taxonomic information, as well as the
@@ -162,20 +163,24 @@ process begins; this is the most time-consuming step. If you
have multiple processing cores, you can run this process with
multiple threads, e.g.:
- kraken-build --standard --threads 16 --db $DBNAME
-
-Using 16 threads on a computer with 122 GB of RAM, the build
-process took approximately an hour and a half (steps with an asterisk
-have some multi-threading enabled) in February 2015:
-
- 7m48s *Step 1 (create set)
- n/a Step 2 (reduce database, optional and skipped)
- 53m16s *Step 3 (sort set)
- 1m04s Step 4 (GI number to sequence ID map)
- 0m27s Step 5 (Sequence ID to taxon map)
- 29m20s *Step 6 (set LCA values)
- ------
- 91m55s Total build time
+ kraken-build --standard --threads 24 --db $DBNAME
+
+Using 24 threads on a computer (an AWS r4.8xlarge instance)
+with 244 GB of RAM, the build process took approximately 5 hours
+(steps with an asterisk have some multi-threading enabled) in
+October 2017:
+
+ 24m50s *Step 1 (create set)
+ n/a Step 2 (reduce database, optional and skipped)
+ 154m53s *Step 3 (sort set)
+ n/a Step 4 (GI number to sequence ID map - now obsolete)
+ <1s Step 5 (Sequence ID to taxon map)
+ 127m28s *Step 6 (set LCA values)
+ -------
+ 5h7m11s Total build time
+
+This process used the automatically estimated jellyfish hash size
+of 20170976000.
Note that if any step (including the initial downloads) fails,
the build process will abort. However, `kraken-build` will
@@ -352,7 +357,7 @@ To build a custom database:
kraken-build --download-taxonomy --db $DBNAME
- This will download the GI number to taxon map, as well as the
+ This will download the accession number to taxon map, as well as the
taxonomic name and tree information from NCBI. These files can
be found in `$DBNAME/taxonomy/` . If you need to modify the taxonomy,
edits can be made to the `names.dmp` and `nodes.dmp` files in this directory;
@@ -361,9 +366,10 @@ To build a custom database:
2) Install a genomic library. Four sets of standard genomes are
made easily available through `kraken-build`:
- - bacteria: RefSeq complete bacterial/archaeal genomes
- - plasmids: RefSeq plasmid sequences
- - viruses: RefSeq complete viral genomes
+ - archaea: RefSeq complete archaeal genomes
+ - bacteria: RefSeq complete bacterial genomes
+ - plasmid: RefSeq plasmid sequences
+ - viral: RefSeq complete viral genomes
- human: GRCh38 human genome
To download and install any one of these, use the `--download-library`
@@ -376,7 +382,7 @@ To build a custom database:
- Sequences must be in a FASTA file (multi-FASTA is allowed)
- Each sequence's ID (the string between the `>` and the first
whitespace character on the header line) must contain either
- a GI number to allow Kraken to lookup the correct taxa, or an
+ an NCBI accession number to allow Kraken to lookup the correct taxa, or an
explicit assignment of the taxonomy ID using `kraken:taxid` (see below).
Replicons not downloaded from NCBI may need their taxonomy information
@@ -390,7 +396,7 @@ To build a custom database:
The `kraken:taxid` string must begin the sequence ID or be immediately
preceded by a pipe character (`|`). Explicit assignment of taxonomy IDs
- in this manner will override the GI number mapping provided by NCBI.
+ in this manner will override the accession number mapping provided by NCBI.
If your genomes meet the requirements above, then you can add each
replicon to your database's genomic library using the `--add-to-library`
@@ -724,7 +730,7 @@ minimizers with a predefined constant to toggle half of each minimizer's
bits before sorting. The more evenly distributed bins provide better
caching performance, but databases created in this way are not compatible
with earlier versions of Kraken. Kraken versions from v0.10.0-beta up to
-(but not including) v1.0 will support the use of the older databases, but
+(and including) v1.0 will support the use of the older databases, but
we nonetheless recommend one of the two following options:
1) Build a new database. This is the preferred option, as a newly-created
diff --git a/docs/top.html b/docs/top.html
index a189458..1cb42aa 100644
--- a/docs/top.html
+++ b/docs/top.html
@@ -1,7 +1,7 @@
<div class="pretoc">
<p class="title">Kraken taxonomic sequence classification system</p>
- <p class="version">Version 0.10.5-beta</p>
+ <p class="version">Version 1.0</p>
<p>Operating Manual</p>
</div>
diff --git a/install_kraken.sh b/install_kraken.sh
index fff418d..d13f6ed 100755
--- a/install_kraken.sh
+++ b/install_kraken.sh
@@ -19,7 +19,7 @@
set -e
-VERSION="0.10.5-beta"
+VERSION="1.0"
if [ -z "$1" ] || [ -n "$2" ]
then
@@ -59,5 +59,8 @@ echo "To make things easier for you, you may want to copy/symlink the following"
echo "files into a directory in your PATH:"
for file in $KRAKEN_DIR/kraken*
do
- [ -x "$file" ] && echo " $file"
+ if [ -x "$file" ]
+ then
+ echo " $file"
+ fi
done
diff --git a/scripts/add_to_library.sh b/scripts/add_to_library.sh
index ee50323..db180a5 100755
--- a/scripts/add_to_library.sh
+++ b/scripts/add_to_library.sh
@@ -1,6 +1,6 @@
#!/bin/bash
-# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
+# Copyright 2013-2017, Derrick Wood <dwood at cs.jhu.edu>
#
# This file is part of the Kraken taxonomic sequence classification system.
#
@@ -24,26 +24,38 @@ set -e # Stop on error
LIBRARY_DIR="$KRAKEN_DB_NAME/library"
-if [ ! -e "$1" ]
+input_file=$1
+
+if [ ! -e "$input_file" ]
then
- echo "Can't add \"$1\": file does not exist"
+ echo "Can't add \"$input_file\": file does not exist"
exit 1
fi
-if [ ! -f "$1" ]
+if [ ! -f "$input_file" ]
then
- echo "Can't add \"$1\": not a regular file"
+ echo "Can't add \"$input_file\": not a regular file"
exit 1
fi
-if ! verify_gi_numbers.pl "$1"
+add_dir="$LIBRARY_DIR/added"
+mkdir -p "$add_dir"
+
+if [[ $input_file == *.gbff || $input_file == *.gbff.gz || $input_file == *.gbk || $input_file == *.gbk.gz ]]
then
- echo "Can't add \"$1\": sequence is missing GI number"
- exit 1
+ convert_gb_to_fa.pl $input_file > "$add_dir/temp.fna"
+ input_file="$add_dir/temp.fna"
fi
+
+scan_fasta_file.pl "$input_file" > "$add_dir/temp_map.txt"
-add_dir="$LIBRARY_DIR/added"
-mkdir -p "$add_dir"
+filename=$(cp_into_tempfile.pl -t "XXXXXXXXXX" -d "$add_dir" -s fna "$input_file")
+
+cat "$add_dir/temp_map.txt" >> "$add_dir/prelim_map.txt"
+rm "$add_dir/temp_map.txt"
-filename=$(cp_into_tempfile.pl -t "XXXXXXXXXX" -d "$add_dir" -s fna "$1")
+if [ -e "$add_dir/temp.fna" ]
+then
+ rm "$add_dir/temp.fna"
+fi
echo "Added \"$1\" to library ($KRAKEN_DB_NAME)"
diff --git a/scripts/build_kraken_db.sh b/scripts/build_kraken_db.sh
index 7df4d0b..453678c 100755
--- a/scripts/build_kraken_db.sh
+++ b/scripts/build_kraken_db.sh
@@ -1,6 +1,6 @@
#!/bin/bash
-# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
+# Copyright 2013-2017, Derrick Wood <dwood at cs.jhu.edu>
#
# This file is part of the Kraken taxonomic sequence classification system.
#
@@ -69,14 +69,14 @@ else
start_time1=$(date "+%s.%N")
check_for_jellyfish.sh
- # Estimate hash size as 1.15 * chars in library FASTA files
+ # Estimate hash size as 1.25 * estimated k-mer count
if [ -z "$KRAKEN_HASH_SIZE" ]
then
- KRAKEN_HASH_SIZE=$(find library/ '(' -name '*.fna' -o -name '*.fa' -o -name '*.ffn' ')' -printf '%s\n' | perl -nle '$sum += $_; END {print int(1.15 * $sum)}')
+ KRAKEN_HASH_SIZE=$(find library/ -name '*.fna' -print0 | xargs -0 cat | kmer_estimator -m 1.25 -t $KRAKEN_THREAD_CT -k $KRAKEN_KMER_LEN)
echo "Hash size not specified, using '$KRAKEN_HASH_SIZE'"
fi
- find library/ '(' -name '*.fna' -o -name '*.fa' -o -name '*.ffn' ')' -print0 | \
+ find library/ -name '*.fna' -print0 | \
xargs -0 cat | \
jellyfish count -m $KRAKEN_KMER_LEN -s $KRAKEN_HASH_SIZE -C -t $KRAKEN_THREAD_CT \
-o database /dev/fd/0
@@ -154,29 +154,36 @@ else
echo "K-mer set sorted. [$(report_time_elapsed $start_time1)]"
fi
-if [ -e "gi2seqid.map" ]
-then
- echo "Skipping step 4, GI number to seqID map already complete."
-else
- echo "Creating GI number to seqID map (step 4 of 6)..."
- start_time1=$(date "+%s.%N")
- find library/ '(' -name '*.fna' -o -name '*.fa' -o -name '*.ffn' ')' -print0 | \
- xargs -0 cat | report_gi_numbers.pl > gi2seqid.map.tmp
- mv gi2seqid.map.tmp gi2seqid.map
-
- echo "GI number to seqID map created. [$(report_time_elapsed $start_time1)]"
-fi
+echo "Skipping step 4, GI number to seqID map now obsolete."
-if [ -e "seqid2taxid.map" ]
+seqid2taxid_map_file="seqid2taxid.map"
+if [ -e "$seqid2taxid_map_file" ]
then
echo "Skipping step 5, seqID to taxID map already complete."
else
echo "Creating seqID to taxID map (step 5 of 6)..."
start_time1=$(date "+%s.%N")
- make_seqid_to_taxid_map taxonomy/gi_taxid_nucl.dmp gi2seqid.map \
- > seqid2taxid.map.tmp
- mv seqid2taxid.map.tmp seqid2taxid.map
- line_ct=$(wc -l seqid2taxid.map | awk '{print $1}')
+
+ find library/ -maxdepth 2 -name prelim_map.txt | xargs cat > taxonomy/prelim_map.txt
+ if [ ! -s "taxonomy/prelim_map.txt" ]; then
+ echo "No preliminary seqid/taxid mapping files found, aborting."
+ exit 1
+ fi
+
+ grep "^TAXID" taxonomy/prelim_map.txt | cut -f 2- > $seqid2taxid_map_file.tmp || true
+ if grep "^ACCNUM" taxonomy/prelim_map.txt | cut -f 2- > accmap_file.tmp; then
+ if compgen -G "taxonomy/*.accession2taxid" > /dev/null; then
+ lookup_accession_numbers.pl accmap_file.tmp taxonomy/*.accession2taxid > seqid2taxid_acc.tmp
+ cat seqid2taxid_acc.tmp >> $seqid2taxid_map_file.tmp
+ rm seqid2taxid_acc.tmp
+ else
+ echo "Accession to taxid map files are required to build this DB."
+ echo "Run 'kraken-build --db $KRAKEN_DB_NAME --download-taxonomy' again?"
+ exit 1
+ fi
+ fi
+ mv $seqid2taxid_map_file.tmp $seqid2taxid_map_file
+ line_ct=$(wc -l $seqid2taxid_map_file | awk '{print $1}')
echo "$line_ct sequences mapped to taxa. [$(report_time_elapsed $start_time1)]"
fi
diff --git a/scripts/convert_gb_to_fa.pl b/scripts/convert_gb_to_fa.pl
new file mode 100755
index 0000000..762c2e7
--- /dev/null
+++ b/scripts/convert_gb_to_fa.pl
@@ -0,0 +1,58 @@
+#!/usr/bin/env perl
+
+# Copyright 2013-2017, Derrick Wood <dwood at cs.jhu.edu>
+#
+# This file is part of the Kraken taxonomic sequence classification system.
+#
+# Kraken is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# Kraken is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with Kraken. If not, see <http://www.gnu.org/licenses/>.
+
+# Pull sequence data with accession and taxid from Genbank file and output in fasta format
+# Adapted from @tseemann https://github.com/MDU-PHL/mdu-tools/blob/master/bin/genbank-to-kraken_fasta.pl
+
+use strict;
+use warnings;
+
+ at ARGV or die "Usage: $0 <file.gbk[.gz]> ...";
+
+my $wrote=0;
+my($seqid, $in_seq, $taxid);
+
+my $input_file = $ARGV[0];
+
+open(IN, "gunzip -c -f \Q$input_file\E |") or die "can’t open pipe to $input_file";
+
+while (<IN>) {
+ if (m/^VERSION\s+(\S+)/) {
+ $seqid = $1;
+ }
+ elsif (m/taxon:(\d+)/) {
+ $taxid = $1;
+ }
+ elsif (m/^ORIGIN/) {
+ $in_seq = 1;
+ print ">$seqid|kraken:taxid|$taxid\n";
+ }
+ elsif (m{^//}) {
+ $in_seq = $taxid = $seqid = undef;
+ $wrote++;
+ }
+ elsif ($in_seq) {
+ substr $_, 0, 10, '';
+ s/\s//g;
+ print uc($_), "\n";
+ }
+}
+
+close IN;
+
diff --git a/scripts/cp_into_tempfile.pl b/scripts/cp_into_tempfile.pl
index 4e24ff2..c502d2e 100755
--- a/scripts/cp_into_tempfile.pl
+++ b/scripts/cp_into_tempfile.pl
@@ -1,4 +1,4 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
#
diff --git a/scripts/download_genomic_library.sh b/scripts/download_genomic_library.sh
index ddf2ef1..add3c18 100755
--- a/scripts/download_genomic_library.sh
+++ b/scripts/download_genomic_library.sh
@@ -1,6 +1,6 @@
#!/bin/bash
-# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
+# Copyright 2013-2017, Derrick Wood <dwood at cs.jhu.edu>
#
# This file is part of the Kraken taxonomic sequence classification system.
#
@@ -19,101 +19,65 @@
# Download specific genomic libraries for use with Kraken.
# Supported choices are:
-# bacteria - NCBI RefSeq complete bacterial/archaeal genomes
+# archaea - NCBI RefSeq complete archaeal genomes
+# bacteria - NCBI RefSeq complete bacterial genomes
# plasmids - NCBI RefSeq plasmid sequences
-# viruses - NCBI RefSeq complete viral DNA and RNA genomes
+# viral - NCBI RefSeq complete viral DNA and RNA genomes
# human - NCBI RefSeq GRCh38 human reference genome
set -u # Protect against uninitialized vars.
set -e # Stop on error
LIBRARY_DIR="$KRAKEN_DB_NAME/library"
-NCBI_SERVER="ftp.ncbi.nih.gov"
+NCBI_SERVER="ftp.ncbi.nlm.nih.gov"
FTP_SERVER="ftp://$NCBI_SERVER"
RSYNC_SERVER="rsync://$NCBI_SERVER"
THIS_DIR=$PWD
+library_name="$1"
+library_file="library.fna"
+if [ -e "$LIBRARY_DIR/$library_name/.completed" ]; then
+ echo "Skipping $library_name, already completed library download"
+ exit 0
+fi
case "$1" in
- "bacteria")
- mkdir -p $LIBRARY_DIR/Bacteria
- cd $LIBRARY_DIR/Bacteria
- if [ ! -e "lib.complete" ]
- then
- rm -f all.fna.tar.gz
- wget $FTP_SERVER/genomes/Bacteria/all.fna.tar.gz
- echo -n "Unpacking..."
- tar zxf all.fna.tar.gz
- rm all.fna.tar.gz
- echo " complete."
- touch "lib.complete"
- else
- echo "Skipping download of bacterial genomes, already downloaded here."
+ "archaea" | "bacteria" | "viral" | "human" )
+ mkdir -p $LIBRARY_DIR/$library_name
+ cd $LIBRARY_DIR/$library_name
+ rm -f assembly_summary.txt
+ remote_dir_name=$library_name
+ if [ "$library_name" = "human" ]; then
+ remote_dir_name="vertebrate_mammalian/Homo_sapiens"
fi
- ;;
- "plasmids")
- mkdir -p $LIBRARY_DIR/Plasmids
- cd $LIBRARY_DIR/Plasmids
- if [ ! -e "lib.complete" ]
- then
- rm -f plasmids.all.fna.tar.gz
- wget $FTP_SERVER/genomes/Plasmids/plasmids.all.fna.tar.gz
- echo -n "Unpacking..."
- tar zxf plasmids.all.fna.tar.gz
- rm plasmids.all.fna.tar.gz
- echo " complete."
- touch "lib.complete"
- else
- echo "Skipping download of plasmids, already downloaded here."
+ if ! wget -q $FTP_SERVER/genomes/refseq/$remote_dir_name/assembly_summary.txt; then
+ echo "Error downloading assembly summary file for $library_name, exiting." >/dev/fd/2
+ exit 1
fi
- ;;
- "viruses")
- mkdir -p $LIBRARY_DIR/Viruses
- cd $LIBRARY_DIR/Viruses
- if [ ! -e "lib.complete" ]
- then
- rm -f all.fna.tar.gz
- rm -f all.ffn.tar.gz
- wget $FTP_SERVER/genomes/Viruses/all.fna.tar.gz
- wget $FTP_SERVER/genomes/Viruses/all.ffn.tar.gz
- echo -n "Unpacking..."
- tar zxf all.fna.tar.gz
- tar zxf all.ffn.tar.gz
- rm all.fna.tar.gz
- rm all.ffn.tar.gz
- echo " complete."
- touch "lib.complete"
- else
- echo "Skipping download of viral genomes, already downloaded here."
+ if [ "$library_name" = "human" ]; then
+ grep "Genome Reference Consortium" assembly_summary.txt > x
+ mv x assembly_summary.txt
fi
+ rm -rf all/ library.f* manifest.txt rsync.err
+ rsync_from_ncbi.pl assembly_summary.txt
+ scan_fasta_file.pl $library_file > prelim_map.txt
+ touch .completed
;;
- "human")
- mkdir -p $LIBRARY_DIR/Human
- cd $LIBRARY_DIR/Human
- if [ ! -e "lib.complete" ]
- then
- # get list of CHR_* directories
- wget --spider --no-remove-listing $FTP_SERVER/genomes/H_sapiens/
- directories=$(perl -nle '/^d/ and /(CHR_\w+)\s*$/ and print $1' .listing)
- rm .listing
-
- # For each CHR_* directory, get GRCh* fasta gzip file name, d/l, unzip, and add
- for directory in $directories
- do
- wget --spider --no-remove-listing $FTP_SERVER/genomes/H_sapiens/$directory/
- file=$(perl -nle '/^-/ and /\b(hs_ref_GRCh\w+\.fa\.gz)\s*$/ and print $1' .listing)
- [ -z "$file" ] && exit 1
- rm .listing
- wget $FTP_SERVER/genomes/H_sapiens/$directory/$file
- gunzip "$file"
- done
-
- touch "lib.complete"
- else
- echo "Skipping download of human genome, already downloaded here."
- fi
+ "plasmid")
+ mkdir -p $LIBRARY_DIR/plasmid
+ cd $LIBRARY_DIR/plasmid
+ rm -f library.f* plasmid.*
+ echo -n "Downloading plasmid files from FTP..."
+ wget -q --no-remove-listing --spider $FTP_SERVER/genomes/refseq/plasmid/
+ awk '{ print $NF }' .listing | perl -ple 'tr/\r//d' | grep '\.fna\.gz' > manifest.txt
+ cat manifest.txt | xargs -n1 -I{} wget -q $FTP_SERVER/genomes/refseq/plasmid/{}
+ cat manifest.txt | xargs -n1 -I{} gunzip -c {} > $library_file
+ rm -f plasmid.* .listing
+ scan_fasta_file.pl $library_file > prelim_map.txt
+ touch .completed
+ echo " done."
;;
*)
echo "Unsupported library. Valid options are: "
- echo " bacteria plasmids virus human"
+ echo " archaea bacteria plasmid viral human"
;;
esac
diff --git a/scripts/download_taxonomy.sh b/scripts/download_taxonomy.sh
index fa73616..8d538fa 100755
--- a/scripts/download_taxonomy.sh
+++ b/scripts/download_taxonomy.sh
@@ -1,41 +1,30 @@
#!/bin/bash
-# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
+# Copyright 2013-2017, Derrick Wood <dwood at cs.jhu.edu>
#
# This file is part of the Kraken taxonomic sequence classification system.
-#
-# Kraken is free software: you can redistribute it and/or modify
-# it under the terms of the GNU General Public License as published by
-# the Free Software Foundation, either version 3 of the License, or
-# (at your option) any later version.
-#
-# Kraken is distributed in the hope that it will be useful,
-# but WITHOUT ANY WARRANTY; without even the implied warranty of
-# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
-# GNU General Public License for more details.
-#
-# You should have received a copy of the GNU General Public License
-# along with Kraken. If not, see <http://www.gnu.org/licenses/>.
# Download NCBI taxonomy information for Kraken.
-# Designed to be called by kraken_build
+# Designed to be called by kraken-build
set -u # Protect against uninitialized vars.
set -e # Stop on error
TAXONOMY_DIR="$KRAKEN_DB_NAME/taxonomy"
-NCBI_SERVER="ftp.ncbi.nih.gov"
+NCBI_SERVER="ftp.ncbi.nlm.nih.gov"
FTP_SERVER="ftp://$NCBI_SERVER"
-THIS_DIR=$PWD
mkdir -p "$TAXONOMY_DIR"
cd "$TAXONOMY_DIR"
-if [ ! -e "gimap.dlflag" ]
+if [ ! -e "accmap.dlflag" ]
then
- wget $FTP_SERVER/pub/taxonomy/gi_taxid_nucl.dmp.gz
- touch gimap.dlflag
- echo "Downloaded GI to taxon map"
+ wget $FTP_SERVER/pub/taxonomy/accession2taxid/nucl_est.accession2taxid.gz
+ wget $FTP_SERVER/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
+ wget $FTP_SERVER/pub/taxonomy/accession2taxid/nucl_gss.accession2taxid.gz
+ wget $FTP_SERVER/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz
+ touch accmap.dlflag
+ echo "Downloaded accession to taxon map(s)"
fi
if [ ! -e "taxdump.dlflag" ]
@@ -45,16 +34,17 @@ then
echo "Downloaded taxonomy tree data"
fi
-if [ ! -e "gimap.flag" ]
+if ls | grep -q 'accession2taxid\.gz$'
then
- gunzip gi_taxid_nucl.dmp.gz
- touch gimap.flag
- echo "Uncompressed GI to taxon map"
+ echo -n "Uncompressing taxonomy data... "
+ gunzip *accession2taxid.gz
+ echo "done."
fi
-if [ ! -e "taxdump.flag" ]
+if [ ! -e "taxdump.untarflag" ]
then
+ echo -n "Untarring taxonomy tree data... "
tar zxf taxdump.tar.gz
- touch taxdump.flag
- echo "Uncompressed taxonomy tree data"
+ touch taxdump.untarflag
+ echo "done."
fi
diff --git a/scripts/kraken b/scripts/kraken
index 57cc717..af5f67b 100755
--- a/scripts/kraken
+++ b/scripts/kraken
@@ -1,4 +1,4 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
#
diff --git a/scripts/kraken-build b/scripts/kraken-build
index bf36ae6..31a8d30 100755
--- a/scripts/kraken-build
+++ b/scripts/kraken-build
@@ -1,6 +1,6 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
-# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
+# Copyright 2013-2017, Derrick Wood <dwood at cs.jhu.edu>
#
# This file is part of the Kraken taxonomic sequence classification system.
#
@@ -40,7 +40,7 @@ my $DEF_MINIMIZER_LEN = 15;
my $DEF_KMER_LEN = 31;
my $DEF_THREAD_CT = 1;
-my @VALID_LIBRARY_TYPES = qw/bacteria plasmids viruses human/;
+my @VALID_LIBRARY_TYPES = qw/archaea bacteria plasmid viral human/;
# Option/task option variables
my (
@@ -199,8 +199,8 @@ Usage: $PROG [task option] [options]
Task options (exactly one must be selected):
--download-taxonomy Download NCBI taxonomic information
--download-library TYPE Download partial library
- (TYPE = one of "bacteria", "plasmids",
- "viruses", "human")
+ (TYPE = one of "archaea", "bacteria", "plasmid",
+ "viral", "human")
--add-to-library FILE Add FILE to library
--build Create DB from library
(requires taxonomy d/l'ed and at least one file
diff --git a/scripts/kraken-filter b/scripts/kraken-filter
index 04dcb7c..5ab01df 100755
--- a/scripts/kraken-filter
+++ b/scripts/kraken-filter
@@ -1,4 +1,4 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
# Copyright 2013, Derrick Wood <dwood at cs.jhu.edu>
#
diff --git a/scripts/kraken-mpa-report b/scripts/kraken-mpa-report
index 7813569..526a167 100755
--- a/scripts/kraken-mpa-report
+++ b/scripts/kraken-mpa-report
@@ -1,4 +1,4 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
#
diff --git a/scripts/kraken-report b/scripts/kraken-report
index 8351593..582a819 100755
--- a/scripts/kraken-report
+++ b/scripts/kraken-report
@@ -1,4 +1,4 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
#
@@ -91,8 +91,13 @@ for (keys %name_map) {
$clade_counts{$_} ||= 0;
}
+my $unclassified_percent = 100;
+if ($seq_count) {
+ $unclassified_percent = $clade_counts{0} * 100 / $seq_count;
+}
+
printf "%6.2f\t%d\t%d\t%s\t%d\t%s%s\n",
- $clade_counts{0} * 100 / $seq_count,
+ $unclassified_percent,
$clade_counts{0}, $taxo_counts{0}, "U",
0, "", "unclassified";
dfs_report(1, 0);
diff --git a/scripts/kraken-translate b/scripts/kraken-translate
index 89a067a..46c9102 100755
--- a/scripts/kraken-translate
+++ b/scripts/kraken-translate
@@ -1,4 +1,4 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
#
diff --git a/scripts/krakenlib.pm b/scripts/krakenlib.pm
index 943d6c6..f7e01b2 100644
--- a/scripts/krakenlib.pm
+++ b/scripts/krakenlib.pm
@@ -1,6 +1,6 @@
package krakenlib;
-# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
+# Copyright 2013-2017, Derrick Wood <dwood at cs.jhu.edu>
#
# This file is part of the Kraken taxonomic sequence classification system.
#
@@ -73,4 +73,31 @@ sub find_db {
return $db_prefix;
}
+# Input: a FASTA sequence ID
+# Output: either (a) a taxonomy ID number found in the sequence ID,
+# (b) an NCBI accession number found in the sequence ID, or undef
+sub check_seqid {
+ my $seqid = shift;
+ my $taxid = undef;
+ # Note all regexes here use ?: to avoid capturing the ^ or | character
+ if ($seqid =~ /(?:^|\|)kraken:taxid\|(\d+)/) {
+ $taxid = $1; # OK, has explicit taxid
+ }
+ elsif ($seqid =~ /^(\d+)$/) {
+ $taxid = $1; # OK, has explicit taxid (w/o token)
+ }
+ # Accession number check
+ elsif ($seqid =~ /(?:^|\|) # Begins seqid or immediately follows pipe
+ ([A-Z]+ # Starts with one or more UC alphas
+ _? # Might have an underscore next
+ [A-Z0-9]+) # Ends with UC alphas or digits
+ (?:\||\b|\.)/x # Followed by pipe, word boundary, or period
+ )
+ {
+ $taxid = $1; # A bit misleading - failure to pass /^\d+$/ means this is
+ # OK, but requires accession -> taxid mapping
+ }
+ return $taxid;
+}
+
1;
diff --git a/scripts/lookup_accession_numbers.pl b/scripts/lookup_accession_numbers.pl
new file mode 100755
index 0000000..376ebf2
--- /dev/null
+++ b/scripts/lookup_accession_numbers.pl
@@ -0,0 +1,65 @@
+#!/usr/bin/env perl
+
+# Copyright 2013-2017, Derrick Wood <dwood at cs.jhu.edu>
+#
+# This file is part of the Kraken taxonomic sequence classification system.
+
+# Looks up accession numbers and reports associated taxonomy IDs
+#
+# Input is (a) 1 2-column TSV file w/ sequence IDs and accession numbers,
+# and (b) a list of accession2taxid files from NCBI.
+# Output is tab-delimited lines, with sequence IDs in first
+# column and taxonomy IDs in second.
+
+use strict;
+use warnings;
+use File::Basename;
+use Getopt::Std;
+
+my $PROG = basename $0;
+
+my $lookup_list_file = shift @ARGV;
+
+my %target_lists;
+open FILE, "<", $lookup_list_file
+ or die "$PROG: can't open $lookup_list_file: $!\n";
+while (<FILE>) {
+ chomp;
+ my ($seqid, $accnum) = split /\t/;
+ $target_lists{$accnum} ||= [];
+ push @{ $target_lists{$accnum} }, $seqid;
+}
+close FILE;
+
+my $initial_target_count = scalar keys %target_lists;
+
+my @accession_map_files = @ARGV;
+for my $file (@accession_map_files) {
+ open FILE, "<", $file
+ or die "$PROG: can't open $file: $!\n";
+ scalar(<FILE>); # discard header line
+ while (<FILE>) {
+ chomp;
+ my ($accession, $with_version, $taxid, $gi) = split /\t/;
+ if ($target_lists{$accession}) {
+ my @list = @{ $target_lists{$accession} };
+ delete $target_lists{$accession};
+ for my $seqid (@list) {
+ print "$seqid\t$taxid\n";
+ }
+ last if ! %target_lists;
+ }
+ }
+ close FILE;
+ last if ! %target_lists;
+}
+
+if (%target_lists) {
+ warn "$PROG: @{[scalar keys %target_lists]}/$initial_target_count accession numbers remain unmapped, see unmapped.txt in DB directory\n";
+ open LIST, ">", "unmapped.txt"
+ or die "$PROG: can't write unmapped.txt: $!\n";
+ for my $target (keys %target_lists) {
+ print LIST "$target\n";
+ }
+ close LIST;
+}
diff --git a/scripts/read_merger.pl b/scripts/read_merger.pl
index 2d32477..166f21f 100755
--- a/scripts/read_merger.pl
+++ b/scripts/read_merger.pl
@@ -1,4 +1,4 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
#
diff --git a/scripts/report_gi_numbers.pl b/scripts/report_gi_numbers.pl
index ce6a0bc..102a5d9 100755
--- a/scripts/report_gi_numbers.pl
+++ b/scripts/report_gi_numbers.pl
@@ -1,4 +1,4 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
#
diff --git a/scripts/rsync_from_ncbi.pl b/scripts/rsync_from_ncbi.pl
new file mode 100755
index 0000000..7392991
--- /dev/null
+++ b/scripts/rsync_from_ncbi.pl
@@ -0,0 +1,134 @@
+#!/usr/bin/env perl
+
+# Copyright 2013-2017, Derrick Wood <dwood at cs.jhu.edu>
+#
+# This file is part of the Kraken taxonomic sequence classification system.
+
+# Reads an assembly_summary.txt file, which indicates taxids and FTP paths for
+# genome/protein data. Performs the download of the complete genomes from
+# that file, decompresses, and explicitly assigns taxonomy as needed.
+
+use strict;
+use warnings;
+use File::Basename;
+use Getopt::Std;
+use List::Util qw/max/;
+
+my $PROG = basename $0;
+
+my $suffix = "_genomic.fna.gz";
+
+# Manifest hash maps filenames (keys) to taxids (values)
+my %manifest;
+while (<>) {
+ next if /^#/;
+ chomp;
+ my @fields = split /\t/;
+ my ($taxid, $asm_level, $ftp_path) = @fields[5, 11, 19];
+ # Possible TODO - make the list here configurable by user-supplied flags
+ next unless grep {$asm_level eq $_} ("Complete Genome", "Chromosome");
+
+ my $full_path = $ftp_path . "/" . basename($ftp_path) . $suffix;
+ # strip off server/leading dir name to allow --files-from= to work w/ rsync
+ # also allows filenames to just start with "all/", which is nice
+ if (! ($full_path =~ s#^ftp://ftp\.ncbi\.nlm\.nih\.gov/genomes/##)) {
+ die "$PROG: unexpected FTP path (new server?) for $ftp_path\n";
+ }
+ $manifest{$full_path} = $taxid;
+}
+
+open MANIFEST, ">", "manifest.txt"
+ or die "$PROG: can't write manifest: $!\n";
+print MANIFEST "$_\n" for keys %manifest;
+close MANIFEST;
+
+print STDERR "Step 1/3: performing rsync dry run...\n";
+# Sometimes some files aren't always present, so we have to do this two-rsync run hack
+# First, do a dry run to find non-existent files, then delete them from the
+# manifest; after this, execution can proceed as usual.
+system("rsync --dry-run --no-motd --files-from=manifest.txt rsync://ftp.ncbi.nlm.nih.gov/genomes/ . 2> rsync.err");
+open ERR_FILE, "<", "rsync.err"
+ or die "$PROG: can't read rsync.err file: $!\n";
+while (<ERR_FILE>) {
+ chomp;
+ # I really doubt this will work across every version of rsync. :(
+ if (/failed: No such file or directory/ && /^rsync: link_stat "\/([^"]+)"/) {
+ warn "$PROG: \"$1\" (taxid: $manifest{$1}) was not found on rsync server, will remove from manifest\n";
+ delete $manifest{$1};
+ }
+}
+close ERR_FILE;
+print STDERR "Rsync dry run complete, removing any non-existent files from manifest.\n";
+
+# Rewrite manifest
+open MANIFEST, ">", "manifest.txt"
+ or die "$PROG: can't re-write manifest: $!\n";
+print MANIFEST "$_\n" for keys %manifest;
+close MANIFEST;
+
+print STDERR "Step 2/3: Performing rsync file transfer of requested files\n";
+if (system("rsync --no-motd --files-from=manifest.txt rsync://ftp.ncbi.nlm.nih.gov/genomes/ .") != 0) {
+ die "$PROG: rsync error, exited with code @{[$? >> 8]}\n";
+}
+print STDERR "Rsync file transfer complete.\n";
+print STDERR "Step 3/3: Assigning taxonomic IDs to sequences\n";
+my $output_file = "library.fna";
+open OUT, ">", $output_file
+ or die "$PROG: can't write $output_file: $!\n";
+my $projects_added = 0;
+my $sequences_added = 0;
+my $ch_added = 0;
+my $ch = "bp";
+my $max_out_chars = 0;
+for my $in_filename (keys %manifest) {
+ my $taxid = $manifest{$in_filename};
+ open IN, "zcat $in_filename |" or die "$PROG: can't read $in_filename: $!\n";
+ while (<IN>) {
+ if (/^>/) {
+ s/^>/>kraken:taxid|$taxid|/;
+ $sequences_added++;
+ }
+ else {
+ $ch_added += length($_) - 1;
+ }
+ print OUT;
+ }
+ close IN;
+ unlink $in_filename;
+ $projects_added++;
+ my $out_line = progress_line($projects_added, scalar keys %manifest, $sequences_added, $ch_added) . "...";
+ $max_out_chars = max(length($out_line), $max_out_chars);
+ my $space_line = " " x $max_out_chars;
+ print STDERR "\r$space_line\r$out_line" if -t STDERR;
+ if (! -t STDERR && $projects_added == keys %manifest) {
+ print STDERR "$out_line";
+ }
+}
+close OUT;
+print STDERR " done.\n";
+
+print STDERR "All files processed, cleaning up extra sequence files...";
+system("rm -rf all/") == 0
+ or die "$PROG: can't clean up all/ directory: $?\n";
+print STDERR " done, library complete.\n";
+
+sub progress_line {
+ my ($projs, $total_projs, $seqs, $chs) = @_;
+ my $line = "Processed ";
+ $line .= ($projs == $total_projs) ? "$projs" : "$projs/$total_projs";
+ $line .= " project" . ($total_projs > 1 ? 's' : '') . " ";
+ $line .= "($seqs sequence" . ($seqs > 1 ? 's' : '') . ", ";
+ my $prefix;
+ my @prefixes = qw/k M G T P E/;
+ while (@prefixes && $chs >= 1000) {
+ $prefix = shift @prefixes;
+ $chs /= 1000;
+ }
+ if (defined $prefix) {
+ $line .= sprintf '%.2f %s%s)', $chs, $prefix, $ch;
+ }
+ else {
+ $line .= "$chs $ch)";
+ }
+ return $line;
+}
diff --git a/scripts/scan_fasta_file.pl b/scripts/scan_fasta_file.pl
new file mode 100755
index 0000000..4f69ee4
--- /dev/null
+++ b/scripts/scan_fasta_file.pl
@@ -0,0 +1,50 @@
+#!/usr/bin/env perl
+
+# Copyright 2013-2017, Derrick Wood <dwood at cs.jhu.edu>
+#
+# This file is part of the Kraken taxonomic sequence classification system.
+
+# Reads multi-FASTA input and examines each sequence header. Headers are
+# OK if a taxonomy ID is found (as either the entire sequence ID or as part
+# of a # "kraken:taxid" token), or if something looking like an accession
+# number is found. Not "OK" headers will are fatal errors unless "--lenient"
+# is used.
+#
+# Each sequence header results in a line with three tab-separated values;
+# the first indicating whether third column is the taxonomy ID ("TAXID") or
+# an accession number ("ACCNUM") for the sequence ID listed in the second
+# column.
+
+use strict;
+use warnings;
+use File::Basename;
+use Getopt::Long;
+require "$ENV{KRAKEN_DIR}/krakenlib.pm";
+
+my $PROG = basename $0;
+
+my $lenient = 0;
+GetOptions("lenient" => \$lenient)
+ or die "Usage: $PROG [--lenient] <fasta filename(s)>\n";
+
+while (<>) {
+ next unless /^>/;
+ # while (/.../g) needed because non-redundant DBs sometimes have multiple
+ # sequence IDs in the header; extra sequence IDs are prefixed by
+ # '\x01' characters (if downloaded in FASTA format from NCBI FTP directly).
+ while (/(?:^>|\x01)(\S+)/g) {
+ my $seqid = $1;
+ my $taxid = krakenlib::check_seqid($seqid);
+ if (! defined $taxid) {
+ next if $lenient;
+ die "$PROG: unable to determine taxonomy ID for sequence $seqid\n";
+ }
+ # "$taxid" may actually be an accession number
+ if ($taxid =~ /^\d+$/) {
+ print "TAXID\t$seqid\t$taxid\n";
+ }
+ else {
+ print "ACCNUM\t$seqid\t$taxid\n";
+ }
+ }
+}
diff --git a/scripts/standard_installation.sh b/scripts/standard_installation.sh
index b542a4f..731cd7f 100755
--- a/scripts/standard_installation.sh
+++ b/scripts/standard_installation.sh
@@ -1,6 +1,6 @@
#!/bin/bash
-# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
+# Copyright 2013-2017, Derrick Wood <dwood at cs.jhu.edu>
#
# This file is part of the Kraken taxonomic sequence classification system.
#
@@ -32,8 +32,9 @@ fi
check_for_jellyfish.sh
kraken-build --db $KRAKEN_DB_NAME --download-taxonomy
+kraken-build --db $KRAKEN_DB_NAME --download-library archaea
kraken-build --db $KRAKEN_DB_NAME --download-library bacteria
-kraken-build --db $KRAKEN_DB_NAME --download-library viruses
+kraken-build --db $KRAKEN_DB_NAME --download-library viral
kraken-build --db $KRAKEN_DB_NAME --build --threads $KRAKEN_THREAD_CT \
--jellyfish-hash-size "$KRAKEN_HASH_SIZE" \
--max-db-size "$KRAKEN_MAX_DB_SIZE" \
diff --git a/scripts/verify_gi_numbers.pl b/scripts/verify_gi_numbers.pl
index ec616f5..0bb5cdf 100755
--- a/scripts/verify_gi_numbers.pl
+++ b/scripts/verify_gi_numbers.pl
@@ -1,4 +1,4 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
# Copyright 2013-2015, Derrick Wood <dwood at cs.jhu.edu>
#
diff --git a/src/Makefile b/src/Makefile
index 2f927f6..c3e1ef5 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -1,6 +1,6 @@
CXX = g++
CXXFLAGS = -Wall -fopenmp -O3
-PROGS = db_sort set_lcas classify make_seqid_to_taxid_map db_shrink
+PROGS = db_sort set_lcas classify make_seqid_to_taxid_map db_shrink kmer_estimator
.PHONY: all install clean
@@ -18,6 +18,8 @@ db_sort: krakendb.o quickfile.o
set_lcas: krakendb.o quickfile.o krakenutil.o seqreader.o
+kmer_estimator: krakenutil.o seqreader.o
+
classify: krakendb.o quickfile.o krakenutil.o seqreader.o
make_seqid_to_taxid_map: quickfile.o
diff --git a/src/classify.cpp b/src/classify.cpp
index b3efb42..957c8be 100644
--- a/src/classify.cpp
+++ b/src/classify.cpp
@@ -133,7 +133,8 @@ void report_stats(struct timeval time1, struct timeval time2) {
seconds /= 1e6;
seconds += time2.tv_sec;
- cerr << "\r";
+ if (isatty(fileno(stderr)))
+ cerr << "\r";
fprintf(stderr,
"%llu sequences (%.2f Mbp) processed in %.3fs (%.1f Kseq/m, %.2f Mbp/m).\n",
(unsigned long long) total_sequences, total_bases / 1.0e6, seconds,
@@ -194,12 +195,19 @@ void process_file(char *filename) {
(*Unclassified_output) << unclassified_output_ss.str();
total_sequences += work_unit.size();
total_bases += total_nt;
- cerr << "\rProcessed " << total_sequences << " sequences (" << total_bases << " bp) ...";
+ if (isatty(fileno(stderr)))
+ cerr << "\rProcessed " << total_sequences << " sequences (" << total_bases << " bp) ...";
}
}
} // end parallel section
delete reader;
+ if (Print_kraken)
+ (*Kraken_output) << std::flush;
+ if (Print_classified)
+ (*Classified_output) << std::flush;
+ if (Print_unclassified)
+ (*Unclassified_output) << std::flush;
}
void classify_sequence(DNASequence &dna, ostringstream &koss,
@@ -342,7 +350,7 @@ set<uint32_t> get_ancestry(uint32_t taxon) {
void parse_command_line(int argc, char **argv) {
int opt;
- int sig;
+ long long sig;
if (argc > 1 && strcmp(argv[1], "-h") == 0)
usage(0);
@@ -355,10 +363,12 @@ void parse_command_line(int argc, char **argv) {
Index_filename = optarg;
break;
case 't' :
- sig = atoi(optarg);
+ sig = atoll(optarg);
if (sig <= 0)
errx(EX_USAGE, "can't use nonpositive thread count");
#ifdef _OPENMP
+ if (sig > omp_get_num_procs())
+ errx(EX_USAGE, "thread count exceeds number of processors");
Num_threads = sig;
omp_set_num_threads(Num_threads);
#endif
@@ -370,7 +380,7 @@ void parse_command_line(int argc, char **argv) {
Quick_mode = true;
break;
case 'm' :
- sig = atoi(optarg);
+ sig = atoll(optarg);
if (sig <= 0)
errx(EX_USAGE, "can't use nonpositive minimum hit count");
Minimum_hit_count = sig;
@@ -393,7 +403,7 @@ void parse_command_line(int argc, char **argv) {
Kraken_output_file = optarg;
break;
case 'u' :
- sig = atoi(optarg);
+ sig = atoll(optarg);
if (sig <= 0)
errx(EX_USAGE, "can't use nonpositive work unit size");
Work_unit_size = sig;
diff --git a/src/db_shrink.cpp b/src/db_shrink.cpp
index 79feeb8..164073f 100644
--- a/src/db_shrink.cpp
+++ b/src/db_shrink.cpp
@@ -121,16 +121,16 @@ int main(int argc, char **argv) {
void parse_command_line(int argc, char **argv) {
int opt;
- int sig;
+ long long sig;
if (argc > 1 && strcmp(argv[1], "-h") == 0)
usage(0);
while ((opt = getopt(argc, argv, "d:o:n:O:")) != -1) {
switch (opt) {
case 'n' :
- sig = atoi(optarg);
+ sig = atoll(optarg);
if (sig < 1)
- errx(EX_USAGE, "output count cannot be negative");
+ errx(EX_USAGE, "output count must be positive");
Output_count = sig;
break;
case 'd' :
@@ -140,7 +140,7 @@ void parse_command_line(int argc, char **argv) {
Output_DB_filename = optarg;
break;
case 'O' :
- sig = atoi(optarg);
+ sig = atoll(optarg);
if (sig < 1)
errx(EX_USAGE, "offset count cannot be negative");
Offset = sig;
diff --git a/src/db_sort.cpp b/src/db_sort.cpp
index f4b8aca..1bafef3 100644
--- a/src/db_sort.cpp
+++ b/src/db_sort.cpp
@@ -124,14 +124,14 @@ static int pair_cmp(const void *a, const void *b) {
void parse_command_line(int argc, char **argv) {
int opt;
- int sig;
+ long long sig;
if (argc > 1 && strcmp(argv[1], "-h") == 0)
usage(0);
while ((opt = getopt(argc, argv, "n:d:o:i:t:zM")) != -1) {
switch (opt) {
case 'n' :
- sig = atoi(optarg);
+ sig = atoll(optarg);
if (sig < 1 || sig > 31)
errx(EX_USAGE, "bin key length out of range");
Bin_key_nt = (uint8_t) sig;
@@ -149,10 +149,12 @@ void parse_command_line(int argc, char **argv) {
Operate_in_RAM = true;
break;
case 't' :
- sig = atoi(optarg);
+ sig = atoll(optarg);
if (sig <= 0)
errx(EX_USAGE, "can't use nonpositive thread count");
#ifdef _OPENMP
+ if (sig > omp_get_num_procs())
+ errx(EX_USAGE, "thread count exceeds number of processors");
Num_threads = sig;
omp_set_num_threads(Num_threads);
#endif
diff --git a/src/kmer_estimator.cpp b/src/kmer_estimator.cpp
new file mode 100644
index 0000000..9562893
--- /dev/null
+++ b/src/kmer_estimator.cpp
@@ -0,0 +1,141 @@
+/*
+ * Copyright 2013-2017, Derrick Wood <dwood at cs.jhu.edu>
+ *
+ * This file is part of the Kraken taxonomic sequence classification system.
+ *
+ * Kraken is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * Kraken is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with Kraken. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "kraken_headers.hpp"
+#include "krakenutil.hpp"
+#include "seqreader.hpp"
+
+#define SKIP_LEN 50000
+
+using namespace std;
+using namespace kraken;
+
+void parse_command_line(int argc, char **argv);
+void usage(int exit_code=EX_USAGE);
+uint64_t obtain_estimated_kmer_ct();
+
+int Num_threads = 1;
+int k = 0;
+double multiplier = 1.0;
+
+// MurmurHash3 finalizer
+uint64_t hash_code(uint64_t h) {
+ h ^= h >> 33;
+ h *= 0xff51afd7ed558ccd;
+ h ^= h >> 33;
+ h *= 0xc4ceb9fe1a85ec53;
+ h ^= h >> 33;
+ return h;
+}
+
+int main(int argc, char **argv) {
+ #ifdef _OPENMP
+ omp_set_num_threads(1);
+ #endif
+
+ parse_command_line(argc, argv);
+
+ KmerScanner::set_k(k);
+ uint64_t estimate = multiplier * obtain_estimated_kmer_ct();
+ cout << estimate << endl;
+
+ return 0;
+}
+
+uint64_t obtain_estimated_kmer_ct() {
+ FastaReader reader("/dev/fd/0");
+ DNASequence dna;
+ set<uint64_t> qualified_kmers;
+ uint64_t qualification_limit = 4;
+ uint64_t mod_value = 4096;
+
+ while (true) {
+ dna = reader.next_sequence();
+ if (! reader.is_valid())
+ break;
+ #pragma omp parallel for schedule(dynamic)
+ for (size_t i = 0; i < dna.seq.size(); i += SKIP_LEN) {
+ KmerScanner scanner(dna.seq, i, i + SKIP_LEN + k - 1);
+ uint64_t *kmer_ptr;
+
+ while ((kmer_ptr = scanner.next_kmer()) != NULL) {
+ if (scanner.ambig_kmer())
+ continue;
+ uint64_t hc = hash_code(*kmer_ptr);
+ if (hc % mod_value < qualification_limit) {
+ #pragma omp critical(set_access)
+ qualified_kmers.insert(*kmer_ptr);
+ }
+ }
+ }
+ }
+ double estimate = (qualified_kmers.size() + 2) * ((double) mod_value / qualification_limit);
+ return (uint64_t) estimate;
+}
+
+void parse_command_line(int argc, char **argv) {
+ int opt;
+ long long sig;
+
+ if (argc > 1 && strcmp(argv[1], "-h") == 0)
+ usage(0);
+ while ((opt = getopt(argc, argv, "t:k:m:")) != -1) {
+ switch (opt) {
+ case 't' :
+ sig = atoll(optarg);
+ if (sig <= 0)
+ errx(EX_USAGE, "can't use nonpositive thread count");
+ #ifdef _OPENMP
+ if (sig > omp_get_num_procs())
+ errx(EX_USAGE, "thread count exceeds number of processors");
+ Num_threads = sig;
+ omp_set_num_threads(Num_threads);
+ #endif
+ break;
+ case 'k' :
+ sig = atoll(optarg);
+ if (sig <= 0)
+ errx(EX_USAGE, "k can't be <= 0");
+ if (sig > 31)
+ errx(EX_USAGE, "k can't be > 31");
+ k = sig;
+ break;
+ case 'm' :
+ multiplier = atof(optarg);
+ break;
+ default:
+ usage();
+ break;
+ }
+ }
+
+ if (k == 0)
+ usage(EX_USAGE);
+}
+
+void usage(int exit_code) {
+ cerr << "Usage: estimator [options]" << endl
+ << endl
+ << "Options: (*mandatory)" << endl
+ << "* -k # Length of k-mers" << endl
+ << " -t # Number of threads" << endl
+ << " -m FLOAT Multiplier" << endl
+ << " -h Print this message" << endl;
+ exit(exit_code);
+}
diff --git a/src/set_lcas.cpp b/src/set_lcas.cpp
index 24bf0f3..027a386 100644
--- a/src/set_lcas.cpp
+++ b/src/set_lcas.cpp
@@ -120,10 +120,15 @@ void process_single_file() {
for (size_t i = 0; i < dna.seq.size(); i += SKIP_LEN)
set_lcas(taxid, dna.seq, i, i + SKIP_LEN + Database.get_k() - 1);
}
- cerr << "\rProcessed " << ++seqs_processed << " sequences";
+ if (isatty(fileno(stderr)))
+ cerr << "\rProcessed " << ++seqs_processed << " sequences";
+ else if (++seqs_processed % 500 == 0)
+ cerr << "Processed " << seqs_processed << " sequences.\n";
}
- cerr << "\r ";
- cerr << "\rFinished processing " << seqs_processed << " sequences" << endl;
+ if (isatty(fileno(stderr))) {
+ cerr << "\r \r";
+ }
+ cerr << "Finished processing " << seqs_processed << " sequences" << endl;
}
void process_files() {
@@ -144,10 +149,15 @@ void process_files() {
iss >> filename;
iss >> taxid;
process_file(filename, taxid);
- cerr << "\rProcessed " << ++seqs_processed << " sequences";
+ if (isatty(fileno(stderr)))
+ cerr << "\rProcessed " << ++seqs_processed << " sequences";
+ else if (++seqs_processed % 500 == 0)
+ cerr << "Processed " << seqs_processed << " sequences.\n";
+ }
+ if (isatty(fileno(stderr))) {
+ cerr << "\r \r";
}
- cerr << "\r ";
- cerr << "\rFinished processing " << seqs_processed << " sequences" << endl;
+ cerr << "Finished processing " << seqs_processed << " sequences" << endl;
}
void process_file(string filename, uint32_t taxid) {
@@ -186,7 +196,7 @@ void set_lcas(uint32_t taxid, string &seq, size_t start, size_t finish) {
void parse_command_line(int argc, char **argv) {
int opt;
- int sig;
+ long long sig;
if (argc > 1 && strcmp(argv[1], "-h") == 0)
usage(0);
@@ -208,10 +218,12 @@ void parse_command_line(int argc, char **argv) {
ID_to_taxon_map_filename = optarg;
break;
case 't' :
- sig = atoi(optarg);
+ sig = atoll(optarg);
if (sig <= 0)
errx(EX_USAGE, "can't use nonpositive thread count");
#ifdef _OPENMP
+ if (sig > omp_get_num_procs())
+ errx(EX_USAGE, "thread count exceeds number of processors");
Num_threads = sig;
omp_set_num_threads(Num_threads);
#endif
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/kraken.git
More information about the debian-med-commit
mailing list