[med-svn] [centrifuge] 01/03: New upstream version 1.0.3~beta

Andreas Tille tille at debian.org
Sat Nov 25 21:00:20 UTC 2017


This is an automated email from the git hooks/post-receive script.

tille pushed a commit to branch master
in repository centrifuge.

commit 64b022e19e136c0975c53a4f285a25778f8f02ec
Author: Andreas Tille <tille at debian.org>
Date:   Sat Nov 25 21:58:02 2017 +0100

    New upstream version 1.0.3~beta
---
 MANUAL                                       |  14 +-
 MANUAL.markdown                              |  20 ++-
 Makefile                                     |  35 ++++-
 README.md                                    |   9 +-
 VERSION                                      |   2 +-
 aln_sink.h                                   | 213 ++++++++++++++++----------
 bt2_idx.h                                    |   2 +-
 centrifuge-BuildSharedSequence.pl            |   2 +-
 centrifuge-RemoveEmptySequence.pl            |   2 +-
 centrifuge-RemoveN.pl                        |   2 +-
 centrifuge-compress.pl                       |  22 ++-
 centrifuge-download                          |  88 +++++++----
 centrifuge-kreport                           |  36 +++--
 centrifuge-sort-nt.pl                        | 119 ++++++++++-----
 centrifuge.cpp                               | 136 ++++++++++++++---
 centrifuge.xcodeproj/project.pbxproj         |   8 +-
 classifier.h                                 |  57 ++++---
 doc/index.shtml                              |  13 ++
 doc/manual.inc.html                          |  15 +-
 doc/sidebar.inc.shtml                        |  46 +++---
 doc/strip_markdown.pl                        |   2 +-
 evaluation/centrifuge_evaluate.py            |  18 +--
 evaluation/test/abundance.Rmd                |  53 +++++++
 evaluation/test/centrifuge_evaluate_mason.py | 215 ++++++++++++++++++++++++---
 fast_mutex.h                                 |   0
 indices/Makefile                             |  76 ++++++----
 opts.h                                       |   2 +
 ref_read.cpp                                 |  17 ++-
 taxonomy.h                                   |  31 ++--
 tinythread.cpp                               |   0
 tinythread.h                                 |   0
 31 files changed, 953 insertions(+), 302 deletions(-)

diff --git a/MANUAL b/MANUAL
index c139009..183e6c8 100644
--- a/MANUAL
+++ b/MANUAL
@@ -214,7 +214,7 @@ The following example shows classification assignments for a read.  The assignme
     The fifth column is the score for the next best classification (e.g., 0).
     The sixth column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80).
     The seventh column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80). 
-    The eighth column is the number of classifications, indicating how many assignments were made (e.g.,1).
+    The eighth column is the number of classifications for this read, indicating how many assignments were made (e.g.,1).
 
 ### Centrifuge summary output (the default filename is centrifuge_report.tsv)
 
@@ -231,6 +231,14 @@ The following example shows a classification summary for each genome or taxonomi
     The sixth column is the number of reads uniquely classified to this genomic sequence (e.g., 5964).
     The seventh column is the proportion of this genome normalized by its genomic length (e.g., 0.0152317).
 
+As the GenBank database is incomplete (i.e., many more genomes remain to be identified and added), and reads have sequencing errors, classification programs including Centrifuge often report many false assignments.  In order to perform more conservative analyses, users may want to discard assignments for reads having a matching length (8th column in the output of Centrifuge) of 40% or lower.  It may be also helpful to use a score (4th column) for filtering out some assignments.   Our fut [...]
+
+### Kraken-style report
+
+`centrifuge-kreport` can be used to make a Kraken-style report from the Centrifuge output including taxonomy information:
+
+`centrifuge-kreport -x <centrifuge index> <centrifuge out file>`
+
 Inspecting the Centrifuge index
 -----------------------
 
@@ -413,6 +421,10 @@ integers, e.g., `40 40 30 40`..., rather than ASCII characters, e.g., `II?I`....
 
 #### Classification
 
+    --min-hitlen <int>
+
+Minimum length of partial hits, which must be greater than 15 (default: 22)"
+
     -k <int>
 
 It searches for at most `<int>` distinct, primary assignments for each read or pair.  
diff --git a/MANUAL.markdown b/MANUAL.markdown
index 714e4fc..f59dae0 100644
--- a/MANUAL.markdown
+++ b/MANUAL.markdown
@@ -228,7 +228,7 @@ The following example shows classification assignments for a read.  The assignme
     The fifth column is the score for the next best classification (e.g., 0).
     The sixth column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80).
     The seventh column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80). 
-    The eighth column is the number of classifications, indicating how many assignments were made (e.g.,1).
+    The eighth column is the number of classifications for this read, indicating how many assignments were made (e.g.,1).
 
 ### Centrifuge summary output (the default filename is centrifuge_report.tsv)
 
@@ -245,7 +245,13 @@ The following example shows a classification summary for each genome or taxonomi
     The sixth column is the number of reads uniquely classified to this genomic sequence (e.g., 5964).
     The seventh column is the proportion of this genome normalized by its genomic length (e.g., 0.0152317).
 
+As the GenBank database is incomplete (i.e., many more genomes remain to be identified and added), and reads have sequencing errors, classification programs including Centrifuge often report many false assignments.  In order to perform more conservative analyses, users may want to discard assignments for reads having a matching length (8th column in the output of Centrifuge) of 40% or lower.  It may be also helpful to use a score (4th column) for filtering out some assignments.   Our fut [...]
 
+### Kraken-style report
+
+`centrifuge-kreport` can be used to make a Kraken-style report from the Centrifuge output including taxonomy information:
+
+`centrifuge-kreport -x <centrifuge index> <centrifuge out file>`
 
 
 Inspecting the Centrifuge index
@@ -575,6 +581,18 @@ integers, e.g., `40 40 30 40`..., rather than ASCII characters, e.g., `II?I`....
 
 <table>
 
+<tr><td id="centrifuge-options-min-hitlen">
+
+[`--min-hitlen`]: #centrifuge-options-min-hitlen
+
+    --min-hitlen <int>
+
+</td><td>
+
+Minimum length of partial hits, which must be greater than 15 (default: 22)"
+
+</td></tr>
+
 <tr><td id="centrifuge-options-k">
 
 [`-k`]: #centrifuge-options-k
diff --git a/Makefile b/Makefile
index d6de0bf..8d66b3c 100644
--- a/Makefile
+++ b/Makefile
@@ -137,7 +137,8 @@ CENTRIFUGE_REPORT_CPPS_MAIN=$(BUILD_CPPS)
 
 SEARCH_FRAGMENTS = $(wildcard search_*_phase*.c)
 VERSION = $(shell cat VERSION)
-GIT_VERSION = $(shell command -v git 2>&1 > /dev/null && git describe --long --tags --dirty --always --abbrev=10 || cat VERSION)
+GIT_VERSION = $(VERSION)
+#GIT_VERSION = $(shell command -v git 2>&1 > /dev/null && git describe --long --tags --dirty --always --abbrev=10 || cat VERSION)
 
 # Convert BITS=?? to a -m flag
 BITS=32
@@ -186,6 +187,13 @@ CENTRIFUGE_BIN_LIST_AUX = centrifuge-build-bin-debug \
 	centrifuge-class-debug \
 	centrifuge-inspect-bin-debug
 
+CENTRIFUGE_SCRIPT_LIST = 	centrifuge \
+	centrifuge-build \
+	centrifuge-inspect \
+	centrifuge-download \
+	$(wildcard centrifuge-*.pl)
+
+
 GENERAL_LIST = $(wildcard scripts/*.sh) \
 	$(wildcard scripts/*.pl) \
 	$(wildcard *.py) \
@@ -198,9 +206,7 @@ GENERAL_LIST = $(wildcard scripts/*.sh) \
 	$(wildcard example/reference/*) \
 	indices/Makefile \
 	$(PTHREAD_PKG) \
-	centrifuge \
-	centrifuge-build \
-	centrifuge-inspect \
+	$(CENTRIFUGE_SCRIPT_LIST) \
 	AUTHORS \
 	LICENSE \
 	NEWS \
@@ -402,6 +408,27 @@ doc/manual.inc.html: MANUAL.markdown
 MANUAL: MANUAL.markdown
 	perl doc/strip_markdown.pl < $^ > $@
 
+prefix=/usr/local
+
+.PHONY: install
+install: all
+	mkdir -p $(prefix)/bin
+	mkdir -p $(prefix)/share/centrifuge/indices
+	install -m 0644 indices/Makefile $(prefix)/share/centrifuge/indices
+	install -d -m 0755 $(prefix)/share/centrifuge/doc
+	install -m 0644 doc/* $(prefix)/share/centrifuge/doc
+	for file in $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_SCRIPT_LIST); do \
+		install -m 0755 $$file $(prefix)/bin ; \
+	done
+
+.PHONY: uninstall
+uninstall: all
+	for file in $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_SCRIPT_LIST); do \
+		rm -v $(prefix)/bin/$$file ; \
+		rm -v $(prefix)/share/centrifuge; \
+	done
+
+
 .PHONY: clean
 clean:
 	rm -f $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_BIN_LIST_AUX) \
diff --git a/README.md b/README.md
index bb4ce51..a81fb6c 100644
--- a/README.md
+++ b/README.md
@@ -16,6 +16,8 @@ desktop computers
 
 The Centrifuge hompage is  http://www.ccb.jhu.edu/software/centrifuge
 
+The Centrifuge preprint is available at http://biorxiv.org/content/early/2016/05/25/054965.abstract
+
 The Centrifuge poster is available at http://www.ccb.jhu.edu/people/infphilo/data/Centrifuge-poster.pdf
 
 For more details on installing and running Centrifuge, look at MANUAL
@@ -26,6 +28,7 @@ For more details on installing and running Centrifuge, look at MANUAL
     git clone https://github.com/infphilo/centrifuge
     cd centrifuge
     make
+    sudo make install prefix=/usr/local
 
 ### Building indexes
 
@@ -35,6 +38,6 @@ See the MANUAL files for details. We provide a Makefile that simplifies the buil
 standard and custom indices
 
     cd indices
-    make b+h+v                   # bacterial, human, and viral genomes [~12G]
-    make b_compressed            # bacterial genomes compressed at the species level [~4.2G]
-    make b_compressed+h+v        # combination of the two above [~8G]
+    make p+h+v                   # bacterial, human, and viral genomes [~12G]
+    make p_compressed            # bacterial genomes compressed at the species level [~4.2G]
+    make p_compressed+h+v        # combination of the two above [~8G]
diff --git a/VERSION b/VERSION
index ed69ddf..22068e3 100644
--- a/VERSION
+++ b/VERSION
@@ -1 +1 @@
-1.0.2-beta
+1.0.3-beta
diff --git a/aln_sink.h b/aln_sink.h
index b76eec7..b46cdb1 100644
--- a/aln_sink.h
+++ b/aln_sink.h
@@ -156,6 +156,7 @@ struct SpeciesMetrics {
 		}
 
         // Only consider good hits for abundance analysis
+        // DK - for debugging purposes
         if(score >= max_score) {
             cur_ids.ids.push_back(taxID);
             if(cur_ids.ids.size() == nresult) {
@@ -204,9 +205,10 @@ struct SpeciesMetrics {
         
         // E step
         p_next.fill(0.0);
+        // for each assigned read set
         for(map<IDs, uint64_t>::const_iterator itr = observed.begin(); itr != observed.end(); itr++) {
-            const EList<uint64_t, 5>& ids = itr->first.ids;
-            uint64_t count = itr->second;
+            const EList<uint64_t, 5>& ids = itr->first.ids; // all ids assigned to the read set
+            uint64_t count = itr->second; // number of reads in the read set
             double psum = 0.0;
             for(size_t i = 0; i < ids.size(); i++) {
                 uint64_t tid = ids[i];
@@ -764,9 +766,11 @@ public:
 	explicit AlnSink(
                      OutputQueue& oq,
                      const StrList& refnames,
+                     const EList<uint32_t>& tab_fmt_cols,
                      bool quiet) :
     oq_(oq),
     refnames_(refnames),
+    tab_fmt_cols_(tab_fmt_cols),
     quiet_(quiet)
 	{
 	}
@@ -966,6 +970,7 @@ protected:
 	OutputQueue&       oq_;           // output queue
 	int                numWrappers_;  // # threads owning a wrapper for this HitSink
 	const StrList&     refnames_;     // reference names
+	const EList<uint32_t>&     tab_fmt_cols_; // Columns that are printed in tabular format
 	bool               quiet_;        // true -> don't print alignment stats at the end
 	ReportingMetrics   met_;          // global repository of reporting metrics
 };
@@ -1367,9 +1372,11 @@ public:
                Ebwt<index_t>*   ebwt,
                OutputQueue&     oq,           // output queue
                const StrList&   refnames,     // reference names
+			   const EList<uint32_t>&   tab_fmt_cols, // columns to output in the tabular format
                bool             quiet) :
     AlnSink<index_t>(oq,
                      refnames,
+					 tab_fmt_cols,
                      quiet),
     ebwt_(ebwt)
     { }
@@ -1982,6 +1989,17 @@ itoa10(x, buf); \
 o.append(buf); \
 }
 
+#define WRITE_STRING(o, x) { \
+o.append(x.c_str()); \
+}
+
+template <typename T> inline 
+void appendNumber(BTString& o, const T x, char* buf) {
+	itoa10<T>(x, buf);
+	o.append(buf);
+}
+
+
 /**
  * Print a seed summary to the first output stream in the outs_ list.
  */
@@ -2180,6 +2198,80 @@ void AlnSink<index_t>::appendSeedSummary(
 	o.append('\n');
 }
 
+
+inline
+void appendReadID(BTString& o, const BTString & rd_name) {
+    size_t namelen = rd_name.length();
+    if(namelen >= 2 &&
+       rd_name[namelen-2] == '/' &&
+       (rd_name[namelen-1] == '1' || rd_name[namelen-1] == '2' || rd_name[namelen-1] == '3'))
+    {
+        namelen -= 2;
+    }
+    for(size_t i = 0; i < namelen; i++) {
+        if(isspace(rd_name[i])) {
+            break;
+        }
+        o.append(rd_name[i]);
+    }
+}
+
+inline
+void appendSeqID(BTString& o, const AlnRes* rs, const std::map<uint64_t, TaxonomyNode>& tree) { 
+    bool leaf = true;
+    std::map<uint64_t, TaxonomyNode>::const_iterator itr = tree.find(rs->taxID());
+    if(itr != tree.end()) {
+        const TaxonomyNode& node = itr->second;
+        leaf = node.leaf;
+    }
+
+    // unique ID
+    if(leaf) {
+        o.append(rs->uid().c_str());
+    } else {
+        o.append(get_tax_rank_string(rs->taxRank()));
+    }
+}
+
+inline 
+void appendTaxID(BTString& o, uint64_t tid) {
+
+	char buf[1024];
+    uint64_t tid1 = tid & 0xffffffff;
+    uint64_t tid2 = tid >> 32;
+    itoa10<int64_t>(tid1, buf);
+    o.append(buf);
+
+    if(tid2 > 0) {
+        o.append(".");
+        itoa10<int64_t>(tid2, buf);
+        o.append(buf);
+    }
+}
+
+
+enum FIELD_DEF {
+	PLACEHOLDER=0,
+	PLACEHOLDER_STAR,
+	PLACEHOLDER_ZERO,
+    READ_ID,
+	SEQ_ID,
+	TAX_ID,
+	TAX_RANK,
+	TAX_NAME,
+	SCORE,
+	SCORE2,
+	HIT_LENGTH,
+	QUERY_LENGTH,
+	NUM_MATCHES,
+	SEQ,
+	SEQ1,
+	SEQ2,
+	QUAL,
+	QUAL1,
+	QUAL2
+};
+
 /**
  * Append a single hit to the given output stream in Bowtie's
  * verbose-mode format.
@@ -2201,24 +2293,51 @@ void AlnSinkSam<index_t>::appendMate(
 	if(rs == NULL) {
 		return;
 	}
+
 	char buf[1024];
+	bool firstfield = true;
+	uint64_t taxid =  rs->taxID();
+	const basic_string<char> empty_string = "";
+
+	for (int i=0; i < this->tab_fmt_cols_.size(); ++i) {
+		BEGIN_FIELD;
+		switch (this->tab_fmt_cols_[i]) {
+			case READ_ID:      appendReadID(o, rd.name); break;
+			case SEQ_ID:       appendSeqID(o, rs, ebwt.tree()); break;
+			case SEQ:          o.append((string(rd.patFw.toZBuf()) + 
+										(rdo == NULL? "" : "N" + string(rdo->patFw.toZBuf()))).c_str()); break;
+			case QUAL:         o.append((string(rd.qual.toZBuf()) + 
+										(rdo == NULL? "" : "I" + string(rdo->qual.toZBuf()))).c_str()); break;
+
+			case SEQ1:         o.append(rd.patFw.toZBuf()); break;
+			case QUAL1:        o.append(rd.qual.toZBuf()); break;
+			case SEQ2:         o.append(rdo == NULL? empty_string.c_str() : rdo->patFw.toZBuf()); break;
+			case QUAL2:        o.append(rdo == NULL? empty_string.c_str() : rdo->qual.toZBuf()); break;
+
+			case TAX_ID:       appendTaxID(o, taxid); break;
+			case TAX_RANK:     o.append(get_tax_rank_string(rs->taxRank())); break;
+			case TAX_NAME:     o.append(find_or_use_default(ebwt.name(), taxid, empty_string).c_str()); break;
+
+			case SCORE:        appendNumber<uint64_t>(o, rs->score(), buf); break;
+			case SCORE2:       appendNumber<uint64_t>(o,
+									   (summ.secbest().valid()? summ.secbest().score() : 0),
+									   buf); break;
+			case HIT_LENGTH:   appendNumber<uint64_t>(o, rs->summedHitLen(), buf); break;
+			case QUERY_LENGTH: appendNumber<uint64_t>(o, 
+									   (rd.patFw.length() + (rdo != NULL ? rdo->patFw.length() : 0)),
+									   buf); break;
+			case NUM_MATCHES:  appendNumber<uint64_t>(o, n_results, buf); break;
+			case PLACEHOLDER:  o.append("");
+			case PLACEHOLDER_STAR:  o.append("*");
+			case PLACEHOLDER_ZERO:  o.append("0");
+			default: ;
+		}
+
+	}
+	o.append("\n");
 
-    // QNAME
-    size_t namelen = rd.name.length();
-    if(namelen >= 2 &&
-       rd.name[namelen-2] == '/' &&
-       (rd.name[namelen-1] == '1' || rd.name[namelen-1] == '2' || rd.name[namelen-1] == '3'))
-    {
-        namelen -= 2;
-    }
-    for(size_t i = 0; i < namelen; i++) {
-        if(isspace(rd.name[i])) {
-            break;
-        }
-        o.append(rd.name[i]);
-    }
-    o.append('\t');
 
+	// species counting
 	sm.addSpeciesCounts(
                         rs->taxID(),
                         rs->score(),
@@ -2238,65 +2357,7 @@ void AlnSinkSam<index_t>::appendMate(
 	}
 
 //    (sc[rs->speciesID_])++;
-    
-    const std::map<uint64_t, TaxonomyNode>& tree = ebwt.tree();
-    bool leaf = true;
-    std::map<uint64_t, TaxonomyNode>::const_iterator itr = tree.find(rs->taxID());
-    if(itr != tree.end()) {
-        const TaxonomyNode& node = itr->second;
-        leaf = node.leaf;
-    }
-
-    // unique ID
-    if(leaf) {
-        o.append(rs->uid().c_str());
-    } else {
-        o.append(get_tax_rank_string(rs->taxRank()));
-    }
-    o.append('\t');
-    
-    // tax ID
-    uint64_t tid = rs->taxID();
-    uint64_t tid1 = tid & 0xffffffff;
-    uint64_t tid2 = tid >> 32;
-    itoa10<int64_t>(tid1, buf);
-    o.append(buf);
-    if(tid2 > 0) {
-        o.append(".");
-        itoa10<int64_t>(tid2, buf);
-        o.append(buf);
-    }
-    o.append('\t');
-    
-    // score
-    itoa10<int64_t>(rs->score(), buf);
-    o.append(buf);
-    o.append('\t');
-    
-    // second best score
-    if(summ.secbest().valid()) {
-        itoa10<int64_t>(summ.secbest().score(), buf);
-    } else {
-        itoa10<int64_t>(0, buf);
-    }
-    o.append(buf);
-    o.append('\t');
-    
-    // hit length
-    itoa10<int64_t>(rs->summedHitLen(), buf);
-    o.append(buf);
-    o.append('\t');
-    
-    size_t rdlen = rd.patFw.length() + (rdo != NULL ? rdo->patFw.length() : 0);
-    itoa10<size_t>(rdlen, buf);
-    o.append(buf);
-    o.append('\t');
-
-    // number of results
-    itoa10<int64_t>(n_results, buf);
-    o.append(buf);
-    o.append('\n');
-
+   
 }
 
 // #include <iomanip>
diff --git a/bt2_idx.h b/bt2_idx.h
index b265361..6728620 100644
--- a/bt2_idx.h
+++ b/bt2_idx.h
@@ -1244,7 +1244,7 @@ public:
                     uint64_t tid = get_tid(stid);
                     if(uids.find(uid) == uids.end()) continue;
                     if(uid_to_tid.find(uid) != uid_to_tid.end()) {
-						if(uid_to_tid.find(uid) != uid_to_tid.end()) {
+						if(uid_to_tid[uid] != tid) {
 							cerr << "Warning: Diverging taxonomy IDs for " << uid << " in " << conversion_table_fname << ": "
                                  << uid_to_tid[uid] << " and " << tid << ". Taking first. " << endl;
 						}
diff --git a/centrifuge-BuildSharedSequence.pl b/centrifuge-BuildSharedSequence.pl
index 6456c50..b7fcb67 100755
--- a/centrifuge-BuildSharedSequence.pl
+++ b/centrifuge-BuildSharedSequence.pl
@@ -1,4 +1,4 @@
-#!/bin/perl
+#!/usr/bin/env perl
 
 use strict ;
 use Getopt::Long;
diff --git a/centrifuge-RemoveEmptySequence.pl b/centrifuge-RemoveEmptySequence.pl
old mode 100644
new mode 100755
index 5f55d4b..8b3d0fd
--- a/centrifuge-RemoveEmptySequence.pl
+++ b/centrifuge-RemoveEmptySequence.pl
@@ -1,4 +1,4 @@
-#!/bin/perl
+#!/usr/bin/env perl
 
 # remove the headers with empty sequences. possible introduced by dustmask
 
diff --git a/centrifuge-RemoveN.pl b/centrifuge-RemoveN.pl
old mode 100644
new mode 100755
index 323110d..49a940e
--- a/centrifuge-RemoveN.pl
+++ b/centrifuge-RemoveN.pl
@@ -1,4 +1,4 @@
-#/bin/perl
+#!/usr/bin/env perl
 
 use strict ;
 use warnings ;
diff --git a/centrifuge-compress.pl b/centrifuge-compress.pl
index 6804df1..1e42e03 100755
--- a/centrifuge-compress.pl
+++ b/centrifuge-compress.pl
@@ -1,4 +1,4 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
 
 # Read and merge the sequence for the chosen level
 
@@ -72,6 +72,26 @@ my $idMapLock : shared ;
 
 my $step = 1 ;
 
+# check the depended softwares
+if ( `which jellyfish` eq "" )
+{
+	die "Could not find Jellyfish.\n" ;
+}
+else
+{
+	my $version = `jellyfish --version` ;
+	chomp $version ;
+	if ( ( split /\s+/, $version )[1] lt 2 )
+	{
+		die "Jellyfish on your system is ", $version, " , centrifuge-compress.pl requires at least 2.0.0\n" ;
+	}
+}
+
+if ( `which nucmer` eq "" )
+{
+	die "Could not find Nucmer.\n"
+}
+
 if ( `which dustmasker` eq "" ) 
 {
 	print STDERR "Could not find dustmasker. And will turn on -noDustmasker option.\n" ;
diff --git a/centrifuge-download b/centrifuge-download
index c131003..7bff927 100755
--- a/centrifuge-download
+++ b/centrifuge-download
@@ -6,12 +6,17 @@ exists() {
   command -v "$1" >/dev/null 2>&1
 }
 
-if hash wget 2>/dev/null; then
+if hash rsync 2>/dev/null; then
+	DL_PROG="rsync --no-motd"
+        DL_MODE="rsync"
+elif hash wget 2>/dev/null; then
 	DL_PROG="wget -N --reject=index.html -qO"
+        DL_MODE="ftp"
 else 
 	DL_PROG="curl -s -o"
+        DL_MODE="ftp"
 fi
-export DL_PROG
+export DL_PROG DL_MODE
 
 cut_after_first_space_or_second_pipe() {
 	grep '^>' | sed 's/ .*//' | sed 's/\([^|]*|[^|]*\).*/\1/'
@@ -32,9 +37,17 @@ function download_n_process() {
     IFS=$'\t' read -r TAXID FILEPATH <<< "$1"
 
     NAME=`basename $FILEPATH .gz`
-    $DL_PROG "$LIBDIR/$DOMAIN/$NAME.gz" "$FILEPATH" || { printf "\nError downloading $FILEPATH!\n" >&2 && exit 1; }
+
+    if [ $DL_MODE = "rsync" ]; then
+        FILEPATH=${FILEPATH/ftp/rsync}
+        $DL_PROG "$FILEPATH" "$LIBDIR/$DOMAIN/$NAME.gz" || \
+        { printf "\nError downloading $FILEPATH!\n" >&2 && exit 1; }
+    else
+        $DL_PROG "$LIBDIR/$DOMAIN/$NAME.gz" "$FILEPATH" || { printf "\nError downloading $FILEPATH!\n" >&2 && exit 1; }
+    fi
+
     [[ -f "$LIBDIR/$DOMAIN/$NAME.gz" ]] || return;
-    gunzip "$LIBDIR/$DOMAIN/$NAME.gz" ||{ printf "\nError gunzipping $LIBDIR/$DOMAIN/$NAME.gz [ downloaded from $FILEPATH ]!\n" >&2 &&  exit 255; }
+    gunzip -f "$LIBDIR/$DOMAIN/$NAME.gz" ||{ printf "\nError gunzipping $LIBDIR/$DOMAIN/$NAME.gz [ downloaded from $FILEPATH ]!\n" >&2 &&  exit 255; }
 
 	
     if [[ "$CHANGE_HEADER" == "1" ]]; then
@@ -58,6 +71,14 @@ function download_n_process() {
 }
 export -f download_n_process
 
+
+
+function download_n_process_nofail() {
+    download_n_process "$@" || true
+}
+export -f download_n_process_nofail
+
+
 ceol=`tput el` # terminfo clr_eol
 
 function count {
@@ -82,7 +103,7 @@ function count {
 function check_or_mkdir_no_fail {
     #echo -n "Creating $1 ... " >&2
     if [[ -d $1 && ! -n `find $1 -prune -empty -type d` ]]; then
-        echo "Directory exists already! Continuing" >&2
+        echo "Directory $1 exists.  Continuing" >&2
         return `true`
     else 
         #echo "Done" >&2
@@ -110,7 +131,6 @@ DATABASE="refseq"
 ASSEMBLY_LEVEL="Complete Genome"
 REFSEQ_CATEGORY=""
 TAXID=""
-DOWNLOAD_GI_MAP=0
 
 BASE_DIR="."
 N_PROC=1
@@ -145,7 +165,7 @@ WHEN USING database refseq OR genbank:
 "
 
 # arguments: $OPTFIND (current index), $OPTARG (argument for option), $OPTERR (bash-specific)
-while getopts "o:P:d:a:c:t:urlmg" OPT "$@"; do
+while getopts "o:P:d:a:c:t:urlm" OPT "$@"; do
     case $OPT in
         o) BASE_DIR="$OPTARG" ;;
         P) N_PROC="$OPTARG" ;;
@@ -157,7 +177,6 @@ while getopts "o:P:d:a:c:t:urlmg" OPT "$@"; do
 		u) FILTER_UNPLACED=1 ;;
         m) DO_DUST=1 ;;
         l) CHANGE_HEADER=1 ;;
-        g) DOWNLOAD_GI_MAP=1 ;;
         \?) echo "Invalid option: -$OPTARG" >&2 
             exit 1 
         ;;
@@ -177,15 +196,14 @@ if [[ "$DATABASE" == "taxonomy" ]]; then
   echo "Downloading NCBI taxonomy ... " >&2
   if check_or_mkdir_no_fail "$BASE_DIR"; then
     cd "$BASE_DIR" > /dev/null
-    if [[ "$DOWNLOAD_GI_MAP" == "1" ]]; then
-        $DL_PROG gi_taxid_nucl.dmp.gz $FTP/pub/taxonomy/gi_taxid_nucl.dmp.gz
-        gunzip -c gi_taxid_nucl.dmp.gz | sed 's/^/gi|/' > gi_taxid_nucl.map
-	else
+    if [ $DL_MODE = "rsync" ]; then
+        $DL_PROG ${FTP/ftp/rsync}/pub/taxonomy/taxdump.tar.gz taxdump.tar.gz
+    else
         $DL_PROG taxdump.tar.gz $FTP/pub/taxonomy/taxdump.tar.gz
-        tar -zxvf taxdump.tar.gz nodes.dmp
-        tar -zxvf taxdump.tar.gz names.dmp
-        rm taxdump.tar.gz
     fi
+    tar -zxvf taxdump.tar.gz nodes.dmp
+    tar -zxvf taxdump.tar.gz names.dmp
+    rm taxdump.tar.gz
     cd - > /dev/null
   fi
   exit 0
@@ -204,8 +222,13 @@ if [[ "$DATABASE" == "contaminants" ]]; then
     cd "$CONTAMINANT_DIR" > /dev/null
 
     # download UniVec and EmVec database
-    $DL_PROG UniVec.fna ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec
-    $DL_PROG emvec.dat.gz ftp://ftp.ebi.ac.uk/pub/databases/emvec/emvec.dat.gz
+    if [ $DL_MODE = "rsync" ]; then
+        $DL_PROG rsync://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec UniVec.fna
+        $DL_PROG rsync://ftp.ebi.ac.uk/pub/databases/emvec/emvec.dat.gz emvec.dat.gz
+    else
+        $DL_PROG UniVec.fna ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec
+        $DL_PROG emvec.dat.gz ftp://ftp.ebi.ac.uk/pub/databases/emvec/emvec.dat.gz
+    fi
     gunzip -c emvec.dat.gz | dat_to_fna > EmVec.fna
  
     if [[ "$CHANGE_HEADER" == "1" ]]; then
@@ -252,7 +275,7 @@ TAXID=${TAXID//,/|}
 if exists wget; then
 	wget -qO- --no-remove-listing ftp://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE/ > /dev/null
 else
-	curl -s ftp://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE/ > .listing
+	curl --disable-epsv -s ftp://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE/ > .listing
 fi
 
 if [[ "$CHANGE_HEADER" == "1" ]]; then
@@ -261,10 +284,12 @@ fi
 
 
 for DOMAIN in $DOMAINS; do
-    if [[ ! `grep " $DOMAIN" .listing` ]]; then
-        c_echo "$DOMAIN is not a valid domain - use one of the following:" >&2
-        grep '^d' .listing  | sed 's/.* //'
-        exit 1
+    if [[ -f .listing ]]; then
+        if [[ ! `grep "^d.* $DOMAIN" .listing` ]]; then
+            c_echo "$DOMAIN is not a valid domain - use one of the following:" >&2
+            grep '^d' .listing  | sed 's/.* //' | sed 's/^/  /' 1>&2
+            exit 1
+        fi
     fi
     
     #if [[ "$CHANGE_HEADER" != "1" ]]; then
@@ -278,21 +303,32 @@ for DOMAIN in $DOMAINS; do
     FULL_ASSEMBLY_SUMMARY_FILE="$LIBDIR/$DOMAIN/assembly_summary.txt"
     ASSEMBLY_SUMMARY_FILE="$LIBDIR/$DOMAIN/assembly_summary_filtered.txt"
 
-    #echo "Downloading and filtering the assembly_summary.txt file ..." >&2
-    $DL_PROG "$FULL_ASSEMBLY_SUMMARY_FILE" ftp://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE/$DOMAIN/assembly_summary.txt > "$FULL_ASSEMBLY_SUMMARY_FILE"
+    echo "Downloading ftp://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE/$DOMAIN/assembly_summary.txt ..." >&2
+    if [ $DL_MODE = "rsync" ]; then
+        $DL_PROG rsync://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE/$DOMAIN/assembly_summary.txt "$FULL_ASSEMBLY_SUMMARY_FILE" || {
+            c_echo "rsync Download failed! Have a look at valid domains at ftp://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE ." >&2
+            exit 1;
+        }
+    else
+        $DL_PROG "$FULL_ASSEMBLY_SUMMARY_FILE" ftp://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE/$DOMAIN/assembly_summary.txt > "$FULL_ASSEMBLY_SUMMARY_FILE" || {
+            c_echo "Download failed! Have a look at valid domains at ftp://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE ." >&2
+            exit 1;
+        }
+    fi
+
     awk -F "\t" "BEGIN {OFS=\"\t\"} $AWK_QUERY" "$FULL_ASSEMBLY_SUMMARY_FILE" > "$ASSEMBLY_SUMMARY_FILE"
 
     N_EXPECTED=`cat "$ASSEMBLY_SUMMARY_FILE" | wc -l`
     [[ $N_EXPECTED -gt 0 ]] || { echo "Domain $DOMAIN has no genomes with specified filter." >&2; exit 1; }
     echo "Downloading $N_EXPECTED $DOMAIN genomes at assembly level $ASSEMBLY_LEVEL ... (will take a while)" >&2
     cut -f "$TAXID_FIELD,$FTP_PATH_FIELD" "$ASSEMBLY_SUMMARY_FILE" | sed 's#\([^/]*\)$#\1/\1_genomic.fna.gz#' |\
-       tr '\n' '\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process "$@"' _ | count $N_EXPECTED
+       tr '\n' '\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process_nofail "$@"' _ | count $N_EXPECTED
     echo >&2
 
     if [[ "$DOWNLOAD_RNA" == "1" && ! `echo $DOMAIN | egrep 'bacteria|viral|archaea'` ]]; then
     	echo "Downloadinging rna sequence files" >&2
         cut -f $TAXID_FIELD,$FTP_PATH_FIELD  "$ASSEMBLY_SUMMARY_FILE"| sed 's#\([^/]*\)$#\1/\1_rna.fna.gz#' |\
-            tr '\n' '\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process "$@"' _ | count $N_EXPECTED
+            tr '\n' '\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process_nofail "$@"' _ | count $N_EXPECTED
         echo >&2
     fi
 done
diff --git a/centrifuge-kreport b/centrifuge-kreport
index b51afd4..8ab54d7 100755
--- a/centrifuge-kreport
+++ b/centrifuge-kreport
@@ -1,4 +1,4 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
 
 # Give a Kraken-style report from a Centrifuge output
 #
@@ -11,7 +11,7 @@ use warnings;
 use Getopt::Long;
 
 my ($centrifuge_index, $min_score, $min_length);
-my $only_unique = 1;
+my $only_unique = 0;
 my $show_zeros = 0;
 my $is_cnts_table = 0;
 my $PROG = "centrifuge-kreport";
@@ -20,18 +20,36 @@ GetOptions(
   "help" => \&display_help,
   "x=s" => \$centrifuge_index,
   "show-zeros" => \$show_zeros,
-  "is-report-file" => \$is_cnts_table,
+  "is-count-table" => \$is_cnts_table,
   "min-score=i" => \$min_score,
   "min-length=i"=> \$min_length,
   "only-unique" => \$only_unique
 ) or usage();
 
 usage() unless defined $centrifuge_index;
-usage() unless defined $ARGV[0];
+if (!defined $ARGV[0]) {
+    print STDERR "Reading centrifuge out file from STDIN ... \n";
+}
 
 sub usage {
   my $exit_code = @_ ? shift : 64;
-  print STDERR "Usage: centrifuge-kreport -x <index name> [--show-zeros] [--min-score=SCORE] [--is-counts] [--min-length=LENGTH] <centrifuge output file(s)>\n";
+  print STDERR "
+Usage: centrifuge-kreport -x <index name> OPTIONS <centrifuge output file(s)>
+
+centrifuge-kreport creates Kraken-style reports from centrifuge out files.
+
+Options:
+    -x INDEX            (REQUIRED) Centrifuge index
+
+    --only-unique        Only count reads that were uniquely assigned to one taxon
+    --show-zeros         Show clades that have zero reads, too
+    --is-count-table     The format of the file is 'TAXID<tab>COUNT' instead of the standard
+                         Centrifuge output format
+
+    --min-score SCORE    Require a minimum score for reads to be counted
+    --min-length LENGTH  Require a minimum alignment length to the read
+  
+  ";
   exit $exit_code;
 }
 
@@ -55,11 +73,9 @@ if ($is_cnts_table) {
 } else {
   <>;
   while (<>) {
-    my (undef,$seqID,$taxid,$score, undef,$hitLength,$numMatches) = split /\t/;
-    my ($l,$rl) = split(/ \/ /,$hitLength);
-    #print STDERR "undef,$seqID,$taxid,$score, undef,$hitLength,$numMatches: $l || $rl\n";
+    my (undef,$seqID,$taxid,$score, undef, $hitLength, $queryLength, $numMatches) = split /\t/;
     next if $only_unique && $numMatches > 1;
-    next if defined $min_length && $l < $min_length;
+    next if defined $min_length && $hitLength < $min_length;
     next if defined $min_score && $score < $min_score;
     $taxo_counts{$taxid} += 1/$numMatches;
     $seq_count++;
@@ -74,6 +90,8 @@ for (keys %name_map) {
   $clade_counts{$_} ||= 0;
 }
 
+die "No sequence matches with given settings" unless $seq_count > 0;
+
 printf "%6.2f\t%d\t%d\t%s\t%d\t%s%s\n",
   $clade_counts{0} * 100 / $seq_count,
   $clade_counts{0}, $taxo_counts{0}, "U",
diff --git a/centrifuge-sort-nt.pl b/centrifuge-sort-nt.pl
index d4124eb..022158b 100755
--- a/centrifuge-sort-nt.pl
+++ b/centrifuge-sort-nt.pl
@@ -1,59 +1,110 @@
 #! /usr/bin/env perl
 #
-# Short description for sort-nt.pl
+# Sort nt file sequences according to their taxonomy ID
+# Uses the new mappinf file format available at
+# ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/
 #
 # Author fbreitwieser <fbreitwieser at sherman>
-# Version 0.1
+# Version 0.2
 # Copyright (C) 2016 fbreitwieser <fbreitwieser at sherman>
-# Modified On 2016-02-28 12:56
+# Modified On 2016-09-26
 # Created  2016-02-28 12:56
 #
 use strict;
 use warnings;
 use File::Basename;
+use Getopt::Long;
 
-$#ARGV==1 or die "USAGE: ".basename($0)." <sequence file> <mapping file>\n";
-my ($nt_file,$gi_taxid_file) = @ARGV;
+my $new_map_file;
+my $ac_wo_mapping_file;
+my $opt_help;
 
-my %gi_to_pos;
-my %taxid_to_gi;
-my %gi_to_taxid;
+my $USAGE = 
+"USAGE: ".basename($0)." OPTIONS <sequence file> <mapping file> [<mapping file>*]\n
 
-print STDERR "Reading headers from $nt_file ...\n";
+OPTIONS:
+  -m str      Output mappings that are present in sequence file to file str
+  -a str      Output ACs w/o mapping to file str
+  -h          This message
+";
+
+GetOptions(
+	"m|map=s" => \$new_map_file,
+	"a=s" => \$ac_wo_mapping_file,
+	"h|help" => \$opt_help) or die "Error in command line arguments";
+
+scalar(@ARGV) >= 2 or die $USAGE;
+if (defined $opt_help) {
+	print STDERR $USAGE;
+	exit(0);
+}
+
+my ($nt_file, @ac_taxid_files) = @ARGV;
+
+my %ac_to_pos;
+my %taxid_to_ac;
+my %ac_to_taxid;
+
+print STDERR "Reading headers from $nt_file ... ";
 open(my $NT, "<", $nt_file) or die $!;
 while (<$NT>) {
-    if (/(^>(gi\|[0-9]*).*)/) {
-        #print STDERR "\$gi_to_pos{$2} = [$1,".tell($NT)."]\n";
-        $gi_to_pos{$2} = [tell($NT),$1];
+    # get the headers with (!) the version number
+    if (/(^>([^ ]*).*)/) {
+        # record the position of this AC
+        $ac_to_pos{$2} = [tell($NT),$1];
     }
 }
+print STDERR "found ", scalar(keys %ac_to_pos), " ACs\n";
+
+foreach my $ac_taxid_file (@ac_taxid_files) {
+print STDERR "Reading ac to taxid mapping from $ac_taxid_file ...\n";
+    my $FP1;
+    if ($ac_taxid_file =~ /.gz$/) {
+        open($FP1, "-|", "gunzip -c '$ac_taxid_file'") or die $!;
+    } else {
+        open($FP1, "<", $ac_taxid_file) or die $!;
+    }
 
-print STDERR "Reading gi to taxid mapping $gi_taxid_file ...\n";
-my $FP1;
-if ($gi_taxid_file =~ /.zip$/) {
-    open($FP1, "-|", "unzip -c '$gi_taxid_file'") or die $!;
-} else {
-    open($FP1, "<", $gi_taxid_file) or die $!;
+    # format: accession <tab> accession.version <tab> taxid <tab> gi
+    # currently we look for a mapping with the version number
+    while ( <$FP1> ) {
+    	my (undef, $ac, $taxid) = split;
+        next unless defined $taxid;
+	    if ( defined( $ac_to_pos{ $ac } ) ) {
+            push @{ $taxid_to_ac{ $taxid } }, $ac;
+		    $ac_to_taxid{ $ac } = $taxid;
+        }
+    }
+    close $FP1;
 }
-while ( <$FP1> ) {
-	chomp;
-	my ($gi,$taxid) = split;
-    next unless defined $taxid;
-	if ( defined( $gi_to_pos{ $gi } ) )
-	{
-		push @{ $taxid_to_gi{ $taxid } }, $gi;
-		$gi_to_taxid{ $gi } = $taxid;
-	}
+print STDERR "Got taxonomy mappings for ", scalar(keys %ac_to_taxid), " ACs\n";
+if (defined $ac_wo_mapping_file && scalar(keys %ac_to_taxid) < scalar(keys %ac_to_pos)) {
+    print STDERR "Writing ACs without taxonomy mapping to $ac_wo_mapping_file\n";
+    open(my $FP2, ">", $ac_wo_mapping_file) or die $!;
+    foreach my $ac (keys %ac_to_pos) {
+        next unless defined $ac_to_taxid{$ac};
+        print $FP2 $ac, "\n";
+    }
+    close($FP2);
 }
-close $FP1;
+
+if (defined $new_map_file) {
+print STDERR "Writing taxonomy ID mapping to $new_map_file\n";
+open(my $FP3, ">", $new_map_file) or die $!;
+foreach my $ac (keys %ac_to_taxid) {
+    print $FP3 $ac,"\t",$ac_to_taxid{$ac},"\n";
+}
+close($FP3);
+}
+
 
 print STDERR "Outputting sorted FASTA ...\n";
-foreach my $taxid (keys %taxid_to_gi) {
-    my @gis = @{$taxid_to_gi{$taxid}};
-    my @sorted_gis = sort { $gi_to_pos{$a}->[0] <=> $gi_to_pos{$b}->[0] } @gis;
-    foreach (@sorted_gis) {
-        print $gi_to_pos{$_}->[1],"\n";
-        seek($NT, $gi_to_pos{$_}->[0], 0);
+foreach my $taxid (sort {$a <=> $b} keys %taxid_to_ac) {
+    my @acs = @{$taxid_to_ac{$taxid}};
+    my @sorted_acs = sort { $ac_to_pos{$a}->[0] <=> $ac_to_pos{$b}->[0] } @acs;
+    foreach (@sorted_acs) {
+        print $ac_to_pos{$_}->[1],"\n";
+        seek($NT, $ac_to_pos{$_}->[0], 0);
         while (<$NT>) {
             last if (/^>/);
             print $_;
diff --git a/centrifuge.cpp b/centrifuge.cpp
index f1e0e1c..a2a1c68 100644
--- a/centrifuge.cpp
+++ b/centrifuge.cpp
@@ -1,20 +1,20 @@
 /*
  * Copyright 2014, Daehwan Kim <infphilo at gmail.com>
  *
- * This file is part of HISAT.
+ * This file is part of Centrifuge.
  *
- * HISAT is free software: you can redistribute it and/or modify
+ * Centrifuge is free software: you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation, either version 3 of the License, or
  * (at your option) any later version.
  *
- * HISAT is distributed in the hope that it will be useful,
+ * Centrifuge is distributed in the hope that it will be useful,
  * but WITHOUT ANY WARRANTY; without even the implied warranty of
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
  *
  * You should have received a copy of the GNU General Public License
- * along with HISAT.  If not, see <http://www.gnu.org/licenses/>.
+ * along with Centrifuge.  If not, see <http://www.gnu.org/licenses/>.
  */
 
 #include <stdlib.h>
@@ -249,12 +249,37 @@ static string classification_rank;
 static EList<uint64_t> host_taxIDs;
 static EList<uint64_t> excluded_taxIDs;
 
+
+static string tab_fmt_col_def;
+static bool sam_format;
+static EList<uint32_t> tab_fmt_cols;
+static EList<string> tab_fmt_cols_str;
+static map<string, uint32_t> col_name_map;
+
 #ifdef USE_SRA
 static EList<string> sra_accs;
 #endif
 
 #define DMAX std::numeric_limits<double>::max()
 
+
+static void parse_col_fmt(const string arg, EList<string>& tab_fmt_cols_str, EList<uint32_t>& tab_fmt_cols) {
+    tab_fmt_cols.clear();
+    tab_fmt_cols_str.clear();
+    tokenize(arg, ",", tab_fmt_cols_str);
+    for(size_t i = 0; i < tab_fmt_cols_str.size(); i++) {
+        map<string, uint32_t>::iterator it = col_name_map.find(tab_fmt_cols_str[i]);
+        if (it == col_name_map.end()) {
+            cerr << "Column definition "  << tab_fmt_cols_str[i] << " invalid." << endl;
+            exit(1);
+        }
+        tab_fmt_cols.push_back(it->second);
+    }
+
+}
+
+
+
 static void resetOptions() {
 
 #ifndef NDEBUG
@@ -451,6 +476,47 @@ static void resetOptions() {
     host_taxIDs.clear();
     classification_rank = "strain";
     excluded_taxIDs.clear();
+	sam_format = false;
+
+    col_name_map["readID"] = READ_ID;
+    col_name_map["seqID"] = SEQ_ID;
+    col_name_map["taxLevel"] = TAX_RANK;
+    col_name_map["taxRank"] = TAX_RANK;
+    col_name_map["taxID"] = TAX_ID;
+    col_name_map["taxName"] = TAX_NAME;
+    col_name_map["score"] = SCORE;
+    col_name_map["2ndBestScore"] = SCORE2;
+    col_name_map["hitLength"] = HIT_LENGTH;
+    col_name_map["queryLength"] = QUERY_LENGTH;
+    col_name_map["numMatches"] = NUM_MATCHES;
+    col_name_map["readSeq"] = SEQ;
+    col_name_map["readQual"] = QUAL;
+
+    // SAM names
+    col_name_map["QNAME"] = READ_ID;
+    col_name_map["FLAG"] = PLACEHOLDER_ZERO;
+    col_name_map["RNAME"] = TAX_ID;
+    col_name_map["POS"] = PLACEHOLDER_ZERO;
+    col_name_map["MAPQ"] = PLACEHOLDER_ZERO;
+    col_name_map["CIGAR"] = PLACEHOLDER;
+    col_name_map["RNEXT"] = SEQ_ID;
+    col_name_map["PNEXT"] = PLACEHOLDER_ZERO;
+    col_name_map["TLEN"] = PLACEHOLDER_ZERO;
+    col_name_map["SEQ"] = SEQ;
+    col_name_map["QUAL"] = QUAL;
+
+    // further columns
+    col_name_map["SEQ1"] = SEQ1;
+    col_name_map["SEQ2"] = SEQ2;
+    col_name_map["QUAL1"] = QUAL1;
+    col_name_map["QUAL2"] = QUAL2;
+    col_name_map["readSeq1"] = SEQ1;
+    col_name_map["readSeq2"] = SEQ2;
+    col_name_map["readQual1"] = QUAL1;
+    col_name_map["readQual2"] = QUAL2;
+
+    tab_fmt_col_def = "readID,seqID,taxID,score,2ndBestScore,hitLength,queryLength,numMatches";
+    parse_col_fmt(tab_fmt_col_def, tab_fmt_cols_str, tab_fmt_cols);
     
 #ifdef USE_SRA
     sra_accs.clear();
@@ -614,7 +680,9 @@ static struct option long_options[] = {
     {(char*)"no-abundance",     no_argument,       0,        ARG_NO_ABUNDANCE},
     {(char*)"no-traverse",      no_argument,       0,        ARG_NO_TRAVERSE},
     {(char*)"classification-rank", required_argument,    0,  ARG_CLASSIFICATION_RANK},
-    {(char*)"exclude-taxids",      required_argument,    0,  ARG_EXCLUDE_TAXIDS},
+    {(char*)"exclude-taxids",   required_argument, 0,  ARG_EXCLUDE_TAXIDS},
+    {(char*)"out-fmt",          required_argument, 0,  ARG_OUT_FMT},
+    {(char*)"tab-fmt-cols",     required_argument, 0,  ARG_TAB_FMT_COLS},
 #ifdef USE_SRA
     {(char*)"sra-acc",   required_argument, 0,        ARG_SRA_ACC},
 #endif
@@ -662,16 +730,16 @@ static void printArgDesc(ostream& out) {
  * Print a summary usage message to the provided output stream.
  */
 static void printUsage(ostream& out) {
-	out << "Centrifuge version " << string(CENTRIFUGE_VERSION).c_str() << " by Daehwan Kim (infphilo at gmail.com, www.ccb.jhu.edu/people/infphilo)" << endl;
+	out << "Centrifuge version " << string(CENTRIFUGE_VERSION).c_str() << " by the Centrifuge developer team (centrifuge.metagenomics at gmail.com)" << endl;
 	string tool_name = "centrifuge-class";
 	if(wrapper == "basic-0") {
-		tool_name = "hisat";
+		tool_name = "centrifuge";
 	}
     out << "Usage: " << endl
 #ifdef USE_SRA
-    << "  " << tool_name.c_str() << " [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <filename>] [--report-file <report>]" << endl
+    << "  " << tool_name.c_str() << " [options]* -x <cf-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <filename>] [--report-file <report>]" << endl
 #else
-    << "  " << tool_name.c_str() << " [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <filename>] [--report-file <report>]" << endl
+    << "  " << tool_name.c_str() << " [options]* -x <cf-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <filename>] [--report-file <report>]" << endl
 #endif
 	    << endl
 		<<     "  <cf-idx>   Index filename prefix (minus trailing .X." << gEbwt_ext << ")." << endl
@@ -715,7 +783,6 @@ static void printUsage(ostream& out) {
 		<< "  --ignore-quals     treat all quality values as 30 on Phred scale (off)" << endl
 	    << "  --nofw             do not align forward (original) version of read (off)" << endl
 	    << "  --norc             do not align reverse-complement version of read (off)" << endl
-        << "  --min-hitlen       " << endl
 #ifdef USE_SRA
         << "  --sra-acc          SRA accession ID" << endl
 #endif
@@ -730,7 +797,10 @@ static void printUsage(ostream& out) {
 	//if(wrapper == "basic-0") {
 	//	out << "  --bam              output directly to BAM (by piping through 'samtools view')" << endl;
 	//}
-	out << "  -t/--time          print wall-clock time taken by search phases" << endl;
+	out << "  --out-fmt <str>       define output format, either 'tab' or 'sam' (tab)" << endl
+		<< "  --tab-fmt-cols <str>  columns in tabular format, comma separated " << endl 
+        << "                          default: " << tab_fmt_col_def << endl;
+	out << "  -t/--time             print wall-clock time taken by search phases" << endl;
 	if(wrapper == "basic-0") {
 	out << "  --un <path>           write unpaired reads that didn't align to <path>" << endl
 	    << "  --al <path>           write unpaired reads that aligned at least once to <path>" << endl
@@ -739,17 +809,17 @@ static void printUsage(ostream& out) {
 	    << "  (Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g." << endl
 		<< "  --un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)" << endl;
 	}
-	out << "  --quiet            print nothing to stderr except serious errors" << endl
-	//  << "  --refidx           refer to ref. seqs by 0-based index rather than name" << endl
-		<< "  --met-file <path>  send metrics to file at <path> (off)" << endl
-		<< "  --met-stderr       send metrics to stderr (off)" << endl
-		<< "  --met <int>        report internal counters & metrics every <int> secs (1)" << endl
+	out << "  --quiet               print nothing to stderr except serious errors" << endl
+	//  << "  --refidx              refer to ref. seqs by 0-based index rather than name" << endl
+		<< "  --met-file <path>     send metrics to file at <path> (off)" << endl
+		<< "  --met-stderr          send metrics to stderr (off)" << endl
+		<< "  --met <int>           report internal counters & metrics every <int> secs (1)" << endl
 		<< endl
 	    << " Performance:" << endl
 	    << "  -o/--offrate <int> override offrate of index; must be >= index's offrate" << endl
 	    << "  -p/--threads <int> number of alignment threads to launch (1)" << endl
 #ifdef BOWTIE_MM
-	    << "  --mm               use memory-mapped I/O for index; many 'bowtie's can share" << endl
+	    << "  --mm               use memory-mapped I/O for index; many instances can share" << endl
 #endif
 		<< endl
 	    << " Other:" << endl
@@ -1361,6 +1431,24 @@ static void parseOption(int next_option, const char *arg) {
             }
             break;
         }
+	    case ARG_OUT_FMT: {
+            if (strcmp(arg, "sam") == 0) {
+                sam_format = true;
+                parse_col_fmt("QNAME,FLAG,RNAME,POS,MAPQ,CIGAR,RNEXT,PNEXT,TLEN,SEQ,QUAL",
+                              tab_fmt_cols_str, tab_fmt_cols);
+            } else if (strcmp(arg, "default") == 0 || strcmp(arg, "tab") == 0) {
+
+            } else {
+                cerr << "Invalid output format " << arg << "!" << endl;
+                exit(1);
+            }
+			break;
+	    }
+		case ARG_TAB_FMT_COLS: {
+            parse_col_fmt(arg, tab_fmt_cols_str, tab_fmt_cols);
+    		break;
+		}
+
 #ifdef USE_SRA
         case ARG_SRA_ACC: {
             tokenize(arg, ",", sra_accs); format = SRA_FASTA;
@@ -2893,13 +2981,22 @@ static void driver(
                                                  &ebwt,
                                                  oq,           // output queue
                                                  refnames,     // reference names
+                                                 tab_fmt_cols, // columns in tab format
                                                  gQuiet);      // don't print alignment summary at end
                 if(!samNoHead) {
 					BTString buf;
+                    // TODO: Write '@SQ\tSN:AA\tLN:length fields
 					fout->writeString(buf);
 				}
 				// Write header for read-results file
-				fout->writeChars("readID\tseqID\ttaxID\tscore\t2ndBestScore\thitLength\tqueryLength\tnumMatches\n");
+                if (!sam_format) {
+                fout->writeChars(tab_fmt_cols_str[0].c_str());
+                for (size_t i = 1; i < tab_fmt_cols_str.size(); ++i) {
+                    fout->writeChars("\t");
+                    fout->writeChars(tab_fmt_cols_str[i].c_str());
+                }
+                fout->writeChars("\n");
+                }
 				break;
 			}
 			default:
@@ -2976,8 +3073,10 @@ static void driver(
             }
             reportOfb << endl;
 			for(map<uint64_t, ReadCounts>::const_iterator it = spm.species_counts.begin(); it != spm.species_counts.end(); ++it) {
+
                 uint64_t taxid = it->first;
                 if(taxid == 0) continue;
+
                 std::map<uint64_t, string>::const_iterator name_itr = name_map.find(taxid);
                 if(name_itr != name_map.end()) {
                     reportOfb << name_itr->second;
@@ -2985,6 +3084,7 @@ static void driver(
                     reportOfb << taxid;
                 }
                 reportOfb << '\t' << taxid << '\t';
+
                 uint8_t rank = 0;
                 bool leaf = false;
                 std::map<uint64_t, TaxonomyNode>::const_iterator tree_itr = tree.find(taxid);
diff --git a/centrifuge.xcodeproj/project.pbxproj b/centrifuge.xcodeproj/project.pbxproj
index d32bbe4..666a084 100644
--- a/centrifuge.xcodeproj/project.pbxproj
+++ b/centrifuge.xcodeproj/project.pbxproj
@@ -216,6 +216,7 @@
 		E869A04A1C2095A8007600C2 /* centrifugex */ = {isa = PBXFileReference; explicitFileType = "compiled.mach-o.executable"; includeInIndex = 0; path = centrifugex; sourceTree = BUILT_PRODUCTS_DIR; };
 		E869A0551C2095B5007600C2 /* centrifuge-inspectx */ = {isa = PBXFileReference; explicitFileType = "compiled.mach-o.executable"; includeInIndex = 0; path = "centrifuge-inspectx"; sourceTree = BUILT_PRODUCTS_DIR; };
 		E869A0601C2095CA007600C2 /* centrifuge-compressx */ = {isa = PBXFileReference; explicitFileType = "compiled.mach-o.executable"; includeInIndex = 0; path = "centrifuge-compressx"; sourceTree = BUILT_PRODUCTS_DIR; };
+		E8D181651D64DB4D00CC784A /* taxonomy.h */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.c.h; path = taxonomy.h; sourceTree = "<group>"; };
 /* End PBXFileReference section */
 
 /* Begin PBXFrameworksBuildPhase section */
@@ -335,6 +336,7 @@
 				E86143771C20833200D5C240 /* sse_util.h */,
 				E86143781C20833200D5C240 /* sstring.h */,
 				E86143791C20833200D5C240 /* str_util.h */,
+				E8D181651D64DB4D00CC784A /* taxonomy.h */,
 				E861437A1C20833200D5C240 /* threading.h */,
 				E861437B1C20833200D5C240 /* timer.h */,
 				E861437C1C20833200D5C240 /* tinythread.h */,
@@ -475,7 +477,7 @@
 		E8485E0B1C207EF000F225FA /* Project object */ = {
 			isa = PBXProject;
 			attributes = {
-				LastUpgradeCheck = 0720;
+				LastUpgradeCheck = 0800;
 				ORGANIZATIONNAME = "Daehwan Kim";
 				TargetAttributes = {
 					E8485E121C207EF000F225FA = {
@@ -606,8 +608,10 @@
 				CLANG_WARN_DIRECT_OBJC_ISA_USAGE = YES_ERROR;
 				CLANG_WARN_EMPTY_BODY = YES;
 				CLANG_WARN_ENUM_CONVERSION = YES;
+				CLANG_WARN_INFINITE_RECURSION = YES;
 				CLANG_WARN_INT_CONVERSION = YES;
 				CLANG_WARN_OBJC_ROOT_CLASS = YES_ERROR;
+				CLANG_WARN_SUSPICIOUS_MOVE = YES;
 				CLANG_WARN_UNREACHABLE_CODE = YES;
 				CLANG_WARN__DUPLICATE_METHOD_MATCH = YES;
 				CODE_SIGN_IDENTITY = "-";
@@ -660,8 +664,10 @@
 				CLANG_WARN_DIRECT_OBJC_ISA_USAGE = YES_ERROR;
 				CLANG_WARN_EMPTY_BODY = YES;
 				CLANG_WARN_ENUM_CONVERSION = YES;
+				CLANG_WARN_INFINITE_RECURSION = YES;
 				CLANG_WARN_INT_CONVERSION = YES;
 				CLANG_WARN_OBJC_ROOT_CLASS = YES_ERROR;
+				CLANG_WARN_SUSPICIOUS_MOVE = YES;
 				CLANG_WARN_UNREACHABLE_CODE = YES;
 				CLANG_WARN__DUPLICATE_METHOD_MATCH = YES;
 				CODE_SIGN_IDENTITY = "-";
diff --git a/classifier.h b/classifier.h
index 4c47b6d..f61a7f0 100644
--- a/classifier.h
+++ b/classifier.h
@@ -380,10 +380,23 @@ public:
         for(size_t i = 0; i < _hitMap.size(); i++) {
             _hitMap[i].finalize(this->_paired, this->_mate1fw, this->_mate2fw);
         }
+
+        // See if some of the assignments corresponde to host taxIDs
+        int64_t best_score = 0;
+        bool only_host_taxIDs = false;
+        for(size_t gi = 0; gi < _hitMap.size(); gi++) {
+            if(_hitMap[gi].score > best_score) {
+                best_score = _hitMap[gi].score;
+                only_host_taxIDs = (_host_taxIDs.find(_hitMap[gi].taxID) != _host_taxIDs.end());
+            } else if(_hitMap[gi].score == best_score) {
+                only_host_taxIDs |= (_host_taxIDs.find(_hitMap[gi].taxID) != _host_taxIDs.end());
+            }
+        }
+ 
         
         // If the number of hits is more than -k,
         //   traverse up the taxonomy tree to reduce the number
-        if(_hitMap.size() > (size_t)rp.khits) {
+        if (!only_host_taxIDs && _hitMap.size() > (size_t)rp.khits) {
             // Count the number of the best hits
             uint32_t best_score = _hitMap[0].score;
             for(size_t i = 1; i < _hitMap.size(); i++) {
@@ -422,10 +435,9 @@ public:
                         _hitMap[i].leaf = false;
                     }
                     if(_hitMap[i].rank > rank) continue;
-                    if(rank + 1 >= _hitMap[i].path.size()) continue;
-                    uint64_t parent_taxID = _hitMap[i].path[rank + 1];
                     
-                    // Daehwan: we may want to traverse up the tree more until we get non-zero taxID.
+                    uint64_t parent_taxID = (rank + 1 >= _hitMap[i].path.size() ? 1 : _hitMap[i].path[rank + 1]);
+                    // Traverse up the tree more until we get non-zero taxID.
                     if(parent_taxID == 0) continue;
                     
                     size_t j = 0;
@@ -441,7 +453,14 @@ public:
                         _hitTaxCount.back().second = parent_taxID;
                     }
                 }
-                if(_hitTaxCount.size() <= 0) break;
+                if(_hitTaxCount.size() <= 0) {
+                    if(rank < _hitMap[0].path.size()) {
+                        rank++;
+                        continue;
+                    } else {
+                        break;
+                    }
+                }
                 _hitTaxCount.sort();
                 size_t j = _hitTaxCount.size();
                 while(j-- > 0) {
@@ -450,9 +469,9 @@ public:
                     for(size_t i = 0; i < _hitMap.size(); i++) {
                         assert_geq(_hitMap[i].rank, rank);
                         if(_hitMap[i].rank != rank) continue;
-                        if(rank + 1 >= _hitMap[i].path.size()) continue;
-                        if(parent_taxID == _hitMap[i].path[rank + 1]) {
-                            _hitMap[i].uniqueID = 0;
+                        uint64_t cur_parent_taxID = (rank + 1 >= _hitMap[i].path.size() ? 1 : _hitMap[i].path[rank + 1]);
+                        if(parent_taxID == cur_parent_taxID) {
+                            _hitMap[i].uniqueID = std::numeric_limits<uint64_t>::max();
                             _hitMap[i].rank = rank + 1;
                             _hitMap[i].taxID = parent_taxID;
                             _hitMap[i].leaf = false;
@@ -486,10 +505,12 @@ public:
                     if(_hitMap.size() <= (size_t)rp.khits)
                         break;
                 }
-                rank += 1;
+                rank++;
+                if(rank > _hitMap[0].path.size())
+                    break;
             }
         }
-        if(_hitMap.size() > (size_t)rp.khits)
+        if(!only_host_taxIDs && _hitMap.size() > (size_t)rp.khits)
             return 0;
        
 #if 0
@@ -507,18 +528,7 @@ public:
             max_score += (rdlen > 15 ? (rdlen - 15) * (rdlen - 15) : 0);
         }
         
-        // See if some of the assignments corresponde to host taxIDs
-        int64_t best_score = 0;
-        bool only_host_taxIDs = false;
-        for(size_t gi = 0; gi < _hitMap.size(); gi++) {
-            if(_hitMap[gi].score > best_score) {
-                best_score = _hitMap[gi].score;
-                only_host_taxIDs = (_host_taxIDs.find(_hitMap[gi].taxID) != _host_taxIDs.end());
-            } else if(_hitMap[gi].score == best_score) {
-                only_host_taxIDs |= (_host_taxIDs.find(_hitMap[gi].taxID) != _host_taxIDs.end());
-            }
-        }
-        
+       
         for(size_t gi = 0; gi < _hitMap.size(); gi++) {
             assert_gt(_hitMap[gi].score, 0);
             HitCount<index_t>& hitCount = _hitMap[gi];
@@ -527,7 +537,6 @@ public:
                     continue;
             }
             const EList<pair<string, uint64_t> >& uid_to_tid = ebwtFw.uid_to_tid();
-            assert_lt(hitCount.uniqueID, uid_to_tid.size());
             const std::map<uint64_t, TaxonomyNode>& tree = ebwtFw.tree();
             uint8_t taxRank = RANK_UNKNOWN;
             std::map<uint64_t, TaxonomyNode>::const_iterator itr = tree.find(hitCount.taxID);
@@ -539,7 +548,7 @@ public:
             rs.init(
                     hitCount.score,
                     max_score,
-                    uid_to_tid[hitCount.uniqueID].first,
+                    hitCount.uniqueID < uid_to_tid.size() ? uid_to_tid[hitCount.uniqueID].first : get_tax_rank_string(taxRank),
                     hitCount.taxID,
                     taxRank,
                     hitCount.summedHitLen,
diff --git a/doc/index.shtml b/doc/index.shtml
index cf078b8..6957ee3 100644
--- a/doc/index.shtml
+++ b/doc/index.shtml
@@ -20,6 +20,19 @@
 	<!--#include virtual="sidebar.inc.shtml"-->
 	</div> <!-- End of "rightside" -->
 	<div id="leftside">
+          <h2>Have a look at <a href="Pavian">https://github.com/fbreitwieser/pavian</a> for visual analysis of results generated with Centrifuge!</h2>
+
+	  <h2>Centrifuge 1.0.3-beta release 12/06/2016</h2>
+          <ul>
+              <li>Fixed Perl hash bangs (thanks to Andreas Sjödin / @druvus).</li>
+              <li>Updated nt database building to work with new accession code scheme.</li>
+              <li>Added option <tt>--tab-fmt-cols</tt> to specify output format columns.</li>
+              <li>A minor fix for traversing the hitmap up the taxonomy tree.</li>
+          </ul>
+          <br/>
+
+          <h2><a href="http://genome.cshlp.org/content/early/2016/11/16/gr.210641.116.abstract">Centrifuge</a> paper published at Genome Research 11/16/2016</h2>  
+
 	  <h2>Centrifuge 1.0.2-beta release 5/25/2016</h2>
           <ul>
 	    <li>Fixed a runtime error during abundance analysis.</li>
diff --git a/doc/manual.inc.html b/doc/manual.inc.html
index d778a78..2b257b9 100644
--- a/doc/manual.inc.html
+++ b/doc/manual.inc.html
@@ -16,6 +16,7 @@
 <li><a href="#custom-database">Custom database</a></li>
 <li><a href="#centrifuge-classification-output">Centrifuge classification output</a></li>
 <li><a href="#centrifuge-summary-output-the-default-filename-is-centrifuge_report.tsv">Centrifuge summary output (the default filename is centrifuge_report.tsv)</a></li>
+<li><a href="#kraken-style-report">Kraken-style report</a></li>
 </ul></li>
 <li><a href="#inspecting-the-centrifuge-index">Inspecting the Centrifuge index</a></li>
 <li><a href="#wrapper">Wrapper</a></li>
@@ -127,7 +128,7 @@ The fourth column is the score for the classification, which is the weighted sum
 The fifth column is the score for the next best classification (e.g., 0).
 The sixth column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80).
 The seventh column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80). 
-The eighth column is the number of classifications, indicating how many assignments were made (e.g.,1).</code></pre>
+The eighth column is the number of classifications for this read, indicating how many assignments were made (e.g.,1).</code></pre>
 <h3 id="centrifuge-summary-output-the-default-filename-is-centrifuge_report.tsv">Centrifuge summary output (the default filename is centrifuge_report.tsv)</h3>
 <p>The following example shows a classification summary for each genome or taxonomic unit. The assignment output has 7 columns.</p>
 <pre><code>name                                                            taxID   taxRank    genomeSize   numReads   numUniqueReads   abundance
@@ -140,6 +141,10 @@ The fourth column is the length of the genome sequence (e.g., 703004).
 The fifth column is the number of reads classified to this genomic sequence including multi-classified reads (e.g., 5981).
 The sixth column is the number of reads uniquely classified to this genomic sequence (e.g., 5964).
 The seventh column is the proportion of this genome normalized by its genomic length (e.g., 0.0152317).</code></pre>
+<p>As the GenBank database is incomplete (i.e., many more genomes remain to be identified and added), and reads have sequencing errors, classification programs including Centrifuge often report many false assignments. In order to perform more conservative analyses, users may want to discard assignments for reads having a matching length (8th column in the output of Centrifuge) of 40% or lower. It may be also helpful to use a score (4th column) for filtering out some assignments. Our futu [...]
+<h3 id="kraken-style-report">Kraken-style report</h3>
+<p><code>centrifuge-kreport</code> can be used to make a Kraken-style report from the Centrifuge output including taxonomy information:</p>
+<p><code>centrifuge-kreport -x <centrifuge index> <centrifuge out file></code></p>
 <h2 id="inspecting-the-centrifuge-index">Inspecting the Centrifuge index</h2>
 <p>The index can be inspected with <code>centrifuge-inspect</code>. To extract raw sequences:</p>
 <pre><code>centrifuge-inspect <centrifuge index></code></pre>
@@ -304,6 +309,14 @@ The seventh column is the proportion of this genome normalized by its genomic le
 <h4 id="classification">Classification</h4>
 <table>
 
+<tr><td id="centrifuge-options-min-hitlen">
+
+<pre><code>--min-hitlen <int></code></pre>
+</td><td>
+
+<p>Minimum length of partial hits, which must be greater than 15 (default: 22)"</p>
+</td></tr>
+
 <tr><td id="centrifuge-options-k">
 
 <pre><code>-k <int></code></pre>
diff --git a/doc/sidebar.inc.shtml b/doc/sidebar.inc.shtml
index 5942da2..85d4217 100644
--- a/doc/sidebar.inc.shtml
+++ b/doc/sidebar.inc.shtml
@@ -7,6 +7,7 @@
  </ul>
 </div>
 
+<!--
 <h2>News and Updates</h2>
 <div class="box">
  <ul>
@@ -16,16 +17,15 @@
    </tbody></table>
  </ul>
 </div>
+-->
 
 <h2>Getting Help</h2>
 <div class="box">
  <ul>
    <table width="100%">
      <tbody><tr><td>
-		<!-- Questions and comments about HISAT2 can be posted on the 
-		<a href="https://groups.google.com/forum/#!forum/hisat-tools-users"><b>HISAT Tools Users Google Group</b></a>. -->
-		Please use <a href="mailto:centrifuge.metagenomics at gmail.com">centrifuge.metagenomics at gmail.com</a> for 
-		private communications only. Please do not email technical questions to Centrifuge contributors directly.</td></tr>
+        Please submit an issue on <a href="https://github.com/infphilo/centrifuge/issues">GitHub</a>, or send an E-Mail to
+        <a href="mailto:centrifuge.metagenomics at gmail.com">centrifuge.metagenomics at gmail.com</a> for private communications.</td></tr>
    </tbody></table>
  </ul>
 </div>
@@ -33,53 +33,53 @@
 <a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/downloads"><h2><u>Releases</u></h2></a>
 <div class="box">
  <ul>
-   <table width="100%"><tbody><tr><td>version 1.0.2-beta</td> <td align="right">5/25/2016</td></tr>
+   <table width="100%"><tbody><tr><td>version 1.0.3-beta</td> <td align="right">12/06/2016</td></tr>
        <tr>
-	 <td><a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/downloads/centrifuge-1.0.2-beta-source.zip" onclick="javascript: pageTracker._trackPageview('/downloads/centrifuge'); ">   Source code</a></td>
+	 <td><a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/downloads/centrifuge-1.0.3-beta-source.zip" onclick="javascript: pageTracker._trackPageview('/downloads/centrifuge'); ">   Source code</a></td>
        </tr>
        <tr>
-	 <td><a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/downloads/centrifuge-1.0.2-beta-Linux_x86_64.zip" onclick="javascript: pageTracker._trackPageview('/downloads/centrifuge'); ">   Linux x86_64 binary</a></td>
+	 <td><a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/downloads/centrifuge-1.0.3-beta-Linux_x86_64.zip" onclick="javascript: pageTracker._trackPageview('/downloads/centrifuge'); ">   Linux x86_64 binary</a></td>
        </tr>
        <tr>
-	 <td><a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/downloads/centrifuge-1.0.2-beta-OSX_x86_64.zip" onclick="javascript: pageTracker._trackPageview('/downloads/centrifuge'); ">   Mac OS X x86_64 binary</a></td>
+	 <td><a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/downloads/centrifuge-1.0.3-beta-OSX_x86_64.zip" onclick="javascript: pageTracker._trackPageview('/downloads/centrifuge'); ">   Mac OS X x86_64 binary</a></td>
        </tr>
    </tbody></table>
  </ul>
 </div>
 
-<a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data"><h2><u>Indexes</u></h2></a>
+<a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data"><h2><u>Indexes</u></h2></a> 
   <div class="box">
-    <table width="100%">
+    <table width="100%"><tr><td>last updated:</td> <td align="right">12/06/2016</td></tr>
       <tr>
         <td>
-	  <a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/b_compressed.tar.gz"><i>Bacteria (compressed)</i></a>
+	  <a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed.tar.gz"><i>   Bacteria (compressed)</i></a>
         </td>
 	<td align="right" style="font-size: x-small">
-	  <b>3.4 GB</b>
+	  <b>4.4 GB</b>
         </td>
       </tr>
       <tr>
         <td>
-	  <a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/b_compressed+h+v.tar.gz"><i>Bacteria, Viruses, Human (compressed)</i></a>
+	  <a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz"><i>   Bacteria, Viruses, Human (compressed)</i></a>
         </td>
 	<td align="right" style="font-size: x-small">
-	  <b>3.9 GB</b>
+	  <b>5.4 GB</b>
         </td>
       </tr>
       <tr>
         <td>
-	  <a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/b+h+v.tar.gz"><i>Bacteria, Viruses, Human </i></a>
+	  <a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p+h+v.tar.gz"><i>   Bacteria, Viruses, Human </i></a>
         </td>
 	<td align="right" style="font-size: x-small">
-	  <b>6.3 GB</b>
+	  <b>7.9 GB</b>
         </td>
       </tr>
       <tr>
         <td>
-	  <a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/nt.tar.gz"><i>NCBI nucleotide non-redundant sequences </i></a>
+	  <a href="ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/nt.tar.gz"><i>   NCBI nucleotide non-redundant sequences </i></a>
         </td>
 	<td align="right" style="font-size: x-small">
-	  <b>44.2 GB</b>
+	  <b>50 GB</b>
         </td>
       </tr>
     </table>
@@ -88,20 +88,16 @@
 <h2>Related Tools</h2>
 <div class="box">
  <ul>
-   <!--
-   <li><a href="http://www.ccb.jhu.edu/software/hisat">HISAT</a>: Fast and sensitive spliced alignment</li>
+   <li><a href="https://github.com/fbreitwieser/pavian">Pavian</a>: Tool for interactive analysis of pathogen and metagenomics data</li>
+   <li><a href="http://www.ccb.jhu.edu/software/hisat2">HISAT2</a>: Graph-based alignment to a population of genomes</li>
    <li><a href="http://bowtie-bio.sourceforge.net/bowtie2">Bowtie2</a>: Ultrafast read alignment</li>
-   <li><a href="http://www.ccb.jhu.edu/software/tophat">TopHat2</a>: Spliced read mapper for RNA-Seq</li>
-   <li><a href="http://cufflinks.cbcb.umd.edu">Cufflinks</a>: Isoform assembly and quantitation for RNA-Seq</li>
-   <li><a href="http://www.ccb.jhu.edu/software/stringtie">StringTie</a>: Transcript assembly and quantification for RNA-Seq</li>
-   -->
  </ul>
 </div>
 
 <h2>Publications</h2>
 <div class="box">
  <ul>
-   <li><p>Kim D, Song L, Breitwieser F, and Salzberg SL. <a href="http://biorxiv.org/content/early/2016/05/25/054965"><b>Centrifuge: rapid and sensitive classification of metagenomic sequences</b></a>. <i>bioRxiv</i> 2016</p></li>
+   <li><p>Kim D, Song L, Breitwieser FP, and Salzberg SL. <a href="http://genome.cshlp.org/content/early/2016/11/16/gr.210641.116.abstract"><b>Centrifuge: rapid and sensitive classification of metagenomic sequences</b></a>. <i>Genome Research</i> 2016</p></li>
  </ul>
 </div>
 
diff --git a/doc/strip_markdown.pl b/doc/strip_markdown.pl
index 0ecc595..13a7529 100644
--- a/doc/strip_markdown.pl
+++ b/doc/strip_markdown.pl
@@ -1,4 +1,4 @@
-#!/usr/bin/perl -w
+#!/usr/bin/env perl -w
 
 ##
 # strip_markdown.pl
diff --git a/evaluation/centrifuge_evaluate.py b/evaluation/centrifuge_evaluate.py
index 212d2d8..e8fd268 100755
--- a/evaluation/centrifuge_evaluate.py
+++ b/evaluation/centrifuge_evaluate.py
@@ -534,26 +534,26 @@ def evaluate(index_base,
 if __name__ == "__main__":
     parser = ArgumentParser(
         description='Centrifuge evaluation')
-    parser.add_argument('index_base',
+    parser.add_argument("index_base",
                         nargs='?',
                         type=str,
                         help='Centrifuge index')
-    parser.add_argument('--index-base-for-read',
+    parser.add_argument("--index-base-for-read",
                         dest="index_base_for_read",
                         type=str,
                         default="",
                         help='index base for read (default same as index base)')    
-    parser.add_argument('--num-fragment',
+    parser.add_argument("--num-fragment",
                         dest="num_fragment",
                         action='store',
                         type=int,
                         default=1,
                         help='Number of fragments in millions (default: 1)')
-    parser.add_argument('--paired',
+    parser.add_argument("--paired",
                         dest='paired',
                         action='store_true',
                         help='Paired-end reads')
-    parser.add_argument('--error-rate',
+    parser.add_argument("--error-rate",
                         dest='error_rate',
                         action='store',
                         type=float,
@@ -570,19 +570,19 @@ if __name__ == "__main__":
                         type=str,
                         default="centrifuge",
                         help="A comma-separated list of aligners (default: centrifuge)")
-    parser.add_argument('--runtime-only',
+    parser.add_argument("--runtime-only",
                         dest='runtime_only',
                         action='store_true',
                         help='Just check runtime without evaluation')    
-    parser.add_argument('--no-sql',
+    parser.add_argument("--no-sql",
                         dest='sql',
                         action='store_false',
                         help='Do not write results into a sqlite database')
-    parser.add_argument('-v', '--verbose',
+    parser.add_argument("-v", "--verbose",
                         dest='verbose',
                         action='store_true',
                         help='also print some statistics to stderr')
-    parser.add_argument('--debug',
+    parser.add_argument("--debug",
                         dest='debug',
                         action='store_true',
                         help='Debug')
diff --git a/evaluation/test/abundance.Rmd b/evaluation/test/abundance.Rmd
new file mode 100644
index 0000000..321ae41
--- /dev/null
+++ b/evaluation/test/abundance.Rmd
@@ -0,0 +1,53 @@
+---
+title: "Centrifuge abundance"
+# author: "Daehwan Kim"
+date: "August 15, 2016"
+output: html_document
+---
+
+```{r setup}
+library(ggplot2)
+```
+
+```{r}
+
+abundance.cmp <- read.delim("abundance_k5.cmp", stringsAsFactors = FALSE)
+levels <- c("species", "genus", "family", "order", "class", "phylum")
+abundance.cmp$rank <- factor(abundance.cmp$rank, levels = levels)
+abundance.cmp$log_true <- log(abundance.cmp$true)
+abundance.cmp$log_calc <- log(abundance.cmp$calc)
+head(abundance.cmp)
+
+ggplot(abundance.cmp) + 
+  geom_bar(aes(x = log_true), binwidth = 0.2) +
+  xlab("abundance (log_truth)") +
+  facet_wrap(~ rank, scales = "free_y")
+
+ggplot(abundance.cmp) + 
+  geom_bar(aes(x = log_calc), binwidth = 0.2) +
+  xlab("abundance (log_calc)") +
+  facet_wrap(~ rank, scales = "free_y")
+
+ggplot(abundance.cmp) + 
+  geom_point(aes(x = true, y = calc), size = 0.7) +
+  xlab("abundance (truth)") + ylab("abundance (centrifuge)") +
+  facet_wrap(~ rank)  + 
+  # geom_text(aes(x = true, y = calc, label = name),
+  #          check_overlap = TRUE, hjust = 0, nudge_x = 0.01) +
+  geom_abline(slope = 1, color = "red")
+
+ggplot(abundance.cmp) + 
+  geom_point(aes(x = log_true, y = log_calc), size = 0.7) +
+  xlab("abundance (log_truth)") + ylab("abundance (log_centrifuge)") +
+  facet_wrap(~ rank)  + 
+  geom_abline(slope = 1, color = "red")
+
+for(level in levels) {
+  print(level)
+  abundance.cmp.rank <- abundance.cmp[abundance.cmp$rank==level,]
+  for(method in c("pearson", "spearman", "kendall")) {
+    print(paste(' ', method))
+    print(paste('  ', cor(abundance.cmp.rank$true, abundance.cmp.rank$calc, method=method)))
+  }
+}
+```
\ No newline at end of file
diff --git a/evaluation/test/centrifuge_evaluate_mason.py b/evaluation/test/centrifuge_evaluate_mason.py
index 2e63799..998a681 100755
--- a/evaluation/test/centrifuge_evaluate_mason.py
+++ b/evaluation/test/centrifuge_evaluate_mason.py
@@ -45,7 +45,7 @@ def compare_scm(centrifuge_out, true_out, taxonomy_tree, rank):
         if first:
             first = False
             continue
-        read_name, seq_id, tax_id, score, _, _, _ = line.strip().split('\t')
+        read_name, seq_id, tax_id, score, _, _, _, _ = line.strip().split('\t')
 
         # Traverse up taxonomy tree to match the given rank parameter
         rank_tax_id = tax_id
@@ -87,6 +87,21 @@ def compare_scm(centrifuge_out, true_out, taxonomy_tree, rank):
         read_name, tax_id = fields[1:3] 
         # Traverse up taxonomy tree to match the given rank parameter
         rank_tax_id = tax_id
+        if rank != "strain":
+            while True:
+                if tax_id not in taxonomy_tree:
+                    rank_tax_id = ""
+                    break
+                parent_tax_id, cur_rank = taxonomy_tree[tax_id]
+                if cur_rank == rank:
+                    rank_tax_id = tax_id
+                    break
+                if tax_id == parent_tax_id:
+                    rank_tax_id = ""
+                    break
+                tax_id = parent_tax_id
+        if rank_tax_id == "":
+            continue
         if read_name not in db_dic:
             unclassified += 1
             continue
@@ -110,15 +125,14 @@ def compare_scm(centrifuge_out, true_out, taxonomy_tree, rank):
 
 """
 """
-def evaluate():
+def evaluate(index_base,
+             ranks,
+             verbose,
+             debug):
     # Current script directory
     curr_script = os.path.realpath(inspect.getsourcefile(evaluate))
     path_base = os.path.dirname(curr_script) + "/.."
 
-    # index_base = "b_compressed"
-    index_base = "b+h+v"
-    # index_base = "centrifuge_Dec_Bonly"
-
     def check_files(fnames):
         for fname in fnames:
             if not os.path.exists(fname):
@@ -140,7 +154,7 @@ def evaluate():
     tax_ids = set()
     tax_cmd = [centrifuge_inspect,
                "--conversion-table",
-               "%s/%s" % (index_path, index_base)]
+               "%s/%s" % (index_path, "b+h+v")]
     tax_proc = subprocess.Popen(tax_cmd, stdout=subprocess.PIPE)
     for line in tax_proc.stdout:
         _, tax_id = line.strip().split()
@@ -150,12 +164,14 @@ def evaluate():
     # Read taxonomic tree
     tax_tree_cmd = [centrifuge_inspect,
                     "--taxonomy-tree",
-                    "%s/%s" % (index_path, index_base)]    
+                    "%s/%s" % (index_path, "b+h+v")]    
     tax_tree_proc = subprocess.Popen(tax_tree_cmd, stdout=subprocess.PIPE, stderr=open("/dev/null", 'w'))
     taxonomy_tree = read_taxonomy_tree(tax_tree_proc.stdout)
 
-    read_fname = "bacteria_sim10K.fa"
-    scm_fname = "bacteria_sim10K.truth"
+    compressed = (index_base.find("compressed") != -1) or (index_base == "centrifuge_Dec_Bonly")
+
+    read_fname = "centrifuge_data/bacteria_sim10K/bacteria_sim10K.fa"
+    scm_fname = "centrifuge_data/bacteria_sim10K/bacteria_sim10K.truth_species"
     read_fnames = [read_fname, scm_fname]
 
     program_bin_base = "%s/.." % path_base
@@ -167,21 +183,182 @@ def evaluate():
                       "%s/%s" % (index_path, index_base),
                       read_fname]
 
-    print >> sys.stderr, '\t'.join(centrifuge_cmd)
+    if verbose:
+        print >> sys.stderr, ' '.join(centrifuge_cmd)
+
+    out_fname = "centrifuge.output"
+    proc = subprocess.Popen(centrifuge_cmd, stdout=open(out_fname, "w"), stderr=subprocess.PIPE)
+    proc.communicate()
+
+    results = {"strain"  : [0, 0, 0],
+               "species" : [0, 0, 0],
+               "genus"   : [0, 0, 0],
+               "family"  : [0, 0, 0],
+               "order"   : [0, 0, 0],
+               "class"   : [0, 0, 0],
+               "phylum"  : [0, 0, 0]}
+    for rank in ranks:
+        if compressed and rank == "strain":
+            continue
+
+        classified, unique_classified, unclassified, raw_classified, raw_unique_classified = \
+            compare_scm(out_fname, scm_fname, taxonomy_tree, rank)
+        results[rank] = [classified, unique_classified, unclassified]
+        num_cases = classified + unclassified
+        # if rank == "strain":
+        #    assert num_cases == num_fragment
+
+        print >> sys.stderr, "\t\t%s" % rank
+        print >> sys.stderr, "\t\t\tsensitivity: {:,} / {:,} ({:.2%})".format(classified, num_cases, float(classified) / num_cases)
+        print >> sys.stderr, "\t\t\tprecision  : {:,} / {:,} ({:.2%})".format(classified, raw_classified, float(classified) / raw_classified)
+        print >> sys.stderr, "\n\t\t\tfor uniquely classified "
+        print >> sys.stderr, "\t\t\t\t\tsensitivity: {:,} / {:,} ({:.2%})".format(unique_classified, num_cases, float(unique_classified) / num_cases)
+        print >> sys.stderr, "\t\t\t\t\tprecision  : {:,} / {:,} ({:.2%})".format(unique_classified, raw_unique_classified, float(unique_classified) / raw_unique_classified)
+
+        # Calculate sum of squared residuals in abundance
+        """
+        if rank == "strain":
+            abundance_SSR = compare_abundance("centrifuge_report.tsv", truth_fname, taxonomy_tree, debug)
+            print >> sys.stderr, "\t\t\tsum of squared residuals in abundance: {}".format(abundance_SSR)
+        """
+
+    # calculate true abundance
+    true_abundance = {}
+    total_sum = 0.0
+    num_genomes, num_species = 0, 0
+    for line in open("abundance.txt"):
+        seqID, taxID, genomeSize, reads, reads10K, genomeName = line.strip().split(',')[:6]
+        genomeSize, reads, reads10K = int(genomeSize), int(reads), int(reads10K)
+        if reads <= 0:
+            continue
+        num_genomes += 1
+        while True:
+            if taxID not in taxonomy_tree:
+                rank_taxID = ""
+                break
+            parent_taxID, rank = taxonomy_tree[taxID]
+            if rank == "species":
+                rank_taxID = taxID
+                break
+            if taxID == parent_taxID:
+                rank_taxID = ""
+                break
+            taxID = parent_taxID
+        if rank_taxID == "":
+            continue
+        assert rank == "species"
+        num_species += 1
+        total_sum += (reads / float(genomeSize))
+        if rank_taxID not in true_abundance:
+            true_abundance[rank_taxID] = 0.0
+        true_abundance[rank_taxID] += (reads / float(genomeSize))
+    for taxID, reads in true_abundance.items():
+        true_abundance[taxID] /= total_sum
+
+    print >> sys.stderr, "number of genomes:", num_genomes
+    print >> sys.stderr, "number of species:", num_species
+    print >> sys.stderr, "number of uniq species:", len(true_abundance)
+
+    read_fname = "centrifuge_data/bacteria_sim10M/bacteria_sim10M.fa"
+    summary_fname = "centrifuge.summary"
+    centrifuge_cmd = ["%s/centrifuge" % program_bin_base,
+                      # "-k", "20",
+                      # "--min-hitlen", "15",
+                      "--report-file", summary_fname,
+                      "-f",
+                      "-p", "3",
+                      "%s/%s" % (index_path, index_base),
+                      read_fname]
+
+    if verbose:
+        print >> sys.stderr, ' '.join(centrifuge_cmd)
 
     out_fname = "centrifuge.output"
     proc = subprocess.Popen(centrifuge_cmd, stdout=open(out_fname, "w"), stderr=subprocess.PIPE)
     proc.communicate()
 
-    classified, unique_classified, unclassified, raw_classified, raw_unique_classified = \
-        compare_scm(out_fname, scm_fname, taxonomy_tree, "genus")
-    num_cases = classified + unclassified
+    calc_abundance = {}
+    for taxID in true_abundance.keys():
+        calc_abundance[taxID] = 0.0
+    first = True
+    for line in open(summary_fname):
+        if first:
+            first = False
+            continue
+        name, taxID, taxRank, genomeSize, numReads, numUniqueReads, abundance = line.strip().split("\t")
+        genomeSize, numReads, numUniqueReads, abundance = int(genomeSize), int(numReads), int(numUniqueReads), float(abundance)
+        calc_abundance[taxID] = abundance
+
+        # DK - for debugging purposes
+        """
+        if taxID in true_abundance:
+            print >> sys.stderr, "%s: %.6f vs. %.6f" % (taxID, abundance, true_abundance[taxID])
+        """
+
+    abundance_file = open("abundance.cmp", 'w')
+    print >> abundance_file, "taxID\ttrue\tcalc\trank"
+    for rank in ranks:
+        if rank == "strain":
+            continue
+        true_abundance_rank, calc_abundance_rank = {}, {}
+        for taxID in true_abundance.keys():
+            assert taxID in calc_abundance
+            rank_taxID = taxID
+            while True:
+                if rank_taxID not in taxonomy_tree:
+                    rank_taxID = ""
+                    break
+                parent_taxID, cur_rank = taxonomy_tree[rank_taxID]
+                if cur_rank == rank:
+                    break
+                if rank_taxID == parent_taxID:
+                    rank_taxID = ""
+                    break
+                rank_taxID = parent_taxID
+            if rank_taxID not in true_abundance_rank:
+                true_abundance_rank[rank_taxID] = 0.0
+                calc_abundance_rank[rank_taxID] = 0.0
+            true_abundance_rank[rank_taxID] += true_abundance[taxID]
+            calc_abundance_rank[rank_taxID] += calc_abundance[taxID]
+
+        ssr = 0.0 # Sum of Squared Residuals
+        for taxID in true_abundance_rank.keys():
+            assert taxID in calc_abundance_rank
+            ssr += (true_abundance_rank[taxID] - calc_abundance_rank[taxID]) ** 2
+            print >> abundance_file, "%s\t%.6f\t%.6f\t%s" % (taxID, true_abundance_rank[taxID], calc_abundance_rank[taxID], rank)
+        print >> sys.stderr, "%s) Sum of squared residuals: %.6f" % (rank, ssr)
+    abundance_file.close()
 
-    print >> sys.stderr, "\t\t\tsensitivity: {:,} / {:,} ({:.2%})".format(classified, num_cases, float(classified) / num_cases)
-    print >> sys.stderr, "\t\t\tprecision  : {:,} / {:,} ({:.2%})".format(classified, raw_classified, float(classified) / raw_classified)
-    print >> sys.stderr, "\t\t\t\t\tsensitivity: {:,} / {:,} ({:.2%})".format(unique_classified, num_cases, float(unique_classified) / num_cases)
-    print >> sys.stderr, "\t\t\t\t\tprecision  : {:,} / {:,} ({:.2%})".format(unique_classified, raw_unique_classified, float(unique_classified) / raw_unique_classified)
 
 
 if __name__ == "__main__":
-    evaluate()
+    parser = ArgumentParser(
+        description='Centrifuge evaluation on Mason simulated reads')
+    parser.add_argument("--index-base",
+                        type=str,
+                        default="b_compressed",
+                        help='Centrifuge index such as b_compressed, b+h+v, and centrifuge_Dec_Bonly (default: b_compressed)')
+    rank_list_default = "strain,species,genus,family,order,class,phylum"
+    parser.add_argument("--rank-list",
+                        dest="ranks",
+                        type=str,
+                        default=rank_list_default,
+                        help="A comma-separated list of ranks (default: %s)" % rank_list_default)
+    parser.add_argument('-v', '--verbose',
+                        dest='verbose',
+                        action='store_true',
+                        help='also print some statistics to stderr')
+    parser.add_argument('--debug',
+                        dest='debug',
+                        action='store_true',
+                        help='Debug')
+
+    args = parser.parse_args()
+    if not args.index_base:
+        parser.print_help()
+        exit(1)
+    ranks = args.ranks.split(',')
+    evaluate(args.index_base,
+             ranks,
+             args.verbose,
+             args.debug)
diff --git a/fast_mutex.h b/fast_mutex.h
old mode 100755
new mode 100644
diff --git a/indices/Makefile b/indices/Makefile
index f120e55..95cfb35 100644
--- a/indices/Makefile
+++ b/indices/Makefile
@@ -34,14 +34,14 @@ OPTIONS:
 
 STANDARD TARGETS:
 
-    b_compressed        Download all bacteria genomes from RefSeq,
+    p_compressed        Download all bacteria genomes from RefSeq,
                         and compresses them at the species level
 
-    b_compressed+h+v    b_compressed + human genome and transcripts,
+    p_compressed+h+v    p_compressed + human genome and transcripts,
                         contaminant sequences from UniVec and EmVec,
                         and all viral genomes
 
-    b+h+v               As above, but with uncompressed bacterial genomes
+    p+h+v               As above, but with uncompressed bacterial genomes
 
 Alternatively, a IDX_NAME and one or more genomes may be specified as
 options to build a custom database.
@@ -57,10 +57,10 @@ EXTENDED OPTIONS:
 EXAMPLES:
 	# Make an index with all complete bacterial and archaeal genomes, and compress
 	# the bacterial genomes to the species level
-	make b_compressed
+	make p_compressed
 
 	# same as:
-	make COMPLETE_GENOMES=archaea COMPLETE_GENOMES_COMPRESSED=bacteria IDX_NAME=b_compressed
+	make COMPLETE_GENOMES=archaea COMPLETE_GENOMES_COMPRESSED=bacteria IDX_NAME=p_compressed
 	
 	# Make an index with just the human genome
 	make IDX_NAME=h MAMMALIAN_TAXIDS=9606
@@ -79,22 +79,22 @@ all:
 
 IDX_NAME?=$(shell basename $(shell dirname $(abspath $(lastword $(MAKEFILE_LIST)))))
 
-INDICES=b+h+v b_compressed b_compressed+h+v refseq_microbial refseq_full nt
+INDICES=p+h+v p_compressed p_compressed+h+v refseq_microbial refseq_full nt
 
-b+h+v: export COMPLETE_GENOMES:=archaea bacteria viral
-b+h+v: export MAMMALIAN_TAXIDS:=9606
-b+h+v: export INCLUDE_CONTAMINANTS:=1
-b+h+v: export IDX_NAME:=b+h+v
+p+h+v: export COMPLETE_GENOMES:=archaea bacteria viral
+p+h+v: export MAMMALIAN_TAXIDS:=9606
+p+h+v: export INCLUDE_CONTAMINANTS:=1
+p+h+v: export IDX_NAME:=p+h+v
 
-b_compressed: export COMPLETE_GENOMES:=
-b_compressed: export COMPLETE_GENOMES_COMPRESSED:=archaea bacteria
-b_compressed: export IDX_NAME:=b_compressed
+p_compressed: export COMPLETE_GENOMES:=
+p_compressed: export COMPLETE_GENOMES_COMPRESSED:=archaea bacteria
+p_compressed: export IDX_NAME:=p_compressed
 
-b_compressed+h+v: export COMPLETE_GENOMES:=viral
-b_compressed+h+v: export COMPLETE_GENOMES_COMPRESSED:=archaea bacteria
-b_compressed+h+v: export MAMMALIAN_TAXIDS:=9606
-b_compressed+h+v: export INCLUDE_CONTAMINANTS:=1
-b_compressed+h+v: export IDX_NAME:=b_compressed+h+v
+p_compressed+h+v: export COMPLETE_GENOMES:=viral
+p_compressed+h+v: export COMPLETE_GENOMES_COMPRESSED:=archaea bacteria
+p_compressed+h+v: export MAMMALIAN_TAXIDS:=9606
+p_compressed+h+v: export INCLUDE_CONTAMINANTS:=1
+p_compressed+h+v: export IDX_NAME:=p_compressed+h+v
 
 refseq_microbial: export COMPLETE_GENOMES:=archaea bacteria fungi protozoa viral
 refseq_microbial: export CHROMOSOME_LEVEL_GENOMES:=$(COMPLETE_GENOMES)
@@ -128,9 +128,12 @@ TAXID_MAPS=$(call get_ref_file_names,$(TAXID_SUFFIX))
 CF_BUILD_OPTS?= 
 
 ifeq (nt,$(IDX_NAME))
+ifeq ($(strip $(DONT_DUSTMASK)),)
+REFERENCE_SEQUENCES+=nt-dusted.fna
+else
 REFERENCE_SEQUENCES+=nt-sorted.fna
+endif
 TAXID_MAPS+=nt.map
-TAXONOMY_DOWNLOAD_OPTS+=-g
 CF_BUILD_OPTS+=--ftabchars=14
 endif
 
@@ -166,16 +169,28 @@ $(REFERENCE_SEQUENCES_DIR):
 #	$(MAKE) -f $(THIS_FILE) $(patsubst %$(TAXID_SUFFIX),%.fna, $@)
 
 nt.gz:
-	wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
+	curl -o nt.gz ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
 
 nt.fna: nt.gz
 	gunzip -c nt.gz > nt.fna
 
-nt.map:
-	wget -qO- ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz | gunzip -c | sed 's/^/gi|/' > nt.map
+accession2taxid/nucl_gb.accession2taxid.gz:
+	mkdir -p accession2taxid
+	curl ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz > accession2taxid/nucl_gb.accession2taxid.gz
+
+accession2taxid/nucl_wgs.accession2taxid.gz:
+	mkdir -p accession2taxid
+	curl ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz > accession2taxid/nucl_wgs.accession2taxid.gz
 
-nt-sorted.fna: nt.fna nt.map
-	centrifuge-sort-nt.pl nt.fna nt.map > nt-sorted.fna
+nt.map: nt-sorted.fna
+
+nt-sorted.fna: nt.fna accession2taxid/nucl_gb.accession2taxid.gz accession2taxid/nucl_wgs.accession2taxid.gz
+	centrifuge-sort-nt.pl -m nt.map -a nt-acs-wo-mapping.txt \
+		nt.fna accession2taxid/nucl_gb.accession2taxid.gz accession2taxid/nucl_wgs.accession2taxid.gz \
+		> nt-sorted.fna
+
+nt-dusted.fna: nt-sorted.fna | .dustmasker-ok
+	 dustmasker -infmt fasta -in nt-sorted.fna -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > nt-dusted.fna
 
 $(REFERENCE_SEQUENCES_DIR)/mammalian-reference-%.fna: | $(REFERENCE_SEQUENCES_DIR)
 	@[[ -d $(TMP_DIR) ]] && rm -rf $(TMP_DIR); mkdir -p $(TMP_DIR)
@@ -184,8 +199,8 @@ $(REFERENCE_SEQUENCES_DIR)/mammalian-reference-%.fna: | $(REFERENCE_SEQUENCES_DI
 	cat $(TMP_DIR)/vertebrate_mammalian/*.fna > $@.tmp && mv $@.tmp $@
 	mv $(TMP_DIR)/$(patsubst %.fna,%$(TAXID_SUFFIX),$(notdir $@)) $(patsubst %.fna,%$(TAXID_SUFFIX),$@)
 ifeq (1,$(KEEP_FILES))
-	[[ -d $(DL_DIR)/verterbrate_mammalian ]] || mkdir -p $(DL_DIR)/verterbrate_mammalian
-	mv $(TMP_DIR)/verterbrate_mammalian/* $(DL_DIR)/vertebrate_mammalian
+	[[ -d $(DL_DIR)/vertebrate_mammalian ]] || mkdir -p $(DL_DIR)/vertebrate_mammalian
+	mv $(TMP_DIR)/vertebrate_mammalian/* $(DL_DIR)/vertebrate_mammalian
 else
 	rm -rf $(TMP_DIR)
 endif
@@ -255,13 +270,12 @@ endif
 endif
 	
 
-taxonomy/names.dmp: | taxonomy
-taxonomy/nodes.dmp: | taxonomy
-
-taxonomy: | .path-ok
+taxonomy/names.dmp: taxonomy/nodes.dmp
+taxonomy/nodes.dmp: | .path-ok
 	[[ -d $(TMP_DIR) ]] && rm -rf $(TMP_DIR); mkdir -p $(TMP_DIR)
 	centrifuge-download $(TAXONOMY_DOWNLOAD_OPTS) -o $(TMP_DIR)/taxonomy taxonomy
-	mv $(TMP_DIR)/taxonomy . && rmdir $(TMP_DIR)
+	mkdir -p taxonomy
+	mv $(TMP_DIR)/taxonomy/* taxonomy && rmdir $(TMP_DIR)/taxonomy && rmdir $(TMP_DIR)
 
 $(IDX_NAME).1.cf: $(REFERENCE_SEQUENCES) $(SIZE_TABLES) $(TAXID_MAPS) taxonomy/nodes.dmp taxonomy/names.dmp | .path-ok
 	@echo Index building prerequisites: $^
diff --git a/opts.h b/opts.h
index cd00d40..8d06c87 100644
--- a/opts.h
+++ b/opts.h
@@ -170,6 +170,8 @@ enum {
     ARG_NO_TRAVERSE,             // --no-traverse
     ARG_CLASSIFICATION_RANK,
     ARG_EXCLUDE_TAXIDS,
+    ARG_OUT_FMT,
+    ARG_TAB_FMT_COLS,
 #ifdef USE_SRA
     ARG_SRA_ACC,
 #endif
diff --git a/ref_read.cpp b/ref_read.cpp
index 40dd454..81451cf 100644
--- a/ref_read.cpp
+++ b/ref_read.cpp
@@ -29,7 +29,8 @@ RefRecord fastaRefReadSize(
 	FileBuf& in,
 	const RefReadInParams& rparms,
 	bool first,
-	BitpairOutFileBuf* bpout)
+	BitpairOutFileBuf* bpout,
+	int& n_empty_ref_sequences)
 {
 	int c;
 	static int lastc = '>'; // last character seen
@@ -62,12 +63,13 @@ RefRecord fastaRefReadSize(
 		do {
 			if((c = in.getPastNewline()) == -1) {
 				// No more input
-				cerr << "Warning: Encountered empty reference sequence" << endl;
+				cerr << "Warning: Encountered empty reference sequence at end of file" << endl;
 				lastc = -1;
 				return RefRecord(0, 0, true);
 			}
 			if(c == '>') {
-				cerr << "Warning: Encountered empty reference sequence" << endl;
+				++ n_empty_ref_sequences;
+				//cerr << "Warning: Encountered empty reference sequence" << endl;
 			}
 			// continue until a non-name, non-comment line
 		} while (c == '>');
@@ -187,6 +189,7 @@ RefRecord fastaRefReadSize(
 		}
 		c = in.get();
 	}
+
 	lastc = c;
 	return RefRecord((TIndexOffU)off, (TIndexOffU)len, first);
 }
@@ -282,6 +285,8 @@ fastaRefReadSizes(
 	TIndexOffU unambigTot = 0;
 	size_t bothTot = 0;
 	assert_gt(in.size(), 0);
+	int n_empty_ref_sequences = 0;
+
 	// For each input istream
 	for(size_t i = 0; i < in.size(); i++) {
 		bool first = true;
@@ -290,7 +295,7 @@ fastaRefReadSizes(
 		while(!in[i]->eof()) {
 			RefRecord rec;
 			try {
-				rec = fastaRefReadSize(*in[i], rparms, first, bpout);
+				rec = fastaRefReadSize(*in[i], rparms, first, bpout, n_empty_ref_sequences);
 				if((unambigTot + rec.len) < unambigTot) {
 					throw RefTooLongException();
 				}
@@ -319,6 +324,10 @@ fastaRefReadSizes(
 		assert(!in[i]->eof());
 #endif
 	}
+	if (n_empty_ref_sequences > 0) {
+		cerr << "Warning: Encountered " << n_empty_ref_sequences << " empty reference sequence(s)" << endl;
+	}
+
 	assert_geq(bothTot, 0);
 	assert_geq(unambigTot, 0);
 	return make_pair(
diff --git a/taxonomy.h b/taxonomy.h
index 8fe2545..c08dc5d 100644
--- a/taxonomy.h
+++ b/taxonomy.h
@@ -42,6 +42,7 @@ enum {
     RANK_SUPER_PHYLUM,
     RANK_TRIBE,
     RANK_VARIETAS,
+    RANK_LIFE,
     RANK_MAX
 };
 
@@ -59,7 +60,7 @@ struct TaxonomyNode {
 };
 
 struct TaxonomyPathTable {
-    static const size_t nranks = 7;
+    static const size_t nranks = 10;
 
     map<uint64_t, uint32_t> tid_to_pid;  // from taxonomic ID to path ID
     ELList<uint64_t> paths;
@@ -81,6 +82,12 @@ struct TaxonomyPathTable {
                 return 5;
             case RANK_PHYLUM:
                 return 6;
+            case RANK_KINGDOM:
+                return 7;
+            case RANK_SUPER_KINGDOM:
+                return 8;
+            case RANK_DOMAIN:
+                return 9;
             default:
                 return std::numeric_limits<uint8_t>::max();
         }
@@ -90,14 +97,17 @@ struct TaxonomyPathTable {
                     const std::map<uint64_t, TaxonomyNode>& tree)
     {
         map<uint32_t, uint32_t> rank_map;
-        rank_map[RANK_STRAIN]      = 0;
-        rank_map[RANK_SUB_SPECIES] = 0;
-        rank_map[RANK_SPECIES]     = 1;
-        rank_map[RANK_GENUS]       = 2;
-        rank_map[RANK_FAMILY]      = 3;
-        rank_map[RANK_ORDER]       = 4;
-        rank_map[RANK_CLASS]       = 5;
-        rank_map[RANK_PHYLUM]      = 6;
+        rank_map[RANK_STRAIN]        = 0;
+        rank_map[RANK_SUB_SPECIES]   = 0;
+        rank_map[RANK_SPECIES]       = 1;
+        rank_map[RANK_GENUS]         = 2;
+        rank_map[RANK_FAMILY]        = 3;
+        rank_map[RANK_ORDER]         = 4;
+        rank_map[RANK_CLASS]         = 5;
+        rank_map[RANK_PHYLUM]        = 6;
+        rank_map[RANK_KINGDOM]       = 7;
+        rank_map[RANK_SUPER_KINGDOM] = 8;
+        rank_map[RANK_DOMAIN]        = 9;
 
         tid_to_pid.clear();
         paths.clear();
@@ -223,6 +233,7 @@ inline static const char* get_tax_rank_string(uint8_t rank) {
         case RANK_SUPER_PHYLUM:  return "superphylum";
         case RANK_TRIBE:         return "tribe";
         case RANK_VARIETAS:      return "varietas";
+        case RANK_LIFE:          return "life";
         default:                 return "no rank";
     };
 }
@@ -282,6 +293,8 @@ inline static uint8_t get_tax_rank_id(const char* rank) {
         return RANK_TRIBE;
     } else if(strcmp(rank, "varietas") == 0) {
         return RANK_VARIETAS;
+    } else if(strcmp(rank, "life") == 0) {
+        return RANK_LIFE;
     } else {
         return RANK_UNKNOWN;
     }
diff --git a/tinythread.cpp b/tinythread.cpp
old mode 100755
new mode 100644
diff --git a/tinythread.h b/tinythread.h
old mode 100755
new mode 100644

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/centrifuge.git



More information about the debian-med-commit mailing list