[med-svn] [Git][med-team/mash][upstream] New upstream version 2.2+dfsg
Sascha Steinbiss
gitlab at salsa.debian.org
Wed Jul 31 16:13:42 BST 2019
Sascha Steinbiss pushed to branch upstream at Debian Med / mash
Commits:
5e14db80 by Sascha Steinbiss at 2019-07-31T13:48:06Z
New upstream version 2.2+dfsg
- - - - -
13 changed files:
- README.md
- doc/sphinx/conf.py
- doc/sphinx/data.rst
- doc/sphinx/index.rst
- src/mash/CommandBounds.cpp
- src/mash/CommandDistance.cpp
- src/mash/CommandDistance.h
- src/mash/CommandScreen.cpp
- src/mash/CommandScreen.h
- src/mash/CommandTriangle.cpp
- src/mash/CommandTriangle.h
- src/mash/version.h
- test/ref/screen
Changes:
=====================================
README.md
=====================================
@@ -1,6 +1,6 @@
Mash is normally distributed as a dependency-free binary for Linux or OSX (see
https://github.com/marbl/Mash/releases). This source distribution is intended
-for other operating systems or for development. Mash requires c++11 to build,
-which is available in and GCC >= 4.8 and OSX >= 10.7.
+for other operating systems or for development. Mash requires c++14 to build,
+which is available in and GCC >= 5 and XCode >= 6.
See http://mash.readthedocs.org for more information.
=====================================
doc/sphinx/conf.py
=====================================
@@ -43,7 +43,7 @@ source_suffix = '.rst'
master_doc = 'index'
# General information about the project.
-project = u'mash'
+project = u'Mash'
copyright = u'2015, Brian Ondov, Todd Treangen, Adam Phillippy'
# The version info for the project you're documenting, acts as replacement for
=====================================
doc/sphinx/data.rst
=====================================
@@ -37,24 +37,6 @@ Figure 5:
* `fig5.html <https://obj.umiacs.umd.edu/mash/screen/fig5/fig5.html>`_: Interactive version
* `fig5.tsv <https://obj.umiacs.umd.edu/mash/screen/fig5/fig5.tsv>`_: Source data
-Public data sources
-~~~~~~~~~~~~~~~~~~~
-
-The BLAST ``nr`` database was downloaded from ``ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*``.
-
-HMP data were downloaded from ``ftp://public-ftp.ihmpdcc.org/``, reads from the ``Ilumina/`` directory
-and coding sequences from the ``HMGI/`` directory. Within these folders, sample SRS015937 resides in
-``tongue_dorsum/`` and SRS020263 in ``right_retroauricular_crease/``.
-
-SRA runs downloaded with the `SRA Toolkit <https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/>`_.
-
-RefSeq genomes downloaded from the ``genomes/refseq/`` directory of ``ftp.ncbi.nlm.nih.gov``.
-
-Public data products
-~~~~~~~~~~~~~~~~~~~~
-
-Quebec Polyomavirus is submitted to GenBank as BK010702.
-
Screen of SRA metagenomes vs. RefSeq
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -83,3 +65,24 @@ The files are tab separated, with each line beginning with a RefSeq assembly acc
GCF_000001215.4 SRR3401361 SRR3540373
GCF_000001405.36 SRR5127794 ERR1539652 SRR413753 ERR206081
GCF_000001405.38 SRR5127794 ERR1539652 ERR1711677 SRR413753 ERR206081
+
+We also provide simple scripts for searching these files: `search.tar <https://obj.umiacs.umd.edu/mash/screen/search.tar>`_
+
+Public data sources
+~~~~~~~~~~~~~~~~~~~
+
+The BLAST ``nr`` database was downloaded from ``ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*``.
+
+HMP data were downloaded from ``ftp://public-ftp.ihmpdcc.org/``, reads from the ``Ilumina/`` directory
+and coding sequences from the ``HMGI/`` directory. Within these folders, sample SRS015937 resides in
+``tongue_dorsum/`` and SRS020263 in ``right_retroauricular_crease/``.
+
+SRA runs downloaded with the `SRA Toolkit <https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/>`_.
+
+RefSeq genomes downloaded from the ``genomes/refseq/`` directory of ``ftp.ncbi.nlm.nih.gov``.
+
+Public data products
+~~~~~~~~~~~~~~~~~~~~
+
+Quebec Polyomavirus is submitted to GenBank as BK010702.
+
=====================================
doc/sphinx/index.rst
=====================================
@@ -17,6 +17,8 @@ Publication
===========
`Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x. <http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x>`_
+`Mash Screen: High-throughput sequence containment estimation for genome discovery. Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM. BioRxiv. 2019 Mar. doi: 10.1101/557314 <https://doi.org/10.1101/557314>`_
+
.. toctree::
:maxdepth: 1
=====================================
src/mash/CommandBounds.cpp
=====================================
@@ -85,7 +85,8 @@ int CommandBounds::run() const
if ( cont )
{
- m2j = exp(-k * dists[j]);
+ //m2j = exp(-k * dists[j]);
+ m2j = pow(1.0 - dists[j], k); // binomial model
}
else
{
@@ -114,7 +115,8 @@ int CommandBounds::run() const
if ( cont )
{
- j2m = -1.0 / k * log(je);
+ //j2m = -1.0 / k * log(je);
+ j2m = 1.0 - pow(je, 1. / k);
}
else
{
=====================================
src/mash/CommandDistance.cpp
=====================================
@@ -37,6 +37,7 @@ CommandDistance::CommandDistance()
//addOption("log", Option(Option::Boolean, "L", "Output", "Log scale distances and divide by k-mer size to provide a better analog to phylogenetic distance. The special case of zero shared min-hashes will result in a distance of 1.", ""));
addOption("pvalue", Option(Option::Number, "v", "Output", "Maximum p-value to report.", "1.0", 0., 1.));
addOption("distance", Option(Option::Number, "d", "Output", "Maximum distance to report.", "1.0", 0., 1.));
+ addOption("comment", Option(Option::Boolean, "C", "Output", "Show comment fields with reference/query names (denoted with ':').", "1.0", 0., 1.));
useSketchOptions();
}
@@ -51,6 +52,7 @@ int CommandDistance::run() const
int threads = options.at("threads").getArgumentAsNumber();
bool list = options.at("list").active;
bool table = options.at("table").active;
+ bool comment = options.at("comment").active;
//bool log = options.at("log").active;
double pValueMax = options.at("pvalue").getArgumentAsNumber();
double distanceMax = options.at("distance").getArgumentAsNumber();
@@ -225,13 +227,13 @@ int CommandDistance::run() const
while ( threadPool.outputAvailable() )
{
- writeOutput(threadPool.popOutputWhenAvailable(), table);
+ writeOutput(threadPool.popOutputWhenAvailable(), table, comment);
}
}
while ( threadPool.running() )
{
- writeOutput(threadPool.popOutputWhenAvailable(), table);
+ writeOutput(threadPool.popOutputWhenAvailable(), table, comment);
}
if ( warningCount > 0 && ! parameters.reads )
@@ -242,7 +244,7 @@ int CommandDistance::run() const
return 0;
}
-void CommandDistance::writeOutput(CompareOutput * output, bool table) const
+void CommandDistance::writeOutput(CompareOutput * output, bool table, bool comment) const
{
uint64_t i = output->indexQuery;
uint64_t j = output->indexRef;
@@ -267,7 +269,21 @@ void CommandDistance::writeOutput(CompareOutput * output, bool table) const
}
else if ( pair->pass )
{
- cout << output->sketchRef.getReference(j).name << '\t' << output->sketchQuery.getReference(i).name << '\t' << pair->distance << '\t' << pair->pValue << '\t' << pair->numer << '/' << pair->denom << endl;
+ cout << output->sketchRef.getReference(j).name;
+
+ if ( comment )
+ {
+ cout << ':' << output->sketchRef.getReference(j).comment;
+ }
+
+ cout << '\t' << output->sketchQuery.getReference(i).name;
+
+ if ( comment )
+ {
+ cout << ':' << output->sketchQuery.getReference(i).comment;
+ }
+
+ cout << '\t' << pair->distance << '\t' << pair->pValue << '\t' << pair->numer << '/' << pair->denom << endl;
}
j++;
=====================================
src/mash/CommandDistance.h
=====================================
@@ -85,7 +85,7 @@ public:
private:
- void writeOutput(CompareOutput * output, bool table) const;
+ void writeOutput(CompareOutput * output, bool table, bool comment) const;
};
CommandDistance::CompareOutput * compare(CommandDistance::CompareInput * input);
=====================================
src/mash/CommandScreen.cpp
=====================================
@@ -39,9 +39,9 @@ CommandScreen::CommandScreen()
: Command()
{
name = "screen";
- summary = "Determine whether query sequences are within a larger pool of sequences.";
- description = "Determine how well query sequences are contained within a pool of sequences. The queries must be formatted as a single Mash sketch file (.msh), created with the `mash sketch` command. The <pool> files can be contigs or reads, in fasta or fastq, gzipped or not, and \"-\" can be given for <pool> to read from standard input. The <pool> sequences are assumed to be nucleotides, and will be 6-frame translated if the <queries> are amino acids. The output fields are [identity, shared-hashes, median-multiplicity, p-value, query-ID, query-comment], where median-multiplicity is computed for shared hashes, based on the number of observations of those hashes within the pool.";
- argumentString = "<queries>.msh <pool> [<pool>] ...";
+ summary = "Determine whether query sequences are within a larger mixture of sequences.";
+ description = "Determine how well query sequences are contained within a mixture of sequences. The queries must be formatted as a single Mash sketch file (.msh), created with the `mash sketch` command. The <mixture> files can be contigs or reads, in fasta or fastq, gzipped or not, and \"-\" can be given for <mixture> to read from standard input. The <mixture> sequences are assumed to be nucleotides, and will be 6-frame translated if the <queries> are amino acids. The output fields are [identity, shared-hashes, median-multiplicity, p-value, query-ID, query-comment], where median-multiplicity is computed for shared hashes, based on the number of observations of those hashes within the mixture.";
+ argumentString = "<queries>.msh <mixture> [<mixture>] ...";
useOption("help");
useOption("threads");
@@ -322,7 +322,7 @@ int CommandScreen::run() const
*/
uint64_t setSize = minHashHeap.estimateSetSize();
- cerr << " Estimated distinct" << (trans ? " (translated)" : "") << " k-mers in pool: " << setSize << endl;
+ cerr << " Estimated distinct" << (trans ? " (translated)" : "") << " k-mers in mixture: " << setSize << endl;
if ( setSize == 0 )
{
@@ -606,7 +606,7 @@ double pValueWithin(uint64_t x, uint64_t setSize, double kmerSpace, uint64_t ske
return 1.;
}
- double r = 1. / (1. + kmerSpace / setSize);
+ double r = double(setSize) / kmerSpace;
#ifdef USE_BOOST
return cdf(complement(binomial(sketchSize, r), x - 1));
=====================================
src/mash/CommandScreen.h
=====================================
@@ -19,6 +19,15 @@
namespace mash {
+struct HashTableEntry
+{
+ HashTableEntry() : count(0) {}
+
+ uint32_t count;
+ std::unordered_set<uint64_t> indices;
+};
+
+//typedef std::unordered_map< uint64_t, HashTableEntry > HashTable;
typedef std::unordered_map< uint64_t, std::unordered_set<uint64_t> > HashTable;
static const std::unordered_map< std::string, char > codons =
=====================================
src/mash/CommandTriangle.cpp
=====================================
@@ -35,6 +35,9 @@ CommandTriangle::CommandTriangle()
useOption("help");
addOption("list", Option(Option::Boolean, "l", "Input", "List input. Lines in each <query> specify paths to sequence files, one per line. The reference file is not affected.", ""));
addOption("comment", Option(Option::Boolean, "C", "Output", "Use comment fields for sequence names instead of IDs.", ""));
+ addOption("edge", Option(Option::Boolean, "E", "Output", "Output edge list instead of Phylip matrix, with fields [seq1, seq2, dist, p-val, shared-hashes].", ""));
+ addOption("pvalue", Option(Option::Number, "v", "Output", "Maximum p-value to report in edge list. Implies -" + getOption("edge").identifier + ".", "1.0", 0., 1.));
+ addOption("distance", Option(Option::Number, "d", "Output", "Maximum distance to report in edge list. Implies -" + getOption("edge").identifier + ".", "1.0", 0., 1.));
//addOption("log", Option(Option::Boolean, "L", "Output", "Log scale distances and divide by k-mer size to provide a better analog to phylogenetic distance. The special case of zero shared min-hashes will result in a distance of 1.", ""));
useSketchOptions();
}
@@ -50,8 +53,16 @@ int CommandTriangle::run() const
int threads = options.at("threads").getArgumentAsNumber();
bool list = options.at("list").active;
//bool log = options.at("log").active;
- double pValueMax = 0;
bool comment = options.at("comment").active;
+ bool edge = options.at("edge").active;
+ double pValueMax = options.at("pvalue").getArgumentAsNumber();
+ double distanceMax = options.at("distance").getArgumentAsNumber();
+ double pValuePeakToSet = 0;
+
+ if ( options.at("pvalue").active || options.at("distance").active )
+ {
+ edge = true;
+ }
Sketch::Parameters parameters;
@@ -109,27 +120,33 @@ int CommandTriangle::run() const
}
}
- cout << '\t' << sketch.getReferenceCount() << endl;
- cout << (comment ? sketch.getReference(0).comment : sketch.getReference(0).name) << endl;
+ if ( !edge )
+ {
+ cout << '\t' << sketch.getReferenceCount() << endl;
+ cout << (comment ? sketch.getReference(0).comment : sketch.getReference(0).name) << endl;
+ }
ThreadPool<TriangleInput, TriangleOutput> threadPool(compare, threads);
for ( uint64_t i = 1; i < sketch.getReferenceCount(); i++ )
{
- threadPool.runWhenThreadAvailable(new TriangleInput(sketch, i, parameters));
+ threadPool.runWhenThreadAvailable(new TriangleInput(sketch, i, parameters, distanceMax, pValueMax));
while ( threadPool.outputAvailable() )
{
- writeOutput(threadPool.popOutputWhenAvailable(), comment, pValueMax);
+ writeOutput(threadPool.popOutputWhenAvailable(), comment, edge, pValuePeakToSet);
}
}
while ( threadPool.running() )
{
- writeOutput(threadPool.popOutputWhenAvailable(), comment, pValueMax);
+ writeOutput(threadPool.popOutputWhenAvailable(), comment, edge, pValuePeakToSet);
}
- cerr << "Max p-value: " << pValueMax << endl;
+ if ( !edge )
+ {
+ cerr << "Max p-value: " << pValuePeakToSet << endl;
+ }
if ( warningCount > 0 && ! parameters.reads )
{
@@ -139,23 +156,43 @@ int CommandTriangle::run() const
return 0;
}
-void CommandTriangle::writeOutput(TriangleOutput * output, bool comment, double & pValueMax) const
+void CommandTriangle::writeOutput(TriangleOutput * output, bool comment, bool edge, double & pValuePeakToSet) const
{
- const Sketch & sketch = output->sketch;
- const Sketch::Reference & ref = sketch.getReference(output->index);
-
- cout << (comment ? ref.comment : ref.name);
-
+ const Sketch & sketch = output->sketch;
+ const Sketch::Reference & ref = sketch.getReference(output->index);
+
+ if ( !edge )
+ {
+ cout << (comment ? ref.comment : ref.name);
+ }
+
for ( uint64_t i = 0; i < output->index; i++ )
{
const CommandDistance::CompareOutput::PairOutput * pair = &output->pairs[i];
- cout << '\t' << pair->distance;
- if ( pair->pValue > pValueMax )
- {
- pValueMax = pair->pValue;
- }
+
+ if ( edge )
+ {
+ if ( pair->pass )
+ {
+ const Sketch::Reference & qry = sketch.getReference(i);
+ cout << (comment ? ref.comment : ref.name) << '\t'<< (comment ? qry.comment : qry.name) << '\t' << pair->distance << '\t' << pair->pValue << '\t' << pair->numer << '/' << pair->denom << endl;
+ }
+ }
+ else
+ {
+ cout << '\t' << pair->distance;
+ }
+
+ if ( pair->pValue > pValuePeakToSet )
+ {
+ pValuePeakToSet = pair->pValue;
+ }
+ }
+
+ if ( !edge )
+ {
+ cout << endl;
}
- cout << endl;
delete output;
}
@@ -170,7 +207,7 @@ CommandTriangle::TriangleOutput * compare(CommandTriangle::TriangleInput * input
for ( uint64_t i = 0; i < input->index; i++ )
{
- compareSketches(&output->pairs[i], sketch.getReference(input->index), sketch.getReference(i), sketchSize, sketch.getKmerSize(), sketch.getKmerSpace(), -1., -1.);
+ compareSketches(&output->pairs[i], sketch.getReference(input->index), sketch.getReference(i), sketchSize, sketch.getKmerSize(), sketch.getKmerSpace(), input->maxDistance, input->maxPValue);
}
return output;
=====================================
src/mash/CommandTriangle.h
=====================================
@@ -19,16 +19,20 @@ public:
struct TriangleInput
{
- TriangleInput(const Sketch & sketchNew, uint64_t indexNew, const Sketch::Parameters & parametersNew)
+ TriangleInput(const Sketch & sketchNew, uint64_t indexNew, const Sketch::Parameters & parametersNew, double maxDistanceNew, double maxPValueNew)
:
sketch(sketchNew),
index(indexNew),
- parameters(parametersNew)
+ parameters(parametersNew),
+ maxDistance(maxDistanceNew),
+ maxPValue(maxPValueNew)
{}
const Sketch & sketch;
uint64_t index;
const Sketch::Parameters & parameters;
+ double maxDistance;
+ double maxPValue;
};
struct TriangleOutput
@@ -61,7 +65,7 @@ private:
double pValueMax;
bool comment;
- void writeOutput(TriangleOutput * output, bool comment, double & pValueMax) const;
+ void writeOutput(TriangleOutput * output, bool comment, bool edge, double & pValuePeakToSet) const;
};
CommandTriangle::TriangleOutput * compare(CommandTriangle::TriangleInput * input);
=====================================
src/mash/version.h
=====================================
@@ -4,4 +4,4 @@
//
// See the LICENSE.txt file included with this software for license information.
-static const char * version = "2.1.1";
+static const char * version = "2.2";
=====================================
test/ref/screen
=====================================
@@ -1,3 +1,3 @@
-0.861792 44/1000 1 5.00739e-229 genome1.fna gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome
-0.853596 36/1000 1 1.70479e-184 genome2.fna gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
-0.861792 44/1000 1 5.00739e-229 genome3.fna gi|682117612|gb|CP009273.1| Escherichia coli BW25113, complete genome
+0.861792 44/1000 1 5.00742e-229 genome1.fna gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome
+0.853596 36/1000 1 1.7048e-184 genome2.fna gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
+0.861792 44/1000 1 5.00742e-229 genome3.fna gi|682117612|gb|CP009273.1| Escherichia coli BW25113, complete genome
View it on GitLab: https://salsa.debian.org/med-team/mash/commit/5e14db80cf1008378399b328c70a9ff7b5c730e4
--
View it on GitLab: https://salsa.debian.org/med-team/mash/commit/5e14db80cf1008378399b328c70a9ff7b5c730e4
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20190731/69ae46eb/attachment-0001.html>
More information about the debian-med-commit
mailing list