[med-svn] [mhap] 02/04: Imported Upstream version 2.1+dfsg
Afif Elghraoui
afif at moszumanska.debian.org
Thu Jul 14 05:43:26 UTC 2016
This is an automated email from the git hooks/post-receive script.
afif pushed a commit to branch master
in repository mhap.
commit bcd23e534d91781ee6d52ac4e7a4ebefaa3394a0
Author: Afif Elghraoui <afif at ghraoui.name>
Date: Wed Jul 13 21:15:40 2016 -0700
Imported Upstream version 2.1+dfsg
---
README.md | 6 +-
docs/source/conf.py | 4 +-
docs/source/installation.rst | 22 +-
docs/source/quickstart.rst | 98 +++--
docs/source/utilities.rst | 14 +-
pom.xml | 6 +-
.../edu/umd/marbl/mhap/impl/MinHashSearch.java | 13 +-
.../edu/umd/marbl/mhap/impl/SequenceSketch.java | 6 +-
.../marbl/mhap/impl/SequenceSketchStreamer.java | 33 +-
.../java/edu/umd/marbl/mhap/main/EstimateROC.java | 98 +++--
.../edu/umd/marbl/mhap/main/GetHistogramStats.java | 2 +-
.../edu/umd/marbl/mhap/main/KmerStatSimulator.java | 6 +-
.../java/edu/umd/marbl/mhap/main/MhapMain.java | 115 ++++--
.../edu/umd/marbl/mhap/sketch/FrequencyCounts.java | 226 ++++++++--
.../edu/umd/marbl/mhap/sketch/MinHashSketch.java | 35 +-
.../umd/marbl/mhap/sketch/OrderedNGramHashes.java | 4 +-
.../marbl/mhap/sketch/OrderedNGramHashesOld.java | 460 ---------------------
.../edu/umd/marbl/mhap/utils/ParseOptions.java | 2 +-
src/main/java/edu/umd/marbl/mhap/utils/Utils.java | 59 +--
19 files changed, 512 insertions(+), 697 deletions(-)
diff --git a/README.md b/README.md
index 51b300a..c54debd 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
# MHAP
-MinHash alignment process (MHAP pronounced MAP): locality sensitive hashing to detect overlaps and utilities. This is the development branch, please use the [latest tagged](https://github.com/marbl/MHAP/releases/tag/v2.0).
+MinHash alignment process (MHAP pronounced MAP): locality sensitive hashing to detect overlaps and utilities. This is the development branch, please use the [latest tagged](https://github.com/marbl/MHAP/releases/tag/v2.1).
## Build
@@ -13,10 +13,10 @@ You must have a recent [JDK](http://www.oracle.com/technetwork/java/javase/down
For a quick user-quide, run:
cd target
- java -jar mhap-2.0.jar
+ java -jar mhap-2.1.jar
## Docs
-For the full documentation information please see http://mhap.readthedocs.org/en/
+For the full documentation information please see http://mhap.readthedocs.io/en/latest/
## Cite
- Berlin K, Koren S, Chin CS, Drake PJ, Landolin JM, Phillippy AM [Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing](http://www.nature.com/nbt/journal/v33/n6/abs/nbt.3238.html "nb"). Nature Biotechnology. (2015).
diff --git a/docs/source/conf.py b/docs/source/conf.py
index 0da72ed..ae113f8 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -54,9 +54,9 @@ copyright = u'2014, Sergey Koren and Konstantin Berlin'
# built documents.
#
# The short X.Y version.
-version = '2.0'
+version = '2.1'
# The full version, including alpha/beta/rc tags.
-release = '2.0'
+release = '2.1'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
diff --git a/docs/source/installation.rst b/docs/source/installation.rst
index 283bc04..0ff9a46 100644
--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
@@ -28,19 +28,19 @@ The pre-compiled version is recommended to users who want to run MHAP, without d
.. code-block:: bash
- $ wget https://github.com/marbl/MHAP/releases/download/v2.0/mhap-2.0.tar.gz
+ $ wget https://github.com/marbl/MHAP/releases/download/v2.1/mhap-2.1.tar.gz
And if ``wget`` not available, you can use ``curl`` instead:
.. code-block:: bash
- $ curl -L https://github.com/marbl/MHAP/releases/download/v2.0/mhap-2.0.tar.gz > mhap-2.0.tar.gz
+ $ curl -L https://github.com/marbl/MHAP/releases/download/v2.1/mhap-2.1.tar.gz > mhap-2.1.tar.gz
Then run
.. code-block:: bash
- $ tar xvzf mhap-2.0.tar.gz
+ $ tar xvzf mhap-2.1.tar.gz
Source
-----------------
@@ -49,7 +49,7 @@ To build the code from the release:
.. code-block:: bash
- $ wget https://github.com/marbl/MHAP/archive/v2.0.zip
+ $ wget https://github.com/marbl/MHAP/archive/v2.1.zip
If you see a certificate not trusted error, you can add the following option to wget:
@@ -61,27 +61,27 @@ And if ``wget`` not available, you can use ``curl`` instead:
.. code-block:: bash
- $ curl -L https://github.com/marbl/MHAP/archive/v2.0.zip > v2.0.zip
+ $ curl -L https://github.com/marbl/MHAP/archive/v2.1.zip > v2.1.zip
-You can also browse the https://github.com/marbl/MHAP/tree/v2.0
+You can also browse the https://github.com/marbl/MHAP/tree/v2.1
and click on Downloads.
Once downloaded, extract to unpack:
.. code-block:: bash
- $ unzip v2.0.zip
+ $ unzip v2.1.zip
-Change to MetAMOS directory:
+Change to MASH directory:
.. code-block:: bash
- $ cd MHAP-2.0
+ $ cd MHAP-2.1
-Once inside the MetAMOS directory, run:
+Once inside the directory, run:
.. code-block:: bash
$ maven install
-This will compile the program and create a target/mhap-2.0.jar file which you can use to run MHAP. The quick-start instructions assume you are in the target directory when running the program. You can also use the target/mhap-2.0.tar file to copy MHAP to a different system or directory. If you would like to run the `validation utilties <utilities.html>`_ you must also download and build the `SSW Library <https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library>`_. Follow the in [...]
+This will compile the program and create a target/mhap-2.1.jar file which you can use to run MHAP. The quick-start instructions assume you are in the target directory when running the program. You can also use the target/mhap-2.1.jar file to copy MHAP to a different system or directory. If you would like to run the `validation utilties <utilities.html>`_ you must also download and build the `SSW Library <https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library>`_. Follow the in [...]
diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst
index 8b95733..2492065 100644
--- a/docs/source/quickstart.rst
+++ b/docs/source/quickstart.rst
@@ -9,7 +9,7 @@ Running MHAP provides command-line documenation if you run it without parameters
.. code-block:: bash
- $ java -jar mhap-2.0.jar
+ $ java -jar mhap-2.1.jar
MHAP has two main usage modes, the main finds all overlaps between the input sequences. The second only constructs an index which can be subsequently reused.
@@ -18,25 +18,40 @@ Finding overlaps
.. code-block:: bash
- $ java -Xmx32g -server -jar mhap-2.0.jar -s<fasta/dat from/self file> [-q<fasta/dat to file or directory>] [-f<kmer filter list, must be sorted>]
+ $ java -Xmx32g -server -jar mhap-2.1.jar -s<fasta/dat from/self file> [-q<fasta/dat to file or directory>] [-f<kmer filter list, must be sorted>]
-Both the -s and -q options can accept either FastA sequences or binary dat files (generated as described below). The -q option can accept either a file or a directory, in which case all FastA/dat files in the specified directory will be used. By default, only the sequences specified by -s are indexed and the sequences in -q are streamed against the constructed index. Since MHAP is written in Java, the memory usage can be high. Generally, 32GB of RAM is sufficient to index 40K sequences. [...]
+Both the -s and -q options can accept either FastA sequences or binary dat files (generated as described below). The -q option can accept either a file or a directory, in which case all FastA/dat files in the specified directory will be used. By default, only the sequences specified by -s are indexed and the sequences in -q are streamed against the constructed index. Generally, 32GB of RAM is sufficient to index 40K sequences. If you have more sequences, you can partition your data and r [...]
-The optional -f flag provides a file of repetitive k-mers which should not be selected as min-mers. The file is a two-column tab-delimited input specifying the kmer and the fraction of total kmers the k-mer comprises. For example:
+The optional -f flag provides a file of repetitive k-mers which should be biased against selected as min-mers. The file is a two-column tab-delimited input specifying the kmer and the fraction of total kmers the k-mer comprises. For example:
.. code-block:: bash
$ head kmers.ignore
+ 464
GGGGGGGGGGGGG 0.0005
-means the k-mer GGGGGGGGGGG represents 0.05% of the k-mers in the dataset (so if there are 100,000 total k-mers, it occurs 50 times).
+means the k-mer GGGGGGGGGGG represents 0.05% of the k-mers in the dataset (so if there are 100,000 total k-mers, it occurs 50 times). The first line specifies the total number of k-mer entries in the file.
+
+It is also possible to use the k-mer list as a positive selection as was used in `Carvalho et. al. <http://biorxiv.org/content/biorxiv/early/2016/05/14/053256.full.pdf>`_. Specify the k-mer list as above and the flag:
+
+.. code-block:: bash
+
+ --supress-noise 2
+
+which will not allow any k-mer not in in the input file to be a minmer. The k-mers above --filter-threshold will be ignored as repeats.
+
+.. code-block:: bash
+
+ --supress-noise 1
+
+will downweight any k-mer not in the input file to bias against its selection as a minmer. The k-mers above --filter-threshold will be downeighted as repeats.
Constructing binary index
-----------------
.. code-block:: bash
- $ java -Xmx32g -server -jar mhap-2.0.jar -p<directory of fasta files> -q <output directory> [-f<kmer filter list, must be sorted>]
+ $ java -Xmx32g -server -jar mhap-2.1.jar -p<directory of fasta files> -q <output directory> [-f<kmer filter list, must be sorted>]
In this use case, files in the -p directory will be converted to binary sketch files in the -q directory. Subsequent runs using these files (instead of FastA files) will be faster as the sequences no longer need to be sketched, only loaded into memory.
@@ -58,22 +73,55 @@ Options
-----------------
The full list of options is available via command-line help (--help or -h). Below is a list of commonly used options.
- -h Displays the help menu.
- --version Displays the version.
- --pacbio-fast Set all the parameters for the PacBio fast setting. This is the current best guidance, and could change at any time without warning, default = false.
- --pacbio-sensitive Set all the parameters for the PacBio sensitive settings. This is the current best guidance, and could change at any time without warning, default = false.
- --nanopore Set all the parameters for the Nanopore settings. This is the current best guidance, and could change at any time without warning, default = false.
- -k [int], k-mer size used for MinHashing. The k-mer size for second stage filter is seperate, default = 16.
- --num-hashes [int], number of min-mers to be used in MinHashing, default = 512.
- --num-min-matches [int], minimum # min-mer that must be shared before computing second stage filter. Any sequences below that value are considered non-overlapping, default = 3.
- --threshold [double], the threshold cutoff for the second stage sort-merge filter. This is based on the identity score computed from the Jaccard distance of k-mers (size given by ordered-kmer-size) in the overlapping regions, default = 0.78.
- --filter-threshold [double], the cutoff at which the k-mer in the k-mer filter file is considered repetitive. This value for a specific k-mer is specified in the second column in the filter file. If no filter file is provided, this option is ignored, default = 1.0E-5.
- --weighted Perform weighted MinHashing using tf-idf scaling which biases repetitive k-mers to higher hash values. default=false.
- --max-shift [double], region size to the left and right of the estimated overlap, as derived from the median shift and sequence length, where a k-mer matches are still considered valid. Second stage filter only, default = 0.2.
- --min-store-length [int], The minimum length of the read that is stored in the box. Used to filter out short reads from FASTA file, default = 0.
- --no-self Do not compute the overlaps between sequences inside a box. Should be used when the to and from sequences are coming from different files, default = false.
- --num-threads [int], number of threads to use for computation. Typically set to #cores, , default = 8.
- --ordered-kmer-size [int], The size of k-mers used in the ordered second stage filter, , default = 12.
- --ordered-sketch-size [int], The sketch size for second stage filter, default = 1536.
- --store-full-id Store full IDs as seen in FASTA file, rather than storing just the sequence position in the file. Some FASTA files have long IDS, slowing output of results. This options is ignored when using compressed file format, default = false.
- -f [string], k-mer filter file used for filtering out highly repetative k-mers. Must be sorted in descending order of frequency (second column), default = "".
+ Usage 1 (direct execution): java -server -Xmx<memory> -jar <MHAP jar> -s<fasta/dat from/self file> [-q<fasta/dat to file>] [-f<kmer filter list, must be sorted>]
+
+ Usage 2 (generate precomputed binaries): java -server -Xmx<memory> -jar <MHAP jar> -p<directory of fasta files> -q <output directory> [-f<kmer filter list, must be sorted>]
+
+ --filter-threshold, default = 1.0E-5
+ [double], the cutoff at which the k-mer in the k-mer filter file is considered repetitive. This value for a specific k-mer is specified in the second column in the filter file. If no filter file is provided, this option is ignored.
+ --help, default = false
+ Displays the help menu.
+ --max-shift, default = 0.2
+ [double], region size to the left and right of the estimated overlap, as derived from the median shift and sequence length, where a k-mer matches are still considered valid. Second stage filter only.
+ --min-olap-length, default = 116
+ [int], The minimum length of the read that used for overlapping. Used to filter out short reads from FASTA file.
+ --min-store-length, default = 0
+ [int], The minimum length of the read that is stored in the box. Used to filter out short reads from FASTA file.
+ --no-self, default = false
+ Do not compute the overlaps between sequences inside a box. Should be used when the to and from sequences are coming from different files.
+ --no-tf, default = false
+ Do not perform the tf weighing, in the tf-idf weighing.
+ --num-hashes, default = 512
+ [int], number of min-mers to be used in MinHashing.
+ --num-min-matches, default = 3
+ [int], minimum # min-mer that must be shared before computing second stage filter. Any sequences below that value are considered non-overlapping.
+ --num-threads, default = 8
+ [int], number of threads to use for computation. Typically set to #cores.
+ --ordered-kmer-size, default = 12
+ [int] The size of k-mers used in the ordered second stage filter.
+ --ordered-sketch-size, default = 1536
+ [int] The sketch size for second stage filter.
+ --repeat-weight, default = 0.9
+ [double] Repeat suppression strength for tf-idf weighing. <0.0 do unweighted MinHash (version 1.0), >=1.0 do only the tf weighing. To perform no idf weighting, do no supply -f option.
+ --settings, default = 0
+ Set all unset parameters for the default settings. Same defaults are applied to Nanopore and Pacbio reads. 0) None, 1) Default, 2) Fast, 3) Sensitive.
+ --store-full-id, default = false
+ Store full IDs as seen in FASTA file, rather than storing just the sequence position in the file. Some FASTA files have long IDS, slowing output of results. This options is ignored when using compressed file format.
+ --supress-noise, default = 0
+ [int] 0) Does nothing, 1) completely removes any k-mers not specified in the filter file, 2) supresses k-mers not specified in the filter file, similar to repeats.
+ --threshold, default = 0.78
+ [double], the threshold cutoff for the second stage sort-merge filter. This is based on the identity score computed from the Jaccard distance of k-mers (size given by ordered-kmer-size) in the overlapping regions.
+ --version, default = false
+ Displays the version and build time.
+ -f, default = ""
+ k-mer filter file used for filtering out highly repetative k-mers. Must be sorted in descending order of frequency (second column).
+ -h, default = false
+ Displays the help menu.
+ -k, default = 16
+ [int], k-mer size used for MinHashing. The k-mer size for second stage filter is seperate, and cannot be modified.
+ -p, default = ""
+ Usage 2 only. The directory containing FASTA files that should be converted to binary format for storage.
+ -q, default = ""
+ Usage 1: The FASTA file of reads, or a directory of files, that will be compared to the set of reads in the box (see -s). Usage 2: The output directory for the binary formatted dat files.
+ -s, default = ""
+ Usage 1 only. The FASTA or binary dat file (see Usage 2) of reads that will be stored in a box, and that all subsequent reads will be compared to.
diff --git a/docs/source/utilities.rst b/docs/source/utilities.rst
index 59b9470..f041afb 100644
--- a/docs/source/utilities.rst
+++ b/docs/source/utilities.rst
@@ -14,11 +14,11 @@ Assuming you have a mapping of sequences to a truth (such as a reference genome)
.. code-block:: bash
- $ java -cp mhap-2.0.jar edu.umd.marbl.mhap.main.EstimateROC <reference mapping M4> <overlaps M4/MHAP> <fasta of sequences> [minimum overlap length to evaluate] [number of random trials] [use dynamic programming] [verbose]
+ $ java -cp mhap-2.1.jar edu.umd.marbl.mhap.main.EstimateROC <reference mapping M4> <overlaps M4/MHAP> <fasta of sequences> [minimum overlap length to evaluate] [number of random trials] [use dynamic programming] [verbose] [minimum identity of overlap] [maximum different between expected overlap and reported] [load all overlaps]
The default minimum overlap length is 2000 and default number of trials is 10000. This will estimate sensitivity/specificity to within 1%. It can be increased at the expense of runtime. Specifying 0 will examine all possible N^2 overlap pairs.
-If the dynamic programming is turned on (by typing true for the parameter), overlaps not present in the reference mapping will be confirmed if a Smith-Watermann alignment can identify the overlap specified. This step requires the `SSW Library <https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library>`_ to be separately compiled and installed:
+The dynamic programming flag (true/false) will check overlaps not present in the reference mapping by running a Smith-Watermann alignment to identify the overlap specified. This step requires the `SSW Library <https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library>`_ to be separately compiled and installed:
.. code-block:: bash
@@ -28,7 +28,11 @@ If the dynamic programming is turned on (by typing true for the parameter), over
$ cd /full/path/to/mhap/target/lib
$ ln -s /full/path/to/Complete-Striped-Smith-Waterman-Library-master/src/libsswjni.so
-Now, you can run the EstimateROC command above.
+The verbose flag (true/false) enables logging to report true overlaps missing from the result and false-positives where no alignments could be found matching the required thresholds.
+
+The minimum identity of the overlap (0.7 by default) is the lower bound for the sensitivity of an overlapper to evaluate. It is used to select matches to the reference that could be found by the overlapper. It is also used to threshold the minimum identity found by the Smith-Waterman alignment above.
+
+The load all overlaps flag (true/false) will evaluate the specificity and PPV on all overlaps reported by the overlapper if enabled, not only those for good reads (where both reads were mapped to the reference in the truth set).
Simulating Data
-----------------
@@ -37,12 +41,12 @@ MHAP includes a tool to simulate sequencing data with random error as well as es
.. code-block:: bash
- $ java -cp mhap-2.0.jar edu.umd.marbl.mhap.main.KmerStatSimulator <# sequences> <sequence length (bp)> <insertion error rate> <deletion error rate> <substitution error rate> [reference genome]
+ $ java -cp mhap-2.1.jar edu.umd.marbl.mhap.main.KmerStatSimulator <# sequences> <sequence length (bp)> <insertion error rate> <deletion error rate> <substitution error rate> [reference genome]
The error rates must be between 0 and 1 and are additive. Specifying 10% insertion, 2% deletion, and 1% substitution will result in sequences with a 13% error rate. If no reference sequence is given, completely random sequences are generated and errors added. Otherwise, random sequences are drawn from the reference and errors added. Errors are added randomly with no bias.
.. code-block:: bash
- $ java -cp mhap-2.0.jar edu.umd.marbl.mhap.main.KmerStatSimulator <# trials> <kmer size> <sequence length> <overlap length> <insertion error rate> <deletion error rate> <substitution error rate> [one-sided error] [reference genome] [kmer filter]
+ $ java -cp mhap-2.1.jar edu.umd.marbl.mhap.main.KmerStatSimulator <# trials> <kmer size> <sequence length> <overlap length> <insertion error rate> <deletion error rate> <substitution error rate> [one-sided error] [reference genome] [kmer filter]
This usage will output a distribution of Jaccard similarity between a pair of overlapping sequences with the specified error rate (when using the specified k-mer size) and two random sequences of the same length. If no reference sequence is given, completely random sequences are generated and errors added, otherwise sequences are drawn from the reference. When one-sided error is specified (by typing true for the parameter), only one of the two sequences will have error simulated, matchin [...]
diff --git a/pom.xml b/pom.xml
index 363549c..008edef 100644
--- a/pom.xml
+++ b/pom.xml
@@ -3,7 +3,7 @@
<modelVersion>4.0.0</modelVersion>
<groupId>mhap</groupId>
<artifactId>mhap</artifactId>
- <version>2.0</version>
+ <version>2.1</version>
<name>MinHash Alignment Process</name>
<build>
<resources>
@@ -109,12 +109,12 @@
<dependency>
<groupId>it.unimi.dsi</groupId>
<artifactId>fastutil</artifactId>
- <version>7.0.9</version>
+ <version>7.0.12</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-compress</artifactId>
- <version>1.10</version>
+ <version>1.11</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
diff --git a/src/main/java/edu/umd/marbl/mhap/impl/MinHashSearch.java b/src/main/java/edu/umd/marbl/mhap/impl/MinHashSearch.java
index 1ea3f2b..5929561 100644
--- a/src/main/java/edu/umd/marbl/mhap/impl/MinHashSearch.java
+++ b/src/main/java/edu/umd/marbl/mhap/impl/MinHashSearch.java
@@ -61,7 +61,7 @@ public final class MinHashSearch extends AbstractMatchSearch
private final Map<SequenceId, SequenceSketch> sequenceVectorsHash;
public MinHashSearch(SequenceSketchStreamer data, int numHashes, int numMinMatches, int numThreads,
- boolean storeResults, int minStoreLength, double maxShift, double acceptScore, double alignmentOffset, double alignmentScore) throws IOException
+ boolean storeResults, int minStoreLength, double maxShift, double acceptScore) throws IOException
{
super(numThreads, storeResults);
@@ -161,12 +161,13 @@ public final class MinHashSearch extends AbstractMatchSearch
//estimate size
long numLookups = this.getNumberSequencesSearched();
long numProcessed = this.numberElementsProcessed.get();
- int mapSize = Math.max(256, (int)(4.0*(double)numLookups/(double)numProcessed));
+ int mapSize = Math.max(256, (int)(4.0*(double)numProcessed/(double)numLookups));
Map<SequenceId, HitCounter> bestSequenceHit = new Object2ObjectOpenHashMap<>(mapSize);
int[] minHashes = minHash.getMinHashArray();
int hashIndex = 0;
+ long additionalProcessed = 0L;
for (Map<Integer,ArrayList<SequenceId>> currHash : this.hashes)
{
ArrayList<SequenceId> currentHashMatchList = currHash.get(minHashes[hashIndex]);
@@ -174,8 +175,7 @@ public final class MinHashSearch extends AbstractMatchSearch
// if some matches exist add them
if (currentHashMatchList != null)
{
- this.numberElementsProcessed.getAndAdd(currentHashMatchList.size());
-
+ additionalProcessed += currentHashMatchList.size();
for (SequenceId sequenceId : currentHashMatchList)
{
bestSequenceHit.compute(sequenceId, (k,v)-> (v==null) ? new HitCounter(1) : v.addHit());
@@ -188,8 +188,9 @@ public final class MinHashSearch extends AbstractMatchSearch
//record the search time
long minHashEndTime = System.nanoTime();
this.minhashSearchTime.getAndAdd(minHashEndTime - startTime);
-
- //record number of hash matches processed
+
+ //record the procssed statistic
+ this.numberElementsProcessed.getAndAdd(additionalProcessed);
this.numberSequencesHit.getAndAdd(bestSequenceHit.size());
// compute the proper counts for all sets and remove below threshold
diff --git a/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketch.java b/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketch.java
index 9c9dfcc..d304cdf 100644
--- a/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketch.java
+++ b/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketch.java
@@ -57,7 +57,7 @@ public final class SequenceSketch implements Serializable
public final static int SUBSEQUENCE_SIZE = 50;
public final static int BIT_KMER_SIZE = 7;
- public static SequenceSketch fromByteStream(DataInputStream input, int offset, boolean useAlignment) throws IOException
+ public static SequenceSketch fromByteStream(DataInputStream input, int offset) throws IOException
{
try
{
@@ -100,11 +100,11 @@ public final class SequenceSketch implements Serializable
this.orderedHashes = orderedHashes;
}
- public SequenceSketch(Sequence seq, int kmerSize, int numHashes, int orderedKmerSize, int orderedSketchSize, FrequencyCounts kmerFilter, boolean weighted, boolean useAlignment)
+ public SequenceSketch(Sequence seq, int kmerSize, int numHashes, int orderedKmerSize, int orderedSketchSize, FrequencyCounts kmerFilter, double repeatWeight)
{
this.sequenceLength = seq.length();
this.id = seq.getId();
- this.mainHashes = new MinHashSketch(seq.getSquenceString(), kmerSize, numHashes, kmerFilter, weighted);
+ this.mainHashes = new MinHashSketch(seq.getSquenceString(), kmerSize, numHashes, kmerFilter, repeatWeight);
this.orderedHashes = new OrderedNGramHashes(seq.getSquenceString(), orderedKmerSize, orderedSketchSize);
}
diff --git a/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketchStreamer.java b/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketchStreamer.java
index 5b1f603..61fa704 100644
--- a/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketchStreamer.java
+++ b/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketchStreamer.java
@@ -60,8 +60,8 @@ public class SequenceSketchStreamer
private final AtomicLong numberProcessed;
private final int numHashes;
private final int offset;
- private final boolean weighted;
- private final boolean useAlignment;
+ private final double repeatWeight;
+ private final int minOlapLength;
private final int orderedKmerSize;
private final int orderedSketchSize;
@@ -69,14 +69,15 @@ public class SequenceSketchStreamer
private final boolean readingFasta;
private final ConcurrentLinkedQueue<SequenceSketch> sequenceHashList;
- public SequenceSketchStreamer(String file, int offset, boolean useAlignment) throws FileNotFoundException
+ public SequenceSketchStreamer(String file, int minOlapLength, int offset) throws FileNotFoundException
{
this.fastaData = null;
this.readingFasta = false;
this.sequenceHashList = new ConcurrentLinkedQueue<SequenceSketch>();
this.numberProcessed = new AtomicLong();
this.kmerFilter = null;
- this.weighted = true;
+ this.repeatWeight = 0;
+ this.minOlapLength = minOlapLength;
this.kmerSize = 0;
this.numHashes = 0;
@@ -84,19 +85,19 @@ public class SequenceSketchStreamer
this.orderedSketchSize = 0;
this.readClosed = false;
this.offset = offset;
- this.useAlignment = useAlignment;
this.buffInput = new DataInputStream(new BufferedInputStream(new FileInputStream(file), Utils.BUFFER_BYTE_SIZE));
}
- public SequenceSketchStreamer(String file, int kmerSize, int numHashes, int orderedKmerSize, int orderedSketchSize,
- FrequencyCounts kmerFilter, boolean weighted, int offset, boolean useAlignment) throws IOException
+ public SequenceSketchStreamer(String file, int minOlapLength, int kmerSize, int numHashes, int orderedKmerSize, int orderedSketchSize,
+ FrequencyCounts kmerFilter, double repeatWeight, int offset) throws IOException
{
this.fastaData = new FastaData(file, offset);
this.readingFasta = true;
this.sequenceHashList = new ConcurrentLinkedQueue<SequenceSketch>();
this.numberProcessed = new AtomicLong();
- this.weighted = weighted;
+ this.repeatWeight = repeatWeight;
+ this.minOlapLength = minOlapLength;
this.kmerFilter = kmerFilter;
this.kmerSize = kmerSize;
@@ -106,7 +107,6 @@ public class SequenceSketchStreamer
this.buffInput = null;
this.readClosed = false;
this.offset = offset;
- this.useAlignment = useAlignment;
}
public SequenceSketch dequeue(boolean fwdOnly, ReadBuffer buf) throws IOException
@@ -121,7 +121,12 @@ public class SequenceSketchStreamer
SequenceSketch seqHashes;
if (this.readingFasta)
{
- Sequence seq = this.fastaData.dequeue();
+ Sequence seq;
+ do
+ {
+ seq = this.fastaData.dequeue();
+ }
+ while (seq!=null && seq.length()<this.minOlapLength);
// compute the hashes
seqHashes = null;
@@ -130,6 +135,7 @@ public class SequenceSketchStreamer
if (seqHashes == null)
return false;
+
processAddition(seqHashes);
this.sequenceHashList.add(seqHashes);
@@ -148,13 +154,12 @@ public class SequenceSketchStreamer
{
// read the binary file
seqHashes = readFromBinary(buf, fwdOnly);
- while (seqHashes != null && fwdOnly && !seqHashes.getSequenceId().isForward())
+ while (seqHashes != null && fwdOnly && !seqHashes.getSequenceId().isForward() && seqHashes.getSequenceLength()<this.minOlapLength)
{
seqHashes = readFromBinary(buf, fwdOnly);
}
// do nothing and return
- // record
if (seqHashes == null)
return false;
@@ -228,7 +233,7 @@ public class SequenceSketchStreamer
public SequenceSketch getSketch(Sequence seq)
{
// compute the hashes
- return new SequenceSketch(seq, this.kmerSize, this.numHashes, this.orderedKmerSize, this.orderedSketchSize, this.kmerFilter, this.weighted, this.useAlignment);
+ return new SequenceSketch(seq, this.kmerSize, this.numHashes, this.orderedKmerSize, this.orderedSketchSize, this.kmerFilter, this.repeatWeight);
}
public int getNumberProcessed()
@@ -285,7 +290,7 @@ public class SequenceSketchStreamer
// get as byte array stream
SequenceSketch seqHashes = SequenceSketch.fromByteStream(new DataInputStream(
- new ByteArrayInputStream(byteArray)), this.offset, this.useAlignment);
+ new ByteArrayInputStream(byteArray)), this.offset);
return seqHashes;
}
diff --git a/src/main/java/edu/umd/marbl/mhap/main/EstimateROC.java b/src/main/java/edu/umd/marbl/mhap/main/EstimateROC.java
index a55be7d..9ad26b7 100755
--- a/src/main/java/edu/umd/marbl/mhap/main/EstimateROC.java
+++ b/src/main/java/edu/umd/marbl/mhap/main/EstimateROC.java
@@ -44,6 +44,8 @@ import java.util.Random;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.atomic.AtomicInteger;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
import java.util.stream.Stream;
import edu.umd.marbl.mhap.impl.FastaData;
@@ -65,10 +67,12 @@ public class EstimateROC {
private static double MIN_IDENTITY = 0.70;
private static final double REF_IDENTITY_ADJUSTMENT = 0.1;
private static double MIN_REF_IDENTITY = MIN_IDENTITY + REF_IDENTITY_ADJUSTMENT;
+ private static double MIN_ALIGNMENT_IDENTITY = MIN_IDENTITY - REF_IDENTITY_ADJUSTMENT;
private static double MIN_OVERLAP_DIFFERENCE = 0.30;
private static final int DEFAULT_NUM_TRIALS = 10000;
private static final int DEFAULT_MIN_OVL = 2000;
private static final boolean DEFAULT_DO_DP = false;
+ private static boolean LOAD_ALL = false;
private static boolean DEBUG = false;
private static class Pair {
@@ -187,10 +191,14 @@ public class EstimateROC {
if (args.length > 7) {
MIN_IDENTITY = Double.parseDouble(args[7]);
MIN_REF_IDENTITY = MIN_IDENTITY + REF_IDENTITY_ADJUSTMENT;
+ MIN_ALIGNMENT_IDENTITY = MIN_IDENTITY - REF_IDENTITY_ADJUSTMENT/2;
}
if (args.length > 8) {
MIN_OVERLAP_DIFFERENCE = Double.parseDouble(args[8]);
}
+ if (args.length > 9) {
+ LOAD_ALL = Boolean.parseBoolean(args[9]);
+ }
System.err.println("Running, reference: " + args[0] + " matches: " + args[1]);
System.err.println("Number trials: " + (g.numTrials == 0 ? "all" : g.numTrials));
@@ -198,6 +206,7 @@ public class EstimateROC {
System.err.println("Minimum acceptable %" + MIN_IDENTITY);
System.err.println("Minimum acceptable shift " + MIN_OVERLAP_DIFFERENCE);
System.err.println("Minimum overlap to ref %" + MIN_REF_IDENTITY);
+ System.err.println("Minimum acceptable overlap for dp check %" + MIN_ALIGNMENT_IDENTITY);
// load and cluster reference
System.err.print("Loading reference...");
@@ -293,8 +302,8 @@ public class EstimateROC {
// now initialize matrix
for (int i = 0; i < 128; i++) {
for (int j = 0; j < 128; j++) {
- if (i == j) MATCH_MATRIX[i][j] = 1;
- else MATCH_MATRIX[i][j] = -1;
+ if (i == j) MATCH_MATRIX[i][j] = 2;
+ else MATCH_MATRIX[i][j] = -2;
}
}
} catch (Exception e) {
@@ -338,6 +347,8 @@ public class EstimateROC {
private HashSet<String> getSequenceMatches(String id, int min) {
String chr = this.seqToChr.get(id);
Pair p1 = this.seqToPosition.get(id);
+ if (chr == null || p1 == null) { return null; }
+
List<Integer> intersect = this.clusters.get(chr).get(p1.first,
p1.second);
HashSet<String> result = new HashSet<String>();
@@ -492,7 +503,7 @@ public class EstimateROC {
if (id.equalsIgnoreCase(id2)) {
continue;
}
- if (this.seqToChr.get(id) == null || this.seqToChr.get(id2) == null) {
+ if ((EstimateROC.LOAD_ALL != true) && (this.seqToChr.get(id) == null || this.seqToChr.get(id2) == null)) {
continue;
}
String ovlName = getOvlName(id, id2);
@@ -651,7 +662,7 @@ public class EstimateROC {
}
}
- private static double getScore(jaligner.Alignment alignment, int ovlLen) {
+ private static double getScore(jaligner.Alignment alignment, String qry, String ref) {
char[] sequence1 = alignment.getSequence1();
char[] sequence2 = alignment.getSequence2();
int length = Math.max(sequence1.length, sequence2.length);
@@ -675,25 +686,61 @@ public class EstimateROC {
matches++;
}
}
- return (matches / (double)ovlLen);
+ return (matches / (double)length);
}
- private static double getScore(ssw.Alignment alignment, int ovlLen) {
- // the result is a cigar string of the format 3M1I9M1I9M1D10M1I13M1I9M1D1M2D45M1D6M1D5
- int matches = 0;
- for (int i = 0; i < alignment.cigar.length(); i++) {
- if (alignment.cigar.charAt(i) == 'M') {
- // get the match length
- int s = i-1;
- while (s >= 0 && Character.isDigit(alignment.cigar.charAt(s))) {
- s--;
- }
- s++;
- matches += Integer.parseInt(alignment.cigar.substring(s, i));
- }
- }
- return matches/ (double) ovlLen;
+ private static double getScore(ssw.Alignment alignment, String qry, String ref) {
+ // the result is a cigar string of the format
+ // 3M1I9M1I9M1D10M1I13M1I9M1D1M2D45M1D6M1D50
+ Pattern cigarPattern = Pattern.compile("[\\d]+[a-zA-Z|=]");
+ Matcher matcher = cigarPattern.matcher(alignment.cigar);
+ int errors = 0;
+ int len = 0;
+ int qryPos = alignment.read_begin1;
+ int refPos = alignment.ref_begin1;
+ while (matcher.find()) {
+ String cVal = matcher.group();
+ int cLen = Integer.parseInt(cVal.substring(0, cVal.length() - 1));
+ char cLetter = cVal.toUpperCase().charAt(cVal.length() - 1);
+ switch (cLetter) {
+ case 'H':
+ break;
+ case 'S':
+ case '=':
+ // ignore, not an error
+ len+=cLen;
+ break;
+ case 'M':
+ for (int i = 0; i < cLen; i++) {
+ if (ref.charAt(refPos) != qry.charAt(qryPos)) {
+ errors++;
+ } else {
+ // do nothing
+ }
+ refPos++;
+ qryPos++;
+ }
+ len+=cLen;
+ break;
+ case 'I':
+ errors += cLen;
+ qryPos += cLen;
+ len+=cLen;
+ break;
+ case 'D':
+ errors += cLen;
+ refPos += cLen;
+ len+=cLen;
+ break;
+ default:
+ System.err.println("Error, unknown base " + cLetter);
+ System.exit(1);
+ break;
+ }
+ }
+
+ return 1 - (errors / (double) len);
}
private boolean computeDP(String id, String id2) {
@@ -702,7 +749,7 @@ public class EstimateROC {
}
Overlap ovl = this.ovlInfo.get(getOvlName(id, id2));
- System.err.println("Aligning sequence " + ovl.id1 + " to " + ovl.id2 + " " + ovl.bfirst + " to " + ovl.bsecond + " and " + ovl.isFwd + " and " + ovl.afirst + " " + ovl.asecond);
+ if (DEBUG) System.err.println("Aligning sequence " + ovl.id1 + " to " + ovl.id2 + " " + ovl.bfirst + " to " + ovl.bsecond + " and " + ovl.isFwd + " and " + ovl.afirst + " " + ovl.asecond);
String s1 = this.dataSeq[getSequenceId(ovl.id1)].getSquenceString().substring(ovl.afirst, ovl.asecond);
String s2 = null;
@@ -730,8 +777,8 @@ public class EstimateROC {
} catch (MatrixLoaderException e) {
return false;
}
- score = getScore(alignment, ovlLen);
length = alignment.getLength();
+ score = getScore(alignment, s1, s2);
if (DEBUG) {
System.err.println(alignment.getSummary());
System.err.println("My score: " + score);
@@ -740,15 +787,16 @@ public class EstimateROC {
}
else {
ssw.Alignment alignment = Aligner.align(s1.getBytes(), s2.getBytes(), MATCH_MATRIX, 2, 1, true);
- score = getScore(alignment, ovlLen);
length = Math.max(alignment.read_end1-alignment.read_begin1, alignment.ref_end1 - alignment.ref_begin1);
+ score = getScore(alignment, s1, s2);
if (DEBUG) {
System.err.println(alignment.toString());
+ System.err.println(alignment.read_end1 + " " + alignment.read_begin1 + " " + alignment.ref_end1 + " " + alignment.ref_begin1 + " " + length);
System.err.println("My score: " + score);
}
}
- return (score > MIN_IDENTITY && length > this.minOvlLen);
+ return (score > MIN_ALIGNMENT_IDENTITY && length > this.minOvlLen && 1-((float)length/ovlLen) < MIN_OVERLAP_DIFFERENCE);
}
private void estimateSensitivity() {
@@ -817,7 +865,7 @@ public class EstimateROC {
String id2 = ovl[1];
HashSet<String> matches = getSequenceMatches(id, 0);
- if (matches.contains(id2)) {
+ if (matches != null && matches.contains(id2)) {
numTP.getAndIncrement();
} else {
if (computeDP(id, id2)) {
diff --git a/src/main/java/edu/umd/marbl/mhap/main/GetHistogramStats.java b/src/main/java/edu/umd/marbl/mhap/main/GetHistogramStats.java
index a6224e8..f296263 100755
--- a/src/main/java/edu/umd/marbl/mhap/main/GetHistogramStats.java
+++ b/src/main/java/edu/umd/marbl/mhap/main/GetHistogramStats.java
@@ -44,7 +44,7 @@ public class GetHistogramStats {
public GetHistogramStats(String fileName, double p) {
try {
- BufferedReader bf = Utils.getFile(fileName, "hist");
+ BufferedReader bf = Utils.getFile(fileName, null);
String line = null;
while ((line = bf.readLine()) != null) {
diff --git a/src/main/java/edu/umd/marbl/mhap/main/KmerStatSimulator.java b/src/main/java/edu/umd/marbl/mhap/main/KmerStatSimulator.java
index d873465..6d11188 100644
--- a/src/main/java/edu/umd/marbl/mhap/main/KmerStatSimulator.java
+++ b/src/main/java/edu/umd/marbl/mhap/main/KmerStatSimulator.java
@@ -139,7 +139,7 @@ public class KmerStatSimulator {
}
private void loadSkipMers(String file) throws Exception {
- BufferedReader bf = Utils.getFile(file, "repeats");
+ BufferedReader bf = Utils.getFile(file, null);
String line = null;
while ((line = bf.readLine()) != null) {
@@ -193,8 +193,8 @@ public class KmerStatSimulator {
}
public double compareMinHash2(String first, String second) {
- MinHashSketch h1 = new MinHashSketch(first, this.kmer, 1256, null, true);
- MinHashSketch h2 = new MinHashSketch(second, this.kmer, 1256, null, true);
+ MinHashSketch h1 = new MinHashSketch(first, this.kmer, 1256, null, 1.0);
+ MinHashSketch h2 = new MinHashSketch(second, this.kmer, 1256, null, 1.0);
return h1.jaccard(h2);
}
diff --git a/src/main/java/edu/umd/marbl/mhap/main/MhapMain.java b/src/main/java/edu/umd/marbl/mhap/main/MhapMain.java
index 1d5b470..43e2d08 100644
--- a/src/main/java/edu/umd/marbl/mhap/main/MhapMain.java
+++ b/src/main/java/edu/umd/marbl/mhap/main/MhapMain.java
@@ -29,6 +29,7 @@
*/
package edu.umd.marbl.mhap.main;
+import java.io.BufferedReader;
import java.io.File;
import java.io.FilenameFilter;
import java.io.IOException;
@@ -46,13 +47,12 @@ import edu.umd.marbl.mhap.utils.Utils;
public final class MhapMain
{
private final double acceptScore;
- private final double alignmentOffset;
- private final double alignmentScore;
private final String inFile;
private final FrequencyCounts kmerFilter;
private final int kmerSize;
private final double maxShift;
private final int minStoreLength;
+ private final int minOlapLength;
private final boolean noSelf;
private final int numHashes;
private final int numMinMatches;
@@ -61,15 +61,12 @@ public final class MhapMain
private final int orderedSketchSize;
private final String processFile;
private final String toFile;
- private final boolean useAlignment;
- private final boolean weighted;
+ private final double repeatWeight;
//private static final double DEFAULT_OVERLAP_ACCEPT_SCORE = 0.024;
private static final double DEFAULT_OVERLAP_ACCEPT_SCORE = 0.78;
- private static final double DEFAULT_BIT_ALIGNMENT_MISMATCH_PANELTY = -0.530;
-
- private static final double DEFAULT_BIT_ALIGNMENT_SCORE = 1.0e-6;
+ private static final double DEFAULT_REPEAT_WEIGHT= 0.0;
private static final double DEFAULT_FILTER_CUTOFF = 1.0e-5;
@@ -78,6 +75,8 @@ public final class MhapMain
private static final double DEFAULT_MAX_SHIFT_PERCENT = 0.2;
private static final int DEFAULT_MIN_STORE_LENGTH = 0;
+
+ private static final int DEFAULT_MIN_OVL_LENGTH = DEFAULT_KMER_SIZE+100;
private static final int DEFAULT_NUM_MIN_MATCHES = 3;
@@ -105,36 +104,33 @@ public final class MhapMain
options.addOption("-f", "k-mer filter file used for filtering out highly repetative k-mers. Must be sorted in descending order of frequency (second column).", "");
options.addOption("-k", "[int], k-mer size used for MinHashing. The k-mer size for second stage filter is seperate, and cannot be modified.", DEFAULT_KMER_SIZE);
options.addOption("--num-hashes", "[int], number of min-mers to be used in MinHashing.", DEFAULT_NUM_WORDS);
- //options.addOption("--threshold", "[double], the threshold cutoff for the second stage sort-merge filter. This is based on the Jaccard distance of k-mers (size given by ordered-kmer-size) in the overlapping regions.", DEFAULT_OVERLAP_ACCEPT_SCORE);
options.addOption("--threshold", "[double], the threshold cutoff for the second stage sort-merge filter. This is based on the identity score computed from the Jaccard distance of k-mers (size given by ordered-kmer-size) in the overlapping regions.", DEFAULT_OVERLAP_ACCEPT_SCORE);
options.addOption("--filter-threshold", "[double], the cutoff at which the k-mer in the k-mer filter file is considered repetitive. This value for a specific k-mer is specified in the second column in the filter file. If no filter file is provided, this option is ignored.", DEFAULT_FILTER_CUTOFF);
options.addOption("--max-shift", "[double], region size to the left and right of the estimated overlap, as derived from the median shift and sequence length, where a k-mer matches are still considered valid. Second stage filter only.", DEFAULT_MAX_SHIFT_PERCENT);
options.addOption("--num-min-matches", "[int], minimum # min-mer that must be shared before computing second stage filter. Any sequences below that value are considered non-overlapping.", DEFAULT_NUM_MIN_MATCHES);
options.addOption("--num-threads", "[int], number of threads to use for computation. Typically set to #cores.", DEFAULT_NUM_THREADS);
- options.addOption("--weighted", "Perform weighted MinHashing.", false);
- options.addOption("--alignment", "Experimental option.", false);
- options.addOption("--alignment-offset", "The offset to account for the variance in the alignment match score. Experimental option.", DEFAULT_BIT_ALIGNMENT_MISMATCH_PANELTY);
- options.addOption("--alignment-score", "The cutoff score for alignment matches. Experimental option.", DEFAULT_BIT_ALIGNMENT_SCORE);
- options.addOption("--ordered-kmer-size", "The size of k-mers used in the ordered second stage filter.", DEFAULT_ORDERED_KMER_SIZE);
- options.addOption("--ordered-sketch-size", "The sketch size for second stage filter.", DEFAULT_ORDERED_SKETCH_SIZE);
+ options.addOption("--repeat-weight", "[double] Repeat suppression strength for tf-idf weighing. <0.0 do unweighted MinHash (version 1.0), >=1.0 do only the tf weighing. To perform no idf weighting, do no supply -f option. ", DEFAULT_REPEAT_WEIGHT);
+ options.addOption("--ordered-kmer-size", "[int] The size of k-mers used in the ordered second stage filter.", DEFAULT_ORDERED_KMER_SIZE);
+ options.addOption("--ordered-sketch-size", "[int] The sketch size for second stage filter.", DEFAULT_ORDERED_SKETCH_SIZE);
options.addOption("--min-store-length", "[int], The minimum length of the read that is stored in the box. Used to filter out short reads from FASTA file.", DEFAULT_MIN_STORE_LENGTH);
+ options.addOption("--min-olap-length", "[int], The minimum length of the read that used for overlapping. Used to filter out short reads from FASTA file.", DEFAULT_MIN_OVL_LENGTH);
options.addOption("--no-self", "Do not compute the overlaps between sequences inside a box. Should be used when the to and from sequences are coming from different files.", false);
options.addOption("--store-full-id", "Store full IDs as seen in FASTA file, rather than storing just the sequence position in the file. Some FASTA files have long IDS, slowing output of results. This options is ignored when using compressed file format.", false);
- options.addOption("--pacbio-fast", "Set all the parameters for the PacBio fast setting. This is the current best guidance, and could change at any time without warning.", false);
- options.addOption("--pacbio-sensitive", "Set all the parameters for the PacBio sensitive settings. This is the current best guidance, and could change at any time without warning.", false);
- options.addOption("--nanopore", "Set all the parameters for the Nanopore settings. This is the current best guidance, and could change at any time without warning.", false);
+ options.addOption("--supress-noise", "[int] 0) Does nothing, 1) completely removes any k-mers not specified in the filter file, 2) supresses k-mers not specified in the filter file, similar to repeats. ", 0);
+ options.addOption("--no-tf", "Do not perform the tf weighing, of the tf-idf weighing.", false);
+ options.addOption("--settings", "Set all unset parameters for the default settings. Same defaults are applied to Nanopore and Pacbio reads. 0) None, 1) Default, 2) Fast, 3) Sensitive.", 0);
if (!options.process(args))
System.exit(0);
-
- if (options.get("--pacbio-fast").getBoolean() && options.get("--pacbio-sensitive").getBoolean())
+
+ if (options.get("--settings").getInteger()<0 || options.get("--settings").getInteger()>3)
{
- System.out.println("Two default sequence type parameters cannot be set at the same time.");
+ System.out.println("Please enter valid --settings flag. See options below:");
System.out.println(options.helpMenuString());
System.exit(1);
}
- if (options.get("--pacbio-fast").getBoolean())
+ if (options.get("--settings").getInteger()==1) //default
{
if (!options.get("-k").isSet())
options.setOptions("-k", 16);
@@ -144,21 +140,39 @@ public final class MhapMain
if (!options.get("--num-hashes").isSet())
options.setOptions("--num-hashes", 512);
+
+ if (!options.get("--threshold").isSet())
+ options.setOptions("--threshold", .78);
+
+ if (!options.get("--ordered-sketch-size").isSet())
+ options.setOptions("--ordered-sketch-size", 1536);
+
+ if (!options.get("--ordered-kmer-size").isSet())
+ options.setOptions("--ordered-kmer-size", 12);
}
else
- if (options.get("--pacbio-sensitive").getBoolean())
+ if (options.get("--settings").getInteger()==2) //fast
{
if (!options.get("-k").isSet())
options.setOptions("-k", 16);
if (!options.get("--num-min-matches").isSet())
- options.setOptions("--num-min-matches", 2);
+ options.setOptions("--num-min-matches", 3);
if (!options.get("--num-hashes").isSet())
- options.setOptions("--num-hashes", 768);
- }
+ options.setOptions("--num-hashes", 256);
+
+ if (!options.get("--threshold").isSet())
+ options.setOptions("--threshold", .80);
+
+ if (!options.get("--ordered-sketch-size").isSet())
+ options.setOptions("--ordered-sketch-size", 1000);
+
+ if (!options.get("--ordered-kmer-size").isSet())
+ options.setOptions("--ordered-kmer-size", 14);
+ }
else
- if (options.get("--nanopore").getBoolean())
+ if (options.get("--settings").getInteger()==3) //sensitive
{
if (!options.get("-k").isSet())
options.setOptions("-k", 16);
@@ -170,8 +184,15 @@ public final class MhapMain
options.setOptions("--num-hashes", 768);
if (!options.get("--threshold").isSet())
- options.setOptions("--threshold", 0.70);
- }
+ options.setOptions("--threshold", .73);
+
+ if (!options.get("--ordered-sketch-size").isSet())
+ options.setOptions("--ordered-sketch-size", 1536);
+
+ if (!options.get("--ordered-kmer-size").isSet())
+ options.setOptions("--ordered-kmer-size", 12);
+ }
+
if (options.get("-s").getString().isEmpty() && options.get("-p").getString().isEmpty())
{
@@ -257,6 +278,13 @@ public final class MhapMain
System.exit(1);
}
+ //check range
+ if (options.get("--supress-noise").getInteger()<0 || options.get("--supress-noise").getInteger()>2)
+ {
+ System.out.println("The --supress-noise parameter must be in [0,2].");
+ System.exit(1);
+ }
+
//check other options
//TODO move into the class
if (options.get("--store-full-id").getBoolean())
@@ -289,15 +317,13 @@ public final class MhapMain
this.kmerSize = options.get("-k").getInteger();
this.numMinMatches = options.get("--num-min-matches").getInteger();
this.minStoreLength = options.get("--min-store-length").getInteger();
+ this.minOlapLength = options.get("--min-olap-length").getInteger();
this.maxShift = options.get("--max-shift").getDouble();
this.acceptScore = options.get("--threshold").getDouble();
- this.weighted = options.get("--weighted").getBoolean();
- this.useAlignment = options.get("--alignment").getBoolean();
- this.alignmentOffset = options.get("--alignment-offset").getDouble();
- this.alignmentScore = options.get("--alignment-score").getDouble();
+ this.repeatWeight = options.get("--repeat-weight").getDouble();
this.orderedKmerSize = options.get("--ordered-kmer-size").getInteger();
this.orderedSketchSize = options.get("--ordered-sketch-size").getInteger();
-
+
// read in the kmer filter set
String filterFile = options.get("-f").getString();
@@ -307,13 +333,26 @@ public final class MhapMain
System.err.println("Reading in filter file " + filterFile + ".");
try
{
- this.kmerFilter = Utils.createKmerFilter(filterFile, options.get("--filter-threshold").getDouble(), this.kmerSize, 0);
+ double offset = 0.0;
+ if (this.repeatWeight>=0.0 && this.repeatWeight<1.0)
+ offset = this.repeatWeight;
+
+ double maxFraction = options.get("--filter-threshold").getDouble();
+ int removeUnique = options.get("--supress-noise").getInteger();
+ boolean noTf = options.get("--no-tf").getBoolean();
+
+ try (BufferedReader bf = Utils.getFile(filterFile, null))
+ {
+ this.kmerFilter = new FrequencyCounts(bf, maxFraction, offset, removeUnique, noTf, this.numThreads);
+ }
}
catch (Exception e)
{
throw new MhapRuntimeException("Could not parse k-mer filter file.", e);
}
System.err.println("Time (s) to read filter file: " + (System.nanoTime() - startTime) * 1.0e-9);
+ if (this.kmerFilter!=null)
+ System.err.println("Read in k-mer filter for sizes: " + this.kmerFilter.getKmerSizes());
}
else
{
@@ -499,17 +538,17 @@ public final class MhapMain
public MinHashSearch getMatchSearch(SequenceSketchStreamer hashStreamer) throws IOException
{
return new MinHashSearch(hashStreamer, this.numHashes, this.numMinMatches, this.numThreads, false,
- this.minStoreLength, this.maxShift, this.acceptScore, this.alignmentOffset, this.alignmentScore);
+ this.minStoreLength, this.maxShift, this.acceptScore);
}
public SequenceSketchStreamer getSequenceHashStreamer(String file, int offset) throws IOException
{
SequenceSketchStreamer seqStreamer;
if (file.endsWith(".dat"))
- seqStreamer = new SequenceSketchStreamer(file, offset, this.useAlignment);
+ seqStreamer = new SequenceSketchStreamer(file, this.minOlapLength, offset);
else
- seqStreamer = new SequenceSketchStreamer(file, this.kmerSize, this.numHashes,
- this.orderedKmerSize, this.orderedSketchSize, this.kmerFilter, this.weighted, offset, this.useAlignment);
+ seqStreamer = new SequenceSketchStreamer(file, this.minOlapLength, this.kmerSize, this.numHashes,
+ this.orderedKmerSize, this.orderedSketchSize, this.kmerFilter, this.repeatWeight, offset);
return seqStreamer;
}
diff --git a/src/main/java/edu/umd/marbl/mhap/sketch/FrequencyCounts.java b/src/main/java/edu/umd/marbl/mhap/sketch/FrequencyCounts.java
index 523601b..4ba06f4 100644
--- a/src/main/java/edu/umd/marbl/mhap/sketch/FrequencyCounts.java
+++ b/src/main/java/edu/umd/marbl/mhap/sketch/FrequencyCounts.java
@@ -28,40 +28,173 @@
*/
package edu.umd.marbl.mhap.sketch;
-import java.util.HashMap;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.LinkedBlockingQueue;
+import java.util.concurrent.ThreadPoolExecutor;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicReference;
+import com.google.common.hash.BloomFilter;
+
+import edu.umd.marbl.mhap.impl.MhapRuntimeException;
+import it.unimi.dsi.fastutil.ints.IntOpenHashSet;
+import it.unimi.dsi.fastutil.longs.Long2DoubleOpenHashMap;
public final class FrequencyCounts
{
private final double filterCutoff;
private final Map<Long,Double> fractionCounts;
+ private final Set<Integer> kmerSizes;
private final double maxIdfValue;
private final double maxValue;
private final double minIdfValue;
private final double minValue;
+ private final boolean noTf;
+ private final double offset;
+ private final int removeUnique;
+ private final BloomFilter<Long> validMers;
+
+ public static final double REPEAT_SCALE = 3.0;
- public FrequencyCounts(Map<Long,Double> fractionCounts, double filterCutoff)
+ public FrequencyCounts(BufferedReader bf, double filterCutoff, double offset, int removeUnique, boolean noTf, int numThreads) throws IOException
{
- this.fractionCounts = new HashMap<>(fractionCounts);
- this.filterCutoff = filterCutoff;
+ if (offset<0.0 || offset>=1.0)
+ throw new MhapRuntimeException("Offset can only be between 0 and 1.0.");
+
+ this.kmerSizes = new IntOpenHashSet();
+ this.removeUnique = removeUnique;
+ this.noTf = noTf;
+
+ // generate hashset
+ Long2DoubleOpenHashMap validMap = new Long2DoubleOpenHashMap();
+ BloomFilter<Long> validMers = null;
+
+ //the max value observed in the list
+ AtomicReference<Double> maxValue = new AtomicReference<Double>(Double.NEGATIVE_INFINITY);
+
+ String line = bf.readLine();
+ try
+ {
+ long size;
+ if (line==null)
+ {
+ System.err.println("Warning, k-mer filter file is empty. Assuming zero entries.");
+ size = 1L;
+ }
+ else
+ {
+ size = Long.parseLong(line);
+
+ if (size<0L)
+ throw new MhapRuntimeException("K-mer filter file size line must have positive long value.");
+ else
+ if (size==0L)
+ {
+ System.err.println("Warning, k-mer filter file has zero elements.");
+ size = 1L;
+ }
+ }
+
+ if (removeUnique>1)
+ validMers = BloomFilter.create((value, sink) -> sink.putLong(value), size, 1.0e-5);
+ }
+ catch (Exception e)
+ {
+ throw new MhapRuntimeException("K-mer filter file first line must contain estimated number of k-mers in the file (long).");
+ }
+
+ final ThreadPoolExecutor executor = new ThreadPoolExecutor(numThreads, numThreads, 100L, TimeUnit.MILLISECONDS,
+ new LinkedBlockingQueue<Runnable>(10000), new ThreadPoolExecutor.CallerRunsPolicy());
- double maxValue = Double.NEGATIVE_INFINITY;
- for (double val : this.fractionCounts.values())
- maxValue = Math.max(maxValue, val);
+ BloomFilter<Long> currValidMers = validMers;
+
+ line = bf.readLine();
+ while (line != null)
+ {
+ String currLine = line;
+
+ executor.submit(() ->
+ {
+ try
+ {
+ String[] str = currLine.split("\\s+", 3);
+
+ if (str.length < 1)
+ throw new MhapRuntimeException("K-mer filter file must have at least one column [k-mer]. Line="+currLine);
+
+ //store the kmer sizes in the list
+ synchronized (this.kmerSizes)
+ {
+ this.kmerSizes.add(str[0].length());
+ }
+
+ long[] hash = HashUtils.computeSequenceHashesLong(str[0], str[0].length(), 0);
+
+ if (str.length >= 2)
+ {
+ double percent = Double.parseDouble(str[1]);
+
+ // if greater, add to hashset
+ if (percent > filterCutoff)
+ {
+ maxValue.getAndUpdate(v -> Math.max(v, percent));
+
+ //store the max percent
+ synchronized (validMap)
+ {
+ validMap.put(hash[0], percent);
+ }
+ }
+ }
+
+ //store in the bloom filter
+ if (removeUnique>0)
+ synchronized (currValidMers)
+ {
+ currValidMers.put(hash[0]);
+ }
+ }
+ catch (Exception e)
+ {
+ System.err.println(e);
+ }
+
+ });
+
+ // read the next line
+ line = bf.readLine();
+ }
- this.maxValue = maxValue;
+ executor.shutdown();
+ try
+ {
+ executor.awaitTermination(5L, TimeUnit.DAYS);
+ }
+ catch (InterruptedException e)
+ {
+ executor.shutdownNow();
+ throw new RuntimeException("Unable to finish all tasks.");
+ }
+
+ //trim the hashtable to the right size
+ validMap.trim();
+
+ this.validMers = validMers;
+ this.fractionCounts = validMap;
+ this.filterCutoff = filterCutoff;
+ this.offset = offset;
+ this.maxValue = maxValue.get();
this.minValue = this.filterCutoff;
this.minIdfValue = idf(this.maxValue);
this.maxIdfValue = idf(this.minValue);
}
- public boolean contains(long hash)
- {
- return this.fractionCounts.containsKey(hash);
- }
-
public double documentFrequencyRatio(long hash)
{
Double val = this.fractionCounts.get(hash);
@@ -76,9 +209,14 @@ public final class FrequencyCounts
return this.filterCutoff;
}
+ public List<Integer> getKmerSizes()
+ {
+ return new ArrayList<>(this.kmerSizes);
+ }
+
public double idf(double freq)
{
- return Math.log(this.maxValue/freq);
+ return Math.log(this.maxValue/freq-offset);
//return Math.log1p(this.maxValue/freq);
}
@@ -88,24 +226,25 @@ public final class FrequencyCounts
return idf(freq);
}
- public double idfDiscrete(long hash, int maxValue)
+ public double inverseDocumentFrequency(long hash)
{
- Double val = this.fractionCounts.get(hash);
- if (val == null)
- return maxValue;
-
- //get the true value
- double idf = idf(val);
-
- //scale it to match max
- double scale = (maxIdf()-minIdf())/(double)(maxValue-1.0);
-
- return 1.0+(idf-minIdf())/scale;
+ return 1.0/documentFrequencyRatio(hash);
}
- public double inverseDocumentFrequency(long hash)
+ public boolean isPopular(long hash)
{
- return 1.0/documentFrequencyRatio(hash);
+ return this.fractionCounts.containsKey(hash);
+ }
+
+ public boolean keepKmer(long hash)
+ {
+ if (this.removeUnique==0 || this.removeUnique==2)
+ return true;
+
+ if (this.validMers==null)
+ return false;
+
+ return this.validMers.mightContain(hash);
}
public double maxIdf()
@@ -117,4 +256,35 @@ public final class FrequencyCounts
{
return this.minIdfValue;
}
+
+ public double scaledIdf(long hash)
+ {
+ return scaledIdf(hash, REPEAT_SCALE);
+ }
+
+ public double scaledIdf(long hash, double maxValue)
+ {
+ if (this.removeUnique==2 && this.validMers!=null && !this.validMers.mightContain(hash))
+ return 1.0;
+
+ Double val = this.fractionCounts.get(hash);
+ if (val == null)
+ return maxValue;
+
+ //get the true value
+ double idf = idf(val);
+
+ //scale it to match max
+ double scale = (maxIdf()-minIdf())/(double)(maxValue-1.0);
+
+ return 1.0+(idf-minIdf())/scale;
+ }
+
+ public double tfWeight(int weight)
+ {
+ if (this.noTf)
+ return 1.0;
+
+ return (double)weight;
+ }
}
diff --git a/src/main/java/edu/umd/marbl/mhap/sketch/MinHashSketch.java b/src/main/java/edu/umd/marbl/mhap/sketch/MinHashSketch.java
index 22711b2..12d3d05 100644
--- a/src/main/java/edu/umd/marbl/mhap/sketch/MinHashSketch.java
+++ b/src/main/java/edu/umd/marbl/mhap/sketch/MinHashSketch.java
@@ -49,7 +49,7 @@ public final class MinHashSketch implements Sketch<MinHashSketch>
private static final long serialVersionUID = 8846482698636860862L;
private final static int[] computeNgramMinHashesWeighted(String seq, final int nGramSize, final int numHashes,
- FrequencyCounts kmerFilter, boolean weighted)
+ FrequencyCounts kmerFilter, double repeatWeight)
{
final int numberNGrams = seq.length() - nGramSize + 1;
@@ -64,6 +64,10 @@ public final class MinHashSketch implements Sketch<MinHashSketch>
int maxCount = 0;
for (long kmer : kmerHashes)
{
+ //do not add unique kmers to the sketch
+ if (kmerFilter!=null && !kmerFilter.keepKmer(kmer))
+ continue;
+
HitCounter counter = hitMap.get(kmer);
if (counter==null)
{
@@ -76,6 +80,10 @@ public final class MinHashSketch implements Sketch<MinHashSketch>
if (maxCount<counter.count)
maxCount = counter.count;
}
+
+ //make sure don't create a non-zero value
+ if (hitMap.isEmpty())
+ hitMap.put(kmerHashes[0], new HitCounter(1));
//allocate the space
int[] hashes = new int[Math.max(1,numHashes)];
@@ -88,19 +96,23 @@ public final class MinHashSketch implements Sketch<MinHashSketch>
long key = kmer.getKey();
int weight = kmer.getValue().count;
- if (!weighted)
+ if (repeatWeight<0.0)
+ {
weight = 1;
+ if (kmerFilter!=null && kmerFilter.isPopular(key))
+ weight = 0;
+ }
+ else
if (kmerFilter!=null)
{
- if (weighted)
+ if (repeatWeight>=0.0 && repeatWeight<1.0)
{
//compute the td part
- double td = (double)weight;
- //td = Math.log1p(td)*3.4;
+ double td = (double)kmerFilter.tfWeight(weight);
- //compute the idf part
- double idf = kmerFilter.idfDiscrete(key, 3);
+ //compute the idf part, 1-3
+ double idf = kmerFilter.scaledIdf(key);
//compute td-idf
weight = (int)Math.round(td*idf);
@@ -108,8 +120,9 @@ public final class MinHashSketch implements Sketch<MinHashSketch>
weight = 1;
}
else
+ if (repeatWeight>=1.0)
{
- if (kmerFilter.contains(key))
+ if (kmerFilter.isPopular(key))
weight = 0;
}
}
@@ -191,12 +204,12 @@ public final class MinHashSketch implements Sketch<MinHashSketch>
public MinHashSketch(String str, int nGramSize, int numHashes)
{
- this.minHashes = MinHashSketch.computeNgramMinHashesWeighted(str, nGramSize, numHashes, null, true);
+ this.minHashes = MinHashSketch.computeNgramMinHashesWeighted(str, nGramSize, numHashes, null, -1.0);
}
- public MinHashSketch(String seq, int nGramSize, int numHashes, FrequencyCounts freqFilter, boolean weighted)
+ public MinHashSketch(String seq, int nGramSize, int numHashes, FrequencyCounts freqFilter, double repeatWeight)
{
- this.minHashes = MinHashSketch.computeNgramMinHashesWeighted(seq, nGramSize, numHashes, freqFilter, weighted);
+ this.minHashes = MinHashSketch.computeNgramMinHashesWeighted(seq, nGramSize, numHashes, freqFilter, repeatWeight);
}
public byte[] getAsByteArray()
diff --git a/src/main/java/edu/umd/marbl/mhap/sketch/OrderedNGramHashes.java b/src/main/java/edu/umd/marbl/mhap/sketch/OrderedNGramHashes.java
index db8591e..de35494 100644
--- a/src/main/java/edu/umd/marbl/mhap/sketch/OrderedNGramHashes.java
+++ b/src/main/java/edu/umd/marbl/mhap/sketch/OrderedNGramHashes.java
@@ -276,9 +276,9 @@ public final class OrderedNGramHashes
}
}
+ private final int kmerSize;
private final int[][] orderedHashes;
private final int seqLength;
- private final int kmerSize;
private static double computeKBottomSketchJaccard(int[][] seq1Hashes, int[][] seq2Hashes, int medianShift, int absMaxShiftInOverlap, int a1, int a2, int b1, int b2)
{
@@ -455,7 +455,7 @@ public final class OrderedNGramHashes
this.seqLength = seq.length() - kmerSize + 1;
if (this.seqLength<=0)
- throw new SketchRuntimeException("Sequence length must be greater or equal to kmerSize.");
+ throw new SketchRuntimeException("Sequence length must be greater or equal to n-gram size.");
// compute just direct hash of sequence
int[] hashes = HashUtils.computeSequenceHashes(seq, kmerSize);
diff --git a/src/main/java/edu/umd/marbl/mhap/sketch/OrderedNGramHashesOld.java b/src/main/java/edu/umd/marbl/mhap/sketch/OrderedNGramHashesOld.java
deleted file mode 100644
index cdb6470..0000000
--- a/src/main/java/edu/umd/marbl/mhap/sketch/OrderedNGramHashesOld.java
+++ /dev/null
@@ -1,460 +0,0 @@
-/*
- * MHAP package
- *
- * This software is distributed "as is", without any warranty, including
- * any implied warranty of merchantability or fitness for a particular
- * use. The authors assume no responsibility for, and shall not be liable
- * for, any special, indirect, or consequential damages, or any damages
- * whatsoever, arising out of or in connection with the use of this
- * software.
- *
- * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren
- * University Of Maryland
- *
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- *
- */
-package edu.umd.marbl.mhap.sketch;
-
-import it.unimi.dsi.fastutil.ints.IntArrays;
-
-import java.io.ByteArrayOutputStream;
-import java.io.DataInputStream;
-import java.io.DataOutputStream;
-import java.io.EOFException;
-import java.io.IOException;
-import java.io.Serializable;
-import java.util.Arrays;
-
-import edu.umd.marbl.mhap.impl.OverlapInfo;
-import edu.umd.marbl.mhap.utils.Utils;
-
-public final class OrderedNGramHashesOld
-{
- private static final class SortableIntPair implements Comparable<SortableIntPair>, Serializable
- {
- public final int x;
- public final int y;
- /**
- *
- */
- private static final long serialVersionUID = 2525278831423582446L;
-
- public SortableIntPair(int x, int y)
- {
- this.x = x;
- this.y = y;
- }
-
- @Override
- public int compareTo(SortableIntPair p)
- {
- return Integer.compare(this.x, p.x);
- }
-
- /* (non-Javadoc)
- * @see java.lang.Object#toString()
- */
- @Override
- public String toString()
- {
- return "["+x + ", " + y + "]";
- }
-
- }
-
- private final int[][] orderedHashes;
- private final int seqLength;
-
- public final static int REDUCTION = 4;
-
- private final static int[][] allocateMemory(int size)
- {
- // allocate the memory
- int[][] completeHash = new int[size][2];
-
- return completeHash;
- }
-
- public final static OrderedNGramHashesOld fromByteStream(DataInputStream input) throws IOException
- {
- try
- {
- // dos.writeInt(this.seqLength);
- // dos.writeInt(size());
- int seqLength = input.readInt();
- int hashLength = input.readInt();
-
- int[][] orderedHashes = allocateMemory(hashLength);
-
- for (int iter = 0; iter < hashLength; iter++)
- {
- // dos.writeInt(this.completeHash[iter][iter2]);
- orderedHashes[iter][0] = input.readInt();
- orderedHashes[iter][1] = input.readInt();
- }
-
- return new OrderedNGramHashesOld(seqLength, orderedHashes);
-
- }
- catch (EOFException e)
- {
- return null;
- }
- }
-
- private OrderedNGramHashesOld(int seqLength, int[][] orderedHashes)
- {
- this.seqLength = seqLength;
- this.orderedHashes = orderedHashes;
- }
-
- public OrderedNGramHashesOld(String seq, int kmerSize)
- {
- this.seqLength = seq.length() - kmerSize + 1;
-
- if (this.seqLength<=0)
- throw new SketchRuntimeException("Sequence length must be greater or equal to kmerSize.");
-
- this.orderedHashes = getFullHashes(seq, kmerSize);
- }
-
- public byte[] getAsByteArray()
- {
- ByteArrayOutputStream bos = new ByteArrayOutputStream(size() * 2);
- DataOutputStream dos = new DataOutputStream(bos);
-
- try
- {
- dos.writeInt(this.seqLength);
- dos.writeInt(size());
- for (int iter = 0; iter < this.orderedHashes.length; iter++)
- {
- dos.writeInt(this.orderedHashes[iter][0]);
- dos.writeInt(this.orderedHashes[iter][1]);
- }
-
- dos.flush();
- return bos.toByteArray();
- }
- catch (IOException e)
- {
- throw new SketchRuntimeException("Unexpected IO error.");
- }
- }
-
- public int getHash(int index)
- {
- return this.orderedHashes[index][0];
- }
-
- private int[][] storeAsArray(SortableIntPair[] completeHashAsPair)
- {
- // allocate the memory
- int[][] completeHash = allocateMemory(completeHashAsPair.length);
-
- for (int iter = 0; iter < completeHashAsPair.length; iter++)
- {
- completeHash[iter][0] = completeHashAsPair[iter].x;
- completeHash[iter][1] = completeHashAsPair[iter].y;
- }
-
- return completeHash;
- }
-
- private int[][] getFullHashes(String seq, int subKmerSize)
- {
- int cutoff = (int) ((long) Integer.MIN_VALUE + ((long) Integer.MAX_VALUE - (long) Integer.MIN_VALUE)
- / (long) REDUCTION);
-
- // compute just direct hash of sequence
- int[] hashes = HashUtils.computeSequenceHashes(seq, subKmerSize);
-
- int count = 0;
- for (int val : hashes)
- if (val <= cutoff)
- count++;
-
- int[] cutHashes = new int[count];
- int[] perm = new int[count];
- int[] pos = new int[count];
-
- count = 0;
- for (int iter = 0; iter < hashes.length; iter++)
- if (hashes[iter] <= cutoff)
- {
- cutHashes[count] = hashes[iter];
- perm[count] = count;
- pos[count] = iter;
-
- count++;
- }
-
- //sort the array
- IntArrays.radixSortIndirect(perm, cutHashes, true);
-
- SortableIntPair[] completeHashAsPair = new SortableIntPair[count];
- for (int iter=0; iter<count; iter++)
- {
- int index = perm[iter];
- completeHashAsPair[iter] = new SortableIntPair(cutHashes[index], pos[index]);
- }
-
- //System.err.println(Arrays.toString(completeHashAsPair));
- // sort the results, sort in place so no need to look at second
- //Arrays.sort(completeHashAsPair);
-
- return storeAsArray(completeHashAsPair);
- }
-
- public OverlapInfo getOverlapInfo(OrderedNGramHashesOld s, double maxShiftPercent)
- {
- int[][] allKmerHashes = this.orderedHashes;
-
- // get the kmers of the second sequence
- int[][] sAllKmerHashes = s.orderedHashes;
-
- // get sizes
- int size1 = this.size();
- int size2 = s.size();
-
- int kmerSize1 = this.seqLength;
- int kmerSize2 = s.seqLength;
-
- // init the ok regions
- int valid1Lower = 0;
- int valid1Upper = kmerSize1;
- int valid2Lower = 0;
- int valid2Upper = kmerSize2;
-
- int medianShift = 0;
- int overlapSize = Math.min(kmerSize1, kmerSize2);
- int absMaxShiftInOverlap = Math.max(kmerSize1, kmerSize2);
-
- int count = 0;
- int[] posShift = new int[Math.min(size1, size2) / 8 + 1];
- int[] pos1Index = new int[posShift.length];
- int[] pos2Index = new int[posShift.length];
-
- // check the repeat flag
- int numScoringRepeats = 2;
- if (maxShiftPercent <= 0)
- {
- numScoringRepeats = 1;
- maxShiftPercent = Math.abs(maxShiftPercent);
- }
-
- // refine multiple times to get better interval estimate
- for (int repeat = 0; repeat < numScoringRepeats; repeat++)
- {
- // init counters
- count = 0;
- int i1 = 0;
- int i2 = 0;
-
- // init the loop storage
- int hash1 = 0;
- int hash2 = 0;
- int pos1;
- int pos2;
-
- // perform merge operation to get the shift and the kmer count
- while (true)
- {
- if (i1>=allKmerHashes.length)
- break;
- if (i2>=sAllKmerHashes.length)
- break;
-
- // get the values in the array
- hash1 = allKmerHashes[i1][0];
- pos1 = allKmerHashes[i1][1];
-
- hash2 = sAllKmerHashes[i2][0];
- pos2 = sAllKmerHashes[i2][1];
-
- if (hash1 < hash2 || pos1 < valid1Lower || pos1 >= valid1Upper)
- i1++;
- else if (hash2 < hash1 || pos2 < valid2Lower || pos2 >= valid2Upper)
- i2++;
- else
- {
- // check if current shift makes sense positionally
- int currShift = pos2 - pos1;
- if (Math.abs(currShift - medianShift) > absMaxShiftInOverlap)
- {
- // do not record this shift and increase counter
- i2++;
- continue;
- }
-
- // adjust array size if needed
- if (posShift.length <= count)
- {
- posShift = Arrays.copyOf(posShift, posShift.length * 2);
- pos1Index = Arrays.copyOf(pos1Index, pos1Index.length * 2);
- pos2Index = Arrays.copyOf(pos2Index, pos2Index.length * 2);
- }
-
- // compute the shift
- posShift[count] = currShift;
- pos1Index[count] = pos1;
- pos2Index[count] = pos2;
-
- // if first round, store only first hit
- if (repeat == 0)
- i1++;
- i2++;
-
- count++;
- }
- }
-
- if (count <= 0)
- return new OverlapInfo(0.0, 0, 0, 0, 0, 0);
-
- // pick out only the matches that are best
- if (repeat > 0)
- {
- int reducedCount = -1;
-
- // copy over only the best values
- for (int iter = 0; iter < count; iter++)
- {
- if (reducedCount >= 0 && pos1Index[reducedCount] == pos1Index[iter])
- {
- // if better, record it
- if (Math.abs(posShift[reducedCount] - medianShift) > Math.abs(posShift[iter] - medianShift))
- {
- pos1Index[reducedCount] = pos1Index[iter];
- pos2Index[reducedCount] = pos2Index[iter];
- posShift[reducedCount] = posShift[iter];
- }
- }
- else
- {
- // add the new data
- reducedCount++;
- pos1Index[reducedCount] = pos1Index[iter];
- pos2Index[reducedCount] = pos2Index[iter];
- posShift[reducedCount] = posShift[iter];
- }
- }
-
- count = reducedCount + 1;
- }
-
- if (count <= 0)
- medianShift = 0;
- else
- medianShift = Utils.quickSelect(Arrays.copyOf(posShift, count), count / 2, count);
-
- // get the actual overlap size
- int leftPosition = Math.max(0, -medianShift);
- int rightPosition = Math.min(kmerSize1, kmerSize2 - medianShift);
- overlapSize = Math.max(this.seqLength - kmerSize1, rightPosition - leftPosition);
-
- // compute the max possible allowed shift in kmers
- absMaxShiftInOverlap = Math.min(Math.max(kmerSize1, kmerSize2),
- (int) ((double) overlapSize * maxShiftPercent));
-
- // get the updated borders
- valid1Lower = Math.max(0, -medianShift - absMaxShiftInOverlap);
- valid1Upper = Math.min(kmerSize1, kmerSize2 - medianShift + absMaxShiftInOverlap);
- valid2Lower = Math.max(0, medianShift - absMaxShiftInOverlap);
- valid2Upper = Math.min(kmerSize2, kmerSize1 + medianShift + absMaxShiftInOverlap);
-
- /*
- * System.err.println(overlapSize);
- * System.err.println("Size1= "+size1+" Lower:"+
- * valid1Lower+" Upper:"+valid1Upper+" Shift="+shift);
- * System.err.println("Size2= "+size2+" Lower:"+
- * valid2Lower+" Upper:"+valid2Upper);
- */
- }
-
- // storage for edge computation
- int leftEdge1 = Integer.MAX_VALUE;
- int leftEdge2 = Integer.MAX_VALUE;
- int rightEdge1 = Integer.MIN_VALUE;
- int rightEdge2 = Integer.MIN_VALUE;
-
- // count only the shifts in the correct place
- int validCount = 0;
- for (int iter = 0; iter < count; iter++)
- {
- int pos1 = pos1Index[iter];
- int pos2 = pos2Index[iter];
-
- // take only valid values
- if (Math.abs(posShift[iter] - medianShift) > absMaxShiftInOverlap)
- continue;
-
- // get the edges
- if (pos1 < leftEdge1)
- leftEdge1 = pos1;
- if (pos2 < leftEdge2)
- leftEdge2 = pos2;
- if (pos1 > rightEdge1)
- rightEdge1 = pos1;
- if (pos2 > rightEdge2)
- rightEdge2 = pos2;
-
- validCount++;
- }
-
- if (validCount <= 1)
- return new OverlapInfo(0.0, 0, 0, 0, 0, 0);
-
- // compute the score
- double score = (double) validCount / (double) (overlapSize);
-
- // get edge info uniformly minimum variance unbiased (UMVU) estimators
- // a = (n*a-b)/(n-1)
- // b = (n*b-a)/(n-1)
- int a1 = Math.max(0, (int) Math.round((validCount * leftEdge1 - rightEdge1) / (double) (validCount - 1)));
- int b1 = Math.max(0, (int) Math.round((validCount * leftEdge2 - rightEdge2) / (double) (validCount - 1)));
- int a2 = Math.min(this.seqLength,
- (int) Math.round((validCount * rightEdge1 - leftEdge1) / (double) (validCount - 1)));
- int b2 = Math.min(s.seqLength,
- (int) Math.round((validCount * rightEdge2 - leftEdge2) / (double) (validCount - 1)));
-
- // int ahang = a1-a2;
- // int bhang = (this.size()-b1>s.size()-b2) ? b1-this.size() : s.size()
- // - b2;
-
- // if (score>0.06)
- // {
- // int[] test = Arrays.copyOf(posShift, count);
- // int[] test2 = Arrays.copyOf(pos1Index, count);
-
- // System.err.println("Start = "+Math.max(0,
- // -medianShift)+", Overlap="+overlapSize+" Maxshift="+absMaxShiftInOverlap+": ["+Arrays.toString(test)+"; "+Arrays.toString(test2)+"];");
- // System.err.println("Overlap="+overlapSize+", Shift/overlap="+(double)(test[test.length-10]-test[10])/(double)overlapSize);
- // }
-
- // the hangs are adjusted by the rate of slide*distance traveled
- // relative to median, -medianShift-(a1-a2)
- // return new OverlapInfo(score, ahang, bhang);
-
- return new OverlapInfo(score * (double) REDUCTION, validCount, a1, a2, b1, b2);
- }
-
- public int size()
- {
- return this.orderedHashes.length;
- }
-}
diff --git a/src/main/java/edu/umd/marbl/mhap/utils/ParseOptions.java b/src/main/java/edu/umd/marbl/mhap/utils/ParseOptions.java
index f8c606b..82712a2 100644
--- a/src/main/java/edu/umd/marbl/mhap/utils/ParseOptions.java
+++ b/src/main/java/edu/umd/marbl/mhap/utils/ParseOptions.java
@@ -340,7 +340,7 @@ public class ParseOptions
throw new RuntimeException("Unknown flag \"" + flag + "\".");
else if (option.isBoolean())
option.setValue(true);
- else if (iter + 1 < args.length && !args[iter + 1].startsWith("-"))
+ else if (iter + 1 < args.length)
{
if (option.isDouble())
{
diff --git a/src/main/java/edu/umd/marbl/mhap/utils/Utils.java b/src/main/java/edu/umd/marbl/mhap/utils/Utils.java
index 5e31022..b62d2a3 100644
--- a/src/main/java/edu/umd/marbl/mhap/utils/Utils.java
+++ b/src/main/java/edu/umd/marbl/mhap/utils/Utils.java
@@ -35,7 +35,6 @@ import java.util.ArrayList;
import java.util.HashMap;
import java.util.Random;
import java.io.BufferedInputStream;
-import java.io.File;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
@@ -45,10 +44,6 @@ import java.io.FileReader;
import org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream;
import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream;
-import edu.umd.marbl.mhap.impl.MhapRuntimeException;
-import edu.umd.marbl.mhap.sketch.FrequencyCounts;
-import edu.umd.marbl.mhap.sketch.HashUtils;
-
public final class Utils
{
@@ -195,49 +190,6 @@ public final class Utils
return count;
}
- public final static FrequencyCounts createKmerFilter(String fileName, double maxFraction, int kmerSize, int seed)
- throws IOException
- {
- File file = new File(fileName);
-
- // make sure don't leak resources
- try (BufferedReader bf = new BufferedReader(new FileReader(file), BUFFER_BYTE_SIZE);)
- {
- // generate hashset
- HashMap<Long, Double> values = new HashMap<>();
-
- String line = bf.readLine();
- while (line != null)
- {
- String[] str = line.split("\\s+", 3);
-
- if (str.length < 2)
- throw new MhapRuntimeException(
- "K-mer filter file must have at least two column [k-mer k-mer_fraction].");
-
- double percent = Double.parseDouble(str[1]);
-
- // if greater, add to hashset
- if (percent > maxFraction)
- {
- long[] minHash = HashUtils.computeSequenceHashesLong(str[0], kmerSize, seed);
-
- if (minHash.length > 1)
- throw new MhapRuntimeException("K-mer filter file size greater than the specified k-mer size.");
-
- for (long val : minHash)
- values.put(val, percent);
- }
- else
- break;
-
- // read the next line
- line = bf.readLine();
- }
- return new FrequencyCounts(values, maxFraction);
- }
- }
-
public final static int[] errorString(int[] s, double readError)
{
int[] snew = s.clone();
@@ -253,14 +205,6 @@ public final class Utils
return snew;
}
- public final static BufferedReader getFile(String fileName, String postfix) throws IOException
- {
- String[] array = new String[1];
- array[0] = postfix;
-
- return getFile(fileName, array);
- }
-
public final static BufferedReader getFile(String fileName, String[] postfix) throws IOException
{
if (fileName.endsWith("bz2"))
@@ -287,6 +231,9 @@ public final class Utils
}
else
{
+ if (postfix==null)
+ return new BufferedReader(new FileReader(fileName), BUFFER_BYTE_SIZE);
+
int i = 0;
for (i = 0; i < postfix.length; i++)
{
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/mhap.git
More information about the debian-med-commit
mailing list