[med-svn] [Git][med-team/uncalled][upstream] New upstream version 2.3+ds
Andreas Tille (@tille)
gitlab at salsa.debian.org
Thu Jul 13 16:29:31 BST 2023
Andreas Tille pushed to branch upstream at Debian Med / uncalled
Commits:
62b3012c by Andreas Tille at 2023-07-13T15:38:18+02:00
New upstream version 2.3+ds
- - - - -
22 changed files:
- Makefile
- README.md
- scripts/uncalled
- setup.py
- src/bwa_index.hpp
- src/conf.hpp
- src/dtw.hpp
- src/dtw_test.cpp
- src/event_detector.cpp
- src/event_profiler.hpp
- src/map_pool_ord.cpp
- src/mapper.cpp
- src/mapper.hpp
- src/pore_model.hpp
- src/pybinder.cpp
- src/range.cpp
- src/read_buffer.cpp
- src/uncalled_map_ord.cpp
- uncalled/__init__.py
- uncalled/args.py
- uncalled/conf/defaults.toml
- uncalled/debug.py
Changes:
=====================================
Makefile
=====================================
@@ -23,7 +23,7 @@ _COMMON_OBJS=mapper.o seed_tracker.o range.o event_detector.o normalizer.o chunk
_MAP_ORD_OBJS=$(_COMMON_OBJS) realtime_pool.o map_pool_ord.o uncalled_map_ord.o
_MAP_OBJS=$(_COMMON_OBJS) map_pool.o uncalled_map.o
_SIM_OBJS=$(_COMMON_OBJS) realtime_pool.o client_sim.o uncalled_sim.o
-_DTW_OBJS=dtw_test.o fast5_reader.o read_buffer.o normalizer.o chunk.o event_detector.o range.o
+_DTW_OBJS=dtw_test.o fast5_reader.o read_buffer.o normalizer.o chunk.o event_detector.o range.o event_profiler.o
_ALL_OBJS=$(_COMMON_OBJS) realtime_pool.o map_pool.o uncalled_map.o uncalled_map_ord.o client_sim.o uncalled_sim.o dtw_test.o
=====================================
README.md
=====================================
@@ -4,18 +4,22 @@ A **U**tility for **N**anopore **C**urrent **Al**ignment to **L**arge **E**xpans
![UNCALLED logo](uncalled_logo_small.png "UNCALLED logo")
-A streaming algorithm for mapping raw nanopore signal to DNA references
+A read mapper which rapidly aligns raw nanopore signal to DNA references
-Enables real-time enrichment or depletion on Oxford Nanopore Technologies (ONT) MinION runs via ReadUntil
+Enables software-based targeted sequenceing on Oxford Nanopore (ONT) MinION or GridION [via adaptive sampling](https://nanoporetech.com/about-us/news/towards-real-time-targeting-enrichment-or-other-sampling-nanopore-sequencing-devices)
-Also supports standalone signal mapping of fast5 reads
+Note that UNCALLED can only be applied to legacy r9.4.1 data. For r10.4.1 data try [ReadFish](https://github.com/LooseLab/readfish) or ONT's builtin adaptive sampling option.
-Read the [preprint on BioRxiv](https://www.biorxiv.org/content/10.1101/2020.02.03.931923v1)
+[
+**Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED** \
+Sam Kovaka, Yunfan Fan, Bohan Ni, Winston Timp, Michael C. Schatz \
+Nature Biotechnology (2020)
+](https://www.nature.com/articles/s41587-020-0731-9)
## Installation
```
-pip3 install git+https://github.com/skovaka/UNCALLED.git --user
+> pip3 install git+https://github.com/skovaka/UNCALLED.git --user
```
OR
@@ -23,12 +27,12 @@ OR
```
> git clone --recursive https://github.com/skovaka/UNCALLED.git
> cd UNCALLED
-> python3 setup.py install --user
+> pip3 install .
```
Requires python >= 3.6, read-until == 3.0.0, pybind11 >= 2.5.0, and GCC >= 4.8.1 (all except GCC are automatically downloaded and installed)
-Other dependecies are included via submodules, so be sure to clone with `git --recursive`
+Other dependencies are included via submodules, so be sure to clone with `git --recursive`
We recommend running on a Linux machine. UNCALLED has been successfully installed and run on Mac computers, but real-time ReadUntil has not been tested on a Mac. Installing UNCALLED has not been attempted on Windows.
@@ -36,18 +40,21 @@ We recommend running on a Linux machine. UNCALLED has been successfully installe
**Example:**
+
```
> uncalled index -o E.coli E.coli.fasta
```
+**Positional arguments:**
-Optional arguments:
+- `fasta-file` reference genome(s) or other target sequences in the FASTA format
-- `-o/--bwa_prefix` output index prefix (default: same as input fasta)
+**Optional arguments:**
+- `-o/--bwa_prefix` output index prefix (default: same as input fasta)
-Note that this command will use a previously built BWA index if all the required files exist with the specified prefix. Otherwise, a new BWA index will be automatically built.
+Note that UNCALLED uses the [BWA](https://github.com/lh3/bwa) FM Index to encode the reference, and this command will use a previously built BWA index if all the required files exist with the specified prefix. Otherwise, a new BWA index will be automatically built.
-We recommend applying repeat masking your reference if it contains eukaryotic sequences. See [masking](masking/) for more details.
+**We recommend applying repeat masking your reference if it contains eukaryotic sequences.** See [masking](masking/) for more details.
## Fast5 Mapping
@@ -65,16 +72,16 @@ eee4b762-25dd-4d4a-8a59-be47065029be 2905 * * * * *
e175c87b-a426-4a3f-8dc1-8e7ab5fdd30d 8052 84 154 + Escherichia_coli_chromosome 4765434 1064550 1064614 41 65 255 ch:i:182 st:i:452368 mt:f:38.611683
```
-Positional arguments:
+**Positional arguments:**
-- `bwa-prefix` the prefix of the index to align to. Should be a BWA index that `uncalled index` was run on
-- `fast5-files` a text file containing the path to one fast5 file per line
+- `bwa-prefix` the BWA reference index prefix generated by `uncalled map`
+- `fast5-files` Reads to be mapped. Can be a directory which will be recursively searched for all files with the ".fast5" extension, a text file containing one fast5 filename per line, or a comma-separated list of fast5 file names.
-Optional arguments:
+**Optional arguments:**
-- `-t/--threads` number of threads to use for mapping (default: 1)
+- `-l/--read-list` text file containing a list of read IDs. Only these reads will be mapped if specified
- `-n/--read-count` maximum number of reads to map
-- `-f/--filter` text file containing subset of read IDs (one per line) to map from the fast5 files (will map all by default)
+ - `-t/--threads` number of threads to use for mapping (default: 1)
- `-e/--max-events-proc` number of events to attempt mapping before giving up on a read (default 30,000). Note that there are approximately two events per nucleotide on average.
@@ -82,6 +89,8 @@ See [example/](example/) for a simple read and reference example.
## Real-Time ReadUntil
+**Warning:** in the latest MinKNOW version, an API bug may prevent UNCALLED from properly ejecting reads. You can identify this bug if you do not see a peak of small "adaptive sampling" reads in read length histogram. If this occurs you should stop your sequencing run, briefly start a new sequencing run with MinKNOW's builtin version of adaptive sampling enabled, then stop that run and restart your UNCALLED run. We have found that this may initialize something in MinKNOW which allows UNCALLED to function properly.
+
**Example:**
```
@@ -99,26 +108,30 @@ d9acafe3-23dd-4a0f-83db-efe299ee59a4 1355 * * * * * *
We recommend that you try mapping fast5s via `uncalled map` before real-time enrichment, as runtime issues could occur if UNCALLED is not installed properly.
-The command can generally be run at any time before or during a sequencing run, although an error may occur if UNCALLED is run before any sequencing run has been started in the current MinKNOW session. If this is happens you should start UNCALLED after the run begins, ideally during the first mux scan. If you want to change the chunk size you must run the commend *before* starting the run (see below).
+The command can generally be run at any time before or during a sequencing run, although an error may occur if UNCALLED is run before any sequencing run has been started in the current MinKNOW session. If this is happens you should start UNCALLED after the run begins, ideally during the first mux scan. If you want to change the chunk size you must run the command *before* starting the run (see below).
-Arguments:
+**Positional arguments:**
+- `bwa-prefix` the BWA reference index prefix generated by `uncalled map`
-- `bwa-prefix` the prefix of the index to align to. Should be a BWA index that `uncalled index` was run on
-- `-t/--threads` number of threads to use for mapping (default: 1)
-- `-c/--max-chunks-proc` number of chunks to attempt mapping before giving up on a read (default: 10).
+**Required arguments:**
+- `--enrich` will *keep* reads that map to the reference if included OR
+- `--deplete` will *eject* reads that map to the reference if included
+Exactly one of `--deplete` or `--enrich` must be specified
+
+**Optional Arguments:**
+
+- `-c/--max-chunks` number of chunks to attempt mapping before giving up on a read (default: 10).
- `--chunk-size` size of chunks in seconds (default: 1). Note: this is a new feature and may not work as intended (see below)
+- `-t/--threads` number of threads to use for mapping (default: 1)
- `--port` MinION device port.
-- `--enrich` will *keep* reads that map to the reference if included
-- `--deplete` will *eject* reads that map to the reference if included
- `--even` will only eject reads from even channels if included
- `--odd` will only eject reads from odd channels if included
-- `--duration` expected duration of sequencing run in hours (default: 48)
+- `--duration` expected duration of sequencing run in hours (default: 72)
-Note exactly one of `--deplete` or `--enrich` must be specified
### Altering Chunk Size
-The ReadUntil API recieves signal is "chunks", which by default are one second's worth of signal. This can be changed using the `--chunk-size` parameter. Note that `--max-chunks-proc` should also be changed to compensate for changes to chunks size. *If the chunk size is changed, you must start running UNCALLED before sequencing begins.* UNCALLED is unable change the chunk size mid-seqencing-run. In general reducing the chunk size should improve enrichment, although [previous work](http://dx.doi.org/10.1101/2020.02.03.926956) has found that the API becomes unreliable with chunks sizes less than 0.4 seconds. We have not thoroughly tested this feature, and recommend using the default 1 second chunk size for most cases. In the future this default size may be reduced.
+The ReadUntil API receives signal is "chunks", which by default are one second's worth of signal. This can be changed using the `--chunk-size` parameter. Note that `--max-chunks-proc` should also be changed to compensate for changes to chunk sizes. *If the chunk size is changed, you must start running UNCALLED before sequencing begins.* UNCALLED is unable to change the chunk size mid-seqencing-run. In general reducing the chunk size should improve enrichment, although [previous work](http://dx.doi.org/10.1101/2020.02.03.926956) has found that the API becomes unreliable with chunks sizes less than 0.4 seconds. We have not thoroughly tested this feature, and recommend using the default 1 second chunk size for most cases. In the future this default size may be reduced.
## Simulator
@@ -135,7 +148,7 @@ cnt_on_bp 33.145022
cnt_total_bp 8271.651331
```
-The simulator simulates a real-time run using data from two real runs: one control run and one UNCALLED run. Reads are simulated from the control run, and the pattern of channel activity of modeled after the control run. The simulator outputs a PAF file similar to the realtime mode, which can be interperted using scripts found in [sim_scripts/](sim_scripts/).
+The simulator simulates a real-time run using data from two real runs: one control run and one UNCALLED run. Reads are simulated from the control run, and the pattern of channel activity of modeled after the control run. The simulator outputs a PAF file similar to the real-time mode, which can be interperted using scripts found in [sim_scripts/](sim_scripts/).
Example files which can be used as template UNCALLED sequencing summary and PAF files for the simulator can be found [here](http://labshare.cshl.edu/shares/schatzlab/www-data/UNCALLED/simulator_files/). The control reads/sequencing summary can be from any sequencing run of your sample of interest, and it does not have to match the sample used in the provided examples.
@@ -162,9 +175,9 @@ Exactly one of `--deplete` or `--enrich` must be specified
UNCALLED outputs to stdout in a format similar to [PAF](https://github.com/lh3/miniasm/blob/master/PAF.md). Unmapped reads are output with reference-location-dependent fields replaced with \*s. Lines that begin with "#" are comments that useful for debugging.
-Query coordinates, residue matches, and block lengths are estimated assuming 450bp sequenced per second. This estimate can be significantly off depending on the sequencing run. UNCALLED attempts to map a read as early as possible, so the "query end" field corresponds to the leftmost position where UNCALLED was able to confidently map the read. In many cases this may only be 450bp or 900bp into the read, even if the read is many times longer than this. This differs from aligners such as [minimap2](https://github.com/lh3/minimap2), which attempt to map the full length of the read.
+Query coordinates, residue matches, and block lengths are estimated assuming 450bp sequenced per second. This estimate can be significantly off depending on the sequencing run. UNCALLED attempts to map a read as early as possible, so the "query sequence length" and "query end" fields correspond to the leftmost position where UNCALLED was able to confidently map the read. In many cases this may only be 450bp or 900bp into the read, even if the read is many times longer than this. This differs from aligners such as [minimap2](https://github.com/lh3/minimap2), which attempt to map the full length of the read.
-For real-time mapping, read lengths are estimated by how much signal UNCALLED received, which does not necessarily correspond to how much signal was actually sequenced.
+The "query sequence length" field currently does not correspond to the actual read length, rather an estimate of the number of bases that UNCALLED attempted to align. In most cases this will be equal to "query end". This may be changed to better reflect the full read length in future versions.
Both modes include the following custom attributes in each PAF entry:
@@ -182,11 +195,11 @@ Both modes include the following custom attributes in each PAF entry:
### pafstats
-We have included a functionality called `uncalled pafstats` which computes speed statistcs from a PAF file output by UNCALLED. Accuracy statistics can also be included if provided a ground truth PAF file, for example based on minimap2 alignments of basecalled reads. There is also an option to output the original UNCALLED PAF annotated with comparisions to the ground truth.
+We have included a functionality called `uncalled pafstats` which computes speed statistics from a PAF file output by UNCALLED. Accuracy statistics can also be included if provided a ground truth PAF file, for example based on [minimap2](https://github.com/lh3/minimap2 alignments of basecalled reads. There is also an option to output the original UNCALLED PAF annotated with comparisons to the ground truth.
**Example:**
```
-> uncalled pafstats -r minimap2_alns.paf -n 5000 -a uncalled_out.paf > uncalled_ann.paf
+> uncalled pafstats -r minimap2_alns.paf -n 5000 uncalled_out.paf
Summary: 5000 reads, 4373 mapped (89.46%)
Comparing to reference PAF
@@ -201,12 +214,20 @@ BP mapped: 636.29 362.00
MS to map: 140.99 89.96
```
+Positional arguments
+- `infile` PAF file output by UNCALLED
+
+Optional arguments
+- `-n/--max-reads` maximum number of reads to parse
+- `-r/--ref-paf` ground-truth alignments (from minimap2) to compute TP/TN/FP/FN rates
+- `-a/--annotate` if used with `-r`, will output PAF with "rf:" tag indicating TP, TN, FP, or FN
+
Accuracy statistics:
- TP: true positive - percent infile reads that overlap reference read locations
- FP: false positive - percent infile reads that do not overlap reference read locations
- TN: true negative - percent of reads which were not aligned in reference or infile
- FN: false negative - percent of reads which were aligned in the reference but not in the infile
-- NA: not aligned/not applicable - percent of reads aligned in infile but not in reference. Could be considered a false positive, but the truth is unkown.
+- NA: not aligned/not applicable - percent of reads aligned in infile but not in reference. Could be considered a false positive, but the truth is unknown.
## Practical Considerations
@@ -219,23 +240,19 @@ Note that enrichment necessitates a quick decision as to whether or not a read m
UNCALLED currently does not support large (> ~1Gbp) or highly repetitive references.
The speed and mapping rate both progressively drop as references become larger and more repetitive.
Bacterial genomes or small collections of divergent bacterial genomes typically work well.
-Small segments of eukaryotic genomes can also be used, however the presence of any repetitvie elements will harm the performance.
+Small segments of eukaryotic genomes can also be used, however the presence of any repetitve elements will harm the performance.
Collections of highly similar genomes wil not work well, as conserved sequences introduce repeats.
-See [masking](masking/) for repeat masking scripts and guidlines.
-
-ReadUntil works best with longer reads. Maximize your read lengths for best results. You may also need to perform a nuclease flush and reloading to achive the highest yield of on-target bases.
-
-UNCALLED currently only supports reads sequenced with r9.4 chemistry.
+See [masking](masking/) for repeat masking scripts and guidelines.
-## Undocumented Features
+ReadUntil works best with longer reads. Maximize your read lengths for best results. You may also need to perform a nuclease flush and reloading to achieve the highest yield of on-target bases.
-UNCALLED is a work in progress. Many parameters exist that are not documented here, but can be seen on the command line help information. Most users should leave these unchanged. They may be removed in future versions, or be replaced with hyperparameters that adjust the accuracy and speed of UNCALLED.
+UNCALLED currently only supports reads sequenced with r9.4.1/r9.4 chemistry.
## Release notes
- v2.2: added event profiler which masks out pore stalls, and added compile-time debug options
- v2.1: updated ReadUntil client for latest MinKNOW version, made `uncalled index` automatically build the BWA index, added hdf5 submodule, further automated installation by auto-building hdf5, switched to using setuptools, moved submodules to submods/
-- v2.0: released the ReadUntil simulator `uncalled sim`, which can predict how much enrichment UNCALLED could provide on a given reference, using a control and UNCALLED run as a template. Also CHANGED THE FORMAT OF CERTAIN ARGUMENTS. Index prefix and fast5 list are now positional, and some flags have changed names. See below for details.
+- v2.0: released the ReadUntil simulator `uncalled sim`, which can predict how much enrichment UNCALLED could provide on a given reference, using a control and UNCALLED run as a template. _Also changed the format of certain arguments_: index prefix and fast5 list are now positional, and some flags have changed names.
- v1.2: fixed indexing for particularly large or small reference
- v1.1: added support for altering chunk size
- v1.0: pre-print release
=====================================
scripts/uncalled
=====================================
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
# MIT License
#
@@ -22,8 +22,6 @@
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
-#TODO try to improve module names
-#from uncalled import *
import sys
import os
import argparse
@@ -35,8 +33,6 @@ import traceback
import uncalled as unc
import numpy as np
-#from uncalled import mapping, index, pafstats, minknow_client, sim_utils, args
-
MAX_SLEEP = 0.01
def index_cmd(args):
@@ -87,7 +83,7 @@ def fast5_path(fname):
path = os.path.abspath(fname)
if not os.path.isfile(path):
- sys.stderr.write("Warning: \"%s\" is not a fast5 file.\n")
+ sys.stderr.write("Warning: \"%s\" is not a fast5 file.\n" % fname)
return None
return path
@@ -95,6 +91,11 @@ def fast5_path(fname):
def load_fast5s(fast5s, recursive):
for path in fast5s:
path = path.strip()
+
+ if not os.path.exists(path):
+ sys.stderr.write("Error: \"%s\" does not exist\n" % path)
+ sys.exit(1)
+
isdir = os.path.isdir(path)
#Recursive directory search
@@ -109,7 +110,7 @@ def load_fast5s(fast5s, recursive):
yield fast5_path(os.path.join(path, fname))
#Read fast5 name directly
- elif path.endswith("fast5"):
+ elif path.endswith(".fast5"):
yield fast5_path(path)
#Read fast5 filenames from text file
@@ -332,7 +333,7 @@ def list_ports_cmd(args):
if __name__ == "__main__":
- conf, args = unc.args.load_conf(sys.argv)
+ parser, conf, args = unc.args.load_conf(sys.argv)
if args.subcmd == "index":
index_cmd(args)
=====================================
setup.py
=====================================
@@ -4,7 +4,7 @@ import os
import subprocess
import sys
-__version__ = "2.2"
+__version__ = "2.3.0"
ROOT_DIR = os.getcwd()
@@ -83,8 +83,8 @@ uncalled = Extension(
"_uncalled",
sources = [
- "src/event_profiler.cpp",
"src/pybinder.cpp",
+ "src/event_profiler.cpp",
"src/client_sim.cpp",
"src/fast5_reader.cpp",
"src/mapper.cpp",
@@ -100,7 +100,7 @@ uncalled = Extension(
],
include_dirs = [
- "./submods", #TODO: consistent incl paths?
+ "./submods",
"./submods/hdf5/include",
"./submods/fast5/include",
"./submods/pdqsort",
@@ -109,11 +109,14 @@ uncalled = Extension(
],
library_dirs = [
- "./submods/bwa",
- "./submods/hdf5/lib"
+ "./submods/bwa"
],
+
+ extra_objects = [
+ "./submods/hdf5/lib/libhdf5.a"
+ ],
- libraries = ["bwa", "hdf5", "z", "dl", "m"],
+ libraries = ["bwa", "z", "dl", "m"],
extra_compile_args = ["-std=c++11", "-O3"],
@@ -135,8 +138,12 @@ setup(
python_requires=">=3.6",
setup_requires=[
- 'pybind11>=2.5.0',
- 'read-until==3.0.0'
+ 'pybind11>=2.5.0'
+ ],
+
+ install_requires=[
+ 'read-until==3.0.0',
+ 'minknow-api>=5.0'
],
packages=find_packages(),
=====================================
src/bwa_index.hpp
=====================================
@@ -165,6 +165,10 @@ class BwaIndex {
return kmer_ranges_[kmer];
}
+ u64 get_kmer_count(u16 kmer) const {
+ return kmer_ranges_[kmer].length();
+ }
+
Range get_base_range(u8 base) const {
return Range(index_->L2[base], index_->L2[base+1]);
}
@@ -342,6 +346,7 @@ class BwaIndex {
PY_BWA_INDEX_METH(destroy);
PY_BWA_INDEX_METH(get_neighbor);
PY_BWA_INDEX_METH(get_kmer_range);
+ PY_BWA_INDEX_METH(get_kmer_count);
PY_BWA_INDEX_METH(get_base_range);
PY_BWA_INDEX_METH(sa);
PY_BWA_INDEX_METH(size);
@@ -356,6 +361,8 @@ class BwaIndex {
PY_BWA_INDEX_METH(get_ref_name);
PY_BWA_INDEX_METH(get_ref_len);
PY_BWA_INDEX_METH(range_to_fms);
+ c.def("get_kmers", static_cast< std::vector<u16> (BwaIndex::*)(u64, u64)> (&BwaIndex::get_kmers) );
+ c.def("get_kmers", static_cast< std::vector<u16> (BwaIndex::*)(std::string, u64, u64)> (&BwaIndex::get_kmers) );
}
#endif
=====================================
src/conf.hpp
=====================================
@@ -64,6 +64,7 @@ class Conf {
u16 threads;
Mapper::Params &mapper_prms = Mapper::PRMS;
+ Normalizer::Params &norm_prms = mapper_prms.norm_prms;
EventDetector::Params &event_prms = mapper_prms.event_prms;
EventProfiler::Params &evt_prof_prms = mapper_prms.evt_prof_prms;
SeedTracker::Params &seed_prms = mapper_prms.seed_prms;
@@ -180,7 +181,7 @@ class Conf {
GET_TOML_EXTERN(float, min_seed_prob, mapper_prms);
GET_TOML_EXTERN(std::string, bwa_prefix, mapper_prms);
GET_TOML_EXTERN(std::string, idx_preset, mapper_prms);
- GET_TOML_EXTERN(u32, evt_buffer_len, mapper_prms);
+ GET_TOML_EXTERN(std::string, model_path, mapper_prms);
GET_TOML_EXTERN(u16, evt_batch_size, mapper_prms);
GET_TOML_EXTERN(float, evt_timeout, mapper_prms);
GET_TOML_EXTERN(float, chunk_timeout, mapper_prms);
@@ -198,6 +199,13 @@ class Conf {
GET_TOML_EXTERN(u32, min_map_len, seed_prms);
}
+ if (conf.contains("normalizer")) {
+ const auto subconf = toml::find(conf, "normalizer");
+ GET_TOML_EXTERN(u32, len, norm_prms);
+ GET_TOML_EXTERN(float, tgt_mean, norm_prms);
+ GET_TOML_EXTERN(float, tgt_stdv, norm_prms);
+ }
+
if (conf.contains("event_detector")) {
const auto subconf = toml::find(conf, "event_detector");
GET_TOML_EXTERN(float, min_mean, event_prms);
@@ -250,6 +258,7 @@ class Conf {
GET_SET_EXTERN(std::string, mapper_prms, bwa_prefix)
GET_SET_EXTERN(std::string, mapper_prms, idx_preset)
+ GET_SET_EXTERN(std::string, mapper_prms, model_path)
GET_SET_EXTERN(u32, mapper_prms, max_events)
GET_SET_EXTERN(u32, mapper_prms, seed_len);
@@ -293,6 +302,7 @@ class Conf {
DEFPRP(bwa_prefix)
DEFPRP(idx_preset)
+ DEFPRP(model_path);
DEFPRP(max_events)
DEFPRP(seed_len);
DEFPRP(chunk_time)
=====================================
src/dtw.hpp
=====================================
@@ -6,23 +6,38 @@
#include <cfloat>
#include "util.hpp"
-enum DTWSubSeq {NONE, ROW, COL};
-
-
-template < class RowT, class ColT, typename Func >
+enum class DTWSubSeq {NONE, ROW, COL};
+typedef struct {
+ DTWSubSeq subseq;
+ float dw, hw, vw;
+} DTWParams;
+
+const DTWParams
+ DTW_EVENT_GLOB = {
+ DTWSubSeq::NONE, 2, 1, 100
+ }, DTW_EVENT_QSUB = {
+ DTWSubSeq::COL, 2, 1, 100
+ }, DTW_EVENT_RSUB = {
+ DTWSubSeq::ROW, 2, 1, 100
+ }, DTW_RAW_QSUB = {
+ DTWSubSeq::COL, 10, 1, 1000
+ }, DTW_RAW_RSUB = {
+ DTWSubSeq::ROW, 10, 1, 1000
+ }, DTW_RAW_GLOB = {
+ DTWSubSeq::NONE, 10, 1, 1000
+ };
+
+
+template < class ColT, class RowT, typename Func >
class DTW {
public:
- typedef struct {
- DTWSubSeq subseq;
- float dw, hw, vw;
- const Func &fn;
- } Prms;
-
- DTW(const std::vector<RowT> &row_vals,
- const std::vector<ColT> &col_vals,
- const Prms &p) :
- prms(p),
+ DTW(const std::vector<ColT> &col_vals,
+ const std::vector<RowT> &row_vals,
+ const DTWParams &p,
+ const Func &fn) :
+ PRMS(p),
+ fn_(fn),
rvals_(row_vals),
cvals_(col_vals) {
@@ -33,25 +48,16 @@ class DTW {
traceback();
}
- DTW(const std::vector<RowT> &row_vals,
- const std::vector<ColT> &col_vals,
- DTWSubSeq subseq,
- float dw,
- float hw,
- float vw,
- const Func &fn) :
- DTW ( row_vals, col_vals, {subseq,dw,hw,vw,fn} ) {}
-
void compute_matrix() {
u64 k = 0;
float cost, ds, hs, vs;
for (u64 i = 0; i < rvals_.size(); i++) {
for (u64 j = 0; j < cvals_.size(); j++) {
- cost = prms.fn(rvals_[i], cvals_[j]);
- ds = dscore(i,j) + (prms.dw * cost);
- hs = hscore(i,j) + (prms.hw * cost);
- vs = vscore(i,j) + (prms.vw * cost);
+ cost = fn_(rvals_[i], cvals_[j]);
+ ds = dscore(i,j) + (PRMS.dw * cost);
+ hs = hscore(i,j) + (PRMS.hw * cost);
+ vs = vscore(i,j) + (PRMS.vw * cost);
if (ds <= hs && ds <= vs) {
mat_[k] = ds;
@@ -70,7 +76,7 @@ class DTW {
void traceback() {
u64 i = rvals_.size()-1, j = cvals_.size()-1;
- switch (prms.subseq) {
+ switch (PRMS.subseq) {
case DTWSubSeq::ROW:
for (u64 k = 0; k < rvals_.size(); k++) {
if (mat_[k*cvals_.size() + j] < mat_[i*cvals_.size() + j]) {
@@ -91,11 +97,11 @@ class DTW {
score_sum_ = mat_[i*cvals_.size() + j];
- path_.push_back(std::pair<u64, u64>(i, j));
+ path_.push_back(std::pair<u64, u64>(j, i));
u64 k = i*cvals_.size() + j;
- while(!(i == 0 || prms.subseq == DTWSubSeq::ROW) ||
- !(j == 0 || prms.subseq == DTWSubSeq::COL)) {
+ while(!(i == 0 || PRMS.subseq == DTWSubSeq::ROW) ||
+ !(j == 0 || PRMS.subseq == DTWSubSeq::COL)) {
if (i == 0 || bcrumbs_[k] == Move::H) {
k--;
@@ -109,7 +115,7 @@ class DTW {
j--;
}
- path_.push_back(std::pair<u64, u64>(i, j));
+ path_.push_back(std::pair<u64, u64>(j, i));
}
}
@@ -132,7 +138,7 @@ class DTW {
<< p->second << "\t"
<< rvals_[p->first] << "\t"
<< cvals_[p->second] << "\t"
- << prms.fn(rvals_[p->first], cvals_[p->second]) << "\t"
+ << fn_(rvals_[p->first], cvals_[p->second]) << "\t"
<< "\n";
}
}
@@ -141,7 +147,8 @@ class DTW {
enum Move {D, H, V}; //Horizontal, vertical, diagonal
static constexpr float MAX_COST = FLT_MAX / 2.0;
- const Prms prms;
+ const DTWParams PRMS;
+ const Func &fn_;
const std::vector<RowT> &rvals_;
const std::vector<ColT> &cvals_;
@@ -153,24 +160,76 @@ class DTW {
inline float hscore(u64 i, u64 j) {
if (j > 0) return mat_[cvals_.size()*i + j-1];
- else if (prms.subseq == DTWSubSeq::ROW) return 0;
+ else if (PRMS.subseq == DTWSubSeq::ROW) return 0;
return MAX_COST;
}
inline float vscore(u64 i, u64 j) {
if (i > 0) return mat_[cvals_.size()*(i-1) + j];
- else if (prms.subseq == DTWSubSeq::COL) return 0;
+ else if (PRMS.subseq == DTWSubSeq::COL) return 0;
else return MAX_COST;
}
inline float dscore(u64 i, u64 j) {
if (j > 0 && i > 0) return mat_[cvals_.size()*(i-1) + j-1];
else if ((j == i) ||
- (i == 0 && prms.subseq == DTWSubSeq::COL) ||
- (j == 0 && prms.subseq == DTWSubSeq::ROW)) return 0;
+ (i == 0 && PRMS.subseq == DTWSubSeq::COL) ||
+ (j == 0 && PRMS.subseq == DTWSubSeq::ROW)) return 0;
return MAX_COST;
}
+
+ #ifdef PYBIND
+
+ static void pybind_defs(pybind11::class_<DTW> &c) {
+ }
+ #endif
};
+float dtwcost_r94p(u16 k, float e) {
+ return -pmodel_r94_template.match_prob(e,k);
+}
+class DTWr94p : public DTW<float, u16, decltype(dtwcost_r94p)> {
+ public:
+ DTWr94p(const std::vector<float> &means,
+ const std::vector<u16> &kmers,
+ const DTWParams &prms)
+ : DTW(means, kmers, prms, dtwcost_r94p) {}
+
+ #ifdef PYBIND
+ #define PY_DTW_R94P_METH(P) c.def(#P, &DTWr94p::P);
+ static void pybind_defs(pybind11::class_<DTWr94p> &c) {
+ c.def(pybind11::init<const std::vector<float>&,
+ const std::vector<u16>&,
+ const DTWParams&>());
+ PY_DTW_R94P_METH(get_path)
+ PY_DTW_R94P_METH(score)
+ PY_DTW_R94P_METH(mean_score)
+ }
+ #endif
+
+};
+
+float dtwcost_r94d(u16 k, float e) {
+ return abs(e-pmodel_r94_template.get_mean(k));
+}
+class DTWr94d : public DTW<float, u16, decltype(dtwcost_r94d)> {
+ public:
+ DTWr94d(const std::vector<float> &means,
+ const std::vector<u16> &kmers,
+ const DTWParams &prms)
+ : DTW(means, kmers, prms, dtwcost_r94d) {}
+
+ #ifdef PYBIND
+ #define PY_DTW_R94D_METH(P) c.def(#P, &DTWr94d::P);
+ static void pybind_defs(pybind11::class_<DTWr94d> &c) {
+ c.def(pybind11::init<const std::vector<float>&,
+ const std::vector<u16>&,
+ const DTWParams&>());
+ PY_DTW_R94D_METH(get_path)
+ PY_DTW_R94D_METH(score)
+ PY_DTW_R94D_METH(mean_score)
+ }
+ #endif
+};
#endif
=====================================
src/dtw_test.cpp
=====================================
@@ -2,18 +2,20 @@
#include <math.h>
#include <unordered_map>
+#include "event_profiler.hpp"
#include "normalizer.hpp"
+#include "model_r94.inl"
#include "pore_model.hpp"
#include "fast5_reader.hpp"
#include "bwa_index.hpp"
#include "dtw.hpp"
-const std::string CONF_DIR(std::getenv("UNCALLED_CONF")),
- DEF_MODEL = CONF_DIR + "/r94_5mers.txt",
- DEF_CONF = CONF_DIR + "/defaults.toml";
-
-//bool load_conf(int argc, char** argv, Conf &conf);
+//const std::string CONF_DIR(std::getenv("UNCALLED_CONF")),
+// DEF_MODEL = CONF_DIR + "/r94_5mers.txt",
+// DEF_CONF = CONF_DIR + "/defaults.toml";
//
+//bool load_conf(int argc, char** argv, Conf &conf);
+
typedef struct {
std::string rd_name, rf_name;
@@ -69,17 +71,14 @@ int main(int argc, char** argv) {
out_prefix = std::string(argv[4]);
}
- const PoreModel<KLEN> model(DEF_MODEL, false);
-
- auto costfn = [&model](float e, u16 k) {return -model.match_prob(e,k);};
- //auto costfn = [&model](float e, u16 k) {return abs(e-model.get_mean(k));};
+ auto model = pmodel_r94_template;
- DTW<float, u16, decltype(costfn)>::Prms dtwp = {
+ DTWParams dtwp = {
DTWSubSeq::NONE, //substr
- 1,1,1, //d,v,h
- costfn}; //fn
+ 1,1,1}; //d,v,h
EventDetector evdt;
+ EventProfiler evpr;
BwaIndex<KLEN> idx(index_prefix);
idx.load_pacseq();
@@ -134,10 +133,16 @@ int main(int argc, char** argv) {
//Create events if needed
if (create_events) {
- norm.set_signal(evdt.get_means(signal));
- //std::cout << "scale=" << norm.get_scale() << "\t"
- // << "shift=" << norm.get_shift() << "\n";
- //continue;
+ auto events = evdt.get_events(signal);
+ auto mask = evpr.get_full_mask(events);
+ signal.clear();
+
+ for (u32 i = 0; i < events.size(); i++) {
+ if (mask[i]) signal.push_back(events[i].mean);
+ }
+
+ norm.set_signal(signal);
+
} else {
norm.set_signal(signal);
}
@@ -148,12 +153,13 @@ int main(int argc, char** argv) {
while (!norm.empty()) signal.push_back(norm.pop());
//Takes up too much space :(
- if (signal.size() > 500000) {
- return 1;
+ if (signal.size() > 50000) {
+ std::cerr << "Skipping " << read.get_id() << "\n";
+ continue;
}
- DTW<float, u16, decltype(costfn)> dtw(signal, kmers, dtwp);
+ DTWr94d dtw(signal, kmers, dtwp);
if (!out_prefix.empty()) {
std::string path_fname = out_prefix+read.get_id()+".txt";
=====================================
src/event_detector.cpp
=====================================
@@ -21,9 +21,8 @@ const EventDetector::Params EventDetector::PRMS_DEF = {
threshold2 : 9.0,
peak_height : 0.2,
- //TODO need to adjust once calibration figured out
min_mean : 0,
- max_mean : 4000
+ max_mean : 400
};
EventDetector::EventDetector(Params prms) :
=====================================
src/event_profiler.hpp
=====================================
@@ -126,34 +126,48 @@ class EventProfiler {
return next_evt_.mean;
}
- //#ifdef PYBIND
-
- //#define PY_CHUNK_METH(P) c.def(#P, &Chunk::P);
- //#define PY_CHUNK_PROP(P) c.def_property(#P, &Chunk::get_##P, &Chunk::set_##P);
- //#define PY_CHUNK_RPROP(P) c.def_property_readonly(#P, &Chunk::get_##P);
-
- //static void pybind_defs(pybind11::class_<Chunk> &c) {
- // c.def(pybind11::init<
- // const std::string &, //id,
- // u16, u32, u64, //channel, number, start
- // const std::string &, //dtype
- // const std::string & //raw_str
- // >());
- // c.def(pybind11::init<
- // const std::string &, //id,
- // u16, u32, u64, //channel, number, start
- // const std::vector<float> &, //raw_data,
- // u32, u32 //raw_st, raw_len
- // >());
- // PY_CHUNK_METH(pop);
- // PY_CHUNK_METH(swap);
- // PY_CHUNK_METH(empty);
- // PY_CHUNK_METH(print);
- // PY_CHUNK_METH(size);
- // PY_CHUNK_RPROP(channel);
- // PY_CHUNK_RPROP(number);
- //}
+ std::vector<bool> get_full_mask(const std::vector<Event> &events) {
+ std::vector<bool> mask;
+ mask.reserve(events.size());
+
+ reset();
+ for (auto &e : events) {
+ add_event(e);
+ if (is_full()) {
+ mask.push_back(to_mask_ == 0);
+ }
+ }
+
+ while (mask.size() < events.size()) {
+ if (to_mask_ == 0) {
+ mask.push_back(true);
+ } else {
+ mask.push_back(false);
+ to_mask_--;
+ }
+ }
-};
+ return mask;
+ }
+ #ifdef PYBIND
+
+ #define PY_EVENT_PROFILER_METH(P) c.def(#P, &EventProfiler::P);
+ #define PY_EVENT_PROFILER_PROP(P) c.def_property(#P, &EventProfiler::get_##P, &EventProfiler::set_##P);
+ #define PY_EVENT_PROFILER_RPROP(P) c.def_property_readonly(#P, &EventProfiler::get_##P);
+
+ static void pybind_defs(pybind11::class_<EventProfiler> &c) {
+ c.def(pybind11::init());
+ PY_EVENT_PROFILER_METH(reset)
+ PY_EVENT_PROFILER_METH(set_norm)
+ PY_EVENT_PROFILER_METH(add_event)
+ PY_EVENT_PROFILER_METH(is_full)
+ PY_EVENT_PROFILER_METH(event_ready)
+ PY_EVENT_PROFILER_METH(anno_event)
+ PY_EVENT_PROFILER_METH(next_mean)
+ PY_EVENT_PROFILER_METH(get_full_mask)
+ }
+ #endif
+
+};
#endif
=====================================
src/map_pool_ord.cpp
=====================================
@@ -63,7 +63,6 @@ std::vector<Paf> MapPoolOrd::update() {
channels_empty_ = true;
-
for (u32 i = 0; i < channels_.size(); i++) {
if (channels_[i].empty()) continue;
channels_empty_ = false;
@@ -74,10 +73,11 @@ std::vector<Paf> MapPoolOrd::update() {
continue;
}
+
//Get next chunk
ReadBuffer &r = channels_[i].front();
Chunk chunk = r.get_chunk(chunk_idx_[i]);
-
+
//Try adding to pool
//If sucessfful, move to next chunk
if (pool_.try_add_chunk(chunk)) {
=====================================
src/mapper.cpp
=====================================
@@ -35,12 +35,12 @@ Mapper::Params Mapper::PRMS {
max_events : 30000,
max_stay_frac : 0.5,
min_seed_prob : -3.75,
- evt_buffer_len : 6000,
evt_batch_size : 5,
evt_timeout : 10.0,
chunk_timeout : 4000.0,
bwa_prefix : "",
idx_preset : "default",
+ model_path : "",
seed_prms : SeedTracker::PRMS_DEF,
norm_prms : Normalizer::PRMS_DEF,
event_prms : EventDetector::PRMS_DEF,
@@ -66,6 +66,7 @@ u32 Mapper::PATH_TAIL_MOVE = 0;
Mapper::Mapper() :
evdt_(PRMS.event_prms),
evt_prof_(PRMS.evt_prof_prms),
+ norm_(PRMS.norm_prms),
seed_tracker_(PRMS.seed_prms),
state_(State::INACTIVE) {
@@ -109,6 +110,10 @@ void Mapper::load_static() {
if (fmi.is_loaded()) return;
+ if (!PRMS.model_path.empty()) {
+ model = PoreModel<KLEN>(PRMS.model_path, true);
+ }
+
fmi.load_index(PRMS.bwa_prefix);
if (!fmi.is_loaded()) {
std::cerr << "Error: failed to load BWA index\n";
@@ -218,6 +223,7 @@ void Mapper::reset() {
last_chunk_ = false;
state_ = State::MAPPING;
norm_.skip_unread();
+ //norm_.reset();
seed_tracker_.reset();
evdt_.reset();
@@ -375,10 +381,12 @@ bool Mapper::chunk_mapped() {
bool Mapper::map_chunk() {
wait_time_ += map_timer_.lap();
- if (reset_ || chunk_timer_.get() > PRMS.chunk_timeout) {
+ if (reset_ ||
+ chunk_timer_.get() > PRMS.chunk_timeout ||
+ event_i_ >= PRMS.max_events) {
+
set_failed();
read_.loc_.set_ended();
- //std::cerr << "# END timer or reset\n";
return true;
} else if (norm_.empty() &&
@@ -646,7 +654,7 @@ bool Mapper::map_next() {
#endif
}
- dbg_conf_out();
+ //dbg_conf_out();
//Update event index
event_i_++;
@@ -897,11 +905,11 @@ void Mapper::dbg_open_all() {
<< "win_mask\n";
#endif
- #ifdef DEBUG_CONFIDENCE
- dbg_open(conf_out_, "_conf.tsv");
- conf_out_ << "top_conf\t"
- << "mean_conf\n";
- #endif
+ //#ifdef DEBUG_CONFIDENCE
+ //dbg_open(conf_out_, "_conf.tsv");
+ //conf_out_ << "top_conf\t"
+ // << "mean_conf\n";
+ //#endif
dbg_opened_ = true;
}
@@ -936,26 +944,26 @@ void Mapper::dbg_close_all() {
if (events_out_.is_open()) events_out_.close();
#endif
- #ifdef DEBUG_CONFIDENCE
- if (conf_out_.is_open()) conf_out_.close();
- #endif
+ ///#ifdef DEBUG_CONFIDENCE
+ ///if (conf_out_.is_open()) conf_out_.close();
+ ///#endif
dbg_opened_ = false;
}
#endif
}
-void Mapper::dbg_conf_out() {
- #ifdef DEBUG_CONFIDENCE
- if (seed_tracker_.empty() || seed_tracker_.get_top_conf() == 0) return;
- conf_out_ << evt_prof_.mask_idx_map_[event_i_] << "\t"
- << seed_tracker_.get_best().id_ << "\t"
- << seed_tracker_.get_top_conf() << "\t"
- << seed_tracker_.get_mean_conf() << "\n";
-
- conf_out_.flush();
- #endif
-}
+//void Mapper::dbg_conf_out() {
+// #ifdef DEBUG_CONFIDENCE
+// if (seed_tracker_.empty() || seed_tracker_.get_top_conf() == 0) return;
+// conf_out_ << evt_prof_.mask_idx_map_[event_i_] << "\t"
+// << seed_tracker_.get_best().id_ << "\t"
+// << seed_tracker_.get_top_conf() << "\t"
+// << seed_tracker_.get_mean_conf() << "\n";
+//
+// conf_out_.flush();
+// #endif
+//}
void Mapper::dbg_events_out() {
#ifdef DEBUG_EVENTS
@@ -1007,14 +1015,6 @@ void Mapper::dbg_seeds_out(
u64 ref_st = 0;
fmi.translate_loc(sa_half, rf_name, ref_st);
- if (ref_st == 0) {
- std::cerr << rf_name << "\t"
- << sa_start << "\t"
- << sa_half << "\t"
- << static_cast<int>(path.type_head()) << "\t"
- << static_cast<int>(path.type_tail()) << "\n";
- }
-
seeds_out_ << rf_name << "\t"
<< ref_st << "\t"
<< (ref_st + ref_len) << "\t"
=====================================
src/mapper.hpp
=====================================
@@ -58,13 +58,13 @@ class Mapper {
float min_seed_prob;
//realtime only
- u32 evt_buffer_len;
u16 evt_batch_size;
float evt_timeout;
float chunk_timeout;
std::string bwa_prefix;
std::string idx_preset;
+ std::string model_path;
SeedTracker::Params seed_prms;
Normalizer::Params norm_prms;
=====================================
src/pore_model.hpp
=====================================
@@ -105,7 +105,7 @@ class PoreModel {
//TODO: clean up IO
//maybe load from toml and/or header file
//make scripts for model to toml and header?
- PoreModel(std::string model_fname, bool cmpl) : PoreModel () {
+ PoreModel(const std::string &model_fname, bool cmpl) : PoreModel () {
std::ifstream model_in(model_fname);
@@ -190,6 +190,7 @@ class PoreModel {
#define PY_PORE_MODEL_METH(P) c.def(#P, &PoreModel<KLEN>::P);
static void pybind_defs(pybind11::class_<PoreModel<KLEN>> &c) {
+ c.def(pybind11::init<const std::string &, bool>());
PY_PORE_MODEL_METH(match_prob);
PY_PORE_MODEL_METH(get_means_mean);
PY_PORE_MODEL_METH(get_means_stdv);
=====================================
src/pybinder.cpp
=====================================
@@ -6,6 +6,7 @@
#include "realtime_pool.hpp"
#include "client_sim.hpp"
#include "model_r94.inl"
+#include "dtw.hpp"
namespace py = pybind11;
using namespace pybind11::literals;
@@ -44,6 +45,9 @@ PYBIND11_MODULE(_uncalled, m) {
py::class_<EventDetector> event_detector(m, "EventDetector");
EventDetector::pybind_defs(event_detector, event);
+ py::class_<EventProfiler> event_profiler(m, "EventProfiler");
+ EventProfiler::pybind_defs(event_profiler);
+
py::class_<Normalizer> norm(m, "Normalizer");
Normalizer::pybind_defs(norm);
@@ -65,5 +69,24 @@ PYBIND11_MODULE(_uncalled, m) {
m.def("seq_to_kmers", &seq_to_kmers<KLEN>);
m.def("kmer_neighbor", &kmer_neighbor<KLEN>);
+ //py::class_<DTW> dtw(m, "DTW");
+ //DTW::pybind_defs(dtw);
+
+ py::class_<DTWr94p> dtw_r94p(m, "DTWr94p");
+ DTWr94p::pybind_defs(dtw_r94p);
+
+ py::class_<DTWr94d> dtw_r94d(m, "DTWr94d");
+ DTWr94d::pybind_defs(dtw_r94d);
+
+ py::class_<DTWParams>dtwp(m, "DTWParams");
+ dtwp.def_readwrite("dw", &DTWParams::dw);
+ dtwp.def_readwrite("hw", &DTWParams::hw);
+ dtwp.def_readwrite("vw", &DTWParams::vw);
+ m.attr("DTW_EVENT_GLOB") = py::cast(DTW_EVENT_GLOB);
+ m.attr("DTW_RAW_GLOB") = py::cast(DTW_RAW_GLOB);
+ m.attr("DTW_EVENT_QSUB") = py::cast(DTW_EVENT_QSUB);
+ m.attr("DTW_EVENT_RSUB") = py::cast(DTW_EVENT_RSUB);
+ m.attr("DTW_RAW_QSUB") = py::cast(DTW_EVENT_QSUB);
+ m.attr("DTW_RAW_RSUB") = py::cast(DTW_EVENT_RSUB);
}
=====================================
src/range.cpp
=====================================
@@ -23,11 +23,11 @@
#include "range.hpp"
-size_t max(size_t a, size_t b) {
+u64 max(u64 a, u64 b) {
return a > b ? a : b;
}
-size_t min(size_t a, size_t b) {
+u64 min(u64 a, u64 b) {
return a < b ? a : b;
}
@@ -35,7 +35,7 @@ Range::Range(const Range &prev)
: start_(prev.start_),
end_(prev.end_) {}
-Range::Range(size_t start, size_t end) : start_(start), end_(end) {}
+Range::Range(u64 start, u64 end) : start_(start), end_(end) {}
Range::Range() : start_(1), end_(0) {}
@@ -46,7 +46,7 @@ bool Range::intersects(const Range &q) const {
!(q.start_ > end_ || q.end_ < start_);
}
-size_t Range::length() const {
+u64 Range::length() const {
return end_ - start_ + 1;
}
=====================================
src/read_buffer.cpp
=====================================
@@ -237,7 +237,7 @@ ReadBuffer::ReadBuffer(const hdf5_tools::File &file,
//full_signal_.assign(int_data.begin(), int_data.end());
for (u16 raw : int_data) {
- float calibrated = (cal_range * raw / cal_digit) + cal_offset;
+ float calibrated = cal_range * (raw + cal_offset) / cal_digit;
full_signal_.push_back(calibrated);
}
=====================================
src/uncalled_map_ord.cpp
=====================================
@@ -60,7 +60,7 @@ int main(int argc, char** argv) {
void load_conf(int argc, char** argv, Conf &conf) {
int opt;
- std::string flagstr = "C:t:n:r:R:c:l:s:w:";
+ std::string flagstr = "C:t:n:r:R:c:l:s:w:p:";
#ifdef DEBUG_OUT
flagstr += "D:";
@@ -78,6 +78,7 @@ void load_conf(int argc, char** argv, Conf &conf) {
FLAG_TO_CONF('l', std::string, read_list)
FLAG_TO_CONF('s', atof, win_stdv_min)
FLAG_TO_CONF('w', atof, win_len)
+ FLAG_TO_CONF('p', std::string, idx_preset)
#ifdef DEBUG_OUT
FLAG_TO_CONF('D', std::string, dbg_prefix);
#endif
=====================================
uncalled/__init__.py
=====================================
@@ -1,5 +1,2 @@
from _uncalled import *
-#from uncalled.minknow_client import Client
-from uncalled import index, pafstats, sim_utils, args, debug
-
-#TODO further flatten module structure, then fix uncalled script
+from uncalled import index, pafstats, sim_utils, args, debug, minknow_client
=====================================
uncalled/args.py
=====================================
@@ -159,18 +159,6 @@ def add_bwa_opt(p, conf):
)
def add_ru_opts(p, conf):
- #TODO: selectively enrich or deplete refs in index
- p.add_argument(
- "-c", "--max-chunks",
- type=int, default=conf.max_chunks, required=True,
- help="Will give up on a read after this many chunks have been processed. Only has effect when --unblock is set"
- )
- p.add_argument(
- "--chunk-time",
- type=float, default=1, required=False,
- help="Length of chunks in seconds"
- )
-
modes = p.add_mutually_exclusive_group(required=True)
modes.add_argument(
"-D", "--deplete", action='store_const',
@@ -289,6 +277,16 @@ def add_map_opts(p, conf):
type=int, default=conf.max_events,
help="Will give up on a read after this many events have been processed"
)
+ p.add_argument(
+ "-c", "--max-chunks",
+ type=int, default=conf.max_chunks,
+ help="Will give up on a read after this many chunks have been processed. Only has effect when --unblock is set"
+ )
+ p.add_argument(
+ "--chunk-time",
+ type=float, default=1, required=False,
+ help="Length of chunks in seconds"
+ )
def load_conf(argv):
conf = unc.Conf()
@@ -303,4 +301,4 @@ def load_conf(argv):
if hasattr(conf, a):
setattr(conf, a, v)
- return conf, args
+ return parser, conf, args
=====================================
uncalled/conf/defaults.toml
=====================================
@@ -44,11 +44,13 @@ max_paths = 10000
max_stay_frac = 0.5
min_seed_prob = -3.75
-evt_buffer_len = 6000
evt_batch_size = 5
evt_timeout = 1000000.0
max_chunk_wait = 4000000.0
+[normalizer]
+len = 6000
+
[seed_tracker]
min_aln_len = 25
min_mean_conf = 6.00
=====================================
uncalled/debug.py
=====================================
@@ -4,8 +4,6 @@ import sys, os
import numpy as np
import argparse
import bisect
-from scipy.stats import linregress
-from file_read_backwards import FileReadBackwards
from collections import defaultdict
import re
@@ -115,7 +113,7 @@ class DebugParser:
else:
self.min_ref = -mm2_paf.rf_en
else:
- self.min_ref = None
+ self.min_ref = self.ref_name = None
self.max_ref = None
#TODO clean these up
@@ -140,12 +138,17 @@ class DebugParser:
self.parse_bc_aln(bce_moves)
self.max_path_fm = max_path_fm
+ #TODO handle
if max_path_fm > 0:
if self.idx is None:
sys.stderr.write("Error: must include BWA index to include paths with FM overlaps\n")
sys.exit(1)
- if self.conf_clust.fwd:
+ #if self.conf_clust.fwd:
+ fwd = (mm2_paf is not None and mm2_paf.is_fwd or
+ mm2_paf is None and self.conf_clust.fwd)
+
+ if fwd:
st = self.min_ref
en = self.max_ref
else:
@@ -155,7 +158,7 @@ class DebugParser:
fwd_fms, rev_fms = self.idx.range_to_fms(
self.ref_name, st, en
)
- fms = fwd_fms if self.mm2_paf.is_fwd else rev_fms
+ fms = fwd_fms if fwd else rev_fms
self.fm_to_ref = {fms[i] : i for i in range(len(fms))}
self.range_fms = np.sort(fms)
@@ -216,9 +219,7 @@ class DebugParser:
continue
if self.min_evt is None:
- print(en, self.min_samp)
self.min_evt = evt
- print(self.min_evt, "MIN")
if self.max_samp is not None and st > self.max_samp:
break
@@ -238,7 +239,7 @@ class DebugParser:
if self.max_evt is None or self.max_evt >= evt:
self.max_evt = evt
- if self.max_samp is None:
+ if self.max_samp is None or st+ln < self.max_samp:
self.max_samp = st+ln
self.max_chunk = (self.max_samp-1) // CHUNK_LEN
@@ -312,8 +313,12 @@ class DebugParser:
if self.max_clust is not None:
conf_clust = self.max_clust
self.conf_evt = conf_clust.evts[-1]
+ print("tHERE", self.max_evt)
else:
self.conf_evt = self.max_evt
+ print("HERE", self.max_evt)
+ else:
+ print("MAPPED?")
self._set_conf_clust(conf_clust)
self._parse_expired(clust_ids)
@@ -327,6 +332,8 @@ class DebugParser:
if cc is None: return
+ print(cc.evts, self.conf_evt)
+
self.conf_idx = np.searchsorted(cc.evts, self.conf_evt, side='right')-1
self.conf_len = cc.lens[self.conf_idx]
@@ -361,7 +368,7 @@ class DebugParser:
if self.max_ref is None:
self.max_ref = ren
if self.ref_name is None:
- self.ref_name = cc.ref
+ self.ref_name = cc.rf
#finds clusters that should expire
@@ -386,10 +393,6 @@ class DebugParser:
head_tabs = paths_fwd.readline().split()
C = {head_tabs[i] : i for i in range(len(head_tabs))}
- #paths_fwd.close()
- #paths_rev = FileReadBackwards(self.paths_fname)
- #for line in paths_rev:
-
for line in paths_fwd:
tabs = line.split()
if tabs[0] == head_tabs[0]: continue
@@ -510,6 +513,8 @@ class DebugParser:
self.bce_samps = np.array(bce_samps)
self.bce_refs = np.array(bce_refs) - BCE_K + 1
self.max_ref = self.min_ref + max(bce_refs)
+
+ self.bc_loaded = True
def _cig_query_to_refs(self, paf):
View it on GitLab: https://salsa.debian.org/med-team/uncalled/-/commit/62b3012c245df085bf58a8e786631b98dbfe36db
--
View it on GitLab: https://salsa.debian.org/med-team/uncalled/-/commit/62b3012c245df085bf58a8e786631b98dbfe36db
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20230713/61383026/attachment-0001.htm>
More information about the debian-med-commit
mailing list