[med-svn] [Git][med-team/atropos][upstream] 4 commits: New upstream version 1.1.22+dfsg
Steffen Möller
gitlab at salsa.debian.org
Thu Nov 21 16:08:22 GMT 2019
Steffen Möller pushed to branch upstream at Debian Med / atropos
Commits:
276ab632 by Steffen Moeller at 2019-11-21T15:03:19Z
New upstream version 1.1.22+dfsg
- - - - -
19ac1406 by Steffen Moeller at 2019-11-21T15:09:00Z
New upstream version 1.1.22+dfsg
- - - - -
97f343a5 by Steffen Moeller at 2019-11-21T15:16:29Z
New upstream version 1.1.22+dfsg
- - - - -
dbfa561b by Steffen Moeller at 2019-11-21T15:24:31Z
New upstream version 1.1.22+dfsg
- - - - -
10 changed files:
- CHANGES.md
- Makefile
- README.md
- atropos/_version.py
- atropos/align/__init__.py
- atropos/commands/detect/__init__.py
- doc/conf.py
- doc/guide.rst
- doc/index.rst
- setup.py
Changes:
=====================================
CHANGES.md
=====================================
@@ -1,5 +1,10 @@
# Changes
+v1.1.22 (dev)
+-------------
+* Documentation fixes (#79)
+* Fix for index error when running `detect` command (#80)
+
v1.1.21 (2018.11.24)
--------------------
* Added long options for paired adapter parameters.
=====================================
Makefile
=====================================
@@ -19,8 +19,7 @@ test:
$(TEST)
docs:
- make -C docs api
- make -C docs html
+ make -C doc html
readme:
pandoc --from=markdown --to=rst --output=README.rst README.md
@@ -58,7 +57,7 @@ release:
$(TEST)
python setup.py sdist bdist_wheel
# release
- python setup.py sdist upload
+ python setup.py sdist upload -r pypi
git push origin --tags
$(github_release)
$(docker)
=====================================
README.md
=====================================
@@ -1,4 +1,4 @@
-[![Travis CI](https://travis-ci.org/jdidion/atropos.svg?branch=master)](https://travis-ci.org/jdidion/atropos])
+[![Travis CI](https://travis-ci.org/jdidion/atropos.svg?branch=1.1)](https://travis-ci.org/jdidion/atropos)
[![PyPi](https://img.shields.io/pypi/v/atropos.svg)](https://pypi.python.org/pypi/atropos)
[![DOI](https://zenodo.org/badge/61393086.svg)](https://zenodo.org/badge/latestdoi/61393086)
@@ -88,7 +88,7 @@ Please cite as:
The results in the paper can be fully reproduced using the workflow defined in the [paper](paper/README.md) directory.
The citation for the original Cutadapt paper is:
-
+
> Marcel Martin. "Cutadapt removes adapter sequences from high-throughput sequencing reads." EMBnet.Journal, 17(1):10-12, May 2011. http://dx.doi.org/10.14806/ej.17.1.200
@@ -127,12 +127,15 @@ The citation for the original Cutadapt paper is:
* Provide option for RNA-seq data that will trim polyA sequence.
* Add formal config file support (#53)
* Automate crash reporting using [sentry](https://github.com/getsentry/raven-python).
+* Look at [NGMerge] for improving read merging: https://github.com/harvardinformatics/NGmerge
+* Look at replacing pysam with [pybam](https://github.com/JohnLonginotto/pybam)
### 1.4
* Currently, InsertAligner requires a single 3' adapter for each end. Adapter trimming will later be generalized so that A) the InsertAligner can handle multiple matched pairs of adapters and/or B) multiple different aligners can be used for different adapters.
* Integrate with [AdapterBase](https://github.com/NCBI-Hackathons/OnlineAdapterDatabase) for improved matching of detected contaminants to known adapters, automated trimming of datasets with known adapters, and (opt-in) submission of adapter information for novel datasets.
* Migrate to seqio (https://github.com/jdidion/seqio) for reading/writing sequence files.
+ * Also look at using HTSeq instead: https://github.com/simon-anders/htseq
* General-purpose read filtering based on read ID: https://github.com/marcelm/cutadapt/issues/107.
* Currently, SAM/BAM input files must be name sorted; add an option to 1) pre-sort reads inline using samtools or sambamba, or 2) cache each read in memory until its mate is found.
@@ -169,8 +172,10 @@ The citation for the original Cutadapt paper is:
### 2.0
* Simplification of command line options, perhaps by further splitting functionality up into different sub-commands, but also by more intelligent choices for default option values based on context.
+
* Consider adding additional report formats
- * https://github.com/marcelm/cutadapt/issues/112
+* Add fuzzy matching: https://github.com/Daniel-Liu-c0deb0t/Java-Fuzzy-Search
+* https://github.com/marcelm/cutadapt/issues/112
* Performance enhancements using
* http://numba.pydata.org/
* https://github.com/undefx/vecpy
=====================================
atropos/_version.py
=====================================
@@ -23,8 +23,8 @@ def get_keywords():
# setup.py/versioneer.py will grep for the variable names, so they must
# each be defined on a line of their own. _version.py will just call
# get_keywords().
- git_refnames = " (tag: 1.1.21)"
- git_full = "1b7273b082551e80652a45df2736e5ffff8ce7f9"
+ git_refnames = " (tag: 1.1.22)"
+ git_full = "2b15c778f0ccf1d0fb753e4334fa6dc0048a9ee6"
keywords = {"refnames": git_refnames, "full": git_full}
return keywords
=====================================
atropos/align/__init__.py
=====================================
@@ -212,6 +212,12 @@ MatchInfo = namedtuple("MatchInfo", (
# * http://www.combio.pl/alfree/tools/
# * http://www.combio.pl/alfree
# * http://bioinformatics.org.au/tools/decaf+py/
+# 12. https://github.com/sdu-hpcl/BGSA
+# 13. Train a NN to approximate the pairwise alignment distance
+# https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty887/5140215?redirectedFrom=fulltext
+# 14. Nucl2vec: https://github.com/prakharg24/Nucl2vec (local alignment only - could
+# it be adapted to semi-global?)
+# * Can re-implement in SpaCy? https://spacy.io/usage/vectors-similarity
class InsertAligner(object):
"""Implementation of an insert matching algorithm.
=====================================
atropos/commands/detect/__init__.py
=====================================
@@ -8,7 +8,7 @@ from atropos.align import Aligner, SEMIGLOBAL
from atropos.commands.base import (
BaseCommandRunner, Pipeline, SingleEndPipelineMixin, PairedEndPipelineMixin)
from atropos.util import (
- reverse_complement, sequence_complexity, enumerate_range, run_interruptible)
+ reverse_complement, sequence_complexity, run_interruptible)
# TODO: Test whether using rc=True in parse_known_contaminants is as fast
# and as accurate as testing both the forward and reverse complement
@@ -22,8 +22,8 @@ from atropos.util import (
# strategy is to use CountTable and, upon each kmer addition, test whether its
# count exceeds some threshold (perhaps by maintaining a running distribution of
# all kmer counts). These high-abundance kmers could then be profiled using a
-# second pass. Another strategy is to compute summary metrics for reads based on
-# kmer abundance, as per
+# second pass. Another strategy is to compute summary metrics for reads based on
+# kmer abundance, as per
# http://khmer-recipes.readthedocs.io/en/latest/001-extract-reads-by-coverage/index.html
# Then take reads in the furthest-right peak and extract common sequences from
# those.
@@ -51,7 +51,7 @@ class CommandRunner(BaseCommandRunner):
"""
name = 'detect'
-
+
def __call__(self):
kmer_size = self.kmer_size or 12
n_reads = self.max_reads
@@ -60,7 +60,7 @@ class CommandRunner(BaseCommandRunner):
known_contaminants = None
if include != 'unknown':
known_contaminants = self.load_known_adapters()
-
+
detector = self.detector
if not detector:
if known_contaminants and include == 'known':
@@ -69,9 +69,9 @@ class CommandRunner(BaseCommandRunner):
detector = 'heuristic'
else:
detector = 'khmer'
-
+
detector_args = dict(known_contaminants=known_contaminants)
-
+
if detector == 'known':
logging.getLogger().debug(
"Detecting contaminants using the known-only algorithm")
@@ -88,34 +88,37 @@ class CommandRunner(BaseCommandRunner):
logging.getLogger().debug(
"Detecting contaminants using the kmer-based algorithm")
detector_class = KhmerDetector
-
+ else:
+ raise ValueError("Invalid value for 'detector': {}".format(detector))
+
summary_args = dict(
- kmer_size=kmer_size, n_reads=n_reads,
+ kmer_size=kmer_size, n_reads=n_reads,
overrep_cutoff=overrep_cutoff, include=include,
past_end_bases=self.past_end_bases)
detector_args.update(summary_args)
-
+
if self.paired:
detector = PairedDetector(detector_class, **detector_args)
else:
detector = detector_class(**detector_args)
-
+
self.summary['detect'] = summary_args
if known_contaminants:
self.summary['detect']['known_contaminants'] = \
known_contaminants.summarize()
-
+
logging.getLogger().info(
"Detecting adapters and other potential contaminant "
"sequences based on %d-mers in %d reads", kmer_size, n_reads)
-
+
# currently only single-threaded operation is supproted
self.summary.update(mode='serial', threads=1)
return run_interruptible(detector, self, raise_on_error=True)
+
class Match(object):
"""A contaminant match.
-
+
Args:
seq_or_contam: The matched sequence.
count: The number of matches.
@@ -143,34 +146,34 @@ class Match(object):
self.longest_match = None
if reads:
self.set_longest_match(reads)
-
+
def __len__(self):
return len(self.seq)
-
+
def __repr__(self):
if self.is_known:
return "{} => {} ({}))".format(
self.seq, self.names, self.known_seqs)
else:
return self.seq
-
+
@property
def seq_complexity(self):
"""The complexity of the sequence (0=homopolymer, 2=random).
"""
return sequence_complexity(self.seq)
-
+
@property
def count_is_frequency(self):
"""Whether `self.count` represents a frequency.
"""
return isinstance(self.count, float)
-
+
def set_contaminant(self, contam, match_frac, match_frac2=None):
"""Set the known contaminant from a :class:`ContaminantMatcher`.
"""
self.set_known(contam.names, [contam.seq], match_frac, match_frac2)
-
+
def set_known(self, names, seqs, match_frac, match_frac2=None):
"""Set the known contaminant.
"""
@@ -178,13 +181,13 @@ class Match(object):
self.known_seqs = seqs
self.match_frac = match_frac
self.match_frac2 = match_frac2
-
+
@property
def is_known(self):
"""Whether the matching sequence is a known contaminant.
"""
return self.known_seqs is not None
-
+
def set_longest_match(self, sequences):
"""Set the longest matching sequence from a set of sequences.
"""
@@ -193,7 +196,7 @@ class Match(object):
seqlen = len(self.seq) - idx
if self.longest_match is None or self.longest_match[1] < seqlen:
self.longest_match = (seq[idx:], seqlen)
-
+
def estimate_abundance(self, read_sequences):
"""Determine whether this match's sequence is within 'seq' by simple
exact string comparison.
@@ -201,7 +204,7 @@ class Match(object):
self.abundance = sum(
1 for read_seq in read_sequences
if self.seq in read_seq)
-
+
def summarize(self):
summary = dict(
longest_kmer=self.seq,
@@ -224,9 +227,10 @@ class Match(object):
known_seqs=self.known_seqs)
return summary
+
class ContaminantMatcher(object):
"""Matches a known contaminant against other sequences.
-
+
Args:
seq: The contaminant sequence.
names: Sequence of names for the contaminant.
@@ -241,14 +245,14 @@ class ContaminantMatcher(object):
self.n_kmers = len(self.kmers)
self.kmer_size = kmer_size
self.matches = 0
-
+
def match(self, seq, seqrc):
"""Returns (num_matches, num_contam_kmers, num_seq_kmers).
-
+
Args:
seq: The sequence to match.
seqrc: The reverse complement of `seq`.
-
+
Returns:
Tuple (f1, f2, seq), where f1 is the fraction of contaminant kmers
that match, f2 is the fraction of sequence kmers that match, and
@@ -258,12 +262,12 @@ class ContaminantMatcher(object):
seq[i:(i+self.kmer_size)]
for i in range(len(seq) - self.kmer_size + 1))
fw_matches = float(len(self.kmers & fw_kmers))
-
+
rv_kmers = set(
seqrc[i:(i+self.kmer_size)]
for i in range(len(seqrc) - self.kmer_size + 1))
rv_matches = float(len(self.kmers & rv_kmers))
-
+
if fw_matches >= rv_matches:
n_matches = fw_matches
kmers = fw_kmers
@@ -272,22 +276,23 @@ class ContaminantMatcher(object):
n_matches = rv_matches
kmers = rv_kmers
compare_seq = seqrc
-
+
self.matches += n_matches
match_frac1 = match_frac2 = 0
if self.n_kmers > 0:
match_frac1 = n_matches / self.n_kmers
if len(kmers) > 0:
match_frac2 = n_matches / len(kmers)
- return (match_frac1, match_frac2, compare_seq)
+ return match_frac1, match_frac2, compare_seq
+
def create_contaminant_matchers(contaminants, kmer_size):
"""Create :class:`ContaminantMatcher`s from sequences.
-
+
Args:
contaminants: A dict of {seq:names}.
kmer_size: The kmer size.
-
+
Returns:
A list of :class:`ContaminantMatcher`s.
"""
@@ -296,19 +301,20 @@ def create_contaminant_matchers(contaminants, kmer_size):
for seq, names in contaminants.iter_sequences()
]
+
class Detector(SingleEndPipelineMixin, Pipeline):
"""Base class for contaminant detectors.
-
+
Args:
kmer_size: Size of kmers to match.
n_reads: Number of reads to sample.
overrep_cutoff: Degree of overrepresentation required for a kmer to be
considered as a contaminant.
- known_contaminant: :class:`ContaminantMatcher`s to match against.
+ known_contaminants: :class:`ContaminantMatcher`s to match against.
past_end_bases: On Illumina, long runs of A (and sometimes other bases)
can signify that the sequencer has read past the end of a fragment
that is shorter than the read length + adapter length. Those
- bases will be removed from any sequencers before looking for
+ bases will be removed from any sequencers before looking for
matching contaminants.
"""
def __init__(
@@ -329,32 +335,32 @@ class Detector(SingleEndPipelineMixin, Pipeline):
self._past_end_regexp = re.compile(past_end_bases[0])
else:
self._past_end_regexp = re.compile('|'.join(
- base + '{8,}.*|' + base + '{2,}$'
+ base + '{8,}.*|' + base + '{2,}$'
for base in past_end_bases))
-
+
@property
def min_report_freq(self):
"""The minimum contaminant frequency required for a contaminant to be
reported.
"""
raise NotImplementedError()
-
+
def set_read_length(self, record):
assert self._read_length is None
self._read_length = len(record.sequence)
-
+
def handle_records(self, context, records):
if context['size'] == 0:
return
if self._read_length is None:
self.set_read_length(records[0])
super().handle_records(context, records)
-
+
def handle_reads(self, context, read1, read2=None):
seq = self._filter_seq(read1.sequence)
if seq:
self._read_sequences.add(seq)
-
+
def _filter_seq(self, seq):
if sequence_complexity(seq) <= 1.0:
return None
@@ -365,19 +371,19 @@ class Detector(SingleEndPipelineMixin, Pipeline):
if len(seq) < self.kmer_size:
return None
return seq
-
+
def matches(self, **kwargs):
"""Returns the current set of matches.
"""
if self._matches is None or len(kwargs) > 0:
self._filter_and_sort(**kwargs)
return self._matches
-
+
def _filter_and_sort(
- self, min_len=None, min_complexity=1.1, min_match_frac=0.1,
+ self, min_len=None, min_complexity=1.1, min_match_frac=0.1,
limit=20):
"""Identify, filter, and sort contaminants.
-
+
Args:
min_len: Minimum contaminant length.
min_complexity: Minimum sequence complexity.
@@ -386,64 +392,64 @@ class Detector(SingleEndPipelineMixin, Pipeline):
"""
if min_len is None:
min_len = self.kmer_size
-
+
matches = self._get_contaminants()
-
+
for match in matches:
match.estimate_abundance(self._read_sequences)
-
- def _filter(match):
- if match.count < self.min_report_freq:
+
+ def _filter(_match):
+ if _match.count < self.min_report_freq:
logging.getLogger().debug(
"Filtering {} because frequency {} < {}".format(
- match.seq, match.count, self.min_report_freq))
+ _match.seq, _match.count, self.min_report_freq))
return False
- if min_len and len(match) < min_len:
+ if min_len and len(_match) < min_len:
logging.getLogger().debug(
"Filtering {} because it's too short (< {} bp)".format(
- match.seq, min_len))
+ _match.seq, min_len))
print('too short')
return False
- if min_complexity and match.seq_complexity < min_complexity:
+ if min_complexity and _match.seq_complexity < min_complexity:
logging.getLogger().debug(
"Filtering {} because its complexity {} < {}".format(
- match.seq, match.seq_complexity, min_complexity))
+ _match.seq, _match.seq_complexity, min_complexity))
return False
- if self.include == 'known' and not match.is_known:
+ if self.include == 'known' and not _match.is_known:
logging.getLogger().debug(
"Filtering {} because it's not known".format(
- match.seq))
+ _match.seq))
return False
- elif self.include == 'unknown' and match.is_known:
+ elif self.include == 'unknown' and _match.is_known:
logging.getLogger().debug(
"Filtering {} because it's known".format(
- match.seq))
+ _match.seq))
return False
if (
- min_match_frac and match.is_known and
- match.match_frac < min_match_frac):
+ min_match_frac and _match.is_known and
+ _match.match_frac < min_match_frac):
logging.getLogger().debug(
"Filtering {} because its match_frac {} < {}".format(
- match.seq, match.match_frac, min_match_frac))
+ _match.seq, _match.match_frac, min_match_frac))
return False
return True
-
+
matches = list(filter(_filter, matches))
matches.sort(key=lambda x: len(x) * math.log(x.count), reverse=True)
-
+
if limit is not None:
matches = matches[:limit]
-
+
self._matches = matches
-
+
def _get_contaminants(self):
"""Implemention of contaminant matching.
-
+
Returns:
A list of :class:`Match`es.
"""
raise NotImplementedError()
-
+
def finish(self, summary, **kwargs):
super().finish(summary)
summary['detect']['matches'] = ([
@@ -451,6 +457,7 @@ class Detector(SingleEndPipelineMixin, Pipeline):
for match in self.matches(**kwargs)
],)
+
class PairedDetector(PairedEndPipelineMixin, Pipeline):
"""Detector for paired-end reads.
"""
@@ -459,7 +466,7 @@ class PairedDetector(PairedEndPipelineMixin, Pipeline):
self.read1_detector = detector_class(**kwargs)
self.read2_detector = detector_class(**kwargs)
self._read_length_set = False
-
+
def handle_records(self, context, records):
if context['size'] == 0:
return
@@ -469,11 +476,11 @@ class PairedDetector(PairedEndPipelineMixin, Pipeline):
self.read2_detector.set_read_length(read2)
self._read_length_set = True
super().handle_records(context, records)
-
+
def handle_reads(self, context, read1, read2):
self.read1_detector.handle_reads(context, read1)
self.read2_detector.handle_reads(context, read2)
-
+
def finish(self, summary, **kwargs):
super().finish(summary)
summary['detect']['matches'] = ([
@@ -484,12 +491,13 @@ class PairedDetector(PairedEndPipelineMixin, Pipeline):
for match in self.read2_detector.matches(**kwargs)
])
+
class KnownContaminantDetector(Detector):
"""Test known contaminants against reads. This has linear complexity and is
more specific than the khmer matcher, but less specific than the heuristic
matcher. It's also less sensitive since it does not try to detect unknown
contaminants.
-
+
Args:
known_contaminants: List of :class:`ContaminantMatcher`s.
min_kmer_match_frac: Minimum fraction of matching kmers required.
@@ -500,23 +508,23 @@ class KnownContaminantDetector(Detector):
super().__init__(known_contaminants=known_contaminants, **kwargs)
self.min_kmer_match_frac = min_kmer_match_frac
self._min_k = min(len(s) for s in known_contaminants.sequences)
-
+
@property
def min_report_freq(self):
return 0.1
-
+
def _filter_seq(self, seq):
seq = super()._filter_seq(seq)
if seq and len(seq) >= self._min_k:
return seq
return None
-
+
def _get_contaminants(self):
contaminant_matchers = create_contaminant_matchers(
self.known_contaminants, self.kmer_size)
counts = defaultdict(int)
max_match_fracs = defaultdict(int)
-
+
for seq in self._read_sequences:
seqrc = reverse_complement(seq)
for contam in contaminant_matchers:
@@ -525,14 +533,14 @@ class KnownContaminantDetector(Detector):
counts[contam] += 1
if match[0] > max_match_fracs[contam]:
max_match_fracs[contam] = match[0]
-
+
min_count = math.ceil(
self.n_reads * (self._read_length - self._min_k + 1) *
self.overrep_cutoff / float(4**self._min_k))
-
+
return [
Match(
- c[0], match_frac=max_match_fracs[c[0]],
+ c[0], match_frac=max_match_fracs[c[0]],
abundance=float(c[1]) / self.n_reads)
for c in filter(
lambda x: x[1] >= min_count,
@@ -540,44 +548,45 @@ class KnownContaminantDetector(Detector):
)
]
+
class HeuristicDetector(Detector):
"""Use a heuristic iterative algorithm to arrive at likely contaminants.
This is the most accurate algorithm overall, but it has quadratic complexity
and becomes too slow/memory-intenstive when n_reads > 50k.
"""
def __init__(
- self, min_frequency=0.001, min_contaminant_match_frac=0.9,
+ self, min_frequency=0.001, min_contaminant_match_frac=0.9,
**kwargs):
super(HeuristicDetector, self).__init__(**kwargs)
self.min_frequency = min_frequency
self.min_contaminant_match_frac = min_contaminant_match_frac
-
+
@property
def min_report_freq(self):
return 0.1 * self.n_reads
-
+
def _get_contaminants(self):
- def _min_count(kmer_size):
+ def _min_count(_kmer_size):
return math.ceil(self.n_reads * max(
self.min_frequency,
- (self._read_length - kmer_size + 1) * self.overrep_cutoff /
- float(4**kmer_size)))
-
+ (self._read_length - _kmer_size + 1) * self.overrep_cutoff /
+ float(4 ** _kmer_size)))
+
kmer_size = self.kmer_size
kmers = defaultdict(lambda: [0, set()])
-
+
for seq in self._read_sequences:
for i in range(len(seq) - kmer_size + 1):
kmer = seq[i:(i+kmer_size)]
kmers[kmer][0] += 1
kmers[kmer][1].add(seq)
-
+
prev = None
cur = {}
results = {}
result_seqs = defaultdict(set)
min_count = _min_count(kmer_size)
-
+
# Identify candidate kmers for increasing values of k
while True:
all_seqs = set()
@@ -585,10 +594,10 @@ class HeuristicDetector(Detector):
if count > min_count:
cur[kmer] = (count, seqs)
all_seqs.update(seqs)
-
+
if len(all_seqs) == 0:
break
-
+
if prev:
for kmer, (count, seqs) in prev.items():
if (
@@ -596,7 +605,7 @@ class HeuristicDetector(Detector):
sequence_complexity(kmer) > 1.0):
results[kmer] = count
result_seqs[kmer].update(seqs)
-
+
kmer_size += 1
kmers = defaultdict(lambda: [0, set()])
for seq in all_seqs:
@@ -604,17 +613,16 @@ class HeuristicDetector(Detector):
kmer = seq[i:(i+kmer_size)]
kmers[kmer][0] += 1
kmers[kmer][1].add(seq)
-
+
min_count = _min_count(kmer_size)
prev = cur
cur = {}
-
+
results = list(results.items())
-
+
# Now merge overlapping sequences by length and frequency to eliminate
# redundancy in the set of candidate kmers.
- results.sort(key=lambda i: len(i[0]) * math.log(i[1]), reverse=True)
- cur = results[0]
+ results.sort(key=lambda r: len(r[0]) * math.log(r[1]), reverse=True)
merged = []
unmerged = []
while len(results) > 1:
@@ -634,76 +642,81 @@ class HeuristicDetector(Detector):
results = unmerged
unmerged = []
results = merged + results
-
+
if len(results) == 0:
return []
-
+
# TODO: For each retained match, pull out the longest sequence that
# matches to have a better shot of identifying long adapters that
# appear in full very infrequently
-
+
# Re-sort by frequency
- results.sort(key=lambda i: i[1], reverse=True)
+ results.sort(key=lambda r: r[1], reverse=True)
# Keep anything that's within 50% of the top hit
# TODO: make this user-configurable?
min_count = int(results[0][1] * 0.5)
results = (x for x in results if x[1] >= min_count)
# Convert to matches
matches = [
- Match(x[0], count=x[1], reads=result_seqs[x[0]])
+ Match(x[0], count=x[1], reads=result_seqs[x[0]])
for x in results]
-
+
if self.known_contaminants:
# Match to known sequences
contaminants = create_contaminant_matchers(
self.known_contaminants, self.kmer_size)
known = {}
unknown = []
-
- def find_best_match(seq, best_matches, best_match_frac):
+
+ def find_best_match(_seq, _best_matches, _best_match_frac):
"""Find best contaminant matches to `seq`.
"""
- seqrc = reverse_complement(seq)
- for contam in contaminants:
- match_frac1, match_frac2, compare_seq = contam.match(
- seq, seqrc)
- if match_frac1 < best_match_frac[0]:
+ seqrc = reverse_complement(_seq)
+
+ for _contam in contaminants:
+ match_frac1, match_frac2, compare_seq = _contam.match(
+ _seq, seqrc)
+ if match_frac1 < _best_match_frac[0]:
continue
if (
- contam.seq in compare_seq or
+ _contam.seq in compare_seq or
align(
- compare_seq, contam.seq,
+ compare_seq, _contam.seq,
self.min_contaminant_match_frac)):
- if (match_frac1 > best_match_frac[0] or (
- match_frac1 == best_match_frac[0] and
- match_frac2 > best_match_frac[1])):
- best_matches = {}
- best_match_frac = (match_frac1, match_frac2)
- best_matches[contam] = (
+ if (
+ match_frac1 > _best_match_frac[0] or (
+ match_frac1 == _best_match_frac[0] and
+ match_frac2 > _best_match_frac[1]
+ )
+ ):
+ _best_matches = {}
+ _best_match_frac = (match_frac1, match_frac2)
+ _best_matches[_contam] = (
match, (match_frac1, match_frac2))
- return (best_matches, best_match_frac)
-
+
+ return _best_matches, _best_match_frac
+
for match in matches:
best_matches, best_match_frac = find_best_match(
match.seq, {}, (self.min_contaminant_match_frac, 0))
-
+
if match.longest_match:
best_matches, best_match_frac = find_best_match(
match.longest_match[0], best_matches, best_match_frac)
-
+
if best_matches:
- for contam, match in best_matches.items():
- if contam not in known or match[1] > known[contam][1]:
- known[contam] = match
+ for contam, _match in best_matches.items():
+ if contam not in known or _match[1] > known[contam][1]:
+ known[contam] = _match
else:
unknown.append(match)
-
+
# resolve many-many relationships
-
+
new_matches = defaultdict(lambda: [])
for contam, (match, match_frac) in known.items():
new_matches[match].append((contam, match_frac))
-
+
known = []
for match, contams in new_matches.items():
if len(contams) == 1:
@@ -720,17 +733,18 @@ class HeuristicDetector(Detector):
match.set_contaminant(contam, *match_frac)
else:
names = set(contam.names)
- seqs = set((contam.seq,))
+ seqs = {(contam.seq,)}
for other_contam in equiv:
names.update(other_contam[0].names)
seqs.add(other_contam[0].seq)
match.set_known(list(names), list(seqs), *match_frac)
known.append(match)
-
+
matches = known + unknown
-
+
return matches
+
class KhmerDetector(Detector):
"""Identify contaminants based on kmer frequency using a fast kmer counting
approach (as implemented in the khmer library). This approach is fast but
@@ -739,7 +753,7 @@ class KhmerDetector(Detector):
@property
def min_report_freq(self):
return 0.0001
-
+
def _get_contaminants(self):
from khmer import Countgraph, khmer_args
# assuming all sequences are same length
@@ -748,43 +762,43 @@ class KhmerDetector(Detector):
countgraph = Countgraph(
self.kmer_size, tablesize, khmer_args.DEFAULT_N_TABLES)
countgraph.set_use_bigcount(True)
-
+
for seq in self._read_sequences:
countgraph.consume_and_tag(seq)
-
+
n_expected = math.ceil(tablesize / float(4**self.kmer_size))
min_count = n_expected * self.overrep_cutoff
if min_count >= 2**16:
raise ValueError(
"The minimum count for an over-represented k-kmer {} is "
"greater than the max khmer count (2^16)".format(min_count))
-
+
candidates = {}
-
+
for tag in countgraph.get_tagset():
count = countgraph.get(tag)
if count >= min_count:
candidates[tag] = count
-
+
if self.known_contaminants:
matches = []
seen = set()
-
- def match(kmer):
+
+ def match(_kmer):
"""Returns the frequency of `kmer` in `candidates`.
"""
- freq = candidates.get(kmer, 0)
+ freq = candidates.get(_kmer, 0)
if freq > 0:
- seen.add(kmer)
+ seen.add(_kmer)
return freq
-
+
for seq, names in self.known_contaminants.iter_sequences():
seqlen = len(seq)
if seqlen < self.kmer_size:
print("Cannot check {}; sequence is shorter than {}".format(
list(names)[0], self.kmer_size))
continue
-
+
n_kmers = seqlen - self.kmer_size + 1
num_matches = 0
match_counts = []
@@ -797,34 +811,36 @@ class KhmerDetector(Detector):
if kmer_count > 0:
num_matches += 1
match_counts.append(kmer_count)
-
+
if num_matches > 0:
# not sure what the correct metric is to use here
overall_count = sum(match_counts) / float(n_kmers)
matches.append(Match(
- seq, count=overall_count / float(tablesize),
+ seq, count=overall_count / float(tablesize),
names=names, match_frac=float(num_matches) / n_kmers))
-
+
# Add remaining tags
for tag in set(candidates.keys()) - seen:
matches.append(Match(
tag, count=candidates[tag] / float(tablesize)))
-
+
else:
matches = [
Match(tag, count=count / float(tablesize))
for tag, count in candidates.items()]
-
+
return matches
+
def align(seq1, seq2, min_overlap_frac=0.9):
"""Align two sequences.
-
+
Args:
- seq1, seq2: The sequences to align.
+ seq1: The second sequence to align.
+ seq2: The first sequence to align.
min_overlap_frac: Minimum fraction of overlapping bases required for a
match.
-
+
Returns:
The matching portion of the sequence.
"""
=====================================
doc/conf.py
=====================================
@@ -46,7 +46,7 @@ master_doc = 'index'
# General information about the project.
project = u'atropos'
-copyright = u'Original Cutadapt code: 2010-2014, Marcel Martin; Atropos modifications and improvements: 2015-2016, John P Didion'
+copyright = u'Original Cutadapt code: 2010-2014, Marcel Martin; Atropos modifications and improvements: 2015-2018, John P Didion'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
@@ -208,8 +208,7 @@ latex_elements = {
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
- ('index', 'atropos.tex', u'atropos Documentation',
- u'Marcel Martin', 'manual'),
+ ('index', 'atropos.tex', u'Atropos Documentation', u'John Didion', 'manual'),
]
# The name of an image file (relative to this directory) to place at the top of
@@ -238,8 +237,7 @@ latex_documents = [
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
- ('index', 'atropos', u'atropos Documentation',
- [u'Marcel Martin'], 1)
+ ('index', 'atropos', u'Atropos Documentation', [u'John Didion'], 1)
]
# If true, show URL addresses after external links.
@@ -252,9 +250,8 @@ man_pages = [
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
- ('index', 'atropos', u'atropos Documentation',
- u'Marcel Martin', 'atropos', 'One line description of project.',
- 'Miscellaneous'),
+ ('index', 'atropos', u'atropos Documentation', u'John Didion', 'atropos',
+ 'General purpose NGS read manipulation tool.', 'Read manipulation'),
]
# Documents to append as an appendix to all manuals.
=====================================
doc/guide.rst
=====================================
@@ -657,10 +657,10 @@ been removed. For example, if the following sequence is in reads.fastq::
ACGTACGTACGTADAP
The following command will first trim the ``ADAP`` part of the adapter
-off the end. Then, since only 4 bases were trimmed, the ``-i 5`` option
+off the end. Then, since only 4 bases were trimmed, the ``-i -5`` option
will cause a 5th base to be removed.
- atropos -A ADAPTER -i 5 -o trimmed.fastq -se reads.fastq
+ atropos -A ADAPTER -i -5 -o trimmed.fastq -se reads.fastq
.. _quality-trimming:
=====================================
doc/index.rst
=====================================
@@ -1,5 +1,3 @@
-.. include:: ./readme.rst
-
=================
Table of contents
=================
=====================================
setup.py
=====================================
@@ -146,7 +146,7 @@ setup(
url="https://atropos.readthedocs.org/",
description="trim adapters from high-throughput sequencing reads",
long_description=long_description,
- log_description_content_type="text/markdown",
+ long_description_content_type="text/markdown",
license="Original Cutadapt code is under MIT license; improvements and additions are in the Public Domain",
ext_modules=extensions,
packages=find_packages(),
View it on GitLab: https://salsa.debian.org/med-team/atropos/compare/c903af9bf83e294c37a75f5eb3c4a06e38c1e2cd...dbfa561b04043199fbaf24ea48f33f0e88285651
--
View it on GitLab: https://salsa.debian.org/med-team/atropos/compare/c903af9bf83e294c37a75f5eb3c4a06e38c1e2cd...dbfa561b04043199fbaf24ea48f33f0e88285651
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20191121/d28892ef/attachment-0001.html>
More information about the debian-med-commit
mailing list