[med-svn] [Git][med-team/atropos][upstream] 4 commits: New upstream version 1.1.22+dfsg

Thu Nov 21 16:08:22 GMT 2019


Steffen Möller pushed to branch upstream at Debian Med / atropos


Commits:
276ab632 by Steffen Moeller at 2019-11-21T15:03:19Z
New upstream version 1.1.22+dfsg
- - - - -
19ac1406 by Steffen Moeller at 2019-11-21T15:09:00Z
New upstream version 1.1.22+dfsg
- - - - -
97f343a5 by Steffen Moeller at 2019-11-21T15:16:29Z
New upstream version 1.1.22+dfsg
- - - - -
dbfa561b by Steffen Moeller at 2019-11-21T15:24:31Z
New upstream version 1.1.22+dfsg
- - - - -


10 changed files:

- CHANGES.md
- Makefile
- README.md
- atropos/_version.py
- atropos/align/__init__.py
- atropos/commands/detect/__init__.py
- doc/conf.py
- doc/guide.rst
- doc/index.rst
- setup.py


Changes:

=====================================
CHANGES.md
=====================================
@@ -1,5 +1,10 @@
 # Changes
 
+v1.1.22 (dev)
+-------------
+* Documentation fixes (#79)
+* Fix for index error when running `detect` command (#80)
+
 v1.1.21 (2018.11.24)
 --------------------
 * Added long options for paired adapter parameters.


=====================================
Makefile
=====================================
@@ -19,8 +19,7 @@ test:
 	$(TEST)
 
 docs:
-	make -C docs api
-	make -C docs html
+	make -C doc html
 
 readme:
 	pandoc --from=markdown --to=rst --output=README.rst README.md
@@ -58,7 +57,7 @@ release:
 	$(TEST)
 	python setup.py sdist bdist_wheel
 	# release
-	python setup.py sdist upload
+	python setup.py sdist upload -r pypi
 	git push origin --tags
 	$(github_release)
 	$(docker)


=====================================
README.md
=====================================
@@ -1,4 +1,4 @@
-[![Travis CI](https://travis-ci.org/jdidion/atropos.svg?branch=master)](https://travis-ci.org/jdidion/atropos])
+[![Travis CI](https://travis-ci.org/jdidion/atropos.svg?branch=1.1)](https://travis-ci.org/jdidion/atropos)
 [![PyPi](https://img.shields.io/pypi/v/atropos.svg)](https://pypi.python.org/pypi/atropos)
 [![DOI](https://zenodo.org/badge/61393086.svg)](https://zenodo.org/badge/latestdoi/61393086)
 
@@ -88,7 +88,7 @@ Please cite as:
 The results in the paper can be fully reproduced using the workflow defined in the [paper](paper/README.md) directory.
 
 The citation for the original Cutadapt paper is:
- 
+
 > Marcel Martin. "Cutadapt removes adapter sequences from high-throughput sequencing reads." EMBnet.Journal, 17(1):10-12, May 2011. http://dx.doi.org/10.14806/ej.17.1.200
 
 
@@ -127,12 +127,15 @@ The citation for the original Cutadapt paper is:
 * Provide option for RNA-seq data that will trim polyA sequence.
 * Add formal config file support (#53)
 * Automate crash reporting using [sentry](https://github.com/getsentry/raven-python).
+* Look at [NGMerge] for improving read merging: https://github.com/harvardinformatics/NGmerge
+* Look at replacing pysam with [pybam](https://github.com/JohnLonginotto/pybam)
 
 ### 1.4
 
 * Currently, InsertAligner requires a single 3' adapter for each end. Adapter trimming will later be generalized so that A) the InsertAligner can handle multiple matched pairs of adapters and/or B) multiple different aligners can be used for different adapters.
 * Integrate with [AdapterBase](https://github.com/NCBI-Hackathons/OnlineAdapterDatabase) for improved matching of detected contaminants to known adapters, automated trimming of datasets with known adapters, and (opt-in) submission of adapter information for novel datasets.
 * Migrate to seqio (https://github.com/jdidion/seqio) for reading/writing sequence files.
+    * Also look at using HTSeq instead: https://github.com/simon-anders/htseq
 * General-purpose read filtering based on read ID: https://github.com/marcelm/cutadapt/issues/107.
 * Currently, SAM/BAM input files must be name sorted; add an option to 1) pre-sort reads inline using samtools or sambamba, or 2) cache each read in memory until its mate is found.
 
@@ -169,8 +172,10 @@ The citation for the original Cutadapt paper is:
 ### 2.0
 
 * Simplification of command line options, perhaps by further splitting functionality up into different sub-commands, but also by more intelligent choices for default option values based on context.
+
 * Consider adding additional report formats
-    * https://github.com/marcelm/cutadapt/issues/112
+* Add fuzzy matching: https://github.com/Daniel-Liu-c0deb0t/Java-Fuzzy-Search
+* https://github.com/marcelm/cutadapt/issues/112
 * Performance enhancements using
     * http://numba.pydata.org/
     * https://github.com/undefx/vecpy


=====================================
atropos/_version.py
=====================================
@@ -23,8 +23,8 @@ def get_keywords():
     # setup.py/versioneer.py will grep for the variable names, so they must
     # each be defined on a line of their own. _version.py will just call
     # get_keywords().
-    git_refnames = " (tag: 1.1.21)"
-    git_full = "1b7273b082551e80652a45df2736e5ffff8ce7f9"
+    git_refnames = " (tag: 1.1.22)"
+    git_full = "2b15c778f0ccf1d0fb753e4334fa6dc0048a9ee6"
     keywords = {"refnames": git_refnames, "full": git_full}
     return keywords
 


=====================================
atropos/align/__init__.py
=====================================
@@ -212,6 +212,12 @@ MatchInfo = namedtuple("MatchInfo", (
 # * http://www.combio.pl/alfree/tools/
 # * http://www.combio.pl/alfree
 # * http://bioinformatics.org.au/tools/decaf+py/
+# 12. https://github.com/sdu-hpcl/BGSA
+# 13. Train a NN to approximate the pairwise alignment distance
+# https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty887/5140215?redirectedFrom=fulltext
+# 14. Nucl2vec: https://github.com/prakharg24/Nucl2vec (local alignment only - could
+# it be adapted to semi-global?)
+# * Can re-implement in SpaCy? https://spacy.io/usage/vectors-similarity
 
 class InsertAligner(object):
     """Implementation of an insert matching algorithm.


=====================================
atropos/commands/detect/__init__.py
=====================================
@@ -8,7 +8,7 @@ from atropos.align import Aligner, SEMIGLOBAL
 from atropos.commands.base import (
     BaseCommandRunner, Pipeline, SingleEndPipelineMixin, PairedEndPipelineMixin)
 from atropos.util import (
-    reverse_complement, sequence_complexity, enumerate_range, run_interruptible)
+    reverse_complement, sequence_complexity, run_interruptible)
 
 # TODO: Test whether using rc=True in parse_known_contaminants is as fast
 # and as accurate as testing both the forward and reverse complement
@@ -22,8 +22,8 @@ from atropos.util import (
 # strategy is to use CountTable and, upon each kmer addition, test whether its
 # count exceeds some threshold (perhaps by maintaining a running distribution of
 # all kmer counts). These high-abundance kmers could then be profiled using a
-# second pass. Another strategy is to compute summary metrics for reads based on 
-# kmer abundance, as per 
+# second pass. Another strategy is to compute summary metrics for reads based on
+# kmer abundance, as per
 # http://khmer-recipes.readthedocs.io/en/latest/001-extract-reads-by-coverage/index.html
 # Then take reads in the furthest-right peak and extract common sequences from
 # those.
@@ -51,7 +51,7 @@ class CommandRunner(BaseCommandRunner):
 
     """
     name = 'detect'
-    
+
     def __call__(self):
         kmer_size = self.kmer_size or 12
         n_reads = self.max_reads
@@ -60,7 +60,7 @@ class CommandRunner(BaseCommandRunner):
         known_contaminants = None
         if include != 'unknown':
             known_contaminants = self.load_known_adapters()
-        
+
         detector = self.detector
         if not detector:
             if known_contaminants and include == 'known':
@@ -69,9 +69,9 @@ class CommandRunner(BaseCommandRunner):
                 detector = 'heuristic'
             else:
                 detector = 'khmer'
-        
+
         detector_args = dict(known_contaminants=known_contaminants)
-            
+
         if detector == 'known':
             logging.getLogger().debug(
                 "Detecting contaminants using the known-only algorithm")
@@ -88,34 +88,37 @@ class CommandRunner(BaseCommandRunner):
             logging.getLogger().debug(
                 "Detecting contaminants using the kmer-based algorithm")
             detector_class = KhmerDetector
-        
+        else:
+            raise ValueError("Invalid value for 'detector': {}".format(detector))
+
         summary_args = dict(
-            kmer_size=kmer_size, n_reads=n_reads, 
+            kmer_size=kmer_size, n_reads=n_reads,
             overrep_cutoff=overrep_cutoff, include=include,
             past_end_bases=self.past_end_bases)
         detector_args.update(summary_args)
-        
+
         if self.paired:
             detector = PairedDetector(detector_class, **detector_args)
         else:
             detector = detector_class(**detector_args)
-        
+
         self.summary['detect'] = summary_args
         if known_contaminants:
             self.summary['detect']['known_contaminants'] = \
                 known_contaminants.summarize()
-        
+
         logging.getLogger().info(
             "Detecting adapters and other potential contaminant "
             "sequences based on %d-mers in %d reads", kmer_size, n_reads)
-        
+
         # currently only single-threaded operation is supproted
         self.summary.update(mode='serial', threads=1)
         return run_interruptible(detector, self, raise_on_error=True)
 
+
 class Match(object):
     """A contaminant match.
-    
+
     Args:
         seq_or_contam: The matched sequence.
         count: The number of matches.
@@ -143,34 +146,34 @@ class Match(object):
         self.longest_match = None
         if reads:
             self.set_longest_match(reads)
-    
+
     def __len__(self):
         return len(self.seq)
-    
+
     def __repr__(self):
         if self.is_known:
             return "{} => {} ({}))".format(
                 self.seq, self.names, self.known_seqs)
         else:
             return self.seq
-    
+
     @property
     def seq_complexity(self):
         """The complexity of the sequence (0=homopolymer, 2=random).
         """
         return sequence_complexity(self.seq)
-    
+
     @property
     def count_is_frequency(self):
         """Whether `self.count` represents a frequency.
         """
         return isinstance(self.count, float)
-    
+
     def set_contaminant(self, contam, match_frac, match_frac2=None):
         """Set the known contaminant from a :class:`ContaminantMatcher`.
         """
         self.set_known(contam.names, [contam.seq], match_frac, match_frac2)
-    
+
     def set_known(self, names, seqs, match_frac, match_frac2=None):
         """Set the known contaminant.
         """
@@ -178,13 +181,13 @@ class Match(object):
         self.known_seqs = seqs
         self.match_frac = match_frac
         self.match_frac2 = match_frac2
-    
+
     @property
     def is_known(self):
         """Whether the matching sequence is a known contaminant.
         """
         return self.known_seqs is not None
-    
+
     def set_longest_match(self, sequences):
         """Set the longest matching sequence from a set of sequences.
         """
@@ -193,7 +196,7 @@ class Match(object):
             seqlen = len(self.seq) - idx
             if self.longest_match is None or self.longest_match[1] < seqlen:
                 self.longest_match = (seq[idx:], seqlen)
-    
+
     def estimate_abundance(self, read_sequences):
         """Determine whether this match's sequence is within 'seq' by simple
         exact string comparison.
@@ -201,7 +204,7 @@ class Match(object):
         self.abundance = sum(
             1 for read_seq in read_sequences
             if self.seq in read_seq)
-    
+
     def summarize(self):
         summary = dict(
             longest_kmer=self.seq,
@@ -224,9 +227,10 @@ class Match(object):
                 known_seqs=self.known_seqs)
         return summary
 
+
 class ContaminantMatcher(object):
     """Matches a known contaminant against other sequences.
-    
+
     Args:
         seq: The contaminant sequence.
         names: Sequence of names for the contaminant.
@@ -241,14 +245,14 @@ class ContaminantMatcher(object):
         self.n_kmers = len(self.kmers)
         self.kmer_size = kmer_size
         self.matches = 0
-        
+
     def match(self, seq, seqrc):
         """Returns (num_matches, num_contam_kmers, num_seq_kmers).
-        
+
         Args:
             seq: The sequence to match.
             seqrc: The reverse complement of `seq`.
-        
+
         Returns:
             Tuple (f1, f2, seq), where f1 is the fraction of contaminant kmers
             that match, f2 is the fraction of sequence kmers that match, and
@@ -258,12 +262,12 @@ class ContaminantMatcher(object):
             seq[i:(i+self.kmer_size)]
             for i in range(len(seq) - self.kmer_size + 1))
         fw_matches = float(len(self.kmers & fw_kmers))
-        
+
         rv_kmers = set(
             seqrc[i:(i+self.kmer_size)]
             for i in range(len(seqrc) - self.kmer_size + 1))
         rv_matches = float(len(self.kmers & rv_kmers))
-        
+
         if fw_matches >= rv_matches:
             n_matches = fw_matches
             kmers = fw_kmers
@@ -272,22 +276,23 @@ class ContaminantMatcher(object):
             n_matches = rv_matches
             kmers = rv_kmers
             compare_seq = seqrc
-        
+
         self.matches += n_matches
         match_frac1 = match_frac2 = 0
         if self.n_kmers > 0:
             match_frac1 = n_matches / self.n_kmers
         if len(kmers) > 0:
             match_frac2 = n_matches / len(kmers)
-        return (match_frac1, match_frac2, compare_seq)
+        return match_frac1, match_frac2, compare_seq
+
 
 def create_contaminant_matchers(contaminants, kmer_size):
     """Create :class:`ContaminantMatcher`s from sequences.
-    
+
     Args:
         contaminants: A dict of {seq:names}.
         kmer_size: The kmer size.
-    
+
     Returns:
         A list of :class:`ContaminantMatcher`s.
     """
@@ -296,19 +301,20 @@ def create_contaminant_matchers(contaminants, kmer_size):
         for seq, names in contaminants.iter_sequences()
     ]
 
+
 class Detector(SingleEndPipelineMixin, Pipeline):
     """Base class for contaminant detectors.
-    
+
     Args:
         kmer_size: Size of kmers to match.
         n_reads: Number of reads to sample.
         overrep_cutoff: Degree of overrepresentation required for a kmer to be
             considered as a contaminant.
-        known_contaminant: :class:`ContaminantMatcher`s to match against.
+        known_contaminants: :class:`ContaminantMatcher`s to match against.
         past_end_bases: On Illumina, long runs of A (and sometimes other bases)
             can signify that the sequencer has read past the end of a fragment
             that is shorter than the read length + adapter length. Those
-            bases will be removed from any sequencers before looking for 
+            bases will be removed from any sequencers before looking for
             matching contaminants.
     """
     def __init__(
@@ -329,32 +335,32 @@ class Detector(SingleEndPipelineMixin, Pipeline):
                 self._past_end_regexp = re.compile(past_end_bases[0])
             else:
                 self._past_end_regexp = re.compile('|'.join(
-                    base + '{8,}.*|' + base + '{2,}$' 
+                    base + '{8,}.*|' + base + '{2,}$'
                     for base in past_end_bases))
-    
+
     @property
     def min_report_freq(self):
         """The minimum contaminant frequency required for a contaminant to be
         reported.
         """
         raise NotImplementedError()
-    
+
     def set_read_length(self, record):
         assert self._read_length is None
         self._read_length = len(record.sequence)
-    
+
     def handle_records(self, context, records):
         if context['size'] == 0:
             return
         if self._read_length is None:
             self.set_read_length(records[0])
         super().handle_records(context, records)
-    
+
     def handle_reads(self, context, read1, read2=None):
         seq = self._filter_seq(read1.sequence)
         if seq:
             self._read_sequences.add(seq)
-    
+
     def _filter_seq(self, seq):
         if sequence_complexity(seq) <= 1.0:
             return None
@@ -365,19 +371,19 @@ class Detector(SingleEndPipelineMixin, Pipeline):
         if len(seq) < self.kmer_size:
             return None
         return seq
-    
+
     def matches(self, **kwargs):
         """Returns the current set of matches.
         """
         if self._matches is None or len(kwargs) > 0:
             self._filter_and_sort(**kwargs)
         return self._matches
-    
+
     def _filter_and_sort(
-            self, min_len=None, min_complexity=1.1, min_match_frac=0.1, 
+            self, min_len=None, min_complexity=1.1, min_match_frac=0.1,
             limit=20):
         """Identify, filter, and sort contaminants.
-        
+
         Args:
             min_len: Minimum contaminant length.
             min_complexity: Minimum sequence complexity.
@@ -386,64 +392,64 @@ class Detector(SingleEndPipelineMixin, Pipeline):
         """
         if min_len is None:
             min_len = self.kmer_size
-        
+
         matches = self._get_contaminants()
-        
+
         for match in matches:
             match.estimate_abundance(self._read_sequences)
-        
-        def _filter(match):
-            if match.count < self.min_report_freq:
+
+        def _filter(_match):
+            if _match.count < self.min_report_freq:
                 logging.getLogger().debug(
                     "Filtering {} because frequency {} < {}".format(
-                        match.seq, match.count, self.min_report_freq))
+                        _match.seq, _match.count, self.min_report_freq))
                 return False
-            if min_len and len(match) < min_len:
+            if min_len and len(_match) < min_len:
                 logging.getLogger().debug(
                     "Filtering {} because it's too short (< {} bp)".format(
-                        match.seq, min_len))
+                        _match.seq, min_len))
                 print('too short')
                 return False
-            if min_complexity and match.seq_complexity < min_complexity:
+            if min_complexity and _match.seq_complexity < min_complexity:
                 logging.getLogger().debug(
                     "Filtering {} because its complexity {} < {}".format(
-                        match.seq, match.seq_complexity, min_complexity))
+                        _match.seq, _match.seq_complexity, min_complexity))
                 return False
-            if self.include == 'known' and not match.is_known:
+            if self.include == 'known' and not _match.is_known:
                 logging.getLogger().debug(
                     "Filtering {} because it's not known".format(
-                        match.seq))
+                        _match.seq))
                 return False
-            elif self.include == 'unknown' and match.is_known:
+            elif self.include == 'unknown' and _match.is_known:
                 logging.getLogger().debug(
                     "Filtering {} because it's known".format(
-                        match.seq))
+                        _match.seq))
                 return False
             if (
-                    min_match_frac and match.is_known and
-                    match.match_frac < min_match_frac):
+                    min_match_frac and _match.is_known and
+                    _match.match_frac < min_match_frac):
                 logging.getLogger().debug(
                     "Filtering {} because its match_frac {} < {}".format(
-                        match.seq, match.match_frac, min_match_frac))
+                        _match.seq, _match.match_frac, min_match_frac))
                 return False
             return True
-        
+
         matches = list(filter(_filter, matches))
         matches.sort(key=lambda x: len(x) * math.log(x.count), reverse=True)
-        
+
         if limit is not None:
             matches = matches[:limit]
-        
+
         self._matches = matches
-    
+
     def _get_contaminants(self):
         """Implemention of contaminant matching.
-        
+
         Returns:
             A list of :class:`Match`es.
         """
         raise NotImplementedError()
-    
+
     def finish(self, summary, **kwargs):
         super().finish(summary)
         summary['detect']['matches'] = ([
@@ -451,6 +457,7 @@ class Detector(SingleEndPipelineMixin, Pipeline):
             for match in self.matches(**kwargs)
         ],)
 
+
 class PairedDetector(PairedEndPipelineMixin, Pipeline):
     """Detector for paired-end reads.
     """
@@ -459,7 +466,7 @@ class PairedDetector(PairedEndPipelineMixin, Pipeline):
         self.read1_detector = detector_class(**kwargs)
         self.read2_detector = detector_class(**kwargs)
         self._read_length_set = False
-    
+
     def handle_records(self, context, records):
         if context['size'] == 0:
             return
@@ -469,11 +476,11 @@ class PairedDetector(PairedEndPipelineMixin, Pipeline):
             self.read2_detector.set_read_length(read2)
             self._read_length_set = True
         super().handle_records(context, records)
-    
+
     def handle_reads(self, context, read1, read2):
         self.read1_detector.handle_reads(context, read1)
         self.read2_detector.handle_reads(context, read2)
-    
+
     def finish(self, summary, **kwargs):
         super().finish(summary)
         summary['detect']['matches'] = ([
@@ -484,12 +491,13 @@ class PairedDetector(PairedEndPipelineMixin, Pipeline):
             for match in self.read2_detector.matches(**kwargs)
         ])
 
+
 class KnownContaminantDetector(Detector):
     """Test known contaminants against reads. This has linear complexity and is
     more specific than the khmer matcher, but less specific than the heuristic
     matcher. It's also less sensitive since it does not try to detect unknown
     contaminants.
-    
+
     Args:
         known_contaminants: List of :class:`ContaminantMatcher`s.
         min_kmer_match_frac: Minimum fraction of matching kmers required.
@@ -500,23 +508,23 @@ class KnownContaminantDetector(Detector):
         super().__init__(known_contaminants=known_contaminants, **kwargs)
         self.min_kmer_match_frac = min_kmer_match_frac
         self._min_k = min(len(s) for s in known_contaminants.sequences)
-    
+
     @property
     def min_report_freq(self):
         return 0.1
-    
+
     def _filter_seq(self, seq):
         seq = super()._filter_seq(seq)
         if seq and len(seq) >= self._min_k:
             return seq
         return None
-    
+
     def _get_contaminants(self):
         contaminant_matchers = create_contaminant_matchers(
             self.known_contaminants, self.kmer_size)
         counts = defaultdict(int)
         max_match_fracs = defaultdict(int)
-        
+
         for seq in self._read_sequences:
             seqrc = reverse_complement(seq)
             for contam in contaminant_matchers:
@@ -525,14 +533,14 @@ class KnownContaminantDetector(Detector):
                     counts[contam] += 1
                     if match[0] > max_match_fracs[contam]:
                         max_match_fracs[contam] = match[0]
-        
+
         min_count = math.ceil(
             self.n_reads * (self._read_length - self._min_k + 1) *
             self.overrep_cutoff / float(4**self._min_k))
-        
+
         return [
             Match(
-                c[0], match_frac=max_match_fracs[c[0]], 
+                c[0], match_frac=max_match_fracs[c[0]],
                 abundance=float(c[1]) / self.n_reads)
             for c in filter(
                 lambda x: x[1] >= min_count,
@@ -540,44 +548,45 @@ class KnownContaminantDetector(Detector):
             )
         ]
 
+
 class HeuristicDetector(Detector):
     """Use a heuristic iterative algorithm to arrive at likely contaminants.
     This is the most accurate algorithm overall, but it has quadratic complexity
     and becomes too slow/memory-intenstive when n_reads > 50k.
     """
     def __init__(
-            self, min_frequency=0.001, min_contaminant_match_frac=0.9, 
+            self, min_frequency=0.001, min_contaminant_match_frac=0.9,
             **kwargs):
         super(HeuristicDetector, self).__init__(**kwargs)
         self.min_frequency = min_frequency
         self.min_contaminant_match_frac = min_contaminant_match_frac
-    
+
     @property
     def min_report_freq(self):
         return 0.1 * self.n_reads
-    
+
     def _get_contaminants(self):
-        def _min_count(kmer_size):
+        def _min_count(_kmer_size):
             return math.ceil(self.n_reads * max(
                 self.min_frequency,
-                (self._read_length - kmer_size + 1) * self.overrep_cutoff /
-                float(4**kmer_size)))
-        
+                (self._read_length - _kmer_size + 1) * self.overrep_cutoff /
+                float(4 ** _kmer_size)))
+
         kmer_size = self.kmer_size
         kmers = defaultdict(lambda: [0, set()])
-        
+
         for seq in self._read_sequences:
             for i in range(len(seq) - kmer_size + 1):
                 kmer = seq[i:(i+kmer_size)]
                 kmers[kmer][0] += 1
                 kmers[kmer][1].add(seq)
-        
+
         prev = None
         cur = {}
         results = {}
         result_seqs = defaultdict(set)
         min_count = _min_count(kmer_size)
-        
+
         # Identify candidate kmers for increasing values of k
         while True:
             all_seqs = set()
@@ -585,10 +594,10 @@ class HeuristicDetector(Detector):
                 if count > min_count:
                     cur[kmer] = (count, seqs)
                     all_seqs.update(seqs)
-            
+
             if len(all_seqs) == 0:
                 break
-            
+
             if prev:
                 for kmer, (count, seqs) in prev.items():
                     if (
@@ -596,7 +605,7 @@ class HeuristicDetector(Detector):
                             sequence_complexity(kmer) > 1.0):
                         results[kmer] = count
                         result_seqs[kmer].update(seqs)
-            
+
             kmer_size += 1
             kmers = defaultdict(lambda: [0, set()])
             for seq in all_seqs:
@@ -604,17 +613,16 @@ class HeuristicDetector(Detector):
                     kmer = seq[i:(i+kmer_size)]
                     kmers[kmer][0] += 1
                     kmers[kmer][1].add(seq)
-            
+
             min_count = _min_count(kmer_size)
             prev = cur
             cur = {}
-        
+
         results = list(results.items())
-        
+
         # Now merge overlapping sequences by length and frequency to eliminate
         # redundancy in the set of candidate kmers.
-        results.sort(key=lambda i: len(i[0]) * math.log(i[1]), reverse=True)
-        cur = results[0]
+        results.sort(key=lambda r: len(r[0]) * math.log(r[1]), reverse=True)
         merged = []
         unmerged = []
         while len(results) > 1:
@@ -634,76 +642,81 @@ class HeuristicDetector(Detector):
             results = unmerged
             unmerged = []
         results = merged + results
-        
+
         if len(results) == 0:
             return []
-        
+
         # TODO: For each retained match, pull out the longest sequence that
         # matches to have a better shot of identifying long adapters that
         # appear in full very infrequently
-        
+
         # Re-sort by frequency
-        results.sort(key=lambda i: i[1], reverse=True)
+        results.sort(key=lambda r: r[1], reverse=True)
         # Keep anything that's within 50% of the top hit
         # TODO: make this user-configurable?
         min_count = int(results[0][1] * 0.5)
         results = (x for x in results if x[1] >= min_count)
         # Convert to matches
         matches = [
-            Match(x[0], count=x[1], reads=result_seqs[x[0]]) 
+            Match(x[0], count=x[1], reads=result_seqs[x[0]])
             for x in results]
-        
+
         if self.known_contaminants:
             # Match to known sequences
             contaminants = create_contaminant_matchers(
                 self.known_contaminants, self.kmer_size)
             known = {}
             unknown = []
-            
-            def find_best_match(seq, best_matches, best_match_frac):
+
+            def find_best_match(_seq, _best_matches, _best_match_frac):
                 """Find best contaminant matches to `seq`.
                 """
-                seqrc = reverse_complement(seq)
-                for contam in contaminants:
-                    match_frac1, match_frac2, compare_seq = contam.match(
-                        seq, seqrc)
-                    if match_frac1 < best_match_frac[0]:
+                seqrc = reverse_complement(_seq)
+
+                for _contam in contaminants:
+                    match_frac1, match_frac2, compare_seq = _contam.match(
+                        _seq, seqrc)
+                    if match_frac1 < _best_match_frac[0]:
                         continue
                     if (
-                            contam.seq in compare_seq or
+                            _contam.seq in compare_seq or
                             align(
-                                compare_seq, contam.seq,
+                                compare_seq, _contam.seq,
                                 self.min_contaminant_match_frac)):
-                        if (match_frac1 > best_match_frac[0] or (
-                                match_frac1 == best_match_frac[0] and
-                                match_frac2 > best_match_frac[1])):
-                            best_matches = {}
-                            best_match_frac = (match_frac1, match_frac2)
-                        best_matches[contam] = (
+                        if (
+                            match_frac1 > _best_match_frac[0] or (
+                                match_frac1 == _best_match_frac[0] and
+                                match_frac2 > _best_match_frac[1]
+                            )
+                        ):
+                            _best_matches = {}
+                            _best_match_frac = (match_frac1, match_frac2)
+                        _best_matches[_contam] = (
                             match, (match_frac1, match_frac2))
-                return (best_matches, best_match_frac)
-            
+
+                return _best_matches, _best_match_frac
+
             for match in matches:
                 best_matches, best_match_frac = find_best_match(
                     match.seq, {}, (self.min_contaminant_match_frac, 0))
-                
+
                 if match.longest_match:
                     best_matches, best_match_frac = find_best_match(
                         match.longest_match[0], best_matches, best_match_frac)
-                
+
                 if best_matches:
-                    for contam, match in best_matches.items():
-                        if contam not in known or match[1] > known[contam][1]:
-                            known[contam] = match
+                    for contam, _match in best_matches.items():
+                        if contam not in known or _match[1] > known[contam][1]:
+                            known[contam] = _match
                 else:
                     unknown.append(match)
-            
+
             # resolve many-many relationships
-            
+
             new_matches = defaultdict(lambda: [])
             for contam, (match, match_frac) in known.items():
                 new_matches[match].append((contam, match_frac))
-            
+
             known = []
             for match, contams in new_matches.items():
                 if len(contams) == 1:
@@ -720,17 +733,18 @@ class HeuristicDetector(Detector):
                         match.set_contaminant(contam, *match_frac)
                     else:
                         names = set(contam.names)
-                        seqs = set((contam.seq,))
+                        seqs = {(contam.seq,)}
                         for other_contam in equiv:
                             names.update(other_contam[0].names)
                             seqs.add(other_contam[0].seq)
                         match.set_known(list(names), list(seqs), *match_frac)
                 known.append(match)
-            
+
             matches = known + unknown
-        
+
         return matches
 
+
 class KhmerDetector(Detector):
     """Identify contaminants based on kmer frequency using a fast kmer counting
     approach (as implemented in the khmer library). This approach is fast but
@@ -739,7 +753,7 @@ class KhmerDetector(Detector):
     @property
     def min_report_freq(self):
         return 0.0001
-    
+
     def _get_contaminants(self):
         from khmer import Countgraph, khmer_args
         # assuming all sequences are same length
@@ -748,43 +762,43 @@ class KhmerDetector(Detector):
         countgraph = Countgraph(
             self.kmer_size, tablesize, khmer_args.DEFAULT_N_TABLES)
         countgraph.set_use_bigcount(True)
-        
+
         for seq in self._read_sequences:
             countgraph.consume_and_tag(seq)
-        
+
         n_expected = math.ceil(tablesize / float(4**self.kmer_size))
         min_count = n_expected * self.overrep_cutoff
         if min_count >= 2**16:
             raise ValueError(
                 "The minimum count for an over-represented k-kmer {} is "
                 "greater than the max khmer count (2^16)".format(min_count))
-    
+
         candidates = {}
-        
+
         for tag in countgraph.get_tagset():
             count = countgraph.get(tag)
             if count >= min_count:
                 candidates[tag] = count
-        
+
         if self.known_contaminants:
             matches = []
             seen = set()
-            
-            def match(kmer):
+
+            def match(_kmer):
                 """Returns the frequency of `kmer` in `candidates`.
                 """
-                freq = candidates.get(kmer, 0)
+                freq = candidates.get(_kmer, 0)
                 if freq > 0:
-                    seen.add(kmer)
+                    seen.add(_kmer)
                 return freq
-            
+
             for seq, names in self.known_contaminants.iter_sequences():
                 seqlen = len(seq)
                 if seqlen < self.kmer_size:
                     print("Cannot check {}; sequence is shorter than {}".format(
                         list(names)[0], self.kmer_size))
                     continue
-                
+
                 n_kmers = seqlen - self.kmer_size + 1
                 num_matches = 0
                 match_counts = []
@@ -797,34 +811,36 @@ class KhmerDetector(Detector):
                     if kmer_count > 0:
                         num_matches += 1
                         match_counts.append(kmer_count)
-                
+
                 if num_matches > 0:
                     # not sure what the correct metric is to use here
                     overall_count = sum(match_counts) / float(n_kmers)
                     matches.append(Match(
-                        seq, count=overall_count / float(tablesize), 
+                        seq, count=overall_count / float(tablesize),
                         names=names, match_frac=float(num_matches) / n_kmers))
-            
+
             # Add remaining tags
             for tag in set(candidates.keys()) - seen:
                 matches.append(Match(
                     tag, count=candidates[tag] / float(tablesize)))
-        
+
         else:
             matches = [
                 Match(tag, count=count / float(tablesize))
                 for tag, count in candidates.items()]
-        
+
         return matches
 
+
 def align(seq1, seq2, min_overlap_frac=0.9):
     """Align two sequences.
-    
+
     Args:
-        seq1, seq2: The sequences to align.
+        seq1: The second sequence to align.
+        seq2: The first sequence to align.
         min_overlap_frac: Minimum fraction of overlapping bases required for a
             match.
-    
+
     Returns:
         The matching portion of the sequence.
     """


=====================================
doc/conf.py
=====================================
@@ -46,7 +46,7 @@ master_doc = 'index'
 
 # General information about the project.
 project = u'atropos'
-copyright = u'Original Cutadapt code: 2010-2014, Marcel Martin; Atropos modifications and improvements: 2015-2016, John P Didion'
+copyright = u'Original Cutadapt code: 2010-2014, Marcel Martin; Atropos modifications and improvements: 2015-2018, John P Didion'
 
 # The version info for the project you're documenting, acts as replacement for
 # |version| and |release|, also used in various other places throughout the
@@ -208,8 +208,7 @@ latex_elements = {
 # (source start file, target name, title,
 #  author, documentclass [howto, manual, or own class]).
 latex_documents = [
-  ('index', 'atropos.tex', u'atropos Documentation',
-   u'Marcel Martin', 'manual'),
+  ('index', 'atropos.tex', u'Atropos Documentation', u'John Didion', 'manual'),
 ]
 
 # The name of an image file (relative to this directory) to place at the top of
@@ -238,8 +237,7 @@ latex_documents = [
 # One entry per manual page. List of tuples
 # (source start file, name, description, authors, manual section).
 man_pages = [
-    ('index', 'atropos', u'atropos Documentation',
-     [u'Marcel Martin'], 1)
+    ('index', 'atropos', u'Atropos Documentation', [u'John Didion'], 1)
 ]
 
 # If true, show URL addresses after external links.
@@ -252,9 +250,8 @@ man_pages = [
 # (source start file, target name, title, author,
 #  dir menu entry, description, category)
 texinfo_documents = [
-  ('index', 'atropos', u'atropos Documentation',
-   u'Marcel Martin', 'atropos', 'One line description of project.',
-   'Miscellaneous'),
+  ('index', 'atropos', u'atropos Documentation', u'John Didion', 'atropos',
+   'General purpose NGS read manipulation tool.', 'Read manipulation'),
 ]
 
 # Documents to append as an appendix to all manuals.


=====================================
doc/guide.rst
=====================================
@@ -657,10 +657,10 @@ been removed. For example, if the following sequence is in reads.fastq::
     ACGTACGTACGTADAP
 
 The following command will first trim the ``ADAP`` part of the adapter
-off the end. Then, since only 4 bases were trimmed, the ``-i 5`` option
+off the end. Then, since only 4 bases were trimmed, the ``-i -5`` option
 will cause a 5th base to be removed.
 
-    atropos -A ADAPTER -i 5 -o trimmed.fastq -se reads.fastq
+    atropos -A ADAPTER -i -5 -o trimmed.fastq -se reads.fastq
 
 .. _quality-trimming:
 


=====================================
doc/index.rst
=====================================
@@ -1,5 +1,3 @@
-.. include:: ./readme.rst
-
 =================
 Table of contents
 =================


=====================================
setup.py
=====================================
@@ -146,7 +146,7 @@ setup(
     url="https://atropos.readthedocs.org/",
     description="trim adapters from high-throughput sequencing reads",
     long_description=long_description,
-    log_description_content_type="text/markdown",
+    long_description_content_type="text/markdown",
     license="Original Cutadapt code is under MIT license; improvements and additions are in the Public Domain",
     ext_modules=extensions,
     packages=find_packages(),



View it on GitLab: https://salsa.debian.org/med-team/atropos/compare/c903af9bf83e294c37a75f5eb3c4a06e38c1e2cd...dbfa561b04043199fbaf24ea48f33f0e88285651

-- 
View it on GitLab: https://salsa.debian.org/med-team/atropos/compare/c903af9bf83e294c37a75f5eb3c4a06e38c1e2cd...dbfa561b04043199fbaf24ea48f33f0e88285651
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20191121/d28892ef/attachment-0001.html>