[med-svn] [Git][med-team/mapdamage][upstream] New upstream version 2.2.3+dfsg

Étienne Mollier (@emollier) gitlab at salsa.debian.org
Mon Oct 20 22:36:36 BST 2025



Étienne Mollier pushed to branch upstream at Debian Med / mapdamage


Commits:
73e4fcdb by Étienne Mollier at 2025-10-20T22:56:36+02:00
New upstream version 2.2.3+dfsg
- - - - -


22 changed files:

- − MANIFEST
- MANIFEST.in
- README.md
- mapdamage/Rscripts/stats/function.R
- mapdamage/__init__.py
- + mapdamage/__main__.py
- mapdamage/align.py
- mapdamage/composition.py
- bin/mapDamage → mapdamage/main.py
- mapdamage/mp_test.py
- mapdamage/parseoptions.py
- mapdamage/rescale.py
- mapdamage/rscript.py
- mapdamage/seq.py
- + mapdamage/seqtk/README.md
- + mapdamage/seqtk/kseq.h
- + mapdamage/seqtk/seqtk.c
- mapdamage/tables.py
- + mapdamage/tests/fake1.fasta.fai
- mapdamage/version.py
- + pyproject.toml
- setup.py


Changes:

=====================================
MANIFEST deleted
=====================================
@@ -1,13 +0,0 @@
-# file GENERATED by distutils, do NOT edit
-README.txt
-setup.py
-bin/mapDamage
-mapdamage/__init__.py
-mapdamage/align.py
-mapdamage/composition.py
-mapdamage/parseoptions.py
-mapdamage/rscript.py
-mapdamage/seq.py
-mapdamage/tables.py
-mapdamage/version.py
-mapdamage/Rscripts/mapDamage.R


=====================================
MANIFEST.in
=====================================
@@ -1,2 +1,5 @@
 include mapdamage/Rscripts/mapDamage.R
 include mapdamage/Rscripts/stats/*R
+include mapdamage/seqtk/seqtk.c
+include mapdamage/seqtk/kseq.h
+include mapdamage/seqtk/README.md


=====================================
README.md
=====================================
@@ -1,30 +1,18 @@
 ## mapDamage
 
-[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/mapdamage2/README.html) [![Conda](https://img.shields.io/conda/dn/bioconda/mapdamage2.svg)](https://anaconda.org/bioconda/mapdamage2/files) 
+[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/mapdamage2/README.html) [![Conda](https://img.shields.io/conda/dn/bioconda/mapdamage2.svg)](https://anaconda.org/bioconda/mapdamage2/files)
 
 ![Conda](https://anaconda.org/bioconda/mapdamage2/badges/latest_release_date.svg) ![Conda](https://anaconda.org/bioconda/mapdamage2/badges/version.svg) [![Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.](https://www.repostatus.org/badges/latest/inactive.svg)](https://www.repostatus.org/#inactive)
 
-#### `bioconda` installation
+### Installation
 
-* python3 version **2.2.1**
+mapDamage can be installed via pip:
 
-``` 
-conda install -c bioconda mapdamage2=2.2.1
+```bash
+pip install mapdamage
 ```
 
-* python3 version **2.2.1** **with** R and 4 mandatory packages for the Bayesian inference:
-
-``` 
-conda install -c bioconda mapdamage2=2.2.1=pyr36_1
-```
-
----
-
-### Important
-
-* From version `2.2.1` the `master` branch is requiring **python3** as `python2` is not supported from 2020-01-01.
-
-* Users with versions dating prior to June 12 2013 please update. A nasty bug that caused the statistical part of `mapDamage` to use half of the data for estimation of the damage parameters, sorry for the inconvenience.
+R and several packages are *post mortem* damage plotting, modeling, and rescaling. Refers to the detailed instructions on the [dedicated page](http://ginolhac.github.io/mapDamage/).
 
 ### Introduction
 
@@ -33,35 +21,35 @@ Complete documentation, instructions, examples, screenshots and FAQ are availabl
 [mapDamage2](https://geogenetics.ku.dk/publications/mapdamage2.0/) is a computational framework written in **Python3** and **R**, which tracks and quantifies DNA damage patterns
 among ancient DNA sequencing reads generated by Next-Generation Sequencing platforms.
 
-`mapDamage` was developed at the [Centre for GeoGenetics](https://geogenetics.ku.dk/) by the [Orlando Group ](https://geogenetics.ku.dk/research_groups/palaeomix_group/).
+`mapDamage` was developed at the [Centre for GeoGenetics](https://geogenetics.ku.dk/) by the [Orlando Group](https://geogenetics.ku.dk/research_groups/palaeomix_group/).
 
 ### Citation
 
-If you use this program, please cite the following publication:  
+If you use this program, please cite the following publication:
 Jónsson H, Ginolhac A, Schubert M, Johnson P, Orlando L.
 [mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters.
-_Bioinformatics_ 23rd April 2013. doi: 10.1093/bioinformatics/btt193](http://bioinformatics.oxfordjournals.org/content/early/2013/05/17/bioinformatics.btt193)
+*Bioinformatics* 23rd April 2013. doi: 10.1093/bioinformatics/btt193](http://bioinformatics.oxfordjournals.org/content/early/2013/05/17/bioinformatics.btt193)
 
-The original `mapDamage1` was published in the following article:  
+The original `mapDamage1` was published in the following article:
 Ginolhac A, Rasmussen M, Gilbert MT, Willerslev E, Orlando L.
-[mapDamage: testing for damage patterns in ancient DNA sequences. _Bioinformatics_ 2011 **27**(15):2153-5
+[mapDamage: testing for damage patterns in ancient DNA sequences. *Bioinformatics* 2011 **27**(15):2153-5
 http://bioinformatics.oxfordjournals.org/content/27/15/2153](http://bioinformatics.oxfordjournals.org/content/27/15/2153)
 
 ### Test the no-stats part and rescaling
 
 you can test `mapDamage` by running:
 
-``` 
+```bash
 cd mapDamage/mapdamage/
 python3 mp_test.py
 ```
 
 should return
 
-``` 
+```
 Started with the command: /usr/local/bin/mapDamage -i tests/test.bam -r tests/fake1.fasta -d tests/results --no-stats
-	Reading from 'tests/test.bam'
-	Writing results to 'tests/results/'
+ Reading from 'tests/test.bam'
+ Writing results to 'tests/results/'
 pdf tests/results/Fragmisincorporation_plot.pdf generated
 additional tests/results/Length_plot.pdf generated
 Successful run
@@ -74,5 +62,5 @@ OK
 
 ### Contact
 
-Please report bugs and suggest possible improvements to Aurélien Ginolhac, Mikkel Schubert or Hákon Jónsson by email:
-ginolhac at gmail.com, mikkelsch at gmail.com or jonsson.hakon at gmail.com.
+Please report bugs and suggest possible improvements on GitHub:
+<https://github.com/ginolhac/mapDamage/issues/new>


=====================================
mapdamage/Rscripts/stats/function.R
=====================================
@@ -10,7 +10,7 @@ getPmat <- function(tmu,tv_ti_ratio,acgt){
         stop()
     }
     if (tv_ti_ratio<=0){
-        write("The transversion and transtition ratio cannot go under 0",stderr())
+        write("The transversion and transition ratio cannot go under 0",stderr())
         stop()
     }
     #Returns the substitution probability matrix.


=====================================
mapdamage/__init__.py
=====================================
@@ -1,10 +1,10 @@
 #!/usr/bin/env python3
 
-import mapdamage.parseoptions
-import mapdamage.seq
 import mapdamage.align
-import mapdamage.tables
 import mapdamage.composition
-import mapdamage.rscript
+import mapdamage.parseoptions
 import mapdamage.rescale
+import mapdamage.rscript
+import mapdamage.seq
+import mapdamage.tables
 import mapdamage.version


=====================================
mapdamage/__main__.py
=====================================
@@ -0,0 +1,3 @@
+import mapdamage.main
+
+mapdamage.main.main()


=====================================
mapdamage/align.py
=====================================
@@ -1,127 +1,127 @@
-import itertools
 import mapdamage
 
 # from Martin Kircher, description of CIGAR operations
-#O BAM Description
-#M 0 alignment match (can be a sequence match or mismatch)
-#I 1 insertion to the reference
-#D 2 deletion from the reference
-#N 3 skipped region from the reference
-#S 4 soft clipping (clipped sequences present in SEQ)
-#H 5 hard clipping (clipped sequences NOT present in SEQ)
-#P 6 padding (silent deletion from padded reference)
-#= 7 sequence match
-#X 8 sequence mismatch
+# O BAM Description
+# M 0 alignment match (can be a sequence match or mismatch)
+# I 1 insertion to the reference
+# D 2 deletion from the reference
+# N 3 skipped region from the reference
+# S 4 soft clipping (clipped sequences present in SEQ)
+# H 5 hard clipping (clipped sequences NOT present in SEQ)
+# P 6 padding (silent deletion from padded reference)
+# = 7 sequence match
+# X 8 sequence mismatch
 
 
 def get_coordinates(read):
-  """ return external coordinates of aligned read bases """
-  fivep = read.aend if read.is_reverse else read.pos
-  threep = read.pos if read.is_reverse else read.aend
+    """return external coordinates of aligned read bases"""
+    fivep = read.aend if read.is_reverse else read.pos
+    threep = read.pos if read.is_reverse else read.aend
 
-  return fivep, threep
+    return fivep, threep
 
 
 def get_around(coord, chrom, reflengths, length, ref):
-  """ return reference sequences before and after the read
-  check for extremities and return what is available """
-  coord_min = min(coord)
-  coord_max = max(coord)
+    """return reference sequences before and after the read
+    check for extremities and return what is available"""
+    coord_min = min(coord)
+    coord_max = max(coord)
 
-  pos_before = max(0, coord_min - length)
-  pos_after  = min(reflengths[chrom], coord_max + length)
+    pos_before = max(0, coord_min - length)
+    pos_after = min(reflengths[chrom], coord_max + length)
 
-  # Uppercased, to be sure that we don't compare A,C,G,T and a,c,g,t
-  before = ref.fetch(chrom, pos_before, coord_min).upper()
-  after = ref.fetch(chrom, coord_max, pos_after).upper()
+    # Uppercased, to be sure that we don't compare A,C,G,T and a,c,g,t
+    before = ref.fetch(chrom, pos_before, coord_min).upper()
+    after = ref.fetch(chrom, coord_max, pos_after).upper()
 
-  return before, after
+    return before, after
 
 
 def align(cigarlist, seq, ref):
-  """ insert gaps according to the cigar string
-  deletion: gaps to be inserted into read sequences,
-  insertions: gaps to be inserted into reference sequence """
-  lref = list(ref)
-  for nbr, idx in parse_cigar(cigarlist, 1):
-    lref[idx:idx] = ["-"] * nbr
+    """insert gaps according to the cigar string
+    deletion: gaps to be inserted into read sequences,
+    insertions: gaps to be inserted into reference sequence"""
+    lref = list(ref)
+    for nbr, idx in parse_cigar(cigarlist, 1):
+        lref[idx:idx] = ["-"] * nbr
 
-  lread = list(seq)
-  for nbr, idx in parse_cigar(cigarlist, 2):
-    lread[idx:idx] = ["-"] * nbr
+    lread = list(seq)
+    for nbr, idx in parse_cigar(cigarlist, 2):
+        lread[idx:idx] = ["-"] * nbr
 
-  return "".join(lread), "".join(lref)
+    return "".join(lread), "".join(lref)
 
 
 def align_with_qual(cigarlist, seq, qual, threshold, ref):
-  """ insert gaps according to the cigar string
-  deletion: gaps to be inserted into read sequences and qualities,
-  insertions: gaps to be inserted into reference sequence """
-  lref = list(ref)
-  for nbr, idx in parse_cigar(cigarlist, 1):
-    lref[idx:idx] = ["-"] * nbr
+    """insert gaps according to the cigar string
+    deletion: gaps to be inserted into read sequences and qualities,
+    insertions: gaps to be inserted into reference sequence"""
+    lref = list(ref)
+    for nbr, idx in parse_cigar(cigarlist, 1):
+        lref[idx:idx] = ["-"] * nbr
 
-  lread = list(seq)
-  lqual = list(qual)
-  for nbr, idx in parse_cigar(cigarlist, 2):
-    lread[idx:idx] = ["-"] * nbr
-    lqual[idx:idx] = ["-"] * nbr
+    lread = list(seq)
+    lqual = list(qual)
+    for nbr, idx in parse_cigar(cigarlist, 2):
+        lread[idx:idx] = ["-"] * nbr
+        lqual[idx:idx] = ["-"] * nbr
 
-  for idx, score in enumerate(lqual):
-    # mask read and reference nucleotides (so not gaps) if below desired threshold
-    if (ord(score)-33) < threshold and lread[idx] != "-":
-      lread[idx] = "N"
-      lref[idx]  = "N"
+    for idx, score in enumerate(lqual):
+        # mask read and reference nucleotides (so not gaps) if below desired threshold
+        if (ord(score) - 33) < threshold and lread[idx] != "-":
+            lread[idx] = "N"
+            lref[idx] = "N"
 
-  return "".join(lread), "".join(lqual), "".join(lref)
+    return "".join(lread), "".join(lqual), "".join(lref)
 
 
 def get_mis(read, seq, refseq, ref, length, tab, end):
-  """ count mismatches using aligned reference and read,
-  must be redone since N in reference were randomly replaced by any bases """
-  std = '-' if read.is_reverse else '+'
-  subtable = tab[ref][end][std]
+    """count mismatches using aligned reference and read,
+    must be redone since N in reference were randomly replaced by any bases"""
+    std = "-" if read.is_reverse else "+"
+    subtable = tab[ref][end][std]
 
-  for (i, nt_seq, nt_ref) in zip(range(length), seq, refseq):
-    if (nt_seq in "ACGT-") and (nt_ref in "ACGT-"):
-      if nt_ref != "-":
-        # record base composition in the reference, only A, C, G, T
-        subtable[nt_ref][i] += 1
+    for i, nt_seq, nt_ref in zip(range(length), seq, refseq):
+        if (nt_seq in "ACGT-") and (nt_ref in "ACGT-"):
+            if nt_ref != "-":
+                # record base composition in the reference, only A, C, G, T
+                subtable[nt_ref][i] += 1
 
-      # Most ref/seq pairs will be identical
-      if (nt_ref != nt_seq):
-        mut = "%s>%s" % (nt_ref, nt_seq)
-        subtable[mut][i] += 1
+            # Most ref/seq pairs will be identical
+            if nt_ref != nt_seq:
+                mut = "%s>%s" % (nt_ref, nt_seq)
+                subtable[mut][i] += 1
 
 
 def parse_cigar(cigarlist, ope):
-  """ for a specific operation (mismach, match, insertion, deletion... see above)
-  return occurences and index in the alignment """
-  tlength = 0
-  coordinate = []
-  # count matches, indels and mismatches
-  oplist = (0, 1, 2, 7, 8)
-  for operation, length in cigarlist:
-    if operation == ope:
-        coordinate.append([length, tlength])
-    if operation in oplist:
-        tlength += length
-  return coordinate
+    """for a specific operation (mismatch, match, insertion, deletion... see above)
+    return occurrences and index in the alignment"""
+    tlength = 0
+    coordinate = []
+    # count matches, indels and mismatches
+    oplist = (0, 1, 2, 7, 8)
+    for operation, length in cigarlist:
+        if operation == ope:
+            coordinate.append([length, tlength])
+        if operation in oplist:
+            tlength += length
+    return coordinate
 
 
 def record_soft_clipping(read, tab, length):
-  """ record soft clipped bases at extremities """
-  def update_table(end, std, bases):
-    for i in range(0, min(bases, length)):
-      tab[end][std]['S'][i] += 1
-
-  strand = '-' if read.is_reverse else '+'
-  for (nbases, idx) in mapdamage.align.parse_cigar(read.cigar, 4):
-    if idx == 0:
-      # Soft-clipping at the left side of the alignment
-      end = '3p' if read.is_reverse else '5p'
-    else:
-      # Soft-clipping at the right side of the alignment
-      end = '5p' if read.is_reverse else '3p'
-
-    update_table(end, strand, nbases)
+    """record soft clipped bases at extremities"""
+
+    def update_table(end, std, bases):
+        for i in range(0, min(bases, length)):
+            tab[end][std]["S"][i] += 1
+
+    strand = "-" if read.is_reverse else "+"
+    for nbases, idx in mapdamage.align.parse_cigar(read.cigar, 4):
+        if idx == 0:
+            # Soft-clipping at the left side of the alignment
+            end = "3p" if read.is_reverse else "5p"
+        else:
+            # Soft-clipping at the right side of the alignment
+            end = "5p" if read.is_reverse else "3p"
+
+        update_table(end, strand, nbases)


=====================================
mapdamage/composition.py
=====================================
@@ -1,82 +1,60 @@
-import mapdamage
-import itertools
 import csv
-import subprocess
-import sys
+
+import mapdamage
+import mapdamage.seqtk
+
 
 def count_ref_comp(read, chrom, before, after, comp):
-  """ record basae composition in external genomic regions """
-  std = '-' if read.is_reverse else '+'
+    """record basae composition in external genomic regions"""
+    std = "-" if read.is_reverse else "+"
 
-  _update_table(comp[chrom]['5p'][std], before, range(-len(before), 0))
-  _update_table(comp[chrom]['3p'][std], after,  range(1, len(after) + 1))
+    _update_table(comp[chrom]["5p"][std], before, range(-len(before), 0))
+    _update_table(comp[chrom]["3p"][std], after, range(1, len(after) + 1))
 
 
 def count_read_comp(read, chrom, length, comp):
-  """ record base composition of read, discard marked nucleotides """
-  std, seq = '+', read.query
-  if read.is_reverse:
-    std, seq = '-', mapdamage.seq.revcomp(seq)
+    """record base composition of read, discard marked nucleotides"""
+    std, seq = "+", read.query
+    if read.is_reverse:
+        std, seq = "-", mapdamage.seq.revcomp(seq)
 
-  _update_table(comp[chrom]['5p'][std], seq,           range(1, length + 1))
-  _update_table(comp[chrom]['3p'][std], reversed(seq), range(-1, - length - 1, -1))
+    _update_table(comp[chrom]["5p"][std], seq, range(1, length + 1))
+    _update_table(comp[chrom]["3p"][std], reversed(seq), range(-1, -length - 1, -1))
 
 
 def _update_table(table, sequence, indices):
-  for (index, nt) in zip(indices, sequence):
-    if nt in table:
-      table[nt][index] += 1
+    for index, nt in zip(indices, sequence):
+        if nt in table:
+            table[nt][index] += 1
 
 
-def get_base_comp(filename,destination=False):
-    """
-    Gets the basecomposition of all the sequences in filename
-    and returns the value to destination if given.
+def write_base_comp(fasta, destination):
+    """Calculates the total base composition across all sequences in 'fasta'
+    and writes them to 'destination' as CSV.
     """
-    path_to_seqtk = mapdamage.rscript.construct_path("seqtk",folder="seqtk")
-    bases = {"A":0,"C":0,"G":0,"T":0}
-    alp = ["A","C","G","T"]
-    try:
-        proc = subprocess.Popen([path_to_seqtk,"comp",filename],stdout=subprocess.PIPE)
-        out = proc.communicate()[0]
-        for li in out.splitlines():
-            tabs = li.split() # 1 is the ref, 2 is the total and then the base counts A, C, G and T.
-            bases["A"] = bases["A"] + int(tabs[2])
-            bases["C"] = bases["C"] + int(tabs[3])
-            bases["G"] = bases["G"] + int(tabs[4])
-            bases["T"] = bases["T"] + int(tabs[5])
-    except (OSError, ValueError) as error:
-        sys.stderr.write("Error: Seqtk failed: %s\n" % (error,))
-        sys.exit(1)
-    # get the base frequencies
+    bases = {"A": 0, "C": 0, "G": 0, "T": 0}
+    for stats in mapdamage.seqtk.comp(str(fasta)):
+        for key in bases:
+            bases[key] += stats[key]
+
+    # calculate the base frequencies
     ba_su = sum(bases.values())
-    for ba in alp:
-        bases[ba] = float(bases[ba])/float(ba_su)
-    if (destination==False):
-        return bases
-    else:
-        # write the results
-        fo = open(destination,"w")
-        vals = [str(bases[i]) for i in alp]
-        fo.write(",".join(alp)+"\n")
-        fo.write(",".join(vals)+"\n")
-        fo.close()
+    for key in bases:
+        bases[key] = bases[key] / ba_su
+
+    with open(destination, "wt", newline="") as handle:
+        writer = csv.writer(handle)
+
+        header = ["A", "C", "G", "T"]
+        writer.writerow(header)
+        writer.writerow(bases[key] for key in header)
+
 
 def read_base_comp(filename):
-    """
-    Read the base compition from a file created by get_base_comp
-    """
-    fh = open(filename)
-    first_line = True
-    for li in fh:
-        li = li.rstrip()
-        lp = li.split()
-        if first_line:
-            header = lp
-            first_line = False
-        else:
-            body = lp
-    bases = {}
-    for ba,per in zip(header,body):
-        bases[ba] = per
-    return bases
+    """Read the base compition from a file created by write_base_comp"""
+    with open(filename, newline="") as csvfile:
+        reader = csv.DictReader(csvfile)
+        for row in reader:
+            return row
+
+    raise csv.Error("No rows found in %r" % (filename,))


=====================================
bin/mapDamage → mapdamage/main.py
=====================================
@@ -2,10 +2,14 @@
 # -*- coding: utf-8 -*-
 
 import logging
+import os
 import random
-import time
 import sys
-import os
+import time
+
+import pysam
+
+import mapdamage
 
 """ Copyright (c) 2012  Aurélien Ginolhac, Mikkel Schubert, Hákon Jónsson
 and Ludovic Orlando
@@ -38,33 +42,17 @@ plot and quantify damage patterns from a SAM/BAM file
 :Output: tabulated tables, pdf
 """
 
-# check if pysam if available
-MODULE = "pysam"
-URL = "http://code.google.com/p/pysam/"
-try:
-    __import__(MODULE)
-except ImportError as e:
-    sys.stderr.write("Error: Could not import required module '%s':\n\t- %s\n" % (MODULE, e))
-    sys.stderr.write("       If module is not installed, please download from '%s'.\n" % URL)
-    sys.stderr.write("       A local install may be performed using the following command:\n")
-    sys.stderr.write("       $ python setup.py install --user\n\n")
-    sys.exit(1)
-
-import pysam
-
-
-_BAM_UNMAPPED  = 0x4
+_BAM_UNMAPPED = 0x4
 _BAM_SECONDARY = 0x100
 _BAM_FAILED_QC = 0x200
-_BAM_PCR_DUPE  = 0x400
-_BAM_CHIMERIC  = 0x800
+_BAM_PCR_DUPE = 0x400
+_BAM_CHIMERIC = 0x800
+
 
 def _filter_reads(bamfile):
-    filtered_flags = _BAM_UNMAPPED | \
-      _BAM_SECONDARY | \
-      _BAM_FAILED_QC | \
-      _BAM_PCR_DUPE | \
-      _BAM_CHIMERIC
+    filtered_flags = (
+        _BAM_UNMAPPED | _BAM_SECONDARY | _BAM_FAILED_QC | _BAM_PCR_DUPE | _BAM_CHIMERIC
+    )
 
     for read in bamfile:
         if not (read.flag & filtered_flags):
@@ -74,7 +62,9 @@ def _filter_reads(bamfile):
 def _downsample_to_fraction(bamfile, options):
     if options.verbose:
         print("\tDownsampling %.2f fraction of the original file" % options.downsample)
-    assert options.downsample > 0, "Downsample fraction must be positive, not %s" % options.downsample
+    assert options.downsample > 0, (
+        "Downsample fraction must be positive, not %s" % options.downsample
+    )
 
     rand = random.Random(options.downsample_seed)
     for read in _filter_reads(bamfile):
@@ -84,13 +74,15 @@ def _downsample_to_fraction(bamfile, options):
 
 def _downsample_to_fixed_number(bamfile, options):
     if options.verbose:
-        print("\tDownsampling %d random reads from the original file" % options.downsample)
+        print(
+            "\tDownsampling %d random reads from the original file" % options.downsample
+        )
 
     # use reservoir sampling
     downsample_to = int(options.downsample)
     rand = random.Random(options.downsample_seed)
     sample = [None] * downsample_to
-    for (index, record) in enumerate(_filter_reads(bamfile)):
+    for index, record in enumerate(_filter_reads(bamfile)):
         if index >= downsample_to:
             index = rand.randint(0, index)
             if index >= downsample_to:
@@ -112,59 +104,20 @@ def _read_bamfile(bamfile, options):
     else:
         return _downsample_to_fixed_number(bamfile, options)
 
-def _check_mm_option():
-    """
-    As the user can override the system wide mapdamage modules with
-    the --mapdamage-modules, it has to happen before option parsing
-    in mapdamage.parseoptions
-    """
-    path_to_mm = None
-    for nr,arg in zip(list(range(len(sys.argv))),sys.argv):
-        if arg.startswith("--mapdamage-modules"):
-            try:
-                if "=" in arg:
-                    # the option is of the format --mapdamage-modules=AAAA
-                    arg_p = arg.split("=")
-                    path_to_mm = arg_p[1]
-                else:
-                    # the option is of the format --mapdamage-modules AAAA
-                    path_to_mm = sys.argv[nr+1]
-                break
-            except IndexError as e:
-                raise SystemExit("Must specify a path to --mapdamage-modules")
-    if path_to_mm != None:
-        if not os.path.isdir(path_to_mm):
-            raise SystemExit("The --mapdamage-modules option must be a valid path (path=%s)" % path_to_mm)
-        if not os.path.isdir(os.path.join(path_to_mm,"mapdamage")):
-            raise SystemExit("The --mapdamage-modules path (path=%s) must contain the mapdamage module" % path_to_mm)
-    return path_to_mm
-
 
 def main():
     start_time = time.time()
 
-    # the user can override the system wide mapdamage modules
-    path_to_mm = _check_mm_option()
-    if path_to_mm != None:
-        sys.path.insert(0,path_to_mm)
-    import mapdamage
-
-    fpath_seqtk = mapdamage.rscript.construct_path("seqtk", folder="seqtk")
-    if not (os.path.isfile(fpath_seqtk) and os.access(fpath_seqtk, os.X_OK)):
-        sys.stderr.write("Seqtk executable not accessible; mapDamage has not\n"
-                         "been intalled properly or current user does not\n"
-                         "sufficient permissions. Expected executable at\n   "
-                         "'%s'\n" % (fpath_seqtk,))
-        return 1
-
     options = mapdamage.parseoptions.options()
-    if options == None:
+    if options is None:
         sys.stderr.write("Option parsing failed, terminating the program\n")
         return 1
 
-    logging.basicConfig(filename = os.path.join(options.folder, "Runtime_log.txt"),
-                        format   = '%(asctime)s\t%(levelname)s\t%(name)s: %(message)s',
-                        level    = logging.DEBUG)
+    logging.basicConfig(
+        filename=os.path.join(options.folder, "Runtime_log.txt"),
+        format="%(asctime)s\t%(levelname)s\t%(name)s: %(message)s",
+        level=logging.DEBUG,
+    )
     handler = logging.StreamHandler()
     handler.setLevel(logging.INFO)
     logging.getLogger().addHandler(handler)
@@ -175,7 +128,9 @@ def main():
     # plot using R if results folder already done
     if options.plot_only:
         if options.no_r:
-            logger.error("Cannot use plot damage patterns if R is missing, terminating the program")
+            logger.error(
+                "Cannot use plot damage patterns if R is missing, terminating the program"
+            )
             return 1
         else:
             mapdamage.rscript.plot(options)
@@ -190,13 +145,13 @@ def main():
             return 1
         else:
             # before running the Bayesian estimation get the base composition
-            path_to_basecomp = os.path.join(options.folder,"dnacomp_genome.csv")
+            path_to_basecomp = os.path.join(options.folder, "dnacomp_genome.csv")
             if os.path.isfile(path_to_basecomp):
-                #Try to read the base composition file
+                # Try to read the base composition file
                 mapdamage.composition.read_base_comp(path_to_basecomp)
             else:
-                #Construct the base composition file
-                mapdamage.composition.get_base_comp(options.ref,path_to_basecomp)
+                # Construct the base composition file
+                mapdamage.composition.write_base_comp(options.ref, path_to_basecomp)
             mapdamage.rscript.run_stats(options)
             return 0
 
@@ -205,8 +160,13 @@ def main():
         ref = pysam.Fastafile(options.ref)
     except IOError as e:
         # Couldn't open the reference file
-        logger.error("Couldn't open the reference file "+options.ref+\
-            " or there are permission problems writing the "+options.ref+".fai file")
+        logger.error(
+            "Couldn't open the reference file "
+            + options.ref
+            + " or there are permission problems writing the "
+            + options.ref
+            + ".fai file"
+        )
         raise e
 
     # rescale the qualities
@@ -218,7 +178,7 @@ def main():
 
     # open SAM/BAM file
     if options.filename == "-":
-        in_bam = pysam.Samfile("-", 'rb')
+        in_bam = pysam.Samfile("-", "rb")
         # disabling rescaling if reading from standard input since we need
         # to read it twice
         if options.rescale:
@@ -235,38 +195,43 @@ def main():
     elif not mapdamage.seq.compare_sequence_dicts(fai_lengths, reflengths):
         return 1
     elif (len(reflengths) >= 1000) and not options.merge_reference_sequences:
-        print("WARNING: Alignment contains a large number of reference sequences (%i)!" % len(reflengths))
+        print(
+            "WARNING: Alignment contains a large number of reference sequences (%i)!"
+            % len(reflengths)
+        )
         print("  This may lead to excessive memory/disk usage.")
         print("  Consider using --merge-reference-sequences")
         print()
 
     refnames = in_bam.references
     if options.merge_reference_sequences:
-        refnames = ['*']
+        refnames = ["*"]
 
     # for misincorporation patterns, record mismatches
     misincorp = mapdamage.tables.initialize_mut(refnames, options.length)
     # for fragmentation patterns, record base compositions
-    dnacomp =  mapdamage.tables.initialize_comp(refnames, options.around, options.length)
+    dnacomp = mapdamage.tables.initialize_comp(refnames, options.around, options.length)
     # for length distributions
-    lgdistrib =  mapdamage.tables.initialize_lg()
+    lgdistrib = mapdamage.tables.initialize_lg()
 
     if not options.quiet:
         print("\tReading from '%s'" % options.filename)
-        if  options.minqual != 0:
+        if options.minqual != 0:
             print("\tFiltering out bases with a Phred score < %d" % options.minqual)
         if options.verbose:
-            print("\t%d references are assumed in SAM/BAM file, for a total of %d nucleotides" % (len(reflengths), sum(reflengths.values())))
+            print(
+                "\t%d references are assumed in SAM/BAM file, for a total of %d nucleotides"
+                % (len(reflengths), sum(reflengths.values()))
+            )
         print("\tWriting results to '%s/'" % options.folder)
 
-
     # open file handler to write alignments in fasta format
     if options.fasta:
         # use name of the SAM/BAM filename without extension
-        ffasta = os.path.splitext(os.path.basename(options.filename))[0]+'.fasta'
+        ffasta = os.path.splitext(os.path.basename(options.filename))[0] + ".fasta"
         if not options.quiet:
             print("\tWriting alignments in '%s'" % ffasta)
-        fhfasta = open(options.folder+"/"+ffasta,"w")
+        fhfasta = open(options.folder + "/" + ffasta, "w")
 
     counter = 0
 
@@ -281,7 +246,9 @@ def main():
         # fetch reference name, chromosome or contig names
         chrom = in_bam.getrname(read.tid)
 
-        (before, after) = mapdamage.align.get_around(coordinate, chrom, reflengths, options.around, ref)
+        (before, after) = mapdamage.align.get_around(
+            coordinate, chrom, reflengths, options.around, ref
+        )
         refseq = ref.fetch(chrom, min(coordinate), max(coordinate)).upper()
         # read.query contains aligned sequences while read.seq is the read itself
         seq = read.query
@@ -289,12 +256,17 @@ def main():
         # add gaps according to the cigar string, do it for qualities if filtering options is on
         if not (options.minqual and read.qual):
             if options.minqual:
-                logger.warning("Cannot filter bases by PHRED scores for read '%s'; no scores available." % read.qname)
+                logger.warning(
+                    "Cannot filter bases by PHRED scores for read '%s'; no scores available."
+                    % read.qname
+                )
 
             (seq, refseq) = mapdamage.align.align(read.cigar, seq, refseq)
         else:
             # add gaps to qualities and mask read and reference nucleotides if below desired threshold
-            (seq, _, refseq) = mapdamage.align.align_with_qual(read.cigar, seq, read.qqual, options.minqual, refseq)
+            (seq, _, refseq) = mapdamage.align.align_with_qual(
+                read.cigar, seq, read.qqual, options.minqual, refseq
+            )
 
         # reverse complement read and reference when mapped reverse strand
         if read.is_reverse:
@@ -305,15 +277,19 @@ def main():
             before = beforerev
 
         if options.merge_reference_sequences:
-            chrom = '*'
+            chrom = "*"
 
         # record soft clipping when present
         mapdamage.align.record_soft_clipping(read, misincorp[chrom], options.length)
 
         # count misincorparations by comparing read and reference base by base
-        mapdamage.align.get_mis(read, seq, refseq, chrom, options.length, misincorp, '5p')
+        mapdamage.align.get_mis(
+            read, seq, refseq, chrom, options.length, misincorp, "5p"
+        )
         # do the same with sequences align to 3'-ends
-        mapdamage.align.get_mis(read, seq[::-1], refseq[::-1], chrom, options.length, misincorp, '3p')
+        mapdamage.align.get_mis(
+            read, seq[::-1], refseq[::-1], chrom, options.length, misincorp, "3p"
+        )
         # compute base composition for reads
         mapdamage.composition.count_read_comp(read, chrom, options.length, dnacomp)
 
@@ -321,7 +297,17 @@ def main():
         mapdamage.composition.count_ref_comp(read, chrom, before, after, dnacomp)
 
         if options.fasta:
-            mapdamage.seq.write_fasta(read, chrom, seq, refseq, min(coordinate), max(coordinate), before, after, fhfasta)
+            mapdamage.seq.write_fasta(
+                read,
+                chrom,
+                seq,
+                refseq,
+                min(coordinate),
+                max(coordinate),
+                before,
+                after,
+                fhfasta,
+            )
 
         if options.verbose:
             if counter % 50000 == 0:
@@ -337,14 +323,13 @@ def main():
         fhfasta.close()
 
     # output results, write summary tables to disk
-    with open(options.folder+"/"+"misincorporation.txt", 'w') as fmut:
+    with open(options.folder + "/" + "misincorporation.txt", "w") as fmut:
         mapdamage.tables.print_mut(misincorp, options, fmut)
-    with open(options.folder+"/"+"dnacomp.txt", 'w') as fcomp:
+    with open(options.folder + "/" + "dnacomp.txt", "w") as fcomp:
         mapdamage.tables.print_comp(dnacomp, options, fcomp)
-    with open(options.folder+"/"+"lgdistribution.txt", 'w') as flg:
+    with open(options.folder + "/" + "lgdistribution.txt", "w") as flg:
         mapdamage.tables.print_lg(lgdistrib, options, flg)
 
-
     # plot using R
     if not options.no_r:
         mapdamage.rscript.plot(options)
@@ -354,11 +339,12 @@ def main():
     if mapdamage.tables.check_table_and_warn_if_dmg_freq_is_low(options.folder):
         options.no_stats = True
 
-
     # run the Bayesian estimation
     if not options.no_stats:
         # before running the Bayesian estimation get the base composition
-        mapdamage.composition.get_base_comp(options.ref,os.path.join(options.folder,"dnacomp_genome.csv"))
+        mapdamage.composition.write_base_comp(
+            options.ref, os.path.join(options.folder, "dnacomp_genome.csv")
+        )
         mapdamage.rscript.run_stats(options)
 
     # rescale the qualities
@@ -373,7 +359,3 @@ def main():
     logger.debug("Run completed in %f seconds" % (time.time() - start_time,))
 
     return 0
-
-
-if __name__ == '__main__':
-    sys.exit(main())


=====================================
mapdamage/mp_test.py
=====================================
@@ -1,60 +1,85 @@
-import unittest
+import filecmp
+import optparse
 import subprocess
+import unittest
+
 import pysam
+
 import mapdamage
-import optparse
-import filecmp
+
 
 def read_nocomment(dnacomp):
-    with open(dnacomp, 'r') as f:
+    with open(dnacomp, "r") as f:
         a = f.readlines()
-        return [x for x in a if not x.startswith('#')]
+        return [x for x in a if not x.startswith("#")]
 
-def mock_options(filename,rescale_out,folder):
+
+def mock_options(filename, rescale_out, folder):
     """Make the options object with nice values for testing"""
-    return optparse.Values({
-        "filename":filename,
-        "rescale_out":rescale_out,
-        "verbose":True,
-        "folder":folder,
-        "rescale_length_5p":12, # default values as in --seq-length
-        "rescale_length_3p":12, # default values as in --seq-length
-        "quiet":True
-        })
+    return optparse.Values(
+        {
+            "filename": filename,
+            "rescale_out": rescale_out,
+            "verbose": True,
+            "folder": folder,
+            "rescale_length_5p": 12,  # default values as in --seq-length
+            "rescale_length_3p": 12,  # default values as in --seq-length
+            "quiet": True,
+        }
+    )
 
 
 class testCases(unittest.TestCase):
-
     def test_no_stats(self):
         """regular mapDamage"""
-        subprocess.run(["mapDamage",  "-i",  "tests/test.bam", "-r", "tests/fake1.fasta", "-d", "tests/results", "--no-stats"], check = True)
-        self.assertTrue(read_nocomment("tests/dnacomp.txt") == read_nocomment("tests/results/dnacomp.txt"))
+        subprocess.run(
+            [
+                "mapDamage",
+                "-i",
+                "tests/test.bam",
+                "-r",
+                "tests/fake1.fasta",
+                "-d",
+                "tests/results",
+                "--no-stats",
+            ],
+            check=True,
+        )
+        self.assertTrue(
+            read_nocomment("tests/dnacomp.txt")
+            == read_nocomment("tests/results/dnacomp.txt")
+        )
+
 
 class testRescaling(unittest.TestCase):
     def test_single_end_file(self):
         """Test, rescaling BAM file"""
         #
-	    # The expected substition frequencies before and after scaling using the scaled qualities as probalities:
-	    # CT	0.06226411977920493		0.04163524443356556
-	    # TC	0.020395286584806528		0.020395286584806528
-	    # GA	0.04400459948304954		0.03794905109091021
-	    # AG	0.05355350777642983		0.05355350777642983
-	    # Quality metrics before and after scaling
-	    # CT-Q0 	5		5
-	    # CT-Q10 	5		2
-	    # CT-Q20 	3		2
-	    # CT-Q30 	3		2
-	    # CT-Q40 	0		0
-	    # GA-Q0 	5		5
-	    # GA-Q10 	5		4
-	    # GA-Q20 	1		0
-	    # GA-Q30 	1		0
-	    # GA-Q40 	0		0
-        options = mock_options("tests/test.bam","tests/test.rescaled.sam","tests/probs/")
+        # The expected substitution frequencies before and after scaling using the scaled qualities as probalities:
+        # CT	0.06226411977920493		0.04163524443356556
+        # TC	0.020395286584806528		0.020395286584806528
+        # GA	0.04400459948304954		0.03794905109091021
+        # AG	0.05355350777642983		0.05355350777642983
+        # Quality metrics before and after scaling
+        # CT-Q0 	5		5
+        # CT-Q10 	5		2
+        # CT-Q20 	3		2
+        # CT-Q30 	3		2
+        # CT-Q40 	0		0
+        # GA-Q0 	5		5
+        # GA-Q10 	5		4
+        # GA-Q20 	1		0
+        # GA-Q30 	1		0
+        # GA-Q40 	0		0
+        options = mock_options(
+            "tests/test.bam", "tests/test.rescaled.sam", "tests/probs/"
+        )
         ref = pysam.Fastafile("tests/fake1.fasta")
-        mapdamage.rescale.rescale_qual(ref,options,debug=True)
-        self.assertTrue(filecmp.cmp("tests/test.rescaled.sam","tests/test.rescaled.correct.sam"))
+        mapdamage.rescale.rescale_qual(ref, options, debug=True)
+        self.assertTrue(
+            filecmp.cmp("tests/test.rescaled.sam", "tests/test.rescaled.correct.sam")
+        )
 
 
-if  __name__=='__main__':
+if __name__ == "__main__":
     unittest.main()


=====================================
mapdamage/parseoptions.py
=====================================
@@ -1,9 +1,10 @@
-from optparse import OptionParser, OptionGroup, SUPPRESS_HELP
 import os
 import sys
+from optparse import SUPPRESS_HELP, OptionGroup, OptionParser
 
-from mapdamage.version import __version__
 from mapdamage.rscript import check_R_lib
+from mapdamage.version import __version__
+
 
 def file_exist(filename):
     if os.path.exists(filename) and not os.path.isdir(filename):
@@ -16,14 +17,14 @@ def file_exist(filename):
 
 
 def whereis(program):
-    for path in os.environ.get('PATH', '').split(':'):
-        if os.path.exists(os.path.join(path, program)) and \
-            not os.path.isdir(os.path.join(path, program)):
+    for path in os.environ.get("PATH", "").split(":"):
+        if os.path.exists(os.path.join(path, program)) and not os.path.isdir(
+            os.path.join(path, program)
+        ):
             return os.path.join(path, program)
     return None
 
 
-
 def check_py_version():
     req_version = (2, 6)
     cur_version = sys.version_info
@@ -31,120 +32,353 @@ def check_py_version():
     if cur_version >= req_version:
         return True
     else:
-        sys.stderr.write("Your Python interpreter is too old."\
-            "Please consider upgrading to at least %d.%d\n" % (req_version[0], req_version[1]))
+        sys.stderr.write(
+            "Your Python interpreter is too old."
+            "Please consider upgrading to at least %d.%d\n"
+            % (req_version[0], req_version[1])
+        )
         return None
 
 
 def options():
-    parser = OptionParser("%prog [options] -i BAMfile -r reference.fasta\n\nUse option -h or --help for help", version=__version__, \
-            epilog="report bugs to aginolhac at snm.ku.dk, MSchubert at snm.ku.dk or jonsson.hakon at gmail.com")
+    parser = OptionParser(
+        "%prog [options] -i BAMfile -r reference.fasta\n\nUse option -h or --help for help",
+        version=__version__,
+        epilog="report bugs to aginolhac at snm.ku.dk, MSchubert at snm.ku.dk or jonsson.hakon at gmail.com",
+    )
 
     args = OptionGroup(parser, "Input files")
-    args.add_option("-i", "--input", help="SAM/BAM file, must contain a valid header, use '-' for reading a BAM from stdin", \
-          action="store", type="string", dest="filename")
-    args.add_option("-r", "--reference", help="Reference file in FASTA format", \
-          action="store", dest="ref")
+    args.add_option(
+        "-i",
+        "--input",
+        help="SAM/BAM file, must contain a valid header, use '-' for reading a BAM from stdin",
+        action="store",
+        type="string",
+        dest="filename",
+    )
+    args.add_option(
+        "-r",
+        "--reference",
+        help="Reference file in FASTA format",
+        action="store",
+        dest="ref",
+    )
 
     parser.add_option_group(args)
     group = OptionGroup(parser, "General options")
-    group.add_option("-n", "--downsample", help = "Downsample to a randomly selected fraction of the reads (if 0 < DOWNSAMPLE < 1), or " \
-                     "a fixed number of randomly selected reads (if DOWNSAMPLE >= 1). By default, no downsampling is performed.",
-                     type = float, default = None)
-    group.add_option("--downsample-seed", help = "Seed value to use for downsampling. See documentation for py module 'random' for default behavior.",
-                     type = int, default = None)
-    group.add_option("--merge-reference-sequences", help = "Ignore referece sequence names when tabulating reads (using '*' instead). "
-                     "Useful for alignments with a large number of reference sequnces, which may otherwise result in excessive "
-                     "memory or disk usage due to the number of tables generated.",
-                     default = False, action = "store_true")
-    group.add_option("-l", "--length", dest="length", help="read length, in nucleotides to consider [%default]", \
-            type = int, default=70,action="store")
-    group.add_option("-a", "--around", dest="around", help="nucleotides to retrieve before/after reads [%default]", \
-            type = int, default=10,action="store")
-    group.add_option("-Q", "--min-basequal", dest="minqual", help="minimun base quality Phred score considered, Phred-33 assumed [%default]", \
-            type = int, default=0, action="store")
-    group.add_option("-d", "--folder", help="folder name to store results [results_FILENAME]", \
-          action="store", type="string", dest="folder")
-    group.add_option("-f", "--fasta", dest="fasta", help="Write alignments in a FASTA file", \
-          default=False,action="store_true")
-    group.add_option("--plot-only", dest="plot_only", help="Run only plotting from a valid result folder", \
-          default=False,action="store_true")
-    group.add_option("-q", "--quiet", dest="quiet", help="Disable any output to stdout", \
-          default=False,action="store_true")
-    group.add_option("-v", "--verbose", dest="verbose", help="Display progression information during parsing", \
-          default=False,action="store_true")
-    group.add_option("--mapdamage-modules", dest="mapdamage_modules", help="Override the system wide installed mapDamage module", \
-          default=None)
-    group.add_option("--no-plot", dest="no_r", help=SUPPRESS_HELP, default=False, action="store_true")
+    group.add_option(
+        "-n",
+        "--downsample",
+        help="Downsample to a randomly selected fraction of the reads (if 0 < DOWNSAMPLE < 1), or "
+        "a fixed number of randomly selected reads (if DOWNSAMPLE >= 1). By default, no downsampling is performed.",
+        type=float,
+        default=None,
+    )
+    group.add_option(
+        "--downsample-seed",
+        help="Seed value to use for downsampling. See documentation for py module 'random' for default behavior.",
+        type=int,
+        default=None,
+    )
+    group.add_option(
+        "--merge-reference-sequences",
+        help="Ignore reference sequence names when tabulating reads (using '*' instead). "
+        "Useful for alignments with a large number of reference sequences, which may otherwise result in excessive "
+        "memory or disk usage due to the number of tables generated.",
+        default=False,
+        action="store_true",
+    )
+    group.add_option(
+        "-l",
+        "--length",
+        dest="length",
+        help="read length, in nucleotides to consider [%default]",
+        type=int,
+        default=70,
+        action="store",
+    )
+    group.add_option(
+        "-a",
+        "--around",
+        dest="around",
+        help="nucleotides to retrieve before/after reads [%default]",
+        type=int,
+        default=10,
+        action="store",
+    )
+    group.add_option(
+        "-Q",
+        "--min-basequal",
+        dest="minqual",
+        help="minimum base quality Phred score considered, Phred-33 assumed [%default]",
+        type=int,
+        default=0,
+        action="store",
+    )
+    group.add_option(
+        "-d",
+        "--folder",
+        help="folder name to store results [results_FILENAME]",
+        action="store",
+        type="string",
+        dest="folder",
+    )
+    group.add_option(
+        "-f",
+        "--fasta",
+        dest="fasta",
+        help="Write alignments in a FASTA file",
+        default=False,
+        action="store_true",
+    )
+    group.add_option(
+        "--plot-only",
+        dest="plot_only",
+        help="Run only plotting from a valid result folder",
+        default=False,
+        action="store_true",
+    )
+    group.add_option(
+        "-q",
+        "--quiet",
+        dest="quiet",
+        help="Disable any output to stdout",
+        default=False,
+        action="store_true",
+    )
+    group.add_option(
+        "-v",
+        "--verbose",
+        dest="verbose",
+        help="Display progression information during parsing",
+        default=False,
+        action="store_true",
+    )
+    group.add_option(
+        "--no-plot", dest="no_r", help=SUPPRESS_HELP, default=False, action="store_true"
+    )
     parser.add_option_group(group)
 
     # options for plotting damage patterns
     group2 = OptionGroup(parser, "Options for graphics")
-    group2.add_option("-y", "--ymax", dest="ymax", \
-           help="graphical y-axis limit for nucleotide misincorporation frequencies [%default]", type = float, \
-           default=0.3,action="store")
-    group2.add_option("-m", "--readplot", dest="readplot", \
-           help="read length, in nucleotides, considered for plotting nucleotide misincorporations [%default]", \
-           type = int, default=25, action="store")
-    group2.add_option("-b", "--refplot", dest="refplot", \
-          help="the number of reference nucleotides to consider for ploting base composition in the region located upstream "
-          "and downstream of every read [%default]", type= int, default=10, action="store")
-    group2.add_option("-t", "--title", dest="title", \
-          help="title used for plots [%default]", \
-          type="string", default="",action="store")
+    group2.add_option(
+        "-y",
+        "--ymax",
+        dest="ymax",
+        help="graphical y-axis limit for nucleotide misincorporation frequencies [%default]",
+        type=float,
+        default=0.3,
+        action="store",
+    )
+    group2.add_option(
+        "-m",
+        "--readplot",
+        dest="readplot",
+        help="read length, in nucleotides, considered for plotting nucleotide misincorporations [%default]",
+        type=int,
+        default=25,
+        action="store",
+    )
+    group2.add_option(
+        "-b",
+        "--refplot",
+        dest="refplot",
+        help="the number of reference nucleotides to consider for plotting base composition in the region located upstream "
+        "and downstream of every read [%default]",
+        type=int,
+        default=10,
+        action="store",
+    )
+    group2.add_option(
+        "-t",
+        "--title",
+        dest="title",
+        help="title used for plots [%default]",
+        type="string",
+        default="",
+        action="store",
+    )
     parser.add_option_group(group2)
 
     # Then the plethora of optional options for the statistical estimation ..
-    group3 = OptionGroup(parser,"Options for the statistical estimation")
-    group3.add_option("", "--rand", dest="rand", \
-            help="Number of random starting points for the likelihood optimization  [%default]", type = int, default=30, action="store")
-    group3.add_option("", "--burn", dest="burn", \
-            help="Number of burnin iterations  [%default]", type = int, default=10000,action="store")
-    group3.add_option("", "--adjust", dest="adjust", \
-            help="Number of adjust proposal variance parameters iterations  [%default]", type = int, default=10, action="store")
-    group3.add_option("", "--iter", dest="iter", \
-            help="Number of final MCMC iterations  [%default]", type = int, default=50000, action="store")
-    group3.add_option("", "--forward", dest="forward", \
-            help="Using only the 5' end of the seqs  [%default]", default=False, action="store_true")
-    group3.add_option("", "--reverse", dest="reverse", \
-            help="Using only the 3' end of the seqs  [%default]", default=False, action="store_true")
-    group3.add_option("", "--var-disp", dest="var_disp", \
-            help="Variable dispersion in the overhangs  [%default]", default=False,action="store_true")
-    group3.add_option("", "--jukes-cantor", dest="jukes_cantor", \
-            help="Use Jukes Cantor instead of HKY85  [%default]", default=False,action="store_true")
-    group3.add_option("", "--diff-hangs", dest="diff_hangs", \
-            help="The overhangs are different for 5' and 3'  [%default]", default=False, action="store_true")
-    group3.add_option("", "--fix-nicks" , dest="fix_nicks", \
-            help="Fix the nick frequency vector (Only C.T from the 5' end and G.A from the 3' end)  [%default]", default=False, action="store_true")
-    group3.add_option("", "--use-raw-nick-freq" , dest="use_raw_nick_freq", \
-            help="Use the raw nick frequency vector without smoothing  [%default]", default=False, action="store_true")
-    group3.add_option("", "--single-stranded", dest="single_stranded", \
-            help="Single stranded protocol [%default]", default=False, action="store_true")
-    group3.add_option("", "--theme-bw", dest="theme_bw", \
-            help="Use black and white theme in post. pred. plot [%default]", default=False, action="store_true")
-    group3.add_option("", "--seq-length", dest="seq_length", \
-            help="How long sequence to use from each side [%default]", type = int, default=12, action="store")
-    group3.add_option("--stats-only", dest="stats_only", help="Run only statistical estimation from a valid result folder", \
-          default=False, action="store_true")
-    group3.add_option("--no-stats", help="Disabled statistical estimation, active by default", default=False, action="store_true")
-    group3.add_option("--check-R-packages", help="Check if the R modules are working", default=False, action="store_true")
+    group3 = OptionGroup(parser, "Options for the statistical estimation")
+    group3.add_option(
+        "",
+        "--rand",
+        dest="rand",
+        help="Number of random starting points for the likelihood optimization  [%default]",
+        type=int,
+        default=30,
+        action="store",
+    )
+    group3.add_option(
+        "",
+        "--burn",
+        dest="burn",
+        help="Number of burn-in iterations  [%default]",
+        type=int,
+        default=10000,
+        action="store",
+    )
+    group3.add_option(
+        "",
+        "--adjust",
+        dest="adjust",
+        help="Number of adjust proposal variance parameters iterations  [%default]",
+        type=int,
+        default=10,
+        action="store",
+    )
+    group3.add_option(
+        "",
+        "--iter",
+        dest="iter",
+        help="Number of final MCMC iterations  [%default]",
+        type=int,
+        default=50000,
+        action="store",
+    )
+    group3.add_option(
+        "",
+        "--forward",
+        dest="forward",
+        help="Using only the 5' end of the seqs  [%default]",
+        default=False,
+        action="store_true",
+    )
+    group3.add_option(
+        "",
+        "--reverse",
+        dest="reverse",
+        help="Using only the 3' end of the seqs  [%default]",
+        default=False,
+        action="store_true",
+    )
+    group3.add_option(
+        "",
+        "--var-disp",
+        dest="var_disp",
+        help="Variable dispersion in the overhangs  [%default]",
+        default=False,
+        action="store_true",
+    )
+    group3.add_option(
+        "",
+        "--jukes-cantor",
+        dest="jukes_cantor",
+        help="Use Jukes Cantor instead of HKY85  [%default]",
+        default=False,
+        action="store_true",
+    )
+    group3.add_option(
+        "",
+        "--diff-hangs",
+        dest="diff_hangs",
+        help="The overhangs are different for 5' and 3'  [%default]",
+        default=False,
+        action="store_true",
+    )
+    group3.add_option(
+        "",
+        "--fix-nicks",
+        dest="fix_nicks",
+        help="Fix the nick frequency vector (Only C.T from the 5' end and G.A from the 3' end)  [%default]",
+        default=False,
+        action="store_true",
+    )
+    group3.add_option(
+        "",
+        "--use-raw-nick-freq",
+        dest="use_raw_nick_freq",
+        help="Use the raw nick frequency vector without smoothing  [%default]",
+        default=False,
+        action="store_true",
+    )
+    group3.add_option(
+        "",
+        "--single-stranded",
+        dest="single_stranded",
+        help="Single stranded protocol [%default]",
+        default=False,
+        action="store_true",
+    )
+    group3.add_option(
+        "",
+        "--theme-bw",
+        dest="theme_bw",
+        help="Use black and white theme in post. pred. plot [%default]",
+        default=False,
+        action="store_true",
+    )
+    group3.add_option(
+        "",
+        "--seq-length",
+        dest="seq_length",
+        help="How long sequence to use from each side [%default]",
+        type=int,
+        default=12,
+        action="store",
+    )
+    group3.add_option(
+        "--stats-only",
+        dest="stats_only",
+        help="Run only statistical estimation from a valid result folder",
+        default=False,
+        action="store_true",
+    )
+    group3.add_option(
+        "--no-stats",
+        help="Disabled statistical estimation, active by default",
+        default=False,
+        action="store_true",
+    )
+    group3.add_option(
+        "--check-R-packages",
+        help="Check if the R modules are working",
+        default=False,
+        action="store_true",
+    )
     parser.add_option_group(group3)
 
-    group4 = OptionGroup(parser,"Options for rescaling of BAM files")
-    group4.add_option("--rescale", dest="rescale", help="Rescale the quality scores in the BAM file using the output from the statistical estimation",
-          default=False, action="store_true")
-    group4.add_option("--rescale-only", dest="rescale_only", help="Run only rescaling from a valid result folder",
-          default=False, action="store_true")
-    group4.add_option("--rescale-out", dest="rescale_out", help="Write the rescaled BAM to this file",
-          default=None, action="store")
-    group4.add_option("--rescale-length-5p", dest="rescale_length_5p",
-                      help="How many bases to rescale at the 5' termini; defaults to --seq-length.", type=int, action="store")
-    group4.add_option("--rescale-length-3p", dest="rescale_length_3p",
-                      help="How many bases to rescale at the 5' termini; defaults to --seq-length.", type=int, action="store")
+    group4 = OptionGroup(parser, "Options for rescaling of BAM files")
+    group4.add_option(
+        "--rescale",
+        dest="rescale",
+        help="Rescale the quality scores in the BAM file using the output from the statistical estimation",
+        default=False,
+        action="store_true",
+    )
+    group4.add_option(
+        "--rescale-only",
+        dest="rescale_only",
+        help="Run only rescaling from a valid result folder",
+        default=False,
+        action="store_true",
+    )
+    group4.add_option(
+        "--rescale-out",
+        dest="rescale_out",
+        help="Write the rescaled BAM to this file",
+        default=None,
+        action="store",
+    )
+    group4.add_option(
+        "--rescale-length-5p",
+        dest="rescale_length_5p",
+        help="How many bases to rescale at the 5' termini; defaults to --seq-length.",
+        type=int,
+        action="store",
+    )
+    group4.add_option(
+        "--rescale-length-3p",
+        dest="rescale_length_3p",
+        help="How many bases to rescale at the 5' termini; defaults to --seq-length.",
+        type=int,
+        action="store",
+    )
     parser.add_option_group(group4)
 
-
-    #Parse the arguments
+    # Parse the arguments
     (options, args) = parser.parse_args()
 
     # check python version
@@ -161,9 +395,9 @@ def options():
 
     # check general arguments
     if not (options.plot_only or options.stats_only) and not options.filename:
-        parser.error('SAM/BAM file not given (-i)')
+        parser.error("SAM/BAM file not given (-i)")
     if not (options.plot_only or options.ref):
-        parser.error('Reference file not given (-r)')
+        parser.error("Reference file not given (-r)")
     if not options.plot_only and not options.stats_only:
         if not file_exist(options.filename) or not file_exist(options.ref):
             return None
@@ -174,57 +408,65 @@ def options():
             options.downsample = int(options.downsample)
 
     if options.plot_only and not options.folder:
-        parser.error('Folder not provided, required with --plot-only')
+        parser.error("Folder not provided, required with --plot-only")
     if options.stats_only and not options.folder:
-        parser.error('Folder not provided, required with --stats-only')
+        parser.error("Folder not provided, required with --stats-only")
     if options.rescale_only and not options.folder:
-        parser.error('Folder not provided, required with --rescale-only')
+        parser.error("Folder not provided, required with --rescale-only")
     if options.rescale_only and not options.filename:
-        parser.error('Input bam not provided, required with --rescale-only')
+        parser.error("Input bam not provided, required with --rescale-only")
     if options.rescale_only and not options.ref:
-        parser.error('Reference not provided, required with --rescale-only')
+        parser.error("Reference not provided, required with --rescale-only")
 
     if options.verbose and options.quiet:
-        parser.error('Cannot use verbose and quiet option at the same time')
+        parser.error("Cannot use verbose and quiet option at the same time")
 
     # check options
     if options.length < 0:
-        parser.error('length (-l) must be a positive integrer')
+        parser.error("length (-l) must be a positive integrer")
     if options.around < 0:
-        parser.error('around (-a) must be a positive integrer')
+        parser.error("around (-a) must be a positive integrer")
     if options.ymax <= 0 or options.ymax > 1:
-        parser.error('ymax (-b) must be an real number beetween 0 and 1')
+        parser.error("ymax (-b) must be an real number between 0 and 1")
     if options.readplot < 0:
-        parser.error('readplot (-m) must be a positive integrer')
+        parser.error("readplot (-m) must be a positive integrer")
     if options.refplot < 0:
-        parser.error('refplot (-b) must be a positive integrer')
+        parser.error("refplot (-b) must be a positive integrer")
     if options.refplot > options.around and not options.plot_only:
-        parser.error('refplot (-b) must be inferior to around (-a)')
+        parser.error("refplot (-b) must be inferior to around (-a)")
     if options.readplot > options.length:
-        parser.error('readplot (-m) must be inferior to length (-l)')
-    if options.minqual < 0 or  options.minqual > 41:
-        parser.error('minimal base quality, Phred score, must be within this range: 0 - 41')
+        parser.error("readplot (-m) must be inferior to length (-l)")
+    if options.minqual < 0 or options.minqual > 41:
+        parser.error(
+            "minimal base quality, Phred score, must be within this range: 0 - 41"
+        )
 
     # check statistic options
     if options.forward and options.reverse:
-        parser.error('Cannot use only forward end and only reverse end for the statistics')
+        parser.error(
+            "Cannot use only forward end and only reverse end for the statistics"
+        )
 
     # use filename as default for plot titles if not set
     if options.title == "" and options.filename:
         options.title = os.path.splitext(os.path.basename(options.filename))[0]
     # for --plot-only, use the folder name, without results_ as title
     if options.title == "" and not options.filename and options.folder:
-        options.title = os.path.splitext(os.path.basename(options.folder))[0].replace("results_", "")
+        options.title = os.path.splitext(os.path.basename(options.folder))[0].replace(
+            "results_", ""
+        )
 
     # check folder
     if not options.folder and options.filename:
-        options.folder = "results_"+os.path.splitext(os.path.basename(options.filename))[0]
+        options.folder = (
+            "results_" + os.path.splitext(os.path.basename(options.filename))[0]
+        )
 
     # check destination for rescaled bam
     if not options.rescale_out and (options.rescale or options.rescale_only):
-        # if there are mulitiple bam files to rescale then pick first one as
+        # if there are multiple bam files to rescale then pick first one as
         # the name of the rescaled file
-        if isinstance(options.filename,list):
+        if isinstance(options.filename, list):
             basename = os.path.basename(options.filename[0])
         else:
             basename = os.path.basename(options.filename)
@@ -235,34 +477,45 @@ def options():
         if not options.quiet and not options.plot_only:
             print(("Warning, %s already exists" % options.folder))
         if options.plot_only:
-            if not file_exist(options.folder+"/dnacomp.txt") or not file_exist(options.folder+"/misincorporation.txt"):
-                parser.error('folder %s is not a valid result folder' % options.folder)
+            if not file_exist(options.folder + "/dnacomp.txt") or not file_exist(
+                options.folder + "/misincorporation.txt"
+            ):
+                parser.error("folder %s is not a valid result folder" % options.folder)
     else:
-        os.makedirs(options.folder, mode = 0o750)
+        os.makedirs(options.folder, mode=0o750)
         if options.plot_only or options.stats_only or options.rescale_only:
-            sys.stderr.write("Error, %s does not exist while plot/stats/rescale only was used\n" % options.folder)
+            sys.stderr.write(
+                "Error, %s does not exist while plot/stats/rescale only was used\n"
+                % options.folder
+            )
             return None
 
     if options.rescale_length_3p is None:
         options.rescale_length_3p = options.seq_length
     elif not (0 <= options.rescale_length_3p <= options.seq_length):
-        parser.error("--rescale-length-3p must be less than or equal to "
-                     "--seq-length and greater than zero")
+        parser.error(
+            "--rescale-length-3p must be less than or equal to "
+            "--seq-length and greater than zero"
+        )
 
     if options.rescale_length_5p is None:
         options.rescale_length_5p = options.seq_length
     elif not (0 <= options.rescale_length_5p <= options.seq_length):
-        parser.error("--rescale-length-5p must be less than or equal to "
-                     "--seq-length and greater than zero")
+        parser.error(
+            "--rescale-length-5p must be less than or equal to "
+            "--seq-length and greater than zero"
+        )
 
     # check if the Rscript executable is present on the system
-    if not whereis('Rscript'):
+    if not whereis("Rscript"):
         print("Warning, Rscript is not in your PATH, plotting is disabled")
         options.no_r = True
 
     # check the nick frequencies options
-    if (options.use_raw_nick_freq + options.fix_nicks + options.single_stranded)>1:
-        parser.error("The options --use-raw-nick-freq, --fix-nicks and --single-stranded are mutually exclusive.")
+    if (options.use_raw_nick_freq + options.fix_nicks + options.single_stranded) > 1:
+        parser.error(
+            "The options --use-raw-nick-freq, --fix-nicks and --single-stranded are mutually exclusive."
+        )
 
     if check_R_lib():
         # check for R libraries
@@ -275,5 +528,4 @@ def options():
         if options.rescale_only:
             sys.exit("Cannot use --rescale-only with missing R libraries")
 
-
     return options


=====================================
mapdamage/rescale.py
=====================================
@@ -1,21 +1,24 @@
 import csv
-import sys
-import os
-import mapdamage
-import pysam
-import itertools
-import math
 import logging
+import math
+import os
+import sys
 import time
 
+import pysam
+
+import mapdamage
+
 
 def phred_pval_to_char(pval):
     """Transforming error rate to ASCII character using the Phred scale"""
-    return chr(int(round(-10*math.log10(abs(pval)))+33))
+    return chr(int(round(-10 * math.log10(abs(pval))) + 33))
+
 
 def phred_char_to_pval(ch):
     """Transforming ASCII character in the Phred scale to the error rate"""
-    return 10**(-(float(ord(ch))-float(33))/10)
+    return 10 ** (-(float(ord(ch)) - float(33)) / 10)
+
 
 def get_corr_prob(folder, rescale_length_5p, rescale_length_3p):
     """
@@ -24,19 +27,28 @@ def get_corr_prob(folder, rescale_length_5p, rescale_length_3p):
     position (one based)  -  CT  -  probability
                           -  GA  -  probability
     """
-    full_path = os.path.join(folder,"Stats_out_MCMC_correct_prob.csv")
+    full_path = os.path.join(folder, "Stats_out_MCMC_correct_prob.csv")
     if not os.path.isfile(full_path):
-        sys.exit("Missing file, the file \n\tStats_out_MCMC_correct_prob.csv\nshould be in the folder\n\t"+folder+"\nDid you run the MCMC estimates of the parameters?")
+        sys.exit(
+            "Missing file, the file \n\tStats_out_MCMC_correct_prob.csv\nshould be in the folder\n\t"
+            + folder
+            + "\nDid you run the MCMC estimates of the parameters?"
+        )
     try:
         with open(full_path) as fi:
             fi_handle = csv.DictReader(fi)
             corr_prob = {}
             for line in fi_handle:
-                if (line["Position"] in corr_prob):
-                    sys.exit('This file has multiple position definitions %s, line %d: %s' % \
-                        (folder, fi_handle.line_num, corr_prob[line["Position"]]))
+                if line["Position"] in corr_prob:
+                    sys.exit(
+                        "This file has multiple position definitions %s, line %d: %s"
+                        % (folder, fi_handle.line_num, corr_prob[line["Position"]])
+                    )
                 else:
-                    corr_prob[int(line["Position"])] = {'C.T':float(line["C.T"]), 'G.A':float(line["G.A"])}
+                    corr_prob[int(line["Position"])] = {
+                        "C.T": float(line["C.T"]),
+                        "G.A": float(line["G.A"]),
+                    }
 
             # Exclude probabilities for positions outside of user-specified region
             for key in list(corr_prob.keys()):
@@ -44,11 +56,17 @@ def get_corr_prob(folder, rescale_length_5p, rescale_length_3p):
                     corr_prob.pop(key)
             return corr_prob
     except csv.Error as e:
-        sys.exit('File %s, line %d: %s' % (os.path.join(folder,"Stats_out_MCMC_correct_prob.csv"), \
-            fi_handle.line_num, e))
+        sys.exit(
+            "File %s, line %d: %s"
+            % (
+                os.path.join(folder, "Stats_out_MCMC_correct_prob.csv"),
+                fi_handle.line_num,
+                e,
+            )
+        )
 
 
-def corr_this_base(corr_prob, nt_seq, nt_ref, pos, length,direction="both"):
+def corr_this_base(corr_prob, nt_seq, nt_ref, pos, length, direction="both"):
     """
     The position specific damaging correction, using the input
     corr_prob dictionary holding the damage correcting values
@@ -59,20 +77,20 @@ def corr_this_base(corr_prob, nt_seq, nt_ref, pos, length,direction="both"):
     direction which end to consider the rescaling
     returns the correction probability for this particular set
     """
-    if (pos == 0):
+    if pos == 0:
         # not using 0 based indexing
         raise SystemError
-    if ( nt_seq == "T" and nt_ref == "C" ):
+    if nt_seq == "T" and nt_ref == "C":
         # an C to T transition
         subs = "C.T"
-    elif( nt_seq == "A" and nt_ref == "G" ):
+    elif nt_seq == "A" and nt_ref == "G":
         # an G to A transition
         subs = "G.A"
     else:
         # other transitions/transversions are not affected by damage
         return 0
 
-    back_pos = pos-length-1
+    back_pos = pos - length - 1
     # position from 3' end
 
     if pos in corr_prob:
@@ -92,123 +110,177 @@ def corr_this_base(corr_prob, nt_seq, nt_ref, pos, length,direction="both"):
     elif direction == "backward":
         return p3_corr
     elif direction == "both":
-        if pos < abs(back_pos) :
+        if pos < abs(back_pos):
             # then we use the forward correction
             return p5_corr
-        else :
+        else:
             # else the backward correction
             return p3_corr
     else:
         # this should not happen
         raise SystemExit("Abnormal direction in the rescaling procedure")
 
+
 def initialize_subs():
     """Initialize a substitution table, to track the expected substitution counts"""
-    per_qual = dict(list(zip(list(range(0,130)),[0]*130)))
-    subs = {"CT-before":per_qual.copy(),\
-            "TC-before":per_qual.copy(),\
-            "GA-before":per_qual.copy(),\
-            "AG-before":per_qual.copy(),\
-            "CT-after":per_qual.copy(),\
-            "TC-after":per_qual.copy(),\
-            "GA-after":per_qual.copy(),\
-            "AG-after":per_qual.copy(),\
-            "A":0,\
-            "C":0,\
-            "G":0,\
-            "T":0,\
-            "CT-pvals":0.0,\
-            "CT-pvals_before":0.0,\
-            "TC-pvals":0.0,\
-            "GA-pvals":0.0,\
-            "GA-pvals_before":0.0,\
-            "AG-pvals":0.0,\
-            }
+    per_qual = dict(list(zip(list(range(0, 130)), [0] * 130)))
+    subs = {
+        "CT-before": per_qual.copy(),
+        "TC-before": per_qual.copy(),
+        "GA-before": per_qual.copy(),
+        "AG-before": per_qual.copy(),
+        "CT-after": per_qual.copy(),
+        "TC-after": per_qual.copy(),
+        "GA-after": per_qual.copy(),
+        "AG-after": per_qual.copy(),
+        "A": 0,
+        "C": 0,
+        "G": 0,
+        "T": 0,
+        "CT-pvals": 0.0,
+        "CT-pvals_before": 0.0,
+        "TC-pvals": 0.0,
+        "GA-pvals": 0.0,
+        "GA-pvals_before": 0.0,
+        "AG-pvals": 0.0,
+    }
     return subs
 
 
-
-def record_subs(subs,nt_seq,nt_ref,nt_qual,nt_newqual,prob_corr):
-    """ record the expected substitution change, prob_corr is the excact version for nt_qual"""
-    if ( nt_seq == "T" and nt_ref == "C"):
+def record_subs(subs, nt_seq, nt_ref, nt_qual, nt_newqual, prob_corr):
+    """record the expected substitution change, prob_corr is the exact version for nt_qual"""
+    if nt_seq == "T" and nt_ref == "C":
         sub_type = "CT"
         subs["CT-pvals"] += prob_corr
-        subs["CT-pvals_before"] += 1-phred_char_to_pval(nt_qual)
-    elif ( nt_seq == "A" and nt_ref == "G"):
+        subs["CT-pvals_before"] += 1 - phred_char_to_pval(nt_qual)
+    elif nt_seq == "A" and nt_ref == "G":
         sub_type = "GA"
         subs["GA-pvals"] += prob_corr
-        subs["GA-pvals_before"] += 1-phred_char_to_pval(nt_qual)
-    elif ( nt_seq == "C" and nt_ref == "T"):
+        subs["GA-pvals_before"] += 1 - phred_char_to_pval(nt_qual)
+    elif nt_seq == "C" and nt_ref == "T":
         sub_type = "TC"
-        subs["TC-pvals"] += 1-phred_char_to_pval(nt_qual)
-        if (nt_qual != nt_newqual):
-            raise SystemError("Internal error: rescaling qualities for the wrong transitions")
-    elif ( nt_seq == "G" and nt_ref == "A"):
+        subs["TC-pvals"] += 1 - phred_char_to_pval(nt_qual)
+        if nt_qual != nt_newqual:
+            raise SystemError(
+                "Internal error: rescaling qualities for the wrong transitions"
+            )
+    elif nt_seq == "G" and nt_ref == "A":
         sub_type = "AG"
-        subs["AG-pvals"] += 1-phred_char_to_pval(nt_qual)
-        if (nt_qual != nt_newqual):
-            raise SystemError("Internal error: rescaling qualities for the wrong transitions")
+        subs["AG-pvals"] += 1 - phred_char_to_pval(nt_qual)
+        if nt_qual != nt_newqual:
+            raise SystemError(
+                "Internal error: rescaling qualities for the wrong transitions"
+            )
     else:
         sub_type = "NN"
-    if (sub_type != "NN"):
+    if sub_type != "NN":
         # record only transitions
-        subs[sub_type+"-before"][int(ord(nt_qual))-33] += 1
-        subs[sub_type+"-after"][int(ord(nt_newqual))-33] += 1
-    if (nt_ref in ["A","C","G","T"]):
+        subs[sub_type + "-before"][int(ord(nt_qual)) - 33] += 1
+        subs[sub_type + "-after"][int(ord(nt_newqual)) - 33] += 1
+    if nt_ref in ["A", "C", "G", "T"]:
         subs[nt_ref] += 1
 
+
 def qual_summary_subs(subs):
-    """Calculates summary statistics for the substition table subs"""
-    for i in ["CT-before","TC-before","GA-before","AG-before","CT-after","TC-after","GA-after","AG-after"]:
-        for lv in [0,10,20,30,40]:
+    """Calculates summary statistics for the substitution table subs"""
+    for i in [
+        "CT-before",
+        "TC-before",
+        "GA-before",
+        "AG-before",
+        "CT-after",
+        "TC-after",
+        "GA-after",
+        "AG-after",
+    ]:
+        for lv in [0, 10, 20, 30, 40]:
             for qv in subs[i]:
-                if qv >= lv :
-                    key = i+"-Q"+str(lv)
+                if qv >= lv:
+                    key = i + "-Q" + str(lv)
                     if key in subs:
                         subs[key] += subs[i][qv]
                     else:
                         subs[key] = subs[i][qv]
 
+
 def print_subs(subs):
-    """Print the substition table"""
-    print("\tThe expected substition frequencies before and after scaling using the scaled qualities as probalities:")
-    if subs["C"]!=0:
+    """Print the substitution table"""
+    print(
+        "\tThe expected substitution frequencies before and after scaling using the scaled qualities as probalities:"
+    )
+    if subs["C"] != 0:
         # the special case of no substitutions
-        print(("\tCT\t"+str(subs["CT-pvals_before"]/subs["C"])+"\t\t"+str(subs["CT-pvals"]/subs["C"])))
+        print(
+            "\tCT\t"
+            + str(subs["CT-pvals_before"] / subs["C"])
+            + "\t\t"
+            + str(subs["CT-pvals"] / subs["C"])
+        )
     else:
         print("\tCT\tNA\t\tNA")
-    if subs["T"]!=0:
-        print(("\tTC\t"+str(subs["TC-pvals"]/subs["T"])+"\t\t"+str(subs["TC-pvals"]/subs["T"])))
+    if subs["T"] != 0:
+        print(
+            "\tTC\t"
+            + str(subs["TC-pvals"] / subs["T"])
+            + "\t\t"
+            + str(subs["TC-pvals"] / subs["T"])
+        )
     else:
         print("\tTC\tNA\t\tNA")
-    if subs["G"]!=0:
-        print(("\tGA\t"+str(subs["GA-pvals_before"]/subs["G"])+"\t\t"+str(subs["GA-pvals"]/subs["G"])))
+    if subs["G"] != 0:
+        print(
+            "\tGA\t"
+            + str(subs["GA-pvals_before"] / subs["G"])
+            + "\t\t"
+            + str(subs["GA-pvals"] / subs["G"])
+        )
     else:
         print("\tGA\tNA\t\tNA")
-    if subs["A"]!=0:
-        print(("\tAG\t"+str(subs["AG-pvals"]/subs["A"])+"\t\t"+str(subs["AG-pvals"]/subs["A"])))
+    if subs["A"] != 0:
+        print(
+            "\tAG\t"
+            + str(subs["AG-pvals"] / subs["A"])
+            + "\t\t"
+            + str(subs["AG-pvals"] / subs["A"])
+        )
     else:
         print("\tAG\tNA\t\tNA")
     print("\tQuality metrics before and after scaling")
-    print(("\tCT-Q0 \t"+str(subs["CT-before-Q0"])+"\t\t"+str(subs["CT-after-Q0"])))
-    print(("\tCT-Q10 \t"+str(subs["CT-before-Q10"])+"\t\t"+str(subs["CT-after-Q10"])))
-    print(("\tCT-Q20 \t"+str(subs["CT-before-Q20"])+"\t\t"+str(subs["CT-after-Q20"])))
-    print(("\tCT-Q30 \t"+str(subs["CT-before-Q30"])+"\t\t"+str(subs["CT-after-Q30"])))
-    print(("\tCT-Q40 \t"+str(subs["CT-before-Q40"])+"\t\t"+str(subs["CT-after-Q40"])))
-    print(("\tGA-Q0 \t"+str(subs["GA-before-Q0"])+"\t\t"+str(subs["GA-after-Q0"])))
-    print(("\tGA-Q10 \t"+str(subs["GA-before-Q10"])+"\t\t"+str(subs["GA-after-Q10"])))
-    print(("\tGA-Q20 \t"+str(subs["GA-before-Q20"])+"\t\t"+str(subs["GA-after-Q20"])))
-    print(("\tGA-Q30 \t"+str(subs["GA-before-Q30"])+"\t\t"+str(subs["GA-after-Q30"])))
-    print(("\tGA-Q40 \t"+str(subs["GA-before-Q40"])+"\t\t"+str(subs["GA-after-Q40"])))
-
-
-def rescale_qual_read(bam, read, ref, corr_prob,subs, debug = False,direction="both"):
+    print("\tCT-Q0 \t" + str(subs["CT-before-Q0"]) + "\t\t" + str(subs["CT-after-Q0"]))
+    print(
+        "\tCT-Q10 \t" + str(subs["CT-before-Q10"]) + "\t\t" + str(subs["CT-after-Q10"])
+    )
+    print(
+        "\tCT-Q20 \t" + str(subs["CT-before-Q20"]) + "\t\t" + str(subs["CT-after-Q20"])
+    )
+    print(
+        "\tCT-Q30 \t" + str(subs["CT-before-Q30"]) + "\t\t" + str(subs["CT-after-Q30"])
+    )
+    print(
+        "\tCT-Q40 \t" + str(subs["CT-before-Q40"]) + "\t\t" + str(subs["CT-after-Q40"])
+    )
+    print("\tGA-Q0 \t" + str(subs["GA-before-Q0"]) + "\t\t" + str(subs["GA-after-Q0"]))
+    print(
+        "\tGA-Q10 \t" + str(subs["GA-before-Q10"]) + "\t\t" + str(subs["GA-after-Q10"])
+    )
+    print(
+        "\tGA-Q20 \t" + str(subs["GA-before-Q20"]) + "\t\t" + str(subs["GA-after-Q20"])
+    )
+    print(
+        "\tGA-Q30 \t" + str(subs["GA-before-Q30"]) + "\t\t" + str(subs["GA-after-Q30"])
+    )
+    print(
+        "\tGA-Q40 \t" + str(subs["GA-before-Q40"]) + "\t\t" + str(subs["GA-after-Q40"])
+    )
+
+
+def rescale_qual_read(bam, read, ref, corr_prob, subs, debug=False, direction="both"):
     """
     bam              a pysam bam object
     read             a pysam read object
     ref              a pysam fasta ref file
     reflengths       a dictionary holding the length of the references
-    subs             a dictionary holding the corrected number of substition before and after scaling
+    subs             a dictionary holding the corrected number of substitution before and after scaling
     corr_prob dictionary from get_corr_prob
     returns a read with rescaled quality score
 
@@ -225,8 +297,9 @@ def rescale_qual_read(bam, read, ref, corr_prob,subs, debug = False,direction="b
     chrom = bam.getrname(read.tid)
     refseq = ref.fetch(chrom, min(coordinate), max(coordinate)).upper()
     # add gaps to qualities and mask read and reference nucleotides if below desired threshold
-    (seq, qual, refseq) = mapdamage.align.align_with_qual(read.cigar, \
-        raw_seq, read.qqual, -100, refseq)
+    (seq, qual, refseq) = mapdamage.align.align_with_qual(
+        read.cigar, raw_seq, read.qqual, -100, refseq
+    )
     length_read = len(raw_seq)
     length_align = len(seq)
     # reverse complement read and reference when mapped reverse strand
@@ -234,43 +307,53 @@ def rescale_qual_read(bam, read, ref, corr_prob,subs, debug = False,direction="b
         refseq = mapdamage.seq.revcomp(refseq)
         seq = mapdamage.seq.revcomp(seq)
         qual = qual[::-1]
-    new_qual = [-100]*length_read
+    new_qual = [-100] * length_read
     pos_on_read = 0
     number_of_rescaled_bases = 0.0
-    for (i, nt_seq, nt_ref, nt_qual) in zip(range(length_align), seq, refseq, qual):
+    for i, nt_seq, nt_ref, nt_qual in zip(range(length_align), seq, refseq, qual):
         # rescale the quality according to the triplet position,
         # pair of the reference and the sequence
-        if ((nt_seq == "T" and nt_ref =="C") or (nt_seq == "A" and nt_ref =="G")):
+        if (nt_seq == "T" and nt_ref == "C") or (nt_seq == "A" and nt_ref == "G"):
             # need to rescale this subs.
-            pdam = 1 - corr_this_base(corr_prob, nt_seq, nt_ref, pos_on_read + 1, length_read,direction=direction)
+            pdam = 1 - corr_this_base(
+                corr_prob,
+                nt_seq,
+                nt_ref,
+                pos_on_read + 1,
+                length_read,
+                direction=direction,
+            )
             pseq = 1 - phred_char_to_pval(nt_qual)
-            newp = pdam*pseq # this could be numerically unstable
-            newq = phred_pval_to_char(1-newp)
-            number_of_rescaled_bases += 1-pdam
+            newp = pdam * pseq  # this could be numerically unstable
+            newq = phred_pval_to_char(1 - newp)
+            number_of_rescaled_bases += 1 - pdam
         else:
             # don't rescale, other bases
             newp = 1 - phred_char_to_pval(nt_qual)
             newq = nt_qual
         if pos_on_read < length_read:
             new_qual[pos_on_read] = newq
-            record_subs(subs,nt_seq,nt_ref,nt_qual,new_qual[pos_on_read],newp)
+            record_subs(subs, nt_seq, nt_ref, nt_qual, new_qual[pos_on_read], newp)
             if nt_seq != "-":
                 pos_on_read += 1
             # done with the aligned portion of the read
         else:
             if not debug:
-                logger.warning("Warning: The aligment of the read is longer than the actual read %s",(read.qname))
+                logger.warning(
+                    "Warning: The alignment of the read is longer than the actual read %s",
+                    (read.qname),
+                )
             break
     new_qual = "".join(new_qual)
 
     if read.is_reverse:
         new_qual = new_qual[::-1]
-    if (read.cigar[0][0] == 4):
+    if read.cigar[0][0] == 4:
         # check for soft clipping at forward end
-        new_qual = read.qual[0:read.cigar[0][1]] + new_qual
-    if (read.cigar[-1][0] == 4):
+        new_qual = read.qual[0 : read.cigar[0][1]] + new_qual
+    if read.cigar[-1][0] == 4:
         # the same backwards
-        new_qual = new_qual + read.qual[-read.cigar[-1][1]:]
+        new_qual = new_qual + read.qual[-read.cigar[-1][1] :]
 
     read.qual = new_qual
     # truncate this to 5 digits
@@ -279,12 +362,12 @@ def rescale_qual_read(bam, read, ref, corr_prob,subs, debug = False,direction="b
     if read.has_tag("MR"):
         raise SystemExit("Read: %s already has a MR tag, can't rescale" % read)
 
-    read.set_tag("MR", number_of_rescaled_bases, 'f')
+    read.set_tag("MR", number_of_rescaled_bases, "f")
 
     return read
 
 
-def rescale_qual(ref, options,debug=False):
+def rescale_qual(ref, options, debug=False):
     """
     ref                a pysam fasta ref file
     bam_filename       name of a BAM/SAM file to read
@@ -297,7 +380,9 @@ def rescale_qual(ref, options,debug=False):
     if not debug:
         # no need to log when unit testing
         logger = logging.getLogger(__name__)
-        logger.info("Rescaling BAM: '%s' -> '%s'" % (options.filename, options.rescale_out))
+        logger.info(
+            "Rescaling BAM: '%s' -> '%s'" % (options.filename, options.rescale_out)
+        )
     start_time = time.time()
 
     # open SAM/BAM files
@@ -306,10 +391,12 @@ def rescale_qual(ref, options,debug=False):
         write_mode = "wh"
     else:
         write_mode = "wb"
-    bam_out = pysam.Samfile(options.rescale_out,write_mode, template = bam)
-    corr_prob = get_corr_prob(options.folder,
-                              rescale_length_5p=options.rescale_length_5p,
-                              rescale_length_3p=options.rescale_length_3p)
+    bam_out = pysam.Samfile(options.rescale_out, write_mode, template=bam)
+    corr_prob = get_corr_prob(
+        options.folder,
+        rescale_length_5p=options.rescale_length_5p,
+        rescale_length_3p=options.rescale_length_3p,
+    )
     subs = initialize_subs()
     first_pair = True
     number_of_non_proper_pairs = 0
@@ -318,40 +405,64 @@ def rescale_qual(ref, options,debug=False):
         if hit.is_unmapped:
             pass
         elif not hit.qual and not debug:
-            logger.warning("Cannot rescale base PHRED scores for read '%s'; no scores assigned." % hit.qname)
-        elif hit.is_paired :
+            logger.warning(
+                "Cannot rescale base PHRED scores for read '%s'; no scores assigned."
+                % hit.qname
+            )
+        elif hit.is_paired:
             if first_pair and not debug:
                 # assuming the ends are non-overlapping
-                logger.warning("Warning! Assuming the pairs are non-overlapping, facing inwards and correctly paired.")
-                first_pair=False
-            #5p --------------> 3p
-            #3p <-------------- 5p
+                logger.warning(
+                    "Warning! Assuming the pairs are non-overlapping, facing inwards and correctly paired."
+                )
+                first_pair = False
+            # 5p --------------> 3p
+            # 3p <-------------- 5p
             # pair 1 (inwards)
-            #5p ---->
+            # 5p ---->
             #             <---- 5p
             #     A         B
             # pair 2 (outwards), this happens if the reference is RC this is not supported
             #             ----> 3p
-            #3p <----
+            # 3p <----
             #     A         B
             # Correct outwards pairs from the 3p and inwards pairs with the 5p end
-            if ((not hit.is_reverse) and hit.mate_is_reverse and (hit.pnext>hit.pos) and hit.tid==hit.mrnm):
+            if (
+                (not hit.is_reverse)
+                and hit.mate_is_reverse
+                and (hit.pnext > hit.pos)
+                and hit.tid == hit.mrnm
+            ):
                 # the inwards case mate A
-                hit = rescale_qual_read(bam, hit, ref, corr_prob,subs,direction="forward",debug=debug)
-            elif (hit.is_reverse and (not hit.mate_is_reverse) and (hit.pnext<hit.pos) and hit.tid==hit.mrnm):
+                hit = rescale_qual_read(
+                    bam, hit, ref, corr_prob, subs, direction="forward", debug=debug
+                )
+            elif (
+                hit.is_reverse
+                and (not hit.mate_is_reverse)
+                and (hit.pnext < hit.pos)
+                and hit.tid == hit.mrnm
+            ):
                 # the inwards case mate B
-                hit = rescale_qual_read(bam, hit, ref, corr_prob,subs,direction="forward",debug=debug)
+                hit = rescale_qual_read(
+                    bam, hit, ref, corr_prob, subs, direction="forward", debug=debug
+                )
             else:
                 number_of_non_proper_pairs += 1
                 # cannot do much with conflicting pairing information
         else:
-            hit = rescale_qual_read(bam, hit, ref, corr_prob,subs,debug=debug)
+            hit = rescale_qual_read(bam, hit, ref, corr_prob, subs, debug=debug)
 
         bam_out.write(hit)
-    if  number_of_non_proper_pairs!=0 and not debug:
-        logger.warning("Number of non-rescaled reads due to improper pairing:  %d" % number_of_non_proper_pairs)
-    if (subs["TC-before"] != subs["TC-after"] or subs["AG-before"] != subs["AG-after"]):
-        sys.exit("Qualities for T.C and A.G transitions should not change in the rescaling, please contact the authors.")
+    if number_of_non_proper_pairs != 0 and not debug:
+        logger.warning(
+            "Number of non-rescaled reads due to improper pairing:  %d"
+            % number_of_non_proper_pairs
+        )
+    if subs["TC-before"] != subs["TC-after"] or subs["AG-before"] != subs["AG-after"]:
+        sys.exit(
+            "Qualities for T.C and A.G transitions should not change in the rescaling, please contact the authors."
+        )
     qual_summary_subs(subs)
     bam.close()
     bam_out.close()


=====================================
mapdamage/rscript.py
=====================================
@@ -1,76 +1,97 @@
+import logging
 import os
 import subprocess
+import time
 from subprocess import CalledProcessError, check_call
-from mapdamage.version import __version__
+
 import mapdamage
-import logging
-import time
+from mapdamage.version import __version__
 
 
-def construct_path(name,folder="Rscripts"):
+def construct_path(name, folder="Rscripts"):
     """Construct a path to the mapdamage package data given the name"""
-    return(os.path.join(mapdamage.__path__[0], folder, name))
+    return os.path.join(mapdamage.__path__[0], folder, name)
 
-def plot(opt):
-  """
-  Calls the R script to make the plots, takes in the arguments
-  from parameter processing
-  """
-  # comp<-args[1]
-  # pdfout<-args[2]
-  # around<-as.numeric(args[3])
-  # misincorp<-args[4]
-  # lg<-as.numeric(args[5])
-  # ymax<-as.numeric(args[6])
-  # folder<-args[7]
-  # title<-args[8]
-  # version<-args[9]
-
-  fmut = opt.folder+"/"+"misincorporation.txt"
-  fcomp = opt.folder+"/"+"dnacomp.txt"
-  title = opt.folder+"/"+"Fragmisincorporation_plot.pdf"
-
-  script = construct_path("mapDamage.R")
-  call = ["Rscript", script, fcomp, title, opt.refplot, fmut, opt.readplot, \
-      opt.ymax, opt.folder, opt.title, __version__]
-  code = subprocess.call(list(map(str, call)))
-
-  if code == 0:
-    if not opt.quiet:
-      print(("pdf %s generated" % title))
-    return 0
-  else:
-    print("Error: plotting with R failed")
-    return 1
 
+def plot(opt):
+    """
+    Calls the R script to make the plots, takes in the arguments
+    from parameter processing
+    """
+    # comp<-args[1]
+    # pdfout<-args[2]
+    # around<-as.numeric(args[3])
+    # misincorp<-args[4]
+    # lg<-as.numeric(args[5])
+    # ymax<-as.numeric(args[6])
+    # folder<-args[7]
+    # title<-args[8]
+    # version<-args[9]
+
+    fmut = opt.folder + "/" + "misincorporation.txt"
+    fcomp = opt.folder + "/" + "dnacomp.txt"
+    title = opt.folder + "/" + "Fragmisincorporation_plot.pdf"
+
+    script = construct_path("mapDamage.R")
+    call = [
+        "Rscript",
+        script,
+        fcomp,
+        title,
+        opt.refplot,
+        fmut,
+        opt.readplot,
+        opt.ymax,
+        opt.folder,
+        opt.title,
+        __version__,
+    ]
+    code = subprocess.call(list(map(str, call)))
+
+    if code == 0:
+        if not opt.quiet:
+            print(("pdf %s generated" % title))
+        return 0
+    else:
+        print("Error: plotting with R failed")
+        return 1
 
 
 def opt_plots(opt):
-  """ optional plots for length distribution
-  and cumulative C>T mutations, per strand """
-
-  fmut = opt.folder+"/"+"misincorporation.txt"
-  flength = opt.folder+"/"+"lgdistribution.txt"
-  output = opt.folder+"/"+"Length_plot.pdf"
-
-  script = construct_path("lengths.R")
-  call = ["Rscript", script, flength, output, fmut, opt.length, \
-      opt.title, __version__, bool_to_int(opt.quiet)]
-  code = subprocess.call(list(map(str, call)))
-  if code == 0:
-    return 0
-  else:
-    print("Error: plotting with R failed")
-    return 1
+    """optional plots for length distribution
+    and cumulative C>T mutations, per strand"""
+
+    fmut = opt.folder + "/" + "misincorporation.txt"
+    flength = opt.folder + "/" + "lgdistribution.txt"
+    output = opt.folder + "/" + "Length_plot.pdf"
+
+    script = construct_path("lengths.R")
+    call = [
+        "Rscript",
+        script,
+        flength,
+        output,
+        fmut,
+        opt.length,
+        opt.title,
+        __version__,
+        bool_to_int(opt.quiet),
+    ]
+    code = subprocess.call(list(map(str, call)))
+    if code == 0:
+        return 0
+    else:
+        print("Error: plotting with R failed")
+        return 1
 
 
 def check_R_one_lib(name):
     """Checks if a necessary R library is here."""
     with open(os.devnull, "w") as devnull:
         rpa = construct_path("stats/checkLibraries.R")
-        return subprocess.call(["Rscript", rpa, "--args", name],
-                               stdout = devnull,
-                               stderr = devnull)
+        return subprocess.call(
+            ["Rscript", rpa, "--args", name], stdout=devnull, stderr=devnull
+        )
 
 
 def check_R_lib():
@@ -83,18 +104,25 @@ def check_R_lib():
     for lib in libs:
         # Check the libraries
         if check_R_one_lib(lib):
-            #found a missing library
+            # found a missing library
             missing_lib.append(lib)
     if len(missing_lib) > 1:
         # Grammar Nazi has arrived
         last_ele = missing_lib.pop()
-        print(("Missing the following R libraries '" + "', '".join(missing_lib) \
-            + "' and '" + last_ele + "'"))
+        print(
+            (
+                "Missing the following R libraries '"
+                + "', '".join(missing_lib)
+                + "' and '"
+                + last_ele
+                + "'"
+            )
+        )
         return 1
     elif len(missing_lib) == 1:
-        print(("Missing the following R library "+missing_lib[0]))
+        print(("Missing the following R library " + missing_lib[0]))
         return 1
-    else :
+    else:
         # No missing libraries
         return 0
 
@@ -103,31 +131,32 @@ def run_stats(opt):
     """
     Runs the Bayesian estimation program, using the options opt
     """
-    arg = ["Rscript",                            \
-         construct_path("stats/runGeneral.R"), \
-         "--args",                               \
-         opt.rand,                               \
-         opt.burn,                               \
-         opt.adjust,                             \
-         opt.iter,                               \
-         bool_to_int(opt.forward),               \
-         bool_to_int(opt.reverse),               \
-         bool_to_int(not opt.var_disp),              \
-         bool_to_int(not opt.diff_hangs),            \
-         0,                                      \
-         bool_to_int(opt.fix_nicks),             \
-         bool_to_int(not opt.single_stranded),       \
-         opt.seq_length,                         \
-         opt.folder+"/",                         \
-         construct_path("stats/"),             \
-         opt.folder+"/Stats_out",                \
-         int(opt.verbose),                       \
-         int(opt.quiet),                         \
-         int(opt.jukes_cantor),                       \
-         opt.folder+"/acgt_ratio.csv",            \
-         int(opt.use_raw_nick_freq),                   \
-         int(opt.theme_bw)                   \
-         ]
+    arg = [
+        "Rscript",
+        construct_path("stats/runGeneral.R"),
+        "--args",
+        opt.rand,
+        opt.burn,
+        opt.adjust,
+        opt.iter,
+        bool_to_int(opt.forward),
+        bool_to_int(opt.reverse),
+        bool_to_int(not opt.var_disp),
+        bool_to_int(not opt.diff_hangs),
+        0,
+        bool_to_int(opt.fix_nicks),
+        bool_to_int(not opt.single_stranded),
+        opt.seq_length,
+        opt.folder + "/",
+        construct_path("stats/"),
+        opt.folder + "/Stats_out",
+        int(opt.verbose),
+        int(opt.quiet),
+        int(opt.jukes_cantor),
+        opt.folder + "/acgt_ratio.csv",
+        int(opt.use_raw_nick_freq),
+        int(opt.theme_bw),
+    ]
     arg = [str(i) for i in arg]
 
     logger = logging.getLogger(__name__)
@@ -141,11 +170,13 @@ def run_stats(opt):
         logger.error("The Bayesian statistics program failed to finish")
         raise e
 
-    logger.debug("Bayesian estimates completed in %f seconds" % (time.time() - start_time,))
+    logger.debug(
+        "Bayesian estimates completed in %f seconds" % (time.time() - start_time,)
+    )
     return 0
 
 
 def bool_to_int(boolean):
-    """ return 0 for False and 1 for True """
+    """return 0 for False and 1 for True"""
     result = 1 if boolean else 0
     return result


=====================================
mapdamage/seq.py
=====================================
@@ -1,113 +1,146 @@
 import sys
-import string
 
 # from Martin Kircher, to complement DNA
-TABLE = str.maketrans('TGCAMRWSYKVHDBtgcamrwsykvhdb', \
-    'ACGTKYWSRMBDHVacgtkywsrmbdhv')
+TABLE = str.maketrans("TGCAMRWSYKVHDBtgcamrwsykvhdb", "ACGTKYWSRMBDHVacgtkywsrmbdhv")
 
 LETTERS = ("A", "C", "G", "T")
-MUTATIONS = ('G>A', 'C>T', 'A>G', 'T>C', 'A>C', 'A>T', 'C>G', 'C>A', 'T>G',
-             'T>A', 'G>C', 'G>T', 'A>-', 'T>-', 'C>-', 'G>-', '->A', '->T',
-             '->C', '->G', 'S')
-HEADER = LETTERS + ("Total", ) + MUTATIONS
+MUTATIONS = (
+    "G>A",
+    "C>T",
+    "A>G",
+    "T>C",
+    "A>C",
+    "A>T",
+    "C>G",
+    "C>A",
+    "T>G",
+    "T>A",
+    "G>C",
+    "G>T",
+    "A>-",
+    "T>-",
+    "C>-",
+    "G>-",
+    "->A",
+    "->T",
+    "->C",
+    "->G",
+    "S",
+)
+HEADER = LETTERS + ("Total",) + MUTATIONS
 
 
 def write_fasta(read, ref, seq, refseq, start, end, before, after, fout):
-  std = '-' if read.is_reverse else '+'
+    std = "-" if read.is_reverse else "+"
 
-  # output coordinate in 1-based offset for 5'
-  if len(before) > 0:
-    fout.write(">%s\n%s\n" % (ref, before))
-  fout.write(">%s:%d-%d\n%s\n>%s_%s\n%s\n" % (ref, start+1, end, refseq, read.qname, std, seq))
-  if len(after) > 0:
-    fout.write(">%s\n%s\n" % (ref, after))
+    # output coordinate in 1-based offset for 5'
+    if len(before) > 0:
+        fout.write(">%s\n%s\n" % (ref, before))
+    fout.write(
+        ">%s:%d-%d\n%s\n>%s_%s\n%s\n"
+        % (ref, start + 1, end, refseq, read.qname, std, seq)
+    )
+    if len(after) > 0:
+        fout.write(">%s\n%s\n" % (ref, after))
 
 
 def revcomp(seq):
-  """ return reverse complemented string """
-  return seq.translate(TABLE)[::-1]
+    """return reverse complemented string"""
+    return seq.translate(TABLE)[::-1]
 
 
 def record_lg(read, coordinate, tab):
-  """ record global length distribution
-  don't record paired reads as they are normally not used for aDNA """
-  std = '-' if read.is_reverse else '+'
+    """record global length distribution
+    don't record paired reads as they are normally not used for aDNA"""
+    std = "-" if read.is_reverse else "+"
 
-  length = (max(coordinate) - min(coordinate))
-  if not read.is_paired:
-    tab[std][length] = tab[std][length] + 1
+    length = max(coordinate) - min(coordinate)
+    if not read.is_paired:
+        tab[std][length] = tab[std][length] + 1
 
-  return tab
+    return tab
 
 
 def read_fasta_index(filename):
-  """ from a fasta index file, fai, return dictionary of references:lengths """
-  def print_err(msg, filename, line):
-    sys.stderr.write("Error: %s\n" % msg)
-    sys.stderr.write("       Filename: %s\n" % filename)
-    sys.stderr.write("       Line:     %s\n" % repr(line))
-
-  fai = {}
-  with open(filename, 'r') as handle:
-    for line in handle:
-      ref = line.split("\t")
-      if len(ref) != 5:
-        print_err("Line in fasta index contains wrong number of fields, found %i, expected 5:" \
-            % len(ref), filename, line)
+    """from a fasta index file, fai, return dictionary of references:lengths"""
+
+    def print_err(msg, filename, line):
+        sys.stderr.write("Error: %s\n" % msg)
+        sys.stderr.write("       Filename: %s\n" % filename)
+        sys.stderr.write("       Line:     %s\n" % repr(line))
+
+    fai = {}
+    with open(filename, "r") as handle:
+        for line in handle:
+            ref = line.split("\t")
+            if len(ref) != 5:
+                print_err(
+                    "Line in fasta index contains wrong number of fields, found %i, expected 5:"
+                    % len(ref),
+                    filename,
+                    line,
+                )
+                return None
+
+            try:
+                fai[ref[0]] = int(ref[1])
+            except ValueError:
+                print_err(
+                    "Column 2 in FASTA index did not contain a number, found '%s':"
+                    % ref[1],
+                    filename,
+                    line,
+                )
+                return None
+
+    if not fai:
+        sys.stderr.write(
+            "Error: Index for %r does contain any sequences.\n" % (filename,)
+        )
+        sys.stderr.write("       Please ensure that FASTA file is valid, and\n")
+        sys.stderr.write("       re-index file using 'samtool faidx'.\n")
         return None
 
-      try:
-        fai[ref[0]] = int(ref[1])
-      except ValueError:
-        print_err("Column 2 in FASTA index did not contain a number, found '%s':" % ref[1], filename, line)
-        return None
-
-  if not fai:
-    sys.stderr.write("Error: Index for %r does contain any sequences.\n" % (filename,))
-    sys.stderr.write("       Please ensure that FASTA file is valid, and\n")
-    sys.stderr.write("       re-index file using 'samtool faidx'.\n")
-    return None
-
-  return fai
+    return fai
 
 
 def compare_sequence_dicts(fasta_dict, bam_dict):
-  """Compares a FASTA and BAM sequence dictionary, and prints any differences.
-  Returns true if all required sequences are found, false otherwise."""
-  if fasta_dict == bam_dict:
-    return True
-
-  sys.stderr.write("Sequence dictionaries in FASTA/BAM files differ:\n")
-  common = set(fasta_dict) & set(bam_dict)
-  if not common:
-    sys.stderr.write("FATAL ERROR: No sequences in common!\n")
-    return False
-
-  # Check that the lengths of required sequences match (fatal error)
-  different = []
-  for key in sorted(common):
-    if fasta_dict[key] != bam_dict[key]:
-      different.append((key, fasta_dict[key], bam_dict[key]))
-
-  if different:
-    sys.stderr.write("FATAL ERROR: Length of required sequences differ:\n")
-    for values in different:
-      sys.stderr.write("    - %s: %i bp vs %i bp\n" % values)
-
-  # Check for sequences only found in the BAM file (fatal errors)
-  bam_only = set(bam_dict) - common
-  if bam_only:
-    sys.stderr.write("FATAL ERROR: Sequences missing from FASTA dictionary:\n")
-    for key in bam_only:
-      sys.stderr.write("    - %s = %i bp\n" % (key, bam_dict[key]))
-
-  # Check for sequences only found in the BAM file (fatal errors)
-  fasta_only = set(fasta_dict) - common
-  if fasta_only:
-    sys.stderr.write("WARNING: FASTA file contains extra sequences:\n")
-    for key in fasta_only:
-      sys.stderr.write("    - %s = %i bp\n" % (key, fasta_dict[key]))
-
-  sys.stderr.write("\n")
-
-  return not (different or bam_only)
+    """Compares a FASTA and BAM sequence dictionary, and prints any differences.
+    Returns true if all required sequences are found, false otherwise."""
+    if fasta_dict == bam_dict:
+        return True
+
+    sys.stderr.write("Sequence dictionaries in FASTA/BAM files differ:\n")
+    common = set(fasta_dict) & set(bam_dict)
+    if not common:
+        sys.stderr.write("FATAL ERROR: No sequences in common!\n")
+        return False
+
+    # Check that the lengths of required sequences match (fatal error)
+    different = []
+    for key in sorted(common):
+        if fasta_dict[key] != bam_dict[key]:
+            different.append((key, fasta_dict[key], bam_dict[key]))
+
+    if different:
+        sys.stderr.write("FATAL ERROR: Length of required sequences differ:\n")
+        for values in different:
+            sys.stderr.write("    - %s: %i bp vs %i bp\n" % values)
+
+    # Check for sequences only found in the BAM file (fatal errors)
+    bam_only = set(bam_dict) - common
+    if bam_only:
+        sys.stderr.write("FATAL ERROR: Sequences missing from FASTA dictionary:\n")
+        for key in bam_only:
+            sys.stderr.write("    - %s = %i bp\n" % (key, bam_dict[key]))
+
+    # Check for sequences only found in the BAM file (fatal errors)
+    fasta_only = set(fasta_dict) - common
+    if fasta_only:
+        sys.stderr.write("WARNING: FASTA file contains extra sequences:\n")
+        for key in fasta_only:
+            sys.stderr.write("    - %s = %i bp\n" % (key, fasta_dict[key]))
+
+    sys.stderr.write("\n")
+
+    return not (different or bam_only)


=====================================
mapdamage/seqtk/README.md
=====================================
@@ -0,0 +1,13 @@
+Introduction
+------------
+
+Seqtk is a fast and lightweight tool for processing sequences in the FASTA or
+FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be
+optionally compressed by gzip.
+
+Written by Heng Li.
+
+This version was reduced to perform only the `comp` function and integrated to mapDamage.
+The original version can be found here  
+https://github.com/lh3/seqtk
+


=====================================
mapdamage/seqtk/kseq.h
=====================================
@@ -0,0 +1,248 @@
+/* The MIT License
+
+   Copyright (c) 2008, 2009, 2011 Attractive Chaos <attractor at live.co.uk>
+
+   Permission is hereby granted, free of charge, to any person obtaining
+   a copy of this software and associated documentation files (the
+   "Software"), to deal in the Software without restriction, including
+   without limitation the rights to use, copy, modify, merge, publish,
+   distribute, sublicense, and/or sell copies of the Software, and to
+   permit persons to whom the Software is furnished to do so, subject to
+   the following conditions:
+
+   The above copyright notice and this permission notice shall be
+   included in all copies or substantial portions of the Software.
+
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+   BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+   SOFTWARE.
+*/
+
+/* Last Modified: 2017-02-11 */
+
+#ifndef AC_KSEQ_H
+#define AC_KSEQ_H
+
+#include <ctype.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdlib.h>
+
+#define KS_SEP_SPACE 0 // isspace(): \t, \n, \v, \f, \r
+#define KS_SEP_TAB   1 // isspace() && !' '
+#define KS_SEP_LINE  2 // line separator: "\n" (Unix) or "\r\n" (Windows)
+#define KS_SEP_MAX   2
+
+#define __KS_TYPE(type_t) \
+	typedef struct __kstream_t { \
+		unsigned char *buf; \
+		int begin, end, is_eof; \
+		type_t f; \
+	} kstream_t;
+
+#define ks_err(ks) ((ks)->end < 0)
+#define ks_eof(ks) ((ks)->is_eof && (ks)->begin >= (ks)->end)
+#define ks_rewind(ks) ((ks)->is_eof = (ks)->begin = (ks)->end = 0)
+
+#define __KS_BASIC(type_t, __bufsize) \
+	static inline kstream_t *ks_init(type_t f) \
+	{ \
+		kstream_t *ks = (kstream_t*)calloc(1, sizeof(kstream_t)); \
+		ks->f = f; \
+		ks->buf = (unsigned char*)malloc(__bufsize); \
+		return ks; \
+	} \
+	static inline void ks_destroy(kstream_t *ks) \
+	{ \
+		if (ks) { \
+			free(ks->buf); \
+			free(ks); \
+		} \
+	}
+
+#define __KS_GETC(__read, __bufsize) \
+	static inline int ks_getc(kstream_t *ks) \
+	{ \
+		if (ks_err(ks)) return -3; \
+		if (ks_eof(ks)) return -1; \
+		if (ks->begin >= ks->end) { \
+			ks->begin = 0; \
+			ks->end = __read(ks->f, ks->buf, __bufsize); \
+			if (ks->end == 0) { ks->is_eof = 1; return -1; } \
+			else if (ks->end < 0) { ks->is_eof = 1; return -3; } \
+		} \
+		return (int)ks->buf[ks->begin++]; \
+	}
+
+#ifndef KSTRING_T
+#define KSTRING_T kstring_t
+typedef struct __kstring_t {
+	size_t l, m;
+	char *s;
+} kstring_t;
+#endif
+
+#ifndef kroundup32
+#define kroundup32(x) (--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, (x)|=(x)>>16, ++(x))
+#endif
+
+#ifndef kroundup64
+#define kroundup64(x) (--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, (x)|=(x)>>16, (x)|=(x)>>32, ++(x))
+#endif
+
+#define __KS_GETUNTIL(__read, __bufsize) \
+	static int64_t ks_getuntil2(kstream_t *ks, int delimiter, kstring_t *str, int *dret, int append) \
+	{ \
+		int gotany = 0; \
+		if (dret) *dret = 0; \
+		str->l = append? str->l : 0; \
+		for (;;) { \
+			int i; \
+			if (ks_err(ks)) return -3; \
+			if (ks->begin >= ks->end) { \
+				if (!ks->is_eof) { \
+					ks->begin = 0; \
+					ks->end = __read(ks->f, ks->buf, __bufsize); \
+					if (ks->end == 0) { ks->is_eof = 1; break; } \
+					if (ks->end == -1) { ks->is_eof = 1; return -3; } \
+				} else break; \
+			} \
+			if (delimiter == KS_SEP_LINE) { \
+				unsigned char *sep = (unsigned char*)memchr(ks->buf + ks->begin, '\n', ks->end - ks->begin); \
+				i = sep != NULL ? sep - ks->buf : ks->end; \
+			} else if (delimiter > KS_SEP_MAX) { \
+				for (i = ks->begin; i < ks->end; ++i) \
+					if (ks->buf[i] == delimiter) break; \
+			} else if (delimiter == KS_SEP_SPACE) { \
+				for (i = ks->begin; i < ks->end; ++i) \
+					if (isspace(ks->buf[i])) break; \
+			} else if (delimiter == KS_SEP_TAB) { \
+				for (i = ks->begin; i < ks->end; ++i) \
+					if (isspace(ks->buf[i]) && ks->buf[i] != ' ') break; \
+			} else i = 0; /* never come to here! */ \
+			if (str->m - str->l < (size_t)(i - ks->begin + 1)) { \
+				str->m = str->l + (i - ks->begin) + 1; \
+				kroundup64(str->m); \
+				str->s = (char*)realloc(str->s, str->m); \
+			} \
+			gotany = 1; \
+			memcpy(str->s + str->l, ks->buf + ks->begin, i - ks->begin); \
+			str->l = str->l + (i - ks->begin); \
+			ks->begin = i + 1; \
+			if (i < ks->end) { \
+				if (dret) *dret = ks->buf[i]; \
+				break; \
+			} \
+		} \
+		if (!gotany && ks_eof(ks)) return -1; \
+		if (str->s == 0) { \
+			str->m = 1; \
+			str->s = (char*)calloc(1, 1); \
+		} else if (delimiter == KS_SEP_LINE && str->l > 1 && str->s[str->l-1] == '\r') --str->l; \
+		str->s[str->l] = '\0'; \
+		return str->l; \
+	} \
+	static inline int ks_getuntil(kstream_t *ks, int delimiter, kstring_t *str, int *dret) \
+	{ return ks_getuntil2(ks, delimiter, str, dret, 0); }
+
+#define KSTREAM_INIT(type_t, __read, __bufsize) \
+	__KS_TYPE(type_t) \
+	__KS_BASIC(type_t, __bufsize) \
+	__KS_GETC(__read, __bufsize) \
+	__KS_GETUNTIL(__read, __bufsize)
+
+#define kseq_rewind(ks) ((ks)->last_char = (ks)->f->is_eof = (ks)->f->begin = (ks)->f->end = 0)
+
+#define __KSEQ_BASIC(SCOPE, type_t) \
+	SCOPE kseq_t *kseq_init(type_t fd) \
+	{ \
+		kseq_t *s = (kseq_t*)calloc(1, sizeof(kseq_t)); \
+		s->f = ks_init(fd); \
+		return s; \
+	} \
+	SCOPE void kseq_destroy(kseq_t *ks) \
+	{ \
+		if (!ks) return; \
+		free(ks->name.s); free(ks->comment.s); free(ks->seq.s);	free(ks->qual.s); \
+		ks_destroy(ks->f); \
+		free(ks); \
+	}
+
+/* Return value:
+   >=0  length of the sequence (normal)
+   -1   end-of-file
+   -2   truncated quality string
+   -3   error reading stream
+ */
+#define __KSEQ_READ(SCOPE) \
+	SCOPE int64_t kseq_read(kseq_t *seq) \
+	{ \
+		int c,r; \
+		kstream_t *ks = seq->f; \
+		if (seq->last_char == 0) { /* then jump to the next header line */ \
+			while ((c = ks_getc(ks)) >= 0 && c != '>' && c != '@'); \
+			if (c < 0) return c; /* end of file or error*/ \
+			seq->last_char = c; \
+		} /* else: the first header char has been read in the previous call */ \
+		seq->comment.l = seq->seq.l = seq->qual.l = 0; /* reset all members */ \
+		if ((r=ks_getuntil(ks, 0, &seq->name, &c)) < 0) return r;  /* normal exit: EOF or error */ \
+		if (c != '\n') ks_getuntil(ks, KS_SEP_LINE, &seq->comment, 0); /* read FASTA/Q comment */ \
+		if (seq->seq.s == 0) { /* we can do this in the loop below, but that is slower */ \
+			seq->seq.m = 256; \
+			seq->seq.s = (char*)malloc(seq->seq.m); \
+		} \
+		while ((c = ks_getc(ks)) >= 0 && c != '>' && c != '+' && c != '@') { \
+			if (c == '\n') continue; /* skip empty lines */ \
+			seq->seq.s[seq->seq.l++] = c; /* this is safe: we always have enough space for 1 char */ \
+			ks_getuntil2(ks, KS_SEP_LINE, &seq->seq, 0, 1); /* read the rest of the line */ \
+		} \
+		if (c == '>' || c == '@') seq->last_char = c; /* the first header char has been read */ \
+		if (seq->seq.l + 1 >= seq->seq.m) { /* seq->seq.s[seq->seq.l] below may be out of boundary */ \
+			seq->seq.m = seq->seq.l + 2; \
+			kroundup64(seq->seq.m); /* rounded to the next closest 2^k */ \
+			seq->seq.s = (char*)realloc(seq->seq.s, seq->seq.m); \
+		} \
+		seq->seq.s[seq->seq.l] = 0;	/* null terminated string */ \
+		seq->is_fastq = (c == '+'); \
+		if (!seq->is_fastq) return seq->seq.l; /* FASTA */ \
+		if (seq->qual.m < seq->seq.m) {	/* allocate memory for qual in case insufficient */ \
+			seq->qual.m = seq->seq.m; \
+			seq->qual.s = (char*)realloc(seq->qual.s, seq->qual.m); \
+		} \
+		while ((c = ks_getc(ks)) >= 0 && c != '\n'); /* skip the rest of '+' line */ \
+		if (c == -1) return -2; /* error: no quality string */ \
+		while ((c = ks_getuntil2(ks, KS_SEP_LINE, &seq->qual, 0, 1) >= 0 && seq->qual.l < seq->seq.l)); \
+		if (c == -3) return -3; /* stream error */ \
+		seq->last_char = 0;	/* we have not come to the next header line */ \
+		if (seq->seq.l != seq->qual.l) return -2; /* error: qual string is of a different length */ \
+		return seq->seq.l; \
+	}
+
+#define __KSEQ_TYPE(type_t) \
+	typedef struct { \
+		kstring_t name, comment, seq, qual; \
+		int last_char, is_fastq; \
+		kstream_t *f; \
+	} kseq_t;
+
+#define KSEQ_INIT2(SCOPE, type_t, __read) \
+	KSTREAM_INIT(type_t, __read, 16384) \
+	__KSEQ_TYPE(type_t) \
+	__KSEQ_BASIC(SCOPE, type_t) \
+	__KSEQ_READ(SCOPE)
+
+#define KSEQ_INIT(type_t, __read) KSEQ_INIT2(static, type_t, __read)
+
+#define KSEQ_DECLARE(type_t) \
+	__KS_TYPE(type_t) \
+	__KSEQ_TYPE(type_t) \
+	extern kseq_t *kseq_init(type_t fd); \
+	void kseq_destroy(kseq_t *ks); \
+	int64_t kseq_read(kseq_t *seq);
+
+#endif


=====================================
mapdamage/seqtk/seqtk.c
=====================================
@@ -0,0 +1,169 @@
+/* The MIT License
+
+   Copyright (c) 20082-2012 by Heng Li <lh3 at me.com>
+
+   Permission is hereby granted, free of charge, to any person obtaining
+   a copy of this software and associated documentation files (the
+   "Software"), to deal in the Software without restriction, including
+   without limitation the rights to use, copy, modify, merge, publish,
+   distribute, sublicense, and/or sell copies of the Software, and to
+   permit persons to whom the Software is furnished to do so, subject to
+   the following conditions:
+
+   The above copyright notice and this permission notice shall be
+   included in all copies or substantial portions of the Software.
+
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+   BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+   SOFTWARE.
+*/
+#define PY_SSIZE_T_CLEAN
+#include <Python.h>
+#include <ctype.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <zlib.h>
+
+#include "kseq.h"
+KSEQ_INIT(gzFile, gzread)
+
+int
+mapdamage_set(PyObject *dict, const char *key, PyObject *value)
+{
+    if (value == NULL) {
+        return 1;
+    }
+
+    int result = PyDict_SetItemString(dict, key, value);
+    Py_DECREF(value);
+
+    return result;
+}
+
+int
+mapdamage_set_long(PyObject *dict, const char *key, long value)
+{
+    return mapdamage_set(dict, key, PyLong_FromLong(value));
+}
+
+static PyObject *
+mapdamage_stk_comp(PyObject *self, PyObject *args)
+{
+    (void)self;
+
+    char *filename = NULL;
+    if (PyArg_ParseTuple(args, "s", &filename) == 0) {
+        return NULL;
+    }
+
+    gzFile fp = gzopen(filename, "r");
+    if (fp == NULL) {
+        PyErr_SetFromErrnoWithFilename(PyExc_OSError, filename);
+        return NULL;
+    }
+
+    // Cannot fail; kseq_init will segfault if calloc fails
+    kseq_t *seq = kseq_init(fp);
+
+    long int l = 0;
+    int sig = 1;
+    PyObject *result = PyList_New(0);
+    if (!result) {
+        goto Cleanup;
+    }
+
+    long counts[256];
+    while (!(sig = PyErr_CheckSignals()) && (l = kseq_read(seq)) >= 0) {
+        memset(counts, 0, 256 * sizeof(long));
+        for (long i = 0; i < l; ++i) {
+            counts[(unsigned char)seq->seq.s[i]]++;
+        }
+
+        // If we break it will have been due to an Py* failure
+        sig = 1;
+
+        PyObject *stats = PyDict_New();
+        if (stats == NULL || PyList_Append(result, stats) != 0) {
+            Py_XDECREF(stats);
+            break;
+        }
+
+        PyObject *name = PyUnicode_FromString(seq->name.s);
+        if (mapdamage_set(stats, "name", name) != 0 ||
+            mapdamage_set_long(stats, "len", l) != 0 ||
+            mapdamage_set_long(stats, "A", counts['A'] + counts['a']) != 0 ||
+            mapdamage_set_long(stats, "C", counts['C'] + counts['c']) != 0 ||
+            mapdamage_set_long(stats, "G", counts['G'] + counts['g']) != 0 ||
+            mapdamage_set_long(stats, "T", counts['T'] + counts['t']) != 0) {
+            Py_DECREF(stats);
+            break;
+        }
+    }
+
+Cleanup:
+    kseq_destroy(seq);
+    const int gzerr = gzclose(fp);
+
+    if (gzerr != Z_OK || sig || l != -1) {
+        Py_XDECREF(result);
+
+        if (gzerr != Z_OK) {
+            switch (gzerr) {
+                case Z_STREAM_ERROR:
+                    PyErr_SetString(PyExc_RuntimeError,
+                                    "Stream error while reading gzip file");
+                    break;
+                case Z_ERRNO:
+                    PyErr_SetFromErrnoWithFilename(PyExc_OSError, filename);
+                    break;
+                case Z_BUF_ERROR:
+                    PyErr_SetString(PyExc_RuntimeError,
+                                    "Buffer error while reading gzip file");
+                    break;
+                default:
+                    PyErr_SetString(PyExc_RuntimeError,
+                                    "Unexpected zlib error");
+            }
+        }
+        else if (l == -3) {
+            PyErr_SetFromErrnoWithFilename(PyExc_OSError, filename);
+        }
+        else if (l == -2) {
+            PyErr_SetString(PyExc_RuntimeError, "Malformed file");
+        }
+
+        return NULL;
+    }
+
+    return result;
+}
+
+static PyMethodDef SeqtkMethods[] = {
+    {"comp", mapdamage_stk_comp, METH_VARARGS,
+     "Wrapper around seqtk stk_comp function"},
+    {NULL, NULL, 0, NULL},
+};
+
+static struct PyModuleDef seqtkmodule = {
+    .m_base = PyModuleDef_HEAD_INIT,
+    .m_name = "seqtk",
+    .m_doc = "Python interface to seqtk functions",
+    .m_size = 0,
+    .m_methods = SeqtkMethods,
+    .m_slots = NULL,
+    .m_traverse = NULL,
+    .m_clear = NULL,
+    .m_free = NULL,
+};
+
+PyMODINIT_FUNC
+PyInit_seqtk(void)
+{
+    return PyModule_Create(&seqtkmodule);
+}


=====================================
mapdamage/tables.py
=====================================
@@ -1,114 +1,135 @@
-import os
 import collections
+import os
 
 import mapdamage
 from mapdamage.version import __version__
 
 
 def initialize_mut(ref, length):
-  tab = {}
-  for contig in ref:
-    tab_contig = tab[contig] = {}
-    for end in ('5p','3p'):
-      tab_end = tab_contig[end] = {}
-      for std in ('+','-'):
-        tab_std = tab_end[std] = {}
-        for mut in mapdamage.seq.HEADER:
-          tab_std[mut] = dict.fromkeys(range(length), 0)
+    tab = {}
+    for contig in ref:
+        tab_contig = tab[contig] = {}
+        for end in ("5p", "3p"):
+            tab_end = tab_contig[end] = {}
+            for std in ("+", "-"):
+                tab_std = tab_end[std] = {}
+                for mut in mapdamage.seq.HEADER:
+                    tab_std[mut] = dict.fromkeys(range(length), 0)
 
-  return tab
+    return tab
 
 
 def print_mut(mut, opt, out):
-  _print_freq_table(mut, mapdamage.seq.HEADER, opt, out, offset = 1)
+    _print_freq_table(mut, mapdamage.seq.HEADER, opt, out, offset=1)
 
 
 def initialize_comp(ref, around, length):
-  keys = {"3p" : list(range(-length, 0)) + list(range(1, around + 1)),
-          "5p" : list(range(-around, 0)) + list(range(1, length + 1))}
+    keys = {
+        "3p": list(range(-length, 0)) + list(range(1, around + 1)),
+        "5p": list(range(-around, 0)) + list(range(1, length + 1)),
+    }
 
-  tab = {}
-  for contig in ref:
-    tab_contig = tab[contig] = {}
-    for end in ('5p','3p'):
-      tab_end = tab_contig[end] = {}
-      for std in ('+','-'):
-        tab_std = tab_end[std] = {}
-        for letters in mapdamage.seq.LETTERS:
-          tab_std[letters] = dict.fromkeys(keys[end], 0)
+    tab = {}
+    for contig in ref:
+        tab_contig = tab[contig] = {}
+        for end in ("5p", "3p"):
+            tab_end = tab_contig[end] = {}
+            for std in ("+", "-"):
+                tab_std = tab_end[std] = {}
+                for letters in mapdamage.seq.LETTERS:
+                    tab_std[letters] = dict.fromkeys(keys[end], 0)
 
-  return tab
+    return tab
 
 
 def print_comp(comp, opt, out):
-  columns = mapdamage.seq.LETTERS + ("Total",)
-  _print_freq_table(comp, columns, opt, out)
+    columns = mapdamage.seq.LETTERS + ("Total",)
+    _print_freq_table(comp, columns, opt, out)
 
 
 def initialize_lg():
-  tab = {}
-  for std in ('+','-'):
-    tab[std] = collections.defaultdict(int)
+    tab = {}
+    for std in ("+", "-"):
+        tab[std] = collections.defaultdict(int)
 
-  return tab
+    return tab
 
 
 def print_lg(tab, opt, out):
-  out.write("# table produced by mapDamage version %s\n" % __version__)
-  out.write("# using mapped file %s and %s as reference file\n" % (opt.filename, opt.ref))
-  if opt.minqual != 0:
-    out.write("# Quality filtering of bases with a Phred score < %d\n" % opt.minqual)
-  out.write("# Std: strand of reads\n")
-  out.write("Std\tLength\tOccurences \n")
-  for std in tab:
-    for i in tab[std]:
-      out.write("%s\t%d\t%d\n" % (std, i, tab[std][i]))
+    out.write("# table produced by mapDamage version %s\n" % __version__)
+    out.write(
+        "# using mapped file %s and %s as reference file\n" % (opt.filename, opt.ref)
+    )
+    if opt.minqual != 0:
+        out.write(
+            "# Quality filtering of bases with a Phred score < %d\n" % opt.minqual
+        )
+    out.write("# Std: strand of reads\n")
+    out.write("Std\tLength\tOccurences \n")
+    for std in tab:
+        for i in tab[std]:
+            out.write("%s\t%d\t%d\n" % (std, i, tab[std][i]))
 
 
 def check_table_and_warn_if_dmg_freq_is_low(folder):
-  """ Returns true if the damage frequencies are too low to allow
-  Bayesian estimation of DNA damages, i.e < 1% at first position.
-  """
-  total = 0.0
-  for filename in ("5pCtoT_freq.txt", "3pGtoA_freq.txt"):
-    if not os.path.exists(folder+"/"+filename):
-      print(("Error: Required table has not been created ('%s'), bayesian computation cannot be performed" \
-            % filename))
-      return True
-
-    with open(os.path.join(folder, filename)) as handle:
-      for line in handle:
-        freq = line.strip().split('\t')
-        if freq[0] == "1":
-          total += float(freq[1])
-          break
-      else:
-        print(("Error: Could not find pos = 1 in table '%s', bayesian computation cannot be performed" \
-              % filename))
-        return True
-
-  if total < 0.01:
-    print(("Warning: DNA damage levels are too low, the Bayesian computation should not be performed (%f < 0.01)\n" % total))
-
-  return False
-
-
-def _print_freq_table(table, columns, opt, out, offset = 0):
-  out.write("# table produced by mapDamage version %s\n" % __version__)
-  out.write("# using mapped file %s and %s as reference file\n" % (opt.filename, opt.ref))
-  if opt.minqual != 0:
-    out.write("# Bases with a Phred score < %d were filtered out\n" % opt.minqual)
-  out.write("# Chr: reference from sam/bam header, End: from which termini of DNA sequences, Std: strand of reads\n")
-  out.write("Chr\tEnd\tStd\tPos\t%s\n" % ("\t".join(columns)))
-
-  for (reference, ends) in sorted(table.items()):
-    for (end, strands) in sorted(ends.items()):
-      for (strand, subtable) in sorted(strands.items()):
-        subtable["Total"] = {}
-        for index in sorted(subtable[columns[0]]):
-          subtable["Total"][index] = sum(subtable[letter][index] for letter in mapdamage.seq.LETTERS)
-
-          out.write("%s\t%s\t%s\t%d" % (reference, end, strand, index + offset))
-          for base in columns:
-            out.write("\t%d" % subtable[base][index])
-          out.write("\n")
+    """Returns true if the damage frequencies are too low to allow
+    Bayesian estimation of DNA damages, i.e < 1% at first position.
+    """
+    total = 0.0
+    for filename in ("5pCtoT_freq.txt", "3pGtoA_freq.txt"):
+        if not os.path.exists(folder + "/" + filename):
+            print(
+                "Error: Required table has not been created ('%s'), bayesian computation cannot be performed"
+                % filename
+            )
+            return True
+
+        with open(os.path.join(folder, filename)) as handle:
+            for line in handle:
+                freq = line.strip().split("\t")
+                if freq[0] == "1":
+                    total += float(freq[1])
+                    break
+            else:
+                print(
+                    "Error: Could not find pos = 1 in table '%s', bayesian computation cannot be performed"
+                    % filename
+                )
+                return True
+
+    if total < 0.01:
+        print(
+            "Warning: DNA damage levels are too low, the Bayesian computation should not be performed (%f < 0.01)\n"
+            % total
+        )
+
+    return False
+
+
+def _print_freq_table(table, columns, opt, out, offset=0):
+    out.write("# table produced by mapDamage version %s\n" % __version__)
+    out.write(
+        "# using mapped file %s and %s as reference file\n" % (opt.filename, opt.ref)
+    )
+    if opt.minqual != 0:
+        out.write("# Bases with a Phred score < %d were filtered out\n" % opt.minqual)
+    out.write(
+        "# Chr: reference from sam/bam header, End: from which termini of DNA sequences, Std: strand of reads\n"
+    )
+    out.write("Chr\tEnd\tStd\tPos\t%s\n" % ("\t".join(columns)))
+
+    for reference, ends in sorted(table.items()):
+        for end, strands in sorted(ends.items()):
+            for strand, subtable in sorted(strands.items()):
+                subtable["Total"] = {}
+                for index in sorted(subtable[columns[0]]):
+                    subtable["Total"][index] = sum(
+                        subtable[letter][index] for letter in mapdamage.seq.LETTERS
+                    )
+
+                    out.write(
+                        "%s\t%s\t%s\t%d" % (reference, end, strand, index + offset)
+                    )
+                    for base in columns:
+                        out.write("\t%d" % subtable[base][index])
+                    out.write("\n")


=====================================
mapdamage/tests/fake1.fasta.fai
=====================================
@@ -0,0 +1 @@
+fake1	201	7	60	61


=====================================
mapdamage/version.py
=====================================
@@ -1,5 +1 @@
-#!/usr/bin/env python3
-try:
-    from mapdamage._version import __version__
-except ImportError:
-    __version__ = "2.2.2"
+__version__ = "2.2.3"


=====================================
pyproject.toml
=====================================
@@ -0,0 +1,49 @@
+[build-system]
+requires = ["setuptools>=42", "wheel"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "mapdamage"
+dynamic = ["version"]
+
+authors = [
+    { name = "Aurélien Ginolhac" },
+    { name = "Mikkel Schubert" },
+    { name = "Hákon Jónsson" },
+]
+# To be replaced with `license = "MIT"` when support for py <= 3.9 is dropped
+license = { "text" = "MIT" }
+description = "mapDamage tracks and quantifies DNA damage patterns in ancient DNA sequencing reads generated by Next-Generation Sequencing platforms"
+readme = { file = "README.md", content-type = "text/markdown" }
+
+requires-python = ">=3.7"
+dependencies = ["pysam"]
+
+classifiers = [
+    "Development Status :: 5 - Production/Stable",
+    "Intended Audience :: Science/Research",
+    "Topic :: Scientific/Engineering :: Bio-Informatics",
+    "Programming Language :: Python :: 3 :: Only",
+    "Programming Language :: Python :: 3.7",
+    "Programming Language :: Python :: 3.8",
+    "Programming Language :: Python :: 3.9",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13",
+]
+
+[project.urls]
+Homepage = "https://github.com/ginolhac/mapDamage"
+Documentation = "https://ginolhac.github.io/mapDamage/"
+Repository = "https://github.com/ginolhac/mapDamage.git"
+Issues = "https://github.com/ginolhac/mapDamage/issues"
+
+[project.scripts]
+mapDamage = "mapdamage.main:main"
+
+[tool.setuptools.package-data]
+mapdamage = ["Rscripts/*.R", "Rscripts/stats/*.R", "tests/**"]
+
+[tool.setuptools.dynamic]
+version = { attr = "mapdamage.version.__version__" }


=====================================
setup.py
=====================================
@@ -1,73 +1,12 @@
 #!/usr/bin/env python3
-# -*- coding: utf-8 -*-
-
-from distutils.core import setup
-from distutils.command.install import install as DistutilsInstall
-import os
-import sys
-import subprocess
-
-if sys.version_info < (3, 5):
-    print("At least Python 3.5 is required.\n", file=sys.stderr)
-    exit(1)
-
-def compile_seqtk():
-    """Compiling the seqtk toolkit"""
-    old_wd = os.getcwd()
-    new_wd = os.path.join(old_wd,"mapdamage","seqtk")
-    os.chdir(new_wd)
-    if not os.path.isfile("seqtk.c"):
-        raise SystemExit("Cannot find seqtk.c")
-    if (os.path.isfile("seqtk")):
-        os.system("rm seqtk")
-    xs = os.system("make -f Makefile")
-    os.chdir(old_wd)
-    if (xs != 0):
-        raise SystemExit("Cannot compile seqtk")
-
-
-def setup_version():
-    if not os.path.exists(".git"):
-        # Release version, no .git folder
-        return
-
-    try:
-        version = subprocess.check_output(("git", "describe", "--always", "--tags", "--dirty"))
-        with open(os.path.join("mapdamage", "_version.py"), "w") as handle:
-            handle.write("#!/usr/bin/env python\n")
-            handle.write("__version__ = %r\n" % (version.decode("utf-8").strip(),))
-    except (subprocess.CalledProcessError, OSError) as error:
-        raise SystemExit("Could not determine mapDamage version: %s" % (error,))
-
-
-class compileInstall(DistutilsInstall):
-    # extension of the class to account for an extra compiling step
-    def run(self):
-        self.record=""
-        setup_version()
-        compile_seqtk()
-        DistutilsInstall.run(self)
-        # fixing the permission problem of seqtk
-        files = self.get_outputs()
-        for fi in files:
-            if fi.endswith("seqtk/seqtk"):
-                os.chmod(fi,0o755)
-
-
-
-
+from setuptools import Extension, setup
 
 setup(
-    cmdclass={'install': compileInstall},
-    name='mapdamage',
-    version='2.2.2',
-    author='Aurélien Ginolhac, Mikkel Schubert, Hákon Jónsson',
-    author_email='MSchubert at snm.ku.dk, jonsson.hakon at gmail.com',
-    packages=['mapdamage'],
-    package_data={'mapdamage': ['Rscripts/*.R','Rscripts/stats/*.R','tests/*','seqtk/seqtk']},
-    scripts=['bin/mapDamage'],
-    url='https://github.com/ginolhac/mapDamage',
-    license='LICENSE.txt',
-    description='mapDamage tracks and quantify DNA damage pattern among ancient DNA sequencing reads generated by Next-Generation Sequencing platforms',
-    long_description=open('README.md').read()
+    ext_modules=[
+        Extension(
+            "mapdamage.seqtk",
+            sources=["mapdamage/seqtk/seqtk.c"],
+            libraries=["z"],
+        )
+    ],
 )



View it on GitLab: https://salsa.debian.org/med-team/mapdamage/-/commit/73e4fcdb81df37d20ac00bdf4fa63b25f68b8fba

-- 
View it on GitLab: https://salsa.debian.org/med-team/mapdamage/-/commit/73e4fcdb81df37d20ac00bdf4fa63b25f68b8fba
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20251020/273efeeb/attachment-0001.htm>


More information about the debian-med-commit mailing list