[med-svn] [Git][med-team/mapdamage][upstream] New upstream version 2.2.3+dfsg
Étienne Mollier (@emollier)
gitlab at salsa.debian.org
Mon Oct 20 22:36:36 BST 2025
Étienne Mollier pushed to branch upstream at Debian Med / mapdamage
Commits:
73e4fcdb by Étienne Mollier at 2025-10-20T22:56:36+02:00
New upstream version 2.2.3+dfsg
- - - - -
22 changed files:
- − MANIFEST
- MANIFEST.in
- README.md
- mapdamage/Rscripts/stats/function.R
- mapdamage/__init__.py
- + mapdamage/__main__.py
- mapdamage/align.py
- mapdamage/composition.py
- bin/mapDamage → mapdamage/main.py
- mapdamage/mp_test.py
- mapdamage/parseoptions.py
- mapdamage/rescale.py
- mapdamage/rscript.py
- mapdamage/seq.py
- + mapdamage/seqtk/README.md
- + mapdamage/seqtk/kseq.h
- + mapdamage/seqtk/seqtk.c
- mapdamage/tables.py
- + mapdamage/tests/fake1.fasta.fai
- mapdamage/version.py
- + pyproject.toml
- setup.py
Changes:
=====================================
MANIFEST deleted
=====================================
@@ -1,13 +0,0 @@
-# file GENERATED by distutils, do NOT edit
-README.txt
-setup.py
-bin/mapDamage
-mapdamage/__init__.py
-mapdamage/align.py
-mapdamage/composition.py
-mapdamage/parseoptions.py
-mapdamage/rscript.py
-mapdamage/seq.py
-mapdamage/tables.py
-mapdamage/version.py
-mapdamage/Rscripts/mapDamage.R
=====================================
MANIFEST.in
=====================================
@@ -1,2 +1,5 @@
include mapdamage/Rscripts/mapDamage.R
include mapdamage/Rscripts/stats/*R
+include mapdamage/seqtk/seqtk.c
+include mapdamage/seqtk/kseq.h
+include mapdamage/seqtk/README.md
=====================================
README.md
=====================================
@@ -1,30 +1,18 @@
## mapDamage
-[](http://bioconda.github.io/recipes/mapdamage2/README.html) [](https://anaconda.org/bioconda/mapdamage2/files)
+[](http://bioconda.github.io/recipes/mapdamage2/README.html) [](https://anaconda.org/bioconda/mapdamage2/files)
  [](https://www.repostatus.org/#inactive)
-#### `bioconda` installation
+### Installation
-* python3 version **2.2.1**
+mapDamage can be installed via pip:
-```
-conda install -c bioconda mapdamage2=2.2.1
+```bash
+pip install mapdamage
```
-* python3 version **2.2.1** **with** R and 4 mandatory packages for the Bayesian inference:
-
-```
-conda install -c bioconda mapdamage2=2.2.1=pyr36_1
-```
-
----
-
-### Important
-
-* From version `2.2.1` the `master` branch is requiring **python3** as `python2` is not supported from 2020-01-01.
-
-* Users with versions dating prior to June 12 2013 please update. A nasty bug that caused the statistical part of `mapDamage` to use half of the data for estimation of the damage parameters, sorry for the inconvenience.
+R and several packages are *post mortem* damage plotting, modeling, and rescaling. Refers to the detailed instructions on the [dedicated page](http://ginolhac.github.io/mapDamage/).
### Introduction
@@ -33,35 +21,35 @@ Complete documentation, instructions, examples, screenshots and FAQ are availabl
[mapDamage2](https://geogenetics.ku.dk/publications/mapdamage2.0/) is a computational framework written in **Python3** and **R**, which tracks and quantifies DNA damage patterns
among ancient DNA sequencing reads generated by Next-Generation Sequencing platforms.
-`mapDamage` was developed at the [Centre for GeoGenetics](https://geogenetics.ku.dk/) by the [Orlando Group ](https://geogenetics.ku.dk/research_groups/palaeomix_group/).
+`mapDamage` was developed at the [Centre for GeoGenetics](https://geogenetics.ku.dk/) by the [Orlando Group](https://geogenetics.ku.dk/research_groups/palaeomix_group/).
### Citation
-If you use this program, please cite the following publication:
+If you use this program, please cite the following publication:
Jónsson H, Ginolhac A, Schubert M, Johnson P, Orlando L.
[mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters.
-_Bioinformatics_ 23rd April 2013. doi: 10.1093/bioinformatics/btt193](http://bioinformatics.oxfordjournals.org/content/early/2013/05/17/bioinformatics.btt193)
+*Bioinformatics* 23rd April 2013. doi: 10.1093/bioinformatics/btt193](http://bioinformatics.oxfordjournals.org/content/early/2013/05/17/bioinformatics.btt193)
-The original `mapDamage1` was published in the following article:
+The original `mapDamage1` was published in the following article:
Ginolhac A, Rasmussen M, Gilbert MT, Willerslev E, Orlando L.
-[mapDamage: testing for damage patterns in ancient DNA sequences. _Bioinformatics_ 2011 **27**(15):2153-5
+[mapDamage: testing for damage patterns in ancient DNA sequences. *Bioinformatics* 2011 **27**(15):2153-5
http://bioinformatics.oxfordjournals.org/content/27/15/2153](http://bioinformatics.oxfordjournals.org/content/27/15/2153)
### Test the no-stats part and rescaling
you can test `mapDamage` by running:
-```
+```bash
cd mapDamage/mapdamage/
python3 mp_test.py
```
should return
-```
+```
Started with the command: /usr/local/bin/mapDamage -i tests/test.bam -r tests/fake1.fasta -d tests/results --no-stats
- Reading from 'tests/test.bam'
- Writing results to 'tests/results/'
+ Reading from 'tests/test.bam'
+ Writing results to 'tests/results/'
pdf tests/results/Fragmisincorporation_plot.pdf generated
additional tests/results/Length_plot.pdf generated
Successful run
@@ -74,5 +62,5 @@ OK
### Contact
-Please report bugs and suggest possible improvements to Aurélien Ginolhac, Mikkel Schubert or Hákon Jónsson by email:
-ginolhac at gmail.com, mikkelsch at gmail.com or jonsson.hakon at gmail.com.
+Please report bugs and suggest possible improvements on GitHub:
+<https://github.com/ginolhac/mapDamage/issues/new>
=====================================
mapdamage/Rscripts/stats/function.R
=====================================
@@ -10,7 +10,7 @@ getPmat <- function(tmu,tv_ti_ratio,acgt){
stop()
}
if (tv_ti_ratio<=0){
- write("The transversion and transtition ratio cannot go under 0",stderr())
+ write("The transversion and transition ratio cannot go under 0",stderr())
stop()
}
#Returns the substitution probability matrix.
=====================================
mapdamage/__init__.py
=====================================
@@ -1,10 +1,10 @@
#!/usr/bin/env python3
-import mapdamage.parseoptions
-import mapdamage.seq
import mapdamage.align
-import mapdamage.tables
import mapdamage.composition
-import mapdamage.rscript
+import mapdamage.parseoptions
import mapdamage.rescale
+import mapdamage.rscript
+import mapdamage.seq
+import mapdamage.tables
import mapdamage.version
=====================================
mapdamage/__main__.py
=====================================
@@ -0,0 +1,3 @@
+import mapdamage.main
+
+mapdamage.main.main()
=====================================
mapdamage/align.py
=====================================
@@ -1,127 +1,127 @@
-import itertools
import mapdamage
# from Martin Kircher, description of CIGAR operations
-#O BAM Description
-#M 0 alignment match (can be a sequence match or mismatch)
-#I 1 insertion to the reference
-#D 2 deletion from the reference
-#N 3 skipped region from the reference
-#S 4 soft clipping (clipped sequences present in SEQ)
-#H 5 hard clipping (clipped sequences NOT present in SEQ)
-#P 6 padding (silent deletion from padded reference)
-#= 7 sequence match
-#X 8 sequence mismatch
+# O BAM Description
+# M 0 alignment match (can be a sequence match or mismatch)
+# I 1 insertion to the reference
+# D 2 deletion from the reference
+# N 3 skipped region from the reference
+# S 4 soft clipping (clipped sequences present in SEQ)
+# H 5 hard clipping (clipped sequences NOT present in SEQ)
+# P 6 padding (silent deletion from padded reference)
+# = 7 sequence match
+# X 8 sequence mismatch
def get_coordinates(read):
- """ return external coordinates of aligned read bases """
- fivep = read.aend if read.is_reverse else read.pos
- threep = read.pos if read.is_reverse else read.aend
+ """return external coordinates of aligned read bases"""
+ fivep = read.aend if read.is_reverse else read.pos
+ threep = read.pos if read.is_reverse else read.aend
- return fivep, threep
+ return fivep, threep
def get_around(coord, chrom, reflengths, length, ref):
- """ return reference sequences before and after the read
- check for extremities and return what is available """
- coord_min = min(coord)
- coord_max = max(coord)
+ """return reference sequences before and after the read
+ check for extremities and return what is available"""
+ coord_min = min(coord)
+ coord_max = max(coord)
- pos_before = max(0, coord_min - length)
- pos_after = min(reflengths[chrom], coord_max + length)
+ pos_before = max(0, coord_min - length)
+ pos_after = min(reflengths[chrom], coord_max + length)
- # Uppercased, to be sure that we don't compare A,C,G,T and a,c,g,t
- before = ref.fetch(chrom, pos_before, coord_min).upper()
- after = ref.fetch(chrom, coord_max, pos_after).upper()
+ # Uppercased, to be sure that we don't compare A,C,G,T and a,c,g,t
+ before = ref.fetch(chrom, pos_before, coord_min).upper()
+ after = ref.fetch(chrom, coord_max, pos_after).upper()
- return before, after
+ return before, after
def align(cigarlist, seq, ref):
- """ insert gaps according to the cigar string
- deletion: gaps to be inserted into read sequences,
- insertions: gaps to be inserted into reference sequence """
- lref = list(ref)
- for nbr, idx in parse_cigar(cigarlist, 1):
- lref[idx:idx] = ["-"] * nbr
+ """insert gaps according to the cigar string
+ deletion: gaps to be inserted into read sequences,
+ insertions: gaps to be inserted into reference sequence"""
+ lref = list(ref)
+ for nbr, idx in parse_cigar(cigarlist, 1):
+ lref[idx:idx] = ["-"] * nbr
- lread = list(seq)
- for nbr, idx in parse_cigar(cigarlist, 2):
- lread[idx:idx] = ["-"] * nbr
+ lread = list(seq)
+ for nbr, idx in parse_cigar(cigarlist, 2):
+ lread[idx:idx] = ["-"] * nbr
- return "".join(lread), "".join(lref)
+ return "".join(lread), "".join(lref)
def align_with_qual(cigarlist, seq, qual, threshold, ref):
- """ insert gaps according to the cigar string
- deletion: gaps to be inserted into read sequences and qualities,
- insertions: gaps to be inserted into reference sequence """
- lref = list(ref)
- for nbr, idx in parse_cigar(cigarlist, 1):
- lref[idx:idx] = ["-"] * nbr
+ """insert gaps according to the cigar string
+ deletion: gaps to be inserted into read sequences and qualities,
+ insertions: gaps to be inserted into reference sequence"""
+ lref = list(ref)
+ for nbr, idx in parse_cigar(cigarlist, 1):
+ lref[idx:idx] = ["-"] * nbr
- lread = list(seq)
- lqual = list(qual)
- for nbr, idx in parse_cigar(cigarlist, 2):
- lread[idx:idx] = ["-"] * nbr
- lqual[idx:idx] = ["-"] * nbr
+ lread = list(seq)
+ lqual = list(qual)
+ for nbr, idx in parse_cigar(cigarlist, 2):
+ lread[idx:idx] = ["-"] * nbr
+ lqual[idx:idx] = ["-"] * nbr
- for idx, score in enumerate(lqual):
- # mask read and reference nucleotides (so not gaps) if below desired threshold
- if (ord(score)-33) < threshold and lread[idx] != "-":
- lread[idx] = "N"
- lref[idx] = "N"
+ for idx, score in enumerate(lqual):
+ # mask read and reference nucleotides (so not gaps) if below desired threshold
+ if (ord(score) - 33) < threshold and lread[idx] != "-":
+ lread[idx] = "N"
+ lref[idx] = "N"
- return "".join(lread), "".join(lqual), "".join(lref)
+ return "".join(lread), "".join(lqual), "".join(lref)
def get_mis(read, seq, refseq, ref, length, tab, end):
- """ count mismatches using aligned reference and read,
- must be redone since N in reference were randomly replaced by any bases """
- std = '-' if read.is_reverse else '+'
- subtable = tab[ref][end][std]
+ """count mismatches using aligned reference and read,
+ must be redone since N in reference were randomly replaced by any bases"""
+ std = "-" if read.is_reverse else "+"
+ subtable = tab[ref][end][std]
- for (i, nt_seq, nt_ref) in zip(range(length), seq, refseq):
- if (nt_seq in "ACGT-") and (nt_ref in "ACGT-"):
- if nt_ref != "-":
- # record base composition in the reference, only A, C, G, T
- subtable[nt_ref][i] += 1
+ for i, nt_seq, nt_ref in zip(range(length), seq, refseq):
+ if (nt_seq in "ACGT-") and (nt_ref in "ACGT-"):
+ if nt_ref != "-":
+ # record base composition in the reference, only A, C, G, T
+ subtable[nt_ref][i] += 1
- # Most ref/seq pairs will be identical
- if (nt_ref != nt_seq):
- mut = "%s>%s" % (nt_ref, nt_seq)
- subtable[mut][i] += 1
+ # Most ref/seq pairs will be identical
+ if nt_ref != nt_seq:
+ mut = "%s>%s" % (nt_ref, nt_seq)
+ subtable[mut][i] += 1
def parse_cigar(cigarlist, ope):
- """ for a specific operation (mismach, match, insertion, deletion... see above)
- return occurences and index in the alignment """
- tlength = 0
- coordinate = []
- # count matches, indels and mismatches
- oplist = (0, 1, 2, 7, 8)
- for operation, length in cigarlist:
- if operation == ope:
- coordinate.append([length, tlength])
- if operation in oplist:
- tlength += length
- return coordinate
+ """for a specific operation (mismatch, match, insertion, deletion... see above)
+ return occurrences and index in the alignment"""
+ tlength = 0
+ coordinate = []
+ # count matches, indels and mismatches
+ oplist = (0, 1, 2, 7, 8)
+ for operation, length in cigarlist:
+ if operation == ope:
+ coordinate.append([length, tlength])
+ if operation in oplist:
+ tlength += length
+ return coordinate
def record_soft_clipping(read, tab, length):
- """ record soft clipped bases at extremities """
- def update_table(end, std, bases):
- for i in range(0, min(bases, length)):
- tab[end][std]['S'][i] += 1
-
- strand = '-' if read.is_reverse else '+'
- for (nbases, idx) in mapdamage.align.parse_cigar(read.cigar, 4):
- if idx == 0:
- # Soft-clipping at the left side of the alignment
- end = '3p' if read.is_reverse else '5p'
- else:
- # Soft-clipping at the right side of the alignment
- end = '5p' if read.is_reverse else '3p'
-
- update_table(end, strand, nbases)
+ """record soft clipped bases at extremities"""
+
+ def update_table(end, std, bases):
+ for i in range(0, min(bases, length)):
+ tab[end][std]["S"][i] += 1
+
+ strand = "-" if read.is_reverse else "+"
+ for nbases, idx in mapdamage.align.parse_cigar(read.cigar, 4):
+ if idx == 0:
+ # Soft-clipping at the left side of the alignment
+ end = "3p" if read.is_reverse else "5p"
+ else:
+ # Soft-clipping at the right side of the alignment
+ end = "5p" if read.is_reverse else "3p"
+
+ update_table(end, strand, nbases)
=====================================
mapdamage/composition.py
=====================================
@@ -1,82 +1,60 @@
-import mapdamage
-import itertools
import csv
-import subprocess
-import sys
+
+import mapdamage
+import mapdamage.seqtk
+
def count_ref_comp(read, chrom, before, after, comp):
- """ record basae composition in external genomic regions """
- std = '-' if read.is_reverse else '+'
+ """record basae composition in external genomic regions"""
+ std = "-" if read.is_reverse else "+"
- _update_table(comp[chrom]['5p'][std], before, range(-len(before), 0))
- _update_table(comp[chrom]['3p'][std], after, range(1, len(after) + 1))
+ _update_table(comp[chrom]["5p"][std], before, range(-len(before), 0))
+ _update_table(comp[chrom]["3p"][std], after, range(1, len(after) + 1))
def count_read_comp(read, chrom, length, comp):
- """ record base composition of read, discard marked nucleotides """
- std, seq = '+', read.query
- if read.is_reverse:
- std, seq = '-', mapdamage.seq.revcomp(seq)
+ """record base composition of read, discard marked nucleotides"""
+ std, seq = "+", read.query
+ if read.is_reverse:
+ std, seq = "-", mapdamage.seq.revcomp(seq)
- _update_table(comp[chrom]['5p'][std], seq, range(1, length + 1))
- _update_table(comp[chrom]['3p'][std], reversed(seq), range(-1, - length - 1, -1))
+ _update_table(comp[chrom]["5p"][std], seq, range(1, length + 1))
+ _update_table(comp[chrom]["3p"][std], reversed(seq), range(-1, -length - 1, -1))
def _update_table(table, sequence, indices):
- for (index, nt) in zip(indices, sequence):
- if nt in table:
- table[nt][index] += 1
+ for index, nt in zip(indices, sequence):
+ if nt in table:
+ table[nt][index] += 1
-def get_base_comp(filename,destination=False):
- """
- Gets the basecomposition of all the sequences in filename
- and returns the value to destination if given.
+def write_base_comp(fasta, destination):
+ """Calculates the total base composition across all sequences in 'fasta'
+ and writes them to 'destination' as CSV.
"""
- path_to_seqtk = mapdamage.rscript.construct_path("seqtk",folder="seqtk")
- bases = {"A":0,"C":0,"G":0,"T":0}
- alp = ["A","C","G","T"]
- try:
- proc = subprocess.Popen([path_to_seqtk,"comp",filename],stdout=subprocess.PIPE)
- out = proc.communicate()[0]
- for li in out.splitlines():
- tabs = li.split() # 1 is the ref, 2 is the total and then the base counts A, C, G and T.
- bases["A"] = bases["A"] + int(tabs[2])
- bases["C"] = bases["C"] + int(tabs[3])
- bases["G"] = bases["G"] + int(tabs[4])
- bases["T"] = bases["T"] + int(tabs[5])
- except (OSError, ValueError) as error:
- sys.stderr.write("Error: Seqtk failed: %s\n" % (error,))
- sys.exit(1)
- # get the base frequencies
+ bases = {"A": 0, "C": 0, "G": 0, "T": 0}
+ for stats in mapdamage.seqtk.comp(str(fasta)):
+ for key in bases:
+ bases[key] += stats[key]
+
+ # calculate the base frequencies
ba_su = sum(bases.values())
- for ba in alp:
- bases[ba] = float(bases[ba])/float(ba_su)
- if (destination==False):
- return bases
- else:
- # write the results
- fo = open(destination,"w")
- vals = [str(bases[i]) for i in alp]
- fo.write(",".join(alp)+"\n")
- fo.write(",".join(vals)+"\n")
- fo.close()
+ for key in bases:
+ bases[key] = bases[key] / ba_su
+
+ with open(destination, "wt", newline="") as handle:
+ writer = csv.writer(handle)
+
+ header = ["A", "C", "G", "T"]
+ writer.writerow(header)
+ writer.writerow(bases[key] for key in header)
+
def read_base_comp(filename):
- """
- Read the base compition from a file created by get_base_comp
- """
- fh = open(filename)
- first_line = True
- for li in fh:
- li = li.rstrip()
- lp = li.split()
- if first_line:
- header = lp
- first_line = False
- else:
- body = lp
- bases = {}
- for ba,per in zip(header,body):
- bases[ba] = per
- return bases
+ """Read the base compition from a file created by write_base_comp"""
+ with open(filename, newline="") as csvfile:
+ reader = csv.DictReader(csvfile)
+ for row in reader:
+ return row
+
+ raise csv.Error("No rows found in %r" % (filename,))
=====================================
bin/mapDamage → mapdamage/main.py
=====================================
@@ -2,10 +2,14 @@
# -*- coding: utf-8 -*-
import logging
+import os
import random
-import time
import sys
-import os
+import time
+
+import pysam
+
+import mapdamage
""" Copyright (c) 2012 Aurélien Ginolhac, Mikkel Schubert, Hákon Jónsson
and Ludovic Orlando
@@ -38,33 +42,17 @@ plot and quantify damage patterns from a SAM/BAM file
:Output: tabulated tables, pdf
"""
-# check if pysam if available
-MODULE = "pysam"
-URL = "http://code.google.com/p/pysam/"
-try:
- __import__(MODULE)
-except ImportError as e:
- sys.stderr.write("Error: Could not import required module '%s':\n\t- %s\n" % (MODULE, e))
- sys.stderr.write(" If module is not installed, please download from '%s'.\n" % URL)
- sys.stderr.write(" A local install may be performed using the following command:\n")
- sys.stderr.write(" $ python setup.py install --user\n\n")
- sys.exit(1)
-
-import pysam
-
-
-_BAM_UNMAPPED = 0x4
+_BAM_UNMAPPED = 0x4
_BAM_SECONDARY = 0x100
_BAM_FAILED_QC = 0x200
-_BAM_PCR_DUPE = 0x400
-_BAM_CHIMERIC = 0x800
+_BAM_PCR_DUPE = 0x400
+_BAM_CHIMERIC = 0x800
+
def _filter_reads(bamfile):
- filtered_flags = _BAM_UNMAPPED | \
- _BAM_SECONDARY | \
- _BAM_FAILED_QC | \
- _BAM_PCR_DUPE | \
- _BAM_CHIMERIC
+ filtered_flags = (
+ _BAM_UNMAPPED | _BAM_SECONDARY | _BAM_FAILED_QC | _BAM_PCR_DUPE | _BAM_CHIMERIC
+ )
for read in bamfile:
if not (read.flag & filtered_flags):
@@ -74,7 +62,9 @@ def _filter_reads(bamfile):
def _downsample_to_fraction(bamfile, options):
if options.verbose:
print("\tDownsampling %.2f fraction of the original file" % options.downsample)
- assert options.downsample > 0, "Downsample fraction must be positive, not %s" % options.downsample
+ assert options.downsample > 0, (
+ "Downsample fraction must be positive, not %s" % options.downsample
+ )
rand = random.Random(options.downsample_seed)
for read in _filter_reads(bamfile):
@@ -84,13 +74,15 @@ def _downsample_to_fraction(bamfile, options):
def _downsample_to_fixed_number(bamfile, options):
if options.verbose:
- print("\tDownsampling %d random reads from the original file" % options.downsample)
+ print(
+ "\tDownsampling %d random reads from the original file" % options.downsample
+ )
# use reservoir sampling
downsample_to = int(options.downsample)
rand = random.Random(options.downsample_seed)
sample = [None] * downsample_to
- for (index, record) in enumerate(_filter_reads(bamfile)):
+ for index, record in enumerate(_filter_reads(bamfile)):
if index >= downsample_to:
index = rand.randint(0, index)
if index >= downsample_to:
@@ -112,59 +104,20 @@ def _read_bamfile(bamfile, options):
else:
return _downsample_to_fixed_number(bamfile, options)
-def _check_mm_option():
- """
- As the user can override the system wide mapdamage modules with
- the --mapdamage-modules, it has to happen before option parsing
- in mapdamage.parseoptions
- """
- path_to_mm = None
- for nr,arg in zip(list(range(len(sys.argv))),sys.argv):
- if arg.startswith("--mapdamage-modules"):
- try:
- if "=" in arg:
- # the option is of the format --mapdamage-modules=AAAA
- arg_p = arg.split("=")
- path_to_mm = arg_p[1]
- else:
- # the option is of the format --mapdamage-modules AAAA
- path_to_mm = sys.argv[nr+1]
- break
- except IndexError as e:
- raise SystemExit("Must specify a path to --mapdamage-modules")
- if path_to_mm != None:
- if not os.path.isdir(path_to_mm):
- raise SystemExit("The --mapdamage-modules option must be a valid path (path=%s)" % path_to_mm)
- if not os.path.isdir(os.path.join(path_to_mm,"mapdamage")):
- raise SystemExit("The --mapdamage-modules path (path=%s) must contain the mapdamage module" % path_to_mm)
- return path_to_mm
-
def main():
start_time = time.time()
- # the user can override the system wide mapdamage modules
- path_to_mm = _check_mm_option()
- if path_to_mm != None:
- sys.path.insert(0,path_to_mm)
- import mapdamage
-
- fpath_seqtk = mapdamage.rscript.construct_path("seqtk", folder="seqtk")
- if not (os.path.isfile(fpath_seqtk) and os.access(fpath_seqtk, os.X_OK)):
- sys.stderr.write("Seqtk executable not accessible; mapDamage has not\n"
- "been intalled properly or current user does not\n"
- "sufficient permissions. Expected executable at\n "
- "'%s'\n" % (fpath_seqtk,))
- return 1
-
options = mapdamage.parseoptions.options()
- if options == None:
+ if options is None:
sys.stderr.write("Option parsing failed, terminating the program\n")
return 1
- logging.basicConfig(filename = os.path.join(options.folder, "Runtime_log.txt"),
- format = '%(asctime)s\t%(levelname)s\t%(name)s: %(message)s',
- level = logging.DEBUG)
+ logging.basicConfig(
+ filename=os.path.join(options.folder, "Runtime_log.txt"),
+ format="%(asctime)s\t%(levelname)s\t%(name)s: %(message)s",
+ level=logging.DEBUG,
+ )
handler = logging.StreamHandler()
handler.setLevel(logging.INFO)
logging.getLogger().addHandler(handler)
@@ -175,7 +128,9 @@ def main():
# plot using R if results folder already done
if options.plot_only:
if options.no_r:
- logger.error("Cannot use plot damage patterns if R is missing, terminating the program")
+ logger.error(
+ "Cannot use plot damage patterns if R is missing, terminating the program"
+ )
return 1
else:
mapdamage.rscript.plot(options)
@@ -190,13 +145,13 @@ def main():
return 1
else:
# before running the Bayesian estimation get the base composition
- path_to_basecomp = os.path.join(options.folder,"dnacomp_genome.csv")
+ path_to_basecomp = os.path.join(options.folder, "dnacomp_genome.csv")
if os.path.isfile(path_to_basecomp):
- #Try to read the base composition file
+ # Try to read the base composition file
mapdamage.composition.read_base_comp(path_to_basecomp)
else:
- #Construct the base composition file
- mapdamage.composition.get_base_comp(options.ref,path_to_basecomp)
+ # Construct the base composition file
+ mapdamage.composition.write_base_comp(options.ref, path_to_basecomp)
mapdamage.rscript.run_stats(options)
return 0
@@ -205,8 +160,13 @@ def main():
ref = pysam.Fastafile(options.ref)
except IOError as e:
# Couldn't open the reference file
- logger.error("Couldn't open the reference file "+options.ref+\
- " or there are permission problems writing the "+options.ref+".fai file")
+ logger.error(
+ "Couldn't open the reference file "
+ + options.ref
+ + " or there are permission problems writing the "
+ + options.ref
+ + ".fai file"
+ )
raise e
# rescale the qualities
@@ -218,7 +178,7 @@ def main():
# open SAM/BAM file
if options.filename == "-":
- in_bam = pysam.Samfile("-", 'rb')
+ in_bam = pysam.Samfile("-", "rb")
# disabling rescaling if reading from standard input since we need
# to read it twice
if options.rescale:
@@ -235,38 +195,43 @@ def main():
elif not mapdamage.seq.compare_sequence_dicts(fai_lengths, reflengths):
return 1
elif (len(reflengths) >= 1000) and not options.merge_reference_sequences:
- print("WARNING: Alignment contains a large number of reference sequences (%i)!" % len(reflengths))
+ print(
+ "WARNING: Alignment contains a large number of reference sequences (%i)!"
+ % len(reflengths)
+ )
print(" This may lead to excessive memory/disk usage.")
print(" Consider using --merge-reference-sequences")
print()
refnames = in_bam.references
if options.merge_reference_sequences:
- refnames = ['*']
+ refnames = ["*"]
# for misincorporation patterns, record mismatches
misincorp = mapdamage.tables.initialize_mut(refnames, options.length)
# for fragmentation patterns, record base compositions
- dnacomp = mapdamage.tables.initialize_comp(refnames, options.around, options.length)
+ dnacomp = mapdamage.tables.initialize_comp(refnames, options.around, options.length)
# for length distributions
- lgdistrib = mapdamage.tables.initialize_lg()
+ lgdistrib = mapdamage.tables.initialize_lg()
if not options.quiet:
print("\tReading from '%s'" % options.filename)
- if options.minqual != 0:
+ if options.minqual != 0:
print("\tFiltering out bases with a Phred score < %d" % options.minqual)
if options.verbose:
- print("\t%d references are assumed in SAM/BAM file, for a total of %d nucleotides" % (len(reflengths), sum(reflengths.values())))
+ print(
+ "\t%d references are assumed in SAM/BAM file, for a total of %d nucleotides"
+ % (len(reflengths), sum(reflengths.values()))
+ )
print("\tWriting results to '%s/'" % options.folder)
-
# open file handler to write alignments in fasta format
if options.fasta:
# use name of the SAM/BAM filename without extension
- ffasta = os.path.splitext(os.path.basename(options.filename))[0]+'.fasta'
+ ffasta = os.path.splitext(os.path.basename(options.filename))[0] + ".fasta"
if not options.quiet:
print("\tWriting alignments in '%s'" % ffasta)
- fhfasta = open(options.folder+"/"+ffasta,"w")
+ fhfasta = open(options.folder + "/" + ffasta, "w")
counter = 0
@@ -281,7 +246,9 @@ def main():
# fetch reference name, chromosome or contig names
chrom = in_bam.getrname(read.tid)
- (before, after) = mapdamage.align.get_around(coordinate, chrom, reflengths, options.around, ref)
+ (before, after) = mapdamage.align.get_around(
+ coordinate, chrom, reflengths, options.around, ref
+ )
refseq = ref.fetch(chrom, min(coordinate), max(coordinate)).upper()
# read.query contains aligned sequences while read.seq is the read itself
seq = read.query
@@ -289,12 +256,17 @@ def main():
# add gaps according to the cigar string, do it for qualities if filtering options is on
if not (options.minqual and read.qual):
if options.minqual:
- logger.warning("Cannot filter bases by PHRED scores for read '%s'; no scores available." % read.qname)
+ logger.warning(
+ "Cannot filter bases by PHRED scores for read '%s'; no scores available."
+ % read.qname
+ )
(seq, refseq) = mapdamage.align.align(read.cigar, seq, refseq)
else:
# add gaps to qualities and mask read and reference nucleotides if below desired threshold
- (seq, _, refseq) = mapdamage.align.align_with_qual(read.cigar, seq, read.qqual, options.minqual, refseq)
+ (seq, _, refseq) = mapdamage.align.align_with_qual(
+ read.cigar, seq, read.qqual, options.minqual, refseq
+ )
# reverse complement read and reference when mapped reverse strand
if read.is_reverse:
@@ -305,15 +277,19 @@ def main():
before = beforerev
if options.merge_reference_sequences:
- chrom = '*'
+ chrom = "*"
# record soft clipping when present
mapdamage.align.record_soft_clipping(read, misincorp[chrom], options.length)
# count misincorparations by comparing read and reference base by base
- mapdamage.align.get_mis(read, seq, refseq, chrom, options.length, misincorp, '5p')
+ mapdamage.align.get_mis(
+ read, seq, refseq, chrom, options.length, misincorp, "5p"
+ )
# do the same with sequences align to 3'-ends
- mapdamage.align.get_mis(read, seq[::-1], refseq[::-1], chrom, options.length, misincorp, '3p')
+ mapdamage.align.get_mis(
+ read, seq[::-1], refseq[::-1], chrom, options.length, misincorp, "3p"
+ )
# compute base composition for reads
mapdamage.composition.count_read_comp(read, chrom, options.length, dnacomp)
@@ -321,7 +297,17 @@ def main():
mapdamage.composition.count_ref_comp(read, chrom, before, after, dnacomp)
if options.fasta:
- mapdamage.seq.write_fasta(read, chrom, seq, refseq, min(coordinate), max(coordinate), before, after, fhfasta)
+ mapdamage.seq.write_fasta(
+ read,
+ chrom,
+ seq,
+ refseq,
+ min(coordinate),
+ max(coordinate),
+ before,
+ after,
+ fhfasta,
+ )
if options.verbose:
if counter % 50000 == 0:
@@ -337,14 +323,13 @@ def main():
fhfasta.close()
# output results, write summary tables to disk
- with open(options.folder+"/"+"misincorporation.txt", 'w') as fmut:
+ with open(options.folder + "/" + "misincorporation.txt", "w") as fmut:
mapdamage.tables.print_mut(misincorp, options, fmut)
- with open(options.folder+"/"+"dnacomp.txt", 'w') as fcomp:
+ with open(options.folder + "/" + "dnacomp.txt", "w") as fcomp:
mapdamage.tables.print_comp(dnacomp, options, fcomp)
- with open(options.folder+"/"+"lgdistribution.txt", 'w') as flg:
+ with open(options.folder + "/" + "lgdistribution.txt", "w") as flg:
mapdamage.tables.print_lg(lgdistrib, options, flg)
-
# plot using R
if not options.no_r:
mapdamage.rscript.plot(options)
@@ -354,11 +339,12 @@ def main():
if mapdamage.tables.check_table_and_warn_if_dmg_freq_is_low(options.folder):
options.no_stats = True
-
# run the Bayesian estimation
if not options.no_stats:
# before running the Bayesian estimation get the base composition
- mapdamage.composition.get_base_comp(options.ref,os.path.join(options.folder,"dnacomp_genome.csv"))
+ mapdamage.composition.write_base_comp(
+ options.ref, os.path.join(options.folder, "dnacomp_genome.csv")
+ )
mapdamage.rscript.run_stats(options)
# rescale the qualities
@@ -373,7 +359,3 @@ def main():
logger.debug("Run completed in %f seconds" % (time.time() - start_time,))
return 0
-
-
-if __name__ == '__main__':
- sys.exit(main())
=====================================
mapdamage/mp_test.py
=====================================
@@ -1,60 +1,85 @@
-import unittest
+import filecmp
+import optparse
import subprocess
+import unittest
+
import pysam
+
import mapdamage
-import optparse
-import filecmp
+
def read_nocomment(dnacomp):
- with open(dnacomp, 'r') as f:
+ with open(dnacomp, "r") as f:
a = f.readlines()
- return [x for x in a if not x.startswith('#')]
+ return [x for x in a if not x.startswith("#")]
-def mock_options(filename,rescale_out,folder):
+
+def mock_options(filename, rescale_out, folder):
"""Make the options object with nice values for testing"""
- return optparse.Values({
- "filename":filename,
- "rescale_out":rescale_out,
- "verbose":True,
- "folder":folder,
- "rescale_length_5p":12, # default values as in --seq-length
- "rescale_length_3p":12, # default values as in --seq-length
- "quiet":True
- })
+ return optparse.Values(
+ {
+ "filename": filename,
+ "rescale_out": rescale_out,
+ "verbose": True,
+ "folder": folder,
+ "rescale_length_5p": 12, # default values as in --seq-length
+ "rescale_length_3p": 12, # default values as in --seq-length
+ "quiet": True,
+ }
+ )
class testCases(unittest.TestCase):
-
def test_no_stats(self):
"""regular mapDamage"""
- subprocess.run(["mapDamage", "-i", "tests/test.bam", "-r", "tests/fake1.fasta", "-d", "tests/results", "--no-stats"], check = True)
- self.assertTrue(read_nocomment("tests/dnacomp.txt") == read_nocomment("tests/results/dnacomp.txt"))
+ subprocess.run(
+ [
+ "mapDamage",
+ "-i",
+ "tests/test.bam",
+ "-r",
+ "tests/fake1.fasta",
+ "-d",
+ "tests/results",
+ "--no-stats",
+ ],
+ check=True,
+ )
+ self.assertTrue(
+ read_nocomment("tests/dnacomp.txt")
+ == read_nocomment("tests/results/dnacomp.txt")
+ )
+
class testRescaling(unittest.TestCase):
def test_single_end_file(self):
"""Test, rescaling BAM file"""
#
- # The expected substition frequencies before and after scaling using the scaled qualities as probalities:
- # CT 0.06226411977920493 0.04163524443356556
- # TC 0.020395286584806528 0.020395286584806528
- # GA 0.04400459948304954 0.03794905109091021
- # AG 0.05355350777642983 0.05355350777642983
- # Quality metrics before and after scaling
- # CT-Q0 5 5
- # CT-Q10 5 2
- # CT-Q20 3 2
- # CT-Q30 3 2
- # CT-Q40 0 0
- # GA-Q0 5 5
- # GA-Q10 5 4
- # GA-Q20 1 0
- # GA-Q30 1 0
- # GA-Q40 0 0
- options = mock_options("tests/test.bam","tests/test.rescaled.sam","tests/probs/")
+ # The expected substitution frequencies before and after scaling using the scaled qualities as probalities:
+ # CT 0.06226411977920493 0.04163524443356556
+ # TC 0.020395286584806528 0.020395286584806528
+ # GA 0.04400459948304954 0.03794905109091021
+ # AG 0.05355350777642983 0.05355350777642983
+ # Quality metrics before and after scaling
+ # CT-Q0 5 5
+ # CT-Q10 5 2
+ # CT-Q20 3 2
+ # CT-Q30 3 2
+ # CT-Q40 0 0
+ # GA-Q0 5 5
+ # GA-Q10 5 4
+ # GA-Q20 1 0
+ # GA-Q30 1 0
+ # GA-Q40 0 0
+ options = mock_options(
+ "tests/test.bam", "tests/test.rescaled.sam", "tests/probs/"
+ )
ref = pysam.Fastafile("tests/fake1.fasta")
- mapdamage.rescale.rescale_qual(ref,options,debug=True)
- self.assertTrue(filecmp.cmp("tests/test.rescaled.sam","tests/test.rescaled.correct.sam"))
+ mapdamage.rescale.rescale_qual(ref, options, debug=True)
+ self.assertTrue(
+ filecmp.cmp("tests/test.rescaled.sam", "tests/test.rescaled.correct.sam")
+ )
-if __name__=='__main__':
+if __name__ == "__main__":
unittest.main()
=====================================
mapdamage/parseoptions.py
=====================================
@@ -1,9 +1,10 @@
-from optparse import OptionParser, OptionGroup, SUPPRESS_HELP
import os
import sys
+from optparse import SUPPRESS_HELP, OptionGroup, OptionParser
-from mapdamage.version import __version__
from mapdamage.rscript import check_R_lib
+from mapdamage.version import __version__
+
def file_exist(filename):
if os.path.exists(filename) and not os.path.isdir(filename):
@@ -16,14 +17,14 @@ def file_exist(filename):
def whereis(program):
- for path in os.environ.get('PATH', '').split(':'):
- if os.path.exists(os.path.join(path, program)) and \
- not os.path.isdir(os.path.join(path, program)):
+ for path in os.environ.get("PATH", "").split(":"):
+ if os.path.exists(os.path.join(path, program)) and not os.path.isdir(
+ os.path.join(path, program)
+ ):
return os.path.join(path, program)
return None
-
def check_py_version():
req_version = (2, 6)
cur_version = sys.version_info
@@ -31,120 +32,353 @@ def check_py_version():
if cur_version >= req_version:
return True
else:
- sys.stderr.write("Your Python interpreter is too old."\
- "Please consider upgrading to at least %d.%d\n" % (req_version[0], req_version[1]))
+ sys.stderr.write(
+ "Your Python interpreter is too old."
+ "Please consider upgrading to at least %d.%d\n"
+ % (req_version[0], req_version[1])
+ )
return None
def options():
- parser = OptionParser("%prog [options] -i BAMfile -r reference.fasta\n\nUse option -h or --help for help", version=__version__, \
- epilog="report bugs to aginolhac at snm.ku.dk, MSchubert at snm.ku.dk or jonsson.hakon at gmail.com")
+ parser = OptionParser(
+ "%prog [options] -i BAMfile -r reference.fasta\n\nUse option -h or --help for help",
+ version=__version__,
+ epilog="report bugs to aginolhac at snm.ku.dk, MSchubert at snm.ku.dk or jonsson.hakon at gmail.com",
+ )
args = OptionGroup(parser, "Input files")
- args.add_option("-i", "--input", help="SAM/BAM file, must contain a valid header, use '-' for reading a BAM from stdin", \
- action="store", type="string", dest="filename")
- args.add_option("-r", "--reference", help="Reference file in FASTA format", \
- action="store", dest="ref")
+ args.add_option(
+ "-i",
+ "--input",
+ help="SAM/BAM file, must contain a valid header, use '-' for reading a BAM from stdin",
+ action="store",
+ type="string",
+ dest="filename",
+ )
+ args.add_option(
+ "-r",
+ "--reference",
+ help="Reference file in FASTA format",
+ action="store",
+ dest="ref",
+ )
parser.add_option_group(args)
group = OptionGroup(parser, "General options")
- group.add_option("-n", "--downsample", help = "Downsample to a randomly selected fraction of the reads (if 0 < DOWNSAMPLE < 1), or " \
- "a fixed number of randomly selected reads (if DOWNSAMPLE >= 1). By default, no downsampling is performed.",
- type = float, default = None)
- group.add_option("--downsample-seed", help = "Seed value to use for downsampling. See documentation for py module 'random' for default behavior.",
- type = int, default = None)
- group.add_option("--merge-reference-sequences", help = "Ignore referece sequence names when tabulating reads (using '*' instead). "
- "Useful for alignments with a large number of reference sequnces, which may otherwise result in excessive "
- "memory or disk usage due to the number of tables generated.",
- default = False, action = "store_true")
- group.add_option("-l", "--length", dest="length", help="read length, in nucleotides to consider [%default]", \
- type = int, default=70,action="store")
- group.add_option("-a", "--around", dest="around", help="nucleotides to retrieve before/after reads [%default]", \
- type = int, default=10,action="store")
- group.add_option("-Q", "--min-basequal", dest="minqual", help="minimun base quality Phred score considered, Phred-33 assumed [%default]", \
- type = int, default=0, action="store")
- group.add_option("-d", "--folder", help="folder name to store results [results_FILENAME]", \
- action="store", type="string", dest="folder")
- group.add_option("-f", "--fasta", dest="fasta", help="Write alignments in a FASTA file", \
- default=False,action="store_true")
- group.add_option("--plot-only", dest="plot_only", help="Run only plotting from a valid result folder", \
- default=False,action="store_true")
- group.add_option("-q", "--quiet", dest="quiet", help="Disable any output to stdout", \
- default=False,action="store_true")
- group.add_option("-v", "--verbose", dest="verbose", help="Display progression information during parsing", \
- default=False,action="store_true")
- group.add_option("--mapdamage-modules", dest="mapdamage_modules", help="Override the system wide installed mapDamage module", \
- default=None)
- group.add_option("--no-plot", dest="no_r", help=SUPPRESS_HELP, default=False, action="store_true")
+ group.add_option(
+ "-n",
+ "--downsample",
+ help="Downsample to a randomly selected fraction of the reads (if 0 < DOWNSAMPLE < 1), or "
+ "a fixed number of randomly selected reads (if DOWNSAMPLE >= 1). By default, no downsampling is performed.",
+ type=float,
+ default=None,
+ )
+ group.add_option(
+ "--downsample-seed",
+ help="Seed value to use for downsampling. See documentation for py module 'random' for default behavior.",
+ type=int,
+ default=None,
+ )
+ group.add_option(
+ "--merge-reference-sequences",
+ help="Ignore reference sequence names when tabulating reads (using '*' instead). "
+ "Useful for alignments with a large number of reference sequences, which may otherwise result in excessive "
+ "memory or disk usage due to the number of tables generated.",
+ default=False,
+ action="store_true",
+ )
+ group.add_option(
+ "-l",
+ "--length",
+ dest="length",
+ help="read length, in nucleotides to consider [%default]",
+ type=int,
+ default=70,
+ action="store",
+ )
+ group.add_option(
+ "-a",
+ "--around",
+ dest="around",
+ help="nucleotides to retrieve before/after reads [%default]",
+ type=int,
+ default=10,
+ action="store",
+ )
+ group.add_option(
+ "-Q",
+ "--min-basequal",
+ dest="minqual",
+ help="minimum base quality Phred score considered, Phred-33 assumed [%default]",
+ type=int,
+ default=0,
+ action="store",
+ )
+ group.add_option(
+ "-d",
+ "--folder",
+ help="folder name to store results [results_FILENAME]",
+ action="store",
+ type="string",
+ dest="folder",
+ )
+ group.add_option(
+ "-f",
+ "--fasta",
+ dest="fasta",
+ help="Write alignments in a FASTA file",
+ default=False,
+ action="store_true",
+ )
+ group.add_option(
+ "--plot-only",
+ dest="plot_only",
+ help="Run only plotting from a valid result folder",
+ default=False,
+ action="store_true",
+ )
+ group.add_option(
+ "-q",
+ "--quiet",
+ dest="quiet",
+ help="Disable any output to stdout",
+ default=False,
+ action="store_true",
+ )
+ group.add_option(
+ "-v",
+ "--verbose",
+ dest="verbose",
+ help="Display progression information during parsing",
+ default=False,
+ action="store_true",
+ )
+ group.add_option(
+ "--no-plot", dest="no_r", help=SUPPRESS_HELP, default=False, action="store_true"
+ )
parser.add_option_group(group)
# options for plotting damage patterns
group2 = OptionGroup(parser, "Options for graphics")
- group2.add_option("-y", "--ymax", dest="ymax", \
- help="graphical y-axis limit for nucleotide misincorporation frequencies [%default]", type = float, \
- default=0.3,action="store")
- group2.add_option("-m", "--readplot", dest="readplot", \
- help="read length, in nucleotides, considered for plotting nucleotide misincorporations [%default]", \
- type = int, default=25, action="store")
- group2.add_option("-b", "--refplot", dest="refplot", \
- help="the number of reference nucleotides to consider for ploting base composition in the region located upstream "
- "and downstream of every read [%default]", type= int, default=10, action="store")
- group2.add_option("-t", "--title", dest="title", \
- help="title used for plots [%default]", \
- type="string", default="",action="store")
+ group2.add_option(
+ "-y",
+ "--ymax",
+ dest="ymax",
+ help="graphical y-axis limit for nucleotide misincorporation frequencies [%default]",
+ type=float,
+ default=0.3,
+ action="store",
+ )
+ group2.add_option(
+ "-m",
+ "--readplot",
+ dest="readplot",
+ help="read length, in nucleotides, considered for plotting nucleotide misincorporations [%default]",
+ type=int,
+ default=25,
+ action="store",
+ )
+ group2.add_option(
+ "-b",
+ "--refplot",
+ dest="refplot",
+ help="the number of reference nucleotides to consider for plotting base composition in the region located upstream "
+ "and downstream of every read [%default]",
+ type=int,
+ default=10,
+ action="store",
+ )
+ group2.add_option(
+ "-t",
+ "--title",
+ dest="title",
+ help="title used for plots [%default]",
+ type="string",
+ default="",
+ action="store",
+ )
parser.add_option_group(group2)
# Then the plethora of optional options for the statistical estimation ..
- group3 = OptionGroup(parser,"Options for the statistical estimation")
- group3.add_option("", "--rand", dest="rand", \
- help="Number of random starting points for the likelihood optimization [%default]", type = int, default=30, action="store")
- group3.add_option("", "--burn", dest="burn", \
- help="Number of burnin iterations [%default]", type = int, default=10000,action="store")
- group3.add_option("", "--adjust", dest="adjust", \
- help="Number of adjust proposal variance parameters iterations [%default]", type = int, default=10, action="store")
- group3.add_option("", "--iter", dest="iter", \
- help="Number of final MCMC iterations [%default]", type = int, default=50000, action="store")
- group3.add_option("", "--forward", dest="forward", \
- help="Using only the 5' end of the seqs [%default]", default=False, action="store_true")
- group3.add_option("", "--reverse", dest="reverse", \
- help="Using only the 3' end of the seqs [%default]", default=False, action="store_true")
- group3.add_option("", "--var-disp", dest="var_disp", \
- help="Variable dispersion in the overhangs [%default]", default=False,action="store_true")
- group3.add_option("", "--jukes-cantor", dest="jukes_cantor", \
- help="Use Jukes Cantor instead of HKY85 [%default]", default=False,action="store_true")
- group3.add_option("", "--diff-hangs", dest="diff_hangs", \
- help="The overhangs are different for 5' and 3' [%default]", default=False, action="store_true")
- group3.add_option("", "--fix-nicks" , dest="fix_nicks", \
- help="Fix the nick frequency vector (Only C.T from the 5' end and G.A from the 3' end) [%default]", default=False, action="store_true")
- group3.add_option("", "--use-raw-nick-freq" , dest="use_raw_nick_freq", \
- help="Use the raw nick frequency vector without smoothing [%default]", default=False, action="store_true")
- group3.add_option("", "--single-stranded", dest="single_stranded", \
- help="Single stranded protocol [%default]", default=False, action="store_true")
- group3.add_option("", "--theme-bw", dest="theme_bw", \
- help="Use black and white theme in post. pred. plot [%default]", default=False, action="store_true")
- group3.add_option("", "--seq-length", dest="seq_length", \
- help="How long sequence to use from each side [%default]", type = int, default=12, action="store")
- group3.add_option("--stats-only", dest="stats_only", help="Run only statistical estimation from a valid result folder", \
- default=False, action="store_true")
- group3.add_option("--no-stats", help="Disabled statistical estimation, active by default", default=False, action="store_true")
- group3.add_option("--check-R-packages", help="Check if the R modules are working", default=False, action="store_true")
+ group3 = OptionGroup(parser, "Options for the statistical estimation")
+ group3.add_option(
+ "",
+ "--rand",
+ dest="rand",
+ help="Number of random starting points for the likelihood optimization [%default]",
+ type=int,
+ default=30,
+ action="store",
+ )
+ group3.add_option(
+ "",
+ "--burn",
+ dest="burn",
+ help="Number of burn-in iterations [%default]",
+ type=int,
+ default=10000,
+ action="store",
+ )
+ group3.add_option(
+ "",
+ "--adjust",
+ dest="adjust",
+ help="Number of adjust proposal variance parameters iterations [%default]",
+ type=int,
+ default=10,
+ action="store",
+ )
+ group3.add_option(
+ "",
+ "--iter",
+ dest="iter",
+ help="Number of final MCMC iterations [%default]",
+ type=int,
+ default=50000,
+ action="store",
+ )
+ group3.add_option(
+ "",
+ "--forward",
+ dest="forward",
+ help="Using only the 5' end of the seqs [%default]",
+ default=False,
+ action="store_true",
+ )
+ group3.add_option(
+ "",
+ "--reverse",
+ dest="reverse",
+ help="Using only the 3' end of the seqs [%default]",
+ default=False,
+ action="store_true",
+ )
+ group3.add_option(
+ "",
+ "--var-disp",
+ dest="var_disp",
+ help="Variable dispersion in the overhangs [%default]",
+ default=False,
+ action="store_true",
+ )
+ group3.add_option(
+ "",
+ "--jukes-cantor",
+ dest="jukes_cantor",
+ help="Use Jukes Cantor instead of HKY85 [%default]",
+ default=False,
+ action="store_true",
+ )
+ group3.add_option(
+ "",
+ "--diff-hangs",
+ dest="diff_hangs",
+ help="The overhangs are different for 5' and 3' [%default]",
+ default=False,
+ action="store_true",
+ )
+ group3.add_option(
+ "",
+ "--fix-nicks",
+ dest="fix_nicks",
+ help="Fix the nick frequency vector (Only C.T from the 5' end and G.A from the 3' end) [%default]",
+ default=False,
+ action="store_true",
+ )
+ group3.add_option(
+ "",
+ "--use-raw-nick-freq",
+ dest="use_raw_nick_freq",
+ help="Use the raw nick frequency vector without smoothing [%default]",
+ default=False,
+ action="store_true",
+ )
+ group3.add_option(
+ "",
+ "--single-stranded",
+ dest="single_stranded",
+ help="Single stranded protocol [%default]",
+ default=False,
+ action="store_true",
+ )
+ group3.add_option(
+ "",
+ "--theme-bw",
+ dest="theme_bw",
+ help="Use black and white theme in post. pred. plot [%default]",
+ default=False,
+ action="store_true",
+ )
+ group3.add_option(
+ "",
+ "--seq-length",
+ dest="seq_length",
+ help="How long sequence to use from each side [%default]",
+ type=int,
+ default=12,
+ action="store",
+ )
+ group3.add_option(
+ "--stats-only",
+ dest="stats_only",
+ help="Run only statistical estimation from a valid result folder",
+ default=False,
+ action="store_true",
+ )
+ group3.add_option(
+ "--no-stats",
+ help="Disabled statistical estimation, active by default",
+ default=False,
+ action="store_true",
+ )
+ group3.add_option(
+ "--check-R-packages",
+ help="Check if the R modules are working",
+ default=False,
+ action="store_true",
+ )
parser.add_option_group(group3)
- group4 = OptionGroup(parser,"Options for rescaling of BAM files")
- group4.add_option("--rescale", dest="rescale", help="Rescale the quality scores in the BAM file using the output from the statistical estimation",
- default=False, action="store_true")
- group4.add_option("--rescale-only", dest="rescale_only", help="Run only rescaling from a valid result folder",
- default=False, action="store_true")
- group4.add_option("--rescale-out", dest="rescale_out", help="Write the rescaled BAM to this file",
- default=None, action="store")
- group4.add_option("--rescale-length-5p", dest="rescale_length_5p",
- help="How many bases to rescale at the 5' termini; defaults to --seq-length.", type=int, action="store")
- group4.add_option("--rescale-length-3p", dest="rescale_length_3p",
- help="How many bases to rescale at the 5' termini; defaults to --seq-length.", type=int, action="store")
+ group4 = OptionGroup(parser, "Options for rescaling of BAM files")
+ group4.add_option(
+ "--rescale",
+ dest="rescale",
+ help="Rescale the quality scores in the BAM file using the output from the statistical estimation",
+ default=False,
+ action="store_true",
+ )
+ group4.add_option(
+ "--rescale-only",
+ dest="rescale_only",
+ help="Run only rescaling from a valid result folder",
+ default=False,
+ action="store_true",
+ )
+ group4.add_option(
+ "--rescale-out",
+ dest="rescale_out",
+ help="Write the rescaled BAM to this file",
+ default=None,
+ action="store",
+ )
+ group4.add_option(
+ "--rescale-length-5p",
+ dest="rescale_length_5p",
+ help="How many bases to rescale at the 5' termini; defaults to --seq-length.",
+ type=int,
+ action="store",
+ )
+ group4.add_option(
+ "--rescale-length-3p",
+ dest="rescale_length_3p",
+ help="How many bases to rescale at the 5' termini; defaults to --seq-length.",
+ type=int,
+ action="store",
+ )
parser.add_option_group(group4)
-
- #Parse the arguments
+ # Parse the arguments
(options, args) = parser.parse_args()
# check python version
@@ -161,9 +395,9 @@ def options():
# check general arguments
if not (options.plot_only or options.stats_only) and not options.filename:
- parser.error('SAM/BAM file not given (-i)')
+ parser.error("SAM/BAM file not given (-i)")
if not (options.plot_only or options.ref):
- parser.error('Reference file not given (-r)')
+ parser.error("Reference file not given (-r)")
if not options.plot_only and not options.stats_only:
if not file_exist(options.filename) or not file_exist(options.ref):
return None
@@ -174,57 +408,65 @@ def options():
options.downsample = int(options.downsample)
if options.plot_only and not options.folder:
- parser.error('Folder not provided, required with --plot-only')
+ parser.error("Folder not provided, required with --plot-only")
if options.stats_only and not options.folder:
- parser.error('Folder not provided, required with --stats-only')
+ parser.error("Folder not provided, required with --stats-only")
if options.rescale_only and not options.folder:
- parser.error('Folder not provided, required with --rescale-only')
+ parser.error("Folder not provided, required with --rescale-only")
if options.rescale_only and not options.filename:
- parser.error('Input bam not provided, required with --rescale-only')
+ parser.error("Input bam not provided, required with --rescale-only")
if options.rescale_only and not options.ref:
- parser.error('Reference not provided, required with --rescale-only')
+ parser.error("Reference not provided, required with --rescale-only")
if options.verbose and options.quiet:
- parser.error('Cannot use verbose and quiet option at the same time')
+ parser.error("Cannot use verbose and quiet option at the same time")
# check options
if options.length < 0:
- parser.error('length (-l) must be a positive integrer')
+ parser.error("length (-l) must be a positive integrer")
if options.around < 0:
- parser.error('around (-a) must be a positive integrer')
+ parser.error("around (-a) must be a positive integrer")
if options.ymax <= 0 or options.ymax > 1:
- parser.error('ymax (-b) must be an real number beetween 0 and 1')
+ parser.error("ymax (-b) must be an real number between 0 and 1")
if options.readplot < 0:
- parser.error('readplot (-m) must be a positive integrer')
+ parser.error("readplot (-m) must be a positive integrer")
if options.refplot < 0:
- parser.error('refplot (-b) must be a positive integrer')
+ parser.error("refplot (-b) must be a positive integrer")
if options.refplot > options.around and not options.plot_only:
- parser.error('refplot (-b) must be inferior to around (-a)')
+ parser.error("refplot (-b) must be inferior to around (-a)")
if options.readplot > options.length:
- parser.error('readplot (-m) must be inferior to length (-l)')
- if options.minqual < 0 or options.minqual > 41:
- parser.error('minimal base quality, Phred score, must be within this range: 0 - 41')
+ parser.error("readplot (-m) must be inferior to length (-l)")
+ if options.minqual < 0 or options.minqual > 41:
+ parser.error(
+ "minimal base quality, Phred score, must be within this range: 0 - 41"
+ )
# check statistic options
if options.forward and options.reverse:
- parser.error('Cannot use only forward end and only reverse end for the statistics')
+ parser.error(
+ "Cannot use only forward end and only reverse end for the statistics"
+ )
# use filename as default for plot titles if not set
if options.title == "" and options.filename:
options.title = os.path.splitext(os.path.basename(options.filename))[0]
# for --plot-only, use the folder name, without results_ as title
if options.title == "" and not options.filename and options.folder:
- options.title = os.path.splitext(os.path.basename(options.folder))[0].replace("results_", "")
+ options.title = os.path.splitext(os.path.basename(options.folder))[0].replace(
+ "results_", ""
+ )
# check folder
if not options.folder and options.filename:
- options.folder = "results_"+os.path.splitext(os.path.basename(options.filename))[0]
+ options.folder = (
+ "results_" + os.path.splitext(os.path.basename(options.filename))[0]
+ )
# check destination for rescaled bam
if not options.rescale_out and (options.rescale or options.rescale_only):
- # if there are mulitiple bam files to rescale then pick first one as
+ # if there are multiple bam files to rescale then pick first one as
# the name of the rescaled file
- if isinstance(options.filename,list):
+ if isinstance(options.filename, list):
basename = os.path.basename(options.filename[0])
else:
basename = os.path.basename(options.filename)
@@ -235,34 +477,45 @@ def options():
if not options.quiet and not options.plot_only:
print(("Warning, %s already exists" % options.folder))
if options.plot_only:
- if not file_exist(options.folder+"/dnacomp.txt") or not file_exist(options.folder+"/misincorporation.txt"):
- parser.error('folder %s is not a valid result folder' % options.folder)
+ if not file_exist(options.folder + "/dnacomp.txt") or not file_exist(
+ options.folder + "/misincorporation.txt"
+ ):
+ parser.error("folder %s is not a valid result folder" % options.folder)
else:
- os.makedirs(options.folder, mode = 0o750)
+ os.makedirs(options.folder, mode=0o750)
if options.plot_only or options.stats_only or options.rescale_only:
- sys.stderr.write("Error, %s does not exist while plot/stats/rescale only was used\n" % options.folder)
+ sys.stderr.write(
+ "Error, %s does not exist while plot/stats/rescale only was used\n"
+ % options.folder
+ )
return None
if options.rescale_length_3p is None:
options.rescale_length_3p = options.seq_length
elif not (0 <= options.rescale_length_3p <= options.seq_length):
- parser.error("--rescale-length-3p must be less than or equal to "
- "--seq-length and greater than zero")
+ parser.error(
+ "--rescale-length-3p must be less than or equal to "
+ "--seq-length and greater than zero"
+ )
if options.rescale_length_5p is None:
options.rescale_length_5p = options.seq_length
elif not (0 <= options.rescale_length_5p <= options.seq_length):
- parser.error("--rescale-length-5p must be less than or equal to "
- "--seq-length and greater than zero")
+ parser.error(
+ "--rescale-length-5p must be less than or equal to "
+ "--seq-length and greater than zero"
+ )
# check if the Rscript executable is present on the system
- if not whereis('Rscript'):
+ if not whereis("Rscript"):
print("Warning, Rscript is not in your PATH, plotting is disabled")
options.no_r = True
# check the nick frequencies options
- if (options.use_raw_nick_freq + options.fix_nicks + options.single_stranded)>1:
- parser.error("The options --use-raw-nick-freq, --fix-nicks and --single-stranded are mutually exclusive.")
+ if (options.use_raw_nick_freq + options.fix_nicks + options.single_stranded) > 1:
+ parser.error(
+ "The options --use-raw-nick-freq, --fix-nicks and --single-stranded are mutually exclusive."
+ )
if check_R_lib():
# check for R libraries
@@ -275,5 +528,4 @@ def options():
if options.rescale_only:
sys.exit("Cannot use --rescale-only with missing R libraries")
-
return options
=====================================
mapdamage/rescale.py
=====================================
@@ -1,21 +1,24 @@
import csv
-import sys
-import os
-import mapdamage
-import pysam
-import itertools
-import math
import logging
+import math
+import os
+import sys
import time
+import pysam
+
+import mapdamage
+
def phred_pval_to_char(pval):
"""Transforming error rate to ASCII character using the Phred scale"""
- return chr(int(round(-10*math.log10(abs(pval)))+33))
+ return chr(int(round(-10 * math.log10(abs(pval))) + 33))
+
def phred_char_to_pval(ch):
"""Transforming ASCII character in the Phred scale to the error rate"""
- return 10**(-(float(ord(ch))-float(33))/10)
+ return 10 ** (-(float(ord(ch)) - float(33)) / 10)
+
def get_corr_prob(folder, rescale_length_5p, rescale_length_3p):
"""
@@ -24,19 +27,28 @@ def get_corr_prob(folder, rescale_length_5p, rescale_length_3p):
position (one based) - CT - probability
- GA - probability
"""
- full_path = os.path.join(folder,"Stats_out_MCMC_correct_prob.csv")
+ full_path = os.path.join(folder, "Stats_out_MCMC_correct_prob.csv")
if not os.path.isfile(full_path):
- sys.exit("Missing file, the file \n\tStats_out_MCMC_correct_prob.csv\nshould be in the folder\n\t"+folder+"\nDid you run the MCMC estimates of the parameters?")
+ sys.exit(
+ "Missing file, the file \n\tStats_out_MCMC_correct_prob.csv\nshould be in the folder\n\t"
+ + folder
+ + "\nDid you run the MCMC estimates of the parameters?"
+ )
try:
with open(full_path) as fi:
fi_handle = csv.DictReader(fi)
corr_prob = {}
for line in fi_handle:
- if (line["Position"] in corr_prob):
- sys.exit('This file has multiple position definitions %s, line %d: %s' % \
- (folder, fi_handle.line_num, corr_prob[line["Position"]]))
+ if line["Position"] in corr_prob:
+ sys.exit(
+ "This file has multiple position definitions %s, line %d: %s"
+ % (folder, fi_handle.line_num, corr_prob[line["Position"]])
+ )
else:
- corr_prob[int(line["Position"])] = {'C.T':float(line["C.T"]), 'G.A':float(line["G.A"])}
+ corr_prob[int(line["Position"])] = {
+ "C.T": float(line["C.T"]),
+ "G.A": float(line["G.A"]),
+ }
# Exclude probabilities for positions outside of user-specified region
for key in list(corr_prob.keys()):
@@ -44,11 +56,17 @@ def get_corr_prob(folder, rescale_length_5p, rescale_length_3p):
corr_prob.pop(key)
return corr_prob
except csv.Error as e:
- sys.exit('File %s, line %d: %s' % (os.path.join(folder,"Stats_out_MCMC_correct_prob.csv"), \
- fi_handle.line_num, e))
+ sys.exit(
+ "File %s, line %d: %s"
+ % (
+ os.path.join(folder, "Stats_out_MCMC_correct_prob.csv"),
+ fi_handle.line_num,
+ e,
+ )
+ )
-def corr_this_base(corr_prob, nt_seq, nt_ref, pos, length,direction="both"):
+def corr_this_base(corr_prob, nt_seq, nt_ref, pos, length, direction="both"):
"""
The position specific damaging correction, using the input
corr_prob dictionary holding the damage correcting values
@@ -59,20 +77,20 @@ def corr_this_base(corr_prob, nt_seq, nt_ref, pos, length,direction="both"):
direction which end to consider the rescaling
returns the correction probability for this particular set
"""
- if (pos == 0):
+ if pos == 0:
# not using 0 based indexing
raise SystemError
- if ( nt_seq == "T" and nt_ref == "C" ):
+ if nt_seq == "T" and nt_ref == "C":
# an C to T transition
subs = "C.T"
- elif( nt_seq == "A" and nt_ref == "G" ):
+ elif nt_seq == "A" and nt_ref == "G":
# an G to A transition
subs = "G.A"
else:
# other transitions/transversions are not affected by damage
return 0
- back_pos = pos-length-1
+ back_pos = pos - length - 1
# position from 3' end
if pos in corr_prob:
@@ -92,123 +110,177 @@ def corr_this_base(corr_prob, nt_seq, nt_ref, pos, length,direction="both"):
elif direction == "backward":
return p3_corr
elif direction == "both":
- if pos < abs(back_pos) :
+ if pos < abs(back_pos):
# then we use the forward correction
return p5_corr
- else :
+ else:
# else the backward correction
return p3_corr
else:
# this should not happen
raise SystemExit("Abnormal direction in the rescaling procedure")
+
def initialize_subs():
"""Initialize a substitution table, to track the expected substitution counts"""
- per_qual = dict(list(zip(list(range(0,130)),[0]*130)))
- subs = {"CT-before":per_qual.copy(),\
- "TC-before":per_qual.copy(),\
- "GA-before":per_qual.copy(),\
- "AG-before":per_qual.copy(),\
- "CT-after":per_qual.copy(),\
- "TC-after":per_qual.copy(),\
- "GA-after":per_qual.copy(),\
- "AG-after":per_qual.copy(),\
- "A":0,\
- "C":0,\
- "G":0,\
- "T":0,\
- "CT-pvals":0.0,\
- "CT-pvals_before":0.0,\
- "TC-pvals":0.0,\
- "GA-pvals":0.0,\
- "GA-pvals_before":0.0,\
- "AG-pvals":0.0,\
- }
+ per_qual = dict(list(zip(list(range(0, 130)), [0] * 130)))
+ subs = {
+ "CT-before": per_qual.copy(),
+ "TC-before": per_qual.copy(),
+ "GA-before": per_qual.copy(),
+ "AG-before": per_qual.copy(),
+ "CT-after": per_qual.copy(),
+ "TC-after": per_qual.copy(),
+ "GA-after": per_qual.copy(),
+ "AG-after": per_qual.copy(),
+ "A": 0,
+ "C": 0,
+ "G": 0,
+ "T": 0,
+ "CT-pvals": 0.0,
+ "CT-pvals_before": 0.0,
+ "TC-pvals": 0.0,
+ "GA-pvals": 0.0,
+ "GA-pvals_before": 0.0,
+ "AG-pvals": 0.0,
+ }
return subs
-
-def record_subs(subs,nt_seq,nt_ref,nt_qual,nt_newqual,prob_corr):
- """ record the expected substitution change, prob_corr is the excact version for nt_qual"""
- if ( nt_seq == "T" and nt_ref == "C"):
+def record_subs(subs, nt_seq, nt_ref, nt_qual, nt_newqual, prob_corr):
+ """record the expected substitution change, prob_corr is the exact version for nt_qual"""
+ if nt_seq == "T" and nt_ref == "C":
sub_type = "CT"
subs["CT-pvals"] += prob_corr
- subs["CT-pvals_before"] += 1-phred_char_to_pval(nt_qual)
- elif ( nt_seq == "A" and nt_ref == "G"):
+ subs["CT-pvals_before"] += 1 - phred_char_to_pval(nt_qual)
+ elif nt_seq == "A" and nt_ref == "G":
sub_type = "GA"
subs["GA-pvals"] += prob_corr
- subs["GA-pvals_before"] += 1-phred_char_to_pval(nt_qual)
- elif ( nt_seq == "C" and nt_ref == "T"):
+ subs["GA-pvals_before"] += 1 - phred_char_to_pval(nt_qual)
+ elif nt_seq == "C" and nt_ref == "T":
sub_type = "TC"
- subs["TC-pvals"] += 1-phred_char_to_pval(nt_qual)
- if (nt_qual != nt_newqual):
- raise SystemError("Internal error: rescaling qualities for the wrong transitions")
- elif ( nt_seq == "G" and nt_ref == "A"):
+ subs["TC-pvals"] += 1 - phred_char_to_pval(nt_qual)
+ if nt_qual != nt_newqual:
+ raise SystemError(
+ "Internal error: rescaling qualities for the wrong transitions"
+ )
+ elif nt_seq == "G" and nt_ref == "A":
sub_type = "AG"
- subs["AG-pvals"] += 1-phred_char_to_pval(nt_qual)
- if (nt_qual != nt_newqual):
- raise SystemError("Internal error: rescaling qualities for the wrong transitions")
+ subs["AG-pvals"] += 1 - phred_char_to_pval(nt_qual)
+ if nt_qual != nt_newqual:
+ raise SystemError(
+ "Internal error: rescaling qualities for the wrong transitions"
+ )
else:
sub_type = "NN"
- if (sub_type != "NN"):
+ if sub_type != "NN":
# record only transitions
- subs[sub_type+"-before"][int(ord(nt_qual))-33] += 1
- subs[sub_type+"-after"][int(ord(nt_newqual))-33] += 1
- if (nt_ref in ["A","C","G","T"]):
+ subs[sub_type + "-before"][int(ord(nt_qual)) - 33] += 1
+ subs[sub_type + "-after"][int(ord(nt_newqual)) - 33] += 1
+ if nt_ref in ["A", "C", "G", "T"]:
subs[nt_ref] += 1
+
def qual_summary_subs(subs):
- """Calculates summary statistics for the substition table subs"""
- for i in ["CT-before","TC-before","GA-before","AG-before","CT-after","TC-after","GA-after","AG-after"]:
- for lv in [0,10,20,30,40]:
+ """Calculates summary statistics for the substitution table subs"""
+ for i in [
+ "CT-before",
+ "TC-before",
+ "GA-before",
+ "AG-before",
+ "CT-after",
+ "TC-after",
+ "GA-after",
+ "AG-after",
+ ]:
+ for lv in [0, 10, 20, 30, 40]:
for qv in subs[i]:
- if qv >= lv :
- key = i+"-Q"+str(lv)
+ if qv >= lv:
+ key = i + "-Q" + str(lv)
if key in subs:
subs[key] += subs[i][qv]
else:
subs[key] = subs[i][qv]
+
def print_subs(subs):
- """Print the substition table"""
- print("\tThe expected substition frequencies before and after scaling using the scaled qualities as probalities:")
- if subs["C"]!=0:
+ """Print the substitution table"""
+ print(
+ "\tThe expected substitution frequencies before and after scaling using the scaled qualities as probalities:"
+ )
+ if subs["C"] != 0:
# the special case of no substitutions
- print(("\tCT\t"+str(subs["CT-pvals_before"]/subs["C"])+"\t\t"+str(subs["CT-pvals"]/subs["C"])))
+ print(
+ "\tCT\t"
+ + str(subs["CT-pvals_before"] / subs["C"])
+ + "\t\t"
+ + str(subs["CT-pvals"] / subs["C"])
+ )
else:
print("\tCT\tNA\t\tNA")
- if subs["T"]!=0:
- print(("\tTC\t"+str(subs["TC-pvals"]/subs["T"])+"\t\t"+str(subs["TC-pvals"]/subs["T"])))
+ if subs["T"] != 0:
+ print(
+ "\tTC\t"
+ + str(subs["TC-pvals"] / subs["T"])
+ + "\t\t"
+ + str(subs["TC-pvals"] / subs["T"])
+ )
else:
print("\tTC\tNA\t\tNA")
- if subs["G"]!=0:
- print(("\tGA\t"+str(subs["GA-pvals_before"]/subs["G"])+"\t\t"+str(subs["GA-pvals"]/subs["G"])))
+ if subs["G"] != 0:
+ print(
+ "\tGA\t"
+ + str(subs["GA-pvals_before"] / subs["G"])
+ + "\t\t"
+ + str(subs["GA-pvals"] / subs["G"])
+ )
else:
print("\tGA\tNA\t\tNA")
- if subs["A"]!=0:
- print(("\tAG\t"+str(subs["AG-pvals"]/subs["A"])+"\t\t"+str(subs["AG-pvals"]/subs["A"])))
+ if subs["A"] != 0:
+ print(
+ "\tAG\t"
+ + str(subs["AG-pvals"] / subs["A"])
+ + "\t\t"
+ + str(subs["AG-pvals"] / subs["A"])
+ )
else:
print("\tAG\tNA\t\tNA")
print("\tQuality metrics before and after scaling")
- print(("\tCT-Q0 \t"+str(subs["CT-before-Q0"])+"\t\t"+str(subs["CT-after-Q0"])))
- print(("\tCT-Q10 \t"+str(subs["CT-before-Q10"])+"\t\t"+str(subs["CT-after-Q10"])))
- print(("\tCT-Q20 \t"+str(subs["CT-before-Q20"])+"\t\t"+str(subs["CT-after-Q20"])))
- print(("\tCT-Q30 \t"+str(subs["CT-before-Q30"])+"\t\t"+str(subs["CT-after-Q30"])))
- print(("\tCT-Q40 \t"+str(subs["CT-before-Q40"])+"\t\t"+str(subs["CT-after-Q40"])))
- print(("\tGA-Q0 \t"+str(subs["GA-before-Q0"])+"\t\t"+str(subs["GA-after-Q0"])))
- print(("\tGA-Q10 \t"+str(subs["GA-before-Q10"])+"\t\t"+str(subs["GA-after-Q10"])))
- print(("\tGA-Q20 \t"+str(subs["GA-before-Q20"])+"\t\t"+str(subs["GA-after-Q20"])))
- print(("\tGA-Q30 \t"+str(subs["GA-before-Q30"])+"\t\t"+str(subs["GA-after-Q30"])))
- print(("\tGA-Q40 \t"+str(subs["GA-before-Q40"])+"\t\t"+str(subs["GA-after-Q40"])))
-
-
-def rescale_qual_read(bam, read, ref, corr_prob,subs, debug = False,direction="both"):
+ print("\tCT-Q0 \t" + str(subs["CT-before-Q0"]) + "\t\t" + str(subs["CT-after-Q0"]))
+ print(
+ "\tCT-Q10 \t" + str(subs["CT-before-Q10"]) + "\t\t" + str(subs["CT-after-Q10"])
+ )
+ print(
+ "\tCT-Q20 \t" + str(subs["CT-before-Q20"]) + "\t\t" + str(subs["CT-after-Q20"])
+ )
+ print(
+ "\tCT-Q30 \t" + str(subs["CT-before-Q30"]) + "\t\t" + str(subs["CT-after-Q30"])
+ )
+ print(
+ "\tCT-Q40 \t" + str(subs["CT-before-Q40"]) + "\t\t" + str(subs["CT-after-Q40"])
+ )
+ print("\tGA-Q0 \t" + str(subs["GA-before-Q0"]) + "\t\t" + str(subs["GA-after-Q0"]))
+ print(
+ "\tGA-Q10 \t" + str(subs["GA-before-Q10"]) + "\t\t" + str(subs["GA-after-Q10"])
+ )
+ print(
+ "\tGA-Q20 \t" + str(subs["GA-before-Q20"]) + "\t\t" + str(subs["GA-after-Q20"])
+ )
+ print(
+ "\tGA-Q30 \t" + str(subs["GA-before-Q30"]) + "\t\t" + str(subs["GA-after-Q30"])
+ )
+ print(
+ "\tGA-Q40 \t" + str(subs["GA-before-Q40"]) + "\t\t" + str(subs["GA-after-Q40"])
+ )
+
+
+def rescale_qual_read(bam, read, ref, corr_prob, subs, debug=False, direction="both"):
"""
bam a pysam bam object
read a pysam read object
ref a pysam fasta ref file
reflengths a dictionary holding the length of the references
- subs a dictionary holding the corrected number of substition before and after scaling
+ subs a dictionary holding the corrected number of substitution before and after scaling
corr_prob dictionary from get_corr_prob
returns a read with rescaled quality score
@@ -225,8 +297,9 @@ def rescale_qual_read(bam, read, ref, corr_prob,subs, debug = False,direction="b
chrom = bam.getrname(read.tid)
refseq = ref.fetch(chrom, min(coordinate), max(coordinate)).upper()
# add gaps to qualities and mask read and reference nucleotides if below desired threshold
- (seq, qual, refseq) = mapdamage.align.align_with_qual(read.cigar, \
- raw_seq, read.qqual, -100, refseq)
+ (seq, qual, refseq) = mapdamage.align.align_with_qual(
+ read.cigar, raw_seq, read.qqual, -100, refseq
+ )
length_read = len(raw_seq)
length_align = len(seq)
# reverse complement read and reference when mapped reverse strand
@@ -234,43 +307,53 @@ def rescale_qual_read(bam, read, ref, corr_prob,subs, debug = False,direction="b
refseq = mapdamage.seq.revcomp(refseq)
seq = mapdamage.seq.revcomp(seq)
qual = qual[::-1]
- new_qual = [-100]*length_read
+ new_qual = [-100] * length_read
pos_on_read = 0
number_of_rescaled_bases = 0.0
- for (i, nt_seq, nt_ref, nt_qual) in zip(range(length_align), seq, refseq, qual):
+ for i, nt_seq, nt_ref, nt_qual in zip(range(length_align), seq, refseq, qual):
# rescale the quality according to the triplet position,
# pair of the reference and the sequence
- if ((nt_seq == "T" and nt_ref =="C") or (nt_seq == "A" and nt_ref =="G")):
+ if (nt_seq == "T" and nt_ref == "C") or (nt_seq == "A" and nt_ref == "G"):
# need to rescale this subs.
- pdam = 1 - corr_this_base(corr_prob, nt_seq, nt_ref, pos_on_read + 1, length_read,direction=direction)
+ pdam = 1 - corr_this_base(
+ corr_prob,
+ nt_seq,
+ nt_ref,
+ pos_on_read + 1,
+ length_read,
+ direction=direction,
+ )
pseq = 1 - phred_char_to_pval(nt_qual)
- newp = pdam*pseq # this could be numerically unstable
- newq = phred_pval_to_char(1-newp)
- number_of_rescaled_bases += 1-pdam
+ newp = pdam * pseq # this could be numerically unstable
+ newq = phred_pval_to_char(1 - newp)
+ number_of_rescaled_bases += 1 - pdam
else:
# don't rescale, other bases
newp = 1 - phred_char_to_pval(nt_qual)
newq = nt_qual
if pos_on_read < length_read:
new_qual[pos_on_read] = newq
- record_subs(subs,nt_seq,nt_ref,nt_qual,new_qual[pos_on_read],newp)
+ record_subs(subs, nt_seq, nt_ref, nt_qual, new_qual[pos_on_read], newp)
if nt_seq != "-":
pos_on_read += 1
# done with the aligned portion of the read
else:
if not debug:
- logger.warning("Warning: The aligment of the read is longer than the actual read %s",(read.qname))
+ logger.warning(
+ "Warning: The alignment of the read is longer than the actual read %s",
+ (read.qname),
+ )
break
new_qual = "".join(new_qual)
if read.is_reverse:
new_qual = new_qual[::-1]
- if (read.cigar[0][0] == 4):
+ if read.cigar[0][0] == 4:
# check for soft clipping at forward end
- new_qual = read.qual[0:read.cigar[0][1]] + new_qual
- if (read.cigar[-1][0] == 4):
+ new_qual = read.qual[0 : read.cigar[0][1]] + new_qual
+ if read.cigar[-1][0] == 4:
# the same backwards
- new_qual = new_qual + read.qual[-read.cigar[-1][1]:]
+ new_qual = new_qual + read.qual[-read.cigar[-1][1] :]
read.qual = new_qual
# truncate this to 5 digits
@@ -279,12 +362,12 @@ def rescale_qual_read(bam, read, ref, corr_prob,subs, debug = False,direction="b
if read.has_tag("MR"):
raise SystemExit("Read: %s already has a MR tag, can't rescale" % read)
- read.set_tag("MR", number_of_rescaled_bases, 'f')
+ read.set_tag("MR", number_of_rescaled_bases, "f")
return read
-def rescale_qual(ref, options,debug=False):
+def rescale_qual(ref, options, debug=False):
"""
ref a pysam fasta ref file
bam_filename name of a BAM/SAM file to read
@@ -297,7 +380,9 @@ def rescale_qual(ref, options,debug=False):
if not debug:
# no need to log when unit testing
logger = logging.getLogger(__name__)
- logger.info("Rescaling BAM: '%s' -> '%s'" % (options.filename, options.rescale_out))
+ logger.info(
+ "Rescaling BAM: '%s' -> '%s'" % (options.filename, options.rescale_out)
+ )
start_time = time.time()
# open SAM/BAM files
@@ -306,10 +391,12 @@ def rescale_qual(ref, options,debug=False):
write_mode = "wh"
else:
write_mode = "wb"
- bam_out = pysam.Samfile(options.rescale_out,write_mode, template = bam)
- corr_prob = get_corr_prob(options.folder,
- rescale_length_5p=options.rescale_length_5p,
- rescale_length_3p=options.rescale_length_3p)
+ bam_out = pysam.Samfile(options.rescale_out, write_mode, template=bam)
+ corr_prob = get_corr_prob(
+ options.folder,
+ rescale_length_5p=options.rescale_length_5p,
+ rescale_length_3p=options.rescale_length_3p,
+ )
subs = initialize_subs()
first_pair = True
number_of_non_proper_pairs = 0
@@ -318,40 +405,64 @@ def rescale_qual(ref, options,debug=False):
if hit.is_unmapped:
pass
elif not hit.qual and not debug:
- logger.warning("Cannot rescale base PHRED scores for read '%s'; no scores assigned." % hit.qname)
- elif hit.is_paired :
+ logger.warning(
+ "Cannot rescale base PHRED scores for read '%s'; no scores assigned."
+ % hit.qname
+ )
+ elif hit.is_paired:
if first_pair and not debug:
# assuming the ends are non-overlapping
- logger.warning("Warning! Assuming the pairs are non-overlapping, facing inwards and correctly paired.")
- first_pair=False
- #5p --------------> 3p
- #3p <-------------- 5p
+ logger.warning(
+ "Warning! Assuming the pairs are non-overlapping, facing inwards and correctly paired."
+ )
+ first_pair = False
+ # 5p --------------> 3p
+ # 3p <-------------- 5p
# pair 1 (inwards)
- #5p ---->
+ # 5p ---->
# <---- 5p
# A B
# pair 2 (outwards), this happens if the reference is RC this is not supported
# ----> 3p
- #3p <----
+ # 3p <----
# A B
# Correct outwards pairs from the 3p and inwards pairs with the 5p end
- if ((not hit.is_reverse) and hit.mate_is_reverse and (hit.pnext>hit.pos) and hit.tid==hit.mrnm):
+ if (
+ (not hit.is_reverse)
+ and hit.mate_is_reverse
+ and (hit.pnext > hit.pos)
+ and hit.tid == hit.mrnm
+ ):
# the inwards case mate A
- hit = rescale_qual_read(bam, hit, ref, corr_prob,subs,direction="forward",debug=debug)
- elif (hit.is_reverse and (not hit.mate_is_reverse) and (hit.pnext<hit.pos) and hit.tid==hit.mrnm):
+ hit = rescale_qual_read(
+ bam, hit, ref, corr_prob, subs, direction="forward", debug=debug
+ )
+ elif (
+ hit.is_reverse
+ and (not hit.mate_is_reverse)
+ and (hit.pnext < hit.pos)
+ and hit.tid == hit.mrnm
+ ):
# the inwards case mate B
- hit = rescale_qual_read(bam, hit, ref, corr_prob,subs,direction="forward",debug=debug)
+ hit = rescale_qual_read(
+ bam, hit, ref, corr_prob, subs, direction="forward", debug=debug
+ )
else:
number_of_non_proper_pairs += 1
# cannot do much with conflicting pairing information
else:
- hit = rescale_qual_read(bam, hit, ref, corr_prob,subs,debug=debug)
+ hit = rescale_qual_read(bam, hit, ref, corr_prob, subs, debug=debug)
bam_out.write(hit)
- if number_of_non_proper_pairs!=0 and not debug:
- logger.warning("Number of non-rescaled reads due to improper pairing: %d" % number_of_non_proper_pairs)
- if (subs["TC-before"] != subs["TC-after"] or subs["AG-before"] != subs["AG-after"]):
- sys.exit("Qualities for T.C and A.G transitions should not change in the rescaling, please contact the authors.")
+ if number_of_non_proper_pairs != 0 and not debug:
+ logger.warning(
+ "Number of non-rescaled reads due to improper pairing: %d"
+ % number_of_non_proper_pairs
+ )
+ if subs["TC-before"] != subs["TC-after"] or subs["AG-before"] != subs["AG-after"]:
+ sys.exit(
+ "Qualities for T.C and A.G transitions should not change in the rescaling, please contact the authors."
+ )
qual_summary_subs(subs)
bam.close()
bam_out.close()
=====================================
mapdamage/rscript.py
=====================================
@@ -1,76 +1,97 @@
+import logging
import os
import subprocess
+import time
from subprocess import CalledProcessError, check_call
-from mapdamage.version import __version__
+
import mapdamage
-import logging
-import time
+from mapdamage.version import __version__
-def construct_path(name,folder="Rscripts"):
+def construct_path(name, folder="Rscripts"):
"""Construct a path to the mapdamage package data given the name"""
- return(os.path.join(mapdamage.__path__[0], folder, name))
+ return os.path.join(mapdamage.__path__[0], folder, name)
-def plot(opt):
- """
- Calls the R script to make the plots, takes in the arguments
- from parameter processing
- """
- # comp<-args[1]
- # pdfout<-args[2]
- # around<-as.numeric(args[3])
- # misincorp<-args[4]
- # lg<-as.numeric(args[5])
- # ymax<-as.numeric(args[6])
- # folder<-args[7]
- # title<-args[8]
- # version<-args[9]
-
- fmut = opt.folder+"/"+"misincorporation.txt"
- fcomp = opt.folder+"/"+"dnacomp.txt"
- title = opt.folder+"/"+"Fragmisincorporation_plot.pdf"
-
- script = construct_path("mapDamage.R")
- call = ["Rscript", script, fcomp, title, opt.refplot, fmut, opt.readplot, \
- opt.ymax, opt.folder, opt.title, __version__]
- code = subprocess.call(list(map(str, call)))
-
- if code == 0:
- if not opt.quiet:
- print(("pdf %s generated" % title))
- return 0
- else:
- print("Error: plotting with R failed")
- return 1
+def plot(opt):
+ """
+ Calls the R script to make the plots, takes in the arguments
+ from parameter processing
+ """
+ # comp<-args[1]
+ # pdfout<-args[2]
+ # around<-as.numeric(args[3])
+ # misincorp<-args[4]
+ # lg<-as.numeric(args[5])
+ # ymax<-as.numeric(args[6])
+ # folder<-args[7]
+ # title<-args[8]
+ # version<-args[9]
+
+ fmut = opt.folder + "/" + "misincorporation.txt"
+ fcomp = opt.folder + "/" + "dnacomp.txt"
+ title = opt.folder + "/" + "Fragmisincorporation_plot.pdf"
+
+ script = construct_path("mapDamage.R")
+ call = [
+ "Rscript",
+ script,
+ fcomp,
+ title,
+ opt.refplot,
+ fmut,
+ opt.readplot,
+ opt.ymax,
+ opt.folder,
+ opt.title,
+ __version__,
+ ]
+ code = subprocess.call(list(map(str, call)))
+
+ if code == 0:
+ if not opt.quiet:
+ print(("pdf %s generated" % title))
+ return 0
+ else:
+ print("Error: plotting with R failed")
+ return 1
def opt_plots(opt):
- """ optional plots for length distribution
- and cumulative C>T mutations, per strand """
-
- fmut = opt.folder+"/"+"misincorporation.txt"
- flength = opt.folder+"/"+"lgdistribution.txt"
- output = opt.folder+"/"+"Length_plot.pdf"
-
- script = construct_path("lengths.R")
- call = ["Rscript", script, flength, output, fmut, opt.length, \
- opt.title, __version__, bool_to_int(opt.quiet)]
- code = subprocess.call(list(map(str, call)))
- if code == 0:
- return 0
- else:
- print("Error: plotting with R failed")
- return 1
+ """optional plots for length distribution
+ and cumulative C>T mutations, per strand"""
+
+ fmut = opt.folder + "/" + "misincorporation.txt"
+ flength = opt.folder + "/" + "lgdistribution.txt"
+ output = opt.folder + "/" + "Length_plot.pdf"
+
+ script = construct_path("lengths.R")
+ call = [
+ "Rscript",
+ script,
+ flength,
+ output,
+ fmut,
+ opt.length,
+ opt.title,
+ __version__,
+ bool_to_int(opt.quiet),
+ ]
+ code = subprocess.call(list(map(str, call)))
+ if code == 0:
+ return 0
+ else:
+ print("Error: plotting with R failed")
+ return 1
def check_R_one_lib(name):
"""Checks if a necessary R library is here."""
with open(os.devnull, "w") as devnull:
rpa = construct_path("stats/checkLibraries.R")
- return subprocess.call(["Rscript", rpa, "--args", name],
- stdout = devnull,
- stderr = devnull)
+ return subprocess.call(
+ ["Rscript", rpa, "--args", name], stdout=devnull, stderr=devnull
+ )
def check_R_lib():
@@ -83,18 +104,25 @@ def check_R_lib():
for lib in libs:
# Check the libraries
if check_R_one_lib(lib):
- #found a missing library
+ # found a missing library
missing_lib.append(lib)
if len(missing_lib) > 1:
# Grammar Nazi has arrived
last_ele = missing_lib.pop()
- print(("Missing the following R libraries '" + "', '".join(missing_lib) \
- + "' and '" + last_ele + "'"))
+ print(
+ (
+ "Missing the following R libraries '"
+ + "', '".join(missing_lib)
+ + "' and '"
+ + last_ele
+ + "'"
+ )
+ )
return 1
elif len(missing_lib) == 1:
- print(("Missing the following R library "+missing_lib[0]))
+ print(("Missing the following R library " + missing_lib[0]))
return 1
- else :
+ else:
# No missing libraries
return 0
@@ -103,31 +131,32 @@ def run_stats(opt):
"""
Runs the Bayesian estimation program, using the options opt
"""
- arg = ["Rscript", \
- construct_path("stats/runGeneral.R"), \
- "--args", \
- opt.rand, \
- opt.burn, \
- opt.adjust, \
- opt.iter, \
- bool_to_int(opt.forward), \
- bool_to_int(opt.reverse), \
- bool_to_int(not opt.var_disp), \
- bool_to_int(not opt.diff_hangs), \
- 0, \
- bool_to_int(opt.fix_nicks), \
- bool_to_int(not opt.single_stranded), \
- opt.seq_length, \
- opt.folder+"/", \
- construct_path("stats/"), \
- opt.folder+"/Stats_out", \
- int(opt.verbose), \
- int(opt.quiet), \
- int(opt.jukes_cantor), \
- opt.folder+"/acgt_ratio.csv", \
- int(opt.use_raw_nick_freq), \
- int(opt.theme_bw) \
- ]
+ arg = [
+ "Rscript",
+ construct_path("stats/runGeneral.R"),
+ "--args",
+ opt.rand,
+ opt.burn,
+ opt.adjust,
+ opt.iter,
+ bool_to_int(opt.forward),
+ bool_to_int(opt.reverse),
+ bool_to_int(not opt.var_disp),
+ bool_to_int(not opt.diff_hangs),
+ 0,
+ bool_to_int(opt.fix_nicks),
+ bool_to_int(not opt.single_stranded),
+ opt.seq_length,
+ opt.folder + "/",
+ construct_path("stats/"),
+ opt.folder + "/Stats_out",
+ int(opt.verbose),
+ int(opt.quiet),
+ int(opt.jukes_cantor),
+ opt.folder + "/acgt_ratio.csv",
+ int(opt.use_raw_nick_freq),
+ int(opt.theme_bw),
+ ]
arg = [str(i) for i in arg]
logger = logging.getLogger(__name__)
@@ -141,11 +170,13 @@ def run_stats(opt):
logger.error("The Bayesian statistics program failed to finish")
raise e
- logger.debug("Bayesian estimates completed in %f seconds" % (time.time() - start_time,))
+ logger.debug(
+ "Bayesian estimates completed in %f seconds" % (time.time() - start_time,)
+ )
return 0
def bool_to_int(boolean):
- """ return 0 for False and 1 for True """
+ """return 0 for False and 1 for True"""
result = 1 if boolean else 0
return result
=====================================
mapdamage/seq.py
=====================================
@@ -1,113 +1,146 @@
import sys
-import string
# from Martin Kircher, to complement DNA
-TABLE = str.maketrans('TGCAMRWSYKVHDBtgcamrwsykvhdb', \
- 'ACGTKYWSRMBDHVacgtkywsrmbdhv')
+TABLE = str.maketrans("TGCAMRWSYKVHDBtgcamrwsykvhdb", "ACGTKYWSRMBDHVacgtkywsrmbdhv")
LETTERS = ("A", "C", "G", "T")
-MUTATIONS = ('G>A', 'C>T', 'A>G', 'T>C', 'A>C', 'A>T', 'C>G', 'C>A', 'T>G',
- 'T>A', 'G>C', 'G>T', 'A>-', 'T>-', 'C>-', 'G>-', '->A', '->T',
- '->C', '->G', 'S')
-HEADER = LETTERS + ("Total", ) + MUTATIONS
+MUTATIONS = (
+ "G>A",
+ "C>T",
+ "A>G",
+ "T>C",
+ "A>C",
+ "A>T",
+ "C>G",
+ "C>A",
+ "T>G",
+ "T>A",
+ "G>C",
+ "G>T",
+ "A>-",
+ "T>-",
+ "C>-",
+ "G>-",
+ "->A",
+ "->T",
+ "->C",
+ "->G",
+ "S",
+)
+HEADER = LETTERS + ("Total",) + MUTATIONS
def write_fasta(read, ref, seq, refseq, start, end, before, after, fout):
- std = '-' if read.is_reverse else '+'
+ std = "-" if read.is_reverse else "+"
- # output coordinate in 1-based offset for 5'
- if len(before) > 0:
- fout.write(">%s\n%s\n" % (ref, before))
- fout.write(">%s:%d-%d\n%s\n>%s_%s\n%s\n" % (ref, start+1, end, refseq, read.qname, std, seq))
- if len(after) > 0:
- fout.write(">%s\n%s\n" % (ref, after))
+ # output coordinate in 1-based offset for 5'
+ if len(before) > 0:
+ fout.write(">%s\n%s\n" % (ref, before))
+ fout.write(
+ ">%s:%d-%d\n%s\n>%s_%s\n%s\n"
+ % (ref, start + 1, end, refseq, read.qname, std, seq)
+ )
+ if len(after) > 0:
+ fout.write(">%s\n%s\n" % (ref, after))
def revcomp(seq):
- """ return reverse complemented string """
- return seq.translate(TABLE)[::-1]
+ """return reverse complemented string"""
+ return seq.translate(TABLE)[::-1]
def record_lg(read, coordinate, tab):
- """ record global length distribution
- don't record paired reads as they are normally not used for aDNA """
- std = '-' if read.is_reverse else '+'
+ """record global length distribution
+ don't record paired reads as they are normally not used for aDNA"""
+ std = "-" if read.is_reverse else "+"
- length = (max(coordinate) - min(coordinate))
- if not read.is_paired:
- tab[std][length] = tab[std][length] + 1
+ length = max(coordinate) - min(coordinate)
+ if not read.is_paired:
+ tab[std][length] = tab[std][length] + 1
- return tab
+ return tab
def read_fasta_index(filename):
- """ from a fasta index file, fai, return dictionary of references:lengths """
- def print_err(msg, filename, line):
- sys.stderr.write("Error: %s\n" % msg)
- sys.stderr.write(" Filename: %s\n" % filename)
- sys.stderr.write(" Line: %s\n" % repr(line))
-
- fai = {}
- with open(filename, 'r') as handle:
- for line in handle:
- ref = line.split("\t")
- if len(ref) != 5:
- print_err("Line in fasta index contains wrong number of fields, found %i, expected 5:" \
- % len(ref), filename, line)
+ """from a fasta index file, fai, return dictionary of references:lengths"""
+
+ def print_err(msg, filename, line):
+ sys.stderr.write("Error: %s\n" % msg)
+ sys.stderr.write(" Filename: %s\n" % filename)
+ sys.stderr.write(" Line: %s\n" % repr(line))
+
+ fai = {}
+ with open(filename, "r") as handle:
+ for line in handle:
+ ref = line.split("\t")
+ if len(ref) != 5:
+ print_err(
+ "Line in fasta index contains wrong number of fields, found %i, expected 5:"
+ % len(ref),
+ filename,
+ line,
+ )
+ return None
+
+ try:
+ fai[ref[0]] = int(ref[1])
+ except ValueError:
+ print_err(
+ "Column 2 in FASTA index did not contain a number, found '%s':"
+ % ref[1],
+ filename,
+ line,
+ )
+ return None
+
+ if not fai:
+ sys.stderr.write(
+ "Error: Index for %r does contain any sequences.\n" % (filename,)
+ )
+ sys.stderr.write(" Please ensure that FASTA file is valid, and\n")
+ sys.stderr.write(" re-index file using 'samtool faidx'.\n")
return None
- try:
- fai[ref[0]] = int(ref[1])
- except ValueError:
- print_err("Column 2 in FASTA index did not contain a number, found '%s':" % ref[1], filename, line)
- return None
-
- if not fai:
- sys.stderr.write("Error: Index for %r does contain any sequences.\n" % (filename,))
- sys.stderr.write(" Please ensure that FASTA file is valid, and\n")
- sys.stderr.write(" re-index file using 'samtool faidx'.\n")
- return None
-
- return fai
+ return fai
def compare_sequence_dicts(fasta_dict, bam_dict):
- """Compares a FASTA and BAM sequence dictionary, and prints any differences.
- Returns true if all required sequences are found, false otherwise."""
- if fasta_dict == bam_dict:
- return True
-
- sys.stderr.write("Sequence dictionaries in FASTA/BAM files differ:\n")
- common = set(fasta_dict) & set(bam_dict)
- if not common:
- sys.stderr.write("FATAL ERROR: No sequences in common!\n")
- return False
-
- # Check that the lengths of required sequences match (fatal error)
- different = []
- for key in sorted(common):
- if fasta_dict[key] != bam_dict[key]:
- different.append((key, fasta_dict[key], bam_dict[key]))
-
- if different:
- sys.stderr.write("FATAL ERROR: Length of required sequences differ:\n")
- for values in different:
- sys.stderr.write(" - %s: %i bp vs %i bp\n" % values)
-
- # Check for sequences only found in the BAM file (fatal errors)
- bam_only = set(bam_dict) - common
- if bam_only:
- sys.stderr.write("FATAL ERROR: Sequences missing from FASTA dictionary:\n")
- for key in bam_only:
- sys.stderr.write(" - %s = %i bp\n" % (key, bam_dict[key]))
-
- # Check for sequences only found in the BAM file (fatal errors)
- fasta_only = set(fasta_dict) - common
- if fasta_only:
- sys.stderr.write("WARNING: FASTA file contains extra sequences:\n")
- for key in fasta_only:
- sys.stderr.write(" - %s = %i bp\n" % (key, fasta_dict[key]))
-
- sys.stderr.write("\n")
-
- return not (different or bam_only)
+ """Compares a FASTA and BAM sequence dictionary, and prints any differences.
+ Returns true if all required sequences are found, false otherwise."""
+ if fasta_dict == bam_dict:
+ return True
+
+ sys.stderr.write("Sequence dictionaries in FASTA/BAM files differ:\n")
+ common = set(fasta_dict) & set(bam_dict)
+ if not common:
+ sys.stderr.write("FATAL ERROR: No sequences in common!\n")
+ return False
+
+ # Check that the lengths of required sequences match (fatal error)
+ different = []
+ for key in sorted(common):
+ if fasta_dict[key] != bam_dict[key]:
+ different.append((key, fasta_dict[key], bam_dict[key]))
+
+ if different:
+ sys.stderr.write("FATAL ERROR: Length of required sequences differ:\n")
+ for values in different:
+ sys.stderr.write(" - %s: %i bp vs %i bp\n" % values)
+
+ # Check for sequences only found in the BAM file (fatal errors)
+ bam_only = set(bam_dict) - common
+ if bam_only:
+ sys.stderr.write("FATAL ERROR: Sequences missing from FASTA dictionary:\n")
+ for key in bam_only:
+ sys.stderr.write(" - %s = %i bp\n" % (key, bam_dict[key]))
+
+ # Check for sequences only found in the BAM file (fatal errors)
+ fasta_only = set(fasta_dict) - common
+ if fasta_only:
+ sys.stderr.write("WARNING: FASTA file contains extra sequences:\n")
+ for key in fasta_only:
+ sys.stderr.write(" - %s = %i bp\n" % (key, fasta_dict[key]))
+
+ sys.stderr.write("\n")
+
+ return not (different or bam_only)
=====================================
mapdamage/seqtk/README.md
=====================================
@@ -0,0 +1,13 @@
+Introduction
+------------
+
+Seqtk is a fast and lightweight tool for processing sequences in the FASTA or
+FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be
+optionally compressed by gzip.
+
+Written by Heng Li.
+
+This version was reduced to perform only the `comp` function and integrated to mapDamage.
+The original version can be found here
+https://github.com/lh3/seqtk
+
=====================================
mapdamage/seqtk/kseq.h
=====================================
@@ -0,0 +1,248 @@
+/* The MIT License
+
+ Copyright (c) 2008, 2009, 2011 Attractive Chaos <attractor at live.co.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining
+ a copy of this software and associated documentation files (the
+ "Software"), to deal in the Software without restriction, including
+ without limitation the rights to use, copy, modify, merge, publish,
+ distribute, sublicense, and/or sell copies of the Software, and to
+ permit persons to whom the Software is furnished to do so, subject to
+ the following conditions:
+
+ The above copyright notice and this permission notice shall be
+ included in all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ SOFTWARE.
+*/
+
+/* Last Modified: 2017-02-11 */
+
+#ifndef AC_KSEQ_H
+#define AC_KSEQ_H
+
+#include <ctype.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdlib.h>
+
+#define KS_SEP_SPACE 0 // isspace(): \t, \n, \v, \f, \r
+#define KS_SEP_TAB 1 // isspace() && !' '
+#define KS_SEP_LINE 2 // line separator: "\n" (Unix) or "\r\n" (Windows)
+#define KS_SEP_MAX 2
+
+#define __KS_TYPE(type_t) \
+ typedef struct __kstream_t { \
+ unsigned char *buf; \
+ int begin, end, is_eof; \
+ type_t f; \
+ } kstream_t;
+
+#define ks_err(ks) ((ks)->end < 0)
+#define ks_eof(ks) ((ks)->is_eof && (ks)->begin >= (ks)->end)
+#define ks_rewind(ks) ((ks)->is_eof = (ks)->begin = (ks)->end = 0)
+
+#define __KS_BASIC(type_t, __bufsize) \
+ static inline kstream_t *ks_init(type_t f) \
+ { \
+ kstream_t *ks = (kstream_t*)calloc(1, sizeof(kstream_t)); \
+ ks->f = f; \
+ ks->buf = (unsigned char*)malloc(__bufsize); \
+ return ks; \
+ } \
+ static inline void ks_destroy(kstream_t *ks) \
+ { \
+ if (ks) { \
+ free(ks->buf); \
+ free(ks); \
+ } \
+ }
+
+#define __KS_GETC(__read, __bufsize) \
+ static inline int ks_getc(kstream_t *ks) \
+ { \
+ if (ks_err(ks)) return -3; \
+ if (ks_eof(ks)) return -1; \
+ if (ks->begin >= ks->end) { \
+ ks->begin = 0; \
+ ks->end = __read(ks->f, ks->buf, __bufsize); \
+ if (ks->end == 0) { ks->is_eof = 1; return -1; } \
+ else if (ks->end < 0) { ks->is_eof = 1; return -3; } \
+ } \
+ return (int)ks->buf[ks->begin++]; \
+ }
+
+#ifndef KSTRING_T
+#define KSTRING_T kstring_t
+typedef struct __kstring_t {
+ size_t l, m;
+ char *s;
+} kstring_t;
+#endif
+
+#ifndef kroundup32
+#define kroundup32(x) (--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, (x)|=(x)>>16, ++(x))
+#endif
+
+#ifndef kroundup64
+#define kroundup64(x) (--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, (x)|=(x)>>16, (x)|=(x)>>32, ++(x))
+#endif
+
+#define __KS_GETUNTIL(__read, __bufsize) \
+ static int64_t ks_getuntil2(kstream_t *ks, int delimiter, kstring_t *str, int *dret, int append) \
+ { \
+ int gotany = 0; \
+ if (dret) *dret = 0; \
+ str->l = append? str->l : 0; \
+ for (;;) { \
+ int i; \
+ if (ks_err(ks)) return -3; \
+ if (ks->begin >= ks->end) { \
+ if (!ks->is_eof) { \
+ ks->begin = 0; \
+ ks->end = __read(ks->f, ks->buf, __bufsize); \
+ if (ks->end == 0) { ks->is_eof = 1; break; } \
+ if (ks->end == -1) { ks->is_eof = 1; return -3; } \
+ } else break; \
+ } \
+ if (delimiter == KS_SEP_LINE) { \
+ unsigned char *sep = (unsigned char*)memchr(ks->buf + ks->begin, '\n', ks->end - ks->begin); \
+ i = sep != NULL ? sep - ks->buf : ks->end; \
+ } else if (delimiter > KS_SEP_MAX) { \
+ for (i = ks->begin; i < ks->end; ++i) \
+ if (ks->buf[i] == delimiter) break; \
+ } else if (delimiter == KS_SEP_SPACE) { \
+ for (i = ks->begin; i < ks->end; ++i) \
+ if (isspace(ks->buf[i])) break; \
+ } else if (delimiter == KS_SEP_TAB) { \
+ for (i = ks->begin; i < ks->end; ++i) \
+ if (isspace(ks->buf[i]) && ks->buf[i] != ' ') break; \
+ } else i = 0; /* never come to here! */ \
+ if (str->m - str->l < (size_t)(i - ks->begin + 1)) { \
+ str->m = str->l + (i - ks->begin) + 1; \
+ kroundup64(str->m); \
+ str->s = (char*)realloc(str->s, str->m); \
+ } \
+ gotany = 1; \
+ memcpy(str->s + str->l, ks->buf + ks->begin, i - ks->begin); \
+ str->l = str->l + (i - ks->begin); \
+ ks->begin = i + 1; \
+ if (i < ks->end) { \
+ if (dret) *dret = ks->buf[i]; \
+ break; \
+ } \
+ } \
+ if (!gotany && ks_eof(ks)) return -1; \
+ if (str->s == 0) { \
+ str->m = 1; \
+ str->s = (char*)calloc(1, 1); \
+ } else if (delimiter == KS_SEP_LINE && str->l > 1 && str->s[str->l-1] == '\r') --str->l; \
+ str->s[str->l] = '\0'; \
+ return str->l; \
+ } \
+ static inline int ks_getuntil(kstream_t *ks, int delimiter, kstring_t *str, int *dret) \
+ { return ks_getuntil2(ks, delimiter, str, dret, 0); }
+
+#define KSTREAM_INIT(type_t, __read, __bufsize) \
+ __KS_TYPE(type_t) \
+ __KS_BASIC(type_t, __bufsize) \
+ __KS_GETC(__read, __bufsize) \
+ __KS_GETUNTIL(__read, __bufsize)
+
+#define kseq_rewind(ks) ((ks)->last_char = (ks)->f->is_eof = (ks)->f->begin = (ks)->f->end = 0)
+
+#define __KSEQ_BASIC(SCOPE, type_t) \
+ SCOPE kseq_t *kseq_init(type_t fd) \
+ { \
+ kseq_t *s = (kseq_t*)calloc(1, sizeof(kseq_t)); \
+ s->f = ks_init(fd); \
+ return s; \
+ } \
+ SCOPE void kseq_destroy(kseq_t *ks) \
+ { \
+ if (!ks) return; \
+ free(ks->name.s); free(ks->comment.s); free(ks->seq.s); free(ks->qual.s); \
+ ks_destroy(ks->f); \
+ free(ks); \
+ }
+
+/* Return value:
+ >=0 length of the sequence (normal)
+ -1 end-of-file
+ -2 truncated quality string
+ -3 error reading stream
+ */
+#define __KSEQ_READ(SCOPE) \
+ SCOPE int64_t kseq_read(kseq_t *seq) \
+ { \
+ int c,r; \
+ kstream_t *ks = seq->f; \
+ if (seq->last_char == 0) { /* then jump to the next header line */ \
+ while ((c = ks_getc(ks)) >= 0 && c != '>' && c != '@'); \
+ if (c < 0) return c; /* end of file or error*/ \
+ seq->last_char = c; \
+ } /* else: the first header char has been read in the previous call */ \
+ seq->comment.l = seq->seq.l = seq->qual.l = 0; /* reset all members */ \
+ if ((r=ks_getuntil(ks, 0, &seq->name, &c)) < 0) return r; /* normal exit: EOF or error */ \
+ if (c != '\n') ks_getuntil(ks, KS_SEP_LINE, &seq->comment, 0); /* read FASTA/Q comment */ \
+ if (seq->seq.s == 0) { /* we can do this in the loop below, but that is slower */ \
+ seq->seq.m = 256; \
+ seq->seq.s = (char*)malloc(seq->seq.m); \
+ } \
+ while ((c = ks_getc(ks)) >= 0 && c != '>' && c != '+' && c != '@') { \
+ if (c == '\n') continue; /* skip empty lines */ \
+ seq->seq.s[seq->seq.l++] = c; /* this is safe: we always have enough space for 1 char */ \
+ ks_getuntil2(ks, KS_SEP_LINE, &seq->seq, 0, 1); /* read the rest of the line */ \
+ } \
+ if (c == '>' || c == '@') seq->last_char = c; /* the first header char has been read */ \
+ if (seq->seq.l + 1 >= seq->seq.m) { /* seq->seq.s[seq->seq.l] below may be out of boundary */ \
+ seq->seq.m = seq->seq.l + 2; \
+ kroundup64(seq->seq.m); /* rounded to the next closest 2^k */ \
+ seq->seq.s = (char*)realloc(seq->seq.s, seq->seq.m); \
+ } \
+ seq->seq.s[seq->seq.l] = 0; /* null terminated string */ \
+ seq->is_fastq = (c == '+'); \
+ if (!seq->is_fastq) return seq->seq.l; /* FASTA */ \
+ if (seq->qual.m < seq->seq.m) { /* allocate memory for qual in case insufficient */ \
+ seq->qual.m = seq->seq.m; \
+ seq->qual.s = (char*)realloc(seq->qual.s, seq->qual.m); \
+ } \
+ while ((c = ks_getc(ks)) >= 0 && c != '\n'); /* skip the rest of '+' line */ \
+ if (c == -1) return -2; /* error: no quality string */ \
+ while ((c = ks_getuntil2(ks, KS_SEP_LINE, &seq->qual, 0, 1) >= 0 && seq->qual.l < seq->seq.l)); \
+ if (c == -3) return -3; /* stream error */ \
+ seq->last_char = 0; /* we have not come to the next header line */ \
+ if (seq->seq.l != seq->qual.l) return -2; /* error: qual string is of a different length */ \
+ return seq->seq.l; \
+ }
+
+#define __KSEQ_TYPE(type_t) \
+ typedef struct { \
+ kstring_t name, comment, seq, qual; \
+ int last_char, is_fastq; \
+ kstream_t *f; \
+ } kseq_t;
+
+#define KSEQ_INIT2(SCOPE, type_t, __read) \
+ KSTREAM_INIT(type_t, __read, 16384) \
+ __KSEQ_TYPE(type_t) \
+ __KSEQ_BASIC(SCOPE, type_t) \
+ __KSEQ_READ(SCOPE)
+
+#define KSEQ_INIT(type_t, __read) KSEQ_INIT2(static, type_t, __read)
+
+#define KSEQ_DECLARE(type_t) \
+ __KS_TYPE(type_t) \
+ __KSEQ_TYPE(type_t) \
+ extern kseq_t *kseq_init(type_t fd); \
+ void kseq_destroy(kseq_t *ks); \
+ int64_t kseq_read(kseq_t *seq);
+
+#endif
=====================================
mapdamage/seqtk/seqtk.c
=====================================
@@ -0,0 +1,169 @@
+/* The MIT License
+
+ Copyright (c) 20082-2012 by Heng Li <lh3 at me.com>
+
+ Permission is hereby granted, free of charge, to any person obtaining
+ a copy of this software and associated documentation files (the
+ "Software"), to deal in the Software without restriction, including
+ without limitation the rights to use, copy, modify, merge, publish,
+ distribute, sublicense, and/or sell copies of the Software, and to
+ permit persons to whom the Software is furnished to do so, subject to
+ the following conditions:
+
+ The above copyright notice and this permission notice shall be
+ included in all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ SOFTWARE.
+*/
+#define PY_SSIZE_T_CLEAN
+#include <Python.h>
+#include <ctype.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <zlib.h>
+
+#include "kseq.h"
+KSEQ_INIT(gzFile, gzread)
+
+int
+mapdamage_set(PyObject *dict, const char *key, PyObject *value)
+{
+ if (value == NULL) {
+ return 1;
+ }
+
+ int result = PyDict_SetItemString(dict, key, value);
+ Py_DECREF(value);
+
+ return result;
+}
+
+int
+mapdamage_set_long(PyObject *dict, const char *key, long value)
+{
+ return mapdamage_set(dict, key, PyLong_FromLong(value));
+}
+
+static PyObject *
+mapdamage_stk_comp(PyObject *self, PyObject *args)
+{
+ (void)self;
+
+ char *filename = NULL;
+ if (PyArg_ParseTuple(args, "s", &filename) == 0) {
+ return NULL;
+ }
+
+ gzFile fp = gzopen(filename, "r");
+ if (fp == NULL) {
+ PyErr_SetFromErrnoWithFilename(PyExc_OSError, filename);
+ return NULL;
+ }
+
+ // Cannot fail; kseq_init will segfault if calloc fails
+ kseq_t *seq = kseq_init(fp);
+
+ long int l = 0;
+ int sig = 1;
+ PyObject *result = PyList_New(0);
+ if (!result) {
+ goto Cleanup;
+ }
+
+ long counts[256];
+ while (!(sig = PyErr_CheckSignals()) && (l = kseq_read(seq)) >= 0) {
+ memset(counts, 0, 256 * sizeof(long));
+ for (long i = 0; i < l; ++i) {
+ counts[(unsigned char)seq->seq.s[i]]++;
+ }
+
+ // If we break it will have been due to an Py* failure
+ sig = 1;
+
+ PyObject *stats = PyDict_New();
+ if (stats == NULL || PyList_Append(result, stats) != 0) {
+ Py_XDECREF(stats);
+ break;
+ }
+
+ PyObject *name = PyUnicode_FromString(seq->name.s);
+ if (mapdamage_set(stats, "name", name) != 0 ||
+ mapdamage_set_long(stats, "len", l) != 0 ||
+ mapdamage_set_long(stats, "A", counts['A'] + counts['a']) != 0 ||
+ mapdamage_set_long(stats, "C", counts['C'] + counts['c']) != 0 ||
+ mapdamage_set_long(stats, "G", counts['G'] + counts['g']) != 0 ||
+ mapdamage_set_long(stats, "T", counts['T'] + counts['t']) != 0) {
+ Py_DECREF(stats);
+ break;
+ }
+ }
+
+Cleanup:
+ kseq_destroy(seq);
+ const int gzerr = gzclose(fp);
+
+ if (gzerr != Z_OK || sig || l != -1) {
+ Py_XDECREF(result);
+
+ if (gzerr != Z_OK) {
+ switch (gzerr) {
+ case Z_STREAM_ERROR:
+ PyErr_SetString(PyExc_RuntimeError,
+ "Stream error while reading gzip file");
+ break;
+ case Z_ERRNO:
+ PyErr_SetFromErrnoWithFilename(PyExc_OSError, filename);
+ break;
+ case Z_BUF_ERROR:
+ PyErr_SetString(PyExc_RuntimeError,
+ "Buffer error while reading gzip file");
+ break;
+ default:
+ PyErr_SetString(PyExc_RuntimeError,
+ "Unexpected zlib error");
+ }
+ }
+ else if (l == -3) {
+ PyErr_SetFromErrnoWithFilename(PyExc_OSError, filename);
+ }
+ else if (l == -2) {
+ PyErr_SetString(PyExc_RuntimeError, "Malformed file");
+ }
+
+ return NULL;
+ }
+
+ return result;
+}
+
+static PyMethodDef SeqtkMethods[] = {
+ {"comp", mapdamage_stk_comp, METH_VARARGS,
+ "Wrapper around seqtk stk_comp function"},
+ {NULL, NULL, 0, NULL},
+};
+
+static struct PyModuleDef seqtkmodule = {
+ .m_base = PyModuleDef_HEAD_INIT,
+ .m_name = "seqtk",
+ .m_doc = "Python interface to seqtk functions",
+ .m_size = 0,
+ .m_methods = SeqtkMethods,
+ .m_slots = NULL,
+ .m_traverse = NULL,
+ .m_clear = NULL,
+ .m_free = NULL,
+};
+
+PyMODINIT_FUNC
+PyInit_seqtk(void)
+{
+ return PyModule_Create(&seqtkmodule);
+}
=====================================
mapdamage/tables.py
=====================================
@@ -1,114 +1,135 @@
-import os
import collections
+import os
import mapdamage
from mapdamage.version import __version__
def initialize_mut(ref, length):
- tab = {}
- for contig in ref:
- tab_contig = tab[contig] = {}
- for end in ('5p','3p'):
- tab_end = tab_contig[end] = {}
- for std in ('+','-'):
- tab_std = tab_end[std] = {}
- for mut in mapdamage.seq.HEADER:
- tab_std[mut] = dict.fromkeys(range(length), 0)
+ tab = {}
+ for contig in ref:
+ tab_contig = tab[contig] = {}
+ for end in ("5p", "3p"):
+ tab_end = tab_contig[end] = {}
+ for std in ("+", "-"):
+ tab_std = tab_end[std] = {}
+ for mut in mapdamage.seq.HEADER:
+ tab_std[mut] = dict.fromkeys(range(length), 0)
- return tab
+ return tab
def print_mut(mut, opt, out):
- _print_freq_table(mut, mapdamage.seq.HEADER, opt, out, offset = 1)
+ _print_freq_table(mut, mapdamage.seq.HEADER, opt, out, offset=1)
def initialize_comp(ref, around, length):
- keys = {"3p" : list(range(-length, 0)) + list(range(1, around + 1)),
- "5p" : list(range(-around, 0)) + list(range(1, length + 1))}
+ keys = {
+ "3p": list(range(-length, 0)) + list(range(1, around + 1)),
+ "5p": list(range(-around, 0)) + list(range(1, length + 1)),
+ }
- tab = {}
- for contig in ref:
- tab_contig = tab[contig] = {}
- for end in ('5p','3p'):
- tab_end = tab_contig[end] = {}
- for std in ('+','-'):
- tab_std = tab_end[std] = {}
- for letters in mapdamage.seq.LETTERS:
- tab_std[letters] = dict.fromkeys(keys[end], 0)
+ tab = {}
+ for contig in ref:
+ tab_contig = tab[contig] = {}
+ for end in ("5p", "3p"):
+ tab_end = tab_contig[end] = {}
+ for std in ("+", "-"):
+ tab_std = tab_end[std] = {}
+ for letters in mapdamage.seq.LETTERS:
+ tab_std[letters] = dict.fromkeys(keys[end], 0)
- return tab
+ return tab
def print_comp(comp, opt, out):
- columns = mapdamage.seq.LETTERS + ("Total",)
- _print_freq_table(comp, columns, opt, out)
+ columns = mapdamage.seq.LETTERS + ("Total",)
+ _print_freq_table(comp, columns, opt, out)
def initialize_lg():
- tab = {}
- for std in ('+','-'):
- tab[std] = collections.defaultdict(int)
+ tab = {}
+ for std in ("+", "-"):
+ tab[std] = collections.defaultdict(int)
- return tab
+ return tab
def print_lg(tab, opt, out):
- out.write("# table produced by mapDamage version %s\n" % __version__)
- out.write("# using mapped file %s and %s as reference file\n" % (opt.filename, opt.ref))
- if opt.minqual != 0:
- out.write("# Quality filtering of bases with a Phred score < %d\n" % opt.minqual)
- out.write("# Std: strand of reads\n")
- out.write("Std\tLength\tOccurences \n")
- for std in tab:
- for i in tab[std]:
- out.write("%s\t%d\t%d\n" % (std, i, tab[std][i]))
+ out.write("# table produced by mapDamage version %s\n" % __version__)
+ out.write(
+ "# using mapped file %s and %s as reference file\n" % (opt.filename, opt.ref)
+ )
+ if opt.minqual != 0:
+ out.write(
+ "# Quality filtering of bases with a Phred score < %d\n" % opt.minqual
+ )
+ out.write("# Std: strand of reads\n")
+ out.write("Std\tLength\tOccurences \n")
+ for std in tab:
+ for i in tab[std]:
+ out.write("%s\t%d\t%d\n" % (std, i, tab[std][i]))
def check_table_and_warn_if_dmg_freq_is_low(folder):
- """ Returns true if the damage frequencies are too low to allow
- Bayesian estimation of DNA damages, i.e < 1% at first position.
- """
- total = 0.0
- for filename in ("5pCtoT_freq.txt", "3pGtoA_freq.txt"):
- if not os.path.exists(folder+"/"+filename):
- print(("Error: Required table has not been created ('%s'), bayesian computation cannot be performed" \
- % filename))
- return True
-
- with open(os.path.join(folder, filename)) as handle:
- for line in handle:
- freq = line.strip().split('\t')
- if freq[0] == "1":
- total += float(freq[1])
- break
- else:
- print(("Error: Could not find pos = 1 in table '%s', bayesian computation cannot be performed" \
- % filename))
- return True
-
- if total < 0.01:
- print(("Warning: DNA damage levels are too low, the Bayesian computation should not be performed (%f < 0.01)\n" % total))
-
- return False
-
-
-def _print_freq_table(table, columns, opt, out, offset = 0):
- out.write("# table produced by mapDamage version %s\n" % __version__)
- out.write("# using mapped file %s and %s as reference file\n" % (opt.filename, opt.ref))
- if opt.minqual != 0:
- out.write("# Bases with a Phred score < %d were filtered out\n" % opt.minqual)
- out.write("# Chr: reference from sam/bam header, End: from which termini of DNA sequences, Std: strand of reads\n")
- out.write("Chr\tEnd\tStd\tPos\t%s\n" % ("\t".join(columns)))
-
- for (reference, ends) in sorted(table.items()):
- for (end, strands) in sorted(ends.items()):
- for (strand, subtable) in sorted(strands.items()):
- subtable["Total"] = {}
- for index in sorted(subtable[columns[0]]):
- subtable["Total"][index] = sum(subtable[letter][index] for letter in mapdamage.seq.LETTERS)
-
- out.write("%s\t%s\t%s\t%d" % (reference, end, strand, index + offset))
- for base in columns:
- out.write("\t%d" % subtable[base][index])
- out.write("\n")
+ """Returns true if the damage frequencies are too low to allow
+ Bayesian estimation of DNA damages, i.e < 1% at first position.
+ """
+ total = 0.0
+ for filename in ("5pCtoT_freq.txt", "3pGtoA_freq.txt"):
+ if not os.path.exists(folder + "/" + filename):
+ print(
+ "Error: Required table has not been created ('%s'), bayesian computation cannot be performed"
+ % filename
+ )
+ return True
+
+ with open(os.path.join(folder, filename)) as handle:
+ for line in handle:
+ freq = line.strip().split("\t")
+ if freq[0] == "1":
+ total += float(freq[1])
+ break
+ else:
+ print(
+ "Error: Could not find pos = 1 in table '%s', bayesian computation cannot be performed"
+ % filename
+ )
+ return True
+
+ if total < 0.01:
+ print(
+ "Warning: DNA damage levels are too low, the Bayesian computation should not be performed (%f < 0.01)\n"
+ % total
+ )
+
+ return False
+
+
+def _print_freq_table(table, columns, opt, out, offset=0):
+ out.write("# table produced by mapDamage version %s\n" % __version__)
+ out.write(
+ "# using mapped file %s and %s as reference file\n" % (opt.filename, opt.ref)
+ )
+ if opt.minqual != 0:
+ out.write("# Bases with a Phred score < %d were filtered out\n" % opt.minqual)
+ out.write(
+ "# Chr: reference from sam/bam header, End: from which termini of DNA sequences, Std: strand of reads\n"
+ )
+ out.write("Chr\tEnd\tStd\tPos\t%s\n" % ("\t".join(columns)))
+
+ for reference, ends in sorted(table.items()):
+ for end, strands in sorted(ends.items()):
+ for strand, subtable in sorted(strands.items()):
+ subtable["Total"] = {}
+ for index in sorted(subtable[columns[0]]):
+ subtable["Total"][index] = sum(
+ subtable[letter][index] for letter in mapdamage.seq.LETTERS
+ )
+
+ out.write(
+ "%s\t%s\t%s\t%d" % (reference, end, strand, index + offset)
+ )
+ for base in columns:
+ out.write("\t%d" % subtable[base][index])
+ out.write("\n")
=====================================
mapdamage/tests/fake1.fasta.fai
=====================================
@@ -0,0 +1 @@
+fake1 201 7 60 61
=====================================
mapdamage/version.py
=====================================
@@ -1,5 +1 @@
-#!/usr/bin/env python3
-try:
- from mapdamage._version import __version__
-except ImportError:
- __version__ = "2.2.2"
+__version__ = "2.2.3"
=====================================
pyproject.toml
=====================================
@@ -0,0 +1,49 @@
+[build-system]
+requires = ["setuptools>=42", "wheel"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "mapdamage"
+dynamic = ["version"]
+
+authors = [
+ { name = "Aurélien Ginolhac" },
+ { name = "Mikkel Schubert" },
+ { name = "Hákon Jónsson" },
+]
+# To be replaced with `license = "MIT"` when support for py <= 3.9 is dropped
+license = { "text" = "MIT" }
+description = "mapDamage tracks and quantifies DNA damage patterns in ancient DNA sequencing reads generated by Next-Generation Sequencing platforms"
+readme = { file = "README.md", content-type = "text/markdown" }
+
+requires-python = ">=3.7"
+dependencies = ["pysam"]
+
+classifiers = [
+ "Development Status :: 5 - Production/Stable",
+ "Intended Audience :: Science/Research",
+ "Topic :: Scientific/Engineering :: Bio-Informatics",
+ "Programming Language :: Python :: 3 :: Only",
+ "Programming Language :: Python :: 3.7",
+ "Programming Language :: Python :: 3.8",
+ "Programming Language :: Python :: 3.9",
+ "Programming Language :: Python :: 3.10",
+ "Programming Language :: Python :: 3.11",
+ "Programming Language :: Python :: 3.12",
+ "Programming Language :: Python :: 3.13",
+]
+
+[project.urls]
+Homepage = "https://github.com/ginolhac/mapDamage"
+Documentation = "https://ginolhac.github.io/mapDamage/"
+Repository = "https://github.com/ginolhac/mapDamage.git"
+Issues = "https://github.com/ginolhac/mapDamage/issues"
+
+[project.scripts]
+mapDamage = "mapdamage.main:main"
+
+[tool.setuptools.package-data]
+mapdamage = ["Rscripts/*.R", "Rscripts/stats/*.R", "tests/**"]
+
+[tool.setuptools.dynamic]
+version = { attr = "mapdamage.version.__version__" }
=====================================
setup.py
=====================================
@@ -1,73 +1,12 @@
#!/usr/bin/env python3
-# -*- coding: utf-8 -*-
-
-from distutils.core import setup
-from distutils.command.install import install as DistutilsInstall
-import os
-import sys
-import subprocess
-
-if sys.version_info < (3, 5):
- print("At least Python 3.5 is required.\n", file=sys.stderr)
- exit(1)
-
-def compile_seqtk():
- """Compiling the seqtk toolkit"""
- old_wd = os.getcwd()
- new_wd = os.path.join(old_wd,"mapdamage","seqtk")
- os.chdir(new_wd)
- if not os.path.isfile("seqtk.c"):
- raise SystemExit("Cannot find seqtk.c")
- if (os.path.isfile("seqtk")):
- os.system("rm seqtk")
- xs = os.system("make -f Makefile")
- os.chdir(old_wd)
- if (xs != 0):
- raise SystemExit("Cannot compile seqtk")
-
-
-def setup_version():
- if not os.path.exists(".git"):
- # Release version, no .git folder
- return
-
- try:
- version = subprocess.check_output(("git", "describe", "--always", "--tags", "--dirty"))
- with open(os.path.join("mapdamage", "_version.py"), "w") as handle:
- handle.write("#!/usr/bin/env python\n")
- handle.write("__version__ = %r\n" % (version.decode("utf-8").strip(),))
- except (subprocess.CalledProcessError, OSError) as error:
- raise SystemExit("Could not determine mapDamage version: %s" % (error,))
-
-
-class compileInstall(DistutilsInstall):
- # extension of the class to account for an extra compiling step
- def run(self):
- self.record=""
- setup_version()
- compile_seqtk()
- DistutilsInstall.run(self)
- # fixing the permission problem of seqtk
- files = self.get_outputs()
- for fi in files:
- if fi.endswith("seqtk/seqtk"):
- os.chmod(fi,0o755)
-
-
-
-
+from setuptools import Extension, setup
setup(
- cmdclass={'install': compileInstall},
- name='mapdamage',
- version='2.2.2',
- author='Aurélien Ginolhac, Mikkel Schubert, Hákon Jónsson',
- author_email='MSchubert at snm.ku.dk, jonsson.hakon at gmail.com',
- packages=['mapdamage'],
- package_data={'mapdamage': ['Rscripts/*.R','Rscripts/stats/*.R','tests/*','seqtk/seqtk']},
- scripts=['bin/mapDamage'],
- url='https://github.com/ginolhac/mapDamage',
- license='LICENSE.txt',
- description='mapDamage tracks and quantify DNA damage pattern among ancient DNA sequencing reads generated by Next-Generation Sequencing platforms',
- long_description=open('README.md').read()
+ ext_modules=[
+ Extension(
+ "mapdamage.seqtk",
+ sources=["mapdamage/seqtk/seqtk.c"],
+ libraries=["z"],
+ )
+ ],
)
View it on GitLab: https://salsa.debian.org/med-team/mapdamage/-/commit/73e4fcdb81df37d20ac00bdf4fa63b25f68b8fba
--
View it on GitLab: https://salsa.debian.org/med-team/mapdamage/-/commit/73e4fcdb81df37d20ac00bdf4fa63b25f68b8fba
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20251020/273efeeb/attachment-0001.htm>
More information about the debian-med-commit
mailing list