[med-svn] [Git][med-team/freebayes][upstream] New upstream version 1.3.2

Andreas Tille gitlab at salsa.debian.org
Tue Dec 17 13:58:29 GMT 2019



Andreas Tille pushed to branch upstream at Debian Med / freebayes


Commits:
0d52de8e by Andreas Tille at 2019-12-17T09:46:24Z
New upstream version 1.3.2
- - - - -


8 changed files:

- README.md
- python/allelebayes.py
- python/dirichlet.py
- python/hwe.py
- scripts/coverage_to_regions.py
- scripts/fasta_generate_regions.py
- src/AlleleParser.cpp
- src/Parameters.cpp


Changes:

=====================================
README.md
=====================================
@@ -7,7 +7,7 @@
 
 ## Overview
 
-[*FreeBayes*](http://arxiv.org/abs/1207.3907) is a 
+[*freebayes*](http://arxiv.org/abs/1207.3907) is a 
 [Bayesian](http://en.wikipedia.org/wiki/Bayesian_inference) genetic variant 
 detector designed to find small polymorphisms, specifically SNPs 
 (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs 
@@ -15,7 +15,7 @@ detector designed to find small polymorphisms, specifically SNPs
 substitution events) smaller than the length of a short-read sequencing 
 alignment.
 
-*FreeBayes* is haplotype-based, in the sense that it calls variants based on 
+*freebayes* is haplotype-based, in the sense that it calls variants based on 
 the literal sequences of reads aligned to a particular target, not their 
 precise alignment.  This model is a straightforward generalization of previous 
 ones (e.g. PolyBayes, samtools, GATK) which detect or report variants based on 
@@ -25,7 +25,7 @@ alignments:
 
 <img src="https://github.com/ekg/freebayes/raw/v1.3.0/paper/haplotype_calling.png" width=500/>
 
-*FreeBayes* uses short-read alignments 
+*freebayes* uses short-read alignments 
 ([BAM](http://samtools.sourceforge.net/SAMv1.pdf) files with 
 [Phred+33](http://en.wikipedia.org/wiki/Phred_quality_score) encoded quality 
 scores, now standard) for any number of individuals from a population and a 
@@ -41,123 +41,70 @@ variation across the samples under analysis.
 
 ## Citing freebayes
 
-A preprint [Haplotype-based variant detection from short-read 
-sequencing](http://arxiv.org/abs/1207.3907) provides an overview of the 
-statistical models
-used in FreeBayes.  We ask that you cite this paper if you use FreeBayes in
-work that leads to publication.
+A preprint [Haplotype-based variant detection from short-read sequencing](http://arxiv.org/abs/1207.3907) provides an overview of the 
+statistical models used in freebayes.
+We ask that you cite this paper if you use freebayes in work that leads to publication.
+This preprint is used for documentation and citation.
+freebayes was never submitted for review, but has been used in over 1000 publications.
 
 Please use this citation format:
 
-Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing.
-*arXiv preprint arXiv:1207.3907 [q-bio.GN]* 2012
+Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. *arXiv preprint arXiv:1207.3907 [q-bio.GN]* 2012
 
-If possible, please also refer to the version number provided by freebayes when 
-it is run without arguments or with the `--help` option.  For example, you 
-should see something like this:
+If possible, please also refer to the version number provided by freebayes when it is run without arguments or with the `--help` option.
+For example, you should see something like this:
 
-    version:  v0.9.10-3-g47a713e
+    version:  v1.3.1-1-g5eb71a3-dirty
 
-This provides both a point release number and a git commit id, which will 
-ensure precise reproducibility of results.
+This provides both a point release number and a git commit id, which will ensure precise reproducibility of results.
 
+## Download
 
-## Obtaining
+Precompiled static binaries are available for [freebayes relases](https://github.com/ekg/freebayes/releases).
+Most users should simply download these and run them.
 
-To download FreeBayes, please use git to download the most recent development
-tree.  Currently, the tree is hosted on github, and can be obtained via:
+Other packages are available from various sources, including conda and Debian, but these are relatively old (at least as of 2019) and do not include important updates from the past few years.
 
-    git clone --recursive git://github.com/ekg/freebayes.git
-
-Note the use of --recursive.  This is required in order to download all 
-nested git submodules for external repositories.
-
-### Resolving proxy issues with git
-
-Depending on your local network configuration, you may have problems obtaining
-freebayes via git.  If you see something like this you may be behind a proxy
-that blocks access to standard git:// port (9418).
-
-    $ git clone --recursive git://github.com/ekg/freebayes.git
-    Cloning into 'freebayes'...
-    fatal: Unable to look up github.com (port 9418) (Name or service not known)
-
-Luckily, if you have access to https:// on port 443, then you can use this
-'magic' command as a workaround to enable download of the submodules:
-
-    git config --global url.https://github.com/.insteadOf git://github.com/
-
-
-## Compilation
-
-FreeBayes requires g++ and the standard C and C++ development libraries.
-
-    make
-
-Will build the executable freebayes, as well as the utilities bamfiltertech and 
-bamleftalign.  These executables can be found in the `bin/` directory in the 
-repository.
-
-Users may wish to install to e.g. /usr/local/bin (default), which is 
-accomplished via
-
-    sudo make install
+## Support
 
-Note that the freebayes-parallel script and the programs on which it depends are
-not installed by this command.
+Please report any issues or questions to the [freebayes mailing list](https://groups.google.com/forum/#!forum/freebayes), [freebayes issue tracker](https://github.com/ekg/freebayes/issues), or by email to <erik.garrison at gmail.com>.
 
-Users can optionally build with [BamTools](https://github.com/pezmaster31/bamtools) instead of [SeqLib](https://github.com/walaj/SeqLib). Building with BamTools requires CMake.
+## Usage
 
-    make wbamtools
+In its simplest operation, freebayes requires only two inputs: a FASTA reference sequence, and a BAM-format alignment file sorted by reference position.
+For instance:
 
-## Usage
+    freebayes -f ref.fa aln.bam >var.vcf
 
-In its simplest operation, freebayes requires only two inputs: a FASTA reference
-sequence, and a BAM-format alignment file sorted by reference position.  For
-instance:
+... will produce a VCF file describing all SNPs, INDELs, and haplotype variants between the reference and aln.bam.
 
-    freebayes --fasta-reference h.sapiens.fasta NA20504.bam
+Multiple BAM files may be given for joint calling.
 
-... produce (on standard output) a VCF file on standard out describing
-all SNPs, INDELs, MNPs, and Complex events between the reference and the
-alignments in NA20504.bam.  In order to produce correct output, the reference
-supplied must be the reference to which NA20504.bam was aligned.
+Typically, we might consider two additional parameters.
+GVCF output allows us to have coverage information about non-called sites, and we can enable it with `--gcvcf`.
+For performance reasons we may want to skip regions of extremely high coverage in the reference using the `--skip-limit` parameter `-g`.
+These can greatly increase runtime but do not produce meaningful results.
+For instance, if we wanted to exclude regions of 1000X coverage, we would run:
 
-Users may specify any number of BAM files on the command line.  FreeBayes uses 
-the [BamTools API](http://github.com/pezmaster31/bamtools) to open and parse 
-these files in parallel, virtually merging them at runtime into one logical 
-file with a merged header.
+    freebayes -f ref.fa --gvcf -g 1000 >var.vcf
 
 For a description of available command-line options and their defaults, run:
 
     freebayes --help
 
-## Getting the best results
-
-freebayes provides many options to configure its operation.
-These may be important in certain use cases, but they are not meant for standard use.
-It is strongly recommended that you run freebayes with parameters that are as close to default as possible unless you can validate that the chosen parameters improve performance.
-To achieve the desired tradeoff between sensitivity and specificity, the best approach is to filter the output using the `QUAL` field, which encodes the posterior estimate of the probability of variation.
-
-Filtering the input with the provided options demonstrably hurts performance where truth sets can be used to evaluate results.
-By removing information from the input, you can confuse the Bayesian model by making it appear that certain alleles frequently indicative of context specific error (such as indels in homopolymers) don't exist.
-
-Many users apply these filters to force freebayes to not make haplotype calls.
-This is almost always a mistake, as the haplotype calling process greatly improves the method's signal to noise ratio and normalizes differential alignment in complex regions.
+## Examples
 
-If you wish to obtain a VCF that does not contain haplotype calls or complex alleles, first call with default parameters and then decompose the output with tools in vcflib, vt, vcf-tools, bcftools, GATK, and Picard.
-Here we use a tool in vcflib that normalizes the haplotype calls into pointwise SNPs and indels:
+Call variants assuming a diploid sample:
 
-    freebayes ... | vcfallelicprimitives -kg >calls.vcf
+    freebayes -f ref.fa aln.bam >var.vcf
 
-Note that this is not done by default as it makes it difficult to determine which variant calls freebayes completed.
-The raw output faithfully describes exactly the calls that were made.
+Call variants on only chrQ:
 
-## Examples
+    freebayes -f ref.fa -r chrQ aln.bam >var.vcf
 
-Call variants assuming a diploid sample:
+Call variants on only chrQ, from position 1000 to 2000:
 
-    freebayes -f ref.fa aln.bam >var.vcf
+    freebayes -f ref.fa -r chrQ:1000-2000 aln.bam >var.vcf
 
 Require at least 5 supporting observations to consider a variant:
 
@@ -208,7 +155,6 @@ should probably be running it in parallel using this script to run on a single
 host, or generating a series of scripts, one per region, and run them on a
 cluster. Be aware that the freebayes-parallel script contains calls to other programs  using relative paths from the scripts subdirectory; the easiest way to ensure a successful run is to invoke the freebayes-parallel script from within the scripts subdirectory.
 
-
 ## Calling variants: from fastq to VCF
 
 You've sequenced some samples.  You have a reference genome or assembled set of 
@@ -221,10 +167,8 @@ samples.  You can use freebayes to detect the variants, following these steps:
 * Ensure your alignments have **read groups** attached so their sample may be 
 identified by freebayes.  Aligners allow you to do this, but you can also use 
 [bamaddrg](http://github.com/ekg/bamaddrg) to do so post-alignment.
-* **Sort** the alignments (e.g. bamtools sort).
-* **Mark duplicates**, for instance with [samtools 
-rmdup](http://samtools.sourceforge.net/) (if PCR was used in the preparation of 
-your sequencing library)
+* **Sort** the alignments (e.g. [sambamba sort](https://github.com/biod/sambamba)).
+* **Mark duplicates**, for instance with [sambamba markdup](https://github.com/biod/sambamba) (if PCR was used in the preparation of your sequencing library)
 * ***Run freebayes*** on all your alignment data simultaneously, generating a 
 VCF.  The default settings should work for most use cases, but if your samples 
 are not diploid, set the `--ploidy` and adjust the `--min-alternate-fraction` 
@@ -235,7 +179,7 @@ observation count (AO).
 * (possibly, **Iterate** the variant detection process in response to insight 
 gained from your interpretation)
 
-FreeBayes emits a standard VCF 4.1 output stream.  This format is designed for the
+freebayes emits a standard VCF 4.1 output stream.  This format is designed for the
 probabilistic description of allelic variants within a population of samples,
 but it is equally suited to describing the probability of variation in a single
 sample.
@@ -269,7 +213,7 @@ can also be done by filtering on the DP flag.
 
 ## Calling variants in a population
 
-FreeBayes is designed to be run on many individuals from the same population
+freebayes is designed to be run on many individuals from the same population
 (e.g. many human individuals) simultaneously.  The algorithm exploits a neutral
 model of allele diffusion to impute most-confident genotypings
 across the entire population.  In practice, the discriminant power of the 
@@ -327,41 +271,46 @@ more flexible than setting a hard count on the number of observations.
 ## Observation filters and qualities
 
 ### Input filters
-FreeBayes filters its input so as to ignore low-confidence alignments and
-alleles which are only supported by low-quality sequencing observations (see
-`--min-mapping-quality` and `--min-base-quality`).  It also will only evaluate a
-position if at least one read has mapping quality of
-`--min-supporting-mapping-quality` and one allele has quality of at least
-`--min-supporting-base-quality`.
 
-Reads with more than a fixed number of high-quality mismatches can be excluded
-by specifying `--read-mismatch-limit`.  This is meant as a workaround when 
-mapping quality estimates are not appropriately calibrated.
+By default, freebayes doesn't 
+
+freebayes may be configured to filter its input so as to ignore low-confidence alignments and alleles which are only supported by low-quality sequencing observations (see `--min-mapping-quality` and `--min-base-quality`).
+It also will only evaluate a position if at least one read has mapping quality of `--min-supporting-mapping-quality` and one allele has quality of at least `--min-supporting-base-quality`.
+
+Reads with more than a fixed number of high-quality mismatches can be excluded by specifying `--read-mismatch-limit`.
+This is meant as a workaround when mapping quality estimates are not appropriately calibrated.
 
-Reads marked as duplicates in the BAM file are ignored, but this can be 
-disabled for testing purposes by providing `--use-duplicate-reads`.  FreeBayes 
-does not mark duplicates on its own, you must use another process to do this.
+Reads marked as duplicates in the BAM file are ignored, but this can be disabled for testing purposes by providing `--use-duplicate-reads`.
+freebayes does not mark duplicates on its own, you must use another process to do this, such as that in [sambamba](https://github.com/biod/sambamba).
 
 ### Observation thresholds
-As a guard against spurious variation caused by sequencing artifacts, positions
-are skipped when no more than `--min-alternate-count` or 
-`--min-alternate-fraction`
-non-clonal observations of an alternate are found in one sample.  These default 
-to 2 and 0.2 respectively.  The default setting of `--min-alternate-fraction 
-0.2` is suitable for diploid samples but should be changed for ploidy > 2.
+
+As a guard against spurious variation caused by sequencing artifacts, positions are skipped when no more than `--min-alternate-count` or `--min-alternate-fraction` non-clonal observations of an alternate are found in one sample.
+These default to 2 and 0.05 respectively.
+The default setting of `--min-alternate-fraction 0.05` is suitable for diploid samples but may need to be changed for higher ploidy.
 
 ### Allele type exclusion
-FreeBayes provides a few methods to ignore certain classes of allele, e.g. 
-`--no-indels` and `--no-mnps`.  Users are *strongly cautioned against using 
+freebayes provides a few methods to ignore certain classes of allele, e.g. 
+`--throw-away-indels-obs` and `--throw-awary-mnps-obs`.  Users are *strongly cautioned against using 
 these*, because removing this information is very likely to reduce detection 
 power.  To generate a report only including SNPs, use vcffilter post-call as 
 such:
 
     freebayes ... | vcffilter -f "TYPE = snp"
 
+### Normalizing variant representation
+
+If you wish to obtain a VCF that does not contain haplotype calls or complex alleles, first call with default parameters and then decompose the output with tools in vcflib, vt, vcf-tools, bcftools, GATK, or Picard.
+Here we use a tool in vcflib that normalizes the haplotype calls into pointwise SNPs and indels:
+
+    freebayes ... | vcfallelicprimitives -kg >calls.vcf
+
+Note that this is not done by default as it makes it difficult to determine which variant calls freebayes completed.
+The raw output faithfully describes exactly the calls that were made.
+
 ### Observation qualities
 
-FreeBayes estimates observation quality using several simple heuristics based 
+freebayes estimates observation quality using several simple heuristics based 
 on manipulations of the phred-scaled base qualities:
 
 * For single-base observations, *mismatches* and *reference observations*: the 
@@ -373,21 +322,13 @@ deleted sequence.
 * For *haplotypes*: the mean quality of allele observations within the 
 haplotype.
 
-### Effective base depth
-
-By default, filters are left completely open.
-Use `--experimental-gls` if you would like to integrate both base and mapping 
-quality are into the reported site quality (QUAL in the VCF) and 
-genotype quality (GQ, when supplying `--genotype-qualities`).  This integration 
-is driven by the "Effective Base Depth" metric first developed in 
-[snpTools](http://www.hgsc.bcm.edu/software/snptools), which scales observation 
-quality by mapping quality.  When `--experimental-gls` is given, *P(Obs|Genotype) ~ 
-P(MappedCorrectly(Obs))P(SequencedCorrectly(Obs))*.
-
+By default, both base and mapping quality are into the reported site quality (QUAL in the VCF) and genotype quality (GQ, when supplying `--genotype-qualities`).
+This integration is driven by the "Effective Base Depth" metric first developed in [snpTools](http://www.hgsc.bcm.edu/software/snptools), which scales observation quality by mapping quality: *P(Obs|Genotype) ~ P(MappedCorrectly(Obs))P(SequencedCorrectly(Obs))*.
+Set `--standard-gls` to use the model described in the freebayes preprint.
 
 ## Stream processing
 
-FreeBayes can read BAM from standard input `--stdin` instead of directly from
+freebayes can read BAM from standard input `--stdin` instead of directly from
 files.  This allows the application of any number of streaming BAM filters and
 calibrators to its input.
 
@@ -487,67 +428,58 @@ is combined with high a sequencing error rate.
 
 ## Best practices and design philosophy
 
-FreeBayes follows the patterns suggested by the [Unix 
-philosophy](https://en.wikipedia.org/wiki/Unix_philosophy), which promotes the 
-development of simple, modular systems that perform a single function, and can 
-be combined into more complex systems using stream processing of common 
-interchange formats.
-
-FreeBayes incorporates a number of features in order to reduce the complexity 
-of variant detection for researchers and developers:
-
-* **Indel realignment is accomplished internally** using a read-independent 
-method, and issues resulting from discordant alignments are dramatically 
-reducedy through the direct detection of haplotypes.
-* The need for **base quality recalibration is avoided** through the direct 
-detection of haplotypes. Sequencing platform errors tend to cluster (e.g. at 
-the ends of reads), and generate unique, non-repeating haplotypes at a given 
-locus.
-* **Variant quality recalibration is avoided** by incorporating a number of 
-metrics, such as read placement bias and allele balance, directly into the 
-Bayesian model.  (Our upcoming publication will discuss this in more detail.)
-
-A minimal pre-processing pipeline similar to that described in "Calling 
-variants" should be sufficient for most uses.  For more information, please 
-refer to a recent post by Brad Chapman [on minimal BAM preprocessing 
-methods](http://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/).
-
-For a push-button solution to variant detection, from reads to variant calls, 
-look no further than the [gkno genome analysis platform](http://gkno.me/).
-
-## Contributors
-
-FreeBayes is made by:
-
-- Erik Garrison 
-- Thomas Sibley 
-- Dillon Lee 
-- Patrick Marks 
-- Noah Spies 
-- Joshua Randall 
-- Jeremy Anderson
+freebayes follows the patterns suggested by the [Unix philosophy](https://en.wikipedia.org/wiki/Unix_philosophy), which promotes the development of simple, modular systems that perform a single function, and can be combined into more complex systems using stream processing of common interchange formats.
 
-## Support
+freebayes incorporates a number of features in order to reduce the complexity of variant detection for researchers and developers:
 
-### email
+* **Indel realignment is accomplished internally** using a read-independent method, and issues resulting from discordant alignments are dramatically reducedy through the direct detection of haplotypes.
+* The need for **base quality recalibration is avoided** through the direct detection of haplotypes. Sequencing platform errors tend to cluster (e.g. at the ends of reads), and generate unique, non-repeating haplotypes at a given locus.
+* **Variant quality recalibration is avoided** by incorporating a number of metrics, such as read placement bias and allele balance, directly into the Bayesian model.  (Our upcoming publication will discuss this in more detail.)
 
-Please report any issues or questions to the [freebayes mailing 
-list](https://groups.google.com/forum/#!forum/freebayes), [freebayes issue 
-tracker](https://github.com/ekg/freebayes/issues), or by email to 
-<erik.garrison at gmail.com>.
+A minimal pre-processing pipeline similar to that described in "Calling variants" should be sufficient for most uses.
+For more information, please refer to a recent post by Brad Chapman [on minimal BAM preprocessing methods](http://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/).
 
-### IRC
+## Development
+
+To download freebayes, please use git to download the most recent development tree:
+
+    git clone --recursive git://github.com/ekg/freebayes.git
 
-If you would like to chat real-time about freebayes, join #freebayes on
-freenode. A gittr.im chat is also available.
+Note the use of --recursive.  This is required in order to download all nested git submodules for external repositories.
 
-### reversion
+### Resolving proxy issues with git
 
-Note that if you encounter issues with the development HEAD and you would like 
-a quick workaround for an issue that is likely to have been reintroduced 
-recently, you can use `git checkout` to step back a few revisions.
+Depending on your local network configuration, you may have problems obtaining freebayes via git.
+If you see something like this you may be behind a proxy that blocks access to standard git:// port (9418).
 
-    git checkout [git-commit-id]
+    $ git clone --recursive git://github.com/ekg/freebayes.git
+    Cloning into 'freebayes'...
+    fatal: Unable to look up github.com (port 9418) (Name or service not known)
 
-It will also help with debugging to know if a problem has arisen in recent 
-commits!
+Luckily, if you have access to https:// on port 443, then you can use this
+'magic' command as a workaround to enable download of the submodules:
+
+    git config --global url.https://github.com/.insteadOf git://github.com/
+
+
+## Compilation
+
+freebayes requires g++ and the standard C and C++ development libraries.
+
+    make
+
+Will build the executable freebayes, as well as the utilities bamfiltertech and 
+bamleftalign.  These executables can be found in the `bin/` directory in the 
+repository.
+
+Users may wish to install to e.g. /usr/local/bin (default), which is 
+accomplished via
+
+    sudo make install
+
+Note that the freebayes-parallel script and the programs on which it depends are
+not installed by this command.
+
+Users can optionally build with [BamTools](https://github.com/pezmaster31/bamtools) instead of [SeqLib](https://github.com/walaj/SeqLib). Building with BamTools requires CMake.
+
+    make wbamtools


=====================================
python/allelebayes.py
=====================================
@@ -1,7 +1,7 @@
 #!/usr/bin/env python
 
 # calculates data likelihoods for sets of alleles
-
+from __future__ import print_function, division
 import multiset
 import sys
 import cjson
@@ -73,8 +73,16 @@ def alleles_quality_to_lnprob(alleles):
         allele['quality'] = phred.phred2ln(allele['quality'])
     return alleles
 
-def product(l):
-    return reduce(operator.mul, l)
+def fold(func, iterable, initial=None, reverse=False):
+    x=initial
+    if reverse:
+        iterable=reversed(iterable)
+    for e in iterable:
+        x=func(x,e) if x is not None else e
+    return x
+
+def product(listy):
+    return fold(operator.mul, listy)
 
 def observed_alleles_in_genotype(genotype, allele_groups):
     in_genotype = {}
@@ -125,12 +133,12 @@ def sampling_prob(genotype, alleles):
     genotype, follows the multinomial probability distribution."""
     allele_groups = group_alleles(alleles)
     multiplicity = sum([x[1] for x in genotype])
-    print genotype, multiplicity, alleles
+    print(genotype, multiplicity, alleles)
     for allele, count in genotype:
         if allele_groups.has_key(allele):
             print allele, count, math.pow(float(count) / multiplicity, len(allele_groups[allele]))
-    print product([math.factorial(len(obs)) for obs in allele_groups.values()])
-    print allele_groups.values()
+    print(product([math.factorial(len(obs)) for obs in allele_groups.values()]))
+    print(allele_groups.values())
     return float(math.factorial(len(alleles))) \
         / product([math.factorial(len(obs)) for obs in allele_groups.values()]) \
         * product([math.pow(float(count) / multiplicity, len(allele_groups[allele])) \
@@ -271,7 +279,7 @@ def banded_genotype_combinations(sample_genotypes, bandwidth, band_depth):
                 yield [(sample, genotypes[index]) for index, (sample, genotypes) in zip(index_permutation, sample_genotypes)]
 
 def genotype_str(genotype):
-    return reduce(operator.add, [allele * count for allele, count in genotype])
+    return fold(operator.add, [allele * count for allele, count in genotype])
 
 if __name__ == '__main__':
 
@@ -371,5 +379,5 @@ if __name__ == '__main__':
             sample['genotypes'] = sorted([[genotype_str(genotype), math.exp(prob)] for genotype, prob in sample['genotypes']], 
                                             key=lambda c: c[1], reverse=True)
 
-        print cjson.encode(position)
+        print(cjson.encode(position))
         #print position['position']


=====================================
python/dirichlet.py
=====================================
@@ -1,11 +1,21 @@
+#!/usr/bin/env python
+from __future__ import division
 from scipy.special import gamma, gammaln
 import operator
 import math
 from logsumexp import logsumexp
 from factorialln import factorialln
 
-def product(l):
-    return reduce(operator.mul, l)
+def fold(func, iterable, initial=None, reverse=False):
+    x=initial
+    if reverse:
+        iterable=reversed(iterable)
+    for e in iterable:
+        x=func(x,e) if x is not None else e
+    return x
+
+def product(listy):
+    return fold(operator.mul, listy)
 
 def beta(alphas):
     """Multivariate beta function"""


=====================================
python/hwe.py
=====================================
@@ -1,7 +1,17 @@
+#!/usr/bin/env python
+from __future__ import print_function, division
 from dirichlet import multinomial, multinomialln, multinomial_coefficient, multinomial_coefficientln
 import math
 import operator
 
+def fold(func, iterable, initial=None, reverse=False):
+    x=initial
+    if reverse:
+        iterable=reversed(iterable)
+    for e in iterable:
+        x=func(x,e) if x is not None else e
+    return x
+
 def hwe_expectation(genotype, allele_counts):
     """@genotype is counts of A,B,C etc. alleles, e.g. (2,0) is AA and (1,1) is AB
     @allele_counts is the counts of the alleles in the population"""
@@ -9,7 +19,7 @@ def hwe_expectation(genotype, allele_counts):
     ploidy = sum(genotype)
     genotype_coeff = multinomial_coefficient(ploidy, genotype)
     allele_frequencies = [count / float(population_total_alleles) for count in allele_counts]
-    genotype_expected_frequency = genotype_coeff * reduce(operator.mul, [math.pow(freq, p) for freq, p in zip(allele_frequencies, genotype)])
+    genotype_expected_frequency = genotype_coeff * fold(operator.mul, [math.pow(freq, p) for freq, p in zip(allele_frequencies, genotype)])
     return genotype_expected_frequency
 
 
@@ -28,7 +38,7 @@ def hwe_sampling_probln(genotype, genotypes, ploidy):
              population as suggested by the genotype counts, given HWE"""
     population_total_alleles = sum([sum(g[0]) * g[1] for g in genotypes])
     #print "population_total_alleles", population_total_alleles
-    allele_counts = reduce(add_tuple, [[a * g[1] for a in g[0]] for g in genotypes])
+    allele_counts = fold(add_tuple, [[a * g[1] for a in g[0]] for g in genotypes])
     #print "allele_counts", allele_counts
     genotype_counts = [g[1] for g in genotypes]
     #print "genotype_counts", genotype_counts
@@ -49,7 +59,7 @@ def hwe_sampling_probln(genotype, genotypes, ploidy):
 
 def inbreeding_coefficient(genotype, genotypes):
     population_total_alleles = sum([sum(g[0]) * g[1] for g in genotypes])
-    allele_counts = reduce(add_tuple, [[a * g[1] for a in g[0]] for g in genotypes])
+    allele_counts = fold(add_tuple, [[a * g[1] for a in g[0]] for g in genotypes])
     genotype_counts = [g[1] for g in genotypes]
     population_total_genotypes = sum([gtc[1] for gtc in genotypes])
     expected = hwe_expectation(genotype, allele_counts) * population_total_genotypes
@@ -59,6 +69,6 @@ def inbreeding_coefficient(genotype, genotypes):
             observed = g[1]
             break
     if observed == 0:
-        print "error, no observations of genotype, cannot calculate inbreeding coefficient"
+        print("error, no observations of genotype, cannot calculate inbreeding coefficient")
         return 0
     return 1 - (observed / expected)


=====================================
scripts/coverage_to_regions.py
=====================================
@@ -1,13 +1,13 @@
 #!/usr/bin/env python
-
+from __future__ import print_function, division
 import sys
 
 if len(sys.argv) < 3:
-    print "usage: <bamtools_coverage_output ", sys.argv[0], " fasta_index num_regions >regions.bed"
-    print "Generates regions with even sequencing coverage, provided an input of coverage per-position as"
-    print "generated by bamtools coverage.  In other words, generates regions such that the integral of"
-    print "coverage is approximately equal for each.  These can be used when variant calling to reduce"
-    print "variance in job runtime."
+    print("usage: <bamtools_coverage_output {} fasta_index num_regions >regions.bed").format(sys.argv[0])
+    print("Generates regions with even sequencing coverage, provided an input of coverage per-position as")
+    print("generated by bamtools coverage.  In other words, generates regions such that the integral of")
+    print("coverage is approximately equal for each.  These can be used when variant calling to reduce")
+    print("variance in job runtime.")
     exit(1)
 
 lengths = {}
@@ -35,7 +35,7 @@ for line in positions:
     chrom, pos, depth = line #line.strip().split("\t")
     if lchrom != chrom:
         if lchrom:
-            print lchrom+":"+str(lpos)+"-"+str(lengths[lchrom])
+            print(lchrom+":"+str(lpos)+"-"+str(lengths[lchrom]))
             lpos = 0
             lchrom = chrom
             bp_in_region = 0
@@ -43,9 +43,8 @@ for line in positions:
             lchrom = chrom
     bp_in_region += depth
     if bp_in_region > bp_per_region:
-        print chrom+":"+str(lpos)+"-"+str(pos) #, pos - lpos
+        print(chrom+":"+str(lpos)+"-"+str(pos)) #, pos - lpos
         lpos = pos
         bp_in_region = 0
 
-
-print lchrom+":"+str(lpos)+"-"+str(lengths[lchrom])
+print(lchrom+":"+str(lpos)+"-"+str(lengths[lchrom]))


=====================================
scripts/fasta_generate_regions.py
=====================================
@@ -1,12 +1,11 @@
 #!/usr/bin/env python
-
+from __future__ import print_function
 import sys
 
-
 if len(sys.argv) == 1:
-    print "usage: ", sys.argv[0], " <fasta file or index file> <region size>"
-    print "generates a list of freebayes/bamtools region specifiers on stdout"
-    print "intended for use in creating cluster jobs"
+    print("usage: {} <fasta file or index file> <region size>").format(sys.argv[0])
+    print("generates a list of freebayes/bamtools region specifiers on stdout")
+    print("intended for use in creating cluster jobs")
     exit(1)
 
 fasta_index_file = sys.argv[1]
@@ -17,19 +16,14 @@ fasta_index_file = open(fasta_index_file)
 region_size = int(sys.argv[2])
 
 for line in fasta_index_file:
-
     fields = line.strip().split("\t")
-
     chrom_name = fields[0]
     chrom_length = int(fields[1])
     region_start = 0
-
     while region_start < chrom_length:
         start = region_start
         end = region_start + region_size
         if end > chrom_length:
             end = chrom_length
-        print chrom_name + ":" + str(region_start) + "-" + str(end)
+        print(chrom_name + ":" + str(region_start) + "-" + str(end))
         region_start = end
-
-


=====================================
src/AlleleParser.cpp
=====================================
@@ -1317,6 +1317,13 @@ RegisteredAlignment& AlleleParser::registerAlignment(BAMALIGN& alignment, Regist
 
     string rDna = alignment.QUERYBASES;
     string rQual = alignment.QUALITIES;
+    if (qualityChar2LongDouble(rQual.at(0)) == -1) {
+        // force rQual to be 0
+        char q0 = qualityInt2Char(0);
+        for (size_t i = 0; i < rQual.size(); ++i) {
+            rQual[i] = q0;
+        }
+    }
     int rp = 0;  // read position, 0-based relative to read
     int csp = currentSequencePosition(alignment); // current sequence position, 0-based relative to currentSequence
     int sp = alignment.POSITION;  // sequence position
@@ -1979,12 +1986,8 @@ void AlleleParser::updateAlignmentQueue(long int position,
             }
 
             // skip alignments which are non-primary
-#ifdef HAVE_BAMTOOLS
-            if (!currentAlignment.IsPrimaryAlignment()) {
-#else
             if (currentAlignment.SecondaryFlag()) {
-#endif
-                //DEBUG("skipping alignment " << currentAlignment.Name << " because it is not marked primary");
+                DEBUG("skipping alignment " << currentAlignment.QNAME << " because it is not marked primary");
                 continue;
             }
 
@@ -2878,8 +2881,6 @@ bool AlleleParser::toNextPosition(void) {
     // remove past registered alleles
     DEBUG2("marking previous alleles as processed and removing from registered alleles");
     removePreviousAlleles(registeredAlleles, currentPosition);
-    sort(registeredAlleles.begin(), registeredAlleles.end());
-    registeredAlleles.erase(unique(registeredAlleles.begin(), registeredAlleles.end()), registeredAlleles.end());
 
     // if we have alignments which ended at the previous base, erase them and their alleles
     DEBUG2("erasing old registered alignments");
@@ -3341,10 +3342,13 @@ void AlleleParser::buildHaplotypeAlleles(
         groupAlleles(samples, alleleGroups);
 
         /*
-        for (Samples::iterator s = samples.begin(); s != samples.end(); ++s) {
-            cerr << s->first << endl;
-            for (Sample::iterator t = s->second.begin(); t != s->second.end(); ++t) {
-                cerr << t->first << " " << t->second << endl << endl;
+        if (parameters.debug) {
+            DEBUG("after re-grouping alleles");
+            for (Samples::iterator s = samples.begin(); s != samples.end(); ++s) {
+                cerr << s->first << endl;
+                for (Sample::iterator t = s->second.begin(); t != s->second.end(); ++t) {
+                    cerr << t->first << " " << t->second << endl << endl;
+                }
             }
         }
         */


=====================================
src/Parameters.cpp
=====================================
@@ -17,7 +17,7 @@ void Parameters::simpleUsage(char ** argv) {
         << "          \"Haplotype-based variant detection from short-read sequencing\"" << endl
         << "          arXiv:1207.3907 (http://arxiv.org/abs/1207.3907)" << endl
         << endl
-        << "author:   Erik Garrison <erik.garrison at bc.edu>, Marth Lab, Boston College, 2010-2014" << endl
+        << "author:   Erik Garrison <erik.garrison at gmail.com>" << endl
         << "version:  " << VERSION_GIT << endl;
 
 }
@@ -381,7 +381,7 @@ void Parameters::usage(char** argv) {
         << "   -dd             Print more verbose debugging output (requires \"make DEBUG\")" << endl
         << endl
         << endl
-        << "author:   Erik Garrison <erik.garrison at bc.edu>, Marth Lab, Boston College, 2010-2014" << endl
+        << "author:   Erik Garrison <erik.garrison at gmail.com>" << endl
         << "version:  " << VERSION_GIT << endl;
 
 }



View it on GitLab: https://salsa.debian.org/med-team/freebayes/commit/0d52de8e8d2b68699328b04ef5139bd9bb0ef440

-- 
View it on GitLab: https://salsa.debian.org/med-team/freebayes/commit/0d52de8e8d2b68699328b04ef5139bd9bb0ef440
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20191217/3be5e77b/attachment-0001.html>


More information about the debian-med-commit mailing list