[med-svn] [Git][med-team/freebayes][upstream] New upstream version 1.3.1
Andreas Tille
gitlab at salsa.debian.org
Wed Jul 31 16:33:27 BST 2019
Andreas Tille pushed to branch upstream at Debian Med / freebayes
Commits:
3b14acd3 by Andreas Tille at 2019-07-31T13:47:14Z
New upstream version 1.3.1
- - - - -
10 changed files:
- Makefile
- README.md
- src/AlleleParser.cpp
- src/AlleleParser.h
- src/Makefile
- src/Parameters.cpp
- src/Parameters.h
- src/ResultData.cpp
- src/freebayes.cpp
- test/t/01_call_variants.t
Changes:
=====================================
Makefile
=====================================
@@ -1,3 +1,6 @@
+#CFLAGS=-O3 -D_FILE_OFFSET_BITS=64 -g
+#CXXFLAGS=-O3 -D_FILE_OFFSET_BITS=64 -g
+
all: vcflib/Makefile log
cd src && $(MAKE)
=====================================
README.md
=====================================
@@ -2,7 +2,6 @@
## user manual and guide
[![Build Status](https://travis-ci.org/ekg/freebayes.svg)](https://travis-ci.org/ekg/freebayes)
-[![Gitter](https://badges.gitter.im/Join Chat.svg)](https://gitter.im/ekg/freebayes?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
--------
@@ -24,8 +23,7 @@ alignments. This method avoids one of the core problems with alignment-based
variant detection--- that identical sequences may have multiple possible
alignments:
-<img src="http://hypervolu.me/freebayes/figures/haplotype_calling.png"
-width=500/>
+<img src="https://github.com/ekg/freebayes/raw/v1.3.0/paper/haplotype_calling.png" width=500/>
*FreeBayes* uses short-read alignments
([BAM](http://samtools.sourceforge.net/SAMv1.pdf) files with
@@ -134,6 +132,26 @@ For a description of available command-line options and their defaults, run:
freebayes --help
+## Getting the best results
+
+freebayes provides many options to configure its operation.
+These may be important in certain use cases, but they are not meant for standard use.
+It is strongly recommended that you run freebayes with parameters that are as close to default as possible unless you can validate that the chosen parameters improve performance.
+To achieve the desired tradeoff between sensitivity and specificity, the best approach is to filter the output using the `QUAL` field, which encodes the posterior estimate of the probability of variation.
+
+Filtering the input with the provided options demonstrably hurts performance where truth sets can be used to evaluate results.
+By removing information from the input, you can confuse the Bayesian model by making it appear that certain alleles frequently indicative of context specific error (such as indels in homopolymers) don't exist.
+
+Many users apply these filters to force freebayes to not make haplotype calls.
+This is almost always a mistake, as the haplotype calling process greatly improves the method's signal to noise ratio and normalizes differential alignment in complex regions.
+
+If you wish to obtain a VCF that does not contain haplotype calls or complex alleles, first call with default parameters and then decompose the output with tools in vcflib, vt, vcf-tools, bcftools, GATK, and Picard.
+Here we use a tool in vcflib that normalizes the haplotype calls into pointwise SNPs and indels:
+
+ freebayes ... | vcfallelicprimitives -kg >calls.vcf
+
+Note that this is not done by default as it makes it difficult to determine which variant calls freebayes completed.
+The raw output faithfully describes exactly the calls that were made.
## Examples
@@ -145,6 +163,10 @@ Require at least 5 supporting observations to consider a variant:
freebayes -f ref.fa -C 5 aln.bam >var.vcf
+Skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than 200:
+
+ freebayes -f ref.fa -g 200 aln.bam >var.vcf
+
Use a different ploidy:
freebayes -f ref.fa -p 4 aln.bam >var.vcf
=====================================
src/AlleleParser.cpp
=====================================
@@ -1266,19 +1266,16 @@ Allele AlleleParser::makeAllele(RegisteredAlignment& ra,
// a dangerous game
int start = pos - currentSequenceStart;
double minEntropy = parameters.minRepeatEntropy;
- // check first that' wer'e actually ina repeat... TODO
- //cerr << "entropy of " << entropy(currentSequence.substr(start, repeatRightBoundary - pos)) << " is too low, " << endl;
while (minEntropy > 0 && // ignore if turned off
- repeatRightBoundary - currentSequenceStart < currentSequence.size() && //guard
+ // don't run off the end of the current sequence
+ repeatRightBoundary - currentSequenceStart < currentSequence.size() &&
+ // there is no point in going past the alignment end
+ // because we won't make a haplotype call unless we have a covering observation from a read
+ repeatRightBoundary < alignment.ENDPOSITION &&
entropy(currentSequence.substr(start, repeatRightBoundary - pos)) < minEntropy) {
- //cerr << "entropy of " << entropy(currentSequence.substr(start, repeatRightBoundary - pos)) << " is too low, ";
- //cerr << "increasing rought boundary to ";
++repeatRightBoundary;
- //cerr << repeatRightBoundary << endl;
}
- // now we
- //cachedRepeatCounts[pos] = repeatCounts(pos - currentSequenceStart, currentSequence, 12);
// edge case, the indel is an insertion and matches the reference to the right
// this means there is a repeat structure in the read, but not the ref
if (currentSequence.substr(pos - currentSequenceStart, length) == readSequence) {
@@ -1992,9 +1989,9 @@ void AlleleParser::updateAlignmentQueue(long int position,
}
if (!gettingPartials && currentAlignment.ENDPOSITION < position) {
- cerr << currentAlignment.QNAME << " at " << currentSequenceName << ":" << currentAlignment.POSITION << " is out of order!"
- << " expected after " << position << endl;
- continue;
+ cerr << currentAlignment.QNAME << " at " << currentSequenceName << ":" << currentAlignment.POSITION << " is out of order!"
+ << " expected after " << position << endl;
+ continue;
}
// otherwise, get the sample name and register the alignment to generate a sequence of alleles
@@ -2022,29 +2019,53 @@ void AlleleParser::updateAlignmentQueue(long int position,
if (parameters.baseQualityCap != 0) {
capBaseQuality(currentAlignment, parameters.baseQualityCap);
}
+ // do we exceed coverage anywhere?
+ // do we touch anything where we had exceeded coverage?
+ // if so skip this read, and mark and remove processed alignments and registered alleles overlapping the coverage capped position
+ bool considerAlignment = true;
+ if (parameters.skipCoverage > 0) {
+ for (unsigned long int i = currentAlignment.POSITION; i < currentAlignment.ENDPOSITION; ++i) {
+ unsigned long int x = ++coverage[i];
+ if (x > parameters.skipCoverage && !gettingPartials) {
+ considerAlignment = false;
+ // we're exceeding coverage at this position for the first time, so clean up
+ if (!coverageSkippedPositions.count(i)) {
+ // clean up reads overlapping this position
+ removeCoverageSkippedAlleles(registeredAlleles, i);
+ removeCoverageSkippedAlleles(newAlleles, i);
+ // remove the alignments overlapping this position
+ removeRegisteredAlignmentsOverlappingPosition(i);
+ // record that the position is capped
+ coverageSkippedPositions.insert(i);
+ }
+ }
+ }
+ }
// decomposes alignment into a set of alleles
// here we get the deque of alignments ending at this alignment's end position
deque<RegisteredAlignment>& rq = registeredAlignments[currentAlignment.ENDPOSITION];
-
- // and insert the registered alignment into that deque
- rq.push_front(RegisteredAlignment(currentAlignment, parameters));
- RegisteredAlignment& ra = rq.front();
- registerAlignment(currentAlignment, ra, sampleName, sequencingTech);
- // backtracking if we have too many mismatches
- // or if there are no recorded alleles
- if (ra.alleles.empty()
- || ((float) ra.mismatches / (float) currentAlignment.SEQLEN) > parameters.readMaxMismatchFraction
- || ra.mismatches > parameters.RMU
- || ra.snpCount > parameters.readSnpLimit
- || ra.indelCount > parameters.readIndelLimit) {
- rq.pop_front(); // backtrack
- } else {
- // push the alleles into our new alleles vector
- for (vector<Allele>::iterator allele = ra.alleles.begin(); allele != ra.alleles.end(); ++allele) {
- newAlleles.push_back(&*allele);
+ //cerr << "parameters capcoverage " << parameters.capCoverage << " " << rq.size() << endl;
+ if (considerAlignment) {
+ // and insert the registered alignment into that deque
+ rq.push_front(RegisteredAlignment(currentAlignment));
+ RegisteredAlignment& ra = rq.front();
+ registerAlignment(currentAlignment, ra, sampleName, sequencingTech);
+ // backtracking if we have too many mismatches
+ // or if there are no recorded alleles
+ if (ra.alleles.empty()
+ || ((float) ra.mismatches / (float) currentAlignment.SEQLEN) > parameters.readMaxMismatchFraction
+ || ra.mismatches > parameters.RMU
+ || ra.snpCount > parameters.readSnpLimit
+ || ra.indelCount > parameters.readIndelLimit) {
+ rq.pop_front(); // backtrack
+ } else {
+ // push the alleles into our new alleles vector
+ for (vector<Allele>::iterator allele = ra.alleles.begin(); allele != ra.alleles.end(); ++allele) {
+ newAlleles.push_back(&*allele);
+ }
}
}
- }
+ }
} while ((hasMoreAlignments = GETNEXT(bamMultiReader, currentAlignment))
&& currentAlignment.POSITION <= position
&& currentAlignment.REFID == currentRefID);
@@ -2054,6 +2075,44 @@ void AlleleParser::updateAlignmentQueue(long int position,
}
+void AlleleParser::removeRegisteredAlignmentsOverlappingPosition(long unsigned int pos) {
+ map<long unsigned int, deque<RegisteredAlignment> >::iterator f = registeredAlignments.begin();
+ map<long unsigned int, set<deque<RegisteredAlignment>::iterator> > alignmentsToErase;
+ set<Allele*> allelesToErase;
+ while (f != registeredAlignments.end()) {
+ for (deque<RegisteredAlignment>::iterator d = f->second.begin(); d != f->second.end(); ++d) {
+ if (d->start <= pos && d->end > pos) {
+ alignmentsToErase[f->first].insert(d);
+ for (vector<Allele>::iterator a = d->alleles.begin(); a != d->alleles.end(); ++a) {
+ allelesToErase.insert(&*a);
+ }
+ }
+ }
+ ++f;
+ }
+ // clean up registered alleles--- maybe this should be done externally?
+ for (vector<Allele*>::iterator a = registeredAlleles.begin(); a != registeredAlleles.end(); ++a) {
+ if (allelesToErase.count(*a)) {
+ *a = NULL;
+ }
+ }
+ registeredAlleles.erase(remove(registeredAlleles.begin(), registeredAlleles.end(), (Allele*)NULL), registeredAlleles.end());
+ if (alignmentsToErase.size()) {
+ for (map<long unsigned int, set<deque<RegisteredAlignment>::iterator> >::iterator e = alignmentsToErase.begin();
+ e != alignmentsToErase.end(); ++e) {
+ deque<RegisteredAlignment> updated;
+ map<long unsigned int, deque<RegisteredAlignment> >::iterator f = registeredAlignments.find(e->first);
+ assert(f != registeredAlignments.end());
+ for (deque<RegisteredAlignment>::iterator d = f->second.begin(); d != f->second.end(); ++d) {
+ if (!e->second.count(d)) {
+ updated.push_back(*d);
+ }
+ }
+ f->second = updated;
+ }
+ }
+}
+
void AlleleParser::addToRegisteredAlleles(vector<Allele*>& alleles) {
registeredAlleles.insert(registeredAlleles.end(),
alleles.begin(),
@@ -2505,6 +2564,17 @@ void AlleleParser::removePreviousAlleles(vector<Allele*>& alleles, long int posi
alleles.erase(remove(alleles.begin(), alleles.end(), (Allele*)NULL), alleles.end());
}
+void AlleleParser::removeCoverageSkippedAlleles(vector<Allele*>& alleles, long int position) {
+ for (vector<Allele*>::iterator a = alleles.begin(); a != alleles.end(); ++a) {
+ Allele* allele = *a;
+ if (*a != NULL && allele->alignmentStart <= position && allele->alignmentEnd > position) {
+ allele->processed = true;
+ *a = NULL;
+ }
+ }
+ alleles.erase(remove(alleles.begin(), alleles.end(), (Allele*)NULL), alleles.end());
+}
+
// steps our position/beddata/reference pointers through all positions in all
// targets, returns false when we are finished
//
@@ -2518,6 +2588,9 @@ bool AlleleParser::toNextTarget(void) {
DEBUG("to next target");
clearRegisteredAlignments();
+ coverageSkippedPositions.clear();
+ cachedRepeatCounts.clear();
+ coverage.clear();
// reset haplotype length; there is no last call in this sequence; it isn't relevant
lastHaplotypeLength = 0;
@@ -2763,6 +2836,9 @@ bool AlleleParser::toNextPosition(void) {
|| (registeredAlignments.empty() && currentRefID != currentAlignment.REFID)) {
DEBUG("at end of sequence");
clearRegisteredAlignments();
+ coverageSkippedPositions.clear();
+ cachedRepeatCounts.clear();
+ coverage.clear();
loadNextPositionWithAlignmentOrInputVariant(currentAlignment);
justSwitchedTargets = true;
}
@@ -2799,11 +2875,6 @@ bool AlleleParser::toNextPosition(void) {
// done typically at each new read, but this handles the case where there is no data for a while
//updateInputVariants(currentPosition, 1);
- //DEBUG2("updating registered alleles");
- //updateRegisteredAlleles(); // this removes unused left-flanking sequence
- //DEBUG2("updating prior variant alleles");
- //updatePriorAlleles();
-
// remove past registered alleles
DEBUG2("marking previous alleles as processed and removing from registered alleles");
removePreviousAlleles(registeredAlleles, currentPosition);
@@ -2861,6 +2932,17 @@ bool AlleleParser::toNextPosition(void) {
cachedRepeatCounts.erase(rc++);
}
+ DEBUG2("erasing old coverage cap");
+ while (coverageSkippedPositions.size() && *coverageSkippedPositions.begin() < currentPosition) {
+ coverageSkippedPositions.erase(coverageSkippedPositions.begin());
+ }
+
+ DEBUG2("erasing old coverage counts");
+ map<long unsigned int, long unsigned int>::iterator cov = coverage.begin();
+ while (cov != coverage.end() && cov->first < currentPosition) {
+ coverage.erase(cov++);
+ }
+
return true;
}
@@ -3139,7 +3221,7 @@ void AlleleParser::buildHaplotypeAlleles(
for (deque<RegisteredAlignment>::iterator r = ras.begin(); r != ras.end(); ++r) {
RegisteredAlignment& ra = *r;
if ((ra.start > currentPosition && ra.start < currentPosition + haplotypeLength)
- || (ra.end > currentPosition && ra.end < currentPosition + haplotypeLength)) {
+ || (ra.end > currentPosition && ra.end < currentPosition + haplotypeLength)) {
Allele* aptr;
bool allowPartials = true;
ra.fitHaplotype(currentPosition, haplotypeLength, aptr, allowPartials);
@@ -3553,11 +3635,12 @@ bool AlleleParser::getNextAlleles(Samples& samples, int allowedAlleleTypes) {
void AlleleParser::getAlleles(Samples& samples, int allowedAlleleTypes,
int haplotypeLength, bool getAllAllelesInHaplotype,
bool ignoreProcessedFlag) {
-
+ Samples gvcf_held; // make some samples that by bass filtering for gvcf lines
DEBUG2("getting alleles");
-
- for (Samples::iterator s = samples.begin(); s != samples.end(); ++s)
- s->second.clear();
+ samples.clear();
+ // Commenting this out and replacinf with .clear() to relly empty it, it is more aloc, but no major change
+ //for (Samples::iterator s = samples.begin(); s != samples.end(); ++s)
+ // s->second.clear();
// TODO ^^^ this should be optimized for better scanning performance
// if we have targets and are outside of the current target, don't return anything
@@ -3600,6 +3683,9 @@ void AlleleParser::getAlleles(Samples& samples, int allowedAlleleTypes,
(allele.position == currentPosition)))
) ) {
allele.update(haplotypeLength);
+ if(parameters.gVCFout){
+ gvcf_held[allele.sampleID][allele.currentBase].push_back(*a); // store things incase
+ }
if (allele.quality >= parameters.BQL0 && allele.currentBase != "N"
&& (allele.isReference() || !allele.alternateSequence.empty())) { // filters haplotype construction chaff
//cerr << "keeping allele " << allele << endl;
@@ -3619,6 +3705,9 @@ void AlleleParser::getAlleles(Samples& samples, int allowedAlleleTypes,
}
}
}
+ if(samples.size() == 0 && parameters.gVCFout){
+ samples = gvcf_held; // if there are no non reference vals try to recover any allined values for gvcf if doing gvcf output!!
+ }
vector<string> samplesToErase;
// now remove empty alleles from our return so as to not confuse processing
=====================================
src/AlleleParser.h
=====================================
@@ -57,10 +57,8 @@ public:
int snpCount;
int indelCount;
int alleleTypes;
- Parameters parameters;
- RegisteredAlignment(BAMALIGN& alignment, Parameters parameters)
- //: alignment(alignment)
+ RegisteredAlignment(BAMALIGN& alignment)
: start(alignment.POSITION)
, end(alignment.ENDPOSITION)
, refid(alignment.REFID)
@@ -69,7 +67,6 @@ public:
, snpCount(0)
, indelCount(0)
, alleleTypes(0)
- , parameters(parameters)
{
FILLREADGROUP(readgroup, alignment);
}
@@ -195,6 +192,8 @@ public:
vector<Allele*> registeredAlleles;
map<long unsigned int, deque<RegisteredAlignment> > registeredAlignments;
+ set<long unsigned int> coverageSkippedPositions;
+ map<long unsigned int, long unsigned int> coverage;
map<int, map<long int, vector<Allele> > > inputVariantAlleles; // all variants present in the input VCF, as 'genotype' alleles
pair<int, long int> nextInputVariantPosition(void);
void getInputVariantsInRegion(string& seq, long start = 0, long end = 0);
@@ -251,6 +250,8 @@ public:
int haplotypeLength = 1,
bool getAllAllelesInHaplotype = false);
void removePreviousAlleles(vector<Allele*>& alleles, long int position);
+ void removeCoverageSkippedAlleles(vector<Allele*>& alleles, long int position);
+ void removeRegisteredAlignmentsOverlappingPosition(long unsigned int pos);
void removeFilteredAlleles(vector<Allele*>& alleles);
void removeDuplicateAlleles(Samples& samples, map<string, vector<Allele*> >& alleleGroups,
int allowedAlleleTypes, int haplotypeLength, Allele& refallele);
=====================================
src/Makefile
=====================================
@@ -5,6 +5,8 @@
################################################################################
# Compiler
+
+CXXFLAGS=-O3 -D_FILE_OFFSET_BITS=64 -g -ggdb
CXX=g++ ${CXXFLAGS}
C=gcc
@@ -17,7 +19,6 @@ export LIBFLAGS
# Compiler flags
-CFLAGS=-O3 -D_FILE_OFFSET_BITS=64 -g
#CFLAGS=-O3 -static -D VERBOSE_DEBUG # enables verbose debugging via --debug2
SEQLIB_ROOT=../SeqLib
@@ -26,13 +27,13 @@ TABIX_ROOT=$(VCFLIB_ROOT)/tabixpp
HTSLIB_ROOT=$(SEQLIB_ROOT)/htslib
LIBS = -lz -llzma -lbz2 -lm -lpthread
-INCLUDE = -I../src -I../ttmath -I$(VCFLIB_ROOT)/src/ -I$(VCFLIB_ROOT)/smithwaterman/ -I$(VCFLIB_ROOT)/multichoose/ -I$(VCFLIB_ROOT)/filevercmp/ -I$(VCFLIB_ROOT)/fastahack/ -I$(HTSLIB_ROOT) -I$(SEQLIB_ROOT)
+INCLUDE = -I../src -I../ttmath -I$(VCFLIB_ROOT)/src/ -I$(VCFLIB_ROOT)/smithwaterman/ -I$(VCFLIB_ROOT)/multichoose/ -I$(VCFLIB_ROOT)/filevercmp/ -I$(VCFLIB_ROOT)/fastahack/ -I$(HTSLIB_ROOT) -I$(SEQLIB_ROOT)
#INCLUDE = -I../ttmath -I$(BAMTOOLS_ROOT)/src/ -I$(VCFLIB_ROOT)/src/ -I$(TABIX_ROOT)/ -I$(VCFLIB_ROOT)/smithwaterman/ -I$(VCFLIB_ROOT)/multichoose/ -I$(VCFLIB_ROOT)/filevercmp/ -I$(VCFLIB_ROOT)/fastahack/ -I$(HTSLIB_ROOT) -I$(SEQLIB_ROOT) -I$(SEQLIB_ROOT)/htslib
all: autoversion ../bin/freebayes ../bin/bamleftalign
static:
- $(MAKE) CXXFLAGS="$(CXXFLAGS) -static" all
+ $(MAKE) CXXFLAGS="$(CXXFLAGS) -static -static-libstdc++ -static-libgcc -Wl,--allow-multiple-definition" all
debug:
$(MAKE) CXXFLAGS="$(CXXFLAGS) -D VERBOSE_DEBUG -g -rdynamic" all
=====================================
src/Parameters.cpp
=====================================
@@ -72,6 +72,9 @@ void Parameters::usage(char** argv) {
<< " # require at least 5 supporting observations to consider a variant" << endl
<< " freebayes -f ref.fa -C 5 aln.bam >var.vcf" << endl
<< endl
+ << " # discard alignments overlapping positions where total read depth is greater than 200" << endl
+ << " freebayes -f ref.fa -g 200 aln.bam >var.vcf" << endl
+ << endl
<< " # use a different ploidy" << endl
<< " freebayes -f ref.fa -p 4 aln.bam >var.vcf" << endl
<< endl
@@ -126,11 +129,12 @@ void Parameters::usage(char** argv) {
<< " -A --cnv-map FILE" << endl
<< " Read a copy number map from the BED file FILE, which has" << endl
<< " either a sample-level ploidy:" << endl
- << " sample name, copy number" << endl
+ << " sample_name copy_number" << endl
<< " or a region-specific format:" << endl
- << " reference sequence, start, end, sample name, copy number" << endl
+ << " seq_name start end sample_name copy_number" << endl
<< " ... for each region in each sample which does not have the" << endl
- << " default copy number as set by --ploidy." << endl
+ << " default copy number as set by --ploidy. These fields can be delimited" << endl
+ << " by space or tab." << endl
<< endl
<< "output:" << endl
<< endl
@@ -139,6 +143,10 @@ void Parameters::usage(char** argv) {
<< " Write gVCF output, which indicates coverage in uncalled regions." << endl
<< " --gvcf-chunk NUM" << endl
<< " When writing gVCF output emit a record for every NUM bases." << endl
+ << " -& --gvcf-dont-use-chunk BOOL " << endl
+ << " When writing the gVCF output emit a record for all bases if" << endl
+ << " set to \"true\" , will also route an int to --gvcf-chunk" << endl
+ << " similar to --output-mode EMIT_ALL_SITES from GATK" << endl
<< " -@ --variant-input VCF" << endl
<< " Use variants reported in VCF file as input to the algorithm." << endl
<< " Variants in this file will included in the output even if" << endl
@@ -194,10 +202,6 @@ void Parameters::usage(char** argv) {
<< endl
<< "allele scope:" << endl
<< endl
- << " -I --no-snps Ignore SNP alleles." << endl
- << " -i --no-indels Ignore insertion and deletion alleles." << endl
- << " -X --no-mnps Ignore multi-nuceotide polymorphisms, MNPs." << endl
- << " -u --no-complex Ignore complex events (composites of other classes)." << endl
<< " -n --use-best-n-alleles N" << endl
<< " Evaluate only the best N SNP alleles, ranked by sum of" << endl
<< " supporting quality scores. (Set to 0 to use all; default: all)" << endl
@@ -216,6 +220,23 @@ void Parameters::usage(char** argv) {
<< " detection window. (default, use all observations, dividing partial" << endl
<< " support across matching haplotypes when generating haplotypes.)" << endl
<< endl
+ << " These flags are meant for testing." << endl
+ << " They are not meant for filtering the output." << endl
+ << " They actually filter the input to the algorithm by throwing away alignments." << endl
+ << " This hurts performance by hiding information from the Bayesian model." << endl
+ << " Do not use them unless you can validate that they improve results!" << endl
+ << endl
+ << " -I --throw-away-snp-obs Remove SNP observations from input." << endl
+ << " -i --throw-away-indels-obs Remove indel observations from input." << endl
+ << " -X --throw-away-mnp-obs Remove MNP observations from input." << endl
+ << " -u --throw-away-complex-obs Remove complex allele observations from input." << endl
+ << endl
+ << " If you need to break apart haplotype calls to obtain one class of alleles," << endl
+ << " run the call with default parameters, then normalize and subset the VCF:" << endl
+ << " freebayes ... | vcfallelicprimitives -kg >calls.vcf" << endl
+ << " For example, this would retain only biallelic SNPs." << endl
+ << " <calls.vcf vcfsnps | vcfbiallelic >biallelic_snp_calls.vcf" << endl
+ << endl
<< "indel realignment:" << endl
<< endl
<< " -O --dont-left-align-indels" << endl
@@ -276,8 +297,13 @@ void Parameters::usage(char** argv) {
<< " to use the allele in analysis. default: 1" << endl
<< " --min-coverage N" << endl
<< " Require at least this coverage to process a site. default: 0" << endl
- << " --max-coverage N" << endl
- << " Do not process sites with greater than this coverage. default: no limit" << endl
+ << " --limit-coverage N" << endl
+ << " Downsample per-sample coverage to this level if greater than this coverage." << endl
+ << " default: no limit" << endl
+ << " -g --skip-coverage N" << endl
+ << " Skip processing of alignments overlapping positions with coverage >N." << endl
+ << " This filters sites above this coverage, but will also reduce data nearby." << endl
+ << " default: no limit" << endl
<< endl
<< "population priors:" << endl
<< endl
@@ -388,6 +414,7 @@ Parameters::Parameters(int argc, char** argv) {
outputFile = "";
gVCFout = false;
gVCFchunk = 0;
+ gVCFNoChunk = false; // --gvcf-no-chunk sets this to true
alleleObservationBiasFile = "";
// operation parameters
@@ -461,7 +488,8 @@ Parameters::Parameters(int argc, char** argv) {
probContamination = 10e-9;
//minAltQSumTotal = 0;
minCoverage = 0;
- maxCoverage = 0;
+ limitCoverage = 0;
+ skipCoverage = 0;
debuglevel = 0;
debug = false;
debug2 = false;
@@ -486,6 +514,7 @@ Parameters::Parameters(int argc, char** argv) {
{"vcf", required_argument, 0, 'v'},
{"gvcf", no_argument, 0, '8'},
{"gvcf-chunk", required_argument, 0, '&'},
+ {"gvcf-dont-use-chunk", required_argument, 0 , '&'},
{"use-duplicate-reads", no_argument, 0, '4'},
{"no-partial-observations", no_argument, 0, '['},
{"use-best-n-alleles", required_argument, 0, 'n'},
@@ -507,15 +536,15 @@ Parameters::Parameters(int argc, char** argv) {
{"read-max-mismatch-fraction", required_argument, 0, 'z'},
{"read-snp-limit", required_argument, 0, '$'},
{"read-indel-limit", required_argument, 0, 'e'},
- {"no-indels", no_argument, 0, 'i'},
+ {"throw-away-indel-obs", no_argument, 0, 'i'},
{"dont-left-align-indels", no_argument, 0, 'O'},
- {"no-mnps", no_argument, 0, 'X'},
- {"no-complex", no_argument, 0, 'u'},
+ {"throw-away-mnps-obs", no_argument, 0, 'X'},
+ {"throw-away-complex-obs", no_argument, 0, 'u'},
{"max-complex-gap", required_argument, 0, 'E'},
{"haplotype-length", required_argument, 0, 'E'},
{"min-repeat-size", required_argument, 0, 'E'},
{"min-repeat-entropy", required_argument, 0, 'E'},
- {"no-snps", no_argument, 0, 'I'},
+ {"throw-away-snp-obs", no_argument, 0, 'I'},
{"indel-exclusion-window", required_argument, 0, 'x'},
{"theta", required_argument, 0, 'T'},
{"pvar", required_argument, 0, 'P'},
@@ -533,7 +562,8 @@ Parameters::Parameters(int argc, char** argv) {
//{"min-alternate-mean-mapq", required_argument, 0, 'k'},
{"min-alternate-qsum", required_argument, 0, '3'},
{"min-coverage", required_argument, 0, '!'},
- {"max-coverage", required_argument, 0, '+'},
+ {"limit-coverage", required_argument, 0, '+'},
+ {"skip-coverage", required_argument, 0, 'g'},
{"genotype-qualities", no_argument, 0, '='},
{"variant-input", required_argument, 0, '@'},
{"only-use-input-alleles", no_argument, 0, 'l'},
@@ -559,7 +589,7 @@ Parameters::Parameters(int argc, char** argv) {
while (true) {
int option_index = 0;
- c = getopt_long(argc, argv, "hcO4ZKjH[0diN5a)Ik=wl6#uVXJY:b:G:M:x:@:A:f:t:r:s:v:n:B:p:m:q:R:Q:U:$:e:T:P:D:^:S:W:F:C:&:L:8z:1:3:E:7:2:9:%:_:,:(:!:+:",
+ c = getopt_long(argc, argv, "hcO4ZKjH[0diN5a)Ik=wl6#uVXJY:b:G:M:x:@:A:f:t:r:s:v:n:B:p:m:q:R:Q:U:$:e:T:P:D:^:S:W:F:C:&:L:8z:1:3:E:7:2:9:%:_:,:(:!:+:g:",
long_options, &option_index);
if (c == -1) // end of options
@@ -635,8 +665,16 @@ Parameters::Parameters(int argc, char** argv) {
gVCFout = true;
break;
+ // -& BOOL/INT --gvcf-no-chunk BOOL/INT --gvcf-chunk
case '&':
- gVCFchunk = atoi(optarg);
+ //cerr << "optarg:\t" << optarg << endl;
+ if(optarg[0] == 't'){
+ gVCFNoChunk = true;
+ } else if (optarg[0] == 'f'){
+ gVCFNoChunk = false;
+ } else {
+ gVCFchunk = atoi(optarg);
+ }
break;
// -4 --use-duplicate-reads
@@ -668,10 +706,18 @@ Parameters::Parameters(int argc, char** argv) {
}
break;
- // -+ --max-coverage
+ // -+ --limit-coverage
case '+':
- if (!convert(optarg, maxCoverage)) {
- cerr << "could not parse max-coverage" << endl;
+ if (!convert(optarg, limitCoverage)) {
+ cerr << "could not parse limit-coverage" << endl;
+ exit(1);
+ }
+ break;
+
+ // -g --skip-coverage
+ case 'g':
+ if (!convert(optarg, skipCoverage)) {
+ cerr << "could not parse skip-coverage" << endl;
exit(1);
}
break;
=====================================
src/Parameters.h
=====================================
@@ -37,6 +37,7 @@ public:
string outputFile;
bool gVCFout; // -l --gvcf
int gVCFchunk;
+ bool gVCFNoChunk;
string variantPriorsFile;
string haplotypeVariantFile;
bool reportAllHaplotypeAlleles;
@@ -120,7 +121,8 @@ public:
int minAltCount; // -C --min-alternate-count
int minAltTotal; // -G --min-alternate-total
int minCoverage; // -! --min-coverage
- int maxCoverage; // -+ --max-coverage
+ int limitCoverage; // -+ --limit-coverage
+ int skipCoverage; // -g --skip-coverage
int debuglevel; // -d --debug increments
bool debug; // set if debuglevel >=1
bool debug2; // set if debuglevel >=2
=====================================
src/ResultData.cpp
=====================================
@@ -656,7 +656,10 @@ vcflib::Variant& Results::gvcf(
endPos = parser->currentPosition;
}
long numSites = endPos - startPos;
- assert(numSites > 0);
+ if(numSites <= 0){
+ std::cerr << "Hit end of chr, but still attempted to call location !!! \n Breaking\n";
+ exit(1);
+ };
// set up site call
var.ref = parser->referenceSubstr(startPos, 1);
@@ -673,17 +676,15 @@ vcflib::Variant& Results::gvcf(
var.format.push_back("DP");
var.format.push_back("MIN_DP");
var.format.push_back("QR");
+ var.format.push_back("RO");
var.format.push_back("QA");
+ var.format.push_back("AO");
NonCall total = nonCalls.aggregateAll();
/* This resets min depth to zero if nonCalls is less than numSites. */
-
- int minDepth = total.minDepth;
-
- if(numSites != total.nCount){
- minDepth = 0;
- }
+
+ int minDepth = (numSites != total.nCount) ? 0 : total.minDepth;
var.info["DP"].push_back(convert((total.refCount+total.altCount) / numSites));
var.info["MIN_DP"].push_back(convert(minDepth));
@@ -707,17 +708,16 @@ vcflib::Variant& Results::gvcf(
/* This resets min depth to zero if nonCalls is less than numSites. */
-
- int minDepth = nc.minDepth;
-
- if(numSites != nc.nCount){
- minDepth = 0;
- }
-
+
+ int minDepth = (numSites != nc.nCount) ? 0 : nc.minDepth;
+
+
sampleOutput["DP"].push_back(convert((nc.refCount+nc.altCount) / numSites));
sampleOutput["MIN_DP"].push_back(convert(minDepth));
sampleOutput["QR"].push_back(convert(ln2phred(nc.reflnQ)));
+ sampleOutput["RO"].push_back(convert((nc.refCount/numSites)));
sampleOutput["QA"].push_back(convert(ln2phred(nc.altlnQ)));
+ sampleOutput["AO"].push_back(convert((nc.altCount/numSites)));
}
return var;
=====================================
src/freebayes.cpp
=====================================
@@ -106,7 +106,7 @@ int main (int argc, char *argv[]) {
out << parser->variantCallFile.header << endl;
}
- if (0 < parameters.maxCoverage) {
+ if (0 < parameters.limitCoverage) {
srand(13);
}
@@ -120,15 +120,19 @@ int main (int argc, char *argv[]) {
++total_sites;
DEBUG2("at start of main loop");
-
- // did we switch chromosomes or exceed our gVCF chunk size?
+
+ // did we switch chromosomes or exceed our gVCF chunk size, or do we not want to use chunks?
// if so, we may need to output a gVCF record
Results results;
- if (parameters.gVCFout && !nonCalls.empty() &&
- ( nonCalls.begin()->first != parser->currentSequenceName
- || (parameters.gVCFchunk &&
- nonCalls.lastPos().second - nonCalls.firstPos().second
- > parameters.gVCFchunk))) {
+ if (parameters.gVCFout
+ && !(nonCalls.empty())
+ && ( (parameters.gVCFNoChunk)
+ || (nonCalls.begin()->first != parser->currentSequenceName)
+ || (parameters.gVCFchunk
+ && nonCalls.lastPos().second - nonCalls.firstPos().second >= parameters.gVCFchunk
+ )
+ )
+ ){
vcflib::Variant var(parser->variantCallFile);
out << results.gvcf(var, nonCalls, parser) << endl;
nonCalls.clear();
@@ -157,7 +161,7 @@ int main (int argc, char *argv[]) {
} else if (parameters.onlyUseInputAlleles) {
DEBUG("no input alleles, but using only input alleles for analysis, skipping position");
skip = true;
- } else if (0 < parameters.maxCoverage) {
+ } else if (0 < parameters.limitCoverage) {
// go through each sample
for (Samples::iterator s = samples.begin(); s != samples.end(); ++s) {
string sampleName = s->first;
@@ -167,14 +171,14 @@ int main (int argc, char *argv[]) {
for (Sample::iterator sg = sample.begin(); sg != sample.end(); ++sg) {
sampleCoverage += sg->second.size();
}
- if (sampleCoverage <= parameters.maxCoverage) {
+ if (sampleCoverage <= parameters.limitCoverage) {
continue;
}
- DEBUG("coverage " << sampleCoverage << " for sample " << sampleName << " was > " << parameters.maxCoverage << ", so we will remove " << (sampleCoverage - parameters.maxCoverage) << " genotypes");
+ DEBUG("coverage " << sampleCoverage << " for sample " << sampleName << " was > " << parameters.limitCoverage << ", so we will remove " << (sampleCoverage - parameters.limitCoverage) << " genotypes");
vector<string> genotypesToErase;
do {
- double probRemove = (sampleCoverage - parameters.maxCoverage) / (double)sampleCoverage;
+ double probRemove = (sampleCoverage - parameters.limitCoverage) / (double)sampleCoverage;
vector<string> genotypesToErase;
// iterate through the genotypes
for (Sample::iterator sg = sample.begin(); sg != sample.end(); ++sg) {
@@ -182,7 +186,7 @@ int main (int argc, char *argv[]) {
// iterate through each allele
for (int alleleIndex = 0; alleleIndex < sg->second.size(); alleleIndex++) {
// only if we have more alleles to remove
- if (parameters.maxCoverage < sampleCoverage) {
+ if (parameters.limitCoverage < sampleCoverage) {
double r = rand() / (double)RAND_MAX;
if (r < probRemove) { // skip over this allele
sampleCoverage--;
@@ -205,7 +209,7 @@ int main (int argc, char *argv[]) {
for (vector<string>::iterator gt = genotypesToErase.begin(); gt != genotypesToErase.end(); ++gt) {
sample.erase(*gt);
}
- } while (parameters.maxCoverage < sampleCoverage);
+ } while (parameters.limitCoverage < sampleCoverage);
sampleCoverage = 0;
for (Sample::iterator sg = sample.begin(); sg != sample.end(); ++sg) {
sampleCoverage += sg->second.size();
@@ -679,7 +683,8 @@ int main (int argc, char *argv[]) {
}
// write the last gVCF record
- if (parameters.gVCFout && !nonCalls.empty()) {
+ // NOTE: for some resion this is only needed if we are using gVCF chunks, if minimal chunking it is not requird, in fact it breaks....
+ if (parameters.gVCFout && !nonCalls.empty() && !parameters.gVCFNoChunk) {
Results results;
vcflib::Variant var(parser->variantCallFile);
out << results.gvcf(var, nonCalls, parser) << endl;
=====================================
test/t/01_call_variants.t
=====================================
@@ -7,7 +7,7 @@ PATH=../bin:$PATH # for freebayes
PATH=../scripts:$PATH # for freebayes-parallel
PATH=../vcflib/bin:$PATH # for vcf binaries used by freebayes-parallel
-plan tests 21
+plan tests 24
is $(echo "$(comm -12 <(cat tiny/NA12878.chr22.tiny.giab.vcf | grep -v "^#" | cut -f 2 | sort) <(freebayes -f tiny/q.fa tiny/NA12878.chr22.tiny.bam | grep -v "^#" | cut -f 2 | sort) | wc -l) >= 13" | bc) 1 "variant calling recovers most of the GiAB variants in a test region"
@@ -111,8 +111,13 @@ is $(freebayes -f splice/1:883884-887618.fa splice/1:883884-887618.bam -F 0.05 -
is $(freebayes -f tiny/q.fa -F 0.2 tiny/NA12878.chr22.tiny.bam --gvcf | grep '<\*>' | wc -l) 20 "freebayes produces the expected number of lines of gVCF output"
-is $(freebayes -f tiny/q.fa -F 0.2 tiny/NA12878.chr22.tiny.bam --gvcf --gvcf-chunk 50 | grep '<\*>' | wc -l) 245 "freebayes produces the expected number of lines of gVCF output"
+is $(freebayes -f tiny/q.fa -F 0.2 tiny/NA12878.chr22.tiny.bam --gvcf --gvcf-chunk 50 | grep '<\*>' | wc -l) 250 "freebayes produces the expected number of lines of gVCF output"
+
+is $(freebayes -f tiny/q.fa -F 0.2 tiny/NA12878.chr22.tiny.bam --gvcf --gvcf-dont-use-chunk true | grep '<\*>' | wc -l) 12250 "freebayes produces the expected number of lines of gVCF output"
samtools view -h tiny/NA12878.chr22.tiny.bam | sed s/NA12878D_HiSeqX_R1.fastq.gz/222.NA12878D_HiSeqX_R1.fastq.gz/ | sed s/SM:1/SM:2/ >x.sam
is $(freebayes -f tiny/q.fa -F 0.2 tiny/NA12878.chr22.tiny.bam x.sam -A <(echo 1 8; echo 2 13) | grep 'AN=21' | wc -l) 19 "the CNV map may be used to specify per-sample copy numbers"
rm -f x.sam
+
+is $(freebayes -f tiny/q.fa --skip-coverage 30 tiny/NA12878.chr22.tiny.bam | grep -v '^#' | wc -l) 22 "freebayes makes the expected number of calls when capping coverage"
+is $(freebayes -f tiny/q.fa -g 30 tiny/NA12878.chr22.tiny.bam | vcfkeepinfo - DP | vcf2tsv | cut -f 8 | tail -n+2 | awk '$1 <= 30 { print }' | wc -l) 22 "all coverage capped calls are below the coverage threshold"
View it on GitLab: https://salsa.debian.org/med-team/freebayes/commit/3b14acd35f5feafcc42097dc92ee130c36e5537c
--
View it on GitLab: https://salsa.debian.org/med-team/freebayes/commit/3b14acd35f5feafcc42097dc92ee130c36e5537c
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20190731/a3560042/attachment-0001.html>
More information about the debian-med-commit
mailing list