[med-svn] [sga] annotated tag upstream/0.10.13 created (now dffa83c)
Andreas Tille
tille at debian.org
Mon Jan 11 12:36:47 UTC 2016
This is an automated email from the git hooks/post-receive script.
tille pushed a change to annotated tag upstream/0.10.13
in repository sga.
at dffa83c (tag)
tagging 3cae268067c8985f1d57b3ea8d407e0db5458dd3 (commit)
replaces v0.9.4
tagged by Andreas Tille
on Wed May 28 07:39:13 2014 +0200
- Log -----------------------------------------------------------------
Upstream version 0.10.13
Albert Vilella (8):
a bit rough, but just to have an idea of what is needed
adding prefix to walk method for convenience
Merge github.com:jts/sga
reverting to original walk
adding prefix to walk for convenience to the pinball pipeline
walk with prefix tweak
deleting INSTALL for now
fixed typo --component-paths should be --component-walks in cerr
Andreas Tille (1):
Imported Upstream version 0.10.13
Cornelis Arnout Albers (32):
Integrated Dindel
it compiles!
Integrated DindelRealignWindow into HapgenProcess
Added VCFFile output
Makes calls now, did some initial checking and debugging
Changed calling to ModelSelection. Added EM haplotype frequency estimation. Seems to work.
Fixed addSNP to candidate haplotypes.
fixed a couple of bugs. Still debug version but seems to do sensible things.
tested on 167924_A1 exome and seems to give already nice results.
Fixed position bug. Fixed output of uncalled variants
Fixed strand issue.
Fixed quality score issue for haplotypes mapping with low quality scores to places in the reference. Added ID output in VCF
Fixed getDistance: it now only used reads that match the haplotype sequence without mismatch at the position of variant. Also fixed outputAsVCF:it now combines freuquencies from haplotypes mapping to the same position.
Fixed incorrect averaging in outputAsVCF
fixed silly bug in computation of penalties for haplotype alignments
Optimized and added INFO tag outputs.
March 20 debug version
Added SingleRead as replacement for MatePairs
Fixed variant frequency.
Changed FLANKING_SIZE to zero and DINDEL_DEBUG_3->1
Merge branch 'graph-diff-v4' of /nfs/users/nfs_j/js18/work/git_repository/sga into graph-diff-v4
Fixed -MAX_INT varqual bug. Made sure flankingHaplotypes is unique. SET MAPPING QUALITY TO 1000 for all candidate alignments
Fixed mapping bug and -MAX_INT QUAL bug. realignMatePairs can be used to choose mate pair alignment, automatically sets FLANKING_SIZE to 1000.
Added MultiSample EM and caller. Fixed homopolymer, added AmbiMap filter tag
Added multisample EM caller.
Fixed homopolymer error when variant is in last column
Added genotyping.
Genotyping in multisample caller seems to work.
Fixed inf bug in genotyping
fixing alignment bug
Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
Fixed DindelHaplotype::extractVariants() incorrect assertion
David Rio Deiros (2):
Warning the user about the fact he will have issues merging
Adding feature in preprocess step to remove adapter from reads.
Jared Simpson (627):
Added extra information to sga stats --run-lengths
Capped max run length in --run-lengths option to make the output more readable
Refactored the BWT Markers into their own file. Also moved accumulation code out of RLBWT into the RLUnit
Refactored RLUnit into its own class
Changed name of FULL_COUNT define to RL_FULL_COUNT
Re-implemented connect pipeline. Cleaned up sga-assemble
Changed input parameters in sga-pipeline
Skeleton of fm-merge subprogram
Added skeleton code for the FMMerge processes
Added BitVector class for storing large arrays of bits. Stubbed in some functionality of FM-merge.
Started implementation of FMMergeProcess logic
Rewrote overlapReadExact to fix the very rare case where a read has a proper prefix/suffix overlap to itself. This case
Working version of fm-merge. Currently requires a remove duplicate edges operation which is sub optimal.
Fixed bug in conflictConsensus algorithm where the error correction would not filter out true conflicts if the root base is the 3rd (or 4th) most frequent base but still above the cutoff.
Implemented interleaved mode for sga preprocess
Merge branch 'fm-merge'
Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
Implemented 1-bit sparse gap array for rmdup/qc
Added a simple kmer caching scheme to the kmer-based corrector to avoid duplicate lookups in the fm-index.
Modified k-mer corrector to take quality values into account when determining thresholds for correction.
Changed default sample rate and gap array size for sga merge
Implemented quality-aware overlap correction. The quality scores are used to select the cutoffs for the number of times a base needs to be seen to avoid being corrected away.
Changed error corrector to print out a masked multi-overlap
Fixed bug where fm-merge would crash if the graph contains a simple cycle
Revised dead-end trim function to only remove if the branch is less than a minimum length. This makes the trimmer function properly on a graph constructed by fm-merge.
Fixed very subtle bug in overlap computation when reads have multiple valid overlaps to each other.
Added some extra comments
Removed errant print from SGA/index
Modified sga-rmdup to determine which read to keep based on the index of the reads, not their full id/name.
Implemented mutex lock on BitVector to allow threads to update the bitvector atomically in fm-merge.
Removed warning
Improved the speed of sga-qc by an order of magnitude.
Implemented more complex variation (bubble) removal algorithm.
Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
Implemented writing out variants to a file in fasta format.
sga-qc: Moved the delete call for the BWTs to occur before the indices are rebuilt to avoid having two copies loaded at once needlessly.
Removed unncessary to-do warning from variant smoother
Fixed bug where variant removal would assert if there was a degenerate bubble
Added exit condition to SGSearch::findVariantWalks to avoid infinite loops in the case that the graph contains a non-branching cycle
Implemented new SGSearchTree class for more efficient graph searching. All functionality
Integrated new search tree code into the variation smoother. The results are subtley
Modified output files to have a common prefix, which can be specified on the command line with the -o option.
Integrated SGSearchTree into SGSearch::findWalks
Updated configure.ac and the README to support use of the BamTools library.
First implementation of sga connect using a BAM file. This version is functional but could be more efficient.
Much faster version of BAM-based connect. Now uses substring operations when extracting the fragment instead of copying potentially very large strings constantly.
Added statistics to sga connect to describe why the connection failed
Cleanup up the connection code.
Fixed sga-correct to search for a path with the correct length by subtracting the amount of the fragment that is present in the current contig.
Added extra output to sga-scaffold
Added ScaffoldGroup class to handle ordering a set of contigs.
Added functions to compute the probability that two scaffold links are incorrectly ordered.
Made the SGSearchTree a generic templated class so that it can also be used for the scaffolding module.
Continued to make the graph search functions more generic.
Started to implement searching for walks on a scaffold graph.
Simplified construction of variation paths in the graph. More work on making the searching code more generic.
Removed some dead code.
Seperated input of paired end and mate pair libraries for scaffolder.
Added new function to remove transitive edges from the scaffold graph. Needs more work.
Experimental layout algorithm for scaffoldding
Cleaned up code
Removed unused object to remove warning.
Added function to infer secondary links between nodes in a putative scaffold.
Implemented connected components algorithm.
Added new ScaffoldAlgorithms files and refactored some code into these.
Wrote algorithm to compute a layout of the connected component for a scaffold starting from a terminal vertex. This function is the backbone of the scaffolding algorithm.
The scaffolder now writes out scaffold statistics at the end of the program.
Minor update to scaffolding output message formatting.
First implementation of new scaffolding algorithm.
Removed some prints
Integrated Heng Li's stdaln dynamic programming library into ThirdParty. This is used in the variation removal algorithm to set an upper bound on how different two sequences can be and still be removed.
Changed name of cigar line in variants file to make it clear its an internally-used field and not for the fasta sequence that is output.
Wrote compare-and-swap updates for the BitChar/BitVector data structures.
Rewrote FMMergeProcess to use the compare-and-swap functionality in the bit vector. Removed the locks.
Fixed scaffold2fasta to search for a path of the correct length (was using end to start distance instead of end to end).
Wrote function to find and remove cycles in the scaffold graph
Added status output message to the link validator
Added filterBAM subprogram to attempt to get rid of bad MP reads.
Imposed a max indel size on the variant resolution.
Minor formatting tweaks.
Implemented SV detection and removal for the scaffolder. Turned off by default.
Changed scaffold2fasta to write out unplaced scaffolds and use a gap with a minimum length
Added function to break the scaffold graph at positions that have conflicting distance estimates.
Removed dot file output from scaffolder
Updated version to 0.9.5
Updated OverlapAlgorithm::_processIrreducibleBlocksExact to work in the current overlapping framework. It is now used by default.
Moved astat.py from sgatools repository to sga/src/bin/sga-astat.py
Implemented loading contigs from a fasta file for scaffold2fasta
Cleaned up sga-scaffold to only print link validation messages if -v flag is given.
Implemented dust filter for low complexity sequences in sga-preprocess as suggested by Albert Vilella.
Updated README to reference the python modules that the pipeline scripts require.
Added parameter to specificy the maximum number of bases to correct with the kmer corrector.
Added parameter to specify the maximum number of bases to correct with the kmer corrector.
Added warning to the overlap computation for when a substring read is found.
Added genome size option to sga-astat.py as alternative to performing the bootstrap estimate
Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
Changed overlap correction QC. Now requires at least one overlap supporting each base in the read after correction.
sga-preprocess will now append /1 or /2 to read names in pe-mode if the paired reads have the exact same name.
Added flag to SGSearch::findWalks to specify whether any walks should be returned if the search was aborted. All uses of findWalks require an exhaustive search to be performed except the utility sga-walk.
Switched the irreducible block algorithm back to inexact mode.
[Issue GH-1] Changed shebang in python scripts to use /usr/bin/env so user's environment python is used. Reported and fix suggested by John St. John.
[Issue GH-3]: sga-preprocess will stop reading the file if there is a fastq record with no sequence or quality values.
Implemented quality score conversion from phred64 to phred33 for preprocess.
Moved sga-mergeDriver.pl from sgatools into main repository.
Wrote help message for sga-mergeDriver.pl
Cleaned up handling of cycles in fm-merge.
Added --kmer-distribution function to sga-stats. Re-enable the -x option for sga correct to set the min kmer coverage required.
Rewrote the CorrectionThresholds to be a proper (singleton) class instead of a namespace.
Fixed bug in OverlapAlgorithm::_processIrreducibleBlocksExact where the assertion was checking the wrong condition. Added --exact optiont overlap to force the use of the exact irreducible algorithm.
Rewrote the exact-mode irreducible block algorithm to be iterative instead of recursive.
Made the exact-mode irreducible algorithm the default again for overlap/fm-merge.
Added --no-overlap and --branch-cutoff options to sga-stats.
Wrote experimental program sga-cluster to write out the connected components of a graph. Requested by Albert Vilella.
Refactored KmerDistribution code into its own class.
First pass at learning the kmer correction threshold.
Added the contigs filename to the temporary output of bwa aln so the same reads can be mapped to different contigs at the same time without having a filename clash.
Added -w parameter to sga-walk to allow the exact sequeunce of the walk to be specified.
Changed the error correction metrics to use wider integers to avoid wrap arounds for very large data sets.
Made the kmer corrector the default algorithm to use for sga-correct. Changed the default kmer size to 31.
Made the temp files of the bwa aln step avoid using relative paths.
Modified sga-astat.py so that it does not require a bam index file.
Added exit statement to sga-bam2de.pl when the the command line arguments are incorrect.
added optional -k parameter to sga-bam2de.pl
Fixed usage message for scaffold.
Made command line argument to change the minimum walk distance in sga-connect
Fixed tripped assertion in makeScaffolds for the very rare case that a terminal vertex cannot be found. Fixed stats output to avoid 32-bit integer wraparound.
Changed help text for --max-distance option to filterBAM
Rewrote sga-cluster to use the FM-index instead of an asqg file.
Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
Fixed warning in OverlapBlock
Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
Added new filtering modes to sga-filterBAM. Can now filter out pairs based on error rate, mapping quality and kmer depth.
Fixed bug in filterBAM where an extra read pair would be erroneously output
Removed the default seed stride parameter for sga-cluster as it would lead to some overlaps being missed.
Made sga-cluster emit an error when a substring read is found.
Bug fix in ScaffoldRecord to avoid outputting duplicate records for singleton scaffolds.
Changed default gap storage parameter in sga/index
Changed sga-scaffold arguments. Unified repeat/unique a-stat so any contig that does not meet the unique cutoff is deemed to be a repeat. Added --min-copy-number parameter to discard contigs that have a low (<0.3) estimated copy number.
Changed sga-cluster so the temp file uses the same prefix as the -o parameter to prevent name clashes.
Changed default gap array parameter back to 8.
[Github issue 4] Made the error message for the case where a substring read is found during string graph construction to be more informative.
Bumped version to 0.9.6
Made the maximum distance estimate error when resolving gaps a command line parameter
bumped version number
Disabled RLBWT validation
New mode to sga-walk which exhaustively finds all walks through the largest connected component of the input graph. Used in sga-cluster workflow.
Modified k-mer corrector to only use the forward bwt when looking up k-mer counts. This effectively halves the memory usage of the correction step.
Merged the duplicate removal and qc checks into a single process: sga-filter.
Increased version to 0.9.8
Rewrote the gap array to use compare and swap instructions when updating the base counts. This allows much better concurrency when merging/removing reads from a bwt.
Added comments to the SparseGapArray
More comment improvement
Re-enabled -Wall as bamtools now compiles without warnings.
Implemented caching of BWTIntervals for all strings of a given length. Currently used in the k-mer corrector.
Changed BWTIntervalCache::lookup to take in a c-string to avoid an extra copy
Code cleanup: created a parameter object for setting the error correction options in place of a constructor with many arguments
Added option to sga-index to suppress constructing the reverse BWT. Modified sga-merge to avoid attempting to merge RBWTs if they don't exist.
Added a directory with sga examples. Currently holds a script for a c. elegans assembly.
Incremented version to 0.9.9.
Added short README to the top level directory.
Tweaked wording in top-level README
Updated main README file.
sga scaffold: allow loading sequences with ambiguity codes, disable the requirement that an a-statistic file is provided.
Changed wording in a-stat warning
Rewrote some ScaffoldRecord functions to take in a parameter object
Factored out the scaffold input sequence container into ScaffoldSequenceCollection so that it does not require using a StringGraph
Added std::map-based ScaffoldSequenceCollection for scaffolding sequences that do not belong to a graph.
Handle Ns in Util::complement
Fixed gcc 4.6 compile warnings
Created files and framework for implementing bwa-sw algorithm. Not functional or usable.
Updates to bwa-sw algorithm
Implemented most of core bwasw algorithm. Output verified to match that of Heng Li's implementation. Does not restrict the number of nodes to track (z-best heuristic) or output the alignments. Lots of debug information in this version, not usable.
Implemented saving found hits. Still very much development, not for use.
First implementation of sampled suffix array data structure and gen-ssa subprogram.
Implemented O(N) algorithm for constructing the sampled suffix array from a BWT. Implemented reading/writing SSA.
Created new subprogram correct-long to hold bwa-sw porting work. Added generateCIGAR function to LRAlignment
Macro'd out the bwa compatibility print statements
Modified to read query sequences from a file.
Implemented gapped MultiAlignment class
Implemented function to transform a multiple alignment into a consensus string.
Implemented cutTail function from bwasw to remove possibly erroneous cells from a stack. Currently it discards too much and removes a lot of valid hits.
Improved version of cutTail that does choose the cells to keep based purely on the highest score, but using the fraction of the maximum possible score for the cell. This will not be the final version of cutTail however as it still discards useful hits.
long read aligner: Added two new methods of cutting the tails of a cell array
Added --cut argument to sga correct-long to choose cell reduction heuristic
Added extra output to MultiAlignment::print
Added extra debugging info to the long read aligner.
Added function to save the terminal hits in long-read alignment mode
Removed some debug output from the long read aligner
Removed a bunch of debug info and refactored some code from the main bwaswAlignment function
More refactoring of bwaswAlgorithm. Changed saveHits to keep all good hits, instead of just the top 2 for every position.
Code cleanup in LRAlignment
Added new LRCorrection module to implement the long-read correction code
Renamed LRHit data members to have more sensible names
First pass at extending LRHits to be full-length alignments
Further refinements of LRHit extension. Not complete
Turned off hit extension for now
Added simple function to find new LRHits based on overlapping reads.
Check sem_init return codes to catch errors when this function is not implemented (on OSX). This is a stop-gap measure until I have time to switch to named semaphores.
New experimental long read correction algorithm based on threading the read through a de Bruijn graph
Resurrected bwa-like saveHits function. Removed targetString member from LRCell/LRHit.
Refactored a bunch of code that uses stdaln into StdAlnTools. Started work on the new graph-based correction code.
Started string-threading extension code.
Added core extension functionality to StringThreader
First implementation of extension dynamic programming code.
Changed ExtensionDP to use an edit distance scoring system. Fixed off-by-one in makePaddedStrings
Implemented function in ExtensionDP to print the full alignment.
Added function to ExtensionDP to calculate the local error rate of the alignment.
Integrating ExtensionDP into StringThreader. Initialization of alignment for root node complete.
Added code to calculate the extended alignments for the StringThreaderNodes
Full extension algorithm complete, including culling leaves once their error rate is too high.
Added a few more helper functions to the string threading code. Currently, the search explodes when threading through repeats that are slightly
New culling heuristic for StringThreader
Implemented writing out the corrected reads to a file.
Wrapped the long read correction process in a class to use in the SequenceProcessFramework.
Implemented outputting corrected sequences from the StringThreader. Still a bit hacky.
tweaks to graph-based correction algorithm
Merge branch 'master' of git at github.com:jts/sga
Made the extension termination condition more robust. StringThreader now returns a trimmed alignment to avoid bad-tails of the long reads.
Added early exit to kmerCorrect when the read sequence is shorter than the kmer length.
Added read length check to k-mer filter.
Refactored cluster generation code into ReadCluster class. Added stub subprogram for cluster-extend.
Implemented extension mode for sga cluster. This required refactoring the SequenceProcessFramework to use a generic generator object.
Enabled control of the max cluster size
Fixed formatting of sga cluster help
Added some defines to sga cluster to hide away the hideous template function calls
Changed description in sga-cluster boilerplate
Added missing shortopt for --min-branch-length to sga assemble
Merge branch 'long-align'
Increased default minimum branch length in sga assemble to better handle long (150+) reads.
Cleaned up build. Added a Makefile to install the scripts the sga scaffold pipeline requires (astat, bam2de).
Deleting deprecated sga-pipeline script
Added example script for the Illumina MiSeq example data. Includes the scaffolding component.
Added ability to cluster based on seed sequences that are not present in the FM-index.
Fixed cluster size computation
Temporarily disabled the threaded version of multi-key quicksort.
Added ability to compute overlaps between two disjoint sets of reads.
Increased version number to 0.9.10
Refactored sga filter. Moved arguments to QCProcess into a parameter object.
Added function to compute an interval pair using cached intervals.
Removed hard-coded threading option to bwa aln
Added filter for homopolymer sequencing errors and very low complexity sequence
Cleaned up comments
Version 0.9.11
Now sga walk will write out the reads making up the walk string in SAM format if the --sam option is given. This replaces --description-file.
In sga filter, exit the homopolymer check if the read length is shorter than the k-mer size. The homopolymer and complexity checks are now disabled by default.
Updated version to 0.9.12
Cleaned up unused files
Fixed a bug in the suffix array validation code. The validator assumed that suffixes with the same string were sorted by read name when they are actually sorted by position in the file. Thanks to Tomas Larsson for the bug report and test data.
In-progress checkin of subprogram to convert a BEETL index file into SGA's format
Fixed rare bug in the scaffold builder where a contig could be added with the wrong orientation if a unique walk is found between the contig pair with orientation opposite to that of the link.
Version v0.9.13
Removed debug code from convert-beetl
convert-beetl now writes out an .sai file
Created script to generate a BWT using beetl and convert it to SGA's format.
Fixed bug where the reverse index would not be used when the overlap method of correction is specified.
Added human genome assembly instructions and updated c. elegans script with the parameters used in the sga paper.
Update human assembly instructions.
Merge branch 'beetl'
Added new subprogram to extract the set of sequences from a bwt
Changed the sample rate in bwt2fa to use less memory
Rewrote the small repeat resolution algorithm to be much faster. New algorithm is slightly more aggressive.
Track the number of vertices that have been merged into each vertex to properly decide which walks to retain when removing bubbles.
Version 0.9.14
Fixed usage message for correct and correct-long
Changed beetl index to use a named version of sga. Not for release yet
modified SampledSuffixArray to optionally work over the lexicographic index only (no samples). gen-ssa now avoids loading the read names to save memory.
Added skeleton of graph-diff program
Enabled graph-diff, cleaned up help message
Finished skeleton of graph-diff
Initial k-mer traversal code implemented
Made branch code detection in GraphCompare more efficient
Added code to build the sequence of the bubbles once a differing k-mer has been found.
Implemented bitvector marking of used k-mers to avoid outputting duplicate variants
Added better reporting metrics for the success rate of the bubble discovery process.
GraphCompare now writes out the variants that it finds in fasta format.
Refactored BubbleBuilder code into its own file.
More refactoring.
Refactoring
Added parameter object to GraphCompare
Changed GraphCompare status print condition
Implemented threaded mode for GraphCompare.
Removed unnecessary assertion when a loop is found when building the target bubble
Moved from the substring graph traversal algorithm to a more standard SequenceProcessFramework-basd kmer traversal.
Added logic to filter out low-frequency kmers when attempting the process on an uncorrected data set.
Re-enabled writing out variants to a file.
Added extra variant kmer marking to avoid double counting the bubble construction failure reasons
Added coverage output to variants.fa
Fixed boundary check for ignoring low-coverage edges
Whitespace change only
Created program and initial parsing code for var2vcf convertor.
Reenabled sanity check insertion in var2vcf
Added proper substring function to DNAString
Implemented var2vcf to turn variants found by graph-diff into vcf records.
Added quality filter and fixed VCF coordinate calculation in case where an insertion occurs along with a second variant.
Added additional sanity checks in var2vcf to allow the processing a real human genome call set to go through.
Removed print
Allowed target portion of the bubble to branch. Controlled with the -y command line parameter.
Added interval cache to graph-diff to speed up computation.
Fixed -o/--outfile option to graph-diff
Sort VCF file by the order of the reference sequences in the input BAM file instead of strict lexicographic ordering.
Fixed variant file output name.
Removed dust check TODO message, which was implemented in a previous commit
Changed variation bubble builder to better support uncorrected sequence graphs.
Added metagenome assembler program skeleton
added parallel processing framework to the metagenome assembly subprogram
Added skeleton of new hapgen program
Added initial haplotype generation functionality
Tweaked hapgen debug output
hapgen: properly handle cases where the anchors cannot be found
Wrote utility to build a simple multiple alignment from an array of strings
First pass at extracting the reads mapping to haplotypes in hapgen
correct-long: use sai file instead of ssa
Merge branch 'small_repeat_rewrite'
Fixed bug in small repeat resolution algorithm
Brand new long read error correction based on haplotype generation code. The algorithm finds kmer anchors on the long reads, then builds putative haplotypes through a de Bruijn graph between them.
Added skeleton of new sga gapfill program
First pass at implementing gap filling logic
Added more informative results stats to gapfill
gapfill: the gap sequence is now placed into the scaffold and the new scaffolds written. First functional build
Rewrote processGap/processScaffold to be cleaner and more robust
When patching a scaffold gap, remove the overlapping input sequence instead of the gap sequence.
Added descending kmer mode to gap filler and implemented first pass at choosing the assembled sequence which best fits the gap
Do not correct a read unless two unique anchors can be found
Changed long read error correction kmer size
Implemented first pass at metagenomic assembly logic
sga metagenome: implement compare and swap logic to avoid outputting duplicate contigs when two threads assemble the same sequence simultaneously
Fixed metagenomics assembly logic when there is an in-branch into a repeat
Now use BWTIntervalCache when calculating de Bruijn extensions
Output start time for main beetl processes
Testing a local coverage based coverage cutoff for the metagenomics prototype
Added more output statistics to the gapfill module
Merge branch 'metagenomics' into gapfill
Write out beetl progress to a file.
hapgen now extracts the piece of the reference that is being reassembled.
hapgen: added function to MultiAlignment to construct an MA from local alignments. Added code to hapgen to pull out read pairs.
Added threading options to scaffold driver scripts
Merge branch 'gapfill' into hapgen
Fixed error from merging gapfill branch
Merge branch 'hapgen'
Reverted back to the old repeat resolver code
Made the gap fill start/end kmer sizes command line arguments
Removed prints from scaffold sv resolver
Added new filter to filterBAM to aggressively get rid of FR contamination in a mate pair library. Added new output to the Scaffold
Added strict mode to scaffolder to only keep unambiguous connections in the scaffold graph
v0.9.15
Refined version of local coverage based metagenomic assembly
Better implementation of coverage-cutoff based de Bruijn assembler for metagenomics
v0.9.16
Added explicit cast to avoid warning on some versions of gcc
v0.9.17
Implemented loading reference fm-index for graph-diff
Stubbed in bwasw alignment of constructed haplotypes to reference
Furhter implemented realignment of haplotypes to reference after discovery.
Refactored code into HapgenUtil
Implemented more helper functions for the hapgen process
Initial merge and integration of Kees' dindel code
Set the bamtools link flag in the case that bamtools is installed in a standard directory and --with-bamtools is not needed.
Moved dindel code into a new function in GraphCompare
Implemented testing variants with dindel separately for the normal/tumour
Implemented function to get the edges of the de Bruijn graph from the FM-index using a single (forward) index. This can be used to cut the memory usage of some subprograms in half.
Removed the reverse FM-index from graph-diff which is no longer needed.
Revised SampledSuffixArray to using a uint32_t to store the ids of the lexicographically sorted reads. This cuts the memory of the data structure in half but limits it to 2**32 strings.
Reformatted some dindel code to fit the style of the codebase
Added sga-asgq2dot.pl helper script to bin directory
Converted tabs to spaces in sga-asgq2dot
Removed some cruft from sga-asqg2dot
Fixed bug in read pair extraction
Split tumour/normal calls into separate vcf files.
Added code to perform a fairly basic selection of the best alignment position from a set of possibilities
Adding functionality to graph-diff to test existing variants passed in via a VCF file
Refactored dindel calling code into DindelUtil
The previous commits that changed the represtation of lexicographic index in the SampleSuffixArray broke binary compatability with previous files. Updated the magic number to catch these old binaries.
Refactored more of the dindel wrapper code
Added extra stats reporting to vcf tester
Added extra information in the vcf testing mode of graph-diff to help understand why some variants were not found
DindelRealign code now outputs to a stream instead of a file. Also, added more debug output in VCFTester
More debugging output in VCFTester
Removed some dindel assertions and changed the branch logic
Made new dindel integration code thread safe and removed a bunch of prints
Changed dindel assertion to a throw; modified the post-assembly walk finding algorithm to avoid performing enormous walks in the case that the graph has loops
Changed another assertion to a warning/return code
Merged Albert Villela's change which allows adding a suffix to each read ID in sga preprocess
Throw an error when the homopolymer length check fails instead of printing a warning
If the best candidate haplotype is to the reverse strand of the reference reverse-complement everything so the variants are on the right strand
Trying version of code that relies on haplotype builder - kinda hacky
Reverted back to using variation bubble builder in graph-diff
Changed sga walk to remove contained reads and lingering transitive edges when --component-walks is specified.
Fixed usage message for sga filter
Skeleton code for new indexer
In configure, specify -lbamtools in LIBS instead of LDFLAGS. This corrects the library link ordering and fixes the build in the case --as-needed is used in ld, as in newer versions of gcc.
First semi-functional implementation of BCR for testing
BCR cleanup
BCR-constructed bwt is now written to disk.
Started to integrate BCR into index. Made it more efficient by using 2-bit encoded strings everywhere
BCR algorithm now writes out the reverse index. Made the algorithm choice a command line argument
Fully integrated BCR with the BWT disk algorithm
Removed print from BCR
Fixed bug where the wrong data types were being written/read in the ssa files.
Implemented a number of debug/development functions for graph-diff
Enabled reference based calling, fixed memory errors
Was using the wrong base BWT in non-reference mode
Added new debug mode to graph-diff
Fixed two performance issues in sga assemble.
Removed much debugging print statements
Added a new exception to Dindel to handle the case where the variant found lies at the beginning of one of the haplotypes
Fixed memory stomp in DindelHaplotype constructor
Merging latest changes to the dindel haplotype model by Kees Albers.
Temporarily re-enabled some debug prints
Fixed configure script to properly handle bamtools include/lib paths in the case where it was installed with make install
Bumped version to v0.9.18
Added debug code for Kees
Turned off debug mode
Merge Kees' branch with bug fixes of variant that are not being called
Added error message when a sequence with a given ID cannot be found in the input scaffold/contig collection
Merge Kees' branch, with a fix so that variants are output with respect to the correct reference strand
Added overdepth filter to avoid running dindel on super deep regions
Implemented a simple counting-based variant caller for debugging
Investigating poor variant calling performance on mouse genome data. Added a lightweight profiler.
Revised profiler
New heurestics to improve the running time when calling variants versus a reference genome
Fixed bug in extractHaplotypeReads where it would incorrectly flag some haplotypes as being too deep.
Merge branch 'graph-diff-v2' of /nfs/users/nfs_c/caa/source/sga_merge into graph-diff-v2
Merging bug fix from Kees
More debugging code
More debugging code
Modified sga-cluster-extend to warn instead of exit when some seed that is passed in is a substring of a read.
graph-diff can now output multiple candidate haplotypes. Added a min-depth parameter to avoid traversing low-coverage k-mers.
Implemented --longest-n parameter for sga-walk --component-walk.
Integrate code from github.com/jts/misc
Removed long-used Algorithm/ErrorCorrect code
First implementation of new correction algorithm, which allows arbitrary overlaps between reads. Not for production use
Refined the kmer matching portion of the overlap calculation for the new corrector.
Updated 3rd party code with improvements to the overlapper and multiple alignment
Changed parameters in consensus algorithm in new overlapper
Updated third party code
Integrated new overlap method which extends an existing alignment. Considerably faster than previous method.
Updated third party code
New overlap corrector will use the -r/--rounds parameter to iteratively correct reads. This can lead to better correction accuracy but decreases correction throughput
Keep leading directories when parsing the reference filename
sga-rmdup now writes out the number of copies of each sequence in the header line of the fasta file.
Removed unused parameter in sga walk
Changed the ASQG parser to only warn if the TE tag is not present instead of aborting
Emit an error and exit when a vertex record is truncated.
Make sure that edge records have the correct number of fields
Changed warning message when the operating system is OSX
Merge branch 'new-indexer'
Added Illumina's notice regarding the rights to the BCR algorithm
Fixed assertion tripped by short filenames when checking for gzip extension
Filter the abyss-generated insert size histogram to avoid very long DistanceEst runtime.
Added option to sga-filter to remove substring sequences only.
v0.9.19
Huge debugging hacks to investigate why we are missing some variants.
Started to implement coherency-based haplotype builder.
Merge branch 'master' into graph-diff-v3
Merge branch 'it-correct' into graph-diff-v3
First implementation of read-coherent haplotype generation. Way too slow so far.
New "kmer witness" algorithm
Removed hacked-in hardcoded path
Removed debug output
New method of deriving haplotypes from all the reads sharing a new kmer.
More conservative generation of haplotypes.
Rewrote method of inferring haplotypes from read coherent kmers
Read coherent haplotype builder now extends to new variant kmers
Parameter tweaks
Merged Kees' latest code.
Improved version of the haplotype generator. Lots of testing/debug code still in this version
Allow singleton kmers to recruit new reads
Check if quality string exists before adjusting for removed adapters
Integrated new multiple alignment code
Aggressively collapse conflicting bases during haplotype construction. This is only temporary.
First pass at overlap-based haplotype builder. This version is not functional
Extremely crude version of inexact string graph haplotyping code
Reverted to old overlap code. Not functional
New version of the variant algorithm that is based on inexact overlaps
Fixed bug where empty initial haplotypes caused a crash
Tweaks to RCHB
First pass of string-graph based haplotype builder. Slow!
Refactored the big k-mer based overlapper into a new file. Optimized some functions in OverlapHaplotypeBuilder
Abort graph construction if no corrected read contains the initial kmer.
Temporarily disabled constraint requiring complete construction of parallel bubble
Set up a kmer->vertex map to avoid huge computation inserting reads into the graph.
Stop the graph extension if there are too many tips in the graph.
various optimizations to improve the running time of the overlap constructor
Changed std::map/std::set into hashmap/hashset
Save kmer indices instead of actual kmers sequences
Moved overlap parameter into the parameters object
New k-mer based haplotype to reference alignment, more restrictive haplotype assembly
Suppress construction of parallel haplotype
Extension vertices now labelled with the direction of extension
Separated walk candidates into left/right join positions
Avoid attempting to build covering paths when there are unambiguous chains of join vertices
Printing changes only.
Recursively trim tips from the graph
Min overlap length is now a command line parameter
Only extend the graph in one direction - this removes the effect of "back bubbles" making the graph too complex to resolve
Revised trimming logic so it does not iteratively trim the whole branch. Testing lower correction kmer.
More restrictive alignment
Cap the number of differences to the reference genome at 8
sga-cluster extend mode can now be limited to a maximum number of iterations
Allow the user to define a set of sequences that are used to stop extension in sga-cluster
Enabled the de Bruijn graph based QC check of candidate haplotypes
Debug code, synching with Kees
Refined the StringThread correction method
Implemented a less restrictive check for when the graphs converge
Re-enabled haplotype QC
Set a positive default value for the minimum contig length in sga-bam2de.pl
Do not pass -s to DistanceEst twice
New haplotype QC which counts the number of branches off a haplotype. Not used, just for information
Fixed infinite loop in haplotype QC for short haplotypes
Merge branch 'graph-diff-v4' of /nfs/users/nfs_c/caa/source/sga-graph-diff-v4-copy into graph-diff-v4
Fixed assertion when a haplotype aligned to the end of a chromosome
Fixed error in the warning when a kmer threshold cannot be found for error correction
Refactoring the variant calling code into its own directory
Removed abandoned class
More refactoring
Removed dead code
Renamed VariationBubbleBuilder to VariationBuilderCommon
Refactored BuilderCommon code into VariationBuilderCommon
Allow scaffolds to contain full IUPAC ambiguity codes
Added new flag to SeqReader to avoid changing lower case bases to upper
Skip all IUPAC codes when finding anchors for gap filling
Refactored kmer masking code into a separate function
Refactoring GraphCompare
Added BWTIndexSet container. Massive refactoring of code to use it.
Refactored de Bruijn haplotype builder into a new file
Removed old debug code
Refactored haplotype QC into its own function
Moved HapgenUtil
The algorithm used during haplotype assembly (dbg vs string graph) is a command line option
Cleaned up parameters
More option cleanup
Allow sga-deinterleave.pl to read gzipped files.
--debruijn should not take a parameter
If one haplotype fails QC, do not attempt to assemble a variant
Cleaned up prints in OverlapHaplotypeBuilder
Started to implement multi-sample calling
Integrated the read groups into DindelUtil code
Added assertion warning for mate-pair mode
Minor formatting change
Clean up assertions so empty BWTs can be written after filtering
When resolving scaffold gaps over ambiguity codes, the flanking sequence of the filled gap may not match that of the scaffold anymore.
Version v0.9.20
Fix substring assertion when null strings are passed to calculateDustScore
Removed debug prints
Use the new version of BEETL. Rewrote convert-beetl to use far less memory.
sga-beetl-index.pl now converts fastq to fasta
sga-beetl-index now has a no-convert option
sga-merge will not merge population indices
sga-merge uses the full path to the input files so they do not need to be in the working directory
Merge /nfs/users/nfs_c/caa/source/sga-graph-diff-refactor into graph-diff-refactor
Use full path to indices
Use correct file status for popidx
Attempt to fix assertion when counting homopolymer lengths
Tweaked read extraction parameters, cleaned up some debug output
Fix the way that homopolymer runs are counted
Lowered default storage level for merging BWTs during sga-index
Move out of bounds check to inside loop
Early exit from the overlap function when the input read is shorter than min_overlap
Added subprogram to evaluate how well we can detect mutations with k-mers
Build parallel haplotypes and reduce the mapping kmer size
Added homopolymer filter. Dindel code now outputs VCFRecords
sga-graphdiff now directly outputs the final set of calls.
Fixed compiler warning
Made haplotype QC less strict
Disable homopolymer filter
Reverted to using a 31-mer for mapping. Lowered MAX_READS.
Fixed cluster extend --iterations option so it properly extends for N rounds, not to N reads
added new option to scaffold2fasta --write-names, which outputs the names of the contigs that make up the scaffold
Reversed order of contigs when building scaffolds from right-to-left
Fixed missing ID for singleton scaffolds
Apply a minimum coverage of 2 during haplotype QC in non-ref mode
Require at least two occurrences in the base sequence when making comparative calls
temporary hack to fix crash when trying to extract pairs of reads from an unpaired index
scaffold2fasta: write the orientation of the contigs when --write-names is specified
No longer extracts read mates. Cleaned up prints
Merge branch 'graph-diff-refactor'
v0.9.3
Fixed how the name of the BWT file is computed when the fastq file is gzipped
When merging, if input reads are fastq/gzipped, write to the same
Accept .fq as a fastq file extension when choosing the output name for sga-merge
Fixed unused variable warning
issue 21: merge fails when filename contains a '.' that is not part of the file extension. fixed by more careful handling of gzipped suffixes
Fixed GCC 4.6 warnings
Started implementation of quality scores for variant calling pipeline
Integrated quality scores into the variant calling pipeline
Set GraphCompare verbosity to the user's requested value
Integrating Heng Li's ropebwt code
Wrote a new VCFCollections wrapper to pass sample names to Dindel
BWT writing now function in ropebwt
Merge branch 'master' into quality-scores
ropebwt: .sai file is now written, reversed index can be constructed
Merge branch 'ropebwt'
Ropebwt algorithm now uses the command line threading parameter
v0.9.31
Merge branch 'master' of github.com:jts/sga
Updated sga-index help text
Update README to credit Heng's ropebwt implementation
Updated examples to use ropebwt
Merge branch 'master' of /nfs/users/nfs_c/caa/source/sga-basequals
Perform semi-global haplotype realignment within dindel
Merge branch 'quality-scores'
Using quality scores is now a command line option
Merge branch 'master' of /nfs/users/nfs_c/caa/source/sga-basequals
Removed print
Reverted to global alignment for haplotype-haplotype alignments
v0.9.32
When removing transitive edges from the scaffold graph, check the orientation of contigs in the layout
More aggressive cycle detection/removal in --strict mode of sga scaffolder
Cleanup code and comments
Merge branch 'master' of github.com:jts/sga
v0.9.33
Explicitly construct the .sai file when using ropebwt since you cannot get the lexicographic index from ropebwt when read lengths vary
v0.9.34
Whitespace changes
Fixed compilation warnings pointed out by Zhang Feng.
Add a dependency check to bam2de.pl and set the --mind parameter
Change default min distance to -99 bases
Set the --mina option to abyss
Fix divide-by-zero in sga-astat
github issue 25: implement writing orphaned pairs to a file during preprocess
github issue 27: added --no-primer-check option to preprocess. Also, cleaned up help message.
github issue 26: Removed references to old --quality-scale parameter
github issue 14: sga index should exit gracefully when the input file is empty.
When building the FM-index using ropebwt the lexicographic index is built using openmp if the compiler supports it.
Implemented an upper limit on the number of edges we allow a vertex to have before giving up on using it in the assembly graph.
v0.9.35
Added namespace to fix compile error on OSX
Jason Stajich (1):
Seems like sga-align could be run with threads so that bwa uses multithreaded to be faster. Is there any reason not to do this?
Nathan S. Watson-Haigh (8):
Just need to specify base name of the FASTA files.
Merge branch 'master' of git://github.com/jts/sga
Merge branch 'master' of git://github.com/jts/sga
Consistently display help when no command arguments are given.
Added support for .f and .r read pair suffixes found in reads output by sff_extract version < 0.3.0.
Send info about which file is being processed to STDOUT.
Additional info (algorithm used) sent to STDOUT when using SAIS - this is to be consistent with the output when BCR is used.
Merge remote-tracking branch 'upstream/master'
Shaun Jackman (1):
ld_set is static inline. Closes #29
jts (8):
Merge pull request #6 from avilella/master
Merge pull request #7 from avilella/master
Merge pull request #11 from hyphaltip/patch-1
Merge pull request #12 from drio/master
Merge pull request #19 from mh11/77a4bc7aacd7b838fc3097af515e2f27ffecde3e
Merge pull request #20 from nathanhaigh/master
Merge pull request #22 from nathanhaigh/master
Merge pull request #30 from sjackman/patch-1
mh11 (3):
Allow to build FWD and REV index separately to improve speed
Supress merging of indexes / sequence files
Enable suppression of index creation also for memory only run + code formatting (spaces)
-----------------------------------------------------------------------
No new revisions were added by this update.
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/sga.git
More information about the debian-med-commit
mailing list