[med-svn] [sga] annotated tag upstream/0.10.13 created (now dffa83c)

Andreas Tille tille at debian.org
Mon Jan 11 12:36:47 UTC 2016


This is an automated email from the git hooks/post-receive script.

tille pushed a change to annotated tag upstream/0.10.13
in repository sga.

        at  dffa83c   (tag)
   tagging  3cae268067c8985f1d57b3ea8d407e0db5458dd3 (commit)
  replaces  v0.9.4
 tagged by  Andreas Tille
        on  Wed May 28 07:39:13 2014 +0200

- Log -----------------------------------------------------------------
Upstream version 0.10.13

Albert Vilella (8):
      a bit rough, but just to have an idea of what is needed
      adding prefix to walk method for convenience
      Merge github.com:jts/sga
      reverting to original walk
      adding prefix to walk for convenience to the pinball pipeline
      walk with prefix tweak
      deleting INSTALL for now
      fixed typo --component-paths should be --component-walks in cerr

Andreas Tille (1):
      Imported Upstream version 0.10.13

Cornelis Arnout Albers (32):
      Integrated Dindel
      it compiles!
      Integrated DindelRealignWindow into HapgenProcess
      Added VCFFile output
      Makes calls now, did some initial checking and debugging
      Changed calling to ModelSelection. Added EM haplotype frequency estimation. Seems to work.
      Fixed addSNP to candidate haplotypes.
      fixed a couple of bugs. Still debug version but seems to do sensible things.
      tested on 167924_A1 exome and seems to give already nice results.
      Fixed position bug. Fixed output of uncalled variants
      Fixed strand issue.
      Fixed quality score issue for haplotypes mapping with low quality scores to places in the reference. Added ID output in VCF
      Fixed getDistance: it now only used reads that match the haplotype sequence without mismatch at the position of variant. Also fixed outputAsVCF:it now combines freuquencies from haplotypes mapping to the same position.
      Fixed incorrect averaging in outputAsVCF
      fixed silly bug in computation of penalties for haplotype alignments
      Optimized and added INFO tag outputs.
      March 20 debug version
      Added SingleRead as replacement for MatePairs
      Fixed variant frequency.
      Changed FLANKING_SIZE to zero and DINDEL_DEBUG_3->1
      Merge branch 'graph-diff-v4' of /nfs/users/nfs_j/js18/work/git_repository/sga into graph-diff-v4
      Fixed -MAX_INT varqual bug. Made sure flankingHaplotypes is unique. SET MAPPING QUALITY TO 1000 for all candidate alignments
      Fixed mapping bug and -MAX_INT QUAL bug. realignMatePairs can be used to choose mate pair alignment, automatically sets FLANKING_SIZE to 1000.
      Added MultiSample EM and caller. Fixed homopolymer, added AmbiMap filter tag
      Added multisample EM caller.
      Fixed homopolymer error when variant is in last column
      Added genotyping.
      Genotyping in multisample caller seems to work.
      Fixed inf bug in genotyping
      fixing alignment bug
      Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
      Fixed DindelHaplotype::extractVariants() incorrect assertion

David Rio Deiros (2):
      Warning the user about the fact he will have issues merging
      Adding feature in preprocess step to remove adapter from reads.

Jared Simpson (627):
      Added extra information to sga stats --run-lengths
      Capped max run length in --run-lengths option to make the output more readable
      Refactored the BWT Markers into their own file. Also moved accumulation code out of RLBWT into the RLUnit
      Refactored RLUnit into its own class
      Changed name of FULL_COUNT define to RL_FULL_COUNT
      Re-implemented connect pipeline. Cleaned up sga-assemble
      Changed input parameters in sga-pipeline
      Skeleton of fm-merge subprogram
      Added skeleton code for the FMMerge processes
      Added BitVector class for storing large arrays of bits. Stubbed in some functionality of FM-merge.
      Started implementation of FMMergeProcess logic
      Rewrote overlapReadExact to fix the very rare case where a read has a proper prefix/suffix overlap to itself. This case
      Working version of fm-merge. Currently requires a remove duplicate edges operation which is sub optimal.
      Fixed bug in conflictConsensus algorithm where the error correction would not filter out true conflicts if the root base is the 3rd (or 4th) most frequent base but still above the cutoff.
      Implemented interleaved mode for sga preprocess
      Merge branch 'fm-merge'
      Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
      Implemented 1-bit sparse gap array for rmdup/qc
      Added a simple kmer caching scheme to the kmer-based corrector to avoid duplicate lookups in the fm-index.
      Modified k-mer corrector to take quality values into account when determining thresholds for correction.
      Changed default sample rate and gap array size for sga merge
      Implemented quality-aware overlap correction. The quality scores are used to select the cutoffs for the number of times a base needs to be seen to avoid being corrected away.
      Changed error corrector to print out a masked multi-overlap
      Fixed bug where fm-merge would crash if the graph contains a simple cycle
      Revised dead-end trim function to only remove if the branch is less than a minimum length. This makes the trimmer function properly on a graph constructed by fm-merge.
      Fixed very subtle bug in overlap computation when reads have multiple valid overlaps to each other.
      Added some extra comments
      Removed errant print from SGA/index
      Modified sga-rmdup to determine which read to keep based on the index of the reads, not their full id/name.
      Implemented mutex lock on BitVector to allow threads to update the bitvector atomically in fm-merge.
      Removed warning
      Improved the speed of sga-qc by an order of magnitude.
      Implemented more complex variation (bubble) removal algorithm.
      Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
      Implemented writing out variants to a file in fasta format.
      sga-qc: Moved the delete call for the BWTs to occur before the indices are rebuilt to avoid having two copies loaded at once needlessly.
      Removed unncessary to-do warning from variant smoother
      Fixed bug where variant removal would assert if there was a degenerate bubble
      Added exit condition to SGSearch::findVariantWalks to avoid infinite loops in the case that the graph contains a non-branching cycle
      Implemented new SGSearchTree class for more efficient graph searching. All functionality
      Integrated new search tree code into the variation smoother. The results are subtley
      Modified output files to have a common prefix, which can be specified on the command line with the -o option.
      Integrated SGSearchTree into SGSearch::findWalks
      Updated configure.ac and the README to support use of the BamTools library.
      First implementation of sga connect using a BAM file. This version is functional but could be more efficient.
      Much faster version of BAM-based connect. Now uses substring operations when extracting the fragment instead of copying potentially very large strings constantly.
      Added statistics to sga connect to describe why the connection failed
      Cleanup up the connection code.
      Fixed sga-correct to search for a path with the correct length by subtracting the amount of the fragment that is present in the current contig.
      Added extra output to sga-scaffold
      Added ScaffoldGroup class to handle ordering a set of contigs.
      Added functions to compute the probability that two scaffold links are incorrectly ordered.
      Made the SGSearchTree a generic templated class so that it can also be used for the scaffolding module.
      Continued to make the graph search functions more generic.
      Started to implement searching for walks on a scaffold graph.
      Simplified construction of variation paths in the graph. More work on making the searching code more generic.
      Removed some dead code.
      Seperated input of paired end and mate pair libraries for scaffolder.
      Added new function to remove transitive edges from the scaffold graph. Needs more work.
      Experimental layout algorithm for scaffoldding
      Cleaned up code
      Removed unused object to remove warning.
      Added function to infer secondary links between nodes in a putative scaffold.
      Implemented connected components algorithm.
      Added new ScaffoldAlgorithms files and refactored some code into these.
      Wrote algorithm to compute a layout of the connected component for a scaffold starting from a terminal vertex. This function is the backbone of the scaffolding algorithm.
      The scaffolder now writes out scaffold statistics at the end of the program.
      Minor update to scaffolding output message formatting.
      First implementation of new scaffolding algorithm.
      Removed some prints
      Integrated Heng Li's stdaln dynamic programming library into ThirdParty. This is used in the variation removal algorithm to set an upper bound on how different two sequences can be and still be removed.
      Changed name of cigar line in variants file to make it clear its an internally-used field and not for the fasta sequence that is output.
      Wrote compare-and-swap updates for the BitChar/BitVector data structures.
      Rewrote FMMergeProcess to use the compare-and-swap functionality in the bit vector. Removed the locks.
      Fixed scaffold2fasta to search for a path of the correct length (was using end to start distance instead of end to end).
      Wrote function to find and remove cycles in the scaffold graph
      Added status output message to the link validator
      Added filterBAM subprogram to attempt to get rid of bad MP reads.
      Imposed a max indel size on the variant resolution.
      Minor formatting tweaks.
      Implemented SV detection and removal for the scaffolder. Turned off by default.
      Changed scaffold2fasta to write out unplaced scaffolds and use a gap with a minimum length
      Added function to break the scaffold graph at positions that have conflicting distance estimates.
      Removed dot file output from scaffolder
      Updated version to 0.9.5
      Updated OverlapAlgorithm::_processIrreducibleBlocksExact to work in the current overlapping framework. It is now used by default.
      Moved astat.py from sgatools repository to sga/src/bin/sga-astat.py
      Implemented loading contigs from a fasta file for scaffold2fasta
      Cleaned up sga-scaffold to only print link validation messages if -v flag is given.
      Implemented dust filter for low complexity sequences in sga-preprocess as suggested by Albert Vilella.
      Updated README to reference the python modules that the pipeline scripts require.
      Added parameter to specificy the maximum number of bases to correct with the kmer corrector.
      Added parameter to specify the maximum number of bases to correct with the kmer corrector.
      Added warning to the overlap computation for when a substring read is found.
      Added genome size option to sga-astat.py as alternative to performing the bootstrap estimate
      Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
      Changed overlap correction QC. Now requires at least one overlap supporting each base in the read after correction.
      sga-preprocess will now append /1 or /2 to read names in pe-mode if the paired reads have the exact same name.
      Added flag to SGSearch::findWalks to specify whether any walks should be returned if the search was aborted. All uses of findWalks require an exhaustive search to be performed except the utility sga-walk.
      Switched the irreducible block algorithm back to inexact mode.
      [Issue GH-1] Changed shebang in python scripts to use /usr/bin/env so user's environment python is used. Reported and fix suggested by John St. John.
      [Issue GH-3]: sga-preprocess will stop reading the file if there is a fastq record with no sequence or quality values.
      Implemented quality score conversion from phred64 to phred33 for preprocess.
      Moved sga-mergeDriver.pl from sgatools into main repository.
      Wrote help message for sga-mergeDriver.pl
      Cleaned up handling of cycles in fm-merge.
      Added --kmer-distribution function to sga-stats. Re-enable the -x option for sga correct to set the min kmer coverage required.
      Rewrote the CorrectionThresholds to be a proper (singleton) class instead of a namespace.
      Fixed bug in OverlapAlgorithm::_processIrreducibleBlocksExact where the assertion was checking the wrong condition. Added --exact optiont overlap to force the use of the exact irreducible algorithm.
      Rewrote the exact-mode irreducible block algorithm to be iterative instead of recursive.
      Made the exact-mode irreducible algorithm the default again for overlap/fm-merge.
      Added --no-overlap and --branch-cutoff options to sga-stats.
      Wrote experimental program sga-cluster to write out the connected components of a graph. Requested by Albert Vilella.
      Refactored KmerDistribution code into its own class.
      First pass at learning the kmer correction threshold.
      Added the contigs filename to the temporary output of bwa aln so the same reads can be mapped to different contigs at the same time without having a filename clash.
      Added -w parameter to sga-walk to allow the exact sequeunce of the walk to be specified.
      Changed the error correction metrics to use wider integers to avoid wrap arounds for very large data sets.
      Made the kmer corrector the default algorithm to use for sga-correct. Changed the default kmer size to 31.
      Made the temp files of the bwa aln step avoid using relative paths.
      Modified sga-astat.py so that it does not require a bam index file.
      Added exit statement to sga-bam2de.pl when the the command line arguments are incorrect.
      added optional -k parameter to sga-bam2de.pl
      Fixed usage message for scaffold.
      Made command line argument to change the minimum walk distance in sga-connect
      Fixed tripped assertion in makeScaffolds for the very rare case that a terminal vertex cannot be found. Fixed stats output to avoid 32-bit integer wraparound.
      Changed help text for --max-distance option to filterBAM
      Rewrote sga-cluster to use the FM-index instead of an asqg file.
      Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
      Fixed warning in OverlapBlock
      Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
      Added new filtering modes to sga-filterBAM. Can now filter out pairs based on error rate, mapping quality and kmer depth.
      Fixed bug in filterBAM where an extra read pair would be erroneously output
      Removed the default seed stride parameter for sga-cluster as it would lead to some overlaps being missed.
      Made sga-cluster emit an error when a substring read is found.
      Bug fix in ScaffoldRecord to avoid outputting duplicate records for singleton scaffolds.
      Changed default gap storage parameter in sga/index
      Changed sga-scaffold arguments. Unified repeat/unique a-stat so any contig that does not meet the unique cutoff is deemed to be a repeat. Added --min-copy-number parameter to discard contigs that have a low (<0.3) estimated copy number.
      Changed sga-cluster so the temp file uses the same prefix as the -o parameter to prevent name clashes.
      Changed default gap array parameter back to 8.
      [Github issue 4] Made the error message for the case where a substring read is found during string graph construction to be more informative.
      Bumped version to 0.9.6
      Made the maximum distance estimate error when resolving gaps a command line parameter
      bumped version number
      Disabled RLBWT validation
      New mode to sga-walk which exhaustively finds all walks through the largest connected component of the input graph. Used in sga-cluster workflow.
      Modified k-mer corrector to only use the forward bwt when looking up k-mer counts. This effectively halves the memory usage of the correction step.
      Merged the duplicate removal and qc checks into a single process: sga-filter.
      Increased version to 0.9.8
      Rewrote the gap array to use compare and swap instructions when updating the base counts. This allows much better concurrency when merging/removing reads from a bwt.
      Added comments to the SparseGapArray
      More comment improvement
      Re-enabled -Wall as bamtools now compiles without warnings.
      Implemented caching of BWTIntervals for all strings of a given length. Currently used in the k-mer corrector.
      Changed BWTIntervalCache::lookup to take in a c-string to avoid an extra copy
      Code cleanup: created a parameter object for setting the error correction options in place of a constructor with many arguments
      Added option to sga-index to suppress constructing the reverse BWT. Modified sga-merge to avoid attempting to merge RBWTs if they don't exist.
      Added a directory with sga examples. Currently holds a script for a c. elegans assembly.
      Incremented version to 0.9.9.
      Added short README to the top level directory.
      Tweaked wording in top-level README
      Updated main README file.
      sga scaffold: allow loading sequences with ambiguity codes, disable the requirement that an a-statistic file is provided.
      Changed wording in a-stat warning
      Rewrote some ScaffoldRecord functions to take in a parameter object
      Factored out the scaffold input sequence container into ScaffoldSequenceCollection so that it does not require using a StringGraph
      Added std::map-based ScaffoldSequenceCollection for scaffolding sequences that do not belong to a graph.
      Handle Ns in Util::complement
      Fixed gcc 4.6 compile warnings
      Created files and framework for implementing bwa-sw algorithm. Not functional or usable.
      Updates to bwa-sw algorithm
      Implemented most of core bwasw algorithm. Output verified to match that of Heng Li's implementation. Does not restrict the number of nodes to track (z-best heuristic) or output the alignments. Lots of debug information in this version, not usable.
      Implemented saving found hits. Still very much development, not for use.
      First implementation of sampled suffix array data structure and gen-ssa subprogram.
      Implemented O(N) algorithm for constructing the sampled suffix array from a BWT. Implemented reading/writing SSA.
      Created new subprogram correct-long to hold bwa-sw porting work. Added generateCIGAR function to LRAlignment
      Macro'd out the bwa compatibility print statements
      Modified to read query sequences from a file.
      Implemented gapped MultiAlignment class
      Implemented function to transform a multiple alignment into a consensus string.
      Implemented cutTail function from bwasw to remove possibly erroneous cells from a stack. Currently it discards too much and removes a lot of valid hits.
      Improved version of cutTail that does choose the cells to keep based purely on the highest score, but using the fraction of the maximum possible score for the cell. This will not be the final version of cutTail however as it still discards useful hits.
      long read aligner: Added two new methods of cutting the tails of a cell array
      Added --cut argument to sga correct-long to choose cell reduction heuristic
      Added extra output to MultiAlignment::print
      Added extra debugging info to the long read aligner.
      Added function to save the terminal hits in long-read alignment mode
      Removed some debug output from the long read aligner
      Removed a bunch of debug info and refactored some code from the main bwaswAlignment function
      More refactoring of bwaswAlgorithm. Changed saveHits to keep all good hits, instead of just the top 2 for every position.
      Code cleanup in LRAlignment
      Added new LRCorrection module to implement the long-read correction code
      Renamed LRHit data members to have more sensible names
      First pass at extending LRHits to be full-length alignments
      Further refinements of LRHit extension. Not complete
      Turned off hit extension for now
      Added simple function to find new LRHits based on overlapping reads.
      Check sem_init return codes to catch errors when this function is not implemented (on OSX). This is a stop-gap measure until I have time to switch to named semaphores.
      New experimental long read correction algorithm based on threading the read through a de Bruijn graph
      Resurrected bwa-like saveHits function. Removed targetString member from LRCell/LRHit.
      Refactored a bunch of code that uses stdaln into StdAlnTools. Started work on the new graph-based correction code.
      Started string-threading extension code.
      Added core extension functionality to StringThreader
      First implementation of extension dynamic programming code.
      Changed ExtensionDP to use an edit distance scoring system. Fixed off-by-one in makePaddedStrings
      Implemented function in ExtensionDP to print the full alignment.
      Added function to ExtensionDP to calculate the local error rate of the alignment.
      Integrating ExtensionDP into StringThreader. Initialization of alignment for root node complete.
      Added code to calculate the extended alignments for the StringThreaderNodes
      Full extension algorithm complete, including culling leaves once their error rate is too high.
      Added a few more helper functions to the string threading code. Currently, the search explodes when threading through repeats that are slightly
      New culling heuristic for StringThreader
      Implemented writing out the corrected reads to a file.
      Wrapped the long read correction process in a class to use in the SequenceProcessFramework.
      Implemented outputting corrected sequences from the StringThreader. Still a bit hacky.
      tweaks to graph-based correction algorithm
      Merge branch 'master' of git at github.com:jts/sga
      Made the extension termination condition more robust. StringThreader now returns a trimmed alignment to avoid bad-tails of the long reads.
      Added early exit to kmerCorrect when the read sequence is shorter than the kmer length.
      Added read length check to k-mer filter.
      Refactored cluster generation code into ReadCluster class. Added stub subprogram for cluster-extend.
      Implemented extension mode for sga cluster. This required refactoring the SequenceProcessFramework to use a generic generator object.
      Enabled control of the max cluster size
      Fixed formatting of sga cluster help
      Added some defines to sga cluster to hide away the hideous template function calls
      Changed description in sga-cluster boilerplate
      Added missing shortopt for --min-branch-length to sga assemble
      Merge branch 'long-align'
      Increased default minimum branch length in sga assemble to better handle long (150+) reads.
      Cleaned up build. Added a Makefile to install the scripts the sga scaffold pipeline requires (astat, bam2de).
      Deleting deprecated sga-pipeline script
      Added example script for the Illumina MiSeq example data. Includes the scaffolding component.
      Added ability to cluster based on seed sequences that are not present in the FM-index.
      Fixed cluster size computation
      Temporarily disabled the threaded version of multi-key quicksort.
      Added ability to compute overlaps between two disjoint sets of reads.
      Increased version number to 0.9.10
      Refactored sga filter. Moved arguments to QCProcess into a parameter object.
      Added function to compute an interval pair using cached intervals.
      Removed hard-coded threading option to bwa aln
      Added filter for homopolymer sequencing errors and very low complexity sequence
      Cleaned up comments
      Version 0.9.11
      Now sga walk will write out the reads making up the walk string in SAM format if the --sam option is given. This replaces --description-file.
      In sga filter, exit the homopolymer check if the read length is shorter than the k-mer size. The homopolymer and complexity checks are now disabled by default.
      Updated version to 0.9.12
      Cleaned up unused files
      Fixed a bug in the suffix array validation code. The validator assumed that suffixes with the same string were sorted by read name when they are actually sorted by position in the file. Thanks to Tomas Larsson for the bug report and test data.
      In-progress checkin of subprogram to convert a BEETL index file into SGA's format
      Fixed rare bug in the scaffold builder where a contig could be added with the wrong orientation if a unique walk is found between the contig pair with orientation opposite to that of the link.
      Version v0.9.13
      Removed debug code from convert-beetl
      convert-beetl now writes out an .sai file
      Created script to generate a BWT using beetl and convert it to SGA's format.
      Fixed bug where the reverse index would not be used when the overlap method of correction is specified.
      Added human genome assembly instructions and updated c. elegans script with the parameters used in the sga paper.
      Update human assembly instructions.
      Merge branch 'beetl'
      Added new subprogram to extract the set of sequences from a bwt
      Changed the sample rate in bwt2fa to use less memory
      Rewrote the small repeat resolution algorithm to be much faster. New algorithm is slightly more aggressive.
      Track the number of vertices that have been merged into each vertex to properly decide which walks to retain when removing bubbles.
      Version 0.9.14
      Fixed usage message for correct and correct-long
      Changed beetl index to use a named version of sga. Not for release yet
      modified SampledSuffixArray to optionally work over the lexicographic index only (no samples). gen-ssa now avoids loading the read names to save memory.
      Added skeleton of graph-diff program
      Enabled graph-diff, cleaned up help message
      Finished skeleton of graph-diff
      Initial k-mer traversal code implemented
      Made branch code detection in GraphCompare more efficient
      Added code to build the sequence of the bubbles once a differing k-mer has been found.
      Implemented bitvector marking of used k-mers to avoid outputting duplicate variants
      Added better reporting metrics for the success rate of the bubble discovery process.
      GraphCompare now writes out the variants that it finds in fasta format.
      Refactored BubbleBuilder code into its own file.
      More refactoring.
      Refactoring
      Added parameter object to GraphCompare
      Changed GraphCompare status print condition
      Implemented threaded mode for GraphCompare.
      Removed unnecessary assertion when a loop is found when building the target bubble
      Moved from the substring graph traversal algorithm to a more standard SequenceProcessFramework-basd kmer traversal.
      Added logic to filter out low-frequency kmers when attempting the process on an uncorrected data set.
      Re-enabled writing out variants to a file.
      Added extra variant kmer marking to avoid double counting the bubble construction failure reasons
      Added coverage output to variants.fa
      Fixed boundary check for ignoring low-coverage edges
      Whitespace change only
      Created program and initial parsing code for var2vcf convertor.
      Reenabled sanity check insertion in var2vcf
      Added proper substring function to DNAString
      Implemented var2vcf to turn variants found by graph-diff into vcf records.
      Added quality filter and fixed VCF coordinate calculation in case where an insertion occurs along with a second variant.
      Added additional sanity checks in var2vcf to allow the processing a real human genome call set to go through.
      Removed print
      Allowed target portion of the bubble to branch. Controlled with the -y command line parameter.
      Added interval cache to graph-diff to speed up computation.
      Fixed -o/--outfile option to graph-diff
      Sort VCF file by the order of the reference sequences in the input BAM file instead of strict lexicographic ordering.
      Fixed variant file output name.
      Removed dust check TODO message, which was implemented in a previous commit
      Changed variation bubble builder to better support uncorrected sequence graphs.
      Added metagenome assembler program skeleton
      added parallel processing framework to the metagenome assembly subprogram
      Added skeleton of new hapgen program
      Added initial haplotype generation functionality
      Tweaked hapgen debug output
      hapgen: properly handle cases where the anchors cannot be found
      Wrote utility to build a simple multiple alignment from an array of strings
      First pass at extracting the reads mapping to haplotypes in hapgen
      correct-long: use sai file instead of ssa
      Merge branch 'small_repeat_rewrite'
      Fixed bug in small repeat resolution algorithm
      Brand new long read error correction based on haplotype generation code. The algorithm finds kmer anchors on the long reads, then builds putative haplotypes through a de Bruijn graph between them.
      Added skeleton of new sga gapfill program
      First pass at implementing gap filling logic
      Added more informative results stats to gapfill
      gapfill: the gap sequence is now placed into the scaffold and the new scaffolds written. First functional build
      Rewrote processGap/processScaffold to be cleaner and more robust
      When patching a scaffold gap, remove the overlapping input sequence instead of the gap sequence.
      Added descending kmer mode to gap filler and implemented first pass at choosing the assembled sequence which best fits the gap
      Do not correct a read unless two unique anchors can be found
      Changed long read error correction kmer size
      Implemented first pass at metagenomic assembly logic
      sga metagenome: implement compare and swap logic to avoid outputting duplicate contigs when two threads assemble the same sequence simultaneously
      Fixed metagenomics assembly logic when there is an in-branch into a repeat
      Now use BWTIntervalCache when calculating de Bruijn extensions
      Output start time for main beetl processes
      Testing a local coverage based coverage cutoff for the metagenomics prototype
      Added more output statistics to the gapfill module
      Merge branch 'metagenomics' into gapfill
      Write out beetl progress to a file.
      hapgen now extracts the piece of the reference that is being reassembled.
      hapgen: added function to MultiAlignment to construct an MA from local alignments. Added code to hapgen to pull out read pairs.
      Added threading options to scaffold driver scripts
      Merge branch 'gapfill' into hapgen
      Fixed error from merging gapfill branch
      Merge branch 'hapgen'
      Reverted back to the old repeat resolver code
      Made the gap fill start/end kmer sizes command line arguments
      Removed prints from scaffold sv resolver
      Added new filter to filterBAM to aggressively get rid of FR contamination in a mate pair library. Added new output to the Scaffold
      Added strict mode to scaffolder to only keep unambiguous connections in the scaffold graph
      v0.9.15
      Refined version of local coverage based metagenomic assembly
      Better implementation of coverage-cutoff based de Bruijn assembler for metagenomics
      v0.9.16
      Added explicit cast to avoid warning on some versions of gcc
      v0.9.17
      Implemented loading reference fm-index for graph-diff
      Stubbed in bwasw alignment of constructed haplotypes to reference
      Furhter implemented realignment of haplotypes to reference after discovery.
      Refactored code into HapgenUtil
      Implemented more helper functions for the hapgen process
      Initial merge and integration of Kees' dindel code
      Set the bamtools link flag in the case that bamtools is installed in a standard directory and --with-bamtools is not needed.
      Moved dindel code into a new function in GraphCompare
      Implemented testing variants with dindel separately for the normal/tumour
      Implemented function to get the edges of the de Bruijn graph from the FM-index using a single (forward) index. This can be used to cut the memory usage of some subprograms in half.
      Removed the  reverse FM-index from graph-diff which is no longer needed.
      Revised SampledSuffixArray to using a uint32_t to store the ids of the lexicographically sorted reads. This cuts the memory of the data structure in half but limits it to 2**32 strings.
      Reformatted some dindel code to fit the style of the codebase
      Added sga-asgq2dot.pl helper script to bin directory
      Converted tabs to spaces in sga-asgq2dot
      Removed some cruft from sga-asqg2dot
      Fixed bug in read pair extraction
      Split tumour/normal calls into separate vcf files.
      Added code to perform a fairly basic selection of the best alignment position from a set of possibilities
      Adding functionality to graph-diff to test existing variants passed in via a VCF file
      Refactored dindel calling code into DindelUtil
      The previous commits that changed the represtation of lexicographic index in the SampleSuffixArray broke binary compatability with previous files. Updated the magic number to catch these old binaries.
      Refactored more of the dindel wrapper code
      Added extra stats reporting to vcf tester
      Added extra information in the vcf testing mode of graph-diff to help understand why some variants were not found
      DindelRealign code now outputs to a stream instead of a file. Also, added more debug output in VCFTester
      More debugging output in VCFTester
      Removed some dindel assertions and changed the branch logic
      Made new dindel integration code thread safe and removed a bunch of prints
      Changed dindel assertion to a throw; modified the post-assembly walk finding algorithm to avoid performing enormous walks in the case that the graph has loops
      Changed another assertion to a warning/return code
      Merged Albert Villela's change which allows adding a suffix to each read ID in sga preprocess
      Throw an error when the homopolymer length check fails instead of printing a warning
      If the best candidate haplotype is to the reverse strand of the reference reverse-complement everything so the variants are on the right strand
      Trying version of code that relies on haplotype builder - kinda hacky
      Reverted back to using variation bubble builder in graph-diff
      Changed sga walk to remove contained reads and lingering transitive edges when --component-walks is specified.
      Fixed usage message for sga filter
      Skeleton code for new indexer
      In configure, specify -lbamtools in LIBS instead of LDFLAGS. This corrects the library link ordering and fixes the build in the case --as-needed is used in ld, as in newer versions of gcc.
      First semi-functional implementation of BCR for testing
      BCR cleanup
      BCR-constructed bwt is now written to disk.
      Started to integrate BCR into index. Made it more efficient by using 2-bit encoded strings everywhere
      BCR algorithm now writes out the reverse index. Made the algorithm choice a command line argument
      Fully integrated BCR with the BWT disk algorithm
      Removed print from BCR
      Fixed bug where the wrong data types were being written/read in the ssa files.
      Implemented a number of debug/development functions for graph-diff
      Enabled reference based calling, fixed memory errors
      Was using the wrong base BWT in non-reference mode
      Added new debug mode to graph-diff
      Fixed two performance issues in sga assemble.
      Removed much debugging print statements
      Added a new exception to Dindel to handle the case where the variant found lies at the beginning of one of the haplotypes
      Fixed memory stomp in DindelHaplotype constructor
      Merging latest changes to the dindel haplotype model by Kees Albers.
      Temporarily re-enabled some debug prints
      Fixed configure script to properly handle bamtools include/lib paths in the case where it was installed with make install
      Bumped version to v0.9.18
      Added debug code for Kees
      Turned off debug mode
      Merge Kees' branch with bug fixes of variant that are not being called
      Added error message when a sequence with a given ID cannot be found in the input scaffold/contig collection
      Merge Kees' branch, with a fix so that variants are output with respect to the correct reference strand
      Added overdepth filter to avoid running dindel on super deep regions
      Implemented a simple counting-based variant caller for debugging
      Investigating poor variant calling performance on mouse genome data. Added a lightweight profiler.
      Revised profiler
      New heurestics to improve the running time when calling variants versus a reference genome
      Fixed bug in extractHaplotypeReads where it would incorrectly flag some haplotypes as being too deep.
      Merge branch 'graph-diff-v2' of /nfs/users/nfs_c/caa/source/sga_merge into graph-diff-v2
      Merging bug fix from Kees
      More debugging code
      More debugging code
      Modified sga-cluster-extend to warn instead of exit when some seed that is passed in is a substring of a read.
      graph-diff can now output multiple candidate haplotypes. Added a min-depth parameter to avoid traversing low-coverage k-mers.
      Implemented --longest-n parameter for sga-walk --component-walk.
      Integrate code from github.com/jts/misc
      Removed long-used Algorithm/ErrorCorrect code
      First implementation of new correction algorithm, which allows arbitrary overlaps between reads. Not for production use
      Refined the kmer matching portion of the overlap calculation for the new corrector.
      Updated 3rd party code with improvements to the overlapper and multiple alignment
      Changed parameters in consensus algorithm in new overlapper
      Updated third party code
      Integrated new overlap method which extends an existing alignment. Considerably faster than previous method.
      Updated third party code
      New overlap corrector will use the -r/--rounds parameter to iteratively correct reads. This can lead to better correction accuracy but decreases correction throughput
      Keep leading directories when parsing the reference filename
      sga-rmdup now writes out the number of copies of each sequence in the header line of the fasta file.
      Removed unused parameter in sga walk
      Changed the ASQG parser to only warn if the TE tag is not present instead of aborting
      Emit an error and exit when a vertex record is truncated.
      Make sure that edge records have the correct number of fields
      Changed warning message when the operating system is OSX
      Merge branch 'new-indexer'
      Added Illumina's notice regarding the rights to the BCR algorithm
      Fixed assertion tripped by short filenames when checking for gzip extension
      Filter the abyss-generated insert size histogram to avoid very long DistanceEst runtime.
      Added option to sga-filter to remove substring sequences only.
      v0.9.19
      Huge debugging hacks to investigate why we are missing some variants.
      Started to implement coherency-based haplotype builder.
      Merge branch 'master' into graph-diff-v3
      Merge branch 'it-correct' into graph-diff-v3
      First implementation of read-coherent haplotype generation. Way too slow so far.
      New "kmer witness" algorithm
      Removed hacked-in hardcoded path
      Removed debug output
      New method of deriving haplotypes from all the reads sharing a new kmer.
      More conservative generation of haplotypes.
      Rewrote method of inferring haplotypes from read coherent kmers
      Read coherent haplotype builder now extends to new variant kmers
      Parameter tweaks
      Merged Kees' latest code.
      Improved version of the haplotype generator. Lots of testing/debug code still in this version
      Allow singleton kmers to recruit new reads
      Check if quality string exists before adjusting for removed adapters
      Integrated new multiple alignment code
      Aggressively collapse conflicting bases during haplotype construction. This is only temporary.
      First pass at overlap-based haplotype builder. This version is not functional
      Extremely crude version of inexact string graph haplotyping code
      Reverted to old overlap code. Not functional
      New version of the variant algorithm that is based on inexact overlaps
      Fixed bug where empty initial haplotypes caused a crash
      Tweaks to RCHB
      First pass of string-graph based haplotype builder. Slow!
      Refactored the big k-mer based overlapper into a new file. Optimized some functions in OverlapHaplotypeBuilder
      Abort graph construction if no corrected read contains the initial kmer.
      Temporarily disabled constraint requiring complete construction of parallel bubble
      Set up a kmer->vertex map to avoid huge computation inserting reads into the graph.
      Stop the graph extension if there are too many tips in the graph.
      various optimizations to improve the running time of the overlap constructor
      Changed std::map/std::set into hashmap/hashset
      Save kmer indices instead of actual kmers sequences
      Moved overlap parameter into the parameters object
      New k-mer based haplotype to reference alignment, more restrictive haplotype assembly
      Suppress construction of parallel haplotype
      Extension vertices now labelled with the direction of extension
      Separated walk candidates into left/right join positions
      Avoid attempting to build covering paths when there are unambiguous chains of join vertices
      Printing changes only.
      Recursively trim tips from the graph
      Min overlap length is now a command line parameter
      Only extend the graph in one direction - this removes the effect of "back bubbles" making the graph too complex to resolve
      Revised trimming logic so it does not iteratively trim the whole branch. Testing lower correction kmer.
      More restrictive alignment
      Cap the number of differences to the reference genome at 8
      sga-cluster extend mode can now be limited to a maximum number of iterations
      Allow the user to define a set of sequences that are used to stop extension in sga-cluster
      Enabled the de Bruijn graph based QC check of candidate haplotypes
      Debug code, synching with Kees
      Refined the StringThread correction method
      Implemented a less restrictive check for when the graphs converge
      Re-enabled haplotype QC
      Set a positive default value for the minimum contig length in sga-bam2de.pl
      Do not pass -s to DistanceEst twice
      New haplotype QC which counts the number of branches off a haplotype. Not used, just for information
      Fixed infinite loop in haplotype QC for short haplotypes
      Merge branch 'graph-diff-v4' of /nfs/users/nfs_c/caa/source/sga-graph-diff-v4-copy into graph-diff-v4
      Fixed assertion when a haplotype aligned to the end of a chromosome
      Fixed error in the warning when a kmer threshold cannot be found for error correction
      Refactoring the variant calling code into its own directory
      Removed abandoned class
      More refactoring
      Removed dead code
      Renamed VariationBubbleBuilder to VariationBuilderCommon
      Refactored BuilderCommon code into VariationBuilderCommon
      Allow scaffolds to contain full IUPAC ambiguity codes
      Added new flag to SeqReader to avoid changing lower case bases to upper
      Skip all IUPAC codes when finding anchors for gap filling
      Refactored kmer masking code into a separate function
      Refactoring GraphCompare
      Added BWTIndexSet container. Massive refactoring of code to use it.
      Refactored de Bruijn haplotype builder into a new file
      Removed old debug code
      Refactored haplotype QC into its own function
      Moved HapgenUtil
      The algorithm used during haplotype assembly (dbg vs string graph) is a command line option
      Cleaned up parameters
      More option cleanup
      Allow sga-deinterleave.pl to read gzipped files.
      --debruijn should not take a parameter
      If one haplotype fails QC, do not attempt to assemble a variant
      Cleaned up prints in OverlapHaplotypeBuilder
      Started to implement multi-sample calling
      Integrated the read groups into DindelUtil code
      Added assertion warning for mate-pair mode
      Minor formatting change
      Clean up assertions so empty BWTs can be written after filtering
      When resolving scaffold gaps over ambiguity codes, the flanking sequence of the filled gap may not match that of the scaffold anymore.
      Version v0.9.20
      Fix substring assertion when null strings are passed to calculateDustScore
      Removed debug prints
      Use the new version of BEETL. Rewrote convert-beetl to use far less memory.
      sga-beetl-index.pl now converts fastq to fasta
      sga-beetl-index now has a no-convert option
      sga-merge will not merge population indices
      sga-merge uses the full path to the input files so they do not need to be in the working directory
      Merge /nfs/users/nfs_c/caa/source/sga-graph-diff-refactor into graph-diff-refactor
      Use full path to indices
      Use correct file status for popidx
      Attempt to fix assertion when counting homopolymer lengths
      Tweaked read extraction parameters, cleaned up some debug output
      Fix the way that homopolymer runs are counted
      Lowered default storage level for merging BWTs during sga-index
      Move out of bounds check to inside loop
      Early exit from the overlap function when the input read is shorter than min_overlap
      Added subprogram to evaluate how well we can detect mutations with k-mers
      Build parallel haplotypes and reduce the mapping kmer size
      Added homopolymer filter. Dindel code now outputs VCFRecords
      sga-graphdiff now directly outputs the final set of calls.
      Fixed compiler warning
      Made haplotype QC less strict
      Disable homopolymer filter
      Reverted to using a 31-mer for mapping. Lowered MAX_READS.
      Fixed cluster extend --iterations option so it properly extends for N rounds, not to N reads
      added new option to scaffold2fasta --write-names, which outputs the names of the contigs that make up the scaffold
      Reversed order of contigs when building scaffolds from right-to-left
      Fixed missing ID for singleton scaffolds
      Apply a minimum coverage of 2 during haplotype QC in non-ref mode
      Require at least two occurrences in the base sequence when making comparative calls
      temporary hack to fix crash when trying to extract pairs of reads from an unpaired index
      scaffold2fasta: write the orientation of the contigs when --write-names is specified
      No longer extracts read mates. Cleaned up prints
      Merge branch 'graph-diff-refactor'
      v0.9.3
      Fixed how the name of the BWT file is computed when the fastq file is gzipped
      When merging, if input reads are fastq/gzipped, write to the same
      Accept .fq as a fastq file extension when choosing the output name for sga-merge
      Fixed unused variable warning
      issue 21: merge fails when filename contains a '.' that is not part of the file extension. fixed by more careful handling of gzipped suffixes
      Fixed GCC 4.6 warnings
      Started implementation of quality scores for variant calling pipeline
      Integrated quality scores into the variant calling pipeline
      Set GraphCompare verbosity to the user's requested value
      Integrating Heng Li's ropebwt code
      Wrote a new VCFCollections wrapper to pass sample names to Dindel
      BWT writing now function in ropebwt
      Merge branch 'master' into quality-scores
      ropebwt: .sai file is now written, reversed index can be constructed
      Merge branch 'ropebwt'
      Ropebwt algorithm now uses the command line threading parameter
      v0.9.31
      Merge branch 'master' of github.com:jts/sga
      Updated sga-index help text
      Update README to credit Heng's ropebwt implementation
      Updated examples to use ropebwt
      Merge branch 'master' of /nfs/users/nfs_c/caa/source/sga-basequals
      Perform semi-global haplotype realignment within dindel
      Merge branch 'quality-scores'
      Using quality scores is now a command line option
      Merge branch 'master' of /nfs/users/nfs_c/caa/source/sga-basequals
      Removed print
      Reverted to global alignment for haplotype-haplotype alignments
      v0.9.32
      When removing transitive edges from the scaffold graph, check the orientation of contigs in the layout
      More aggressive cycle detection/removal in --strict mode of sga scaffolder
      Cleanup code and comments
      Merge branch 'master' of github.com:jts/sga
      v0.9.33
      Explicitly construct the .sai file when using ropebwt since you cannot get the lexicographic index from ropebwt when read lengths vary
      v0.9.34
      Whitespace changes
      Fixed compilation warnings pointed out by Zhang Feng.
      Add a dependency check to bam2de.pl and set the --mind parameter
      Change default min distance to -99 bases
      Set the --mina option to abyss
      Fix divide-by-zero in sga-astat
      github issue 25: implement writing orphaned pairs to a file during preprocess
      github issue 27: added --no-primer-check option to preprocess. Also, cleaned up help message.
      github issue 26: Removed references to old --quality-scale parameter
      github issue 14: sga index should exit gracefully when the input file is empty.
      When building the FM-index using ropebwt the lexicographic index is built using openmp if the compiler supports it.
      Implemented an upper limit on the number of edges we allow a vertex to have before giving up on using it in the assembly graph.
      v0.9.35
      Added namespace to fix compile error on OSX

Jason Stajich (1):
      Seems like sga-align could be run with threads so that bwa uses multithreaded to be faster.  Is there any reason not to do this?

Nathan S. Watson-Haigh (8):
      Just need to specify base name of the FASTA files.
      Merge branch 'master' of git://github.com/jts/sga
      Merge branch 'master' of git://github.com/jts/sga
      Consistently display help when no command arguments are given.
      Added support for .f and .r read pair suffixes found in reads output by sff_extract version < 0.3.0.
      Send info about which file is being processed to STDOUT.
      Additional info (algorithm used) sent to STDOUT when using SAIS - this is to be consistent with the output when BCR is used.
      Merge remote-tracking branch 'upstream/master'

Shaun Jackman (1):
      ld_set is static inline. Closes #29

jts (8):
      Merge pull request #6 from avilella/master
      Merge pull request #7 from avilella/master
      Merge pull request #11 from hyphaltip/patch-1
      Merge pull request #12 from drio/master
      Merge pull request #19 from mh11/77a4bc7aacd7b838fc3097af515e2f27ffecde3e
      Merge pull request #20 from nathanhaigh/master
      Merge pull request #22 from nathanhaigh/master
      Merge pull request #30 from sjackman/patch-1

mh11 (3):
      Allow to build FWD and REV index separately to improve speed
      Supress merging of indexes / sequence files
      Enable suppression of index creation also for memory only run + code formatting (spaces)

-----------------------------------------------------------------------

No new revisions were added by this update.

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/sga.git



More information about the debian-med-commit mailing list