[med-svn] [SCM] sga annotated tag, v0.9.4, created. v0.9.4

Jared Simpson js18 at sanger.ac.uk
Thu Nov 8 08:10:55 UTC 2012


The annotated tag, v0.9.4 has been created
        at  93a16eba40529d103a35af3c18565a04f35d6a07 (tag)
   tagging  235cf4366ff3b3244a1121b6b5910ca004830f46 (commit)
 tagged by  Jared Simpson
        on  Fri Nov 19 15:12:15 2010 +0000

- Shortlog ------------------------------------------------------------
Tagging v0.9.4

Jared Simpson (603):
      adding new project
      Importing stub files
      Importing configuration files
      Added edge labels to dotty output
      Renamed IVertex to Vertex
      Added vertex merging and removal
      added simplify() function which removes transitive edges from the seqgraph
      Added README
      added function to load edges into the graph
      Initial import of UniEst, Util directories
      First version of UniEst is complete
      In progress check-in of scaffolding code
      Scaffolding in-progress checkin.
      Fixed horrible bug in UniEst and added some better command line parsing
      - Implemented contig uniqueness estimator by overhanging pairs, the performance is similar to depth estimation for long (>= 100bp) contigs but worse for small
      - UniEst: reworked command line arguments. The align file is now required but inference over depth can be disabled with the --no_depth flag. The --no_pair flag is removed, it is automatically set when a paired/hist file are passed in.
      UniEst: Added graph-based uniqueness inference. It does not work particularly well.
      Cleaned up resolve
      Committing test code stub, unit tests will go here
      Added automake files
      Refactored the SeqGraph to be a template.
      More refactoring.
      More refactoring, renamed the SeqGraph module to Bigraph which is more general
      Refactored scaffold code to use new templated Bigraph class.
      In-progress checkin of experimental distance estimation code. Lots of testing/debug hooks in BDE.cpp
      Initial checkin of development suffixtree and bwt code
      Added suffix tree
      Big refactoring of BWT code
      Refactored code out of BWT class into SuffixArray class
      Bug-fixes
      Merge function fixed. Too slow. In-place construction is needed
      Checking prior to refactor
      Added main program, refactored
      Added overlapper
      Added SeqReader
      Added overlap data structure and HitData class
      Added getopt to index
      Added assemble program and laid out skeleton
      Implemented initial string graph construction algorithm
      Massive refactor
      Added proper destructor to Vertex, fixed memleak in Bigraph::merge
      Added StringGraph clasess which implement Myer's formulation of a string graph
      Refactored the vertex merge logic to be more intuitive
      Added validation function to stringvertex and vertex
      Simplified overlap representation
      Added generation of reverse suffix array to index
      Updated assemble parameters
      Fixed bug where the orientation of edges that were being merged were incorrect
      Rewrote overlap processing logic so all hits for a given read are processed at the same time
      Added checks to overlap detection to ensure overlaps are sane
      Moved stringgraph construction functions to SGUtil
      Implemented redundant read detection and removal algorithms
      Renamed SAID to SAElem
      Minor renaming
      Added early exit to SuffixArray::removeReads if the id list to remove is empty
      Implemented exact prefix/suffix matching
      Added matching string to oview output
      Fixed bug in oview which was not displaying reverse complement alignments correctly
      Renamed BWT::getHits to BWT::getPrefixHits to more accurately reflect its purpose
      Fixed bug in SuffixArray::extractPrefixSuffixOverlaps which could output multiple hits per read instead of only the optimal hit
      Added verbose guard around prints
      Switched Vertex implementation to use a list instead of an STL map, for space reasons and to allow easier sorting for the transitive removal algorithm
      Added edge sorting functions to bigraph and vertex classes.
      Implemented myers transitive removal algorithm
      Fixed bug in transred algorithm - twin edges were not being removed
      Fixed bug in TR algorithm. Edges must be marked for removal and removed in a single pass afterwards or else some reducible edges may be missed.
      Implemented bucket sort as an improvement over using std::sort across the whole suffix array. Should modify this to use histogram sort or another variant to drop the memory usage
      Implemented histogram sort
      Refactored SuffixCompare into its own files
      Tweaked parameters
      Removed prints
      Removed some prints
      Swapped order of conditions for terminating loop in histogram sort so that valgrind doesnt complain about out of bounds access
      Removed print of contigs before transitive removal/compaction
      Added DNAString class which is a wrapper for a c-string
      Ported Bentley/Sedgewick/s multikey quicksort
      Implemented Nong-Zhang-Chan induced copying suffix array construction algorithm. In tests it only has to sort 30% of the suffixes using MKQS/histogram sort which is a large improvement. The algorithm can probably be modified further. The code is in need of a cleanup as well.
      Restored writing the SA out after indexing
      Inlined some frequently called functions to avoid function call overhead
      Refactor SuffixCompare class to distinguish comparing by sequence (in a radix sort) and ID
      Added mkqs which was forgotten
      Refactored SA construction code out of index and into class
      Much refactorint
      Moved read/write functions into SuffixArray/BWT classes
      Fixed macro.
      Implemented sampling of the occurance array to lower memory
      macroed out calls in BWT for readability reasons
      Substantial rewrite of the overlap program
      Deleted unnecessary HitData.cpp
      Rewrote oview to draw all the alignments for a particular read at the same time
      Implemented inexact matching to BWT, there is currently no limit on the amount of backtracking so it is significantly slower
      Implemented seeded bwt alignment algorithm for inexact suffix/prefix matching.
      Portability fixes, added includes and fixed printfs so that it would compile on my home machine (32-bit Ubuntu 9.04)
      Slight reworking of the structure of the alignment algorithm in bwt_algorithms.cpp
      Fixed oview bug
      Refactored the Overlap struct to include a sub-struct which holds the matching coordinates
      Major refactoring to StringGraph. Overhangs are now stored as intervals instead of actual strings. The bookkeeping is a bit messy and could probably be cleaned up but the checked in version works. It simplified the merging logic somewhat.
      Inlining
      Added trimming algorithm, sweepVertex, remove duplicate hits
      Added error correction algorithm to oview, added bubble popping algorithm to StringGraph (in progress)
      Fixed bug in duplicate hit removal, hits to BWT and RevBWT must be considered differently to avoid stomping IDs
      Added much better, block-wise bwt alignment algorithm
      Added vertex removal program which eliminates vertices that have a high error rate.
      Factored Match into its own file
      More changes to coordinate system. Now all the changes of frame happen internally to the Match class greatly simplifying client code.
      Progress towards inferring transitive closure edges from consistent overlaps. Edges that reveal containments are causing problems.
      Working implementation of transitive closure algorithm.
      Inlined AlphaCount constructor
      Refactored the interval data out of BWTAlign
      First pass at exact assembly/string graph construction algorithm. slow.
      Refactored BWTAlgorithms module, made it a proper namespace and changed functions to be more generic
      Added AssembleExact functions and implemented initial version of exact string graph algorithm
      Factored out the extension gathering logic for AssembleExact so the same function can find left and right extensions
      Working version of exact extension assembly algorithm. Needs cleaning up.
      Added fast method to get the smallest consistent extension for a given sequence
      Full implementation of irreducible overlap extraction algorithm. It now outputs all irreducible overlaps instead of just the unique one. It will skip short substring that are contained within some other string but in general substrings in the data set are not handled well. This should be improved.
      Refactored irreducible algorithm into the BWTAlgorithms collection.
      Minor formatting changes
      Added code to pair vertices based on read ids
      propedit
      Added algorithm to output all sequences of length k from a BWT
      Added output code and debug printing. The extraction is (understandably) slow for large l.
      Removed unused exact executable
      Removed unused filehandle
      Test commit, no change
      Removed autogenerated files
      Added .gitignore
      Removed *.in files, updated .gitignore
      Updated .gitignore
      Added tools directory with useful scripts
      Added data directory
      Removed "exact" line from sga help text
      Removed comment
      Removed call to basename so code compiles on OSX which does not supply GNU basename function and POSIX version is unsuitable.
      First pass at bwt to string graph algorithm.
      Commiting changes to work from home.
      Changed command line parameter names, rewrote to use strings as buffers instead of arrays. Much less memory required.
      Commiting test code using hash_map for transfer to home.
      Added size tracking.
      Merge branch 'master' of ssh://127.0.0.1:2222/~/work/git_repository/sga
      Removed hash test code as its not smaller or faster than using std::map. The vertex finds are not a bottleneck.
      Reordered members in Edge class and changed GraphColor from an enum to uint8_t. The Edge class (and classes deriving from it) are very heavy and push the memory usage way up for big assemblies. It might be worth removing the start pointer from the Edge class to save 8 bytes.
      Generalized the irreducible overlap algorithm to handle reverse complement alignments simulatenously with regular alignments. The code is in need of a cleanup/simplification, particularly with how contained reads are handled.
      Minor change to assert
      Modified simplify to preferentially merge in the ED_SENSE direction so appending to strings is prefered to prepending
      Minor formatting change in a print
      Changed simplify output frequency
      Removed malloc.h include as OSX does not have this header
      Added ability to output RC reads to SE sampler
      Merge branch 'master' of /nfs/team71/phd/js18/work/git_repository/sga
      added -o and -m flags to specify output file and the minimum overlap size to accept, respectively.
      Enabled -o and -m flags in sga assemble
      Refactored Vertex class to keep edges in a vector instead of a list. Many edges must be removed from the vector but the erase() calls are not a bottleneck.
      Added functions to ensure that all edges for a vertex in a given direction are unique
      Refactored StringEdge/Vertex into its own file.
      Merged StringVertex/StringEdge into Vertex/Edge to simplify code and avoid (unused) inheritence overhead.
      Removed m_pStart member from Edge as it can be found from the m_pTwin member. This saves 1 pointer.
      Added BitChar class containing a simple bitset of 8 bits
      Updated Vertex::getMemSize to include the size of the string
      Added SimplePool and SimpleAllocator, an implementation of a zero-overhead memory pool for objects that do not need to be freed.
      Edge.h: Removed boost memory pool code, passed delete calls to memory pool (which do nothing by design)
      Factored Interval/SeqCoord classes out of the Util files
      Moved EdgeDir/EdgeComp definitions to GraphCommon from Util
      Large in-progress refactor of overlap stage. Now instead of outputting hits to each element of the suffix array the initial
      Changed overlap hit output mode to ascii for consistency with other programs.
      Fixed spelling error in Occurrence class
      Added missing files.
      Cleaned up overlap computation, removed dependency on hits
      Cleanup.
      Adding interval set stub code.
      Implemented the removal of sub-maximal overlap blocks (as produced in rare cases by BWTAlgorithms::findOverlapBlocks). The complete list of overlaps is sorted to find overlapping blocks which are split apart by OverlapBlocks::resolveOverlap. This replaces the IntervalSet idea that was never implemented. The algorithm could be slightly improved but the triggering case is so rare it isn't worth the extra complexity.
      Refactored OverlapBlock to remove the need for a seperate OverlapBlockRecord class.
      Formatting changes.
      Created LockedQueue class
      Added destructor to LockedQueue to destroy the mutex.
      Added stub OverlapThread class. Currently not compiled in.
      Implemented some of the threading code, fixed configure/makefiles
      Added warning as a note to self
      Refactored overlap module in prepartion for adding threads.
      Output formatting changes
      Moved defaults from parseArgs to declarations.
      Modified GPL boilerplate and added COPYING file to source tree.
      First pass at threaded overlapper. Lock contention currently kills performance.
      Implementation of threading that isn't clean but works. Will be cleaned up.
      Better threaded code, still in progress.
      Better but more complicated semaphora usage. Current version uses multiple buffers but it is probably more complex
      Stable version of threading overlap module. Output is not properly processed yet but the thread logic is more or less complete.
      Removed one of the semaphores from OverlapThread as it was redundant
      Completed threading work. Removed LockedQueue class as it is not used in the threading module.
      Minor cleanup.
      Commiting experimental multi-input buffer threading code to transfer to work
      Committing experimental batch-scheduling algorithm for overlap threads.
      Refactored the batch model parallelization algorithm. Currently gives better performance
      Experimental paired end resolution code.
      Very experimental code for paired end resolution. Not at all a stable version - transferring to sanger to run on the farm to generate numbers.
      Enabled pairedoverlap visit
      More experimental PE code, transferring to work, will be reverted later.
      -Manually unrolled very-oftenly used loop in AlphaCount
      -Removed redundant calls to get the occurrence counts from the FM-index in BWTAlgorithm::updateBothR and updateBothL. Big improvement in speed.
      Refactored functions out of SGAlgorithms into SGPairedAlgorithms
      Refactored some visit algorithms into SGDebugAlgorithms
      Removed some debug code.
      Moved functions from BWTAlgorithms to OverlapAlgorithm
      Heavy refactoring. Moved the inexact overlap code to OverlapAlgorithms. Broke the huge, ugly _alignBlock function into more manageable chunks. Still needs some cleanup.
      Refactored the multi-alignment printing code from the oview program into a class (MultiOverlap)
      Added pileup functionality to multi-overlap.
      Added debug functions for detecting when edges are missed due to base calling errors and inexact overlaps.
      Added TransitiveGroup/TransitiveGroupCollection classes and a method to Vertex for constructing these.
      Added code to infer matches between different transitive groups
      Refactored out the Alphabet stuff from the SuffixTools dir into its own file in Alphabet.
      Refactored some logic from MultiOverlap to Pileup
      First go at base probability calculations.
      Added new quality/probability calculations for overlaps
      First go at performing transitive closure. Very heurestic and a bit hacky in places. Notably generated containments aren't handled well.
      More experimental code for detecting missing edges in the graph. Needs cleaning up.
      In-progress checkin. More robust algorithm for computing missing edges but not perfect yet.
      Fixed bug in missing edge inference where duplicate edges would be generated
      Removed some debug prints
      Fixed bug where vertex colors weren't being reset correctly in SGRealignVisitor::getMissingCandidates
      In-progress checkin. Added ability to include containment relationships in the graph.
      Re-worked seqcoord logic for clarity and to handle containment seqcoords.
      Fixed bug in OverlapAlgorithms where containments would be output many times
      Added function to write out the overlaps present in the graph.
      Started refactor to refactor alphabet data structures into their own class
      Re-enabled the realign visitor as the default operation to perform in debug mode.
      Added two experimental likelihood maximization functions to MultiOverlap
      Added error correction code that uses the partitions calculated in MultiOverlap
      Implemented a new partitioning method based on improving a global likelihood
      Added actual read correction and visitor to remove edges that have an error rate above a threshold
      Fixed bug in partitionLI and tweaked params
      Added field to output in debug visitor
      Added more experimental partitioning functions.
      Tweaks to partitioning code
      Better partitioning function based on splitting the overlap set via discrepent bases
      Tweaks to previous, wrapping up coding for the night and shifting working copy to sanger
      Added SeqTrie class
      Removed debug code
      Added functionality to the SeqTrie. Added function to Bigraph::vertex to construct it from overlaps.
      Worked on SeqTrie-based error correction. It performs much better than the previous partitioning based methods but it is unusable because the memory usage explodes because of the insertAtDepth() which cause a combinatorial increase in memory use.
      Tweaks to previous.
      Transfering code to home, slight modifications to previous
      Added samQC.py which parses a SAM/BAM file to output some error rate metrices. Mostly used to learn python
      Adding incomplete and non-functional SeqDAVG class to switch to sanger
      Implemented basic functionality of SeqDAVG
      SeqDAVG insert at depth working.
      More work on SeqDAVG
      Checking in exploratory code to work on it from work tomorrow.
      More experimental conflict resolution code
      saca.h/saca.cpp: Changed bucket data from int to int64_t to prevent wrap around for very large suffix arrays.
      SeqTrie: removed inefficient insertAtDepth function
      Created SGA/preprocess program which processes read files to remove low-quality subsequences and reads with ambiguous bases.
      Removed print statement from preprocess
      More changes to the experimental error correction code. Current method can resolve repeats quite well. Must be refactored
      Refactored error correction code into its own namespace in the new Algorithms directory
      Very good version error correction
      Added ability to sub-sample reads to preprocess
      Added missing includes
      Cleaned up some includes, made a stub class for the graph remodeling visitor
      Started work on graph remodelling code - added functions to discover the complete set of overlaps for a given vertex.
      Added error correction mode to assemble.
      Added output counter to error correct visitor
      Re-wrote Vertex::makeUnique
      Tweaked trimming
      Now checking error codes from pthreads creation routines
      Fixed bug in Match::infer when sequences are not the same length. The coordinates must be translated before setting the .seqlen property of the SeqCoord or else the isValid() assert will blow because the start/end may be out of range
      Added SQG format stub directory/files
      Wrote a file format to hold the assembly graph. It is implemented in the SQG/ subdirectory. Modelled after the SAM format.
      Fixed bug in preprocess where sub-sampling was not working properly.
      Added a pe-aware mode to preprocess. PE reads will now be discarded/kept together.
      Fixed bug in OverlapAlgorithm introduced during previous refactoring. Overlaps were being output for non-terminal right overlapblocks.
      Modified the inexact overlap detection algorithm to remove redundant seeds
      Integrated gzstream wrapper for zlib. Used it in the overlap step for the final ASQG output and the temporary hits files.
      Refactored the way ASQG records are output.
      Removed old unused TagValue code in SQG
      Wrote ASQG parser in SGUtil. Now used to read in the graph.
      Fixed parsing bugs, string graph with substring verts now loads and builds cleanly.
      Fixed oview to use ASQG input. Removed unused functions.
      Created wrapper for opening a gzip or non-gzip file.
      Created createWriter wrapper in Util to open a gzip or plaintext file writer. Used it in SGA/overlap
      Use createWriter/createReader in BWT
      Added new OverlapAlgorithmNew as a re-worked OverlapAlgorithm. This is temporary and will be merged soon
      Replaced OverlapAlgorithm with the new, faster seeded algorithm that was developed in OverlapAlgorithmNew.
      Added --edge-stats command to assemble which outputs the distribution of overlap lengths and number of differences
      Added options to SGA/overlap to explicitly set the seed length and stride. These allow for more aggressive seeding (and lower computational time) but break the guarantee that all overlaps within epsilon are found. They are not used by default and fairly experimental.
      Fixed incorrect timing of collapsing seeds
      Fixed output in transitive reduction to accurately report the number of edges and vertices marked.
      Cleaned up OverlapAlgorithm for irreducible overlaps. Preparing to implement full, inexact irreducible algorithm.
      Fixed missing include.
      Added SearchHistory classes, transfering code to work
      First pass at inexact irreducible algorithm. Some transitive edges in the test set I am using remain but the majority are culled.
      Added function to write out an ASQG from Bigraph
      Fixed fencepost error in SearchHistory compare
      Fixed bug in SearchHistory calculation
      Fixed careless bug in SearchHistory
      Merge branch 'master' of ssh://127.0.0.1:2222/~/work/git_repository/sga
      Fixed bug in inexact irreducible object. If multiple overlap blocks are the same length, some transitive blocks may not get marked.
      Preliminary implementation of contained vertex resolving algorithm. This is a debug version and will be changed in a subsequent commit
      Working version of transitive-aware contain algorithm. This algorithm is much cleaner than the previous version but the implementation must be cleaned up.
      Separated EdgeDesc into its own file.
      Modified EdgeDesc to use a pointer to a vertex instead of a vertex ID
      Rewrote overlap/edge inference algorithms to use EdgeDesc instead of Vertices. It is important to track the directionality of the edges as weird palindromic
      Returned FUZZ parameter in SGTransRedVisitor to default value of 10.
      Cleaned up interface to enqueueEdges
      Added function to Util to make the floating point comparison between two error rates while allowing for a small tolerance.
      Do not allow containment edges in Vertex::getEdges(dir)
      Renamed the SGTransRedVisitor to SGTransitiveReductionVisitor
      Added graph structure validation visitor to find cases where the irreducible edges are missing from a vertex or erroneously found.
      Added oview2fa.pl tool
      Changed the order than vertices are remodelled in the ContainRemove visitor to visit the neighbors in order of length. This
      Removed missed print statement
      Rewrite of the findOverlapBlocksInexact algorithm. This is somewhat cleaner and a bit faster than the previous method. More cleanup/improvement is possible.
      Big improvement to inexact overlap, only branch the search seed after its interval is valid to avoid a big unnecessary copy.
      Added ability to randomly change Ns to bases in preprocess so that discarding reads can be avoiding. It is turned off by default.
      Implemented reference-counted search tree
      Refactored all the search history classes into one file. Added function to get the history from a SearchHistoryLink
      Integrated new SearchHistory tracker into the SearchSeeds.
      Re-enabled the list version of the inexact overlap algorithm instead of the queue version.
      Removed dead code.
      Tweaked preprocess GC filter.
      Fixed output in overlap align loop
      Added BWTDiskConstruction stub and command line arguments to index
      Implemented the control flow for the bwtdisk algorithm
      Refactored BWT class, moved the reader/writer logic into seperate classes to allow them to be used by the BWTDisk construction algorithm.
      Removed some more dead code from BWT
      Implemented merging of a bwt in memory with a bwt on disk.
      BWT merging now working, still need to merge the sai and track the relative ordering of read ids.
      Changed constant in disk algo
      Minor cleanup.
      Changed the ordering of equal strings from ID comparison to index comparison. This makes it far simpler to merge BWTs on disk.
      Cleanup of BWTDiskConstruction code
      More cleanup.
      Added merging of suffix array index to disk construction. Now fully functional.
      Factored the Reader/Writer logic out of the SuffixArray class to use it in the disk construction.
      Changed constant.
      Added flag to disk construction to build the reverse index. This completes the algorithm -
      Re-formatted entire source tree to use spaces instead of tabs.
      Factored the visitor algorithms out of SGAlgorithms into their own file.
      Merged SGAlgorithms::_discoverOverlaps and SGAlgorithms::addOverlapsToSet
      Refactoring.
      Rewrote the remodelAfterExcision function to use the newly developed EdgeDescOvermapMap code. It needs refactoring
      Fixed bug introduced to OverlapAlgorithm a few checkins ago. The seed_length should be clamped at minOverlap.
      Started to refactor the overlap collection logic out of SGAlgorithms into CompleteOverlapSet
      Fixed a bug in CompleteOverlapSet, it now behaves exactly as if all overlaps within the parameters were found using the FM-index (as desired). Changed the remodel visitor to use it.
      More refactoring, all the overlap discovery algorithms have been moved to CompleteOverlapSet.
      Fixed bug where the graph error rate parameter was not being set after remodelling.
      Fixed potential memory leak in irreducible algorithm.
      Re-implemented a cleaner version of the inexact irreducible algorithm in OverlapAlgorithm
      Started work on handling substrings in irreducible algorithm. Unforunately it seems that we will have to load substrings into the graph and then remove them - they can't be determinstically removed at
      Added a default value for the minimum read length
      Wrote core code for resolving the path between the ends of a PE fragment
      Added function to write result of fragment completion algorithm to file
      Added new graph parameters to specify whether the graph has containments and/or transitive edges
      Write out containment/transitive tags in Bigraph::writeASQG
      Progress on handling substring vertices.
      More work on substring containments, closer to giving the same results as exhaustive algorithm but not perfect.
      Added isContainment property vertex to signal that it needs to be removed from the graph instead of setting a color.
      Tweaked setting for sampled reads
      large refactoring, remodelling the graph properly handles generated containment and substring edges.
      Cleaned up some code, added new visitor to (trivially) remove identical reads.
      Resurrected recursive overlap map construction for debugging long running time in yeast case
      Began the implementation of the rmdup subprogram. Refactored OverlapAlgorithm so minOverlap is not a member variable but passed into the relevant algorithm to run.
      More refactoring.
      Refactored the hit computation code into its own file
      More refactoring and the first working version the rmdup
      Fixed bug in Vertex where the containment flag was not being set in the constructor.
      Created files for merge subprogram to merge multiple BWTs.
      Removed test code from read sampler that should not have been checked in.
      Implemented merging of indices from two different read files.
      Added flag to merge reverse indices
      Wrote function to merge two read files together
      Cleaned up outfile naming in SGA/merge, it is now complete in the case of merging two indices
      Modified read sampler to add a prefix to each readname
      Implemented new overlap detection algorithm in CompleteOverlapSet
      Started work on 2 bit per base encoded string class
      Implemented the rest of the EncodedString class
      Implemented append and swap functions in EncodedString. Ported the Vertex class to use this class to store the sequence.
      Added BWTCodec to encode an alphabet of ACGT$
      Changed the BWT class to use the EncodedString representation of the BWT string.
      Added lookup table for shift values and changed value in mask from decimal to hex
      Added NoCodec which can be used by EncodedString to avoid doing any actual encoding. Useful for testing.
      Changed NoCodec to use a similar get/store function as the real codecs.
      Added 4-bit BWT codec. It uses half the memory compared to not encoded the string for roughly the same speed. It is faster than the 3-bit encoder for unknown reasons.
      Added option to merge to clean up original files.
      Don't load the reverse read table in SGA/overlap, only use the forward read table.
      Changed interface to parseHits so that the reverse read table is not used.
      Changed comment
      Added check to SGA/merge and revised mergeDriver tool
      Reduced the memory usage of Vertex by using the SimplePool allocator and removing two data members that are not used currently.
      Major refactoring of how a sequence file is processed in parallel. Wrote the generic SequenceProcessFramework to handle reading the file
      Refactored rmdup to use the new concurrency framework. Removed OverlapThread which is now obsolete
      Moved some print messages to SequenceProcessFramework
      Generalized the SequenceProcessFramework to take in a SeqReader and an optional parameter n which limits the number of
      Removed some prints
      Modified BWTDiskConstruction to use SequenceProcessFramework. This involved refactoring the GapArray into its own file.
      Reverted the number of reads per group back to 2M
      Removed double-construction of overlap block
      Fixed a few incorrect forward declares that Clang picked up.
      Implemented new, much faster remodel algorithm. The implementation is not perfect yet.
      Fixed last issue with the new remodel algorithm, it now gives the same result as the old algorithm but is much much faster.
      Refactored CompleteOverlapSet to use new partitioning code
      Added some void casts so the program compiles without warnings if DNDEBUG is specified
      Added skeleton for error correction subprogram
      Added skeleton of ErrorCorrectProcess, implemented control flow for error correct subprogram
      Merge branch 'master' of /nfs/team71/phd/js18/work/git_repository/sga
      Added methods to SearchHistoryVector and OverlapBlock for extracting the string corresponding to a match
      Implemented the rest of the error correction subprogram. It uses the simple correction algorithm at the moment but gives good results on simulated data.
      Began implementation of run-length encoded BWT class. It reads from a .bwt file and compresses the string into runs as it is read.
      Rewrote RLBWT printInfo
      Started to implement the marker placement code
      Implemented setting the markers in the RLBWT and random accessing of elements.
      Implemented occurrence counting for RLBWT. Some efficiency gains can still be made
      Renamed the old BWT class to SBWT ("simple" BWT). The BWT identifier is now a typedef to switch between using the RLE version and the regular version
      Implemented forward-search of the Marker array
      Implemented forward search in getFullOcc as well. The code could be cleaned up a bit.
      Fixed bug in RLBWT::initializeFMIndex where the last marker would not be placed correctly.
      Added ReadInfoTable to load an index of id,length pairs. This is used to construct overlaps from hits in overlap and rmdup. The benefit here
      Added hidden argument to overlap to use exact mode.
      Fixed but in RemovalAlgorithm where cycles in the graph would cause an infinite loop.
      Re-enabled rmdup by writing the id and sequence out to the hits file.
      Force the suffix array to BWT conversion methods to use SBWT for now. This should eventually change to writing the RLBWT
      Added hacky BubbleEdge removal visitor. Currently not in use.
      Made the number of reads to process in a batch a parameter to the BWT disk construction algorithm
      Added subgraph subprogram, to extract a specified portion of the graph
      Cleaned up subgraph, it now removes containments and properly handles the vertex visit logic
      Changed tabs to spaces in samQC, modified so the summary stats can be printed in every mode.
      Implemented -o, --outfile option to SGA/correct
      -Added option to perform multiple rounds of error correction
      Added quality filtering option to remove reads with a substantial amount of low-quality bases.
      Fixed semantics of quality filter
      Fix: the number of times the trim/bubble popping is performed did not match the command line parameter
      Removed print that was checked in by error
      Added some extra information to the break writer
      Added small-repeat resolution code. Remove edges that join together two sequences with a sub-read length repeat unit if there are
      Added method to MultiOverlap to generate SeqTries.
      Re-enabled seqtrie correction.
      Added quick and dirty PrimerScreen class and enabled the screen in the preprocess. This just checks for
      Tweaked PrimerScreen settings. Now matches over the first 14 bases of the sequence.
      Added tool scripts to revision control
      Added command line arguments to correct to take the in the algorithm to use and the conflictCutoff
      Merge branch 'master' of ssh://127.0.0.1:2222/~/work/git_repository/sga
      Removed print statements from SGRepeatResolveVisitor
      First pass at the scaffold driver. This version is based on bwa but this will be replaced by bowtie
      BWA-based distance estimation calculation is complete.
      Added new conversion scripts:
      Added some metrics to the error correction.
      Adding additional development/analysis scripts to revision control
      Removed hardcoded paths from analyzeCorrect, run_bwa.sh and samQC
      Added function to calculate the amount of a read that is covered
      Changed OverlapAlgorithm to remove submaximal overlap blocks for containments and proper overlaps at the same time
      Refactored BWTAlgorithms::updateBothL/R to take in the AlphaCount for the lower and upper interval.
      Working implementation of binary .bwt file. Uses run-length encoding.
      Forgot to add RLBWT* files
      heavy refactoring of BWT I/O. Now the binary and ascii output files are subclasses of IBWTReader/Writer. The rest of
      Fixed a crash in the contain removal algorithm found in the yeast data. An assertion would blow if multiple valid overlaps
      Added sparse hash check to configure
      Bumped version number, removed define from SGA.cpp
      Fixed Makefiles/includes so that make dist works
      Removed Tests directory from standard build.
      Initial implementation of in-place removal of strings from the FM-index. Not working in the version.
      Working version of the removal of reads from the FM-index in the rmdup program.
      Fixed parallel mode for rmdup. Now working as designed.
      Changed output BWTs back to binary
      Fixed rmdup index rebuild. Substring reads were not being written to the dup file.
      Added checks to the StringGraph construction and oview functions to ensure that each
      Refactored HashMap includes to check for the precense of tr1, ext/hash_map, etc.
      Merge branch 'master' of /nfs/team71/phd/js18/work/git_repository/sga
      Added ability to load distance estimate edges to the ScaffoldGraph.
      Improved dot output
      Removed unused line of code
      Abstracted out the GapArray functions.
      Implemented 4-bit storage SparseGapArray
      Added arguments to index and merge to control the size of the gap array.
      Fixed order of arguments when creating a ScaffoldEdge.
      Cleaned up MultiOverlap code and removed dead code from other classes
      Added Metrics classes and ability to track statistics about what positions in
      Removing Scaffold/Makefile.in which shouldnt have been added to the tracking
      Updated the output after error correction
      Merge branch 'master' of ssh://127.0.0.1:2222/~/work/git_repository/sga
      Fixed help message for merge to indicate it can only take 2 files.
      Added script to compute a-statistic for contigs from a bam file
      Better estimate of expected number of reads per contig by using the number of positions in the read that
      Minor update to calculation for expected arrival rate
      Added abilitiy to load a-statistic data from a file to sga-scaffold.
      In-progress checkin of scaffolding code. It compiles but should not be used.
      Merge branch 'master' of /nfs/team71/phd/js18/work/git_repository/sga
      Fixed terrible bug in scaffolder
      Added ability to write out scaffolds to a file after processing. Still in development.
      Fixed bug where an istream* reader was not cleaned up.
      Added script to evaluate the scaffold output
      Added ability to output singletons from the scaffolder
      Added -o,--outfile option to sga-scaffold to specify output file.
      Ported sga-scaffold into the main SGA program as a subcommand
      Added -a,--asqg-outfile option to assemble to write out the final graph.
      evalScaffolds: output number of gaps and mean gap size
      Factored the link data out of the ScaffoldEdge class
      More refactoring. Created the ScaffoldRecord class to hold the output of the scaffolding process. It can be read from/written to a file.
      Implemented simple scaffolding output where sequences are truncated if an overlap is predicted.
      Modified scaffold to perform reductions until no more reductions can be made
      Implemented edit distance calculation for two strings using dynamic programming in OverlapTools.
      Refactored dynamic programming algorithm into its own class.
      Finished code to join contigs that are predicted to overlap
      Added perl script to break up a set of scaffolds into contigs
      Added a driver script for the scaffold evaluation.
      Turned off prints in Overlapper
      Fixed bug in scaffold evaluation
      Refactored vertex to vertex search algorithms
      Fixed major performance bug in the irreducible extension algorithm. Every right extension was performing 4 branches,
      Implemented scaffold resolution using the string graph.
      Finished graph resolving work. Added command line parameter to choose the stringency of the resolution step.
      Made the sequence process framework more generic by using an input generator
      Added --no-discard flag to sga correct to suppress discarding reads.
      Fixed int to double conversion warnings.
      Started implementation of local string graph construction. It currently generates duplicate edges
      Threaded the mkqs portion of the indexing step. Not a huge decrease in running time, around 20%.
      Removed unnecessary print.
      Changed default sample rate for merging to 1024
      Completed connect subprogram.
      Fixed bug in the connect subprogram where the program would abort if the first and second reads had identical sequences
      Added experimental bubble-popping "smoothing" algorithm. Not in a state that it is usable for production work.
      Added some parameters to merge and correct. Moved the smoothing task in assemble to occur before simplification. Smoothing is still experimental.
      Fixed seg fault in search where the m_pWalkIndex member of an SGWalk was not initialized in the copy constructor
      Added depth filter to the error correct process to avoid correcting very deep sequences, which takes a lot of time.
      Added function to SGSearch to calculate the coverage spanning a given edge.
      Made command line parameters for the coverage removal algorithm
      Added sampleRate parameter to rmdup
      Added parameter to the correct subprogram to limit the amount of branching for complex reads.
      Fixed bad memory leak in the branch cutoff code for the overlap algorithm
      Started work on kmer-based error correction
      Initial checkin of sga2afg script
      Fixed bugs in the sga2afg and sga2contig scripts
      Added development script for computing an FM-index from a polymorphic genome
      Updates on the sga2afg convertor script and the testing graphical fm-index python script
      First pass at k-mer error corrector
      Improved kmer-correction. Gives very close results to the overlap correction.
      Rewrote vertex/edge allocation logic so that a global memory pool is not used. The pool now belongs to the graph that creates the vertex/edge. This allows multiple graphs to be created in different threads without stomping over each other's memory. The global new for Vertex/Edges is disabled, the allocations must go through a pool.
      Re-enabled threading in the connect process since the memory pool issues are fixed.
      Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
      Changed the read discarding logic for the kmer error corrector
      Fixed bug in error corrector where no sequence would be output for uncorrected reads in the kmer algorithm.
      Refactored repository to not contain data files and tools/analysis scripts. These are moved to the sgatools repo
      Updated the README
      Removed dead code from repository
      More dead code removal
      More README updates
      Implemented hybrid mode error correction which first performs a kmer correction pass, then overlap correction.
      Made the irreducible-edge only algorithm the default for sga overlap. All overlaps can be generated using the -x/--exhaustive option.
      Bumped version to 0.92
      Added assert ScaffoldRecord::introduceGap to catch case where the expected overlap between scaffold components is not sane.
      Minor changes to the README
      Merge branch 'gh-pages' of github.com:jts/sga
      Rewrote sga main webpage.
      Obscured email address
      Fixed formatting
      Added bin directory with first version of sga-pipeline script
      Added pipeline script information to the README
      Corrected file extension handling in sga-pipeline
      Rewrote sga-pipeline to be more modular and flexible
      add rmdup-pe workflow to sga-pipeline to remove duplicated paired-end reads
      Added logging to the sga-pipeline script
      sga-pipeline: fixed formatting issue for rmdup and correct wrappers
      Modified SeqReader to read compressed fasta/fastq files
      Modified SeqReader to automatically uppercase all input sequences.
      Extended --permuteN option in preprocess to handle the full IUPAC ambiguity code set as suggest by Shaun Jackman.
      Cleaned up help message for many subprograms, mostly by adding default parameters.
      Changed version numbering to a conventional x.x.x scheme and bumped version to v0.9.3
      Updated README with new name of the --trim option
      Added sga connect workflow to sga-pipeline
      Added --skip-preprocess option to sga-pipeline
      Added --version option to sga main program
      Implemented sga qc subprogram. This program looks for, and discards, problematic reads. Right now, the qc check requires each read to have a tiling of high confidence k-mers (with a short kmer length).
      Added new output file to sga-connect to record the pe reads that could not be connected
      Added new subprogram sga-stats which prints out a histogram of the kmer counts for a read set.
      Implemented gmap subprogram which is a very basic read-read mapper.
      Added flag to gmap output to indicate reverse complement alignments
      Rewrote sga-connect to work from the graph instead of the FM-index.
      Update the new sga-connect program to mark vertices in the graph that are covered by a pe-walk
      Rewrote Util/HashMap.h logic to explicitly define the StringHasher function. This is to fix a problem where tr1::unordered_map was available but the sparsehash was still trying to use __gnu_cxx::hash<std::string> which does not exist.
      Implemented edge link update function in scaffold module
      Cleaned up output in bigraph and assemble.
      Added sga-align and sga-deinterleave helper scripts
      Added new statistics to sga-stats. Now outputs the estimated error rate in the reads and the mean overlap depth.
      Rewrote portions of the MultiOverlap correction code for efficiency
      Added structural variation detection options to sga-connect
      Fixed bug in the bubble popper. The counter would never be incremented so it would always be reported that no bubbles were popped.
      Fixed string initialization error spotted by valgrind
      Added --with-hoard=PATH option to configure to allow the use of the Hoard memory allocator.
      Minor formatting change in configure
      Added --run-lengths parameter to sga-stats to print the run length distribution of the BWT
      Fixed typo in README spotted by Matthias Haimel. Added instruction for running autogen.sh
      Added a numReads field to the header of the sga-connect output
      Rewrote AlphaCount class to take in a template parameter indicating the storage size. Replaced all existing uses of AlphaCount in the code with AlphaCount64, the 64-bit storage version.
      Complete re-write of how the BWT occurrence array markers are represented.
      Removed old marker code and cleaned up.
      Fixed error in SmallMarker - was using size_t to hold the unitCount when it will be at most 128. Changed to uint8_t which for a huge memory saving.
      Cleaned up two-tier code.
      More clean up of two-tier code.
      Removed unused print statements in getInterpolatedMarkers
      Removed gcc force-inline attributes
      Implemented second version of two-tier occurrence array markers.
      Fixed bug in two-tier implementation where the count for the last SmallBlock placed was incorrect.
      Changed default sample rate for merging bwts
      Added a method to read in the non-RLE BWT from a binary bwt file.
      Updated version to 0.9.4. The main difference in this version is an improved strategy for managing the Occurrence array in the BWT, which requires substantially less memory.

jts (1):
      github generated gh-pages branch

-----------------------------------------------------------------------

-- 
Debian packaging for sga



More information about the debian-med-commit mailing list