[med-svn] [SCM] sga annotated tag, v0.9.4, created. v0.9.4
Jared Simpson
js18 at sanger.ac.uk
Thu Nov 8 08:10:55 UTC 2012
The annotated tag, v0.9.4 has been created
at 93a16eba40529d103a35af3c18565a04f35d6a07 (tag)
tagging 235cf4366ff3b3244a1121b6b5910ca004830f46 (commit)
tagged by Jared Simpson
on Fri Nov 19 15:12:15 2010 +0000
- Shortlog ------------------------------------------------------------
Tagging v0.9.4
Jared Simpson (603):
adding new project
Importing stub files
Importing configuration files
Added edge labels to dotty output
Renamed IVertex to Vertex
Added vertex merging and removal
added simplify() function which removes transitive edges from the seqgraph
Added README
added function to load edges into the graph
Initial import of UniEst, Util directories
First version of UniEst is complete
In progress check-in of scaffolding code
Scaffolding in-progress checkin.
Fixed horrible bug in UniEst and added some better command line parsing
- Implemented contig uniqueness estimator by overhanging pairs, the performance is similar to depth estimation for long (>= 100bp) contigs but worse for small
- UniEst: reworked command line arguments. The align file is now required but inference over depth can be disabled with the --no_depth flag. The --no_pair flag is removed, it is automatically set when a paired/hist file are passed in.
UniEst: Added graph-based uniqueness inference. It does not work particularly well.
Cleaned up resolve
Committing test code stub, unit tests will go here
Added automake files
Refactored the SeqGraph to be a template.
More refactoring.
More refactoring, renamed the SeqGraph module to Bigraph which is more general
Refactored scaffold code to use new templated Bigraph class.
In-progress checkin of experimental distance estimation code. Lots of testing/debug hooks in BDE.cpp
Initial checkin of development suffixtree and bwt code
Added suffix tree
Big refactoring of BWT code
Refactored code out of BWT class into SuffixArray class
Bug-fixes
Merge function fixed. Too slow. In-place construction is needed
Checking prior to refactor
Added main program, refactored
Added overlapper
Added SeqReader
Added overlap data structure and HitData class
Added getopt to index
Added assemble program and laid out skeleton
Implemented initial string graph construction algorithm
Massive refactor
Added proper destructor to Vertex, fixed memleak in Bigraph::merge
Added StringGraph clasess which implement Myer's formulation of a string graph
Refactored the vertex merge logic to be more intuitive
Added validation function to stringvertex and vertex
Simplified overlap representation
Added generation of reverse suffix array to index
Updated assemble parameters
Fixed bug where the orientation of edges that were being merged were incorrect
Rewrote overlap processing logic so all hits for a given read are processed at the same time
Added checks to overlap detection to ensure overlaps are sane
Moved stringgraph construction functions to SGUtil
Implemented redundant read detection and removal algorithms
Renamed SAID to SAElem
Minor renaming
Added early exit to SuffixArray::removeReads if the id list to remove is empty
Implemented exact prefix/suffix matching
Added matching string to oview output
Fixed bug in oview which was not displaying reverse complement alignments correctly
Renamed BWT::getHits to BWT::getPrefixHits to more accurately reflect its purpose
Fixed bug in SuffixArray::extractPrefixSuffixOverlaps which could output multiple hits per read instead of only the optimal hit
Added verbose guard around prints
Switched Vertex implementation to use a list instead of an STL map, for space reasons and to allow easier sorting for the transitive removal algorithm
Added edge sorting functions to bigraph and vertex classes.
Implemented myers transitive removal algorithm
Fixed bug in transred algorithm - twin edges were not being removed
Fixed bug in TR algorithm. Edges must be marked for removal and removed in a single pass afterwards or else some reducible edges may be missed.
Implemented bucket sort as an improvement over using std::sort across the whole suffix array. Should modify this to use histogram sort or another variant to drop the memory usage
Implemented histogram sort
Refactored SuffixCompare into its own files
Tweaked parameters
Removed prints
Removed some prints
Swapped order of conditions for terminating loop in histogram sort so that valgrind doesnt complain about out of bounds access
Removed print of contigs before transitive removal/compaction
Added DNAString class which is a wrapper for a c-string
Ported Bentley/Sedgewick/s multikey quicksort
Implemented Nong-Zhang-Chan induced copying suffix array construction algorithm. In tests it only has to sort 30% of the suffixes using MKQS/histogram sort which is a large improvement. The algorithm can probably be modified further. The code is in need of a cleanup as well.
Restored writing the SA out after indexing
Inlined some frequently called functions to avoid function call overhead
Refactor SuffixCompare class to distinguish comparing by sequence (in a radix sort) and ID
Added mkqs which was forgotten
Refactored SA construction code out of index and into class
Much refactorint
Moved read/write functions into SuffixArray/BWT classes
Fixed macro.
Implemented sampling of the occurance array to lower memory
macroed out calls in BWT for readability reasons
Substantial rewrite of the overlap program
Deleted unnecessary HitData.cpp
Rewrote oview to draw all the alignments for a particular read at the same time
Implemented inexact matching to BWT, there is currently no limit on the amount of backtracking so it is significantly slower
Implemented seeded bwt alignment algorithm for inexact suffix/prefix matching.
Portability fixes, added includes and fixed printfs so that it would compile on my home machine (32-bit Ubuntu 9.04)
Slight reworking of the structure of the alignment algorithm in bwt_algorithms.cpp
Fixed oview bug
Refactored the Overlap struct to include a sub-struct which holds the matching coordinates
Major refactoring to StringGraph. Overhangs are now stored as intervals instead of actual strings. The bookkeeping is a bit messy and could probably be cleaned up but the checked in version works. It simplified the merging logic somewhat.
Inlining
Added trimming algorithm, sweepVertex, remove duplicate hits
Added error correction algorithm to oview, added bubble popping algorithm to StringGraph (in progress)
Fixed bug in duplicate hit removal, hits to BWT and RevBWT must be considered differently to avoid stomping IDs
Added much better, block-wise bwt alignment algorithm
Added vertex removal program which eliminates vertices that have a high error rate.
Factored Match into its own file
More changes to coordinate system. Now all the changes of frame happen internally to the Match class greatly simplifying client code.
Progress towards inferring transitive closure edges from consistent overlaps. Edges that reveal containments are causing problems.
Working implementation of transitive closure algorithm.
Inlined AlphaCount constructor
Refactored the interval data out of BWTAlign
First pass at exact assembly/string graph construction algorithm. slow.
Refactored BWTAlgorithms module, made it a proper namespace and changed functions to be more generic
Added AssembleExact functions and implemented initial version of exact string graph algorithm
Factored out the extension gathering logic for AssembleExact so the same function can find left and right extensions
Working version of exact extension assembly algorithm. Needs cleaning up.
Added fast method to get the smallest consistent extension for a given sequence
Full implementation of irreducible overlap extraction algorithm. It now outputs all irreducible overlaps instead of just the unique one. It will skip short substring that are contained within some other string but in general substrings in the data set are not handled well. This should be improved.
Refactored irreducible algorithm into the BWTAlgorithms collection.
Minor formatting changes
Added code to pair vertices based on read ids
propedit
Added algorithm to output all sequences of length k from a BWT
Added output code and debug printing. The extraction is (understandably) slow for large l.
Removed unused exact executable
Removed unused filehandle
Test commit, no change
Removed autogenerated files
Added .gitignore
Removed *.in files, updated .gitignore
Updated .gitignore
Added tools directory with useful scripts
Added data directory
Removed "exact" line from sga help text
Removed comment
Removed call to basename so code compiles on OSX which does not supply GNU basename function and POSIX version is unsuitable.
First pass at bwt to string graph algorithm.
Commiting changes to work from home.
Changed command line parameter names, rewrote to use strings as buffers instead of arrays. Much less memory required.
Commiting test code using hash_map for transfer to home.
Added size tracking.
Merge branch 'master' of ssh://127.0.0.1:2222/~/work/git_repository/sga
Removed hash test code as its not smaller or faster than using std::map. The vertex finds are not a bottleneck.
Reordered members in Edge class and changed GraphColor from an enum to uint8_t. The Edge class (and classes deriving from it) are very heavy and push the memory usage way up for big assemblies. It might be worth removing the start pointer from the Edge class to save 8 bytes.
Generalized the irreducible overlap algorithm to handle reverse complement alignments simulatenously with regular alignments. The code is in need of a cleanup/simplification, particularly with how contained reads are handled.
Minor change to assert
Modified simplify to preferentially merge in the ED_SENSE direction so appending to strings is prefered to prepending
Minor formatting change in a print
Changed simplify output frequency
Removed malloc.h include as OSX does not have this header
Added ability to output RC reads to SE sampler
Merge branch 'master' of /nfs/team71/phd/js18/work/git_repository/sga
added -o and -m flags to specify output file and the minimum overlap size to accept, respectively.
Enabled -o and -m flags in sga assemble
Refactored Vertex class to keep edges in a vector instead of a list. Many edges must be removed from the vector but the erase() calls are not a bottleneck.
Added functions to ensure that all edges for a vertex in a given direction are unique
Refactored StringEdge/Vertex into its own file.
Merged StringVertex/StringEdge into Vertex/Edge to simplify code and avoid (unused) inheritence overhead.
Removed m_pStart member from Edge as it can be found from the m_pTwin member. This saves 1 pointer.
Added BitChar class containing a simple bitset of 8 bits
Updated Vertex::getMemSize to include the size of the string
Added SimplePool and SimpleAllocator, an implementation of a zero-overhead memory pool for objects that do not need to be freed.
Edge.h: Removed boost memory pool code, passed delete calls to memory pool (which do nothing by design)
Factored Interval/SeqCoord classes out of the Util files
Moved EdgeDir/EdgeComp definitions to GraphCommon from Util
Large in-progress refactor of overlap stage. Now instead of outputting hits to each element of the suffix array the initial
Changed overlap hit output mode to ascii for consistency with other programs.
Fixed spelling error in Occurrence class
Added missing files.
Cleaned up overlap computation, removed dependency on hits
Cleanup.
Adding interval set stub code.
Implemented the removal of sub-maximal overlap blocks (as produced in rare cases by BWTAlgorithms::findOverlapBlocks). The complete list of overlaps is sorted to find overlapping blocks which are split apart by OverlapBlocks::resolveOverlap. This replaces the IntervalSet idea that was never implemented. The algorithm could be slightly improved but the triggering case is so rare it isn't worth the extra complexity.
Refactored OverlapBlock to remove the need for a seperate OverlapBlockRecord class.
Formatting changes.
Created LockedQueue class
Added destructor to LockedQueue to destroy the mutex.
Added stub OverlapThread class. Currently not compiled in.
Implemented some of the threading code, fixed configure/makefiles
Added warning as a note to self
Refactored overlap module in prepartion for adding threads.
Output formatting changes
Moved defaults from parseArgs to declarations.
Modified GPL boilerplate and added COPYING file to source tree.
First pass at threaded overlapper. Lock contention currently kills performance.
Implementation of threading that isn't clean but works. Will be cleaned up.
Better threaded code, still in progress.
Better but more complicated semaphora usage. Current version uses multiple buffers but it is probably more complex
Stable version of threading overlap module. Output is not properly processed yet but the thread logic is more or less complete.
Removed one of the semaphores from OverlapThread as it was redundant
Completed threading work. Removed LockedQueue class as it is not used in the threading module.
Minor cleanup.
Commiting experimental multi-input buffer threading code to transfer to work
Committing experimental batch-scheduling algorithm for overlap threads.
Refactored the batch model parallelization algorithm. Currently gives better performance
Experimental paired end resolution code.
Very experimental code for paired end resolution. Not at all a stable version - transferring to sanger to run on the farm to generate numbers.
Enabled pairedoverlap visit
More experimental PE code, transferring to work, will be reverted later.
-Manually unrolled very-oftenly used loop in AlphaCount
-Removed redundant calls to get the occurrence counts from the FM-index in BWTAlgorithm::updateBothR and updateBothL. Big improvement in speed.
Refactored functions out of SGAlgorithms into SGPairedAlgorithms
Refactored some visit algorithms into SGDebugAlgorithms
Removed some debug code.
Moved functions from BWTAlgorithms to OverlapAlgorithm
Heavy refactoring. Moved the inexact overlap code to OverlapAlgorithms. Broke the huge, ugly _alignBlock function into more manageable chunks. Still needs some cleanup.
Refactored the multi-alignment printing code from the oview program into a class (MultiOverlap)
Added pileup functionality to multi-overlap.
Added debug functions for detecting when edges are missed due to base calling errors and inexact overlaps.
Added TransitiveGroup/TransitiveGroupCollection classes and a method to Vertex for constructing these.
Added code to infer matches between different transitive groups
Refactored out the Alphabet stuff from the SuffixTools dir into its own file in Alphabet.
Refactored some logic from MultiOverlap to Pileup
First go at base probability calculations.
Added new quality/probability calculations for overlaps
First go at performing transitive closure. Very heurestic and a bit hacky in places. Notably generated containments aren't handled well.
More experimental code for detecting missing edges in the graph. Needs cleaning up.
In-progress checkin. More robust algorithm for computing missing edges but not perfect yet.
Fixed bug in missing edge inference where duplicate edges would be generated
Removed some debug prints
Fixed bug where vertex colors weren't being reset correctly in SGRealignVisitor::getMissingCandidates
In-progress checkin. Added ability to include containment relationships in the graph.
Re-worked seqcoord logic for clarity and to handle containment seqcoords.
Fixed bug in OverlapAlgorithms where containments would be output many times
Added function to write out the overlaps present in the graph.
Started refactor to refactor alphabet data structures into their own class
Re-enabled the realign visitor as the default operation to perform in debug mode.
Added two experimental likelihood maximization functions to MultiOverlap
Added error correction code that uses the partitions calculated in MultiOverlap
Implemented a new partitioning method based on improving a global likelihood
Added actual read correction and visitor to remove edges that have an error rate above a threshold
Fixed bug in partitionLI and tweaked params
Added field to output in debug visitor
Added more experimental partitioning functions.
Tweaks to partitioning code
Better partitioning function based on splitting the overlap set via discrepent bases
Tweaks to previous, wrapping up coding for the night and shifting working copy to sanger
Added SeqTrie class
Removed debug code
Added functionality to the SeqTrie. Added function to Bigraph::vertex to construct it from overlaps.
Worked on SeqTrie-based error correction. It performs much better than the previous partitioning based methods but it is unusable because the memory usage explodes because of the insertAtDepth() which cause a combinatorial increase in memory use.
Tweaks to previous.
Transfering code to home, slight modifications to previous
Added samQC.py which parses a SAM/BAM file to output some error rate metrices. Mostly used to learn python
Adding incomplete and non-functional SeqDAVG class to switch to sanger
Implemented basic functionality of SeqDAVG
SeqDAVG insert at depth working.
More work on SeqDAVG
Checking in exploratory code to work on it from work tomorrow.
More experimental conflict resolution code
saca.h/saca.cpp: Changed bucket data from int to int64_t to prevent wrap around for very large suffix arrays.
SeqTrie: removed inefficient insertAtDepth function
Created SGA/preprocess program which processes read files to remove low-quality subsequences and reads with ambiguous bases.
Removed print statement from preprocess
More changes to the experimental error correction code. Current method can resolve repeats quite well. Must be refactored
Refactored error correction code into its own namespace in the new Algorithms directory
Very good version error correction
Added ability to sub-sample reads to preprocess
Added missing includes
Cleaned up some includes, made a stub class for the graph remodeling visitor
Started work on graph remodelling code - added functions to discover the complete set of overlaps for a given vertex.
Added error correction mode to assemble.
Added output counter to error correct visitor
Re-wrote Vertex::makeUnique
Tweaked trimming
Now checking error codes from pthreads creation routines
Fixed bug in Match::infer when sequences are not the same length. The coordinates must be translated before setting the .seqlen property of the SeqCoord or else the isValid() assert will blow because the start/end may be out of range
Added SQG format stub directory/files
Wrote a file format to hold the assembly graph. It is implemented in the SQG/ subdirectory. Modelled after the SAM format.
Fixed bug in preprocess where sub-sampling was not working properly.
Added a pe-aware mode to preprocess. PE reads will now be discarded/kept together.
Fixed bug in OverlapAlgorithm introduced during previous refactoring. Overlaps were being output for non-terminal right overlapblocks.
Modified the inexact overlap detection algorithm to remove redundant seeds
Integrated gzstream wrapper for zlib. Used it in the overlap step for the final ASQG output and the temporary hits files.
Refactored the way ASQG records are output.
Removed old unused TagValue code in SQG
Wrote ASQG parser in SGUtil. Now used to read in the graph.
Fixed parsing bugs, string graph with substring verts now loads and builds cleanly.
Fixed oview to use ASQG input. Removed unused functions.
Created wrapper for opening a gzip or non-gzip file.
Created createWriter wrapper in Util to open a gzip or plaintext file writer. Used it in SGA/overlap
Use createWriter/createReader in BWT
Added new OverlapAlgorithmNew as a re-worked OverlapAlgorithm. This is temporary and will be merged soon
Replaced OverlapAlgorithm with the new, faster seeded algorithm that was developed in OverlapAlgorithmNew.
Added --edge-stats command to assemble which outputs the distribution of overlap lengths and number of differences
Added options to SGA/overlap to explicitly set the seed length and stride. These allow for more aggressive seeding (and lower computational time) but break the guarantee that all overlaps within epsilon are found. They are not used by default and fairly experimental.
Fixed incorrect timing of collapsing seeds
Fixed output in transitive reduction to accurately report the number of edges and vertices marked.
Cleaned up OverlapAlgorithm for irreducible overlaps. Preparing to implement full, inexact irreducible algorithm.
Fixed missing include.
Added SearchHistory classes, transfering code to work
First pass at inexact irreducible algorithm. Some transitive edges in the test set I am using remain but the majority are culled.
Added function to write out an ASQG from Bigraph
Fixed fencepost error in SearchHistory compare
Fixed bug in SearchHistory calculation
Fixed careless bug in SearchHistory
Merge branch 'master' of ssh://127.0.0.1:2222/~/work/git_repository/sga
Fixed bug in inexact irreducible object. If multiple overlap blocks are the same length, some transitive blocks may not get marked.
Preliminary implementation of contained vertex resolving algorithm. This is a debug version and will be changed in a subsequent commit
Working version of transitive-aware contain algorithm. This algorithm is much cleaner than the previous version but the implementation must be cleaned up.
Separated EdgeDesc into its own file.
Modified EdgeDesc to use a pointer to a vertex instead of a vertex ID
Rewrote overlap/edge inference algorithms to use EdgeDesc instead of Vertices. It is important to track the directionality of the edges as weird palindromic
Returned FUZZ parameter in SGTransRedVisitor to default value of 10.
Cleaned up interface to enqueueEdges
Added function to Util to make the floating point comparison between two error rates while allowing for a small tolerance.
Do not allow containment edges in Vertex::getEdges(dir)
Renamed the SGTransRedVisitor to SGTransitiveReductionVisitor
Added graph structure validation visitor to find cases where the irreducible edges are missing from a vertex or erroneously found.
Added oview2fa.pl tool
Changed the order than vertices are remodelled in the ContainRemove visitor to visit the neighbors in order of length. This
Removed missed print statement
Rewrite of the findOverlapBlocksInexact algorithm. This is somewhat cleaner and a bit faster than the previous method. More cleanup/improvement is possible.
Big improvement to inexact overlap, only branch the search seed after its interval is valid to avoid a big unnecessary copy.
Added ability to randomly change Ns to bases in preprocess so that discarding reads can be avoiding. It is turned off by default.
Implemented reference-counted search tree
Refactored all the search history classes into one file. Added function to get the history from a SearchHistoryLink
Integrated new SearchHistory tracker into the SearchSeeds.
Re-enabled the list version of the inexact overlap algorithm instead of the queue version.
Removed dead code.
Tweaked preprocess GC filter.
Fixed output in overlap align loop
Added BWTDiskConstruction stub and command line arguments to index
Implemented the control flow for the bwtdisk algorithm
Refactored BWT class, moved the reader/writer logic into seperate classes to allow them to be used by the BWTDisk construction algorithm.
Removed some more dead code from BWT
Implemented merging of a bwt in memory with a bwt on disk.
BWT merging now working, still need to merge the sai and track the relative ordering of read ids.
Changed constant in disk algo
Minor cleanup.
Changed the ordering of equal strings from ID comparison to index comparison. This makes it far simpler to merge BWTs on disk.
Cleanup of BWTDiskConstruction code
More cleanup.
Added merging of suffix array index to disk construction. Now fully functional.
Factored the Reader/Writer logic out of the SuffixArray class to use it in the disk construction.
Changed constant.
Added flag to disk construction to build the reverse index. This completes the algorithm -
Re-formatted entire source tree to use spaces instead of tabs.
Factored the visitor algorithms out of SGAlgorithms into their own file.
Merged SGAlgorithms::_discoverOverlaps and SGAlgorithms::addOverlapsToSet
Refactoring.
Rewrote the remodelAfterExcision function to use the newly developed EdgeDescOvermapMap code. It needs refactoring
Fixed bug introduced to OverlapAlgorithm a few checkins ago. The seed_length should be clamped at minOverlap.
Started to refactor the overlap collection logic out of SGAlgorithms into CompleteOverlapSet
Fixed a bug in CompleteOverlapSet, it now behaves exactly as if all overlaps within the parameters were found using the FM-index (as desired). Changed the remodel visitor to use it.
More refactoring, all the overlap discovery algorithms have been moved to CompleteOverlapSet.
Fixed bug where the graph error rate parameter was not being set after remodelling.
Fixed potential memory leak in irreducible algorithm.
Re-implemented a cleaner version of the inexact irreducible algorithm in OverlapAlgorithm
Started work on handling substrings in irreducible algorithm. Unforunately it seems that we will have to load substrings into the graph and then remove them - they can't be determinstically removed at
Added a default value for the minimum read length
Wrote core code for resolving the path between the ends of a PE fragment
Added function to write result of fragment completion algorithm to file
Added new graph parameters to specify whether the graph has containments and/or transitive edges
Write out containment/transitive tags in Bigraph::writeASQG
Progress on handling substring vertices.
More work on substring containments, closer to giving the same results as exhaustive algorithm but not perfect.
Added isContainment property vertex to signal that it needs to be removed from the graph instead of setting a color.
Tweaked setting for sampled reads
large refactoring, remodelling the graph properly handles generated containment and substring edges.
Cleaned up some code, added new visitor to (trivially) remove identical reads.
Resurrected recursive overlap map construction for debugging long running time in yeast case
Began the implementation of the rmdup subprogram. Refactored OverlapAlgorithm so minOverlap is not a member variable but passed into the relevant algorithm to run.
More refactoring.
Refactored the hit computation code into its own file
More refactoring and the first working version the rmdup
Fixed bug in Vertex where the containment flag was not being set in the constructor.
Created files for merge subprogram to merge multiple BWTs.
Removed test code from read sampler that should not have been checked in.
Implemented merging of indices from two different read files.
Added flag to merge reverse indices
Wrote function to merge two read files together
Cleaned up outfile naming in SGA/merge, it is now complete in the case of merging two indices
Modified read sampler to add a prefix to each readname
Implemented new overlap detection algorithm in CompleteOverlapSet
Started work on 2 bit per base encoded string class
Implemented the rest of the EncodedString class
Implemented append and swap functions in EncodedString. Ported the Vertex class to use this class to store the sequence.
Added BWTCodec to encode an alphabet of ACGT$
Changed the BWT class to use the EncodedString representation of the BWT string.
Added lookup table for shift values and changed value in mask from decimal to hex
Added NoCodec which can be used by EncodedString to avoid doing any actual encoding. Useful for testing.
Changed NoCodec to use a similar get/store function as the real codecs.
Added 4-bit BWT codec. It uses half the memory compared to not encoded the string for roughly the same speed. It is faster than the 3-bit encoder for unknown reasons.
Added option to merge to clean up original files.
Don't load the reverse read table in SGA/overlap, only use the forward read table.
Changed interface to parseHits so that the reverse read table is not used.
Changed comment
Added check to SGA/merge and revised mergeDriver tool
Reduced the memory usage of Vertex by using the SimplePool allocator and removing two data members that are not used currently.
Major refactoring of how a sequence file is processed in parallel. Wrote the generic SequenceProcessFramework to handle reading the file
Refactored rmdup to use the new concurrency framework. Removed OverlapThread which is now obsolete
Moved some print messages to SequenceProcessFramework
Generalized the SequenceProcessFramework to take in a SeqReader and an optional parameter n which limits the number of
Removed some prints
Modified BWTDiskConstruction to use SequenceProcessFramework. This involved refactoring the GapArray into its own file.
Reverted the number of reads per group back to 2M
Removed double-construction of overlap block
Fixed a few incorrect forward declares that Clang picked up.
Implemented new, much faster remodel algorithm. The implementation is not perfect yet.
Fixed last issue with the new remodel algorithm, it now gives the same result as the old algorithm but is much much faster.
Refactored CompleteOverlapSet to use new partitioning code
Added some void casts so the program compiles without warnings if DNDEBUG is specified
Added skeleton for error correction subprogram
Added skeleton of ErrorCorrectProcess, implemented control flow for error correct subprogram
Merge branch 'master' of /nfs/team71/phd/js18/work/git_repository/sga
Added methods to SearchHistoryVector and OverlapBlock for extracting the string corresponding to a match
Implemented the rest of the error correction subprogram. It uses the simple correction algorithm at the moment but gives good results on simulated data.
Began implementation of run-length encoded BWT class. It reads from a .bwt file and compresses the string into runs as it is read.
Rewrote RLBWT printInfo
Started to implement the marker placement code
Implemented setting the markers in the RLBWT and random accessing of elements.
Implemented occurrence counting for RLBWT. Some efficiency gains can still be made
Renamed the old BWT class to SBWT ("simple" BWT). The BWT identifier is now a typedef to switch between using the RLE version and the regular version
Implemented forward-search of the Marker array
Implemented forward search in getFullOcc as well. The code could be cleaned up a bit.
Fixed bug in RLBWT::initializeFMIndex where the last marker would not be placed correctly.
Added ReadInfoTable to load an index of id,length pairs. This is used to construct overlaps from hits in overlap and rmdup. The benefit here
Added hidden argument to overlap to use exact mode.
Fixed but in RemovalAlgorithm where cycles in the graph would cause an infinite loop.
Re-enabled rmdup by writing the id and sequence out to the hits file.
Force the suffix array to BWT conversion methods to use SBWT for now. This should eventually change to writing the RLBWT
Added hacky BubbleEdge removal visitor. Currently not in use.
Made the number of reads to process in a batch a parameter to the BWT disk construction algorithm
Added subgraph subprogram, to extract a specified portion of the graph
Cleaned up subgraph, it now removes containments and properly handles the vertex visit logic
Changed tabs to spaces in samQC, modified so the summary stats can be printed in every mode.
Implemented -o, --outfile option to SGA/correct
-Added option to perform multiple rounds of error correction
Added quality filtering option to remove reads with a substantial amount of low-quality bases.
Fixed semantics of quality filter
Fix: the number of times the trim/bubble popping is performed did not match the command line parameter
Removed print that was checked in by error
Added some extra information to the break writer
Added small-repeat resolution code. Remove edges that join together two sequences with a sub-read length repeat unit if there are
Added method to MultiOverlap to generate SeqTries.
Re-enabled seqtrie correction.
Added quick and dirty PrimerScreen class and enabled the screen in the preprocess. This just checks for
Tweaked PrimerScreen settings. Now matches over the first 14 bases of the sequence.
Added tool scripts to revision control
Added command line arguments to correct to take the in the algorithm to use and the conflictCutoff
Merge branch 'master' of ssh://127.0.0.1:2222/~/work/git_repository/sga
Removed print statements from SGRepeatResolveVisitor
First pass at the scaffold driver. This version is based on bwa but this will be replaced by bowtie
BWA-based distance estimation calculation is complete.
Added new conversion scripts:
Added some metrics to the error correction.
Adding additional development/analysis scripts to revision control
Removed hardcoded paths from analyzeCorrect, run_bwa.sh and samQC
Added function to calculate the amount of a read that is covered
Changed OverlapAlgorithm to remove submaximal overlap blocks for containments and proper overlaps at the same time
Refactored BWTAlgorithms::updateBothL/R to take in the AlphaCount for the lower and upper interval.
Working implementation of binary .bwt file. Uses run-length encoding.
Forgot to add RLBWT* files
heavy refactoring of BWT I/O. Now the binary and ascii output files are subclasses of IBWTReader/Writer. The rest of
Fixed a crash in the contain removal algorithm found in the yeast data. An assertion would blow if multiple valid overlaps
Added sparse hash check to configure
Bumped version number, removed define from SGA.cpp
Fixed Makefiles/includes so that make dist works
Removed Tests directory from standard build.
Initial implementation of in-place removal of strings from the FM-index. Not working in the version.
Working version of the removal of reads from the FM-index in the rmdup program.
Fixed parallel mode for rmdup. Now working as designed.
Changed output BWTs back to binary
Fixed rmdup index rebuild. Substring reads were not being written to the dup file.
Added checks to the StringGraph construction and oview functions to ensure that each
Refactored HashMap includes to check for the precense of tr1, ext/hash_map, etc.
Merge branch 'master' of /nfs/team71/phd/js18/work/git_repository/sga
Added ability to load distance estimate edges to the ScaffoldGraph.
Improved dot output
Removed unused line of code
Abstracted out the GapArray functions.
Implemented 4-bit storage SparseGapArray
Added arguments to index and merge to control the size of the gap array.
Fixed order of arguments when creating a ScaffoldEdge.
Cleaned up MultiOverlap code and removed dead code from other classes
Added Metrics classes and ability to track statistics about what positions in
Removing Scaffold/Makefile.in which shouldnt have been added to the tracking
Updated the output after error correction
Merge branch 'master' of ssh://127.0.0.1:2222/~/work/git_repository/sga
Fixed help message for merge to indicate it can only take 2 files.
Added script to compute a-statistic for contigs from a bam file
Better estimate of expected number of reads per contig by using the number of positions in the read that
Minor update to calculation for expected arrival rate
Added abilitiy to load a-statistic data from a file to sga-scaffold.
In-progress checkin of scaffolding code. It compiles but should not be used.
Merge branch 'master' of /nfs/team71/phd/js18/work/git_repository/sga
Fixed terrible bug in scaffolder
Added ability to write out scaffolds to a file after processing. Still in development.
Fixed bug where an istream* reader was not cleaned up.
Added script to evaluate the scaffold output
Added ability to output singletons from the scaffolder
Added -o,--outfile option to sga-scaffold to specify output file.
Ported sga-scaffold into the main SGA program as a subcommand
Added -a,--asqg-outfile option to assemble to write out the final graph.
evalScaffolds: output number of gaps and mean gap size
Factored the link data out of the ScaffoldEdge class
More refactoring. Created the ScaffoldRecord class to hold the output of the scaffolding process. It can be read from/written to a file.
Implemented simple scaffolding output where sequences are truncated if an overlap is predicted.
Modified scaffold to perform reductions until no more reductions can be made
Implemented edit distance calculation for two strings using dynamic programming in OverlapTools.
Refactored dynamic programming algorithm into its own class.
Finished code to join contigs that are predicted to overlap
Added perl script to break up a set of scaffolds into contigs
Added a driver script for the scaffold evaluation.
Turned off prints in Overlapper
Fixed bug in scaffold evaluation
Refactored vertex to vertex search algorithms
Fixed major performance bug in the irreducible extension algorithm. Every right extension was performing 4 branches,
Implemented scaffold resolution using the string graph.
Finished graph resolving work. Added command line parameter to choose the stringency of the resolution step.
Made the sequence process framework more generic by using an input generator
Added --no-discard flag to sga correct to suppress discarding reads.
Fixed int to double conversion warnings.
Started implementation of local string graph construction. It currently generates duplicate edges
Threaded the mkqs portion of the indexing step. Not a huge decrease in running time, around 20%.
Removed unnecessary print.
Changed default sample rate for merging to 1024
Completed connect subprogram.
Fixed bug in the connect subprogram where the program would abort if the first and second reads had identical sequences
Added experimental bubble-popping "smoothing" algorithm. Not in a state that it is usable for production work.
Added some parameters to merge and correct. Moved the smoothing task in assemble to occur before simplification. Smoothing is still experimental.
Fixed seg fault in search where the m_pWalkIndex member of an SGWalk was not initialized in the copy constructor
Added depth filter to the error correct process to avoid correcting very deep sequences, which takes a lot of time.
Added function to SGSearch to calculate the coverage spanning a given edge.
Made command line parameters for the coverage removal algorithm
Added sampleRate parameter to rmdup
Added parameter to the correct subprogram to limit the amount of branching for complex reads.
Fixed bad memory leak in the branch cutoff code for the overlap algorithm
Started work on kmer-based error correction
Initial checkin of sga2afg script
Fixed bugs in the sga2afg and sga2contig scripts
Added development script for computing an FM-index from a polymorphic genome
Updates on the sga2afg convertor script and the testing graphical fm-index python script
First pass at k-mer error corrector
Improved kmer-correction. Gives very close results to the overlap correction.
Rewrote vertex/edge allocation logic so that a global memory pool is not used. The pool now belongs to the graph that creates the vertex/edge. This allows multiple graphs to be created in different threads without stomping over each other's memory. The global new for Vertex/Edges is disabled, the allocations must go through a pool.
Re-enabled threading in the connect process since the memory pool issues are fixed.
Merge branch 'master' of /nfs/users/nfs_j/js18/work/git_repository/sga
Changed the read discarding logic for the kmer error corrector
Fixed bug in error corrector where no sequence would be output for uncorrected reads in the kmer algorithm.
Refactored repository to not contain data files and tools/analysis scripts. These are moved to the sgatools repo
Updated the README
Removed dead code from repository
More dead code removal
More README updates
Implemented hybrid mode error correction which first performs a kmer correction pass, then overlap correction.
Made the irreducible-edge only algorithm the default for sga overlap. All overlaps can be generated using the -x/--exhaustive option.
Bumped version to 0.92
Added assert ScaffoldRecord::introduceGap to catch case where the expected overlap between scaffold components is not sane.
Minor changes to the README
Merge branch 'gh-pages' of github.com:jts/sga
Rewrote sga main webpage.
Obscured email address
Fixed formatting
Added bin directory with first version of sga-pipeline script
Added pipeline script information to the README
Corrected file extension handling in sga-pipeline
Rewrote sga-pipeline to be more modular and flexible
add rmdup-pe workflow to sga-pipeline to remove duplicated paired-end reads
Added logging to the sga-pipeline script
sga-pipeline: fixed formatting issue for rmdup and correct wrappers
Modified SeqReader to read compressed fasta/fastq files
Modified SeqReader to automatically uppercase all input sequences.
Extended --permuteN option in preprocess to handle the full IUPAC ambiguity code set as suggest by Shaun Jackman.
Cleaned up help message for many subprograms, mostly by adding default parameters.
Changed version numbering to a conventional x.x.x scheme and bumped version to v0.9.3
Updated README with new name of the --trim option
Added sga connect workflow to sga-pipeline
Added --skip-preprocess option to sga-pipeline
Added --version option to sga main program
Implemented sga qc subprogram. This program looks for, and discards, problematic reads. Right now, the qc check requires each read to have a tiling of high confidence k-mers (with a short kmer length).
Added new output file to sga-connect to record the pe reads that could not be connected
Added new subprogram sga-stats which prints out a histogram of the kmer counts for a read set.
Implemented gmap subprogram which is a very basic read-read mapper.
Added flag to gmap output to indicate reverse complement alignments
Rewrote sga-connect to work from the graph instead of the FM-index.
Update the new sga-connect program to mark vertices in the graph that are covered by a pe-walk
Rewrote Util/HashMap.h logic to explicitly define the StringHasher function. This is to fix a problem where tr1::unordered_map was available but the sparsehash was still trying to use __gnu_cxx::hash<std::string> which does not exist.
Implemented edge link update function in scaffold module
Cleaned up output in bigraph and assemble.
Added sga-align and sga-deinterleave helper scripts
Added new statistics to sga-stats. Now outputs the estimated error rate in the reads and the mean overlap depth.
Rewrote portions of the MultiOverlap correction code for efficiency
Added structural variation detection options to sga-connect
Fixed bug in the bubble popper. The counter would never be incremented so it would always be reported that no bubbles were popped.
Fixed string initialization error spotted by valgrind
Added --with-hoard=PATH option to configure to allow the use of the Hoard memory allocator.
Minor formatting change in configure
Added --run-lengths parameter to sga-stats to print the run length distribution of the BWT
Fixed typo in README spotted by Matthias Haimel. Added instruction for running autogen.sh
Added a numReads field to the header of the sga-connect output
Rewrote AlphaCount class to take in a template parameter indicating the storage size. Replaced all existing uses of AlphaCount in the code with AlphaCount64, the 64-bit storage version.
Complete re-write of how the BWT occurrence array markers are represented.
Removed old marker code and cleaned up.
Fixed error in SmallMarker - was using size_t to hold the unitCount when it will be at most 128. Changed to uint8_t which for a huge memory saving.
Cleaned up two-tier code.
More clean up of two-tier code.
Removed unused print statements in getInterpolatedMarkers
Removed gcc force-inline attributes
Implemented second version of two-tier occurrence array markers.
Fixed bug in two-tier implementation where the count for the last SmallBlock placed was incorrect.
Changed default sample rate for merging bwts
Added a method to read in the non-RLE BWT from a binary bwt file.
Updated version to 0.9.4. The main difference in this version is an improved strategy for managing the Occurrence array in the BWT, which requires substantially less memory.
jts (1):
github generated gh-pages branch
-----------------------------------------------------------------------
--
Debian packaging for sga
More information about the debian-med-commit
mailing list