[med-svn] [Git][med-team/flye][upstream] New upstream version 2.9.3+dfsg

Étienne Mollier (@emollier) gitlab at salsa.debian.org
Wed Nov 29 21:50:32 GMT 2023



Étienne Mollier pushed to branch upstream at Debian Med / flye


Commits:
eb1e7a71 by Étienne Mollier at 2023-11-29T22:07:23+01:00
New upstream version 2.9.3+dfsg
- - - - -


18 changed files:

- README.md
- docs/FAQ.md
- docs/NEWS.md
- flye/__build__.py
- flye/__version__.py
- flye/config/bin_cfg/asm_defaults.cfg
- flye/config/bin_cfg/asm_nano_hq.cfg
- flye/main.py
- flye/polishing/polish.py
- src/assemble/extender.cpp
- src/polishing/subs_matrix.h
- src/repeat_graph/haplotype_resolver.cpp
- src/repeat_graph/main_repeat.cpp
- src/repeat_graph/output_generator.cpp
- src/repeat_graph/repeat_graph.cpp
- src/repeat_graph/repeat_graph.h
- src/sequence/overlap.cpp
- src/sequence/sequence_container.h


Changes:

=====================================
README.md
=====================================
@@ -3,7 +3,7 @@ Flye assembler
 
 [![BioConda Install](https://img.shields.io/conda/dn/bioconda/flye.svg?style=flag&label=BioConda%20install)](https://anaconda.org/bioconda/flye)
 
-### Version: 2.9.2
+### Version: 2.9.3
 
 Flye is a de novo assembler for single-molecule sequencing reads,
 such as those produced by PacBio and Oxford Nanopore Technologies.
@@ -26,6 +26,12 @@ Manuals
 Latest updates
 --------------
 
+### Flye 2.9.3 release (28 November 2023)
+* Disjointig step speedup for `--nano-hq` mode
+* Improved `--keep-haplotypes` mode preserves more heterozygous SVs
+* A few bug fixes
+
+
 ### Flye 2.9.2 release (18 March 2023)
 * Update to minimap 2.24 + using HiFi and Kit14 parameters for faster alignment
 * Fixed a few small bugs and corner cases
@@ -54,29 +60,6 @@ Latest updates
 * Update to minimap 2.18
 * Several rare bug fixes/other improvements
 
-### Flye 2.8.3 release (10 Feb 2021)
-* Reduced RAM consumption for some ultra-long ONT datasets
-* Fixed rare artificial sequence insertions on some ONT datasets
-* Assemblies should be largely identical to 2.8
-
-### Flye 2.8.2 release (12 Dec 2020)
-* Improvements in GFA output, much faster generation of large and tangled graphs
-* Speed improvements for graph simplification algorithms
-* A few minor bugs fixed
-* Assemblies should be largely identical to 2.8
-
-### Flye 2.8.1 release (02 Sep 2020)
-* Added a new option `--hifi-error` to control the expected error rate of HiFi reads (no other changes)
-
-### Flye 2.8 release (04 Aug 2020)
-* Improvements in contiguity and speed for PacBio HiFi mode
-* Using the `--meta` k-mer selection strategy in isolate assemblies as well.
-This strategy is more robust to drops in coverage/contamination and requires less memory
-* 1.5-2x RAM footprint reduction for large assemblies (e.g. human ONT assembly now uses 400-500 Gb)
-* Genome size parameter is no longer required (it is still needed for downsampling though `--asm-coverage`)
-* Flye now can occasionally use overlaps shorter than "minOverlap" parameter to close disjointing gaps
-* Various improvements and bugfixes
-
 
 Repeat graph
 ------------
@@ -218,4 +201,4 @@ has already been answered.
 If you are reporting a problem, please include the `flye.log` file and provide
 details about your dataset.
 
-In case you prefer personal communication, please contact Mikhail at fenderglass at gmail.com.
+In case you prefer personal communication, please contact Mikhail at mikolmogorov at gmail.com.


=====================================
docs/FAQ.md
=====================================
@@ -236,10 +236,17 @@ flye --polish-target SEQ_TO_POLISH --pacbio-raw READS --iterations NUM_ITER --ou
 
 You can also provide Bam file as input instead of reads, which will skip the read mapping step.
 
+
+Flye assembly of the same reads is slightly different from run to run
+---------------------------------------------------------------------
+
+Flye is not fully deterministic, and this would be very difficult to fix. See more info here: https://github.com/fenderglass/Flye/issues/509
+For test runs, one can use `--deterministic` option to make the output stable, at the expense of substantially slower runtimes.
+
 My question is not listed, how do I get help?
 ---------------------------------------------
 
 Please post your question to the [issue tracker](https://github.com/fenderglass/Flye/issues). 
-In case you prefer personal communcation, you can contact Mikhail at fenderglass at gmail.com.
+In case you prefer personal communcation, you can contact Mikhail at mikolmogorov at gmail.com.
 If you reporting a problem, please include the `flye.log` file and provide some 
 details about your dataset (if possible).


=====================================
docs/NEWS.md
=====================================
@@ -1,3 +1,9 @@
+Flye 2.9.3 release (28 November 2023)
+====================================
+* Disjointig step speedup for `--nano-hq` mode
+* Improved `--keep-haplotypes` mode preserves more heterozygous SVs
+* A few bug fixes
+
 Flye 2.9.2 release (18 March 2023)
 =================================
 * Update to minimap 2.24 + using HiFi and Kit14 parameters for faster alignment


=====================================
flye/__build__.py
=====================================
@@ -1 +1 @@
-__build__ = 1786
+__build__ = 1797


=====================================
flye/__version__.py
=====================================
@@ -1 +1 @@
-__version__ = "2.9.2"
+__version__ = "2.9.3"


=====================================
flye/config/bin_cfg/asm_defaults.cfg
=====================================
@@ -8,6 +8,7 @@ meta_read_filter_kmer_freq = 100
 chain_large_gap_penalty = 2
 chain_small_gap_penalty = 0.5
 chain_gap_jump_threshold = 100
+max_jump_gap = 500
 
 #read assembly parameters
 max_coverage_drop_rate = 5
@@ -17,6 +18,7 @@ chimera_overhang = 1000
 min_reads_in_disjointig = 4
 max_inner_reads = 10
 max_inner_fraction = 0.25
+aggressive_dup_filter = 1
 
 #repeat graph parameters
 max_separation = 500
@@ -33,5 +35,5 @@ weak_detach_rate = 5
 tip_coverage_rate = 2
 tip_length_rate = 2
 
-output_gfa_before_rr = 0
+output_gfa_before_rr = 1
 remove_alt_edges = 0


=====================================
flye/config/bin_cfg/asm_nano_hq.cfg
=====================================
@@ -6,7 +6,7 @@ low_cutoff_warning = 0
 #k-mer selection
 kmer_size = 17
 use_minimizers = 1
-minimizer_window = 5
+minimizer_window = 10
 
 reads_base_alignment = 1
 


=====================================
flye/main.py
=====================================
@@ -429,6 +429,9 @@ def _run_polisher_only(args):
     if bam_input and len(args.reads) > 1:
         raise ResumeException("Only single bam input supported")
 
+    if bam_input and args.num_iters > 1:
+        raise ResumeException("Bam input only supports single iteration. For multiple iterations, provide fastq instead")
+
     pol.polish(args.polish_target, args.reads, args.out_dir,
                args.num_iters, args.threads, args.platform,
                args.read_type, output_progress=True)
@@ -678,19 +681,21 @@ def main():
     if args.read_error and args.read_error > 1:
         parser.error("--read-error expressed as a decimal fraction, e.g. 0.01 or 0.03")
 
-    if args.read_error:
-        hifi_str = "assemble_ovlp_divergence={0},repeat_graph_ovlp_divergence={0}".format(args.read_error)
+    def _add_extra_param(param):
         if args.extra_params:
-            args.extra_params += "," + hifi_str
+            args.extra_params += "," + param
         else:
-            args.extra_params = hifi_str
+            args.extra_params = param
+
+    if args.read_error:
+        hifi_str = "assemble_ovlp_divergence={0},repeat_graph_ovlp_divergence={0}".format(args.read_error)
+        _add_extra_param(hifi_str)
 
     if args.no_alt_contigs:
-        alt_params = "remove_alt_edges=1"
-        if args.extra_params:
-            args.extra_params += "," + alt_params
-        else:
-            args.extra_params = "remove_alt_edges=1"
+        _add_extra_param("remove_alt_edges=1")
+
+    if args.keep_haplotypes:
+        _add_extra_param("aggressive_dup_filter=0")
 
     if args.pacbio_raw:
         args.reads = args.pacbio_raw


=====================================
flye/polishing/polish.py
=====================================
@@ -125,6 +125,8 @@ def polish(contig_seqs, read_seqs, work_dir, num_iters, num_threads, read_platfo
     with open(stats_file, "w") as f:
         f.write("#seq_name\tlength\tcoverage\n")
         for ctg_id in contig_lengths:
+            if ctg_id not in coverage_stats:
+                coverage_stats[ctg_id] = 0
             f.write("{0}\t{1}\t{2}\n".format(ctg_id,
                     contig_lengths[ctg_id], coverage_stats[ctg_id]))
 


=====================================
src/assemble/extender.cpp
=====================================
@@ -298,10 +298,10 @@ void Extender::assembleDisjointigs()
 		//int extRight = this->countRightExtensions(startOvlps);
 
 		if (_chimDetector.isChimeric(startRead, startOvlps) ||
-			_readsContainer.seqLen(startRead) < _safeOverlap ||
-			//std::max(extLeft, extRight) > maxStartExt ||
-			//std::min(extLeft, extRight) < minStartExt ||
-			numInnerOvlp > totalOverlaps / 2) return;
+			_readsContainer.seqLen(startRead) < _safeOverlap) return;
+
+		const bool aggressiveDupFilt = (int)Config::get("aggressive_dup_filter");
+		if (aggressiveDupFilt && numInnerOvlp > totalOverlaps / 2) return;
 		
 		//Good to go!
 		ExtensionInfo exInfo = this->extendDisjointig(startRead);


=====================================
src/polishing/subs_matrix.h
=====================================
@@ -4,6 +4,7 @@
 
 #pragma once
 
+#include <cstdint>
 #include <string>
 #include <fstream>
 #include <iostream>


=====================================
src/repeat_graph/haplotype_resolver.cpp
=====================================
@@ -166,7 +166,7 @@ int HaplotypeResolver::findHeterozygousLoops()
 		//loop coverage should be roughly equal or less
 		if (loop.meanCoverage > 
 				COV_MULT * std::min(entrancePath->meanCoverage, 
-									entrancePath->meanCoverage)) continue;
+									exitPath->meanCoverage)) continue;
 
 		//loop should not be longer than other branches
 		if (loop.length > std::max(entrancePath->length, 


=====================================
src/repeat_graph/main_repeat.cpp
=====================================
@@ -190,7 +190,7 @@ int repeat_main(int argc, char** argv)
 	Logger::get().info() << "Building repeat graph";
 	SequenceContainer edgeSequences;
 	RepeatGraph rg(seqAssembly, &edgeSequences);
-	rg.build();
+	rg.build(keepHaplotypes);
 	//rg.validateGraph();
 
 	Logger::get().info() << "Parsing reads";
@@ -261,7 +261,7 @@ int repeat_main(int argc, char** argv)
 		Logger::get().debug() << "[SIMPL] == Iteration " << iterNum << " ==";
 
 		actions += multInf.splitNodes();
-		if (isMeta) 
+		if (isMeta && !keepHaplotypes) 
 		{
 			actions += multInf.disconnectMinorPaths();
 		}
@@ -277,7 +277,7 @@ int repeat_main(int argc, char** argv)
 		if (!actions) break;
 	}
 
-	if (isMeta) 
+	if (isMeta && !keepHaplotypes) 
 	{
 		multInf.resolveForks();
 	}


=====================================
src/repeat_graph/output_generator.cpp
=====================================
@@ -106,13 +106,23 @@ void OutputGenerator::outputGfa(const std::vector<UnbranchingPath>& paths,
 	}
 
 	//make sure that if there are nodes with one incoming and one outgoing
-	//edge, they are connected. Most relevant to the circular contigs
+	//edge, they are connected. Initialize those connections to zero.
+	//Most relevant to the circular contigs, but also to strange bubbles
 	for (auto& node : _graph.iterNodes())
 	{
-		if (node->inEdges.size() == 1 && node->outEdges.size() == 1)
+		if (node->outEdges.size() == 1)
 		{
-			//initialize to zero
-			edgeConnections[node->inEdges.front()][node->outEdges.front()];
+			for (auto& inEdge : node->inEdges)
+			{
+				edgeConnections[inEdge][node->outEdges.front()];	//initialize to zero
+			}
+		}
+		if (node->inEdges.size() == 1)
+		{
+			for (auto& outEdge : node->outEdges)
+			{
+				edgeConnections[node->inEdges.front()][outEdge];	//initialize to zero
+			}
 		}
 	}
 


=====================================
src/repeat_graph/repeat_graph.cpp
=====================================
@@ -75,7 +75,7 @@ std::unordered_set<GraphEdge*> GraphEdge::adjacentEdges()
 	return edges;
 }
 
-void RepeatGraph::build()
+void RepeatGraph::build(bool keepHaplotypes)
 {
 	//getting overlaps
 	VertexIndex asmIndex(_asmSeqs);
@@ -104,7 +104,10 @@ void RepeatGraph::build()
 	asmOverlaps.overlapDivergenceStats();
 
 	this->getGluepoints(asmOverlaps);
-	this->collapseTandems();
+	if (!keepHaplotypes)
+	{
+		this->collapseTandems();
+	}
 	this->initializeEdges(asmOverlaps);
 	GraphProcessor proc(*this, _asmSeqs);
 	proc.simplify();


=====================================
src/repeat_graph/repeat_graph.h
=====================================
@@ -250,7 +250,7 @@ public:
 	{}
 	~RepeatGraph();
 
-	void build();
+	void build(bool keepHaplotypes);
 	void updateEdgeSequences();
 	void storeGraph(const std::string& filename);
 	void loadGraph(const std::string& filename);


=====================================
src/sequence/overlap.cpp
=====================================
@@ -112,6 +112,7 @@ OverlapDetector::getSeqOverlaps(const FastaRecord& fastaRec,
 	static const float LG_GAP = (float)Config::get("chain_large_gap_penalty");
 	static const float SM_GAP = (float)Config::get("chain_small_gap_penalty");
 	static const int GAP_JUMP_THLD = (int)Config::get("chain_gap_jump_threshold");
+	static const int MAX_GAP = (int)Config::get("max_jump_gap");
 
 	//outSuggestChimeric = false;
 	int32_t curLen = fastaRec.sequence.length();
@@ -288,14 +289,15 @@ OverlapDetector::getSeqOverlaps(const FastaRecord& fastaRec,
 			{
 				int32_t curPrev = matchesList[j].curPos;
 				int32_t extPrev = matchesList[j].extPos;
+				int32_t jumpDiv = abs((curNext - curPrev) - 
+									  (extNext - extPrev));
 				if (0 < curNext - curPrev && curNext - curPrev < _maxJump &&
-					0 < extNext - extPrev && extNext - extPrev < _maxJump)
+					0 < extNext - extPrev && extNext - extPrev < _maxJump &&
+					jumpDiv <= MAX_GAP)
 				{
 					int32_t matchScore = 
 						std::min(std::min(curNext - curPrev, extNext - extPrev),
 										  kmerSize);
-					int32_t jumpDiv = abs((curNext - curPrev) - 
-										  (extNext - extPrev));
 					//int32_t gapCost = jumpDiv ? 
 					//		kmerSize * jumpDiv + ilog2_32(jumpDiv) : 0;
 					int32_t gapCost = (jumpDiv > GAP_JUMP_THLD ? LG_GAP : SM_GAP) * jumpDiv;


=====================================
src/sequence/sequence_container.h
=====================================
@@ -4,6 +4,7 @@
 
 #pragma once
 
+#include <cstdint>
 #include <vector>
 #include <unordered_map>
 #include <string>



View it on GitLab: https://salsa.debian.org/med-team/flye/-/commit/eb1e7a71b8199bf107b55b8c63a693a2514aecaa

-- 
View it on GitLab: https://salsa.debian.org/med-team/flye/-/commit/eb1e7a71b8199bf107b55b8c63a693a2514aecaa
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20231129/f4e7adc3/attachment-0001.htm>


More information about the debian-med-commit mailing list