[med-svn] [Git][med-team/mindthegap][upstream] New upstream version 2.2.3

Nilesh Patra (@nilesh) gitlab at salsa.debian.org
Tue Sep 21 16:43:30 BST 2021



Nilesh Patra pushed to branch upstream at Debian Med / mindthegap


Commits:
b7bdcbee by Nilesh Patra at 2021-09-21T21:08:16+05:30
New upstream version 2.2.3
- - - - -


7 changed files:

- CHANGELOG.md
- CMakeLists.txt
- README.md
- doc/MindTheGap_insertion_caller.md
- src/Filler.cpp
- src/Filler.hpp
- src/main.cpp


Changes:

=====================================
CHANGELOG.md
=====================================
@@ -4,6 +4,14 @@
 ## [Unreleased]
 
 --------------------------------------------------------------------------------
+## [2.2.3] - 2021-06-11
+
+* 2 novel options in the `Fill` (local assembly) module :
+  * `-fwd-only`:  output the first-contig extensions of failed gap-fillings in a separate file, it can be useful for the assembly of the extremities of linear genomes (as in the [MinYS](https://github.com/cguyomar/MinYS) tool)
+  * `-extend`:  do not try in reverse direction if no inserted sequence is assembled (bkpt mode), it can improve the running time and/or help the user controlling the direction of assembly (as in the [MTG-link](https://github.com/anne-gcd/MTG-Link) tool)
+
+--------------------------------------------------------------------------------
+
 ## [2.2.2] - 2020-06-19
 
 * A bug fix: updating gatb-core version, notably this fixes a bug in the `fill` module: nodes at extremities of contigs of size exactly `k` were not marked correctly, potentially leading to duplicated contigs in the contig graph. This could prevent exploring some parts of the graph, and if graph exploration parameters where set too large (`-max-nodes` and `-max-length`), it could lead in some rare cases to extreme running times and/or memory consumptions. This should no longer happen now.


=====================================
CMakeLists.txt
=====================================
@@ -28,7 +28,7 @@ cmake_minimum_required(VERSION 3.1)
 # The default version number is the latest official build
 SET (gatb-tool_VERSION_MAJOR 2)
 SET (gatb-tool_VERSION_MINOR 2)
-SET (gatb-tool_VERSION_PATCH 2)
+SET (gatb-tool_VERSION_PATCH 3)
 
 # But, it is possible to define another release number during a local build
 IF (DEFINED MAJOR)


=====================================
README.md
=====================================
@@ -4,9 +4,9 @@
 |-----------|-------------|
 [![Build Status](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-debian7-64bits-gcc-4.7/badge/icon)](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-debian7-64bits-gcc-4.7/) | [![Build Status](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-macos-10.9.5-gcc-4.2.1/badge/icon)](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-macos-10.9.5-gcc-4.2.1/)
 
-Travis CI : [![Build Status](https://travis-ci.org/GATB/MindTheGap.svg?branch=master)](https://travis-ci.org/GATB/MindTheGap)
+[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/mindthegap/README.html)
 
-[![License](http://img.shields.io/:license-affero-blue.svg)](http://www.gnu.org/licenses/agpl-3.0.en.html)
+[![License](http://img.shields.io/:license-affero-blue.svg)](http://www.gnu.org/licenses/agpl-3.0.en.html)     
 
 # What is MindTheGap ?
 
@@ -49,10 +49,7 @@ Retrieve a binary archive file from one of the official MindTheGap releases (see
     tar -zxf MindTheGap-vX.Y.Z-bin-Darwin.tar.gz
     cd MindTheGap-vX.Y.Z-bin-Darwin
     chmod u+x bin/MindTheGap
-    
-    # run a simple example
-    ./bin/MindTheGap find -in data/reads_r1.fastq,data/reads_r2.fastq -ref data/reference.fasta -out example
-    ./bin/MindTheGap fill -graph example.h5 -bkpt example.breakpoints -out example
+    ./bin/MindTheGap -help
 
 In case the software does not run appropriately on your system, you should consider to install it from its source code. Retrieve the source archive file `MindTheGap-vX.Y.Z-Source.tar.gz`.
 
@@ -65,13 +62,19 @@ In case the software does not run appropriately on your system, you should consi
 ## Using conda or docker
 
 MindTheGap is also distributed as a [Bioconda package](https://anaconda.org/bioconda/mindthegap):
- 
+
     conda install -c bioconda mindthegap
 
 Or pull the docker image of MindTheGap (warning: need to be updated with latest releases):
 
     docker pull clemaitr/mindthegap
 
+## Small run example
+
+```
+MindTheGap find -in data/reads_r1.fastq,data/reads_r2.fastq -ref data/reference.fasta -out example
+MindTheGap fill -graph example.h5 -bkpt example.breakpoints -out example
+```
 
 
 
@@ -194,21 +197,26 @@ MindTheGap is composed of two main modules : breakpoint detection (`find` module
     Both MindTheGap modules generate the graph file if reads were given as input: 
     
 * a graph file (`.h5`). This is a binary file, to obtain information stored in it, you can use the utility program `dbginfo` located in your bin directory or in ext/gatb-core/bin/.
-    
+  
     `MindTheGap find` generates the following output files:
+    
     * a breakpoint file (`.breakpoints`) in fasta format. 
-* a variant file (`.othervariants.vcf`) in vcf format. It contains SNPs and deletion events.
     
+* a variant file (`.othervariants.vcf`) in vcf format. It contains SNPs and deletion events.
+  
     `MindTheGap fill` generates the following output files:
-* a sequence file (`.insertions.fasta`) in fasta format. It contains the inserted sequences or contig gap-fills that were successfully assembled. 
     
+* a sequence file (`.insertions.fasta`) in fasta format. It contains the inserted sequences or contig gap-fills that were successfully assembled. 
+  
 * an insertion variant file (`.insertions.vcf`) in vcf format, in the case of insertion variant detection. 
-    
+  
 * an assembly graph file (`.gfa`) in GFA format, in the case of contig gap-filling. It contains the original contigs and the obtained gap-fill sequences (nodes of the graph), together with their overlapping relationships (arcs of the graph).
-    
+  
 * a log file (`.info.txt`), a tabular file with some information about the filling process for each breakpoint/grap-fill. 
-    
-      
+  
+* with option `-extend`, an additional sequence file (`.extensions.fasta`) in fasta format. It contains sequence extensions for failed insertion or gap-filling assemblies, ie. when the target kmer was not found, the first contig immediately after the source kmer is output.
+  
+  ​    
 
 Other optional parameters and details on input and output file formats are given in [doc/MindTheGap_insertion_caller.md](doc/MindTheGap_insertion_caller.md) and [doc/MindTheGap_assembly.md](doc/MindTheGap_assembly.md), depending on the usage.
 


=====================================
doc/MindTheGap_insertion_caller.md
=====================================
@@ -49,6 +49,7 @@ MindTheGap is composed of two main modules : breakpoint detection (`find` module
 	* `-max-nodes`: maximum number of nodes in contig graph for each insertion assembly [default '100']. This arguments limits the computational time, this is especially useful for complex genomes.
     * `-max-length`: maximum number of assembled nucleotides in the contig graph (nt)  [default '10000']. This arguments limits the computational time, this is especially useful for complex genomes.
     * `-filter`: if set, insertions with multiple solutions are not output in the final vcf file (default : not activated).
+	* `-fwd-only`: if set, inserted sequences are searched in only one direction : from the left kmer to the right kmer. Default behavior : not set, ie. when no solution is found in the forward direction, a solution is searched in the opposite direction (revcomp(right) --> revcomp(left)).
 	
 6. **MindTheGap Output**
   
@@ -92,7 +93,7 @@ MindTheGap is composed of two main modules : breakpoint detection (`find` module
 	For insertion variants (`insertions.vcf` file only), positions are **left-normalized**. This happens when there is a small (typically <5bp)) repeated sequence between the breakpoint site and one extremity of the inserted sequence. In this case, multiple positions are possible. Here, the leftmost position is reported and the number of possible positions is indicated in the INFO field (NPOS id). This latter value is at least the size of the repeated sequence + 1 (>= fuzzy +1). 
 	
 		chr4    618791     bkpt20  T    TAGGTGTATTTAGCTCCG   .       PASS    TYPE=INS;LEN=17;QUAL=50;NSOL=1;NPOS=4;AVK=22.71;MDK=23.00      GT      1/1
-	 	#there are 4 (NPOS) possible positions for an insertion of 17 nt from positions 618791 to 618794 on chr4, therefore the repeat is of size 3 and is AGG.
+		#there are 4 (NPOS) possible positions for an insertion of 17 nt from positions 618791 to 618794 on chr4, therefore the repeat is of size 3 and is AGG.
 	
 	FILTER field: can be `PASS`or `LOWQUAL` (for insertions with multiple solutions)
 	


=====================================
src/Filler.cpp
=====================================
@@ -64,6 +64,8 @@ Filler::Filler ()  : Tool ("MindTheGap fill") , _progress(0)
     _breakpointMode = true;
     _contig_trim_size = 0;
     _filter = false;
+    _fwd_only = false;
+    _extend = false;
 
 
 
@@ -82,7 +84,8 @@ Filler::Filler ()  : Tool ("MindTheGap fill") , _progress(0)
     generalParser->push_front (new OptionOneParam (STR_NB_CORES,    "number of cores",      false, "0"  ));
 
     IOptionsParser* inputParser = new OptionsParser("Input / output");
-    inputParser->push_front (new OptionNoParam (STR_FILTER, "do not output low quality insertions", false));
+    inputParser->push_front (new OptionNoParam (STR_EXTEND, "output first-contig extensions of failed gap-fillings in a separate file", false));
+    inputParser->push_front (new OptionNoParam (STR_FILTER, "do not output low quality insertions (bkpt mode)", false));
     inputParser->push_front (new OptionOneParam (STR_CONTIG_OVERLAP, "Overlap between input contigs (default, ie. 0 = kmer size)",  false, "0"));
     inputParser->push_front (new OptionOneParam (STR_URI_OUTPUT, "prefix for output files", false, ""));
     inputParser->push_front (new OptionOneParam (STR_URI_BKPT, "breakpoint file", false, ""));
@@ -93,6 +96,7 @@ Filler::Filler ()  : Tool ("MindTheGap fill") , _progress(0)
     
     IOptionsParser* fillerParser = new OptionsParser("Assembly");
     //TODO HERE PUT THE FILL OPTIONS
+    fillerParser->push_front (new OptionNoParam (STR_FWD_ONLY, "do not try in reverse direction if no inserted sequence is assembled (bkpt mode)", false));
     fillerParser->push_front (new OptionOneParam (STR_MAX_DEPTH, "maximum length of insertions (nt)", false, "10000"));
     fillerParser->push_front (new OptionOneParam (STR_MAX_NODES, "maximum number of nodes in contig graph (nt)", false, "100"));
 
@@ -264,7 +268,16 @@ void Filler::execute ()
         }
     }
 
-	
+    if (getInput()->get(STR_EXTEND) != 0)
+    {
+        _extend = true;
+        _extension_file_name = getInput()->getStr(STR_URI_OUTPUT)+".extensions.fasta";
+        _extension_file = fopen(_extension_file_name.c_str(),"w");
+        if(_extension_file == NULL){
+            string message = "Cannot open file "+ _extension_file_name + " for writing";
+            throw Exception(message.c_str());
+        }
+    }
     
     //Getting the breakpoint sequences
     if (_breakpointMode)
@@ -298,6 +311,11 @@ void Filler::execute ()
         _filter = true;
     }
     
+    if(getInput()->get(STR_FWD_ONLY) != 0)
+    {
+        _fwd_only = true;
+    }
+    
     // Now do the job
     time_t start_time = time(0);
 
@@ -316,6 +334,10 @@ void Filler::execute ()
     {
         fclose(_gfa_file);
     }
+    if (_extend)
+    {
+        fclose(_extension_file);
+    }
     
     // We gather some info/statistics to print in stdout
     resumeParameters();
@@ -451,6 +473,10 @@ void Filler::resumeResults(double seconds){
         getInfo()->add(2,"assembly graph file","%s",_gfa_file_name.c_str());
     }
     getInfo()->add(2,"assembly statistics file","%s",_insert_info_file_name.c_str());
+    if (_extend)
+    {
+        getInfo()->add(2,"extension sequence file","%s",_extension_file_name.c_str());
+    }
 
 }
 
@@ -507,7 +533,8 @@ public:
      }
         
     std::vector<filled_insertion_t> filledSequences;
-     _object->gapFillFromSource<span>(infostring,_tid, sourceSequence, conc_targetSequence,filledSequences, targetDictionary, is_anchor_repeated, reverse );
+    string extensionSequence;
+     _object->gapFillFromSource<span>(infostring,_tid, sourceSequence, conc_targetSequence,filledSequences, targetDictionary, is_anchor_repeated, reverse, extensionSequence );
 
     // We filter out loops (ie target = seed_Rc)
     for (auto it=filledSequences.begin() ; it != filledSequences.end();)
@@ -532,7 +559,10 @@ public:
      // Write insertions to file
      _object->writeFilledBreakpoint(filledSequences,seedName,infostring);
      _object->writeToGFA(filledSequences,sourceSequence,seedName,isRc);
-        
+    
+    if(filledSequences.size()==0 && _object->_extend){
+        _object->writeExtensions(extensionSequence, seedName, sourceSequence);
+    }
 
      _nb_breakpoints++;
 
@@ -630,10 +660,13 @@ public:
             targetDictionary.insert ({targetSequence, std::make_pair(breakpointName_R, false)});
 
             //_object->gapFill<span>(infostring,_tid,sourceSequence,targetSequence,filledSequences,begin_kmer_repeated,end_kmer_repeated);
-            _object->gapFillFromSource<span>(infostring,_tid, sourceSequence, targetSequence,filledSequences, targetDictionary, is_anchor_repeated, false);
+            string extensionSequence;
+            string extensionSequenceRev;
+            _object->gapFillFromSource<span>(infostring,_tid, sourceSequence, targetSequence,filledSequences, targetDictionary, is_anchor_repeated, false,  extensionSequence);
+            
 
             //If gap-filling failed in one direction, try the other direction (from target to source in revcomp)
-            if(filledSequences.size()==0){
+            if(!_object->_fwd_only && filledSequences.size()==0){
                 string targetSequence2 = revcomp_sequence(sourceSequence);
                 targetDictionary.clear();
                 targetDictionary.insert({targetSequence2, std::make_pair(breakpointName, false)});
@@ -642,13 +675,18 @@ public:
 
 
                 //_object->GapFill<span>(infostring,_tid,sourceSequence2,targetSequence2,filledSequences,begin_kmer_repeated,end_kmer_repeated,true);
-                _object->gapFillFromSource<span>(infostring,_tid, sourceSequence2, targetSequence2,filledSequences, targetDictionary, is_anchor_repeated, true);
+                _object->gapFillFromSource<span>(infostring,_tid, sourceSequence2, targetSequence2,filledSequences, targetDictionary, is_anchor_repeated, true, extensionSequenceRev);
 
             }
 
             _object->writeFilledBreakpoint(filledSequences,breakpointName,infostring);
             _object->writeVcf(filledSequences,breakpointName,sourceSequence);
 
+            if(filledSequences.size()==0 && _object->_extend){
+                _object->writeExtensions(extensionSequence, breakpointName, sourceSequence);
+                string sourceSequence2 = revcomp_sequence(targetSequence);
+                _object->writeExtensions(extensionSequenceRev, breakpointName+"_reverse", sourceSequence2);
+            }
             
             // We increase the breakpoint counter.
             _nb_breakpoints++;
@@ -814,7 +852,7 @@ void Filler::fillAny<span>::operator () (Filler* object)
 }
 
 template<size_t span>
-void Filler::gapFillFromSource(std::string & infostring, int tid, string sourceSequence, string targetSequence, std::vector<filled_insertion_t>& filledSequences, bkpt_dict_t targetDictionary, bool is_anchor_repeated, bool reverse ){
+void Filler::gapFillFromSource(std::string & infostring, int tid, string sourceSequence, string targetSequence, std::vector<filled_insertion_t>& filledSequences, bkpt_dict_t targetDictionary, bool is_anchor_repeated, bool reverse, std::string & extensionSequence){
     typedef typename gatb::core::kmer::impl::Kmer<span>::ModelCanonical ModelCanonical;
 
 
@@ -977,6 +1015,10 @@ void Filler::gapFillFromSource(std::string & infostring, int tid, string sourceS
         infostring +=   Stringify::format ("\t%d", filledSequences.size()) ;
     }
      }
+    else // no solution
+    {
+        extensionSequence = get_first_contig(contig_file_name);
+    }
     
     remove(contig_file_name.c_str());
     remove((contig_graph_file_prefix+".graph").c_str());
@@ -1230,6 +1272,24 @@ void Filler::writeToGFA(std::vector<filled_insertion_t>& filledSequences, string
     funlockfile(_gfa_file);
 }
 
+void Filler::writeExtensions(string contigSeq, string seedName, string sourceSequence){
+
+    
+    int llen = contigSeq.length() ;
+    
+   if(llen>0){
+        flockfile(_extension_file);
+
+        //writing sequence header
+        fprintf(_extension_file,">%s_len_%d source=%s\n",seedName.c_str(),llen, sourceSequence.c_str());
+        //writing DNA sequence
+        fprintf(_extension_file,"%.*s\n",(int)llen,contigSeq.c_str() );
+    
+        funlockfile(_extension_file);
+  }
+    
+}
+
 
 set< info_node_t >  Filler::find_nodes_containing_multiple_R(bkpt_dict_t targetDictionary, string linear_seqs_name, int nb_mis_allowed, int nb_gaps_allowed)
 {
@@ -1316,3 +1376,32 @@ set< info_node_t >  Filler::find_nodes_containing_multiple_R(bkpt_dict_t targetD
     delete Nodes;
     return terminal_nodes;
 }
+
+
+string  Filler::get_first_contig(string contig_file_name)
+{
+    BankFasta* contigs = new BankFasta((char *)contig_file_name.c_str());
+
+    string contigseq="";
+    size_t contiglen;
+
+    BankFasta::Iterator  * itSeq  =  new BankFasta::Iterator  (*contigs);
+
+    // We loop over sequences.
+    for (itSeq->first(); !itSeq->isDone(); itSeq->next())
+    {
+        contiglen = (*itSeq)->getDataSize();
+        //cout << contiglen << endl;
+        if (contiglen > _kmerSize )
+        {
+            contigseq =  string((*itSeq)->getDataBuffer(),contiglen);
+            //cout << "contig:"<< contigseq << endl;
+            contigseq = contigseq.substr(_kmerSize, contiglen);
+        }
+        break;
+    }
+    
+    delete itSeq;
+    delete contigs;
+    return contigseq;
+}


=====================================
src/Filler.hpp
=====================================
@@ -37,6 +37,8 @@ static const char* STR_URI_BKPT = "-bkpt";
 static const char* STR_MAX_DEPTH = "-max-length";
 static const char* STR_MAX_NODES = "-max-nodes";
 static const char* STR_FILTER = "-filter";
+static const char* STR_FWD_ONLY = "-fwd-only";
+static const char* STR_EXTEND = "-extend";
 
 
  class info_node_t
@@ -129,6 +131,15 @@ public:
     //output file with statistics about each attempt of gap-filling
     string _insert_info_file_name;
     FILE * _insert_info_file;
+    
+    //output file in vcf format (bkpt mode)
+    string _vcf_file_name;
+    FILE * _vcf_file;
+    
+    //output file for single-contig extensions (-extend option)
+    string _extension_file_name;
+    FILE * _extension_file;
+    bool _extend;
 
     //parameters for dbg traversal (stop criteria)
     int _max_depth;
@@ -141,11 +152,13 @@ public:
     //parameter for gap-filling between contigs
     int _contig_trim_size;
 
-    //parameter for filtering out low quality insertions
+    //parameter for filtering out low quality insertions (in vcf output, bkpt mode only)
     bool _filter;
     
-    string _vcf_file_name;
-    FILE * _vcf_file;
+    //parameter for constraining assembly in the forward direction only (bkpt mode only), otherwise (default behavior) tries the revcomp direction when no inserted sequence is assembled in forward.
+    bool _fwd_only;
+    
+
 
     // Actual job done by the tool is here
     void execute ();
@@ -161,6 +174,11 @@ public:
      */
     void writeVcf(std::vector<filled_insertion_t>& filledSequences, string breakpointName, string seedk);
 
+    /** writes a given extension sequence in the output extension file (fasta format)
+    */
+    void writeExtensions(string contigSeq, string seedName, string sourceSequence);
+
+    
     /** Fill one gap
      */
     /*template<size_t span>
@@ -168,7 +186,7 @@ public:
                  ,bool reversed =false);*/
 
     template<size_t span>
-    void gapFillFromSource(std::string & infostring, int tid, string sourceSequence, string targetSequence, std::vector<filled_insertion_t>& filledSequences, bkpt_dict_t targetDictionary,bool is_anchor_repeated, bool reverse );
+    void gapFillFromSource(std::string & infostring, int tid, string sourceSequence, string targetSequence, std::vector<filled_insertion_t>& filledSequences, bkpt_dict_t targetDictionary,bool is_anchor_repeated, bool reverse, std::string & extensionSequence );
 
     gatb::core::tools::dp::IteratorListener* _progress;
 
@@ -203,6 +221,12 @@ private:
     set< info_node_t >  find_nodes_containing_R(string targetSequence, string linear_seqs_name, int nb_mis_allowed, int nb_gaps_allowed, bool anchor_is_repeated);
     set< info_node_t> find_nodes_containing_multiple_R(bkpt_dict_t targetDictionary, string linear_seqs_name, int nb_mis_allowed, int nb_gaps_allowed);
 
+    /**
+     returns the first contig from the local assembly (only if >_kmer_size), when no solution is found, it can be returned as a safe extension.
+     */
+    string  get_first_contig(string contig_file_name);
+
+    
     /** Handle on the progress information. */
     void setProgress (gatb::core::tools::dp::IteratorListener* progress)  { SP_SETATTR(progress); }
 


=====================================
src/main.cpp
=====================================
@@ -26,7 +26,7 @@
 
 using namespace std;
 
-static const char* MTG_VERSION = "2.2.2";
+static const char* MTG_VERSION = "2.2.3";
 
 static const char* STR_FIND        = "find";
 static const char* STR_FILL = "fill";



View it on GitLab: https://salsa.debian.org/med-team/mindthegap/-/commit/b7bdcbee246d474229b51a2960c60e0f6caaf96c

-- 
View it on GitLab: https://salsa.debian.org/med-team/mindthegap/-/commit/b7bdcbee246d474229b51a2960c60e0f6caaf96c
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20210921/f9290706/attachment-0001.htm>


More information about the debian-med-commit mailing list