[med-svn] [Git][med-team/mindthegap][master] 6 commits: d/watch: Fix watch regex
Nilesh Patra (@nilesh)
gitlab at salsa.debian.org
Tue Sep 21 16:43:21 BST 2021
Nilesh Patra pushed to branch master at Debian Med / mindthegap
Commits:
b3ce070e by Nilesh Patra at 2021-09-21T21:07:20+05:30
d/watch: Fix watch regex
- - - - -
b7bdcbee by Nilesh Patra at 2021-09-21T21:08:16+05:30
New upstream version 2.2.3
- - - - -
a76c4a28 by Nilesh Patra at 2021-09-21T21:08:16+05:30
routine-update: New upstream version
- - - - -
2d5c9522 by Nilesh Patra at 2021-09-21T21:08:21+05:30
Update upstream source from tag 'upstream/2.2.3'
Update to upstream version '2.2.3'
with Debian dir 85893c2d67fa74d55b1231870f86c04e0410eda9
- - - - -
1b2b53d9 by Nilesh Patra at 2021-09-21T21:08:22+05:30
routine-update: Standards-Version: 4.6.0
- - - - -
45f4c582 by Nilesh Patra at 2021-09-21T21:09:25+05:30
Upload to unstable
- - - - -
10 changed files:
- CHANGELOG.md
- CMakeLists.txt
- README.md
- debian/changelog
- debian/control
- debian/watch
- doc/MindTheGap_insertion_caller.md
- src/Filler.cpp
- src/Filler.hpp
- src/main.cpp
Changes:
=====================================
CHANGELOG.md
=====================================
@@ -4,6 +4,14 @@
## [Unreleased]
--------------------------------------------------------------------------------
+## [2.2.3] - 2021-06-11
+
+* 2 novel options in the `Fill` (local assembly) module :
+ * `-fwd-only`: output the first-contig extensions of failed gap-fillings in a separate file, it can be useful for the assembly of the extremities of linear genomes (as in the [MinYS](https://github.com/cguyomar/MinYS) tool)
+ * `-extend`: do not try in reverse direction if no inserted sequence is assembled (bkpt mode), it can improve the running time and/or help the user controlling the direction of assembly (as in the [MTG-link](https://github.com/anne-gcd/MTG-Link) tool)
+
+--------------------------------------------------------------------------------
+
## [2.2.2] - 2020-06-19
* A bug fix: updating gatb-core version, notably this fixes a bug in the `fill` module: nodes at extremities of contigs of size exactly `k` were not marked correctly, potentially leading to duplicated contigs in the contig graph. This could prevent exploring some parts of the graph, and if graph exploration parameters where set too large (`-max-nodes` and `-max-length`), it could lead in some rare cases to extreme running times and/or memory consumptions. This should no longer happen now.
=====================================
CMakeLists.txt
=====================================
@@ -28,7 +28,7 @@ cmake_minimum_required(VERSION 3.1)
# The default version number is the latest official build
SET (gatb-tool_VERSION_MAJOR 2)
SET (gatb-tool_VERSION_MINOR 2)
-SET (gatb-tool_VERSION_PATCH 2)
+SET (gatb-tool_VERSION_PATCH 3)
# But, it is possible to define another release number during a local build
IF (DEFINED MAJOR)
=====================================
README.md
=====================================
@@ -4,9 +4,9 @@
|-----------|-------------|
[![Build Status](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-debian7-64bits-gcc-4.7/badge/icon)](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-debian7-64bits-gcc-4.7/) | [![Build Status](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-macos-10.9.5-gcc-4.2.1/badge/icon)](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-macos-10.9.5-gcc-4.2.1/)
-Travis CI : [![Build Status](https://travis-ci.org/GATB/MindTheGap.svg?branch=master)](https://travis-ci.org/GATB/MindTheGap)
+[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/mindthegap/README.html)
-[![License](http://img.shields.io/:license-affero-blue.svg)](http://www.gnu.org/licenses/agpl-3.0.en.html)
+[![License](http://img.shields.io/:license-affero-blue.svg)](http://www.gnu.org/licenses/agpl-3.0.en.html)
# What is MindTheGap ?
@@ -49,10 +49,7 @@ Retrieve a binary archive file from one of the official MindTheGap releases (see
tar -zxf MindTheGap-vX.Y.Z-bin-Darwin.tar.gz
cd MindTheGap-vX.Y.Z-bin-Darwin
chmod u+x bin/MindTheGap
-
- # run a simple example
- ./bin/MindTheGap find -in data/reads_r1.fastq,data/reads_r2.fastq -ref data/reference.fasta -out example
- ./bin/MindTheGap fill -graph example.h5 -bkpt example.breakpoints -out example
+ ./bin/MindTheGap -help
In case the software does not run appropriately on your system, you should consider to install it from its source code. Retrieve the source archive file `MindTheGap-vX.Y.Z-Source.tar.gz`.
@@ -65,13 +62,19 @@ In case the software does not run appropriately on your system, you should consi
## Using conda or docker
MindTheGap is also distributed as a [Bioconda package](https://anaconda.org/bioconda/mindthegap):
-
+
conda install -c bioconda mindthegap
Or pull the docker image of MindTheGap (warning: need to be updated with latest releases):
docker pull clemaitr/mindthegap
+## Small run example
+
+```
+MindTheGap find -in data/reads_r1.fastq,data/reads_r2.fastq -ref data/reference.fasta -out example
+MindTheGap fill -graph example.h5 -bkpt example.breakpoints -out example
+```
@@ -194,21 +197,26 @@ MindTheGap is composed of two main modules : breakpoint detection (`find` module
Both MindTheGap modules generate the graph file if reads were given as input:
* a graph file (`.h5`). This is a binary file, to obtain information stored in it, you can use the utility program `dbginfo` located in your bin directory or in ext/gatb-core/bin/.
-
+
`MindTheGap find` generates the following output files:
+
* a breakpoint file (`.breakpoints`) in fasta format.
-* a variant file (`.othervariants.vcf`) in vcf format. It contains SNPs and deletion events.
+* a variant file (`.othervariants.vcf`) in vcf format. It contains SNPs and deletion events.
+
`MindTheGap fill` generates the following output files:
-* a sequence file (`.insertions.fasta`) in fasta format. It contains the inserted sequences or contig gap-fills that were successfully assembled.
+* a sequence file (`.insertions.fasta`) in fasta format. It contains the inserted sequences or contig gap-fills that were successfully assembled.
+
* an insertion variant file (`.insertions.vcf`) in vcf format, in the case of insertion variant detection.
-
+
* an assembly graph file (`.gfa`) in GFA format, in the case of contig gap-filling. It contains the original contigs and the obtained gap-fill sequences (nodes of the graph), together with their overlapping relationships (arcs of the graph).
-
+
* a log file (`.info.txt`), a tabular file with some information about the filling process for each breakpoint/grap-fill.
-
-
+
+* with option `-extend`, an additional sequence file (`.extensions.fasta`) in fasta format. It contains sequence extensions for failed insertion or gap-filling assemblies, ie. when the target kmer was not found, the first contig immediately after the source kmer is output.
+
+
Other optional parameters and details on input and output file formats are given in [doc/MindTheGap_insertion_caller.md](doc/MindTheGap_insertion_caller.md) and [doc/MindTheGap_assembly.md](doc/MindTheGap_assembly.md), depending on the usage.
=====================================
debian/changelog
=====================================
@@ -1,3 +1,12 @@
+mindthegap (2.2.3-1) unstable; urgency=medium
+
+ * Team upload.
+ * New upstream version
+ * Standards-Version: 4.6.0 (routine-update)
+ * d/watch: Fix watch regex
+
+ -- Nilesh Patra <nilesh at debian.org> Tue, 21 Sep 2021 21:08:31 +0530
+
mindthegap (2.2.2-2) unstable; urgency=medium
* Team upload.
=====================================
debian/control
=====================================
@@ -9,7 +9,7 @@ Build-Depends: debhelper-compat (= 13),
libboost-dev,
zlib1g-dev,
libhdf5-dev
-Standards-Version: 4.5.0
+Standards-Version: 4.6.0
Vcs-Browser: https://salsa.debian.org/med-team/mindthegap
Vcs-Git: https://salsa.debian.org/med-team/mindthegap.git
Homepage: https://github.com/GATB/MindTheGap
=====================================
debian/watch
=====================================
@@ -1,3 +1,3 @@
version=4
-https://github.com/GATB/MindTheGap/releases .*/archive/v?@ANY_VERSION@@ARCHIVE_EXT@
+https://github.com/GATB/MindTheGap/releases .*/archive/.*/v?@ANY_VERSION@@ARCHIVE_EXT@
=====================================
doc/MindTheGap_insertion_caller.md
=====================================
@@ -49,6 +49,7 @@ MindTheGap is composed of two main modules : breakpoint detection (`find` module
* `-max-nodes`: maximum number of nodes in contig graph for each insertion assembly [default '100']. This arguments limits the computational time, this is especially useful for complex genomes.
* `-max-length`: maximum number of assembled nucleotides in the contig graph (nt) [default '10000']. This arguments limits the computational time, this is especially useful for complex genomes.
* `-filter`: if set, insertions with multiple solutions are not output in the final vcf file (default : not activated).
+ * `-fwd-only`: if set, inserted sequences are searched in only one direction : from the left kmer to the right kmer. Default behavior : not set, ie. when no solution is found in the forward direction, a solution is searched in the opposite direction (revcomp(right) --> revcomp(left)).
6. **MindTheGap Output**
@@ -92,7 +93,7 @@ MindTheGap is composed of two main modules : breakpoint detection (`find` module
For insertion variants (`insertions.vcf` file only), positions are **left-normalized**. This happens when there is a small (typically <5bp)) repeated sequence between the breakpoint site and one extremity of the inserted sequence. In this case, multiple positions are possible. Here, the leftmost position is reported and the number of possible positions is indicated in the INFO field (NPOS id). This latter value is at least the size of the repeated sequence + 1 (>= fuzzy +1).
chr4 618791 bkpt20 T TAGGTGTATTTAGCTCCG . PASS TYPE=INS;LEN=17;QUAL=50;NSOL=1;NPOS=4;AVK=22.71;MDK=23.00 GT 1/1
- #there are 4 (NPOS) possible positions for an insertion of 17 nt from positions 618791 to 618794 on chr4, therefore the repeat is of size 3 and is AGG.
+ #there are 4 (NPOS) possible positions for an insertion of 17 nt from positions 618791 to 618794 on chr4, therefore the repeat is of size 3 and is AGG.
FILTER field: can be `PASS`or `LOWQUAL` (for insertions with multiple solutions)
=====================================
src/Filler.cpp
=====================================
@@ -64,6 +64,8 @@ Filler::Filler () : Tool ("MindTheGap fill") , _progress(0)
_breakpointMode = true;
_contig_trim_size = 0;
_filter = false;
+ _fwd_only = false;
+ _extend = false;
@@ -82,7 +84,8 @@ Filler::Filler () : Tool ("MindTheGap fill") , _progress(0)
generalParser->push_front (new OptionOneParam (STR_NB_CORES, "number of cores", false, "0" ));
IOptionsParser* inputParser = new OptionsParser("Input / output");
- inputParser->push_front (new OptionNoParam (STR_FILTER, "do not output low quality insertions", false));
+ inputParser->push_front (new OptionNoParam (STR_EXTEND, "output first-contig extensions of failed gap-fillings in a separate file", false));
+ inputParser->push_front (new OptionNoParam (STR_FILTER, "do not output low quality insertions (bkpt mode)", false));
inputParser->push_front (new OptionOneParam (STR_CONTIG_OVERLAP, "Overlap between input contigs (default, ie. 0 = kmer size)", false, "0"));
inputParser->push_front (new OptionOneParam (STR_URI_OUTPUT, "prefix for output files", false, ""));
inputParser->push_front (new OptionOneParam (STR_URI_BKPT, "breakpoint file", false, ""));
@@ -93,6 +96,7 @@ Filler::Filler () : Tool ("MindTheGap fill") , _progress(0)
IOptionsParser* fillerParser = new OptionsParser("Assembly");
//TODO HERE PUT THE FILL OPTIONS
+ fillerParser->push_front (new OptionNoParam (STR_FWD_ONLY, "do not try in reverse direction if no inserted sequence is assembled (bkpt mode)", false));
fillerParser->push_front (new OptionOneParam (STR_MAX_DEPTH, "maximum length of insertions (nt)", false, "10000"));
fillerParser->push_front (new OptionOneParam (STR_MAX_NODES, "maximum number of nodes in contig graph (nt)", false, "100"));
@@ -264,7 +268,16 @@ void Filler::execute ()
}
}
-
+ if (getInput()->get(STR_EXTEND) != 0)
+ {
+ _extend = true;
+ _extension_file_name = getInput()->getStr(STR_URI_OUTPUT)+".extensions.fasta";
+ _extension_file = fopen(_extension_file_name.c_str(),"w");
+ if(_extension_file == NULL){
+ string message = "Cannot open file "+ _extension_file_name + " for writing";
+ throw Exception(message.c_str());
+ }
+ }
//Getting the breakpoint sequences
if (_breakpointMode)
@@ -298,6 +311,11 @@ void Filler::execute ()
_filter = true;
}
+ if(getInput()->get(STR_FWD_ONLY) != 0)
+ {
+ _fwd_only = true;
+ }
+
// Now do the job
time_t start_time = time(0);
@@ -316,6 +334,10 @@ void Filler::execute ()
{
fclose(_gfa_file);
}
+ if (_extend)
+ {
+ fclose(_extension_file);
+ }
// We gather some info/statistics to print in stdout
resumeParameters();
@@ -451,6 +473,10 @@ void Filler::resumeResults(double seconds){
getInfo()->add(2,"assembly graph file","%s",_gfa_file_name.c_str());
}
getInfo()->add(2,"assembly statistics file","%s",_insert_info_file_name.c_str());
+ if (_extend)
+ {
+ getInfo()->add(2,"extension sequence file","%s",_extension_file_name.c_str());
+ }
}
@@ -507,7 +533,8 @@ public:
}
std::vector<filled_insertion_t> filledSequences;
- _object->gapFillFromSource<span>(infostring,_tid, sourceSequence, conc_targetSequence,filledSequences, targetDictionary, is_anchor_repeated, reverse );
+ string extensionSequence;
+ _object->gapFillFromSource<span>(infostring,_tid, sourceSequence, conc_targetSequence,filledSequences, targetDictionary, is_anchor_repeated, reverse, extensionSequence );
// We filter out loops (ie target = seed_Rc)
for (auto it=filledSequences.begin() ; it != filledSequences.end();)
@@ -532,7 +559,10 @@ public:
// Write insertions to file
_object->writeFilledBreakpoint(filledSequences,seedName,infostring);
_object->writeToGFA(filledSequences,sourceSequence,seedName,isRc);
-
+
+ if(filledSequences.size()==0 && _object->_extend){
+ _object->writeExtensions(extensionSequence, seedName, sourceSequence);
+ }
_nb_breakpoints++;
@@ -630,10 +660,13 @@ public:
targetDictionary.insert ({targetSequence, std::make_pair(breakpointName_R, false)});
//_object->gapFill<span>(infostring,_tid,sourceSequence,targetSequence,filledSequences,begin_kmer_repeated,end_kmer_repeated);
- _object->gapFillFromSource<span>(infostring,_tid, sourceSequence, targetSequence,filledSequences, targetDictionary, is_anchor_repeated, false);
+ string extensionSequence;
+ string extensionSequenceRev;
+ _object->gapFillFromSource<span>(infostring,_tid, sourceSequence, targetSequence,filledSequences, targetDictionary, is_anchor_repeated, false, extensionSequence);
+
//If gap-filling failed in one direction, try the other direction (from target to source in revcomp)
- if(filledSequences.size()==0){
+ if(!_object->_fwd_only && filledSequences.size()==0){
string targetSequence2 = revcomp_sequence(sourceSequence);
targetDictionary.clear();
targetDictionary.insert({targetSequence2, std::make_pair(breakpointName, false)});
@@ -642,13 +675,18 @@ public:
//_object->GapFill<span>(infostring,_tid,sourceSequence2,targetSequence2,filledSequences,begin_kmer_repeated,end_kmer_repeated,true);
- _object->gapFillFromSource<span>(infostring,_tid, sourceSequence2, targetSequence2,filledSequences, targetDictionary, is_anchor_repeated, true);
+ _object->gapFillFromSource<span>(infostring,_tid, sourceSequence2, targetSequence2,filledSequences, targetDictionary, is_anchor_repeated, true, extensionSequenceRev);
}
_object->writeFilledBreakpoint(filledSequences,breakpointName,infostring);
_object->writeVcf(filledSequences,breakpointName,sourceSequence);
+ if(filledSequences.size()==0 && _object->_extend){
+ _object->writeExtensions(extensionSequence, breakpointName, sourceSequence);
+ string sourceSequence2 = revcomp_sequence(targetSequence);
+ _object->writeExtensions(extensionSequenceRev, breakpointName+"_reverse", sourceSequence2);
+ }
// We increase the breakpoint counter.
_nb_breakpoints++;
@@ -814,7 +852,7 @@ void Filler::fillAny<span>::operator () (Filler* object)
}
template<size_t span>
-void Filler::gapFillFromSource(std::string & infostring, int tid, string sourceSequence, string targetSequence, std::vector<filled_insertion_t>& filledSequences, bkpt_dict_t targetDictionary, bool is_anchor_repeated, bool reverse ){
+void Filler::gapFillFromSource(std::string & infostring, int tid, string sourceSequence, string targetSequence, std::vector<filled_insertion_t>& filledSequences, bkpt_dict_t targetDictionary, bool is_anchor_repeated, bool reverse, std::string & extensionSequence){
typedef typename gatb::core::kmer::impl::Kmer<span>::ModelCanonical ModelCanonical;
@@ -977,6 +1015,10 @@ void Filler::gapFillFromSource(std::string & infostring, int tid, string sourceS
infostring += Stringify::format ("\t%d", filledSequences.size()) ;
}
}
+ else // no solution
+ {
+ extensionSequence = get_first_contig(contig_file_name);
+ }
remove(contig_file_name.c_str());
remove((contig_graph_file_prefix+".graph").c_str());
@@ -1230,6 +1272,24 @@ void Filler::writeToGFA(std::vector<filled_insertion_t>& filledSequences, string
funlockfile(_gfa_file);
}
+void Filler::writeExtensions(string contigSeq, string seedName, string sourceSequence){
+
+
+ int llen = contigSeq.length() ;
+
+ if(llen>0){
+ flockfile(_extension_file);
+
+ //writing sequence header
+ fprintf(_extension_file,">%s_len_%d source=%s\n",seedName.c_str(),llen, sourceSequence.c_str());
+ //writing DNA sequence
+ fprintf(_extension_file,"%.*s\n",(int)llen,contigSeq.c_str() );
+
+ funlockfile(_extension_file);
+ }
+
+}
+
set< info_node_t > Filler::find_nodes_containing_multiple_R(bkpt_dict_t targetDictionary, string linear_seqs_name, int nb_mis_allowed, int nb_gaps_allowed)
{
@@ -1316,3 +1376,32 @@ set< info_node_t > Filler::find_nodes_containing_multiple_R(bkpt_dict_t targetD
delete Nodes;
return terminal_nodes;
}
+
+
+string Filler::get_first_contig(string contig_file_name)
+{
+ BankFasta* contigs = new BankFasta((char *)contig_file_name.c_str());
+
+ string contigseq="";
+ size_t contiglen;
+
+ BankFasta::Iterator * itSeq = new BankFasta::Iterator (*contigs);
+
+ // We loop over sequences.
+ for (itSeq->first(); !itSeq->isDone(); itSeq->next())
+ {
+ contiglen = (*itSeq)->getDataSize();
+ //cout << contiglen << endl;
+ if (contiglen > _kmerSize )
+ {
+ contigseq = string((*itSeq)->getDataBuffer(),contiglen);
+ //cout << "contig:"<< contigseq << endl;
+ contigseq = contigseq.substr(_kmerSize, contiglen);
+ }
+ break;
+ }
+
+ delete itSeq;
+ delete contigs;
+ return contigseq;
+}
=====================================
src/Filler.hpp
=====================================
@@ -37,6 +37,8 @@ static const char* STR_URI_BKPT = "-bkpt";
static const char* STR_MAX_DEPTH = "-max-length";
static const char* STR_MAX_NODES = "-max-nodes";
static const char* STR_FILTER = "-filter";
+static const char* STR_FWD_ONLY = "-fwd-only";
+static const char* STR_EXTEND = "-extend";
class info_node_t
@@ -129,6 +131,15 @@ public:
//output file with statistics about each attempt of gap-filling
string _insert_info_file_name;
FILE * _insert_info_file;
+
+ //output file in vcf format (bkpt mode)
+ string _vcf_file_name;
+ FILE * _vcf_file;
+
+ //output file for single-contig extensions (-extend option)
+ string _extension_file_name;
+ FILE * _extension_file;
+ bool _extend;
//parameters for dbg traversal (stop criteria)
int _max_depth;
@@ -141,11 +152,13 @@ public:
//parameter for gap-filling between contigs
int _contig_trim_size;
- //parameter for filtering out low quality insertions
+ //parameter for filtering out low quality insertions (in vcf output, bkpt mode only)
bool _filter;
- string _vcf_file_name;
- FILE * _vcf_file;
+ //parameter for constraining assembly in the forward direction only (bkpt mode only), otherwise (default behavior) tries the revcomp direction when no inserted sequence is assembled in forward.
+ bool _fwd_only;
+
+
// Actual job done by the tool is here
void execute ();
@@ -161,6 +174,11 @@ public:
*/
void writeVcf(std::vector<filled_insertion_t>& filledSequences, string breakpointName, string seedk);
+ /** writes a given extension sequence in the output extension file (fasta format)
+ */
+ void writeExtensions(string contigSeq, string seedName, string sourceSequence);
+
+
/** Fill one gap
*/
/*template<size_t span>
@@ -168,7 +186,7 @@ public:
,bool reversed =false);*/
template<size_t span>
- void gapFillFromSource(std::string & infostring, int tid, string sourceSequence, string targetSequence, std::vector<filled_insertion_t>& filledSequences, bkpt_dict_t targetDictionary,bool is_anchor_repeated, bool reverse );
+ void gapFillFromSource(std::string & infostring, int tid, string sourceSequence, string targetSequence, std::vector<filled_insertion_t>& filledSequences, bkpt_dict_t targetDictionary,bool is_anchor_repeated, bool reverse, std::string & extensionSequence );
gatb::core::tools::dp::IteratorListener* _progress;
@@ -203,6 +221,12 @@ private:
set< info_node_t > find_nodes_containing_R(string targetSequence, string linear_seqs_name, int nb_mis_allowed, int nb_gaps_allowed, bool anchor_is_repeated);
set< info_node_t> find_nodes_containing_multiple_R(bkpt_dict_t targetDictionary, string linear_seqs_name, int nb_mis_allowed, int nb_gaps_allowed);
+ /**
+ returns the first contig from the local assembly (only if >_kmer_size), when no solution is found, it can be returned as a safe extension.
+ */
+ string get_first_contig(string contig_file_name);
+
+
/** Handle on the progress information. */
void setProgress (gatb::core::tools::dp::IteratorListener* progress) { SP_SETATTR(progress); }
=====================================
src/main.cpp
=====================================
@@ -26,7 +26,7 @@
using namespace std;
-static const char* MTG_VERSION = "2.2.2";
+static const char* MTG_VERSION = "2.2.3";
static const char* STR_FIND = "find";
static const char* STR_FILL = "fill";
View it on GitLab: https://salsa.debian.org/med-team/mindthegap/-/compare/9b2eddbae37a1758f8ee9d3478488d112f5888f0...45f4c5823582db564d2a23f135329d7a08260fbb
--
View it on GitLab: https://salsa.debian.org/med-team/mindthegap/-/compare/9b2eddbae37a1758f8ee9d3478488d112f5888f0...45f4c5823582db564d2a23f135329d7a08260fbb
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20210921/36a57596/attachment-0001.htm>
More information about the debian-med-commit
mailing list