[med-svn] [Git][med-team/mindthegap][upstream] New upstream version 2.3.0
Nilesh Patra (@nilesh)
gitlab at salsa.debian.org
Sun May 29 06:53:31 BST 2022
Nilesh Patra pushed to branch upstream at Debian Med / mindthegap
Commits:
b9057443 by Nilesh Patra at 2022-05-29T11:15:30+05:30
New upstream version 2.3.0
- - - - -
30 changed files:
- − .travis.yml
- CHANGELOG.md
- CMakeLists.txt
- README.md
- data/reads_r1.fastq
- data/reads_r2.fastq
- data/reference.fasta
- doc/MindTheGap_insertion_caller.md
- scripts/jenkins/tool-mindthegap-build-debian7-64bits-gcc-4.7.sh
- scripts/jenkins/tool-mindthegap-build-macos-10.9.5-gcc-4.2.1.sh
- scripts/jenkins/tool-mindthegap-release-debian.sh
- src/FindBreakpoints.hpp
- src/FindHeteroInsertion.hpp
- src/FindSNP.hpp
- + src/FindSmallInsertion.hpp
- src/Finder.cpp
- src/Finder.hpp
- src/main.cpp
- test/full_test/README
- test/full_test/allele1.fasta
- test/full_test/allele2.fasta
- test/full_test/gold.breakpoints
- test/full_test/gold.insertions.fasta
- test/full_test/gold.insertions.vcf
- test/full_test/gold.othervariants.vcf
- test/full_test/gold_bed.breakpoints
- test/full_test/gold_bed.othervariants.vcf
- test/full_test/gold_fill.output
- test/full_test/gold_find.output
- test/full_test/reference.fasta
Changes:
=====================================
.travis.yml deleted
=====================================
@@ -1,34 +0,0 @@
-language: cpp
-os:
-- linux
-- osx
-compiler:
-- clang
-- gcc
-addons:
- apt:
- sources:
- - ubuntu-toolchain-r-test
- - llvm-toolchain-precise-3.7
- - george-edison55-precise-backports # for cmake 3
- packages:
- - libcppunit-dev
- - g++-4.8
- - clang-3.7
- - cmake
- - cmake-data
-install:
-- if [ "`echo $CXX`" == "g++" ] && [ "$TRAVIS_OS_NAME" == "linux" ]; then export CXX=g++-4.8; fi
-- if [ "`echo $CXX`" == "clang++" ] && [ "$TRAVIS_OS_NAME" == "linux" ]; then export CXX=clang++-3.7; fi
-matrix:
- exclude:
- - os: osx
- compiler: gcc
-script:
-- mkdir build
-- cd build
-- cmake .. && make
-- cd ../test && ./simple_full_test.sh
-env:
- global:
- - MAKEFLAGS="-j 4"
=====================================
CHANGELOG.md
=====================================
@@ -3,6 +3,16 @@
--------------------------------------------------------------------------------
## [Unreleased]
+--------------------------------------------------------------------------------
+## [2.3.0] - 2022-04-20
+
+Improving the `Find` (insertion breakpoint finder) module:
+
+* very small insertions (1 or 2 bp) are now directly assembled in the `Find` module and are output in the `.othervariants.vcf` file. This may increase the running time of the `Find` module but the overall running time of MindTheGap (Find+Fill) is drastically reduced. Indeed, these numerous small insertions are no longer output in the breakpoint file, nor given as input for the `Fill` assembly module which performs a deeper traversal of the de Bruijn graph (designed for longer insertions).
+* a novel filter is implemented to reduce the amount of False Positive insertion sites. It is based on the number of branching kmers in a 100-bp window before a heterozygous site. It can be tuned with the novel option `-branching-filter`. It is now activated by default, so this may modify the amount of heterozygous sites detected with respect to previous versions.
+
+With this new version, the running time of MindTheGap as an insertion variant caller is reduced for real large datasets, such as human genome re-sequencing data.
+
--------------------------------------------------------------------------------
## [2.2.3] - 2021-06-11
=====================================
CMakeLists.txt
=====================================
@@ -27,8 +27,8 @@ cmake_minimum_required(VERSION 3.1)
################################################################################
# The default version number is the latest official build
SET (gatb-tool_VERSION_MAJOR 2)
-SET (gatb-tool_VERSION_MINOR 2)
-SET (gatb-tool_VERSION_PATCH 3)
+SET (gatb-tool_VERSION_MINOR 3)
+SET (gatb-tool_VERSION_PATCH 0)
# But, it is possible to define another release number during a local build
IF (DEFINED MAJOR)
=====================================
README.md
=====================================
@@ -2,7 +2,7 @@
| **Linux** | **Mac OSX** |
|-----------|-------------|
-[![Build Status](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-debian7-64bits-gcc-4.7/badge/icon)](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-debian7-64bits-gcc-4.7/) | [![Build Status](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-macos-10.9.5-gcc-4.2.1/badge/icon)](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-macos-10.9.5-gcc-4.2.1/)
+[![Build Status](https://ci.inria.fr/gatb-core/view/MindTheGap-gitlab/job/tool-mindthegap-build-debian7-64bits-gcc-4.7-gitlab/badge/icon)](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-debian7-64bits-gcc-4.7/) | [![Build Status](https://ci.inria.fr/gatb-core/view/MindTheGap-gitlab/job/tool-mindthegap-build-macos-10.9.5-gcc-4.2.1-gitlab/badge/icon)](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-macos-10.9.5-gcc-4.2.1/)
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/mindthegap/README.html)
@@ -194,27 +194,30 @@ MindTheGap is composed of two main modules : breakpoint detection (`find` module
4. **MindTheGap Output**
All the output files are prefixed either by a default name: "MindTheGap_Expe-[date:YY:MM:DD-HH:mm]" or by a user defined prefix (option `-out` of MindTheGap).
- Both MindTheGap modules generate the graph file if reads were given as input:
-* a graph file (`.h5`). This is a binary file, to obtain information stored in it, you can use the utility program `dbginfo` located in your bin directory or in ext/gatb-core/bin/.
-
- `MindTheGap find` generates the following output files:
+ The main results files are output by the Fill module, these are:
- * a breakpoint file (`.breakpoints`) in fasta format.
+ * an **insertion variant file** (`.insertions.vcf`) in vcf format, in the case of insertion variant detection (for insertions >2 bp).
+
+ * an **assembly graph file** (`.gfa`) in GFA format, in the case of contig gap-filling. It contains the original contigs and the obtained gap-fill sequences (nodes of the graph), together with their overlapping relationships (arcs of the graph).
+
+ Additional output files are:
-* a variant file (`.othervariants.vcf`) in vcf format. It contains SNPs and deletion events.
+ * a graph file (`.h5`), output by both MindTheGap modules. This is a binary file containing the de Bruijn graph data structure. To obtain information stored in it, you can use the utility program `dbginfo` located in your bin directory or in ext/gatb-core/bin/.
- `MindTheGap fill` generates the following output files:
+ * Files output specifically by `MindTheGap find`:
-* a sequence file (`.insertions.fasta`) in fasta format. It contains the inserted sequences or contig gap-fills that were successfully assembled.
-
-* an insertion variant file (`.insertions.vcf`) in vcf format, in the case of insertion variant detection.
+ * a breakpoint file (`.breakpoints`) in fasta format.
+
+ * a variant file (`.othervariants.vcf`) in vcf format. It contains SNPs, deletions and very small insertions (1-2 bp).
-* an assembly graph file (`.gfa`) in GFA format, in the case of contig gap-filling. It contains the original contigs and the obtained gap-fill sequences (nodes of the graph), together with their overlapping relationships (arcs of the graph).
+ * Files output specifically by `MindTheGap fill`:
+
+ * a sequence file (`.insertions.fasta`) in fasta format. It contains the inserted sequences (for insertions >2 bp) or contig gap-fills that were successfully assembled.
-* a log file (`.info.txt`), a tabular file with some information about the filling process for each breakpoint/grap-fill.
+ * a log file (`.info.txt`), a tabular file with some information about the filling process for each breakpoint/grap-fill.
-* with option `-extend`, an additional sequence file (`.extensions.fasta`) in fasta format. It contains sequence extensions for failed insertion or gap-filling assemblies, ie. when the target kmer was not found, the first contig immediately after the source kmer is output.
+ * with option `-extend`, an additional sequence file (`.extensions.fasta`) in fasta format. It contains sequence extensions for failed insertion or gap-filling assemblies, ie. when the target kmer was not found, the first contig immediately after the source kmer is output.
@@ -233,10 +236,14 @@ Either in your `bin/` directory or in `ext/gatb-core/bin/`, you can find additio
## Reference
+If you use MindTheGap, please cite:
+
MindTheGap: integrated detection and assembly of short and long insertions. Guillaume Rizk, Anaïs Gouin, Rayan Chikhi and Claire Lemaitre. Bioinformatics 2014 30(24):3451-3457. http://bioinformatics.oxfordjournals.org/content/30/24/3451
[Web page](https://gatb.inria.fr/software/mind-the-gap/) with some updated results.
+MindTheGap was also evaluated in a recent benchmark exploring many different genomic features (size, nature, repeat context, junctional homology at breakpoints) of human insertion variants. Among other tested SV callers, MindTheGap was the only tool able to output sequence-resolved insertions for many types of insertions. Read more: [Towards a better understanding of the low recall of insertion variants with short-read based variant callers.](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07125-5) Delage W, Thevenon J, Lemaitre C. *BMC Genomics* **2020**, 21(1):762.
+
# Contact
=====================================
data/reads_r1.fastq
=====================================
The diff for this file was not included because it is too large.
=====================================
data/reads_r2.fastq
=====================================
The diff for this file was not included because it is too large.
=====================================
data/reference.fasta
=====================================
@@ -8,3 +8,7 @@ TGCTGCCGATCGCTACGACGTCCTACCTTACACACAACGGGCCGCGTTCATACCCACGTATGAAGACATGCGGTTATCCG
TATTGCGCCCTTCAAGAAGCTTCTGCTGACCGTAGGCGTCTCGGCGGTTTGTACTTTGAAAAATTAGCTGCACTACATCCGATGGGTATCCCTCCTCAATCTCAGCAGACCCGGAAAGCGATAGAATCAGCCACGCGGTCGTCCGGGCTAGGGGCCCTGCGCAAGGAAGGTTGGACAGGGCTAGACCCGGAAGCATCGGCTTTTCCTAAATGGTGACGGAGTTATATAGGGTAAGCCTGATAGCGCGGTAGGTGTAATGGCCATCCCCTCGCCTAGCGTGCGCGCAGACAAGTCCAGTCCCGGAGGAGGCATAGGCCTCATTATCATTTCCCTAGAATCGCTCTTGACATCTAGGTTGTACTAGGGACCAGGCGCCCAAAGCGGACGGTTCTCCGTGCTTTCGTGCCGTTTCAGCGTAAGATGCTATTTTTTGGGGAAATGGTCGGCGTGTGCGGGGGAGAACCACGGTACCAACTACGATAAGTCCGTCGTGTAACTTACGTGAAGGTGCTGTGAAGCAGGAATCCGTGCCAAAATGTCCGTGCGATATCCAACTTTCATAGTATTACACGAGAGCCTATGATTTGCCCAGGCGCGACCCGTGAATCGAGGTAATCGCCGACCAGATATTGCGAAACACCACATTACATGACTACTGTCCGCTTGAAGAGTTATATACTTGACAGTCCTGGTTGACGGCACAGCATATCTCCAATGTGTGGTTTAAAGTCTCACGTTCTTCATGCGCGCCGGCCCATGGGAACAGGTATCCTTACTTTCGGTACAAATGAGGCTCCAAAATAGCACGCTTGCAGCAGTCAAGTTGAACGCCTTAAAAGGCACCGCCGCTCGTTCATTGGGATTCCTTGAGAATCGTGACTTGTTACACTATAAGATCATGGATTGGACAAAATAGGCCAACTCCCGCACGCTGTGGCTATTCTTAAGTTGCATAGGTGGGAGTAGCCTTATACTCGATTTCTAAAAAGAGTAGGTGAGC
>Seq4
TTCCGGCGCCGCACTAATTGAAGTGGTGAGCTGACCAGTCGTTCAGGATCCGAAGGCGGGGATGGCGCTATAGGAGCCGGCAGGTATGCTTTGCCGCAAAATTTCGGGGTGGTGGAACCGTCTTACCGAAAGTTAGCTACAGCCTGGAATGTGAAATTCCATGACCTGCCCGTCCTGTGTCCACAGGGCGACATTTGCCACGTAGGTAGGGCGACCATTAGAATGCTGCATTATCGGGCGATAAAAAGTTTTATACCCAAGAATCCTACAAAGATGAAAATTTCGAAGAGCTGCACGCAGTTGTAAGTTGCTTTTCTGGGGTAATCGAGATTCTCCACCATAACCTGCGCAATGCATCGTGAAGCTTTACCGCGCCCAAGGGGAGCGTCTCAGTGGGGTTGCCTCCAGGGATATATTGAAAGTTGAAGAAGAAGATCACAGGTTAAGCGGTATGTTAAGTTAGAACTCACGGGGAGCCGCCTTGATTTTGTTCGACATGAACCAGAGACCAAGTGTGTTATGTTCTGGAACCTTAATACGTACGTCGCCAGCACCGAGCCGGCACTCCATCTCTTTTGGGTGCGCAACATTGCTATACTTAGGATCCATTGACATCTGTCAGCCGTCTTTCCAGAACGTTATAAGACTCGTGAGGAAATTATACAAATCGTTGCCATCATCCAAAGCAAAGTACTTCCGCTTAGGAGTGCCTTGAAGAACCGATTATCTCTGACAATGTAATGCCACAGCACCCTCGACAAAGTTCTACATTCGTTCCAGGTCATGATACAGCGCGCTAAATTACCGCTACGAGCCATACCCCGAACATTGAGACCTGGCCAGTAGGTAGGTGTCAAATCGATATCCACACCTGTCGAAGCAGCTAGGGACCTAGACGCAACAGTAACCGCCTCGGAGTAAGCCCTGGTAAAGATCGGTTGCGGCGGGAGTCCTCCATTCAGGCCAAACGTGCAGTGCTCGATGTGCTTCCTATCGCTCT
+>Seq5
+GATGTTTAGAAGTTTCCAGGTCACGCCAATGATTGGCATTTACACACGTGGATCAGCGGACATATCTAACCCTTAGTGTTCTTAAGAGCAACTCACTACTCATTTCCACTAACCCCGCCGGCGGTAATTCCAATCTAGTTGATCAGACTTCCCAGTCAATGAAAGCGACACCGTGCGTCTGTAATACCAACAAGACCCTGCTGTCGTCCCGCAGAGGACGCGGCACCTCCGGATTTTGAGTCCAGTCTGAACGATTTTCGATCACTCACCATGGATCTGGAAAACGGAGTCGAGTACTCAGAGCCAAATTGATGCATTTCCAATGACCCGATGCAGGTGCGACCGATCTTCGCCTATGCTTCCCGCCGTAATTATTGAGTCTGGGTCCCGGCCGCTAACGTTGACTCACGGGGAGGTACCCGTGCGTATTCTTCTCAAAGTGACGCTGGACAGCAGCGCATGTCCGAGCCCCATCGTCCTATCTGGTGTAGAGTCTTACCCTAATTAGAGTGATCGAACCAGTAGGTGTCGCGGTCTTAGGGCTCCCATTGTCCAAGGGAACGTGAACAGATATGAATCTGGGAGAATAGTGCAGCGTTGCCCTTCTGGTCGGTCAGCCCTTGCCTACGGCCCGTATGCGGAGAATGAAGGCGTGAAACATTCTGCTCTTTTAGAAGCAGCGGCTGCACCCGTATAACAATCGCACGATCGTACGTCTCATTTGCCGCGTTGGCGCGCCCGTGGATGATGGACCACGGTATGAACCTCTGCACTTCAAATTTGACGCAATCCTGCACTCACCGCACACAGTTCTAGTCTAACCGTCGCAGTGTCTGCTTTAAGGTAGAGATCGATACTTAGGATATGTTCATGTGTGTTTGTAGCGCTGGACCCTCTTATGGTGTGGTCACTTGTGATGGATCGAGGAACTTAGGCGGTTAACTTGTTTCGACGTCTCACCGACAATATCAGGATTTAGTATCG
+>Seq6
+ACCGAAAATGACAATGTTCACACGCATGCTCGGCGTGGAAAAGAGCCTTTTCTAAGACCGACTCGTTCCGGGCAGCAGGATTATTAGCCAATCAAAATTATCGACCGGTCATCAAGCTGCGATAGTGCAGGCGCATGCCGTCCAATGGGTCCACGGCGGAAGTGCGTTCGTCTACTCTGTCAAATCTTAACATTTTTTGAGGCTAATCCGGCCGGTAGTGTACCGTGAACCAAAGTCCTTCTACGAGCGTATTAGATTGCTCAAAAGATCCGGGAGAATTGACCAGGTCGTATCTTTAAAAACGCTGGTGCGAGCAGCTGCTGTTTTATCAACACCCATTTAGTCCTGTGAAGTTTGCTTAGCAGATACACCTTCCCGCGTGGTATGAGAGGCTGTTCTTTTAAAAACTATGAGGCTCTGGCACCTTCGACGCTAACAAAGTCCCCACGGACCATGATACCCTTACGCAACTCTCTTTGCACGCTAGGGCGAGAGTACTGCCCCTAGACTAGGTACACGCCGGGTAAACTCTCTCGCACACCTTTACGCTCGACTACAGGCTTCTAACCCTTCCGAACGCATATAATTCAAATGGCACTTAGTAACAGACGAATCACGGCTCACAGGCAGAATTCACTGGAGTAAAAGGATTCAGAACAATAGATAGTGTGTTAACTTTACAGTCATCCGTATTATAACGTAGCGAGAGGATTGAGTTCTTGTTAGGAAGGAAGGTCCTATAGACGAGTGCGGTAGCGCACCCGGTCGCCTTGCGTAGTCATGCCCGACGTGTTGATGGTTCCCTTTTAGCCGCCACACAAGGGATCCGAGGGTGAGAGACACATGGCCCTCACCGACGAGACTTACTCAGCCTGCCTCGCTATTGCCCTCTTTTTGATCGTCCCTTTGTGGCTCTCGAGGACTCGTGCAGCGTGTATCTGGGGATTTGTAAGCTTAAGACTACCTTCCATAGGA
=====================================
doc/MindTheGap_insertion_caller.md
=====================================
@@ -34,18 +34,28 @@ MindTheGap is composed of two main modules : breakpoint detection (`find` module
4. **Find module specific options**
In addition to the read or graph files, the `find` module has one mandatory option `-ref` and several optional options:
+
* `-ref`: the path to the reference genome file (in fasta format).
* `-homo-only`: only homozygous insertions are reported (default: not activated).
* `-max-rep`: maximal repeat size allowed for fuzzy sites [default '5'].
* `-het-max-occ`: maximal number of occurrences of a (k-1)mer in the reference genome allowed for heterozyguous insertion breakpoints [default '1']. In order to detect an heterozyguous insertion breakpoints, both flanking k-1-mers, at each side of the insertion site, must have strictly less than this number of occurrences in the reference genome. This prevents false positive predictions inside repeated regions. Warning : increasing this parameter may lead to numerous false positives (genomic approximate repeats).
- * `-bed`: the path to a bed file defining genomic regions, to limit the find algorithm to particular regions of the genome. This can be usefull for exome data.
+ * `-branching-filter`: maximal number of branching kmers in a 100-bp window before a heterozygous site [default '15', '-1' means no filter applied]. This filter prevents numerous false positive predictions inside repeated regions. In large and complex genomes, such as human, this parameter can be set to lower values (10 or 5), in order to decrease the running time of the Fill module (but this may result in a loss of recall in repeat-rich regions).
+ * `-bed`: the path to a bed file defining genomic regions, to limit the find algorithm to particular regions of the genome. This can be usefull for exome data. Important: the bed file has to be sorted and overlapping intervals merged, such as:
+
+ ```
+ sort -k1,1 -k2,2n file.bed > file_sorted.bed
+ bedtools merge -i file_sorted.bed > file_final.bed
+ ```
+
5. **Fill module specific options**
- In addition to the read or graph files, the `fill` module has one other mandatory option, `-bkpt`
- * `-bkpt`: the breakpoint file path. This is one of the output of the `find` module and contains for each detected insertion site its left and right kmers from and to which the local assembly will be performed (see section E for details about the format).
+ In addition to the read or graph files, the `fill` module has one other mandatory option, `-bkpt`:
+
+ * `-bkpt`: the breakpoint file path. This is one of the output of the `find` module and contains for each detected insertion site its left and right kmers from and to which the local assembly will be performed (see section [Output formats](#output-formats) for details about the format).
The fill module has several optional options:
+
* `-max-nodes`: maximum number of nodes in contig graph for each insertion assembly [default '100']. This arguments limits the computational time, this is especially useful for complex genomes.
* `-max-length`: maximum number of assembled nucleotides in the contig graph (nt) [default '10000']. This arguments limits the computational time, this is especially useful for complex genomes.
* `-filter`: if set, insertions with multiple solutions are not output in the final vcf file (default : not activated).
@@ -55,21 +65,25 @@ MindTheGap is composed of two main modules : breakpoint detection (`find` module
All the output files are prefixed either by a default name: "MindTheGap_Expe-[date:YY:MM:DD-HH:mm]" or by a user defined prefix (option `-out` of MindTheGap)
Both MindTheGap modules generate the graph file if reads were given as input:
+
* a graph file (`.h5`). This is a binary file, to obtain information stored in it, you can use the utility program dbginfo located in your bin directory or in ext/gatb-core/bin/.
`MindTheGap find` generates the following output files:
+
* a breakpoint file (`.breakpoints`) in fasta format. It contains the breakpoint sequences of each detected insertion site. Each insertion site corresponds to 2 consecutive entries in the fasta file : sequences are the left and right side flanking kmers.
- * a variant file (`.othervariants.vcf`) in vcf format. It contains SNPs and deletion events.
+ * a variant file (`.othervariants.vcf`) in vcf format. It contains SNPs, deletion and very small insertions (1-2 bp).
`MindTheGap fill` generates the following output files:
- * a sequence file (`.insertions.fasta`) in fasta format. It contains the inserted sequences or contig gap-fills that were successfully assembled. In the case of insertion variants, the location of each insertion on the reference genome can be found in its fasta header. The fasta header includes also information about each gap-fill such as its length, quality score and median kmer abundance.
- * an insertion variant file (`.insertions.vcf`) in vcf format, in the case of insertion variant detection. This file contains all information of assembled insertion variants as in the `.insertions.fasta` file but in a different format. Here, insertion site positions are 1-based and left-normalized according to the VCF format specifications (contrary to positions indicated in the `.breakpoints` and `insertions.fasta` files which are right-normalized). Normalization occurs when multiple positions are possible for a single variation due to a small repeat.
+
+ * a sequence file (`.insertions.fasta`) in fasta format (for insertions >2 bp). It contains the inserted sequences or contig gap-fills that were successfully assembled. In the case of insertion variants, the location of each insertion on the reference genome can be found in its fasta header. The fasta header includes also information about each gap-fill such as its length, quality score and median kmer abundance.
+ * an insertion variant file (`.insertions.vcf`) in vcf format (for insertions >2 bp). This file contains all information of assembled insertion variants as in the `.insertions.fasta` file but in a different format. Here, insertion site positions are 1-based and left-normalized according to the VCF format specifications (contrary to positions indicated in the `.breakpoints` and `insertions.fasta` files which are right-normalized). Normalization occurs when multiple positions are possible for a single variation due to a small repeat.
* a log file (`.info.txt`), a tabular file with some information about the filling process for each breakpoint/grap-fill.
## Details on output formats
+<a name="output-formats"></a>
1. Breakpoint format
@@ -98,6 +112,7 @@ MindTheGap is composed of two main modules : breakpoint detection (`find` module
FILTER field: can be `PASS`or `LOWQUAL` (for insertions with multiple solutions)
INFO fields:
+
* `TYPE`: variant type, INS for insertion
* `LEN`: insertion size in bp
* `QUAL`: quality of the insertion (quality scores range from 0 to 50, 50 being the best quality, see the different quality scores [below](#quality))
@@ -128,6 +143,7 @@ MindTheGap is composed of two main modules : breakpoint detection (`find` module
<a name="quality"></a>
Each insertion is assigned a quality score ranging from 0 (low quality) to 50 (highest quality). This quality score reflects mainly repeat-associated criteria:
+
* `qual=5`: if one of the breakpoint kmer could not be found exactly but with 2 errors (mismatches)
* `qual=10`: if one of the breakpoint kmer could not be found exactly but with 1 error (mismatch)
* `qual=15`: if multiple sequences can be assembled for a given breakpoint (note that to output multiple sequences, they must differ from each other significantly, ie. <90% identity)
@@ -137,6 +153,7 @@ MindTheGap is composed of two main modules : breakpoint detection (`find` module
5. Gap-filling information file:
For each gap-fill, some informations about the filling process are given in the file `.info.txt`, whether it has been successfully filled or not. This can help understand why some breakpoints could not be filled. Here are the description of the columns:
+
* column 1 : breakpoint name
* column 2-4 : number of nodes in the contig graph, total nt assembled, number of nodes containing the right breakpoint kmer
* (optionnally) column 5-7 : same informations as in column 2-4 but for the filling process in the reverse direction from right to left kmer, activated only if the filling failed in the forward direction
=====================================
scripts/jenkins/tool-mindthegap-build-debian7-64bits-gcc-4.7.sh
=====================================
@@ -28,6 +28,8 @@ DO_NOT_STOP_AT_ERROR : ${DO_NOT_STOP_AT_ERROR}
Jenkins build parameters (built in)
-----------------------------------------
BUILD_NUMBER : ${BUILD_NUMBER}
+JENKINS_HOME : ${JENKINS_HOME}
+WORKSPACE : ${WORKSPACE}
"
error_code () { [ "$DO_NOT_STOP_AT_ERROR" = "true" ] && { return 0 ; } }
@@ -54,12 +56,15 @@ g++ --version
[ `gcc -dumpversion` = 4.7 ] && { echo "GCC 4.7"; } || { echo "GCC version is not 4.7, we exit"; exit 1; }
-JENKINS_TASK=tool-${TOOL_NAME}-build-debian7-64bits-gcc-4.7
+JENKINS_TASK=tool-${TOOL_NAME}-build-debian7-64bits-gcc-4.7-gitlab
+JENKINS_WORKSPACE=$WORKSPACE/$JENKINS_TASK/
+
GIT_DIR=/scratchdir/builds/workspace/gatb-${TOOL_NAME}
BUILD_DIR=/scratchdir/$JENKINS_TASK/gatb-${TOOL_NAME}/build
rm -rf $BUILD_DIR
mkdir -p $BUILD_DIR
+mkdir -p $JENKINS_WORKSPACE
#-----------------------------------------------
# we need gatb-core submodule to be initialized
@@ -118,10 +123,19 @@ cd build
# PACKAGING #
################################################################
-# Upload bin bundle to the forge
+#-- Upload bin bundle as a build artifact
+# -> bin bundle *-bin-Linux.tar.gz will be archived as a build artifact
+# -> source package is handled by the osx task
+
if [ $? -eq 0 ] && [ "$INRIA_FORGE_LOGIN" != none ] && [ "$DO_NOT_STOP_AT_ERROR" != true ]; then
- make package
- scp ${ARCHIVE_NAME}-${BRANCH_TO_BUILD}-bin-Linux.tar.gz ${INRIA_FORGE_LOGIN}@scm.gforge.inria.fr:/home/groups/gatb-tools/htdocs/ci-inria
- # source package is handled by the osx task
+ echo "Creating a binary archive... "
+ make package
+
+ pwd
+ ls -atlhrsF
+
+ #-- Move the generated bin bundle to the workspace (so that it can be uploaded as a Jenkins job artifact)
+ mv *-${BRANCH_TO_BUILD}-bin-Linux.tar.gz $JENKINS_WORKSPACE/
+
fi
=====================================
scripts/jenkins/tool-mindthegap-build-macos-10.9.5-gcc-4.2.1.sh
=====================================
@@ -28,6 +28,8 @@ DO_NOT_STOP_AT_ERROR : ${DO_NOT_STOP_AT_ERROR}
Jenkins build parameters (built in)
-----------------------------------------
BUILD_NUMBER : ${BUILD_NUMBER}
+JENKINS_HOME : ${JENKINS_HOME}
+WORKSPACE : ${WORKSPACE}
"
error_code () { [ "$DO_NOT_STOP_AT_ERROR" = "true" ] && { return 0 ; } }
@@ -117,9 +119,12 @@ cd ../build
# Prepare and upload bin and source bundle to the forge
if [ $? -eq 0 ] && [ "$INRIA_FORGE_LOGIN" != none ] && [ "$DO_NOT_STOP_AT_ERROR" != true ]; then
- make package
+ make package
make package_source
- scp ${ARCHIVE_NAME}-${BRANCH_TO_BUILD}-bin-Darwin.tar.gz ${INRIA_FORGE_LOGIN}@scm.gforge.inria.fr:/home/groups/gatb-tools/htdocs/ci-inria
- scp ${ARCHIVE_NAME}-${BRANCH_TO_BUILD}-Source.tar.gz ${INRIA_FORGE_LOGIN}@scm.gforge.inria.fr:/home/groups/gatb-tools/htdocs/ci-inria
+
+ # make both tar.gz available as Jenkins build artifacts
+ cp ${ARCHIVE_NAME}-${BRANCH_TO_BUILD}-bin-Darwin.tar.gz ${WORKSPACE}/
+ cp ${ARCHIVE_NAME}-${BRANCH_TO_BUILD}-Source.tar.gz ${WORKSPACE}/
+
fi
=====================================
scripts/jenkins/tool-mindthegap-release-debian.sh
=====================================
@@ -85,18 +85,24 @@ if [ "$INRIA_FORGE_LOGIN" == none ]; then
fi
cd $BUILD_DIR
-git clone https://github.com/pgdurand/github-release-api.git
+git clone https://github.com/GATB/github-release-api.git
################################################################
# RETRIEVE ARCHIVES FROM INRIA FORGE #
################################################################
+CI_URL=https://ci.inria.fr/gatb-core/view/MindTheGap-gitlab/job
+JENKINS_TASK_DEB=tool-mindthegap-build-debian7-64bits-gcc-4.7-gitlab
+JENKINS_TASK_MAC=tool-mindthegap-build-macos-10.9.5-gcc-4.2.1-gitlab
+
#retrieve last build from ci-inria (see tool-lean-build-XXX tasks)
-scp ${INRIA_FORGE_LOGIN}@scm.gforge.inria.fr:/home/groups/gatb-tools/htdocs/ci-inria/${ARCHIVE_NAME}-${BRANCH_TO_BUILD}-bin-Linux.tar.gz .
+wget $CI_URL/$JENKINS_TASK_DEB/lastSuccessfulBuild/artifact/$JENKINS_TASK_DEB/${ARCHIVE_NAME}-${BRANCH_TO_BUILD}-bin-Linux.tar.gz
[ $? != 0 ] && exit 1
-scp ${INRIA_FORGE_LOGIN}@scm.gforge.inria.fr:/home/groups/gatb-tools/htdocs/ci-inria/${ARCHIVE_NAME}-${BRANCH_TO_BUILD}-bin-Darwin.tar.gz .
+
+wget $CI_URL/$JENKINS_TASK_MAC/lastSuccessfulBuild/artifact/${ARCHIVE_NAME}-${BRANCH_TO_BUILD}-bin-Darwin.tar.gz
[ $? != 0 ] && exit 1
-scp ${INRIA_FORGE_LOGIN}@scm.gforge.inria.fr:/home/groups/gatb-tools/htdocs/ci-inria/${ARCHIVE_NAME}-${BRANCH_TO_BUILD}-Source.tar.gz .
+
+wget $CI_URL/$JENKINS_TASK_MAC/lastSuccessfulBuild/artifact/${ARCHIVE_NAME}-${BRANCH_TO_BUILD}-Source.tar.gz
[ $? != 0 ] && exit 1
################################################################
=====================================
src/FindBreakpoints.hpp
=====================================
@@ -103,7 +103,7 @@ public :
/** writes a given variant in the output vcf file
*/
void writeVcfVariant(int bkt_id, string& chrom_name, uint64_t position, char* ref_char, char* alt_char, int repeat_size, string type);
-
+ void writeIndel(int bkt_id, string &chrom_name, uint64_t position, string ref_char, string alt_char, int repeat_size, string type);
/*Getter*/
/** Return the number of found breakpoints
@@ -130,7 +130,7 @@ public :
*/
size_t kmer_size();
- /** Return the numbre of max repeat
+ /** Return the max repeat size at breakpoint
*/
int max_repeat();
@@ -138,6 +138,9 @@ public :
*/
int snp_min_val();
+ /** Return the threashold value of the branching filter
+ */
+ int branching_threshold();
/** The last solid kmer before gap
*/
@@ -233,7 +236,18 @@ public :
/** Incremente the value of backup_iterate
*/
int backup_iterate();
-
+
+ /*Incremente the value of homo_clean_indel
+ */
+ int homo_clean_indel_iterate();
+
+ /* Incremente the value of homo_fuzzy_indel
+ */
+ int homo_fuzzy_indel_iterate();
+ /* Incremente the value of hetero_indel
+ */
+ int hetero_indel_iterate();
+
/*Setter*/
/** Set value of recent_hetero
*/
@@ -465,8 +479,12 @@ void FindBreakpoints<span>::operator()()
v.push_back(token);
}
if(v[0]==m_chrom_name){ // we are on the current chromosome
- interval=std::make_pair(std::stoi(v[1]),std::stoi(v[2]));
- interval_vector.push_back( tuple<uint64_t ,uint64_t>(interval));
+ uint64_t bed_begin = std::stoi(v[1]);
+ uint64_t bed_end = std::stoi(v[2]);
+ if ((bed_end-bed_begin) > this->finder->_kmerSize){
+ interval=std::make_pair(std::stoi(v[1]),std::stoi(v[2]));
+ interval_vector.push_back( tuple<uint64_t ,uint64_t>(interval));
+ }
}
iss.clear();
}
@@ -489,18 +507,28 @@ void FindBreakpoints<span>::operator()()
end_pos=get<1>(interval_vector.front());
}
- if(!(*m_it_kmer).isValid() || (m_position<start_pos))
+ if(!(*m_it_kmer).isValid())
{
- //Reintialize stretch_size for each bed region
-
+ //Re-initialize stretch_size
this->m_solid_stretch_size = 0;
this->m_gap_stretch_size = 0;
this->m_kmer_begin = KmerCanonical();
this->m_kmer_end = KmerCanonical();
- //DEBUG
- //cout<<"n";
+
}
-
+
+ if(m_position==start_pos-1) //for each beginning of bed region
+ {
+ //Re-initialize stretch_size for each bed region
+ this->m_solid_stretch_size = 0;
+ this->m_gap_stretch_size = 0;
+ this->m_kmer_begin = KmerCanonical();
+ this->m_kmer_end = KmerCanonical();
+
+ //Re-initialize het_kmer_history for each bed region
+ memset(this->m_het_kmer_history, 0, sizeof(info_type)*256);
+ }
+
if(((*m_it_kmer).isValid()) && (m_position>=start_pos)) //inside the current bed interval
{
@@ -648,7 +676,30 @@ void FindBreakpoints<span>::writeVcfVariant(int bkt_id, string& chrom_name, uint
repeat_size
);
}
-
+template <size_t span>
+void FindBreakpoints<span>::writeIndel(int bkt_id, string &chrom_name, uint64_t position, string ref_string, string alt_string, int repeat_size, string type)
+{
+ // NOTE : currently all positions coming from FindObservers are 0-based, VCF is supposed to be 1-based, so we add +1
+ int variant_size = alt_string.length() - 1;
+ string GT = "./.";
+ if (type == "HOM")
+ {
+ GT = "1/1";
+ }
+ if (type == "HET")
+ {
+ GT = "0/1";
+ }
+ fprintf(this->finder->_vcf_file, "%s\t%lli\tbkpt%i\t%s\t%s\t.\tPASS\tTYPE=INS;LEN=%i;FUZZY=%i\tGT\t%s\n",
+ chrom_name.c_str(),
+ position + 1, //switch to 1-based
+ bkt_id,
+ ref_string.c_str(),
+ alt_string.c_str(),
+ variant_size,
+ repeat_size,
+ GT.c_str());
+}
/*Getter*/
template<size_t span>
int FindBreakpoints<span>::node_in_branch(Node& kmer_node)
@@ -710,6 +761,12 @@ int FindBreakpoints<span>::snp_min_val()
return this->finder->_snp_min_val;
}
+template<size_t span>
+int FindBreakpoints<span>::branching_threshold()
+{
+ return this->finder->_branching_threshold;
+}
+
/*Kmer related object*/
template<size_t span>
typename FindBreakpoints<span>::KmerCanonical& FindBreakpoints<span>::kmer_begin()
@@ -870,7 +927,22 @@ int FindBreakpoints<span>::backup_iterate()
{
return this->finder->_nb_backup++;
}
+template <size_t span>
+int FindBreakpoints<span>::homo_clean_indel_iterate()
+{
+ return this->finder->_nb_homo_clean_indel++;
+}
+template <size_t span>
+int FindBreakpoints<span>::homo_fuzzy_indel_iterate()
+{
+ return this->finder->_nb_homo_fuzzy_indel++;
+}
+template <size_t span>
+int FindBreakpoints<span>::hetero_indel_iterate()
+{
+ return this->finder->_nb_hetero_indel++;
+}
/*Setter*/
template<size_t span>
void FindBreakpoints<span>::recent_hetero(int value)
=====================================
src/FindHeteroInsertion.hpp
=====================================
@@ -29,8 +29,10 @@ template<size_t span>
class FindHeteroInsertion : public IFindObserver<span>
{
public :
-
- /** \copydoc IFindObserver::IFindObserver
+ typedef typename gatb::core::kmer::impl::Kmer<span> Kmer;
+ typedef typename Kmer::ModelCanonical KmerModel;
+ typedef typename KmerModel::Iterator KmerIterator;
+ /** \copydoc IFindObserver::IFindObserver
*/
FindHeteroInsertion(FindBreakpoints<span> * find);
@@ -47,12 +49,24 @@ bool FindHeteroInsertion<span>::update()
{
if(!this->_find->homo_only())
{
+ // branching filter parameters
+ int branching_threshold = this->_find->branching_threshold(); //max number of branching kmers in the 100 bp window of previous kmers
+ int max_branching_kmers = branching_threshold;
+ bool filtering = true;
+ if (branching_threshold<0){
+ filtering = false;
+ max_branching_kmers = 100;
+ }
+ int filter_window_size = 100 ; //should not be larger than the size of het_kmer_history = 256
+
+
// hetero site detection
if(!this->_find->kmer_end_is_repeated() && this->_find->current_info().nb_in == 2 && !this->_find->recent_hetero())
{
//loop over putative repeat size (0=clean, >0 fuzzy), reports only the smallest repeat size found.
for(int i = 0; i <= this->_find->max_repeat(); i++)
{
+ bool found_base_one = false;
if(this->_find->het_kmer_history(this->_find->het_kmer_begin_index()+i).nb_out == 2 && !this->_find->het_kmer_history(this->_find->het_kmer_begin_index()+i).is_repeated)
{
//hetero breakpoint found
@@ -60,32 +74,102 @@ bool FindHeteroInsertion<span>::update()
//string kmer_end_str = this->_find->model().toString(this->_find->current_info().kmer);
//modif 15/06/2018 to check !!! (before in case of fuzzy>0, the end and right kmers overlapped, => insertion of wrong size (- fuzzy), missing the repeat + loss of recall if insertion of size < repeat)
string kmer_end_str = string(&(this->_find->chrom_seq()[this->_find->position() + i]), this->_find->kmer_size());
- if (!this->_find->model().codeSeed(&(this->_find->chrom_seq()[this->_find->position() +i]),Data::ASCII).isValid())
+ string ref = kmer_begin_str.substr(kmer_begin_str.size() - 1 - i, 1);
+
+ //Tests if this can be a small (1-2 bp) insertion
+ char nucleo[20][6] = {"A", "C", "G", "T", "AA", "AC", "AG", "AT", "CA", "CC", "CG", "CT", "GA", "GC", "GG", "GT", "TA", "TC", "TG", "TT"};
+ KmerModel local_m(this->_find->kmer_size());
+ KmerIterator local_it(local_m);
+ std::string seq;
+ string inser_base_one;
+ if (!this->_find->model().codeSeed(&(this->_find->chrom_seq()[this->_find->position() +i]),Data::ASCII).isValid())
{
return false;
}
- this->_find->writeBreakpoint(this->_find->breakpoint_id(), this->_find->chrom_name(), this->_find->position()-1+i, kmer_begin_str, kmer_end_str,i, STR_HET_TYPE, this->_find->het_kmer_history(this->_find->het_kmer_begin_index()+i).is_repeated,this->_find->kmer_end_is_repeated() );
-
- this->_find->breakpoint_id_iterate();
-
- if(i==0)
+
+ for (int a = 0; a < 20; a++) // for all possible 1-2 bp insertions, perform a micro-assembly
{
- this->_find->hetero_clean_iterate();
+ seq = kmer_begin_str + nucleo[a] + kmer_end_str;
+ Data local_d(const_cast<char *>(seq.c_str()));
+ int sum_valid = 0;
+ // // Init this variable
+ local_d.setRef(const_cast<char *>(seq.c_str()), (size_t)seq.length());
+ local_it.setData(local_d);
+ for (local_it.first(); !local_it.isDone(); local_it.next())
+ {
+ if (this->contains(local_it->forward()))
+ {
+ sum_valid++;
+ }
+ else
+ {
+ break;
+ }
+ if (sum_valid == this->_find->kmer_size())
+ {
+ inser_base_one = ref + nucleo[a];
+ found_base_one = true;
+ }
+ }
+ if (found_base_one == true)
+ break;
}
- else
+ if (found_base_one)
{
- this->_find->hetero_fuzzy_iterate();
+ this->_find->writeIndel(this->_find->breakpoint_id(), this->_find->chrom_name(), this->_find->position() - 1, ref, inser_base_one, i, STR_HET_TYPE);
+ this->_find->hetero_indel_iterate();
+ this->_find->breakpoint_id_iterate();
+ return true;
}
-
- this->_find->recent_hetero(this->_find->max_repeat()); // we found a breakpoint, the next hetero one mus be at least _max_repeat apart from this one.
- return true; //reports only the smallest repeat size found.
+ else
+ {
+
+ //this may be a large insertion
+
+ int nb_branching = 0;
+ //Applying the branching-filter :
+ if (filtering){
+ //counts the number of branching-kmers among the 100 previous ones
+ int nb_prev = 0;
+ unsigned char begin_index = this->_find->het_kmer_begin_index()-1;
+ while ((nb_branching <= max_branching_kmers) && (nb_prev<filter_window_size)){
+ //cout << "in loop" << nb_prev << " " << begin_index-nb_prev << endl;
+ if(this->_find->het_kmer_history(begin_index-nb_prev).nb_out >1 || this->_find->het_kmer_history(begin_index-nb_prev).nb_in >1 ){
+ nb_branching ++;
+ }
+ nb_prev++;
+ }
+ }
+
+ if(nb_branching <= max_branching_kmers){
+ this->_find->writeBreakpoint(this->_find->breakpoint_id(), this->_find->chrom_name(), this->_find->position() - 1 + i, kmer_begin_str, kmer_end_str, i, STR_HET_TYPE, this->_find->het_kmer_history(this->_find->het_kmer_begin_index() + i).is_repeated, this->_find->kmer_end_is_repeated());
+
+ this->_find->breakpoint_id_iterate();
+
+ if (i == 0)
+ {
+ this->_find->hetero_clean_iterate();
+ }
+ else
+ {
+ this->_find->hetero_fuzzy_iterate();
+ }
+
+ this->_find->recent_hetero(this->_find->max_repeat()); // we found a breakpoint, the next hetero one mus be at least _max_repeat apart from this one.
+ return true; //reports only the smallest repeat size found.
+ }
+ else{ // stop the loop over fuzzy size, because the branching context will remain not good for other fuzzy sizes
+ this->_find->recent_hetero(max(0, this->_find->recent_hetero() - 1)); // when recent_hetero=0 : we are sufficiently far from the previous hetero-site
+ return false;
+ }
+ }
}
}
}
-
- this->_find->recent_hetero(max(0, this->_find->recent_hetero() - 1)); // when recent_hetero=0 : we are sufficiently far from the previous hetero-site
+
+ this->_find->recent_hetero(max(0, this->_find->recent_hetero() - 1)); // when recent_hetero=0 : we are sufficiently far from the previous hetero-site
}
-
+
return false;
}
=====================================
src/FindSNP.hpp
=====================================
@@ -146,9 +146,7 @@ bool FindSNP<span>::snp_at_end(unsigned char* beginpos, size_t limit, KmerType*
nuc[un] = 0;
nuc[deux] = 0;
nuc[trois] = 0;
-
- unsigned char endpos = (*beginpos + limit) % 256;
-
+
unsigned char beginpos_init = (*beginpos);
//this->remove_nuc(nuc, *beginpos);
*ref_nuc = this->_find->het_kmer_history(*beginpos).kmer & 3; // obtain the reference nuc
=====================================
src/FindSmallInsertion.hpp
=====================================
@@ -0,0 +1,216 @@
+/*****************************************************************************
+* MindTheGap: Integrated detection and assembly of insertion variants
+* A tool from the GATB (Genome Assembly Tool Box)
+* Copyright (C) 2022 INRIA
+* Authors: C. Lemaitre, G. Rizk, P. Marijon, W. Delage
+*
+* This program is free software: you can redistribute it and/or modify
+* it under the terms of the GNU Affero General Public License as
+* published by the Free Software Foundation, either version 3 of the
+* License, or (at your option) any later version.
+*
+* This program is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+* GNU Affero General Public License for more details.
+*
+* You should have received a copy of the GNU Affero General Public License
+* along with this program. If not, see <http://www.gnu.org/licenses/>.
+*****************************************************************************/
+
+#ifndef FINDSMALLINSERTION_HPP_
+#define FINDSMALLINSERTION_HPP_
+//**********************************
+#include <IFindObserver.hpp>
+#include <FindBreakpoints.hpp>
+
+
+template<size_t span>
+class FindSmallCleanInsertion : public IFindObserver<span>
+{
+public :
+
+ typedef typename gatb::core::kmer::impl::Kmer<span> Kmer;
+
+ typedef typename Kmer::ModelCanonical KmerModel;
+ typedef typename KmerModel::Iterator KmerIterator;
+
+public:
+
+ /** \copydoc IFindObserver<span>
+ */
+ /** \copydoc IFindObserver::IFindObserver
+ */
+ FindSmallCleanInsertion(FindBreakpoints<span> * find);
+
+
+ /** \copydoc IFindObserver::IFindObserver
+ */
+ bool update();
+};
+
+template<size_t span>
+FindSmallCleanInsertion<span>::FindSmallCleanInsertion(FindBreakpoints<span> * find) : IFindObserver<span>(find){}
+
+template<size_t span>
+bool FindSmallCleanInsertion<span>::update()
+{
+
+ if((this->_find->kmer_begin().isValid() && this->_find->kmer_end().isValid()) == false)
+ {
+ return false;
+ }
+
+ if(this->_find->gap_stretch_size() == (this->_find->kmer_size()-1)) //Check size of gap
+ {
+ // obtains the kmer sequence
+ string kmer_begin_str = this->_find->model().toString(this->_find->kmer_begin().forward());
+ string kmer_end_str = this->_find->model().toString(this->_find->kmer_end().forward());
+ string ref = kmer_begin_str.substr(kmer_begin_str.size()-1,1);
+
+ //All possible insertions of size 1 and 2
+ char nucleo[20][6] = {"A","C","G","T","AA","AC","AG","AT","CA","CC","CG","CT","GA","GC","GG","GT","TA","TC","TG","TT"};
+
+ KmerModel local_m(this->_find->kmer_size());
+ KmerIterator local_it(local_m);
+ std::string seq;
+ string inser_base_one;
+ bool found_base_one=false;
+
+ //Test all possible insertions, by performing a micro-guided-assembly, ie. checks if all kmers of the insertion are present in the graph
+ for (int i=0; i<20; i++)
+ {
+ seq = kmer_begin_str+ nucleo[i] + kmer_end_str;
+ Data local_d(const_cast<char*>(seq.c_str()));
+ int sum_valid=0;
+ // Init this variable
+ local_d.setRef(const_cast<char*>(seq.c_str()), (size_t)seq.length());
+ local_it.setData(local_d);
+ for(local_it.first(); !local_it.isDone(); local_it.next())
+ {
+ if(this->contains(local_it->forward()))
+ {
+ sum_valid++;
+ }
+ else
+ {
+ break;
+ }
+ if (sum_valid==this->_find->kmer_size())
+ {
+ inser_base_one=ref+nucleo[i];
+ found_base_one=true;
+ }
+ }
+ if (found_base_one==true) break;
+ }
+ if (!found_base_one) return false;
+
+ this->_find->writeIndel(this->_find->breakpoint_id(),this->_find->chrom_name(),this->_find->position()-2, ref, inser_base_one, 0, STR_HOM_TYPE);
+ this->_find->homo_clean_indel_iterate();
+ this->_find->breakpoint_id_iterate();
+
+ return true;
+ }
+ return false;
+}
+
+///*
+template<size_t span>
+class FindSmallFuzzyInsertion : public IFindObserver<span>
+{
+public :
+
+ typedef typename gatb::core::kmer::impl::Kmer<span> Kmer;
+
+ typedef typename Kmer::ModelCanonical KmerModel;
+ typedef typename KmerModel::Iterator KmerIterator;
+
+public:
+
+ /** \copydoc IFindObserver<span>
+ */
+ /** \copydoc IFindObserver::IFindObserver
+ */
+ FindSmallFuzzyInsertion(FindBreakpoints<span> * find);
+
+
+ /** \copydoc IFindObserver::IFindObserver
+ */
+ bool update();
+};
+
+template<size_t span>
+FindSmallFuzzyInsertion<span>::FindSmallFuzzyInsertion(FindBreakpoints<span> * find):IFindObserver<span>(find){}
+
+template<size_t span>
+bool FindSmallFuzzyInsertion<span>::update()
+{
+ if((this->_find->kmer_begin().isValid() && this->_find->kmer_end().isValid()) == false)
+ {
+ return false;
+ }
+
+ if(this->_find->gap_stretch_size() < this->_find->kmer_size() - 1 && this->_find->gap_stretch_size() >= this->_find->kmer_size() - 1 - this->_find->max_repeat())
+ {
+ int repeat_size = this->_find->kmer_size() - 1 - this->_find->gap_stretch_size();
+ // obtains the kmer sequence
+ string kmer_begin_str = this->_find->model().toString(this->_find->kmer_begin().forward());
+ string kmer_end_str = string(&(this->_find->chrom_seq()[this->_find->position() - 1 + repeat_size]), this->_find->kmer_size());
+ if ((this->nb_out_branch(this->_find->kmer_begin().forward())==0) || (this->nb_in_branch(this->_find->kmer_end().forward())==0) || (!this->_find->model().codeSeed(&(this->_find->chrom_seq()[this->_find->position() - 1 + repeat_size]),Data::ASCII).isValid()))
+ {
+ return false;
+ }
+ else
+ {
+ string ref = kmer_begin_str.substr(kmer_begin_str.size()-1-repeat_size,1);
+
+ //All possible insertions of size 1 and 2
+ char nucleo[20][6] = {"A","C","G","T","AA","AC","AG","AT","CA","CC","CG","CT","GA","GC","GG","GT","TA","TC","TG","TT"};
+ KmerModel local_m(this->_find->kmer_size());
+ KmerIterator local_it(local_m);
+ std::string seq;
+ string inser_base_one;
+ bool found_base_one=false;
+ //std::list<char> fourth (nucleo, nucleo + sizeof(nucleo) / sizeof(char) );
+ for (int i=0; i<20; i++)
+ {
+ seq = kmer_begin_str+ nucleo[i] + kmer_end_str;
+ //std::cout << seq << endl;
+ Data local_d(const_cast<char*>(seq.c_str()));
+ int sum_valid=0;
+ // // Init this variable
+ local_d.setRef(const_cast<char*>(seq.c_str()), (size_t)seq.length());
+ local_it.setData(local_d);
+ for(local_it.first(); !local_it.isDone(); local_it.next())
+ {
+ if(this->contains(local_it->forward()))
+ {
+ sum_valid++;
+ }
+ else
+ {
+ break;
+ }
+ if (sum_valid==this->_find->kmer_size())
+ {
+ inser_base_one=ref+nucleo[i];
+ found_base_one=true;
+ }
+ }
+ if (found_base_one==true) break;
+ }
+ if (!found_base_one) return false;
+ this->_find->writeIndel(this->_find->breakpoint_id(),this->_find->chrom_name(),this->_find->position()- 2, ref, inser_base_one, repeat_size, STR_HOM_TYPE);
+ this->_find->homo_clean_indel_iterate();
+ this->_find->breakpoint_id_iterate();
+
+ return true;
+ }
+ }
+ return false;
+ }
+
+
+#endif // FINDSMALLINSERTION_HPP_
+
=====================================
src/Finder.cpp
=====================================
@@ -27,6 +27,7 @@
#include <FindInsertion.hpp>
#include <FindSNP.hpp>
#include <limits> //for std::numeric_limits
+#include <FindSmallInsertion.hpp>
//#define PRINT_DEBUG
/********************************************************************************/
@@ -63,6 +64,7 @@ Finder::Finder () : Tool ("MindTheGap find")
_max_repeat = 0;
_het_max_occ = 1;
_snp_min_val = 5;
+ _branching_threshold = 5;
_nbCores = 0;
_breakpoint_file_name = "";
_vcf_file_name = "";
@@ -75,14 +77,17 @@ Finder::Finder () : Tool ("MindTheGap find")
_nb_solo_snp = 0;
_nb_multi_snp = 0;
_nb_backup = 0;
-
+ _nb_homo_clean_indel = 0;
+ _nb_homo_fuzzy_indel = 0;
+ _nb_hetero_indel = 0;
_homo_only = false;
_homo_insert = true;
_hete_insert = true;
_snp = true;
_backup = false;
_deletion = true;
-
+ _small_homo = true;
+
_bed_file_name="";
setHelp(&HelpFinder);
@@ -112,6 +117,7 @@ Finder::Finder () : Tool ("MindTheGap find")
finderParser->push_front (new OptionNoParam (STR_INSERT_ONLY, "search only insertion breakpoints (do not report other variants)", false));
//finderParser->getParser(STR_INSERT_ONLY)->setVisible(false);
finderParser->push_front (new OptionOneParam (STR_HET_MAX_OCC, "maximal number of occurrences of a kmer in the reference genome allowed for heterozyguous breakpoints", false,"1"));
+ finderParser->push_front (new OptionOneParam (STR_BRANCHING_FILTER, "branching filter paramater for heterozygous insertions, maximal number of branching kmers in a 100-bp window before a heterozygous site (if -1 = no filter)", false,"15"));
//allow to find heterozyguous breakpoints in n-repeated regions of the reference genome
finderParser->push_front (new OptionOneParam (STR_MAX_REPEAT, "maximal repeat size detected for fuzzy sites", false, "5"));
finderParser->push_front (new OptionNoParam (STR_HOMO_ONLY, "search only homozygous breakpoints", false));
@@ -305,7 +311,8 @@ void Finder::execute ()
_max_repeat = getInput()->getInt(STR_MAX_REPEAT);
_het_max_occ=getInput()->getInt(STR_HET_MAX_OCC);
_snp_min_val=getInput()->getInt(STR_SNP_MIN_VAL);
-
+ _branching_threshold = getInput()->getInt(STR_BRANCHING_FILTER);
+
if(_het_max_occ<1){
_het_max_occ=1;
}
@@ -318,6 +325,7 @@ void Finder::execute ()
_snp = true;
_backup = false;
_deletion = true;
+ _small_homo = true;
}
if(getInput()->get(STR_INSERT_ONLY) != 0)
@@ -328,6 +336,7 @@ void Finder::execute ()
_snp = false;
_backup = false;
_deletion = false;
+ _small_homo = true;
}
if(getInput()->get(STR_SNP_ONLY) != 0)
@@ -338,6 +347,7 @@ void Finder::execute ()
_snp = true;
_backup = false;
_deletion = false;
+ _small_homo = true;
}
if(getInput()->get(STR_DELETION_ONLY) != 0)
@@ -348,6 +358,7 @@ void Finder::execute ()
_snp = false;
_backup = false;
_deletion = true;
+ _small_homo = true;
}
if(getInput()->get(STR_HETERO_ONLY) != 0)
@@ -358,6 +369,7 @@ void Finder::execute ()
_snp = false;
_backup = false;
_deletion = false;
+ _small_homo = true;
}
if(getInput()->get(STR_WITH_BACKUP) != 0)
@@ -419,6 +431,10 @@ void Finder::resumeParameters(){
getInfo()->add(2,"Graph",getInput()->getStr(STR_URI_GRAPH).c_str());
}
getInfo()->add(2,"Reference",getInput()->getStr(STR_URI_REF).c_str());
+ if(getInput()->get(STR_BED) != 0)
+ {
+ getInfo()->add(2,"Bed file",_bed_file_name.c_str());
+ }
getInfo()->add(1,"Graph");
getInfo()->add(2,"kmer-size","%i", _kmerSize);
@@ -456,6 +472,7 @@ void Finder::resumeParameters(){
getInfo()->add(1,"Breakpoint detection options");
getInfo()->add(2,"max_repeat","%i", _max_repeat);
getInfo()->add(2,"hetero_max_occ","%i", _het_max_occ);
+ getInfo()->add(2,"branching filter value", "‰i", _branching_threshold);
getInfo()->add(2,"homo_insertions","%s", _homo_insert ? "yes" : "no");
getInfo()->add(2,"hete_insertions","%s", _hete_insert ? "yes" : "no");
getInfo()->add(2,"snp","%s", _snp ? "yes" : "no");
@@ -474,6 +491,8 @@ void Finder::resumeResults(double seconds){
getInfo()->add(3,"fuzzy","%i", _nb_hetero_fuzzy);
getInfo()->add(1,"Other variants");
getInfo()->add(2,"deletions","%i", _nb_clean_deletion+_nb_fuzzy_deletion);
+ getInfo()->add(2, "Homozygous insertions 1-2 bp size", "%i", _nb_homo_clean_indel + _nb_homo_fuzzy_indel);
+ getInfo()->add(2, "Heterozygous insertions 1-2 bp size", "%i", _nb_hetero_indel);
//getInfo()->add(3,"clean", "%i", _nb_clean_deletion);
//getInfo()->add(3,"fuzzy", "%i", _nb_fuzzy_deletion);
getInfo()->add(2,"SNPs","%i", _nb_solo_snp+_nb_multi_snp);
@@ -540,8 +559,12 @@ void Finder::runFindBreakpoints<span>::operator () (Finder* object)
{
findBreakpoints.addGapObserver(new FindDeletion<span>(&findBreakpoints));
}
-
- if(object->_homo_insert)
+ if (object->_small_homo)
+ {
+ findBreakpoints.addGapObserver(new FindSmallCleanInsertion<span>(&findBreakpoints));
+ findBreakpoints.addGapObserver(new FindSmallFuzzyInsertion<span>(&findBreakpoints));
+ }
+ if(object->_homo_insert)
{
findBreakpoints.addGapObserver(new FindCleanInsertion<span>(&findBreakpoints));
findBreakpoints.addGapObserver(new FindFuzzyInsertion<span>(&findBreakpoints));
=====================================
src/Finder.hpp
=====================================
@@ -31,6 +31,7 @@ static const char* STR_URI_REF = "-ref";
static const char* STR_MAX_REPEAT = "-max-rep";;
static const char* STR_HET_MAX_OCC = "-het-max-occ";
static const char* STR_SNP_MIN_VAL = "-snp-min-val";
+static const char* STR_BRANCHING_FILTER = "-branching-filter";
static const char* STR_HOMO_ONLY = "-homo-only";
static const char* STR_INSERT_ONLY = "-insert-only";
@@ -65,10 +66,12 @@ public:
const char* _mtg_version;
size_t _kmerSize;
Graph _graph;
- //Graph _ref_graph; // no longer used
+
+ //parameters
int _max_repeat;
int _het_max_occ;
int _snp_min_val;
+ int _branching_threshold;
int _nbCores;
bool _homo_only;
bool _homo_insert;
@@ -76,6 +79,10 @@ public:
bool _snp;
bool _backup;
bool _deletion;
+ bool _small_homo;
+ bool _small_hetero;
+
+ //input/output files
IBank* _refBank;
string _breakpoint_file_name;
FILE * _breakpoint_file;
@@ -84,6 +91,7 @@ public:
string _bed_file_name;
+ //results statistics
int _nb_homo_clean;
int _nb_homo_fuzzy;
int _nb_hetero_clean;
@@ -93,7 +101,9 @@ public:
int _nb_solo_snp;
int _nb_multi_snp;
int _nb_backup;
-
+ int _nb_homo_clean_indel;
+ int _nb_homo_fuzzy_indel;
+ int _nb_hetero_indel;
// Actual job done by the tool is here
void execute ();
=====================================
src/main.cpp
=====================================
@@ -26,7 +26,7 @@
using namespace std;
-static const char* MTG_VERSION = "2.2.3";
+static const char* MTG_VERSION = "2.3.0";
static const char* STR_FIND = "find";
static const char* STR_FILL = "fill";
=====================================
test/full_test/README
=====================================
@@ -121,5 +121,46 @@ cp reference.fasta ../../data/reference.fasta
# 7. Create Gold files for automated tests
-../../build/bin/MindTheGap find -in ../../data/reads_r1.fastq,../../data/reads_r2.fastq -ref ../../data/reference.fasta -out gold > gold_find.output
-../../build/bin/MindTheGap fill -graph gold.h5 -bkpt gold.breakpoints -out gold > gold_fill.output
\ No newline at end of file
+../../build/bin/MindTheGap find -in ../../data/reads_r1.fastq,../../data/reads_r2.fastq -ref ../../data/reference.fasta -out gold -nb-cores 1 > gold_find.output
+../../build/bin/MindTheGap fill -graph gold.h5 -bkpt gold.breakpoints -out gold -nb-cores 1 > gold_fill.output
+
+
+# 08/02/2022 : rajoute des petites insertions 1-2 bp
+
+ou ajout de seq5 et seq6 depuis : Projets/mindTheGap/test-small-indels (990 premiers nt de chr1 et chr2, 6/10 HOM, 4/10 HET seulement dans allele 1)
+
+# change les param de simulation des reads : augmente couverture (+diminue tx erreurs)
+~/Bin/samtools-0.1.18/misc/wgsim -e 0.001 -d 200 -s 20 -N 1000 -1 100 -2 100 -r 0 -R 0 allele1.fasta allele1_r1.fq allele1_r2.fq
+~/Bin/samtools-0.1.18/misc/wgsim -e 0.001 -d 200 -s 20 -N 1000 -1 100 -2 100 -r 0 -R 0 allele2.fasta allele2_r1.fq allele2_r2.fq
+
+cat allele1_r1.fq allele2_r1.fq > reads_r1.fastq
+cat allele1_r2.fq allele2_r2.fq > reads_r2.fastq
+
+ATTENTION : changements dans les résultats : othervariants.vcf : perd 2 snps (Seq1 206, 219), gagne 1 del (Seq0 297) mais en perd une autre (Seq1 740) ; ne change pas les résultats des grandes insertions
+
+RQ : rate 2 petites insertions de taille 2 (Seq6 : pos 500 et 900)
+
+# List of all mutations :
+Seq0 101
+Seq0 123
+Seq0 816
+Seq1 206
+Seq1 219
+Seq1 342
+Seq1 740
+Seq2 320
+Seq2 344
+Seq2 379
+Seq2 535
+Seq2 834
+Seq3 256
+Seq3 511
+Seq3 766
+Seq3 781
+Seq4 257
+Seq4 349
+Seq4 512
+Seq4 600
+Seq4 841
+Seq4 884
+Seq4 821
\ No newline at end of file
=====================================
test/full_test/allele1.fasta
=====================================
@@ -8,3 +8,7 @@ TGCTGCCGATCGCTACGACGTCCTACCTTACACACAACGGGCCGCGTTCATACCCACGTATGAAGACATGCGGTTATCCG
TATTGCGCCCTTCAAGAAGCTTCTGCTGACCGTAGGCGTCTCGGCGGTTTGTACTTTGAAAAATTAGCTGCACTACATCCGATGGGTATCCCTCCTCAATCTCAGCAGACCCGGAAAGCGATAGAATCAGCCACGCGGTCGTCCGGGCTAGGGGCCCTGCGCAAGGAAGGTTGGACAGGGCTAGACCCGGAAGCATCGGCTTTTCCTAAATGGTGACGGAGTTATATAGGGTAAGCCTGATAGCGCGGTAGGTGTTATGGCCATCCCCTCGCCTAGCGTGCGCGCAGACAAGTCCAGTCCCGGAGGAGGCATAGGCCTCATTATCATTTCCCTAGAATCGCTCTTGACATCTAGGTTGTACTAGGGACCAGGCGCCCAAAGCGGACGGTTCTCCGTGCTTTCGTGCCGTTTCAGCGTAAGATGCTATTTTTTGGGGAAATGGTCGGCGTGTGCGGGGGAGAACCACGGTACCAACTACGATAAGTCCGTCGTGTAACTTACGTGAAGGTGATGTGAAGCAGGAATCCGTGCCAAAATGTCCGTGCGATATCCAACTTTCATAGTATTACACGAGAGCCTATGATTTGCCCAGGCGCGACCCGTGAATCGAGGTAATCGCCGACCAGATATTGCGAAACACCACATTACATGACTACTGTCCGCTTGAAGAGTTATATACTTGACAGTCCTGGTTGACGGCACAGCATATCTCCAATGTGTGGTTTAAAGTCTCACGTTCTTCATGCGCGCCGGCCCATGGGAACAAGTATCCTTACTTTCGTTTGCAGCACTAGCCGTTCCTTGACATCTGCGGCCAACTTGTGCCTGAACCTGGAGTTTCGACAGCGTGGCGCTCTGGCCTAGTTCTTCGCTGGCACCTGGAAGAGCCGCCGTACAAATGAGGCTCCAAAATAGCACGCTTGCAGCAGTCAAGTTGAACGCCTTAAAAGGCACCGCCGCTCGTTCATTGGGATTCCTTGAGAATCGTGACTTGTTACACTATAAGATCATGGATTGGACAAAATAGGCCAACTCCCGCACGCTGTGGCTATTCTTAAGTTGCATAGGTGGGAGTAGCCTTATACTCGATTTCTAAAAAGAGTAGGTGAGC
>Seq4
TTCCGGCGCCGCACTAATTGAAGTGGTGAGCTGACCAGTCGTTCAGGATCCGAAGGCGGGGATGGCGCTATAGGAGCCGGCAGGTATGCTTTGCCGCAAAATTTCGGGGTGGTGGAACCGTCTTACCGAAAGTTAGCTACAGCCTGGAATGTGAAATTCCATGACCTGCCCGTCCTGTGTCCACAGGGCGACATTTGCCACGTAGGTAGGGCGACCATTAGAATGCTGCATTATCGGGCGATAAAAAGTTTTATACTCAAGAATCCTACAAAGATGAAAATTTCGAAGAGCTGCACGCAGTTGTAAGTTGCTTTTCTGGGGTAATCGAGATTCTCCACCATAACCTGCGCAATGCATCGTGAAGCTTTACCGCGCCCAAGGGGAGCGTCTCAGTGGGGTTGCCTCCAGGGATATATTGAAAGTTGAAGAAGAAGATCACAGGTTAAGCGGTATGTTAAGTTAGAACTCACGGGGAGCCGCCTTGATTTTGTTCGACATGAACCAGAGACCAGGTGTGTTATGTTCTGGAACCTTAATACGTACGTCGCCAGCACCGAGCCGGCACTCCATCTCTTTTGGGTGCGCAACATTGCTATACTTAGGTGTATTCCTGGGTTGAGTGGCAGGTTTCTCTTAATTCTTCCCTAAGTAGCTCCGAGGATCCATTGACATCTGTCAGCCGTCTTTCCAGAACGTTATAAGACTCGTGAGGAAATTATACAAATCGTTGCCATCATCCAAAGCAAAGTACTTCCGCTTAGGAGTGCCTTGAAGAACCGATTATCTCTGACAATGTAATGCCACAGCACCCTCGACAAAGTTCTACATTCGTTCCAGGTCATGATACAGCGCGCTAAATTACCGCTACGAGCCATACCGGATGGCGGCCGGAGAGCGCTGCAATCGCATGGCTCGGGACCGAACATTGAGACCTGGCTAGTAGGTAGGTGTCAAATCGATATCCACACCTGTCGAAGCAGCTAAAGATCGGTTGCGGCGGGAGTCCTCCATTCAGGCCAAACGTGCAGTGCTCGATGTGCTTCCTATCGCTCT
+>Seq5
+GATGTTTAGAAGTTTCCAGGTCACGCCAATGATTGGCATTTACACACGTGGATCAGCGGACATATCTAACCCTTAGTGTTCTTAAGAGCAACTCACTACTCCATTTCCACTAACCCCGCCGGCGGTAATTCCAATCTAGTTGATCAGACTTCCCAGTCAATGAAAGCGACACCGTGCGTCTGTAATACCAACAAGACCCTGGCTGTCGTCCCGCAGAGGACGCGGCACCTCCGGATTTTGAGTCCAGTCTGAACGATTTTCGATCACTCACCATGGATCTGGAAAACGGAGTCGAGTACTCACGAGCCAAATTGATGCATTTCCAATGACCCGATGCAGGTGCGACCGATCTTCGCCTATGCTTCCCGCCGTAATTATTGAGTCTGGGTCCCGGCCGCTAACGTTTGACTCACGGGGAGGTACCCGTGCGTATTCTTCTCAAAGTGACGCTGGACAGCAGCGCATGTCCGAGCCCCATCGTCCTATCTGGTGTAGAGTCTTACCTCTAATTAGAGTGATCGAACCAGTAGGTGTCGCGGTCTTAGGGCTCCCATTGTCCAAGGGAACGTGAACAGATATGAATCTGGGAGAATAGTGCAGCGTTGACCCTTCTGGTCGGTCAGCCCTTGCCTACGGCCCGTATGCGGAGAATGAAGGCGTGAAACATTCTGCTCTTTTAGAAGCAGCGGCTGCACCCGTATAACAACTCGCACGATCGTACGTCTCATTTGCCGCGTTGGCGCGCCCGTGGATGATGGACCACGGTATGAACCTCTGCACTTCAAATTTGACGCAATCCTGCACTCACCCGCACACAGTTCTAGTCTAACCGTCGCAGTGTCTGCTTTAAGGTAGAGATCGATACTTAGGATATGTTCATGTGTGTTTGTAGCGCTGGACCCTCTTATGGGTGTGGTCACTTGTGATGGATCGAGGAACTTAGGCGGTTAACTTGTTTCGACGTCTCACCGACAATATCAGGATTTAGTATCG
+>Seq6
+ACCGAAAATGACAATGTTCACACGCATGCTCGGCGTGGAAAAGAGCCTTTTCTAAGACCGACTCGTTCCGGGCAGCAGGATTATTAGCCAATCAAAATTATATCGACCGGTCATCAAGCTGCGATAGTGCAGGCGCATGCCGTCCAATGGGTCCACGGCGGAAGTGCGTTCGTCTACTCTGTCAAATCTTAACATTTTTTGAGCGGCTAATCCGGCCGGTAGTGTACCGTGAACCAAAGTCCTTCTACGAGCGTATTAGATTGCTCAAAAGATCCGGGAGAATTGACCAGGTCGTATCTTTAAAATAACGCTGGTGCGAGCAGCTGCTGTTTTATCAACACCCATTTAGTCCTGTGAAGTTTGCTTAGCAGATACACCTTCCCGCGTGGTATGAGAGGCTGTTCTTCATTAAAAACTATGAGGCTCTGGCACCTTCGACGCTAACAAAGTCCCCACGGACCATGATACCCTTACGCAACTCTCTTTGCACGCTAGGGCGAGAGTACTGTCCCCCTAGACTAGGTACACGCCGGGTAAACTCTCTCGCACACCTTTACGCTCGACTACAGGCTTCTAACCCTTCCGAACGCATATAATTCAAATGGCACTTCAAGTAACAGACGAATCACGGCTCACAGGCAGAATTCACTGGAGTAAAAGGATTCAGAACAATAGATAGTGTGTTAACTTTACAGTCATCCGTATTATAACGTGTAGCGAGAGGATTGAGTTCTTGTTAGGAAGGAAGGTCCTATAGACGAGTGCGGTAGCGCACCCGGTCGCCTTGCGTAGTCATGCCCGACGTGTTGATGGTGGTCCCTTTTAGCCGCCACACAAGGGATCCGAGGGTGAGAGACACATGGCCCTCACCGACGAGACTTACTCAGCCTGCCTCGCTATTGCCCTCTTTTTGATCACGTCCCTTTGTGGCTCTCGAGGACTCGTGCAGCGTGTATCTGGGGATTTGTAAGCTTAAGACTACCTTCCATAGGA
=====================================
test/full_test/allele2.fasta
=====================================
@@ -8,3 +8,7 @@ TGCTGCCGATCGCTACGACGTCCTACCTTACACACAACGGGCCGCGTTCATACCCACGTATGAAGACATGCGGTTATCCG
TATTGCGCCCTTCAAGAAGCTTCTGCTGACCGTAGGCGTCTCGGCGGTTTGTACTTTGAAAAATTAGCTGCACTACATCCGATGGGTATCCCTCCTCAATCTCAGCAGACCCGGAAAGCGATAGAATCAGCCACGCGGTCGTCCGGGCTAGGGGCCCTGCGCAAGGAAGGTTGGACAGGGCTAGACCCGGAAGCATCGGCTTTTCCTAAATGGTGACGGAGTTATATAGGGTAAGCCTGATAGCGCGGTAGGTGTTATGGCCATCCCCTCGCCTAGCGTGCGCGCAGACAAGTCCAGTCCCGGAGGAGGCATAGGCCTCATTATCATTTCCCTAGAATCGCTCTTGACATCTAGGTTGTACTAGGGACCAGGCGCCCAAAGCGGACGGTTCTCCGTGCTTTCGTGCCGTTTCAGCGTAAGATGCTATTTTTTGGGGAAATGGTCGGCGTGTGCGGGGGAGAACCACGGTACCAACTACGATAAGTCCGTCGTGTAACTTACGTGAAGGTGATGTGAAGCAGGAATCCGTGCCAAAATGTCCGTGCGATATCCAACTTTCATAGTATTACACGAGAGCCTATGATTTGCCCAGGCGCGACCCGTGAATCGAGGTAATCGCCGACCAGATATTGCGAAACACCACATTACATGACTACTGTCCGCTTGAAGAGTTATATACTTGACAGTCCTGGTTGACGGCACAGCATATCTCCAATGTGTGGTTTAAAGTCTCACGTTCTTCATGCGCGCCGGCCCATGGGAACAAGTATCCTTACTTTCGTTTGCAGCACTAGCCGTTCCTTGACATCTGCGGCCAACTTGTGCCTGAACCTGGAGTTTCGACAGCGTGGCGCTCTGGCCTAGTTCTTCGCTGGCACCTGGAAGAGCCGCCGTACAAATGAGGCTCCAAAATAGCACGCTTGCAGCAGTCAAGTTGAACGCCTTAAAAGGCACCGCCGCTCGTTCATTGGGATTCCTTGAGAATCGTGACTTGTTACACTATAAGATCATGGATTGGACAAAATAGGCCAACTCCCGCACGCTGTGGCTATTCTTAAGTTGCATAGGTGGGAGTAGCCTTATACTCGATTTCTAAAAAGAGTAGGTGAGC
>Seq4
TTCCGGCGCCGCACTAATTGAAGTGGTGAGCTGACCAGTCGTTCAGGATCCGAAGGCGGGGATGGCGCTATAGGAGCCGGCAGGTATGCTTTGCCGCAAAATTTCGGGGTGGTGGAACCGTCTTACCGAAAGTTAGCTACAGCCTGGAATGTGAAATTCCATGACCTGCCCGTCCTGTGTCCACAGGGCGACATTTGCCACGTAGGTAGGGCGACCATTAGAATGCTGCATTATCGGGCGATAAAAAGTTTTATACTCAAGAATCCTACAAAGATGAAAATTTCGAAGAGCTGCACGCAGTTGTAAGTTGCTTTTCTGGGGTAATCGAGATTCTCCACCATAACCTGCGCAGTCTTAACCTTAAGACCGTTCATTGATAAAACTTGCTCACGCTCTAGATGGCGTGAAGCGAAACCTAGGAAAAAGTTTTGCAGATAATTAGATTATGCGCGATACTCCGCCGTGTGTTCAATGCATCGTGAAGCTTTACCGCGCCCAAGGGGAGCGTCTCAGTGGGGTTGCCTCCAGGGATATATTGAAAGTTGAAGAAGAAGATCACAGGTTAAGCGGTATGTTAAGTTAGAACTCACGGGGAGCCGCCTTGATTTTGTTCGACATGAACCAGAGACCAGGTGTGTTATGTTCTGGAACCTTAATACGTACGTCGCCAGCACCGAGCCGGCACTCCATCTCTTTTGGGTGCGCAACATTGCTATACTTAGGTGTATTCCTGGGTTGAGTGGCAGGTTTCTCTTAATTCTTCCCTAAGTAGCTCCGAGGATCCATTGACATCTGTCAGCCGTCTTTCCAGAACGTTATAAGACTCGTGAGGAAATTATACAAATCGTTGCCATCATCCAAAGCAAAGTACTTCCGCTTAGGAGTGCCTTGAAGAACCGATTATCTCTGACAATGTAATGCCACAGCACCCTCGACAAAGTTCTACATTCGTTCCAGGTCATGATACAGCGCGCTAAATTACCGCTACGAGCCATACCGGATGGCGGCCGGAGAGCGCTGCAATCGCATGGCTCGGGACCGAACATTGAGACCTGGCTAGTAGGTAGGTGTCAAATCGATATCCACACCTGTCGAAGCAGCTAAAGATCGGTTGCGGCGGGAGTCCTCCATTCAGGCCAAACGTGCAGTGCTCGATGTGCTTCCTATCGCTCT
+>Seq5
+GATGTTTAGAAGTTTCCAGGTCACGCCAATGATTGGCATTTACACACGTGGATCAGCGGACATATCTAACCCTTAGTGTTCTTAAGAGCAACTCACTACTCCATTTCCACTAACCCCGCCGGCGGTAATTCCAATCTAGTTGATCAGACTTCCCAGTCAATGAAAGCGACACCGTGCGTCTGTAATACCAACAAGACCCTGGCTGTCGTCCCGCAGAGGACGCGGCACCTCCGGATTTTGAGTCCAGTCTGAACGATTTTCGATCACTCACCATGGATCTGGAAAACGGAGTCGAGTACTCACGAGCCAAATTGATGCATTTCCAATGACCCGATGCAGGTGCGACCGATCTTCGCCTATGCTTCCCGCCGTAATTATTGAGTCTGGGTCCCGGCCGCTAACGTTTGACTCACGGGGAGGTACCCGTGCGTATTCTTCTCAAAGTGACGCTGGACAGCAGCGCATGTCCGAGCCCCATCGTCCTATCTGGTGTAGAGTCTTACCTCTAATTAGAGTGATCGAACCAGTAGGTGTCGCGGTCTTAGGGCTCCCATTGTCCAAGGGAACGTGAACAGATATGAATCTGGGAGAATAGTGCAGCGTTGCCCTTCTGGTCGGTCAGCCCTTGCCTACGGCCCGTATGCGGAGAATGAAGGCGTGAAACATTCTGCTCTTTTAGAAGCAGCGGCTGCACCCGTATAACAATCGCACGATCGTACGTCTCATTTGCCGCGTTGGCGCGCCCGTGGATGATGGACCACGGTATGAACCTCTGCACTTCAAATTTGACGCAATCCTGCACTCACCGCACACAGTTCTAGTCTAACCGTCGCAGTGTCTGCTTTAAGGTAGAGATCGATACTTAGGATATGTTCATGTGTGTTTGTAGCGCTGGACCCTCTTATGGTGTGGTCACTTGTGATGGATCGAGGAACTTAGGCGGTTAACTTGTTTCGACGTCTCACCGACAATATCAGGATTTAGTATCG
+>Seq6
+ACCGAAAATGACAATGTTCACACGCATGCTCGGCGTGGAAAAGAGCCTTTTCTAAGACCGACTCGTTCCGGGCAGCAGGATTATTAGCCAATCAAAATTATATCGACCGGTCATCAAGCTGCGATAGTGCAGGCGCATGCCGTCCAATGGGTCCACGGCGGAAGTGCGTTCGTCTACTCTGTCAAATCTTAACATTTTTTGAGCGGCTAATCCGGCCGGTAGTGTACCGTGAACCAAAGTCCTTCTACGAGCGTATTAGATTGCTCAAAAGATCCGGGAGAATTGACCAGGTCGTATCTTTAAAATAACGCTGGTGCGAGCAGCTGCTGTTTTATCAACACCCATTTAGTCCTGTGAAGTTTGCTTAGCAGATACACCTTCCCGCGTGGTATGAGAGGCTGTTCTTCATTAAAAACTATGAGGCTCTGGCACCTTCGACGCTAACAAAGTCCCCACGGACCATGATACCCTTACGCAACTCTCTTTGCACGCTAGGGCGAGAGTACTGTCCCCCTAGACTAGGTACACGCCGGGTAAACTCTCTCGCACACCTTTACGCTCGACTACAGGCTTCTAACCCTTCCGAACGCATATAATTCAAATGGCACTTAGTAACAGACGAATCACGGCTCACAGGCAGAATTCACTGGAGTAAAAGGATTCAGAACAATAGATAGTGTGTTAACTTTACAGTCATCCGTATTATAACGTAGCGAGAGGATTGAGTTCTTGTTAGGAAGGAAGGTCCTATAGACGAGTGCGGTAGCGCACCCGGTCGCCTTGCGTAGTCATGCCCGACGTGTTGATGGTTCCCTTTTAGCCGCCACACAAGGGATCCGAGGGTGAGAGACACATGGCCCTCACCGACGAGACTTACTCAGCCTGCCTCGCTATTGCCCTCTTTTTGATCGTCCCTTTGTGGCTCTCGAGGACTCGTGCAGCGTGTATCTGGGGATTTGTAAGCTTAAGACTACCTTCCATAGGA
=====================================
test/full_test/gold.breakpoints
=====================================
@@ -2,31 +2,31 @@
CTCCGGATCTCCGTGTTCTTCGGAAGCTTAG
>bkpt2_Seq0_pos_123_fuzzy_0_HET right_kmer
GTCACGCGCGTCATACTACAGTAAGTTACTG
->bkpt6_Seq1_pos_342_fuzzy_0_HET left_kmer
+>bkpt5_Seq1_pos_342_fuzzy_0_HET left_kmer
GCCGCGCAAAGCCGGTCAACAGCGTTAGTAT
->bkpt6_Seq1_pos_342_fuzzy_0_HET right_kmer
+>bkpt5_Seq1_pos_342_fuzzy_0_HET right_kmer
GTTGAAAGTTTACTCAGATCGCTTCTGTCGG
->bkpt11_Seq2_pos_535_fuzzy_0_HOM left_kmer
+>bkpt9_Seq2_pos_535_fuzzy_0_HOM left_kmer
GGCATGCGTAAGTTATCGTGAAACCATGATG
->bkpt11_Seq2_pos_535_fuzzy_0_HOM right_kmer
+>bkpt9_Seq2_pos_535_fuzzy_0_HOM right_kmer
GCCCCTTACTAGACCAAATGTACTGAATGCG
->bkpt12_Seq2_pos_835_fuzzy_1_HOM left_kmer
+>bkpt10_Seq2_pos_835_fuzzy_1_HOM left_kmer
GAGCTACCCGCCCTCGGTGAGAAGGTAGTAT
->bkpt12_Seq2_pos_835_fuzzy_1_HOM right_kmer
+>bkpt10_Seq2_pos_835_fuzzy_1_HOM right_kmer
ACCCAAACGCGTCCTATGCAGTTTTGGGCTT
->bkpt16_Seq3_pos_781_fuzzy_0_HOM left_kmer
+>bkpt14_Seq3_pos_781_fuzzy_0_HOM left_kmer
CGGCCCATGGGAACAAGTATCCTTACTTTCG
->bkpt16_Seq3_pos_781_fuzzy_0_HOM right_kmer
+>bkpt14_Seq3_pos_781_fuzzy_0_HOM right_kmer
GTACAAATGAGGCTCCAAAATAGCACGCTTG
->bkpt18_Seq4_pos_351_fuzzy_2_HET left_kmer
+>bkpt16_Seq4_pos_351_fuzzy_2_HET left_kmer
GTAATCGAGATTCTCCACCATAACCTGCGCA
->bkpt18_Seq4_pos_351_fuzzy_2_HET right_kmer
+>bkpt16_Seq4_pos_351_fuzzy_2_HET right_kmer
ATGCATCGTGAAGCTTTACCGCGCCCAAGGG
->bkpt20_Seq4_pos_603_fuzzy_3_HOM left_kmer
+>bkpt18_Seq4_pos_603_fuzzy_3_HOM left_kmer
CTTTTGGGTGCGCAACATTGCTATACTTAGG
->bkpt20_Seq4_pos_603_fuzzy_3_HOM right_kmer
+>bkpt18_Seq4_pos_603_fuzzy_3_HOM right_kmer
ATCCATTGACATCTGTCAGCCGTCTTTCCAG
->bkpt22_Seq4_pos_821_fuzzy_0_HOM left_kmer
+>bkpt20_Seq4_pos_821_fuzzy_0_HOM left_kmer
AGCGCGCTAAATTACCGCTACGAGCCATACC
->bkpt22_Seq4_pos_821_fuzzy_0_HOM right_kmer
+>bkpt20_Seq4_pos_821_fuzzy_0_HOM right_kmer
CCGAACATTGAGACCTGGCTAGTAGGTAGGT
=====================================
test/full_test/gold.insertions.fasta
=====================================
@@ -1,16 +1,16 @@
->bkpt2_Seq0_pos_123_fuzzy_0_HET_len_137_qual_50_avg_cov_8.38_median_cov_8.00
+>bkpt2_Seq0_pos_123_fuzzy_0_HET_len_137_qual_50_avg_cov_21.59_median_cov_21.00
ATCTAAGCTGTGACCTTGTGGCCGAGGCGCTTTTCACGCCTACATTAACTCCTGGGAAGCTCTCTGCTCTAGTTTCAGTGCACATCTCCAGGTGAGCAACCCTGGCAAGCAGCCCCTTCCTGTAGAAATTACTTAGC
->bkpt6_Seq1_pos_342_fuzzy_0_HET_len_125_qual_50_avg_cov_9.91_median_cov_9.00
+>bkpt5_Seq1_pos_342_fuzzy_0_HET_len_125_qual_50_avg_cov_25.17_median_cov_24.00
ATGGTTTATAGAACCCGGGCGTTCATGTCCGTCAGAACGATCTTGGCACGGTAGCCCCTGGTCCAGAGAGCCAAGGTGACTCAGCCCCACGATGGTGGTCTAGAGCGAAATAACCCTCGCCGAGA
->bkpt11_Seq2_pos_535_fuzzy_0_HOM_len_140_qual_50_avg_cov_21.63_median_cov_22.00
+>bkpt9_Seq2_pos_535_fuzzy_0_HOM_len_140_qual_50_avg_cov_41.66_median_cov_43.00
TAACGTTCGCTGAACATCGACTCCGGTGACGACATACGATTCAAGAAGAGAGTGACTCTGTAGGATAACATCCCGCAACGCCTAATCCATCCAGCCTGGCACCATGTATAAAGGGCGTCAGGTATGTTAACGAGACTATT
->bkpt12_Seq2_pos_835_fuzzy_1_HOM_len_207_qual_50_avg_cov_16.50_median_cov_16.00
+>bkpt10_Seq2_pos_835_fuzzy_1_HOM_len_207_qual_50_avg_cov_40.42_median_cov_42.00
GCACGCTGCAGGATTGGAACCACAATGTACGCCGATCCAAGCAGTAGTGGTTCATTGTATAAGTATCCTCCCTTGATTGGTCGAATATTAGGCATGCCCCGGGAGCATGTGGGCTCGAGCCACGGAGAGCAACTAATCGCGCATAAAACAAATACCTCATGGTTTTTGTGCGGAAAACCGTTGGGTGGACCATCAGCGGTTGTGATT
->bkpt16_Seq3_pos_781_fuzzy_0_HOM_len_111_qual_50_avg_cov_20.85_median_cov_20.50
+>bkpt14_Seq3_pos_781_fuzzy_0_HOM_len_111_qual_50_avg_cov_50.54_median_cov_53.00
TTTGCAGCACTAGCCGTTCCTTGACATCTGCGGCCAACTTGTGCCTGAACCTGGAGTTTCGACAGCGTGGCGCTCTGGCCTAGTTCTTCGCTGGCACCTGGAAGAGCCGCC
->bkpt18_Seq4_pos_351_fuzzy_2_HET_len_120_qual_50_avg_cov_9.73_median_cov_10.00
+>bkpt16_Seq4_pos_351_fuzzy_2_HET_len_120_qual_50_avg_cov_25.67_median_cov_26.00
GTCTTAACCTTAAGACCGTTCATTGATAAAACTTGCTCACGCTCTAGATGGCGTGAAGCGAAACCTAGGAAAAAGTTTTGCAGATAATTAGATTATGCGCGATACTCCGCCGTGTGTTCA
->bkpt20_Seq4_pos_603_fuzzy_3_HOM_len_57_qual_50_avg_cov_22.71_median_cov_23.00
+>bkpt18_Seq4_pos_603_fuzzy_3_HOM_len_57_qual_50_avg_cov_46.66_median_cov_47.00
TGTATTCCTGGGTTGAGTGGCAGGTTTCTCTTAATTCTTCCCTAAGTAGCTCCGAGG
->bkpt22_Seq4_pos_821_fuzzy_0_HOM_len_40_qual_50_avg_cov_24.34_median_cov_24.00
+>bkpt20_Seq4_pos_821_fuzzy_0_HOM_len_40_qual_50_avg_cov_37.63_median_cov_38.00
GGATGGCGGCCGGAGAGCGCTGCAATCGCATGGCTCGGGA
=====================================
test/full_test/gold.insertions.vcf
=====================================
@@ -1,8 +1,8 @@
##fileformat=VCFv4.1
-##filedate=Thu May 9 11:36:09 2019
-##source=MindTheGap fill version 2.2.0
-##SAMPLE=file:test-output/full-test.h5
-##REF=file:test-output/full-test
+##filedate=Tue Feb 8 12:29:50 2022
+##source=MindTheGap fill version 2.2.3
+##SAMPLE=file:gold.h5
+##REF=file:gold
##INFO=<ID=TYPE,Number=1,Type=String,Description="INS">
##INFO=<ID=LEN,Number=1,Type=Integer,Description="variant size">
##INFO=<=QUAL,Number=.,Type=Integer,Description="Quality of the insertion">
@@ -12,11 +12,11 @@
##INFO=<ID=NPOS,Number=1,Type=Integer,Description="number of alternative positions for the insertion site (= size of repeat (fuzzy) +1)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT G1
-Seq0 123 bkpt2 G GATCTAAGCTGTGACCTTGTGGCCGAGGCGCTTTTCACGCCTACATTAACTCCTGGGAAGCTCTCTGCTCTAGTTTCAGTGCACATCTCCAGGTGAGCAACCCTGGCAAGCAGCCCCTTCCTGTAGAAATTACTTAGC . PASS TYPE=INS;LEN=137;QUAL=50;NSOL=1;NPOS=1;AVK=8.38;MDK=8.00 GT 0/1
-Seq1 342 bkpt6 T TATGGTTTATAGAACCCGGGCGTTCATGTCCGTCAGAACGATCTTGGCACGGTAGCCCCTGGTCCAGAGAGCCAAGGTGACTCAGCCCCACGATGGTGGTCTAGAGCGAAATAACCCTCGCCGAGA . PASS TYPE=INS;LEN=125;QUAL=50;NSOL=1;NPOS=1;AVK=9.91;MDK=9.00 GT 0/1
-Seq2 535 bkpt11 G GTAACGTTCGCTGAACATCGACTCCGGTGACGACATACGATTCAAGAAGAGAGTGACTCTGTAGGATAACATCCCGCAACGCCTAATCCATCCAGCCTGGCACCATGTATAAAGGGCGTCAGGTATGTTAACGAGACTATT . PASS TYPE=INS;LEN=140;QUAL=50;NSOL=1;NPOS=1;AVK=21.63;MDK=22.00 GT 1/1
-Seq2 834 bkpt12 A ATGCACGCTGCAGGATTGGAACCACAATGTACGCCGATCCAAGCAGTAGTGGTTCATTGTATAAGTATCCTCCCTTGATTGGTCGAATATTAGGCATGCCCCGGGAGCATGTGGGCTCGAGCCACGGAGAGCAACTAATCGCGCATAAAACAAATACCTCATGGTTTTTGTGCGGAAAACCGTTGGGTGGACCATCAGCGGTTGTGAT . PASS TYPE=INS;LEN=207;QUAL=50;NSOL=1;NPOS=2;AVK=16.50;MDK=16.00 GT 1/1
-Seq3 781 bkpt16 G GTTTGCAGCACTAGCCGTTCCTTGACATCTGCGGCCAACTTGTGCCTGAACCTGGAGTTTCGACAGCGTGGCGCTCTGGCCTAGTTCTTCGCTGGCACCTGGAAGAGCCGCC . PASS TYPE=INS;LEN=111;QUAL=50;NSOL=1;NPOS=1;AVK=20.85;MDK=20.50 GT 1/1
-Seq4 349 bkpt18 G GCAGTCTTAACCTTAAGACCGTTCATTGATAAAACTTGCTCACGCTCTAGATGGCGTGAAGCGAAACCTAGGAAAAAGTTTTGCAGATAATTAGATTATGCGCGATACTCCGCCGTGTGTT . PASS TYPE=INS;LEN=120;QUAL=50;NSOL=1;NPOS=3;AVK=9.73;MDK=10.00 GT 0/1
-Seq4 600 bkpt20 T TAGGTGTATTCCTGGGTTGAGTGGCAGGTTTCTCTTAATTCTTCCCTAAGTAGCTCCG . PASS TYPE=INS;LEN=57;QUAL=50;NSOL=1;NPOS=4;AVK=22.71;MDK=23.00 GT 1/1
-Seq4 821 bkpt22 C CGGATGGCGGCCGGAGAGCGCTGCAATCGCATGGCTCGGGA . PASS TYPE=INS;LEN=40;QUAL=50;NSOL=1;NPOS=1;AVK=24.34;MDK=24.00 GT 1/1
+Seq0 123 bkpt2 G GATCTAAGCTGTGACCTTGTGGCCGAGGCGCTTTTCACGCCTACATTAACTCCTGGGAAGCTCTCTGCTCTAGTTTCAGTGCACATCTCCAGGTGAGCAACCCTGGCAAGCAGCCCCTTCCTGTAGAAATTACTTAGC . PASS TYPE=INS;LEN=137;QUAL=50;NSOL=1;NPOS=1;AVK=21.59;MDK=21.00 GT 0/1
+Seq1 342 bkpt5 T TATGGTTTATAGAACCCGGGCGTTCATGTCCGTCAGAACGATCTTGGCACGGTAGCCCCTGGTCCAGAGAGCCAAGGTGACTCAGCCCCACGATGGTGGTCTAGAGCGAAATAACCCTCGCCGAGA . PASS TYPE=INS;LEN=125;QUAL=50;NSOL=1;NPOS=1;AVK=25.17;MDK=24.00 GT 0/1
+Seq2 535 bkpt9 G GTAACGTTCGCTGAACATCGACTCCGGTGACGACATACGATTCAAGAAGAGAGTGACTCTGTAGGATAACATCCCGCAACGCCTAATCCATCCAGCCTGGCACCATGTATAAAGGGCGTCAGGTATGTTAACGAGACTATT . PASS TYPE=INS;LEN=140;QUAL=50;NSOL=1;NPOS=1;AVK=41.66;MDK=43.00 GT 1/1
+Seq2 834 bkpt10 A ATGCACGCTGCAGGATTGGAACCACAATGTACGCCGATCCAAGCAGTAGTGGTTCATTGTATAAGTATCCTCCCTTGATTGGTCGAATATTAGGCATGCCCCGGGAGCATGTGGGCTCGAGCCACGGAGAGCAACTAATCGCGCATAAAACAAATACCTCATGGTTTTTGTGCGGAAAACCGTTGGGTGGACCATCAGCGGTTGTGAT . PASS TYPE=INS;LEN=207;QUAL=50;NSOL=1;NPOS=2;AVK=40.42;MDK=42.00 GT 1/1
+Seq3 781 bkpt14 G GTTTGCAGCACTAGCCGTTCCTTGACATCTGCGGCCAACTTGTGCCTGAACCTGGAGTTTCGACAGCGTGGCGCTCTGGCCTAGTTCTTCGCTGGCACCTGGAAGAGCCGCC . PASS TYPE=INS;LEN=111;QUAL=50;NSOL=1;NPOS=1;AVK=50.54;MDK=53.00 GT 1/1
+Seq4 349 bkpt16 G GCAGTCTTAACCTTAAGACCGTTCATTGATAAAACTTGCTCACGCTCTAGATGGCGTGAAGCGAAACCTAGGAAAAAGTTTTGCAGATAATTAGATTATGCGCGATACTCCGCCGTGTGTT . PASS TYPE=INS;LEN=120;QUAL=50;NSOL=1;NPOS=3;AVK=25.67;MDK=26.00 GT 0/1
+Seq4 600 bkpt18 T TAGGTGTATTCCTGGGTTGAGTGGCAGGTTTCTCTTAATTCTTCCCTAAGTAGCTCCG . PASS TYPE=INS;LEN=57;QUAL=50;NSOL=1;NPOS=4;AVK=46.66;MDK=47.00 GT 1/1
+Seq4 821 bkpt20 C CGGATGGCGGCCGGAGAGCGCTGCAATCGCATGGCTCGGGA . PASS TYPE=INS;LEN=40;QUAL=50;NSOL=1;NPOS=1;AVK=37.63;MDK=38.00 GT 1/1
=====================================
test/full_test/gold.othervariants.vcf
=====================================
@@ -1,25 +1,39 @@
##fileformat=VCFv4.1
-##filedate=Thu May 9 11:49:49 2019
-##source=MindTheGap find version 2.2.0
-##SAMPLE=file:../data/reads_r1.fastq,../data/reads_r2.fastq
-##REF=file:../data/reference.fasta
+##filedate=Tue Feb 8 12:29:50 2022
+##source=MindTheGap find version 2.2.3
+##SAMPLE=file:../../data/reads_r1.fastq,../../data/reads_r2.fastq
+##REF=file:../../data/reference.fasta
##INFO=<ID=TYPE,Number=1,Type=String,Description="SNP, INS, DEL or .">
##INFO=<ID=LEN,Number=1,Type=Integer,Description="variant size">
##INFO=<ID=FUZZY,Number=1,Type=Integer,Description="repeat size at the breakpoint, only for INS and DEL">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT G1
Seq0 101 bkpt1 T C . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
-Seq0 816 bkpt3 C A . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
-Seq1 206 bkpt4 G C . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
-Seq1 219 bkpt5 T A . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
-Seq1 740 bkpt7 CCTGTTGGGAAGGAATTGCAATACTCTCCGAACCAGCTTAGGGCCCCCCGCCGCCGCAATTCGAGCGTTATGCCCGGAGCATTTGCACGATGCCATTAAACTATATCAA C . PASS TYPE=DEL;LEN=108;FUZZY=2 GT 1/1
-Seq2 320 bkpt8 T C . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
-Seq2 344 bkpt9 C A . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
-Seq2 379 bkpt10 G C . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
-Seq3 256 bkpt13 A T . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
-Seq3 511 bkpt14 C A . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
-Seq3 766 bkpt15 G A . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
-Seq4 257 bkpt17 C T . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
-Seq4 512 bkpt19 A G . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
-Seq4 841 bkpt21 C T . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
-Seq4 884 bkpt23 CTAGGGACCTAGACGCAACAGTAACCGCCTCGGAGTAAGCCCTGG C . PASS TYPE=DEL;LEN=44;FUZZY=2 GT 1/1
+Seq0 297 bkpt3 CTAGCTTGAGAGTGCGTATCTCACCGATCCCCTGGCTATGCTCCGCGATTCACTAGTAGTTTCACGCCGACAGAGCGAAACCGTGATAGGTCATCATGCCGGTCTGCAGTCACGT C . PASS TYPE=DEL;LEN=114;FUZZY=0 GT 1/1
+Seq0 816 bkpt4 C A . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
+Seq2 320 bkpt6 T C . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
+Seq2 344 bkpt7 C A . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
+Seq2 379 bkpt8 G C . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
+Seq3 256 bkpt11 A T . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
+Seq3 511 bkpt12 C A . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
+Seq3 766 bkpt13 G A . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
+Seq4 257 bkpt15 C T . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
+Seq4 512 bkpt17 A G . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
+Seq4 841 bkpt19 C T . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
+Seq4 884 bkpt21 CTAGGGACCTAGACGCAACAGTAACCGCCTCGGAGTAAGCCCTGG C . PASS TYPE=DEL;LEN=44;FUZZY=2 GT 1/1
+Seq5 100 bkpt22 T TC . PASS TYPE=INS;LEN=1;FUZZY=1 GT 1/1
+Seq5 199 bkpt23 T TG . PASS TYPE=INS;LEN=1;FUZZY=1 GT 1/1
+Seq5 300 bkpt24 A AC . PASS TYPE=INS;LEN=1;FUZZY=0 GT 1/1
+Seq5 400 bkpt25 G GT . PASS TYPE=INS;LEN=1;FUZZY=2 GT 1/1
+Seq5 500 bkpt26 C CT . PASS TYPE=INS;LEN=1;FUZZY=0 GT 1/1
+Seq5 600 bkpt27 G GA . PASS TYPE=INS;LEN=1;FUZZY=0 GT 0/1
+Seq5 700 bkpt28 A AC . PASS TYPE=INS;LEN=1;FUZZY=0 GT 0/1
+Seq5 800 bkpt29 A AC . PASS TYPE=INS;LEN=1;FUZZY=2 GT 0/1
+Seq5 900 bkpt30 T TG . PASS TYPE=INS;LEN=1;FUZZY=2 GT 0/1
+Seq6 98 bkpt31 T TAT . PASS TYPE=INS;LEN=2;FUZZY=3 GT 1/1
+Seq6 200 bkpt32 A ACG . PASS TYPE=INS;LEN=2;FUZZY=1 GT 1/1
+Seq6 300 bkpt33 A ATA . PASS TYPE=INS;LEN=2;FUZZY=1 GT 1/1
+Seq6 400 bkpt34 T TCA . PASS TYPE=INS;LEN=2;FUZZY=0 GT 1/1
+Seq6 600 bkpt35 T TCA . PASS TYPE=INS;LEN=2;FUZZY=0 GT 0/1
+Seq6 699 bkpt36 C CGT . PASS TYPE=INS;LEN=2;FUZZY=2 GT 0/1
+Seq6 800 bkpt37 T TGG . PASS TYPE=INS;LEN=2;FUZZY=0 GT 0/1
=====================================
test/full_test/gold_bed.breakpoints
=====================================
@@ -2,11 +2,11 @@
CTCCGGATCTCCGTGTTCTTCGGAAGCTTAG
>bkpt2_Seq0_pos_123_fuzzy_0_HET right_kmer
GTCACGCGCGTCATACTACAGTAAGTTACTG
->bkpt3_Seq1_pos_342_fuzzy_0_HET left_kmer
+>bkpt4_Seq1_pos_342_fuzzy_0_HET left_kmer
GCCGCGCAAAGCCGGTCAACAGCGTTAGTAT
->bkpt3_Seq1_pos_342_fuzzy_0_HET right_kmer
+>bkpt4_Seq1_pos_342_fuzzy_0_HET right_kmer
GTTGAAAGTTTACTCAGATCGCTTCTGTCGG
->bkpt4_Seq2_pos_535_fuzzy_0_HOM left_kmer
+>bkpt5_Seq2_pos_535_fuzzy_0_HOM left_kmer
GGCATGCGTAAGTTATCGTGAAACCATGATG
->bkpt4_Seq2_pos_535_fuzzy_0_HOM right_kmer
+>bkpt5_Seq2_pos_535_fuzzy_0_HOM right_kmer
GCCCCTTACTAGACCAAATGTACTGAATGCG
=====================================
test/full_test/gold_bed.othervariants.vcf
=====================================
@@ -1,6 +1,6 @@
##fileformat=VCFv4.1
-##filedate=Thu May 9 11:40:18 2019
-##source=MindTheGap find version 2.2.0
+##filedate=Tue Feb 8 12:46:04 2022
+##source=MindTheGap find version 2.2.3
##SAMPLE=file:../../data/reads_r1.fastq,../../data/reads_r2.fastq
##REF=file:../../data/reference.fasta
##INFO=<ID=TYPE,Number=1,Type=String,Description="SNP, INS, DEL or .">
@@ -9,3 +9,4 @@
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT G1
Seq0 101 bkpt1 T C . PASS TYPE=SNP;LEN=1;FUZZY=0 GT 1/1
+Seq0 297 bkpt3 CTAGCTTGAGAGTGCGTATCTCACCGATCCCCTGGCTATGCTCCGCGATTCACTAGTAGTTTCACGCCGACAGAGCGAAACCGTGATAGGTCATCATGCCGGTCTGCAGTCACGT C . PASS TYPE=DEL;LEN=114;FUZZY=0 GT 1/1
=====================================
test/full_test/gold_fill.output
=====================================
@@ -1,7 +1,6 @@
-nb breakpoints=7
MindTheGap fill
- version : 1.0.0
- gatb-core-library : 1.2.0
+ version : 2.2.3
+ gatb-core-library : 1.4.2
supported_kmer_sizes : 32 64 96 128
Parameters
Input data
@@ -9,18 +8,21 @@ Parameters
Breakpoints : gold.breakpoints
Graph
kmer-size : 31
- abundance_min (auto inferred) : 3
- abundance_min (used) : 3
- nb_solid_kmers : 5473
- nb_branching_nodes : 34
+ abundance_min (auto inferred) : 7
+ abundance_min (used) : 7
+ nb_solid_kmers : 7419
+ nb_branching_nodes : 36
Assembly options
max_depth : 10000
max_nodes : 100
Results
Breakpoints
- nb_input : 7
- nb_filled : 7
- unique_sequence : 7
- multiple_sequence : 0
- Time : 0.0 s
- Output file : gold.insertions.fasta
+ nb_input_breakpoints : 8
+ nb_filled_breakpoints : 8
+ as_unique_sequence : 8
+ as_multiple_sequence : 0
+ Time : 1.0 s
+ Output files
+ assembled sequence file : gold.insertions.fasta
+ insertion variant vcf file : gold.insertions.vcf
+ assembly statistics file : gold.info.txt
=====================================
test/full_test/gold_find.output
=====================================
@@ -1,6 +1,6 @@
MindTheGap find
- version : 1.0.0
- gatb-core-library : 1.2.0
+ version : 2.2.3
+ gatb-core-library : 1.4.2
supported_kmer_sizes : 32 64 96 128
Parameters
Input data
@@ -8,11 +8,11 @@ Parameters
Reference : ../../data/reference.fasta
Graph
kmer-size : 31
- abundance_min (auto inferred) : 3
- abundance_min (used) : 3
+ abundance_min (auto inferred) : 7
+ abundance_min (used) : 7
abundance_max : 2147483647
- nb_solid_kmers : 5473
- nb_branching_nodes : 34
+ nb_solid_kmers : 7419
+ nb_branching_nodes : 36
Breakpoint detection options
max_repeat : 5
hetero_max_occ : 1
@@ -25,12 +25,14 @@ Results
homozygous : 5
clean : 3
fuzzy : 2
- heterozygous : 2
- clean : 1
+ heterozygous : 3
+ clean : 2
fuzzy : 1
Other variants
- deletions : 3
- SNPs : 13
+ deletions : 2
+ Homozygous insertions 1-2 bp size : 9
+ Heterozygous insertions 1-2 bp size : 7
+ SNPs : 11
Time : 0.0 s
Output files
graph_file : gold.h5
=====================================
test/full_test/reference.fasta
=====================================
@@ -8,3 +8,7 @@ TGCTGCCGATCGCTACGACGTCCTACCTTACACACAACGGGCCGCGTTCATACCCACGTATGAAGACATGCGGTTATCCG
TATTGCGCCCTTCAAGAAGCTTCTGCTGACCGTAGGCGTCTCGGCGGTTTGTACTTTGAAAAATTAGCTGCACTACATCCGATGGGTATCCCTCCTCAATCTCAGCAGACCCGGAAAGCGATAGAATCAGCCACGCGGTCGTCCGGGCTAGGGGCCCTGCGCAAGGAAGGTTGGACAGGGCTAGACCCGGAAGCATCGGCTTTTCCTAAATGGTGACGGAGTTATATAGGGTAAGCCTGATAGCGCGGTAGGTGTAATGGCCATCCCCTCGCCTAGCGTGCGCGCAGACAAGTCCAGTCCCGGAGGAGGCATAGGCCTCATTATCATTTCCCTAGAATCGCTCTTGACATCTAGGTTGTACTAGGGACCAGGCGCCCAAAGCGGACGGTTCTCCGTGCTTTCGTGCCGTTTCAGCGTAAGATGCTATTTTTTGGGGAAATGGTCGGCGTGTGCGGGGGAGAACCACGGTACCAACTACGATAAGTCCGTCGTGTAACTTACGTGAAGGTGCTGTGAAGCAGGAATCCGTGCCAAAATGTCCGTGCGATATCCAACTTTCATAGTATTACACGAGAGCCTATGATTTGCCCAGGCGCGACCCGTGAATCGAGGTAATCGCCGACCAGATATTGCGAAACACCACATTACATGACTACTGTCCGCTTGAAGAGTTATATACTTGACAGTCCTGGTTGACGGCACAGCATATCTCCAATGTGTGGTTTAAAGTCTCACGTTCTTCATGCGCGCCGGCCCATGGGAACAGGTATCCTTACTTTCGGTACAAATGAGGCTCCAAAATAGCACGCTTGCAGCAGTCAAGTTGAACGCCTTAAAAGGCACCGCCGCTCGTTCATTGGGATTCCTTGAGAATCGTGACTTGTTACACTATAAGATCATGGATTGGACAAAATAGGCCAACTCCCGCACGCTGTGGCTATTCTTAAGTTGCATAGGTGGGAGTAGCCTTATACTCGATTTCTAAAAAGAGTAGGTGAGC
>Seq4
TTCCGGCGCCGCACTAATTGAAGTGGTGAGCTGACCAGTCGTTCAGGATCCGAAGGCGGGGATGGCGCTATAGGAGCCGGCAGGTATGCTTTGCCGCAAAATTTCGGGGTGGTGGAACCGTCTTACCGAAAGTTAGCTACAGCCTGGAATGTGAAATTCCATGACCTGCCCGTCCTGTGTCCACAGGGCGACATTTGCCACGTAGGTAGGGCGACCATTAGAATGCTGCATTATCGGGCGATAAAAAGTTTTATACCCAAGAATCCTACAAAGATGAAAATTTCGAAGAGCTGCACGCAGTTGTAAGTTGCTTTTCTGGGGTAATCGAGATTCTCCACCATAACCTGCGCAATGCATCGTGAAGCTTTACCGCGCCCAAGGGGAGCGTCTCAGTGGGGTTGCCTCCAGGGATATATTGAAAGTTGAAGAAGAAGATCACAGGTTAAGCGGTATGTTAAGTTAGAACTCACGGGGAGCCGCCTTGATTTTGTTCGACATGAACCAGAGACCAAGTGTGTTATGTTCTGGAACCTTAATACGTACGTCGCCAGCACCGAGCCGGCACTCCATCTCTTTTGGGTGCGCAACATTGCTATACTTAGGATCCATTGACATCTGTCAGCCGTCTTTCCAGAACGTTATAAGACTCGTGAGGAAATTATACAAATCGTTGCCATCATCCAAAGCAAAGTACTTCCGCTTAGGAGTGCCTTGAAGAACCGATTATCTCTGACAATGTAATGCCACAGCACCCTCGACAAAGTTCTACATTCGTTCCAGGTCATGATACAGCGCGCTAAATTACCGCTACGAGCCATACCCCGAACATTGAGACCTGGCCAGTAGGTAGGTGTCAAATCGATATCCACACCTGTCGAAGCAGCTAGGGACCTAGACGCAACAGTAACCGCCTCGGAGTAAGCCCTGGTAAAGATCGGTTGCGGCGGGAGTCCTCCATTCAGGCCAAACGTGCAGTGCTCGATGTGCTTCCTATCGCTCT
+>Seq5
+GATGTTTAGAAGTTTCCAGGTCACGCCAATGATTGGCATTTACACACGTGGATCAGCGGACATATCTAACCCTTAGTGTTCTTAAGAGCAACTCACTACTCATTTCCACTAACCCCGCCGGCGGTAATTCCAATCTAGTTGATCAGACTTCCCAGTCAATGAAAGCGACACCGTGCGTCTGTAATACCAACAAGACCCTGCTGTCGTCCCGCAGAGGACGCGGCACCTCCGGATTTTGAGTCCAGTCTGAACGATTTTCGATCACTCACCATGGATCTGGAAAACGGAGTCGAGTACTCAGAGCCAAATTGATGCATTTCCAATGACCCGATGCAGGTGCGACCGATCTTCGCCTATGCTTCCCGCCGTAATTATTGAGTCTGGGTCCCGGCCGCTAACGTTGACTCACGGGGAGGTACCCGTGCGTATTCTTCTCAAAGTGACGCTGGACAGCAGCGCATGTCCGAGCCCCATCGTCCTATCTGGTGTAGAGTCTTACCCTAATTAGAGTGATCGAACCAGTAGGTGTCGCGGTCTTAGGGCTCCCATTGTCCAAGGGAACGTGAACAGATATGAATCTGGGAGAATAGTGCAGCGTTGCCCTTCTGGTCGGTCAGCCCTTGCCTACGGCCCGTATGCGGAGAATGAAGGCGTGAAACATTCTGCTCTTTTAGAAGCAGCGGCTGCACCCGTATAACAATCGCACGATCGTACGTCTCATTTGCCGCGTTGGCGCGCCCGTGGATGATGGACCACGGTATGAACCTCTGCACTTCAAATTTGACGCAATCCTGCACTCACCGCACACAGTTCTAGTCTAACCGTCGCAGTGTCTGCTTTAAGGTAGAGATCGATACTTAGGATATGTTCATGTGTGTTTGTAGCGCTGGACCCTCTTATGGTGTGGTCACTTGTGATGGATCGAGGAACTTAGGCGGTTAACTTGTTTCGACGTCTCACCGACAATATCAGGATTTAGTATCG
+>Seq6
+ACCGAAAATGACAATGTTCACACGCATGCTCGGCGTGGAAAAGAGCCTTTTCTAAGACCGACTCGTTCCGGGCAGCAGGATTATTAGCCAATCAAAATTATCGACCGGTCATCAAGCTGCGATAGTGCAGGCGCATGCCGTCCAATGGGTCCACGGCGGAAGTGCGTTCGTCTACTCTGTCAAATCTTAACATTTTTTGAGGCTAATCCGGCCGGTAGTGTACCGTGAACCAAAGTCCTTCTACGAGCGTATTAGATTGCTCAAAAGATCCGGGAGAATTGACCAGGTCGTATCTTTAAAAACGCTGGTGCGAGCAGCTGCTGTTTTATCAACACCCATTTAGTCCTGTGAAGTTTGCTTAGCAGATACACCTTCCCGCGTGGTATGAGAGGCTGTTCTTTTAAAAACTATGAGGCTCTGGCACCTTCGACGCTAACAAAGTCCCCACGGACCATGATACCCTTACGCAACTCTCTTTGCACGCTAGGGCGAGAGTACTGCCCCTAGACTAGGTACACGCCGGGTAAACTCTCTCGCACACCTTTACGCTCGACTACAGGCTTCTAACCCTTCCGAACGCATATAATTCAAATGGCACTTAGTAACAGACGAATCACGGCTCACAGGCAGAATTCACTGGAGTAAAAGGATTCAGAACAATAGATAGTGTGTTAACTTTACAGTCATCCGTATTATAACGTAGCGAGAGGATTGAGTTCTTGTTAGGAAGGAAGGTCCTATAGACGAGTGCGGTAGCGCACCCGGTCGCCTTGCGTAGTCATGCCCGACGTGTTGATGGTTCCCTTTTAGCCGCCACACAAGGGATCCGAGGGTGAGAGACACATGGCCCTCACCGACGAGACTTACTCAGCCTGCCTCGCTATTGCCCTCTTTTTGATCGTCCCTTTGTGGCTCTCGAGGACTCGTGCAGCGTGTATCTGGGGATTTGTAAGCTTAAGACTACCTTCCATAGGA
View it on GitLab: https://salsa.debian.org/med-team/mindthegap/-/commit/b905744345c06c4c96602510025ab2bf5aefebbc
--
View it on GitLab: https://salsa.debian.org/med-team/mindthegap/-/commit/b905744345c06c4c96602510025ab2bf5aefebbc
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20220529/c45e7d12/attachment-0001.htm>
More information about the debian-med-commit
mailing list