[med-svn] [Git][med-team/minia][upstream] New upstream version 3.2.1

Fri Jul 5 06:59:01 BST 2019


Andrius Merkys pushed to branch upstream at Debian Med / minia


Commits:
578eb06a by Andrius Merkys at 2019-07-05T05:33:15Z
New upstream version 3.2.1
- - - - -


14 changed files:

- CMakeLists.txt
- INSTALL
- README.md
- doc/manual.pdf
- doc/manual.tex
- + doc/paper.pdf
- merci/merci.cpp
- merci/run-test-simple.sh
- + scripts/convertToGFA.py
- + scripts/filter_by_previous_contigs.cpp
- src/Minia.cpp
- + src/build_info.hpp.in
- src/main.cpp
- test/test_ERR039477.sh


Changes:

=====================================
CMakeLists.txt
=====================================
@@ -37,7 +37,10 @@ exec_program(
     ARGS "rev-parse --short HEAD"
     OUTPUT_VARIABLE VERSION_SHA1 )
 
-add_definitions( -DGIT_SHA1="${VERSION_SHA1}" )
+if ( VERSION_SHA1 MATCHES "^[0-9a-f]+$") # might return "git commit fatal: not a git repository"
+    message("Git version: ${VERSION_SHA1}")
+    add_definitions( -DGIT_SHA1="${VERSION_SHA1}" )
+endif()
 
 ################################################################################
 # Define cmake modules directory
@@ -94,6 +97,12 @@ add_executable("merci"  ${MerciFiles})
 target_link_libraries("merci" ${gatb-core-libraries})
 
 
+# that file will hold the version information as a #define
+configure_file (
+${PROJECT_SOURCE_DIR}/src/build_info.hpp.in
+${PROJECT_SOURCE_DIR}/src/build_info.hpp
+)
+
 ################################################################################
 #  DOC
 ################################################################################


=====================================
INSTALL
=====================================
@@ -24,5 +24,5 @@ cd ..
 # run tests
 echo "Running simple test..."
 cd test
-. ./simple_test.sh
+./simple_test.sh
 cd ..


=====================================
README.md
=====================================
@@ -32,12 +32,7 @@ C++11 compiler; (g++ version>=4.7 (Linux), clang version>=4.3 (Mac OSX))
 
 Type `minia` without any arguments for usage instructions.
 
-A more complete manual:
-
-    cd doc 
-    pdflatex manual.tex
-
-If you cannot compile it: http://minia.genouest.org/files/minia.pdf
+A more complete manual is here: https://github.com/GATB/minia/raw/master/doc/manual.pdf
 
 # What is new ? (2018)
 


=====================================
doc/manual.pdf
=====================================
Binary files a/doc/manual.pdf and b/doc/manual.pdf differ


=====================================
doc/manual.tex
=====================================
@@ -1,6 +1,7 @@
 \documentclass[a4paper]{article}
 \usepackage{fancyvrb}
 \usepackage{pdfpages}
+\usepackage{hyperref}
 \usepackage{url}
 \begin{document}
 
@@ -65,7 +66,7 @@ The main parameters are:
 \begin{description}
 
 \vitem+kmer-size+
-The $k$-mer length is the length of the nodes in the de Bruijn graph. It strongly depends on the input dataset. For proper assembly, we recommend that you use the Minia-pipeline that runs Minia multiple times, with an iterative multi-k algorithm. That way, you won't need to choose k. If you insist on running with a single k value, the KmerGenie software can automatically find the best $k$ for your dataset.
+        The $k$-mer length is the length of the nodes in the de Bruijn graph. It strongly depends on the input dataset. For proper assembly, we recommend that you use the Minia-pipeline (\url{https://github.com/GATB/gatb-minia-pipeline}) that runs Minia multiple times, using an iterative multi-k algorithm (similar to IDBA, Megahit). That way, you won't need to choose k. If you insist on running with a single k value, the KmerGenie software can automatically find the best $k$ for your dataset.
 
 \vitem+abundance-min+
 The \verb+abundance-min+ is a hard cut-off to remove likely erroneous, low-abundance $k$-mers. In Minia 3, it is recommended to set it to 2 unless you have a good reason not to. Minia 3 also implements other ways to remove erroneous k-mers, using a more adequate relative cut-off, in the graph simplifications step.
@@ -113,11 +114,17 @@ Minia can direclty read files compressed with gzip. Compressed files should end
 
 \section{Output}
 
-The output of Minia is a set of contigs in the FASTA format, in the file \verb+[prefix].contigs.fa+. 
+\subsection{Fasta}
 
-\subsection*{Creating unitigs}
+The main output of Minia is a set of contigs in the FASTA format, in the file \verb+[prefix].contigs.fa+. 
 
-Minia will also output unitigs in the FASTA format. Those are non-branching paths in the de Bruijn graph prior to any graph simplification. File: \verb+[prefix].unitigs.fa+. 
+\subsection{Assembly graph}
+
+The contigs FASTA file can be converted to GFA using \href{https://github.com/GATB/bcalm/blob/master/scripts/convertToGFA.py}{BCALM's convertToGFA.py script}.
+
+\subsection*{Unitigs}
+
+Minia will also output unitigs, in the FASTA format. They correspond to non-branching paths in the de Bruijn graph prior to any graph simplification. File: \verb+[prefix].unitigs.fa+. 
 
 \section{Selection of graph simplification steps}
 


=====================================
doc/paper.pdf
=====================================
Binary files /dev/null and b/doc/paper.pdf differ


=====================================
merci/merci.cpp
=====================================
@@ -93,8 +93,7 @@ void index_assembly(string assembly, int k, int nb_threads, bool verbose, assemb
                 bool rc = modelCanon.toString(kmmerBegin.value()) != kmerBegin; // could be optimized
                 ExtremityInfo e (current_seq, rc, UNITIG_BEGIN); /* actually we dont know if its false/UNITIG_BEGIN or reverse/UNITIG_END but should think about it, for now, putting naive*/ 
                 index[kmmerBegin.value()] = e.pack();
-                if (debug)
-                std::cout << "index left kmer " << modelCanon.toString(kmmerBegin.value()) << " of contig " << current_seq << std::endl; 
+                if (debug) std::cout << "index left kmer " << modelCanon.toString(kmmerBegin.value()) << " of contig " << current_seq << std::endl; 
             }
 
             if (no_link_right)
@@ -104,8 +103,7 @@ void index_assembly(string assembly, int k, int nb_threads, bool verbose, assemb
                 bool rc =  modelCanon.toString(kmmerEnd.value()) != kmerEnd; // could be optimized
                 ExtremityInfo e (current_seq, rc, UNITIG_END); 
                 index[kmmerEnd.value()] = e.pack();
-                if (debug)
-                std::cout << "index right kmer " << modelCanon.toString(kmmerEnd.value()) << " of contig " << current_seq << std::endl; 
+                if (debug) std::cout << "index right kmer " << modelCanon.toString(kmmerEnd.value()) << " of contig " << current_seq << std::endl; 
             }
             current_seq++;
         }


=====================================
merci/run-test-simple.sh
=====================================
@@ -1,3 +1,8 @@
 cd test-simple; 
 ../../buildk32/merci -kmer-size 11 -reads reads.fa -assembly assembly.fa -verbose  1 ;
 cd ..
+
+echo "wc, should have 4 lines:"
+wc -l test-simple/assembly.fa
+echo "wc, should have 2 lines:"
+wc -l test-simple/assembly.fa.merci


=====================================
scripts/convertToGFA.py
=====================================
@@ -0,0 +1,123 @@
+#!/usr/bin/env python
+
+'''****************************************************************************
+
+*   Program - Convert To GFA format
+*   Author - Mayank Pahadia
+
+
+ *  This program is free software: you can redistribute it and/or modify
+ *  it under the terms of the GNU Affero General Public License as
+ *  published by the Free Software Foundation, either version 3 of the
+ *  License, or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU Affero General Public License for more details.
+
+****************************************************************************'''
+
+'''****************************************************************************
+
+*   To run the program, pass three arguments with the python script on command line.
+*   For example - python convertToGFA.py inputFileName outputFileName kmerSize
+
+*   Logic - It reads through the fasta file with all the unitigs information 
+*   and link information and outputs it in the GFA format. 
+
+
+****************************************************************************'''
+
+
+import sys
+import argparse
+
+def write_segment(name,segment,optional,g,links):
+    add = ""
+    add += "S\t" #for segment
+    add += name #id of segment
+    add += "\t"
+    add += segment #segment itself
+    add += "\t"
+    for i in optional: #optional tags
+        add+=i
+        add+="\t"
+    #adding Segment to the file
+    g.write(add.strip()+"\n")
+    for j in links: #adding all the links of the current segment to the GFA file
+        g.write(j)
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Convert a bcalm-generated FASTA to a GFA.",
+                                     formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    parser.add_argument('inputFilename', help='Input FASTA file')
+    parser.add_argument('outputFilename', help='Output GFA file')
+    parser.add_argument('kmerSize', type=int, help='k-mer length')
+    parser.add_argument('-s', '--single-directed', action='store_true',
+                        help='Avoid outputting the whole skew-simmetric graph and output only one edge between two nodes',
+                        dest='single_directed')
+
+    args = parser.parse_args()
+
+    with open(args.inputFilename) as f:
+        #name stores the id of the unitig
+        #optional is a list which stores all the optional tags of a segment
+        #links stores all the link information about a segment
+        name = ""
+        optional=[]
+        links=[]
+        g = open(args.outputFilename,'w')
+        #adding Header to the file
+        k = int(args.kmerSize)
+        g.write('H\tVN:Z:1.0\tks:i:%d\n' %k) # includes the k-mer size
+        print("GFA file open")
+        #firstLine is for implemetation purpose so that we don't add some garbage value to the output file.
+        firstLine = 0
+        #segment stores the segment till present, in a fasta file, segment can be on many lines, hence we need to get the whole segment from all the lines
+        segment = ""
+        for line in f:
+            line = line.replace("\n","")
+            if(line[0]!=">"):
+                #segment might be in more than one line, hence we get the whole segment first, and then put it in the GFA file.
+                segment += line
+            if(line[0]==">"):
+                if(firstLine!=0):#if it's not the firstline in the input file, we store the input in GFA format in the output file
+                    write_segment(name,segment,optional,g,links)
+                    segment = ""
+
+                firstLine = 1
+                #once the previous segment and it's information has been stored, we start the next segment and it's information
+                a = line.split(" ")
+                name=a[0][1:] #get the id
+                optional=[]
+                links = []
+                #we skip the first value because the first value is ">ID"
+                for i in range(1,len(a)):
+                    #we need this because the line can end with a space, hence we get one extra value in our list.
+                    if(a[i]==""):
+                        continue
+                    if(a[i][0:2] == "MA"): #previous bcalm2 versions had "MA=[xxx]" optional tag as well, kept it just for compatibility, and reformated
+                        optional.append(a[i][0:2]+":f:"+a[i][2:])
+                    elif(a[i][0:2] == "L:"): #for links
+                        b = a[i].split(":")
+                        k1 = int(args.kmerSize)-1
+                        if args.single_directed:
+                            if name < b[2]:
+                                links.append("L\t"+name+"\t"+b[1]+"\t"+b[2]+"\t"+b[3]+"\t"+str(k1)+"M\n")
+                            elif name == b[2] and not (b[1] == b[3] == '-'): # manage links between the same unitig
+                                links.append("L\t"+name+"\t"+b[1]+"\t"+b[2]+"\t"+b[3]+"\t"+str(k1)+"M\n")
+                        else:
+                            links.append("L\t"+name+"\t"+b[1]+"\t"+b[2]+"\t"+b[3]+"\t"+str(k1)+"M\n")
+                    else: #all the other optional tags
+                        optional.append(a[i])
+
+
+        #we will miss the last one, because it won't go into the if condition - if(line[0]==">") and hence won't add the segment to the file.
+        write_segment(name,segment,optional,g,links)
+        print("done")
+        g.close()
+
+if __name__ == "__main__":
+    main()


=====================================
scripts/filter_by_previous_contigs.cpp
=====================================
@@ -0,0 +1,149 @@
+// comments added a posteriori, 1 year later:
+// script that keeps reads if they match at least one kmer from a contig file
+// probably this was done for multi-k optimization? 
+
+// to compile it is a bit tricky: go to dsk repository, put that script in dsk/utils folder and use that CMakeLists file: https://github.com/GATB/dsk/blob/12de3e1f49e04f372b80773e7c4bc829c28e5657/utils/CMakeLists.txt
+
+#include <gatb/gatb_core.hpp>
+#include <fstream>
+#include <set>
+
+// We use the required packages
+using namespace std;
+
+
+static /* probably important that it's static here too */
+char revcomp (char s) {
+	if (s == 'A') return 'T';
+	else if (s == 'C') return 'G';
+	else if (s == 'G') return 'C';
+	else if (s == 'T') return 'A';
+	else if (s == 'a') return 't';
+	else if (s == 'c') return 'g';
+	else if (s == 'g') return 'c';
+	else if (s == 't') return 'a';
+	return 'X';
+}
+
+static string revcomp (const string &s) {
+	string rc;
+	for (signed int i = s.length() - 1; i >= 0; i--) {rc += revcomp(s[i]);}
+	return rc;
+}
+
+/********************************************************************************/
+class FilterByPreviousContigs : public Tool
+{
+    public:
+
+        /** */
+        FilterByPreviousContigs () : Tool ("filter_by_previous_contigs")
+    {
+        getParser()->push_front (new OptionOneParam ("-previous-contigs",   "previous contigs"  , true));
+        getParser()->push_front (new OptionOneParam ("-reads",              "reads"  , true));
+        getParser()->push_front (new OptionOneParam (STR_KMER_SIZE,  "k-mer size used to generate previous contigs"  , true));
+        getParser()->push_front (new OptionOneParam ("-out",                "output file that will contain filtered reads",           true));
+    }
+
+        /** */
+        void execute ()
+        {
+            size_t kmerSize = getInput()->getInt (STR_KMER_SIZE);
+
+            Integer::apply<Functor,Parameter> (kmerSize, Parameter(*this, kmerSize));
+        }
+
+        struct Parameter
+        {
+            Parameter (FilterByPreviousContigs& tool, size_t kmerSize) : tool(tool), kmerSize(kmerSize)  {}
+            FilterByPreviousContigs& tool;
+            size_t     kmerSize;
+        };
+
+        template<size_t span> struct Functor  {  void operator ()  (Parameter parameter)
+            {
+                FilterByPreviousContigs& tool     = parameter.tool;
+                size_t                   kmerSize = parameter.kmerSize;
+                unsigned int k = kmerSize;
+
+                typedef typename Kmer<span>::Count Count;
+
+                typename Kmer<span>::ModelCanonical model (kmerSize);
+
+                BankFasta bank(tool.getInput()->getStr("-previous-contigs"));
+                IBank* bankReads = Bank::open(tool.getInput()->getStr("-reads"));
+                BankFasta bankOut(tool.getInput()->getStr("-out"));
+
+                BankFasta::Iterator itCtg (bank);
+                set<string> kmers;
+
+                std::cout << "indexing previous contigs" << std::endl;
+
+                for (itCtg.first(); !itCtg.isDone(); itCtg.next())
+                {
+                    string seq = itCtg->toString();
+                    string kmerBegin = seq.substr(0, k );
+                    string kmerEnd = seq.substr(seq.size() - k , k );
+
+                    kmers.insert(kmerBegin);
+                    kmers.insert(kmerEnd);
+                    kmers.insert(revcomp(kmerBegin));
+                    kmers.insert(revcomp(kmerEnd));
+                }
+                
+                std::cout << "filtering reads" << std::endl;
+
+                Iterator<Sequence>* itReads = bankReads->iterator();
+                uint64_t nb_reads = 0, nb_kept = 0;
+
+                for (itReads->first(); !itReads->isDone(); itReads->next())
+                {
+                    bool contains = false;
+                    string seq = (*itReads)->toString();
+                    nb_reads++;
+                    int found_pos = 0;
+                    for (int i = 0; i < seq.size() - k; i++)
+                    {
+                        string kmer = seq.substr(i, k );
+                        contains = (kmers.find(kmer) != kmers.end());
+                        if (contains) 
+                        {
+                            found_pos = i;
+                            break;
+                        }
+                    }
+                    if (contains)
+                    {
+                        Sequence s (Data::ASCII);
+                        s.getData().setRef ((char*)seq.c_str(), seq.size());
+                        s._comment = (*itReads)->getComment() + " contains previous-k kmer at pos " + to_string(found_pos);
+                        bankOut.insert(s);
+                        nb_kept++;
+                    }
+                }
+
+                std::cout << "done filtering reads, kept " << nb_kept << "/" << nb_reads << std::endl;
+
+                /** We gather some statistics. */
+                tool.getInfo()->add (1, "stats");
+                tool.getInfo()->add (2, "kmer_size", "%ld", kmerSize);
+            }};
+};
+
+/********************************************************************************/
+/*                       Dump solid kmers in ASCII format                       */
+/********************************************************************************/
+int main (int argc, char* argv[])
+{
+    /*    try
+          {*/
+    FilterByPreviousContigs().run (argc, argv);
+    /*    }
+          catch (Exception& e)
+          {
+          std::cout << "EXCEPTION: " << e.getMessage() << std::endl;
+          return EXIT_FAILURE;
+          }*/ 
+    // gdb likes having clean exceptions
+}
+


=====================================
src/Minia.cpp
=====================================
@@ -51,7 +51,6 @@ static const char* progressFormat0 = "Minia : assembly";
 *********************************************************************/
 Minia::Minia () : Tool ("minia")
 {
-
 #ifdef GIT_SHA1
     std::cout << "Minia 3, git commit " << GIT_SHA1 << std::endl;
 #endif


=====================================
src/build_info.hpp.in
=====================================
@@ -0,0 +1,26 @@
+/*****************************************************************************
+ *   GATB : Genome Assembly Tool Box
+ *   Copyright (C) 2014  INRIA
+ *   Authors: R.Chikhi, G.Rizk, E.Drezen
+ *
+ *  This program is free software: you can redistribute it and/or modify
+ *  it under the terms of the GNU Affero General Public License as
+ *  published by the Free Software Foundation, either version 3 of the
+ *  License, or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU Affero General Public License for more details.
+ *
+ *  You should have received a copy of the GNU Affero General Public License
+ *  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+*****************************************************************************/
+
+/* WARNING:
+ * Every source file including this header will be recompiled unconditionally
+ */
+#define STR_MINIA_VERSION     "${gatb-tool-version}"
+#define STR_MINIA_COMPILATION_FLAGS   "${gatb-core-flags}"
+#define STR_MINIA_COMPILER            "${CMAKE_C_COMPILER}  (${CMAKE_CXX_COMPILER_VERSION})"
+#define STR_MINIA_OPERATING_SYSTEM    "${CMAKE_SYSTEM}"


=====================================
src/main.cpp
=====================================
@@ -20,6 +20,7 @@
 /********************************************************************************/
 
 #include <Minia.hpp>
+#include "build_info.hpp"
 
 using namespace std;
 
@@ -27,6 +28,19 @@ using namespace std;
 
 int main (int argc, char* argv[])
 {
+
+	if(argc > 1 && (   strcmp(argv[1],"--version")==0 || strcmp(argv[1],"-v")==0 || strcmp(argv[1],"-version")==0    )     ){
+        std::cout << "Minia version " << STR_MINIA_VERSION << std::endl;
+#ifdef GIT_SHA1
+        std::cout << "git commit " << GIT_SHA1 << std::endl;
+#endif
+     	std::cout << "Using gatb-core version "<< System::info().getVersion() << std::endl;
+     	std::cout << "OS: " << STR_MINIA_OPERATING_SYSTEM << std::endl;
+        std::cout << "compiler: " << STR_MINIA_COMPILER << std::endl;
+		return EXIT_SUCCESS;
+	}
+
+
     // We define a try/catch block in case some method fails (bad filename for instance)
     try
     {


=====================================
test/test_ERR039477.sh
=====================================
@@ -19,6 +19,7 @@ fi
 # if wget is not installed, you may use "curl -O ..."
 DATA_SAMPLE="ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR039/ERR039477/ERR039477.fastq.gz"
 WGET_PATH=`which wget`
+rm -f ERR039477.fastq.gz #needed otherwise wget will not overwrite and write to ERR039477.fastq.gz.1 instead
 echo ">>> Retrieving data sample: ${DATA_SAMPLE}"
 if [ ! -z "$WGET_PATH" ] ; then
   echo "    using '$WGET_PATH'..."
@@ -42,6 +43,10 @@ else
   fi
 fi
 
+echo "size and MD5 of ERR039477.fastq.gz :"
+ls -l ERR039477.fastq.gz
+md5sum ERR039477.fastq.gz
+
 ################################################################################
 # we launch minia; note that we use only one thread (no real time issues with
 # potential different results)



View it on GitLab: https://salsa.debian.org/med-team/minia/commit/578eb06a74342abef43d9afb47e9799f30e7d3c2

-- 
View it on GitLab: https://salsa.debian.org/med-team/minia/commit/578eb06a74342abef43d9afb47e9799f30e7d3c2
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20190705/29b8c922/attachment-0001.html>