[med-svn] [Git][med-team/minia][upstream] New upstream version 3.2.1+git20191130.5b131b9

Andreas Tille gitlab at salsa.debian.org
Thu Dec 5 14:26:29 GMT 2019



Andreas Tille pushed to branch upstream at Debian Med / minia


Commits:
b1b6ea2d by Andreas Tille at 2019-12-05T14:21:22Z
New upstream version 3.2.1+git20191130.5b131b9
- - - - -


6 changed files:

- CMakeLists.txt
- README.md
- merci/merci.cpp
- src/Minia.cpp
- test/ERR039477.md5
- + test/bubble_covmult0.5.fa


Changes:

=====================================
CMakeLists.txt
=====================================
@@ -8,7 +8,7 @@ cmake_minimum_required (VERSION 2.6)
 # The default version number is the latest official build
 SET (gatb-tool_VERSION_MAJOR 3)
 SET (gatb-tool_VERSION_MINOR 2)
-SET (gatb-tool_VERSION_PATCH 0)
+SET (gatb-tool_VERSION_PATCH 1)
 
 # But, it is possible to define another release number during a local build
 IF (DEFINED MAJOR)
@@ -84,6 +84,8 @@ link_directories (${gatb-core-extra-libraries-path})
 set (PROGRAM_SOURCE_DIR ${PROJECT_SOURCE_DIR}/src)
 set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)
 
+cmake_policy(SET CMP0009 NEW) # fixes cmake complaining about symlinks
+
 include_directories (${PROGRAM_SOURCE_DIR})
 file (GLOB_RECURSE  ProjectFiles  ${PROGRAM_SOURCE_DIR}/*)
 add_executable(${PROJECT_NAME} ${ProjectFiles})


=====================================
README.md
=====================================
@@ -2,24 +2,27 @@
 
 [![License](http://img.shields.io/:license-affero-blue.svg)](http://www.gnu.org/licenses/agpl-3.0.en.html)
 
+<!---
 | **Linux** | **Mac OSX** |
 |-----------|-------------|
 [![Build Status](https://ci.inria.fr/gatb-core/view/Minia/job/tool-minia-build-debian7-64bits-gcc-4.7/badge/icon)](https://ci.inria.fr/gatb-core/view/Minia/job/tool-minia-build-debian7-64bits-gcc-4.7/) | [![Build Status](https://ci.inria.fr/gatb-core/view/Minia/job/tool-minia-build-macos-10.9.5-gcc-4.2.1/badge/icon)](https://ci.inria.fr/gatb-core/view/Minia/job/tool-minia-build-macos-10.9.5-gcc-4.2.1/)
+--->
 
+# Before continuing..
 
-# What is Minia ?
+If you are looking to do high-quality genome or metagenome assemblies, please go here: https://github.com/GATB/gatb-minia-pipeline This is a pipeline built on top of Minia that does a similar algorithm to metaSpades and MEGAHIT (multi-k assembly).
 
-Minia is a short-read assembler based on a de Bruijn graph, capable of assembling a human genome on a desktop computer in a day. The output of Minia is a set of contigs. Minia produces results of similar contiguity and accuracy to other de Bruijn assemblers (e.g. Velvet).
+# Introduction
 
-# Getting the latest source code
+Minia is a short-read assembler based on a de Bruijn graph, capable of assembling a human genome on a desktop computer in a day. The output of Minia is a set of contigs. Back when it was released, Minia produced results of similar contiguity and accuracy to other de Bruijn assemblers (e.g. Velvet). Now (2015 onwards), genome assemblers have evolved and in order ot have high contiguity, see the previous section. 
 
-## Requirements
+# Getting the latest source code
 
-CMake 2.6+; see http://www.cmake.org/cmake/resources/software.html
+## Instructions
 
-C++11 compiler; (g++ version>=4.7 (Linux), clang version>=4.3 (Mac OSX))
+It is recommended to use download the latest binary release (Linux or OSX) there: https://github.com/GATB/minia/releases
 
-## Instructions
+Otherwise, Minia may be compiled from sources as follows:
 
     # get a local copy of minia source code
     git clone --recursive https://github.com/GATB/minia.git
@@ -28,6 +31,13 @@ C++11 compiler; (g++ version>=4.7 (Linux), clang version>=4.3 (Mac OSX))
     cd minia
     sh INSTALL
 
+## Requirements
+
+CMake 3.10+; see http://www.cmake.org/cmake/resources/software.html
+
+C++11 compiler; (g++ version>=4.7 (Linux), clang version>=4.3 (Mac OSX))
+
+
 # User manual	 
 
 Type `minia` without any arguments for usage instructions.


=====================================
merci/merci.cpp
=====================================
@@ -373,6 +373,65 @@ static bool maybe_merge(uint64_t packed, connections_index_t &connections_index,
     return true; 
 }
 
+        
+static void
+parse_unitig_header(string header, float& mean_abundance)
+{
+    bool debug = false;
+    if (debug) std::cout << "parsing unitig links for " << header << std::endl;
+    std::stringstream stream(header);
+    while(1) {
+        string tok;
+        stream >> tok;
+        if(!stream)
+            break;
+
+        if (tok.size() < 3)
+            // that's the id, skip it
+            continue;
+
+        string field = tok.substr(0,2);
+
+		if (field == "km")
+		{
+			mean_abundance = atof(tok.substr(tok.find_last_of(':')+1).c_str());
+			//std::cout << "unitig " << header << " mean abundance " << mean_abundance << std::endl;
+		}
+	}
+}
+
+
+void renumber_glue_file(string glue_filename, uint64_t nb_out_tigs)
+{
+    {
+        std::ifstream infile(glue_filename);
+        std::ofstream outfile(glue_filename+".tmp");
+        std::string line;
+        uint64_t counter = 1;
+        while (std::getline(infile, line))
+        {
+            if (line[0] == '>')
+            {
+                size_t space_pos = line.find(' ');
+                /* // yolo
+                if (space_pos >= line.size())
+                {
+                    std::cout << "error: no space in this glue file header (" << line << ") contact a developer." << std::endl;
+                    exit(1);
+                }
+                */
+                auto end_header = line.substr(space_pos);
+                string new_header = ">" + std::to_string(nb_out_tigs+counter) + end_header;
+                outfile << new_header << std::endl;
+                counter++;
+            }
+            else
+                outfile << line << std::endl;
+        }
+    } // closes files
+    file_copy(glue_filename+".tmp",glue_filename);
+    System::file().remove (glue_filename+".tmp");
+}
 
 static void 
 extend_assembly_with_connections(const string assembly, int k, int nb_threads, bool verbose, connections_index_t &connections_index, connections_t &connections, BankFasta &out, BankFasta &glue)
@@ -415,6 +474,13 @@ extend_assembly_with_connections(const string assembly, int k, int nb_threads, b
         s.getData().setRef ((char*)seq.c_str(), seq.size());
         s._comment = string(lmark?"1":"0")+string(rmark?"1":"0"); //We set the sequence comment.
         s._comment += " ";
+
+        // add coverage information 
+		float mean_abundance;
+		parse_unitig_header(comment,mean_abundance);
+		uint nb_kmers = seq.size() - k + 1;
+		for (uint i = 0; i < nb_kmers; i++)
+			s._comment += std::to_string((uint)mean_abundance) + " ";
         
         if (lmark || rmark)
             glue.insert(s); 
@@ -438,7 +504,8 @@ void merci(int k, string reads, string assembly, int nb_threads, bool verbose)
     string linked_assembly = assembly + ".linked";
     file_copy(assembly, linked_assembly);
     uint64_t nb_tigs = 0;
-    link_tigs<span>( linked_assembly, k, nb_threads, nb_tigs, verbose);
+    bool renumber_unitigs = true; // let's allow the input to be anything. Here it doesn't amtter much. We anyway renumber at the end
+    link_tigs<span>( linked_assembly, k, nb_threads, nb_tigs, verbose, renumber_unitigs);
 
     // real trick here
     // tigs of length exactly k are annoying, they need to be handled carefully with UNITIG_BOTH positions
@@ -463,11 +530,20 @@ void merci(int k, string reads, string assembly, int nb_threads, bool verbose)
     glue.flush();
    
     // glue what needs to be glued. magic, we're re-using bcalm code
-    bglue<span> (nullptr /*no storage*/, assembly+".glue", k, 0, nb_threads, verbose);
+    bglue<span> (nullptr /*no storage*/, assembly+".glue", k, 0, nb_threads, false, verbose);
+  
+    // renumber the .glue file just to avoid ID collision with .merci file
+    renumber_glue_file(assembly+".glue", nb_tigs );
     
     // append glued to merci
     out.flush();
     file_append(assembly+".merci", assembly+".glue");
+
+    // bglue drop links so let's recreate them 
+    k += 1;
+    file_copy(assembly+".merci", assembly+".merci.b4link");
+    renumber_unitigs = true; // here it's absolutely mandatory to renumber if we want the output to be processed by minia
+    link_tigs<span>( assembly+".merci", k, nb_threads, nb_tigs, verbose, false, renumber_unitigs);
 }
 
 class Merci : public gatb::core::tools::misc::impl::Tool


=====================================
src/Minia.cpp
=====================================
@@ -154,8 +154,7 @@ struct MiniaFunctor  {  void operator ()  (Parameter parameter)
     // link contigs
     uint nb_threads = 1;  // doesn't matter because for now link_tigs is single-threaded
     bool verbose = true;
-    link_tigs<span>(output, minia.k, nb_threads, minia.nbContigs, verbose);
-
+    link_tigs<span>(output, minia.k, nb_threads, minia.nbContigs, verbose, false);
 
     /** We gather some statistics. */
     minia.getInfo()->add (1, minia.getTimeInfo().getProperties("time"));
@@ -274,8 +273,8 @@ string Minia::assemble (/*const, removed because Simplifications isn't const any
 			graphSimplifications._bulgeLen_kAdd = getInput()->getDouble("-bulge-len-kadd");
 		if (getParser()->saw("-bulge-altpath-kadd"))
 			graphSimplifications._bulgeAltPath_kAdd = getInput()->getDouble("-bulge-altpath-kadd");
-		if (getParser()->saw("-bulge-altpath-covMult"))
-			graphSimplifications._bulgeAltPath_covMult = getInput()->getDouble("-bulge-altpath-covMult");
+		if (getParser()->saw("-bulge-altpath-covmult"))
+			graphSimplifications._bulgeAltPath_covMult = getInput()->getDouble("-bulge-altpath-covmult");
 
 		if (getParser()->saw("-ec-len-kmult"))
 			graphSimplifications._ecLen_kMult = getInput()->getDouble("-ec-len-kmult");


=====================================
test/ERR039477.md5
=====================================
@@ -1,3 +1,3 @@
-3732560f98d63897d2b7a122938d7a42 # osx CI
-037b126f9e37db1db55d23eadc40477d # gcc 7 blok-bok
-c6e5a2cf1b9c6246129ae4263da749cc # debian CI
+e92d66d1e5b7450e6f6d8f6cc1de24bf # osx CI
+3192031d3491f3488a210419c50b9d4d # gcc 7 blok-bok
+dc556ec0e91c9aad6c1a68e48a2d8456 # debian CI


=====================================
test/bubble_covmult0.5.fa
=====================================
@@ -0,0 +1,16 @@
+>works well for k=21; part of genome10K.fasta
+CATCGATGCGAGACGCCTGTCGCGGGGAATTGTGGGGCGGACCACGCTCTGGCTAACGAGCTACCGTTTCCTTTAACCTGCCAGACGGTGACCAGGGCCGTTCGGCGTTGCATCGAGCGGTGTCGCTAGCGCAATGCGCAAGATTTTGACATTTACAAGGCAACATTGCAGCGTCCGATGGTCCGGTGGCCTCCAGATAGTGTCCAGTCGCTCTAACTGTATGGAGACCATAGGCATTTACCTTATTCTCATCGCCACGCCCCAAGATCTTTAGGACCCAGCATTCCTTTAACCACTAACATAACGCGTGTCATCTAGTTCAACAACC
+>that's the bubble  coverage 4
+TGTCATCTAGTTCAACAACCAAAATAACGACTCTTGCGCTCGGATGT
+>that's the bubble 
+TGTCATCTAGTTCAACAACCAAAATAACGACTCTTGCGCTCGGATGT
+>that's the bubble 
+TGTCATCTAGTTCAACAACCAAAATAACGACTCTTGCGCTCGGATGT
+>that's the bubble 
+TGTCATCTAGTTCAACAACCAAAATAACGACTCTTGCGCTCGGATGT
+>that's the bubble path 2, coverage 2
+TGTCATCTAGTTCAACAACCAAAAAAACGACTCTTGCGCTCGGATGT
+>that's the bubble  
+TGTCATCTAGTTCAACAACCAAAAAAACGACTCTTGCGCTCGGATGT
+>remaining part
+CGACTCTTGCGCTCGGATGTCCGCAATGGGTTATCCCTATGTTCCGGTAATCTCTCATCTACTAAGCGCCCTAAAGGTCGTATGGTTGGAGGGCGGTTACACACCCTTAAGTACCGAACGATAGAGCACCCGTCTAGGAGGGCGTGCAGGGTCTCCCGCTAGCTAATGGTCACGGCCTCTCTGGGAAAGCTGAACAACGGATGATACCCATACTGCCACTCCAGTACCTGGGCCGCGTGTTGTACGCTGTGTATCTTGAGAGCGTTTCCAGCAGATAGAACAGGATCACATGTACAAA



View it on GitLab: https://salsa.debian.org/med-team/minia/commit/b1b6ea2d6284fe6b58350cf106bdae8b8eb2bb7b

-- 
View it on GitLab: https://salsa.debian.org/med-team/minia/commit/b1b6ea2d6284fe6b58350cf106bdae8b8eb2bb7b
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20191205/b1b82a9c/attachment-0001.html>


More information about the debian-med-commit mailing list