[med-svn] [soapdenovo2] 03/08: New upstream version 240+dfsg1

Andreas Tille tille at debian.org
Sun Dec 4 17:08:50 UTC 2016


This is an automated email from the git hooks/post-receive script.

tille pushed a commit to branch master
in repository soapdenovo2.

commit 932cad1b0e2d92f9e714bb4b6f5907aef25d854a
Author: Andreas Tille <tille at debian.org>
Date:   Sun Dec 4 17:58:51 2016 +0100

    New upstream version 240+dfsg1
---
 MANUAL => README.md                | 122 +++++++++++++++++++------------------
 sparsePregraph/build_preArc.cpp    |   2 +
 sparsePregraph/inc/stdinc.h        |   1 +
 sparsePregraph/multi_threads.cpp   |   2 +
 sparsePregraph/pregraph_sparse.cpp |   2 +
 standardPregraph/orderContig.c     |   4 +-
 6 files changed, 71 insertions(+), 62 deletions(-)

diff --git a/MANUAL b/README.md
similarity index 88%
rename from MANUAL
rename to README.md
index ccc07a4..e3801a6 100644
--- a/MANUAL
+++ b/README.md
@@ -1,66 +1,52 @@
+# Manual of SOAPdenovo2
 
-Manual of SOAPdenovo-V2.04
+## What's next of SOAPdenovo2
 
-Ruibang Luo, 2012-7-10
-Zhenyu Li, 2012-7-10
+MEGAHIT is the formal successor of SOAPdenovo2
 
-**********************************************************
-Introduction
+MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph
+http://www.ncbi.nlm.nih.gov/pubmed/25609793
+https://github.com/voutcn/megahit
+
+## Introduction
 
 SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost effective way.
 
-************
-System Requirement
+## System Requirement
 
 SOAPdenovo aims for large plant and animal genomes, although it also works well on bacteria and fungi genomes.  It runs on 64-bit Linux system with a minimum of 5G physical memory. For big genomes like human, about 150 GB memory would be required.
 
-************
-Update Log
-
-1) 63mer and 127mer versions were merged.
-
-2) A new module named "sparse-pregraph" was added which can reduce considerable computational comsumption.
-
-3) "Multi-kmer" method was introduced in "contig" step which allows the utilization of the advantages of small and large kmer.
-
-4) Algorithm of scaffolding was improved to get longer and more accuracy scaffolds.
-
-5) AIO (Asynchronous Input/Output) was introduced to boost the performance of reading files.
-
-6) Information for visualization purpose was available after scaffolding.
-
-7) Several bugs fixed.
-
-************
-Installation
+## Installation
 1. You can download the pre-compiled binary according to your platform, unpack using "tar -zxf  ${destination folder} download.tgz" and execute directly.
 2. Or download the source code, unpack to ${destination folder} with the method above, and compile by using GNU make with command "make" at ${destination folder}/SOAPdenovo-V2.04. Then install executable to ${destination folder}/SOAPdenovo-V2.04/bin using "make install"
 
-************
-How to use it
+## How to use it
 
-1. Configuration file
+### 1. Configuration file
 
 For big genome projects with deep sequencing, the data is usually organized as multiple read sequence files generated from multiple libraries. The configuration file tells the assembler where to find these files and the relevant information. "example.config" is an example of such a file.
 
-The configuration file has a section for global information, and then multiple library sections. Right now only "max_rd_len" is included in the global information section. Any read longer than max_rd_len will be cut to this length. 
+The configuration file has a section for global information, and then multiple library sections. Right now only "max_rd_len" is included in the global information section. Any read longer than max_rd_len will be cut to this length.
 
 The library information and the information of sequencing data generated from the library should be organized in the corresponding library section. Each library section starts with tag [LIB] and includes the following items:
+
+<pre>
 1) avg_ins
    This value indicates the average insert size of this library or the peak value position in the insert size distribution figure.
 2) reverse_seq
-   This option takes value 0 or 1. It tells the assembler if the read sequences need to be complementarily reversed. 
+   This option takes value 0 or 1. It tells the assembler if the read sequences need to be complementarily reversed.
 Illumima GA produces two types of paired-end libraries: a) forward-reverse, generated from fragmented DNA ends with typical insert size less than 500 bp; b) reverse-forward, generated from circularizing libraries with typical insert size greater than 2 Kb. The parameter "reverse_seq" should be set to indicate this: 0, forward-reverse; 1, reverse-forward.
 3) asm_flags
    This indicator decides in which part(s) the reads are used. It takes value 1(only contig assembly), 2 (only scaffold assembly), 3(both contig and scaffold assembly), or 4 (only gap closure).
 4) rd_len_cutof
    The assembler will cut the reads from the current library to this length.
 5) rank
-   It takes integer values and decides in which order the reads are used for scaffold assembly. Libraries with the same "rank" are used at the same time during scaffold assembly. 
+   It takes integer values and decides in which order the reads are used for scaffold assembly. Libraries with the same "rank" are used at the same time during scaffold assembly.
 6) pair_num_cutoff
    This parameter is the cutoff value of pair number for a reliable connection between two contigs or pre-scaffolds. The minimum number for paired-end reads and mate-pair reads is 3 and 5 respectively.
 7) map_len
    This takes effect in the "map" step and is the minimun alignment length between a read and a contig required for a reliable read location. The minimum length for paired-end reads and mate-pair reads is 32 and 35 respectively.
+</pre>
 
 The assembler accepts read file in three kinds of formats: FASTA, FASTQ and BAM. Mate-pair relationship could be indicated in two ways: two sequence files with reads in the same order belonging to a pair, or two adjacent reads in a single file (FASTA only) belonging to a pair. If a read in bam file fails platform/vendor quality checks(the flag field 0x0200 is set), itself and it's paired read would be ignored.
 
@@ -68,15 +54,16 @@ In the configuration file single end files are indicated by "f=/path/filename" o
 
 All the above items in each library section are optional. The assembler assigns default values for most of them. If you are not sure how to set a parameter, you can remove it from your configuration file.
 
-2. Get it started
+### 2. Get started
 Once the configuration file is available, a typical way to run the assembler is:
+<pre>
 ${bin} all -s config_file -K 63 -R -o graph_prefix 1>ass.log 2>ass.err
 
 User can also choose to run the assembly process step by step as:
 step1:
-${bin} pregraph -s config_file -K 63 -R -o graph_prefix 1>pregraph.log 2>pregraph.err  
+${bin} pregraph -s config_file -K 63 -R -o graph_prefix 1>pregraph.log 2>pregraph.err
 OR
-${bin} sparse_pregraph -s config_file -K 63 -z 5000000000 -R -o graph_prefix 1>pregraph.log 2>pregraph.err 
+${bin} sparse_pregraph -s config_file -K 63 -z 5000000000 -R -o graph_prefix 1>pregraph.log 2>pregraph.err
 
 step2:
 ${bin} contig -g graph_prefix -R 1>contig.log 2>contig.err
@@ -86,10 +73,12 @@ ${bin} map -s config_file -g graph_prefix 1>map.log 2>map.err
 
 step4:
 ${bin} scaff -g graph_prefix -F 1>scaff.log 2>scaff.err
+</pre>
 
-3.Options
+## 3.Options
 
-3.1 Options for all (pregraph-contig-map-scaff)
+### 3.1 Options for all (pregraph-contig-map-scaff)
+<pre>
   -s <string>    configFile: the config file of solexa reads
   -o <string>    outputGraph: prefix of output graph file name
   -K <int>       kmer(min 13, max 63/127): kmer size, [23]
@@ -116,8 +105,10 @@ ${bin} scaff -g graph_prefix -F 1>scaff.log 2>scaff.err
   -B <float>     bubbleCoverage: remove contig with lower cvoerage in bubble structure if both contigs' coverage are smaller than bubbleCoverage*avgCvg, [0.6]
   -N <int>       genomeSize: genome size for statistics, [0]
   -V (optional)  output visualization information of assembly, [NO]
+</pre>
 
-3.2 Options for sparse_pregraph
+### 3.2 Options for sparse_pregraph
+<pre>
   Usage: ./SOAPdenovo2 sparse_pregraph -s configFile -K kmer -z genomeSize -o outputGraph [-g maxKmerEdgeLength -d kmerFreqCutoff -e kmerEdgeFreqCutoff -R -r runMode -p n_cpu]
   -s <string>     configFile: the config file of solexa reads
   -K <int>        kmer(min 13, max 63/127): kmer size, [23]
@@ -129,42 +120,45 @@ ${bin} scaff -g graph_prefix -F 1>scaff.log 2>scaff.err
   -r <int>        runMode: 0 build graph & build edge and preArc, 1 load graph by prefix & build edge and preArc, 2 build graph only, 3 build edges only, 4 build preArcs only [0]
   -p <int>        n_cpu: number of cpu for use,[8]
   -o <int>        outputGraph: prefix of output graph file name
+</pre>
 
-4. Output files
+## 4. Output files
 
-4.1 These files are output as assembly results:
-a. *.contig	
+### 4.1 These files are output as assembly results:
+<pre>
+a. *.contig
   contig sequences without using mate pair information.
-b. *.scafSeq	
+b. *.scafSeq
   scaffold sequences (final contig sequences can be extracted by breaking down scaffold sequences at gap regions).
+</pre>
 
-4.2 There are some other files that provide useful information for advanced users, which are listed in Appendix B.
+### 4.2 There are some other files that provide useful information for advanced users, which are listed in Appendix B.
 
-5. FAQ
+## 5. FAQ
 
-5.1 How to set K-mer size?
+### 5.1 How to set K-mer size?
 
 The program accepts odd numbers between 13 and 31. Larger K-mers would have higher rate of uniqueness in the genome and would make the graph simpler, but it requires deep sequencing depth and longer read length to guarantee the overlap at any genomic location.
 
 The sparse pregraph module usually needs 2-10bp smaller kmer length to achieve the same performance as the original pregraph module.
 
-5.2 How to set genome size(-z) for sparse pregraph module?
+### 5.2 How to set genome size(-z) for sparse pregraph module?
 
 The -z parameter for sparse pregraph should be set a litter larger than the real genome size, it is used to allocate memory.
 
-5.3 How to set library rank?
+### 5.3 How to set library rank?
 
 SOAPdenovo will use the pair-end libraries with insert size from smaller to larger to construct scaffolds. Libraries with the same rank would be used at the same time. For example, in a dataset of a human genome, we set five ranks for five libraries with insert size 200-bp, 500-bp, 2-Kb, 5-Kb and 10-Kb, separately. It is desired that the pairs in each rank provide adequate physical coverage of the genome.
 
-************
-APPENDIX A: an example.config 
+# APPENDIX A: an example.config
 
+<pre>
 #maximal read length
 max_rd_len=100
 [LIB]
 #average insert size
 avg_ins=200
-#if sequence needs to be reversed 
+#if sequence needs to be reversed
 reverse_seq=0
 #in which part(s) the reads are used
 asm_flags=3
@@ -220,14 +214,15 @@ f1=/path/**LIBNAMEA**/fasta_read_1.fa
 f2=/path/**LIBNAMEA**/fasta_read_2.fa
 p=/path/**LIBNAMEA**/pairs_in_one_file.fa
 b=/path/**LIBNAMEA**/reads_in_file.bam
+</pre>
 
-************
-Appendix B: output files
+# Appendix B: output files
 
-1. Output files from the command "pregraph"
+## 1. Output files from the command "pregraph"
+<pre>
    a. *.kmerFreq
       Each row shows the number of Kmers with a frequency equals the row number. Note that those peaks of frequencies which are the integral multiple of 63 are due to the data structure.
-   b. *.edge 
+   b. *.edge
       Each record gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer coverage, whether it's reverse-complementarily identical and the sequence.
    c. *.markOnEdge & *.path
       These two files are for using reads to solve small repeats.
@@ -237,26 +232,32 @@ Appendix B: output files
       Kmers at the ends of edges.
    g. *.preGraphBasic
       Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc.
+</pre>
 
-2. Output files from the command "contig"
+## 2. Output files from the command "contig"
+<pre>
    a. *.contig
       Contig information: corresponding edge index, length, kmer coverage, whether it's tip and the sequence. Either a contig or its reverse complementry counterpart is included. Each reverse complementary contig index is indicated in the *.ContigIndex file.
    b. *.Arc
-      Arcs coming out of each edge and their corresponding coverage by reads 
+      Arcs coming out of each edge and their corresponding coverage by reads
    c. *.updated.edge
       Some information for each edge in graph: length, Kmers at both ends, index difference between the reverse-complementary edge and this one.
    d. *.ContigIndex
       Each record gives information about each contig in the *.contig: it's edge index, length, the index difference between its reverse-complementary counterpart and itself.
+</pre>
 
-3. Output files from the command "map"
+## 3. Output files from the command "map"
+<pre>
    a. *.peGrads
-      Information for each clone library: insert-size, read index upper bound, rank and pair number cutoff for a reliable link. This file can be revised manually for scaffolding tuning. 
+      Information for each clone library: insert-size, read index upper bound, rank and pair number cutoff for a reliable link. This file can be revised manually for scaffolding tuning.
    b. *.readOnContig
       Reads' locations on contigs. Here contigs are referred by their edge index. Howerver about half of them are not listed in the *.contig file for their reverse-complementary counterparts are included already.
    c. *.readInGap
-      This file includes reads that could be located in gaps between contigs. This information will be used to close gaps in scaffolds if "-F" is set. 
+      This file includes reads that could be located in gaps between contigs. This information will be used to close gaps in scaffolds if "-F" is set.
+</pre>
 
-4. Output files from the command "scaff"
+## 4. Output files from the command "scaff"
+<pre>
    a. *.newContigIndex
       Contigs are sorted according their length before scaffolding. Their new index are listed in this file.  This is useful if one wants to corresponds contigs in *.contig with those in *.links.
    b. *.links
@@ -275,3 +276,4 @@ Appendix B: output files
       Contigs that form bubble structures in scaffolds. Every two contigs form a bubble and the contig with higher coverage will be kept in scaffold.
    i. *.scafStatistics
       Statistic information of final scaffold and contig.
+</pre>
diff --git a/sparsePregraph/build_preArc.cpp b/sparsePregraph/build_preArc.cpp
index 7cf5c25..ad4f2ca 100644
--- a/sparsePregraph/build_preArc.cpp
+++ b/sparsePregraph/build_preArc.cpp
@@ -27,6 +27,8 @@
 #include "global.h"
 #include "multi_threads.h"
 
+#include <unistd.h>
+
 #include <math.h>
 #include "bam.h"
 #include "faidx.h"
diff --git a/sparsePregraph/inc/stdinc.h b/sparsePregraph/inc/stdinc.h
index 5ce79ed..5cd64c0 100644
--- a/sparsePregraph/inc/stdinc.h
+++ b/sparsePregraph/inc/stdinc.h
@@ -32,6 +32,7 @@
 #include <stddef.h>
 #include <time.h>
 #include <pthread.h>
+#include <unistd.h>
 using namespace std;
 
 
diff --git a/sparsePregraph/multi_threads.cpp b/sparsePregraph/multi_threads.cpp
index 6808834..5aa755d 100644
--- a/sparsePregraph/multi_threads.cpp
+++ b/sparsePregraph/multi_threads.cpp
@@ -25,6 +25,8 @@
 #include "build_graph.h"
 #include "stdinc.h"
 
+#include <unistd.h>
+
 #include "build_preArc.h"
 
 void creatThrds ( pthread_t * threads, PARAMETER * paras )
diff --git a/sparsePregraph/pregraph_sparse.cpp b/sparsePregraph/pregraph_sparse.cpp
index 53b687e..cd422f3 100644
--- a/sparsePregraph/pregraph_sparse.cpp
+++ b/sparsePregraph/pregraph_sparse.cpp
@@ -24,6 +24,8 @@
 #include "stdinc.h"
 #include "core.h"
 
+#include <unistd.h>
+
 #include "multi_threads.h"
 #include "build_graph.h"
 #include "build_edge.h"
diff --git a/standardPregraph/orderContig.c b/standardPregraph/orderContig.c
index 933058b..b79ae61 100644
--- a/standardPregraph/orderContig.c
+++ b/standardPregraph/orderContig.c
@@ -3202,7 +3202,7 @@ void ScafStat ( int len_cut, char * graphfile )
 	Singleton_Seq[Scaffold_Number] = 0;
 	Nucleotide = fgetc ( fp );
 
-	while ( Nucleotide != EOF )
+	while ( Nucleotide != (char) EOF ) /* Bug Fix */
 	{
 		if ( Nucleotide == '>' )
 		{
@@ -3529,7 +3529,7 @@ void ScafStat ( int len_cut, char * graphfile )
 	Singleton_Seq[Scaffold_Number] = 0;
 	Nucleotide = fgetc ( fp2 );
 
-	while ( Nucleotide != EOF )
+	while ( Nucleotide != (char) EOF ) /* Bug Fix */
 	{
 		if ( Nucleotide == '>' )
 		{

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/soapdenovo2.git



More information about the debian-med-commit mailing list