[med-svn] [velvetoptimiser] 01/05: New upstream version 2.2.6
Andreas Tille
tille at debian.org
Fri Sep 22 07:31:33 UTC 2017
This is an automated email from the git hooks/post-receive script.
tille pushed a commit to branch master
in repository velvetoptimiser.
commit b409eb971a19dfd327f5fd44a1587370cd04fbac
Author: Andreas Tille <tille at debian.org>
Date: Fri Sep 22 09:19:30 2017 +0200
New upstream version 2.2.6
---
CHANGELOG | 8 +
README | 311 ---------------------------------
README.md | 396 ++++++++++++++++++++++++++++++++++++++++++
VelvetOpt/Assembly.pm | 6 +-
VelvetOpt/Utils.pm | 32 +++-
VelvetOpt/gwrap.pm | 9 +-
VelvetOpt/hwrap.pm | 7 +-
VelvetOptimiser.pl | 146 ++++++++--------
maketar.sh | 40 +++++
paper.bib | 258 +++++++++++++++++++++++++++
paper.md | 22 +++
test/data/test_long_R1.fq.gz | Bin 0 -> 1258403 bytes
test/data/test_long_R2.fq.gz | Bin 0 -> 1296399 bytes
test/data/test_short_R1.fq.gz | Bin 0 -> 1547180 bytes
test/data/test_short_R2.fq.gz | Bin 0 -> 1744996 bytes
test/run_test.sh | 21 +++
16 files changed, 857 insertions(+), 399 deletions(-)
diff --git a/CHANGELOG b/CHANGELOG
index c4a7bcc..f88aa1f 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -82,3 +82,11 @@ Changes since Version 2.0:
2.2.5
* Re-configured velveth input line checker to handle new velvet options and hopefully future proof it.
+
+2.2.6
+
+* Changed README to a markdown document. Thanks @andergs!
+* Fixed a findbin issue
+* Uses File::Path remove_tree instead of "rm -r" during cleanup. Thanks for the suggestion @tseemann!
+* Memory estimator can now work with fastq.gz files.
+* Memory estimator failing for -bam and -sam [Torsten]
diff --git a/README b/README
deleted file mode 100644
index 013a86c..0000000
--- a/README
+++ /dev/null
@@ -1,311 +0,0 @@
-NAME
-====
-
-VelvetOptimiser
-
-VERSION
-=======
-
-Version 2.2.5
-
-LICENCE
-=======
-
-Copyright 2009 - Simon Gladman - CSIRO & Monash University.
-
-simon.gladman at monash.edu
-
-This program is free software; you can redistribute it and/or modify
-it under the terms of the GNU General Public License as published by
-the Free Software Foundation; either version 2 of the License, or
-(at your option) any later version.
-
-This program is distributed in the hope that it will be useful,
-but WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
-GNU General Public License for more details.
-
-You should have received a copy of the GNU General Public License
-along with this program; if not, write to the Free Software
-Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
-MA 02110-1301, USA.
-
-
-INTRODUCTION
-============
-
-The VelvetOptimiser is designed to run as a wrapper script for the Velvet
-assembler (Daniel Zerbino, EBI UK) and to assist with optimising the
-assembly. It searches a supplied hash value range for the optimum,
-estimates the expected coverage and then searches for the optimum coverage
-cutoff. It uses Velvet's internal mechanism for estimating insert lengths
-for paired end libraries. It can optimise the assemblies by either the
-default optimisation condition or by a user supplied one. It outputs the
-results to a subdirectory and records all its operations in a logfile.
-
-Expected coverage is estimated using the length weighted mode of the contig
-coverage in all active short columns of the stats.txt file.
-
-
-PREREQUISITES
-=============
-
-Velvet => 0.7.51
-Perl => 5.8.8
-BioPerl >= 1.4
-GNU utilities: grep sed free cut
-
-
-COMMAND LINE
-============
-
- VelvetOptimiser.pl [options] -f 'velveth input line'
-
- Options:
-
- --help This help.
- --V|version! Print version to stdout and exit.
- --v|verbose+ Verbose logging, includes all velvet output in the logfile. (default '0').
- --s|hashs=i The starting (lower) hash value (default '19').
- --e|hashe=i The end (higher) hash value (default '31').
- --x|step=i The step in hash search.. min 2, no odd numbers (default '2').
- --f|velvethfiles=s The file section of the velveth command line. (default '0').
- --a|amosfile! Turn on velvet's read tracking and amos file output. (default '0').
- --o|velvetgoptions=s Extra velvetg options to pass through. eg. -long_mult_cutoff -max_coverage etc (default '').
- --t|threads=i The maximum number of simulataneous velvet instances to run. (default '4').
- --g|genomesize=f The approximate size of the genome to be assembled in megabases.
- Only used in memory use estimation. If not specified, memory use estimation
- will not occur. If memory use is estimated, the results are shown and then program exits. (default '0').
- --k|optFuncKmer=s The optimisation function used for k-mer choice. (default 'n50').
- --c|optFuncCov=s The optimisation function used for cov_cutoff optimisation. (default 'Lbp').
- --p|prefix=s The prefix for the output filenames, the default is the date and time in the format DD-MM-YYYY-HH-MM_. (default 'auto').
- --d|dir_final=s The name of the directory to put the final output into. (default '.')
- --z|upperCovCutoff=f The maximum coverage cutoff to consider as a multiplier of the expected coverage. (default '0.8').
-
-
-Advanced!: Changing the optimisation function(s)
-
-Velvet optimiser assembly optimisation function can be built from the following variables.
- LNbp = The total number of Ns in large contigs
- Lbp = The total number of base pairs in large contigs
- Lcon = The number of large contigs
- max = The length of the longest contig
- n50 = The n50
- ncon = The total number of contigs
- tbp = The total number of basepairs in contigs
-Examples are:
- 'Lbp' = Just the total basepairs in contigs longer than 1kb
- 'n50*Lcon' = The n50 times the number of long contigs.
- 'n50*Lcon/tbp+log(Lbp)' = The n50 times the number of long contigs divided
- by the total bases in all contigs plus the log of the number of bases
- in long contigs.
-
-
-
-EXAMPLES
-========
-
-Find the best assembly for a lane of Illumina single-end reads, trying k-values between 27 and 31:
-
-% VelvetOptimiser.pl -s 27 -e 31 -f '-short -fastq s_1_sequence.txt'
-
-Print an estimate of how much RAM is needed by the above command, if we use eight threads at once,
-and we estimate our assembled genome to be 4.5 megabases long:
-
-% VelvetOptimiser.pl -s 27 -e 31 -f '-short -fastq s_1_sequence.txt' -g 4.5 -t 8
-
-Find the best assembly for Illumina paired end reads just for k=31, using four threads (eg. quad core CPU),
-but optimizing for N50 for k-mer length rather than sum of large contig sizes:
-
-% VelvetOptimiser.pl -s 31 -e 31 -f '-shortPaired -fasta interleaved.fasta' -t 4 --optFuncKmer 'n50'
-
-
-DETAILED OPTIONS
-================
-
--h or --help
-
- Prints the commandline help to STDOUT.
-
--V or --version
-
- Prints the program name and version to STDOUT. Note that other information is still
- printed to STDERR. You can ignore this by redirecting the output like this:
- % VelvetOptimiser.pl --version 2> /dev/null
-
--v or --verbose
-
- Adds the full velveth and velvetg output to the logfile. (Handy for
- looking at the insert lengths and sds that Velvet has chosen for each library.)
-
--s or --hashs
-
- Parameter type required: odd integer > 0 & <= the MAXKMERLENGTH velvet was compiled with.
- Default: 19
-
- This is the lower end of the hash value range that the optimiser will search for the optimum.
- If the supplied value is even, it will be lowered by 1.
- If the supplied value is higher than MAXKMERLENGTH, it will be dropped to MAXKMERLENGTH.
-
--e or --hashe
-
- Parameter type required: odd integer >= 'hashs' & <= the MAXKMERLENGTH velvet was compiled with.
- Default: MAXKMERLENGTH
-
- This is the upper end of the hash value range that the optimiser will search for the optimum.
- If the supplied value is even, it will be lowered by 1.
- If the supplied value is higher than MAXKMERLENGTH, it will be dropped to MAXKMERLENGTH.
- If the supplied value is lower than 'hashs' then it will be set to equal 'hashs'.
-
--x or --step
-
- Parameter type required: even integer < difference between 'hashs' and 'hashe'.
- Default: 2
-
- This parameter details the number of hash values to skip each increment from 'hashs' to 'hashe' when searching for the optimum.
- If the supplied value is odd, it will be lowered by 1.
- If the supplied value is less than 2, it will be set to 2.
- If the supplied value is greater than the 'hashs' to 'hashe' range, it will be set to the range.
-
--f or --velvethfiles
-
- Parameter type required: string with '' or ""
- No default.
-
- This is a required parameter. If this option is not specified, then the optimisers usage will be displayed.
-
- You need to supply everything you would normally supply velveth at this point except for the hash size and the
- directory name in the following format.
-
- {[-file_format][-read_type] filename} repeated for as many read files as you have.
-
-
- File format options:
- -fasta
- -fastq
- -fasta.gz
- -fastq.gz
- -bam
- -sam
- -eland
- -gerald
-
- Read type options:
- -short
- -shortPaired
- -short2
- -shortPaired2
- -long
- -longPaired
-
- Examples:
-
- -f 'reads.fna'
- reads.fna is short not paired and fasta. (these are the defaults: -short and -fasta)
-
- -f '-shortPaired -fastq paired_reads.fastq -long long_reads.fna'
- Two read files supplied, first one is a paired end fastq file and the second is a long single ended read file.
-
- -f '-shortPaired paired_reads_1.fna -shortPaired2 paired_reads_2.fna'
- Two read files supplied, both are short paired fastas but come from two different libraries, therefore needing two different CATEGORIES.
-
- There is a fairly extensive checker built into the optimiser to check if the format of the string is correct. However, it won't check the read files for their format (fasta, fastq, eland etc.)
-
--a or --amosfile
-
- Turns on Velvets read tracking and amos file output.
- This option is the same as specifying '-amos_file yes -read_trkg yes' in velvetg. However, it will only be added to the final assembly and not to the intermediate ones.
-
--o or --velvetgoptions
-
- Parameter type required: string.
- No default
-
- String should contain extra options to be passed to velvetg as required such as "-long_mult_cutoff 1" or "-max_coverage 50" etc. Warning, there is no sanity check, so be careful. The optimiser will crash if you give velvetg something it doesn't handle.
-
--t or --threads
-
- Parameter type required: integer
-
- Specifies the maximum number of threads (simulataneous Velvet instances) to run. It defaults to the number of CPUs in the current computer.
-
--g or --genomesize
-
- Parameter type required: float.
- No default.
-
- This option will run the Optimiser's memory estimator. It will estimate the memory required to run Velvet with the current -f parameter and number of threads. Once the estimator has finsihed its calulations, it will display the required memory, make a recommendation and then exit the script. This is useful for determining if you will have sufficient free RAM to run the assembly before you start.
- You need to supply the approximate size of the genome you are assembling in mega bases. For example, for a Salmonella genome I would use: -g 5
-
--k or --optFuncKmer
-
- Parameter type required: string.
- Default: 'n50'
-
- This option will change the function that the Optimiser uses to find the best hash value from the given range. The default is to use the n50. I have found this function to work for me better than the previous single optimisation function, but you may wish to change it. A list of possible variables to use in your optimisation function and some examples are shown below.
-
--c or --optFuncCov
-
- Parameter type required: string.
- Default: 'Lbp'
-
- This option will change the function that the Optimiser uses to find the best hash value from the given range. The default is to use the number of basepairs in contigs greater than 1 kilobase in length. I have found this function to work for me but you may wish to change it. A list of possible variables to use in your optimisation function and some examples are shown below.
-
- Velvet optimiser assembly optimisation functions can be built from the following variables:
-
- LNbp = The total number of Ns in large contigs
- Lbp = The total number of base pairs in large contigs
- Lcon = The number of large contigs
- max = The length of the longest contig
- n50 = The n50
- ncon = The total number of contigs
- tbp = The total number of basepairs in contigs
-
- Examples are:
-
- 'Lbp' = Just the total basepairs in contigs longer than 1kb
- 'n50*Lcon' = The n50 times the number of long contigs.
- 'n50*Lcon/tbp+log(Lbp)' = The n50 times the number of long contigs divided
- by the total bases in all contigs plus the log of the number of bases
- in long contigs.
-
- Be warned! The optimiser doesn't care what you supply in this string and will attempt to run anyway. If you give it a nonsensical optimisation function be prepared to receive a nonsensical assembly!
-
--p or --prefix
-
- Parameter type required: string
- Default: The current date and time in the format "DD-MM-YYYY-HH-MM-SS_"
-
- Names the logfile and the output directory with whatever prefix is supplied followed by "_logfile.txt" for the logfile and "_data_k" where k is the optimum hash value for the ouput data directory.
-
--d or --dir_final
-
- Parameter type required: string
- Default: . (the current directory)
-
- At the completion of the optimiser, any non default string will cause the final velvet output and the Velvet Optimiser logfile to be moved to the directory specified. If the directory already exists, an error is generated and the optimiser stops.
-
--z or --upperCovCutoff
-
- Parameter type required: float
- Default: 0.8
-
- Uses this fraction of the expected coverage to set the upper limit of the coverage cutoff range to search for the optimum in.
-
-BUGS
-====
-
-* None that I am aware of.
-
-TO DO
-=====
-
-* Make the auto_XXX folders be in --dir_final when set, not in the current directory.
-* Use velvetk.pl script to choose suitable -s and -e parameters.
-
-CONTACT
-=======
-
-Simon Gladman <simon.gladman at csiro.au>
-Torsten Seemann <torsten.seemann at monash.edu>
-
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..7885899
--- /dev/null
+++ b/README.md
@@ -0,0 +1,396 @@
+# VelvetOptimiser: automate your Velvet assemblies
+
+<!-- TOC depthFrom:2 depthTo:6 withLinks:1 updateOnSave:1 orderedList:0 -->
+
+- [Introduction](#introduction)
+- [Installation](#installation)
+ - [Dependencies](#dependencies)
+ - [Brew](#brew)
+ - [Conda](#conda)
+ - [Manual](#manual)
+- [Version](#version)
+- [Command Line](#command-line)
+- [Examples](#examples)
+- [Detailed Options](#detailed-options)
+ - [``` -h or --help```](#-h-or-help)
+ - [```-V or --version```](#-v-or-version)
+ - [```-v or --verbose```](#-v-or-verbose)
+ - [```-s or --hashs```](#-s-or-hashs)
+ - [```-e or --hashe```](#-e-or-hashe)
+ - [`-x or --step`](#-x-or-step)
+ - [`-f or --velvethfiles`](#-f-or-velvethfiles)
+ - [`-a or --amosfile`](#-a-or-amosfile)
+ - [`-o or --velvetgoptions`](#-o-or-velvetgoptions)
+ - [`-t or --threads`](#-t-or-threads)
+ - [`-g or --genomesize`](#-g-or-genomesize)
+ - [`-k or --optFuncKmer`](#-k-or-optfunckmer)
+ - [`-c or --optFuncCov`](#-c-or-optfunccov)
+ - [`-p or --prefix`](#-p-or-prefix)
+ - [`-d or --dir_final`](#-d-or-dirfinal)
+ - [`-z or --upperCovCutoff`](#-z-or-uppercovcutoff)
+- [Bugs](#bugs)
+- [To do](#to-do)
+- [Contact](#contact)
+- [License](#license)
+- [References](#references)
+
+<!-- /TOC -->
+
+## Introduction
+
+The VelvetOptimiser is designed to run as a wrapper script for the Velvet
+assembler ([Zerbino and Birney 2008](#zerbino2008)) and to assist with optimising the
+assembly. It searches a supplied hash value range for the optimum,
+estimates the expected coverage and then searches for the optimum coverage
+cutoff. It uses Velvet's internal mechanism for estimating insert lengths
+for paired end libraries. It can optimise the assemblies by either the
+default optimisation condition or by a user supplied one. It outputs the
+results to a subdirectory and records all its operations in a logfile.
+
+Expected coverage is estimated using the length weighted mode of the contig
+coverage in all active short columns of the `stats.txt` file.
+
+## Installation
+
+### Dependencies
+
+* Velvet => 1.1
+* Perl => 5.8.8
+* BioPerl >= 1.4
+* GNU utilities: grep sed free cut
+
+### Brew
+
+```
+brew install homebrew/science/velvetoptimiser
+```
+
+### Conda
+
+```
+conda install -c bioconda perl-velvetoptimiser
+```
+
+### Manual
+
+1. Download the tarball
+2. Copy it to the folder where you want to unpack it (it comes in its own folder)
+3. Add it to the path
+
+```
+# Step 1
+wget -O velvetoptimiser_2.2.6.tar.gz https://github.com/tseemann/VelvetOptimiser/releases/tag/2.2.6
+
+# Step 2
+mkdir $HOME/src && mv velvetoptimiser_2.2.6.tar.gz $HOME/src && tar zxvf velvetoptimiser_2.2.6.tar.gz
+
+# Step 3
+echo "export PATH=$PATH:$HOME/src/velvetoptimiser_2.2.6"
+```
+
+### Testing the installation
+
+You can test it with:
+
+```
+VelvetOptimiser.pl --help
+```
+
+or for a more comprehensive test:
+
+```
+cd $HOME/src/VelvetOptimiser/test
+./run_test.sh
+```
+
+<!-- ADD SOME INFORMATION HOW HOW TO INSTALL -->
+
+## Version
+
+Version 2.2.6
+
+## Command Line
+
+```
+VelvetOptimiser.pl \[options\] -f 'velveth input line'
+```
+
+```
+ Options:
+
+ --help This help.
+ --V|version! Print version to stdout and exit.
+ --v|verbose+ Verbose logging, includes all velvet output in the logfile. (default '0').
+ --s|hashs=i The starting (lower) hash value (default '19').
+ --e|hashe=i The end (higher) hash value (default '31').
+ --x|step=i The step in hash search.. min 2, no odd numbers (default '2').
+ --f|velvethfiles=s The file section of the velveth command line. (default '0').
+ --a|amosfile! Turn on velvet's read tracking and amos file output. (default '0').
+ --o|velvetgoptions=s Extra velvetg options to pass through. eg. -long_mult_cutoff -max_coverage etc (default '').
+ --t|threads=i The maximum number of simulataneous velvet instances to run. (default '4').
+ --g|genomesize=f The approximate size of the genome to be assembled in megabases.
+ Only used in memory use estimation. If not specified, memory use estimation
+ will not occur. If memory use is estimated, the results are shown and then program exits. (default '0').
+ --k|optFuncKmer=s The optimisation function used for k-mer choice. (default 'n50').
+ --c|optFuncCov=s The optimisation function used for cov_cutoff optimisation. (default 'Lbp').
+ --p|prefix=s The prefix for the output filenames, the default is the date and time in the format DD-MM-YYYY-HH-MM_. (default 'auto').
+ --d|dir_final=s The name of the directory to put the final output into. (default '.')
+ --z|upperCovCutoff=f The maximum coverage cutoff to consider as a multiplier of the expected coverage. (default '0.8').
+
+```
+
+**Advanced!**: VelvetOptimiser accepts custom optimisation functions. See [here](#-c-or-optfunccov).
+
+
+## Examples
+
+Find the best assembly for a lane of Illumina single-end reads, trying k-values between 53 and 75:
+
+```
+% VelvetOptimiser.pl -s 53 -e 75 -f '-short -fastq s_1_sequence.txt'
+```
+
+Print an estimate of how much RAM is needed by the above command, if we use eight threads at once,
+and we estimate our assembled genome to be 4.5 megabases long:
+
+```
+% VelvetOptimiser.pl -s 53 -e 75 -f '-short -fastq s_1_sequence.txt' -g 4.5 -t 8
+```
+
+Find the best assembly for Illumina paired end reads just for k=31, using four threads (eg. quad core CPU),
+but optimizing for N50 for k-mer length rather than sum of large contig sizes:
+
+```
+% VelvetOptimiser.pl -s 31 -e 31 -f '-shortPaired -fasta interleaved.fasta' -t 4 --optFuncKmer 'n50'
+```
+
+## Detailed Options
+
+### ``` -h or --help```
+
+Prints the commandline help to STDOUT.
+
+### ```-V or --version```
+
+Prints the program name and version to STDOUT.
+
+**Note** that other information is still
+printed to STDERR. You can ignore this by redirecting the output like this:
+
+ ```% VelvetOptimiser.pl --version 2> /dev/null```
+
+### ```-v or --verbose```
+
+Adds the full `velveth` and `velvet`g output to the `logfile`. (Handy for looking at the insert lengths and sds that Velvet has chosen for each library.)
+
+### ```-s or --hashs```
+
+Parameter type required: odd integer > 0 & <= the `MAXKMERLENGTH` velvet was compiled with.
+
+* Default: 19
+
+This is the lower end of the hash value range that the optimiser will search for the optimum.
+
+If the supplied value is even, it will be lowered by 1.
+
+If the supplied value is higher than `MAXKMERLENGTH`, it will be dropped to `MAXKMERLENGTH`.
+
+### ```-e or --hashe```
+
+Parameter type required: odd integer >= `hashs` & <= the `MAXKMERLENGTH` velvet was compiled with.
+
+* Default: MAXKMERLENGTH
+
+This is the upper end of the hash value range that the optimiser will search for the optimum.
+
+If the supplied value is even, it will be lowered by 1.
+
+If the supplied value is higher than `MAXKMERLENGTH`, it will be dropped to `MAXKMERLENGTH`.
+
+If the supplied value is lower than `hashs` then it will be set to equal `hashs`.
+
+### `-x or --step`
+
+Parameter type required: even integer < difference between `hashs` and `hashe`.
+
+* Default: 2
+
+This parameter details the number of hash values to skip each increment from `hashs` to `hashe` when searching for the optimum.
+
+If the supplied value is odd, it will be lowered by 1.
+
+If the supplied value is less than 2, it will be set to 2.
+
+If the supplied value is greater than the `hashs` to `hashe` range, it will be set to the range.
+
+### `-f or --velvethfiles`
+
+Parameter type required: string with '' or ""
+
+* No default.
+
+This is a required parameter. If this option is not specified, then the optimisers usage will be displayed.
+
+You need to supply everything you would normally supply `velveth` at this point except for the hash size and the directory name in the following format.
+
+{\[-file_format\]\[-read_type\] filename} repeated for as many read files as you have.
+
+
+File format options:
+- fasta
+- fastq
+- fasta.gz
+- fastq.gz
+- bam
+- sam
+- eland
+- gerald
+
+Read type options:
+- short
+- shortPaired
+- short2
+- shortPaired2
+- long
+- longPaired
+
+Examples:
+
+* `-f 'reads.fna'`
+ - `reads.fna` is short not paired and fasta. (these are the defaults: `-short` and `-fasta`)
+
+* `-f '-shortPaired -fastq paired_reads.fastq -long long_reads.fna'`
+ - Two read files supplied, first one is a paired end fastq file and the second is a long single ended read file.
+
+* `-f '-shortPaired paired_reads_1.fna -shortPaired2 paired_reads_2.fna'`
+ - Two read files supplied, both are short paired fastas but come from two different libraries, therefore needing two different CATEGORIES.
+
+There is a fairly extensive checker built into the optimiser to check if the format of the string is correct. However, it won't check the read files for their format (`fasta`, `fastq`, `eland` etc.)
+
+### `-a or --amosfile`
+
+Turns on Velvet\'s read tracking and `amos` file output.
+
+This option is the same as specifying '-amos_file yes -read_trkg yes' in velvetg. However, it will only be added to the final assembly and not to the intermediate ones.
+
+### `-o or --velvetgoptions`
+
+Parameter type required: `string`.
+
+* No default
+
+String should contain extra options to be passed to `velvetg` as required such as `-long_mult_cutoff 1` or `-max_coverage 50` etc. Warning, there is no sanity check, so be careful. The optimiser will crash if you give velvetg something it doesn't handle.
+
+### `-t or --threads`
+
+Parameter type required: `integer`
+
+* Default: Number of CPUs in the current computer.
+
+Specifies the maximum number of threads (simultaneous Velvet instances) to run.
+
+### `-g or --genomesize`
+
+Parameter type required: `float`.
+
+* No default.
+
+This option will run the Optimiser's memory estimator. It will estimate the memory required to run Velvet with the current -f parameter and number of threads. Once the estimator has finished its calculations, it will display the required memory, make a recommendation and then exit the script. This is useful for determining if you will have sufficient free RAM to run the assembly before you start.
+
+You need to supply the approximate size of the genome you are assembling in mega bases. For example, for a Salmonella genome I would use: -g 5
+
+### `-k or --optFuncKmer`
+
+Parameter type required: `string`.
+
+* Default: `n50`
+
+This option will change the function that the Optimiser uses to find the best hash value from the given range. The default is to use the `n50`. I have found this function to work for me better than the previous single optimisation function, but you may wish to change it. A list of possible variables to use in your optimisation function and some examples are shown below.
+
+### `-c or --optFuncCov`
+
+Parameter type required: `string`.
+
+* Default: `Lbp`
+
+This option will change the function that the Optimiser uses to find the best hash value from the given range. The default is to use the number of basepairs in contigs greater than 1 kilobase in length. I have found this function to work for me but you may wish to change it. A list of possible variables to use in your optimisation function and some examples are shown below.
+
+Velvet optimiser assembly optimisation functions can be built from the following variables:
+
+* `LNbp` = The total number of Ns in large contigs
+* `Lbp` = The total number of base pairs in large contigs
+* `Lcon` = The number of large contigs
+* `max` = The length of the longest contig
+* `n50` = The `n50`
+* `ncon` = The total number of contigs
+* `tbp` = The total number of basepairs in contigs
+
+Examples are:
+
+* `Lbp` = Just the total basepairs in contigs longer than 1kb
+* `n50*Lcon` = The n50 times the number of long contigs.
+* `n50*Lcon/tbp+log(Lbp)` = The n50 times the number of long contigs divided by the total bases in all contigs plus the log of the number of bases in long contigs.
+
+**Be warned!** The optimiser doesn't care what you supply in this string and will attempt to run anyway. If you give it a nonsensical optimisation function be prepared to receive a nonsensical assembly!
+
+### `-p or --prefix`
+
+Parameter type required: `string`
+
+Default: The current date and time in the format `DD-MM-YYYY-HH-MM-SS_`
+
+Names the `logfile` and the output directory with whatever prefix is supplied followed by `_logfile.txt` for the `logfile` and `_data_k` where `k` is the optimum hash value for the output data directory.
+
+### `-d or --dir_final`
+
+Parameter type required: `string`
+
+Default: . (the current directory)
+
+At the completion of the optimiser, any non default string will cause the final velvet output and the Velvet Optimiser `logfile` to be moved to the directory specified. If the directory already exists, an error is generated and the optimiser stops.
+
+### `-z or --upperCovCutoff`
+
+Parameter type required: `float`
+
+Default: 0.8
+
+Uses this fraction of the expected coverage to set the upper limit of the coverage cutoff range to search for the optimum in.
+
+## Bugs
+
+* Still none that I am aware of.
+
+## To do
+
+* Make the auto_XXX folders be in --dir_final when set, not in the current directory.
+* Use velvetk.pl script to choose suitable -s and -e parameters.
+
+## Contact
+Simon Gladman <mailto:simon.gladman at unimelb.edu.au>
+
+Torsten Seemann <mailto:torsten.seemann at unimelb.edu.au>
+
+
+## License
+
+Copyright 2008- Simon Gladman & Torsten Seemann
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
+MA 02110-1301, USA.
+
+## References
+
+[](#zerbino208)Zerbino and Birney (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. **Genome Research**. 18(5):821-9. doi: doi:[10.1101/gr.074492.107](http://www.genome.org/cgi/doi/10.1101/gr.074492.107)
diff --git a/VelvetOpt/Assembly.pm b/VelvetOpt/Assembly.pm
index c2c9cef..3d25de7 100644
--- a/VelvetOpt/Assembly.pm
+++ b/VelvetOpt/Assembly.pm
@@ -1,6 +1,6 @@
# VelvetOpt::Assembly.pm
#
-# Copyright 2008,2009 Simon Gladman <simon.gladman at monash.edu>
+# Copyright 2008- Simon Gladman & Torsten Seemann
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
@@ -45,11 +45,11 @@ VelvetOpt::Assembly.pm - Velvet assembly container class.
=head1 AUTHOR
-Simon Gladman, CSIRO, 2007, 2008.
+Simon Gladman & Torsten Seemann
=head1 LICENSE
-Copyright 2008, 2009 Simon Gladman <simon.gladman at csiro.au>
+Copyright 2008- Simon Gladman & Torsten Seemann
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
diff --git a/VelvetOpt/Utils.pm b/VelvetOpt/Utils.pm
index c0edbe2..9747f0c 100644
--- a/VelvetOpt/Utils.pm
+++ b/VelvetOpt/Utils.pm
@@ -18,7 +18,7 @@
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
# MA 02110-1301, USA.
-# Version 2.1.3
+# Version 2.2.6
# Changes for Version 2.0.1
# Added Mikael Brandstrom Durling's numCpus and freeMem for the Mac.
@@ -32,6 +32,9 @@
# Changes for Version 2.1.3
# Changed the minimum contig size to use for estimating expected coverage to 3*kmer size -1 and set the minimum coverage to 2 instead of 0.
# This should get rid of exp_covs of 1 when it should be very high for assembling reads that weren't ampped to a reference using one of the standard read mapping programs
+#
+# Changes for Version 2.2.6
+# Added the ability for the memory estimator to open fastq.gz files.
package VelvetOpt::Utils;
@@ -163,25 +166,37 @@ sub getReadSizeNum {
if(/^-fasta/){
$currentfiletype = "fasta";
}
+ elsif(/^-fastq\.gz/){
+ $currentfiletype = "fastq.gz";
+ }
elsif(/^-fastq/){
$currentfiletype = "fastq";
}
- elsif(/(-eland)|(-gerald)|(-fasta.gz)|(-fastq.gz)/) {
+ elsif(/(-eland)|(-gerald)|(-fasta.gz)|(-bam)|(-sam)|(-fmtAuto)/) {
croak "Cannot estimate memory usage from file types other than fasta or fastq..\n";
}
}
elsif(-r $_){
+ # Probably should not just use the 1st read x #reads as people trim reads
my $file = $_;
if($currentfiletype eq "fasta"){
- my $x = `grep -c "^>" $file`;
+ my $x = qx(grep -c "^>" $file);
chomp($x);
$num += $x;
my $l = &getReadLength($file, 'Fasta');
$reads{$l} += $x;
print STDERR "File: $file has $x reads of length $l\n";
}
+ elsif($currentfiletype eq "fastq.gz"){
+ my $x = qx(zgrep -c "^@" $file);
+ chomp($x);
+ $num += $x;
+ my $l = &getReadLength($file, "Fastq.gz");
+ $reads{$l} += $x;
+ print STDERR "File: $file has $x reads of length $l\n";
+ }
else {
- my $x = `grep -c "^@" $file`;
+ my $x = qx(grep -c "^@" $file);
chomp($x);
$num += $x;
my $l = &getReadLength($file, 'Fastq');
@@ -208,7 +223,14 @@ sub getReadSizeNum {
#
sub getReadLength {
my ($f, $t) = @_;
- my $sio = Bio::SeqIO->new(-file => $f, -format => $t);
+ my $sio;
+ if ($t eq "Fastq.gz"){
+ open my $zcat, "gunzip -c $f |" or croak $!;
+ $sio = Bio::SeqIO->new(-fh => $zcat, -format => "Fastq")
+ }
+ else {
+ $sio = Bio::SeqIO->new(-file => $f, -format => $t);
+ }
my $s = $sio->next_seq() or croak "Something went bad while reading file $f!\n";
return $s->length;
}
diff --git a/VelvetOpt/gwrap.pm b/VelvetOpt/gwrap.pm
index 1936a15..c6fd08a 100644
--- a/VelvetOpt/gwrap.pm
+++ b/VelvetOpt/gwrap.pm
@@ -1,6 +1,6 @@
# VelvetOpt::gwrap.pm
#
-# Copyright 2008 Simon Gladman <simon.gladman at monash.edu>
+# Copyright 2008- Simon Gladman & Torsten Seemann
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
@@ -23,13 +23,14 @@ package VelvetOpt::gwrap;
VelvetOpt::gwrap.pm - Velvet graphing and assembly program wrapper module.
-=head1 AUTHOR
+=head1 AUTHORS
-Simon Gladman, CSIRO, 2007, 2008.
+Simon Gladman
+Torsten Seemann
=head1 LICENSE
-Copyright 2008 Simon Gladman <simon.gladman at csiro.au>
+Copyright 2008- Simon Gladman & Torsten Seemann
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
diff --git a/VelvetOpt/hwrap.pm b/VelvetOpt/hwrap.pm
index bd6b570..9ae65c2 100644
--- a/VelvetOpt/hwrap.pm
+++ b/VelvetOpt/hwrap.pm
@@ -1,6 +1,6 @@
# VelvetOpt::hwrap.pm
#
-# Copyright 2008 Simon Gladman <simon.gladman at monash.edu>
+# Copyright 2008- Simon Gladman & Torsten Seemann
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
@@ -30,11 +30,12 @@ VelvetOpt::hwrap.pm - Velvet hashing program wrapper module.
=head1 AUTHOR
-Simon Gladman, CSIRO, 2007, 2008.
+Simon Gladman
+Torsten Seemann
=head1 LICENSE
-Copyright 2008 Simon Gladman <simon.gladman at csiro.au>
+Copyright 2008- Simon Gladman & Torsten Seemann
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
diff --git a/VelvetOptimiser.pl b/VelvetOptimiser.pl
index a76c677..05e5f94 100755
--- a/VelvetOptimiser.pl
+++ b/VelvetOptimiser.pl
@@ -1,8 +1,8 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
#
# VelvetOptimiser.pl
#
-# Copyright 2008, 2009, 2010 Simon Gladman <simon.gladman at monash.edu>
+# Copyright 2008, 2009, 2010 Simon Gladman <simon.gladman at unimelb.edu.au>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
@@ -19,7 +19,7 @@
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
# MA 02110-1301, USA.
-my $OptVersion = "2.2.5";
+my $OptVersion = "2.2.6";
#
# pragmas
@@ -31,7 +31,7 @@ use warnings;
#
use POSIX qw(strftime);
use FindBin;
-use lib "$FindBin::Bin";
+use lib "$FindBin::RealBin";
use threads;
use threads::shared;
use VelvetOpt::Assembly;
@@ -41,6 +41,7 @@ use VelvetOpt::Utils;
use Data::Dumper;
use Storable qw (freeze thaw);
use Getopt::Long;
+use File::Path qw (remove_tree);
#
@@ -69,6 +70,8 @@ my $ass_num = 1;
my $categories;
my $prefix;
my $OUT;
+my @error_log;
+my @memory_errors;
my $logSem : shared;
our $num_threads;
my $current_threads : shared = 0;
@@ -171,7 +174,7 @@ for(my $i = $hashs; $i <= $hashe; $i += $hashstep){
}
#check for $hashe in array..
-my $max = $hashvals[$#hashvals];
+my $max = $hashvals[-1]; # last element in array
if($max < $hashe){
push @hashvals, $hashe;
}
@@ -218,13 +221,20 @@ foreach my $hashval (@hashvals){
while($current_threads >= $num_threads){
sleep(2);
}
- if($threadfailed){
+ if($threadfailed == 1){
for my $thr (threads->list) {
print STDERR "Waiting for thread ",$thr->tid," to complete.\n";
$thr->join;
}
die "Velveth failed to run! Must be a problem with file types, check by running velveth manually or by using -v option and reading the log file.\n";
- }
+ }
+ elsif($threadfailed == 2){
+ for my $thr (threads->list) {
+ print STDERR "Waiting for thread ",$thr->tid," to complete.\n";
+ $thr->join;
+ }
+ die "Velveth failed to run because it ran out of memory. See logfile for more details and try reducing the number of threads using the -t option.\n";
+ }
$threads[$ass_num] = threads->create(\&runVelveth, $readfile, $hashval, $vhversion, \$logSem, $ass_num);
$ass_num ++;
sleep(2);
@@ -322,7 +332,7 @@ print $OUT strftime("%b %e %H:%M:%S", localtime), " Optimisation routine chosen
#now send the best assembly so far to the appropriate optimisation routine...
if($optRoute eq "shortOpt"){
-
+
&expCov($assembliesObjs{$bestId});
&covCutoff($assembliesObjs{$bestId});
@@ -384,10 +394,11 @@ else {
#delete superfluous directories..
foreach my $key(keys %assemblies){
- unless($key == $bestId){
+ unless($key == $bestId){
my $dir = $assembliesObjs{$key}->{ass_dir};
- system('rm', '-r', '--preserve-root', $dir);
- }
+ #system('rm', '-r', '--preserve-root', $dir);
+ remove_tree($dir);
+ }
}
unless ($finaldir eq "."){
rename $assembliesObjs{$bestId}->{ass_dir}, $finaldir;
@@ -407,13 +418,13 @@ sub setOptions {
use Getopt::Long;
my $num_cpus = VelvetOpt::Utils::num_cpu;
my $thmax = int($num_cpus/$thread_per_job) || 1;
-
+
@Options = (
{OPT=>"help", VAR=>\&usage, DESC=>"This help"},
{OPT=>"version!", VAR=>\$printVersion, DEFAULT=>0, DESC=>"Print version to stdout and exit."},
{OPT=>"v|verbose+", VAR=>\$verbose, DEFAULT=>0, DESC=>"Verbose logging, includes all velvet output in the logfile."},
- {OPT=>"s|hashs=i", VAR=>\$hashs, DEFAULT=>19, DESC=>"The starting (lower) hash value"},
+ {OPT=>"s|hashs=i", VAR=>\$hashs, DEFAULT=>19, DESC=>"The starting (lower) hash value"},
{OPT=>"e|hashe=i", VAR=>\$hashe, DEFAULT=>$maxhash, DESC=>"The end (higher) hash value"},
{OPT=>"x|step=i", VAR=>\$hashstep, DEFAULT=>2, DESC=>"The step in hash search.. min 2, no odd numbers"},
{OPT=>"f|velvethfiles=s", VAR=>\$readfile, DEFAULT=>0, DESC=>"The file section of the velveth command line."},
@@ -439,19 +450,19 @@ sub setOptions {
${$_->{VAR}} = $_->{DEFAULT};
}
}
-
+
if ($printVersion) {
print "VelvetOptimiser $OptVersion\n";
exit 0;
}
print STDERR strftime("%b %e %H:%M:%S", localtime), " Starting to check input parameters.\n";
-
+
unless($readfile){
print STDERR "\tYou must supply the velveth parameter line in quotes. eg -f '-short .....'\n";
&usage();
}
-
+
if($hashs > $maxhash){
print STDERR "\tStart hash value too high. New start hash value is $maxhash.\n";
$hashs = $maxhash;
@@ -460,7 +471,7 @@ sub setOptions {
$hashs = $hashs - 1;
print STDERR "\tStart hash value not odd. Subtracting one. New start hash value = $hashs\n";
}
-
+
if(&isOdd($hashstep)){
print STDERR "\tOld hash step value $hashstep\n";
$hashstep --;
@@ -473,7 +484,7 @@ sub setOptions {
if($hashstep < 2){
$hashstep = 2;
print STDERR "\tHash step set below minimum of 2. New hash step value = 2\n";
- }
+ }
if($hashe > $maxhash || $hashe < 1){
print STDERR "\tEnd hash value not in workable range. New end hash value is $maxhash.\n";
$hashe = $maxhash;
@@ -486,26 +497,26 @@ sub setOptions {
$hashe = $hashe - 1;
print STDERR "\tEnd hash value not odd. Subtracting one. New end hash value = $hashe\n";
}
-
+
if($num_threads > $thmax){
print STDERR "\tWARNING: You have set the number of threads to use to a value greater than the number of available CPUs.\n";
print STDERR "\tWARNING: This may be because of the velvet compile option for OMP.\n";
}
-
+
#check the velveth parameter string..
my $vh_ok = VelvetOpt::hwrap::_checkVHString("check 21 $readfile", $categories);
unless($vh_ok){ die "Please re-start with a corrected velveth parameter string." }
-
+
print STDERR "\tVelveth parameter string OK.\n";
-
+
#test if outdir exists...
if(-d $finaldir && $finaldir ne "."){
die "Output directory $finaldir already exists, please choose a different name and restart.\n";
}
print STDERR strftime("%b %e %H:%M:%S", localtime), " Finished checking input parameters.\n";
-
+
}
sub usage {
@@ -518,7 +529,7 @@ sub usage {
print VelvetOpt::Assembly::opt_func_toString;
exit(1);
}
-
+
#----------------------------------------------------------------------
@@ -527,19 +538,19 @@ sub usage {
#
sub runVelveth{
-
+
{
lock($current_threads);
$current_threads ++;
}
-
+
my $rf = shift;
my $hv = shift;
my $vv = shift;
my $semRef = shift;
my $anum = shift;
my $assembly;
-
+
print STDERR strftime("%b %e %H:%M:%S", localtime), "\t\tRunning velveth with hash value: $hv.\n";
#make the velveth command line.
@@ -552,31 +563,43 @@ sub runVelveth{
my $vhresponse = VelvetOpt::hwrap::objectVelveth($assembly, $categories);
unless($vhresponse){ die "Velveth didn't run on hash value of $hv.\n$!\n";}
-
- unless(-r ($prefix . "_data_$hv" . "/Roadmaps")){
+
+ unless(-r ($prefix . "_data_$hv" . "/Roadmaps")){
print STDERR "Velveth failed! Response:\n$vhresponse\n";
{
lock ($threadfailed);
$threadfailed = 1;
}
}
-
+
#run the hashdetail generation routine.
$vhresponse = $assembly->getHashingDetails();
-
+
+ # check whether the assembly ran out of memory
+ @error_log = split(/\n/,$assembly->toString());
+ @memory_errors = grep(/No more memory for memory chunk/, at error_log);
+ if (scalar @memory_errors > 0) {
+ print "velveth ran out of memory while running on hash value of $hv\n";
+ print "Error: " . join(", ", at memory_errors) . "\n";
+ {
+ lock ($threadfailed);
+ $threadfailed = 2;
+ }
+ }
+
#print the objects to the log file...
{
lock($$semRef);
print $OUT $assembly->toStringNoV() if !$verbose;
print $OUT $assembly->toString() if $verbose;
}
-
+
{
lock(%assemblies);
my $ass_str = freeze($assembly);
$assemblies{$anum} = $ass_str;
}
-
+
{
lock($current_threads);
$current_threads --;
@@ -593,26 +616,26 @@ sub runVelvetg{
lock($current_threads);
$current_threads ++;
}
-
+
my $vv = shift;
my $semRef = shift;
my $anum = shift;
my $assembly;
-
+
#get back the object!
$assembly = bless thaw($assemblies{$anum}), "VelvetOpt::Assembly";
-
+
print STDERR strftime("%b %e %H:%M:%S", localtime), "\t\tRunning vanilla velvetg on hash value: " . $assembly->{hashval} . "\n";
#make the velvetg commandline.
my $vgline = $prefix . "_data_" . $assembly->{hashval};
-
+
$vgline .= " $vgoptions";
$vgline .= " -clean yes";
#save the velvetg commandline in the assembly.
$assembly->{pstringg} = $vgline;
-
+
#save the velvetg version in the assembly.
$assembly->{versiong} = $vv;
@@ -630,13 +653,13 @@ sub runVelvetg{
print $OUT $assembly->toStringNoV() if !$verbose;
print $OUT $assembly->toString() if $verbose;
}
-
+
{
lock(%assemblies);
my $ass_str = freeze($assembly);
$assemblies{$anum} = $ass_str;
}
-
+
{
lock($current_threads);
$current_threads --;
@@ -721,7 +744,7 @@ sub covCutoff{
#get the assembly score and set the current cutoff score.
my $ass_score = $ass->{assmscore};
print "In covCutOff and assembly score is: $ass_score..\n" if $interested;
-
+
sub func {
@@ -751,11 +774,11 @@ sub covCutoff{
}
$ass_score = $ass->{assmscore};
print $OUT $ass->toStringNoV();
-
+
return $ass_score;
-
+
}
-
+
print STDERR strftime("%b %e %H:%M:%S", localtime), " Beginning coverage cutoff optimisation\n";
print $OUT strftime("%b %e %H:%M:%S", localtime), " Beginning coverage cutoff optimisation\n";
@@ -772,7 +795,7 @@ sub covCutoff{
print $OUT strftime("%b %e %H:%M:%S", localtime), " Warning: Minimum coverage cutoff is set to be higher than 50% of the expected coverage. Please consider lowering minCovCutoff.\n";
print STDERR strftime("%b %e %H:%M:%S", localtime), " Warning: Minimum coverage cutoff is set to be higher than 50% of the expected coverage. Please consider lowering minCovCutoff.\n";
}
-
+
my $a = $minCovCutoff;
my $b = $upperCovCutoff * $expCov;
my $t = 0.618;
@@ -782,10 +805,10 @@ sub covCutoff{
my $fd = func($ass, $d);
my $iters = 1;
-
+
printf STDERR "\t\tLooking for best cutoff score between %.3f and %.3f\n", $a, $b;
printf $OUT "\t\tLooking for best cutoff score between %.3f and %.3f\n", $a, $b;
-
+
while(abs($a -$b) > 1){
if($fc > $fd){
printf STDERR "\t\tMax cutoff lies between %.3f & %.3f\n", $d, $b;
@@ -850,7 +873,7 @@ sub expCov {
}
$ass->{pstringg} = $vg;
-
+
print $OUT $ass->toStringNoV();
}
@@ -859,51 +882,28 @@ sub expCov {
# insLengthLong - get the Long insert length and use it in the assembly..
#
sub insLengthLong {
- print STDERR strftime("%b %e %H:%M:%S", localtime), " Getting the long insert length\n";
- print $OUT strftime("%b %e %H:%M:%S", localtime), " Getting the long insert length\n";
my $ass = shift;
my $len = "auto";
print STDERR strftime("%b %e %H:%M:%S", localtime), " Setting assembly long insert length $len\n";
print $OUT strftime("%b %e %H:%M:%S", localtime), " Setting assembly long insert length $len\n";
-
- #re-write the pstringg with the new velvetg command..
- #my $vg = $ass->{pstringg};
- #if($vg =~ /ins_length_long/){
- # $vg =~ s/ins_length_long\s+\d+/ins_length_long $len/;
- #}
- #else {
- # $vg .= " -ins_length_long $len";
- #}
}
#
# insLengthShort - get the short insert length and use it in the assembly..
#
sub insLengthShort {
- print STDERR strftime("%b %e %H:%M:%S", localtime), " Setting the short insert length\n";
- print $OUT strftime("%b %e %H:%M:%S", localtime), " Setting the short insert length\n";
my $ass = shift;
my $len = "auto";
print STDERR strftime("%b %e %H:%M:%S", localtime), " Setting assembly short insert length(s) to $len\n";
print $OUT strftime("%b %e %H:%M:%S", localtime), " Setting assembly short insert length(s) to $len\n";
-
- #re-write the pstringg with the new velvetg command..
- #my $vg = $ass->{pstringg};
- #if($vg =~ /ins_length /){
- # $vg =~ s/ins_length\s+\d+/ins_length $len/;
- #}
- #else {
- # $vg .= " -ins_length $len";
- #}
- #$ass->{pstringg} = $vg;
}
#
-# estMemUse - estimates the memory usage from
+# estMemUse - estimates the memory usage from
#
sub estMemUse {
-
+
my $max_runs = @hashvals;
my $totmem = 0;
#get the read lengths and the number of reads...
diff --git a/maketar.sh b/maketar.sh
new file mode 100755
index 0000000..3d96a57
--- /dev/null
+++ b/maketar.sh
@@ -0,0 +1,40 @@
+#!/bin/bash
+
+DEST=/tmp
+echo "Producing distributable .tar.gz for VelvetOptimizer in $DEST ..."
+
+CURRENTDIR=$(pwd)
+echo Current directory $CURRENTDIR
+#VERSION=$(grep "my \$OptVersion" VelvetOptimiser.pl | sed "s/my \$OptVersion = \"//" | sed "s/\";//")
+
+VERSION=$(./VelvetOptimiser.pl --version 2> /dev/null | sed 's/^.* //')
+echo "Determined version: $VERSION"
+if [ "$VERSION" == "" ]; then
+ echo "ERROR: could not get version!"
+ exit -1
+fi
+
+TARDIR="$DEST/VelvetOptimiser-$VERSION"
+if [ -d "$TARDIR" ]; then
+ rm -r $TARDIR
+fi
+if [ -f "$TARDIR.tar.gz" ]; then
+ rm "$TARDIR.tar.gz"
+fi
+
+echo "Copying files..."
+mkdir $TARDIR
+cp -v -r VelvetOptimiser.pl LICENSE README INSTALL CHANGELOG VelvetOpt $TARDIR
+cd $DEST
+
+echo "Tarring..."
+tar -zcvf VelvetOptimiser-$VERSION.tar.gz VelvetOptimiser-$VERSION
+rm -r $TARDIR
+
+echo "Checking..."
+tar ztf "$TARDIR.tar.gz"
+
+cd $CURRENTDIR
+echo "Archive is here: $TARDIR.tar.gz"
+ls -lsa "$TARDIR.tar.gz"
+
diff --git a/paper.bib b/paper.bib
new file mode 100644
index 0000000..15ffde3
--- /dev/null
+++ b/paper.bib
@@ -0,0 +1,258 @@
+% Generated by Paperpile. Check out http://paperpile.com for more information.
+% BibTeX export options can be customized via Settings -> BibTeX.
+
+ at ARTICLE{Zerbino2008-kh,
+ title = "Velvet: algorithms for de novo short read assembly using de
+ Bruijn graphs",
+ author = "Zerbino, Daniel R and Birney, Ewan",
+ affiliation = "EMBL-European Bioinformatics Institute, Wellcome Trust Genome
+ Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.",
+ journal = "Genome Res.",
+ volume = 18,
+ number = 5,
+ pages = "821--829",
+ month = may,
+ year = 2008,
+ language = "en"
+}
+
+ at ARTICLE{Zerbino2009-pt,
+ title = "Pebble and rock band: heuristic resolution of repeats and
+ scaffolding in the velvet short-read de novo assembler",
+ author = "Zerbino, Daniel R and McEwen, Gayle K and Margulies, Elliott H
+ and Birney, Ewan",
+ affiliation = "European Bioinformatics Institute, Wellcome Trust Genome
+ Campus, Hinxton, Cambridge, UK. zerbino at ebi.ac.uk",
+ journal = "PLoS One",
+ volume = 4,
+ number = 12,
+ pages = "e8407",
+ month = "22~" # dec,
+ year = 2009,
+ language = "en"
+}
+
+ at ARTICLE{Chewapreecha2014-on,
+ title = "Dense genomic sampling identifies highways of pneumococcal
+ recombination",
+ author = "Chewapreecha, Claire and Harris, Simon R and Croucher,
+ Nicholas J and Turner, Claudia and Marttinen, Pekka and Cheng,
+ Lu and Pessia, Alberto and Aanensen, David M and Mather,
+ Alison E and Page, Andrew J and Salter, Susannah J and Harris,
+ David and Nosten, Francois and Goldblatt, David and Corander,
+ Jukka and Parkhill, Julian and Turner, Paul and Bentley,
+ Stephen D",
+ affiliation = "The Wellcome Trust Sanger Institute, Wellcome Trust Genome
+ Campus, Hinxton, Cambridge, CB10 1SA, UK. Department of
+ Infectious Disease Epidemiology, Imperial College London, St.
+ Mary's Hospital, London, W2 1PG, UK. Shoklo Malaria Research
+ Unit, Mahidol-Oxford Tropical Medicine Research Unit, Faculty
+ of Tropical Medicine, Mahidol University, Maesot 63110,
+ Thailand. Cambodia-Oxford Medical Research Unit, Angkor
+ Hospital for Children, Siem Reap, Cambodia. Centre for
+ Tropical Medicine, Nuffield Department of Medicine, University
+ of Oxford, Oxford, OX3 7LJ, UK. Helsinki Institute for
+ Information Technology HIIT, Department of Information and
+ Computer Science, Aalto University, 00076, Finland. Department
+ of Mathematics and Statistics, University of Helsinki, 00014,
+ Finland. Immunobiology Unit, Institute of Child Health,
+ University College London, WC1N 1EH, UK. Department of
+ Medicine, University of Cambridge, Addenbrooke's Hospital,
+ Cambridge, CB2 0QQ, UK.",
+ journal = "Nat. Genet.",
+ volume = 46,
+ number = 3,
+ pages = "305--309",
+ month = mar,
+ year = 2014,
+ language = "en"
+}
+
+ at ARTICLE{Wong2015-fl,
+ title = "Phylogeographical analysis of the dominant multidrug-resistant
+ {H58} clade of Salmonella Typhi identifies inter- and
+ intracontinental transmission events",
+ author = "Wong, Vanessa K and Baker, Stephen and Pickard, Derek J and
+ Parkhill, Julian and Page, Andrew J and Feasey, Nicholas A and
+ Kingsley, Robert A and Thomson, Nicholas R and Keane,
+ Jacqueline A and Weill, Fran{\c c}ois-Xavier and Edwards,
+ David J and Hawkey, Jane and Harris, Simon R and Mather,
+ Alison E and Cain, Amy K and Hadfield, James and Hart, Peter J
+ and Thieu, Nga Tran Vu and Klemm, Elizabeth J and Glinos,
+ Dafni A and Breiman, Robert F and Watson, Conall H and
+ Kariuki, Samuel and Gordon, Melita A and Heyderman, Robert S
+ and Okoro, Chinyere and Jacobs, Jan and Lunguya, Octavie and
+ Edmunds, W John and Msefula, Chisomo and Chabalgoity, Jose A
+ and Kama, Mike and Jenkins, Kylie and Dutta, Shanta and Marks,
+ Florian and Campos, Josefina and Thompson, Corinne and Obaro,
+ Stephen and MacLennan, Calman A and Dolecek, Christiane and
+ Keddy, Karen H and Smith, Anthony M and Parry, Christopher M
+ and Karkey, Abhilasha and Mulholland, E Kim and Campbell,
+ James I and Dongol, Sabina and Basnyat, Buddha and Dufour,
+ Muriel and Bandaranayake, Don and Naseri, Take Toleafoa and
+ Singh, Shalini Pravin and Hatta, Mochammad and Newton, Paul
+ and Onsare, Robert S and Isaia, Lupeoletalalei and Dance,
+ David and Davong, Viengmon and Thwaites, Guy and Wijedoru,
+ Lalith and Crump, John A and De Pinna, Elizabeth and Nair,
+ Satheesh and Nilles, Eric J and Thanh, Duy Pham and Turner,
+ Paul and Soeng, Sona and Valcanis, Mary and Powling, Joan and
+ Dimovski, Karolina and Hogg, Geoff and Farrar, Jeremy and
+ Holt, Kathryn E and Dougan, Gordon",
+ affiliation = "1] Wellcome Trust Sanger Institute, Hinxton, UK. [2]
+ Department of Microbiology, Addenbrooke's Hospital, Cambridge
+ University Hospitals National Health Service (NHS) Foundation
+ Trust, Cambridge, UK. 1] Hospital for Tropical Diseases,
+ Wellcome Trust Major Overseas Programme, Oxford University
+ Clinical Research Unit, Ho Chi Minh City, Vietnam. [2] Centre
+ for Tropical Medicine and Global Health, Nuffield Department
+ of Clinical Medicine, Oxford University, Oxford, UK. [3]
+ Department of Infectious and Tropical Diseases, London School
+ of Hygiene and Tropical Medicine, London, UK. Wellcome Trust
+ Sanger Institute, Hinxton, UK. Liverpool School of Tropical
+ Medicine, Liverpool, UK. 1] Wellcome Trust Sanger Institute,
+ Hinxton, UK. [2] Institute of Food Research, Norwich Research
+ Park, Norwich, UK. 1] Wellcome Trust Sanger Institute,
+ Hinxton, UK. [2] Department of Infectious and Tropical
+ Diseases, London School of Hygiene and Tropical Medicine,
+ London, UK. Institut Pasteur, Unit{\'e} des Bact{\'e}ries
+ Pathog{\`e}nes Ent{\'e}riques, Paris, France. Department of
+ Biochemistry and Molecular Biology, Bio21 Molecular Science
+ and Biotechnology Institute, University of Melbourne,
+ Melbourne, Victoria, Australia. 1] Department of Biochemistry
+ and Molecular Biology, Bio21 Molecular Science and
+ Biotechnology Institute, University of Melbourne, Melbourne,
+ Victoria, Australia. [2] Faculty of Veterinary and
+ Agricultural Sciences, University of Melbourne, Melbourne,
+ Victoria, Australia. Institute of Biomedical Research, School
+ of Immunity and Infection, College of Medicine and Dental
+ Sciences, University of Birmingham, Birmingham, UK. Hospital
+ for Tropical Diseases, Wellcome Trust Major Overseas
+ Programme, Oxford University Clinical Research Unit, Ho Chi
+ Minh City, Vietnam. 1] Kenya Medical Research Institute
+ (KEMRI), Nairobi, Kenya. [2] Centers for Disease Control and
+ Prevention, Atlanta, Georgia, USA. [3] Emory Global Health
+ Institute, Atlanta, Georgia, USA. Centre for the Mathematical
+ Modelling of Infectious Diseases, Department of Infectious
+ Disease Epidemiology, London School of Hygiene and Tropical
+ Medicine, London, UK. 1] Wellcome Trust Sanger Institute,
+ Hinxton, UK. [2] Kenya Medical Research Institute (KEMRI),
+ Nairobi, Kenya. Institute of Infection and Global Health,
+ University of Liverpool, Liverpool, UK. Malawi-Liverpool
+ Wellcome Trust Clinical Research Programme, College of
+ Medicine, University of Malawi, Blantyre, Malawi. 1]
+ Department of Clinical Sciences, Institute of Tropical
+ Medicine, Antwerp, Belgium. [2] Department of Microbiology and
+ Immunology, Katholieke Universiteit (KU) Leuven, University of
+ Leuven, Leuven, Belgium. 1] National Institute for Biomedical
+ Research, Kinshasa, Democratic Republic of the Congo. [2]
+ University Hospital of Kinshasa, Kinshasa, Democratic Republic
+ of the Congo. 1] Malawi-Liverpool Wellcome Trust Clinical
+ Research Programme, College of Medicine, University of Malawi,
+ Blantyre, Malawi. [2] Microbiology Department, College of
+ Medicine, University of Malawi, Blantyre, Malawi. Departamento
+ de Desarrollo Biotecnol{\'o}gico, Instituto de Higiene,
+ Facultad de Medicina, Montevideo, Uruguay. Ministry of Health,
+ Suva, Fiji. Fiji Health Sector Support Program, Suva, Fiji.
+ National Institute of Cholera and Enteric Diseases, Kolkata,
+ India. International Vaccine Institute, Department of
+ Epidemiology, Seoul, Republic of Korea. Enteropathogen
+ Division, Administraci{\'o}n Nacional de Laboratorios e
+ Institutos de Salud (ANLIS) Carlos G. Malbran Institute,
+ Buenos Aires, Argentina. 1] Hospital for Tropical Diseases,
+ Wellcome Trust Major Overseas Programme, Oxford University
+ Clinical Research Unit, Ho Chi Minh City, Vietnam. [2] Centre
+ for Tropical Medicine and Global Health, Nuffield Department
+ of Clinical Medicine, Oxford University, Oxford, UK. 1]
+ Division of Pediatric Infectious Diseases, University of
+ Nebraska Medical Center, Omaha, Nebraska, USA. [2] University
+ of Abuja Teaching Hospital, Abuja, Nigeria. [3] Bingham
+ University, Karu, Nigeria. 1] Wellcome Trust Sanger Institute,
+ Hinxton, UK. [2] Institute of Biomedical Research, School of
+ Immunity and Infection, College of Medicine and Dental
+ Sciences, University of Birmingham, Birmingham, UK. [3]
+ Novartis Vaccines Institute for Global Health, Siena, Italy.
+ Centre for Tropical Medicine and Global Health, Nuffield
+ Department of Clinical Medicine, Oxford University, Oxford,
+ UK. Centre for Enteric Diseases, National Institute for
+ Communicable Diseases, Division in the National Health
+ Laboratory Service, University of the Witwatersrand,
+ Johannesburg, South Africa. 1] Department of Clinical
+ Research, London School of Hygiene and Tropical Medicine,
+ London, UK. [2] Graduate School of Tropical Medicine and
+ Global Health, Nagasaki University, Nagasaki, Japan. Patan
+ Academy of Health Sciences, Wellcome Trust Major Overseas
+ Programme, Oxford University Clinical Research Unit,
+ Kathmandu, Nepal. 1] Department of Infectious and Tropical
+ Diseases, London School of Hygiene and Tropical Medicine,
+ London, UK. [2] Murdoch Childrens Research Institute,
+ Melbourne, Victoria, Australia. Enteric and Leptospira
+ Reference Laboratory, Institute of Environmental Science and
+ Research, Ltd. (ESR), Porirua, New Zealand. National Centre
+ for Biosecurity and Infectious Disease, Institute of
+ Environmental Science and Research, Porirua, New Zealand.
+ Samoa Ministry of Health, Apia, Samoa. National Influenza
+ Center, World Health Organization, Center for Communicable
+ Disease Control, Suva, Fiji. Department of Microbiology,
+ Hasanuddin University, Makassar, Indonesia. 1] Centre for
+ Tropical Medicine and Global Health, Nuffield Department of
+ Clinical Medicine, Oxford University, Oxford, UK. [2] Lao
+ Oxford Mahosot Wellcome Trust Research Unit, Microbiology
+ Laboratory, Mahosot Hospital, Vientiane, Laos. Kenya Medical
+ Research Institute (KEMRI), Nairobi, Kenya. National Health
+ Services, Tupua Tamasese Meaole Hospital, Apia, Samoa. Lao
+ Oxford Mahosot Wellcome Trust Research Unit, Microbiology
+ Laboratory, Mahosot Hospital, Vientiane, Laos. 1]
+ Mahidol-Oxford Tropical Medicine Research Unit, Faculty of
+ Tropical Medicine, Mahidol University, Bangkok, Thailand. [2]
+ Paediatric Emergency Medicine, Chelsea and Westminster
+ Hospital, London, UK. Centre for International Health and
+ Otago International Health Research Network, Dunedin School of
+ Medicine, University of Otago, Dunedin, New Zealand.
+ Salmonella Reference Service, Public Health England,
+ Colindale, London, UK. Emerging Disease Surveillance and
+ Response, Division of Pacific Technical Support, World Health
+ Organization, Suva, Fiji. 1] Centre for Tropical Medicine and
+ Global Health, Nuffield Department of Clinical Medicine,
+ Oxford University, Oxford, UK. [2] Mahidol-Oxford Tropical
+ Medicine Research Unit, Faculty of Tropical Medicine, Mahidol
+ University, Bangkok, Thailand. [3] Cambodia-Oxford Medical
+ Research Unit, Angkor Hospital for Children, Siem Reap,
+ Cambodia. Cambodia-Oxford Medical Research Unit, Angkor
+ Hospital for Children, Siem Reap, Cambodia. Microbiological
+ Diagnostic Unit-Public Health Laboratory, Department of
+ Microbiology and Immunology at the Peter Doherty Institute for
+ Infection and Immunity, University of Melbourne, Melbourne,
+ Victoria, Australia.",
+ journal = "Nat. Genet.",
+ volume = 47,
+ number = 6,
+ pages = "632--639",
+ month = jun,
+ year = 2015,
+ language = "en"
+}
+
+ at ARTICLE{Page2016-zp,
+ title = "Robust high-throughput prokaryote de novo assembly and
+ improvement pipeline for Illumina data",
+ author = "Page, Andrew J and De Silva, Nishadi and Hunt, Martin and
+ Quail, Michael A and Parkhill, Julian and Harris, Simon R and
+ Otto, Thomas D and Keane, Jacqueline A",
+ affiliation = "Pathogen Informatics, Wellcome Trust Sanger Institute ,
+ Wellcome Genome Campus, Hinxton, CB10 1SA, Cambridgeshire ,
+ UK. Biochemical Development, Wellcome Trust Sanger Institute ,
+ Wellcome Genome Campus, Hinxton, CB10 1SA, Cambridgeshire ,
+ UK. Pathogen Genomics, Wellcome Trust Sanger Institute ,
+ Wellcome Genome Campus, Hinxton, CB10 1SA, Cambridgeshire ,
+ UK. Parasite Genomics, Wellcome Trust Sanger Institute ,
+ Wellcome Genome Campus, Hinxton, CB10 1SA, Cambridgeshire ,
+ UK.",
+ journal = "Microb Genom",
+ volume = 2,
+ number = 8,
+ pages = "e000083",
+ month = aug,
+ year = 2016,
+ keywords = "assembly; high-throughput; illumina; prokaryotic",
+ language = "en"
+}
diff --git a/paper.md b/paper.md
new file mode 100644
index 0000000..85bc1a3
--- /dev/null
+++ b/paper.md
@@ -0,0 +1,22 @@
+---
+title: 'Velvet Optimiser: automatic optimisation of Velvet assembly parameters'
+tags:
+ - bioinformatics
+ - de novo assembly
+authors:
+ - name: Simon Gladman
+ orcid: 0000-0002-6100-4385
+ affiliation: Melbourne Bioinformatics, University of Melbourne
+ - name: Torsten Seemann
+ orcid: 0000-0001-6046-610X
+ affiliation: Melbourne Bioinformatics, University of Melbourne
+
+date: 27 Jul 2017
+bibliography: paper.bib
+---
+
+# Summary
+
+The Velvet assembler, like several other next generation sequencing assembly applications, assemble de novo millions of short reads using de Bruijn graphs (Zerbino and Birney 2008; Zerbino et al. 2009). A de Bruijn graph is a directed graph where nodes consist of overlapping sub-sequences of reads (k-mers). After construction and simplification of the de Bruijn graph, contiguous DNA sequences (contigs) are "read off" the graph by tracing a set of paths in a determined order. These cont [...]
+
+# References
diff --git a/test/data/test_long_R1.fq.gz b/test/data/test_long_R1.fq.gz
new file mode 100644
index 0000000..139cadc
Binary files /dev/null and b/test/data/test_long_R1.fq.gz differ
diff --git a/test/data/test_long_R2.fq.gz b/test/data/test_long_R2.fq.gz
new file mode 100644
index 0000000..024bdbb
Binary files /dev/null and b/test/data/test_long_R2.fq.gz differ
diff --git a/test/data/test_short_R1.fq.gz b/test/data/test_short_R1.fq.gz
new file mode 100644
index 0000000..cdeb52a
Binary files /dev/null and b/test/data/test_short_R1.fq.gz differ
diff --git a/test/data/test_short_R2.fq.gz b/test/data/test_short_R2.fq.gz
new file mode 100644
index 0000000..49ce9f7
Binary files /dev/null and b/test/data/test_short_R2.fq.gz differ
diff --git a/test/run_test.sh b/test/run_test.sh
new file mode 100755
index 0000000..b35161d
--- /dev/null
+++ b/test/run_test.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+
+echo "Running VelvetOptimiser on short test data."
+
+VelvetOptimiser.pl -s 55 -e 59 -f "-shortPaired -fastq.gz -separate data/test_short_R1.fq.gz data/test_short_R2.fq.gz" -d test_out
+
+echo "Number of contigs in final assembly:"
+
+grep -c ">" test_out/contigs.fa
+
+echo "Running VelvetOptimiser on short and long test data."
+
+VelvetOptimiser.pl -s 43 -e 91 -x 4 -f "-shortPaired -fastq.gz -separate data/test_short_R1.fq.gz data/test_short_R2.fq.gz -longPaired -fastq.gz -separate data/test_long_R1.fq.gz data/test_long_R2.fq.gz" -d test_out_2 -o "-long_mult_cutoff 3"
+
+echo "Number of contigs in final assembly:"
+
+grep -c ">" test_out_2/contigs.fa
+
+echo "Estimation of memory test:"
+
+VelvetOptimiser.pl -s 55 -e 59 -f "-shortPaired -fastq.gz -separate mutant_R1.fq.gz mutant_R2.fq.gz -longPaired -fastq.gz -separate data/test_long_R1.fq.gz data/test_long_R2.fq.gz" -g 0.2
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/velvetoptimiser.git
More information about the debian-med-commit
mailing list