[med-svn] r19844 - in trunk/packages/sortmerna/trunk/debian: . doc_source
Andreas Tille
tille at moszumanska.debian.org
Wed Aug 5 14:16:06 UTC 2015
Author: tille
Date: 2015-08-05 14:16:06 +0000 (Wed, 05 Aug 2015)
New Revision: 19844
Add missing source of documentation
Added: trunk/packages/sortmerna/trunk/debian/README.source
--- trunk/packages/sortmerna/trunk/debian/README.source (rev 0)
+++ trunk/packages/sortmerna/trunk/debian/README.source 2015-08-05 14:16:06 UTC (rev 19844)
@@ -0,0 +1,19 @@
+Some scripts are requiring python-skbio but they are compatible
+with version 0.2.3 and NOT with the version currently in Debian
+which is 0.4.0.
+You could try this by running
+$ python tests/test_sortmerna.py
+Traceback (most recent call last):
+ File "tests/test_sortmerna.py", line 16, in <module>
+ from skbio.parse.sequences import parse_fasta
+ImportError: No module named parse.sequences
+The module interface was changed inbetween python-skbio version
+0.2.3 and 0.4.0.
+As a consequence the test is not run. If you might succeed in
+adapting the code please add the Build-Depends python-skbio and
Modified: trunk/packages/sortmerna/trunk/debian/copyright
--- trunk/packages/sortmerna/trunk/debian/copyright 2015-08-05 12:59:19 UTC (rev 19843)
+++ trunk/packages/sortmerna/trunk/debian/copyright 2015-08-05 14:16:06 UTC (rev 19844)
@@ -10,6 +10,43 @@
University of Colorado at Boulder, Boulder, CO
jenya.kopylov at gmail.com, laurent.noe at lifl.fr, helene.touzet at lifl.fr
License: LGPL-3+
+Files: alp/*
+Copyright: John Spouge, Sergey Sheetlin
+License: PublicDomain
+ National Center for Biotechnology Information
+ .
+ This software/database is a "United States Government Work" under the
+ terms of the United States Copyright Act. It was written as part of
+ the author's offical duties as a United States Government employee and
+ thus cannot be copyrighted. This software/database is freely available
+ to the public for use. The National Library of Medicine and the U.S.
+ Government have not placed any restriction on its use or reproduction.
+ .
+ Although all reasonable efforts have been taken to ensure the accuracy
+ and reliability of the software and data, the NLM and the U.S.
+ Government do not and cannot warrant the performance or results that
+ may be obtained by using this software or data. The NLM and the U.S.
+ Government disclaim all warranties, express or implied, including
+ warranties of performance, merchantability or fitness for any particular
+ purpose.
+ .
+ Please cite the author in any work or product based on this material.
+Files: SortMeRNA-User-Manual-2.0.pdf
+Copyright: 2014 Evguenia Kopylova <jenya.kopylov at gmail.com>
+License: LGPL-3+
+Comment: The source for this file was obtained from Git and is
+ available in debian/doc_source.
+ It will be included in the next upstream release.
+Files: debian/*
+Copyright: 2015 Tim Booth <tbooth at ceh.ac.uk>
+ 2015 Andreas Tille <tille at debian.org>
+License: LGPL-3+
+License: LGPL-3+
SortMeRNA is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation, either version 3 of the License, or
@@ -46,27 +83,3 @@
-Files: alp/*
-Copyright: John Spouge, Sergey Sheetlin
-License: PublicDomain
- National Center for Biotechnology Information
- .
- This software/database is a "United States Government Work" under the
- terms of the United States Copyright Act. It was written as part of
- the author's offical duties as a United States Government employee and
- thus cannot be copyrighted. This software/database is freely available
- to the public for use. The National Library of Medicine and the U.S.
- Government have not placed any restriction on its use or reproduction.
- .
- Although all reasonable efforts have been taken to ensure the accuracy
- and reliability of the software and data, the NLM and the U.S.
- Government do not and cannot warrant the performance or results that
- may be obtained by using this software or data. The NLM and the U.S.
- Government disclaim all warranties, express or implied, including
- warranties of performance, merchantability or fitness for any particular
- purpose.
- .
- Please cite the author in any work or product based on this material.
Added: trunk/packages/sortmerna/trunk/debian/doc_source/SortMeRNA-User-Manual-2.0.tex
--- trunk/packages/sortmerna/trunk/debian/doc_source/SortMeRNA-User-Manual-2.0.tex (rev 0)
+++ trunk/packages/sortmerna/trunk/debian/doc_source/SortMeRNA-User-Manual-2.0.tex 2015-08-05 14:16:06 UTC (rev 19844)
@@ -0,0 +1,996 @@
+\usepackage{amsmath, amsthm, amssymb}
+ colorlinks,
+ citecolor=black,
+ filecolor=black,
+ linkcolor=black,
+ urlcolor=black
+\usepackage{tikz} % graphic diagrams
+\usetikzlibrary{positioning,patterns,backgrounds,decorations.pathreplacing,decorations.markings,shapes,fit,calc,shadows} % fitting shapes to coordinates
+\title{SortMeRNA User Manual}
+\author{Evguenia Kopylova\\ {\em jenya.kopylov at gmail.com}}
+\date{Oct 2014, version 2.0}
+Copyright (C) 2012-2015 Bonsai Bioinformatics Research Group \\
+(LIFL - Universit\'{e} Lille 1), CNRS UMR 8022, INRIA Nord-Europe \\
+\url{http://bioinfo.lifl.fr/RNA/sortmerna/} \\
+OTU-picking extensions and continuous support developed in the Knight Lab, \\
+BioFrontiers Institute, University of Colorado at Boulder, CO \\
+SortMeRNA is a local sequence alignment tool for filtering, mapping and OTU-picking.
+The core algorithm is based on approximate seeds and allows for fast and sensitive analyses
+of NGS reads. The main application of SortMeRNA is filtering rRNA from metatranscriptomic data.
+Additional applications include OTU-picking and taxonomy assignation available through QIIME v1.9+ (\url{http://qiime.org}, currently the development version to be released in early December).
+SortMeRNA takes as input a file of reads (fasta or fastq format) and one or multiple rRNA
+database file(s), and sorts apart aligned and rejected reads into two files specified by the user.
+SortMeRNA works with Illumina, 454, Ion Torrent and PacBio data, and can produce SAM and
+BLAST-like alignments.
+For questions \& help, please contact:
+ 1. Evguenia Kopylova evguenia.kopylova at lifl.fr
+ 2. Laurent Noe laurent.noe at lifl.fr
+ 3. Helene Touzet helene.touzet at lifl.fr
+{\bf Important:} This user manual is strictly for SortMeRNA version 2.0.
+\caption{\texttt{sortmerna-2.0} directory tree}~\\
+\tikzstyle{every node}=[draw=black,thick,anchor=west]
+ grow via three points={one child at (0.5,-0.7) and
+ two children at (0.5,-0.7) and (0.5,-1.4)},
+ edge from parent path={(\tikzparentnode.south) |- (\tikzchildnode.west)}]
+ \node [root] {sortmerna-2.0}
+ child { node {alp}}
+ child { node {cmph}}
+ child { node {src}}
+ child { node {include}}
+ child { node {scripts}}
+ child { node {tests}}
+ child { node {rRNA\_databases}
+ child { node {silva-bac-16s-id90.fasta}}
+ child { node {...}}
+ }
+ child [missing] {}
+ child [missing] {}
+ child { node [selected] {sortmerna} }
+ child { node [selected] {indexdb\_rna} }
+ ;
+\subsection{Install from tarball release}
+ \item Download \texttt{sortmerna-2.0.tar.gz} from \url{https://github.com/biocore/sortmerna/releases}
+ \item Extract the source code package into a directory of your choice, enter \texttt{sortmerna-2.0} directory and type,
+ \begin{verbatim}
+ > bash ./build.sh
+ \end{verbatim}
+ \item At this point, two executables \texttt{indexdb\_rna} and \texttt{sortmerna} will be located
+ in the \texttt{sortmerna-2.0} directory.
+ If the user would like to install the executables into their default installation directory (\texttt{/usr/local/bin} for Linux or \texttt{/opt/local/bin} for Mac) then type,
+ \begin{verbatim}
+ > make install (with root permissions)
+ \end{verbatim}
+ \item To begin using SortMeRNA, type `\texttt{indexdb\_rna -h}' or `\texttt{sortmerna -h}'. Databases must first be indexed using \texttt{indexdb\_rna}.
+\subsection{Install development version from git}
+ \item Clone the sortmerna directory to your local system
+ \begin{verbatim}
+ > git clone https://github.com/biocore/sortmerna.git
+ \end{verbatim}
+ \item Build sortmerna
+ \begin{verbatim}
+ > cd sortmerna
+ > bash ./build.sh
+ \end{verbatim}
+\subsection{Install from precompiled code}
+ \item Download the latest binary distribution of SortMeRNA from \url{http://bioinfo.lifl.fr/RNA/sortmerna}
+ \item Extract the source code package into a directory of your choice,
+ \begin{verbatim}
+ > tar -xvf sortmerna-2.0.tar.gz
+ > cd sortmerna-2.0
+ \end{verbatim}
+ \item To begin using SortMeRNA, type `\texttt{indexdb\_rna -h}' or `\texttt{sortmerna -h}'. The user must firstly index
+ the databases with the command \texttt{indexdb\_rna} before they can run the command \texttt{sortmerna}.
+\noindent If the user installed SortMeRNA using the command \texttt{`make install'}, then they can use the command \texttt{`make uninstall'} to
+uninstall SortMeRNA (with root permissions).
+\noindent SortMeRNA comes prepackaged with 8 databases,\\
+ \textbf{representative database} & \textbf{\%id} & $\#$ \textbf{seq (clustered)} & \textbf{origin} & $\#$ \textbf{seq (original)} \\
+ \hline
+ silva-bac-16s-id90 & 90 & 12798 & SILVA SSU Ref NR v.119 & 464618 \\
+ silva-arc-16s-id95 & 95 & 3193 & SILVA SSU Ref NR v.119 & 18797 \\
+ silva-euk-18s-id95 & 95 & 7348 & SILVA SSU Ref NR v.119 & 51553 \\
+ silva-bac-23s-id98 & 98 & 4488 & SILVA LSU Ref v.119 & 43822 \\
+ silva-arc-23s-id98 & 98 & 251 & SILVA LSU Ref v.119 & 629 \\
+ silva-euk-28s-id98 & 98 & 4935 & SILVA LSU Ref v.119 & 13095 \\
+ rfam-5s-id98 & 98 & 59513 & RFAM & 116760 \\
+ rfam-5.8s-id98 & 98 & 13034 & RFAM & 225185 \\
+HMMER 3.1b1 and SumaClust v1.0.00 were used to reduce the size of the original databases to the similarity listed in column 2 (\%id) of the table above
+(see {\tt/sortmerna/rRNA\_databases/README.txt} for a list of complete steps).
+These representative databases were specifically made for fast filtering of rRNA. Approximately the same number of rRNA will be filtered
+using silva-bac-16s-id90 (12802 rRNA) as using Greengenes 97\% (99322 rRNA), but the former will run significantly faster.
+\noindent \textbf{id} $\%$: members of the cluster must have identity at least this \% id with the representative sequence \\
+\noindent \textbf{Remark}: The user must first index the fasta database by using the command \texttt{indexdb\_rna} and
+then filter/map reads against the database using the command \texttt{sortmerna}.
+\section{How to run SortMeRNA}
+\subsection{Index the rRNA database: command `\texttt{indexdb\_rna}'}
+\noindent The executable \texttt{indexdb\_rna} indexes an rRNA database.\\
+\noindent To see the man page for \texttt{indexdb\_rna},
+>> indexdb_rna -h
+ Program: SortMeRNA version 2.0, 29/11/2014
+ Copyright: 2012-2015 Bonsai Bioinformatics Research Group:
+ LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe
+ OTU-picking extensions and continuing support developed in the Knight Lab,
+ BioFrontiers Institute, University of Colorado at Boulder
+ Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the
+ See the GNU Lesser General Public License for more details.
+ Contact: Evguenia Kopylova, jenya.kopylov at gmail.com
+ Laurent Noe, laurent.noe at lifl.fr
+ Helene Touzet, helene.touzet at lifl.fr
+ usage: ./indexdb_rna --ref db.fasta,db.idx [OPTIONS]:
+ --------------------------------------------------------------------------------------------------------
+ | parameter value description default |
+ --------------------------------------------------------------------------------------------------------
+ --ref STRING,STRING FASTA reference file, index file mandatory
+ (ex. --ref /path/to/file1.fasta,/path/to/index1)
+ If passing multiple reference sequence files, separate
+ them by ':',
+ (ex. --ref /path/to/file1.fasta,/path/to/index1:/path/to/file2.fasta,path/to/index2)
+ --fast BOOL suggested option for aligning ~99% related species off
+ --sensitive BOOL suggested option for aligning ~75-98% related species on
+ --tmpdir STRING directory where to write temporary files
+ -m INT the amount of memory (in Mbytes) for building the index 3072
+ -L INT seed length 18
+ --max_pos INT maximum number of positions to store for each unique L-mer 10000
+ (setting --max_pos 0 will store all positions)
+ -v BOOL verbose
+ -h BOOL help
+\noindent There are eight rRNA representative databases provided in the `\texttt{sortmerna-2.0/rRNA\_databases}' folder.
+All databases were derived from the SILVA SSU and LSU databases (release 119) and the RFAM databases using HMMER 3.1b1 and SumaClust v1.0.00.
+Additionally, the user can index their own database. \\
+\subsubsection{Example 1: indexdb\_rna using one database}
+>> ./indexdb_rna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db -v
+ Program: SortMeRNA version 2.0, 29/11/2014
+ Copyright: 2012-2015 Bonsai Bioinformatics Research Group:
+ LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe
+ OTU-picking extensions and continuing support developed in the Knight Lab,
+ BioFrontiers Institute, University of Colorado at Boulder
+ Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the
+ See the GNU Lesser General Public License for more details.
+ Contact: Evguenia Kopylova, jenya.kopylov at gmail.com
+ Laurent Noe, laurent.noe at lifl.fr
+ Helene Touzet, helene.touzet at lifl.fr
+ Parameters summary:
+ K-mer size: 19
+ K-mer interval: 1
+ Maximum positions to store per unique K-mer: 10000
+ Total number of databases to index: 1
+ Begin indexing file ./rRNA_databases/silva-bac-16s-id90.fasta under index name ./index/silva-bac-16s-db:
+ Collecting sequence distribution statistics .. done [1.133206 sec]
+ start index part # 0:
+ (1/3) building burst tries .. done [23.643256 sec]
+ (2/3) building CMPH hash .. done [22.306709 sec]
+ (3/3) building position lookup tables .. done [54.958680 sec]
+ total number of sequences in this part = 12798
+ writing kmer data to ./index/silva-bac-16s-db.kmer_0.dat
+ writing burst tries to ./index/silva-bac-16s-db.bursttrie_0.dat
+ writing position lookup table to ./index/silva-bac-16s-db.pos_0.dat
+ writing nucleotide distribution statistics to ./index/silva-bac-16s-db.stats
+ done.
+\subsubsection{Example 2: indexdb\_rna using multiple databases}
+Multiple databases can be indexed simultaneously by passing them as a `:' separated list to \texttt{--ref} (no spaces allowed).
+>> ./indexdb_rna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db:\
+\subsection{A guide to choosing `{\bf sortmerna}' parameters for filtering and read mapping}
+In SortMeRNA version 1.99 beta and up, users have the option to output sequence alignments for their matching rRNA reads in
+the SAM or BLAST-like formats. Depending on the desired quality of alignments, different parameters choices must be set.
+Table~\ref{tab:guide} presents a guide to setting parameters choices for most use cases. In all cases, output alignments are always guaranteed to reach
+the threshold E-value score (default E-value=1). An E-value of 1 signifies that one random alignment is expected for aligning
+\textbf{all} reads against the reference database. The E-value in SortMeRNA is computed for the entire search space, not per read.
+\caption{SortMeRNA alignment parameter guide}
+ \centering
+ \footnotesize
+ \begin{tabular}{l | l | l}
+ \toprule
+ \parbox[t]{0.6in}{\sf option} & {\sf speed} & \parbox[t]{0.45in}{\sf description} \\
+ \midrule
+ \multirow{8}{*}{{\tt --num-alignments INT}}
+ & Very fast for {\tt INT = 1}& \parbox{6cm}{Output the first alignment passing E-value threshold ({\bf best choice if only filtering is needed})} \\
+ \cmidrule{2-3}
+ & Speed decreases for higher value {\tt INT} & \parbox{6cm}{Higher {\tt INT} signifies more alignments will be made \& output }\\
+ \cmidrule{2-3}
+ & Very slow for {\tt INT = 0} & \parbox{6cm}{All alignments reaching the E-value threshold are reported (this option is not suggested for high similarity rRNA databases, due to many possible alignments per read causing a very large file output)} \\
+ \midrule
+ \multirow{4}{*}{{\tt --best INT}}
+ & Fast for {\tt INT = 1} & \parbox{6cm}{Only one high-candidate reference sequence will be searched for alignments (determined heuristically using a Longest Increasing Subsequence of seed matches). The single best alignment of those will be reported }\\
+ \cmidrule{2-3}
+ & Speed decreases for higher value {\tt INT} & \parbox{6cm}{Higher {\tt INT} signifies more alignments will be made, though only the best one will be reported } \\
+ \cmidrule{2-3}
+ & Very slow for {\tt INT = 0} & \parbox{6cm}{All high-candidate reference sequences will be searched for alignments, though only the best one will be reported }\\
+ \bottomrule
+ \end{tabular}
+\subsection{Filter rRNA reads}
+\noindent The executable \texttt{sortmerna} can filter rRNA reads against an indexed rRNA database.\\
+\noindent To see the man page for \texttt{sortmerna},
+>> ./sortmerna -h
+ Program: SortMeRNA version 2.0, 29/11/2014
+ Copyright: 2012-2015 Bonsai Bioinformatics Research Group:
+ LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe
+ OTU-picking extensions and continuing support developed in the Knight Lab,
+ BioFrontiers Institute, University of Colorado at Boulder
+ Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the
+ See the GNU Lesser General Public License for more details.
+ Contact: Evguenia Kopylova, jenya.kopylov at gmail.com
+ Laurent Noe, laurent.noe at lifl.fr
+ Helene Touzet, helene.touzet at lifl.fr
+ usage: ./sortmerna --ref db.fasta,db.idx --reads file.fa --aligned base_name_output [OPTIONS]:
+ -------------------------------------------------------------------------------------------------------------
+ | parameter value description default |
+ -------------------------------------------------------------------------------------------------------------
+ --ref STRING,STRING FASTA reference file, index file mandatory
+ (ex. --ref /path/to/file1.fasta,/path/to/index1)
+ If passing multiple reference files, separate
+ them using the delimiter ':',
+ (ex. --ref /path/to/file1.fasta,/path/to/index1:/path/to/file2.fasta,path/to/index2)
+ --reads STRING FASTA/FASTQ reads file mandatory
+ --aligned STRING aligned reads filepath + base file name mandatory
+ (appropriate extension will be added)
+ --other STRING rejected reads filepath + base file name
+ (appropriate extension will be added)
+ --fastx BOOL output FASTA/FASTQ file off
+ (for aligned and/or rejected reads)
+ --sam BOOL output SAM alignment off
+ (for aligned reads only)
+ --SQ BOOL add SQ tags to the SAM file off
+ --blast INT output alignments in various Blast-like formats
+ 0 - pairwise
+ 1 - tabular (Blast -m 8 format)
+ 2 - tabular + column for CIGAR
+ 3 - tabular + columns for CIGAR and query coverage
+ --log BOOL output overall statistics off
+ --num_alignments INT report first INT alignments per read reaching E-value -1
+ (--num_alignments 0 signifies all alignments will be output)
+ or (default)
+ --best INT report INT best alignments per read reaching E-value 1
+ by searching --min_lis INT candidate alignments
+ (--best 0 signifies all candidate alignments will be searched)
+ --min_lis INT search all alignments having the first INT longest LIS 2
+ LIS stands for Longest Increasing Subsequence, it is
+ computed using seeds' positions to expand hits into
+ longer matches prior to Smith-Waterman alignment.
+ --print_all_reads BOOL output null alignment strings for non-aligned reads off
+ to SAM and/or BLAST tabular files
+ --paired_in BOOL both paired-end reads go in --aligned fasta/q file off
+ (interleaved reads only, see Section 4.2.4 of User Manual)
+ --paired_out BOOL both paired-end reads go in --other fasta/q file off
+ (interleaved reads only, see Section 4.2.4 of User Manual)
+ --match INT SW score (positive integer) for a match 2
+ --mismatch INT SW penalty (negative integer) for a mismatch -3
+ --gap_open INT SW penalty (positive integer) for introducing a gap 5
+ --gap_ext INT SW penalty (positive integer) for extending a gap 2
+ -N INT SW penalty for ambiguous letters (N's) scored as --mismatch
+ -F BOOL search only the forward strand off
+ -R BOOL search only the reverse-complementary strand off
+ -a INT number of threads to use 1
+ -e DOUBLE E-value threshold 1
+ -m INT INT Mbytes for loading the reads into memory 1024
+ (maximum -m INT is 4096)
+ -v BOOL verbose off
+ --id DOUBLE %id similarity threshold (the alignment must 0.97
+ still pass the E-value threshold)
+ --coverage DOUBLE %query coverage threshold (the alignment must 0.97
+ still pass the E-value threshold)
+ --de_novo_otu BOOL FASTA/FASTQ file for reads matching database < %id off
+ (set using --id) and < %cov (set using --coverage)
+ (alignment must still pass the E-value threshold)
+ --otu_map BOOL output OTU map (input to QIIME's make_otu_table.py) off
+ [ADVANCED OPTIONS] (see SortMeRNA user manual for more details):
+ --passes INT,INT,INT three intervals at which to place the seed on the read L,L/2,3
+ (L is the seed length set in ./indexdb_rna)
+ --edges INT number (or percent if INT followed by % sign) of 4
+ nucleotides to add to each edge of the read
+ prior to SW local alignment
+ --num_seeds INT number of seeds matched before searching 2
+ for candidate LIS
+ --full_search BOOL search for all 0-error and 1-error seed off
+ matches in the index rather than stopping
+ after finding a 0-error match (<1% gain in
+ sensitivity with up four-fold decrease in speed)
+ --pid BOOL add pid to output file names off
+ [HELP]:
+ -h BOOL help
+ --version BOOL SortMeRNA version number
+\noindent The user can adjust the amount of memory allocated for loading the reads through the
+command option \texttt{-m}. By default, \texttt{-m} is set to be high enough for 1GB.
+If the reads file is larger than 1GB, then \texttt{sortmerna} internally divides the file into partial sections of
+1GB and executes one section at a time. Hence, if a user has an input file of 15GB and only 1GB of RAM to store it, the
+file will be processed in partial sections using \texttt{mmap} without having to physically split it prior to execution. Otherwise, the user
+can increase \texttt{-m} to map larger portions of the file. The limit for \texttt{-m} is given by typing \texttt{sortmerna -h}.
+\subsubsection{Example 3: multiple databases and the fastest alignment option}
+>> time ./sortmerna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db:\
+ --reads SRR106861.fasta --sam --num_alignments 1 --fastx --aligned SRR105861_rRNA\
+ --other SRR105861_non_rRNA --log -v
+ Program: SortMeRNA version 2.0, 29/11/2014
+ Copyright: 2012-2015 Bonsai Bioinformatics Research Group:
+ LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe
+ OTU-picking extensions and continuing support developed in the Knight Lab,
+ BioFrontiers Institute, University of Colorado at Boulder
+ Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the
+ See the GNU Lesser General Public License for more details.
+ Contact: Evguenia Kopylova, jenya.kopylov at gmail.com
+ Laurent Noe, laurent.noe at lifl.fr
+ Helene Touzet, helene.touzet at lifl.fr
+ Computing read file statistics ... done [2.16 sec]
+ size of reads file: 35238748 bytes
+ partial section(s) to be executed: 1 of size 35238748 bytes
+ Parameters summary:
+ Number of seeds = 2
+ Edges = 4 (as integer)
+ SW match = 2
+ SW mismatch = -3
+ SW gap open penalty = 5
+ SW gap extend penalty = 2
+ SW ambiguous nucleotide = -3
+ SQ tags are not output
+ Number of threads = 1
+ Begin mmap reads section # 1:
+ Time to mmap reads and set up pointers [0.11 sec]
+ Begin analysis of: ./rRNA_databases/silva-bac-16s-id90.fasta
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.602397
+ Gumbel K = 0.328927
+ Minimal SW score based on E-value = 54
+ Loading index part 1/1 ... done [4.67 sec]
+ Begin index search ... done [83.53 sec]
+ Freeing index ... done [0.87 sec]
+ Begin analysis of: ./rRNA_databases/silva-bac-23s-id98.fasta
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.603075
+ Gumbel K = 0.330488
+ Minimal SW score based on E-value = 53
+ Loading index part 1/1 ... done [3.63 sec]
+ Begin index search ... done [94.76 sec]
+ Freeing index ... done [0.41 sec]
+ Begin analysis of: ./rRNA_databases/silva-arc-16s-id95.fasta
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.596230
+ Gumbel K = 0.322143
+ Minimal SW score based on E-value = 52
+ Loading index part 1/1 ... done [1.14 sec]
+ Begin index search ... done [22.63 sec]
+ Freeing index ... done [0.14 sec]
+ Begin analysis of: ./rRNA_databases/silva-arc-23s-id98.fasta
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.597749
+ Gumbel K = 0.325630
+ Minimal SW score based on E-value = 49
+ Loading index part 1/1 ... done [0.50 sec]
+ Begin index search ... done [13.27 sec]
+ Freeing index ... done [0.06 sec]
+ Begin analysis of: ./rRNA_databases/silva-euk-18s-id95.fasta
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.612228
+ Gumbel K = 0.334926
+ Minimal SW score based on E-value = 52
+ Loading index part 1/1 ... done [3.23 sec]
+ Begin index search ... done [30.28 sec]
+ Freeing index ... done [0.45 sec]
+ Begin analysis of: ./rRNA_databases/silva-euk-28s-id98.fasta
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.612068
+ Gumbel K = 0.344763
+ Minimal SW score based on E-value = 53
+ Loading index part 1/1 ... done [3.43 sec]
+ Begin index search ... done [35.69 sec]
+ Freeing index ... done [0.48 sec]
+ Begin analysis of: ./rRNA_databases/rfam-5s-database-id98.fasta
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.616617
+ Gumbel K = 0.341306
+ Minimal SW score based on E-value = 51
+ Loading index part 1/1 ... done [1.77 sec]
+ Begin index search ... done [13.50 sec]
+ Freeing index ... done [0.22 sec]
+ Begin analysis of: ./rRNA_databases/rfam-5.8s-database-id98.fasta
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.617817
+ Gumbel K = 0.340589
+ Minimal SW score based on E-value = 49
+ Loading index part 1/1 ... done [0.60 sec]
+ Begin index search ... done [8.78 sec]
+ Freeing index ... done [0.07 sec]
+ Total number of reads mapped (incl. all reads file sections searched): 104243
+ Writing aligned FASTA/FASTQ ... done [1.13 sec]
+ Writing not-aligned FASTA/FASTQ ... done [0.10 sec]
+\noindent The option `\texttt{--log}' will create an overall statistics file,\\
+>> cat SRR105861_rRNA.log
+ Time and date
+ Command: sortmerna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db:\
+ ./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-db:\
+ ./rRNA_databases/silva-arc-16s-id95.fasta,./index/silva-arc-16s-db:\
+ ./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-db:\
+ ./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-db:\
+ ./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s:\
+ ./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-db:\
+ ./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s-db\
+ --reads /Users/jenya/Downloads/SRR106861.fasta --sam --num_alignments 1\
+ --fastx --aligned SRR105861_rRNA --other SRR105861_non_rRNA.fasta fasta -v
+ Process pid = 1957
+ Parameters summary:
+ Index: ./index/silva-bac-16s-db
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.602397
+ Gumbel K = 0.328927
+ Minimal SW score based on E-value = 54
+ Index: ./index/silva-bac-23s-db
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.603075
+ Gumbel K = 0.330488
+ Minimal SW score based on E-value = 53
+ Index: ./index/silva-arc-16s-db
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.596230
+ Gumbel K = 0.322143
+ Minimal SW score based on E-value = 52
+ Index: ./index/silva-arc-23s-db
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.597749
+ Gumbel K = 0.325630
+ Minimal SW score based on E-value = 49
+ Index: ./index/silva-euk-18s-db
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.612228
+ Gumbel K = 0.334926
+ Minimal SW score based on E-value = 52
+ Index: ./index/silva-euk-28s
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.612068
+ Gumbel K = 0.344763
+ Minimal SW score based on E-value = 53
+ Index: ./index/rfam-5s-db
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.616617
+ Gumbel K = 0.341306
+ Minimal SW score based on E-value = 51
+ Index: ./index/rfam-5.8s-db
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.617817
+ Gumbel K = 0.340589
+ Minimal SW score based on E-value = 49
+ Number of seeds = 2
+ Edges = 4 (as integer)
+ SW match = 2
+ SW mismatch = -3
+ SW gap open penalty = 5
+ SW gap extend penalty = 2
+ SW ambiguous nucleotide = -3
+ SQ tags are not output
+ Number of threads = 1
+ Reads file = SRR106861.fasta
+ Results:
+ Total reads = 113128
+ Total reads passing E-value threshold = 104243 (92.15%)
+ Total reads failing E-value threshold = 8885 (7.85%)
+ Minimum read length = 59
+ Maximum read length = 1253
+ Mean read length = 267
+ By database:
+ ./rRNA_databases/silva-bac-16s-id90.fasta 25.73%
+ ./rRNA_databases/silva-bac-23s-id98.fasta 64.37%
+ ./rRNA_databases/silva-arc-16s-id95.fasta 0.00%
+ ./rRNA_databases/silva-arc-23s-id98.fasta 0.00%
+ ./rRNA_databases/silva-euk-18s-id95.fasta 0.00%
+ ./rRNA_databases/silva-euk-28s-id98.fasta 0.00%
+ ./rRNA_databases/rfam-5s-database-id98.fasta 2.04%
+ ./rRNA_databases/rfam-5.8s-database-id98.fasta 0.00%
+ \end{Verbatim}
+\subsubsection{Filtering paired-end reads}
+When writing aligned and non-aligned reads to FASTA/Q files, sometimes the situation arises
+where one of the paired-end reads aligns and the other one doesn't. Since SortMeRNA
+looks at each read individually, by default the reads will be split into two separate files. That is, the read that
+aligned will go into the {\tt--aligned} FASTA/Q file and the pair that didn't align will go into the
+{\tt--other} FASTA/Q file.
+This situation would result in the splitting of some paired reads in the
+output files and not optimal for users who require paired order of the reads for
+downstream analyses.
+For users who wish to keep the order of their paired-ended reads, two options are available.
+If one read aligns and the other one not then,
+ \item[(1)] {\tt--paired-in} will put both reads into the file specified by {\tt--aligned}
+ \item[(2)] {\tt--paired-out} will put both reads into the file specified by {\tt--other}
+The first option, {\tt--paired-in} is optimal for users that want all reads in the {\tt--other} file
+to be non-rRNA. However, there are small chances that reads which are non-rRNA will also be
+put into the {\tt--aligned} file.
+The second option, {\tt--paired-out} is optimal for users that want only rRNA reads in the
+{\tt--aligned} file. However, there are small chances that reads which are rRNA will also be
+put into the {\tt--other} file.
+If neither of these two options is added to the {\tt sortmerna} command, then aligned and
+non-aligned reads will be properly output to the {\tt--aligned} and {\tt--other} files, possibly breaking
+the order for a set of paired reads between two output files.
+{\bf It's important to note} that regardless of the options used, the {\tt--log} file will always
+report the true number of reads classified as rRNA (not the number of reads in the {\tt--aligned}
+\subsubsection{Example 4: forward-reverse paired-end reads (2 input files)}
+\tikzstyle{mybox} = [draw=OliveGreen, fill=blue!5, very thick,
+ rectangle, rounded corners, inner sep=10pt, inner ysep=20pt]
+\tikzstyle{fancytitle} =[fill=OliveGreen, text=white, rectangle, rounded corners]
+\node [mybox] (box1) {%
+ \begin{minipage}[t!]{2in}
+ {\footnotesize
+ @SEQUENCE\_ID\_1/\textbf{1} \\
+ ACTT..\\
+ +\\
+ QUALITY\_1/1\\
+ @SEQUENCE\_ID\_2/\textbf{1} \\
+ GTTA..\\
+ +\\
+ QUALITY\_2/1\\
+ ..
+ }
+ \end{minipage}
+\node [mybox] (box2) [right=of box1,xshift=2cm] {%
+ \begin{minipage}[t!]{2in}
+ {\footnotesize
+ @SEQUENCE\_ID\_1/\textbf{2} \\
+ GTAC..\\
+ +\\
+ QUALITY\_1/2\\
+ @SEQUENCE\_ID\_2/\textbf{2} \\
+ CCAC..\\
+ +\\
+ QUALITY\_2/2\\
+ ..
+ }
+ \end{minipage}
+\node[fancytitle] at (box1.north) {{\small FASTQ forward reads}};
+\node[fancytitle] at (box2.north) {{\small FASTQ reverse reads}};
+\draw [decorate,color=black!80,decoration={brace,mirror,amplitude=5pt,raise=2pt}] (3,0.3) -- node[right=10pt]{$~~pair~\#~1$}(3,1.8);
+\draw [decorate,color=black!80,decoration={brace,amplitude=5pt,raise=2pt}] (5.8,0.3) -- node[right=10pt]{~}(5.8,1.8);
+\draw [decorate,color=black!80,decoration={brace,amplitude=5pt,raise=2pt}] (3,0) -- node[right=10pt]{$~~pair~\#~2$}(3,-1.5);
+\draw [decorate,color=black!80,decoration={brace,mirror,amplitude=5pt,raise=2pt}] (5.8,0) -- node[right=10pt]{~}(5.8,-1.5);
+\caption{Forward and reverse reads in paired-end sequencing format}
+\tikzstyle{mybox} = [draw=OliveGreen, fill=blue!5, very thick,
+ rectangle, rounded corners, inner sep=10pt, inner ysep=20pt]
+\tikzstyle{fancytitle} =[fill=OliveGreen, text=white, rectangle, rounded corners]
+\node [mybox] (box) {%
+ \begin{minipage}[t!]{2in}
+ {\footnotesize
+ @SEQUENCE\_ID\_1/\textbf{1} \\
+ ACTT..\\
+ +\\
+ QUALITY\_1/1\\
+ @SEQUENCE\_ID\_1/\textbf{2} \\
+ GTAC..\\
+ +\\
+ QUALITY\_1/2\\
+ ..
+ }
+ \end{minipage}
+ };
+\node[fancytitle] at (box.north) {{\small FASTQ paired-end reads}};
+\draw [decorate,color=black!80,decoration={brace,mirror,amplitude=10pt,raise=2pt}] (0.5,-1.5) -- node[right=10pt]{$~pair~\#~1$}(0.5,1.8);
+\caption{Paired-end read format accepted by SortMeRNA}
+\noindent SortMeRNA accepts only 1 file as input for the reads. If a user has two input files, in the case for the
+foward and reverse paired-end reads (see Figure~\ref{fig:format2}), they may use the \texttt{merge-paired-reads.sh} script found in
+\texttt{`sortmerna/scripts'} folder to interleave the paired reads into the format of Figure~\ref{fig:format1}.\\
+\noindent The command for \texttt{merge-paired-reads.sh} is the following,
+ > bash ./merge-paired-reads.sh forward-reads.fastq reverse-reads.fastq outfile.fastq
+\noindent Now, the user may input \texttt{outfile.fastq} to SortMeRNA for analysis.
+\noindent Similarly, for unmerging the paired reads back into two separate files, use the command,
+ > bash ./unmerge-paired-reads.sh merged-reads.fastq forward-reads.fastq reverse-reads.fastq
+{\bf Important:} unmerge-paired-reads.sh should only be used if one of the options {\tt--paired\_in} or {\tt--paired\_out}
+was used during filtering. Otherwise it may give incorrect results if a paired-read was split during alignment (one
+read aligned and the other one not).
+\subsection{Read mapping}
+\subsubsection{Mapping reads for classification}
+Although SortMeRNA is very sensitive with the small rRNA databases distributed with the source code,
+these databases are not optimal for classification since often alignments with 75-90\% identity
+will be returned (there are only several thousand rRNA in most of the databases, compared to the original
+SILVA or Greengenes databases containing millions of rRNA). Classification at the species level generally
+considers alignments at 97\% and above, so it is suggested to use a larger database is species classification
+is the main goal.
+Moreover, SortMeRNA is a local alignment tool, so it's also important to look at the query coverage \% for
+each alignment. In the SAM output format, neither \% id or query coverage are reported. If the user wishes
+for these values, then the Blast tabular format with CIGAR + query coverage option {(\tt--blast 3)} is the way to go.
+\subsubsection{Example 5: mapping reads against the 16S Greengenes 97\% id database with multithreading}
+This example will generate SAM and BLAST tabular output files. Alignments are classified as significant
+based on the E-value cutoff (default 1). SortMeRNA's E-value takes into consideration the full size of the
+reference database as well as the query file, thus the E-value is higher than BLAST's (ex. equivalent to BLAST's 1e-5).
+>> sortmerna --ref 97_otus_gg_13_8.fasta,./index/97_otus_gg_13_8\
+ --reads SRR106861.fasta --blast 3 --sam --log --aligned SRR106861_gg_rRNA -a 20 -v
+ Program: SortMeRNA version 2.0, 29/11/2014
+ Copyright: 2012-2015 Bonsai Bioinformatics Research Group:
+ LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe
+ OTU-picking extensions and continuing support developed in the Knight Lab,
+ BioFrontiers Institute, University of Colorado at Boulder
+ Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the
+ See the GNU Lesser General Public License for more details.
+ Contact: Evguenia Kopylova, jenya.kopylov at gmail.com
+ Laurent Noe, laurent.noe at lifl.fr
+ Helene Touzet, helene.touzet at lifl.fr
+ Computing read file statistics ... done [0.44 sec]
+ size of reads file: 35238748 bytes
+ partial section(s) to be executed: 1 of size 35238748 bytes
+ Parameters summary:
+ Number of seeds = 2
+ Edges = 4 (as integer)
+ SW match = 2
+ SW mismatch = -3
+ SW gap open penalty = 5
+ SW gap extend penalty = 2
+ SW ambiguous nucleotide = -3
+ SQ tags are not output
+ Number of threads = 20
+ Begin mmap reads section # 1:
+ Time to mmap reads and set up pointers [0.10 sec]
+ Begin analysis of: 97_otus_gg_13_8.fasta
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.600470
+ Gumbel K = 0.327880
+ Minimal SW score based on E-value = 57
+ Loading index part 1/1 ... done [10.76 sec]
+ Begin index search ... done [23.75 sec]
+ Freeing index ... done [1.44 sec]
+ Total number of reads mapped (incl. all reads file sections searched): 29089
+ Writing alignments ... done [7.71 sec]
+This is almost the same number of 16S rRNA as identified by SortMeRNA using the smaller provided database,
+>> cat SRR106861_gg_rRNA.log
+ Date and time
+ Command: sortmerna --ref 97_otus_gg_13_8.fasta,./index/97_otus_gg_13_8\
+ --reads SRR106861.fasta --blast 3 --sam --log --aligned SRR106861_gg_rRNA -a 20 -v
+ Process pid = 44246
+ Parameters summary:
+ Index: ./index/97_otus_gg_13_8
+ Seed length = 18
+ Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
+ Gumbel lambda = 0.600470
+ Gumbel K = 0.327880
+ Minimal SW score based on E-value = 57
+ Number of seeds = 2
+ Edges = 4 (as integer)
+ SW match = 2
+ SW mismatch = -3
+ SW gap open penalty = 5
+ SW gap extend penalty = 2
+ SW ambiguous nucleotide = -3
+ SQ tags are not output
+ Number of threads = 20
+ Reads file = SRR106861.fasta
+ Results:
+ Total reads = 113128
+ Total reads passing E-value threshold = 29089 (25.71%)
+ Total reads failing E-value threshold = 84039 (74.29%)
+ Minimum read length = 59
+ Maximum read length = 1253
+ Mean read length = 267
+ By database:
+ 97_otus_gg_13_8.fasta 25.71%
+SortMeRNA is implemented in QIIME's closed-reference and open-reference OTU-picking workflows.
+The readers are referred to QIIME's tutorials for an in-depth discussion of these methods
+\section{SortMeRNA advanced options}
+\subsection*{{\tt --num\_seeds INT}}
+The threshold number of seeds required to match in the primary seed-search filter before
+moving on to the secondary seed-cluster filter. More specifically, the threshold number of
+seeds required before searching for a longest increasing subsequence (LIS) of the seeds' positions
+between the read and the closest matching reference sequence. By default, this is set to 2 seeds.
+\subsection*{{\tt --passes INT,INT,INT}}
+In the primary seed-search filter, SortMeRNA moves a seed of length $L$ (parameter of {\tt indexdb\_rna})
+across the read using three passes. If at the end of each pass a threshold number of seeds (defined by {\tt --num\_seeds})
+did not match to the reference database, SortMeRNA attempts to find more seeds by decreasing the interval at which the
+seed is placed along the read by using another pass. In default mode, these intervals are set to
+$L,L/2,3$ for Pass 1, 2 and 3, respectively. Usually, if the read is highly similar to the reference
+database, a threshold number of seeds will be found in the first pass.
+\subsection*{{\tt --edges INT(\%)}}
+The number (or percentage if followed by \%) of nucleotides to add to each edge of the alignment region
+on the reference sequence before performing Smith-Waterman alignment. By default, this is set to 4 nucleotides.
+\subsection*{{\tt --full\_search FLAG}}
+During the index traversal, if a seed match is found with 0-errors, SortMeRNA will stop searching for further
+1-error matches. This heuristic is based upon the assumption that 0-error matches are more significant than
+1-error matches. By turning it off using the {\tt--full\_search} flag, the sensitivity may increase (often
+by less than 1\%) but with up to four-fold decrease in speed.
+\subsection*{{\tt --pid FLAG}}
+The pid of the running {\tt sortmerna} process will be added to the output files in order to avoid over-writing output if the same
+{\tt --aligned STRING} base name is provided for different runs.
+Any issues or bug reports should be reported to \url{https://github.com/biocore/sortmerna/issues} or by e-mail
+to the authors (see list of e-mails in Section 1 of this document). Comments and suggestions are also always appreciated!
+If you use SortMeRNA please cite,
+Kopylova E., No\'{e} L. and Touzet H., ``SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data", {\it Bioinformatics} (2012), doi: 10.1093/bioinformatics/bts611.
Added: trunk/packages/sortmerna/trunk/debian/doc_source/get
--- trunk/packages/sortmerna/trunk/debian/doc_source/get (rev 0)
+++ trunk/packages/sortmerna/trunk/debian/doc_source/get 2015-08-05 14:16:06 UTC (rev 19844)
@@ -0,0 +1 @@
+wget https://github.com/biocore/sortmerna/raw/master/SortMeRNA-User-Manual-2.0.tex
More information about the debian-med-commit
mailing list