[med-svn] [gasic] 01/01: Imported Upstream version 0.0.r18
Andreas Tille
tille at debian.org
Sat Feb 8 08:22:22 UTC 2014
This is an automated email from the git hooks/post-receive script.
tille pushed a commit to branch master
in repository gasic.
commit 1e9230efd6ba28de5d4f02c12f71f39eca0fb8bc
Author: Andreas Tille <tille at debian.org>
Date: Sat Feb 8 09:20:48 2014 +0100
Imported Upstream version 0.0.r18
---
LICENSE | 24 +++++++
README | 126 ++++++++++++++++++++++++++++++++++
core/__init__.py | 0
core/gasic.py | 186 ++++++++++++++++++++++++++++++++++++++++++++++++++
core/tools.py | 152 +++++++++++++++++++++++++++++++++++++++++
correct_abundances.py | 177 +++++++++++++++++++++++++++++++++++++++++++++++
create_matrix.py | 158 ++++++++++++++++++++++++++++++++++++++++++
doc/MANUAL | 183 +++++++++++++++++++++++++++++++++++++++++++++++++
doc/example_script.py | 172 ++++++++++++++++++++++++++++++++++++++++++++++
quality_check.py | 131 +++++++++++++++++++++++++++++++++++
run_mappers.py | 85 +++++++++++++++++++++++
11 files changed, 1394 insertions(+)
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..2bba2f5
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,24 @@
+Copyright (c) 2012, Martin S. Lindner, LindnerM at rki.de,
+Robert Koch-Institut, Germany,
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+ * Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+ * The name of the author may not be used to endorse or promote products
+ derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL MARTIN S. LINDNER BE LIABLE FOR ANY
+DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/README b/README
new file mode 100644
index 0000000..10d94ab
--- /dev/null
+++ b/README
@@ -0,0 +1,126 @@
+GASiC - Genome Abundance Similarity Correction
+
+
+Downloading
+-----------
+
+GASiC is hosted on sourceforge.net
+
+ http://sourceforge.net/projects/gasic/
+
+Further information can be found on the group homepage:
+
+ http://www.rki-ng4.de/
+
+
+System Requirements
+-------------------
+
+GASiC was built and tested on Linux operating systems.
+
+For Windows or Mac OS users, we provide a virtual machine (VM) with a
+pre-installed and tested version of GASiC. Download links and further
+information can be found at:
+
+ http://sourceforge.net/projects/gasic/
+
+and
+
+ http://www.rki-ng4.de/
+
+The VirtualBox software is reqired to run the virtual machine. It can
+can be obtained at:
+
+ https://www.virtualbox.org/
+
+
+Documentation
+-------------
+
+Documentation can be found in the folder 'doc/'. The sourceforge wiki
+pages also contain useful information.
+
+GASiC is self-documenting. Calling a script with the '--help' option
+displays help on the script functionality.
+
+The GASiC source code is thoroughly commented. Developers might find
+answers there.
+
+
+Installation
+------------
+
+GASiC is a Python script library and does not need installation. If all
+dependencies are installed correctly, GASiC scripts can be run directly
+from the main folder.
+
+We recommend to add the GASiC main directory to your PYTHONPATH environment
+variable. On the console type:
+ export PYTHONPATH=$PYTHONPATH:/path/to/GASiC
+
+To use GASiC from any directory, you may also adjust your PATH environment
+variable:
+ export PATH=$PATH:/path/to/GASiC
+
+
+Dependencies
+------------
+
+GASiC was tested with the software listed below. Older versions might still
+work, but are not tested.
+
+ * Python 2.7, http://www.python.org/
+
+ * NumPy 1.6.1, http://numpy.scipy.org/
+ * SciPy 0.10.0, http://www.scipy.org/
+ * BioPython 1.58, http://biopython.org/wiki/Biopython
+ * pysam 0.6, http://code.google.com/p/pysam/
+
+GASiC calls alignment tools and read simulators from the command line. These
+tools must be installed and added to the PATH before GASiC can be used.
+We tested GASiC with the following tools:
+
+ * bowtie, http://bowtie-bio.sourceforge.net/index.shtml
+ * bowtie2, http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
+ * bwa, http://bio-bwa.sourceforge.net/bwa.shtml
+
+ * Mason, http://www.seqan.de/projects/mason/
+ * dwgsim, http://dnaa.sourceforge.net/
+
+
+Bug Reporting
+-------------
+
+You can send GASiC bug reports to <LindnerM at rki.de>.
+
+Please provide a minimal example of the code producing the error and the
+error message(s). Also check which libraries (see Dependencies) you use.
+
+
+
+
+-------------------------------------------------------------------------------
+Copyright (c) 2012, Martin S. Lindner, LindnerM at rki.de,
+Robert Koch-Institut, Germany,
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+ * Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+ * The name of the author may not be used to endorse or promote products
+ derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL MARTIN S. LINDNER BE LIABLE FOR ANY
+DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/core/__init__.py b/core/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/core/gasic.py b/core/gasic.py
new file mode 100644
index 0000000..687e1be
--- /dev/null
+++ b/core/gasic.py
@@ -0,0 +1,186 @@
+"""
+Copyright (c) 2012, Martin S. Lindner, LindnerM at rki.de,
+Robert Koch-Institut, Germany,
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+ * Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+ * The name of the author may not be used to endorse or promote products
+ derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL MARTIN S. LINDNER BE LIABLE FOR ANY
+DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+"""
+import sys
+import numpy as np
+import numpy.linalg as la
+import scipy.optimize as opt
+
+
+def similarity_correction(sim, reads, N):
+ """ Calculate corrected abundances given a similarity matrix and observations using optimization.
+
+ Input:
+ sim: [numpy.array (M,M)]: with pairwise similarities between species
+ reads [numpy.array (M,)]: with number of observed reads for each species
+ N [int]: total number of reads
+
+ Output:
+ abundances [numpy.array (M,)]: estimated abundance of each species in the sample
+ """
+
+ # transform reads to abundances and rename similarity matrix
+ A = sim
+ r = reads.astype(np.float) / N
+
+ rng = range(len(reads))
+
+ # Now solve the optimization problem: min_c |Ac-r|^2 s.t. c_i >= 0 and 1-sum(c_i) >= 0
+ # construct objective function
+ def objective (c):
+ # numpy implementation of objective function |A*c-r|^2
+ return np.sum(np.square(np.dot(A,c)-r))
+
+ # construct constraints
+ def build_con(k):
+ # constraint: k'th component of x should be >0
+ return lambda c: c[k]
+ cons = [build_con(k) for k in rng]
+ # constraint: the sum of all components of x should not exceed 1
+ cons.append( lambda c: 1-np.sum(c) )
+
+ # initial guess
+ c_0 = np.array([0.5 for i in range(len(reads))])
+
+ # finally: optimization procedure
+ abundances = opt.fmin_cobyla(objective, c_0, cons, disp=0, rhoend=1e-10, maxfun=10000)
+
+ return abundances
+
+
+
+
+def similarity_correction_smp(sim, reads, N, subsets=1):
+ """ Calculate corrected abundances given a similarity matrix and observations
+ using optimization. Sample subsets of the reads to obtain a more stable
+ estimation of the abundance.
+
+ Input:
+ sim: [numpy.array (M,M)]: with pairwise similarities between species
+ reads [numpy.array (M,N)]: read mapping information about every species
+ N [int]: total number of reads
+ subsets [int]: divide the reads in subsets and use the median of the estimated abundances
+
+ Output:
+ abundances [numpy.array (M,)]: estimated relative abundance of each species in the sample
+ """
+ # Each subset consists of every subsets'th read. Count the number of matching reads
+ # per species in each subset
+ M = reads.shape[0]
+ reads_sbs = np.zeros((M, subsets))
+ for s in range(subsets):
+ reads_sbs[:,s] = np.sum(reads[:,s::subsets], axis=1)
+
+ N_sbs = np.ceil(float(N)/subsets)
+
+ # calculate the corrected abundances for each subset
+ sbs_corr = np.zeros( (M,subsets) )
+ for s in range(subsets):
+ sbs_corr[:,s] = similarity_correction(sim, reads_sbs[:,s], N_sbs)
+
+ abundances = np.median(sbs_corr,axis=1)
+
+ return abundances
+
+
+
+def bootstrap_similarity_matrix(mapped_reads):
+ """
+ Calculate a similarity matrix by bootstrapping.
+
+ INPUT:
+ mapped_reads: mapping information as generated by similarity_matrix_raw().
+
+ OUTPUT:
+ d_matrix: similarity matrix. Ordering is the same as in 'names' used in the
+ similarity_matrix_raw() function.
+ """
+ # get the number of sequences and simulated reads from the shape of the mapped reads matrix
+ num_reads = mapped_reads.shape[2]
+ num_seq = mapped_reads.shape[0]
+
+ # create a bootstrap index vector
+ bootstrap_indices = np.random.randint(num_reads, size=(num_reads,))
+
+ # count number of mapped reads in bootstrap sample
+ counts = np.sum( mapped_reads[:,:,bootstrap_indices], axis=2 )
+
+ # normalize read counts and build similarity matrix
+ s_matrix = np.zeros((num_seq,num_seq))
+ for i in range(num_seq):
+ for j in range(num_seq):
+ # similarity matrix entries as described in the paper
+ s_matrix[i,j] = float(counts[j,i])/counts[i,i]
+
+ return s_matrix
+
+
+
+def bootstrap(reads, smat_raw, B, test_c=0.01):
+ """
+ Similarity correction using a bootstrapping procedure for more robust corrections and error
+ estimates.
+ reads: [numpy.array (M,N)] array with mapping information; reads[m,n]==1, if read n mapped to
+ species m.
+ smat_raw: mapping information for similarity matrix. species have same ordering as reads array
+ B: Number of bootstrap samples
+ test_c: For testing: treat species as not present, if estimated concentration is below test_c.
+ """
+ # M: Number of species, N: Number of reads
+ M,N = reads.shape
+
+ # initialize arrays to store results
+ found = np.zeros( (B,M) )
+ corr = np.zeros( (B,M) )
+ fails = np.zeros( (B,M) )
+
+ for b in range(B):
+ msg = " bootstrapping %i of %i"%(b+1,B)
+ sys.stdout.write("\r" + " "*(len(msg)+5) + "\r")
+ sys.stdout.write(msg)
+ sys.stdout.flush()
+
+ # select a bootstrap sample
+ random_set = np.random.randint(N,size=N)
+
+ # count the number of matching reads in bootstrap sample
+ found[b,:] = np.sum(reads[:,random_set], axis=1)
+ #for r in range(N):
+ # found[b,:] += reads[:,random_set[r]]
+
+ # bootstrap a similarity matrix
+ smat = bootstrap_similarity_matrix(smat_raw)
+
+ # calculate abundances
+ corr[b,:] = similarity_correction(smat,found[b,:],N)
+
+ # check if the calculated abundance is below the test abundance
+ fails[b,:] = corr[b,:] < test_c
+ print
+ p_values = np.mean(fails, axis=0)
+ abundances = np.mean(corr, axis=0)
+ variances = np.var(corr, axis=0)
+ return p_values, abundances, variances
diff --git a/core/tools.py b/core/tools.py
new file mode 100644
index 0000000..6faa73f
--- /dev/null
+++ b/core/tools.py
@@ -0,0 +1,152 @@
+"""
+Copyright (c) 2012, Martin S. Lindner, LindnerM at rki.de,
+Robert Koch-Institut, Germany,
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+ * Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+ * The name of the author may not be used to endorse or promote products
+ derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL MARTIN S. LINDNER BE LIABLE FOR ANY
+DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+"""
+
+import os
+
+# Names File Reader
+#
+# Reads a names file (plain text, one name per line) and returns an array of strings
+def read_names(filename):
+ f = open(filename,'r')
+ names = [nm.rstrip('\n\r') for nm in f]
+ f.close()
+ return names
+
+
+# Read Mappers
+#
+# Define caller functions for the read mappers here. These fuctions call the
+# read mappers on the command line with the correspondig parameters. Output
+# is a SAM file at the specified location. Each function takes 3 mandatory
+# string arguments:
+# index: name of the reference sequence or reference index
+# reads: name of the file containing the reads
+# out: name of the output SAM file
+# Additional mapper parameters may be speciefied:
+# param: string with parameters for the mapper [default: ""]
+#
+# Note: the mapper should be configured in that way, that it only reports ONE
+# match for each read (e.g. the best match).
+
+def run_bowtie2(index, reads, out, param=""):
+ command = "bowtie2 -U {reads} -x {index} -S {samfile} {param} --local -M 0 ".format(reads=reads, index=index, samfile=out, param=param)
+ print "Executing:",command
+ os.system(command)
+ return 1
+
+def run_bowtie(index, reads, out, param=""):
+ command = "bowtie -S -p 2 -q -3 30 -v 2 {param} {index} {reads} > {samfile}".format(index=index, reads=reads, samfile=out, param=param)
+ print "Executing:",command
+ os.system(command)
+ return 1
+
+def run_bwa(index, reads, out, param=""):
+ command = "bwa aln {param} {index} {reads} > /tmp/res.sai".format(index=index, reads=reads, param=param)
+ print "Executing:",command
+ os.system(command)
+ # convert the bwa output to SAM and remove temporary file
+ command_sam = "bwa samse %s /tmp/res.sai %s > %s && rm /tmp/res.sai"%(index,reads,out)
+ os.system(command_sam)
+ return 1
+
+def run_bwasw(index, reads, out, param=""):
+ command = "bwa bwasw {param} {index} {reads} > {samfile}".format(index=index, reads=reads, samfile=out, param=param)
+ print "Executing:",command
+ os.system(command)
+ return 1
+
+
+run_mapper = dict( bowtie=run_bowtie,
+ bowtie2=run_bowtie2,
+ bwa = run_bwa,
+ bwasw = run_bwasw,)
+
+"""
+How to add your custom mapper
+
+1. Create a caller function
+ - Create a copy of one of the existing functions, e.g. run_bowtie
+ - Rename it, customize it, but DO NOT TOUCH the interface!
+
+2. Add the caller function to the run_mapper dict
+ - The dict entry should have the format: [name] = [caller function]
+
+3. Now you can use your mapper:
+ >>> import tools_lib
+ >>> tools_lib.run_mapper["name"]
+"""
+
+
+
+# Read Simulators
+#
+# Define caller functions for the read simulators here. These fuctions call
+# the read simulators on the command line with the correspondig parameters.
+# Output is a SAM file at the specified location. Each function takes 3
+# mandatory string arguments:
+# ref: name of the reference sequence file
+# out: name of the output reads file
+#
+# Note: since simulators tend to have many tuning parameters, we encourage
+# to create a seperate caller function for every scenario.
+
+def run_mason_illumina(ref, out):
+ command = "mason illumina -N 10000 -hi 0 -hs 0 -n 72 -sq -o %s %s"%(out,ref)
+ print "Executing:",command
+ os.system(command)
+ # remove the needless SAM file
+ command = "rm %s.sam"%out
+ os.system(command)
+ return 1
+
+def run_dwgsim(ref, out):
+ command = "dwgsim -c 2 -1 80 -2 0 -r 0 -y 0 -e 0.002 -N 100000 -f TACG %s %s"%(ref, out)
+ print "Executing:",command
+ os.system(command)
+ # remove all additional files and rename reads file
+ command = "mv {out}.bfast.fastq {out} && rm {out}.bwa.read1.fastq && rm {out}.bwa.read2.fastq && rm {out}.mutations.txt".format(out=out)
+ os.system(command)
+ return 1
+
+
+run_simulator = dict( mason_illumina=run_mason_illumina,
+ dwgsim=run_dwgsim,)
+
+"""
+How to add your custom read simulator
+
+1. Create a caller function
+ - Create a copy of one of the existing functions, e.g. run_dwgsim
+ - Rename it, customize it, but DO NOT TOUCH the interface!
+
+2. Add the caller function to the run_simulator dict
+ - The dict entry should have the format: [name] = [caller function]
+
+3. Now you can use your simulator:
+ >>> import tools_lib
+ >>> tools_lib.run_simulator["name"]
+"""
diff --git a/correct_abundances.py b/correct_abundances.py
new file mode 100755
index 0000000..db367db
--- /dev/null
+++ b/correct_abundances.py
@@ -0,0 +1,177 @@
+#!/usr/bin/python
+"""
+Copyright (c) 2012, Martin S. Lindner, LindnerM at rki.de,
+Robert Koch-Institut, Germany,
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+ * Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+ * The name of the author may not be used to endorse or promote products
+ derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL MARTIN S. LINDNER BE LIABLE FOR ANY
+DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+"""
+
+import numpy as np
+import pysam
+import sys
+import glob
+import optparse
+
+from core import gasic
+from core import tools
+
+
+def similarity_correction(names, smat_raw, sam_pattern, bootstrap_samples):
+ """
+ Perform similarity correction step. The similarity matrix and mapping
+ results must be available.
+
+ INPUT:
+ names: array of genome names
+ smat_raw: mapping information for similarity matrix with same ordering of genomes as in 'names'
+ sam_pattern: pattern pointing to the filenames of the SAM-files to analyze
+ bootstrap_samples: number of bootstrap samples, use 1 to disable bootstrapping
+
+ OUTPUT:
+ total: total number of reads in the dataset
+ num_reads: number of reads mapped to each genome (array)
+ corr: abundance of each genome after similarity correction
+ err: estimated standard error
+ p: p-value for the confidence, that the true abundance is above some threshold
+ """
+
+ print "\n--- GASiC correction\n"
+ # find out the number of reads
+ total = len( [1 for read in pysam.Samfile(sam_pattern%(names[0]), "r")] )
+ print " found %i reads"%total
+
+ # initialize some arrays
+ # mapping information; mapped[i,j]=1 if read j was successfully mapped to i.
+ mapped = np.zeros( (len(names), total) )
+
+ # total number of successfully mapped reads per reference
+ num_reads = np.zeros( (len(names),) )
+
+ print "\n--- Analyzing SAM files\n"
+ # analyze the SAM files
+ for n_ind,nm in enumerate(names):
+ msg = " SAM file %i of %i"%(n_ind+1,len(names))
+ sys.stdout.write("\r" + " "*(len(msg)+5) + "\r")
+ sys.stdout.write(msg)
+ sys.stdout.flush()
+
+ # open SAM file
+ sf = pysam.Samfile(sam_pattern%nm, "r")
+
+ # go through reads in samfile and check if it was successfully mapped
+ mapped[n_ind,:] = np.array([int(not rd.is_unmapped) for rd in sf])
+ num_reads[n_ind] = sum(mapped[n_ind,:])
+
+ print "\n\n--- Bootstrapping abundances\n"
+ # run similarity correction step
+ p,corr,var = gasic.bootstrap(mapped, smat_raw, bootstrap_samples)
+ err = np.sqrt(var)
+
+ print "\nDone estimating abundances\n"
+ return total,num_reads,corr,err,p
+
+
+def unique(names, sam_pattern):
+ """ Determine the number of unique reads for every species based on the read names.
+ INPUT:
+ names: array of genome names
+ sam_pattern: pattern pointing to the filenames of the SAM-files to analyze
+
+ OUTPUT:
+ unique: number of unique reads per species.
+ """
+ # one set for the names of mapped reads for each species
+ mapped_read_names = [set() for nm in names]
+
+ for n,nm in enumerate(names):
+ # parse the samfile
+ sf = pysam.Samfile(sam_pattern%nm, "r")
+ for read in sf:
+ # add the hash of read name to the set, if read was mapped
+ if not read.is_unmapped:
+ mapped_read_names[n].add(hash(read.qname))
+
+ unique_read_names = [set() for nm in names]
+ for n in range(len(names)):
+ others = set()
+ for m in range(len(names)):
+ if n!=m:
+ others |= mapped_read_names[m]
+ unique_read_names[n] = mapped_read_names[n] - others
+
+ return np.array([len(unq) for unq in unique_read_names])
+
+
+if __name__=="__main__":
+ usage = """%prog NAMES
+
+Run the similarity correction step.
+
+
+Note: Although it is possible to run the read mappers by hand or to create the
+similarity matrix manually, we strongly recommend to use the provided Python
+scripts 'run_mappers.py' and 'create_similarity_matrix.py'.
+
+Input:
+NAMES: Filename of the names file; the plain text names file should
+ contain one name per line. The name is used as identifier in
+ the whole algorithm.
+
+See the provided LICENSE file or source code for license information.
+"""
+
+ # configure the parser
+ parser = optparse.OptionParser(usage=usage)
+ parser.add_option('-m', '--similarity-matrix', type='string', dest='smat', default='./similarity_matrix.npy', help='Path to similarity matrix file. The similarity matrix must be created with the same NAMES file. [default: %default]')
+ parser.add_option('-s', '--samfiles', type='string', dest='sam', default='./SAM/%s.sam', help='Pattern pointing to the SAM files created by the mapper. Placeholder for the name is "%s". [default: %default]')
+
+ parser.add_option('-b', '--bootstrap-samples', type='int', dest='boot', default=100, help='Set the number of bootstrap samples. Use 1 to disable bootstrapping [default: %default]')
+ parser.add_option('-o', '--output', type='string', dest='out', default='./results.txt', help='Plain text output file containing the results. [default: %default]')
+ # parse arguments
+ opt, args = parser.parse_args()
+
+ numArgs = len(args)
+ if numArgs == 1:
+ # read the Names file
+ names_file = args[0]
+ names = tools.read_names(names_file)
+
+ # load the similarity matrix
+ smat = np.load(opt.smat)
+
+ # start similarity correction
+ total,num_reads,corr,err,p = similarity_correction(names, smat, opt.sam, opt.boot)
+
+ # write results into tab separated file.
+ ofile = open(opt.out,'w')
+ ofile.write("#genome name\tmapped reads\testimated reads\testimated error\tp-value\n")
+ for n_ind,nm in enumerate(names):
+ # Name, mapped reads, corrected reads, estimated error, p-value
+ out = "{name}\t{mapped}\t{corr}\t{error}\t{pval}\n"
+ ofile.write(out.format(name=nm,mapped=num_reads[n_ind],corr=corr[n_ind]*total,error=err[n_ind]*total,pval=p[n_ind]))
+ ofile.close()
+ print "--- wrote results to", opt.out
+
+ else:
+ parser.print_help()
+ sys.exit(1)
diff --git a/create_matrix.py b/create_matrix.py
new file mode 100755
index 0000000..fa412a6
--- /dev/null
+++ b/create_matrix.py
@@ -0,0 +1,158 @@
+#!/usr/bin/python
+"""
+Copyright (c) 2012, Martin S. Lindner, LindnerM at rki.de,
+Robert Koch-Institut, Germany,
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+ * Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+ * The name of the author may not be used to endorse or promote products
+ derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL MARTIN S. LINDNER BE LIABLE FOR ANY
+DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+"""
+
+import sys
+import os
+import glob
+import optparse
+
+import numpy as np
+from Bio import SeqIO
+import pysam
+
+from core import tools
+
+
+def similarity_matrix_raw(names, ref_pattern, index_pattern, temp_dir, simulator, mapper):
+ """
+ Perform the read generation and mapping step for the similarity matrix
+ calculation. The raw output is used to bootstrap a similarity matrix.
+
+ INPUT:
+ names: array of genome names
+ ref_pattern: pattern pointing to the reference genome files used by the read simulator
+ index_pattern: pattern pointing to the mapping index files used by the read mapper
+ temp_dir: directory where temporary files are stored (simulated reads, SAM files).
+ Attention: make sure that there is enough space available!
+
+ OUTPUT:
+ mapped_reads: Mapping information about mapped reads readily usable by
+ bootstrap_similarity_matrix().
+ """
+
+ if not mapper in tools.run_mapper:
+ raise Exception('Aborting. Mapper "%s" not found in tools.py'%mapper)
+ if not simulator in tools.run_simulator:
+ raise Exception('Aborting. Simulator "%s" not found in tools.py'%simulator)
+ if not os.path.isdir(temp_dir):
+ os.makedirs(temp_dir)
+
+ n_seq = len(names)
+ rng = range(n_seq)
+
+ # construct arrays with real file names from pattern
+ ref_files = [ ref_pattern%nm for nm in names ] # filenames of reference sequences
+ index_files = [ index_pattern%nm for nm in names ] # filenames of mapper index files
+ sim_files = [ temp_dir+'/'+nm+'.fastq' for nm in names ] # filenames of simulated read files
+
+ print "\n--- Simulating reads for each reference genome with %s\n"%simulator
+ # generate reads for every reference genome
+ for i in rng:
+ tools.run_simulator[simulator](ref_files[i], sim_files[i])
+
+ # find out how many reads were generated
+ # Attention: Here we assume that all files contain the same number of reads
+ # and are stored in fastq format
+ num_reads = len( [ True for i in SeqIO.parse(sim_files[0],'fastq') ] )
+
+ print "\n--- Mapping simulated reads to reference genomes with %s\n"%mapper
+ # map the reads of every reference to all references
+ for i in rng:
+ for j in rng:
+ samfile = temp_dir+'/'+names[i]+'-'+names[j]+'.sam'
+ tools.run_mapper[mapper](index_files[j], sim_files[i], samfile)
+
+ print "\n--- Parsing SAM files\n"
+ # parse SAM files
+ mapped_reads = np.zeros((n_seq,n_seq, num_reads))
+ for i in rng:
+ for j in rng:
+ msg = " SAM file %i of %i"%(i*n_seq+j+1,n_seq*n_seq)
+ sys.stdout.write("\r" + " "*(len(msg)+5) + "\r")
+ sys.stdout.write(msg)
+ sys.stdout.flush()
+
+ # count the reads in i mapping to refernce j
+ samfile = temp_dir+'/'+names[i]+'-'+names[j]+'.sam'
+ samhandle = pysam.Samfile(samfile, "r")
+ mapped_reads[i,j,:] = np.array( [int(not read.is_unmapped) for read in samhandle] )
+
+ print "\n\nMatrix creation done\n"
+
+ return mapped_reads
+
+
+
+
+
+if __name__=="__main__":
+ usage = """%prog [options] NAMES
+
+Calculate the similarity matrix.
+
+First, a set of reads is simulated for every reference genome using a read
+simulator from core/tools.py specified via -s.
+Second, the simulated reads of each species are mapped against all reference
+genomes using the mapper specified with -m.
+Third, the resulting SAM-files are analyzed to calculate the similarity
+matrix. The similarity matrix is stored as a numpy file (-o).
+
+Input:
+NAMES: Filename of the names file; the plain text names file should
+ contain one name per line. The name is used as identifier in
+ the whole algorithm.
+
+See the provided LICENSE file or source code for license information.
+"""
+
+ # configure the parser
+ parser = optparse.OptionParser(usage=usage)
+ parser.add_option('-s', '--simulator', type='string', dest='simulator', default=None, help='Identifier of read simulator defined in core/tools.py [default: %default]')
+ parser.add_option('-r', '--reference', type='string', dest='ref', default='./ref/%s.fasta', help='Reference sequence file pattern for the read simulator. Placeholder for the name is "%s". [default: %default]')
+ parser.add_option('-m', '--mapper', type='string', dest='mapper', default=None, help='Identifier of mapper defined in core/tools.py [default: %default]')
+ parser.add_option('-i', '--index', type='string', dest='index', default='./ref/%s.fasta', help='Reference index files for the read mapper. Placeholder for the name is "%s". [default: %default]')
+ parser.add_option('-t', '--temp', type='string', dest='temp', default='./temp', help='Directory to store temporary simulated datasets and SAM files. [default: %default]')
+ parser.add_option('-o', '--output', type='string', dest='out', default='./similarity_matrix.npy', help='Output similarity matrix file. [default: %default]')
+ # parse arguments
+ opt, args = parser.parse_args()
+
+ numArgs = len(args)
+ if numArgs == 1:
+ # read the Names file
+ names_file = args[0]
+ names = tools.read_names(names_file)
+
+ # construct the similarity matrix
+ smat = similarity_matrix_raw(names, opt.ref, opt.index, opt.temp, opt.simulator, opt.mapper)
+
+ # save the similarity matrix
+ np.save(opt.out, smat)
+ print "Wrote similarity matrix to",opt.out
+ else:
+ parser.print_help()
+ sys.exit(1)
diff --git a/doc/MANUAL b/doc/MANUAL
new file mode 100644
index 0000000..493cec2
--- /dev/null
+++ b/doc/MANUAL
@@ -0,0 +1,183 @@
+GASiC Overview
+--------------
+
+GASiC (Genome Abundance Similarity Correction) analyzes results
+obtained from aligning sequence reads against a set of reference
+sequences and estimates the abundance of each reference in the
+dataset.
+
+
+1. Input / Format Prerequisites
+-------------------------------
+
+In addition to the raw sequence read dataset and the reference
+genomes, GASiC requires a read alignment/mapping tool and a read
+simulator installed.
+
+The reference genomes must be stored in separate files with a
+unique, brief filename entitling the content of the file, e.g.
+'e.coli.fasta'. The format is not fixed, but must be compatible
+to the read mapper and read simulator.
+
+If the read mapper requires a reference index (e.g. bowtie), the
+index must have the same base name as the reference genome file,
+e.g. 'e.coli.index'.
+
+The raw data requires no special format; everything is possible,
+if the read mapper is able to read the file.
+
+The read alignment/mapping tool should be appropriate for the
+raw data, i.e. do not use a short read mapper to align Sanger
+reads! The output of the read mapper must be in the SAM format.
+Most tools natively support SAM output or provide scripts to
+convert their output to SAM format (e.g. SAMSE for BWA).
+
+The read simulator should be able to simulate reads with the
+same characteristics as in the raw data. The output format must
+be readable by the alignment/mapping tool.
+
+
+2. Command Line Interface
+-------------------------
+
+One way to use GASiC is the command line interface. This
+requires the least knowledge about programming and Linux.
+
+2.1 Names File
+--------------
+
+One central element for practically using GASiC via the command
+line interface is the so called Names File. The Names File
+contains the brief uniqe names identifying the reference
+genomes involved in the analysis. The names are listed as plain
+text, using one name per line. An example Names File could look
+like (without the dashed lines):
+
+--- Names File ---
+e.coli
+EHEC
+shigella
+--- End of file ---
+
+This has the advantage, that you only need to pass the Names
+File and a general path pattern to the scripts. Let us make an
+example. Say, you stored your reference sequences in the
+directory 'references/' as 'e.coli.fasta', 'EHEC.fasta', and
+'shigella.fasta'. The mapper index files are stored accordingly
+in 'references/map_index/'. Instead of passing the full
+paths of all files to the GASiC scripts, you only need to pass
+the names file and the pattern string 'references/%s.fasta'
+for the references and 'references/map_index/%s.index' for the
+index files. '%s' is the placeholder for the name listed in
+the index file.
+
+
+2.2 Read Alignment/Mapping
+--------------------------
+
+The first stage of GASiC is to map/align the reads to the
+reference genomes. Although this can also be done by hand, we
+recommend to use our script 'run_mappers.py'.
+
+'run_mappers.py' can be called from the command line via:
+ python run_mappers.py
+Since no parameters were passed, this will simply show the
+help text and exit. The behaviour is controlled by the
+parameters described in the help text.
+
+The general call looks like this:
+ python run_mappers.py -m [mapper] -i [index] -o [output] [names]
+
+[mapper] is a string identifying which mapping tool is used.
+The identifier must be associated to a call function in the
+GASiC file 'core/gasic.py'. More information can be found
+there. If you specify 'bowtie', the bowtie mapper must be
+installed on your system.
+
+[index] is the string pattern pointing to the reference
+genome index files. For example: 'references/index/%s.idx'.
+
+[output] is the string pattern pointing to the output files
+created by the mapping tool. For example: 'SAM/%s.sam'.
+
+[names] is the path to the Names File, e.g. 'names.txt'.
+
+
+2.3 Similarity Estimation
+-------------------------
+
+The genome similarity is estimated with the GASiC script
+'create_matrix.py'. It is called via
+ python create_matrix.py -s [simulator] -r [reference] -m [mapper] -i [index] -t [temp] -o [output] [names]
+
+[simulator] is the identifier string for the read simulator,
+see the [mapper] description above.
+
+[reference] is the string pattern pointing to the reference
+sequences serving as input for the read simulator, e.g.
+'references/%s.fasta'
+
+[mapper] see above.
+
+[index] see above.
+
+[temp] is the path of a temporary directory, where the
+simulated reads and intermediate SAM files can be stored.
+For example: '/tmp'.
+
+[output] is the filename of the output similarity matrix
+file. The file is in the NumPy format. Example:
+'similarity_matrix.npy'
+
+[names] Names File, see above.
+
+
+2.4 Similarity Correction
+-------------------------
+
+The GASiC script 'correct_abundances.py' estimates the
+true abundances for each reference genome and calculates
+p-values. The script is called via
+ python correct_abundances.py -m [matrix] -s [samfiles] -b [bootstrap] -o [output] [names]
+
+[matrix] is the filename of the similarity matrix file
+obtained in the previous step. For example
+'similarity_matrix.npy'.
+
+[samfiles] is a string pointing to the SAM files created
+in the mapping step. For example: 'SAM/%s.sam'.
+
+[bootstrap] is the number of bootstrap repetitions of
+the experiment. This highly influences the runtime.
+For example: '100'
+
+[output] is the filename where the results will be saved.
+For example: 'results.txt'.
+
+
+3. Scripting Interface
+----------------------
+
+The command line scripts described in the previous
+section can also be imported in python, such that they
+can easily be used to create more sophisticated analysis
+scripts. One example script is provided in the file
+'doc/example_script.py'. The script is documented,
+please find information there.
+
+
+4. Developer Information
+------------------------
+
+Since the GASiC source code is freely available and well
+documented, users can easily extend GASiC or tune it to
+their needs. This may become necessary when very large
+datasets need to be analyzed and for example parallel
+computation needs to be involved, or the calculation of
+the similarity matrix needs to be accelerated and some
+prior knowledge about the genome similarities needs to
+be included.
+
+GASiC is published under a BSD-like license and can thus
+be used, modified, and distributed easily. See the
+LICENSE file for more information.
\ No newline at end of file
diff --git a/doc/example_script.py b/doc/example_script.py
new file mode 100644
index 0000000..88dc372
--- /dev/null
+++ b/doc/example_script.py
@@ -0,0 +1,172 @@
+"""
+File: example_script.py
+
+Description:
+ This file contains commented analysis script demonstrating how GASiC
+ can be used via the scripting interface. The contained code only has
+ exemplary character and is not meant to be executed directly.
+
+Author: Martin Lindner, LindnerM at rki.de
+"""
+
+
+"""
+ Example 1:
+ ----------
+
+ Simple GASiC workflow. Reproduces the command line tool
+ functionalities.
+
+ Assumptions:
+ - Data: FASTQ reads in './data/reads.fastq'
+ - Reference genomes: FASTA files in './ref/'
+ - Mapper index: Index files in './ref/index/'
+ - bowtie mapper
+ - Mason simulator (configured for illumina reads, "mason_illumina")
+ - Names File in 'names.txt'
+"""
+
+# load module for system operations
+import os
+
+
+# load GASiC modules
+from core import tools
+import create_matrix
+import correct_abundances
+
+
+# set all paths and patterns
+p_data = './data/reads.fastq'
+p_ref = './ref/%s.fasta'
+p_index = './ref/index/%s'
+p_SAM = './SAM/%s.sam'
+p_temp = './distance'
+
+
+# create directories for SAM files and temporary files
+os.makedirs(p_SAM)
+os.makedirs(p_temp)
+
+
+# read the Names File
+names = tools.read_names('names.txt')
+
+
+# Step 1: map the reads to every reference genome
+for name in names:
+ # fill the current name into the pattern of reference index and SAM path
+ index_name = p_index%name
+ SAM_name = p_SAM%name
+
+ # run the read mapper
+ tools.run_mapper['bowtie'](index_name, p_data, SAM_name)
+
+
+# Step 2: calculate the similarity matrix
+sim_matrix = create_matrix.similarity_matrix_raw(names, p_ref, p_index, p_temp, 'mason_illumina', 'bowtie')
+
+
+# Step 3: estimate the true abundances
+total,mapped,est,err,p = correct_abundances.similarity_correction(names, sim_matrix, p_SAM, 100)
+
+print "Finished."
+
+
+
+
+"""
+ Example 2:
+ ----------
+
+ Apply GASiC to 5 datasets, e.g. containing repetitions of the
+ same experiment
+
+ Assumptions:
+ - Data: FASTQ reads in './data/' with names 'rep0.fastq' - 'rep5.fastq'
+ - Reference genomes: FASTA files in './ref/'
+ - Mapper index: Index files in './ref/index/'
+ - a custom mapper
+ - Mason simulator (configured for illumina reads, "mason_illumina")
+ - Names File in 'names.txt'
+"""
+
+# load module for system operations
+import os
+
+
+# load GASiC modules
+from core import tools
+import create_matrix
+import correct_abundances
+
+
+# set all paths and patterns
+p_data = './data/rep%i.fastq'
+p_ref = './ref/%s.fasta'
+p_index = './ref/index/%s'
+p_temp = './distance'
+
+
+# create directories for SAM files and temporary files
+os.makedirs('./SAM')
+os.makedirs(p_temp)
+
+
+# read the Names File
+names = tools.read_names('names.txt')
+
+
+# define a wrapper function for 'my_mapper'
+def run_my_mapper(index, reads, out, param=""):
+ command = "my_mapper -d some_default {param} {index} {reads} > {samfile}".format(index=index, reads=reads, samfile=out, param=param)
+ print "Executing:",command
+ os.system(command)
+ return 1
+
+
+# register wrapper function temporarily as 'my_mapper'
+tools.run_mapper['my_mapper'] = run_my_mapper
+
+
+# calculate the similarity matrix only once
+sim_matrix = create_matrix.similarity_matrix_raw(names, p_ref, p_index, p_temp, 'mason_illumina', 'my_mapper')
+
+
+# set up numpy arrays to store all results
+num_genomes = len(names)
+import numpy as np
+total = np.zeros((5,))
+mapped= np.zeros((5,num_genomes))
+est = np.zeros((5,num_genomes))
+err = np.zeros((5,num_genomes))
+p = np.zeros((5,num_genomes))
+unique= np.zeros((5,num_genomes))
+
+# iterate over all datasets
+for it in range(5):
+ dataset = p_data%it
+
+ # map the reads to every reference genome
+ for name in names:
+ # fill the current name into the pattern of reference index and SAM path
+ index_name = p_index%name
+ SAM_name = './SAM/rep%i_%s.sam'%(it,name)
+
+ # run the read mapper
+ tools.run_mapper['bowtie'](index_name, dataset, SAM_name)
+
+ # estimate the true abundances and count unique reads
+ SAM_pattern = './SAM/rep{num}_%s.sam'.format(num=it)
+ total[it],mapped[it,:],est[it,:],err[it,:],p[it,:] = correct_abundances.similarity_correction(names, sim_matrix, SAM_pattern, 100)
+ unique[it,:] = correct_abundances.unique(names, SAM_pattern)
+
+# GASiC calculation finished. Now we can post-process the results
+# for example, calculate mean and variance for each genome over all datasets
+mean_est = np.mean(est,axis=0)
+var_est = np.var(est,axis=0)
+
+print mean_est
+print var_est
+
+print "Finished."
diff --git a/quality_check.py b/quality_check.py
new file mode 100755
index 0000000..1a79cf6
--- /dev/null
+++ b/quality_check.py
@@ -0,0 +1,131 @@
+#!/usr/bin/python
+"""
+Copyright (c) 2012, Martin S. Lindner, LindnerM at rki.de,
+Robert Koch-Institut, Germany,
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+ * Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+ * The name of the author may not be used to endorse or promote products
+ derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL MARTIN S. LINDNER BE LIABLE FOR ANY
+DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+"""
+
+import optparse
+import glob
+import os
+import sys
+try:
+ import matplotlib.pyplot as plt
+except ImportError:
+ raise Exception('Import Error: Failed to load matplotlib module. Please install matplotlib properly (http://matplotlib.sourceforge.net/).')
+
+try:
+ import pysam
+except ImportError:
+ raise Exception('Import Error: Failed to load pysam module. Please install pysam properly (http://code.google.com/p/pysam/).')
+
+import numpy as np
+import scipy.stats as st
+from core import tools
+
+if __name__=="__main__":
+ usage = """%prog [Options] SAMFILE
+
+Perform a sanity check on the mapping results (SAM file).
+
+This tool analyzes the output of the read mapper and provides useful
+information to the user to decide whether the set of reference sequences
+and the mapper settings are feasible for the given dataset.
+
+Input:
+SAMFILE: Name of the SAM file to analyze
+
+"""
+
+ # configure the parser
+ parser = optparse.OptionParser(usage=usage)
+ parser.add_option('-o', '--outpath', type='string', dest='out', default='./', help='Path to the directory for the analysis output. [default: %default]')
+ # parse arguments
+ options, args = parser.parse_args()
+
+ numArgs = len(args)
+ if numArgs >= 1:
+ sam_names = args
+
+ # create output directory if necessary
+ if not os.path.exists(os.path.dirname(options.out)):
+ os.makedirs(os.path.dirname(options.out))
+
+ # analyze all files and gather statistics
+ for i,nm in enumerate(sam_names):
+ dataset_name = os.path.splitext(os.path.split(nm)[1])[0]
+ sf = pysam.Samfile(nm,'r')
+ cov = np.zeros( (sum(sf.lengths),) )
+ start_pos = np.cumsum(sf.lengths)-sf.lengths[0]
+ total_reads = 0
+ mapped_reads = 0
+ for read in sf:
+ total_reads += 1
+ if not read.is_unmapped:
+ r_start = start_pos[read.tid] + read.pos
+ r_end = start_pos[read.tid] + read.pos + read.qlen
+ cov[r_start:r_end] += 1
+ mapped_reads += 1
+
+ # calculate coverage
+ max_cov = np.max(cov)
+ mean_cov = np.mean(cov)
+
+ # plot the coverage
+ plt.figure()
+ h = plt.hist(cov, bins=min(max_cov,100), range=(0,max_cov))
+ plt.xlabel('coverage')
+ plt.ylabel('# occurrences')
+ plt.title('Coverage histogram for '+dataset_name)
+ plt.savefig(options.out+'/'+dataset_name+'_coverage.png', dpi=300)
+
+ # write out user information
+ f = open(options.out+'/'+dataset_name+'_info.txt','w')
+ f.write(' Name:\t'+nm+'\n')
+ f.write('Total Genome Length:\t%i\n'%(sum(sf.lengths)))
+ f.write('Num. Contigs:\t%i\n\n'%(len(sf.lengths)))
+ f.write('Mapping Results:\n')
+ f.write('Total Reads:\t%i\n'%(total_reads))
+ f.write('Mapped Reads:\t%i\n'%(mapped_reads))
+ f.write('Average Coverage:\t%f\n'%(mean_cov))
+ zero_cov = h[0][0]/float(sum(h[0]))
+ f.write('Fraction of bases with 0 Coverage:\t%f\n\n'%(zero_cov) )
+
+ f.write('Comments:\n')
+ if mean_cov < 1.:
+ f.write('Low overall coverage. ')
+ if mapped_reads/float(total_reads)>0.01:
+ f.write('Consider using a larger dataset.\n')
+ else:
+ f.write('Abundance of this or related sequences is below 0.01.\n')
+
+ p = st.poisson(mean_cov)
+ if zero_cov > 2*p.pmf(0):
+ f.write('Unnaturally many bases with zero coverage. Consider that this species is not present in the dataset.\n')
+
+ f.close()
+ else:
+ parser.print_help()
+ sys.exit(1)
+
diff --git a/run_mappers.py b/run_mappers.py
new file mode 100755
index 0000000..41114da
--- /dev/null
+++ b/run_mappers.py
@@ -0,0 +1,85 @@
+#!/usr/bin/python
+"""
+Copyright (c) 2012, Martin S. Lindner, LindnerM at rki.de,
+Robert Koch-Institut, Germany,
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+ * Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+ * The name of the author may not be used to endorse or promote products
+ derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL MARTIN S. LINDNER BE LIABLE FOR ANY
+DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+"""
+
+import optparse
+import glob
+import os
+import sys
+
+from core import tools
+
+if __name__=="__main__":
+ usage = """%prog NAMES READS -i REF -o OUT -m MAPPER
+
+Run a read mapper to map reads to reference genomes.
+
+The names in the NAMES file will be inserted in the provided string
+patterns. Each pattern must contain exactly one "%s" placeholder
+(python string formatting).
+
+Input:
+NAMES: Filename of the names file; the plain text names file should
+ contain one name per line. The name is used as identifier in
+ the whole algorithm.
+
+READS: File containing the reads to be mapped.
+
+"""
+
+ # configure the parser
+ parser = optparse.OptionParser(usage=usage)
+ parser.add_option('-m', '--mapper', type='string', dest='mapper', default=None, help='Identifier of mapper defined in core/tools.py [default: %default]')
+ parser.add_option('-i', '--index', type='string', dest='ref', default='./ref/%s.fasta', help='Pattern, that points to the reference sequences/indices when used with a name. Placeholder for the name is "%s". [default: %default]')
+ parser.add_option('-o', '--output', type='string', dest='out', default='./SAM/%s.sam', help='Pattern, that points to the output SAM file, when used with a name. Placeholder for the name is "%s". [default: %default]')
+ # parse arguments
+ options, args = parser.parse_args()
+
+ numArgs = len(args)
+ if numArgs == 2:
+ reads = args[1]
+ names = tools.read_names(args[0])
+ out = [options.out%nm for nm in names]
+ ref = [options.ref%nm for nm in names]
+
+ if not options.mapper:
+ parser.print_help()
+ raise Exception('Aborting. No mapper specified.')
+
+ # create output directory if necessary
+ if not os.path.exists(os.path.dirname(options.out)):
+ os.makedirs(os.path.dirname(options.out))
+
+ for i in range(len(names)):
+ msg = "--- mapping reads with %s to %s"%(options.mapper, ref[i])
+ print msg
+ tools.run_mapper[options.mapper](ref[i], reads, out[i])
+ print "\nMapping done.\n"
+ else:
+ parser.print_help()
+ sys.exit(1)
+
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/gasic.git
More information about the debian-med-commit
mailing list