[Debian-med-packaging] conservation-code (Predicting functionally important residues from sequence conservation) in Debian

Wed Oct 10 14:21:15 UTC 2012

Dear John and Mona!

I am Laszlo Kajan, bioinformatician, Debian Maintainer, member of the Debian Med Packaging Group [2].

Thank you for making the code for your published method conservation-code available under a free license [1].

I am in the process of packaging conservation-code/score_conservation for Debian [4]. I have prepared a man page for score_conservation, and
also a patch. Please consider including these in your version, if you find them valuable.

I noticed a discrepancy between the publication [3] and the packaged score_conservation: PSEUDOCOUNT is 10E-7 in the Python script, but 10E-6 in
the publication ('Estimating probabilities'). I assume there is a typo in the paper.

Thank you for supporting the bioinformatics community with your free software tools.

Best regards,

Laszlo Kajan

[1] http://compbio.cs.princeton.edu/conservation/

[2] http://www.debian.org/devel/debian-med/

[3] http://bioinformatics.oxfordjournals.org/content/23/15/1875.full

[4] Who uses Debian?: http://wiki.debian.org/DebianScience/UpstreamAuthorsContactingQA

(upstream authors John A. Capra and Mona Singh are Bcc'd on this email)
-------------- next part --------------
Author: Laszlo Kajan <lkajan at rostlab.org>
Description: fixes to executable
Forwarded: no

--- a/score_conservation
+++ b/score_conservation
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/python
 
 ################################################################################
 # score_conservation.py - Copyright Tony Capra 2007 - Last Update: 03/09/11
@@ -98,7 +98,7 @@
 
 
 def usage():
-    print """\nUSAGE:\npython score_conservation.py [options] alignfile\n\t -alignfile must be in fasta or clustal format.\n\nOPTIONS:\n\t
+    print """\nUSAGE:\nscore_conservation [options] alignfile\n\t -alignfile must be in fasta or clustal format.\n\nOPTIONS:\n\t
     -a\treference sequence. Print scores in reference to a specific sequence (ignoring gaps). Default prints the entire column. [sequence name]\n\t
     -b\tlambda for window heuristic linear combination. Default=.5 [real in [0,1]]\n
     -d\tbackground distribution file, e.g., swissprot.distribution. Default=BLOSUM62 background [filename]\n\t
@@ -136,8 +136,9 @@
 
 	aa_num += 1
 
+    freqsum = (sum(seq_weights) + len(amino_acids) * pc_amount)
     for j in range(len(freq_counts)):
-	freq_counts[j] = freq_counts[j] / (sum(seq_weights) + len(amino_acids) * pc_amount)
+	freq_counts[j] = freq_counts[j] / freqsum
 
     return freq_counts
 
@@ -718,7 +719,7 @@
 window_size = 3 # 0 = no window
 win_lam = .5 # for window method linear combination
 outfile_name = ""
-s_matrix_file = "matrix/blosum62.bla"
+s_matrix_file = "/usr/share/conservation-code/matrix/blosum62.bla"
 bg_distribution = blosum_background_distr[:]
 scoring_function = js_divergence
 use_seq_weights = True
@@ -790,7 +791,7 @@
 	if arg == 'shannon_entropy': scoring_function = shannon_entropy
 	elif arg == 'property_entropy': scoring_function = property_entropy
 	elif arg == 'property_relative_entropy': scoring_function = property_relative_entropy
-	elif arg == 'vn_entropy': scoring_function = vn_entropy; from numarray import *; import numarray.linear_algebra as la
+	elif arg == 'vn_entropy': scoring_function = vn_entropy; from numpy.numarray import *; import numpy.numarray.linear_algebra as la
 
 	elif arg == 'relative_entropy': scoring_function = relative_entropy
 	elif arg == 'js_divergence': scoring_function = js_divergence
-------------- next part --------------
=pod

=head1 NAME

score_conservation - score protein sequence conservation

=head1 SYNOPSIS

score_conservation [options] ALIGNFILE

=head1 DESCRIPTION

Score protein sequence conservation in B<ALIGNFILE>.  B<ALIGNFILE> must be in FASTA or CLUSTAL format.

The following conservation scoring methods are implemented:
 * sum of pairs
 * weighted sum of pairs
 * Shannon entropy
 * Shannon entropy with property groupings (Mirny and Shakhnovich 1995,
   Valdar and Thornton 2001)
 * relative entropy with property groupings (Williamson 1995)
 * von Neumann entropy (Caffrey et al 2004)
 * relative entropy (Samudrala and Wang 2006)
 * Jensen-Shannon divergence (Capra and Singh 2007)

A window-based extension that incorporates the estimated conservation of
sequentially adjacent residues into the score for each column is also given.
This window approach can be applied to any of the conservation scoring
methods.

With default parameters score_conservation(1) computes the conservation scores for the alignment using the
Jensen-Shannon divergence and a window B<-w> of I<3>.

The sequence-specific output can be used as the conservation input for
concavity.

Conservation is highly predictive in identifying catalytic sites and
residues near bound ligands.

=head1 REFERENCES

=over

=item Capra JA and Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics, 23(15):1875-82, 2007.

=back

=head1 OPTIONS

=over

=item -a [NAME]

Reference sequence. Print scores in reference to the named sequence (ignoring gaps). Default prints the entire column.

=item -b [0-1]

Lambda for window heuristic linear combination. Default=I<.5>.

Equation:

C<score = (1 - lambda) * average_score_over_window_around_middle + lambda * score_of_middle>

=item -d [FILE]

Background distribution file, e.g. F<distributions/swissprot.distribution>. Default=built-in BLOSUM62.

=item -g [0-1)]

Gap cutoff. Do not score columns that contain more than gap cutoff fraction gaps. Default=I<.3>.

=item -h

Print help.

=item -l [true|false]

Use sequence weighting. Default=I<true>.

=item -m [FILE]

Similarity matrix file, e.g. F<matrix/blosum62.bla> or .qij. Default=F<matrix/blosum62.bla>.

Some methods, e.g. I<js_divergence>, do not use this.

=item -n [true|false]

Normalize scores. Print the z-score (over the alignment) of each column raw score. Default=I<false>.

=item -o FILE

Output file. Default: standard output stream.

=item -p [true|false]

Use gap penalty. Lower the score of columns that contain gaps, proportionally to the sum weight of the gapped sequences. Default=I<true>.

=item -s [METHOD]

Conservation estimation method, one of I<shannon_entropy property_entropy property_relative_entropy vn_entropy relative_entropy js_divergence sum_of_pairs>. Default=I<js_divergence>.

=item -w [0-INT]

Window size. Number of residues on either side included in the window. Default=I<3>.

=back

=head1 EXAMPLES

Note: you may have to copy and uncompress the example data files before running the following examples.

=over

=item Compute conservation scores for the alignment using the Jensen-Shannon divergence with default settings and print out the scores:

 score_conservation __docdir__/examples/2plc__hssp-filtered.aln

=item Score an alignment using Jensen-Shannon divergence, a window of size 3 (on either side of the residue), and the swissprot background distribution:

 score_conservation -s js_divergence -w 3 -d \
  __pkgdatadir__/distributions/swissprot.distribution \
  __docdir__/examples/2plc__hssp-filtered.aln

=back

=head1 FILES

=over

=item Distributions

F<__pkgdatadir__/distributions>

=item Matrices

F<__pkgdatadir__/matrix>

=back

=head1 SEE ALSO

=over

=item Homepage L<http://compbio.cs.princeton.edu/conservation/>

=item Publication L<http://bioinformatics.oxfordjournals.org/cgi/content/full/23/15/1875>

=item concavity(1)

=back

=cut