[med-svn] [Git][med-team/fasta3][upstream] New upstream version 36.3.8g
Steffen Möller
gitlab at salsa.debian.org
Sun Apr 15 23:22:21 BST 2018
Steffen Möller pushed to branch upstream at Debian Med / fasta3
Commits:
c1a8a345 by Steffen Moeller at 2018-04-15T15:56:45+02:00
New upstream version 36.3.8g
- - - - -
20 changed files:
- README
- README.md
- doc/README_v36.3.8d.md
- + doc/README_v36.3.8g.md
- doc/fasta_guide.pdf
- doc/fasta_guide.tex
- doc/readme.v36
- psisearch2/m89_btop_msa2.pl
- psisearch2/psisearch2_msa.pl
- psisearch2/psisearch2_msa.py
- scripts/ann_exons_ncbi.pl
- + scripts/ann_exons_up_sql.pl
- scripts/ann_exons_up_www.pl
- scripts/ann_pfam30_tmptbl.pl
- src/ag_stats.c
- src/cal_cons2.c
- src/comp_lib9.c
- src/compacc2e.c
- src/initfa.c
- src/scaleswn.c
Changes:
=====================================
README
=====================================
--- a/README
+++ b/README
@@ -1,4 +1,11 @@
+[This file has been replaced by README.md]
+
+December, 2017
+
+The most up-to-date version information on FASTA versions is available
+in README.md and doc/readme.v36 .
+
July, 2015
This version of the FASTA programs is fasta-36.3.8. Since March, 2011
=====================================
README.md
=====================================
--- a/README.md
+++ b/README.md
@@ -12,9 +12,26 @@ includes programs for aligning translated DNA sequences against
proteins (`fastx`, `fasty` are equivalent to `blastx`, `tfastx`,
`tfasty` are similar to `tblastn`).
-####September, 2016
+####December, 2017
+The current FASTA version is fasta-36.3.8f, Dec. 2017
+
+The statistics routines for normally distributed scores (ggsearch36,
+glsearch36) are more robust to very low E()-value thresholds.
+
+####Sept, 2017
+The current FASTA version is fasta-36.3.8f, Sept. 2017
+
+If the -S option is used and a query sequence has no upper case
+letters, it is re-read with lower-case letters converted to upper-case.
-The current FASTA version is fasta-36.3.8e.
+####May, 2017
+The current FASTA version is fasta-36.3.8f, May. 2017
+
+Various bugs in sub-alignment scoring corrected and support for the
+EBI SP:GSTM1_HUMAN P09488 added. The format for the $SRCH_URL and
+$SRCH_URL2 format strings has changed to enable pairwise alignment.
+
+####September, 2016
The fasta-36.3.6e version includes a new directory, `psisearch2`, with
scripts to run iterative PSSM (PSI-BLAST or SSEARCH36) searches using
=====================================
doc/README_v36.3.8d.md
=====================================
--- a/doc/README_v36.3.8d.md
+++ b/doc/README_v36.3.8d.md
@@ -6,7 +6,7 @@ Changes in **fasta-36.3.8d** released 13-April-2016:
1. Various bug fixes to `pssm_asn_subs.c` that avoid coredumps when
reading NCBI PSSM ASN.1 binary files. `pssm_asn_subs.c` can now read
- UUPACAA sequences.
+ IUPACAA sequences.
2. default gap penalties for VT40 (from -14/-2 to -13/-1), VT80 (from
-14/-2 to -11/-1), and VT120 (from -10/-1 to 11/-1) have changed
=====================================
doc/README_v36.3.8g.md
=====================================
--- /dev/null
+++ b/doc/README_v36.3.8g.md
@@ -0,0 +1,18 @@
+
+
+## The FASTA package - protein and DNA sequence similarity searching and alignment programs
+
+Changes in **fasta-36.3.8f** released 31-Dec-2017
+
+1. (December, 2017) -- Make statistical thresholds more robust for
+ small E()-values with normally distributed scores (ggsearch36,
+ glsearch36).
+
+2. (September, 2017) Treat all lower-case queries as uppercase with -S option.
+
+3. (May, 2017) Improvements/fixes to sub-alignment scoring strategies.
+
+4. Improvements/fixes to psisearch2 scripts.
+
+For more detailed information, see `doc/readme.v36`.
+
=====================================
doc/fasta_guide.pdf
=====================================
Binary files a/doc/fasta_guide.pdf and b/doc/fasta_guide.pdf differ
=====================================
doc/fasta_guide.tex
=====================================
--- a/doc/fasta_guide.tex
+++ b/doc/fasta_guide.tex
@@ -1470,8 +1470,8 @@ environment variables are used in HTML mode (\texttt{-m 6}) to provide
links from the sequence alignment (see the links at
\url{http://fasta.bioch.virginia.edu/fasta_www2/}). \texttt{REF\_URL}
is associated with the \texttt{Entrez Lookup} link; \texttt{SRCH\_URL}
-with the \texttt{Re-search database} link, and \texttt{SRCH\_URL1}
-with the \texttt{General re-search} link. In each case, the text
+with the \texttt{General re-search} link, and \texttt{SRCH\_URL1}
+with the \texttt{Pairwise alignment} link. In each case, the text
corresponds to a HTML URL, but with positions containing the
\texttt{\%s} or \texttt{\%ld} (for numbers) part of a 'C'
\texttt{sprintf()} call for specific variables. \texttt{REF\_URL} uses
=====================================
doc/readme.v36
=====================================
--- a/doc/readme.v36
+++ b/doc/readme.v36
@@ -6,6 +6,24 @@ multiple high-scoring alignments to be shown, rather than just one.
This is the main functional difference between FASTA and BLAST -
BLAST could show multiple HSPs, FASTA did not.
+>>Dec. 30, 2017 [released as fasta-36.3.8g]
+[scaleswn.c]
+Replace np_to_z() with np1_to_z(), which does not substract low
+probability from 1.0, thus allowing accurate z-values for very low
+probabilities.
+
+>>Sept. 26, 2017
+[comp_lib9.c, compacc2e.c]
+Previously, if the query sequence was all lower-case letters (seg-ed)
+and the '-S' option specified, the search would effectively be done
+with a zero-length sequence, which broke the statistics. The code has
+been modified to convert all lower-case queries to upper-case when -S
+is used.
+
+[scaleswn.c]
+Fixed problem with scaleswn.c/ag_stats() not setting parameters
+properly when matrix was unknown.
+
>>May 23, 2017 [released as fasta-36.3.8f]
[url_subs.c]
A small, but major change in the output available to the $SRCH_URL and
@@ -18,7 +36,7 @@ pairwise alignment becomes possible.
>>May 23, 2017
[scripts/ann_feats2ipr.pl,ann_feats_up_www2.pl,test_ann_scripts.sh src/defs.h]
Changes to ensure that EBI format databases, which place the ID before
-the accession, e.g. SP:GSTM1_HUMAN P09388, can be processed properly
+the accession, e.g. SP:GSTM1_HUMAN P09488, can be processed properly
by annotation scripts. This involved displaying more of the
description line, so that the accession field is included, in the
annot_XXXXX file.
=====================================
psisearch2/m89_btop_msa2.pl
=====================================
--- a/psisearch2/m89_btop_msa2.pl
+++ b/psisearch2/m89_btop_msa2.pl
@@ -50,7 +50,7 @@ use Getopt::Long;
my ($shelp, $help, $m_format, $evalue, $qvalue, $domain_bound) = (0, 0, "m8CB", 0.001, 30.0,0);
-my ($query_file, $bound_file_in, $bound_file_only, $bound_file_out, $masked_lib_out,$mask_type_end, $mask_type_int) = ("","","","","","","");
+my ($query_file, $sel_file, $bound_file_in, $bound_file_only, $bound_file_out, $masked_lib_out,$mask_type_end, $mask_type_int) = ("","","","","","","","");
my $query_lib_r = 0;
my ($eval2_fmt, $eval2) = (0,"");
@@ -62,6 +62,9 @@ GetOptions(
"expect=f" => \$evalue,
"qvalue=f" => \$qvalue,
"format=s" => \$m_format,
+ "selected_file_in=s" => \$sel_file,
+ "sel_file_in=s" => \$sel_file,
+ "sel_file=s" => \$sel_file,
"m_format=s" => \$m_format,
"mformat=s" => \$m_format,
"bound_file_in=s" => \$bound_file_in,
@@ -120,6 +123,30 @@ if (! $query_file || !$query_len) {
die "query sequence required";
}
+my %sel_accs = ();
+my $have_sel_accs = 0;
+if ($sel_file) {
+ if (open (my $sfd, $sel_file)) {
+ while (my $line = <$sfd>) {
+ chomp $line;
+ next if $line =~ m/^#/;
+ if ($line =~ m/\t/) {
+ my @data = split(/\t/,$line);
+ $sel_accs{$data[0]} = $data[1];
+ }
+ else {
+ $sel_accs{$line} = 1;
+ }
+ }
+ close($sfd);
+ $evalue = 1000000.0;
+ $have_sel_accs = 1;
+ }
+ else {
+ warn "Cannot open selected sequence file: $sel_file";
+ }
+}
+
push @multi_names, $query_acc;
$acc_names{$query_acc} = 1;
$multi_align{$query_acc} = btop2alignment($query_seq_r, $query_len, {BTOP=>$query_len, q_start=>1, q_end=>$query_len}, 0);
@@ -242,6 +269,10 @@ while (my $line = <>) {
$subj_acc =~ s/^gi\|\d+\|(\w+\|\w+)\|?\w+/$1/;
}
+ if ($have_sel_accs) {
+ next unless ($sel_accs{$hit_data{'s_seqid'}});
+ }
+
# a better solution would be to rename the q_seqid, or at least to
# check for identity
# next if ($hit_data{q_seqid} eq $hit_data{s_seqid});
=====================================
psisearch2/psisearch2_msa.pl
=====================================
--- a/psisearch2/psisearch2_msa.pl
+++ b/psisearch2/psisearch2_msa.pl
@@ -25,21 +25,20 @@ use Pod::Usage;
# implementation of simple shell script to do iterative searches
#
# logic:
-# (1) do initial search
+# (1) do initial search or take results from previous search with --prev_m89res
# (2) use results of initial search to produce MSA/PSSM for next search
# (3) do PSSM search
-# (4) use results of PSSM search to produce MSA/PSSM for iterative step 3
+# (4) repeat at step (2)
#
################
#
# command:
-# psisearch2_msa.pl --query query_file --db database --num_iter N --evalue 0.002 --no_msa --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --out_suffix none --pgm ssearch/psiblast
+# psisearch2_msa.pl --query query.file --db database.file --num_iter N --pssm_evalue 0.002 --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --out_suffix none --pgm ssearch/psiblast --prev_m89res prev_results.itx.m8CB.file --sel_res selected_accs.file --prev_bounds boundary.file
#
################
-use vars qw( $query_file $db_file $num_iter $evalue $int_mask $end_mask $query_mask $no_msa $tmp_dir $dom_flag $align_flag $suffix $srch_pgm $file_out $help $shelp $error_log $rm_flag $annot_type $quiet);
-use vars qw( $prev_msa $next_msa $prev_hitdb $next_hitdb $prev_pssm $next_pssm $prev_bound_in $next_bound_out $tmp_file_list $save_all $delete_bnd $delete_tmp);
-
+use vars qw( $query_file $db_file $num_iter $pssm_evalue $srch_evalue $int_mask $end_mask $query_mask $tmp_dir $dom_flag $align_flag $suffix $srch_pgm $file_out $help $shelp $error_log $rm_flag $annot_type $quiet);
+use vars qw( $prev_m89res $m_format $prev_sel_res $this_iter $prev_msa $next_msa $prev_hitdb $next_hitdb $prev_pssm $next_pssm $prev_bound $next_bound_out $tmp_file_list $save_all $delete_bnd $delete_tmp $use_stdout);
################
# locations of required programs:
@@ -54,49 +53,58 @@ my $ssearch_bin = "$pgm_bin/ssearch36";
my $psiblast_bin = "$pgm_bin/psiblast";
my $makeblastdb_bin = "$pgm_bin/makeblastdb";
my $datatool_bin = "$pgm_bin/datatool -m $pgm_data/NCBI_all.asn";
-my $align2msa_lib = "m89_btop_msa2.pl";
+my $align2msa_lib = "$pgm_bin/m89_btop_msa2.pl";
my %srch_subs = ('ssearch' => \&get_ssearch_cmd,
'psiblast' => \&get_psiblast_cmd,
);
-my %annot_cmds = ('rpd3' => qq("\!../scripts/ann_pfam28.pl --pfacc --db RPD3 --vdoms --split_over"),
- 'rpd3nv' => qq("\!../scripts/ann_pfam28.pl --pfacc --db RPD3 --split_over"),
- 'rpd3nvn' => qq("\!../scripts/ann_pfam28.pl --pfacc --db RPD3 --split_over --neg"),
- 'pfam' => qq("\!../scripts/ann_pfam30.pl --pfacc --vdoms --split_over"));
+my %annot_cmds = ('rpd3' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --vdoms --split_over"),
+ 'rpd3nv' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --split_over"),
+ 'rpd3nvn' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --split_over --neg"),
+ 'pfam' => qq("\!ann_pfam30.pl --vdoms --split_over --neg")
+ );
-($num_iter, $evalue, $dom_flag, $align_flag, $int_mask, $end_mask, $query_mask, $srch_pgm, $tmp_dir, $error_log, $annot_type, $quiet) =
- ( 5, 0.002, 0, 0, 'none', 'none', 0, 'ssearch','',0, 0, "", 0);
+($num_iter, $pssm_evalue, $srch_evalue, $dom_flag, $align_flag, $int_mask, $end_mask, $query_mask, $srch_pgm, $tmp_dir, $error_log, $annot_type, $quiet) =
+ ( 5, 0.002, 5.0, 0, 0, 'none', 'none', 0, 'ssearch','',0, 0, "", 0);
($save_all, $tmp_file_list, $delete_bnd, $delete_tmp) = (0, "", 0, 0);
+($prev_m89res, $m_format, $prev_sel_res, $prev_bound, $this_iter, $use_stdout) = ("","", "","", 1, 0);
-
-my $pgm_command = "#".join(" ",($0, at ARGV));
-print STDERR "#",join(" ",($0, at ARGV)),"\n" if ($error_log);
+my $pgm_command = "# ".join(" ",($0, at ARGV));
+print STDERR "# ",join(" ",($0, at ARGV)),"\n" if ($error_log);
GetOptions(
'query|sequence=s' => \$query_file,
'db|database=s' => \$db_file,
'suffix|out_suffix=s' => \$suffix,
'dir=s' => \$tmp_dir,
- 'evalue=f' => \$evalue,
+ 'pssm_evalue=f' => \$pssm_evalue,
+ 'search_evalue=f' => \$srch_evalue,
'annot_db=s' => \$annot_type,
'out_name=s' => \$file_out,
+ 'use_stdout' => \$use_stdout,
+ 'this_iter=s' => \$this_iter,
'iter=i' => \$num_iter,
+ 'prev_m89res=s' => \$prev_m89res,
+ 'sel_res_in=s' => \$prev_sel_res,
+ 'sel_accs=s' => \$prev_sel_res,
+ 'sel_file=s' => \$prev_sel_res,
+ 'sel_file_in=s' => \$prev_sel_res,
# 'in_msa=s' => \$prev_msa,
# 'out_msa=s' => \$next_msa,
# 'in_hitdb=s' => \$prev_hitdb,
# 'out_hitdb=s' => \$next_hitdb,
'in_pssm=s' => \$prev_pssm,
# 'out_pssm=s' => \$next_pssm,
- 'in_bounds=s' => \$prev_bound_in,
+ 'prev_bounds=s' => \$prev_bound,
'out_bounds=s' => \$next_bound_out,
'num_iter|max_iter=i' => \$num_iter,
- # 'no_msa' => \$no_msa,
'dom|domain' => \$dom_flag,
'align' => \$align_flag,
'query_seed|query_mask' => \$query_mask,
'int_mask|int-mask|int_seed|int-seed=s' => \$int_mask,
'end_mask|end-mask|end_seed|end-seed=s' => \$end_mask,
+ 'm_format=s' => \$m_format,
'pgm=s' => \$srch_pgm,
'quiet' => \$quiet,
'q' => \$quiet,
@@ -116,6 +124,8 @@ pod2usage(exitstatus => 0, verbose => 2) if $help;
pod2usage(1) unless $query_file && -r $query_file; # need a query
pod2usage(1) unless $db_file ; # need a database
+my %sel_hits = ();
+
my @del_file_ext = qw(msa psibl_out hit_db asntxt asnbin);
if ($srch_pgm =~ m/psiblast/) {
@@ -127,10 +137,7 @@ if ($query_mask) {
$end_mask='query' unless $end_mask ne 'none';
}
-if ($delete_tmp) {
- $delete_bnd = 1;
-}
-elsif ($save_all) {
+if (!$delete_tmp && $save_all) {
@del_file_ext = ();
$tmp_file_list = "";
$delete_bnd = 0
@@ -155,45 +162,76 @@ if ($tmp_file_list) {
print "$pgm_command\n" unless ($quiet);
-my $this_iter = "it1";
+####
+# generate output filenames
my ($query_pref) = ($query_file =~ m/([\w\.]+)$/);
$file_out = $query_pref unless $file_out;
-my $this_file_pref = "$file_out.$this_iter";
+my $this_file_pref = "$file_out.it$this_iter";
$this_file_pref = "$this_file_pref.$suffix" if ($suffix);
my $this_file_out = $this_file_pref;
$this_file_out = "$tmp_dir/$this_file_out" if ($tmp_dir);
-my $prev_file_out = $this_file_out;
+my $prev_file_out = "";
####
-# parse output to build PSSM
-# generate output filenames
+# do the first search or use $prev_m89res
+my $first_iter = 0;
+my $iter_val = $this_iter;
+my $search = "";
+
+my @del_err_files = ();
-# do the first search
-my $search = $srch_subs{$srch_pgm}($query_file, $db_file, $prev_pssm);
-log_system("$search > $this_file_out 2> $this_file_out.err");
+unless ($prev_m89res) {
+ $search = $srch_subs{$srch_pgm}($query_file, $db_file, $prev_pssm);
+ unless ($use_stdout) {
+ log_system("$search > $this_file_out 2> $this_file_out.err");
+ }
+ else {
+ log_system("$search 2> $this_file_out.err");
+ }
+ push @del_err_files, "$this_file_out.err";
+ $first_iter++;
+}
+else {
+ $this_file_out = $prev_m89res;
+}
-my ($this_pssm, $this_bound_out) = build_msa_pssm($query_file, $this_file_out, $prev_bound_in);
+my ($this_pssm, $this_bound_out) = ("","");
# now have necessary files for next iteration
-for (my $it=2; $it <= $num_iter; $it++) {
+for (my $it=$first_iter; $it < $num_iter; $it++) {
+
+ ####
+ # build the PSSM for the current search
+
+ my ($this_pssm, $this_bound_out) = build_msa_pssm($query_file, $this_file_out, $prev_bound, $prev_sel_res, $m_format);
+ $prev_file_out = $this_file_out;
+ $prev_sel_res = "";
+
+ $iter_val = $this_iter + $it;
- $prev_pssm = $this_pssm;
- $prev_bound_in = $this_bound_out;
####
# build filename for this iteration
- $this_file_pref = $this_file_out = "$file_out.it$it";
+ $this_file_pref = $this_file_out = "$file_out.it$iter_val";
$this_file_out = "$this_file_pref.$suffix" if ($suffix);
$this_file_out = "$tmp_dir/$this_file_out" if ($tmp_dir);
- $search = $srch_subs{$srch_pgm}($query_file, $db_file, $prev_pssm);
- log_system("$search > $this_file_out 2> $this_file_out.err");
+ $search = $srch_subs{$srch_pgm}($query_file, $db_file, $this_pssm);
+ unless ($use_stdout) {
+ log_system("$search > $this_file_out 2> $this_file_out.err");
+ }
+ else {
+ log_system("$search 2> $this_file_out.err");
+ }
+ push @del_err_files, "$this_file_out.err";
+ ####
# here, we are done with previous .msa, .asntxt, .asnbin, etc files. Delete them if desired
+
if (@del_file_ext) {
my @del_file_list = ();
for my $ext (@del_file_ext) {
@@ -201,36 +239,20 @@ for (my $it=2; $it <= $num_iter; $it++) {
}
log_system("rm ".join(" ", at del_file_list));
}
- $prev_file_out = $this_file_out;
-
- ($this_pssm, $this_bound_out) = build_msa_pssm($query_file, $this_file_out, $prev_bound_in);
+ log_system("rm $prev_bound") if ($delete_bnd);
- if (has_converged($prev_bound_in, $this_bound_out)) {
- print STDERR "$0 $srch_pgm $query_file $db_file converged ($it iterations)\n" unless ($quiet);
-
- if (@del_file_ext) {
- my @del_file_list = ();
- for my $ext (@del_file_ext) {
- push @del_file_list, "$prev_file_out.$ext";
- }
- log_system("rm ".join(" ", at del_file_list));
- log_system("rm $prev_bound_in $this_bound_out") if ($delete_bnd);
- }
- exit(0);
+ if (has_converged($prev_bound, $this_bound_out)) {
+ print STDERR "$0 $srch_pgm $query_file $db_file converged ($iter_val iterations)\n" unless ($quiet);
+ last;
}
- log_system("rm $prev_bound_in") if ($delete_bnd);
+ $prev_bound = $this_bound_out
}
-if (@del_file_ext) {
- my @del_file_list = ();
- for my $ext (@del_file_ext) {
- push @del_file_list, "$prev_file_out.$ext";
- }
- log_system("rm ".join(" ", at del_file_list));
+if (@del_err_files) {
+ log_system("rm ".join(" ", at del_err_files));
}
-log_system("rm $this_bound_out") if ($delete_bnd);
-
+log_system("rm $prev_bound") if ($delete_bnd);
unless ($quiet) {
print STDERR "$0 $srch_pgm $query_file $db_file finished ($num_iter iterations)\n";
}
@@ -254,7 +276,7 @@ sub log_system {
sub get_ssearch_cmd {
my ($query_file, $db_file, $pssm_file) = @_;
- my $search_cmd = qq($ssearch_bin -S -m 8CB -d 0 -E "1.0 0" -s BP62);
+ my $search_cmd = qq($ssearch_bin -S -m 6 -m 9B -E "$srch_evalue 0" -s BP62);
if ($annot_type) {
$search_cmd .= qq( -V $annot_cmds{$annot_type});
}
@@ -274,7 +296,7 @@ sub get_ssearch_cmd {
sub get_psiblast_cmd {
my ($query_file, $db_file, $pssm_file) = @_;
- my $search_cmd = qq($psiblast_bin -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh $evalue -num_iterations 1 -db $db_file);
+ my $search_cmd = qq($psiblast_bin -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh $pssm_evalue -evalue $srch_evalue -num_iterations 1 -db $db_file);
if ($pssm_file) {
$search_cmd .= qq( -in_pssm $pssm_file);
# $search_cmd .= qq( -comp_based_stats 0);
@@ -296,7 +318,7 @@ sub get_psiblast_cmd {
# always produce a $bound_file_out file to test for convergence
#
sub build_msa_pssm {
- my ($query_file, $this_file_out,$prev_bound_in) = @_;
+ my ($query_file, $this_file_out,$prev_bound_in, $prev_sel_in, $m_format) = @_;
my ($this_msa, $this_hit_db, $this_pssm_asntxt, $this_pssm_asnbin, $this_psibl_out, $this_bound_out) =
("$this_file_out.msa",
@@ -308,7 +330,18 @@ sub build_msa_pssm {
);
my $blastdb_err = "$this_file_out.mkbldb_err";
- my $aln2msa_cmd = qq($align2msa_lib --query $query_file --evalue $evalue --masked_lib_out=$this_hit_db);
+ my $aln2msa_cmd = qq($align2msa_lib --query $query_file --masked_lib_out=$this_hit_db);
+
+ if ($m_format) {
+ $aln2msa_cmd .= qq( --m_format $m_format);
+ }
+
+ if ($prev_sel_in) {
+ $aln2msa_cmd .= qq( --sel_file_in $prev_sel_in);
+ }
+ else {
+ $aln2msa_cmd .= qq( --evalue $pssm_evalue);
+ }
if ($int_mask) {
$aln2msa_cmd .= qq( --int_mask_type $int_mask);
@@ -342,7 +375,7 @@ sub build_msa_pssm {
log_system("rm $this_hit_db.p* $blastdb_err");
# remove uninformative error logs
- log_system("rm $this_psibl_out.err $this_file_out.err") unless $error_log;
+ log_system("rm $this_psibl_out.err") unless $error_log;
unless ($srch_pgm eq 'psiblast') {
my $asn2asn_cmd = "$datatool_bin -v $this_pssm_asntxt -e $this_pssm_asnbin";
@@ -361,6 +394,8 @@ sub build_msa_pssm {
sub has_converged {
my ($file1, $file2) = @_;
+ return 0 unless ($file1 && $file2);
+
my @f1_names = ();
my @f2_names = ();
@@ -408,12 +443,19 @@ psisearch2_msa.pl
-h short help
--help include description
+
--query query file (also --sequence)
--db database file (--database)
- --pgm program used for searching, ssearch or psiblast
- --num_iter maximum number of iterations (--max_iter)
+ --pgm [ssearch] program used for searching, ssearch or psiblast
+ --num_iter/iter [5] maximum number of iterations (--max_iter)
+
+
+ --this_iter [0] iteration number for tracking
+ --pssm_evalue [0.002] threshold for inclusion in PSSM
+ --search_evalue [5.0] threshold for inclusion in search display
+ --annot_db [null] (rpd3, rpd3nv, rpd3nvn, pfam)
+
--dir working directory and location of output
- --evalue threshold for inclusion in PSSM
--out_name/--suffix result file is "out_name.it#.suffix"
--in_msa/--out_msa [not implemented] MSA used to build PSSM, requires --in_hitdb
--in_hitdb/--out_hitdb [not implemented] used to build PSSM
@@ -427,8 +469,8 @@ psisearch2_msa.pl
=head1 DESCRIPTION
-C<psisearch2_msa.pl> automates successive iterations of C<psiblast> or
-C<ssearch36> using different strategies to reduce PSSM contamination
+C<psisearch2_msa.pl> can perform one, or multiple, successive iterations of C<psiblast> or
+C<ssearch36> using different query seeding strategies to reduce PSSM contamination
from alignment over-extension. C<psisearch2_msa.pl> uses the
C<m89_btop_msa2.pl> program to read BTOP formatted output from
C<psiblast> or C<ssearch36> and produce both a multiple sequence
=====================================
psisearch2/psisearch2_msa.py
=====================================
--- a/psisearch2/psisearch2_msa.py
+++ b/psisearch2/psisearch2_msa.py
@@ -34,7 +34,7 @@ import re
################
#
# command:
-# psisearch2_msa.py --query query_file --db database --num_iter N --evalue 0.002 --no_msa --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --suffix --pgm ssearch/psiblast
+# psisearch2_msa.py --query query_file --db database --num_iter N --evalue 0.002 --no_msa --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --suffix M8CB --pgm ssearch/psiblast --prev_m89res pre_iter.out --this_iter # --num_iter #
#
################
@@ -59,15 +59,9 @@ annot_cmds = {'rpd3': '"!../scripts/ann_pfam28.pl --pfacc --db RPD3 --vdoms --sp
num_iter = 5
evalue = 0.002
-dom_flag = 0
-align_flag = 0
-int_mask = 'none'
-end_mask = 'none'
srch_pgm = 'ssearch'
-tmp_dir = ''
error_log = 0
rm_flag = 0
-annot_type = ''
quiet = 0
################
@@ -90,8 +84,8 @@ def get_ssearch_cmd(query_file, db_file, pssm_file) :
search_cmd = '%s -S -m 8CB -d 0 -E "1.0 0" -s BP62' % (ssearch_bin)
- if (annot_type) :
- search_cmd += " -V %s" % (annot_cmds[annot_type])
+ if (args.annot_type) :
+ search_cmd += " -V %s" % (annot_cmds[args.annot_type])
if (pssm_file) :
search_cmd += ' -P "%s 2"' % (pssm_file)
@@ -107,7 +101,7 @@ def get_ssearch_cmd(query_file, db_file, pssm_file) :
#
def get_psiblast_cmd(query_file, db_file, pssm_file) :
- search_cmd = "%s -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh %f -num_iterations 1 -db %s" % (psiblast_bin, evalue, db_file)
+ search_cmd = "%s -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh %f -num_iterations 1 -db %s" % (psiblast_bin, args.evalue, db_file)
if (pssm_file) :
search_cmd += " -in_pssm %s" % (pssm_file)
@@ -126,24 +120,29 @@ def get_psiblast_cmd(query_file, db_file, pssm_file) :
#
# always produce a bound_file_out file to test for convergence
#
-def build_msa_pssm(query_file, this_file_out,prev_bound_in, error_log) :
+def build_msa_pssm(query_file, this_file_out,prev_bound_in, prev_sel_res, args, error_log) :
(this_msa, this_hit_db, this_pssm_asntxt, this_pssm_asnbin, this_psibl_out, this_bound_out) = (this_file_out+".msa",this_file_out+".hit_db",this_file_out+".asntxt",this_file_out+".asnbin",this_file_out+".psibl_out",this_file_out+".bnd_out")
blastdb_err = this_file_out+".mkbldb_err"
- aln2msa_cmd = "%s --query %s --evalue %f --masked_lib_out=%s" % (align2msa_lib, query_file, evalue, this_hit_db)
+ aln2msa_cmd = "%s --query %s --masked_lib_out=%s" % (align2msa_lib, query_file, this_hit_db)
- if (int_mask) :
- aln2msa_cmd += " --int_mask_type %s" % (int_mask)
+ if (prev_sel_res) :
+ aln2msa_cmd += " --sel_res %s" % (prev_sel_res)
+ else:
+ aln2msa_cmd += " --evalue %f" % (args.evalue)
+
+ if (args.int_mask) :
+ aln2msa_cmd += " --int_mask_type %s" % (args.int_mask)
- if (end_mask) :
- aln2msa_cmd += " --end_mask_type %s" % (end_mask)
+ if (args.end_mask) :
+ aln2msa_cmd += " --end_mask_type %s" % (args.end_mask)
- if (dom_flag) :
+ if (args.dom_flag) :
aln2msa_cmd += " --domain"
- if (align_flag and prev_bound_in) :
- aln2msa_cmd += " --bound_file_in %s" %(prev_bound_in)
+ if (args.align_flag and args.prev_bound_in) :
+ aln2msa_cmd += " --bound_file_in %s" %(args.prev_bound_in)
# always produce this file to check for convergence
aln2msa_cmd += " --bound_file_out %s" % (this_bound_out)
@@ -162,7 +161,7 @@ def build_msa_pssm(query_file, this_file_out,prev_bound_in, error_log) :
# remove uninformative error logs
if (not error_log) :
- log_system("rm "+this_psibl_out+".err "+this_file_out+".err",error_log)
+ log_system("rm "+this_psibl_out+".err ",error_log)
if (srch_pgm != 'psiblast') :
asn2asn_cmd = "%s -v %s -e %s" % (datatool_bin, this_pssm_asntxt, this_pssm_asnbin)
@@ -177,6 +176,9 @@ def build_msa_pssm(query_file, this_file_out,prev_bound_in, error_log) :
#
def has_converged(file1, file2) :
+ if (not file1 or not file2):
+ return 0
+
f1_names = []
f2_names = []
@@ -224,13 +226,16 @@ arg_parse.add_argument('--evalue', dest='evalue', default=0.002, type=float, act
arg_parse.add_argument('--annot_db', dest='annot_type', action='store',help='source of domain annotations')
arg_parse.add_argument('--suffix', dest='suffix', action='store',help='suffix for result output')
arg_parse.add_argument('--out_name', dest='file_out', action='store',help='result file name')
-arg_parse.add_argument('--iter', dest='num_iter', default=5, type=int, action='store',help='number of iterations')
+arg_parse.add_argument('--num_iter', dest='num_iter', default=5, type=int, action='store',help='number of iterations to run')
arg_parse.add_argument('--in_pssm', dest='prev_pssm', action='store',help='initial PSSM')
arg_parse.add_argument('--in_bounds', dest='prev_bound_in', type=str, action='store',help='initial boundaries')
arg_parse.add_argument('--domain', dest='dom_flag', action='store_true',help='use domain annotations')
-arg_parse.add_argument('--align', dest='align_flag', action='store_true',help='use alignment boundaries')
+arg_parse.add_argument('--align', dest='align_flag', default=0, action='store_true',help='use alignment boundaries')
arg_parse.add_argument('--pgm', dest='srch_pgm', action='store',default='ssearch',help='search program: ssearch/psiblast')
arg_parse.add_argument('--query_seed', dest='query_mask', action='store_true',help='use query seeding')
+arg_parse.add_argument('--prev_m89res', dest='prev_m89res', action='store', help='prevous iteration result')
+arg_parse.add_argument('--sel_res', dest='prev_sel_res', action='store', help='selected accession file')
+arg_parse.add_argument('--this_iter', dest='this_iter', help='this iteration number',type=int)
arg_parse.add_argument('--int_seed', dest='int_mask', action='store',default='none',help='sequence masking: none/query/random')
arg_parse.add_argument('--end_seed', dest='end_mask', action='store',default='none',help='sequence masking: none/query/random')
arg_parse.add_argument('--int_mask', dest='int_mask', action='store',default='none',help='sequence masking: none/query/random')
@@ -257,13 +262,16 @@ del_file_ext = ["msa","psibl_out","hit_db","asntxt","asnbin"]
if (re.match('psiblast',srch_pgm)) :
del_file_ext.pop()
-
if (args.query_mask) :
if (args.int_mask == 'none') :
args.int_mask = 'query'
if (args.end_mask == 'none') :
args.end_mask = 'query'
+this_iter = 1
+if (args.this_iter):
+ this_iter = args.this_iter
+
delete_bnd = args.delete_bnd
if (args.delete_tmp) :
delete_bnd = 1
@@ -282,8 +290,6 @@ if (args.tmp_file_list) :
new_del_file_ext.append(ext)
del_file_ext = new_del_file_ext[:]
-this_iter = "it1"
-
query_pref = query_file = args.query_file
m = re.search(r'([\w\.]+)$',str(args.query_file))
query_pref = m.groups()[0]
@@ -293,35 +299,50 @@ if (not args.file_out) :
else :
file_out = args.file_out
-this_file_out = this_file_pref = file_out+"."+this_iter
+this_file_out = this_file_pref = file_out+".it"+str(this_iter)
+
if (args.suffix) :
this_file_out = this_file_pref+"."+args.suffix
if (args.tmp_dir) :
this_file_out = args.tmp_dir+"/"+this_file_out
+prev_bound_in = ''
+if (args.prev_bound_in):
+ prev_bound_in = args.prev_bound_in
+
####
# parse output to build PSSM
# generate output filenames
-prev_file_out = this_file_out
+first_iter = 0
+del_err_files = []
# do the first search
-search_str = srch_subs[srch_pgm](args.query_file, args.db_file, args.prev_pssm)
-log_system(search_str+" > "+this_file_out+" 2> "+this_file_out+".err", error_log)
+if (not args.prev_m89res):
+ search_str = srch_subs[srch_pgm](args.query_file, args.db_file, args.prev_pssm)
+ log_system(search_str+" > "+this_file_out+" 2> "+this_file_out+".err", error_log)
+ del_err_files.append(this_file_out+".err")
+ first_iter += 1
+else:
+ this_file_out = args.prev_m89res
-prev_file_out = this_file_out
-(this_pssm, this_bound_out) = build_msa_pssm(args.query_file, this_file_out, args.prev_bound_in, error_log)
+it=first_iter
+
# now have necessary files for next iteration
-it=2
-while (it <= args.num_iter) :
+while (it < args.num_iter) :
+
+ (this_pssm, this_bound_out) = build_msa_pssm(args.query_file, this_file_out, prev_bound_in, arg.prev_sel_res, error_log)
+ prev_file_out = this_file_out
+ arg.prev_sel_res = ''
+
+ iter_val = this_iter + it
prev_pssm = this_pssm
- prev_bound_in = this_bound_out
####
# build filename for this iteration
- this_file_out = this_file_pref = "%s.it%d" % (file_out,it)
+ this_file_out = this_file_pref = "%s.it%d" % (file_out,iter_val)
if (args.suffix) :
this_file_out = this_file_pref+"."+args.suffix
if (args.tmp_dir) :
@@ -329,22 +350,19 @@ while (it <= args.num_iter) :
search_str = srch_subs[srch_pgm](args.query_file, args.db_file, prev_pssm)
log_system("%s > %s 2> %s" % (search_str,this_file_out,this_file_out+".err"), error_log)
+ del_err_files.append(this_file_out+".err")
if (len(del_file_ext)):
del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
log_system('rm '+' '.join(del_file_list),error_log)
- prev_file_out = this_file_out
-
- (this_pssm, this_bound_out) = build_msa_pssm(query_file, this_file_out, prev_bound_in, error_log)
-
if (has_converged(prev_bound_in, this_bound_out)) :
if (not quiet) :
sys.stderr.write(" %s %s %s %s converged (%d iterations)\n" % (sys.argv[0], srch_pgm, query_file, args.db_file, it))
- if (len(del_file_ext)):
- del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
- log_system('rm '+' '.join(del_file_list),error_log)
+ # if (len(del_file_ext)):
+ # del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
+ # log_system('rm '+' '.join(del_file_list),error_log)
if (delete_bnd) :
log_system("rm "+prev_bound_in,error_log)
@@ -353,16 +371,20 @@ while (it <= args.num_iter) :
if (delete_bnd) :
log_system("rm "+prev_bound_in,error_log)
+ prev_bound_in = this_bound_out
it += 1
-if (len(del_file_ext)):
- del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
- log_system('rm '+' '.join(del_file_list),error_log)
+if (len(del_err_files)):
+ log_system('rm '+' '.join(del_err_files),error_log)
+
+# if (len(del_file_ext)):
+# del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
+# log_system('rm '+' '.join(del_file_list),error_log)
if (delete_bnd):
log_system("rm "+this_bound_out,error_log)
if (not quiet) :
- sys.stderr.write(" %s %s %s %s finished (%d iterations)\n" % (sys.argv[0], srch_pgm, query_file, args.db_file, it-1))
+ sys.stderr.write(" %s %s %s %s finished (%d iterations)\n" % (sys.argv[0], srch_pgm, query_file, args.db_file, it))
=====================================
scripts/ann_exons_ncbi.pl
=====================================
--- a/scripts/ann_exons_ncbi.pl
+++ b/scripts/ann_exons_ncbi.pl
@@ -4,13 +4,13 @@
# gi|23065544|ref|NP_000552.2|
#
-# and returns the exons present in the protein from NCBI gff3 tables (human and mouse only)
+# and returns the exons present in the protein from NCBI gff3 tables (human, mouse, rat, xtrop)
#
# it must:
# (1) read in the line
# (2) parse it to get the acc
# (3) return the tab delimited exon boundaries
-
+#
use strict;
=====================================
scripts/ann_exons_up_sql.pl
=====================================
--- /dev/null
+++ b/scripts/ann_exons_up_sql.pl
@@ -0,0 +1,337 @@
+#!/usr/bin/perl -w
+
+################################################################
+# copyright (c) 2014,2015 by William R. Pearson and The Rector &
+# Visitors of the University of Virginia
+################################################################
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under this License is distributed on an "AS
+# IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
+# express or implied. See the License for the specific language
+# governing permissions and limitations under the License.
+################################################################
+
+# ann_exons_up_sql.pl gets an annotation file from fasta36 -V with a line of the form:
+
+# gi|62822551|sp|P00502|GSTA1_RAT Glutathione S-transfer\n (at least from pir1.lseg)
+#
+# it must:
+# (1) read in the line
+# (2) parse it to get the up_acc
+# (3) return the tab delimited features
+#
+
+# this version can read feature2 uniprot features (acc/pos/end/label/value), but returns sorted start/end domains
+# modified 18-Jan-2016 to produce annotation symbols consistent with ann_exons_up_www2.pl
+
+use strict;
+
+use DBI;
+use Getopt::Long;
+use Pod::Usage;
+
+use vars qw($host $db $a_table $port $user $pass);
+
+my %domains = ();
+my $domain_cnt = 0;
+
+my $hostname = `/bin/hostname`;
+
+unless ($hostname =~ m/ebi/) {
+ ($host, $db, $a_table, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "uniprot", "annot2", 0, "web_user", "fasta_www");
+# $host = 'xdb';
+}
+else {
+ ($host, $db, $a_table, $port, $user, $pass) = ("mysql-pearson-prod", "up_db", "annot", 4124, "web_user", "fasta_www");
+}
+
+my ($sstr, $lav, $neg_doms, $no_vars, $no_doms, $no_feats, $shelp, $help, $pfam26) = (0,0,0,0,0,0,0,0,0,0);
+my ($min_nodom) = (10);
+
+my ($show_color) = (1);
+my $color_sep_str = " :";
+$color_sep_str = '~';
+
+GetOptions(
+ "host=s" => \$host,
+ "db=s" => \$db,
+ "user=s" => \$user,
+ "password=s" => \$pass,
+ "port=i" => \$port,
+ "lav" => \$lav,
+ "no_doms" => \$no_doms,
+ "no-doms" => \$no_doms,
+ "nodoms" => \$no_doms,
+ "no_var" => \$no_vars,
+ "no-var" => \$no_vars,
+ "novar" => \$no_vars,
+ "neg" => \$neg_doms,
+ "neg_doms" => \$neg_doms,
+ "neg-doms" => \$neg_doms,
+ "negdoms" => \$neg_doms,
+ "min_nodom=i" => \$min_nodom,
+ "min-nodom=i" => \$min_nodom,
+ "no_feats" => \$no_feats,
+ "no-feats" => \$no_feats,
+ "nofeats" => \$no_feats,
+ "color!" => \$show_color,
+ "sstr" => \$sstr,
+ "h|?" => \$shelp,
+ "help" => \$help,
+ );
+
+pod2usage(1) if $shelp;
+pod2usage(exitstatus => 0, verbose => 2) if $help;
+pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN);
+
+my $connect = "dbi:mysql(AutoCommit=>1,RaiseError=>1):database=$db";
+$connect .= ";host=$host" if $host;
+$connect .= ";port=$port" if $port;
+
+my $dbh = DBI->connect($connect,
+ $user,
+ $pass
+ ) or die $DBI::errstr;
+
+
+my $get_annot_sub = \&get_fasta_annots;
+if ($lav) {
+ $no_feats = 1;
+ $get_annot_sub = \&get_lav_annots;
+}
+
+my $get_annots_id = $dbh->prepare(qq(select acc, start, end, ix from up_exons join annot2 using(acc) where id=? order by ix));
+my $get_annots_acc = $dbh->prepare(qq(select acc, start, end, ix from up_exons where acc=? order by ix));
+my $get_annots_refacc = $dbh->prepare(qq(select ref_acc, start, end, ix from up_exons join annot2 using(acc) where ref_acc=? order by ix));
+
+my $get_annots_sql = $get_annots_acc;
+
+my ($tmp, $gi, $sdb, $acc, $id, $use_acc);
+
+# get the query
+my ($query, $seq_len) = @ARGV;
+$seq_len = 0 unless defined($seq_len);
+
+$query =~ s/^>// if ($query);
+
+my @annots = ();
+
+#if it's a file I can open, read and parse it
+unless ($query && ($query =~ m/[\|:]/ ||
+ $query =~ m/^[NX]P_/ ||
+ $query =~ m/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}\s/)) {
+
+ while (my $a_line = <>) {
+ $a_line =~ s/^>//;
+ chomp $a_line;
+ push @annots, show_annots($a_line, $get_annot_sub);
+ }
+}
+else {
+ push @annots, show_annots("$query\t$seq_len", $get_annot_sub);
+}
+
+for my $seq_annot (@annots) {
+ print ">",$seq_annot->{seq_info},"\n";
+ for my $annot (@{$seq_annot->{list}}) {
+ if (!$lav && $show_color && defined($domains{$annot->[-1]})) {
+ $annot->[-1] .= $color_sep_str.$domains{$annot->[-1]};
+ }
+ print join("\t",@$annot),"\n";
+ }
+}
+
+exit(0);
+
+sub show_annots {
+ my ($query_len, $get_annot_sub) = @_;
+
+ my ($annot_line, $seq_len) = split(/\t/,$query_len);
+
+ my %annot_data = (seq_info=>$annot_line);
+
+ if ($annot_line =~ m/^gi\|/) {
+ $use_acc = 1;
+ ($tmp, $gi, $sdb, $acc, $id) = split(/\|/,$annot_line);
+ }
+ elsif ($annot_line =~ m/^(SP|TR):(\w+) (\w+)/) {
+ ($sdb, $id, $acc) = ($1,$2,$3);
+ $use_acc = 1;
+ $sdb = lc($sdb)
+ }
+ elsif ($annot_line =~ m/^(SP|TR):(\w+)/) {
+ ($sdb, $id) = ($1,$2);
+ $use_acc = 0;
+ $sdb = lc($sdb)
+ }
+ elsif ($annot_line !~ m/\|/) { # new NCBI swissprot format
+ $use_acc =1;
+ $sdb = 'sp';
+ ($acc) = split(/\s+/,$annot_line);
+ }
+ else {
+ $use_acc = 1;
+ ($sdb, $acc, $id) = split(/\|/,$annot_line);
+ }
+
+ # remove version number
+ unless ($use_acc) {
+ $get_annots_sql = $get_annots_id;
+ $get_annots_sql->execute($id);
+ }
+ else {
+ unless ($sdb =~ m/ref/) {
+ $get_annots_sql = $get_annots_acc;
+ } else {
+ $get_annots_sql = $get_annots_refacc;
+ }
+ $acc =~ s/\.\d+$//;
+ $get_annots_sql->execute($acc);
+ }
+
+ $annot_data{list} = $get_annot_sub->($get_annots_sql, $seq_len);
+
+ return \%annot_data;
+}
+
+sub get_fasta_annots {
+ my ($get_annots_sql, $seq_len) = @_;
+
+ my ($acc, $start, $end, $ix);
+ my @feats = ();
+
+ while (($acc, $start, $end, $ix) = $get_annots_sql->fetchrow_array()) {
+ push @feats, [$start, "-", $end, "exon_$ix~$ix"];
+ }
+
+ return \@feats;
+}
+
+sub get_lav_annots {
+ my ($get_annots_sql, $seq_len) = @_;
+
+ my ($pos, $end, $label, $value, $comment);
+
+ my @feats = ();
+
+ my %annot = ();
+ while (($acc, $pos, $end, $label, $value) = $get_annots_sql->fetchrow_array()) {
+ next unless ($label =~ m/^DOMAIN/ || $label =~ m/^REPEAT/);
+ $value =~ s/\s?\{.+\}\.?$//;
+ $value = domain_name($label,$value);
+ push @feats, [$pos, $end, $value];
+ }
+
+ return \@feats;
+}
+
+# domain name takes a uniprot domain label, removes comments ( ;
+# truncated) and numbers and returns a canonical form. Thus:
+# Cortactin 6.
+# Cortactin 7; truncated.
+# becomes "Cortactin"
+#
+
+sub domain_name {
+
+ my ($label, $value) = @_;
+
+ if ($label =~ /DOMAIN|REPEAT/) {
+ $value =~ s/;.*$//;
+ $value =~ s/\s+\d+\.?$//;
+ $value =~ s/\.\s*$//;
+ $value =~ s/\s+\d+\.\s+.*$//;
+ $value =~ s/\s+/_/;
+ if (!defined($domains{$value})) {
+ $domain_cnt++;
+ $domains{$value} = $domain_cnt;
+ }
+ return $value;
+ }
+ else {
+ return $value;
+ }
+}
+
+__END__
+
+=pod
+
+=head1 NAME
+
+ann_exons_up_sql.pl
+
+=head1 SYNOPSIS
+
+ ann_exons_up_sql.pl --no_doms --no_feats --lav 'sp|P09488|GSTM1_NUMAN' | accession.file
+
+=head1 OPTIONS
+
+ -h short help
+ --help include description
+ --no-doms do not show domain boundaries (domains are always shown with --lav)
+ --no-feats do not show features (variants, active sites, phospho-sites)
+ --no-var do not show variant sites (--no_var, --novar)
+ --lav produce lav2plt.pl annotation format, only show domains/repeats
+ --neg-doms, -- report domains between annotated domains as NODOM
+ (also --neg, --neg_doms)
+ --min_nodom=10 minimum non-domain length to produce NODOM
+ --host, --user, --password, --port --db -- info for mysql database
+
+=head1 DESCRIPTION
+
+C<ann_exons_up_sql.pl> extracts feature, domain, and repeat information from
+a msyql database (default name, uniprot) built by parsing the
+uniprot_sprot.dat and uniprot_trembl.dat feature tables. Given a
+command line argument that contains a sequence accession (P09488) or
+identifier (GSTM1_HUMAN), the program looks up the features available
+for that sequence and returns them in a tab-delimited format:
+
+ >sp|P09488|GSTM1_HUMAN
+ 2 - 88 GST_N-terminal~1
+ 7 V F Mutagen: Reduces catalytic activity 100- fold. {ECO:0000269|PubMed:16548513}.
+ 34 * - MOD_RES: Phosphothreonine. {ECO:0000250|UniProtKB:P10649}.
+ 90 - 208 GST_C-terminal~2
+ 108 V S Mutagen: Changes the properties of the enzyme toward some substrates. {ECO:0000269|PubMed:16548513, ECO:0000269|PubMed:9930979}.
+ 108 V Q Mutagen: Reduces catalytic activity by half. {ECO:0000269|PubMed:16548513, ECO:0000269|PubMed:9930979}.
+ 109 V I Mutagen: Reduces catalytic activity by half. {ECO:0000269|PubMed:16548513}.
+ 116 # - BINDING: Substrate.
+ 116 V A Mutagen: Reduces catalytic activity 10-fold. {ECO:0000269|PubMed:16548513}.
+ 116 V F Mutagen: Slight increase of catalytic activity. {ECO:0000269|PubMed:16548513}.
+ 173 V N in allele GSTM1B; dbSNP:rs1065411. {ECO:0000269|Ref.3, ECO:0000269|Ref.5}.
+ 210 * - MOD_RES: Phosphoserine. {ECO:0000250|UniProtKB:P04905}.
+ 210 V T in dbSNP:rs449856.
+
+If features are provided, then a legend of feature symbols is provided
+as well:
+
+ ==:Active site
+ =*:Modified
+ =#:Substrate binding
+ =^:Site
+ =!:Metal binding
+
+If the C<--lav> option is specified, domain and repeat features are
+presented in a different format for the C<lav2plt.pl> program:
+
+ >sp|P09488|GSTM1_HUMAN
+ 2 88 GST N-terminal.
+ 90 208 GST C-terminal.
+
+C<ann_exons_up_sql.pl> is designed to be used by the B<FASTA> programs
+with the C<-V \!ann_exons_up_sql.pl> option, or by the
+C<annot_blast_btop.pl> script. It can also be used with the
+lav2plt.pl program with the C<--xA "\!ann_exons_up_sql.pl --lav"> or
+C<--yA "\!ann_exons_up_sql.pl --lav"> options.
+
+=head1 AUTHOR
+
+William R. Pearson, wrp at virginia.edu
+
+=cut
=====================================
scripts/ann_exons_up_www.pl
=====================================
--- a/scripts/ann_exons_up_www.pl
+++ b/scripts/ann_exons_up_www.pl
@@ -149,9 +149,41 @@ sub parse_json_up_exons {
my $acc_exons = decode_json($exon_json);
my $exon_num = 1;
+ my $last_end = 0;
+ my $last_phase = 0;
+
for my $exon ( @{$acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'exon'}} ) {
my ($p_begin, $p_end) = ($exon->{'proteinLocation'}{'begin'}{'position'},$exon->{'proteinLocation'}{'end'}{'position'});
+ my ($g_begin, $g_end) = ($exon->{'genomeLocation'}{'begin'}{'position'},$exon->{'genomeLocation'}{'end'}{'position'});
+
+ my $this_phase = 0;
+ if (defined($g_begin) && defined($g_end)) {
+ $this_phase = ($g_end - $g_begin + 1) % 3;
+ }
+
+ if (!defined($p_begin) || !defined($p_end)) {
+ $exon_num++;
+ $last_phase = 0;
+ next;
+ }
+
if ($p_end >= $p_begin) {
+ if ($p_begin == $last_end) {
+ if ($last_phase==2) {
+ $p_begin += 1;
+ }
+ elsif ($last_phase==1) {
+ $last_end -= 1;
+ $exons[-1]->{seq_end} -= 1;
+ }
+ }
+
+ if ($p_begin <= $last_end && $p_end > $last_end) {
+ $p_begin = $last_end+1;
+ }
+ $last_end = $p_end;
+ $last_phase = $this_phase;
+
push @exons, {
info=>"exon_".$exon_num.$color_sep_str.$exon_num,
seq_start=>$p_begin,
=====================================
scripts/ann_pfam30_tmptbl.pl
=====================================
--- a/scripts/ann_pfam30_tmptbl.pl
+++ b/scripts/ann_pfam30_tmptbl.pl
@@ -246,9 +246,7 @@ if ($rpd2_fams) {
}
#if it's a file I can open, read and parse it
-unless ($query && ($query =~ m/[\|:]/ ||
- $query =~ m/^[NX]P_/ ||
- $query =~ m/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}\s/)) {
+unless ($query && $query =~ m/[\|:]/) {
while (my $a_line = <>) {
$a_line =~ s/^>//;
=====================================
src/ag_stats.c
=====================================
--- a/src/ag_stats.c
+++ b/src/ag_stats.c
@@ -34,11 +34,11 @@ ag_parm(char *pam_type, int gdelval, int ggapval)
r_v = look_p(p250_p,t_gdelval,t_ggapval,&K,&Lambda,&H);
else if (strcmp(pam_type,"P120")==0)
r_v = look_p(p120_p,t_gdelval,t_ggapval,&K,&Lambda,&H);
- else if (strcmp(pam_type,"MD10")==0)
+ else if (strcmp(pam_type,"MD10")==0 || strcmp(pam_type,"VT10")==0)
r_v = look_p(md10_p,t_gdelval,t_ggapval,&K,&Lambda,&H);
- else if (strcmp(pam_type,"MD20")==0)
+ else if (strcmp(pam_type,"MD20")==0 || strcmp(pam_type,"VT20")==0)
r_v = look_p(md20_p,t_gdelval,t_ggapval,&K,&Lambda,&H);
- else if (strcmp(pam_type,"MD40")==0)
+ else if (strcmp(pam_type,"MD40")==0 || strcmp(pam_type,"VT40")==0)
r_v = look_p(md40_p,t_gdelval,t_ggapval,&K,&Lambda,&H);
else if (strcmp(pam_type,"DNA")==0 || strcmp(pam_type,"+5/-4")==0)
r_v = look_p(nt54_p,t_gdelval,t_ggapval,&K,&Lambda,&H);
=====================================
src/cal_cons2.c
=====================================
--- a/src/cal_cons2.c
+++ b/src/cal_cons2.c
@@ -899,7 +899,7 @@ calc_astruct(struct a_struct *aln_p, struct a_res_str *a_res_p, struct f_struct
}
static struct update_code_str *
-init_update_data(show_code) {
+init_update_data(int show_code) {
struct update_code_str *update_data_p;
=====================================
src/comp_lib9.c
=====================================
--- a/src/comp_lib9.c
+++ b/src/comp_lib9.c
@@ -159,6 +159,8 @@ extern void ptime (FILE *, long);
#define GETLIB (m_file_p->getlib)
+extern void upper_seq(unsigned char *aa0, int n0, int *xascii, unsigned char *sqx);
+
int samp_stats_idx (int *pre_nstats, int nstats, void *rand_state);
void
@@ -332,6 +334,7 @@ void prhist(FILE *, const struct mngmsg *, struct pstruct *, struct hist_str his
void print_sum(FILE *, struct db_str *qtt, struct db_str *ntt, int in_mem, long tot_memK);
int reset_maxn(struct mngmsg *, int, int); /* set m_msg.maxt, maxn from maxl */
+int count_not_seg(unsigned char *aa0, int n0, struct pstruct *pst);
FILE *outfd; /* Output file */
@@ -734,6 +737,11 @@ main (int argc, char *argv[])
/* reset algorithm parameters for alphabet */
resetp (&m_msg, &pst);
+ if (count_not_seg(aa0[0],m_msg.n0, &pst) == 0) { /* if no un-seg'ed query residues, convert to upper case */
+ upper_seq(aa0[0],m_msg.n0,qascii,pst.sqx);
+ fprintf(stderr,"+++ warning [%s:%d] - all lower-case query converted to upper case: %s\n", __FILE__, __LINE__, info_qlabel);
+ }
+
#ifndef COMP_MLIB
gettitle(m_msg.tname,m_msg.qtitle,sizeof(m_msg.qtitle));
if (m_msg.tname[0]=='-' || m_msg.tname[0]=='@') {
@@ -1840,6 +1848,11 @@ main (int argc, char *argv[])
/* if ends with ESS, remove terminal ESS */
if (aa0[0][m_msg.n0-1] == ESS) { m_msg.n0--; aa0[0][m_msg.n0]= '\0';}
+ if (count_not_seg(aa0[0],m_msg.n0, &pst) == 0) { /* if no un-seg'ed query residues, convert to upper case */
+ upper_seq(aa0[0],m_msg.n0, qascii, pst.sqx);
+ fprintf(stderr,"+++ warning [%s:%d] - all lower-case query converted to upper case: %s\n", __FILE__, __LINE__, info_qlabel);
+ }
+
if (m_msg.outfd) {fputc('\n',stdout);}
if (qlcont) {
=====================================
src/compacc2e.c
=====================================
--- a/src/compacc2e.c
+++ b/src/compacc2e.c
@@ -4314,3 +4314,28 @@ display_push_features(void *annot_stack, struct dyn_string_str *annot_var_dyn,
}
}
}
+
+int count_not_seg(unsigned char *aa0, int n0, struct pstruct *ppst) {
+ int i;
+ int nseg_cnt = 0;
+ if (ppst->ext_sq_set != 1) {
+ return n0;
+ }
+ else {
+ for (i=0; i<n0; i++) {
+ if (aa0[i] < ppst->nsq) {nseg_cnt++;}
+ }
+ return nseg_cnt;
+ }
+}
+
+void upper_seq(unsigned char *aa0, int n0, int *xascii, unsigned char *sqx) {
+ int i;
+
+ for (i=0; i<n0; i++) {
+ if (sqx[aa0[i]] >= 'a' && sqx[aa0[i]] <= 'z') {
+ aa0[i] = xascii[sqx[aa0[i]] - ('a' - 'A')];
+ }
+ }
+}
+
=====================================
src/initfa.c
=====================================
--- a/src/initfa.c
+++ b/src/initfa.c
@@ -498,9 +498,9 @@ char *iprompt1=" test sequence file name: ";
char *iprompt2=" database file name: ";
#ifdef PCOMPLIB
-char *verstr="36.3.8f May, 2017 MPI";
+char *verstr="36.3.8g Dec, 2017 MPI";
#else
-char *verstr="36.3.8f May, 2017";
+char *verstr="36.3.8g Dec, 2017";
#endif
static int mktup=3;
@@ -1100,7 +1100,7 @@ f_getopt (char copt, char *optarg,
m_msg->e_cut_set = 1;
if (tmp_e_rep > 0.0) {
- if (tmp_e_rep >= 1.0) { ppst->e_cut_r = ppst->e_cut/tmp_e_rep;}
+ if (tmp_e_rep >= 1.0) { ppst->e_cut_r = ppst->e_cut / tmp_e_rep;}
else { ppst->e_cut_r = tmp_e_rep;}
}
break;
=====================================
src/scaleswn.c
=====================================
--- a/src/scaleswn.c
+++ b/src/scaleswn.c
@@ -677,18 +677,16 @@ proc_hist_a(struct stat_str *sptr, int nstats, struct score_count_s s_info,
}
}
- if (r_v == 1) {
- pu->r_u.ag.Lambda = Lambda;
- pu->r_u.ag.K = K;
- pu->r_u.ag.H = H;
- }
- else {
+ if (r_v != 1) {
r_v = look_p(bl62_p,t_gdelval,t_ggapval,&K,&Lambda,&H);
#ifdef DEBUG
fprintf(stderr,"+++ Warning : [%s:%d] Parameters not available for: %s: %d/%d -- using BL62\n",
__FILE__, __LINE__, ppst->pam_name,t_gdelval-t_ggapval,t_ggapval);
#endif
}
+ pu->r_u.ag.Lambda = Lambda;
+ pu->r_u.ag.K = K;
+ pu->r_u.ag.H = H;
/*
fprintf(stderr," the parameters are: Lambda: %5.3f K: %5.3f H: %5.3f\n",
@@ -717,11 +715,11 @@ ag_parm(char *pam_name, int gdelval, int ggapval, struct pstat_str *pu)
r_v = look_p(p250_p,gdelval,ggapval,&K,&Lambda,&H);
else if (strcmp(pam_name,"P120")==0)
r_v = look_p(p120_p,gdelval,ggapval,&K,&Lambda,&H);
- else if (strcmp(pam_name,"MD10")==0)
+ else if (strcmp(pam_name,"MD10")==0 || strcmp(pam_name,"VT10")==0)
r_v = look_p(md10_p,gdelval,ggapval,&K,&Lambda,&H);
- else if (strcmp(pam_name,"MD20")==0)
+ else if (strcmp(pam_name,"MD20")==0 || strcmp(pam_name,"VT20")==0)
r_v = look_p(md20_p,gdelval,ggapval,&K,&Lambda,&H);
- else if (strcmp(pam_name,"MD40")==0)
+ else if (strcmp(pam_name,"MD40")==0 || strcmp(pam_name,"VT40")==0)
r_v = look_p(md40_p,gdelval,ggapval,&K,&Lambda,&H);
else if (strcmp(pam_name,"DNA")==0 || strcmp(pam_name,"+5/-4")==0)
r_v = look_p(nt54_p,gdelval,ggapval, &K,&Lambda,&H);
@@ -2699,6 +2697,7 @@ zs_to_E(double zs,int n1, int dnaseq, long entries, struct db_str db)
#ifdef NORMAL_DIST
double np_to_z(double, int *);
+double np1_to_z(double, int *);
#endif
/* compute z-score for given E()-value, assuming normal or
@@ -2717,7 +2716,7 @@ E_to_zs(double E, long entries)
#else
/* this formula does not work for E() >= 1 */
if (e >= 1.0) e = 0.99;
- z = np_to_z(1.0-e,&error);
+ z = np1_to_z(e,&error);
if (!error) return z*10.0 + 50.0;
else return 0.0;
#endif
@@ -2958,6 +2957,20 @@ scale_scores(struct beststr **bptr, int nbest, struct db_str db,
transcription.
*/
+/* added 30-Dec-2017 to deal with very low E()-value thresholds */
+
+double np1_to_z(double p1, int *fault) {
+ double zs1;
+
+ if (p1 < 0.1) {
+ zs1 = np_to_z(p1, fault);
+ return -zs1;
+ }
+ else {
+ return np_to_z(1.0 - p1, fault);
+ }
+}
+
double np_to_z(double p, int *fault) {
double q, r, ppnd16;
View it on GitLab: https://salsa.debian.org/med-team/fasta3/commit/c1a8a345121adca3f70993084dcf8d6f0f15a78d
---
View it on GitLab: https://salsa.debian.org/med-team/fasta3/commit/c1a8a345121adca3f70993084dcf8d6f0f15a78d
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20180415/f5953cce/attachment-0001.html>
More information about the debian-med-commit
mailing list