[med-svn] [Git][med-team/fasta3][upstream] New upstream version 36.3.8g

Steffen Möller gitlab at salsa.debian.org
Sun Apr 15 23:22:21 BST 2018


Steffen Möller pushed to branch upstream at Debian Med / fasta3


Commits:
c1a8a345 by Steffen Moeller at 2018-04-15T15:56:45+02:00
New upstream version 36.3.8g
- - - - -


20 changed files:

- README
- README.md
- doc/README_v36.3.8d.md
- + doc/README_v36.3.8g.md
- doc/fasta_guide.pdf
- doc/fasta_guide.tex
- doc/readme.v36
- psisearch2/m89_btop_msa2.pl
- psisearch2/psisearch2_msa.pl
- psisearch2/psisearch2_msa.py
- scripts/ann_exons_ncbi.pl
- + scripts/ann_exons_up_sql.pl
- scripts/ann_exons_up_www.pl
- scripts/ann_pfam30_tmptbl.pl
- src/ag_stats.c
- src/cal_cons2.c
- src/comp_lib9.c
- src/compacc2e.c
- src/initfa.c
- src/scaleswn.c


Changes:

=====================================
README
=====================================
--- a/README
+++ b/README
@@ -1,4 +1,11 @@
 
+[This file has been replaced by README.md]
+
+December, 2017
+
+The most up-to-date version information on FASTA versions is available
+in README.md and doc/readme.v36 .
+
 July, 2015
 
 This version of the FASTA programs is fasta-36.3.8.  Since March, 2011


=====================================
README.md
=====================================
--- a/README.md
+++ b/README.md
@@ -12,9 +12,26 @@ includes programs for aligning translated DNA sequences against
 proteins (`fastx`, `fasty` are equivalent to `blastx`, `tfastx`,
 `tfasty` are similar to `tblastn`).
 
-####September, 2016
+####December, 2017
+The current FASTA version is fasta-36.3.8f, Dec. 2017
+
+The statistics routines for normally distributed scores (ggsearch36,
+glsearch36) are more robust to very low E()-value thresholds.
+
+####Sept, 2017
+The current FASTA version is fasta-36.3.8f, Sept. 2017
+
+If the -S option is used and a query sequence has no upper case
+letters, it is re-read with lower-case letters converted to upper-case.
 
-The current FASTA version is fasta-36.3.8e.
+####May, 2017
+The current FASTA version is fasta-36.3.8f, May. 2017
+
+Various bugs in sub-alignment scoring corrected and support for the
+EBI SP:GSTM1_HUMAN P09488 added.  The format for the $SRCH_URL and
+$SRCH_URL2 format strings has changed to enable pairwise alignment.
+
+####September, 2016
 
 The fasta-36.3.6e version includes a new directory, `psisearch2`, with
 scripts to run iterative PSSM (PSI-BLAST or SSEARCH36) searches using


=====================================
doc/README_v36.3.8d.md
=====================================
--- a/doc/README_v36.3.8d.md
+++ b/doc/README_v36.3.8d.md
@@ -6,7 +6,7 @@ Changes in **fasta-36.3.8d** released 13-April-2016:
 
 1. Various bug fixes to `pssm_asn_subs.c` that avoid coredumps when
    reading NCBI PSSM ASN.1 binary files.  `pssm_asn_subs.c` can now read
-   UUPACAA sequences.
+   IUPACAA sequences.
 
 2. default gap penalties for VT40 (from -14/-2 to -13/-1), VT80 (from
    -14/-2 to -11/-1), and VT120 (from -10/-1 to 11/-1) have changed


=====================================
doc/README_v36.3.8g.md
=====================================
--- /dev/null
+++ b/doc/README_v36.3.8g.md
@@ -0,0 +1,18 @@
+
+
+## The FASTA package - protein and DNA sequence similarity searching and alignment programs
+
+Changes in **fasta-36.3.8f** released 31-Dec-2017
+
+1. (December, 2017) -- Make statistical thresholds more robust for
+   small E()-values with normally distributed scores (ggsearch36,
+   glsearch36).
+
+2. (September, 2017) Treat all lower-case queries as uppercase with -S option.
+
+3. (May, 2017) Improvements/fixes to sub-alignment scoring strategies.
+
+4. Improvements/fixes to psisearch2 scripts.
+
+For more detailed information, see `doc/readme.v36`.
+


=====================================
doc/fasta_guide.pdf
=====================================
Binary files a/doc/fasta_guide.pdf and b/doc/fasta_guide.pdf differ


=====================================
doc/fasta_guide.tex
=====================================
--- a/doc/fasta_guide.tex
+++ b/doc/fasta_guide.tex
@@ -1470,8 +1470,8 @@ environment variables are used in HTML mode (\texttt{-m 6}) to provide
 links from the sequence alignment (see the links at
 \url{http://fasta.bioch.virginia.edu/fasta_www2/}). \texttt{REF\_URL}
 is associated with the \texttt{Entrez Lookup} link; \texttt{SRCH\_URL}
-with the \texttt{Re-search database} link, and \texttt{SRCH\_URL1}
-with the \texttt{General re-search} link.  In each case, the text
+with the \texttt{General re-search} link, and \texttt{SRCH\_URL1}
+with the \texttt{Pairwise alignment} link.  In each case, the text
 corresponds to a HTML URL, but with positions containing the
 \texttt{\%s} or \texttt{\%ld} (for numbers) part of a 'C'
 \texttt{sprintf()} call for specific variables. \texttt{REF\_URL} uses


=====================================
doc/readme.v36
=====================================
--- a/doc/readme.v36
+++ b/doc/readme.v36
@@ -6,6 +6,24 @@ multiple high-scoring alignments to be shown, rather than just one.
 This is the main functional difference between FASTA and BLAST -
 BLAST could show multiple HSPs, FASTA did not.
 
+>>Dec. 30, 2017  [released as fasta-36.3.8g]
+[scaleswn.c]
+Replace np_to_z() with np1_to_z(), which does not substract low
+probability from 1.0, thus allowing accurate z-values for very low
+probabilities.
+
+>>Sept. 26, 2017
+[comp_lib9.c, compacc2e.c]
+Previously, if the query sequence was all lower-case letters (seg-ed)
+and the '-S' option specified, the search would effectively be done
+with a zero-length sequence, which broke the statistics.  The code has
+been modified to convert all lower-case queries to upper-case when -S
+is used.
+
+[scaleswn.c]
+Fixed problem with scaleswn.c/ag_stats() not setting parameters
+properly when matrix was unknown.
+
 >>May 23, 2017 [released as fasta-36.3.8f]
 [url_subs.c]
 A small, but major change in the output available to the $SRCH_URL and
@@ -18,7 +36,7 @@ pairwise alignment becomes possible.
 >>May 23, 2017
 [scripts/ann_feats2ipr.pl,ann_feats_up_www2.pl,test_ann_scripts.sh  src/defs.h]
 Changes to ensure that EBI format databases, which place the ID before
-the accession, e.g.  SP:GSTM1_HUMAN P09388, can be processed properly
+the accession, e.g.  SP:GSTM1_HUMAN P09488, can be processed properly
 by annotation scripts.  This involved displaying more of the
 description line, so that the accession field is included, in the
 annot_XXXXX file.  


=====================================
psisearch2/m89_btop_msa2.pl
=====================================
--- a/psisearch2/m89_btop_msa2.pl
+++ b/psisearch2/m89_btop_msa2.pl
@@ -50,7 +50,7 @@ use Getopt::Long;
 
 
 my ($shelp, $help, $m_format, $evalue, $qvalue, $domain_bound) = (0, 0, "m8CB", 0.001, 30.0,0);
-my ($query_file, $bound_file_in, $bound_file_only, $bound_file_out, $masked_lib_out,$mask_type_end, $mask_type_int) = ("","","","","","","");
+my ($query_file, $sel_file, $bound_file_in, $bound_file_only, $bound_file_out, $masked_lib_out,$mask_type_end, $mask_type_int) = ("","","","","","","","");
 my $query_lib_r = 0;
 my ($eval2_fmt, $eval2) = (0,"");
 
@@ -62,6 +62,9 @@ GetOptions(
     "expect=f" => \$evalue,
     "qvalue=f" => \$qvalue,
     "format=s" => \$m_format,
+    "selected_file_in=s" => \$sel_file,
+    "sel_file_in=s" => \$sel_file,
+    "sel_file=s" => \$sel_file,
     "m_format=s" => \$m_format,
     "mformat=s" => \$m_format,
     "bound_file_in=s" => \$bound_file_in,
@@ -120,6 +123,30 @@ if (! $query_file || !$query_len) {
   die "query sequence required";
 }
 
+my %sel_accs = ();
+my $have_sel_accs = 0;
+if ($sel_file) {
+  if (open (my $sfd, $sel_file)) {
+    while (my $line = <$sfd>) {
+      chomp $line;
+      next if $line =~ m/^#/;
+      if ($line =~ m/\t/) {
+	my @data = split(/\t/,$line);
+	$sel_accs{$data[0]} = $data[1];
+      }
+      else {
+	$sel_accs{$line} = 1;
+      }
+    }
+    close($sfd);
+    $evalue = 1000000.0;
+    $have_sel_accs = 1;
+  }
+  else {
+    warn "Cannot open selected sequence file: $sel_file";
+  }
+}
+
 push @multi_names, $query_acc;
 $acc_names{$query_acc} = 1;
 $multi_align{$query_acc} = btop2alignment($query_seq_r, $query_len, {BTOP=>$query_len, q_start=>1, q_end=>$query_len}, 0);
@@ -242,6 +269,10 @@ while (my $line = <>) {
     $subj_acc =~ s/^gi\|\d+\|(\w+\|\w+)\|?\w+/$1/;
   }
 
+  if ($have_sel_accs) {
+    next unless ($sel_accs{$hit_data{'s_seqid'}});
+  }
+
   # a better solution would be to rename the q_seqid, or at least to
   # check for identity
   # next if ($hit_data{q_seqid} eq $hit_data{s_seqid});


=====================================
psisearch2/psisearch2_msa.pl
=====================================
--- a/psisearch2/psisearch2_msa.pl
+++ b/psisearch2/psisearch2_msa.pl
@@ -25,21 +25,20 @@ use Pod::Usage;
 # implementation of simple shell script to do iterative searches
 #
 # logic:
-# (1) do initial search
+# (1) do initial search or take results from previous search with --prev_m89res
 # (2) use results of initial search to produce MSA/PSSM for next search
 # (3) do PSSM search
-# (4) use results of PSSM search to produce MSA/PSSM for iterative step 3
+# (4) repeat at step (2)
 #
 ################
 #
 # command:
-# psisearch2_msa.pl --query query_file --db database --num_iter N --evalue 0.002 --no_msa --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --out_suffix none --pgm ssearch/psiblast
+# psisearch2_msa.pl --query query.file --db database.file --num_iter N --pssm_evalue 0.002 --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --out_suffix none --pgm ssearch/psiblast --prev_m89res prev_results.itx.m8CB.file --sel_res selected_accs.file --prev_bounds boundary.file
 #
 ################
 
-use vars qw( $query_file $db_file $num_iter $evalue $int_mask $end_mask $query_mask $no_msa $tmp_dir $dom_flag $align_flag $suffix $srch_pgm $file_out $help $shelp $error_log $rm_flag $annot_type $quiet);
-use vars qw( $prev_msa $next_msa $prev_hitdb $next_hitdb $prev_pssm $next_pssm $prev_bound_in $next_bound_out $tmp_file_list $save_all $delete_bnd $delete_tmp);
-
+use vars qw( $query_file $db_file $num_iter $pssm_evalue $srch_evalue $int_mask $end_mask $query_mask $tmp_dir $dom_flag $align_flag $suffix $srch_pgm $file_out $help $shelp $error_log $rm_flag $annot_type $quiet);
+use vars qw( $prev_m89res $m_format $prev_sel_res $this_iter $prev_msa $next_msa $prev_hitdb $next_hitdb $prev_pssm $next_pssm $prev_bound $next_bound_out $tmp_file_list $save_all $delete_bnd $delete_tmp $use_stdout);
 
 ################
 # locations of required programs:
@@ -54,49 +53,58 @@ my $ssearch_bin = "$pgm_bin/ssearch36";
 my $psiblast_bin = "$pgm_bin/psiblast";
 my $makeblastdb_bin = "$pgm_bin/makeblastdb";
 my $datatool_bin = "$pgm_bin/datatool -m $pgm_data/NCBI_all.asn";
-my $align2msa_lib = "m89_btop_msa2.pl";
+my $align2msa_lib = "$pgm_bin/m89_btop_msa2.pl";
 
 my %srch_subs = ('ssearch' => \&get_ssearch_cmd,
 		 'psiblast' => \&get_psiblast_cmd,
 		);
 
-my %annot_cmds = ('rpd3' => qq("\!../scripts/ann_pfam28.pl --pfacc --db RPD3 --vdoms --split_over"),
-		  'rpd3nv' => qq("\!../scripts/ann_pfam28.pl --pfacc --db RPD3 --split_over"),
-		  'rpd3nvn' => qq("\!../scripts/ann_pfam28.pl --pfacc --db RPD3 --split_over --neg"),
-		  'pfam' => qq("\!../scripts/ann_pfam30.pl --pfacc --vdoms --split_over"));
+my %annot_cmds = ('rpd3' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --vdoms --split_over"),
+		  'rpd3nv' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --split_over"),
+		  'rpd3nvn' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --split_over --neg"),
+		  'pfam' => qq("\!ann_pfam30.pl --vdoms --split_over --neg")
+    );
 
-($num_iter, $evalue, $dom_flag, $align_flag, $int_mask, $end_mask, $query_mask, $srch_pgm, $tmp_dir, $error_log, $annot_type, $quiet) =
-  ( 5, 0.002, 0, 0, 'none', 'none', 0, 'ssearch','',0, 0, "", 0);
+($num_iter, $pssm_evalue, $srch_evalue, $dom_flag, $align_flag, $int_mask, $end_mask, $query_mask, $srch_pgm, $tmp_dir, $error_log, $annot_type, $quiet) =
+  ( 5, 0.002, 5.0, 0, 0, 'none', 'none', 0, 'ssearch','',0, 0, "", 0);
 ($save_all, $tmp_file_list, $delete_bnd, $delete_tmp) = (0, "", 0, 0);
+($prev_m89res, $m_format, $prev_sel_res, $prev_bound, $this_iter, $use_stdout) = ("","", "","", 1, 0);
 
-
-my $pgm_command =  "#".join(" ",($0, at ARGV));
-print STDERR "#",join(" ",($0, at ARGV)),"\n" if ($error_log);
+my $pgm_command =  "# ".join(" ",($0, at ARGV));
+print STDERR "# ",join(" ",($0, at ARGV)),"\n" if ($error_log);
 
 GetOptions(
 	   'query|sequence=s' => \$query_file,
 	   'db|database=s' => \$db_file,
            'suffix|out_suffix=s' => \$suffix,
     	   'dir=s' => \$tmp_dir,
-	   'evalue=f' => \$evalue,
+	   'pssm_evalue=f' => \$pssm_evalue,
+	   'search_evalue=f' => \$srch_evalue,
 	   'annot_db=s' => \$annot_type,
            'out_name=s' => \$file_out,
+	   'use_stdout' => \$use_stdout,
+    	   'this_iter=s' => \$this_iter,
 	   'iter=i' => \$num_iter,
+    	   'prev_m89res=s' => \$prev_m89res,
+           'sel_res_in=s' => \$prev_sel_res,
+           'sel_accs=s' => \$prev_sel_res,
+           'sel_file=s' => \$prev_sel_res,
+           'sel_file_in=s' => \$prev_sel_res,
 	   # 'in_msa=s' => \$prev_msa,
 	   # 'out_msa=s' => \$next_msa,
 	   # 'in_hitdb=s' => \$prev_hitdb,
 	   # 'out_hitdb=s' => \$next_hitdb,
 	   'in_pssm=s' => \$prev_pssm,
 	   # 'out_pssm=s' => \$next_pssm,
-	   'in_bounds=s' => \$prev_bound_in,
+	   'prev_bounds=s' => \$prev_bound,
 	   'out_bounds=s' => \$next_bound_out,
 	   'num_iter|max_iter=i' => \$num_iter,
-           # 'no_msa' => \$no_msa,
 	   'dom|domain' => \$dom_flag,
 	   'align' => \$align_flag,
     	   'query_seed|query_mask' => \$query_mask,
 	   'int_mask|int-mask|int_seed|int-seed=s' => \$int_mask,
 	   'end_mask|end-mask|end_seed|end-seed=s' => \$end_mask,
+	   'm_format=s' => \$m_format,
 	   'pgm=s' => \$srch_pgm,
     	   'quiet' => \$quiet,
     	   'q' => \$quiet,
@@ -116,6 +124,8 @@ pod2usage(exitstatus => 0, verbose => 2) if $help;
 pod2usage(1) unless $query_file && -r $query_file;  # need a query
 pod2usage(1) unless $db_file ;        # need a database
 
+my %sel_hits = ();
+
 my @del_file_ext = qw(msa psibl_out hit_db asntxt asnbin);
 
 if ($srch_pgm =~ m/psiblast/) {
@@ -127,10 +137,7 @@ if ($query_mask) {
   $end_mask='query' unless $end_mask ne 'none';
 }
 
-if ($delete_tmp) {
-  $delete_bnd = 1;
-}
-elsif ($save_all) {
+if (!$delete_tmp && $save_all) {
   @del_file_ext = ();
   $tmp_file_list = "";
   $delete_bnd = 0
@@ -155,45 +162,76 @@ if ($tmp_file_list) {
 
 print "$pgm_command\n" unless ($quiet);
 
-my $this_iter = "it1";
+####
+# generate output filenames
 
 my ($query_pref) = ($query_file =~ m/([\w\.]+)$/);
 
 $file_out = $query_pref unless $file_out;
 
-my $this_file_pref = "$file_out.$this_iter";
+my $this_file_pref = "$file_out.it$this_iter";
 $this_file_pref = "$this_file_pref.$suffix" if ($suffix);
 my $this_file_out = $this_file_pref;
 $this_file_out = "$tmp_dir/$this_file_out" if ($tmp_dir);
 
-my $prev_file_out = $this_file_out;
+my $prev_file_out = "";
 
 ####
-# parse output to build PSSM
-# generate output filenames
+# do the first search or use $prev_m89res
+my $first_iter = 0;
+my $iter_val = $this_iter;
+my $search = "";
+
+my @del_err_files = ();
 
-# do the first search
-my $search = $srch_subs{$srch_pgm}($query_file, $db_file, $prev_pssm);
-log_system("$search > $this_file_out 2> $this_file_out.err");
+unless ($prev_m89res) {
+  $search = $srch_subs{$srch_pgm}($query_file, $db_file, $prev_pssm);
+  unless ($use_stdout) {
+    log_system("$search > $this_file_out 2> $this_file_out.err");
+  }
+  else {
+    log_system("$search 2> $this_file_out.err");
+  }
+  push @del_err_files, "$this_file_out.err";
+  $first_iter++;
+}
+else {
+  $this_file_out = $prev_m89res;
+}
 
-my ($this_pssm, $this_bound_out) = build_msa_pssm($query_file, $this_file_out, $prev_bound_in);
+my ($this_pssm, $this_bound_out) = ("","");
 
 # now have necessary files for next iteration
 
-for (my $it=2; $it <= $num_iter; $it++) {
+for (my $it=$first_iter; $it < $num_iter; $it++) {
+
+  ####
+  # build the PSSM for the current search
+
+  my ($this_pssm, $this_bound_out) = build_msa_pssm($query_file, $this_file_out, $prev_bound, $prev_sel_res, $m_format);
+  $prev_file_out = $this_file_out;
+  $prev_sel_res = "";
+
+  $iter_val = $this_iter + $it;
 
-  $prev_pssm = $this_pssm;
-  $prev_bound_in = $this_bound_out;
   ####
   # build filename for this iteration
-  $this_file_pref = $this_file_out = "$file_out.it$it";
+  $this_file_pref = $this_file_out = "$file_out.it$iter_val";
   $this_file_out = "$this_file_pref.$suffix" if ($suffix);
   $this_file_out = "$tmp_dir/$this_file_out" if ($tmp_dir);
 
-  $search = $srch_subs{$srch_pgm}($query_file, $db_file, $prev_pssm);
-  log_system("$search > $this_file_out 2> $this_file_out.err");
+  $search = $srch_subs{$srch_pgm}($query_file, $db_file, $this_pssm);
+  unless ($use_stdout) {
+    log_system("$search > $this_file_out 2> $this_file_out.err");
+  }
+  else {
+    log_system("$search 2> $this_file_out.err");
+  }
+  push @del_err_files, "$this_file_out.err";
 
+  ####
   # here, we are done with previous .msa, .asntxt, .asnbin, etc files.  Delete them if desired
+
   if (@del_file_ext) {
     my @del_file_list = ();
     for my $ext (@del_file_ext) {
@@ -201,36 +239,20 @@ for (my $it=2; $it <= $num_iter; $it++) {
     }
     log_system("rm ".join(" ", at del_file_list));
   }
-  $prev_file_out = $this_file_out;
-
-  ($this_pssm, $this_bound_out) = build_msa_pssm($query_file, $this_file_out, $prev_bound_in);
+  log_system("rm $prev_bound") if ($delete_bnd);
 
-  if (has_converged($prev_bound_in, $this_bound_out)) {
-    print STDERR "$0 $srch_pgm $query_file $db_file converged ($it iterations)\n" unless ($quiet);
-
-    if (@del_file_ext) {
-      my @del_file_list = ();
-      for my $ext (@del_file_ext) {
-	push @del_file_list, "$prev_file_out.$ext";
-      }
-      log_system("rm ".join(" ", at del_file_list));
-      log_system("rm $prev_bound_in $this_bound_out") if ($delete_bnd);
-    }
-    exit(0);
+  if (has_converged($prev_bound, $this_bound_out)) {
+    print STDERR "$0 $srch_pgm $query_file $db_file converged ($iter_val iterations)\n" unless ($quiet);
+    last;
   }
-  log_system("rm $prev_bound_in") if ($delete_bnd);
+  $prev_bound = $this_bound_out
 }
 
-if (@del_file_ext) {
-  my @del_file_list = ();
-  for my $ext (@del_file_ext) {
-    push @del_file_list, "$prev_file_out.$ext";
-  }
-  log_system("rm ".join(" ", at del_file_list));
+if (@del_err_files) {
+  log_system("rm ".join(" ", at del_err_files));
 }
 
-log_system("rm $this_bound_out") if ($delete_bnd);
-
+log_system("rm $prev_bound") if ($delete_bnd);
 unless ($quiet) {
   print STDERR "$0 $srch_pgm $query_file $db_file finished ($num_iter iterations)\n";
 }
@@ -254,7 +276,7 @@ sub log_system {
 sub get_ssearch_cmd {
   my ($query_file, $db_file, $pssm_file) = @_;
 
-  my $search_cmd = qq($ssearch_bin -S -m 8CB -d 0 -E "1.0 0" -s BP62);
+  my $search_cmd = qq($ssearch_bin -S -m 6 -m 9B -E "$srch_evalue 0" -s BP62);
   if ($annot_type) {
     $search_cmd .= qq( -V $annot_cmds{$annot_type});
   }
@@ -274,7 +296,7 @@ sub get_ssearch_cmd {
 sub get_psiblast_cmd {
   my ($query_file, $db_file, $pssm_file) = @_;
 
-  my $search_cmd = qq($psiblast_bin -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh $evalue -num_iterations 1 -db $db_file);
+  my $search_cmd = qq($psiblast_bin -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh $pssm_evalue -evalue $srch_evalue -num_iterations 1 -db $db_file);
   if ($pssm_file) {
     $search_cmd .= qq( -in_pssm $pssm_file);
 #    $search_cmd .= qq( -comp_based_stats 0);
@@ -296,7 +318,7 @@ sub get_psiblast_cmd {
 # always produce a $bound_file_out file to test for convergence
 #
 sub build_msa_pssm {
-  my ($query_file, $this_file_out,$prev_bound_in) = @_;
+  my ($query_file, $this_file_out,$prev_bound_in, $prev_sel_in, $m_format) = @_;
 
   my ($this_msa, $this_hit_db, $this_pssm_asntxt, $this_pssm_asnbin, $this_psibl_out, $this_bound_out) =
     ("$this_file_out.msa",
@@ -308,7 +330,18 @@ sub build_msa_pssm {
     );
 
   my $blastdb_err = "$this_file_out.mkbldb_err";
-  my $aln2msa_cmd = qq($align2msa_lib --query $query_file --evalue $evalue --masked_lib_out=$this_hit_db);
+  my $aln2msa_cmd = qq($align2msa_lib --query $query_file --masked_lib_out=$this_hit_db);
+
+  if ($m_format) {
+    $aln2msa_cmd .= qq( --m_format $m_format);
+  }
+
+  if ($prev_sel_in) {
+    $aln2msa_cmd .= qq( --sel_file_in $prev_sel_in);
+  }
+  else {
+    $aln2msa_cmd .= qq( --evalue $pssm_evalue);
+  }
 
   if ($int_mask) {
     $aln2msa_cmd .= qq( --int_mask_type $int_mask);
@@ -342,7 +375,7 @@ sub build_msa_pssm {
   log_system("rm $this_hit_db.p* $blastdb_err");
 
   # remove uninformative error logs
-  log_system("rm $this_psibl_out.err $this_file_out.err") unless $error_log;
+  log_system("rm $this_psibl_out.err") unless $error_log;
 
   unless ($srch_pgm eq 'psiblast') {
     my $asn2asn_cmd = "$datatool_bin -v $this_pssm_asntxt -e $this_pssm_asnbin";
@@ -361,6 +394,8 @@ sub build_msa_pssm {
 sub has_converged {
   my ($file1, $file2) = @_;
 
+  return 0 unless ($file1 && $file2);
+
   my @f1_names = ();
   my @f2_names = ();
 
@@ -408,12 +443,19 @@ psisearch2_msa.pl
 
  -h	short help
  --help include description
+
  --query query file  (also --sequence)
  --db    database file (--database)
- --pgm   program used for searching, ssearch or psiblast
- --num_iter maximum number of iterations (--max_iter)
+ --pgm   [ssearch] program used for searching, ssearch or psiblast
+ --num_iter/iter [5] maximum number of iterations (--max_iter)
+
+
+ --this_iter [0] iteration number for tracking
+ --pssm_evalue  [0.002] threshold for inclusion in PSSM
+ --search_evalue [5.0] threshold for inclusion in search display
+ --annot_db [null] (rpd3, rpd3nv, rpd3nvn, pfam)
+
  --dir   working directory and location of output
- --evalue  threshold for inclusion in PSSM
  --out_name/--suffix  result file is "out_name.it#.suffix"
  --in_msa/--out_msa   [not implemented] MSA used to build PSSM, requires --in_hitdb
  --in_hitdb/--out_hitdb [not implemented] used to build PSSM
@@ -427,8 +469,8 @@ psisearch2_msa.pl
 
 =head1 DESCRIPTION
 
-C<psisearch2_msa.pl> automates successive iterations of C<psiblast> or
-C<ssearch36> using different strategies to reduce PSSM contamination
+C<psisearch2_msa.pl> can perform one, or multiple, successive iterations of C<psiblast> or
+C<ssearch36> using different query seeding strategies to reduce PSSM contamination
 from alignment over-extension.  C<psisearch2_msa.pl> uses the
 C<m89_btop_msa2.pl> program to read BTOP formatted output from
 C<psiblast> or C<ssearch36> and produce both a multiple sequence


=====================================
psisearch2/psisearch2_msa.py
=====================================
--- a/psisearch2/psisearch2_msa.py
+++ b/psisearch2/psisearch2_msa.py
@@ -34,7 +34,7 @@ import re
 ################
 #
 # command:
-# psisearch2_msa.py --query query_file --db database --num_iter N --evalue 0.002 --no_msa --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --suffix --pgm ssearch/psiblast
+# psisearch2_msa.py --query query_file --db database --num_iter N --evalue 0.002 --no_msa --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --suffix M8CB --pgm ssearch/psiblast --prev_m89res pre_iter.out --this_iter # --num_iter #
 #
 ################
 
@@ -59,15 +59,9 @@ annot_cmds = {'rpd3': '"!../scripts/ann_pfam28.pl --pfacc --db RPD3 --vdoms --sp
 
 num_iter = 5
 evalue = 0.002
-dom_flag = 0
-align_flag = 0
-int_mask = 'none'
-end_mask = 'none'
 srch_pgm = 'ssearch'
-tmp_dir = ''
 error_log = 0
 rm_flag = 0
-annot_type = ''
 quiet = 0
 
 ################
@@ -90,8 +84,8 @@ def get_ssearch_cmd(query_file, db_file, pssm_file) :
 
   search_cmd = '%s -S -m 8CB -d 0 -E "1.0 0" -s BP62' % (ssearch_bin)
 
-  if (annot_type) :
-      search_cmd += " -V %s" % (annot_cmds[annot_type])
+  if (args.annot_type) :
+      search_cmd += " -V %s" % (annot_cmds[args.annot_type])
 
   if (pssm_file) :
       search_cmd += ' -P "%s 2"' % (pssm_file)
@@ -107,7 +101,7 @@ def get_ssearch_cmd(query_file, db_file, pssm_file) :
 #
 def get_psiblast_cmd(query_file, db_file, pssm_file) :
 
-  search_cmd = "%s -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh %f -num_iterations 1 -db %s" % (psiblast_bin, evalue, db_file)
+  search_cmd = "%s -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh %f -num_iterations 1 -db %s" % (psiblast_bin, args.evalue, db_file)
 
   if (pssm_file) :
       search_cmd += " -in_pssm %s" % (pssm_file)
@@ -126,24 +120,29 @@ def get_psiblast_cmd(query_file, db_file, pssm_file) :
 #
 # always produce a bound_file_out file to test for convergence
 #
-def build_msa_pssm(query_file, this_file_out,prev_bound_in, error_log) :
+def build_msa_pssm(query_file, this_file_out,prev_bound_in, prev_sel_res, args, error_log) :
 
   (this_msa, this_hit_db, this_pssm_asntxt, this_pssm_asnbin, this_psibl_out, this_bound_out) =  (this_file_out+".msa",this_file_out+".hit_db",this_file_out+".asntxt",this_file_out+".asnbin",this_file_out+".psibl_out",this_file_out+".bnd_out")
 
   blastdb_err = this_file_out+".mkbldb_err"
-  aln2msa_cmd = "%s --query %s --evalue %f --masked_lib_out=%s" % (align2msa_lib, query_file, evalue, this_hit_db)
+  aln2msa_cmd = "%s --query %s --masked_lib_out=%s" % (align2msa_lib, query_file, this_hit_db)
 
-  if (int_mask) :
-      aln2msa_cmd += " --int_mask_type %s" % (int_mask)
+  if (prev_sel_res) :
+    aln2msa_cmd += " --sel_res %s" % (prev_sel_res)
+  else:
+    aln2msa_cmd += " --evalue %f" % (args.evalue)
+    
+  if (args.int_mask) :
+      aln2msa_cmd += " --int_mask_type %s" % (args.int_mask)
 
-  if (end_mask) :
-      aln2msa_cmd += " --end_mask_type %s" % (end_mask)
+  if (args.end_mask) :
+      aln2msa_cmd += " --end_mask_type %s" % (args.end_mask)
 
-  if (dom_flag) :
+  if (args.dom_flag) :
     aln2msa_cmd += " --domain"
 
-  if (align_flag and prev_bound_in) :
-      aln2msa_cmd += " --bound_file_in %s" %(prev_bound_in)
+  if (args.align_flag and args.prev_bound_in) :
+      aln2msa_cmd += " --bound_file_in %s" %(args.prev_bound_in)
 
   # always produce this file to check for convergence
   aln2msa_cmd += " --bound_file_out %s" % (this_bound_out)
@@ -162,7 +161,7 @@ def build_msa_pssm(query_file, this_file_out,prev_bound_in, error_log) :
 
   # remove uninformative error logs
   if (not error_log) :
-    log_system("rm "+this_psibl_out+".err "+this_file_out+".err",error_log)
+    log_system("rm "+this_psibl_out+".err ",error_log)
 
   if (srch_pgm != 'psiblast') :
       asn2asn_cmd = "%s -v %s -e %s" % (datatool_bin, this_pssm_asntxt, this_pssm_asnbin)
@@ -177,6 +176,9 @@ def build_msa_pssm(query_file, this_file_out,prev_bound_in, error_log) :
 #
 def has_converged(file1, file2) :
 
+  if (not file1 or not file2):
+    return 0
+
   f1_names = []
   f2_names = []
 
@@ -224,13 +226,16 @@ arg_parse.add_argument('--evalue', dest='evalue', default=0.002, type=float, act
 arg_parse.add_argument('--annot_db', dest='annot_type', action='store',help='source of domain annotations')
 arg_parse.add_argument('--suffix', dest='suffix', action='store',help='suffix for result output')
 arg_parse.add_argument('--out_name', dest='file_out', action='store',help='result file name')
-arg_parse.add_argument('--iter', dest='num_iter', default=5, type=int, action='store',help='number of iterations')
+arg_parse.add_argument('--num_iter', dest='num_iter', default=5, type=int, action='store',help='number of iterations to run')
 arg_parse.add_argument('--in_pssm', dest='prev_pssm', action='store',help='initial PSSM')
 arg_parse.add_argument('--in_bounds', dest='prev_bound_in', type=str, action='store',help='initial boundaries')
 arg_parse.add_argument('--domain', dest='dom_flag', action='store_true',help='use domain annotations')
-arg_parse.add_argument('--align', dest='align_flag', action='store_true',help='use alignment boundaries')
+arg_parse.add_argument('--align', dest='align_flag', default=0, action='store_true',help='use alignment boundaries')
 arg_parse.add_argument('--pgm', dest='srch_pgm', action='store',default='ssearch',help='search program: ssearch/psiblast')
 arg_parse.add_argument('--query_seed', dest='query_mask', action='store_true',help='use query seeding')
+arg_parse.add_argument('--prev_m89res', dest='prev_m89res', action='store', help='prevous iteration result')
+arg_parse.add_argument('--sel_res', dest='prev_sel_res', action='store', help='selected accession file')
+arg_parse.add_argument('--this_iter', dest='this_iter', help='this iteration number',type=int)
 arg_parse.add_argument('--int_seed', dest='int_mask', action='store',default='none',help='sequence masking: none/query/random')
 arg_parse.add_argument('--end_seed', dest='end_mask', action='store',default='none',help='sequence masking: none/query/random')
 arg_parse.add_argument('--int_mask', dest='int_mask', action='store',default='none',help='sequence masking: none/query/random')
@@ -257,13 +262,16 @@ del_file_ext = ["msa","psibl_out","hit_db","asntxt","asnbin"]
 if (re.match('psiblast',srch_pgm)) :
     del_file_ext.pop()
 
-
 if (args.query_mask) :
     if (args.int_mask == 'none') :
         args.int_mask = 'query'
     if (args.end_mask == 'none') :
         args.end_mask = 'query'
 
+this_iter = 1
+if (args.this_iter):
+  this_iter = args.this_iter
+
 delete_bnd = args.delete_bnd
 if (args.delete_tmp) :
     delete_bnd = 1
@@ -282,8 +290,6 @@ if (args.tmp_file_list) :
             new_del_file_ext.append(ext)
     del_file_ext = new_del_file_ext[:]
 
-this_iter = "it1"
-
 query_pref = query_file = args.query_file
 m = re.search(r'([\w\.]+)$',str(args.query_file))
 query_pref = m.groups()[0]
@@ -293,35 +299,50 @@ if (not args.file_out) :
 else :
     file_out = args.file_out
 
-this_file_out = this_file_pref = file_out+"."+this_iter
+this_file_out = this_file_pref = file_out+".it"+str(this_iter)
+
 if (args.suffix) :
     this_file_out = this_file_pref+"."+args.suffix
 if (args.tmp_dir) :
     this_file_out = args.tmp_dir+"/"+this_file_out
 
+prev_bound_in = ''
+if (args.prev_bound_in):
+  prev_bound_in = args.prev_bound_in
+
 ####
 # parse output to build PSSM
 # generate output filenames
 
-prev_file_out = this_file_out
+first_iter = 0
+del_err_files = []
 
 # do the first search
-search_str = srch_subs[srch_pgm](args.query_file, args.db_file, args.prev_pssm)
-log_system(search_str+" > "+this_file_out+" 2> "+this_file_out+".err", error_log)
+if (not args.prev_m89res):
+  search_str = srch_subs[srch_pgm](args.query_file, args.db_file, args.prev_pssm)
+  log_system(search_str+" > "+this_file_out+" 2> "+this_file_out+".err", error_log)
+  del_err_files.append(this_file_out+".err")
+  first_iter += 1
+else:
+  this_file_out = args.prev_m89res
 
-prev_file_out = this_file_out
 
-(this_pssm, this_bound_out) = build_msa_pssm(args.query_file, this_file_out, args.prev_bound_in, error_log)
+it=first_iter
+
 # now have necessary files for next iteration
 
-it=2
-while (it <= args.num_iter) :
+while (it < args.num_iter) :
+
+  (this_pssm, this_bound_out) = build_msa_pssm(args.query_file, this_file_out, prev_bound_in, arg.prev_sel_res,  error_log)
+  prev_file_out = this_file_out
+  arg.prev_sel_res = ''
+
+  iter_val = this_iter + it
 
   prev_pssm = this_pssm
-  prev_bound_in = this_bound_out
   ####
   # build filename for this iteration
-  this_file_out = this_file_pref = "%s.it%d" % (file_out,it)
+  this_file_out = this_file_pref = "%s.it%d" % (file_out,iter_val)
   if (args.suffix) :
       this_file_out = this_file_pref+"."+args.suffix
   if (args.tmp_dir) :
@@ -329,22 +350,19 @@ while (it <= args.num_iter) :
 
   search_str = srch_subs[srch_pgm](args.query_file, args.db_file, prev_pssm)
   log_system("%s > %s 2> %s" % (search_str,this_file_out,this_file_out+".err"), error_log)
+  del_err_files.append(this_file_out+".err")
 
   if (len(del_file_ext)):
       del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
       log_system('rm '+' '.join(del_file_list),error_log)
 
-  prev_file_out = this_file_out
-
-  (this_pssm, this_bound_out) = build_msa_pssm(query_file, this_file_out, prev_bound_in, error_log)
-
   if (has_converged(prev_bound_in, this_bound_out)) :
       if (not quiet) :
           sys.stderr.write(" %s %s %s %s converged (%d iterations)\n" % (sys.argv[0], srch_pgm, query_file, args.db_file, it))
 
-      if (len(del_file_ext)):
-          del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
-          log_system('rm '+' '.join(del_file_list),error_log)
+      # if (len(del_file_ext)):
+      #     del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
+      #     log_system('rm '+' '.join(del_file_list),error_log)
 
       if (delete_bnd) :
           log_system("rm "+prev_bound_in,error_log)
@@ -353,16 +371,20 @@ while (it <= args.num_iter) :
 
   if (delete_bnd) :
       log_system("rm "+prev_bound_in,error_log)
+  prev_bound_in = this_bound_out
 
   it += 1
 
-if (len(del_file_ext)):
-    del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
-    log_system('rm '+' '.join(del_file_list),error_log)
+if (len(del_err_files)):
+  log_system('rm '+' '.join(del_err_files),error_log)
+
+# if (len(del_file_ext)):
+#     del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
+#     log_system('rm '+' '.join(del_file_list),error_log)
 
 if (delete_bnd):
     log_system("rm "+this_bound_out,error_log)
 
 if (not quiet) :
-    sys.stderr.write(" %s %s %s %s finished (%d iterations)\n" % (sys.argv[0], srch_pgm, query_file, args.db_file, it-1))
+    sys.stderr.write(" %s %s %s %s finished (%d iterations)\n" % (sys.argv[0], srch_pgm, query_file, args.db_file, it))
 


=====================================
scripts/ann_exons_ncbi.pl
=====================================
--- a/scripts/ann_exons_ncbi.pl
+++ b/scripts/ann_exons_ncbi.pl
@@ -4,13 +4,13 @@
 
 # gi|23065544|ref|NP_000552.2| 
 #
-# and returns the exons present in the protein from NCBI gff3 tables (human and mouse only)
+# and returns the exons present in the protein from NCBI gff3 tables (human, mouse, rat, xtrop)
 #
 # it must:
 # (1) read in the line
 # (2) parse it to get the acc
 # (3) return the tab delimited exon boundaries
-
+#
 
 use strict;
 


=====================================
scripts/ann_exons_up_sql.pl
=====================================
--- /dev/null
+++ b/scripts/ann_exons_up_sql.pl
@@ -0,0 +1,337 @@
+#!/usr/bin/perl -w
+
+################################################################
+# copyright (c) 2014,2015 by William R. Pearson and The Rector &
+# Visitors of the University of Virginia
+################################################################
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under this License is distributed on an "AS
+# IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
+# express or implied.  See the License for the specific language
+# governing permissions and limitations under the License.
+################################################################
+
+# ann_exons_up_sql.pl gets an annotation file from fasta36 -V with a line of the form:
+
+# gi|62822551|sp|P00502|GSTA1_RAT Glutathione S-transfer\n  (at least from pir1.lseg)
+#
+# it must:
+# (1) read in the line
+# (2) parse it to get the up_acc
+# (3) return the tab delimited features
+#
+
+# this version can read feature2 uniprot features (acc/pos/end/label/value), but returns sorted start/end domains
+# modified 18-Jan-2016 to produce annotation symbols consistent with ann_exons_up_www2.pl
+
+use strict;
+
+use DBI;
+use Getopt::Long;
+use Pod::Usage;
+
+use vars qw($host $db $a_table $port $user $pass);
+
+my %domains = ();
+my $domain_cnt = 0;
+
+my $hostname = `/bin/hostname`;
+
+unless ($hostname =~ m/ebi/) {
+  ($host, $db, $a_table, $port, $user, $pass)  = ("wrpxdb.its.virginia.edu", "uniprot", "annot2", 0, "web_user", "fasta_www");
+#  $host = 'xdb';
+}
+else {
+  ($host, $db, $a_table, $port, $user, $pass)  = ("mysql-pearson-prod", "up_db", "annot", 4124, "web_user", "fasta_www");
+}
+
+my ($sstr, $lav, $neg_doms, $no_vars, $no_doms, $no_feats, $shelp, $help, $pfam26) = (0,0,0,0,0,0,0,0,0,0);
+my ($min_nodom) = (10);
+
+my ($show_color) = (1);
+my $color_sep_str = " :";
+$color_sep_str = '~';
+
+GetOptions(
+    "host=s" => \$host,
+    "db=s" => \$db,
+    "user=s" => \$user,
+    "password=s" => \$pass,
+    "port=i" => \$port,
+    "lav" => \$lav,
+    "no_doms" => \$no_doms,
+    "no-doms" => \$no_doms,
+    "nodoms" => \$no_doms,
+    "no_var" => \$no_vars,
+    "no-var" => \$no_vars,
+    "novar" => \$no_vars,
+    "neg" => \$neg_doms,
+    "neg_doms" => \$neg_doms,
+    "neg-doms" => \$neg_doms,
+    "negdoms" => \$neg_doms,
+    "min_nodom=i" => \$min_nodom,
+    "min-nodom=i" => \$min_nodom,
+    "no_feats" => \$no_feats,
+    "no-feats" => \$no_feats,
+    "nofeats" => \$no_feats,
+    "color!" => \$show_color,
+    "sstr" => \$sstr,
+    "h|?" => \$shelp,
+    "help" => \$help,
+    );
+
+pod2usage(1) if $shelp;
+pod2usage(exitstatus => 0, verbose => 2) if $help;
+pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN);
+
+my $connect = "dbi:mysql(AutoCommit=>1,RaiseError=>1):database=$db";
+$connect .= ";host=$host" if $host;
+$connect .= ";port=$port" if $port;
+
+my $dbh = DBI->connect($connect,
+		       $user,
+		       $pass
+		      ) or die $DBI::errstr;
+
+
+my $get_annot_sub = \&get_fasta_annots;
+if ($lav) {
+  $no_feats = 1;
+  $get_annot_sub = \&get_lav_annots;
+}
+
+my $get_annots_id = $dbh->prepare(qq(select acc, start, end, ix from up_exons join annot2 using(acc) where id=? order by ix));
+my $get_annots_acc = $dbh->prepare(qq(select acc, start, end, ix from up_exons where acc=? order by ix));
+my $get_annots_refacc = $dbh->prepare(qq(select ref_acc, start, end, ix from up_exons join annot2 using(acc) where ref_acc=? order by ix));
+
+my $get_annots_sql = $get_annots_acc;
+
+my ($tmp, $gi, $sdb, $acc, $id, $use_acc);
+
+# get the query
+my ($query, $seq_len) =  @ARGV;
+$seq_len = 0 unless defined($seq_len);
+
+$query =~ s/^>// if ($query);
+
+my @annots = ();
+
+#if it's a file I can open, read and parse it
+unless ($query && ($query =~ m/[\|:]/ ||
+		   $query =~ m/^[NX]P_/ ||
+		   $query =~ m/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}\s/)) {
+
+  while (my $a_line = <>) {
+    $a_line =~ s/^>//;
+    chomp $a_line;
+    push @annots, show_annots($a_line, $get_annot_sub);
+  }
+}
+else {
+  push @annots, show_annots("$query\t$seq_len", $get_annot_sub);
+}
+
+for my $seq_annot (@annots) {
+  print ">",$seq_annot->{seq_info},"\n";
+  for my $annot (@{$seq_annot->{list}}) {
+    if (!$lav && $show_color && defined($domains{$annot->[-1]})) {
+      $annot->[-1] .= $color_sep_str.$domains{$annot->[-1]};
+    }
+    print join("\t",@$annot),"\n";
+  }
+}
+
+exit(0);
+
+sub show_annots {
+  my ($query_len, $get_annot_sub) = @_;
+
+  my ($annot_line, $seq_len) = split(/\t/,$query_len);
+
+  my %annot_data = (seq_info=>$annot_line);
+
+  if ($annot_line =~ m/^gi\|/) {
+    $use_acc = 1;
+    ($tmp, $gi, $sdb, $acc, $id) = split(/\|/,$annot_line);
+  }
+  elsif ($annot_line =~ m/^(SP|TR):(\w+) (\w+)/) {
+    ($sdb, $id, $acc) = ($1,$2,$3);
+    $use_acc = 1;
+    $sdb = lc($sdb)
+  }
+  elsif ($annot_line =~ m/^(SP|TR):(\w+)/) {
+    ($sdb, $id) = ($1,$2);
+    $use_acc = 0;
+    $sdb = lc($sdb)
+  }
+  elsif ($annot_line !~ m/\|/) {  # new NCBI swissprot format
+    $use_acc =1;
+    $sdb = 'sp';
+    ($acc) = split(/\s+/,$annot_line);
+  }
+  else {
+    $use_acc = 1;
+    ($sdb, $acc, $id) = split(/\|/,$annot_line);
+  }
+
+  # remove version number
+  unless ($use_acc) {
+    $get_annots_sql = $get_annots_id;
+    $get_annots_sql->execute($id);
+  }
+  else {
+    unless ($sdb =~ m/ref/) {
+      $get_annots_sql = $get_annots_acc;
+    } else {
+      $get_annots_sql = $get_annots_refacc;
+    }
+    $acc =~ s/\.\d+$//;
+    $get_annots_sql->execute($acc);
+  }
+
+  $annot_data{list} = $get_annot_sub->($get_annots_sql, $seq_len);
+
+  return \%annot_data;
+}
+
+sub get_fasta_annots {
+  my ($get_annots_sql, $seq_len) = @_;
+
+  my ($acc, $start, $end, $ix);
+  my @feats = ();
+
+  while (($acc, $start, $end, $ix) = $get_annots_sql->fetchrow_array()) {
+      push @feats, [$start, "-", $end, "exon_$ix~$ix"];
+  }
+
+  return \@feats;
+}
+
+sub get_lav_annots {
+  my ($get_annots_sql, $seq_len) = @_;
+
+  my ($pos, $end, $label, $value, $comment);
+
+  my @feats = ();
+
+  my %annot = ();
+  while (($acc, $pos, $end, $label, $value) = $get_annots_sql->fetchrow_array()) {
+    next unless ($label =~ m/^DOMAIN/ || $label =~ m/^REPEAT/);
+    $value =~ s/\s?\{.+\}\.?$//;
+    $value = domain_name($label,$value);
+    push @feats, [$pos, $end, $value];
+  }
+
+  return \@feats;
+}
+
+# domain name takes a uniprot domain label, removes comments ( ;
+# truncated) and numbers and returns a canonical form. Thus:
+# Cortactin 6.
+# Cortactin 7; truncated.
+# becomes "Cortactin"
+#
+
+sub domain_name {
+
+  my ($label, $value) = @_;
+
+  if ($label =~ /DOMAIN|REPEAT/) {
+    $value =~ s/;.*$//;
+    $value =~ s/\s+\d+\.?$//;
+    $value =~ s/\.\s*$//;
+    $value =~ s/\s+\d+\.\s+.*$//;
+    $value =~ s/\s+/_/;
+    if (!defined($domains{$value})) {
+      $domain_cnt++;
+      $domains{$value} = $domain_cnt;
+    }
+    return $value;
+  }
+  else {
+    return $value;
+  }
+}
+
+__END__
+
+=pod
+
+=head1 NAME
+
+ann_exons_up_sql.pl
+
+=head1 SYNOPSIS
+
+ ann_exons_up_sql.pl --no_doms --no_feats --lav 'sp|P09488|GSTM1_NUMAN' | accession.file
+
+=head1 OPTIONS
+
+ -h	short help
+ --help include description
+ --no-doms  do not show domain boundaries (domains are always shown with --lav)
+ --no-feats do not show features (variants, active sites, phospho-sites)
+ --no-var   do not show variant sites (--no_var, --novar)
+ --lav  produce lav2plt.pl annotation format, only show domains/repeats
+ --neg-doms,  -- report domains between annotated domains as NODOM
+                 (also --neg, --neg_doms)
+ --min_nodom=10  minimum non-domain length to produce NODOM
+ --host, --user, --password, --port --db -- info for mysql database
+
+=head1 DESCRIPTION
+
+C<ann_exons_up_sql.pl> extracts feature, domain, and repeat information from
+a msyql database (default name, uniprot) built by parsing the
+uniprot_sprot.dat and uniprot_trembl.dat feature tables.  Given a
+command line argument that contains a sequence accession (P09488) or
+identifier (GSTM1_HUMAN), the program looks up the features available
+for that sequence and returns them in a tab-delimited format:
+
+ >sp|P09488|GSTM1_HUMAN
+ 2	-	88	GST_N-terminal~1
+ 7	V	F	Mutagen: Reduces catalytic activity 100- fold. {ECO:0000269|PubMed:16548513}.
+ 34	*	-	MOD_RES: Phosphothreonine. {ECO:0000250|UniProtKB:P10649}.
+ 90	-	208	GST_C-terminal~2
+ 108	V	S	Mutagen: Changes the properties of the enzyme toward some substrates. {ECO:0000269|PubMed:16548513, ECO:0000269|PubMed:9930979}.
+ 108	V	Q	Mutagen: Reduces catalytic activity by half. {ECO:0000269|PubMed:16548513, ECO:0000269|PubMed:9930979}.
+ 109	V	I	Mutagen: Reduces catalytic activity by half. {ECO:0000269|PubMed:16548513}.
+ 116	#	-	BINDING: Substrate.
+ 116	V	A	Mutagen: Reduces catalytic activity 10-fold. {ECO:0000269|PubMed:16548513}.
+ 116	V	F	Mutagen: Slight increase of catalytic activity. {ECO:0000269|PubMed:16548513}.
+ 173	V	N	in allele GSTM1B; dbSNP:rs1065411. {ECO:0000269|Ref.3, ECO:0000269|Ref.5}.
+ 210	*	-	MOD_RES: Phosphoserine. {ECO:0000250|UniProtKB:P04905}.
+ 210	V	T	in dbSNP:rs449856.
+
+If features are provided, then a legend of feature symbols is provided
+as well:
+
+ ==:Active site
+ =*:Modified
+ =#:Substrate binding
+ =^:Site
+ =!:Metal binding
+
+If the C<--lav> option is specified, domain and repeat features are
+presented in a different format for the C<lav2plt.pl> program:
+
+  >sp|P09488|GSTM1_HUMAN
+  2	88	GST N-terminal.
+  90	208	GST C-terminal.
+
+C<ann_exons_up_sql.pl> is designed to be used by the B<FASTA> programs
+with the C<-V \!ann_exons_up_sql.pl> option, or by the
+C<annot_blast_btop.pl> script.  It can also be used with the
+lav2plt.pl program with the C<--xA "\!ann_exons_up_sql.pl --lav"> or
+C<--yA "\!ann_exons_up_sql.pl --lav"> options.
+
+=head1 AUTHOR
+
+William R. Pearson, wrp at virginia.edu
+
+=cut


=====================================
scripts/ann_exons_up_www.pl
=====================================
--- a/scripts/ann_exons_up_www.pl
+++ b/scripts/ann_exons_up_www.pl
@@ -149,9 +149,41 @@ sub parse_json_up_exons {
   my $acc_exons = decode_json($exon_json);
 
   my $exon_num = 1;
+  my $last_end = 0;
+  my $last_phase = 0;
+
   for my $exon ( @{$acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'exon'}} ) {
     my ($p_begin, $p_end) = ($exon->{'proteinLocation'}{'begin'}{'position'},$exon->{'proteinLocation'}{'end'}{'position'});
+    my ($g_begin, $g_end) = ($exon->{'genomeLocation'}{'begin'}{'position'},$exon->{'genomeLocation'}{'end'}{'position'});
+
+    my $this_phase = 0;
+    if (defined($g_begin) && defined($g_end)) {
+      $this_phase = ($g_end - $g_begin + 1) % 3;
+    }
+
+    if (!defined($p_begin) || !defined($p_end)) {
+      $exon_num++;
+      $last_phase = 0;
+      next;
+    }
+
     if ($p_end >= $p_begin) {
+      if ($p_begin == $last_end) {
+	if ($last_phase==2) {
+	  $p_begin += 1;
+	}
+	elsif ($last_phase==1) {
+	  $last_end -= 1;
+	  $exons[-1]->{seq_end} -= 1;
+	}
+      }
+
+      if ($p_begin <= $last_end && $p_end > $last_end) {
+	$p_begin = $last_end+1;
+      }
+      $last_end = $p_end;
+      $last_phase = $this_phase;
+
       push @exons, {
 		    info=>"exon_".$exon_num.$color_sep_str.$exon_num,
 		    seq_start=>$p_begin,


=====================================
scripts/ann_pfam30_tmptbl.pl
=====================================
--- a/scripts/ann_pfam30_tmptbl.pl
+++ b/scripts/ann_pfam30_tmptbl.pl
@@ -246,9 +246,7 @@ if ($rpd2_fams) {
 }
 
 #if it's a file I can open, read and parse it
-unless ($query && ($query =~ m/[\|:]/ ||
-		   $query =~ m/^[NX]P_/ ||
-		   $query =~ m/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}\s/)) {
+unless ($query && $query =~ m/[\|:]/) {
 
   while (my $a_line = <>) {
     $a_line =~ s/^>//;


=====================================
src/ag_stats.c
=====================================
--- a/src/ag_stats.c
+++ b/src/ag_stats.c
@@ -34,11 +34,11 @@ ag_parm(char *pam_type, int gdelval, int ggapval)
       r_v = look_p(p250_p,t_gdelval,t_ggapval,&K,&Lambda,&H);
   else if (strcmp(pam_type,"P120")==0)
       r_v = look_p(p120_p,t_gdelval,t_ggapval,&K,&Lambda,&H);
-  else if (strcmp(pam_type,"MD10")==0)
+  else if (strcmp(pam_type,"MD10")==0 || strcmp(pam_type,"VT10")==0)
       r_v = look_p(md10_p,t_gdelval,t_ggapval,&K,&Lambda,&H);
-  else if (strcmp(pam_type,"MD20")==0)
+  else if (strcmp(pam_type,"MD20")==0 || strcmp(pam_type,"VT20")==0)
       r_v = look_p(md20_p,t_gdelval,t_ggapval,&K,&Lambda,&H);
-  else if (strcmp(pam_type,"MD40")==0)
+  else if (strcmp(pam_type,"MD40")==0 || strcmp(pam_type,"VT40")==0)
       r_v = look_p(md40_p,t_gdelval,t_ggapval,&K,&Lambda,&H);
   else if (strcmp(pam_type,"DNA")==0 || strcmp(pam_type,"+5/-4")==0)
       r_v = look_p(nt54_p,t_gdelval,t_ggapval,&K,&Lambda,&H);


=====================================
src/cal_cons2.c
=====================================
--- a/src/cal_cons2.c
+++ b/src/cal_cons2.c
@@ -899,7 +899,7 @@ calc_astruct(struct a_struct *aln_p, struct a_res_str *a_res_p, struct f_struct 
 }
 
 static struct update_code_str *
-init_update_data(show_code) {
+init_update_data(int show_code) {
 
   struct update_code_str *update_data_p;
 


=====================================
src/comp_lib9.c
=====================================
--- a/src/comp_lib9.c
+++ b/src/comp_lib9.c
@@ -159,6 +159,8 @@ extern void ptime (FILE *, long);
 
 #define GETLIB (m_file_p->getlib)
 
+extern void upper_seq(unsigned char *aa0, int n0, int *xascii, unsigned char *sqx);
+
 int samp_stats_idx (int *pre_nstats, int nstats, void *rand_state);
 
 void
@@ -332,6 +334,7 @@ void prhist(FILE *, const struct mngmsg *, struct pstruct *, struct hist_str his
 
 void print_sum(FILE *, struct db_str *qtt, struct db_str *ntt, int in_mem, long tot_memK);
 int reset_maxn(struct mngmsg *, int, int);	/* set m_msg.maxt, maxn from maxl */
+int count_not_seg(unsigned char *aa0, int n0, struct pstruct *pst);
 
 FILE *outfd;			/* Output file */
 
@@ -734,6 +737,11 @@ main (int argc, char *argv[])
    /* reset algorithm parameters for alphabet */
   resetp (&m_msg, &pst);
 
+  if (count_not_seg(aa0[0],m_msg.n0, &pst) == 0) { /* if no un-seg'ed query residues, convert to upper case */
+    upper_seq(aa0[0],m_msg.n0,qascii,pst.sqx);
+    fprintf(stderr,"+++ warning [%s:%d] - all lower-case query converted to upper case: %s\n", __FILE__, __LINE__, info_qlabel);
+  }
+
 #ifndef COMP_MLIB
   gettitle(m_msg.tname,m_msg.qtitle,sizeof(m_msg.qtitle));
   if (m_msg.tname[0]=='-' || m_msg.tname[0]=='@') {
@@ -1840,6 +1848,11 @@ main (int argc, char *argv[])
     /* if ends with ESS, remove terminal ESS */
     if (aa0[0][m_msg.n0-1] == ESS) { m_msg.n0--; aa0[0][m_msg.n0]= '\0';}
 
+    if (count_not_seg(aa0[0],m_msg.n0, &pst) == 0) { /* if no un-seg'ed query residues, convert to upper case */
+      upper_seq(aa0[0],m_msg.n0, qascii, pst.sqx);
+      fprintf(stderr,"+++ warning [%s:%d] - all lower-case query converted to upper case: %s\n", __FILE__, __LINE__, info_qlabel);
+    }
+
     if (m_msg.outfd) {fputc('\n',stdout);}
 
     if (qlcont) {


=====================================
src/compacc2e.c
=====================================
--- a/src/compacc2e.c
+++ b/src/compacc2e.c
@@ -4314,3 +4314,28 @@ display_push_features(void *annot_stack, struct dyn_string_str *annot_var_dyn,
     }
   }
 }
+
+int count_not_seg(unsigned char *aa0, int n0, struct pstruct *ppst) {
+  int i;
+  int nseg_cnt = 0;
+  if (ppst->ext_sq_set != 1) {
+    return n0;
+  }
+  else {
+    for (i=0; i<n0; i++) {
+      if (aa0[i] < ppst->nsq) {nseg_cnt++;}
+    }
+    return nseg_cnt;
+  }
+}
+
+void upper_seq(unsigned char *aa0, int n0, int *xascii, unsigned char *sqx) {
+  int i;
+
+  for (i=0; i<n0; i++) {
+    if (sqx[aa0[i]] >= 'a' && sqx[aa0[i]] <= 'z') {
+      aa0[i] = xascii[sqx[aa0[i]] - ('a' - 'A')];
+    }
+  }
+}
+


=====================================
src/initfa.c
=====================================
--- a/src/initfa.c
+++ b/src/initfa.c
@@ -498,9 +498,9 @@ char *iprompt1=" test sequence file name: ";
 char *iprompt2=" database file name: ";
 
 #ifdef PCOMPLIB
-char *verstr="36.3.8f May, 2017 MPI";
+char *verstr="36.3.8g Dec, 2017 MPI";
 #else
-char *verstr="36.3.8f May, 2017";
+char *verstr="36.3.8g Dec, 2017";
 #endif
 
 static int mktup=3;
@@ -1100,7 +1100,7 @@ f_getopt (char copt, char *optarg,
     m_msg->e_cut_set = 1;
 
     if (tmp_e_rep > 0.0) {
-      if (tmp_e_rep >= 1.0) { ppst->e_cut_r = ppst->e_cut/tmp_e_rep;}
+      if (tmp_e_rep >= 1.0) { ppst->e_cut_r = ppst->e_cut / tmp_e_rep;}
       else { ppst->e_cut_r = tmp_e_rep;}
     }
     break;


=====================================
src/scaleswn.c
=====================================
--- a/src/scaleswn.c
+++ b/src/scaleswn.c
@@ -677,18 +677,16 @@ proc_hist_a(struct stat_str *sptr, int nstats, struct score_count_s s_info,
     }
   }
 
-  if (r_v == 1) {
-    pu->r_u.ag.Lambda = Lambda;
-    pu->r_u.ag.K = K;
-    pu->r_u.ag.H = H;
-  }
-  else {
+  if (r_v != 1) {
       r_v = look_p(bl62_p,t_gdelval,t_ggapval,&K,&Lambda,&H);
 #ifdef DEBUG
     fprintf(stderr,"+++ Warning : [%s:%d] Parameters not available for: %s: %d/%d -- using BL62\n",
 	    __FILE__, __LINE__, ppst->pam_name,t_gdelval-t_ggapval,t_ggapval);
 #endif
   }
+  pu->r_u.ag.Lambda = Lambda;
+  pu->r_u.ag.K = K;
+  pu->r_u.ag.H = H;
 
   /*
     fprintf(stderr," the parameters are: Lambda: %5.3f K: %5.3f H: %5.3f\n",
@@ -717,11 +715,11 @@ ag_parm(char *pam_name, int gdelval, int ggapval, struct pstat_str *pu)
       r_v = look_p(p250_p,gdelval,ggapval,&K,&Lambda,&H);
   else if (strcmp(pam_name,"P120")==0)
       r_v = look_p(p120_p,gdelval,ggapval,&K,&Lambda,&H);
-  else if (strcmp(pam_name,"MD10")==0)
+  else if (strcmp(pam_name,"MD10")==0 || strcmp(pam_name,"VT10")==0)
       r_v = look_p(md10_p,gdelval,ggapval,&K,&Lambda,&H);
-  else if (strcmp(pam_name,"MD20")==0)
+  else if (strcmp(pam_name,"MD20")==0 || strcmp(pam_name,"VT20")==0)
       r_v = look_p(md20_p,gdelval,ggapval,&K,&Lambda,&H);
-  else if (strcmp(pam_name,"MD40")==0)
+  else if (strcmp(pam_name,"MD40")==0 || strcmp(pam_name,"VT40")==0)
       r_v = look_p(md40_p,gdelval,ggapval,&K,&Lambda,&H);
   else if (strcmp(pam_name,"DNA")==0 || strcmp(pam_name,"+5/-4")==0)
       r_v = look_p(nt54_p,gdelval,ggapval, &K,&Lambda,&H);
@@ -2699,6 +2697,7 @@ zs_to_E(double zs,int n1, int dnaseq, long entries, struct db_str db)
 
 #ifdef NORMAL_DIST
 double np_to_z(double, int *);
+double np1_to_z(double, int *);
 #endif
 
 /* compute z-score for given E()-value, assuming normal or
@@ -2717,7 +2716,7 @@ E_to_zs(double E, long entries)
 #else
   /* this formula does not work for E() >= 1 */
   if (e >= 1.0) e = 0.99;
-  z = np_to_z(1.0-e,&error);
+  z = np1_to_z(e,&error);
   if (!error) return z*10.0 + 50.0;
   else return 0.0;
 #endif
@@ -2958,6 +2957,20 @@ scale_scores(struct beststr **bptr, int nbest, struct db_str db,
        transcription.
 */
 
+/* added 30-Dec-2017 to deal with very low E()-value thresholds */
+
+double np1_to_z(double p1, int *fault) {
+  double zs1;
+
+  if (p1 < 0.1) {
+    zs1 = np_to_z(p1, fault);
+    return -zs1;
+  }
+  else {
+    return np_to_z(1.0 - p1, fault);
+  }
+}
+
 double np_to_z(double p, int *fault) {
 
   double q, r, ppnd16;



View it on GitLab: https://salsa.debian.org/med-team/fasta3/commit/c1a8a345121adca3f70993084dcf8d6f0f15a78d

---
View it on GitLab: https://salsa.debian.org/med-team/fasta3/commit/c1a8a345121adca3f70993084dcf8d6f0f15a78d
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20180415/f5953cce/attachment-0001.html>


More information about the debian-med-commit mailing list