[Debian-med-packaging] Feature Suggestion with Patch

Laszlo Kajan lkajan at rostlab.org
Fri Oct 12 23:13:41 UTC 2012


Dear Sean!

I am Laszlo Kajan from Burkhard Rost's lab, bioinformatician, Debian maintainer.

I would like to use jackhmmer to search a smaller database to build up an HMM, and then search a bigger database with this model. The present
version of jackhmmer (that I know of, 3.0-4 in Debian) does not support this fully, because:

1: jackhmmer can not be started from a query /and/ a model, so I can not use the checkpoint model to restart it on the big database.

2: hmmsearch can not be started from a model /and/ a query, so that it would produce an alignment including the query as jackhmmer does; it also
can not do iterative search.

3: jackhmmer saves the checkpoint HMM /before/ doing another iteration, therefore in order to get the checkpoint for 3 iterations I have to run
actually 4. I would like to save those clock cycles.

I have prepared patches that solve 1 and 3 and allow me to search the big database with a model built on the smaller one, using jackhmmer
(patches attached, series: jackhmmer_chkhmmstop, jackhmmer_restartfromhmm, jackhmmer_stop_restart_test). The results obtained from the restarted
search are identical to a non-restart, when performed on the same database, i.e. 3 iterations on a database give the same results as one
iteration starting from a checkpoint HMM saved before the 3rd iteration. This make me hope that my patches - though I do not have much knowledge
about the internals of hmmer3 - are on the right path.

* Please comment on my patches and let me know if my approach is all right.

* Please consider adding the features (1 and 3) to hmmer in your next release.

  If you do approve my patches (or something with similar results), then I would like to include them in the Debian release of hmmer3. However
this should only be a temporary solution, until your next release.

Please treat my feature suggestion and patches as a sign of my great appreciation of hmmer3. Thank you for releasing hmmer3 under a free
software license.

Best regards,

Laszlo Kajan
Rost Lab
Debian Maintainer
-------------- next part --------------
--- a/src/jackhmmer.c
+++ b/src/jackhmmer.c
@@ -68,6 +68,7 @@
   { "--tblout",     eslARG_OUTFILE, NULL, NULL, NULL,      NULL,    NULL,  NULL,            "save parseable table of per-sequence hits to file <s>",        2 },
   { "--domtblout",  eslARG_OUTFILE, NULL, NULL, NULL,      NULL,    NULL,  NULL,            "save parseable table of per-domain hits to file <s>",          2 },
   { "--chkhmm",     eslARG_OUTFILE, NULL, NULL, NULL,      NULL,    NULL,  NULL,            "save HMM checkpoints to files <s>-<iteration>.hmm",            2 },
+  { "--chkhmmskip", eslARG_NONE,   FALSE, NULL, NULL,      NULL,    NULL,  NULL,            "skip search after saving HMM checkpoint for last iteration",   2 },
   { "--chkali",     eslARG_OUTFILE, NULL, NULL, NULL,      NULL,    NULL,  NULL,            "save alignment checkpoints to files <s>-<iteration>.sto",      2 },
   { "--acc",        eslARG_NONE,   FALSE, NULL, NULL,      NULL,    NULL,  NULL,            "prefer accessions over names in output",                       2 },
   { "--noali",      eslARG_NONE,   FALSE, NULL, NULL,      NULL,    NULL,  NULL,            "don't output alignments, so output is smaller",                2 },
@@ -260,6 +261,7 @@
   if (esl_opt_IsUsed(go, "--tblout"))    fprintf(ofp, "# per-seq hits tabular output:     %s\n",      esl_opt_GetString(go, "--tblout"));
   if (esl_opt_IsUsed(go, "--domtblout")) fprintf(ofp, "# per-dom hits tabular output:     %s\n",      esl_opt_GetString(go, "--domtblout"));
   if (esl_opt_IsUsed(go, "--chkhmm"))    fprintf(ofp, "# HMM checkpoint files output:     %s-<i>.hmm\n", esl_opt_GetString(go, "--chkhmm"));
+  if (esl_opt_IsUsed(go, "--chkhmmskip"))fprintf(ofp, "# skipped search after saving HMM checkpoint for last iteration\n");
   if (esl_opt_IsUsed(go, "--chkali"))    fprintf(ofp, "# MSA checkpoint files output:     %s-<i>.sto\n", esl_opt_GetString(go, "--chkali"));
   if (esl_opt_IsUsed(go, "--acc"))       fprintf(ofp, "# prefer accessions over names:    yes\n");
   if (esl_opt_IsUsed(go, "--noali"))     fprintf(ofp, "# show alignments in output:       no\n");
@@ -599,6 +601,10 @@
 #endif
 	    }
 
+		// lkajan: chkhmmskip
+		if( !( esl_opt_GetBoolean(go, "--chkhmmskip") == TRUE && esl_opt_IsOn(go, "--chkhmm") && iteration == maxiterations ) )
+		{
+
 #ifdef HMMER_THREADS
 	  if (ncpus > 0) sstatus = thread_loop(threadObj, queue, dbfp);
 	  else           sstatus = serial_loop(info, dbfp);
@@ -619,6 +625,8 @@
 			sstatus, dbfp->filename);
 	    }
 
+	    	} // lkajan: chkhmmskip
+
 	  /* merge the results of the search results */
 	  for (i = 1; i < infocnt; ++i)
 	    {
-------------- next part --------------
--- a/src/jackhmmer.c
+++ b/src/jackhmmer.c
@@ -62,6 +62,7 @@
   /* name           type         default   env  range   toggles     reqs   incomp                             help                                                  docgroup*/
   { "-h",           eslARG_NONE,   FALSE, NULL, NULL,      NULL,    NULL,  NULL,            "show brief help on version and usage",                         1 },
   { "-N",           eslARG_INT,      "5", NULL, "n>0",     NULL,    NULL,  NULL,            "set maximum number of iterations to <n>",                      1 },
+  { "--hmmprime",   eslARG_INFILE,  NULL, NULL, NULL,      NULL,    NULL,  NULL,            "HMM checkpoint to prime search with",                          1 },
 /* Control of output */
   { "-o",           eslARG_OUTFILE, NULL, NULL, NULL,      NULL,    NULL,  NULL,            "direct output to file <f>, not stdout",                        2 },
   { "-A",           eslARG_OUTFILE, NULL, NULL, NULL,      NULL,    NULL,  NULL,            "save multiple alignment of hits to file <s>",                  2 },
@@ -256,6 +257,7 @@
   fprintf(ofp, "# query sequence file:             %s\n", qfile);
   fprintf(ofp, "# target sequence database:        %s\n", dbfile);
   if (esl_opt_IsUsed(go, "-N"))          fprintf(ofp, "# maximum iterations set to:       %d\n",      esl_opt_GetInteger(go, "-N"));
+  if (esl_opt_IsUsed(go, "--hmmprime"))  fprintf(ofp, "# priming HMM checkpoint:          %s\n",      esl_opt_GetString(go, "--hmmprime"));
   if (esl_opt_IsUsed(go, "-o"))          fprintf(ofp, "# output directed to file:         %s\n",      esl_opt_GetString(go, "-o"));
   if (esl_opt_IsUsed(go, "-A"))          fprintf(ofp, "# MSA of hits saved to file:       %s\n",      esl_opt_GetString(go, "-A"));
   if (esl_opt_IsUsed(go, "--tblout"))    fprintf(ofp, "# per-seq hits tabular output:     %s\n",      esl_opt_GetString(go, "--tblout"));
@@ -401,6 +403,7 @@
   int              qformat  = eslSQFILE_UNKNOWN;  /* format of qfile                                 */
   int              dbformat = eslSQFILE_UNKNOWN;  /* format of dbfile                                */
   ESL_SQFILE      *qfp      = NULL;		  /* open qfile                                      */
+  P7_HMMFILE      *hfp      = NULL;		  /* open input HMM file                             */
   ESL_SQFILE      *dbfp     = NULL;               /* open dbfile                                     */
   ESL_ALPHABET    *abc      = NULL;               /* sequence alphabet                               */
   P7_BUILDER      *bld      = NULL;               /* HMM construction configuration                  */
@@ -414,6 +417,7 @@
   int              nnew_targets;
   int              prv_msa_nseq;
   int              status   = eslOK;
+  int              hstatus  = eslOK;
   int              qstatus  = eslOK;
   int              sstatus  = eslOK;
 
@@ -480,6 +484,15 @@
   else if (status != eslOK)        esl_fatal ("Unexpected error %d opening sequence file %s\n", status, cfg->qfile);
   qsq = esl_sq_CreateDigital(abc);
 
+  // lkajan: open HMM file
+  char		*hmmfile = esl_opt_GetString(go, "--hmmprime"); // query HMM file
+  if( hmmfile )
+  {
+      status = p7_hmmfile_Open(hmmfile, NULL, &hfp);
+      if      (status == eslENOTFOUND) p7_Fail("Failed to open hmm file %s for reading.\n",                      hmmfile);
+      else if (status == eslEFORMAT)   p7_Fail("Unrecognized format, trying to open hmm file %s for reading.\n", hmmfile);
+      else if (status != eslOK)        p7_Fail("Unexpected error %d in opening hmm file %s.\n", status,          hmmfile);
+  }
 #ifdef HMMER_THREADS
   /* initialize thread data */
   if (esl_opt_IsOn(go, "--cpu")) ncpus = esl_opt_GetInteger(go, "--cpu");
@@ -530,13 +543,22 @@
   while ((qstatus = esl_sqio_Read(qfp, qsq)) == eslOK)
     {
       P7_HMM          *hmm     = NULL;	     /* HMM - only needed if checkpointed        */
+      P7_HMM          *phmm    = NULL;	     /* priming HMM - only needed if primed      */
       P7_HMM         **ret_hmm = NULL;	     /* HMM - only needed if checkpointed        */
+      P7_PROFILE      *gm      = NULL;
       P7_OPROFILE     *om      = NULL;       /* optimized query profile                  */
       P7_TRACE        *qtr     = NULL;       /* faux trace for query sequence            */
       ESL_MSA         *msa     = NULL;       /* multiple alignment of included hits      */
       
       if (esl_opt_IsOn(go, "--chkhmm")) ret_hmm = &hmm;
 
+      // lkajan: read in an HMM and prime search with it - here, so that the below 'continue' does not affect this
+      if(hfp)
+      {
+	  hstatus = p7_hmmfile_Read(hfp, &abc, &phmm);
+	  if (hstatus != eslOK){ p7_Fail("Failed to read from hmm file %s.\n", hmmfile); phmm = NULL; }
+      }
+
       nquery++;
       if (qsq->n == 0) continue; /* skip zero length queries as if they aren't even present. */
 
@@ -560,6 +582,20 @@
 	    {
 	      p7_SingleBuilder(bld, qsq, info->bg, ret_hmm, &qtr, NULL, &om); /* bypass HMM - only need model */
 
+	      // lkajan: from hmmsearch.c:425
+	      if( phmm )
+	      {
+      		p7_hmm_Destroy(hmm);
+		p7_oprofile_Destroy(om);
+      		p7_profile_Destroy(gm);
+
+		hmm = p7_hmm_Clone(phmm);
+		gm = p7_profile_Create (hmm->M, abc);
+		om = p7_oprofile_Create(hmm->M, abc);
+      		p7_ProfileConfig(hmm, info->bg, gm, 100, p7_LOCAL); /* 100 is a dummy length for now; and MSVFilter requires local mode */
+      		p7_oprofile_Convert(gm, om);                  /* <om> is now p7_LOCAL, multihit */
+	      }
+
 	      prv_msa_nseq = 1;
 	    }
 	  else 
@@ -699,6 +735,8 @@
 
       esl_msa_Destroy(msa);
       p7_oprofile_Destroy(om);
+      p7_profile_Destroy(gm);
+      p7_hmm_Destroy(phmm);
       p7_trace_Destroy(qtr);
       esl_sq_Reuse(qsq);
       esl_keyhash_Reuse(kh);
@@ -732,6 +770,7 @@
   esl_keyhash_Destroy(kh);
   esl_sqfile_Close(qfp);
   esl_sqfile_Close(dbfp);
+  p7_hmmfile_Close(hfp);
   esl_sq_Destroy(qsq);  
   esl_stopwatch_Destroy(w);
   p7_builder_Destroy(bld);
-------------- next part --------------
--- a/testsuite/testsuite.sqc
+++ b/testsuite/testsuite.sqc
@@ -163,6 +163,8 @@
 1 exercise  j/--tblout          @src/jackhmmer@  --tblout    %PHMMER.tbl%  !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
 1 exercise  j/--domtblout       @src/jackhmmer@  --domtblout %PHMMER.dtbl% !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
 1 exercise  j/--chkhmm          @src/jackhmmer@  --chkhmm    %PHMMER.ch%   !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
+1 exercise  j/--chkhmmskip      @src/jackhmmer@  -N 2 --chkhmm %PHMMER.ch% --chkhmmskip !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
+1 exercise  j/--hmmprime        @src/jackhmmer@  --hmmprime %PHMMER.ch-2.hmm% !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
 1 exercise  j/--chkali          @src/jackhmmer@  --chkali    %PHMMER.ca%   !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
 1 exercise  j/--acc             @src/jackhmmer@  --acc                     !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
 1 exercise  j/--noali           @src/jackhmmer@  --noali                   !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
@@ -286,6 +288,7 @@
 1 exercise  dup-names             !testsuite/i10-duplicate-names.pl!    @@ !! %OUTFILES%
 1 exercise  mapali-again          !testsuite/i11-hmmalign-mapali.pl!    @@ !! %OUTFILES%
 1 exercise  delete-corruption     !testsuite/i12-delete-corruption.pl!  @@ !! %OUTFILES%
+1 exercise  jackhmmer-restart     /usr/bin/perl !testsuite/i13-jackhmmer-restart.pl!  @@ !! %OUTFILES%
 
 1 exercise  brute-itest           @src/itest_brute@  
 1 exercise  hmmpress-itest        !src/hmmpress.itest.pl! @src/hmmpress@ %MINIFAM.HMM% %TMPPFX%
--- /dev/null
+++ b/testsuite/i13-jackhmmer-restart.pl
@@ -0,0 +1,35 @@
+#!/usr/bin/perl
+
+# Usage:   ./i13-jackhmmer-restart.pl <builddir> <srcdir> <tmpfile prefix>
+# Example: ./i13-jackhmmer-restart.pl ..         ..       tmpfoo
+#
+# Laszlo Kajan <lkajan at rostlab.org>  Fri, 12 Oct 2012 23:30:37 +0200
+
+
+BEGIN {
+    $builddir  = shift;
+    $srcdir    = shift;
+    $tmppfx    = shift;
+}
+
+# Verify that we have all the executables we need for the test.
+if (! -x "$builddir/src/jackhmmer")   { die "FAIL: didn't find jackhmmer binary in $builddir/src\n";  }
+
+my $cmd = "$builddir/src/jackhmmer --notextw -N 2 -A $tmppfx.sto -o /dev/null $srcdir/tutorial/HBB_HUMAN $srcdir/tutorial/globins45.fa";
+system( $cmd ) && die("FAIL: failed in call '$cmd'\n");
+
+my $cmd = "$builddir/src/jackhmmer --notextw -N 2 --chkhmm $tmppfx.ch --chkhmmskip -o /dev/null $srcdir/tutorial/HBB_HUMAN $srcdir/tutorial/globins45.fa";
+system( $cmd ) && die("FAIL: failed in call '$cmd'\n");
+
+$cmd = "$builddir/src/jackhmmer --notextw -N 1 -A $tmppfx.R.sto --hmmprime $tmppfx.ch-2.hmm -o /dev/null $srcdir/tutorial/HBB_HUMAN $srcdir/tutorial/globins45.fa";
+system( $cmd ) && die("FAIL: failed in call '$cmd'\n");
+
+$cmd = "sed -i -e '/^#=GF/d;' $tmppfx.sto $tmppfx.R.sto";
+system( $cmd ) && die("FAIL: failed in call '$cmd'\n");
+
+$cmd = "diff -q $tmppfx.sto $tmppfx.R.sto";
+system( $cmd ) && die("FAIL ('$cmd'): files differ\n");
+
+print "ok\n";
+unlink( "$tmppfx.sto", glob("$tmppfx.ch-*"), "$tmppfx.R.sto" );
+exit 0;


More information about the Debian-med-packaging mailing list