[Debian-med-packaging] Feature Suggestion with Patch
Laszlo Kajan
lkajan at rostlab.org
Fri Oct 12 23:13:41 UTC 2012
Dear Sean!
I am Laszlo Kajan from Burkhard Rost's lab, bioinformatician, Debian maintainer.
I would like to use jackhmmer to search a smaller database to build up an HMM, and then search a bigger database with this model. The present
version of jackhmmer (that I know of, 3.0-4 in Debian) does not support this fully, because:
1: jackhmmer can not be started from a query /and/ a model, so I can not use the checkpoint model to restart it on the big database.
2: hmmsearch can not be started from a model /and/ a query, so that it would produce an alignment including the query as jackhmmer does; it also
can not do iterative search.
3: jackhmmer saves the checkpoint HMM /before/ doing another iteration, therefore in order to get the checkpoint for 3 iterations I have to run
actually 4. I would like to save those clock cycles.
I have prepared patches that solve 1 and 3 and allow me to search the big database with a model built on the smaller one, using jackhmmer
(patches attached, series: jackhmmer_chkhmmstop, jackhmmer_restartfromhmm, jackhmmer_stop_restart_test). The results obtained from the restarted
search are identical to a non-restart, when performed on the same database, i.e. 3 iterations on a database give the same results as one
iteration starting from a checkpoint HMM saved before the 3rd iteration. This make me hope that my patches - though I do not have much knowledge
about the internals of hmmer3 - are on the right path.
* Please comment on my patches and let me know if my approach is all right.
* Please consider adding the features (1 and 3) to hmmer in your next release.
If you do approve my patches (or something with similar results), then I would like to include them in the Debian release of hmmer3. However
this should only be a temporary solution, until your next release.
Please treat my feature suggestion and patches as a sign of my great appreciation of hmmer3. Thank you for releasing hmmer3 under a free
software license.
Best regards,
Laszlo Kajan
Rost Lab
Debian Maintainer
-------------- next part --------------
--- a/src/jackhmmer.c
+++ b/src/jackhmmer.c
@@ -68,6 +68,7 @@
{ "--tblout", eslARG_OUTFILE, NULL, NULL, NULL, NULL, NULL, NULL, "save parseable table of per-sequence hits to file <s>", 2 },
{ "--domtblout", eslARG_OUTFILE, NULL, NULL, NULL, NULL, NULL, NULL, "save parseable table of per-domain hits to file <s>", 2 },
{ "--chkhmm", eslARG_OUTFILE, NULL, NULL, NULL, NULL, NULL, NULL, "save HMM checkpoints to files <s>-<iteration>.hmm", 2 },
+ { "--chkhmmskip", eslARG_NONE, FALSE, NULL, NULL, NULL, NULL, NULL, "skip search after saving HMM checkpoint for last iteration", 2 },
{ "--chkali", eslARG_OUTFILE, NULL, NULL, NULL, NULL, NULL, NULL, "save alignment checkpoints to files <s>-<iteration>.sto", 2 },
{ "--acc", eslARG_NONE, FALSE, NULL, NULL, NULL, NULL, NULL, "prefer accessions over names in output", 2 },
{ "--noali", eslARG_NONE, FALSE, NULL, NULL, NULL, NULL, NULL, "don't output alignments, so output is smaller", 2 },
@@ -260,6 +261,7 @@
if (esl_opt_IsUsed(go, "--tblout")) fprintf(ofp, "# per-seq hits tabular output: %s\n", esl_opt_GetString(go, "--tblout"));
if (esl_opt_IsUsed(go, "--domtblout")) fprintf(ofp, "# per-dom hits tabular output: %s\n", esl_opt_GetString(go, "--domtblout"));
if (esl_opt_IsUsed(go, "--chkhmm")) fprintf(ofp, "# HMM checkpoint files output: %s-<i>.hmm\n", esl_opt_GetString(go, "--chkhmm"));
+ if (esl_opt_IsUsed(go, "--chkhmmskip"))fprintf(ofp, "# skipped search after saving HMM checkpoint for last iteration\n");
if (esl_opt_IsUsed(go, "--chkali")) fprintf(ofp, "# MSA checkpoint files output: %s-<i>.sto\n", esl_opt_GetString(go, "--chkali"));
if (esl_opt_IsUsed(go, "--acc")) fprintf(ofp, "# prefer accessions over names: yes\n");
if (esl_opt_IsUsed(go, "--noali")) fprintf(ofp, "# show alignments in output: no\n");
@@ -599,6 +601,10 @@
#endif
}
+ // lkajan: chkhmmskip
+ if( !( esl_opt_GetBoolean(go, "--chkhmmskip") == TRUE && esl_opt_IsOn(go, "--chkhmm") && iteration == maxiterations ) )
+ {
+
#ifdef HMMER_THREADS
if (ncpus > 0) sstatus = thread_loop(threadObj, queue, dbfp);
else sstatus = serial_loop(info, dbfp);
@@ -619,6 +625,8 @@
sstatus, dbfp->filename);
}
+ } // lkajan: chkhmmskip
+
/* merge the results of the search results */
for (i = 1; i < infocnt; ++i)
{
-------------- next part --------------
--- a/src/jackhmmer.c
+++ b/src/jackhmmer.c
@@ -62,6 +62,7 @@
/* name type default env range toggles reqs incomp help docgroup*/
{ "-h", eslARG_NONE, FALSE, NULL, NULL, NULL, NULL, NULL, "show brief help on version and usage", 1 },
{ "-N", eslARG_INT, "5", NULL, "n>0", NULL, NULL, NULL, "set maximum number of iterations to <n>", 1 },
+ { "--hmmprime", eslARG_INFILE, NULL, NULL, NULL, NULL, NULL, NULL, "HMM checkpoint to prime search with", 1 },
/* Control of output */
{ "-o", eslARG_OUTFILE, NULL, NULL, NULL, NULL, NULL, NULL, "direct output to file <f>, not stdout", 2 },
{ "-A", eslARG_OUTFILE, NULL, NULL, NULL, NULL, NULL, NULL, "save multiple alignment of hits to file <s>", 2 },
@@ -256,6 +257,7 @@
fprintf(ofp, "# query sequence file: %s\n", qfile);
fprintf(ofp, "# target sequence database: %s\n", dbfile);
if (esl_opt_IsUsed(go, "-N")) fprintf(ofp, "# maximum iterations set to: %d\n", esl_opt_GetInteger(go, "-N"));
+ if (esl_opt_IsUsed(go, "--hmmprime")) fprintf(ofp, "# priming HMM checkpoint: %s\n", esl_opt_GetString(go, "--hmmprime"));
if (esl_opt_IsUsed(go, "-o")) fprintf(ofp, "# output directed to file: %s\n", esl_opt_GetString(go, "-o"));
if (esl_opt_IsUsed(go, "-A")) fprintf(ofp, "# MSA of hits saved to file: %s\n", esl_opt_GetString(go, "-A"));
if (esl_opt_IsUsed(go, "--tblout")) fprintf(ofp, "# per-seq hits tabular output: %s\n", esl_opt_GetString(go, "--tblout"));
@@ -401,6 +403,7 @@
int qformat = eslSQFILE_UNKNOWN; /* format of qfile */
int dbformat = eslSQFILE_UNKNOWN; /* format of dbfile */
ESL_SQFILE *qfp = NULL; /* open qfile */
+ P7_HMMFILE *hfp = NULL; /* open input HMM file */
ESL_SQFILE *dbfp = NULL; /* open dbfile */
ESL_ALPHABET *abc = NULL; /* sequence alphabet */
P7_BUILDER *bld = NULL; /* HMM construction configuration */
@@ -414,6 +417,7 @@
int nnew_targets;
int prv_msa_nseq;
int status = eslOK;
+ int hstatus = eslOK;
int qstatus = eslOK;
int sstatus = eslOK;
@@ -480,6 +484,15 @@
else if (status != eslOK) esl_fatal ("Unexpected error %d opening sequence file %s\n", status, cfg->qfile);
qsq = esl_sq_CreateDigital(abc);
+ // lkajan: open HMM file
+ char *hmmfile = esl_opt_GetString(go, "--hmmprime"); // query HMM file
+ if( hmmfile )
+ {
+ status = p7_hmmfile_Open(hmmfile, NULL, &hfp);
+ if (status == eslENOTFOUND) p7_Fail("Failed to open hmm file %s for reading.\n", hmmfile);
+ else if (status == eslEFORMAT) p7_Fail("Unrecognized format, trying to open hmm file %s for reading.\n", hmmfile);
+ else if (status != eslOK) p7_Fail("Unexpected error %d in opening hmm file %s.\n", status, hmmfile);
+ }
#ifdef HMMER_THREADS
/* initialize thread data */
if (esl_opt_IsOn(go, "--cpu")) ncpus = esl_opt_GetInteger(go, "--cpu");
@@ -530,13 +543,22 @@
while ((qstatus = esl_sqio_Read(qfp, qsq)) == eslOK)
{
P7_HMM *hmm = NULL; /* HMM - only needed if checkpointed */
+ P7_HMM *phmm = NULL; /* priming HMM - only needed if primed */
P7_HMM **ret_hmm = NULL; /* HMM - only needed if checkpointed */
+ P7_PROFILE *gm = NULL;
P7_OPROFILE *om = NULL; /* optimized query profile */
P7_TRACE *qtr = NULL; /* faux trace for query sequence */
ESL_MSA *msa = NULL; /* multiple alignment of included hits */
if (esl_opt_IsOn(go, "--chkhmm")) ret_hmm = &hmm;
+ // lkajan: read in an HMM and prime search with it - here, so that the below 'continue' does not affect this
+ if(hfp)
+ {
+ hstatus = p7_hmmfile_Read(hfp, &abc, &phmm);
+ if (hstatus != eslOK){ p7_Fail("Failed to read from hmm file %s.\n", hmmfile); phmm = NULL; }
+ }
+
nquery++;
if (qsq->n == 0) continue; /* skip zero length queries as if they aren't even present. */
@@ -560,6 +582,20 @@
{
p7_SingleBuilder(bld, qsq, info->bg, ret_hmm, &qtr, NULL, &om); /* bypass HMM - only need model */
+ // lkajan: from hmmsearch.c:425
+ if( phmm )
+ {
+ p7_hmm_Destroy(hmm);
+ p7_oprofile_Destroy(om);
+ p7_profile_Destroy(gm);
+
+ hmm = p7_hmm_Clone(phmm);
+ gm = p7_profile_Create (hmm->M, abc);
+ om = p7_oprofile_Create(hmm->M, abc);
+ p7_ProfileConfig(hmm, info->bg, gm, 100, p7_LOCAL); /* 100 is a dummy length for now; and MSVFilter requires local mode */
+ p7_oprofile_Convert(gm, om); /* <om> is now p7_LOCAL, multihit */
+ }
+
prv_msa_nseq = 1;
}
else
@@ -699,6 +735,8 @@
esl_msa_Destroy(msa);
p7_oprofile_Destroy(om);
+ p7_profile_Destroy(gm);
+ p7_hmm_Destroy(phmm);
p7_trace_Destroy(qtr);
esl_sq_Reuse(qsq);
esl_keyhash_Reuse(kh);
@@ -732,6 +770,7 @@
esl_keyhash_Destroy(kh);
esl_sqfile_Close(qfp);
esl_sqfile_Close(dbfp);
+ p7_hmmfile_Close(hfp);
esl_sq_Destroy(qsq);
esl_stopwatch_Destroy(w);
p7_builder_Destroy(bld);
-------------- next part --------------
--- a/testsuite/testsuite.sqc
+++ b/testsuite/testsuite.sqc
@@ -163,6 +163,8 @@
1 exercise j/--tblout @src/jackhmmer@ --tblout %PHMMER.tbl% !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
1 exercise j/--domtblout @src/jackhmmer@ --domtblout %PHMMER.dtbl% !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
1 exercise j/--chkhmm @src/jackhmmer@ --chkhmm %PHMMER.ch% !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
+1 exercise j/--chkhmmskip @src/jackhmmer@ -N 2 --chkhmm %PHMMER.ch% --chkhmmskip !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
+1 exercise j/--hmmprime @src/jackhmmer@ --hmmprime %PHMMER.ch-2.hmm% !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
1 exercise j/--chkali @src/jackhmmer@ --chkali %PHMMER.ca% !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
1 exercise j/--acc @src/jackhmmer@ --acc !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
1 exercise j/--noali @src/jackhmmer@ --noali !tutorial/HBB_HUMAN! !tutorial/globins45.fa!
@@ -286,6 +288,7 @@
1 exercise dup-names !testsuite/i10-duplicate-names.pl! @@ !! %OUTFILES%
1 exercise mapali-again !testsuite/i11-hmmalign-mapali.pl! @@ !! %OUTFILES%
1 exercise delete-corruption !testsuite/i12-delete-corruption.pl! @@ !! %OUTFILES%
+1 exercise jackhmmer-restart /usr/bin/perl !testsuite/i13-jackhmmer-restart.pl! @@ !! %OUTFILES%
1 exercise brute-itest @src/itest_brute@
1 exercise hmmpress-itest !src/hmmpress.itest.pl! @src/hmmpress@ %MINIFAM.HMM% %TMPPFX%
--- /dev/null
+++ b/testsuite/i13-jackhmmer-restart.pl
@@ -0,0 +1,35 @@
+#!/usr/bin/perl
+
+# Usage: ./i13-jackhmmer-restart.pl <builddir> <srcdir> <tmpfile prefix>
+# Example: ./i13-jackhmmer-restart.pl .. .. tmpfoo
+#
+# Laszlo Kajan <lkajan at rostlab.org> Fri, 12 Oct 2012 23:30:37 +0200
+
+
+BEGIN {
+ $builddir = shift;
+ $srcdir = shift;
+ $tmppfx = shift;
+}
+
+# Verify that we have all the executables we need for the test.
+if (! -x "$builddir/src/jackhmmer") { die "FAIL: didn't find jackhmmer binary in $builddir/src\n"; }
+
+my $cmd = "$builddir/src/jackhmmer --notextw -N 2 -A $tmppfx.sto -o /dev/null $srcdir/tutorial/HBB_HUMAN $srcdir/tutorial/globins45.fa";
+system( $cmd ) && die("FAIL: failed in call '$cmd'\n");
+
+my $cmd = "$builddir/src/jackhmmer --notextw -N 2 --chkhmm $tmppfx.ch --chkhmmskip -o /dev/null $srcdir/tutorial/HBB_HUMAN $srcdir/tutorial/globins45.fa";
+system( $cmd ) && die("FAIL: failed in call '$cmd'\n");
+
+$cmd = "$builddir/src/jackhmmer --notextw -N 1 -A $tmppfx.R.sto --hmmprime $tmppfx.ch-2.hmm -o /dev/null $srcdir/tutorial/HBB_HUMAN $srcdir/tutorial/globins45.fa";
+system( $cmd ) && die("FAIL: failed in call '$cmd'\n");
+
+$cmd = "sed -i -e '/^#=GF/d;' $tmppfx.sto $tmppfx.R.sto";
+system( $cmd ) && die("FAIL: failed in call '$cmd'\n");
+
+$cmd = "diff -q $tmppfx.sto $tmppfx.R.sto";
+system( $cmd ) && die("FAIL ('$cmd'): files differ\n");
+
+print "ok\n";
+unlink( "$tmppfx.sto", glob("$tmppfx.ch-*"), "$tmppfx.R.sto" );
+exit 0;
More information about the Debian-med-packaging
mailing list