[med-svn] [prokka] 01/02: Imported Upstream version 1.11+dfsg
Andreas Tille
tille at debian.org
Thu Apr 30 10:09:14 UTC 2015
This is an automated email from the git hooks/post-receive script.
tille pushed a commit to branch master
in repository prokka.
commit 7a81aea273394333d593a4e52fff4d8b8f35f45a
Author: Andreas Tille <tille at debian.org>
Date: Thu Apr 30 12:04:46 2015 +0200
Imported Upstream version 1.11+dfsg
---
.gitignore | 24 +++
README.md | 374 +++++++++++++++++++++++++++++++++
bin/prokka | 179 +++++++++++-----
bin/prokka-genbank_to_fasta_db | 62 ++++--
bin/prokka-uniprot_to_fasta_db | 9 +-
db/hmm/CLUSTERS.hmm | Bin 375777699 -> 0 bytes
db/hmm/Pfam.hmm | Bin 327312167 -> 0 bytes
doc/ChangeLog.txt | 19 ++
doc/ToDoList.txt | 6 +-
doc/prokka-manual.html | 215 -------------------
doc/prokka-manual.txt | 454 -----------------------------------------
11 files changed, 597 insertions(+), 745 deletions(-)
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..2c8a08d
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,24 @@
+tmp/
+blib/
+.build/
+_build/
+cover_db/
+inc/
+Build
+!Build/
+Build.bat
+.last_cover_stats
+Makefile
+Makefile.old
+MANIFEST.bak
+META.yml
+MYMETA.yml
+nytprof.out
+pm_to_blib
+doc/prokka-*
+bug/
+db/cm/*.i1*
+db/kingdom/*/sprot.p*
+db/hmm/*.h3?
+db/genus/*.p*
+
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..9373bbb
--- /dev/null
+++ b/README.md
@@ -0,0 +1,374 @@
+# Prokka: rapid prokaryotic genome annotation
+
+Torsten Seemann <torsten.seemann at monash.edu>
+Victorian Bioinformatics Consortium, AUSTRALIA <http://vicbioinformatics.com>
+
+##Contents
+
+- [Introduction](#introduction)
+- [Installation](#installation)
+- [Invoking Prokka](#invoking-prokka)
+- [Output Files](#output-files)
+- [Command line options](#command-line-options)
+- [Dependencies](#dependencies)
+- [Databases](#databases)
+- [FAQ](#faq)
+- [Still To Do](#still-to-do)
+- [Changes](#changes)
+- [Citation](#citation)
+
+##Introduction
+
+Whole genome annotation is the process of identifying features of interest
+in a set of genomic DNA sequences, and labelling them with useful
+information. Prokka is a software tool to annotate bacterial, archaeal and
+viral genomes quickly and produce standards-compliant output files.
+
+##Installation
+
+###Download
+
+Download the latest `prokka-1.xx.tar.gz` archive from http://www.bioinformatics.net.au/software.prokka.shtml
+
+###Extract
+
+Choose somewhere to put it, for example in your home directory (no root access required):
+
+ % cd $HOME
+ % tar zxvf prokka-1.xx.tar.gz
+ % ls prokka-1.xx
+
+###Add to PATH
+
+Add the following line to your `$HOME/.bashrc` file,
+or to `/etc/profile.d/prokka.sh` to make it available to all users:
+
+ export PATH=$PATH:$HOME/prokka-1.xx/bin
+
+###Index the sequence databases
+
+ % prokka --setupdb
+
+###Install dependencies
+
+Prokka comes with many binaries for Linux and Mac OS X. It will always use your existing installed versions if they exist, but will use the included ones if that fails. For some older systems (eg. Centos 4.x) some of them won't work due to them being dynamically linked against new GLIBC libraries you don't have.
+
+You can consult the list of dependencies later in this document.
+
+###Choose a rRNA predictor
+
+####Option 1 - Don't use one
+
+If Prokka can't find a predictor for rRNA featues (either Barrnap or RNAmmer below) then it simply won't annotate any. Most people don't care that much about them anyway,
+
+####Option 2 - Barrnap
+
+This was written by the author of Prokka and is recommended if you prefer speed over absolute accuracy. It uses the new multi-core NHMMER for DNA:DNA profile searches. Download it from http://www.vicbioinformatics.com/software.barrnap.shtml
+
+####Option 3 - RNAmmer
+
+RNAmmer was written when HMMER 2.x was the latest release. Since them, HMMER 3.x has been released, and uses the same executable binary names. Prokka needs HMMER3 and RNAmmer (and hence HMMER2) so you need to edit your RNAmmer script to explicitly point your HMMER2 binary instead of using the HMMER3 binary which is more likely to be in your PATH first.
+
+Type which rnammer to find the script, and then edit it with your favourite editor. Find the following lines at the top:
+
+ if ( $uname eq "Linux" ) {
+ # $HMMSEARCH_BINARY = "/usr/cbs/bio/bin/linux64/hmmsearch"; # OLD
+ $HMMSEARCH_BINARY = "/path/to/my/hmmer-2.3.2/bin/hmmsearch"; # NEW (yours)
+ }
+
+If you are using Mac OS X, you'll also have to change the `"Linux"` to `"Darwin"` too. As you can see, I have commented out the original part, and replaced it with the location of my HMMER2 hmmsearch tool, so it doesn't run the HMMER3 one. You need to ensure HMMER3 is in your PATH before the old HMMER2 too.
+
+###Test
+
+* Type `prokka` and it should output it's help screen.
+* Type `prokka --version` and you should see an output like `prokka 1.x`
+* Type `prokka --listdb` and it will show you what databases it has installed to use.
+
+
+##Invoking Prokka
+
+###Beginner
+
+ # Vanilla (but with free toppings)
+ % prokka contigs.fa
+
+ # Look for a folder called PROKKA_yyyymmdd (today's date) and look at stats
+ % cat PROKKA_yyyymmdd/*.txt
+
+###Moderate
+
+ # Choose the names of the output files
+ % prokka --outdir mydir --prefix mygenome contigs.fa
+
+ # Visualize it in Artemis
+ % art mydir/mygenome.gff
+
+###Expert
+
+ # It's not just for bacteria, people
+ % prokka --kingdom Archaea --outdir mydir --genus Pyrococcus --locustag PYCC
+
+ # Search for my favourite gene
+ % exonerate --bestn 1 zetatoxin.fasta mydir/PYCC_06072012.faa | less
+
+###Wizard
+
+ # Watch and learn
+ % prokka --outdir mydir --locustag EHEC --proteins NewToxins.faa --evalue 0.001 --gram neg --addgenes contigs.fa
+
+ # Check to see if anything went really wrong
+ % less mydir/EHEC_06072012.err
+
+ # Add final details using Sequin
+ % sequin mydir/EHEC_0607201.sqn
+
+###Genbank submitter
+
+ # Register your BioProject and your locus_tag prefix first!
+ % prokka --compliant --centre UoN --outdir PRJNA123456 --locustag EHEC --prefix EHEC-Chr1 contigs.fa
+
+ # Check to see if anything went really wrong
+ % less PRJNA123456/EHEC-Chr1.err
+
+ # Add final details using Sequin
+ % sequin PRJNA123456/EHEC-Chr1.sqn
+
+###Crazy Person
+
+ # No stinking Perl script is going to control me
+ % prokka \
+ --outdir $HOME/genomes/Ec_POO247 --force \
+ --prefix Ec_POO247 --addgenes --locustag ECPOOp \
+ --increment 10 --gffver 2 --centre CDC --compliant \
+ --genus Escherichia --species coli --strain POO247 --plasmid pECPOO247 \
+ --kingdom Bacteria --gcode 11 --usegenus \
+ --proteins /opt/prokka/db/trusted/Ecocyc-17.6 \
+ --evalue 1e-9 --rfam \
+ plasmid-closed.fna
+
+
+##Output Files
+
+| Extension | Description |
+| --------- | ----------- |
+| .gff | This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. |
+| .gbk | This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence. |
+| .fna | Nucleotide FASTA file of the input contig sequences. |
+| .faa | Protein FASTA file of the translated CDS sequences. |
+| .ffn | Nucleotide FASTA file of all the annotated sequences, not just CDS. |
+| .sqn | An ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. |
+| .fsa | Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. |
+| .tbl | Feature Table file, used by "tbl2asn" to create the .sqn file. |
+| .err | Unacceptable annotations - the NCBI discrepancy report. |
+| .log | Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the --quiet option was enabled. |
+| .txt | Statistics relating to the annotated features found. |
+
+##Command line options
+
+ General:
+ --help This help
+ --version Print version and exit
+ --docs Show full manual/documentation
+ --citation Print citation for referencing Prokka
+ --quiet No screen output (default OFF)
+ --debug Debug mode: keep all temporary files (default OFF)
+ Setup:
+ --listdb List all configured databases
+ --setupdb Index all installed databases
+ --cleandb Remove all database indices
+ --depends List all software dependencies
+ Outputs:
+ --outdir [X] Output folder [auto] (default '')
+ --force Force overwriting existing output folder (default OFF)
+ --prefix [X] Filename output prefix [auto] (default '')
+ --addgenes Add 'gene' features for each 'CDS' feature (default OFF)
+ --locustag [X] Locus tag prefix (default 'PROKKA')
+ --increment [N] Locus tag counter increment (default '1')
+ --gffver [N] GFF version (default '3')
+ --compliant Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
+ --centre [X] Sequencing centre ID. (default '')
+ Organism details:
+ --genus [X] Genus name (default 'Genus')
+ --species [X] Species name (default 'species')
+ --strain [X] Strain name (default 'strain')
+ --plasmid [X] Plasmid name or identifier (default '')
+ Annotations:
+ --kingdom [X] Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
+ --gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0')
+ --gram [X] Gram: -/neg +/pos (default '')
+ --usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
+ --proteins [X] Fasta file of trusted proteins to first annotate from (default '')
+ --hmms [X] Trusted HMM to first annotate from (default '')
+ --metagenome Improve gene predictions for highly fragmented genomes (default OFF)
+ --rawproduct Do not clean up /product annotation (default OFF)
+ Computation:
+ --fast Fast mode - skip CDS /product searching (default OFF)
+ --cpus [N] Number of CPUs to use [0=all] (default '8')
+ --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
+ --evalue [n.n] Similarity e-value cut-off (default '1e-06')
+ --rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
+ --norrna Don't run rRNA search (default OFF)
+ --notrna Don't run tRNA search (default OFF)
+ --rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)
+
+##Dependencies
+
+* __GNU Parallel__
+A shell tool for executing jobs in parallel using one or more computers
+_O. Tange, GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, Feb 2011:42-47._
+
+* __BioPerl__
+Used for input/output of various file formats
+_Stajich et al, The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002 Oct;12(10):1611-8._
+
+* __Aragorn__
+Finds transfer RNA features (tRNA)
+_Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004 Jan 2;32(1):11-6._
+
+* __Barrnap__
+Used to predict ribosomal RNA features (rRNA). My licence-free replacement for RNAmmmer.
+_Manuscript under preparation._
+
+* __RNAmmer__
+Finds ribosomal RNA features (rRNA)
+_Lagesen K et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8._
+
+* __Prodigal__
+Finds protein-coding features (CDS)
+_Hyatt D et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11:119._
+
+* __SignalP__
+Finds signal peptide features in CDS (sig_peptide)
+_Petersen TN et al. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods. 2011 Sep 29;8(10):785-6._
+
+* __BLAST+__
+Used for similarity searching against protein sequence libraries
+_Camacho C et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421._
+
+* __HMMER3__
+Used for similarity searching against protein family profiles
+_Finn RD et al. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W29-37._
+
+* __Infernal__
+Used for similarity searching against ncRNA family profiles
+_D. L. Kolbe, S. R. Eddy. Fast Filtering for RNA Homology Search. Bioinformatics, 27:3102-3109, 2011._
+
+
+##Databases
+
+###The Core (BLAST+) Databases
+
+Prokka uses a variety of databases when trying to assign function to the predicted CDS features. It takes a hierarchial approach to make it fast. A small, core set of well characterized proteins are first searched using BLAST+. This combination of small database and fast search typically completes about 70% of the workload. Then a series of slower but more sensitive HMM databases are searched using HMMER3.
+
+The initial core databases are derived from UniProtKB; there is one per "kingdom" supported. To qualify for inclusion, a protein must be (1) from Bacteria (or Archaea or Viruses); (2) not be "Fragment" entries; and (3) have an evidence level ("PE") of 2 or lower, which corresponds to experimental mRNA or proteomics evidence.
+
+####Making a Core Databases
+
+If you want to modify these core databases, the included script `prokka-uniprot_to_fasta_db`, along with the official `uniprot_sprot.dat`, can be used to generate a new database to put in `/opt/prokka/db/kingdom/`. If you add new ones, the command `prokka --listdb` will show you whether it has been detected properly.
+
+####The Genus Databases
+
+If you enable `--usegenus` and also provide a Genus via `--genus` then it will first use a BLAST database which is Genus specific. Prokka comes with a set of databases for the most common Bacterial genera; type prokka `--listdb` to see what they are.
+
+####Adding a Genus Databases
+
+If you have a set of Genbank files and want to create a new Genus database, Prokka comes with a tool called
+`prokka-genbank_to_fasta_db` to help. For example, if you had four annotated "Coccus" genomes, you could do the following:
+
+ % prokka-genbank_to_fasta_db Coccus1.gbk Coccus2.gbk Coccus3.gbk Coccus4.gbk > Coccus.faa
+ % cd-hit -i Coccus.faa -o Coccus -T 0 -M 0 -g 1 -s 0.8 -c 0.9
+ % rm -fv Coccus.faa Coccus.bak.clstr Coccus.clstr
+ % makeblastdb -dbtype prot -in Coccus
+ % mv Coccus.p* /path/to/prokka/db/genus/
+
+###The HMM Databases
+
+Prokka comes with a bunch of HMM libraries for HMMER3. They are mostly Bacteria-specific. They are searched after the core and genus databases. You can add more simply by putting them in `/opt/prokka/db/hmm`. Type `prokka --listdb` to confirm they are recognised.
+
+###FASTA database format
+
+Prokka understands two annotation tag formats, a plain one and a detailed one.
+
+The plain one is a standard FASTA-like line with the ID after the `>` sign, and the protein `/product`
+after the ID (the "description" part of the line):
+
+ >SeqID product
+
+The detailed one consists of a special encoded three-part description line. The parts are the `/EC_number`,
+the `/gene` code, then the `/product` - and they are separated by a special "~~~" sequence:
+
+ >SeqID EC_number~~~gene~~~product
+
+Here are some examples. Note that not all parts need to be present, but the "~~~" should still be there:
+
+ >YP_492693.1 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
+ MNEKNIKHSQNFITSKHNIDKIMTNIRLNEHDNIFEIGSGKGHFTLELVQRCNFVTAIEI
+ DHKLCKTTENKLVDHDNFQVLNKDILQFKFPKNQSYKIFGNIPYNISTDIIRKIVF*
+ >YP_492697.1 ~~~traB~~~transfer complex protein TraB
+ MIKKFSLTTVYVAFLSIVLSNITLGAENPGPKIEQGLQQVQTFLTGLIVAVGICAGVWIV
+ LKKLPGIDDPMVKNEMFRGVGMVLAGVAVGAALVWLVPWVYNLFQ*
+ >YP_492694.1 ~~~~~~transposase
+ MNYFRYKQFNKDVITVAVGYYLRYALSYRDISEILRGRGVNVHHSTVYRWVQEYAPILYQ
+ QSINTAKNTLKGIECIYALYKKNRRSLQIYGFSPCHEISIMLAS*
+
+The same description lines apply to HMM models, except the "NAME" and "DESC" fields are used:
+
+ NAME PRK00001
+ ACC PRK00001
+ DESC 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
+ LENG 284
+
+
+##FAQ
+
+* __Where does the name "Prokka" come from?__
+Prokka is a contraction of "prokaryotic annotation". It's also relatively unique within Google, and also rhymes with a native Australian marsupial called the quokka.
+
+* __Can I annotate by eukaryote genome with Prokka?__
+No. Prokka is specifically designed for Bacteria, Archaea and Viruses. It can't handle multi-exon gene models; I would recommend using MAKER 2 for that purpose.
+
+* __Why does Prokka keeps on crashing when it gets to tge "tbl2asn" stage?__
+It seems that the tbl2asn program from NCBI "expires" after 12 months, and refuses to run. Unfortunately you need to install a newer version which you can download from here.
+
+* __The hmmscan step seems to hang and do nothing?__
+The problem here is GNU Parallel. It seems the Debian package for hmmer has modified it to require the `--gnu` option to behave in the 'default' way. There is no clear reason for this. The only way to restore normal behaviour is to edit the prokka script and change `parallel` to `parallel --gnu`.
+
+* __Why does prokka fail when it gets to hmmscan?__
+Unfortunately HMMER keeps changing it's database format, and they aren't upward compatible. If you upgraded HMMER (from 3.0 to 3.1 say) then you need to "re-press" the files. This can be done as follows:
+ cd /path/to/prokka/db/hmm
+ mkdir new
+ for D in *.hmm ; do hmmconvert $D > new/$D ; done
+ cd new
+ for D in *.hmm ; do hmmpress $D ; done
+ mv * ..
+ rmdir new
+
+* __Why does Prokka take so long to download?__
+Our server is in Australia, and the international pipes aren't always flowing as well as we'd like. I try to put it on GoogleDrive. Dropbox is no longer possible due to bandwidth quotas. If you are able to mirror Prokka (~2 GB) outside please let me know.
+
+* __Why can't I load Prokka .GBK files into Mauve?__
+Mauve uses BioJava to parse GenBank files, and it is very picky about Genbank files.
+It does not like long contig names,
+like those from Velvet or Spades. One solution is to use `--centre XXX`
+in Prokka and it will rename all your contigs to be NCBI (and Mauve)
+compliant. It does not like the ACCESSION and VERSION strings that Prokka
+produces via the "tbl2asn" tool. The following Unix command will fix them:
+`egrep -v '^(ACCESSION|VERSION)' prokka.gbk > mauve.gbk`
+
+
+##Still To Do
+
+* ToDoList.txt: https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/doc/ToDoList.txt
+
+
+##Changes
+
+* ChangeLog.txt: https://raw.githubusercontent.com/Victorian-Bioinformatics-Consortium/prokka/master/doc/ChangeLog.txt
+* Github commits: https://github.com/Victorian-Bioinformatics-Consortium/prokka/commits/master
+
+##Citation
+
+Seemann T.
+*Prokka: rapid prokaryotic genome annotation*
+**Bioinformatics** 2014 Jul 15;30(14):2068-9.
+[PMID:24642063](http://www.ncbi.nlm.nih.gov/pubmed/24642063)
diff --git a/bin/prokka b/bin/prokka
index 61d1a6e..ff9a799 100755
--- a/bin/prokka
+++ b/bin/prokka
@@ -32,6 +32,7 @@ use Bio::SearchIO;
use Bio::Seq;
use Bio::SeqFeature::Generic;
use Bio::Tools::GFF;
+use Bio::Tools::GuessSeqFormat;
use FindBin;
# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
@@ -41,13 +42,13 @@ my @CMDLINE = @ARGV;
my $OPSYS = $^O;
my $BINDIR = "$FindBin::RealBin/../binaries/$OPSYS";
my $EXE = $FindBin::RealScript;
-my $VERSION = "1.10";
+my $VERSION = "1.11";
my $AUTHOR = 'Torsten Seemann <torsten.seemann at monash.edu>';
my $URL = 'http://www.vicbioinformatics.com';
my $DBDIR = "$FindBin::RealBin/../db";
my $HYPO = 'hypothetical protein';
my $UNANN = 'unannotated protein';
-my $MAXCONTIGIDLEN = 38; # Genbank rule
+my $MAXCONTIGIDLEN = 20; # Genbank rule
# these should accept .faa on STDIN and write report to STDOUT
my $BLASTPCMD = "blastp -query - -db %d -evalue %e -num_threads 1 -num_descriptions 1 -num_alignments 1 -seg no";
@@ -57,13 +58,14 @@ my $barrnap_mode = 'bac';
my $aragorn_opt = '';
# debian package broke compatibility so have to force it now *grumble*
-my $PARALLELCMD = "parallel --gnu";
+my $PARALLELCMD = "parallel --gnu --plain";
# not used anymore
#my $INFERNALCMD = "cmscan --noali --notextw --acc -E %e --cpu 1 -o %o %d %i 2>/dev/null";
my $starttime = localtime;
my %seq;
+my @seq;
# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
# table of tools we need/optional and min versions
@@ -164,6 +166,8 @@ my %tools = (
'egrep' => { NEEDED=>1 },
'sed' => { NEEDED=>1 },
'find' => { NEEDED=>1 },
+ # for the new --proteins option ability to take .gbk or .embl files
+ 'prokka-genbank_to_fasta_db' => { NEEDED=>1 },
);
# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
@@ -216,10 +220,10 @@ if (-d $BINDIR) {
# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
# command line options
-my(@Options, $quiet, $kingdom, $fast, $force, $outdir, $prefix, $cpus, $addgenes,
+my(@Options, $quiet, $debug, $kingdom, $fast, $force, $outdir, $prefix, $cpus, $addgenes,
$gcode, $gram, $gffver, $locustag, $increment, $mincontiglen, $evalue, $coverage,
$genus, $species, $strain, $plasmid,
- $usegenus, $proteins, $centre, $scaffolds,
+ $usegenus, $proteins, $hmms, $centre, $scaffolds,
$rfam, $norrna, $notrna, $rnammer, $rawproduct,
$metagenome, $compliant, $listdb, $citation);
setOptions();
@@ -233,6 +237,7 @@ msg("Victorian Bioinformatics Consortium - $URL");
msg("Local time is $starttime");
msg("You are", $ENV{USER} || 'not telling me who you are!');
msg("Operating system is $OPSYS");
+msg("You have enabled DEBUG mode. Temporary files will NOT be deleted.") if $debug;
# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
# check BioPerl version
@@ -243,6 +248,12 @@ msg("You have BioPerl $bpver");
err("Please install BioPerl $minbpver or higher") if $bpver < $minbpver;
# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
+# Check some incompatible options
+
+$fast && $proteins and err("In --fast mode, the --proteins will not be used!");
+$fast && $hmms and err("In --fast mode, the --hmms will not be used!");
+
+# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
# Determine CPU cores available
my $num_cores = num_cpu();
@@ -313,7 +324,7 @@ if ($compliant) {
}
#$centre or err("You must set --centre or the NCBI tbl2asn tool won't work properly, sorry.");
-($gcode < 1 or $gcode > 24) and err("Invalid genetic code, must be 1..24");
+($gcode < 1 or $gcode > 25) and err("Invalid genetic code, must be 1..25");
$evalue >= 0 or err("Invalid --evalue, must be >= 0");
#($coverage >= 0 and $coverage <= 100) or err("Invalid --coverage, must be 0..100");
$increment >= 1 or err("Invalid --increment, must be >= 1");
@@ -393,8 +404,11 @@ my $ncontig = 0;
my $contigprefix = $locustag || $prefix || $outdir || $strain || '';
$contigprefix .= '_' if $contigprefix;
my $contig_name_len = length($centre) + 1 + length($contigprefix) + 6;
-if ($compliant and $contig_name_len > $MAXCONTIGIDLEN) {
- err("Genbank contig IDs are $contig_name_len chars, must be <= $MAXCONTIGIDLEN. Prefix is: $contigprefix");
+if ($compliant) {
+ my $contig_name_len = length($centre) + 1 + length($contigprefix) + 6;
+ if ($contig_name_len > $MAXCONTIGIDLEN) {
+ err("Genbank contig IDs are $contig_name_len chars, must be <= $MAXCONTIGIDLEN. Prefix is: $contigprefix");
+ }
}
while (my $seq = $fin->next_seq) {
@@ -409,7 +423,8 @@ while (my $seq = $fin->next_seq) {
$seq->id( sprintf "gnl|$centre|${contigprefix}contig%06d", $ncontig );
}
if (length($seq->id) > $MAXCONTIGIDLEN) {
- msg("WARNING: Contig IDs must be less than 38 characters for Genbank compliance - ".$seq->id)
+ msg("Contig ID must <= $MAXCONTIGIDLEN chars long: ".$seq->id);
+ err("Please rename your contigs or use --centre XXX to generate clean contig names.");
}
my $s = $seq->seq;
$s = uc($s);
@@ -422,10 +437,10 @@ while (my $seq = $fin->next_seq) {
err("Uh oh! Sequence file '$in' contains duplicate sequence ID:", $seq->id);
}
$seq{ $seq->id }{DNA} = $seq;
+ push @seq, $seq->id; # this array it used to preserve original contig order
}
$ncontig > 0 or err("FASTA file '$in' contains no suitable sequence entries");
msg("Wrote $ncontig contigs");
-#msg(sort keys %seq); exit;
# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
# tRNA + tmRNA
@@ -568,7 +583,9 @@ if ($rfam) {
my $num_ncrna = 0;
my $tool = "Infernal:".$tools{'cmscan'}->{VERSION};
my $icpu = $cpus || 1;
- open INFERNAL, '-|', "cmscan --cpu $icpu -E $evalue --tblout /dev/stdout -o /dev/null --noali $cmdb \Q$outdir/$prefix.fna\E";
+ my $cmd = "cmscan --cpu $icpu -E $evalue --tblout /dev/stdout -o /dev/null --noali $cmdb \Q$outdir/$prefix.fna\E";
+ msg("Running: $cmd");
+ open INFERNAL, '-|', $cmd;
while (<INFERNAL>) {
next if m/^#/; # ignore comments
my @x = split ' '; # magic Perl whitespace splitter
@@ -608,7 +625,7 @@ else {
# Tally all the RNA features __ which we want to exclude overlaps with CDS __
my @allrna;
-for my $sid (sort keys %seq) {
+for my $sid (@seq) {
push @allrna, (grep { $_->primary_tag =~ m/[tr]RNA/ } @{ $seq{$sid}{FEATURE} });
}
msg("Total of", scalar(@allrna), "tRNA + rRNA features");
@@ -642,7 +659,7 @@ if ($tools{'minced'}->{HAVE}) {
# CDS
msg("Predicting coding sequences");
-my $totalbp = sum( map { $seq{$_}{DNA}->length } keys %seq);
+my $totalbp = sum( map { $seq{$_}{DNA}->length } @seq);
my $prodigal_mode = ($totalbp >= 100000 && !$metagenome) ? 'single' : 'meta';
msg("Contigs total $totalbp bp, so using $prodigal_mode mode");
my $num_cds=0;
@@ -701,7 +718,7 @@ msg("Found $num_cds CDS");
# Connect features to their parent sequences
msg("Connecting features back to sequences");
-for my $sid (sort keys %seq) {
+for my $sid (@seq) {
for my $f (@{ $seq{$sid}{FEATURE} }) {
$f->attach_seq( $seq{$sid}{DNA} );
}
@@ -723,11 +740,11 @@ if ($tools{signalp}->{HAVE}) {
my $spout = Bio::SeqIO->new(-fh=>$spoutfh, -format=>'fasta');
my %cds;
my $count=0;
- for my $sid (sort keys %seq) {
+ for my $sid (@seq) {
for my $f (@{ $seq{$sid}{FEATURE} }) {
next unless $f->primary_tag eq 'CDS';
$cds{++$count} = $f;
- my $seq = $f->seq->translate;
+ my $seq = $f->seq->translate(-codontable_id=>$gcode, -complete=>1);
$seq->display_id($count);
$spout->write_seq($seq);
}
@@ -747,7 +764,8 @@ if ($tools{signalp}->{HAVE}) {
my $prob = $x[5];
my $cleave = $x[3];
my $start = $parent->strand > 0 ? $parent->start : $parent->end;
- my $end = $start + $parent->strand * ($cleave - 1);
+ # need to convert to DNA coordinates
+ my $end = $start + $parent->strand * ($cleave*3 - 1);
my $sigpep = Bio::SeqFeature::Generic->new(
-seq_id => $parent->seq_id,
-source_tag => $tool,
@@ -773,7 +791,8 @@ if ($tools{signalp}->{HAVE}) {
my $parent = $cds{ $x[0] };
my $cleave = $x[2];
my $start = $parent->strand > 0 ? $parent->start : $parent->end;
- my $end = $start + $parent->strand * ($cleave - 1);
+ # need to convert to DNA coordinates
+ my $end = $start + $parent->strand * ($cleave*3 - 1);
my $sigpep = Bio::SeqFeature::Generic->new(
-seq_id => $parent->seq_id,
-source_tag => $tool,
@@ -818,14 +837,20 @@ my @database = (
# secondary sources are a series of HMMs
unless ($kingdom eq 'Viruses') {
- for my $name (qw(HAMAP CLUSTERS Pfam)) {
- push @database, {
- DB => "$DBDIR/hmm/$name.hmm",
- SRC => "protein motif:$name:",
- FMT => 'hmmer3',
- CMD => $HMMER3CMD,
- VERSION => 3, # without this, latest Bioperl goes into infinite loop
- };
+ for my $name ( hmms() ) {
+ my $dbfile = "$DBDIR/hmm/$name.hmm";
+ if (-r $dbfile) {
+ push @database, {
+ DB => $dbfile,
+ SRC => "protein motif:$name:",
+ FMT => 'hmmer3',
+ CMD => $HMMER3CMD,
+ VERSION => 3, # without this, latest Bioperl goes into infinite loop
+ }
+ }
+ else {
+ msg("Will not use $name HMM database, $dbfile not installed.");
+ }
}
}
@@ -850,10 +875,39 @@ else {
msg("Not using genus-specific database. Try --usegenus to enable it.");
}
+# if user supplies a trusted set of HMMs, we try these first!
+if (-r $hmms) {
+ msg("Preparing user-supplied primary HMMER annotation source: $hmms");
+ -r "$hmms.h3i" or err("Your HMM is not indexed, please run: hmmpress $hmms");
+ my $src = $hmms;
+ $src =~ s{^.*/}{};
+ $src =~ s/.hmm$//;
+ msg("Using /inference source as '$src'");
+ unshift @database, {
+ DB => $hmms,
+ SRC => "protein motif:$src:",
+ FMT => 'hmmer3',
+ CMD => $HMMER3CMD,
+ VERSION => 3, # without this, latest Bioperl goes into infinite loop
+ };
+}
+
# if user supplies a trusted set of proteins, we try these first!
if (-r $proteins) {
- msg("Preparing user-supplied primary annotation source: $proteins");
- runcmd("makeblastdb -dbtype prot -in \Q$proteins\E -out \Q$outdir/proteins\E -logfile /dev/null");
+ msg("Preparing user-supplied primary BLAST annotation source: $proteins");
+ my $faa_file = $proteins;
+ my $format = Bio::Tools::GuessSeqFormat->new(-file=>$proteins)->guess
+ or err("could not determine format of --proteins file '$proteins'");
+ msg("Guessed source was in $format format.");
+ if ($format =~ m/^(embl|genbank)$/) {
+ $faa_file = "$outdir/proteins.faa";
+ msg("Converting $format '$proteins' into Prokka FASTA '$faa_file'");
+ runcmd("prokka-genbank_to_fasta_db --format $format \Q$proteins\E > \Q$faa_file\E 2> /dev/null");
+ }
+ elsif ($format ne 'fasta') {
+ err("Option --proteins only supports FASTA, GenBank, and EMBL formats.");
+ }
+ runcmd("makeblastdb -dbtype prot -in \Q$faa_file\E -out \Q$outdir/proteins\E -logfile /dev/null");
my $src = $proteins;
$src =~ s{^.*/}{};
msg("Using /inference source as '$src'");
@@ -877,18 +931,23 @@ else {
for my $db (@database) {
+ # create a unqiue output name so we can save them in --debug mode
+ my $outname = $db->{DB};
+ $outname =~ s{^.*/}{};
+
# we write out all the CDS which haven't been annotated yet and then search them
- my $faa_name = "$outdir/proteins.faa";
+ my $faa_name = "$outdir/$outname.faa";
open my $faa, '>', $faa_name;
my %cds;
my $count=0;
- for my $sid (sort keys %seq) {
+ for my $sid (@seq) {
for my $f (@{ $seq{$sid}{FEATURE} }) {
next unless $f->primary_tag eq 'CDS';
next if $f->has_tag('product');
$cds{++$count} = $f;
- print $faa ">$count\n",$f->seq->translate->seq,"\n";
+ print $faa ">$count\n",
+ $f->seq->translate(-codontable_id=>$gcode, -complete=>1)->seq,"\n";
}
}
close $faa;
@@ -910,7 +969,7 @@ else {
my $bsize = int($faa_bytes / $cpus / 2); # div 2 to allow for slow vs fast subtasks?
my $paropts = $cpus > 0 ? " -j $cpus" : "";
- my $bls_name = "$outdir/proteins.bls";
+ my $bls_name = "$outdir/$outname.".$db->{FMT};
runcmd(
"cat \Q$faa_name\E | ${PARALLELCMD}$paropts --block $bsize --recstart '>' --pipe".
" $cmd > \Q$bls_name\E 2> /dev/null"
@@ -944,7 +1003,7 @@ else {
$cds{$pid}->add_tag_value('inference', $db->{SRC}.$hit->name);
}
msg("Cleaned $num_cleaned /product names") if $num_cleaned > 0;
- delfile( $faa_name, $bls_name);
+ delfile($faa_name, $bls_name);
}
}
@@ -957,7 +1016,7 @@ if ($proteins) {
my $empty_label = $fast ? 'unannotated protein' : $HYPO;
my $num_hypo=0;
-for my $sid (sort keys %seq) {
+for my $sid (@seq) {
for my $f ( @{ $seq{$sid}{FEATURE} }) {
if ($f->primary_tag eq 'CDS' and not $f->has_tag('product')) {
$f->add_tag_value('product', $empty_label);
@@ -970,7 +1029,7 @@ msg("Labelling remaining $num_hypo proteins as '$empty_label'") if $num_hypo > 0
# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
# Look for possible /pseudo genes - adjacent with same annotation
-for my $sid (sort keys %seq) {
+for my $sid (@seq) {
my $prev = '';
for my $f ( grep { $_->primary_tag eq 'CDS' } @{ $seq{$sid}{FEATURE} } ) {
my $this = TAG($f, 'product');
@@ -988,10 +1047,11 @@ for my $sid (sort keys %seq) {
my %collide;
-for my $sid (sort keys %seq) {
+for my $sid (@seq) {
for my $f ( sort { $a->start <=> $b->start } @{ $seq{$sid}{FEATURE} }) {
next unless $f->primary_tag eq 'CDS';
my $gene = TAG($f, 'gene') or next;
+ $gene =~ s/_(\d+)$//; # remove existing de-duplicated suffix
push @{ $collide{$gene} }, $f;
}
}
@@ -1007,6 +1067,7 @@ for my $gene (keys %collide) {
$n++;
$f->add_tag_value('gene', "${gene}_${n}");
}
+ msg("Fixed $n duplicate /gene -", map { $_->get_tag_values('gene') } @cds);
$num_collide++;
}
msg("Fixed $num_collide colliding /gene names.");
@@ -1016,7 +1077,7 @@ msg("Fixed $num_collide colliding /gene names.");
msg("Adding /locus_tag identifiers");
my $num_lt=0;
-for my $sid (sort keys %seq) {
+for my $sid (@seq) {
for my $f ( sort { $a->start <=> $b->start } @{ $seq{$sid}{FEATURE} }) {
next unless $f->primary_tag =~ m/CDS|RNA/;
$num_lt++;
@@ -1030,6 +1091,7 @@ for my $sid (sort keys %seq) {
if ($addgenes) {
# make a 'sister' gene feature for the CDS feature
# (ideally it would encompass the UTRs as well, but we don't know them)
+ my $gene_id = "${ID}_gene";
my $g = Bio::SeqFeature::Generic->new(
-primary => 'gene',
-seq_id => $f->seq_id,
@@ -1037,8 +1099,14 @@ for my $sid (sort keys %seq) {
-end => $f->end,
-strand => $f->strand,
-source_tag => $EXE,
- -tag => { 'locus_tag'=> $ID },
+ -tag => {
+ 'locus_tag' => $ID,
+ 'ID' => $gene_id, # Add suffix to ID for GFF output
+ },
);
+ # Make a Parent tag from the CDS to the gene
+ $f->add_tag_value('Parent', $gene_id);
+ # copy the /gene from the CDS
if (my $gENE = TAG($f, 'gene')) {
$g->add_tag_value('gene', $gENE);
}
@@ -1060,14 +1128,14 @@ my $fsa_fh = Bio::SeqIO->new(-file=>">$outdir/$prefix.fsa", -format=>'fasta');
my $gff_factory = Bio::Tools::GFF->new(-gff_version=>$gffver);
print $gff_fh "##gff-version $gffver\n";
-for my $id (sort keys %seq) {
+for my $id (@seq) {
print $gff_fh "##sequence-region $id 1 ", $seq{$id}{DNA}->length, "\n";
}
my $fsa_desc = "[gcode=$gcode] [organism=$genus $species] [strain=$strain]";
$fsa_desc .= " [plasmid=$plasmid]" if $plasmid;
-for my $sid (sort keys %seq) {
+for my $sid (@seq) {
my $ctg = $seq{$sid}{DNA};
$ctg->desc($fsa_desc);
$fsa_fh->write_seq($ctg);
@@ -1096,15 +1164,15 @@ for my $sid (sort keys %seq) {
$ffn_fh->write_seq($p);
}
if ($f->primary_tag eq 'CDS') {
- $faa_fh->write_seq( $p->translate(-codontable_id=>$gcode) );
+ $faa_fh->write_seq( $p->translate(-codontable_id=>$gcode, -complete=>1) );
}
}
}
-if (scalar keys %seq) {
+if (@seq) {
print $gff_fh "##FASTA\n";
my $seqio = Bio::SeqIO->new(-fh=>$gff_fh, -format=>'fasta');
- for my $sid (sort keys %seq) {
+ for my $sid (@seq) {
$seqio->write_seq($seq{$sid}{DNA});
}
}
@@ -1117,11 +1185,11 @@ open my $txtFh, '>', "$outdir/$prefix.txt";
select $txtFh;
printf "organism: $genus $species $strain $plasmid\n";
-printf "contigs: %d\n", scalar(keys %seq);
-printf "bases: %d\n", sum( map { $seq{$_}{DNA}->length } keys %seq );
+printf "contigs: %d\n", scalar(@seq);
+printf "bases: %d\n", sum( map { $seq{$_}{DNA}->length } @seq );
my %count;
-for my $sid (sort keys %seq) {
+for my $sid (@seq) {
for my $f (@{ $seq{$sid}{FEATURE} }) {
$count{ $f->primary_tag }++;
}
@@ -1150,7 +1218,10 @@ runcmd(
);
delfile("$outdir/errorsummary.val");
delfile( map { "$outdir/$prefix.$_" } qw(dr fixedproducts ecn val) );
-move("$outdir/$prefix.gbf", "$outdir/$prefix.gbk"); # rename XXX.gbf to XXX.gbk
+
+msg("Repairing broken .GBK output that tbl2asn produces...");
+runcmd("sed 's/COORDINATES: profile/COORDINATES:profile/' < \Q$outdir/$prefix.gbf\E > \Q$outdir/$prefix.gbk\E");
+delfile("$outdir/$prefix.gbf");
# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
# Some final log output
@@ -1182,8 +1253,9 @@ sub cleanup_product {
$p =~ s/^arCOG\d+\s+//;
$p =~ s/\((EC|COG).*?\)//;
- $p =~ s/\s*\w+\d{4,}c?//; # remove possible locus tags
+ $p =~ s/\s+\w+\d{4,}c?//; # remove possible locus tags
$p =~ s/ and (inactivated|related) \w+//;
+ $p =~ s/,\s*family$//;
$p =~ s/^(potential|possible|probable|predicted|uncharacteri.ed)/putative/i;
if ($p =~ m/(domain|family|binding|fold|like|motif|repeat)\s*$/i and $p !~ m/,/) {
@@ -1260,8 +1332,13 @@ sub runcmd {
sub delfile {
for my $file (@_) {
- msg("Deleting unwanted file:", $file);
- unlink $file;
+ if ($debug) {
+ msg("In --debug mode, saving temporary file:", $file);
+ }
+ else {
+ msg("Deleting unwanted file:", $file);
+ unlink $file;
+ }
}
}
@@ -1396,6 +1473,7 @@ sub setOptions {
{OPT=>"docs", VAR=>\&showdoc, DESC=>"Show full manual/documentation"},
{OPT=>"citation",VAR=>\&show_citation, DESC=>"Print citation for referencing Prokka"},
{OPT=>"quiet!", VAR=>\$quiet, DEFAULT=>0, DESC=>"No screen output"},
+ {OPT=>"debug!", VAR=>\$debug, DEFAULT=>0, DESC=>"Debug mode: keep all temporary files"},
'Setup:',
{OPT=>"listdb", VAR=>\&list_db, DESC=>"List all configured databases"},
{OPT=>"setupdb", VAR=>\&setup_db, DESC=>"Index all installed databases"},
@@ -1423,6 +1501,7 @@ sub setOptions {
{OPT=>"gram=s", VAR=>\$gram, DEFAULT=>'', DESC=>"Gram: -/neg +/pos"},
{OPT=>"usegenus!", VAR=>\$usegenus, DEFAULT=>0, DESC=>"Use genus-specific BLAST databases (needs --genus)"},
{OPT=>"proteins=s", VAR=>\$proteins, DEFAULT=>'', DESC=>"Fasta file of trusted proteins to first annotate from"},
+ {OPT=>"hmms=s", VAR=>\$hmms, DEFAULT=>'', DESC=>"Trusted HMM to first annotate from"},
{OPT=>"metagenome!", VAR=>\$metagenome, DEFAULT=>0, DESC=>"Improve gene predictions for highly fragmented genomes"},
{OPT=>"rawproduct!", VAR=>\$rawproduct, DEFAULT=>0, DESC=>"Do not clean up /product annotation"},
'Computation:',
diff --git a/bin/prokka-genbank_to_fasta_db b/bin/prokka-genbank_to_fasta_db
index 59b9f44..7e61381 100755
--- a/bin/prokka-genbank_to_fasta_db
+++ b/bin/prokka-genbank_to_fasta_db
@@ -1,16 +1,37 @@
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
+use Bio::Tools::CodonTable;
use Data::Dumper;
-my(@Options, $verbose, $format, $hypo, $idtag, $sep, $blank, $pseudo, $minlen);
+# gene complement(1..1200)
+# /locus_tag="P148_SR1C00001G0001"
+# CDS complement(1..1200)
+# /locus_tag="P148_SR1C00001G0001"
+# /note="RAAC1_SR1_1_1"
+# /codon_start=1
+# /transl_table=25
+# /product="MAEBL"
+# /protein_id="AHB40819.1"
+# /db_xref="GI:563351664"
+
+my @IDTAG = ('protein_id', 'locus_tag', 'db_xref');
+
+my(@Options, $verbose, $format, $hypo, $idtag, $sep, $blank, $pseudo, $minlen, $gcode);
setOptions();
my $in = Bio::SeqIO->new(-fh=>\*ARGV, -format=>$format);
my $out = Bio::SeqIO->new(-fh=>\*STDOUT, -format=>'Fasta');
+my @TRYIDTAG = $idtag ? ($idtag) : @IDTAG;
+print STDERR "Will use first of (@TRYIDTAG) as FASTA ID\n";
+
+# support different /transl_table=XX tags in CDS
+#my $table = Bio::Tools::CodonTable->new();
+#printf STDERR "Default /transl_table is %d - %s\n", $table->id, $table->name;
+
while (my $seq = $in->next_seq) {
- print STDERR "\rParsing: ",$seq->display_id;
+ print STDERR "\rParsing: ",$seq->display_id, "\n";
my $counter = 0;
for my $f ($seq->get_SeqFeatures) {
@@ -22,44 +43,46 @@ while (my $seq = $in->next_seq) {
my $prod = TAG($f, 'product') || $blank;
- next if $prod eq 'hypothetical protein';
+ next if !$hypo and $prod eq 'hypothetical protein';
my $cds = $f->spliced_seq; # don't forget eukaryotes!
- my $id = TAG($f, $idtag) or die "feature #$counter does not have /$idtag tag!";
+ my $id;
+ for my $idtag (@TRYIDTAG) {
+ $id = TAG($f, $idtag);
+ last if $id;
+ }
+ $id or die "\nFeature #$counter does not have any of these tags: @TRYIDTAG";
$counter++;
- print STDERR "Processing: [$counter] ", $seq->display_id, " | $id | $prod\n" if $verbose;
+ print STDERR "[$counter] ", $seq->display_id, " | $id | $prod\n" if $verbose;
# HANDLE CODON START FOR FUZZY FEATURES!
if ($f->has_tag('codon_start')) {
my($cs) = $f->get_tag_values('codon_start');
if ($cs != 1) {
- print STDERR "/codon_start=$cs - trimming mRNA!";
+ print STDERR "\t/codon_start=$cs - trimming mRNA!\n";
$cds = $cds->trunc($cs, $cds->length);
}
}
#END
-
- # DNA -> AA
- $cds = $cds->translate;
+ # DNA -> AA
+ my $tt = $gcode || TAG($f, 'transl_table') || 11;
+ print STDERR "\tUsing specified /transl_table=$tt\n" if $verbose && $tt != 1;
- # remove stop codon
- my $aa = $cds->seq;
- $aa =~ s/\*$//;
- $cds->seq($aa);
+ # http://www.bioperl.org/wiki/HOWTO:Beginners#Translating
+ $cds = $cds->translate(-codontable_id => $tt, -complete => 1);
my $ec = TAG($f, 'EC_number') || $blank;
my $gene = TAG($f, 'gene') || $blank;
- $cds->desc("$ec$sep$gene$sep$prod");
+ $cds->desc( join $sep, $ec, $gene, $prod );
$cds->display_id($id);
$out->write_seq($cds);
}
- print STDERR "\n";
}
-print STDERR "\nDone\n";
+print STDERR "Done.\n";
#----------------------------------------------------------------------
@@ -79,14 +102,15 @@ sub setOptions {
use Getopt::Long;
@Options = (
- {OPT=>"help", VAR=>\&usage, DESC=>"This help"},
- {OPT=>"verbose!", VAR=>\$verbose, DEFAULT=>0, DESC=>"Verbose progress"},
+ {OPT=>"help", VAR=>\&usage, DESC=>"This help"},
+ {OPT=>"verbose!", VAR=>\$verbose, DEFAULT=>0, DESC=>"Verbose progress"},
{OPT=>"format=s", VAR=>\$format, DEFAULT=>'genbank', DESC=>"Input format"},
- {OPT=>"idtag=s", VAR=>\$idtag, DEFAULT=>'protein_id', DESC=>"What tag to use as Fasta ID"},
+ {OPT=>"idtag=s", VAR=>\$idtag, DEFAULT=>'', DESC=>"What tag to use as Fasta ID (default = try first of: @IDTAG)"},
{OPT=>"sep=s", VAR=>\$sep, DEFAULT=>'~~~', DESC=>"Separator between EC/gene/product" },
{OPT=>"blank=s", VAR=>\$blank, DEFAULT=>'', DESC=>"Replace empty EC/gene/product with this"},
{OPT=>"pseudo!", VAR=>\$pseudo, DEFAULT=>0, DESC=>"Include /pseudo genes"},
{OPT=>"hypo!", VAR=>\$hypo, DEFAULT=>0, DESC=>"Include 'hypothetical protein' genes"},
+ {OPT=>"gcode=i", VAR=>\$gcode, DEFAULT=>0, DESC=>"Force this genetic code for translation (otherwise /transl_table)"},
{OPT=>"minlen=i", VAR=>\$minlen, DEFAULT=>0, DESC=>"Minimum peptide length"},
);
diff --git a/bin/prokka-uniprot_to_fasta_db b/bin/prokka-uniprot_to_fasta_db
index f9c86d8..79457d2 100755
--- a/bin/prokka-uniprot_to_fasta_db
+++ b/bin/prokka-uniprot_to_fasta_db
@@ -5,7 +5,7 @@ use SWISS::KW;
#use SWISS::OC;
use Data::Dumper;
-my(@Options, $verbose, $frag, $evlev, $minlen, $sep, $blank, $term, $hypo);
+my(@Options, $verbose, $frag, $evlev, $minlen, $maxlen, $sep, $blank, $term, $hypo);
setOptions();
my $HYPO = 'hypothetical protein';
@@ -26,8 +26,10 @@ while (<ARGV>)
next if not $frag and $entry->isFragment;
next if not $frag and $entry->DEs->hasFragment;
- # Too short?
- next if $minlen > 0 and length($entry->SQs->seq) < $minlen;
+ # Too short or to long?
+ my $L = length($entry->SQs->seq);
+ next if $minlen > 0 and $L < $minlen;
+ next if $maxlen > 0 and $L > $maxlen;
# Reject on evidence level
# grep ^PE uniprot_sprot.dat | sort | uniq -c
@@ -104,6 +106,7 @@ sub setOptions {
{OPT=>"evidence=i", VAR=>\$evlev, DEFAULT=>2, DESC=>"1=prot 2=mrna 3=homol 4=pred 5=unsure"},
{OPT=>"fragments!", VAR=>\$frag, DEFAULT=>0, DESC=>"Include 'DE Flags: Fragment;' entries"},
{OPT=>"minlen=i", VAR=>\$minlen, DEFAULT=>20, DESC=>"Minimum peptide length"},
+ {OPT=>"maxlen=i", VAR=>\$maxlen, DEFAULT=>1E5, DESC=>"Minimum peptide length"},
{OPT=>"term=s", VAR=>\$term, DEFAULT=>'', DESC=>"Lineage must contain this term eg. 'Bacteria'"},
{OPT=>"hypo!", VAR=>\$hypo, DEFAULT=>0, DESC=>"Don't filter out hypothetical proteins"},
);
diff --git a/db/hmm/CLUSTERS.hmm b/db/hmm/CLUSTERS.hmm
deleted file mode 100644
index 0e4f653..0000000
Binary files a/db/hmm/CLUSTERS.hmm and /dev/null differ
diff --git a/db/hmm/Pfam.hmm b/db/hmm/Pfam.hmm
deleted file mode 100644
index 6a01710..0000000
Binary files a/db/hmm/Pfam.hmm and /dev/null differ
diff --git a/doc/ChangeLog.txt b/doc/ChangeLog.txt
index 7188975..dc9a657 100644
--- a/doc/ChangeLog.txt
+++ b/doc/ChangeLog.txt
@@ -2,6 +2,25 @@
Prokka ChangeLog
----------------
+v1.11 - 15 Feb 2015
+* Option --proteins now supports .GBK/.EMBL files directly!
+* Support for user supplied HMM via --hmms [Connor Driscoll]
+* Add Prodigal -c option when in metagenome mode [Daan Speth]
+* Wierd coordinate errors with long Spades contig names [Stephan Pabinger]
+* Replaced dodgy OSX aragorn binary [Yevgeny Nikolaichik]
+* Handle translation table 25 [Mads Albertsen]
+* Fix semantically incorrect GFF3 output [Marc Hoeppner]
+* Fix contigs with pipe characters from Quiver [Vaughn Cooper]
+* Keep original contig ordering [Jane Hawkey]
+* Fix protein translation with alternate --gcode [Andreas Leimbach]
+* Various bug fixes in prokka-* support scripts [Andreas Leimbach]
+* Workaround tbl2asn bug with COORDINATES:profile [Wiep Smits]
+* Provide MD5 checksum for website download [Willem VA Viljoen]
+* Hopefully slightly improved cleanup_product() function
+* Ensure released are tagged on github for Linux Brew [Shaun Jackman]
+* Check databases exist before searching against them
+* Updated bundled binaries
+
v1.10 - 28 July 2014
* Support for barrnap 0.4 with Archaea and Mito support [Lionel Guy]
* New legal assembly_gap feature in gbk, sqn, tbl files [Connor Skennerton]
diff --git a/doc/ToDoList.txt b/doc/ToDoList.txt
index 74724c1..49d75ab 100644
--- a/doc/ToDoList.txt
+++ b/doc/ToDoList.txt
@@ -2,23 +2,21 @@
Prokka Wishlist
---------------
+* New --first-blast option, like --proteins?
+* New --last-blast option eg. "nr" [Check who emailed me this]
* annotate possible frame-shifts
* FAQ about annotation transfer
* use included binary if PATH one is wrong version [Simon Gladman]
* reconsider !~m/a-z/ rule in cleanup_product() [Roderick Felsheim]
* all Kingdom=ALL or ANY for metagenomes [Andreas Bremges]
* make --cleanup option for /product names [Andreas Bremges]
-* check databases exist before spawning blast/hmmer
* locus_tag rules: http://www.ncbi.nlm.nih.gov/genomes/locustag/Proposal.pdf
-* Add .PTT file output [Andrew Buultjens]
* Bug: "existing RNA (repeat_region)" CRISPR is not RNA ?
-* Allow --proteins to be a .GBK or .GFF3 file ie. extract the CDS annotations
* Factor out some functions into modules!
* Check for large genomic tracts which are unannotated. Sometimes Prodigal misses big genes.
* Add an example input/output so users can check their copy is producing similar results.
* Output potential homopolymer assembly errors near CDS flanks
* Add the CLUSTERS "PHA" library to Viruses mode.
* Option to include ribosomal binding sites (RBS) as features.
-* Check input contigs for runs of Ns, and either complain, or split the file, or additionally create a .AGP scaffold file. (Mitchell Stanton-Cook has done this, need to incorporate)
* Add hyperlinks to tool references in Manual
diff --git a/doc/prokka-manual.html b/doc/prokka-manual.html
deleted file mode 100644
index 50c59aa..0000000
--- a/doc/prokka-manual.html
+++ /dev/null
@@ -1,215 +0,0 @@
-<h1 id="prokka-rapid-prokaryotic-genome-annotation">Prokka: rapid prokaryotic genome annotation</h1>
-<p>Torsten Seemann <script type="text/javascript">
-<!--
-h='monash.edu';a='@';n='torsten.seemann';e=n+a+h;
-document.write('<a h'+'ref'+'="ma'+'ilto'+':'+e+'">'+'<code>'+e+'</code>'+'<\/'+'a'+'>');
-// -->
-</script><noscript>torsten.seemann at monash dot edu</noscript><br />Victorian Bioinformatics Consortium, AUSTRALIA <a href="http://vicbioinformatics.com"><code class="url">http://vicbioinformatics.com</code></a></p>
-<h2 id="introduction">Introduction</h2>
-<p>Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.</p>
-<h2 id="installation">Installation</h2>
-<h3 id="download">Download</h3>
-<p>Download the latest <code>prokka-1.xx.tar.gz</code> archive from http://www.bioinformatics.net.au/software.prokka.shtml</p>
-<h3 id="extract">Extract</h3>
-<p>Choose somewhere to put it, for example in your home directory (no root access required):</p>
-<pre><code>% cd $HOME
-% tar zxvf prokka-1.xx.tar.gz
-% ls prokka-1.xx</code></pre>
-<h3 id="add-to-path">Add to PATH</h3>
-<p>Add the following line to your <code>$HOME/.bashrc</code> file, or to <code>/etc/profile.d/prokka.sh</code> to make it available to all users:</p>
-<pre><code>export PATH=$PATH:$HOME/prokka-1.xx/bin</code></pre>
-<h3 id="index-the-sequence-databases">Index the sequence databases</h3>
-<pre><code>% prokka --setupdb</code></pre>
-<h3 id="install-dependencies">Install dependencies</h3>
-<p>Prokka comes with many binaries for Linux and Mac OS X. It will always use your existing installed versions if they exist, but will use the included ones if that fails. For some older systems (eg. Centos 4.x) some of them won't work due to them being dynamically linked against new GLIBC libraries you don't have.</p>
-<p>You can consult the list of dependencies later in this document.</p>
-<h3 id="choose-a-rrna-predictor">Choose a rRNA predictor</h3>
-<h4 id="option-1---dont-use-one">Option 1 - Don't use one</h4>
-<p>If Prokka can't find a predictor for rRNA featues (either Barrnap or RNAmmer below) then it simply won't annotate any. Most people don't care that much about them anyway,</p>
-<h4 id="option-2---barrnap">Option 2 - Barrnap</h4>
-<p>This was written by the author of Prokka and is recommended if you prefer speed over absolute accuracy. It uses the new multi-core NHMMER for DNA:DNA profile searches. Download it from http://www.vicbioinformatics.com/software.barrnap.shtml</p>
-<h4 id="option-3---rnammer">Option 3 - RNAmmer</h4>
-<p>RNAmmer was written when HMMER 2.x was the latest release. Since them, HMMER 3.x has been released, and uses the same executable binary names. Prokka needs HMMER3 and RNAmmer (and hence HMMER2) so you need to edit your RNAmmer script to explicitly point your HMMER2 binary instead of using the HMMER3 binary which is more likely to be in your PATH first.</p>
-<p>Type which rnammer to find the script, and then edit it with your favourite editor. Find the following lines at the top:</p>
-<pre><code>if ( $uname eq "Linux" ) {
-# $HMMSEARCH_BINARY = "/usr/cbs/bio/bin/linux64/hmmsearch"; # OLD
- $HMMSEARCH_BINARY = "/path/to/my/hmmer-2.3.2/bin/hmmsearch"; # NEW (yours)
-}</code></pre>
-<p>If you are using Mac OS X, you'll also have to change the <code>"Linux"</code> to <code>"Darwin"</code> too. As you can see, I have commented out the original part, and replaced it with the location of my HMMER2 hmmsearch tool, so it doesn't run the HMMER3 one. You need to ensure HMMER3 is in your PATH before the old HMMER2 too.</p>
-<h3 id="test">Test</h3>
-<ul>
-<li>Type <code>prokka</code> and it should output it's help screen.</li>
-<li>Type <code>prokka --version</code> and you should see an output like <code>prokka 1.x</code></li>
-<li>Type <code>prokka --listdb</code> and it will show you what databases it has installed to use.</li>
-</ul>
-<h2 id="invoking-prokka">Invoking Prokka</h2>
-<h3 id="beginner">Beginner</h3>
-<pre><code># Vanilla (but with free toppings)
-% prokka contigs.fa
-
-# Look for a folder called PROKKA_yyyymmdd (today's date) and look at stats
-% cat PROKKA_yyyymmdd/*.txt</code></pre>
-<h3 id="moderate">Moderate</h3>
-<pre><code># Choose the names of the output files
-% prokka --outdir mydir --prefix mygenome contigs.fa
-
-# Visualize it in Artemis
-% art mydir/mygenome.gff</code></pre>
-<h3 id="expert">Expert</h3>
-<pre><code># It's not just for bacteria, people
-% prokka --kingdom Archaea --outdir mydir --genus Pyrococcus --locustag PYCC
-
-# Search for my favourite gene
-% exonerate --bestn 1 zetatoxin.fasta mydir/PYCC_06072012.faa | less</code></pre>
-<h3 id="wizard">Wizard</h3>
-<pre><code># Watch and learn
-% prokka --outdir mydir --locustag EHEC --proteins NewToxins.faa --evalue 0.001 --gram neg --addgenes contigs.fa
-
-# Check to see if anything went really wrong
-% less mydir/EHEC_06072012.err
-
-# Add final details using Sequin
-% sequin mydir/EHEC_0607201.sqn</code></pre>
-<h3 id="genbank-submitter">Genbank submitter</h3>
-<pre><code># Register your BioProject and your locus_tag prefix first!
-% prokka --compliant --centre UoN --outdir PRJNA123456 --locustag EHEC --prefix EHEC-Chr1 contigs.fa
-
-# Check to see if anything went really wrong
-% less PRJNA123456/EHEC-Chr1.err
-
-# Add final details using Sequin
-% sequin PRJNA123456/EHEC-Chr1.sqn</code></pre>
-<h3 id="crazy-person">Crazy Person</h3>
-<pre><code># No stinking Perl script is going to control me
-% prokka \
- --outdir $HOME/genomes/Ec_POO247 --force \
- --prefix Ec_POO247 --addgenes --locustag ECPOOp \
- --increment 10 --gffver 2 --centre CDC --compliant \
- --genus Escherichia --species coli --strain POO247 --plasmid pECPOO247 \
- --kingdom Bacteria --gcode 11 --usegenus \
- --proteins /opt/prokka/db/trusted/Ecocyc-17.6 \
- --evalue 1e-9 --rfam \
- plasmid-closed.fna</code></pre>
-<h2 id="output-files">Output Files</h2>
-<p>| Extension | Description | | --------- | ----------- | | .gff | This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. | | .gbk | This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence. | | .fna | Nucleotide FASTA file of the input contig sequences. | | .faa | Protein FASTA file of the tran [...]
-<h2 id="command-line-options">Command line options</h2>
-<pre><code>General:
- --help This help
- --version Print version and exit
- --docs Show full manual/documentation
- --citation Print citation for referencing Prokka
- --quiet No screen output (default OFF)
-Setup:
- --listdb List all configured databases
- --setupdb Index all installed databases
- --cleandb Remove all database indices
- --depends List all software dependencies
-Outputs:
- --outdir [X] Output folder [auto] (default '')
- --force Force overwriting existing output folder (default OFF)
- --prefix [X] Filename output prefix [auto] (default '')
- --addgenes Add 'gene' features for each 'CDS' feature (default OFF)
- --locustag [X] Locus tag prefix (default 'PROKKA')
- --increment [N] Locus tag counter increment (default '1')
- --gffver [N] GFF version (default '3')
- --compliant Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
- --centre [X] Sequencing centre ID. (default '')
-Organism details:
- --genus [X] Genus name (default 'Genus')
- --species [X] Species name (default 'species')
- --strain [X] Strain name (default 'strain')
- --plasmid [X] Plasmid name or identifier (default '')
-Annotations:
- --kingdom [X] Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
- --gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0')
- --gram [X] Gram: -/neg +/pos (default '')
- --usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
- --proteins [X] Fasta file of trusted proteins to first annotate from (default '')
- --metagenome Improve gene predictions for highly fragmented genomes (default OFF)
- --rawproduct Do not clean up /product annotation (default OFF)
-Computation:
- --fast Fast mode - skip CDS /product searching (default OFF)
- --cpus [N] Number of CPUs to use [0=all] (default '8')
- --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
- --evalue [n.n] Similarity e-value cut-off (default '1e-06')
- --rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
- --norrna Don't run rRNA search (default OFF)
- --notrna Don't run tRNA search (default OFF)
- --rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)</code></pre>
-<h2 id="dependencies">Dependencies</h2>
-<ul>
-<li><p><strong>GNU Parallel</strong><br />A shell tool for executing jobs in parallel using one or more computers<br /><em>O. Tange, GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, Feb 2011:42-47.</em></p></li>
-<li><p><strong>BioPerl</strong><br />Used for input/output of various file formats<br /><em>Stajich et al, The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002 Oct;12(10):1611-8.</em></p></li>
-<li><p><strong>Aragorn</strong><br />Finds transfer RNA features (tRNA)<br /><em>Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004 Jan 2;32(1):11-6.</em></p></li>
-<li><p><strong>Barrnap</strong><br />Used to predict ribosomal RNA features (rRNA). My licence-free replacement for RNAmmmer.<br /><em>Manuscript under preparation.</em></p></li>
-<li><p><strong>RNAmmer</strong><br />Finds ribosomal RNA features (rRNA)<br /><em>Lagesen K et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8.</em></p></li>
-<li><p><strong>Prodigal</strong><br />Finds protein-coding features (CDS)<br /><em>Hyatt D et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11:119.</em></p></li>
-<li><p><strong>SignalP</strong><br />Finds signal peptide features in CDS (sig_peptide)<br /><em>Petersen TN et al. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods. 2011 Sep 29;8(10):785-6.</em></p></li>
-<li><p><strong>BLAST+</strong><br />Used for similarity searching against protein sequence libraries<br /><em>Camacho C et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421.</em></p></li>
-<li><p><strong>HMMER3</strong><br />Used for similarity searching against protein family profiles<br /><em>Finn RD et al. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W29-37.</em></p></li>
-<li><p><strong>Infernal</strong><br />Used for similarity searching against ncRNA family profiles<br /><em>D. L. Kolbe, S. R. Eddy. Fast Filtering for RNA Homology Search. Bioinformatics, 27:3102-3109, 2011.</em></p></li>
-</ul>
-<h2 id="databases">Databases</h2>
-<h3 id="the-core-blast-databases">The Core (BLAST+) Databases</h3>
-<p>Prokka uses a variety of databases when trying to assign function to the predicted CDS features. It takes a hierarchial approach to make it fast. A small, core set of well characterized proteins are first searched using BLAST+. This combination of small database and fast search typically completes about 70% of the workload. Then a series of slower but more sensitive HMM databases are searched using HMMER3.</p>
-<p>The initial core databases are derived from UniProtKB; there is one per "kingdom" supported. To qualify for inclusion, a protein must be (1) from Bacteria (or Archaea or Viruses); (2) not be "Fragment" entries; and (3) have an evidence level ("PE") of 2 or lower, which corresponds to experimental mRNA or proteomics evidence.</p>
-<h4 id="making-a-core-databases">Making a Core Databases</h4>
-<p>If you want to modify these core databases, the included script <code>prokka-uniprot_to_fasta_db</code>, along with the official <code>uniprot_sprot.dat</code>, can be used to generate a new database to put in <code>/opt/prokka/db/kingdom/</code>. If you add new ones, the command <code>prokka --listdb</code> will show you whether it has been detected properly.</p>
-<h4 id="the-genus-databases">The Genus Databases</h4>
-<p>If you enable <code>--usegenus</code> and also provide a Genus via <code>--genus</code> then it will first use a BLAST database which is Genus specific. Prokka comes with a set of databases for the most common Bacterial genera; type prokka <code>--listdb</code> to see what they are.</p>
-<h4 id="adding-a-genus-databases">Adding a Genus Databases</h4>
-<p>If you have a set of Genbank files and want to create a new Genus database, Prokka comes with a tool called <code>prokka-genbank_to_fasta_db</code> to help. For example, if you had four annotated "Coccus" genomes, you could do the following:</p>
-<pre><code>% prokka-genbank_to_fasta_db Coccus1.gbk Coccus2.gbk Coccus3.gbk Coccus4.gbk > Coccus.faa
-% cd-hit -i Coccus.faa -o Coccus -T 0 -M 0 -g 1 -s 0.8 -c 0.9
-% rm -fv Coccus.faa Coccus.bak.clstr Coccus.clstr
-% makeblastdb -dbtype prot -in Coccus
-% mv Coccus.p* /path/to/prokka/db/genus/</code></pre>
-<h3 id="the-hmm-databases">The HMM Databases</h3>
-<p>Prokka comes with a bunch of HMM libraries for HMMER3. They are mostly Bacteria-specific. They are searched after the core and genus databases. You can add more simply by putting them in <code>/opt/prokka/db/hmm</code>. Type <code>prokka --listdb</code> to confirm they are recognised.</p>
-<h3 id="fasta-database-format">FASTA database format</h3>
-<p>Prokka understands two annotation tag formats, a plain one and a detailed one.</p>
-<p>The plain one is a standard FASTA-like line with the ID after the <code>></code> sign, and the protein <code>/product</code> after the ID (the "description" part of the line):</p>
-<pre><code>>SeqID product</code></pre>
-<p>The detailed one consists of a special encoded three-part description line. The parts are the <code>/EC_number</code>, the <code>/gene</code> code, then the <code>/product</code> - and they are separated by a special "<sub>~</sub>" sequence:</p>
-<pre><code>>SeqID EC_number~~~gene~~~product</code></pre>
-<p>Here are some examples. Note that not all parts need to be present, but the "<sub>~</sub>" should still be there:</p>
-<pre><code>>YP_492693.1 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
-MNEKNIKHSQNFITSKHNIDKIMTNIRLNEHDNIFEIGSGKGHFTLELVQRCNFVTAIEI
-DHKLCKTTENKLVDHDNFQVLNKDILQFKFPKNQSYKIFGNIPYNISTDIIRKIVF*
->YP_492697.1 ~~~traB~~~transfer complex protein TraB
-MIKKFSLTTVYVAFLSIVLSNITLGAENPGPKIEQGLQQVQTFLTGLIVAVGICAGVWIV
-LKKLPGIDDPMVKNEMFRGVGMVLAGVAVGAALVWLVPWVYNLFQ*
->YP_492694.1 ~~~~~~transposase
-MNYFRYKQFNKDVITVAVGYYLRYALSYRDISEILRGRGVNVHHSTVYRWVQEYAPILYQ
-QSINTAKNTLKGIECIYALYKKNRRSLQIYGFSPCHEISIMLAS*</code></pre>
-<p>The same description lines apply to HMM models, except the "NAME" and "DESC" fields are used:</p>
-<pre><code>NAME PRK00001
-ACC PRK00001
-DESC 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
-LENG 284</code></pre>
-<h2 id="faq">FAQ</h2>
-<ul>
-<li><p><strong>Where does the name "Prokka" come from?</strong><br />Prokka is a contraction of "prokaryotic annotation". It's also relatively unique within Google, and also rhymes with a native Australian marsupial called the quokka.</p></li>
-<li><p><strong>Can I annotate by eukaryote genome with Prokka?</strong><br />No. Prokka is specifically designed for Bacteria, Archaea and Viruses. It can't handle multi-exon gene models; I would recommend using MAKER 2 for that purpose.</p></li>
-<li><p><strong>Why does Prokka keeps on crashing when it gets to tge "tbl2asn" stage?</strong><br />It seems that the tbl2asn program from NCBI "expires" after 12 months, and refuses to run. Unfortunately you need to install a newer version which you can download from here.</p></li>
-<li><p><strong>The hmmscan step seems to hang and do nothing?</strong><br />The problem here is GNU Parallel. It seems the Debian package for hmmer has modified it to require the <code>--gnu</code> option to behave in the 'default' way. There is no clear reason for this. The only way to restore normal behaviour is to edit the prokka script and change <code>parallel</code> to <code>parallel --gnu</code>.</p></li>
-<li><p><strong>Why does prokka fail when it gets to hmmscan?</strong><br />Unfortunately HMMER keeps changing it's database format, and they aren't upward compatible. If you upgraded HMMER (from 3.0 to 3.1 say) then you need to "re-press" the files. This can be done as follows: cd /path/to/prokka/db/hmm mkdir new for D in *.hmm ; do hmmconvert <span class="math"><em>D</em> > <em>n</em><em>e</em><em>w</em> / </span>D ; done cd new for D in *.hmm ; do hmmpress $D ; done mv * . [...]
-<li><p><strong>Why does Prokka take so long to download?</strong><br />Our server is in Australia, and the international pipes aren't always flowing as well as we'd like. I try to put it on GoogleDrive. Dropbox is no longer possible due to bandwidth quotas. If you are able to mirror Prokka (~2 GB) outside please let me know.</p></li>
-<li><p><strong>Why can't I load Prokka .GBK files into Mauve?</strong><br />Mauve is very picky about Genbank files. It does not like long contig names, like those from Velvet or Spades. The simple solution is to use <code>--centre XXX</code> in Prokka and it will rename all your contigs to be NCBI (and Mauve) compliant. It does not like the ACCESSION and VERSION strings that Prokka produces via the "tbl2asn" tool. The following Unix command will fix them: <code>egrep -v '^(ACC [...]
-</ul>
-<h2 id="still-to-do">Still To Do</h2>
-<ul>
-<li>ToDoList.txt: https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/doc/ToDoList.txt</li>
-</ul>
-<h2 id="changes">Changes</h2>
-<ul>
-<li>ChangeLog.txt: https://raw.githubusercontent.com/Victorian-Bioinformatics-Consortium/prokka/master/doc/ChangeLog.txt</li>
-<li>Github commits: https://github.com/Victorian-Bioinformatics-Consortium/prokka/commits/master</li>
-</ul>
-<h2 id="bugs">Bugs</h2>
-<ul>
-<li>tbl2asn seems to be removing the "(anti-codon)" part in my tRNA /product values</li>
-<li>tbl2asn putting space in /inference for Infernal</li>
-</ul>
-<h2 id="citation">Citation</h2>
-<p>Seemann T.<br /><em>Prokka: rapid prokaryotic genome annotation</em><br /><strong>Bioinformatics</strong> 2014 Jul 15;30(14):2068-9. <a href="http://www.ncbi.nlm.nih.gov/pubmed/24642063">PMID:24642063</a><br /><a href="doi:10.1093/bioinformatics/btu153">DOI</a></p>
diff --git a/doc/prokka-manual.txt b/doc/prokka-manual.txt
deleted file mode 100644
index 47b6329..0000000
--- a/doc/prokka-manual.txt
+++ /dev/null
@@ -1,454 +0,0 @@
-Prokka: rapid prokaryotic genome annotation
-===========================================
-
-Torsten Seemann torsten.seemann at monash.edu
-Victorian Bioinformatics Consortium, AUSTRALIA
-http://vicbioinformatics.com
-
-Introduction
-------------
-
-Whole genome annotation is the process of identifying features of
-interest in a set of genomic DNA sequences, and labelling them with
-useful information. Prokka is a software tool to annotate bacterial,
-archaeal and viral genomes quickly and produce standards-compliant
-output files.
-
-Installation
-------------
-
-Download
-
-Download the latest prokka-1.xx.tar.gz archive from
-http://www.bioinformatics.net.au/software.prokka.shtml
-
-Extract
-
-Choose somewhere to put it, for example in your home directory (no root
-access required):
-
- % cd $HOME
- % tar zxvf prokka-1.xx.tar.gz
- % ls prokka-1.xx
-
-Add to PATH
-
-Add the following line to your $HOME/.bashrc file, or to
-/etc/profile.d/prokka.sh to make it available to all users:
-
- export PATH=$PATH:$HOME/prokka-1.xx/bin
-
-Index the sequence databases
-
- % prokka --setupdb
-
-Install dependencies
-
-Prokka comes with many binaries for Linux and Mac OS X. It will always
-use your existing installed versions if they exist, but will use the
-included ones if that fails. For some older systems (eg. Centos 4.x)
-some of them won't work due to them being dynamically linked against new
-GLIBC libraries you don't have.
-
-You can consult the list of dependencies later in this document.
-
-Choose a rRNA predictor
-
-Option 1 - Don't use one
-
-If Prokka can't find a predictor for rRNA featues (either Barrnap or
-RNAmmer below) then it simply won't annotate any. Most people don't care
-that much about them anyway,
-
-Option 2 - Barrnap
-
-This was written by the author of Prokka and is recommended if you
-prefer speed over absolute accuracy. It uses the new multi-core NHMMER
-for DNA:DNA profile searches. Download it from
-http://www.vicbioinformatics.com/software.barrnap.shtml
-
-Option 3 - RNAmmer
-
-RNAmmer was written when HMMER 2.x was the latest release. Since them,
-HMMER 3.x has been released, and uses the same executable binary names.
-Prokka needs HMMER3 and RNAmmer (and hence HMMER2) so you need to edit
-your RNAmmer script to explicitly point your HMMER2 binary instead of
-using the HMMER3 binary which is more likely to be in your PATH first.
-
-Type which rnammer to find the script, and then edit it with your
-favourite editor. Find the following lines at the top:
-
- if ( $uname eq "Linux" ) {
- # $HMMSEARCH_BINARY = "/usr/cbs/bio/bin/linux64/hmmsearch"; # OLD
- $HMMSEARCH_BINARY = "/path/to/my/hmmer-2.3.2/bin/hmmsearch"; # NEW (yours)
- }
-
-If you are using Mac OS X, you'll also have to change the "Linux" to
-"Darwin" too. As you can see, I have commented out the original part,
-and replaced it with the location of my HMMER2 hmmsearch tool, so it
-doesn't run the HMMER3 one. You need to ensure HMMER3 is in your PATH
-before the old HMMER2 too.
-
-Test
-
-- Type prokka and it should output it's help screen.
-- Type prokka --version and you should see an output like prokka 1.x
-- Type prokka --listdb and it will show you what databases it has
- installed to use.
-
-Invoking Prokka
----------------
-
-Beginner
-
- # Vanilla (but with free toppings)
- % prokka contigs.fa
-
- # Look for a folder called PROKKA_yyyymmdd (today's date) and look at stats
- % cat PROKKA_yyyymmdd/*.txt
-
-Moderate
-
- # Choose the names of the output files
- % prokka --outdir mydir --prefix mygenome contigs.fa
-
- # Visualize it in Artemis
- % art mydir/mygenome.gff
-
-Expert
-
- # It's not just for bacteria, people
- % prokka --kingdom Archaea --outdir mydir --genus Pyrococcus --locustag PYCC
-
- # Search for my favourite gene
- % exonerate --bestn 1 zetatoxin.fasta mydir/PYCC_06072012.faa | less
-
-Wizard
-
- # Watch and learn
- % prokka --outdir mydir --locustag EHEC --proteins NewToxins.faa --evalue 0.001 --gram neg --addgenes contigs.fa
-
- # Check to see if anything went really wrong
- % less mydir/EHEC_06072012.err
-
- # Add final details using Sequin
- % sequin mydir/EHEC_0607201.sqn
-
-Genbank submitter
-
- # Register your BioProject and your locus_tag prefix first!
- % prokka --compliant --centre UoN --outdir PRJNA123456 --locustag EHEC --prefix EHEC-Chr1 contigs.fa
-
- # Check to see if anything went really wrong
- % less PRJNA123456/EHEC-Chr1.err
-
- # Add final details using Sequin
- % sequin PRJNA123456/EHEC-Chr1.sqn
-
-Crazy Person
-
- # No stinking Perl script is going to control me
- % prokka \
- --outdir $HOME/genomes/Ec_POO247 --force \
- --prefix Ec_POO247 --addgenes --locustag ECPOOp \
- --increment 10 --gffver 2 --centre CDC --compliant \
- --genus Escherichia --species coli --strain POO247 --plasmid pECPOO247 \
- --kingdom Bacteria --gcode 11 --usegenus \
- --proteins /opt/prokka/db/trusted/Ecocyc-17.6 \
- --evalue 1e-9 --rfam \
- plasmid-closed.fna
-
-Output Files
-------------
-
-| Extension | Description | | --------- | ----------- | | .gff | This is
-the master annotation in GFF3 format, containing both sequences and
-annotations. It can be viewed directly in Artemis or IGV. | | .gbk |
-This is a standard Genbank file derived from the master .gff. If the
-input to prokka was a multi-FASTA, then this will be a multi-Genbank,
-with one record for each sequence. | | .fna | Nucleotide FASTA file of
-the input contig sequences. | | .faa | Protein FASTA file of the
-translated CDS sequences. | | .ffn | Nucleotide FASTA file of all the
-annotated sequences, not just CDS. | | .sqn | An ASN1 format "Sequin"
-file for submission to Genbank. It needs to be edited to set the correct
-taxonomy, authors, related publication etc. | | .fsa | Nucleotide FASTA
-file of the input contig sequences, used by "tbl2asn" to create the .sqn
-file. It is mostly the same as the .fna file, but with extra Sequin tags
-in the sequence description lines. | | .tbl | Feature Table file, used
-by "tbl2asn" to create the .sqn file. | | .err | Unacceptable
-annotations - the NCBI discrepancy report. | | .log | Contains all the
-output that Prokka produced during its run. This is a record of what
-settings you used, even if the --quiet option was enabled. | | .txt |
-Statistics relating to the annotated features found. |
-
-Command line options
---------------------
-
- General:
- --help This help
- --version Print version and exit
- --docs Show full manual/documentation
- --citation Print citation for referencing Prokka
- --quiet No screen output (default OFF)
- Setup:
- --listdb List all configured databases
- --setupdb Index all installed databases
- --cleandb Remove all database indices
- --depends List all software dependencies
- Outputs:
- --outdir [X] Output folder [auto] (default '')
- --force Force overwriting existing output folder (default OFF)
- --prefix [X] Filename output prefix [auto] (default '')
- --addgenes Add 'gene' features for each 'CDS' feature (default OFF)
- --locustag [X] Locus tag prefix (default 'PROKKA')
- --increment [N] Locus tag counter increment (default '1')
- --gffver [N] GFF version (default '3')
- --compliant Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
- --centre [X] Sequencing centre ID. (default '')
- Organism details:
- --genus [X] Genus name (default 'Genus')
- --species [X] Species name (default 'species')
- --strain [X] Strain name (default 'strain')
- --plasmid [X] Plasmid name or identifier (default '')
- Annotations:
- --kingdom [X] Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
- --gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0')
- --gram [X] Gram: -/neg +/pos (default '')
- --usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
- --proteins [X] Fasta file of trusted proteins to first annotate from (default '')
- --metagenome Improve gene predictions for highly fragmented genomes (default OFF)
- --rawproduct Do not clean up /product annotation (default OFF)
- Computation:
- --fast Fast mode - skip CDS /product searching (default OFF)
- --cpus [N] Number of CPUs to use [0=all] (default '8')
- --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
- --evalue [n.n] Similarity e-value cut-off (default '1e-06')
- --rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
- --norrna Don't run rRNA search (default OFF)
- --notrna Don't run tRNA search (default OFF)
- --rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)
-
-Dependencies
-------------
-
-- GNU Parallel
- A shell tool for executing jobs in parallel using one or more
- computers
- O. Tange, GNU Parallel - The Command-Line Power Tool, ;login: The
- USENIX Magazine, Feb 2011:42-47.
-
-- BioPerl
- Used for input/output of various file formats
- Stajich et al, The Bioperl toolkit: Perl modules for the life
- sciences. Genome Res. 2002 Oct;12(10):1611-8.
-
-- Aragorn
- Finds transfer RNA features (tRNA)
- Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and
- tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004 Jan
- 2;32(1):11-6.
-
-- Barrnap
- Used to predict ribosomal RNA features (rRNA). My licence-free
- replacement for RNAmmmer.
- Manuscript under preparation.
-
-- RNAmmer
- Finds ribosomal RNA features (rRNA)
- Lagesen K et al. RNAmmer: consistent and rapid annotation of
- ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8.
-
-- Prodigal
- Finds protein-coding features (CDS)
- Hyatt D et al. Prodigal: prokaryotic gene recognition and
- translation initiation site identification. BMC Bioinformatics. 2010
- Mar 8;11:119.
-
-- SignalP
- Finds signal peptide features in CDS (sig_peptide)
- Petersen TN et al. SignalP 4.0: discriminating signal peptides from
- transmembrane regions. Nat Methods. 2011 Sep 29;8(10):785-6.
-
-- BLAST+
- Used for similarity searching against protein sequence libraries
- Camacho C et al. BLAST+: architecture and applications. BMC
- Bioinformatics. 2009 Dec 15;10:421.
-
-- HMMER3
- Used for similarity searching against protein family profiles
- Finn RD et al. HMMER web server: interactive sequence similarity
- searching. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W29-37.
-
-- Infernal
- Used for similarity searching against ncRNA family profiles
- D. L. Kolbe, S. R. Eddy. Fast Filtering for RNA Homology Search.
- Bioinformatics, 27:3102-3109, 2011.
-
-Databases
----------
-
-The Core (BLAST+) Databases
-
-Prokka uses a variety of databases when trying to assign function to the
-predicted CDS features. It takes a hierarchial approach to make it fast.
-A small, core set of well characterized proteins are first searched
-using BLAST+. This combination of small database and fast search
-typically completes about 70% of the workload. Then a series of slower
-but more sensitive HMM databases are searched using HMMER3.
-
-The initial core databases are derived from UniProtKB; there is one per
-"kingdom" supported. To qualify for inclusion, a protein must be (1)
-from Bacteria (or Archaea or Viruses); (2) not be "Fragment" entries;
-and (3) have an evidence level ("PE") of 2 or lower, which corresponds
-to experimental mRNA or proteomics evidence.
-
-Making a Core Databases
-
-If you want to modify these core databases, the included script
-prokka-uniprot_to_fasta_db, along with the official uniprot_sprot.dat,
-can be used to generate a new database to put in
-/opt/prokka/db/kingdom/. If you add new ones, the command
-prokka --listdb will show you whether it has been detected properly.
-
-The Genus Databases
-
-If you enable --usegenus and also provide a Genus via --genus then it
-will first use a BLAST database which is Genus specific. Prokka comes
-with a set of databases for the most common Bacterial genera; type
-prokka --listdb to see what they are.
-
-Adding a Genus Databases
-
-If you have a set of Genbank files and want to create a new Genus
-database, Prokka comes with a tool called prokka-genbank_to_fasta_db to
-help. For example, if you had four annotated "Coccus" genomes, you could
-do the following:
-
- % prokka-genbank_to_fasta_db Coccus1.gbk Coccus2.gbk Coccus3.gbk Coccus4.gbk > Coccus.faa
- % cd-hit -i Coccus.faa -o Coccus -T 0 -M 0 -g 1 -s 0.8 -c 0.9
- % rm -fv Coccus.faa Coccus.bak.clstr Coccus.clstr
- % makeblastdb -dbtype prot -in Coccus
- % mv Coccus.p* /path/to/prokka/db/genus/
-
-The HMM Databases
-
-Prokka comes with a bunch of HMM libraries for HMMER3. They are mostly
-Bacteria-specific. They are searched after the core and genus databases.
-You can add more simply by putting them in /opt/prokka/db/hmm. Type
-prokka --listdb to confirm they are recognised.
-
-FASTA database format
-
-Prokka understands two annotation tag formats, a plain one and a
-detailed one.
-
-The plain one is a standard FASTA-like line with the ID after the >
-sign, and the protein /product after the ID (the "description" part of
-the line):
-
- >SeqID product
-
-The detailed one consists of a special encoded three-part description
-line. The parts are the /EC_number, the /gene code, then the /product -
-and they are separated by a special "~" sequence:
-
- >SeqID EC_number~~~gene~~~product
-
-Here are some examples. Note that not all parts need to be present, but
-the "~" should still be there:
-
- >YP_492693.1 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
- MNEKNIKHSQNFITSKHNIDKIMTNIRLNEHDNIFEIGSGKGHFTLELVQRCNFVTAIEI
- DHKLCKTTENKLVDHDNFQVLNKDILQFKFPKNQSYKIFGNIPYNISTDIIRKIVF*
- >YP_492697.1 ~~~traB~~~transfer complex protein TraB
- MIKKFSLTTVYVAFLSIVLSNITLGAENPGPKIEQGLQQVQTFLTGLIVAVGICAGVWIV
- LKKLPGIDDPMVKNEMFRGVGMVLAGVAVGAALVWLVPWVYNLFQ*
- >YP_492694.1 ~~~~~~transposase
- MNYFRYKQFNKDVITVAVGYYLRYALSYRDISEILRGRGVNVHHSTVYRWVQEYAPILYQ
- QSINTAKNTLKGIECIYALYKKNRRSLQIYGFSPCHEISIMLAS*
-
-The same description lines apply to HMM models, except the "NAME" and
-"DESC" fields are used:
-
- NAME PRK00001
- ACC PRK00001
- DESC 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
- LENG 284
-
-FAQ
----
-
-- Where does the name "Prokka" come from?
- Prokka is a contraction of "prokaryotic annotation". It's also
- relatively unique within Google, and also rhymes with a native
- Australian marsupial called the quokka.
-
-- Can I annotate by eukaryote genome with Prokka?
- No. Prokka is specifically designed for Bacteria, Archaea and
- Viruses. It can't handle multi-exon gene models; I would recommend
- using MAKER 2 for that purpose.
-
-- Why does Prokka keeps on crashing when it gets to tge "tbl2asn"
- stage?
- It seems that the tbl2asn program from NCBI "expires" after 12
- months, and refuses to run. Unfortunately you need to install a
- newer version which you can download from here.
-
-- The hmmscan step seems to hang and do nothing?
- The problem here is GNU Parallel. It seems the Debian package for
- hmmer has modified it to require the --gnu option to behave in the
- 'default' way. There is no clear reason for this. The only way to
- restore normal behaviour is to edit the prokka script and change
- parallel to parallel --gnu.
-
-- Why does prokka fail when it gets to hmmscan?
- Unfortunately HMMER keeps changing it's database format, and they
- aren't upward compatible. If you upgraded HMMER (from 3.0 to 3.1
- say) then you need to "re-press" the files. This can be done as
- follows: cd /path/to/prokka/db/hmm mkdir new for D in *.hmm ; do
- hmmconvert D > new/D ; done cd new for D in *.hmm ; do hmmpress $D ;
- done mv * .. rmdir new
-
-- Why does Prokka take so long to download?
- Our server is in Australia, and the international pipes aren't
- always flowing as well as we'd like. I try to put it on GoogleDrive.
- Dropbox is no longer possible due to bandwidth quotas. If you are
- able to mirror Prokka (~2 GB) outside please let me know.
-
-- Why can't I load Prokka .GBK files into Mauve?
- Mauve is very picky about Genbank files. It does not like long
- contig names, like those from Velvet or Spades. The simple solution
- is to use --centre XXX in Prokka and it will rename all your contigs
- to be NCBI (and Mauve) compliant. It does not like the ACCESSION and
- VERSION strings that Prokka produces via the "tbl2asn" tool. The
- following Unix command will fix them:
- egrep -v '^(ACCESSION|VERSION)' prokka.gbk > mauve.gbk
-
-Still To Do
------------
-
-- ToDoList.txt:
- https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/doc/ToDoList.txt
-
-Changes
--------
-
-- ChangeLog.txt:
- https://raw.githubusercontent.com/Victorian-Bioinformatics-Consortium/prokka/master/doc/ChangeLog.txt
-- Github commits:
- https://github.com/Victorian-Bioinformatics-Consortium/prokka/commits/master
-
-Bugs
-----
-
-- tbl2asn seems to be removing the "(anti-codon)" part in my tRNA
- /product values
-- tbl2asn putting space in /inference for Infernal
-
-Citation
---------
-
-Seemann T.
-Prokka: rapid prokaryotic genome annotation
-Bioinformatics 2014 Jul 15;30(14):2068-9. PMID:24642063
-DOI
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/prokka.git
More information about the debian-med-commit
mailing list