[med-svn] r12678 - trunk/packages/cdbfasta/trunk/debian

Tue Dec 18 14:24:21 UTC 2012

Author: tille
Date: 2012-12-18 14:24:20 +0000 (Tue, 18 Dec 2012)
New Revision: 12678

Added:
   trunk/packages/cdbfasta/trunk/debian/cdbfasta_usage.html
   trunk/packages/cdbfasta/trunk/debian/doc-base
Modified:
   trunk/packages/cdbfasta/trunk/debian/docs
Log:
Add documention that was found in a separate HTML file at sourceforge download page


Added: trunk/packages/cdbfasta/trunk/debian/cdbfasta_usage.html
===================================================================

--- trunk/packages/cdbfasta/trunk/debian/cdbfasta_usage.html	                        (rev 0)
+++ trunk/packages/cdbfasta/trunk/debian/cdbfasta_usage.html	2012-12-18 14:24:20 UTC (rev 12678)
@@ -0,0 +1,194 @@
+<!DOCTYPE HTML PUBLIC "-//w3c//dtd html 4.0 transitional//en">
+<html><head>
+
+
+   <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+   <meta name="GENERATOR" content="Mozilla/4.78 [en] (X11; U; Linux 2.4.19 i686) [Netscape]">
+   <title>cdb tools for fasta files</title>
+</head><body bgcolor="#ffffff" link="#0000ee" text="#000000" vlink="#551a8b" alink="#ff0000">
+
+<h1>
+CDB (Constant DataBase) indexing and retrieval tools for multi-FASTA files</h1>
+This is a brief introduction to a couple of platform independent file-based
+hashing tools (<b>cdbfasta</b> and <b>cdbyank</b>) that can be used for
+creating indices for quick retrieval of any particular sequences from large
+multi-FASTA files. The last version has the option to compress data records
+in order to save space. The index files are now architecture independent,
+the same index file can be created and used on many different Unix platform
+(be it 32bit/64bit, big-endian or little-endian architectures) and even
+Windows.
+<p><b>Content:</b>
+</p><p><b>   1.Typical usage</b>
+<br><b>   2.Retrieving sequence ranges or deflines</b>
+<br><b>   3.Data compression option</b>
+<br><b>   4.Development notes</b>
+<br> 
+</p><h2>
+1. Typical usage</h2>
+Use <b>cdbfasta</b> to create the index file for a multi-FASTA file and
+<b>cdbyank</b>
+to pull records based on that index file. An usage message is displayed
+if the commands cdbyank or cdbyank are run without any parameters (or with
+-h). In order to create an index file, the name of the fasta file to be
+indexed must be provided:
+<pre>cdbfasta <fasta_file></pre>
+The fasta file can be specified with the whole path (if it's not in the
+current directory), e.g.
+<pre>cdbfasta /usr/local/db/GUDB.human</pre>
+By default cdbfasta creates an index file with the same path and name as
+the database file but with the  .cidx suffix added to the original
+name. So in the example above, a file GUDB.human.cidx will be created in
+/usr/local/db/. The default usage considers the key for a FASTA record
+to be the first space-delimited token following the ">" starting character
+from the definition line. For example, if a FASTA record had a defline
+like this:
+<p>>AA141526
+</p><p>...then we can use the string 'AA141526' with cdbyank to retrieve the
+full FASTA record associated to that sequence name:
+</p><pre>cdbyank -a 'AA141526' /usr/local/db/GUDB.human.cidx</pre>
+Sometimes all the space delimited tokens in the defline need to be declared
+as keys in the index file, pointing to the same fasta record. This can
+be accomplished by cdbfasta by using the "<b>-m</b>" switch.
+<p>For long and complex fastA file accessions (for example : EGAD|61|GP|186739|gb|AAA63210.1||M60828)
+there is a possibility to create the index file in such a way that there
+is no need to provide the full string to cdbyank in order to retrieve such
+a sequence, but only the first "<db>|<accession>" pair (i.e. a substring
+ending at the second '|' character) should be enough. (EGAD|61 in the example
+above). In order to enable this feature, there are two alternative options
+for cdbfasta:
+</p><ul>
+<li>
+<b>-c</b> : the index file is built only by storing the "shortcut key"
+(the first "db|accession" pair found in the defline of each fasta record).
+In this case, cdbyank will only be able to accept these "shortcut" accessions
+for record retrieval.</li>
+
+<li>
+<b>-C</b> : the index file is built by storing both the "shortcut key"
+and the full keys (which are considered to end at the first space character
+in the defline). In this case, two strings are stored as keys for each
+fastA record so any of them can be used as an accession for retrieval of
+the same record with cdbyank.</li>
+</ul>
+In order to retrieve records from the database file, cdbyank should be
+provided with the name of the index file created previously with cdbfasta,
+e.g.:
+<pre>cdbyank -a 'human|Z98492' /usr/local/db/GUDB.human.cidx</pre>
+A list of accessions is expected at stdin if -a option is not provided,
+e.g.:
+<pre>cat seq_list | cdbyank /usr/local/db/GUDB.human.cidx</pre>
+This way the output will be a series a fasta records at stdout. By redirecting
+this output to a file a multifasta file is obtained. cdbyank locates the
+database file by stripping the '.cidx' suffix off the index filename. But
+this is not enforced, because by using the <b>-d </b>option, cdbyank can
+make use of a user-provided database to be used by the given index file.
+In the example above, if the index file "GUDB.human.cidx" is moved into
+another directory, a cdbyank command (in that other directory) can be issued
+like that:
+<pre>cdbyank -a 'human|Z98492' -d /usr/local/db/GUDB.human GUDB.human.cidx</pre>
+The position of the index file in the list of arguments of cdbyank is not
+enforced. For the -a usage, the error status returned by cdbyank to the
+shell will be 1 if the given key was not found and 0 for success.
+<p>The total number of fasta records indexed and the list of the keys stored
+in a specific cdb index file can be retrieved with cdbyank's <b>-n</b>
+and <b>-l </b>switches, respectively. This information is obtained from
+the index file directly (the database file is not needed for that). There
+is also a -s option that displays a summary of the indexing information
+stored in the index at index time. These are the initial name of the fastA
+file, its size, how the index was created (e.g. was -m (multiple keys)
+option given ? was -c or -C (shortcut keys) option given?), the number
+of keys stored in the file as well as the number of fasta records indexed
+- the latter being the same with what <b>-n</b> option returns.
+</p><p>As an extra feature, cdbfasta and cdbyank can also be used for some
+special cases where databases may have different records but with the same
+key (non-unique keys). Although the performance will degrade a little,
+cdbfasta is able to index this kind of files, but by default cdbyank only
+outputs the first record found. If you want all the possible records sharing
+the same key (accession) to be retrieved and displayed, the <b>-x </b>option
+should be given to cdbyank.
+<br> 
+</p><h2>
+2. Retrieving sequence ranges or only the defline</h2>
+
+<p><br>There are two <b>cdbyank</b> options added for convenience: <b>-F</b>
+option returns the definition line of each requested FASTA record (the
+first line for each record).  The <b>-R </b>option of cdbyank is intended
+for FASTA files containing actual genetic sequences (nucleotide or protein)
+and expects each of the retrieval commands to have the following format
+(space delimited):
+</p><p><key>  <right_coordinate>  <left_coordinate>
+</p><p>For example if we only want to retrieve the sequence range 24...178
+(letter numbering starts at 1) from sequence with the name 'human|Z98492',
+then the cdbyank command would look like this:
+</p><pre>cdbyank -a 'human|Z98492 24 178' -R GUDB.human.cidx</pre>
+Multiple sequence ranges can be extracted this way by providing a file
+having each line following the format above (key followed by the two coordinates).
+Then, as before, such file can be piped into cdbyank with -R option to
+pull specific sequence ranges for each of the sequences specified in the
+input file.
+<br> 
+<pre>cat seqlistranges | cdbyank -R GUDB.human.cidx</pre>
+Note that this range option works by actually parsing and looping through
+the retrieved record characters internally - so the performance is poor
+when some terminal range is pulled from a very large record.
+<h2>
+3. Data compression option</h2>
+The indexing program <b>cdbfasta</b> has the  <b>-z <compressed_db></b>
+option which creates a compressed file <compress_db> from the data in
+the given input file and at the same time creates an index file for this
+new compressed database, named <compressed_db>.cidxz.The original input
+file can then be discarded -- as it can be recovered at any point later
+from the <compressed_db> file by using the <b>-z</b> option of <b>cdbyank</b>.
+<br>Because each record is compressed separately, compression is poor if
+the records are small. Compression is only advised when:
+<ul>
+<li>
+data records are large enough for the compression algorithm to become efficient
+(at least 1KB per record, the more the better)</li>
+
+<li>
+only random access is needed to the data records (so the original file
+can be discarded)</li>
+</ul>
+The compression can be quite slow for large files and there is also some
+performance penalty for cdbyank as it has to decompress the retrieved records
+on the fly. The input data for cdbfasta compression can be collected from
+stdin if '-' is used instead of a file name:
+<pre>cat my_data_files* | cdbfasta - -z mydata.cdbz</pre>
+This option is useful especially when the total size of input data files
+is extremely large (over the file-system limits or over the 4GB internal
+limit of cdbfasta) while the compressed output can be small enough to fall
+under such limits.
+<br>With compressed databases cdbyank can be used normally without extra
+options as it will auto-detect the compression (from the index file info)
+and activate on-the-fly decompression of the retrieved records. Only -F
+and -R options are not yet supported for compressed records.
+<h2>
+4. Development notes</h2>
+These tools were developed in C++, based on the publicly available <b>cdb
+</b>("constant
+database") code written by D.J. Bernstein (<a href="http://cr.yp.to/djb.html">http://cr.yp.to/djb.html</a>).
+"<i>Constant databases</i>" are those that we don't need to add to or remove
+records from. The original C source was (rather crudely) wrapped into C++
+classes and adjusted to automatically index fasta records and to create
+an external index instead of compacting the original data file like the
+original cdb library code does.  Also the "endianness" is now checked
+at runtime and the bytes are swapped accordingly such that the file offsets
+and record sizes are always read/written in the same way in the index file.
+<br>The compression option uses <b>zlib</b>'s "deflate" method. The program
+uses deflate() with Z_FULL_FLUSH after each record, such that random record
+decompression is possible after the first [dummy] record is decompressed
+internally.
+<br>The index file contains an info chunk (actually stored at the end of
+the file) which maintains a summary data and flags about the indexing process
+(the -s option of cdbyank shows this info). Since the compression option
+was added, cdbyank is always trying to read this information first (before
+opening the data file) in order to determine if the data records are compressed
+or not.
+<p>Please let me know if you notice any problems with these tools.
+</p><p>--
+<br>Geo Pertea
+<br>geo.pertea at gmail.com
+<br>06/09/2003
+<br> 
+</p></body></html>
\ No newline at end of file

Added: trunk/packages/cdbfasta/trunk/debian/doc-base
===================================================================
--- trunk/packages/cdbfasta/trunk/debian/doc-base	                        (rev 0)
+++ trunk/packages/cdbfasta/trunk/debian/doc-base	2012-12-18 14:24:20 UTC (rev 12678)
@@ -0,0 +1,12 @@
+Document: cdbfasta
+Title: CDB (Constant DataBase) indexing and retrieval tools for multi-FASTA files
+Author: Geo Pertea <geo.pertea at gmail.com>
+Abstract: Constant DataBase indexing and retrieval tools for multi-FASTA files
+ CDB (Constant DataBase) can be used for creating indices for quick
+ retrieval of any particular sequences from large multi-FASTA files.
+ It has the option to compress data records in order to save space.
+Section: Science/Biology
+
+Format: html
+Files: /usr/share/doc/cdbfasta/cdbfasta_usage.html
+Index: /usr/share/doc/cdbfasta/cdbfasta_usage.html

Modified: trunk/packages/cdbfasta/trunk/debian/docs
===================================================================
--- trunk/packages/cdbfasta/trunk/debian/docs	2012-12-18 14:23:01 UTC (rev 12677)
+++ trunk/packages/cdbfasta/trunk/debian/docs	2012-12-18 14:24:20 UTC (rev 12678)
@@ -1 +1,2 @@
 README
+debian/cdbfasta_usage.html