[med-svn] r3130 - in trunk/community/infrastructure/getData: . debian getData.conf.d

Thu Feb 19 10:16:14 UTC 2009

Author: plessy
Date: 2009-02-19 10:16:14 +0000 (Thu, 19 Feb 2009)
New Revision: 3130

Added:
   trunk/community/infrastructure/getData/ChangeLog
   trunk/community/infrastructure/getData/debian/compat
   trunk/community/infrastructure/getData/getData.conf.d/RefSeq.getData
   trunk/community/infrastructure/getData/getData.conf.d/RefSeq.mk
Removed:
   trunk/community/infrastructure/getData/getData.txt
Modified:
   trunk/community/infrastructure/getData/debian/control
   trunk/community/infrastructure/getData/getData
   trunk/community/infrastructure/getData/getData.conf.d/dog.getData
   trunk/community/infrastructure/getData/getData.conf.d/dog.getData.mk
   trunk/community/infrastructure/getData/getData.conf.d/pdb.getData
   trunk/community/infrastructure/getData/getData.conf.d/rfam.getData
Log:
Added limited support for mouse RefSeq, and a changelog.

2009-02-19  Charles Plessy <plessy at debian.org>

	* ChangeLog: Added. Let's follow the GNU coding standards.
	http://www.gnu.org/prep/standards/html_node/Change-Logs.html

	* getData.conf.d: Added support for mouse and human RefSeq.
	* getData.pl: Removed human RefSeq.

	* getData.conf.d/dog.getData, getData.conf.d/rfam.getData,
        getData.conf.d/pdb.getData: Print on STDERR only if verbose.
	* getData.conf.d/dog.getData.mk: added missing parenthesis around a
	make variable.


Added: trunk/community/infrastructure/getData/ChangeLog
===================================================================

--- trunk/community/infrastructure/getData/ChangeLog	                        (rev 0)
+++ trunk/community/infrastructure/getData/ChangeLog	2009-02-19 10:16:14 UTC (rev 3130)
@@ -0,0 +1,12 @@
+2009-02-19  Charles Plessy <plessy at debian.org>
+
+	* ChangeLog: Added. Let's follow the GNU coding standards.
+	http://www.gnu.org/prep/standards/html_node/Change-Logs.html
+
+	* getData.conf.d: Added support for mouse and human RefSeq.
+	* getData.pl: Removed human RefSeq.
+
+	* getData.conf.d/dog.getData, getData.conf.d/rfam.getData,
+        getData.conf.d/pdb.getData: Print on STDERR only if verbose.
+	* getData.conf.d/dog.getData.mk: added missing parenthesis around a
+	make variable.

Added: trunk/community/infrastructure/getData/debian/compat
===================================================================
--- trunk/community/infrastructure/getData/debian/compat	                        (rev 0)
+++ trunk/community/infrastructure/getData/debian/compat	2009-02-19 10:16:14 UTC (rev 3130)
@@ -0,0 +1 @@
+7

Modified: trunk/community/infrastructure/getData/debian/control
===================================================================
--- trunk/community/infrastructure/getData/debian/control	2009-02-18 20:18:44 UTC (rev 3129)
+++ trunk/community/infrastructure/getData/debian/control	2009-02-19 10:16:14 UTC (rev 3130)
@@ -2,7 +2,8 @@
 Section: science
 Priority: optional
 Maintainer: Steffen Moeller <moeller at debian.org>
-Build-Depends: cdbs, debhelper (>= 5)
+Uploaders: Charles Plessy <plessy at debian.org>
+Build-Depends: cdbs, debhelper (>= 7)
 Standards-Version: 3.8.0
 Homepage: http://debian-med.alioth.debian.org
 

Modified: trunk/community/infrastructure/getData/getData
===================================================================
--- trunk/community/infrastructure/getData/getData	2009-02-18 20:18:44 UTC (rev 3129)
+++ trunk/community/infrastructure/getData/getData	2009-02-19 10:16:14 UTC (rev 3130)
@@ -405,12 +405,6 @@
 		source => "wget $sharedWgetOptions http://www.reactome.org/download/interactions.README.txt http://www.reactome.org/download/current/homo_sapiens.interactions.txt.gz"
 	},
 
-# Proof-of-principle for RefSeq. Does not include everything.
-	"refseq.hsa" => {
-		name => "The NCBI Reference Sequence project - Homo sapiens",
-		source => "wget $sharedWgetOptions ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/*ff.gz"
-	},
-
 	"pfam-a" => {
 		name => "Pfam-A : Manually curated protein families and domains, only the seed is presented.",
 		source => "wget $sharedWgetOptions ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A-seed.gz"

Added: trunk/community/infrastructure/getData/getData.conf.d/RefSeq.getData
===================================================================
--- trunk/community/infrastructure/getData/getData.conf.d/RefSeq.getData	                        (rev 0)
+++ trunk/community/infrastructure/getData/getData.conf.d/RefSeq.getData	2009-02-19 10:16:14 UTC (rev 3130)
@@ -0,0 +1,18 @@
+# Proof-of-principle for RefSeq. Does not include everything.
+print STDERR "Reading Canis lupus familiaris configuration file\n" if $verbose;
+
+$toBeMirrored{"refseq.hsa"} = {
+                "name"          => "The NCBI Reference Sequence project – Homo sapiens",
+		"tags"          => ["human", "proteome", "transcriptome"],
+                "source"        => "ln -s /etc/getData.conf.d/RefSeq.mk Makefile ; SPECIES=H_Sapiens make get unpack",
+		"post-download" => "make emboss blast"
+        },
+
+$toBeMirrored{"refseq.mmu"} = {
+                "name"          => "The NCBI Reference Sequence project – Mus musculus",
+		"tags"          => ["mouse", "proteome", "transcriptome"],
+                "source"        => "ln -s /etc/getData.conf.d/RefSeq.mk Makefile ; SPECIES=M_musculus make get unpack",
+		"post-download" => "make emboss blast"
+        },
+
+1;

Added: trunk/community/infrastructure/getData/getData.conf.d/RefSeq.mk
===================================================================
--- trunk/community/infrastructure/getData/getData.conf.d/RefSeq.mk	                        (rev 0)
+++ trunk/community/infrastructure/getData/getData.conf.d/RefSeq.mk	2009-02-19 10:16:14 UTC (rev 3130)
@@ -0,0 +1,13 @@
+SHARED_WGET_OPTIONS=$(shell getData --getWgetOptions)
+
+# $SPECIES is provided to make in the call from /etc/getData.conf.d/RefSeq.getData.
+
+get:
+	wget $(SHARED_WGET_OPTIONS) ftp://ftp.ncbi.nih.gov/refseq/$(SPECIES)/mRNA_Prot/*ff.gz
+
+unpack:
+	for file in *ff.gz ; do zcat $$file > `basename $$file .gz` ; done
+
+blast:
+
+emboss:

Modified: trunk/community/infrastructure/getData/getData.conf.d/dog.getData
===================================================================
--- trunk/community/infrastructure/getData/getData.conf.d/dog.getData	2009-02-18 20:18:44 UTC (rev 3129)
+++ trunk/community/infrastructure/getData/getData.conf.d/dog.getData	2009-02-19 10:16:14 UTC (rev 3130)
@@ -1,4 +1,4 @@
-print STDERR "Reading Canis lupus familiaris configuration file\n";
+print STDERR "Reading Canis lupus familiaris configuration file\n" if $verbose;
 
 $toBeMirrored{"dog.genome"}={
   "name" => "CanFam2.0 - Dog Genome Sequencing Project",

Modified: trunk/community/infrastructure/getData/getData.conf.d/dog.getData.mk
===================================================================
--- trunk/community/infrastructure/getData/getData.conf.d/dog.getData.mk	2009-02-18 20:18:44 UTC (rev 3129)
+++ trunk/community/infrastructure/getData/getData.conf.d/dog.getData.mk	2009-02-19 10:16:14 UTC (rev 3130)
@@ -1,7 +1,7 @@
 SHARED_WGET_OPTIONS=$(shell getData --getWgetOptions)
 
 get:
-	wget $SHARED_WGET_OPTIONS ftp://ftp.ensembl.org/pub/current_fasta/canis_familiaris/dna/Canis_familiaris.BROADD2.50.dna.chromosome.*.fa.gz
+	wget $(SHARED_WGET_OPTIONS) ftp://ftp.ensembl.org/pub/current_fasta/canis_familiaris/dna/Canis_familiaris.BROADD2.50.dna.chromosome.*.fa.gz
 
 unpack:
 	for file in *chromosome.*.fa.gz ; do zcat $$file > `basename $$file .gz` ; done

Modified: trunk/community/infrastructure/getData/getData.conf.d/pdb.getData
===================================================================
--- trunk/community/infrastructure/getData/getData.conf.d/pdb.getData	2009-02-18 20:18:44 UTC (rev 3129)
+++ trunk/community/infrastructure/getData/getData.conf.d/pdb.getData	2009-02-19 10:16:14 UTC (rev 3130)
@@ -1,5 +1,5 @@
 
-print STDERR "Reading PDB configuration file\n";
+print STDERR "Reading PDB configuration file\n" if $verbose;
 
 $toBeMirrored{"pdb"}={
   "name" => "PDB - protein structure database",

Modified: trunk/community/infrastructure/getData/getData.conf.d/rfam.getData
===================================================================
--- trunk/community/infrastructure/getData/getData.conf.d/rfam.getData	2009-02-18 20:18:44 UTC (rev 3129)
+++ trunk/community/infrastructure/getData/getData.conf.d/rfam.getData	2009-02-19 10:16:14 UTC (rev 3130)
@@ -1,4 +1,4 @@
-print STDERR "Reading Canis lupus familiaris configuration file\n";
+print STDERR "Reading Rfam configuration file\n" if $verbose;
 
 $toBeMirrored{"dog.genome"}={
   "name" => "Rfam9.1 - Multiple alignments and covariance models of non-coding RNA families",

Deleted: trunk/community/infrastructure/getData/getData.txt
===================================================================
--- trunk/community/infrastructure/getData/getData.txt	2009-02-18 20:18:44 UTC (rev 3129)
+++ trunk/community/infrastructure/getData/getData.txt	2009-02-19 10:16:14 UTC (rev 3130)
@@ -1,207 +0,0 @@
-NAME
-    getData - retrieves databases from the Internet
-
-SYNOPSIS
-    getData [ --mirrordir <path> ] <list of db names>
-
-    getData --list
-
-DESCRIPTION
-    Bioinformatics has the intrinsic problem to bring the biological data to
-    the end user. Astronomers have the equivalent problem and particle
-    physicists, well, they haven come up with (first) the web and (second)
-    the computational grids to address their problems. Debian helps with the
-    programs but will not provide such huge datasets that are even
-    frequently updated - not even in volatile.debian.org. Most
-    bioinformatics researchers will not need too many of such databases. And
-    even more so will gladly continue in using public services remotely.
-
-    For those who need a set of databases on a regular basis, this script
-    shall be a start to automate the burden to download the data and update
-    indices and the like. The world has seen such magic before with the Lion
-    Biosciences Prisma tool
-    (http://bib.oxfordjournals.org/cgi/reprint/3/4/389.pdf) but how about
-    something simpler (as a start) that at least gets close to what we
-    desire and is Free. The aim must be to address the needs of all (most)
-    communities, not only of the bioinformatics world. The seed was hence
-    made with databases from astronomy.
-
-    Please contact the Debian-Med community if you consider this program to
-    be almost ready for your needs and explain what still needs to be added.
-    Public databases that you managed to integrate with this system are also
-    very warmly welcomed as feedback.
-
-OPTIONS
-    --help
-            this help
-
-    --man
-        Present a more detailed description in form of a man page.
-
-    --verbose
-        Say one or two words more than required.
-
-    --mirrordir <path>
-        Specifies destination directory. The data will be mirrored to the
-        folder $mirrordir/$dbname/. Please be aware that this mirrordir is
-        nowhere stored. The directory can consequently be moved to arbitrary
-        locations at any time, if the users of the data are only informed
-        about that moving.
-
-    --list
-        Lists all databases that may be requested to be installed.
-
-    <list of db names>
-        Only those databases that are explicitly requested to be downloaded
-        will be downloaded. Such databases may require considerable
-        bandwidth, so please make sure you know you are doing the right
-        thing.
-
-    --post
-        Perform only the unpacking/indexing, but do not retrieve/update the
-        databases. This option is considered useful when adding a new
-        database management system to the system, e.g. after installing
-        EMBOSS.
-
-    --source
-        Perform only the unpacking/indexing, but do not retrieve/update the
-        databases. This option may be beneficial when the site administator
-        is aware of current analyses that should not be disturbed by the
-        indexing process but the downloading from the net can already be
-        started.
-
-    --confd <directory>
-        Allows for the specification of a directory in which multiple files
-        can be stored that will be read by getData upon its invocation.
-        These may add values to the global variable %toBeMirrored that
-        specifies the databases and their download scripts.
-
-    --config <system>
-        Preparation of the configuration file that would be reuired for a
-        particular system that deals with the database. The configuration is
-        printed to stdout and is expected to be copied manually to the
-        proper file or folder. One could imagine this process to be
-        automated, though this is not yet implemented. Currently available
-        is support for two systems:
-
-        emboss  This specifies the EMBOSS suite of tools for bioinformatics
-                (www.emboss.org) that is also available as a Debian package.
-                The configuration for the Uniprot databases will allow the
-                sequence retrieval with the seqret tool.
-
-        dre - ARC Grid Runtime Environment
-                Runtime environments (REs) are a concept of the ARC grid
-                middleware of which more can be learned on
-                http://www.nordugrid.org. A script is needed to indicate the
-                presence of a runtime environment. Here, the name of the
-                script is important, which is not definable by getData
-                though since it only writes to stdout.
-
-        Unfortunately, the configuration was not yet be found to be
-        modularised. It all needs to happen within the getData script
-        itself.
-
-    --remove <list of dbnames>
-        This command removes folders that store the data. In principle this
-        could be perfomed manually, though some databases may have special
-        requirements pre- or post-removal, which can be specified
-        individually for every database.
-
-SPECIFICATION OF DATABASES
-    Databases for download and their post-processing are specified at two
-    different locations. One is the getData script itself, the other are
-    files stored in /etc/getData.d. Either will define elements of a
-    considerably large hash. The key is the identifier which is also shown
-    by the 'getData --list' directive. The value is a reference to another
-    hash, which assigns values to all the properties that a database has for
-    its download and post-processing:
-
-    name - a human-readable pretty-printed name or short description that
-    makes clear to the world what this database is about.
-        A bad example is the mere assignment of "DE405", which few people
-        understand. A better example is "Pfam-A : Manually curated protein
-        families and domains, only the seed is presented.". One could argue
-        that one should have that field renamed to "description".
-
-    source - shell commands to perform the initial download and subsequent
-    updates
-        Commonly the wget tool is used for download. The such presented
-        little script is executed underneath the mirrordir directory. One
-        simple example is "wget --mirror
-        ftp://ssd.jpl.nasa.gov/pub/eph/export/unix/unxp2[01]*.405". With
-        increasing proficiency in using wget, one is tempted to substitute
-        "--mirror" with "--recursive --no-host-directories --no-directories
-        --level 1 --no-parent".
-
-    post-download - shell commands to perform after the data has been
-    downloaded.
-        A simple (and unnecessary when used the right flags to wget) example
-        is the mere setting of a symbolic link:
-
-          "post-download" => "ln -s ssd.jpl.nasa.gov/pub/eph/export/unix/unxp*.405 ."
-
-        Some more effort has been put into TrEMBL for the merging of
-        releases with subsequent updates and the indexing for EMBOSS:
-
-          "d=uncompressed; if [ ! -d \$d ]; then mkdir \$d; fi; "
-           ."rm -rf \$d/trembl.dat; "
-           ."(find ftp.ebi.ac.uk -name '*.dat.gz' | xargs -r zcat ) > \$d/trembl.dat; "
-           ."[ -x /usr/bin/dbxflat ] "
-           . "&& cd \$d && "
-           . "dbxflat -dbresource embl -dbname trembllocal -idformat swiss -filenames=trembl.dat -fields id,acc -auto",
-
-        The dots are connecting strings in Perl. This helps the readability
-        of the code. When writing these scripts, please be aware the
-        newlines don't separate the individual commands here. Semicolon are
-        required.
-
-    recommends - suggests a series of packages to be present for the use of
-    the database or the performance of the indexing.
-        This information is not used at the moment, also to render this
-        script more useful for other Linux distributions than Debian.
-
-EXAMPLES
-    The following will list the identifiers and the descriptions of the
-    first 4 databases that area available via getData on your system.
-
-         ./getData --mirrordir=/local/databases/mirrored --list | head 4
-
-    To install any particular database, only give its name as an argument.
-    If the installation is performed at another directory than the default,
-    then the --mirrordir needs again to be set.
-
-         ./getData swiss.dat
-
-    To remove the database again, give the script a hint with the --remove
-    flag
-
-         ./getData --remove swiss.dat
-
-    To perform the indexing only and circumvent the download (attention,
-    this is dangerous since the index files will look newer than the
-    database is), do
-
-         ./getData --post swiss.dat
-
-    A special exception to these extra scripts is the --config flag in that
-    it takes a list of extra arguments. Each shall denote a particular
-    system that this database may be of interest for. There are today two
-    systems supported:
-
-TODO
-    We now need a mechanism with which packages can specify hooks that shall
-    be called upon an update of a database. But we cannot assume that every
-    indexing that can be performed because of the installation of some
-    package is also desired by the user. How to configure this properly is
-    left to be decided.
-
-SEE ALSO
-    http://debian-med.alioth.debian.org, http://wiki.debian.org/DebianMed,
-    /etc/getData.conf
-
-AUTHORS
-    This script was prepared by Steffen Moeller <moeller at debian.org> and
-    Charles Plessy <debian-no-spam at plessy.org> and is distributed under the
-    terms of the GNU Public License (GPL). On Debian systems, this license
-    can be found under /usr/share/common-licenses/GPL.
-