[med-svn] r3130 - in trunk/community/infrastructure/getData: . debian getData.conf.d
plessy at alioth.debian.org
plessy at alioth.debian.org
Thu Feb 19 10:16:14 UTC 2009
Author: plessy
Date: 2009-02-19 10:16:14 +0000 (Thu, 19 Feb 2009)
New Revision: 3130
Added:
trunk/community/infrastructure/getData/ChangeLog
trunk/community/infrastructure/getData/debian/compat
trunk/community/infrastructure/getData/getData.conf.d/RefSeq.getData
trunk/community/infrastructure/getData/getData.conf.d/RefSeq.mk
Removed:
trunk/community/infrastructure/getData/getData.txt
Modified:
trunk/community/infrastructure/getData/debian/control
trunk/community/infrastructure/getData/getData
trunk/community/infrastructure/getData/getData.conf.d/dog.getData
trunk/community/infrastructure/getData/getData.conf.d/dog.getData.mk
trunk/community/infrastructure/getData/getData.conf.d/pdb.getData
trunk/community/infrastructure/getData/getData.conf.d/rfam.getData
Log:
Added limited support for mouse RefSeq, and a changelog.
2009-02-19 Charles Plessy <plessy at debian.org>
* ChangeLog: Added. Let's follow the GNU coding standards.
http://www.gnu.org/prep/standards/html_node/Change-Logs.html
* getData.conf.d: Added support for mouse and human RefSeq.
* getData.pl: Removed human RefSeq.
* getData.conf.d/dog.getData, getData.conf.d/rfam.getData,
getData.conf.d/pdb.getData: Print on STDERR only if verbose.
* getData.conf.d/dog.getData.mk: added missing parenthesis around a
make variable.
Added: trunk/community/infrastructure/getData/ChangeLog
===================================================================
--- trunk/community/infrastructure/getData/ChangeLog (rev 0)
+++ trunk/community/infrastructure/getData/ChangeLog 2009-02-19 10:16:14 UTC (rev 3130)
@@ -0,0 +1,12 @@
+2009-02-19 Charles Plessy <plessy at debian.org>
+
+ * ChangeLog: Added. Let's follow the GNU coding standards.
+ http://www.gnu.org/prep/standards/html_node/Change-Logs.html
+
+ * getData.conf.d: Added support for mouse and human RefSeq.
+ * getData.pl: Removed human RefSeq.
+
+ * getData.conf.d/dog.getData, getData.conf.d/rfam.getData,
+ getData.conf.d/pdb.getData: Print on STDERR only if verbose.
+ * getData.conf.d/dog.getData.mk: added missing parenthesis around a
+ make variable.
Added: trunk/community/infrastructure/getData/debian/compat
===================================================================
--- trunk/community/infrastructure/getData/debian/compat (rev 0)
+++ trunk/community/infrastructure/getData/debian/compat 2009-02-19 10:16:14 UTC (rev 3130)
@@ -0,0 +1 @@
+7
Modified: trunk/community/infrastructure/getData/debian/control
===================================================================
--- trunk/community/infrastructure/getData/debian/control 2009-02-18 20:18:44 UTC (rev 3129)
+++ trunk/community/infrastructure/getData/debian/control 2009-02-19 10:16:14 UTC (rev 3130)
@@ -2,7 +2,8 @@
Section: science
Priority: optional
Maintainer: Steffen Moeller <moeller at debian.org>
-Build-Depends: cdbs, debhelper (>= 5)
+Uploaders: Charles Plessy <plessy at debian.org>
+Build-Depends: cdbs, debhelper (>= 7)
Standards-Version: 3.8.0
Homepage: http://debian-med.alioth.debian.org
Modified: trunk/community/infrastructure/getData/getData
===================================================================
--- trunk/community/infrastructure/getData/getData 2009-02-18 20:18:44 UTC (rev 3129)
+++ trunk/community/infrastructure/getData/getData 2009-02-19 10:16:14 UTC (rev 3130)
@@ -405,12 +405,6 @@
source => "wget $sharedWgetOptions http://www.reactome.org/download/interactions.README.txt http://www.reactome.org/download/current/homo_sapiens.interactions.txt.gz"
},
-# Proof-of-principle for RefSeq. Does not include everything.
- "refseq.hsa" => {
- name => "The NCBI Reference Sequence project - Homo sapiens",
- source => "wget $sharedWgetOptions ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/*ff.gz"
- },
-
"pfam-a" => {
name => "Pfam-A : Manually curated protein families and domains, only the seed is presented.",
source => "wget $sharedWgetOptions ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A-seed.gz"
Added: trunk/community/infrastructure/getData/getData.conf.d/RefSeq.getData
===================================================================
--- trunk/community/infrastructure/getData/getData.conf.d/RefSeq.getData (rev 0)
+++ trunk/community/infrastructure/getData/getData.conf.d/RefSeq.getData 2009-02-19 10:16:14 UTC (rev 3130)
@@ -0,0 +1,18 @@
+# Proof-of-principle for RefSeq. Does not include everything.
+print STDERR "Reading Canis lupus familiaris configuration file\n" if $verbose;
+
+$toBeMirrored{"refseq.hsa"} = {
+ "name" => "The NCBI Reference Sequence project – Homo sapiens",
+ "tags" => ["human", "proteome", "transcriptome"],
+ "source" => "ln -s /etc/getData.conf.d/RefSeq.mk Makefile ; SPECIES=H_Sapiens make get unpack",
+ "post-download" => "make emboss blast"
+ },
+
+$toBeMirrored{"refseq.mmu"} = {
+ "name" => "The NCBI Reference Sequence project – Mus musculus",
+ "tags" => ["mouse", "proteome", "transcriptome"],
+ "source" => "ln -s /etc/getData.conf.d/RefSeq.mk Makefile ; SPECIES=M_musculus make get unpack",
+ "post-download" => "make emboss blast"
+ },
+
+1;
Added: trunk/community/infrastructure/getData/getData.conf.d/RefSeq.mk
===================================================================
--- trunk/community/infrastructure/getData/getData.conf.d/RefSeq.mk (rev 0)
+++ trunk/community/infrastructure/getData/getData.conf.d/RefSeq.mk 2009-02-19 10:16:14 UTC (rev 3130)
@@ -0,0 +1,13 @@
+SHARED_WGET_OPTIONS=$(shell getData --getWgetOptions)
+
+# $SPECIES is provided to make in the call from /etc/getData.conf.d/RefSeq.getData.
+
+get:
+ wget $(SHARED_WGET_OPTIONS) ftp://ftp.ncbi.nih.gov/refseq/$(SPECIES)/mRNA_Prot/*ff.gz
+
+unpack:
+ for file in *ff.gz ; do zcat $$file > `basename $$file .gz` ; done
+
+blast:
+
+emboss:
Modified: trunk/community/infrastructure/getData/getData.conf.d/dog.getData
===================================================================
--- trunk/community/infrastructure/getData/getData.conf.d/dog.getData 2009-02-18 20:18:44 UTC (rev 3129)
+++ trunk/community/infrastructure/getData/getData.conf.d/dog.getData 2009-02-19 10:16:14 UTC (rev 3130)
@@ -1,4 +1,4 @@
-print STDERR "Reading Canis lupus familiaris configuration file\n";
+print STDERR "Reading Canis lupus familiaris configuration file\n" if $verbose;
$toBeMirrored{"dog.genome"}={
"name" => "CanFam2.0 - Dog Genome Sequencing Project",
Modified: trunk/community/infrastructure/getData/getData.conf.d/dog.getData.mk
===================================================================
--- trunk/community/infrastructure/getData/getData.conf.d/dog.getData.mk 2009-02-18 20:18:44 UTC (rev 3129)
+++ trunk/community/infrastructure/getData/getData.conf.d/dog.getData.mk 2009-02-19 10:16:14 UTC (rev 3130)
@@ -1,7 +1,7 @@
SHARED_WGET_OPTIONS=$(shell getData --getWgetOptions)
get:
- wget $SHARED_WGET_OPTIONS ftp://ftp.ensembl.org/pub/current_fasta/canis_familiaris/dna/Canis_familiaris.BROADD2.50.dna.chromosome.*.fa.gz
+ wget $(SHARED_WGET_OPTIONS) ftp://ftp.ensembl.org/pub/current_fasta/canis_familiaris/dna/Canis_familiaris.BROADD2.50.dna.chromosome.*.fa.gz
unpack:
for file in *chromosome.*.fa.gz ; do zcat $$file > `basename $$file .gz` ; done
Modified: trunk/community/infrastructure/getData/getData.conf.d/pdb.getData
===================================================================
--- trunk/community/infrastructure/getData/getData.conf.d/pdb.getData 2009-02-18 20:18:44 UTC (rev 3129)
+++ trunk/community/infrastructure/getData/getData.conf.d/pdb.getData 2009-02-19 10:16:14 UTC (rev 3130)
@@ -1,5 +1,5 @@
-print STDERR "Reading PDB configuration file\n";
+print STDERR "Reading PDB configuration file\n" if $verbose;
$toBeMirrored{"pdb"}={
"name" => "PDB - protein structure database",
Modified: trunk/community/infrastructure/getData/getData.conf.d/rfam.getData
===================================================================
--- trunk/community/infrastructure/getData/getData.conf.d/rfam.getData 2009-02-18 20:18:44 UTC (rev 3129)
+++ trunk/community/infrastructure/getData/getData.conf.d/rfam.getData 2009-02-19 10:16:14 UTC (rev 3130)
@@ -1,4 +1,4 @@
-print STDERR "Reading Canis lupus familiaris configuration file\n";
+print STDERR "Reading Rfam configuration file\n" if $verbose;
$toBeMirrored{"dog.genome"}={
"name" => "Rfam9.1 - Multiple alignments and covariance models of non-coding RNA families",
Deleted: trunk/community/infrastructure/getData/getData.txt
===================================================================
--- trunk/community/infrastructure/getData/getData.txt 2009-02-18 20:18:44 UTC (rev 3129)
+++ trunk/community/infrastructure/getData/getData.txt 2009-02-19 10:16:14 UTC (rev 3130)
@@ -1,207 +0,0 @@
-NAME
- getData - retrieves databases from the Internet
-
-SYNOPSIS
- getData [ --mirrordir <path> ] <list of db names>
-
- getData --list
-
-DESCRIPTION
- Bioinformatics has the intrinsic problem to bring the biological data to
- the end user. Astronomers have the equivalent problem and particle
- physicists, well, they haven come up with (first) the web and (second)
- the computational grids to address their problems. Debian helps with the
- programs but will not provide such huge datasets that are even
- frequently updated - not even in volatile.debian.org. Most
- bioinformatics researchers will not need too many of such databases. And
- even more so will gladly continue in using public services remotely.
-
- For those who need a set of databases on a regular basis, this script
- shall be a start to automate the burden to download the data and update
- indices and the like. The world has seen such magic before with the Lion
- Biosciences Prisma tool
- (http://bib.oxfordjournals.org/cgi/reprint/3/4/389.pdf) but how about
- something simpler (as a start) that at least gets close to what we
- desire and is Free. The aim must be to address the needs of all (most)
- communities, not only of the bioinformatics world. The seed was hence
- made with databases from astronomy.
-
- Please contact the Debian-Med community if you consider this program to
- be almost ready for your needs and explain what still needs to be added.
- Public databases that you managed to integrate with this system are also
- very warmly welcomed as feedback.
-
-OPTIONS
- --help
- this help
-
- --man
- Present a more detailed description in form of a man page.
-
- --verbose
- Say one or two words more than required.
-
- --mirrordir <path>
- Specifies destination directory. The data will be mirrored to the
- folder $mirrordir/$dbname/. Please be aware that this mirrordir is
- nowhere stored. The directory can consequently be moved to arbitrary
- locations at any time, if the users of the data are only informed
- about that moving.
-
- --list
- Lists all databases that may be requested to be installed.
-
- <list of db names>
- Only those databases that are explicitly requested to be downloaded
- will be downloaded. Such databases may require considerable
- bandwidth, so please make sure you know you are doing the right
- thing.
-
- --post
- Perform only the unpacking/indexing, but do not retrieve/update the
- databases. This option is considered useful when adding a new
- database management system to the system, e.g. after installing
- EMBOSS.
-
- --source
- Perform only the unpacking/indexing, but do not retrieve/update the
- databases. This option may be beneficial when the site administator
- is aware of current analyses that should not be disturbed by the
- indexing process but the downloading from the net can already be
- started.
-
- --confd <directory>
- Allows for the specification of a directory in which multiple files
- can be stored that will be read by getData upon its invocation.
- These may add values to the global variable %toBeMirrored that
- specifies the databases and their download scripts.
-
- --config <system>
- Preparation of the configuration file that would be reuired for a
- particular system that deals with the database. The configuration is
- printed to stdout and is expected to be copied manually to the
- proper file or folder. One could imagine this process to be
- automated, though this is not yet implemented. Currently available
- is support for two systems:
-
- emboss This specifies the EMBOSS suite of tools for bioinformatics
- (www.emboss.org) that is also available as a Debian package.
- The configuration for the Uniprot databases will allow the
- sequence retrieval with the seqret tool.
-
- dre - ARC Grid Runtime Environment
- Runtime environments (REs) are a concept of the ARC grid
- middleware of which more can be learned on
- http://www.nordugrid.org. A script is needed to indicate the
- presence of a runtime environment. Here, the name of the
- script is important, which is not definable by getData
- though since it only writes to stdout.
-
- Unfortunately, the configuration was not yet be found to be
- modularised. It all needs to happen within the getData script
- itself.
-
- --remove <list of dbnames>
- This command removes folders that store the data. In principle this
- could be perfomed manually, though some databases may have special
- requirements pre- or post-removal, which can be specified
- individually for every database.
-
-SPECIFICATION OF DATABASES
- Databases for download and their post-processing are specified at two
- different locations. One is the getData script itself, the other are
- files stored in /etc/getData.d. Either will define elements of a
- considerably large hash. The key is the identifier which is also shown
- by the 'getData --list' directive. The value is a reference to another
- hash, which assigns values to all the properties that a database has for
- its download and post-processing:
-
- name - a human-readable pretty-printed name or short description that
- makes clear to the world what this database is about.
- A bad example is the mere assignment of "DE405", which few people
- understand. A better example is "Pfam-A : Manually curated protein
- families and domains, only the seed is presented.". One could argue
- that one should have that field renamed to "description".
-
- source - shell commands to perform the initial download and subsequent
- updates
- Commonly the wget tool is used for download. The such presented
- little script is executed underneath the mirrordir directory. One
- simple example is "wget --mirror
- ftp://ssd.jpl.nasa.gov/pub/eph/export/unix/unxp2[01]*.405". With
- increasing proficiency in using wget, one is tempted to substitute
- "--mirror" with "--recursive --no-host-directories --no-directories
- --level 1 --no-parent".
-
- post-download - shell commands to perform after the data has been
- downloaded.
- A simple (and unnecessary when used the right flags to wget) example
- is the mere setting of a symbolic link:
-
- "post-download" => "ln -s ssd.jpl.nasa.gov/pub/eph/export/unix/unxp*.405 ."
-
- Some more effort has been put into TrEMBL for the merging of
- releases with subsequent updates and the indexing for EMBOSS:
-
- "d=uncompressed; if [ ! -d \$d ]; then mkdir \$d; fi; "
- ."rm -rf \$d/trembl.dat; "
- ."(find ftp.ebi.ac.uk -name '*.dat.gz' | xargs -r zcat ) > \$d/trembl.dat; "
- ."[ -x /usr/bin/dbxflat ] "
- . "&& cd \$d && "
- . "dbxflat -dbresource embl -dbname trembllocal -idformat swiss -filenames=trembl.dat -fields id,acc -auto",
-
- The dots are connecting strings in Perl. This helps the readability
- of the code. When writing these scripts, please be aware the
- newlines don't separate the individual commands here. Semicolon are
- required.
-
- recommends - suggests a series of packages to be present for the use of
- the database or the performance of the indexing.
- This information is not used at the moment, also to render this
- script more useful for other Linux distributions than Debian.
-
-EXAMPLES
- The following will list the identifiers and the descriptions of the
- first 4 databases that area available via getData on your system.
-
- ./getData --mirrordir=/local/databases/mirrored --list | head 4
-
- To install any particular database, only give its name as an argument.
- If the installation is performed at another directory than the default,
- then the --mirrordir needs again to be set.
-
- ./getData swiss.dat
-
- To remove the database again, give the script a hint with the --remove
- flag
-
- ./getData --remove swiss.dat
-
- To perform the indexing only and circumvent the download (attention,
- this is dangerous since the index files will look newer than the
- database is), do
-
- ./getData --post swiss.dat
-
- A special exception to these extra scripts is the --config flag in that
- it takes a list of extra arguments. Each shall denote a particular
- system that this database may be of interest for. There are today two
- systems supported:
-
-TODO
- We now need a mechanism with which packages can specify hooks that shall
- be called upon an update of a database. But we cannot assume that every
- indexing that can be performed because of the installation of some
- package is also desired by the user. How to configure this properly is
- left to be decided.
-
-SEE ALSO
- http://debian-med.alioth.debian.org, http://wiki.debian.org/DebianMed,
- /etc/getData.conf
-
-AUTHORS
- This script was prepared by Steffen Moeller <moeller at debian.org> and
- Charles Plessy <debian-no-spam at plessy.org> and is distributed under the
- terms of the GNU Public License (GPL). On Debian systems, this license
- can be found under /usr/share/common-licenses/GPL.
-
More information about the debian-med-commit
mailing list