[med-svn] [Git][med-team/ncbi-entrez-direct][upstream] 9 commits: New upstream version 10.5.20181205+ds
Aaron M. Ucko
gitlab at salsa.debian.org
Thu Feb 7 03:54:54 GMT 2019
Aaron M. Ucko pushed to branch upstream at Debian Med / ncbi-entrez-direct
Commits:
566f983f by Aaron M. Ucko at 2019-02-07T02:17:56Z
New upstream version 10.5.20181205+ds
- - - - -
11778b93 by Aaron M. Ucko at 2019-02-07T02:22:25Z
New upstream version 10.6.20181210+ds
- - - - -
f8c4a9ae by Aaron M. Ucko at 2019-02-07T03:04:42Z
New upstream version 10.6.20190103+ds
- - - - -
12290890 by Aaron M. Ucko at 2019-02-07T03:10:04Z
New upstream version 10.7.20190104+ds
- - - - -
1ce3c44a by Aaron M. Ucko at 2019-02-07T03:12:12Z
New upstream version 10.7.20190114+ds
- - - - -
4357532b by Aaron M. Ucko at 2019-02-07T03:15:12Z
New upstream version 10.8.20190117+ds
- - - - -
093bb44f by Aaron M. Ucko at 2019-02-07T03:25:17Z
New upstream version 10.8.20190128+ds
- - - - -
492cba4e by Aaron M. Ucko at 2019-02-07T03:38:28Z
New upstream version 10.9.20190131+ds
- - - - -
1c42e99e by Aaron M. Ucko at 2019-02-07T03:39:48Z
New upstream version 10.9.20190205+ds
- - - - -
20 changed files:
- README
- archive-pubmed
- edirect.pl
- index-pubmed
- − local-phrase-search
- nquire
- − pm-clean
- − pm-current
- − pm-erase
- pm-index
- − pm-log
- − pm-repack
- pm-stash
- − pm-uids
- − pm-verify
- rchive.go
- setup-deps.pl
- test-pubmed-index
- transmute
- xtract.go
Changes:
=====================================
README
=====================================
@@ -1,40 +1,213 @@
-ENTREZ DIRECT - README
+ENTREZ DIRECT: COMMAND LINE ACCESS TO NCBI ENTREZ DATABASES
-Entrez Direct (EDirect) is an advanced method for accessing the NCBI's set of interconnected Entrez databases (publication, nucleotide, protein, structure, gene, variation, expression, etc.) from a terminal window. It uses command-line arguments for the query terms and combines individual operations with UNIX pipes.
+Searching, retrieving, and parsing data from NCBI databases through the Unix command line.
-EDirect also provides an argument-driven function that simplifies the extraction of data from document summaries or other results that are returned in XML format. Queries can move seamlessly between EDirect commands and UNIX utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.
+INTRODUCTION
-EDirect consists of a set of scripts that are downloaded to the user's computer. If you extract the archive in your home directory, you may need to enter:
+Entrez Direct (EDirect) provides access to the NCBI's suite of interconnected databases (biomedical literature, nucleotide and protein sequence, molecular structure, gene, genome assembly, gene expression, clinical variation, etc.) from a Unix terminal window. Search terms are given in command-line arguments. Individual operations are connected with Unix pipes to allow construction of multi-step queries. Selected records can then be retrieved in a variety of formats.
- PATH=$PATH:$HOME/edirect
+EDirect also includes an argument-driven function that simplifies the extraction of data from document summaries or other results that are in structured XML format. This can eliminate the need for writing custom software to answer ad hoc questions. Queries can move seamlessly between EDirect commands and Unix utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.
-in a terminal window to temporarily add EDirect functions to the PATH environment variable so they can be run by name. You can then try EDirect by copying the sample query below and pasting it into the terminal window for execution:
+PROGRAMMATIC ACCESS
- esearch -db pubmed -query "Beadle AND Tatum AND Neurospora" |
+Several underlying network services provide access to different facets of Entrez. These include searching by indexed terms, looking up precomputed neighbors or links, filtering results by date or category, and downloading record summaries or reports. The same functionalities are available on the web or when using programmatic methods.
+
+EDirect navigation programs (esearch, elink, efilter, and efetch) communicate by means of a small structured message, which can be passed invisibly between operations with a Unix pipe. The message includes the current database, so it does not need to be given as an argument after the first step.
+
+All EDirect commands are designed to work on large sets of data. There is no need to write a script to loop over records one at a time. Intermediate results are stored on the Entrez history server. For best performance, obtain an API Key from NCBI, and place the following line in your .bash_profile file:
+
+ export NCBI_API_KEY=user_api_key_goes_here
+
+Each program also has a -help command that prints detailed information about available arguments.
+
+NAVIGATION FUNCTIONS
+
+Esearch performs a new Entrez search using terms in indexed fields. It requires a -db argument for the database name and uses -query to obtain the search terms. For PubMed, without field qualifiers, the server uses automatic term mapping to compose a search strategy by translating the supplied query:
+
+ esearch -db pubmed -query "selective serotonin reuptake inhibitor"
+
+Search terms can be also qualified with bracketed field names:
+
+ esearch -db nucleotide -query "insulin [PROT] AND rodents [ORGN]"
+
+Elink looks up precomputed neighbors within a database, or finds associated records in other databases:
+
+ elink -related
+
+ elink -target gene
+
+Efilter limits the results of a previous query, with shortcuts that can also be used in esearch:
+
+ efilter -molecule genomic -location chloroplast -country sweden
+
+Efetch downloads selected records or reports in a designated format:
+
+ efetch -format abstract
+
+ENTREZ EXPLORATION
+
+Individual query commands are connected by a Unix vertical bar pipe symbol:
+
+ esearch -db pubmed -query "transposition immunity" | efetch -format medline
+
+PubMed related articles are calculated by a statistical algorithm using the title, abstract, and medical subject headings (MeSH terms). These connections between papers can be used for knowledge discovery.
+
+Lycopene cyclase converts lycopene to beta-carotene, the immediate precursor of vitamin A. An initial search on the enzyme results in 232 articles. Looking up precomputed neighbors returns 14,387 PubMed papers, some of which might be expected to discuss adjacent steps in the biosynthetic pathway:
+
+ esearch -db pubmed -query "lycopene cyclase" |
elink -related |
- efilter -query "NOT historical article [FILT]" |
- efetch -format docsum |
- xtract -pattern DocumentSummary -if Author -and Title \
- -element Id -first "Author/Name" -element Title |
- grep -i -e enzyme -e synthesis |
- sort -t $'\t' -k 2,3f |
- column -s $'\t' -t |
- head -n 10 |
- cut -c 1-80
-
-This query returns the PubMed ID, first author name, and article title for PubMed "neighbors" (related citations) of the original publications. It then requires specific words in the resulting rows, sorts alphabetically by author name and title, aligns the columns, and truncates the lines for easier viewing:
-
- 2960822 Anton IA A eukaryotic repressor protein, the qa-1S gene prod
- 5264137 Arroyo-Begovich A In vitro formation of an active multienzyme complex
- 14942736 BONNER DM Gene-enzyme relationships in Neurospora.
- 5361218 Caroline DF Pyrimidine synthesis in Neurospora crassa: gene-enz
- 123642 Case ME Genetic evidence on the organization and action of
+ elink -target protein |
+ efilter -organism mouse |
+ efetch -format fasta
+
+Linking to the protein database finds 251,887 sequence records, each of which has standardized organism information from the NCBI taxonomy. Limiting to proteins in mice returns 39 records. (Animals do not encode the genes involved in carotene biosynthesis.) Records are then retrieved in FASTA format. As anticipated, the results include the enzyme that splits beta-carotene into two molecules of retinal:
+
+ ...
+ >NP_067461.2 beta,beta-carotene 15,15'-dioxygenase isoform 1 [Mus musculus]
+ MEIIFGQNKKEQLEPVQAKVTGSIPAWLQGTLLRNGPGMHTVGESKYNHWFDGLALLHSFSIRDGEVFYR
+ SKYLQSDTYIANIEANRIVVSEFGTMAYPDPCKNIFSKAFSYLSHTIPDFTDNCLINIMKCGEDFYATTE
+ TNYIRKIDPQTLETLEKVDYRKYVAVNLATSHPHYDEAGNVLNMGTSVVDKGRTKYVIFKIPATVPDSKK
+ KGKSPVKHAEVFCSISSRSLLSPSYYHSFGVTENYVVFLEQPFKLDILKMATAYMRGVSWASCMSFDRED
+ KTYIHIIDQRTRKPVPTKFYTDPMVVFHHVNAYEEDGCVLFDVIAYEDSSLYQLFYLANLNKDFEEKSRL
+ TSVPTLRRFAVPLHVDKDAEVGSNLVKVSSTTATALKEKDGHVYCQPEVLYEGLELPRINYAYNGKPYRY
+ IFAAEVQWSPVPTKILKYDILTKSSLKWSEESCWPAEPLFVPTPGAKDEDDGVILSAIVSTDPQKLPFLL
+ ILDAKSFTELARASVDADMHLDLHGLFIPDADWNAVKQTPAETQEVENSDHPTDPTAPELSHSENDFTAG
+ HGGSSL
+ ...
+
+STRUCTURED DATA EXTRACTION
+
+The xtract program uses command-line arguments to direct the conversion of XML data into a tab-delimited table. The -pattern argument divides the results into rows, while placement of data into columns is controlled by -element.
+
+Formatting arguments allow extensive customization of the output. The line break between -pattern objects can be changed with -ret, and the tab character between -element fields can be replaced by -tab.
+
+The -sep argument is used to distinguish multiple elements of the same type, and controls their separation independently of the -tab argument. The -sep value also applies to unrelated -element arguments that are grouped with commas. The query:
+
+ efetch -db pubmed -id 6271474,1413997,16589597 -format docsum |
+ xtract -pattern DocumentSummary -sep "|" -element Id PubDate Name
+
+returns a table with individual author names separated by vertical bars:
+
+ 6271474 1981 Casadaban MJ|Chou J|Lemaux P|Tu CP|Cohen SN
+ 1413997 1992 Oct Mortimer RK|Contopoulou CR|King JS
+ 16589597 1954 Dec Garber ED
+
+Selection arguments are specialized derivatives of -element. Among these are positional commands (-first and -last) and numeric processing operations (including -num, -len, -sum, -min, -max, and -avg). There are also functions that perform sequence coordinate conversion (-0-based, -1-based, and -ucsc-based).
+
+NESTED EXPLORATION
+
+Exploration arguments (-pattern, -group, -block, and -subset) limit data extraction to specified regions of the XML, visiting all relevant objects one at a time. This design allows nested exploration of complex, hierarchical data to be controlled by a linear chain of command-line argument statements.
+
+PubmedArticle XML contains the MeSH terms applied to a publication. Each MeSH term can have its own unique set of qualifiers. A single level of nested exploration within the current pattern:
+
+ esearch -db gene -query "beta-carotene oxygenase 1" -organism human |
+ elink -target pubmed | efilter -released last_year | efetch -format xml |
+ xtract -pattern PubmedArticle -element MedlineCitation/PMID \
+ -block MeshHeading \
+ -pfc "\n" -sep "/" -element DescriptorName,QualifierName
+
+retains the proper association of subheadings for each MeSH term:
+
+ 30396924
+ Age Factors
+ Animals
+ Cell Cycle Proteins/deficiency/genetics/metabolism
+ Cellular Senescence/physiology
+ ...
+
+CONDITIONAL EXECUTION
+
+Conditional processing arguments (-if and -unless) restrict exploration by object name and value. These may be used in conjunction with string or numeric constraints:
+
+ esearch -db pubmed -query "Casadaban MJ [AUTH]" |
+ efetch -format xml |
+ xtract -pattern PubmedArticle -if "#Author" -lt 6 \
+ -block Author -if LastName -is-not Casadaban \
+ -sep ", " -tab "\n" -element LastName,Initials |
+ sort-uniq-count-rank
+
+to select papers with fewer than 6 authors and print a table of the most frequent coauthors:
+
+ 11 Chou, J
+ 8 Cohen, SN
+ 7 Groisman, EA
+ ...
+
+SAVING DATA IN VARIABLES
+
+A value can be recorded in a variable and used wherever needed. Variables are created by a hyphen followed by a name consisting of a string of capital letters or digits (e.g., -PMID). Values are retrieved by placing an ampersand before the variable name (e.g., "&PMID") in an -element statement:
+
+ efetch -db pubmed -id 3201829,6301692,781293 -format xml |
+ xtract -pattern PubmedArticle -PMID MedlineCitation/PMID \
+ -block Author -element "&PMID" \
+ -sep " " -tab "\n" -element Initials,LastName
+
+producing a list of authors, with the PubMed Identifier (PMID) in the first column of each row:
+
+ 3201829 JR Johnston
+ 3201829 CR Contopoulou
+ 3201829 RK Mortimer
+ 6301692 MA Krasnow
+ 6301692 NR Cozzarelli
+ 781293 MJ Casadaban
+
+The variable can be used even though the original object is no longer visible inside the -block section.
+
+SEQUENCE QUALIFIERS
+
+The NCBI represents sequence records in a data model based on the central dogma of molecular biology. A sequence can have multiple features, which carry information about the biology of a given region, including the transformations involved in gene expression. A feature can have multiple qualifiers, which store specific details about that feature (e.g., name of the gene, genetic code used for translation).
+
+The data hierarchy is easily explored using a -pattern {sequence} -group {feature} -block {qualifier} construct. As a convenience, an -insd helper function is provided for generating the appropriate nested extraction commands from feature and qualifier names on the command line. Processing the results of a search on cone snail venom:
+
+ esearch -db protein -query "conotoxin" -feature mat_peptide |
+ efetch -format gpc |
+ xtract -insd complete mat_peptide "%peptide" product peptide |
+ grep -i conotoxin | sort -t $'\t' -u -k 2,2n
+
+returns the accession, length, name, and sequence for a sample of neurotoxic peptides:
+
+ ADB43131.1 15 conotoxin Cal 1b LCCKRHHGCHPCGRT
+ ADB43128.1 16 conotoxin Cal 5.1 DPAPCCQHPIETCCRR
+ AIC77105.1 17 conotoxin Lt1.4 GCCSHPACDVNNPDICG
+ ADB43129.1 18 conotoxin Cal 5.2 MIQRSQCCAVKKNCCHVG
+ ADD97803.1 20 conotoxin Cal 1.2 AGCCPTIMYKTGACRTNRCR
+ AIC77085.1 21 conotoxin Bt14.8 NECDNCMRSFCSMIYEKCRLK
+ ADB43125.1 22 conotoxin Cal 14.2 GCPADCPNTCDSSNKCSPGFPG
+ AIC77154.1 23 conotoxin Bt14.19 VREKDCPPHPVPGMHKCVCLKTC
...
-EDirect will run on UNIX and Macintosh computers that have the Perl language installed, and under the Cygwin UNIX-emulation environment on Windows PCs.
+INSTALLATION
+
+EDirect consists of a set of scripts and programs that are downloaded to the user's computer.
+
+EDirect will run on Unix and Macintosh computers that have the Perl language installed, and under the Cygwin Unix-emulation environment on Windows PCs.
+
+To install the EDirect software, copy the following commands and paste them into a terminal window:
+
+ cd ~
+ /bin/bash
+ perl -MNet::FTP -e \
+ '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
+ $ftp->login; $ftp->binary;
+ $ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
+ gunzip -c edirect.tar.gz | tar xf -
+ rm edirect.tar.gz
+ builtin exit
+ export PATH=${PATH}:$HOME/edirect >& /dev/null || setenv PATH "${PATH}:$HOME/edirect"
+ ./edirect/setup.sh
+
+This downloads several scripts into an "edirect" folder in the user's home directory. The setup.sh script then downloads any missing Perl modules, and may print an additional command for updating the PATH environment variable in the user's configuration file. Copy that command, if present, and paste it into the terminal window to complete the installation process. The editing instructions will look something like:
+
+ echo "export PATH=\$PATH:\$HOME/edirect" >> $HOME/.bash_profile
+
+DOCUMENTATION
Documentation for EDirect is on the web at:
http://www.ncbi.nlm.nih.gov/books/NBK179288
-Questions or comments on EDirect may be sent to eutilities at ncbi.nlm.nih.gov.
+Information on how to obtain an API Key is described in this NCBI blogpost:
+
+ https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities
+
+Questions or comments on EDirect may be sent to info at ncbi.nlm.nih.gov.
=====================================
archive-pubmed
=====================================
@@ -21,6 +21,7 @@ if [ "$#" -gt 0 ]
then
target="$1"
MASTER=$(cd "$target" && pwd)
+ CONFIG=${MASTER}
shift
else
if [ -z "${EDIRECT_PUBMED_MASTER}" ]
@@ -70,7 +71,7 @@ do
mkdir -p "$MASTER/$dir"
done
-for dir in Current Indexed Inverted Merged Pubmed
+for dir in Indexed Inverted Merged Pubmed
do
mkdir -p "$WORKING/$dir"
done
@@ -98,3 +99,20 @@ fetch-pubmed -path "$MASTER/Archive" |
xtract -pattern Author -if Affiliation -contains Medicine \
-pfx "Archive is " -element Initials
echo ""
+
+if [ -n "$CONFIG" ]
+then
+ target=bash_profile
+ if ! grep "$target" "$HOME/.bashrc" >/dev/null 2>&1
+ then
+ if [ ! -f $HOME/.$target ] || grep 'bashrc' "$HOME/.$target" >/dev/null 2>&1
+ then
+ target=bashrc
+ fi
+ fi
+ echo ""
+ echo "For convenience, please execute the following to save the archive path to a variable:"
+ echo ""
+ echo " echo \"export EDIRECT_PUBMED_MASTER='${CONFIG}'\" >>" "\$HOME/.$target"
+ echo ""
+fi
=====================================
edirect.pl
=====================================
@@ -43,7 +43,7 @@ use File::Spec;
# EDirect version number
-$version = "10.5";
+$version = "10.9";
BEGIN
{
@@ -116,6 +116,7 @@ sub clearflags {
$batch = false;
$chr_start = -1;
$chr_stop = -1;
+ $class = "";
$clean = false;
$cmd = "";
$compact = false;
@@ -143,6 +144,7 @@ sub clearflags {
$http = "";
$id = "";
$input = "";
+ $internal = false;
$journal = "";
$json = false;
$just_num = false;
@@ -170,6 +172,7 @@ sub clearflags {
$query = "";
$raw = false;
$related = false;
+ $result = 0;
$rldate = 0;
$seq_start = 0;
$seq_stop = 0;
@@ -192,6 +195,7 @@ sub clearflags {
$verbose = false;
$volume = "";
$web = "";
+ $released = "";
$word = false;
$year = "";
@@ -210,10 +214,20 @@ sub clearflags {
$api_key = "";
$api_key = $ENV{NCBI_API_KEY} if defined $ENV{NCBI_API_KEY};
+
+ $abbrv_flag = false;
+ if (defined $ENV{EDIRECT_DO_AUTO_ABBREV} && $ENV{EDIRECT_DO_AUTO_ABBREV} eq "true" ) {
+ $abbrv_flag = true;
+ }
}
sub do_sleep {
+ if ( $internal ) {
+ Time::HiRes::usleep(1000);
+ return;
+ }
+
if ( $api_key ne "" ) {
if ( $log ) {
print STDERR "sleeping 1/10 second\n";
@@ -360,8 +374,17 @@ sub read_aliases {
sub adjust_base {
+ if ( $basx ne "" ) {
+ $internal = false;
+ }
+
if ( $basx eq "" ) {
+ if ( $internal ) {
+ $base = "https://eutils-internal.ncbi.nlm.nih.gov/entrez/eutils/";
+ return;
+ }
+
# if base not overridden, check URL of previous query, stick with main or colo site,
# since history server data for EUtils does not copy between locations, by design
@@ -512,12 +535,14 @@ sub get_count {
$output = get ($url);
if ( ! defined $output ) {
- print STDERR "Failure of '$url'\n";
+ print STDERR "Failure of get_count '$url'\n";
+ $result = 1;
return "", "";
}
if ( $output eq "" ) {
print STDERR "No get_count output returned from '$url'\n";
+ $result = 1;
return "", ""
}
@@ -531,10 +556,12 @@ sub get_count {
if ( $errx ne "" ) {
close (STDOUT);
+ $result = 1;
die "ERROR in count output: $errx\nURL: $url\n\n";
}
if ( $numx eq "" ) {
+ $result = 1;
die "Count value not found in count output - WebEnv $webx\n";
}
@@ -607,7 +634,8 @@ sub get_uids {
if ( defined $data ) {
$keep_trying = false;
} else {
- print STDERR "Failure of '$url'\n";
+ print STDERR "Failure of get_uids '$url'\n";
+ $result = 1;
}
}
if ( $keep_trying ) {
@@ -687,12 +715,14 @@ sub do_post_yielding_ref {
$rslt = get ($urlx);
if ( ! defined $rslt ) {
- print STDERR "Failure of '$urlx'\n";
+ print STDERR "Failure of do_get '$urlx'\n";
+ $result = 1;
return "";
}
if ( $rslt eq "" ) {
print STDERR "No do_get output returned from '$urlx'\n";
+ $result = 1;
return "";
}
@@ -717,7 +747,15 @@ sub do_post_yielding_ref {
if ( $res->is_success) {
$rslt = $res->content_ref;
} else {
- print STDERR $res->status_line . "\n";
+ $stts = $res->status_line;
+ print STDERR $stts . "\n";
+ if ( $stts eq "429 Too Many Requests" ) {
+ if ( $api_key eq "" ) {
+ print STDERR "PLEASE REQUEST AN API_KEY FROM NCBI\n";
+ } else {
+ print STDERR "TOO MANY REQUESTS EVEN WITH API_KEY\n";
+ }
+ }
}
if ( $$rslt eq "" ) {
@@ -1052,12 +1090,32 @@ sub write_edirect {
# wrapper to detect command line errors
+my $abbrev_help = qq{
+ To enable argument auto abbreviation resolution, run:
+
+ export EDIRECT_DO_AUTO_ABBREV="true"
+
+ in the terminal, or add that line to your .bash_profile configuration file.
+
+};
+
sub MyGetOptions {
my $help_msg = shift @_;
+ if ( $abbrv_flag ) {
+ Getopt::Long::Configure("auto_abbrev");
+ } else {
+ Getopt::Long::Configure("no_auto_abbrev");
+ }
+
if ( !GetOptions(@_) ) {
- die $help_msg;
+ if ( $abbrv_flag ) {
+ die $help_msg;
+ } else {
+ print $help_msg;
+ die $abbrev_help;
+ }
} elsif (@ARGV) {
die ("Entrez Direct does not support positional arguments.\n"
. "Please remember to quote parameter values containing\n"
@@ -1089,6 +1147,7 @@ sub ecntc {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
+ "internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
@@ -1151,8 +1210,10 @@ Spell Check
Publication Filters
-pub abstract, clinical, english, free, historical,
- journal, last_week, last_month, last_year,
- medline, preprint, published, review, structured
+ journal, medline, preprint, published, review,
+ structured
+ -journal pnas, "j bacteriol", ...
+ -released last_week, last_month, last_year, prev_years
Sequence Filters
@@ -1170,6 +1231,11 @@ Gene Filters
-status alive
-type coding, pseudo
+SNP Filters
+
+ -class acceptor, donor, frameshift, indel, intron,
+ missense, nonsense, synonymous
+
Miscellaneous Arguments
-label Alias for query step
@@ -1180,6 +1246,8 @@ sub process_extras {
my $frst = shift (@_);
my $publ = shift (@_);
+ my $rlsd = shift (@_);
+ my $jrnl = shift (@_);
my $ctry = shift (@_);
my $fkey = shift (@_);
my $locn = shift (@_);
@@ -1188,8 +1256,11 @@ sub process_extras {
my $sorc = shift (@_);
my $stat = shift (@_);
my $gtyp = shift (@_);
+ my $clss = shift (@_);
$publ = lc($publ);
+ $rlsd = lc($rlsd);
+ $jrnl = lc($jrnl);
$ctry = lc($ctry);
$fkey = lc($fkey);
$bmol = lc($bmol);
@@ -1198,6 +1269,7 @@ sub process_extras {
$sorc = lc($sorc);
$stat = lc($stat);
$gtyp = lc($gtyp);
+ $clss = lc($clss);
%pubHash = (
'abstract' => 'has abstract [FILT]',
@@ -1219,6 +1291,15 @@ sub process_extras {
'trial' => 'clinical trial [FILT]',
);
+ %releasedHash = (
+ 'last_month' => 'published last month [FILT]',
+ 'last month' => 'published last month [FILT]',
+ 'last_week' => 'published last week [FILT]',
+ 'last week' => 'published last week [FILT]',
+ 'last_year' => 'published last year [FILT]',
+ 'last year' => 'published last year [FILT]',
+ );
+
@featureArray = (
"-10_signal",
"-35_signal",
@@ -1362,6 +1443,17 @@ sub process_extras {
'viruses' => 'viruses [FILT]',
);
+ %snpHash = (
+ 'acceptor' => 'splice acceptor variant [FXN]',
+ 'donor' => 'splice donor variant [FXN]',
+ 'frameshift' => 'frameshift [FXN]',
+ 'indel' => 'cds indel [FXN]',
+ 'intron' => 'intron [FXN]',
+ 'missense' => 'missense [FXN]',
+ 'nonsense' => 'nonsense [FXN]',
+ 'synonymous' => 'synonymous codon [FXN]',
+ );
+
%sourceHash = (
'ddbj' => 'srcdb ddbj [PROP]',
'embl' => 'srcdb embl [PROP]',
@@ -1387,34 +1479,62 @@ sub process_extras {
my @working = ();
- my $suffix = "";
+ my $suffix1 = "";
+ my $suffix2 = "";
+
+ my $is_published = false;
+ my $is_prev_year = false;
if ( $frst ne "" ) {
push (@working, $frst);
}
if ( $publ ne "" ) {
- if ( defined $pubHash{$publ} ) {
- $val = $pubHash{$publ};
+ # -pub can use comma-separated list
+ my @pbs = split (',', $publ);
+ foreach $pb (@pbs) {
+ if ( defined $pubHash{$pb} ) {
+ $val = $pubHash{$pb};
+ push (@working, $val);
+ } elsif ( $pb eq "published" ) {
+ $is_published = true;
+ } else {
+ die "\nUnrecognized -pub argument '$pb', use efilter -help to see available choices\n\n";
+ }
+ }
+ }
+
+ if ( $rlsd ne "" ) {
+ if ( defined $releasedHash{$rlsd} ) {
+ $val = $releasedHash{$rlsd};
push (@working, $val);
- } elsif ( $publ eq "published" ) {
- $suffix = "published";
+ } elsif ( $rlsd eq "prev_years" ) {
+ $is_prev_year = true;
} else {
- die "\nUnrecognized -pub argument '$publ', use efilter -help to see available choices\n\n";
+ die "\nUnrecognized -released argument '$rlsd', use efilter -help to see available choices\n\n";
}
}
+ if ( $jrnl ne "" ) {
+ $val = $jrnl . " [JOUR]";
+ push (@working, $val);
+ }
+
if ( $ctry ne "" ) {
$val = "country " . $ctry . " [TEXT]";
push (@working, $val);
}
if ( $fkey ne "" ) {
- if ( grep( /^$fkey$/, @featureArray ) ) {
- $val = $fkey . " [FKEY]";
- push (@working, $val);
- } else {
- die "\nUnrecognized -feature argument '$fkey', use efilter -help to see available choices\n\n";
+ # -feature can use comma-separated list
+ my @fts = split (',', $fkey);
+ foreach $ft (@fts) {
+ if ( grep( /^$ft$/, @featureArray ) ) {
+ $val = $ft . " [FKEY]";
+ push (@working, $val);
+ } else {
+ die "\nUnrecognized -feature argument '$ft', use efilter -help to see available choices\n\n";
+ }
}
}
@@ -1475,12 +1595,25 @@ sub process_extras {
}
}
+ if ( $clss ne "" ) {
+ if ( defined $snpHash{$clss} ) {
+ $val = $snpHash{$clss};
+ push (@working, $val);
+ } else {
+ die "\nUnrecognized -class argument '$clss', use efilter -help to see available choices\n\n";
+ }
+ }
+
my $xtras = join (" AND ", @working);
- if ( $suffix eq "published" ) {
+ if ( $is_published ) {
$xtras = $xtras . " NOT ahead of print [FILT]";
}
+ if ( $is_prev_year ) {
+ $xtras = $xtras . " NOT published last year [FILT]";
+ }
+
return $xtras;
}
@@ -1514,6 +1647,7 @@ sub efilt {
MyGetOptions(
$filt_help,
"query=s" => \$query,
+ "q=s" => \$query,
"sort=s" => \$sort,
"days=i" => \$rldate,
"mindate=s" => \$mndate,
@@ -1523,7 +1657,9 @@ sub efilt {
"field=s" => \$field,
"spell" => \$spell,
"pairs=s" => \$pair,
+ "journal=s" => \$journal,
"pub=s" => \$pub,
+ "released=s" => \$released,
"country=s" => \$country,
"feature=s" => \$feature,
"location=s" => \$location,
@@ -1532,6 +1668,7 @@ sub efilt {
"source=s" => \$source,
"status=s" => \$status,
"type=s" => \$gtype,
+ "class=s" => \$class,
"api_key=s" => \$api_key,
"email=s" => \$emaddr,
"tool=s" => \$tuul,
@@ -1539,6 +1676,7 @@ sub efilt {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
+ "internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
@@ -1554,7 +1692,7 @@ sub efilt {
}
# process special filter flags, add to query string
- $query = process_extras ( $query, $pub, $country, $feature, $location, $molecule, $organism, $source, $status, $gtype );
+ $query = process_extras ( $query, $pub, $released, $journal, $country, $feature, $location, $molecule, $organism, $source, $status, $gtype, $class );
if ( -t STDIN ) {
if ( $query eq "" ) {
@@ -1585,8 +1723,8 @@ sub efilt {
$email = $emaddr;
}
- if ( $query eq "" && $rldate < 1 and $mndate eq "" and $mxdate eq "" ) {
- die "Must supply -query or -days or -mindate and -maxdate arguments on command line\n";
+ if ( $query eq "" && $sort eq "" && $rldate < 1 and $mndate eq "" and $mxdate eq "" ) {
+ die "Must supply -query or -sort or -days or -mindate and -maxdate arguments on command line\n";
}
binmode STDOUT, ":utf8";
@@ -1897,6 +2035,7 @@ sub esmry {
my $silent = shift (@_);
my $verbose = shift (@_);
my $debug = shift (@_);
+ my $internal = shift (@_);
my $log = shift (@_);
my $http = shift (@_);
my $alias = shift (@_);
@@ -2175,6 +2314,7 @@ Format Examples
summary Summary
gene
+ full_report Detailed Report
gene_table Gene Table
native Gene Report
native asn.1 Entrezgene ASN.1
@@ -2357,9 +2497,7 @@ sub xml_to_json {
my $conv = $xc->XMLin($data);
convert_bools($conv);
my $jc = JSON::PP->new->ascii->pretty->allow_nonref;
- my $result = $jc->encode($conv);
-
- $data = "$result";
+ $data = $jc->encode($conv);
return $data;
}
@@ -2405,6 +2543,7 @@ sub eftch {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
+ "internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"raw" => \$raw,
@@ -2456,6 +2595,10 @@ sub eftch {
$style = "withparts";
}
+ if ( $type eq "gbc" and $mode eq "" ) {
+ $mode = "xml";
+ }
+
if ( -t STDIN and not @ARGV ) {
} elsif ( $db ne "" and $id ne "" ) {
} else {
@@ -2517,7 +2660,7 @@ sub eftch {
if ( $type eq "docsum" or $fnc eq "-summary" ) {
esmry ( $dbase, $web, $key, $num, $id, $mode, $min, $max, $tool, $email,
- $silent, $verbose, $debug, $log, $http, $alias, $basx );
+ $silent, $verbose, $debug, $internal, $log, $http, $alias, $basx );
return;
}
@@ -2683,10 +2826,11 @@ sub eftch {
$arg = "db=$dbase&id=$id";
- if ( $type eq "gb" ) {
+ if ( $type eq "gb" or $type eq "gbc" ) {
if ( $style eq "withparts" or $style eq "master" ) {
- $arg .= "&rettype=gbwithparts";
+ $arg .= "&rettype=$type";
$arg .= "&retmode=$mode";
+ $arg .= "&style=$style";
} elsif ( $style eq "conwithfeat" or $style eq "withfeat" or $style eq "contigwithfeat" ) {
$arg .= "&rettype=$type";
$arg .= "&retmode=$mode";
@@ -2826,10 +2970,11 @@ sub eftch {
$arg = "db=$dbase&query_key=$key&WebEnv=$web";
- if ( $type eq "gb" ) {
+ if ( $type eq "gb" or $type eq "gbc" ) {
if ( $style eq "withparts" or $style eq "master" ) {
- $arg .= "&rettype=gbwithparts";
+ $arg .= "&rettype=$type";
$arg .= "&retmode=$mode";
+ $arg .= "&style=$style";
} elsif ( $style eq "conwithfeat" or $style eq "withfeat" or $style eq "contigwithfeat" ) {
$arg .= "&rettype=$type";
$arg .= "&retmode=$mode";
@@ -3034,6 +3179,7 @@ sub einfo {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
+ "internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
@@ -3125,6 +3271,7 @@ sub einfo {
if ( ! defined $output ) {
print STDERR "Failure of '$url'\n";
+ $result = 1;
return;
}
@@ -3575,6 +3722,7 @@ sub elink {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
+ "internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
@@ -3929,6 +4077,7 @@ sub entfy {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
+ "internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
@@ -4102,6 +4251,7 @@ sub epost {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
+ "internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
@@ -4317,6 +4467,7 @@ sub epost {
}
if ( $combo eq "" ) {
+ $result = 1;
die "Failure of post to find data to load\n";
}
@@ -4369,6 +4520,7 @@ sub espel {
$spell_help,
"db=s" => \$db,
"query=s" => \$query,
+ "q=s" => \$query,
"api_key=s" => \$api_key,
"email=s" => \$emaddr,
"tool=s" => \$tuul,
@@ -4376,6 +4528,7 @@ sub espel {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
+ "internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
@@ -4472,6 +4625,7 @@ sub ecitmtch {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
+ "internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
@@ -4586,6 +4740,7 @@ sub eprxy {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
+ "internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
@@ -4821,13 +4976,16 @@ sub esrch {
$srch_help,
"db=s" => \$db,
"query=s" => \$query,
+ "q=s" => \$query,
"sort=s" => \$sort,
"days=i" => \$rldate,
"mindate=s" => \$mndate,
"maxdate=s" => \$mxdate,
"datetype=s" => \$dttype,
"label=s" => \$lbl,
+ "journal=s" => \$journal,
"pub=s" => \$pub,
+ "released=s" => \$released,
"country=s" => \$country,
"feature=s" => \$feature,
"location=s" => \$location,
@@ -4836,6 +4994,7 @@ sub esrch {
"source=s" => \$source,
"status=s" => \$status,
"type=s" => \$gtype,
+ "class=s" => \$class,
"clean" => \$clean,
"field=s" => \$field,
"word" => \$word,
@@ -4853,6 +5012,7 @@ sub esrch {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
+ "internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
@@ -4899,7 +5059,7 @@ sub esrch {
binmode STDOUT, ":utf8";
# support all efilter shortcut flags in esearch (undocumented)
- $query = process_extras ( $query, $pub, $country, $feature, $location, $molecule, $organism, $source, $status, $gtype );
+ $query = process_extras ( $query, $pub, $released, $journal, $country, $feature, $location, $molecule, $organism, $source, $status, $gtype, $class );
if ( $query eq "" ) {
die "Must supply -query search expression on command line\n";
@@ -5250,3 +5410,5 @@ if ( scalar @ARGV > 0 and $ARGV[0] eq "-version" ) {
close (STDIN);
close (STDOUT);
close (STDERR);
+
+exit $result;
=====================================
index-pubmed
=====================================
@@ -21,6 +21,7 @@ if [ "$#" -gt 0 ]
then
target="$1"
MASTER=$(cd "$target" && pwd)
+ CONFIG=${MASTER}
shift
else
if [ -z "${EDIRECT_PUBMED_MASTER}" ]
@@ -70,7 +71,7 @@ do
mkdir -p "$MASTER/$dir"
done
-for dir in Current Indexed Inverted Merged Pubmed
+for dir in Indexed Inverted Merged Pubmed
do
mkdir -p "$WORKING/$dir"
done
@@ -108,19 +109,9 @@ echo "$seconds seconds"
REF=$seconds
echo ""
-seconds_start=$(date "+%s")
-echo "Collecting PubMed Records"
-pm-current "$WORKING/Current" "$MASTER/Archive"
-seconds_end=$(date "+%s")
-seconds=$((seconds_end - seconds_start))
-echo "$seconds seconds"
-COL=$seconds
-echo ""
-
seconds_start=$(date "+%s")
echo "Indexing PubMed Records"
-cd "$WORKING/Current"
-pm-index "$WORKING/Indexed"
+pm-index "$MASTER/Archive" "$WORKING/Indexed"
seconds_end=$(date "+%s")
seconds=$((seconds_end - seconds_start))
echo "$seconds seconds"
@@ -160,7 +151,6 @@ echo ""
echo "DWN $DWN seconds"
echo "POP $POP seconds"
echo "REF $REF seconds"
-echo "COL $COL seconds"
echo "IDX $IDX seconds"
echo "INV $INV seconds"
echo "MRG $MRG seconds"
@@ -172,3 +162,20 @@ fetch-pubmed -path "$MASTER/Archive" |
xtract -pattern Author -if Affiliation -contains Medicine \
-pfx "Archive and Index are " -element Initials
echo ""
+
+if [ -n "$CONFIG" ]
+then
+ target=bash_profile
+ if ! grep "$target" "$HOME/.bashrc" >/dev/null 2>&1
+ then
+ if [ ! -f $HOME/.$target ] || grep 'bashrc' "$HOME/.$target" >/dev/null 2>&1
+ then
+ target=bashrc
+ fi
+ fi
+ echo ""
+ echo "For convenience, please execute the following to save the archive path to a variable:"
+ echo ""
+ echo " echo \"export EDIRECT_PUBMED_MASTER='${CONFIG}'\" >>" "\$HOME/.$target"
+ echo ""
+fi
=====================================
local-phrase-search deleted
=====================================
@@ -1,153 +0,0 @@
-#!/bin/sh
-
-target=""
-mode="query"
-debug=false
-
-while [ $# -gt 0 ]
-do
- case "$1" in
- -h | -help | --help )
- mode=help
- break
- ;;
- -debug )
- debug=true
- shift
- ;;
- -path | -master )
- target=$2
- shift
- shift
- ;;
- -count )
- mode="count"
- shift
- ;;
- -counts )
- mode="counts"
- shift
- ;;
- -countr )
- mode="countr"
- shift
- ;;
- -countp )
- mode="countp"
- shift
- ;;
- -query | -phrase )
- mode="query"
- shift
- ;;
- -search )
- mode="search"
- shift
- ;;
- -exact )
- mode="exact"
- shift
- ;;
- -mock )
- mode="mock"
- shift
- ;;
- -mocks )
- mode="mocks"
- shift
- ;;
- -mockx )
- mode="mockx"
- shift
- ;;
- -* )
- exec >&2
- echo "$0: Unrecognized option $1"
- exit 1
- ;;
- * )
- break
- ;;
- esac
-done
-
-if [ $mode = "help" ]
-then
- cat <<EOF
-USAGE: $0
- [-path path_to_pubmed_master]
- -count | -counts | -search | -exact | [-query]
- query arguments
-
-EXAMPLE: local-phrase-search -query catabolite repress* AND protease inhibit*
-EOF
- exit
-fi
-
-if [ -z "$target" ]
-then
- if [ -z "${EDIRECT_PUBMED_MASTER}" ]
- then
- echo "Must supply path to postings files or set EDIRECT_PUBMED_MASTER environment variable"
- exit 1
- else
- MASTER="${EDIRECT_PUBMED_MASTER}"
- MASTER=${MASTER%/}
- target="$MASTER/Postings"
- fi
-else
- argument="$target"
- target=$(cd "$argument" && pwd)
- target=${target%/}
- case "$target" in
- */Postings ) ;;
- * ) target=$target/Postings ;;
- esac
-fi
-
-osname=`uname -s | sed -e 's/_NT-.*$/_NT/; s/^MINGW[0-9]*/CYGWIN/'`
-if [ "$osname" = "CYGWIN_NT" -a -x /bin/cygpath ]
-then
- target=`cygpath -w "$target"`
-fi
-
-target=${target%/}
-
-if [ "$debug" = true ]
-then
- echo "mode: $mode, path: '$target', args: '$*'"
- exit
-fi
-
-case "$mode" in
- count )
- rchive -path "$target" -count "$*"
- ;;
- counts )
- rchive -path "$target" -counts "$*"
- ;;
- countr )
- rchive -path "$target" -countr "$*"
- ;;
- countp )
- rchive -path "$target" -countp "$*"
- ;;
- query )
- rchive -path "$target" -query "$*"
- ;;
- search )
- rchive -path "$target" -search "$*"
- ;;
- exact )
- rchive -path "$target" -exact "$*"
- ;;
- mock )
- rchive -path "$target" -mock "$*"
- ;;
- mocks )
- rchive -path "$target" -mocks "$*"
- ;;
- mockx )
- rchive -path "$target" -mockx "$*"
- ;;
-esac
=====================================
nquire
=====================================
@@ -43,7 +43,7 @@ use File::Spec;
# nquire version number
-$version = "10.5";
+$version = "10.9";
BEGIN
{
@@ -63,10 +63,12 @@ BEGIN
}
use lib $LibDir;
+use JSON::PP;
use LWP::UserAgent;
use POSIX;
use URI::Escape;
use Net::FTP;
+use XML::Simple;
# definitions
@@ -81,6 +83,7 @@ sub clearflags {
$alias = "";
$debug = false;
$http = "";
+ $j2x = false;
$output = "";
}
@@ -229,6 +232,33 @@ sub do_uri_escape {
return $rslt;
}
+sub convert_bools {
+ my %unrecognized;
+
+ local *_convert_bools = sub {
+ my $ref_type = ref($_[0]);
+ if (!$ref_type) {
+ # Nothing.
+ }
+ elsif ($ref_type eq 'HASH') {
+ _convert_bools($_) for values(%{ $_[0] });
+ }
+ elsif ($ref_type eq 'ARRAY') {
+ _convert_bools($_) for @{ $_[0] };
+ }
+ elsif (
+ $ref_type eq 'JSON::PP::Boolean' || $ref_type eq 'Types::Serialiser::Boolean'
+ ) {
+ $_[0] = $_[0] ? 1 : 0;
+ }
+ else {
+ ++$unrecognized{$ref_type};
+ }
+ };
+
+ &_convert_bools;
+}
+
# nquire executes an external URL query from command line arguments
my $nquire_help = qq{
@@ -439,8 +469,103 @@ Federated Query
}" |
xtract -pattern result -block binding -element "binding\@name" literal
+BioThings Queries
+
+ nquire -variant variant "chr6:g.26093141G>A" -fields dbsnp.gene |
+ xtract -pattern gene -element \@geneid
+
+ nquire -gene query -q "symbol:OPN1MW" -species 9606 |
+ xtract -pattern hits -element "\@_id"
+
+ nquire -gene query -q "symbol:OPN1MW AND taxid:9606" |
+ xtract -pattern hits -element "\@_id"
+
+ nquire -gene gene 2652 -fields pathway.wikipathways |
+ xtract -pattern pathway -element "\@id"
+
+ nquire -gene query -q "pathway.wikipathways.id:WP455" -size 300 |
+ xtract -pattern hits -element "\@_id"
+
+ nquire -chem query -q "drugbank.targets.uniprot:P05231 AND drugbank.targets.actions:inhibitor" -fields hgvs |
+ xtract -pattern hits -element "\@_id"
+
+EDirect Expansion
+
+ ExtractIDs() {
+ xtract -pattern BIO_THINGS -block Id -tab "\\n" -element "Id"
+ }
+
+ WrapIDs() {
+ xtract -wrp BIO_THINGS -pattern opt -wrp "Type" -lbl "\$1" \\
+ -wrp "Count" -num "\$2" -block "\$2" -wrp "Id" -element "\$3" |
+ xtract -format
+ }
+
+ nquire -gene query -q "symbol:OPN1MW AND taxid:9606" |
+ WrapIDs entrezgene hits "\@entrezgene" |
+
+ ExtractIDs |
+ while read geneid
+ do
+ nquire -gene gene "\$geneid" -fields pathway.wikipathways
+ done |
+ WrapIDs pathway.wikipathways.id pathway "\@id" |
+
+ ExtractIDs |
+ while read pathid
+ do
+ nquire -gene query -q "pathway.wikipathways.id:\$pathid" -size 300
+ done |
+ WrapIDs entrezgene hits "\@entrezgene" |
+
+ ExtractIDs |
+ sort -n
+
};
+my @pubchem_properties = qw(
+ MolecularFormula
+ MolecularWeight
+ CanonicalSMILES
+ IsomericSMILES
+ InChI
+ InChIKey
+ IUPACName
+ XLogP
+ ExactMass
+ MonoisotopicMass
+ TPSA
+ Complexity
+ Charge
+ HBondDonorCount
+ HBondAcceptorCount
+ RotatableBondCount
+ HeavyAtomCount
+ IsotopeAtomCount
+ AtomStereoCount
+ DefinedAtomStereoCount
+ UndefinedAtomStereoCount
+ BondStereoCount
+ DefinedBondStereoCount
+ UndefinedBondStereoCount
+ CovalentUnitCount
+ Volume3D
+ XStericQuadrupole3D
+ YStericQuadrupole3D
+ ZStericQuadrupole3D
+ FeatureCount3D
+ FeatureAcceptorCount3D
+ FeatureDonorCount3D
+ FeatureAnionCount3D
+ FeatureCationCount3D
+ FeatureRingCount3D
+ FeatureHydrophobeCount3D
+ ConformerModelRMSD3D
+ EffectiveRotorCount3D
+ ConformerCount3D
+ Fingerprint2D
+);
+
sub nquire {
# nquire -url http://... -tag value -tag value | ...
@@ -450,10 +575,19 @@ sub nquire {
$pfx = "";
$amp = "";
$pat = "";
+ $sfx = "";
@args = @ARGV;
$max = scalar @args;
+ %biothingsHash = (
+ '-gene' => 'http://mygene.info/v3',
+ '-variant' => 'http://myvariant.info/v1',
+ '-chem' => 'http://mychem.info/v1',
+ '-drug' => 'http://c.biothings.io/v1',
+ '-taxon' => 'http://t.biothings.io/v1',
+ );
+
if ( $max < 1 ) {
return;
}
@@ -515,12 +649,10 @@ sub nquire {
$ftp->cwd($dir) or die "Unable to change to $dir: ", $ftp->message;
$ftp->binary or warn "Unable to set binary mode: ", $ftp->message;
- if (! -e $fl) {
- if (! $ftp->get($fl, "/dev/stdout") ) {
- my $msg = $ftp->message;
- chomp $msg;
- print STDERR "\nFAILED TO DOWNLOAD:\n\n$fl ($msg\n";
- }
+ if (! $ftp->get($fl, "/dev/stdout") ) {
+ my $msg = $ftp->message;
+ chomp $msg;
+ print STDERR "\nFAILED TO DOWNLOAD:\n\n$fl ($msg\n";
}
}
}
@@ -529,7 +661,7 @@ sub nquire {
}
}
- # if present, -http get or -get must be next
+ # if present, -http get or -get must be next (now also allow -http post or -post)
# nquire -get -url "http://collections.mnh.si.edu/services/resolver/resolver.php" -voucher "Birds:625456"
@@ -544,6 +676,9 @@ sub nquire {
} elsif ( $pat eq "-get" ) {
$i++;
$http = "get";
+ } elsif ( $pat eq "-post" ) {
+ $i++;
+ $http = "post";
}
}
@@ -605,6 +740,13 @@ sub nquire {
if ( $i < $max ) {
$url = "https://eutilstest.ncbi.nlm.nih.gov/entrez/eutils";
}
+ } elsif ( $pat eq "-qa" ) {
+ # shortcut for eutils QA base (undocumented)
+ $i++;
+ if ( $i < $max ) {
+ $url = "http://qa.ncbi.nlm.nih.gov/entrez/eutils";
+ }
+
} elsif ( $pat eq "-hydra" ) {
# internal citation match request (undocumented)
$i++;
@@ -617,6 +759,7 @@ sub nquire {
$amp = "&";
$i++;
}
+
} elsif ( $pat eq "-revhist" ) {
# internal sequence revision history request (undocumented)
$i++;
@@ -627,6 +770,47 @@ sub nquire {
$amp = "&";
$i++;
}
+
+ } elsif ( $pat eq "-pubchem" ) {
+ # shortcut for PubChem Power User Gateway REST service base (undocumented)
+ # nquire -pubchem "compound/name/creatine/property" "IUPACName,MolecularWeight,MolecularFormula" "XML"
+ $i++;
+ if ( $i < $max ) {
+ $url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug";
+ if ( $i + 2 == $max && $args[$i] eq "compound" ) {
+ # even shorter shortcut
+ # nquire -pubchem compound creatine
+ $pat = $args[$i + 1];
+ if ( $pat =~ /^-(.+)/ ) {
+ } elsif ( $pat !~ /\// ) {
+ $i = $i + 2;
+ $url .= "/compound/name/";
+ $pat = map_macros ($pat);
+ $url .= $pat;
+ $url .= "/property/";
+ $sfx = join(",", @pubchem_properties);
+ $url .= $sfx;
+ $url .= "/XML";
+ }
+ }
+ }
+
+ } elsif ( defined $biothingsHash{$pat} ) {
+ # shortcuts for biothings services (undocumented)
+ $i++;
+ $url = $biothingsHash{$pat};
+ if ( $http eq "" ) {
+ $http = "get";
+ }
+ $j2x = true;
+
+ } elsif ( $pat eq "-wikipathways" ) {
+ # shortcut for webservice.wikipathways.org (undocumented)
+ $i++;
+ if ( $i < $max ) {
+ $url = "http://webservice.wikipathways.org";
+ }
+
} elsif ( $pat eq "-biosample" ) {
# internal biosample_chk request on live database (undocumented)
$i++;
@@ -709,6 +893,33 @@ sub nquire {
# perform query
$output = do_post ($url, $arg);
+ if ( $j2x ) {
+ my $jc = JSON::PP->new->ascii->pretty->allow_nonref;
+ my $conv = $jc->decode($output);
+ convert_bools($conv);
+ my $result = XMLout($conv, SuppressEmpty => undef);
+
+ # remove newlines, tabs, space between tokens, compress runs of spaces
+ $result =~ s/\r/ /g;
+ $result =~ s/\n/ /g;
+ $result =~ s/\t//g;
+ $result =~ s/ +/ /g;
+ $result =~ s/> +</></g;
+
+ # remove <opt> flanking object
+ if ( $result =~ /<opt>\s*?</ and $result =~ />\s*?<\/opt>/ ) {
+ $result =~ s/<opt>\s*?</</g;
+ $result =~ s/>\s*?<\/opt>/>/g;
+ }
+
+ $output = "$result";
+
+ # restore newlines between objects
+ $output =~ s/> *?</>\n</g;
+
+ binmode(STDOUT, ":utf8");
+ }
+
print "$output";
}
=====================================
pm-clean deleted
=====================================
@@ -1,24 +0,0 @@
-#!/bin/sh
-
-if [ "$#" -eq 0 ]
-then
- echo "Must supply path for cleaned files"
- exit 1
-fi
-
-target="$1"
-
-target=${target%/}
-
-for fl in *.xml.gz
-do
- base=${fl%.xml.gz}
- if [ -f "$target/$base.xml.gz" ]
- then
- continue
- fi
- echo "$base"
- gunzip -c "$fl" |
- xtract -mixed -format flush |
- gzip > "$target/$base.xml.gz"
-done
=====================================
pm-current deleted
=====================================
@@ -1,59 +0,0 @@
-#!/bin/sh
-
-if [ "$#" -eq 0 ]
-then
- echo "Must supply path for current files"
- exit 1
-fi
-
-target="$1"
-shift
-
-target=${target%/}
-
-if [ "$#" -eq 0 ]
-then
- echo "Must supply path for archive files"
- exit 1
-fi
-
-archive="$1"
-shift
-
-archive=${archive%/}
-
-find "$target" -name "*.xml.gz" -delete
-
-fr=0
-chunk_size=250000
-if [ -n "${EDIRECT_CHUNK_SIZE}" ]
-then
- chunk_size="${EDIRECT_CHUNK_SIZE}"
-fi
-to=$((chunk_size - 1))
-loop_max=$((50000000 / chunk_size))
-seq 1 $((loop_max)) | while read n
-do
- base=$(printf pubmed%03d $n)
- if [ -f "$target/$base.xml.gz" ]
- then
- fr=$((fr + chunk_size))
- to=$((to + chunk_size))
- continue
- fi
- echo "$base XML"
- seconds_start=$(date "+%s")
- seq -f "%0.f" $fr $to | stream-pubmed -path "$archive" > "$target/$base.xml.gz"
- fr=$((fr + chunk_size))
- to=$((to + chunk_size))
- seconds_end=$(date "+%s")
- seconds=$((seconds_end - seconds_start))
- echo "$seconds seconds"
- fsize=$(wc -c <"$target/$base.xml.gz")
- if [ "$fsize" -le 300 ]
- then
- rm "$target/$base.xml.gz"
- exit 0
- fi
- sleep 1
-done
=====================================
pm-erase deleted
=====================================
@@ -1,17 +0,0 @@
-#!/bin/sh
-
-if [ "$#" -eq 0 ]
-then
- echo "Must supply path to archive files"
- exit 1
-fi
-
-target="$1"
-
-target=${target%/}
-
-rchive -trie -gzip |
-while read dir
-do
- rm "$target/$dir"
-done
=====================================
pm-index
=====================================
@@ -2,35 +2,74 @@
dir=`dirname "$0"`
+if [ "$#" -eq 0 ]
+then
+ echo "Must supply path for archive files"
+ exit 1
+fi
+
+archive="$1"
+shift
+
+archive=${archive%/}
+
if [ "$#" -eq 0 ]
then
echo "Must supply path for indexed files"
exit 1
fi
-target="$1"
+indexed="$1"
+shift
-target=${target%/}
+indexed=${indexed%/}
-find "$target" -name "*.e2x.gz" -delete
+cd "$archive"
-for fl in *.xml.gz
+find "$indexed" -name "*.e2x.gz" -delete
+
+q=0
+fr=0
+chunk_size=250000
+if [ -n "${EDIRECT_CHUNK_SIZE}" ]
+then
+ chunk_size="${EDIRECT_CHUNK_SIZE}"
+fi
+to=$((chunk_size - 1))
+loop_max=$((50000000 / chunk_size))
+seq 1 $((loop_max)) | while read n
do
- base=${fl%.xml.gz}
- echo "$base"
+ base=$(printf pubmed%03d $n)
+ if [ -f "$indexed/$base.e2x.gz" ]
+ then
+ fr=$((fr + chunk_size))
+ to=$((to + chunk_size))
+ continue
+ fi
+ echo "$base XML"
seconds_start=$(date "+%s")
if [ -s "$dir/meshtree.txt" ]
then
- gunzip -c "$fl" |
+ seq -f "%0.f" $fr $to |
+ fetch-pubmed -path "$archive" |
xtract -transform "$dir/meshtree.txt" -e2index |
- gzip -1 > "$target/$base.e2x.gz"
+ gzip -1 > "$indexed/$base.e2x.gz"
else
- gunzip -c "$fl" |
+ seq -f "%0.f" $fr $to |
+ fetch-pubmed -path "$archive" |
xtract -e2index |
- gzip -1 > "$target/$base.e2x.gz"
+ gzip -1 > "$indexed/$base.e2x.gz"
fi
+ fr=$((fr + chunk_size))
+ to=$((to + chunk_size))
seconds_end=$(date "+%s")
seconds=$((seconds_end - seconds_start))
echo "$seconds seconds"
+ fsize=$(wc -c < "$indexed/$base.e2x.gz")
+ if [ "$fsize" -le 300 ]
+ then
+ rm -f "$target/$base.xml.gz"
+ exit 0
+ fi
sleep 1
done
=====================================
pm-log deleted
=====================================
@@ -1,24 +0,0 @@
-#!/bin/sh
-
-printAdditions() {
- f="$1"
- base=${f%.xml.gz}
- gunzip -c "$f" |
- xtract -strict -pattern PubmedArticle \
- -block MedlineCitation/PMID -lbl "$base" -sep "." \
- -element MedlineCitation/PMID,MedlineCitation/PMID at Version
-}
-
-printDeletions() {
- f="$1"
- base=${f%.xml.gz}
- gunzip -c "$f" |
- xtract -strict -pattern DeleteCitation \
- -block PMID -lbl "$base" -tab "\tD\n" -sep "." -element "PMID, at Version"
-}
-
-for fl in *.xml.gz
-do
- printAdditions "$fl"
- printDeletions "$fl"
-done > transactions.txt
=====================================
pm-repack deleted
=====================================
@@ -1,16 +0,0 @@
-#!/bin/sh
-
-for fl in *.xml.gz
-do
- echo "$fl"
- base=${fl%.xml.gz}
- gunzip -c "$fl" | xtract -strict -compress -format flush > "$base.tmp"
- xtract -input "$base.tmp" -pattern PubmedArticle -element MedlineCitation/PMID > "$base.uid"
- rchive -input "$base.tmp" -unique "$base.uid" -index MedlineCitation/PMID \
- -head "<PubmedArticleSet>" -tail "</PubmedArticleSet>" -pattern PubmedArticle |
- xtract -format indent -xml '<?xml version="1.0" encoding="UTF-8"?>' \
- -doctype '<!DOCTYPE PubmedArticleSet SYSTEM "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_180601.dtd">' > "$base.xml"
- rm "$base.tmp"
- rm "$base.uid"
-done
-rm *.xml.gz
=====================================
pm-stash
=====================================
@@ -45,7 +45,8 @@ deleteCitations() {
reportVersioned() {
inp="$1"
pmidlist=.TO-REPORT
- xtract -input "$inp" -pattern PubmedArticle -block MedlineCitation/PMID -if "@Version" -gt 1 -element "PMID" |
+ xtract -input "$inp" -pattern PubmedArticle \
+ -block MedlineCitation/PMID -if "@Version" -gt 1 -element "PMID" |
sort -n | uniq > $pmidlist
if [ -s $pmidlist ]
then
=====================================
pm-uids deleted
=====================================
@@ -1,13 +0,0 @@
-#!/bin/sh
-
-if [ "$#" -eq 0 ]
-then
- echo "Must supply path to archive files"
- exit 1
-fi
-
-target="$1"
-
-find "$target" -name "*.xml.gz" |
-sed -e 's,.*/\(.*\)\.xml\.gz,\1,' |
-sort -n | uniq
=====================================
pm-verify deleted
=====================================
@@ -1,7 +0,0 @@
-#!/bin/sh
-
-for fl in *.gz
-do
- echo "$fl"
- gunzip -c "$fl" | xtract -mixed -verify
-done
=====================================
rchive.go
=====================================
@@ -62,7 +62,7 @@ import (
// RCHIVE VERSION AND HELP MESSAGE TEXT
-const rchiveVersion = "10.5"
+const rchiveVersion = "10.9"
const rchiveHelp = `
Processing Flags
@@ -5794,7 +5794,7 @@ func main() {
if dbug {
// drain results, but suppress normal output
- for _ = range unsq {
+ for range unsq {
recordCount++
runtime.Gosched()
}
@@ -5922,7 +5922,7 @@ func main() {
if dbug {
// drain results, but suppress normal output
- for _ = range sptr {
+ for range sptr {
recordCount++
runtime.Gosched()
}
@@ -6503,7 +6503,7 @@ func main() {
if dbug {
// drain results, but suppress normal output
- for _ = range rslq {
+ for range rslq {
recordCount++
runtime.Gosched()
}
=====================================
setup-deps.pl
=====================================
@@ -73,7 +73,7 @@ my @lwp_deps = qw(Encode::Locale File::Listing
HTTP:Cookies HTTP::Date HTTP::Message HTTP::Negotiate
IO::Socket::SSL LWP::MediaTypes LWP::Protocol::https
Net::HTTP URI WWW::RobotRules Mozilla::CA);
-for my $module (@lwp_deps, 'Time::HiRes', 'JSON::PP', 'XML::Simple') {
+for my $module (@lwp_deps, 'Time::HiRes', 'JSON::PP', 'MIME::Base64', 'XML::Simple') {
if ( ! CheckAvailability($module) ) {
CPAN::Shell->install($module);
}
=====================================
test-pubmed-index
=====================================
@@ -23,7 +23,7 @@ do
fi
if [ -z "$ttl" ]
then
- echo "$uid TRIM"
+ echo "$uid TRIM -- $ttl"
continue
fi
res=`phrase-search -exact "$ttl"`
=====================================
transmute
=====================================
@@ -43,7 +43,7 @@ use File::Spec;
# transmute version number
-$version = "10.5";
+$version = "10.9";
BEGIN
{
@@ -64,6 +64,7 @@ BEGIN
use lib $LibDir;
use JSON::PP;
+use MIME::Base64;
use URI::Escape;
use XML::Simple;
@@ -102,7 +103,11 @@ Transformation Commands
};
-my $type = shift or die "Must supply -decode, -encode, -j2x, or -x2j on command line\n";
+# read required function argument
+my $type = shift or die "Must supply conversion type on command line\n";
+
+# read optional parent object name
+my $obj = shift;
sub transmute {
@@ -131,7 +136,7 @@ sub transmute {
# perform specific conversions
- if ( $type eq "decode" || $type eq "-decode" ) {
+ if ( $type eq "unescape" || $type eq "-unescape" ) {
$data = uri_unescape($data);
@@ -144,7 +149,7 @@ sub transmute {
print "$data";
}
- if ( $type eq "encode" || $type eq "-encode" ) {
+ if ( $type eq "escape" || $type eq "-escape" ) {
# compress runs of spaces
$data =~ s/ +/ /g;
@@ -154,6 +159,20 @@ sub transmute {
print "$data";
}
+ if ( $type eq "decode64" || $type eq "-decode64" ) {
+
+ $data = decode_base64($data);
+
+ print "$data";
+ }
+
+ if ( $type eq "encode64" || $type eq "-encode64" ) {
+
+ $data = encode_base64($data);
+
+ print "$data";
+ }
+
if ( $type eq "plain" || $type eq "-plain" ) {
# remove embedded mixed-content tags
@@ -262,6 +281,96 @@ sub transmute {
print "$data\n";
}
+ if ( $type eq "docsum" || $type eq "-docsum" ) {
+
+ # remove newlines, tabs, space between tokens, compress runs of spaces
+ $data =~ s/\r/ /g;
+ $data =~ s/\n/ /g;
+ $data =~ s/\t//g;
+ $data =~ s/ +/ /g;
+ $data =~ s/> +</></g;
+
+ # move UID from attribute to object
+ if ($data !~ /<Id>\d+<\/Id>/i) {
+ $data =~ s/<DocumentSummary uid=\"(\d+)\">/<DocumentSummary><Id>$1<\/Id>/g;
+ }
+ $data =~ s/<DocumentSummary uid=\"\d+\">/<DocumentSummary>/g;
+
+ # fix bad encoding
+ my @accum = ();
+ my @working = ();
+ my $prefix = "";
+ my $suffix = "";
+ my $docsumset_attrs = '';
+
+ if ( $data =~ /(.+?)<DocumentSummarySet(\s+.+?)?>(.+)<\/DocumentSummarySet>(.+)/s ) {
+ $prefix = $1;
+ $docsumset_attrs = $2;
+ my $docset = $3;
+ $suffix = $4;
+
+ my @vals = ($docset =~ /<DocumentSummary>(.+?)<\/DocumentSummary>/sg);
+ foreach $val (@vals) {
+ push (@working, "<DocumentSummary>");
+ if ( $val =~ /<Title>(.+?)<\/Title>/ ) {
+ my $x = $1;
+ if ( $x =~ /\&\;/ || $x =~ /\<\;/ || $x =~ /\>\;/ || $x =~ /\</ || $x =~ /\>/ ) {
+ while ( $x =~ /\&\;/ || $x =~ /\<\;/ || $x =~ /\>\;/ ) {
+ HTML::Entities::decode_entities($x);
+ }
+ # removed mixed content tags
+ $x =~ s|<b>||g;
+ $x =~ s|<i>||g;
+ $x =~ s|<u>||g;
+ $x =~ s|<sup>||g;
+ $x =~ s|<sub>||g;
+ $x =~ s|</b>||g;
+ $x =~ s|</i>||g;
+ $x =~ s|</u>||g;
+ $x =~ s|</sup>||g;
+ $x =~ s|</sub>||g;
+ $x =~ s|<b/>||g;
+ $x =~ s|<i/>||g;
+ $x =~ s|<u/>||g;
+ $x =~ s|<sup/>||g;
+ $x =~ s|<sub/>||g;
+ # Reencode any resulting less-than or greater-than entities to avoid breaking the XML.
+ $x =~ s/</</g;
+ $x =~ s/>/>/g;
+ $val =~ s/<Title>(.+?)<\/Title>/<Title>$x<\/Title>/;
+ }
+ }
+ if ( $val =~ /<Summary>(.+?)<\/Summary>/ ) {
+ my $x = $1;
+ if ( $x =~ /\&\;/ ) {
+ HTML::Entities::decode_entities($x);
+ # Reencode any resulting less-than or greater-than entities to avoid breaking the XML.
+ $x =~ s/</</g;
+ $x =~ s/>/>/g;
+ $val =~ s/<Summary>(.+?)<\/Summary>/<Summary>$x<\/Summary>/;
+ }
+ }
+ push (@working, $val );
+ push (@working, "</DocumentSummary>");
+ }
+ }
+
+ if ( scalar @working > 0 ) {
+ push (@accum, $prefix);
+ push (@accum, "<DocumentSummarySet$docsumset_attrs>");
+ push (@accum, @working);
+ push (@accum, "</DocumentSummarySet>");
+ push (@accum, $suffix);
+ $data = join ("\n", @accum);
+ $data =~ s/\n\n/\n/g;
+ }
+
+ # restore newlines between objects
+ $data =~ s/> *?</>\n</g;
+
+ print "$data\n";
+ }
+
if ( $type eq "json2xml" || $type eq "-json2xml" || $type eq "j2x" || $type eq "-j2x" ) {
# convert JSON to XML
@@ -284,9 +393,6 @@ sub transmute {
$result =~ s/>\s*?<\/opt>/>/g;
}
- # read optional parent object name
- my $obj = shift;
-
if ( defined($obj) && $obj ne "" ) {
my $xml = '<?xml version="1.0" encoding="UTF-8"?>';
=====================================
xtract.go
=====================================
@@ -53,7 +53,7 @@ import (
// XTRACT VERSION AND HELP MESSAGE TEXT
-const xtractVersion = "10.5"
+const xtractVersion = "10.9"
const xtractHelp = `
Overview
@@ -202,6 +202,7 @@ Text Processing
-terms Partition text at spaces
-words Split at punctuation marks
-pairs Adjacent informative words
+ -reverse Reverse words in string
-letters Separate individual letters
-clauses Break at phrase separators
-indices Index normalized words
@@ -209,6 +210,7 @@ Text Processing
Sequence Processing
-revcomp Reverse-complement nucleotide sequence
+ -nucleic Subrange determines forward or revcomp
Sequence Coordinates
@@ -270,7 +272,7 @@ Notes
-num and -len selections are synonyms for Object Count (#) and Item Length (%).
- -words, -pairs, and -indices convert to lower case.
+ -words, -pairs, -reverse, and -indices convert to lower case.
Examples
@@ -457,10 +459,11 @@ Formatted Authors
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block PubDate -sep "-" -element Year,Month,MedlineDate \
-block Author -sep " " -tab "" \
- -element "&COM" Initials,LastName -COM "(, )"
+ -element "&COM" Initials,LastName -COM "(|)" |
+ perl -pe 's/(\t[^\t|]*)\|([^\t|]*)$/$1 and $2/; s/\|([^|]*)$/, and $1/; s/\|/, /g'
- 1413997 1992-Oct RK Mortimer, CR Contopoulou, JS King
- 6301692 1983-Apr MA Krasnow, NR Cozzarelli
+ 1413997 1992-Oct RK Mortimer, CR Contopoulou, and JS King
+ 6301692 1983-Apr MA Krasnow and NR Cozzarelli
781293 1976-Jul MJ Casadaban
Medical Subject Headings
@@ -1640,6 +1643,7 @@ const (
TERMS
WORDS
PAIRS
+ REVERSE
LETTERS
CLAUSES
INDICES
@@ -1697,6 +1701,7 @@ const (
ONEBASED
UCSCBASED
REVCOMP
+ NUCLEIC
ELSE
VARIABLE
VALUE
@@ -1794,6 +1799,7 @@ var argTypeIs = map[string]ArgumentType{
"-terms": EXTRACTION,
"-words": EXTRACTION,
"-pairs": EXTRACTION,
+ "-reverse": EXTRACTION,
"-letters": EXTRACTION,
"-clauses": EXTRACTION,
"-indices": EXTRACTION,
@@ -1823,6 +1829,7 @@ var argTypeIs = map[string]ArgumentType{
"-bed-based": EXTRACTION,
"-bed-coords": EXTRACTION,
"-revcomp": EXTRACTION,
+ "-nucleic": EXTRACTION,
"-else": EXTRACTION,
"-pfx": CUSTOMIZATION,
"-sfx": CUSTOMIZATION,
@@ -1854,6 +1861,7 @@ var opTypeIs = map[string]OpType{
"-terms": TERMS,
"-words": WORDS,
"-pairs": PAIRS,
+ "-reverse": REVERSE,
"-letters": LETTERS,
"-clauses": CLAUSES,
"-indices": INDICES,
@@ -1917,6 +1925,7 @@ var opTypeIs = map[string]OpType{
"-bed-based": UCSCBASED,
"-bed-coords": UCSCBASED,
"-revcomp": REVCOMP,
+ "-nucleic": NUCLEIC,
"-else": ELSE,
}
@@ -2692,8 +2701,8 @@ func ParseArguments(cmdargs []string, pttrn string) *Block {
op := &Operation{Type: status, Value: ""}
comm = append(comm, op)
status = UNSET
- case ELEMENT, FIRST, LAST, ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, TERMS, WORDS, PAIRS, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED:
- case NUM, LEN, SUM, MIN, MAX, INC, DEC, SUB, AVG, DEV, MED, BIN, BIT, ZEROBASED, ONEBASED, UCSCBASED, REVCOMP:
+ case ELEMENT, FIRST, LAST, ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, TERMS, WORDS, PAIRS, REVERSE, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED:
+ case NUM, LEN, SUM, MIN, MAX, INC, DEC, SUB, AVG, DEV, MED, BIN, BIT, ZEROBASED, ONEBASED, UCSCBASED, REVCOMP, NUCLEIC:
case TAB, RET, PFX, SFX, SEP, LBL, PFC, DEQ, PLG, ELG, WRP, DEF, COLOR:
case UNSET:
fmt.Fprintf(os.Stderr, "\nERROR: No -element before '%s'\n", str)
@@ -2872,8 +2881,8 @@ func ParseArguments(cmdargs []string, pttrn string) *Block {
switch status {
case UNSET:
status = nextStatus(str)
- case ELEMENT, FIRST, LAST, ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, TERMS, WORDS, PAIRS, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED,
- NUM, LEN, SUM, MIN, MAX, INC, DEC, SUB, AVG, DEV, MED, BIN, BIT, ZEROBASED, ONEBASED, UCSCBASED, REVCOMP:
+ case ELEMENT, FIRST, LAST, ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, TERMS, WORDS, PAIRS, REVERSE, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED,
+ NUM, LEN, SUM, MIN, MAX, INC, DEC, SUB, AVG, DEV, MED, BIN, BIT, ZEROBASED, ONEBASED, UCSCBASED, REVCOMP, NUCLEIC:
for !strings.HasPrefix(str, "-") {
// create one operation per argument, even if under a single -element statement
op := &Operation{Type: status, Value: str}
@@ -3349,6 +3358,27 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
return "", false
}
+ // reverseComplement reverse-complements a nucleotide sequence
+ reverseComplement := func(str string) string {
+
+ runes := []rune(str)
+ // reverse sequence letters - middle base in odd-length sequence is not touched, so cannot also complement here
+ for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
+ runes[i], runes[j] = runes[j], runes[i]
+ }
+ found := false
+ // now complement every base, also handling uracil, leaving case intact
+ for i, ch := range runes {
+ runes[i], found = revComp[ch]
+ if !found {
+ runes[i] = 'X'
+ }
+ }
+ str = string(runes)
+
+ return str
+ }
+
// processElement handles individual -element constructs
processElement := func(acc func(string)) {
@@ -3467,13 +3497,36 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
}
}
+ doRevComp := false
+ doUpCase := false
+ if status == NUCLEIC {
+ // -nucleic uses direction of range to decide between forward strand or reverse complement
+ if min+1 > max {
+ min, max = max-1, min+1
+ doRevComp = true
+ }
+ doUpCase = true
+ }
+
// numeric range now calculated, apply slice to string
if min == 0 && max == 0 {
+ if doRevComp {
+ str = reverseComplement(str)
+ }
+ if doUpCase {
+ str = strings.ToUpper(str)
+ }
acc(str)
} else if max == 0 {
if min > 0 && min < len(str) {
str = str[min:]
if str != "" {
+ if doRevComp {
+ str = reverseComplement(str)
+ }
+ if doUpCase {
+ str = strings.ToUpper(str)
+ }
acc(str)
}
}
@@ -3481,6 +3534,12 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
if max > 0 && max <= len(str) {
str = str[:max]
if str != "" {
+ if doRevComp {
+ str = reverseComplement(str)
+ }
+ if doUpCase {
+ str = strings.ToUpper(str)
+ }
acc(str)
}
}
@@ -3488,6 +3547,12 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
if min < max && min > 0 && max <= len(str) {
str = str[min:max]
if str != "" {
+ if doRevComp {
+ str = reverseComplement(str)
+ }
+ if doUpCase {
+ str = strings.ToUpper(str)
+ }
acc(str)
}
}
@@ -3501,8 +3566,8 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
sendSlice(str)
}
})
- case TERMS, WORDS, PAIRS, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED,
- VALUE, LEN, SUM, MIN, MAX, SUB, AVG, DEV, MED, BIN, BIT, REVCOMP:
+ case TERMS, WORDS, PAIRS, REVERSE, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED,
+ VALUE, LEN, SUM, MIN, MAX, SUB, AVG, DEV, MED, BIN, BIT, REVCOMP, NUCLEIC:
exploreElements(func(str string, lvl int) {
if str != "" {
sendSlice(str)
@@ -3744,7 +3809,7 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
buffer.WriteString(single)
between = sep
}
- case ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, VALUE, NUM, INC, DEC, ZEROBASED, ONEBASED, UCSCBASED:
+ case ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, VALUE, NUM, INC, DEC, ZEROBASED, ONEBASED, UCSCBASED, NUCLEIC:
processElement(func(str string) {
if str != "" {
ok = true
@@ -3955,20 +4020,7 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
if str != "" {
ok = true
buffer.WriteString(between)
- runes := []rune(str)
- // reverse sequence letters - middle base in odd-length sequence is not touched, so cannot also complement here
- for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
- runes[i], runes[j] = runes[j], runes[i]
- }
- found := false
- // now complement every base, also handling uracil, leaving case intact
- for i, ch := range runes {
- runes[i], found = revComp[ch]
- if !found {
- runes[i] = 'X'
- }
- }
- str = string(runes)
+ str = reverseComplement(str)
buffer.WriteString(str)
between = sep
}
@@ -4270,6 +4322,37 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
}
}
})
+ case REVERSE:
+ processElement(func(str string) {
+ if str != "" {
+
+ words := strings.FieldsFunc(str, func(c rune) bool {
+ return !unicode.IsLetter(c) && !unicode.IsDigit(c)
+ })
+ for lf, rt := 0, len(words)-1; lf < rt; lf, rt = lf+1, rt-1 {
+ words[lf], words[rt] = words[rt], words[lf]
+ }
+ for _, item := range words {
+ item = strings.ToLower(item)
+ if DeStop {
+ if IsStopWord(item) {
+ continue
+ }
+ }
+ if DoStem {
+ item = porter2.Stem(item)
+ item = strings.TrimSpace(item)
+ }
+ if item == "" {
+ continue
+ }
+ ok = true
+ buffer.WriteString(between)
+ buffer.WriteString(item)
+ between = sep
+ }
+ }
+ })
case LETTERS:
processElement(func(str string) {
if str != "" {
@@ -4447,8 +4530,8 @@ func ProcessInstructions(commands []*Operation, curr *Node, mask, tab, ret strin
str := op.Value
switch op.Type {
- case ELEMENT, FIRST, LAST, ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, TERMS, WORDS, PAIRS, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED,
- NUM, LEN, SUM, MIN, MAX, INC, DEC, SUB, AVG, DEV, MED, BIN, BIT, ZEROBASED, ONEBASED, UCSCBASED, REVCOMP:
+ case ELEMENT, FIRST, LAST, ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, TERMS, WORDS, PAIRS, REVERSE, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED,
+ NUM, LEN, SUM, MIN, MAX, INC, DEC, SUB, AVG, DEV, MED, BIN, BIT, ZEROBASED, ONEBASED, UCSCBASED, REVCOMP, NUCLEIC:
txt, ok := ProcessClause(curr, op.Stages, mask, tab, pfx, sfx, plg, sep, def, op.Type, index, level, variables, transform)
if ok {
plg = ""
@@ -4474,11 +4557,16 @@ func ProcessInstructions(commands []*Operation, curr *Node, mask, tab, ret strin
case LBL:
lbl := str
accum(tab)
+ accum(plg)
+ accum(pfx)
if plain {
accum(lbl)
} else {
printInColor(lbl)
}
+ accum(sfx)
+ plg = ""
+ lst = elg
tab = col
ret = lin
case PFC:
@@ -5697,7 +5785,7 @@ func ProcessINSD(args []string, isPipe, addDash, doIndex bool) []string {
acc = append(acc, "-element", "INSDSeq_accession-version", "-clr", "-rst", "-tab", "\\n")
}
} else {
- acc = append(acc, "-pattern", "INSDSeq", "-ACCN", "INSDSeq_accession-version")
+ acc = append(acc, "-pattern", "INSDSeq", "-ACCN", "INSDSeq_accession-version", "-SEQ", "INSDSeq_sequence")
}
if doIndex {
@@ -5868,6 +5956,30 @@ func ProcessINSD(args []string, isPipe, addDash, doIndex bool) []string {
// report capitalization or vocabulary failure
checkAgainstVocabulary(str, "element", insdtags)
+ } else if str == "sub_sequence" {
+
+ // special sub_sequence qualifier shows sequence under feature intervals
+ acc = append(acc, "-block", "INSDFeature_intervals")
+ if isPipe {
+ acc = append(acc, "-lbl", "")
+ } else {
+ acc = append(acc, "-lbl", "\"\"")
+ }
+
+ acc = append(acc, "-subset", "INSDInterval", "-FR", "INSDInterval_from", "-TO", "INSDInterval_to")
+ if isPipe {
+ acc = append(acc, "-pfx", "", "-tab", "", "-nucleic", "&SEQ[&FR:&TO]")
+ } else {
+ acc = append(acc, "-pfx", "\"\"", "-tab", "\"\"", "-nucleic", "\"&SEQ[&FR:&TO]\"")
+ }
+
+ acc = append(acc, "-subset", "INSDFeature_intervals")
+ if isPipe {
+ acc = append(acc, "-lbl", "\\t")
+ } else {
+ acc = append(acc, "-lbl", "\"\\t\"")
+ }
+
} else {
acc = append(acc, "-block", "INSDQualifier")
@@ -8364,7 +8476,7 @@ func main() {
// -e2index shortcut for experimental indexing code (documented in rchive.go)
if args[0] == "-e2index" {
- // e.g., xtract -transform meshtree.txt -e2index
+ // e.g., xtract -transform "$EDIRECT_MESH_TREE" -e2index
args = args[1:]
View it on GitLab: https://salsa.debian.org/med-team/ncbi-entrez-direct/compare/0ad84a2b84b29a5b58bbc04eac8d19cda6213f1a...1c42e99e91bbfb976d95038ed72113cc6d224482
--
View it on GitLab: https://salsa.debian.org/med-team/ncbi-entrez-direct/compare/0ad84a2b84b29a5b58bbc04eac8d19cda6213f1a...1c42e99e91bbfb976d95038ed72113cc6d224482
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20190207/a0acc6d1/attachment-0001.html>
More information about the debian-med-commit
mailing list