[med-svn] [Git][med-team/ncbi-entrez-direct][master] 4 commits: New upstream version 14.0.20201030+dfsg

Aaron M. Ucko gitlab at salsa.debian.org
Mon Nov 2 02:33:55 GMT 2020



Aaron M. Ucko pushed to branch master at Debian Med / ncbi-entrez-direct


Commits:
b3b5834e by Aaron M. Ucko at 2020-11-01T15:07:43-05:00
New upstream version 14.0.20201030+dfsg
- - - - -
fa43ea4b by Aaron M. Ucko at 2020-11-01T15:08:54-05:00
Merge tag 'upstream/14.0.20201030+dfsg' into master

Upstream version 14.0.20201030(+dfsg).

- - - - -
657e422a by Aaron M. Ucko at 2020-11-01T15:14:03-05:00
debian/man/efetch.1: Update for new release.

Only the Perl implementation advertises -style withparts nowadays.

- - - - -
3d9114ce by Aaron M. Ucko at 2020-11-01T21:24:41-05:00
Finalize ncbi-entrez-direct 14.0.20201030+dfsg-1 for unstable.

- - - - -


17 changed files:

- README
- debian/changelog
- debian/man/efetch.1
- ecommon.sh
- edirect.pl
- efetch
- efilter
- einfo
- elink
- epost
- esearch
- esummary
- hlp-xtract.txt
- nquire
- test-edirect
- test-eutils
- xtract.go


Changes:

=====================================
README
=====================================
@@ -30,20 +30,26 @@ Esearch performs a new Entrez search using terms in indexed fields. It requires
 
 Search terms can also be qualified with bracketed field names:
 
+  esearch -db pubmed -query "Federhen S [AUTH] AND Nucleic Acids Res [JOUR]"
+
   esearch -db nuccore -query "insulin [PROT] AND rodents [ORGN]"
 
-Elink looks up precomputed neighbors within a database, or finds associated records in other databases:
+Elink looks up precomputed neighbors within a database:
 
   elink -related
 
+or finds associated records in other databases:
+
   elink -target gene
 
-or follows PubMed references in the NIH Open Citation Collection dataset (see PMID 31600197):
-
-  elink -cited
+Elink also accesses the NIH Open Citation Collection dataset (see PMID 31600197) to follow the reference lists of PubMed records:
 
   elink -cites
 
+or to find publications that cite the selected PubMed articles:
+
+  elink -cited
+
 Efilter limits the results of a previous query, with shortcuts that can also be used in esearch:
 
   efilter -molecule genomic -location chloroplast -country sweden -days 365
@@ -58,19 +64,25 @@ Individual query commands are connected by a Unix vertical bar pipe symbol:
 
 DISCOVERY BY NAVIGATION
 
-PubMed related articles are calculated by a statistical text retrieval algorithm using the title, abstract, and medical subject headings (MeSH terms). The connections between papers can be used for making discoveries. A simple example is finding the last enzymatic step in the vitamin A biosynthetic pathway.
+PubMed related articles are calculated by a statistical text retrieval algorithm using the title, abstract, and medical subject headings (MeSH terms). The connections between papers can be used for making discoveries. An example of this is finding the last enzymatic step in the vitamin A biosynthetic pathway.
 
-Lycopene cyclase in plants converts lycopene to beta-carotene, the immediate precursor of vitamin A. An initial search on the enzyme finds 256 articles. Looking up precomputed neighbors:
+Lycopene cyclase in plants converts lycopene to β-carotene, the immediate dietary precursor of vitamin A. An initial search on the enzyme:
 
   esearch -db pubmed -query "lycopene cyclase" |
+
+finds 258 articles. Looking up precomputed neighbors:
+
   elink -related |
 
-returns 15,661 PubMed papers, some of which might be expected to discuss adjacent steps in the pathway. Since plants cannot convert beta-carotene to retinal, we can eliminate all earlier steps by restricting those results to enzymes in animals. To do so, we first link from publications to proteins, which are annotated with standardized organism information from the NCBI taxonomy. This finds 609,744 sequence records. Limiting these results to curated proteins in mice matches only 24 records:
+returns 15,827 PubMed papers, some of which might be expected to discuss adjacent steps in the pathway. Since beta-carotene is an essential nutrient, we can eliminate all earlier steps by restricting those results to enzymes in animals. To do so, we first link from publications to proteins, all of which are annotated with standardized organism information from the NCBI taxonomy:
 
   elink -target protein |
+
+This finds 644,215 protein sequence records. Limiting these results to curated proteins in mice:
+
   efilter -organism mouse -source refseq |
 
-This is small enough to examine individually, so we retrieve the records in FASTA format:
+matches only 34 sequences, which is small enough to examine by retrieving the individual records:
 
   efetch -format fasta
 
@@ -89,13 +101,13 @@ XML DATA EXTRACTION
 
 The ability to obtain Entrez records in structured format, and to easily extract the underlying data, allows the user to ask novel questions that are not addressed by existing analysis software.
 
-The xtract program uses command-line arguments to direct the conversion of XML data into a user-specified form. The -pattern command partitions an XML stream into individual records that are processed separately. Within each record, the -element command does an exhaustive, depth-first search to find data content by field name. Explicit paths to objects are not needed.
+The xtract program uses command-line arguments to direct the conversion of eXtensible Markup Language data. The -pattern command partitions an XML stream by object name into individual records that are processed separately. Within each record, the -element command does an exhaustive, depth-first search to find data content by field name. Explicit paths to objects are not needed.
 
 FORMAT CUSTOMIZATION
 
 By default, the -pattern argument divides the results into rows, while placement of data into columns is controlled by -element, to create a tab-delimited table.
 
-Formatting commands allow extensive customization of the output. The line break between -pattern rows can be changed with -ret, and the tab character between -element fields can be replaced by -tab. The -sep argument is used to distinguish multiple elements of the same type, and controls their separation independently of the -tab command. The following query:
+Formatting commands allow extensive customization of the output. The line break between -pattern rows can be changed with -ret, and the tab character between -element fields can be replaced by -tab. The -sep comand is used to distinguish multiple elements of the same type, and controls their separation independently of the -tab command. The following query:
 
   efetch -db pubmed -id 6271474,6092233,16589597 -format docsum |
   xtract -pattern DocumentSummary -sep "|" -element Id PubDate Name
@@ -106,9 +118,29 @@ returns a table with individual author names separated by vertical bars:
   6092233     1984 Jul-Aug    Calderon IL|Contopoulou CR|Mortimer RK
   16589597    1954 Dec        Garber ED
 
-Derivatives of -element that may modify data values include positional commands (-first and -last), numeric operations (including -num, -len, -inc, -sum, -min, -max, and -avg), text processing variants (such as -encode, -plain, -upper, -title, and -words), and functions that perform sequence or coordinate conversion (-revcomp, -fasta, -ncbi2na, -0-based, -1-based, and -ucsc-based).
+The -sep value also applies to distinct -element arguments that are grouped with commas. This can be used to keep data from multiple related fields in the same column:
+
+  -sep " " -element Initials,LastName
+
+The -def command sets a default placeholder to be printed when none of the comma-separated fields in an -element clause are present:
+
+  -def "-" -sep " " -element Year,Month,MedlineDate
+
+ELEMENT VARIANTS
+
+Derivatives of -element were created to eliminate the need for writing post-processing scripts to perform otherwise trivial modifications or analyses on extracted data. They are subdivided into several categories. A representative selection of commands is shown below:
+
+  Positional:       -first, -last
+
+  Numeric:          -num, -len, -inc, -dec, -sum, -min, -max, -avg, -dev, -med
+
+  Text:             -encode, -plain, -upper, -lower, -title, -words, -reverse
+
+  Sequence:         -revcomp, -fasta, -ncbi2na, -molwt
 
-The -def command sets a default placeholder to be printed when an -element field is not present.
+  Coordinates:      -0-based, -1-based, -ucsc-based
+
+  Miscellaneous:    -year, -histogram
 
 EXPLORATION CONTROL
 
@@ -141,17 +173,15 @@ Each time through the loop, the -element command only sees the current author's
 
   RK    Mortimer    CR    Contopoulou    JS    King
 
-The -sep value also applies to unrelated -element arguments that are grouped with commas:
+Grouping the two author subfields with a comma, and adjusting the -sep -and -tab values:
 
   xtract -pattern PubmedArticle \
     -block Author -sep " " -tab ", " -element Initials,LastName
 
-allowing -sep -and -tab to produce a more desirable formatting of author names:
+produces a more desirable formatting of author names:
 
   RK Mortimer, CR Contopoulou, JS King
 
-If set, the -def value would be printed if none of the comma-separated fields produced any output.
-
 NESTED EXPLORATION
 
 Exploration command names (-group, -block, and -subset) are assigned to a precedence hierarchy:
@@ -191,7 +221,7 @@ Adding -subset commands within the -block visits each individual descriptor or q
         -subset QualifierName -plg " / " -tab "" \
           -translate "@MajorTopicYN" -element QualifierName
 
-and keeps major topic attributes associated with their parent objects. A text translation command converts the "Y" attribute value to an asterisk for printing:
+and keeps major topic attributes associated with their parent objects. The -transform and -translate commands convert the "Y" attribute value to an asterisk for indicating major topics:
 
   6162838
   Base Sequence
@@ -206,7 +236,7 @@ and keeps major topic attributes associated with their parent objects. A text tr
 
 CONDITIONAL EXECUTION
 
-Conditional processing commands (-if, -unless, -and, -or, and -else) restrict exploration by object name and value. These may be used in conjunction with string or numeric constraints:
+Conditional processing commands (-if, -unless, -and, -or, and -else) restrict object exploration by data content. These may be used in conjunction with string, numeric, or object constraints:
 
   esearch -db pubmed -query "Casadaban MJ [AUTH]" |
   efetch -format xml |
@@ -215,13 +245,25 @@ Conditional processing commands (-if, -unless, -and, -or, and -else) restrict ex
       -sep ", " -tab "\n" -element LastName,Initials |
   sort-uniq-count-rank
 
-to select papers with fewer than 6 authors and print a table of the most frequent coauthors:
+This will select papers with fewer than 6 authors and print a table of the most frequent coauthors:
 
   11    Chou, J
   8     Cohen, SN
   7     Groisman, EA
   ...
 
+Numeric constraints can also compare the integer values of two fields. This can be used to find genes that are encoded on the minus strand of a nucleotide sequence:
+
+  -if ChrStart -gt ChrStop
+
+Object constraints will compare the string values of two named fields, and can look for internal inconsistencies between fields whose contents should (in most cases) be identical:
+
+  -if Chromosome -differs-from ChrLoc
+
+The -position command restricts presentation of objects by relative location or index number:
+
+  -block Author -position last -sep ", " -element LastName,Initials
+
 SAVING DATA IN VARIABLES
 
 A value can be recorded in a variable and used wherever needed. Variables are created by a hyphen followed by a name consisting of a string of capital letters or digits (e.g., -PMID). Values are retrieved by placing an ampersand before the variable name (e.g., "&PMID") in an -element statement:
@@ -231,7 +273,7 @@ A value can be recorded in a variable and used wherever needed. Variables are cr
     -block Author -element "&PMID" \
       -sep " " -tab "\n" -element Initials,LastName
 
-producing a list of authors, with the PubMed identifier in the first column of each row:
+This produces a list of authors, with the PubMed identifier in the first column of each row:
 
   3201829    JR Johnston
   3201829    CR Contopoulou
@@ -242,11 +284,19 @@ producing a list of authors, with the PubMed identifier in the first column of e
 
 The variable can be used even though the original object is no longer visible inside the -block section.
 
+Variables can be initialized with a literal value inside parentheses:
+
+  -block Author -sep " " -tab "" -element "&COM" Initials,LastName -COM "(, )"
+
+They can also be used as arguments in a conditional statement:
+
+  -ABST AbstractText -block Article -if "Journal/Title" -is-within "&ABST"
+
 SEQUENCE QUALIFIERS
 
-The NCBI represents sequence records in a data model based on the central dogma of molecular biology. A sequence can have multiple features, which contain information about the biology of a given region, including the transformations involved in gene expression. Each feature can have multiple qualifiers, which store specific details about that feature (e.g., name of the gene, genetic code used for translation, accession of the product sequence).
+The NCBI represents sequence records in a data model based on the central dogma of molecular biology. Each sequence can have multiple features, which contain information about the biology of a given region, including the transformations involved in gene expression. Each feature can have multiple qualifiers, which store specific details about that feature (e.g., name of the gene, genetic code used for translation, accession of the product sequence, cross-references to external databases).
 
-The data hierarchy is explored using a -pattern {sequence} -group {feature} -block {qualifier} construct. As a convenience, an -insd helper function generates the appropriate nested extraction commands from feature and qualifier names on the command line. For example, processing the results of a search on cone snail venom:
+The data hierarchy is explored using a -pattern {sequence} -group {feature} -block {qualifier} construct. As a convenience, the -insd helper function generates the appropriate nested extraction commands from feature and qualifier names on the command line. For example, processing the results of a search on cone snail venom:
 
   esearch -db protein -query "conotoxin" -feature mat_peptide |
   efetch -format gpc |
@@ -260,27 +310,24 @@ prints the accession number, peptide length, product name, calculated molecular
   AIC77105.1    17    conotoxin Lt1.4       1705    GCCSHPACDVNNPDICG
   ADB43129.1    18    conotoxin Cal 5.2     2008    MIQRSQCCAVKKNCCHVG
   ADD97803.1    20    conotoxin Cal 1.2     2206    AGCCPTIMYKTGACRTNRCR
-  AIC77085.1    21    conotoxin Bt14.8      2574    NECDNCMRSFCSMIYEKCRLK
-  ADB43125.1    22    conotoxin Cal 14.2    2157    GCPADCPNTCDSSNKCSPGFPG
-  AIC77154.1    23    conotoxin Bt14.19     2578    VREKDCPPHPVPGMHKCVCLKTC
   ...
 
 GENES IN A REGION
 
-To list all genes between two markers flanking the human X chromosome centromere, first retrieve the chromosome record:
+To list all genes between two markers flanking the human X chromosome centromere, first retrieve the protein-coding gene records on that chromosome:
 
   esearch -db gene -query "Homo sapiens [ORGN] AND X [CHR]" |
   efilter -status alive -type coding | efetch -format docsum |
 
-Gene names and chromosomal positions are extracted by piping the record to:
+Gene names and chromosomal positions are extracted by piping the records to:
 
-  xtract -pattern DocumentSummary -NME Name -DSC Description \
+  xtract -pattern DocumentSummary -NAME Name -DESC Description \
     -block GenomicInfoType -if ChrLoc -equals X \
-      -min ChrStart,ChrStop -element "&NME" "&DSC" |
+      -min ChrStart,ChrStop -element "&NAME" "&DESC" |
 
 Exploring each GenomicInfoType is needed because of pseudoautosomal regions at the ends of the X and Y chromosomes. Without limiting to chromosome X, the copy of IL9R near the "q" telomere of chromosome Y would be erroneously placed with genes that are near the X chromosome centromere.
 
-Results can now be sorted, filtered, and subdivided:
+Results can now be sorted by position, and then filtered and partitioned:
 
   sort -k 1,1n | cut -f 2- |
   grep -v pseudogene | grep -v uncharacterized |
@@ -310,15 +357,13 @@ Linking from a pathway record back to the gene database:
   elink -target gene |
   efetch -format docsum |
   xtract -pattern DocumentSummary -element Name Description |
-  grep -v pseudogene | grep -v uncharacterized |
-  sort -f
+  grep -v pseudogene | grep -v uncharacterized | sort -f
 
 returns the set of all genes known to be involved in the pathway:
 
-  AANAT      aralkylamine N-acetyltransferase
-  ACADM      acyl-CoA dehydrogenase medium chain
-  ACHE       acetylcholinesterase (Cartwright blood group)
-  ADCYAP1    adenylate cyclase activating polypeptide 1
+  AANAT    aralkylamine N-acetyltransferase
+  ACADM    acyl-CoA dehydrogenase medium chain
+  ACHE     acetylcholinesterase (Cartwright blood group)
   ...
 
 GENE SEQUENCE
@@ -379,7 +424,7 @@ When a recursively defined object is given to an exploration command:
   efetch -db taxonomy -id 9606,7227,10090 -format xml |
   xtract -pattern Taxon -element TaxId ScientificName
 
-the -element command only examines fields in the outermost layer:
+the -element command only examines fields in the topmost level:
 
   9606     Homo sapiens
   7227     Drosophila melanogaster
@@ -424,7 +469,7 @@ This prints every genomic accession regardless of nesting depth:
 
 HETEROGENEOUS OBJECTS
 
-The nquire program uses command-line arguments to request data from external CGI or FTP servers. A query on curated biological database associations (see PMID 23175613):
+The nquire program uses command-line arguments to request data from external CGI or FTP servers. A query on a curated biological database service developed at the Scripps Research Institute (see PMID 23175613):
 
   nquire -get http://mygene.info/v3/gene/2652 |
   xtract -j2x -set - -rec GeneRec |
@@ -467,7 +512,7 @@ Namespace prefixes are indicated by a colon, and a leading colon matches any pre
   nquire -url "http://webservice.wikipathways.org" getPathway -pwId WP455 |
   xtract -pattern "ns1:getPathwayResponse" -decode ":gpml" |
 
-The -decode argument converts Base64-encoded data back to its original binary form. In this case, encoding was used to embed Graphical Pathway Markup Language inside another XML object:
+The -decode command converts Base64-encoded data back to its original binary form. In this case, encoding was used to embed Graphical Pathway Markup Language text inside another XML object:
 
   xtract -pattern Pathway -block Xref \
     -if @Database -equals "Entrez Gene" \
@@ -475,7 +520,7 @@ The -decode argument converts Base64-encoded data back to its original binary fo
 
 JSON TO XML
 
-Consolidated gene information retrieved in JSON format:
+Consolidated gene information for human beta-globin retrieved in JavaScript Object Notation format:
 
   nquire -get http://mygene.info/v3 gene 3043 |
 
@@ -531,16 +576,15 @@ Tab-delimited data is easily converted to XML with xtract -t2x:
 
 This takes a series of command-line arguments with tag names for wrapping the individual columns, and skips the first line of input, which contains header information, to generate a new XML file:
 
-  <Set>
-    <Rec>
-      <Code>1246500</Code>
-      <Name>repA1</Name>
-    </Rec>
-    <Rec>
-      <Code>1246501</Code>
-      <Name>repA2</Name>
-    </Rec>
-    ...
+  <Rec>
+    <Code>1246500</Code>
+    <Name>repA1</Name>
+  </Rec>
+  <Rec>
+    <Code>1246501</Code>
+    <Name>repA2</Name>
+  </Rec>
+  ...
 
 Similarly, xtract -c2x will convert comma-separated values (CSV) files to XML.
 
@@ -556,11 +600,11 @@ Nquire can also produce FTP directory listings and save FTP files to a local dis
 The latest GenPept incremental update file can be parsed into XML with xtract -g2x:
 
   nquire -ftp ftp.ncbi.nlm.nih.gov genbank daily-nc Last.File |
-  sed 's/flat/gnp/g' |
+  sed "s/flat/gnp/g" |
   nquire -ftp ftp.ncbi.nlm.nih.gov genbank daily-nc |
   gunzip -c | xtract -g2x > latest_genpept.xml
 
-Accessions can be filtered by organism name or taxon identifier and saved to a file:
+Records can be filtered by organism name or taxon identifier, with their accessions saved to a file:
 
   cat latest_genpept.xml | xtract -insd INSDSeq_organism source db_xref |
   grep -w "taxon:2697049" | cut -f 1 > accn_subset.txt
@@ -576,11 +620,11 @@ such as generating the sequences of individual mature peptides derived from poly
 
 INDEXED FIELDS
 
-Entrez can report the fields and links that are indexed for each database. For example:
+The einfo command can report the fields and links that are indexed for each database:
 
   einfo -db protein -fields
 
-will return a table of field abbreviations and names indexed for proteins:
+This will return a table of field abbreviations and names indexed for proteins:
 
   ACCN    Accession
   ALL     All Fields
@@ -592,13 +636,11 @@ will return a table of field abbreviations and names indexed for proteins:
   ECNO    EC/RN Number
   FILT    Filter
   FKEY    Feature key
-  GENE    Gene Name
-  GPRJ    BioProject
   ...
 
 LOCAL PUBMED CACHE
 
-Fetching data from Entrez works well when a few thousand records are needed, but it does not scale for much larger sets of data, where the time it takes to download becomes a limiting factor. EDirect can now preload all 31 million PubMed records onto an inexpensive external 500 GB solid state drive for rapid retrieval.
+Fetching data from Entrez works well when a few thousand records are needed, but it does not scale for much larger sets of data, where the time it takes to download becomes a limiting factor. EDirect can now preload over 30 million live PubMed records onto an inexpensive external 500 GB solid state drive for rapid retrieval.
 
 For example, PMID 12345678 would be stored (as a compressed XML file) at:
 
@@ -618,7 +660,7 @@ to download the PubMed release files and distribute each record on the drive. Th
 
 The local archive is a completely self-contained turnkey system, with no need for the user to download and configure complicated third-party database software.
 
-Retrieving a PubmedArticleSet containing over 120,000 PubMed records from the local archive:
+Retrieving a set containing over 120,000 PubMed records from the local archive:
 
   esearch -db pubmed -query "PNAS [JOUR]" -pub abstract |
   efetch -format uid | stream-pubmed | gunzip -c |
@@ -629,7 +671,7 @@ Even modest sets of PubMed query results can benefit from using the local cache.
 
   esearch -db pubmed -query "Cozzarelli NR [AUTH]" | elink -cited |
 
-requires 7 seconds to match 7468 subsequent articles. Fetching them from the local archive:
+requires 7 seconds to match 7485 subsequent articles. Fetching them from the local archive:
 
   efetch -format uid | fetch-pubmed |
 
@@ -648,7 +690,7 @@ that lists the authors who most often cited the original papers:
   76     Maxwell A
   58     Wang JC
   52     Osheroff N
-  51     Stasiak A
+  52     Stasiak A
   ...
 
 Fetching from the network service would extend the 7 second running time to over 2 minutes.
@@ -667,7 +709,7 @@ For example, the term list that includes "cancer" would be located at:
 
   /Postings/NORM/c/a/n/c/canc.trm
 
-A query on cancer thus only needs to load a very small subset of the total index. This design allows efficient expression evaluation, unrestricted wildcard truncation, phrase queries, and proximity searches.
+A query on cancer thus only needs to load a very small subset of the total index. This design combines efficient expression evaluation, unrestricted wildcard truncation, phrase queries, and proximity searches.
 
 The phrase-search script provides access to the local search system.
 
@@ -707,7 +749,7 @@ Runs of tildes indicate the maximum distance between sequential phrases:
 
   phrase-search -query "vitamin c ~ ~ common cold"
 
-MeSH hierarchy code and year of publication are also indexed:
+MeSH identifier code, MeSH hierarchy key, and year of publication are also indexed:
 
   phrase-search -query "C14.907.617.812* [TREE] AND 2015:2019 [YEAR]"
 
@@ -726,13 +768,12 @@ All query commands return a list of PMIDs, which can be piped directly to fetch-
   tee /dev/tty |
   xy-plot auth.png
 
-performs a proximity search with dynamic wildcard expansion (matching phrases like "selective serotonin and norepinephrine reuptake inhibitors") and fetches 12,454 PubMed records from the local archive. It then counts the number of authors for each paper, printing a frequency table of the number of papers per number of coauthors:
+performs a proximity search with dynamic wildcard expansion (matching phrases like "selective serotonin and norepinephrine reuptake inhibitors") and fetches 12,555 PubMed records from the local archive. It then counts the number of authors for each paper, printing a frequency table of the number of papers per number of coauthors:
 
   0    49
-  1    1367
-  2    1851
-  3    1859
-  4    1698
+  1    1372
+  2    1859
+  3    1870
   ...
 
 and creating a visual graph of the data. The entire set of commands runs in under 4 seconds.
@@ -744,17 +785,16 @@ RAPIDLY SCANNING PUBMED
 If the expand-current script is run after archive-pubmed or index-pubmed, an ad hoc scan can be performed on the entire set of live PubMed records:
 
   cat $EDIRECT_PUBMED_MASTER/Current/*.xml |
-  xtract -timer -pattern PubmedArticle \
-    -if "#Author" -eq 7 \
-      -element MedlineCitation/PMID LastName
+  xtract -timer -pattern PubmedArticle -PMID MedlineCitation/PMID \
+    -group AuthorList -if "#LastName" -eq 7 -element "&PMID" LastName
 
-in this case finding articles with seven authors. (Author count is not indexed by Entrez or locally by EDirect.)
+in this case finding 1,613,792 articles with seven authors. (Author count is not indexed by Entrez or EDirect. This query excludes consortia and additional named investigators.)
 
-Xtract uses the Boyer-Moore-Horspool algorithm to partition an XML stream into individual records, sending them down a thread-safe communication channel to be distributed among multiple instances of the data exploration and extraction function. On a modern six-core computer with a fast solid state drive, it can process the full set of 31 million PubMed records in just over 3 minutes, a sustained rate of over 160,000 records per second.
+Xtract uses the Boyer-Moore-Horspool algorithm to partition an XML stream into individual records, distributing them among multiple instances of the data exploration and extraction function for concurrent execution. On a modern six-core computer with a fast solid state drive, it can process the full set of PubMed records in under 4 minutes.
 
 IDENTIFIER CONVERSION
 
-The index-pubmed script also downloads MeSH descriptor information from NLM and creates a conversion file:
+The index-pubmed script also downloads MeSH descriptor information from the NLM ftp server and generates a conversion file:
 
   ...
   <Rec>
@@ -794,6 +834,8 @@ which wraps element contents in new XML tags by issuing several other formatting
 
   -pfx "<Tree>" -sep "</Tree><Tree>" -sfx "</Tree>"
 
+and also ensures that data values containing encoded angle brackets, ampersands, quotation marks, or apostrophes, remain properly encoded within the new XML.
+
 NATURAL LANGUAGE PROCESSING
 
 Additional annotation on PubMed can be downloaded and indexed by running:
@@ -802,7 +844,7 @@ Additional annotation on PubMed can be downloaded and indexed by running:
 
 NCBI's Biomedical Text Mining Group performs computational analysis of PubMed and PMC papers, and extracts chemical, disease, and gene references from article contents (see PMID 31114887). Along with NLM Gene Reference Into Function mappings (see PMID 14728215), these terms are indexed in CHEM, DISZ, and GENE fields.
 
-Recent research at Stanford University defined biological themes, supported by dependency paths, which are indexed as THME and PATH fields. Theme keys in the Global Network of Biomedical Relationships are taken from a table in the paper (see PMID 29490008):
+Recent research at Stanford University defined biological themes, supported by dependency paths, which are indexed in THME, PATH, and CONV fields. Theme keys in the Global Network of Biomedical Relationships are taken from a table in the paper (see PMID 29490008):
 
   A+    Agonism, activation                      N     Inhibits
   A-    Antagonism, blocking                     O     Transport, channels
@@ -841,7 +883,7 @@ This finds PubMed papers about complement proteins and limits them by the "impro
   1467432
   ...
 
-Intermediate lists of PMIDs can be saved to a file and piped (with "cat") into a subsequent phrase-search -filter query, or uploaded to the Entrez history server by piping to:
+Intermediate lists of PMIDs can be saved to a file and piped (with "cat") into a subsequent phrase-search -filter query. They can also be uploaded to the Entrez history server by piping to epost:
 
   epost -db pubmed
 


=====================================
debian/changelog
=====================================
@@ -1,3 +1,10 @@
+ncbi-entrez-direct (14.0.20201030+dfsg-1) unstable; urgency=medium
+
+  * New upstream release.
+  * debian/man/efetch.1: Update for new release.
+
+ -- Aaron M. Ucko <ucko at debian.org>  Sun, 01 Nov 2020 21:24:40 -0500
+
 ncbi-entrez-direct (14.0.20201015+dfsg-1) unstable; urgency=medium
 
   * New upstream release.


=====================================
debian/man/efetch.1
=====================================
@@ -1,4 +1,4 @@
-.TH EFETCH 1 2020-10-12 NCBI "NCBI Entrez Direct User's Manual"
+.TH EFETCH 1 2020-11-01 NCBI "NCBI Entrez Direct User's Manual"
 .SH NAME
 efetch, esummary \- retrieve results from an NCBI Entrez search
 .SH SYNOPSIS
@@ -87,7 +87,8 @@ or
 .TP
 \fB\-style\fP\ \fIstyle\fP
 \fBmaster\fP (shell implementation only),
-\fBwithparts\fP, or \fBconwithfeat\fP.
+\fBwithparts\fP (Perl implementation only),
+or \fBconwithfeat\fP.
 .SS Direct Record Selection
 .TP
 \fB\-db\fP\ \fIname\fP


=====================================
ecommon.sh
=====================================
@@ -388,7 +388,7 @@ ParseCommonArgs() {
         echo "$version"
         exit 0
         ;;
-      -newmode )
+      -newmode | -oldmode )
         argsConsumed=$(($argsConsumed + 1))
         shift
         ;;
@@ -605,7 +605,7 @@ RequestWithRetry() {
         ;;
       *\<eFetchResult\>* | *\<eSummaryResult\>*  | *\<eSearchResult\>*  | *\<eLinkResult\>* | *\<ePostResult\>* | *\<eInfoResult\>* )
         case "$res" in
-          *\<ERROR\>*  )
+          *\<ERROR\>* )
             ref=$( echo "$res" | xtract -format -doctype "" )
             ErrorHead "$warn" "$when"
             PrintQuery "$@"
@@ -620,7 +620,7 @@ RequestWithRetry() {
             # retry query
             res=$( "$@" )
             ;;
-          *\<error\>*  )
+          *\<error\>* )
             ref=$( echo "$res" | xtract -format -doctype "" )
             ErrorHead "$warn" "$when"
             PrintQuery "$@"
@@ -635,7 +635,7 @@ RequestWithRetry() {
             # retry query
             res=$( "$@" )
             ;;
-          *\<ErrorList\>*  )
+          *\<ErrorList\>* )
             ref=$( echo "$res" | xtract -format -doctype "" )
             # question mark prints names of heterogeneous child objects
             errs=$( echo "$res" | xtract -pattern "ErrorList/*" -element "?" )
@@ -658,7 +658,7 @@ RequestWithRetry() {
               res=$( "$@" )
             fi
             ;;
-          *\"error\":*  )
+          *\"error\":* )
             ref=$( echo "$res" | xtract -format -doctype "" )
             ErrorHead "$warn" "$when"
             PrintQuery "$@"
@@ -668,12 +668,32 @@ RequestWithRetry() {
             # retry query
             res=$( "$@" )
             ;;
+          *"<DocumentSummarySet status=\"OK\"><!--"* )
+            # 'DocSum Backend failed' message embedded in comment
+            ErrorHead "$warn" "$when"
+            PrintQuery "$@"
+            ErrorTail "$res" "$whch"
+            sleep 1
+            when=$( date )
+            # retry query
+            res=$( "$@" )
+            ;;
           * )
             # success - no error message detected
             goOn=false
             ;;
         esac
         ;;
+      *"<DocumentSummarySet status=\"OK\"><!--"* )
+        # docsum with comment not surrounded by wrapper
+        ErrorHead "$warn" "$when"
+        PrintQuery "$@"
+        ErrorTail "$res" "$whch"
+        sleep 1
+        when=$( date )
+        # retry query
+        res=$( "$@" )
+        ;;
       * )
         # success for non-structured or non-EUtils-XML result
         goOn=false


=====================================
edirect.pl
=====================================
@@ -182,6 +182,7 @@ sub clearflags {
   $name = "";
   $neighbor = false;
   $num = "";
+  $oldmode = false;
   $organism = "";
   $output = "";
   $page = "";
@@ -1182,6 +1183,7 @@ sub ecntc {
     "verbose" => \$verbose,
     "debug" => \$debug,
     "internal" => \$internal,
+    "oldmode" => \$oldmode,
     "dev" => \$dev,
     "ext" => \$external,
     "log" => \$log,
@@ -1750,6 +1752,7 @@ sub efilt {
     "verbose" => \$verbose,
     "debug" => \$debug,
     "internal" => \$internal,
+    "oldmode" => \$oldmode,
     "dev" => \$dev,
     "ext" => \$external,
     "log" => \$log,
@@ -2730,6 +2733,7 @@ sub eftch {
     "verbose" => \$verbose,
     "debug" => \$debug,
     "internal" => \$internal,
+    "oldmode" => \$oldmode,
     "dev" => \$dev,
     "ext" => \$external,
     "log" => \$log,
@@ -3509,6 +3513,7 @@ sub einfo {
     "verbose" => \$verbose,
     "debug" => \$debug,
     "internal" => \$internal,
+    "oldmode" => \$oldmode,
     "dev" => \$dev,
     "ext" => \$external,
     "log" => \$log,
@@ -4072,6 +4077,7 @@ sub elink {
     "verbose" => \$verbose,
     "debug" => \$debug,
     "internal" => \$internal,
+    "oldmode" => \$oldmode,
     "dev" => \$dev,
     "ext" => \$external,
     "log" => \$log,
@@ -4525,6 +4531,7 @@ sub entfy {
     "verbose" => \$verbose,
     "debug" => \$debug,
     "internal" => \$internal,
+    "oldmode" => \$oldmode,
     "dev" => \$dev,
     "ext" => \$external,
     "log" => \$log,
@@ -4700,6 +4707,7 @@ sub epost {
     "verbose" => \$verbose,
     "debug" => \$debug,
     "internal" => \$internal,
+    "oldmode" => \$oldmode,
     "dev" => \$dev,
     "ext" => \$external,
     "log" => \$log,
@@ -4979,6 +4987,7 @@ sub espel {
     "verbose" => \$verbose,
     "debug" => \$debug,
     "internal" => \$internal,
+    "oldmode" => \$oldmode,
     "dev" => \$dev,
     "ext" => \$external,
     "log" => \$log,
@@ -5078,6 +5087,7 @@ sub ecitmtch {
     "verbose" => \$verbose,
     "debug" => \$debug,
     "internal" => \$internal,
+    "oldmode" => \$oldmode,
     "dev" => \$dev,
     "ext" => \$external,
     "log" => \$log,
@@ -5195,6 +5205,7 @@ sub eprxy {
     "verbose" => \$verbose,
     "debug" => \$debug,
     "internal" => \$internal,
+    "oldmode" => \$oldmode,
     "dev" => \$dev,
     "ext" => \$external,
     "log" => \$log,
@@ -5467,6 +5478,7 @@ sub esrch {
     "verbose" => \$verbose,
     "debug" => \$debug,
     "internal" => \$internal,
+    "oldmode" => \$oldmode,
     "dev" => \$dev,
     "ext" => \$external,
     "log" => \$log,
@@ -6359,6 +6371,11 @@ sub do_nquire_post {
 
   $rslt = "";
 
+  $nquir_debug = false;
+  if (defined $ENV{NQUIRE_DEBUG} && $ENV{NQUIRE_DEBUG} eq "true" ) {
+    $nquir_debug = true;
+  }
+
   if ( $debug ) {
     if ( $argx ne "" ) {
       print STDERR "URL: $urlx?$argx\n\n";
@@ -6392,6 +6409,10 @@ sub do_nquire_post {
       print STDERR "$rslt\n";
     }
 
+    if ( $nquir_debug ) {
+      print STDERR "DEBUG: " . $res->status_line . "\n";
+    }
+
     return $rslt;
   }
 
@@ -6422,6 +6443,10 @@ sub do_nquire_post {
     print STDERR "$rslt\n";
   }
 
+  if ( $nquir_debug ) {
+    print STDERR "DEBUG: " . $res->status_line . "\n";
+  }
+
   return $rslt;
 }
 
@@ -6844,6 +6869,15 @@ sub nqir {
 
   $i = 0;
 
+  # ignore -oldmode flag during transition to redesigned EDirect
+
+  if ( $i < $max ) {
+    $pat = $args[$i];
+    if ( $pat eq "-oldmode" ) {
+      $i++;
+    }
+  }
+
   # if present, -debug must be first argument, only prints generated URL (undocumented)
 
   if ( $i < $max ) {


=====================================
efetch
=====================================
@@ -32,31 +32,50 @@
 #
 # ==========================================================================
 
-for x in "$@"
-do
-  if [ "x$x" = "x-newmode" ]
-  then
-    USE_NEW_EDIRECT=1
-    break
-  fi
-done
-
 pth=$( dirname "$0" )
 
+# conditionally execute original Perl implementation
+
 PERL=""
 
+internal=no
+while [ "$#" -ne 0 ]
+do
+  case "$1" in
+    -internal )
+      internal=yes
+      shift
+      ;;
+    -newmode )
+      USE_NEW_EDIRECT=1
+      shift
+      ;;
+    -oldmode )
+      USE_NEW_EDIRECT=0
+      shift
+      ;;
+    * )
+      break
+      ;;
+  esac
+done
+if [ "$internal" = yes ]
+then
+  set _ -internal "$@"
+  shift
+fi
+
 if [ ! -f "$pth"/ecommon.sh ]
 then
   USE_NEW_EDIRECT=false
 fi
 
-# set PERL path if using old EDirect
-
 case "${USE_NEW_EDIRECT}" in
   "" | [FfNn]* | 0 | [Oo][Ff][Ff] )
+    # set PERL path if using old EDirect
     PERL=perl
     case "$( uname -s )" in
-      CYGWIN_NT*)
+      CYGWIN_NT* )
         # Use a negative match here because the shell treats 0 as success.
         if perl -e 'exit $^O !~ /^MSWin/'; then
            pth=$( cygpath -w "$pth" )
@@ -71,8 +90,6 @@ case "${USE_NEW_EDIRECT}" in
     ;;
 esac
 
-# conditionally execute original Perl implementation
-
 if [ -n "${PERL}" ]
 then
   exec "${PERL}" "$pth"/edirect.pl -fetch "$@"
@@ -94,7 +111,7 @@ Format Selection
 
   -format        Format of record or report
   -mode          text, xml, asn.1, json
-  -style         master, withparts, conwithfeat
+  -style         master, conwithfeat
 
 Direct Record Selection
 
@@ -250,6 +267,8 @@ Examples
 
   efetch -db nuccore -id CM000177.6 -format gb -style conwithfeat -showgaps
 
+  efetch -db nuccore -id 1121073309 -format gbc -style master
+
   esearch -db protein -query "conotoxin AND mat_peptide [FKEY]" |
   efetch -format gpc |
   xtract -insd complete mat_peptide "%peptide" product mol_wt peptide |
@@ -562,7 +581,7 @@ isSequence=false
 isFasta=false
 
 case "$dbase" in
-  nucleotide | nuccore | est | gss | protein )
+  nucleotide | nuccore | protein )
     isSequence=true
     ;;
 esac
@@ -596,12 +615,11 @@ case "$style" in
     style=master
     ;;
   conwithfeat | conwithfeats | contigwithfeat | gbconwithfeat | gbconwithfeats )
-    format=gb
     style=conwithfeat
     ;;
   withpart | withparts | gbwithpart | gbwithparts )
-    format="gbwithparts"
-    style=""
+    # accept from old scripts - same result as style master
+    style=withparts
     ;;
   "" )
     ;;
@@ -612,14 +630,6 @@ case "$style" in
 esac
 
 case "$format:$mode" in
-  gbconwithfeat:* | gbconwithfeats:* )
-    format=gb
-    style=conwithfeat
-    ;; 
-  gbwithpart:* | gbwithparts:* )
-    format="gbwithparts"
-    style=""
-    ;;
   gbc: | gpc: )
     mode=xml
     ;;
@@ -839,7 +849,7 @@ case "$format:$dbase:$mode:$isSequence" in
   *                 ) chunk=1000  ;;
 esac
 
-if [ "$format" = "gbwithparts" ] || [ "$style" = "conwithfeat" ]
+if [ "$style" = "master" ] || [ "$style" = "withparts" ] || [ "$style" = "conwithfeat" ]
 then
   chunk=1
 fi
@@ -932,10 +942,11 @@ then
   fi |
   if [ "$mode" = "json" ]
   then
-    grep "."
+    grep '.'
   elif [ "$raw" = true ]
   then
-    xtract -mixed -format -doctype ""
+    # xtract -mixed -format -doctype ""
+    grep '.'
   else
     xtract -normalize "$dbase" |
     sed -e 's/<!DOCTYPE eSummaryResult PUBLIC/<!DOCTYPE DocumentSummarySet PUBLIC/g; s/<eSummaryResult>//g; s/<\/eSummaryResult>//g' |
@@ -976,7 +987,8 @@ then
     nquire -get $biocbase $idtype $uids |
     if [ "$raw" = true ]
     then
-      xtract -format -doctype ""
+      # xtract -format -doctype ""
+      grep '.'
     else
       xtract -normalize bioc | xtract -format -doctype ""
     fi


=====================================
efilter
=====================================
@@ -32,31 +32,50 @@
 #
 # ==========================================================================
 
-for x in "$@"
-do
-  if [ "x$x" = "x-newmode" ]
-  then
-    USE_NEW_EDIRECT=1
-    break
-  fi
-done
-
 pth=$( dirname "$0" )
 
+# conditionally execute original Perl implementation
+
 PERL=""
 
+internal=no
+while [ "$#" -ne 0 ]
+do
+  case "$1" in
+    -internal )
+      internal=yes
+      shift
+      ;;
+    -newmode )
+      USE_NEW_EDIRECT=1
+      shift
+      ;;
+    -oldmode )
+      USE_NEW_EDIRECT=0
+      shift
+      ;;
+    * )
+      break
+      ;;
+  esac
+done
+if [ "$internal" = yes ]
+then
+  set _ -internal "$@"
+  shift
+fi
+
 if [ ! -f "$pth"/ecommon.sh ]
 then
   USE_NEW_EDIRECT=false
 fi
 
-# set PERL path if using old EDirect
-
 case "${USE_NEW_EDIRECT}" in
   "" | [FfNn]* | 0 | [Oo][Ff][Ff] )
+    # set PERL path if using old EDirect
     PERL=perl
     case "$( uname -s )" in
-      CYGWIN_NT*)
+      CYGWIN_NT* )
         # Use a negative match here because the shell treats 0 as success.
         if perl -e 'exit $^O !~ /^MSWin/'; then
            pth=$( cygpath -w "$pth" )
@@ -71,8 +90,6 @@ case "${USE_NEW_EDIRECT}" in
     ;;
 esac
 
-# conditionally execute original Perl implementation
-
 if [ -n "${PERL}" ]
 then
   exec "${PERL}" "$pth"/edirect.pl -filter "$@"


=====================================
einfo
=====================================
@@ -32,31 +32,50 @@
 #
 # ==========================================================================
 
-for x in "$@"
-do
-  if [ "x$x" = "x-newmode" ]
-  then
-    USE_NEW_EDIRECT=1
-    break
-  fi
-done
-
 pth=$( dirname "$0" )
 
+# conditionally execute original Perl implementation
+
 PERL=""
 
+internal=no
+while [ "$#" -ne 0 ]
+do
+  case "$1" in
+    -internal )
+      internal=yes
+      shift
+      ;;
+    -newmode )
+      USE_NEW_EDIRECT=1
+      shift
+      ;;
+    -oldmode )
+      USE_NEW_EDIRECT=0
+      shift
+      ;;
+    * )
+      break
+      ;;
+  esac
+done
+if [ "$internal" = yes ]
+then
+  set _ -internal "$@"
+  shift
+fi
+
 if [ ! -f "$pth"/ecommon.sh ]
 then
   USE_NEW_EDIRECT=false
 fi
 
-# set PERL path if using old EDirect
-
 case "${USE_NEW_EDIRECT}" in
   "" | [FfNn]* | 0 | [Oo][Ff][Ff] )
+    # set PERL path if using old EDirect
     PERL=perl
     case "$( uname -s )" in
-      CYGWIN_NT*)
+      CYGWIN_NT* )
         # Use a negative match here because the shell treats 0 as success.
         if perl -e 'exit $^O !~ /^MSWin/'; then
            pth=$( cygpath -w "$pth" )
@@ -71,8 +90,6 @@ case "${USE_NEW_EDIRECT}" in
     ;;
 esac
 
-# conditionally execute original Perl implementation
-
 if [ -n "${PERL}" ]
 then
   exec "${PERL}" "$pth"/edirect.pl -info "$@"


=====================================
elink
=====================================
@@ -32,31 +32,50 @@
 #
 # ==========================================================================
 
-for x in "$@"
-do
-  if [ "x$x" = "x-newmode" ]
-  then
-    USE_NEW_EDIRECT=1
-    break
-  fi
-done
-
 pth=$( dirname "$0" )
 
+# conditionally execute original Perl implementation
+
 PERL=""
 
+internal=no
+while [ "$#" -ne 0 ]
+do
+  case "$1" in
+    -internal )
+      internal=yes
+      shift
+      ;;
+    -newmode )
+      USE_NEW_EDIRECT=1
+      shift
+      ;;
+    -oldmode )
+      USE_NEW_EDIRECT=0
+      shift
+      ;;
+    * )
+      break
+      ;;
+  esac
+done
+if [ "$internal" = yes ]
+then
+  set _ -internal "$@"
+  shift
+fi
+
 if [ ! -f "$pth"/ecommon.sh ]
 then
   USE_NEW_EDIRECT=false
 fi
 
-# set PERL path if using old EDirect
-
 case "${USE_NEW_EDIRECT}" in
   "" | [FfNn]* | 0 | [Oo][Ff][Ff] )
+    # set PERL path if using old EDirect
     PERL=perl
     case "$( uname -s )" in
-      CYGWIN_NT*)
+      CYGWIN_NT* )
         # Use a negative match here because the shell treats 0 as success.
         if perl -e 'exit $^O !~ /^MSWin/'; then
            pth=$( cygpath -w "$pth" )
@@ -71,8 +90,6 @@ case "${USE_NEW_EDIRECT}" in
     ;;
 esac
 
-# conditionally execute original Perl implementation
-
 if [ -n "${PERL}" ]
 then
   exec "${PERL}" "$pth"/edirect.pl -link "$@"
@@ -286,6 +303,10 @@ do
       cites=true
       shift
       ;;
+    -batch )
+      # accept -batch flag from old scripts - now standard behavior
+      shift
+      ;;
     -h | -help | --help )
       PrintHelp
       exit 0
@@ -437,22 +458,9 @@ fi
 
 # -cited or -cites access the NIH Open Citation Collection dataset (see PMID 31600197)
 
-iciteElement=""
-
-if [ "$cited" = true ]
-then
-  # equivalent of -name pubmed_pubmed_citedin (for pubmed records also in pmc)
-  iciteElement="cited_by"
-fi
-
-if [ "$cites" = true ]
-then
-  # equivalent of -name pubmed_pubmed_refs (for pubmed records also in pmc)
-  iciteElement="references"
-fi
-
 LinkInIcite() {
 
+  iciteElement="$1"
   GenerateUidList "$dbase" |
   join-into-groups-of 100 |
   while read uids
@@ -466,9 +474,9 @@ LinkInIcite() {
   epost -db pubmed -log "$log"
 }
 
-if [ -n "$iciteElement" ]
-then
-  cits=$( LinkInIcite )
+QueryIcite() {
+
+  cits=$( LinkInIcite "$1" )
 
   if [ -n "$cits" ]
   then
@@ -478,6 +486,20 @@ then
   fi
 
   WriteEDirect "$dbase" "$web_env" "$qry_key" "$num" "$stp" "$err"
+}
+
+if [ "$cited" = true ]
+then
+  # equivalent of -name pubmed_pubmed_citedin (for pubmed records also in pmc)
+  QueryIcite "cited_by"
+
+  exit 0
+fi
+
+if [ "$cites" = true ]
+then
+  # equivalent of -name pubmed_pubmed_refs (for pubmed records also in pmc)
+  QueryIcite "references"
 
   exit 0
 fi


=====================================
epost
=====================================
@@ -32,31 +32,50 @@
 #
 # ==========================================================================
 
-for x in "$@"
-do
-  if [ "x$x" = "x-newmode" ]
-  then
-    USE_NEW_EDIRECT=1
-    break
-  fi
-done
-
 pth=$( dirname "$0" )
 
+# conditionally execute original Perl implementation
+
 PERL=""
 
+internal=no
+while [ "$#" -ne 0 ]
+do
+  case "$1" in
+    -internal )
+      internal=yes
+      shift
+      ;;
+    -newmode )
+      USE_NEW_EDIRECT=1
+      shift
+      ;;
+    -oldmode )
+      USE_NEW_EDIRECT=0
+      shift
+      ;;
+    * )
+      break
+      ;;
+  esac
+done
+if [ "$internal" = yes ]
+then
+  set _ -internal "$@"
+  shift
+fi
+
 if [ ! -f "$pth"/ecommon.sh ]
 then
   USE_NEW_EDIRECT=false
 fi
 
-# set PERL path if using old EDirect
-
 case "${USE_NEW_EDIRECT}" in
   "" | [FfNn]* | 0 | [Oo][Ff][Ff] )
+    # set PERL path if using old EDirect
     PERL=perl
     case "$( uname -s )" in
-      CYGWIN_NT*)
+      CYGWIN_NT* )
         # Use a negative match here because the shell treats 0 as success.
         if perl -e 'exit $^O !~ /^MSWin/'; then
            pth=$( cygpath -w "$pth" )
@@ -71,8 +90,6 @@ case "${USE_NEW_EDIRECT}" in
     ;;
 esac
 
-# conditionally execute original Perl implementation
-
 if [ -n "${PERL}" ]
 then
   exec "${PERL}" "$pth"/edirect.pl -post "$@"


=====================================
esearch
=====================================
@@ -32,31 +32,50 @@
 #
 # ==========================================================================
 
-for x in "$@"
-do
-  if [ "x$x" = "x-newmode" ]
-  then
-    USE_NEW_EDIRECT=1
-    break
-  fi
-done
-
 pth=$( dirname "$0" )
 
+# conditionally execute original Perl implementation
+
 PERL=""
 
+internal=no
+while [ "$#" -ne 0 ]
+do
+  case "$1" in
+    -internal )
+      internal=yes
+      shift
+      ;;
+    -newmode )
+      USE_NEW_EDIRECT=1
+      shift
+      ;;
+    -oldmode )
+      USE_NEW_EDIRECT=0
+      shift
+      ;;
+    * )
+      break
+      ;;
+  esac
+done
+if [ "$internal" = yes ]
+then
+  set _ -internal "$@"
+  shift
+fi
+
 if [ ! -f "$pth"/ecommon.sh ]
 then
   USE_NEW_EDIRECT=false
 fi
 
-# set PERL path if using old EDirect
-
 case "${USE_NEW_EDIRECT}" in
   "" | [FfNn]* | 0 | [Oo][Ff][Ff] )
+    # set PERL path if using old EDirect
     PERL=perl
     case "$( uname -s )" in
-      CYGWIN_NT*)
+      CYGWIN_NT* )
         # Use a negative match here because the shell treats 0 as success.
         if perl -e 'exit $^O !~ /^MSWin/'; then
            pth=$( cygpath -w "$pth" )
@@ -71,8 +90,6 @@ case "${USE_NEW_EDIRECT}" in
     ;;
 esac
 
-# conditionally execute original Perl implementation
-
 if [ -n "${PERL}" ]
 then
   exec "${PERL}" "$pth"/edirect.pl -search "$@"
@@ -1151,7 +1168,7 @@ AddPathwayArg() {
 # warn on mismatch between filter argument and database
 
 case "$dbase" in
-  nucleotide | nuccore | est | gss | protein )
+  nucleotide | nuccore | protein )
     ;;
   * )
     if [ "$dbase" != "gene" ] || [ -z "$orgn" ]
@@ -1309,6 +1326,70 @@ then
   datetype=""
 fi
 
+# protect embedded 'and', 'or', and 'not' terms in single token filter,
+# properties, and organism fields, in select biological databases, since
+# lower-case Boolean operators will be replaced with AND according to:
+#   https://www.nlm.nih.gov/pubs/techbull/ja97/ja97_pubmed.html
+# although only 'or' and 'not' actually cause misinterpretation of:
+#   -db biosample -query "package metagenome or environmental version 1 0 [PROP]"
+
+ProtectWithUnderscores() {
+
+  item="$1"
+  case "$item" in
+    *" and "* | *" or "* | *" not "* )
+      echo "$item" | sed -e "s/ and /_and_/g; s/ or /_or_/g; s/ not /_not_/g"
+      ;;
+    * )
+      echo "$item"
+      ;;
+  esac
+}
+
+ProcessEntrezQuery() {
+
+  echo "$1" |
+  sed -e 's/(/ | ( | /g' \
+      -e 's/)/ | ) | /g' |
+  sed -e "s/ AND / | AND | /g" \
+      -e "s/ OR / | OR | /g" \
+      -e "s/ NOT / | NOT | /g" |
+  tr '|' '\n' |
+  while read item
+  do
+    item=$( echo "$item" | sed -e 's/^ *//g; s/ *$//g; s/  */ /g' )
+    case "$item" in
+      "" )
+        ;;
+      *"[FILT]" | *"[Filter]" | *"[filter]" )
+        ProtectWithUnderscores "$item"
+        ;;
+      *"[PROP]" | *"[Properties]" | *"[properties]" )
+        ProtectWithUnderscores "$item"
+        ;;
+      *"[ORGN]" | *"[Organism]" | *"[organism]" )
+        ProtectWithUnderscores "$item"
+        ;;
+      * )
+        echo "$item"
+        ;;
+    esac
+  done
+}
+
+case "$dbase" in
+  nuc* | prot* | gene | genome | popset | taxonomy | clinvar | cdd | sra | ipg | bio* )
+    case "$query" in
+      *\|* )
+        # skip if query contains an embedded vertical bar, reserved for splitting in ProcessEntrezQuery
+        ;;
+      * )
+        query=$( ProcessEntrezQuery "$query" | tr '\n' ' ' | sed -e 's/^ *//g; s/ *$//g; s/  */ /g' )
+        ;;
+    esac
+    ;;
+esac
+
 # helper function adds search-specific arguments (if set)
 
 RunWithSearchArgs() {


=====================================
esummary
=====================================
@@ -32,31 +32,50 @@
 #
 # ==========================================================================
 
-for x in "$@"
-do
-  if [ "x$x" = "x-newmode" ]
-  then
-    USE_NEW_EDIRECT=1
-    break
-  fi
-done
-
 pth=$( dirname "$0" )
 
+# conditionally execute original Perl implementation
+
 PERL=""
 
+internal=no
+while [ "$#" -ne 0 ]
+do
+  case "$1" in
+    -internal )
+      internal=yes
+      shift
+      ;;
+    -newmode )
+      USE_NEW_EDIRECT=1
+      shift
+      ;;
+    -oldmode )
+      USE_NEW_EDIRECT=0
+      shift
+      ;;
+    * )
+      break
+      ;;
+  esac
+done
+if [ "$internal" = yes ]
+then
+  set _ -internal "$@"
+  shift
+fi
+
 if [ ! -f "$pth"/ecommon.sh ]
 then
   USE_NEW_EDIRECT=false
 fi
 
-# set PERL path if using old EDirect
-
 case "${USE_NEW_EDIRECT}" in
   "" | [FfNn]* | 0 | [Oo][Ff][Ff] )
+    # set PERL path if using old EDirect
     PERL=perl
     case "$( uname -s )" in
-      CYGWIN_NT*)
+      CYGWIN_NT* )
         # Use a negative match here because the shell treats 0 as success.
         if perl -e 'exit $^O !~ /^MSWin/'; then
            pth=$( cygpath -w "$pth" )
@@ -71,8 +90,6 @@ case "${USE_NEW_EDIRECT}" in
     ;;
 esac
 
-# conditionally execute original Perl implementation
-
 if [ -n "${PERL}" ]
 then
   exec "${PERL}" "$pth"/edirect.pl -fetch -format docsum "$@"


=====================================
hlp-xtract.txt
=====================================
@@ -79,6 +79,7 @@ Link Counts
     -wrp "Uid" -element IdList/Id -wrp "Count" -num Link/Id |
   xtract -pattern Rec -if Count -ge 50 -element Uid Count
 
+  32296183    17997
   19372376    57
 
 Markup Correction
@@ -105,11 +106,11 @@ Record Counts
   efilter -days 365 -datetype PDAT |
   xtract -pattern ENTREZ_DIRECT -lbl "$0" -element Count'
 
-  diphtheria      33
-  measles         150
-  pertussis       85
-  polio           100
-  tuberculosis    1608
+  diphtheria      20
+  measles         213
+  pertussis       69
+  polio           76
+  tuberculosis    1787
 
 Citation Lookup
 
@@ -185,7 +186,7 @@ Vitamin Biosynthesis
 
 Coding Sequences
 
-  efetch -db nuccore -id J01636.1 -format gbc -style withparts |
+  efetch -db nuccore -id J01636.1 -format gbc |
   xtract -insd CDS gene sub_sequence
 
   J01636.1    lacI    GTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCG...
@@ -725,8 +726,7 @@ Heterogeneous Object Names
 XML Namespace Prefixes
 
   nquire -url "http://webservice.wikipathways.org" getPathway -pwId WP455 |
-  xtract -pattern "ns1:getPathwayResponse" -element ":gpml" |
-  transmute -decode64 |
+  xtract -pattern "ns1:getPathwayResponse" -decode ":gpml" |
   xtract -pattern Pathway -block Xref \
     -if @Database -equals "Entrez Gene" \
       -tab "\n" -element @ID |
@@ -953,11 +953,11 @@ Phrase Query Automation
 
   ascend_mesh_tree "C01.925.782.417.415"
 
-  5148       c01 925 782 417 415*
-  26792      c01 925 782 417*
-  607883     c01 925 782*
-  870516     c01 925*
-  2541697    c01*
+  5598       c01 925 782 417 415*
+  28400      c01 925 782 417*
+  658188     c01 925 782*
+  928201     c01 925*
+  2639368    c01*
 
 Medical Subject Heading Code Viewers
 


=====================================
nquire
=====================================
@@ -32,28 +32,37 @@
 #
 # ==========================================================================
 
-for x in "$@"
-do
-  if [ "x$x" = "x-newmode" ]
-  then
-    USE_NEW_EDIRECT=1
-    break
-  fi
-done
-
 # pth must contain cacert.pem certificate (previously within aux/lib/perl5/Mozilla/CA/ subdirectory)
 
 pth=$( dirname "$0" )
 
+# conditionally execute original Perl implementation
+
 PERL=""
 
-# set PERL path if using old EDirect
+while [ "$#" -ne 0 ]
+do
+  case "$1" in
+    -newmode )
+      USE_NEW_EDIRECT=1
+      shift
+      ;;
+    -oldmode )
+      USE_NEW_EDIRECT=0
+      shift
+      ;;
+    * )
+      break
+      ;;
+  esac
+done
 
 case "${USE_NEW_EDIRECT}" in
   "" | [FfNn]* | 0 | [Oo][Ff][Ff] )
+    # set PERL path if using old EDirect
     PERL=perl
     case "$( uname -s )" in
-      CYGWIN_NT*)
+      CYGWIN_NT* )
         # Use a negative match here because the shell treats 0 as success.
         if perl -e 'exit $^O !~ /^MSWin/'; then
            pth=$( cygpath -w "$pth" )
@@ -68,8 +77,6 @@ case "${USE_NEW_EDIRECT}" in
     ;;
 esac
 
-# conditionally execute original Perl implementation
-
 if [ -n "${PERL}" ]
 then
   exec "${PERL}" "$pth"/edirect.pl -nquir "$@"


=====================================
test-edirect
=====================================
@@ -235,7 +235,7 @@ PrintTimeAndTitle "Vitamin Biosynthesis"
 
 PrintTimeAndTitle "Coding Sequences"
 
-  efetch -db nuccore -id J01636.1 -format gbc -style withparts |
+  efetch -db nuccore -id J01636.1 -format gbc |
   xtract -insd CDS gene sub_sequence
 
 PrintTimeAndTitle "Sequence Subregion"
@@ -363,20 +363,58 @@ PrintTimeAndTitle "Unfiltered Gene Lookup"
 
   for sym in ATP6 ATP7B CBD DMD HFE PAH PRNP TTN
   do
-    esearch -db gene -query "$sym [GENE]" -organism human |
-    efetch -format docsum |
-    xtract -pattern DocumentSummary -def "-" -lbl "${sym}" \
-      -element NomenclatureSymbol Id Description CommonName
+    fst=$( esearch -db gene -query "$sym [GENE]" -organism human )
+    if [ -z "$fst" ]
+    then
+      echo "ESEARCH FAILURE FOR $sym"
+      continue
+    fi
+    scd=$( echo "$fst" | efetch -raw -format docsum )
+    if [ -z "$scd" ]
+    then
+      echo "EFETCH FAILURE FOR $sym, ESEARCH RESULT IS"
+      echo "$fst"
+      continue
+    fi
+    thd=$( echo "$scd" |
+           xtract -pattern DocumentSummary -def "-" -lbl "${sym}" \
+           -element NomenclatureSymbol Id Description CommonName )
+    if [ -z "$thd" ]
+    then
+      echo "XTRACT FAILURE FOR $sym, EFETCH RESULT IS"
+      echo "$scd"
+      continue
+    fi
+    echo "$thd"
   done
 
 PrintTimeAndTitle "Protein Coding Genes"
 
   for sym in MT-ATP6 BRCA2 CFTR HBB HFE IL9R OPN1MW PAH
   do
-    esearch -db gene -query "$sym [PREF]" -organism human |
-    efetch -format docsum |
-    xtract -pattern DocumentSummary -def "-" \
-      -lbl "${sym}" -element Id Chromosome Description
+    fst=$( esearch -db gene -query "$sym [PREF]" -organism human )
+    if [ -z "$fst" ]
+    then
+      echo "ESEARCH FAILURE FOR $sym"
+      continue
+    fi
+    scd=$( echo "$fst" | efetch -raw -format docsum )
+    if [ -z "$scd" ]
+    then
+      echo "EFETCH FAILURE FOR $sym, ESEARCH RESULT IS"
+      echo "$fst"
+      continue
+    fi
+    thd=$( echo "$scd" |
+           xtract -pattern DocumentSummary -def "-" \
+           -lbl "${sym}" -element Id Chromosome Description )
+    if [ -z "$thd" ]
+    then
+      echo "XTRACT FAILURE FOR $sym, EFETCH RESULT IS"
+      echo "$scd"
+      continue
+    fi
+    echo "$thd"
   done
 
 PrintTimeAndTitle "Taxonomic Names"


=====================================
test-eutils
=====================================
@@ -145,6 +145,7 @@ DoTime() {
 DoAlive() {
   for i in $(seq 1 $repeats)
   do
+    sleep 1
     DoStart
     size=0
     res=$(
@@ -170,7 +171,7 @@ DoAlive() {
         ;;
     esac
     DoTime
-    if [ "$size" -ne 1341 ]
+    if [ "$size" -ne 1367 ]
     then
       echo "($size)"
     fi
@@ -178,6 +179,7 @@ DoAlive() {
 
   for i in $(seq 1 $repeats)
   do
+    sleep 1
     DoStart
     size=0
     res=$(
@@ -200,14 +202,11 @@ DoAlive() {
       printf "."
     fi
     DoTime
-    if [ "$size" -ne 11750 ]
-    then
-      echo "($size)"
-    fi
   done
 
   for i in $(seq 1 $repeats)
   do
+    sleep 1
     DoStart
     size=0
     res=$(
@@ -240,6 +239,7 @@ DoAlive() {
 
   for i in $(seq 1 $repeats)
   do
+    sleep 1
     DoStart
     size=0
     res=$(
@@ -272,6 +272,7 @@ DoAlive() {
 
   for i in $(seq 1 $repeats)
   do
+    sleep 1
     DoStart
     size=0
     res=$(


=====================================
xtract.go
=====================================
@@ -5995,6 +5995,7 @@ func ProcessINSD(args []string, isPipe, addDash, doIndex bool) []string {
 		"chloroplast",
 		"chromoplast",
 		"chromosome",
+		"circular_RNA",
 		"citation",
 		"clone_lib",
 		"clone",
@@ -7223,6 +7224,9 @@ func ProcessFormat(rdr <-chan string, args []string, useTimer bool) int {
 	customDoctype := false
 	doctype := ""
 
+	doComment := false
+	doCdata := false
+
 	// look for [copy|compact|flush|indent|expand] specification
 	if len(args) > 0 {
 		inSwitch := true
@@ -7344,6 +7348,12 @@ func ProcessFormat(rdr <-chan string, args []string, useTimer bool) int {
 		case "-self", "-self-closing":
 			keepEmptySelfClosing = true
 			args = args[1:]
+		case "-comment":
+			doComment = true
+			args = args[1:]
+		case "-cdata":
+			doCdata = true
+			args = args[1:]
 		default:
 			fmt.Fprintf(os.Stderr, "\nERROR: Unrecognized option after -format command\n")
 			os.Exit(1)
@@ -7728,8 +7738,26 @@ func ProcessFormat(rdr <-chan string, args []string, useTimer bool) int {
 			}
 			pfx = ""
 			okIndent = false
-		case CDATATAG, COMMENTTAG:
-			// ignore
+		case CDATATAG:
+			if doCdata {
+				buffer.WriteString(pfx)
+				doIndent(indent)
+				buffer.WriteString("<![CDATA[")
+				buffer.WriteString(name)
+				buffer.WriteString("]]>")
+				pfx = ret
+				okIndent = true
+			}
+		case COMMENTTAG:
+			if doComment {
+				buffer.WriteString(pfx)
+				doIndent(indent)
+				buffer.WriteString("<!--")
+				buffer.WriteString(name)
+				buffer.WriteString("-->")
+				pfx = ret
+				okIndent = true
+			}
 		case DOCTYPETAG:
 			if customDoctype && doctype == "" {
 				doctype = name
@@ -9970,6 +9998,7 @@ func GenBankConverter(inp io.Reader) <-chan string {
 	const twentyonespaces = "                     "
 
 	var rec strings.Builder
+	var alt strings.Builder
 	var con strings.Builder
 	var seq strings.Builder
 
@@ -10058,7 +10087,10 @@ func GenBankConverter(inp io.Reader) <-chan string {
 				// start of record
 				rec.WriteString("  <INSDSeq>\n")
 
-				cols := strings.Fields(line)
+				// do not break if given artificial multi-line LOCUS
+				str := readContinuationLines(line)
+
+				cols := strings.Fields(str)
 				ln := len(cols)
 				if ln == 8 {
 					moleculetype := cols[4]
@@ -10109,12 +10141,12 @@ func GenBankConverter(inp io.Reader) <-chan string {
 					writeOneElement("    ", "INSDSeq_update-date", cols[6])
 
 				} else {
-					fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+					fmt.Fprintf(os.Stderr, "ERROR: "+str+"\n")
 				}
 
-				// read next line and continue
-				line = nextLine()
-				row++
+				// read next line and continue - handled by readContinuationLines above
+				// line = nextLine()
+				// row++
 			}
 
 			if strings.HasPrefix(line, "DEFINITION") {
@@ -10146,7 +10178,7 @@ func GenBankConverter(inp io.Reader) <-chan string {
 					writeOneElement("    ", "INSDSeq_primary-accession", accessions[0])
 
 				} else {
-					fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+					fmt.Fprintf(os.Stderr, "ERROR: ACCESSION "+str+"\n")
 				}
 			}
 
@@ -10196,6 +10228,9 @@ func GenBankConverter(inp io.Reader) <-chan string {
 
 				for _, secndry := range secondaries {
 
+					if strings.HasPrefix(secndry, "REGION") {
+						break
+					}
 					writeOneElement("      ", "INSDSecondary-accn", secndry)
 				}
 
@@ -10687,6 +10722,49 @@ func GenBankConverter(inp io.Reader) <-chan string {
 			}
 			rec.WriteString("    </INSDSeq_feature-table>\n")
 
+			// TSA, TLS, WGS, or CONTIG lines may be next
+
+			alt_name := ""
+
+			if strings.HasPrefix(line, "TSA") ||
+				strings.HasPrefix(line, "TLS") ||
+				strings.HasPrefix(line, "WGS") {
+
+				alt.Reset()
+
+				alt_name = line[:3]
+				line = line[3:]
+			}
+
+			if strings.HasPrefix(line, "WGS_CONTIG") ||
+				strings.HasPrefix(line, "WGS_SCAFLD") {
+
+				alt.Reset()
+
+				alt_name = line[:3]
+				line = line[10:]
+			}
+
+			if alt_name != "" {
+
+				alt_name = strings.ToLower(alt_name)
+				txt := strings.TrimSpace(line)
+				alt.WriteString(txt)
+				for {
+					// read next line
+					line = nextLine()
+					row++
+					if !strings.HasPrefix(line, twelvespaces) {
+						// if not continuation of contig, break out of loop
+						break
+					}
+					// append subsequent line and continue with loop
+					txt = strings.TrimPrefix(line, twelvespaces)
+					txt = strings.TrimSpace(txt)
+					alt.WriteString(txt)
+				}
+			}
+
 			if strings.HasPrefix(line, "CONTIG") {
 
 				// pathological records can have over 90,000 components, use strings.Builder
@@ -10748,6 +10826,22 @@ func GenBankConverter(inp io.Reader) <-chan string {
 					}
 					con.Reset()
 
+					if alt_name != "" {
+						rec.WriteString("    <INSDSeq_alt-seq>\n")
+						rec.WriteString("      <INSDAltSeqData>\n")
+						str = alt.String()
+						str = strings.TrimSpace(str)
+						if str != "" {
+							writeOneElement("        ", "INSDAltSeqData_name", alt_name)
+							rec.WriteString("        <INSDAltSeqData_items>\n")
+							writeOneElement("          ", "INSDAltSeqItem_value", str)
+							rec.WriteString("        </INSDAltSeqData_items>\n")
+						}
+						alt.Reset()
+						rec.WriteString("      </INSDAltSeqData>\n")
+						rec.WriteString("    </INSDSeq_alt-seq>\n")
+					}
+
 					// end of record
 					rec.WriteString("  </INSDSeq>\n")
 



View it on GitLab: https://salsa.debian.org/med-team/ncbi-entrez-direct/-/compare/f2e309255fa5041d1709a57aa0230c54a066fe51...3d9114ce05b1730e1373efbd116818658d584f06

-- 
View it on GitLab: https://salsa.debian.org/med-team/ncbi-entrez-direct/-/compare/f2e309255fa5041d1709a57aa0230c54a066fe51...3d9114ce05b1730e1373efbd116818658d584f06
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20201102/3d6c569b/attachment-0001.html>


More information about the debian-med-commit mailing list