[med-svn] [Git][med-team/ncbi-entrez-direct][upstream] New upstream version 23.8.20250429+dfsg

Aaron M. Ucko (@ucko) gitlab at salsa.debian.org
Fri May 2 04:07:45 BST 2025



Aaron M. Ucko pushed to branch upstream at Debian Med / ncbi-entrez-direct


Commits:
18ae4204 by Aaron M. Ucko at 2025-05-01T23:00:11-04:00
New upstream version 23.8.20250429+dfsg

- - - - -


7 changed files:

- README
- cmd/transmute.go
- esample
- eutils/align.go
- + gbf2tbl
- gff-sort
- help/tst-efetch.txt


Changes:

=====================================
README
=====================================
@@ -14,12 +14,6 @@ Navigation programs (esearch, elink, efilter, and efetch) communicate by means o
 
 Accessory programs (nquire, transmute, and xtract) can help eliminate the need for writing custom software to answer ad hoc questions. Queries can move seamlessly between EDirect programs and Unix utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.
 
-All EDirect programs are designed to work on large sets of data. Intermediate results are either saved on the Entrez history server or instantiated in the hidden message. For best performance, obtain an API Key from NCBI, and place the following line in your .bash_profile and .zshrc configuration files:
-
-  export NCBI_API_KEY=unique_api_key
-
-Each program also has a -help command that prints detailed information about available arguments.
-
 NAVIGATION FUNCTIONS
 
 Esearch performs a new Entrez search using terms in indexed fields. It requires a -db argument for the database name and uses -query for the search terms. For PubMed, without field qualifiers, the server uses automatic term mapping to compose a search strategy by translating the supplied query:
@@ -50,7 +44,14 @@ Efetch downloads selected records or reports in a style designated by -format:
 
 Individual query commands are connected by a Unix vertical bar pipe symbol:
 
-  esearch -db pubmed -query "tn3 transposition immunity" | efetch -format medline
+  esearch -db pubmed -query "tn3 transposition immunity" | efetch -format apa
+
+The vertical bar also allows query steps to be placed on separate lines:
+
+  esearch -db pubmed -query "raynaud disease AND fish oil" |
+  efetch -format medline
+
+Each program has a -help command that prints detailed information about available arguments.
 
 There is no need to use a script to loop over records in small groups, or write code to retry after a transient network or server failure, or add a time delay between requests. All of those features are already built into the EDirect commands.
 
@@ -72,6 +73,22 @@ The resulting output can be post-processed by Unix utilities or scripts:
 
   fmt -w 1 | sort -V | uniq
 
+INSTALLATION
+
+EDirect consists of a set of scripts and programs that are downloaded to the user's computer. To install the software, open a terminal window and execute one of the following two commands:
+
+  sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
+
+  sh -c "$(wget -q https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh -O -)"
+
+One installation is complete, run the following to set the PATH for the current terminal session:
+
+  export PATH=${HOME}/edirect:${PATH}
+
+All EDirect programs are designed to work on large sets of data. Intermediate results are either saved on the Entrez history server or instantiated in the hidden message. For best performance, obtain an API Key from NCBI, and place the following line in your .bash_profile and .zshrc configuration files:
+
+  export NCBI_API_KEY=unique_api_key
+
 DISCOVERY BY NAVIGATION
 
 PubMed related articles are identified by a statistical text retrieval algorithm using the title, abstract, and medical subject headings (MeSH terms). The connections between papers can be used for making discoveries. An example of this is finding the last enzymatic step in the vitamin A biosynthetic pathway.
@@ -98,13 +115,17 @@ As anticipated, the results include the enzyme that splits beta-carotene into tw
   SKYLQSDTYIANIEANRIVVSEFGTMAYPDPCKNIFSKAFSYLSHTIPDFTDNCLINIMKCGEDFYATTE
   ...
 
+A better example used Entrez protein neighbors to instantly rediscover the similarity between a human colon cancer gene and microbial DNA repair genes. Unfortunately, precomputed BLAST links were discontinued due to the exponential growth of the sequence databases.
+
 XML DATA EXTRACTION
 
 The ability to obtain Entrez records in structured format, and to easily extract the underlying data, allows the user to ask novel questions that are not addressed by existing analysis software.
 
 The xtract program uses command-line arguments to direct the conversion of data in eXtensible Markup Language format. It allows record detection, path exploration, element selection, conditional processing, and report formatting to be controlled independently.
 
-The -pattern command partitions an XML stream by object name into individual records that are processed separately. Within each record, the -element command does an exhaustive, depth-first search to find data content by field name. Explicit paths to objects are not needed.
+The -pattern command partitions an XML stream by object name into individual records that are processed separately. Within each record, the -element command does an exhaustive, depth-first search to find data content by field name.
+
+Neither explicit object paths nor complicated path formulas are needed for element identification.
 
 FORMAT CUSTOMIZATION
 
@@ -141,8 +162,24 @@ ELEMENT VARIANTS
 
 Derivatives of -element were created to eliminate the inconvenience of having to write post-processing scripts to perform otherwise trivial modifications or analyses on extracted data. Examples include positional (-first, -even), numeric (-inc, -max, -mod, -log), text (-upper, -title, -words, -letters), and sequence (-revcomp, -fasta, -ncbi2na, -molwt) commands. Substitute for -element as needed.
 
+The -with and -split commands can parse multiple clauses that are packed into a single field:
+
+  -wrp Item -with ";" -split Attributes
+
 The original -element prefix shortcuts, "#" and "%", are redirected to -num and -len, respectively.
 
+ELEMENT CONSTRUCTS
+
+An -element argument can use the parent / child construct to limit selection when items cannot otherwise be disambiguated. In this case it prevents the display of additional PMIDs that might be present in CommentsCorrections objects deeper in the MedlineCitation container:
+
+  xtract -pattern PubmedArticle -element MedlineCitation/PMID
+
+A subrange is selected with start and stop positions inside square brackets and separated by a colon. Endpoints for removal of leading or trailing substrings are separated by a vertical bar inside brackets:
+
+  -author Initials[1:1],LastName -prose "Title[phospholipase | rattlesnake]"
+
+  -pkg Aspect -wrp Tag -element "Item[|=]" -wrp Val -element "Item[=|]"
+
 EXPLORATION CONTROL
 
 Exploration commands allow precise control of the order in which XML record contents are examined, by separately presenting each instance of the chosen subregion. This limits what subsequent commands "see" at any one time, and allows related fields in an object to be kept together.
@@ -204,6 +241,8 @@ keeps qualifiers, such as gene and product, associated with their parent feature
   NM_021486.4
       source
                 organism       Mus musculus
+                mol_type       mRNA
+                db_xref        taxon:10090
       gene
                 gene           Bco1
       CDS
@@ -227,6 +266,8 @@ This prints the feature key on each line before the qualifier name and value, ev
 
   NM_021486.4
       source    organism       Mus musculus
+      source    mol_type       mRNA
+      source    db_xref        taxon:10090
       gene      gene           Bco1
       CDS       gene           Bco1
       CDS       product        beta,beta-carotene 15,15'-dioxygenase isoform 1
@@ -266,11 +307,7 @@ Object constraints will compare the string values of two named fields, and can l
 
   -if Chromosome -differs-from ChrLoc
 
-PARSING AND SUBSTITUTION
-
-The -with and -split commands can parse multiple elements packed into a single field:
-
-  -wrp Item -with ";" -split Attributes
+VALUE SUBSTITUTION
 
 External values can be inserted by reading a two-column, precomputed file or ad hoc conversion table with -transform, and then requesting a replacement by applying -translate to an element:
 
@@ -319,9 +356,17 @@ EDirect provides additional functions, scripts, and exploration constructs to si
 
 SEQUENCE QUALIFIERS
 
-The NCBI data model for sequence records (see PMID 11449725) is based on the central dogma of molecular biology. Features contain information about the biology of a given region, including the transformations involved in gene expression. Qualifiers store specific details about a feature (e.g., name of the gene, genetic code used for protein translation, accession of the product sequence, cross-references to external databases).
+The NCBI data model for sequence records (see PMID 11449725) is based on the central dogma of molecular biology. Sequences, including genomic DNA, messenger RNAs, and protein products, are "instantiated" with the actual sequence letters, and are assigned accession numbers for reference.
+
+Features contain information about the biology of a region, including the transformations involved in gene expression. Qualifiers store specific details about a feature (e.g., name of the gene, genetic code used for protein translation, accession of the product sequence, cross-references to external databases).
 
-As a convenience for exploring sequence records, the -insd helper function generates the appropriate nested extraction commands from feature and qualifier names on the command line. (Two computed qualifiers, feat_location and sub_sequence, are also supported.) A search on cone snail venom:
+A gene feature indicates the location of a heritable region of nucleic acid that confers a measurable phenotype. An mRNA feature on genomic DNA represents the exonic and untranslated regions of the message that remain after transcription and splicing. A coding region (CDS) feature has a product reference to the translated protein sequence.
+
+As a convenience for exploring sequence records, the xtract -insd helper function generates the appropriate nested extraction commands from feature and qualifier names on the command line. (Two computed qualifiers, feat_location and sub_sequence, are also supported.)
+
+SNAIL VENOM PEPTIDE SEQUENCES
+
+A search for cone snail venom mature peptides:
 
   esearch -db protein -query "conotoxin" -feature mat_peptide |
   efetch -format gpc |
@@ -339,6 +384,38 @@ prints the accession number, mature peptide length, product name, calculated mol
   ADB43125.1    22    conotoxin Cal 14.2    2157    GCPADCPNTCDSSNKCSPGFPG
   ...
 
+SNP-MODIFIED PRODUCT PAIRS
+
+Single nucleotide polymorphisms can represent different substitutions at the same position, but variation records do not explicitly match a specific CDS modification to its altered protein product:
+
+  efetch -db snp -id 11549407 -format docsum |
+
+The hgvs2spdi script converts 1-based HGVS data ("NM_000518.5:c.118C>T") into 0-based SPDI format ("NM_000518.5:167:C:T"). For SNPs on cDNA transcripts the position is CDS-relative, and the script retrieves the GenBank record in order to calculate the absolute sequence offset:
+
+  snp2hgvs | hgvs2spdi | spdi2tbl |
+
+The normalized results are saved in a tab-delimited data table:
+
+  rs11549407    NC_000011.10    5226773    G    A    Genomic    Substitution    HBB
+  rs11549407    NM_000518.5     167        C    T    Coding     Substitution    HBB
+  rs11549407    NP_000509.1     39         Q    *    Protein    Termination     HBB
+  rs11549407    NP_000509.1     39         Q    E    Protein    Missense        HBB
+  ...
+
+A final step then translates the coding region locations (after nucleotide substitution), and sorts them with protein sequences (after residue replacement):
+
+  tbl2prod
+
+to produce adjacent matching CDS/protein pairs:
+
+  rs11549407    NM_000518.5:167:C:T    MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWT*R ...
+  rs11549407    NP_000509.1:39:Q:*     MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWT*R ...
+  rs11549407    NM_000518.5:167:C:G    MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTER ...
+  rs11549407    NP_000509.1:39:Q:E     MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTER ...
+  rs11549407    NM_000518.5:167:C:A    MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTKR ...
+  rs11549407    NP_000509.1:39:Q:K     MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTKR ...
+  ...
+
 GENES IN A REGION
 
 Records for protein-coding genes on the human X chromosome are retrieved by running:
@@ -371,31 +448,33 @@ to produce an ordered table of known genes located between two markers flanking
 
 TAXONOMIC LINEAGE
 
-When xtract explores a recursively-defined data structure, such as Taxon and Gene-commentary, entry to an internal object is normally blocked when its name matches the current exploration container. Recursive data can be flattened by exploring with a double star / child construct:
+To accommodate recursively-defined data, entry to an internal object is blocked when its name matches the current exploration container. The double star / child construct recursively visits every object regardless of depth, and can flatten a complex structure into a linear set of elements in a single step:
 
   efetch -db taxonomy -id 9606 -format xml |
   xtract -pattern Taxon \
     -first TaxId -tab "\n" -element ScientificName \
-    -block "**/Taxon" -if Rank -is-not "no rank" \
+    -block "**/Taxon" -if Rank -is-not "no rank" -and Rank -excludes "root" \
       -tab "\n" -element Rank,ScientificName
 
 which removes the search constraint and visits every child object, regardless of nesting depth, to print all of the individual internal lineage nodes:
 
-  9606            Homo sapiens
-  superkingdom    Eukaryota
-  clade           Opisthokonta
-  kingdom         Metazoa
-  clade           Eumetazoa
-  clade           Bilateria
-  clade           Deuterostomia
-  phylum          Chordata
-  subphylum       Craniata
-  clade           Vertebrata
+  9606         Homo sapiens
+  domain       Eukaryota
+  clade        Opisthokonta
+  kingdom      Metazoa
+  clade        Eumetazoa
+  clade        Bilateria
+  clade        Deuterostomia
+  phylum       Chordata
+  subphylum    Craniata
+  clade        Vertebrata
   ...
 
 SEQUENCE ANALYSIS
 
-EDirect sequence processing functions are provided by the transmute program. No special coding techniques or custom data structures are required. For example, the nucleotide sequence in a GenBank record can be extracted, reverse-complemented, and saved in FASTA format with:
+EDirect sequence processing functions are provided by the transmute program. No special coding techniques or custom data structures are required.
+
+For example, the nucleotide sequence in a GenBank record can be extracted, reverse-complemented, and saved in FASTA format with:
 
   efetch -db nuccore -id U00096 -format gb |
   gbf2fsa | transmute -revcomp | transmute -fasta -width 50
@@ -440,13 +519,13 @@ The repercussions of a genomic SNP can be followed with transmute functions: -re
 
 EXTERNAL DATA INTEGRATION
 
-The nquire program uses command-line arguments to obtain data from external RESTful, CGI, or FTP servers. Results in various formats can be converted to XML by the transmute program.
+The nquire program uses command-line arguments to obtain data from external RESTful, CGI, or FTP servers. Xtract can now automatically detect and convert input data in JSON, text ASN.1, and GenBank flatfile formats, but other formats still require an explicit call to transmute or its shortcuts.
 
 JSON ARRAYS
 
 Human beta-globin information from a Scripps Research data integration project (see PMID 23175613):
 
-  nquire -get http://mygene.info/v3 gene 3043 |
+  nquire -get http://mygene.info/v3 gene 3043 | transmute -j2x |
 
 contains a multi-dimensional JavaScript Object Notation array of exon coordinates:
 
@@ -457,11 +536,7 @@ contains a multi-dimensional JavaScript Object Notation array of exon coordinate
   ],
   "strand": -1,
 
-This can be converted to XML with transmute -j2x (or the json2xml shortcut script):
-
-  transmute -j2x |
-
-with the default "-nest element" argument assigning distinct tag names to each level:
+Conversion to XML assigns distinct tag names to each level with the "-nest element" default:
 
   <position>
     <position_E>5225463</position_E>
@@ -469,12 +544,11 @@ with the default "-nest element" argument assigning distinct tag names to each l
   </position>
   ...
 
-JSON MIXTURES
+HETEROGENEOUS DATA
 
 A query for the human green-sensitive opsin gene:
 
-  nquire -get http://mygene.info/v3/gene/2652 |
-  transmute -j2x |
+  nquire -get http://mygene.info/v3/gene/2652 | transmute -j2x |
 
 returns data containing a heterogeneous mixture of objects in the pathway section:
 
@@ -523,6 +597,8 @@ This takes a series of command-line arguments with tag names for wrapping the in
   </Rec>
   ...
 
+The tbl2xml -header argument will obtain tag names from the first line of the input data.
+
 Similarly, transmute -c2x (or csv2xml) will convert comma-separated values (CSV) files to XML.
 
 GENBANK DOWNLOAD
@@ -532,9 +608,9 @@ The most recent GenBank virus release file can also be downloaded from NCBI serv
   nquire -lst ftp.ncbi.nlm.nih.gov genbank |
   grep "^gbvrl" | grep ".seq.gz" | sort -V |
   tail -n 1 | skip-if-file-exists |
-  nquire -asp ftp.ncbi.nlm.nih.gov genbank
+  nquire -dwn ftp.ncbi.nlm.nih.gov genbank
 
-If the Aspera Connect client is not installed on your computer, nquire -asp will default to -dwn and use FTP transfer. GenBank flatfile records can be selected by organism name or taxon identifier, or by presence or absence of an arbitrary text string, with transmute -gbf (or filter-genbank). The subset is converted to INSDSeq XML with transmute -g2x (or gbf2xml):
+GenBank flatfile records can be selected by organism name or taxon identifier, or by presence or absence of an arbitrary text string, with transmute -gbf (or filter-genbank). While this can be read directly by xtract, explicit conversion to INSDSeq XML with transmute -g2x (or gbf2xml) may be up to three times faster for large sets of records:
 
   gunzip -c *.seq.gz | filter-genbank -taxid 11292 | gbf2xml |
 
@@ -591,7 +667,6 @@ LOCAL SEARCH INDEX
 
 A similar strategy was used to create a local information retrieval system suitable for large data mining queries. Run archive-pubmed -index to populate retrieval index files from records stored in the local archive. The initial indexing will also take a few hours. Since PubMed updates are released once per day, it may be convenient to schedule reindexing to start in the late evening and run during the night.
 
-
 For PubMed titles and primary abstracts, the indexing process deletes hyphens after specific prefixes, removes accents and diacritical marks, splits words at punctuation characters, corrects encoding artifacts, and spells out Greek letters for easier searching on scientific terms. It then prepares inverted indices with term positions, and uses them to build distributed term lists and postings files.
 
 For example, the term list that includes "cancer" in the title or abstract would be located at:
@@ -715,8 +790,6 @@ has reusable boilerplate in its first three lines, and indexes PubMed records by
   </IdxDocument>
   ...
 
-Note that "MedlineCitation/PMID" uses the parent / child construct to prevent the display of additional PMID items that might be present later in CommentsCorrections objects.
-
 PYTHON INTEGRATION
 
 Controlling EDirect from Python scripts is easily done with assistance from the edirect.py library file, which is included in the EDirect archive.
@@ -784,10 +857,12 @@ Programs in Google's Go language ("golang") start with package main and then imp
   package main
 
   import (
+      "cmp"
       "eutils"
       "fmt"
+      "maps"
       "os"
-      "sort"
+      "slices"
   )
 
 Each compiled Go binary has a single main function, which is where program execution begins:
@@ -812,15 +887,10 @@ A for loop on the range of the sequence string visits each sequence letter. The
              counts[base]++
           }
 
-Maps are not returned in a defined order, so map keys are loaded to a keys array, which is then sorted:
+A sorted keys array is produced by calling slices.SortedFunc. The alphabetical sort order is determined by the second argument, which is is an anonymous function literal:
 
-          var keys []rune
-          for ky := range counts {
-              keys = append(keys, ky)
-          }
-          sort.Slice(keys, func(i, j int) bool { return keys[i] < keys[j] })
-
-(The second argument passed to sort.Slice is an anonymous function literal used to control the sort order. It is also a closure, implicitly inheriting the keys array from the enclosing function.)
+          keys := slices.SortedFunc(maps.Keys(counts),
+              func(i, j rune) int { return cmp.Compare(i, j) })
 
 The sequence identifier is printed in the first column:
 
@@ -870,18 +940,6 @@ The build script creates module files used to track dependencies and retrieve im
   chmod +x build.sh
   ./build.sh
 
-INSTALLATION
-
-EDirect consists of a set of scripts and programs that are downloaded to the user's computer. To install the software, open a terminal window and execute one of the following two commands:
-
-  sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
-
-  sh -c "$(wget -q https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh -O -)"
-
-One installation is complete, run the following to set the PATH for the current terminal session:
-
-  export PATH=${HOME}/edirect:${PATH}
-
 SOLID-STATE DRIVE PREPARATION
 
 To initialize a solid-state drive for hosting the local archive on a Mac, log into an admin account, run Disk Utility, choose View -> Show All Devices, select the top-level external drive, and press the Erase icon. Set the Scheme popup to GUID Partition Map, and APFS will appear as a format choice. Set the Format popup to APFS, enter the desired name for the volume, and click the Erase button.


=====================================
cmd/transmute.go
=====================================
@@ -1049,7 +1049,12 @@ func processAlign(inp io.Reader, args []string) {
 
 		switch args[0] {
 		case "-g":
-			pdg = eutils.GetNumericArg(args, "-g spacing between columns", 0, 1, 30)
+			if len(args) > 1 && args[1] == "t" {
+				// special case to use tab character between columns
+				pdg = -1
+			} else {
+				pdg = eutils.GetNumericArg(args, "-g spacing between columns", 0, 1, 30)
+			}
 			args = args[2:]
 		case "-h":
 			mrg = eutils.GetNumericArg(args, "-i indent before columns", 0, 1, 30)


=====================================
esample
=====================================
@@ -6,7 +6,7 @@
 do_help() {
   cat <<EOF
   
-Usage: esample [ -docsum | -article | -book | -protein | -gene | -taxon | -blast | -snp | -hgvs | -bioc | -flatfile | -gencode ]
+Usage: esample [ -docsum | -article | -book | -protein | -gene | -taxon | -blast | -snp | -hgvs | -bioc | -flatfile | -gff | -gencode ]
 
 EOF
 }
@@ -1660,6 +1660,18 @@ ORIGIN
 EOF
 }
 
+do_gff() {
+  cat <<EOF
+LGIB01000001.1	Gnomon	gene	52056	58768	.	+	.	ID=gene1
+LGIB01000001.1	Gnomon	mRNA	52056	58768	.	+	.	ID=rna1;Parent=gene1
+LGIB01000001.1	Gnomon	exon	52056	52096	.	+	.	ID=id4;Parent=rna1
+LGIB01000001.1	Gnomon	CDS	52056	52096	.	+	0	ID=cds1;Parent=rna1
+LGIB01000001.1	Gnomon	mRNA	52056	58768	.	+	.	ID=rna2;Parent=gene1
+LGIB01000001.1	Gnomon	exon	52056	53000	.	+	.	ID=id19;Parent=rna2
+LGIB01000001.1	Gnomon	CDS	52100	53000	.	+	0	ID=cds2;Parent=rna2
+EOF
+}
+
 do_gencode() {
   cat <<EOF
 
@@ -1818,6 +1830,9 @@ case "$choice" in
     flatfile | -flatfile )
      do_flatfile 
      ;;
+    gff | -gff | gff3 | -gff3 )
+     do_gff 
+     ;;
     gencode | -gencode )
      do_gencode 
      ;;


=====================================
eutils/align.go
=====================================
@@ -92,6 +92,9 @@ func AlignColumns(inp io.Reader, margin, padding, minimum int, align string) <-c
 
 	if padding > 0 && padding < 30 {
 		pad = spaces[0:padding]
+	} else if padding == -1 {
+		// continue using tab character between columns
+		pad = "\t"
 	}
 
 	align = strings.TrimSpace(align)


=====================================
gbf2tbl
=====================================
@@ -0,0 +1,6 @@
+#!/bin/sh
+
+# Public domain notice for all NCBI EDirect scripts is located at:
+# https://www.ncbi.nlm.nih.gov/books/NBK179288/#chapter6.Public_Domain_Notice
+
+gbf2xml | xml2tbl


=====================================
gff-sort
=====================================
@@ -20,7 +20,7 @@ V_region	2
 V_segment	2
 CDS	3
 exon	4
-intron	5
+intron	4
 EOF
 
 temp1=$(mktemp /tmp/GFF_TEMP1.XXXXXXXXX)
@@ -28,17 +28,19 @@ temp2=$(mktemp /tmp/GFF_TEMP2.XXXXXXXXX)
 
 grep '.' |
 sed '/^#/d' |
-# read GFF3 tab-delimited data into XML structure
+# read GFF3 tab-delimited data into XML structure, taking field names from the command line
 tbl2xml -rec Rec SeqID Source Type Start End Score Strand Phase Attributes |
-# use xtract -with and -split arguments to separate individual tag=value attributes
+# use xtract -with and -split arguments to separate individual tag=value attributes,
+# also use HERE document and suffix test to convert feature type to sort order number
 xtract -transform <( echo -e "$TYPEMAP" ) -rec Rec \
   -pattern Rec \
     -group Rec -pkg Fields \
       -block "Rec/*" -element "*" \
       -block Type -if Type -ends-with RNA -wrp Feat -lbl 2 \
-        -else -def 6 -wrp Feat -translate Type \
+        -else -def 5 -wrp Feat -translate Type \
     -group Rec -pkg Content -wrp Item -with ";" -split Attributes |
-# use xtract prefix and suffix trimming constructs to isolate tag and value
+# use xtract prefix and suffix trimming constructs to isolate tag and value, making new
+# XML objects with the extracted tag as the object name: <tag>value</tag>
 xtract -rec Rec \
   -pattern Rec \
     -group Fields -element "*" \


=====================================
help/tst-efetch.txt
=====================================
@@ -49,7 +49,6 @@ pubmed	abstract	2539356	cis-acting
 pubmed	bioc	2539356	palindromic
 pubmed	medline	2539356	transposition
 pubmed	xml	2539356	immunity
-snp	json	137853337	RCV000009793
 sra	runinfo	190091	SRS336098
 sra	xml	190091	SRS336098
 taxonomy	native	11060	dengue



View it on GitLab: https://salsa.debian.org/med-team/ncbi-entrez-direct/-/commit/18ae420437d125cd4c4f4fabbb8d4625dfa780fc

-- 
View it on GitLab: https://salsa.debian.org/med-team/ncbi-entrez-direct/-/commit/18ae420437d125cd4c4f4fabbb8d4625dfa780fc
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20250502/45ae46f1/attachment-0001.htm>


More information about the debian-med-commit mailing list