[med-svn] [Git][med-team/ncbi-entrez-direct][upstream] 5 commits: New upstream version 23.9.20250512+dfsg

Aaron M. Ucko (@ucko) gitlab at salsa.debian.org
Mon May 26 17:03:12 BST 2025



Aaron M. Ucko pushed to branch upstream at Debian Med / ncbi-entrez-direct


Commits:
48271b21 by Aaron M. Ucko at 2025-05-18T22:19:05-04:00
New upstream version 23.9.20250512+dfsg
- - - - -
a2bef797 by Aaron M. Ucko at 2025-05-18T22:25:18-04:00
New upstream version 23.9.20250518+dfsg
- - - - -
8533ada9 by Aaron M. Ucko at 2025-05-23T15:04:55-04:00
New upstream version 23.9.20250520+dfsg
- - - - -
7c5fb8e0 by Aaron M. Ucko at 2025-05-23T15:06:37-04:00
New upstream version 24.0.20250522+dfsg
- - - - -
08f679c6 by Aaron M. Ucko at 2025-05-25T21:55:17-04:00
New upstream version 24.0.20250523+dfsg
- - - - -


27 changed files:

- README
- archive-pids
- archive-pmc
- archive-pubmed
- cmd/go.mod
- cmd/go.sum
- cmd/rchive.go
- cmd/transmute.go
- cmd/xtract.go
- ecommon.sh
- efetch
- einfo
- eutils/eutils_test.go
- eutils/format.go
- eutils/go.mod
- eutils/go.sum
- eutils/misc.go
- eutils/utils.go
- eutils/xplore.go
- gbf2info
- help/tst-elink.txt
- help/xtract-help.txt
- nquire
- xcommon.sh
- xfetch
- xlink
- + xlink.ini


Changes:

=====================================
README
=====================================
@@ -24,7 +24,7 @@ Search terms can also be qualified with a bracketed field name to match within t
 
   esearch -db nuccore -query "insulin [PROT] AND rodents [ORGN]"
 
-Elink looks up precomputed neighbors within a database, or finds associated records in other databases, or uses the NIH Open Citation Collection service (see PMID 31600197) to follow reference lists:
+Elink looks up precomputed neighbors within a database, or finds associated records in other databases, or uses the NIH Open Citation Collection service (PMID 31600197) to follow reference lists:
 
   elink -related
 
@@ -53,7 +53,7 @@ The vertical bar also allows query steps to be placed on separate lines:
 
 Each program has a -help command that prints detailed information about available arguments.
 
-There is no need to use a script to loop over records in small groups, or write code to retry after a transient network or server failure, or add a time delay between requests. All of those features are already built into the EDirect commands.
+EDirect programs are designed to work on large sets of data. There is no need to use a script to loop over records in small groups, or write code to retry a query after a transient network or server failure, or add a time delay between requests. All of those features are already built into the system.
 
 ACCESSORY PROGRAMS
 
@@ -61,7 +61,7 @@ Nquire retrieves data from remote servers with URLs constructed from command lin
 
   nquire -get https://icite.od.nih.gov api/pubs -pmids 2539356 |
 
-Transmute -j2x converts a concatenated stream of JSON objects into XML:
+Transmute converts a concatenated stream of JSON objects or other structured formats into XML:
 
   transmute -j2x |
 
@@ -85,7 +85,7 @@ One installation is complete, run the following to set the PATH for the current
 
   export PATH=${HOME}/edirect:${PATH}
 
-All EDirect programs are designed to work on large sets of data. Intermediate results are either saved on the Entrez history server or instantiated in the hidden message. For best performance, obtain an API Key from NCBI, and place the following line in your .bash_profile and .zshrc configuration files:
+For best performance, obtain an API Key from NCBI, and place the following line in your .bash_profile and .zshrc configuration files:
 
   export NCBI_API_KEY=unique_api_key
 
@@ -99,7 +99,7 @@ An initial search on the lycopene cyclase enzyme finds 306 articles. Looking up
 
   esearch -db pubmed -query "lycopene cyclase" | elink -related |
 
-We cannot reliably limit the results to animals in PubMed, but we can for sequence records, which are indexed by taxonomy. Linking the publication neighbors to their associated protein records finds 604,878 sequences. Restricting those to mice excludes plants, fungi, and bacteria, which eliminates the earlier enzymes:
+We cannot reliably limit the results to animals in PubMed, but we can for sequence records, which are indexed by the NCBI taxonomy. Linking the publication neighbors to their associated protein records finds 604,878 sequences. Restricting those to mice excludes plants, fungi, and bacteria, which eliminates the earlier enzymes:
 
   elink -target protein | efilter -organism mouse -source refseq |
 
@@ -156,33 +156,54 @@ Repackaging commands (-wrp, -enc, and -pkg) wrap extracted data values with brac
 
   -pfx "<Word>" -sep "</Word><Word>" -sfx "</Word>"
 
-Additional commands (-tag, -att, -atr, -cls, -slf, and -end) allow generation of XML attributes.
+It also sets an internal flag to ensure that data values containing encoded ampersands, angle brackets, apostrophes, and quotation marks remain properly encoded inside the new XML.
 
 ELEMENT VARIANTS
 
-Derivatives of -element were created to eliminate the inconvenience of having to write post-processing scripts to perform otherwise trivial modifications or analyses on extracted data. Examples include positional (-first, -even), numeric (-inc, -max, -mod, -log), text (-upper, -title, -words, -letters), and sequence (-revcomp, -fasta, -ncbi2na, -molwt) commands. Substitute for -element as needed.
+Derivatives of -element were created to avoid the inconvenience of having to write post-processing scripts to perform trivial modifications or calculations on extracted data. Other variants were added for content normalization, report formatting, or index generation. They are subdivided into several categories. Substitute for -element as needed. A representative selection is shown below:
+
+  Positional:    -first,  -last,  -even,  -odd,  -backward
+  Numeric:       -num,  -len,  -inc,  -dec,  -mod,  -bin,  -hex,  -bit,  -sqt,  -lge,  -lg2,  -log
+  Statistics:    -sum,  -acc,  -min,  -max,  -dev,  -med,  -avg,  -geo,  -hrm,  -rms
+  Character:     -upper,  -lower,  -title,  -mirror,  -alpha,  -alnum
+  Text:          -terms,  words,  -pairs,  -letters,  -split,  -order,  -reverse,  -prose
+  Sequence:      -revcomp,  -fasta,  -ncbi2na,  -cds2prot,  -molwt,  -pept
+  Citation:      -year,  -month,  -date, -auth,  -initials,  -page,  -author,  -journal
+  Other:         -doi,  -wct,  -trim,  -pad,  -accession,  -numeric
+
+The original -element prefix shortcuts, "#" and "%", are redirected to -num and -len, respectively.
+
+VALUE SUBSTITUTION
+
+External values can be inserted by reading a two-column, precomputed file or ad hoc conversion table with -transform, and then requesting a replacement by applying -translate to an element:
+
+  xtract -transform accn-to-uid.txt  ...  -translate Accession
+
+  xtract -transform <( echo -e "Genomic\t1\nCoding\t2\nProtein\t3\n" )  ...
+
+PARSING FIELDS
 
 The -with and -split commands can parse multiple clauses that are packed into a single field:
 
   -wrp Item -with ";" -split Attributes
 
-The original -element prefix shortcuts, "#" and "%", are redirected to -num and -len, respectively.
+SUBSTRING LIMITS
 
-ELEMENT CONSTRUCTS
+A subrange is selected with start and stop positions inside square brackets and separated by a colon. Endpoints for removal of specific prefix and suffix strings are indicated by a vertical bar inside brackets:
 
-An -element argument can use the parent / child construct to limit selection when items cannot otherwise be disambiguated. In this case it prevents the display of additional PMIDs that might be present in CommentsCorrections objects deeper in the MedlineCitation container:
+  -author Initials[1:1],LastName -prose "Title[phospholipase | rattlesnake]"
 
-  xtract -pattern PubmedArticle -element MedlineCitation/PMID
+  -wrp Tag -element "Item[|=]" -wrp Val -element "Item[=|]"
 
-A subrange is selected with start and stop positions inside square brackets and separated by a colon. Endpoints for removal of leading or trailing substrings are separated by a vertical bar inside brackets:
+LOCAL CONTEXT
 
-  -author Initials[1:1],LastName -prose "Title[phospholipase | rattlesnake]"
+An -element argument can use the parent / child construct to limit selection when items can only be disambiguated by position, not by name. In this case it prevents the display of additional PMIDs that might be present in CommentsCorrections objects deeper in the MedlineCitation container:
 
-  -pkg Aspect -wrp Tag -element "Item[|=]" -wrp Val -element "Item[=|]"
+  xtract -pattern PubmedArticle -element MedlineCitation/PMID
 
 EXPLORATION CONTROL
 
-Exploration commands allow precise control of the order in which XML record contents are examined, by separately presenting each instance of the chosen subregion. This limits what subsequent commands "see" at any one time, and allows related fields in an object to be kept together.
+Exploration commands allow precise control over the order in which XML record contents are examined, by separately presenting each instance of the chosen subregion. This limits what subsequent commands "see" at any one time, allowing related fields in an object to be kept together.
 
 In contrast to the simpler DocumentSummary format, records retrieved as PubmedArticle XML:
 
@@ -242,7 +263,6 @@ keeps qualifiers, such as gene and product, associated with their parent feature
       source
                 organism       Mus musculus
                 mol_type       mRNA
-                db_xref        taxon:10090
       gene
                 gene           Bco1
       CDS
@@ -267,7 +287,6 @@ This prints the feature key on each line before the qualifier name and value, ev
   NM_021486.4
       source    organism       Mus musculus
       source    mol_type       mRNA
-      source    db_xref        taxon:10090
       gene      gene           Bco1
       CDS       gene           Bco1
       CDS       product        beta,beta-carotene 15,15'-dioxygenase isoform 1
@@ -281,16 +300,16 @@ Variables can be (re)initialized with an explicit literal value inside parenthes
 
 CONDITIONAL EXECUTION
 
-Conditional processing commands (-if, -unless, -and, -or, and -else) restrict object exploration by data content. They check to see if the named field is within the scope, and may be used in conjunction with string, numeric, or object constraints to require an additional match by value. For example:
+Conditional processing commands (-if and -unless) restrict object exploration by data content. They check to see if the named field is within the scope, and may be used in conjunction with string, numeric, or object constraints to require an additional match by value. Use -and and -or to build compound tests, and -select to remove records that do not satisfy the condition. For example:
 
-  esearch -db pubmed -query "Havran W [AUTH]" |
-  efetch -format xml |
-  xtract -pattern PubmedArticle -if Language -equals eng \
+  esearch -db pubmed -query "Havran W [AUTH]" | efetch -format xml |
+  xtract -pattern PubmedArticle -select Language -equals eng |
+  xtract -pattern PubmedArticle \
     -block Author -if LastName -is-not Havran \
       -sep ", " -tab "\n" -author LastName,Initials[1:1] |
   sort-uniq-count-rank
 
-selects papers written in English and prints a table of the most frequent collaborators, using a range to keep only the first initial so that variants like "Berg, CM" and "Berg, C" are combined:
+limits the results to papers written in English and prints a table of the most frequent collaborators, using a range to keep only the first initial so that variants like "Berg, C" and "Berg, CM" are combined:
 
   35    Witherden, D
   15    Boismenu, R
@@ -307,13 +326,42 @@ Object constraints will compare the string values of two named fields, and can l
 
   -if Chromosome -differs-from ChrLoc
 
-VALUE SUBSTITUTION
+The -position command restricts presentation of objects by relative location or index number:
 
-External values can be inserted by reading a two-column, precomputed file or ad hoc conversion table with -transform, and then requesting a replacement by applying -translate to an element:
+  -block Author -position last -sep ", " -element Lastname,Initials
 
-  xtract -transform accn-to-uid.txt  ...  -translate Accession
+The -else command can run an alternative -element or -lbl instruction if the condition is not satisfied:
+
+  -if ChrStart -gt ChrStop -lbl "minus strand" -else -lbl "plus strand"
 
-  -transform <( echo -e "Genomic\t1\nCoding\t2\nProtein\t3\n" )
+GENERATING ATTRIBUTES
+
+Additional commands (-tag, -att, -atr, -cls, -slf, and -end) allow generation of XML tags with attributes. The following will produce regular and self-closing XML objects, respectively:
+
+  -tag Item -att type journal -cls -element Source -end Item
+
+  <Item type="journal">J Bacteriol</Item>
+
+  -tag Item -att type journal -atr name Source -slf
+
+  <Item type="journal" name="J Bacteriol" />
+
+XML NAMESPACES
+
+Namespace prefixes are followed by a colon, while a leading colon matches any prefix:
+
+  nquire -url http://webservice.wikipathways.org getPathway -pwId WP455 |
+  xtract -pattern "ns1:getPathwayResponse" -decode ":gpml" |
+
+The embedded Graphical Pathway Markup Language object can then be processed:
+
+  xtract -pattern Pathway -block Xref \
+    -if @Database -equals "Entrez Gene" \
+      -tab "\n" -element @ID
+
+AUTOMATIC FORMAT CONVERSION
+
+Xtract can now detect and convert input data in JSON, text ASN.1, and GenBank/GenPept flatfile formats. Explicit transmute or shortcut commands are only needed for inspecting the intermediate XML or for overriding the default conversion settings.
 
 MULTI-STEP TRANSFORMATIONS
 
@@ -356,11 +404,11 @@ EDirect provides additional functions, scripts, and exploration constructs to si
 
 SEQUENCE QUALIFIERS
 
-The NCBI data model for sequence records (see PMID 11449725) is based on the central dogma of molecular biology. Sequences, including genomic DNA, messenger RNAs, and protein products, are "instantiated" with the actual sequence letters, and are assigned accession numbers for reference.
+The NCBI data model for sequence records (PMID 11449725) is based on the central dogma of molecular biology. Sequences, including genomic DNA, messenger RNAs, and protein products, are "instantiated" with the actual sequence letters, and are assigned accession numbers for reference.
 
 Features contain information about the biology of a region, including the transformations involved in gene expression. Qualifiers store specific details about a feature (e.g., name of the gene, genetic code used for protein translation, accession of the product sequence, cross-references to external databases).
 
-A gene feature indicates the location of a heritable region of nucleic acid that confers a measurable phenotype. An mRNA feature on genomic DNA represents the exonic and untranslated regions of the message that remain after transcription and splicing. A coding region (CDS) feature has a product reference to the translated protein sequence.
+A gene feature indicates the location of a heritable region of nucleic acid that confers a measurable phenotype. An mRNA feature on genomic DNA represents the exonic and untranslated regions that remain after message transcription and intron splicing. A coding region (CDS) feature has a product reference to the translated protein sequence record.
 
 As a convenience for exploring sequence records, the xtract -insd helper function generates the appropriate nested extraction commands from feature and qualifier names on the command line. (Two computed qualifiers, feat_location and sub_sequence, are also supported.)
 
@@ -373,7 +421,7 @@ A search for cone snail venom mature peptides:
   xtract -insd complete mat_peptide "%peptide" product mol_wt peptide |
   grep -i conotoxin | sort-table -u -k 2,2n
 
-prints the accession number, mature peptide length, product name, calculated molecular weight, and amino acid sequence for a sample of neurotoxic peptides:
+uses the -insd function to print the accession number, mature peptide length, product name, calculated molecular weight, and amino acid sequence for a sample of neurotoxic peptides::
 
   ADB43131.1    15    conotoxin Cal 1b      1708    LCCKRHHGCHPCGRT
   ADB43128.1    16    conotoxin Cal 5.1     1829    DPAPCCQHPIETCCRR
@@ -429,7 +477,7 @@ Gene names and chromosomal positions are extracted by piping the records to:
     -block GenomicInfoType -if ChrLoc -equals X \
       -min ChrStart,ChrStop -element "&NAME" "&DESC" |
 
-with the -if statement eliminating coordinates from pseudoautosomal gene copies present on the Y chromosome telomeres. Results can now be sorted by position, and then filtered and partitioned:
+The -if statement eliminates coordinates from pseudoautosomal gene copies present on the Y chromosome telomeres. Results can now be sorted by position, and then filtered and partitioned:
 
   sort-table -k 1,1n | cut -f 2- |
   grep -v pseudogene | grep -v uncharacterized | grep -v hypothetical |
@@ -448,7 +496,7 @@ to produce an ordered table of known genes located between two markers flanking
 
 TAXONOMIC LINEAGE
 
-To accommodate recursively-defined data, entry to an internal object is blocked when its name matches the current exploration container. The double star / child construct recursively visits every object regardless of depth, and can flatten a complex structure into a linear set of elements in a single step:
+To accommodate recursively-defined data, entry to an internal object is blocked when its name matches the current exploration container. The double star / child construct removes the search constraint to recursively visit every object regardless of depth, and can flatten a complex structure into a linear set of elements in a single step:
 
   efetch -db taxonomy -id 9606 -format xml |
   xtract -pattern Taxon \
@@ -456,7 +504,7 @@ To accommodate recursively-defined data, entry to an internal object is blocked
     -block "**/Taxon" -if Rank -is-not "no rank" -and Rank -excludes "root" \
       -tab "\n" -element Rank,ScientificName
 
-which removes the search constraint and visits every child object, regardless of nesting depth, to print all of the individual internal lineage nodes:
+This prints all of the individual internal lineage nodes:
 
   9606         Homo sapiens
   domain       Eukaryota
@@ -472,7 +520,7 @@ which removes the search constraint and visits every child object, regardless of
 
 SEQUENCE ANALYSIS
 
-EDirect sequence processing functions are provided by the transmute program. No special coding techniques or custom data structures are required.
+EDirect sequence processing functions are provided by the transmute program. They can handle huge sequences as strings, without requiring any special coding techniques or custom data structures.
 
 For example, the nucleotide sequence in a GenBank record can be extracted, reverse-complemented, and saved in FASTA format with:
 
@@ -519,11 +567,11 @@ The repercussions of a genomic SNP can be followed with transmute functions: -re
 
 EXTERNAL DATA INTEGRATION
 
-The nquire program uses command-line arguments to obtain data from external RESTful, CGI, or FTP servers. Xtract can now automatically detect and convert input data in JSON, text ASN.1, and GenBank flatfile formats, but other formats still require an explicit call to transmute or its shortcuts.
+The nquire program uses command-line arguments to obtain data from external RESTful, CGI, or FTP servers. (Xtract can read JSON, ASN.1, and GenBank formats directly, but previously-required conversion commands - now for inspecting XML or overriding defaults - are shown below in light text.)
 
 JSON ARRAYS
 
-Human beta-globin information from a Scripps Research data integration project (see PMID 23175613):
+Human beta-globin information from a Scripps Research data integration project (PMID 23175613):
 
   nquire -get http://mygene.info/v3 gene 3043 | transmute -j2x |
 
@@ -597,7 +645,7 @@ This takes a series of command-line arguments with tag names for wrapping the in
   </Rec>
   ...
 
-The tbl2xml -header argument will obtain tag names from the first line of the input data.
+The tbl2xml -header argument will instead obtain tag names from the first line of the input data.
 
 Similarly, transmute -c2x (or csv2xml) will convert comma-separated values (CSV) files to XML.
 
@@ -610,7 +658,9 @@ The most recent GenBank virus release file can also be downloaded from NCBI serv
   tail -n 1 | skip-if-file-exists |
   nquire -dwn ftp.ncbi.nlm.nih.gov genbank
 
-GenBank flatfile records can be selected by organism name or taxon identifier, or by presence or absence of an arbitrary text string, with transmute -gbf (or filter-genbank). While this can be read directly by xtract, explicit conversion to INSDSeq XML with transmute -g2x (or gbf2xml) may be up to three times faster for large sets of records:
+GenBank flatfile records can be selected by organism name or taxon identifier, or by presence or absence of an arbitrary text string, with transmute -gbf (or filter-genbank).
+
+While this can be read directly by xtract, explicit conversion to INSDSeq XML with transmute -g2x (or gbf2xml) may be up to three times faster for large sets of records:
 
   gunzip -c *.seq.gz | filter-genbank -taxid 11292 | gbf2xml |
 
@@ -624,7 +674,7 @@ Fetching data from Entrez works well when a few thousand records are needed, but
 
 LOCAL RECORD CACHE
 
-EDirect can now preload over 38 million live PubMed records onto an inexpensive external 500 GB solid-state drive as individual files for rapid retrieval. For example, PMID 2539356 would be stored at:
+EDirect can now preload over 38 million live PubMed records onto an inexpensive external 1 TB solid-state drive as individual files for rapid retrieval. For example, PMID 2539356 would be stored at:
 
   /pubmed/Archive/02/53/93/2539356.xml.gz
 
@@ -705,12 +755,14 @@ Each plus sign will replace a single word inside a phrase, and runs of tildes in
 
   xsearch -query "vitamin c ~ ~ common cold"
 
-An exact substring match, without special processing of Boolean operators or indexed field names, can be obtained with -title (on the article title) or -exact (on the title or abstract), while ranked partial term matching in any field is available with -match:
-
-  xsearch -title "Genetic Control of Biochemical Reactions in Neurospora."
+Ranked partial term matching is available in any field with -match:
 
   xsearch -match "tn3 transposition immunity [PAIR]" | just-top-hits 1
 
+An exact substring match, without special processing of Boolean operators or indexed field names, can be obtained with -title (on the article title) or -exact (on the title or abstract):
+
+  xsearch -title "Genetic Control of Biochemical Reactions in Neurospora."
+
 MeSH identifier code, MeSH hierarchy key, and year of publication are also indexed, and MESH field queries are supported by internally mapping to the appropriate CODE or TREE entries:
 
   xsearch -query "C14.907.617.812* [TREE] AND 2015:2019 [YEAR]"
@@ -737,9 +789,11 @@ The cumulative size of PubMed can be calculated with a running sum of the annual
   print-columns '$1, log($2)/log(10), log($3)/log(10)' |
   filter-columns '$1 >= 1800 && $1 < YR' | xy-plot annual-and-cumulative.png
 
+The sharp jump after World War II was caused by several factors, including the release of declassified papers, a policy of expanding biomedical research in postwar America, and the introduction of computers that could keep up with the indexing of articles from a broader range of subjects.
+
 NATURAL LANGUAGE PROCESSING
 
-NLM's Biomedical Text Mining Group performs computational analysis to extract chemical, disease, and gene references from article contents (see PMID 31114887). NLM indexing of PubMed records assigns Gene Reference into Function (GeneRIF) mappings (see PMID 14728215).
+NLM's Biomedical Text Mining Group performs computational analysis to extract chemical, disease, and gene references from article contents (PMID 31114887). NLM indexing of PubMed records assigns Gene Reference into Function (GeneRIF) mappings (PMID 14728215).
 
 Running archive-nlmnlp -index periodically (monthly) will automatically refresh any out-of-date support files and then index the connections in CHEM, DISZ, GENE, GRIF, GSYN, and PREF fields:
 
@@ -764,6 +818,18 @@ This returns PMIDs for 6670 articles that cite the original 97 papers. The resul
   xtract -pattern PubmedArticle -histogram Journal/ISOAbbreviation |
   sort-table -nr | head -n 10
 
+The archive-pid -index command reads the PubMed local archive's incremental inverted index files, and builds a PMCID index that allows xlink to return PubMed Central identifiers from PMIDs.
+
+ADDITIONAL EXPERIMENTAL ARCHIVES
+
+New database domains are also easily added, with records obtained from public data resources.
+
+Running archive-pmc -index downloads PMC release files, and collects primary author names, citation details, section titles, and full-text paragraphs. It then converts them to a more tractable form, and builds an archive from those derived records.
+
+Similarly, archive-taxonomy -index archives novel records assembled from NCBI taxonomy data tables retrieved from the FTP site.
+
+A new database is ready for use once its population script has finished all downloading, conversion, validation, caching, indexing, inversion, and postings steps.
+
 USER-SPECIFIED TERM INDEX
 
 Running custom-index with a PubMed indexer script and the names of the fields it populates:
@@ -790,6 +856,23 @@ has reusable boilerplate in its first three lines, and indexes PubMed records by
   </IdxDocument>
   ...
 
+SOLID-STATE DRIVE PREPARATION
+
+To initialize a solid-state drive for hosting the local archive on a Mac, log into an admin account, run Disk Utility, choose View -> Show All Devices, select the top-level external drive, and press the Erase icon. Set the Scheme popup to GUID Partition Map, and APFS will appear as a format choice. Set the Format popup to APFS, enter the desired name for the volume, and click the Erase button.
+
+To finish the drive configuration, disable Spotlight indexing on the drive with:
+
+  sudo mdutil -i off "${EDIRECT_LOCAL_ARCHIVE}"
+  sudo mdutil -E "${EDIRECT_LOCAL_ARCHIVE}"
+
+and turn off FSEvents logging with:
+
+  sudo touch "${EDIRECT_LOCAL_ARCHIVE}/.fseventsd/no_log"
+
+Also exclude the drive from being backed up by Time Machine or scanned by a virus checker.
+
+Finally, in Apple -> System Settings -> Privacy & Security -> Full Disk Access, turn on the Terminal slide switch.
+
 PYTHON INTEGRATION
 
 Controlling EDirect from Python scripts is easily done with assistance from the edirect.py library file, which is included in the EDirect archive.
@@ -940,21 +1023,6 @@ The build script creates module files used to track dependencies and retrieve im
   chmod +x build.sh
   ./build.sh
 
-SOLID-STATE DRIVE PREPARATION
-
-To initialize a solid-state drive for hosting the local archive on a Mac, log into an admin account, run Disk Utility, choose View -> Show All Devices, select the top-level external drive, and press the Erase icon. Set the Scheme popup to GUID Partition Map, and APFS will appear as a format choice. Set the Format popup to APFS, enter the desired name for the volume, and click the Erase button.
-
-To finish the drive configuration, disable Spotlight indexing on the drive with:
-
-  sudo mdutil -i off "${EDIRECT_LOCAL_ARCHIVE}"
-  sudo mdutil -E "${EDIRECT_LOCAL_ARCHIVE}"
-
-and turn off FSEvents logging with:
-
-  sudo touch "${EDIRECT_LOCAL_ARCHIVE}/.fseventsd/no_log"
-
-Also exclude the drive from being backed up by Time Machine or scanned by a virus checker.
-
 DOCUMENTATION
 
 Documentation for EDirect is on the web at:
@@ -974,6 +1042,10 @@ Instructions for downloading and installing the Go compiler are at:
 
   https://golang.org/doc/install#download
 
+To download the free Aspera Connect file transfer client, open the IBM Aspera Connect subsection at:
+
+  https://www.ibm.com/products/aspera/downloads#cds
+
 Questions or comments on EDirect may be sent to info at ncbi.nlm.nih.gov.
 
 This research was supported by the Intramural Research Program of the National Library of Medicine at the NIH.


=====================================
archive-pids
=====================================
@@ -308,7 +308,7 @@ wait
 okay=""
 if [ "$e2post" = true ]
 then
-  okay=$( echo 2539356 | xkubj -db pubmed -target PMCID | grep -w 209839 )
+  okay=$( echo 2539356 | xlink -db pubmed -target PMCID | grep -w 209839 )
   if [ -n "$okay" ]
   then
     echo "Archive and Index are OK" >&2


=====================================
archive-pmc
=====================================
@@ -54,6 +54,9 @@ datafiles=true
 download=true
 populate=true
 
+justtest=false
+justmiss=false
+
 e2index=false
 e2invert=false
 e2collect=false
@@ -63,6 +66,25 @@ e2post=false
 while [ $# -gt 0 ]
 do
   case "$1" in
+    download | -download )
+      download=true
+      populate=false
+      shift
+      ;;
+    verify | -verify )
+      datafiles=false
+      download=false
+      populate=false
+      justtest=true
+      shift
+      ;;
+    missing | -missing )
+      datafiles=false
+      download=false
+      populate=false
+      justmiss=true
+      shift
+      ;;
     daily | -daily )
       e2index=true
       e2invert=true
@@ -735,6 +757,94 @@ then
   sleep 1
 fi
 
+CheckSection() {
+
+  dir="$1"
+  flt="$2"
+
+  if [ "$useFtp" = true ]
+  then
+    nquire -lst "ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/${dir}/xml" |
+    grep ".tar.gz" | grep "$flt" |
+    skip-if-file-exists |
+    while read fl
+    do
+      echo "$fl" >&2
+    done
+  elif [ "$useHttps" = true ]
+  then
+    nquire -get "https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/${dir}/xml" |
+    xtract -pattern a -if a -starts-with "oa_" -and a -ends-with ".tar.gz" -and a -contains "$flt" -element a |
+    skip-if-file-exists |
+    while read fl
+    do
+      echo "$fl" >&2
+    done
+  fi
+}
+
+if [ "$justmiss" = true ]
+then
+  seconds_start=$(date "+%s")
+  echo "Looking for Missing PMC Files" >&2
+
+  if [ -d "${sourceBase}" ]
+  then
+    cd "${sourceBase}"
+
+    for flt in baseline incr
+    do
+      for dir in oa_comm oa_noncomm oa_other
+      do
+        CheckSection "$dir" "$flt"
+      done
+    done
+  fi
+
+  seconds_end=$(date "+%s")
+  seconds=$((seconds_end - seconds_start))
+  echo "$seconds seconds" >&2
+  echo "" >&2
+  exit 0
+fi
+
+if [ "$justtest" = true ]
+then
+  seconds_start=$(date "+%s")
+  echo "Verifing PMC Archive" >&2
+
+  if [ -d "${sourceBase}" ]
+  then
+    cd "${sourceBase}"
+
+    for fl in *.tar.gz
+    do
+      printf "."
+      # verify contents
+      if [ -s "$fl" ]
+      then
+        errs=$( (tar -xOzf "$fl" --to-stdout | xtract -mixed -verify -max 180) 2>&1 )
+        if [ -n "$errs" ]
+        then
+          printf "\n"
+          echo "Invalid Contents '$fl'" >&2
+        fi
+      else
+        printf "\n"
+        echo "Empty file '$fl'" >&2
+      fi
+    done
+    printf "\n"
+  fi
+
+  seconds_end=$(date "+%s")
+  seconds=$((seconds_end - seconds_start))
+  echo "" >&2
+  echo "$seconds seconds" >&2
+  echo "" >&2
+  exit 0
+fi
+
 PMCStash() {
 
   fl="$1"
@@ -772,7 +882,8 @@ then
         while read fl
         do
           base=${fl%.tar.gz}
-          if [ ! -f "${archiveBase}/Sentinels/$base.snt" ]
+          # skip if sentinel present or if file is present but empty
+          if [ ! -f "${archiveBase}/Sentinels/$base.snt" ] && [ -s "$fl" ]
           then
             PMCStash "$fl"
           fi


=====================================
archive-pubmed
=====================================
@@ -54,6 +54,9 @@ datafiles=true
 download=true
 populate=true
 
+justtest=false
+justmiss=false
+
 e2index=false
 e2invert=false
 e2collect=false
@@ -63,6 +66,25 @@ e2post=false
 while [ $# -gt 0 ]
 do
   case "$1" in
+    download | -download )
+      download=true
+      populate=false
+      shift
+      ;;
+    verify | -verify )
+      datafiles=false
+      download=false
+      populate=false
+      justtest=true
+      shift
+      ;;
+    missing | -missing )
+      datafiles=false
+      download=false
+      populate=false
+      justmiss=true
+      shift
+      ;;
     daily | -daily )
       e2index=true
       e2invert=true
@@ -885,6 +907,89 @@ then
   sleep 1
 fi
 
+CheckSection() {
+
+  dir="$1"
+
+  if [ "$useFtp" = true ]
+  then
+    nquire -lst ftp.ncbi.nlm.nih.gov pubmed "$dir" |
+    grep -v ".md5" | grep "xml.gz" |
+    skip-if-file-exists |
+    while read fl
+    do
+      echo "$fl" >&2
+    done
+  elif [ "$useHttps" = true ]
+  then
+    nquire -get https://ftp.ncbi.nlm.nih.gov pubmed "$dir" |
+    xtract -pattern a -if a -starts-with pubmed -and a -ends-with ".xml.gz" -element a |
+    skip-if-file-exists |
+    while read fl
+    do
+      echo "$fl" >&2
+    done
+  fi
+}
+
+if [ "$justmiss" = true ]
+then
+  seconds_start=$(date "+%s")
+  echo "Looking for Missing PubMed Files" >&2
+
+  if [ -d "${sourceBase}" ]
+  then
+    cd "${sourceBase}"
+
+    CheckSection "baseline"
+    CheckSection "updatefiles"
+  fi
+
+  seconds_end=$(date "+%s")
+  seconds=$((seconds_end - seconds_start))
+  echo "" >&2
+  echo "$seconds seconds" >&2
+  echo "" >&2
+  exit 0
+fi
+
+if [ "$justtest" = true ]
+then
+  seconds_start=$(date "+%s")
+  echo "Verifing PubMed Archive" >&2
+
+  if [ -d "${sourceBase}" ]
+  then
+    cd "${sourceBase}"
+
+    for fl in *.xml.gz
+    do
+      printf "."
+      # verify contents
+      if [ -s "$fl" ]
+      then
+        errs=$( (gunzip -c "$fl" | xtract -mixed -verify) 2>&1 )
+        if [ -n "$errs" ]
+        then
+          printf "\n"
+          echo "Invalid Contents '$fl'" >&2
+        fi
+      else
+        printf "\n"
+        echo "Empty file '$fl'" >&2
+      fi
+    done
+    printf "\n"
+  fi
+
+  seconds_end=$(date "+%s")
+  seconds=$((seconds_end - seconds_start))
+  echo "" >&2
+  echo "$seconds seconds" >&2
+  echo "" >&2
+  exit 0
+fi
+
 ReportVersioned() {
   inp="$1"
   pmidlist=.TO-REPORT
@@ -998,11 +1103,11 @@ then
     for fl in *.xml.gz
     do
       base=${fl%.xml.gz}
-      if [ -f "${archiveBase}/Sentinels/$base.snt" ]
+      # skip if sentinel present or if file is present but empty
+      if [ ! -f "${archiveBase}/Sentinels/$base.snt" ] && [ -s "$fl" ]
       then
-        continue
+        PMStash "$fl"
       fi
-      PMStash "$fl"
     done
   fi
 


=====================================
cmd/go.mod
=====================================
@@ -19,7 +19,7 @@ require (
 	github.com/go-playground/universal-translator v0.18.1 // indirect
 	github.com/go-playground/validator/v10 v10.20.0 // indirect
 	github.com/goccy/go-json v0.10.2 // indirect
-	github.com/goccy/go-yaml v1.15.23 // indirect
+	github.com/goccy/go-yaml v1.17.1 // indirect
 	github.com/json-iterator/go v1.1.12 // indirect
 	github.com/klauspost/compress v1.18.0 // indirect
 	github.com/klauspost/cpuid v1.3.1 // indirect
@@ -41,7 +41,7 @@ require (
 	golang.org/x/crypto v0.23.0 // indirect
 	golang.org/x/net v0.25.0 // indirect
 	golang.org/x/sys v0.25.0 // indirect
-	golang.org/x/text v0.22.0 // indirect
+	golang.org/x/text v0.24.0 // indirect
 	google.golang.org/protobuf v1.34.1 // indirect
 	gopkg.in/yaml.v3 v3.0.1 // indirect
 )


=====================================
cmd/go.sum
=====================================
@@ -26,8 +26,8 @@ github.com/go-playground/validator/v10 v10.20.0 h1:K9ISHbSaI0lyB2eWMPJo+kOS/FBEx
 github.com/go-playground/validator/v10 v10.20.0/go.mod h1:dbuPbCMFw/DrkbEynArYaCwl3amGuJotoKCe95atGMM=
 github.com/goccy/go-json v0.10.2 h1:CrxCmQqYDkv1z7lO7Wbh2HN93uovUHgrECaO5ZrCXAU=
 github.com/goccy/go-json v0.10.2/go.mod h1:6MelG93GURQebXPDq3khkgXZkazVtN9CRI+MGFi0w8I=
-github.com/goccy/go-yaml v1.15.23 h1:WS0GAX1uNPDLUvLkNU2vXq6oTnsmfVFocjQ/4qA48qo=
-github.com/goccy/go-yaml v1.15.23/go.mod h1:XBurs7gK8ATbW4ZPGKgcbrY1Br56PdM69F7LkFRi1kA=
+github.com/goccy/go-yaml v1.17.1 h1:LI34wktB2xEE3ONG/2Ar54+/HJVBriAGJ55PHls4YuY=
+github.com/goccy/go-yaml v1.17.1/go.mod h1:XBurs7gK8ATbW4ZPGKgcbrY1Br56PdM69F7LkFRi1kA=
 github.com/google/gofuzz v1.0.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg=
 github.com/json-iterator/go v1.1.12 h1:PV8peI4a0ysnczrg+LtxykD8LfKY9ML6u2jnxaEnrnM=
 github.com/json-iterator/go v1.1.12/go.mod h1:e30LSqwooZae/UwlEbR2852Gd8hjQvJoHmT4TnhNGBo=
@@ -92,8 +92,8 @@ golang.org/x/sys v0.5.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.25.0 h1:r+8e+loiHxRqhXVl6ML1nO3l1+oFoWbnlu2Ehimmi34=
 golang.org/x/sys v0.25.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
-golang.org/x/text v0.22.0 h1:bofq7m3/HAFvbF51jz3Q9wLg3jkvSPuiZu/pD1XwgtM=
-golang.org/x/text v0.22.0/go.mod h1:YRoo4H8PVmsu+E3Ou7cqLVH8oXWIHVoX0jqUWALQhfY=
+golang.org/x/text v0.24.0 h1:dd5Bzh4yt5KYA8f9CJHCP4FB4D51c2c6JvN37xJJkJ0=
+golang.org/x/text v0.24.0/go.mod h1:L8rBsPeo2pSS+xqN0d5u2ikmjtmoJbDBT1b7nHvFCdU=
 google.golang.org/protobuf v1.34.1 h1:9ddQBjfCyZPOHPUiPxpYESBLc+T8P3E+Vo4IbKZgFWg=
 google.golang.org/protobuf v1.34.1/go.mod h1:c6P6GXX6sHbq/GpV6MGZEdwhWPcYBgnhAHhKbcUYpos=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=


=====================================
cmd/rchive.go
=====================================
@@ -2620,7 +2620,7 @@ func main() {
 				if eutils.HasUnicodeMarkup(ctx) {
 					ctx = eutils.RepairUnicodeMarkup(ctx, eutils.SPACE)
 				}
-				ctx = eutils.RepairEncodedMarkup(ctx)
+				ctx = eutils.CleanupEncodedMarkup(ctx)
 				buffer.WriteString("\t| ")
 				buffer.WriteString(ctx)
 				if eutils.HasAmpOrNotASCII(ctx) {


=====================================
cmd/transmute.go
=====================================
@@ -992,7 +992,7 @@ func makePlain(inp io.Reader) {
 		}
 		if eutils.HasAngleOrAmpersandEncoding(str) {
 			str = eutils.RepairTableMarkup(str, eutils.SPACE)
-			// str = eutils.RemoveEmbeddedMarkup(str)
+			// str = eutils.RemoveEmbeddedMarkup(str, NOMARKUP)
 			str = eutils.RemoveHTMLDecorations(str)
 			str = eutils.CompressRunsOfSpaces(str)
 		}


=====================================
cmd/xtract.go
=====================================
@@ -103,12 +103,6 @@ func main() {
 		os.Exit(1)
 	}
 
-	// pretty-printing of xtract arguments
-	if args[0] == "-pretty" {
-		eutils.PrettyArguments(args[1:])
-		return
-	}
-
 	// performance arguments
 	chanDepth := 0
 	farmSize := 0
@@ -351,6 +345,14 @@ func main() {
 
 	eutils.SetOptions(doStrict, doMixed, doSelf, deAccent, deSymbol, doASCII, doCompress, doCleanup, doStem, deStop)
 
+	// pretty-printing of xtract arguments
+	if args[0] == "-pretty" {
+
+		eutils.PrettyArguments(args[1:])
+
+		return
+	}
+
 	// -stats prints number of CPUs and performance tuning values if no other arguments (undocumented)
 	if stts && len(args) < 1 {
 
@@ -585,6 +587,9 @@ func main() {
 	hd := ""
 	tl := ""
 
+	xmltype := ""
+	doctype := ""
+
 	for {
 
 		inSwitch = true
@@ -671,6 +676,18 @@ func main() {
 				hd = "<" + tmp + ">"
 				tl = "</" + tmp + ">"
 			}
+		case "-xml":
+			if len(args) < 2 {
+				eutils.DisplayError("Pattern missing after -xml command")
+				os.Exit(1)
+			}
+			xmltype = eutils.CleanXmltype(args[1])
+		case "-doctype":
+			if len(args) < 2 {
+				eutils.DisplayError("Pattern missing after -doctype command")
+				os.Exit(1)
+			}
+			doctype = eutils.CleanDoctype(args[1])
 		default:
 			// if not any of the controls, set flag to break out of for loop
 			inSwitch = false
@@ -689,6 +706,17 @@ func main() {
 		}
 	}
 
+	if head != "" {
+		// first prepend doctype
+		if doctype != "" {
+			head = doctype + "\n" + head
+		}
+		// then prepend xml line
+		if xmltype != "" {
+			head = xmltype + "\n" + head
+		}
+	}
+
 	// CREATE XML BLOCK READER FROM STDIN OR FILE
 
 	const FirstBuffSize = 4096
@@ -1455,23 +1483,27 @@ func main() {
 		return
 	}
 
-	// SPLIT FILE BY BY RECORD COUNT
+	// SPLIT FILE BY RECORD COUNT
 
 	// split XML record into subfiles by count
 	if len(args) == 8 && args[2] == "-split" && args[4] == "-prefix" && args[6] == "-suffix" {
 
 		// e.g., -head "<IdxDocumentSet>" -tail "</IdxDocumentSet>" -pattern IdxDocument -split 250000 -prefix "biocon" -suffix "e2x"
+
 		count := 0
 		fnum := 0
+
 		var (
 			fl  *os.File
 			err error
 		)
+
 		chunk, err := strconv.Atoi(args[3])
 		if err != nil {
 			eutils.DisplayError("-split argument '%s' is not an integer", err.Error())
 			return
 		}
+
 		prefix := args[5]
 		suffix := args[7]
 
@@ -1524,6 +1556,90 @@ func main() {
 		return
 	}
 
+	// ALLOT RECORDS BY BYTE SIZE
+
+	// allot XML record into subfiles by size
+	if len(args) == 8 && args[2] == "-allot" && args[4] == "-prefix" && args[6] == "-suffix" {
+
+		// e.g., -head "<IdxDocumentSet>" -tail "</IdxDocumentSet>" -pattern IdxDocument -allot 10000 -prefix "biocon" -suffix "e2x"
+
+		retlen := len("\n")
+		cumulative := 0
+		fnum := 0
+
+		headlen := 0
+		taillen := 0
+		if head != "" {
+			headlen += len(head) + retlen
+		}
+		if tail != "" {
+			taillen += len(tail) + retlen
+		}
+		padlen := retlen + headlen + taillen
+
+		var (
+			fl  *os.File
+			err error
+		)
+
+		chunk, err := strconv.Atoi(args[3])
+		if err != nil {
+			eutils.DisplayError("-allot argument '%s' is not an integer", err.Error())
+			return
+		}
+
+		prefix := args[5]
+		suffix := args[7]
+
+		eutils.PartitionXML(topPattern, star, false, rdr,
+			func(str string) {
+				recordCount++
+
+				if cumulative >= chunk {
+					if tail != "" {
+						fl.WriteString(tail)
+						fl.WriteString("\n")
+					}
+					fl.Close()
+					cumulative = 0
+				}
+				if cumulative == 0 {
+					fpath := fmt.Sprintf("%s%03d.%s", prefix, fnum, suffix)
+					fl, err = os.Create(fpath)
+					if err != nil {
+						eutils.DisplayError("Unable to create path '%s'", err.Error())
+						return
+					}
+					os.Stderr.WriteString(fpath + "\n")
+					fnum++
+					if head != "" {
+						fl.WriteString(head)
+						fl.WriteString("\n")
+					}
+				}
+				cumulative += len(str) + padlen
+
+				fl.WriteString(str[:])
+				fl.WriteString("\n")
+			})
+
+		if cumulative > 0 {
+			if tail != "" {
+				fl.WriteString(tail)
+				fl.WriteString("\n")
+			}
+			fl.Close()
+		}
+
+		debug.FreeOSMemory()
+
+		if timr {
+			printDuration("records")
+		}
+
+		return
+	}
+
 	// PARSE AND VALIDATE EXTRACTION ARGUMENTS
 
 	// parse nested exploration instruction from command-line arguments


=====================================
ecommon.sh
=====================================
@@ -39,7 +39,7 @@ then
   set -x
 fi
 
-version="23.8"
+version="24.0"
 
 # initialize common flags
 


=====================================
efetch
=====================================
@@ -1058,16 +1058,20 @@ then
   join-into-groups-of "$chunk" |
   while read uids
   do
-    nquire -get $biocbase $xpoort biocxml $idtype $uids |
-    if [ "$raw" = true ]
-    then
-      # transmute -format indent -doctype ""
-      grep '.'
-    elif [ "$json" = true ]
+    res=$( nquire -get $biocbase $xpoort biocxml $idtype $uids )
+    if [ -n "$res" ]
     then
-      transmute -x2j
-    else
-      transmute -normalize bioc | transmute -format indent -doctype ""
+      echo "$res" |
+      if [ "$raw" = true ]
+      then
+        # transmute -format indent -doctype ""
+        grep '.'
+      elif [ "$json" = true ]
+      then
+        transmute -x2j
+      else
+        transmute -normalize bioc | transmute -format indent -doctype ""
+      fi
     fi
   done
 


=====================================
einfo
=====================================
@@ -207,6 +207,12 @@ if [ -n "$dbase" ]
 then
   res=$( RunWithCommonArgs nquire -get "$base" einfo.fcgi -db "$dbase" -version "2.0" )
 
+  if [ -z "$res" ]
+  then
+    DisplayError "einfo.fcgi query failed"
+    exit 1
+  fi
+
   # shortcut for fields
 
   if [ "$fields" = true ]


=====================================
eutils/eutils_test.go
=====================================
@@ -17,6 +17,26 @@ func stringTestMatch(t *testing.T, name string, proc func(str string) string, da
 	}
 }
 
+func policyTestMatch(t *testing.T, name string, proc func(str string, policy int) string, data []stringTable) {
+
+	for _, test := range data {
+		actual := proc(test.input, NOMARKUP)
+		if actual != test.expected {
+			t.Errorf("%s(%s) = %s, expected %s", name, test.input, actual, test.expected)
+		}
+	}
+}
+
+func spaceTestMatch(t *testing.T, name string, proc func(str string, policy int) string, data []stringTable) {
+
+	for _, test := range data {
+		actual := proc(test.input, SPACE)
+		if actual != test.expected {
+			t.Errorf("%s(%s) = %s, expected %s", name, test.input, actual, test.expected)
+		}
+	}
+}
+
 func TestCleanupAuthor(t *testing.T) {
 
 	stringTestMatch(t, "CleanupAuthor,",
@@ -38,6 +58,18 @@ func TestCleanupBadSpaces(t *testing.T) {
 		})
 }
 
+func TestCleanupEncodedMarkup(t *testing.T) {
+
+	stringTestMatch(t, "CleanupEncodedMarkup,",
+		CleanupEncodedMarkup,
+		[]stringTable{
+			{"<sup>", "<sup>"},
+			{"&#181;", "µ"},
+			{"&amp;amp;amp;amp;amp;amp;lt;", "<"},
+			{"CO</sub><sub>2", "CO2"},
+		})
+}
+
 func TestCleanupSimple(t *testing.T) {
 
 	stringTestMatch(t, "CleanupSimple,",
@@ -176,10 +208,18 @@ func TestRelaxString(t *testing.T) {
 
 func TestRemoveEmbeddedMarkup(t *testing.T) {
 
-	stringTestMatch(t, "RemoveEmbeddedMarkup,",
+	policyTestMatch(t, "RemoveEmbeddedMarkup,",
 		RemoveEmbeddedMarkup,
 		[]stringTable{
 			{"using <i>Escherichia coli</i> bacteria", "using Escherichia coli bacteria"},
+			{"sulfuric H<sub>2</sub>SO<sub>4</sub> formula", "sulfuric H2SO4 formula"},
+		})
+
+	spaceTestMatch(t, "RemoveEmbeddedMarkup,",
+		RemoveEmbeddedMarkup,
+		[]stringTable{
+			{"emission E<sub>direct</sub> spectrum", "emission E direct spectrum"},
+			{"tritiated CO<sub>2</sub><sup>3</sup>H carboxyl", "tritiated CO2 3H carboxyl"},
 		})
 }
 
@@ -213,18 +253,6 @@ func TestRemoveHTMLDecorations(t *testing.T) {
 		})
 }
 
-func TestRepairEncodedMarkup(t *testing.T) {
-
-	stringTestMatch(t, "RepairEncodedMarkup,",
-		RepairEncodedMarkup,
-		[]stringTable{
-			{"<sup>", "<sup>"},
-			{"&#181;", "µ"},
-			{"&amp;amp;amp;amp;amp;amp;lt;", "<"},
-			{"CO</sub><sub>2", "CO2"},
-		})
-}
-
 func TestSortStringByWords(t *testing.T) {
 
 	stringTestMatch(t, "SortStringByWords",


=====================================
eutils/format.go
=====================================
@@ -309,28 +309,8 @@ func xmlFormatter(rcrd, prnt string, inp <-chan XMLToken, offset int, doXML bool
 
 			// check for xml line explicitly set in argument
 			if xml != "" {
-				xml = strings.TrimSpace(xml)
-				if strings.HasPrefix(xml, "<") {
-					xml = xml[1:]
-				}
-				if strings.HasPrefix(xml, "?") {
-					xml = xml[1:]
-				}
-				if strings.HasPrefix(xml, "xml") {
-					xml = xml[3:]
-				}
-				if strings.HasPrefix(xml, " ") {
-					xml = xml[1:]
-				}
-				if strings.HasSuffix(xml, "?>") {
-					xlen := len(xml)
-					xml = xml[:xlen-2]
-				}
-				xml = strings.TrimSpace(xml)
-
-				buffer.WriteString("<?xml ")
+				xml = CleanXmltype(xml)
 				buffer.WriteString(xml)
-				buffer.WriteString(" ?>")
 			} else {
 				buffer.WriteString("<?xml version=\"1.0\" encoding=\"UTF-8\" ?>")
 			}
@@ -339,25 +319,8 @@ func xmlFormatter(rcrd, prnt string, inp <-chan XMLToken, offset int, doXML bool
 
 			// check for doctype taken from XML file or explicitly set in argument
 			if doctype != "" {
-				doctype = strings.TrimSpace(doctype)
-				if strings.HasPrefix(doctype, "<") {
-					doctype = doctype[1:]
-				}
-				if strings.HasPrefix(doctype, "!") {
-					doctype = doctype[1:]
-				}
-				if strings.HasPrefix(doctype, "DOCTYPE") {
-					doctype = doctype[7:]
-				}
-				if strings.HasPrefix(doctype, " ") {
-					doctype = doctype[1:]
-				}
-				doctype = strings.TrimSuffix(doctype, ">")
-				doctype = strings.TrimSpace(doctype)
-
-				buffer.WriteString("<!DOCTYPE ")
+				doctype = CleanDoctype(doctype)
 				buffer.WriteString(doctype)
-				buffer.WriteString(">")
 			} else {
 				buffer.WriteString("<!DOCTYPE ")
 				buffer.WriteString(parent)


=====================================
eutils/go.mod
=====================================
@@ -5,13 +5,13 @@ go 1.24.0
 require (
 	github.com/fatih/color v1.18.0
 	github.com/gedex/inflector v0.0.0-20170307190818-16278e9db813
-	github.com/goccy/go-yaml v1.15.23
+	github.com/goccy/go-yaml v1.17.1
 	github.com/klauspost/cpuid v1.3.1
 	github.com/klauspost/pgzip v1.2.6
 	github.com/komkom/toml v0.1.2
 	github.com/pbnjay/memory v0.0.0-20210728143218-7b4eea64cf58
 	github.com/surgebase/porter2 v0.0.0-20150829210152-56e4718818e8
-	golang.org/x/text v0.22.0
+	golang.org/x/text v0.24.0
 )
 
 require (


=====================================
eutils/go.sum
=====================================
@@ -4,8 +4,8 @@ github.com/fatih/color v1.18.0 h1:S8gINlzdQ840/4pfAwic/ZE0djQEH3wM94VfqLTZcOM=
 github.com/fatih/color v1.18.0/go.mod h1:4FelSpRwEGDpQ12mAdzqdOukCy4u8WUtOY6lkT/6HfU=
 github.com/gedex/inflector v0.0.0-20170307190818-16278e9db813 h1:Uc+IZ7gYqAf/rSGFplbWBSHaGolEQlNLgMgSE3ccnIQ=
 github.com/gedex/inflector v0.0.0-20170307190818-16278e9db813/go.mod h1:P+oSoE9yhSRvsmYyZsshflcR6ePWYLql6UU1amW13IM=
-github.com/goccy/go-yaml v1.15.23 h1:WS0GAX1uNPDLUvLkNU2vXq6oTnsmfVFocjQ/4qA48qo=
-github.com/goccy/go-yaml v1.15.23/go.mod h1:XBurs7gK8ATbW4ZPGKgcbrY1Br56PdM69F7LkFRi1kA=
+github.com/goccy/go-yaml v1.17.1 h1:LI34wktB2xEE3ONG/2Ar54+/HJVBriAGJ55PHls4YuY=
+github.com/goccy/go-yaml v1.17.1/go.mod h1:XBurs7gK8ATbW4ZPGKgcbrY1Br56PdM69F7LkFRi1kA=
 github.com/klauspost/compress v1.18.0 h1:c/Cqfb0r+Yi+JtIEq73FWXVkRonBlf0CRNYc8Zttxdo=
 github.com/klauspost/compress v1.18.0/go.mod h1:2Pp+KzxcywXVXMr50+X0Q/Lsb43OQHYWRCY2AiWywWQ=
 github.com/klauspost/cpuid v1.3.1 h1:5JNjFYYQrZeKRJ0734q51WCEEn2huer72Dc7K+R/b6s=
@@ -36,8 +36,8 @@ golang.org/x/sys v0.0.0-20220811171246-fbc7d0a398ab/go.mod h1:oPkhp1MJrh7nUepCBc
 golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.25.0 h1:r+8e+loiHxRqhXVl6ML1nO3l1+oFoWbnlu2Ehimmi34=
 golang.org/x/sys v0.25.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
-golang.org/x/text v0.22.0 h1:bofq7m3/HAFvbF51jz3Q9wLg3jkvSPuiZu/pD1XwgtM=
-golang.org/x/text v0.22.0/go.mod h1:YRoo4H8PVmsu+E3Ou7cqLVH8oXWIHVoX0jqUWALQhfY=
+golang.org/x/text v0.24.0 h1:dd5Bzh4yt5KYA8f9CJHCP4FB4D51c2c6JvN37xJJkJ0=
+golang.org/x/text v0.24.0/go.mod h1:L8rBsPeo2pSS+xqN0d5u2ikmjtmoJbDBT1b7nHvFCdU=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
 gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c h1:dUUwHk2QECo/6vqA44rthZ8ie2QXMNeKRTHCNY2nXvo=
 gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=


=====================================
eutils/misc.go
=====================================
@@ -1356,6 +1356,37 @@ var ncbi4naToIupac = map[int]string{
 	255: "NN",
 }
 
+// CleanDoctype accepts full doctype objects or just the inner contents
+func CleanDoctype(doctype string) string {
+
+	doctype = strings.TrimSpace(doctype)
+	doctype = strings.TrimPrefix(doctype, "<")
+	doctype = strings.TrimPrefix(doctype, "!")
+	doctype = strings.TrimPrefix(doctype, "DOCTYPE")
+	doctype = strings.TrimSuffix(doctype, ">")
+	doctype = strings.TrimSpace(doctype)
+
+	doctype = "<!DOCTYPE " + doctype + ">"
+
+	return doctype
+}
+
+// CleanXmltype accepts full xml objects or just the inner contents
+func CleanXmltype(xmltype string) string {
+
+	xmltype = strings.TrimSpace(xmltype)
+	xmltype = strings.TrimPrefix(xmltype, "<")
+	xmltype = strings.TrimPrefix(xmltype, "?")
+	xmltype = strings.TrimPrefix(xmltype, "xml")
+	xmltype = strings.TrimSuffix(xmltype, ">")
+	xmltype = strings.TrimSuffix(xmltype, "?")
+	xmltype = strings.TrimSpace(xmltype)
+
+	xmltype = "<?xml " + xmltype + " ?>"
+
+	return xmltype
+}
+
 // use RelaxString to convert non-alphanumeric characters to spaces
 
 // CleanupAuthor fixes misused letters and accents
@@ -1488,7 +1519,7 @@ func CleanupContents(str string, ascii, amper, mixed bool) string {
 	}
 	if allowEmbed {
 		if amper {
-			str = RepairEncodedMarkup(str)
+			str = CleanupEncodedMarkup(str)
 		}
 	}
 	if doScript {
@@ -1513,7 +1544,7 @@ func CleanupContents(str string, ascii, amper, mixed bool) string {
 				str = RepairScriptMarkup(str, CONCISE)
 				str = RepairTableMarkup(str, SPACE)
 				// call RepairScriptMarkup before RemoveEmbeddedMarkup
-				str = RemoveEmbeddedMarkup(str)
+				str = RemoveEmbeddedMarkup(str, NOMARKUP)
 			}
 		}
 		if ascii && HasBadSpace(str) {
@@ -1574,6 +1605,162 @@ func CleanupContents(str string, ascii, amper, mixed bool) string {
 	return str
 }
 
+// CleanupEncodedMarkup removes ampersand-encoded markup
+func CleanupEncodedMarkup(str string) string {
+
+	// convert <sup> to <sup> (html subset)
+	// convert &#181; to µ (but not further - use html.UnescapeString)
+	// convert &amp;amp;amp;amp;amp;amp;lt; to <
+	// remove </sub><sub> or </sup><sup> (internals)
+
+	var buffer strings.Builder
+
+	lookAhead := func(txt string, to int) string {
+		mx := len(txt)
+		if to > mx {
+			to = mx
+		}
+		pos := strings.Index(txt[:to], "gt;")
+		if pos > 0 {
+			to = pos + 3
+		}
+		return txt[:to]
+	}
+
+	skip := 0
+
+	for i, ch := range str {
+		if skip > 0 {
+			skip--
+			continue
+		}
+		if ch == '<' {
+			// remove internal tags in runs of subscripts or superscripts
+			if strings.HasPrefix(str[i:], "</sub><sub>") || strings.HasPrefix(str[i:], "</sup><sup>") {
+				skip = 10
+				continue
+			}
+			buffer.WriteRune(ch)
+			continue
+		} else if ch != '&' {
+			buffer.WriteRune(ch)
+			continue
+		} else if strings.HasPrefix(str[i:], "<") {
+			sub := lookAhead(str[i:], 14)
+			txt, ok := htmlRepair[sub]
+			if ok {
+				adv := len(sub) - 1
+				// do not convert if flanked by spaces - it may be a scientific symbol,
+				// e.g., fragments <i> in PMID 9698410, or escaped <b> and <d> tags used
+				// to indicate stem position in letters in PMID 21892341
+				if i < 1 || str[i-1] != ' ' || !strings.HasPrefix(str[i+adv:], "; ") {
+					buffer.WriteString(txt)
+					skip = adv
+					continue
+				}
+			}
+		} else if strings.HasPrefix(str[i:], "&") {
+			if strings.HasPrefix(str[i:], "&lt;") {
+				sub := lookAhead(str[i:], 22)
+				txt, ok := htmlRepair[sub]
+				if ok {
+					buffer.WriteString(txt)
+					skip = len(sub) - 1
+					continue
+				} else {
+					buffer.WriteString("<")
+					skip = 7
+					continue
+				}
+			} else if strings.HasPrefix(str[i:], "&gt;") {
+				buffer.WriteString(">")
+				skip = 7
+				continue
+			} else {
+				skip = 4
+				j := i + 5
+				// remove runs of multiply-encoded ampersands
+				for strings.HasPrefix(str[j:], "amp;") {
+					skip += 4
+					j += 4
+				}
+				// then look for special symbols used in PubMed records
+				if strings.HasPrefix(str[j:], "lt;") {
+					buffer.WriteString("<")
+					skip += 3
+				} else if strings.HasPrefix(str[j:], "gt;") {
+					buffer.WriteString(">")
+					skip += 3
+				} else if strings.HasPrefix(str[j:], "frac") {
+					buffer.WriteString("&frac")
+					skip += 4
+				} else if strings.HasPrefix(str[j:], "plusmn") {
+					buffer.WriteString("&plusmn")
+					skip += 6
+				} else if strings.HasPrefix(str[j:], "acute") {
+					buffer.WriteString("&acute")
+					skip += 5
+				} else if strings.HasPrefix(str[j:], "aacute") {
+					buffer.WriteString("&aacute")
+					skip += 6
+				} else if strings.HasPrefix(str[j:], "rsquo") {
+					buffer.WriteString("&rsquo")
+					skip += 5
+				} else if strings.HasPrefix(str[j:], "lsquo") {
+					buffer.WriteString("&lsquo")
+					skip += 5
+				} else if strings.HasPrefix(str[j:], "micro") {
+					buffer.WriteString("&micro")
+					skip += 5
+				} else if strings.HasPrefix(str[j:], "oslash") {
+					buffer.WriteString("&oslash")
+					skip += 6
+				} else if strings.HasPrefix(str[j:], "kgr") {
+					buffer.WriteString("&kgr")
+					skip += 3
+				} else if strings.HasPrefix(str[j:], "apos") {
+					buffer.WriteString("&apos")
+					skip += 4
+				} else if strings.HasPrefix(str[j:], "quot") {
+					buffer.WriteString("&quot")
+					skip += 4
+				} else if strings.HasPrefix(str[j:], "alpha") {
+					buffer.WriteString("&alpha")
+					skip += 5
+				} else if strings.HasPrefix(str[j:], "beta") {
+					buffer.WriteString("&beta")
+					skip += 4
+				} else if strings.HasPrefix(str[j:], "gamma") {
+					buffer.WriteString("&gamma")
+					skip += 5
+				} else if strings.HasPrefix(str[j:], "Delta") {
+					buffer.WriteString("&Delta")
+					skip += 5
+				} else if strings.HasPrefix(str[j:], "phi") {
+					buffer.WriteString("&phi")
+					skip += 3
+				} else if strings.HasPrefix(str[j:], "ge") {
+					buffer.WriteString("&ge")
+					skip += 2
+				} else if strings.HasPrefix(str[j:], "sup2") {
+					buffer.WriteString("&sup2")
+					skip += 4
+				} else if strings.HasPrefix(str[j:], "#") {
+					buffer.WriteString("&")
+				} else {
+					buffer.WriteString("&")
+				}
+				continue
+			}
+		}
+
+		// if loop not continued by any preceding test, print character
+		buffer.WriteRune(ch)
+	}
+
+	return buffer.String()
+}
+
 // CleanupPlain removes embedded mixed-content markup tags
 func CleanupPlain(str string, wrp bool) string {
 
@@ -1583,7 +1770,7 @@ func CleanupPlain(str string, wrp bool) string {
 
 	str = strings.Replace(str, "\n", " ", -1)
 
-	str = RemoveEmbeddedMarkup(str)
+	str = RemoveEmbeddedMarkup(str, NOMARKUP)
 	str = TransformAccents(str, false, false)
 
 	if HasUnicodeMarkup(str) {
@@ -1627,7 +1814,7 @@ func CleanupProse(str string, wrp bool) string {
 		str = html.UnescapeString(str)
 	}
 
-	str = RemoveEmbeddedMarkup(str)
+	str = RemoveEmbeddedMarkup(str, SPACE)
 	str = TransformAccents(str, false, false)
 	str = FixMisusedLetters(str, true, false, true)
 	str = TransformAccents(str, false, false)
@@ -1684,11 +1871,11 @@ func CleanupQuery(str string, exactMatch, removeBrackets bool) string {
 
 	if removeBrackets {
 		if HasAngleOrAmpersandEncoding(str) {
-			str = RepairEncodedMarkup(str)
+			str = CleanupEncodedMarkup(str)
 			str = RepairScriptMarkup(str, SPACE)
 			str = RepairMathMLMarkup(str, SPACE)
 			// RemoveEmbeddedMarkup must be called before UnescapeString, which was suppressed in ExploreElements
-			str = RemoveEmbeddedMarkup(str)
+			str = RemoveEmbeddedMarkup(str, NOMARKUP)
 		}
 	}
 
@@ -1977,7 +2164,7 @@ func FlattenMathML(str string, policy int) string {
 
 	str = strings.TrimSpace(str)
 
-	// str = RemoveEmbeddedMarkup(str)
+	// str = RemoveEmbeddedMarkup(str, NOMARKUP)
 
 	return str
 }
@@ -2766,12 +2953,12 @@ func PrepareForIndexing(str string, doHomoglyphs, isAuthor, isProse, spellGreek,
 		str = CleanupBadSpaces(str)
 	}
 	if HasAngleOrAmpersandEncoding(str) {
-		str = RepairEncodedMarkup(str)
+		str = CleanupEncodedMarkup(str)
 		str = RepairTableMarkup(str, SPACE)
 		str = RepairScriptMarkup(str, SPACE)
 		str = RepairMathMLMarkup(str, SPACE)
 		// RemoveEmbeddedMarkup must be called before UnescapeString, which was suppressed in ExploreElements
-		str = RemoveEmbeddedMarkup(str)
+		str = RemoveEmbeddedMarkup(str, SPACE)
 	}
 
 	if HasAmpOrNotASCII(str) {
@@ -2846,17 +3033,41 @@ func RelaxString(str string) string {
 }
 
 // RemoveEmbeddedMarkup removes internal mixed-content sections
-func RemoveEmbeddedMarkup(str string) string {
+func RemoveEmbeddedMarkup(str string, policy int) string {
 
 	inContent := true
-	var buffer strings.Builder
+	shouldPad := false
+	var (
+		curr   rune
+		prev   rune
+		buffer strings.Builder
+	)
 
 	for _, ch := range str {
 		if ch == '<' {
 			inContent = false
 		} else if ch == '>' {
+			if !inContent {
+				shouldPad = true
+			}
 			inContent = true
 		} else if inContent {
+			prev = curr
+			curr = ch
+			if shouldPad {
+				if unicode.IsLetter(prev) && unicode.IsLetter(curr) {
+					switch policy {
+					case SPACE:
+						buffer.WriteRune(' ')
+					}
+				} else if unicode.IsDigit(prev) && unicode.IsDigit(curr) {
+					switch policy {
+					case SPACE:
+						buffer.WriteRune(' ')
+					}
+				}
+			}
+			shouldPad = false
 			buffer.WriteRune(ch)
 		}
 	}
@@ -2925,162 +3136,6 @@ func RemoveHTMLDecorations(str string) string {
 	return str
 }
 
-// RepairEncodedMarkup removes ampersand-encoded markup
-func RepairEncodedMarkup(str string) string {
-
-	// convert <sup> to <sup> (html subset)
-	// convert &#181; to µ (but not further - use html.UnescapeString)
-	// convert &amp;amp;amp;amp;amp;amp;lt; to <
-	// remove </sub><sub> or </sup><sup> (internals)
-
-	var buffer strings.Builder
-
-	lookAhead := func(txt string, to int) string {
-		mx := len(txt)
-		if to > mx {
-			to = mx
-		}
-		pos := strings.Index(txt[:to], "gt;")
-		if pos > 0 {
-			to = pos + 3
-		}
-		return txt[:to]
-	}
-
-	skip := 0
-
-	for i, ch := range str {
-		if skip > 0 {
-			skip--
-			continue
-		}
-		if ch == '<' {
-			// remove internal tags in runs of subscripts or superscripts
-			if strings.HasPrefix(str[i:], "</sub><sub>") || strings.HasPrefix(str[i:], "</sup><sup>") {
-				skip = 10
-				continue
-			}
-			buffer.WriteRune(ch)
-			continue
-		} else if ch != '&' {
-			buffer.WriteRune(ch)
-			continue
-		} else if strings.HasPrefix(str[i:], "<") {
-			sub := lookAhead(str[i:], 14)
-			txt, ok := htmlRepair[sub]
-			if ok {
-				adv := len(sub) - 1
-				// do not convert if flanked by spaces - it may be a scientific symbol,
-				// e.g., fragments <i> in PMID 9698410, or escaped <b> and <d> tags used
-				// to indicate stem position in letters in PMID 21892341
-				if i < 1 || str[i-1] != ' ' || !strings.HasPrefix(str[i+adv:], "; ") {
-					buffer.WriteString(txt)
-					skip = adv
-					continue
-				}
-			}
-		} else if strings.HasPrefix(str[i:], "&") {
-			if strings.HasPrefix(str[i:], "&lt;") {
-				sub := lookAhead(str[i:], 22)
-				txt, ok := htmlRepair[sub]
-				if ok {
-					buffer.WriteString(txt)
-					skip = len(sub) - 1
-					continue
-				} else {
-					buffer.WriteString("<")
-					skip = 7
-					continue
-				}
-			} else if strings.HasPrefix(str[i:], "&gt;") {
-				buffer.WriteString(">")
-				skip = 7
-				continue
-			} else {
-				skip = 4
-				j := i + 5
-				// remove runs of multiply-encoded ampersands
-				for strings.HasPrefix(str[j:], "amp;") {
-					skip += 4
-					j += 4
-				}
-				// then look for special symbols used in PubMed records
-				if strings.HasPrefix(str[j:], "lt;") {
-					buffer.WriteString("<")
-					skip += 3
-				} else if strings.HasPrefix(str[j:], "gt;") {
-					buffer.WriteString(">")
-					skip += 3
-				} else if strings.HasPrefix(str[j:], "frac") {
-					buffer.WriteString("&frac")
-					skip += 4
-				} else if strings.HasPrefix(str[j:], "plusmn") {
-					buffer.WriteString("&plusmn")
-					skip += 6
-				} else if strings.HasPrefix(str[j:], "acute") {
-					buffer.WriteString("&acute")
-					skip += 5
-				} else if strings.HasPrefix(str[j:], "aacute") {
-					buffer.WriteString("&aacute")
-					skip += 6
-				} else if strings.HasPrefix(str[j:], "rsquo") {
-					buffer.WriteString("&rsquo")
-					skip += 5
-				} else if strings.HasPrefix(str[j:], "lsquo") {
-					buffer.WriteString("&lsquo")
-					skip += 5
-				} else if strings.HasPrefix(str[j:], "micro") {
-					buffer.WriteString("&micro")
-					skip += 5
-				} else if strings.HasPrefix(str[j:], "oslash") {
-					buffer.WriteString("&oslash")
-					skip += 6
-				} else if strings.HasPrefix(str[j:], "kgr") {
-					buffer.WriteString("&kgr")
-					skip += 3
-				} else if strings.HasPrefix(str[j:], "apos") {
-					buffer.WriteString("&apos")
-					skip += 4
-				} else if strings.HasPrefix(str[j:], "quot") {
-					buffer.WriteString("&quot")
-					skip += 4
-				} else if strings.HasPrefix(str[j:], "alpha") {
-					buffer.WriteString("&alpha")
-					skip += 5
-				} else if strings.HasPrefix(str[j:], "beta") {
-					buffer.WriteString("&beta")
-					skip += 4
-				} else if strings.HasPrefix(str[j:], "gamma") {
-					buffer.WriteString("&gamma")
-					skip += 5
-				} else if strings.HasPrefix(str[j:], "Delta") {
-					buffer.WriteString("&Delta")
-					skip += 5
-				} else if strings.HasPrefix(str[j:], "phi") {
-					buffer.WriteString("&phi")
-					skip += 3
-				} else if strings.HasPrefix(str[j:], "ge") {
-					buffer.WriteString("&ge")
-					skip += 2
-				} else if strings.HasPrefix(str[j:], "sup2") {
-					buffer.WriteString("&sup2")
-					skip += 4
-				} else if strings.HasPrefix(str[j:], "#") {
-					buffer.WriteString("&")
-				} else {
-					buffer.WriteString("&")
-				}
-				continue
-			}
-		}
-
-		// if loop not continued by any preceding test, print character
-		buffer.WriteRune(ch)
-	}
-
-	return buffer.String()
-}
-
 // RepairMathMLMarkup removes MathML embedded markup symbols
 func RepairMathMLMarkup(str string, policy int) string {
 


=====================================
eutils/utils.go
=====================================
@@ -47,7 +47,7 @@ import (
 )
 
 // EDirectVersion is the current EDirect release number
-const EDirectVersion = "23.8"
+const EDirectVersion = "24.0"
 
 // ANSI escape codes for terminal color, highlight, and reverse
 const (


=====================================
eutils/xplore.go
=====================================
@@ -1998,12 +1998,12 @@ func processClause(
 				str = CleanupBadSpaces(str)
 			}
 			if HasAngleOrAmpersandEncoding(str) {
-				str = RepairEncodedMarkup(str)
+				str = CleanupEncodedMarkup(str)
 				str = RepairTableMarkup(str, SPACE)
 				str = RepairScriptMarkup(str, SPACE)
 				str = RepairMathMLMarkup(str, SPACE)
 				// RemoveEmbeddedMarkup must be called before UnescapeString, which was suppressed in ExploreElements
-				str = RemoveEmbeddedMarkup(str)
+				str = RemoveEmbeddedMarkup(str, SPACE)
 			}
 
 			if HasAmpOrNotASCII(str) {


=====================================
gbf2info
=====================================
@@ -68,7 +68,7 @@ xtract -rec Rec -pattern INSDSeq -SEQ INSDSeq_sequence -MOL INSDSeq_moltype \
     -group INSDFeature -if "&KEY" -equals CDS -and "&MOL" -is-not AA \
       -block INSDFeature_intervals -if "&MOL" -equals mRNA -FR -first INSDInterval_from -TO -first INSDInterval_to \
         -subset INSDFeature_intervals -if "&FR" -lt "&TO" \
-          -OFS -min INSDInterval_from,INSDInterval_to -wrp Offset -dec "&OFS" \
+          -OFS -min INSDInterval_from,INSDInterval_to -wrp offset -dec "&OFS" \
       -block INSDQualifier -if INSDQualifier_name -equals codon_start -FM INSDQualifier_value \
       -block INSDQualifier -if INSDQualifier_name -equals transl_table -GC INSDQualifier_value \
       -block INSDFeature -pkg mol_wt -molwt "&SUB" \


=====================================
help/tst-elink.txt
=====================================
@@ -1,4 +1,5 @@
 assembly	nuccore	9513491
+bioproject	sra	1183783
 cdd	pubmed	274590
 gds	pubmed	1336
 gds	taxonomy	1336


=====================================
help/xtract-help.txt
=====================================
@@ -426,6 +426,10 @@ Xtract Examples
 
   -1-based ChrStart
 
+  -tag Item -att type journal -cls -encode Source -end Item
+
+  -tag Item -att type journal -atr name Source -slf
+
   -insd CDS gene product protein_id translation
 
   -insd complete mat_peptide "%peptide" product peptide
@@ -440,7 +444,9 @@ Xtract Examples
 
   -wrp PubmedArticleSet -pattern PubmedArticle -sort MedlineCitation/PMID
 
-  -pattern PubmedArticle -split 5000 -prefix "subset" -suffix "xml"
+  -set PubmedArticleSet -pattern PubmedArticle -split 1000 -prefix "subset" -suffix "xml"
+
+  -set PubmedArticleSet -pattern PubmedArticle -allot 500000 -prefix "subset" -suffix "xml"
 
   -pattern PubmedBookArticle -path BookDocument.Book.AuthorList.Author -element LastName
 


=====================================
nquire
=====================================
@@ -147,14 +147,24 @@ ParseXMLObject() {
   echo "$mesg" | sed -n "s|.*<$objc[^>]*>\\(.*\\)</$objc>.*|\\1|p"
 }
 
-# check for whether Aspera is installed
+# NCBI servers now require Aspera Connect client version 4.2 or above.
+
+# To download the free client, open the IBM Aspera Connect subsection in:
+#   https://www.ibm.com/products/aspera/downloads#cds
+
+# Also see discussion in:
+#   https://www.biostars.org/p/9553092/
 
 APPPATH=""
 KEYPATH=""
-KEYNAME=asperaweb_id_dsa.openssh
+KEYNAME=aspera_tokenauth_id_rsa
+
+# the old KEYNAME, asperaweb_id_dsa.openssh, is no longer used
 
 HasAspera() {
 
+  # check to see if the Aspera Connect client is installed
+
   if [ -n "${EDIRECT_NO_ASPERA}" ] && [ "${EDIRECT_NO_ASPERA}" = true ]
   then
     return 1
@@ -162,7 +172,7 @@ HasAspera() {
 
   case "$( uname -s )" in
     Darwin )
-      sysdir='/Applications/Aspera Connect.app/Contents/Resources'
+      sysdir='/Applications/IBM Aspera Connect.app/Contents/Resources'
       sysdir2=/bin
       userdir=$HOME$sysdir
       ;;
@@ -179,13 +189,13 @@ HasAspera() {
   esac
   for d in "$sysdir" "$sysdir2" "$userdir"
   do
-    if "$d/ascp" --version 2>&1 | grep '^Aspera' >/dev/null
+    if "$d/ascp" --version 2>&1 | grep -e '^Aspera' -e '^IBM Aspera' >/dev/null
     then
       APPPATH=$d
       break
     fi
   done
-  if [ -z "$APPPATH" ]  &&  ascp --version 2>&1 | grep '^Aspera' >/dev/null
+  if [ -z "$APPPATH" ]  &&  ascp --version 2>&1 | grep -e '^Aspera' -e '^IBM Aspera' >/dev/null
   then
     APPPATH=$( type -path ascp )
     APPPATH=$( dirname "$APPPATH" )
@@ -1147,6 +1157,12 @@ DownloadOneFile() {
         SendRequest "$urlfl" > "$fl"
         ;;
       -asp )
+        if [ -z "${ASPERA_SCP_PASS}" ]
+        then
+          # this value is a public default key published in an IBM document:
+          #   https://delivery04.dhe.ibm.com/sar/CMA/OSA/08orb/0/IBM_Aspera_Faspex_Admin_4.4.0_Linux.pdf
+          export ASPERA_SCP_PASS=743128bf-3bf3-45b5-ab14-4602c67f2950
+        fi
         starttime=$( GetTime )
         "$APPPATH/ascp" -T -q -k 1 -l 500m -i "$KEYPATH/$KEYNAME" \
         "anonftp@$urlfl" "."


=====================================
xcommon.sh
=====================================
@@ -127,6 +127,34 @@ DisplayNote() {
   fi
 }
 
+# parse XML Config/File object
+
+ParseConfig() {
+
+  mesg=$1
+  objc=$2
+  shift 2
+
+  if [ -z "$mesg" ]
+  then
+    return 1
+  fi
+
+  while [ $# -gt 0 ]
+  do
+    var=$1
+    fld=$2
+    shift 2
+    value=$( echo "$mesg" | xtract -pattern Rec -ret "" -element "$fld" )
+    if [ -n "$value" ]
+    then
+      eval "$var=\$value"
+    fi
+  done
+
+  return 0
+}
+
 # parse ENTREZ_DIRECT object
 
 ParseMessage() {


=====================================
xfetch
=====================================
@@ -224,65 +224,28 @@ FixDateConstraints
 
 FindArchiveFolder
 
-# get database-specific parameters from optional xfetch.ini configuration file
+# get database-specific parameters from xfetch.ini configuration file
 
-GetColumn() {
-
-  params="$1"
-  col="$2"
-
-  # remove result if default hyphen indicates field is missing from config file
-  echo "$params" | cut -f "$col" | sed -e 's/^-$//g' | grep '.'
-}
-
-if [ -f "${pth}/xfetch.ini" ]
+if [ ! -f "$pth"/xfetch.ini ]
 then
-  params=$(
-    cat "${pth}/xfetch.ini" |
-    ini2xml |
-    xtract -pattern "ConfigFile" -group "$dbase" -def "-" -element name set xml doctype skip
-  )
-  if [ -n "$params" ]
-  then
-    recname=$( GetColumn "$params" "1" )
-    settag=$( GetColumn "$params" "2" )
-    xmltag=$( GetColumn "$params" "3" )
-    doctype=$( GetColumn "$params" "4" )
-    recskip=$( GetColumn "$params" "5" )
-  fi
+  echo "ERROR: Unable to find '$pth/xfetch.ini' file" >&2
+  exit 1
 fi
 
-# set database-specific parameters for original NCBI databases unless already set
-
-if [ -z "$recname" ] || [ -z "$settag" ] || [ -z "$xmltag" ] || [ -z "$doctype" ]
+mssg=$(
+  cat "${pth}/xfetch.ini" |
+  ini2xml |
+  xtract -rec Rec -pattern "ConfigFile/*" -select "$dbase" |
+  tr '\n' ' '
+)
+if [ -n "$mssg" ]
 then
-  case "$dbase" in
-    pubmed )
-      recname="PubmedArticle"
-      settag="PubmedArticleSet"
-      xmltag='<?xml version="1.0" encoding="UTF-8" ?>'
-      doctype='<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">'
-      ;;
-    pmc )
-      recname="PMCInfo"
-      settag="PMCInfoSet"
-      xmltag='<?xml version="1.0" encoding="UTF-8" ?>'
-      doctype='<!DOCTYPE PMCInfoSet>'
-      ;;
-    taxonomy )
-      recname="TaxonInfo"
-      settag="TaxonInfoSet"
-      xmltag='<?xml version="1.0" encoding="UTF-8" ?>'
-      doctype='<!DOCTYPE TaxonInfoSet>'
-      ;;
-    * )
-      ;;
-  esac
+  ParseConfig "$mssg" Rec recname name settag set xmltag xml doctype doctype recskip skip
 fi
 
 if [ -z "$recname" ] || [ -z "$settag" ]
 then
-  echo "ERROR: Missing -tag or -set from database-specific wrapper, use database-specific fetch script" >&2
+  echo "ERROR: Missing -tag or -set from database-specific wrapper in xfetch.ini file" >&2
   exit 1
 fi
 


=====================================
xlink
=====================================
@@ -66,6 +66,7 @@ fi
 dbase=""
 debug=false
 raw=false
+target=""
 
 while [ $# -gt 0 ]
 do
@@ -188,48 +189,18 @@ FixDateConstraints
 
 FindPostingsFolder
 
-if [ $# -lt 1 ]
+if [ $# -lt 2 ]
 then
   echo "ERROR: Insufficient arguments given to xlink" >&2
   exit 1
 fi
 
-# call rchive -link
-
 val="$1"
 shift
 case "$val" in
   -target )
-    if [ "$raw" = true ]
-    then
-      GetUIDs |
-      word-at-a-time |
-      rchive -db "$dbase" -link "$*"
-    else
-      flt=""
-      num="0"
-      uids=$( GetUIDs | word-at-a-time | rchive -db "$dbase" -link "$*" )
-      if [ -n "$uids" ]
-      then
-        flt=$( echo "$uids" | sed -e 's/^/  <Id>/' -e 's/$/<\/Id>/' )
-        num=$( echo "$uids" | wc -l | tr -d ' ' )
-        echo "<ENTREZ_DIRECT>"
-        if [ -n "$dbase" ]
-        then
-          echo "  <Db>${dbase}</Db>"
-        fi
-        if [ -n "$num" ]
-        then
-          echo "  <Count>${num}</Count>"
-        fi
-        if [ -n "$flt" ]
-        then
-          echo "$flt"
-        fi
-        echo "  <Source>Local</Source>"
-        echo "</ENTREZ_DIRECT>"
-      fi
-    fi
+    target="$1"
+    shift
     ;;
   -* )
     exec >&2
@@ -243,4 +214,58 @@ case "$val" in
     ;;
 esac
 
+# get database-specific parameters from xlink.ini configuration file
+
+dest="$dbase"
+
+if [ ! -f "$pth"/xlink.ini ]
+then
+  echo "ERROR: Unable to find '$pth/xlink.ini' file" >&2
+  exit 1
+fi
+
+mssg=$(
+  cat "${pth}/xlink.ini" |
+  ini2xml |
+  xtract -rec Rec -pattern "ConfigFile/*" -select "$dbase" |
+  tr '\n' ' '
+)
+if [ -n "$mssg" ]
+then
+  ParseConfig "$mssg" Rec dest "$target"
+fi
+
+# call rchive -link
+
+if [ "$raw" = true ]
+then
+  GetUIDs |
+  word-at-a-time |
+  rchive -db "$dbase" -link "$target"
+else
+  flt=""
+  num="0"
+  uids=$( GetUIDs | word-at-a-time | rchive -db "$dbase" -link "$target" )
+  if [ -n "$uids" ]
+  then
+    flt=$( echo "$uids" | sed -e 's/^/  <Id>/' -e 's/$/<\/Id>/' )
+    num=$( echo "$uids" | wc -l | tr -d ' ' )
+    echo "<ENTREZ_DIRECT>"
+    if [ -n "$dest" ]
+    then
+      echo "  <Db>${dest}</Db>"
+    fi
+    if [ -n "$num" ]
+    then
+      echo "  <Count>${num}</Count>"
+    fi
+    if [ -n "$flt" ]
+    then
+      echo "$flt"
+    fi
+    echo "  <Source>Local</Source>"
+    echo "</ENTREZ_DIRECT>"
+  fi
+fi
+
 exit 0


=====================================
xlink.ini
=====================================
@@ -0,0 +1,9 @@
+# xlink.ini
+
+# Public domain notice for all NCBI EDirect scripts is located at:
+# https://www.ncbi.nlm.nih.gov/books/NBK179288/#chapter6.Public_Domain_Notice
+
+[pubmed]
+CITED=pubmed
+CITES=pubmed
+PMCID=pmc



View it on GitLab: https://salsa.debian.org/med-team/ncbi-entrez-direct/-/compare/18ae420437d125cd4c4f4fabbb8d4625dfa780fc...08f679c67d79eb9568a05db088fcb7c5d262ed3b

-- 
View it on GitLab: https://salsa.debian.org/med-team/ncbi-entrez-direct/-/compare/18ae420437d125cd4c4f4fabbb8d4625dfa780fc...08f679c67d79eb9568a05db088fcb7c5d262ed3b
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20250526/5f0efc1b/attachment-0001.htm>


More information about the debian-med-commit mailing list