[med-svn] [Git][med-team/ncbi-entrez-direct][master] 6 commits: New upstream version 13.7.20200713+dfsg

Wed Jul 15 03:20:44 BST 2020


Aaron M. Ucko pushed to branch master at Debian Med / ncbi-entrez-direct


Commits:
bbe0173f by Aaron M. Ucko at 2020-07-13T22:12:51-04:00
New upstream version 13.7.20200713+dfsg
- - - - -
8582cc64 by Aaron M. Ucko at 2020-07-13T22:13:42-04:00
Merge tag 'upstream/13.7.20200713+dfsg'

Upstream version 13.7.20200713(+dfsg).

- - - - -
b2e48ab9 by Aaron M. Ucko at 2020-07-14T21:39:17-04:00
d/rules: Install accn-at-a-time and g2x; update gbf2xml's handling.

Account for new as-is script accn-at-a-time, new Go executable g2x,
and gbf2xml's switch from a Perl script to an as-is script (wrapping
xtract).

- - - - -
6e84fc59 by Aaron M. Ucko at 2020-07-14T22:07:30-04:00
debian/rules: Belatedly arrange to tweak enquire.

Its --ca* options won't do here.

- - - - -
984a1b06 by Aaron M. Ucko at 2020-07-14T22:15:47-04:00
debian/man: Update for new upstream release (13.7.20200713[+dfsg]).

* accn-at-a-time.1, g2x.1: Document new commands.
* gbf2xml.1: Turn into an alias for xtract.1.
* xtract.1: Document -g2x flag (under Data Conversion) and
  corresponding gbf2xml executable.
* Update SEE ALSO references accordingly.

- - - - -
3ff79da3 by Aaron M. Ucko at 2020-07-14T22:17:16-04:00
Finalize ncbi-entrez-direct 13.7.20200713+dfsg-1 for unstable.

- - - - -


15 changed files:

- + accn-at-a-time
- debian/changelog
- + debian/man/accn-at-a-time.1
- + debian/man/g2x.1
- debian/man/gbf2xml.1
- debian/man/word-at-a-time.1
- debian/man/xtract.1
- debian/rules
- enquire
- + g2x.go
- gbf2xml
- hlp-xtract.txt
- sort-uniq-count
- sort-uniq-count-rank
- xtract.go


Changes:

=====================================
accn-at-a-time
=====================================
@@ -0,0 +1,4 @@
+#!/bin/bash -norc
+sed 's/[^a-zA-Z0-9_.]/ /g; s/^ *//' |
+tr 'A-Z' 'a-z' |
+fmt -w 1


=====================================
debian/changelog
=====================================
@@ -1,3 +1,15 @@
+ncbi-entrez-direct (13.7.20200713+dfsg-1) unstable; urgency=medium
+
+  * New upstream release.
+  * debian/man/{accn-at-a-time,g2x}.1: Document new commands.
+  * debian/man/{gbf2xml,word-at-a-time,xtract}.1: Update for new release.
+  * debian/rules:
+    - Account for new as-is script accn-at-a-time, new Go executable g2x, and
+      gbf2xml's switch from a Perl script to an as-is script (wrapping xtract).
+    - Belatedly arrange to tweak enquire, whose --ca* options won't do here.
+
+ -- Aaron M. Ucko <ucko at debian.org>  Tue, 14 Jul 2020 22:17:15 -0400
+
 ncbi-entrez-direct (13.7.20200615+dfsg-2) unstable; urgency=medium
 
   * debian/rules: Install ftp-cp, ftp-ls, nquire, and transmute as simple


=====================================
debian/man/accn-at-a-time.1
=====================================
@@ -0,0 +1,13 @@
+.TH ACCN-AT-A-TIME 1 2020-07-14 NCBI "NCBI Entrez Direct User's Manual"
+.SH NAME
+accn\-at\-a\-time \- parse an input file into biological identifiers
+.SH SYNOPSIS
+.B accn\-at\-a\-time
+.SH DESCRIPTION
+\fBaccn\-at\-a\-time\fP reads a text file on standard input,
+extracts all sequences of
+digits, English letters, underscores, and periods,
+lowercases any capital letters,
+and prints the resulting terms in order, one per line.
+.SH SEE ALSO
+.BR word\-at\-a\-time (1).


=====================================
debian/man/g2x.1
=====================================
@@ -0,0 +1,21 @@
+.TH G2X 1 2020-07-14 NCBI "NCBI Entrez Direct User's Manual"
+.SH NAME
+g2x \- Convert GenBank flatfiles to INSDSeq XML
+.SH SYNOPSIS
+.B g2x
+[\|\fB\-input\fP\ \fIfilename\fP\|]
+[\|\fB\-output\fP\ \fIfilename\fP\|]
+.SH DESCRIPTION
+\fBgbf2xml\fP reads the specified GenBank flatfile
+and writes a corresponding INSDSeq XML document.
+.SH OPTIONS
+.TP
+\fB\-input\fP\ \fIfilename\fP
+Read GenBank format from file instead of stdin.
+.TP
+\fB\-output\fP\ \fIfilename\fP
+Write INSDSeq XML to file instead of stdout.
+.SH SEE ALSO
+.BR asn2ff (1),
+.BR asn2gb (1),
+.BR gbf2xml (1).


=====================================
debian/man/gbf2xml.1
=====================================
@@ -1,21 +1 @@
-.TH GBF2XML 1 2017-07-05 NCBI "NCBI Entrez Direct User's Manual"
-.SH NAME
-gbf2xml \- Convert GenBank flatfiles to INSDSeq XML
-.SH SYNOPSIS
-.B gbf2xml
-[\|\fIfilename\fP\|]
-.SH DESCRIPTION
-\fBgbf2xml\fP reads the specified GenBank flatfile
-(or from standard input when not passed a filename)
-and writes a corresponding INSDSeq XML document
-to standard output.
-.SH BUGS
-Feature intervals that refer to 'far' locations, i.e., those not within
-the cited record and which have an accession and colon, are suppressed.
-Those rare features (e.g., trans\-splicing between molecules) are lost.
-
-Keywords and References are currently not supported.
-.SH SEE ALSO
-.BR asn2ff (1),
-.BR asn2gb (1),
-.BR xtract (1).
+.so xtract.1


=====================================
debian/man/word-at-a-time.1
=====================================
@@ -1,4 +1,4 @@
-.TH WORD-AT-A-TIME 1 2017-01-24 NCBI "NCBI Entrez Direct User's Manual"
+.TH WORD-AT-A-TIME 1 2020-07-14 NCBI "NCBI Entrez Direct User's Manual"
 .SH NAME
 word\-at\-a\-time \- parse an input file into alphanumeric words
 .SH SYNOPSIS
@@ -9,4 +9,5 @@ extracts all sequences of digits and English letters,
 lowercases any capital letters,
 and prints the resulting terms in order, one per line.
 .SH SEE ALSO
+.BR accn\-at\-a\-time (1),
 .BR join\-into\-groups\-of (1).


=====================================
debian/man/xtract.1
=====================================
@@ -1,6 +1,6 @@
-.TH XTRACT 1 2020-06-28 NCBI "NCBI Entrez Direct User's Manual"
+.TH XTRACT 1 2020-07-14 NCBI "NCBI Entrez Direct User's Manual"
 .SH NAME
-xtract \- convert XML into a table of data values
+gbf2xml, xtract \- NCBI Entrez Direct XML conversion and transformation tool
 .SH SYNOPSIS
 \fBxtract\fP
 [\|\fB\-help\fP\|]
@@ -125,12 +125,18 @@ xtract \- convert XML into a table of data values
 [\|\fB\-t2x\fP [\|\fB\-set\fP\ \fItag\fP\|] [\|\fB\-rec\fP\ \fItag\fP\|] \
 [\|\fB\-skip\fP\ \fIN\fP\|] [\|\fB\-lower\fP|\fB\-upper\fP\|] \
 [\|\fB\-indent\fP|\fB\-flush\fP\|] \fIcolumnName1\fP\ ...\|]
+[\|\fB\-g2x\fP\|]
 [\|\fB\-examples\fP\|]
 [\|\fB\-version\fP\|]
+
+\fBgbf2xml\fP\ ...
 .SH DESCRIPTION
 \fBxtract\fP converts an XML document
 into a table of data values
 according to user\-specified rules.
+
+\fBgbf2xml\fP converts from GenBank flatfile format to INSDSeq XML,
+and is equivalent to \fBxtract \-g2x\fP.
 .SH OPTIONS
 .SS Processing Flags
 .TP
@@ -716,6 +722,9 @@ Do not indent XML output.
 XML object names per column.
 .RE
 .PD
+.TP
+\fB\-g2x\fP
+Convert GenBank flatfile format to INSDSeq XML.
 .SS Documentation
 .TP
 \fB\-help\fP


=====================================
debian/rules
=====================================
@@ -17,22 +17,25 @@ MODES = address blast citmatch contact filter link notify post proxy search \
 STD_WRAPPERS = $(MODES:%=bin/e%)
 OTHER_WRAPPERS = esummary ftp-cp ftp-ls nquire transmute
 WRAPPERS = $(STD_WRAPPERS) $(OTHER_WRAPPERS:%=bin/%)
-AS_IS_SCRIPTS = amino-acid-composition archive-pubmed between-two-genes \
-                enquire entrez-phrase-search esample exclude-uid-lists \
-                expand-current fetch-extras fetch-pubmed filter-stop-words \
-                index-pubmed intersect-uid-lists join-into-groups-of pm-* \
-                protein-neighbors reorder-columns sort-uniq-count* \
-                stream-pubmed theme-aliases word-at-a-time xml2tbl xy-plot
+AS_IS_SCRIPTS = accn-at-a-time amino-acid-composition archive-pubmed \
+                between-two-genes entrez-phrase-search esample \
+                exclude-uid-lists expand-current fetch-extras fetch-pubmed \
+                filter-stop-words gbf2xml index-pubmed intersect-uid-lists \
+                join-into-groups-of pm-* protein-neighbors reorder-columns \
+                sort-uniq-count* stream-pubmed theme-aliases word-at-a-time \
+                xml2tbl xy-plot
 BASH_SCRIPTS = index-extras index-themes
 BIN_BASH_SCRIPTS = $(BASH_SCRIPTS:%=bin/%)
 DL_SCRIPTS = download-ncbi-data download-pubmed download-sequence
 BIN_DL_SCRIPTS = $(DL_SCRIPTS:%=bin/%)
-PERL_SCRIPTS = edirutil gbf2xml run-ncbi-converter
+PERL_SCRIPTS = edirutil run-ncbi-converter
 BIN_PERL_SCRIPTS = $(PERL_SCRIPTS:%=bin/%)
 # Only bt-link and xplore need this treatment at present,
 # but list all bt-* scripts to be safe.
 BT_SCRIPTS = bt-link bt-load bt-save bt-srch xplore
 BIN_BT_SCRIPTS = $(BT_SCRIPTS:%=bin/%)
+OTHER_SCRIPTS = enquire
+BIN_OTHER_SCRIPTS = $(OTHER_SCRIPTS:%=bin/%)
 FIX_PERL_SHEBANG = 1s,^\#!/usr/bin/env perl$$,\#!/usr/bin/perl,
 FIX_BASH_SHEBANG = 1s,^\#!/bin/sh,\#!/bin/bash,
 
@@ -50,7 +53,7 @@ export GOCACHE = $(CURDIR)/go-build
 export GOPATH = $(CURDIR)/obj-$(DEB_HOST_GNU_TYPE)
 GOLIBS = $(GOLIBSRC:$(GOCODE)/%=$(GOPATH)/%)
 
-GO_APPS     = j2x rchive t2x xtract
+GO_APPS     = g2x j2x rchive t2x xtract
 BIN_GO_APPS = $(GO_APPS:%=bin/%)
 
 GOVERSION := $(word 3,$(shell go version)) # go version **goX.Y.Z** OS/CPU
@@ -77,6 +80,11 @@ $(STD_WRAPPERS): bin/e%: bin/edirect
 	echo 'exec /usr/bin/edirect -$* "$$@"' >> $@
 	chmod +x $@
 
+bin/enquire: enquire
+	mkdir -p bin
+	sed -e 's/ --ca.*\.pem//' $< > $@
+	chmod +x $@
+
 bin/esummary: bin/edirect
 	echo '#!/bin/sh' > $@
 	echo 'exec /usr/bin/edirect -fetch -format docsum "$$@"' >> $@
@@ -127,6 +135,7 @@ $(GOPATH)/src/$(GH)/fiam/gounidecode: $(GOLIBS)
 	ln -s ../rainycape $(GOPATH)/src/$(GH)/fiam/gounidecode
 
 COMMON = common.go
+bin/g2x: COMMON = 
 bin/j2x: COMMON = 
 bin/t2x: COMMON = 
 $(BIN_GO_APPS): bin/%: %.go $(GOPATH)/src/$(GH)/fiam/gounidecode
@@ -139,7 +148,8 @@ override_dh_auto_configure:
 	-mv go.mod go.sum s2p.go saved/
 
 override_dh_auto_build: $(WRAPPERS) $(BIN_BASH_SCRIPTS) $(BIN_DL_SCRIPTS) \
-                        $(BIN_PERL_SCRIPTS) $(BIN_BT_SCRIPTS) $(BIN_GO_APPS)
+                        $(BIN_PERL_SCRIPTS) $(BIN_BT_SCRIPTS) \
+                        $(BIN_OTHER_SCRIPTS) $(BIN_GO_APPS)
 	dh_auto_build
 	install $(AS_IS_SCRIPTS) debian/efetch debian/einfo bin/
 


=====================================
enquire
=====================================
@@ -222,10 +222,60 @@ do
   esac
 done
 
-# get extraction method
+# subset of perl -MURI::Escape -ne 'chomp;print uri_escape($_),"\n"'
+
+Escape() {
+  echo "$1" |
+  sed -e "s/%/%25/g" \
+      -e "s/!/%21/g" \
+      -e "s/#/%23/g" \
+      -e "s/&/%26/g" \
+      -e "s/'/%27/g" \
+      -e "s/*/%2A/g" \
+      -e "s/+/%2B/g" \
+      -e "s/,/%2C/g" \
+      -e "s|/|%2F|g" \
+      -e "s/:/%3A/g" \
+      -e "s/;/%3B/g" \
+      -e "s/=/%3D/g" \
+      -e "s/?/%3F/g" \
+      -e "s/@/%40/g" \
+      -e "s/|/%7C/g" \
+      -e "s/ /%20/g" |
+  sed -e 's/\$/%24/g' \
+      -e 's/(/%28/g' \
+      -e 's/)/%29/g' \
+      -e 's/</%3C/g' \
+      -e 's/>/%3E/g' \
+      -e 's/\[/%5B/g' \
+      -e 's/\]/%5D/g' \
+      -e 's/\^/%5E/g' \
+      -e 's/{/%7B/g' \
+      -e 's/}/%7D/g'
+}
+
+# initialize variables
 
 mode=""
 
+url=""
+sls=""
+
+arg=""
+amp=""
+cmd=""
+pfx=""
+
+# optionally include nextra.sh script, if present, for internal NCBI maintenance functions (undocumented)
+# dot command is equivalent of "source"
+
+if [ -f "$pth"/nextra.sh ]
+then
+  . "$pth"/nextra.sh
+fi
+
+# get extraction method
+
 if [ $# -gt 0 ]
 then
   case "$1" in
@@ -247,9 +297,6 @@ fi
 
 # collect URL directory components
 
-url=""
-sls=""
-
 while [ $# -gt 0 ]
 do
   case "$1" in
@@ -268,45 +315,8 @@ do
   esac
 done
 
-# subset of perl -MURI::Escape -ne 'chomp;print uri_escape($_),"\n"'
-
-Escape() {
-  echo "$1" |
-  sed -e "s/%/%25/g" \
-      -e "s/!/%21/g" \
-      -e "s/#/%23/g" \
-      -e "s/&/%26/g" \
-      -e "s/'/%27/g" \
-      -e "s/*/%2A/g" \
-      -e "s/+/%2B/g" \
-      -e "s/,/%2C/g" \
-      -e "s|/|%2F|g" \
-      -e "s/:/%3A/g" \
-      -e "s/;/%3B/g" \
-      -e "s/=/%3D/g" \
-      -e "s/?/%3F/g" \
-      -e "s/@/%40/g" \
-      -e "s/|/%7C/g" \
-      -e "s/ /%20/g" |
-  sed -e 's/\$/%24/g' \
-      -e 's/(/%28/g' \
-      -e 's/)/%29/g' \
-      -e 's/</%3C/g' \
-      -e 's/>/%3E/g' \
-      -e 's/\[/%5B/g' \
-      -e 's/\]/%5D/g' \
-      -e 's/\^/%5E/g' \
-      -e 's/{/%7B/g' \
-      -e 's/}/%7D/g'
-}
-
 # collect argument tags paired with (escaped) values
 
-arg=""
-amp=""
-cmd=""
-pfx=""
-
 while [ $# -gt 0 ]
 do
   case "$1" in


=====================================
g2x.go
=====================================
@@ -0,0 +1,1125 @@
+// ===========================================================================
+//
+//                            PUBLIC DOMAIN NOTICE
+//            National Center for Biotechnology Information (NCBI)
+//
+//  This software/database is a "United States Government Work" under the
+//  terms of the United States Copyright Act. It was written as part of
+//  the author's official duties as a United States Government employee and
+//  thus cannot be copyrighted. This software/database is freely available
+//  to the public for use. The National Library of Medicine and the U.S.
+//  Government do not place any restriction on its use or reproduction.
+//  We would, however, appreciate having the NCBI and the author cited in
+//  any work or product based on this material.
+//
+//  Although all reasonable efforts have been taken to ensure the accuracy
+//  and reliability of the software and data, the NLM and the U.S.
+//  Government do not and cannot warrant the performance or results that
+//  may be obtained by using this software or data. The NLM and the U.S.
+//  Government disclaim all warranties, express or implied, including
+//  warranties of performance, merchantability or fitness for any particular
+//  purpose.
+//
+// ===========================================================================
+//
+// File Name:  g2x.go
+//
+// Author:  Jonathan Kans
+//
+// ==========================================================================
+
+/*
+  Compile application by running:
+
+  go build g2x.go
+*/
+
+package main
+
+import (
+	"bufio"
+	"fmt"
+	"html"
+	"io"
+	"os"
+	"runtime"
+	"runtime/debug"
+	"strings"
+	"time"
+	"unicode"
+)
+
+const g2xHelp = `
+Data Files
+
+  -input     Read GenBank format from file instead of stdin
+  -output    Write INSDSeq XML to file instead of stdout
+
+`
+
+// global variables, initialized (recursively) to "zero value" of type
+var (
+	ByteCount int
+	ChanDepth int
+	InBlank   [256]bool
+	InElement [256]bool
+)
+
+// init function(s) run after creation of variables, before main function
+func init() {
+
+	// set communication channel buffer size
+	ChanDepth = 16
+
+	// range iterates over all elements of slice
+	for i := range InBlank {
+		// (would already have been zeroed at creation in this case)
+		InBlank[i] = false
+	}
+	InBlank[' '] = true
+	InBlank['\t'] = true
+	InBlank['\n'] = true
+	InBlank['\r'] = true
+	InBlank['\f'] = true
+
+	for i := range InElement {
+		// (would already have been zeroed at creation in this case)
+		InElement[i] = false
+	}
+	for ch := 'A'; ch <= 'Z'; ch++ {
+		InElement[ch] = true
+	}
+	for ch := 'a'; ch <= 'z'; ch++ {
+		InElement[ch] = true
+	}
+	for ch := '0'; ch <= '9'; ch++ {
+		InElement[ch] = true
+	}
+	InElement['_'] = true
+	InElement['-'] = true
+	InElement['.'] = true
+	InElement[':'] = true
+}
+
+func CompressRunsOfSpaces(str string) string {
+
+	whiteSpace := false
+	var buffer strings.Builder
+
+	for _, ch := range str {
+		if ch < 127 && InBlank[ch] {
+			if !whiteSpace {
+				buffer.WriteRune(' ')
+			}
+			whiteSpace = true
+		} else {
+			buffer.WriteRune(ch)
+			whiteSpace = false
+		}
+	}
+
+	return buffer.String()
+}
+
+func IsAllDigits(str string) bool {
+
+	for _, ch := range str {
+		if !unicode.IsDigit(ch) {
+			return false
+		}
+	}
+
+	return true
+}
+
+// GenBankConverter sends INSDSeq XML records down a channel
+func GenBankConverter(inp io.Reader) <-chan string {
+
+	if inp == nil {
+		return nil
+	}
+
+	out := make(chan string, ChanDepth)
+	if out == nil {
+		fmt.Fprintf(os.Stderr, "Unable to create GenBank converter channel\n")
+		os.Exit(1)
+	}
+
+	const twelvespaces = "            "
+	const twentyonespaces = "                     "
+
+	var rec strings.Builder
+	var con strings.Builder
+	var seq strings.Builder
+
+	scanr := bufio.NewScanner(inp)
+
+	convertGenBank := func(inp io.Reader, out chan<- string) {
+
+		// close channel when all records have been sent
+		defer close(out)
+
+		row := 0
+
+		nextLine := func() string {
+
+			for scanr.Scan() {
+				line := scanr.Text()
+				if line == "" {
+					continue
+				}
+				return line
+			}
+			return ""
+
+		}
+
+		for {
+
+			rec.Reset()
+
+			// read first line of next record
+			line := nextLine()
+			if line == "" {
+				break
+			}
+
+			row++
+
+			for {
+				if !strings.HasPrefix(line, "LOCUS") {
+					// skip release file header information
+					line = nextLine()
+					row++
+					continue
+				}
+				break
+			}
+
+			readContinuationLines := func(str string) string {
+
+				for {
+					// read next line
+					line = nextLine()
+					row++
+					if !strings.HasPrefix(line, twelvespaces) {
+						// if not continuation line, break out of loop
+						break
+					}
+					// append subsequent line and continue with loop
+					txt := strings.TrimPrefix(line, twelvespaces)
+					str += " " + txt
+				}
+
+				str = CompressRunsOfSpaces(str)
+				str = strings.TrimSpace(str)
+
+				return str
+			}
+
+			writeOneElement := func(spaces, tag, value string) {
+
+				rec.WriteString(spaces)
+				rec.WriteString("<")
+				rec.WriteString(tag)
+				rec.WriteString(">")
+				value = html.EscapeString(value)
+				rec.WriteString(value)
+				rec.WriteString("</")
+				rec.WriteString(tag)
+				rec.WriteString(">\n")
+			}
+
+			// each section will exit with the next line ready to process
+
+			if strings.HasPrefix(line, "LOCUS") {
+
+				cols := strings.Fields(line)
+				if len(cols) == 8 {
+
+					// start of record
+					rec.WriteString("  <INSDSeq>\n")
+
+					moleculetype := cols[4]
+					strandedness := ""
+					if strings.HasPrefix(moleculetype, "ds-") {
+						moleculetype = strings.TrimPrefix(moleculetype, "ds-")
+						strandedness = "double"
+					} else if strings.HasPrefix(moleculetype, "ss-") {
+						moleculetype = strings.TrimPrefix(moleculetype, "ss-")
+						strandedness = "single"
+					} else if strings.HasPrefix(moleculetype, "ms-") {
+						moleculetype = strings.TrimPrefix(moleculetype, "ms-")
+						strandedness = "mixed"
+					} else if strings.HasSuffix(moleculetype, "DNA") {
+						strandedness = "double"
+					} else if strings.HasSuffix(moleculetype, "RNA") {
+						strandedness = "single"
+					}
+
+					writeOneElement("    ", "INSDSeq_locus", cols[1])
+
+					writeOneElement("    ", "INSDSeq_length", cols[2])
+
+					if strandedness != "" {
+						writeOneElement("    ", "INSDSeq_strandedness", strandedness)
+					}
+
+					writeOneElement("    ", "INSDSeq_moltype", moleculetype)
+
+					writeOneElement("    ", "INSDSeq_topology", cols[5])
+
+					writeOneElement("    ", "INSDSeq_division", cols[6])
+
+					writeOneElement("    ", "INSDSeq_update-date", cols[7])
+
+				} else {
+					fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+				}
+
+				// read next line and continue
+				line = nextLine()
+				row++
+			}
+
+			if strings.HasPrefix(line, "DEFINITION") {
+
+				txt := strings.TrimPrefix(line, "DEFINITION")
+				def := readContinuationLines(txt)
+				def = strings.TrimSuffix(def, ".")
+
+				writeOneElement("    ", "INSDSeq_definition", def)
+			}
+
+			var secondaries []string
+
+			if strings.HasPrefix(line, "ACCESSION") {
+
+				txt := strings.TrimPrefix(line, "ACCESSION")
+				str := readContinuationLines(txt)
+				accessions := strings.Fields(str)
+				ln := len(accessions)
+				if ln > 1 {
+
+					writeOneElement("    ", "INSDSeq_primary-accession", accessions[0])
+
+					// skip past primary accession, collect secondaries
+					secondaries = accessions[1:]
+
+				} else if ln == 1 {
+
+					writeOneElement("    ", "INSDSeq_primary-accession", accessions[0])
+
+				} else {
+					fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+				}
+			}
+
+			accnver := ""
+			gi := ""
+
+			if strings.HasPrefix(line, "VERSION") {
+
+				cols := strings.Fields(line)
+				if len(cols) == 2 {
+
+					accnver = cols[1]
+					writeOneElement("    ", "INSDSeq_accession-version", accnver)
+
+				} else if len(cols) == 3 {
+
+					accnver = cols[1]
+					writeOneElement("    ", "INSDSeq_accession-version", accnver)
+
+					// collect gi for other-seqids
+					if strings.HasPrefix(cols[2], "GI:") {
+						gi = strings.TrimPrefix(cols[2], "GI:")
+					}
+
+				} else {
+					fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+				}
+
+				// read next line and continue
+				line = nextLine()
+				row++
+
+			}
+
+			if gi != "" {
+
+				rec.WriteString("    <INSDSeq_other-seqids>\n")
+
+				writeOneElement("      ", "INSDSeqid", "gi|"+gi)
+
+				rec.WriteString("    </INSDSeq_other-seqids>\n")
+			}
+
+			if len(secondaries) > 0 {
+
+				rec.WriteString("    <INSDSeq_secondary-accessions>\n")
+
+				for _, secndry := range secondaries {
+
+					writeOneElement("      ", "INSDSecondary-accn", secndry)
+				}
+
+				rec.WriteString("    </INSDSeq_secondary-accessions>\n")
+			}
+
+			if strings.HasPrefix(line, "DBLINK") {
+
+				txt := strings.TrimPrefix(line, "DBLINK")
+				readContinuationLines(txt)
+				// collect for database-reference
+				// out <- Token{DBLINK, dbl}
+			}
+
+			if strings.HasPrefix(line, "KEYWORDS") {
+
+				txt := strings.TrimPrefix(line, "KEYWORDS")
+				key := readContinuationLines(txt)
+				key = strings.TrimSuffix(key, ".")
+
+				if key != "" {
+					rec.WriteString("    <INSDSeq_keywords>\n")
+					kywds := strings.Split(key, ";")
+					for _, kw := range kywds {
+						kw = strings.TrimSpace(kw)
+						if kw == "" || kw == "." {
+							continue
+						}
+
+						writeOneElement("      ", "INSDKeyword", kw)
+					}
+					rec.WriteString("    </INSDSeq_keywords>\n")
+				}
+			}
+
+			if strings.HasPrefix(line, "SOURCE") {
+
+				txt := strings.TrimPrefix(line, "SOURCE")
+				src := readContinuationLines(txt)
+
+				writeOneElement("    ", "INSDSeq_source", src)
+			}
+
+			if strings.HasPrefix(line, "  ORGANISM") {
+
+				org := strings.TrimPrefix(line, "  ORGANISM")
+				org = CompressRunsOfSpaces(org)
+				org = strings.TrimSpace(org)
+
+				writeOneElement("    ", "INSDSeq_organism", org)
+
+				line = nextLine()
+				row++
+				if strings.HasPrefix(line, twelvespaces) {
+					txt := strings.TrimPrefix(line, twelvespaces)
+					tax := readContinuationLines(txt)
+					tax = strings.TrimSuffix(tax, ".")
+
+					writeOneElement("    ", "INSDSeq_taxonomy", tax)
+				}
+			}
+
+			rec.WriteString("    <INSDSeq_references>\n")
+			for {
+				if !strings.HasPrefix(line, "REFERENCE") {
+					// exit out of reference section
+					break
+				}
+
+				ref := "0"
+
+				rec.WriteString("      <INSDReference>\n")
+
+				str := strings.TrimPrefix(line, "REFERENCE")
+				str = CompressRunsOfSpaces(str)
+				str = strings.TrimSpace(str)
+				idx := strings.Index(str, "(")
+				if idx > 0 {
+					ref = strings.TrimSpace(str[:idx])
+
+					writeOneElement("        ", "INSDReference_reference", ref)
+
+					posn := str[idx+1:]
+					posn = strings.TrimSuffix(posn, ")")
+					posn = strings.TrimSpace(posn)
+					if posn == "sites" {
+
+						writeOneElement("        ", "INSDReference_position", posn)
+
+					} else {
+						cols := strings.Fields(posn)
+						if len(cols) == 4 && cols[2] == "to" {
+
+							writeOneElement("        ", "INSDReference_position", cols[1]+".."+cols[3])
+
+						} else {
+							fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+						}
+					}
+				} else {
+					ref = strings.TrimSpace(str)
+
+					writeOneElement("        ", "INSDReference_reference", ref)
+				}
+				line = nextLine()
+				row++
+
+				if strings.HasPrefix(line, "  AUTHORS") {
+
+					txt := strings.TrimPrefix(line, "  AUTHORS")
+					auths := readContinuationLines(txt)
+
+					rec.WriteString("        <INSDReference_authors>\n")
+					authors := strings.Split(auths, ", ")
+					for _, auth := range authors {
+						auth = strings.TrimSpace(auth)
+						if auth == "" {
+							continue
+						}
+						pair := strings.Split(auth, " and ")
+						for _, name := range pair {
+
+							writeOneElement("          ", "INSDAuthor", name)
+						}
+					}
+					rec.WriteString("        </INSDReference_authors>\n")
+				}
+
+				if strings.HasPrefix(line, "  CONSRTM") {
+
+					txt := strings.TrimPrefix(line, "  CONSRTM")
+					cons := readContinuationLines(txt)
+
+					writeOneElement("        ", "INSDReference_consortium", cons)
+				}
+
+				if strings.HasPrefix(line, "  TITLE") {
+
+					txt := strings.TrimPrefix(line, "  TITLE")
+					titl := readContinuationLines(txt)
+
+					writeOneElement("        ", "INSDReference_title", titl)
+				}
+
+				if strings.HasPrefix(line, "  JOURNAL") {
+
+					txt := strings.TrimPrefix(line, "  JOURNAL")
+					jour := readContinuationLines(txt)
+
+					writeOneElement("        ", "INSDReference_journal", jour)
+				}
+
+				if strings.HasPrefix(line, "   PUBMED") {
+
+					txt := strings.TrimPrefix(line, "   PUBMED")
+					pmid := readContinuationLines(txt)
+
+					writeOneElement("        ", "INSDReference_pubmed", pmid)
+				}
+
+				if strings.HasPrefix(line, "  REMARK") {
+
+					txt := strings.TrimPrefix(line, "  REMARK")
+					rem := readContinuationLines(txt)
+
+					writeOneElement("        ", "INSDReference_remark", rem)
+				}
+
+				// end of this reference
+				rec.WriteString("      </INSDReference>\n")
+				// continue to next reference
+			}
+			rec.WriteString("    </INSDSeq_references>\n")
+
+			if strings.HasPrefix(line, "COMMENT") {
+
+				txt := strings.TrimPrefix(line, "COMMENT")
+				com := readContinuationLines(txt)
+
+				writeOneElement("    ", "INSDSeq_comment", com)
+			}
+
+			rec.WriteString("    <INSDSeq_feature-table>\n")
+			if strings.HasPrefix(line, "FEATURES") {
+
+				line = nextLine()
+				row++
+
+				for {
+					if !strings.HasPrefix(line, "     ") {
+						// exit out of features section
+						break
+					}
+					if len(line) < 22 {
+						fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+						line = nextLine()
+						row++
+						continue
+					}
+
+					rec.WriteString("      <INSDFeature>\n")
+
+					// read feature key and start of location
+					fkey := line[5:21]
+					fkey = strings.TrimSpace(fkey)
+
+					writeOneElement("        ", "INSDFeature_key", fkey)
+
+					loc := line[21:]
+					loc = strings.TrimSpace(loc)
+					for {
+						line = nextLine()
+						row++
+						if !strings.HasPrefix(line, twentyonespaces) {
+							break
+						}
+						txt := strings.TrimPrefix(line, twentyonespaces)
+						if strings.HasPrefix(txt, "/") {
+							// if not continuation of location, break out of loop
+							break
+						}
+						// append subsequent line and continue with loop
+						loc += strings.TrimSpace(txt)
+					}
+
+					writeOneElement("        ", "INSDFeature_location", loc)
+
+					location_operator := ""
+					is_comp := false
+					prime5 := false
+					prime3 := false
+
+					// parseloc recursive definition
+					var parseloc func(string) []string
+
+					parseloc = func(str string) []string {
+
+						var acc []string
+
+						if strings.HasPrefix(str, "join(") && strings.HasSuffix(str, ")") {
+
+							location_operator = "join"
+
+							str = strings.TrimPrefix(str, "join(")
+							str = strings.TrimSuffix(str, ")")
+							items := strings.Split(str, ",")
+
+							for _, thisloc := range items {
+								inner := parseloc(thisloc)
+								for _, sub := range inner {
+									acc = append(acc, sub)
+								}
+							}
+
+						} else if strings.HasPrefix(str, "order(") && strings.HasSuffix(str, ")") {
+
+							location_operator = "order"
+
+							str = strings.TrimPrefix(str, "order(")
+							str = strings.TrimSuffix(str, ")")
+							items := strings.Split(str, ",")
+
+							for _, thisloc := range items {
+								inner := parseloc(thisloc)
+								for _, sub := range inner {
+									acc = append(acc, sub)
+								}
+							}
+
+						} else if strings.HasPrefix(str, "complement(") && strings.HasSuffix(str, ")") {
+
+							is_comp = true
+
+							str = strings.TrimPrefix(str, "complement(")
+							str = strings.TrimSuffix(str, ")")
+							items := parseloc(str)
+
+							// reverse items
+							for i, j := 0, len(items)-1; i < j; i, j = i+1, j-1 {
+								items[i], items[j] = items[j], items[i]
+							}
+
+							// reverse from and to positions, flip direction of angle brackets (partial flags)
+							for _, thisloc := range items {
+								pts := strings.Split(thisloc, "..")
+								ln := len(pts)
+								if ln == 2 {
+									fst := pts[0]
+									scd := pts[1]
+									lf := ""
+									rt := ""
+									if strings.HasPrefix(fst, "<") {
+										fst = strings.TrimPrefix(fst, "<")
+										rt = ">"
+									}
+									if strings.HasPrefix(scd, ">") {
+										scd = strings.TrimPrefix(scd, ">")
+										lf = "<"
+									}
+									acc = append(acc, lf+scd+".."+rt+fst)
+								} else if ln > 0 {
+									acc = append(acc, pts[0])
+								}
+							}
+
+						} else {
+
+							// save individual interval or point if no leading accession
+							if strings.Index(str, ":") < 0 {
+								acc = append(acc, str)
+							}
+						}
+
+						return acc
+					}
+
+					items := parseloc(loc)
+
+					rec.WriteString("        <INSDFeature_intervals>\n")
+
+					num_ivals := 0
+
+					// report individual intervals
+					for _, thisloc := range items {
+						if thisloc == "" {
+							continue
+						}
+
+						num_ivals++
+
+						rec.WriteString("          <INSDInterval>\n")
+						pts := strings.Split(thisloc, "..")
+						if len(pts) == 2 {
+
+							// fr..to
+							fr := pts[0]
+							to := pts[1]
+							if strings.HasPrefix(fr, "<") {
+								fr = strings.TrimPrefix(fr, "<")
+								prime5 = true
+							}
+							if strings.HasPrefix(to, ">") {
+								to = strings.TrimPrefix(to, ">")
+								prime3 = true
+							}
+							writeOneElement("            ", "INSDInterval_from", fr)
+							writeOneElement("            ", "INSDInterval_to", to)
+							if is_comp {
+								rec.WriteString("            <INSDInterval_iscomp value=\"true\"/>\n")
+							}
+							writeOneElement("            ", "INSDInterval_accession", accnver)
+
+						} else {
+
+							crt := strings.Split(thisloc, "^")
+							if len(crt) == 2 {
+
+								// fr^to
+								fr := crt[0]
+								to := crt[1]
+								writeOneElement("            ", "INSDInterval_from", fr)
+								writeOneElement("            ", "INSDInterval_to", to)
+								if is_comp {
+									rec.WriteString("            <INSDInterval_iscomp value=\"true\"/>\n")
+								}
+								rec.WriteString("            <INSDInterval_interbp value=\"true\"/>\n")
+								writeOneElement("            ", "INSDInterval_accession", accnver)
+
+							} else {
+
+								// pt
+								pt := pts[0]
+								if strings.HasPrefix(pt, "<") {
+									pt = strings.TrimPrefix(pt, "<")
+									prime5 = true
+								}
+								if strings.HasPrefix(pt, ">") {
+									pt = strings.TrimPrefix(pt, ">")
+									prime3 = true
+								}
+								writeOneElement("            ", "INSDInterval_point", pt)
+								writeOneElement("            ", "INSDInterval_accession", accnver)
+							}
+						}
+						rec.WriteString("          </INSDInterval>\n")
+					}
+
+					rec.WriteString("        </INSDFeature_intervals>\n")
+
+					if num_ivals > 1 {
+						writeOneElement("        ", "INSDFeature_operator", location_operator)
+					}
+					if prime5 {
+						rec.WriteString("        <INSDFeature_partial5 value=\"true\"/>\n")
+					}
+					if prime3 {
+						rec.WriteString("        <INSDFeature_partial3 value=\"true\"/>\n")
+					}
+
+					hasQual := false
+					for {
+						if !strings.HasPrefix(line, twentyonespaces) {
+							// if not qualifier line, break out of loop
+							break
+						}
+						txt := strings.TrimPrefix(line, twentyonespaces)
+						qual := ""
+						val := ""
+						if strings.HasPrefix(txt, "/") {
+							if !hasQual {
+								hasQual = true
+								rec.WriteString("        <INSDFeature_quals>\n")
+							}
+							// read new qualifier and start of value
+							qual = strings.TrimPrefix(txt, "/")
+							qual = strings.TrimSpace(qual)
+							idx := strings.Index(qual, "=")
+							if idx > 0 {
+								val = qual[idx+1:]
+								qual = qual[:idx]
+							}
+
+							for {
+								line = nextLine()
+								row++
+								if !strings.HasPrefix(line, twentyonespaces) {
+									break
+								}
+								txt := strings.TrimPrefix(line, twentyonespaces)
+								if strings.HasPrefix(txt, "/") {
+									// if not continuation of qualifier, break out of loop
+									break
+								}
+								// append subsequent line to value and continue with loop
+								if qual == "transcription" || qual == "translation" || qual == "peptide" || qual == "anticodon" {
+									val += strings.TrimSpace(txt)
+								} else {
+									val += " " + strings.TrimSpace(txt)
+								}
+							}
+
+							rec.WriteString("          <INSDQualifier>\n")
+
+							writeOneElement("            ", "INSDQualifier_name", qual)
+
+							val = strings.TrimPrefix(val, "\"")
+							val = strings.TrimSuffix(val, "\"")
+							val = strings.TrimSpace(val)
+							if val != "" {
+
+								writeOneElement("            ", "INSDQualifier_value", val)
+							}
+
+							rec.WriteString("          </INSDQualifier>\n")
+						}
+					}
+					if hasQual {
+						rec.WriteString("        </INSDFeature_quals>\n")
+					}
+
+					// end of this feature
+					rec.WriteString("      </INSDFeature>\n")
+					// continue to next feature
+				}
+			}
+			rec.WriteString("    </INSDSeq_feature-table>\n")
+
+			if strings.HasPrefix(line, "CONTIG") {
+
+				// pathological records can have over 90,000 components, use strings.Builder
+				con.Reset()
+
+				txt := strings.TrimPrefix(line, "CONTIG")
+				txt = strings.TrimSpace(txt)
+				con.WriteString(txt)
+				for {
+					// read next line
+					line = nextLine()
+					row++
+					if !strings.HasPrefix(line, twelvespaces) {
+						// if not continuation of contig, break out of loop
+						break
+					}
+					// append subsequent line and continue with loop
+					txt = strings.TrimPrefix(line, twelvespaces)
+					txt = strings.TrimSpace(txt)
+					con.WriteString(txt)
+				}
+			}
+
+			if strings.HasPrefix(line, "BASE COUNT") {
+
+				txt := strings.TrimPrefix(line, "BASE COUNT")
+				readContinuationLines(txt)
+				// not supported
+			}
+
+			if strings.HasPrefix(line, "ORIGIN") {
+
+				line = nextLine()
+				row++
+			}
+
+			// remainder should be sequence
+
+			// sequence can be millions of bases, use strings.Builder
+			seq.Reset()
+
+			for line != "" {
+
+				if strings.HasPrefix(line, "//") {
+
+					// end of record, print collected sequence
+					str := seq.String()
+					if str != "" {
+
+						writeOneElement("    ", "INSDSeq_sequence", str)
+					}
+					seq.Reset()
+
+					// print contig section
+					str = con.String()
+					str = strings.TrimSpace(str)
+					if str != "" {
+						writeOneElement("    ", "INSDSeq_contig", str)
+					}
+					con.Reset()
+
+					// end of record
+					rec.WriteString("  </INSDSeq>\n")
+
+					// send formatted record down channel
+					txt := rec.String()
+					out <- txt
+					rec.Reset()
+					// go to top of loop for next record
+					break
+				}
+
+				// read next sequence line
+
+				cols := strings.Fields(line)
+				for _, str := range cols {
+
+					if IsAllDigits(str) {
+						continue
+					}
+
+					// append letters to sequence
+					seq.WriteString(str)
+				}
+
+				// read next line and continue
+				line = nextLine()
+				row++
+
+			}
+
+			// continue to next record
+		}
+	}
+
+	// launch single converter goroutine
+	go convertGenBank(inp, out)
+
+	return out
+}
+
+func main() {
+
+	// skip past executable name
+	args := os.Args[1:]
+
+	goOn := true
+
+	timr := false
+
+	infile := ""
+	outfile := ""
+
+	for len(args) > 0 && goOn {
+		str := args[0]
+		switch str {
+		case "-help":
+			fmt.Printf("g2x\n%s\n", g2xHelp)
+			return
+		case "-i", "-input":
+			// read data from file instead of stdin
+			args = args[1:]
+			if len(args) < 1 {
+				fmt.Fprintf(os.Stderr, "Input file name is missing\n")
+				os.Exit(1)
+			}
+			infile = args[0]
+			if infile == "-" {
+				infile = ""
+			}
+			args = args[1:]
+		case "-o", "-output":
+			// write data to file instead of stdout
+			args = args[1:]
+			if len(args) < 1 {
+				fmt.Fprintf(os.Stderr, "Output file name is missing\n")
+				os.Exit(1)
+			}
+			outfile = args[0]
+			if outfile == "-" {
+				outfile = ""
+			}
+			args = args[1:]
+		case "-timer":
+			timr = true
+			args = args[1:]
+		default:
+			goOn = false
+		}
+	}
+
+	in := os.Stdin
+
+	isPipe := false
+	fi, err := os.Stdin.Stat()
+	if err == nil {
+		// check for data being piped into stdin
+		isPipe = bool((fi.Mode() & os.ModeNamedPipe) != 0)
+	}
+
+	usingFile := false
+	if infile != "" {
+
+		fl, err := os.Open(infile)
+		if err != nil {
+			fmt.Fprintf(os.Stderr, "%s\n", err.Error())
+			os.Exit(1)
+		}
+
+		defer fl.Close()
+
+		// use indicated file instead of stdin
+		in = fl
+		usingFile = true
+
+		if isPipe && runtime.GOOS != "windows" {
+			mode := fi.Mode().String()
+			fmt.Fprintf(os.Stderr, "Input data from both stdin and file '%s', mode is '%s'\n", infile, mode)
+			os.Exit(1)
+		}
+	}
+
+	if !usingFile && !isPipe {
+		fmt.Fprintf(os.Stderr, "No input data supplied\n")
+		os.Exit(1)
+	}
+
+	op := os.Stdout
+
+	if outfile != "" {
+
+		fl, err := os.Create(outfile)
+		if err != nil {
+			fmt.Fprintf(os.Stderr, "%s\n", err.Error())
+			os.Exit(1)
+		}
+
+		defer fl.Close()
+
+		// use indicated file instead of stdout
+		op = fl
+	}
+
+	// initialize process timer
+	startTime := time.Now()
+	recordCount := 0
+	byteCount := 0
+
+	// print processing rate and program duration
+	printDuration := func(name string) {
+
+		stopTime := time.Now()
+		duration := stopTime.Sub(startTime)
+		seconds := float64(duration.Nanoseconds()) / 1e9
+
+		if recordCount >= 1000000 {
+			throughput := float64(recordCount/100000) / 10.0
+			fmt.Fprintf(os.Stderr, "\nXtract processed %.1f million %s in %.3f seconds", throughput, name, seconds)
+		} else {
+			fmt.Fprintf(os.Stderr, "\nXtract processed %d %s in %.3f seconds", recordCount, name, seconds)
+		}
+
+		if seconds >= 0.001 && recordCount > 0 {
+			rate := int(float64(recordCount) / seconds)
+			if rate >= 1000000 {
+				fmt.Fprintf(os.Stderr, " (%d million %s/second", rate/1000000, name)
+			} else {
+				fmt.Fprintf(os.Stderr, " (%d %s/second", rate, name)
+			}
+			if byteCount > 0 {
+				rate := int(float64(byteCount) / seconds)
+				if rate >= 1000000 {
+					fmt.Fprintf(os.Stderr, ", %d megabytes/second", rate/1000000)
+				} else if rate >= 1000 {
+					fmt.Fprintf(os.Stderr, ", %d kilobytes/second", rate/1000)
+				} else {
+					fmt.Fprintf(os.Stderr, ", %d bytes/second", rate)
+				}
+			}
+			fmt.Fprintf(os.Stderr, ")")
+		}
+
+		fmt.Fprintf(os.Stderr, "\n\n")
+	}
+
+	gbk := GenBankConverter(in)
+
+	if gbk == nil {
+		fmt.Fprintf(os.Stderr, "Unable to create GenBank to XML converter\n")
+		os.Exit(1)
+	}
+
+	head := `<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE INSDSet PUBLIC "-//NCBI//INSD INSDSeq/EN" "http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd">
+<INSDSet>
+`
+	tail := ""
+
+	// drain output of last channel in service chain
+	// runtime assigns concurrent goroutines to execute on separate CPUs for maximum speed
+	for str := range gbk {
+
+		if str == "" {
+			continue
+		}
+
+		if head != "" {
+			op.WriteString(head)
+			head = ""
+			tail = `</INSDSet>
+`
+		}
+
+		// send result to stdout
+		op.WriteString(str)
+		if !strings.HasSuffix(str, "\n") {
+			op.WriteString("\n")
+		}
+
+		recordCount++
+
+		runtime.Gosched()
+	}
+
+	if tail != "" {
+		op.WriteString(tail)
+	}
+
+	// explicitly freeing memory before exit is useful for finding leaks when profiling
+	debug.FreeOSMemory()
+
+	if timr {
+		printDuration("records")
+	}
+}


=====================================
gbf2xml
=====================================
@@ -1,554 +1,3 @@
-#!/usr/bin/env perl
-
-# ===========================================================================
-#
-#                            PUBLIC DOMAIN NOTICE
-#            National Center for Biotechnology Information (NCBI)
-#
-#  This software/database is a "United States Government Work" under the
-#  terms of the United States Copyright Act.  It was written as part of
-#  the author's official duties as a United States Government employee and
-#  thus cannot be copyrighted.  This software/database is freely available
-#  to the public for use. The National Library of Medicine and the U.S.
-#  Government do not place any restriction on its use or reproduction.
-#  We would, however, appreciate having the NCBI and the author cited in
-#  any work or product based on this material.
-#
-#  Although all reasonable efforts have been taken to ensure the accuracy
-#  and reliability of the software and data, the NLM and the U.S.
-#  Government do not and cannot warrant the performance or results that
-#  may be obtained by using this software or data. The NLM and the U.S.
-#  Government disclaim all warranties, express or implied, including
-#  warranties of performance, merchantability or fitness for any particular
-#  purpose.
-#
-# ===========================================================================
-#
-# File Name:  gbf2xml
-#
-# Author:  Jonathan Kans
-#
-# Version Creation Date:   6/8/17
-#
-# ==========================================================================
-
-use strict;
-use warnings;
-
-
-#  Script to convert GenBank flatfiles to INSDSeq XML.
-#
-#  Feature intervals that refer to 'far' locations, i.e., those not within
-#  the cited record and which have an accession and colon, are suppressed.
-#  Those rare features (e.g., trans-splicing between molecules) are lost.
-#
-#  Keywords and References are currently not supported.
-
-
-# definitions
-
-use constant false => 0;
-use constant true  => 1;
-
-# state variables for tracking current position in flatfile
-
-my $in_seq;
-my $in_con;
-my $in_feat;
-my $in_key;
-my $in_qual;
-my $in_def;
-my $in_tax;
-my $any_feat;
-my $any_qual;
-my $no_space;
-my $is_comp;
-my $current_key;
-my $current_loc;
-my $current_qual;
-my $current_val;
-my $moltype;
-my $division;
-my $update_date;
-my $organism;
-my $source;
-my $taxonomy;
-my $topology;
-my $sequence;
-my $length;
-my $curr_seq;
-my $locus;
-my $defline;
-my $accn;
-my $accndv;
-my $location_operator;
-
-# subroutine to clear state variables for each flatfile
-# start in in_feat state to gracefully handle missing FEATURES/FH line
-
-sub clearflags {
-  $in_seq = false;
-  $in_con = false;
-  $in_feat = false;
-  $in_key = false;
-  $in_qual = false;
-  $in_def = false;
-  $in_tax = false;
-  $any_feat = false;
-  $any_qual = false;
-  $no_space = false;
-  $is_comp = false;
-  $current_key = "";
-  $current_loc = "";
-  $current_qual = "";
-  $current_val = "";
-  $moltype = "";
-  $division = "";
-  $update_date = "";
-  $organism = "";
-  $source = "";
-  $taxonomy = "";
-  $topology = "";
-  $sequence = "";
-  $length = 0;
-  $curr_seq = "";
-  $locus = "";
-  $defline = "";
-  $accn = "";
-  $accndv = "";
-  $location_operator = "";
-}
-
-# recursive subroutine for parsing flatfile representation of feature location
-
-sub parseloc {
-  my $subloc = shift (@_);
-  my @working = ();
-
-  if ( $subloc =~ /^(join|order)\((.+)\)$/ ) {
-    $location_operator = $1;
-    my $temploc = $2;
-    my @items = split (',', $temploc);
-    foreach my $thisloc (@items ) {
-      if ( $thisloc !~ /^.*:.*$/ ) {
-        push (@working, parseloc ($thisloc));
-      }
-    }
-
-  } elsif ( $subloc =~ /^complement\((.+)\)$/ ) {
-    $is_comp = true;
-    my $comploc = $1;
-    my @items = parseloc ($comploc);
-    my @rev = reverse (@items);
-    foreach my $thisloc (@rev ) {
-      if ( $thisloc =~ /^([^.]+)\.\.([^.]+)$/ ) {
-        $thisloc = "$2..$1";
-      }
-
-      if ( $thisloc =~ /^>([^.]+)\.\.([^.]+)$/ ) {
-        $thisloc = "<$1..$2";
-      }
-      if ( $thisloc =~ /^([^.]+)\.\.<([^.]+)$/ ) {
-        $thisloc = "$1..>$2";
-      }
-
-      if ( $thisloc !~ /^.*:.*$/ ) {
-        push (@working, parseloc ($thisloc));
-      }
-    }
-
-  } elsif ( $subloc !~ /^.*:.*$/ ) {
-    push (@working, $subloc);
-  }
-
-  return @working;
-}
-
-#subroutine to print next feature key / location / qualifier line
-
-sub flushline {
-  if ( $in_key ) {
-
-    if ( $any_qual ) {
-      print  "        </INSDFeature_quals>\n";
-      $any_qual = false;
-    }
-
-    if ( $any_feat ) {
-      print  "      </INSDFeature>\n";
-    }
-    $any_feat = true;
-
-    print  "      <INSDFeature>\n";
-
-    #print feature key and intervals
-    print  "        <INSDFeature_key>$current_key</INSDFeature_key>\n";
-
-    my $clean_loc = $current_loc;
-    $clean_loc =~ s/</</g;
-    $clean_loc =~ s/>/>/g;
-    print  "        <INSDFeature_location>$clean_loc</INSDFeature_location>\n";
-
-    print  "        <INSDFeature_intervals>\n";
-
-    # parse join() order() complement() ###..### location
-    $location_operator = 0;
-    $is_comp = false;
-    my @theloc = parseloc ($current_loc);
-
-    # convert number (dot) (dot) number to number (tab) number
-    my $numivals = 0;
-    my $prime5 = false;
-    my $prime3 = false;
-    foreach my $thisloc (@theloc ) {
-      $numivals++;
-      print  "          <INSDInterval>\n";
-      if ( $thisloc =~ /^([^.]+)\.\.([^.]+)$/ ) {
-        my $fr = $1;
-        my $to = $2;
-        if ( $thisloc =~ /^</ ) {
-          $prime5 = true;
-        }
-        if ( $thisloc =~ /\.\.>/ ) {
-          $prime3 = true;
-        }
-        $fr =~ s/[<>]//;
-        $to =~ s/[<>]//;
-        print  "            <INSDInterval_from>$fr</INSDInterval_from>\n";
-        print  "            <INSDInterval_to>$to</INSDInterval_to>\n";
-        if ( $is_comp ) {
-          print  "            <INSDInterval_iscomp value=\"true\"/>\n";
-        }
-        print  "            <INSDInterval_accession>$accndv</INSDInterval_accession>\n";
-      } elsif ( $thisloc =~ /^(.+)\^(.+)$/ ) {
-        my $fr = $1;
-        my $to = $2;
-        $fr =~ s/[<>]//;
-        $to =~ s/[<>]//;
-        print  "            <INSDInterval_from>$fr</INSDInterval_from>\n";
-        print  "            <INSDInterval_to>$to</INSDInterval_to>\n";
-        if ( $is_comp ) {
-          print  "            <INSDInterval_iscomp value=\"true\"/>\n";
-        }
-        print  "            <INSDInterval_interbp value=\"true\"/>\n";
-        print  "            <INSDInterval_accession>$accndv</INSDInterval_accession>\n";
-      } elsif ( $thisloc =~ /^([^.]+)$/ ) {
-        my $pt = $1;
-        $pt =~ s/[<>]//;
-        print  "            <INSDInterval_point>$pt</INSDInterval_point>\n";
-        print  "            <INSDInterval_accession>$accndv</INSDInterval_accession>\n";
-      }
-      print  "          </INSDInterval>\n";
-    }
-
-    print  "        </INSDFeature_intervals>\n";
-
-    if ( $numivals > 1 ) {
-      print  "        <INSDFeature_operator>$location_operator</INSDFeature_operator>\n";
-    }
-    if ( $prime5 ) {
-      print  "        <INSDFeature_partial5 value=\"true\"/>\n";
-    }
-    if ( $prime3 ) {
-      print  "        <INSDFeature_partial3 value=\"true\"/>\n";
-    }
-
-  } elsif ( $in_qual ) {
-
-    if ( ! $any_qual ) {
-      print  "        <INSDFeature_quals>\n";
-    }
-    $any_qual = true;
-
-    if ( $current_val eq "" ) {
-      print  "          <INSDQualifier>\n";
-      print  "            <INSDQualifier_name>$current_qual</INSDQualifier_name>\n";
-      print  "          </INSDQualifier>\n";
-    } else {
-      print  "          <INSDQualifier>\n";
-      print  "            <INSDQualifier_name>$current_qual</INSDQualifier_name>\n";
-      my $clean_val = $current_val;
-      $clean_val =~ s/</</g;
-      $clean_val =~ s/>/>/g;
-      print  "            <INSDQualifier_value>$clean_val</INSDQualifier_value>\n";
-      print  "          </INSDQualifier>\n";
-    }
-  }
-}
-
-# initialize flags and lists at start of program
-
-clearflags ();
-
-print  "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
-print  "<!DOCTYPE INSDSet PUBLIC \"-//NCBI//INSD INSDSeq/EN\" \"https://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd\">\n";
-print  "<INSDSet>\n";
-
-# main loop reads one line at a time
-
-while (<> ) {
-  chomp;
-  $_ =~ s/\r$//;
-
-  # first check for extra definition or taxonomy lines, otherwise clear continuation flags
-  if ( $in_def ) {
-    if ( /^ {12}(.*)$/ ) {
-      $defline = $defline . " " . $1;
-    } else {
-      $in_def = false;
-    }
-  } elsif ( $in_tax ) {
-    if ( /^ {12}(.*)$/ ) {
-      if ( $taxonomy eq "" ) {
-        $taxonomy = $1;
-      } else {
-        $taxonomy = $taxonomy . " " . $1;
-      }
-    } else {
-      $in_tax = false;
-    }
-  }
-
-  if ( $in_def || $in_tax ) {
-
-    # continuation lines taken care of above
-
-  } elsif ( /^LOCUS\s+(\S*).*$/ ) {
-
-    # record locus
-    $locus = $1;
-    if ( / (\d+) bp / || / (\d+) aa / ) {
-      $length = $1;
-    }
-
-    if ( /^.*\s(\S+\s+\S+\s+\S+\s+\d+-\S+-\d+)$/ ) {
-      my $tail = $1;
-      if ( $tail =~ /^(\S*)\s+(\S*)\s+(\S*)\s+(\d*-\S*-\d*)$/ ) {
-        $moltype = $1;
-        $topology = $2;
-        $division = $3;
-        $update_date = $4;
-        $moltype = uc $moltype;
-      }
-    }
-
-    print  "  <INSDSeq>\n";
-
-    print  "    <INSDSeq_locus>$locus</INSDSeq_locus>\n";
-    print  "    <INSDSeq_length>$length</INSDSeq_length>\n";
-
-    if ( $moltype ne "" ) {
-      print  "    <INSDSeq_moltype>$moltype</INSDSeq_moltype>\n";
-    }
-    if ( $topology ne "" ) {
-      print  "    <INSDSeq_topology>$topology</INSDSeq_topology>\n";
-    }
-    if ( $division ne "" ) {
-      print  "    <INSDSeq_division>$division</INSDSeq_division>\n";
-    }
-    if ( $update_date ne "" ) {
-      print  "    <INSDSeq_update-date>$update_date</INSDSeq_update-date>\n";
-    }
-
-  } elsif ( /^DEFINITION\s*(.*).*$/ ) {
-
-    # record first line of definition line
-    $defline = $1;
-    # next line with leading spaces will be continuation of definition line
-    $in_def = true;
-
-  } elsif ( /^ACCESSION\s*(\S*).*$/ ) {
-
-    # record accession
-    $accn = $1;
-
-  } elsif ( /^VERSION\s*(\S*).*$/ ) {
-
-    # record accession.version
-    $accndv = $1;
-
-  } elsif ( /^SOURCE\s*(.*)$/ ) {
-
-    # record source
-    $source = $1;
-
-  } elsif ( /^ {1,3}ORGANISM\s+(.*)$/ ) {
-
-    # record organism
-    if ( $organism eq "" ) {
-      $organism = $1;
-      if ( $organism =~ /^([^(]*) \(.*\)/ ) {
-        $organism = $1;
-      }
-    }
-    # next line with leading spaces will be start of taxonomy
-    $in_tax = true;
-
-  } elsif ( /^FEATURES\s+.*$/ ) {
-
-    # beginning of feature table, flags already set up
-
-    # first print saved fields
-    $defline =~ s/\.$//;
-    $defline =~ s/</</g;
-    $defline =~ s/>/>/g;
-    if ( $defline ne "" ) {
-      print  "    <INSDSeq_definition>$defline</INSDSeq_definition>\n";
-    }
-    if ( $accn ne "" ) {
-      print  "    <INSDSeq_primary-accession>$accn</INSDSeq_primary-accession>\n";
-    }
-    if ( $accndv ne "" ) {
-      print  "    <INSDSeq_accession-version>$accndv</INSDSeq_accession-version>\n";
-    }
-
-    $in_feat = true;
-
-    if ( $source ne "" ) {
-      print  "    <INSDSeq_source>$source</INSDSeq_source>\n";
-    }
-    if ( $organism ne "" ) {
-      print  "    <INSDSeq_organism>$organism</INSDSeq_organism>\n";
-    }
-    $taxonomy =~ s/\.$//;
-    if ( $taxonomy ne "" ) {
-      print  "    <INSDSeq_taxonomy>$taxonomy</INSDSeq_taxonomy>\n";
-    }
-
-    print  "    <INSDSeq_feature-table>\n";
-
-  } elsif ( /^ORIGIN\s*.*$/ ) {
-
-    # end of feature table, print final newline
-    flushline ();
-
-    if ( $in_feat ) {
-      if ( $any_qual ) {
-        print  "        </INSDFeature_quals>\n";
-        $any_qual = false;
-      }
-
-      print  "      </INSDFeature>\n";
-
-      print  "    </INSDSeq_feature-table>\n";
-    }
-
-    $in_feat = false;
-    $in_key = false;
-    $in_qual = false;
-    $no_space = false;
-    $in_seq = true;
-    $in_con = false;
-
-  } elsif ( /^CONTIG\s*.*$/ ) {
-
-    # end of feature table, print final newline
-    flushline ();
-
-    if ( $in_feat ) {
-      if ( $any_qual ) {
-        print  "        </INSDFeature_quals>\n";
-        $any_qual = false;
-      }
-
-      print  "      </INSDFeature>\n";
-
-      print  "    </INSDSeq_feature-table>\n";
-    }
-
-    $in_feat = false;
-    $in_key = false;
-    $in_qual = false;
-    $no_space = false;
-    $in_seq = false;
-    $in_con = true;
-
-  } elsif ( /^\/\/\.*/ ) {
-
-    # at end-of-record double slash
-    if ( $sequence ne "" ) {
-        print  "    <INSDSeq_sequence>$sequence</INSDSeq_sequence>\n";
-    }
-    print  "  </INSDSeq>\n";
-    # reset variables for catenated flatfiles
-    clearflags ();
-
-  } elsif ( $in_seq ) {
-
-    if ( /^\s+\d+ (.*)$/ || /^\s+(.*)\s+\d+$/ ) {
-      # record sequence
-      $curr_seq = $1;
-      $curr_seq =~ s/ //g;
-      $curr_seq = lc $curr_seq;
-      if ( $sequence eq "" ) {
-        $sequence = $curr_seq;
-      } else {
-        $sequence = $sequence . $curr_seq;
-      }
-    }
-
-  } elsif ( $in_con ) {
-
-  } elsif ( $in_feat ) {
-
-    if ( /^ {1,10}(\w+)\s+(.*)$/ ) {
-      # new feature key and location
-      flushline ();
-
-      $in_key = true;
-      $in_qual = false;
-      $current_key = $1;
-      $current_loc = $2;
-
-    } elsif ( /^\s+\/(\w+)=(.*)$/ ) {
-      # new qualifier
-      flushline ();
-
-      $in_key = false;
-      $in_qual = true;
-      $current_qual = $1;
-      # remove leading double quote
-      my $val = $2;
-      $val =~ s/\"//g;
-      $current_val = $val;
-      if ( $current_qual =~ /(?:translation|transcription|peptide|anticodon)/ ) {
-        $no_space = true;
-      } else {
-        $no_space = false;
-      }
-
-    } elsif ( /^\s+\/(\w+)$/ ) {
-      # new singleton qualifier - e.g., trans-splicing, pseudo
-      flushline ();
-
-      $in_key = false;
-      $in_qual = true;
-      $current_qual = $1;
-      $current_val = "";
-      $no_space = false;
-
-    } elsif ( /^\s+(.*)$/ ) {
-
-      if ( $in_key ) {
-        # continuation of feature location
-        $current_loc = $current_loc . $1;
-
-      } elsif ( $in_qual ) {
-        # continuation of qualifier
-        # remove trailing double quote
-        my $val = $1;
-        $val =~ s/\"//g;
-        if ( $no_space ) {
-          $current_val = $current_val . $val;
-        } elsif ( $current_val =~ /-$/ ) {
-          $current_val = $current_val . $val;
-        } else {
-          $current_val = $current_val . " " . $val;
-        }
-      }
-    }
-  }
-}
-
-print  "</INSDSet>\n";
+#!/bin/sh
 
+xtract -g2x "$@"


=====================================
hlp-xtract.txt
=====================================
@@ -126,6 +126,16 @@ Elink -cited Equivalent
   elink_cited |
   efetch -format abstract
 
+Combining Independent Queries
+
+  esearch -db protein -query "amyloid* [PROT]" |
+  elink -target pubmed |
+  esearch -db gene -query "apo* [GENE]" |
+  elink -target pubmed |
+  esearch -query "(#3) AND (#6)" |
+  efetch -format docsum |
+  xtract -pattern DocumentSummary -element Id Title
+
 PMC
 
 Formatting Tag Removal


=====================================
sort-uniq-count
=====================================
@@ -10,4 +10,4 @@ then
 fi
 sort "-$flags" |
 uniq -i -c |
-perl -pe 's/\s*(\d+)\s(.+)/$1\t$2/'
+awk '{ n=$1; sub(/[ \t]*[0-9]+[ \t]/, ""); print n "\t" $0 }'


=====================================
sort-uniq-count-rank
=====================================
@@ -10,5 +10,5 @@ then
 fi
 sort "-$flags" |
 uniq -i -c |
-perl -pe 's/\s*(\d+)\s(.+)/$1\t$2/' |
+awk '{ n=$1; sub(/[ \t]*[0-9]+[ \t]/, ""); print n "\t" $0 }' |
 sort -t "$(printf '\t')" -k 1,1nr -k "2$flags"


=====================================
xtract.go
=====================================
@@ -331,6 +331,8 @@ Data Conversion
                      [-indent | -flush]
                      XML object names per column
 
+  -g2x             Convert GenBank flatfile format to INSDSeq XML
+
 Documentation
 
   -help            Print this document
@@ -6549,14 +6551,15 @@ func ProcessTokens(rdr <-chan string) {
 }
 
 // ProcessFormat reformats XML for ease of reading
-func ProcessFormat(rdr <-chan string, args []string) {
+func ProcessFormat(rdr <-chan string, args []string, useTimer bool) int {
 
 	if rdr == nil || args == nil {
-		return
+		return 0
 	}
 
 	var buffer strings.Builder
 	count := 0
+	maxLine := 0
 
 	// skip past command name
 	args = args[1:]
@@ -6613,7 +6616,7 @@ func ProcessFormat(rdr <-chan string, args []string) {
 			os.Stdout.WriteString(str)
 		}
 		os.Stdout.WriteString("\n")
-		return
+		return 0
 	}
 
 	unicodePolicy := ""
@@ -6714,7 +6717,7 @@ func ProcessFormat(rdr <-chan string, args []string) {
 		DoMathML = true
 	}
 
-	CountLines = DoMixed
+	CountLines = DoMixed || useTimer
 	AllowEmbed = DoStrict || DoMixed
 	ContentMods = AllowEmbed || DoCompress || DoUnicode || DoScript || DoMathML || DeAccent || DoASCII
 
@@ -7117,6 +7120,8 @@ func ProcessFormat(rdr <-chan string, args []string) {
 			return
 		}
 
+		maxLine = tkn.Line
+
 		if tkn.Tag == DOCTYPETAG {
 			if skipDoctype {
 				return
@@ -7146,6 +7151,8 @@ func ProcessFormat(rdr <-chan string, args []string) {
 			fmt.Fprintf(os.Stdout, "%s", txt)
 		}
 	}
+
+	return maxLine
 }
 
 // ProcessOutline displays outline of XML structure
@@ -7339,12 +7346,14 @@ func ProcessSynopsis(rdr <-chan string, leaf bool, delim string) {
 }
 
 // ProcessVerify checks for well-formed XML
-func ProcessVerify(rdr <-chan string, args []string) {
+func ProcessVerify(rdr <-chan string, args []string) int {
 
 	if rdr == nil || args == nil {
-		return
+		return 0
 	}
 
+	CountLines = true
+
 	tknq := CreateTokenizer(rdr)
 
 	if tknq == nil {
@@ -7374,6 +7383,7 @@ func ProcessVerify(rdr <-chan string, args []string) {
 	maxDepth := 0
 	depthLine := 0
 	depthID := ""
+	maxLine := 0
 
 	// warn if HTML tags are not well-formed
 	unbalancedHTML := func(text string) bool {
@@ -7497,6 +7507,7 @@ func ProcessVerify(rdr <-chan string, args []string) {
 			tag := tkn.Tag
 			name := tkn.Name
 			line := tkn.Line
+			maxLine = line
 
 			if level > maxDepth {
 				maxDepth = level
@@ -7564,6 +7575,8 @@ func ProcessVerify(rdr <-chan string, args []string) {
 	if maxDepth > 20 {
 		fmt.Fprintf(os.Stdout, "%s%8d\tMaximum nesting, %d levels\n", depthID, depthLine, maxDepth)
 	}
+
+	return maxLine
 }
 
 // ProcessFilter modifies XML content, comments, or CDATA
@@ -8667,6 +8680,805 @@ func TableConverter(inp io.Reader, args []string) int {
 	return recordCount
 }
 
+// READ GENBANK FLATFILE AND TRANSLATE TO INSDSEQ XML
+
+// GenBankConverter sends INSDSeq XML records down a channel
+func GenBankConverter(inp io.Reader) <-chan string {
+
+	if inp == nil {
+		return nil
+	}
+
+	out := make(chan string, ChanDepth)
+	if out == nil {
+		fmt.Fprintf(os.Stderr, "Unable to create GenBank converter channel\n")
+		os.Exit(1)
+	}
+
+	const twelvespaces = "            "
+	const twentyonespaces = "                     "
+
+	var rec strings.Builder
+	var con strings.Builder
+	var seq strings.Builder
+
+	scanr := bufio.NewScanner(inp)
+
+	convertGenBank := func(inp io.Reader, out chan<- string) {
+
+		// close channel when all records have been sent
+		defer close(out)
+
+		row := 0
+
+		nextLine := func() string {
+
+			for scanr.Scan() {
+				line := scanr.Text()
+				if line == "" {
+					continue
+				}
+				return line
+			}
+			return ""
+
+		}
+
+		for {
+
+			rec.Reset()
+
+			// read first line of next record
+			line := nextLine()
+			if line == "" {
+				break
+			}
+
+			row++
+
+			for {
+				if !strings.HasPrefix(line, "LOCUS") {
+					// skip release file header information
+					line = nextLine()
+					row++
+					continue
+				}
+				break
+			}
+
+			readContinuationLines := func(str string) string {
+
+				for {
+					// read next line
+					line = nextLine()
+					row++
+					if !strings.HasPrefix(line, twelvespaces) {
+						// if not continuation line, break out of loop
+						break
+					}
+					// append subsequent line and continue with loop
+					txt := strings.TrimPrefix(line, twelvespaces)
+					str += " " + txt
+				}
+
+				str = CompressRunsOfSpaces(str)
+				str = strings.TrimSpace(str)
+
+				return str
+			}
+
+			writeOneElement := func(spaces, tag, value string) {
+
+				rec.WriteString(spaces)
+				rec.WriteString("<")
+				rec.WriteString(tag)
+				rec.WriteString(">")
+				value = html.EscapeString(value)
+				rec.WriteString(value)
+				rec.WriteString("</")
+				rec.WriteString(tag)
+				rec.WriteString(">\n")
+			}
+
+			// each section will exit with the next line ready to process
+
+			if strings.HasPrefix(line, "LOCUS") {
+
+				cols := strings.Fields(line)
+				if len(cols) == 8 {
+
+					// start of record
+					rec.WriteString("  <INSDSeq>\n")
+
+					moleculetype := cols[4]
+					strandedness := ""
+					if strings.HasPrefix(moleculetype, "ds-") {
+						moleculetype = strings.TrimPrefix(moleculetype, "ds-")
+						strandedness = "double"
+					} else if strings.HasPrefix(moleculetype, "ss-") {
+						moleculetype = strings.TrimPrefix(moleculetype, "ss-")
+						strandedness = "single"
+					} else if strings.HasPrefix(moleculetype, "ms-") {
+						moleculetype = strings.TrimPrefix(moleculetype, "ms-")
+						strandedness = "mixed"
+					} else if strings.HasSuffix(moleculetype, "DNA") {
+						strandedness = "double"
+					} else if strings.HasSuffix(moleculetype, "RNA") {
+						strandedness = "single"
+					}
+
+					writeOneElement("    ", "INSDSeq_locus", cols[1])
+
+					writeOneElement("    ", "INSDSeq_length", cols[2])
+
+					if strandedness != "" {
+						writeOneElement("    ", "INSDSeq_strandedness", strandedness)
+					}
+
+					writeOneElement("    ", "INSDSeq_moltype", moleculetype)
+
+					writeOneElement("    ", "INSDSeq_topology", cols[5])
+
+					writeOneElement("    ", "INSDSeq_division", cols[6])
+
+					writeOneElement("    ", "INSDSeq_update-date", cols[7])
+
+				} else {
+					fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+				}
+
+				// read next line and continue
+				line = nextLine()
+				row++
+			}
+
+			if strings.HasPrefix(line, "DEFINITION") {
+
+				txt := strings.TrimPrefix(line, "DEFINITION")
+				def := readContinuationLines(txt)
+				def = strings.TrimSuffix(def, ".")
+
+				writeOneElement("    ", "INSDSeq_definition", def)
+			}
+
+			var secondaries []string
+
+			if strings.HasPrefix(line, "ACCESSION") {
+
+				txt := strings.TrimPrefix(line, "ACCESSION")
+				str := readContinuationLines(txt)
+				accessions := strings.Fields(str)
+				ln := len(accessions)
+				if ln > 1 {
+
+					writeOneElement("    ", "INSDSeq_primary-accession", accessions[0])
+
+					// skip past primary accession, collect secondaries
+					secondaries = accessions[1:]
+
+				} else if ln == 1 {
+
+					writeOneElement("    ", "INSDSeq_primary-accession", accessions[0])
+
+				} else {
+					fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+				}
+			}
+
+			accnver := ""
+			gi := ""
+
+			if strings.HasPrefix(line, "VERSION") {
+
+				cols := strings.Fields(line)
+				if len(cols) == 2 {
+
+					accnver = cols[1]
+					writeOneElement("    ", "INSDSeq_accession-version", accnver)
+
+				} else if len(cols) == 3 {
+
+					accnver = cols[1]
+					writeOneElement("    ", "INSDSeq_accession-version", accnver)
+
+					// collect gi for other-seqids
+					if strings.HasPrefix(cols[2], "GI:") {
+						gi = strings.TrimPrefix(cols[2], "GI:")
+					}
+
+				} else {
+					fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+				}
+
+				// read next line and continue
+				line = nextLine()
+				row++
+
+			}
+
+			if gi != "" {
+
+				rec.WriteString("    <INSDSeq_other-seqids>\n")
+
+				writeOneElement("      ", "INSDSeqid", "gi|"+gi)
+
+				rec.WriteString("    </INSDSeq_other-seqids>\n")
+			}
+
+			if len(secondaries) > 0 {
+
+				rec.WriteString("    <INSDSeq_secondary-accessions>\n")
+
+				for _, secndry := range secondaries {
+
+					writeOneElement("      ", "INSDSecondary-accn", secndry)
+				}
+
+				rec.WriteString("    </INSDSeq_secondary-accessions>\n")
+			}
+
+			if strings.HasPrefix(line, "DBLINK") {
+
+				txt := strings.TrimPrefix(line, "DBLINK")
+				readContinuationLines(txt)
+				// collect for database-reference
+				// out <- Token{DBLINK, dbl}
+			}
+
+			if strings.HasPrefix(line, "KEYWORDS") {
+
+				txt := strings.TrimPrefix(line, "KEYWORDS")
+				key := readContinuationLines(txt)
+				key = strings.TrimSuffix(key, ".")
+
+				if key != "" {
+					rec.WriteString("    <INSDSeq_keywords>\n")
+					kywds := strings.Split(key, ";")
+					for _, kw := range kywds {
+						kw = strings.TrimSpace(kw)
+						if kw == "" || kw == "." {
+							continue
+						}
+
+						writeOneElement("      ", "INSDKeyword", kw)
+					}
+					rec.WriteString("    </INSDSeq_keywords>\n")
+				}
+			}
+
+			if strings.HasPrefix(line, "SOURCE") {
+
+				txt := strings.TrimPrefix(line, "SOURCE")
+				src := readContinuationLines(txt)
+
+				writeOneElement("    ", "INSDSeq_source", src)
+			}
+
+			if strings.HasPrefix(line, "  ORGANISM") {
+
+				org := strings.TrimPrefix(line, "  ORGANISM")
+				org = CompressRunsOfSpaces(org)
+				org = strings.TrimSpace(org)
+
+				writeOneElement("    ", "INSDSeq_organism", org)
+
+				line = nextLine()
+				row++
+				if strings.HasPrefix(line, twelvespaces) {
+					txt := strings.TrimPrefix(line, twelvespaces)
+					tax := readContinuationLines(txt)
+					tax = strings.TrimSuffix(tax, ".")
+
+					writeOneElement("    ", "INSDSeq_taxonomy", tax)
+				}
+			}
+
+			rec.WriteString("    <INSDSeq_references>\n")
+			for {
+				if !strings.HasPrefix(line, "REFERENCE") {
+					// exit out of reference section
+					break
+				}
+
+				ref := "0"
+
+				rec.WriteString("      <INSDReference>\n")
+
+				str := strings.TrimPrefix(line, "REFERENCE")
+				str = CompressRunsOfSpaces(str)
+				str = strings.TrimSpace(str)
+				idx := strings.Index(str, "(")
+				if idx > 0 {
+					ref = strings.TrimSpace(str[:idx])
+
+					writeOneElement("        ", "INSDReference_reference", ref)
+
+					posn := str[idx+1:]
+					posn = strings.TrimSuffix(posn, ")")
+					posn = strings.TrimSpace(posn)
+					if posn == "sites" {
+
+						writeOneElement("        ", "INSDReference_position", posn)
+
+					} else {
+						cols := strings.Fields(posn)
+						if len(cols) == 4 && cols[2] == "to" {
+
+							writeOneElement("        ", "INSDReference_position", cols[1]+".."+cols[3])
+
+						} else {
+							fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+						}
+					}
+				} else {
+					ref = strings.TrimSpace(str)
+
+					writeOneElement("        ", "INSDReference_reference", ref)
+				}
+				line = nextLine()
+				row++
+
+				if strings.HasPrefix(line, "  AUTHORS") {
+
+					txt := strings.TrimPrefix(line, "  AUTHORS")
+					auths := readContinuationLines(txt)
+
+					rec.WriteString("        <INSDReference_authors>\n")
+					authors := strings.Split(auths, ", ")
+					for _, auth := range authors {
+						auth = strings.TrimSpace(auth)
+						if auth == "" {
+							continue
+						}
+						pair := strings.Split(auth, " and ")
+						for _, name := range pair {
+
+							writeOneElement("          ", "INSDAuthor", name)
+						}
+					}
+					rec.WriteString("        </INSDReference_authors>\n")
+				}
+
+				if strings.HasPrefix(line, "  CONSRTM") {
+
+					txt := strings.TrimPrefix(line, "  CONSRTM")
+					cons := readContinuationLines(txt)
+
+					writeOneElement("        ", "INSDReference_consortium", cons)
+				}
+
+				if strings.HasPrefix(line, "  TITLE") {
+
+					txt := strings.TrimPrefix(line, "  TITLE")
+					titl := readContinuationLines(txt)
+
+					writeOneElement("        ", "INSDReference_title", titl)
+				}
+
+				if strings.HasPrefix(line, "  JOURNAL") {
+
+					txt := strings.TrimPrefix(line, "  JOURNAL")
+					jour := readContinuationLines(txt)
+
+					writeOneElement("        ", "INSDReference_journal", jour)
+				}
+
+				if strings.HasPrefix(line, "   PUBMED") {
+
+					txt := strings.TrimPrefix(line, "   PUBMED")
+					pmid := readContinuationLines(txt)
+
+					writeOneElement("        ", "INSDReference_pubmed", pmid)
+				}
+
+				if strings.HasPrefix(line, "  REMARK") {
+
+					txt := strings.TrimPrefix(line, "  REMARK")
+					rem := readContinuationLines(txt)
+
+					writeOneElement("        ", "INSDReference_remark", rem)
+				}
+
+				// end of this reference
+				rec.WriteString("      </INSDReference>\n")
+				// continue to next reference
+			}
+			rec.WriteString("    </INSDSeq_references>\n")
+
+			if strings.HasPrefix(line, "COMMENT") {
+
+				txt := strings.TrimPrefix(line, "COMMENT")
+				com := readContinuationLines(txt)
+
+				writeOneElement("    ", "INSDSeq_comment", com)
+			}
+
+			rec.WriteString("    <INSDSeq_feature-table>\n")
+			if strings.HasPrefix(line, "FEATURES") {
+
+				line = nextLine()
+				row++
+
+				for {
+					if !strings.HasPrefix(line, "     ") {
+						// exit out of features section
+						break
+					}
+					if len(line) < 22 {
+						fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+						line = nextLine()
+						row++
+						continue
+					}
+
+					rec.WriteString("      <INSDFeature>\n")
+
+					// read feature key and start of location
+					fkey := line[5:21]
+					fkey = strings.TrimSpace(fkey)
+
+					writeOneElement("        ", "INSDFeature_key", fkey)
+
+					loc := line[21:]
+					loc = strings.TrimSpace(loc)
+					for {
+						line = nextLine()
+						row++
+						if !strings.HasPrefix(line, twentyonespaces) {
+							break
+						}
+						txt := strings.TrimPrefix(line, twentyonespaces)
+						if strings.HasPrefix(txt, "/") {
+							// if not continuation of location, break out of loop
+							break
+						}
+						// append subsequent line and continue with loop
+						loc += strings.TrimSpace(txt)
+					}
+
+					writeOneElement("        ", "INSDFeature_location", loc)
+
+					location_operator := ""
+					is_comp := false
+					prime5 := false
+					prime3 := false
+
+					// parseloc recursive definition
+					var parseloc func(string) []string
+
+					parseloc = func(str string) []string {
+
+						var acc []string
+
+						if strings.HasPrefix(str, "join(") && strings.HasSuffix(str, ")") {
+
+							location_operator = "join"
+
+							str = strings.TrimPrefix(str, "join(")
+							str = strings.TrimSuffix(str, ")")
+							items := strings.Split(str, ",")
+
+							for _, thisloc := range items {
+								inner := parseloc(thisloc)
+								for _, sub := range inner {
+									acc = append(acc, sub)
+								}
+							}
+
+						} else if strings.HasPrefix(str, "order(") && strings.HasSuffix(str, ")") {
+
+							location_operator = "order"
+
+							str = strings.TrimPrefix(str, "order(")
+							str = strings.TrimSuffix(str, ")")
+							items := strings.Split(str, ",")
+
+							for _, thisloc := range items {
+								inner := parseloc(thisloc)
+								for _, sub := range inner {
+									acc = append(acc, sub)
+								}
+							}
+
+						} else if strings.HasPrefix(str, "complement(") && strings.HasSuffix(str, ")") {
+
+							is_comp = true
+
+							str = strings.TrimPrefix(str, "complement(")
+							str = strings.TrimSuffix(str, ")")
+							items := parseloc(str)
+
+							// reverse items
+							for i, j := 0, len(items)-1; i < j; i, j = i+1, j-1 {
+								items[i], items[j] = items[j], items[i]
+							}
+
+							// reverse from and to positions, flip direction of angle brackets (partial flags)
+							for _, thisloc := range items {
+								pts := strings.Split(thisloc, "..")
+								ln := len(pts)
+								if ln == 2 {
+									fst := pts[0]
+									scd := pts[1]
+									lf := ""
+									rt := ""
+									if strings.HasPrefix(fst, "<") {
+										fst = strings.TrimPrefix(fst, "<")
+										rt = ">"
+									}
+									if strings.HasPrefix(scd, ">") {
+										scd = strings.TrimPrefix(scd, ">")
+										lf = "<"
+									}
+									acc = append(acc, lf+scd+".."+rt+fst)
+								} else if ln > 0 {
+									acc = append(acc, pts[0])
+								}
+							}
+
+						} else {
+
+							// save individual interval or point if no leading accession
+							if strings.Index(str, ":") < 0 {
+								acc = append(acc, str)
+							}
+						}
+
+						return acc
+					}
+
+					items := parseloc(loc)
+
+					rec.WriteString("        <INSDFeature_intervals>\n")
+
+					num_ivals := 0
+
+					// report individual intervals
+					for _, thisloc := range items {
+						if thisloc == "" {
+							continue
+						}
+
+						num_ivals++
+
+						rec.WriteString("          <INSDInterval>\n")
+						pts := strings.Split(thisloc, "..")
+						if len(pts) == 2 {
+
+							// fr..to
+							fr := pts[0]
+							to := pts[1]
+							if strings.HasPrefix(fr, "<") {
+								fr = strings.TrimPrefix(fr, "<")
+								prime5 = true
+							}
+							if strings.HasPrefix(to, ">") {
+								to = strings.TrimPrefix(to, ">")
+								prime3 = true
+							}
+							writeOneElement("            ", "INSDInterval_from", fr)
+							writeOneElement("            ", "INSDInterval_to", to)
+							if is_comp {
+								rec.WriteString("            <INSDInterval_iscomp value=\"true\"/>\n")
+							}
+							writeOneElement("            ", "INSDInterval_accession", accnver)
+
+						} else {
+
+							crt := strings.Split(thisloc, "^")
+							if len(crt) == 2 {
+
+								// fr^to
+								fr := crt[0]
+								to := crt[1]
+								writeOneElement("            ", "INSDInterval_from", fr)
+								writeOneElement("            ", "INSDInterval_to", to)
+								if is_comp {
+									rec.WriteString("            <INSDInterval_iscomp value=\"true\"/>\n")
+								}
+								rec.WriteString("            <INSDInterval_interbp value=\"true\"/>\n")
+								writeOneElement("            ", "INSDInterval_accession", accnver)
+
+							} else {
+
+								// pt
+								pt := pts[0]
+								if strings.HasPrefix(pt, "<") {
+									pt = strings.TrimPrefix(pt, "<")
+									prime5 = true
+								}
+								if strings.HasPrefix(pt, ">") {
+									pt = strings.TrimPrefix(pt, ">")
+									prime3 = true
+								}
+								writeOneElement("            ", "INSDInterval_point", pt)
+								writeOneElement("            ", "INSDInterval_accession", accnver)
+							}
+						}
+						rec.WriteString("          </INSDInterval>\n")
+					}
+
+					rec.WriteString("        </INSDFeature_intervals>\n")
+
+					if num_ivals > 1 {
+						writeOneElement("        ", "INSDFeature_operator", location_operator)
+					}
+					if prime5 {
+						rec.WriteString("        <INSDFeature_partial5 value=\"true\"/>\n")
+					}
+					if prime3 {
+						rec.WriteString("        <INSDFeature_partial3 value=\"true\"/>\n")
+					}
+
+					hasQual := false
+					for {
+						if !strings.HasPrefix(line, twentyonespaces) {
+							// if not qualifier line, break out of loop
+							break
+						}
+						txt := strings.TrimPrefix(line, twentyonespaces)
+						qual := ""
+						val := ""
+						if strings.HasPrefix(txt, "/") {
+							if !hasQual {
+								hasQual = true
+								rec.WriteString("        <INSDFeature_quals>\n")
+							}
+							// read new qualifier and start of value
+							qual = strings.TrimPrefix(txt, "/")
+							qual = strings.TrimSpace(qual)
+							idx := strings.Index(qual, "=")
+							if idx > 0 {
+								val = qual[idx+1:]
+								qual = qual[:idx]
+							}
+
+							for {
+								line = nextLine()
+								row++
+								if !strings.HasPrefix(line, twentyonespaces) {
+									break
+								}
+								txt := strings.TrimPrefix(line, twentyonespaces)
+								if strings.HasPrefix(txt, "/") {
+									// if not continuation of qualifier, break out of loop
+									break
+								}
+								// append subsequent line to value and continue with loop
+								if qual == "transcription" || qual == "translation" || qual == "peptide" || qual == "anticodon" {
+									val += strings.TrimSpace(txt)
+								} else {
+									val += " " + strings.TrimSpace(txt)
+								}
+							}
+
+							rec.WriteString("          <INSDQualifier>\n")
+
+							writeOneElement("            ", "INSDQualifier_name", qual)
+
+							val = strings.TrimPrefix(val, "\"")
+							val = strings.TrimSuffix(val, "\"")
+							val = strings.TrimSpace(val)
+							if val != "" {
+
+								writeOneElement("            ", "INSDQualifier_value", val)
+							}
+
+							rec.WriteString("          </INSDQualifier>\n")
+						}
+					}
+					if hasQual {
+						rec.WriteString("        </INSDFeature_quals>\n")
+					}
+
+					// end of this feature
+					rec.WriteString("      </INSDFeature>\n")
+					// continue to next feature
+				}
+			}
+			rec.WriteString("    </INSDSeq_feature-table>\n")
+
+			if strings.HasPrefix(line, "CONTIG") {
+
+				// pathological records can have over 90,000 components, use strings.Builder
+				con.Reset()
+
+				txt := strings.TrimPrefix(line, "CONTIG")
+				txt = strings.TrimSpace(txt)
+				con.WriteString(txt)
+				for {
+					// read next line
+					line = nextLine()
+					row++
+					if !strings.HasPrefix(line, twelvespaces) {
+						// if not continuation of contig, break out of loop
+						break
+					}
+					// append subsequent line and continue with loop
+					txt = strings.TrimPrefix(line, twelvespaces)
+					txt = strings.TrimSpace(txt)
+					con.WriteString(txt)
+				}
+			}
+
+			if strings.HasPrefix(line, "BASE COUNT") {
+
+				txt := strings.TrimPrefix(line, "BASE COUNT")
+				readContinuationLines(txt)
+				// not supported
+			}
+
+			if strings.HasPrefix(line, "ORIGIN") {
+
+				line = nextLine()
+				row++
+			}
+
+			// remainder should be sequence
+
+			// sequence can be millions of bases, use strings.Builder
+			seq.Reset()
+
+			for line != "" {
+
+				if strings.HasPrefix(line, "//") {
+
+					// end of record, print collected sequence
+					str := seq.String()
+					if str != "" {
+
+						writeOneElement("    ", "INSDSeq_sequence", str)
+					}
+					seq.Reset()
+
+					// print contig section
+					str = con.String()
+					str = strings.TrimSpace(str)
+					if str != "" {
+						writeOneElement("    ", "INSDSeq_contig", str)
+					}
+					con.Reset()
+
+					// end of record
+					rec.WriteString("  </INSDSeq>\n")
+
+					// send formatted record down channel
+					txt := rec.String()
+					out <- txt
+					rec.Reset()
+					// go to top of loop for next record
+					break
+				}
+
+				// read next sequence line
+
+				cols := strings.Fields(line)
+				for _, str := range cols {
+
+					if IsAllDigits(str) {
+						continue
+					}
+
+					// append letters to sequence
+					seq.WriteString(str)
+				}
+
+				// read next line and continue
+				line = nextLine()
+				row++
+
+			}
+
+			// continue to next record
+		}
+	}
+
+	// launch single converter goroutine
+	go convertGenBank(inp, out)
+
+	return out
+}
+
 // MAIN FUNCTION
 
 // e.g., xtract -pattern PubmedArticle -element MedlineCitation/PMID -block Author -sep " " -element Initials,LastName
@@ -9297,6 +10109,62 @@ func main() {
 		return
 	}
 
+	// READ GENBANK FLATFILE AND TRANSLATE TO INSDSEQ XML
+
+	// must be called before CreateReader starts draining stdin
+	if len(args) > 0 && args[0] == "-g2x" {
+
+		gbk := GenBankConverter(in)
+
+		if gbk == nil {
+			fmt.Fprintf(os.Stderr, "Unable to create GenBank to XML converter\n")
+			os.Exit(1)
+		}
+
+		head := `<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE INSDSet PUBLIC "-//NCBI//INSD INSDSeq/EN" "http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd">
+<INSDSet>
+`
+		tail := ""
+
+		// drain output of last channel in service chain
+		for str := range gbk {
+
+			if str == "" {
+				continue
+			}
+
+			if head != "" {
+				os.Stdout.WriteString(head)
+				head = ""
+				tail = `</INSDSet>
+`
+			}
+
+			// send result to stdout
+			os.Stdout.WriteString(str)
+			if !strings.HasSuffix(str, "\n") {
+				os.Stdout.WriteString("\n")
+			}
+
+			recordCount++
+
+			runtime.Gosched()
+		}
+
+		if tail != "" {
+			os.Stdout.WriteString(tail)
+		}
+
+		debug.FreeOSMemory()
+
+		if timr {
+			printDuration("records")
+		}
+
+		return
+	}
+
 	// CREATE XML BLOCK READER FROM STDIN OR FILE
 
 	rdr := CreateReader(in)
@@ -9468,10 +10336,9 @@ func main() {
 
 	switch args[0] {
 	case "-format":
-		ProcessFormat(rdr, args)
+		recordCount = ProcessFormat(rdr, args, timr)
 	case "-verify", "-validate":
-		CountLines = true
-		ProcessVerify(rdr, args)
+		recordCount = ProcessVerify(rdr, args)
 	case "-filter":
 		ProcessFilter(rdr, args)
 	case "-normalize", "-normal":



View it on GitLab: https://salsa.debian.org/med-team/ncbi-entrez-direct/-/compare/36914b997fe7b6f71dbe208a0f2e29e2ca47f056...3ff79da3f1d0d11d2976f414edfd56adcf372b89

-- 
View it on GitLab: https://salsa.debian.org/med-team/ncbi-entrez-direct/-/compare/36914b997fe7b6f71dbe208a0f2e29e2ca47f056...3ff79da3f1d0d11d2976f414edfd56adcf372b89
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20200715/8d3398d1/attachment-0001.html>