[med-svn] [Git][med-team/ncbi-entrez-direct][master] 6 commits: New upstream version 13.7.20200713+dfsg
Aaron M. Ucko
gitlab at salsa.debian.org
Wed Jul 15 03:20:44 BST 2020
Aaron M. Ucko pushed to branch master at Debian Med / ncbi-entrez-direct
Commits:
bbe0173f by Aaron M. Ucko at 2020-07-13T22:12:51-04:00
New upstream version 13.7.20200713+dfsg
- - - - -
8582cc64 by Aaron M. Ucko at 2020-07-13T22:13:42-04:00
Merge tag 'upstream/13.7.20200713+dfsg'
Upstream version 13.7.20200713(+dfsg).
- - - - -
b2e48ab9 by Aaron M. Ucko at 2020-07-14T21:39:17-04:00
d/rules: Install accn-at-a-time and g2x; update gbf2xml's handling.
Account for new as-is script accn-at-a-time, new Go executable g2x,
and gbf2xml's switch from a Perl script to an as-is script (wrapping
xtract).
- - - - -
6e84fc59 by Aaron M. Ucko at 2020-07-14T22:07:30-04:00
debian/rules: Belatedly arrange to tweak enquire.
Its --ca* options won't do here.
- - - - -
984a1b06 by Aaron M. Ucko at 2020-07-14T22:15:47-04:00
debian/man: Update for new upstream release (13.7.20200713[+dfsg]).
* accn-at-a-time.1, g2x.1: Document new commands.
* gbf2xml.1: Turn into an alias for xtract.1.
* xtract.1: Document -g2x flag (under Data Conversion) and
corresponding gbf2xml executable.
* Update SEE ALSO references accordingly.
- - - - -
3ff79da3 by Aaron M. Ucko at 2020-07-14T22:17:16-04:00
Finalize ncbi-entrez-direct 13.7.20200713+dfsg-1 for unstable.
- - - - -
15 changed files:
- + accn-at-a-time
- debian/changelog
- + debian/man/accn-at-a-time.1
- + debian/man/g2x.1
- debian/man/gbf2xml.1
- debian/man/word-at-a-time.1
- debian/man/xtract.1
- debian/rules
- enquire
- + g2x.go
- gbf2xml
- hlp-xtract.txt
- sort-uniq-count
- sort-uniq-count-rank
- xtract.go
Changes:
=====================================
accn-at-a-time
=====================================
@@ -0,0 +1,4 @@
+#!/bin/bash -norc
+sed 's/[^a-zA-Z0-9_.]/ /g; s/^ *//' |
+tr 'A-Z' 'a-z' |
+fmt -w 1
=====================================
debian/changelog
=====================================
@@ -1,3 +1,15 @@
+ncbi-entrez-direct (13.7.20200713+dfsg-1) unstable; urgency=medium
+
+ * New upstream release.
+ * debian/man/{accn-at-a-time,g2x}.1: Document new commands.
+ * debian/man/{gbf2xml,word-at-a-time,xtract}.1: Update for new release.
+ * debian/rules:
+ - Account for new as-is script accn-at-a-time, new Go executable g2x, and
+ gbf2xml's switch from a Perl script to an as-is script (wrapping xtract).
+ - Belatedly arrange to tweak enquire, whose --ca* options won't do here.
+
+ -- Aaron M. Ucko <ucko at debian.org> Tue, 14 Jul 2020 22:17:15 -0400
+
ncbi-entrez-direct (13.7.20200615+dfsg-2) unstable; urgency=medium
* debian/rules: Install ftp-cp, ftp-ls, nquire, and transmute as simple
=====================================
debian/man/accn-at-a-time.1
=====================================
@@ -0,0 +1,13 @@
+.TH ACCN-AT-A-TIME 1 2020-07-14 NCBI "NCBI Entrez Direct User's Manual"
+.SH NAME
+accn\-at\-a\-time \- parse an input file into biological identifiers
+.SH SYNOPSIS
+.B accn\-at\-a\-time
+.SH DESCRIPTION
+\fBaccn\-at\-a\-time\fP reads a text file on standard input,
+extracts all sequences of
+digits, English letters, underscores, and periods,
+lowercases any capital letters,
+and prints the resulting terms in order, one per line.
+.SH SEE ALSO
+.BR word\-at\-a\-time (1).
=====================================
debian/man/g2x.1
=====================================
@@ -0,0 +1,21 @@
+.TH G2X 1 2020-07-14 NCBI "NCBI Entrez Direct User's Manual"
+.SH NAME
+g2x \- Convert GenBank flatfiles to INSDSeq XML
+.SH SYNOPSIS
+.B g2x
+[\|\fB\-input\fP\ \fIfilename\fP\|]
+[\|\fB\-output\fP\ \fIfilename\fP\|]
+.SH DESCRIPTION
+\fBgbf2xml\fP reads the specified GenBank flatfile
+and writes a corresponding INSDSeq XML document.
+.SH OPTIONS
+.TP
+\fB\-input\fP\ \fIfilename\fP
+Read GenBank format from file instead of stdin.
+.TP
+\fB\-output\fP\ \fIfilename\fP
+Write INSDSeq XML to file instead of stdout.
+.SH SEE ALSO
+.BR asn2ff (1),
+.BR asn2gb (1),
+.BR gbf2xml (1).
=====================================
debian/man/gbf2xml.1
=====================================
@@ -1,21 +1 @@
-.TH GBF2XML 1 2017-07-05 NCBI "NCBI Entrez Direct User's Manual"
-.SH NAME
-gbf2xml \- Convert GenBank flatfiles to INSDSeq XML
-.SH SYNOPSIS
-.B gbf2xml
-[\|\fIfilename\fP\|]
-.SH DESCRIPTION
-\fBgbf2xml\fP reads the specified GenBank flatfile
-(or from standard input when not passed a filename)
-and writes a corresponding INSDSeq XML document
-to standard output.
-.SH BUGS
-Feature intervals that refer to 'far' locations, i.e., those not within
-the cited record and which have an accession and colon, are suppressed.
-Those rare features (e.g., trans\-splicing between molecules) are lost.
-
-Keywords and References are currently not supported.
-.SH SEE ALSO
-.BR asn2ff (1),
-.BR asn2gb (1),
-.BR xtract (1).
+.so xtract.1
=====================================
debian/man/word-at-a-time.1
=====================================
@@ -1,4 +1,4 @@
-.TH WORD-AT-A-TIME 1 2017-01-24 NCBI "NCBI Entrez Direct User's Manual"
+.TH WORD-AT-A-TIME 1 2020-07-14 NCBI "NCBI Entrez Direct User's Manual"
.SH NAME
word\-at\-a\-time \- parse an input file into alphanumeric words
.SH SYNOPSIS
@@ -9,4 +9,5 @@ extracts all sequences of digits and English letters,
lowercases any capital letters,
and prints the resulting terms in order, one per line.
.SH SEE ALSO
+.BR accn\-at\-a\-time (1),
.BR join\-into\-groups\-of (1).
=====================================
debian/man/xtract.1
=====================================
@@ -1,6 +1,6 @@
-.TH XTRACT 1 2020-06-28 NCBI "NCBI Entrez Direct User's Manual"
+.TH XTRACT 1 2020-07-14 NCBI "NCBI Entrez Direct User's Manual"
.SH NAME
-xtract \- convert XML into a table of data values
+gbf2xml, xtract \- NCBI Entrez Direct XML conversion and transformation tool
.SH SYNOPSIS
\fBxtract\fP
[\|\fB\-help\fP\|]
@@ -125,12 +125,18 @@ xtract \- convert XML into a table of data values
[\|\fB\-t2x\fP [\|\fB\-set\fP\ \fItag\fP\|] [\|\fB\-rec\fP\ \fItag\fP\|] \
[\|\fB\-skip\fP\ \fIN\fP\|] [\|\fB\-lower\fP|\fB\-upper\fP\|] \
[\|\fB\-indent\fP|\fB\-flush\fP\|] \fIcolumnName1\fP\ ...\|]
+[\|\fB\-g2x\fP\|]
[\|\fB\-examples\fP\|]
[\|\fB\-version\fP\|]
+
+\fBgbf2xml\fP\ ...
.SH DESCRIPTION
\fBxtract\fP converts an XML document
into a table of data values
according to user\-specified rules.
+
+\fBgbf2xml\fP converts from GenBank flatfile format to INSDSeq XML,
+and is equivalent to \fBxtract \-g2x\fP.
.SH OPTIONS
.SS Processing Flags
.TP
@@ -716,6 +722,9 @@ Do not indent XML output.
XML object names per column.
.RE
.PD
+.TP
+\fB\-g2x\fP
+Convert GenBank flatfile format to INSDSeq XML.
.SS Documentation
.TP
\fB\-help\fP
=====================================
debian/rules
=====================================
@@ -17,22 +17,25 @@ MODES = address blast citmatch contact filter link notify post proxy search \
STD_WRAPPERS = $(MODES:%=bin/e%)
OTHER_WRAPPERS = esummary ftp-cp ftp-ls nquire transmute
WRAPPERS = $(STD_WRAPPERS) $(OTHER_WRAPPERS:%=bin/%)
-AS_IS_SCRIPTS = amino-acid-composition archive-pubmed between-two-genes \
- enquire entrez-phrase-search esample exclude-uid-lists \
- expand-current fetch-extras fetch-pubmed filter-stop-words \
- index-pubmed intersect-uid-lists join-into-groups-of pm-* \
- protein-neighbors reorder-columns sort-uniq-count* \
- stream-pubmed theme-aliases word-at-a-time xml2tbl xy-plot
+AS_IS_SCRIPTS = accn-at-a-time amino-acid-composition archive-pubmed \
+ between-two-genes entrez-phrase-search esample \
+ exclude-uid-lists expand-current fetch-extras fetch-pubmed \
+ filter-stop-words gbf2xml index-pubmed intersect-uid-lists \
+ join-into-groups-of pm-* protein-neighbors reorder-columns \
+ sort-uniq-count* stream-pubmed theme-aliases word-at-a-time \
+ xml2tbl xy-plot
BASH_SCRIPTS = index-extras index-themes
BIN_BASH_SCRIPTS = $(BASH_SCRIPTS:%=bin/%)
DL_SCRIPTS = download-ncbi-data download-pubmed download-sequence
BIN_DL_SCRIPTS = $(DL_SCRIPTS:%=bin/%)
-PERL_SCRIPTS = edirutil gbf2xml run-ncbi-converter
+PERL_SCRIPTS = edirutil run-ncbi-converter
BIN_PERL_SCRIPTS = $(PERL_SCRIPTS:%=bin/%)
# Only bt-link and xplore need this treatment at present,
# but list all bt-* scripts to be safe.
BT_SCRIPTS = bt-link bt-load bt-save bt-srch xplore
BIN_BT_SCRIPTS = $(BT_SCRIPTS:%=bin/%)
+OTHER_SCRIPTS = enquire
+BIN_OTHER_SCRIPTS = $(OTHER_SCRIPTS:%=bin/%)
FIX_PERL_SHEBANG = 1s,^\#!/usr/bin/env perl$$,\#!/usr/bin/perl,
FIX_BASH_SHEBANG = 1s,^\#!/bin/sh,\#!/bin/bash,
@@ -50,7 +53,7 @@ export GOCACHE = $(CURDIR)/go-build
export GOPATH = $(CURDIR)/obj-$(DEB_HOST_GNU_TYPE)
GOLIBS = $(GOLIBSRC:$(GOCODE)/%=$(GOPATH)/%)
-GO_APPS = j2x rchive t2x xtract
+GO_APPS = g2x j2x rchive t2x xtract
BIN_GO_APPS = $(GO_APPS:%=bin/%)
GOVERSION := $(word 3,$(shell go version)) # go version **goX.Y.Z** OS/CPU
@@ -77,6 +80,11 @@ $(STD_WRAPPERS): bin/e%: bin/edirect
echo 'exec /usr/bin/edirect -$* "$$@"' >> $@
chmod +x $@
+bin/enquire: enquire
+ mkdir -p bin
+ sed -e 's/ --ca.*\.pem//' $< > $@
+ chmod +x $@
+
bin/esummary: bin/edirect
echo '#!/bin/sh' > $@
echo 'exec /usr/bin/edirect -fetch -format docsum "$$@"' >> $@
@@ -127,6 +135,7 @@ $(GOPATH)/src/$(GH)/fiam/gounidecode: $(GOLIBS)
ln -s ../rainycape $(GOPATH)/src/$(GH)/fiam/gounidecode
COMMON = common.go
+bin/g2x: COMMON =
bin/j2x: COMMON =
bin/t2x: COMMON =
$(BIN_GO_APPS): bin/%: %.go $(GOPATH)/src/$(GH)/fiam/gounidecode
@@ -139,7 +148,8 @@ override_dh_auto_configure:
-mv go.mod go.sum s2p.go saved/
override_dh_auto_build: $(WRAPPERS) $(BIN_BASH_SCRIPTS) $(BIN_DL_SCRIPTS) \
- $(BIN_PERL_SCRIPTS) $(BIN_BT_SCRIPTS) $(BIN_GO_APPS)
+ $(BIN_PERL_SCRIPTS) $(BIN_BT_SCRIPTS) \
+ $(BIN_OTHER_SCRIPTS) $(BIN_GO_APPS)
dh_auto_build
install $(AS_IS_SCRIPTS) debian/efetch debian/einfo bin/
=====================================
enquire
=====================================
@@ -222,10 +222,60 @@ do
esac
done
-# get extraction method
+# subset of perl -MURI::Escape -ne 'chomp;print uri_escape($_),"\n"'
+
+Escape() {
+ echo "$1" |
+ sed -e "s/%/%25/g" \
+ -e "s/!/%21/g" \
+ -e "s/#/%23/g" \
+ -e "s/&/%26/g" \
+ -e "s/'/%27/g" \
+ -e "s/*/%2A/g" \
+ -e "s/+/%2B/g" \
+ -e "s/,/%2C/g" \
+ -e "s|/|%2F|g" \
+ -e "s/:/%3A/g" \
+ -e "s/;/%3B/g" \
+ -e "s/=/%3D/g" \
+ -e "s/?/%3F/g" \
+ -e "s/@/%40/g" \
+ -e "s/|/%7C/g" \
+ -e "s/ /%20/g" |
+ sed -e 's/\$/%24/g' \
+ -e 's/(/%28/g' \
+ -e 's/)/%29/g' \
+ -e 's/</%3C/g' \
+ -e 's/>/%3E/g' \
+ -e 's/\[/%5B/g' \
+ -e 's/\]/%5D/g' \
+ -e 's/\^/%5E/g' \
+ -e 's/{/%7B/g' \
+ -e 's/}/%7D/g'
+}
+
+# initialize variables
mode=""
+url=""
+sls=""
+
+arg=""
+amp=""
+cmd=""
+pfx=""
+
+# optionally include nextra.sh script, if present, for internal NCBI maintenance functions (undocumented)
+# dot command is equivalent of "source"
+
+if [ -f "$pth"/nextra.sh ]
+then
+ . "$pth"/nextra.sh
+fi
+
+# get extraction method
+
if [ $# -gt 0 ]
then
case "$1" in
@@ -247,9 +297,6 @@ fi
# collect URL directory components
-url=""
-sls=""
-
while [ $# -gt 0 ]
do
case "$1" in
@@ -268,45 +315,8 @@ do
esac
done
-# subset of perl -MURI::Escape -ne 'chomp;print uri_escape($_),"\n"'
-
-Escape() {
- echo "$1" |
- sed -e "s/%/%25/g" \
- -e "s/!/%21/g" \
- -e "s/#/%23/g" \
- -e "s/&/%26/g" \
- -e "s/'/%27/g" \
- -e "s/*/%2A/g" \
- -e "s/+/%2B/g" \
- -e "s/,/%2C/g" \
- -e "s|/|%2F|g" \
- -e "s/:/%3A/g" \
- -e "s/;/%3B/g" \
- -e "s/=/%3D/g" \
- -e "s/?/%3F/g" \
- -e "s/@/%40/g" \
- -e "s/|/%7C/g" \
- -e "s/ /%20/g" |
- sed -e 's/\$/%24/g' \
- -e 's/(/%28/g' \
- -e 's/)/%29/g' \
- -e 's/</%3C/g' \
- -e 's/>/%3E/g' \
- -e 's/\[/%5B/g' \
- -e 's/\]/%5D/g' \
- -e 's/\^/%5E/g' \
- -e 's/{/%7B/g' \
- -e 's/}/%7D/g'
-}
-
# collect argument tags paired with (escaped) values
-arg=""
-amp=""
-cmd=""
-pfx=""
-
while [ $# -gt 0 ]
do
case "$1" in
=====================================
g2x.go
=====================================
@@ -0,0 +1,1125 @@
+// ===========================================================================
+//
+// PUBLIC DOMAIN NOTICE
+// National Center for Biotechnology Information (NCBI)
+//
+// This software/database is a "United States Government Work" under the
+// terms of the United States Copyright Act. It was written as part of
+// the author's official duties as a United States Government employee and
+// thus cannot be copyrighted. This software/database is freely available
+// to the public for use. The National Library of Medicine and the U.S.
+// Government do not place any restriction on its use or reproduction.
+// We would, however, appreciate having the NCBI and the author cited in
+// any work or product based on this material.
+//
+// Although all reasonable efforts have been taken to ensure the accuracy
+// and reliability of the software and data, the NLM and the U.S.
+// Government do not and cannot warrant the performance or results that
+// may be obtained by using this software or data. The NLM and the U.S.
+// Government disclaim all warranties, express or implied, including
+// warranties of performance, merchantability or fitness for any particular
+// purpose.
+//
+// ===========================================================================
+//
+// File Name: g2x.go
+//
+// Author: Jonathan Kans
+//
+// ==========================================================================
+
+/*
+ Compile application by running:
+
+ go build g2x.go
+*/
+
+package main
+
+import (
+ "bufio"
+ "fmt"
+ "html"
+ "io"
+ "os"
+ "runtime"
+ "runtime/debug"
+ "strings"
+ "time"
+ "unicode"
+)
+
+const g2xHelp = `
+Data Files
+
+ -input Read GenBank format from file instead of stdin
+ -output Write INSDSeq XML to file instead of stdout
+
+`
+
+// global variables, initialized (recursively) to "zero value" of type
+var (
+ ByteCount int
+ ChanDepth int
+ InBlank [256]bool
+ InElement [256]bool
+)
+
+// init function(s) run after creation of variables, before main function
+func init() {
+
+ // set communication channel buffer size
+ ChanDepth = 16
+
+ // range iterates over all elements of slice
+ for i := range InBlank {
+ // (would already have been zeroed at creation in this case)
+ InBlank[i] = false
+ }
+ InBlank[' '] = true
+ InBlank['\t'] = true
+ InBlank['\n'] = true
+ InBlank['\r'] = true
+ InBlank['\f'] = true
+
+ for i := range InElement {
+ // (would already have been zeroed at creation in this case)
+ InElement[i] = false
+ }
+ for ch := 'A'; ch <= 'Z'; ch++ {
+ InElement[ch] = true
+ }
+ for ch := 'a'; ch <= 'z'; ch++ {
+ InElement[ch] = true
+ }
+ for ch := '0'; ch <= '9'; ch++ {
+ InElement[ch] = true
+ }
+ InElement['_'] = true
+ InElement['-'] = true
+ InElement['.'] = true
+ InElement[':'] = true
+}
+
+func CompressRunsOfSpaces(str string) string {
+
+ whiteSpace := false
+ var buffer strings.Builder
+
+ for _, ch := range str {
+ if ch < 127 && InBlank[ch] {
+ if !whiteSpace {
+ buffer.WriteRune(' ')
+ }
+ whiteSpace = true
+ } else {
+ buffer.WriteRune(ch)
+ whiteSpace = false
+ }
+ }
+
+ return buffer.String()
+}
+
+func IsAllDigits(str string) bool {
+
+ for _, ch := range str {
+ if !unicode.IsDigit(ch) {
+ return false
+ }
+ }
+
+ return true
+}
+
+// GenBankConverter sends INSDSeq XML records down a channel
+func GenBankConverter(inp io.Reader) <-chan string {
+
+ if inp == nil {
+ return nil
+ }
+
+ out := make(chan string, ChanDepth)
+ if out == nil {
+ fmt.Fprintf(os.Stderr, "Unable to create GenBank converter channel\n")
+ os.Exit(1)
+ }
+
+ const twelvespaces = " "
+ const twentyonespaces = " "
+
+ var rec strings.Builder
+ var con strings.Builder
+ var seq strings.Builder
+
+ scanr := bufio.NewScanner(inp)
+
+ convertGenBank := func(inp io.Reader, out chan<- string) {
+
+ // close channel when all records have been sent
+ defer close(out)
+
+ row := 0
+
+ nextLine := func() string {
+
+ for scanr.Scan() {
+ line := scanr.Text()
+ if line == "" {
+ continue
+ }
+ return line
+ }
+ return ""
+
+ }
+
+ for {
+
+ rec.Reset()
+
+ // read first line of next record
+ line := nextLine()
+ if line == "" {
+ break
+ }
+
+ row++
+
+ for {
+ if !strings.HasPrefix(line, "LOCUS") {
+ // skip release file header information
+ line = nextLine()
+ row++
+ continue
+ }
+ break
+ }
+
+ readContinuationLines := func(str string) string {
+
+ for {
+ // read next line
+ line = nextLine()
+ row++
+ if !strings.HasPrefix(line, twelvespaces) {
+ // if not continuation line, break out of loop
+ break
+ }
+ // append subsequent line and continue with loop
+ txt := strings.TrimPrefix(line, twelvespaces)
+ str += " " + txt
+ }
+
+ str = CompressRunsOfSpaces(str)
+ str = strings.TrimSpace(str)
+
+ return str
+ }
+
+ writeOneElement := func(spaces, tag, value string) {
+
+ rec.WriteString(spaces)
+ rec.WriteString("<")
+ rec.WriteString(tag)
+ rec.WriteString(">")
+ value = html.EscapeString(value)
+ rec.WriteString(value)
+ rec.WriteString("</")
+ rec.WriteString(tag)
+ rec.WriteString(">\n")
+ }
+
+ // each section will exit with the next line ready to process
+
+ if strings.HasPrefix(line, "LOCUS") {
+
+ cols := strings.Fields(line)
+ if len(cols) == 8 {
+
+ // start of record
+ rec.WriteString(" <INSDSeq>\n")
+
+ moleculetype := cols[4]
+ strandedness := ""
+ if strings.HasPrefix(moleculetype, "ds-") {
+ moleculetype = strings.TrimPrefix(moleculetype, "ds-")
+ strandedness = "double"
+ } else if strings.HasPrefix(moleculetype, "ss-") {
+ moleculetype = strings.TrimPrefix(moleculetype, "ss-")
+ strandedness = "single"
+ } else if strings.HasPrefix(moleculetype, "ms-") {
+ moleculetype = strings.TrimPrefix(moleculetype, "ms-")
+ strandedness = "mixed"
+ } else if strings.HasSuffix(moleculetype, "DNA") {
+ strandedness = "double"
+ } else if strings.HasSuffix(moleculetype, "RNA") {
+ strandedness = "single"
+ }
+
+ writeOneElement(" ", "INSDSeq_locus", cols[1])
+
+ writeOneElement(" ", "INSDSeq_length", cols[2])
+
+ if strandedness != "" {
+ writeOneElement(" ", "INSDSeq_strandedness", strandedness)
+ }
+
+ writeOneElement(" ", "INSDSeq_moltype", moleculetype)
+
+ writeOneElement(" ", "INSDSeq_topology", cols[5])
+
+ writeOneElement(" ", "INSDSeq_division", cols[6])
+
+ writeOneElement(" ", "INSDSeq_update-date", cols[7])
+
+ } else {
+ fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+ }
+
+ // read next line and continue
+ line = nextLine()
+ row++
+ }
+
+ if strings.HasPrefix(line, "DEFINITION") {
+
+ txt := strings.TrimPrefix(line, "DEFINITION")
+ def := readContinuationLines(txt)
+ def = strings.TrimSuffix(def, ".")
+
+ writeOneElement(" ", "INSDSeq_definition", def)
+ }
+
+ var secondaries []string
+
+ if strings.HasPrefix(line, "ACCESSION") {
+
+ txt := strings.TrimPrefix(line, "ACCESSION")
+ str := readContinuationLines(txt)
+ accessions := strings.Fields(str)
+ ln := len(accessions)
+ if ln > 1 {
+
+ writeOneElement(" ", "INSDSeq_primary-accession", accessions[0])
+
+ // skip past primary accession, collect secondaries
+ secondaries = accessions[1:]
+
+ } else if ln == 1 {
+
+ writeOneElement(" ", "INSDSeq_primary-accession", accessions[0])
+
+ } else {
+ fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+ }
+ }
+
+ accnver := ""
+ gi := ""
+
+ if strings.HasPrefix(line, "VERSION") {
+
+ cols := strings.Fields(line)
+ if len(cols) == 2 {
+
+ accnver = cols[1]
+ writeOneElement(" ", "INSDSeq_accession-version", accnver)
+
+ } else if len(cols) == 3 {
+
+ accnver = cols[1]
+ writeOneElement(" ", "INSDSeq_accession-version", accnver)
+
+ // collect gi for other-seqids
+ if strings.HasPrefix(cols[2], "GI:") {
+ gi = strings.TrimPrefix(cols[2], "GI:")
+ }
+
+ } else {
+ fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+ }
+
+ // read next line and continue
+ line = nextLine()
+ row++
+
+ }
+
+ if gi != "" {
+
+ rec.WriteString(" <INSDSeq_other-seqids>\n")
+
+ writeOneElement(" ", "INSDSeqid", "gi|"+gi)
+
+ rec.WriteString(" </INSDSeq_other-seqids>\n")
+ }
+
+ if len(secondaries) > 0 {
+
+ rec.WriteString(" <INSDSeq_secondary-accessions>\n")
+
+ for _, secndry := range secondaries {
+
+ writeOneElement(" ", "INSDSecondary-accn", secndry)
+ }
+
+ rec.WriteString(" </INSDSeq_secondary-accessions>\n")
+ }
+
+ if strings.HasPrefix(line, "DBLINK") {
+
+ txt := strings.TrimPrefix(line, "DBLINK")
+ readContinuationLines(txt)
+ // collect for database-reference
+ // out <- Token{DBLINK, dbl}
+ }
+
+ if strings.HasPrefix(line, "KEYWORDS") {
+
+ txt := strings.TrimPrefix(line, "KEYWORDS")
+ key := readContinuationLines(txt)
+ key = strings.TrimSuffix(key, ".")
+
+ if key != "" {
+ rec.WriteString(" <INSDSeq_keywords>\n")
+ kywds := strings.Split(key, ";")
+ for _, kw := range kywds {
+ kw = strings.TrimSpace(kw)
+ if kw == "" || kw == "." {
+ continue
+ }
+
+ writeOneElement(" ", "INSDKeyword", kw)
+ }
+ rec.WriteString(" </INSDSeq_keywords>\n")
+ }
+ }
+
+ if strings.HasPrefix(line, "SOURCE") {
+
+ txt := strings.TrimPrefix(line, "SOURCE")
+ src := readContinuationLines(txt)
+
+ writeOneElement(" ", "INSDSeq_source", src)
+ }
+
+ if strings.HasPrefix(line, " ORGANISM") {
+
+ org := strings.TrimPrefix(line, " ORGANISM")
+ org = CompressRunsOfSpaces(org)
+ org = strings.TrimSpace(org)
+
+ writeOneElement(" ", "INSDSeq_organism", org)
+
+ line = nextLine()
+ row++
+ if strings.HasPrefix(line, twelvespaces) {
+ txt := strings.TrimPrefix(line, twelvespaces)
+ tax := readContinuationLines(txt)
+ tax = strings.TrimSuffix(tax, ".")
+
+ writeOneElement(" ", "INSDSeq_taxonomy", tax)
+ }
+ }
+
+ rec.WriteString(" <INSDSeq_references>\n")
+ for {
+ if !strings.HasPrefix(line, "REFERENCE") {
+ // exit out of reference section
+ break
+ }
+
+ ref := "0"
+
+ rec.WriteString(" <INSDReference>\n")
+
+ str := strings.TrimPrefix(line, "REFERENCE")
+ str = CompressRunsOfSpaces(str)
+ str = strings.TrimSpace(str)
+ idx := strings.Index(str, "(")
+ if idx > 0 {
+ ref = strings.TrimSpace(str[:idx])
+
+ writeOneElement(" ", "INSDReference_reference", ref)
+
+ posn := str[idx+1:]
+ posn = strings.TrimSuffix(posn, ")")
+ posn = strings.TrimSpace(posn)
+ if posn == "sites" {
+
+ writeOneElement(" ", "INSDReference_position", posn)
+
+ } else {
+ cols := strings.Fields(posn)
+ if len(cols) == 4 && cols[2] == "to" {
+
+ writeOneElement(" ", "INSDReference_position", cols[1]+".."+cols[3])
+
+ } else {
+ fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+ }
+ }
+ } else {
+ ref = strings.TrimSpace(str)
+
+ writeOneElement(" ", "INSDReference_reference", ref)
+ }
+ line = nextLine()
+ row++
+
+ if strings.HasPrefix(line, " AUTHORS") {
+
+ txt := strings.TrimPrefix(line, " AUTHORS")
+ auths := readContinuationLines(txt)
+
+ rec.WriteString(" <INSDReference_authors>\n")
+ authors := strings.Split(auths, ", ")
+ for _, auth := range authors {
+ auth = strings.TrimSpace(auth)
+ if auth == "" {
+ continue
+ }
+ pair := strings.Split(auth, " and ")
+ for _, name := range pair {
+
+ writeOneElement(" ", "INSDAuthor", name)
+ }
+ }
+ rec.WriteString(" </INSDReference_authors>\n")
+ }
+
+ if strings.HasPrefix(line, " CONSRTM") {
+
+ txt := strings.TrimPrefix(line, " CONSRTM")
+ cons := readContinuationLines(txt)
+
+ writeOneElement(" ", "INSDReference_consortium", cons)
+ }
+
+ if strings.HasPrefix(line, " TITLE") {
+
+ txt := strings.TrimPrefix(line, " TITLE")
+ titl := readContinuationLines(txt)
+
+ writeOneElement(" ", "INSDReference_title", titl)
+ }
+
+ if strings.HasPrefix(line, " JOURNAL") {
+
+ txt := strings.TrimPrefix(line, " JOURNAL")
+ jour := readContinuationLines(txt)
+
+ writeOneElement(" ", "INSDReference_journal", jour)
+ }
+
+ if strings.HasPrefix(line, " PUBMED") {
+
+ txt := strings.TrimPrefix(line, " PUBMED")
+ pmid := readContinuationLines(txt)
+
+ writeOneElement(" ", "INSDReference_pubmed", pmid)
+ }
+
+ if strings.HasPrefix(line, " REMARK") {
+
+ txt := strings.TrimPrefix(line, " REMARK")
+ rem := readContinuationLines(txt)
+
+ writeOneElement(" ", "INSDReference_remark", rem)
+ }
+
+ // end of this reference
+ rec.WriteString(" </INSDReference>\n")
+ // continue to next reference
+ }
+ rec.WriteString(" </INSDSeq_references>\n")
+
+ if strings.HasPrefix(line, "COMMENT") {
+
+ txt := strings.TrimPrefix(line, "COMMENT")
+ com := readContinuationLines(txt)
+
+ writeOneElement(" ", "INSDSeq_comment", com)
+ }
+
+ rec.WriteString(" <INSDSeq_feature-table>\n")
+ if strings.HasPrefix(line, "FEATURES") {
+
+ line = nextLine()
+ row++
+
+ for {
+ if !strings.HasPrefix(line, " ") {
+ // exit out of features section
+ break
+ }
+ if len(line) < 22 {
+ fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+ line = nextLine()
+ row++
+ continue
+ }
+
+ rec.WriteString(" <INSDFeature>\n")
+
+ // read feature key and start of location
+ fkey := line[5:21]
+ fkey = strings.TrimSpace(fkey)
+
+ writeOneElement(" ", "INSDFeature_key", fkey)
+
+ loc := line[21:]
+ loc = strings.TrimSpace(loc)
+ for {
+ line = nextLine()
+ row++
+ if !strings.HasPrefix(line, twentyonespaces) {
+ break
+ }
+ txt := strings.TrimPrefix(line, twentyonespaces)
+ if strings.HasPrefix(txt, "/") {
+ // if not continuation of location, break out of loop
+ break
+ }
+ // append subsequent line and continue with loop
+ loc += strings.TrimSpace(txt)
+ }
+
+ writeOneElement(" ", "INSDFeature_location", loc)
+
+ location_operator := ""
+ is_comp := false
+ prime5 := false
+ prime3 := false
+
+ // parseloc recursive definition
+ var parseloc func(string) []string
+
+ parseloc = func(str string) []string {
+
+ var acc []string
+
+ if strings.HasPrefix(str, "join(") && strings.HasSuffix(str, ")") {
+
+ location_operator = "join"
+
+ str = strings.TrimPrefix(str, "join(")
+ str = strings.TrimSuffix(str, ")")
+ items := strings.Split(str, ",")
+
+ for _, thisloc := range items {
+ inner := parseloc(thisloc)
+ for _, sub := range inner {
+ acc = append(acc, sub)
+ }
+ }
+
+ } else if strings.HasPrefix(str, "order(") && strings.HasSuffix(str, ")") {
+
+ location_operator = "order"
+
+ str = strings.TrimPrefix(str, "order(")
+ str = strings.TrimSuffix(str, ")")
+ items := strings.Split(str, ",")
+
+ for _, thisloc := range items {
+ inner := parseloc(thisloc)
+ for _, sub := range inner {
+ acc = append(acc, sub)
+ }
+ }
+
+ } else if strings.HasPrefix(str, "complement(") && strings.HasSuffix(str, ")") {
+
+ is_comp = true
+
+ str = strings.TrimPrefix(str, "complement(")
+ str = strings.TrimSuffix(str, ")")
+ items := parseloc(str)
+
+ // reverse items
+ for i, j := 0, len(items)-1; i < j; i, j = i+1, j-1 {
+ items[i], items[j] = items[j], items[i]
+ }
+
+ // reverse from and to positions, flip direction of angle brackets (partial flags)
+ for _, thisloc := range items {
+ pts := strings.Split(thisloc, "..")
+ ln := len(pts)
+ if ln == 2 {
+ fst := pts[0]
+ scd := pts[1]
+ lf := ""
+ rt := ""
+ if strings.HasPrefix(fst, "<") {
+ fst = strings.TrimPrefix(fst, "<")
+ rt = ">"
+ }
+ if strings.HasPrefix(scd, ">") {
+ scd = strings.TrimPrefix(scd, ">")
+ lf = "<"
+ }
+ acc = append(acc, lf+scd+".."+rt+fst)
+ } else if ln > 0 {
+ acc = append(acc, pts[0])
+ }
+ }
+
+ } else {
+
+ // save individual interval or point if no leading accession
+ if strings.Index(str, ":") < 0 {
+ acc = append(acc, str)
+ }
+ }
+
+ return acc
+ }
+
+ items := parseloc(loc)
+
+ rec.WriteString(" <INSDFeature_intervals>\n")
+
+ num_ivals := 0
+
+ // report individual intervals
+ for _, thisloc := range items {
+ if thisloc == "" {
+ continue
+ }
+
+ num_ivals++
+
+ rec.WriteString(" <INSDInterval>\n")
+ pts := strings.Split(thisloc, "..")
+ if len(pts) == 2 {
+
+ // fr..to
+ fr := pts[0]
+ to := pts[1]
+ if strings.HasPrefix(fr, "<") {
+ fr = strings.TrimPrefix(fr, "<")
+ prime5 = true
+ }
+ if strings.HasPrefix(to, ">") {
+ to = strings.TrimPrefix(to, ">")
+ prime3 = true
+ }
+ writeOneElement(" ", "INSDInterval_from", fr)
+ writeOneElement(" ", "INSDInterval_to", to)
+ if is_comp {
+ rec.WriteString(" <INSDInterval_iscomp value=\"true\"/>\n")
+ }
+ writeOneElement(" ", "INSDInterval_accession", accnver)
+
+ } else {
+
+ crt := strings.Split(thisloc, "^")
+ if len(crt) == 2 {
+
+ // fr^to
+ fr := crt[0]
+ to := crt[1]
+ writeOneElement(" ", "INSDInterval_from", fr)
+ writeOneElement(" ", "INSDInterval_to", to)
+ if is_comp {
+ rec.WriteString(" <INSDInterval_iscomp value=\"true\"/>\n")
+ }
+ rec.WriteString(" <INSDInterval_interbp value=\"true\"/>\n")
+ writeOneElement(" ", "INSDInterval_accession", accnver)
+
+ } else {
+
+ // pt
+ pt := pts[0]
+ if strings.HasPrefix(pt, "<") {
+ pt = strings.TrimPrefix(pt, "<")
+ prime5 = true
+ }
+ if strings.HasPrefix(pt, ">") {
+ pt = strings.TrimPrefix(pt, ">")
+ prime3 = true
+ }
+ writeOneElement(" ", "INSDInterval_point", pt)
+ writeOneElement(" ", "INSDInterval_accession", accnver)
+ }
+ }
+ rec.WriteString(" </INSDInterval>\n")
+ }
+
+ rec.WriteString(" </INSDFeature_intervals>\n")
+
+ if num_ivals > 1 {
+ writeOneElement(" ", "INSDFeature_operator", location_operator)
+ }
+ if prime5 {
+ rec.WriteString(" <INSDFeature_partial5 value=\"true\"/>\n")
+ }
+ if prime3 {
+ rec.WriteString(" <INSDFeature_partial3 value=\"true\"/>\n")
+ }
+
+ hasQual := false
+ for {
+ if !strings.HasPrefix(line, twentyonespaces) {
+ // if not qualifier line, break out of loop
+ break
+ }
+ txt := strings.TrimPrefix(line, twentyonespaces)
+ qual := ""
+ val := ""
+ if strings.HasPrefix(txt, "/") {
+ if !hasQual {
+ hasQual = true
+ rec.WriteString(" <INSDFeature_quals>\n")
+ }
+ // read new qualifier and start of value
+ qual = strings.TrimPrefix(txt, "/")
+ qual = strings.TrimSpace(qual)
+ idx := strings.Index(qual, "=")
+ if idx > 0 {
+ val = qual[idx+1:]
+ qual = qual[:idx]
+ }
+
+ for {
+ line = nextLine()
+ row++
+ if !strings.HasPrefix(line, twentyonespaces) {
+ break
+ }
+ txt := strings.TrimPrefix(line, twentyonespaces)
+ if strings.HasPrefix(txt, "/") {
+ // if not continuation of qualifier, break out of loop
+ break
+ }
+ // append subsequent line to value and continue with loop
+ if qual == "transcription" || qual == "translation" || qual == "peptide" || qual == "anticodon" {
+ val += strings.TrimSpace(txt)
+ } else {
+ val += " " + strings.TrimSpace(txt)
+ }
+ }
+
+ rec.WriteString(" <INSDQualifier>\n")
+
+ writeOneElement(" ", "INSDQualifier_name", qual)
+
+ val = strings.TrimPrefix(val, "\"")
+ val = strings.TrimSuffix(val, "\"")
+ val = strings.TrimSpace(val)
+ if val != "" {
+
+ writeOneElement(" ", "INSDQualifier_value", val)
+ }
+
+ rec.WriteString(" </INSDQualifier>\n")
+ }
+ }
+ if hasQual {
+ rec.WriteString(" </INSDFeature_quals>\n")
+ }
+
+ // end of this feature
+ rec.WriteString(" </INSDFeature>\n")
+ // continue to next feature
+ }
+ }
+ rec.WriteString(" </INSDSeq_feature-table>\n")
+
+ if strings.HasPrefix(line, "CONTIG") {
+
+ // pathological records can have over 90,000 components, use strings.Builder
+ con.Reset()
+
+ txt := strings.TrimPrefix(line, "CONTIG")
+ txt = strings.TrimSpace(txt)
+ con.WriteString(txt)
+ for {
+ // read next line
+ line = nextLine()
+ row++
+ if !strings.HasPrefix(line, twelvespaces) {
+ // if not continuation of contig, break out of loop
+ break
+ }
+ // append subsequent line and continue with loop
+ txt = strings.TrimPrefix(line, twelvespaces)
+ txt = strings.TrimSpace(txt)
+ con.WriteString(txt)
+ }
+ }
+
+ if strings.HasPrefix(line, "BASE COUNT") {
+
+ txt := strings.TrimPrefix(line, "BASE COUNT")
+ readContinuationLines(txt)
+ // not supported
+ }
+
+ if strings.HasPrefix(line, "ORIGIN") {
+
+ line = nextLine()
+ row++
+ }
+
+ // remainder should be sequence
+
+ // sequence can be millions of bases, use strings.Builder
+ seq.Reset()
+
+ for line != "" {
+
+ if strings.HasPrefix(line, "//") {
+
+ // end of record, print collected sequence
+ str := seq.String()
+ if str != "" {
+
+ writeOneElement(" ", "INSDSeq_sequence", str)
+ }
+ seq.Reset()
+
+ // print contig section
+ str = con.String()
+ str = strings.TrimSpace(str)
+ if str != "" {
+ writeOneElement(" ", "INSDSeq_contig", str)
+ }
+ con.Reset()
+
+ // end of record
+ rec.WriteString(" </INSDSeq>\n")
+
+ // send formatted record down channel
+ txt := rec.String()
+ out <- txt
+ rec.Reset()
+ // go to top of loop for next record
+ break
+ }
+
+ // read next sequence line
+
+ cols := strings.Fields(line)
+ for _, str := range cols {
+
+ if IsAllDigits(str) {
+ continue
+ }
+
+ // append letters to sequence
+ seq.WriteString(str)
+ }
+
+ // read next line and continue
+ line = nextLine()
+ row++
+
+ }
+
+ // continue to next record
+ }
+ }
+
+ // launch single converter goroutine
+ go convertGenBank(inp, out)
+
+ return out
+}
+
+func main() {
+
+ // skip past executable name
+ args := os.Args[1:]
+
+ goOn := true
+
+ timr := false
+
+ infile := ""
+ outfile := ""
+
+ for len(args) > 0 && goOn {
+ str := args[0]
+ switch str {
+ case "-help":
+ fmt.Printf("g2x\n%s\n", g2xHelp)
+ return
+ case "-i", "-input":
+ // read data from file instead of stdin
+ args = args[1:]
+ if len(args) < 1 {
+ fmt.Fprintf(os.Stderr, "Input file name is missing\n")
+ os.Exit(1)
+ }
+ infile = args[0]
+ if infile == "-" {
+ infile = ""
+ }
+ args = args[1:]
+ case "-o", "-output":
+ // write data to file instead of stdout
+ args = args[1:]
+ if len(args) < 1 {
+ fmt.Fprintf(os.Stderr, "Output file name is missing\n")
+ os.Exit(1)
+ }
+ outfile = args[0]
+ if outfile == "-" {
+ outfile = ""
+ }
+ args = args[1:]
+ case "-timer":
+ timr = true
+ args = args[1:]
+ default:
+ goOn = false
+ }
+ }
+
+ in := os.Stdin
+
+ isPipe := false
+ fi, err := os.Stdin.Stat()
+ if err == nil {
+ // check for data being piped into stdin
+ isPipe = bool((fi.Mode() & os.ModeNamedPipe) != 0)
+ }
+
+ usingFile := false
+ if infile != "" {
+
+ fl, err := os.Open(infile)
+ if err != nil {
+ fmt.Fprintf(os.Stderr, "%s\n", err.Error())
+ os.Exit(1)
+ }
+
+ defer fl.Close()
+
+ // use indicated file instead of stdin
+ in = fl
+ usingFile = true
+
+ if isPipe && runtime.GOOS != "windows" {
+ mode := fi.Mode().String()
+ fmt.Fprintf(os.Stderr, "Input data from both stdin and file '%s', mode is '%s'\n", infile, mode)
+ os.Exit(1)
+ }
+ }
+
+ if !usingFile && !isPipe {
+ fmt.Fprintf(os.Stderr, "No input data supplied\n")
+ os.Exit(1)
+ }
+
+ op := os.Stdout
+
+ if outfile != "" {
+
+ fl, err := os.Create(outfile)
+ if err != nil {
+ fmt.Fprintf(os.Stderr, "%s\n", err.Error())
+ os.Exit(1)
+ }
+
+ defer fl.Close()
+
+ // use indicated file instead of stdout
+ op = fl
+ }
+
+ // initialize process timer
+ startTime := time.Now()
+ recordCount := 0
+ byteCount := 0
+
+ // print processing rate and program duration
+ printDuration := func(name string) {
+
+ stopTime := time.Now()
+ duration := stopTime.Sub(startTime)
+ seconds := float64(duration.Nanoseconds()) / 1e9
+
+ if recordCount >= 1000000 {
+ throughput := float64(recordCount/100000) / 10.0
+ fmt.Fprintf(os.Stderr, "\nXtract processed %.1f million %s in %.3f seconds", throughput, name, seconds)
+ } else {
+ fmt.Fprintf(os.Stderr, "\nXtract processed %d %s in %.3f seconds", recordCount, name, seconds)
+ }
+
+ if seconds >= 0.001 && recordCount > 0 {
+ rate := int(float64(recordCount) / seconds)
+ if rate >= 1000000 {
+ fmt.Fprintf(os.Stderr, " (%d million %s/second", rate/1000000, name)
+ } else {
+ fmt.Fprintf(os.Stderr, " (%d %s/second", rate, name)
+ }
+ if byteCount > 0 {
+ rate := int(float64(byteCount) / seconds)
+ if rate >= 1000000 {
+ fmt.Fprintf(os.Stderr, ", %d megabytes/second", rate/1000000)
+ } else if rate >= 1000 {
+ fmt.Fprintf(os.Stderr, ", %d kilobytes/second", rate/1000)
+ } else {
+ fmt.Fprintf(os.Stderr, ", %d bytes/second", rate)
+ }
+ }
+ fmt.Fprintf(os.Stderr, ")")
+ }
+
+ fmt.Fprintf(os.Stderr, "\n\n")
+ }
+
+ gbk := GenBankConverter(in)
+
+ if gbk == nil {
+ fmt.Fprintf(os.Stderr, "Unable to create GenBank to XML converter\n")
+ os.Exit(1)
+ }
+
+ head := `<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE INSDSet PUBLIC "-//NCBI//INSD INSDSeq/EN" "http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd">
+<INSDSet>
+`
+ tail := ""
+
+ // drain output of last channel in service chain
+ // runtime assigns concurrent goroutines to execute on separate CPUs for maximum speed
+ for str := range gbk {
+
+ if str == "" {
+ continue
+ }
+
+ if head != "" {
+ op.WriteString(head)
+ head = ""
+ tail = `</INSDSet>
+`
+ }
+
+ // send result to stdout
+ op.WriteString(str)
+ if !strings.HasSuffix(str, "\n") {
+ op.WriteString("\n")
+ }
+
+ recordCount++
+
+ runtime.Gosched()
+ }
+
+ if tail != "" {
+ op.WriteString(tail)
+ }
+
+ // explicitly freeing memory before exit is useful for finding leaks when profiling
+ debug.FreeOSMemory()
+
+ if timr {
+ printDuration("records")
+ }
+}
=====================================
gbf2xml
=====================================
@@ -1,554 +1,3 @@
-#!/usr/bin/env perl
-
-# ===========================================================================
-#
-# PUBLIC DOMAIN NOTICE
-# National Center for Biotechnology Information (NCBI)
-#
-# This software/database is a "United States Government Work" under the
-# terms of the United States Copyright Act. It was written as part of
-# the author's official duties as a United States Government employee and
-# thus cannot be copyrighted. This software/database is freely available
-# to the public for use. The National Library of Medicine and the U.S.
-# Government do not place any restriction on its use or reproduction.
-# We would, however, appreciate having the NCBI and the author cited in
-# any work or product based on this material.
-#
-# Although all reasonable efforts have been taken to ensure the accuracy
-# and reliability of the software and data, the NLM and the U.S.
-# Government do not and cannot warrant the performance or results that
-# may be obtained by using this software or data. The NLM and the U.S.
-# Government disclaim all warranties, express or implied, including
-# warranties of performance, merchantability or fitness for any particular
-# purpose.
-#
-# ===========================================================================
-#
-# File Name: gbf2xml
-#
-# Author: Jonathan Kans
-#
-# Version Creation Date: 6/8/17
-#
-# ==========================================================================
-
-use strict;
-use warnings;
-
-
-# Script to convert GenBank flatfiles to INSDSeq XML.
-#
-# Feature intervals that refer to 'far' locations, i.e., those not within
-# the cited record and which have an accession and colon, are suppressed.
-# Those rare features (e.g., trans-splicing between molecules) are lost.
-#
-# Keywords and References are currently not supported.
-
-
-# definitions
-
-use constant false => 0;
-use constant true => 1;
-
-# state variables for tracking current position in flatfile
-
-my $in_seq;
-my $in_con;
-my $in_feat;
-my $in_key;
-my $in_qual;
-my $in_def;
-my $in_tax;
-my $any_feat;
-my $any_qual;
-my $no_space;
-my $is_comp;
-my $current_key;
-my $current_loc;
-my $current_qual;
-my $current_val;
-my $moltype;
-my $division;
-my $update_date;
-my $organism;
-my $source;
-my $taxonomy;
-my $topology;
-my $sequence;
-my $length;
-my $curr_seq;
-my $locus;
-my $defline;
-my $accn;
-my $accndv;
-my $location_operator;
-
-# subroutine to clear state variables for each flatfile
-# start in in_feat state to gracefully handle missing FEATURES/FH line
-
-sub clearflags {
- $in_seq = false;
- $in_con = false;
- $in_feat = false;
- $in_key = false;
- $in_qual = false;
- $in_def = false;
- $in_tax = false;
- $any_feat = false;
- $any_qual = false;
- $no_space = false;
- $is_comp = false;
- $current_key = "";
- $current_loc = "";
- $current_qual = "";
- $current_val = "";
- $moltype = "";
- $division = "";
- $update_date = "";
- $organism = "";
- $source = "";
- $taxonomy = "";
- $topology = "";
- $sequence = "";
- $length = 0;
- $curr_seq = "";
- $locus = "";
- $defline = "";
- $accn = "";
- $accndv = "";
- $location_operator = "";
-}
-
-# recursive subroutine for parsing flatfile representation of feature location
-
-sub parseloc {
- my $subloc = shift (@_);
- my @working = ();
-
- if ( $subloc =~ /^(join|order)\((.+)\)$/ ) {
- $location_operator = $1;
- my $temploc = $2;
- my @items = split (',', $temploc);
- foreach my $thisloc (@items ) {
- if ( $thisloc !~ /^.*:.*$/ ) {
- push (@working, parseloc ($thisloc));
- }
- }
-
- } elsif ( $subloc =~ /^complement\((.+)\)$/ ) {
- $is_comp = true;
- my $comploc = $1;
- my @items = parseloc ($comploc);
- my @rev = reverse (@items);
- foreach my $thisloc (@rev ) {
- if ( $thisloc =~ /^([^.]+)\.\.([^.]+)$/ ) {
- $thisloc = "$2..$1";
- }
-
- if ( $thisloc =~ /^>([^.]+)\.\.([^.]+)$/ ) {
- $thisloc = "<$1..$2";
- }
- if ( $thisloc =~ /^([^.]+)\.\.<([^.]+)$/ ) {
- $thisloc = "$1..>$2";
- }
-
- if ( $thisloc !~ /^.*:.*$/ ) {
- push (@working, parseloc ($thisloc));
- }
- }
-
- } elsif ( $subloc !~ /^.*:.*$/ ) {
- push (@working, $subloc);
- }
-
- return @working;
-}
-
-#subroutine to print next feature key / location / qualifier line
-
-sub flushline {
- if ( $in_key ) {
-
- if ( $any_qual ) {
- print " </INSDFeature_quals>\n";
- $any_qual = false;
- }
-
- if ( $any_feat ) {
- print " </INSDFeature>\n";
- }
- $any_feat = true;
-
- print " <INSDFeature>\n";
-
- #print feature key and intervals
- print " <INSDFeature_key>$current_key</INSDFeature_key>\n";
-
- my $clean_loc = $current_loc;
- $clean_loc =~ s/</</g;
- $clean_loc =~ s/>/>/g;
- print " <INSDFeature_location>$clean_loc</INSDFeature_location>\n";
-
- print " <INSDFeature_intervals>\n";
-
- # parse join() order() complement() ###..### location
- $location_operator = 0;
- $is_comp = false;
- my @theloc = parseloc ($current_loc);
-
- # convert number (dot) (dot) number to number (tab) number
- my $numivals = 0;
- my $prime5 = false;
- my $prime3 = false;
- foreach my $thisloc (@theloc ) {
- $numivals++;
- print " <INSDInterval>\n";
- if ( $thisloc =~ /^([^.]+)\.\.([^.]+)$/ ) {
- my $fr = $1;
- my $to = $2;
- if ( $thisloc =~ /^</ ) {
- $prime5 = true;
- }
- if ( $thisloc =~ /\.\.>/ ) {
- $prime3 = true;
- }
- $fr =~ s/[<>]//;
- $to =~ s/[<>]//;
- print " <INSDInterval_from>$fr</INSDInterval_from>\n";
- print " <INSDInterval_to>$to</INSDInterval_to>\n";
- if ( $is_comp ) {
- print " <INSDInterval_iscomp value=\"true\"/>\n";
- }
- print " <INSDInterval_accession>$accndv</INSDInterval_accession>\n";
- } elsif ( $thisloc =~ /^(.+)\^(.+)$/ ) {
- my $fr = $1;
- my $to = $2;
- $fr =~ s/[<>]//;
- $to =~ s/[<>]//;
- print " <INSDInterval_from>$fr</INSDInterval_from>\n";
- print " <INSDInterval_to>$to</INSDInterval_to>\n";
- if ( $is_comp ) {
- print " <INSDInterval_iscomp value=\"true\"/>\n";
- }
- print " <INSDInterval_interbp value=\"true\"/>\n";
- print " <INSDInterval_accession>$accndv</INSDInterval_accession>\n";
- } elsif ( $thisloc =~ /^([^.]+)$/ ) {
- my $pt = $1;
- $pt =~ s/[<>]//;
- print " <INSDInterval_point>$pt</INSDInterval_point>\n";
- print " <INSDInterval_accession>$accndv</INSDInterval_accession>\n";
- }
- print " </INSDInterval>\n";
- }
-
- print " </INSDFeature_intervals>\n";
-
- if ( $numivals > 1 ) {
- print " <INSDFeature_operator>$location_operator</INSDFeature_operator>\n";
- }
- if ( $prime5 ) {
- print " <INSDFeature_partial5 value=\"true\"/>\n";
- }
- if ( $prime3 ) {
- print " <INSDFeature_partial3 value=\"true\"/>\n";
- }
-
- } elsif ( $in_qual ) {
-
- if ( ! $any_qual ) {
- print " <INSDFeature_quals>\n";
- }
- $any_qual = true;
-
- if ( $current_val eq "" ) {
- print " <INSDQualifier>\n";
- print " <INSDQualifier_name>$current_qual</INSDQualifier_name>\n";
- print " </INSDQualifier>\n";
- } else {
- print " <INSDQualifier>\n";
- print " <INSDQualifier_name>$current_qual</INSDQualifier_name>\n";
- my $clean_val = $current_val;
- $clean_val =~ s/</</g;
- $clean_val =~ s/>/>/g;
- print " <INSDQualifier_value>$clean_val</INSDQualifier_value>\n";
- print " </INSDQualifier>\n";
- }
- }
-}
-
-# initialize flags and lists at start of program
-
-clearflags ();
-
-print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
-print "<!DOCTYPE INSDSet PUBLIC \"-//NCBI//INSD INSDSeq/EN\" \"https://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd\">\n";
-print "<INSDSet>\n";
-
-# main loop reads one line at a time
-
-while (<> ) {
- chomp;
- $_ =~ s/\r$//;
-
- # first check for extra definition or taxonomy lines, otherwise clear continuation flags
- if ( $in_def ) {
- if ( /^ {12}(.*)$/ ) {
- $defline = $defline . " " . $1;
- } else {
- $in_def = false;
- }
- } elsif ( $in_tax ) {
- if ( /^ {12}(.*)$/ ) {
- if ( $taxonomy eq "" ) {
- $taxonomy = $1;
- } else {
- $taxonomy = $taxonomy . " " . $1;
- }
- } else {
- $in_tax = false;
- }
- }
-
- if ( $in_def || $in_tax ) {
-
- # continuation lines taken care of above
-
- } elsif ( /^LOCUS\s+(\S*).*$/ ) {
-
- # record locus
- $locus = $1;
- if ( / (\d+) bp / || / (\d+) aa / ) {
- $length = $1;
- }
-
- if ( /^.*\s(\S+\s+\S+\s+\S+\s+\d+-\S+-\d+)$/ ) {
- my $tail = $1;
- if ( $tail =~ /^(\S*)\s+(\S*)\s+(\S*)\s+(\d*-\S*-\d*)$/ ) {
- $moltype = $1;
- $topology = $2;
- $division = $3;
- $update_date = $4;
- $moltype = uc $moltype;
- }
- }
-
- print " <INSDSeq>\n";
-
- print " <INSDSeq_locus>$locus</INSDSeq_locus>\n";
- print " <INSDSeq_length>$length</INSDSeq_length>\n";
-
- if ( $moltype ne "" ) {
- print " <INSDSeq_moltype>$moltype</INSDSeq_moltype>\n";
- }
- if ( $topology ne "" ) {
- print " <INSDSeq_topology>$topology</INSDSeq_topology>\n";
- }
- if ( $division ne "" ) {
- print " <INSDSeq_division>$division</INSDSeq_division>\n";
- }
- if ( $update_date ne "" ) {
- print " <INSDSeq_update-date>$update_date</INSDSeq_update-date>\n";
- }
-
- } elsif ( /^DEFINITION\s*(.*).*$/ ) {
-
- # record first line of definition line
- $defline = $1;
- # next line with leading spaces will be continuation of definition line
- $in_def = true;
-
- } elsif ( /^ACCESSION\s*(\S*).*$/ ) {
-
- # record accession
- $accn = $1;
-
- } elsif ( /^VERSION\s*(\S*).*$/ ) {
-
- # record accession.version
- $accndv = $1;
-
- } elsif ( /^SOURCE\s*(.*)$/ ) {
-
- # record source
- $source = $1;
-
- } elsif ( /^ {1,3}ORGANISM\s+(.*)$/ ) {
-
- # record organism
- if ( $organism eq "" ) {
- $organism = $1;
- if ( $organism =~ /^([^(]*) \(.*\)/ ) {
- $organism = $1;
- }
- }
- # next line with leading spaces will be start of taxonomy
- $in_tax = true;
-
- } elsif ( /^FEATURES\s+.*$/ ) {
-
- # beginning of feature table, flags already set up
-
- # first print saved fields
- $defline =~ s/\.$//;
- $defline =~ s/</</g;
- $defline =~ s/>/>/g;
- if ( $defline ne "" ) {
- print " <INSDSeq_definition>$defline</INSDSeq_definition>\n";
- }
- if ( $accn ne "" ) {
- print " <INSDSeq_primary-accession>$accn</INSDSeq_primary-accession>\n";
- }
- if ( $accndv ne "" ) {
- print " <INSDSeq_accession-version>$accndv</INSDSeq_accession-version>\n";
- }
-
- $in_feat = true;
-
- if ( $source ne "" ) {
- print " <INSDSeq_source>$source</INSDSeq_source>\n";
- }
- if ( $organism ne "" ) {
- print " <INSDSeq_organism>$organism</INSDSeq_organism>\n";
- }
- $taxonomy =~ s/\.$//;
- if ( $taxonomy ne "" ) {
- print " <INSDSeq_taxonomy>$taxonomy</INSDSeq_taxonomy>\n";
- }
-
- print " <INSDSeq_feature-table>\n";
-
- } elsif ( /^ORIGIN\s*.*$/ ) {
-
- # end of feature table, print final newline
- flushline ();
-
- if ( $in_feat ) {
- if ( $any_qual ) {
- print " </INSDFeature_quals>\n";
- $any_qual = false;
- }
-
- print " </INSDFeature>\n";
-
- print " </INSDSeq_feature-table>\n";
- }
-
- $in_feat = false;
- $in_key = false;
- $in_qual = false;
- $no_space = false;
- $in_seq = true;
- $in_con = false;
-
- } elsif ( /^CONTIG\s*.*$/ ) {
-
- # end of feature table, print final newline
- flushline ();
-
- if ( $in_feat ) {
- if ( $any_qual ) {
- print " </INSDFeature_quals>\n";
- $any_qual = false;
- }
-
- print " </INSDFeature>\n";
-
- print " </INSDSeq_feature-table>\n";
- }
-
- $in_feat = false;
- $in_key = false;
- $in_qual = false;
- $no_space = false;
- $in_seq = false;
- $in_con = true;
-
- } elsif ( /^\/\/\.*/ ) {
-
- # at end-of-record double slash
- if ( $sequence ne "" ) {
- print " <INSDSeq_sequence>$sequence</INSDSeq_sequence>\n";
- }
- print " </INSDSeq>\n";
- # reset variables for catenated flatfiles
- clearflags ();
-
- } elsif ( $in_seq ) {
-
- if ( /^\s+\d+ (.*)$/ || /^\s+(.*)\s+\d+$/ ) {
- # record sequence
- $curr_seq = $1;
- $curr_seq =~ s/ //g;
- $curr_seq = lc $curr_seq;
- if ( $sequence eq "" ) {
- $sequence = $curr_seq;
- } else {
- $sequence = $sequence . $curr_seq;
- }
- }
-
- } elsif ( $in_con ) {
-
- } elsif ( $in_feat ) {
-
- if ( /^ {1,10}(\w+)\s+(.*)$/ ) {
- # new feature key and location
- flushline ();
-
- $in_key = true;
- $in_qual = false;
- $current_key = $1;
- $current_loc = $2;
-
- } elsif ( /^\s+\/(\w+)=(.*)$/ ) {
- # new qualifier
- flushline ();
-
- $in_key = false;
- $in_qual = true;
- $current_qual = $1;
- # remove leading double quote
- my $val = $2;
- $val =~ s/\"//g;
- $current_val = $val;
- if ( $current_qual =~ /(?:translation|transcription|peptide|anticodon)/ ) {
- $no_space = true;
- } else {
- $no_space = false;
- }
-
- } elsif ( /^\s+\/(\w+)$/ ) {
- # new singleton qualifier - e.g., trans-splicing, pseudo
- flushline ();
-
- $in_key = false;
- $in_qual = true;
- $current_qual = $1;
- $current_val = "";
- $no_space = false;
-
- } elsif ( /^\s+(.*)$/ ) {
-
- if ( $in_key ) {
- # continuation of feature location
- $current_loc = $current_loc . $1;
-
- } elsif ( $in_qual ) {
- # continuation of qualifier
- # remove trailing double quote
- my $val = $1;
- $val =~ s/\"//g;
- if ( $no_space ) {
- $current_val = $current_val . $val;
- } elsif ( $current_val =~ /-$/ ) {
- $current_val = $current_val . $val;
- } else {
- $current_val = $current_val . " " . $val;
- }
- }
- }
- }
-}
-
-print "</INSDSet>\n";
+#!/bin/sh
+xtract -g2x "$@"
=====================================
hlp-xtract.txt
=====================================
@@ -126,6 +126,16 @@ Elink -cited Equivalent
elink_cited |
efetch -format abstract
+Combining Independent Queries
+
+ esearch -db protein -query "amyloid* [PROT]" |
+ elink -target pubmed |
+ esearch -db gene -query "apo* [GENE]" |
+ elink -target pubmed |
+ esearch -query "(#3) AND (#6)" |
+ efetch -format docsum |
+ xtract -pattern DocumentSummary -element Id Title
+
PMC
Formatting Tag Removal
=====================================
sort-uniq-count
=====================================
@@ -10,4 +10,4 @@ then
fi
sort "-$flags" |
uniq -i -c |
-perl -pe 's/\s*(\d+)\s(.+)/$1\t$2/'
+awk '{ n=$1; sub(/[ \t]*[0-9]+[ \t]/, ""); print n "\t" $0 }'
=====================================
sort-uniq-count-rank
=====================================
@@ -10,5 +10,5 @@ then
fi
sort "-$flags" |
uniq -i -c |
-perl -pe 's/\s*(\d+)\s(.+)/$1\t$2/' |
+awk '{ n=$1; sub(/[ \t]*[0-9]+[ \t]/, ""); print n "\t" $0 }' |
sort -t "$(printf '\t')" -k 1,1nr -k "2$flags"
=====================================
xtract.go
=====================================
@@ -331,6 +331,8 @@ Data Conversion
[-indent | -flush]
XML object names per column
+ -g2x Convert GenBank flatfile format to INSDSeq XML
+
Documentation
-help Print this document
@@ -6549,14 +6551,15 @@ func ProcessTokens(rdr <-chan string) {
}
// ProcessFormat reformats XML for ease of reading
-func ProcessFormat(rdr <-chan string, args []string) {
+func ProcessFormat(rdr <-chan string, args []string, useTimer bool) int {
if rdr == nil || args == nil {
- return
+ return 0
}
var buffer strings.Builder
count := 0
+ maxLine := 0
// skip past command name
args = args[1:]
@@ -6613,7 +6616,7 @@ func ProcessFormat(rdr <-chan string, args []string) {
os.Stdout.WriteString(str)
}
os.Stdout.WriteString("\n")
- return
+ return 0
}
unicodePolicy := ""
@@ -6714,7 +6717,7 @@ func ProcessFormat(rdr <-chan string, args []string) {
DoMathML = true
}
- CountLines = DoMixed
+ CountLines = DoMixed || useTimer
AllowEmbed = DoStrict || DoMixed
ContentMods = AllowEmbed || DoCompress || DoUnicode || DoScript || DoMathML || DeAccent || DoASCII
@@ -7117,6 +7120,8 @@ func ProcessFormat(rdr <-chan string, args []string) {
return
}
+ maxLine = tkn.Line
+
if tkn.Tag == DOCTYPETAG {
if skipDoctype {
return
@@ -7146,6 +7151,8 @@ func ProcessFormat(rdr <-chan string, args []string) {
fmt.Fprintf(os.Stdout, "%s", txt)
}
}
+
+ return maxLine
}
// ProcessOutline displays outline of XML structure
@@ -7339,12 +7346,14 @@ func ProcessSynopsis(rdr <-chan string, leaf bool, delim string) {
}
// ProcessVerify checks for well-formed XML
-func ProcessVerify(rdr <-chan string, args []string) {
+func ProcessVerify(rdr <-chan string, args []string) int {
if rdr == nil || args == nil {
- return
+ return 0
}
+ CountLines = true
+
tknq := CreateTokenizer(rdr)
if tknq == nil {
@@ -7374,6 +7383,7 @@ func ProcessVerify(rdr <-chan string, args []string) {
maxDepth := 0
depthLine := 0
depthID := ""
+ maxLine := 0
// warn if HTML tags are not well-formed
unbalancedHTML := func(text string) bool {
@@ -7497,6 +7507,7 @@ func ProcessVerify(rdr <-chan string, args []string) {
tag := tkn.Tag
name := tkn.Name
line := tkn.Line
+ maxLine = line
if level > maxDepth {
maxDepth = level
@@ -7564,6 +7575,8 @@ func ProcessVerify(rdr <-chan string, args []string) {
if maxDepth > 20 {
fmt.Fprintf(os.Stdout, "%s%8d\tMaximum nesting, %d levels\n", depthID, depthLine, maxDepth)
}
+
+ return maxLine
}
// ProcessFilter modifies XML content, comments, or CDATA
@@ -8667,6 +8680,805 @@ func TableConverter(inp io.Reader, args []string) int {
return recordCount
}
+// READ GENBANK FLATFILE AND TRANSLATE TO INSDSEQ XML
+
+// GenBankConverter sends INSDSeq XML records down a channel
+func GenBankConverter(inp io.Reader) <-chan string {
+
+ if inp == nil {
+ return nil
+ }
+
+ out := make(chan string, ChanDepth)
+ if out == nil {
+ fmt.Fprintf(os.Stderr, "Unable to create GenBank converter channel\n")
+ os.Exit(1)
+ }
+
+ const twelvespaces = " "
+ const twentyonespaces = " "
+
+ var rec strings.Builder
+ var con strings.Builder
+ var seq strings.Builder
+
+ scanr := bufio.NewScanner(inp)
+
+ convertGenBank := func(inp io.Reader, out chan<- string) {
+
+ // close channel when all records have been sent
+ defer close(out)
+
+ row := 0
+
+ nextLine := func() string {
+
+ for scanr.Scan() {
+ line := scanr.Text()
+ if line == "" {
+ continue
+ }
+ return line
+ }
+ return ""
+
+ }
+
+ for {
+
+ rec.Reset()
+
+ // read first line of next record
+ line := nextLine()
+ if line == "" {
+ break
+ }
+
+ row++
+
+ for {
+ if !strings.HasPrefix(line, "LOCUS") {
+ // skip release file header information
+ line = nextLine()
+ row++
+ continue
+ }
+ break
+ }
+
+ readContinuationLines := func(str string) string {
+
+ for {
+ // read next line
+ line = nextLine()
+ row++
+ if !strings.HasPrefix(line, twelvespaces) {
+ // if not continuation line, break out of loop
+ break
+ }
+ // append subsequent line and continue with loop
+ txt := strings.TrimPrefix(line, twelvespaces)
+ str += " " + txt
+ }
+
+ str = CompressRunsOfSpaces(str)
+ str = strings.TrimSpace(str)
+
+ return str
+ }
+
+ writeOneElement := func(spaces, tag, value string) {
+
+ rec.WriteString(spaces)
+ rec.WriteString("<")
+ rec.WriteString(tag)
+ rec.WriteString(">")
+ value = html.EscapeString(value)
+ rec.WriteString(value)
+ rec.WriteString("</")
+ rec.WriteString(tag)
+ rec.WriteString(">\n")
+ }
+
+ // each section will exit with the next line ready to process
+
+ if strings.HasPrefix(line, "LOCUS") {
+
+ cols := strings.Fields(line)
+ if len(cols) == 8 {
+
+ // start of record
+ rec.WriteString(" <INSDSeq>\n")
+
+ moleculetype := cols[4]
+ strandedness := ""
+ if strings.HasPrefix(moleculetype, "ds-") {
+ moleculetype = strings.TrimPrefix(moleculetype, "ds-")
+ strandedness = "double"
+ } else if strings.HasPrefix(moleculetype, "ss-") {
+ moleculetype = strings.TrimPrefix(moleculetype, "ss-")
+ strandedness = "single"
+ } else if strings.HasPrefix(moleculetype, "ms-") {
+ moleculetype = strings.TrimPrefix(moleculetype, "ms-")
+ strandedness = "mixed"
+ } else if strings.HasSuffix(moleculetype, "DNA") {
+ strandedness = "double"
+ } else if strings.HasSuffix(moleculetype, "RNA") {
+ strandedness = "single"
+ }
+
+ writeOneElement(" ", "INSDSeq_locus", cols[1])
+
+ writeOneElement(" ", "INSDSeq_length", cols[2])
+
+ if strandedness != "" {
+ writeOneElement(" ", "INSDSeq_strandedness", strandedness)
+ }
+
+ writeOneElement(" ", "INSDSeq_moltype", moleculetype)
+
+ writeOneElement(" ", "INSDSeq_topology", cols[5])
+
+ writeOneElement(" ", "INSDSeq_division", cols[6])
+
+ writeOneElement(" ", "INSDSeq_update-date", cols[7])
+
+ } else {
+ fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+ }
+
+ // read next line and continue
+ line = nextLine()
+ row++
+ }
+
+ if strings.HasPrefix(line, "DEFINITION") {
+
+ txt := strings.TrimPrefix(line, "DEFINITION")
+ def := readContinuationLines(txt)
+ def = strings.TrimSuffix(def, ".")
+
+ writeOneElement(" ", "INSDSeq_definition", def)
+ }
+
+ var secondaries []string
+
+ if strings.HasPrefix(line, "ACCESSION") {
+
+ txt := strings.TrimPrefix(line, "ACCESSION")
+ str := readContinuationLines(txt)
+ accessions := strings.Fields(str)
+ ln := len(accessions)
+ if ln > 1 {
+
+ writeOneElement(" ", "INSDSeq_primary-accession", accessions[0])
+
+ // skip past primary accession, collect secondaries
+ secondaries = accessions[1:]
+
+ } else if ln == 1 {
+
+ writeOneElement(" ", "INSDSeq_primary-accession", accessions[0])
+
+ } else {
+ fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+ }
+ }
+
+ accnver := ""
+ gi := ""
+
+ if strings.HasPrefix(line, "VERSION") {
+
+ cols := strings.Fields(line)
+ if len(cols) == 2 {
+
+ accnver = cols[1]
+ writeOneElement(" ", "INSDSeq_accession-version", accnver)
+
+ } else if len(cols) == 3 {
+
+ accnver = cols[1]
+ writeOneElement(" ", "INSDSeq_accession-version", accnver)
+
+ // collect gi for other-seqids
+ if strings.HasPrefix(cols[2], "GI:") {
+ gi = strings.TrimPrefix(cols[2], "GI:")
+ }
+
+ } else {
+ fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+ }
+
+ // read next line and continue
+ line = nextLine()
+ row++
+
+ }
+
+ if gi != "" {
+
+ rec.WriteString(" <INSDSeq_other-seqids>\n")
+
+ writeOneElement(" ", "INSDSeqid", "gi|"+gi)
+
+ rec.WriteString(" </INSDSeq_other-seqids>\n")
+ }
+
+ if len(secondaries) > 0 {
+
+ rec.WriteString(" <INSDSeq_secondary-accessions>\n")
+
+ for _, secndry := range secondaries {
+
+ writeOneElement(" ", "INSDSecondary-accn", secndry)
+ }
+
+ rec.WriteString(" </INSDSeq_secondary-accessions>\n")
+ }
+
+ if strings.HasPrefix(line, "DBLINK") {
+
+ txt := strings.TrimPrefix(line, "DBLINK")
+ readContinuationLines(txt)
+ // collect for database-reference
+ // out <- Token{DBLINK, dbl}
+ }
+
+ if strings.HasPrefix(line, "KEYWORDS") {
+
+ txt := strings.TrimPrefix(line, "KEYWORDS")
+ key := readContinuationLines(txt)
+ key = strings.TrimSuffix(key, ".")
+
+ if key != "" {
+ rec.WriteString(" <INSDSeq_keywords>\n")
+ kywds := strings.Split(key, ";")
+ for _, kw := range kywds {
+ kw = strings.TrimSpace(kw)
+ if kw == "" || kw == "." {
+ continue
+ }
+
+ writeOneElement(" ", "INSDKeyword", kw)
+ }
+ rec.WriteString(" </INSDSeq_keywords>\n")
+ }
+ }
+
+ if strings.HasPrefix(line, "SOURCE") {
+
+ txt := strings.TrimPrefix(line, "SOURCE")
+ src := readContinuationLines(txt)
+
+ writeOneElement(" ", "INSDSeq_source", src)
+ }
+
+ if strings.HasPrefix(line, " ORGANISM") {
+
+ org := strings.TrimPrefix(line, " ORGANISM")
+ org = CompressRunsOfSpaces(org)
+ org = strings.TrimSpace(org)
+
+ writeOneElement(" ", "INSDSeq_organism", org)
+
+ line = nextLine()
+ row++
+ if strings.HasPrefix(line, twelvespaces) {
+ txt := strings.TrimPrefix(line, twelvespaces)
+ tax := readContinuationLines(txt)
+ tax = strings.TrimSuffix(tax, ".")
+
+ writeOneElement(" ", "INSDSeq_taxonomy", tax)
+ }
+ }
+
+ rec.WriteString(" <INSDSeq_references>\n")
+ for {
+ if !strings.HasPrefix(line, "REFERENCE") {
+ // exit out of reference section
+ break
+ }
+
+ ref := "0"
+
+ rec.WriteString(" <INSDReference>\n")
+
+ str := strings.TrimPrefix(line, "REFERENCE")
+ str = CompressRunsOfSpaces(str)
+ str = strings.TrimSpace(str)
+ idx := strings.Index(str, "(")
+ if idx > 0 {
+ ref = strings.TrimSpace(str[:idx])
+
+ writeOneElement(" ", "INSDReference_reference", ref)
+
+ posn := str[idx+1:]
+ posn = strings.TrimSuffix(posn, ")")
+ posn = strings.TrimSpace(posn)
+ if posn == "sites" {
+
+ writeOneElement(" ", "INSDReference_position", posn)
+
+ } else {
+ cols := strings.Fields(posn)
+ if len(cols) == 4 && cols[2] == "to" {
+
+ writeOneElement(" ", "INSDReference_position", cols[1]+".."+cols[3])
+
+ } else {
+ fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+ }
+ }
+ } else {
+ ref = strings.TrimSpace(str)
+
+ writeOneElement(" ", "INSDReference_reference", ref)
+ }
+ line = nextLine()
+ row++
+
+ if strings.HasPrefix(line, " AUTHORS") {
+
+ txt := strings.TrimPrefix(line, " AUTHORS")
+ auths := readContinuationLines(txt)
+
+ rec.WriteString(" <INSDReference_authors>\n")
+ authors := strings.Split(auths, ", ")
+ for _, auth := range authors {
+ auth = strings.TrimSpace(auth)
+ if auth == "" {
+ continue
+ }
+ pair := strings.Split(auth, " and ")
+ for _, name := range pair {
+
+ writeOneElement(" ", "INSDAuthor", name)
+ }
+ }
+ rec.WriteString(" </INSDReference_authors>\n")
+ }
+
+ if strings.HasPrefix(line, " CONSRTM") {
+
+ txt := strings.TrimPrefix(line, " CONSRTM")
+ cons := readContinuationLines(txt)
+
+ writeOneElement(" ", "INSDReference_consortium", cons)
+ }
+
+ if strings.HasPrefix(line, " TITLE") {
+
+ txt := strings.TrimPrefix(line, " TITLE")
+ titl := readContinuationLines(txt)
+
+ writeOneElement(" ", "INSDReference_title", titl)
+ }
+
+ if strings.HasPrefix(line, " JOURNAL") {
+
+ txt := strings.TrimPrefix(line, " JOURNAL")
+ jour := readContinuationLines(txt)
+
+ writeOneElement(" ", "INSDReference_journal", jour)
+ }
+
+ if strings.HasPrefix(line, " PUBMED") {
+
+ txt := strings.TrimPrefix(line, " PUBMED")
+ pmid := readContinuationLines(txt)
+
+ writeOneElement(" ", "INSDReference_pubmed", pmid)
+ }
+
+ if strings.HasPrefix(line, " REMARK") {
+
+ txt := strings.TrimPrefix(line, " REMARK")
+ rem := readContinuationLines(txt)
+
+ writeOneElement(" ", "INSDReference_remark", rem)
+ }
+
+ // end of this reference
+ rec.WriteString(" </INSDReference>\n")
+ // continue to next reference
+ }
+ rec.WriteString(" </INSDSeq_references>\n")
+
+ if strings.HasPrefix(line, "COMMENT") {
+
+ txt := strings.TrimPrefix(line, "COMMENT")
+ com := readContinuationLines(txt)
+
+ writeOneElement(" ", "INSDSeq_comment", com)
+ }
+
+ rec.WriteString(" <INSDSeq_feature-table>\n")
+ if strings.HasPrefix(line, "FEATURES") {
+
+ line = nextLine()
+ row++
+
+ for {
+ if !strings.HasPrefix(line, " ") {
+ // exit out of features section
+ break
+ }
+ if len(line) < 22 {
+ fmt.Fprintf(os.Stderr, "ERROR: "+line+"\n")
+ line = nextLine()
+ row++
+ continue
+ }
+
+ rec.WriteString(" <INSDFeature>\n")
+
+ // read feature key and start of location
+ fkey := line[5:21]
+ fkey = strings.TrimSpace(fkey)
+
+ writeOneElement(" ", "INSDFeature_key", fkey)
+
+ loc := line[21:]
+ loc = strings.TrimSpace(loc)
+ for {
+ line = nextLine()
+ row++
+ if !strings.HasPrefix(line, twentyonespaces) {
+ break
+ }
+ txt := strings.TrimPrefix(line, twentyonespaces)
+ if strings.HasPrefix(txt, "/") {
+ // if not continuation of location, break out of loop
+ break
+ }
+ // append subsequent line and continue with loop
+ loc += strings.TrimSpace(txt)
+ }
+
+ writeOneElement(" ", "INSDFeature_location", loc)
+
+ location_operator := ""
+ is_comp := false
+ prime5 := false
+ prime3 := false
+
+ // parseloc recursive definition
+ var parseloc func(string) []string
+
+ parseloc = func(str string) []string {
+
+ var acc []string
+
+ if strings.HasPrefix(str, "join(") && strings.HasSuffix(str, ")") {
+
+ location_operator = "join"
+
+ str = strings.TrimPrefix(str, "join(")
+ str = strings.TrimSuffix(str, ")")
+ items := strings.Split(str, ",")
+
+ for _, thisloc := range items {
+ inner := parseloc(thisloc)
+ for _, sub := range inner {
+ acc = append(acc, sub)
+ }
+ }
+
+ } else if strings.HasPrefix(str, "order(") && strings.HasSuffix(str, ")") {
+
+ location_operator = "order"
+
+ str = strings.TrimPrefix(str, "order(")
+ str = strings.TrimSuffix(str, ")")
+ items := strings.Split(str, ",")
+
+ for _, thisloc := range items {
+ inner := parseloc(thisloc)
+ for _, sub := range inner {
+ acc = append(acc, sub)
+ }
+ }
+
+ } else if strings.HasPrefix(str, "complement(") && strings.HasSuffix(str, ")") {
+
+ is_comp = true
+
+ str = strings.TrimPrefix(str, "complement(")
+ str = strings.TrimSuffix(str, ")")
+ items := parseloc(str)
+
+ // reverse items
+ for i, j := 0, len(items)-1; i < j; i, j = i+1, j-1 {
+ items[i], items[j] = items[j], items[i]
+ }
+
+ // reverse from and to positions, flip direction of angle brackets (partial flags)
+ for _, thisloc := range items {
+ pts := strings.Split(thisloc, "..")
+ ln := len(pts)
+ if ln == 2 {
+ fst := pts[0]
+ scd := pts[1]
+ lf := ""
+ rt := ""
+ if strings.HasPrefix(fst, "<") {
+ fst = strings.TrimPrefix(fst, "<")
+ rt = ">"
+ }
+ if strings.HasPrefix(scd, ">") {
+ scd = strings.TrimPrefix(scd, ">")
+ lf = "<"
+ }
+ acc = append(acc, lf+scd+".."+rt+fst)
+ } else if ln > 0 {
+ acc = append(acc, pts[0])
+ }
+ }
+
+ } else {
+
+ // save individual interval or point if no leading accession
+ if strings.Index(str, ":") < 0 {
+ acc = append(acc, str)
+ }
+ }
+
+ return acc
+ }
+
+ items := parseloc(loc)
+
+ rec.WriteString(" <INSDFeature_intervals>\n")
+
+ num_ivals := 0
+
+ // report individual intervals
+ for _, thisloc := range items {
+ if thisloc == "" {
+ continue
+ }
+
+ num_ivals++
+
+ rec.WriteString(" <INSDInterval>\n")
+ pts := strings.Split(thisloc, "..")
+ if len(pts) == 2 {
+
+ // fr..to
+ fr := pts[0]
+ to := pts[1]
+ if strings.HasPrefix(fr, "<") {
+ fr = strings.TrimPrefix(fr, "<")
+ prime5 = true
+ }
+ if strings.HasPrefix(to, ">") {
+ to = strings.TrimPrefix(to, ">")
+ prime3 = true
+ }
+ writeOneElement(" ", "INSDInterval_from", fr)
+ writeOneElement(" ", "INSDInterval_to", to)
+ if is_comp {
+ rec.WriteString(" <INSDInterval_iscomp value=\"true\"/>\n")
+ }
+ writeOneElement(" ", "INSDInterval_accession", accnver)
+
+ } else {
+
+ crt := strings.Split(thisloc, "^")
+ if len(crt) == 2 {
+
+ // fr^to
+ fr := crt[0]
+ to := crt[1]
+ writeOneElement(" ", "INSDInterval_from", fr)
+ writeOneElement(" ", "INSDInterval_to", to)
+ if is_comp {
+ rec.WriteString(" <INSDInterval_iscomp value=\"true\"/>\n")
+ }
+ rec.WriteString(" <INSDInterval_interbp value=\"true\"/>\n")
+ writeOneElement(" ", "INSDInterval_accession", accnver)
+
+ } else {
+
+ // pt
+ pt := pts[0]
+ if strings.HasPrefix(pt, "<") {
+ pt = strings.TrimPrefix(pt, "<")
+ prime5 = true
+ }
+ if strings.HasPrefix(pt, ">") {
+ pt = strings.TrimPrefix(pt, ">")
+ prime3 = true
+ }
+ writeOneElement(" ", "INSDInterval_point", pt)
+ writeOneElement(" ", "INSDInterval_accession", accnver)
+ }
+ }
+ rec.WriteString(" </INSDInterval>\n")
+ }
+
+ rec.WriteString(" </INSDFeature_intervals>\n")
+
+ if num_ivals > 1 {
+ writeOneElement(" ", "INSDFeature_operator", location_operator)
+ }
+ if prime5 {
+ rec.WriteString(" <INSDFeature_partial5 value=\"true\"/>\n")
+ }
+ if prime3 {
+ rec.WriteString(" <INSDFeature_partial3 value=\"true\"/>\n")
+ }
+
+ hasQual := false
+ for {
+ if !strings.HasPrefix(line, twentyonespaces) {
+ // if not qualifier line, break out of loop
+ break
+ }
+ txt := strings.TrimPrefix(line, twentyonespaces)
+ qual := ""
+ val := ""
+ if strings.HasPrefix(txt, "/") {
+ if !hasQual {
+ hasQual = true
+ rec.WriteString(" <INSDFeature_quals>\n")
+ }
+ // read new qualifier and start of value
+ qual = strings.TrimPrefix(txt, "/")
+ qual = strings.TrimSpace(qual)
+ idx := strings.Index(qual, "=")
+ if idx > 0 {
+ val = qual[idx+1:]
+ qual = qual[:idx]
+ }
+
+ for {
+ line = nextLine()
+ row++
+ if !strings.HasPrefix(line, twentyonespaces) {
+ break
+ }
+ txt := strings.TrimPrefix(line, twentyonespaces)
+ if strings.HasPrefix(txt, "/") {
+ // if not continuation of qualifier, break out of loop
+ break
+ }
+ // append subsequent line to value and continue with loop
+ if qual == "transcription" || qual == "translation" || qual == "peptide" || qual == "anticodon" {
+ val += strings.TrimSpace(txt)
+ } else {
+ val += " " + strings.TrimSpace(txt)
+ }
+ }
+
+ rec.WriteString(" <INSDQualifier>\n")
+
+ writeOneElement(" ", "INSDQualifier_name", qual)
+
+ val = strings.TrimPrefix(val, "\"")
+ val = strings.TrimSuffix(val, "\"")
+ val = strings.TrimSpace(val)
+ if val != "" {
+
+ writeOneElement(" ", "INSDQualifier_value", val)
+ }
+
+ rec.WriteString(" </INSDQualifier>\n")
+ }
+ }
+ if hasQual {
+ rec.WriteString(" </INSDFeature_quals>\n")
+ }
+
+ // end of this feature
+ rec.WriteString(" </INSDFeature>\n")
+ // continue to next feature
+ }
+ }
+ rec.WriteString(" </INSDSeq_feature-table>\n")
+
+ if strings.HasPrefix(line, "CONTIG") {
+
+ // pathological records can have over 90,000 components, use strings.Builder
+ con.Reset()
+
+ txt := strings.TrimPrefix(line, "CONTIG")
+ txt = strings.TrimSpace(txt)
+ con.WriteString(txt)
+ for {
+ // read next line
+ line = nextLine()
+ row++
+ if !strings.HasPrefix(line, twelvespaces) {
+ // if not continuation of contig, break out of loop
+ break
+ }
+ // append subsequent line and continue with loop
+ txt = strings.TrimPrefix(line, twelvespaces)
+ txt = strings.TrimSpace(txt)
+ con.WriteString(txt)
+ }
+ }
+
+ if strings.HasPrefix(line, "BASE COUNT") {
+
+ txt := strings.TrimPrefix(line, "BASE COUNT")
+ readContinuationLines(txt)
+ // not supported
+ }
+
+ if strings.HasPrefix(line, "ORIGIN") {
+
+ line = nextLine()
+ row++
+ }
+
+ // remainder should be sequence
+
+ // sequence can be millions of bases, use strings.Builder
+ seq.Reset()
+
+ for line != "" {
+
+ if strings.HasPrefix(line, "//") {
+
+ // end of record, print collected sequence
+ str := seq.String()
+ if str != "" {
+
+ writeOneElement(" ", "INSDSeq_sequence", str)
+ }
+ seq.Reset()
+
+ // print contig section
+ str = con.String()
+ str = strings.TrimSpace(str)
+ if str != "" {
+ writeOneElement(" ", "INSDSeq_contig", str)
+ }
+ con.Reset()
+
+ // end of record
+ rec.WriteString(" </INSDSeq>\n")
+
+ // send formatted record down channel
+ txt := rec.String()
+ out <- txt
+ rec.Reset()
+ // go to top of loop for next record
+ break
+ }
+
+ // read next sequence line
+
+ cols := strings.Fields(line)
+ for _, str := range cols {
+
+ if IsAllDigits(str) {
+ continue
+ }
+
+ // append letters to sequence
+ seq.WriteString(str)
+ }
+
+ // read next line and continue
+ line = nextLine()
+ row++
+
+ }
+
+ // continue to next record
+ }
+ }
+
+ // launch single converter goroutine
+ go convertGenBank(inp, out)
+
+ return out
+}
+
// MAIN FUNCTION
// e.g., xtract -pattern PubmedArticle -element MedlineCitation/PMID -block Author -sep " " -element Initials,LastName
@@ -9297,6 +10109,62 @@ func main() {
return
}
+ // READ GENBANK FLATFILE AND TRANSLATE TO INSDSEQ XML
+
+ // must be called before CreateReader starts draining stdin
+ if len(args) > 0 && args[0] == "-g2x" {
+
+ gbk := GenBankConverter(in)
+
+ if gbk == nil {
+ fmt.Fprintf(os.Stderr, "Unable to create GenBank to XML converter\n")
+ os.Exit(1)
+ }
+
+ head := `<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE INSDSet PUBLIC "-//NCBI//INSD INSDSeq/EN" "http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd">
+<INSDSet>
+`
+ tail := ""
+
+ // drain output of last channel in service chain
+ for str := range gbk {
+
+ if str == "" {
+ continue
+ }
+
+ if head != "" {
+ os.Stdout.WriteString(head)
+ head = ""
+ tail = `</INSDSet>
+`
+ }
+
+ // send result to stdout
+ os.Stdout.WriteString(str)
+ if !strings.HasSuffix(str, "\n") {
+ os.Stdout.WriteString("\n")
+ }
+
+ recordCount++
+
+ runtime.Gosched()
+ }
+
+ if tail != "" {
+ os.Stdout.WriteString(tail)
+ }
+
+ debug.FreeOSMemory()
+
+ if timr {
+ printDuration("records")
+ }
+
+ return
+ }
+
// CREATE XML BLOCK READER FROM STDIN OR FILE
rdr := CreateReader(in)
@@ -9468,10 +10336,9 @@ func main() {
switch args[0] {
case "-format":
- ProcessFormat(rdr, args)
+ recordCount = ProcessFormat(rdr, args, timr)
case "-verify", "-validate":
- CountLines = true
- ProcessVerify(rdr, args)
+ recordCount = ProcessVerify(rdr, args)
case "-filter":
ProcessFilter(rdr, args)
case "-normalize", "-normal":
View it on GitLab: https://salsa.debian.org/med-team/ncbi-entrez-direct/-/compare/36914b997fe7b6f71dbe208a0f2e29e2ca47f056...3ff79da3f1d0d11d2976f414edfd56adcf372b89
--
View it on GitLab: https://salsa.debian.org/med-team/ncbi-entrez-direct/-/compare/36914b997fe7b6f71dbe208a0f2e29e2ca47f056...3ff79da3f1d0d11d2976f414edfd56adcf372b89
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20200715/8d3398d1/attachment-0001.html>
More information about the debian-med-commit
mailing list