[med-svn] [r-cran-rentrez] 03/05: New upstream version 1.0.4
Andreas Tille
tille at debian.org
Sat Sep 30 08:37:58 UTC 2017
This is an automated email from the git hooks/post-receive script.
tille pushed a commit to branch master
in repository r-cran-rentrez.
commit a857d51b7a48552df06a31b9dcc42753cddd95a8
Author: Andreas Tille <tille at debian.org>
Date: Sat Sep 30 10:33:14 2017 +0200
New upstream version 1.0.4
---
DESCRIPTION | 29 ++
LICENSE | 2 +
MD5 | 50 +++
NAMESPACE | 43 +++
NEWS | 102 ++++++
R/base.r | 135 ++++++++
R/entrez_citmatch.r | 45 +++
R/entrez_fetch.r | 73 +++++
R/entrez_global_query.r | 31 ++
R/entrez_info.r | 173 ++++++++++
R/entrez_link.r | 235 ++++++++++++++
R/entrez_post.r | 43 +++
R/entrez_search.r | 120 +++++++
R/entrez_summary.r | 188 +++++++++++
R/help.r | 19 ++
R/parse_pubmed_xml.r | 102 ++++++
build/vignette.rds | Bin 0 -> 210 bytes
debian/README.test | 8 -
debian/changelog | 5 -
debian/compat | 1 -
debian/control | 26 --
debian/copyright | 32 --
debian/docs | 3 -
debian/rules | 5 -
debian/source/format | 1 -
debian/tests/control | 3 -
debian/tests/run-unit-test | 13 -
debian/watch | 2 -
inst/doc/rentrez_tutorial.R | 183 +++++++++++
inst/doc/rentrez_tutorial.Rmd | 627 +++++++++++++++++++++++++++++++++++
inst/doc/rentrez_tutorial.html | 722 +++++++++++++++++++++++++++++++++++++++++
man/entrez_citmatch.Rd | 40 +++
man/entrez_db_links.Rd | 44 +++
man/entrez_db_searchable.Rd | 41 +++
man/entrez_db_summary.Rd | 40 +++
man/entrez_dbs.Rd | 29 ++
man/entrez_fetch.Rd | 67 ++++
man/entrez_global_query.Rd | 29 ++
man/entrez_info.Rd | 43 +++
man/entrez_link.Rd | 81 +++++
man/entrez_post.Rd | 42 +++
man/entrez_search.Rd | 86 +++++
man/entrez_summary.Rd | 84 +++++
man/extract_from_esummary.Rd | 22 ++
man/linkout_urls.Rd | 22 ++
man/parse_pubmed_xml.Rd | 30 ++
man/rentrez.Rd | 23 ++
tests/test-all.R | 10 +
tests/testthat/test_citmatch.r | 12 +
tests/testthat/test_docs.r | 18 +
tests/testthat/test_fetch.r | 31 ++
tests/testthat/test_httr.r | 9 +
tests/testthat/test_info.r | 50 +++
tests/testthat/test_link.r | 68 ++++
tests/testthat/test_net.r | 4 +
tests/testthat/test_parse.r | 44 +++
tests/testthat/test_post.r | 35 ++
tests/testthat/test_query.r | 46 +++
tests/testthat/test_search.r | 34 ++
tests/testthat/test_summary.r | 79 +++++
tests/testthat/test_webenv.r | 15 +
vignettes/rentrez_tutorial.Rmd | 627 +++++++++++++++++++++++++++++++++++
62 files changed, 4727 insertions(+), 99 deletions(-)
diff --git a/DESCRIPTION b/DESCRIPTION
new file mode 100644
index 0000000..b95d8ba
--- /dev/null
+++ b/DESCRIPTION
@@ -0,0 +1,29 @@
+Package: rentrez
+Version: 1.0.4
+Date: 2016-10-26
+Title: Entrez in R
+Authors at R: c(
+ person("David", "Winter", role=c("aut", "cre"),
+ email = "david.winter at gmail.com"),
+ person("Scott", "Chamberlain", role="ctb", email = "myrmecocystus at gmail.com"),
+ person("Han","Guangchun", role=c("ctb"),email="hanguangchun at gmail.com")
+ )
+Depends: R (>= 2.6.0)
+Imports: XML, httr (>= 0.5), jsonlite (>= 0.9)
+Suggests: testthat, knitr, rmarkdown
+URL: http://github.com/ropensci/rentrez
+BugReports: https://github.com/ropensci/rentrez/issues
+Description: Provides an R interface to the NCBI's EUtils API
+ allowing users to search databases like GenBank and PubMed, process the
+ results of those searches and pull data into their R sessions.
+VignetteBuilder: knitr
+License: MIT + file LICENSE
+RoxygenNote: 5.0.1
+NeedsCompilation: no
+Packaged: 2016-10-25 21:45:43 UTC; dwinter
+Author: David Winter [aut, cre],
+ Scott Chamberlain [ctb],
+ Han Guangchun [ctb]
+Maintainer: David Winter <david.winter at gmail.com>
+Repository: CRAN
+Date/Publication: 2016-10-26 10:37:53
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..16a01c3
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,2 @@
+YEAR: 2012-2016
+COPYRIGHT HOLDER: David Winter
diff --git a/MD5 b/MD5
new file mode 100644
index 0000000..7ccd2ed
--- /dev/null
+++ b/MD5
@@ -0,0 +1,50 @@
+c46cd1b0d179ed94f36607f4d794fcba *DESCRIPTION
+9cc081ea2d963c6df84446d83052ad2e *LICENSE
+734927c7997434b0a0b33184d323cbc1 *NAMESPACE
+65508a8c8d00af2fe9a9d7c878998f45 *NEWS
+bd61a06a1bf850c5e8382b0b709d2e74 *R/base.r
+111faade12c3113b9e790591a06e4d6e *R/entrez_citmatch.r
+eaba24477be03709dc9165be72ea0af3 *R/entrez_fetch.r
+ecdd5a491cbac7d4724dc8f9fc9ed9c9 *R/entrez_global_query.r
+033301b7c9d56c3a42748f7831967acd *R/entrez_info.r
+bee110dd037576d16a0dab0b86fcd72a *R/entrez_link.r
+6df3c3820aee8f91f0da62f7dd6af96a *R/entrez_post.r
+ac8f2afdf128547c59b48c895e22e73e *R/entrez_search.r
+fa7a54a7b0bc597ca1a0dc36296256db *R/entrez_summary.r
+8bc803d43b3e90e932d6c82894f59650 *R/help.r
+646f614d14b267f07f6289e0cc54f357 *R/parse_pubmed_xml.r
+b619f815cdf95e7e3fd71991e1a4c3fb *build/vignette.rds
+c4e8efa0982cdfe1a54a41d79f3c45cb *inst/doc/rentrez_tutorial.R
+d6b6f2d94601a8060eccf9138e1e13d4 *inst/doc/rentrez_tutorial.Rmd
+ca1f984d47ba7da9814366c7b7932a6c *inst/doc/rentrez_tutorial.html
+6a9d7d59c190e52d444f99df98d7250b *man/entrez_citmatch.Rd
+1004875de1892a54ebdd08af57b41888 *man/entrez_db_links.Rd
+7b9430ebdf08921fef74c053dc79f338 *man/entrez_db_searchable.Rd
+1c76c77d9a4ecc1b4ec24b01f5a2b478 *man/entrez_db_summary.Rd
+f014c478bccc89fa7ee923bb1ca257fe *man/entrez_dbs.Rd
+9b5edd0123abaf9f940ac4633f35b153 *man/entrez_fetch.Rd
+1c247dda2af926fc6c1e60cfa2ca837e *man/entrez_global_query.Rd
+c76debb88267f7eededddc9e5dd1ec9e *man/entrez_info.Rd
+2039f9f031b0939a53dc1b7ef5780c65 *man/entrez_link.Rd
+ff390bf26de7ede2ef8830df1a71a900 *man/entrez_post.Rd
+bda0d4b88d02a9c9dd5dcea23c73170c *man/entrez_search.Rd
+3b98e2c7e97907c4379f44fc8eda8fff *man/entrez_summary.Rd
+ed7f3e55c89c45441b594e068f429148 *man/extract_from_esummary.Rd
+631c78e4d1fa5abd0f9c7000ff0196b7 *man/linkout_urls.Rd
+7f25c20a98fd461a8d73b5d00643a6a8 *man/parse_pubmed_xml.Rd
+617722e7b51351ef3bb9153ef67f0d46 *man/rentrez.Rd
+db04e7147a14d952e0ae8c93d1390087 *tests/test-all.R
+0c4b51d40ae63cbfdcfac31cd67edb96 *tests/testthat/test_citmatch.r
+4edd85844f931fee501b87861537459c *tests/testthat/test_docs.r
+eb3281531e131d2c8fb84bcac1bb1cc8 *tests/testthat/test_fetch.r
+a4c45c8f355eafbc660aede214e6f526 *tests/testthat/test_httr.r
+6f1a4c681ca3b43318b45fdf4a87221f *tests/testthat/test_info.r
+c6822fe9c387ac4be79cd407ea7cd9b4 *tests/testthat/test_link.r
+1ac649cfb5ba8744d2d62ef182c6bad9 *tests/testthat/test_net.r
+d9c769bcd94e0464e232d79ca6a063db *tests/testthat/test_parse.r
+a2bfe354a3cecc892df44f53fd6c9f58 *tests/testthat/test_post.r
+5a51a6ccff29b6a57e393edbef9331e2 *tests/testthat/test_query.r
+d8b42d6257b50ed6dba47f0d00c28875 *tests/testthat/test_search.r
+9c6bee496f6534a5d8518582277d8c33 *tests/testthat/test_summary.r
+c435f7927e6e2f015bf43780f4957e1e *tests/testthat/test_webenv.r
+d6b6f2d94601a8060eccf9138e1e13d4 *vignettes/rentrez_tutorial.Rmd
diff --git a/NAMESPACE b/NAMESPACE
new file mode 100644
index 0000000..abb6355
--- /dev/null
+++ b/NAMESPACE
@@ -0,0 +1,43 @@
+# Generated by roxygen2: do not edit by hand
+
+S3method(as.data.frame,eInfoList)
+S3method(extract_from_esummary,esummary)
+S3method(extract_from_esummary,esummary_list)
+S3method(print,eInfoEntry)
+S3method(print,eInfoLink)
+S3method(print,eInfoSearch)
+S3method(print,elink)
+S3method(print,elink_classic)
+S3method(print,elink_list)
+S3method(print,esearch)
+S3method(print,esummary)
+S3method(print,esummary_list)
+S3method(print,linkout)
+S3method(print,multi_pubmed_record)
+S3method(print,pubmed_record)
+S3method(print,web_history)
+export(entrez_citmatch)
+export(entrez_db_links)
+export(entrez_db_searchable)
+export(entrez_db_summary)
+export(entrez_dbs)
+export(entrez_fetch)
+export(entrez_global_query)
+export(entrez_info)
+export(entrez_link)
+export(entrez_post)
+export(entrez_search)
+export(entrez_summary)
+export(extract_from_esummary)
+export(linkout_urls)
+export(parse_pubmed_xml)
+importFrom(XML,xmlChildren)
+importFrom(XML,xmlGetAttr)
+importFrom(XML,xmlName)
+importFrom(XML,xmlSApply)
+importFrom(XML,xmlToList)
+importFrom(XML,xmlTreeParse)
+importFrom(XML,xmlValue)
+importFrom(XML,xpathApply)
+importFrom(XML,xpathSApply)
+importFrom(jsonlite,fromJSON)
diff --git a/NEWS b/NEWS
new file mode 100644
index 0000000..cdc067c
--- /dev/null
+++ b/NEWS
@@ -0,0 +1,102 @@
+Version 1.0.3
+------------------
+Update to only use https
+ * NCBI is goinh all https, rentrez will only use https from now on.
+ * Added links to repo/bug reporting to DESCRIPTION
+ * Documented changes to sequence database XML records
+ * Allow automatic parsing of XML flavours
+
+
+Version 1.0.2
+-------------------
+Bug fix release
+ * Tests now work with testthat 1.0.0
+ * All calls to ncbi specify encoding is UTF-8 (saving error messages)
+ * HTTP Error codes associated with large requests now give the user a hint
+ to check out the documentation for web-history features
+
+Version 1.0.1
+-------------------
+Bug fix release
+ * Properly format "by_id" mode URLS (bug exposed by httr 1.0.1)
+ * Handle case in which some IDs passed to "by_id" mode are invalide (thanks
+ Zachary Foster for report)
+ * Documentation updated to reflect OMIM->SNP links no longer possible
+ * Use Rmarkdown (not knitr) as vignette builder
+ * Return NCBI error messages are text when they exist
+
+Version 1.0.0
+--------------------
+ * new function extract_from_esummary() for extracting like-named elements
+ from a list esummary records (e.g. get all "Title" fields from a list of
+ PubMed esummaries)
+ * Support for `cmd` option in entrez_link (breaks backward compatibility)
+ * Allows discovery of external links from and use of web_history
+ * New helper function linkout_urls to get URLs form external links
+ * Support for 'by_id' mode for entrez_link. Pass a vector of IDs to
+ entrez_link, (optionally) get a list on elink objects back (one per ID)
+ * New web_history object makes using NCBI Web History features easier
+ * All of these changes documented in new vignette
+ * Han Guangchun added as contributor for his pull requests
+ * New tests, minor bug fixes and extended documentation
+
+
+
+Version 0.4.1
+---------------------
+* Bug fix: The example for entrez_summary contained a typo which made it fail
+ (being wrapped in dontest this hadn't previously shown up).
+
+Version 0.4
+------------------------
+ * entrez_summary now fetches 'version 2.0' esummary records from NCBI
+ * This change may break some scripts. In particular, the names of some
+ elements in esummary records have changed. Broken scripts shold produce a
+ helpful error message, and using entrez_summary(..., version="1.0")
+ should fix it. More details are given in the help to entrez_summary.
+ * When version 2.0 records are requested entrez_summary fetches the json
+ record.
+ * New helper functions for einfo Eutil
+ * entrez_dbs() lists avaliable databases.
+ * entrez_db_summary() gets summary information about a given database.
+ * entrez_db_links() lists databases against which a given db's records might
+ be cross referenced.
+ * entrez_db_searchable() lists search terms avaliable for a given database.
+ * Nicer print functions for search and summary objects
+ * New dependancy on jsonlite for handling json records.
+ * Bunch of bugs squashed and typos cleaned up
+
+Version 0.3.1
+------------------------
+ * Squashed a bug in the vignette which wrote to users $HOME
+
+Version 0.3
+------------------------
+ * using httr to handle HTTP GETs and some url building
+ * parsing for esummary parsing for clinvar database
+ * Scott Chamberlain added as contributer for above
+ * Pubmed parser handles multi-record files
+ * html vignette included
+
+Version 0.2.4
+-------------------------
+ * minor release to fix bug in esummary parsing
+
+Version 0.2.3
+---------------------------------
+ * Edited license/description to meet CRAN requiremens
+ * Added sentence to description to summarise the package
+
+
+Version 0.2.2
+--------------------------------
+
+ * Parsing of esummary xmls is now much nicer.
+ * S3 items to represent most results
+ * Tests to cover all functions
+
+
+Version 0.1.1
+---------------------------------
+ * First release on CRAN + now part of ROpenSci
+ * Functions cover the whole EUtils API
diff --git a/R/base.r b/R/base.r
new file mode 100755
index 0000000..46deeee
--- /dev/null
+++ b/R/base.r
@@ -0,0 +1,135 @@
+#What's going on under the hood. As far as possible we are following the best
+#practices for API packages suggested by hadly/httr:
+#
+# http://cran.r-project.org/web/packages/httr/vignettes/api-packages.html
+#
+#and also conforming to the NBCI's requirements about rate limiting and
+#adding identifiers to each request:
+#
+# http://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Usage_Guidelines_and_Requirements
+#
+
+
+
+
+#As per NCBI's documentation -- we set tool developer's email and tool name:
+entrez_email <- function() 'david.winter at gmail.com'
+entrez_tool <- function() 'rentrez'
+
+#Create a URL for the EUtils API.
+#
+# This function is used by all the API-querying functions in rentrez to build
+# the appropriate url. Required arguments for each rentrez are handled in each
+# function. Those arguments that either ID(s) or are WebEnv cookie can be set
+# by passing a string or two argument names to `make_entrez_query`
+#
+#
+# efetch_url <- make_entrez_query("efetch", require_one_of=c("id", "WebEnv"),
+# id=c(23310964,23310965), db="pubmed",
+# rettype="xml")
+#
+
+
+make_entrez_query <- function(util, config, interface=".fcgi?", by_id=FALSE, ...){
+ uri <- paste0("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/", util, interface)
+ args <- list(..., email=entrez_email(), tool=entrez_tool())
+ if(by_id){
+ ids_string <- paste0("id=", args$id, collapse="&")
+ args$id <- NULL
+ uri <- paste0(uri, ids_string)
+ }else{
+ if("id" %in% names(args)){
+ args$id <- paste(args$id, collapse=",")
+ }
+ }
+ response <- httr::GET(uri, query=args, config= config)
+ entrez_check(response)
+ httr::content(response, as="text", encoding="UTF-8")
+}
+
+##
+# Check for that we have either the ID or the web-history functions are
+# specified for those functions that need one.
+##
+
+id_or_webenv <- function(){
+ args <- sys.frame(sys.parent())
+ msg <- "Must specify either (not both) 'id' or 'web_history' arguments"
+ if(!is.null(args$id)){
+ if(!is.null(args$web_history)){
+ stop(msg, call.=FALSE)
+ }
+ return(list(id=args$id))
+ }
+ if(is.null(args$web_history)){
+ stop(msg, call.=FALSE)
+ }
+ list(WebEnv=args$web_history$WebEnv, query_key=args$web_history$QueryKey)
+}
+
+
+entrez_check <- function(req){
+ if (req$status_code < 400) {
+ return(invisible())
+ }
+ if (req$status_code == 414){
+ stop("HTTP failure 414, the request is too large. For large requests, try using web history as described in the rentrez tutorial")
+ }
+ if (req$status_code == 502){
+ stop("HTTP failure: 502, bad gateway. This error code is often returned when trying to download many records in a single request. Try using web history as described in the rentrez tutorial")
+ }
+ message <- httr::content(req, as="text", encoding="UTF-8")
+ stop("HTTP failure: ", req$status_code, "\n", message, call. = FALSE)
+}
+
+
+#Does a parsed-xml object contains ERRORs as reported by NCBI
+#(i.e. <ERROR> entry's in a valid XML):
+check_xml_errors <- function(x){
+ errs <- x["//ERROR"]
+ if( length(errs) > 0){
+ for(e in errs){
+ warning(xmlValue(e))
+ }
+ }
+ invisible()
+}
+
+
+parse_response <- function(x, type=NULL){
+ res <- switch(type,
+ "json" = fromJSON(x),
+ "xml" = xmlTreeParse(x, useInternalNodes=TRUE),
+ "native" = xmlTreeParse(x, useInternalNodes=TRUE),
+ "gbc" = xmlTreeParse(x, useInternalNodes=TRUE),
+ "ipg" = xmlTreeParse(x, useInternalNodes=TRUE),
+ "text" = x, #citmatch uses plain old plain text
+ x #fall-through, if in doubt, return un-parsed response
+ )
+ return(res)
+}
+
+#contsructor for web history objects
+web_history <- function(WebEnv, QueryKey){
+ res <- list(WebEnv=WebEnv, QueryKey=QueryKey)
+ class(res) <- list("web_history", "list")
+ res
+}
+
+#'@export
+print.web_history <- function(x, ...){
+ cat("Web history object (QueryKey = ", x$QueryKey,
+ ", WebEnv = ", substr(x$WebEnv, 1, 12), "...", ")\n",sep="")
+}
+
+
+
+add_class <- function(x, new_class){
+ class(x) <- c(new_class, class(x))
+ x
+}
+
+.last <- function(s){
+ len <- nchar(s)
+ substr(s, len-1, len)
+}
diff --git a/R/entrez_citmatch.r b/R/entrez_citmatch.r
new file mode 100644
index 0000000..80bd911
--- /dev/null
+++ b/R/entrez_citmatch.r
@@ -0,0 +1,45 @@
+#' Fetch pubmed ids matching specially formatted citation strings
+#'
+#'@param bdata character, containing citation data.
+#' Each citation must be represented in a pipe-delimited format
+#' journal_title|year|volume|first_page|author_name|your_key|
+#' The final field "your_key" is arbitrary, and can used as you see
+#' fit. Fields can be left empty, but be sure to keep 6 pipes.
+#'@param db character, the database to search. Defaults to pubmed,
+#' the only database currently available
+#'@param retmode character, file format to retrieve. Defaults to xml, as
+#' per the API documentation, though note the API only returns plain text
+#'@param config vector configuration options passed to httr::GET
+#'@return A character vector containing PMIDs
+#'@seealso \code{\link[httr]{config}} for available configs
+#'@export
+#'@examples
+#'\donttest{
+#' ex_cites <- c("proc natl acad sci u s a|1991|88|3248|mann bj|test1|",
+#' "science|1987|235|182|palmenberg ac|test2|")
+#' entrez_citmatch(ex_cites)
+#'}
+entrez_citmatch <- function(bdata, db="pubmed", retmode="xml", config=NULL){
+ if(length(bdata) > 1){
+ bdata <- paste0(bdata, collapse="\r")
+ }
+ ifelse(.last(bdata)=="|", bdata, paste0(bdata, "|"))
+ request <- make_entrez_query("ecitmatch",
+ bdata=bdata,
+ db=db,
+ retmode=retmode,
+ interface=".cgi?",
+ config=config)
+ results <- strsplit(strsplit(request, "\n")[[1]], "\\|")
+ sapply(results, extract_pmid)
+}
+
+extract_pmid <- function(line){
+ tryCatch("[["(line,7),
+ error=function(e){
+ warning(paste("No pmid found for line", line))
+ NA
+ }
+ )
+
+}
diff --git a/R/entrez_fetch.r b/R/entrez_fetch.r
new file mode 100755
index 0000000..e3aa445
--- /dev/null
+++ b/R/entrez_fetch.r
@@ -0,0 +1,73 @@
+#' Download data from NCBI databases
+#'
+#' A set of unique identifiers mush be specified with either the \code{db}
+#' argument (which directly specifies the IDs as a numeric or character vector)
+#' or a \code{web_history} object as returned by
+#' \code{\link{entrez_link}}, \code{\link{entrez_search}} or
+#' \code{\link{entrez_post}}. See Table 1 in the linked reference for the set of
+#' formats available for each database. In particular, note that sequence
+#' databases (nuccore, protein and their relatives) use specific format names
+#' (eg "native", "ipg") for different flavours of xml.
+#'
+#' For the most part, this function returns a character vector containing the
+#' fetched records. For XML records (including 'native', 'ipg', 'gbc' sequence
+#' records), setting \code{parsed} to \code{TRUE} will return an
+#' \code{XMLInternalDocument},
+#'
+#'@export
+#'@param db character, name of the database to use
+#'@param id vector (numeric or character), unique ID(s) for records in database \code{db}
+#'@param web_history, a web_history object
+#'@param rettype character, format in which to get data (eg, fasta, xml...)
+#'@param retmode character, mode in which to receive data, defaults to 'text'
+#'@param config vector, httr configuration options passed to httr::GET
+#'@param \dots character, additional terms to add to the request, see NCBI
+#'documentation linked to in references for a complete list
+#'@references \url{http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EFetch_}
+#'@param parsed boolean should entrez_fetch attempt to parse the resulting
+#' file. Only works with xml records (including those with rettypes other than
+#' "xml") at present
+#'@seealso \code{\link[httr]{config}} for available configs
+#'@return character string containing the file created
+#'@return XMLInternalDocument a parsed XML document if parsed=TRUE and
+#'rettype is a flavour of XML.
+#
+#' @examples
+#' \dontrun{
+#' katipo <- "Latrodectus katipo[Organism]"
+#' katipo_search <- entrez_search(db="nuccore", term=katipo)
+#' kaitpo_seqs <- entrez_fetch(db="nuccore", id=katipo_search$ids, rettype="fasta")
+#' #xml
+#' kaitpo_seqs <- entrez_fetch(db="nuccore", id=katipo_search$ids, rettype="native")
+#'}
+
+entrez_fetch <- function(db, id=NULL, web_history=NULL, rettype, retmode="", parsed=FALSE,
+ config=NULL, ...){
+ identifiers <- id_or_webenv()
+ if(parsed){
+ if(!is_xml_record(rettype, retmode)){
+ msg <- paste("At present, entrez_fetch can only parse XML records, got", rettype)
+ stop(msg)
+ }
+ }
+ args <- c(list("efetch", db=db, rettype=rettype, config=config, ...), identifiers)
+ records <- do.call(make_entrez_query, args)
+ #NCBI limits requests to three per second
+ Sys.sleep(0.33)
+ if(parsed){
+ #At the moment, this is just a long-winded way to call
+ #XML::xmlTreeParse, but we already use this approach to parse
+ #esummaries,and this is more flexible if NCBI starts sharing more
+ #records in JSON.
+ return(parse_response(records, rettype))
+ }
+ records
+}
+
+is_xml_record <- function(rettype, retmode){
+ if(rettype %in% c("xml", "native", "gpc","ipg")){
+ return(TRUE)
+ }
+ retmode == "xml"
+}
+
diff --git a/R/entrez_global_query.r b/R/entrez_global_query.r
new file mode 100755
index 0000000..e5f505f
--- /dev/null
+++ b/R/entrez_global_query.r
@@ -0,0 +1,31 @@
+#' Find the number of records that match a given term across all NCBI Entrez databases
+#'
+#'
+#'
+#'@export
+#'@param term the search term to use
+#'@param config vector configuration options passed to httr::GET
+#'@param ... additional arguments to add to the query
+#'@seealso \code{\link[httr]{config}} for available configs
+#'@return a named vector with counts for each a database
+#'
+#' @examples
+#'
+#' NCBI_data_on_best_butterflies_ever <- entrez_global_query(term="Heliconius")
+
+entrez_global_query <- function(term, config=NULL, ...){
+ response <- make_entrez_query("egquery",
+ term=gsub(" ", "+", term),
+ config=config,
+ ...)
+ record <- xmlTreeParse(response, useInternalNodes=TRUE)
+ db_names <- xpathSApply(record, "//ResultItem/DbName", xmlValue)
+ get_Ids <- function(dbname){
+ path <- paste("//ResultItem/DbName[text()='", dbname, "']/../Count", sep="")
+ res <- as.numeric(xpathSApply(record, path, xmlValue))
+ }
+ #NCBI limits requests to three per second
+ Sys.sleep(0.33)
+ res <- structure(sapply(db_names, get_Ids), names=db_names)
+ return(res)
+}
diff --git a/R/entrez_info.r b/R/entrez_info.r
new file mode 100644
index 0000000..3c4f0d7
--- /dev/null
+++ b/R/entrez_info.r
@@ -0,0 +1,173 @@
+#' Get information about EUtils databases
+#'
+#' Gather information about EUtils generally, or a given Eutils database.
+#'Note: The most common uses-cases for the einfo util are finding the list of
+#' search fields available for a given database or the other NCBI databases to
+#' which records in a given database might be linked. Both these use cases
+#' are implemented in higher-level functions that return just this information
+#' (\code{entrez_db_searchable} and \code{entrez_db_links} respectively).
+#' Consequently most users will not have a reason to use this function (though
+#' it is exported by \code{rentrez} for the sake of completeness.
+#'@param db character database about which to retrieve information (optional)
+#'@param config config vector passed on to \code{httr::GET}
+#'@return XMLInternalDocument with information describing either all the
+#'databases available in Eutils (if db is not set) or one particular database
+#'(set by 'db')
+#'@seealso \code{\link[httr]{config}} for available httr configurations
+#'@family einfo
+#'@importFrom XML xmlChildren xmlName xpathSApply
+#'@examples
+#'\dontrun{
+#'all_the_data <- entrez_info()
+#'XML::xpathSApply(all_the_data, "//DbName", xmlValue)
+#'entrez_dbs()
+#'}
+#'@export
+
+entrez_info <- function(db=NULL, config=NULL){
+ req <- make_entrez_query("einfo", db=db, config=config)
+ res <- parse_response(req, "xml")
+ check_xml_errors(res)
+ res
+}
+
+#' List databases available from the NCBI
+#'
+#' Retrieves the names of databases available through the EUtils API
+#'@param config config vector passed to \code{httr::GET}
+#'@family einfo
+#'@return character vector listing available dbs
+#'@export
+#'@examples
+#'\donttest{
+#' entrez_dbs()
+#'}
+entrez_dbs <- function(config=NULL){
+ xpathSApply(entrez_info(config), "//DbName", xmlValue)
+}
+
+
+#' Retrieve summary information about an NCBI database
+#'
+#'@param config config vector passed to \code{httr::GET}
+#'@param db character, name of database to summaries
+#'@return Character vector with the following data
+#'@return DbName Name of database
+#'@return Description Brief description of the database
+#'@return Count Number of records contained in the database
+#'@return MenuName Name in web-interface to EUtils
+#'@return DbBuild Unique ID for current build of database
+#'@return LastUpdate Date of most recent update to database
+#'@family einfo
+#'@examples
+#'entrez_db_summary("pubmed")
+#'@export
+
+entrez_db_summary <- function(db, config=NULL){
+ rec <- entrez_info(db, config)
+ unparsed <- xpathApply( rec, "//DbInfo/*[not(self::LinkList or self::FieldList)]")
+ res <- sapply(unparsed, xmlValue)
+ names(res) <- sapply(unparsed, xmlName)
+ class(res) <- c("eInfoEntry", class(res))
+ res
+}
+
+
+#' List available links for records from a given NCBI database
+#'
+#' For a given database, fetch a list of other databases that contain
+#' cross-referenced records. The names of these records can be used as the
+#' \code{db} argument in \code{\link{entrez_link}}
+#'
+#'@param config config vector passed to \code{httr::GET}
+#'@param db character, name of database to search
+#'@return An eInfoLink object (sub-classed from list) summarizing linked-databases.
+#' Can be coerced to a data-frame with \code{as.data.frame}. Printing the object
+#' the name of each element (which is the correct name for \code{entrez_link},
+#' and can be used to get (a little) more information about each linked database
+#' (see example below).
+#'@family einfo
+#'@seealso \code{\link{entrez_link}}
+#'@examples
+#' \donttest{
+#'taxid <- entrez_search(db="taxonomy", term="Osmeriformes")$ids
+#'tax_links <- entrez_db_links("taxonomy")
+#'tax_links
+#'entrez_link(dbfrom="taxonomy", db="pmc", id=taxid)
+#'
+#'sra_links <- entrez_db_links("sra")
+#'as.data.frame(sra_links)
+#'}
+#'@export
+entrez_db_links <- function(db, config=NULL){
+ rec <- entrez_info(db, config)
+ unparsed <- xpathApply(rec, "//Link", xmlChildren)
+ res <- lapply(unparsed, lapply, xmlValue)
+ res <- lapply(res, add_class, new_class='eInfoEntry')
+ names(res) <- sapply(res, "[[", "DbTo")
+ class(res) <- c("eInfoLink", "eInfoList", "list")
+ attr(res, 'db') <- xmlValue(rec["/eInfoResult/DbInfo/DbName"][[1]])
+ res
+}
+
+
+#' List available search fields for a given database
+#'
+
+#'Fetch a list of search fields that can be used with a given database. Fields
+#' can be used as part of the \code{term} argument to \code{\link{entrez_search}}
+#'@param config config vector passed to \code{httr::GET}
+#'@param db character, name of database to get search field from
+#'@return An eInfoSearch object (subclassed from list) summarizing linked-databases.
+#' Can be coerced to a data-frame with \code{as.data.frame}. Printing the object
+#' shows only the names of each available search field.
+#'@seealso \code{\link{entrez_search}}
+#'@family einfo
+#'@examples
+#'\donttest{
+#' pmc_fields <- entrez_db_searchable("pmc")
+#' pmc_fields[["AFFL"]]
+#' entrez_search(db="pmc", term="Otago[AFFL]", retmax=0)
+#' entrez_search(db="pmc", term="Auckland[AFFL]", retmax=0)
+#'
+#' sra_fields <- entrez_db_searchable("sra")
+#' as.data.frame(sra_fields)
+#'}
+#'@export
+
+entrez_db_searchable <- function(db, config=NULL){
+ rec <- entrez_info(db, config)
+ unparsed <- xpathApply(rec,
+ "/eInfoResult/DbInfo/FieldList/Field",
+ xmlChildren)
+ res <- lapply(unparsed, lapply, xmlValue)
+ res <- lapply(res, add_class, new_class="eInfoEntry")
+ names(res) <- sapply(res, "[[", "Name")
+ class(res) <- c("eInfoSearch", "eInfoList", "list")
+ attr(res, 'db') <- xmlValue(rec["/eInfoResult/DbInfo/DbName"][[1]])
+ res
+}
+
+#'@export
+print.eInfoLink<- function(x, ...){
+ cat("Databases with linked records for database '", attr(x, "db"), "'\n", sep="")
+ print(names(x), quote=FALSE)
+}
+
+#'@export
+as.data.frame.eInfoList <- function(x, ...){
+ data.frame(do.call("rbind", x), row.names=NULL)
+}
+
+#'@export
+print.eInfoSearch <- function(x, ...){
+ cat("Searchable fields for database '", attr(x, "db"), "'\n", sep="")
+ for (term in x){
+ cat(" ", term$Name, "\t", term$Description, "\n")
+ }
+}
+
+#'@export
+print.eInfoEntry <- function(x, ...){
+ cat(paste0(" ", names(x), ": ", unlist(x), collapse="\n"), "\n")
+}
diff --git a/R/entrez_link.r b/R/entrez_link.r
new file mode 100755
index 0000000..f1f2741
--- /dev/null
+++ b/R/entrez_link.r
@@ -0,0 +1,235 @@
+#' Get links to datasets related to records from an NCBI database
+#'
+#' Discover records related to a set of unique identifiers from
+#' an NCBI database. The object returned by this function depends on the value
+#' set for the \code{cmd} argument. Printing the returned object lists the names
+#' , and provides a brief description, of the elements included in the object.
+#'
+#'@export
+#'@param db character Name of the database to search for links (or use "all" to
+#' search all databases available for \code{db}. \code{entrez_db_links} allows you
+#' to discover databases that might have linked information (see examples).
+#'@param id vector with unique ID(s) for records in database \code{db}.
+#'@param web_history a web_history object
+#'@param dbfrom character Name of database from which the Id(s) originate
+#'@param by_id logial If FALSE (default) return a single
+#' \code{elink} objects containing links for all of the provided \code{id}s.
+#' Alternatively, if TRUE return a list of \code{elink} objects, one for each
+#' ID in \code{id}.
+#'@param cmd link function to use. Allowled values include
+#' \itemize{
+#' \item neighbor (default). Returns a set of IDs in \code{db} linked to the
+#' input IDs in \code{dbfrom}.
+#' \item neighbor_score. As `neighbor'', but additionally returns similarity scores.
+#' \item neighbor_history. As `neighbor', but returns web history objects.
+#' \item acheck. Returns a list of linked databases available from NCBI for a set of IDs.
+#' \item ncheck. Checks for the existence of links within a single database.
+#' \item lcheck. Checks for external (i.e. outside NCBI) links.
+#' \item llinks. Returns a list of external links for each ID, excluding links
+#' provided by libraries.
+#' \item llinkslib. As 'llinks' but additionally includes links provided by
+#' libraries.
+#' \item prlinks. As 'llinks' but returns only the primary external link for
+#' each ID.
+#'}
+#'@param \dots character Additional terms to add to the request, see NCBI
+#'documentation linked to in references for a complete list
+#'@param config vector configuration options passed to httr::GET
+#'@seealso \code{\link[httr]{config}} for available configs
+#'@seealso \code{entrez_db_links}
+#'@return An elink object containing the data defined by the \code{cmd} argument
+#'(if by_id=FALSE) or a list of such object (if by_id=TRUE).
+#'@return file XMLInternalDocument xml file resulting from search, parsed with
+#'\code{\link{xmlTreeParse}}
+#'@references \url{http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_ELink_}
+#'@importFrom XML xmlToList
+#' @examples
+#' \donttest{
+#' pubmed_search <- entrez_search(db = "pubmed", term ="10.1016/j.ympev.2010.07.013[doi]")
+#' linked_dbs <- entrez_db_links("pubmed")
+#' linked_dbs
+#' nucleotide_data <- entrez_link(dbfrom = "pubmed", id = pubmed_search$ids, db ="nuccore")
+#' #Sources for the full text of the paper
+#' res <- entrez_link(dbfrom="pubmed", db="", cmd="llinks", id=pubmed_search$ids)
+#' linkout_urls(res)
+#'}
+#'
+
+
+entrez_link <- function(dbfrom, web_history=NULL, id=NULL, db=NULL, cmd='neighbor', by_id=FALSE, config=NULL, ...){
+ identifiers <- id_or_webenv()
+ args <- c(list("elink", db=db, dbfrom=dbfrom, cmd=cmd, config=config, by_id=by_id, ...), identifiers)
+ if(by_id){
+ if(is.null(id)) stop("Can't use by_id mode without ids!")
+ }
+ response <- do.call(make_entrez_query,args)
+ record <- parse_response(response, 'xml')
+ Sys.sleep(0.33)
+ res <- parse_elink(record, cmd=cmd, by_id=by_id)
+ if(!is.null(id) & by_id){
+ if(length(res) != length(id)){
+ msg <- paste( id[!(id %in% res)], ", ")
+ warning("Some IDs appear to be invalid. Result containg no information for the following IDs: ", msg)
+ }
+ }
+ res
+}
+
+#' Extract URLs from an elink object
+#' @param elink elink object (returned by entrez_link) containing Urls
+#' @return list of character vectors, one per ID each containing of URLs for that
+#' ID.
+#' @seealso entrez_link
+#' @export
+linkout_urls <- function(elink){
+ if (!("linkouts" %in% names(elink))){
+ stop("Not linkouts in the elink object. Use entrez_link commands 'prlinks', 'llinks' or 'llinkslib' to fetch urls")
+ }
+ lapply(elink$linkouts, function(lo) if(length(lo) == 0) NA else sapply(lo, "[[", "Url"))
+}
+
+
+#
+# Parising Elink is.... fun. The XML files returned by the different 'cmd'
+# args are very differnt, so we can't hope for a one-size-fits all solution.
+# Instead, we can break of a few similar cases and write parsing functions,
+# which we dispatch via a big switch statement.
+#
+# Each parsing function should return a list with elements corresponding to the
+# data n XML, and set the attribute "content" to a brief description of what
+# each element in the record contains, to be used by the print fxn.
+#
+# In addition, the "by_id" mode
+# means we we sometimes reuturn a list of elink objects, have applied the
+# relevant function to each "<LinkSet>" in the XML.
+#
+parse_elink <- function(x, cmd, by_id, id){
+ check_xml_errors(x)
+ f <- make_elink_fxn(cmd)
+ res <- xpathApply(x, "//LinkSet",f)
+ if(length(res) > 1){
+ class(res) <- c("elink_list", "list")
+ return(res)
+ }
+ res[[1]]
+}
+
+
+
+
+
+make_elink_fxn <- function(cmd){
+ f <- switch(cmd,
+ "neighbor" = parse_neighbors,
+ "neighbor_score" = function(x) parse_neighbors(x, scores=TRUE),
+ "neighbor_history" = parse_history,
+ "acheck" = parse_acheck,
+ "ncheck" = function(x) parse_check(x, "HasNeighbor"),
+ "lcheck" = function(x) parse_check(x, "HasLinkOut"),
+ "llinkslib" = parse_linkouts,
+ "llinks" = parse_linkouts,
+ "prlinks" = parse_linkouts,
+ stop("Don't know how to deal with cmd ", cmd)
+ )
+ function(x){
+ res <- f(x)
+ class(res) <- c("elink", "list")
+ res
+ }
+
+}
+
+parse_neighbors <- function(x, scores=FALSE){
+ content <- ""
+ if("-1" %in% xpathSApply(x, "//IdList/Id", xmlValue)){
+ warning("Some IDs not found")
+ }
+ db_names <- xpathSApply(x, "LinkSetDb/LinkName", xmlValue)
+ links <- sapply(db_names, get_linked_elements, record=x, element="Id", simplify=FALSE)
+ class(links) <- c("elink_classic", "list")
+ res <- list(links = links, file=x)
+ if(scores){
+ nscores <- sapply(db_names, get_linked_elements, record=x, element="Score", simplify=FALSE)
+ class(nscores) <- c("elink_classic", "list")
+ content <- " $scores: weighted neighbouring scores for each hit in links\n"
+ res$scores <- nscores
+ }
+ attr(res, "content") <- paste(" $links: IDs for linked records from NCBI\n",
+ content)
+ res
+}
+
+parse_history <- function(x){
+ qks <- xpathSApply(x, "LinkSetDbHistory/QueryKey", xmlValue, simplify=FALSE)
+ cookie <- xmlValue(x[["WebEnv"]])
+ histories <- lapply(qks, web_history, WebEnv=cookie)
+ names(histories) <- xpathSApply(x, "//LinkSetDbHistory/LinkName", xmlValue)
+ res <- list(web_histories=histories, file=x)
+ attr(res, "content") <- paste0(" $web_histories: Objects containing web history information\n")
+ res
+}
+
+parse_acheck <- function(x){
+ db_info <- xpathApply(x, "//LinkInfo", xmlToList)
+ names(db_info) <- sapply(db_info, "[[","LinkName")
+ class(db_info) <- "elink_classic"
+ res <- list(linked_databses = db_info)
+ attr(res, "content") <- " $linked_databases: a list of summary data from each databse with linked records"
+ res
+}
+
+parse_check <- function(x, attr){
+ path <- paste0("IdCheckList/Id/@", attr)
+ is_it_y <- structure(names= xpathSApply(x, "IdCheckList/Id", xmlValue),
+ xpathSApply(x, path, `==`, "Y"))
+
+ res <- list(check = is_it_y)
+ attr(res, "content") <- " $check: TRUE/FALSE for wether each ID has links"
+ res
+}
+
+parse_linkouts <- function(x){
+ per_id <- xpathApply(x, "//IdUrlList/IdUrlSet")
+ list_per_id <- lapply(per_id, function(x) lapply(x["ObjUrl"], xmlToList))
+ names(list_per_id) <-paste0("ID_", sapply(per_id,function(x) xmlValue(x[["Id"]])))
+ list_o_lists <- lapply(list_per_id, unname)#otherwise first element of earch list has same name!
+ list_o_lists <- lapply(list_o_lists, lapply, add_class, "linkout")
+ res <- list( linkouts = list_o_lists)
+ attr(res, "content") <- " $linkouts: links to external websites"
+ res
+}
+
+
+
+
+#' @export
+
+print.elink_list <- function(x, ...){
+ payload <- attr(x[[1]], "content")
+ cat("List of", length(x), "elink objects,each containing\n", payload)
+}
+
+#' @export
+print.elink <- function(x, ...){
+ payload <- attr(x, "content")
+ cat("elink object with contents:\n", payload, "\n",sep="")
+}
+
+
+#' @export
+print.linkout <- function(x,...){
+ cat("Linkout from", x$Provider$Name, "\n $Url:", substr(x$Url, 1, 26), "...\n")
+}
+
+#' @export
+print.elink_classic <- function(x, ...){
+ len <- length(x)
+ cat(paste("elink result with information from", len , "databases:\n"))
+ print (names(x), quote=FALSE)
+}
+
+
+get_linked_elements <- function(record, dbname, element){
+ path <- paste0("LinkSetDb/LinkName[text()='", dbname, "']/../Link/", element)
+ return(xpathSApply(record, path, xmlValue))
+}
diff --git a/R/entrez_post.r b/R/entrez_post.r
new file mode 100755
index 0000000..a2a6d5a
--- /dev/null
+++ b/R/entrez_post.r
@@ -0,0 +1,43 @@
+#' Post IDs to Eutils for later use
+#'
+#'
+#'
+#'@export
+#'@param db character Name of the database from which the IDs were taken
+#'@param id vector with unique ID(s) for records in database \code{db}.
+#'@param web_history A web_history object. Can be used to add to additional
+#' identifiers to an existing web environment on the NCBI
+#'@param \dots character Additional terms to add to the request, see NCBI
+#'documentation linked to in references for a complete list
+#'@param config vector of configuration options passed to httr::GET
+#'@references \url{http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EPost_}
+#'@seealso \code{\link[httr]{config}} for available httr configurations
+#'@importFrom XML xmlTreeParse
+#' @examples
+#'\dontrun{
+#' so_many_snails <- entrez_search(db="nuccore",
+#' "Gastropoda[Organism] AND COI[Gene]", retmax=200)
+#' upload <- entrez_post(db="nuccore", id=so_many_snails$ids)
+#' first <- entrez_fetch(db="nuccore", rettype="fasta", web_history=upload,
+#' retmax=10)
+#' second <- entrez_fetch(db="nuccore", file_format="fasta", web_history=upload,
+#' retstart=10, retmax=10)
+#'}
+
+entrez_post <- function(db, id=NULL, web_history=NULL, config=NULL, ...){
+ args <-list("epost", db=db, config=config, id=id, web_history=web_history, ...)
+ if(!is.null(web_history)){
+ args <- c(args, WebEnv=web_history$WebEnv, query_key = web_history$QueryKey)
+ args$web_history <- NULL
+ }
+ response <- do.call(make_entrez_query, args)
+ record <- xmlTreeParse(response, useInternalNodes=TRUE)
+ result <- xpathApply(record, "/ePostResult/*", XML::xmlValue)
+ names(result) <- c("QueryKey", "WebEnv")
+ class(result) <- c("web_history", "list")
+ #NCBI limits requests to three per second
+ Sys.sleep(0.33)
+ return(result)
+}
+
+
diff --git a/R/entrez_search.r b/R/entrez_search.r
new file mode 100755
index 0000000..09a620b
--- /dev/null
+++ b/R/entrez_search.r
@@ -0,0 +1,120 @@
+#' Search the NCBI databases using EUtils
+#'
+#' The NCBI uses a search term syntax where search terms can be associated with
+#' a specific search field with square brackets. So, for instance ``Homo[ORGN]''
+#' denotes a search for Homo in the ``Organism'' field. The names and
+#' definitions of these fields can be identified using
+#' \code{\link{entrez_db_searchable}}.
+#'
+#' Searches can make use of several fields by combining them via the boolean
+#' operators AND, OR and NOT. So, using the search term``((Homo[ORGN] AND APP[GENE]) NOT
+#' Review[PTYP])'' in PubMed would identify articles matching the gene APP in
+#' humans, and exclude review articles. More examples of the use of these search
+#' terms, and the more specific MeSH terms for precise searching,
+#' is given in the package vignette.
+#'
+#'@export
+#'@param db character, name of the database to search for.
+#'@param term character, the search term.
+#'@param use_history logical. If TRUE return a web_history object for use in
+#' later calls to the NCBI
+#'@param retmode character, one of json (default) or xml. This will make no
+#' difference in most cases.
+#'@param \dots characte, additional terms to add to the request, see NCBI
+#'documentation linked to in references for a complete list
+#'@param config vector configuration options passed to httr::GET
+#'@seealso \code{\link[httr]{config}} for available httr configurations
+#'@seealso \code{\link{entrez_db_searchable}} to get a set of search fields that
+#' can be used in \code{term} for any database
+#'@return ids integer Unique IDS returned by the search
+#'@return count integer Total number of hits for the search
+#'@return retmax integer Maximum number of hits returned by the search
+#'@return web_history A web_history object for use in subsequent calls to NCBI
+#'@return QueryTranslation character, search term as the NCBI interpreted it
+#'@return file either and XMLInternalDocument xml file resulting from search, parsed with
+#'\code{\link[XML]{xmlTreeParse}} or, if \code{retmode} was set to json a list
+#' resulting from the returned JSON file being parsed with
+#' \code{\link[jsonlite]{fromJSON}}.
+#'@references \url{http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_ESearch_}
+#'@examples
+#' \dontrun{
+#' query <- "Gastropoda[Organism] AND COI[Gene]"
+#' web_env_search <- entrez_search(db="nuccore", query, use_history=TRUE)
+#' cookie <- web_env_search$WebEnv
+#' qk <- web_env_search$QueryKey
+#' snail_coi <- entrez_fetch(db = "nuccore", WebEnv = cookie, query_key = qk,
+#' file_format = "fasta", retmax = 10)
+#'}
+#'\donttest{
+#'
+#' fly_id <- entrez_search(db="taxonomy", term="Drosophila")
+#' #Oh, right. There is a genus and a subgenus name Drosophila...
+#' #how can we limit this search
+#' (tax_fields <- entrez_db_searchable("taxonomy"))
+#' #"RANK" loots promising
+#' tax_fields$RANK
+#' entrez_search(db="taxonomy", term="Drosophila & Genus[RANK]")
+#'}
+
+entrez_search <- function(db, term, config=NULL, retmode="xml", use_history=FALSE, ... ){
+ usehistory <- if(use_history) "y" else "n"
+ response <- make_entrez_query("esearch",
+ db=db,
+ term=term,
+ config=config,
+ retmode=retmode,
+ usehistory=usehistory,
+ ...)
+ parsed <- parse_response(response, retmode)
+ parse_esearch(parsed, history=use_history)
+}
+
+
+parse_esearch <- function(x, history) UseMethod("parse_esearch")
+
+parse_esearch.XMLInternalDocument <- function(x, history){
+ res <- list( ids = xpathSApply(x, "//IdList/Id", xmlValue),
+ count = as.integer(xmlValue(x[["/eSearchResult/Count"]])),
+ retmax = as.integer(xmlValue(x[["/eSearchResult/RetMax"]])),
+ QueryTranslation = xmlValue(x[["/eSearchResult/QueryTranslation"]]),
+ file = x)
+ if(history){
+ res$web_history = web_history(
+ QueryKey = xmlValue(x[["/eSearchResult/QueryKey"]]),
+ WebEnv = xmlValue(x[["/eSearchResult/WebEnv"]])
+ )
+ }
+ class(res) <- c("esearch", "list")
+ return(res)
+}
+
+parse_esearch.list <- function(x, history){
+ #for consitancy between xml/json records we are going to change the
+ #file names from lower -> CamelCase
+ res <- x$esearchresult[ c("idlist", "count", "retmax", "querytranslation") ]
+ names(res)[c(1,4)] <- c("ids", "QueryTranslation")
+ if(history){
+ res$web_history = web_history(QueryKey = x$esearch_result[["querykey"]],
+ WebEnv = x$esearch_result[["webenv"]])
+ }
+ res$count <- as.integer(res$count)
+ res$retmax <- as.integer(res$retmax)
+ res$file <- x
+ class(res) <- c("esearch", "list")
+ return(res)
+}
+
+#'@export
+print.esearch <- function(x, ...){
+ display_term <- if(nchar(x$QueryTranslation) > 50){
+ paste(substr(x$QueryTranslation, 1, 50), "...")
+ } else x$QueryTranslation
+ cookie_word <- if("web_history" %in% names(x)) "a" else "no"
+ msg<- paste("Entrez search result with", x$count, "hits (object contains",
+ length(x$ids), "IDs and", cookie_word,
+ "web_history object)\n Search term (as translated): " , display_term, "\n")
+ cat(msg)
+}
+
+
+ c("//IdList/Id", "/eSearchResult/Count", "/eSearchResult/RetMax", "/eSearchResult/QueryTranslation")
diff --git a/R/entrez_summary.r b/R/entrez_summary.r
new file mode 100755
index 0000000..12697f1
--- /dev/null
+++ b/R/entrez_summary.r
@@ -0,0 +1,188 @@
+#' Get summaries of objects in NCBI datasets from a unique ID
+#
+#'
+#' The NCBI offer two distinct formats for summary documents.
+#' Version 1.0 is a relatively limited summary of a database record based on a
+#' shared Document Type Definition. Version 1.0 summaries are only available as
+#' XML and are not available for some newer databases
+#' Version 2.0 summaries generally contain more information about a given
+#' record, but each database has its own distinct format. 2.0 summaries are
+#' available for records in all databases and as JSON and XML files.
+#' As of version 0.4, rentrez fetches version 2.0 summaries by default and
+#' uses JSON as the exchange format (as JSON object can be more easily converted
+#' into native R types). Existing scripts which relied on the structure and
+#' naming of the "Version 1.0" summary files can be updated by setting the new
+#' \code{version} argument to "1.0".
+#'
+#' By default, entrez_summary returns a single record when only one ID is
+#' passed and a list of such records when multiple IDs are passed. This can lead
+#' to unexpected behaviour when the results of a variable number of IDs (perhaps the
+#' result of \code{entrez_search}) are processed with an apply family function
+#' or in a for-loop. If you use this function as part of a function or script that
+#' generates a variably-sized vector of IDs setting \code{always_return_list} to
+#' \code{TRUE} will avoid these problems. The function
+#' \code{extract_from_esummary} is provided for the specific case of extracting
+#' named elements from a list of esummary objects, and is designed to work on
+#' single objects as well as lists.
+#'
+#'@export
+#'@param db character Name of the database to search for
+#'@param id vector with unique ID(s) for records in database \code{db}.
+#'@param web_history A web_history object
+#'@param always_return_list logical, return a list of esummary objects even
+#'when only one ID is provided (see description for a note about this option)
+#'@param \dots character Additional terms to add to the request, see NCBI
+#'documentation linked to in references for a complete list
+#'@param config vector configuration options passed to \code{httr::GET}
+#'@param version either 1.0 or 2.0 see above for description
+#'@references \url{http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_ESummary_}
+#'@seealso \code{\link[httr]{config}} for available configs
+#'@seealso \code{\link{extract_from_esummary}} which can be used to extract
+#'elements from a list of esummary records
+#'@return A list of esummary records (if multiple IDs are passed and
+#'always_return_list if FALSE) or a single record.
+#'@return file XMLInternalDocument xml file containing the entire record
+#'returned by the NCBI.
+#'@importFrom XML xpathApply xmlSApply xmlGetAttr xmlValue
+#'@importFrom jsonlite fromJSON
+#' @examples
+#'\donttest{
+#' pop_ids = c("307082412", "307075396", "307075338", "307075274")
+#' pop_summ <- entrez_summary(db="popset", id=pop_ids)
+#' extract_from_esummary(pop_summ, "title")
+#'
+#' # clinvar example
+#' res <- entrez_search(db = "clinvar", term = "BRCA1", retmax=10)
+#' cv <- entrez_summary(db="clinvar", id=res$ids)
+#' cv
+#' extract_from_esummary(cv, "title", simplify=FALSE)
+#' extract_from_esummary(cv, "trait_set")[1:2]
+#' extract_from_esummary(cv, "gene_sort")
+#' }
+entrez_summary <- function(db, id=NULL, web_history=NULL,
+ version=c("2.0", "1.0"), always_return_list = FALSE, config=NULL, ...){
+ identifiers <- id_or_webenv()
+ v <-match.arg(version)
+ retmode <- if(v == "2.0") "json" else "xml"
+ args <- c(list("esummary", db=db, config=config, retmode=retmode, version=v, ...), identifiers)
+ response <- do.call(make_entrez_query, args)
+ whole_record <- parse_response(response, retmode)
+ parse_esummary(whole_record, always_return_list)
+}
+
+#' Extract elements from a list of esummary records
+#'@export
+#'@param esummaries A list of esummary objects
+#'@param elements the names of the element to extract
+#'@param simplify logical, if possible return a vector
+#'@return List or vector containing requested elements
+extract_from_esummary <- function(esummaries, elements, simplify=TRUE){
+ UseMethod("extract_from_esummary", esummaries)
+}
+
+#'@export
+extract_from_esummary.esummary <- function(esummaries, elements, simplify=TRUE){
+ fxn <- if(simplify & length(elements)==1) `[[` else `[`
+ fxn(esummaries, elements)
+}
+
+#'@export
+extract_from_esummary.esummary_list <- function(esummaries, elements, simplify=TRUE){
+ fxn <- if (simplify & length(elements) == 1) `[[` else `[`
+ sapply(esummaries, fxn, elements, simplify=simplify)
+}
+
+
+
+
+parse_esummary <- function(x, always_return_list) UseMethod("parse_esummary")
+
+
+check_json_errs <- function(rec){
+ if("error" %in% names(rec)){
+ msg <- paste0("ID ", rec$uid, " produced error '", rec$error, "'")
+ warning(msg, call.=FALSE)
+ }
+ invisible()
+}
+
+
+parse_esummary.list <- function(x, always_return_list){
+ #already parsed by jsonlite, just add check for errors, then re-class
+ res <- x$result[2: length(x$result)]
+ sapply(res, check_json_errs)
+ res <- lapply(res, add_class, new_class="esummary")
+ if(length(res)==1 & !always_return_list){
+ return(res[[1]])
+ }
+ class(res) <- c("esummary_list", "list")
+ res
+}
+
+# Prase a summary XML
+#
+# Logic goes like this
+# 1. Define functions parse_esumm_* to handle all data types
+# 2. For each node detect type, parse accordingly
+# 3. wrap it all up in function parse_summary that
+#
+
+#
+#@export
+parse_esummary.XMLInternalDocument <- function(x, always_return_list){
+ check_xml_errors(x)
+ recs <- x["//DocSum"]
+ if(length(recs)==0){
+ stop("Esummary document contains no DocSums, try 'version=2.0'?)")
+ }
+ per_rec <- function(r){
+ res <- xpathApply(r, "Item", parse_node)
+ names(res) <- xpathApply(r, "Item", xmlGetAttr, "Name")
+ res <- c(res, file=x)
+ class(res) <- c("esummary", class(res))
+ return(res)
+ }
+ if(length(recs)==1 & !always_return_list){
+ return(per_rec(recs[[1]]))
+ }
+ res <- lapply(recs, per_rec)
+ names(res) <- xpathSApply(x, "//DocSum/Id", xmlValue)
+ class(res) <- c("esummary_list", "list")
+ res
+}
+
+parse_node <- function(node) {
+ node_type <- xmlGetAttr(node, "Type")
+
+ node_fxn <- switch(node_type,
+ "Integer" = parse_esumm_int,
+ "List" = parse_esumm_list,
+ "Structure" = parse_esumm_list,
+ xmlValue) #unnamed arguments to switch = default val.
+ return(node_fxn(node))
+
+}
+
+parse_esumm_int <- function(node) as.integer(xmlValue(node))
+
+parse_esumm_list <- function(node){
+ res <- lapply(node["Item"], parse_node)
+ names(res) <- lapply(node["Item"], xmlGetAttr, "Name")
+ return(res)
+}
+
+
+#' @export
+print.esummary <- function(x, ...){
+ len <- length(x)
+ cat(paste("esummary result with", len - 1, "items:\n"))
+ print(names(x)[-len], quote=FALSE)
+}
+
+#' @export
+print.esummary_list <- function(x, ...){
+ len <- length(x)
+ cat("List of ", len, "esummary records. First record:\n\n ")
+ print(x[1])
+}
+
diff --git a/R/help.r b/R/help.r
new file mode 100755
index 0000000..60820fa
--- /dev/null
+++ b/R/help.r
@@ -0,0 +1,19 @@
+#' rentrez
+#'
+#' rentrez provides functions to search for, discover and download data from
+#' the NCBI's databases using their EUtils function.
+#'
+#' Users are expected to know a little bit about the EUtils API, which is well
+#' documented: \url{http://www.ncbi.nlm.nih.gov/books/NBK25500/}
+#'
+#' The NCBI will ban IPs that don't use EUtils within their \href{http://www.ncbi.nlm.nih.gov/corehtml/query/static/eutils_help.html}{user guidelines}. In particular
+#' /enumerated{
+#' /item Don't send more than three request per second (rentrez enforces this limit)
+#' /item If you plan on sending a sequence of more than ~100 requests, do so outside of peak times for the US
+#' /item For large requests use the web history method (see examples for \code{\link{entrez_search}} or use \code{\link{entrez_post}} to upload IDs)
+#'}
+#' @docType package
+#' @name rentrez
+#' @aliases rentrez rentrez-package
+#'
+NULL
diff --git a/R/parse_pubmed_xml.r b/R/parse_pubmed_xml.r
new file mode 100755
index 0000000..3186bc4
--- /dev/null
+++ b/R/parse_pubmed_xml.r
@@ -0,0 +1,102 @@
+#' Summarize an XML record from pubmed.
+#'
+#' Note: this function assumes all records are of the type "PubmedArticle"
+#' and will return an empty record for any other type (including books).
+#'
+#'@export
+#'@param record Either and XMLInternalDocument or character the record to be
+#'parsed ( expected to come from \code{\link{entrez_fetch}})
+#'@return Either a single pubmed_record object, or a list of several
+#'@importFrom XML xmlName
+#'@examples
+
+#'
+#' hox_paper <- entrez_search(db="pubmed", term="10.1038/nature08789[doi]")
+#' hox_rel <- entrez_link(db="pubmed", dbfrom="pubmed", id=hox_paper$ids)
+#' recs <- entrez_fetch(db="pubmed",
+#' id=hox_rel$links$pubmed_pubmed[1:3],
+#' rettype="xml")
+#' parse_pubmed_xml(recs)
+#'
+
+parse_pubmed_xml<- function(record){
+ if(typeof(record) == "character"){
+ record <- xmlTreeParse(record, useInternalNodes=TRUE)
+ }
+ res <- xpathApply(record,
+ "/PubmedArticleSet/*",
+ parse_one_pubmed)
+ if(length(res)==1){
+ return(res[[1]])
+ }
+ class(res) <- c("multi_pubmed_record", "list")
+ return(res)
+}
+
+#The work-horse function - get information from a single xml rec
+parse_one_pubmed <- function(paper){
+ atype <- xmlName(paper)
+ if( atype != "PubmedArticle" ){
+ pmid = xpathSApply(paper, "//PMID", xmlValue)
+ msg = paste0("Pubmed record ", pmid, " is of type '", atype,
+ "' which rentrez doesn't know how to parse.",
+ " Returning empty record")
+
+ warning(msg)
+ return(structure(list(), class="pubmed_record", empty=TRUE))
+ }
+ get_value <- function(path){
+ return(xpathSApply(paper, path, xmlValue))
+ }
+ res <- list()
+ res$title <- get_value(".//ArticleTitle")
+ res$authors <- paste(get_value(".//Author/LastName"),
+ get_value(".//Author/ForeName"), sep=", ")
+ res$year <- get_value(".//PubDate/Year")
+ res$journal <- get_value(".//Journal/Title")
+ res$volume <- get_value(".//JournalIssue/Volume")
+ res$issue <- get_value(".//JournalIssue/Issue")
+ res$pages <- get_value(".//MedlinePgn")
+ res$key_words <- get_value(".//DescriptorName")
+ res$doi <- get_value(".//ArticleId[@IdType='doi']")
+ res$pmid <- get_value(".//ArticleId[@IdType='pubmed']")
+ res$abstract <- get_value(".//AbstractText")
+
+ structure(res, class="pubmed_record", empty=FALSE)
+}
+
+
+
+#' @export
+print.pubmed_record <- function(x, first_line=TRUE, ...){
+ if( attr(x, "empty")){
+ cat('Pubmed record (empty)\n')
+ return()
+ }
+
+ if(length(x$authors) == 1){
+ display.author <- x$authors[1]
+ }
+ else if(length(x$authors) == 2){
+ display.author <- with(x, paste(authors[1], authors[2], sep=". & "))
+ }
+ else
+ display.author <- paste(x$authors[1], "et al")
+
+ display <- with(x, sprintf(" %s. (%s). %s. %s:%s",
+ display.author, year, journal, volume, pages))
+ if(first_line){
+ cat("Pubmed record", "\n")
+ }
+ cat(display, "\n")
+}
+
+#' @export
+print.multi_pubmed_record <- function(x, ...){
+ nrecs <- length(x)
+ cat("List of", nrecs, "pubmed records\n")
+ if( nrecs > 3){
+ sapply(x[1:3], print, first_line=FALSE)
+ cat(".\n.\n.\n")
+ } else sapply(x[1:3], print, first_line=FALSE)
+}
diff --git a/build/vignette.rds b/build/vignette.rds
new file mode 100644
index 0000000..7982763
Binary files /dev/null and b/build/vignette.rds differ
diff --git a/debian/README.test b/debian/README.test
deleted file mode 100644
index 55a9142..0000000
--- a/debian/README.test
+++ /dev/null
@@ -1,8 +0,0 @@
-Notes on how this package can be tested.
-────────────────────────────────────────
-
-To run the unit tests provided by the package you can do
-
- sh run-unit-test
-
-in this directory.
diff --git a/debian/changelog b/debian/changelog
deleted file mode 100644
index 4358ed7..0000000
--- a/debian/changelog
+++ /dev/null
@@ -1,5 +0,0 @@
-r-cran-rentrez (1.0.4-1) unstable; urgency=medium
-
- * Initial release (closes: #844259)
-
- -- Andreas Tille <tille at debian.org> Sun, 13 Nov 2016 22:04:56 +0100
diff --git a/debian/compat b/debian/compat
deleted file mode 100644
index ec63514..0000000
--- a/debian/compat
+++ /dev/null
@@ -1 +0,0 @@
-9
diff --git a/debian/control b/debian/control
deleted file mode 100644
index 04dcdc4..0000000
--- a/debian/control
+++ /dev/null
@@ -1,26 +0,0 @@
-Source: r-cran-rentrez
-Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.org>
-Uploaders: Andreas Tille <tille at debian.org>
-Section: gnu-r
-Priority: optional
-Build-Depends: debhelper (>= 9),
- dh-r,
- r-base-dev,
- r-cran-xml,
- r-cran-httr,
- r-cran-jsonlite
-Standards-Version: 3.9.8
-Vcs-Browser: https://anonscm.debian.org/viewvc/debian-med/trunk/packages/R/r-cran-rentrez/trunk/
-Vcs-Svn: svn://anonscm.debian.org/debian-med/trunk/packages/R/r-cran-rentrez/trunk/
-Homepage: https://cran.r-project.org/package=rentrez
-
-Package: r-cran-rentrez
-Architecture: all
-Depends: ${R:Depends},
- ${misc:Depends}
-Recommends: ${R:Recommends}
-Suggests: ${R:Suggests}
-Description: GNU R interface to the NCBI's EUtils API
- Provides an R interface to the NCBI's EUtils API allowing users to
- search databases like GenBank and PubMed, process the results of those
- searches and pull data into their R sessions.
diff --git a/debian/copyright b/debian/copyright
deleted file mode 100644
index 198ef62..0000000
--- a/debian/copyright
+++ /dev/null
@@ -1,32 +0,0 @@
-Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
-Upstream-Name: rentrez
-Upstream-Contact: David Winter <david.winter at gmail.com>
-Source: https://cran.r-project.org/package=rentrez
-
-Files: *
-Copyright: 2012-2016 David Winter, Scott Chamberlain, Han Guangchun
-License: MIT
-
-Files: debian/*
-Copyright: 2016 Andreas Tille <tille at debian.org>
-License: MIT
-
-License: MIT
- Permission is hereby granted, free of charge, to any person obtaining
- a copy of this software and associated documentation files (the
- "Software"), to deal in the Software without restriction, including
- without limitation the rights to use, copy, modify, merge, publish,
- distribute, sublicense, and/or sell copies of the Software, and to
- permit persons to whom the Software is furnished to do so, subject to
- the following conditions:
- .
- The above copyright notice and this permission notice shall be
- included in all copies or substantial portions of the Software.
- .
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
- LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
- OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
- WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/debian/docs b/debian/docs
deleted file mode 100644
index 960011c..0000000
--- a/debian/docs
+++ /dev/null
@@ -1,3 +0,0 @@
-tests
-debian/README.test
-debian/tests/run-unit-test
diff --git a/debian/rules b/debian/rules
deleted file mode 100755
index 529c38a..0000000
--- a/debian/rules
+++ /dev/null
@@ -1,5 +0,0 @@
-#!/usr/bin/make -f
-
-%:
- dh $@ --buildsystem R
-
diff --git a/debian/source/format b/debian/source/format
deleted file mode 100644
index 163aaf8..0000000
--- a/debian/source/format
+++ /dev/null
@@ -1 +0,0 @@
-3.0 (quilt)
diff --git a/debian/tests/control b/debian/tests/control
deleted file mode 100644
index b044b0c..0000000
--- a/debian/tests/control
+++ /dev/null
@@ -1,3 +0,0 @@
-Tests: run-unit-test
-Depends: @, r-cran-testthat
-Restrictions: allow-stderr
diff --git a/debian/tests/run-unit-test b/debian/tests/run-unit-test
deleted file mode 100644
index 79b98d7..0000000
--- a/debian/tests/run-unit-test
+++ /dev/null
@@ -1,13 +0,0 @@
-#!/bin/sh -e
-
-oname=rentrez
-pkg=r-cran-`echo $oname | tr '[A-Z]' '[a-z]'`
-
-if [ "$ADTTMP" = "" ] ; then
- ADTTMP=`mktemp -d /tmp/${pkg}-test.XXXXXX`
- trap "rm -rf $ADTTMP" 0 INT QUIT ABRT PIPE TERM
-fi
-cd $ADTTMP
-cp -a /usr/share/doc/${pkg}/tests/* $ADTTMP
-find . -name "*.gz" -exec gunzip \{\} \;
-LC_ALL=C R --no-save < test-all.R
diff --git a/debian/watch b/debian/watch
deleted file mode 100644
index af7ceec..0000000
--- a/debian/watch
+++ /dev/null
@@ -1,2 +0,0 @@
-version=4
-https://cran.r-project.org/src/contrib/rentrez_([-\d.]*)\.tar\.gz
diff --git a/inst/doc/rentrez_tutorial.R b/inst/doc/rentrez_tutorial.R
new file mode 100644
index 0000000..13f4b86
--- /dev/null
+++ b/inst/doc/rentrez_tutorial.R
@@ -0,0 +1,183 @@
+## ---- count_recs, echo=FALSE---------------------------------------------
+library(rentrez)
+count_recs <- function(db, denom) {
+ nrecs <- rentrez::entrez_db_summary(db)["Count"]
+ round(as.integer(nrecs)/denom, 1)
+}
+
+## ---- dbs----------------------------------------------------------------
+entrez_dbs()
+
+## ---- cdd----------------------------------------------------------------
+entrez_db_summary("cdd")
+
+## ---- sra_eg-------------------------------------------------------------
+entrez_db_searchable("sra")
+
+## ----eg_search-----------------------------------------------------------
+r_search <- entrez_search(db="pubmed", term="R Language")
+
+## ----print_search--------------------------------------------------------
+r_search
+
+## ----search_ids----------------------------------------------------------
+r_search$ids
+
+## ----searchids_2---------------------------------------------------------
+another_r_search <- entrez_search(db="pubmed", term="R Language", retmax=40)
+another_r_search
+
+## ---- Tt-----------------------------------------------------------------
+entrez_search(db="sra",
+ term="Tetrahymena thermophila[ORGN]",
+ retmax=0)
+
+## ---- Tt2----------------------------------------------------------------
+entrez_search(db="sra",
+ term="Tetrahymena thermophila[ORGN] AND 2013:2015[PDAT]",
+ retmax=0)
+
+## ---- Tt3----------------------------------------------------------------
+entrez_search(db="sra",
+ term="(Tetrahymena thermophila[ORGN] OR Tetrahymena borealis[ORGN]) AND 2013:2015[PDAT]",
+ retmax=0)
+
+## ---- sra_searchable-----------------------------------------------------
+entrez_db_searchable("sra")
+
+## ---- mesh---------------------------------------------------------------
+entrez_search(db = "pubmed",
+ term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
+
+## ---- connectome, fig.width=5, fig.height=4, fig.align='center'----------
+search_year <- function(year, term){
+ query <- paste(term, "AND (", year, "[PDAT])")
+ entrez_search(db="pubmed", term=query, retmax=0)$count
+}
+
+year <- 2008:2014
+papers <- sapply(year, search_year, term="Connectome", USE.NAMES=FALSE)
+
+plot(year, papers, type='b', main="The Rise of the Connectome")
+
+## ----elink0--------------------------------------------------------------
+all_the_links <- entrez_link(dbfrom='gene', id=351, db='all')
+all_the_links
+
+## ----elink_link----------------------------------------------------------
+all_the_links$links
+
+## ---- elink_pmc----------------------------------------------------------
+all_the_links$links$gene_pmc[1:10]
+
+## ---- elink_omim---------------------------------------------------------
+all_the_links$links$gene_clinvar
+
+
+## ---- elink1-------------------------------------------------------------
+nuc_links <- entrez_link(dbfrom='gene', id=351, db='nuccore')
+nuc_links
+nuc_links$links
+
+## ---- elinik_refseqs-----------------------------------------------------
+nuc_links$links$gene_nuccore_refseqrna
+
+## ---- outlinks-----------------------------------------------------------
+paper_links <- entrez_link(dbfrom="pubmed", id=25500142, cmd="llinks")
+paper_links
+
+## ---- urls---------------------------------------------------------------
+paper_links$linkouts
+
+## ----just_urls-----------------------------------------------------------
+linkout_urls(paper_links)
+
+## ---- multi_default------------------------------------------------------
+all_links_together <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"))
+all_links_together
+all_links_together$links$gene_protein
+
+## ---- multi_byid---------------------------------------------------------
+all_links_sep <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"), by_id=TRUE)
+all_links_sep
+lapply(all_links_sep, function(x) x$links$gene_protein)
+
+## ---- Summ_1-------------------------------------------------------------
+taxize_summ <- entrez_summary(db="pubmed", id=24555091)
+taxize_summ
+
+## ---- Summ_2-------------------------------------------------------------
+taxize_summ$articleids
+
+## ---- Summ_3-------------------------------------------------------------
+taxize_summ$pmcrefcount
+
+## ---- multi_summ---------------------------------------------------------
+vivax_search <- entrez_search(db = "pubmed",
+ term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
+multi_summs <- entrez_summary(db="pubmed", id=vivax_search$ids)
+
+## ---- multi_summ2--------------------------------------------------------
+extract_from_esummary(multi_summs, "fulljournalname")
+
+## ---- multi_summ3--------------------------------------------------------
+date_and_cite <- extract_from_esummary(multi_summs, c("pubdate", "pmcrefcount", "title"))
+knitr::kable(head(t(date_and_cite)), row.names=FALSE)
+
+## ---- transcript_ids-----------------------------------------------------
+gene_ids <- c(351, 11647)
+linked_seq_ids <- entrez_link(dbfrom="gene", id=gene_ids, db="nuccore")
+linked_transripts <- linked_seq_ids$links$gene_nuccore_refseqrna
+head(linked_transripts)
+
+## ----fetch_fasta---------------------------------------------------------
+all_recs <- entrez_fetch(db="nuccore", id=linked_transripts, rettype="fasta")
+class(all_recs)
+nchar(all_recs)
+
+## ---- peak---------------------------------------------------------------
+cat(strwrap(substr(all_recs, 1, 500)), sep="\n")
+
+## ---- Tt_tax-------------------------------------------------------------
+Tt <- entrez_search(db="taxonomy", term="(Tetrahymena thermophila[ORGN]) AND Species[RANK]")
+tax_rec <- entrez_fetch(db="taxonomy", id=Tt$ids, rettype="xml", parsed=TRUE)
+class(tax_rec)
+
+## ---- Tt_list------------------------------------------------------------
+tax_list <- XML::xmlToList(tax_rec)
+tax_list$Taxon$GeneticCode
+
+## ---- Tt_path------------------------------------------------------------
+tt_lineage <- tax_rec["//LineageEx/Taxon/ScientificName"]
+tt_lineage[1:4]
+
+## ---- Tt_apply-----------------------------------------------------------
+XML::xpathSApply(tax_rec, "//LineageEx/Taxon/ScientificName", XML::xmlValue)
+
+## ---- asthma-------------------------------------------------------------
+upload <- entrez_post(db="omim", id=600807)
+upload
+
+## ---- snail_search-------------------------------------------------------
+entrez_search(db="nuccore", term="COI[Gene] AND Gastropoda[ORGN]")
+
+## ---- snail_history------------------------------------------------------
+snail_coi <- entrez_search(db="nuccore", term="COI[Gene] AND Gastropoda[ORGN]", use_history=TRUE)
+snail_coi
+snail_coi$web_history
+
+## ---- asthma_links-------------------------------------------------------
+asthma_clinvar <- entrez_link(dbfrom="omim", db="clinvar", cmd="neighbor_history", id=600807)
+asthma_clinvar$web_histories
+
+## ---- asthma_links_upload------------------------------------------------
+asthma_variants <- entrez_link(dbfrom="omim", db="clinvar", cmd="neighbor_history", web_history=upload)
+asthma_variants
+
+## ---- links--------------------------------------------------------------
+snp_links <- entrez_link(dbfrom="clinvar", db="snp",
+ web_history=asthma_variants$web_histories$omim_clinvar,
+ cmd="neighbor_history")
+snp_summ <- entrez_summary(db="snp", web_history=snp_links$web_histories$clinvar_snp)
+knitr::kable(extract_from_esummary(snp_summ, c("chr", "fxn_class", "global_maf")))
+
diff --git a/inst/doc/rentrez_tutorial.Rmd b/inst/doc/rentrez_tutorial.Rmd
new file mode 100644
index 0000000..cbaf1f6
--- /dev/null
+++ b/inst/doc/rentrez_tutorial.Rmd
@@ -0,0 +1,627 @@
+---
+title: Rentrez Tutorial
+author: "David winter"
+date: "`r Sys.Date()`"
+output:
+ rmarkdown::html_vignette:
+ toc: true
+vignette: >
+ %\VignetteIndexEntry{Rentrez Tutorial}
+ %\VignetteEngine{knitr::rmarkdown}
+ %\usepackage[utf8]{inputenc}
+---
+
+```{r, count_recs, echo=FALSE}
+library(rentrez)
+count_recs <- function(db, denom) {
+ nrecs <- rentrez::entrez_db_summary(db)["Count"]
+ round(as.integer(nrecs)/denom, 1)
+}
+```
+## Introduction: The NCBI, entrez and `rentrez`.
+
+The NCBI shares a _lot_ of data. At the time this document was compiled, there
+were `r count_recs("pubmed",1e6)` million papers in [PubMed](http://www.ncbi.nlm.nih.gov/pubmed/),
+including `r count_recs("pmc", 1e6)` million full-text records available in [PubMed Central](http://www.ncbi.nlm.nih.gov/pubmed/).
+[The NCBI Nucleotide Database](http://www.ncbi.nlm.nih.gov/nuccore) (which includes GenBank) has data for `r count_recs("nuccore", 1e6)`
+million different sequences, and [dbSNP](http://www.ncbi.nlm.nih.gov/snp/) describes
+`r count_recs("snp", 1e6)` million different genetic variants. All of these
+records can be cross-referenced with the `r round(entrez_search(db="taxonomy", term='species[RANK]')$count/1e6,2)` million
+species in the [NCBI taxonomy](www.ncbi.nlm.nih.gov/taxonomy) or `r count_recs("omim", 1e3)` thousand disease-associated records
+in [OMIM](http://www.ncbi.nlm.nih.gov/omim).
+
+
+The NCBI makes this data available through a [web interface](http://www.ncbi.nlm.nih.gov/),
+an [FTP server](ftp://ftp.ncbi.nlm.nih.gov/) and through a REST API called the
+[Entrez Utilities](http://www.ncbi.nlm.nih.gov/books/NBK25500/) (`Eutils` for
+short). This package provides functions to use that API, allowing users to
+gather and combine data from multiple NCBI databases in the comfort of an R
+session or script.
+
+## Getting started with the rentrez
+
+To make the most of all the data the NCBI shares you need to know a little about
+their databases, the records they contain and the ways you can find those
+records. The [NCBI provides extensive documentation for each of their
+databases](http://www.ncbi.nlm.nih.gov/home/documentation.shtml) and for the
+[EUtils API that `rentrez` takes advantage of](http://www.ncbi.nlm.nih.gov/books/NBK25501/).
+There are also some helper functions in `rentrez` that help users learn their
+way around the NCBI's databases.
+
+First, you can use `entrez_dbs()` to find the list of available databases:
+
+```{r, dbs}
+entrez_dbs()
+```
+There is a set of functions with names starting `entrez_db_` that can be used to
+gather more information about each of these databases:
+
+**Functions that help you learn about NCBI databases**
+
+| Function name | Return |
+|--------------------------|------------------------------------------------------|
+| `entrez_db_summary()` | Brief description of what the database is |
+| `entrez_db_searchable()` | Set of search terms that can used with this database |
+| `entrez_db_links() ` | Set of databases that might contain linked records |
+
+For instance, we can get a description of the somewhat cryptically named
+database 'cdd'...
+
+```{r, cdd}
+entrez_db_summary("cdd")
+```
+
+... or find out which search terms can be used with the Sequence Read Archive (SRA)
+database (which contains raw data from sequencing projects):
+
+```{r, sra_eg}
+entrez_db_searchable("sra")
+```
+
+Just how these 'helper' functions might be useful will become clearer once
+you've started using `rentrez`, so let's get started.
+
+## Searching databases: `entrez_search()`
+
+Very often, the first thing you'll want to do with `rentrez` is search a given
+NCBI database to find records that match some keywords. You can do this using
+the function `entrez_search()`. In the simplest case you just need to provide a
+database name (`db`) and a search term (`term`) so let's search PubMed for
+articles about the `R language`:
+
+
+```{r eg_search}
+r_search <- entrez_search(db="pubmed", term="R Language")
+```
+The object returned by a search acts like a list, and you can get a summary of
+its contents by printing it.
+
+```{r print_search}
+r_search
+```
+
+There are a few things to note here. First, the NCBI's server has worked out
+that we meant R as a programming language, and so included the
+['MeSH' term](http://www.ncbi.nlm.nih.gov/mesh) term associated with programming
+languages. We'll worry about MeSH terms and other special queries later, for now
+just note that you can use this feature to check that your search term was interpreted in the way
+you intended. Second, there are many more 'hits' for this search than there
+are unique IDs contained in this object. That's because the optional argument
+`retmax`, which controls the maximum number of returned values has a default
+value of 20.
+
+The IDs are the most important thing returned here. They
+allow us to fetch records matching those IDs, gather summary data about them or find
+cross-referenced records in other databases. We access the IDs as a vector using the
+`$` operator:
+
+
+```{r search_ids}
+r_search$ids
+```
+
+If we want to get more than 20 IDs we can do so by increasing the `ret_max` argument.
+
+```{r searchids_2}
+another_r_search <- entrez_search(db="pubmed", term="R Language", retmax=40)
+another_r_search
+```
+
+If we want to get IDs for all of the thousands of records that match this
+search, we can use the NCBI's web history feature [described below](#web_history).
+
+
+### Building search terms
+
+The EUtils API uses a special syntax to build search terms. You can search a
+database against a specific term using the format `query[SEARCH FIELD]`, and
+combine multiple such searches using the boolean operators `AND`, `OR` and `NOT`.
+
+For instance, we can find next generation sequence datasets for the (amazing...) ciliate
+_Tetrahymena thermophila_ by using the organism ('ORGN') search field:
+
+
+```{r, Tt}
+entrez_search(db="sra",
+ term="Tetrahymena thermophila[ORGN]",
+ retmax=0)
+```
+
+We can narrow our focus to only those records that have been added recently (using the colon to
+specify a range of values):
+
+
+```{r, Tt2}
+entrez_search(db="sra",
+ term="Tetrahymena thermophila[ORGN] AND 2013:2015[PDAT]",
+ retmax=0)
+```
+
+Or include recent records for either _T. thermophila_ or it's close relative _T.
+borealis_ (using parentheses to make ANDs and ORs explicit).
+
+
+```{r, Tt3}
+entrez_search(db="sra",
+ term="(Tetrahymena thermophila[ORGN] OR Tetrahymena borealis[ORGN]) AND 2013:2015[PDAT]",
+ retmax=0)
+```
+
+The set of search terms available varies between databases. You can get a list
+of available terms or any given data base with `entrez_db_searchable()`
+
+```{r, sra_searchable}
+entrez_db_searchable("sra")
+```
+
+###Precise queries using MeSH terms
+
+In addition to the search terms described above, the NCBI allows searches using
+[Medical Subject Heading (MeSH)](http://www.ncbi.nlm.nih.gov/mesh) terms. These
+terms create a 'controlled vocabulary', and allow users to make very finely
+controlled queries of databases.
+
+For instance, if you were interested in reviewing studies on how a class of
+anti-malarial drugs called Folic Acid Antagonists work against _Plasmodium vivax_ (a
+particular species of malarial parasite), you could use this search:
+
+```{r, mesh}
+entrez_search(db = "pubmed",
+ term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
+```
+
+The complete set of MeSH terms is available as a database from the NCBI. That
+means it is possible to download detailed information about each term and find
+the ways in which terms relate to each other using `rentrez`. You can search
+for specific terms with `entrez_search(db="mesh", term =...)` and learn about the
+results of your search using the tools described below.
+
+### Advanced counting
+
+As you can see above, the object returned by `entrez_search()` includes the
+number of records matching a given search. This means you can learn a little
+about the composition of, or trends in, the records stored in the NCBI's
+databases using only the search utility. For instance, let's track the rise of
+the scientific buzzword "connectome" in PubMed, programmatically creating
+search terms for the `PDAT` field:
+
+```{r, connectome, fig.width=5, fig.height=4, fig.align='center'}
+search_year <- function(year, term){
+ query <- paste(term, "AND (", year, "[PDAT])")
+ entrez_search(db="pubmed", term=query, retmax=0)$count
+}
+
+year <- 2008:2014
+papers <- sapply(year, search_year, term="Connectome", USE.NAMES=FALSE)
+
+plot(year, papers, type='b', main="The Rise of the Connectome")
+```
+
+## Finding cross-references : `entrez_link()`:
+
+
+One of the strengths of the NCBI databases is the degree to which records of one
+type are connected to other records within the NCBI or to external data
+sources. The function `entrez_link()` allows users to discover these links
+between records.
+
+###My god, it's full of links
+
+To get an idea of the degree to which records in the NCBI are cross-linked we
+can find all NCBI data associated with a single gene (in this case the
+Amyloid Beta Precursor gene, the product of which is associated with the
+plaques that form in the brains of Alzheimer's Disease patients).
+
+The function `entrez_link()` can be used to find cross-referenced records. In
+the most basic case we need to provide an ID (`id`), the database from which this
+ID comes (`dbfrom`) and the name of a database in which to find linked records (`db`).
+If we set this last argument to 'all' we can find links in multiple databases:
+
+```{r elink0}
+all_the_links <- entrez_link(dbfrom='gene', id=351, db='all')
+all_the_links
+```
+Just as with `entrez_search` the returned object behaves like a list, and we can
+learn a little about its contents by printing it. In the case, all of the
+information is in `links` (and there's a lot of them!):
+
+
+```{r elink_link}
+all_the_links$links
+```
+The names of the list elements are in the format `[source_database]_[linked_database]`
+and the elements themselves contain a vector of linked-IDs. So, if we want to
+find open access publications associated with this gene we could get linked records
+in PubMed Central:
+
+```{r, elink_pmc}
+all_the_links$links$gene_pmc[1:10]
+```
+
+Or if were interested in this gene's role in diseases we could find links to clinVar:
+
+```{r, elink_omim}
+all_the_links$links$gene_clinvar
+
+```
+
+###Narrowing our focus
+
+If we know beforehand what sort of links we'd like to find , we can
+to use the `db` argument to narrow the focus of a call to `entrez_link`.
+
+For instance, say we are interested in knowing about all of the
+RNA transcripts associated with the Amyloid Beta Precursor gene in humans.
+Transcript sequences are stored in the nucleotide database (referred
+to as `nuccore` in EUtils), so to find transcripts associated with a given gene
+we need to set `dbfrom=gene` and `db=nuccore`.
+
+```{r, elink1}
+nuc_links <- entrez_link(dbfrom='gene', id=351, db='nuccore')
+nuc_links
+nuc_links$links
+```
+The object we get back contains links to the nucleotide database generally, but
+also to special subsets of that database like [refseq](http://www.ncbi.nlm.nih.gov/refseq/).
+We can take advantage of this narrower set of links to find IDs that match unique
+transcripts from our gene of interest.
+
+```{r, elinik_refseqs}
+nuc_links$links$gene_nuccore_refseqrna
+```
+We can use these ids in calls to `entrez_fetch()` or `entrez_summary()` to learn
+more about the transcripts they represent.
+
+###External links
+
+In addition to finding data within the NCBI, `entrez_link` can turn up
+connections to external databases. Perhaps the most interesting example is
+finding links to the full text of papers in PubMed. For example, when I wrote
+this document the first paper linked to Amyloid Beta Precursor had a unique ID of
+`25500142`. We can find links to the full text of that paper with `entrez_link`
+by setting the `cmd` argument to 'llinks':
+
+```{r, outlinks}
+paper_links <- entrez_link(dbfrom="pubmed", id=25500142, cmd="llinks")
+paper_links
+```
+
+Each element of the `linkouts` object contains information about an external
+source of data on this paper:
+
+```{r, urls}
+paper_links$linkouts
+```
+
+Each of those linkout objects contains quite a lot of information, but the URL
+is probably the most useful. For that reason, `rentrez` provides the
+function `linkout_urls` to make extracting just the URL simple:
+
+```{r just_urls}
+linkout_urls(paper_links)
+```
+
+The full list of options for the `cmd` argument are given in in-line
+documentation (`?entrez_link`). If you are interested in finding full text
+records for a large number of articles checkout the package
+[fulltext](https://github.com/ropensci/fulltext) which makes use of multiple
+sources (including the NCBI) to discover the full text articles.
+
+###Using more than one ID
+
+It is possible to pass more than one ID to `entrez_link()`. By default, doing so
+will give you a single elink object containing the complete set of links for
+_all_ of the IDs that you specified. So, if you were looking for protein IDs
+related to specific genes you could do:
+
+```{r, multi_default}
+all_links_together <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"))
+all_links_together
+all_links_together$links$gene_protein
+```
+
+Although this behaviour might sometimes be useful, it means we've lost track of
+which `protein` ID is linked to which `gene` ID. To retain that information we
+can set `by_id` to `TRUE`. This gives us a list of elink objects, each once
+containing links from a single `gene` ID:
+
+```{r, multi_byid}
+all_links_sep <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"), by_id=TRUE)
+all_links_sep
+lapply(all_links_sep, function(x) x$links$gene_protein)
+```
+
+## Getting summary data: `entrez_summary()`
+
+Having found the unique IDs for some records via `entrez_search` or `entrez_link()`, you are
+probably going to want to learn something about them. The `Eutils` API has two
+ways to get information about a record. `entrez_fetch()` returns 'full' records
+in varying formats and `entrez_summary()` returns less information about each
+record, but in relatively simple format. Very often the summary records have the information
+you are after, so `rentrez` provides functions to parse and summarise summary
+records.
+
+
+###The summary record
+
+`entrez_summary()` takes a vector of unique IDs for the samples you want to get
+summary information from. Let's start by finding out something about the paper
+describing [Taxize](https://github.com/ropensci/taxize), using its PubMed ID:
+
+
+```{r, Summ_1}
+taxize_summ <- entrez_summary(db="pubmed", id=24555091)
+taxize_summ
+```
+
+Once again, the object returned by `entrez_summary` behaves like a list, so you can extract
+elements using `$`. For instance, we could convert our PubMed ID to another
+article identifier...
+
+```{r, Summ_2}
+taxize_summ$articleids
+```
+...or see how many times the article has been cited in PubMed Central papers
+
+```{r, Summ_3}
+taxize_summ$pmcrefcount
+```
+
+###Dealing with many records
+
+If you give `entrez_summary()` a vector with more than one ID you'll get a
+list of summary records back. Let's get those _Plasmodium vivax_ papers we found
+in the `entrez_search()` section back, and fetch some summary data on each paper:
+
+```{r, multi_summ}
+vivax_search <- entrez_search(db = "pubmed",
+ term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
+multi_summs <- entrez_summary(db="pubmed", id=vivax_search$ids)
+```
+
+`rentrez` provides a helper function, `extract_from_esummary()` that takes one
+or more elements from every summary record in one of these lists. Here it is
+working with one...
+
+```{r, multi_summ2}
+extract_from_esummary(multi_summs, "fulljournalname")
+```
+... and several elements:
+
+```{r, multi_summ3}
+date_and_cite <- extract_from_esummary(multi_summs, c("pubdate", "pmcrefcount", "title"))
+knitr::kable(head(t(date_and_cite)), row.names=FALSE)
+```
+
+##Fetching full records: `entrez_fetch()`
+
+As useful as the summary records are, sometimes they just don't have the
+information that you need. If you want a complete representation of a record you
+can use `entrez_fetch`, using the argument `rettype` to specify the format you'd
+like the record in.
+
+###Fetch DNA sequences in fasta format
+
+Let's extend the example given in the `entrez_link()` section about finding
+transcript for a given gene. This time we will fetch cDNA sequences of those
+transcripts.We can start by repeating the steps in the earlier example
+to get nucleotide IDs for refseq transcripts of two genes:
+
+```{r, transcript_ids}
+gene_ids <- c(351, 11647)
+linked_seq_ids <- entrez_link(dbfrom="gene", id=gene_ids, db="nuccore")
+linked_transripts <- linked_seq_ids$links$gene_nuccore_refseqrna
+head(linked_transripts)
+```
+
+Now we can get our sequences with `entrez_fetch`, setting `rettype` to "fasta"
+(the list of formats available for [each database is give in this table](http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/)):
+
+```{r fetch_fasta}
+all_recs <- entrez_fetch(db="nuccore", id=linked_transripts, rettype="fasta")
+class(all_recs)
+nchar(all_recs)
+```
+
+Congratulations, now you have a really huge character vector! Rather than
+printing all those thousands of bases we can take a peak at the top of the file:
+
+```{r, peak}
+cat(strwrap(substr(all_recs, 1, 500)), sep="\n")
+```
+
+If we wanted to use these sequences in some other application we could write them
+to file:
+
+```r
+write(all_recs, file="my_transcripts.fasta")
+```
+
+Alternatively, if you want to use them within an R session
+we could write them to a temporary file then read that. In this case I'm using `read.dna()` from the
+pylogenetics package ape (but not executing the code block in this vignette, so
+you don't have to install that package):
+
+```r
+temp <- tempfile()
+write(all_recs, temp)
+parsed_recs <- ape::read.dna(all_recs, temp)
+```
+
+###Fetch a parsed XML document
+
+Most of the NCBI's databases can return records in XML format. In additional to
+downloading the text-representation of these files, `entrez_fetch()` can return
+objects parsed by the `XML` package. As an example, we can check out the Taxonomy
+database's record for (did I mention they are amazing....) _Tetrahymena
+thermophila_, specifying we want the result to be parsed by setting
+`parsed=TRUE`:
+
+```{r, Tt_tax}
+Tt <- entrez_search(db="taxonomy", term="(Tetrahymena thermophila[ORGN]) AND Species[RANK]")
+tax_rec <- entrez_fetch(db="taxonomy", id=Tt$ids, rettype="xml", parsed=TRUE)
+class(tax_rec)
+```
+
+The package XML (which you have if you have installed `rentrez`) provides
+functions to get information from these files. For relatively simple records
+like this one you can use `XML::xmlToList`:
+
+```{r, Tt_list}
+tax_list <- XML::xmlToList(tax_rec)
+tax_list$Taxon$GeneticCode
+```
+
+For more complex records, which generate deeply-nested lists, you can use
+[XPath expressions](https://en.wikipedia.org/wiki/XPath) along with the function
+`XML::xpathSApply` or the extraction operatord `[` and `[[` to extract specific parts of the
+file. For instance, we can get the scientific name of each taxon in _T.
+thermophila_'s lineage by specifying a path through the XML
+
+```{r, Tt_path}
+tt_lineage <- tax_rec["//LineageEx/Taxon/ScientificName"]
+tt_lineage[1:4]
+```
+
+As the name suggests, `XML::xpathSApply()` is a counterpart of base R's
+`sapply`, and can be used to apply a function to
+nodes in an XML object. A particularly useful function to apply is `XML::xmlValue`,
+which returns the content of the node:
+
+```{r, Tt_apply}
+XML::xpathSApply(tax_rec, "//LineageEx/Taxon/ScientificName", XML::xmlValue)
+```
+There are a few more complex examples of using `XPath` [on the rentrez wiki](https://github.com/ropensci/rentrez/wiki)
+
+<a name="web_history"></a>
+
+##Using NCBI's Web History features
+
+When you are dealing with very large queries it can be time consuming to pass
+long vectors of unique IDs to and from the NCBI. To avoid this problem, the NCBI
+provides a feature called "web history" which allows users to store IDs on the
+NCBI servers then refer to them in future calls.
+
+###Post a set of IDs to the NCBI for later use: `entrez_post()`
+
+If you have a list of many NCBI IDs that you want to use later on, you can post
+them to the NCBI's severs. In order to provide a brief example, I'm going to post just one
+ID, the `omim` identifier for asthma:
+
+```{r, asthma}
+upload <- entrez_post(db="omim", id=600807)
+upload
+```
+The NCBI sends you back some information you can use to refer to the posted IDs.
+In `rentrez`, that information is represented as a `web_history` object.
+
+###Get a `web_history` object from `entrez_search` or `entrez_link()`
+
+In addition to directly uploading IDs to the NCBI, you can use the web history
+features with `entrez_search` and `entrez_link`. For instance, imagine you wanted to
+find all of the sequences of the widely-studied gene COI from all snails
+(which are members of the taxonomic group Gastropoda):
+
+```{r, snail_search}
+entrez_search(db="nuccore", term="COI[Gene] AND Gastropoda[ORGN]")
+```
+
+That's a lot of sequences! If you really wanted to download all of these it
+would be a good idea to save all those IDs to the server by setting
+`use_history` to `TRUE` (note you now get a `web_history` object along with your
+normal search result):
+
+```{r, snail_history}
+snail_coi <- entrez_search(db="nuccore", term="COI[Gene] AND Gastropoda[ORGN]", use_history=TRUE)
+snail_coi
+snail_coi$web_history
+```
+
+Similarity, `entrez_link()` can return `web_history` objects by using the `cmd`
+`neighbor_history`. Let's find genetic variants (from the clinvar database)
+associated with asthma (using the same OMIM ID we identified earlier):
+
+
+```{r, asthma_links}
+asthma_clinvar <- entrez_link(dbfrom="omim", db="clinvar", cmd="neighbor_history", id=600807)
+asthma_clinvar$web_histories
+```
+
+As you can see, instead of returning lists of IDs for each linked database (as
+it would be default), `entrez_link()` now returns a list of web_histories.
+
+###Use a `web_history` object
+
+Once you have those IDs stored on the NCBI's servers, you are going to want to
+do something with them. The functions `entrez_fetch()` `entrez_summary()` and
+`entrez_link()` can all use `web_history` objects in exactly the same way they
+use IDs.
+
+So, we could repeat the last example (finding variants linked to asthma), but this
+time using the ID we uploaded earlier
+
+```{r, asthma_links_upload}
+asthma_variants <- entrez_link(dbfrom="omim", db="clinvar", cmd="neighbor_history", web_history=upload)
+asthma_variants
+```
+
+... if we want to get some genetic information about these variants we need to
+map our clinvar IDs to SNP IDs:
+
+
+```{r, links}
+snp_links <- entrez_link(dbfrom="clinvar", db="snp",
+ web_history=asthma_variants$web_histories$omim_clinvar,
+ cmd="neighbor_history")
+snp_summ <- entrez_summary(db="snp", web_history=snp_links$web_histories$clinvar_snp)
+knitr::kable(extract_from_esummary(snp_summ, c("chr", "fxn_class", "global_maf")))
+```
+
+If you really wanted to you could also use `web_history` objects to download all those thousands of COI sequences.
+When downloading large sets of data, it is a good idea to take advantage of the
+arguments `retmax` and `restart` to split the request up into smaller chunks.
+For instance, we could get the first 200 sequences in 50-sequence chunks:
+
+(note: this code block is not executed as part of the vignette to save time and bandwidth):
+
+
+```r
+for( seq_start in seq(1,200,50)){
+ recs <- entrez_fetch(db="nuccore", web_history=snail_coi$web_history,
+ rettype="fasta", retmax=50, retstart=seq_start)
+ cat(recs, file="snail_coi.fasta", append=TRUE)
+ cat(seq_start+49, "sequences downloaded\r")
+}
+```
+
+##What next ?
+
+This tutorial has introduced you to the core functions of `rentrez`, there are
+almost limitless ways that you could put them together. [Check out the wiki](https://github.com/ropensci/rentrez/wiki)
+for more specific examples, and be sure to read the inline-documentation for
+each function. If you run into problem with rentrez, or just need help with the
+package and `Eutils` please contact us by opening an issue at the [github
+repository](https://github.com/ropensci/rentrez/issues)
+
+
+
diff --git a/inst/doc/rentrez_tutorial.html b/inst/doc/rentrez_tutorial.html
new file mode 100644
index 0000000..7849d85
--- /dev/null
+++ b/inst/doc/rentrez_tutorial.html
@@ -0,0 +1,722 @@
+<!DOCTYPE html>
+
+<html xmlns="http://www.w3.org/1999/xhtml">
+
+<head>
+
+<meta charset="utf-8">
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
+<meta name="generator" content="pandoc" />
+
+<meta name="viewport" content="width=device-width, initial-scale=1">
+
+<meta name="author" content="David winter" />
+
+<meta name="date" content="2016-10-26" />
+
+<title>Rentrez Tutorial</title>
+
+
+
+<style type="text/css">code{white-space: pre;}</style>
+<style type="text/css">
+table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
+ margin: 0; padding: 0; vertical-align: baseline; border: none; }
+table.sourceCode { width: 100%; line-height: 100%; }
+td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
+td.sourceCode { padding-left: 5px; }
+code > span.kw { color: #007020; font-weight: bold; }
+code > span.dt { color: #902000; }
+code > span.dv { color: #40a070; }
+code > span.bn { color: #40a070; }
+code > span.fl { color: #40a070; }
+code > span.ch { color: #4070a0; }
+code > span.st { color: #4070a0; }
+code > span.co { color: #60a0b0; font-style: italic; }
+code > span.ot { color: #007020; }
+code > span.al { color: #ff0000; font-weight: bold; }
+code > span.fu { color: #06287e; }
+code > span.er { color: #ff0000; font-weight: bold; }
+</style>
+
+
+
+<link href="data:text/css,body%20%7B%0A%20%20background%2Dcolor%3A%20%23fff%3B%0A%20%20margin%3A%201em%20auto%3B%0A%20%20max%2Dwidth%3A%20700px%3B%0A%20%20overflow%3A%20visible%3B%0A%20%20padding%2Dleft%3A%202em%3B%0A%20%20padding%2Dright%3A%202em%3B%0A%20%20font%2Dfamily%3A%20%22Open%20Sans%22%2C%20%22Helvetica%20Neue%22%2C%20Helvetica%2C%20Arial%2C%20sans%2Dserif%3B%0A%20%20font%2Dsize%3A%2014px%3B%0A%20%20line%2Dheight%3A%201%2E35%3B%0A%7D%0A%0A%23header%20%7B%0A%20%20text%2Dalign%3A% [...]
+
+</head>
+
+<body>
+
+
+
+
+<h1 class="title toc-ignore">Rentrez Tutorial</h1>
+<h4 class="author"><em>David winter</em></h4>
+<h4 class="date"><em>2016-10-26</em></h4>
+
+
+<div id="TOC">
+<ul>
+<li><a href="#introduction-the-ncbi-entrez-and-rentrez.">Introduction: The NCBI, entrez and <code>rentrez</code>.</a></li>
+<li><a href="#getting-started-with-the-rentrez">Getting started with the rentrez</a></li>
+<li><a href="#searching-databases-entrez_search">Searching databases: <code>entrez_search()</code></a><ul>
+<li><a href="#building-search-terms">Building search terms</a></li>
+<li><a href="#precise-queries-using-mesh-terms">Precise queries using MeSH terms</a></li>
+<li><a href="#advanced-counting">Advanced counting</a></li>
+</ul></li>
+<li><a href="#finding-cross-references-entrez_link">Finding cross-references : <code>entrez_link()</code>:</a><ul>
+<li><a href="#my-god-its-full-of-links">My god, it’s full of links</a></li>
+<li><a href="#narrowing-our-focus">Narrowing our focus</a></li>
+<li><a href="#external-links">External links</a></li>
+<li><a href="#using-more-than-one-id">Using more than one ID</a></li>
+</ul></li>
+<li><a href="#getting-summary-data-entrez_summary">Getting summary data: <code>entrez_summary()</code></a><ul>
+<li><a href="#the-summary-record">The summary record</a></li>
+<li><a href="#dealing-with-many-records">Dealing with many records</a></li>
+</ul></li>
+<li><a href="#fetching-full-records-entrez_fetch">Fetching full records: <code>entrez_fetch()</code></a><ul>
+<li><a href="#fetch-dna-sequences-in-fasta-format">Fetch DNA sequences in fasta format</a></li>
+<li><a href="#fetch-a-parsed-xml-document">Fetch a parsed XML document</a></li>
+</ul></li>
+<li><a href="#using-ncbis-web-history-features">Using NCBI’s Web History features</a><ul>
+<li><a href="#post-a-set-of-ids-to-the-ncbi-for-later-use-entrez_post">Post a set of IDs to the NCBI for later use: <code>entrez_post()</code></a></li>
+<li><a href="#get-a-web_history-object-from-entrez_search-or-entrez_link">Get a <code>web_history</code> object from <code>entrez_search</code> or <code>entrez_link()</code></a></li>
+<li><a href="#use-a-web_history-object">Use a <code>web_history</code> object</a></li>
+</ul></li>
+<li><a href="#what-next">What next ?</a></li>
+</ul>
+</div>
+
+<div id="introduction-the-ncbi-entrez-and-rentrez." class="section level2">
+<h2>Introduction: The NCBI, entrez and <code>rentrez</code>.</h2>
+<p>The NCBI shares a <em>lot</em> of data. At the time this document was compiled, there were 26.6 million papers in <a href="http://www.ncbi.nlm.nih.gov/pubmed/">PubMed</a>, including 4.2 million full-text records available in <a href="http://www.ncbi.nlm.nih.gov/pubmed/">PubMed Central</a>. <a href="http://www.ncbi.nlm.nih.gov/nuccore">The NCBI Nucleotide Database</a> (which includes GenBank) has data for 219.2 million different sequences, and <a href="http://www.ncbi.nlm.nih.gov/snp/" [...]
+<p>The NCBI makes this data available through a <a href="http://www.ncbi.nlm.nih.gov/">web interface</a>, an <a href="ftp://ftp.ncbi.nlm.nih.gov/">FTP server</a> and through a REST API called the <a href="http://www.ncbi.nlm.nih.gov/books/NBK25500/">Entrez Utilities</a> (<code>Eutils</code> for short). This package provides functions to use that API, allowing users to gather and combine data from multiple NCBI databases in the comfort of an R session or script.</p>
+</div>
+<div id="getting-started-with-the-rentrez" class="section level2">
+<h2>Getting started with the rentrez</h2>
+<p>To make the most of all the data the NCBI shares you need to know a little about their databases, the records they contain and the ways you can find those records. The <a href="http://www.ncbi.nlm.nih.gov/home/documentation.shtml">NCBI provides extensive documentation for each of their databases</a> and for the <a href="http://www.ncbi.nlm.nih.gov/books/NBK25501/">EUtils API that <code>rentrez</code> takes advantage of</a>. There are also some helper functions in <code>rentrez</code> [...]
+<p>First, you can use <code>entrez_dbs()</code> to find the list of available databases:</p>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">entrez_dbs</span>()</code></pre>
+<pre><code>## [1] "pubmed" "protein" "nuccore"
+## [4] "nucleotide" "nucgss" "nucest"
+## [7] "structure" "sparcle" "genome"
+## [10] "annotinfo" "assembly" "bioproject"
+## [13] "biosample" "blastdbinfo" "books"
+## [16] "cdd" "clinvar" "clone"
+## [19] "gap" "gapplus" "grasp"
+## [22] "dbvar" "gene" "gds"
+## [25] "geoprofiles" "homologene" "medgen"
+## [28] "mesh" "ncbisearch" "nlmcatalog"
+## [31] "omim" "orgtrack" "pmc"
+## [34] "popset" "probe" "proteinclusters"
+## [37] "pcassay" "biosystems" "pccompound"
+## [40] "pcsubstance" "pubmedhealth" "seqannot"
+## [43] "snp" "sra" "taxonomy"
+## [46] "unigene" "gencoll" "gtr"</code></pre>
+<p>There is a set of functions with names starting <code>entrez_db_</code> that can be used to gather more information about each of these databases:</p>
+<p><strong>Functions that help you learn about NCBI databases</strong></p>
+<table>
+<thead>
+<tr class="header">
+<th align="left">Function name</th>
+<th align="left">Return</th>
+</tr>
+</thead>
+<tbody>
+<tr class="odd">
+<td align="left"><code>entrez_db_summary()</code></td>
+<td align="left">Brief description of what the database is</td>
+</tr>
+<tr class="even">
+<td align="left"><code>entrez_db_searchable()</code></td>
+<td align="left">Set of search terms that can used with this database</td>
+</tr>
+<tr class="odd">
+<td align="left"><code>entrez_db_links()</code></td>
+<td align="left">Set of databases that might contain linked records</td>
+</tr>
+</tbody>
+</table>
+<p>For instance, we can get a description of the somewhat cryptically named database ‘cdd’…</p>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">entrez_db_summary</span>(<span class="st">"cdd"</span>)</code></pre>
+<pre><code>## DbName: cdd
+## MenuName: Conserved Domains
+## Description: Conserved Domain Database
+## DbBuild: Build160706-0602.1
+## Count: 52411
+## LastUpdate: 2016/07/06 12:08</code></pre>
+<p>… or find out which search terms can be used with the Sequence Read Archive (SRA) database (which contains raw data from sequencing projects):</p>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">entrez_db_searchable</span>(<span class="st">"sra"</span>)</code></pre>
+<pre><code>## Searchable fields for database 'sra'
+## ALL All terms from all searchable fields
+## UID Unique number assigned to publication
+## FILT Limits the records
+## ACCN Accession number of sequence
+## TITL Words in definition line
+## PROP Classification by source qualifiers and molecule type
+## WORD Free text associated with record
+## ORGN Scientific and common names of organism, and all higher levels of taxonomy
+## AUTH Author(s) of publication
+## PDAT Date sequence added to GenBank
+## MDAT Date of last update
+## GPRJ BioProject
+## BSPL BioSample
+## PLAT Platform
+## STRA Strategy
+## SRC Source
+## SEL Selection
+## LAY Layout
+## RLEN Percent of aligned reads
+## ACS Access is public or controlled
+## ALN Percent of aligned reads
+## MBS Size in megabases</code></pre>
+<p>Just how these ‘helper’ functions might be useful will become clearer once you’ve started using <code>rentrez</code>, so let’s get started.</p>
+</div>
+<div id="searching-databases-entrez_search" class="section level2">
+<h2>Searching databases: <code>entrez_search()</code></h2>
+<p>Very often, the first thing you’ll want to do with <code>rentrez</code> is search a given NCBI database to find records that match some keywords. You can do this using the function <code>entrez_search()</code>. In the simplest case you just need to provide a database name (<code>db</code>) and a search term (<code>term</code>) so let’s search PubMed for articles about the <code>R language</code>:</p>
+<pre class="sourceCode r"><code class="sourceCode r">r_search <-<span class="st"> </span><span class="kw">entrez_search</span>(<span class="dt">db=</span><span class="st">"pubmed"</span>, <span class="dt">term=</span><span class="st">"R Language"</span>)</code></pre>
+<p>The object returned by a search acts like a list, and you can get a summary of its contents by printing it.</p>
+<pre class="sourceCode r"><code class="sourceCode r">r_search</code></pre>
+<pre><code>## Entrez search result with 9461 hits (object contains 20 IDs and no web_history object)
+## Search term (as translated): R[All Fields] AND ("programming languages"[MeSH Te ...</code></pre>
+<p>There are a few things to note here. First, the NCBI’s server has worked out that we meant R as a programming language, and so included the <a href="http://www.ncbi.nlm.nih.gov/mesh">‘MeSH’ term</a> term associated with programming languages. We’ll worry about MeSH terms and other special queries later, for now just note that you can use this feature to check that your search term was interpreted in the way you intended. Second, there are many more ‘hits’ for this search than there ar [...]
+<p>The IDs are the most important thing returned here. They allow us to fetch records matching those IDs, gather summary data about them or find cross-referenced records in other databases. We access the IDs as a vector using the <code>$</code> operator:</p>
+<pre class="sourceCode r"><code class="sourceCode r">r_search$ids</code></pre>
+<pre><code>## [1] "27774058" "27771785" "27771004" "27694991" "27770071" "27491423"
+## [7] "27762050" "27760879" "27509845" "27312095" "27755987" "27755648"
+## [13] "27751661" "27346093" "27059941" "26452616" "27747969" "27746120"
+## [19] "25544604" "27744111"</code></pre>
+<p>If we want to get more than 20 IDs we can do so by increasing the <code>ret_max</code> argument.</p>
+<pre class="sourceCode r"><code class="sourceCode r">another_r_search <-<span class="st"> </span><span class="kw">entrez_search</span>(<span class="dt">db=</span><span class="st">"pubmed"</span>, <span class="dt">term=</span><span class="st">"R Language"</span>, <span class="dt">retmax=</span><span class="dv">40</span>)
+another_r_search</code></pre>
+<pre><code>## Entrez search result with 9461 hits (object contains 40 IDs and no web_history object)
+## Search term (as translated): R[All Fields] AND ("programming languages"[MeSH Te ...</code></pre>
+<p>If we want to get IDs for all of the thousands of records that match this search, we can use the NCBI’s web history feature <a href="#web_history">described below</a>.</p>
+<div id="building-search-terms" class="section level3">
+<h3>Building search terms</h3>
+<p>The EUtils API uses a special syntax to build search terms. You can search a database against a specific term using the format <code>query[SEARCH FIELD]</code>, and combine multiple such searches using the boolean operators <code>AND</code>, <code>OR</code> and <code>NOT</code>.</p>
+<p>For instance, we can find next generation sequence datasets for the (amazing…) ciliate <em>Tetrahymena thermophila</em> by using the organism (‘ORGN’) search field:</p>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">entrez_search</span>(<span class="dt">db=</span><span class="st">"sra"</span>,
+ <span class="dt">term=</span><span class="st">"Tetrahymena thermophila[ORGN]"</span>,
+ <span class="dt">retmax=</span><span class="dv">0</span>)</code></pre>
+<pre><code>## Entrez search result with 165 hits (object contains 0 IDs and no web_history object)
+## Search term (as translated): "Tetrahymena thermophila"[Organism]</code></pre>
+<p>We can narrow our focus to only those records that have been added recently (using the colon to specify a range of values):</p>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">entrez_search</span>(<span class="dt">db=</span><span class="st">"sra"</span>,
+ <span class="dt">term=</span><span class="st">"Tetrahymena thermophila[ORGN] AND 2013:2015[PDAT]"</span>,
+ <span class="dt">retmax=</span><span class="dv">0</span>)</code></pre>
+<pre><code>## Entrez search result with 75 hits (object contains 0 IDs and no web_history object)
+## Search term (as translated): "Tetrahymena thermophila"[Organism] AND 2013[PDAT] ...</code></pre>
+<p>Or include recent records for either <em>T. thermophila</em> or it’s close relative <em>T. borealis</em> (using parentheses to make ANDs and ORs explicit).</p>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">entrez_search</span>(<span class="dt">db=</span><span class="st">"sra"</span>,
+ <span class="dt">term=</span><span class="st">"(Tetrahymena thermophila[ORGN] OR Tetrahymena borealis[ORGN]) AND 2013:2015[PDAT]"</span>,
+ <span class="dt">retmax=</span><span class="dv">0</span>)</code></pre>
+<pre><code>## Entrez search result with 75 hits (object contains 0 IDs and no web_history object)
+## Search term (as translated): ("Tetrahymena thermophila"[Organism] OR "Tetrahyme ...</code></pre>
+<p>The set of search terms available varies between databases. You can get a list of available terms or any given data base with <code>entrez_db_searchable()</code></p>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">entrez_db_searchable</span>(<span class="st">"sra"</span>)</code></pre>
+<pre><code>## Searchable fields for database 'sra'
+## ALL All terms from all searchable fields
+## UID Unique number assigned to publication
+## FILT Limits the records
+## ACCN Accession number of sequence
+## TITL Words in definition line
+## PROP Classification by source qualifiers and molecule type
+## WORD Free text associated with record
+## ORGN Scientific and common names of organism, and all higher levels of taxonomy
+## AUTH Author(s) of publication
+## PDAT Date sequence added to GenBank
+## MDAT Date of last update
+## GPRJ BioProject
+## BSPL BioSample
+## PLAT Platform
+## STRA Strategy
+## SRC Source
+## SEL Selection
+## LAY Layout
+## RLEN Percent of aligned reads
+## ACS Access is public or controlled
+## ALN Percent of aligned reads
+## MBS Size in megabases</code></pre>
+</div>
+<div id="precise-queries-using-mesh-terms" class="section level3">
+<h3>Precise queries using MeSH terms</h3>
+<p>In addition to the search terms described above, the NCBI allows searches using <a href="http://www.ncbi.nlm.nih.gov/mesh">Medical Subject Heading (MeSH)</a> terms. These terms create a ‘controlled vocabulary’, and allow users to make very finely controlled queries of databases.</p>
+<p>For instance, if you were interested in reviewing studies on how a class of anti-malarial drugs called Folic Acid Antagonists work against <em>Plasmodium vivax</em> (a particular species of malarial parasite), you could use this search:</p>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">entrez_search</span>(<span class="dt">db =</span> <span class="st">"pubmed"</span>,
+ <span class="dt">term =</span> <span class="st">"(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])"</span>)</code></pre>
+<pre><code>## Entrez search result with 12 hits (object contains 12 IDs and no web_history object)
+## Search term (as translated): "malaria, vivax"[MeSH Terms] AND "folic acid antag ...</code></pre>
+<p>The complete set of MeSH terms is available as a database from the NCBI. That means it is possible to download detailed information about each term and find the ways in which terms relate to each other using <code>rentrez</code>. You can search for specific terms with <code>entrez_search(db="mesh", term =...)</code> and learn about the results of your search using the tools described below.</p>
+</div>
+<div id="advanced-counting" class="section level3">
+<h3>Advanced counting</h3>
+<p>As you can see above, the object returned by <code>entrez_search()</code> includes the number of records matching a given search. This means you can learn a little about the composition of, or trends in, the records stored in the NCBI’s databases using only the search utility. For instance, let’s track the rise of the scientific buzzword “connectome” in PubMed, programmatically creating search terms for the <code>PDAT</code> field:</p>
+<pre class="sourceCode r"><code class="sourceCode r">search_year <-<span class="st"> </span>function(year, term){
+ query <-<span class="st"> </span><span class="kw">paste</span>(term, <span class="st">"AND ("</span>, year, <span class="st">"[PDAT])"</span>)
+ <span class="kw">entrez_search</span>(<span class="dt">db=</span><span class="st">"pubmed"</span>, <span class="dt">term=</span>query, <span class="dt">retmax=</span><span class="dv">0</span>)$count
+}
+
+year <-<span class="st"> </span><span class="dv">2008</span>:<span class="dv">2014</span>
+papers <-<span class="st"> </span><span class="kw">sapply</span>(year, search_year, <span class="dt">term=</span><span class="st">"Connectome"</span>, <span class="dt">USE.NAMES=</span><span class="ot">FALSE</span>)
+
+<span class="kw">plot</span>(year, papers, <span class="dt">type=</span><span class="st">'b'</span>, <span class="dt">main=</span><span class="st">"The Rise of the Connectome"</span>)</code></pre>
+<p><img src=" [...]
+</div>
+</div>
+<div id="finding-cross-references-entrez_link" class="section level2">
+<h2>Finding cross-references : <code>entrez_link()</code>:</h2>
+<p>One of the strengths of the NCBI databases is the degree to which records of one type are connected to other records within the NCBI or to external data sources. The function <code>entrez_link()</code> allows users to discover these links between records.</p>
+<div id="my-god-its-full-of-links" class="section level3">
+<h3>My god, it’s full of links</h3>
+<p>To get an idea of the degree to which records in the NCBI are cross-linked we can find all NCBI data associated with a single gene (in this case the Amyloid Beta Precursor gene, the product of which is associated with the plaques that form in the brains of Alzheimer’s Disease patients).</p>
+<p>The function <code>entrez_link()</code> can be used to find cross-referenced records. In the most basic case we need to provide an ID (<code>id</code>), the database from which this ID comes (<code>dbfrom</code>) and the name of a database in which to find linked records (<code>db</code>). If we set this last argument to ‘all’ we can find links in multiple databases:</p>
+<pre class="sourceCode r"><code class="sourceCode r">all_the_links <-<span class="st"> </span><span class="kw">entrez_link</span>(<span class="dt">dbfrom=</span><span class="st">'gene'</span>, <span class="dt">id=</span><span class="dv">351</span>, <span class="dt">db=</span><span class="st">'all'</span>)
+all_the_links</code></pre>
+<pre><code>## elink object with contents:
+## $links: IDs for linked records from NCBI
+## </code></pre>
+<p>Just as with <code>entrez_search</code> the returned object behaves like a list, and we can learn a little about its contents by printing it. In the case, all of the information is in <code>links</code> (and there’s a lot of them!):</p>
+<pre class="sourceCode r"><code class="sourceCode r">all_the_links$links</code></pre>
+<pre><code>## elink result with information from 52 databases:
+## [1] gene_bioconcepts gene_biosystems
+## [3] gene_biosystems_all gene_clinvar
+## [5] gene_dbvar gene_gene_h3k4me3
+## [7] gene_genome gene_gtr
+## [9] gene_homologene gene_medgen_diseases
+## [11] gene_pcassay_alltarget_list gene_pcassay_alltarget_summary
+## [13] gene_pcassay_rnai gene_pcassay_target
+## [15] gene_probe gene_structure
+## [17] gene_bioproject gene_books
+## [19] gene_cdd gene_gene_neighbors
+## [21] gene_genereviews gene_genome2
+## [23] gene_geoprofiles gene_nuccore
+## [25] gene_nuccore_mgc gene_nuccore_pos
+## [27] gene_nuccore_refseqgene gene_nuccore_refseqrna
+## [29] gene_nucest gene_nucest_clust
+## [31] gene_nucleotide gene_nucleotide_clust
+## [33] gene_nucleotide_mgc gene_nucleotide_mgc_url
+## [35] gene_nucleotide_pos gene_omim
+## [37] gene_pcassay_proteintarget gene_pccompound
+## [39] gene_pcsubstance gene_pmc
+## [41] gene_pmc_nucleotide gene_protein
+## [43] gene_protein_refseq gene_pubmed
+## [45] gene_pubmed_citedinomim gene_pubmed_pmc_nucleotide
+## [47] gene_pubmed_rif gene_snp
+## [49] gene_snp_geneview gene_taxonomy
+## [51] gene_unigene gene_varview</code></pre>
+<p>The names of the list elements are in the format <code>[source_database]_[linked_database]</code> and the elements themselves contain a vector of linked-IDs. So, if we want to find open access publications associated with this gene we could get linked records in PubMed Central:</p>
+<pre class="sourceCode r"><code class="sourceCode r">all_the_links$links$gene_pmc[<span class="dv">1</span>:<span class="dv">10</span>]</code></pre>
+<pre><code>## [1] "5054717" "4944527" "4837062" "4769228" "4760399" "4751366" "4748078"
+## [8] "4667289" "4648968" "4600482"</code></pre>
+<p>Or if were interested in this gene’s role in diseases we could find links to clinVar:</p>
+<pre class="sourceCode r"><code class="sourceCode r">all_the_links$links$gene_clinvar</code></pre>
+<pre><code>## [1] "253512" "253403" "236549" "236548" "236547" "221889" "160886"
+## [8] "155682" "155309" "155093" "155053" "154360" "154063" "153438"
+## [15] "152839" "151388" "150018" "149551" "149418" "149160" "149035"
+## [22] "148411" "148262" "148180" "146125" "145984" "145474" "145468"
+## [29] "145332" "145107" "144677" "144194" "127268" "98242" "98241"
+## [36] "98240" "98239" "98238" "98237" "98236" "98235" "59247"
+## [43] "59246" "59245" "59243" "59226" "59224" "59223" "59222"
+## [50] "59221" "59010" "59005" "59004" "37145" "32099" "18106"
+## [57] "18105" "18104" "18103" "18102" "18101" "18100" "18099"
+## [64] "18098" "18097" "18096" "18095" "18094" "18093" "18092"
+## [71] "18091" "18090" "18089" "18088" "18087"</code></pre>
+</div>
+<div id="narrowing-our-focus" class="section level3">
+<h3>Narrowing our focus</h3>
+<p>If we know beforehand what sort of links we’d like to find , we can to use the <code>db</code> argument to narrow the focus of a call to <code>entrez_link</code>.</p>
+<p>For instance, say we are interested in knowing about all of the RNA transcripts associated with the Amyloid Beta Precursor gene in humans. Transcript sequences are stored in the nucleotide database (referred to as <code>nuccore</code> in EUtils), so to find transcripts associated with a given gene we need to set <code>dbfrom=gene</code> and <code>db=nuccore</code>.</p>
+<pre class="sourceCode r"><code class="sourceCode r">nuc_links <-<span class="st"> </span><span class="kw">entrez_link</span>(<span class="dt">dbfrom=</span><span class="st">'gene'</span>, <span class="dt">id=</span><span class="dv">351</span>, <span class="dt">db=</span><span class="st">'nuccore'</span>)
+nuc_links</code></pre>
+<pre><code>## elink object with contents:
+## $links: IDs for linked records from NCBI
+## </code></pre>
+<pre class="sourceCode r"><code class="sourceCode r">nuc_links$links</code></pre>
+<pre><code>## elink result with information from 5 databases:
+## [1] gene_nuccore gene_nuccore_mgc gene_nuccore_pos
+## [4] gene_nuccore_refseqgene gene_nuccore_refseqrna</code></pre>
+<p>The object we get back contains links to the nucleotide database generally, but also to special subsets of that database like <a href="http://www.ncbi.nlm.nih.gov/refseq/">refseq</a>. We can take advantage of this narrower set of links to find IDs that match unique transcripts from our gene of interest.</p>
+<pre class="sourceCode r"><code class="sourceCode r">nuc_links$links$gene_nuccore_refseqrna</code></pre>
+<pre><code>## [1] "324021747" "324021746" "324021739" "324021737" "324021735"
+## [6] "228008405" "228008404" "228008403" "228008402" "228008401"</code></pre>
+<p>We can use these ids in calls to <code>entrez_fetch()</code> or <code>entrez_summary()</code> to learn more about the transcripts they represent.</p>
+</div>
+<div id="external-links" class="section level3">
+<h3>External links</h3>
+<p>In addition to finding data within the NCBI, <code>entrez_link</code> can turn up connections to external databases. Perhaps the most interesting example is finding links to the full text of papers in PubMed. For example, when I wrote this document the first paper linked to Amyloid Beta Precursor had a unique ID of <code>25500142</code>. We can find links to the full text of that paper with <code>entrez_link</code> by setting the <code>cmd</code> argument to ‘llinks’:</p>
+<pre class="sourceCode r"><code class="sourceCode r">paper_links <-<span class="st"> </span><span class="kw">entrez_link</span>(<span class="dt">dbfrom=</span><span class="st">"pubmed"</span>, <span class="dt">id=</span><span class="dv">25500142</span>, <span class="dt">cmd=</span><span class="st">"llinks"</span>)
+paper_links</code></pre>
+<pre><code>## elink object with contents:
+## $linkouts: links to external websites</code></pre>
+<p>Each element of the <code>linkouts</code> object contains information about an external source of data on this paper:</p>
+<pre class="sourceCode r"><code class="sourceCode r">paper_links$linkouts</code></pre>
+<pre><code>## $ID_25500142
+## $ID_25500142[[1]]
+## Linkout from Elsevier Science
+## $Url: http://linkinghub.elsevier ...
+##
+## $ID_25500142[[2]]
+## Linkout from Europe PubMed Central
+## $Url: http://europepmc.org/abstr ...
+##
+## $ID_25500142[[3]]
+## Linkout from Ovid Technologies, Inc.
+## $Url: http://ovidsp.ovid.com/ovi ...
+##
+## $ID_25500142[[4]]
+## Linkout from PubMed Central
+## $Url: https://www.ncbi.nlm.nih.g ...
+##
+## $ID_25500142[[5]]
+## Linkout from MedlinePlus Health Information
+## $Url: https://medlineplus.gov/al ...
+##
+## $ID_25500142[[6]]
+## Linkout from Mouse Genome Informatics (MGI)
+## $Url: http://www.informatics.jax ...</code></pre>
+<p>Each of those linkout objects contains quite a lot of information, but the URL is probably the most useful. For that reason, <code>rentrez</code> provides the function <code>linkout_urls</code> to make extracting just the URL simple:</p>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">linkout_urls</span>(paper_links)</code></pre>
+<pre><code>## $ID_25500142
+## [1] "http://linkinghub.elsevier.com/retrieve/pii/S0014-4886(14)00393-8"
+## [2] "http://europepmc.org/abstract/MED/25500142"
+## [3] "http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=linkout&SEARCH=25500142.ui"
+## [4] "https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/25500142/"
+## [5] "https://medlineplus.gov/alzheimersdisease.html"
+## [6] "http://www.informatics.jax.org/marker/reference/25500142"</code></pre>
+<p>The full list of options for the <code>cmd</code> argument are given in in-line documentation (<code>?entrez_link</code>). If you are interested in finding full text records for a large number of articles checkout the package <a href="https://github.com/ropensci/fulltext">fulltext</a> which makes use of multiple sources (including the NCBI) to discover the full text articles.</p>
+</div>
+<div id="using-more-than-one-id" class="section level3">
+<h3>Using more than one ID</h3>
+<p>It is possible to pass more than one ID to <code>entrez_link()</code>. By default, doing so will give you a single elink object containing the complete set of links for <em>all</em> of the IDs that you specified. So, if you were looking for protein IDs related to specific genes you could do:</p>
+<pre class="sourceCode r"><code class="sourceCode r">all_links_together <-<span class="st"> </span><span class="kw">entrez_link</span>(<span class="dt">db=</span><span class="st">"protein"</span>, <span class="dt">dbfrom=</span><span class="st">"gene"</span>, <span class="dt">id=</span><span class="kw">c</span>(<span class="st">"93100"</span>, <span class="st">"223646"</span>))
+all_links_together</code></pre>
+<pre><code>## elink object with contents:
+## $links: IDs for linked records from NCBI
+## </code></pre>
+<pre class="sourceCode r"><code class="sourceCode r">all_links_together$links$gene_protein</code></pre>
+<pre><code>## [1] "1034662002" "1034662000" "1034661998" "1034661996" "1034661994"
+## [6] "1034661992" "558472750" "545685826" "194394158" "166221824"
+## [11] "154936864" "148697547" "148697546" "122346659" "119602646"
+## [16] "119602645" "119602644" "119602643" "119602642" "81899807"
+## [21] "74215266" "74186774" "37787317" "37787309" "37787307"
+## [26] "37787305" "37589273" "33991172" "31982089" "26339824"
+## [31] "26329351" "21619615" "10834676"</code></pre>
+<p>Although this behaviour might sometimes be useful, it means we’ve lost track of which <code>protein</code> ID is linked to which <code>gene</code> ID. To retain that information we can set <code>by_id</code> to <code>TRUE</code>. This gives us a list of elink objects, each once containing links from a single <code>gene</code> ID:</p>
+<pre class="sourceCode r"><code class="sourceCode r">all_links_sep <-<span class="st"> </span><span class="kw">entrez_link</span>(<span class="dt">db=</span><span class="st">"protein"</span>, <span class="dt">dbfrom=</span><span class="st">"gene"</span>, <span class="dt">id=</span><span class="kw">c</span>(<span class="st">"93100"</span>, <span class="st">"223646"</span>), <span class="dt">by_id=</span><span class="ot">TRUE</span>)
+all_links_sep</code></pre>
+<pre><code>## List of 2 elink objects,each containing
+## $links: IDs for linked records from NCBI
+## </code></pre>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">lapply</span>(all_links_sep, function(x) x$links$gene_protein)</code></pre>
+<pre><code>## [[1]]
+## [1] "1034662002" "1034662000" "1034661998" "1034661996" "1034661994"
+## [6] "1034661992" "558472750" "545685826" "194394158" "166221824"
+## [11] "154936864" "122346659" "119602646" "119602645" "119602644"
+## [16] "119602643" "119602642" "37787309" "37787307" "37787305"
+## [21] "33991172" "21619615" "10834676"
+##
+## [[2]]
+## [1] "148697547" "148697546" "81899807" "74215266" "74186774"
+## [6] "37787317" "37589273" "31982089" "26339824" "26329351"</code></pre>
+</div>
+</div>
+<div id="getting-summary-data-entrez_summary" class="section level2">
+<h2>Getting summary data: <code>entrez_summary()</code></h2>
+<p>Having found the unique IDs for some records via <code>entrez_search</code> or <code>entrez_link()</code>, you are probably going to want to learn something about them. The <code>Eutils</code> API has two ways to get information about a record. <code>entrez_fetch()</code> returns ‘full’ records in varying formats and <code>entrez_summary()</code> returns less information about each record, but in relatively simple format. Very often the summary records have the information you are aft [...]
+<div id="the-summary-record" class="section level3">
+<h3>The summary record</h3>
+<p><code>entrez_summary()</code> takes a vector of unique IDs for the samples you want to get summary information from. Let’s start by finding out something about the paper describing <a href="https://github.com/ropensci/taxize">Taxize</a>, using its PubMed ID:</p>
+<pre class="sourceCode r"><code class="sourceCode r">taxize_summ <-<span class="st"> </span><span class="kw">entrez_summary</span>(<span class="dt">db=</span><span class="st">"pubmed"</span>, <span class="dt">id=</span><span class="dv">24555091</span>)
+taxize_summ</code></pre>
+<pre><code>## esummary result with 42 items:
+## [1] uid pubdate epubdate
+## [4] source authors lastauthor
+## [7] title sorttitle volume
+## [10] issue pages lang
+## [13] nlmuniqueid issn essn
+## [16] pubtype recordstatus pubstatus
+## [19] articleids history references
+## [22] attributes pmcrefcount fulljournalname
+## [25] elocationid doctype srccontriblist
+## [28] booktitle medium edition
+## [31] publisherlocation publishername srcdate
+## [34] reportnumber availablefromurl locationlabel
+## [37] doccontriblist docdate bookname
+## [40] chapter sortpubdate sortfirstauthor</code></pre>
+<p>Once again, the object returned by <code>entrez_summary</code> behaves like a list, so you can extract elements using <code>$</code>. For instance, we could convert our PubMed ID to another article identifier…</p>
+<pre class="sourceCode r"><code class="sourceCode r">taxize_summ$articleids</code></pre>
+<pre><code>## idtype idtypen value
+## 1 pubmed 1 24555091
+## 2 doi 3 10.12688/f1000research.2-191.v2
+## 3 pmc 8 PMC3901538
+## 4 rid 8 24563765
+## 5 eid 8 24555091
+## 6 version 8 2
+## 7 version-id 8 2
+## 8 pmcid 5 pmc-id: PMC3901538;</code></pre>
+<p>…or see how many times the article has been cited in PubMed Central papers</p>
+<pre class="sourceCode r"><code class="sourceCode r">taxize_summ$pmcrefcount</code></pre>
+<pre><code>## [1] 7</code></pre>
+</div>
+<div id="dealing-with-many-records" class="section level3">
+<h3>Dealing with many records</h3>
+<p>If you give <code>entrez_summary()</code> a vector with more than one ID you’ll get a list of summary records back. Let’s get those <em>Plasmodium vivax</em> papers we found in the <code>entrez_search()</code> section back, and fetch some summary data on each paper:</p>
+<pre class="sourceCode r"><code class="sourceCode r">vivax_search <-<span class="st"> </span><span class="kw">entrez_search</span>(<span class="dt">db =</span> <span class="st">"pubmed"</span>,
+ <span class="dt">term =</span> <span class="st">"(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])"</span>)
+multi_summs <-<span class="st"> </span><span class="kw">entrez_summary</span>(<span class="dt">db=</span><span class="st">"pubmed"</span>, <span class="dt">id=</span>vivax_search$ids)</code></pre>
+<p><code>rentrez</code> provides a helper function, <code>extract_from_esummary()</code> that takes one or more elements from every summary record in one of these lists. Here it is working with one…</p>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">extract_from_esummary</span>(multi_summs, <span class="st">"fulljournalname"</span>)</code></pre>
+<pre><code>## 24861816
+## "Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases"
+## 24145518
+## "Antimicrobial agents and chemotherapy"
+## 24007534
+## "Malaria journal"
+## 23230341
+## "The Korean journal of parasitology"
+## 23043980
+## "Experimental parasitology"
+## 20810806
+## "The American journal of tropical medicine and hygiene"
+## 20412783
+## "Acta tropica"
+## 19597012
+## "Clinical microbiology reviews"
+## 17556611
+## "The American journal of tropical medicine and hygiene"
+## 17519409
+## "JAMA"
+## 17368986
+## "Trends in parasitology"
+## 12374849
+## "Proceedings of the National Academy of Sciences of the United States of America"</code></pre>
+<p>… and several elements:</p>
+<pre class="sourceCode r"><code class="sourceCode r">date_and_cite <-<span class="st"> </span><span class="kw">extract_from_esummary</span>(multi_summs, <span class="kw">c</span>(<span class="st">"pubdate"</span>, <span class="st">"pmcrefcount"</span>, <span class="st">"title"</span>))
+knitr::<span class="kw">kable</span>(<span class="kw">head</span>(<span class="kw">t</span>(date_and_cite)), <span class="dt">row.names=</span><span class="ot">FALSE</span>)</code></pre>
+<table>
+<thead>
+<tr class="header">
+<th align="left">pubdate</th>
+<th align="left">pmcrefcount</th>
+<th align="left">title</th>
+</tr>
+</thead>
+<tbody>
+<tr class="odd">
+<td align="left">2014 Aug</td>
+<td align="left"></td>
+<td align="left">Prevalence of mutations in the antifolates resistance-associated genes (dhfr and dhps) in Plasmodium vivax parasites from Eastern and Central Sudan.</td>
+</tr>
+<tr class="even">
+<td align="left">2014</td>
+<td align="left">5</td>
+<td align="left">Prevalence of polymorphisms in antifolate drug resistance molecular marker genes pvdhfr and pvdhps in clinical isolates of Plasmodium vivax from Kolkata, India.</td>
+</tr>
+<tr class="odd">
+<td align="left">2013 Sep 5</td>
+<td align="left">2</td>
+<td align="left">Prevalence and patterns of antifolate and chloroquine drug resistance markers in Plasmodium vivax across Pakistan.</td>
+</tr>
+<tr class="even">
+<td align="left">2012 Dec</td>
+<td align="left">11</td>
+<td align="left">Prevalence of drug resistance-associated gene mutations in Plasmodium vivax in Central China.</td>
+</tr>
+<tr class="odd">
+<td align="left">2012 Dec</td>
+<td align="left">7</td>
+<td align="left">Novel mutations in the antifolate drug resistance marker genes among Plasmodium vivax isolates exhibiting severe manifestations.</td>
+</tr>
+<tr class="even">
+<td align="left">2010 Sep</td>
+<td align="left">15</td>
+<td align="left">Mutations in the antifolate-resistance-associated genes dihydrofolate reductase and dihydropteroate synthase in Plasmodium vivax isolates from malaria-endemic countries.</td>
+</tr>
+</tbody>
+</table>
+</div>
+</div>
+<div id="fetching-full-records-entrez_fetch" class="section level2">
+<h2>Fetching full records: <code>entrez_fetch()</code></h2>
+<p>As useful as the summary records are, sometimes they just don’t have the information that you need. If you want a complete representation of a record you can use <code>entrez_fetch</code>, using the argument <code>rettype</code> to specify the format you’d like the record in.</p>
+<div id="fetch-dna-sequences-in-fasta-format" class="section level3">
+<h3>Fetch DNA sequences in fasta format</h3>
+<p>Let’s extend the example given in the <code>entrez_link()</code> section about finding transcript for a given gene. This time we will fetch cDNA sequences of those transcripts.We can start by repeating the steps in the earlier example to get nucleotide IDs for refseq transcripts of two genes:</p>
+<pre class="sourceCode r"><code class="sourceCode r">gene_ids <-<span class="st"> </span><span class="kw">c</span>(<span class="dv">351</span>, <span class="dv">11647</span>)
+linked_seq_ids <-<span class="st"> </span><span class="kw">entrez_link</span>(<span class="dt">dbfrom=</span><span class="st">"gene"</span>, <span class="dt">id=</span>gene_ids, <span class="dt">db=</span><span class="st">"nuccore"</span>)
+linked_transripts <-<span class="st"> </span>linked_seq_ids$links$gene_nuccore_refseqrna
+<span class="kw">head</span>(linked_transripts)</code></pre>
+<pre><code>## [1] "1039766414" "1039766413" "1039766411" "1039766410" "1039766409"
+## [6] "563317856"</code></pre>
+<p>Now we can get our sequences with <code>entrez_fetch</code>, setting <code>rettype</code> to “fasta” (the list of formats available for <a href="http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/">each database is give in this table</a>):</p>
+<pre class="sourceCode r"><code class="sourceCode r">all_recs <-<span class="st"> </span><span class="kw">entrez_fetch</span>(<span class="dt">db=</span><span class="st">"nuccore"</span>, <span class="dt">id=</span>linked_transripts, <span class="dt">rettype=</span><span class="st">"fasta"</span>)
+<span class="kw">class</span>(all_recs)</code></pre>
+<pre><code>## [1] "character"</code></pre>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">nchar</span>(all_recs)</code></pre>
+<pre><code>## [1] 55183</code></pre>
+<p>Congratulations, now you have a really huge character vector! Rather than printing all those thousands of bases we can take a peak at the top of the file:</p>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">cat</span>(<span class="kw">strwrap</span>(<span class="kw">substr</span>(all_recs, <span class="dv">1</span>, <span class="dv">500</span>)), <span class="dt">sep=</span><span class="st">"</span><span class="ch">\n</span><span class="st">"</span>)</code></pre>
+<pre><code>## >XM_006538500.2 PREDICTED: Mus musculus alkaline phosphatase,
+## liver/bone/kidney (Alpl), transcript variant X5, mRNA
+## GCGCCCGTGGCTTGCGCGACTCCCACGCGCGCGCTCCGCCGGTCCCGCAGTGACTGTCCCAGCCACGGTG
+## GGGACACGTGGAAGGTCAGGCTCCCTGGGGACCCACGACCTCCCGCTCCGGACTCCGCGCGCATCTCTTG
+## TGGCCTGGCAGGATGATGGACGTGGCGCCCGCTGAGCCGCTACCCAGGACCTCACCCTCGTGCTAAGCAC
+## CTGCTCCCGGTGCCCACGCGCCTCCGTAGTCCACAGCTGCGCCCTTCGTGGTCCCTTGGCACTCTGTCCC
+## GTTGGTGTCTAAAGTAGTTGGGGAGCAGCAGGAAGAAGGCACGTGCTGCGATCTTTGGCGGGAGAGATCG
+## GAGACCGCGTGCTAGTGTCTGTCTGAGAG</code></pre>
+<p>If we wanted to use these sequences in some other application we could write them to file:</p>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">write</span>(all_recs, <span class="dt">file=</span><span class="st">"my_transcripts.fasta"</span>)</code></pre>
+<p>Alternatively, if you want to use them within an R session<br />we could write them to a temporary file then read that. In this case I’m using <code>read.dna()</code> from the pylogenetics package ape (but not executing the code block in this vignette, so you don’t have to install that package):</p>
+<pre class="sourceCode r"><code class="sourceCode r">temp <-<span class="st"> </span><span class="kw">tempfile</span>()
+<span class="kw">write</span>(all_recs, temp)
+parsed_recs <-<span class="st"> </span>ape::<span class="kw">read.dna</span>(all_recs, temp)</code></pre>
+</div>
+<div id="fetch-a-parsed-xml-document" class="section level3">
+<h3>Fetch a parsed XML document</h3>
+<p>Most of the NCBI’s databases can return records in XML format. In additional to downloading the text-representation of these files, <code>entrez_fetch()</code> can return objects parsed by the <code>XML</code> package. As an example, we can check out the Taxonomy database’s record for (did I mention they are amazing….) <em>Tetrahymena thermophila</em>, specifying we want the result to be parsed by setting <code>parsed=TRUE</code>:</p>
+<pre class="sourceCode r"><code class="sourceCode r">Tt <-<span class="st"> </span><span class="kw">entrez_search</span>(<span class="dt">db=</span><span class="st">"taxonomy"</span>, <span class="dt">term=</span><span class="st">"(Tetrahymena thermophila[ORGN]) AND Species[RANK]"</span>)
+tax_rec <-<span class="st"> </span><span class="kw">entrez_fetch</span>(<span class="dt">db=</span><span class="st">"taxonomy"</span>, <span class="dt">id=</span>Tt$ids, <span class="dt">rettype=</span><span class="st">"xml"</span>, <span class="dt">parsed=</span><span class="ot">TRUE</span>)
+<span class="kw">class</span>(tax_rec)</code></pre>
+<pre><code>## [1] "XMLInternalDocument" "XMLAbstractDocument"</code></pre>
+<p>The package XML (which you have if you have installed <code>rentrez</code>) provides functions to get information from these files. For relatively simple records like this one you can use <code>XML::xmlToList</code>:</p>
+<pre class="sourceCode r"><code class="sourceCode r">tax_list <-<span class="st"> </span>XML::<span class="kw">xmlToList</span>(tax_rec)
+tax_list$Taxon$GeneticCode</code></pre>
+<pre><code>## $GCId
+## [1] "6"
+##
+## $GCName
+## [1] "Ciliate Nuclear; Dasycladacean Nuclear; Hexamita Nuclear"</code></pre>
+<p>For more complex records, which generate deeply-nested lists, you can use <a href="https://en.wikipedia.org/wiki/XPath">XPath expressions</a> along with the function <code>XML::xpathSApply</code> or the extraction operatord <code>[</code> and <code>[[</code> to extract specific parts of the file. For instance, we can get the scientific name of each taxon in <em>T. thermophila</em>’s lineage by specifying a path through the XML</p>
+<pre class="sourceCode r"><code class="sourceCode r">tt_lineage <-<span class="st"> </span>tax_rec[<span class="st">"//LineageEx/Taxon/ScientificName"</span>]
+tt_lineage[<span class="dv">1</span>:<span class="dv">4</span>]</code></pre>
+<pre><code>## [[1]]
+## <ScientificName>cellular organisms</ScientificName>
+##
+## [[2]]
+## <ScientificName>Eukaryota</ScientificName>
+##
+## [[3]]
+## <ScientificName>Alveolata</ScientificName>
+##
+## [[4]]
+## <ScientificName>Ciliophora</ScientificName></code></pre>
+<p>As the name suggests, <code>XML::xpathSApply()</code> is a counterpart of base R’s <code>sapply</code>, and can be used to apply a function to nodes in an XML object. A particularly useful function to apply is <code>XML::xmlValue</code>, which returns the content of the node:</p>
+<pre class="sourceCode r"><code class="sourceCode r">XML::<span class="kw">xpathSApply</span>(tax_rec, <span class="st">"//LineageEx/Taxon/ScientificName"</span>, XML::xmlValue)</code></pre>
+<pre><code>## [1] "cellular organisms" "Eukaryota" "Alveolata"
+## [4] "Ciliophora" "Intramacronucleata" "Oligohymenophorea"
+## [7] "Hymenostomatida" "Tetrahymenina" "Tetrahymenidae"
+## [10] "Tetrahymena"</code></pre>
+<p>There are a few more complex examples of using <code>XPath</code> <a href="https://github.com/ropensci/rentrez/wiki">on the rentrez wiki</a></p>
+<p><a name="web_history"></a></p>
+</div>
+</div>
+<div id="using-ncbis-web-history-features" class="section level2">
+<h2>Using NCBI’s Web History features</h2>
+<p>When you are dealing with very large queries it can be time consuming to pass long vectors of unique IDs to and from the NCBI. To avoid this problem, the NCBI provides a feature called “web history” which allows users to store IDs on the NCBI servers then refer to them in future calls.</p>
+<div id="post-a-set-of-ids-to-the-ncbi-for-later-use-entrez_post" class="section level3">
+<h3>Post a set of IDs to the NCBI for later use: <code>entrez_post()</code></h3>
+<p>If you have a list of many NCBI IDs that you want to use later on, you can post them to the NCBI’s severs. In order to provide a brief example, I’m going to post just one ID, the <code>omim</code> identifier for asthma:</p>
+<pre class="sourceCode r"><code class="sourceCode r">upload <-<span class="st"> </span><span class="kw">entrez_post</span>(<span class="dt">db=</span><span class="st">"omim"</span>, <span class="dt">id=</span><span class="dv">600807</span>)
+upload</code></pre>
+<pre><code>## Web history object (QueryKey = 1, WebEnv = NCID_1_54492...)</code></pre>
+<p>The NCBI sends you back some information you can use to refer to the posted IDs. In <code>rentrez</code>, that information is represented as a <code>web_history</code> object.</p>
+</div>
+<div id="get-a-web_history-object-from-entrez_search-or-entrez_link" class="section level3">
+<h3>Get a <code>web_history</code> object from <code>entrez_search</code> or <code>entrez_link()</code></h3>
+<p>In addition to directly uploading IDs to the NCBI, you can use the web history features with <code>entrez_search</code> and <code>entrez_link</code>. For instance, imagine you wanted to find all of the sequences of the widely-studied gene COI from all snails (which are members of the taxonomic group Gastropoda):</p>
+<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">entrez_search</span>(<span class="dt">db=</span><span class="st">"nuccore"</span>, <span class="dt">term=</span><span class="st">"COI[Gene] AND Gastropoda[ORGN]"</span>)</code></pre>
+<pre><code>## Entrez search result with 61927 hits (object contains 20 IDs and no web_history object)
+## Search term (as translated): COI[Gene] AND "Gastropoda"[Organism]</code></pre>
+<p>That’s a lot of sequences! If you really wanted to download all of these it would be a good idea to save all those IDs to the server by setting <code>use_history</code> to <code>TRUE</code> (note you now get a <code>web_history</code> object along with your normal search result):</p>
+<pre class="sourceCode r"><code class="sourceCode r">snail_coi <-<span class="st"> </span><span class="kw">entrez_search</span>(<span class="dt">db=</span><span class="st">"nuccore"</span>, <span class="dt">term=</span><span class="st">"COI[Gene] AND Gastropoda[ORGN]"</span>, <span class="dt">use_history=</span><span class="ot">TRUE</span>)
+snail_coi</code></pre>
+<pre><code>## Entrez search result with 61927 hits (object contains 20 IDs and a web_history object)
+## Search term (as translated): COI[Gene] AND "Gastropoda"[Organism]</code></pre>
+<pre class="sourceCode r"><code class="sourceCode r">snail_coi$web_history</code></pre>
+<pre><code>## Web history object (QueryKey = 1, WebEnv = NCID_1_54493...)</code></pre>
+<p>Similarity, <code>entrez_link()</code> can return <code>web_history</code> objects by using the <code>cmd</code> <code>neighbor_history</code>. Let’s find genetic variants (from the clinvar database) associated with asthma (using the same OMIM ID we identified earlier):</p>
+<pre class="sourceCode r"><code class="sourceCode r">asthma_clinvar <-<span class="st"> </span><span class="kw">entrez_link</span>(<span class="dt">dbfrom=</span><span class="st">"omim"</span>, <span class="dt">db=</span><span class="st">"clinvar"</span>, <span class="dt">cmd=</span><span class="st">"neighbor_history"</span>, <span class="dt">id=</span><span class="dv">600807</span>)
+asthma_clinvar$web_histories</code></pre>
+<pre><code>## $omim_clinvar
+## Web history object (QueryKey = 1, WebEnv = NCID_1_51728...)</code></pre>
+<p>As you can see, instead of returning lists of IDs for each linked database (as it would be default), <code>entrez_link()</code> now returns a list of web_histories.</p>
+</div>
+<div id="use-a-web_history-object" class="section level3">
+<h3>Use a <code>web_history</code> object</h3>
+<p>Once you have those IDs stored on the NCBI’s servers, you are going to want to do something with them. The functions <code>entrez_fetch()</code> <code>entrez_summary()</code> and <code>entrez_link()</code> can all use <code>web_history</code> objects in exactly the same way they use IDs.</p>
+<p>So, we could repeat the last example (finding variants linked to asthma), but this time using the ID we uploaded earlier</p>
+<pre class="sourceCode r"><code class="sourceCode r">asthma_variants <-<span class="st"> </span><span class="kw">entrez_link</span>(<span class="dt">dbfrom=</span><span class="st">"omim"</span>, <span class="dt">db=</span><span class="st">"clinvar"</span>, <span class="dt">cmd=</span><span class="st">"neighbor_history"</span>, <span class="dt">web_history=</span>upload)
+asthma_variants</code></pre>
+<pre><code>## elink object with contents:
+## $web_histories: Objects containing web history information</code></pre>
+<p>… if we want to get some genetic information about these variants we need to map our clinvar IDs to SNP IDs:</p>
+<pre class="sourceCode r"><code class="sourceCode r">snp_links <-<span class="st"> </span><span class="kw">entrez_link</span>(<span class="dt">dbfrom=</span><span class="st">"clinvar"</span>, <span class="dt">db=</span><span class="st">"snp"</span>,
+ <span class="dt">web_history=</span>asthma_variants$web_histories$omim_clinvar,
+ <span class="dt">cmd=</span><span class="st">"neighbor_history"</span>)
+snp_summ <-<span class="st"> </span><span class="kw">entrez_summary</span>(<span class="dt">db=</span><span class="st">"snp"</span>, <span class="dt">web_history=</span>snp_links$web_histories$clinvar_snp)
+knitr::<span class="kw">kable</span>(<span class="kw">extract_from_esummary</span>(snp_summ, <span class="kw">c</span>(<span class="st">"chr"</span>, <span class="st">"fxn_class"</span>, <span class="st">"global_maf"</span>)))</code></pre>
+<table>
+<thead>
+<tr class="header">
+<th align="left"></th>
+<th align="left">41364547</th>
+<th align="left">11558538</th>
+<th align="left">2303067</th>
+<th align="left">20541</th>
+</tr>
+</thead>
+<tbody>
+<tr class="odd">
+<td align="left">chr</td>
+<td align="left">11</td>
+<td align="left">2</td>
+<td align="left">5</td>
+<td align="left">5</td>
+</tr>
+<tr class="even">
+<td align="left">fxn_class</td>
+<td align="left">intron-variant,utr-variant-5-prime</td>
+<td align="left">missense,reference,utr-variant-5-prime</td>
+<td align="left">missense,reference</td>
+<td align="left">missense,reference</td>
+</tr>
+<tr class="odd">
+<td align="left">global_maf</td>
+<td align="left">A=0.0036/18</td>
+<td align="left">T=0.0595/298</td>
+<td align="left">G=0.4331/2169</td>
+<td align="left">A=0.2700/1352</td>
+</tr>
+</tbody>
+</table>
+<p>If you really wanted to you could also use <code>web_history</code> objects to download all those thousands of COI sequences. When downloading large sets of data, it is a good idea to take advantage of the arguments <code>retmax</code> and <code>restart</code> to split the request up into smaller chunks. For instance, we could get the first 200 sequences in 50-sequence chunks:</p>
+<p>(note: this code block is not executed as part of the vignette to save time and bandwidth):</p>
+<pre class="sourceCode r"><code class="sourceCode r">for( seq_start in <span class="kw">seq</span>(<span class="dv">1</span>,<span class="dv">200</span>,<span class="dv">50</span>)){
+ recs <-<span class="st"> </span><span class="kw">entrez_fetch</span>(<span class="dt">db=</span><span class="st">"nuccore"</span>, <span class="dt">web_history=</span>snail_coi$web_history,
+ <span class="dt">rettype=</span><span class="st">"fasta"</span>, <span class="dt">retmax=</span><span class="dv">50</span>, <span class="dt">retstart=</span>seq_start)
+ <span class="kw">cat</span>(recs, <span class="dt">file=</span><span class="st">"snail_coi.fasta"</span>, <span class="dt">append=</span><span class="ot">TRUE</span>)
+ <span class="kw">cat</span>(seq_start<span class="dv">+49</span>, <span class="st">"sequences downloaded</span><span class="ch">\r</span><span class="st">"</span>)
+}</code></pre>
+</div>
+</div>
+<div id="what-next" class="section level2">
+<h2>What next ?</h2>
+<p>This tutorial has introduced you to the core functions of <code>rentrez</code>, there are almost limitless ways that you could put them together. <a href="https://github.com/ropensci/rentrez/wiki">Check out the wiki</a> for more specific examples, and be sure to read the inline-documentation for each function. If you run into problem with rentrez, or just need help with the package and <code>Eutils</code> please contact us by opening an issue at the <a href="https://github.com/ropensc [...]
+</div>
+
+
+
+<!-- dynamically load mathjax for compatibility with self-contained -->
+<script>
+ (function () {
+ var script = document.createElement("script");
+ script.type = "text/javascript";
+ script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
+ document.getElementsByTagName("head")[0].appendChild(script);
+ })();
+</script>
+
+</body>
+</html>
diff --git a/man/entrez_citmatch.Rd b/man/entrez_citmatch.Rd
new file mode 100644
index 0000000..d27fcc5
--- /dev/null
+++ b/man/entrez_citmatch.Rd
@@ -0,0 +1,40 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/entrez_citmatch.r
+\name{entrez_citmatch}
+\alias{entrez_citmatch}
+\title{Fetch pubmed ids matching specially formatted citation strings}
+\usage{
+entrez_citmatch(bdata, db = "pubmed", retmode = "xml", config = NULL)
+}
+\arguments{
+\item{bdata}{character, containing citation data.
+Each citation must be represented in a pipe-delimited format
+ journal_title|year|volume|first_page|author_name|your_key|
+The final field "your_key" is arbitrary, and can used as you see
+fit. Fields can be left empty, but be sure to keep 6 pipes.}
+
+\item{db}{character, the database to search. Defaults to pubmed,
+the only database currently available}
+
+\item{retmode}{character, file format to retrieve. Defaults to xml, as
+per the API documentation, though note the API only returns plain text}
+
+\item{config}{vector configuration options passed to httr::GET}
+}
+\value{
+A character vector containing PMIDs
+}
+\description{
+Fetch pubmed ids matching specially formatted citation strings
+}
+\examples{
+\donttest{
+ex_cites <- c("proc natl acad sci u s a|1991|88|3248|mann bj|test1|",
+ "science|1987|235|182|palmenberg ac|test2|")
+entrez_citmatch(ex_cites)
+}
+}
+\seealso{
+\code{\link[httr]{config}} for available configs
+}
+
diff --git a/man/entrez_db_links.Rd b/man/entrez_db_links.Rd
new file mode 100644
index 0000000..b0eb60e
--- /dev/null
+++ b/man/entrez_db_links.Rd
@@ -0,0 +1,44 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/entrez_info.r
+\name{entrez_db_links}
+\alias{entrez_db_links}
+\title{List available links for records from a given NCBI database}
+\usage{
+entrez_db_links(db, config = NULL)
+}
+\arguments{
+\item{db}{character, name of database to search}
+
+\item{config}{config vector passed to \code{httr::GET}}
+}
+\value{
+An eInfoLink object (sub-classed from list) summarizing linked-databases.
+Can be coerced to a data-frame with \code{as.data.frame}. Printing the object
+the name of each element (which is the correct name for \code{entrez_link},
+and can be used to get (a little) more information about each linked database
+(see example below).
+}
+\description{
+For a given database, fetch a list of other databases that contain
+cross-referenced records. The names of these records can be used as the
+\code{db} argument in \code{\link{entrez_link}}
+}
+\examples{
+\donttest{
+taxid <- entrez_search(db="taxonomy", term="Osmeriformes")$ids
+tax_links <- entrez_db_links("taxonomy")
+tax_links
+entrez_link(dbfrom="taxonomy", db="pmc", id=taxid)
+
+sra_links <- entrez_db_links("sra")
+as.data.frame(sra_links)
+}
+}
+\seealso{
+\code{\link{entrez_link}}
+
+Other einfo: \code{\link{entrez_db_searchable}},
+ \code{\link{entrez_db_summary}},
+ \code{\link{entrez_dbs}}, \code{\link{entrez_info}}
+}
+
diff --git a/man/entrez_db_searchable.Rd b/man/entrez_db_searchable.Rd
new file mode 100644
index 0000000..d96d018
--- /dev/null
+++ b/man/entrez_db_searchable.Rd
@@ -0,0 +1,41 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/entrez_info.r
+\name{entrez_db_searchable}
+\alias{entrez_db_searchable}
+\title{List available search fields for a given database}
+\usage{
+entrez_db_searchable(db, config = NULL)
+}
+\arguments{
+\item{db}{character, name of database to get search field from}
+
+\item{config}{config vector passed to \code{httr::GET}}
+}
+\value{
+An eInfoSearch object (subclassed from list) summarizing linked-databases.
+Can be coerced to a data-frame with \code{as.data.frame}. Printing the object
+shows only the names of each available search field.
+}
+\description{
+Fetch a list of search fields that can be used with a given database. Fields
+can be used as part of the \code{term} argument to \code{\link{entrez_search}}
+}
+\examples{
+\donttest{
+pmc_fields <- entrez_db_searchable("pmc")
+pmc_fields[["AFFL"]]
+entrez_search(db="pmc", term="Otago[AFFL]", retmax=0)
+entrez_search(db="pmc", term="Auckland[AFFL]", retmax=0)
+
+sra_fields <- entrez_db_searchable("sra")
+as.data.frame(sra_fields)
+}
+}
+\seealso{
+\code{\link{entrez_search}}
+
+Other einfo: \code{\link{entrez_db_links}},
+ \code{\link{entrez_db_summary}},
+ \code{\link{entrez_dbs}}, \code{\link{entrez_info}}
+}
+
diff --git a/man/entrez_db_summary.Rd b/man/entrez_db_summary.Rd
new file mode 100644
index 0000000..242f3c5
--- /dev/null
+++ b/man/entrez_db_summary.Rd
@@ -0,0 +1,40 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/entrez_info.r
+\name{entrez_db_summary}
+\alias{entrez_db_summary}
+\title{Retrieve summary information about an NCBI database}
+\usage{
+entrez_db_summary(db, config = NULL)
+}
+\arguments{
+\item{db}{character, name of database to summaries}
+
+\item{config}{config vector passed to \code{httr::GET}}
+}
+\value{
+Character vector with the following data
+
+DbName Name of database
+
+Description Brief description of the database
+
+Count Number of records contained in the database
+
+MenuName Name in web-interface to EUtils
+
+DbBuild Unique ID for current build of database
+
+LastUpdate Date of most recent update to database
+}
+\description{
+Retrieve summary information about an NCBI database
+}
+\examples{
+entrez_db_summary("pubmed")
+}
+\seealso{
+Other einfo: \code{\link{entrez_db_links}},
+ \code{\link{entrez_db_searchable}},
+ \code{\link{entrez_dbs}}, \code{\link{entrez_info}}
+}
+
diff --git a/man/entrez_dbs.Rd b/man/entrez_dbs.Rd
new file mode 100644
index 0000000..490563c
--- /dev/null
+++ b/man/entrez_dbs.Rd
@@ -0,0 +1,29 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/entrez_info.r
+\name{entrez_dbs}
+\alias{entrez_dbs}
+\title{List databases available from the NCBI}
+\usage{
+entrez_dbs(config = NULL)
+}
+\arguments{
+\item{config}{config vector passed to \code{httr::GET}}
+}
+\value{
+character vector listing available dbs
+}
+\description{
+Retrieves the names of databases available through the EUtils API
+}
+\examples{
+\donttest{
+entrez_dbs()
+}
+}
+\seealso{
+Other einfo: \code{\link{entrez_db_links}},
+ \code{\link{entrez_db_searchable}},
+ \code{\link{entrez_db_summary}},
+ \code{\link{entrez_info}}
+}
+
diff --git a/man/entrez_fetch.Rd b/man/entrez_fetch.Rd
new file mode 100644
index 0000000..87fa96c
--- /dev/null
+++ b/man/entrez_fetch.Rd
@@ -0,0 +1,67 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/entrez_fetch.r
+\name{entrez_fetch}
+\alias{entrez_fetch}
+\title{Download data from NCBI databases}
+\usage{
+entrez_fetch(db, id = NULL, web_history = NULL, rettype, retmode = "",
+ parsed = FALSE, config = NULL, ...)
+}
+\arguments{
+\item{db}{character, name of the database to use}
+
+\item{id}{vector (numeric or character), unique ID(s) for records in database \code{db}}
+
+\item{web_history, }{a web_history object}
+
+\item{rettype}{character, format in which to get data (eg, fasta, xml...)}
+
+\item{retmode}{character, mode in which to receive data, defaults to 'text'}
+
+\item{parsed}{boolean should entrez_fetch attempt to parse the resulting
+file. Only works with xml records (including those with rettypes other than
+"xml") at present}
+
+\item{config}{vector, httr configuration options passed to httr::GET}
+
+\item{\dots}{character, additional terms to add to the request, see NCBI
+documentation linked to in references for a complete list}
+}
+\value{
+character string containing the file created
+
+XMLInternalDocument a parsed XML document if parsed=TRUE and
+rettype is a flavour of XML.
+}
+\description{
+A set of unique identifiers mush be specified with either the \code{db}
+argument (which directly specifies the IDs as a numeric or character vector)
+or a \code{web_history} object as returned by
+\code{\link{entrez_link}}, \code{\link{entrez_search}} or
+\code{\link{entrez_post}}. See Table 1 in the linked reference for the set of
+formats available for each database. In particular, note that sequence
+databases (nuccore, protein and their relatives) use specific format names
+(eg "native", "ipg") for different flavours of xml.
+}
+\details{
+For the most part, this function returns a character vector containing the
+fetched records. For XML records (including 'native', 'ipg', 'gbc' sequence
+records), setting \code{parsed} to \code{TRUE} will return an
+\code{XMLInternalDocument},
+}
+\examples{
+\dontrun{
+katipo <- "Latrodectus katipo[Organism]"
+katipo_search <- entrez_search(db="nuccore", term=katipo)
+kaitpo_seqs <- entrez_fetch(db="nuccore", id=katipo_search$ids, rettype="fasta")
+#xml
+kaitpo_seqs <- entrez_fetch(db="nuccore", id=katipo_search$ids, rettype="native")
+}
+}
+\references{
+\url{http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EFetch_}
+}
+\seealso{
+\code{\link[httr]{config}} for available configs
+}
+
diff --git a/man/entrez_global_query.Rd b/man/entrez_global_query.Rd
new file mode 100644
index 0000000..1e6ac60
--- /dev/null
+++ b/man/entrez_global_query.Rd
@@ -0,0 +1,29 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/entrez_global_query.r
+\name{entrez_global_query}
+\alias{entrez_global_query}
+\title{Find the number of records that match a given term across all NCBI Entrez databases}
+\usage{
+entrez_global_query(term, config = NULL, ...)
+}
+\arguments{
+\item{term}{the search term to use}
+
+\item{config}{vector configuration options passed to httr::GET}
+
+\item{...}{additional arguments to add to the query}
+}
+\value{
+a named vector with counts for each a database
+}
+\description{
+Find the number of records that match a given term across all NCBI Entrez databases
+}
+\examples{
+
+NCBI_data_on_best_butterflies_ever <- entrez_global_query(term="Heliconius")
+}
+\seealso{
+\code{\link[httr]{config}} for available configs
+}
+
diff --git a/man/entrez_info.Rd b/man/entrez_info.Rd
new file mode 100644
index 0000000..1e24943
--- /dev/null
+++ b/man/entrez_info.Rd
@@ -0,0 +1,43 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/entrez_info.r
+\name{entrez_info}
+\alias{entrez_info}
+\title{Get information about EUtils databases}
+\usage{
+entrez_info(db = NULL, config = NULL)
+}
+\arguments{
+\item{db}{character database about which to retrieve information (optional)}
+
+\item{config}{config vector passed on to \code{httr::GET}}
+}
+\value{
+XMLInternalDocument with information describing either all the
+databases available in Eutils (if db is not set) or one particular database
+(set by 'db')
+}
+\description{
+Gather information about EUtils generally, or a given Eutils database.
+Note: The most common uses-cases for the einfo util are finding the list of
+search fields available for a given database or the other NCBI databases to
+which records in a given database might be linked. Both these use cases
+are implemented in higher-level functions that return just this information
+(\code{entrez_db_searchable} and \code{entrez_db_links} respectively).
+Consequently most users will not have a reason to use this function (though
+it is exported by \code{rentrez} for the sake of completeness.
+}
+\examples{
+\dontrun{
+all_the_data <- entrez_info()
+XML::xpathSApply(all_the_data, "//DbName", xmlValue)
+entrez_dbs()
+}
+}
+\seealso{
+\code{\link[httr]{config}} for available httr configurations
+
+Other einfo: \code{\link{entrez_db_links}},
+ \code{\link{entrez_db_searchable}},
+ \code{\link{entrez_db_summary}}, \code{\link{entrez_dbs}}
+}
+
diff --git a/man/entrez_link.Rd b/man/entrez_link.Rd
new file mode 100644
index 0000000..adbaa3b
--- /dev/null
+++ b/man/entrez_link.Rd
@@ -0,0 +1,81 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/entrez_link.r
+\name{entrez_link}
+\alias{entrez_link}
+\title{Get links to datasets related to records from an NCBI database}
+\usage{
+entrez_link(dbfrom, web_history = NULL, id = NULL, db = NULL,
+ cmd = "neighbor", by_id = FALSE, config = NULL, ...)
+}
+\arguments{
+\item{dbfrom}{character Name of database from which the Id(s) originate}
+
+\item{web_history}{a web_history object}
+
+\item{id}{vector with unique ID(s) for records in database \code{db}.}
+
+\item{db}{character Name of the database to search for links (or use "all" to
+search all databases available for \code{db}. \code{entrez_db_links} allows you
+to discover databases that might have linked information (see examples).}
+
+\item{cmd}{link function to use. Allowled values include
+\itemize{
+ \item neighbor (default). Returns a set of IDs in \code{db} linked to the
+ input IDs in \code{dbfrom}.
+ \item neighbor_score. As `neighbor'', but additionally returns similarity scores.
+ \item neighbor_history. As `neighbor', but returns web history objects.
+ \item acheck. Returns a list of linked databases available from NCBI for a set of IDs.
+ \item ncheck. Checks for the existence of links within a single database.
+ \item lcheck. Checks for external (i.e. outside NCBI) links.
+ \item llinks. Returns a list of external links for each ID, excluding links
+ provided by libraries.
+ \item llinkslib. As 'llinks' but additionally includes links provided by
+ libraries.
+ \item prlinks. As 'llinks' but returns only the primary external link for
+ each ID.
+}}
+
+\item{by_id}{logial If FALSE (default) return a single
+\code{elink} objects containing links for all of the provided \code{id}s.
+Alternatively, if TRUE return a list of \code{elink} objects, one for each
+ID in \code{id}.}
+
+\item{config}{vector configuration options passed to httr::GET}
+
+\item{\dots}{character Additional terms to add to the request, see NCBI
+documentation linked to in references for a complete list}
+}
+\value{
+An elink object containing the data defined by the \code{cmd} argument
+(if by_id=FALSE) or a list of such object (if by_id=TRUE).
+
+file XMLInternalDocument xml file resulting from search, parsed with
+\code{\link{xmlTreeParse}}
+}
+\description{
+Discover records related to a set of unique identifiers from
+an NCBI database. The object returned by this function depends on the value
+set for the \code{cmd} argument. Printing the returned object lists the names
+, and provides a brief description, of the elements included in the object.
+}
+\examples{
+\donttest{
+ pubmed_search <- entrez_search(db = "pubmed", term ="10.1016/j.ympev.2010.07.013[doi]")
+ linked_dbs <- entrez_db_links("pubmed")
+ linked_dbs
+ nucleotide_data <- entrez_link(dbfrom = "pubmed", id = pubmed_search$ids, db ="nuccore")
+ #Sources for the full text of the paper
+ res <- entrez_link(dbfrom="pubmed", db="", cmd="llinks", id=pubmed_search$ids)
+ linkout_urls(res)
+}
+
+}
+\references{
+\url{http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_ELink_}
+}
+\seealso{
+\code{\link[httr]{config}} for available configs
+
+\code{entrez_db_links}
+}
+
diff --git a/man/entrez_post.Rd b/man/entrez_post.Rd
new file mode 100644
index 0000000..ed6c741
--- /dev/null
+++ b/man/entrez_post.Rd
@@ -0,0 +1,42 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/entrez_post.r
+\name{entrez_post}
+\alias{entrez_post}
+\title{Post IDs to Eutils for later use}
+\usage{
+entrez_post(db, id = NULL, web_history = NULL, config = NULL, ...)
+}
+\arguments{
+\item{db}{character Name of the database from which the IDs were taken}
+
+\item{id}{vector with unique ID(s) for records in database \code{db}.}
+
+\item{web_history}{A web_history object. Can be used to add to additional
+identifiers to an existing web environment on the NCBI}
+
+\item{config}{vector of configuration options passed to httr::GET}
+
+\item{\dots}{character Additional terms to add to the request, see NCBI
+documentation linked to in references for a complete list}
+}
+\description{
+Post IDs to Eutils for later use
+}
+\examples{
+\dontrun{
+so_many_snails <- entrez_search(db="nuccore",
+ "Gastropoda[Organism] AND COI[Gene]", retmax=200)
+upload <- entrez_post(db="nuccore", id=so_many_snails$ids)
+first <- entrez_fetch(db="nuccore", rettype="fasta", web_history=upload,
+ retmax=10)
+second <- entrez_fetch(db="nuccore", file_format="fasta", web_history=upload,
+ retstart=10, retmax=10)
+}
+}
+\references{
+\url{http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EPost_}
+}
+\seealso{
+\code{\link[httr]{config}} for available httr configurations
+}
+
diff --git a/man/entrez_search.Rd b/man/entrez_search.Rd
new file mode 100644
index 0000000..e01cf6c
--- /dev/null
+++ b/man/entrez_search.Rd
@@ -0,0 +1,86 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/entrez_search.r
+\name{entrez_search}
+\alias{entrez_search}
+\title{Search the NCBI databases using EUtils}
+\usage{
+entrez_search(db, term, config = NULL, retmode = "xml",
+ use_history = FALSE, ...)
+}
+\arguments{
+\item{db}{character, name of the database to search for.}
+
+\item{term}{character, the search term.}
+
+\item{config}{vector configuration options passed to httr::GET}
+
+\item{retmode}{character, one of json (default) or xml. This will make no
+difference in most cases.}
+
+\item{use_history}{logical. If TRUE return a web_history object for use in
+later calls to the NCBI}
+
+\item{\dots}{characte, additional terms to add to the request, see NCBI
+documentation linked to in references for a complete list}
+}
+\value{
+ids integer Unique IDS returned by the search
+
+count integer Total number of hits for the search
+
+retmax integer Maximum number of hits returned by the search
+
+web_history A web_history object for use in subsequent calls to NCBI
+
+QueryTranslation character, search term as the NCBI interpreted it
+
+file either and XMLInternalDocument xml file resulting from search, parsed with
+\code{\link[XML]{xmlTreeParse}} or, if \code{retmode} was set to json a list
+resulting from the returned JSON file being parsed with
+\code{\link[jsonlite]{fromJSON}}.
+}
+\description{
+The NCBI uses a search term syntax where search terms can be associated with
+a specific search field with square brackets. So, for instance ``Homo[ORGN]''
+denotes a search for Homo in the ``Organism'' field. The names and
+definitions of these fields can be identified using
+\code{\link{entrez_db_searchable}}.
+}
+\details{
+Searches can make use of several fields by combining them via the boolean
+operators AND, OR and NOT. So, using the search term``((Homo[ORGN] AND APP[GENE]) NOT
+Review[PTYP])'' in PubMed would identify articles matching the gene APP in
+humans, and exclude review articles. More examples of the use of these search
+terms, and the more specific MeSH terms for precise searching,
+is given in the package vignette.
+}
+\examples{
+\dontrun{
+ query <- "Gastropoda[Organism] AND COI[Gene]"
+ web_env_search <- entrez_search(db="nuccore", query, use_history=TRUE)
+ cookie <- web_env_search$WebEnv
+ qk <- web_env_search$QueryKey
+ snail_coi <- entrez_fetch(db = "nuccore", WebEnv = cookie, query_key = qk,
+ file_format = "fasta", retmax = 10)
+}
+\donttest{
+
+fly_id <- entrez_search(db="taxonomy", term="Drosophila")
+#Oh, right. There is a genus and a subgenus name Drosophila...
+#how can we limit this search
+(tax_fields <- entrez_db_searchable("taxonomy"))
+#"RANK" loots promising
+tax_fields$RANK
+entrez_search(db="taxonomy", term="Drosophila & Genus[RANK]")
+}
+}
+\references{
+\url{http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_ESearch_}
+}
+\seealso{
+\code{\link[httr]{config}} for available httr configurations
+
+\code{\link{entrez_db_searchable}} to get a set of search fields that
+can be used in \code{term} for any database
+}
+
diff --git a/man/entrez_summary.Rd b/man/entrez_summary.Rd
new file mode 100644
index 0000000..2fa9d19
--- /dev/null
+++ b/man/entrez_summary.Rd
@@ -0,0 +1,84 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/entrez_summary.r
+\name{entrez_summary}
+\alias{entrez_summary}
+\title{Get summaries of objects in NCBI datasets from a unique ID}
+\usage{
+entrez_summary(db, id = NULL, web_history = NULL, version = c("2.0",
+ "1.0"), always_return_list = FALSE, config = NULL, ...)
+}
+\arguments{
+\item{db}{character Name of the database to search for}
+
+\item{id}{vector with unique ID(s) for records in database \code{db}.}
+
+\item{web_history}{A web_history object}
+
+\item{version}{either 1.0 or 2.0 see above for description}
+
+\item{always_return_list}{logical, return a list of esummary objects even
+when only one ID is provided (see description for a note about this option)}
+
+\item{config}{vector configuration options passed to \code{httr::GET}}
+
+\item{\dots}{character Additional terms to add to the request, see NCBI
+documentation linked to in references for a complete list}
+}
+\value{
+A list of esummary records (if multiple IDs are passed and
+always_return_list if FALSE) or a single record.
+
+file XMLInternalDocument xml file containing the entire record
+returned by the NCBI.
+}
+\description{
+The NCBI offer two distinct formats for summary documents.
+Version 1.0 is a relatively limited summary of a database record based on a
+shared Document Type Definition. Version 1.0 summaries are only available as
+XML and are not available for some newer databases
+Version 2.0 summaries generally contain more information about a given
+record, but each database has its own distinct format. 2.0 summaries are
+available for records in all databases and as JSON and XML files.
+As of version 0.4, rentrez fetches version 2.0 summaries by default and
+uses JSON as the exchange format (as JSON object can be more easily converted
+into native R types). Existing scripts which relied on the structure and
+naming of the "Version 1.0" summary files can be updated by setting the new
+\code{version} argument to "1.0".
+}
+\details{
+By default, entrez_summary returns a single record when only one ID is
+passed and a list of such records when multiple IDs are passed. This can lead
+to unexpected behaviour when the results of a variable number of IDs (perhaps the
+result of \code{entrez_search}) are processed with an apply family function
+or in a for-loop. If you use this function as part of a function or script that
+generates a variably-sized vector of IDs setting \code{always_return_list} to
+\code{TRUE} will avoid these problems. The function
+\code{extract_from_esummary} is provided for the specific case of extracting
+named elements from a list of esummary objects, and is designed to work on
+single objects as well as lists.
+}
+\examples{
+\donttest{
+ pop_ids = c("307082412", "307075396", "307075338", "307075274")
+ pop_summ <- entrez_summary(db="popset", id=pop_ids)
+ extract_from_esummary(pop_summ, "title")
+
+ # clinvar example
+ res <- entrez_search(db = "clinvar", term = "BRCA1", retmax=10)
+ cv <- entrez_summary(db="clinvar", id=res$ids)
+ cv
+ extract_from_esummary(cv, "title", simplify=FALSE)
+ extract_from_esummary(cv, "trait_set")[1:2]
+ extract_from_esummary(cv, "gene_sort")
+}
+}
+\references{
+\url{http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_ESummary_}
+}
+\seealso{
+\code{\link[httr]{config}} for available configs
+
+\code{\link{extract_from_esummary}} which can be used to extract
+elements from a list of esummary records
+}
+
diff --git a/man/extract_from_esummary.Rd b/man/extract_from_esummary.Rd
new file mode 100644
index 0000000..9494960
--- /dev/null
+++ b/man/extract_from_esummary.Rd
@@ -0,0 +1,22 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/entrez_summary.r
+\name{extract_from_esummary}
+\alias{extract_from_esummary}
+\title{Extract elements from a list of esummary records}
+\usage{
+extract_from_esummary(esummaries, elements, simplify = TRUE)
+}
+\arguments{
+\item{esummaries}{A list of esummary objects}
+
+\item{elements}{the names of the element to extract}
+
+\item{simplify}{logical, if possible return a vector}
+}
+\value{
+List or vector containing requested elements
+}
+\description{
+Extract elements from a list of esummary records
+}
+
diff --git a/man/linkout_urls.Rd b/man/linkout_urls.Rd
new file mode 100644
index 0000000..b94955a
--- /dev/null
+++ b/man/linkout_urls.Rd
@@ -0,0 +1,22 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/entrez_link.r
+\name{linkout_urls}
+\alias{linkout_urls}
+\title{Extract URLs from an elink object}
+\usage{
+linkout_urls(elink)
+}
+\arguments{
+\item{elink}{elink object (returned by entrez_link) containing Urls}
+}
+\value{
+list of character vectors, one per ID each containing of URLs for that
+ID.
+}
+\description{
+Extract URLs from an elink object
+}
+\seealso{
+entrez_link
+}
+
diff --git a/man/parse_pubmed_xml.Rd b/man/parse_pubmed_xml.Rd
new file mode 100644
index 0000000..24e3ecc
--- /dev/null
+++ b/man/parse_pubmed_xml.Rd
@@ -0,0 +1,30 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/parse_pubmed_xml.r
+\name{parse_pubmed_xml}
+\alias{parse_pubmed_xml}
+\title{Summarize an XML record from pubmed.}
+\usage{
+parse_pubmed_xml(record)
+}
+\arguments{
+\item{record}{Either and XMLInternalDocument or character the record to be
+parsed ( expected to come from \code{\link{entrez_fetch}})}
+}
+\value{
+Either a single pubmed_record object, or a list of several
+}
+\description{
+Note: this function assumes all records are of the type "PubmedArticle"
+and will return an empty record for any other type (including books).
+}
+\examples{
+
+hox_paper <- entrez_search(db="pubmed", term="10.1038/nature08789[doi]")
+hox_rel <- entrez_link(db="pubmed", dbfrom="pubmed", id=hox_paper$ids)
+recs <- entrez_fetch(db="pubmed",
+ id=hox_rel$links$pubmed_pubmed[1:3],
+ rettype="xml")
+parse_pubmed_xml(recs)
+
+}
+
diff --git a/man/rentrez.Rd b/man/rentrez.Rd
new file mode 100644
index 0000000..90466e7
--- /dev/null
+++ b/man/rentrez.Rd
@@ -0,0 +1,23 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/help.r
+\docType{package}
+\name{rentrez}
+\alias{rentrez}
+\alias{rentrez-package}
+\title{rentrez}
+\description{
+rentrez provides functions to search for, discover and download data from
+the NCBI's databases using their EUtils function.
+}
+\details{
+Users are expected to know a little bit about the EUtils API, which is well
+documented: \url{http://www.ncbi.nlm.nih.gov/books/NBK25500/}
+
+The NCBI will ban IPs that don't use EUtils within their \href{http://www.ncbi.nlm.nih.gov/corehtml/query/static/eutils_help.html}{user guidelines}. In particular
+/enumerated{
+ /item Don't send more than three request per second (rentrez enforces this limit)
+ /item If you plan on sending a sequence of more than ~100 requests, do so outside of peak times for the US
+ /item For large requests use the web history method (see examples for \code{\link{entrez_search}} or use \code{\link{entrez_post}} to upload IDs)
+}
+}
+
diff --git a/tests/test-all.R b/tests/test-all.R
new file mode 100644
index 0000000..3bf21d5
--- /dev/null
+++ b/tests/test-all.R
@@ -0,0 +1,10 @@
+library("testthat")
+
+#All of the tests rely on the API existing and behaving as documented. However,
+#the API occasionally falls over or stops working which lets to errors on CRAN.
+#Because we use travis CI we will hear about any test failures as soon as they
+#happen. So, let's skill all tests on CRAN:
+
+if(identical(Sys.getenv("NOT_CRAN"), "true")){
+ test_check("rentrez")
+}
diff --git a/tests/testthat/test_citmatch.r b/tests/testthat/test_citmatch.r
new file mode 100644
index 0000000..4ea0d18
--- /dev/null
+++ b/tests/testthat/test_citmatch.r
@@ -0,0 +1,12 @@
+context("Cite matching")
+test_that("Citation matching works",{
+ ex_cites <- c("proc natl acad sci u s a|1991|88|3248|mann bj|test1|",
+ "science|1987|235|182|palmenberg ac|test2|")
+ res <- entrez_citmatch(ex_cites)
+ expect_that(res, is_a("character"))
+ expect_equal(res, c("2014248", "3026048"))
+ expect_warning(entrez_citmatch(c("some|nonsense|", ex_cites)))
+})
+
+
+
diff --git a/tests/testthat/test_docs.r b/tests/testthat/test_docs.r
new file mode 100644
index 0000000..dd71335
--- /dev/null
+++ b/tests/testthat/test_docs.r
@@ -0,0 +1,18 @@
+# test any parts of the README or tutorial that aren't already part of the test
+# suite. Note, the final example of the README makes a lot calls to NCBI, so is
+# not included here
+context("documentation")
+
+test_that("Examples in documentation work", {
+ #setup
+ hox_paper <- entrez_search(db="pubmed", term="10.1038/nature08789[doi]")
+ katipo_search <- entrez_search(db="popset",
+ term="Latrodectus katipo[Organism]")
+
+
+
+ expect_that(hox_paper$ids, equals("20203609"))
+ expect_true(katipo_search$count >= 6)
+})
+
+
diff --git a/tests/testthat/test_fetch.r b/tests/testthat/test_fetch.r
new file mode 100644
index 0000000..30e95c1
--- /dev/null
+++ b/tests/testthat/test_fetch.r
@@ -0,0 +1,31 @@
+context("fetching records")
+
+
+pop_ids = c("307082412", "307075396", "307075338", "307075274")
+coi <- entrez_fetch(db = "popset", id = pop_ids[1],
+ rettype = "fasta")
+xml_rec <- entrez_fetch(db = "popset", id=pop_ids[1], rettype="native", parsed=TRUE)
+raw_rec <- entrez_fetch(db = "popset", id=pop_ids[1], rettype="native")
+
+test_that("httr does no warn about inferred encoding", {
+ expect_message( entrez_fetch(db = "popset", id=pop_ids[1], rettype="uilist"), NA)
+})
+
+
+
+test_that("Fetching sequences works", {
+ expect_that(length(strsplit(coi, ">")[[1]]), equals(30))
+
+})
+
+test_that("Entrez_fetch record parsing works", {
+ expect_that(raw_rec, is_a("character"))
+ expect_that(xml_rec, is_a("XMLInternalDocument"))
+ expect_error(
+ entrez_fetch(db="popset", id="307082412", rettype="fasta", parsed=TRUE),
+ "At present, entrez_fetch can only parse XML records, got fasta"
+ )
+})
+
+
+
diff --git a/tests/testthat/test_httr.r b/tests/testthat/test_httr.r
new file mode 100644
index 0000000..e689f3d
--- /dev/null
+++ b/tests/testthat/test_httr.r
@@ -0,0 +1,9 @@
+context("httr option passing")
+ #most config options don't produce capture-able output, so instead
+ # we will test if we raise an error when we us a non-existant proxy to
+ # connect to the internet
+ test_that("httr config options can be passed to rentrez functions",{
+ expect_error(entrez_search(db="popset",
+ term="test",
+ config=use_proxy(url="0.0.0.0", port=80 )))
+})
diff --git a/tests/testthat/test_info.r b/tests/testthat/test_info.r
new file mode 100644
index 0000000..e1fdc54
--- /dev/null
+++ b/tests/testthat/test_info.r
@@ -0,0 +1,50 @@
+context("einfo functions")
+
+einfo_rec <- entrez_info()
+pm_rec <- entrez_info(db="pubmed")
+
+test_that(" can get xml recs from einfo", {
+ expect_that(einfo_rec, is_a("XMLInternalDocument"))
+ expect_that(pm_rec, is_a("XMLInternalDocument"))
+})
+
+dbs <- entrez_dbs()
+cdd <- entrez_db_summary("cdd")
+
+test_that(" We can get summary information on DBs", {
+ expect_that(dbs, is_a("character"))
+ expect_true("pubmed" %in% dbs)
+
+ expect_that(cdd, is_a("character"))
+ expect_named(cdd)
+})
+
+search_fields <- entrez_db_searchable("pmc")
+sf_df <- as.data.frame(search_fields)
+
+test_that("We can retrieve serach fields", {
+ expect_that(search_fields, is_a("eInfoSearch"))
+ expect_named(search_fields$GRNT)
+ expect_that(sf_df, is_a("data.frame"))
+})
+
+omim_links <- entrez_db_links("omim")
+omim_df <- as.data.frame(omim_links)
+
+test_that("We can retreive linked dbs", {
+ expect_that(omim_links, is_a("eInfoLink"))
+ expect_named(omim_links[[1]])
+ expect_that(omim_df, is_a("data.frame"))
+ expect_equal(nrow(omim_df), length(omim_links))
+})
+
+test_that("We can prink elink objects", {
+ expect_output(print(omim_links), "Databases with linked records for database 'omim'")
+ expect_output(print(search_fields), "Searchable fields for database 'pmc'")
+})
+
+test_that("We can print elements from einfo object", {
+ expect_output(print(omim_links$gene), "Name: omim_gene\n")
+ expect_output(print(search_fields$GRNT), "Name: GRNT\n")
+ expect_output(print(cdd), "DbName: cdd")
+})
diff --git a/tests/testthat/test_link.r b/tests/testthat/test_link.r
new file mode 100644
index 0000000..aece800
--- /dev/null
+++ b/tests/testthat/test_link.r
@@ -0,0 +1,68 @@
+context("elink")
+
+elinks_mixed <- entrez_link(dbfrom = "pubmed", id = c(19880848, 22883857), db = "all")
+elinks_by_id <- entrez_link(dbfrom = "pubmed", id = c(19880848, 22883857), db = "all", by_id=TRUE)
+
+
+#
+#We should maybe download these xmls and test the internal functions
+# as these really take some downloading,... especially the lib. links?
+
+message("(this may take some time, have to download many records)")
+commands <- c("neighbor_history", "neighbor_score",
+ "acheck", "ncheck", "lcheck",
+ "llinks", "llinkslib", "prlinks")
+
+
+all_the_commands <- lapply(commands, function(cmd_arg)
+ entrez_link(db="pubmed", dbfrom="pubmed", id=19880848, cmd=cmd_arg)
+)
+test_that("The record-linking funcitons work",{
+ expect_that(elinks_mixed, is_a("elink"))
+ expect_that(names(elinks_mixed$links), is_a("character"))
+ expect_true(length(elinks_mixed$links$pubmed_mesh_major) > 0)
+})
+
+
+test_that("by_id mode works for elinks", {
+ expect_that(elinks_by_id, is_a("elink_list"))
+ expect_that(length(elinks_by_id), equals(2))
+ expect_that(elinks_by_id[[1]], is_a("elink"))
+})
+
+test_that("elink printing behaves", {
+ expect_output(print(elinks_by_id), "List of 2 elink objects,each containing")
+ for(ret in all_the_commands){
+ expect_output(print(ret), "elink object with contents:\\s+\\$[A-Za-z]+")
+ }
+})
+
+
+test_that("We detect missing ids from elink results",{
+ expect_warning(
+ entrez_link(dbfrom="pubmed", db="all", id=c(20203609,2020360999999,20203610), by_id=TRUE)
+ )
+})
+
+test_that("Elink sub-elements can be acessed and printed", {
+ expect_output(print(all_the_commands[[3]][[1]]),
+ "elink result with information from \\d+ databases")
+ expect_output(print(all_the_commands[[8]]$linkouts[[1]]),
+ "Linkout from [A-Za-z]+\\s+\\$Url")
+})
+
+
+test_that("URls can be extracted from elink objs", {
+ for(idx in 6:8){
+ urls <- linkout_urls(all_the_commands[[idx]])
+ expect_that(urls, is_a("list"))
+ expect_that(urls[[1]], is_a("character"))
+ }
+})
+
+test_that("Elink errors on mis-spelled/unknown cmds",{
+ expect_error(rcheck <- entrez_link(dbfrom = "pubmed",
+ id = 19880848, db = "all",
+ cmd='rcheck'))
+})
+
diff --git a/tests/testthat/test_net.r b/tests/testthat/test_net.r
new file mode 100644
index 0000000..06d5b0a
--- /dev/null
+++ b/tests/testthat/test_net.r
@@ -0,0 +1,4 @@
+context("Network")
+test_that("The NCBI is contactable from this comptuter /",{
+ expect_true(!httr::http_error("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"))
+})
diff --git a/tests/testthat/test_parse.r b/tests/testthat/test_parse.r
new file mode 100644
index 0000000..36df7ae
--- /dev/null
+++ b/tests/testthat/test_parse.r
@@ -0,0 +1,44 @@
+context("result-parsers")
+
+
+raw_rec <- entrez_fetch(db="pubmed", id=20674752, rettype="xml")
+xml_rec <- entrez_fetch(db="pubmed", id=20674752, rettype="xml", parsed=TRUE)
+multi_rec <- entrez_fetch(db="pubmed",
+ id=c(22883857, 25042335, 20203609,11959827),
+ rettype="xml", parsed=TRUE)
+parsed_raw <- parse_pubmed_xml(raw_rec)
+parsed_rec <- parse_pubmed_xml(xml_rec)
+parsed_multi <- parse_pubmed_xml(multi_rec)
+
+test_that("pubmed file parsers work",{
+ expect_that(raw_rec, is_a("character"))
+
+ expect_that(parsed_raw, is_a("pubmed_record"))
+ expect_that(parsed_rec, is_a("pubmed_record"))
+ expect_that(names(parsed_rec), is_a("character"))
+ expect_that(parsed_rec$pmid, is_identical_to("20674752"))
+
+ expect_that(parsed_multi, is_a("multi_pubmed_record"))
+ expect_that(parsed_multi[[1]], is_a("pubmed_record"))
+ expect_that(length(parsed_multi), equals(4))
+
+ # Older (buggier) versions of the pubmed parser included data from every
+ # record in an xml file in each parsed record. If that error is
+ # re-introduced there will be 25 authors in each record and this will fail
+ expect_that(length(parsed_multi[[1]]$authors), equals(1))
+
+})
+
+test_that("we can print pubmed records", {
+ expect_output(print(parsed_rec), "Pubmed record")
+ expect_output(print(parsed_multi), "List of 4 pubmed records")
+})
+
+test_that("We warn about unknown pubmed record types", {
+ rec = entrez_fetch(db="pubmed", id=25905152, rettype="xml")
+ expect_warning(parsed_rec <- parse_pubmed_xml(rec))
+ expect_output(print(parsed_rec), "Pubmed record \\(empty\\)")
+})
+
+
+
diff --git a/tests/testthat/test_post.r b/tests/testthat/test_post.r
new file mode 100644
index 0000000..2d9a8ad
--- /dev/null
+++ b/tests/testthat/test_post.r
@@ -0,0 +1,35 @@
+context("entrez_post")
+
+prot_ids = c(15718680,157427902)
+ret <- entrez_post(id=prot_ids, db="protein")
+
+test_that("we can post ids", {
+ qk <- ret$QueryKey
+ expect_that(as.integer(qk), is_a("integer"))
+ expect_false(is.na(as.integer(qk)))
+ expect_that(ret$QueryKey, is_a("character"))
+})
+
+test_that("we can add to WebEnv", {
+ ret2 <- entrez_post(id=119703751, db="protein", web_history=ret)
+ first <- entrez_summary(db="protein", web_history=ret)
+ second <- entrez_summary(db="protein", web_history=ret2)
+ expect_equal(ret2$QueryKey, "2")
+ expect_equal(ret2$WebEnv, ret$WebEnv)
+ expect_equal(length(first), 2)
+ expect_that(second, is_a("esummary"))#i.e. justone
+})
+
+test_that("Example works", {
+ so_many_snails <- entrez_search(db="nuccore",
+ "Gastropoda[Organism] AND COI[Gene]", retmax=200)
+ upload <- entrez_post(db="nuccore", id=so_many_snails$ids)
+ first <- entrez_fetch(db="nuccore", rettype="fasta", web_history=upload, retstart=0, retmax=4)
+ nrecs <- length(gregexpr(">", first)[[1]])
+ expect_equal(nrecs, 4)
+})
+
+test_that("We can print a post result", {
+ expect_output(print(ret),
+ "\\(QueryKey = \\d+, WebEnv = [A-Z0-9_]+\\.\\.\\.\\)")
+})
diff --git a/tests/testthat/test_query.r b/tests/testthat/test_query.r
new file mode 100644
index 0000000..db2c248
--- /dev/null
+++ b/tests/testthat/test_query.r
@@ -0,0 +1,46 @@
+context("query")
+test_that("Query building functions work", {
+
+ #concatenate multiple IDs, include entrez terms
+ query <- rentrez:::make_entrez_query("efetch",
+ db="nuccore",
+ id=c(443610374, 443610372),
+ config=list(),
+ retmode="txt",
+ rettype="fasta")
+ nrecs <- length(gregexpr(">", query)[[1]])
+
+ expect_equal(nrecs, 2)
+
+
+ #should be able to give ints or characters to id and get a url
+ query <- rentrez:::make_entrez_query("efetch",
+ db="nuccore",
+ id=c("443610374", "443610372"),
+ retmode="txt",
+ config=list(),
+ rettype="fasta")
+ nrecs <- length(gregexpr(">", query)[[1]])
+ expect_equal(nrecs, 2)
+
+ #specific function have right "require one of" settings
+ expect_that(entrez_fetch(db="nuccore", rettype="fasta"), throws_error())
+ expect_that(entrez_summary(db="nuccore", web_history="A", id=123), throws_error())
+ expect_that(entrez_link(db="nuccore", dbfrom="pubmed"), throws_error())
+
+ #httr pases on errors
+ #404
+ expect_error(rentrez:::make_entrez_query("non-eutil",
+ id=12,
+ db="none",
+ config=list()))
+ #400
+ expect_error(rentrez:::make_entrez_query("efetch",
+ id=1e-17,
+ config=list(),
+ db="nuccore"))
+
+})
+
+
+
diff --git a/tests/testthat/test_search.r b/tests/testthat/test_search.r
new file mode 100644
index 0000000..f25ace2
--- /dev/null
+++ b/tests/testthat/test_search.r
@@ -0,0 +1,34 @@
+context("search")
+
+#setup
+gsearch <- entrez_global_query("Heliconius")
+pubmed_search <- entrez_search(db = "pubmed",
+ term = "10.1016/j.ympev.2010.07.013[doi]")
+json_search <- entrez_search(db="pubmed",
+ term = "10.1016/j.ympev.2010.07.013[doi]",
+ retmode='json')
+
+test_that("Global query works",{
+ #global query
+ expect_that(gsearch, is_a("numeric"))
+ expect_that(names(gsearch), is_a("character"))
+ expect_true(sum(gsearch) > 0 )
+})
+
+test_that("Entrez query works",{
+ #entrez query
+ expect_that(pubmed_search, is_a("esearch"))
+ expect_that(pubmed_search$ids, is_identical_to("20674752"))
+})
+
+test_that("Entrez query works just as well with xml/json",{
+ expect_that(json_search, is_a("esearch"))
+ expect_that(json_search$ids, is_identical_to("20674752"))
+ expect_equal(names(pubmed_search),names(json_search))
+})
+
+
+test_that("we can print search results", {
+ expect_output(print(pubmed_search), "Entrez search result with \\d+ hits")
+ expect_output(print(json_search), "Entrez search result with \\d+ hits")
+})
diff --git a/tests/testthat/test_summary.r b/tests/testthat/test_summary.r
new file mode 100644
index 0000000..e14527c
--- /dev/null
+++ b/tests/testthat/test_summary.r
@@ -0,0 +1,79 @@
+context("fetching and parsing summary recs")
+
+
+pop_ids = c("307082412", "307075396", "307075338", "307075274")
+pop_summ_xml <- entrez_summary(db="popset",
+ id=pop_ids, version="1.0")
+pop_summ_json <- entrez_summary(db="popset",
+ id=pop_ids, version="2.0")
+
+
+test_that("Functions to fetch summaries work", {
+ #tests
+ expect_that(pop_summ_xml, is_a("list"))
+ expect_that(pop_summ_json, is_a("list"))
+
+ expect_that(pop_summ_xml[[4]], is_a("esummary"))
+ expect_that(pop_summ_json[[4]], is_a("esummary"))
+ sapply(pop_summ_json, function(x)
+ expect_that(x[["title"]], matches("Muraenidae"))
+ )
+})
+
+
+
+test_that("List elements in XML are parsable", {
+ rec <- entrez_summary(db="pubmed", id=25696867, version="1.0")
+ expect_named(rec$History)
+ expect_gt(length(rec$History), 0)
+})
+
+
+test_that("JSON and XML objects are similar", {
+ #It would be nice to test whether the xml and json records
+ # have the same data in them, but it turns out they don't
+ # when they leave the NCBI, so let's ensure we can get some
+ # info from each file, even if they won't be exactly the same
+ sapply(pop_summ_xml, function(x)
+ expect_that(x[["Title"]], matches("Muraenidae")))
+ sapply(pop_summ_json, function(x)
+ expect_that(x[["title"]], matches("Muraenidae")))
+
+ expect_that(length(pop_summ_xml[[1]]), is_more_than(12))
+ expect_that(length(pop_summ_json[[1]]), is_more_than(12))
+
+})
+
+test_that("We can print summary records", {
+ expect_output(print(pop_summ_json), "List of 4 esummary records")
+ expect_output(print(pop_summ_json[[1]]), "esummary result with \\d+ items")
+ expect_output(print(pop_summ_xml), "List of 4 esummary records")
+ expect_output(print(pop_summ_xml[[1]]), "esummary result with \\d+ items")
+})
+
+test_that("We can detect errors in esummary records", {
+ expect_warning(
+ entrez_summary(db="pmc", id=c(4318541212,4318541), version="1.0")
+ )
+ expect_warning(
+ entrez_summary(db="pmc", id=c(4318541212,4318541))
+ )
+})
+
+test_that("We can extract elements from esummary object", {
+ expect_that(extract_from_esummary(pop_summ_xml, c("Title", "TaxId")), is_a("matrix"))
+ expect_that(extract_from_esummary(pop_summ_xml, c("Title", "TaxId"), simplify=FALSE), is_a("list"))
+ expect_that(extract_from_esummary(pop_summ_json, "title"), is_a("character"))
+
+})
+
+test_that("We can extract elements from a single esummary", {
+ expect_that(extract_from_esummary(pop_summ_xml[[1]], c("Title", "TaxId")), is_a("list"))
+ expect_that(extract_from_esummary(pop_summ_xml[[1]], "Gi"), is_a("integer"))
+ expect_that(extract_from_esummary(pop_summ_xml[[1]], "Gi", FALSE), is_a("list"))
+})
+
+test_that("We can get a list of one element if we ask for it", {
+ expect_that(entrez_summary(db="popset", id=307075396, always_return_list=TRUE), is_a("list"))
+ expect_that(entrez_summary(db="popset", id=307075396), is_a("esummary"))
+})
diff --git a/tests/testthat/test_webenv.r b/tests/testthat/test_webenv.r
new file mode 100644
index 0000000..7cd751c
--- /dev/null
+++ b/tests/testthat/test_webenv.r
@@ -0,0 +1,15 @@
+context("WebEnv")
+test_that("Searches using WebEnv features work", {
+ #setup
+ web_env_search <- entrez_search(db="nuccore",
+ term="Gastropoda[Organism] AND COI[Gene]",
+ use_history=TRUE)
+ wh <- web_env_search$web_history
+ snail_coi <- entrez_fetch(db = "nuccore", web_history=wh, rettype = "fasta", retmax = 10)
+
+ #test
+ expect_that(wh$WebEnv, is_a("character"))
+ expect_that(as.integer(wh$QueryKey), is_a("integer"))
+ expect_that(snail_coi, is_a("character"))
+ expect_that(length(strsplit(snail_coi, ">")[[1]]), equals(11))
+})
diff --git a/vignettes/rentrez_tutorial.Rmd b/vignettes/rentrez_tutorial.Rmd
new file mode 100644
index 0000000..cbaf1f6
--- /dev/null
+++ b/vignettes/rentrez_tutorial.Rmd
@@ -0,0 +1,627 @@
+---
+title: Rentrez Tutorial
+author: "David winter"
+date: "`r Sys.Date()`"
+output:
+ rmarkdown::html_vignette:
+ toc: true
+vignette: >
+ %\VignetteIndexEntry{Rentrez Tutorial}
+ %\VignetteEngine{knitr::rmarkdown}
+ %\usepackage[utf8]{inputenc}
+---
+
+```{r, count_recs, echo=FALSE}
+library(rentrez)
+count_recs <- function(db, denom) {
+ nrecs <- rentrez::entrez_db_summary(db)["Count"]
+ round(as.integer(nrecs)/denom, 1)
+}
+```
+## Introduction: The NCBI, entrez and `rentrez`.
+
+The NCBI shares a _lot_ of data. At the time this document was compiled, there
+were `r count_recs("pubmed",1e6)` million papers in [PubMed](http://www.ncbi.nlm.nih.gov/pubmed/),
+including `r count_recs("pmc", 1e6)` million full-text records available in [PubMed Central](http://www.ncbi.nlm.nih.gov/pubmed/).
+[The NCBI Nucleotide Database](http://www.ncbi.nlm.nih.gov/nuccore) (which includes GenBank) has data for `r count_recs("nuccore", 1e6)`
+million different sequences, and [dbSNP](http://www.ncbi.nlm.nih.gov/snp/) describes
+`r count_recs("snp", 1e6)` million different genetic variants. All of these
+records can be cross-referenced with the `r round(entrez_search(db="taxonomy", term='species[RANK]')$count/1e6,2)` million
+species in the [NCBI taxonomy](www.ncbi.nlm.nih.gov/taxonomy) or `r count_recs("omim", 1e3)` thousand disease-associated records
+in [OMIM](http://www.ncbi.nlm.nih.gov/omim).
+
+
+The NCBI makes this data available through a [web interface](http://www.ncbi.nlm.nih.gov/),
+an [FTP server](ftp://ftp.ncbi.nlm.nih.gov/) and through a REST API called the
+[Entrez Utilities](http://www.ncbi.nlm.nih.gov/books/NBK25500/) (`Eutils` for
+short). This package provides functions to use that API, allowing users to
+gather and combine data from multiple NCBI databases in the comfort of an R
+session or script.
+
+## Getting started with the rentrez
+
+To make the most of all the data the NCBI shares you need to know a little about
+their databases, the records they contain and the ways you can find those
+records. The [NCBI provides extensive documentation for each of their
+databases](http://www.ncbi.nlm.nih.gov/home/documentation.shtml) and for the
+[EUtils API that `rentrez` takes advantage of](http://www.ncbi.nlm.nih.gov/books/NBK25501/).
+There are also some helper functions in `rentrez` that help users learn their
+way around the NCBI's databases.
+
+First, you can use `entrez_dbs()` to find the list of available databases:
+
+```{r, dbs}
+entrez_dbs()
+```
+There is a set of functions with names starting `entrez_db_` that can be used to
+gather more information about each of these databases:
+
+**Functions that help you learn about NCBI databases**
+
+| Function name | Return |
+|--------------------------|------------------------------------------------------|
+| `entrez_db_summary()` | Brief description of what the database is |
+| `entrez_db_searchable()` | Set of search terms that can used with this database |
+| `entrez_db_links() ` | Set of databases that might contain linked records |
+
+For instance, we can get a description of the somewhat cryptically named
+database 'cdd'...
+
+```{r, cdd}
+entrez_db_summary("cdd")
+```
+
+... or find out which search terms can be used with the Sequence Read Archive (SRA)
+database (which contains raw data from sequencing projects):
+
+```{r, sra_eg}
+entrez_db_searchable("sra")
+```
+
+Just how these 'helper' functions might be useful will become clearer once
+you've started using `rentrez`, so let's get started.
+
+## Searching databases: `entrez_search()`
+
+Very often, the first thing you'll want to do with `rentrez` is search a given
+NCBI database to find records that match some keywords. You can do this using
+the function `entrez_search()`. In the simplest case you just need to provide a
+database name (`db`) and a search term (`term`) so let's search PubMed for
+articles about the `R language`:
+
+
+```{r eg_search}
+r_search <- entrez_search(db="pubmed", term="R Language")
+```
+The object returned by a search acts like a list, and you can get a summary of
+its contents by printing it.
+
+```{r print_search}
+r_search
+```
+
+There are a few things to note here. First, the NCBI's server has worked out
+that we meant R as a programming language, and so included the
+['MeSH' term](http://www.ncbi.nlm.nih.gov/mesh) term associated with programming
+languages. We'll worry about MeSH terms and other special queries later, for now
+just note that you can use this feature to check that your search term was interpreted in the way
+you intended. Second, there are many more 'hits' for this search than there
+are unique IDs contained in this object. That's because the optional argument
+`retmax`, which controls the maximum number of returned values has a default
+value of 20.
+
+The IDs are the most important thing returned here. They
+allow us to fetch records matching those IDs, gather summary data about them or find
+cross-referenced records in other databases. We access the IDs as a vector using the
+`$` operator:
+
+
+```{r search_ids}
+r_search$ids
+```
+
+If we want to get more than 20 IDs we can do so by increasing the `ret_max` argument.
+
+```{r searchids_2}
+another_r_search <- entrez_search(db="pubmed", term="R Language", retmax=40)
+another_r_search
+```
+
+If we want to get IDs for all of the thousands of records that match this
+search, we can use the NCBI's web history feature [described below](#web_history).
+
+
+### Building search terms
+
+The EUtils API uses a special syntax to build search terms. You can search a
+database against a specific term using the format `query[SEARCH FIELD]`, and
+combine multiple such searches using the boolean operators `AND`, `OR` and `NOT`.
+
+For instance, we can find next generation sequence datasets for the (amazing...) ciliate
+_Tetrahymena thermophila_ by using the organism ('ORGN') search field:
+
+
+```{r, Tt}
+entrez_search(db="sra",
+ term="Tetrahymena thermophila[ORGN]",
+ retmax=0)
+```
+
+We can narrow our focus to only those records that have been added recently (using the colon to
+specify a range of values):
+
+
+```{r, Tt2}
+entrez_search(db="sra",
+ term="Tetrahymena thermophila[ORGN] AND 2013:2015[PDAT]",
+ retmax=0)
+```
+
+Or include recent records for either _T. thermophila_ or it's close relative _T.
+borealis_ (using parentheses to make ANDs and ORs explicit).
+
+
+```{r, Tt3}
+entrez_search(db="sra",
+ term="(Tetrahymena thermophila[ORGN] OR Tetrahymena borealis[ORGN]) AND 2013:2015[PDAT]",
+ retmax=0)
+```
+
+The set of search terms available varies between databases. You can get a list
+of available terms or any given data base with `entrez_db_searchable()`
+
+```{r, sra_searchable}
+entrez_db_searchable("sra")
+```
+
+###Precise queries using MeSH terms
+
+In addition to the search terms described above, the NCBI allows searches using
+[Medical Subject Heading (MeSH)](http://www.ncbi.nlm.nih.gov/mesh) terms. These
+terms create a 'controlled vocabulary', and allow users to make very finely
+controlled queries of databases.
+
+For instance, if you were interested in reviewing studies on how a class of
+anti-malarial drugs called Folic Acid Antagonists work against _Plasmodium vivax_ (a
+particular species of malarial parasite), you could use this search:
+
+```{r, mesh}
+entrez_search(db = "pubmed",
+ term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
+```
+
+The complete set of MeSH terms is available as a database from the NCBI. That
+means it is possible to download detailed information about each term and find
+the ways in which terms relate to each other using `rentrez`. You can search
+for specific terms with `entrez_search(db="mesh", term =...)` and learn about the
+results of your search using the tools described below.
+
+### Advanced counting
+
+As you can see above, the object returned by `entrez_search()` includes the
+number of records matching a given search. This means you can learn a little
+about the composition of, or trends in, the records stored in the NCBI's
+databases using only the search utility. For instance, let's track the rise of
+the scientific buzzword "connectome" in PubMed, programmatically creating
+search terms for the `PDAT` field:
+
+```{r, connectome, fig.width=5, fig.height=4, fig.align='center'}
+search_year <- function(year, term){
+ query <- paste(term, "AND (", year, "[PDAT])")
+ entrez_search(db="pubmed", term=query, retmax=0)$count
+}
+
+year <- 2008:2014
+papers <- sapply(year, search_year, term="Connectome", USE.NAMES=FALSE)
+
+plot(year, papers, type='b', main="The Rise of the Connectome")
+```
+
+## Finding cross-references : `entrez_link()`:
+
+
+One of the strengths of the NCBI databases is the degree to which records of one
+type are connected to other records within the NCBI or to external data
+sources. The function `entrez_link()` allows users to discover these links
+between records.
+
+###My god, it's full of links
+
+To get an idea of the degree to which records in the NCBI are cross-linked we
+can find all NCBI data associated with a single gene (in this case the
+Amyloid Beta Precursor gene, the product of which is associated with the
+plaques that form in the brains of Alzheimer's Disease patients).
+
+The function `entrez_link()` can be used to find cross-referenced records. In
+the most basic case we need to provide an ID (`id`), the database from which this
+ID comes (`dbfrom`) and the name of a database in which to find linked records (`db`).
+If we set this last argument to 'all' we can find links in multiple databases:
+
+```{r elink0}
+all_the_links <- entrez_link(dbfrom='gene', id=351, db='all')
+all_the_links
+```
+Just as with `entrez_search` the returned object behaves like a list, and we can
+learn a little about its contents by printing it. In the case, all of the
+information is in `links` (and there's a lot of them!):
+
+
+```{r elink_link}
+all_the_links$links
+```
+The names of the list elements are in the format `[source_database]_[linked_database]`
+and the elements themselves contain a vector of linked-IDs. So, if we want to
+find open access publications associated with this gene we could get linked records
+in PubMed Central:
+
+```{r, elink_pmc}
+all_the_links$links$gene_pmc[1:10]
+```
+
+Or if were interested in this gene's role in diseases we could find links to clinVar:
+
+```{r, elink_omim}
+all_the_links$links$gene_clinvar
+
+```
+
+###Narrowing our focus
+
+If we know beforehand what sort of links we'd like to find , we can
+to use the `db` argument to narrow the focus of a call to `entrez_link`.
+
+For instance, say we are interested in knowing about all of the
+RNA transcripts associated with the Amyloid Beta Precursor gene in humans.
+Transcript sequences are stored in the nucleotide database (referred
+to as `nuccore` in EUtils), so to find transcripts associated with a given gene
+we need to set `dbfrom=gene` and `db=nuccore`.
+
+```{r, elink1}
+nuc_links <- entrez_link(dbfrom='gene', id=351, db='nuccore')
+nuc_links
+nuc_links$links
+```
+The object we get back contains links to the nucleotide database generally, but
+also to special subsets of that database like [refseq](http://www.ncbi.nlm.nih.gov/refseq/).
+We can take advantage of this narrower set of links to find IDs that match unique
+transcripts from our gene of interest.
+
+```{r, elinik_refseqs}
+nuc_links$links$gene_nuccore_refseqrna
+```
+We can use these ids in calls to `entrez_fetch()` or `entrez_summary()` to learn
+more about the transcripts they represent.
+
+###External links
+
+In addition to finding data within the NCBI, `entrez_link` can turn up
+connections to external databases. Perhaps the most interesting example is
+finding links to the full text of papers in PubMed. For example, when I wrote
+this document the first paper linked to Amyloid Beta Precursor had a unique ID of
+`25500142`. We can find links to the full text of that paper with `entrez_link`
+by setting the `cmd` argument to 'llinks':
+
+```{r, outlinks}
+paper_links <- entrez_link(dbfrom="pubmed", id=25500142, cmd="llinks")
+paper_links
+```
+
+Each element of the `linkouts` object contains information about an external
+source of data on this paper:
+
+```{r, urls}
+paper_links$linkouts
+```
+
+Each of those linkout objects contains quite a lot of information, but the URL
+is probably the most useful. For that reason, `rentrez` provides the
+function `linkout_urls` to make extracting just the URL simple:
+
+```{r just_urls}
+linkout_urls(paper_links)
+```
+
+The full list of options for the `cmd` argument are given in in-line
+documentation (`?entrez_link`). If you are interested in finding full text
+records for a large number of articles checkout the package
+[fulltext](https://github.com/ropensci/fulltext) which makes use of multiple
+sources (including the NCBI) to discover the full text articles.
+
+###Using more than one ID
+
+It is possible to pass more than one ID to `entrez_link()`. By default, doing so
+will give you a single elink object containing the complete set of links for
+_all_ of the IDs that you specified. So, if you were looking for protein IDs
+related to specific genes you could do:
+
+```{r, multi_default}
+all_links_together <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"))
+all_links_together
+all_links_together$links$gene_protein
+```
+
+Although this behaviour might sometimes be useful, it means we've lost track of
+which `protein` ID is linked to which `gene` ID. To retain that information we
+can set `by_id` to `TRUE`. This gives us a list of elink objects, each once
+containing links from a single `gene` ID:
+
+```{r, multi_byid}
+all_links_sep <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"), by_id=TRUE)
+all_links_sep
+lapply(all_links_sep, function(x) x$links$gene_protein)
+```
+
+## Getting summary data: `entrez_summary()`
+
+Having found the unique IDs for some records via `entrez_search` or `entrez_link()`, you are
+probably going to want to learn something about them. The `Eutils` API has two
+ways to get information about a record. `entrez_fetch()` returns 'full' records
+in varying formats and `entrez_summary()` returns less information about each
+record, but in relatively simple format. Very often the summary records have the information
+you are after, so `rentrez` provides functions to parse and summarise summary
+records.
+
+
+###The summary record
+
+`entrez_summary()` takes a vector of unique IDs for the samples you want to get
+summary information from. Let's start by finding out something about the paper
+describing [Taxize](https://github.com/ropensci/taxize), using its PubMed ID:
+
+
+```{r, Summ_1}
+taxize_summ <- entrez_summary(db="pubmed", id=24555091)
+taxize_summ
+```
+
+Once again, the object returned by `entrez_summary` behaves like a list, so you can extract
+elements using `$`. For instance, we could convert our PubMed ID to another
+article identifier...
+
+```{r, Summ_2}
+taxize_summ$articleids
+```
+...or see how many times the article has been cited in PubMed Central papers
+
+```{r, Summ_3}
+taxize_summ$pmcrefcount
+```
+
+###Dealing with many records
+
+If you give `entrez_summary()` a vector with more than one ID you'll get a
+list of summary records back. Let's get those _Plasmodium vivax_ papers we found
+in the `entrez_search()` section back, and fetch some summary data on each paper:
+
+```{r, multi_summ}
+vivax_search <- entrez_search(db = "pubmed",
+ term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
+multi_summs <- entrez_summary(db="pubmed", id=vivax_search$ids)
+```
+
+`rentrez` provides a helper function, `extract_from_esummary()` that takes one
+or more elements from every summary record in one of these lists. Here it is
+working with one...
+
+```{r, multi_summ2}
+extract_from_esummary(multi_summs, "fulljournalname")
+```
+... and several elements:
+
+```{r, multi_summ3}
+date_and_cite <- extract_from_esummary(multi_summs, c("pubdate", "pmcrefcount", "title"))
+knitr::kable(head(t(date_and_cite)), row.names=FALSE)
+```
+
+##Fetching full records: `entrez_fetch()`
+
+As useful as the summary records are, sometimes they just don't have the
+information that you need. If you want a complete representation of a record you
+can use `entrez_fetch`, using the argument `rettype` to specify the format you'd
+like the record in.
+
+###Fetch DNA sequences in fasta format
+
+Let's extend the example given in the `entrez_link()` section about finding
+transcript for a given gene. This time we will fetch cDNA sequences of those
+transcripts.We can start by repeating the steps in the earlier example
+to get nucleotide IDs for refseq transcripts of two genes:
+
+```{r, transcript_ids}
+gene_ids <- c(351, 11647)
+linked_seq_ids <- entrez_link(dbfrom="gene", id=gene_ids, db="nuccore")
+linked_transripts <- linked_seq_ids$links$gene_nuccore_refseqrna
+head(linked_transripts)
+```
+
+Now we can get our sequences with `entrez_fetch`, setting `rettype` to "fasta"
+(the list of formats available for [each database is give in this table](http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/)):
+
+```{r fetch_fasta}
+all_recs <- entrez_fetch(db="nuccore", id=linked_transripts, rettype="fasta")
+class(all_recs)
+nchar(all_recs)
+```
+
+Congratulations, now you have a really huge character vector! Rather than
+printing all those thousands of bases we can take a peak at the top of the file:
+
+```{r, peak}
+cat(strwrap(substr(all_recs, 1, 500)), sep="\n")
+```
+
+If we wanted to use these sequences in some other application we could write them
+to file:
+
+```r
+write(all_recs, file="my_transcripts.fasta")
+```
+
+Alternatively, if you want to use them within an R session
+we could write them to a temporary file then read that. In this case I'm using `read.dna()` from the
+pylogenetics package ape (but not executing the code block in this vignette, so
+you don't have to install that package):
+
+```r
+temp <- tempfile()
+write(all_recs, temp)
+parsed_recs <- ape::read.dna(all_recs, temp)
+```
+
+###Fetch a parsed XML document
+
+Most of the NCBI's databases can return records in XML format. In additional to
+downloading the text-representation of these files, `entrez_fetch()` can return
+objects parsed by the `XML` package. As an example, we can check out the Taxonomy
+database's record for (did I mention they are amazing....) _Tetrahymena
+thermophila_, specifying we want the result to be parsed by setting
+`parsed=TRUE`:
+
+```{r, Tt_tax}
+Tt <- entrez_search(db="taxonomy", term="(Tetrahymena thermophila[ORGN]) AND Species[RANK]")
+tax_rec <- entrez_fetch(db="taxonomy", id=Tt$ids, rettype="xml", parsed=TRUE)
+class(tax_rec)
+```
+
+The package XML (which you have if you have installed `rentrez`) provides
+functions to get information from these files. For relatively simple records
+like this one you can use `XML::xmlToList`:
+
+```{r, Tt_list}
+tax_list <- XML::xmlToList(tax_rec)
+tax_list$Taxon$GeneticCode
+```
+
+For more complex records, which generate deeply-nested lists, you can use
+[XPath expressions](https://en.wikipedia.org/wiki/XPath) along with the function
+`XML::xpathSApply` or the extraction operatord `[` and `[[` to extract specific parts of the
+file. For instance, we can get the scientific name of each taxon in _T.
+thermophila_'s lineage by specifying a path through the XML
+
+```{r, Tt_path}
+tt_lineage <- tax_rec["//LineageEx/Taxon/ScientificName"]
+tt_lineage[1:4]
+```
+
+As the name suggests, `XML::xpathSApply()` is a counterpart of base R's
+`sapply`, and can be used to apply a function to
+nodes in an XML object. A particularly useful function to apply is `XML::xmlValue`,
+which returns the content of the node:
+
+```{r, Tt_apply}
+XML::xpathSApply(tax_rec, "//LineageEx/Taxon/ScientificName", XML::xmlValue)
+```
+There are a few more complex examples of using `XPath` [on the rentrez wiki](https://github.com/ropensci/rentrez/wiki)
+
+<a name="web_history"></a>
+
+##Using NCBI's Web History features
+
+When you are dealing with very large queries it can be time consuming to pass
+long vectors of unique IDs to and from the NCBI. To avoid this problem, the NCBI
+provides a feature called "web history" which allows users to store IDs on the
+NCBI servers then refer to them in future calls.
+
+###Post a set of IDs to the NCBI for later use: `entrez_post()`
+
+If you have a list of many NCBI IDs that you want to use later on, you can post
+them to the NCBI's severs. In order to provide a brief example, I'm going to post just one
+ID, the `omim` identifier for asthma:
+
+```{r, asthma}
+upload <- entrez_post(db="omim", id=600807)
+upload
+```
+The NCBI sends you back some information you can use to refer to the posted IDs.
+In `rentrez`, that information is represented as a `web_history` object.
+
+###Get a `web_history` object from `entrez_search` or `entrez_link()`
+
+In addition to directly uploading IDs to the NCBI, you can use the web history
+features with `entrez_search` and `entrez_link`. For instance, imagine you wanted to
+find all of the sequences of the widely-studied gene COI from all snails
+(which are members of the taxonomic group Gastropoda):
+
+```{r, snail_search}
+entrez_search(db="nuccore", term="COI[Gene] AND Gastropoda[ORGN]")
+```
+
+That's a lot of sequences! If you really wanted to download all of these it
+would be a good idea to save all those IDs to the server by setting
+`use_history` to `TRUE` (note you now get a `web_history` object along with your
+normal search result):
+
+```{r, snail_history}
+snail_coi <- entrez_search(db="nuccore", term="COI[Gene] AND Gastropoda[ORGN]", use_history=TRUE)
+snail_coi
+snail_coi$web_history
+```
+
+Similarity, `entrez_link()` can return `web_history` objects by using the `cmd`
+`neighbor_history`. Let's find genetic variants (from the clinvar database)
+associated with asthma (using the same OMIM ID we identified earlier):
+
+
+```{r, asthma_links}
+asthma_clinvar <- entrez_link(dbfrom="omim", db="clinvar", cmd="neighbor_history", id=600807)
+asthma_clinvar$web_histories
+```
+
+As you can see, instead of returning lists of IDs for each linked database (as
+it would be default), `entrez_link()` now returns a list of web_histories.
+
+###Use a `web_history` object
+
+Once you have those IDs stored on the NCBI's servers, you are going to want to
+do something with them. The functions `entrez_fetch()` `entrez_summary()` and
+`entrez_link()` can all use `web_history` objects in exactly the same way they
+use IDs.
+
+So, we could repeat the last example (finding variants linked to asthma), but this
+time using the ID we uploaded earlier
+
+```{r, asthma_links_upload}
+asthma_variants <- entrez_link(dbfrom="omim", db="clinvar", cmd="neighbor_history", web_history=upload)
+asthma_variants
+```
+
+... if we want to get some genetic information about these variants we need to
+map our clinvar IDs to SNP IDs:
+
+
+```{r, links}
+snp_links <- entrez_link(dbfrom="clinvar", db="snp",
+ web_history=asthma_variants$web_histories$omim_clinvar,
+ cmd="neighbor_history")
+snp_summ <- entrez_summary(db="snp", web_history=snp_links$web_histories$clinvar_snp)
+knitr::kable(extract_from_esummary(snp_summ, c("chr", "fxn_class", "global_maf")))
+```
+
+If you really wanted to you could also use `web_history` objects to download all those thousands of COI sequences.
+When downloading large sets of data, it is a good idea to take advantage of the
+arguments `retmax` and `restart` to split the request up into smaller chunks.
+For instance, we could get the first 200 sequences in 50-sequence chunks:
+
+(note: this code block is not executed as part of the vignette to save time and bandwidth):
+
+
+```r
+for( seq_start in seq(1,200,50)){
+ recs <- entrez_fetch(db="nuccore", web_history=snail_coi$web_history,
+ rettype="fasta", retmax=50, retstart=seq_start)
+ cat(recs, file="snail_coi.fasta", append=TRUE)
+ cat(seq_start+49, "sequences downloaded\r")
+}
+```
+
+##What next ?
+
+This tutorial has introduced you to the core functions of `rentrez`, there are
+almost limitless ways that you could put them together. [Check out the wiki](https://github.com/ropensci/rentrez/wiki)
+for more specific examples, and be sure to read the inline-documentation for
+each function. If you run into problem with rentrez, or just need help with the
+package and `Eutils` please contact us by opening an issue at the [github
+repository](https://github.com/ropensci/rentrez/issues)
+
+
+
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/r-cran-rentrez.git
More information about the debian-med-commit
mailing list