[med-svn] [r-cran-urltools] 03/09: New upstream version 1.6.0
Andreas Tille
tille at debian.org
Thu Nov 30 15:01:57 UTC 2017
This is an automated email from the git hooks/post-receive script.
tille pushed a commit to branch master
in repository r-cran-urltools.
commit 7512ab10b81fda7b1a643c6c95e09db78233a9ee
Author: Andreas Tille <tille at debian.org>
Date: Thu Nov 30 15:48:00 2017 +0100
New upstream version 1.6.0
---
DESCRIPTION | 30 +++
LICENSE | 2 +
MD5 | 65 +++++++
NAMESPACE | 34 ++++
NEWS | 175 ++++++++++++++++++
R/RcppExports.R | 262 ++++++++++++++++++++++++++
R/accessors.R | 202 ++++++++++++++++++++
R/suffix.R | 265 +++++++++++++++++++++++++++
R/urltools.R | 21 +++
R/zzz.R | 22 +++
README.md | 39 ++++
build/vignette.rds | Bin 0 -> 195 bytes
data/suffix_dataset.rda | Bin 0 -> 36976 bytes
data/tld_dataset.rda | Bin 0 -> 6099 bytes
debian/README.test | 8 -
debian/changelog | 5 -
debian/compat | 1 -
debian/control | 29 ---
debian/copyright | 47 -----
debian/docs | 3 -
debian/rules | 5 -
debian/source/format | 1 -
debian/tests/control | 5 -
debian/tests/run-unit-test | 17 --
debian/watch | 2 -
inst/doc/urltools.R | 86 +++++++++
inst/doc/urltools.Rmd | 182 +++++++++++++++++++
inst/doc/urltools.html | 384 +++++++++++++++++++++++++++++++++++++++
man/domain.Rd | 34 ++++
man/encoder.Rd | 66 +++++++
man/fragment.Rd | 34 ++++
man/host_extract.Rd | 32 ++++
man/param_get.Rd | 38 ++++
man/param_remove.Rd | 34 ++++
man/param_set.Rd | 42 +++++
man/parameters.Rd | 35 ++++
man/path.Rd | 34 ++++
man/port.Rd | 34 ++++
man/puny.Rd | 37 ++++
man/scheme.Rd | 37 ++++
man/suffix_dataset.Rd | 28 +++
man/suffix_extract.Rd | 53 ++++++
man/suffix_refresh.Rd | 34 ++++
man/tld_dataset.Rd | 23 +++
man/tld_extract.Rd | 40 ++++
man/tld_refresh.Rd | 34 ++++
man/url_compose.Rd | 30 +++
man/url_parse.Rd | 37 ++++
man/urltools.Rd | 17 ++
src/Makevars | 1 +
src/RcppExports.cpp | 182 +++++++++++++++++++
src/accessors.cpp | 37 ++++
src/compose.cpp | 68 +++++++
src/compose.h | 58 ++++++
src/encoding.cpp | 92 ++++++++++
src/encoding.h | 67 +++++++
src/param.cpp | 112 ++++++++++++
src/parameter.cpp | 175 ++++++++++++++++++
src/parameter.h | 93 ++++++++++
src/parsing.cpp | 238 ++++++++++++++++++++++++
src/parsing.h | 141 ++++++++++++++
src/puny.cpp | 226 +++++++++++++++++++++++
src/punycode.c | 289 +++++++++++++++++++++++++++++
src/punycode.h | 108 +++++++++++
src/suffix.cpp | 145 +++++++++++++++
src/urltools.cpp | 184 +++++++++++++++++++
src/utf8.c | 172 ++++++++++++++++++
src/utf8.h | 17 ++
tests/testthat.R | 4 +
tests/testthat/test_encoding.R | 26 +++
tests/testthat/test_get_set.R | 59 ++++++
tests/testthat/test_memory.R | 30 +++
tests/testthat/test_parameters.R | 61 +++++++
tests/testthat/test_parsing.R | 68 +++++++
tests/testthat/test_puny.R | 47 +++++
tests/testthat/test_suffixes.R | 108 +++++++++++
vignettes/urltools.Rmd | 182 +++++++++++++++++++
77 files changed, 5512 insertions(+), 123 deletions(-)
diff --git a/DESCRIPTION b/DESCRIPTION
new file mode 100644
index 0000000..9dc7274
--- /dev/null
+++ b/DESCRIPTION
@@ -0,0 +1,30 @@
+Package: urltools
+Type: Package
+Title: Vectorised Tools for URL Handling and Parsing
+Version: 1.6.0
+Date: 2016-10-12
+Author: Oliver Keyes [aut, cre], Jay Jacobs [aut, cre], Drew Schmidt [aut],
+ Mark Greenaway [ctb], Bob Rudis [ctb], Alex Pinto [ctb], Maryam Khezrzadeh [ctb],
+ Adam M. Costello [cph], Jeff Bezanson [cph]
+Maintainer: Oliver Keyes <ironholds at gmail.com>
+Description: A toolkit for all URL-handling needs, including encoding and decoding,
+ parsing, parameter extraction and modification. All functions are
+ designed to be both fast and entirely vectorised. It is intended to be
+ useful for people dealing with web-related datasets, such as server-side
+ logs, although may be useful for other situations involving large sets of
+ URLs.
+License: MIT + file LICENSE
+LazyData: TRUE
+LinkingTo: Rcpp
+Imports: Rcpp, methods, triebeard
+Suggests: testthat, knitr
+URL: https://github.com/Ironholds/urltools/
+BugReports: https://github.com/Ironholds/urltools/issues
+VignetteBuilder: knitr
+RoxygenNote: 5.0.1
+Encoding: UTF-8
+Depends: R (>= 2.10)
+NeedsCompilation: yes
+Packaged: 2016-10-16 13:19:23 UTC; ironholds
+Repository: CRAN
+Date/Publication: 2016-10-17 00:43:16
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..ebbb227
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,2 @@
+YEAR: 2014
+COPYRIGHT HOLDER: Oliver Keyes
\ No newline at end of file
diff --git a/MD5 b/MD5
new file mode 100644
index 0000000..eca9994
--- /dev/null
+++ b/MD5
@@ -0,0 +1,65 @@
+2232d0cadef5f6970d6fc56f14d2f545 *DESCRIPTION
+1d9678dbfe1732b5d2c521e07b2ceef0 *LICENSE
+e2f6b30a8006b3ca050d5175156c7fe3 *NAMESPACE
+0c2e9fab14fd100d3932e793c8397a35 *NEWS
+9cf0fa24c4282284d492c92f9e58ce07 *R/RcppExports.R
+cf7a242daa691c3e888f3fadf29eab8c *R/accessors.R
+f323200797b8d7d82c02958355d79cbd *R/suffix.R
+93c2f49af67ce6e11579f17898d19f21 *R/urltools.R
+d101c8875ce174214696cd65e7af61fe *R/zzz.R
+c796e3e3545b201327e524aab77b7138 *README.md
+4e47a0883d28b7040f18b65b59778063 *build/vignette.rds
+c924aa202b18a3de5b29cb4ecfd8bb67 *data/suffix_dataset.rda
+a8544a607fdee8a4b89953c2707b4e7a *data/tld_dataset.rda
+c4794a2695511ab6ba493c38720c6d6a *inst/doc/urltools.R
+2bfbb1b33412b3b272caadf2203789f2 *inst/doc/urltools.Rmd
+821c6d655b2b870f3fb450034bcaa8d6 *inst/doc/urltools.html
+69e6f1e8788ee583ea94aa8a48bf27cb *man/domain.Rd
+2eb37077109f1eb71fed124c85899ae0 *man/encoder.Rd
+1b6c21cff37766aa6639e192779d0d33 *man/fragment.Rd
+af6ff0288e5f0f7494f7910b85285407 *man/host_extract.Rd
+522eaeabd45044c5e57e64b7682b56a4 *man/param_get.Rd
+a7f554a2e090b4047e4d3901d311abb0 *man/param_remove.Rd
+cd3f28b0e44fb741180320299f6b74ec *man/param_set.Rd
+16c239f5f857ee21e467e535c8d9013a *man/parameters.Rd
+4bce2cf14c3a5d91e9baceb29d9c7327 *man/path.Rd
+49509b1bf42e3f3a23b7f510453e50fd *man/port.Rd
+c853cf55bd9d4f3fc406d312becc1b20 *man/puny.Rd
+a47a88872f2a9183c3ec444dd1cf52db *man/scheme.Rd
+f2ab57b9dbc038074ca52644fd82a6ba *man/suffix_dataset.Rd
+439cbef564ed6852bf36580d8c4923e2 *man/suffix_extract.Rd
+a92be408e3af6d00e2fe6cc8bc0a0b48 *man/suffix_refresh.Rd
+d45d75d759cb33c7010e4e3043ff40d8 *man/tld_dataset.Rd
+476adfa8c63aefb313a01def7958b134 *man/tld_extract.Rd
+2f77fd184ba17c3f752b492121021e15 *man/tld_refresh.Rd
+1e86cc89689a9d685d9cd743e4828938 *man/url_compose.Rd
+68c467bf2f96256852842d4095464b64 *man/url_parse.Rd
+5f069622f935c9c9a0270d8226b46ccc *man/urltools.Rd
+b0dee8aa6fb1b7b7044247f95d69ca53 *src/Makevars
+55b45d775dfa8f55ed28f126b4682e5e *src/RcppExports.cpp
+aa992a9862464b4faddb7fa13b7b1cc0 *src/accessors.cpp
+58de4ead82f3ff4a8a7ea20fafe32278 *src/compose.cpp
+265f752d10ea607a576eaaf97d9be942 *src/compose.h
+7f4f2fdd72e83a364536ece419e7de37 *src/encoding.cpp
+6e97f641de365133cb986cbbc5dec856 *src/encoding.h
+5f617e95cb9b5a04a10776386eae5d2a *src/param.cpp
+1a8feb44aa41d5cd35203897ce133f83 *src/parameter.cpp
+794885585068e8c76f4b996e638170aa *src/parameter.h
+a8d804bb2bd63abedc650ff567cd35b7 *src/parsing.cpp
+0c1c58fe75c930766b853f09424e3a0c *src/parsing.h
+b5e867d5bd27fb7f1c9bc276a47edaf8 *src/puny.cpp
+3d86c99b18baecd835083c425090d9cb *src/punycode.c
+b4e4b506528208635ead995600c538ba *src/punycode.h
+50f49742d5b8da5101092aeea4622fa3 *src/suffix.cpp
+090f4c8d1751d348cbb45c35a71b5f12 *src/urltools.cpp
+85e63230a4eaeb1200891c02de40193a *src/utf8.c
+3333d69c11f25242049d1d226d599b94 *src/utf8.h
+9e9970bb4d6e50ba34bab76c8bebcfc6 *tests/testthat.R
+f60de02a5a42405ef86e58c919029e94 *tests/testthat/test_encoding.R
+a9189dfb91afb312c18b9a8142c6b266 *tests/testthat/test_get_set.R
+97a5e4be008b21d5b0c97df21f576c51 *tests/testthat/test_memory.R
+5e2ef2cea7502986e64343431e2b5fb3 *tests/testthat/test_parameters.R
+b96d41814df04f1d374e4814a30d75bd *tests/testthat/test_parsing.R
+3e624b6a700ba5fa0a8e85f24de9ba8d *tests/testthat/test_puny.R
+536a7b5df0d453e38d82f5738c5b2f8b *tests/testthat/test_suffixes.R
+2bfbb1b33412b3b272caadf2203789f2 *vignettes/urltools.Rmd
diff --git a/NAMESPACE b/NAMESPACE
new file mode 100644
index 0000000..fb21c20
--- /dev/null
+++ b/NAMESPACE
@@ -0,0 +1,34 @@
+# Generated by roxygen2: do not edit by hand
+
+export("domain<-")
+export("fragment<-")
+export("parameters<-")
+export("path<-")
+export("port<-")
+export("scheme<-")
+export(domain)
+export(fragment)
+export(host_extract)
+export(param_get)
+export(param_remove)
+export(param_set)
+export(parameters)
+export(path)
+export(port)
+export(puny_decode)
+export(puny_encode)
+export(scheme)
+export(suffix_extract)
+export(suffix_refresh)
+export(tld_extract)
+export(tld_refresh)
+export(url_compose)
+export(url_decode)
+export(url_encode)
+export(url_parameters)
+export(url_parse)
+import(methods)
+importFrom(Rcpp,sourceCpp)
+importFrom(triebeard,longest_match)
+importFrom(triebeard,trie)
+useDynLib(urltools)
diff --git a/NEWS b/NEWS
new file mode 100644
index 0000000..58627c6
--- /dev/null
+++ b/NEWS
@@ -0,0 +1,175 @@
+
+Version 1.6.0 [WIP]
+-------------------------------------------------------------------------
+
+FEATURES
+* Fully punycode encoding and decoding support, thanks to Drew Schmidt.
+* param_get, param_set and param_remove are all fully capable of handling NA values.
+* component setting functions can now assign even when the previous value was NA.
+
+Version 1.5.2
+-------------------------------------------------------------------------
+
+BUGS
+* Custom suffix lists were not working properly.
+Version 1.5.1
+-------------------------------------------------------------------------
+
+BUGS
+* Fixed a bug in which punycode TLDs were excluded from TLD extraction (thanks to
+ Alex Pinto for pointing that out) #51
+* param_get now returns NAs for missing values, rather than empty strings (thanks to Josh Izzard for the report) #49
+* suffix_extract now no longer goofs if the domain+suffix combo overlaps with a valid suffix (thanks to Maryam Khezrzadeh and Alex Pinto) #50
+
+DEVELOPMENT
+* Removed the non-portable -g compiler flag in response to CRAN feedback.
+
+Version 1.5.0
+-------------------------------------------------------------------------
+FEATURES
+
+* Using tries as a data structure (see https://github.com/Ironholds/triebeard), we've increased the speed of suffix_extract() (instead of taking twenty seconds to process a million domains, it now takes..one.)
+* A dataset of top-level domains (TLDs) is now available as data(tld_dataset)
+* suffix_refresh() has been reinstated, and can be used with suffix_extract() to ensure suffix
+extraction is done with the most up-to-date dataset version possible.
+* tld_extract() and tld_refresh() mirrors the functionality of suffix_extract() and suffix_refresh()
+BUG FIXES
+* host_extract() lets you get the host (the lowest-level subdomain, or the domain itself if no subdomain
+is present) from the `domain` fragment of a parsed URL.
+* Code from Jay Jacobs has allowed us to include a best-guess at the org name in the suffix dataset.
+* url_parameters is now deprecated, and has been marked as such.
+
+DEVELOPMENT
+* The instantiation and processing of suffix and TLD datasets on load marginally increases
+the speed of both (if you're calling suffix/TLD related functions more than once a sessions)
+
+Version 1.4.0
+-------------------------------------------------------------------------
+
+BUG FIXES
+* Full NA support is now available!
+
+DEVELOPMENT
+* A substantial (20%) speed increase is now available thanks to internal
+refactoring.
+
+Version 1.3.3
+-------------------------------------------------------------------------
+
+BUG FIXES
+* url_parse no longer lower-cases URLs (case sensitivity is Important) thanks to GitHub user 17843
+
+DOCUMENTATION
+* A note on NAs (as reported by Alex Pinto) added to the vignette
+* Mention Bob Rudis's 'punycode' package.
+
+Version 1.3.2
+-------------------------------------------------------------------------
+
+BUG FIXES
+* Fixed a critical bug impacting URLs with colons in the path
+
+Version 1.3.1
+-------------------------------------------------------------------------
+
+CHANGES
+* suffix_refresh has been removed, since LazyData's parameters prevented it from functioning; thanks to
+ Alex Pinto for the initial bug report and Hadley Wickham for confirming the possible solutions.
+
+BUG FIXES
+* the parser was not properly handling ports; thanks to a report from Rich FitzJohn, this is now fixed.
+
+Version 1.3.0
+-------------------------------------------------------------------------
+
+NEW FEATURES
+* param_set() for inserting or modifying key/value pairs in URL query strings.
+* param_remove() added for stripping key/value pairs out of URL query strings.
+
+CHANGES
+* url_parameters has been renamed param_get() under the new naming scheme - url_parameters still exists, however,
+for the purpose of backwards-compatibiltiy.
+
+BUG FIXES
+* Fixed a bug reported by Alex Pinto whereby URLs with parameters but no paths would not have their domain
+ correctly parsed.
+
+Version 1.2.1
+-------------------------------------------------------------------------
+
+CHANGES
+* Changed "tld" column to "suffix" in return of "suffix_extract" to more
+accurately reflect what it is
+* Switched to "vapply" in "suffix_extract" to give a bit of a speedup to
+an already fast function
+
+BUG FIXES
+* Fixed documentation of "suffix_extract"
+
+DEVELOPMENT
+* More internal documentation added to compiled code.
+* The suffix_dataset dataset was refreshed
+
+Version 1.2.0
+-------------------------------------------------------------------------
+NEW FEATURES
+* Jay Jacobs' "tldextract" functionality has been merged with urltools, and can be accessed
+with "suffix_extract"
+* At Nicolas Coutin's suggestion, url_compose - url_parse in reverse - has been introduced.
+
+BUG FIXES
+
+* To adhere to RfC standards, "query" functions have been renamed "parameter"
+* A bug in which fragments could not be retrieved (and were incorrectly identified as parameters)
+has been fixed. Thanks to Nicolas Coutin for reporting it and providing a reproducible example.
+
+Version 1.1.1
+-------------------------------------------------------------------------
+BUG FIXES
+
+* Parameter parsing now fixed to require a = after the parameter name, thus solving for scenarios where
+the URL would contain the parameter name as part of, say, the domain, and it'd grab the wrong thing. Thanks
+to Jacob Barnett for the bug report and example.
+* URL encoding no longer encodes the slash between the domain and path (thanks to Peter Meissner for pointing
+this bug out).
+
+DEVELOPMENT
+*More unit tests
+
+Version 1.1.0
+-------------------------------------------------------------------------
+NEW FEATURES
+*url_parameters provides the values of specified parameters within a vector of URLs, as a data.frame
+*KeyboardInterrupts are now available for interrupting long computations.
+*url_parse now provides a data.frame, rather than a list, as output.
+
+BUG FIXES
+
+DEVELOPMENT
+*De-static the hell out of all the C++.
+*Internal refactor to store each logical stage of url decomposition as its own method
+*Internal refactor to use references, minimising memory usage; thanks to Mark Greenaway for making this work!
+*Roxygen upgrade
+
+Version 1.0.0
+-------------------------------------------------------------------------
+NEW FEATURES
+*New get/set functionality, mimicking lubridate; see the package vignette.
+
+DEVELOPMENT
+*Internal C++ documentation added and the encoders and parsers refactored.
+
+Version 0.6.0
+-------------------------------------------------------------------------
+NEW FEATURES
+*replace_parameter introduced, to augment extract_parameter (previously simply url_param). This
+allows you to take the value a parameter has associated with it, and replace it with one of your choosing.
+*extract_host allows you to grab the hostname of a site, ignoring other components.
+
+BUG FIXES
+*extract_parameter (now url_extract_param) previously failed with an obfuscated error if the requested
+parameter terminated the URL. This has now been fixed.
+
+DEVELOPMENT
+*unit tests expanded
+*Internal tweaks to improve the speed of url_decode and url_encode.
\ No newline at end of file
diff --git a/R/RcppExports.R b/R/RcppExports.R
new file mode 100644
index 0000000..915f9a3
--- /dev/null
+++ b/R/RcppExports.R
@@ -0,0 +1,262 @@
+# Generated by using Rcpp::compileAttributes() -> do not edit by hand
+# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
+
+get_component_ <- function(urls, component) {
+ .Call('urltools_get_component_', PACKAGE = 'urltools', urls, component)
+}
+
+set_component_ <- function(urls, component, new_value) {
+ .Call('urltools_set_component_', PACKAGE = 'urltools', urls, component, new_value)
+}
+
+#'@title get the values of a URL's parameters
+#'@description URLs can have parameters, taking the form of \code{name=value}, chained together
+#'with \code{&} symbols. \code{param_get}, when provided with a vector of URLs and a vector
+#'of parameter names, will generate a data.frame consisting of the values of each parameter
+#'for each URL.
+#'
+#'@param urls a vector of URLs
+#'
+#'@param parameter_names a vector of parameter names
+#'
+#'@return a data.frame containing one column for each provided parameter name. Values that
+#'cannot be found within a particular URL are represented by an NA.
+#'
+#'@examples
+#'#A very simple example
+#'url <- "https://google.com:80/foo.php?this_parameter=selfreferencing&hiphop=awesome"
+#'parameter_values <- param_get(url, c("this_parameter","hiphop"))
+#'
+#'@seealso \code{\link{url_parse}} for decomposing URLs into their constituent parts and
+#'\code{\link{param_set}} for inserting or modifying key/value pairs within a query string.
+#'
+#'@aliases param_get url_parameter
+#'@rdname param_get
+#'@export
+param_get <- function(urls, parameter_names) {
+ .Call('urltools_param_get', PACKAGE = 'urltools', urls, parameter_names)
+}
+
+#'@title Set the value associated with a parameter in a URL's query.
+#'@description URLs often have queries associated with them, particularly URLs for
+#'APIs, that look like \code{?key=value&key=value&key=value}. \code{param_set}
+#'allows you to modify key/value pairs within query strings, or even add new ones
+#'if they don't exist within the URL.
+#'
+#'@param urls a vector of URLs. These should be decoded (with \code{url_decode})
+#'but do not have to have been otherwise manipulated.
+#'
+#'@param key a string representing the key to modify the value of (or insert wholesale
+#'if it doesn't exist within the URL).
+#'
+#'@param value a value to associate with the key. This can be a single string,
+#'or a vector the same length as \code{urls}
+#'
+#'@return the original vector of URLs, but with modified/inserted key-value pairs. If the
+#'URL is \code{NA}, the returned value will be - if the key or value are, no insertion
+#'will be made.
+#'
+#'@examples
+#'# Set a URL parameter where there's already a key for that
+#'param_set("https://en.wikipedia.org/api.php?action=query", "action", "pageinfo")
+#'
+#'# Set a URL parameter where there isn't.
+#'param_set("https://en.wikipedia.org/api.php?list=props", "action", "pageinfo")
+#'
+#'@seealso \code{\link{param_get}} to retrieve the values associated with multiple keys in
+#'a vector of URLs, and \code{\link{param_remove}} to strip key/value pairs from a URL entirely.
+#'
+#'@export
+param_set <- function(urls, key, value) {
+ .Call('urltools_param_set', PACKAGE = 'urltools', urls, key, value)
+}
+
+#'@title Remove key-value pairs from query strings
+#'@description URLs often have queries associated with them, particularly URLs for
+#'APIs, that look like \code{?key=value&key=value&key=value}. \code{param_remove}
+#'allows you to remove key/value pairs while leaving the rest of the URL intact.
+#'
+#'@param urls a vector of URLs. These should be decoded with \code{url_decode} but don't
+#'have to have been otherwise processed.
+#'
+#'@param keys a vector of parameter keys to remove.
+#'
+#'@return the original URLs but with the key/value pairs specified by \code{keys} removed.
+#'If the original URL is \code{NA}, \code{NA} will be returned; if a specified key is \code{NA},
+#'nothing will be done with it.
+#'
+#'@seealso \code{\link{param_set}} to modify values associated with keys, or \code{\link{param_get}}
+#'to retrieve those values.
+#'
+#'@examples
+#'# Remove multiple parameters from a URL
+#'param_remove(urls = "https://en.wikipedia.org/wiki/api.php?action=list&type=query&format=json",
+#' keys = c("action","format"))
+#'@export
+param_remove <- function(urls, keys) {
+ .Call('urltools_param_remove', PACKAGE = 'urltools', urls, keys)
+}
+
+#'@title Encode or Decode Internationalised Domains
+#'@description \code{puny_encode} and \code{puny_decode} implement
+#'the encoding standard for internationalised (non-ASCII) domains and
+#'subdomains. You can use them to encode UTF-8 domain names, or decode
+#'encoded names (which start "xn--"), or both.
+#'
+#'@param x a vector of URLs. These should be URL decoded using \code{\link{url_decode}}.
+#'
+#'@return a CharacterVector containing encoded or decoded versions of the entries in \code{x}.
+#'Invalid URLs (ones that are \code{NA}, or ones that do not successfully map to an actual
+#'decoded or encoded version) will be returned as \code{NA}.
+#'
+#'@examples
+#'# Encode a URL
+#'puny_encode("https://www.bücher.com/foo")
+#'
+#'# Decode the result, back to the original
+#'puny_decode("https://www.xn--bcher-kva.com/foo")
+#'
+#'@seealso \code{\link{url_decode}} and \code{\link{url_encode}} for percent-encoding.
+#'
+#'@rdname puny
+#'@export
+puny_encode <- function(x) {
+ .Call('urltools_puny_encode', PACKAGE = 'urltools', x)
+}
+
+#'@rdname puny
+#'@export
+puny_decode <- function(x) {
+ .Call('urltools_puny_decode', PACKAGE = 'urltools', x)
+}
+
+reverse_strings <- function(strings) {
+ .Call('urltools_reverse_strings', PACKAGE = 'urltools', strings)
+}
+
+finalise_suffixes <- function(full_domains, suffixes, wildcard, is_suffix) {
+ .Call('urltools_finalise_suffixes', PACKAGE = 'urltools', full_domains, suffixes, wildcard, is_suffix)
+}
+
+tld_extract_ <- function(domains) {
+ .Call('urltools_tld_extract_', PACKAGE = 'urltools', domains)
+}
+
+host_extract_ <- function(domains) {
+ .Call('urltools_host_extract_', PACKAGE = 'urltools', domains)
+}
+
+#'@title Encode or decode a URI
+#'@description encodes or decodes a URI/URL
+#'
+#'@param urls a vector of URLs to decode or encode.
+#'
+#'@details
+#'URL encoding and decoding is an essential prerequisite to proper web interaction
+#'and data analysis around things like server-side logs. The
+#'\href{http://tools.ietf.org/html/rfc3986}{relevant IETF RfC} mandates the percentage-encoding
+#'of non-Latin characters, including things like slashes, unless those are reserved.
+#'
+#'Base R provides \code{\link{URLdecode}} and \code{\link{URLencode}}, which handle
+#'URL encoding - in theory. In practise, they have a set of substantial problems
+#'that the urltools implementation solves::
+#'
+#'\itemize{
+#' \item{No vectorisation: }{Both base R functions operate on single URLs, not vectors of URLs.
+#' This means that, when confronted with a vector of URLs that need encoding or
+#' decoding, your only option is to loop from within R. This can be incredibly
+#' computationally costly with large datasets. url_encode and url_decode are
+#' implemented in C++ and entirely vectorised, allowing for a substantial
+#' performance improvement.}
+#' \item{No scheme recognition: }{encoding the slashes in, say, http://, is a good way
+#' of making sure your URL no longer works. Because of this, the only thing
+#' you can encode in URLencode (unless you refuse to encode reserved characters)
+#' is a partial URL, lacking the initial scheme, which requires additional operations
+#' to set up and increases the complexity of encoding or decoding. url_encode
+#' detects the protocol and silently splits it off, leaving it unencoded to ensure
+#' that the resulting URL is valid.}
+#' \item{ASCII NULs: }{Server side data can get very messy and sometimes include out-of-range
+#' characters. Unfortunately, URLdecode's response to these characters is to convert
+#' them to NULs, which R can't handle, at which point your URLdecode call breaks.
+#' \code{url_decode} simply ignores them.}
+#'}
+#'
+#'@return a character vector containing the encoded (or decoded) versions of "urls".
+#'
+#'@seealso \code{\link{puny_decode}} and \code{\link{puny_encode}}, for punycode decoding
+#'and encoding.
+#'
+#'@examples
+#'
+#'url_decode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_%28logo%29.jpg")
+#'url_encode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_(logo).jpg")
+#'
+#'\dontrun{
+#'#A demonstrator of the contrasting behaviours around out-of-range characters
+#'URLdecode("%gIL")
+#'url_decode("%gIL")
+#'}
+#'@rdname encoder
+#'@export
+url_decode <- function(urls) {
+ .Call('urltools_url_decode', PACKAGE = 'urltools', urls)
+}
+
+#'@rdname encoder
+#'@export
+url_encode <- function(urls) {
+ .Call('urltools_url_encode', PACKAGE = 'urltools', urls)
+}
+
+#'@title split URLs into their component parts
+#'@description \code{url_parse} takes a vector of URLs and splits each one into its component
+#'parts, as recognised by RfC 3986.
+#'
+#'@param urls a vector of URLs
+#'
+#'@details It's useful to be able to take a URL and split it out into its component parts -
+#'for the purpose of hostname extraction, for example, or analysing API calls. This functionality
+#'is not provided in base R, although it is provided in \code{\link[httr]{parse_url}}; that
+#'implementation is entirely in R, uses regular expressions, and is not vectorised. It's
+#'perfectly suitable for the intended purpose (decomposition in the context of automated
+#'HTTP requests from R), but not for large-scale analysis.
+#'
+#'@return a data.frame consisting of the columns scheme, domain, port, path, query
+#'and fragment. See the '\href{http://tools.ietf.org/html/rfc3986}{relevant IETF RfC} for
+#'definitions. If an element cannot be identified, it is represented by an empty string.
+#'
+#'@examples
+#'url_parse("https://en.wikipedia.org/wiki/Article")
+#'
+#'@seealso \code{\link{url_parameters}} for extracting values associated with particular keys in a URL's
+#'query string, and \code{\link{url_compose}}, which is \code{url_parse} in reverse.
+#'
+#'@export
+url_parse <- function(urls) {
+ .Call('urltools_url_parse', PACKAGE = 'urltools', urls)
+}
+
+#'@title Recompose Parsed URLs
+#'
+#'@description Sometimes you want to take a vector of URLs, parse them, perform
+#'some operations and then rebuild them. \code{url_compose} takes a data.frame produced
+#'by \code{\link{url_parse}} and rebuilds it into a vector of full URLs (or: URLs as full
+#'as the vector initially thrown into url_parse).
+#'
+#'This is currently a `beta` feature; please do report bugs if you find them.
+#'
+#'@param parsed_urls a data.frame sourced from \code{\link{url_parse}}
+#'
+#'@seealso \code{\link{scheme}} and other accessors, which you may want to
+#'run URLs through before composing them to modify individual values.
+#'
+#'@examples
+#'#Parse a URL and compose it
+#'url <- "http://en.wikipedia.org"
+#'url_compose(url_parse(url))
+#'
+#'@export
+url_compose <- function(parsed_urls) {
+ .Call('urltools_url_compose', PACKAGE = 'urltools', parsed_urls)
+}
+
diff --git a/R/accessors.R b/R/accessors.R
new file mode 100644
index 0000000..b553986
--- /dev/null
+++ b/R/accessors.R
@@ -0,0 +1,202 @@
+#'@title Get or set a URL's scheme
+#'@description as in the lubridate package, individual components of a URL
+#'can be both extracted or set using the relevant function call - see the
+#'examples.
+#'@aliases scheme
+#'@rdname scheme
+#'
+#'@param x a URL, or vector of URLs
+#'
+#'@param value a replacement value for x's scheme.
+#'
+#'@seealso \code{\link{domain}}, \code{\link{port}}, \code{\link{path}},
+#'\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+#'
+#'@examples
+#'#Get a component
+#'example_url <- "http://cran.r-project.org/submit.html"
+#'scheme(example_url)
+#'
+#'#Set a component
+#'scheme(example_url) <- "https"
+#'
+#'# NA out the URL
+#'scheme(example_url) <- NA_character_
+#'@import methods
+#'@export
+scheme <- function(x){
+ return(get_component_(x,0))
+}
+
+"scheme<-" <- function(x, value) standardGeneric("scheme<-")
+#'@rdname scheme
+#'@export
+setGeneric("scheme<-", useAsDefault = function(x, value){
+ return(set_component_(x, 0, value))
+})
+
+#'@title Get or set a URL's domain
+#'@description as in the lubridate package, individual components of a URL
+#'can be both extracted or set using the relevant function call - see the
+#'examples.
+#'@aliases domain
+#'@rdname domain
+#'
+#'@param x a URL, or vector of URLs
+#'
+#'@param value a replacement value for x's scheme.
+#'
+#'@seealso \code{\link{scheme}}, \code{\link{port}}, \code{\link{path}},
+#'\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+#'
+#'@examples
+#'#Get a component
+#'example_url <- "http://cran.r-project.org/submit.html"
+#'domain(example_url)
+#'
+#'#Set a component
+#'domain(example_url) <- "en.wikipedia.org"
+#'@export
+domain <- function(x){
+ return(get_component_(x,1))
+}
+"domain<-" <- function(x, value) standardGeneric("domain<-")
+#'@rdname domain
+#'@export
+setGeneric("domain<-", useAsDefault = function(x, value){
+ return(set_component_(x, 1, value))
+})
+
+#'@title Get or set a URL's port
+#'@description as in the lubridate package, individual components of a URL
+#'can be both extracted or set using the relevant function call - see the
+#'examples.
+#'
+#'@aliases port
+#'@rdname port
+#'
+#'@param x a URL, or vector of URLs
+#'
+#'@param value a replacement value for x's port.
+#'
+#'@seealso \code{\link{scheme}}, \code{\link{domain}}, \code{\link{path}},
+#'\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+#'
+#'@examples
+#'#Get a component
+#'example_url <- "http://cran.r-project.org:80/submit.html"
+#'port(example_url)
+#'
+#'#Set a component
+#'port(example_url) <- "12"
+#'@export
+port <- function(x){
+ return(get_component_(x,2))
+}
+"port<-" <- function(x, value) standardGeneric("port<-")
+#'@rdname port
+#'@export
+setGeneric("port<-", useAsDefault = function(x, value){
+ return(set_component_(x, 2, value))
+})
+
+#'@title Get or set a URL's path
+#'@description as in the lubridate package, individual components of a URL
+#'can be both extracted or set using the relevant function call - see the
+#'examples.
+#'@aliases path
+#'@rdname path
+#'
+#'@param x a URL, or vector of URLs
+#'
+#'@param value a replacement value for x's path
+#'
+#'@seealso \code{\link{scheme}}, \code{\link{domain}}, \code{\link{port}},
+#'\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+#'
+#'@examples
+#'#Get a component
+#'example_url <- "http://cran.r-project.org:80/submit.html"
+#'path(example_url)
+#'
+#'#Set a component
+#'path(example_url) <- "bin/windows/"
+#'@export
+path <- function(x){
+ return(get_component_(x,3))
+}
+"path<-" <- function(x, value) standardGeneric("path<-")
+#'@rdname path
+#'@export
+setGeneric("path<-", useAsDefault = function(x, value){
+ return(set_component_(x, 3, value))
+})
+
+#'@title Get or set a URL's parameters
+#'@description as in the lubridate package, individual components of a URL
+#'can be both extracted or set using the relevant function call - see the
+#'examples.
+#'
+#'@aliases parameters
+#'@rdname parameters
+#'
+#'@param x a URL, or vector of URLs
+#'
+#'@param value a replacement value for x's parameters.
+#'
+#'@seealso \code{\link{scheme}}, \code{\link{domain}}, \code{\link{port}},
+#'\code{\link{path}} and \code{\link{fragment}} for other accessors.
+#'
+#'@examples
+#'#Get a component
+#'example_url <- "http://en.wikipedia.org/wiki/Aaron_Halfaker?debug=true"
+#'parameters(example_url)
+#'#[1] "debug=true"
+#'
+#'#Set a component
+#'parameters(example_url) <- "debug=false"
+#'@export
+parameters <- function(x){
+ return(get_component_(x,4))
+}
+"parameters<-" <- function(x, value) standardGeneric("parameters<-")
+#'@rdname parameters
+#'@export
+setGeneric("parameters<-", useAsDefault = function(x, value){
+ return(set_component_(x, 4, value))
+})
+
+#'@title Get or set a URL's fragment
+#'@description as in the lubridate package, individual components of a URL
+#'can be both extracted or set using the relevant function call - see the
+#'examples.
+#'@aliases fragment
+#'@rdname fragment
+#'
+#'@param x a URL, or vector of URLs
+#'
+#'@param value a replacement value for x's fragment.
+#'
+#'@seealso \code{\link{scheme}}, \code{\link{domain}}, \code{\link{port}},
+#'\code{\link{path}} and \code{\link{parameters}} for other accessors.
+#'
+#'@examples
+#'#Get a component
+#'example_url <- "http://en.wikipedia.org/wiki/Aaron_Halfaker?debug=true#test"
+#'fragment(example_url)
+#'
+#'#Set a component
+#'fragment(example_url) <- "production"
+#'@export
+#'@rdname fragment
+#'@export
+fragment <- function(x){
+ return(get_component_(x,5))
+}
+
+"fragment<-" <- function(x, value) standardGeneric("fragment<-")
+#'@rdname fragment
+#'@export
+setGeneric("fragment<-", useAsDefault = function(x, value){
+ return(set_component_(x, 5, value))
+})
\ No newline at end of file
diff --git a/R/suffix.R b/R/suffix.R
new file mode 100644
index 0000000..55c388c
--- /dev/null
+++ b/R/suffix.R
@@ -0,0 +1,265 @@
+#' @title Dataset of public suffixes
+#' @description This dataset contains a registry of public suffixes, as retrieved from
+#' and defined by the \href{https://publicsuffix.org/}{public suffix list}. It is
+#' sorted by how many periods(".") appear in the suffix, to optimise it for
+#' \code{\link{suffix_extract}}. It is a data.frame with two columns, the first is
+#' the list of suffixes and the second is our best guess at the comment or owner
+#' associated with the particular suffix.
+#'
+#' @docType data
+#' @keywords datasets
+#' @name suffix_dataset
+#'
+#' @seealso \code{\link{suffix_extract}} for extracting suffixes from domain names,
+#' and \code{\link{suffix_refresh}} for getting a new, totally-up-to-date dataset
+#' version.
+#'
+#' @usage data(suffix_dataset)
+#' @note Last updated 2016-07-31.
+#' @format A data.frame of 8030 rows and 2 columns
+"suffix_dataset"
+
+#'@title Retrieve a public suffix dataset
+#'
+#'@description \code{urltools} comes with an inbuilt
+#'dataset of public suffixes, \code{\link{suffix_dataset}}.
+#'This is used in \code{\link{suffix_extract}} to identify the top-level domain
+#'within a particular domain name.
+#'
+#'While updates to the dataset will be included in each new package release,
+#'there's going to be a gap between changes to the suffixes list and changes to the package.
+#'Accordingly, the package also includes \code{suffix_refresh}, which generates
+#'and returns a \emph{fresh} version of the dataset. This can then be passed through
+#'to \code{\link{suffix_extract}}.
+#'
+#'@return a dataset equivalent in format to \code{\link{suffix_dataset}}.
+#'
+#'@seealso \code{\link{suffix_extract}} to extract suffixes from domain names,
+#'or \code{\link{suffix_dataset}} for the inbuilt, default version of the data.
+#'
+#'@examples
+#'\dontrun{
+#'new_suffixes <- suffix_refresh()
+#'}
+#'
+#'@export
+suffix_refresh <- function(){
+
+ has_libcurl <- capabilities("libcurl")
+ if(length(has_libcurl) == 0 || has_libcurl == FALSE){
+ stop("libcurl support is needed for this function")
+ }
+
+ #Read in and filter
+ connection <- url("https://www.publicsuffix.org/list/effective_tld_names.dat", method = "libcurl")
+ results <- readLines(connection)
+ close(connection)
+
+ # making an assumption that sections are broken by blank lines
+ blank <- which(results == "")
+ # and gotta know where the comments are
+ comments <- grep(pattern = "^//", x=results)
+
+ # if the file doesn't end on a blank line, stick an ending on there.
+ if (blank[length(blank)] < length(results)) {
+ blank <- c(blank, length(results)+1)
+ }
+ # now break up each section into a list
+ # grab right after the blank line and right before the next blank line.
+ suffix_dataset <- do.call(rbind, lapply(seq(length(blank) - 1), function(i) {
+ # these are the lines in the current block
+ lines <- seq(blank[i] + 1, blank[i + 1] - 1)
+ # assume there is nothing in the block
+ rez <- NULL
+ # the lines of text in this block
+ suff <- results[lines]
+ # of which these are the comments
+ iscomment <- lines %in% comments
+ # and check if we have any results
+ # append the first comment at the top of the block only.
+ if(length(suff[!iscomment])) {
+ rez <- data.frame(suffixes = suff[!iscomment],
+ comments = suff[which(iscomment)[1]], stringsAsFactors = FALSE)
+ }
+ return(rez)
+ }))
+ ## this is the old way
+ #suffix_dataset <- results[!grepl(x = results, pattern = "//", fixed = TRUE) & !results == ""]
+
+ #Return the user-friendly version
+ return(suffix_dataset)
+}
+
+#' @title extract the suffix from domain names
+#' @description domain names have suffixes - common endings that people
+#' can or could register domains under. This includes things like ".org", but
+#' also things like ".edu.co". A simple Top Level Domain list, as a
+#' result, probably won't cut it.
+#'
+#' \code{\link{suffix_extract}} takes the list of public suffixes,
+#' as maintained by Mozilla (see \code{\link{suffix_dataset}}) and
+#' a vector of domain names, and produces a data.frame containing the
+#' suffix that each domain uses, and the remaining fragment.
+#'
+#' @param domains a vector of damains, from \code{\link{domain}}
+#' or \code{\link{url_parse}}. Alternately, full URLs can be provided
+#' and will then be run through \code{\link{domain}} internally.
+#'
+#' @param suffixes a dataset of suffixes. By default, this is NULL and the function
+#' relies on \code{\link{suffix_dataset}}. Optionally, if you want more updated
+#' suffix data, you can provide the result of \code{\link{suffix_refresh}} for
+#' this parameter.
+#'
+#' @return a data.frame of four columns, "host" "subdomain", "domain" & "suffix".
+#' "host" is what was passed in. "subdomain" is the subdomain of the suffix.
+#' "domain" contains the part of the domain name that came before the matched suffix.
+#' "suffix" is, well, the suffix.
+#'
+#' @seealso \code{\link{suffix_dataset}} for the dataset of suffixes.
+#'
+#' @examples
+#'
+#' # Using url_parse
+#' domain_name <- url_parse("http://en.wikipedia.org")$domain
+#' suffix_extract(domain_name)
+#'
+#' # Using domain()
+#' domain_name <- domain("http://en.wikipedia.org")
+#' suffix_extract(domain_name)
+#'
+#' #Relying on a fresh version of the suffix dataset
+#' suffix_extract(domain("http://en.wikipedia.org"), suffix_refresh())
+#'
+#' @importFrom triebeard trie longest_match
+#' @export
+suffix_extract <- function(domains, suffixes = NULL){
+ if(!is.null(suffixes)){
+ # check if suffixes is a data.frame, and stop if column not found
+ if(is.data.frame(suffixes)) {
+ if ("suffixes" %in% colnames(suffixes)) {
+ suffixes <- suffixes$suffixes
+ } else {
+ stop("Expected column named \"suffixes\" in suffixes data.frame")
+ }
+ }
+ holding <- suffix_load(suffixes)
+ } else {
+ holding <- list(suff_trie = urltools_env$suff_trie,
+ is_wildcard = urltools_env$is_wildcard,
+ cleaned_suffixes = urltools_env$cleaned_suffixes)
+ }
+
+ rev_domains <- reverse_strings(tolower(domains))
+ matched_suffixes <- triebeard::longest_match(holding$suff_trie, rev_domains)
+ has_wildcard <- matched_suffixes %in% holding$is_wildcard
+ is_suffix <- domains %in% holding$cleaned_suffixes
+ return(finalise_suffixes(domains, matched_suffixes, has_wildcard, is_suffix))
+}
+
+#' @title Dataset of top-level domains (TLDs)
+#' @description This dataset contains a registry of top-level domains, as retrieved from
+#' and defined by the \href{http://data.iana.org/TLD/tlds-alpha-by-domain.txt}{IANA}.
+#'
+#' @docType data
+#' @keywords datasets
+#' @name tld_dataset
+#'
+#' @seealso \code{\link{tld_extract}} for extracting TLDs from domain names,
+#' and \code{\link{tld_refresh}} to get an updated version of this dataset.
+#'
+#' @usage data(tld_dataset)
+#' @note Last updated 2016-07-20.
+#' @format A vector of 1275 elements.
+"tld_dataset"
+
+#'@title Retrieve a TLD dataset
+#'
+#'@description \code{urltools} comes with an inbuilt
+#'dataset of top level domains (TLDs), \code{\link{tld_dataset}}.
+#'This is used in \code{\link{tld_extract}} to identify the top-level domain
+#'within a particular domain name.
+#'
+#'While updates to the dataset will be included in each new package release,
+#'there's going to be a gap between changes to TLDs and changes to the package.
+#'Accordingly, the package also includes \code{tld_refresh}, which generates
+#'and returns a \emph{fresh} version of the dataset. This can then be passed through
+#'to \code{\link{tld_extract}}.
+#'
+#'@return a dataset equivalent in format to \code{\link{tld_dataset}}.
+#'
+#'@seealso \code{\link{tld_extract}} to extract suffixes from domain names,
+#'or \code{\link{tld_dataset}} for the inbuilt, default version of the data.
+#'
+#'@examples
+#'\dontrun{
+#'new_tlds <- tld_refresh()
+#'}
+#'
+#'@export
+tld_refresh <- function(){
+ raw_tlds <- readLines("http://data.iana.org/TLD/tlds-alpha-by-domain.txt", warn = FALSE)
+ raw_tlds <- tolower(raw_tlds[!grepl(x = raw_tlds, pattern = "#", fixed = TRUE)])
+ return(raw_tlds)
+}
+
+#'@title Extract TLDs
+#'@description \code{tld_extract} extracts the top-level domain (TLD) from
+#'a vector of domain names. This is distinct from the suffixes, extracted with
+#'\code{\link{suffix_extract}}; TLDs are \emph{top} level, while suffixes are just
+#'domains through which internet users can publicly register domains (the difference
+#'between \code{.org.uk} and \code{.uk}).
+#'
+#'@param domains a vector of domains, retrieved through \code{\link{url_parse}} or
+#'\code{\link{domain}}.
+#'
+#'@param tlds a dataset of TLDs. If NULL (the default), \code{tld_extract} relies
+#'on urltools' \code{\link{tld_dataset}}; otherwise, you can pass in the result of
+#'\code{\link{tld_refresh}}.
+#'
+#'@return a data.frame of two columns: \code{domain}, with the original domain names,
+#'and \code{tld}, the identified TLD from the domain.
+#'
+#'@examples
+#'# Using the inbuilt dataset
+#'domains <- domain("https://en.wikipedia.org/wiki/Main_Page")
+#'tld_extract(domains)
+#'
+#'# Using a refreshed one
+#'tld_extract(domains, tld_refresh())
+#'
+#'@seealso \code{\link{suffix_extract}} for retrieving suffixes (distinct from TLDs).
+#'
+#'@export
+tld_extract <- function(domains, tlds = NULL){
+ if(is.null(tlds)){
+ tlds <- urltools::tld_dataset
+ }
+ guessed_tlds <- tld_extract_(tolower(domains))
+ guessed_tlds[!guessed_tlds %in% tlds] <- NA
+ return(data.frame(domain = domains, tld = guessed_tlds, stringsAsFactors = FALSE))
+}
+
+#'@title Extract hosts
+#'@description \code{host_extract} extracts the host from
+#'a vector of domain names. A host isn't the same as a domain - it could be
+#'the subdomain, if there are one or more subdomains. The host of \code{en.wikipedia.org}
+#'is \code{en}, while the host of \code{wikipedia.org} is \code{wikipedia}.
+#'
+#'@param domains a vector of domains, retrieved through \code{\link{url_parse}} or
+#'\code{\link{domain}}.
+#'
+#'@return a data.frame of two columns: \code{domain}, with the original domain names,
+#'and \code{host}, the identified host from the domain.
+#'
+#'@examples
+#'# With subdomains
+#'has_subdomain <- domain("https://en.wikipedia.org/wiki/Main_Page")
+#'host_extract(has_subdomain)
+#'
+#'# Without
+#'no_subdomain <- domain("https://ironholds.org/projects/r_shiny/")
+#'host_extract(no_subdomain)
+#'@export
+host_extract <- function(domains){
+ return(data.frame(domain = domains, host = host_extract_(domains), stringsAsFactors = FALSE))
+}
\ No newline at end of file
diff --git a/R/urltools.R b/R/urltools.R
new file mode 100644
index 0000000..5b3e253
--- /dev/null
+++ b/R/urltools.R
@@ -0,0 +1,21 @@
+#' @title Tools for handling URLs
+#' @name urltools
+#' @description This package provides functions for URL encoding and decoding,
+#' parsing, and parameter extraction, designed to be both fast and
+#' entirely vectorised. It is intended to be useful for people dealing with
+#' web-related datasets, such as server-side logs.
+#'
+#' @seealso the \href{https://CRAN.R-project.org/package=urltools/vignettes/urltools.html}{package vignette}.
+#' @useDynLib urltools
+#' @importFrom Rcpp sourceCpp
+#' @docType package
+#' @aliases urltools urltools-package
+NULL
+
+#'@rdname param_get
+#'@export
+url_parameters <- function(urls, parameter_names){
+ .Deprecated("param_get",
+ old = as.character(sys.call(sys.parent()))[1L])
+ return(param_get(urls, parameter_names))
+}
\ No newline at end of file
diff --git a/R/zzz.R b/R/zzz.R
new file mode 100644
index 0000000..7a2a2af
--- /dev/null
+++ b/R/zzz.R
@@ -0,0 +1,22 @@
+urltools_env <- new.env(parent = emptyenv())
+
+suffix_load <- function(suffixes = NULL){
+ if(is.null(suffixes)){
+ suffixes <- urltools::suffix_dataset
+ }
+ cleaned_suffixes <- gsub(x = suffixes, pattern = "*.", replacement = "", fixed = TRUE)
+ is_wildcard <- cleaned_suffixes[which(grepl(x = suffixes, pattern = "*.", fixed = TRUE))]
+ suff_trie <- triebeard::trie(keys = reverse_strings(paste0(".", cleaned_suffixes)),
+ values = cleaned_suffixes)
+ return(list(suff_trie = suff_trie,
+ is_wildcard = is_wildcard,
+ cleaned_suffixes = cleaned_suffixes))
+ return(invisible())
+}
+
+.onLoad <- function(...) {
+ holding <- suffix_load()
+ assign("is_wildcard", holding$is_wildcard, envir = urltools_env)
+ assign("cleaned_suffixes", holding$cleaned_suffixes, envir = urltools_env)
+ assign("suff_trie", holding$suff_trie, envir = urltools_env)
+}
\ No newline at end of file
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..414c18d
--- /dev/null
+++ b/README.md
@@ -0,0 +1,39 @@
+##urltools
+A package for elegantly handling and parsing URLs from within R.
+
+__Author:__ Oliver Keyes, Jay Jacobs<br/>
+__License:__ [MIT](http://opensource.org/licenses/MIT)<br/>
+__Status:__ Stable
+
+[![Travis-CI Build Status](https://travis-ci.org/Ironholds/urltools.svg?branch=master)](https://travis-ci.org/Ironholds/urltools) ![downloads](http://cranlogs.r-pkg.org/badges/grand-total/urltools)
+
+###Description
+
+URLs in R are often treated as nothing more than part of data retrieval -
+they're used for making connections and reading data. With web analytics
+and research, however, URLs can *be* the data, and R's default handlers
+are not best suited to handle vectorised operations over large datasets.
+<code>urltools</code> is intended to solve this.
+
+It contains drop-in replacements for R's URLdecode and URLencode functions, along
+with new functionality such as a URL parser and parameter value extractor. In all
+cases, the functions are designed to be content-safe (not breaking on unexpected values)
+and fully vectorised, resulting in a dramatic speed improvement over existing implementations -
+crucial for large datasets. For more information, see the [urltools vignette](https://github.com/Ironholds/urltools/blob/master/vignettes/urltools.Rmd).
+
+Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md).
+By participating in this project you agree to abide by its terms.
+
+###Installation
+
+The latest CRAN version can be obtained via:
+
+ install.packages("urltools")
+
+To get the development version:
+
+ devtools::install_github("ironholds/urltools")
+
+###Dependencies
+* R. Doy.
+* [Rcpp](https://cran.r-project.org/package=Rcpp)
\ No newline at end of file
diff --git a/build/vignette.rds b/build/vignette.rds
new file mode 100644
index 0000000..4c8421a
Binary files /dev/null and b/build/vignette.rds differ
diff --git a/data/suffix_dataset.rda b/data/suffix_dataset.rda
new file mode 100644
index 0000000..fdb94f5
Binary files /dev/null and b/data/suffix_dataset.rda differ
diff --git a/data/tld_dataset.rda b/data/tld_dataset.rda
new file mode 100644
index 0000000..e0062d4
Binary files /dev/null and b/data/tld_dataset.rda differ
diff --git a/debian/README.test b/debian/README.test
deleted file mode 100644
index 53fb4d7..0000000
--- a/debian/README.test
+++ /dev/null
@@ -1,8 +0,0 @@
-Notes on how this package can be tested.
-────────────────────────────────────────
-
-This package can be tested by running the provided test:
-
- sh ./run-unit-test
-
-in order to confirm its integrity.
diff --git a/debian/changelog b/debian/changelog
deleted file mode 100644
index 5632c3c..0000000
--- a/debian/changelog
+++ /dev/null
@@ -1,5 +0,0 @@
-r-cran-urltools (1.6.0-1) unstable; urgency=medium
-
- * Initial release (closes: #851565)
-
- -- Andreas Tille <tille at debian.org> Mon, 16 Jan 2017 16:58:09 +0100
diff --git a/debian/compat b/debian/compat
deleted file mode 100644
index f599e28..0000000
--- a/debian/compat
+++ /dev/null
@@ -1 +0,0 @@
-10
diff --git a/debian/control b/debian/control
deleted file mode 100644
index ed88f33..0000000
--- a/debian/control
+++ /dev/null
@@ -1,29 +0,0 @@
-Source: r-cran-urltools
-Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.org>
-Uploaders: Andreas Tille <tille at debian.org>
-Section: gnu-r
-Priority: optional
-Build-Depends: debhelper (>= 10),
- dh-r,
- r-base-dev,
- r-cran-rcpp,
- r-cran-triebeard
-Standards-Version: 3.9.8
-Vcs-Browser: https://anonscm.debian.org/viewvc/debian-med/trunk/packages/R/r-cran-urltools/
-Vcs-Svn: svn://anonscm.debian.org/debian-med/trunk/packages/R/r-cran-urltools/
-Homepage: https://cran.r-project.org/package=urltools
-
-Package: r-cran-urltools
-Architecture: any
-Depends: ${R:Depends},
- ${shlibs:Depends},
- ${misc:Depends}
-Recommends: ${R:Recommends}
-Suggests: ${R:Suggests}
-Description: GNU R vectorised tools for URL handling and parsing
- A toolkit for all URL-handling needs, including encoding and decoding,
- parsing, parameter extraction and modification. All functions are
- designed to be both fast and entirely vectorised. It is intended to be
- useful for people dealing with web-related datasets, such as server-side
- logs, although may be useful for other situations involving large sets of
- URLs.
diff --git a/debian/copyright b/debian/copyright
deleted file mode 100644
index 0d6ff34..0000000
--- a/debian/copyright
+++ /dev/null
@@ -1,47 +0,0 @@
-Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
-Upstream-Name: urltools
-Upstream-Contact: Oliver Keyes <ironholds at gmail.com>
-Source: https://cran.r-project.org/package=urltools
-
-Files: *
-Copyright: 2010-2016 Oliver Keyes, Jay Jacobs, Drew Schmidt, Mark Greenaway,
- Bob Rudis, Alex Pinto, Maryam Khezrzadeh, Adam M. Costello,
- Jeff Bezanson
-License: MIT
-
-Files: urltools/src/punycode.*
-Copyright: 2010-2014 Adam M. Costello
-License: punycode
- Regarding this entire document or any portion of it (including
- the pseudocode and C code), the author makes no guarantees and
- is not responsible for any damage resulting from its use. The
- author grants irrevocable permission to anyone to use, modify,
- and distribute it in any way that does not diminish the rights
- of anyone else to use, modify, and distribute it, provided that
- redistributed derivative works do not contain misleading author or
- version information. Derivative works need not be licensed under
- similar terms.
-
-Files: debian/*
-Copyright: 2017 Andreas Tille <tille at debian.org>
-License: MIT
-
-License: MIT
- Permission is hereby granted, free of charge, to any person obtaining
- a copy of this software and associated documentation files (the
- "Software"), to deal in the Software without restriction, including
- without limitation the rights to use, copy, modify, merge, publish,
- distribute, sublicense, and/or sell copies of the Software, and to
- permit persons to whom the Software is furnished to do so, subject to
- the following conditions:
- .
- The above copyright notice and this permission notice shall be
- included in all copies or substantial portions of the Software.
- .
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
- LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
- OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
- WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/debian/docs b/debian/docs
deleted file mode 100644
index 6466d39..0000000
--- a/debian/docs
+++ /dev/null
@@ -1,3 +0,0 @@
-debian/tests/run-unit-test
-debian/README.test
-tests
diff --git a/debian/rules b/debian/rules
deleted file mode 100755
index 529c38a..0000000
--- a/debian/rules
+++ /dev/null
@@ -1,5 +0,0 @@
-#!/usr/bin/make -f
-
-%:
- dh $@ --buildsystem R
-
diff --git a/debian/source/format b/debian/source/format
deleted file mode 100644
index 163aaf8..0000000
--- a/debian/source/format
+++ /dev/null
@@ -1 +0,0 @@
-3.0 (quilt)
diff --git a/debian/tests/control b/debian/tests/control
deleted file mode 100644
index d746f15..0000000
--- a/debian/tests/control
+++ /dev/null
@@ -1,5 +0,0 @@
-Tests: run-unit-test
-Depends: @, r-cran-testthat
-Restrictions: allow-stderr
-
-
diff --git a/debian/tests/run-unit-test b/debian/tests/run-unit-test
deleted file mode 100644
index b29f739..0000000
--- a/debian/tests/run-unit-test
+++ /dev/null
@@ -1,17 +0,0 @@
-#!/bin/sh -e
-
-pkgname=urltools
-debname=r-cran-urltools
-
-if [ "$ADTTMP" = "" ] ; then
- ADTTMP=`mktemp -d /tmp/${debname}-test.XXXXXX`
- trap "rm -rf $ADTTMP" 0 INT QUIT ABRT PIPE TERM
-fi
-cd $ADTTMP
-cp -a /usr/share/doc/$debname/tests/* $ADTTMP
-gunzip -r *
-for testfile in *.R; do
- echo "BEGIN TEST $testfile"
- LC_ALL=C R --no-save < $testfile
-done
-
diff --git a/debian/watch b/debian/watch
deleted file mode 100644
index 7bfc7a0..0000000
--- a/debian/watch
+++ /dev/null
@@ -1,2 +0,0 @@
-version=4
-https://cran.r-project.org/src/contrib/urltools_([-\d.]*)\.tar\.gz
diff --git a/inst/doc/urltools.R b/inst/doc/urltools.R
new file mode 100644
index 0000000..c803438
--- /dev/null
+++ b/inst/doc/urltools.R
@@ -0,0 +1,86 @@
+## ---- eval=FALSE---------------------------------------------------------
+# URLdecode("test%gIL")
+# Error in rawToChar(out) : embedded nul in string: '\0L'
+# In addition: Warning message:
+# In URLdecode("%gIL") : out-of-range values treated as 0 in coercion to raw
+
+## ---- eval=FALSE---------------------------------------------------------
+# URLencode("https://en.wikipedia.org/wiki/Article", reserved = TRUE)
+# [1] "https%3a%2f%2fen.wikipedia.org%2fwiki%2fArticle"
+
+## ---- eval=FALSE---------------------------------------------------------
+# library(urltools)
+# url_decode("test%gIL")
+# [1] "test"
+# url_encode("https://en.wikipedia.org/wiki/Article")
+# [1] "https://en.wikipedia.org%2fwiki%2fArticle"
+
+## ---- eval=FALSE---------------------------------------------------------
+# > parsed_address <- url_parse("https://en.wikipedia.org/wiki/Article")
+# > str(parsed_address)
+# 'data.frame': 1 obs. of 6 variables:
+# $ scheme : chr "https"
+# $ domain : chr "en.wikipedia.org"
+# $ port : chr NA
+# $ path : chr "wiki/Article"
+# $ parameter: chr NA
+# $ fragment : chr NA
+
+## ---- eval=FALSE---------------------------------------------------------
+# > url_compose(parsed_address)
+# [1] "https://en.wikipedia.org/wiki/article"
+
+## ---- eval=FALSE---------------------------------------------------------
+# url <- "https://en.wikipedia.org/wiki/Article"
+# scheme(url)
+# "https"
+# scheme(url) <- "ftp"
+# url
+# "ftp://en.wikipedia.org/wiki/Article"
+
+## ---- eval=FALSE---------------------------------------------------------
+# > url <- "https://en.wikipedia.org/wiki/Article"
+# > domain_name <- domain(url)
+# > domain_name
+# [1] "en.wikipedia.org"
+# > str(suffix_extract(domain_name))
+# 'data.frame': 1 obs. of 4 variables:
+# $ host : chr "en.wikipedia.org"
+# $ subdomain: chr "en"
+# $ domain : chr "wikipedia"
+# $ suffix : chr "org"
+
+## ---- eval=FALSE---------------------------------------------------------
+# domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+# updated_suffixes <- suffix_refresh()
+# suffix_extract(domain_name, updated_suffixes)
+
+## ---- eval=FALSE---------------------------------------------------------
+# domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+# host_extract(domain_name)
+
+## ---- eval=FALSE---------------------------------------------------------
+# > str(param_get(urls = "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json",
+# parameter_names = c("pageid","export")))
+# 'data.frame': 1 obs. of 2 variables:
+# $ pageid: chr "1023"
+# $ export: chr "json"
+
+## ---- eval=FALSE---------------------------------------------------------
+# url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+# url <- param_set(url, key = "pageid", value = "12")
+# url
+# # [1] "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=12&export=json"
+
+## ---- eval=FALSE---------------------------------------------------------
+# url <- "http://en.wikipedia.org/wiki/api.php"
+# url <- param_set(url, key = "pageid", value = "12")
+# url
+# # [1] "http://en.wikipedia.org/wiki/api.php?pageid=12"
+
+## ---- eval=FALSE---------------------------------------------------------
+# url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+# url <- param_remove(url, keys = c("action","export"))
+# url
+# # [1] "http://en.wikipedia.org/wiki/api.php?pageid=1023"
+
diff --git a/inst/doc/urltools.Rmd b/inst/doc/urltools.Rmd
new file mode 100644
index 0000000..beb3db3
--- /dev/null
+++ b/inst/doc/urltools.Rmd
@@ -0,0 +1,182 @@
+<!--
+%\VignetteEngine{knitr::knitr}
+%\VignetteIndexEntry{urltools}
+-->
+
+##Elegant URL handling with urltools
+
+URLs are treated, by base R, as nothing more than components of a data retrieval process: they exist
+to create connections to retrieve datasets. This is an essential feature for the language to have,
+but it also means that URL handlers are designed for situations where URLs *get* you to the data -
+not situations where URLs *are* the data.
+
+There is no support for encoding or decoding URLs en-masse, and no support for parsing and
+interpreting them. `urltools` provides this support!
+
+### URL encoding and decoding
+
+Base R provides two functions - <code>URLdecode</code> and <code>URLencode</code> - for taking percentage-encoded
+URLs and turning them into regular strings, or vice versa. As discussed, these are primarily designed to
+enable connections, and so they have several inherent limitations, including a lack of vectorisation, that
+make them unsuitable for large datasets.
+
+Not only are they not vectorised, they also have several particularly idiosyncratic bugs and limitations:
+<code>URLdecode</code>, for example, breaks if the decoded value is out of range:
+
+```{r, eval=FALSE}
+URLdecode("test%gIL")
+Error in rawToChar(out) : embedded nul in string: '\0L'
+In addition: Warning message:
+In URLdecode("%gIL") : out-of-range values treated as 0 in coercion to raw
+```
+
+URLencode, on the other hand, encodes slashes on its most strict setting - without
+paying attention to where those slashes *are*: if we attempt to URLencode an entire URL, we get:
+
+```{r, eval=FALSE}
+URLencode("https://en.wikipedia.org/wiki/Article", reserved = TRUE)
+[1] "https%3a%2f%2fen.wikipedia.org%2fwiki%2fArticle"
+```
+That's a completely unusable URL (or ewRL, if you will).
+
+urltools replaces both functions with <code>url\_decode</code> and <code>url\_encode</code> respectively:
+```{r, eval=FALSE}
+library(urltools)
+url_decode("test%gIL")
+[1] "test"
+url_encode("https://en.wikipedia.org/wiki/Article")
+[1] "https://en.wikipedia.org%2fwiki%2fArticle"
+```
+
+As you can see, <code>url\_decode</code> simply excludes out-of-range characters from consideration, while <code>url\_encode</code> detects characters that make up part of the URLs scheme, and leaves them unencoded. Both are extremely fast; with `urltools`, you can
+decode a vector of 1,000,000 URLs in 0.9 seconds.
+
+Alongside these, we have functions for encoding and decoding the 'punycode' format of URLs - ones that are designed to be internationalised and have unicode characters in them. These also take one argument, a vector of URLs, and can be found at `puny_encode` and `puny_decode` respectively.
+
+### URL parsing
+
+Once you've got your nicely decoded (or encoded) URLs, it's time to do something with them - and, most of the time,
+you won't actually care about most of the URL. You'll want to look at the scheme, or the domain, or the path,
+but not the entire thing as one string.
+
+The solution is <code>url_parse</code>, which takes a URL and breaks it out into its [RfC 3986](http://www.ietf.org/rfc/rfc3986.txt) components: scheme, domain, port, path, query string and fragment identifier. This is,
+again, fully vectorised, and can happily be run over hundreds of thousands of URLs, rapidly processing them. The
+results are provided as a data.frame, since most people use data.frames to store data.
+
+```{r, eval=FALSE}
+> parsed_address <- url_parse("https://en.wikipedia.org/wiki/Article")
+> str(parsed_address)
+'data.frame': 1 obs. of 6 variables:
+ $ scheme : chr "https"
+ $ domain : chr "en.wikipedia.org"
+ $ port : chr NA
+ $ path : chr "wiki/Article"
+ $ parameter: chr NA
+ $ fragment : chr NA
+```
+
+We can also perform the opposite of this operation with `url_compose`:
+```{r, eval=FALSE}
+> url_compose(parsed_address)
+[1] "https://en.wikipedia.org/wiki/article"
+```
+
+### Getting/setting URL components
+With the inclusion of a URL parser, we suddenly have the opportunity for lubridate-style component getting
+and setting. Syntax is identical to that of `lubridate`, but uses URL components as function names.
+
+```{r, eval=FALSE}
+url <- "https://en.wikipedia.org/wiki/Article"
+scheme(url)
+"https"
+scheme(url) <- "ftp"
+url
+"ftp://en.wikipedia.org/wiki/Article"
+```
+Fields that can be extracted or set are <code>scheme</code>, <code>domain</code>, <code>port</code>, <code>path</code>,
+<code>parameters</code> and <code>fragment</code>.
+
+### Suffix and TLD extraction
+
+Once we've extracted a domain from a URL with `domain` or `url_parse`, we can identify which bit is the domain name, and which
+bit is the suffix:
+
+```{r, eval=FALSE}
+> url <- "https://en.wikipedia.org/wiki/Article"
+> domain_name <- domain(url)
+> domain_name
+[1] "en.wikipedia.org"
+> str(suffix_extract(domain_name))
+'data.frame': 1 obs. of 4 variables:
+ $ host : chr "en.wikipedia.org"
+ $ subdomain: chr "en"
+ $ domain : chr "wikipedia"
+ $ suffix : chr "org"
+```
+
+This relies on an internal database of public suffixes, accessible at `suffix_dataset` - we recognise, though,
+that this dataset may get a bit out of date, so you can also pass the results of the `suffix_refresh` function,
+which retrieves an updated dataset, to `suffix_extract`:
+
+```{r, eval=FALSE}
+domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+updated_suffixes <- suffix_refresh()
+suffix_extract(domain_name, updated_suffixes)
+```
+
+We can do the same thing with top-level domains, with precisely the same setup, except the functions and datasets are `tld_refresh`, `tld_extract` and `tld_dataset`.
+
+In the other direction we have `host_extract`, which retrieves, well, the host! If the URL has subdomains, it'll be the
+lowest-level subdomain. If it doesn't, it'll be the actual domain name, without the suffixes:
+
+```{r, eval=FALSE}
+domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+host_extract(domain_name)
+```
+### Query manipulation
+Once a URL is parsed, it's sometimes useful to get the value associated with a particular query parameter. As
+an example, take the URL `http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json`. What
+pageID is being used? What is the export format? We can find out with `param_get`.
+
+```{r, eval=FALSE}
+> str(param_get(urls = "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json",
+ parameter_names = c("pageid","export")))
+'data.frame': 1 obs. of 2 variables:
+ $ pageid: chr "1023"
+ $ export: chr "json"
+```
+
+This isn't the only function for query manipulation; we can also dynamically modify the values a particular parameter
+might have, or strip them out entirely.
+
+To modify the values, we use `param_set`:
+
+```{r, eval=FALSE}
+url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+url <- param_set(url, key = "pageid", value = "12")
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=12&export=json"
+```
+
+As you can see this works pretty well; it even works in situations where the URL doesn't *have* a query yet:
+
+```{r, eval=FALSE}
+url <- "http://en.wikipedia.org/wiki/api.php"
+url <- param_set(url, key = "pageid", value = "12")
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?pageid=12"
+```
+
+On the other hand we might have a parameter we just don't want any more - that can be handled with `param_remove`, which can
+take multiple parameters as well as multiple URLs:
+
+```{r, eval=FALSE}
+url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+url <- param_remove(url, keys = c("action","export"))
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?pageid=1023"
+```
+
+### Other URL handlers
+If you have ideas for other URL handlers that would make your data processing easier, the best approach
+is to either [request it](https://github.com/Ironholds/urltools/issues) or [add it](https://github.com/Ironholds/urltools/pulls)!
diff --git a/inst/doc/urltools.html b/inst/doc/urltools.html
new file mode 100644
index 0000000..9a50878
--- /dev/null
+++ b/inst/doc/urltools.html
@@ -0,0 +1,384 @@
+<!DOCTYPE html>
+<html>
+<head>
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
+
+<title>URL encoding and decoding</title>
+
+<script type="text/javascript">
+window.onload = function() {
+ var imgs = document.getElementsByTagName('img'), i, img;
+ for (i = 0; i < imgs.length; i++) {
+ img = imgs[i];
+ // center an image if it is the only element of its parent
+ if (img.parentElement.childElementCount === 1)
+ img.parentElement.style.textAlign = 'center';
+ }
+};
+</script>
+
+<!-- Styles for R syntax highlighter -->
+<style type="text/css">
+ pre .operator,
+ pre .paren {
+ color: rgb(104, 118, 135)
+ }
+
+ pre .literal {
+ color: #990073
+ }
+
+ pre .number {
+ color: #099;
+ }
+
+ pre .comment {
+ color: #998;
+ font-style: italic
+ }
+
+ pre .keyword {
+ color: #900;
+ font-weight: bold
+ }
+
+ pre .identifier {
+ color: rgb(0, 0, 0);
+ }
+
+ pre .string {
+ color: #d14;
+ }
+</style>
+
+<!-- R syntax highlighter -->
+<script type="text/javascript">
+var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/</gm,"<")}function f(r,q,p){return RegExp(q,"m"+(r.cI?"i":"")+(p?"g":""))}function b(r){for(var p=0;p<r.childNodes.length;p++){var q=r.childNodes[p];if(q.nodeName=="CODE"){return q}if(!(q.nodeType==3&&q.nodeValue.match(/\s+/))){break}}}function h(t,s){var p="";for(var r=0;r<t.childNodes.length;r++){if(t.childNodes[r].nodeType==3){var q=t.childNodes[r].nodeValue;if(s){q=q.replace(/\n/g,"")}p+=q}else{if(t.chi [...]
+hljs.initHighlightingOnLoad();
+</script>
+
+
+
+<style type="text/css">
+body, td {
+ font-family: sans-serif;
+ background-color: white;
+ font-size: 13px;
+}
+
+body {
+ max-width: 800px;
+ margin: auto;
+ padding: 1em;
+ line-height: 20px;
+}
+
+tt, code, pre {
+ font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
+}
+
+h1 {
+ font-size:2.2em;
+}
+
+h2 {
+ font-size:1.8em;
+}
+
+h3 {
+ font-size:1.4em;
+}
+
+h4 {
+ font-size:1.0em;
+}
+
+h5 {
+ font-size:0.9em;
+}
+
+h6 {
+ font-size:0.8em;
+}
+
+a:visited {
+ color: rgb(50%, 0%, 50%);
+}
+
+pre, img {
+ max-width: 100%;
+}
+pre {
+ overflow-x: auto;
+}
+pre code {
+ display: block; padding: 0.5em;
+}
+
+code {
+ font-size: 92%;
+ border: 1px solid #ccc;
+}
+
+code[class] {
+ background-color: #F8F8F8;
+}
+
+table, td, th {
+ border: none;
+}
+
+blockquote {
+ color:#666666;
+ margin:0;
+ padding-left: 1em;
+ border-left: 0.5em #EEE solid;
+}
+
+hr {
+ height: 0px;
+ border-bottom: none;
+ border-top-width: thin;
+ border-top-style: dotted;
+ border-top-color: #999999;
+}
+
+ at media print {
+ * {
+ background: transparent !important;
+ color: black !important;
+ filter:none !important;
+ -ms-filter: none !important;
+ }
+
+ body {
+ font-size:12pt;
+ max-width:100%;
+ }
+
+ a, a:visited {
+ text-decoration: underline;
+ }
+
+ hr {
+ visibility: hidden;
+ page-break-before: always;
+ }
+
+ pre, blockquote {
+ padding-right: 1em;
+ page-break-inside: avoid;
+ }
+
+ tr, img {
+ page-break-inside: avoid;
+ }
+
+ img {
+ max-width: 100% !important;
+ }
+
+ @page :left {
+ margin: 15mm 20mm 15mm 10mm;
+ }
+
+ @page :right {
+ margin: 15mm 10mm 15mm 20mm;
+ }
+
+ p, h2, h3 {
+ orphans: 3; widows: 3;
+ }
+
+ h2, h3 {
+ page-break-after: avoid;
+ }
+}
+</style>
+
+
+
+</head>
+
+<body>
+<!--
+%\VignetteEngine{knitr::knitr}
+%\VignetteIndexEntry{urltools}
+-->
+
+<p>##Elegant URL handling with urltools</p>
+
+<p>URLs are treated, by base R, as nothing more than components of a data retrieval process: they exist
+to create connections to retrieve datasets. This is an essential feature for the language to have,
+but it also means that URL handlers are designed for situations where URLs <em>get</em> you to the data -
+not situations where URLs <em>are</em> the data.</p>
+
+<p>There is no support for encoding or decoding URLs en-masse, and no support for parsing and
+interpreting them. <code>urltools</code> provides this support!</p>
+
+<h3>URL encoding and decoding</h3>
+
+<p>Base R provides two functions - <code>URLdecode</code> and <code>URLencode</code> - for taking percentage-encoded
+URLs and turning them into regular strings, or vice versa. As discussed, these are primarily designed to
+enable connections, and so they have several inherent limitations, including a lack of vectorisation, that
+make them unsuitable for large datasets.</p>
+
+<p>Not only are they not vectorised, they also have several particularly idiosyncratic bugs and limitations:
+<code>URLdecode</code>, for example, breaks if the decoded value is out of range:</p>
+
+<pre><code class="r">URLdecode("test%gIL")
+Error in rawToChar(out) : embedded nul in string: '\0L'
+In addition: Warning message:
+In URLdecode("%gIL") : out-of-range values treated as 0 in coercion to raw
+</code></pre>
+
+<p>URLencode, on the other hand, encodes slashes on its most strict setting - without
+paying attention to where those slashes <em>are</em>: if we attempt to URLencode an entire URL, we get:</p>
+
+<pre><code class="r">URLencode("https://en.wikipedia.org/wiki/Article", reserved = TRUE)
+[1] "https%3a%2f%2fen.wikipedia.org%2fwiki%2fArticle"
+</code></pre>
+
+<p>That's a completely unusable URL (or ewRL, if you will).</p>
+
+<p>urltools replaces both functions with <code>url_decode</code> and <code>url_encode</code> respectively:</p>
+
+<pre><code class="r">library(urltools)
+url_decode("test%gIL")
+[1] "test"
+url_encode("https://en.wikipedia.org/wiki/Article")
+[1] "https://en.wikipedia.org%2fwiki%2fArticle"
+</code></pre>
+
+<p>As you can see, <code>url_decode</code> simply excludes out-of-range characters from consideration, while <code>url_encode</code> detects characters that make up part of the URLs scheme, and leaves them unencoded. Both are extremely fast; with <code>urltools</code>, you can
+decode a vector of 1,000,000 URLs in 0.9 seconds.</p>
+
+<p>Alongside these, we have functions for encoding and decoding the 'punycode' format of URLs - ones that are designed to be internationalised and have unicode characters in them. These also take one argument, a vector of URLs, and can be found at <code>puny_encode</code> and <code>puny_decode</code> respectively.</p>
+
+<h3>URL parsing</h3>
+
+<p>Once you've got your nicely decoded (or encoded) URLs, it's time to do something with them - and, most of the time,
+you won't actually care about most of the URL. You'll want to look at the scheme, or the domain, or the path,
+but not the entire thing as one string.</p>
+
+<p>The solution is <code>url_parse</code>, which takes a URL and breaks it out into its <a href="http://www.ietf.org/rfc/rfc3986.txt">RfC 3986</a> components: scheme, domain, port, path, query string and fragment identifier. This is,
+again, fully vectorised, and can happily be run over hundreds of thousands of URLs, rapidly processing them. The
+results are provided as a data.frame, since most people use data.frames to store data.</p>
+
+<pre><code class="r">> parsed_address <- url_parse("https://en.wikipedia.org/wiki/Article")
+> str(parsed_address)
+'data.frame': 1 obs. of 6 variables:
+ $ scheme : chr "https"
+ $ domain : chr "en.wikipedia.org"
+ $ port : chr NA
+ $ path : chr "wiki/Article"
+ $ parameter: chr NA
+ $ fragment : chr NA
+</code></pre>
+
+<p>We can also perform the opposite of this operation with <code>url_compose</code>:</p>
+
+<pre><code class="r">> url_compose(parsed_address)
+[1] "https://en.wikipedia.org/wiki/article"
+</code></pre>
+
+<h3>Getting/setting URL components</h3>
+
+<p>With the inclusion of a URL parser, we suddenly have the opportunity for lubridate-style component getting
+and setting. Syntax is identical to that of <code>lubridate</code>, but uses URL components as function names.</p>
+
+<pre><code class="r">url <- "https://en.wikipedia.org/wiki/Article"
+scheme(url)
+"https"
+scheme(url) <- "ftp"
+url
+"ftp://en.wikipedia.org/wiki/Article"
+</code></pre>
+
+<p>Fields that can be extracted or set are <code>scheme</code>, <code>domain</code>, <code>port</code>, <code>path</code>,
+<code>parameters</code> and <code>fragment</code>.</p>
+
+<h3>Suffix and TLD extraction</h3>
+
+<p>Once we've extracted a domain from a URL with <code>domain</code> or <code>url_parse</code>, we can identify which bit is the domain name, and which
+bit is the suffix:</p>
+
+<pre><code class="r">> url <- "https://en.wikipedia.org/wiki/Article"
+> domain_name <- domain(url)
+> domain_name
+[1] "en.wikipedia.org"
+> str(suffix_extract(domain_name))
+'data.frame': 1 obs. of 4 variables:
+ $ host : chr "en.wikipedia.org"
+ $ subdomain: chr "en"
+ $ domain : chr "wikipedia"
+ $ suffix : chr "org"
+</code></pre>
+
+<p>This relies on an internal database of public suffixes, accessible at <code>suffix_dataset</code> - we recognise, though,
+that this dataset may get a bit out of date, so you can also pass the results of the <code>suffix_refresh</code> function,
+which retrieves an updated dataset, to <code>suffix_extract</code>:</p>
+
+<pre><code class="r">domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+updated_suffixes <- suffix_refresh()
+suffix_extract(domain_name, updated_suffixes)
+</code></pre>
+
+<p>We can do the same thing with top-level domains, with precisely the same setup, except the functions and datasets are <code>tld_refresh</code>, <code>tld_extract</code> and <code>tld_dataset</code>.</p>
+
+<p>In the other direction we have <code>host_extract</code>, which retrieves, well, the host! If the URL has subdomains, it'll be the
+lowest-level subdomain. If it doesn't, it'll be the actual domain name, without the suffixes:</p>
+
+<pre><code class="r">domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+host_extract(domain_name)
+</code></pre>
+
+<h3>Query manipulation</h3>
+
+<p>Once a URL is parsed, it's sometimes useful to get the value associated with a particular query parameter. As
+an example, take the URL <code>http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json</code>. What
+pageID is being used? What is the export format? We can find out with <code>param_get</code>.</p>
+
+<pre><code class="r">> str(param_get(urls = "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json",
+ parameter_names = c("pageid","export")))
+'data.frame': 1 obs. of 2 variables:
+ $ pageid: chr "1023"
+ $ export: chr "json"
+</code></pre>
+
+<p>This isn't the only function for query manipulation; we can also dynamically modify the values a particular parameter
+might have, or strip them out entirely.</p>
+
+<p>To modify the values, we use <code>param_set</code>:</p>
+
+<pre><code class="r">url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+url <- param_set(url, key = "pageid", value = "12")
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=12&export=json"
+</code></pre>
+
+<p>As you can see this works pretty well; it even works in situations where the URL doesn't <em>have</em> a query yet:</p>
+
+<pre><code class="r">url <- "http://en.wikipedia.org/wiki/api.php"
+url <- param_set(url, key = "pageid", value = "12")
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?pageid=12"
+</code></pre>
+
+<p>On the other hand we might have a parameter we just don't want any more - that can be handled with <code>param_remove</code>, which can
+take multiple parameters as well as multiple URLs:</p>
+
+<pre><code class="r">url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+url <- param_remove(url, keys = c("action","export"))
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?pageid=1023"
+</code></pre>
+
+<h3>Other URL handlers</h3>
+
+<p>If you have ideas for other URL handlers that would make your data processing easier, the best approach
+is to either <a href="https://github.com/Ironholds/urltools/issues">request it</a> or <a href="https://github.com/Ironholds/urltools/pulls">add it</a>!</p>
+
+</body>
+
+</html>
diff --git a/man/domain.Rd b/man/domain.Rd
new file mode 100644
index 0000000..304fc8b
--- /dev/null
+++ b/man/domain.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/accessors.R
+\name{domain}
+\alias{domain}
+\alias{domain<-}
+\title{Get or set a URL's domain}
+\usage{
+domain(x)
+
+domain(x) <- value
+}
+\arguments{
+\item{x}{a URL, or vector of URLs}
+
+\item{value}{a replacement value for x's scheme.}
+}
+\description{
+as in the lubridate package, individual components of a URL
+can be both extracted or set using the relevant function call - see the
+examples.
+}
+\examples{
+#Get a component
+example_url <- "http://cran.r-project.org/submit.html"
+domain(example_url)
+
+#Set a component
+domain(example_url) <- "en.wikipedia.org"
+}
+\seealso{
+\code{\link{scheme}}, \code{\link{port}}, \code{\link{path}},
+\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+}
+
diff --git a/man/encoder.Rd b/man/encoder.Rd
new file mode 100644
index 0000000..8c7c418
--- /dev/null
+++ b/man/encoder.Rd
@@ -0,0 +1,66 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/RcppExports.R
+\name{url_decode}
+\alias{url_decode}
+\alias{url_encode}
+\title{Encode or decode a URI}
+\usage{
+url_decode(urls)
+
+url_encode(urls)
+}
+\arguments{
+\item{urls}{a vector of URLs to decode or encode.}
+}
+\value{
+a character vector containing the encoded (or decoded) versions of "urls".
+}
+\description{
+encodes or decodes a URI/URL
+}
+\details{
+URL encoding and decoding is an essential prerequisite to proper web interaction
+and data analysis around things like server-side logs. The
+\href{http://tools.ietf.org/html/rfc3986}{relevant IETF RfC} mandates the percentage-encoding
+of non-Latin characters, including things like slashes, unless those are reserved.
+
+Base R provides \code{\link{URLdecode}} and \code{\link{URLencode}}, which handle
+URL encoding - in theory. In practise, they have a set of substantial problems
+that the urltools implementation solves::
+
+\itemize{
+\item{No vectorisation: }{Both base R functions operate on single URLs, not vectors of URLs.
+ This means that, when confronted with a vector of URLs that need encoding or
+ decoding, your only option is to loop from within R. This can be incredibly
+ computationally costly with large datasets. url_encode and url_decode are
+ implemented in C++ and entirely vectorised, allowing for a substantial
+ performance improvement.}
+\item{No scheme recognition: }{encoding the slashes in, say, http://, is a good way
+ of making sure your URL no longer works. Because of this, the only thing
+ you can encode in URLencode (unless you refuse to encode reserved characters)
+ is a partial URL, lacking the initial scheme, which requires additional operations
+ to set up and increases the complexity of encoding or decoding. url_encode
+ detects the protocol and silently splits it off, leaving it unencoded to ensure
+ that the resulting URL is valid.}
+\item{ASCII NULs: }{Server side data can get very messy and sometimes include out-of-range
+ characters. Unfortunately, URLdecode's response to these characters is to convert
+ them to NULs, which R can't handle, at which point your URLdecode call breaks.
+ \code{url_decode} simply ignores them.}
+}
+}
+\examples{
+
+url_decode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_\%28logo\%29.jpg")
+url_encode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_(logo).jpg")
+
+\dontrun{
+#A demonstrator of the contrasting behaviours around out-of-range characters
+URLdecode("\%gIL")
+url_decode("\%gIL")
+}
+}
+\seealso{
+\code{\link{puny_decode}} and \code{\link{puny_encode}}, for punycode decoding
+and encoding.
+}
+
diff --git a/man/fragment.Rd b/man/fragment.Rd
new file mode 100644
index 0000000..af3ec99
--- /dev/null
+++ b/man/fragment.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/accessors.R
+\name{fragment}
+\alias{fragment}
+\alias{fragment<-}
+\title{Get or set a URL's fragment}
+\usage{
+fragment(x)
+
+fragment(x) <- value
+}
+\arguments{
+\item{x}{a URL, or vector of URLs}
+
+\item{value}{a replacement value for x's fragment.}
+}
+\description{
+as in the lubridate package, individual components of a URL
+can be both extracted or set using the relevant function call - see the
+examples.
+}
+\examples{
+#Get a component
+example_url <- "http://en.wikipedia.org/wiki/Aaron_Halfaker?debug=true#test"
+fragment(example_url)
+
+#Set a component
+fragment(example_url) <- "production"
+}
+\seealso{
+\code{\link{scheme}}, \code{\link{domain}}, \code{\link{port}},
+\code{\link{path}} and \code{\link{parameters}} for other accessors.
+}
+
diff --git a/man/host_extract.Rd b/man/host_extract.Rd
new file mode 100644
index 0000000..16ac819
--- /dev/null
+++ b/man/host_extract.Rd
@@ -0,0 +1,32 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/suffix.R
+\name{host_extract}
+\alias{host_extract}
+\title{Extract hosts}
+\usage{
+host_extract(domains)
+}
+\arguments{
+\item{domains}{a vector of domains, retrieved through \code{\link{url_parse}} or
+\code{\link{domain}}.}
+}
+\value{
+a data.frame of two columns: \code{domain}, with the original domain names,
+and \code{host}, the identified host from the domain.
+}
+\description{
+\code{host_extract} extracts the host from
+a vector of domain names. A host isn't the same as a domain - it could be
+the subdomain, if there are one or more subdomains. The host of \code{en.wikipedia.org}
+is \code{en}, while the host of \code{wikipedia.org} is \code{wikipedia}.
+}
+\examples{
+# With subdomains
+has_subdomain <- domain("https://en.wikipedia.org/wiki/Main_Page")
+host_extract(has_subdomain)
+
+# Without
+no_subdomain <- domain("https://ironholds.org/projects/r_shiny/")
+host_extract(no_subdomain)
+}
+
diff --git a/man/param_get.Rd b/man/param_get.Rd
new file mode 100644
index 0000000..8e4bd6e
--- /dev/null
+++ b/man/param_get.Rd
@@ -0,0 +1,38 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/RcppExports.R, R/urltools.R
+\name{param_get}
+\alias{param_get}
+\alias{url_parameter}
+\alias{url_parameters}
+\title{get the values of a URL's parameters}
+\usage{
+param_get(urls, parameter_names)
+
+url_parameters(urls, parameter_names)
+}
+\arguments{
+\item{urls}{a vector of URLs}
+
+\item{parameter_names}{a vector of parameter names}
+}
+\value{
+a data.frame containing one column for each provided parameter name. Values that
+cannot be found within a particular URL are represented by an NA.
+}
+\description{
+URLs can have parameters, taking the form of \code{name=value}, chained together
+with \code{&} symbols. \code{param_get}, when provided with a vector of URLs and a vector
+of parameter names, will generate a data.frame consisting of the values of each parameter
+for each URL.
+}
+\examples{
+#A very simple example
+url <- "https://google.com:80/foo.php?this_parameter=selfreferencing&hiphop=awesome"
+parameter_values <- param_get(url, c("this_parameter","hiphop"))
+
+}
+\seealso{
+\code{\link{url_parse}} for decomposing URLs into their constituent parts and
+\code{\link{param_set}} for inserting or modifying key/value pairs within a query string.
+}
+
diff --git a/man/param_remove.Rd b/man/param_remove.Rd
new file mode 100644
index 0000000..0168d6f
--- /dev/null
+++ b/man/param_remove.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/RcppExports.R
+\name{param_remove}
+\alias{param_remove}
+\title{Remove key-value pairs from query strings}
+\usage{
+param_remove(urls, keys)
+}
+\arguments{
+\item{urls}{a vector of URLs. These should be decoded with \code{url_decode} but don't
+have to have been otherwise processed.}
+
+\item{keys}{a vector of parameter keys to remove.}
+}
+\value{
+the original URLs but with the key/value pairs specified by \code{keys} removed.
+If the original URL is \code{NA}, \code{NA} will be returned; if a specified key is \code{NA},
+nothing will be done with it.
+}
+\description{
+URLs often have queries associated with them, particularly URLs for
+APIs, that look like \code{?key=value&key=value&key=value}. \code{param_remove}
+allows you to remove key/value pairs while leaving the rest of the URL intact.
+}
+\examples{
+# Remove multiple parameters from a URL
+param_remove(urls = "https://en.wikipedia.org/wiki/api.php?action=list&type=query&format=json",
+ keys = c("action","format"))
+}
+\seealso{
+\code{\link{param_set}} to modify values associated with keys, or \code{\link{param_get}}
+to retrieve those values.
+}
+
diff --git a/man/param_set.Rd b/man/param_set.Rd
new file mode 100644
index 0000000..6959bba
--- /dev/null
+++ b/man/param_set.Rd
@@ -0,0 +1,42 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/RcppExports.R
+\name{param_set}
+\alias{param_set}
+\title{Set the value associated with a parameter in a URL's query.}
+\usage{
+param_set(urls, key, value)
+}
+\arguments{
+\item{urls}{a vector of URLs. These should be decoded (with \code{url_decode})
+but do not have to have been otherwise manipulated.}
+
+\item{key}{a string representing the key to modify the value of (or insert wholesale
+if it doesn't exist within the URL).}
+
+\item{value}{a value to associate with the key. This can be a single string,
+or a vector the same length as \code{urls}}
+}
+\value{
+the original vector of URLs, but with modified/inserted key-value pairs. If the
+URL is \code{NA}, the returned value will be - if the key or value are, no insertion
+will be made.
+}
+\description{
+URLs often have queries associated with them, particularly URLs for
+APIs, that look like \code{?key=value&key=value&key=value}. \code{param_set}
+allows you to modify key/value pairs within query strings, or even add new ones
+if they don't exist within the URL.
+}
+\examples{
+# Set a URL parameter where there's already a key for that
+param_set("https://en.wikipedia.org/api.php?action=query", "action", "pageinfo")
+
+# Set a URL parameter where there isn't.
+param_set("https://en.wikipedia.org/api.php?list=props", "action", "pageinfo")
+
+}
+\seealso{
+\code{\link{param_get}} to retrieve the values associated with multiple keys in
+a vector of URLs, and \code{\link{param_remove}} to strip key/value pairs from a URL entirely.
+}
+
diff --git a/man/parameters.Rd b/man/parameters.Rd
new file mode 100644
index 0000000..df3d677
--- /dev/null
+++ b/man/parameters.Rd
@@ -0,0 +1,35 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/accessors.R
+\name{parameters}
+\alias{parameters}
+\alias{parameters<-}
+\title{Get or set a URL's parameters}
+\usage{
+parameters(x)
+
+parameters(x) <- value
+}
+\arguments{
+\item{x}{a URL, or vector of URLs}
+
+\item{value}{a replacement value for x's parameters.}
+}
+\description{
+as in the lubridate package, individual components of a URL
+can be both extracted or set using the relevant function call - see the
+examples.
+}
+\examples{
+#Get a component
+example_url <- "http://en.wikipedia.org/wiki/Aaron_Halfaker?debug=true"
+parameters(example_url)
+#[1] "debug=true"
+
+#Set a component
+parameters(example_url) <- "debug=false"
+}
+\seealso{
+\code{\link{scheme}}, \code{\link{domain}}, \code{\link{port}},
+\code{\link{path}} and \code{\link{fragment}} for other accessors.
+}
+
diff --git a/man/path.Rd b/man/path.Rd
new file mode 100644
index 0000000..d5d870f
--- /dev/null
+++ b/man/path.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/accessors.R
+\name{path}
+\alias{path}
+\alias{path<-}
+\title{Get or set a URL's path}
+\usage{
+path(x)
+
+path(x) <- value
+}
+\arguments{
+\item{x}{a URL, or vector of URLs}
+
+\item{value}{a replacement value for x's path}
+}
+\description{
+as in the lubridate package, individual components of a URL
+can be both extracted or set using the relevant function call - see the
+examples.
+}
+\examples{
+#Get a component
+example_url <- "http://cran.r-project.org:80/submit.html"
+path(example_url)
+
+#Set a component
+path(example_url) <- "bin/windows/"
+}
+\seealso{
+\code{\link{scheme}}, \code{\link{domain}}, \code{\link{port}},
+\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+}
+
diff --git a/man/port.Rd b/man/port.Rd
new file mode 100644
index 0000000..20901ce
--- /dev/null
+++ b/man/port.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/accessors.R
+\name{port}
+\alias{port}
+\alias{port<-}
+\title{Get or set a URL's port}
+\usage{
+port(x)
+
+port(x) <- value
+}
+\arguments{
+\item{x}{a URL, or vector of URLs}
+
+\item{value}{a replacement value for x's port.}
+}
+\description{
+as in the lubridate package, individual components of a URL
+can be both extracted or set using the relevant function call - see the
+examples.
+}
+\examples{
+#Get a component
+example_url <- "http://cran.r-project.org:80/submit.html"
+port(example_url)
+
+#Set a component
+port(example_url) <- "12"
+}
+\seealso{
+\code{\link{scheme}}, \code{\link{domain}}, \code{\link{path}},
+\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+}
+
diff --git a/man/puny.Rd b/man/puny.Rd
new file mode 100644
index 0000000..2ce6e14
--- /dev/null
+++ b/man/puny.Rd
@@ -0,0 +1,37 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/RcppExports.R
+\name{puny_encode}
+\alias{puny_decode}
+\alias{puny_encode}
+\title{Encode or Decode Internationalised Domains}
+\usage{
+puny_encode(x)
+
+puny_decode(x)
+}
+\arguments{
+\item{x}{a vector of URLs. These should be URL decoded using \code{\link{url_decode}}.}
+}
+\value{
+a CharacterVector containing encoded or decoded versions of the entries in \code{x}.
+Invalid URLs (ones that are \code{NA}, or ones that do not successfully map to an actual
+decoded or encoded version) will be returned as \code{NA}.
+}
+\description{
+\code{puny_encode} and \code{puny_decode} implement
+the encoding standard for internationalised (non-ASCII) domains and
+subdomains. You can use them to encode UTF-8 domain names, or decode
+encoded names (which start "xn--"), or both.
+}
+\examples{
+# Encode a URL
+puny_encode("https://www.bücher.com/foo")
+
+# Decode the result, back to the original
+puny_decode("https://www.xn--bcher-kva.com/foo")
+
+}
+\seealso{
+\code{\link{url_decode}} and \code{\link{url_encode}} for percent-encoding.
+}
+
diff --git a/man/scheme.Rd b/man/scheme.Rd
new file mode 100644
index 0000000..4bd9858
--- /dev/null
+++ b/man/scheme.Rd
@@ -0,0 +1,37 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/accessors.R
+\name{scheme}
+\alias{scheme}
+\alias{scheme<-}
+\title{Get or set a URL's scheme}
+\usage{
+scheme(x)
+
+scheme(x) <- value
+}
+\arguments{
+\item{x}{a URL, or vector of URLs}
+
+\item{value}{a replacement value for x's scheme.}
+}
+\description{
+as in the lubridate package, individual components of a URL
+can be both extracted or set using the relevant function call - see the
+examples.
+}
+\examples{
+#Get a component
+example_url <- "http://cran.r-project.org/submit.html"
+scheme(example_url)
+
+#Set a component
+scheme(example_url) <- "https"
+
+# NA out the URL
+scheme(example_url) <- NA_character_
+}
+\seealso{
+\code{\link{domain}}, \code{\link{port}}, \code{\link{path}},
+\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+}
+
diff --git a/man/suffix_dataset.Rd b/man/suffix_dataset.Rd
new file mode 100644
index 0000000..61d36c2
--- /dev/null
+++ b/man/suffix_dataset.Rd
@@ -0,0 +1,28 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/suffix.R
+\docType{data}
+\name{suffix_dataset}
+\alias{suffix_dataset}
+\title{Dataset of public suffixes}
+\format{A data.frame of 8030 rows and 2 columns}
+\usage{
+data(suffix_dataset)
+}
+\description{
+This dataset contains a registry of public suffixes, as retrieved from
+and defined by the \href{https://publicsuffix.org/}{public suffix list}. It is
+sorted by how many periods(".") appear in the suffix, to optimise it for
+\code{\link{suffix_extract}}. It is a data.frame with two columns, the first is
+the list of suffixes and the second is our best guess at the comment or owner
+associated with the particular suffix.
+}
+\note{
+Last updated 2016-07-31.
+}
+\seealso{
+\code{\link{suffix_extract}} for extracting suffixes from domain names,
+and \code{\link{suffix_refresh}} for getting a new, totally-up-to-date dataset
+version.
+}
+\keyword{datasets}
+
diff --git a/man/suffix_extract.Rd b/man/suffix_extract.Rd
new file mode 100644
index 0000000..95c9a10
--- /dev/null
+++ b/man/suffix_extract.Rd
@@ -0,0 +1,53 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/suffix.R
+\name{suffix_extract}
+\alias{suffix_extract}
+\title{extract the suffix from domain names}
+\usage{
+suffix_extract(domains, suffixes = NULL)
+}
+\arguments{
+\item{domains}{a vector of damains, from \code{\link{domain}}
+or \code{\link{url_parse}}. Alternately, full URLs can be provided
+and will then be run through \code{\link{domain}} internally.}
+
+\item{suffixes}{a dataset of suffixes. By default, this is NULL and the function
+relies on \code{\link{suffix_dataset}}. Optionally, if you want more updated
+suffix data, you can provide the result of \code{\link{suffix_refresh}} for
+this parameter.}
+}
+\value{
+a data.frame of four columns, "host" "subdomain", "domain" & "suffix".
+"host" is what was passed in. "subdomain" is the subdomain of the suffix.
+"domain" contains the part of the domain name that came before the matched suffix.
+"suffix" is, well, the suffix.
+}
+\description{
+domain names have suffixes - common endings that people
+can or could register domains under. This includes things like ".org", but
+also things like ".edu.co". A simple Top Level Domain list, as a
+result, probably won't cut it.
+
+\code{\link{suffix_extract}} takes the list of public suffixes,
+as maintained by Mozilla (see \code{\link{suffix_dataset}}) and
+a vector of domain names, and produces a data.frame containing the
+suffix that each domain uses, and the remaining fragment.
+}
+\examples{
+
+# Using url_parse
+domain_name <- url_parse("http://en.wikipedia.org")$domain
+suffix_extract(domain_name)
+
+# Using domain()
+domain_name <- domain("http://en.wikipedia.org")
+suffix_extract(domain_name)
+
+#Relying on a fresh version of the suffix dataset
+suffix_extract(domain("http://en.wikipedia.org"), suffix_refresh())
+
+}
+\seealso{
+\code{\link{suffix_dataset}} for the dataset of suffixes.
+}
+
diff --git a/man/suffix_refresh.Rd b/man/suffix_refresh.Rd
new file mode 100644
index 0000000..3c6d4d9
--- /dev/null
+++ b/man/suffix_refresh.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/suffix.R
+\name{suffix_refresh}
+\alias{suffix_refresh}
+\title{Retrieve a public suffix dataset}
+\usage{
+suffix_refresh()
+}
+\value{
+a dataset equivalent in format to \code{\link{suffix_dataset}}.
+}
+\description{
+\code{urltools} comes with an inbuilt
+dataset of public suffixes, \code{\link{suffix_dataset}}.
+This is used in \code{\link{suffix_extract}} to identify the top-level domain
+within a particular domain name.
+
+While updates to the dataset will be included in each new package release,
+there's going to be a gap between changes to the suffixes list and changes to the package.
+Accordingly, the package also includes \code{suffix_refresh}, which generates
+and returns a \emph{fresh} version of the dataset. This can then be passed through
+to \code{\link{suffix_extract}}.
+}
+\examples{
+\dontrun{
+new_suffixes <- suffix_refresh()
+}
+
+}
+\seealso{
+\code{\link{suffix_extract}} to extract suffixes from domain names,
+or \code{\link{suffix_dataset}} for the inbuilt, default version of the data.
+}
+
diff --git a/man/tld_dataset.Rd b/man/tld_dataset.Rd
new file mode 100644
index 0000000..20d0409
--- /dev/null
+++ b/man/tld_dataset.Rd
@@ -0,0 +1,23 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/suffix.R
+\docType{data}
+\name{tld_dataset}
+\alias{tld_dataset}
+\title{Dataset of top-level domains (TLDs)}
+\format{A vector of 1275 elements.}
+\usage{
+data(tld_dataset)
+}
+\description{
+This dataset contains a registry of top-level domains, as retrieved from
+and defined by the \href{http://data.iana.org/TLD/tlds-alpha-by-domain.txt}{IANA}.
+}
+\note{
+Last updated 2016-07-20.
+}
+\seealso{
+\code{\link{tld_extract}} for extracting TLDs from domain names,
+and \code{\link{tld_refresh}} to get an updated version of this dataset.
+}
+\keyword{datasets}
+
diff --git a/man/tld_extract.Rd b/man/tld_extract.Rd
new file mode 100644
index 0000000..593f659
--- /dev/null
+++ b/man/tld_extract.Rd
@@ -0,0 +1,40 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/suffix.R
+\name{tld_extract}
+\alias{tld_extract}
+\title{Extract TLDs}
+\usage{
+tld_extract(domains, tlds = NULL)
+}
+\arguments{
+\item{domains}{a vector of domains, retrieved through \code{\link{url_parse}} or
+\code{\link{domain}}.}
+
+\item{tlds}{a dataset of TLDs. If NULL (the default), \code{tld_extract} relies
+on urltools' \code{\link{tld_dataset}}; otherwise, you can pass in the result of
+\code{\link{tld_refresh}}.}
+}
+\value{
+a data.frame of two columns: \code{domain}, with the original domain names,
+and \code{tld}, the identified TLD from the domain.
+}
+\description{
+\code{tld_extract} extracts the top-level domain (TLD) from
+a vector of domain names. This is distinct from the suffixes, extracted with
+\code{\link{suffix_extract}}; TLDs are \emph{top} level, while suffixes are just
+domains through which internet users can publicly register domains (the difference
+between \code{.org.uk} and \code{.uk}).
+}
+\examples{
+# Using the inbuilt dataset
+domains <- domain("https://en.wikipedia.org/wiki/Main_Page")
+tld_extract(domains)
+
+# Using a refreshed one
+tld_extract(domains, tld_refresh())
+
+}
+\seealso{
+\code{\link{suffix_extract}} for retrieving suffixes (distinct from TLDs).
+}
+
diff --git a/man/tld_refresh.Rd b/man/tld_refresh.Rd
new file mode 100644
index 0000000..40e3fcd
--- /dev/null
+++ b/man/tld_refresh.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/suffix.R
+\name{tld_refresh}
+\alias{tld_refresh}
+\title{Retrieve a TLD dataset}
+\usage{
+tld_refresh()
+}
+\value{
+a dataset equivalent in format to \code{\link{tld_dataset}}.
+}
+\description{
+\code{urltools} comes with an inbuilt
+dataset of top level domains (TLDs), \code{\link{tld_dataset}}.
+This is used in \code{\link{tld_extract}} to identify the top-level domain
+within a particular domain name.
+
+While updates to the dataset will be included in each new package release,
+there's going to be a gap between changes to TLDs and changes to the package.
+Accordingly, the package also includes \code{tld_refresh}, which generates
+and returns a \emph{fresh} version of the dataset. This can then be passed through
+to \code{\link{tld_extract}}.
+}
+\examples{
+\dontrun{
+new_tlds <- tld_refresh()
+}
+
+}
+\seealso{
+\code{\link{tld_extract}} to extract suffixes from domain names,
+or \code{\link{tld_dataset}} for the inbuilt, default version of the data.
+}
+
diff --git a/man/url_compose.Rd b/man/url_compose.Rd
new file mode 100644
index 0000000..99cb400
--- /dev/null
+++ b/man/url_compose.Rd
@@ -0,0 +1,30 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/RcppExports.R
+\name{url_compose}
+\alias{url_compose}
+\title{Recompose Parsed URLs}
+\usage{
+url_compose(parsed_urls)
+}
+\arguments{
+\item{parsed_urls}{a data.frame sourced from \code{\link{url_parse}}}
+}
+\description{
+Sometimes you want to take a vector of URLs, parse them, perform
+some operations and then rebuild them. \code{url_compose} takes a data.frame produced
+by \code{\link{url_parse}} and rebuilds it into a vector of full URLs (or: URLs as full
+as the vector initially thrown into url_parse).
+
+This is currently a `beta` feature; please do report bugs if you find them.
+}
+\examples{
+#Parse a URL and compose it
+url <- "http://en.wikipedia.org"
+url_compose(url_parse(url))
+
+}
+\seealso{
+\code{\link{scheme}} and other accessors, which you may want to
+run URLs through before composing them to modify individual values.
+}
+
diff --git a/man/url_parse.Rd b/man/url_parse.Rd
new file mode 100644
index 0000000..9217c8e
--- /dev/null
+++ b/man/url_parse.Rd
@@ -0,0 +1,37 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/RcppExports.R
+\name{url_parse}
+\alias{url_parse}
+\title{split URLs into their component parts}
+\usage{
+url_parse(urls)
+}
+\arguments{
+\item{urls}{a vector of URLs}
+}
+\value{
+a data.frame consisting of the columns scheme, domain, port, path, query
+and fragment. See the '\href{http://tools.ietf.org/html/rfc3986}{relevant IETF RfC} for
+definitions. If an element cannot be identified, it is represented by an empty string.
+}
+\description{
+\code{url_parse} takes a vector of URLs and splits each one into its component
+parts, as recognised by RfC 3986.
+}
+\details{
+It's useful to be able to take a URL and split it out into its component parts -
+for the purpose of hostname extraction, for example, or analysing API calls. This functionality
+is not provided in base R, although it is provided in \code{\link[httr]{parse_url}}; that
+implementation is entirely in R, uses regular expressions, and is not vectorised. It's
+perfectly suitable for the intended purpose (decomposition in the context of automated
+HTTP requests from R), but not for large-scale analysis.
+}
+\examples{
+url_parse("https://en.wikipedia.org/wiki/Article")
+
+}
+\seealso{
+\code{\link{url_parameters}} for extracting values associated with particular keys in a URL's
+query string, and \code{\link{url_compose}}, which is \code{url_parse} in reverse.
+}
+
diff --git a/man/urltools.Rd b/man/urltools.Rd
new file mode 100644
index 0000000..64af9d6
--- /dev/null
+++ b/man/urltools.Rd
@@ -0,0 +1,17 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/urltools.R
+\docType{package}
+\name{urltools}
+\alias{urltools}
+\alias{urltools-package}
+\title{Tools for handling URLs}
+\description{
+This package provides functions for URL encoding and decoding,
+parsing, and parameter extraction, designed to be both fast and
+entirely vectorised. It is intended to be useful for people dealing with
+web-related datasets, such as server-side logs.
+}
+\seealso{
+the \href{https://CRAN.R-project.org/package=urltools/vignettes/urltools.html}{package vignette}.
+}
+
diff --git a/src/Makevars b/src/Makevars
new file mode 100644
index 0000000..5de939a
--- /dev/null
+++ b/src/Makevars
@@ -0,0 +1 @@
+PKG_CPPFLAGS = -UNDEBUG
diff --git a/src/RcppExports.cpp b/src/RcppExports.cpp
new file mode 100644
index 0000000..4e28c55
--- /dev/null
+++ b/src/RcppExports.cpp
@@ -0,0 +1,182 @@
+// Generated by using Rcpp::compileAttributes() -> do not edit by hand
+// Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
+
+#include <Rcpp.h>
+
+using namespace Rcpp;
+
+// get_component_
+CharacterVector get_component_(CharacterVector urls, int component);
+RcppExport SEXP urltools_get_component_(SEXP urlsSEXP, SEXP componentSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+ Rcpp::traits::input_parameter< int >::type component(componentSEXP);
+ rcpp_result_gen = Rcpp::wrap(get_component_(urls, component));
+ return rcpp_result_gen;
+END_RCPP
+}
+// set_component_
+CharacterVector set_component_(CharacterVector urls, int component, String new_value);
+RcppExport SEXP urltools_set_component_(SEXP urlsSEXP, SEXP componentSEXP, SEXP new_valueSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+ Rcpp::traits::input_parameter< int >::type component(componentSEXP);
+ Rcpp::traits::input_parameter< String >::type new_value(new_valueSEXP);
+ rcpp_result_gen = Rcpp::wrap(set_component_(urls, component, new_value));
+ return rcpp_result_gen;
+END_RCPP
+}
+// param_get
+List param_get(CharacterVector urls, CharacterVector parameter_names);
+RcppExport SEXP urltools_param_get(SEXP urlsSEXP, SEXP parameter_namesSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+ Rcpp::traits::input_parameter< CharacterVector >::type parameter_names(parameter_namesSEXP);
+ rcpp_result_gen = Rcpp::wrap(param_get(urls, parameter_names));
+ return rcpp_result_gen;
+END_RCPP
+}
+// param_set
+CharacterVector param_set(CharacterVector urls, String key, CharacterVector value);
+RcppExport SEXP urltools_param_set(SEXP urlsSEXP, SEXP keySEXP, SEXP valueSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+ Rcpp::traits::input_parameter< String >::type key(keySEXP);
+ Rcpp::traits::input_parameter< CharacterVector >::type value(valueSEXP);
+ rcpp_result_gen = Rcpp::wrap(param_set(urls, key, value));
+ return rcpp_result_gen;
+END_RCPP
+}
+// param_remove
+CharacterVector param_remove(CharacterVector urls, CharacterVector keys);
+RcppExport SEXP urltools_param_remove(SEXP urlsSEXP, SEXP keysSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+ Rcpp::traits::input_parameter< CharacterVector >::type keys(keysSEXP);
+ rcpp_result_gen = Rcpp::wrap(param_remove(urls, keys));
+ return rcpp_result_gen;
+END_RCPP
+}
+// puny_encode
+CharacterVector puny_encode(CharacterVector x);
+RcppExport SEXP urltools_puny_encode(SEXP xSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< CharacterVector >::type x(xSEXP);
+ rcpp_result_gen = Rcpp::wrap(puny_encode(x));
+ return rcpp_result_gen;
+END_RCPP
+}
+// puny_decode
+CharacterVector puny_decode(CharacterVector x);
+RcppExport SEXP urltools_puny_decode(SEXP xSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< CharacterVector >::type x(xSEXP);
+ rcpp_result_gen = Rcpp::wrap(puny_decode(x));
+ return rcpp_result_gen;
+END_RCPP
+}
+// reverse_strings
+CharacterVector reverse_strings(CharacterVector strings);
+RcppExport SEXP urltools_reverse_strings(SEXP stringsSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< CharacterVector >::type strings(stringsSEXP);
+ rcpp_result_gen = Rcpp::wrap(reverse_strings(strings));
+ return rcpp_result_gen;
+END_RCPP
+}
+// finalise_suffixes
+DataFrame finalise_suffixes(CharacterVector full_domains, CharacterVector suffixes, LogicalVector wildcard, LogicalVector is_suffix);
+RcppExport SEXP urltools_finalise_suffixes(SEXP full_domainsSEXP, SEXP suffixesSEXP, SEXP wildcardSEXP, SEXP is_suffixSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< CharacterVector >::type full_domains(full_domainsSEXP);
+ Rcpp::traits::input_parameter< CharacterVector >::type suffixes(suffixesSEXP);
+ Rcpp::traits::input_parameter< LogicalVector >::type wildcard(wildcardSEXP);
+ Rcpp::traits::input_parameter< LogicalVector >::type is_suffix(is_suffixSEXP);
+ rcpp_result_gen = Rcpp::wrap(finalise_suffixes(full_domains, suffixes, wildcard, is_suffix));
+ return rcpp_result_gen;
+END_RCPP
+}
+// tld_extract_
+CharacterVector tld_extract_(CharacterVector domains);
+RcppExport SEXP urltools_tld_extract_(SEXP domainsSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< CharacterVector >::type domains(domainsSEXP);
+ rcpp_result_gen = Rcpp::wrap(tld_extract_(domains));
+ return rcpp_result_gen;
+END_RCPP
+}
+// host_extract_
+CharacterVector host_extract_(CharacterVector domains);
+RcppExport SEXP urltools_host_extract_(SEXP domainsSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< CharacterVector >::type domains(domainsSEXP);
+ rcpp_result_gen = Rcpp::wrap(host_extract_(domains));
+ return rcpp_result_gen;
+END_RCPP
+}
+// url_decode
+CharacterVector url_decode(CharacterVector urls);
+RcppExport SEXP urltools_url_decode(SEXP urlsSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+ rcpp_result_gen = Rcpp::wrap(url_decode(urls));
+ return rcpp_result_gen;
+END_RCPP
+}
+// url_encode
+CharacterVector url_encode(CharacterVector urls);
+RcppExport SEXP urltools_url_encode(SEXP urlsSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+ rcpp_result_gen = Rcpp::wrap(url_encode(urls));
+ return rcpp_result_gen;
+END_RCPP
+}
+// url_parse
+DataFrame url_parse(CharacterVector urls);
+RcppExport SEXP urltools_url_parse(SEXP urlsSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+ rcpp_result_gen = Rcpp::wrap(url_parse(urls));
+ return rcpp_result_gen;
+END_RCPP
+}
+// url_compose
+CharacterVector url_compose(DataFrame parsed_urls);
+RcppExport SEXP urltools_url_compose(SEXP parsed_urlsSEXP) {
+BEGIN_RCPP
+ Rcpp::RObject rcpp_result_gen;
+ Rcpp::RNGScope rcpp_rngScope_gen;
+ Rcpp::traits::input_parameter< DataFrame >::type parsed_urls(parsed_urlsSEXP);
+ rcpp_result_gen = Rcpp::wrap(url_compose(parsed_urls));
+ return rcpp_result_gen;
+END_RCPP
+}
diff --git a/src/accessors.cpp b/src/accessors.cpp
new file mode 100644
index 0000000..f365d4c
--- /dev/null
+++ b/src/accessors.cpp
@@ -0,0 +1,37 @@
+#include <Rcpp.h>
+#include "parsing.h"
+using namespace Rcpp;
+
+//[[Rcpp::export]]
+CharacterVector get_component_(CharacterVector urls, int component){
+ parsing p_inst;
+ unsigned int input_size = urls.size();
+ CharacterVector output(input_size);
+ for (unsigned int i = 0; i < input_size; ++i){
+ if((i % 10000) == 0){
+ Rcpp::checkUserInterrupt();
+ }
+ if(urls[i] != NA_STRING){
+ output[i] = p_inst.get_component(Rcpp::as<std::string>(urls[i]), component);
+ } else {
+ output[i] = NA_STRING;
+ }
+ }
+ return output;
+}
+
+//[[Rcpp::export]]
+CharacterVector set_component_(CharacterVector urls, int component,
+ String new_value){
+ parsing p_inst;
+ unsigned int input_size = urls.size();
+ CharacterVector output(input_size);
+ for (unsigned int i = 0; i < input_size; ++i){
+ if((i % 10000) == 0){
+ Rcpp::checkUserInterrupt();
+ }
+
+ output[i] = p_inst.set_component(Rcpp::as<std::string>(urls[i]), component, new_value);
+ }
+ return output;
+}
diff --git a/src/compose.cpp b/src/compose.cpp
new file mode 100644
index 0000000..c8983be
--- /dev/null
+++ b/src/compose.cpp
@@ -0,0 +1,68 @@
+#include "compose.h"
+
+bool compose::emptycheck(String element){
+ if(element == NA_STRING){
+ return false;
+ }
+ return true;
+}
+
+std::string compose::compose_single(String scheme, String domain, String port, String path,
+ String parameter, String fragment){
+
+ std::string output;
+
+ if(emptycheck(scheme)){
+ output += scheme;
+ output += "://";
+ }
+
+ if(emptycheck(domain)){
+ output += domain;
+ }
+
+ if(emptycheck(port)){
+ output += ":";
+ output += port;
+ }
+
+ if(emptycheck(path)){
+ output += "/";
+ output += path;
+ }
+
+ if(emptycheck(parameter)){
+ output += "?";
+ output += parameter;
+ }
+
+ if(emptycheck(fragment)){
+ output += "#";
+ output += fragment;
+ }
+
+ return output;
+}
+
+CharacterVector compose::compose_multiple(DataFrame parsed_urls){
+
+ CharacterVector schemes = parsed_urls["scheme"];
+ CharacterVector domains = parsed_urls["domain"];
+ CharacterVector ports = parsed_urls["port"];
+ CharacterVector paths = parsed_urls["path"];
+ CharacterVector parameters = parsed_urls["parameter"];
+ CharacterVector fragments = parsed_urls["fragment"];
+
+ unsigned int input_size = schemes.size();
+ CharacterVector output(input_size);
+
+ for(unsigned int i = 0; i < input_size; i++){
+ if((i % 10000) == 0){
+ Rcpp::checkUserInterrupt();
+ }
+ output[i] = compose_single(schemes[i], domains[i], ports[i], paths[i], parameters[i],
+ fragments[i]);
+ }
+
+ return output;
+}
diff --git a/src/compose.h b/src/compose.h
new file mode 100644
index 0000000..5fbd194
--- /dev/null
+++ b/src/compose.h
@@ -0,0 +1,58 @@
+#include <Rcpp.h>
+using namespace Rcpp;
+
+#ifndef __COMPOSE_INCLUDED__
+#define __COMPOSE_INCLUDED__
+
+/**
+ * A class for recomposing parsed URLs
+ */
+class compose {
+
+private:
+
+ /**
+ * A function for briefly checking if a component is empty before doing anything
+ * with it
+ *
+ * @param str a Rcpp String to check
+ *
+ * @return true if the string is not empty, false if it is.
+ */
+ bool emptycheck(String element);
+
+ /**
+ * A function for recomposing a single URL
+ *
+ * @param scheme the scheme of the URL
+ *
+ * @param domain the domain of the URL
+ *
+ * @param port the port of the URL
+ *
+ * @param path the path of the URL
+ *
+ * @param parameter the parameter of the URL
+ *
+ * @param fragment the fragment of the URL
+ *
+ * @return an Rcpp String containing the recomposed URL
+ *
+ * @seealso compose_multiple for the vectorised version
+ */
+ std::string compose_single(String scheme, String domain, String port, String path,
+ String parameter, String fragment);
+
+public:
+
+ /**
+ * A function for recomposing a vector of URLs
+ *
+ * @param parsed_urls a DataFrame provided by url_parse
+ *
+ * @return a CharacterVector containing the recomposed URLs
+ */
+ CharacterVector compose_multiple(DataFrame parsed_urls);
+};
+
+#endif
diff --git a/src/encoding.cpp b/src/encoding.cpp
new file mode 100644
index 0000000..d4f3540
--- /dev/null
+++ b/src/encoding.cpp
@@ -0,0 +1,92 @@
+#include <Rcpp.h>
+#include "encoding.h"
+using namespace Rcpp;
+
+char encoding::from_hex (char x){
+ if(x <= '9' && x >= '0'){
+ x -= '0';
+ } else if(x <= 'f' && x >= 'a'){
+ x -= ('a' - 10);
+ } else if(x <= 'F' && x >= 'A'){
+ x -= ('A' - 10);
+ } else {
+ x = 0;
+ }
+ return x;
+}
+
+std::string encoding::to_hex(char x){
+
+ //Holding objects and output
+ char digit_1 = (x&0xF0)>>4;
+ char digit_2 = (x&0x0F);
+ std::string output;
+
+ //Convert
+ if(0 <= digit_1 && digit_1 <= 9){
+ digit_1 += 48;
+ } else if(10 <= digit_1 && digit_1 <=15){
+ digit_1 += 97-10;
+ }
+ if(0 <= digit_2 && digit_2 <= 9){
+ digit_2 += 48;
+ } else if(10 <= digit_2 && digit_2 <= 15){
+ digit_2 += 97-10;
+ }
+
+ output.append(&digit_1, 1);
+ output.append(&digit_2, 1);
+ return output;
+}
+
+std::string encoding::internal_url_decode(std::string url){
+
+ //Create output object
+ std::string result;
+
+ //For each character...
+ for (std::string::size_type i = 0; i < url.size(); ++i){
+
+ //If it's a +, space
+ if (url[i] == '+'){
+ result += ' ';
+ } else if (url[i] == '%' && url.size() > i+2){
+
+ //Escaped? Convert from hex and includes
+ char holding_1 = encoding::from_hex(url[i+1]);
+ char holding_2 = encoding::from_hex(url[i+2]);
+ char holding = (holding_1 << 4) | holding_2;
+ result += holding;
+ i += 2;
+
+ } else { //Permitted? Include.
+ result += url[i];
+ }
+ }
+
+ //Return
+ return result;
+}
+
+std::string encoding::internal_url_encode(std::string url){
+
+ //Note the unreserved characters, create an output string
+ std::string unreserved_characters = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ._~-";
+ std::string output = "";
+
+ //For each character..
+ for(int i=0; i < (signed) url.length(); i++){
+
+ //If it's in the list of reserved ones, just pass it through
+ if (unreserved_characters.find_first_of(url[i]) != std::string::npos){
+ output.append(&url[i], 1);
+ //Otherwise, append in an encoded form.
+ } else {
+ output.append("%");
+ output.append(to_hex(url[i]));
+ }
+ }
+
+ //Return
+ return output;
+}
diff --git a/src/encoding.h b/src/encoding.h
new file mode 100644
index 0000000..88e2ac7
--- /dev/null
+++ b/src/encoding.h
@@ -0,0 +1,67 @@
+#include <Rcpp.h>
+using namespace Rcpp;
+
+#ifndef __ENCODING_INCLUDED__
+#define __ENCODING_INCLUDED__
+
+/**
+ * A class for applying percent-encoding to
+ * arbitrary strings - optimised for URLs, obviously.
+ */
+class encoding{
+
+ private:
+
+ /**
+ * A function for taking a hexadecimal element and converting
+ * it to the equivalent non-hex value. Used in internal_url_decode
+ *
+ * @param x a character array representing the hexed value.
+ *
+ * @see to_hex for the reverse operation.
+ *
+ * @return a string containing the un-hexed value of x.
+ */
+ char from_hex (char x);
+
+ /**
+ * A function for taking a character value and converting
+ * it to the equivalent hexadecimal value. Used in internal_url_encode.
+ *
+ * @param x a character array representing the unhexed value.
+ *
+ * @see from_hex for the reverse operation.
+ *
+ * @return a string containing the now-hexed value of x.
+ */
+ std::string to_hex(char x);
+
+ public:
+
+ /**
+ * A function for decoding URLs. calls from_hex, and is
+ * in turn called by url_decode in urltools.cpp.
+ *
+ * @param url a string representing a percent-encoded URL.
+ *
+ * @see internal_url_encode for the reverse operation.
+ *
+ * @return a string containing the decoded URL.
+ */
+ std::string internal_url_decode(std::string url);
+
+ /**
+ * A function for encoding URLs. calls to_hex, and is
+ * in turn called by url_encode in urltools.cpp.
+ *
+ * @param url a string representing a URL.
+ *
+ * @see internal_url_decode for the reverse operation.
+ *
+ * @return a string containing the percent-encoded version of "url".
+ */
+ std::string internal_url_encode(std::string url);
+
+};
+
+#endif
diff --git a/src/param.cpp b/src/param.cpp
new file mode 100644
index 0000000..89a1ab1
--- /dev/null
+++ b/src/param.cpp
@@ -0,0 +1,112 @@
+#include <Rcpp.h>
+#include "parameter.h"
+using namespace Rcpp;
+
+
+//'@title get the values of a URL's parameters
+//'@description URLs can have parameters, taking the form of \code{name=value}, chained together
+//'with \code{&} symbols. \code{param_get}, when provided with a vector of URLs and a vector
+//'of parameter names, will generate a data.frame consisting of the values of each parameter
+//'for each URL.
+//'
+//'@param urls a vector of URLs
+//'
+//'@param parameter_names a vector of parameter names
+//'
+//'@return a data.frame containing one column for each provided parameter name. Values that
+//'cannot be found within a particular URL are represented by an NA.
+//'
+//'@examples
+//'#A very simple example
+//'url <- "https://google.com:80/foo.php?this_parameter=selfreferencing&hiphop=awesome"
+//'parameter_values <- param_get(url, c("this_parameter","hiphop"))
+//'
+//'@seealso \code{\link{url_parse}} for decomposing URLs into their constituent parts and
+//'\code{\link{param_set}} for inserting or modifying key/value pairs within a query string.
+//'
+//'@aliases param_get url_parameter
+//'@rdname param_get
+//'@export
+//[[Rcpp::export]]
+List param_get(CharacterVector urls, CharacterVector parameter_names){
+ parameter p_inst;
+ List output;
+ IntegerVector rownames = Rcpp::seq(1,urls.size());
+ unsigned int column_count = parameter_names.size();
+
+ for(unsigned int i = 0; i < column_count; ++i){
+ if((i % 10000) == 0){
+ Rcpp::checkUserInterrupt();
+ }
+ output.push_back(p_inst.get_parameter(urls, Rcpp::as<std::string>(parameter_names[i])));
+ }
+ output.attr("class") = "data.frame";
+ output.attr("names") = parameter_names;
+ output.attr("row.names") = rownames;
+ return output;
+}
+
+//'@title Set the value associated with a parameter in a URL's query.
+//'@description URLs often have queries associated with them, particularly URLs for
+//'APIs, that look like \code{?key=value&key=value&key=value}. \code{param_set}
+//'allows you to modify key/value pairs within query strings, or even add new ones
+//'if they don't exist within the URL.
+//'
+//'@param urls a vector of URLs. These should be decoded (with \code{url_decode})
+//'but do not have to have been otherwise manipulated.
+//'
+//'@param key a string representing the key to modify the value of (or insert wholesale
+//'if it doesn't exist within the URL).
+//'
+//'@param value a value to associate with the key. This can be a single string,
+//'or a vector the same length as \code{urls}
+//'
+//'@return the original vector of URLs, but with modified/inserted key-value pairs. If the
+//'URL is \code{NA}, the returned value will be - if the key or value are, no insertion
+//'will be made.
+//'
+//'@examples
+//'# Set a URL parameter where there's already a key for that
+//'param_set("https://en.wikipedia.org/api.php?action=query", "action", "pageinfo")
+//'
+//'# Set a URL parameter where there isn't.
+//'param_set("https://en.wikipedia.org/api.php?list=props", "action", "pageinfo")
+//'
+//'@seealso \code{\link{param_get}} to retrieve the values associated with multiple keys in
+//'a vector of URLs, and \code{\link{param_remove}} to strip key/value pairs from a URL entirely.
+//'
+//'@export
+//[[Rcpp::export]]
+CharacterVector param_set(CharacterVector urls, String key, CharacterVector value){
+ parameter p_inst;
+ return p_inst.set_parameter_vectorised(urls, key, value);
+}
+
+//'@title Remove key-value pairs from query strings
+//'@description URLs often have queries associated with them, particularly URLs for
+//'APIs, that look like \code{?key=value&key=value&key=value}. \code{param_remove}
+//'allows you to remove key/value pairs while leaving the rest of the URL intact.
+//'
+//'@param urls a vector of URLs. These should be decoded with \code{url_decode} but don't
+//'have to have been otherwise processed.
+//'
+//'@param keys a vector of parameter keys to remove.
+//'
+//'@return the original URLs but with the key/value pairs specified by \code{keys} removed.
+//'If the original URL is \code{NA}, \code{NA} will be returned; if a specified key is \code{NA},
+//'nothing will be done with it.
+//'
+//'@seealso \code{\link{param_set}} to modify values associated with keys, or \code{\link{param_get}}
+//'to retrieve those values.
+//'
+//'@examples
+//'# Remove multiple parameters from a URL
+//'param_remove(urls = "https://en.wikipedia.org/wiki/api.php?action=list&type=query&format=json",
+//' keys = c("action","format"))
+//'@export
+//[[Rcpp::export]]
+CharacterVector param_remove(CharacterVector urls, CharacterVector keys){
+ parameter p_inst;
+ return p_inst.remove_parameter_vectorised(urls, keys);
+
+}
diff --git a/src/parameter.cpp b/src/parameter.cpp
new file mode 100644
index 0000000..704c784
--- /dev/null
+++ b/src/parameter.cpp
@@ -0,0 +1,175 @@
+#include "parameter.h"
+
+std::vector < std::string > parameter::get_query_string(std::string url){
+
+ std::vector < std::string > output;
+ size_t query_location = url.find("?");
+ if(query_location == std::string::npos){
+ output.push_back(url);
+ } else {
+ output.push_back(url.substr(0, query_location));
+ output.push_back(url.substr(query_location));
+ }
+ return output;
+}
+
+std::string parameter::set_parameter(std::string url, std::string& component, std::string value){
+
+ std::vector < std::string > holding = get_query_string(url);
+ if(holding.size() == 1){
+ return holding[0] + ("?" + component + "=" + value);
+ }
+
+ size_t component_location = holding[1].find((component + "="));
+
+ if(component_location == std::string::npos){
+ holding[1] = (holding[1] + "&" + component + "=" + value);
+ } else {
+ size_t value_location = holding[1].find("&", component_location);
+ if(value_location == std::string::npos){
+ holding[1].replace(component_location, value_location, (component + "=" + value));
+ } else {
+ holding[1].replace(component_location, (value_location - component_location), (component + "=" + value));
+ }
+
+ }
+
+ return(holding[0] + holding[1]);
+
+}
+
+std::string parameter::remove_parameter_single(std::string url, CharacterVector params){
+
+ std::vector < std::string > parsed_url = get_query_string(url);
+ if(parsed_url.size() == 1){
+ return url;
+ }
+
+ for(unsigned int i = 0; i < params.size(); i++){
+ if(params[i] != NA_STRING){
+ size_t param_location = parsed_url[1].find(Rcpp::as<std::string>(params[i]));
+ while(param_location != std::string::npos){
+ size_t end_location = parsed_url[1].find("&", param_location);
+ parsed_url[1].erase(param_location, end_location);
+ param_location = parsed_url[i].find(params[i], param_location);
+ }
+ }
+ }
+
+ // We may have removed all of the parameters or the last one, leading to trailing ampersands or
+ // question marks. If those exist, erase them.
+ if(parsed_url[1][parsed_url[1].size()-1] == '&' || parsed_url[1][parsed_url[1].size()-1] == '?'){
+ parsed_url[1].erase(parsed_url[1].size()-1);
+ }
+
+ return (parsed_url[0] + parsed_url[1]);
+}
+
+//Parameter retrieval
+CharacterVector parameter::get_parameter(CharacterVector& urls, std::string component){
+ std::size_t component_location;
+ std::size_t next_location;
+ unsigned int input_size = urls.size();
+ int component_size = component.length();
+ CharacterVector output(input_size);
+ component = component + "=";
+ std::string holding;
+ for(unsigned int i = 0; i < input_size; ++i){
+ if(urls[i] == NA_STRING){
+ output[i] = NA_STRING;
+ } else {
+ holding = Rcpp::as<std::string>(urls[i]);
+ component_location = holding.find(component);
+ if(component_location == std::string::npos){
+ output[i] = NA_STRING;
+ } else {
+ next_location = holding.find_first_of("&#", component_location + component_size);
+ if(next_location == std::string::npos){
+ output[i] = holding.substr(component_location + component_size + 1);
+ } else {
+ output[i] = holding.substr(component_location + component_size + 1, (next_location-(component_location + component_size + 1)));
+ }
+ }
+ }
+ }
+ return output;
+}
+
+CharacterVector parameter::set_parameter_vectorised(CharacterVector urls, String component,
+ CharacterVector value){
+
+ unsigned int input_size = urls.size();
+ CharacterVector output(input_size);
+
+ if(component != NA_STRING){
+ std::string component_ref = component.get_cstring();
+ if(value.size() == input_size){
+ for(unsigned int i = 0; i < input_size; i++){
+ if((i % 10000) == 0){
+ Rcpp::checkUserInterrupt();
+ }
+ if(urls[i] != NA_STRING && value[i] != NA_STRING){
+ output[i] = set_parameter(Rcpp::as<std::string>(urls[i]), component_ref,
+ Rcpp::as<std::string>(value[i]));
+ } else if(value[i] == NA_STRING){
+ output[i] = urls[i];
+ } else {
+ output[i] = NA_STRING;
+ }
+ }
+ } else if(value.size() == 1){
+ if(value[0] != NA_STRING){
+ std::string value_ref = Rcpp::as<std::string>(value[0]);
+ for(unsigned int i = 0; i < input_size; i++){
+ if((i % 10000) == 0){
+ Rcpp::checkUserInterrupt();
+ }
+ if(urls[i] != NA_STRING){
+ output[i] = set_parameter(Rcpp::as<std::string>(urls[i]), component_ref, value_ref);
+ } else {
+ output[i] = NA_STRING;
+ }
+ }
+ } else {
+ return urls;
+ }
+
+ } else {
+ throw std::range_error("'value' must be the same length as 'urls', or of length 1");
+ }
+ } else {
+ return urls;
+ }
+
+ return output;
+}
+
+CharacterVector parameter::remove_parameter_vectorised(CharacterVector urls,
+ CharacterVector params){
+
+ unsigned int input_size = urls.size();
+ CharacterVector output(input_size);
+ CharacterVector p_copy = params;
+ // Generate easily find-able params.
+ for(unsigned int i = 0; i < p_copy.size(); i++){
+ if(p_copy[i] != NA_STRING){
+ p_copy[i] += "=";
+ }
+ }
+
+ // For each URL, remove those parameters.
+ for(unsigned int i = 0; i < urls.size(); i++){
+ if((i % 10000) == 0){
+ Rcpp::checkUserInterrupt();
+ }
+ if(urls[i] != NA_STRING){
+ output[i] = remove_parameter_single(Rcpp::as<std::string>(urls[i]), p_copy);
+
+ } else {
+ output[i] = NA_STRING;
+ }
+ }
+
+ // Return
+ return output;
+}
diff --git a/src/parameter.h b/src/parameter.h
new file mode 100644
index 0000000..c83e683
--- /dev/null
+++ b/src/parameter.h
@@ -0,0 +1,93 @@
+#include "parsing.h"
+
+#ifndef __PARAM_INCLUDED__
+#define __PARAM_INCLUDED__
+
+
+class parameter: public parsing {
+
+private:
+
+ /**
+ * Split out a URL query from the actual body. Used
+ * in set_ and remove_parameter.
+ *
+ * @param url a URL.
+ *
+ * @return a vector either of length 1, indicating that no
+ * query was found, or 2, indicating that one was.
+ */
+ std::vector < std::string > get_query_string(std::string url);
+
+ /**
+ * Set the value of a single key=value parameter.
+ *
+ * @param url a URL.
+ *
+ * @param component a reference to the key to set
+ *
+ * @param value a reference to the value to set.
+ *
+ * @return a string containing URL + key=value, controlling
+ * for the possibility that the URL did not previously have a query
+ * associated - or did, and /had that key/, but was associating a
+ * different value with it.
+ */
+ std::string set_parameter(std::string url, std::string& component, std::string value);
+
+ /**
+ * Reemove a range of key/value parameters
+ *
+ * @param url a URL.
+ *
+ * @param params a vector of keys.
+ *
+ * @return a string containing the URL but absent the keys and values that were specified.
+ *
+ */
+ std::string remove_parameter_single(std::string url, CharacterVector params);
+
+public:
+
+ /**
+ * Component retrieval specifically for parameters.
+ *
+ * @param urls a reference to a vector of URLs
+ *
+ * @param component the name of a component to retrieve
+ * the value of
+ *
+ * @return a vector of the values for that component.
+ */
+ CharacterVector get_parameter(CharacterVector& urls, std::string component);
+
+
+ /**
+ * Set the value of a single key=value parameter for a vector of strings.
+ *
+ * @param urls a vector of URLs.
+ *
+ * @param component a string containing the key to set
+ *
+ * @param value a vector of values to set.
+ *
+ * @return the initial URLs vector, with the aforementioned string modifications.
+ */
+ CharacterVector set_parameter_vectorised(CharacterVector urls, String component,
+ CharacterVector value);
+
+ /**
+ * Reemove a range of key/value parameters from a vector of strings.
+ *
+ * @param urls a vector of URLs.
+ *
+ * @param params a vector of keys.
+ *
+ * @return the initial URLs vector, with the aforementioned string modifications.
+ *
+ */
+ CharacterVector remove_parameter_vectorised(CharacterVector urls,
+ CharacterVector params);
+};
+
+#endif
diff --git a/src/parsing.cpp b/src/parsing.cpp
new file mode 100644
index 0000000..f8e01d0
--- /dev/null
+++ b/src/parsing.cpp
@@ -0,0 +1,238 @@
+#include "parsing.h"
+
+std::string parsing::scheme(std::string& url){
+ std::string output;
+ std::size_t protocol = url.find("://");
+ if((protocol == std::string::npos) | (protocol > 6)){
+ //If that's not present, or isn't present at the /beginning/, unknown
+ output = "";
+ } else {
+ output = url.substr(0,protocol);
+ url = url.substr((protocol+3));
+ }
+ return output;
+}
+
+std::string parsing::string_tolower(std::string str){
+ unsigned int input_size = str.size();
+ for(unsigned int i = 0; i < input_size; i++){
+ str[i] = tolower(str[i]);
+ }
+ return str;
+}
+
+std::vector < std::string > parsing::domain_and_port(std::string& url){
+
+ std::vector < std::string > output(2);
+ std::string holding;
+ unsigned int output_offset = 0;
+
+ // Identify the port. If there is one, push everything
+ // before that straight into the output, and the remainder
+ // into the holding string. If not, the entire
+ // url goes into the holding string.
+ std::size_t port = url.find(":");
+
+ if(port != std::string::npos && url.find("/") >= port){
+ output[0] = url.substr(0,port);
+ holding = url.substr(port+1);
+ output_offset++;
+ } else {
+ holding = url;
+ }
+
+ // Look for a trailing slash
+ std::size_t trailing_slash = holding.find("/");
+
+ // If there is one, that's when everything ends
+ if(trailing_slash != std::string::npos){
+ output[output_offset] = holding.substr(0, trailing_slash);
+ output_offset++;
+ url = holding.substr(trailing_slash+1);
+ return output;
+ }
+
+ // If not, there might be a query parameter associated
+ // with the base URL, which we need to preserve.
+ std::size_t param = holding.find("?");
+
+ // If there is, handle that
+ if(param != std::string::npos){
+ output[output_offset] = holding.substr(0, param);
+ url = holding.substr(param);
+ return output;
+ }
+
+ // Otherwise we're done here
+ output[output_offset] = holding;
+ url = "";
+ return output;
+}
+
+std::string parsing::path(std::string& url){
+ if(url.size() == 0){
+ return url;
+ }
+ std::string output;
+ std::size_t path = url.find("?");
+ if(path == std::string::npos){
+ std::size_t fragment = url.find("#");
+ if(fragment == std::string::npos){
+ output = url;
+ url = "";
+ return output;
+ }
+ output = url.substr(0,fragment);
+ url = url.substr(fragment);
+ return output;
+ }
+
+ output = url.substr(0,path);
+ url = url.substr(path+1);
+ return output;
+}
+
+std::string parsing::query(std::string& url){
+ if(url == ""){
+ return url;
+ }
+
+ std::string output;
+ std::size_t fragment = url.find("#");
+ if(fragment == std::string::npos){
+ output = url;
+ url = "";
+ return output;
+ }
+ output = url.substr(0,fragment);
+ url = url.substr(fragment+1);
+ return output;
+}
+
+String parsing::check_parse_out(std::string x){
+
+ if(x == ""){
+ return NA_STRING;
+ }
+ return x;
+}
+
+//URL parser
+CharacterVector parsing::url_to_vector(std::string url){
+
+ std::string &url_ptr = url;
+
+ //Output object, holding object, normalise.
+ CharacterVector output(6);
+ std::vector < std::string > holding(2);
+
+ std::string s = scheme(url_ptr);
+
+ holding = domain_and_port(url_ptr);
+
+ //Run
+ output[0] = check_parse_out(string_tolower(s));
+ output[1] = check_parse_out(string_tolower(holding[0]));
+ output[2] = check_parse_out(holding[1]);
+ output[3] = check_parse_out(path(url_ptr));
+ output[4] = check_parse_out(query(url_ptr));
+ output[5] = check_parse_out(url_ptr);
+
+ return output;
+}
+
+//Component retrieval
+String parsing::get_component(std::string url, int component){
+ return url_to_vector(url)[component];
+}
+
+//Component modification
+String parsing::set_component(std::string url, int component, String new_value){
+
+ if(new_value == NA_STRING){
+ return NA_STRING;
+ }
+ std::string output;
+ CharacterVector parsed_url = url_to_vector(url);
+ parsed_url[component] = new_value;
+
+ if(parsed_url[0] != NA_STRING){
+ output += parsed_url[0];
+ output += "://";
+ }
+
+ if(parsed_url[1] != NA_STRING){
+ output += parsed_url[1];
+ }
+
+ if(parsed_url[2] != NA_STRING){
+ output += ":";
+ output += parsed_url[2];
+ }
+
+ if(parsed_url[3] != NA_STRING){
+ output += "/";
+ output += parsed_url[3];
+ }
+
+ if(parsed_url[4] != NA_STRING){
+ output += "?";
+ output += parsed_url[4];
+ }
+
+ if(parsed_url[5] != NA_STRING){
+ output += "#";
+ output += parsed_url[5];
+ }
+
+ return output;
+}
+
+DataFrame parsing::parse_to_df(CharacterVector& urls_ptr){
+
+ //Input and holding objects
+ unsigned int input_size = urls_ptr.size();
+ CharacterVector holding(6);
+
+ //Output objects
+ CharacterVector schemes(input_size);
+ CharacterVector domains(input_size);
+ CharacterVector ports(input_size);
+ CharacterVector paths(input_size);
+ CharacterVector parameters(input_size);
+ CharacterVector fragments(input_size);
+
+ for(unsigned int i = 0; i < input_size; i++){
+ if((i % 10000) == 0){
+ Rcpp::checkUserInterrupt();
+ }
+
+ // Handle NAs on input
+ if(urls_ptr[i] == NA_STRING){
+
+ schemes[i] = NA_STRING;
+ domains[i] = NA_STRING;
+ ports[i] = NA_STRING;
+ paths[i] = NA_STRING;
+ parameters[i] = NA_STRING;
+ fragments[i] = NA_STRING;
+
+ } else {
+ holding = url_to_vector(Rcpp::as<std::string>(urls_ptr[i]));
+ schemes[i] = holding[0];
+ domains[i] = holding[1];
+ ports[i] = holding[2];
+ paths[i] = holding[3];
+ parameters[i] = holding[4];
+ fragments[i] = holding[5];
+ }
+ }
+
+ return DataFrame::create(_["scheme"] = schemes,
+ _["domain"] = domains,
+ _["port"] = ports,
+ _["path"] = paths,
+ _["parameter"] = parameters,
+ _["fragment"] = fragments,
+ _["stringsAsFactors"] = false);
+}
diff --git a/src/parsing.h b/src/parsing.h
new file mode 100644
index 0000000..24c509c
--- /dev/null
+++ b/src/parsing.h
@@ -0,0 +1,141 @@
+#include <Rcpp.h>
+using namespace Rcpp;
+
+#ifndef __PARSING_INCLUDED__
+#define __PARSING_INCLUDED__
+
+class parsing {
+
+ protected:
+
+ /**
+ * A function for parsing a URL and turning it into a vector.
+ * Tremendously useful (read: everything breaks without this)
+ *
+ * @param url a URL.
+ *
+ * @see get_ and set_component, which call this.
+ *
+ * @return a vector consisting of the value for each component
+ * part of the URL.
+ */
+ CharacterVector url_to_vector(std::string url);
+
+ private:
+
+ /**
+ * A function for lower-casing an entire string
+ *
+ * @param str a string to lower-case
+ *
+ * @return a string containing the lower-cased version of the
+ * input.
+ */
+ std::string string_tolower(std::string str);
+
+ /**
+ * A function for extracting the scheme of a URL; part of the
+ * URL parsing framework.
+ *
+ * @param url a reference to a url.
+ *
+ * @see url_to_vector which calls this.
+ *
+ * @return a string containing the scheme of the URL if identifiable,
+ * and "" if not.
+ */
+ std::string scheme(std::string& url);
+
+ /**
+ * A function for extracting the domain and port of a URL; part of the
+ * URL parsing framework. Fairly unique in that it outputs a
+ * vector, unlike the rest of the framework, which outputs a string,
+ * since it has to handle multiple elements.
+ *
+ * @param url a reference to a url. Should've been run through
+ * scheme() first.
+ *
+ * @see url_to_vector which calls this.
+ *
+ * @return a vector containing the domain and port of the URL if identifiable,
+ * and "" for each non-identifiable element.
+ */
+ std::vector < std::string > domain_and_port(std::string& url);
+
+ /**
+ * A function for extracting the path of a URL; part of the
+ * URL parsing framework.
+ *
+ * @param url a reference to a url. Should've been run through
+ * scheme() and domain_and_port() first.
+ *
+ * @see url_to_vector which calls this.
+ *
+ * @return a string containing the path of the URL if identifiable,
+ * and "" if not.
+ */
+ std::string path(std::string& url);
+
+ /**
+ * A function for extracting the path of a URL; part of the
+ * URL parsing framework.
+ *
+ * @param url a reference to a url. Should've been run through
+ * scheme(), domain_and_port() and path() first.
+ *
+ * @see url_to_vector which calls this.
+ *
+ * @return a string containing the query string of the URL if identifiable,
+ * and "" if not.
+ */
+ std::string query(std::string& url);
+
+ String check_parse_out(std::string x);
+
+ public:
+
+ /**
+ * A function to retrieve an individual component from a parsed
+ * URL. Used in scheme(), host() et al; calls parse_url.
+ *
+ * @param url a URL.
+ *
+ * @param component an integer representing which value in
+ * parse_url's returned vector to grab.
+ *
+ * @see set_component, which allows for modification.
+ *
+ * @return a string consisting of the requested URL component.
+ */
+ String get_component(std::string url, int component);
+
+ /**
+ * A function to set an individual component in a parsed
+ * URL. Used in "scheme<-", et al; calls parse_url.
+ *
+ * @param url a URL.
+ *
+ * @param component an integer representing which value in
+ * parse_url's returned vector to modify.
+ *
+ * @param new_value the value to insert into url[component].
+ *
+ * @param delim a delimiter, used in cases where there's no existing value.
+ * @see get_component, which allows for retrieval.
+ *
+ * @return a string consisting of the modified URL.
+ */
+ String set_component(std::string url, int component, String new_value);
+
+ /**
+ * Decompose a vector of URLs and turn it into a data.frame.
+ *
+ * @param URLs a reference to a vector of URLs
+ *
+ * @return an Rcpp data.frame.
+ *
+ */
+ DataFrame parse_to_df(CharacterVector& urls_ptr);
+
+};
+#endif
diff --git a/src/puny.cpp b/src/puny.cpp
new file mode 100644
index 0000000..fa84ac0
--- /dev/null
+++ b/src/puny.cpp
@@ -0,0 +1,226 @@
+#include <Rcpp.h>
+#include "punycode.h"
+extern "C"{
+#include "utf8.h"
+}
+using namespace Rcpp;
+#define R_NO_REMAP
+#include <R.h>
+#include <Rinternals.h>
+
+#define BUFLENT 2048
+static char buf[BUFLENT];
+static uint32_t ibuf[BUFLENT];
+static std::string ascii = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890_.?&=:/";
+
+static inline void clearbuf(){
+ for (int i=0; i<BUFLENT; i++){
+ buf[i] = '\0';
+ ibuf[i] = 0;
+ }
+}
+
+struct url {
+ std::deque<std::string> split_url;
+ std::string protocol;
+ std::string path;
+};
+
+void split_url(std::string x, url& output){
+
+ size_t last;
+ size_t loc = x.find(".");
+
+ last = x.find("://");
+ if(last != std::string::npos){
+ output.protocol = x.substr(0, (last + 3));
+ x = x.substr(last + 3);
+ }
+ last = x.find_first_of(":/");
+ if(last != std::string::npos){
+ output.path = x.substr(last);
+ x = x.substr(0, last);
+ }
+
+ last = 0;
+ loc = x.find(".");
+ while (loc != std::string::npos) {
+ output.split_url.push_back(x.substr(last, loc-last));
+ last = ++loc;
+ loc = x.find(".", loc);
+ }
+ if (loc == std::string::npos){
+ output.split_url.push_back(x.substr(last, x.length()));
+ }
+}
+
+std::string check_result(enum punycode_status& st, std::string& x){
+ std::string ret = "Error with the URL " + x + ":";
+ if (st == punycode_bad_input){
+ ret += "input is invalid";
+ } else if (st == punycode_big_output){
+ ret += "output would exceed the space provided";
+ } else if (st == punycode_overflow){
+ ret += "input needs wider integers to process";
+ } else {
+ return "";
+ }
+ return ret;
+};
+
+String encode_single(std::string x){
+
+ url holding;
+ split_url(x, holding);
+ std::string output = holding.protocol;
+
+ for(unsigned int i = 0; i < holding.split_url.size(); i++){
+ // Check if it's ASCII-only fragment - if so, nowt to do here.
+ if(holding.split_url[i].find_first_not_of(ascii) == std::string::npos){
+ output += holding.split_url[i];
+ if(i < (holding.split_url.size() - 1)){
+ output += ".";
+ }
+ } else {
+
+ // Prep for conversion
+ punycode_uint buflen = BUFLENT;
+ punycode_uint unilen = BUFLENT;
+ const char *s = holding.split_url[i].c_str();
+ const int slen = strlen(s);
+
+ // Do the conversion
+ unilen = u8_toucs(ibuf, unilen, s, slen);
+ enum punycode_status st = punycode_encode(unilen, ibuf, NULL, &buflen, buf);
+
+ // Check it worked
+ std::string ret = check_result(st, x);
+ if(ret.size()){
+ Rcpp::warning(ret);
+ return NA_STRING;
+ }
+
+ std::string encoded = Rcpp::as<std::string>(Rf_mkCharLenCE(buf, buflen, CE_UTF8));
+ if(encoded != holding.split_url[i]){
+ encoded = "xn--" + encoded;
+ }
+ output += encoded;
+ if(i < (holding.split_url.size() - 1)){
+ output += ".";
+ }
+ }
+ }
+ output += holding.path;
+ return output;
+}
+
+//'@title Encode or Decode Internationalised Domains
+//'@description \code{puny_encode} and \code{puny_decode} implement
+//'the encoding standard for internationalised (non-ASCII) domains and
+//'subdomains. You can use them to encode UTF-8 domain names, or decode
+//'encoded names (which start "xn--"), or both.
+//'
+//'@param x a vector of URLs. These should be URL decoded using \code{\link{url_decode}}.
+//'
+//'@return a CharacterVector containing encoded or decoded versions of the entries in \code{x}.
+//'Invalid URLs (ones that are \code{NA}, or ones that do not successfully map to an actual
+//'decoded or encoded version) will be returned as \code{NA}.
+//'
+//'@examples
+//'# Encode a URL
+//'puny_encode("https://www.bücher.com/foo")
+//'
+//'# Decode the result, back to the original
+//'puny_decode("https://www.xn--bcher-kva.com/foo")
+//'
+//'@seealso \code{\link{url_decode}} and \code{\link{url_encode}} for percent-encoding.
+//'
+//'@rdname puny
+//'@export
+//[[Rcpp::export]]
+CharacterVector puny_encode(CharacterVector x){
+
+ unsigned int input_size = x.size();
+ CharacterVector output(input_size);
+
+ for(unsigned int i = 0; i < input_size; i++){
+
+ if(i % 10000 == 0){
+ Rcpp::checkUserInterrupt();
+ }
+
+ if(x[i] == NA_STRING){
+ output[i] = NA_STRING;
+ } else {
+ output[i] = encode_single(Rcpp::as<std::string>(x[i]));
+ }
+ }
+
+ clearbuf();
+ return output;
+}
+
+String decode_single(std::string x){
+ url holding;
+ split_url(x, holding);
+ std::string output = holding.protocol;
+
+ for(unsigned int i = 0; i < holding.split_url.size(); i++){
+ // Check if it's ASCII-only fragment - if so, nowt to do here.
+ if(holding.split_url[i].size() < 4 || holding.split_url[i].substr(0,4) != "xn--"){
+ output += holding.split_url[i];
+ if(i < (holding.split_url.size() - 1)){
+ output += ".";
+ }
+ } else {
+
+ // Prep for conversion
+ punycode_uint buflen;
+ punycode_uint unilen = BUFLENT;
+ const char *s = holding.split_url[i].substr(4).c_str();
+ const int slen = strlen(s);
+
+ // Do the conversion
+ enum punycode_status st = punycode_decode(slen, s, &unilen, ibuf, NULL);
+
+ // Check it worked
+ std::string ret = check_result(st, x);
+ if(ret.size()){
+ Rcpp::warning(ret);
+ return NA_STRING;
+ }
+ buflen = u8_toutf8(buf, BUFLENT, ibuf, unilen);
+ std::string encoded = Rcpp::as<std::string>(Rf_mkCharLenCE(buf, buflen, CE_UTF8));
+ output += encoded;
+ if(i < (holding.split_url.size() - 1)){
+ output += ".";
+ }
+ }
+ }
+ output += holding.path;
+ return output;
+}
+
+//'@rdname puny
+//'@export
+//[[Rcpp::export]]
+CharacterVector puny_decode(CharacterVector x){
+
+ unsigned int input_size = x.size();
+ CharacterVector output(input_size);
+
+ for(unsigned int i = 0; i < input_size; i++){
+
+ if(i % 10000 == 0){
+ Rcpp::checkUserInterrupt();
+ }
+
+ if(x[i] == NA_STRING){
+ output[i] = NA_STRING;
+ } else {
+ output[i] = decode_single(Rcpp::as<std::string>(x[i]));
+ }
+ }
+
+ return output;
+}
diff --git a/src/punycode.c b/src/punycode.c
new file mode 100644
index 0000000..f905e52
--- /dev/null
+++ b/src/punycode.c
@@ -0,0 +1,289 @@
+/*
+punycode.c from RFC 3492
+http://www.nicemice.net/idn/
+Adam M. Costello
+http://www.nicemice.net/amc/
+
+This is ANSI C code (C89) implementing Punycode (RFC 3492).
+
+
+C. Disclaimer and license
+
+ Regarding this entire document or any portion of it (including
+ the pseudocode and C code), the author makes no guarantees and
+ is not responsible for any damage resulting from its use. The
+ author grants irrevocable permission to anyone to use, modify,
+ and distribute it in any way that does not diminish the rights
+ of anyone else to use, modify, and distribute it, provided that
+ redistributed derivative works do not contain misleading author or
+ version information. Derivative works need not be licensed under
+ similar terms.
+*/
+
+#include "punycode.h"
+
+/**********************************************************/
+/* Implementation (would normally go in its own .c file): */
+
+#include <string.h>
+
+/*** Bootstring parameters for Punycode ***/
+
+enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,
+ initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };
+
+/* basic(cp) tests whether cp is a basic code point: */
+#define basic(cp) ((punycode_uint)(cp) < 0x80)
+
+/* delim(cp) tests whether cp is a delimiter: */
+#define delim(cp) ((cp) == delimiter)
+
+/* decode_digit(cp) returns the numeric value of a basic code */
+/* point (for use in representing integers) in the range 0 to */
+/* base-1, or base if cp is does not represent a value. */
+
+static punycode_uint decode_digit(punycode_uint cp)
+{
+ return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 :
+ cp - 97 < 26 ? cp - 97 : base;
+}
+
+/* encode_digit(d,flag) returns the basic code point whose value */
+/* (when used for representing integers) is d, which needs to be in */
+/* the range 0 to base-1. The lowercase form is used unless flag is */
+/* nonzero, in which case the uppercase form is used. The behavior */
+/* is undefined if flag is nonzero and digit d has no uppercase form. */
+
+static char encode_digit(punycode_uint d, int flag)
+{
+ return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);
+ /* 0..25 map to ASCII a..z or A..Z */
+ /* 26..35 map to ASCII 0..9 */
+}
+
+/* flagged(bcp) tests whether a basic code point is flagged */
+/* (uppercase). The behavior is undefined if bcp is not a */
+/* basic code point. */
+
+#define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26)
+
+/* encode_basic(bcp,flag) forces a basic code point to lowercase */
+/* if flag is zero, uppercase if flag is nonzero, and returns */
+/* the resulting code point. The code point is unchanged if it */
+/* is caseless. The behavior is undefined if bcp is not a basic */
+/* code point. */
+
+static char encode_basic(punycode_uint bcp, int flag)
+{
+ bcp -= (bcp - 97 < 26) << 5;
+ return bcp + ((!flag && (bcp - 65 < 26)) << 5);
+}
+
+/*** Platform-specific constants ***/
+
+/* maxint is the maximum value of a punycode_uint variable: */
+static const punycode_uint maxint = (punycode_uint) -1;
+/* Because maxint is unsigned, -1 becomes the maximum value. */
+
+/*** Bias adaptation function ***/
+
+static punycode_uint adapt(
+ punycode_uint delta, punycode_uint numpoints, int firsttime )
+{
+ punycode_uint k;
+
+ delta = firsttime ? delta / damp : delta >> 1;
+ /* delta >> 1 is a faster way of doing delta / 2 */
+ delta += delta / numpoints;
+
+ for (k = 0; delta > ((base - tmin) * tmax) / 2; k += base) {
+ delta /= base - tmin;
+ }
+
+ return k + (base - tmin + 1) * delta / (delta + skew);
+}
+
+/*** Main encode function ***/
+
+enum punycode_status punycode_encode(
+ punycode_uint input_length,
+ const punycode_uint input[],
+ const unsigned char case_flags[],
+ punycode_uint *output_length,
+ char output[] )
+{
+ punycode_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t;
+
+ /* Initialize the state: */
+
+ n = initial_n;
+ delta = out = 0;
+ max_out = *output_length;
+ bias = initial_bias;
+
+ /* Handle the basic code points: */
+
+ for (j = 0; j < input_length; ++j) {
+ if (basic(input[j])) {
+ if (max_out - out < 2) return punycode_big_output;
+ output[out++] =
+ case_flags ? encode_basic(input[j], case_flags[j]) : (char)input[j];
+ }
+ /* else if (input[j] < n) return punycode_bad_input; */
+ /* (not needed for Punycode with unsigned code points) */
+ }
+
+ h = b = out;
+
+ /* h is the number of code points that have been handled, b is the */
+ /* number of basic code points, and out is the number of characters */
+ /* that have been output. */
+
+ if (b > 0) output[out++] = delimiter;
+
+ /* Main encoding loop: */
+
+ while (h < input_length) {
+ /* All non-basic code points < n have been */
+ /* handled already. Find the next larger one: */
+
+ for (m = maxint, j = 0; j < input_length; ++j) {
+ /* if (basic(input[j])) continue; */
+ /* (not needed for Punycode) */
+ if (input[j] >= n && input[j] < m) m = input[j];
+ }
+
+ /* Increase delta enough to advance the decoder's */
+ /* <n,i> state to <m,0>, but guard against overflow: */
+
+ if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow;
+ delta += (m - n) * (h + 1);
+ n = m;
+
+ for (j = 0; j < input_length; ++j) {
+ /* Punycode does not need to check whether input[j] is basic: */
+ if (input[j] < n /* || basic(input[j]) */ ) {
+ if (++delta == 0) return punycode_overflow;
+ }
+
+ if (input[j] == n) {
+ /* Represent delta as a generalized variable-length integer: */
+
+ for (q = delta, k = base; ; k += base) {
+ if (out >= max_out) return punycode_big_output;
+ t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
+ k >= bias + tmax ? tmax : k - bias;
+ if (q < t) break;
+ output[out++] = encode_digit(t + (q - t) % (base - t), 0);
+ q = (q - t) / (base - t);
+ }
+
+ output[out++] = encode_digit(q, case_flags && case_flags[j]);
+ bias = adapt(delta, h + 1, h == b);
+ delta = 0;
+ ++h;
+ }
+ }
+
+ ++delta, ++n;
+ }
+
+ *output_length = out;
+ return punycode_success;
+}
+
+/*** Main decode function ***/
+
+enum punycode_status punycode_decode(
+ punycode_uint input_length,
+ const char input[],
+ punycode_uint *output_length,
+ punycode_uint output[],
+ unsigned char case_flags[] )
+{
+ punycode_uint n, out, i, max_out, bias,
+ b, j, in, oldi, w, k, digit, t;
+
+ if (!input_length) {
+ return punycode_bad_input;
+ }
+
+ /* Initialize the state: */
+
+ n = initial_n;
+ out = i = 0;
+ max_out = *output_length;
+ bias = initial_bias;
+
+ /* Handle the basic code points: Let b be the number of input code */
+ /* points before the last delimiter, or 0 if there is none, then */
+ /* copy the first b code points to the output. */
+
+ for (b = 0, j = input_length - 1 ; j > 0; --j) {
+ if (delim(input[j])) {
+ b = j;
+ break;
+ }
+ }
+ if (b > max_out) return punycode_big_output;
+
+ for (j = 0; j < b; ++j) {
+ if (case_flags) case_flags[out] = flagged(input[j]);
+ if (!basic(input[j])) return punycode_bad_input;
+ output[out++] = input[j];
+ }
+
+ /* Main decoding loop: Start just after the last delimiter if any */
+ /* basic code points were copied; start at the beginning otherwise. */
+
+ for (in = b > 0 ? b + 1 : 0; in < input_length; ++out) {
+
+ /* in is the index of the next character to be consumed, and */
+ /* out is the number of code points in the output array. */
+
+ /* Decode a generalized variable-length integer into delta, */
+ /* which gets added to i. The overflow checking is easier */
+ /* if we increase i as we go, then subtract off its starting */
+ /* value at the end to obtain delta. */
+
+ for (oldi = i, w = 1, k = base; ; k += base) {
+ if (in >= input_length) return punycode_bad_input;
+ digit = decode_digit(input[in++]);
+ if (digit >= base) return punycode_bad_input;
+ if (digit > (maxint - i) / w) return punycode_overflow;
+ i += digit * w;
+ t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
+ k >= bias + tmax ? tmax : k - bias;
+ if (digit < t) break;
+ if (w > maxint / (base - t)) return punycode_overflow;
+ w *= (base - t);
+ }
+
+ bias = adapt(i - oldi, out + 1, oldi == 0);
+
+ /* i was supposed to wrap around from out+1 to 0, */
+ /* incrementing n each time, so we'll fix that now: */
+
+ if (i / (out + 1) > maxint - n) return punycode_overflow;
+ n += i / (out + 1);
+ i %= (out + 1);
+
+ /* Insert n at position i of the output: */
+
+ /* not needed for Punycode: */
+ /* if (decode_digit(n) <= base) return punycode_invalid_input; */
+ if (out >= max_out) return punycode_big_output;
+
+ if (case_flags) {
+ memmove(case_flags + i + 1, case_flags + i, out - i);
+ /* Case of last character determines uppercase flag: */
+ case_flags[i] = flagged(input[in - 1]);
+ }
+
+ memmove(output + i + 1, output + i, (out - i) * sizeof *output);
+ output[i++] = n;
+ }
+
+ *output_length = out;
+ return punycode_success;
+}
diff --git a/src/punycode.h b/src/punycode.h
new file mode 100644
index 0000000..459c6fd
--- /dev/null
+++ b/src/punycode.h
@@ -0,0 +1,108 @@
+/*
+punycode.c from RFC 3492
+http://www.nicemice.net/idn/
+Adam M. Costello
+http://www.nicemice.net/amc/
+
+This is ANSI C code (C89) implementing Punycode (RFC 3492).
+
+
+
+C. Disclaimer and license
+
+ Regarding this entire document or any portion of it (including
+ the pseudocode and C code), the author makes no guarantees and
+ is not responsible for any damage resulting from its use. The
+ author grants irrevocable permission to anyone to use, modify,
+ and distribute it in any way that does not diminish the rights
+ of anyone else to use, modify, and distribute it, provided that
+ redistributed derivative works do not contain misleading author or
+ version information. Derivative works need not be licensed under
+ similar terms.
+*/
+
+#ifdef __cplusplus
+extern "C" {
+#endif /* __cplusplus */
+
+/************************************************************/
+/* Public interface (would normally go in its own .h file): */
+
+#include <limits.h>
+
+enum punycode_status {
+ punycode_success,
+ punycode_bad_input, /* Input is invalid. */
+ punycode_big_output, /* Output would exceed the space provided. */
+ punycode_overflow /* Input needs wider integers to process. */
+};
+
+#if UINT_MAX >= (1 << 26) - 1
+typedef unsigned int punycode_uint;
+#else
+typedef unsigned long punycode_uint;
+#endif
+
+enum punycode_status punycode_encode(
+ punycode_uint input_length,
+ const punycode_uint input[],
+ const unsigned char case_flags[],
+ punycode_uint *output_length,
+ char output[] );
+
+ /* punycode_encode() converts Unicode to Punycode. The input */
+ /* is represented as an array of Unicode code points (not code */
+ /* units; surrogate pairs are not allowed), and the output */
+ /* will be represented as an array of ASCII code points. The */
+ /* output string is *not* null-terminated; it will contain */
+ /* zeros if and only if the input contains zeros. (Of course */
+ /* the caller can leave room for a terminator and add one if */
+ /* needed.) The input_length is the number of code points in */
+ /* the input. The output_length is an in/out argument: the */
+ /* caller passes in the maximum number of code points that it */
+ /* can receive, and on successful return it will contain the */
+ /* number of code points actually output. The case_flags array */
+ /* holds input_length boolean values, where nonzero suggests that */
+ /* the corresponding Unicode character be forced to uppercase */
+ /* after being decoded (if possible), and zero suggests that */
+ /* it be forced to lowercase (if possible). ASCII code points */
+ /* are encoded literally, except that ASCII letters are forced */
+ /* to uppercase or lowercase according to the corresponding */
+ /* uppercase flags. If case_flags is a null pointer then ASCII */
+ /* letters are left as they are, and other code points are */
+ /* treated as if their uppercase flags were zero. The return */
+ /* value can be any of the punycode_status values defined above */
+ /* except punycode_bad_input; if not punycode_success, then */
+ /* output_size and output might contain garbage. */
+
+enum punycode_status punycode_decode(
+ punycode_uint input_length,
+ const char input[],
+ punycode_uint *output_length,
+ punycode_uint output[],
+ unsigned char case_flags[] );
+
+ /* punycode_decode() converts Punycode to Unicode. The input is */
+ /* represented as an array of ASCII code points, and the output */
+ /* will be represented as an array of Unicode code points. The */
+ /* input_length is the number of code points in the input. The */
+ /* output_length is an in/out argument: the caller passes in */
+ /* the maximum number of code points that it can receive, and */
+ /* on successful return it will contain the actual number of */
+ /* code points output. The case_flags array needs room for at */
+ /* least output_length values, or it can be a null pointer if the */
+ /* case information is not needed. A nonzero flag suggests that */
+ /* the corresponding Unicode character be forced to uppercase */
+ /* by the caller (if possible), while zero suggests that it be */
+ /* forced to lowercase (if possible). ASCII code points are */
+ /* output already in the proper case, but their flags will be set */
+ /* appropriately so that applying the flags would be harmless. */
+ /* The return value can be any of the punycode_status values */
+ /* defined above; if not punycode_success, then output_length, */
+ /* output, and case_flags might contain garbage. On success, the */
+ /* decoder will never need to write an output_length greater than */
+ /* input_length, because of how the encoding is defined. */
+
+#ifdef __cplusplus
+}
+#endif /* __cplusplus */
diff --git a/src/suffix.cpp b/src/suffix.cpp
new file mode 100644
index 0000000..35dd584
--- /dev/null
+++ b/src/suffix.cpp
@@ -0,0 +1,145 @@
+#include <Rcpp.h>
+using namespace Rcpp;
+
+std::string string_reverse(std::string x){
+ std::reverse(x.begin(), x.end());
+ return x;
+}
+
+//[[Rcpp::export]]
+CharacterVector reverse_strings(CharacterVector strings){
+
+ unsigned int input_size = strings.size();
+ CharacterVector output(input_size);
+ for(unsigned int i = 0; i < input_size; i++){
+ if(strings[i] == NA_STRING){
+ output[i] = NA_STRING;
+ } else {
+ output[i] = string_reverse(Rcpp::as<std::string>(strings[i]));
+ }
+ }
+
+ return output;
+}
+
+//[[Rcpp::export]]
+DataFrame finalise_suffixes(CharacterVector full_domains, CharacterVector suffixes,
+ LogicalVector wildcard, LogicalVector is_suffix){
+
+ unsigned int input_size = full_domains.size();
+ CharacterVector subdomains(input_size);
+ CharacterVector domains(input_size);
+ std::string holding;
+ size_t domain_location;
+ for(unsigned int i = 0; i < input_size; i++){
+ if((i % 10000) == 0){
+ Rcpp::checkUserInterrupt();
+ }
+ if(is_suffix[i]){
+ subdomains[i] = NA_STRING;
+ domains[i] = NA_STRING;
+ suffixes[i] = full_domains[i];
+ } else {
+ if(suffixes[i] == NA_STRING || suffixes[i].size() == full_domains[i].size()){
+ subdomains[i] = NA_STRING;
+ domains[i] = NA_STRING;
+ } else if(wildcard[i]) {
+ holding = Rcpp::as<std::string>(full_domains[i]);
+ holding = holding.substr(0, ((full_domains[i].size() - suffixes[i].size()) - 1));
+ domain_location = holding.rfind(".");
+ if(domain_location == std::string::npos){
+ domains[i] = NA_STRING;
+ subdomains[i] = NA_STRING;
+ suffixes[i] = holding + "." + suffixes[i];
+ } else {
+ suffixes[i] = holding.substr(domain_location+1) + "." + suffixes[i];
+ holding = holding.substr(0, domain_location);
+ domain_location = holding.rfind(".");
+ if(domain_location == std::string::npos){
+ if(holding.size() == 0){
+ domains[i] = NA_STRING;
+ } else {
+ domains[i] = holding;
+ }
+ subdomains[i] = NA_STRING;
+ } else {
+ domains[i] = holding.substr(domain_location+1);
+ subdomains[i] = holding.substr(0, domain_location);
+ }
+ }
+ } else {
+ holding = Rcpp::as<std::string>(full_domains[i]);
+ holding = holding.substr(0, ((full_domains[i].size() - suffixes[i].size()) - 1));
+ domain_location = holding.rfind(".");
+ if(domain_location == std::string::npos){
+ subdomains[i] = NA_STRING;
+ if(holding.size() == 0){
+ domains[i] = NA_STRING;
+ } else {
+ domains[i] = holding;
+ }
+ } else {
+ subdomains[i] = holding.substr(0, domain_location);
+ domains[i] = holding.substr(domain_location+1);
+ }
+ }
+ }
+ }
+ return DataFrame::create(_["host"] = full_domains, _["subdomain"] = subdomains,
+ _["domain"] = domains, _["suffix"] = suffixes,
+ _["stringsAsFactors"] = false);
+}
+
+//[[Rcpp::export]]
+CharacterVector tld_extract_(CharacterVector domains){
+
+ unsigned int input_size = domains.size();
+ CharacterVector output(input_size);
+ std::string holding;
+ size_t fragment_location;
+
+ for(unsigned int i = 0; i < input_size; i++){
+ if((i % 10000) == 0){
+ Rcpp::checkUserInterrupt();
+ }
+ if(domains[i] == NA_STRING){
+ output[i] = NA_STRING;
+ } else {
+ holding = Rcpp::as<std::string>(domains[i]);
+ fragment_location = holding.rfind(".");
+ if(fragment_location == std::string::npos || fragment_location == (holding.size() - 1)){
+ output[i] = NA_STRING;
+ } else {
+ output[i] = holding.substr(fragment_location+1);
+ }
+ }
+ }
+ return output;
+}
+
+//[[Rcpp::export]]
+CharacterVector host_extract_(CharacterVector domains){
+
+ unsigned int input_size = domains.size();
+ CharacterVector output(input_size);
+ std::string holding;
+ size_t fragment_location;
+
+ for(unsigned int i = 0; i < input_size; i++){
+ if((i % 10000) == 0){
+ Rcpp::checkUserInterrupt();
+ }
+ if(domains[i] == NA_STRING){
+ output[i] = NA_STRING;
+ } else {
+ holding = Rcpp::as<std::string>(domains[i]);
+ fragment_location = holding.find(".");
+ if(fragment_location == std::string::npos){
+ output[i] = NA_STRING;
+ } else {
+ output[i] = holding.substr(0, fragment_location);
+ }
+ }
+ }
+ return output;
+}
diff --git a/src/urltools.cpp b/src/urltools.cpp
new file mode 100644
index 0000000..754fb7e
--- /dev/null
+++ b/src/urltools.cpp
@@ -0,0 +1,184 @@
+#include <Rcpp.h>
+#include "encoding.h"
+#include "compose.h"
+#include "parsing.h"
+
+using namespace Rcpp;
+
+//'@title Encode or decode a URI
+//'@description encodes or decodes a URI/URL
+//'
+//'@param urls a vector of URLs to decode or encode.
+//'
+//'@details
+//'URL encoding and decoding is an essential prerequisite to proper web interaction
+//'and data analysis around things like server-side logs. The
+//'\href{http://tools.ietf.org/html/rfc3986}{relevant IETF RfC} mandates the percentage-encoding
+//'of non-Latin characters, including things like slashes, unless those are reserved.
+//'
+//'Base R provides \code{\link{URLdecode}} and \code{\link{URLencode}}, which handle
+//'URL encoding - in theory. In practise, they have a set of substantial problems
+//'that the urltools implementation solves::
+//'
+//'\itemize{
+//' \item{No vectorisation: }{Both base R functions operate on single URLs, not vectors of URLs.
+//' This means that, when confronted with a vector of URLs that need encoding or
+//' decoding, your only option is to loop from within R. This can be incredibly
+//' computationally costly with large datasets. url_encode and url_decode are
+//' implemented in C++ and entirely vectorised, allowing for a substantial
+//' performance improvement.}
+//' \item{No scheme recognition: }{encoding the slashes in, say, http://, is a good way
+//' of making sure your URL no longer works. Because of this, the only thing
+//' you can encode in URLencode (unless you refuse to encode reserved characters)
+//' is a partial URL, lacking the initial scheme, which requires additional operations
+//' to set up and increases the complexity of encoding or decoding. url_encode
+//' detects the protocol and silently splits it off, leaving it unencoded to ensure
+//' that the resulting URL is valid.}
+//' \item{ASCII NULs: }{Server side data can get very messy and sometimes include out-of-range
+//' characters. Unfortunately, URLdecode's response to these characters is to convert
+//' them to NULs, which R can't handle, at which point your URLdecode call breaks.
+//' \code{url_decode} simply ignores them.}
+//'}
+//'
+//'@return a character vector containing the encoded (or decoded) versions of "urls".
+//'
+//'@seealso \code{\link{puny_decode}} and \code{\link{puny_encode}}, for punycode decoding
+//'and encoding.
+//'
+//'@examples
+//'
+//'url_decode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_%28logo%29.jpg")
+//'url_encode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_(logo).jpg")
+//'
+//'\dontrun{
+//'#A demonstrator of the contrasting behaviours around out-of-range characters
+//'URLdecode("%gIL")
+//'url_decode("%gIL")
+//'}
+//'@rdname encoder
+//'@export
+// [[Rcpp::export]]
+CharacterVector url_decode(CharacterVector urls){
+
+ //Measure size, create output object
+ int input_size = urls.size();
+ CharacterVector output(input_size);
+ encoding enc_inst;
+ //Decode each string in turn.
+ for (int i = 0; i < input_size; ++i){
+ if((i % 10000) == 0){
+ Rcpp::checkUserInterrupt();
+ }
+ if(urls[i] == NA_STRING){
+ output[i] = NA_STRING;
+ } else {
+ output[i] = enc_inst.internal_url_decode(Rcpp::as<std::string>(urls[i]));
+ }
+ }
+
+ //Return
+ return output;
+}
+
+//'@rdname encoder
+//'@export
+// [[Rcpp::export]]
+CharacterVector url_encode(CharacterVector urls){
+
+ //Measure size, create output object and holding objects
+ int input_size = urls.size();
+ CharacterVector output(input_size);
+ std::string holding;
+ size_t scheme_start;
+ size_t first_slash;
+ encoding enc_inst;
+
+ //For each string..
+ for (int i = 0; i < input_size; ++i){
+
+ //Check for user interrupts.
+ if((i % 10000) == 0){
+ Rcpp::checkUserInterrupt();
+ }
+
+ if(urls[i] == NA_STRING){
+ output[i] = NA_STRING;
+ } else {
+ holding = Rcpp::as<std::string>(urls[i]);
+
+ //Extract the protocol. If you can't find it, just encode the entire thing.
+ scheme_start = holding.find("://");
+ if(scheme_start == std::string::npos){
+ output[i] = enc_inst.internal_url_encode(holding);
+ } else {
+ //Otherwise, split out the protocol and encode !protocol.
+ first_slash = holding.find("/", scheme_start+3);
+ if(first_slash == std::string::npos){
+ output[i] = holding.substr(0,scheme_start+3) + enc_inst.internal_url_encode(holding.substr(scheme_start+3));
+ } else {
+ output[i] = holding.substr(0,first_slash+1) + enc_inst.internal_url_encode(holding.substr(first_slash+1));
+ }
+ }
+ }
+ }
+
+ //Return
+ return output;
+}
+
+//'@title split URLs into their component parts
+//'@description \code{url_parse} takes a vector of URLs and splits each one into its component
+//'parts, as recognised by RfC 3986.
+//'
+//'@param urls a vector of URLs
+//'
+//'@details It's useful to be able to take a URL and split it out into its component parts -
+//'for the purpose of hostname extraction, for example, or analysing API calls. This functionality
+//'is not provided in base R, although it is provided in \code{\link[httr]{parse_url}}; that
+//'implementation is entirely in R, uses regular expressions, and is not vectorised. It's
+//'perfectly suitable for the intended purpose (decomposition in the context of automated
+//'HTTP requests from R), but not for large-scale analysis.
+//'
+//'@return a data.frame consisting of the columns scheme, domain, port, path, query
+//'and fragment. See the '\href{http://tools.ietf.org/html/rfc3986}{relevant IETF RfC} for
+//'definitions. If an element cannot be identified, it is represented by an empty string.
+//'
+//'@examples
+//'url_parse("https://en.wikipedia.org/wiki/Article")
+//'
+//'@seealso \code{\link{url_parameters}} for extracting values associated with particular keys in a URL's
+//'query string, and \code{\link{url_compose}}, which is \code{url_parse} in reverse.
+//'
+//'@export
+//[[Rcpp::export]]
+DataFrame url_parse(CharacterVector urls){
+ CharacterVector& urls_ptr = urls;
+ parsing p_inst;
+ return p_inst.parse_to_df(urls_ptr);
+}
+
+//'@title Recompose Parsed URLs
+//'
+//'@description Sometimes you want to take a vector of URLs, parse them, perform
+//'some operations and then rebuild them. \code{url_compose} takes a data.frame produced
+//'by \code{\link{url_parse}} and rebuilds it into a vector of full URLs (or: URLs as full
+//'as the vector initially thrown into url_parse).
+//'
+//'This is currently a `beta` feature; please do report bugs if you find them.
+//'
+//'@param parsed_urls a data.frame sourced from \code{\link{url_parse}}
+//'
+//'@seealso \code{\link{scheme}} and other accessors, which you may want to
+//'run URLs through before composing them to modify individual values.
+//'
+//'@examples
+//'#Parse a URL and compose it
+//'url <- "http://en.wikipedia.org"
+//'url_compose(url_parse(url))
+//'
+//'@export
+//[[Rcpp::export]]
+CharacterVector url_compose(DataFrame parsed_urls){
+ compose c_inst;
+ return c_inst.compose_multiple(parsed_urls);
+}
diff --git a/src/utf8.c b/src/utf8.c
new file mode 100644
index 0000000..c6e27a6
--- /dev/null
+++ b/src/utf8.c
@@ -0,0 +1,172 @@
+/*
+ Basic UTF-8 manipulation routines
+ by Jeff Bezanson
+ placed in the public domain Fall 2005
+
+ This code is designed to provide the utilities you need to manipulate
+ UTF-8 as an internal string encoding. These functions do not perform the
+ error checking normally needed when handling UTF-8 data, so if you happen
+ to be from the Unicode Consortium you will want to flay me alive.
+ I do this because error checking can be performed at the boundaries (I/O),
+ with these routines reserved for higher performance on data known to be
+ valid.
+ A UTF-8 validation routine is included.
+*/
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <wchar.h>
+#include <wctype.h>
+
+#ifdef WIN32
+#include <malloc.h>
+#define snprintf _snprintf
+#else
+#ifndef __FreeBSD__
+#include <alloca.h>
+#endif /* __FreeBSD__ */
+#endif
+#include <assert.h>
+
+#include "utf8.h"
+
+static const uint32_t offsetsFromUTF8[6] = {
+ 0x00000000UL, 0x00003080UL, 0x000E2080UL,
+ 0x03C82080UL, 0xFA082080UL, 0x82082080UL
+};
+
+static const char trailingBytesForUTF8[256] = {
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+ 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
+};
+
+/* returns length of next utf-8 sequence */
+size_t u8_seqlen(const char *s)
+{
+ return trailingBytesForUTF8[(unsigned int)(unsigned char)s[0]] + 1;
+}
+
+/* returns the # of bytes needed to encode a certain character
+ 0 means the character cannot (or should not) be encoded. */
+size_t u8_charlen(uint32_t ch)
+{
+ if (ch < 0x80)
+ return 1;
+ else if (ch < 0x800)
+ return 2;
+ else if (ch < 0x10000)
+ return 3;
+ else if (ch < 0x110000)
+ return 4;
+ return 0;
+}
+
+size_t u8_codingsize(uint32_t *wcstr, size_t n)
+{
+ size_t i, c=0;
+
+ for(i=0; i < n; i++)
+ c += u8_charlen(wcstr[i]);
+ return c;
+}
+
+/* conversions without error checking
+ only works for valid UTF-8, i.e. no 5- or 6-byte sequences
+ srcsz = source size in bytes
+ sz = dest size in # of wide characters
+
+ returns # characters converted
+ if sz == srcsz+1 (i.e. 4*srcsz+4 bytes), there will always be enough space.
+*/
+size_t u8_toucs(uint32_t *dest, size_t sz, const char *src, size_t srcsz)
+{
+ uint32_t ch;
+ const char *src_end = src + srcsz;
+ size_t nb;
+ size_t i=0;
+
+ if (sz == 0 || srcsz == 0)
+ return 0;
+
+ while (i < sz) {
+ if (!isutf(*src)) { // invalid sequence
+ dest[i++] = 0xFFFD;
+ src++;
+ if (src >= src_end) break;
+ continue;
+ }
+ nb = trailingBytesForUTF8[(unsigned char)*src];
+ if (src + nb >= src_end)
+ break;
+ ch = 0;
+ switch (nb) {
+ /* these fall through deliberately */
+ case 5: ch += (unsigned char)*src++; ch <<= 6;
+ case 4: ch += (unsigned char)*src++; ch <<= 6;
+ case 3: ch += (unsigned char)*src++; ch <<= 6;
+ case 2: ch += (unsigned char)*src++; ch <<= 6;
+ case 1: ch += (unsigned char)*src++; ch <<= 6;
+ case 0: ch += (unsigned char)*src++;
+ }
+ ch -= offsetsFromUTF8[nb];
+ dest[i++] = ch;
+ }
+ return i;
+}
+
+
+
+/* srcsz = number of source characters
+ sz = size of dest buffer in bytes
+
+ returns # bytes stored in dest
+ the destination string will never be bigger than the source string.
+*/
+size_t u8_toutf8(char *dest, size_t sz, const uint32_t *src, size_t srcsz)
+{
+ uint32_t ch;
+ size_t i = 0;
+ char *dest0 = dest;
+ char *dest_end = dest + sz;
+
+ while (i < srcsz) {
+ ch = src[i];
+ if (ch < 0x80) {
+ if (dest >= dest_end)
+ break;
+ *dest++ = (char)ch;
+ }
+ else if (ch < 0x800) {
+ if (dest >= dest_end-1)
+ break;
+ *dest++ = (ch>>6) | 0xC0;
+ *dest++ = (ch & 0x3F) | 0x80;
+ }
+ else if (ch < 0x10000) {
+ if (dest >= dest_end-2)
+ break;
+ *dest++ = (ch>>12) | 0xE0;
+ *dest++ = ((ch>>6) & 0x3F) | 0x80;
+ *dest++ = (ch & 0x3F) | 0x80;
+ }
+ else if (ch < 0x110000) {
+ if (dest >= dest_end-3)
+ break;
+ *dest++ = (ch>>18) | 0xF0;
+ *dest++ = ((ch>>12) & 0x3F) | 0x80;
+ *dest++ = ((ch>>6) & 0x3F) | 0x80;
+ *dest++ = (ch & 0x3F) | 0x80;
+ }
+ i++;
+ }
+ return (dest-dest0);
+}
+
diff --git a/src/utf8.h b/src/utf8.h
new file mode 100644
index 0000000..a558efe
--- /dev/null
+++ b/src/utf8.h
@@ -0,0 +1,17 @@
+#ifndef UTF8_H
+#define UTF8_H
+
+extern int locale_is_utf8;
+
+/* is c the start of a utf8 sequence? */
+#define isutf(c) (((c)&0xC0)!=0x80)
+
+#define UEOF ((uint32_t)-1)
+
+/* convert UTF-8 data to wide character */
+size_t u8_toucs(uint32_t *dest, size_t sz, const char *src, size_t srcsz);
+
+/* the opposite conversion */
+size_t u8_toutf8(char *dest, size_t sz, const uint32_t *src, size_t srcsz);
+
+#endif
diff --git a/tests/testthat.R b/tests/testthat.R
new file mode 100644
index 0000000..9a8b7ee
--- /dev/null
+++ b/tests/testthat.R
@@ -0,0 +1,4 @@
+library(testthat)
+library(urltools)
+
+test_check("urltools")
diff --git a/tests/testthat/test_encoding.R b/tests/testthat/test_encoding.R
new file mode 100644
index 0000000..89450b3
--- /dev/null
+++ b/tests/testthat/test_encoding.R
@@ -0,0 +1,26 @@
+context("URL encoding tests")
+
+test_that("Check encoding doesn't encode the scheme", {
+ expect_that(url_encode("https://"), equals("https://"))
+})
+
+test_that("Check encoding does does not encode pre-path slashes", {
+ expect_that(url_encode("https://foo.org/bar/"), equals("https://foo.org/bar%2f"))
+})
+
+test_that("Check encoding can handle NAs", {
+ expect_that(url_encode(c("https://foo.org/bar/", NA)), equals(c("https://foo.org/bar%2f", NA)))
+})
+
+test_that("Check decoding can handle NAs", {
+ expect_that(url_decode(c("https://foo.org/bar%2f", NA)), equals(c("https://foo.org/bar/", NA)))
+})
+
+test_that("Check decoding and encoding are equivalent", {
+
+ url <- "Hinrichtung_auf_dem_Altst%c3%a4dter_Ring.JPG%2f120px-Hinrichtung_auf_dem_Altst%c3%a4dter_Ring.JPG"
+ decoded_url <- "Hinrichtung_auf_dem_Altstädter_Ring.JPG/120px-Hinrichtung_auf_dem_Altstädter_Ring.JPG"
+ expect_that(url_decode(url), equals(decoded_url))
+ expect_that(url_encode(decoded_url), equals(url))
+
+})
\ No newline at end of file
diff --git a/tests/testthat/test_get_set.R b/tests/testthat/test_get_set.R
new file mode 100644
index 0000000..0524dbe
--- /dev/null
+++ b/tests/testthat/test_get_set.R
@@ -0,0 +1,59 @@
+context("Component get/set tests")
+
+test_that("Check elements can be retrieved", {
+ url <- "https://www.google.com:80/foo.php?api_params=turnip#ending"
+ testthat::expect_equal(scheme(url), "https")
+ testthat::expect_equal(domain(url), "www.google.com")
+ testthat::expect_equal(port(url), "80")
+ testthat::expect_equal(path(url), "foo.php")
+ testthat::expect_equal(parameters(url), "api_params=turnip")
+ testthat::expect_equal(fragment(url), "ending")
+})
+
+test_that("Check elements can be retrieved with NAs", {
+ url <- as.character(NA)
+ testthat::expect_equal(is.na(scheme(url)), TRUE)
+ testthat::expect_equal(is.na(domain(url)), TRUE)
+ testthat::expect_equal(is.na(port(url)), TRUE)
+ testthat::expect_equal(is.na(path(url)), TRUE)
+ testthat::expect_equal(is.na(parameters(url)), TRUE)
+ testthat::expect_equal(is.na(fragment(url)), TRUE)
+})
+
+test_that("Check elements can be set", {
+ url <- "https://www.google.com:80/foo.php?api_params=turnip#ending"
+ scheme(url) <- "http"
+ testthat::expect_equal(scheme(url), "http")
+ domain(url) <- "www.wikipedia.org"
+ testthat::expect_equal(domain(url), "www.wikipedia.org")
+ port(url) <- "23"
+ testthat::expect_equal(port(url), "23")
+ path(url) <- "bar.php"
+ testthat::expect_equal(path(url), "bar.php")
+ parameters(url) <- "api_params=manic"
+ testthat::expect_equal(parameters(url), "api_params=manic")
+ fragment(url) <- "beginning"
+ testthat::expect_equal(fragment(url), "beginning")
+})
+
+test_that("Check elements can be set with NAs", {
+ url <- "https://www.google.com:80/"
+ scheme(url) <- "http"
+ testthat::expect_equal(scheme(url), "http")
+ domain(url) <- "www.wikipedia.org"
+ testthat::expect_equal(domain(url), "www.wikipedia.org")
+ port(url) <- "23"
+ testthat::expect_equal(port(url), "23")
+ path(url) <- "bar.php"
+ testthat::expect_equal(path(url), "bar.php")
+ parameters(url) <- "api_params=manic"
+ testthat::expect_equal(parameters(url), "api_params=manic")
+ fragment(url) <- "beginning"
+ testthat::expect_equal(fragment(url), "beginning")
+})
+
+test_that("Assigning NA with get will NA a URL", {
+ url <- "https://www.google.com:80/"
+ port(url) <- NA_character_
+ testthat::expect_true(is.na(url))
+})
diff --git a/tests/testthat/test_memory.R b/tests/testthat/test_memory.R
new file mode 100644
index 0000000..0cc6c69
--- /dev/null
+++ b/tests/testthat/test_memory.R
@@ -0,0 +1,30 @@
+context("Avoid regressions around proxy objects")
+
+test_that("Values are correctly disposed from memory",{
+ memfn <- function(d = NULL){
+ test_url <- "https://test.com"
+ if(!is.null(d)){
+ test_url <- urltools::param_set(test_url, "q" , urltools::url_encode(d))
+ }
+ return(test_url)
+ }
+
+ baseurl <- "https://test.com"
+ expect_equal(memfn(), baseurl)
+ expect_equal(memfn("blah"), paste0(baseurl, "?q=blah"))
+ expect_equal(memfn(), baseurl)
+})
+
+test_that("Parameters correctly add to output",{
+ outfn <- function(d = FALSE){
+ test_url <- "https://test.com"
+ if(d){
+ test_url <- urltools::param_set(test_url, "q", urltools::url_encode(d))
+ }
+ return(test_url)
+ }
+
+ baseurl <- "https://test.com"
+ expect_equal(outfn(), baseurl)
+ expect_equal(outfn(TRUE), paste0(baseurl, "?q=TRUE"))
+})
diff --git a/tests/testthat/test_parameters.R b/tests/testthat/test_parameters.R
new file mode 100644
index 0000000..819f554
--- /dev/null
+++ b/tests/testthat/test_parameters.R
@@ -0,0 +1,61 @@
+context("Test parameter manipulation")
+
+test_that("Parameter parsing can handle multiple, non-existent and pre-trailing parameters",{
+ urls <- c("https://www.google.com:80/foo.php?api_params=parsable&this_parameter=selfreferencing&hiphop=awesome",
+ "https://www.google.com:80/foo.php?api_params=parsable&this_parameter=selfreferencing&hiphop=awesome#foob",
+ "https://www.google.com:80/foo.php?this_parameter=selfreferencing&hiphop=awesome")
+ results <- param_get(urls, c("api_params","hiphop"))
+ expect_that(results[1:2,1], equals(c("parsable","parsable")))
+ expect_true(is.na(results[3,1]))
+
+})
+
+test_that("Parameter parsing works where the parameter appears earlier in the URL", {
+ url <- param_get("www.housetrip.es/tos-de-vacaciones/geo?from=01/04/2015&guests=4&to=05/04/2015","to")
+ expect_that(ncol(url), equals(1))
+ expect_that(url$to[1], equals("05/04/2015"))
+})
+
+test_that("Setting parameter values works", {
+ expect_true(param_set("https://en.wikipedia.org/wiki/api.php", "baz", "quorn") ==
+ "https://en.wikipedia.org/wiki/api.php?baz=quorn")
+ expect_true(param_set("https://en.wikipedia.org/wiki/api.php?foo=bar&baz=qux", "baz", "quorn") ==
+ "https://en.wikipedia.org/wiki/api.php?foo=bar&baz=quorn")
+ expect_true(param_set("https://en.wikipedia.org/wiki/api.php?foo=bar", "baz", "quorn") ==
+ "https://en.wikipedia.org/wiki/api.php?foo=bar&baz=quorn")
+})
+
+test_that("Setting parameter values quietly fails with NA components", {
+ url <- "https://en.wikipedia.org/api.php?action=query"
+ expect_identical(url, param_set(url, "action", NA_character_))
+ expect_true(is.na(param_set(NA_character_, "action", "foo")))
+ expect_identical(url, param_set(url, NA_character_, "pageinfo"))
+})
+
+
+test_that("Removing parameter entries quietly fails with NA components", {
+ url <- "https://en.wikipedia.org/api.php?action=query"
+ expect_identical(url, param_remove(url, "foo"))
+ expect_true(is.na(param_remove(NA_character_, "action")))
+})
+
+test_that("Removing parameter keys works", {
+ expect_true(param_remove("https://en.wikipedia.org/api.php?baz=qux", "baz") ==
+ "https://en.wikipedia.org/api.php")
+})
+
+test_that("Removing parameter keys works when there are multiple parameters in the URL", {
+ expect_true(param_remove("https://en.wikipedia.org/api.php?baz=qux&foo=bar", "baz") ==
+ "https://en.wikipedia.org/api.php?foo=bar")
+})
+
+test_that("Removing parameter keys works when there are multiple parameters to remove", {
+ expect_true(param_remove("https://en.wikipedia.org/api.php?baz=qux&foo=bar", c("baz","foo")) ==
+ "https://en.wikipedia.org/api.php")
+})
+
+test_that("Removing parameter keys works when there is no query", {
+ expect_true(param_remove("https://en.wikipedia.org/api.php", "baz") ==
+ "https://en.wikipedia.org/api.php")
+})
+
diff --git a/tests/testthat/test_parsing.R b/tests/testthat/test_parsing.R
new file mode 100644
index 0000000..385a64e
--- /dev/null
+++ b/tests/testthat/test_parsing.R
@@ -0,0 +1,68 @@
+context("URL parsing tests")
+
+test_that("Check parsing identifies each RfC element", {
+
+ data <- url_parse("https://www.google.com:80/foo.php?api_params=turnip#ending")
+ expect_that(ncol(data), equals(6))
+ expect_that(names(data), equals(c("scheme","domain","port","path","parameter","fragment")))
+ expect_that(data$scheme[1], equals("https"))
+ expect_that(data$domain[1], equals("www.google.com"))
+ expect_that(data$port[1], equals("80"))
+ expect_that(data$path[1], equals("foo.php"))
+ expect_that(data$parameter[1], equals("api_params=turnip"))
+ expect_that(data$fragment[1], equals("ending"))
+})
+
+test_that("Check parsing can handle missing elements", {
+
+ data <- url_parse("https://www.google.com/foo.php?api_params=turnip#ending")
+ expect_that(ncol(data), equals(6))
+ expect_that(names(data), equals(c("scheme","domain","port","path","parameter","fragment")))
+ expect_that(data$scheme[1], equals("https"))
+ expect_that(data$domain[1], equals("www.google.com"))
+ expect_true(is.na(data$port[1]))
+ expect_that(data$path[1], equals("foo.php"))
+ expect_that(data$parameter[1], equals("api_params=turnip"))
+ expect_that(data$fragment[1], equals("ending"))
+})
+
+test_that("Parsing does not up and die and misplace the fragment",{
+ data <- url_parse("http://www.yeastgenome.org/locus/S000005366/overview#protein")
+ expect_that(data$fragment[1], equals("protein"))
+})
+
+test_that("Composing works",{
+ url <- c("http://foo.bar.baz/qux/", "https://en.wikipedia.org:4000/wiki/api.php")
+ amended_url <- url_compose(url_parse(url))
+ expect_that(url, equals(amended_url))
+})
+
+test_that("Port handling works", {
+ url <- "https://en.wikipedia.org:4000/wiki/api.php"
+ expect_that(port(url), equals("4000"))
+ expect_that(path(url), equals("wiki/api.php"))
+ url <- "https://en.wikipedia.org:4000"
+ expect_that(port(url), equals("4000"))
+ expect_true(is.na(path(url)))
+ url <- "https://en.wikipedia.org:4000/"
+ expect_that(port(url), equals("4000"))
+ expect_true(is.na(path(url)))
+ url <- "https://en.wikipedia.org:4000?foo=bar"
+ expect_that(port(url), equals("4000"))
+ expect_true(is.na(path(url)))
+ expect_that(parameters(url), equals("foo=bar"))
+})
+
+test_that("Port handling does not break path handling", {
+ url <- "https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_(logo).jpg"
+ expect_true(is.na(port(url)))
+ expect_that(path(url), equals("wiki/File:Vice_City_Public_Radio_(logo).jpg"))
+})
+
+test_that("URLs with parameters but no paths work", {
+ url <- url_parse("http://www.nextpedition.com?inav=menu_travel_nextpedition")
+ expect_true(url$domain[1] == "www.nextpedition.com")
+ expect_true(is.na(url$port[1]))
+ expect_true(is.na(url$path[1]))
+ expect_true(url$parameter[1] == "inav=menu_travel_nextpedition")
+})
\ No newline at end of file
diff --git a/tests/testthat/test_puny.R b/tests/testthat/test_puny.R
new file mode 100644
index 0000000..c5a1331
--- /dev/null
+++ b/tests/testthat/test_puny.R
@@ -0,0 +1,47 @@
+context("Check punycode handling")
+
+testthat::test_that("Simple punycode domain encoding works", {
+ testthat::expect_identical(puny_encode(enc2utf8("https://www.b\u00FCcher.com/foo")),
+ "https://www.xn--bcher-kva.com/foo")
+})
+
+testthat::test_that("Punycode domain encoding works with fragmentary paths", {
+ testthat::expect_identical(puny_encode(enc2utf8("https://www.b\u00FCcher.com/")),
+ "https://www.xn--bcher-kva.com/")
+})
+
+testthat::test_that("Punycode domain encoding works with ports", {
+ testthat::expect_identical(puny_encode(enc2utf8("https://www.b\u00FCcher.com:80")),
+ "https://www.xn--bcher-kva.com:80")
+})
+
+testthat::test_that("Punycode domain encoding returns an NA on NAs", {
+ testthat::expect_true(is.na(puny_encode(NA_character_)))
+})
+
+testthat::test_that("Simple punycode domain decoding works", {
+ testthat::expect_identical(puny_decode("https://www.xn--bcher-kva.com/foo"),
+ enc2utf8("https://www.b\u00FCcher.com/foo"))
+})
+
+testthat::test_that("Punycode domain decoding works with fragmentary paths", {
+ testthat::expect_identical(puny_decode("https://www.xn--bcher-kva.com/"),
+ enc2utf8("https://www.b\u00FCcher.com/"))
+})
+
+testthat::test_that("Punycode domain decoding works with ports", {
+ testthat::expect_identical(puny_decode("https://www.xn--bcher-kva.com:80"),
+ enc2utf8("https://www.b\u00FCcher.com:80"))
+})
+
+testthat::test_that("Punycode domain decoding returns an NA on NAs", {
+ testthat::expect_true(is.na(puny_decode(NA_character_)))
+})
+
+testthat::test_that("Punycode domain decoding returns an NA on invalid entries", {
+ testthat::expect_true(is.na(suppressWarnings(puny_decode("xn--9"))))
+})
+
+testthat::test_that("Punycode domain decoding warns on invalid entries", {
+ testthat::expect_warning(puny_decode("xn--9"))
+})
\ No newline at end of file
diff --git a/tests/testthat/test_suffixes.R b/tests/testthat/test_suffixes.R
new file mode 100644
index 0000000..b47afc4
--- /dev/null
+++ b/tests/testthat/test_suffixes.R
@@ -0,0 +1,108 @@
+context("Test suffix extraction")
+
+test_that("Suffix extraction works with simple domains",{
+ result <- suffix_extract("en.wikipedia.org")
+ expect_that(ncol(result), equals(4))
+ expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+ expect_that(nrow(result), equals(1))
+
+ expect_that(result$subdomain[1], equals("en"))
+ expect_that(result$domain[1], equals("wikipedia"))
+ expect_that(result$suffix[1], equals("org"))
+})
+
+test_that("Suffix extraction works with multiple domains",{
+ result <- suffix_extract(c("en.wikipedia.org","en.wikipedia.org"))
+ expect_that(ncol(result), equals(4))
+ expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+ expect_that(nrow(result), equals(2))
+
+ expect_that(result$subdomain[1], equals("en"))
+ expect_that(result$domain[1], equals("wikipedia"))
+ expect_that(result$suffix[1], equals("org"))
+ expect_that(result$subdomain[2], equals("en"))
+ expect_that(result$domain[2], equals("wikipedia"))
+ expect_that(result$suffix[2], equals("org"))
+})
+
+test_that("Suffix extraction works when the domain is the same as the suffix",{
+ result <- suffix_extract(c("googleapis.com", "myapi.googleapis.com"))
+ expect_that(ncol(result), equals(4))
+ expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+ expect_that(nrow(result), equals(2))
+
+ expect_equal(result$subdomain[1], NA_character_)
+ expect_equal(result$domain[1], NA_character_)
+ expect_equal(result$suffix[1], "googleapis.com")
+ expect_equal(result$subdomain[2], NA_character_)
+ expect_equal(result$domain[2], "myapi")
+ expect_equal(result$suffix[2], "googleapis.com")
+})
+
+test_that("Suffix extraction works where domains/suffixes overlap", {
+ result <- suffix_extract(domain("http://www.converse.com")) # could be se.com or .com
+ expect_equal(result$subdomain[1], "www")
+ expect_equal(result$domain[1], "converse")
+ expect_equal(result$suffix[1], "com")
+})
+
+test_that("Suffix extraction works when the domain matches a wildcard suffix",{
+ result <- suffix_extract(c("banana.bd", "banana.boat.bd"))
+ expect_that(ncol(result), equals(4))
+ expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+ expect_that(nrow(result), equals(2))
+
+ expect_equal(result$subdomain[1], NA_character_)
+ expect_equal(result$domain[1], NA_character_)
+ expect_equal(result$suffix[1], "banana.bd")
+ expect_equal(result$subdomain[2], NA_character_)
+ expect_equal(result$domain[2], "banana")
+ expect_equal(result$suffix[2], "boat.bd")
+})
+
+test_that("Suffix extraction works when the domain matches a wildcard suffix and has subdomains",{
+ result <- suffix_extract(c("foo.bar.banana.bd"))
+ expect_that(ncol(result), equals(4))
+ expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+ expect_that(nrow(result), equals(1))
+ expect_equal(result$subdomain[1], "foo")
+ expect_equal(result$domain[1], "bar")
+ expect_equal(result$suffix[1], "banana.bd")
+})
+
+
+test_that("Suffix extraction works with new suffixes",{
+ result <- suffix_extract("en.wikipedia.org", suffix_refresh())
+ expect_that(ncol(result), equals(4))
+ expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+ expect_that(nrow(result), equals(1))
+
+ expect_that(result$subdomain[1], equals("en"))
+ expect_that(result$domain[1], equals("wikipedia"))
+ expect_that(result$suffix[1], equals("org"))
+})
+
+test_that("Suffix extraction works with an arbitrary suffixes database (to ensure it is loading it)",{
+ result <- suffix_extract(c("is-this-a.bananaboat", "en.wikipedia.org"), data.frame(suffixes = "bananaboat"))
+ expect_that(ncol(result), equals(4))
+ expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+ expect_that(nrow(result), equals(2))
+
+ expect_equal(result$subdomain[1], NA_character_)
+ expect_equal(result$domain[1], "is-this-a")
+ expect_equal(result$suffix[1], "bananaboat")
+ expect_equal(result$subdomain[2], NA_character_)
+ expect_equal(result$domain[2], NA_character_)
+ expect_equal(result$suffix[2], NA_character_)
+})
+
+test_that("Suffix extraction is back to normal using the internal database when it receives suffixes=NULL",{
+ result <- suffix_extract("en.wikipedia.org")
+ expect_that(ncol(result), equals(4))
+ expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+ expect_that(nrow(result), equals(1))
+
+ expect_that(result$subdomain[1], equals("en"))
+ expect_that(result$domain[1], equals("wikipedia"))
+ expect_that(result$suffix[1], equals("org"))
+})
\ No newline at end of file
diff --git a/vignettes/urltools.Rmd b/vignettes/urltools.Rmd
new file mode 100644
index 0000000..beb3db3
--- /dev/null
+++ b/vignettes/urltools.Rmd
@@ -0,0 +1,182 @@
+<!--
+%\VignetteEngine{knitr::knitr}
+%\VignetteIndexEntry{urltools}
+-->
+
+##Elegant URL handling with urltools
+
+URLs are treated, by base R, as nothing more than components of a data retrieval process: they exist
+to create connections to retrieve datasets. This is an essential feature for the language to have,
+but it also means that URL handlers are designed for situations where URLs *get* you to the data -
+not situations where URLs *are* the data.
+
+There is no support for encoding or decoding URLs en-masse, and no support for parsing and
+interpreting them. `urltools` provides this support!
+
+### URL encoding and decoding
+
+Base R provides two functions - <code>URLdecode</code> and <code>URLencode</code> - for taking percentage-encoded
+URLs and turning them into regular strings, or vice versa. As discussed, these are primarily designed to
+enable connections, and so they have several inherent limitations, including a lack of vectorisation, that
+make them unsuitable for large datasets.
+
+Not only are they not vectorised, they also have several particularly idiosyncratic bugs and limitations:
+<code>URLdecode</code>, for example, breaks if the decoded value is out of range:
+
+```{r, eval=FALSE}
+URLdecode("test%gIL")
+Error in rawToChar(out) : embedded nul in string: '\0L'
+In addition: Warning message:
+In URLdecode("%gIL") : out-of-range values treated as 0 in coercion to raw
+```
+
+URLencode, on the other hand, encodes slashes on its most strict setting - without
+paying attention to where those slashes *are*: if we attempt to URLencode an entire URL, we get:
+
+```{r, eval=FALSE}
+URLencode("https://en.wikipedia.org/wiki/Article", reserved = TRUE)
+[1] "https%3a%2f%2fen.wikipedia.org%2fwiki%2fArticle"
+```
+That's a completely unusable URL (or ewRL, if you will).
+
+urltools replaces both functions with <code>url\_decode</code> and <code>url\_encode</code> respectively:
+```{r, eval=FALSE}
+library(urltools)
+url_decode("test%gIL")
+[1] "test"
+url_encode("https://en.wikipedia.org/wiki/Article")
+[1] "https://en.wikipedia.org%2fwiki%2fArticle"
+```
+
+As you can see, <code>url\_decode</code> simply excludes out-of-range characters from consideration, while <code>url\_encode</code> detects characters that make up part of the URLs scheme, and leaves them unencoded. Both are extremely fast; with `urltools`, you can
+decode a vector of 1,000,000 URLs in 0.9 seconds.
+
+Alongside these, we have functions for encoding and decoding the 'punycode' format of URLs - ones that are designed to be internationalised and have unicode characters in them. These also take one argument, a vector of URLs, and can be found at `puny_encode` and `puny_decode` respectively.
+
+### URL parsing
+
+Once you've got your nicely decoded (or encoded) URLs, it's time to do something with them - and, most of the time,
+you won't actually care about most of the URL. You'll want to look at the scheme, or the domain, or the path,
+but not the entire thing as one string.
+
+The solution is <code>url_parse</code>, which takes a URL and breaks it out into its [RfC 3986](http://www.ietf.org/rfc/rfc3986.txt) components: scheme, domain, port, path, query string and fragment identifier. This is,
+again, fully vectorised, and can happily be run over hundreds of thousands of URLs, rapidly processing them. The
+results are provided as a data.frame, since most people use data.frames to store data.
+
+```{r, eval=FALSE}
+> parsed_address <- url_parse("https://en.wikipedia.org/wiki/Article")
+> str(parsed_address)
+'data.frame': 1 obs. of 6 variables:
+ $ scheme : chr "https"
+ $ domain : chr "en.wikipedia.org"
+ $ port : chr NA
+ $ path : chr "wiki/Article"
+ $ parameter: chr NA
+ $ fragment : chr NA
+```
+
+We can also perform the opposite of this operation with `url_compose`:
+```{r, eval=FALSE}
+> url_compose(parsed_address)
+[1] "https://en.wikipedia.org/wiki/article"
+```
+
+### Getting/setting URL components
+With the inclusion of a URL parser, we suddenly have the opportunity for lubridate-style component getting
+and setting. Syntax is identical to that of `lubridate`, but uses URL components as function names.
+
+```{r, eval=FALSE}
+url <- "https://en.wikipedia.org/wiki/Article"
+scheme(url)
+"https"
+scheme(url) <- "ftp"
+url
+"ftp://en.wikipedia.org/wiki/Article"
+```
+Fields that can be extracted or set are <code>scheme</code>, <code>domain</code>, <code>port</code>, <code>path</code>,
+<code>parameters</code> and <code>fragment</code>.
+
+### Suffix and TLD extraction
+
+Once we've extracted a domain from a URL with `domain` or `url_parse`, we can identify which bit is the domain name, and which
+bit is the suffix:
+
+```{r, eval=FALSE}
+> url <- "https://en.wikipedia.org/wiki/Article"
+> domain_name <- domain(url)
+> domain_name
+[1] "en.wikipedia.org"
+> str(suffix_extract(domain_name))
+'data.frame': 1 obs. of 4 variables:
+ $ host : chr "en.wikipedia.org"
+ $ subdomain: chr "en"
+ $ domain : chr "wikipedia"
+ $ suffix : chr "org"
+```
+
+This relies on an internal database of public suffixes, accessible at `suffix_dataset` - we recognise, though,
+that this dataset may get a bit out of date, so you can also pass the results of the `suffix_refresh` function,
+which retrieves an updated dataset, to `suffix_extract`:
+
+```{r, eval=FALSE}
+domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+updated_suffixes <- suffix_refresh()
+suffix_extract(domain_name, updated_suffixes)
+```
+
+We can do the same thing with top-level domains, with precisely the same setup, except the functions and datasets are `tld_refresh`, `tld_extract` and `tld_dataset`.
+
+In the other direction we have `host_extract`, which retrieves, well, the host! If the URL has subdomains, it'll be the
+lowest-level subdomain. If it doesn't, it'll be the actual domain name, without the suffixes:
+
+```{r, eval=FALSE}
+domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+host_extract(domain_name)
+```
+### Query manipulation
+Once a URL is parsed, it's sometimes useful to get the value associated with a particular query parameter. As
+an example, take the URL `http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json`. What
+pageID is being used? What is the export format? We can find out with `param_get`.
+
+```{r, eval=FALSE}
+> str(param_get(urls = "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json",
+ parameter_names = c("pageid","export")))
+'data.frame': 1 obs. of 2 variables:
+ $ pageid: chr "1023"
+ $ export: chr "json"
+```
+
+This isn't the only function for query manipulation; we can also dynamically modify the values a particular parameter
+might have, or strip them out entirely.
+
+To modify the values, we use `param_set`:
+
+```{r, eval=FALSE}
+url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+url <- param_set(url, key = "pageid", value = "12")
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=12&export=json"
+```
+
+As you can see this works pretty well; it even works in situations where the URL doesn't *have* a query yet:
+
+```{r, eval=FALSE}
+url <- "http://en.wikipedia.org/wiki/api.php"
+url <- param_set(url, key = "pageid", value = "12")
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?pageid=12"
+```
+
+On the other hand we might have a parameter we just don't want any more - that can be handled with `param_remove`, which can
+take multiple parameters as well as multiple URLs:
+
+```{r, eval=FALSE}
+url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+url <- param_remove(url, keys = c("action","export"))
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?pageid=1023"
+```
+
+### Other URL handlers
+If you have ideas for other URL handlers that would make your data processing easier, the best approach
+is to either [request it](https://github.com/Ironholds/urltools/issues) or [add it](https://github.com/Ironholds/urltools/pulls)!
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/r-cran-urltools.git
More information about the debian-med-commit
mailing list