[med-svn] [r-cran-urltools] 03/09: New upstream version 1.6.0

Andreas Tille tille at debian.org
Thu Nov 30 15:01:57 UTC 2017


This is an automated email from the git hooks/post-receive script.

tille pushed a commit to branch master
in repository r-cran-urltools.

commit 7512ab10b81fda7b1a643c6c95e09db78233a9ee
Author: Andreas Tille <tille at debian.org>
Date:   Thu Nov 30 15:48:00 2017 +0100

    New upstream version 1.6.0
---
 DESCRIPTION                      |  30 +++
 LICENSE                          |   2 +
 MD5                              |  65 +++++++
 NAMESPACE                        |  34 ++++
 NEWS                             | 175 ++++++++++++++++++
 R/RcppExports.R                  | 262 ++++++++++++++++++++++++++
 R/accessors.R                    | 202 ++++++++++++++++++++
 R/suffix.R                       | 265 +++++++++++++++++++++++++++
 R/urltools.R                     |  21 +++
 R/zzz.R                          |  22 +++
 README.md                        |  39 ++++
 build/vignette.rds               | Bin 0 -> 195 bytes
 data/suffix_dataset.rda          | Bin 0 -> 36976 bytes
 data/tld_dataset.rda             | Bin 0 -> 6099 bytes
 debian/README.test               |   8 -
 debian/changelog                 |   5 -
 debian/compat                    |   1 -
 debian/control                   |  29 ---
 debian/copyright                 |  47 -----
 debian/docs                      |   3 -
 debian/rules                     |   5 -
 debian/source/format             |   1 -
 debian/tests/control             |   5 -
 debian/tests/run-unit-test       |  17 --
 debian/watch                     |   2 -
 inst/doc/urltools.R              |  86 +++++++++
 inst/doc/urltools.Rmd            | 182 +++++++++++++++++++
 inst/doc/urltools.html           | 384 +++++++++++++++++++++++++++++++++++++++
 man/domain.Rd                    |  34 ++++
 man/encoder.Rd                   |  66 +++++++
 man/fragment.Rd                  |  34 ++++
 man/host_extract.Rd              |  32 ++++
 man/param_get.Rd                 |  38 ++++
 man/param_remove.Rd              |  34 ++++
 man/param_set.Rd                 |  42 +++++
 man/parameters.Rd                |  35 ++++
 man/path.Rd                      |  34 ++++
 man/port.Rd                      |  34 ++++
 man/puny.Rd                      |  37 ++++
 man/scheme.Rd                    |  37 ++++
 man/suffix_dataset.Rd            |  28 +++
 man/suffix_extract.Rd            |  53 ++++++
 man/suffix_refresh.Rd            |  34 ++++
 man/tld_dataset.Rd               |  23 +++
 man/tld_extract.Rd               |  40 ++++
 man/tld_refresh.Rd               |  34 ++++
 man/url_compose.Rd               |  30 +++
 man/url_parse.Rd                 |  37 ++++
 man/urltools.Rd                  |  17 ++
 src/Makevars                     |   1 +
 src/RcppExports.cpp              | 182 +++++++++++++++++++
 src/accessors.cpp                |  37 ++++
 src/compose.cpp                  |  68 +++++++
 src/compose.h                    |  58 ++++++
 src/encoding.cpp                 |  92 ++++++++++
 src/encoding.h                   |  67 +++++++
 src/param.cpp                    | 112 ++++++++++++
 src/parameter.cpp                | 175 ++++++++++++++++++
 src/parameter.h                  |  93 ++++++++++
 src/parsing.cpp                  | 238 ++++++++++++++++++++++++
 src/parsing.h                    | 141 ++++++++++++++
 src/puny.cpp                     | 226 +++++++++++++++++++++++
 src/punycode.c                   | 289 +++++++++++++++++++++++++++++
 src/punycode.h                   | 108 +++++++++++
 src/suffix.cpp                   | 145 +++++++++++++++
 src/urltools.cpp                 | 184 +++++++++++++++++++
 src/utf8.c                       | 172 ++++++++++++++++++
 src/utf8.h                       |  17 ++
 tests/testthat.R                 |   4 +
 tests/testthat/test_encoding.R   |  26 +++
 tests/testthat/test_get_set.R    |  59 ++++++
 tests/testthat/test_memory.R     |  30 +++
 tests/testthat/test_parameters.R |  61 +++++++
 tests/testthat/test_parsing.R    |  68 +++++++
 tests/testthat/test_puny.R       |  47 +++++
 tests/testthat/test_suffixes.R   | 108 +++++++++++
 vignettes/urltools.Rmd           | 182 +++++++++++++++++++
 77 files changed, 5512 insertions(+), 123 deletions(-)

diff --git a/DESCRIPTION b/DESCRIPTION
new file mode 100644
index 0000000..9dc7274
--- /dev/null
+++ b/DESCRIPTION
@@ -0,0 +1,30 @@
+Package: urltools
+Type: Package
+Title: Vectorised Tools for URL Handling and Parsing
+Version: 1.6.0
+Date: 2016-10-12
+Author: Oliver Keyes [aut, cre], Jay Jacobs [aut, cre], Drew Schmidt [aut],
+    Mark Greenaway [ctb], Bob Rudis [ctb], Alex Pinto [ctb], Maryam Khezrzadeh [ctb], 
+    Adam M. Costello [cph], Jeff Bezanson [cph]
+Maintainer: Oliver Keyes <ironholds at gmail.com>
+Description: A toolkit for all URL-handling needs, including encoding and decoding,
+    parsing, parameter extraction and modification. All functions are
+    designed to be both fast and entirely vectorised. It is intended to be
+    useful for people dealing with web-related datasets, such as server-side
+    logs, although may be useful for other situations involving large sets of
+    URLs.
+License: MIT + file LICENSE
+LazyData: TRUE
+LinkingTo: Rcpp
+Imports: Rcpp, methods, triebeard
+Suggests: testthat, knitr
+URL: https://github.com/Ironholds/urltools/
+BugReports: https://github.com/Ironholds/urltools/issues
+VignetteBuilder: knitr
+RoxygenNote: 5.0.1
+Encoding: UTF-8
+Depends: R (>= 2.10)
+NeedsCompilation: yes
+Packaged: 2016-10-16 13:19:23 UTC; ironholds
+Repository: CRAN
+Date/Publication: 2016-10-17 00:43:16
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..ebbb227
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,2 @@
+YEAR: 2014
+COPYRIGHT HOLDER: Oliver Keyes
\ No newline at end of file
diff --git a/MD5 b/MD5
new file mode 100644
index 0000000..eca9994
--- /dev/null
+++ b/MD5
@@ -0,0 +1,65 @@
+2232d0cadef5f6970d6fc56f14d2f545 *DESCRIPTION
+1d9678dbfe1732b5d2c521e07b2ceef0 *LICENSE
+e2f6b30a8006b3ca050d5175156c7fe3 *NAMESPACE
+0c2e9fab14fd100d3932e793c8397a35 *NEWS
+9cf0fa24c4282284d492c92f9e58ce07 *R/RcppExports.R
+cf7a242daa691c3e888f3fadf29eab8c *R/accessors.R
+f323200797b8d7d82c02958355d79cbd *R/suffix.R
+93c2f49af67ce6e11579f17898d19f21 *R/urltools.R
+d101c8875ce174214696cd65e7af61fe *R/zzz.R
+c796e3e3545b201327e524aab77b7138 *README.md
+4e47a0883d28b7040f18b65b59778063 *build/vignette.rds
+c924aa202b18a3de5b29cb4ecfd8bb67 *data/suffix_dataset.rda
+a8544a607fdee8a4b89953c2707b4e7a *data/tld_dataset.rda
+c4794a2695511ab6ba493c38720c6d6a *inst/doc/urltools.R
+2bfbb1b33412b3b272caadf2203789f2 *inst/doc/urltools.Rmd
+821c6d655b2b870f3fb450034bcaa8d6 *inst/doc/urltools.html
+69e6f1e8788ee583ea94aa8a48bf27cb *man/domain.Rd
+2eb37077109f1eb71fed124c85899ae0 *man/encoder.Rd
+1b6c21cff37766aa6639e192779d0d33 *man/fragment.Rd
+af6ff0288e5f0f7494f7910b85285407 *man/host_extract.Rd
+522eaeabd45044c5e57e64b7682b56a4 *man/param_get.Rd
+a7f554a2e090b4047e4d3901d311abb0 *man/param_remove.Rd
+cd3f28b0e44fb741180320299f6b74ec *man/param_set.Rd
+16c239f5f857ee21e467e535c8d9013a *man/parameters.Rd
+4bce2cf14c3a5d91e9baceb29d9c7327 *man/path.Rd
+49509b1bf42e3f3a23b7f510453e50fd *man/port.Rd
+c853cf55bd9d4f3fc406d312becc1b20 *man/puny.Rd
+a47a88872f2a9183c3ec444dd1cf52db *man/scheme.Rd
+f2ab57b9dbc038074ca52644fd82a6ba *man/suffix_dataset.Rd
+439cbef564ed6852bf36580d8c4923e2 *man/suffix_extract.Rd
+a92be408e3af6d00e2fe6cc8bc0a0b48 *man/suffix_refresh.Rd
+d45d75d759cb33c7010e4e3043ff40d8 *man/tld_dataset.Rd
+476adfa8c63aefb313a01def7958b134 *man/tld_extract.Rd
+2f77fd184ba17c3f752b492121021e15 *man/tld_refresh.Rd
+1e86cc89689a9d685d9cd743e4828938 *man/url_compose.Rd
+68c467bf2f96256852842d4095464b64 *man/url_parse.Rd
+5f069622f935c9c9a0270d8226b46ccc *man/urltools.Rd
+b0dee8aa6fb1b7b7044247f95d69ca53 *src/Makevars
+55b45d775dfa8f55ed28f126b4682e5e *src/RcppExports.cpp
+aa992a9862464b4faddb7fa13b7b1cc0 *src/accessors.cpp
+58de4ead82f3ff4a8a7ea20fafe32278 *src/compose.cpp
+265f752d10ea607a576eaaf97d9be942 *src/compose.h
+7f4f2fdd72e83a364536ece419e7de37 *src/encoding.cpp
+6e97f641de365133cb986cbbc5dec856 *src/encoding.h
+5f617e95cb9b5a04a10776386eae5d2a *src/param.cpp
+1a8feb44aa41d5cd35203897ce133f83 *src/parameter.cpp
+794885585068e8c76f4b996e638170aa *src/parameter.h
+a8d804bb2bd63abedc650ff567cd35b7 *src/parsing.cpp
+0c1c58fe75c930766b853f09424e3a0c *src/parsing.h
+b5e867d5bd27fb7f1c9bc276a47edaf8 *src/puny.cpp
+3d86c99b18baecd835083c425090d9cb *src/punycode.c
+b4e4b506528208635ead995600c538ba *src/punycode.h
+50f49742d5b8da5101092aeea4622fa3 *src/suffix.cpp
+090f4c8d1751d348cbb45c35a71b5f12 *src/urltools.cpp
+85e63230a4eaeb1200891c02de40193a *src/utf8.c
+3333d69c11f25242049d1d226d599b94 *src/utf8.h
+9e9970bb4d6e50ba34bab76c8bebcfc6 *tests/testthat.R
+f60de02a5a42405ef86e58c919029e94 *tests/testthat/test_encoding.R
+a9189dfb91afb312c18b9a8142c6b266 *tests/testthat/test_get_set.R
+97a5e4be008b21d5b0c97df21f576c51 *tests/testthat/test_memory.R
+5e2ef2cea7502986e64343431e2b5fb3 *tests/testthat/test_parameters.R
+b96d41814df04f1d374e4814a30d75bd *tests/testthat/test_parsing.R
+3e624b6a700ba5fa0a8e85f24de9ba8d *tests/testthat/test_puny.R
+536a7b5df0d453e38d82f5738c5b2f8b *tests/testthat/test_suffixes.R
+2bfbb1b33412b3b272caadf2203789f2 *vignettes/urltools.Rmd
diff --git a/NAMESPACE b/NAMESPACE
new file mode 100644
index 0000000..fb21c20
--- /dev/null
+++ b/NAMESPACE
@@ -0,0 +1,34 @@
+# Generated by roxygen2: do not edit by hand
+
+export("domain<-")
+export("fragment<-")
+export("parameters<-")
+export("path<-")
+export("port<-")
+export("scheme<-")
+export(domain)
+export(fragment)
+export(host_extract)
+export(param_get)
+export(param_remove)
+export(param_set)
+export(parameters)
+export(path)
+export(port)
+export(puny_decode)
+export(puny_encode)
+export(scheme)
+export(suffix_extract)
+export(suffix_refresh)
+export(tld_extract)
+export(tld_refresh)
+export(url_compose)
+export(url_decode)
+export(url_encode)
+export(url_parameters)
+export(url_parse)
+import(methods)
+importFrom(Rcpp,sourceCpp)
+importFrom(triebeard,longest_match)
+importFrom(triebeard,trie)
+useDynLib(urltools)
diff --git a/NEWS b/NEWS
new file mode 100644
index 0000000..58627c6
--- /dev/null
+++ b/NEWS
@@ -0,0 +1,175 @@
+
+Version 1.6.0 [WIP]
+-------------------------------------------------------------------------
+
+FEATURES
+* Fully punycode encoding and decoding support, thanks to Drew Schmidt.
+* param_get, param_set and param_remove are all fully capable of handling NA values.
+* component setting functions can now assign even when the previous value was NA.
+
+Version 1.5.2
+-------------------------------------------------------------------------
+
+BUGS
+* Custom suffix lists were not working properly.
+Version 1.5.1
+-------------------------------------------------------------------------
+
+BUGS
+* Fixed a bug in which punycode TLDs were excluded from TLD extraction (thanks to
+  Alex Pinto for pointing that out) #51
+* param_get now returns NAs for missing values, rather than empty strings (thanks to Josh Izzard for the report) #49
+* suffix_extract now no longer goofs if the domain+suffix combo overlaps with a valid suffix (thanks to Maryam Khezrzadeh and Alex Pinto) #50
+
+DEVELOPMENT
+* Removed the non-portable -g compiler flag in response to CRAN feedback.
+
+Version 1.5.0
+-------------------------------------------------------------------------
+FEATURES
+
+* Using tries as a data structure (see https://github.com/Ironholds/triebeard), we've increased the speed of suffix_extract() (instead of taking twenty seconds to process a million domains, it now takes..one.)
+* A dataset of top-level domains (TLDs) is now available as data(tld_dataset)
+* suffix_refresh() has been reinstated, and can be used with suffix_extract() to ensure suffix
+extraction is done with the most up-to-date dataset version possible.
+* tld_extract() and tld_refresh() mirrors the functionality of suffix_extract() and suffix_refresh()
+BUG FIXES
+* host_extract() lets you get the host (the lowest-level subdomain, or the domain itself if no subdomain
+is present) from the `domain` fragment of a parsed URL.
+* Code from Jay Jacobs has allowed us to include a best-guess at the org name in the suffix dataset.
+* url_parameters is now deprecated, and has been marked as such.
+
+DEVELOPMENT
+* The instantiation and processing of suffix and TLD datasets on load marginally increases
+the speed of both (if you're calling suffix/TLD related functions more than once a sessions)
+
+Version 1.4.0
+-------------------------------------------------------------------------
+
+BUG FIXES
+* Full NA support is now available!
+
+DEVELOPMENT
+* A substantial (20%) speed increase is now available thanks to internal
+refactoring.
+
+Version 1.3.3
+-------------------------------------------------------------------------
+
+BUG FIXES
+* url_parse no longer lower-cases URLs (case sensitivity is Important) thanks to GitHub user 17843
+
+DOCUMENTATION
+* A note on NAs (as reported by Alex Pinto) added to the vignette
+* Mention Bob Rudis's 'punycode' package.
+
+Version 1.3.2
+-------------------------------------------------------------------------
+
+BUG FIXES
+* Fixed a critical bug impacting URLs with colons in the path
+
+Version 1.3.1
+-------------------------------------------------------------------------
+
+CHANGES
+* suffix_refresh has been removed, since LazyData's parameters prevented it from functioning; thanks to
+  Alex Pinto for the initial bug report and Hadley Wickham for confirming the possible solutions.
+
+BUG FIXES
+* the parser was not properly handling ports; thanks to a report from Rich FitzJohn, this is now fixed.
+
+Version 1.3.0
+-------------------------------------------------------------------------
+
+NEW FEATURES
+* param_set() for inserting or modifying key/value pairs in URL query strings.
+* param_remove() added for stripping key/value pairs out of URL query strings.
+
+CHANGES
+* url_parameters has been renamed param_get() under the new naming scheme - url_parameters still exists, however,
+for the purpose of backwards-compatibiltiy.
+
+BUG FIXES
+* Fixed a bug reported by Alex Pinto whereby URLs with parameters but no paths would not have their domain
+  correctly parsed.
+
+Version 1.2.1
+-------------------------------------------------------------------------
+
+CHANGES
+* Changed "tld" column to "suffix" in return of "suffix_extract" to more
+accurately reflect what it is
+* Switched to "vapply" in "suffix_extract" to give a bit of a speedup to
+an already fast function
+
+BUG FIXES
+* Fixed documentation of "suffix_extract"
+
+DEVELOPMENT
+* More internal documentation added to compiled code.
+* The suffix_dataset dataset was refreshed
+
+Version 1.2.0
+-------------------------------------------------------------------------
+NEW FEATURES
+* Jay Jacobs' "tldextract" functionality has been merged with urltools, and can be accessed
+with "suffix_extract"
+* At Nicolas Coutin's suggestion, url_compose - url_parse in reverse - has been introduced.
+
+BUG FIXES
+
+* To adhere to RfC standards, "query" functions have been renamed "parameter"
+* A bug in which fragments could not be retrieved (and were incorrectly identified as parameters)
+has been fixed. Thanks to Nicolas Coutin for reporting it and providing a reproducible example.
+
+Version 1.1.1
+-------------------------------------------------------------------------
+BUG FIXES
+
+* Parameter parsing now fixed to require a = after the parameter name, thus solving for scenarios where
+the URL would contain the parameter name as part of, say, the domain, and it'd grab the wrong thing. Thanks
+to Jacob Barnett for the bug report and example.
+* URL encoding no longer encodes the slash between the domain and path (thanks to Peter Meissner for pointing
+this bug out).
+
+DEVELOPMENT
+*More unit tests
+
+Version 1.1.0
+-------------------------------------------------------------------------
+NEW FEATURES
+*url_parameters provides the values of specified parameters within a vector of URLs, as a data.frame
+*KeyboardInterrupts are now available for interrupting long computations.
+*url_parse now provides a data.frame, rather than a list, as output.
+
+BUG FIXES
+
+DEVELOPMENT
+*De-static the hell out of all the C++.
+*Internal refactor to store each logical stage of url decomposition as its own method
+*Internal refactor to use references, minimising memory usage; thanks to Mark Greenaway for making this work!
+*Roxygen upgrade
+
+Version 1.0.0
+-------------------------------------------------------------------------
+NEW FEATURES
+*New get/set functionality, mimicking lubridate; see the package vignette.
+
+DEVELOPMENT
+*Internal C++ documentation added and the encoders and parsers refactored.
+
+Version 0.6.0
+-------------------------------------------------------------------------
+NEW FEATURES
+*replace_parameter introduced, to augment extract_parameter (previously simply url_param). This
+allows you to take the value a parameter has associated with it, and replace it with one of your choosing.
+*extract_host allows you to grab the hostname of a site, ignoring other components.
+
+BUG FIXES
+*extract_parameter (now url_extract_param) previously failed with an obfuscated error if the requested
+parameter terminated the URL. This has now been fixed.
+
+DEVELOPMENT
+*unit tests expanded
+*Internal tweaks to improve the speed of url_decode and url_encode.
\ No newline at end of file
diff --git a/R/RcppExports.R b/R/RcppExports.R
new file mode 100644
index 0000000..915f9a3
--- /dev/null
+++ b/R/RcppExports.R
@@ -0,0 +1,262 @@
+# Generated by using Rcpp::compileAttributes() -> do not edit by hand
+# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
+
+get_component_ <- function(urls, component) {
+    .Call('urltools_get_component_', PACKAGE = 'urltools', urls, component)
+}
+
+set_component_ <- function(urls, component, new_value) {
+    .Call('urltools_set_component_', PACKAGE = 'urltools', urls, component, new_value)
+}
+
+#'@title get the values of a URL's parameters
+#'@description URLs can have parameters, taking the form of \code{name=value}, chained together
+#'with \code{&} symbols. \code{param_get}, when provided with a vector of URLs and a vector
+#'of parameter names, will generate a data.frame consisting of the values of each parameter
+#'for each URL.
+#'
+#'@param urls a vector of URLs
+#'
+#'@param parameter_names a vector of parameter names
+#'
+#'@return a data.frame containing one column for each provided parameter name. Values that
+#'cannot be found within a particular URL are represented by an NA.
+#'
+#'@examples
+#'#A very simple example
+#'url <- "https://google.com:80/foo.php?this_parameter=selfreferencing&hiphop=awesome"
+#'parameter_values <- param_get(url, c("this_parameter","hiphop"))
+#'
+#'@seealso \code{\link{url_parse}} for decomposing URLs into their constituent parts and
+#'\code{\link{param_set}} for inserting or modifying key/value pairs within a query string.
+#'
+#'@aliases param_get url_parameter
+#'@rdname param_get
+#'@export
+param_get <- function(urls, parameter_names) {
+    .Call('urltools_param_get', PACKAGE = 'urltools', urls, parameter_names)
+}
+
+#'@title Set the value associated with a parameter in a URL's query.
+#'@description URLs often have queries associated with them, particularly URLs for
+#'APIs, that look like \code{?key=value&key=value&key=value}. \code{param_set}
+#'allows you to modify key/value pairs within query strings, or even add new ones
+#'if they don't exist within the URL.
+#'
+#'@param urls a vector of URLs. These should be decoded (with \code{url_decode})
+#'but do not have to have been otherwise manipulated.
+#'
+#'@param key a string representing the key to modify the value of (or insert wholesale
+#'if it doesn't exist within the URL).
+#'
+#'@param value a value to associate with the key. This can be a single string,
+#'or a vector the same length as \code{urls}
+#'
+#'@return the original vector of URLs, but with modified/inserted key-value pairs. If the
+#'URL is \code{NA}, the returned value will be - if the key or value are, no insertion
+#'will be made.
+#'
+#'@examples
+#'# Set a URL parameter where there's already a key for that
+#'param_set("https://en.wikipedia.org/api.php?action=query", "action", "pageinfo")
+#'
+#'# Set a URL parameter where there isn't.
+#'param_set("https://en.wikipedia.org/api.php?list=props", "action", "pageinfo")
+#'
+#'@seealso \code{\link{param_get}} to retrieve the values associated with multiple keys in
+#'a vector of URLs, and \code{\link{param_remove}} to strip key/value pairs from a URL entirely.
+#'
+#'@export
+param_set <- function(urls, key, value) {
+    .Call('urltools_param_set', PACKAGE = 'urltools', urls, key, value)
+}
+
+#'@title Remove key-value pairs from query strings
+#'@description URLs often have queries associated with them, particularly URLs for
+#'APIs, that look like \code{?key=value&key=value&key=value}. \code{param_remove}
+#'allows you to remove key/value pairs while leaving the rest of the URL intact.
+#'
+#'@param urls a vector of URLs. These should be decoded with \code{url_decode} but don't
+#'have to have been otherwise processed.
+#'
+#'@param keys a vector of parameter keys to remove.
+#'
+#'@return the original URLs but with the key/value pairs specified by \code{keys} removed.
+#'If the original URL is \code{NA}, \code{NA} will be returned; if a specified key is \code{NA},
+#'nothing will be done with it.
+#'
+#'@seealso \code{\link{param_set}} to modify values associated with keys, or \code{\link{param_get}}
+#'to retrieve those values.
+#'
+#'@examples
+#'# Remove multiple parameters from a URL
+#'param_remove(urls = "https://en.wikipedia.org/wiki/api.php?action=list&type=query&format=json",
+#'             keys = c("action","format"))
+#'@export
+param_remove <- function(urls, keys) {
+    .Call('urltools_param_remove', PACKAGE = 'urltools', urls, keys)
+}
+
+#'@title Encode or Decode Internationalised Domains
+#'@description \code{puny_encode} and \code{puny_decode} implement
+#'the encoding standard for internationalised (non-ASCII) domains and
+#'subdomains. You can use them to encode UTF-8 domain names, or decode
+#'encoded names (which start "xn--"), or both.
+#'
+#'@param x a vector of URLs. These should be URL decoded using \code{\link{url_decode}}.
+#'
+#'@return a CharacterVector containing encoded or decoded versions of the entries in \code{x}.
+#'Invalid URLs (ones that are \code{NA}, or ones that do not successfully map to an actual
+#'decoded or encoded version) will be returned as \code{NA}.
+#'
+#'@examples
+#'# Encode a URL
+#'puny_encode("https://www.bücher.com/foo")
+#'
+#'# Decode the result, back to the original
+#'puny_decode("https://www.xn--bcher-kva.com/foo")
+#'
+#'@seealso \code{\link{url_decode}} and \code{\link{url_encode}} for percent-encoding.
+#'
+#'@rdname puny
+#'@export
+puny_encode <- function(x) {
+    .Call('urltools_puny_encode', PACKAGE = 'urltools', x)
+}
+
+#'@rdname puny
+#'@export
+puny_decode <- function(x) {
+    .Call('urltools_puny_decode', PACKAGE = 'urltools', x)
+}
+
+reverse_strings <- function(strings) {
+    .Call('urltools_reverse_strings', PACKAGE = 'urltools', strings)
+}
+
+finalise_suffixes <- function(full_domains, suffixes, wildcard, is_suffix) {
+    .Call('urltools_finalise_suffixes', PACKAGE = 'urltools', full_domains, suffixes, wildcard, is_suffix)
+}
+
+tld_extract_ <- function(domains) {
+    .Call('urltools_tld_extract_', PACKAGE = 'urltools', domains)
+}
+
+host_extract_ <- function(domains) {
+    .Call('urltools_host_extract_', PACKAGE = 'urltools', domains)
+}
+
+#'@title Encode or decode a URI
+#'@description encodes or decodes a URI/URL
+#'
+#'@param urls a vector of URLs to decode or encode.
+#'
+#'@details
+#'URL encoding and decoding is an essential prerequisite to proper web interaction
+#'and data analysis around things like server-side logs. The
+#'\href{http://tools.ietf.org/html/rfc3986}{relevant IETF RfC} mandates the percentage-encoding
+#'of non-Latin characters, including things like slashes, unless those are reserved.
+#'
+#'Base R provides \code{\link{URLdecode}} and \code{\link{URLencode}}, which handle
+#'URL encoding - in theory. In practise, they have a set of substantial problems
+#'that the urltools implementation solves::
+#'
+#'\itemize{
+#' \item{No vectorisation: }{Both base R functions operate on single URLs, not vectors of URLs.
+#'       This means that, when confronted with a vector of URLs that need encoding or
+#'       decoding, your only option is to loop from within R. This can be incredibly
+#'       computationally costly with large datasets. url_encode and url_decode are
+#'       implemented in C++ and entirely vectorised, allowing for a substantial
+#'       performance improvement.}
+#' \item{No scheme recognition: }{encoding the slashes in, say, http://, is a good way
+#'       of making sure your URL no longer works. Because of this, the only thing
+#'       you can encode in URLencode (unless you refuse to encode reserved characters)
+#'       is a partial URL, lacking the initial scheme, which requires additional operations
+#'       to set up and increases the complexity of encoding or decoding. url_encode
+#'       detects the protocol and silently splits it off, leaving it unencoded to ensure
+#'       that the resulting URL is valid.}
+#' \item{ASCII NULs: }{Server side data can get very messy and sometimes include out-of-range
+#'       characters. Unfortunately, URLdecode's response to these characters is to convert
+#'       them to NULs, which R can't handle, at which point your URLdecode call breaks.
+#'       \code{url_decode} simply ignores them.}
+#'}
+#'
+#'@return a character vector containing the encoded (or decoded) versions of "urls".
+#'
+#'@seealso \code{\link{puny_decode}} and \code{\link{puny_encode}}, for punycode decoding
+#'and encoding.
+#'
+#'@examples
+#'
+#'url_decode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_%28logo%29.jpg")
+#'url_encode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_(logo).jpg")
+#'
+#'\dontrun{
+#'#A demonstrator of the contrasting behaviours around out-of-range characters
+#'URLdecode("%gIL")
+#'url_decode("%gIL")
+#'}
+#'@rdname encoder
+#'@export
+url_decode <- function(urls) {
+    .Call('urltools_url_decode', PACKAGE = 'urltools', urls)
+}
+
+#'@rdname encoder
+#'@export
+url_encode <- function(urls) {
+    .Call('urltools_url_encode', PACKAGE = 'urltools', urls)
+}
+
+#'@title split URLs into their component parts
+#'@description \code{url_parse} takes a vector of URLs and splits each one into its component
+#'parts, as recognised by RfC 3986.
+#'
+#'@param urls a vector of URLs
+#'
+#'@details It's useful to be able to take a URL and split it out into its component parts - 
+#'for the purpose of hostname extraction, for example, or analysing API calls. This functionality
+#'is not provided in base R, although it is provided in \code{\link[httr]{parse_url}}; that
+#'implementation is entirely in R, uses regular expressions, and is not vectorised. It's
+#'perfectly suitable for the intended purpose (decomposition in the context of automated
+#'HTTP requests from R), but not for large-scale analysis.
+#'
+#'@return a data.frame consisting of the columns scheme, domain, port, path, query
+#'and fragment. See the '\href{http://tools.ietf.org/html/rfc3986}{relevant IETF RfC} for
+#'definitions. If an element cannot be identified, it is represented by an empty string.
+#'
+#'@examples
+#'url_parse("https://en.wikipedia.org/wiki/Article")
+#'
+#'@seealso \code{\link{url_parameters}} for extracting values associated with particular keys in a URL's
+#'query string, and \code{\link{url_compose}}, which is \code{url_parse} in reverse.
+#'
+#'@export
+url_parse <- function(urls) {
+    .Call('urltools_url_parse', PACKAGE = 'urltools', urls)
+}
+
+#'@title Recompose Parsed URLs
+#'
+#'@description Sometimes you want to take a vector of URLs, parse them, perform
+#'some operations and then rebuild them. \code{url_compose} takes a data.frame produced
+#'by \code{\link{url_parse}} and rebuilds it into a vector of full URLs (or: URLs as full
+#'as the vector initially thrown into url_parse).
+#'
+#'This is currently a `beta` feature; please do report bugs if you find them.
+#'
+#'@param parsed_urls a data.frame sourced from \code{\link{url_parse}}
+#'
+#'@seealso \code{\link{scheme}} and other accessors, which you may want to
+#'run URLs through before composing them to modify individual values.
+#'
+#'@examples
+#'#Parse a URL and compose it
+#'url <- "http://en.wikipedia.org"
+#'url_compose(url_parse(url))
+#'
+#'@export
+url_compose <- function(parsed_urls) {
+    .Call('urltools_url_compose', PACKAGE = 'urltools', parsed_urls)
+}
+
diff --git a/R/accessors.R b/R/accessors.R
new file mode 100644
index 0000000..b553986
--- /dev/null
+++ b/R/accessors.R
@@ -0,0 +1,202 @@
+#'@title Get or set a URL's scheme
+#'@description as in the lubridate package, individual components of a URL
+#'can be both extracted or set using the relevant function call - see the
+#'examples.
+#'@aliases scheme
+#'@rdname scheme
+#'
+#'@param x a URL, or vector of URLs
+#'
+#'@param value a replacement value for x's scheme.
+#'
+#'@seealso \code{\link{domain}}, \code{\link{port}}, \code{\link{path}},
+#'\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+#'
+#'@examples
+#'#Get a component
+#'example_url <- "http://cran.r-project.org/submit.html"
+#'scheme(example_url)
+#'
+#'#Set a component
+#'scheme(example_url) <- "https"
+#'
+#'# NA out the URL
+#'scheme(example_url) <- NA_character_
+#'@import methods
+#'@export
+scheme <- function(x){
+  return(get_component_(x,0))
+}
+
+"scheme<-" <- function(x, value) standardGeneric("scheme<-")
+#'@rdname scheme
+#'@export
+setGeneric("scheme<-", useAsDefault = function(x, value){
+  return(set_component_(x, 0, value))
+})
+
+#'@title Get or set a URL's domain
+#'@description as in the lubridate package, individual components of a URL
+#'can be both extracted or set using the relevant function call - see the
+#'examples.
+#'@aliases domain
+#'@rdname domain
+#'
+#'@param x a URL, or vector of URLs
+#'
+#'@param value a replacement value for x's scheme.
+#'
+#'@seealso \code{\link{scheme}}, \code{\link{port}}, \code{\link{path}},
+#'\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+#'
+#'@examples
+#'#Get a component
+#'example_url <- "http://cran.r-project.org/submit.html"
+#'domain(example_url)
+#'
+#'#Set a component
+#'domain(example_url) <- "en.wikipedia.org"
+#'@export
+domain <- function(x){
+  return(get_component_(x,1))
+}
+"domain<-" <- function(x, value) standardGeneric("domain<-")
+#'@rdname domain
+#'@export
+setGeneric("domain<-", useAsDefault = function(x, value){
+  return(set_component_(x, 1, value))
+})
+
+#'@title Get or set a URL's port
+#'@description as in the lubridate package, individual components of a URL
+#'can be both extracted or set using the relevant function call - see the
+#'examples.
+#'
+#'@aliases port
+#'@rdname port
+#'
+#'@param x a URL, or vector of URLs
+#'
+#'@param value a replacement value for x's port.
+#'
+#'@seealso \code{\link{scheme}}, \code{\link{domain}}, \code{\link{path}},
+#'\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+#'
+#'@examples
+#'#Get a component
+#'example_url <- "http://cran.r-project.org:80/submit.html"
+#'port(example_url)
+#'
+#'#Set a component
+#'port(example_url) <- "12"
+#'@export
+port <- function(x){
+  return(get_component_(x,2))
+}
+"port<-" <- function(x, value) standardGeneric("port<-")
+#'@rdname port
+#'@export
+setGeneric("port<-", useAsDefault = function(x, value){
+  return(set_component_(x, 2, value))
+})
+
+#'@title Get or set a URL's path
+#'@description as in the lubridate package, individual components of a URL
+#'can be both extracted or set using the relevant function call - see the
+#'examples.
+#'@aliases path
+#'@rdname path
+#'
+#'@param x a URL, or vector of URLs
+#'
+#'@param value a replacement value for x's path
+#'
+#'@seealso \code{\link{scheme}}, \code{\link{domain}}, \code{\link{port}},
+#'\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+#'
+#'@examples
+#'#Get a component
+#'example_url <- "http://cran.r-project.org:80/submit.html"
+#'path(example_url)
+#'
+#'#Set a component
+#'path(example_url) <- "bin/windows/"
+#'@export
+path <- function(x){
+  return(get_component_(x,3))
+}
+"path<-" <- function(x, value) standardGeneric("path<-")
+#'@rdname path
+#'@export
+setGeneric("path<-", useAsDefault = function(x, value){
+  return(set_component_(x, 3, value))
+})
+
+#'@title Get or set a URL's parameters
+#'@description as in the lubridate package, individual components of a URL
+#'can be both extracted or set using the relevant function call - see the
+#'examples.
+#'
+#'@aliases parameters
+#'@rdname parameters
+#'
+#'@param x a URL, or vector of URLs
+#'
+#'@param value a replacement value for x's parameters.
+#'
+#'@seealso \code{\link{scheme}}, \code{\link{domain}}, \code{\link{port}},
+#'\code{\link{path}} and \code{\link{fragment}} for other accessors.
+#'
+#'@examples
+#'#Get a component
+#'example_url <- "http://en.wikipedia.org/wiki/Aaron_Halfaker?debug=true"
+#'parameters(example_url)
+#'#[1] "debug=true"
+#'
+#'#Set a component
+#'parameters(example_url) <- "debug=false"
+#'@export
+parameters <- function(x){
+  return(get_component_(x,4))
+}
+"parameters<-" <- function(x, value) standardGeneric("parameters<-")
+#'@rdname parameters
+#'@export
+setGeneric("parameters<-", useAsDefault = function(x, value){
+  return(set_component_(x, 4, value))
+})
+
+#'@title Get or set a URL's fragment
+#'@description as in the lubridate package, individual components of a URL
+#'can be both extracted or set using the relevant function call - see the
+#'examples.
+#'@aliases fragment
+#'@rdname fragment
+#'
+#'@param x a URL, or vector of URLs
+#'
+#'@param value a replacement value for x's fragment.
+#'
+#'@seealso \code{\link{scheme}}, \code{\link{domain}}, \code{\link{port}},
+#'\code{\link{path}} and \code{\link{parameters}} for other accessors.
+#'
+#'@examples
+#'#Get a component
+#'example_url <- "http://en.wikipedia.org/wiki/Aaron_Halfaker?debug=true#test"
+#'fragment(example_url)
+#'
+#'#Set a component
+#'fragment(example_url) <- "production"
+#'@export
+#'@rdname fragment
+#'@export
+fragment <- function(x){
+  return(get_component_(x,5))
+}
+
+"fragment<-" <- function(x, value) standardGeneric("fragment<-")
+#'@rdname fragment
+#'@export
+setGeneric("fragment<-", useAsDefault = function(x, value){
+  return(set_component_(x, 5, value))
+})
\ No newline at end of file
diff --git a/R/suffix.R b/R/suffix.R
new file mode 100644
index 0000000..55c388c
--- /dev/null
+++ b/R/suffix.R
@@ -0,0 +1,265 @@
+#' @title Dataset of public suffixes
+#' @description This dataset contains a registry of public suffixes, as retrieved from
+#' and defined by the \href{https://publicsuffix.org/}{public suffix list}. It is
+#' sorted by how many periods(".") appear in the suffix, to optimise it for
+#' \code{\link{suffix_extract}}.  It is a data.frame with two columns, the first is
+#' the list of suffixes and the second is our best guess at the comment or owner 
+#' associated with the particular suffix. 
+#'
+#' @docType data
+#' @keywords datasets
+#' @name suffix_dataset
+#'
+#' @seealso \code{\link{suffix_extract}} for extracting suffixes from domain names,
+#' and \code{\link{suffix_refresh}} for getting a new, totally-up-to-date dataset
+#' version.
+#'
+#' @usage data(suffix_dataset)
+#' @note Last updated 2016-07-31.
+#' @format A data.frame of 8030 rows and 2 columns
+"suffix_dataset"
+
+#'@title Retrieve a public suffix dataset
+#'
+#'@description \code{urltools} comes with an inbuilt
+#'dataset of public suffixes, \code{\link{suffix_dataset}}.
+#'This is used in \code{\link{suffix_extract}} to identify the top-level domain
+#'within a particular domain name.
+#'
+#'While updates to the dataset will be included in each new package release,
+#'there's going to be a gap between changes to the suffixes list and changes to the package.
+#'Accordingly, the package also includes \code{suffix_refresh}, which generates
+#'and returns a \emph{fresh} version of the dataset. This can then be passed through
+#'to \code{\link{suffix_extract}}.
+#'
+#'@return a dataset equivalent in format to \code{\link{suffix_dataset}}.
+#'
+#'@seealso \code{\link{suffix_extract}} to extract suffixes from domain names,
+#'or \code{\link{suffix_dataset}} for the inbuilt, default version of the data.
+#'
+#'@examples
+#'\dontrun{
+#'new_suffixes <- suffix_refresh()
+#'}
+#'
+#'@export
+suffix_refresh <- function(){
+  
+  has_libcurl <- capabilities("libcurl")
+  if(length(has_libcurl) == 0 || has_libcurl == FALSE){
+    stop("libcurl support is needed for this function")
+  }
+  
+  #Read in and filter
+  connection <- url("https://www.publicsuffix.org/list/effective_tld_names.dat", method = "libcurl")
+  results <- readLines(connection)
+  close(connection)
+  
+  # making an assumption that sections are broken by blank lines
+  blank <- which(results == "")
+  # and gotta know where the comments are
+  comments <- grep(pattern = "^//", x=results)
+  
+  # if the file doesn't end on a blank line, stick an ending on there.
+  if (blank[length(blank)] < length(results)) {
+    blank <- c(blank, length(results)+1)
+  }
+  # now break up each section into a list
+  # grab right after the blank line and right before the next blank line.
+  suffix_dataset <- do.call(rbind, lapply(seq(length(blank) - 1), function(i) {
+    # these are the lines in the current block
+    lines <- seq(blank[i] + 1, blank[i + 1] - 1)
+    # assume there is nothing in the block
+    rez <- NULL
+    # the lines of text in this block
+    suff <- results[lines]
+    # of which these are the comments
+    iscomment <- lines %in% comments
+    # and check if we have any results 
+    # append the first comment at the top of the block only.
+    if(length(suff[!iscomment])) {
+      rez <- data.frame(suffixes = suff[!iscomment],
+                 comments = suff[which(iscomment)[1]], stringsAsFactors = FALSE)
+    }
+    return(rez)
+  }))
+  ## this is the old way
+  #suffix_dataset <- results[!grepl(x = results, pattern = "//", fixed = TRUE) & !results == ""]
+
+  #Return the user-friendly version
+  return(suffix_dataset)
+}
+
+#' @title extract the suffix from domain names
+#' @description domain names have suffixes - common endings that people
+#' can or could register domains under. This includes things like ".org", but
+#' also things like ".edu.co". A simple Top Level Domain list, as a
+#' result, probably won't cut it.
+#'
+#' \code{\link{suffix_extract}} takes the list of public suffixes,
+#' as maintained by Mozilla (see \code{\link{suffix_dataset}}) and
+#' a vector of domain names, and produces a data.frame containing the
+#' suffix that each domain uses, and the remaining fragment.
+#'
+#' @param domains a vector of damains, from \code{\link{domain}}
+#' or \code{\link{url_parse}}. Alternately, full URLs can be provided
+#' and will then be run through \code{\link{domain}} internally.
+#'
+#' @param suffixes a dataset of suffixes. By default, this is NULL and the function
+#' relies on \code{\link{suffix_dataset}}. Optionally, if you want more updated
+#' suffix data, you can provide the result of \code{\link{suffix_refresh}} for
+#' this parameter.
+#' 
+#' @return a data.frame of four columns, "host" "subdomain", "domain" & "suffix".
+#' "host" is what was passed in. "subdomain" is the subdomain of the suffix.
+#' "domain" contains the part of the domain name that came before the matched suffix.
+#' "suffix" is, well, the suffix.
+#'
+#' @seealso \code{\link{suffix_dataset}} for the dataset of suffixes.
+#'
+#' @examples
+#'
+#' # Using url_parse
+#' domain_name <- url_parse("http://en.wikipedia.org")$domain
+#' suffix_extract(domain_name)
+#'
+#' # Using domain()
+#' domain_name <- domain("http://en.wikipedia.org")
+#' suffix_extract(domain_name)
+#'
+#' #Relying on a fresh version of the suffix dataset
+#' suffix_extract(domain("http://en.wikipedia.org"), suffix_refresh())
+#' 
+#' @importFrom triebeard trie longest_match
+#' @export
+suffix_extract <- function(domains, suffixes = NULL){
+  if(!is.null(suffixes)){
+    # check if suffixes is a data.frame, and stop if column not found
+    if(is.data.frame(suffixes)) {
+      if ("suffixes" %in% colnames(suffixes)) {
+        suffixes <- suffixes$suffixes
+      } else {
+        stop("Expected column named \"suffixes\" in suffixes data.frame")
+      }
+    }
+    holding <- suffix_load(suffixes)
+  } else {
+    holding <- list(suff_trie = urltools_env$suff_trie,
+                    is_wildcard = urltools_env$is_wildcard,
+                    cleaned_suffixes = urltools_env$cleaned_suffixes)
+  }
+  
+  rev_domains <- reverse_strings(tolower(domains))
+  matched_suffixes <- triebeard::longest_match(holding$suff_trie, rev_domains)
+  has_wildcard <- matched_suffixes %in% holding$is_wildcard
+  is_suffix <- domains %in% holding$cleaned_suffixes
+  return(finalise_suffixes(domains, matched_suffixes, has_wildcard, is_suffix))
+}
+
+#' @title Dataset of top-level domains (TLDs)
+#' @description This dataset contains a registry of top-level domains, as retrieved from
+#' and defined by the \href{http://data.iana.org/TLD/tlds-alpha-by-domain.txt}{IANA}.
+#' 
+#' @docType data
+#' @keywords datasets
+#' @name tld_dataset
+#'
+#' @seealso \code{\link{tld_extract}} for extracting TLDs from domain names,
+#' and \code{\link{tld_refresh}} to get an updated version of this dataset.
+#'
+#' @usage data(tld_dataset)
+#' @note Last updated 2016-07-20.
+#' @format A vector of 1275 elements.
+"tld_dataset"
+
+#'@title Retrieve a TLD dataset
+#'
+#'@description \code{urltools} comes with an inbuilt
+#'dataset of top level domains (TLDs), \code{\link{tld_dataset}}.
+#'This is used in \code{\link{tld_extract}} to identify the top-level domain
+#'within a particular domain name.
+#'
+#'While updates to the dataset will be included in each new package release,
+#'there's going to be a gap between changes to TLDs and changes to the package.
+#'Accordingly, the package also includes \code{tld_refresh}, which generates
+#'and returns a \emph{fresh} version of the dataset. This can then be passed through
+#'to \code{\link{tld_extract}}.
+#'
+#'@return a dataset equivalent in format to \code{\link{tld_dataset}}.
+#'
+#'@seealso \code{\link{tld_extract}} to extract suffixes from domain names,
+#'or \code{\link{tld_dataset}} for the inbuilt, default version of the data.
+#'
+#'@examples
+#'\dontrun{
+#'new_tlds <- tld_refresh()
+#'}
+#'
+#'@export
+tld_refresh <- function(){
+  raw_tlds <- readLines("http://data.iana.org/TLD/tlds-alpha-by-domain.txt", warn = FALSE)
+  raw_tlds <- tolower(raw_tlds[!grepl(x = raw_tlds, pattern = "#", fixed = TRUE)])
+  return(raw_tlds)
+}
+
+#'@title Extract TLDs
+#'@description \code{tld_extract} extracts the top-level domain (TLD) from
+#'a vector of domain names. This is distinct from the suffixes, extracted with
+#'\code{\link{suffix_extract}}; TLDs are \emph{top} level, while suffixes are just
+#'domains through which internet users can publicly register domains (the difference
+#'between \code{.org.uk} and \code{.uk}).
+#'
+#'@param domains a vector of domains, retrieved through \code{\link{url_parse}} or
+#'\code{\link{domain}}.
+#'
+#'@param tlds a dataset of TLDs. If NULL (the default), \code{tld_extract} relies
+#'on urltools' \code{\link{tld_dataset}}; otherwise, you can pass in the result of
+#'\code{\link{tld_refresh}}.
+#'
+#'@return a data.frame of two columns: \code{domain}, with the original domain names,
+#'and \code{tld}, the identified TLD from the domain.
+#'
+#'@examples
+#'# Using the inbuilt dataset
+#'domains <- domain("https://en.wikipedia.org/wiki/Main_Page")
+#'tld_extract(domains)
+#'
+#'# Using a refreshed one
+#'tld_extract(domains, tld_refresh())
+#'
+#'@seealso \code{\link{suffix_extract}} for retrieving suffixes (distinct from TLDs).
+#'
+#'@export
+tld_extract <- function(domains, tlds = NULL){
+  if(is.null(tlds)){
+    tlds <- urltools::tld_dataset
+  }
+  guessed_tlds <- tld_extract_(tolower(domains))
+  guessed_tlds[!guessed_tlds %in% tlds] <- NA
+  return(data.frame(domain = domains, tld = guessed_tlds, stringsAsFactors = FALSE))
+}
+
+#'@title Extract hosts
+#'@description \code{host_extract} extracts the host from
+#'a vector of domain names. A host isn't the same as a domain - it could be
+#'the subdomain, if there are one or more subdomains. The host of \code{en.wikipedia.org}
+#'is \code{en}, while the host of \code{wikipedia.org} is \code{wikipedia}.
+#'
+#'@param domains a vector of domains, retrieved through \code{\link{url_parse}} or
+#'\code{\link{domain}}.
+#'
+#'@return a data.frame of two columns: \code{domain}, with the original domain names,
+#'and \code{host}, the identified host from the domain.
+#'
+#'@examples
+#'# With subdomains
+#'has_subdomain <- domain("https://en.wikipedia.org/wiki/Main_Page")
+#'host_extract(has_subdomain)
+#'
+#'# Without
+#'no_subdomain <- domain("https://ironholds.org/projects/r_shiny/")
+#'host_extract(no_subdomain)
+#'@export
+host_extract <- function(domains){
+  return(data.frame(domain = domains, host = host_extract_(domains), stringsAsFactors = FALSE))
+}
\ No newline at end of file
diff --git a/R/urltools.R b/R/urltools.R
new file mode 100644
index 0000000..5b3e253
--- /dev/null
+++ b/R/urltools.R
@@ -0,0 +1,21 @@
+#' @title Tools for handling URLs
+#' @name urltools
+#' @description This package provides functions for URL encoding and decoding,
+#' parsing, and parameter extraction, designed to be both fast and
+#' entirely vectorised. It is intended to be useful for people dealing with
+#' web-related datasets, such as server-side logs.
+#' 
+#' @seealso the \href{https://CRAN.R-project.org/package=urltools/vignettes/urltools.html}{package vignette}.
+#' @useDynLib urltools
+#' @importFrom Rcpp sourceCpp
+#' @docType package
+#' @aliases urltools urltools-package
+NULL
+
+#'@rdname param_get
+#'@export
+url_parameters <- function(urls, parameter_names){
+  .Deprecated("param_get",
+              old = as.character(sys.call(sys.parent()))[1L])
+  return(param_get(urls, parameter_names))
+}
\ No newline at end of file
diff --git a/R/zzz.R b/R/zzz.R
new file mode 100644
index 0000000..7a2a2af
--- /dev/null
+++ b/R/zzz.R
@@ -0,0 +1,22 @@
+urltools_env <- new.env(parent = emptyenv())
+
+suffix_load <- function(suffixes = NULL){
+  if(is.null(suffixes)){
+    suffixes <- urltools::suffix_dataset
+  }
+  cleaned_suffixes <- gsub(x = suffixes, pattern = "*.", replacement = "", fixed = TRUE)
+  is_wildcard <- cleaned_suffixes[which(grepl(x = suffixes, pattern = "*.", fixed = TRUE))]
+  suff_trie <- triebeard::trie(keys = reverse_strings(paste0(".", cleaned_suffixes)),
+                               values = cleaned_suffixes)
+  return(list(suff_trie = suff_trie,
+              is_wildcard = is_wildcard,
+              cleaned_suffixes = cleaned_suffixes))
+  return(invisible())
+}
+
+.onLoad <- function(...) {
+  holding <- suffix_load()
+  assign("is_wildcard", holding$is_wildcard, envir = urltools_env)
+  assign("cleaned_suffixes", holding$cleaned_suffixes, envir = urltools_env)
+  assign("suff_trie", holding$suff_trie, envir = urltools_env)
+}
\ No newline at end of file
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..414c18d
--- /dev/null
+++ b/README.md
@@ -0,0 +1,39 @@
+##urltools
+A package for elegantly handling and parsing URLs from within R.
+
+__Author:__ Oliver Keyes, Jay Jacobs<br/>
+__License:__ [MIT](http://opensource.org/licenses/MIT)<br/>
+__Status:__ Stable
+
+[![Travis-CI Build Status](https://travis-ci.org/Ironholds/urltools.svg?branch=master)](https://travis-ci.org/Ironholds/urltools) ![downloads](http://cranlogs.r-pkg.org/badges/grand-total/urltools)
+
+###Description
+
+URLs in R are often treated as nothing more than part of data retrieval -
+they're used for making connections and reading data. With web analytics
+and research, however, URLs can *be* the data, and R's default handlers
+are not best suited to handle vectorised operations over large datasets.
+<code>urltools</code> is intended to solve this. 
+
+It contains drop-in replacements for R's URLdecode and URLencode functions, along
+with new functionality such as a URL parser and parameter value extractor. In all
+cases, the functions are designed to be content-safe (not breaking on unexpected values)
+and fully vectorised, resulting in a dramatic speed improvement over existing implementations -
+crucial for large datasets. For more information, see the [urltools vignette](https://github.com/Ironholds/urltools/blob/master/vignettes/urltools.Rmd).
+
+Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md).
+By participating in this project you agree to abide by its terms.
+
+###Installation
+
+The latest CRAN version can be obtained via:
+
+    install.packages("urltools")
+    
+To get the development version:
+
+    devtools::install_github("ironholds/urltools")
+
+###Dependencies
+* R. Doy.
+* [Rcpp](https://cran.r-project.org/package=Rcpp)
\ No newline at end of file
diff --git a/build/vignette.rds b/build/vignette.rds
new file mode 100644
index 0000000..4c8421a
Binary files /dev/null and b/build/vignette.rds differ
diff --git a/data/suffix_dataset.rda b/data/suffix_dataset.rda
new file mode 100644
index 0000000..fdb94f5
Binary files /dev/null and b/data/suffix_dataset.rda differ
diff --git a/data/tld_dataset.rda b/data/tld_dataset.rda
new file mode 100644
index 0000000..e0062d4
Binary files /dev/null and b/data/tld_dataset.rda differ
diff --git a/debian/README.test b/debian/README.test
deleted file mode 100644
index 53fb4d7..0000000
--- a/debian/README.test
+++ /dev/null
@@ -1,8 +0,0 @@
-Notes on how this package can be tested.
-────────────────────────────────────────
-
-This package can be tested by running the provided test:
-
-   sh ./run-unit-test
-
-in order to confirm its integrity.
diff --git a/debian/changelog b/debian/changelog
deleted file mode 100644
index 5632c3c..0000000
--- a/debian/changelog
+++ /dev/null
@@ -1,5 +0,0 @@
-r-cran-urltools (1.6.0-1) unstable; urgency=medium
-
-  * Initial release (closes: #851565)
-
- -- Andreas Tille <tille at debian.org>  Mon, 16 Jan 2017 16:58:09 +0100
diff --git a/debian/compat b/debian/compat
deleted file mode 100644
index f599e28..0000000
--- a/debian/compat
+++ /dev/null
@@ -1 +0,0 @@
-10
diff --git a/debian/control b/debian/control
deleted file mode 100644
index ed88f33..0000000
--- a/debian/control
+++ /dev/null
@@ -1,29 +0,0 @@
-Source: r-cran-urltools
-Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.org>
-Uploaders: Andreas Tille <tille at debian.org>
-Section: gnu-r
-Priority: optional
-Build-Depends: debhelper (>= 10),
-               dh-r,
-               r-base-dev,
-               r-cran-rcpp,
-               r-cran-triebeard
-Standards-Version: 3.9.8
-Vcs-Browser: https://anonscm.debian.org/viewvc/debian-med/trunk/packages/R/r-cran-urltools/
-Vcs-Svn: svn://anonscm.debian.org/debian-med/trunk/packages/R/r-cran-urltools/
-Homepage: https://cran.r-project.org/package=urltools
-
-Package: r-cran-urltools
-Architecture: any
-Depends: ${R:Depends},
-         ${shlibs:Depends},
-         ${misc:Depends}
-Recommends: ${R:Recommends}
-Suggests: ${R:Suggests}
-Description: GNU R vectorised tools for URL handling and parsing
- A toolkit for all URL-handling needs, including encoding and decoding,
- parsing, parameter extraction and modification. All functions are
- designed to be both fast and entirely vectorised. It is intended to be
- useful for people dealing with web-related datasets, such as server-side
- logs, although may be useful for other situations involving large sets of
- URLs.
diff --git a/debian/copyright b/debian/copyright
deleted file mode 100644
index 0d6ff34..0000000
--- a/debian/copyright
+++ /dev/null
@@ -1,47 +0,0 @@
-Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
-Upstream-Name: urltools
-Upstream-Contact: Oliver Keyes <ironholds at gmail.com>
-Source: https://cran.r-project.org/package=urltools
-
-Files: *
-Copyright: 2010-2016 Oliver Keyes, Jay Jacobs, Drew Schmidt, Mark Greenaway,
-                     Bob Rudis, Alex Pinto, Maryam Khezrzadeh, Adam M. Costello,
-                     Jeff Bezanson
-License: MIT
-
-Files: urltools/src/punycode.*
-Copyright: 2010-2014 Adam M. Costello
-License: punycode
-    Regarding this entire document or any portion of it (including
-    the pseudocode and C code), the author makes no guarantees and
-    is not responsible for any damage resulting from its use.  The
-    author grants irrevocable permission to anyone to use, modify,
-    and distribute it in any way that does not diminish the rights
-    of anyone else to use, modify, and distribute it, provided that
-    redistributed derivative works do not contain misleading author or
-    version information.  Derivative works need not be licensed under
-    similar terms.
-
-Files: debian/*
-Copyright: 2017 Andreas Tille <tille at debian.org>
-License: MIT
-
-License: MIT
- Permission is hereby granted, free of charge, to any person obtaining
- a copy of this software and associated documentation files (the
- "Software"), to deal in the Software without restriction, including
- without limitation the rights to use, copy, modify, merge, publish,
- distribute, sublicense, and/or sell copies of the Software, and to
- permit persons to whom the Software is furnished to do so, subject to
- the following conditions:
- .
- The above copyright notice and this permission notice shall be
- included in all copies or substantial portions of the Software.
- .
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
- LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
- OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
- WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/debian/docs b/debian/docs
deleted file mode 100644
index 6466d39..0000000
--- a/debian/docs
+++ /dev/null
@@ -1,3 +0,0 @@
-debian/tests/run-unit-test
-debian/README.test
-tests
diff --git a/debian/rules b/debian/rules
deleted file mode 100755
index 529c38a..0000000
--- a/debian/rules
+++ /dev/null
@@ -1,5 +0,0 @@
-#!/usr/bin/make -f
-
-%:
-	dh $@ --buildsystem R
-
diff --git a/debian/source/format b/debian/source/format
deleted file mode 100644
index 163aaf8..0000000
--- a/debian/source/format
+++ /dev/null
@@ -1 +0,0 @@
-3.0 (quilt)
diff --git a/debian/tests/control b/debian/tests/control
deleted file mode 100644
index d746f15..0000000
--- a/debian/tests/control
+++ /dev/null
@@ -1,5 +0,0 @@
-Tests: run-unit-test
-Depends: @, r-cran-testthat
-Restrictions: allow-stderr
-
-
diff --git a/debian/tests/run-unit-test b/debian/tests/run-unit-test
deleted file mode 100644
index b29f739..0000000
--- a/debian/tests/run-unit-test
+++ /dev/null
@@ -1,17 +0,0 @@
-#!/bin/sh -e
-
-pkgname=urltools
-debname=r-cran-urltools
-
-if [ "$ADTTMP" = "" ] ; then
-    ADTTMP=`mktemp -d /tmp/${debname}-test.XXXXXX`
-    trap "rm -rf $ADTTMP" 0 INT QUIT ABRT PIPE TERM
-fi
-cd $ADTTMP
-cp -a /usr/share/doc/$debname/tests/* $ADTTMP
-gunzip -r *
-for testfile in *.R; do
-    echo "BEGIN TEST $testfile"
-    LC_ALL=C R --no-save < $testfile
-done
-
diff --git a/debian/watch b/debian/watch
deleted file mode 100644
index 7bfc7a0..0000000
--- a/debian/watch
+++ /dev/null
@@ -1,2 +0,0 @@
-version=4
-https://cran.r-project.org/src/contrib/urltools_([-\d.]*)\.tar\.gz
diff --git a/inst/doc/urltools.R b/inst/doc/urltools.R
new file mode 100644
index 0000000..c803438
--- /dev/null
+++ b/inst/doc/urltools.R
@@ -0,0 +1,86 @@
+## ---- eval=FALSE---------------------------------------------------------
+#  URLdecode("test%gIL")
+#  Error in rawToChar(out) : embedded nul in string: '\0L'
+#  In addition: Warning message:
+#  In URLdecode("%gIL") : out-of-range values treated as 0 in coercion to raw
+
+## ---- eval=FALSE---------------------------------------------------------
+#  URLencode("https://en.wikipedia.org/wiki/Article", reserved = TRUE)
+#  [1] "https%3a%2f%2fen.wikipedia.org%2fwiki%2fArticle"
+
+## ---- eval=FALSE---------------------------------------------------------
+#  library(urltools)
+#  url_decode("test%gIL")
+#  [1] "test"
+#  url_encode("https://en.wikipedia.org/wiki/Article")
+#  [1] "https://en.wikipedia.org%2fwiki%2fArticle"
+
+## ---- eval=FALSE---------------------------------------------------------
+#  > parsed_address <- url_parse("https://en.wikipedia.org/wiki/Article")
+#  > str(parsed_address)
+#  'data.frame':	1 obs. of  6 variables:
+#   $ scheme   : chr "https"
+#   $ domain   : chr "en.wikipedia.org"
+#   $ port     : chr NA
+#   $ path     : chr "wiki/Article"
+#   $ parameter: chr NA
+#   $ fragment : chr NA
+
+## ---- eval=FALSE---------------------------------------------------------
+#  > url_compose(parsed_address)
+#  [1] "https://en.wikipedia.org/wiki/article"
+
+## ---- eval=FALSE---------------------------------------------------------
+#  url <- "https://en.wikipedia.org/wiki/Article"
+#  scheme(url)
+#  "https"
+#  scheme(url) <- "ftp"
+#  url
+#  "ftp://en.wikipedia.org/wiki/Article"
+
+## ---- eval=FALSE---------------------------------------------------------
+#  > url <- "https://en.wikipedia.org/wiki/Article"
+#  > domain_name <- domain(url)
+#  > domain_name
+#  [1] "en.wikipedia.org"
+#  > str(suffix_extract(domain_name))
+#  'data.frame':	1 obs. of  4 variables:
+#   $ host     : chr "en.wikipedia.org"
+#   $ subdomain: chr "en"
+#   $ domain   : chr "wikipedia"
+#   $ suffix      : chr "org"
+
+## ---- eval=FALSE---------------------------------------------------------
+#  domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+#  updated_suffixes <- suffix_refresh()
+#  suffix_extract(domain_name, updated_suffixes)
+
+## ---- eval=FALSE---------------------------------------------------------
+#  domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+#  host_extract(domain_name)
+
+## ---- eval=FALSE---------------------------------------------------------
+#  > str(param_get(urls = "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json",
+#                       parameter_names = c("pageid","export")))
+#  'data.frame':	1 obs. of  2 variables:
+#   $ pageid: chr "1023"
+#   $ export: chr "json"
+
+## ---- eval=FALSE---------------------------------------------------------
+#  url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+#  url <- param_set(url, key = "pageid", value = "12")
+#  url
+#  # [1] "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=12&export=json"
+
+## ---- eval=FALSE---------------------------------------------------------
+#  url <- "http://en.wikipedia.org/wiki/api.php"
+#  url <- param_set(url, key = "pageid", value = "12")
+#  url
+#  # [1] "http://en.wikipedia.org/wiki/api.php?pageid=12"
+
+## ---- eval=FALSE---------------------------------------------------------
+#  url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+#  url <- param_remove(url, keys = c("action","export"))
+#  url
+#  # [1] "http://en.wikipedia.org/wiki/api.php?pageid=1023"
+
diff --git a/inst/doc/urltools.Rmd b/inst/doc/urltools.Rmd
new file mode 100644
index 0000000..beb3db3
--- /dev/null
+++ b/inst/doc/urltools.Rmd
@@ -0,0 +1,182 @@
+<!--
+%\VignetteEngine{knitr::knitr}
+%\VignetteIndexEntry{urltools}
+-->
+
+##Elegant URL handling with urltools
+
+URLs are treated, by base R, as nothing more than components of a data retrieval process: they exist
+to create connections to retrieve datasets. This is an essential feature for the language to have,
+but it also means that URL handlers are designed for situations where URLs *get* you to the data - 
+not situations where URLs *are* the data.
+
+There is no support for encoding or decoding URLs en-masse, and no support for parsing and
+interpreting them. `urltools` provides this support!
+
+### URL encoding and decoding
+
+Base R provides two functions - <code>URLdecode</code> and <code>URLencode</code> - for taking percentage-encoded
+URLs and turning them into regular strings, or vice versa. As discussed, these are primarily designed to
+enable connections, and so they have several inherent limitations, including a lack of vectorisation, that
+make them unsuitable for large datasets.
+
+Not only are they not vectorised, they also have several particularly idiosyncratic bugs and limitations:
+<code>URLdecode</code>, for example, breaks if the decoded value is out of range:
+
+```{r, eval=FALSE}
+URLdecode("test%gIL")
+Error in rawToChar(out) : embedded nul in string: '\0L'
+In addition: Warning message:
+In URLdecode("%gIL") : out-of-range values treated as 0 in coercion to raw
+```
+
+URLencode, on the other hand, encodes slashes on its most strict setting - without
+paying attention to where those slashes *are*: if we attempt to URLencode an entire URL, we get:
+
+```{r, eval=FALSE}
+URLencode("https://en.wikipedia.org/wiki/Article", reserved = TRUE)
+[1] "https%3a%2f%2fen.wikipedia.org%2fwiki%2fArticle"
+```
+That's a completely unusable URL (or ewRL, if you will).
+
+urltools replaces both functions with <code>url\_decode</code> and <code>url\_encode</code> respectively:
+```{r, eval=FALSE}
+library(urltools)
+url_decode("test%gIL")
+[1] "test"
+url_encode("https://en.wikipedia.org/wiki/Article")
+[1] "https://en.wikipedia.org%2fwiki%2fArticle"
+```
+
+As you can see, <code>url\_decode</code> simply excludes out-of-range characters from consideration, while <code>url\_encode</code> detects characters that make up part of the URLs scheme, and leaves them unencoded. Both are extremely fast; with `urltools`, you can
+decode a vector of 1,000,000 URLs in 0.9 seconds.
+
+Alongside these, we have functions for encoding and decoding the 'punycode' format of URLs - ones that are designed to be internationalised and have unicode characters in them. These also take one argument, a vector of URLs, and can be found at `puny_encode` and `puny_decode` respectively.
+
+### URL parsing
+
+Once you've got your nicely decoded (or encoded) URLs, it's time to do something with them - and, most of the time,
+you won't actually care about most of the URL. You'll want to look at the scheme, or the domain, or the path,
+but not the entire thing as one string.
+
+The solution is <code>url_parse</code>, which takes a URL and breaks it out into its [RfC 3986](http://www.ietf.org/rfc/rfc3986.txt) components: scheme, domain, port, path, query string and fragment identifier. This is,
+again, fully vectorised, and can happily be run over hundreds of thousands of URLs, rapidly processing them. The
+results are provided as a data.frame, since most people use data.frames to store data.
+
+```{r, eval=FALSE}
+> parsed_address <- url_parse("https://en.wikipedia.org/wiki/Article")
+> str(parsed_address)
+'data.frame':	1 obs. of  6 variables:
+ $ scheme   : chr "https"
+ $ domain   : chr "en.wikipedia.org"
+ $ port     : chr NA
+ $ path     : chr "wiki/Article"
+ $ parameter: chr NA
+ $ fragment : chr NA                         
+```
+
+We can also perform the opposite of this operation with `url_compose`:
+```{r, eval=FALSE}
+> url_compose(parsed_address)
+[1] "https://en.wikipedia.org/wiki/article"
+```
+
+### Getting/setting URL components
+With the inclusion of a URL parser, we suddenly have the opportunity for lubridate-style component getting
+and setting. Syntax is identical to that of `lubridate`, but uses URL components as function names.
+
+```{r, eval=FALSE}
+url <- "https://en.wikipedia.org/wiki/Article"
+scheme(url)
+"https"
+scheme(url) <- "ftp"
+url
+"ftp://en.wikipedia.org/wiki/Article"
+```
+Fields that can be extracted or set are <code>scheme</code>, <code>domain</code>, <code>port</code>, <code>path</code>,
+<code>parameters</code> and <code>fragment</code>.
+
+### Suffix and TLD extraction
+
+Once we've extracted a domain from a URL with `domain` or `url_parse`, we can identify which bit is the domain name, and which
+bit is the suffix:
+
+```{r, eval=FALSE}
+> url <- "https://en.wikipedia.org/wiki/Article"
+> domain_name <- domain(url)
+> domain_name
+[1] "en.wikipedia.org"
+> str(suffix_extract(domain_name))
+'data.frame':	1 obs. of  4 variables:
+ $ host     : chr "en.wikipedia.org"
+ $ subdomain: chr "en"
+ $ domain   : chr "wikipedia"
+ $ suffix      : chr "org"
+```
+
+This relies on an internal database of public suffixes, accessible at `suffix_dataset` - we recognise, though,
+that this dataset may get a bit out of date, so you can also pass the results of the `suffix_refresh` function,
+which retrieves an updated dataset, to `suffix_extract`:
+
+```{r, eval=FALSE}
+domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+updated_suffixes <- suffix_refresh()
+suffix_extract(domain_name, updated_suffixes)
+```
+
+We can do the same thing with top-level domains, with precisely the same setup, except the functions and datasets are `tld_refresh`, `tld_extract` and `tld_dataset`.
+
+In the other direction we have `host_extract`, which retrieves, well, the host! If the URL has subdomains, it'll be the
+lowest-level subdomain. If it doesn't, it'll be the actual domain name, without the suffixes:
+
+```{r, eval=FALSE}
+domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+host_extract(domain_name)
+```
+### Query manipulation
+Once a URL is parsed, it's sometimes useful to get the value associated with a particular query parameter. As
+an example, take the URL `http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json`. What
+pageID is being used? What is the export format? We can find out with `param_get`.
+
+```{r, eval=FALSE}
+> str(param_get(urls = "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json",
+                     parameter_names = c("pageid","export")))
+'data.frame':	1 obs. of  2 variables:
+ $ pageid: chr "1023"
+ $ export: chr "json"
+```
+
+This isn't the only function for query manipulation; we can also dynamically modify the values a particular parameter
+might have, or strip them out entirely.
+
+To modify the values, we use `param_set`:
+
+```{r, eval=FALSE}
+url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+url <- param_set(url, key = "pageid", value = "12")
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=12&export=json"
+```
+
+As you can see this works pretty well; it even works in situations where the URL doesn't *have* a query yet:
+
+```{r, eval=FALSE}
+url <- "http://en.wikipedia.org/wiki/api.php"
+url <- param_set(url, key = "pageid", value = "12")
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?pageid=12"
+```
+
+On the other hand we might have a parameter we just don't want any more - that can be handled with `param_remove`, which can
+take multiple parameters as well as multiple URLs:
+
+```{r, eval=FALSE}
+url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+url <- param_remove(url, keys = c("action","export"))
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?pageid=1023"
+```
+
+### Other URL handlers
+If you have ideas for other URL handlers that would make your data processing easier, the best approach
+is to either [request it](https://github.com/Ironholds/urltools/issues) or [add it](https://github.com/Ironholds/urltools/pulls)!
diff --git a/inst/doc/urltools.html b/inst/doc/urltools.html
new file mode 100644
index 0000000..9a50878
--- /dev/null
+++ b/inst/doc/urltools.html
@@ -0,0 +1,384 @@
+<!DOCTYPE html>
+<html>
+<head>
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
+
+<title>URL encoding and decoding</title>
+
+<script type="text/javascript">
+window.onload = function() {
+  var imgs = document.getElementsByTagName('img'), i, img;
+  for (i = 0; i < imgs.length; i++) {
+    img = imgs[i];
+    // center an image if it is the only element of its parent
+    if (img.parentElement.childElementCount === 1)
+      img.parentElement.style.textAlign = 'center';
+  }
+};
+</script>
+
+<!-- Styles for R syntax highlighter -->
+<style type="text/css">
+   pre .operator,
+   pre .paren {
+     color: rgb(104, 118, 135)
+   }
+
+   pre .literal {
+     color: #990073
+   }
+
+   pre .number {
+     color: #099;
+   }
+
+   pre .comment {
+     color: #998;
+     font-style: italic
+   }
+
+   pre .keyword {
+     color: #900;
+     font-weight: bold
+   }
+
+   pre .identifier {
+     color: rgb(0, 0, 0);
+   }
+
+   pre .string {
+     color: #d14;
+   }
+</style>
+
+<!-- R syntax highlighter -->
+<script type="text/javascript">
+var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/</gm,"<")}function f(r,q,p){return RegExp(q,"m"+(r.cI?"i":"")+(p?"g":""))}function b(r){for(var p=0;p<r.childNodes.length;p++){var q=r.childNodes[p];if(q.nodeName=="CODE"){return q}if(!(q.nodeType==3&&q.nodeValue.match(/\s+/))){break}}}function h(t,s){var p="";for(var r=0;r<t.childNodes.length;r++){if(t.childNodes[r].nodeType==3){var q=t.childNodes[r].nodeValue;if(s){q=q.replace(/\n/g,"")}p+=q}else{if(t.chi [...]
+hljs.initHighlightingOnLoad();
+</script>
+
+
+
+<style type="text/css">
+body, td {
+   font-family: sans-serif;
+   background-color: white;
+   font-size: 13px;
+}
+
+body {
+  max-width: 800px;
+  margin: auto;
+  padding: 1em;
+  line-height: 20px;
+}
+
+tt, code, pre {
+   font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
+}
+
+h1 {
+   font-size:2.2em;
+}
+
+h2 {
+   font-size:1.8em;
+}
+
+h3 {
+   font-size:1.4em;
+}
+
+h4 {
+   font-size:1.0em;
+}
+
+h5 {
+   font-size:0.9em;
+}
+
+h6 {
+   font-size:0.8em;
+}
+
+a:visited {
+   color: rgb(50%, 0%, 50%);
+}
+
+pre, img {
+  max-width: 100%;
+}
+pre {
+  overflow-x: auto;
+}
+pre code {
+   display: block; padding: 0.5em;
+}
+
+code {
+  font-size: 92%;
+  border: 1px solid #ccc;
+}
+
+code[class] {
+  background-color: #F8F8F8;
+}
+
+table, td, th {
+  border: none;
+}
+
+blockquote {
+   color:#666666;
+   margin:0;
+   padding-left: 1em;
+   border-left: 0.5em #EEE solid;
+}
+
+hr {
+   height: 0px;
+   border-bottom: none;
+   border-top-width: thin;
+   border-top-style: dotted;
+   border-top-color: #999999;
+}
+
+ at media print {
+   * {
+      background: transparent !important;
+      color: black !important;
+      filter:none !important;
+      -ms-filter: none !important;
+   }
+
+   body {
+      font-size:12pt;
+      max-width:100%;
+   }
+
+   a, a:visited {
+      text-decoration: underline;
+   }
+
+   hr {
+      visibility: hidden;
+      page-break-before: always;
+   }
+
+   pre, blockquote {
+      padding-right: 1em;
+      page-break-inside: avoid;
+   }
+
+   tr, img {
+      page-break-inside: avoid;
+   }
+
+   img {
+      max-width: 100% !important;
+   }
+
+   @page :left {
+      margin: 15mm 20mm 15mm 10mm;
+   }
+
+   @page :right {
+      margin: 15mm 10mm 15mm 20mm;
+   }
+
+   p, h2, h3 {
+      orphans: 3; widows: 3;
+   }
+
+   h2, h3 {
+      page-break-after: avoid;
+   }
+}
+</style>
+
+
+
+</head>
+
+<body>
+<!--
+%\VignetteEngine{knitr::knitr}
+%\VignetteIndexEntry{urltools}
+-->
+
+<p>##Elegant URL handling with urltools</p>
+
+<p>URLs are treated, by base R, as nothing more than components of a data retrieval process: they exist
+to create connections to retrieve datasets. This is an essential feature for the language to have,
+but it also means that URL handlers are designed for situations where URLs <em>get</em> you to the data - 
+not situations where URLs <em>are</em> the data.</p>
+
+<p>There is no support for encoding or decoding URLs en-masse, and no support for parsing and
+interpreting them. <code>urltools</code> provides this support!</p>
+
+<h3>URL encoding and decoding</h3>
+
+<p>Base R provides two functions - <code>URLdecode</code> and <code>URLencode</code> - for taking percentage-encoded
+URLs and turning them into regular strings, or vice versa. As discussed, these are primarily designed to
+enable connections, and so they have several inherent limitations, including a lack of vectorisation, that
+make them unsuitable for large datasets.</p>
+
+<p>Not only are they not vectorised, they also have several particularly idiosyncratic bugs and limitations:
+<code>URLdecode</code>, for example, breaks if the decoded value is out of range:</p>
+
+<pre><code class="r">URLdecode("test%gIL")
+Error in rawToChar(out) : embedded nul in string: '\0L'
+In addition: Warning message:
+In URLdecode("%gIL") : out-of-range values treated as 0 in coercion to raw
+</code></pre>
+
+<p>URLencode, on the other hand, encodes slashes on its most strict setting - without
+paying attention to where those slashes <em>are</em>: if we attempt to URLencode an entire URL, we get:</p>
+
+<pre><code class="r">URLencode("https://en.wikipedia.org/wiki/Article", reserved = TRUE)
+[1] "https%3a%2f%2fen.wikipedia.org%2fwiki%2fArticle"
+</code></pre>
+
+<p>That's a completely unusable URL (or ewRL, if you will).</p>
+
+<p>urltools replaces both functions with <code>url_decode</code> and <code>url_encode</code> respectively:</p>
+
+<pre><code class="r">library(urltools)
+url_decode("test%gIL")
+[1] "test"
+url_encode("https://en.wikipedia.org/wiki/Article")
+[1] "https://en.wikipedia.org%2fwiki%2fArticle"
+</code></pre>
+
+<p>As you can see, <code>url_decode</code> simply excludes out-of-range characters from consideration, while <code>url_encode</code> detects characters that make up part of the URLs scheme, and leaves them unencoded. Both are extremely fast; with <code>urltools</code>, you can
+decode a vector of 1,000,000 URLs in 0.9 seconds.</p>
+
+<p>Alongside these, we have functions for encoding and decoding the 'punycode' format of URLs - ones that are designed to be internationalised and have unicode characters in them. These also take one argument, a vector of URLs, and can be found at <code>puny_encode</code> and <code>puny_decode</code> respectively.</p>
+
+<h3>URL parsing</h3>
+
+<p>Once you've got your nicely decoded (or encoded) URLs, it's time to do something with them - and, most of the time,
+you won't actually care about most of the URL. You'll want to look at the scheme, or the domain, or the path,
+but not the entire thing as one string.</p>
+
+<p>The solution is <code>url_parse</code>, which takes a URL and breaks it out into its <a href="http://www.ietf.org/rfc/rfc3986.txt">RfC 3986</a> components: scheme, domain, port, path, query string and fragment identifier. This is,
+again, fully vectorised, and can happily be run over hundreds of thousands of URLs, rapidly processing them. The
+results are provided as a data.frame, since most people use data.frames to store data.</p>
+
+<pre><code class="r">> parsed_address <- url_parse("https://en.wikipedia.org/wiki/Article")
+> str(parsed_address)
+'data.frame':   1 obs. of  6 variables:
+ $ scheme   : chr "https"
+ $ domain   : chr "en.wikipedia.org"
+ $ port     : chr NA
+ $ path     : chr "wiki/Article"
+ $ parameter: chr NA
+ $ fragment : chr NA                         
+</code></pre>
+
+<p>We can also perform the opposite of this operation with <code>url_compose</code>:</p>
+
+<pre><code class="r">> url_compose(parsed_address)
+[1] "https://en.wikipedia.org/wiki/article"
+</code></pre>
+
+<h3>Getting/setting URL components</h3>
+
+<p>With the inclusion of a URL parser, we suddenly have the opportunity for lubridate-style component getting
+and setting. Syntax is identical to that of <code>lubridate</code>, but uses URL components as function names.</p>
+
+<pre><code class="r">url <- "https://en.wikipedia.org/wiki/Article"
+scheme(url)
+"https"
+scheme(url) <- "ftp"
+url
+"ftp://en.wikipedia.org/wiki/Article"
+</code></pre>
+
+<p>Fields that can be extracted or set are <code>scheme</code>, <code>domain</code>, <code>port</code>, <code>path</code>,
+<code>parameters</code> and <code>fragment</code>.</p>
+
+<h3>Suffix and TLD extraction</h3>
+
+<p>Once we've extracted a domain from a URL with <code>domain</code> or <code>url_parse</code>, we can identify which bit is the domain name, and which
+bit is the suffix:</p>
+
+<pre><code class="r">> url <- "https://en.wikipedia.org/wiki/Article"
+> domain_name <- domain(url)
+> domain_name
+[1] "en.wikipedia.org"
+> str(suffix_extract(domain_name))
+'data.frame':   1 obs. of  4 variables:
+ $ host     : chr "en.wikipedia.org"
+ $ subdomain: chr "en"
+ $ domain   : chr "wikipedia"
+ $ suffix      : chr "org"
+</code></pre>
+
+<p>This relies on an internal database of public suffixes, accessible at <code>suffix_dataset</code> - we recognise, though,
+that this dataset may get a bit out of date, so you can also pass the results of the <code>suffix_refresh</code> function,
+which retrieves an updated dataset, to <code>suffix_extract</code>:</p>
+
+<pre><code class="r">domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+updated_suffixes <- suffix_refresh()
+suffix_extract(domain_name, updated_suffixes)
+</code></pre>
+
+<p>We can do the same thing with top-level domains, with precisely the same setup, except the functions and datasets are <code>tld_refresh</code>, <code>tld_extract</code> and <code>tld_dataset</code>.</p>
+
+<p>In the other direction we have <code>host_extract</code>, which retrieves, well, the host! If the URL has subdomains, it'll be the
+lowest-level subdomain. If it doesn't, it'll be the actual domain name, without the suffixes:</p>
+
+<pre><code class="r">domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+host_extract(domain_name)
+</code></pre>
+
+<h3>Query manipulation</h3>
+
+<p>Once a URL is parsed, it's sometimes useful to get the value associated with a particular query parameter. As
+an example, take the URL <code>http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json</code>. What
+pageID is being used? What is the export format? We can find out with <code>param_get</code>.</p>
+
+<pre><code class="r">> str(param_get(urls = "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json",
+                     parameter_names = c("pageid","export")))
+'data.frame':   1 obs. of  2 variables:
+ $ pageid: chr "1023"
+ $ export: chr "json"
+</code></pre>
+
+<p>This isn't the only function for query manipulation; we can also dynamically modify the values a particular parameter
+might have, or strip them out entirely.</p>
+
+<p>To modify the values, we use <code>param_set</code>:</p>
+
+<pre><code class="r">url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+url <- param_set(url, key = "pageid", value = "12")
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=12&export=json"
+</code></pre>
+
+<p>As you can see this works pretty well; it even works in situations where the URL doesn't <em>have</em> a query yet:</p>
+
+<pre><code class="r">url <- "http://en.wikipedia.org/wiki/api.php"
+url <- param_set(url, key = "pageid", value = "12")
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?pageid=12"
+</code></pre>
+
+<p>On the other hand we might have a parameter we just don't want any more - that can be handled with <code>param_remove</code>, which can
+take multiple parameters as well as multiple URLs:</p>
+
+<pre><code class="r">url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+url <- param_remove(url, keys = c("action","export"))
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?pageid=1023"
+</code></pre>
+
+<h3>Other URL handlers</h3>
+
+<p>If you have ideas for other URL handlers that would make your data processing easier, the best approach
+is to either <a href="https://github.com/Ironholds/urltools/issues">request it</a> or <a href="https://github.com/Ironholds/urltools/pulls">add it</a>!</p>
+
+</body>
+
+</html>
diff --git a/man/domain.Rd b/man/domain.Rd
new file mode 100644
index 0000000..304fc8b
--- /dev/null
+++ b/man/domain.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/accessors.R
+\name{domain}
+\alias{domain}
+\alias{domain<-}
+\title{Get or set a URL's domain}
+\usage{
+domain(x)
+
+domain(x) <- value
+}
+\arguments{
+\item{x}{a URL, or vector of URLs}
+
+\item{value}{a replacement value for x's scheme.}
+}
+\description{
+as in the lubridate package, individual components of a URL
+can be both extracted or set using the relevant function call - see the
+examples.
+}
+\examples{
+#Get a component
+example_url <- "http://cran.r-project.org/submit.html"
+domain(example_url)
+
+#Set a component
+domain(example_url) <- "en.wikipedia.org"
+}
+\seealso{
+\code{\link{scheme}}, \code{\link{port}}, \code{\link{path}},
+\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+}
+
diff --git a/man/encoder.Rd b/man/encoder.Rd
new file mode 100644
index 0000000..8c7c418
--- /dev/null
+++ b/man/encoder.Rd
@@ -0,0 +1,66 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/RcppExports.R
+\name{url_decode}
+\alias{url_decode}
+\alias{url_encode}
+\title{Encode or decode a URI}
+\usage{
+url_decode(urls)
+
+url_encode(urls)
+}
+\arguments{
+\item{urls}{a vector of URLs to decode or encode.}
+}
+\value{
+a character vector containing the encoded (or decoded) versions of "urls".
+}
+\description{
+encodes or decodes a URI/URL
+}
+\details{
+URL encoding and decoding is an essential prerequisite to proper web interaction
+and data analysis around things like server-side logs. The
+\href{http://tools.ietf.org/html/rfc3986}{relevant IETF RfC} mandates the percentage-encoding
+of non-Latin characters, including things like slashes, unless those are reserved.
+
+Base R provides \code{\link{URLdecode}} and \code{\link{URLencode}}, which handle
+URL encoding - in theory. In practise, they have a set of substantial problems
+that the urltools implementation solves::
+
+\itemize{
+\item{No vectorisation: }{Both base R functions operate on single URLs, not vectors of URLs.
+      This means that, when confronted with a vector of URLs that need encoding or
+      decoding, your only option is to loop from within R. This can be incredibly
+      computationally costly with large datasets. url_encode and url_decode are
+      implemented in C++ and entirely vectorised, allowing for a substantial
+      performance improvement.}
+\item{No scheme recognition: }{encoding the slashes in, say, http://, is a good way
+      of making sure your URL no longer works. Because of this, the only thing
+      you can encode in URLencode (unless you refuse to encode reserved characters)
+      is a partial URL, lacking the initial scheme, which requires additional operations
+      to set up and increases the complexity of encoding or decoding. url_encode
+      detects the protocol and silently splits it off, leaving it unencoded to ensure
+      that the resulting URL is valid.}
+\item{ASCII NULs: }{Server side data can get very messy and sometimes include out-of-range
+      characters. Unfortunately, URLdecode's response to these characters is to convert
+      them to NULs, which R can't handle, at which point your URLdecode call breaks.
+      \code{url_decode} simply ignores them.}
+}
+}
+\examples{
+
+url_decode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_\%28logo\%29.jpg")
+url_encode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_(logo).jpg")
+
+\dontrun{
+#A demonstrator of the contrasting behaviours around out-of-range characters
+URLdecode("\%gIL")
+url_decode("\%gIL")
+}
+}
+\seealso{
+\code{\link{puny_decode}} and \code{\link{puny_encode}}, for punycode decoding
+and encoding.
+}
+
diff --git a/man/fragment.Rd b/man/fragment.Rd
new file mode 100644
index 0000000..af3ec99
--- /dev/null
+++ b/man/fragment.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/accessors.R
+\name{fragment}
+\alias{fragment}
+\alias{fragment<-}
+\title{Get or set a URL's fragment}
+\usage{
+fragment(x)
+
+fragment(x) <- value
+}
+\arguments{
+\item{x}{a URL, or vector of URLs}
+
+\item{value}{a replacement value for x's fragment.}
+}
+\description{
+as in the lubridate package, individual components of a URL
+can be both extracted or set using the relevant function call - see the
+examples.
+}
+\examples{
+#Get a component
+example_url <- "http://en.wikipedia.org/wiki/Aaron_Halfaker?debug=true#test"
+fragment(example_url)
+
+#Set a component
+fragment(example_url) <- "production"
+}
+\seealso{
+\code{\link{scheme}}, \code{\link{domain}}, \code{\link{port}},
+\code{\link{path}} and \code{\link{parameters}} for other accessors.
+}
+
diff --git a/man/host_extract.Rd b/man/host_extract.Rd
new file mode 100644
index 0000000..16ac819
--- /dev/null
+++ b/man/host_extract.Rd
@@ -0,0 +1,32 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/suffix.R
+\name{host_extract}
+\alias{host_extract}
+\title{Extract hosts}
+\usage{
+host_extract(domains)
+}
+\arguments{
+\item{domains}{a vector of domains, retrieved through \code{\link{url_parse}} or
+\code{\link{domain}}.}
+}
+\value{
+a data.frame of two columns: \code{domain}, with the original domain names,
+and \code{host}, the identified host from the domain.
+}
+\description{
+\code{host_extract} extracts the host from
+a vector of domain names. A host isn't the same as a domain - it could be
+the subdomain, if there are one or more subdomains. The host of \code{en.wikipedia.org}
+is \code{en}, while the host of \code{wikipedia.org} is \code{wikipedia}.
+}
+\examples{
+# With subdomains
+has_subdomain <- domain("https://en.wikipedia.org/wiki/Main_Page")
+host_extract(has_subdomain)
+
+# Without
+no_subdomain <- domain("https://ironholds.org/projects/r_shiny/")
+host_extract(no_subdomain)
+}
+
diff --git a/man/param_get.Rd b/man/param_get.Rd
new file mode 100644
index 0000000..8e4bd6e
--- /dev/null
+++ b/man/param_get.Rd
@@ -0,0 +1,38 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/RcppExports.R, R/urltools.R
+\name{param_get}
+\alias{param_get}
+\alias{url_parameter}
+\alias{url_parameters}
+\title{get the values of a URL's parameters}
+\usage{
+param_get(urls, parameter_names)
+
+url_parameters(urls, parameter_names)
+}
+\arguments{
+\item{urls}{a vector of URLs}
+
+\item{parameter_names}{a vector of parameter names}
+}
+\value{
+a data.frame containing one column for each provided parameter name. Values that
+cannot be found within a particular URL are represented by an NA.
+}
+\description{
+URLs can have parameters, taking the form of \code{name=value}, chained together
+with \code{&} symbols. \code{param_get}, when provided with a vector of URLs and a vector
+of parameter names, will generate a data.frame consisting of the values of each parameter
+for each URL.
+}
+\examples{
+#A very simple example
+url <- "https://google.com:80/foo.php?this_parameter=selfreferencing&hiphop=awesome"
+parameter_values <- param_get(url, c("this_parameter","hiphop"))
+
+}
+\seealso{
+\code{\link{url_parse}} for decomposing URLs into their constituent parts and
+\code{\link{param_set}} for inserting or modifying key/value pairs within a query string.
+}
+
diff --git a/man/param_remove.Rd b/man/param_remove.Rd
new file mode 100644
index 0000000..0168d6f
--- /dev/null
+++ b/man/param_remove.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/RcppExports.R
+\name{param_remove}
+\alias{param_remove}
+\title{Remove key-value pairs from query strings}
+\usage{
+param_remove(urls, keys)
+}
+\arguments{
+\item{urls}{a vector of URLs. These should be decoded with \code{url_decode} but don't
+have to have been otherwise processed.}
+
+\item{keys}{a vector of parameter keys to remove.}
+}
+\value{
+the original URLs but with the key/value pairs specified by \code{keys} removed.
+If the original URL is \code{NA}, \code{NA} will be returned; if a specified key is \code{NA},
+nothing will be done with it.
+}
+\description{
+URLs often have queries associated with them, particularly URLs for
+APIs, that look like \code{?key=value&key=value&key=value}. \code{param_remove}
+allows you to remove key/value pairs while leaving the rest of the URL intact.
+}
+\examples{
+# Remove multiple parameters from a URL
+param_remove(urls = "https://en.wikipedia.org/wiki/api.php?action=list&type=query&format=json",
+            keys = c("action","format"))
+}
+\seealso{
+\code{\link{param_set}} to modify values associated with keys, or \code{\link{param_get}}
+to retrieve those values.
+}
+
diff --git a/man/param_set.Rd b/man/param_set.Rd
new file mode 100644
index 0000000..6959bba
--- /dev/null
+++ b/man/param_set.Rd
@@ -0,0 +1,42 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/RcppExports.R
+\name{param_set}
+\alias{param_set}
+\title{Set the value associated with a parameter in a URL's query.}
+\usage{
+param_set(urls, key, value)
+}
+\arguments{
+\item{urls}{a vector of URLs. These should be decoded (with \code{url_decode})
+but do not have to have been otherwise manipulated.}
+
+\item{key}{a string representing the key to modify the value of (or insert wholesale
+if it doesn't exist within the URL).}
+
+\item{value}{a value to associate with the key. This can be a single string,
+or a vector the same length as \code{urls}}
+}
+\value{
+the original vector of URLs, but with modified/inserted key-value pairs. If the
+URL is \code{NA}, the returned value will be - if the key or value are, no insertion
+will be made.
+}
+\description{
+URLs often have queries associated with them, particularly URLs for
+APIs, that look like \code{?key=value&key=value&key=value}. \code{param_set}
+allows you to modify key/value pairs within query strings, or even add new ones
+if they don't exist within the URL.
+}
+\examples{
+# Set a URL parameter where there's already a key for that
+param_set("https://en.wikipedia.org/api.php?action=query", "action", "pageinfo")
+
+# Set a URL parameter where there isn't.
+param_set("https://en.wikipedia.org/api.php?list=props", "action", "pageinfo")
+
+}
+\seealso{
+\code{\link{param_get}} to retrieve the values associated with multiple keys in
+a vector of URLs, and \code{\link{param_remove}} to strip key/value pairs from a URL entirely.
+}
+
diff --git a/man/parameters.Rd b/man/parameters.Rd
new file mode 100644
index 0000000..df3d677
--- /dev/null
+++ b/man/parameters.Rd
@@ -0,0 +1,35 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/accessors.R
+\name{parameters}
+\alias{parameters}
+\alias{parameters<-}
+\title{Get or set a URL's parameters}
+\usage{
+parameters(x)
+
+parameters(x) <- value
+}
+\arguments{
+\item{x}{a URL, or vector of URLs}
+
+\item{value}{a replacement value for x's parameters.}
+}
+\description{
+as in the lubridate package, individual components of a URL
+can be both extracted or set using the relevant function call - see the
+examples.
+}
+\examples{
+#Get a component
+example_url <- "http://en.wikipedia.org/wiki/Aaron_Halfaker?debug=true"
+parameters(example_url)
+#[1] "debug=true"
+
+#Set a component
+parameters(example_url) <- "debug=false"
+}
+\seealso{
+\code{\link{scheme}}, \code{\link{domain}}, \code{\link{port}},
+\code{\link{path}} and \code{\link{fragment}} for other accessors.
+}
+
diff --git a/man/path.Rd b/man/path.Rd
new file mode 100644
index 0000000..d5d870f
--- /dev/null
+++ b/man/path.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/accessors.R
+\name{path}
+\alias{path}
+\alias{path<-}
+\title{Get or set a URL's path}
+\usage{
+path(x)
+
+path(x) <- value
+}
+\arguments{
+\item{x}{a URL, or vector of URLs}
+
+\item{value}{a replacement value for x's path}
+}
+\description{
+as in the lubridate package, individual components of a URL
+can be both extracted or set using the relevant function call - see the
+examples.
+}
+\examples{
+#Get a component
+example_url <- "http://cran.r-project.org:80/submit.html"
+path(example_url)
+
+#Set a component
+path(example_url) <- "bin/windows/"
+}
+\seealso{
+\code{\link{scheme}}, \code{\link{domain}}, \code{\link{port}},
+\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+}
+
diff --git a/man/port.Rd b/man/port.Rd
new file mode 100644
index 0000000..20901ce
--- /dev/null
+++ b/man/port.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/accessors.R
+\name{port}
+\alias{port}
+\alias{port<-}
+\title{Get or set a URL's port}
+\usage{
+port(x)
+
+port(x) <- value
+}
+\arguments{
+\item{x}{a URL, or vector of URLs}
+
+\item{value}{a replacement value for x's port.}
+}
+\description{
+as in the lubridate package, individual components of a URL
+can be both extracted or set using the relevant function call - see the
+examples.
+}
+\examples{
+#Get a component
+example_url <- "http://cran.r-project.org:80/submit.html"
+port(example_url)
+
+#Set a component
+port(example_url) <- "12"
+}
+\seealso{
+\code{\link{scheme}}, \code{\link{domain}}, \code{\link{path}},
+\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+}
+
diff --git a/man/puny.Rd b/man/puny.Rd
new file mode 100644
index 0000000..2ce6e14
--- /dev/null
+++ b/man/puny.Rd
@@ -0,0 +1,37 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/RcppExports.R
+\name{puny_encode}
+\alias{puny_decode}
+\alias{puny_encode}
+\title{Encode or Decode Internationalised Domains}
+\usage{
+puny_encode(x)
+
+puny_decode(x)
+}
+\arguments{
+\item{x}{a vector of URLs. These should be URL decoded using \code{\link{url_decode}}.}
+}
+\value{
+a CharacterVector containing encoded or decoded versions of the entries in \code{x}.
+Invalid URLs (ones that are \code{NA}, or ones that do not successfully map to an actual
+decoded or encoded version) will be returned as \code{NA}.
+}
+\description{
+\code{puny_encode} and \code{puny_decode} implement
+the encoding standard for internationalised (non-ASCII) domains and
+subdomains. You can use them to encode UTF-8 domain names, or decode
+encoded names (which start "xn--"), or both.
+}
+\examples{
+# Encode a URL
+puny_encode("https://www.bücher.com/foo")
+
+# Decode the result, back to the original
+puny_decode("https://www.xn--bcher-kva.com/foo")
+
+}
+\seealso{
+\code{\link{url_decode}} and \code{\link{url_encode}} for percent-encoding.
+}
+
diff --git a/man/scheme.Rd b/man/scheme.Rd
new file mode 100644
index 0000000..4bd9858
--- /dev/null
+++ b/man/scheme.Rd
@@ -0,0 +1,37 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/accessors.R
+\name{scheme}
+\alias{scheme}
+\alias{scheme<-}
+\title{Get or set a URL's scheme}
+\usage{
+scheme(x)
+
+scheme(x) <- value
+}
+\arguments{
+\item{x}{a URL, or vector of URLs}
+
+\item{value}{a replacement value for x's scheme.}
+}
+\description{
+as in the lubridate package, individual components of a URL
+can be both extracted or set using the relevant function call - see the
+examples.
+}
+\examples{
+#Get a component
+example_url <- "http://cran.r-project.org/submit.html"
+scheme(example_url)
+
+#Set a component
+scheme(example_url) <- "https"
+
+# NA out the URL
+scheme(example_url) <- NA_character_
+}
+\seealso{
+\code{\link{domain}}, \code{\link{port}}, \code{\link{path}},
+\code{\link{parameters}} and \code{\link{fragment}} for other accessors.
+}
+
diff --git a/man/suffix_dataset.Rd b/man/suffix_dataset.Rd
new file mode 100644
index 0000000..61d36c2
--- /dev/null
+++ b/man/suffix_dataset.Rd
@@ -0,0 +1,28 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/suffix.R
+\docType{data}
+\name{suffix_dataset}
+\alias{suffix_dataset}
+\title{Dataset of public suffixes}
+\format{A data.frame of 8030 rows and 2 columns}
+\usage{
+data(suffix_dataset)
+}
+\description{
+This dataset contains a registry of public suffixes, as retrieved from
+and defined by the \href{https://publicsuffix.org/}{public suffix list}. It is
+sorted by how many periods(".") appear in the suffix, to optimise it for
+\code{\link{suffix_extract}}.  It is a data.frame with two columns, the first is
+the list of suffixes and the second is our best guess at the comment or owner 
+associated with the particular suffix.
+}
+\note{
+Last updated 2016-07-31.
+}
+\seealso{
+\code{\link{suffix_extract}} for extracting suffixes from domain names,
+and \code{\link{suffix_refresh}} for getting a new, totally-up-to-date dataset
+version.
+}
+\keyword{datasets}
+
diff --git a/man/suffix_extract.Rd b/man/suffix_extract.Rd
new file mode 100644
index 0000000..95c9a10
--- /dev/null
+++ b/man/suffix_extract.Rd
@@ -0,0 +1,53 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/suffix.R
+\name{suffix_extract}
+\alias{suffix_extract}
+\title{extract the suffix from domain names}
+\usage{
+suffix_extract(domains, suffixes = NULL)
+}
+\arguments{
+\item{domains}{a vector of damains, from \code{\link{domain}}
+or \code{\link{url_parse}}. Alternately, full URLs can be provided
+and will then be run through \code{\link{domain}} internally.}
+
+\item{suffixes}{a dataset of suffixes. By default, this is NULL and the function
+relies on \code{\link{suffix_dataset}}. Optionally, if you want more updated
+suffix data, you can provide the result of \code{\link{suffix_refresh}} for
+this parameter.}
+}
+\value{
+a data.frame of four columns, "host" "subdomain", "domain" & "suffix".
+"host" is what was passed in. "subdomain" is the subdomain of the suffix.
+"domain" contains the part of the domain name that came before the matched suffix.
+"suffix" is, well, the suffix.
+}
+\description{
+domain names have suffixes - common endings that people
+can or could register domains under. This includes things like ".org", but
+also things like ".edu.co". A simple Top Level Domain list, as a
+result, probably won't cut it.
+
+\code{\link{suffix_extract}} takes the list of public suffixes,
+as maintained by Mozilla (see \code{\link{suffix_dataset}}) and
+a vector of domain names, and produces a data.frame containing the
+suffix that each domain uses, and the remaining fragment.
+}
+\examples{
+
+# Using url_parse
+domain_name <- url_parse("http://en.wikipedia.org")$domain
+suffix_extract(domain_name)
+
+# Using domain()
+domain_name <- domain("http://en.wikipedia.org")
+suffix_extract(domain_name)
+
+#Relying on a fresh version of the suffix dataset
+suffix_extract(domain("http://en.wikipedia.org"), suffix_refresh())
+
+}
+\seealso{
+\code{\link{suffix_dataset}} for the dataset of suffixes.
+}
+
diff --git a/man/suffix_refresh.Rd b/man/suffix_refresh.Rd
new file mode 100644
index 0000000..3c6d4d9
--- /dev/null
+++ b/man/suffix_refresh.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/suffix.R
+\name{suffix_refresh}
+\alias{suffix_refresh}
+\title{Retrieve a public suffix dataset}
+\usage{
+suffix_refresh()
+}
+\value{
+a dataset equivalent in format to \code{\link{suffix_dataset}}.
+}
+\description{
+\code{urltools} comes with an inbuilt
+dataset of public suffixes, \code{\link{suffix_dataset}}.
+This is used in \code{\link{suffix_extract}} to identify the top-level domain
+within a particular domain name.
+
+While updates to the dataset will be included in each new package release,
+there's going to be a gap between changes to the suffixes list and changes to the package.
+Accordingly, the package also includes \code{suffix_refresh}, which generates
+and returns a \emph{fresh} version of the dataset. This can then be passed through
+to \code{\link{suffix_extract}}.
+}
+\examples{
+\dontrun{
+new_suffixes <- suffix_refresh()
+}
+
+}
+\seealso{
+\code{\link{suffix_extract}} to extract suffixes from domain names,
+or \code{\link{suffix_dataset}} for the inbuilt, default version of the data.
+}
+
diff --git a/man/tld_dataset.Rd b/man/tld_dataset.Rd
new file mode 100644
index 0000000..20d0409
--- /dev/null
+++ b/man/tld_dataset.Rd
@@ -0,0 +1,23 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/suffix.R
+\docType{data}
+\name{tld_dataset}
+\alias{tld_dataset}
+\title{Dataset of top-level domains (TLDs)}
+\format{A vector of 1275 elements.}
+\usage{
+data(tld_dataset)
+}
+\description{
+This dataset contains a registry of top-level domains, as retrieved from
+and defined by the \href{http://data.iana.org/TLD/tlds-alpha-by-domain.txt}{IANA}.
+}
+\note{
+Last updated 2016-07-20.
+}
+\seealso{
+\code{\link{tld_extract}} for extracting TLDs from domain names,
+and \code{\link{tld_refresh}} to get an updated version of this dataset.
+}
+\keyword{datasets}
+
diff --git a/man/tld_extract.Rd b/man/tld_extract.Rd
new file mode 100644
index 0000000..593f659
--- /dev/null
+++ b/man/tld_extract.Rd
@@ -0,0 +1,40 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/suffix.R
+\name{tld_extract}
+\alias{tld_extract}
+\title{Extract TLDs}
+\usage{
+tld_extract(domains, tlds = NULL)
+}
+\arguments{
+\item{domains}{a vector of domains, retrieved through \code{\link{url_parse}} or
+\code{\link{domain}}.}
+
+\item{tlds}{a dataset of TLDs. If NULL (the default), \code{tld_extract} relies
+on urltools' \code{\link{tld_dataset}}; otherwise, you can pass in the result of
+\code{\link{tld_refresh}}.}
+}
+\value{
+a data.frame of two columns: \code{domain}, with the original domain names,
+and \code{tld}, the identified TLD from the domain.
+}
+\description{
+\code{tld_extract} extracts the top-level domain (TLD) from
+a vector of domain names. This is distinct from the suffixes, extracted with
+\code{\link{suffix_extract}}; TLDs are \emph{top} level, while suffixes are just
+domains through which internet users can publicly register domains (the difference
+between \code{.org.uk} and \code{.uk}).
+}
+\examples{
+# Using the inbuilt dataset
+domains <- domain("https://en.wikipedia.org/wiki/Main_Page")
+tld_extract(domains)
+
+# Using a refreshed one
+tld_extract(domains, tld_refresh())
+
+}
+\seealso{
+\code{\link{suffix_extract}} for retrieving suffixes (distinct from TLDs).
+}
+
diff --git a/man/tld_refresh.Rd b/man/tld_refresh.Rd
new file mode 100644
index 0000000..40e3fcd
--- /dev/null
+++ b/man/tld_refresh.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/suffix.R
+\name{tld_refresh}
+\alias{tld_refresh}
+\title{Retrieve a TLD dataset}
+\usage{
+tld_refresh()
+}
+\value{
+a dataset equivalent in format to \code{\link{tld_dataset}}.
+}
+\description{
+\code{urltools} comes with an inbuilt
+dataset of top level domains (TLDs), \code{\link{tld_dataset}}.
+This is used in \code{\link{tld_extract}} to identify the top-level domain
+within a particular domain name.
+
+While updates to the dataset will be included in each new package release,
+there's going to be a gap between changes to TLDs and changes to the package.
+Accordingly, the package also includes \code{tld_refresh}, which generates
+and returns a \emph{fresh} version of the dataset. This can then be passed through
+to \code{\link{tld_extract}}.
+}
+\examples{
+\dontrun{
+new_tlds <- tld_refresh()
+}
+
+}
+\seealso{
+\code{\link{tld_extract}} to extract suffixes from domain names,
+or \code{\link{tld_dataset}} for the inbuilt, default version of the data.
+}
+
diff --git a/man/url_compose.Rd b/man/url_compose.Rd
new file mode 100644
index 0000000..99cb400
--- /dev/null
+++ b/man/url_compose.Rd
@@ -0,0 +1,30 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/RcppExports.R
+\name{url_compose}
+\alias{url_compose}
+\title{Recompose Parsed URLs}
+\usage{
+url_compose(parsed_urls)
+}
+\arguments{
+\item{parsed_urls}{a data.frame sourced from \code{\link{url_parse}}}
+}
+\description{
+Sometimes you want to take a vector of URLs, parse them, perform
+some operations and then rebuild them. \code{url_compose} takes a data.frame produced
+by \code{\link{url_parse}} and rebuilds it into a vector of full URLs (or: URLs as full
+as the vector initially thrown into url_parse).
+
+This is currently a `beta` feature; please do report bugs if you find them.
+}
+\examples{
+#Parse a URL and compose it
+url <- "http://en.wikipedia.org"
+url_compose(url_parse(url))
+
+}
+\seealso{
+\code{\link{scheme}} and other accessors, which you may want to
+run URLs through before composing them to modify individual values.
+}
+
diff --git a/man/url_parse.Rd b/man/url_parse.Rd
new file mode 100644
index 0000000..9217c8e
--- /dev/null
+++ b/man/url_parse.Rd
@@ -0,0 +1,37 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/RcppExports.R
+\name{url_parse}
+\alias{url_parse}
+\title{split URLs into their component parts}
+\usage{
+url_parse(urls)
+}
+\arguments{
+\item{urls}{a vector of URLs}
+}
+\value{
+a data.frame consisting of the columns scheme, domain, port, path, query
+and fragment. See the '\href{http://tools.ietf.org/html/rfc3986}{relevant IETF RfC} for
+definitions. If an element cannot be identified, it is represented by an empty string.
+}
+\description{
+\code{url_parse} takes a vector of URLs and splits each one into its component
+parts, as recognised by RfC 3986.
+}
+\details{
+It's useful to be able to take a URL and split it out into its component parts - 
+for the purpose of hostname extraction, for example, or analysing API calls. This functionality
+is not provided in base R, although it is provided in \code{\link[httr]{parse_url}}; that
+implementation is entirely in R, uses regular expressions, and is not vectorised. It's
+perfectly suitable for the intended purpose (decomposition in the context of automated
+HTTP requests from R), but not for large-scale analysis.
+}
+\examples{
+url_parse("https://en.wikipedia.org/wiki/Article")
+
+}
+\seealso{
+\code{\link{url_parameters}} for extracting values associated with particular keys in a URL's
+query string, and \code{\link{url_compose}}, which is \code{url_parse} in reverse.
+}
+
diff --git a/man/urltools.Rd b/man/urltools.Rd
new file mode 100644
index 0000000..64af9d6
--- /dev/null
+++ b/man/urltools.Rd
@@ -0,0 +1,17 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/urltools.R
+\docType{package}
+\name{urltools}
+\alias{urltools}
+\alias{urltools-package}
+\title{Tools for handling URLs}
+\description{
+This package provides functions for URL encoding and decoding,
+parsing, and parameter extraction, designed to be both fast and
+entirely vectorised. It is intended to be useful for people dealing with
+web-related datasets, such as server-side logs.
+}
+\seealso{
+the \href{https://CRAN.R-project.org/package=urltools/vignettes/urltools.html}{package vignette}.
+}
+
diff --git a/src/Makevars b/src/Makevars
new file mode 100644
index 0000000..5de939a
--- /dev/null
+++ b/src/Makevars
@@ -0,0 +1 @@
+PKG_CPPFLAGS = -UNDEBUG
diff --git a/src/RcppExports.cpp b/src/RcppExports.cpp
new file mode 100644
index 0000000..4e28c55
--- /dev/null
+++ b/src/RcppExports.cpp
@@ -0,0 +1,182 @@
+// Generated by using Rcpp::compileAttributes() -> do not edit by hand
+// Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
+
+#include <Rcpp.h>
+
+using namespace Rcpp;
+
+// get_component_
+CharacterVector get_component_(CharacterVector urls, int component);
+RcppExport SEXP urltools_get_component_(SEXP urlsSEXP, SEXP componentSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+    Rcpp::traits::input_parameter< int >::type component(componentSEXP);
+    rcpp_result_gen = Rcpp::wrap(get_component_(urls, component));
+    return rcpp_result_gen;
+END_RCPP
+}
+// set_component_
+CharacterVector set_component_(CharacterVector urls, int component, String new_value);
+RcppExport SEXP urltools_set_component_(SEXP urlsSEXP, SEXP componentSEXP, SEXP new_valueSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+    Rcpp::traits::input_parameter< int >::type component(componentSEXP);
+    Rcpp::traits::input_parameter< String >::type new_value(new_valueSEXP);
+    rcpp_result_gen = Rcpp::wrap(set_component_(urls, component, new_value));
+    return rcpp_result_gen;
+END_RCPP
+}
+// param_get
+List param_get(CharacterVector urls, CharacterVector parameter_names);
+RcppExport SEXP urltools_param_get(SEXP urlsSEXP, SEXP parameter_namesSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+    Rcpp::traits::input_parameter< CharacterVector >::type parameter_names(parameter_namesSEXP);
+    rcpp_result_gen = Rcpp::wrap(param_get(urls, parameter_names));
+    return rcpp_result_gen;
+END_RCPP
+}
+// param_set
+CharacterVector param_set(CharacterVector urls, String key, CharacterVector value);
+RcppExport SEXP urltools_param_set(SEXP urlsSEXP, SEXP keySEXP, SEXP valueSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+    Rcpp::traits::input_parameter< String >::type key(keySEXP);
+    Rcpp::traits::input_parameter< CharacterVector >::type value(valueSEXP);
+    rcpp_result_gen = Rcpp::wrap(param_set(urls, key, value));
+    return rcpp_result_gen;
+END_RCPP
+}
+// param_remove
+CharacterVector param_remove(CharacterVector urls, CharacterVector keys);
+RcppExport SEXP urltools_param_remove(SEXP urlsSEXP, SEXP keysSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+    Rcpp::traits::input_parameter< CharacterVector >::type keys(keysSEXP);
+    rcpp_result_gen = Rcpp::wrap(param_remove(urls, keys));
+    return rcpp_result_gen;
+END_RCPP
+}
+// puny_encode
+CharacterVector puny_encode(CharacterVector x);
+RcppExport SEXP urltools_puny_encode(SEXP xSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< CharacterVector >::type x(xSEXP);
+    rcpp_result_gen = Rcpp::wrap(puny_encode(x));
+    return rcpp_result_gen;
+END_RCPP
+}
+// puny_decode
+CharacterVector puny_decode(CharacterVector x);
+RcppExport SEXP urltools_puny_decode(SEXP xSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< CharacterVector >::type x(xSEXP);
+    rcpp_result_gen = Rcpp::wrap(puny_decode(x));
+    return rcpp_result_gen;
+END_RCPP
+}
+// reverse_strings
+CharacterVector reverse_strings(CharacterVector strings);
+RcppExport SEXP urltools_reverse_strings(SEXP stringsSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< CharacterVector >::type strings(stringsSEXP);
+    rcpp_result_gen = Rcpp::wrap(reverse_strings(strings));
+    return rcpp_result_gen;
+END_RCPP
+}
+// finalise_suffixes
+DataFrame finalise_suffixes(CharacterVector full_domains, CharacterVector suffixes, LogicalVector wildcard, LogicalVector is_suffix);
+RcppExport SEXP urltools_finalise_suffixes(SEXP full_domainsSEXP, SEXP suffixesSEXP, SEXP wildcardSEXP, SEXP is_suffixSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< CharacterVector >::type full_domains(full_domainsSEXP);
+    Rcpp::traits::input_parameter< CharacterVector >::type suffixes(suffixesSEXP);
+    Rcpp::traits::input_parameter< LogicalVector >::type wildcard(wildcardSEXP);
+    Rcpp::traits::input_parameter< LogicalVector >::type is_suffix(is_suffixSEXP);
+    rcpp_result_gen = Rcpp::wrap(finalise_suffixes(full_domains, suffixes, wildcard, is_suffix));
+    return rcpp_result_gen;
+END_RCPP
+}
+// tld_extract_
+CharacterVector tld_extract_(CharacterVector domains);
+RcppExport SEXP urltools_tld_extract_(SEXP domainsSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< CharacterVector >::type domains(domainsSEXP);
+    rcpp_result_gen = Rcpp::wrap(tld_extract_(domains));
+    return rcpp_result_gen;
+END_RCPP
+}
+// host_extract_
+CharacterVector host_extract_(CharacterVector domains);
+RcppExport SEXP urltools_host_extract_(SEXP domainsSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< CharacterVector >::type domains(domainsSEXP);
+    rcpp_result_gen = Rcpp::wrap(host_extract_(domains));
+    return rcpp_result_gen;
+END_RCPP
+}
+// url_decode
+CharacterVector url_decode(CharacterVector urls);
+RcppExport SEXP urltools_url_decode(SEXP urlsSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+    rcpp_result_gen = Rcpp::wrap(url_decode(urls));
+    return rcpp_result_gen;
+END_RCPP
+}
+// url_encode
+CharacterVector url_encode(CharacterVector urls);
+RcppExport SEXP urltools_url_encode(SEXP urlsSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+    rcpp_result_gen = Rcpp::wrap(url_encode(urls));
+    return rcpp_result_gen;
+END_RCPP
+}
+// url_parse
+DataFrame url_parse(CharacterVector urls);
+RcppExport SEXP urltools_url_parse(SEXP urlsSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
+    rcpp_result_gen = Rcpp::wrap(url_parse(urls));
+    return rcpp_result_gen;
+END_RCPP
+}
+// url_compose
+CharacterVector url_compose(DataFrame parsed_urls);
+RcppExport SEXP urltools_url_compose(SEXP parsed_urlsSEXP) {
+BEGIN_RCPP
+    Rcpp::RObject rcpp_result_gen;
+    Rcpp::RNGScope rcpp_rngScope_gen;
+    Rcpp::traits::input_parameter< DataFrame >::type parsed_urls(parsed_urlsSEXP);
+    rcpp_result_gen = Rcpp::wrap(url_compose(parsed_urls));
+    return rcpp_result_gen;
+END_RCPP
+}
diff --git a/src/accessors.cpp b/src/accessors.cpp
new file mode 100644
index 0000000..f365d4c
--- /dev/null
+++ b/src/accessors.cpp
@@ -0,0 +1,37 @@
+#include <Rcpp.h>
+#include "parsing.h"
+using namespace Rcpp;
+
+//[[Rcpp::export]]
+CharacterVector get_component_(CharacterVector urls, int component){
+  parsing p_inst;
+  unsigned int input_size = urls.size();
+  CharacterVector output(input_size);
+  for (unsigned int i = 0; i < input_size; ++i){
+    if((i % 10000) == 0){
+      Rcpp::checkUserInterrupt();
+    }
+    if(urls[i] != NA_STRING){
+      output[i] = p_inst.get_component(Rcpp::as<std::string>(urls[i]), component);
+    } else {
+      output[i] = NA_STRING;
+    }
+  }
+  return output;
+}
+
+//[[Rcpp::export]]
+CharacterVector set_component_(CharacterVector urls, int component,
+                               String new_value){
+  parsing p_inst;
+  unsigned int input_size = urls.size();
+  CharacterVector output(input_size);
+  for (unsigned int i = 0; i < input_size; ++i){
+    if((i % 10000) == 0){
+      Rcpp::checkUserInterrupt();
+    }
+    
+    output[i] = p_inst.set_component(Rcpp::as<std::string>(urls[i]), component, new_value);
+  }
+  return output;
+}
diff --git a/src/compose.cpp b/src/compose.cpp
new file mode 100644
index 0000000..c8983be
--- /dev/null
+++ b/src/compose.cpp
@@ -0,0 +1,68 @@
+#include "compose.h"
+
+bool compose::emptycheck(String element){
+  if(element == NA_STRING){
+    return false;
+  }
+  return true;
+}
+
+std::string compose::compose_single(String scheme, String domain, String port, String path,
+                                    String parameter, String fragment){
+  
+  std::string output;
+  
+  if(emptycheck(scheme)){
+    output += scheme;
+    output += "://";
+  }
+  
+  if(emptycheck(domain)){
+    output += domain;
+  }
+  
+  if(emptycheck(port)){
+      output += ":";
+      output += port;
+    }
+  
+  if(emptycheck(path)){
+    output += "/";
+    output += path;
+  }
+  
+  if(emptycheck(parameter)){
+    output += "?";
+    output += parameter;
+  }
+  
+  if(emptycheck(fragment)){
+    output += "#";
+    output += fragment;
+  }
+  
+  return output;
+}
+
+CharacterVector compose::compose_multiple(DataFrame parsed_urls){
+  
+  CharacterVector schemes = parsed_urls["scheme"];
+  CharacterVector domains = parsed_urls["domain"];
+  CharacterVector ports = parsed_urls["port"];
+  CharacterVector paths = parsed_urls["path"];
+  CharacterVector parameters = parsed_urls["parameter"];
+  CharacterVector fragments = parsed_urls["fragment"];
+  
+  unsigned int input_size = schemes.size();
+  CharacterVector output(input_size);
+  
+  for(unsigned int i = 0; i < input_size; i++){
+    if((i % 10000) == 0){
+      Rcpp::checkUserInterrupt();
+    }
+    output[i] = compose_single(schemes[i], domains[i], ports[i], paths[i], parameters[i],
+                               fragments[i]);
+  }
+  
+  return output;
+}
diff --git a/src/compose.h b/src/compose.h
new file mode 100644
index 0000000..5fbd194
--- /dev/null
+++ b/src/compose.h
@@ -0,0 +1,58 @@
+#include <Rcpp.h>
+using namespace Rcpp;
+
+#ifndef __COMPOSE_INCLUDED__
+#define __COMPOSE_INCLUDED__
+
+/**
+ * A class for recomposing parsed URLs
+ */
+class compose {
+  
+private:
+  
+  /**
+   * A function for briefly checking if a component is empty before doing anything
+   * with it
+   * 
+   * @param str a Rcpp String to check
+   * 
+   * @return true if the string is not empty, false if it is.
+   */
+  bool emptycheck(String element);
+  
+  /**
+   * A function for recomposing a single URL
+   * 
+   * @param scheme the scheme of the URL
+   * 
+   * @param domain the domain of the URL
+   * 
+   * @param port the port of the URL
+   * 
+   * @param path the path of the URL
+   * 
+   * @param parameter the parameter of the URL
+   * 
+   * @param fragment the fragment of the URL
+   * 
+   * @return an Rcpp String containing the recomposed URL
+   * 
+   * @seealso compose_multiple for the vectorised version
+   */
+  std::string compose_single(String scheme, String domain, String port, String path,
+                             String parameter, String fragment);
+  
+public:
+  
+  /**
+   * A function for recomposing a vector of URLs
+   * 
+   * @param parsed_urls a DataFrame provided by url_parse
+   * 
+   * @return a CharacterVector containing the recomposed URLs
+   */
+  CharacterVector compose_multiple(DataFrame parsed_urls);
+};
+
+#endif
diff --git a/src/encoding.cpp b/src/encoding.cpp
new file mode 100644
index 0000000..d4f3540
--- /dev/null
+++ b/src/encoding.cpp
@@ -0,0 +1,92 @@
+#include <Rcpp.h>
+#include "encoding.h"
+using namespace Rcpp;
+
+char encoding::from_hex (char x){
+  if(x <= '9' && x >= '0'){
+    x -= '0';
+  } else if(x <= 'f' && x >= 'a'){
+    x -= ('a' - 10);
+  } else if(x <= 'F' && x >= 'A'){
+    x -= ('A' - 10);
+  } else {
+    x = 0;
+  }
+  return x;
+}
+
+std::string encoding::to_hex(char x){
+  
+  //Holding objects and output
+  char digit_1 = (x&0xF0)>>4;
+  char digit_2 = (x&0x0F);
+  std::string output;
+  
+  //Convert
+  if(0 <= digit_1 && digit_1 <= 9){
+    digit_1 += 48;
+  } else if(10 <= digit_1 && digit_1 <=15){
+    digit_1 += 97-10;
+  }
+  if(0 <= digit_2 && digit_2 <= 9){
+    digit_2 += 48;
+  } else if(10 <= digit_2 && digit_2 <= 15){
+    digit_2 += 97-10;
+  }
+  
+  output.append(&digit_1, 1);
+  output.append(&digit_2, 1);
+  return output;
+}
+
+std::string encoding::internal_url_decode(std::string url){
+  
+  //Create output object
+  std::string result;
+
+  //For each character...
+  for (std::string::size_type i = 0; i <  url.size(); ++i){
+    
+    //If it's a +, space
+    if (url[i] == '+'){
+      result += ' ';
+    } else if (url[i] == '%' && url.size() > i+2){
+      
+      //Escaped? Convert from hex and includes
+      char holding_1 = encoding::from_hex(url[i+1]);
+      char holding_2 = encoding::from_hex(url[i+2]);
+      char holding = (holding_1 << 4) | holding_2;
+      result += holding;
+      i += 2;
+      
+    } else { //Permitted? Include.
+      result += url[i];
+    }
+  }
+  
+  //Return
+  return result;
+}
+
+std::string encoding::internal_url_encode(std::string url){
+
+  //Note the unreserved characters, create an output string
+  std::string unreserved_characters = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ._~-";
+  std::string output = "";
+  
+  //For each character..
+  for(int i=0; i < (signed) url.length(); i++){
+    
+    //If it's in the list of reserved ones, just pass it through
+    if (unreserved_characters.find_first_of(url[i]) != std::string::npos){
+      output.append(&url[i], 1);
+    //Otherwise, append in an encoded form.
+    } else {
+      output.append("%");
+      output.append(to_hex(url[i]));
+    }
+  }
+  
+  //Return
+  return output;
+}
diff --git a/src/encoding.h b/src/encoding.h
new file mode 100644
index 0000000..88e2ac7
--- /dev/null
+++ b/src/encoding.h
@@ -0,0 +1,67 @@
+#include <Rcpp.h>
+using namespace Rcpp;
+
+#ifndef __ENCODING_INCLUDED__
+#define __ENCODING_INCLUDED__
+
+/**
+ * A class for applying percent-encoding to
+ * arbitrary strings - optimised for URLs, obviously.
+ */
+class encoding{
+  
+  private:
+  
+    /**
+     * A function for taking a hexadecimal element and converting
+     * it to the equivalent non-hex value. Used in internal_url_decode
+     * 
+     * @param x a character array representing the hexed value.
+     * 
+     * @see to_hex for the reverse operation.
+     * 
+     * @return a string containing the un-hexed value of x.
+     */
+    char from_hex (char x);
+    
+    /**
+     * A function for taking a character value and converting
+     * it to the equivalent hexadecimal value. Used in internal_url_encode.
+     * 
+     * @param x a character array representing the unhexed value.
+     * 
+     * @see from_hex for the reverse operation.
+     * 
+     * @return a string containing the now-hexed value of x.
+     */
+    std::string to_hex(char x);
+
+  public:
+  
+    /**
+     * A function for decoding URLs. calls from_hex, and is
+     * in turn called by url_decode in urltools.cpp.
+     * 
+     * @param url a string representing a percent-encoded URL.
+     * 
+     * @see internal_url_encode for the reverse operation.
+     * 
+     * @return a string containing the decoded URL.
+     */
+    std::string internal_url_decode(std::string url);
+    
+    /**
+     * A function for encoding URLs. calls to_hex, and is
+     * in turn called by url_encode in urltools.cpp.
+     * 
+     * @param url a string representing a URL.
+     * 
+     * @see internal_url_decode for the reverse operation.
+     * 
+     * @return a string containing the percent-encoded version of "url".
+     */
+    std::string internal_url_encode(std::string url);
+
+};
+
+#endif
diff --git a/src/param.cpp b/src/param.cpp
new file mode 100644
index 0000000..89a1ab1
--- /dev/null
+++ b/src/param.cpp
@@ -0,0 +1,112 @@
+#include <Rcpp.h>
+#include "parameter.h"
+using namespace Rcpp;
+
+
+//'@title get the values of a URL's parameters
+//'@description URLs can have parameters, taking the form of \code{name=value}, chained together
+//'with \code{&} symbols. \code{param_get}, when provided with a vector of URLs and a vector
+//'of parameter names, will generate a data.frame consisting of the values of each parameter
+//'for each URL.
+//'
+//'@param urls a vector of URLs
+//'
+//'@param parameter_names a vector of parameter names
+//'
+//'@return a data.frame containing one column for each provided parameter name. Values that
+//'cannot be found within a particular URL are represented by an NA.
+//'
+//'@examples
+//'#A very simple example
+//'url <- "https://google.com:80/foo.php?this_parameter=selfreferencing&hiphop=awesome"
+//'parameter_values <- param_get(url, c("this_parameter","hiphop"))
+//'
+//'@seealso \code{\link{url_parse}} for decomposing URLs into their constituent parts and
+//'\code{\link{param_set}} for inserting or modifying key/value pairs within a query string.
+//'
+//'@aliases param_get url_parameter
+//'@rdname param_get
+//'@export
+//[[Rcpp::export]]
+List param_get(CharacterVector urls, CharacterVector parameter_names){
+  parameter p_inst;
+  List output;
+  IntegerVector rownames = Rcpp::seq(1,urls.size());
+  unsigned int column_count = parameter_names.size();
+
+  for(unsigned int i = 0; i < column_count; ++i){
+    if((i % 10000) == 0){
+      Rcpp::checkUserInterrupt();
+    }
+    output.push_back(p_inst.get_parameter(urls, Rcpp::as<std::string>(parameter_names[i])));
+  }
+  output.attr("class") = "data.frame";
+  output.attr("names") = parameter_names;
+  output.attr("row.names") = rownames;
+  return output;
+}
+
+//'@title Set the value associated with a parameter in a URL's query.
+//'@description URLs often have queries associated with them, particularly URLs for
+//'APIs, that look like \code{?key=value&key=value&key=value}. \code{param_set}
+//'allows you to modify key/value pairs within query strings, or even add new ones
+//'if they don't exist within the URL.
+//'
+//'@param urls a vector of URLs. These should be decoded (with \code{url_decode})
+//'but do not have to have been otherwise manipulated.
+//'
+//'@param key a string representing the key to modify the value of (or insert wholesale
+//'if it doesn't exist within the URL).
+//'
+//'@param value a value to associate with the key. This can be a single string,
+//'or a vector the same length as \code{urls}
+//'
+//'@return the original vector of URLs, but with modified/inserted key-value pairs. If the
+//'URL is \code{NA}, the returned value will be - if the key or value are, no insertion
+//'will be made.
+//'
+//'@examples
+//'# Set a URL parameter where there's already a key for that
+//'param_set("https://en.wikipedia.org/api.php?action=query", "action", "pageinfo")
+//'
+//'# Set a URL parameter where there isn't.
+//'param_set("https://en.wikipedia.org/api.php?list=props", "action", "pageinfo")
+//'
+//'@seealso \code{\link{param_get}} to retrieve the values associated with multiple keys in
+//'a vector of URLs, and \code{\link{param_remove}} to strip key/value pairs from a URL entirely.
+//'
+//'@export
+//[[Rcpp::export]]
+CharacterVector param_set(CharacterVector urls, String key, CharacterVector value){
+  parameter p_inst;
+  return p_inst.set_parameter_vectorised(urls, key, value);
+}
+
+//'@title Remove key-value pairs from query strings
+//'@description URLs often have queries associated with them, particularly URLs for
+//'APIs, that look like \code{?key=value&key=value&key=value}. \code{param_remove}
+//'allows you to remove key/value pairs while leaving the rest of the URL intact.
+//'
+//'@param urls a vector of URLs. These should be decoded with \code{url_decode} but don't
+//'have to have been otherwise processed.
+//'
+//'@param keys a vector of parameter keys to remove.
+//'
+//'@return the original URLs but with the key/value pairs specified by \code{keys} removed.
+//'If the original URL is \code{NA}, \code{NA} will be returned; if a specified key is \code{NA},
+//'nothing will be done with it.
+//'
+//'@seealso \code{\link{param_set}} to modify values associated with keys, or \code{\link{param_get}}
+//'to retrieve those values.
+//'
+//'@examples
+//'# Remove multiple parameters from a URL
+//'param_remove(urls = "https://en.wikipedia.org/wiki/api.php?action=list&type=query&format=json",
+//'             keys = c("action","format"))
+//'@export
+//[[Rcpp::export]]
+CharacterVector param_remove(CharacterVector urls, CharacterVector keys){
+  parameter p_inst;
+  return p_inst.remove_parameter_vectorised(urls, keys);
+  
+}
diff --git a/src/parameter.cpp b/src/parameter.cpp
new file mode 100644
index 0000000..704c784
--- /dev/null
+++ b/src/parameter.cpp
@@ -0,0 +1,175 @@
+#include "parameter.h"
+
+std::vector < std::string > parameter::get_query_string(std::string url){
+  
+  std::vector < std::string > output;
+  size_t query_location = url.find("?");
+  if(query_location == std::string::npos){
+    output.push_back(url);
+  } else {
+    output.push_back(url.substr(0, query_location));
+    output.push_back(url.substr(query_location));
+  }
+  return output;
+}
+
+std::string parameter::set_parameter(std::string url, std::string& component, std::string value){
+  
+  std::vector < std::string > holding = get_query_string(url);
+  if(holding.size() == 1){
+    return holding[0] + ("?" + component + "=" + value);
+  }
+  
+  size_t component_location = holding[1].find((component + "="));
+  
+  if(component_location == std::string::npos){
+    holding[1] = (holding[1] + "&" + component + "=" + value);
+  } else {
+    size_t value_location = holding[1].find("&", component_location);
+    if(value_location == std::string::npos){
+      holding[1].replace(component_location, value_location, (component + "=" + value));
+    } else {
+      holding[1].replace(component_location, (value_location - component_location), (component + "=" + value));
+    }
+    
+  }
+  
+  return(holding[0] + holding[1]);
+  
+}
+
+std::string parameter::remove_parameter_single(std::string url, CharacterVector params){
+  
+  std::vector < std::string > parsed_url = get_query_string(url);
+  if(parsed_url.size() == 1){
+    return url;
+  }
+  
+  for(unsigned int i = 0; i < params.size(); i++){
+    if(params[i] != NA_STRING){
+      size_t param_location = parsed_url[1].find(Rcpp::as<std::string>(params[i]));
+      while(param_location != std::string::npos){
+        size_t end_location = parsed_url[1].find("&", param_location);
+        parsed_url[1].erase(param_location, end_location);
+        param_location = parsed_url[i].find(params[i], param_location);
+      }
+    }
+  }
+  
+  // We may have removed all of the parameters or the last one, leading to trailing ampersands or
+  // question marks. If those exist, erase them.
+  if(parsed_url[1][parsed_url[1].size()-1] == '&' || parsed_url[1][parsed_url[1].size()-1] == '?'){
+    parsed_url[1].erase(parsed_url[1].size()-1);
+  }
+  
+  return (parsed_url[0] + parsed_url[1]);
+}
+
+//Parameter retrieval
+CharacterVector parameter::get_parameter(CharacterVector& urls, std::string component){
+  std::size_t component_location;
+  std::size_t next_location;
+  unsigned int input_size = urls.size();
+  int component_size = component.length();
+  CharacterVector output(input_size);
+  component = component + "=";
+  std::string holding;
+  for(unsigned int i = 0; i < input_size; ++i){
+    if(urls[i] == NA_STRING){
+      output[i] = NA_STRING;
+    } else {
+      holding = Rcpp::as<std::string>(urls[i]);
+      component_location = holding.find(component);
+      if(component_location == std::string::npos){
+        output[i] = NA_STRING;
+      } else {
+        next_location = holding.find_first_of("&#", component_location + component_size);
+        if(next_location == std::string::npos){
+          output[i] = holding.substr(component_location + component_size + 1);
+        } else {
+          output[i] = holding.substr(component_location + component_size + 1, (next_location-(component_location + component_size + 1)));
+        }
+      }
+    }
+  }
+  return output;
+}
+
+CharacterVector parameter::set_parameter_vectorised(CharacterVector urls, String component,
+                                                    CharacterVector value){
+  
+  unsigned int input_size = urls.size();
+  CharacterVector output(input_size);
+  
+  if(component != NA_STRING){
+    std::string component_ref = component.get_cstring();
+    if(value.size() == input_size){
+      for(unsigned int i = 0; i < input_size; i++){
+        if((i % 10000) == 0){
+          Rcpp::checkUserInterrupt();
+        }
+        if(urls[i] != NA_STRING && value[i] != NA_STRING){
+          output[i] = set_parameter(Rcpp::as<std::string>(urls[i]), component_ref,
+                                    Rcpp::as<std::string>(value[i]));
+        } else if(value[i] == NA_STRING){
+          output[i] = urls[i];
+        } else {
+          output[i] = NA_STRING;
+        }
+      }
+    } else if(value.size() == 1){
+      if(value[0] != NA_STRING){
+        std::string value_ref = Rcpp::as<std::string>(value[0]);
+        for(unsigned int i = 0; i < input_size; i++){
+          if((i % 10000) == 0){
+            Rcpp::checkUserInterrupt();
+          }
+          if(urls[i] != NA_STRING){
+            output[i] = set_parameter(Rcpp::as<std::string>(urls[i]), component_ref, value_ref);
+          } else {
+            output[i] = NA_STRING;
+          }
+        }
+      } else {
+        return urls;
+      }
+      
+    } else {
+      throw std::range_error("'value' must be the same length as 'urls', or of length 1");
+    }
+  } else {
+    return urls;
+  }
+
+  return output;
+}
+
+CharacterVector parameter::remove_parameter_vectorised(CharacterVector urls,
+                                                       CharacterVector params){
+  
+  unsigned int input_size = urls.size();
+  CharacterVector output(input_size);
+  CharacterVector p_copy = params;
+  // Generate easily find-able params.
+  for(unsigned int i = 0; i < p_copy.size(); i++){
+    if(p_copy[i] != NA_STRING){
+      p_copy[i] += "=";
+    }
+  }
+
+  // For each URL, remove those parameters.
+  for(unsigned int i = 0; i < urls.size(); i++){
+    if((i % 10000) == 0){
+      Rcpp::checkUserInterrupt();
+    }
+    if(urls[i] != NA_STRING){
+      output[i] = remove_parameter_single(Rcpp::as<std::string>(urls[i]), p_copy);
+      
+    } else {
+      output[i] = NA_STRING;
+    }
+  }
+  
+  // Return
+  return output;
+}
diff --git a/src/parameter.h b/src/parameter.h
new file mode 100644
index 0000000..c83e683
--- /dev/null
+++ b/src/parameter.h
@@ -0,0 +1,93 @@
+#include "parsing.h"
+
+#ifndef __PARAM_INCLUDED__
+#define __PARAM_INCLUDED__
+
+
+class parameter: public parsing {
+  
+private:
+  
+  /**
+   * Split out a URL query from the actual body. Used
+   * in set_ and remove_parameter.
+   * 
+   * @param url a URL.
+   * 
+   * @return a vector either of length 1, indicating that no
+   * query was found, or 2, indicating that one was.
+   */
+  std::vector < std::string > get_query_string(std::string url);
+  
+  /**
+   * Set the value of a single key=value parameter.
+   * 
+   * @param url a URL.
+   * 
+   * @param component a reference to the key to set
+   * 
+   * @param value a reference to the value to set.
+   * 
+   * @return a string containing URL + key=value, controlling
+   * for the possibility that the URL did not previously have a query
+   * associated - or did, and /had that key/, but was associating a
+   * different value with it.
+   */
+  std::string set_parameter(std::string url, std::string& component, std::string value);
+  
+  /**
+   * Reemove a range of key/value parameters
+   * 
+   * @param url a URL.
+   * 
+   * @param params a vector of keys.
+   * 
+   * @return a string containing the URL but absent the keys and values that were specified.
+   * 
+   */
+  std::string remove_parameter_single(std::string url, CharacterVector params);
+  
+public:
+  
+  /**
+   * Component retrieval specifically for parameters.
+   * 
+   * @param urls a reference to a vector of URLs
+   * 
+   * @param component the name of a component to retrieve
+   * the value of
+   * 
+   * @return a vector of the values for that component.
+   */
+  CharacterVector get_parameter(CharacterVector& urls, std::string component);
+  
+  
+  /**
+   * Set the value of a single key=value parameter for a vector of strings.
+   * 
+   * @param urls a vector of URLs.
+   * 
+   * @param component a string containing the key to set
+   * 
+   * @param value a vector of values to set.
+   * 
+   * @return the initial URLs vector, with the aforementioned string modifications.
+   */
+  CharacterVector set_parameter_vectorised(CharacterVector urls, String component,
+                                           CharacterVector value);
+  
+  /**
+   * Reemove a range of key/value parameters from a vector of strings.
+   * 
+   * @param urls a vector of URLs.
+   * 
+   * @param params a vector of keys.
+   * 
+   * @return the initial URLs vector, with the aforementioned string modifications.
+   * 
+   */
+  CharacterVector remove_parameter_vectorised(CharacterVector urls,
+                                              CharacterVector params);
+};
+
+#endif
diff --git a/src/parsing.cpp b/src/parsing.cpp
new file mode 100644
index 0000000..f8e01d0
--- /dev/null
+++ b/src/parsing.cpp
@@ -0,0 +1,238 @@
+#include "parsing.h"
+
+std::string parsing::scheme(std::string& url){
+  std::string output;
+  std::size_t protocol = url.find("://");
+  if((protocol == std::string::npos) | (protocol > 6)){
+    //If that's not present, or isn't present at the /beginning/, unknown
+    output = "";
+  } else {
+    output = url.substr(0,protocol);
+    url = url.substr((protocol+3));
+  }
+  return output;
+}
+
+std::string parsing::string_tolower(std::string str){
+  unsigned int input_size = str.size();
+  for(unsigned int i = 0; i < input_size; i++){
+    str[i] = tolower(str[i]);
+  }
+  return str;
+}
+
+std::vector < std::string > parsing::domain_and_port(std::string& url){
+  
+  std::vector < std::string > output(2);
+  std::string holding;
+  unsigned int output_offset = 0;
+  
+  // Identify the port. If there is one, push everything
+  // before that straight into the output, and the remainder
+  // into the holding string. If not, the entire
+  // url goes into the holding string.
+  std::size_t port = url.find(":");
+  
+  if(port != std::string::npos && url.find("/") >= port){
+    output[0] = url.substr(0,port);
+    holding = url.substr(port+1);
+    output_offset++;
+  } else {
+    holding = url;
+  }
+  
+  // Look for a trailing slash
+  std::size_t trailing_slash = holding.find("/");
+  
+  // If there is one, that's when everything ends
+  if(trailing_slash != std::string::npos){
+    output[output_offset] = holding.substr(0, trailing_slash);
+    output_offset++;
+    url = holding.substr(trailing_slash+1);
+    return output;
+  }
+  
+  // If not, there might be a query parameter associated
+  // with the base URL, which we need to preserve.
+  std::size_t param = holding.find("?");
+  
+  // If there is, handle that
+  if(param != std::string::npos){
+    output[output_offset] = holding.substr(0, param);
+    url = holding.substr(param);
+    return output;
+  }
+  
+  // Otherwise we're done here
+  output[output_offset] = holding;
+  url = "";
+  return output;
+}
+
+std::string parsing::path(std::string& url){
+  if(url.size() == 0){
+    return url;
+  }
+  std::string output;
+  std::size_t path = url.find("?");
+  if(path == std::string::npos){
+    std::size_t fragment = url.find("#");
+    if(fragment == std::string::npos){
+      output = url;
+      url = "";
+      return output;
+    }
+    output = url.substr(0,fragment);
+    url = url.substr(fragment);
+    return output;
+  }
+
+  output = url.substr(0,path);
+  url = url.substr(path+1);
+  return output;
+}
+
+std::string parsing::query(std::string& url){
+  if(url == ""){
+    return url;
+  }
+  
+  std::string output;
+  std::size_t fragment = url.find("#");
+  if(fragment == std::string::npos){
+    output = url;
+    url = "";
+    return output;
+  }
+  output = url.substr(0,fragment);
+  url = url.substr(fragment+1);
+  return output;
+}
+
+String parsing::check_parse_out(std::string x){
+  
+  if(x == ""){
+    return NA_STRING;
+  }
+  return x;
+}
+
+//URL parser
+CharacterVector parsing::url_to_vector(std::string url){
+  
+  std::string &url_ptr = url;
+  
+  //Output object, holding object, normalise.
+  CharacterVector output(6);
+  std::vector < std::string > holding(2);
+  
+  std::string s = scheme(url_ptr);
+  
+  holding = domain_and_port(url_ptr);
+
+  //Run
+  output[0] = check_parse_out(string_tolower(s));
+  output[1] = check_parse_out(string_tolower(holding[0]));
+  output[2] = check_parse_out(holding[1]);
+  output[3] = check_parse_out(path(url_ptr));
+  output[4] = check_parse_out(query(url_ptr));
+  output[5] = check_parse_out(url_ptr);
+  
+  return output;
+}
+
+//Component retrieval
+String parsing::get_component(std::string url, int component){
+  return url_to_vector(url)[component];
+}
+
+//Component modification
+String parsing::set_component(std::string url, int component, String new_value){
+  
+  if(new_value == NA_STRING){
+    return NA_STRING;
+  }
+  std::string output;
+  CharacterVector parsed_url = url_to_vector(url);
+  parsed_url[component] = new_value;
+  
+  if(parsed_url[0] != NA_STRING){
+    output += parsed_url[0];
+    output += "://";
+  }
+  
+  if(parsed_url[1] != NA_STRING){
+    output += parsed_url[1];
+  }
+  
+  if(parsed_url[2] != NA_STRING){
+    output += ":";
+    output += parsed_url[2];
+  }
+  
+  if(parsed_url[3] != NA_STRING){
+    output += "/";
+    output += parsed_url[3];
+  }
+  
+  if(parsed_url[4] != NA_STRING){
+    output += "?";
+    output += parsed_url[4];
+  }
+  
+  if(parsed_url[5] != NA_STRING){
+    output += "#";
+    output += parsed_url[5];
+  }
+  
+  return output;
+}
+
+DataFrame parsing::parse_to_df(CharacterVector& urls_ptr){
+  
+  //Input and holding objects
+  unsigned int input_size = urls_ptr.size();
+  CharacterVector holding(6);
+  
+  //Output objects
+  CharacterVector schemes(input_size);
+  CharacterVector domains(input_size);
+  CharacterVector ports(input_size);
+  CharacterVector paths(input_size);
+  CharacterVector parameters(input_size);
+  CharacterVector fragments(input_size);
+
+  for(unsigned int i = 0; i < input_size; i++){
+    if((i % 10000) == 0){
+      Rcpp::checkUserInterrupt();
+    }
+    
+    // Handle NAs on input
+    if(urls_ptr[i] == NA_STRING){
+      
+      schemes[i] = NA_STRING;
+      domains[i] = NA_STRING;
+      ports[i] = NA_STRING;
+      paths[i] = NA_STRING;
+      parameters[i] = NA_STRING;
+      fragments[i] = NA_STRING;
+      
+    } else {
+      holding = url_to_vector(Rcpp::as<std::string>(urls_ptr[i]));
+      schemes[i] = holding[0];
+      domains[i] = holding[1];
+      ports[i] = holding[2];
+      paths[i] = holding[3];
+      parameters[i] = holding[4];
+      fragments[i] = holding[5];
+    }
+  }
+  
+  return DataFrame::create(_["scheme"] = schemes,
+                           _["domain"] = domains,
+                           _["port"] = ports,
+                           _["path"] = paths,
+                           _["parameter"] = parameters,
+                           _["fragment"] = fragments,
+                           _["stringsAsFactors"] = false);
+}
diff --git a/src/parsing.h b/src/parsing.h
new file mode 100644
index 0000000..24c509c
--- /dev/null
+++ b/src/parsing.h
@@ -0,0 +1,141 @@
+#include <Rcpp.h>
+using namespace Rcpp;
+
+#ifndef __PARSING_INCLUDED__
+#define __PARSING_INCLUDED__
+
+class parsing {
+  
+  protected:
+  
+  /**
+   * A function for parsing a URL and turning it into a vector.
+   * Tremendously useful (read: everything breaks without this)
+   * 
+   * @param url a URL.
+   * 
+   * @see get_ and set_component, which call this.
+   * 
+   * @return a vector consisting of the value for each component
+   * part of the URL.
+   */
+  CharacterVector url_to_vector(std::string url);
+  
+  private:
+    
+    /**
+     * A function for lower-casing an entire string
+     * 
+     * @param str a string to lower-case
+     * 
+     * @return a string containing the lower-cased version of the
+     * input.
+     */
+    std::string string_tolower(std::string str);
+    
+    /**
+     * A function for extracting the scheme of a URL; part of the
+     * URL parsing framework.
+     * 
+     * @param url a reference to a url.
+     * 
+     * @see url_to_vector which calls this.
+     * 
+     * @return a string containing the scheme of the URL if identifiable,
+     * and "" if not.
+     */
+    std::string scheme(std::string& url);
+    
+    /**
+     * A function for extracting the domain and port of a URL; part of the
+     * URL parsing framework. Fairly unique in that it outputs a
+     * vector, unlike the rest of the framework, which outputs a string,
+     * since it has to handle multiple elements.
+     * 
+     * @param url a reference to a url. Should've been run through
+     * scheme() first.
+     * 
+     * @see url_to_vector which calls this.
+     * 
+     * @return a vector containing the domain and port of the URL if identifiable,
+     * and "" for each non-identifiable element.
+     */
+    std::vector < std::string > domain_and_port(std::string& url);
+    
+    /**
+     * A function for extracting the path of a URL; part of the
+     * URL parsing framework.
+     * 
+     * @param url a reference to a url. Should've been run through
+     * scheme() and domain_and_port() first.
+     * 
+     * @see url_to_vector which calls this.
+     * 
+     * @return a string containing the path of the URL if identifiable,
+     * and "" if not.
+     */
+    std::string path(std::string& url);
+    
+    /**
+     * A function for extracting the path of a URL; part of the
+     * URL parsing framework.
+     * 
+     * @param url a reference to a url. Should've been run through
+     * scheme(), domain_and_port() and path() first.
+     * 
+     * @see url_to_vector which calls this.
+     * 
+     * @return a string containing the query string of the URL if identifiable,
+     * and "" if not.
+     */
+    std::string query(std::string& url);
+    
+    String check_parse_out(std::string x);
+    
+  public:
+  
+    /**
+     * A function to retrieve an individual component from a parsed
+     * URL. Used in scheme(), host() et al; calls parse_url.
+     * 
+     * @param url a URL.
+     * 
+     * @param component an integer representing which value in
+     * parse_url's returned vector to grab.
+     * 
+     * @see set_component, which allows for modification.
+     * 
+     * @return a string consisting of the requested URL component.
+     */
+    String get_component(std::string url, int component);
+    
+    /**
+     * A function to set an individual component in a parsed
+     * URL. Used in "scheme<-", et al; calls parse_url.
+     * 
+     * @param url a URL.
+     * 
+     * @param component an integer representing which value in
+     * parse_url's returned vector to modify.
+     * 
+     * @param new_value the value to insert into url[component].
+     * 
+     * @param delim a delimiter, used in cases where there's no existing value.
+     * @see get_component, which allows for retrieval.
+     * 
+     * @return a string consisting of the modified URL.
+     */
+    String set_component(std::string url, int component, String new_value);
+    
+    /**
+     * Decompose a vector of URLs and turn it into a data.frame.
+     * 
+     * @param URLs a reference to a vector of URLs
+     * 
+     * @return an Rcpp data.frame.
+     * 
+     */
+    DataFrame parse_to_df(CharacterVector& urls_ptr);
+
+};
+#endif
diff --git a/src/puny.cpp b/src/puny.cpp
new file mode 100644
index 0000000..fa84ac0
--- /dev/null
+++ b/src/puny.cpp
@@ -0,0 +1,226 @@
+#include <Rcpp.h>
+#include "punycode.h"
+extern "C"{
+#include "utf8.h"
+}
+using namespace Rcpp;
+#define R_NO_REMAP
+#include <R.h>
+#include <Rinternals.h>
+
+#define BUFLENT 2048
+static char buf[BUFLENT];
+static uint32_t ibuf[BUFLENT];
+static std::string ascii = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890_.?&=:/";
+
+static inline void clearbuf(){
+  for (int i=0; i<BUFLENT; i++){
+    buf[i] = '\0';
+    ibuf[i] = 0;
+  }
+}
+
+struct url {
+  std::deque<std::string> split_url;
+  std::string protocol;
+  std::string path;
+};
+
+void split_url(std::string x, url& output){
+
+  size_t last;
+  size_t loc = x.find(".");
+  
+  last = x.find("://");
+  if(last != std::string::npos){
+    output.protocol = x.substr(0, (last + 3));
+    x = x.substr(last + 3);
+  }
+  last = x.find_first_of(":/");
+  if(last != std::string::npos){
+    output.path = x.substr(last);
+    x = x.substr(0, last);
+  }
+
+  last = 0;
+  loc = x.find(".");
+  while (loc != std::string::npos) {
+    output.split_url.push_back(x.substr(last, loc-last));
+    last = ++loc;
+    loc = x.find(".", loc);
+  }
+  if (loc == std::string::npos){
+    output.split_url.push_back(x.substr(last, x.length()));
+  }
+}
+
+std::string check_result(enum punycode_status& st, std::string& x){
+  std::string ret = "Error with the URL " + x + ":";
+  if (st == punycode_bad_input){
+    ret += "input is invalid";
+  } else if (st == punycode_big_output){
+    ret += "output would exceed the space provided";
+  } else if (st == punycode_overflow){
+    ret += "input needs wider integers to process";
+  } else {
+    return "";
+  }
+  return ret;
+};
+
+String encode_single(std::string x){
+  
+  url holding;
+  split_url(x, holding);
+  std::string output = holding.protocol;
+  
+  for(unsigned int i = 0; i < holding.split_url.size(); i++){
+    // Check if it's ASCII-only fragment - if so, nowt to do here.
+    if(holding.split_url[i].find_first_not_of(ascii) == std::string::npos){
+      output += holding.split_url[i];
+      if(i < (holding.split_url.size() - 1)){
+        output += ".";
+      }
+    } else {
+      
+      // Prep for conversion
+      punycode_uint buflen = BUFLENT;
+      punycode_uint unilen = BUFLENT;
+      const char *s = holding.split_url[i].c_str();
+      const int slen = strlen(s);
+      
+      // Do the conversion
+      unilen = u8_toucs(ibuf, unilen, s, slen);
+      enum punycode_status st = punycode_encode(unilen, ibuf, NULL, &buflen, buf);
+      
+      // Check it worked
+      std::string ret = check_result(st, x);
+      if(ret.size()){
+        Rcpp::warning(ret);
+        return NA_STRING;
+      }
+      
+      std::string encoded = Rcpp::as<std::string>(Rf_mkCharLenCE(buf, buflen, CE_UTF8));
+      if(encoded != holding.split_url[i]){
+        encoded = "xn--" + encoded;
+      }
+      output += encoded;
+      if(i < (holding.split_url.size() - 1)){
+        output += ".";
+      }
+    }
+  }
+  output += holding.path;
+  return output;
+}
+
+//'@title Encode or Decode Internationalised Domains
+//'@description \code{puny_encode} and \code{puny_decode} implement
+//'the encoding standard for internationalised (non-ASCII) domains and
+//'subdomains. You can use them to encode UTF-8 domain names, or decode
+//'encoded names (which start "xn--"), or both.
+//'
+//'@param x a vector of URLs. These should be URL decoded using \code{\link{url_decode}}.
+//'
+//'@return a CharacterVector containing encoded or decoded versions of the entries in \code{x}.
+//'Invalid URLs (ones that are \code{NA}, or ones that do not successfully map to an actual
+//'decoded or encoded version) will be returned as \code{NA}.
+//'
+//'@examples
+//'# Encode a URL
+//'puny_encode("https://www.bücher.com/foo")
+//'
+//'# Decode the result, back to the original
+//'puny_decode("https://www.xn--bcher-kva.com/foo")
+//'
+//'@seealso \code{\link{url_decode}} and \code{\link{url_encode}} for percent-encoding.
+//'
+//'@rdname puny
+//'@export
+//[[Rcpp::export]]
+CharacterVector puny_encode(CharacterVector x){
+  
+  unsigned int input_size = x.size();
+  CharacterVector output(input_size);
+  
+  for(unsigned int i = 0; i < input_size; i++){
+    
+    if(i % 10000 == 0){
+      Rcpp::checkUserInterrupt();
+    }
+    
+    if(x[i] == NA_STRING){
+      output[i] = NA_STRING;
+    } else {
+      output[i] = encode_single(Rcpp::as<std::string>(x[i]));
+    }
+  }
+  
+  clearbuf();
+  return output;
+}
+
+String decode_single(std::string x){
+  url holding;
+  split_url(x, holding);
+  std::string output = holding.protocol;
+  
+  for(unsigned int i = 0; i < holding.split_url.size(); i++){
+    // Check if it's ASCII-only fragment - if so, nowt to do here.
+    if(holding.split_url[i].size() < 4 || holding.split_url[i].substr(0,4) != "xn--"){
+      output += holding.split_url[i];
+      if(i < (holding.split_url.size() - 1)){
+        output += ".";
+      }
+    } else {
+      
+      // Prep for conversion
+      punycode_uint buflen;
+      punycode_uint unilen = BUFLENT;
+      const char *s = holding.split_url[i].substr(4).c_str();
+      const int slen = strlen(s);
+      
+      // Do the conversion
+      enum punycode_status st = punycode_decode(slen, s, &unilen, ibuf, NULL);
+      
+      // Check it worked
+      std::string ret = check_result(st, x);
+      if(ret.size()){
+        Rcpp::warning(ret);
+        return NA_STRING;
+      }
+      buflen = u8_toutf8(buf, BUFLENT, ibuf, unilen);
+      std::string encoded = Rcpp::as<std::string>(Rf_mkCharLenCE(buf, buflen, CE_UTF8));
+      output += encoded;
+      if(i < (holding.split_url.size() - 1)){
+        output += ".";
+      }
+    }
+  }
+  output += holding.path;
+  return output;
+}
+
+//'@rdname puny
+//'@export
+//[[Rcpp::export]]
+CharacterVector puny_decode(CharacterVector x){
+  
+  unsigned int input_size = x.size();
+  CharacterVector output(input_size);
+
+  for(unsigned int i = 0; i < input_size; i++){
+    
+    if(i % 10000 == 0){
+      Rcpp::checkUserInterrupt();
+    }
+    
+    if(x[i] == NA_STRING){
+      output[i] = NA_STRING;
+    } else {
+      output[i] = decode_single(Rcpp::as<std::string>(x[i]));
+    }
+  }
+  
+  return output;
+}
diff --git a/src/punycode.c b/src/punycode.c
new file mode 100644
index 0000000..f905e52
--- /dev/null
+++ b/src/punycode.c
@@ -0,0 +1,289 @@
+/*
+punycode.c from RFC 3492
+http://www.nicemice.net/idn/
+Adam M. Costello
+http://www.nicemice.net/amc/
+
+This is ANSI C code (C89) implementing Punycode (RFC 3492).
+
+
+C. Disclaimer and license
+
+    Regarding this entire document or any portion of it (including
+    the pseudocode and C code), the author makes no guarantees and
+    is not responsible for any damage resulting from its use.  The
+    author grants irrevocable permission to anyone to use, modify,
+    and distribute it in any way that does not diminish the rights
+    of anyone else to use, modify, and distribute it, provided that
+    redistributed derivative works do not contain misleading author or
+    version information.  Derivative works need not be licensed under
+    similar terms.
+*/
+
+#include "punycode.h"
+
+/**********************************************************/
+/* Implementation (would normally go in its own .c file): */
+
+#include <string.h>
+
+/*** Bootstring parameters for Punycode ***/
+
+enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,
+       initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };
+
+/* basic(cp) tests whether cp is a basic code point: */
+#define basic(cp) ((punycode_uint)(cp) < 0x80)
+
+/* delim(cp) tests whether cp is a delimiter: */
+#define delim(cp) ((cp) == delimiter)
+
+/* decode_digit(cp) returns the numeric value of a basic code */
+/* point (for use in representing integers) in the range 0 to */
+/* base-1, or base if cp is does not represent a value.       */
+
+static punycode_uint decode_digit(punycode_uint cp)
+{
+  return  cp - 48 < 10 ? cp - 22 :  cp - 65 < 26 ? cp - 65 :
+          cp - 97 < 26 ? cp - 97 :  base;
+}
+
+/* encode_digit(d,flag) returns the basic code point whose value      */
+/* (when used for representing integers) is d, which needs to be in   */
+/* the range 0 to base-1.  The lowercase form is used unless flag is  */
+/* nonzero, in which case the uppercase form is used.  The behavior   */
+/* is undefined if flag is nonzero and digit d has no uppercase form. */
+
+static char encode_digit(punycode_uint d, int flag)
+{
+  return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);
+  /*  0..25 map to ASCII a..z or A..Z */
+  /* 26..35 map to ASCII 0..9         */
+}
+
+/* flagged(bcp) tests whether a basic code point is flagged */
+/* (uppercase).  The behavior is undefined if bcp is not a  */
+/* basic code point.                                        */
+
+#define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26)
+
+/* encode_basic(bcp,flag) forces a basic code point to lowercase */
+/* if flag is zero, uppercase if flag is nonzero, and returns    */
+/* the resulting code point.  The code point is unchanged if it  */
+/* is caseless.  The behavior is undefined if bcp is not a basic */
+/* code point.                                                   */
+
+static char encode_basic(punycode_uint bcp, int flag)
+{
+  bcp -= (bcp - 97 < 26) << 5;
+  return bcp + ((!flag && (bcp - 65 < 26)) << 5);
+}
+
+/*** Platform-specific constants ***/
+
+/* maxint is the maximum value of a punycode_uint variable: */
+static const punycode_uint maxint = (punycode_uint) -1;
+/* Because maxint is unsigned, -1 becomes the maximum value. */
+
+/*** Bias adaptation function ***/
+
+static punycode_uint adapt(
+  punycode_uint delta, punycode_uint numpoints, int firsttime )
+{
+  punycode_uint k;
+
+  delta = firsttime ? delta / damp : delta >> 1;
+  /* delta >> 1 is a faster way of doing delta / 2 */
+  delta += delta / numpoints;
+
+  for (k = 0;  delta > ((base - tmin) * tmax) / 2;  k += base) {
+    delta /= base - tmin;
+  }
+
+  return k + (base - tmin + 1) * delta / (delta + skew);
+}
+
+/*** Main encode function ***/
+
+enum punycode_status punycode_encode(
+  punycode_uint input_length,
+  const punycode_uint input[],
+  const unsigned char case_flags[],
+  punycode_uint *output_length,
+  char output[] )
+{
+  punycode_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t;
+
+  /* Initialize the state: */
+
+  n = initial_n;
+  delta = out = 0;
+  max_out = *output_length;
+  bias = initial_bias;
+
+  /* Handle the basic code points: */
+
+  for (j = 0;  j < input_length;  ++j) {
+    if (basic(input[j])) {
+      if (max_out - out < 2) return punycode_big_output;
+      output[out++] =
+        case_flags ? encode_basic(input[j], case_flags[j]) : (char)input[j];
+    }
+    /* else if (input[j] < n) return punycode_bad_input; */
+    /* (not needed for Punycode with unsigned code points) */
+  }
+
+  h = b = out;
+
+  /* h is the number of code points that have been handled, b is the  */
+  /* number of basic code points, and out is the number of characters */
+  /* that have been output.                                           */
+
+  if (b > 0) output[out++] = delimiter;
+
+  /* Main encoding loop: */
+
+  while (h < input_length) {
+    /* All non-basic code points < n have been     */
+    /* handled already.  Find the next larger one: */
+
+    for (m = maxint, j = 0;  j < input_length;  ++j) {
+      /* if (basic(input[j])) continue; */
+      /* (not needed for Punycode) */
+      if (input[j] >= n && input[j] < m) m = input[j];
+    }
+
+    /* Increase delta enough to advance the decoder's    */
+    /* <n,i> state to <m,0>, but guard against overflow: */
+
+    if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow;
+    delta += (m - n) * (h + 1);
+    n = m;
+
+    for (j = 0;  j < input_length;  ++j) {
+      /* Punycode does not need to check whether input[j] is basic: */
+      if (input[j] < n /* || basic(input[j]) */ ) {
+        if (++delta == 0) return punycode_overflow;
+      }
+
+      if (input[j] == n) {
+        /* Represent delta as a generalized variable-length integer: */
+
+        for (q = delta, k = base;  ;  k += base) {
+          if (out >= max_out) return punycode_big_output;
+          t = k <= bias /* + tmin */ ? tmin :     /* +tmin not needed */
+              k >= bias + tmax ? tmax : k - bias;
+          if (q < t) break;
+          output[out++] = encode_digit(t + (q - t) % (base - t), 0);
+          q = (q - t) / (base - t);
+        }
+
+        output[out++] = encode_digit(q, case_flags && case_flags[j]);
+        bias = adapt(delta, h + 1, h == b);
+        delta = 0;
+        ++h;
+      }
+    }
+
+    ++delta, ++n;
+  }
+
+  *output_length = out;
+  return punycode_success;
+}
+
+/*** Main decode function ***/
+
+enum punycode_status punycode_decode(
+  punycode_uint input_length,
+  const char input[],
+  punycode_uint *output_length,
+  punycode_uint output[],
+  unsigned char case_flags[] )
+{
+  punycode_uint n, out, i, max_out, bias,
+                 b, j, in, oldi, w, k, digit, t;
+
+  if (!input_length) {
+    return punycode_bad_input;
+  }
+
+  /* Initialize the state: */
+
+  n = initial_n;
+  out = i = 0;
+  max_out = *output_length;
+  bias = initial_bias;
+
+  /* Handle the basic code points:  Let b be the number of input code */
+  /* points before the last delimiter, or 0 if there is none, then    */
+  /* copy the first b code points to the output.                      */
+
+  for (b = 0, j = input_length - 1 ;  j > 0;  --j) {
+    if (delim(input[j])) {
+      b = j;
+      break;
+    }
+  }
+  if (b > max_out) return punycode_big_output;
+
+  for (j = 0;  j < b;  ++j) {
+    if (case_flags) case_flags[out] = flagged(input[j]);
+    if (!basic(input[j])) return punycode_bad_input;
+    output[out++] = input[j];
+  }
+
+  /* Main decoding loop:  Start just after the last delimiter if any  */
+  /* basic code points were copied; start at the beginning otherwise. */
+
+  for (in = b > 0 ? b + 1 : 0;  in < input_length;  ++out) {
+
+    /* in is the index of the next character to be consumed, and */
+    /* out is the number of code points in the output array.     */
+
+    /* Decode a generalized variable-length integer into delta,  */
+    /* which gets added to i.  The overflow checking is easier   */
+    /* if we increase i as we go, then subtract off its starting */
+    /* value at the end to obtain delta.                         */
+
+    for (oldi = i, w = 1, k = base;  ;  k += base) {
+      if (in >= input_length) return punycode_bad_input;
+      digit = decode_digit(input[in++]);
+      if (digit >= base) return punycode_bad_input;
+      if (digit > (maxint - i) / w) return punycode_overflow;
+      i += digit * w;
+      t = k <= bias /* + tmin */ ? tmin :     /* +tmin not needed */
+          k >= bias + tmax ? tmax : k - bias;
+      if (digit < t) break;
+      if (w > maxint / (base - t)) return punycode_overflow;
+      w *= (base - t);
+    }
+
+    bias = adapt(i - oldi, out + 1, oldi == 0);
+
+    /* i was supposed to wrap around from out+1 to 0,   */
+    /* incrementing n each time, so we'll fix that now: */
+
+    if (i / (out + 1) > maxint - n) return punycode_overflow;
+    n += i / (out + 1);
+    i %= (out + 1);
+
+    /* Insert n at position i of the output: */
+
+    /* not needed for Punycode: */
+    /* if (decode_digit(n) <= base) return punycode_invalid_input; */
+    if (out >= max_out) return punycode_big_output;
+
+    if (case_flags) {
+      memmove(case_flags + i + 1, case_flags + i, out - i);
+      /* Case of last character determines uppercase flag: */
+      case_flags[i] = flagged(input[in - 1]);
+    }
+
+    memmove(output + i + 1, output + i, (out - i) * sizeof *output);
+    output[i++] = n;
+  }
+
+  *output_length = out;
+  return punycode_success;
+}
diff --git a/src/punycode.h b/src/punycode.h
new file mode 100644
index 0000000..459c6fd
--- /dev/null
+++ b/src/punycode.h
@@ -0,0 +1,108 @@
+/*
+punycode.c from RFC 3492
+http://www.nicemice.net/idn/
+Adam M. Costello
+http://www.nicemice.net/amc/
+
+This is ANSI C code (C89) implementing Punycode (RFC 3492).
+
+
+
+C. Disclaimer and license
+
+    Regarding this entire document or any portion of it (including
+    the pseudocode and C code), the author makes no guarantees and
+    is not responsible for any damage resulting from its use.  The
+    author grants irrevocable permission to anyone to use, modify,
+    and distribute it in any way that does not diminish the rights
+    of anyone else to use, modify, and distribute it, provided that
+    redistributed derivative works do not contain misleading author or
+    version information.  Derivative works need not be licensed under
+    similar terms.
+*/
+
+#ifdef __cplusplus
+extern "C" {
+#endif /* __cplusplus */
+
+/************************************************************/
+/* Public interface (would normally go in its own .h file): */
+
+#include <limits.h>
+
+enum punycode_status {
+  punycode_success,
+  punycode_bad_input,   /* Input is invalid.                       */
+  punycode_big_output,  /* Output would exceed the space provided. */
+  punycode_overflow     /* Input needs wider integers to process.  */
+};
+
+#if UINT_MAX >= (1 << 26) - 1
+typedef unsigned int punycode_uint;
+#else
+typedef unsigned long punycode_uint;
+#endif
+
+enum punycode_status punycode_encode(
+  punycode_uint input_length,
+  const punycode_uint input[],
+  const unsigned char case_flags[],
+  punycode_uint *output_length,
+  char output[] );
+
+    /* punycode_encode() converts Unicode to Punycode.  The input     */
+    /* is represented as an array of Unicode code points (not code    */
+    /* units; surrogate pairs are not allowed), and the output        */
+    /* will be represented as an array of ASCII code points.  The     */
+    /* output string is *not* null-terminated; it will contain        */
+    /* zeros if and only if the input contains zeros.  (Of course     */
+    /* the caller can leave room for a terminator and add one if      */
+    /* needed.)  The input_length is the number of code points in     */
+    /* the input.  The output_length is an in/out argument: the       */
+    /* caller passes in the maximum number of code points that it     */
+    /* can receive, and on successful return it will contain the      */
+    /* number of code points actually output.  The case_flags array   */
+    /* holds input_length boolean values, where nonzero suggests that */
+    /* the corresponding Unicode character be forced to uppercase     */
+    /* after being decoded (if possible), and zero suggests that      */
+    /* it be forced to lowercase (if possible).  ASCII code points    */
+    /* are encoded literally, except that ASCII letters are forced    */
+    /* to uppercase or lowercase according to the corresponding       */
+    /* uppercase flags.  If case_flags is a null pointer then ASCII   */
+    /* letters are left as they are, and other code points are        */
+    /* treated as if their uppercase flags were zero.  The return     */
+    /* value can be any of the punycode_status values defined above   */
+    /* except punycode_bad_input; if not punycode_success, then       */
+    /* output_size and output might contain garbage.                  */
+
+enum punycode_status punycode_decode(
+  punycode_uint input_length,
+  const char input[],
+  punycode_uint *output_length,
+  punycode_uint output[],
+  unsigned char case_flags[] );
+
+    /* punycode_decode() converts Punycode to Unicode.  The input is  */
+    /* represented as an array of ASCII code points, and the output   */
+    /* will be represented as an array of Unicode code points.  The   */
+    /* input_length is the number of code points in the input.  The   */
+    /* output_length is an in/out argument: the caller passes in      */
+    /* the maximum number of code points that it can receive, and     */
+    /* on successful return it will contain the actual number of      */
+    /* code points output.  The case_flags array needs room for at    */
+    /* least output_length values, or it can be a null pointer if the */
+    /* case information is not needed.  A nonzero flag suggests that  */
+    /* the corresponding Unicode character be forced to uppercase     */
+    /* by the caller (if possible), while zero suggests that it be    */
+    /* forced to lowercase (if possible).  ASCII code points are      */
+    /* output already in the proper case, but their flags will be set */
+    /* appropriately so that applying the flags would be harmless.    */
+    /* The return value can be any of the punycode_status values      */
+    /* defined above; if not punycode_success, then output_length,    */
+    /* output, and case_flags might contain garbage.  On success, the */
+    /* decoder will never need to write an output_length greater than */
+    /* input_length, because of how the encoding is defined.          */
+
+#ifdef __cplusplus
+}
+#endif /* __cplusplus */
diff --git a/src/suffix.cpp b/src/suffix.cpp
new file mode 100644
index 0000000..35dd584
--- /dev/null
+++ b/src/suffix.cpp
@@ -0,0 +1,145 @@
+#include <Rcpp.h>
+using namespace Rcpp;
+
+std::string string_reverse(std::string x){
+  std::reverse(x.begin(), x.end());
+  return x;
+}
+
+//[[Rcpp::export]]
+CharacterVector reverse_strings(CharacterVector strings){
+  
+  unsigned int input_size = strings.size();
+  CharacterVector output(input_size);
+  for(unsigned int i = 0; i < input_size; i++){
+    if(strings[i] == NA_STRING){
+      output[i] = NA_STRING;
+    } else {
+      output[i] = string_reverse(Rcpp::as<std::string>(strings[i]));
+    }
+  }
+  
+  return output;
+}
+
+//[[Rcpp::export]]
+DataFrame finalise_suffixes(CharacterVector full_domains, CharacterVector suffixes,
+                            LogicalVector wildcard, LogicalVector is_suffix){
+  
+  unsigned int input_size = full_domains.size();
+  CharacterVector subdomains(input_size);
+  CharacterVector domains(input_size);
+  std::string holding;
+  size_t domain_location;
+  for(unsigned int i = 0; i < input_size; i++){
+    if((i % 10000) == 0){
+      Rcpp::checkUserInterrupt();
+    }
+    if(is_suffix[i]){
+      subdomains[i] = NA_STRING;
+      domains[i] = NA_STRING;
+      suffixes[i] = full_domains[i];
+    } else {
+      if(suffixes[i] == NA_STRING || suffixes[i].size() == full_domains[i].size()){
+        subdomains[i] = NA_STRING;
+        domains[i] = NA_STRING;
+      } else if(wildcard[i]) {
+        holding = Rcpp::as<std::string>(full_domains[i]);
+        holding = holding.substr(0, ((full_domains[i].size() - suffixes[i].size()) - 1));
+        domain_location = holding.rfind(".");
+        if(domain_location == std::string::npos){
+          domains[i] = NA_STRING;
+          subdomains[i] = NA_STRING;
+          suffixes[i] = holding + "." + suffixes[i];
+        } else {
+          suffixes[i] = holding.substr(domain_location+1) + "." + suffixes[i];
+          holding = holding.substr(0, domain_location);
+          domain_location = holding.rfind(".");
+          if(domain_location == std::string::npos){
+            if(holding.size() == 0){
+              domains[i] = NA_STRING;
+            } else {
+              domains[i] = holding;
+            }
+            subdomains[i] = NA_STRING;
+          } else {
+            domains[i] = holding.substr(domain_location+1);
+            subdomains[i] = holding.substr(0, domain_location);
+          }
+        }
+      } else {
+        holding = Rcpp::as<std::string>(full_domains[i]);
+        holding = holding.substr(0, ((full_domains[i].size() - suffixes[i].size()) - 1));
+        domain_location = holding.rfind(".");
+        if(domain_location == std::string::npos){
+          subdomains[i] = NA_STRING;
+          if(holding.size() == 0){
+            domains[i] = NA_STRING;
+          } else {
+            domains[i] = holding;
+          }
+        } else {
+          subdomains[i] = holding.substr(0, domain_location);
+          domains[i] = holding.substr(domain_location+1);
+        }
+      }
+    }
+  }
+  return DataFrame::create(_["host"] = full_domains, _["subdomain"] = subdomains,
+                           _["domain"] = domains, _["suffix"] = suffixes,
+                           _["stringsAsFactors"] = false);
+}
+
+//[[Rcpp::export]]
+CharacterVector tld_extract_(CharacterVector domains){
+  
+  unsigned int input_size = domains.size();
+  CharacterVector output(input_size);
+  std::string holding;
+  size_t fragment_location;
+  
+  for(unsigned int i = 0; i < input_size; i++){
+    if((i % 10000) == 0){
+      Rcpp::checkUserInterrupt();
+    }
+    if(domains[i] == NA_STRING){
+      output[i] = NA_STRING;
+    } else {
+      holding = Rcpp::as<std::string>(domains[i]);
+      fragment_location = holding.rfind(".");
+      if(fragment_location == std::string::npos || fragment_location == (holding.size() - 1)){
+        output[i] = NA_STRING;
+      } else {
+        output[i] = holding.substr(fragment_location+1);
+      }
+    }
+  }
+  return output;
+}
+
+//[[Rcpp::export]]
+CharacterVector host_extract_(CharacterVector domains){
+  
+  unsigned int input_size = domains.size();
+  CharacterVector output(input_size);
+  std::string holding;
+  size_t fragment_location;
+  
+  for(unsigned int i = 0; i < input_size; i++){
+    if((i % 10000) == 0){
+      Rcpp::checkUserInterrupt();
+    }
+    if(domains[i] == NA_STRING){
+      output[i] = NA_STRING;
+    } else {
+      holding = Rcpp::as<std::string>(domains[i]);
+      fragment_location = holding.find(".");
+      if(fragment_location == std::string::npos){
+        output[i] = NA_STRING;
+      } else {
+        output[i] = holding.substr(0, fragment_location);
+      }
+    }
+  }
+  return output;
+}
diff --git a/src/urltools.cpp b/src/urltools.cpp
new file mode 100644
index 0000000..754fb7e
--- /dev/null
+++ b/src/urltools.cpp
@@ -0,0 +1,184 @@
+#include <Rcpp.h>
+#include "encoding.h"
+#include "compose.h"
+#include "parsing.h"
+
+using namespace Rcpp;
+
+//'@title Encode or decode a URI
+//'@description encodes or decodes a URI/URL
+//'
+//'@param urls a vector of URLs to decode or encode.
+//'
+//'@details
+//'URL encoding and decoding is an essential prerequisite to proper web interaction
+//'and data analysis around things like server-side logs. The
+//'\href{http://tools.ietf.org/html/rfc3986}{relevant IETF RfC} mandates the percentage-encoding
+//'of non-Latin characters, including things like slashes, unless those are reserved.
+//'
+//'Base R provides \code{\link{URLdecode}} and \code{\link{URLencode}}, which handle
+//'URL encoding - in theory. In practise, they have a set of substantial problems
+//'that the urltools implementation solves::
+//'
+//'\itemize{
+//' \item{No vectorisation: }{Both base R functions operate on single URLs, not vectors of URLs.
+//'       This means that, when confronted with a vector of URLs that need encoding or
+//'       decoding, your only option is to loop from within R. This can be incredibly
+//'       computationally costly with large datasets. url_encode and url_decode are
+//'       implemented in C++ and entirely vectorised, allowing for a substantial
+//'       performance improvement.}
+//' \item{No scheme recognition: }{encoding the slashes in, say, http://, is a good way
+//'       of making sure your URL no longer works. Because of this, the only thing
+//'       you can encode in URLencode (unless you refuse to encode reserved characters)
+//'       is a partial URL, lacking the initial scheme, which requires additional operations
+//'       to set up and increases the complexity of encoding or decoding. url_encode
+//'       detects the protocol and silently splits it off, leaving it unencoded to ensure
+//'       that the resulting URL is valid.}
+//' \item{ASCII NULs: }{Server side data can get very messy and sometimes include out-of-range
+//'       characters. Unfortunately, URLdecode's response to these characters is to convert
+//'       them to NULs, which R can't handle, at which point your URLdecode call breaks.
+//'       \code{url_decode} simply ignores them.}
+//'}
+//'
+//'@return a character vector containing the encoded (or decoded) versions of "urls".
+//'
+//'@seealso \code{\link{puny_decode}} and \code{\link{puny_encode}}, for punycode decoding
+//'and encoding.
+//'
+//'@examples
+//'
+//'url_decode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_%28logo%29.jpg")
+//'url_encode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_(logo).jpg")
+//'
+//'\dontrun{
+//'#A demonstrator of the contrasting behaviours around out-of-range characters
+//'URLdecode("%gIL")
+//'url_decode("%gIL")
+//'}
+//'@rdname encoder
+//'@export
+// [[Rcpp::export]]
+CharacterVector url_decode(CharacterVector urls){
+  
+  //Measure size, create output object
+  int input_size = urls.size();
+  CharacterVector output(input_size);
+  encoding enc_inst;
+  //Decode each string in turn.
+  for (int i = 0; i < input_size; ++i){
+    if((i % 10000) == 0){
+      Rcpp::checkUserInterrupt();
+    }
+    if(urls[i] == NA_STRING){
+      output[i] = NA_STRING;
+    } else {
+      output[i] = enc_inst.internal_url_decode(Rcpp::as<std::string>(urls[i]));
+    }
+  }
+  
+  //Return
+  return output;
+}
+
+//'@rdname encoder
+//'@export
+// [[Rcpp::export]]
+CharacterVector url_encode(CharacterVector urls){
+  
+  //Measure size, create output object and holding objects
+  int input_size = urls.size();
+  CharacterVector output(input_size);
+  std::string holding;
+  size_t scheme_start;
+  size_t first_slash;
+  encoding enc_inst;
+
+  //For each string..
+  for (int i = 0; i < input_size; ++i){
+    
+    //Check for user interrupts.
+    if((i % 10000) == 0){
+      Rcpp::checkUserInterrupt();
+    }
+    
+    if(urls[i] == NA_STRING){
+      output[i] = NA_STRING;
+    } else {
+      holding = Rcpp::as<std::string>(urls[i]);
+
+      //Extract the protocol. If you can't find it, just encode the entire thing.
+      scheme_start = holding.find("://");
+      if(scheme_start == std::string::npos){
+        output[i] = enc_inst.internal_url_encode(holding);
+      } else {
+        //Otherwise, split out the protocol and encode !protocol.
+        first_slash = holding.find("/", scheme_start+3);
+        if(first_slash == std::string::npos){
+          output[i] = holding.substr(0,scheme_start+3) + enc_inst.internal_url_encode(holding.substr(scheme_start+3));
+        } else {
+          output[i] = holding.substr(0,first_slash+1) + enc_inst.internal_url_encode(holding.substr(first_slash+1));
+        }
+      }
+    }
+  }
+  
+  //Return
+  return output;
+}
+
+//'@title split URLs into their component parts
+//'@description \code{url_parse} takes a vector of URLs and splits each one into its component
+//'parts, as recognised by RfC 3986.
+//'
+//'@param urls a vector of URLs
+//'
+//'@details It's useful to be able to take a URL and split it out into its component parts - 
+//'for the purpose of hostname extraction, for example, or analysing API calls. This functionality
+//'is not provided in base R, although it is provided in \code{\link[httr]{parse_url}}; that
+//'implementation is entirely in R, uses regular expressions, and is not vectorised. It's
+//'perfectly suitable for the intended purpose (decomposition in the context of automated
+//'HTTP requests from R), but not for large-scale analysis.
+//'
+//'@return a data.frame consisting of the columns scheme, domain, port, path, query
+//'and fragment. See the '\href{http://tools.ietf.org/html/rfc3986}{relevant IETF RfC} for
+//'definitions. If an element cannot be identified, it is represented by an empty string.
+//'
+//'@examples
+//'url_parse("https://en.wikipedia.org/wiki/Article")
+//'
+//'@seealso \code{\link{url_parameters}} for extracting values associated with particular keys in a URL's
+//'query string, and \code{\link{url_compose}}, which is \code{url_parse} in reverse.
+//'
+//'@export
+//[[Rcpp::export]]
+DataFrame url_parse(CharacterVector urls){
+  CharacterVector& urls_ptr = urls;
+  parsing p_inst;
+  return p_inst.parse_to_df(urls_ptr);
+}
+
+//'@title Recompose Parsed URLs
+//'
+//'@description Sometimes you want to take a vector of URLs, parse them, perform
+//'some operations and then rebuild them. \code{url_compose} takes a data.frame produced
+//'by \code{\link{url_parse}} and rebuilds it into a vector of full URLs (or: URLs as full
+//'as the vector initially thrown into url_parse).
+//'
+//'This is currently a `beta` feature; please do report bugs if you find them.
+//'
+//'@param parsed_urls a data.frame sourced from \code{\link{url_parse}}
+//'
+//'@seealso \code{\link{scheme}} and other accessors, which you may want to
+//'run URLs through before composing them to modify individual values.
+//'
+//'@examples
+//'#Parse a URL and compose it
+//'url <- "http://en.wikipedia.org"
+//'url_compose(url_parse(url))
+//'
+//'@export
+//[[Rcpp::export]]
+CharacterVector url_compose(DataFrame parsed_urls){
+  compose c_inst;
+  return c_inst.compose_multiple(parsed_urls);
+}
diff --git a/src/utf8.c b/src/utf8.c
new file mode 100644
index 0000000..c6e27a6
--- /dev/null
+++ b/src/utf8.c
@@ -0,0 +1,172 @@
+/*
+  Basic UTF-8 manipulation routines
+  by Jeff Bezanson
+  placed in the public domain Fall 2005
+
+  This code is designed to provide the utilities you need to manipulate
+  UTF-8 as an internal string encoding. These functions do not perform the
+  error checking normally needed when handling UTF-8 data, so if you happen
+  to be from the Unicode Consortium you will want to flay me alive.
+  I do this because error checking can be performed at the boundaries (I/O),
+  with these routines reserved for higher performance on data known to be
+  valid.
+  A UTF-8 validation routine is included.
+*/
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <wchar.h>
+#include <wctype.h>
+
+#ifdef WIN32
+#include <malloc.h>
+#define snprintf _snprintf
+#else
+#ifndef __FreeBSD__
+#include <alloca.h>
+#endif /* __FreeBSD__ */
+#endif
+#include <assert.h>
+
+#include "utf8.h"
+
+static const uint32_t offsetsFromUTF8[6] = {
+    0x00000000UL, 0x00003080UL, 0x000E2080UL,
+    0x03C82080UL, 0xFA082080UL, 0x82082080UL
+};
+
+static const char trailingBytesForUTF8[256] = {
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
+};
+
+/* returns length of next utf-8 sequence */
+size_t u8_seqlen(const char *s)
+{
+    return trailingBytesForUTF8[(unsigned int)(unsigned char)s[0]] + 1;
+}
+
+/* returns the # of bytes needed to encode a certain character
+   0 means the character cannot (or should not) be encoded. */
+size_t u8_charlen(uint32_t ch)
+{
+    if (ch < 0x80)
+        return 1;
+    else if (ch < 0x800)
+        return 2;
+    else if (ch < 0x10000)
+        return 3;
+    else if (ch < 0x110000)
+        return 4;
+    return 0;
+}
+
+size_t u8_codingsize(uint32_t *wcstr, size_t n)
+{
+    size_t i, c=0;
+
+    for(i=0; i < n; i++)
+        c += u8_charlen(wcstr[i]);
+    return c;
+}
+
+/* conversions without error checking
+   only works for valid UTF-8, i.e. no 5- or 6-byte sequences
+   srcsz = source size in bytes
+   sz = dest size in # of wide characters
+
+   returns # characters converted
+   if sz == srcsz+1 (i.e. 4*srcsz+4 bytes), there will always be enough space.
+*/
+size_t u8_toucs(uint32_t *dest, size_t sz, const char *src, size_t srcsz)
+{
+    uint32_t ch;
+    const char *src_end = src + srcsz;
+    size_t nb;
+    size_t i=0;
+
+    if (sz == 0 || srcsz == 0)
+        return 0;
+
+    while (i < sz) {
+        if (!isutf(*src)) {     // invalid sequence
+            dest[i++] = 0xFFFD;
+            src++;
+            if (src >= src_end) break;
+            continue;
+        }
+        nb = trailingBytesForUTF8[(unsigned char)*src];
+        if (src + nb >= src_end)
+            break;
+        ch = 0;
+        switch (nb) {
+            /* these fall through deliberately */
+        case 5: ch += (unsigned char)*src++; ch <<= 6;
+        case 4: ch += (unsigned char)*src++; ch <<= 6;
+        case 3: ch += (unsigned char)*src++; ch <<= 6;
+        case 2: ch += (unsigned char)*src++; ch <<= 6;
+        case 1: ch += (unsigned char)*src++; ch <<= 6;
+        case 0: ch += (unsigned char)*src++;
+        }
+        ch -= offsetsFromUTF8[nb];
+        dest[i++] = ch;
+    }
+    return i;
+}
+
+
+
+/* srcsz = number of source characters
+   sz = size of dest buffer in bytes
+
+   returns # bytes stored in dest
+   the destination string will never be bigger than the source string.
+*/
+size_t u8_toutf8(char *dest, size_t sz, const uint32_t *src, size_t srcsz)
+{
+    uint32_t ch;
+    size_t i = 0;
+    char *dest0 = dest;
+    char *dest_end = dest + sz;
+
+    while (i < srcsz) {
+        ch = src[i];
+        if (ch < 0x80) {
+            if (dest >= dest_end)
+                break;
+            *dest++ = (char)ch;
+        }
+        else if (ch < 0x800) {
+            if (dest >= dest_end-1)
+                break;
+            *dest++ = (ch>>6) | 0xC0;
+            *dest++ = (ch & 0x3F) | 0x80;
+        }
+        else if (ch < 0x10000) {
+            if (dest >= dest_end-2)
+                break;
+            *dest++ = (ch>>12) | 0xE0;
+            *dest++ = ((ch>>6) & 0x3F) | 0x80;
+            *dest++ = (ch & 0x3F) | 0x80;
+        }
+        else if (ch < 0x110000) {
+            if (dest >= dest_end-3)
+                break;
+            *dest++ = (ch>>18) | 0xF0;
+            *dest++ = ((ch>>12) & 0x3F) | 0x80;
+            *dest++ = ((ch>>6) & 0x3F) | 0x80;
+            *dest++ = (ch & 0x3F) | 0x80;
+        }
+        i++;
+    }
+    return (dest-dest0);
+}
+
diff --git a/src/utf8.h b/src/utf8.h
new file mode 100644
index 0000000..a558efe
--- /dev/null
+++ b/src/utf8.h
@@ -0,0 +1,17 @@
+#ifndef UTF8_H
+#define UTF8_H
+
+extern int locale_is_utf8;
+
+/* is c the start of a utf8 sequence? */
+#define isutf(c) (((c)&0xC0)!=0x80)
+
+#define UEOF ((uint32_t)-1)
+
+/* convert UTF-8 data to wide character */
+size_t u8_toucs(uint32_t *dest, size_t sz, const char *src, size_t srcsz);
+
+/* the opposite conversion */
+size_t u8_toutf8(char *dest, size_t sz, const uint32_t *src, size_t srcsz);
+
+#endif
diff --git a/tests/testthat.R b/tests/testthat.R
new file mode 100644
index 0000000..9a8b7ee
--- /dev/null
+++ b/tests/testthat.R
@@ -0,0 +1,4 @@
+library(testthat)
+library(urltools)
+
+test_check("urltools")
diff --git a/tests/testthat/test_encoding.R b/tests/testthat/test_encoding.R
new file mode 100644
index 0000000..89450b3
--- /dev/null
+++ b/tests/testthat/test_encoding.R
@@ -0,0 +1,26 @@
+context("URL encoding tests")
+
+test_that("Check encoding doesn't encode the scheme", {
+  expect_that(url_encode("https://"), equals("https://"))
+})
+
+test_that("Check encoding does does not encode pre-path slashes", {
+  expect_that(url_encode("https://foo.org/bar/"), equals("https://foo.org/bar%2f"))
+})
+
+test_that("Check encoding can handle NAs", {
+  expect_that(url_encode(c("https://foo.org/bar/", NA)), equals(c("https://foo.org/bar%2f", NA)))
+})
+
+test_that("Check decoding can handle NAs", {
+  expect_that(url_decode(c("https://foo.org/bar%2f", NA)), equals(c("https://foo.org/bar/", NA)))
+})
+
+test_that("Check decoding and encoding are equivalent", {
+  
+  url <- "Hinrichtung_auf_dem_Altst%c3%a4dter_Ring.JPG%2f120px-Hinrichtung_auf_dem_Altst%c3%a4dter_Ring.JPG"
+  decoded_url <- "Hinrichtung_auf_dem_Altstädter_Ring.JPG/120px-Hinrichtung_auf_dem_Altstädter_Ring.JPG"
+  expect_that(url_decode(url), equals(decoded_url))
+  expect_that(url_encode(decoded_url), equals(url))
+  
+})
\ No newline at end of file
diff --git a/tests/testthat/test_get_set.R b/tests/testthat/test_get_set.R
new file mode 100644
index 0000000..0524dbe
--- /dev/null
+++ b/tests/testthat/test_get_set.R
@@ -0,0 +1,59 @@
+context("Component get/set tests")
+
+test_that("Check elements can be retrieved", {
+  url <- "https://www.google.com:80/foo.php?api_params=turnip#ending"
+  testthat::expect_equal(scheme(url), "https")
+  testthat::expect_equal(domain(url), "www.google.com")
+  testthat::expect_equal(port(url), "80")
+  testthat::expect_equal(path(url), "foo.php")
+  testthat::expect_equal(parameters(url), "api_params=turnip")
+  testthat::expect_equal(fragment(url), "ending")
+})
+
+test_that("Check elements can be retrieved with NAs", {
+  url <- as.character(NA)
+  testthat::expect_equal(is.na(scheme(url)), TRUE)
+  testthat::expect_equal(is.na(domain(url)), TRUE)
+  testthat::expect_equal(is.na(port(url)), TRUE)
+  testthat::expect_equal(is.na(path(url)), TRUE)
+  testthat::expect_equal(is.na(parameters(url)), TRUE)
+  testthat::expect_equal(is.na(fragment(url)), TRUE)
+})
+
+test_that("Check elements can be set", {
+  url <- "https://www.google.com:80/foo.php?api_params=turnip#ending"
+  scheme(url) <- "http"
+  testthat::expect_equal(scheme(url), "http")
+  domain(url) <- "www.wikipedia.org"
+  testthat::expect_equal(domain(url), "www.wikipedia.org")
+  port(url) <- "23"
+  testthat::expect_equal(port(url), "23")
+  path(url) <- "bar.php"
+  testthat::expect_equal(path(url), "bar.php")
+  parameters(url) <- "api_params=manic"
+  testthat::expect_equal(parameters(url), "api_params=manic")
+  fragment(url) <- "beginning"
+  testthat::expect_equal(fragment(url), "beginning")
+})
+
+test_that("Check elements can be set with NAs", {
+  url <- "https://www.google.com:80/"
+  scheme(url) <- "http"
+  testthat::expect_equal(scheme(url), "http")
+  domain(url) <- "www.wikipedia.org"
+  testthat::expect_equal(domain(url), "www.wikipedia.org")
+  port(url) <- "23"
+  testthat::expect_equal(port(url), "23")
+  path(url) <- "bar.php"
+  testthat::expect_equal(path(url), "bar.php")
+  parameters(url) <- "api_params=manic"
+  testthat::expect_equal(parameters(url), "api_params=manic")
+  fragment(url) <- "beginning"
+  testthat::expect_equal(fragment(url), "beginning")
+})
+
+test_that("Assigning NA with get will NA a URL", {
+  url <- "https://www.google.com:80/"
+  port(url) <- NA_character_
+  testthat::expect_true(is.na(url))
+})
diff --git a/tests/testthat/test_memory.R b/tests/testthat/test_memory.R
new file mode 100644
index 0000000..0cc6c69
--- /dev/null
+++ b/tests/testthat/test_memory.R
@@ -0,0 +1,30 @@
+context("Avoid regressions around proxy objects")
+
+test_that("Values are correctly disposed from memory",{
+  memfn <- function(d = NULL){
+    test_url <- "https://test.com"
+    if(!is.null(d)){
+      test_url <- urltools::param_set(test_url, "q" , urltools::url_encode(d))
+     }
+    return(test_url)
+  }
+  
+  baseurl <- "https://test.com"
+  expect_equal(memfn(), baseurl)
+  expect_equal(memfn("blah"), paste0(baseurl, "?q=blah"))
+  expect_equal(memfn(), baseurl)
+})
+
+test_that("Parameters correctly add to output",{
+  outfn <- function(d = FALSE){
+    test_url <- "https://test.com"
+    if(d){
+      test_url <- urltools::param_set(test_url, "q", urltools::url_encode(d))
+    }
+    return(test_url)
+  }
+  
+  baseurl <- "https://test.com"
+  expect_equal(outfn(), baseurl)
+  expect_equal(outfn(TRUE), paste0(baseurl, "?q=TRUE"))
+})
diff --git a/tests/testthat/test_parameters.R b/tests/testthat/test_parameters.R
new file mode 100644
index 0000000..819f554
--- /dev/null
+++ b/tests/testthat/test_parameters.R
@@ -0,0 +1,61 @@
+context("Test parameter manipulation")
+
+test_that("Parameter parsing can handle multiple, non-existent and pre-trailing parameters",{
+  urls <- c("https://www.google.com:80/foo.php?api_params=parsable&this_parameter=selfreferencing&hiphop=awesome",
+            "https://www.google.com:80/foo.php?api_params=parsable&this_parameter=selfreferencing&hiphop=awesome#foob",
+            "https://www.google.com:80/foo.php?this_parameter=selfreferencing&hiphop=awesome")
+  results <- param_get(urls, c("api_params","hiphop"))
+  expect_that(results[1:2,1], equals(c("parsable","parsable")))
+  expect_true(is.na(results[3,1]))
+  
+})
+
+test_that("Parameter parsing works where the parameter appears earlier in the URL", {
+  url <- param_get("www.housetrip.es/tos-de-vacaciones/geo?from=01/04/2015&guests=4&to=05/04/2015","to")
+  expect_that(ncol(url), equals(1))
+  expect_that(url$to[1], equals("05/04/2015"))
+})
+
+test_that("Setting parameter values works", {
+  expect_true(param_set("https://en.wikipedia.org/wiki/api.php", "baz", "quorn") ==
+              "https://en.wikipedia.org/wiki/api.php?baz=quorn")
+  expect_true(param_set("https://en.wikipedia.org/wiki/api.php?foo=bar&baz=qux", "baz", "quorn") ==
+                "https://en.wikipedia.org/wiki/api.php?foo=bar&baz=quorn")
+  expect_true(param_set("https://en.wikipedia.org/wiki/api.php?foo=bar", "baz", "quorn") ==
+                "https://en.wikipedia.org/wiki/api.php?foo=bar&baz=quorn")
+})
+
+test_that("Setting parameter values quietly fails with NA components", {
+  url <- "https://en.wikipedia.org/api.php?action=query"
+  expect_identical(url, param_set(url, "action", NA_character_))
+  expect_true(is.na(param_set(NA_character_, "action", "foo")))
+  expect_identical(url, param_set(url, NA_character_, "pageinfo"))
+})
+
+
+test_that("Removing parameter entries quietly fails with NA components", {
+  url <- "https://en.wikipedia.org/api.php?action=query"
+  expect_identical(url, param_remove(url, "foo"))
+  expect_true(is.na(param_remove(NA_character_, "action")))
+})
+
+test_that("Removing parameter keys works", {
+  expect_true(param_remove("https://en.wikipedia.org/api.php?baz=qux", "baz") ==
+                "https://en.wikipedia.org/api.php")
+})
+
+test_that("Removing parameter keys works when there are multiple parameters in the URL", {
+  expect_true(param_remove("https://en.wikipedia.org/api.php?baz=qux&foo=bar", "baz") ==
+                "https://en.wikipedia.org/api.php?foo=bar")
+})
+
+test_that("Removing parameter keys works when there are multiple parameters to remove", {
+  expect_true(param_remove("https://en.wikipedia.org/api.php?baz=qux&foo=bar", c("baz","foo")) ==
+                "https://en.wikipedia.org/api.php")
+})
+
+test_that("Removing parameter keys works when there is no query", {
+  expect_true(param_remove("https://en.wikipedia.org/api.php", "baz") ==
+              "https://en.wikipedia.org/api.php")
+})
+
diff --git a/tests/testthat/test_parsing.R b/tests/testthat/test_parsing.R
new file mode 100644
index 0000000..385a64e
--- /dev/null
+++ b/tests/testthat/test_parsing.R
@@ -0,0 +1,68 @@
+context("URL parsing tests")
+
+test_that("Check parsing identifies each RfC element", {
+  
+  data <- url_parse("https://www.google.com:80/foo.php?api_params=turnip#ending")
+  expect_that(ncol(data), equals(6))
+  expect_that(names(data), equals(c("scheme","domain","port","path","parameter","fragment")))
+  expect_that(data$scheme[1], equals("https"))
+  expect_that(data$domain[1], equals("www.google.com"))
+  expect_that(data$port[1], equals("80"))
+  expect_that(data$path[1], equals("foo.php"))
+  expect_that(data$parameter[1], equals("api_params=turnip"))
+  expect_that(data$fragment[1], equals("ending"))
+})
+
+test_that("Check parsing can handle missing elements", {
+  
+  data <- url_parse("https://www.google.com/foo.php?api_params=turnip#ending")
+  expect_that(ncol(data), equals(6))
+  expect_that(names(data), equals(c("scheme","domain","port","path","parameter","fragment")))
+  expect_that(data$scheme[1], equals("https"))
+  expect_that(data$domain[1], equals("www.google.com"))
+  expect_true(is.na(data$port[1]))
+  expect_that(data$path[1], equals("foo.php"))
+  expect_that(data$parameter[1], equals("api_params=turnip"))
+  expect_that(data$fragment[1], equals("ending"))
+})
+
+test_that("Parsing does not up and die and misplace the fragment",{
+  data <- url_parse("http://www.yeastgenome.org/locus/S000005366/overview#protein")
+  expect_that(data$fragment[1], equals("protein"))
+})
+
+test_that("Composing works",{
+  url <- c("http://foo.bar.baz/qux/", "https://en.wikipedia.org:4000/wiki/api.php")
+  amended_url <- url_compose(url_parse(url))
+  expect_that(url, equals(amended_url))
+})
+
+test_that("Port handling works", {
+  url <- "https://en.wikipedia.org:4000/wiki/api.php"
+  expect_that(port(url), equals("4000"))
+  expect_that(path(url), equals("wiki/api.php"))
+  url <- "https://en.wikipedia.org:4000"
+  expect_that(port(url), equals("4000"))
+  expect_true(is.na(path(url)))
+  url <- "https://en.wikipedia.org:4000/"
+  expect_that(port(url), equals("4000"))
+  expect_true(is.na(path(url)))
+  url <- "https://en.wikipedia.org:4000?foo=bar"
+  expect_that(port(url), equals("4000"))
+  expect_true(is.na(path(url)))
+  expect_that(parameters(url), equals("foo=bar"))
+})
+
+test_that("Port handling does not break path handling", {
+  url <- "https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_(logo).jpg"
+  expect_true(is.na(port(url)))
+  expect_that(path(url), equals("wiki/File:Vice_City_Public_Radio_(logo).jpg"))
+})
+
+test_that("URLs with parameters but no paths work", {
+  url <- url_parse("http://www.nextpedition.com?inav=menu_travel_nextpedition")
+  expect_true(url$domain[1] == "www.nextpedition.com")
+  expect_true(is.na(url$port[1]))
+  expect_true(is.na(url$path[1]))
+  expect_true(url$parameter[1] == "inav=menu_travel_nextpedition")
+})
\ No newline at end of file
diff --git a/tests/testthat/test_puny.R b/tests/testthat/test_puny.R
new file mode 100644
index 0000000..c5a1331
--- /dev/null
+++ b/tests/testthat/test_puny.R
@@ -0,0 +1,47 @@
+context("Check punycode handling")
+
+testthat::test_that("Simple punycode domain encoding works", {
+  testthat::expect_identical(puny_encode(enc2utf8("https://www.b\u00FCcher.com/foo")),
+                             "https://www.xn--bcher-kva.com/foo")
+})
+
+testthat::test_that("Punycode domain encoding works with fragmentary paths", {
+  testthat::expect_identical(puny_encode(enc2utf8("https://www.b\u00FCcher.com/")),
+                             "https://www.xn--bcher-kva.com/")
+})
+
+testthat::test_that("Punycode domain encoding works with ports", {
+  testthat::expect_identical(puny_encode(enc2utf8("https://www.b\u00FCcher.com:80")),
+                             "https://www.xn--bcher-kva.com:80")
+})
+
+testthat::test_that("Punycode domain encoding returns an NA on NAs", {
+  testthat::expect_true(is.na(puny_encode(NA_character_)))
+})
+
+testthat::test_that("Simple punycode domain decoding works", {
+  testthat::expect_identical(puny_decode("https://www.xn--bcher-kva.com/foo"),
+                             enc2utf8("https://www.b\u00FCcher.com/foo"))
+})
+
+testthat::test_that("Punycode domain decoding works with fragmentary paths", {
+  testthat::expect_identical(puny_decode("https://www.xn--bcher-kva.com/"),
+                             enc2utf8("https://www.b\u00FCcher.com/"))
+})
+
+testthat::test_that("Punycode domain decoding works with ports", {
+  testthat::expect_identical(puny_decode("https://www.xn--bcher-kva.com:80"),
+                             enc2utf8("https://www.b\u00FCcher.com:80"))
+})
+
+testthat::test_that("Punycode domain decoding returns an NA on NAs", {
+  testthat::expect_true(is.na(puny_decode(NA_character_)))
+})
+
+testthat::test_that("Punycode domain decoding returns an NA on invalid entries", {
+  testthat::expect_true(is.na(suppressWarnings(puny_decode("xn--9"))))
+})
+
+testthat::test_that("Punycode domain decoding warns on invalid entries", {
+  testthat::expect_warning(puny_decode("xn--9"))
+})
\ No newline at end of file
diff --git a/tests/testthat/test_suffixes.R b/tests/testthat/test_suffixes.R
new file mode 100644
index 0000000..b47afc4
--- /dev/null
+++ b/tests/testthat/test_suffixes.R
@@ -0,0 +1,108 @@
+context("Test suffix extraction")
+
+test_that("Suffix extraction works with simple domains",{
+  result <- suffix_extract("en.wikipedia.org")
+  expect_that(ncol(result), equals(4))
+  expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+  expect_that(nrow(result), equals(1))
+  
+  expect_that(result$subdomain[1], equals("en"))
+  expect_that(result$domain[1], equals("wikipedia"))
+  expect_that(result$suffix[1], equals("org"))
+})
+
+test_that("Suffix extraction works with multiple domains",{
+  result <- suffix_extract(c("en.wikipedia.org","en.wikipedia.org"))
+  expect_that(ncol(result), equals(4))
+  expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+  expect_that(nrow(result), equals(2))
+  
+  expect_that(result$subdomain[1], equals("en"))
+  expect_that(result$domain[1], equals("wikipedia"))
+  expect_that(result$suffix[1], equals("org"))
+  expect_that(result$subdomain[2], equals("en"))
+  expect_that(result$domain[2], equals("wikipedia"))
+  expect_that(result$suffix[2], equals("org"))
+})
+
+test_that("Suffix extraction works when the domain is the same as the suffix",{
+  result <- suffix_extract(c("googleapis.com", "myapi.googleapis.com"))
+  expect_that(ncol(result), equals(4))
+  expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+  expect_that(nrow(result), equals(2))
+  
+  expect_equal(result$subdomain[1], NA_character_)
+  expect_equal(result$domain[1], NA_character_)
+  expect_equal(result$suffix[1], "googleapis.com")
+  expect_equal(result$subdomain[2], NA_character_)
+  expect_equal(result$domain[2], "myapi")
+  expect_equal(result$suffix[2], "googleapis.com")
+})
+
+test_that("Suffix extraction works where domains/suffixes overlap", {
+  result <- suffix_extract(domain("http://www.converse.com")) # could be se.com or .com
+  expect_equal(result$subdomain[1], "www")
+  expect_equal(result$domain[1], "converse")
+  expect_equal(result$suffix[1], "com")
+})
+
+test_that("Suffix extraction works when the domain matches a wildcard suffix",{
+  result <- suffix_extract(c("banana.bd", "banana.boat.bd"))
+  expect_that(ncol(result), equals(4))
+  expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+  expect_that(nrow(result), equals(2))
+  
+  expect_equal(result$subdomain[1], NA_character_)
+  expect_equal(result$domain[1], NA_character_)
+  expect_equal(result$suffix[1], "banana.bd")
+  expect_equal(result$subdomain[2], NA_character_)
+  expect_equal(result$domain[2], "banana")
+  expect_equal(result$suffix[2], "boat.bd")
+})
+
+test_that("Suffix extraction works when the domain matches a wildcard suffix and has subdomains",{
+  result <- suffix_extract(c("foo.bar.banana.bd"))
+  expect_that(ncol(result), equals(4))
+  expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+  expect_that(nrow(result), equals(1))
+  expect_equal(result$subdomain[1], "foo")
+  expect_equal(result$domain[1], "bar")
+  expect_equal(result$suffix[1], "banana.bd")
+})
+
+
+test_that("Suffix extraction works with new suffixes",{
+  result <- suffix_extract("en.wikipedia.org", suffix_refresh())
+  expect_that(ncol(result), equals(4))
+  expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+  expect_that(nrow(result), equals(1))
+  
+  expect_that(result$subdomain[1], equals("en"))
+  expect_that(result$domain[1], equals("wikipedia"))
+  expect_that(result$suffix[1], equals("org"))
+})
+
+test_that("Suffix extraction works with an arbitrary suffixes database (to ensure it is loading it)",{
+  result <- suffix_extract(c("is-this-a.bananaboat", "en.wikipedia.org"), data.frame(suffixes = "bananaboat"))
+  expect_that(ncol(result), equals(4))
+  expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+  expect_that(nrow(result), equals(2))
+  
+  expect_equal(result$subdomain[1], NA_character_)
+  expect_equal(result$domain[1], "is-this-a")
+  expect_equal(result$suffix[1], "bananaboat")
+  expect_equal(result$subdomain[2], NA_character_)
+  expect_equal(result$domain[2], NA_character_)
+  expect_equal(result$suffix[2], NA_character_)
+})
+
+test_that("Suffix extraction is back to normal using the internal database when it receives suffixes=NULL",{
+  result <- suffix_extract("en.wikipedia.org")
+  expect_that(ncol(result), equals(4))
+  expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
+  expect_that(nrow(result), equals(1))
+  
+  expect_that(result$subdomain[1], equals("en"))
+  expect_that(result$domain[1], equals("wikipedia"))
+  expect_that(result$suffix[1], equals("org"))
+})
\ No newline at end of file
diff --git a/vignettes/urltools.Rmd b/vignettes/urltools.Rmd
new file mode 100644
index 0000000..beb3db3
--- /dev/null
+++ b/vignettes/urltools.Rmd
@@ -0,0 +1,182 @@
+<!--
+%\VignetteEngine{knitr::knitr}
+%\VignetteIndexEntry{urltools}
+-->
+
+##Elegant URL handling with urltools
+
+URLs are treated, by base R, as nothing more than components of a data retrieval process: they exist
+to create connections to retrieve datasets. This is an essential feature for the language to have,
+but it also means that URL handlers are designed for situations where URLs *get* you to the data - 
+not situations where URLs *are* the data.
+
+There is no support for encoding or decoding URLs en-masse, and no support for parsing and
+interpreting them. `urltools` provides this support!
+
+### URL encoding and decoding
+
+Base R provides two functions - <code>URLdecode</code> and <code>URLencode</code> - for taking percentage-encoded
+URLs and turning them into regular strings, or vice versa. As discussed, these are primarily designed to
+enable connections, and so they have several inherent limitations, including a lack of vectorisation, that
+make them unsuitable for large datasets.
+
+Not only are they not vectorised, they also have several particularly idiosyncratic bugs and limitations:
+<code>URLdecode</code>, for example, breaks if the decoded value is out of range:
+
+```{r, eval=FALSE}
+URLdecode("test%gIL")
+Error in rawToChar(out) : embedded nul in string: '\0L'
+In addition: Warning message:
+In URLdecode("%gIL") : out-of-range values treated as 0 in coercion to raw
+```
+
+URLencode, on the other hand, encodes slashes on its most strict setting - without
+paying attention to where those slashes *are*: if we attempt to URLencode an entire URL, we get:
+
+```{r, eval=FALSE}
+URLencode("https://en.wikipedia.org/wiki/Article", reserved = TRUE)
+[1] "https%3a%2f%2fen.wikipedia.org%2fwiki%2fArticle"
+```
+That's a completely unusable URL (or ewRL, if you will).
+
+urltools replaces both functions with <code>url\_decode</code> and <code>url\_encode</code> respectively:
+```{r, eval=FALSE}
+library(urltools)
+url_decode("test%gIL")
+[1] "test"
+url_encode("https://en.wikipedia.org/wiki/Article")
+[1] "https://en.wikipedia.org%2fwiki%2fArticle"
+```
+
+As you can see, <code>url\_decode</code> simply excludes out-of-range characters from consideration, while <code>url\_encode</code> detects characters that make up part of the URLs scheme, and leaves them unencoded. Both are extremely fast; with `urltools`, you can
+decode a vector of 1,000,000 URLs in 0.9 seconds.
+
+Alongside these, we have functions for encoding and decoding the 'punycode' format of URLs - ones that are designed to be internationalised and have unicode characters in them. These also take one argument, a vector of URLs, and can be found at `puny_encode` and `puny_decode` respectively.
+
+### URL parsing
+
+Once you've got your nicely decoded (or encoded) URLs, it's time to do something with them - and, most of the time,
+you won't actually care about most of the URL. You'll want to look at the scheme, or the domain, or the path,
+but not the entire thing as one string.
+
+The solution is <code>url_parse</code>, which takes a URL and breaks it out into its [RfC 3986](http://www.ietf.org/rfc/rfc3986.txt) components: scheme, domain, port, path, query string and fragment identifier. This is,
+again, fully vectorised, and can happily be run over hundreds of thousands of URLs, rapidly processing them. The
+results are provided as a data.frame, since most people use data.frames to store data.
+
+```{r, eval=FALSE}
+> parsed_address <- url_parse("https://en.wikipedia.org/wiki/Article")
+> str(parsed_address)
+'data.frame':	1 obs. of  6 variables:
+ $ scheme   : chr "https"
+ $ domain   : chr "en.wikipedia.org"
+ $ port     : chr NA
+ $ path     : chr "wiki/Article"
+ $ parameter: chr NA
+ $ fragment : chr NA                         
+```
+
+We can also perform the opposite of this operation with `url_compose`:
+```{r, eval=FALSE}
+> url_compose(parsed_address)
+[1] "https://en.wikipedia.org/wiki/article"
+```
+
+### Getting/setting URL components
+With the inclusion of a URL parser, we suddenly have the opportunity for lubridate-style component getting
+and setting. Syntax is identical to that of `lubridate`, but uses URL components as function names.
+
+```{r, eval=FALSE}
+url <- "https://en.wikipedia.org/wiki/Article"
+scheme(url)
+"https"
+scheme(url) <- "ftp"
+url
+"ftp://en.wikipedia.org/wiki/Article"
+```
+Fields that can be extracted or set are <code>scheme</code>, <code>domain</code>, <code>port</code>, <code>path</code>,
+<code>parameters</code> and <code>fragment</code>.
+
+### Suffix and TLD extraction
+
+Once we've extracted a domain from a URL with `domain` or `url_parse`, we can identify which bit is the domain name, and which
+bit is the suffix:
+
+```{r, eval=FALSE}
+> url <- "https://en.wikipedia.org/wiki/Article"
+> domain_name <- domain(url)
+> domain_name
+[1] "en.wikipedia.org"
+> str(suffix_extract(domain_name))
+'data.frame':	1 obs. of  4 variables:
+ $ host     : chr "en.wikipedia.org"
+ $ subdomain: chr "en"
+ $ domain   : chr "wikipedia"
+ $ suffix      : chr "org"
+```
+
+This relies on an internal database of public suffixes, accessible at `suffix_dataset` - we recognise, though,
+that this dataset may get a bit out of date, so you can also pass the results of the `suffix_refresh` function,
+which retrieves an updated dataset, to `suffix_extract`:
+
+```{r, eval=FALSE}
+domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+updated_suffixes <- suffix_refresh()
+suffix_extract(domain_name, updated_suffixes)
+```
+
+We can do the same thing with top-level domains, with precisely the same setup, except the functions and datasets are `tld_refresh`, `tld_extract` and `tld_dataset`.
+
+In the other direction we have `host_extract`, which retrieves, well, the host! If the URL has subdomains, it'll be the
+lowest-level subdomain. If it doesn't, it'll be the actual domain name, without the suffixes:
+
+```{r, eval=FALSE}
+domain_name <- domain("https://en.wikipedia.org/wiki/Article")
+host_extract(domain_name)
+```
+### Query manipulation
+Once a URL is parsed, it's sometimes useful to get the value associated with a particular query parameter. As
+an example, take the URL `http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json`. What
+pageID is being used? What is the export format? We can find out with `param_get`.
+
+```{r, eval=FALSE}
+> str(param_get(urls = "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json",
+                     parameter_names = c("pageid","export")))
+'data.frame':	1 obs. of  2 variables:
+ $ pageid: chr "1023"
+ $ export: chr "json"
+```
+
+This isn't the only function for query manipulation; we can also dynamically modify the values a particular parameter
+might have, or strip them out entirely.
+
+To modify the values, we use `param_set`:
+
+```{r, eval=FALSE}
+url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+url <- param_set(url, key = "pageid", value = "12")
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=12&export=json"
+```
+
+As you can see this works pretty well; it even works in situations where the URL doesn't *have* a query yet:
+
+```{r, eval=FALSE}
+url <- "http://en.wikipedia.org/wiki/api.php"
+url <- param_set(url, key = "pageid", value = "12")
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?pageid=12"
+```
+
+On the other hand we might have a parameter we just don't want any more - that can be handled with `param_remove`, which can
+take multiple parameters as well as multiple URLs:
+
+```{r, eval=FALSE}
+url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
+url <- param_remove(url, keys = c("action","export"))
+url
+# [1] "http://en.wikipedia.org/wiki/api.php?pageid=1023"
+```
+
+### Other URL handlers
+If you have ideas for other URL handlers that would make your data processing easier, the best approach
+is to either [request it](https://github.com/Ironholds/urltools/issues) or [add it](https://github.com/Ironholds/urltools/pulls)!

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/r-cran-urltools.git



More information about the debian-med-commit mailing list