[med-svn] [r-cran-filehash] 03/05: New upstream version 2.3
Andreas Tille
tille at debian.org
Tue Oct 10 09:47:22 UTC 2017
This is an automated email from the git hooks/post-receive script.
tille pushed a commit to branch master
in repository r-cran-filehash.
commit 22864ed6bafeb303dff664b845ac66c61e3867ea
Author: Andreas Tille <tille at debian.org>
Date: Tue Oct 10 11:45:42 2017 +0200
New upstream version 2.3
---
DESCRIPTION | 22 +++
MD5 | 44 +++++
NAMESPACE | 62 +++++++
R/coerce.R | 43 +++++
R/dump.R | 67 +++++++
R/filehash-DB1.R | 442 ++++++++++++++++++++++++++++++++++++++++++++
R/filehash-RDS.R | 183 +++++++++++++++++++
R/filehash.R | 306 +++++++++++++++++++++++++++++++
R/hash.R | 10 +
R/queue.R | 91 ++++++++++
R/stack.R | 91 ++++++++++
R/zzz.R | 22 +++
build/vignette.rds | Bin 0 -> 209 bytes
debian/README.test | 10 -
debian/changelog | 5 -
debian/compat | 1 -
debian/control | 29 ---
debian/copyright | 29 ---
debian/docs | 3 -
debian/rules | 15 --
debian/source/format | 1 -
debian/tests/control | 3 -
debian/tests/run-unit-test | 37 ----
debian/watch | 2 -
inst/CITATION | 14 ++
inst/COPYING | 19 ++
inst/NEWS | 90 +++++++++
inst/doc/filehash.R | 145 +++++++++++++++
inst/doc/filehash.Rnw | 443 +++++++++++++++++++++++++++++++++++++++++++++
inst/doc/filehash.pdf | Bin 0 -> 100431 bytes
man/createQ.Rd | 31 ++++
man/createS.Rd | 31 ++++
man/db2env.Rd | 96 ++++++++++
man/dbInit.Rd | 64 +++++++
man/dump.Rd | 65 +++++++
man/filehash-class.Rd | 150 +++++++++++++++
man/filehashFormats.Rd | 30 +++
man/filehashOption.Rd | 27 +++
man/push.Rd | 43 +++++
man/queue-class.Rd | 47 +++++
man/stack-class.Rd | 50 +++++
src/hash.c | 84 +++++++++
src/lockfile.c | 21 +++
src/readKeyMap.c | 65 +++++++
src/sha1.c | 371 +++++++++++++++++++++++++++++++++++++
src/sha1.h | 24 +++
tests/SHA1SUM | 2 +
tests/misc/create-testdb.R | 14 ++
tests/reg-tests.R | 183 +++++++++++++++++++
tests/reg-tests.Rout.save | 304 +++++++++++++++++++++++++++++++
tests/testdb-v1.1 | Bin 0 -> 726 bytes
tests/testdb-v2.0 | Bin 0 -> 726 bytes
tests/versions.R | 22 +++
tests/versions.Rout.save | 114 ++++++++++++
vignettes/combined.bib | 50 +++++
vignettes/filehash.Rnw | 443 +++++++++++++++++++++++++++++++++++++++++++++
56 files changed, 4425 insertions(+), 135 deletions(-)
diff --git a/DESCRIPTION b/DESCRIPTION
new file mode 100644
index 0000000..264febc
--- /dev/null
+++ b/DESCRIPTION
@@ -0,0 +1,22 @@
+Package: filehash
+Date: 2015-08-12
+Version: 2.3
+Depends: R (>= 3.0.0), methods
+Collate: filehash.R filehash-DB1.R filehash-RDS.R coerce.R dump.R
+ hash.R queue.R stack.R zzz.R
+Title: Simple Key-Value Database
+Author: Roger D. Peng <rdpeng at jhu.edu>
+Maintainer: Roger D. Peng <rdpeng at jhu.edu>
+Description: Implements a simple key-value style database where character string keys
+ are associated with data values that are stored on the disk. A simple interface is provided for inserting,
+ retrieving, and deleting data from the database. Utilities are provided that allow 'filehash' databases to be
+ treated much like environments and lists are already used in R. These utilities are provided to encourage
+ interactive and exploratory analysis on large datasets. Three different file formats for representing the
+ database are currently available and new formats can easily be incorporated by third parties for use in the
+ 'filehash' framework.
+License: GPL (>= 2)
+URL: http://github.com/rdpeng/filehash
+Packaged: 2015-08-12 14:57:00 UTC; rdpeng
+NeedsCompilation: yes
+Repository: CRAN
+Date/Publication: 2015-08-16 07:30:57
diff --git a/MD5 b/MD5
new file mode 100644
index 0000000..1fd5c05
--- /dev/null
+++ b/MD5
@@ -0,0 +1,44 @@
+a50b1c1bdc0c3c65e36b52d5b85b364a *DESCRIPTION
+369879e4ab19e6934e6136e2f5a86f9f *NAMESPACE
+45232e1ac4dac258e45556bd5f43123c *R/coerce.R
+90f64221f6f44767deed77323c0d4db2 *R/dump.R
+d20b08ebd20413d3c82045b78e07d662 *R/filehash-DB1.R
+792728111ce93475ae36b03871c7fb51 *R/filehash-RDS.R
+b0769cf1a3599dced35d14ce286a651c *R/filehash.R
+23024206925dc990b4acd2f5f14d18b4 *R/hash.R
+b096fd971b2464562c885b8f6d3caeab *R/queue.R
+a75a45aef48fc227a52cc6144180eed0 *R/stack.R
+5d89ecc5246dec2fa4b5b6550d00c1bc *R/zzz.R
+aface47053e12f9628bb70a828a1ad75 *build/vignette.rds
+ed14a5c660ea0f0e723a406175038eb7 *inst/CITATION
+b128d2038f8d0c5c554b72de80298e4a *inst/COPYING
+44422cadef3c067ffd43b9f00a18aa9d *inst/NEWS
+6b5ee8f3a31a761dd38bd0238efbdfc1 *inst/doc/filehash.R
+7fb6c57e7c3a9b572359b21e60e07d3f *inst/doc/filehash.Rnw
+425c7ead37800f5dcfc3af75c2d7f43d *inst/doc/filehash.pdf
+196634a9bcac54d5a8c2f30c622caf8d *man/createQ.Rd
+76d0e8fb13ec5e82d2db2fe69c153399 *man/createS.Rd
+994907132b6735cd18cb2e58fbe38246 *man/db2env.Rd
+666fa5e79d7b3eef0600c59653613c1c *man/dbInit.Rd
+7540db5f93596a70ed9a552bda857059 *man/dump.Rd
+fd3538ab1ab32963e4450a394c1970e6 *man/filehash-class.Rd
+45f7f3a21e0d0cb2afbed678f4cf1e44 *man/filehashFormats.Rd
+fe2418282768cb27bdcef448742c263e *man/filehashOption.Rd
+19f14bdb7f0a406d23e38ea6d0a22405 *man/push.Rd
+3206913d23a67ae002e0214ee2093e94 *man/queue-class.Rd
+d6937904dedc8cd4e083e1b941d78b70 *man/stack-class.Rd
+e3ae87905dbded19881cffd6c80a8be4 *src/hash.c
+1f69eebc69b381da6e96c802a8c93003 *src/lockfile.c
+8a9e91209bf674c2db625d29298caad2 *src/readKeyMap.c
+2212ffe253fda0b122c209ae4e220c71 *src/sha1.c
+7cf279b5c8a6743e49a2c37d105733f5 *src/sha1.h
+d83a9585d249e9a093cb3c545cf49834 *tests/SHA1SUM
+f00793baa7e5a70812957058ec569332 *tests/misc/create-testdb.R
+4900ff9e4624ba08887b7ca8a702df8c *tests/reg-tests.R
+4872c25b98a848e6f7ef872f9dfbf617 *tests/reg-tests.Rout.save
+5b7464763d85ba9406c9e4dddce80d97 *tests/testdb-v1.1
+5b7464763d85ba9406c9e4dddce80d97 *tests/testdb-v2.0
+40829c8958672fbc650561dde99ea5a0 *tests/versions.R
+6e36a99376f5c4281e1acc0885d9d91c *tests/versions.Rout.save
+b2631aaa4f28eae69ba9e6b9b5bbbe80 *vignettes/combined.bib
+7fb6c57e7c3a9b572359b21e60e07d3f *vignettes/filehash.Rnw
diff --git a/NAMESPACE b/NAMESPACE
new file mode 100644
index 0000000..7b05eca
--- /dev/null
+++ b/NAMESPACE
@@ -0,0 +1,62 @@
+useDynLib(filehash)
+import(methods)
+
+## Classes
+exportClasses(
+ "filehash",
+ "filehashRDS",
+ "filehashDB1"
+ )
+
+
+## Primary interface
+exportMethods(
+ "dbInsert",
+ "dbFetch",
+ "dbExists",
+ "dbList",
+ "dbDelete",
+ "dbReorganize",
+ "dbUnlink",
+ "dbMultiFetch",
+ "dbCreate",
+ "dbInit",
+ "dbLoad",
+ "dbLazyLoad"
+ )
+
+exportMethods("[[", "[", "[[<-", "$<-", "$")
+
+export(
+ "filehashOption",
+ "registerFormatDB",
+ "filehashFormats"
+ )
+
+
+## Miscellaneous functions
+exportMethods(
+ "show",
+ "with",
+ "coerce",
+ "lapply",
+ "names",
+ "length"
+ )
+
+export(
+ "dumpDF",
+ "dumpObjects",
+ "db2env",
+ "dumpImage",
+ "dumpList",
+ "dumpEnv"
+ )
+
+## Stack and Queue stuff
+exportClasses("stack", "queue")
+exportMethods("isEmpty", "top", "push", "pop")
+exportMethods("mpush")
+
+export("createQ", "initQ")
+export("createS", "initS")
diff --git a/R/coerce.R b/R/coerce.R
new file mode 100644
index 0000000..23912b0
--- /dev/null
+++ b/R/coerce.R
@@ -0,0 +1,43 @@
+######################################################################
+## Copyright (C) 2006, Roger D. Peng <rpeng at jhsph.edu>
+##
+## This program is free software; you can redistribute it and/or modify
+## it under the terms of the GNU General Public License as published by
+## the Free Software Foundation; either version 2 of the License, or
+## (at your option) any later version.
+##
+## This program is distributed in the hope that it will be useful,
+## but WITHOUT ANY WARRANTY; without even the implied warranty of
+## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+## GNU General Public License for more details.
+##
+## You should have received a copy of the GNU General Public License
+## along with this program; if not, write to the Free Software
+## Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+## 02110-1301, USA
+#####################################################################
+
+toDBType <- function(from, type, dbpath = NULL) {
+ if(is.null(dbpath))
+ dbpath <- dbName(from)
+ if(!dbCreate(dbpath, type = type))
+ stop("could not create ", type, " database")
+ db <- dbInit(dbpath, type = type)
+ keys <- dbList(from)
+
+ for(key in keys)
+ dbInsert(db, key, dbFetch(from, key))
+ invisible(db)
+}
+
+setAs("filehashDB1", "filehashRDS",
+ function(from) {
+ dbpath <- paste(dbName(from), "RDS", sep = "")
+ toDBType(from, "RDS", dbpath)
+ })
+
+setAs("filehashDB1", "list",
+ function(from) {
+ keys <- dbList(from)
+ dbMultiFetch(from, keys)
+ })
diff --git a/R/dump.R b/R/dump.R
new file mode 100644
index 0000000..b381e88
--- /dev/null
+++ b/R/dump.R
@@ -0,0 +1,67 @@
+######################################################################
+## Copyright (C) 2006--2008, Roger D. Peng <rpeng at jhsph.edu>
+##
+## This program is free software; you can redistribute it and/or modify
+## it under the terms of the GNU General Public License as published by
+## the Free Software Foundation; either version 2 of the License, or
+## (at your option) any later version.
+##
+## This program is distributed in the hope that it will be useful,
+## but WITHOUT ANY WARRANTY; without even the implied warranty of
+## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+## GNU General Public License for more details.
+##
+## You should have received a copy of the GNU General Public License
+## along with this program; if not, write to the Free Software
+## Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+## 02110-1301, USA
+#####################################################################
+
+dumpEnv <- function(env, dbName) {
+ keys <- ls(env, all.names = TRUE)
+ dumpObjects(list = keys, dbName = dbName, envir = env)
+}
+
+dumpImage <- function(dbName = "Rworkspace", type = NULL) {
+ dumpObjects(list = ls(envir = globalenv(), all.names = TRUE),
+ dbName = dbName, type = type, envir = globalenv())
+}
+
+dumpObjects <- function(..., list = character(0), dbName, type = NULL,
+ envir = parent.frame()) {
+ names <- as.character(substitute(list(...)))[-1]
+ list <- c(list, names)
+ if(!dbCreate(dbName, type))
+ stop("could not create database file")
+ db <- dbInit(dbName, type)
+
+ for(i in seq(along = list))
+ dbInsert(db, list[i], get(list[i], envir))
+ db
+}
+
+dumpDF <- function(data, dbName = NULL, type = NULL) {
+ if(is.null(dbName))
+ dbName <- as.character(substitute(data))
+ dumpList(as.list(data), dbName = dbName, type = type)
+}
+
+dumpList <- function(data, dbName = NULL, type = NULL) {
+ if(!is.list(data))
+ stop("'data' must be a list")
+ vnames <- names(data)
+
+ if(is.null(vnames) || isTRUE("" %in% vnames))
+ stop("list must have non-empty names")
+ if(is.null(dbName))
+ dbName <- as.character(substitute(data))
+
+ if(!dbCreate(dbName, type))
+ stop("could not create database file")
+ db <- dbInit(dbName, type)
+
+ for(i in seq(along = vnames))
+ dbInsert(db, vnames[i], data[[vnames[i]]])
+ db
+}
+
diff --git a/R/filehash-DB1.R b/R/filehash-DB1.R
new file mode 100644
index 0000000..b516127
--- /dev/null
+++ b/R/filehash-DB1.R
@@ -0,0 +1,442 @@
+######################################################################
+## Copyright (C) 2006--2008, Roger D. Peng <rpeng at jhsph.edu>
+##
+## This program is free software; you can redistribute it and/or modify
+## it under the terms of the GNU General Public License as published by
+## the Free Software Foundation; either version 2 of the License, or
+## (at your option) any later version.
+##
+## This program is distributed in the hope that it will be useful,
+## but WITHOUT ANY WARRANTY; without even the implied warranty of
+## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+## GNU General Public License for more details.
+##
+## You should have received a copy of the GNU General Public License
+## along with this program; if not, write to the Free Software
+## Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+## 02110-1301, USA
+#####################################################################
+
+######################################################################
+## Class 'filehashDB1'
+
+## Database entries
+##
+## File format: [key] [nbytes data] [data]
+## serialized serialized raw bytes (serialized)
+##
+
+######################################################################
+
+## 'meta' is a list of functions for updating the file size of the
+## database and the file map.
+
+setClass("filehashDB1",
+ representation(datafile = "character",
+ meta = "list"),
+ contains = "filehash"
+ )
+
+setValidity("filehashDB1",
+ function(object) {
+ if(!file.exists(object at datafile))
+ return(gettextf("datafile '%s' does not exist",
+ datafile))
+ TRUE
+ })
+
+createDB1 <- function(dbName) {
+ if(!hasWorkingFtell())
+ stop("need working 'ftell()' to use 'DB1' format")
+ if(file.exists(dbName)) {
+ message(gettextf("database '%s' already exists", dbName))
+ return(TRUE)
+ }
+ status <- file.create(dbName)
+
+ if(!status)
+ stop(gettextf("unable to create database file '%s'", dbName))
+ TRUE
+}
+
+makeMetaEnv <- function(filename) {
+ dbmap <- NULL ## 'NULL' indicates the map needs to be read
+ dbfilesize <- file.info(filename)$size
+
+ updatesize <- function(size) {
+ dbfilesize <<- size
+ }
+ updatemap <- function(map) {
+ dbmap <<- map
+ }
+ getsize <- function() {
+ dbfilesize
+ }
+ getmap <- function() {
+ dbmap
+ }
+ list(updatesize = updatesize,
+ updatemap = updatemap,
+ getmap = getmap,
+ getsize = getsize)
+}
+
+initializeDB1 <- function(dbName) {
+ if(!hasWorkingFtell())
+ stop("need working 'ftell()' to use DB1 format")
+ dbName <- normalizePath(dbName)
+
+ new("filehashDB1",
+ datafile = dbName,
+ meta = makeMetaEnv(dbName),
+ name = basename(dbName)
+ )
+}
+
+
+readKeyMap <- function(con, map = NULL, pos = 0) {
+ if(is.null(map)) {
+ ## using 'hash = TRUE' is critical because it can have a major
+ ## impact on performance for large databases
+ map <- new.env(hash = TRUE, parent = emptyenv())
+ pos <- 0
+ }
+ if(pos < 0)
+ stop("'pos' cannot be negative")
+ filename <- path.expand(summary(con)$description)
+ filesize <- file.info(filename)$size
+
+ if(pos > filesize)
+ stop("'pos' cannot be greater than file size")
+ .Call("read_key_map", filename, map, filesize, pos)
+}
+
+readSingleKey <- function(con, map, key) {
+ start <- map[[key]]
+
+ if(is.null(start))
+ stop(gettextf("unable to obtain value for key '%s'", key))
+
+ seek(con, start, rw = "read")
+ unserialize(con)
+}
+
+readKeys <- function(con, map, keys) {
+ r <- lapply(keys, function(key) readSingleKey(con, map, key))
+ names(r) <- keys
+ r
+}
+
+gotoEndPos <- function(con) {
+ ## Move connection to the end
+ seek(con, 0, "end")
+ seek(con)
+}
+
+writeNullKeyValue <- function(con, key) {
+ writestart <- gotoEndPos(con)
+
+ handler <- function(cond) {
+ ## Rewind the file back to where writing began and truncate at
+ ## that position
+ seek(con, writestart, "start", "write")
+ truncate(con)
+ cond
+ }
+ tryCatch({
+ serialize(key, con)
+
+ len <- as.integer(-1)
+ serialize(len, con)
+ }, interrupt = handler, error = handler, finally = {
+ flush(con)
+ })
+}
+
+writeKeyValue <- function(con, key, value) {
+ writestart <- gotoEndPos(con)
+
+ handler <- function(cond) {
+ ## Rewind the file back to where writing began and
+ ## truncate at that position; this is probably a bad
+ ## idea for files > 2GB
+ seek(con, writestart, "start", "write")
+ truncate(con)
+ cond
+ }
+ tryCatch({
+ serialize(key, con)
+
+ byteData <- serialize(value, NULL)
+ len <- length(byteData)
+ serialize(len, con)
+
+ writeBin(byteData, con)
+ }, interrupt = handler, error = handler, finally = {
+ flush(con)
+ })
+}
+
+setMethod("lockFile", "file", function(db, ...) {
+ ## Use 3 underscores for lock file
+ sprintf("%s___LOCK", summary(db)$description)
+})
+
+createLockFile <- function(name) {
+ if(.Platform$OS.type != "windows")
+ status <- .Call("lock_file", name)
+ else {
+ ## TODO: are these optimal values for max.attempts
+ ## and sleep.duration?
+ max.attempts <- 4
+ sleep.duration <- 0.5
+ attempts <- 0
+ status <- -1
+ while ((attempts <= max.attempts) && ! isTRUE(status >= 0)) {
+ attempts <- attempts + 1
+ status <- .Call("lock_file", name)
+
+ if(!isTRUE(status >= 0))
+ Sys.sleep(sleep.duration)
+ }
+ }
+ if(!isTRUE(status >= 0))
+ stop("cannot create lock file ", sQuote(name))
+ TRUE
+}
+
+deleteLockFile <- function(name) {
+ if(!file.remove(name))
+ stop(paste('cannot remove lock file "', name, '"', sep=''))
+ TRUE
+}
+
+################################################################################
+## Internal utilities
+
+filesize <- gotoEndPos
+
+setGeneric("checkMap", function(db, ...) standardGeneric("checkMap"))
+
+setMethod("checkMap", "filehashDB1",
+ function(db, filecon, ...) {
+ old.size <- db at meta$getsize()
+ cur.size <- tryCatch({
+ filesize(filecon)
+ }, error = function(err) {
+ old.size
+ })
+ size.change <- old.size != cur.size
+ map <- getMap(db)
+ map0 <- map
+
+ if(is.null(map))
+ map <- readKeyMap(filecon)
+ else if(size.change) {
+ ## Modify 'map.old' directly
+ map <- tryCatch({
+ readKeyMap(filecon, map, old.size)
+ }, error = function(err) {
+ message(conditionMessage(err))
+ map0
+ })
+ }
+ else
+ map <- map0
+ if(!identical(map, map0)) {
+ db at meta$updatemap(map)
+ db at meta$updatesize(cur.size)
+ }
+ invisible(db)
+ })
+
+
+setGeneric("getMap", function(db) standardGeneric("getMap"))
+
+setMethod("getMap", "filehashDB1",
+ function(db) {
+ db at meta$getmap()
+ })
+
+################################################################################
+## Interface functions
+
+openDBConn <- function(filename, mode) {
+ con <- try({
+ file(filename, mode)
+ }, silent = TRUE)
+
+ if(inherits(con, "try-error"))
+ stop("unable to open connection to database")
+ con
+}
+
+setMethod("dbInsert",
+ signature(db = "filehashDB1", key = "character", value = "ANY"),
+ function(db, key, value, ...) {
+ con <- openDBConn(db at datafile, "ab")
+ on.exit(close(con))
+
+ lockname <- lockFile(con)
+ createLockFile(lockname)
+ on.exit(deleteLockFile(lockname), add = TRUE)
+
+ invisible(writeKeyValue(con, key, value))
+ })
+
+setMethod("dbFetch",
+ signature(db = "filehashDB1", key = "character"),
+ function(db, key, ...) {
+ con <- openDBConn(db at datafile, "rb")
+ on.exit(close(con))
+
+ lockname <- lockFile(con)
+ createLockFile(lockname)
+ on.exit(deleteLockFile(lockname), add = TRUE)
+
+ checkMap(db, con)
+ map <- getMap(db)
+
+ val <- readSingleKey(con, map, key)
+ val
+ })
+
+setMethod("dbMultiFetch",
+ signature(db = "filehashDB1", key = "character"),
+ function(db, key, ...) {
+ con <- openDBConn(db at datafile, "rb")
+ on.exit(close(con))
+
+ lockname <- lockFile(con)
+ createLockFile(lockname)
+ on.exit(deleteLockFile(lockname), add = TRUE)
+
+ checkMap(db, con)
+ map <- getMap(db)
+
+ readKeys(con, map, key)
+ })
+
+setMethod("dbExists", signature(db = "filehashDB1", key = "character"),
+ function(db, key, ...) {
+ dbkeys <- dbList(db)
+ key %in% dbkeys
+ })
+
+setMethod("dbList", "filehashDB1",
+ function(db, ...) {
+ con <- openDBConn(db at datafile, "rb")
+ on.exit(close(con))
+
+ lockname <- lockFile(con)
+ createLockFile(lockname)
+ on.exit(deleteLockFile(lockname), add = TRUE)
+
+ checkMap(db, con)
+ map <- getMap(db)
+
+ if(length(map) == 0)
+ character(0)
+ else {
+ keys <- as.list(map, all.names = TRUE)
+ use <- !sapply(keys, is.null)
+ names(keys[use])
+ }
+ })
+
+setMethod("dbDelete", signature(db = "filehashDB1", key = "character"),
+ function(db, key, ...) {
+ con <- openDBConn(db at datafile, "ab")
+ on.exit(close(con))
+
+ lockname <- lockFile(con)
+ createLockFile(lockname)
+ on.exit(deleteLockFile(lockname), add = TRUE)
+
+ invisible(writeNullKeyValue(con, key))
+ })
+
+setMethod("dbUnlink", "filehashDB1",
+ function(db, ...) {
+ file.remove(db at datafile)
+ })
+
+reorganizeDB <- function(db, ...) {
+ datafile <- db at datafile
+
+ ## Find a temporary file name
+ tempdata <- paste(datafile, "Tmp", sep = "")
+ i <- 0
+ while(file.exists(tempdata)) {
+ i <- i + 1
+ tempdata <- paste(datafile, "Tmp", i, sep = "")
+ }
+ if(!dbCreate(tempdata, type = "DB1")) {
+ warning("could not create temporary database")
+ return(FALSE)
+ }
+ on.exit(file.remove(tempdata))
+
+ tempdb <- dbInit(tempdata, type = "DB1")
+ keys <- dbList(db)
+
+ ## Copy all keys to temporary database
+ nkeys <- length(keys)
+ cat("Reorganizing database: ")
+
+ for(i in seq_along(keys)) {
+ key <- keys[i]
+ msg <- sprintf("%d%% (%d/%d)", round (100 * i / nkeys),
+ i, nkeys)
+ cat(msg)
+
+ dbInsert(tempdb, key, dbFetch(db, key))
+
+ back <- paste(rep("\b", nchar(msg)), collapse = "")
+ cat(back)
+ }
+ cat("\n")
+ status <- file.rename(tempdata, datafile)
+
+ if(!isTRUE(status)) {
+ on.exit()
+ warning("temporary database could not be renamed and is left in ",
+ tempdata)
+ return(FALSE)
+ }
+ on.exit()
+ cat("Finished; reload database with 'dbInit'\n")
+ TRUE
+}
+
+setMethod("dbReorganize", "filehashDB1", reorganizeDB)
+
+
+################################################################################
+## Test system's ftell()
+
+hasWorkingFtell <- function() {
+ tfile <- tempfile()
+ con <- file(tfile, "wb")
+
+ tryCatch({
+ bytes <- raw(10)
+ begin <- seek(con)
+
+ if(begin != 0)
+ return(FALSE)
+ writeBin(bytes, con)
+ end <- seek(con)
+ offset <- end - begin
+ isTRUE(offset == 10)
+ }, error = function(e) {
+ FALSE
+ }, finally = {
+ close(con)
+ unlink(tfile)
+ })
+}
+
+######################################################################
+
+
diff --git a/R/filehash-RDS.R b/R/filehash-RDS.R
new file mode 100644
index 0000000..fbb0e1d
--- /dev/null
+++ b/R/filehash-RDS.R
@@ -0,0 +1,183 @@
+######################################################################
+## Copyright (C) 2006, Roger D. Peng <rpeng at jhsph.edu>
+##
+## This program is free software; you can redistribute it and/or modify
+## it under the terms of the GNU General Public License as published by
+## the Free Software Foundation; either version 2 of the License, or
+## (at your option) any later version.
+##
+## This program is distributed in the hope that it will be useful,
+## but WITHOUT ANY WARRANTY; without even the implied warranty of
+## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+## GNU General Public License for more details.
+##
+## You should have received a copy of the GNU General Public License
+## along with this program; if not, write to the Free Software
+## Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+## 02110-1301, USA
+#####################################################################
+
+################################################################################
+## Class 'filehashRDS'
+
+setClass("filehashRDS",
+ representation(dir = "character"),
+ contains = "filehash"
+ )
+
+setValidity("filehashRDS",
+ function(object) {
+ if(length(object at dir) != 1)
+ return("only one directory should be set in 'dir'")
+ if(!file.exists(object at dir))
+ return(gettextf("directory '%s' does not exist",
+ object at dir))
+ TRUE
+ })
+
+createRDS <- function(dbName) {
+ if(!file.exists(dbName)) {
+ status <- dir.create(dbName)
+
+ if(!status)
+ stop(gettextf("unable to create database directory '%s'",
+ dbName))
+ }
+ else
+ message(gettextf("database '%s' already exists", dbName))
+ TRUE
+}
+
+initializeRDS <- function(dbName) {
+ ## Trailing '/' causes a problem in Windows?
+ dbName <- sub("/$", "", dbName, perl = TRUE)
+ new("filehashRDS", dir = normalizePath(dbName),
+ name = basename(dbName))
+}
+
+## For case-insensitive file systems, objects with the same name but
+## differ by capitalization might get clobbered. `mangleName()'
+## inserts a "@" before each capital letter and `unMangleName()'
+## reverses the operation.
+
+mangleName <- function(oname) {
+ if(any(grep("@",oname,fixed=TRUE)))
+ stop("RDS format cannot cope with objects with @ characters",
+ " in their names")
+ gsub("([A-Z])", "@\\1", oname, perl = TRUE)
+}
+
+unMangleName <- function(mname) {
+ gsub("@", "", mname, fixed = TRUE)
+}
+
+## Function for mapping a key to a path on the filesystem
+setGeneric("objectFile", function(db, key) standardGeneric("objectFile"))
+setMethod("objectFile", signature(db = "filehashRDS", key = "character"),
+ function(db, key) {
+ file.path(db at dir, mangleName(key))
+ })
+
+################################################################################
+## Interface functions
+
+setMethod("dbInsert",
+ signature(db = "filehashRDS", key = "character", value = "ANY"),
+ function(db, key, value, safe = TRUE, ...) {
+ writefile <- if(safe)
+ tempfile()
+ else
+ objectFile(db, key)
+ con <- gzfile(writefile, "wb")
+
+ writestatus <- tryCatch({
+ serialize(value, con)
+ }, condition = function(cond) {
+ cond
+ }, finally = {
+ close(con)
+ })
+ if(inherits(writestatus, "condition"))
+ stop(gettextf("unable to write object '%s'", key))
+ if(!safe)
+ return(invisible(!inherits(writestatus, "condition")))
+
+ cpstatus <- file.copy(writefile, objectFile(db, key),
+ overwrite = TRUE)
+
+ if(!cpstatus)
+ stop(gettextf("unable to insert object '%s'", key))
+ else {
+ rmstatus <- file.remove(writefile)
+
+ if(!rmstatus)
+ warning("unable to remove temporary file")
+ }
+ invisible(cpstatus)
+ })
+
+setMethod("dbFetch", signature(db = "filehashRDS", key = "character"),
+ function(db, key, ...) {
+ ## Create filename from key
+ ofile <- objectFile(db, key)
+ ## Open connection
+ val <- tryCatch({
+ con<-gzfile(ofile)
+ # note it is necessary to split creating and opening
+ # the connection into two steps so that the connection
+ # can be closed/destroyed successfully if ofile does
+ # not exist (avoiding connection leaks).
+ open(con,"rb")
+ ## Read data
+ unserialize(con)
+ }, condition = function(cond) {
+ cond
+ }, finally = {
+ close(con)
+ })
+ if(inherits(val, "condition"))
+ stop(gettextf("unable to obtain value for key '%s'",
+ key))
+ val
+ })
+
+setMethod("dbMultiFetch",
+ signature(db = "filehashRDS", key = "character"),
+ function(db, key, ...) {
+ r <- lapply(key, function(k) dbFetch(db, k))
+ names(r) <- key
+ r
+ })
+
+setMethod("dbExists", signature(db = "filehashRDS", key = "character"),
+ function(db, key, ...) {
+ key %in% dbList(db)
+ })
+
+setMethod("dbList", "filehashRDS",
+ function(db, ...) {
+ ## list all keys/files in the database
+ fileList <- dir(db at dir, all.files = TRUE, full.names = TRUE)
+ use <- !file.info(fileList)$isdir
+ fileList <- basename(fileList[use])
+
+ unMangleName(fileList)
+ })
+
+setMethod("dbDelete", signature(db = "filehashRDS", key = "character"),
+ function(db, key, ...) {
+ ofile <- objectFile(db, key)
+
+ ## remove/delete the file
+ status <- file.remove(ofile)
+ invisible(isTRUE(all(status)))
+ })
+
+setMethod("dbUnlink", "filehashRDS",
+ function(db, ...) {
+ ## delete the entire database directory
+ d <- db at dir
+ status <- unlink(d, recursive = TRUE)
+ invisible(status)
+ })
+
diff --git a/R/filehash.R b/R/filehash.R
new file mode 100644
index 0000000..79ffeb0
--- /dev/null
+++ b/R/filehash.R
@@ -0,0 +1,306 @@
+######################################################################
+## Copyright (C) 2006, Roger D. Peng <rpeng at jhsph.edu>
+##
+## This program is free software; you can redistribute it and/or modify
+## it under the terms of the GNU General Public License as published by
+## the Free Software Foundation; either version 2 of the License, or
+## (at your option) any later version.
+##
+## This program is distributed in the hope that it will be useful,
+## but WITHOUT ANY WARRANTY; without even the implied warranty of
+## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+## GNU General Public License for more details.
+##
+## You should have received a copy of the GNU General Public License
+## along with this program; if not, write to the Free Software
+## Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+## 02110-1301, USA
+#####################################################################
+
+######################################################################
+## Class 'filehash'
+
+setClass("filehash", representation(name = "character"))
+
+setValidity("filehash", function(object) {
+ if(length(object at name) == 0)
+ "database name has length 0"
+ else
+ TRUE
+})
+
+setGeneric("dbName", function(db) standardGeneric("dbName"))
+setMethod("dbName", "filehash", function(db) db at name)
+
+setMethod("show", "filehash",
+ function(object) {
+ if(length(object at name) == 0)
+ stop("database does not have a name")
+ cat(gettextf("'%s' database '%s'\n", as.character(class(object)),
+ object at name))
+ })
+
+
+######################################################################
+
+registerFormatDB <- function(name, funlist) {
+ if(!all(c("initialize", "create") %in% names(funlist)))
+ stop("need both 'initialize' and 'create' functions in 'funlist'")
+ r <- list(list(create = funlist[["create"]],
+ initialize = funlist[["initialize"]]))
+ names(r) <- name
+ do.call("filehashFormats", r)
+ TRUE
+}
+
+filehashFormats <- function(...) {
+ args <- list(...)
+ n <- names(args)
+
+ for(n in names(args))
+ assign(n, args[[n]], .filehashFormats)
+ current <- as.list(.filehashFormats)
+
+ if(length(args) == 0)
+ current
+ else
+ invisible(current)
+}
+
+######################################################################
+## Create necessary database files. On successful creation, return
+## TRUE. If the database already exists, don't do anything but return
+## TRUE (and print a message). If there's any other strange
+## condition, return FALSE.
+
+dbStartup <- function(dbName, type, action = c("initialize", "create")) {
+ action <- match.arg(action)
+ validFormat <- type %in% names(filehashFormats())
+
+ if(!validFormat)
+ stop(gettextf("'%s' not a valid database format", type))
+ formatList <- filehashFormats()[[type]]
+ doFUN <- formatList[[action]]
+
+ if(!is.function(doFUN))
+ stop(gettextf("'%s' function for database format '%s' is not valid",
+ action, type))
+ doFUN(dbName)
+}
+
+setGeneric("dbCreate", function(db, ...) standardGeneric("dbCreate"))
+
+setMethod("dbCreate", "ANY",
+ function(db, type = NULL, ...) {
+ if(is.null(type))
+ type <- filehashOption()$defaultType
+
+ dbStartup(db, type, "create")
+ })
+
+setGeneric("dbInit", function(db, ...) standardGeneric("dbInit"))
+
+setMethod("dbInit", "ANY",
+ function(db, type = NULL, ...) {
+ if(is.null(type))
+ type <- filehashOption()$defaultType
+ dbStartup(db, type, "initialize")
+ })
+
+######################################################################
+## Set options and retrieve list of options
+
+filehashOption <- function(...) {
+ args <- list(...)
+ n <- names(args)
+
+ for(n in names(args))
+ assign(n, args[[n]], .filehashOptions)
+ current <- as.list(.filehashOptions)
+
+ if(length(args) == 0)
+ current
+ else
+ invisible(current)
+}
+
+######################################################################
+## Load active bindings into an environment
+
+setGeneric("dbLoad", function(db, ...) standardGeneric("dbLoad"))
+
+setMethod("dbLoad", "filehash",
+ function(db, env = parent.frame(2), keys = NULL, ...) {
+ if(is.null(keys))
+ keys <- dbList(db)
+ else if(!is.character(keys))
+ stop("'keys' should be a character vector")
+ active <- sapply(keys, function(k) {
+ exists(k, env, inherits = FALSE)
+ })
+ if(any(active)) {
+ warning("keys with active/regular bindings ignored: ",
+ paste(sQuote(keys[active]), collapse = ", "))
+ keys <- keys[!active]
+ }
+ make.f <- function(k) {
+ key <- k
+ function(value) {
+ if(!missing(value)) {
+ dbInsert(db, key, value)
+ invisible(value)
+ }
+ else {
+ obj <- dbFetch(db, key)
+ obj
+ }
+ }
+ }
+ for(k in keys)
+ makeActiveBinding(k, make.f(k), env)
+ invisible(keys)
+ })
+
+setGeneric("dbLazyLoad", function(db, ...) standardGeneric("dbLazyLoad"))
+
+setMethod("dbLazyLoad", "filehash",
+ function(db, env = parent.frame(2), keys = NULL, ...) {
+ if(is.null(keys))
+ keys <- dbList(db)
+ else if(!is.character(keys))
+ stop("'keys' should be a character vector")
+
+ wrap <- function(x, env) {
+ key <- x
+ delayedAssign(x, dbFetch(db, key), environment(), env)
+ }
+ for(k in keys)
+ wrap(k, env)
+ invisible(keys)
+ })
+
+## Load active bindings into an environment and return the environment
+
+db2env <- function(db) {
+ if(is.character(db))
+ db <- dbInit(db) ## use the default type
+ env <- new.env(hash = TRUE)
+ dbLoad(db, env)
+ env
+}
+
+######################################################################
+## Other methods
+
+setGeneric("names")
+setMethod("names", "filehash",
+ function(x) {
+ dbList(x)
+ })
+
+setGeneric("length")
+setMethod("length", "filehash",
+ function(x) {
+ length(dbList(x))
+ })
+
+setAs("filehash", "list",
+ function(from) {
+ env <- new.env(hash = TRUE)
+ dbLoad(from, env)
+ as.list(env, all.names = TRUE)
+ })
+
+setGeneric("with")
+setMethod("with", "filehash",
+ function(data, expr, ...) {
+ env <- db2env(data)
+ eval(substitute(expr), env, enclos = parent.frame())
+ })
+
+setGeneric("lapply")
+setMethod("lapply", signature(X = "filehash"),
+ function(X, FUN, ..., keep.names = TRUE) {
+ FUN <- match.fun(FUN)
+ keys <- dbList(X)
+ rval <- vector("list", length = length(keys))
+
+ for(i in seq(along = keys)) {
+ obj <- dbFetch(X, keys[i])
+ rval[[i]] <- FUN(obj, ...)
+ }
+ if(keep.names)
+ names(rval) <- keys
+ rval
+ })
+
+######################################################################
+## Database interface
+
+setGeneric("dbMultiFetch", function(db, key, ...) {
+ standardGeneric("dbMultiFetch")
+})
+setGeneric("dbInsert", function(db, key, value, ...) {
+ standardGeneric("dbInsert")
+})
+setGeneric("dbFetch", function(db, key, ...) standardGeneric("dbFetch"))
+setGeneric("dbExists", function(db, key, ...) standardGeneric("dbExists"))
+setGeneric("dbList", function(db, ...) standardGeneric("dbList"))
+setGeneric("dbDelete", function(db, key, ...) standardGeneric("dbDelete"))
+setGeneric("dbReorganize", function(db, ...) standardGeneric("dbReorganize"))
+setGeneric("dbUnlink", function(db, ...) standardGeneric("dbUnlink"))
+
+## Other
+setOldClass(c("file", "connection"))
+setGeneric("lockFile", function(db, ...) standardGeneric("lockFile"))
+
+######################################################################
+## Extractor/replacement
+
+setMethod("[[", signature(x = "filehash", i = "character", j = "missing"),
+ function(x, i, j) {
+ dbFetch(x, i)
+ })
+
+setMethod("$", signature(x = "filehash"),
+ function(x, name) {
+ dbFetch(x, name)
+ })
+
+setReplaceMethod("[[", signature(x = "filehash", i = "character", j = "missing"),
+ function(x, i, j, value) {
+ dbInsert(x, i, value)
+ x
+ })
+
+setReplaceMethod("$", signature(x = "filehash"),
+ function(x, name, value) {
+ dbInsert(x, name, value)
+ x
+ })
+
+
+## Need to define these because they're not automatically caught.
+## Don't need this if R >= 2.4.0.
+
+setReplaceMethod("[[", signature(x = "filehash", i = "numeric", j = "missing"),
+ function(x, i, j, value) {
+ stop("numeric indices not allowed")
+ })
+
+setMethod("[[", signature(x = "filehash", i = "numeric", j = "missing"),
+ function(x, i, j) {
+ stop("numeric indices not allowed")
+ })
+
+setMethod("[", signature(x = "filehash", i = "character", j = "missing",
+ drop = "missing"),
+ function(x, i , j, drop) {
+ dbMultiFetch(x, i)
+ })
+
+
+
+
+
+
diff --git a/R/hash.R b/R/hash.R
new file mode 100644
index 0000000..7fb8ab2
--- /dev/null
+++ b/R/hash.R
@@ -0,0 +1,10 @@
+sha1 <- function(object, skip = 14L) {
+ ## Setting 'skip = 14' gives us the same results as
+ ## 'digest(object, "sha1")'
+ bytes <- serialize(object, NULL)
+ .Call("sha1_object", bytes, skip)
+}
+
+sha1_file <- function(filename, skip = 0L) {
+ .Call("sha1_file", filename, skip)
+}
diff --git a/R/queue.R b/R/queue.R
new file mode 100644
index 0000000..25ce47f
--- /dev/null
+++ b/R/queue.R
@@ -0,0 +1,91 @@
+setClass("queue",
+ representation(queue = "filehashDB1",
+ name = "character")
+ )
+
+setMethod("show", "queue",
+ function(object) {
+ cat(gettextf("<queue: %s>\n", object at name))
+ invisible(object)
+ })
+
+createQ <- function(filename) {
+ dbCreate(filename, "DB1")
+ queue <- dbInit(filename, "DB1")
+ dbInsert(queue, "head", NULL)
+ dbInsert(queue, "tail", NULL)
+
+ new("queue", queue = queue, name = filename)
+}
+
+initQ <- function(filename) {
+ new("queue",
+ queue = dbInit(filename, "DB1"),
+ name = filename)
+}
+
+## Public
+setGeneric("pop", function(db, ...) standardGeneric("pop"))
+setGeneric("push", function(db, val, ...) standardGeneric("push"))
+setGeneric("isEmpty", function(db, ...) standardGeneric("isEmpty"))
+setGeneric("top", function(db, ...) standardGeneric("top"))
+
+
+################################################################################
+## Methods
+
+setMethod("lockFile", "queue",
+ function(db, ...) {
+ paste(db at name, "qlock", sep = ".")
+ })
+
+setMethod("push", c("queue", "ANY"), function(db, val, ...) {
+ ## Create a new tail node
+ node <- list(value = val,
+ nextkey = NULL)
+ key <- sha1(node)
+
+ createLockFile(lockFile(db))
+ on.exit(deleteLockFile(lockFile(db)))
+
+ if(isEmpty(db))
+ dbInsert(db at queue, "head", key)
+ else {
+ ## Convert tail node to regular node
+ tailkey <- dbFetch(db at queue, "tail")
+ oldtail <- dbFetch(db at queue, tailkey)
+ oldtail$nextkey <- key
+ dbInsert(db at queue, tailkey, oldtail)
+ }
+ ## Insert new node and point tail to new node
+ dbInsert(db at queue, key, node)
+ dbInsert(db at queue, "tail", key)
+})
+
+setMethod("isEmpty", "queue", function(db) {
+ is.null(dbFetch(db at queue, "head"))
+})
+
+setMethod("top", "queue", function(db, ...) {
+ createLockFile(lockFile(db))
+ on.exit(deleteLockFile(lockFile(db)))
+
+ if(isEmpty(db))
+ stop("queue is empty")
+ h <- dbFetch(db at queue, "head")
+ node <- dbFetch(db at queue, h)
+ node$value
+})
+
+setMethod("pop", "queue", function(db, ...) {
+ createLockFile(lockFile(db))
+ on.exit(deleteLockFile(lockFile(db)))
+
+ if(isEmpty(db))
+ stop("queue is empty")
+ h <- dbFetch(db at queue, "head")
+ node <- dbFetch(db at queue, h)
+ dbInsert(db at queue, "head", node$nextkey)
+ dbDelete(db at queue, h)
+ node$value
+})
diff --git a/R/stack.R b/R/stack.R
new file mode 100644
index 0000000..36d4e4a
--- /dev/null
+++ b/R/stack.R
@@ -0,0 +1,91 @@
+setClass("stack",
+ representation(stack = "filehashDB1",
+ name = "character"))
+
+setMethod("show", "stack",
+ function(object) {
+ cat(gettextf("<stack: %s>\n", object at name))
+ invisible(object)
+ })
+
+createS <- function(filename) {
+ dbCreate(filename, "DB1")
+ stack <- dbInit(filename, "DB1")
+ dbInsert(stack, "top", NULL)
+
+ new("stack", stack = stack, name = filename)
+}
+
+initS <- function(filename) {
+ new("stack",
+ stack = dbInit(filename, "DB1"),
+ name = filename)
+}
+
+setMethod("lockFile", "stack",
+ function(db, ...) {
+ paste(db at name, "slock", sep = ".")
+ })
+
+setMethod("push", c("stack", "ANY"), function(db, val, ...) {
+ node <- list(value = val,
+ nextkey = dbFetch(db at stack, "top"))
+ topkey <- sha1(node)
+
+ createLockFile(lockFile(db))
+ on.exit(deleteLockFile(lockFile(db)))
+
+ dbInsert(db at stack, topkey, node)
+ dbInsert(db at stack, "top", topkey)
+})
+
+setGeneric("mpush", function(db, vals, ...) standardGeneric("mpush"))
+
+setMethod("mpush", c("stack", "ANY"), function(db, vals, ...) {
+ if(!is.list(vals))
+ vals <- as.list(vals)
+ createLockFile(lockFile(db))
+ on.exit(deleteLockFile(lockFile(db)))
+
+ topkey <- dbFetch(db at stack, "top")
+
+ for(i in seq_along(vals)) {
+ node <- list(value = vals[[i]],
+ nextkey = topkey)
+ topkey <- sha1(node)
+
+ dbInsert(db at stack, topkey, node)
+ dbInsert(db at stack, "top", topkey)
+ }
+})
+
+setMethod("isEmpty", "stack", function(db, ...) {
+ h <- dbFetch(db at stack, "top")
+ is.null(h)
+})
+
+
+setMethod("top", "stack", function(db, ...) {
+ createLockFile(lockFile(db))
+ on.exit(deleteLockFile(lockFile(db)))
+
+ if(isEmpty(db))
+ stop("stack is empty")
+ h <- dbFetch(db at stack, "top")
+ node <- dbFetch(db at stack, h)
+ node$value
+})
+
+setMethod("pop", "stack", function(db, ...) {
+ createLockFile(lockFile(db))
+ on.exit(deleteLockFile(lockFile(db)))
+
+ if(isEmpty(db))
+ stop("stack is empty")
+ h <- dbFetch(db at stack, "top")
+ node <- dbFetch(db at stack, h)
+
+ dbInsert(db at stack, "top", node$nextkey)
+ dbDelete(db at stack, h)
+ node$value
+})
diff --git a/R/zzz.R b/R/zzz.R
new file mode 100644
index 0000000..2ed8cd1
--- /dev/null
+++ b/R/zzz.R
@@ -0,0 +1,22 @@
+.onLoad <- function(lib, pkg) {
+ assign("defaultType", "DB1", .filehashOptions)
+
+ for(type in c("DB1", "RDS")) {
+ cname <- paste("create", type, sep = "")
+ iname <- paste("initialize", type, sep = "")
+ r <- list(create = get(cname, mode = "function"),
+ initialize = get(iname, mode="function"))
+ assign(type, r, .filehashFormats)
+ }
+}
+
+.onAttach <- function(lib, pkg) {
+ dcf <- read.dcf(file.path(lib, pkg, "DESCRIPTION"))
+ msg <- gettextf("%s: %s (%s %s)", dcf[, "Package"], dcf[, "Title"],
+ as.character(dcf[, "Version"]), dcf[, "Date"])
+ packageStartupMessage(paste(strwrap(msg), collapse = "\n"))
+}
+
+.filehashOptions <- new.env()
+
+.filehashFormats <- new.env()
diff --git a/build/vignette.rds b/build/vignette.rds
new file mode 100644
index 0000000..0f52621
Binary files /dev/null and b/build/vignette.rds differ
diff --git a/debian/README.test b/debian/README.test
deleted file mode 100644
index 3d2b347..0000000
--- a/debian/README.test
+++ /dev/null
@@ -1,10 +0,0 @@
-Notes on how this package can be tested.
-────────────────────────────────────────
-
-To run the unit tests provided by the package you can do
-
- sh run-unit-test
-
-in this directory.
-
-
diff --git a/debian/changelog b/debian/changelog
deleted file mode 100644
index f253012..0000000
--- a/debian/changelog
+++ /dev/null
@@ -1,5 +0,0 @@
-r-cran-filehash (2.3-1) unstable; urgency=low
-
- * Initial release (closes: #837344)
-
- -- Andreas Tille <tille at debian.org> Sat, 10 Sep 2016 21:40:05 +0200
diff --git a/debian/compat b/debian/compat
deleted file mode 100644
index ec63514..0000000
--- a/debian/compat
+++ /dev/null
@@ -1 +0,0 @@
-9
diff --git a/debian/control b/debian/control
deleted file mode 100644
index ada3a10..0000000
--- a/debian/control
+++ /dev/null
@@ -1,29 +0,0 @@
-Source: r-cran-filehash
-Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.org>
-Uploaders: Andreas Tille <tille at debian.org>
-Section: gnu-r
-Priority: optional
-Build-Depends: debhelper (>= 9),
- cdbs,
- r-base-dev
-Standards-Version: 3.9.8
-Vcs-Browser: https://anonscm.debian.org/viewvc/debian-med/trunk/packages/R/r-cran-filehash/trunk/
-Vcs-Svn: svn://anonscm.debian.org/debian-med/trunk/packages/R/r-cran-filehash/trunk/
-Homepage: http://cran.r-project.org/web/packages/filehash
-
-Package: r-cran-filehash
-Architecture: any
-Depends: ${R:Depends},
- ${misc:Depends},
- ${shlibs:Depends}
-Description: GNU R simple key-value database
- This GNU R package implements a simple key-value style database where
- character string keys are associated with data values that are stored on
- the disk. A simple interface is provided for inserting, retrieving, and
- deleting data from the database. Utilities are provided that allow
- 'filehash' databases to be treated much like environments and lists are
- already used in R. These utilities are provided to encourage interactive
- and exploratory analysis on large datasets. Three different file formats
- for representing the database are currently available and new formats
- can easily be incorporated by third parties for use in the 'filehash'
- framework.
diff --git a/debian/copyright b/debian/copyright
deleted file mode 100644
index 0c0aa2e..0000000
--- a/debian/copyright
+++ /dev/null
@@ -1,29 +0,0 @@
-Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
-Upstream-Contact: Roger D. Peng <rdpeng at jhu.edu>
-Source: http://cran.r-project.org/web/packages/filehash
-
-Files: *
-Copyright: 2009-2016 Roger D. Peng <rdpeng at jhu.edu>
-License: GPL-2+
-
-Files: debian/*
-Copyright: 2016 Andreas Tille <tille at debian.org>
-License: GPL-2+
-
-License: GPL-2+
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation; either version 2 of the License, or
- (at your option) any later version.
- .
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
- .
- You should have received a copy of the GNU General Public License
- along with this program; if not, write to the Free Software
- Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
- .
- On Debian systems, the complete text of the GNU General Public
- License can be found in `/usr/share/common-licenses/GPL-2'.
diff --git a/debian/docs b/debian/docs
deleted file mode 100644
index 960011c..0000000
--- a/debian/docs
+++ /dev/null
@@ -1,3 +0,0 @@
-tests
-debian/README.test
-debian/tests/run-unit-test
diff --git a/debian/rules b/debian/rules
deleted file mode 100755
index e2bc9e7..0000000
--- a/debian/rules
+++ /dev/null
@@ -1,15 +0,0 @@
-#!/usr/bin/make -f
-
-include /usr/share/R/debian/r-cran.mk
-
-install/$(package)::
- rm -rf debian/$(package)/usr/lib/R/site-library/$(cranName)/LICENSE
-
-# if I would only know how to hook in after dh_installdocs - forget this magic
-# cdbs thingy and remove the file rather in the test sccript ...
- # Delete tests depending from devtools since this is not (yet) packaged
-# cd debian/$(package)/usr/share/doc/$(package)/tests/ ; \
-# if grep -qR devtools * ; then \
-# rm -f `grep -lR devtools *` ; \
-# fi
-
diff --git a/debian/source/format b/debian/source/format
deleted file mode 100644
index 163aaf8..0000000
--- a/debian/source/format
+++ /dev/null
@@ -1 +0,0 @@
-3.0 (quilt)
diff --git a/debian/tests/control b/debian/tests/control
deleted file mode 100644
index d2aa55a..0000000
--- a/debian/tests/control
+++ /dev/null
@@ -1,3 +0,0 @@
-Tests: run-unit-test
-Depends: @
-Restrictions: allow-stderr
diff --git a/debian/tests/run-unit-test b/debian/tests/run-unit-test
deleted file mode 100644
index d13482a..0000000
--- a/debian/tests/run-unit-test
+++ /dev/null
@@ -1,37 +0,0 @@
-#!/bin/sh -e
-
-pkg=r-cran-filehash
-
-# The saved result files do contain some differences in metadata and we also
-# need to ignore version differences of R
-filter() {
- grep -v -e '^R version' \
- -e '^Copyright (C)' \
- -e '^R : Copyright 20' \
- -e '^Version 2.0' \
- -e '^Platform:' \
- -e '^Spam version .* is loaded.' \
- -e '^ISBN 3-900051-07-0' \
- $1 | \
- sed -e '/^> *proc\.time()$/,$d'
-}
-
-if [ "$ADTTMP" = "" ] ; then
- ADTTMP=`mktemp -d /tmp/${pkg}-test.XXXXXX`
- trap "rm -rf $ADTTMP" 0 INT QUIT ABRT PIPE TERM
-fi
-cd $ADTTMP
-cp -a /usr/share/doc/${pkg}/tests/* $ADTTMP
-find . -name "*.gz" -exec gunzip \{\} \;
-for htest in `ls *.R | sed 's/\.R$//'` ; do
- LC_ALL=C R --no-save < ${htest}.R 2>&1 | tee > ${htest}.Rout
- filter ${htest}.Rout.save > ${htest}.Rout.save_
- filter ${htest}.Rout > ${htest}.Rout_
- diff -u --ignore-all-space ${htest}.Rout.save_ ${htest}.Rout_
- if [ ! $? ] ; then
- echo "Test ${htest} failed"
- exit 1
- else
- echo "Test ${htest} passed"
- fi
-done
diff --git a/debian/watch b/debian/watch
deleted file mode 100644
index 4f57bbb..0000000
--- a/debian/watch
+++ /dev/null
@@ -1,2 +0,0 @@
-version=3
-http://cran.r-project.org/src/contrib/filehash_([-\d.]*)\.tar\.gz
diff --git a/inst/CITATION b/inst/CITATION
new file mode 100644
index 0000000..a38a6f5
--- /dev/null
+++ b/inst/CITATION
@@ -0,0 +1,14 @@
+citHeader("The reference for the 'filehash' package is:")
+
+citEntry(entry = "article",
+ title = "Interacting with data using the filehash package",
+ author = personList(person("Roger", "Peng", "D.")),
+ journal = "R News",
+ year = "2006",
+ volume = "6",
+ number = "4",
+ pages = "19--24",
+ url = "http://CRAN.R-project.org/doc/Rnews/",
+ textVersion = paste("Peng RD (2006).", dQuote("Interacting with data using the filehash package,"), "R News, 6 (4), 19--24.")
+ )
+
diff --git a/inst/COPYING b/inst/COPYING
new file mode 100644
index 0000000..ffea8cd
--- /dev/null
+++ b/inst/COPYING
@@ -0,0 +1,19 @@
+License
+=======
+
+`filehash' is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 2 of the License, or (at your
+option) any later version.
+
+This program is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+02110-1301, USA
+
+
diff --git a/inst/NEWS b/inst/NEWS
new file mode 100644
index 0000000..be2e358
--- /dev/null
+++ b/inst/NEWS
@@ -0,0 +1,90 @@
+Check the 'filehash' git repository for the latest updates on the
+package at http://repo.or.cz/w/filehash.git
+
+Version 1.0
+-----------
+
+* The 'DB' format has been removed; users should use 'DB1' instead
+
+* Internals of 'DB1' format have changed so that it should be a bit
+more reliable but perhaps a little slower
+
+* The 'dbDisconnect' generic has been removed since it is no longer
+necessary for the 'DB1' format (as it was before). It was never
+needed for the 'RDS' format and one never existed for that format.
+
+
+Version 0.9
+-----------
+
+* For 'filehashRDS' class, the 'dbDir' slot has been renamed to 'dir'.
+
+* An attempt has been made to normalize the error handling to make it
+consistent.
+
+* The various 'dump' functions have been given a 'type' argument
+
+
+Version 0.8
+-----------
+
+* Added function dbLazyLoad for lazy loading filehash databases.
+
+* dbCreate and dbInit are now generics with a method for character
+vectors. The behavior should be the same as before, by default.
+
+* dbLoad is generic.
+
+* The second argument to dbMultiFetch is 'key', not 'keys'.
+
+* dbInitialize is deprecated
+
+* 'DB1' and 'RDS' formats use normalizePath() for resolving paths to
+directories
+
+* There is a vignette now [via vignette("filehash")]
+
+
+Version 0.6-3
+-------------
+
+* Added methods for "[[", "$", "[[<-", and "$<-" for filehash
+objects. Only character indices are allowed
+
+* filehash-DB functions use the new serialize() from R 2.4.0 so that
+numeric data will not suffer from rounding error due to previous use
+of serialize(ascii = TRUE).
+
+* New format filehash-DB1 which stores the key index/map and data in a
+single file.
+
+* New "filehash" method for lapply so that functions can be applied to
+database entries.
+
+
+Version 0.4-1
+-------------
+
+* Patch release, changed some internals for the "DB" type databases
+
+* Added test database for regression testing in future releases
+
+
+Version 0.4
+-----------
+
+* Added name mangling scheme to prevent clobbering on case-insensitive
+OSes like Windows (thanks to Bill Venables and David Brahm)
+
+* Added dumpImage, dumpObjects, dumpDF functions for dumping various
+things to filehash databases
+
+* Added filehashOption() function for setting global options; right
+now only the default database type can be set
+
+* dbLoad and db2env are regular functions now rather than
+generics/methods. dbLoad's default 'env' is the parent frame now
+
+* Added a "filehash" method for 'with'
+
+* Added new generic dbUnlink which deletes a database from the disk
diff --git a/inst/doc/filehash.R b/inst/doc/filehash.R
new file mode 100644
index 0000000..5e3ec0f
--- /dev/null
+++ b/inst/doc/filehash.R
@@ -0,0 +1,145 @@
+### R code from vignette source 'filehash.Rnw'
+
+###################################################
+### code chunk number 1: options
+###################################################
+options(width=60)
+
+
+###################################################
+### code chunk number 2: exampleGlobalEnv
+###################################################
+x <- 1
+print(x)
+
+
+###################################################
+### code chunk number 3: create
+###################################################
+library(filehash)
+dbCreate("mydb")
+db <- dbInit("mydb")
+
+
+###################################################
+### code chunk number 4: setseed1
+###################################################
+set.seed(100)
+
+
+###################################################
+### code chunk number 5: insert
+###################################################
+dbInsert(db, "a", rnorm(100))
+
+
+###################################################
+### code chunk number 6: fetch
+###################################################
+value <- dbFetch(db, "a")
+mean(value)
+
+
+###################################################
+### code chunk number 7: delete
+###################################################
+dbInsert(db, "b", 123)
+dbDelete(db, "a")
+dbList(db)
+dbExists(db, "a")
+
+
+###################################################
+### code chunk number 8: accessors
+###################################################
+db$a <- rnorm(100, 1)
+mean(db$a)
+mean(db[["a"]])
+db$b <- rnorm(100, 2)
+dbList(db)
+
+
+###################################################
+### code chunk number 9: characteronly
+###################################################
+e <- local({
+ err <- function(e) e
+ tryCatch(db[[1]], error = err)
+})
+conditionMessage(e)
+
+
+###################################################
+### code chunk number 10: with
+###################################################
+with(db, c(a = mean(a), b = mean(b)))
+
+
+###################################################
+### code chunk number 11: sapply
+###################################################
+sapply(db[c("a", "b")], mean)
+
+
+###################################################
+### code chunk number 12: lapply
+###################################################
+unlist(lapply(db, mean))
+
+
+###################################################
+### code chunk number 13: cleanupMyDB
+###################################################
+dbUnlink(db)
+rm(list = ls(all = TRUE))
+
+
+###################################################
+### code chunk number 14: setseed2
+###################################################
+set.seed(200)
+
+
+###################################################
+### code chunk number 15: testDB
+###################################################
+dbCreate("testDB")
+db <- dbInit("testDB")
+db$x <- rnorm(100)
+db$y <- runif(100)
+db$a <- letters
+dbLoad(db)
+ls()
+
+
+###################################################
+### code chunk number 16: accessbinding
+###################################################
+mean(y)
+sort(a)
+
+
+###################################################
+### code chunk number 17: assignvalue
+###################################################
+y <- rnorm(100, 2)
+mean(y)
+
+
+###################################################
+### code chunk number 18: removeandload
+###################################################
+rm(list = ls())
+db <- dbInit("testDB")
+dbLoad(db)
+ls()
+mean(y)
+
+
+###################################################
+### code chunk number 19: cleanupTestDB
+###################################################
+dbUnlink(db)
+rm(list = ls(all = TRUE))
+
+
diff --git a/inst/doc/filehash.Rnw b/inst/doc/filehash.Rnw
new file mode 100644
index 0000000..c82aead
--- /dev/null
+++ b/inst/doc/filehash.Rnw
@@ -0,0 +1,443 @@
+\documentclass{article}
+
+%%\VignetteIndexEntry{The filehash Package}
+%%\VignetteDepends{filehash}
+
+\usepackage{charter}
+\usepackage{courier}
+\usepackage[noae]{Sweave}
+\usepackage[margin=1in]{geometry}
+\usepackage{natbib}
+
+\title{Interacting with Data using the \textbf{filehash} Package for
+R}
+
+\author{Roger D. Peng $<$rpeng at jhsph.edu$>$\\\textit{Department of
+Biostatistics}\\\textit{Johns Hopkins Bloomberg School of Public Health}}
+
+\date{}
+
+\newcommand{\pkg}{\textbf}
+\newcommand{\code}{\texttt}
+
+\begin{document}
+
+\maketitle
+
+\begin{abstract}
+The \pkg{filehash} package for R implements a simple key-value style
+database where character string keys are associated with data values
+that are stored on the disk. A simple interface is provided for
+inserting, retrieving, and deleting data from the database. Utilities
+are provided that allow \pkg{filehash} databases to be treated much
+like environments and lists are already used in R. These utilities
+are provided to encourage interactive and exploratory analysis on
+large datasets. Three different file formats for representing the
+database are currently available and new formats can easily be
+incorporated by third parties for use in the \pkg{filehash} framework.
+\end{abstract}
+
+<<options,results=hide,echo=false>>=
+options(width=60)
+@
+
+\section{Overview and Motivation}
+
+Working with large datasets in R can be cumbersome because of the need
+to keep objects in physical memory. While many might generally see
+that as a feature of the system, the need to keep whole objects in
+memory creates challenges to those who might want to work
+interactively with large datasets. Here we take a simple definition
+of ``large dataset'' to be any dataset that cannot be loaded into R as
+a single R object because of memory limitations. For example, a very
+large data frame might be too large for all of the columns and rows to
+be loaded at once. In such a situation, one might load only a subset
+of the rows or columns, if that is possible.
+
+In a key-value database, an arbitrary data object (a ``value'') has a
+``key'' associated with it, usually a character string. When one
+requests the value associated with a particular key, it is the
+database's job to match up the key with the correct value and return
+the value to the requester.
+
+The most straightforward example of a key-value database in R is the
+global environment. Every object in R has a name and a value
+associated with it. When you execute at the R prompt
+<<exampleGlobalEnv,results=hide>>=
+x <- 1
+print(x)
+@
+the first line assigns the value 1 to the name/key ``x''. The second
+line requests the value of ``x'' and prints out 1 to the console. R
+handles the task of finding the appropriate value for ``x'' by
+searching through a series of environments, including the namespaces
+of the packages on the search list.
+
+In most cases, R stores the values associated with keys in memory, so
+that the value of \code{x} in the example above was stored in and
+retrieved from physical memory. However, the idea of a key-value
+database can be generalized beyond this particular configuration. For
+example, as of R 2.0.0, much of the R code for R packages is stored in
+a lazy-loaded database, where the values are initially stored on disk
+and loaded into memory on first access~\citep{Rnews:Ripley:2004}.
+Hence, when R starts up, it uses relatively little memory, while the
+memory usage increases as more objects are requested. Data could also
+be stored on other computers (e.g. websites) and retrieved over the
+network.
+
+The general S language concept of a database is described in Chapter 5
+of the Green Book~\citep{cham:1998} and earlier in~\cite{cham:1991}.
+Although the S and R languages have different semantics with respect
+to how variable names are looked up and bound to values, the general
+concept of using a key-value database applies to both languages.
+Duncan Temple Lang has implemented this general database framework for
+R in the \pkg{RObjectTables} package of
+Omegahat~\citep{TempleLang:2002}. The \pkg{RObjectTables} package
+provides an interface for connecting R with arbitrary backend systems,
+allowing data values to be stored in potentially any format or
+location. While the package itself does not include a specific
+implementation, some examples are provided on the package's website.
+
+The \pkg{filehash} package provides a full read-write implementation
+of a key-value database for R. The package does not depend on any
+external packages (beyond those provided in a standard R installation)
+or software systems and is written entirely in R, making it readily
+usable on most platforms. The \pkg{filehash} package can be thought
+of as a specific implementation of the database concept described
+in~\cite{cham:1991}, taking a slightly different approach to the
+problem. Both~\cite{TempleLang:2002} and~\cite{cham:1991} focus on
+generalizing the notion of ``attach()-ing'' a database in an R/S
+session so that variable names can be looked up automatically via the
+search list. The \pkg{filehash} package represents a database as an
+instance of an S4 class and operates directly on the S4 object via
+various methods.
+
+Key-value databases are sometimes called hash tables and indeed, the
+name of the package comes from the idea of having a ``file-based hash
+table''. With \pkg{filehash} the values are stored in a file on the
+disk rather than in memory. When a user requests the values
+associated with a key, \pkg{filehash} finds the object on the disk,
+loads the value into R and returns it to the user. The package offers
+two formats for storing data on the disk: The values can be stored (1)
+concatenated together in a single file or (2) separately as a
+directory of files.
+
+
+
+
+\section{Related R packages}
+
+There are other packages on CRAN designed specifically to help users
+work with large datasets. Two packages that come immediately to mind
+are the \pkg{g.data} package by David Brahm~\citep{brahm:2002} and the
+\pkg{biglm} package by Thomas Lumley. The \pkg{g.data} package takes
+advantage of the lazy evaluation mechanism in R via the
+\code{delayedAssign} function. Briefly, objects are loaded into R as
+promises to load the actual data associated with an object name. The
+first time an object is requested, the promise is evaluated and the
+data are loaded. From then on, the data reside in memory. The
+mechanism used in \pkg{g.data} is similar to the one used by the
+lazy-loaded databases described in~\cite{Rnews:Ripley:2004}. The
+\pkg{biglm} package allows users to fit linear models on datasets that
+are too large to fit in memory. However, the \pkg{biglm} package does
+not provide methods for dealing with large datasets in general. The
+\pkg{filehash} package also draws inspiration from Luke Tierney's
+experimental \pkg{gdbm} package which implements a key-value database
+via the GNU dbm (GDBM) library. The use of GDBM creates an external
+dependence since the GDBM C library has to be compiled on each system.
+In addition, I encountered a problem where databases created on 32-bit
+machines could not be transferred to and read on 64-bit machines (and
+vice versa). However, with the increasing use of 64-bit machines in
+the future, it seems this problem will eventually go away.
+
+The R Special Interest Group on Databases has developed a number of
+packages that provide an R interface to commonly used relational
+database management systems (RDBMS) such as MySQL (\pkg{RMySQL}),
+PostgreSQL (\pkg{RPgSQL}), and Oracle (\pkg{ROracle}). These packages
+use the S4 classes and generics defined in the \pkg{DBI} package and
+have the advantage that they offer much better database functionality,
+inherited via the use of a true database management system. However,
+this benefit comes with the cost of having to install and use
+third-party software. While installing an RDBMS may not be an
+issue---many systems have them pre-installed and the \pkg{RSQLite}
+package comes bundled with the source for the RDBMS---the need for the
+RDBMS and knowledge of structured query language (SQL) nevertheless
+adds some overhead. This overhead may serve as an impediment for
+users in need of a database for simpler applications.
+
+
+
+\section{Creating a filehash database}
+
+Databases can be created with \pkg{filehash} using the \code{dbCreate}
+function. The one required argument is the name of the database,
+which we call here ``mydb''.
+<<create>>=
+library(filehash)
+dbCreate("mydb")
+db <- dbInit("mydb")
+@
+You can also specify the \code{type} argument which controls how the
+database is represented on the backend. We will discuss the different
+backends in further detail later. For now, we use the default backend
+which is called ``DB1''.
+
+Once the database is created, it must be initialized in order to be
+accessed. The \code{dbInit} function returns an S4 object inheriting
+from class ``filehash''. Since this is a newly created database,
+there are no objects in it.
+
+\section{Accessing a filehash database}
+
+<<setseed1,results=hide,echo=false>>=
+set.seed(100)
+@
+
+The primary interface to filehash databases consists of the functions
+\code{dbFetch}, \code{dbInsert}, \code{dbExists}, \code{dbList}, and
+\code{dbDelete}. These functions are all generic---specific methods
+exists for each type of database backend. They all take as their
+first argument an object of class ``filehash''. To insert some data
+into the database we can simply call \code{dbInsert}
+<<insert>>=
+dbInsert(db, "a", rnorm(100))
+@
+Here we have associated with the key ``a'' 100 standard normal random
+variates. We can retrieve those values with \code{dbFetch}.
+<<fetch>>=
+value <- dbFetch(db, "a")
+mean(value)
+@
+
+The function \code{dbList} lists all of the keys that are available in
+the database, \code{dbExists} tests to see if a given key is in the
+database, and \code{dbDelete} deletes a key-value pair from the
+database
+<<delete>>=
+dbInsert(db, "b", 123)
+dbDelete(db, "a")
+dbList(db)
+dbExists(db, "a")
+@
+
+While using functions like \code{dbInsert} and \code{dbFetch} is
+straightforward it can often be easier on the fingers to use standard
+R subset and accessor functions like \code{\$}, \code{[[}, and
+\code{[}. Filehash databases have methods for these functions so that
+objects can be accessed in a more compact manner. Similarly,
+replacement methods for these functions are also available. The
+\verb+[+ function can be used to access multiple objects from the
+database, in which case a list is returned.
+
+<<accessors>>=
+db$a <- rnorm(100, 1)
+mean(db$a)
+mean(db[["a"]])
+db$b <- rnorm(100, 2)
+dbList(db)
+@
+For all of the accessor functions, only character indices are allowed.
+Numeric indices are caught and an error is given.
+<<characteronly>>=
+e <- local({
+ err <- function(e) e
+ tryCatch(db[[1]], error = err)
+})
+conditionMessage(e)
+@
+Finally, there is method for the \code{with} generic function which
+operates much like using \code{with} on lists or environments.
+
+The following three statements all return the same value.
+<<with>>=
+with(db, c(a = mean(a), b = mean(b)))
+@
+When using \code{with}, the values of ``a'' and ``b'' are looked up in
+the database.
+<<sapply>>=
+sapply(db[c("a", "b")], mean)
+@
+Here, using \code{[} on \code{db} returns a list with the values
+associated with ``a'' and ``b''. Then \code{sapply} is applied in the
+usual way on the returned list.
+<<lapply>>=
+unlist(lapply(db, mean))
+@
+In the last statement we call \code{lapply} directly on the
+``filehash'' object. The \pkg{filehash} package defines a method for
+\code{lapply} that allows the user to apply a function on all the
+elements of a database directly. The method essentially loops through
+all the keys in the database, loads each object separately and applies
+the supplied function to each object. \code{lapply} returns a named
+list with each element being the result of applying the supplied
+function to an object in the database. There is an argument
+\code{keep.names} to the \code{lapply} method which, if set to
+\code{FALSE}, will drop all the names from the list.
+
+<<cleanupMyDB,results=hide,echo=false>>=
+dbUnlink(db)
+rm(list = ls(all = TRUE))
+@
+
+\section{Loading filehash databases}
+
+<<setseed2,results=hide,echo=false>>=
+set.seed(200)
+@
+
+An alternative way of working with a filehash database is to load it
+into an environment and access the element names directly, without
+having to use any of the accessor functions. The \pkg{filehash}
+function \code{dbLoad} works much like the standard R \code{load}
+function except that \code{dbLoad} loads active bindings into a given
+environment rather than the actual data. The active bindings are
+created via the \code{makeActiveBinding} function in the \pkg{base}
+package. \code{dbLoad} takes a filehash database and creates symbols
+in an environment corresponding to the keys in the database. It then
+calls \code{makeActiveBinding} to associate with each key a function
+which loads the data associated with a given key. Conceptually,
+active bindings are like pointers to the database. After calling
+\code{dbLoad}, anytime an object with an active binding is accessed
+the associated function (installed by \code{makeActiveBinding}) loads
+the data from the database.
+
+We can create a simple database to demonstrate the active binding
+mechanism.
+<<testDB>>=
+dbCreate("testDB")
+db <- dbInit("testDB")
+db$x <- rnorm(100)
+db$y <- runif(100)
+db$a <- letters
+dbLoad(db)
+ls()
+@
+Notice that we appear to have some additional objects in our
+workspace. However, the values of these objects are not stored in
+memory---they are stored in the database. When one of the objects is
+accessed, the value is automatically loaded from the database.
+<<accessbinding>>=
+mean(y)
+sort(a)
+@
+If I assign a different value to one of these objects, its
+associated value is updated in the database via the active binding
+mechanism.
+<<assignvalue>>=
+y <- rnorm(100, 2)
+mean(y)
+@
+If I subsequently remove the database and reload it later, the
+updated value for ``y'' persists.
+<<removeandload>>=
+rm(list = ls())
+db <- dbInit("testDB")
+dbLoad(db)
+ls()
+mean(y)
+@
+
+Perhaps one disadvantage of the active binding approach taken here is
+that whenever an object is accessed, the data must be reloaded into R.
+This behavior is distinctly different from the the delayed assignment
+approach taken in \pkg{g.data} where an object must only be loaded
+once and then is subsequently in memory. However, when using delayed
+assignments, if one cycles through all of the objects in the database,
+one could eventually exhaust the available memory.
+
+<<cleanupTestDB,results=hide,echo=false>>=
+dbUnlink(db)
+rm(list = ls(all = TRUE))
+@
+
+\section{Other filehash utilities}
+
+There are a few other utilities included with the \pkg{filehash}
+package. Two of the utilities, \code{dumpObjects} and
+\code{dumpImage}, are analogues of \code{save} and \code{save.image}.
+Rather than save objects to an R workspace, \code{dumpObjects} saves
+the given objects to a ``filehash'' database so that in the future,
+individual objects can be reloaded if desired. Similarly,
+\code{dumpImage} saves the entire workspace to a ``filehash''
+database.
+
+The function \code{dumpList} takes a list and creates a ``filehash''
+database with values from the list. The list must have a non-empty
+name for every element in order for \code{dumpList} to succeed.
+\code{dumpDF} creates a ``filehash'' database from a data frame where
+each column of the data frame is an element in the database.
+Essentially, \code{dumpDF} converts the data frame to a list and calls
+\code{dumpList}.
+
+
+\section{Filehash database backends}
+
+Currently, the \pkg{filehash} package can represent databases in two
+different formats. The default format is called ``DB1'' and it stores
+the keys and values in a single file. From experience, this format
+works well overall but can be a little slow to initialize when there
+are many thousands of keys. Briefly, the ``filehash'' object in R
+stores a map which associates keys with a byte location in the
+database file where the corresponding value is stored. Given the byte
+location, we can \code{seek} to that location in the file and read the
+data directly. Before reading in the data, a check is made to make
+sure that the map is up to date. This format depends critically on
+having a working \code{ftell} at the system level and a crude check is
+made when trying to initialize a database of this format.
+
+The second format is called ``RDS'' and it stores objects as separate
+files on the disk in a directory with the same name as the database.
+This format is the most straightforward and simple of the available
+formats. When a request is made for a specific key, \pkg{filehash}
+finds the appropriate file in the directory and reads the file into R.
+The only catch is that on operating systems that use case-insensitive
+file names, objects whose names differ only in case will collide on
+the filesystem. To workaround this, object names with capital letters
+are stored with mangled names on the disk. An advantage of this
+format is that most of the organizational work is delegated to the
+filesystem.
+
+
+\section{Extending filehash}
+
+The \pkg{filehash} package has a mechanism for developing new backend
+formats, should the need arise. The function \code{registerFormatDB}
+can be used to make \pkg{filehash} aware of a new database format that
+may be implemented in a separate R package or a file.
+\code{registerFormatDB} takes two arguments: a \code{name} for the new
+format (like ``DB1'' or ``RDS'') and a list of functions. The list
+should contain two functions: one function named ``create'' for
+creating a database, given the database name, and another function
+named ``initialize'' for initializing the database. In addition, one
+needs to define methods for \code{dbInsert}, \code{dbFetch}, etc.
+
+A list of available backend formats can be obtained via the
+\code{filehashFormats} function. Upon registering a new backend
+format, the new format will be listed when \code{filehashFormats} is
+called.
+
+The interface for registering new backend formats is still
+experimental and could change in the future.
+
+
+\section{Discussion}
+
+The \pkg{filehash} package has been designed be useful in both a
+programming setting and an interactive setting. Its main purpose is
+to allow for simpler interaction with large datasets where
+simultaneous access to the full dataset is not needed. While the
+package may not be optimal for all settings, one goal was to write a
+simple package in pure R that users to could install with minimal
+overhead. In the future I hope to add functionality for interacting
+with databases stored on remote computers and perhaps incorporate a
+``real'' database backend. Some work has already begun on developing
+a backend based on the \pkg{RSQLite} package.
+
+
+
+\bibliographystyle{alpha}
+\bibliography{combined}
+
+
+\end{document}
+
diff --git a/inst/doc/filehash.pdf b/inst/doc/filehash.pdf
new file mode 100644
index 0000000..952abff
Binary files /dev/null and b/inst/doc/filehash.pdf differ
diff --git a/man/createQ.Rd b/man/createQ.Rd
new file mode 100644
index 0000000..bfd6d6d
--- /dev/null
+++ b/man/createQ.Rd
@@ -0,0 +1,31 @@
+\name{createQ}
+\alias{createQ}
+\alias{initQ}
+
+%- Also NEED an '\alias' for EACH other topic documented here.
+\title{Create/Initialize Queue}
+\description{
+ Create or initialize a queue data structure using \code{filehash}
+ databases
+}
+\usage{
+createQ(filename)
+initQ(filename)
+}
+%- maybe also 'usage' for other objects documented here.
+\arguments{
+ \item{filename}{character, file name for storing the queue data
+ structure}
+}
+\details{
+ A new queue can be created using \code{createQ}, which creates a file
+ for storing the queue information and returns an object of class
+ \code{"queue"}.
+}
+\value{
+ The \code{createQ} and \code{initQ} functions both return an object of
+ class \code{"queue"}.
+}
+\author{Roger D. Peng \email{rpeng at jhsph.edu}}
+
+\keyword{database}
diff --git a/man/createS.Rd b/man/createS.Rd
new file mode 100644
index 0000000..bdf5d54
--- /dev/null
+++ b/man/createS.Rd
@@ -0,0 +1,31 @@
+\name{createS}
+\alias{createS}
+\alias{initS}
+
+%- Also NEED an '\alias' for EACH other topic documented here.
+\title{Create/Initialize Stack}
+\description{
+ Create or initialize a stack data structure using \code{filehash}
+ databases
+}
+\usage{
+createS(filename)
+initS(filename)
+}
+%- maybe also 'usage' for other objects documented here.
+\arguments{
+ \item{filename}{character, file name for storing the stack data
+ structure}
+}
+\details{
+ A new stack can be created using \code{createS}, which creates a file
+ for storing the stack information and returns an object of class
+ \code{"stack"}.
+}
+\value{
+ The \code{createS} and \code{initS} functions both return an object of
+ class \code{"stack"}.
+}
+\author{Roger D. Peng \email{rpeng at jhsph.edu}}
+
+\keyword{database}
diff --git a/man/db2env.Rd b/man/db2env.Rd
new file mode 100644
index 0000000..45bfb9e
--- /dev/null
+++ b/man/db2env.Rd
@@ -0,0 +1,96 @@
+\name{dbLoad}
+\alias{dbLoad}
+\alias{dbLoad,filehash-method}
+\alias{dbLazyLoad}
+\alias{dbLazyLoad,filehash-method}
+\alias{db2env}
+
+\title{Load database into environment}
+\description{
+ Load entire database into an environment
+}
+\usage{
+db2env(db)
+dbLoad(db, ...)
+dbLazyLoad(db, ...)
+
+\S4method{dbLoad}{filehash}(db, env = parent.frame(2), keys = NULL, ...)
+\S4method{dbLazyLoad}{filehash}(db, env = parent.frame(2), keys = NULL, ...)
+}
+\arguments{
+ \item{db}{database object}
+ \item{env}{an environment}
+ \item{keys}{character vector of database keys to load}
+ \item{...}{other arguments passed to methods}
+}
+\details{
+ \code{db2env} loads the entire database \code{db} into an environment
+ via calls to \code{makeActiveBinding}. Therefore, the data themselves
+ are not stored in the environment, but a function pointing to the data
+ in the database is stored. When an element of the environment is
+ accessed, the function is called to retrieve the data from the
+ database. If the data in the database is changed, the changes will be
+ reflected in the environment.
+
+ \code{dbLoad} loads objects in the database directly into the
+ environment specified, like \code{load} does except with active bindings.
+ \code{dbLoad} takes a second argument \code{env}, which is an
+ environment, and the default for \code{env} is \code{parent.frame()}.
+
+ The use of \code{makeActiveBinding} in \code{db2env} and \code{dbLoad}
+ allows for potentially large databases to, at least conceptually, be
+ used in R, as long as you don't need simultaneous access to all of the
+ elements in the database.
+
+ With \code{dbLazyLoad} database objects are
+ "lazy-loaded" into the environment. Promises to load the
+ objects are created in the environment specified by \code{env}. Upon
+ first access, those objects are copied into the environment and will
+ from then on reside in memory. Changes to the database will not be
+ reflected in the object residing in the environment after first
+ access. Conversely, changes to the object in the environment will not
+ be reflected in the database. This type of loading is useful for
+ read-only databases.
+}
+
+\value{
+ For \code{db2env}, an environment is returned, the elements of which
+ are the keys of the database. For \code{dbLoad} and \code{dbLazyLoad}, a character vector
+ is returned (invisibly) containing the keys associated with the values
+ loaded into the environment.
+}
+
+\author{Roger D. Peng}
+
+\seealso{
+ \code{\link{dbInit}} and \code{\link{filehash-class}}
+}
+
+\examples{
+dbCreate("myDB")
+db <- dbInit("myDB")
+dbInsert(db, "a", rnorm(100))
+dbInsert(db, "b", 1:10)
+
+env <- db2env(db)
+ls(env) ## "a", "b"
+print(env$b)
+mean(env$a)
+env$a <- rnorm(100)
+mean(env$a)
+
+env$b[1:5] <- 5:1
+print(env$b)
+
+env <- new.env()
+dbLoad(db, env)
+ls(env)
+
+env <- new.env()
+dbLazyLoad(db, env)
+ls(env)
+
+as(db, "list")
+}
+
+\keyword{database}
diff --git a/man/dbInit.Rd b/man/dbInit.Rd
new file mode 100644
index 0000000..b56eda0
--- /dev/null
+++ b/man/dbInit.Rd
@@ -0,0 +1,64 @@
+\name{dbInit}
+\alias{dbInit}
+\alias{dbInitialize}
+\alias{dbCreate}
+\alias{dbCreate,ANY-method}
+\alias{dbInit,ANY-method}
+\alias{dbReconnect}
+\alias{dbReconnect,filehashDB1-method}
+
+%\alias{dbInitialize}
+
+\title{Simple file-based hash table}
+\description{
+ Interface for creating and initializing a simple file-based hash table
+}
+\usage{
+dbCreate(db, ...)
+dbInit(db, ...)
+dbReconnect(db, ...)
+
+\S4method{dbCreate}{ANY}(db, type = NULL, ...)
+\S4method{dbInit}{ANY}(db, type = NULL, ...)
+\S4method{dbReconnect}{filehashDB1}(db, ...)
+}
+
+\arguments{
+ \item{db}{name of database or a database object}
+ \item{type}{type of database format. If missing, the default type
+ will be used}
+ \item{...}{other arguments passed to methods}
+}
+
+\details{
+ \code{dbCreate} creates the necessary files or directory for the
+ database. If those files already exist nothing is done.
+
+ \code{dbInit} takes a database name and returns an object
+ inheriting from class \code{"filehash"}.
+
+ The \code{type} argument specifies the format in which the database
+ should be stored on the disk. If not specified, the default
+ type will be used (as specified by \code{filehashOption}).
+}
+
+\note{
+ The function \code{dbInitialize} has been deprecated. Use
+ \code{dbInit} instead.
+}
+
+\value{
+ \code{dbCreate} returns \code{TRUE} upon success and \code{FALSE} in
+ the event of an error. \code{dbInit} returns an object
+ inheriting from class \code{"filehash"}
+}
+
+\author{Roger D. Peng}
+
+\seealso{
+ See \code{\link{filehash-class}} more information and examples and
+ \code{\link{filehashOption}} for setting the default database type.
+}
+
+\keyword{database}% at least one, from doc/KEYWORDS
+
diff --git a/man/dump.Rd b/man/dump.Rd
new file mode 100644
index 0000000..a34849e
--- /dev/null
+++ b/man/dump.Rd
@@ -0,0 +1,65 @@
+\name{dumpObjects}
+\alias{dumpObjects}
+\alias{dumpImage}
+\alias{dumpDF}
+\alias{dumpList}
+\alias{dumpEnv}
+
+\title{Dump objects of database}
+\description{
+ Dump R objects to a filehash database
+}
+\usage{
+dumpObjects(..., list = character(0), dbName, type = NULL, envir = parent.frame())
+dumpImage(dbName = "Rworkspace", type = NULL)
+dumpDF(data, dbName = NULL, type = NULL)
+dumpList(data, dbName = NULL, type = NULL)
+dumpEnv(env, dbName)
+}
+
+\arguments{
+ \item{\dots}{R objects to dump}
+ \item{list}{character vector of names of objects to dump}
+ \item{dbName}{character, name of database to which objects should be
+ dumped}
+ \item{type}{type of database to create}
+ \item{envir}{environment from which to obtain objects}
+ \item{data}{a data frame or a list}
+ \item{env}{an environment}
+}
+\details{
+ Objects dumped to a database can later be loaded via \code{dbLoad} or
+ can be accessed with \code{dbFetch}, \code{dbList}, etc.
+ Alternatively, the \code{with} method can be used to evaluate code in
+ the context of a database. If a database with name \code{dbName}
+ already exists, objects will be inserted into the existing database
+ (and values for already-existing keys will be overwritten).
+
+ \code{dumpDF} is different in that each variable in the data frame is
+ stored as a separate object in the database. So each variable can be
+ read from the database separately rather than having to load the
+ entire data frame into memory. \code{dumpList} works in a simlar
+ way.
+
+ The \code{dumpEnv} function takes an environment and stores each
+ element of the environment in a \code{filehash} database.
+}
+
+\value{
+ An object of class \code{"filehash"} is returned and a database is
+ created.
+}
+
+\author{Roger D. Peng}
+
+\examples{
+data <- data.frame(y = rnorm(100), x = rnorm(100), z = rnorm(100))
+db <- dumpDF(data, dbName = "dataframe.dump")
+fit <- with(db, lm(y ~ x + z))
+summary(fit)
+
+db <- dumpList(list(a = 1, b = 2, c = 3), "list.dump")
+db$a
+}
+\keyword{database}% at least one, from doc/KEYWORDS
+
diff --git a/man/filehash-class.Rd b/man/filehash-class.Rd
new file mode 100644
index 0000000..4b75d63
--- /dev/null
+++ b/man/filehash-class.Rd
@@ -0,0 +1,150 @@
+\name{filehash-class}
+\docType{class}
+\alias{filehash-class}
+\alias{filehashDB-class}
+\alias{filehashRDS-class}
+\alias{filehashDB1-class}
+\alias{dbFetch}
+\alias{dbMultiFetch}
+\alias{dbInsert}
+\alias{dbExists}
+\alias{dbList}
+\alias{dbDelete}
+\alias{dbReorganize}
+\alias{dbUnlink}
+\alias{dbDelete,filehashDB,character-method}
+\alias{dbExists,filehashDB,character-method}
+\alias{dbFetch,filehashDB,character-method}
+\alias{dbInsert,filehashDB,character-method}
+\alias{dbList,filehashDB-method}
+\alias{dbUnlink,filehashDB-method}
+\alias{dbReorganize,filehashDB-method}
+\alias{dbMultiFetch,filehashDB1-method}
+\alias{dbDelete,filehashDB1,character-method}
+\alias{dbExists,filehashDB1,character-method}
+\alias{dbFetch,filehashDB1,character-method}
+\alias{dbMultiFetch,filehashDB1,character-method}
+\alias{dbInsert,filehashDB1,character-method}
+\alias{dbList,filehashDB1-method}
+\alias{dbUnlink,filehashDB1-method}
+\alias{dbReorganize,filehashDB1-method}
+\alias{dbDelete,filehashRDS,character-method}
+\alias{dbExists,filehashRDS,character-method}
+\alias{dbFetch,filehashRDS,character-method}
+\alias{dbMultiFetch,filehashRDS,character-method}
+\alias{dbInsert,filehashRDS,character-method}
+\alias{dbList,filehashRDS-method}
+\alias{dbUnlink,filehashRDS-method}
+\alias{show,filehash-method}
+\alias{with,filehash-method}
+\alias{coerce,filehashDB,filehashRDS-method}
+\alias{coerce,filehashRDS,filehashDB-method}
+\alias{coerce,filehashDB1,filehashRDS-method}
+\alias{coerce,filehashDB1,list-method}
+\alias{coerce,filehashDB,filehashDB1-method}
+\alias{coerce,filehash,list-method}
+\alias{lapply,filehash-method}
+\alias{names,filehash-method}
+\alias{length,filehash-method}
+
+\alias{[,filehash,character,missing,missing-method}
+\alias{[[,filehash,character,missing-method}
+\alias{[[,filehash,numeric,missing-method}
+\alias{[[<-,filehash,character,missing-method}
+\alias{[[<-,filehash,numeric,missing-method}
+\alias{$<-,filehash-method}
+\alias{$,filehash-method}
+
+\title{Class "filehash"}
+
+\description{
+ These functions form the interface for a simple file-based key-value
+ database (i.e. hash table).
+}
+
+\section{Objects from the Class}{
+ Objects can be created by calls of the form \code{new("filehash", ...)}.
+}
+
+\section{Slots}{
+ \describe{
+ \item{\code{name}:}{Object of class \code{"character"}, name of the
+ database.}
+ }
+}
+
+\section{Additional slots for "filehashDB1"}{
+ \describe{
+ \item{\code{datafile}:}{full path to the database file.}
+ \item{\code{meta}:}{list containing an environment for database
+ metadata.}
+ }
+}
+
+\section{Additional slots for "filehashRDS"}{
+ \describe{
+ \item{dir:}{Directory where files are stored.}
+ }
+}
+
+\section{Methods}{
+ \describe{
+ \item{dbDelete}{The \code{dbDelete} function is for deleting
+ elements, but for the \code{"DB1"} format all it does is remove the
+ key from the lookup table.
+ The actual data are still in the database (but inaccessible). If
+ you reinsert data for the same key, the new data are simply
+ appended on to the end of the file. Therefore, it's possible to
+ have multiple copies of data lying around after a while,
+ potentially making the database file big. The \code{"RDS"} format
+ does not have this problem.}
+ \item{dbExists}{check to see if a key exists.}
+ \item{dbFetch}{retrieve the value associated with a given key.}
+ \item{dbMultiFetch}{retrieve values associated with multiple keys (a
+ list of those values is returned).}
+ \item{dbInsert}{insert a key-value pair into the database. If
+ that key already exists, its associated value is overwritten. For
+ \code{"RDS"} type databases, there is a \code{safe} option
+ (defaults to \code{TRUE}) which allows the user to insert objects
+ somewhat more safely (objects should not be lost in the event of
+ an interrupt).}
+ \item{dbList}{list all keys in the database.}
+ \item{dbReorganize}{The \code{dbReorganize} function is there for
+ the purpose of rewriting the database to remove all of the stale
+ entries. Basically, this function creates a new copy of the
+ database and then overwrites the old copy. This function has not
+ been tested extensively and so should be considered
+ \emph{experimental}. \code{dbReorganize} is not needed when using
+ the \code{"RDS"} format.}
+ \item{dbUnlink}{delete an entire database from the disk}
+ \item{show}{print method}
+ \item{with}{allows \code{with} to be used with \code{"filehash"}
+ objects much like it can be used with lists or data frames}
+ \item{[[,[[<-}{elements of a database can be accessed using the \code{[[}
+ operator much like a list or environment, but only character
+ indices are allowed}
+ \item{$,$<-}{elements of a database can be accessed using the \code{$}
+ operator much like with a list or environment}
+ \item{lapply}{works much like \code{lapply} with lists; a list is
+ returned.}
+ \item{names}{returns all of the keys in the database}
+ \item{length}{returns the number of elements in the database}
+ }
+}
+
+\author{Roger D. Peng \email{rpeng at jhsph.edu}}
+
+\examples{
+dbCreate("myDB") ## Create database 'myDB'
+db <- dbInit("myDB")
+dbInsert(db, "a", 1:10)
+dbInsert(db, "b", rnorm(1000))
+dbExists(db, "b") ## 'TRUE'
+
+dbList(db) ## c("a", "b")
+dbDelete(db, "a")
+dbList(db) ## "b"
+
+with(db, mean(b))
+}
+\keyword{classes}
diff --git a/man/filehashFormats.Rd b/man/filehashFormats.Rd
new file mode 100644
index 0000000..6dded7d
--- /dev/null
+++ b/man/filehashFormats.Rd
@@ -0,0 +1,30 @@
+\name{filehashFormats}
+\alias{filehashFormats}
+\alias{registerFormatDB}
+
+\title{List and register filehash formats}
+\description{
+ List and register filehash backend database formats.
+}
+\usage{
+registerFormatDB(name, funlist)
+filehashFormats(...)
+}
+%- maybe also 'usage' for other objects documented here.
+\arguments{
+ \item{name}{character, name of database format}
+ \item{funlist}{list of functions for creating and initializing a
+ database format}
+ \item{\dots}{list of functions for registering a new database format}
+}
+\details{
+ \code{registerFormatDB} can be used to register new filehash backend
+ database formats. \code{filehashFormats} called with no arguments
+ lists information on available formats.
+}
+\value{
+ \code{filehashFormats} returns a list containing information on the
+ available filehash formats.
+}
+
+\keyword{utilities}% at least one, from doc/KEYWORDS
diff --git a/man/filehashOption.Rd b/man/filehashOption.Rd
new file mode 100644
index 0000000..77542ed
--- /dev/null
+++ b/man/filehashOption.Rd
@@ -0,0 +1,27 @@
+\name{filehashOption}
+\alias{filehashOption}
+
+\title{Set filehash options}
+\description{
+ Set global filehash options
+}
+\usage{
+filehashOption(...)
+}
+
+\arguments{
+ \item{\dots}{name-value pairs for options}
+}
+\details{
+ Currently, the only option that can be set is the default database
+ type (\code{defaultType}) which can be "DB1", "RDS" or "DB".
+}
+\value{
+ \code{filehashOptions} returns a list of current settings for all
+ options.
+}
+
+\author{Roger D. Peng}
+
+\keyword{database}% at least one, from doc/KEYWORDS
+
diff --git a/man/push.Rd b/man/push.Rd
new file mode 100644
index 0000000..b0b8fec
--- /dev/null
+++ b/man/push.Rd
@@ -0,0 +1,43 @@
+\name{stackqueue}
+\alias{stackqueue}
+\alias{push}
+\alias{pop}
+\alias{mpush}
+\alias{top}
+\alias{isEmpty}
+
+%- Also NEED an '\alias' for EACH other topic documented here.
+\title{Operations on Stacks/Queues}
+\description{
+ Functions for interacting with stack and queue data structures
+ implemented using \code{filehash} databases.
+}
+\usage{
+push(db, val, ...)
+mpush(db, vals, ...)
+pop(db, ...)
+top(db, ...)
+isEmpty(db, ...)
+}
+%- maybe also 'usage' for other objects documented here.
+\arguments{
+ \item{db}{an object of class \code{"stack"} or \code{"queue"}}
+ \item{val}{an R object}
+ \item{vals}{a list of R objects}
+ \item{\dots}{arguments passed to other methods}
+}
+\details{
+ Note that for \code{mpush}, if \code{vals} is not a list it will be
+ coerced to a list via \code{as.list}. Currently, \code{mpush} is only
+ implemented for \code{"stack"}s.
+}
+\value{
+ \code{push} and \code{mpush} return nothing useful; \code{pop} returns
+ a value from the stack/queue and deletes that value from the
+ stack/queue; \code{top} returns the "top" value from the stack/queue;
+ \code{isEmpty} returns \code{TRUE}/\code{FALSE} depending on whether
+ the stack/queue is empty or not. Both \code{pop} and \code{top}
+ signal an error if the stack/queue is empty.
+}
+\author{Roger D. Peng \email{rpeng at jhsph.edu}}
+\keyword{database}% __ONLY ONE__ keyword per line
diff --git a/man/queue-class.Rd b/man/queue-class.Rd
new file mode 100644
index 0000000..09f3204
--- /dev/null
+++ b/man/queue-class.Rd
@@ -0,0 +1,47 @@
+\name{queue-class}
+\docType{class}
+\alias{queue-class}
+\alias{isEmpty,queue-method}
+\alias{pop,queue-method}
+\alias{push,queue-method}
+\alias{show,queue-method}
+\alias{top,queue-method}
+
+\title{Class "queue"}
+\description{A queue implementation using a \code{filehash} database}
+\section{Objects from the Class}{
+Objects can be created by calls of the form \code{new("queue", ...)} or
+by calling \code{createQ}. Existing queues can be initialized with
+\code{initQ}.
+}
+\section{Slots}{
+ \describe{
+ \item{\code{queue}:}{Object of class \code{"filehashDB1"}}
+ \item{\code{name}:}{Object of class \code{"character"}: the name of
+ the queue (default is the file name in which the queue data are
+ stored)}
+ }
+}
+\section{Methods}{
+ \describe{
+ \item{isEmpty}{\code{signature(db = "queue")}: returns
+ \code{TRUE}/\code{FALSE} depending on whether there are elements
+ in the queue.}
+ \item{pop}{\code{signature(db = "queue")}: returns the value of the
+ "top" (i.e. head) of the queue and subsequently removes that
+ element from the queue; an error is signaled if the queue is empty}
+ \item{push}{\code{signature(db = "queue")}: adds an element to the
+ tail ("bottom") of the queue}
+ \item{show}{\code{signature(object = "queue")}: prints the name of
+ the queue}
+ \item{top}{\code{signature(db = "queue")}: returns the value of the
+ "top" (i.e. head) of the queue; an error is signaled if the queue
+ is empty}
+ }
+}
+\author{Roger D. Peng \email{rpeng at jhsph.edu}}
+
+\examples{
+showClass("queue")
+}
+\keyword{classes}
diff --git a/man/stack-class.Rd b/man/stack-class.Rd
new file mode 100644
index 0000000..8fbddb8
--- /dev/null
+++ b/man/stack-class.Rd
@@ -0,0 +1,50 @@
+\name{stack-class}
+\docType{class}
+\alias{stack-class}
+\alias{isEmpty,stack-method}
+\alias{mpush,stack-method}
+\alias{pop,stack-method}
+\alias{push,stack-method}
+\alias{show,stack-method}
+\alias{top,stack-method}
+
+\title{Class "stack"}
+\description{A stack implementation using a \code{filehash} database}
+\section{Objects from the Class}{
+Objects can be created by calls of the form \code{new("stack", ...)} or
+by calling \code{createS}. Existing queues can be initialized with
+\code{initS}.
+}
+\section{Slots}{
+ \describe{
+ \item{\code{stack}:}{Object of class \code{"filehashDB1"}}
+ \item{\code{name}:}{Object of class \code{"character"}: the name of
+ the stack (default is the file name in which the stack data are
+ stored)}
+ }
+}
+\section{Methods}{
+ \describe{
+ \item{isEmpty}{\code{signature(db = "stack")}: returns
+ \code{TRUE}/\code{FALSE} depending on whether there are elements
+ in the stack.}
+ \item{pop}{\code{signature(db = "stack")}: returns the value of the
+ top of the stack and subsequently removes that
+ element from the stack; an error is signaled if the stack is empty}
+ \item{push}{\code{signature(db = "stack")}: adds an element to the
+ top of the stack}
+ \item{show}{\code{signature(object = "stack")}: prints the name of
+ the stack}
+ \item{top}{\code{signature(db = "stack")}: returns the value of the
+ top of the stack; an error is signaled if the stack
+ is empty}
+ \item{mpush}{\code{signature(db = "stack")}: works like \code{push}
+ except it can push multiple objects in a list on to the stack}
+ }
+}
+\author{Roger D. Peng \email{rpeng at jhsph.edu}}
+
+\examples{
+showClass("stack")
+}
+\keyword{classes}
diff --git a/src/hash.c b/src/hash.c
new file mode 100644
index 0000000..7cdf4fb
--- /dev/null
+++ b/src/hash.c
@@ -0,0 +1,84 @@
+#include <R.h>
+#include <Rinternals.h>
+#include "sha1.h"
+
+/*
+ * This code is adapted from the 'digest.c' code in the 'digest'
+ * package by Dirk Eddelbuettel <edd at debian.org> with contributions by
+ * Antoine Lucas, Jarek Tuszynski, Henrik Bengtsson and Simon Urbanek
+ */
+
+SEXP sha1_object(SEXP object, SEXP skip_bytes)
+{
+ char output[41]; /* SHA-1 is 40 bytes + '\0' */
+ int i, skip;
+ SEXP result;
+ sha1_context ctx;
+ unsigned char buffer[20];
+ Rbyte *data;
+ int nChar = length(object);
+
+ PROTECT(object = coerceVector(object, RAWSXP));
+ data = RAW(object);
+ PROTECT(skip_bytes = coerceVector(skip_bytes, INTSXP));
+ skip = INTEGER(skip_bytes)[0];
+
+ if(skip > 0) {
+ if(skip >= nChar)
+ nChar = 0;
+ else {
+ nChar -= skip;
+ data += skip;
+ }
+ }
+ sha1_starts(&ctx);
+ sha1_update(&ctx, (uint8 *) data, nChar);
+ sha1_finish(&ctx, buffer);
+
+ for(i=0; i < 20; i++)
+ sprintf(output + i * 2, "%02x", buffer[i]);
+
+ PROTECT(result = allocVector(STRSXP, 1));
+ SET_STRING_ELT(result, 0, mkChar(output));
+ UNPROTECT(3);
+
+ return result;
+}
+
+
+SEXP sha1_file(SEXP filename, SEXP skip_bytes)
+{
+ char output[41]; /* SHA-1 is 40 bytes + '\0' */
+ int nChar, i, skip;
+ FILE *fp;
+ SEXP result;
+ sha1_context ctx;
+ unsigned char buf[1024];
+ unsigned char sha1sum[20];
+
+ PROTECT(skip_bytes = coerceVector(skip_bytes, INTSXP));
+ PROTECT(filename = coerceVector(filename, STRSXP));
+
+ skip = INTEGER(skip_bytes)[0];
+
+ if(!(fp = fopen(CHAR(STRING_ELT(filename, 0)), "rb")))
+ error("unable to open input file");
+ if (skip > 0)
+ fseek(fp, skip, SEEK_SET);
+ sha1_starts(&ctx);
+
+ while((nChar = fread(buf, 1, sizeof(buf), fp)) > 0)
+ sha1_update(&ctx, buf, nChar);
+
+ fclose(fp);
+ sha1_finish(&ctx, sha1sum);
+
+ for(i=0; i < 20; i++)
+ sprintf(output + i * 2, "%02x", sha1sum[i]);
+
+ PROTECT(result = allocVector(STRSXP, 1));
+ SET_STRING_ELT(result, 0, mkChar(output));
+ UNPROTECT(3);
+
+ return result;
+}
diff --git a/src/lockfile.c b/src/lockfile.c
new file mode 100644
index 0000000..0420339
--- /dev/null
+++ b/src/lockfile.c
@@ -0,0 +1,21 @@
+#include <R.h>
+#include <Rinternals.h>
+#include <fcntl.h>
+#include <unistd.h>
+
+SEXP lock_file(SEXP filename)
+{
+ int fd;
+ SEXP status;
+
+ if(!isString(filename))
+ error("'filename' should be character");
+ PROTECT(status = allocVector(INTSXP, 1));
+
+ fd = open(CHAR(STRING_ELT(filename, 0)),
+ O_WRONLY | O_CREAT | O_EXCL, 0666);
+ INTEGER(status)[0] = fd;
+ close(fd);
+ UNPROTECT(1);
+ return status;
+}
diff --git a/src/readKeyMap.c b/src/readKeyMap.c
new file mode 100644
index 0000000..f312915
--- /dev/null
+++ b/src/readKeyMap.c
@@ -0,0 +1,65 @@
+#define NEED_CONNECTION_PSTREAMS
+
+#include <R.h>
+#include <Rinternals.h>
+
+SEXP read_key_map(SEXP filename, SEXP map, SEXP filesize, SEXP pos)
+{
+ SEXP key, datalen;
+ FILE *fp;
+ int status, len;
+ struct R_inpstream_st in;
+
+ if(!isEnvironment(map))
+ error("'map' should be an environment");
+ if(!isString(filename))
+ error("'filename' should be character");
+
+ PROTECT(filesize = coerceVector(filesize, INTSXP));
+ PROTECT(pos = coerceVector(pos, INTSXP));
+
+ fp = fopen(CHAR(STRING_ELT(filename, 0)), "rb");
+
+ if(INTEGER(pos)[0] > 0) {
+ status = fseek(fp, INTEGER(pos)[0], SEEK_SET);
+
+ if(status < 0)
+ error("problem with initial file pointer seek");
+ }
+
+ /* Initialize the incoming R file stream */
+ R_InitFileInPStream(&in, fp, R_pstream_any_format, NULL, NULL);
+
+ while(INTEGER(pos)[0] < INTEGER(filesize)[0]) {
+ PROTECT(key = R_Unserialize(&in));
+ PROTECT(datalen = R_Unserialize(&in));
+ len = INTEGER(datalen)[0];
+
+ /* calculate the position of file pointer */
+ INTEGER(pos)[0] = ftell(fp);
+
+ if(len <= 0) {
+ /* key has been deleted; set pos to NULL */
+ defineVar(install(CHAR(STRING_ELT(key, 0))),
+ R_NilValue, map);
+ UNPROTECT(2);
+ continue;
+ }
+ /* create a new entry in the key map */
+ defineVar(install(CHAR(STRING_ELT(key, 0))), duplicate(pos), map);
+
+ /* advance to the next key */
+ status = fseek(fp, len, SEEK_CUR);
+
+ if(status < 0) {
+ fclose(fp);
+ error("problem with seek");
+ }
+ INTEGER(pos)[0] = INTEGER(pos)[0] + len;
+
+ UNPROTECT(2);
+ }
+ UNPROTECT(2);
+ fclose(fp);
+ return map;
+}
diff --git a/src/sha1.c b/src/sha1.c
new file mode 100644
index 0000000..082ef97
--- /dev/null
+++ b/src/sha1.c
@@ -0,0 +1,371 @@
+/*
+ * FIPS-180-1 compliant SHA-1 implementation,
+ * by Christophe Devine <devine at cr0.net>;
+ * this program is licensed under the GPL.
+ */
+
+#include <string.h>
+
+#include "sha1.h"
+
+#define GET_UINT32(n,b,i) \
+{ \
+ (n) = ( (uint32) (b)[(i) ] << 24 ) \
+ | ( (uint32) (b)[(i) + 1] << 16 ) \
+ | ( (uint32) (b)[(i) + 2] << 8 ) \
+ | ( (uint32) (b)[(i) + 3] ); \
+}
+
+#define PUT_UINT32(n,b,i) \
+{ \
+ (b)[(i) ] = (uint8) ( (n) >> 24 ); \
+ (b)[(i) + 1] = (uint8) ( (n) >> 16 ); \
+ (b)[(i) + 2] = (uint8) ( (n) >> 8 ); \
+ (b)[(i) + 3] = (uint8) ( (n) ); \
+}
+
+void sha1_starts( sha1_context *ctx )
+{
+ ctx->total[0] = 0;
+ ctx->total[1] = 0;
+
+ ctx->state[0] = 0x67452301;
+ ctx->state[1] = 0xEFCDAB89;
+ ctx->state[2] = 0x98BADCFE;
+ ctx->state[3] = 0x10325476;
+ ctx->state[4] = 0xC3D2E1F0;
+}
+
+void sha1_process( sha1_context *ctx, uint8 data[64] )
+{
+ uint32 temp, W[16], A, B, C, D, E;
+
+ GET_UINT32( W[0], data, 0 );
+ GET_UINT32( W[1], data, 4 );
+ GET_UINT32( W[2], data, 8 );
+ GET_UINT32( W[3], data, 12 );
+ GET_UINT32( W[4], data, 16 );
+ GET_UINT32( W[5], data, 20 );
+ GET_UINT32( W[6], data, 24 );
+ GET_UINT32( W[7], data, 28 );
+ GET_UINT32( W[8], data, 32 );
+ GET_UINT32( W[9], data, 36 );
+ GET_UINT32( W[10], data, 40 );
+ GET_UINT32( W[11], data, 44 );
+ GET_UINT32( W[12], data, 48 );
+ GET_UINT32( W[13], data, 52 );
+ GET_UINT32( W[14], data, 56 );
+ GET_UINT32( W[15], data, 60 );
+
+#define S(x,n) ((x << n) | ((x & 0xFFFFFFFF) >> (32 - n)))
+
+#define R(t) \
+( \
+ temp = W[(t - 3) & 0x0F] ^ W[(t - 8) & 0x0F] ^ \
+ W[(t - 14) & 0x0F] ^ W[ t & 0x0F], \
+ ( W[t & 0x0F] = S(temp,1) ) \
+)
+
+#define P(a,b,c,d,e,x) \
+{ \
+ e += S(a,5) + F(b,c,d) + K + x; b = S(b,30); \
+}
+
+ A = ctx->state[0];
+ B = ctx->state[1];
+ C = ctx->state[2];
+ D = ctx->state[3];
+ E = ctx->state[4];
+
+#define F(x,y,z) (z ^ (x & (y ^ z)))
+#define K 0x5A827999
+
+ P( A, B, C, D, E, W[0] );
+ P( E, A, B, C, D, W[1] );
+ P( D, E, A, B, C, W[2] );
+ P( C, D, E, A, B, W[3] );
+ P( B, C, D, E, A, W[4] );
+ P( A, B, C, D, E, W[5] );
+ P( E, A, B, C, D, W[6] );
+ P( D, E, A, B, C, W[7] );
+ P( C, D, E, A, B, W[8] );
+ P( B, C, D, E, A, W[9] );
+ P( A, B, C, D, E, W[10] );
+ P( E, A, B, C, D, W[11] );
+ P( D, E, A, B, C, W[12] );
+ P( C, D, E, A, B, W[13] );
+ P( B, C, D, E, A, W[14] );
+ P( A, B, C, D, E, W[15] );
+ P( E, A, B, C, D, R(16) );
+ P( D, E, A, B, C, R(17) );
+ P( C, D, E, A, B, R(18) );
+ P( B, C, D, E, A, R(19) );
+
+#undef K
+#undef F
+
+#define F(x,y,z) (x ^ y ^ z)
+#define K 0x6ED9EBA1
+
+ P( A, B, C, D, E, R(20) );
+ P( E, A, B, C, D, R(21) );
+ P( D, E, A, B, C, R(22) );
+ P( C, D, E, A, B, R(23) );
+ P( B, C, D, E, A, R(24) );
+ P( A, B, C, D, E, R(25) );
+ P( E, A, B, C, D, R(26) );
+ P( D, E, A, B, C, R(27) );
+ P( C, D, E, A, B, R(28) );
+ P( B, C, D, E, A, R(29) );
+ P( A, B, C, D, E, R(30) );
+ P( E, A, B, C, D, R(31) );
+ P( D, E, A, B, C, R(32) );
+ P( C, D, E, A, B, R(33) );
+ P( B, C, D, E, A, R(34) );
+ P( A, B, C, D, E, R(35) );
+ P( E, A, B, C, D, R(36) );
+ P( D, E, A, B, C, R(37) );
+ P( C, D, E, A, B, R(38) );
+ P( B, C, D, E, A, R(39) );
+
+#undef K
+#undef F
+
+#define F(x,y,z) ((x & y) | (z & (x | y)))
+#define K 0x8F1BBCDC
+
+ P( A, B, C, D, E, R(40) );
+ P( E, A, B, C, D, R(41) );
+ P( D, E, A, B, C, R(42) );
+ P( C, D, E, A, B, R(43) );
+ P( B, C, D, E, A, R(44) );
+ P( A, B, C, D, E, R(45) );
+ P( E, A, B, C, D, R(46) );
+ P( D, E, A, B, C, R(47) );
+ P( C, D, E, A, B, R(48) );
+ P( B, C, D, E, A, R(49) );
+ P( A, B, C, D, E, R(50) );
+ P( E, A, B, C, D, R(51) );
+ P( D, E, A, B, C, R(52) );
+ P( C, D, E, A, B, R(53) );
+ P( B, C, D, E, A, R(54) );
+ P( A, B, C, D, E, R(55) );
+ P( E, A, B, C, D, R(56) );
+ P( D, E, A, B, C, R(57) );
+ P( C, D, E, A, B, R(58) );
+ P( B, C, D, E, A, R(59) );
+
+#undef K
+#undef F
+
+#define F(x,y,z) (x ^ y ^ z)
+#define K 0xCA62C1D6
+
+ P( A, B, C, D, E, R(60) );
+ P( E, A, B, C, D, R(61) );
+ P( D, E, A, B, C, R(62) );
+ P( C, D, E, A, B, R(63) );
+ P( B, C, D, E, A, R(64) );
+ P( A, B, C, D, E, R(65) );
+ P( E, A, B, C, D, R(66) );
+ P( D, E, A, B, C, R(67) );
+ P( C, D, E, A, B, R(68) );
+ P( B, C, D, E, A, R(69) );
+ P( A, B, C, D, E, R(70) );
+ P( E, A, B, C, D, R(71) );
+ P( D, E, A, B, C, R(72) );
+ P( C, D, E, A, B, R(73) );
+ P( B, C, D, E, A, R(74) );
+ P( A, B, C, D, E, R(75) );
+ P( E, A, B, C, D, R(76) );
+ P( D, E, A, B, C, R(77) );
+ P( C, D, E, A, B, R(78) );
+ P( B, C, D, E, A, R(79) );
+
+#undef K
+#undef F
+
+ ctx->state[0] += A;
+ ctx->state[1] += B;
+ ctx->state[2] += C;
+ ctx->state[3] += D;
+ ctx->state[4] += E;
+}
+
+void sha1_update( sha1_context *ctx, uint8 *input, uint32 length )
+{
+ uint32 left, fill;
+
+ if( ! length ) return;
+
+ left = ctx->total[0] & 0x3F;
+ fill = 64 - left;
+
+ ctx->total[0] += length;
+ ctx->total[0] &= 0xFFFFFFFF;
+
+ if( ctx->total[0] < length )
+ ctx->total[1]++;
+
+ if( left && length >= fill )
+ {
+ memcpy( (void *) (ctx->buffer + left),
+ (void *) input, fill );
+ sha1_process( ctx, ctx->buffer );
+ length -= fill;
+ input += fill;
+ left = 0;
+ }
+
+ while( length >= 64 )
+ {
+ sha1_process( ctx, input );
+ length -= 64;
+ input += 64;
+ }
+
+ if( length )
+ {
+ memcpy( (void *) (ctx->buffer + left),
+ (void *) input, length );
+ }
+}
+
+static uint8 sha1_padding[64] =
+{
+ 0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+};
+
+void sha1_finish( sha1_context *ctx, uint8 digest[20] )
+{
+ uint32 last, padn;
+ uint32 high, low;
+ uint8 msglen[8];
+
+ high = ( ctx->total[0] >> 29 )
+ | ( ctx->total[1] << 3 );
+ low = ( ctx->total[0] << 3 );
+
+ PUT_UINT32( high, msglen, 0 );
+ PUT_UINT32( low, msglen, 4 );
+
+ last = ctx->total[0] & 0x3F;
+ padn = ( last < 56 ) ? ( 56 - last ) : ( 120 - last );
+
+ sha1_update( ctx, sha1_padding, padn );
+ sha1_update( ctx, msglen, 8 );
+
+ PUT_UINT32( ctx->state[0], digest, 0 );
+ PUT_UINT32( ctx->state[1], digest, 4 );
+ PUT_UINT32( ctx->state[2], digest, 8 );
+ PUT_UINT32( ctx->state[3], digest, 12 );
+ PUT_UINT32( ctx->state[4], digest, 16 );
+}
+
+#ifdef TEST
+
+#include <stdlib.h>
+#include <stdio.h>
+
+/*
+ * those are the standard FIPS-180-1 test vectors
+ */
+
+static char *msg[] =
+{
+ "abc",
+ "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq",
+ NULL
+};
+
+static char *val[] =
+{
+ "a9993e364706816aba3e25717850c26c9cd0d89d",
+ "84983e441c3bd26ebaae4aa1f95129e5e54670f1",
+ "34aa973cd4c4daa4f61eeb2bdbad27316534016f"
+};
+
+int main( int argc, char *argv[] )
+{
+ FILE *f;
+ int i, j;
+ char output[41];
+ sha1_context ctx;
+ unsigned char buf[1000];
+ unsigned char sha1sum[20];
+
+ if( argc < 2 )
+ {
+ printf( "\n SHA-1 Validation Tests:\n\n" );
+
+ for( i = 0; i < 3; i++ )
+ {
+ printf( " Test %d ", i + 1 );
+
+ sha1_starts( &ctx );
+
+ if( i < 2 )
+ {
+ sha1_update( &ctx, (uint8 *) msg[i],
+ strlen( msg[i] ) );
+ }
+ else
+ {
+ memset( buf, 'a', 1000 );
+
+ for( j = 0; j < 1000; j++ )
+ {
+ sha1_update( &ctx, (uint8 *) buf, 1000 );
+ }
+ }
+
+ sha1_finish( &ctx, sha1sum );
+
+ for( j = 0; j < 20; j++ )
+ {
+ sprintf( output + j * 2, "%02x", sha1sum[j] );
+ }
+
+ if( memcmp( output, val[i], 40 ) )
+ {
+ printf( "failed!\n" );
+ return( 1 );
+ }
+
+ printf( "passed.\n" );
+ }
+
+ printf( "\n" );
+ }
+ else
+ {
+ if( ! ( f = fopen( argv[1], "rb" ) ) )
+ {
+ perror( "fopen" );
+ return( 1 );
+ }
+
+ sha1_starts( &ctx );
+
+ while( ( i = fread( buf, 1, sizeof( buf ), f ) ) > 0 )
+ {
+ sha1_update( &ctx, buf, i );
+ }
+
+ sha1_finish( &ctx, sha1sum );
+
+ for( j = 0; j < 20; j++ )
+ {
+ printf( "%02x", sha1sum[j] );
+ }
+
+ printf( " %s\n", argv[1] );
+ }
+
+ return( 0 );
+}
+
+#endif
diff --git a/src/sha1.h b/src/sha1.h
new file mode 100644
index 0000000..806eba1
--- /dev/null
+++ b/src/sha1.h
@@ -0,0 +1,24 @@
+#ifndef _SHA1_H
+#define _SHA1_H
+
+#ifndef uint8
+#define uint8 unsigned char
+#endif
+
+#ifndef uint32
+#define uint32 unsigned long int
+#endif
+
+typedef struct
+{
+ uint32 total[2];
+ uint32 state[5];
+ uint8 buffer[64];
+}
+sha1_context;
+
+void sha1_starts( sha1_context *ctx );
+void sha1_update( sha1_context *ctx, uint8 *input, uint32 length );
+void sha1_finish( sha1_context *ctx, uint8 digest[20] );
+
+#endif /* sha1.h */
diff --git a/tests/SHA1SUM b/tests/SHA1SUM
new file mode 100644
index 0000000..89b6e02
--- /dev/null
+++ b/tests/SHA1SUM
@@ -0,0 +1,2 @@
+6b1babdfa60a17a2e79cd9187ea06b3df3c46624 testdb-v1.1
+6b1babdfa60a17a2e79cd9187ea06b3df3c46624 testdb-v2.0
diff --git a/tests/misc/create-testdb.R b/tests/misc/create-testdb.R
new file mode 100644
index 0000000..f0687f2
--- /dev/null
+++ b/tests/misc/create-testdb.R
@@ -0,0 +1,14 @@
+library(filehash)
+
+name <- sprintf("testdb-v%s", packageDescription("filehash", fields = "Version"))
+dbCreate(name, "DB1")
+db <- dbInit(name, "DB1")
+
+set.seed(1)
+dbInsert(db, "a", rnorm(10))
+dbInsert(db, "b", runif(7))
+dbInsert(db, "list", list(1, 2, 3, 4, 5, 6, "a"))
+dbInsert(db, "c", 1L)
+dbInsert(db, "entry", "string")
+dbDelete(db, "b")
+
diff --git a/tests/reg-tests.R b/tests/reg-tests.R
new file mode 100644
index 0000000..93000d2
--- /dev/null
+++ b/tests/reg-tests.R
@@ -0,0 +1,183 @@
+suppressMessages(library(filehash))
+
+######################################################################
+## Test 'filehashRDS' class
+
+dbCreate("mydbRDS", "RDS")
+db <- dbInit("mydbRDS", "RDS")
+show(db)
+
+## Put some data into it
+set.seed(1000)
+dbInsert(db, "a", 1:10)
+dbInsert(db, "b", rnorm(100))
+dbInsert(db, "c", 100:1)
+dbInsert(db, "d", runif(1000))
+dbInsert(db, "other", "hello")
+
+dbList(db)
+
+dbExists(db, "e")
+dbExists(db, "a")
+
+env <- db2env(db)
+ls(env)
+
+env$a
+env$b
+env$c
+str(env$d)
+env$other
+
+env$b <- rnorm(100)
+mean(env$b)
+
+env$a[1:5] <- 5:1
+print(env$a)
+
+dbDelete(db, "c")
+
+tryCatch(print(env$c), error = function(e) cat(as.character(e)))
+tryCatch(dbFetch(db, "c"), error = function(e) cat(as.character(e)))
+
+## Check trailing '/' problem
+dbCreate("testRDSdb", "RDS")
+db <- dbInit("testRDSdb/", "RDS")
+print(db)
+
+######################################################################
+## test filehashDB1 class
+
+dbCreate("mydb", "DB1")
+db <- dbInit("mydb", "DB1")
+
+## Put some data into it
+set.seed(1000)
+dbInsert(db, "a", 1:10)
+dbInsert(db, "b", rnorm(100))
+dbInsert(db, "c", 100:1)
+dbInsert(db, "d", runif(1000))
+dbInsert(db, "other", "hello")
+
+dbList(db)
+
+env <- db2env(db)
+ls(env)
+
+env$a
+env$b
+env$c
+str(env$d)
+env$other
+
+env$b <- rnorm(100)
+mean(env$b)
+
+env$a[1:5] <- 5:1
+print(env$a)
+
+dbDelete(db, "c")
+
+tryCatch(print(env$c), error = function(e) cat(as.character(e)))
+tryCatch(dbFetch(db, "c"), error = function(e) cat(as.character(e)))
+
+numbers <- rnorm(100)
+dbInsert(db, "numbers", numbers)
+b <- dbFetch(db, "numbers")
+stopifnot(all.equal(numbers, b))
+stopifnot(identical(numbers, b))
+
+################################################################################
+## Other tests
+
+rm(list = ls())
+
+
+dbCreate("testLoadingDB", "DB1")
+db <- dbInit("testLoadingDB", "DB1")
+
+set.seed(234)
+
+db$a <- rnorm(100)
+db$b <- runif(1000)
+
+dbLoad(db) ## 'a', 'b'
+summary(a)
+summary(b)
+
+rm(list = ls())
+db <- dbInit("testLoadingDB", "DB1")
+
+dbLazyLoad(db)
+
+summary(a)
+summary(b)
+
+
+
+################################################################################
+## Check dbReorganize
+
+dbCreate("test_reorg", "DB1")
+db <- dbInit("test_reorg", "DB1")
+
+set.seed(1000)
+dbInsert(db, "a", 1)
+dbInsert(db, "a", 1)
+dbInsert(db, "a", 1)
+dbInsert(db, "a", 1)
+dbInsert(db, "b", rnorm(1000))
+dbInsert(db, "b", rnorm(1000))
+dbInsert(db, "b", rnorm(1000))
+dbInsert(db, "b", rnorm(1000))
+dbInsert(db, "c", runif(1000))
+dbInsert(db, "c", runif(1000))
+dbInsert(db, "c", runif(1000))
+dbInsert(db, "c", runif(1000))
+
+summary(db$b)
+summary(db$c)
+
+print(file.info(db at datafile)$size)
+
+dbReorganize(db)
+
+db <- dbInit("test_reorg", "DB1")
+
+print(file.info(db at datafile)$size)
+
+summary(db$b)
+summary(db$c)
+
+
+################################################################################
+## Taken from the vignette
+
+file.remove("mydb")
+
+dbCreate("mydb")
+db <- dbInit("mydb")
+
+set.seed(100)
+
+dbInsert(db, "a", rnorm(100))
+value <- dbFetch(db, "a")
+mean(value)
+
+dbInsert(db, "b", 123)
+dbDelete(db, "a")
+dbList(db)
+dbExists(db, "a")
+
+file.remove("mydb")
+
+################################################################################
+## Check queue
+
+db <- createQ("testq")
+push(db, 1)
+push(db, 2)
+top(db)
+
+pop(db)
+top(db)
diff --git a/tests/reg-tests.Rout.save b/tests/reg-tests.Rout.save
new file mode 100644
index 0000000..32f4b48
--- /dev/null
+++ b/tests/reg-tests.Rout.save
@@ -0,0 +1,304 @@
+
+R version 2.10.1 Patched (--)
+Copyright (C) The R Foundation for Statistical Computing
+ISBN 3-900051-07-0
+
+R is free software and comes with ABSOLUTELY NO WARRANTY.
+You are welcome to redistribute it under certain conditions.
+Type 'license()' or 'licence()' for distribution details.
+
+R is a collaborative project with many contributors.
+Type 'contributors()' for more information and
+'citation()' on how to cite R or R packages in publications.
+
+Type 'demo()' for some demos, 'help()' for on-line help, or
+'help.start()' for an HTML browser interface to help.
+Type 'q()' to quit R.
+
+> suppressMessages(library(filehash))
+>
+> ######################################################################
+> ## Test 'filehashRDS' class
+>
+> dbCreate("mydbRDS", "RDS")
+[1] TRUE
+> db <- dbInit("mydbRDS", "RDS")
+> show(db)
+'filehashRDS' database 'mydbRDS'
+>
+> ## Put some data into it
+> set.seed(1000)
+> dbInsert(db, "a", 1:10)
+> dbInsert(db, "b", rnorm(100))
+> dbInsert(db, "c", 100:1)
+> dbInsert(db, "d", runif(1000))
+> dbInsert(db, "other", "hello")
+>
+> dbList(db)
+[1] "a" "b" "c" "d" "other"
+>
+> dbExists(db, "e")
+[1] FALSE
+> dbExists(db, "a")
+[1] TRUE
+>
+> env <- db2env(db)
+> ls(env)
+[1] "a" "b" "c" "d" "other"
+>
+> env$a
+ [1] 1 2 3 4 5 6 7 8 9 10
+> env$b
+ [1] -0.44577826 -1.20585657 0.04112631 0.63938841 -0.78655436 -0.38548930
+ [7] -0.47586788 0.71975069 -0.01850562 -1.37311776 -0.98242783 -0.55448870
+ [13] 0.12138119 -0.12087232 -1.33604105 0.17005748 0.15507872 0.02493187
+ [19] -2.04658541 0.21315411 2.67007166 -1.22701601 0.83424733 0.53257175
+ [25] -0.64682496 0.60316126 -1.78384414 0.33494217 0.56097572 1.22093565
+ [31] -0.21145359 0.69942953 -0.70643668 -0.46515095 -1.76619861 0.18928860
+ [37] -0.36618068 1.05760118 -0.74162146 -1.34835905 -0.51730643 1.41173570
+ [43] 0.18546503 -0.04369144 -0.21591338 1.46377535 0.22966664 0.10762363
+ [49] -1.37810256 -0.96818288 0.25171138 -1.09469370 0.39764284 -0.99630200
+ [55] 0.10057801 0.95368028 -1.79032293 0.31170122 2.55398801 -0.86083776
+ [61] 0.54392844 -0.39233804 1.23544190 1.19608644 -0.49574690 -0.29434122
+ [67] -0.57349748 1.61920873 -0.95692767 0.04123712 -1.49831044 0.66095916
+ [73] 0.28545762 1.38886629 -0.15934361 -0.46091890 0.16843807 1.39549302
+ [79] 0.72842626 0.33508995 1.16927649 0.24796682 -0.35814947 1.38349332
+ [85] 0.41206917 -0.12300786 -0.06622931 -2.32249088 -1.04565650 2.05787502
+ [91] 1.97153237 -1.92099520 0.46212607 -0.16072406 -0.10421153 0.46783940
+ [97] 0.44392082 0.82855281 -0.38705012 2.01893816
+> env$c
+ [1] 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83
+ [19] 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65
+ [37] 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47
+ [55] 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29
+ [73] 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11
+ [91] 10 9 8 7 6 5 4 3 2 1
+> str(env$d)
+ num [1:1000] 0.0854 0.3317 0.5647 0.4989 0.4549 ...
+> env$other
+[1] "hello"
+>
+> env$b <- rnorm(100)
+> mean(env$b)
+[1] -0.02208835
+>
+> env$a[1:5] <- 5:1
+> print(env$a)
+ [1] 5 4 3 2 1 6 7 8 9 10
+>
+> dbDelete(db, "c")
+>
+> tryCatch(print(env$c), error = function(e) cat(as.character(e)))
+Error in dbFetch(db, key): unable to obtain value for key 'c'
+> tryCatch(dbFetch(db, "c"), error = function(e) cat(as.character(e)))
+Error in dbFetch(db, "c"): unable to obtain value for key 'c'
+>
+> ## Check trailing '/' problem
+> dbCreate("testRDSdb", "RDS")
+[1] TRUE
+> db <- dbInit("testRDSdb/", "RDS")
+> print(db)
+'filehashRDS' database 'testRDSdb'
+>
+> ######################################################################
+> ## test filehashDB1 class
+>
+> dbCreate("mydb", "DB1")
+[1] TRUE
+> db <- dbInit("mydb", "DB1")
+>
+> ## Put some data into it
+> set.seed(1000)
+> dbInsert(db, "a", 1:10)
+> dbInsert(db, "b", rnorm(100))
+> dbInsert(db, "c", 100:1)
+> dbInsert(db, "d", runif(1000))
+> dbInsert(db, "other", "hello")
+>
+> dbList(db)
+[1] "a" "b" "other" "c" "d"
+>
+> env <- db2env(db)
+> ls(env)
+[1] "a" "b" "c" "d" "other"
+>
+> env$a
+ [1] 1 2 3 4 5 6 7 8 9 10
+> env$b
+ [1] -0.44577826 -1.20585657 0.04112631 0.63938841 -0.78655436 -0.38548930
+ [7] -0.47586788 0.71975069 -0.01850562 -1.37311776 -0.98242783 -0.55448870
+ [13] 0.12138119 -0.12087232 -1.33604105 0.17005748 0.15507872 0.02493187
+ [19] -2.04658541 0.21315411 2.67007166 -1.22701601 0.83424733 0.53257175
+ [25] -0.64682496 0.60316126 -1.78384414 0.33494217 0.56097572 1.22093565
+ [31] -0.21145359 0.69942953 -0.70643668 -0.46515095 -1.76619861 0.18928860
+ [37] -0.36618068 1.05760118 -0.74162146 -1.34835905 -0.51730643 1.41173570
+ [43] 0.18546503 -0.04369144 -0.21591338 1.46377535 0.22966664 0.10762363
+ [49] -1.37810256 -0.96818288 0.25171138 -1.09469370 0.39764284 -0.99630200
+ [55] 0.10057801 0.95368028 -1.79032293 0.31170122 2.55398801 -0.86083776
+ [61] 0.54392844 -0.39233804 1.23544190 1.19608644 -0.49574690 -0.29434122
+ [67] -0.57349748 1.61920873 -0.95692767 0.04123712 -1.49831044 0.66095916
+ [73] 0.28545762 1.38886629 -0.15934361 -0.46091890 0.16843807 1.39549302
+ [79] 0.72842626 0.33508995 1.16927649 0.24796682 -0.35814947 1.38349332
+ [85] 0.41206917 -0.12300786 -0.06622931 -2.32249088 -1.04565650 2.05787502
+ [91] 1.97153237 -1.92099520 0.46212607 -0.16072406 -0.10421153 0.46783940
+ [97] 0.44392082 0.82855281 -0.38705012 2.01893816
+> env$c
+ [1] 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83
+ [19] 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65
+ [37] 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47
+ [55] 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29
+ [73] 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11
+ [91] 10 9 8 7 6 5 4 3 2 1
+> str(env$d)
+ num [1:1000] 0.0854 0.3317 0.5647 0.4989 0.4549 ...
+> env$other
+[1] "hello"
+>
+> env$b <- rnorm(100)
+> mean(env$b)
+[1] -0.02208835
+>
+> env$a[1:5] <- 5:1
+> print(env$a)
+ [1] 5 4 3 2 1 6 7 8 9 10
+>
+> dbDelete(db, "c")
+>
+> tryCatch(print(env$c), error = function(e) cat(as.character(e)))
+Error in readSingleKey(con, map, key): unable to obtain value for key 'c'
+> tryCatch(dbFetch(db, "c"), error = function(e) cat(as.character(e)))
+Error in readSingleKey(con, map, key): unable to obtain value for key 'c'
+>
+> numbers <- rnorm(100)
+> dbInsert(db, "numbers", numbers)
+> b <- dbFetch(db, "numbers")
+> stopifnot(all.equal(numbers, b))
+> stopifnot(identical(numbers, b))
+>
+> ################################################################################
+> ## Other tests
+>
+> rm(list = ls())
+>
+>
+> dbCreate("testLoadingDB", "DB1")
+[1] TRUE
+> db <- dbInit("testLoadingDB", "DB1")
+>
+> set.seed(234)
+>
+> db$a <- rnorm(100)
+> db$b <- runif(1000)
+>
+> dbLoad(db) ## 'a', 'b'
+> summary(a)
+ Min. 1st Qu. Median Mean 3rd Qu. Max.
+-3.036000 -0.642100 0.172000 0.004131 0.614100 2.107000
+> summary(b)
+ Min. 1st Qu. Median Mean 3rd Qu. Max.
+0.004583 0.229900 0.478600 0.482200 0.729200 0.999800
+>
+> rm(list = ls())
+> db <- dbInit("testLoadingDB", "DB1")
+>
+> dbLazyLoad(db)
+>
+> summary(a)
+ Min. 1st Qu. Median Mean 3rd Qu. Max.
+-3.036000 -0.642100 0.172000 0.004131 0.614100 2.107000
+> summary(b)
+ Min. 1st Qu. Median Mean 3rd Qu. Max.
+0.004583 0.229900 0.478600 0.482200 0.729200 0.999800
+>
+>
+>
+> ################################################################################
+> ## Check dbReorganize
+>
+> dbCreate("test_reorg", "DB1")
+[1] TRUE
+> db <- dbInit("test_reorg", "DB1")
+>
+> set.seed(1000)
+> dbInsert(db, "a", 1)
+> dbInsert(db, "a", 1)
+> dbInsert(db, "a", 1)
+> dbInsert(db, "a", 1)
+> dbInsert(db, "b", rnorm(1000))
+> dbInsert(db, "b", rnorm(1000))
+> dbInsert(db, "b", rnorm(1000))
+> dbInsert(db, "b", rnorm(1000))
+> dbInsert(db, "c", runif(1000))
+> dbInsert(db, "c", runif(1000))
+> dbInsert(db, "c", runif(1000))
+> dbInsert(db, "c", runif(1000))
+>
+> summary(db$b)
+ Min. 1st Qu. Median Mean 3rd Qu. Max.
+-2.76800 -0.65520 -0.06100 -0.01269 0.65240 3.73900
+> summary(db$c)
+ Min. 1st Qu. Median Mean 3rd Qu. Max.
+0.0002346 0.2416000 0.4813000 0.4938000 0.7492000 0.9992000
+>
+> print(file.info(db at datafile)$size)
+[1] 64980
+>
+> dbReorganize(db)
+Reorganizing database: 33% (1/3)67% (2/3)100% (3/3)
+Finished; reload database with 'dbInit'
+[1] TRUE
+>
+> db <- dbInit("test_reorg", "DB1")
+>
+> print(file.info(db at datafile)$size)
+[1] 16245
+>
+> summary(db$b)
+ Min. 1st Qu. Median Mean 3rd Qu. Max.
+-2.76800 -0.65520 -0.06100 -0.01269 0.65240 3.73900
+> summary(db$c)
+ Min. 1st Qu. Median Mean 3rd Qu. Max.
+0.0002346 0.2416000 0.4813000 0.4938000 0.7492000 0.9992000
+>
+>
+> ################################################################################
+> ## Taken from the vignette
+>
+> file.remove("mydb")
+[1] TRUE
+>
+> dbCreate("mydb")
+[1] TRUE
+> db <- dbInit("mydb")
+>
+> set.seed(100)
+>
+> dbInsert(db, "a", rnorm(100))
+> value <- dbFetch(db, "a")
+> mean(value)
+[1] 0.002912563
+>
+> dbInsert(db, "b", 123)
+> dbDelete(db, "a")
+> dbList(db)
+[1] "b"
+> dbExists(db, "a")
+[1] FALSE
+>
+> file.remove("mydb")
+[1] TRUE
+>
+> ################################################################################
+> ## Check queue
+>
+> db <- createQ("testq")
+> push(db, 1)
+> push(db, 2)
+> top(db)
+[1] 1
+>
+> pop(db)
+[1] 1
+> top(db)
+[1] 2
+>
diff --git a/tests/testdb-v1.1 b/tests/testdb-v1.1
new file mode 100644
index 0000000..ebeaf0d
Binary files /dev/null and b/tests/testdb-v1.1 differ
diff --git a/tests/testdb-v2.0 b/tests/testdb-v2.0
new file mode 100644
index 0000000..ebeaf0d
Binary files /dev/null and b/tests/testdb-v2.0 differ
diff --git a/tests/versions.R b/tests/versions.R
new file mode 100644
index 0000000..7c53cc1
--- /dev/null
+++ b/tests/versions.R
@@ -0,0 +1,22 @@
+## Test databases
+
+suppressMessages(library(filehash))
+
+testdblist <- dir(pattern = glob2rx("testdb-v*"))
+
+for(testname in testdblist) {
+ msg <- sprintf("DATABASE: %s\n", testname)
+ cat(paste(rep("=", nchar(msg)), collapse = ""), "\n")
+ cat(msg)
+ cat(paste(rep("=", nchar(msg)), collapse = ""), "\n")
+ db <- dbInit(testname, "DB1")
+ keys <- dbList(db)
+ print(keys)
+
+ for(k in keys) {
+ cat("key:", k, "\n")
+ val <- dbFetch(db, k)
+ print(val)
+ cat("\n")
+ }
+}
diff --git a/tests/versions.Rout.save b/tests/versions.Rout.save
new file mode 100644
index 0000000..20219f6
--- /dev/null
+++ b/tests/versions.Rout.save
@@ -0,0 +1,114 @@
+
+R version 2.10.1 Patched (--)
+Copyright (C) The R Foundation for Statistical Computing
+ISBN 3-900051-07-0
+
+R is free software and comes with ABSOLUTELY NO WARRANTY.
+You are welcome to redistribute it under certain conditions.
+Type 'license()' or 'licence()' for distribution details.
+
+R is a collaborative project with many contributors.
+Type 'contributors()' for more information and
+'citation()' on how to cite R or R packages in publications.
+
+Type 'demo()' for some demos, 'help()' for on-line help, or
+'help.start()' for an HTML browser interface to help.
+Type 'q()' to quit R.
+
+> ## Test databases
+>
+> suppressMessages(library(filehash))
+>
+> testdblist <- dir(pattern = glob2rx("testdb-v*"))
+>
+> for(testname in testdblist) {
++ msg <- sprintf("DATABASE: %s\n", testname)
++ cat(paste(rep("=", nchar(msg)), collapse = ""), "\n")
++ cat(msg)
++ cat(paste(rep("=", nchar(msg)), collapse = ""), "\n")
++ db <- dbInit(testname, "DB1")
++ keys <- dbList(db)
++ print(keys)
++
++ for(k in keys) {
++ cat("key:", k, "\n")
++ val <- dbFetch(db, k)
++ print(val)
++ cat("\n")
++ }
++ }
+======================
+DATABASE: testdb-v1.1
+======================
+[1] "a" "c" "list" "entry"
+key: a
+ [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078 -0.8204684
+ [7] 0.4874291 0.7383247 0.5757814 -0.3053884
+
+key: c
+[1] 1
+
+key: list
+[[1]]
+[1] 1
+
+[[2]]
+[1] 2
+
+[[3]]
+[1] 3
+
+[[4]]
+[1] 4
+
+[[5]]
+[1] 5
+
+[[6]]
+[1] 6
+
+[[7]]
+[1] "a"
+
+
+key: entry
+[1] "string"
+
+======================
+DATABASE: testdb-v2.0
+======================
+[1] "a" "c" "list" "entry"
+key: a
+ [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078 -0.8204684
+ [7] 0.4874291 0.7383247 0.5757814 -0.3053884
+
+key: c
+[1] 1
+
+key: list
+[[1]]
+[1] 1
+
+[[2]]
+[1] 2
+
+[[3]]
+[1] 3
+
+[[4]]
+[1] 4
+
+[[5]]
+[1] 5
+
+[[6]]
+[1] 6
+
+[[7]]
+[1] "a"
+
+
+key: entry
+[1] "string"
+
+>
diff --git a/vignettes/combined.bib b/vignettes/combined.bib
new file mode 100644
index 0000000..b142fd3
--- /dev/null
+++ b/vignettes/combined.bib
@@ -0,0 +1,50 @@
+
+ at Manual{ templelang:2002,
+ title = {RObjectTables: User-level attach()'able table support},
+ author = {Duncan {Temple Lang}},
+ year = {2002},
+ note = {{R} package version 0.3-1},
+ url = {http://www.omegahat.org/RObjectTables}
+}
+
+ at Article{ rnews:ripley:2004,
+ author = {Brian D. Ripley},
+ title = {Lazy Loading and Packages in {R} 2.0.0},
+ journal = {R News},
+ year = 2004,
+ volume = 4,
+ number = 2,
+ pages = {2--4},
+ month = {September},
+ url = http,
+ pdf = rnews2004-2
+}
+
+ at TechReport{ cham:1991,
+ author = {John M. Chambers},
+ title = {Data Management in {S}},
+ institution = {AT\&T Bell Laboratories Statistics Research},
+ year = {1991},
+ number = {99},
+ month = {December},
+ note = {http://stat.bell-labs.com/doc/93.15.ps}
+}
+
+ at Book{ cham:1998,
+ author = {John M. Chambers},
+ title = {Programming with Data: A Guide to the {S} Language},
+ publisher = {Springer},
+ year = {1998}
+}
+
+ at Article{ brahm:2002,
+ author = {David E. Brahm},
+ title = {Delayed Data Packages},
+ journal = {R News},
+ year = 2002,
+ volume = 2,
+ number = 3,
+ pages = {11--12},
+ month = {December},
+ url = {http://CRAN.R-project.org/doc/Rnews/}
+}
diff --git a/vignettes/filehash.Rnw b/vignettes/filehash.Rnw
new file mode 100644
index 0000000..c82aead
--- /dev/null
+++ b/vignettes/filehash.Rnw
@@ -0,0 +1,443 @@
+\documentclass{article}
+
+%%\VignetteIndexEntry{The filehash Package}
+%%\VignetteDepends{filehash}
+
+\usepackage{charter}
+\usepackage{courier}
+\usepackage[noae]{Sweave}
+\usepackage[margin=1in]{geometry}
+\usepackage{natbib}
+
+\title{Interacting with Data using the \textbf{filehash} Package for
+R}
+
+\author{Roger D. Peng $<$rpeng at jhsph.edu$>$\\\textit{Department of
+Biostatistics}\\\textit{Johns Hopkins Bloomberg School of Public Health}}
+
+\date{}
+
+\newcommand{\pkg}{\textbf}
+\newcommand{\code}{\texttt}
+
+\begin{document}
+
+\maketitle
+
+\begin{abstract}
+The \pkg{filehash} package for R implements a simple key-value style
+database where character string keys are associated with data values
+that are stored on the disk. A simple interface is provided for
+inserting, retrieving, and deleting data from the database. Utilities
+are provided that allow \pkg{filehash} databases to be treated much
+like environments and lists are already used in R. These utilities
+are provided to encourage interactive and exploratory analysis on
+large datasets. Three different file formats for representing the
+database are currently available and new formats can easily be
+incorporated by third parties for use in the \pkg{filehash} framework.
+\end{abstract}
+
+<<options,results=hide,echo=false>>=
+options(width=60)
+@
+
+\section{Overview and Motivation}
+
+Working with large datasets in R can be cumbersome because of the need
+to keep objects in physical memory. While many might generally see
+that as a feature of the system, the need to keep whole objects in
+memory creates challenges to those who might want to work
+interactively with large datasets. Here we take a simple definition
+of ``large dataset'' to be any dataset that cannot be loaded into R as
+a single R object because of memory limitations. For example, a very
+large data frame might be too large for all of the columns and rows to
+be loaded at once. In such a situation, one might load only a subset
+of the rows or columns, if that is possible.
+
+In a key-value database, an arbitrary data object (a ``value'') has a
+``key'' associated with it, usually a character string. When one
+requests the value associated with a particular key, it is the
+database's job to match up the key with the correct value and return
+the value to the requester.
+
+The most straightforward example of a key-value database in R is the
+global environment. Every object in R has a name and a value
+associated with it. When you execute at the R prompt
+<<exampleGlobalEnv,results=hide>>=
+x <- 1
+print(x)
+@
+the first line assigns the value 1 to the name/key ``x''. The second
+line requests the value of ``x'' and prints out 1 to the console. R
+handles the task of finding the appropriate value for ``x'' by
+searching through a series of environments, including the namespaces
+of the packages on the search list.
+
+In most cases, R stores the values associated with keys in memory, so
+that the value of \code{x} in the example above was stored in and
+retrieved from physical memory. However, the idea of a key-value
+database can be generalized beyond this particular configuration. For
+example, as of R 2.0.0, much of the R code for R packages is stored in
+a lazy-loaded database, where the values are initially stored on disk
+and loaded into memory on first access~\citep{Rnews:Ripley:2004}.
+Hence, when R starts up, it uses relatively little memory, while the
+memory usage increases as more objects are requested. Data could also
+be stored on other computers (e.g. websites) and retrieved over the
+network.
+
+The general S language concept of a database is described in Chapter 5
+of the Green Book~\citep{cham:1998} and earlier in~\cite{cham:1991}.
+Although the S and R languages have different semantics with respect
+to how variable names are looked up and bound to values, the general
+concept of using a key-value database applies to both languages.
+Duncan Temple Lang has implemented this general database framework for
+R in the \pkg{RObjectTables} package of
+Omegahat~\citep{TempleLang:2002}. The \pkg{RObjectTables} package
+provides an interface for connecting R with arbitrary backend systems,
+allowing data values to be stored in potentially any format or
+location. While the package itself does not include a specific
+implementation, some examples are provided on the package's website.
+
+The \pkg{filehash} package provides a full read-write implementation
+of a key-value database for R. The package does not depend on any
+external packages (beyond those provided in a standard R installation)
+or software systems and is written entirely in R, making it readily
+usable on most platforms. The \pkg{filehash} package can be thought
+of as a specific implementation of the database concept described
+in~\cite{cham:1991}, taking a slightly different approach to the
+problem. Both~\cite{TempleLang:2002} and~\cite{cham:1991} focus on
+generalizing the notion of ``attach()-ing'' a database in an R/S
+session so that variable names can be looked up automatically via the
+search list. The \pkg{filehash} package represents a database as an
+instance of an S4 class and operates directly on the S4 object via
+various methods.
+
+Key-value databases are sometimes called hash tables and indeed, the
+name of the package comes from the idea of having a ``file-based hash
+table''. With \pkg{filehash} the values are stored in a file on the
+disk rather than in memory. When a user requests the values
+associated with a key, \pkg{filehash} finds the object on the disk,
+loads the value into R and returns it to the user. The package offers
+two formats for storing data on the disk: The values can be stored (1)
+concatenated together in a single file or (2) separately as a
+directory of files.
+
+
+
+
+\section{Related R packages}
+
+There are other packages on CRAN designed specifically to help users
+work with large datasets. Two packages that come immediately to mind
+are the \pkg{g.data} package by David Brahm~\citep{brahm:2002} and the
+\pkg{biglm} package by Thomas Lumley. The \pkg{g.data} package takes
+advantage of the lazy evaluation mechanism in R via the
+\code{delayedAssign} function. Briefly, objects are loaded into R as
+promises to load the actual data associated with an object name. The
+first time an object is requested, the promise is evaluated and the
+data are loaded. From then on, the data reside in memory. The
+mechanism used in \pkg{g.data} is similar to the one used by the
+lazy-loaded databases described in~\cite{Rnews:Ripley:2004}. The
+\pkg{biglm} package allows users to fit linear models on datasets that
+are too large to fit in memory. However, the \pkg{biglm} package does
+not provide methods for dealing with large datasets in general. The
+\pkg{filehash} package also draws inspiration from Luke Tierney's
+experimental \pkg{gdbm} package which implements a key-value database
+via the GNU dbm (GDBM) library. The use of GDBM creates an external
+dependence since the GDBM C library has to be compiled on each system.
+In addition, I encountered a problem where databases created on 32-bit
+machines could not be transferred to and read on 64-bit machines (and
+vice versa). However, with the increasing use of 64-bit machines in
+the future, it seems this problem will eventually go away.
+
+The R Special Interest Group on Databases has developed a number of
+packages that provide an R interface to commonly used relational
+database management systems (RDBMS) such as MySQL (\pkg{RMySQL}),
+PostgreSQL (\pkg{RPgSQL}), and Oracle (\pkg{ROracle}). These packages
+use the S4 classes and generics defined in the \pkg{DBI} package and
+have the advantage that they offer much better database functionality,
+inherited via the use of a true database management system. However,
+this benefit comes with the cost of having to install and use
+third-party software. While installing an RDBMS may not be an
+issue---many systems have them pre-installed and the \pkg{RSQLite}
+package comes bundled with the source for the RDBMS---the need for the
+RDBMS and knowledge of structured query language (SQL) nevertheless
+adds some overhead. This overhead may serve as an impediment for
+users in need of a database for simpler applications.
+
+
+
+\section{Creating a filehash database}
+
+Databases can be created with \pkg{filehash} using the \code{dbCreate}
+function. The one required argument is the name of the database,
+which we call here ``mydb''.
+<<create>>=
+library(filehash)
+dbCreate("mydb")
+db <- dbInit("mydb")
+@
+You can also specify the \code{type} argument which controls how the
+database is represented on the backend. We will discuss the different
+backends in further detail later. For now, we use the default backend
+which is called ``DB1''.
+
+Once the database is created, it must be initialized in order to be
+accessed. The \code{dbInit} function returns an S4 object inheriting
+from class ``filehash''. Since this is a newly created database,
+there are no objects in it.
+
+\section{Accessing a filehash database}
+
+<<setseed1,results=hide,echo=false>>=
+set.seed(100)
+@
+
+The primary interface to filehash databases consists of the functions
+\code{dbFetch}, \code{dbInsert}, \code{dbExists}, \code{dbList}, and
+\code{dbDelete}. These functions are all generic---specific methods
+exists for each type of database backend. They all take as their
+first argument an object of class ``filehash''. To insert some data
+into the database we can simply call \code{dbInsert}
+<<insert>>=
+dbInsert(db, "a", rnorm(100))
+@
+Here we have associated with the key ``a'' 100 standard normal random
+variates. We can retrieve those values with \code{dbFetch}.
+<<fetch>>=
+value <- dbFetch(db, "a")
+mean(value)
+@
+
+The function \code{dbList} lists all of the keys that are available in
+the database, \code{dbExists} tests to see if a given key is in the
+database, and \code{dbDelete} deletes a key-value pair from the
+database
+<<delete>>=
+dbInsert(db, "b", 123)
+dbDelete(db, "a")
+dbList(db)
+dbExists(db, "a")
+@
+
+While using functions like \code{dbInsert} and \code{dbFetch} is
+straightforward it can often be easier on the fingers to use standard
+R subset and accessor functions like \code{\$}, \code{[[}, and
+\code{[}. Filehash databases have methods for these functions so that
+objects can be accessed in a more compact manner. Similarly,
+replacement methods for these functions are also available. The
+\verb+[+ function can be used to access multiple objects from the
+database, in which case a list is returned.
+
+<<accessors>>=
+db$a <- rnorm(100, 1)
+mean(db$a)
+mean(db[["a"]])
+db$b <- rnorm(100, 2)
+dbList(db)
+@
+For all of the accessor functions, only character indices are allowed.
+Numeric indices are caught and an error is given.
+<<characteronly>>=
+e <- local({
+ err <- function(e) e
+ tryCatch(db[[1]], error = err)
+})
+conditionMessage(e)
+@
+Finally, there is method for the \code{with} generic function which
+operates much like using \code{with} on lists or environments.
+
+The following three statements all return the same value.
+<<with>>=
+with(db, c(a = mean(a), b = mean(b)))
+@
+When using \code{with}, the values of ``a'' and ``b'' are looked up in
+the database.
+<<sapply>>=
+sapply(db[c("a", "b")], mean)
+@
+Here, using \code{[} on \code{db} returns a list with the values
+associated with ``a'' and ``b''. Then \code{sapply} is applied in the
+usual way on the returned list.
+<<lapply>>=
+unlist(lapply(db, mean))
+@
+In the last statement we call \code{lapply} directly on the
+``filehash'' object. The \pkg{filehash} package defines a method for
+\code{lapply} that allows the user to apply a function on all the
+elements of a database directly. The method essentially loops through
+all the keys in the database, loads each object separately and applies
+the supplied function to each object. \code{lapply} returns a named
+list with each element being the result of applying the supplied
+function to an object in the database. There is an argument
+\code{keep.names} to the \code{lapply} method which, if set to
+\code{FALSE}, will drop all the names from the list.
+
+<<cleanupMyDB,results=hide,echo=false>>=
+dbUnlink(db)
+rm(list = ls(all = TRUE))
+@
+
+\section{Loading filehash databases}
+
+<<setseed2,results=hide,echo=false>>=
+set.seed(200)
+@
+
+An alternative way of working with a filehash database is to load it
+into an environment and access the element names directly, without
+having to use any of the accessor functions. The \pkg{filehash}
+function \code{dbLoad} works much like the standard R \code{load}
+function except that \code{dbLoad} loads active bindings into a given
+environment rather than the actual data. The active bindings are
+created via the \code{makeActiveBinding} function in the \pkg{base}
+package. \code{dbLoad} takes a filehash database and creates symbols
+in an environment corresponding to the keys in the database. It then
+calls \code{makeActiveBinding} to associate with each key a function
+which loads the data associated with a given key. Conceptually,
+active bindings are like pointers to the database. After calling
+\code{dbLoad}, anytime an object with an active binding is accessed
+the associated function (installed by \code{makeActiveBinding}) loads
+the data from the database.
+
+We can create a simple database to demonstrate the active binding
+mechanism.
+<<testDB>>=
+dbCreate("testDB")
+db <- dbInit("testDB")
+db$x <- rnorm(100)
+db$y <- runif(100)
+db$a <- letters
+dbLoad(db)
+ls()
+@
+Notice that we appear to have some additional objects in our
+workspace. However, the values of these objects are not stored in
+memory---they are stored in the database. When one of the objects is
+accessed, the value is automatically loaded from the database.
+<<accessbinding>>=
+mean(y)
+sort(a)
+@
+If I assign a different value to one of these objects, its
+associated value is updated in the database via the active binding
+mechanism.
+<<assignvalue>>=
+y <- rnorm(100, 2)
+mean(y)
+@
+If I subsequently remove the database and reload it later, the
+updated value for ``y'' persists.
+<<removeandload>>=
+rm(list = ls())
+db <- dbInit("testDB")
+dbLoad(db)
+ls()
+mean(y)
+@
+
+Perhaps one disadvantage of the active binding approach taken here is
+that whenever an object is accessed, the data must be reloaded into R.
+This behavior is distinctly different from the the delayed assignment
+approach taken in \pkg{g.data} where an object must only be loaded
+once and then is subsequently in memory. However, when using delayed
+assignments, if one cycles through all of the objects in the database,
+one could eventually exhaust the available memory.
+
+<<cleanupTestDB,results=hide,echo=false>>=
+dbUnlink(db)
+rm(list = ls(all = TRUE))
+@
+
+\section{Other filehash utilities}
+
+There are a few other utilities included with the \pkg{filehash}
+package. Two of the utilities, \code{dumpObjects} and
+\code{dumpImage}, are analogues of \code{save} and \code{save.image}.
+Rather than save objects to an R workspace, \code{dumpObjects} saves
+the given objects to a ``filehash'' database so that in the future,
+individual objects can be reloaded if desired. Similarly,
+\code{dumpImage} saves the entire workspace to a ``filehash''
+database.
+
+The function \code{dumpList} takes a list and creates a ``filehash''
+database with values from the list. The list must have a non-empty
+name for every element in order for \code{dumpList} to succeed.
+\code{dumpDF} creates a ``filehash'' database from a data frame where
+each column of the data frame is an element in the database.
+Essentially, \code{dumpDF} converts the data frame to a list and calls
+\code{dumpList}.
+
+
+\section{Filehash database backends}
+
+Currently, the \pkg{filehash} package can represent databases in two
+different formats. The default format is called ``DB1'' and it stores
+the keys and values in a single file. From experience, this format
+works well overall but can be a little slow to initialize when there
+are many thousands of keys. Briefly, the ``filehash'' object in R
+stores a map which associates keys with a byte location in the
+database file where the corresponding value is stored. Given the byte
+location, we can \code{seek} to that location in the file and read the
+data directly. Before reading in the data, a check is made to make
+sure that the map is up to date. This format depends critically on
+having a working \code{ftell} at the system level and a crude check is
+made when trying to initialize a database of this format.
+
+The second format is called ``RDS'' and it stores objects as separate
+files on the disk in a directory with the same name as the database.
+This format is the most straightforward and simple of the available
+formats. When a request is made for a specific key, \pkg{filehash}
+finds the appropriate file in the directory and reads the file into R.
+The only catch is that on operating systems that use case-insensitive
+file names, objects whose names differ only in case will collide on
+the filesystem. To workaround this, object names with capital letters
+are stored with mangled names on the disk. An advantage of this
+format is that most of the organizational work is delegated to the
+filesystem.
+
+
+\section{Extending filehash}
+
+The \pkg{filehash} package has a mechanism for developing new backend
+formats, should the need arise. The function \code{registerFormatDB}
+can be used to make \pkg{filehash} aware of a new database format that
+may be implemented in a separate R package or a file.
+\code{registerFormatDB} takes two arguments: a \code{name} for the new
+format (like ``DB1'' or ``RDS'') and a list of functions. The list
+should contain two functions: one function named ``create'' for
+creating a database, given the database name, and another function
+named ``initialize'' for initializing the database. In addition, one
+needs to define methods for \code{dbInsert}, \code{dbFetch}, etc.
+
+A list of available backend formats can be obtained via the
+\code{filehashFormats} function. Upon registering a new backend
+format, the new format will be listed when \code{filehashFormats} is
+called.
+
+The interface for registering new backend formats is still
+experimental and could change in the future.
+
+
+\section{Discussion}
+
+The \pkg{filehash} package has been designed be useful in both a
+programming setting and an interactive setting. Its main purpose is
+to allow for simpler interaction with large datasets where
+simultaneous access to the full dataset is not needed. While the
+package may not be optimal for all settings, one goal was to write a
+simple package in pure R that users to could install with minimal
+overhead. In the future I hope to add functionality for interacting
+with databases stored on remote computers and perhaps incorporate a
+``real'' database backend. Some work has already begun on developing
+a backend based on the \pkg{RSQLite} package.
+
+
+
+\bibliographystyle{alpha}
+\bibliography{combined}
+
+
+\end{document}
+
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/r-cran-filehash.git
More information about the debian-med-commit
mailing list