[R-pkg-team] Bug#975252: ITP: r-cran-ff -- Memory-Efficient Fast-Access Storage of Large Data

Steffen Moeller moeller at debian.org
Thu Nov 19 16:59:33 GMT 2020


Package: wnpp
Severity: wishlist

Subject: ITP: r-cran-ff -- Memory-Efficient Fast-Access Storage of Large Data
Package: wnpp
Owner: Steffen Moeller <moeller at debian.org>
Severity: wishlist

* Package name    : r-cran-ff
  Version         : 4.0.4
  Upstream Author : Daniel Adler
* URL             : https://cran.r-project.org/package=ff
* License         : GPL-2+
  Programming Lang: GNU R
  Description     : Memory-Efficient Fast-Access Storage of Large Data
 The ff package provides data structures that are stored on
 disk but behave (almost) as if they were in RAM by transparently
 mapping only a section (pagesize) in main memory - the effective
 virtual memory consumption per ff object. ff supports R's standard
 atomic data types 'double', 'logical', 'raw' and 'integer' and
 non-standard atomic types boolean (1 bit), quad (2 bit unsigned),
 nibble (4 bit unsigned), byte (1 byte signed with NAs), ubyte (1 byte
 unsigned), short (2 byte signed with NAs), ushort (2 byte unsigned),
 single (4 byte float with NAs). For example 'quad' allows efficient
 storage of genomic data as an 'A','T','G','C' factor. The unsigned
 types support 'circular' arithmetic. There is also support for
 close-to-atomic types 'factor', 'ordered', 'POSIXct', 'Date' and
 custom close-to-atomic types.
 .
 ff not only has native C-support for vectors, matrices and arrays
 with flexible dimorder (major column-order, major row-order and
 generalizations for arrays). There is also a ffdf class not unlike
 data.frames and import/export filters for csv files.
 ff objects store raw data in binary flat files in native encoding,
 and complement this with metadata stored in R as physical and virtual
 attributes. ff objects have well-defined hybrid copying semantics,
 which gives rise to certain performance improvements through
 virtualization. ff objects can be stored and reopened across R
 sessions. ff files can be shared by multiple ff R objects
 (using different data en/de-coding schemes) in the same process
 or from multiple R processes to exploit parallelism. A wide choice of
 finalizer options allows to work with 'permanent' files as well as
 creating/removing 'temporary' ff files completely transparent to the
 user. On certain OS/Filesystem combinations, creating the ff files
 works without notable delay thanks to using sparse file allocation.
 Several access optimization techniques such as Hybrid Index
 Preprocessing and Virtualization are implemented to achieve good
 performance even with large datasets, for example virtual matrix
 transpose without touching a single byte on disk. Further, to reduce
 disk I/O, 'logicals' and non-standard data types get stored native and
 compact on binary flat files i.e. logicals take up exactly 2 bits to
 represent TRUE, FALSE and NA.
 .
 Beyond basic access functions, the ff package also provides
 compatibility functions that facilitate writing code for ff and ram
 objects and support for batch processing on ff objects (e.g. as.ram,
 as.ff, ffapply). ff interfaces closely with functionality from package
 'bit': chunked looping, fast bit operations and coercions between
 different objects that can store subscript information ('bit',
 'bitwhich', ff 'boolean', ri range index, hi hybrid index). This allows
 to work interactively with selections of large datasets and quickly
 modify selection criteria.

Remark: This package is maintained by Debian R Packages Maintainers at
   https://salsa.debian.org/r-pkg-team/r-cran-ff



More information about the R-pkg-team mailing list