[med-svn] [r-cran-data.table] 02/07: New upstream version 1.10.4-2
Andreas Tille
tille at debian.org
Mon Oct 23 17:40:29 UTC 2017
This is an automated email from the git hooks/post-receive script.
tille pushed a commit to branch master
in repository r-cran-data.table.
commit 23e4eb591e073ba97a4419ea9b3b180bb4444592
Author: Andreas Tille <tille at debian.org>
Date: Mon Oct 23 19:22:03 2017 +0200
New upstream version 1.10.4-2
---
DESCRIPTION | 6 +-
MD5 | 54 +-
NEWS.md | 22 +-
R/data.table.R | 182 +--
R/onAttach.R | 3 +-
R/test.data.table.R | 14 +-
R/utils.R | 5 +-
build/vignette.rds | Bin 442 -> 443 bytes
inst/doc/datatable-faq.Rmd | 2 +-
inst/doc/datatable-faq.html | 1433 ++++++++++--------
inst/doc/datatable-intro.R | 3 -
inst/doc/datatable-intro.html | 1577 +++++++++++---------
inst/doc/datatable-keys-fast-subset.html | 1235 ++++++++-------
inst/doc/datatable-reference-semantics.html | 1066 ++++++-------
inst/doc/datatable-reshape.html | 816 +++++-----
...atable-secondary-indices-and-auto-indexing.html | 859 ++++++-----
inst/tests/tests.Rraw | 9 +-
src/assign.c | 103 +-
src/data.table.h | 24 +-
src/dogroups.c | 80 +-
src/fread.c | 70 +-
src/fwrite.c | 79 +-
src/init.c | 40 +-
src/openmp-utils.c | 17 +-
src/rbindlist.c | 96 +-
src/uniqlist.c | 2 -
src/wrappers.c | 33 +-
vignettes/datatable-faq.Rmd | 2 +-
28 files changed, 4314 insertions(+), 3518 deletions(-)
diff --git a/DESCRIPTION b/DESCRIPTION
index 94949f2..412cd8b 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -1,5 +1,5 @@
Package: data.table
-Version: 1.10.4
+Version: 1.10.4-2
Title: Extension of `data.frame`
Authors at R: c(
person("Matt","Dowle", role=c("aut","cre"), email="mattjdowle at gmail.com"),
@@ -22,7 +22,7 @@ MailingList: datatable-help at lists.r-forge.r-project.org
VignetteBuilder: knitr
ByteCompile: TRUE
NeedsCompilation: yes
-Packaged: 2017-02-01 00:15:31.074 UTC; mdowle
+Packaged: 2017-10-12 04:08:36.856 UTC; mdowle
Author: Matt Dowle [aut, cre],
Arun Srinivasan [aut],
Jan Gorecki [ctb],
@@ -31,4 +31,4 @@ Author: Matt Dowle [aut, cre],
Eduard Antonyan [ctb]
Maintainer: Matt Dowle <mattjdowle at gmail.com>
Repository: CRAN
-Date/Publication: 2017-02-01 15:52:19
+Date/Publication: 2017-10-12 14:03:42 UTC
diff --git a/MD5 b/MD5
index d1decf1..3c4dd3c 100644
--- a/MD5
+++ b/MD5
@@ -1,7 +1,7 @@
-ae468059bb64612611a8899452d6acfb *DESCRIPTION
+fac6375e9b7bd47a4b07d14eb27600ae *DESCRIPTION
d32239bcb673463ab874e80d47fae504 *LICENSE
5feca43f299bbbd001563b1033a59705 *NAMESPACE
-3dc8bdf9e48d766984dd260692d5408f *NEWS.md
+25a37cfa18cb1fffe50e5f11dec46d5e *NEWS.md
4377e4b917c44366f66862324c8fd32a *R/AllS4.R
3d35eb16da4271f72a9118200db76d65 *R/IDateTime.R
55b63a5d831071ec745ce7bd4ab4a336 *R/as.data.table.R
@@ -9,7 +9,7 @@ ad10acb3d99aab6ccf9dc37bb836d817 *R/between.R
9052f694cdd5de844433349a91f3bc7a *R/bmerge.R
0041b7a70824ed486983cf52ff902aaa *R/c.factor.R
b28ccef4d6291066b742fd3a3a5f5d39 *R/cedta.R
-f370c168081fb7efddc89d88c5afb872 *R/data.table.R
+84f3c5d5f661addc5156f96044d49108 *R/data.table.R
4e672da0bdc96542f0713ba1da6bde69 *R/duplicated.R
8fdd31e066db641fe2da803bd2e20268 *R/fcast.R
f758580ab081803256534bd9ee22e250 *R/fmelt.R
@@ -21,39 +21,39 @@ cd2b459de37c217870b1ada4d197d1c4 *R/fwrite.R
9e49976bdb1d7b41a755a588fd9eb05b *R/last.R
a9a8ec624e5afd3067d8ea9cc33482f6 *R/like.R
8cace748741791eac2d81501c02a1488 *R/merge.R
-9a127e9279e88bebc231dd038a62f537 *R/onAttach.R
+05ff4c9a78ae75c9904346e0310b44f6 *R/onAttach.R
0354731d8708c60ba8bcc167c6428829 *R/onLoad.R
a00ff6edb3599fcf2801acfc12c4a7e4 *R/openmp-utils.R
4cbd6f0cf26f1a7ecc7b2bd2fb5df8bb *R/setkey.R
b8d3020884ea456d3eeeecf5b06323ef *R/setops.R
b9a144d533820eb7529e86a9b794e661 *R/shift.R
1432e4ed7a2a3d6be052527b9005cd8d *R/tables.R
-f8b4fc4d8738808c6a8540e290324a83 *R/test.data.table.R
+aeb30c8c8e41c2e3c680f9fbc3eff6c6 *R/test.data.table.R
20aa38be566f1dc0f21a8937944481c5 *R/timetaken.R
5ced66446db69ebb73eba28399336922 *R/transpose.R
085cbece46eb9e4689189ac6cda84457 *R/uniqlist.R
-5d0affa6ab449271b8d6d076526ac75f *R/utils.R
+00f125cc15182fc45a0462203062b293 *R/utils.R
8ff82535188f42a368314fddf610483a *R/xts.R
399a1d96153aa5721e8544fa53084c9b *README.md
-52d0eaab5376b90d6f2e7e20e90e0e2a *build/vignette.rds
+188d652dbf24c7603d31ef722a8c836a *build/vignette.rds
d20a5d50c2a2fae35659da9de05acea3 *inst/doc/datatable-faq.R
-0b13b6d61aa41e7908a871653f63c755 *inst/doc/datatable-faq.Rmd
-9092d0c955a3b47273c5b883f856a90b *inst/doc/datatable-faq.html
-ea6aff9f9dc4f5f26a8cee86ba0e8393 *inst/doc/datatable-intro.R
+40824ef7a5e4b01cfeabb26a382fa289 *inst/doc/datatable-faq.Rmd
+ac84983e3ef23aec8afed8a0a1d58282 *inst/doc/datatable-faq.html
+a6cee59288fadeb28014a1ecbd66c326 *inst/doc/datatable-intro.R
e9fff1a46fdf96e3572b583bc89e8f86 *inst/doc/datatable-intro.Rmd
-2cc922bca1adcdce0f112d200c9e6e4a *inst/doc/datatable-intro.html
+627a040520c6a285aa2f7f99654c79bf *inst/doc/datatable-intro.html
c2f4d1dc6234576bf0c1f071325d5b1d *inst/doc/datatable-keys-fast-subset.R
3f2980389baaff06c2d6b401b26d71bf *inst/doc/datatable-keys-fast-subset.Rmd
-f77edced904971a84d23a9593746856b *inst/doc/datatable-keys-fast-subset.html
+0ec41e717cbfea0881f2c90841f6f808 *inst/doc/datatable-keys-fast-subset.html
723df81331669d44c4cab1f541a3d956 *inst/doc/datatable-reference-semantics.R
531acab6260b82f65ab9048aee6fb331 *inst/doc/datatable-reference-semantics.Rmd
-1d2f06ba982c95bc983b51164a72794c *inst/doc/datatable-reference-semantics.html
+c5cebe2a0d8d2bc31d5f5ec1bdc2b4ee *inst/doc/datatable-reference-semantics.html
7149288c1c45ff4e6dc0c89b71f81f72 *inst/doc/datatable-reshape.R
e8ef65c1d8424e390059b854cb18740e *inst/doc/datatable-reshape.Rmd
-bfca7571d9aba8825edf6565e02fadcf *inst/doc/datatable-reshape.html
+6a3b8dbf14a5a7b78aec35525dbdfee6 *inst/doc/datatable-reshape.html
22265ade65535db347b44213d4354772 *inst/doc/datatable-secondary-indices-and-auto-indexing.R
bcdc8c1716a1e3aa1ef831bad0d67715 *inst/doc/datatable-secondary-indices-and-auto-indexing.Rmd
-99d25b646960c7b3cf9d6799c9c7d7e0 *inst/doc/datatable-secondary-indices-and-auto-indexing.html
+9ce3beb36cba7e2c2dc5251f157f05c1 *inst/doc/datatable-secondary-indices-and-auto-indexing.html
e48efd4babf364e97ff98e56b1980c8b *inst/tests/1206FUT.txt
28b57d31f67353c1192c6f65d69a12b1 *inst/tests/1680-fread-header-encoding.csv
fe198c1178f7db508ee0b10a94272e7e *inst/tests/2008head.csv
@@ -84,7 +84,7 @@ d5f1a4914ee94df080c787a3dcc530e3 *inst/tests/issue_785_fread.txt
0738c8cabf507aecd5b004fbbc45c2b4 *inst/tests/quoted_multiline.csv
278198d16e23ea87b8be7822d0af26e3 *inst/tests/russellCRCRLF.csv
2d8e8b64b59f727ae5bf5c527948be6a *inst/tests/russellCRLF.csv
-aa9ca3a13c9a76ee9eb18ad7afa00600 *inst/tests/tests.Rraw
+c145c5065401223f5ba07a63ed975700 *inst/tests/tests.Rraw
9f0dd826cb0518ead716be2e932d4def *man/IDateTime.Rd
9cbcd078347c5a08402df1ba73aa44b5 *man/J.Rd
835019f8d7596f0e2f6482b725b10cbf *man/address.Rd
@@ -139,35 +139,35 @@ a2a4eb530691106fb68f4851182b66fb *man/transpose.Rd
4d71404f8c7d5b3a9aa93195c22e8f97 *man/truelength.Rd
142dc7a616c9c546ffb9f0bea81cc2d7 *man/tstrsplit.Rd
7140c94fe69f182d6c111a8d40795e4f *src/Makevars
-52bc93d41bad403ca8fcc4f7878d4934 *src/assign.c
+359027a5655626317b2a5e36d25c0c4b *src/assign.c
059a1ee027698e97d5757cdc6493ba2d *src/between.c
12239da57bc1754b2345486e48fe1d2b *src/bmerge.c
771f833144f4bcd530d36944c9ece29f *src/chmatch.c
-8977546957face8344131b9eaf04da32 *src/data.table.h
-530e5eb6ef7580c12385c9a964f5e7de *src/dogroups.c
+953f4ff30171bb62e65cda435d90fb47 *src/data.table.h
+d89a571cb1ac3548042db13a43fab6bc *src/dogroups.c
e3d5b289ef4b511e4f89c9838b86cbd0 *src/fastmean.c
87d693c023ca8935be4f1f4aa8aa979e *src/fcast.c
15f33d9fd7a31c7f288019c0d667f003 *src/fmelt.c
5682cc0d46990d81a83ed0d540b31acf *src/forder.c
4cace1380fb006a5c786e9a7f4d12937 *src/frank.c
-dae009c0fbb1f7f1bb028d756417f406 *src/fread.c
+75947440a219645508e4f4047f5d42c9 *src/fread.c
8cc10403f23c6e018f26b0220b509a86 *src/fsort.c
-9d057220f1278fd15061b81dbb684d2e *src/fwrite.c
+23201456c40a379b5bdb9ae51104f9cd *src/fwrite.c
237a455e5212cf3358ab5aaca12fbd9c *src/fwriteLookups.h
e7aae63b27c01a5acce45023ff436b69 *src/gsumm.c
47792eafb3cee1c03bbcb972d00c4aad *src/ijoin.c
-6d60952300a7cb70ad844aa06f3d2edd *src/init.c
+bef407b3b627c2e8704e3a9cc6456de8 *src/init.c
520938944d8dbd58460bcf4ca44e9479 *src/inrange.c
-a1237adc2f1ced2e4e9e3724a9434211 *src/openmp-utils.c
+85ad32d99ab86c022521b275f5dc86de *src/openmp-utils.c
ab561ed83137b5b2d78d5d06030f7446 *src/quickselect.c
-7081147bbf13dd8d85c29a58a4a66e55 *src/rbindlist.c
+5fafe7c34074d2eaf6549debadaa37df *src/rbindlist.c
416562e57a9368398d026ec1edc96313 *src/reorder.c
43566a73264aab49b4f4fb9ffcf77c0b *src/shift.c
eb145cdab68c4ea5d0285ffa8dcb7bb4 *src/subset.c
53304fe0233e11f15786cdbddf6c89f8 *src/transpose.c
-cfe85d1626ba795d8edc0f098c1b7f12 *src/uniqlist.c
+0eb934d739e08dc707b4d335cf74e438 *src/uniqlist.c
75a359ce5e8c6927d46dd9b2aa169da1 *src/vecseq.c
-8fdbfefae387548d671c637eb28694bb *src/wrappers.c
+6835bb448de657ce49779d908b7839e7 *src/wrappers.c
441a393fe285e88a86e1126af8d6d7d8 *tests/autoprint.R
1ad409241d679d234e1a56ef06507e64 *tests/autoprint.Rout.save
5b9fc0d7c7ea64a9b1f60f9eba66327e *tests/knitr.R
@@ -180,7 +180,7 @@ d4e7268543efee032a82d9bde312b34c *tests/main.R
19d2c1a56e20560c13418539c56fbe9a *tests/testthat/test-data.frame-like.R
4304c919f6e28ea84d1ca05708ccaae8 *vignettes/Makefile
5dcf8be4b810d38fc5d4d0817167b079 *vignettes/css/bootstrap.css
-0b13b6d61aa41e7908a871653f63c755 *vignettes/datatable-faq.Rmd
+40824ef7a5e4b01cfeabb26a382fa289 *vignettes/datatable-faq.Rmd
e9fff1a46fdf96e3572b583bc89e8f86 *vignettes/datatable-intro.Rmd
3f2980389baaff06c2d6b401b26d71bf *vignettes/datatable-keys-fast-subset.Rmd
531acab6260b82f65ab9048aee6fb331 *vignettes/datatable-reference-semantics.Rmd
diff --git a/NEWS.md b/NEWS.md
index 9291dd2..89e746f 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -1,9 +1,29 @@
+### Changes in v1.10.4-2 (on CRAN 12 Oct 2017)
+
+1. OpenMP on MacOS is now supported by CRAN and included in CRAN's package binaries for Mac. But installing v1.10.4-1 from source on MacOS failed when OpenMP was not enabled at compile time, [#2409](https://github.com/Rdatatable/data.table/issues/2409). Thanks to Liz Macfie and @fupangpangpang for reporting. The startup message when OpenMP is not enabled has been updated.
+
+2. Two rare potential memory faults fixed, thanks to CRAN's automated use of latest compiler tools; e.g. clang-5 and gcc-7
+
+
+### Changes in v1.10.4-1 (on CRAN 09 Oct 2017)
+
+1. The `nanotime` v0.2.0 update on CRAN 22 June 2017 changed from `integer64` to `S4` and broke `fwrite` of `nanotime` columns. Fixed to work with `nanotime` both before and after v0.2.0.
+
+2. Pass R-devel changes related to `deparse(,backtick=)` and `factor()`.
+
+3. Internal `NAMED()==2` now `MAYBE_SHARED()` instead, [#2330](https://github.com/Rdatatable/data.table/issues/2330). Back-ported to pass under the stated dependency, R 3.0.0.
+
+4. Attempted improvement on Mac with Intel's OpenMP when package 'parallel' forks after `data.table` has performed in parallel with OpenMP. That OpenMP implementation appears to leave threads running after the parallel region has finished. If this fix still doesn't work, call `setDTthreads(1)` immediately after `library(data.table)` which has been reported to fix the problem.
+
+5. When `fread()` and `print()` see `integer64` columns are present but package `bit64` is not installed, the warning is now displayed as intended. Thanks to a question by Santosh on r-help and forwarded by Bill Dunlap.
+
+
### Changes in v1.10.4 (on CRAN 01 Feb 2017)
#### BUG FIXES
-1. The new specialized `nanotime` writer in `fwrite()` type punned using `*(long long *)&REAL(column)[i]` which, strictly, is undefined behavour under C standards. It passed a plethora of tests on linux (gcc 5.4 and clang 3.8), win-builder and 6 out 10 CRAN flavours using gcc. But failed (wrong data written) with the newest version of clang (3.9.1) as used by CRAN on the failing flavors, and solaris-sparc. Replaced with the union method and added a grep to CRAN_Release.cmd.
+1. The new specialized `nanotime` writer in `fwrite()` type punned using `*(long long *)&REAL(column)[i]` which, strictly, is undefined behaviour under C standards. It passed a plethora of tests on linux (gcc 5.4 and clang 3.8), win-builder and 6 out 10 CRAN flavours using gcc. But failed (wrong data written) with the newest version of clang (3.9.1) as used by CRAN on the failing flavors, and solaris-sparc. Replaced with the union method and added a grep to CRAN_Release.cmd.
### Changes in v1.10.2 (on CRAN 31 Jan 2017)
diff --git a/R/data.table.R b/R/data.table.R
index 1effe44..f1b0cbf 100644
--- a/R/data.table.R
+++ b/R/data.table.R
@@ -1,5 +1,5 @@
-dim.data.table <- function(x)
+dim.data.table <- function(x)
{
.Call(Cdim, x)
}
@@ -26,14 +26,14 @@ shouldPrint = function(x) {
ret
}
-print.data.table <- function(x, topn=getOption("datatable.print.topn"),
- nrows=getOption("datatable.print.nrows"),
- class=getOption("datatable.print.class"),
- row.names=getOption("datatable.print.rownames"),
+print.data.table <- function(x, topn=getOption("datatable.print.topn"),
+ nrows=getOption("datatable.print.nrows"),
+ class=getOption("datatable.print.class"),
+ row.names=getOption("datatable.print.rownames"),
quote=FALSE, ...) { # topn - print the top topn and bottom topn rows with '---' inbetween (5)
# nrows - under this the whole (small) table is printed, unless topn is provided (100)
# class - should column class be printed underneath column name? (FALSE)
- if (!shouldPrint(x)) {
+ if (!shouldPrint(x)) {
# := in [.data.table sets .global$print=address(x) to suppress the next print i.e., like <- does. See FAQ 2.22 and README item in v1.9.5
# The issue is distinguishing "> DT" (after a previous := in a function) from "> DT[,foo:=1]". To print.data.table(), there
# is no difference. Now from R 3.2.0 a side effect of the very welcome and requested change to avoid silent deep copy is that
@@ -43,7 +43,7 @@ print.data.table <- function(x, topn=getOption("datatable.print.topn"),
# topenv(), inspecting next statement in caller, using clock() at C level to timeout suppression after some number of cycles
SYS <- sys.calls()
if (length(SYS) <= 2 || # "> DT" auto-print or "> print(DT)" explicit print (cannot distinguish from R 3.2.0 but that's ok)
- ( length(SYS) > 3L && is.symbol(thisSYS <- SYS[[length(SYS)-3L]][[1L]]) &&
+ ( length(SYS) > 3L && is.symbol(thisSYS <- SYS[[length(SYS)-3L]][[1L]]) &&
as.character(thisSYS) %chin% mimicsAutoPrint ) ) {
return(invisible())
# is.symbol() temp fix for #1758.
@@ -162,14 +162,14 @@ data.table <-function(..., keep.rownames=FALSE, check.names=FALSE, key=NULL, str
# TO DO: rewrite data.table(), one of the oldest functions here. Many people use data.table() to convert data.frame rather than
# as.data.table which is faster; speed could be better. Revisit how many copies are taken in for example data.table(DT1,DT2) which
# cbind directs to. And the nested loops for recycling lend themselves to being C level.
-
+
x <- list(...) # doesn't copy named inputs as from R >= 3.1.0 (a very welcome change)
if (!.R.listCopiesNamed) .Call(CcopyNamedInList,x) # to maintain the old behaviour going forwards, for now. See test 548.2.
# **TO DO** Something strange with NAMED on components of `...`. To investigate. Or just port data.table() to C. This is why
# it's switched, because extra copies would be introduced in R <= 3.1.0, iiuc.
-
+
# fix for #5377 - data.table(null list, data.frame and data.table) should return null data.table. Simple fix: check all scenarios here at the top.
- if (identical(x, list(NULL)) || identical(x, list(list())) ||
+ if (identical(x, list(NULL)) || identical(x, list(list())) ||
identical(x, list(data.frame(NULL))) || identical(x, list(data.table(NULL)))) return( null.data.table() )
tt <- as.list(substitute(list(...)))[-1L] # Intention here is that data.table(X,Y) will automatically put X and Y as the column names. For longer expressions, name the arguments to data.table(). But in a call to [.data.table, wrap in list() e.g. DT[,list(a=mean(v),b=foobarzoo(zang))] will get the col names
vnames = names(tt)
@@ -181,7 +181,7 @@ data.table <-function(..., keep.rownames=FALSE, check.names=FALSE, key=NULL, str
}
for (i in which(novname)) {
# if (ncol(as.data.table(x[[i]])) <= 1) { # cbind call in test 230 fails if I write ncol(as.data.table(eval(tt[[i]], parent.frame()))) <= 1, no idea why... (keep this for later even though all tests pass with ncol(.).. because base uses as.data.frame(.))
- if (is.null(ncol(x[[i]]))) {
+ if (is.null(ncol(x[[i]]))) {
if ((tmp <- deparse(tt[[i]])[1]) == make.names(tmp))
vnames[i] <- tmp
}
@@ -305,7 +305,7 @@ data.table <-function(..., keep.rownames=FALSE, check.names=FALSE, key=NULL, str
replace_dot_alias <- function(e) {
# we don't just simply alias .=list because i) list is a primitive (faster to iterate) and ii) we test for use
- # of "list" in several places so it saves having to remember to write "." || "list" in those places
+ # of "list" in several places so it saves having to remember to write "." || "list" in those places
if (is.call(e)) {
if (e[[1L]] == ".") e[[1L]] = quote(list)
for (i in seq_along(e)[-1]) if (!is.null(e[[i]])) e[[i]] = replace_dot_alias(e[[i]])
@@ -314,15 +314,15 @@ replace_dot_alias <- function(e) {
}
.massagei <- function(x) {
- # J alias for list as well in i, just if the first symbol
+ # J alias for list as well in i, just if the first symbol
if (is.call(x) && as.character(x[[1L]]) %chin% c("J","."))
x[[1L]] = quote(list)
x
}
# A (relatively) fast (uses DT grouping) wrapper for matching two vectors, BUT:
-# it behaves like 'pmatch' but only the 'exact' matching part. That is, a value in
-# 'x' is matched to 'table' only once. No index will be present more than once.
+# it behaves like 'pmatch' but only the 'exact' matching part. That is, a value in
+# 'x' is matched to 'table' only once. No index will be present more than once.
# This should make it even clearer:
# chmatch2(c("a", "a"), c("a", "a")) # 1,2 - the second 'a' in 'x' has a 2nd match in 'table'
# chmatch2(c("a", "a"), c("a", "b")) # 1,NA - the second one doesn't 'see' the first 'a'
@@ -426,7 +426,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
stop("Variable '",jsubChar,"' is not found in calling scope. Looking in calling scope because either you used the .. prefix or set with=FALSE")
}
}
- if (root=="{") {
+ if (root=="{") {
if (length(jsub)==2) {
jsub = jsub[[2L]] # to allow {} wrapping of := e.g. [,{`:=`(...)},] [#376]
root = if (is.call(jsub)) as.character(jsub[[1L]])[1L] else ""
@@ -441,7 +441,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
jsub = eval(jsub[[2L]], parent.frame(), parent.frame()) # this evals the symbol to return the dynamic expression
if (is.expression(jsub)) jsub = jsub[[1L]] # if expression, convert it to call
# Note that the dynamic expression could now be := (new in v1.9.7)
- root = if (is.call(jsub)) as.character(jsub[[1L]])[1L] else ""
+ root = if (is.call(jsub)) as.character(jsub[[1L]])[1L] else ""
}
if (root == ":=") {
allow.cartesian=TRUE # (see #800)
@@ -453,17 +453,17 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
}
}
}
-
+
# To take care of duplicate column names properly (see chmatch2 function above `[data.table`) for description
dupmatch <- function(x, y, ...) {
if (anyDuplicated(x))
pmax(chmatch(x,y, ...), chmatch2(x,y,0L))
else chmatch(x,y)
}
-
+
# setdiff removes duplicate entries, which'll create issues with duplicated names. Use '%chin% instead.
dupdiff <- function(x, y) x[!x %chin% y]
-
+
if (!missing(i)) {
xo = NULL
isub = substitute(i)
@@ -482,7 +482,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
isnull_inames = FALSE
nqgrp = integer(0) # for non-equi join
nqmaxgrp = 1L # for non-equi join
- # Fixes 4994: a case where quoted expression with a "!", ex: expr = quote(!dt1); dt[eval(expr)] requires
+ # Fixes 4994: a case where quoted expression with a "!", ex: expr = quote(!dt1); dt[eval(expr)] requires
# the "eval" to be checked before `as.name("!")`. Therefore interchanged.
restore.N = remove.N = FALSE
if (exists(".N", envir=parent.frame(), inherits=FALSE)) {
@@ -541,10 +541,10 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
if (isub[[1L]] == "==" && length(RHS)>1) {
if (length(RHS)!=nrow(x)) stop("RHS of == is length ",length(RHS)," which is not 1 or nrow (",nrow(x),"). For robustness, no recycling is allowed (other than of length 1 RHS). Consider %in% instead.")
i = x[[isub2]] == RHS # DT[colA == colB] regular element-wise vector scan
- } else if ( (is.integer(x[[isub2]]) && is.double(RHS) && isReallyReal(RHS)) || (mode(x[[isub2]]) != mode(RHS) && !(class(x[[isub2]]) %in% c("character", "factor") &&
- class(RHS) %in% c("character", "factor"))) ||
+ } else if ( (is.integer(x[[isub2]]) && is.double(RHS) && isReallyReal(RHS)) || (mode(x[[isub2]]) != mode(RHS) && !(class(x[[isub2]]) %in% c("character", "factor") &&
+ class(RHS) %in% c("character", "factor"))) ||
(is.factor(x[[isub2]]) && !is.factor(RHS) && mode(RHS)=="numeric") ) { # fringe case, #1361. TODO: cleaner way of doing these checks.
- # re-direct all non-matching mode cases to base R, as data.table's binary
+ # re-direct all non-matching mode cases to base R, as data.table's binary
# search based join is strict in types. #957 and #961.
i = if (isub[[1L]] == "==") x[[isub2]] == RHS else x[[isub2]] %in% RHS
} else {
@@ -737,7 +737,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
io = if (missing(on)) haskey(i) else identical(unname(on), head(key(i), length(on)))
i = .shallow(i, retain.key = io)
ans = bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, mult, ops, nqgrp, nqmaxgrp, verbose=verbose)
- # temp fix for issue spotted by Jan, test #1653.1. TODO: avoid this
+ # temp fix for issue spotted by Jan, test #1653.1. TODO: avoid this
# 'setorder', as there's another 'setorder' in generating 'irows' below...
if (length(ans$indices)) setorder(setDT(ans[1:3]), indices)
allLen1 = ans$allLen1
@@ -759,20 +759,20 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
# Really, `anyDuplicated` in base is AWESOME!
# allow.cartesian shouldn't error if a) not-join, b) 'i' has no duplicates
irows = if (allLen1) f__ else vecseq(f__,len__,
- if( allow.cartesian ||
+ if( allow.cartesian ||
notjoin || # #698. When notjoin=TRUE, ignore allow.cartesian. Rows in answer will never be > nrow(x).
- !anyDuplicated(f__, incomparables = c(0L, NA_integer_))) # #742. If 'i' has no duplicates, ignore
- NULL
+ !anyDuplicated(f__, incomparables = c(0L, NA_integer_))) # #742. If 'i' has no duplicates, ignore
+ NULL
else as.double(nrow(x)+nrow(i))) # rows in i might not match to x so old max(nrow(x),nrow(i)) wasn't enough. But this limit now only applies when there are duplicates present so the reason now for nrow(x)+nrow(i) is just to nail it down and be bigger than max(nrow(x),nrow(i)).
# Fix for #1092 and #1074
- # TODO: implement better version of "any"/"all"/"which" to avoid
+ # TODO: implement better version of "any"/"all"/"which" to avoid
# unnecessary construction of logical vectors
if (identical(nomatch, 0L) && allLen1) irows = irows[irows != 0L]
} else {
if (length(xo) && missing(on))
stop("Cannot by=.EACHI when joining to a secondary key, yet")
# since f__ refers to xo later in grouping, so xo needs to be passed through to dogroups too.
- if (length(irows))
+ if (length(irows))
stop("Internal error. irows has length in by=.EACHI")
}
if (nqbyjoin) {
@@ -811,18 +811,18 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
if (is.logical(i)) {
if (isTRUE(i)) irows=i=NULL
# NULL is efficient signal to avoid creating 1:nrow(x) but still return all rows, fixes #1249
-
+
else if (length(i)<=1L) irows=i=integer(0)
# FALSE, NA and empty. All should return empty data.table. The NA here will be result of expression,
# where for consistency of edge case #1252 all NA to be removed. If NA is a single NA symbol, it
# was auto converted to NA_integer_ higher up for ease of use and convenience. We definitely
# don't want base R behaviour where DF[NA,] returns an entire copy filled with NA everywhere.
-
+
else if (length(i)==nrow(x)) irows=i=which(i)
# The which() here auto removes NA for convenience so user doesn't need to remember "!is.na() & ..."
# Also this which() is for consistenty of DT[colA>3,which=TRUE] and which(DT[,colA>3])
# Assigning to 'i' here as well to save memory, #926.
-
+
else stop("i evaluates to a logical vector length ", length(i), " but there are ", nrow(x), " rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.")
} else {
irows = as.integer(i) # e.g. DT[c(1,3)] and DT[c(-1,-3)] ok but not DT[c(1,-3)] (caught as error)
@@ -882,22 +882,22 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
# j was substituted before dealing with i so that := can set allow.cartesian=FALSE (#800) (used above in i logic)
if (is.null(jsub)) return(NULL)
-
+
if (!with && is.call(jsub) && jsub[[1L]]==":=") {
# TODO: make these both errors (or single long error in both cases) in next release.
# i.e. using with=FALSE together with := at all will become an error. Eventually with will be removed.
if (is.null(names(jsub)) && is.name(jsub[[2L]])) {
warning("with=FALSE together with := was deprecated in v1.9.4 released Oct 2014. Please wrap the LHS of := with parentheses; e.g., DT[,(myVar):=sum(b),by=a] to assign to column name(s) held in variable myVar. See ?':=' for other examples. As warned in 2014, this is now a warning.")
- jsub[[2L]] = eval(jsub[[2L]], parent.frame(), parent.frame())
+ jsub[[2L]] = eval(jsub[[2L]], parent.frame(), parent.frame())
} else {
warning("with=FALSE ignored, it isn't needed when using :=. See ?':=' for examples.")
}
with = TRUE
}
-
+
if (!with) {
# missing(by)==TRUE was already checked above before dealing with i
- if (is.call(jsub) && deparse(jsub[[1]], 500L) %in% c("!", "-")) { # TODO is deparse avoidable here?
+ if (is.call(jsub) && deparse(jsub[[1]], 500L, backtick=FALSE) %in% c("!", "-")) { # TODO is deparse avoidable here?
notj = TRUE
jsub = jsub[[2L]]
} else notj = FALSE
@@ -925,7 +925,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
ansvals = dupmatch(ansvars, names(x))
} else {
# once again, use 'setdiff'. Basically, unless indices are specified in `j`, we shouldn't care about duplicated columns.
- ansvars = j # x. and i. prefixes may be in here, and they'll be dealt with below
+ ansvars = j # x. and i. prefixes may be in here, and they'll be dealt with below
# dups = FALSE here.. even if DT[, c("x", "x"), with=FALSE], we subset only the first.. No way to tell which one the OP wants without index.
ansvals = chmatch(ansvars, names(x))
}
@@ -946,8 +946,8 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
if (byjoin) {
bynames = names(x)[rightcols]
} else if (!missing(by)) {
- # deal with by before j because we need byvars when j contains .SD
- # may evaluate to NULL | character() | "" | list(), likely a result of a user expression where no-grouping is one case being loop'd through
+ # deal with by before j because we need byvars when j contains .SD
+ # may evaluate to NULL | character() | "" | list(), likely a result of a user expression where no-grouping is one case being loop'd through
bysubl = as.list.default(bysub)
bysuborig = bysub
if (is.name(bysub) && !(as.character(bysub) %chin% names(x))) { # TO DO: names(x),names(i),and i. and x. prefixes
@@ -977,7 +977,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
bysub = as.call(c(as.name('('), list(bysub)))
bysubl = as.list.default(bysub)
} else if (is.call(bysub) && bysub[[1L]] == ".") bysub[[1L]] = quote(list)
-
+
if (mode(bysub) == "character") {
if (length(grep(",",bysub))) {
if (length(bysub)>1L) stop("'by' is a character vector length ",length(bysub)," but one or more items include a comma. Either pass a vector of column names (which can contain spaces, but no commas), or pass a vector length 1 containing comma separated column names. See ?data.table for other possibilities.")
@@ -989,7 +989,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
bysubl = as.list.default(bysub)
}
allbyvars = intersect(all.vars(bysub),names(x))
-
+
orderedirows = .Call(CisOrderedSubset, irows, nrow(x)) # TRUE when irows is NULL (i.e. no i clause)
# orderedirows = is.sorted(f__)
bysameorder = orderedirows && haskey(x) && all(sapply(bysubl,is.name)) && length(allbyvars) && identical(allbyvars,head(key(x),length(allbyvars)))
@@ -1067,7 +1067,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
bynames[jj] = deparse(bysubl[[jj+1L]])
if (verbose)
cat("by-expression '", bynames[jj], "' is not named, and the auto-generated name '", tt, "' clashed with variable(s) in j. Therefore assigning the entire by-expression as name.\n", sep="")
-
+
}
else bynames[jj] = tt
# if user doesn't like this inferred name, user has to use by=list() to name the column
@@ -1080,7 +1080,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
}
setattr(byval, "names", bynames) # byval is just a list not a data.table hence setattr not setnames
}
-
+
jvnames = NULL
if (is.name(jsub)) {
# j is a single unquoted column name
@@ -1121,11 +1121,11 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
# FR #4979 - negative numeric and character indices for SDcols
colsub = substitute(.SDcols)
# fix for #5190. colsub[[1L]] gave error when it's a symbol.
- if (is.call(colsub) && deparse(colsub[[1L]], 500L) %in% c("!", "-")) {
+ if (is.call(colsub) && deparse(colsub[[1L]], 500L, backtick=FALSE) %in% c("!", "-")) {
colm = TRUE
colsub = colsub[[2L]]
} else colm = FALSE
- # fix for #1216, make sure the paranthesis are peeled from expr of the form (((1:4)))
+ # fix for #1216, make sure the paranthesis are peeled from expr of the form (((1:4)))
while(is.call(colsub) && colsub[[1L]] == "(") colsub = as.list(colsub)[[-1L]]
if (is.call(colsub) && length(colsub) == 3L && colsub[[1L]] == ":") {
# .SDcols is of the format a:b
@@ -1153,8 +1153,8 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
# fix for long standing FR/bug, #495 and #484
allcols = c(names(x), xdotprefix, names(i), idotprefix)
if ( length(othervars <- setdiff(intersect(av, allcols), c(bynames, ansvars))) ) {
- # we've a situation like DT[, c(sum(V1), lapply(.SD, mean)), by=., .SDcols=...] or
- # DT[, lapply(.SD, function(x) x *v1), by=, .SDcols=...] etc.,
+ # we've a situation like DT[, c(sum(V1), lapply(.SD, mean)), by=., .SDcols=...] or
+ # DT[, lapply(.SD, function(x) x *v1), by=, .SDcols=...] etc.,
ansvars = union(ansvars, othervars)
ansvals = chmatch(ansvars, names(x))
}
@@ -1171,7 +1171,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
ansvals = chmatch(ansvars, names(x))
}
# if (!length(ansvars)) Leave ansvars empty. Important for test 607.
-
+
# TODO remove as (m)get is now folded in above.
# added 'mget' - fix for #994
if (any(c("get", "mget") %chin% av)) {
@@ -1196,7 +1196,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
suppPrint <- function(x) { .global$print=address(x); x }
# Suppress print when returns ok not on error, bug #2376. Thanks to: http://stackoverflow.com/a/13606880/403310
# All appropriate returns following this point are wrapped; i.e. return(suppPrint(x)).
-
+
if (is.null(names(jsub))) {
# regular LHS:=RHS usage, or `:=`(...) with no named arguments (an error)
# `:=`(LHS,RHS) is valid though, but more because can't see how to detect that, than desire
@@ -1293,14 +1293,14 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
}
}
}
-
+
if (length(ansvars)) {
w = ansvals
if (length(rightcols) && missing(by)) {
w[ w %in% rightcols ] = NA
}
# patch for #1615. Allow 'x.' syntax. Only useful during join op when x's join col needs to be used.
- # Note that I specifically have not implemented x[y, aa, on=c(aa="bb")] to refer to x's join column
+ # Note that I specifically have not implemented x[y, aa, on=c(aa="bb")] to refer to x's join column
# as well because x[i, col] == x[i][, col] will not be TRUE anymore..
if ( any(xdotprefixvals <- ansvars %in% xdotprefix)) {
w[xdotprefixvals] = chmatch(ansvars[xdotprefixvals], xdotprefix)
@@ -1332,7 +1332,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
}
}
} # end of if !missing(j)
-
+
SDenv = new.env(parent=parent.frame())
# taking care of warnings for posixlt type, #646
SDenv$strptime <- function(x, ...) {
@@ -1371,7 +1371,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
ans[[target]] = x[[source]]
# Temp fix for #921 - skip COPY until after evaluating 'jval' (scroll down).
# Unless 'with=FALSE' - can not be expressions but just column names.
- if (!with && address(ans[[target]]) == address(x[[source]]))
+ if (!with && address(ans[[target]]) == address(x[[source]]))
ans[[target]] = copy(ans[[target]])
else ans[[target]] = ans[[target]]
}
@@ -1387,7 +1387,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
# would cause the next set or := to copy that column (so the warning is needed). To tackle that, we could have our
# own DT.NAMED attribute, perhaps.
# Or keep the rule that [.data.table always returns new memory, and create view() or view= as well, maybe cleaner.
-
+
setattr(ans, "names", ansvars)
if (haskey(x)) {
keylen = which.first(!key(x) %chin% ansvars)-1L
@@ -1403,7 +1403,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
}
setattr(ans, "class", class(x)) # fix for #5296
setattr(ans, "row.names", .set_row_names(nrow(ans)))
-
+
if (!with || missing(j)) return(alloc.col(ans))
SDenv$.SDall = ans
@@ -1465,7 +1465,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
# is.call: selecting from a list column should return list
# is.object: for test 168 and 168.1 (S4 object result from ggplot2::qplot). Just plain list results should result in data.table
- # Fix for #813 and #758. Ex: DT[c(FALSE, FALSE), list(integer(0), y)]
+ # Fix for #813 and #758. Ex: DT[c(FALSE, FALSE), list(integer(0), y)]
# where DT = data.table(x=1:2, y=3:4) should return an empty data.table!!
if (!is.null(irows) && (identical(irows, integer(0)) || all(irows %in% 0L))) ## TODO: any way to not check all 'irows' values?
if (is.atomic(jval)) jval = jval[0L] else jval = lapply(jval, `[`, 0L)
@@ -1490,7 +1490,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
# fix for bug #5114 from GSee's - .data.table.locked=TRUE. # TO DO: more efficient way e.g. address==address (identical will do that but then proceed to deep compare if !=, wheras we want just to stop?)
# Commented as it's taken care of above, along with #921 fix. Kept here for the bug fix info and TO DO.
# if (identical(jval, SDenv$.SD)) return(copy(jval))
-
+
if (is.data.table(jval)) {
setattr(jval, 'class', class(x)) # fix for #5296
if (haskey(x) && all(key(x) %chin% names(jval)) && suppressWarnings(is.sorted(jval, by=key(x)))) # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient)
@@ -1502,7 +1502,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
###########################################################################
# Grouping ...
###########################################################################
-
+
o__ = integer()
if (".N" %chin% ansvars) stop("The column '.N' can't be grouped because it conflicts with the special .N variable. Try setnames(DT,'.N','N') first.")
if (".I" %chin% ansvars) stop("The column '.I' can't be grouped because it conflicts with the special .I variable. Try setnames(DT,'.I','I') first.")
@@ -1542,14 +1542,14 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
} else {
# Find the groups, using 'byval' ...
if (missing(by)) stop("Internal error, by is missing")
-
+
if (length(byval) && length(byval[[1]])) {
if (!bysameorder) {
if (verbose) {last.started.at=proc.time()[3];cat("Finding groups using forderv ... ");flush.console()}
o__ = forderv(byval, sort=!missing(keyby), retGrp=TRUE)
# The sort= argument is called sortStr at C level. It's just about saving the sort of unique strings at
- # C level for efficiency (cgroup vs csort) when by= not keyby=. All other types are always sorted. Getting
- # orginal order below is the part that retains original order. Passing sort=TRUE here always won't change any
+ # C level for efficiency (cgroup vs csort) when by= not keyby=. All other types are always sorted. Getting
+ # orginal order below is the part that retains original order. Passing sort=TRUE here always won't change any
# result at all (tested and confirmed to double check), it'll just make by= slower when there's a large
# number of unique strings. It must be TRUE when keyby= though, since the key is just marked afterwards.
# forderv() returns empty integer() if already ordered to save allocating 1:xnrow
@@ -1566,7 +1566,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
if (!bysameorder && missing(keyby)) {
# TO DO: lower this into forder.c
if (verbose) {last.started.at=proc.time()[3];cat("Getting back original order ... ");flush.console()}
- firstofeachgroup = o__[f__]
+ firstofeachgroup = o__[f__]
if (length(origorder <- forderv(firstofeachgroup))) {
f__ = f__[origorder]
len__ = len__[origorder]
@@ -1615,7 +1615,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
lockBinding(".GRP",SDenv)
lockBinding(".I",SDenv)
lockBinding(".iSD",SDenv)
-
+
GForce = FALSE
if ( getOption("datatable.optimize")>=1 && (is.call(jsub) || (is.name(jsub) && as.character(jsub) %chin% c(".SD",".N"))) ) { # Ability to turn off if problems or to benchmark the benefit
# Optimization to reduce overhead of calling lapply over and over for each group
@@ -1680,7 +1680,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
# FR #2722 is just about optimisation of j=c(.N, lapply(.SD, .)) that is taken care of here.
# FR #735 tries to optimise j-expressions of the form c(...) as long as ... contains
# 1) lapply(.SD, ...), 2) simply .SD or .SD[..], 3) .N, 4) list(...) and 5) functions that normally return a single value*
- # On 5)* the IMPORTANT point to note is that things that are not wrapped within "list(...)" should *always*
+ # On 5)* the IMPORTANT point to note is that things that are not wrapped within "list(...)" should *always*
# return length 1 output for us to optimise. Else, there's no equivalent to optimising c(...) to list(...) AFAICT.
# One issue could be that these functions (e.g., mean) can be "re-defined" by the OP to produce a length > 1 output
# Of course this is worrying too much though. If the issue comes up, we'll just remove the relevant optimisations.
@@ -1700,9 +1700,9 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
jsubl[[i_]] = lapply(ansvarsnew, as.name)
jvnames = c(jvnames, ansvarsnew)
} else if (this == ".N") {
- # don't optimise .I in c(.SD, .I), it's length can be > 1
+ # don't optimise .I in c(.SD, .I), it's length can be > 1
# only c(.SD, list(.I)) should be optimised!! .N is always length 1.
- jvnames = c(jvnames, gsub("^[.]([N])$", "\\1", this))
+ jvnames = c(jvnames, gsub("^[.]([N])$", "\\1", this))
} else {
# jvnames = c(jvnames, if (is.null(names(jsubl))) "" else names(jsubl)[i_])
is_valid=FALSE
@@ -1727,7 +1727,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
}
} else if (is.call(this) && length(this) > 1L && as.character(this[[1L]]) %in% optfuns) {
jvnames = c(jvnames, if (is.null(names(jsubl))) "" else names(jsubl)[i_])
- } else if ( length(this) == 3L && (this[[1L]] == "[" || this[[1L]] == "head") &&
+ } else if ( length(this) == 3L && (this[[1L]] == "[" || this[[1L]] == "head") &&
this[[2L]] == ".SD" && (is.numeric(this[[3L]]) || this[[3L]] == ".N") ) {
# optimise .SD[1] or .SD[2L]. Not sure how to test .SD[a] as to whether a is numeric/integer or a data.table, yet.
any_SD = TRUE
@@ -1767,7 +1767,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
cat("lapply optimization is on, j unchanged as '",deparse(jsub,width.cutoff=200L),"'\n",sep="")
}
dotN <- function(x) if (is.name(x) && x == ".N") TRUE else FALSE # For #5760
- # FR #971, GForce kicks in on all subsets, no joins yet. Although joins could work with
+ # FR #971, GForce kicks in on all subsets, no joins yet. Although joins could work with
# nomatch=0L even now.. but not switching it on yet, will deal it separately.
if (getOption("datatable.optimize")>=2 && !is.data.table(i) && !byjoin && length(f__) && !length(lhs)) {
if (!length(ansvars) && !use.I) {
@@ -1784,9 +1784,9 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
cond = is.call(q) && as.character(q[[1L]]) %chin% gfuns && !is.call(q[[2L]])
ans = cond && (length(q)==2 || identical("na",substring(names(q)[3L],1,2)))
if (identical(ans, TRUE)) return(ans)
- ans = cond && length(q)==3 && ( as.character(q[[1]]) %chin% c("head", "tail") &&
- (identical(q[[3]], 1) || identical(q[[3]], 1L)) ||
- as.character(q[[1]]) %chin% "[" && is.numeric(q[[3]]) &&
+ ans = cond && length(q)==3 && ( as.character(q[[1]]) %chin% c("head", "tail") &&
+ (identical(q[[3]], 1) || identical(q[[3]], 1L)) ||
+ as.character(q[[1]]) %chin% "[" && is.numeric(q[[3]]) &&
length(q[[3]])==1 && q[[3]]>0 )
if (is.na(ans)) ans=FALSE
ans
@@ -1797,7 +1797,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
} else GForce = .ok(jsub)
if (GForce) {
if (jsub[[1L]]=="list")
- for (ii in seq_along(jsub)[-1L]) {
+ for (ii in seq_along(jsub)[-1L]) {
if (dotN(jsub[[ii]])) next; # For #5760
jsub[[ii]][[1L]] = as.name(paste("g", jsub[[ii]][[1L]], sep=""))
if (length(jsub[[ii]])==3) jsub[[ii]][[3]] = eval(jsub[[ii]][[3]], parent.frame()) # tests 1187.2 & 1187.4
@@ -1883,7 +1883,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
gi = if (length(o__)) o__[f__] else f__
g = lapply(grpcols, function(i) groups[[i]][gi])
ans = c(g, ans)
- } else {
+ } else {
ans = .Call(Cdogroups, x, xcols, groups, grpcols, jiscols, xjiscols, grporder, o__, f__, len__, jsub, SDenv, cols, newnames, !missing(on), verbose)
}
if (verbose) {cat(round(proc.time()[3]-last.started.at,3),"secs\n");flush.console()}
@@ -2043,13 +2043,13 @@ as.matrix.data.table <- function(x,...)
# bug #2375. fixed. same as head.data.frame and tail.data.frame to deal with negative indices
head.data.table <- function(x, n=6, ...) {
if (!cedta()) return(NextMethod())
- stopifnot(length(n) == 1L)
+ stopifnot(length(n) == 1L)
i = seq_len(if (n<0L) max(nrow(x)+n, 0L) else min(n,nrow(x)))
x[i, , ]
}
tail.data.table <- function(x, n=6, ...) {
if (!cedta()) return(NextMethod())
- stopifnot(length(n) == 1L)
+ stopifnot(length(n) == 1L)
n <- if (n<0L) max(nrow(x) + n, 0L) else min(n, nrow(x))
i = seq.int(to=nrow(x), length.out=n)
x[i]
@@ -2139,7 +2139,7 @@ as.data.frame.data.table <- function(x, ...)
setattr(ans,"class","data.frame")
setattr(ans,"sorted",NULL) # remove so if you convert to df, do something, and convert back, it is not sorted
setattr(ans,".internal.selfref",NULL)
- # leave tl intact, no harm,
+ # leave tl intact, no harm,
ans
}
@@ -2161,7 +2161,7 @@ as.list.data.table <- function(x, ...) {
dimnames.data.table <- function(x) {
if (!cedta()) {
- if (!inherits(x, "data.frame"))
+ if (!inherits(x, "data.frame"))
stop("data.table inherits from data.frame (from v1.5), but this data.table does not. Has it been created manually (e.g. by using 'structure' rather than 'data.table') or saved to disk using a prior version of data.table?")
return(`dimnames.data.frame`(x))
}
@@ -2176,7 +2176,7 @@ dimnames.data.table <- function(x) {
if (!is.null(value[[1L]])) stop("data.tables do not have rownames")
if (ncol(x) != length(value[[2]])) stop("can't assign",length(value[[2]]),"colnames to a",ncol(x),"column data.table")
setnames(x,as.character(value[[2]]))
- x # this returned value is now shallow copied by R 3.1.0 via *tmp*. A very welcome change.
+ x # this returned value is now shallow copied by R 3.1.0 via *tmp*. A very welcome change.
}
"names<-.data.table" <- function(x,value)
@@ -2193,7 +2193,7 @@ dimnames.data.table <- function(x) {
setattr(x,"names",NULL) # e.g. plyr::melt() calls base::unname()
else
setnames(x,value)
- x # this returned value is now shallow copied by R 3.1.0 via *tmp*. A very welcome change.
+ x # this returned value is now shallow copied by R 3.1.0 via *tmp*. A very welcome change.
}
within.data.table <- function (data, expr, ...)
@@ -2304,7 +2304,7 @@ na.omit.data.table <- function (object, cols = seq_along(object), invert = FALSE
old = cols
cols = chmatch(cols, names(object), nomatch=0L)
if (any(cols==0L))
- stop("Columns ", paste(old[cols==0L], collapse=","),
+ stop("Columns ", paste(old[cols==0L], collapse=","),
" doesn't exist in the input data.table")
}
cols = as.integer(cols)
@@ -2395,7 +2395,7 @@ split.data.table <- function(x, f, drop = FALSE, by, sorted = FALSE, keep.by = T
tmp = eval(dtq)
# add names on list
setattr(ll <- tmp$.ll.tech.split,
- "names",
+ "names",
as.character(
if (!flatten) tmp[[.by]] else tmp[, list(.nm.tech.split=paste(unlist(lapply(.SD, as.character)), collapse = ".")), by=by, .SDcols=by]$.nm.tech.split
))
@@ -2453,7 +2453,7 @@ point <- function(to, to_idx, from, from_idx) {
}
shallow <- function(x, cols=NULL) {
- if (!is.data.table(x))
+ if (!is.data.table(x))
stop("x is not a data.table. Shallow copy is a copy of the vector of column pointers (only), so is only meaningful for data.table")
ans = .shallow(x, cols=cols, retain.key = TRUE)
ans
@@ -2468,7 +2468,7 @@ alloc.col <- function(DT, n=getOption("datatable.alloccol"), verbose=getOption("
name = as.character(name)
assign(name,ans,parent.frame(),inherits=TRUE)
}
- .Call(Csetnamed,ans,0L)
+ .Call(Csetmutable,ans)
}
selfrefok <- function(DT,verbose=getOption("datatable.verbose")) {
@@ -2528,7 +2528,7 @@ setnames <- function(x,old,new) {
if (missing(old)) stop("When 'new' is provided, 'old' must be provided too")
if (!is.character(new)) stop("'new' is not a character vector")
if (is.numeric(old)) {
- if (length(sgn <- unique(sign(old))) != 1L)
+ if (length(sgn <- unique(sign(old))) != 1L)
stop("Items of 'old' is numeric but has both +ve and -ve indices.")
tt = abs(old)<1L | abs(old)>length(x) | is.na(old)
if (any(tt)) stop("Items of 'old' either NA or outside range [1,",length(x),"]: ",paste(old[tt],collapse=","))
@@ -2548,7 +2548,7 @@ setnames <- function(x,old,new) {
w = which(!is.na(m))
if (length(w))
.Call(Csetcharvec, attr(x,"sorted"), m[w], new[w])
-
+
# update secondary keys
idx = attr(x,"index")
for (k in names(attributes(idx))) {
@@ -2560,7 +2560,7 @@ setnames <- function(x,old,new) {
newk = paste("__",paste(tt,collapse="__"),sep="")
setattr(idx, newk, attr(idx, k))
setattr(idx, k, NULL)
- }
+ }
}
.Call(Csetcharvec, attr(x,"names"), as.integer(i), new)
@@ -2665,7 +2665,7 @@ setDF <- function(x, rownames=NULL) {
setattr(x, ".internal.selfref", NULL)
} else if (is.data.frame(x)) {
if (!is.null(rownames)){
- if (length(rownames) != nrow(x))
+ if (length(rownames) != nrow(x))
stop("rownames incorrect length; expected ", nrow(x), " names, got ", length(rownames))
setattr(x, "row.names", rownames)
}
@@ -2747,7 +2747,7 @@ setDT <- function(x, keep.rownames=FALSE, key=NULL, check.names=FALSE) {
if (is.null(xn)) {
setattr(x, "names", paste("V",seq_len(length(x)),sep=""))
} else {
- idx = xn %chin% "" # names can be NA - test 1006 caught that!
+ idx = xn %chin% "" # names can be NA - test 1006 caught that!
if (any(idx)) {
xn[idx] = paste("V", seq_along(which(idx)), sep="")
setattr(x, "names", xn)
@@ -2828,14 +2828,14 @@ rleidv <- function(x, cols=seq_along(x), prefix=NULL) {
if (!is.null(prefix) && (!is.character(prefix) || length(prefix) != 1L))
stop("prefix must be NULL or a character vector of length=1.")
if (is.atomic(x)) {
- if (!missing(cols) && !is.null(cols))
+ if (!missing(cols) && !is.null(cols))
stop("x is a single vector, non-NULL 'cols' doesn't make sense.")
cols = 1L
x = as_list(x)
} else {
if (!length(cols))
stop("x is a list, 'cols' can not be 0-length.")
- if (is.character(cols))
+ if (is.character(cols))
cols = chmatch(cols, names(x))
cols = as.integer(cols)
}
diff --git a/R/onAttach.R b/R/onAttach.R
index 3d91493..6d70fc9 100644
--- a/R/onAttach.R
+++ b/R/onAttach.R
@@ -17,11 +17,10 @@
if (dev && (Sys.Date() - as.Date(d))>28)
packageStartupMessage("**********\nThis development version of data.table was built more than 4 weeks ago. Please update.\n**********")
if (!.Call(ChasOpenMP))
- packageStartupMessage("**********\nThis installation of data.table has not detected OpenMP support. It will still work but in single-threaded mode. If this a Mac and you obtained the Mac binary of data.table from CRAN, CRAN's Mac does not yet support OpenMP. In the meantime please follow our Mac installation instructions on the data.table homepage. If it works and you observe benefits from multiple threads as others have reported, please convince Simon Ubanek by sending him evide [...]
+ packageStartupMessage("**********\nThis installation of data.table has not detected OpenMP support. It should still work but in single-threaded mode. If this is a Mac, please ensure you are using R>=3.4.0 and have installed the MacOS binary package from CRAN: see ?install.packages, the 'type=' argument and the 'Binary packages' section. If you compiled from source, please reinstall and precisely follow the installation instructions on the data.table homepage. This warning message [...]
packageStartupMessage(' The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way')
packageStartupMessage(' Documentation: ?data.table, example(data.table) and browseVignettes("data.table")')
packageStartupMessage(' Release notes, videos and slides: http://r-datatable.com')
}
}
-
diff --git a/R/test.data.table.R b/R/test.data.table.R
index c66abd8..b587486 100644
--- a/R/test.data.table.R
+++ b/R/test.data.table.R
@@ -2,16 +2,12 @@
test.data.table <- function(verbose=FALSE, pkg="pkg", silent=FALSE) {
if (exists("test.data.table",.GlobalEnv,inherits=FALSE)) {
# package developer
- if ("package:data.table" %in% search()) stop("data.table package loaded")
- if (.Platform$OS.type == "unix" && Sys.info()['sysname'] != "Darwin")
- d = path.expand("~/data.table/inst/tests")
- else {
- if (!pkg %in% dir()) stop(paste(pkg, " not in dir()", sep=""))
- d = paste(getwd(),"/", pkg, "/inst/tests",sep="")
- }
+ if ("package:data.table" %in% search()) stop("data.table package is loaded. Unload or start a fresh R session.")
+ d = if (pkg %in% dir()) paste0(getwd(),"/",pkg) else Sys.getenv("CC_DIR")
+ d = paste0(d, "/inst/tests")
} else {
- # user
- d = paste(getNamespaceInfo("data.table","path"),"/tests",sep="")
+ # R CMD check and user running test.data.table()
+ d = paste0(getNamespaceInfo("data.table","path"),"/tests")
}
# for (fn in dir(d,"*.[rR]$",full=TRUE)) { # testthat runs those
oldenc = options(encoding="UTF-8")[[1L]] # just for tests 708-712 on Windows
diff --git a/R/utils.R b/R/utils.R
index 07af702..ab6fec4 100644
--- a/R/utils.R
+++ b/R/utils.R
@@ -49,9 +49,8 @@ UseMethod("%+%")
# we often construct warning msgs with a msg followed by several items of a vector, so %+% is for convenience
require_bit64 = function() {
- # called in fread and print when they see integer64 columns are present
- tt = try(requireNamespace("bit64",quietly=TRUE))
- if (inherits(tt,"try-error"))
+ # called in fread and print when they see integer64 columns are present
+ if (!requireNamespace("bit64",quietly=TRUE))
warning("Some columns are type 'integer64' but package bit64 is not installed. Those columns will print as strange looking floating point data. There is no need to reload the data. Simply install.packages('bit64') to obtain the integer64 print method and print the data again.")
}
diff --git a/build/vignette.rds b/build/vignette.rds
index 0465335..45df8f8 100644
Binary files a/build/vignette.rds and b/build/vignette.rds differ
diff --git a/inst/doc/datatable-faq.Rmd b/inst/doc/datatable-faq.Rmd
index f78ca19..48fdbb9 100644
--- a/inst/doc/datatable-faq.Rmd
+++ b/inst/doc/datatable-faq.Rmd
@@ -595,7 +595,7 @@ Please file suggestions, bug reports and enhancement requests on our [issues tra
Please do star the package on [GitHub](https://github.com/Rdatatable/data.table/wiki). This helps encourage the developers and helps other R users find the package.
-You can submit pull requests to change the code and/or documentation yourself; see our [Contribution Guidelines](https://github.com/Rdatatable/data.table/blob/master/Contributing.md).
+You can submit pull requests to change the code and/or documentation yourself; see our [Contribution Guidelines](https://github.com/Rdatatable/data.table/blob/master/CONTRIBUTING.md).
## I think it's not great. How do I warn others about my experience?
Please put your vote and comments on [Crantastic](http://crantastic.org/packages/data-table). Please make it constructive so we have a chance to improve.
diff --git a/inst/doc/datatable-faq.html b/inst/doc/datatable-faq.html
index 8d283c6..20fe184 100644
--- a/inst/doc/datatable-faq.html
+++ b/inst/doc/datatable-faq.html
@@ -1,495 +1,640 @@
<!DOCTYPE html>
+<html>
+<head>
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
-<html xmlns="http://www.w3.org/1999/xhtml">
+<title>Beginner FAQs</title>
-<head>
+<script type="text/javascript">
+window.onload = function() {
+ var imgs = document.getElementsByTagName('img'), i, img;
+ for (i = 0; i < imgs.length; i++) {
+ img = imgs[i];
+ // center an image if it is the only element of its parent
+ if (img.parentElement.childElementCount === 1)
+ img.parentElement.style.textAlign = 'center';
+ }
+};
+</script>
+
+<!-- Styles for R syntax highlighter -->
+<style type="text/css">
+ pre .operator,
+ pre .paren {
+ color: rgb(104, 118, 135)
+ }
-<meta charset="utf-8">
-<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
-<meta name="generator" content="pandoc" />
+ pre .literal {
+ color: #990073
+ }
-<meta name="viewport" content="width=device-width, initial-scale=1">
+ pre .number {
+ color: #099;
+ }
+ pre .comment {
+ color: #998;
+ font-style: italic
+ }
-<meta name="date" content="2017-01-31" />
+ pre .keyword {
+ color: #900;
+ font-weight: bold
+ }
-<title>Frequently Asked Questions about data.table</title>
+ pre .identifier {
+ color: rgb(0, 0, 0);
+ }
+ pre .string {
+ color: #d14;
+ }
+</style>
+
+<!-- R syntax highlighter -->
+<script type="text/javascript">
+var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/</gm,"<")}function f(r,q,p){return RegExp(q,"m"+(r.cI?"i":"")+(p?"g":""))}function b(r){for(var p=0;p<r.childNodes.length;p++){var q=r.childNodes[p];if(q.nodeName=="CODE"){return q}if(!(q.nodeType==3&&q.nodeValue.match(/\s+/))){break}}}function h(t,s){var p="";for(var r=0;r<t.childNodes.length;r++){if(t.childNodes[r].nodeType==3){var q=t.childNodes[r].nodeValue;if(s){q=q.replace(/\n/g,"")}p+=q}else{if(t.chi [...]
+hljs.initHighlightingOnLoad();
+</script>
+
+<!-- MathJax scripts -->
+<script type="text/javascript" src="https://cdn.bootcss.com/mathjax/2.7.0/MathJax.js?config=TeX-MML-AM_CHTML">
+</script>
-<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
-div.sourceCode { overflow-x: auto; }
-table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
- margin: 0; padding: 0; vertical-align: baseline; border: none; }
-table.sourceCode { width: 100%; line-height: 100%; }
-td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
-td.sourceCode { padding-left: 5px; }
-code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
-code > span.dt { color: #902000; } /* DataType */
-code > span.dv { color: #40a070; } /* DecVal */
-code > span.bn { color: #40a070; } /* BaseN */
-code > span.fl { color: #40a070; } /* Float */
-code > span.ch { color: #4070a0; } /* Char */
-code > span.st { color: #4070a0; } /* String */
-code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
-code > span.ot { color: #007020; } /* Other */
-code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
-code > span.fu { color: #06287e; } /* Function */
-code > span.er { color: #ff0000; font-weight: bold; } /* Error */
-code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
-code > span.cn { color: #880000; } /* Constant */
-code > span.sc { color: #4070a0; } /* SpecialChar */
-code > span.vs { color: #4070a0; } /* VerbatimString */
-code > span.ss { color: #bb6688; } /* SpecialString */
-code > span.im { } /* Import */
-code > span.va { color: #19177c; } /* Variable */
-code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
-code > span.op { color: #666666; } /* Operator */
-code > span.bu { } /* BuiltIn */
-code > span.ex { } /* Extension */
-code > span.pp { color: #bc7a00; } /* Preprocessor */
-code > span.at { color: #7d9029; } /* Attribute */
-code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
-code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
-code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
-code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
-</style>
+body, td {
+ font-family: sans-serif;
+ background-color: white;
+ font-size: 13px;
+}
+body {
+ max-width: 800px;
+ margin: auto;
+ padding: 1em;
+ line-height: 20px;
+}
+tt, code, pre {
+ font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
+}
-<link href="data:text/css;charset=utf-8,body%20%7B%0Abackground%2Dcolor%3A%20%23fff%3B%0Amargin%3A%201em%20auto%3B%0Amax%2Dwidth%3A%20700px%3B%0Aoverflow%3A%20visible%3B%0Apadding%2Dleft%3A%202em%3B%0Apadding%2Dright%3A%202em%3B%0Afont%2Dfamily%3A%20%22Open%20Sans%22%2C%20%22Helvetica%20Neue%22%2C%20Helvetica%2C%20Arial%2C%20sans%2Dserif%3B%0Afont%2Dsize%3A%2014px%3B%0Aline%2Dheight%3A%201%2E35%3B%0A%7D%0A%23header%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0A%23TOC%20%7B%0Aclear%3A%20bot [...]
+h1 {
+ font-size:2.2em;
+}
-</head>
+h2 {
+ font-size:1.8em;
+}
-<body>
+h3 {
+ font-size:1.4em;
+}
+h4 {
+ font-size:1.0em;
+}
+
+h5 {
+ font-size:0.9em;
+}
+
+h6 {
+ font-size:0.8em;
+}
+
+a:visited {
+ color: rgb(50%, 0%, 50%);
+}
+pre, img {
+ max-width: 100%;
+}
+pre {
+ overflow-x: auto;
+}
+pre code {
+ display: block; padding: 0.5em;
+}
+code {
+ font-size: 92%;
+ border: 1px solid #ccc;
+}
-<h1 class="title toc-ignore">Frequently Asked Questions about data.table</h1>
-<h4 class="date"><em>2017-01-31</em></h4>
+code[class] {
+ background-color: #F8F8F8;
+}
+table, td, th {
+ border: none;
+}
-<div id="TOC">
-<ul>
-<li><a href="#beginner-faqs"><span class="toc-section-number">1</span> Beginner FAQs</a><ul>
-<li><a href="#j-num"><span class="toc-section-number">1.1</span> Why do <code>DT[ , 5]</code> and <code>DT[2, 5]</code> return a 1-column data.table rather than vectors like <code>data.frame</code>?</a></li>
-<li><a href="#why-does-dtregion-return-a-1-column-data.table-rather-than-a-vector"><span class="toc-section-number">1.2</span> Why does <code>DT[,"region"]</code> return a 1-column data.table rather than a vector?</a></li>
-<li><a href="#why-does-dt-region-return-a-vector-for-the-region-column-id-like-a-1-column-data.table."><span class="toc-section-number">1.3</span> Why does <code>DT[, region]</code> return a vector for the “region” column? I’d like a 1-column data.table.</a></li>
-<li><a href="#why-does-dt-x-y-z-not-work-i-wanted-the-3-columns-xy-and-z."><span class="toc-section-number">1.4</span> Why does <code>DT[ , x, y, z]</code> not work? I wanted the 3 columns <code>x</code>,<code>y</code> and <code>z</code>.</a></li>
-<li><a href="#i-assigned-a-variable-mycol-x-but-then-dt-mycol-returns-x.-how-do-i-get-it-to-look-up-the-column-name-contained-in-the-mycol-variable"><span class="toc-section-number">1.5</span> I assigned a variable <code>mycol = "x"</code> but then <code>DT[ , mycol]</code> returns <code>"x"</code>. How do I get it to look up the column name contained in the <code>mycol</code> variable?</a></li>
-<li><a href="#what-are-the-benefits-of-being-able-to-use-column-names-as-if-they-are-variables-inside-dt..."><span class="toc-section-number">1.6</span> What are the benefits of being able to use column names as if they are variables inside <code>DT[...]</code>?</a></li>
-<li><a href="#ok-im-starting-to-see-what-data.table-is-about-but-why-didnt-you-just-enhance-data.frame-in-r-why-does-it-have-to-be-a-new-package"><span class="toc-section-number">1.7</span> OK, I’m starting to see what data.table is about, but why didn’t you just enhance <code>data.frame</code> in R? Why does it have to be a new package?</a></li>
-<li><a href="#why-are-the-defaults-the-way-they-are-why-does-it-work-the-way-it-does"><span class="toc-section-number">1.8</span> Why are the defaults the way they are? Why does it work the way it does?</a></li>
-<li><a href="#isnt-this-already-done-by-with-and-subset-in-base"><span class="toc-section-number">1.9</span> Isn’t this already done by <code>with()</code> and <code>subset()</code> in <code>base</code>?</a></li>
-<li><a href="#why-does-xy-return-all-the-columns-from-y-too-shouldnt-it-return-a-subset-of-x"><span class="toc-section-number">1.10</span> Why does <code>X[Y]</code> return all the columns from <code>Y</code> too? Shouldn’t it return a subset of <code>X</code>?</a></li>
-<li><a href="#MergeDiff"><span class="toc-section-number">1.11</span> What is the difference between <code>X[Y]</code> and <code>merge(X, Y)</code>?</a></li>
-<li><a href="#anything-else-about-xy-sumfoobar"><span class="toc-section-number">1.12</span> Anything else about <code>X[Y, sum(foo*bar)]</code>?</a></li>
-<li><a href="#thats-nice.-how-did-you-manage-to-change-it-given-that-users-depended-on-the-old-behaviour"><span class="toc-section-number">1.13</span> That’s nice. How did you manage to change it given that users depended on the old behaviour?</a></li>
-</ul></li>
-<li><a href="#general-syntax"><span class="toc-section-number">2</span> General Syntax</a><ul>
-<li><a href="#how-can-i-avoid-writing-a-really-long-j-expression-youve-said-that-i-should-use-the-column-names-but-ive-got-a-lot-of-columns."><span class="toc-section-number">2.1</span> How can I avoid writing a really long <code>j</code> expression? You’ve said that I should use the column <em>names</em>, but I’ve got a lot of columns.</a></li>
-<li><a href="#why-is-the-default-for-mult-now-all"><span class="toc-section-number">2.2</span> Why is the default for <code>mult</code> now <code>"all"</code>?</a></li>
-<li><a href="#im-using-c-in-j-and-getting-strange-results."><span class="toc-section-number">2.3</span> I’m using <code>c()</code> in <code>j</code> and getting strange results.</a></li>
-<li><a href="#i-have-built-up-a-complex-table-with-many-columns.-i-want-to-use-it-as-a-template-for-a-new-table-i.e.-create-a-new-table-with-no-rows-but-with-the-column-names-and-types-copied-from-my-table.-can-i-do-that-easily"><span class="toc-section-number">2.4</span> I have built up a complex table with many columns. I want to use it as a template for a new table; <em>i.e.</em>, create a new table with no rows, but with the column names and types copied from my table. Can I do that [...]
-<li><a href="#is-a-null-data.table-the-same-as-dt0"><span class="toc-section-number">2.5</span> Is a null data.table the same as <code>DT[0]</code>?</a></li>
-<li><a href="#DTremove1"><span class="toc-section-number">2.6</span> Why has the <code>DT()</code> alias been removed?</a></li>
-<li><a href="#DTremove2"><span class="toc-section-number">2.7</span> But my code uses <code>j = DT(...)</code> and it works. The previous FAQ says that <code>DT()</code> has been removed.</a></li>
-<li><a href="#what-are-the-scoping-rules-for-j-expressions"><span class="toc-section-number">2.8</span> What are the scoping rules for <code>j</code> expressions?</a></li>
-<li><a href="#j-trace"><span class="toc-section-number">2.9</span> Can I trace the <code>j</code> expression as it runs through the groups?</a></li>
-<li><a href="#inside-each-group-why-are-the-group-variables-length-1"><span class="toc-section-number">2.10</span> Inside each group, why are the group variables length-1?</a></li>
-<li><a href="#only-the-first-10-rows-are-printed-how-do-i-print-more"><span class="toc-section-number">2.11</span> Only the first 10 rows are printed, how do I print more?</a></li>
-<li><a href="#with-an-xy-join-what-if-x-contains-a-column-called-y"><span class="toc-section-number">2.12</span> With an <code>X[Y]</code> join, what if <code>X</code> contains a column called <code>"Y"</code>?</a></li>
-<li><a href="#xzy-is-failing-because-x-contains-a-column-y.-id-like-it-to-use-the-table-y-in-calling-scope."><span class="toc-section-number">2.13</span> <code>X[Z[Y]]</code> is failing because <code>X</code> contains a column <code>"Y"</code>. I’d like it to use the table <code>Y</code> in calling scope.</a></li>
-<li><a href="#can-you-explain-further-why-data.table-is-inspired-by-ab-syntax-in-base"><span class="toc-section-number">2.14</span> Can you explain further why data.table is inspired by <code>A[B]</code> syntax in <code>base</code>?</a></li>
-<li><a href="#can-base-be-changed-to-do-this-then-rather-than-a-new-package"><span class="toc-section-number">2.15</span> Can base be changed to do this then, rather than a new package?</a></li>
-<li><a href="#ive-heard-that-data.table-syntax-is-analogous-to-sql."><span class="toc-section-number">2.16</span> I’ve heard that data.table syntax is analogous to SQL.</a></li>
-<li><a href="#SmallerDiffs"><span class="toc-section-number">2.17</span> What are the smaller syntax differences between <code>data.frame</code> and data.table</a></li>
-<li><a href="#im-using-j-for-its-side-effect-only-but-im-still-getting-data-returned.-how-do-i-stop-that"><span class="toc-section-number">2.18</span> I’m using <code>j</code> for its side effect only, but I’m still getting data returned. How do I stop that?</a></li>
-<li><a href="#why-does-.data.table-now-have-a-drop-argument-from-v1.5"><span class="toc-section-number">2.19</span> Why does <code>[.data.table</code> now have a <code>drop</code> argument from v1.5?</a></li>
-<li><a href="#rolling-joins-are-cool-and-very-fast-was-that-hard-to-program"><span class="toc-section-number">2.20</span> Rolling joins are cool and very fast! Was that hard to program?</a></li>
-<li><a href="#why-does-dti-col-value-return-the-whole-of-dt-i-expected-either-no-visible-value-consistent-with---or-a-message-or-return-value-containing-how-many-rows-were-updated.-it-isnt-obvious-that-the-data-has-indeed-been-updated-by-reference."><span class="toc-section-number">2.21</span> Why does <code>DT[i, col := value]</code> return the whole of <code>DT</code>? I expected either no visible value (consistent with <code><-</code>), or a message or return value containing how m [...]
-<li><a href="#ok-thanks.-what-was-so-difficult-about-the-result-of-dti-col-value-being-returned-invisibly"><span class="toc-section-number">2.22</span> OK, thanks. What was so difficult about the result of <code>DT[i, col := value]</code> being returned invisibly?</a></li>
-<li><a href="#why-do-i-have-to-type-dt-sometimes-twice-after-using-to-print-the-result-to-console"><span class="toc-section-number">2.23</span> Why do I have to type <code>DT</code> sometimes twice after using <code>:=</code> to print the result to console?</a></li>
-<li><a href="#ive-noticed-that-basecbind.data.frame-and-baserbind.data.frame-appear-to-be-changed-by-data.table.-how-is-this-possible-why"><span class="toc-section-number">2.24</span> I’ve noticed that <code>base::cbind.data.frame</code> (and <code>base::rbind.data.frame</code>) appear to be changed by data.table. How is this possible? Why?</a></li>
-<li><a href="#r-dispatch"><span class="toc-section-number">2.25</span> I’ve read about method dispatch (<em>e.g.</em> <code>merge</code> may or may not dispatch to <code>merge.data.table</code>) but <em>how</em> does R know how to dispatch? Are dots significant or special? How on earth does R know which function to dispatch and when?</a></li>
-</ul></li>
-<li><a href="#questions-relating-to-compute-time"><span class="toc-section-number">3</span> Questions relating to compute time</a><ul>
-<li><a href="#i-have-20-columns-and-a-large-number-of-rows.-why-is-an-expression-of-one-column-so-quick"><span class="toc-section-number">3.1</span> I have 20 columns and a large number of rows. Why is an expression of one column so quick?</a></li>
-<li><a href="#i-dont-have-a-key-on-a-large-table-but-grouping-is-still-really-quick.-why-is-that"><span class="toc-section-number">3.2</span> I don’t have a <code>key</code> on a large table, but grouping is still really quick. Why is that?</a></li>
-<li><a href="#why-is-grouping-by-columns-in-the-key-faster-than-an-ad-hoc-by"><span class="toc-section-number">3.3</span> Why is grouping by columns in the key faster than an <em>ad hoc</em> <code>by</code>?</a></li>
-<li><a href="#what-are-primary-and-secondary-indexes-in-data.table"><span class="toc-section-number">3.4</span> What are primary and secondary indexes in data.table?</a></li>
-</ul></li>
-<li><a href="#error-messages"><span class="toc-section-number">4</span> Error messages</a><ul>
-<li><a href="#could-not-find-function-dt"><span class="toc-section-number">4.1</span> “Could not find function <code>DT</code>”</a></li>
-<li><a href="#unused-arguments-mysum-sumv"><span class="toc-section-number">4.2</span> “unused argument(s) (<code>MySum = sum(v)</code>)”</a></li>
-<li><a href="#translatecharutf8-must-be-called-on-a-charsxp"><span class="toc-section-number">4.3</span> “<code>translateCharUTF8</code> must be called on a <code>CHARSXP</code>”</a></li>
-<li><a href="#cbinddt-df-returns-a-strange-format-e.g.-integer5"><span class="toc-section-number">4.4</span> <code>cbind(DT, DF)</code> returns a strange format, <em>e.g.</em> <code id="cbinderror">Integer,5</code></a></li>
-<li><a href="#cannot-change-value-of-locked-binding-for-.sd"><span class="toc-section-number">4.5</span> “cannot change value of locked binding for <code>.SD</code>”</a></li>
-<li><a href="#cannot-change-value-of-locked-binding-for-.n"><span class="toc-section-number">4.6</span> “cannot change value of locked binding for <code>.N</code>”</a></li>
-</ul></li>
-<li><a href="#warning-messages"><span class="toc-section-number">5</span> Warning messages</a><ul>
-<li><a href="#the-following-objects-are-masked-from-packagebase-cbind-rbind"><span class="toc-section-number">5.1</span> “The following object(s) are masked from <code>package:base</code>: <code>cbind</code>, <code>rbind</code>”</a></li>
-<li><a href="#coerced-numeric-rhs-to-integer-to-match-the-columns-type"><span class="toc-section-number">5.2</span> “Coerced numeric RHS to integer to match the column’s type”</a></li>
-<li><a href="#reading-data.table-from-rds-or-rdata-file"><span class="toc-section-number">5.3</span> Reading data.table from RDS or RData file</a></li>
-</ul></li>
-<li><a href="#general-questions-about-the-package"><span class="toc-section-number">6</span> General questions about the package</a><ul>
-<li><a href="#v1.3-appears-to-be-missing-from-the-cran-archive"><span class="toc-section-number">6.1</span> v1.3 appears to be missing from the CRAN archive?</a></li>
-<li><a href="#is-data.table-compatible-with-s-plus"><span class="toc-section-number">6.2</span> Is data.table compatible with S-plus?</a></li>
-<li><a href="#is-it-available-for-linux-mac-and-windows"><span class="toc-section-number">6.3</span> Is it available for Linux, Mac and Windows?</a></li>
-<li><a href="#i-think-its-great.-what-can-i-do"><span class="toc-section-number">6.4</span> I think it’s great. What can I do?</a></li>
-<li><a href="#i-think-its-not-great.-how-do-i-warn-others-about-my-experience"><span class="toc-section-number">6.5</span> I think it’s not great. How do I warn others about my experience?</a></li>
-<li><a href="#i-have-a-question.-i-know-the-r-help-posting-guide-tells-me-to-contact-the-maintainer-not-r-help-but-is-there-a-larger-group-of-people-i-can-ask"><span class="toc-section-number">6.6</span> I have a question. I know the r-help posting guide tells me to contact the maintainer (not r-help), but is there a larger group of people I can ask?</a></li>
-<li><a href="#where-are-the-datatable-help-archives"><span class="toc-section-number">6.7</span> Where are the datatable-help archives?</a></li>
-<li><a href="#id-prefer-not-to-post-on-the-issues-page-can-i-mail-just-one-or-two-people-privately"><span class="toc-section-number">6.8</span> I’d prefer not to post on the Issues page, can I mail just one or two people privately?</a></li>
-<li><a href="#i-have-created-a-package-that-uses-data.table.-how-do-i-ensure-my-package-is-data.table-aware-so-that-inheritance-from-data.frame-works"><span class="toc-section-number">6.9</span> I have created a package that uses data.table. How do I ensure my package is data.table-aware so that inheritance from <code>data.frame</code> works?</a></li>
-</ul></li>
-</ul>
-</div>
+blockquote {
+ color:#666666;
+ margin:0;
+ padding-left: 1em;
+ border-left: 0.5em #EEE solid;
+}
+hr {
+ height: 0px;
+ border-bottom: none;
+ border-top-width: thin;
+ border-top-style: dotted;
+ border-top-color: #999999;
+}
+
+ at media print {
+ * {
+ background: transparent !important;
+ color: black !important;
+ filter:none !important;
+ -ms-filter: none !important;
+ }
+
+ body {
+ font-size:12pt;
+ max-width:100%;
+ }
+
+ a, a:visited {
+ text-decoration: underline;
+ }
+
+ hr {
+ visibility: hidden;
+ page-break-before: always;
+ }
+
+ pre, blockquote {
+ padding-right: 1em;
+ page-break-inside: avoid;
+ }
+
+ tr, img {
+ page-break-inside: avoid;
+ }
+
+ img {
+ max-width: 100% !important;
+ }
+
+ @page :left {
+ margin: 15mm 20mm 15mm 10mm;
+ }
+
+ @page :right {
+ margin: 15mm 10mm 15mm 20mm;
+ }
+
+ p, h2, h3 {
+ orphans: 3; widows: 3;
+ }
+
+ h2, h3 {
+ page-break-after: avoid;
+ }
+}
+</style>
+
+
+
+</head>
+
+<body>
<style>
h2 {
font-size: 20px;
}
</style>
-<p>The first section, Beginner FAQs, is intended to be read in order, from start to finish. It’s just written in a FAQ style to be digested more easily. It isn’t really the most frequently asked questions. A better measure for that is looking on Stack Overflow.</p>
-<p>This FAQ is required reading and considered core documentation. Please do not ask questions on Stack Overflow or raise issues on GitHub until you have read it. We can all tell when you ask that you haven’t read it. So if you do ask and haven’t read it, don’t use your real name.</p>
-<p>This document has been quickly revised given the changes in v1.9.8 released Nov 2016. Please do submit pull requests to fix mistakes or improvements. If anyone knows why the table of contents comes out so narrow and squashed when displayed by CRAN, please let us know. This document used to be a PDF and we changed it recently to HTML.</p>
-<div id="beginner-faqs" class="section level1">
-<h1><span class="header-section-number">1</span> Beginner FAQs</h1>
-<div id="j-num" class="section level2">
-<h2><span class="header-section-number">1.1</span> Why do <code>DT[ , 5]</code> and <code>DT[2, 5]</code> return a 1-column data.table rather than vectors like <code>data.frame</code>?</h2>
-<p>For consistency so that when you use data.table in functions that accept varying inputs, you can rely on <code>DT[...]</code> returning a data.table. You don’t have to remember to include <code>drop=FALSE</code> like you do in data.frame. data.table was first released in 2006 and this difference to data.frame has been a feature since the very beginning.</p>
-<p>You may have heard that it is generally bad practice to refer to columns by number rather than name, though. If your colleague comes along and reads your code later they may have to hunt around to find out which column is number 5. If you or they change the column ordering higher up in your R program, you may produce wrong results with no warning or error if you forget to change all the places in your code which refer to column number 5. That is your fault not R’s or data.table’s. It’ [...]
-<p>Say column 5 is named <code>"region"</code> and you really must extract that column as a vector not a data.table. It is more robust to use the column name and write <code>DT$region</code> or <code>DT[["region"]]</code>; i.e., the same as base R. Using base R’s <code>$</code> and <code>[[</code> on data.table is encouraged. Not when combined with <code><-</code> to assign (use <code>:=</code> instead for that) but just to select a single column by name they are e [...]
-<p>There are some circumstances where referring to a column by number seems like the only way, such as a sequence of columns. In these situations just like data.frame, you can write <code>DT[, 5:10]</code> and <code>DT[,c(1,4,10)]</code>. However, again, it is more robust (to future changes in your data’s number of and ordering of columns) to use a named range such as <code>DT[,columnRed:columnViolet]</code> or name each one <code>DT[,c("columnRed","columnOrange",&quo [...]
-<p>However, what we really want you to do is <code>DT[,.(columnRed,columnOrange,columnYellow)]</code>; i.e., use column names as if they are variables directly inside <code>DT[...]</code>. You don’t have to prefix each column with <code>DT$</code> like you do in data.frame. The <code>.()</code> part is just an alias for <code>list()</code> and you can use <code>list()</code> instead if you prefer. You can place any R expression of column names, using any R package, returning different ty [...]
-<p>Reminder: you can place <em>any</em> R expression inside <code>DT[...]</code> using column names as if they are variables; e.g., try <code>DT[, colA*colB/2]</code>. That does return a vector because you used column names as if they are variables. Wrap with <code>.()</code> to return a data.table; i.e. <code>DT[,.(colA*colB/2)]</code>. Name it: <code>DT[,.(myResult = colA*colB/2)]</code>. And we’ll leave it to you to guess how to return two things from this query. It’s also quite commo [...]
-</div>
-<div id="why-does-dtregion-return-a-1-column-data.table-rather-than-a-vector" class="section level2">
-<h2><span class="header-section-number">1.2</span> Why does <code>DT[,"region"]</code> return a 1-column data.table rather than a vector?</h2>
-<p>See the <a href="#j-num">answer above</a>. Try <code>DT$region</code> instead. Or <code>DT[["region"]]</code>.</p>
-</div>
-<div id="why-does-dt-region-return-a-vector-for-the-region-column-id-like-a-1-column-data.table." class="section level2">
-<h2><span class="header-section-number">1.3</span> Why does <code>DT[, region]</code> return a vector for the “region” column? I’d like a 1-column data.table.</h2>
+
+<p>The first section, Beginner FAQs, is intended to be read in order, from start to finish. It's just written in a FAQ style to be digested more easily. It isn't really the most frequently asked questions. A better measure for that is looking on Stack Overflow.</p>
+
+<p>This FAQ is required reading and considered core documentation. Please do not ask questions on Stack Overflow or raise issues on GitHub until you have read it. We can all tell when you ask that you haven't read it. So if you do ask and haven't read it, don't use your real name.</p>
+
+<p>This document has been quickly revised given the changes in v1.9.8 released Nov 2016. Please do submit pull requests to fix mistakes or improvements. If anyone knows why the table of contents comes out so narrow and squashed when displayed by CRAN, please let us know. This document used to be a PDF and we changed it recently to HTML.</p>
+
+<h1>Beginner FAQs</h1>
+
+<h2>Why do <code>DT[ , 5]</code> and <code>DT[2, 5]</code> return a 1-column data.table rather than vectors like <code>data.frame</code>? {#j-num}</h2>
+
+<p>For consistency so that when you use data.table in functions that accept varying inputs, you can rely on <code>DT[...]</code> returning a data.table. You don't have to remember to include <code>drop=FALSE</code> like you do in data.frame. data.table was first released in 2006 and this difference to data.frame has been a feature since the very beginning.</p>
+
+<p>You may have heard that it is generally bad practice to refer to columns by number rather than name, though. If your colleague comes along and reads your code later they may have to hunt around to find out which column is number 5. If you or they change the column ordering higher up in your R program, you may produce wrong results with no warning or error if you forget to change all the places in your code which refer to column number 5. That is your fault not R's or data.table [...]
+
+<p>Say column 5 is named <code>"region"</code> and you really must extract that column as a vector not a data.table. It is more robust to use the column name and write <code>DT$region</code> or <code>DT[["region"]]</code>; i.e., the same as base R. Using base R's <code>$</code> and <code>[[</code> on data.table is encouraged. Not when combined with <code><-</code> to assign (use <code>:=</code> instead for that) but just to select a single column by name they a [...]
+
+<p>There are some circumstances where referring to a column by number seems like the only way, such as a sequence of columns. In these situations just like data.frame, you can write <code>DT[, 5:10]</code> and <code>DT[,c(1,4,10)]</code>. However, again, it is more robust (to future changes in your data's number of and ordering of columns) to use a named range such as <code>DT[,columnRed:columnViolet]</code> or name each one <code>DT[,c("columnRed","columnOrange", [...]
+
+<p>However, what we really want you to do is <code>DT[,.(columnRed,columnOrange,columnYellow)]</code>; i.e., use column names as if they are variables directly inside <code>DT[...]</code>. You don't have to prefix each column with <code>DT$</code> like you do in data.frame. The <code>.()</code> part is just an alias for <code>list()</code> and you can use <code>list()</code> instead if you prefer. You can place any R expression of column names, using any R package, returning differen [...]
+
+<p>Reminder: you can place <em>any</em> R expression inside <code>DT[...]</code> using column names as if they are variables; e.g., try <code>DT[, colA*colB/2]</code>. That does return a vector because you used column names as if they are variables. Wrap with <code>.()</code> to return a data.table; i.e. <code>DT[,.(colA*colB/2)]</code>. Name it: <code>DT[,.(myResult = colA*colB/2)]</code>. And we'll leave it to you to guess how to return two things from this query. It's also q [...]
+
+<h2>Why does <code>DT[,"region"]</code> return a 1-column data.table rather than a vector?</h2>
+
+<p>See the <a href="#j-num">answer above</a>. Try <code>DT$region</code> instead. Or <code>DT[["region"]]</code>. </p>
+
+<h2>Why does <code>DT[, region]</code> return a vector for the “region” column? I'd like a 1-column data.table.</h2>
+
<p>Try <code>DT[ , .(region)]</code> instead. <code>.()</code> is an alias for <code>list()</code> and ensures a data.table is returned.</p>
+
<p>Also continue reading and see the FAQ after next. Skim whole documents before getting stuck in one part.</p>
-</div>
-<div id="why-does-dt-x-y-z-not-work-i-wanted-the-3-columns-xy-and-z." class="section level2">
-<h2><span class="header-section-number">1.4</span> Why does <code>DT[ , x, y, z]</code> not work? I wanted the 3 columns <code>x</code>,<code>y</code> and <code>z</code>.</h2>
+
+<h2>Why does <code>DT[ , x, y, z]</code> not work? I wanted the 3 columns <code>x</code>,<code>y</code> and <code>z</code>.</h2>
+
<p>The <code>j</code> expression is the 2nd argument. Try <code>DT[ , c("x","y","z")]</code> or <code>DT[ , .(x,y,z)]</code>.</p>
-</div>
-<div id="i-assigned-a-variable-mycol-x-but-then-dt-mycol-returns-x.-how-do-i-get-it-to-look-up-the-column-name-contained-in-the-mycol-variable" class="section level2">
-<h2><span class="header-section-number">1.5</span> I assigned a variable <code>mycol = "x"</code> but then <code>DT[ , mycol]</code> returns <code>"x"</code>. How do I get it to look up the column name contained in the <code>mycol</code> variable?</h2>
-<p>In v1.9.8 released Nov 2016 there is an abililty to turn on new behaviour: <code>options(datatable.WhenJisSymbolThenCallingScope=TRUE)</code>. It will then work as you expected, just like data.frame. If you are a new user of data.table, you should probably do this. You can place this command in your .Rprofile file so you don’t have to remember again. See the long item in release notes about this. The release notes are linked at the top of the data.table homepage: <a href="https://gith [...]
-<p>Without turning on that new behavior, what’s happening is that the <code>j</code> expression sees objects in the calling scope. The variable <code>mycol</code> does not exist as a column name of <code>DT</code> so data.table then looked in the calling scope and found <code>mycol</code> there and returned its value <code>"x"</code>. This is correct behaviour currently. Had <code>mycol</code> been a column name, then that column’s data would have been returned. What has been d [...]
-</div>
-<div id="what-are-the-benefits-of-being-able-to-use-column-names-as-if-they-are-variables-inside-dt..." class="section level2">
-<h2><span class="header-section-number">1.6</span> What are the benefits of being able to use column names as if they are variables inside <code>DT[...]</code>?</h2>
-<p><code>j</code> doesn’t have to be just column names. You can write any R <em>expression</em> of column names directly in <code>j</code>, <em>e.g.</em>, <code>DT[ , mean(x*y/z)]</code>. The same applies to <code>i</code>, <em>e.g.</em>, <code>DT[x>1000, sum(y*z)]</code>.</p>
-<p>This runs the <code>j</code> expression on the set of rows where the <code>i</code> expression is true. You don’t even need to return data, <em>e.g.</em>, <code>DT[x>1000, plot(y, z)]</code>. You can do <code>j</code> by group simply by adding <code>by =</code>; e.g., <code>DT[x>1000, sum(y*z), by = w]</code>. This runs <code>j</code> for each group in column <code>w</code> but just over the rows where <code>x>1000</code>. By placing the 3 parts of the query (i=where, j=selec [...]
-</div>
-<div id="ok-im-starting-to-see-what-data.table-is-about-but-why-didnt-you-just-enhance-data.frame-in-r-why-does-it-have-to-be-a-new-package" class="section level2">
-<h2><span class="header-section-number">1.7</span> OK, I’m starting to see what data.table is about, but why didn’t you just enhance <code>data.frame</code> in R? Why does it have to be a new package?</h2>
-<p>As <a href="#j-num">highlighted above</a>, <code>j</code> in <code>[.data.table</code> is fundamentally different from <code>j</code> in <code>[.data.frame</code>. Even if something as simple as <code>DF[ , 1]</code> was changed in base R to return a data.frame rather than a vector, that would break existing code in many 1000’s of CRAN packages and user code. As soon as we took the step to create a new class that inherited from data.frame, we had the opportunity to change a few things [...]
+
+<h2>I assigned a variable <code>mycol = "x"</code> but then <code>DT[ , mycol]</code> returns <code>"x"</code>. How do I get it to look up the column name contained in the <code>mycol</code> variable?</h2>
+
+<p>In v1.9.8 released Nov 2016 there is an abililty to turn on new behaviour: <code>options(datatable.WhenJisSymbolThenCallingScope=TRUE)</code>. It will then work as you expected, just like data.frame. If you are a new user of data.table, you should probably do this. You can place this command in your .Rprofile file so you don't have to remember again. See the long item in release notes about this. The release notes are linked at the top of the data.table homepage: <a href="https:// [...]
+
+<p>Without turning on that new behavior, what's happening is that the <code>j</code> expression sees objects in the calling scope. The variable <code>mycol</code> does not exist as a column name of <code>DT</code> so data.table then looked in the calling scope and found <code>mycol</code> there and returned its value <code>"x"</code>. This is correct behaviour currently. Had <code>mycol</code> been a column name, then that column's data would have been returned. What ha [...]
+
+<h2>What are the benefits of being able to use column names as if they are variables inside <code>DT[...]</code>?</h2>
+
+<p><code>j</code> doesn't have to be just column names. You can write any R <em>expression</em> of column names directly in <code>j</code>, <em>e.g.</em>, <code>DT[ , mean(x*y/z)]</code>. The same applies to <code>i</code>, <em>e.g.</em>, <code>DT[x>1000, sum(y*z)]</code>.</p>
+
+<p>This runs the <code>j</code> expression on the set of rows where the <code>i</code> expression is true. You don't even need to return data, <em>e.g.</em>, <code>DT[x>1000, plot(y, z)]</code>. You can do <code>j</code> by group simply by adding <code>by =</code>; e.g., <code>DT[x>1000, sum(y*z), by = w]</code>. This runs <code>j</code> for each group in column <code>w</code> but just over the rows where <code>x>1000</code>. By placing the 3 parts of the query (i=where, j=s [...]
+
+<h2>OK, I'm starting to see what data.table is about, but why didn't you just enhance <code>data.frame</code> in R? Why does it have to be a new package?</h2>
+
+<p>As <a href="#j-num">highlighted above</a>, <code>j</code> in <code>[.data.table</code> is fundamentally different from <code>j</code> in <code>[.data.frame</code>. Even if something as simple as <code>DF[ , 1]</code> was changed in base R to return a data.frame rather than a vector, that would break existing code in many 1000's of CRAN packages and user code. As soon as we took the step to create a new class that inherited from data.frame, we had the opportunity to change a few th [...]
+
<p>Furthermore, data.table <em>inherits</em> from <code>data.frame</code>. It <em>is</em> a <code>data.frame</code>, too. A data.table can be passed to any package that only accepts <code>data.frame</code> and that package can use <code>[.data.frame</code> syntax on the data.table. See <a href="http://stackoverflow.com/a/10529888/403310">this answer</a> for how that is achieved.</p>
+
<p>We <em>have</em> proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :</p>
+
<blockquote>
-<p><code>unique()</code> and <code>match()</code> are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matt Dowle for suggesting improvements to the way the hash code is generated in unique.c.</p>
+<p><code>unique()</code> and <code>match()</code> are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matt Dowle for suggesting improvements to the way the hash code is generated in unique.c.</p>
</blockquote>
+
<p>A second proposal was to use <code>memcpy</code> in duplicate.c, which is much faster than a for loop in C. This would improve the <em>way</em> that R copies data internally (on some measures by 13 times). The thread on r-devel is <a href="http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html">here</a>.</p>
-<p>A third more significant proposal that was accepted is that R now uses data.table’s radix sort code as from R 3.3.0 :</p>
+
+<p>A third more significant proposal that was accepted is that R now uses data.table's radix sort code as from R 3.3.0 :</p>
+
<blockquote>
<p>The radix sort algorithm and implementation from data.table (forder) replaces the previous radix (counting) sort and adds a new method for order(). Contributed by Matt Dowle and Arun Srinivasan, the new algorithm supports logical, integer (even with large values), real, and character vectors. It outperforms all other methods, but there are some caveats (see ?sort).</p>
</blockquote>
+
<p>This was big event for us and we celebrated until the cows came home. (Not really.)</p>
-</div>
-<div id="why-are-the-defaults-the-way-they-are-why-does-it-work-the-way-it-does" class="section level2">
-<h2><span class="header-section-number">1.8</span> Why are the defaults the way they are? Why does it work the way it does?</h2>
+
+<h2>Why are the defaults the way they are? Why does it work the way it does?</h2>
+
<p>The simple answer is because the main author originally designed it for his own use. He wanted it that way. He finds it a more natural, faster way to write code, which also executes more quickly.</p>
-</div>
-<div id="isnt-this-already-done-by-with-and-subset-in-base" class="section level2">
-<h2><span class="header-section-number">1.9</span> Isn’t this already done by <code>with()</code> and <code>subset()</code> in <code>base</code>?</h2>
+
+<h2>Isn't this already done by <code>with()</code> and <code>subset()</code> in <code>base</code>?</h2>
+
<p>Some of the features discussed so far are, yes. The package builds upon base functionality. It does the same sorts of things but with less code required and executes many times faster if used correctly.</p>
-</div>
-<div id="why-does-xy-return-all-the-columns-from-y-too-shouldnt-it-return-a-subset-of-x" class="section level2">
-<h2><span class="header-section-number">1.10</span> Why does <code>X[Y]</code> return all the columns from <code>Y</code> too? Shouldn’t it return a subset of <code>X</code>?</h2>
-<p>This was changed in v1.5.3 (Feb 2011). Since then <code>X[Y]</code> includes <code>Y</code>’s non-join columns. We refer to this feature as <em>join inherited scope</em> because not only are <code>X</code> columns available to the <code>j</code> expression, so are <code>Y</code> columns. The downside is that <code>X[Y]</code> is less efficient since every item of <code>Y</code>’s non-join columns are duplicated to match the (likely large) number of rows in <code>X</code> that match. W [...]
-</div>
-<div id="MergeDiff" class="section level2">
-<h2><span class="header-section-number">1.11</span> What is the difference between <code>X[Y]</code> and <code>merge(X, Y)</code>?</h2>
-<p><code>X[Y]</code> is a join, looking up <code>X</code>’s rows using <code>Y</code> (or <code>Y</code>’s key if it has one) as an index.</p>
-<p><code>Y[X]</code> is a join, looking up <code>Y</code>’s rows using <code>X</code> (or <code>X</code>’s key if it has one) as an index.</p>
-<p><code>merge(X,Y)</code><a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a> does both ways at the same time. The number of rows of <code>X[Y]</code> and <code>Y[X]</code> usually differ, whereas the number of rows returned by <code>merge(X, Y)</code> and <code>merge(Y, X)</code> is the same.</p>
-<p><em>BUT</em> that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data, only to use a small subset of them afterwards? You may suggest <code>merge(X[ , ColsNeeded1], Y[ , ColsNeeded2])</code>, but that requires the programmer to work out which columns are needed. <code>X[Y, j]</code> in data.table does all that in one step for you. When you write <code>X[Y, sum(foo*bar)]</code>, data.table automatically ins [...]
-</div>
-<div id="anything-else-about-xy-sumfoobar" class="section level2">
-<h2><span class="header-section-number">1.12</span> Anything else about <code>X[Y, sum(foo*bar)]</code>?</h2>
-<p>This behaviour changed in v1.9.4 (Sep 2014). It now does the <code>X[Y]</code> join and then runs <code>sum(foo*bar)</code> over all the rows; i.e., <code>X[Y][ , sum(foo*bar)]</code>. It used to run <code>j</code> for each <em>group</em> of <code>X</code> that each row of <code>Y</code> matches to. That can still be done as it’s very useful but you now need to be explicit and specify <code>by = .EACHI</code>, <em>i.e.</em>, <code>X[Y, sum(foo*bar), by = .EACHI]</code>. We call this < [...]
+
+<h2>Why does <code>X[Y]</code> return all the columns from <code>Y</code> too? Shouldn't it return a subset of <code>X</code>?</h2>
+
+<p>This was changed in v1.5.3 (Feb 2011). Since then <code>X[Y]</code> includes <code>Y</code>'s non-join columns. We refer to this feature as <em>join inherited scope</em> because not only are <code>X</code> columns available to the <code>j</code> expression, so are <code>Y</code> columns. The downside is that <code>X[Y]</code> is less efficient since every item of <code>Y</code>'s non-join columns are duplicated to match the (likely large) number of rows in <code>X</code> that [...]
+
+<h2>What is the difference between <code>X[Y]</code> and <code>merge(X, Y)</code>? {#MergeDiff}</h2>
+
+<p><code>X[Y]</code> is a join, looking up <code>X</code>'s rows using <code>Y</code> (or <code>Y</code>'s key if it has one) as an index.</p>
+
+<p><code>Y[X]</code> is a join, looking up <code>Y</code>'s rows using <code>X</code> (or <code>X</code>'s key if it has one) as an index.</p>
+
+<p><code>merge(X,Y)</code>[<sup>1]</sup> does both ways at the same time. The number of rows of <code>X[Y]</code> and <code>Y[X]</code> usually differ, whereas the number of rows returned by <code>merge(X, Y)</code> and <code>merge(Y, X)</code> is the same.</p>
+
+<p><em>BUT</em> that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data, only to use a small subset of them afterwards? You may suggest <code>merge(X[ , ColsNeeded1], Y[ , ColsNeeded2])</code>, but that requires the programmer to work out which columns are needed. <code>X[Y, j]</code> in data.table does all that in one step for you. When you write <code>X[Y, sum(foo*bar)]</code>, data.table automatically ins [...]
+
+<p>[<sup>1]:</sup> Here we mean either the <code>merge</code> <em>method</em> for data.table or the <code>merge</code> method for <code>data.frame</code> since both methods work in the same way in this respect. See <code>?merge.data.table</code> and <a href="#r-dispatch">below</a> for more information about method dispatch.</p>
+
+<h2>Anything else about <code>X[Y, sum(foo*bar)]</code>?</h2>
+
+<p>This behaviour changed in v1.9.4 (Sep 2014). It now does the <code>X[Y]</code> join and then runs <code>sum(foo*bar)</code> over all the rows; i.e., <code>X[Y][ , sum(foo*bar)]</code>. It used to run <code>j</code> for each <em>group</em> of <code>X</code> that each row of <code>Y</code> matches to. That can still be done as it's very useful but you now need to be explicit and specify <code>by = .EACHI</code>, <em>i.e.</em>, <code>X[Y, sum(foo*bar), by = .EACHI]</code>. We call th [...]
+
<p>For example, (further complicating it by using <em>join inherited scope</em>, too):</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">X =<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">grp =</span> <span class="kw">c</span>(<span class="st">"a"</span>, <span class="st">"a"</span>, <span class="st">"b"</span>,
- <span class="st">"b"</span>, <span class="st">"b"</span>, <span class="st">"c"</span>, <span class="st">"c"</span>), <span class="dt">foo =</span> <span class="dv">1</span>:<span class="dv">7</span>)
-<span class="kw">setkey</span>(X, grp)
-Y =<span class="st"> </span><span class="kw">data.table</span>(<span class="kw">c</span>(<span class="st">"b"</span>, <span class="st">"c"</span>), <span class="dt">bar =</span> <span class="kw">c</span>(<span class="dv">4</span>, <span class="dv">2</span>))
+
+<pre><code class="r">X = data.table(grp = c("a", "a", "b",
+ "b", "b", "c", "c"), foo = 1:7)
+setkey(X, grp)
+Y = data.table(c("b", "c"), bar = c(4, 2))
X
-<span class="co"># grp foo</span>
-<span class="co"># 1: a 1</span>
-<span class="co"># 2: a 2</span>
-<span class="co"># 3: b 3</span>
-<span class="co"># 4: b 4</span>
-<span class="co"># 5: b 5</span>
-<span class="co"># 6: c 6</span>
-<span class="co"># 7: c 7</span>
+# grp foo
+# 1: a 1
+# 2: a 2
+# 3: b 3
+# 4: b 4
+# 5: b 5
+# 6: c 6
+# 7: c 7
Y
-<span class="co"># V1 bar</span>
-<span class="co"># 1: b 4</span>
-<span class="co"># 2: c 2</span>
-X[Y, <span class="kw">sum</span>(foo*bar)]
-<span class="co"># [1] 74</span>
-X[Y, <span class="kw">sum</span>(foo*bar), by =<span class="st"> </span>.EACHI]
-<span class="co"># grp V1</span>
-<span class="co"># 1: b 48</span>
-<span class="co"># 2: c 26</span></code></pre></div>
-</div>
-<div id="thats-nice.-how-did-you-manage-to-change-it-given-that-users-depended-on-the-old-behaviour" class="section level2">
-<h2><span class="header-section-number">1.13</span> That’s nice. How did you manage to change it given that users depended on the old behaviour?</h2>
+# V1 bar
+# 1: b 4
+# 2: c 2
+X[Y, sum(foo*bar)]
+# [1] 74
+X[Y, sum(foo*bar), by = .EACHI]
+# grp V1
+# 1: b 48
+# 2: c 26
+</code></pre>
+
+<h2>That's nice. How did you manage to change it given that users depended on the old behaviour?</h2>
+
<p>The request to change came from users. The feeling was that if a query is doing grouping then an explicit <code>by=</code> should be present for code readability reasons. An option was provided to return the old behaviour: <code>options(datatable.old.bywithoutby)</code>, by default <code>FALSE</code>. This enabled upgrading to test the other new features / bug fixes in v1.9.4, with later migration of any by-without-by queries when ready by adding <code>by=.EACHI</code> to them. We ret [...]
-<p>Of the 66 packages on CRAN or Bioconductor that depended on or import data.table at the time of releasing v1.9.4 (it is now over 300), only one was affected by the change. That could be because many packages don’t have comprehensive tests, or just that grouping by each row in <code>i</code> wasn’t being used much by downstream packages. We always test the new version with all dependent packages before release and coordinate any changes with those maintainers. So this release was quite [...]
-<p>Another compelling reason to make the change was that previously, there was no efficient way to achieve what <code>X[Y, sum(foo*bar)]</code> does now. You had to write <code>X[Y][ , sum(foo*bar)]</code>. That was suboptimal because <code>X[Y]</code> joined all the columns and passed them all to the second compound query without knowing that only <code>foo</code> and <code>bar</code> are needed. To solve that efficiency problem, extra programming effort was required: <code>X[Y, list(fo [...]
-</div>
-</div>
-<div id="general-syntax" class="section level1">
-<h1><span class="header-section-number">2</span> General Syntax</h1>
-<div id="how-can-i-avoid-writing-a-really-long-j-expression-youve-said-that-i-should-use-the-column-names-but-ive-got-a-lot-of-columns." class="section level2">
-<h2><span class="header-section-number">2.1</span> How can I avoid writing a really long <code>j</code> expression? You’ve said that I should use the column <em>names</em>, but I’ve got a lot of columns.</h2>
-<p>When grouping, the <code>j</code> expression can use column names as variables, as you know, but it can also use a reserved symbol <code>.SD</code> which refers to the <strong>S</strong>ubset of the <strong>D</strong>ata.table for each group (excluding the grouping columns). So to sum up all your columns it’s just <code>DT[ , lapply(.SD, sum), by = grp]</code>. It might seem tricky, but it’s fast to write and fast to run. Notice you don’t have to create an anonymous function. The <cod [...]
-<p>So please don’t do, for example, <code>DT[ , sum(.SD[["sales"]]), by = grp]</code>. That works but is inefficient and inelegant. <code>DT[ , sum(sales), by = grp]</code> is what was intended, and it could be 100s of times faster. If you use <em>all</em> of the data in <code>.SD</code> for each group (such as in <code>DT[ , lapply(.SD, sum), by = grp]</code>) then that’s very good usage of <code>.SD</code>. If you’re using <em>several</em> but not <em>all</em> of the columns, [...]
-</div>
-<div id="why-is-the-default-for-mult-now-all" class="section level2">
-<h2><span class="header-section-number">2.2</span> Why is the default for <code>mult</code> now <code>"all"</code>?</h2>
-<p>In v1.5.3 the default was changed to <code>"all"</code>. When <code>i</code> (or <code>i</code>’s key if it has one) has fewer columns than <code>x</code>’s key, <code>mult</code> was already set to <code>"all"</code> automatically. Changing the default makes this clearer and easier for users as it came up quite often.</p>
+
+<p>Of the 66 packages on CRAN or Bioconductor that depended on or import data.table at the time of releasing v1.9.4 (it is now over 300), only one was affected by the change. That could be because many packages don't have comprehensive tests, or just that grouping by each row in <code>i</code> wasn't being used much by downstream packages. We always test the new version with all dependent packages before release and coordinate any changes with those maintainers. So this release w [...]
+
+<p>Another compelling reason to make the change was that previously, there was no efficient way to achieve what <code>X[Y, sum(foo*bar)]</code> does now. You had to write <code>X[Y][ , sum(foo*bar)]</code>. That was suboptimal because <code>X[Y]</code> joined all the columns and passed them all to the second compound query without knowing that only <code>foo</code> and <code>bar</code> are needed. To solve that efficiency problem, extra programming effort was required: <code>X[Y, list(fo [...]
+
+<h1>General Syntax</h1>
+
+<h2>How can I avoid writing a really long <code>j</code> expression? You've said that I should use the column <em>names</em>, but I've got a lot of columns.</h2>
+
+<p>When grouping, the <code>j</code> expression can use column names as variables, as you know, but it can also use a reserved symbol <code>.SD</code> which refers to the <strong>S</strong>ubset of the <strong>D</strong>ata.table for each group (excluding the grouping columns). So to sum up all your columns it's just <code>DT[ , lapply(.SD, sum), by = grp]</code>. It might seem tricky, but it's fast to write and fast to run. Notice you don't have to create an anonymous functi [...]
+
+<p>So please don't do, for example, <code>DT[ , sum(.SD[["sales"]]), by = grp]</code>. That works but is inefficient and inelegant. <code>DT[ , sum(sales), by = grp]</code> is what was intended, and it could be 100s of times faster. If you use <em>all</em> of the data in <code>.SD</code> for each group (such as in <code>DT[ , lapply(.SD, sum), by = grp]</code>) then that's very good usage of <code>.SD</code>. If you're using <em>several</em> but not <em>all</em> of [...]
+
+<h2>Why is the default for <code>mult</code> now <code>"all"</code>?</h2>
+
+<p>In v1.5.3 the default was changed to <code>"all"</code>. When <code>i</code> (or <code>i</code>'s key if it has one) has fewer columns than <code>x</code>'s key, <code>mult</code> was already set to <code>"all"</code> automatically. Changing the default makes this clearer and easier for users as it came up quite often.</p>
+
<p>In versions up to v1.3, <code>"all"</code> was slower. Internally, <code>"all"</code> was implemented by joining using <code>"first"</code>, then again from scratch using <code>"last"</code>, after which a diff between them was performed to work out the span of the matches in <code>x</code> for each row in <code>i</code>. Most often we join to single rows, though, where <code>"first"</code>,<code>"last"</code> and <code>" [...]
+
<p>In v1.4 the binary search in C was changed to branch at the deepest level to find first and last. That branch will likely occur within the same final pages of RAM so there should no longer be a speed disadvantage in defaulting <code>mult</code> to <code>"all"</code>. We warned that the default might change and made the change in v1.5.3.</p>
-<p>A future version of data.table may allow a distinction between a key and a <em>unique key</em>. Internally <code>mult = "all"</code> would perform more like <code>mult = "first"</code> when all <code>x</code>’s key columns were joined to and <code>x</code>’s key was a unique key. data.table would need checks on insert and update to make sure a unique key is maintained. An advantage of specifying a unique key would be that data.table would ensure no duplicates could [...]
-</div>
-<div id="im-using-c-in-j-and-getting-strange-results." class="section level2">
-<h2><span class="header-section-number">2.3</span> I’m using <code>c()</code> in <code>j</code> and getting strange results.</h2>
+
+<p>A future version of data.table may allow a distinction between a key and a <em>unique key</em>. Internally <code>mult = "all"</code> would perform more like <code>mult = "first"</code> when all <code>x</code>'s key columns were joined to and <code>x</code>'s key was a unique key. data.table would need checks on insert and update to make sure a unique key is maintained. An advantage of specifying a unique key would be that data.table would ensure no duplicat [...]
+
+<h2>I'm using <code>c()</code> in <code>j</code> and getting strange results.</h2>
+
<p>This is a common source of confusion. In <code>data.frame</code> you are used to, for example:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DF =<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">x =</span> <span class="dv">1</span>:<span class="dv">3</span>, <span class="dt">y =</span> <span class="dv">4</span>:<span class="dv">6</span>, <span class="dt">z =</span> <span class="dv">7</span>:<span class="dv">9</span>)
+
+<pre><code class="r">DF = data.frame(x = 1:3, y = 4:6, z = 7:9)
DF
-<span class="co"># x y z</span>
-<span class="co"># 1 1 4 7</span>
-<span class="co"># 2 2 5 8</span>
-<span class="co"># 3 3 6 9</span>
-DF[ , <span class="kw">c</span>(<span class="st">"y"</span>, <span class="st">"z"</span>)]
-<span class="co"># y z</span>
-<span class="co"># 1 4 7</span>
-<span class="co"># 2 5 8</span>
-<span class="co"># 3 6 9</span></code></pre></div>
+# x y z
+# 1 1 4 7
+# 2 2 5 8
+# 3 3 6 9
+DF[ , c("y", "z")]
+# y z
+# 1 4 7
+# 2 5 8
+# 3 6 9
+</code></pre>
+
<p>which returns the two columns. In data.table you know you can use the column names directly and might try:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT =<span class="st"> </span><span class="kw">data.table</span>(DF)
-DT[ , <span class="kw">c</span>(y, z)]
-<span class="co"># [1] 4 5 6 7 8 9</span></code></pre></div>
-<p>but this returns one vector. Remember that the <code>j</code> expression is evaluated within the environment of <code>DT</code> and <code>c()</code> returns a vector. If 2 or more columns are required, use <code>list()</code> or <code>.()</code> instead:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT[ , .(y, z)]
-<span class="co"># y z</span>
-<span class="co"># 1: 4 7</span>
-<span class="co"># 2: 5 8</span>
-<span class="co"># 3: 6 9</span></code></pre></div>
+
+<pre><code class="r">DT = data.table(DF)
+DT[ , c(y, z)]
+# [1] 4 5 6 7 8 9
+</code></pre>
+
+<p>but this returns one vector. Remember that the <code>j</code> expression is evaluated within the environment of <code>DT</code> and <code>c()</code> returns a vector. If 2 or more columns are required, use <code>list()</code> or <code>.()</code> instead:</p>
+
+<pre><code class="r">DT[ , .(y, z)]
+# y z
+# 1: 4 7
+# 2: 5 8
+# 3: 6 9
+</code></pre>
+
<p><code>c()</code> can be useful in a data.table too, but its behaviour is different from that in <code>[.data.frame</code>.</p>
-</div>
-<div id="i-have-built-up-a-complex-table-with-many-columns.-i-want-to-use-it-as-a-template-for-a-new-table-i.e.-create-a-new-table-with-no-rows-but-with-the-column-names-and-types-copied-from-my-table.-can-i-do-that-easily" class="section level2">
-<h2><span class="header-section-number">2.4</span> I have built up a complex table with many columns. I want to use it as a template for a new table; <em>i.e.</em>, create a new table with no rows, but with the column names and types copied from my table. Can I do that easily?</h2>
+
+<h2>I have built up a complex table with many columns. I want to use it as a template for a new table; <em>i.e.</em>, create a new table with no rows, but with the column names and types copied from my table. Can I do that easily?</h2>
+
<p>Yes. If your complex table is called <code>DT</code>, try <code>NEWDT = DT[0]</code>.</p>
-</div>
-<div id="is-a-null-data.table-the-same-as-dt0" class="section level2">
-<h2><span class="header-section-number">2.5</span> Is a null data.table the same as <code>DT[0]</code>?</h2>
-<p>No. By “null data.table” we mean the result of <code>data.table(NULL)</code> or <code>as.data.table(NULL)</code>; <em>i.e.</em>,</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">data.table</span>(<span class="ot">NULL</span>)
-<span class="co"># Null data.table (0 rows and 0 cols)</span>
-<span class="kw">data.frame</span>(<span class="ot">NULL</span>)
-<span class="co"># data frame with 0 columns and 0 rows</span>
-<span class="kw">as.data.table</span>(<span class="ot">NULL</span>)
-<span class="co"># Null data.table (0 rows and 0 cols)</span>
-<span class="kw">as.data.frame</span>(<span class="ot">NULL</span>)
-<span class="co"># data frame with 0 columns and 0 rows</span>
-<span class="kw">is.null</span>(<span class="kw">data.table</span>(<span class="ot">NULL</span>))
-<span class="co"># [1] FALSE</span>
-<span class="kw">is.null</span>(<span class="kw">data.frame</span>(<span class="ot">NULL</span>))
-<span class="co"># [1] FALSE</span></code></pre></div>
-<p>The null data.table|<code>frame</code> is <code>NULL</code> with some attributes attached, which means it’s no longer <code>NULL</code>. In R only pure <code>NULL</code> is <code>NULL</code> as tested by <code>is.null()</code>. When referring to the “null data.table” we use lower case null to help distinguish from upper case <code>NULL</code>. To test for the null data.table, use <code>length(DT) == 0</code> or <code>ncol(DT) == 0</code> (<code>length</code> is slightly faster as it’s [...]
+
+<h2>Is a null data.table the same as <code>DT[0]</code>?</h2>
+
+<p>No. By “null data.table” we mean the result of <code>data.table(NULL)</code> or <code>as.data.table(NULL)</code>; <em>i.e.</em>,</p>
+
+<pre><code class="r">data.table(NULL)
+# Null data.table (0 rows and 0 cols)
+data.frame(NULL)
+# data frame with 0 columns and 0 rows
+as.data.table(NULL)
+# Null data.table (0 rows and 0 cols)
+as.data.frame(NULL)
+# data frame with 0 columns and 0 rows
+is.null(data.table(NULL))
+# [1] FALSE
+is.null(data.frame(NULL))
+# [1] FALSE
+</code></pre>
+
+<p>The null data.table|<code>frame</code> is <code>NULL</code> with some attributes attached, which means it's no longer <code>NULL</code>. In R only pure <code>NULL</code> is <code>NULL</code> as tested by <code>is.null()</code>. When referring to the “null data.table” we use lower case null to help distinguish from upper case <code>NULL</code>. To test for the null data.table, use <code>length(DT) == 0</code> or <code>ncol(DT) == 0</code> (<code>length</code> is slightl [...]
+
<p>An <em>empty</em> data.table (<code>DT[0]</code>) has one or more columns, all of which are empty. Those empty columns still have names and types.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT =<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">a =</span> <span class="dv">1</span>:<span class="dv">3</span>, <span class="dt">b =</span> <span class="kw">c</span>(<span class="dv">4</span>, <span class="dv">5</span>, <span class="dv">6</span>), <span class="dt">d =</span> <span class="kw">c</span>(7L,8L,9L))
-DT[<span class="dv">0</span>]
-<span class="co"># Empty data.table (0 rows) of 3 cols: a,b,d</span>
-<span class="kw">sapply</span>(DT[<span class="dv">0</span>], class)
-<span class="co"># a b d </span>
-<span class="co"># "integer" "numeric" "integer"</span></code></pre></div>
-</div>
-<div id="DTremove1" class="section level2">
-<h2><span class="header-section-number">2.6</span> Why has the <code>DT()</code> alias been removed?</h2>
+
+<pre><code class="r">DT = data.table(a = 1:3, b = c(4, 5, 6), d = c(7L,8L,9L))
+DT[0]
+# Empty data.table (0 rows) of 3 cols: a,b,d
+sapply(DT[0], class)
+# a b d
+# "integer" "numeric" "integer"
+</code></pre>
+
+<h2>Why has the <code>DT()</code> alias been removed? {#DTremove1}</h2>
+
<p><code>DT</code> was introduced originally as a wrapper for a list of <code>j</code>expressions. Since <code>DT</code> was an alias for data.table, this was a convenient way to take care of silent recycling in cases where each item of the <code>j</code> list evaluated to different lengths. The alias was one reason grouping was slow, though.</p>
+
<p>As of v1.3, <code>list()</code> or <code>.()</code> should be passed instead to the <code>j</code> argument. These are much faster, especially when there are many groups. Internally, this was a nontrivial change. Vector recycling is now done internally, along with several other speed enhancements for grouping.</p>
-</div>
-<div id="DTremove2" class="section level2">
-<h2><span class="header-section-number">2.7</span> But my code uses <code>j = DT(...)</code> and it works. The previous FAQ says that <code>DT()</code> has been removed.</h2>
+
+<h2>But my code uses <code>j = DT(...)</code> and it works. The previous FAQ says that <code>DT()</code> has been removed. {#DTremove2}</h2>
+
<p>Then you are using a version prior to 1.5.3. Prior to 1.5.3 <code>[.data.table</code> detected use of <code>DT()</code> in the <code>j</code> and automatically replaced it with a call to <code>list()</code>. This was to help the transition for existing users.</p>
-</div>
-<div id="what-are-the-scoping-rules-for-j-expressions" class="section level2">
-<h2><span class="header-section-number">2.8</span> What are the scoping rules for <code>j</code> expressions?</h2>
+
+<h2>What are the scoping rules for <code>j</code> expressions?</h2>
+
<p>Think of the subset as an environment where all the column names are variables. When a variable <code>foo</code> is used in the <code>j</code> of a query such as <code>X[Y, sum(foo)]</code>, <code>foo</code> is looked for in the following order :</p>
-<ol style="list-style-type: decimal">
-<li>The scope of <code>X</code>’s subset; <em>i.e.</em>, <code>X</code>’s column names.</li>
-<li>The scope of each row of <code>Y</code>; <em>i.e.</em>, <code>Y</code>’s column names (<em>join inherited scope</em>)</li>
+
+<ol>
+<li>The scope of <code>X</code>'s subset; <em>i.e.</em>, <code>X</code>'s column names.</li>
+<li>The scope of each row of <code>Y</code>; <em>i.e.</em>, <code>Y</code>'s column names (<em>join inherited scope</em>)</li>
<li>The scope of the calling frame; <em>e.g.</em>, the line that appears before the data.table query.</li>
<li>Exercise for reader: does it then ripple up the calling frames, or go straight to <code>globalenv()</code>?</li>
<li>The global environment</li>
</ol>
+
<p>This is <em>lexical scoping</em> as explained in <a href="https://cran.r-project.org/doc/FAQ/R-FAQ.html#Lexical-scoping">R FAQ 3.3.1</a>. The environment in which the function was created is not relevant, though, because there is <em>no function</em>. No anonymous <em>function</em> is passed to <code>j</code>. Instead, an anonymous <em>body</em> is passed to <code>j</code>; for example,</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT =<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">x =</span> <span class="kw">rep</span>(<span class="kw">c</span>(<span class="st">"a"</span>, <span class="st">"b"</span>), <span class="kw">c</span>(<span class="dv">2</span>, <span class="dv">3</span>)), <span class="dt">y =</span> <span class="dv">1</span>:<span class="dv">5</span>)
+
+<pre><code class="r">DT = data.table(x = rep(c("a", "b"), c(2, 3)), y = 1:5)
DT
-<span class="co"># x y</span>
-<span class="co"># 1: a 1</span>
-<span class="co"># 2: a 2</span>
-<span class="co"># 3: b 3</span>
-<span class="co"># 4: b 4</span>
-<span class="co"># 5: b 5</span>
-DT[ , {z =<span class="st"> </span><span class="kw">sum</span>(y); z +<span class="st"> </span><span class="dv">3</span>}, by =<span class="st"> </span>x]
-<span class="co"># x V1</span>
-<span class="co"># 1: a 6</span>
-<span class="co"># 2: b 15</span></code></pre></div>
+# x y
+# 1: a 1
+# 2: a 2
+# 3: b 3
+# 4: b 4
+# 5: b 5
+DT[ , {z = sum(y); z + 3}, by = x]
+# x V1
+# 1: a 6
+# 2: b 15
+</code></pre>
+
<p>Some programming languages call this a <em>lambda</em>.</p>
-</div>
-<div id="j-trace" class="section level2">
-<h2><span class="header-section-number">2.9</span> Can I trace the <code>j</code> expression as it runs through the groups?</h2>
+
+<h2>Can I trace the <code>j</code> expression as it runs through the groups? {#j-trace}</h2>
+
<p>Try something like this:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT[ , {
- <span class="kw">cat</span>(<span class="st">"Objects:"</span>, <span class="kw">paste</span>(<span class="kw">objects</span>(), <span class="dt">collapse =</span> <span class="st">","</span>), <span class="st">"</span><span class="ch">\n</span><span class="st">"</span>)
- <span class="kw">cat</span>(<span class="st">"Trace: x="</span>, <span class="kw">as.character</span>(x), <span class="st">" y="</span>, y, <span class="st">"</span><span class="ch">\n</span><span class="st">"</span>)
- <span class="kw">sum</span>(y)},
- by =<span class="st"> </span>x]
-<span class="co"># Objects: Cfastmean,mean,print,strptime,x,y </span>
-<span class="co"># Trace: x= a y= 1 2 </span>
-<span class="co"># Objects: Cfastmean,mean,print,strptime,x,y </span>
-<span class="co"># Trace: x= b y= 3 4 5</span>
-<span class="co"># x V1</span>
-<span class="co"># 1: a 3</span>
-<span class="co"># 2: b 12</span></code></pre></div>
-</div>
-<div id="inside-each-group-why-are-the-group-variables-length-1" class="section level2">
-<h2><span class="header-section-number">2.10</span> Inside each group, why are the group variables length-1?</h2>
-<p><a href="#j-trace">Above</a>, <code>x</code> is a grouping variable and (as from v1.6.1) has <code>length</code> 1 (if inspected or used in <code>j</code>). It’s for efficiency and convenience. Therefore, there is no difference between the following two statements:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT[ , .(<span class="dt">g =</span> <span class="dv">1</span>, <span class="dt">h =</span> <span class="dv">2</span>, <span class="dt">i =</span> <span class="dv">3</span>, <span class="dt">j =</span> <span class="dv">4</span>, <span class="dt">repeatgroupname =</span> x, <span class="kw">sum</span>(y)), by =<span class="st"> </span>x]
-<span class="co"># x g h i j repeatgroupname V6</span>
-<span class="co"># 1: a 1 2 3 4 a 3</span>
-<span class="co"># 2: b 1 2 3 4 b 12</span>
-DT[ , .(<span class="dt">g =</span> <span class="dv">1</span>, <span class="dt">h =</span> <span class="dv">2</span>, <span class="dt">i =</span> <span class="dv">3</span>, <span class="dt">j =</span> <span class="dv">4</span>, <span class="dt">repeatgroupname =</span> x[<span class="dv">1</span>], <span class="kw">sum</span>(y)), by =<span class="st"> </span>x]
-<span class="co"># x g h i j repeatgroupname V6</span>
-<span class="co"># 1: a 1 2 3 4 a 3</span>
-<span class="co"># 2: b 1 2 3 4 b 12</span></code></pre></div>
+
+<pre><code class="r">DT[ , {
+ cat("Objects:", paste(objects(), collapse = ","), "\n")
+ cat("Trace: x=", as.character(x), " y=", y, "\n")
+ sum(y)},
+ by = x]
+# Objects: Cfastmean,mean,print,strptime,x,y
+# Trace: x= a y= 1 2
+# Objects: Cfastmean,mean,print,strptime,x,y
+# Trace: x= b y= 3 4 5
+# x V1
+# 1: a 3
+# 2: b 12
+</code></pre>
+
+<h2>Inside each group, why are the group variables length-1?</h2>
+
+<p><a href="#j-trace">Above</a>, <code>x</code> is a grouping variable and (as from v1.6.1) has <code>length</code> 1 (if inspected or used in <code>j</code>). It's for efficiency and convenience. Therefore, there is no difference between the following two statements:</p>
+
+<pre><code class="r">DT[ , .(g = 1, h = 2, i = 3, j = 4, repeatgroupname = x, sum(y)), by = x]
+# x g h i j repeatgroupname V6
+# 1: a 1 2 3 4 a 3
+# 2: b 1 2 3 4 b 12
+DT[ , .(g = 1, h = 2, i = 3, j = 4, repeatgroupname = x[1], sum(y)), by = x]
+# x g h i j repeatgroupname V6
+# 1: a 1 2 3 4 a 3
+# 2: b 1 2 3 4 b 12
+</code></pre>
+
<p>If you need the size of the current group, use <code>.N</code> rather than calling <code>length()</code> on any column.</p>
-</div>
-<div id="only-the-first-10-rows-are-printed-how-do-i-print-more" class="section level2">
-<h2><span class="header-section-number">2.11</span> Only the first 10 rows are printed, how do I print more?</h2>
-<p>There are two things happening here. First, if the number of rows in a data.table are large (<code>> 100</code> by default), then a summary of the data.table is printed to the console by default. Second, the summary of a large data.table is printed by taking the top and bottom <code>n</code> (<code>= 5</code> by default) rows of the data.table and only printing those. Both of these parameters (when to trigger a summary and how much of a table to use as a summary) are configurable b [...]
+
+<h2>Only the first 10 rows are printed, how do I print more?</h2>
+
+<p>There are two things happening here. First, if the number of rows in a data.table are large (<code>> 100</code> by default), then a summary of the data.table is printed to the console by default. Second, the summary of a large data.table is printed by taking the top and bottom <code>n</code> (<code>= 5</code> by default) rows of the data.table and only printing those. Both of these parameters (when to trigger a summary and how much of a table to use as a summary) are configurable b [...]
+
<p>For instance, to enforce the summary of a data.table to only happen when a data.table is greater than 50 rows, you could <code>options(datatable.print.nrows = 50)</code>. To disable the summary-by-default completely, you could <code>options(datatable.print.nrows = Inf)</code>. You could also call <code>print</code> directly, as in <code>print(your.data.table, nrows = Inf)</code>.</p>
+
<p>If you want to show more than just the top (and bottom) 10 rows of a data.table summary (say you like 20), set <code>options(datatable.print.topn = 20)</code>, for example. Again, you could also just call <code>print</code> directly, as in <code>print(your.data.table, topn = 20)</code>.</p>
-</div>
-<div id="with-an-xy-join-what-if-x-contains-a-column-called-y" class="section level2">
-<h2><span class="header-section-number">2.12</span> With an <code>X[Y]</code> join, what if <code>X</code> contains a column called <code>"Y"</code>?</h2>
+
+<h2>With an <code>X[Y]</code> join, what if <code>X</code> contains a column called <code>"Y"</code>?</h2>
+
<p>When <code>i</code> is a single name such as <code>Y</code> it is evaluated in the calling frame. In all other cases such as calls to <code>.()</code> or other expressions, <code>i</code> is evaluated within the scope of <code>X</code>. This facilitates easy <em>self-joins</em> such as <code>X[J(unique(colA)), mult = "first"]</code>.</p>
-</div>
-<div id="xzy-is-failing-because-x-contains-a-column-y.-id-like-it-to-use-the-table-y-in-calling-scope." class="section level2">
-<h2><span class="header-section-number">2.13</span> <code>X[Z[Y]]</code> is failing because <code>X</code> contains a column <code>"Y"</code>. I’d like it to use the table <code>Y</code> in calling scope.</h2>
+
+<h2><code>X[Z[Y]]</code> is failing because <code>X</code> contains a column <code>"Y"</code>. I'd like it to use the table <code>Y</code> in calling scope.</h2>
+
<p>The <code>Z[Y]</code> part is not a single name so that is evaluated within the frame of <code>X</code> and the problem occurs. Try <code>tmp = Z[Y]; X[tmp]</code>. This is robust to <code>X</code> containing a column <code>"tmp"</code> because <code>tmp</code> is a single name. If you often encounter conflicts of this type, one simple solution may be to name all tables in uppercase and all column names in lowercase, or some similar scheme.</p>
-</div>
-<div id="can-you-explain-further-why-data.table-is-inspired-by-ab-syntax-in-base" class="section level2">
-<h2><span class="header-section-number">2.14</span> Can you explain further why data.table is inspired by <code>A[B]</code> syntax in <code>base</code>?</h2>
+
+<h2>Can you explain further why data.table is inspired by <code>A[B]</code> syntax in <code>base</code>?</h2>
+
<p>Consider <code>A[B]</code> syntax using an example matrix <code>A</code> :</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">A =<span class="st"> </span><span class="kw">matrix</span>(<span class="dv">1</span>:<span class="dv">12</span>, <span class="dt">nrow =</span> <span class="dv">4</span>)
+
+<pre><code class="r">A = matrix(1:12, nrow = 4)
A
-<span class="co"># [,1] [,2] [,3]</span>
-<span class="co"># [1,] 1 5 9</span>
-<span class="co"># [2,] 2 6 10</span>
-<span class="co"># [3,] 3 7 11</span>
-<span class="co"># [4,] 4 8 12</span></code></pre></div>
+# [,1] [,2] [,3]
+# [1,] 1 5 9
+# [2,] 2 6 10
+# [3,] 3 7 11
+# [4,] 4 8 12
+</code></pre>
+
<p>To obtain cells <code>(1, 2) = 5</code> and <code>(3, 3) = 11</code> many users (we believe) may try this first :</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">A[<span class="kw">c</span>(<span class="dv">1</span>, <span class="dv">3</span>), <span class="kw">c</span>(<span class="dv">2</span>, <span class="dv">3</span>)]
-<span class="co"># [,1] [,2]</span>
-<span class="co"># [1,] 5 9</span>
-<span class="co"># [2,] 7 11</span></code></pre></div>
+
+<pre><code class="r">A[c(1, 3), c(2, 3)]
+# [,1] [,2]
+# [1,] 5 9
+# [2,] 7 11
+</code></pre>
+
<p>However, this returns the union of those rows and columns. To reference the cells, a 2-column matrix is required. <code>?Extract</code> says :</p>
+
<blockquote>
<p>When indexing arrays by <code>[</code> a single argument <code>i</code> can be a matrix with as many columns as there are dimensions of <code>x</code>; the result is then a vector with elements corresponding to the sets of indices in each row of <code>i</code>.</p>
</blockquote>
-<p>Let’s try again.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">B =<span class="st"> </span><span class="kw">cbind</span>(<span class="kw">c</span>(<span class="dv">1</span>, <span class="dv">3</span>), <span class="kw">c</span>(<span class="dv">2</span>, <span class="dv">3</span>))
+
+<p>Let's try again.</p>
+
+<pre><code class="r">B = cbind(c(1, 3), c(2, 3))
B
-<span class="co"># [,1] [,2]</span>
-<span class="co"># [1,] 1 2</span>
-<span class="co"># [2,] 3 3</span>
+# [,1] [,2]
+# [1,] 1 2
+# [2,] 3 3
A[B]
-<span class="co"># [1] 5 11</span></code></pre></div>
+# [1] 5 11
+</code></pre>
+
<p>A matrix is a 2-dimensional structure with row names and column names. Can we do the same with names?</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">rownames</span>(A) =<span class="st"> </span>letters[<span class="dv">1</span>:<span class="dv">4</span>]
-<span class="kw">colnames</span>(A) =<span class="st"> </span>LETTERS[<span class="dv">1</span>:<span class="dv">3</span>]
+
+<pre><code class="r">rownames(A) = letters[1:4]
+colnames(A) = LETTERS[1:3]
A
-<span class="co"># A B C</span>
-<span class="co"># a 1 5 9</span>
-<span class="co"># b 2 6 10</span>
-<span class="co"># c 3 7 11</span>
-<span class="co"># d 4 8 12</span>
-B =<span class="st"> </span><span class="kw">cbind</span>(<span class="kw">c</span>(<span class="st">"a"</span>, <span class="st">"c"</span>), <span class="kw">c</span>(<span class="st">"B"</span>, <span class="st">"C"</span>))
+# A B C
+# a 1 5 9
+# b 2 6 10
+# c 3 7 11
+# d 4 8 12
+B = cbind(c("a", "c"), c("B", "C"))
A[B]
-<span class="co"># [1] 5 11</span></code></pre></div>
+# [1] 5 11
+</code></pre>
+
<p>So yes, we can. Can we do the same with a <code>data.frame</code>?</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">A =<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">A =</span> <span class="dv">1</span>:<span class="dv">4</span>, <span class="dt">B =</span> letters[<span class="dv">11</span>:<span class="dv">14</span>], <span class="dt">C =</span> pi*<span class="dv">1</span>:<span class="dv">4</span>)
-<span class="kw">rownames</span>(A) =<span class="st"> </span>letters[<span class="dv">1</span>:<span class="dv">4</span>]
+
+<pre><code class="r">A = data.frame(A = 1:4, B = letters[11:14], C = pi*1:4)
+rownames(A) = letters[1:4]
A
-<span class="co"># A B C</span>
-<span class="co"># a 1 k 3.141593</span>
-<span class="co"># b 2 l 6.283185</span>
-<span class="co"># c 3 m 9.424778</span>
-<span class="co"># d 4 n 12.566371</span>
+# A B C
+# a 1 k 3.141593
+# b 2 l 6.283185
+# c 3 m 9.424778
+# d 4 n 12.566371
B
-<span class="co"># [,1] [,2]</span>
-<span class="co"># [1,] "a" "B" </span>
-<span class="co"># [2,] "c" "C"</span>
+# [,1] [,2]
+# [1,] "a" "B"
+# [2,] "c" "C"
A[B]
-<span class="co"># [1] "k" " 9.424778"</span></code></pre></div>
-<p>But, notice that the result was coerced to <code>character.</code> R coerced <code>A</code> to <code>matrix</code> first so that the syntax could work, but the result isn’t ideal. Let’s try making <code>B</code> a <code>data.frame</code>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">B =<span class="st"> </span><span class="kw">data.frame</span>(<span class="kw">c</span>(<span class="st">"a"</span>, <span class="st">"c"</span>), <span class="kw">c</span>(<span class="st">"B"</span>, <span class="st">"C"</span>))
-<span class="kw">cat</span>(<span class="kw">try</span>(A[B], <span class="dt">silent =</span> <span class="ot">TRUE</span>))
-<span class="co"># Error in `[.default`(A, B) : invalid subscript type 'list'</span></code></pre></div>
-<p>So we can’t subset a <code>data.frame</code> by a <code>data.frame</code> in base R. What if we want row names and column names that aren’t <code>character</code> but <code>integer</code> or <code>float</code>? What if we want more than 2 dimensions of mixed types? Enter data.table.</p>
+# [1] "k" " 9.424778"
+</code></pre>
+
+<p>But, notice that the result was coerced to <code>character.</code> R coerced <code>A</code> to <code>matrix</code> first so that the syntax could work, but the result isn't ideal. Let's try making <code>B</code> a <code>data.frame</code>.</p>
+
+<pre><code class="r">B = data.frame(c("a", "c"), c("B", "C"))
+cat(try(A[B], silent = TRUE))
+# Error in `[.default`(A, B) : invalid subscript type 'list'
+</code></pre>
+
+<p>So we can't subset a <code>data.frame</code> by a <code>data.frame</code> in base R. What if we want row names and column names that aren't <code>character</code> but <code>integer</code> or <code>float</code>? What if we want more than 2 dimensions of mixed types? Enter data.table.</p>
+
<p>Furthermore, matrices, especially sparse matrices, are often stored in a 3-column tuple: <code>(i, j, value)</code>. This can be thought of as a key-value pair where <code>i</code> and <code>j</code> form a 2-column key. If we have more than one value, perhaps of different types, it might look like <code>(i, j, val1, val2, val3, ...)</code>. This looks very much like a <code>data.frame</code>. Hence data.table extends <code>data.frame</code> so that a <code>data.frame</code> <code>X</ [...]
-</div>
-<div id="can-base-be-changed-to-do-this-then-rather-than-a-new-package" class="section level2">
-<h2><span class="header-section-number">2.15</span> Can base be changed to do this then, rather than a new package?</h2>
-<p><code>data.frame</code> is used <em>everywhere</em> and so it is very difficult to make <em>any</em> changes to it. data.table <em>inherits</em> from <code>data.frame</code>. It <em>is</em> a <code>data.frame</code>, too. A data.table <em>can</em> be passed to any package that <em>only</em> accepts <code>data.frame</code>. When that package uses <code>[.data.frame</code> syntax on the data.table, it works. It works because <code>[.data.table</code> looks to see where it was called fro [...]
-</div>
-<div id="ive-heard-that-data.table-syntax-is-analogous-to-sql." class="section level2">
-<h2><span class="header-section-number">2.16</span> I’ve heard that data.table syntax is analogous to SQL.</h2>
+
+<h2>Can base be changed to do this then, rather than a new package?</h2>
+
+<p><code>data.frame</code> is used <em>everywhere</em> and so it is very difficult to make <em>any</em> changes to it.
+data.table <em>inherits</em> from <code>data.frame</code>. It <em>is</em> a <code>data.frame</code>, too. A data.table <em>can</em> be passed to any package that <em>only</em> accepts <code>data.frame</code>. When that package uses <code>[.data.frame</code> syntax on the data.table, it works. It works because <code>[.data.table</code> looks to see where it was called from. If it was called from such a package, <code>[.data.table</code> diverts to <code>[.data.frame</code>.</p>
+
+<h2>I've heard that data.table syntax is analogous to SQL.</h2>
+
<p>Yes :</p>
+
<ul>
-<li><code>i</code> <span class="math inline">\(\Leftrightarrow\)</span> where</li>
-<li><code>j</code> <span class="math inline">\(\Leftrightarrow\)</span> select</li>
-<li><code>:=</code> <span class="math inline">\(\Leftrightarrow\)</span> update</li>
-<li><code>by</code> <span class="math inline">\(\Leftrightarrow\)</span> group by</li>
-<li><code>i</code> <span class="math inline">\(\Leftrightarrow\)</span> order by (in compound syntax)</li>
-<li><code>i</code> <span class="math inline">\(\Leftrightarrow\)</span> having (in compound syntax)</li>
-<li><code>nomatch = NA</code> <span class="math inline">\(\Leftrightarrow\)</span> outer join</li>
-<li><code>nomatch = 0L</code> <span class="math inline">\(\Leftrightarrow\)</span> inner join</li>
-<li><code>mult = "first"|"last"</code> <span class="math inline">\(\Leftrightarrow\)</span> N/A because SQL is inherently unordered</li>
-<li><code>roll = TRUE</code> <span class="math inline">\(\Leftrightarrow\)</span> N/A because SQL is inherently unordered</li>
+<li><code>i</code> \(\Leftrightarrow\) where</li>
+<li><code>j</code> \(\Leftrightarrow\) select</li>
+<li><code>:=</code> \(\Leftrightarrow\) update</li>
+<li><code>by</code> \(\Leftrightarrow\) group by</li>
+<li><code>i</code> \(\Leftrightarrow\) order by (in compound syntax)</li>
+<li><code>i</code> \(\Leftrightarrow\) having (in compound syntax)</li>
+<li><code>nomatch = NA</code> \(\Leftrightarrow\) outer join</li>
+<li><code>nomatch = 0L</code> \(\Leftrightarrow\) inner join</li>
+<li><code>mult = "first"|"last"</code> \(\Leftrightarrow\) N/A because SQL is inherently unordered</li>
+<li><code>roll = TRUE</code> \(\Leftrightarrow\) N/A because SQL is inherently unordered</li>
</ul>
+
<p>The general form is :</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT[where, select|update, group by][order by][...] ... [...]</code></pre></div>
-<p>A key advantage of column vectors in R is that they are <em>ordered</em>, unlike SQL<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a>. We can use ordered functions in <code>data.table queries such as</code>diff()` and we can use <em>any</em> R function from any package, not just the functions that are defined in SQL. A disadvantage is that R objects must fit in memory, but with several R packages such as ff, bigmemory, mmap and indexing, this is changing.</p>
-</div>
-<div id="SmallerDiffs" class="section level2">
-<h2><span class="header-section-number">2.17</span> What are the smaller syntax differences between <code>data.frame</code> and data.table</h2>
+
+<pre><code class="r">DT[where, select|update, group by][order by][...] ... [...]
+</code></pre>
+
+<p>A key advantage of column vectors in R is that they are <em>ordered</em>, unlike SQL[<sup>2].</sup> We can use ordered functions in <code>data.table queries such as</code>diff() and we can use <em>any</em> R function from any package, not just the functions that are defined in SQL. A disadvantage is that R objects must fit in memory, but with several R packages such as ff, bigmemory, mmap and indexing, this is changing.</p>
+
+<p>[<sup>2]:</sup> It may be a surprise to learn that <code>select top 10 * from ...</code> does <em>not</em> reliably return the same rows over time in SQL. You do need to include an <code>order by</code> clause, or use a clustered index to guarantee row order; <em>i.e.</em>, SQL is inherently unordered.</p>
+
+<h2>What are the smaller syntax differences between <code>data.frame</code> and data.table {#SmallerDiffs}</h2>
+
<ul>
<li><code>DT[3]</code> refers to the 3rd <em>row</em>, but <code>DF[3]</code> refers to the 3rd <em>column</em></li>
<li><code>DT[3, ] == DT[3]</code>, but <code>DF[ , 3] == DF[3]</code> (somewhat confusingly in data.frame, whereas data.table is consistent)</li>
@@ -500,309 +645,311 @@ A[B]
<li><code>DT[ , "colA"][[1]] == DF[ , "colA"]</code>.</li>
<li><code>DT[ , colA] == DF[ , "colA"]</code> (currently in data.table v1.9.8 but is about to change, see release notes)</li>
<li><code>DT[ , list(colA)] == DF[ , "colA", drop = FALSE]</code></li>
-<li><code>DT[NA]</code> returns 1 row of <code>NA</code>, but <code>DF[NA]</code> returns an entire copy of <code>DF</code> containing <code>NA</code> throughout. The symbol <code>NA</code> is type <code>logical</code> in R and is therefore recycled by <code>[.data.frame</code>. The user’s intention was probably <code>DF[NA_integer_]</code>. <code>[.data.table</code> diverts to this probable intention automatically, for convenience.</li>
-<li><code>DT[c(TRUE, NA, FALSE)]</code> treats the <code>NA</code> as <code>FALSE</code>, but <code>DF[c(TRUE, NA, FALSE)]</code> returns <code>NA</code> rows for each <code>NA</code></li>
+<li><code>DT[NA]</code> returns 1 row of <code>NA</code>, but <code>DF[NA]</code> returns an entire copy of <code>DF</code> containing <code>NA</code> throughout. The symbol <code>NA</code> is type <code>logical</code> in R and is therefore recycled by <code>[.data.frame</code>. The user's intention was probably <code>DF[NA_integer_]</code>. <code>[.data.table</code> diverts to this probable intention automatically, for convenience.</li>
+<li><code>DT[c(TRUE, NA, FALSE)]</code> treats the <code>NA</code> as <code>FALSE</code>, but <code>DF[c(TRUE, NA, FALSE)]</code> returns
+<code>NA</code> rows for each <code>NA</code></li>
<li><code>DT[ColA == ColB]</code> is simpler than <code>DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]</code></li>
<li><code>data.frame(list(1:2, "k", 1:4))</code> creates 3 columns, data.table creates one <code>list</code> column.</li>
<li><code>check.names</code> is by default <code>TRUE</code> in <code>data.frame</code> but <code>FALSE</code> in data.table, for convenience.</li>
<li><code>stringsAsFactors</code> is by default <code>TRUE</code> in <code>data.frame</code> but <code>FALSE</code> in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to <code>factor</code>.</li>
<li>Atomic vectors in <code>list</code> columns are collapsed when printed using <code>", "</code> in <code>data.frame</code>, but <code>","</code> in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.</li>
</ul>
+
<p>In <code>[.data.frame</code> we very often set <code>drop = FALSE</code>. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column <code>data.frame</code>. In <code>[.data.table</code> we took the opportunity to make it consistent and dropped <code>drop</code>.</p>
+
<p>When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.</p>
-</div>
-<div id="im-using-j-for-its-side-effect-only-but-im-still-getting-data-returned.-how-do-i-stop-that" class="section level2">
-<h2><span class="header-section-number">2.18</span> I’m using <code>j</code> for its side effect only, but I’m still getting data returned. How do I stop that?</h2>
-<p>In this case <code>j</code> can be wrapped with <code>invisible()</code>; e.g., <code>DT[ , invisible(hist(colB)), by = colA]</code><a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a></p>
-</div>
-<div id="why-does-.data.table-now-have-a-drop-argument-from-v1.5" class="section level2">
-<h2><span class="header-section-number">2.19</span> Why does <code>[.data.table</code> now have a <code>drop</code> argument from v1.5?</h2>
+
+<h2>I'm using <code>j</code> for its side effect only, but I'm still getting data returned. How do I stop that?</h2>
+
+<p>In this case <code>j</code> can be wrapped with <code>invisible()</code>; e.g., <code>DT[ , invisible(hist(colB)), by = colA]</code>[<sup>3]</sup></p>
+
+<p>[<sup>3]:</sup> <em>e.g.</em>, <code>hist()</code> returns the breakpoints in addition to plotting to the graphics device.</p>
+
+<h2>Why does <code>[.data.table</code> now have a <code>drop</code> argument from v1.5?</h2>
+
<p>So that data.table can inherit from <code>data.frame</code> without using <code>...</code>. If we used <code>...</code> then invalid argument names would not be caught.</p>
+
<p>The <code>drop</code> argument is never used by <code>[.data.table</code>. It is a placeholder for non-data.table-aware packages when they use the <code>[.data.frame</code> syntax directly on a data.table.</p>
-</div>
-<div id="rolling-joins-are-cool-and-very-fast-was-that-hard-to-program" class="section level2">
-<h2><span class="header-section-number">2.20</span> Rolling joins are cool and very fast! Was that hard to program?</h2>
+
+<h2>Rolling joins are cool and very fast! Was that hard to program?</h2>
+
<p>The prevailing row on or before the <code>i</code> row is the final row the binary search tests anyway. So <code>roll = TRUE</code> is essentially just a switch in the binary search C code to return that row.</p>
-</div>
-<div id="why-does-dti-col-value-return-the-whole-of-dt-i-expected-either-no-visible-value-consistent-with---or-a-message-or-return-value-containing-how-many-rows-were-updated.-it-isnt-obvious-that-the-data-has-indeed-been-updated-by-reference." class="section level2">
-<h2><span class="header-section-number">2.21</span> Why does <code>DT[i, col := value]</code> return the whole of <code>DT</code>? I expected either no visible value (consistent with <code><-</code>), or a message or return value containing how many rows were updated. It isn’t obvious that the data has indeed been updated by reference.</h2>
+
+<h2>Why does <code>DT[i, col := value]</code> return the whole of <code>DT</code>? I expected either no visible value (consistent with <code><-</code>), or a message or return value containing how many rows were updated. It isn't obvious that the data has indeed been updated by reference.</h2>
+
<p>This has changed in v1.8.3 to meet your expectations. Please upgrade.</p>
+
<p>The whole of <code>DT</code> is returned (now invisibly) so that compound syntax can work; <em>e.g.</em>, <code>DT[i, done := TRUE][ , sum(done)]</code>. The number of rows updated is returned when <code>verbose</code> is <code>TRUE</code>, either on a per-query basis or globally using <code>options(datatable.verbose = TRUE)</code>.</p>
-</div>
-<div id="ok-thanks.-what-was-so-difficult-about-the-result-of-dti-col-value-being-returned-invisibly" class="section level2">
-<h2><span class="header-section-number">2.22</span> OK, thanks. What was so difficult about the result of <code>DT[i, col := value]</code> being returned invisibly?</h2>
-<p>R internally forces visibility on for <code>[</code>. The value of FunTab’s eval column (see <a href="https://github.com/wch/r-source/blob/trunk/src/main/names.c">src/main/names.c</a>) for <code>[</code> is <code>0</code> meaning “force <code>R_Visible</code> on” (see <a href="https://cran.r-project.org/doc/manuals/r-release/R-ints.html#Autoprinting">R-Internals section 1.6</a> ). Therefore, when we tried <code>invisible()</code> or setting <code>R_Visible</code> to <code>0</code> dir [...]
+
+<h2>OK, thanks. What was so difficult about the result of <code>DT[i, col := value]</code> being returned invisibly?</h2>
+
+<p>R internally forces visibility on for <code>[</code>. The value of FunTab's eval column (see <a href="https://github.com/wch/r-source/blob/trunk/src/main/names.c">src/main/names.c</a>) for <code>[</code> is <code>0</code> meaning “force <code>R_Visible</code> on” (see <a href="https://cran.r-project.org/doc/manuals/r-release/R-ints.html#Autoprinting">R-Internals section 1.6</a> ). Therefore, when we tried <code>invisible()</code> or setting <code>R_Visible</code> to <c [...]
+
<p>To solve this problem, the key was to stop trying to stop the print method running after a <code>:=</code>. Instead, inside <code>:=</code> we now (from v1.8.3) set a global flag which the print method uses to know whether to actually print or not.</p>
-</div>
-<div id="why-do-i-have-to-type-dt-sometimes-twice-after-using-to-print-the-result-to-console" class="section level2">
-<h2><span class="header-section-number">2.23</span> Why do I have to type <code>DT</code> sometimes twice after using <code>:=</code> to print the result to console?</h2>
-<p>This is an unfortunate downside to get <a href="https://github.com/Rdatatable/data.table/issues/869">#869</a> to work. If a <code>:=</code> is used inside a function with no <code>DT[]</code> before the end of the function, then the next time <code>DT</code> is typed at the prompt, nothing will be printed. A repeated <code>DT</code> will print. To avoid this: include a <code>DT[]</code> after the last <code>:=</code> in your function. If that is not possible (e.g., it’s not a function [...]
-</div>
-<div id="ive-noticed-that-basecbind.data.frame-and-baserbind.data.frame-appear-to-be-changed-by-data.table.-how-is-this-possible-why" class="section level2">
-<h2><span class="header-section-number">2.24</span> I’ve noticed that <code>base::cbind.data.frame</code> (and <code>base::rbind.data.frame</code>) appear to be changed by data.table. How is this possible? Why?</h2>
+
+<h2>Why do I have to type <code>DT</code> sometimes twice after using <code>:=</code> to print the result to console?</h2>
+
+<p>This is an unfortunate downside to get <a href="https://github.com/Rdatatable/data.table/issues/869">#869</a> to work. If a <code>:=</code> is used inside a function with no <code>DT[]</code> before the end of the function, then the next time <code>DT</code> is typed at the prompt, nothing will be printed. A repeated <code>DT</code> will print. To avoid this: include a <code>DT[]</code> after the last <code>:=</code> in your function. If that is not possible (e.g., it's not a func [...]
+
+<h2>I've noticed that <code>base::cbind.data.frame</code> (and <code>base::rbind.data.frame</code>) appear to be changed by data.table. How is this possible? Why?</h2>
+
<p>It is a temporary, last resort solution until we discover a better way to solve the problems listed below. Essentially, the issue is that data.table inherits from <code>data.frame</code>, <em>and</em> <code>base::cbind</code> and <code>base::rbind</code> (uniquely) do their own S3 dispatch internally as documented by <code>?cbind</code>. The change is adding one <code>for</code> loop to the start of each function directly in <code>base</code>; <em>e.g.</em>,</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">base::cbind.data.frame
-<span class="co"># function (..., deparse.level = 1) </span>
-<span class="co"># {</span>
-<span class="co"># if (!identical(class(..1), "data.frame")) </span>
-<span class="co"># for (x in list(...)) {</span>
-<span class="co"># if (inherits(x, "data.table")) </span>
-<span class="co"># return(data.table::data.table(...))</span>
-<span class="co"># }</span>
-<span class="co"># data.frame(..., check.names = FALSE)</span>
-<span class="co"># }</span>
-<span class="co"># <environment: namespace:base></span></code></pre></div>
+
+<pre><code class="r">base::cbind.data.frame
+# function (..., deparse.level = 1)
+# {
+# if (!identical(class(..1), "data.frame"))
+# for (x in list(...)) {
+# if (inherits(x, "data.table"))
+# return(data.table::data.table(...))
+# }
+# data.frame(..., check.names = FALSE)
+# }
+# <environment: namespace:base>
+</code></pre>
+
<p>That modification is made dynamically, <em>i.e.</em>, the <code>base</code> definition of <code>cbind.data.frame</code> is fetched, the <code>for</code> loop added to the beginning and then assigned back to <code>base</code>. This solution is intended to be robust to different definitions of <code>base::cbind.data.frame</code> in different versions of R, including unknown future changes. Again, it is a last resort until a better solution is known or made available. The competing requi [...]
+
<ul>
-<li><p><code>cbind(DT, DF)</code> needs to work. Defining <code>cbind.data.table</code> doesn’t work because <code>base::cbind</code> does its own S3 dispatch and requires that the <em>first</em> <code>cbind</code> method for each object it is passed is <em>identical</em>. This is not true in <code>cbind(DT, DF)</code> because the first method for <code>DT</code> is <code>cbind.data.table</code> but the first method for <code>DF</code> is <code>cbind.data.frame</code>. <code>base::cbind< [...]
-<li><p>This naturally leads to trying to mask <code>cbind.data.frame</code> instead. Since a data.table is a <code>data.frame</code>, <code>cbind</code> would find the same method for both <code>DT</code> and <code>DF</code>. However, this doesn’t work either because <code>base::cbind</code> appears to find methods in <code>base</code> first; <em>i.e.</em>, <code>base::cbind.data.frame</code> isn’t maskable. This is reproducible as follows :</p></li>
+<li><p><code>cbind(DT, DF)</code> needs to work. Defining <code>cbind.data.table</code> doesn't work because <code>base::cbind</code> does its own S3 dispatch and requires that the <em>first</em> <code>cbind</code> method for each object it is passed is <em>identical</em>. This is not true in <code>cbind(DT, DF)</code> because the first method for <code>DT</code> is <code>cbind.data.table</code> but the first method for <code>DF</code> is <code>cbind.data.frame</code>. <code>base::cb [...]
+<li><p>This naturally leads to trying to mask <code>cbind.data.frame</code> instead. Since a data.table is a <code>data.frame</code>, <code>cbind</code> would find the same method for both <code>DT</code> and <code>DF</code>. However, this doesn't work either because <code>base::cbind</code> appears to find methods in <code>base</code> first; <em>i.e.</em>, <code>base::cbind.data.frame</code> isn't maskable. This is reproducible as follows :</p></li>
</ul>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">foo =<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">a =</span> <span class="dv">1</span>:<span class="dv">3</span>)
-cbind.data.frame =<span class="st"> </span>function(...) <span class="kw">cat</span>(<span class="st">"Not printed</span><span class="ch">\n</span><span class="st">"</span>)
-<span class="kw">cbind</span>(foo)
-<span class="co"># a</span>
-<span class="co"># 1 1</span>
-<span class="co"># 2 2</span>
-<span class="co"># 3 3</span>
-<span class="kw">rm</span>(<span class="st">"cbind.data.frame"</span>)</code></pre></div>
+
+<pre><code class="r">foo = data.frame(a = 1:3)
+cbind.data.frame = function(...) cat("Not printed\n")
+cbind(foo)
+# a
+# 1 1
+# 2 2
+# 3 3
+rm("cbind.data.frame")
+</code></pre>
+
<ul>
-<li>Finally, we tried masking <code>cbind</code> itself (v1.6.5 and v1.6.6). This allowed <code>cbind(DT, DF)</code> to work, but introduced compatibility issues with package <code>IRanges</code>, since <code>IRanges</code> also masks <code>cbind</code>. It worked if <code>IRanges</code> was lower on the <code>search()</code> path than data.table, but if <code>IRanges</code> was higher then data.table’s, <code>cbind</code> would never be called and the strange-looking <code>matrix</code> [...]
+<li>Finally, we tried masking <code>cbind</code> itself (v1.6.5 and v1.6.6). This allowed <code>cbind(DT, DF)</code> to work, but introduced compatibility issues with package <code>IRanges</code>, since <code>IRanges</code> also masks <code>cbind</code>. It worked if <code>IRanges</code> was lower on the <code>search()</code> path than data.table, but if <code>IRanges</code> was higher then data.table's, <code>cbind</code> would never be called and the strange-looking <code>matrix</c [...]
</ul>
-<p>If you know of a better solution that still solves all the issues above, then please let us know and we’ll gladly change it.</p>
-</div>
-<div id="r-dispatch" class="section level2">
-<h2><span class="header-section-number">2.25</span> I’ve read about method dispatch (<em>e.g.</em> <code>merge</code> may or may not dispatch to <code>merge.data.table</code>) but <em>how</em> does R know how to dispatch? Are dots significant or special? How on earth does R know which function to dispatch and when?</h2>
-<p>This comes up quite a lot but it’s really earth-shatteringly simple. A function such as <code>merge</code> is <em>generic</em> if it consists of a call to <code>UseMethod</code>. When you see people talking about whether or not functions are <em>generic</em> functions they are merely typing the function without <code>()</code> afterwards, looking at the program code inside it and if they see a call to <code>UseMethod</code> then it is <em>generic</em>. What does <code>UseMethod</code> [...]
-<p>You might now ask: where is this documented in R? Answer: it’s quite clear, but, you need to first know to look in <code>?UseMethod</code> and <em>that</em> help file contains :</p>
+
+<p>If you know of a better solution that still solves all the issues above, then please let us know and we'll gladly change it.</p>
+
+<h2>I've read about method dispatch (<em>e.g.</em> <code>merge</code> may or may not dispatch to <code>merge.data.table</code>) but <em>how</em> does R know how to dispatch? Are dots significant or special? How on earth does R know which function to dispatch and when? {#r-dispatch}</h2>
+
+<p>This comes up quite a lot but it's really earth-shatteringly simple. A function such as <code>merge</code> is <em>generic</em> if it consists of a call to <code>UseMethod</code>. When you see people talking about whether or not functions are <em>generic</em> functions they are merely typing the function without <code>()</code> afterwards, looking at the program code inside it and if they see a call to <code>UseMethod</code> then it is <em>generic</em>. What does <code>UseMethod</ [...]
+
+<p>You might now ask: where is this documented in R? Answer: it's quite clear, but, you need to first know to look in <code>?UseMethod</code> and <em>that</em> help file contains :</p>
+
<blockquote>
-<p>When a function calling <code>UseMethod('fun')</code> is applied to an object with class attribute <code>c('first', 'second')</code>, the system searches for a function called <code>fun.first</code> and, if it finds it, applies it to the object. If no such function is found a function called <code>fun.second</code> is tried. If no class name produces a suitable function, the function <code>fun.default</code> is used, if it exists, or an error results.</p>
+<p>When a function calling <code>UseMethod('fun')</code> is applied to an object with class attribute <code>c('first', 'second')</code>, the system searches for a function called <code>fun.first</code> and, if it finds it, applies it to the object. If no such function is found a function called <code>fun.second</code> is tried. If no class name produces a suitable function, the function <code>fun.default</code> is used, if it exists, or an error results.</p>
</blockquote>
-<p>Happily, an internet search for “How does R method dispatch work” (at the time of this writing) returns the <code>?UseMethod</code> help page in the top few links. Admittedly, other links rapidly descend into the intricacies of S3 vs S4, internal generics and so on.</p>
-<p>However, features like basic S3 dispatch (pasting the function name together with the class name) is why some R folk love R. It’s so simple. No complicated registration or signature is required. There isn’t much needed to learn. To create the <code>merge</code> method for data.table all that was required, literally, was to merely create a function called <code>merge.data.table</code>.</p>
-</div>
-</div>
-<div id="questions-relating-to-compute-time" class="section level1">
-<h1><span class="header-section-number">3</span> Questions relating to compute time</h1>
-<div id="i-have-20-columns-and-a-large-number-of-rows.-why-is-an-expression-of-one-column-so-quick" class="section level2">
-<h2><span class="header-section-number">3.1</span> I have 20 columns and a large number of rows. Why is an expression of one column so quick?</h2>
+
+<p>Happily, an internet search for “How does R method dispatch work” (at the time of this writing) returns the <code>?UseMethod</code> help page in the top few links. Admittedly, other links rapidly descend into the intricacies of S3 vs S4, internal generics and so on.</p>
+
+<p>However, features like basic S3 dispatch (pasting the function name together with the class name) is why some R folk love R. It's so simple. No complicated registration or signature is required. There isn't much needed to learn. To create the <code>merge</code> method for data.table all that was required, literally, was to merely create a function called <code>merge.data.table</code>.</p>
+
+<h1>Questions relating to compute time</h1>
+
+<h2>I have 20 columns and a large number of rows. Why is an expression of one column so quick?</h2>
+
<p>Several reasons:</p>
+
<ul>
-<li>Only that column is grouped, the other 19 are ignored because data.table inspects the <code>j</code> expression and realises it doesn’t use the other columns.</li>
+<li>Only that column is grouped, the other 19 are ignored because data.table inspects the <code>j</code> expression and realises it doesn't use the other columns.</li>
<li>One memory allocation is made for the largest group only, then that memory is re-used for the other groups. There is very little garbage to collect.</li>
<li>R is an in-memory column store; i.e., the columns are contiguous in RAM. Page fetches from RAM into L2 cache are minimised.</li>
</ul>
-</div>
-<div id="i-dont-have-a-key-on-a-large-table-but-grouping-is-still-really-quick.-why-is-that" class="section level2">
-<h2><span class="header-section-number">3.2</span> I don’t have a <code>key</code> on a large table, but grouping is still really quick. Why is that?</h2>
+
+<h2>I don't have a <code>key</code> on a large table, but grouping is still really quick. Why is that?</h2>
+
<p>data.table uses radix sorting. This is significantly faster than other sort algorithms. See <a href="http://user2015.math.aau.dk/presentations/234.pdf">our presentations</a> on <a href="https://github.com/Rdatatable/data.table/wiki">our homepage</a> for more information.</p>
+
<p>This is also one reason why <code>setkey()</code> is quick.</p>
+
<p>When no <code>key</code> is set, or we group in a different order from that of the key, we call it an <em>ad hoc</em> <code>by</code>.</p>
-</div>
-<div id="why-is-grouping-by-columns-in-the-key-faster-than-an-ad-hoc-by" class="section level2">
-<h2><span class="header-section-number">3.3</span> Why is grouping by columns in the key faster than an <em>ad hoc</em> <code>by</code>?</h2>
-<p>Because each group is contiguous in RAM, thereby minimising page fetches and memory can be copied in bulk (<code>memcpy</code> in C) rather than looping in C.</p>
-</div>
-<div id="what-are-primary-and-secondary-indexes-in-data.table" class="section level2">
-<h2><span class="header-section-number">3.4</span> What are primary and secondary indexes in data.table?</h2>
-<p>Manual: <a href="https://www.rdocumentation.org/packages/data.table/functions/setkey"><code>?setkey</code></a> S.O. : <a href="https://stackoverflow.com/questions/20039335/what-is-the-purpose-of-setting-a-key-in-data-table/20057411#20057411">What is the purpose of setting a key in data.table?</a></p>
-<p><code>setkey(DT, col1, col2)</code> orders the rows by column <code>col1</code> then within each group of <code>col1</code> it orders by <code>col2</code>. This is a <em>primary index</em>. The row order is changed <em>by reference</em> in RAM. Subsequent joins and groups on those key columns then take advantage of the sort order for efficiency. (Imagine how difficult looking for a phone number in a printed telephone directory would be if it wasn’t sorted by surname then forename. Tha [...]
-<p>However, you can only have one primary key because data can only be physically sorted in RAM in one way at a time. Choose the primary index to be the one you use most often (e.g. <code>[id,date]</code>). Sometimes there isn’t an obvious choice for the primary key or you need to join and group many different columns in different orders. Enter a secondary index. This does use memory (<code>4*nrow</code> bytes regardless of the number of columns in the index) to store the order of the ro [...]
+
+<h2>Why is grouping by columns in the key faster than an <em>ad hoc</em> <code>by</code>?</h2>
+
+<p>Because each group is contiguous in RAM, thereby minimising page fetches and memory can be
+copied in bulk (<code>memcpy</code> in C) rather than looping in C.</p>
+
+<h2>What are primary and secondary indexes in data.table?</h2>
+
+<p>Manual: <a href="https://www.rdocumentation.org/packages/data.table/functions/setkey"><code>?setkey</code></a>
+S.O. : <a href="https://stackoverflow.com/questions/20039335/what-is-the-purpose-of-setting-a-key-in-data-table/20057411#20057411">What is the purpose of setting a key in data.table?</a></p>
+
+<p><code>setkey(DT, col1, col2)</code> orders the rows by column <code>col1</code> then within each group of <code>col1</code> it orders by <code>col2</code>. This is a <em>primary index</em>. The row order is changed <em>by reference</em> in RAM. Subsequent joins and groups on those key columns then take advantage of the sort order for efficiency. (Imagine how difficult looking for a phone number in a printed telephone directory would be if it wasn't sorted by surname then forename. [...]
+
+<p>However, you can only have one primary key because data can only be physically sorted in RAM in one way at a time. Choose the primary index to be the one you use most often (e.g. <code>[id,date]</code>). Sometimes there isn't an obvious choice for the primary key or you need to join and group many different columns in different orders. Enter a secondary index. This does use memory (<code>4*nrow</code> bytes regardless of the number of columns in the index) to store the order of th [...]
+
<p>We use the words <em>index</em> and <em>key</em> interchangeably.</p>
-</div>
-</div>
-<div id="error-messages" class="section level1">
-<h1><span class="header-section-number">4</span> Error messages</h1>
-<div id="could-not-find-function-dt" class="section level2">
-<h2><span class="header-section-number">4.1</span> “Could not find function <code>DT</code>”</h2>
+
+<h1>Error messages</h1>
+
+<h2>“Could not find function <code>DT</code>”</h2>
+
<p>See above <a href="#DTremove1">here</a> and <a href="#DTremove2">here</a>.</p>
-</div>
-<div id="unused-arguments-mysum-sumv" class="section level2">
-<h2><span class="header-section-number">4.2</span> “unused argument(s) (<code>MySum = sum(v)</code>)”</h2>
+
+<h2>“unused argument(s) (<code>MySum = sum(v)</code>)”</h2>
+
<p>This error is generated by <code>DT[ , MySum = sum(v)]</code>. <code>DT[ , .(MySum = sum(v))]</code> was intended, or <code>DT[ , j = .(MySum = sum(v))]</code>.</p>
-</div>
-<div id="translatecharutf8-must-be-called-on-a-charsxp" class="section level2">
-<h2><span class="header-section-number">4.3</span> “<code>translateCharUTF8</code> must be called on a <code>CHARSXP</code>”</h2>
-<p>This error (and similar, <em>e.g.</em>, “<code>getCharCE</code> must be called on a <code>CHARSXP</code>”) may be nothing do with character data or locale. Instead, this can be a symptom of an earlier memory corruption. To date these have been reproducible and fixed (quickly). Please report it to our <a href="https://github.com/Rdatatable/data.table/issues">issues tracker</a>.</p>
-</div>
-<div id="cbinddt-df-returns-a-strange-format-e.g.-integer5" class="section level2">
-<h2><span class="header-section-number">4.4</span> <code>cbind(DT, DF)</code> returns a strange format, <em>e.g.</em> <code id="cbinderror">Integer,5</code></h2>
+
+<h2>“<code>translateCharUTF8</code> must be called on a <code>CHARSXP</code>”</h2>
+
+<p>This error (and similar, <em>e.g.</em>, “<code>getCharCE</code> must be called on a <code>CHARSXP</code>”) may be nothing do with character data or locale. Instead, this can be a symptom of an earlier memory corruption. To date these have been reproducible and fixed (quickly). Please report it to our <a href="https://github.com/Rdatatable/data.table/issues">issues tracker</a>.</p>
+
+<h2><code>cbind(DT, DF)</code> returns a strange format, <em>e.g.</em> <code>Integer,5</code> {#cbinderror}</h2>
+
<p>This occurs prior to v1.6.5, for <code>rbind(DT, DF)</code> too. Please upgrade to v1.6.7 or later.</p>
-</div>
-<div id="cannot-change-value-of-locked-binding-for-.sd" class="section level2">
-<h2><span class="header-section-number">4.5</span> “cannot change value of locked binding for <code>.SD</code>”</h2>
-<p><code>.SD</code> is locked by design. See <code>?data.table</code>. If you’d like to manipulate <code>.SD</code> before using it, or returning it, and don’t wish to modify <code>DT</code> using <code>:=</code>, then take a copy first (see <code>?copy</code>), <em>e.g.</em>,</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT =<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">a =</span> <span class="kw">rep</span>(<span class="dv">1</span>:<span class="dv">3</span>, <span class="dv">1</span>:<span class="dv">3</span>), <span class="dt">b =</span> <span class="dv">1</span>:<span class="dv">6</span>, <span class="dt">c =</span> <span class="dv">7</span>:<span class="dv">12</span>)
+
+<h2>“cannot change value of locked binding for <code>.SD</code>”</h2>
+
+<p><code>.SD</code> is locked by design. See <code>?data.table</code>. If you'd like to manipulate <code>.SD</code> before using it, or returning it, and don't wish to modify <code>DT</code> using <code>:=</code>, then take a copy first (see <code>?copy</code>), <em>e.g.</em>,</p>
+
+<pre><code class="r">DT = data.table(a = rep(1:3, 1:3), b = 1:6, c = 7:12)
DT
-<span class="co"># a b c</span>
-<span class="co"># 1: 1 1 7</span>
-<span class="co"># 2: 2 2 8</span>
-<span class="co"># 3: 2 3 9</span>
-<span class="co"># 4: 3 4 10</span>
-<span class="co"># 5: 3 5 11</span>
-<span class="co"># 6: 3 6 12</span>
-DT[ , { mySD =<span class="st"> </span><span class="kw">copy</span>(.SD)
- mySD[<span class="dv">1</span>, b :<span class="er">=</span><span class="st"> </span>99L]
+# a b c
+# 1: 1 1 7
+# 2: 2 2 8
+# 3: 2 3 9
+# 4: 3 4 10
+# 5: 3 5 11
+# 6: 3 6 12
+DT[ , { mySD = copy(.SD)
+ mySD[1, b := 99L]
mySD},
- by =<span class="st"> </span>a]
-<span class="co"># a b c</span>
-<span class="co"># 1: 1 99 7</span>
-<span class="co"># 2: 2 99 8</span>
-<span class="co"># 3: 2 3 9</span>
-<span class="co"># 4: 3 99 10</span>
-<span class="co"># 5: 3 5 11</span>
-<span class="co"># 6: 3 6 12</span></code></pre></div>
-</div>
-<div id="cannot-change-value-of-locked-binding-for-.n" class="section level2">
-<h2><span class="header-section-number">4.6</span> “cannot change value of locked binding for <code>.N</code>”</h2>
+ by = a]
+# a b c
+# 1: 1 99 7
+# 2: 2 99 8
+# 3: 2 3 9
+# 4: 3 99 10
+# 5: 3 5 11
+# 6: 3 6 12
+</code></pre>
+
+<h2>“cannot change value of locked binding for <code>.N</code>”</h2>
+
<p>Please upgrade to v1.8.1 or later. From this version, if <code>.N</code> is returned by <code>j</code> it is renamed to <code>N</code> to avoid any ambiguity in any subsequent grouping between the <code>.N</code> special variable and a column called <code>".N"</code>.</p>
+
<p>The old behaviour can be reproduced by forcing <code>.N</code> to be called <code>.N</code>, like this :</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT =<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">a =</span> <span class="kw">c</span>(<span class="dv">1</span>,<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">2</span>,<span class="dv">2</span>), <span class="dt">b =</span> <span class="kw">c</span>(<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">2</span>,<span class="dv">2</span>,<span c [...]
+
+<pre><code class="r">DT = data.table(a = c(1,1,2,2,2), b = c(1,2,2,2,1))
DT
-<span class="co"># a b</span>
-<span class="co"># 1: 1 1</span>
-<span class="co"># 2: 1 2</span>
-<span class="co"># 3: 2 2</span>
-<span class="co"># 4: 2 2</span>
-<span class="co"># 5: 2 1</span>
-DT[ , <span class="kw">list</span>(<span class="dt">.N =</span> .N), <span class="kw">list</span>(a, b)] <span class="co"># show intermediate result for exposition</span>
-<span class="co"># a b .N</span>
-<span class="co"># 1: 1 1 1</span>
-<span class="co"># 2: 1 2 1</span>
-<span class="co"># 3: 2 2 2</span>
-<span class="co"># 4: 2 1 1</span>
-<span class="kw">cat</span>(<span class="kw">try</span>(
- DT[ , <span class="kw">list</span>(<span class="dt">.N =</span> .N), <span class="dt">by =</span> <span class="kw">list</span>(a, b)][ , <span class="kw">unique</span>(.N), <span class="dt">by =</span> a] <span class="co"># compound query more typical</span>
-, <span class="dt">silent =</span> <span class="ot">TRUE</span>))
-<span class="co"># Error in `[.data.table`(DT[, list(.N = .N), by = list(a, b)], , unique(.N), : </span>
-<span class="co"># The column '.N' can't be grouped because it conflicts with the special .N variable. Try setnames(DT,'.N','N') first.</span></code></pre></div>
-<p>If you are already running v1.8.1 or later then the error message is now more helpful than the “cannot change value of locked binding” error, as you can see above, since this vignette was produced using v1.8.1 or later.</p>
+# a b
+# 1: 1 1
+# 2: 1 2
+# 3: 2 2
+# 4: 2 2
+# 5: 2 1
+DT[ , list(.N = .N), list(a, b)] # show intermediate result for exposition
+# a b .N
+# 1: 1 1 1
+# 2: 1 2 1
+# 3: 2 2 2
+# 4: 2 1 1
+cat(try(
+ DT[ , list(.N = .N), by = list(a, b)][ , unique(.N), by = a] # compound query more typical
+, silent = TRUE))
+# Error in `[.data.table`(DT[, list(.N = .N), by = list(a, b)], , unique(.N), :
+# The column '.N' can't be grouped because it conflicts with the special .N variable. Try setnames(DT,'.N','N') first.
+</code></pre>
+
+<p>If you are already running v1.8.1 or later then the error message is now more helpful than the “cannot change value of locked binding” error, as you can see above, since this vignette was produced using v1.8.1 or later.</p>
+
<p>The more natural syntax now works :</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">if (<span class="kw">packageVersion</span>(<span class="st">"data.table"</span>) >=<span class="st"> "1.8.1"</span>) {
- DT[ , .N, by =<span class="st"> </span><span class="kw">list</span>(a, b)][ , <span class="kw">unique</span>(N), by =<span class="st"> </span>a]
+
+<pre><code class="r">if (packageVersion("data.table") >= "1.8.1") {
+ DT[ , .N, by = list(a, b)][ , unique(N), by = a]
}
-<span class="co"># a V1</span>
-<span class="co"># 1: 1 1</span>
-<span class="co"># 2: 2 2</span>
-<span class="co"># 3: 2 1</span>
-if (<span class="kw">packageVersion</span>(<span class="st">"data.table"</span>) >=<span class="st"> "1.9.3"</span>) {
- DT[ , .N, by =<span class="st"> </span>.(a, b)][ , <span class="kw">unique</span>(N), by =<span class="st"> </span>a] <span class="co"># same</span>
+# a V1
+# 1: 1 1
+# 2: 2 2
+# 3: 2 1
+if (packageVersion("data.table") >= "1.9.3") {
+ DT[ , .N, by = .(a, b)][ , unique(N), by = a] # same
}
-<span class="co"># a V1</span>
-<span class="co"># 1: 1 1</span>
-<span class="co"># 2: 2 2</span>
-<span class="co"># 3: 2 1</span></code></pre></div>
-</div>
-</div>
-<div id="warning-messages" class="section level1">
-<h1><span class="header-section-number">5</span> Warning messages</h1>
-<div id="the-following-objects-are-masked-from-packagebase-cbind-rbind" class="section level2">
-<h2><span class="header-section-number">5.1</span> “The following object(s) are masked from <code>package:base</code>: <code>cbind</code>, <code>rbind</code>”</h2>
+# a V1
+# 1: 1 1
+# 2: 2 2
+# 3: 2 1
+</code></pre>
+
+<h1>Warning messages</h1>
+
+<h2>“The following object(s) are masked from <code>package:base</code>: <code>cbind</code>, <code>rbind</code>”</h2>
+
<p>This warning was present in v1.6.5 and v.1.6.6 only, when loading the package. The motivation was to allow <code>cbind(DT, DF)</code> to work, but as it transpired, this broke (full) compatibility with package <code>IRanges</code>. Please upgrade to v1.6.7 or later.</p>
-</div>
-<div id="coerced-numeric-rhs-to-integer-to-match-the-columns-type" class="section level2">
-<h2><span class="header-section-number">5.2</span> “Coerced numeric RHS to integer to match the column’s type”</h2>
+
+<h2>“Coerced numeric RHS to integer to match the column's type”</h2>
+
<p>Hopefully, this is self explanatory. The full message is:</p>
-<p>Coerced numeric RHS to integer to match the column’s type; may have truncated precision. Either change the column to numeric first by creating a new numeric vector length 5 (nrows of entire table) yourself and assigning that (i.e. ‘replace’ column), or coerce RHS to integer yourself (e.g. 1L or as.integer) to make your intent clear (and for speed). Or, set the column type correctly up front when you create the table and stick to it, please.</p>
+
+<p>Coerced numeric RHS to integer to match the column's type; may have truncated precision. Either change the column to numeric first by creating a new numeric vector length 5 (nrows of entire table) yourself and assigning that (i.e. 'replace' column), or coerce RHS to integer yourself (e.g. 1L or as.integer) to make your intent clear (and for speed). Or, set the column type correctly up front when you create the table and stick to it, please.</p>
+
<p>To generate it, try :</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT =<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">a =</span> <span class="dv">1</span>:<span class="dv">5</span>, <span class="dt">b =</span> <span class="dv">1</span>:<span class="dv">5</span>)
-<span class="kw">suppressWarnings</span>(
-DT[<span class="dv">2</span>, b :<span class="er">=</span><span class="st"> </span><span class="dv">6</span>] <span class="co"># works (slower) with warning</span>
+
+<pre><code class="r">DT = data.table(a = 1:5, b = 1:5)
+suppressWarnings(
+DT[2, b := 6] # works (slower) with warning
)
-<span class="co"># a b</span>
-<span class="co"># 1: 1 1</span>
-<span class="co"># 2: 2 6</span>
-<span class="co"># 3: 3 3</span>
-<span class="co"># 4: 4 4</span>
-<span class="co"># 5: 5 5</span>
-<span class="kw">class</span>(<span class="dv">6</span>) <span class="co"># numeric not integer</span>
-<span class="co"># [1] "numeric"</span>
-DT[<span class="dv">2</span>, b :<span class="er">=</span><span class="st"> </span>7L] <span class="co"># works (faster) without warning</span>
-<span class="co"># a b</span>
-<span class="co"># 1: 1 1</span>
-<span class="co"># 2: 2 7</span>
-<span class="co"># 3: 3 3</span>
-<span class="co"># 4: 4 4</span>
-<span class="co"># 5: 5 5</span>
-<span class="kw">class</span>(7L) <span class="co"># L makes it an integer</span>
-<span class="co"># [1] "integer"</span>
-DT[ , b :<span class="er">=</span><span class="st"> </span><span class="kw">rnorm</span>(<span class="dv">5</span>)] <span class="co"># 'replace' integer column with a numeric column</span>
-<span class="co"># a b</span>
-<span class="co"># 1: 1 0.1552267</span>
-<span class="co"># 2: 2 1.0811211</span>
-<span class="co"># 3: 3 -0.2778669</span>
-<span class="co"># 4: 4 -0.7133567</span>
-<span class="co"># 5: 5 0.4300428</span></code></pre></div>
-</div>
-<div id="reading-data.table-from-rds-or-rdata-file" class="section level2">
-<h2><span class="header-section-number">5.3</span> Reading data.table from RDS or RData file</h2>
-<p><code>*.RDS</code> and <code>*.RData</code> are file types which can store in-memory R objects on disk efficiently. However, storing data.table into the binary file loses its column over-allocation. This isn’t a big deal – your data.table will be copied in memory on the next <em>by reference</em> operation and throw a warning. Therefore it is recommended to call <code>alloc.col()</code> on each data.table loaded with <code>readRDS()</code> or <code>load()</code> calls.</p>
-</div>
-</div>
-<div id="general-questions-about-the-package" class="section level1">
-<h1><span class="header-section-number">6</span> General questions about the package</h1>
-<div id="v1.3-appears-to-be-missing-from-the-cran-archive" class="section level2">
-<h2><span class="header-section-number">6.1</span> v1.3 appears to be missing from the CRAN archive?</h2>
-<p>That is correct. v1.3 was available on R-Forge only. There were several large changes internally and these took some time to test in development.</p>
-</div>
-<div id="is-data.table-compatible-with-s-plus" class="section level2">
-<h2><span class="header-section-number">6.2</span> Is data.table compatible with S-plus?</h2>
+class(6) # numeric not integer
+# [1] "numeric"
+DT[2, b := 7L] # works (faster) without warning
+class(7L) # L makes it an integer
+# [1] "integer"
+DT[ , b := rnorm(5)] # 'replace' integer column with a numeric column
+</code></pre>
+
+<h2>Reading data.table from RDS or RData file</h2>
+
+<p><code>*.RDS</code> and <code>*.RData</code> are file types which can store in-memory R objects on disk efficiently. However, storing data.table into the binary file loses its column over-allocation. This isn't a big deal – your data.table will be copied in memory on the next <em>by reference</em> operation and throw a warning. Therefore it is recommended to call <code>alloc.col()</code> on each data.table loaded with <code>readRDS()</code> or <code>load()</code> calls.</p>
+
+<h1>General questions about the package</h1>
+
+<h2>v1.3 appears to be missing from the CRAN archive?</h2>
+
+<p>That is correct. v1.3 was available on R-Forge only. There were several large
+changes internally and these took some time to test in development.</p>
+
+<h2>Is data.table compatible with S-plus?</h2>
+
<p>Not currently.</p>
+
<ul>
<li>A few core parts of the package are written in C and use internal R functions and R structures.</li>
<li>The package uses lexical scoping which is one of the differences between R and <strong>S-plus</strong> explained by <a href="https://cran.r-project.org/doc/FAQ/R-FAQ.html#Lexical-scoping">R FAQ 3.3.1</a></li>
</ul>
-</div>
-<div id="is-it-available-for-linux-mac-and-windows" class="section level2">
-<h2><span class="header-section-number">6.3</span> Is it available for Linux, Mac and Windows?</h2>
+
+<h2>Is it available for Linux, Mac and Windows?</h2>
+
<p>Yes, for both 32-bit and 64-bit on all platforms. Thanks to CRAN. There are no special or OS-specific libraries used.</p>
-</div>
-<div id="i-think-its-great.-what-can-i-do" class="section level2">
-<h2><span class="header-section-number">6.4</span> I think it’s great. What can I do?</h2>
+
+<h2>I think it's great. What can I do?</h2>
+
<p>Please file suggestions, bug reports and enhancement requests on our <a href="https://github.com/Rdatatable/data.table/issues">issues tracker</a>. This helps make the package better.</p>
+
<p>Please do star the package on <a href="https://github.com/Rdatatable/data.table/wiki">GitHub</a>. This helps encourage the developers and helps other R users find the package.</p>
-<p>You can submit pull requests to change the code and/or documentation yourself; see our <a href="https://github.com/Rdatatable/data.table/blob/master/Contributing.md">Contribution Guidelines</a>.</p>
-</div>
-<div id="i-think-its-not-great.-how-do-i-warn-others-about-my-experience" class="section level2">
-<h2><span class="header-section-number">6.5</span> I think it’s not great. How do I warn others about my experience?</h2>
+
+<p>You can submit pull requests to change the code and/or documentation yourself; see our <a href="https://github.com/Rdatatable/data.table/blob/master/CONTRIBUTING.md">Contribution Guidelines</a>.</p>
+
+<h2>I think it's not great. How do I warn others about my experience?</h2>
+
<p>Please put your vote and comments on <a href="http://crantastic.org/packages/data-table">Crantastic</a>. Please make it constructive so we have a chance to improve.</p>
-</div>
-<div id="i-have-a-question.-i-know-the-r-help-posting-guide-tells-me-to-contact-the-maintainer-not-r-help-but-is-there-a-larger-group-of-people-i-can-ask" class="section level2">
-<h2><span class="header-section-number">6.6</span> I have a question. I know the r-help posting guide tells me to contact the maintainer (not r-help), but is there a larger group of people I can ask?</h2>
-<p>Yes, there are two options. You can post to <a href="mailto:datatable-help at lists.r-forge.r-project.org">datatable-help</a>. It’s like r-help, but just for this package. Or the <a href="https://stackoverflow.com/tags/data.table/info"><code>[data.table]</code> tag</a> on <a href="https://stackoverflow.com/">Stack Overflow</a>. Feel free to answer questions in those places, too.</p>
-</div>
-<div id="where-are-the-datatable-help-archives" class="section level2">
-<h2><span class="header-section-number">6.7</span> Where are the datatable-help archives?</h2>
+
+<h2>I have a question. I know the r-help posting guide tells me to contact the maintainer (not r-help), but is there a larger group of people I can ask?</h2>
+
+<p>Yes, there are two options. You can post to <a href="mailto:datatable-help at lists.r-forge.r-project.org">datatable-help</a>. It's like r-help, but just for this package. Or the <a href="https://stackoverflow.com/tags/data.table/info"><code>[data.table]</code> tag</a> on <a href="https://stackoverflow.com/">Stack Overflow</a>. Feel free to answer questions in those places, too.</p>
+
+<h2>Where are the datatable-help archives?</h2>
+
<p>The <a href="https://github.com/Rdatatable/data.table/wiki">homepage</a> contains links to the archives in several formats.</p>
-</div>
-<div id="id-prefer-not-to-post-on-the-issues-page-can-i-mail-just-one-or-two-people-privately" class="section level2">
-<h2><span class="header-section-number">6.8</span> I’d prefer not to post on the Issues page, can I mail just one or two people privately?</h2>
-<p>Sure. You’re more likely to get a faster answer from the Issues page or Stack Overflow, though. Further, asking publicly in those places helps build the general knowledge base.</p>
-</div>
-<div id="i-have-created-a-package-that-uses-data.table.-how-do-i-ensure-my-package-is-data.table-aware-so-that-inheritance-from-data.frame-works" class="section level2">
-<h2><span class="header-section-number">6.9</span> I have created a package that uses data.table. How do I ensure my package is data.table-aware so that inheritance from <code>data.frame</code> works?</h2>
-<p>Please see <a href="http://stackoverflow.com/a/10529888/403310">this answer</a>.</p>
-</div>
-</div>
-<div class="footnotes">
-<hr />
-<ol>
-<li id="fn1"><p>Here we mean either the <code>merge</code> <em>method</em> for data.table or the <code>merge</code> method for <code>data.frame</code> since both methods work in the same way in this respect. See <code>?merge.data.table</code> and <a href="#r-dispatch">below</a> for more information about method dispatch.<a href="#fnref1">↩</a></p></li>
-<li id="fn2"><p>It may be a surprise to learn that <code>select top 10 * from ...</code> does <em>not</em> reliably return the same rows over time in SQL. You do need to include an <code>order by</code> clause, or use a clustered index to guarantee row order; <em>i.e.</em>, SQL is inherently unordered.<a href="#fnref2">↩</a></p></li>
-<li id="fn3"><p><em>e.g.</em>, <code>hist()</code> returns the breakpoints in addition to plotting to the graphics device.<a href="#fnref3">↩</a></p></li>
-</ol>
-</div>
+<h2>I'd prefer not to post on the Issues page, can I mail just one or two people privately?</h2>
+<p>Sure. You're more likely to get a faster answer from the Issues page or Stack Overflow, though. Further, asking publicly in those places helps build the general knowledge base.</p>
-<!-- dynamically load mathjax for compatibility with self-contained -->
-<script>
- (function () {
- var script = document.createElement("script");
- script.type = "text/javascript";
- script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
- document.getElementsByTagName("head")[0].appendChild(script);
- })();
-</script>
+<h2>I have created a package that uses data.table. How do I ensure my package is data.table-aware so that inheritance from <code>data.frame</code> works?</h2>
+
+<p>Please see <a href="http://stackoverflow.com/a/10529888/403310">this answer</a>.</p>
</body>
+
</html>
diff --git a/inst/doc/datatable-intro.R b/inst/doc/datatable-intro.R
index 56df3c1..e123628 100644
--- a/inst/doc/datatable-intro.R
+++ b/inst/doc/datatable-intro.R
@@ -20,9 +20,6 @@ DT = data.table(ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18)
DT
class(DT$ID)
-## -------------------------------------------------------------------------------------------------
-getOption("datatable.print.nrows")
-
## ----eval = FALSE---------------------------------------------------------------------------------
# DT[i, j, by]
#
diff --git a/inst/doc/datatable-intro.html b/inst/doc/datatable-intro.html
index aab9cc9..e980e3c 100644
--- a/inst/doc/datatable-intro.html
+++ b/inst/doc/datatable-intro.html
@@ -1,907 +1,1100 @@
<!DOCTYPE html>
+<html>
+<head>
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
-<html xmlns="http://www.w3.org/1999/xhtml">
+<title>Data analysis using data.table</title>
-<head>
+<script type="text/javascript">
+window.onload = function() {
+ var imgs = document.getElementsByTagName('img'), i, img;
+ for (i = 0; i < imgs.length; i++) {
+ img = imgs[i];
+ // center an image if it is the only element of its parent
+ if (img.parentElement.childElementCount === 1)
+ img.parentElement.style.textAlign = 'center';
+ }
+};
+</script>
+
+<!-- Styles for R syntax highlighter -->
+<style type="text/css">
+ pre .operator,
+ pre .paren {
+ color: rgb(104, 118, 135)
+ }
-<meta charset="utf-8">
-<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
-<meta name="generator" content="pandoc" />
+ pre .literal {
+ color: #990073
+ }
-<meta name="viewport" content="width=device-width, initial-scale=1">
+ pre .number {
+ color: #099;
+ }
+ pre .comment {
+ color: #998;
+ font-style: italic
+ }
-<meta name="date" content="2017-01-31" />
+ pre .keyword {
+ color: #900;
+ font-weight: bold
+ }
-<title>Introduction to data.table</title>
+ pre .identifier {
+ color: rgb(0, 0, 0);
+ }
+
+ pre .string {
+ color: #d14;
+ }
+</style>
+
+<!-- R syntax highlighter -->
+<script type="text/javascript">
+var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/</gm,"<")}function f(r,q,p){return RegExp(q,"m"+(r.cI?"i":"")+(p?"g":""))}function b(r){for(var p=0;p<r.childNodes.length;p++){var q=r.childNodes[p];if(q.nodeName=="CODE"){return q}if(!(q.nodeType==3&&q.nodeValue.match(/\s+/))){break}}}function h(t,s){var p="";for(var r=0;r<t.childNodes.length;r++){if(t.childNodes[r].nodeType==3){var q=t.childNodes[r].nodeValue;if(s){q=q.replace(/\n/g,"")}p+=q}else{if(t.chi [...]
+hljs.initHighlightingOnLoad();
+</script>
-<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
-div.sourceCode { overflow-x: auto; }
-table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
- margin: 0; padding: 0; vertical-align: baseline; border: none; }
-table.sourceCode { width: 100%; line-height: 100%; }
-td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
-td.sourceCode { padding-left: 5px; }
-code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
-code > span.dt { color: #902000; } /* DataType */
-code > span.dv { color: #40a070; } /* DecVal */
-code > span.bn { color: #40a070; } /* BaseN */
-code > span.fl { color: #40a070; } /* Float */
-code > span.ch { color: #4070a0; } /* Char */
-code > span.st { color: #4070a0; } /* String */
-code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
-code > span.ot { color: #007020; } /* Other */
-code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
-code > span.fu { color: #06287e; } /* Function */
-code > span.er { color: #ff0000; font-weight: bold; } /* Error */
-code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
-code > span.cn { color: #880000; } /* Constant */
-code > span.sc { color: #4070a0; } /* SpecialChar */
-code > span.vs { color: #4070a0; } /* VerbatimString */
-code > span.ss { color: #bb6688; } /* SpecialString */
-code > span.im { } /* Import */
-code > span.va { color: #19177c; } /* Variable */
-code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
-code > span.op { color: #666666; } /* Operator */
-code > span.bu { } /* BuiltIn */
-code > span.ex { } /* Extension */
-code > span.pp { color: #bc7a00; } /* Preprocessor */
-code > span.at { color: #7d9029; } /* Attribute */
-code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
-code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
-code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
-code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
-</style>
+body, td {
+ font-family: sans-serif;
+ background-color: white;
+ font-size: 13px;
+}
+body {
+ max-width: 800px;
+ margin: auto;
+ padding: 1em;
+ line-height: 20px;
+}
+tt, code, pre {
+ font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
+}
-<link href="data:text/css;charset=utf-8,body%20%7B%0Abackground%2Dcolor%3A%20%23fff%3B%0Amargin%3A%201em%20auto%3B%0Amax%2Dwidth%3A%20700px%3B%0Aoverflow%3A%20visible%3B%0Apadding%2Dleft%3A%202em%3B%0Apadding%2Dright%3A%202em%3B%0Afont%2Dfamily%3A%20%22Open%20Sans%22%2C%20%22Helvetica%20Neue%22%2C%20Helvetica%2C%20Arial%2C%20sans%2Dserif%3B%0Afont%2Dsize%3A%2014px%3B%0Aline%2Dheight%3A%201%2E35%3B%0A%7D%0A%23header%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0A%23TOC%20%7B%0Aclear%3A%20bot [...]
+h1 {
+ font-size:2.2em;
+}
-</head>
+h2 {
+ font-size:1.8em;
+}
-<body>
+h3 {
+ font-size:1.4em;
+}
+
+h4 {
+ font-size:1.0em;
+}
+
+h5 {
+ font-size:0.9em;
+}
+
+h6 {
+ font-size:0.8em;
+}
+
+a:visited {
+ color: rgb(50%, 0%, 50%);
+}
+
+pre, img {
+ max-width: 100%;
+}
+pre {
+ overflow-x: auto;
+}
+pre code {
+ display: block; padding: 0.5em;
+}
+code {
+ font-size: 92%;
+ border: 1px solid #ccc;
+}
+code[class] {
+ background-color: #F8F8F8;
+}
+table, td, th {
+ border: none;
+}
-<h1 class="title toc-ignore">Introduction to data.table</h1>
-<h4 class="date"><em>2017-01-31</em></h4>
+blockquote {
+ color:#666666;
+ margin:0;
+ padding-left: 1em;
+ border-left: 0.5em #EEE solid;
+}
+hr {
+ height: 0px;
+ border-bottom: none;
+ border-top-width: thin;
+ border-top-style: dotted;
+ border-top-color: #999999;
+}
+ at media print {
+ * {
+ background: transparent !important;
+ color: black !important;
+ filter:none !important;
+ -ms-filter: none !important;
+ }
+ body {
+ font-size:12pt;
+ max-width:100%;
+ }
+
+ a, a:visited {
+ text-decoration: underline;
+ }
+
+ hr {
+ visibility: hidden;
+ page-break-before: always;
+ }
+
+ pre, blockquote {
+ padding-right: 1em;
+ page-break-inside: avoid;
+ }
+
+ tr, img {
+ page-break-inside: avoid;
+ }
+
+ img {
+ max-width: 100% !important;
+ }
+
+ @page :left {
+ margin: 15mm 20mm 15mm 10mm;
+ }
+
+ @page :right {
+ margin: 15mm 10mm 15mm 20mm;
+ }
+
+ p, h2, h3 {
+ orphans: 3; widows: 3;
+ }
+
+ h2, h3 {
+ page-break-after: avoid;
+ }
+}
+</style>
+
+
+
+</head>
+
+<body>
<p>This vignette introduces the <em>data.table</em> syntax, its general form, how to <em>subset</em> rows, <em>select and compute</em> on columns and perform aggregations <em>by group</em>. Familiarity with <em>data.frame</em> data structure from base R is useful, but not essential to follow this vignette.</p>
-<hr />
-<div id="data-analysis-using-data.table" class="section level2">
+
+<hr/>
+
<h2>Data analysis using data.table</h2>
+
<p>Data manipulation operations such as <em>subset</em>, <em>group</em>, <em>update</em>, <em>join</em> etc., are all inherently related. Keeping these <em>related operations together</em> allows for:</p>
+
<ul>
<li><p><em>concise</em> and <em>consistent</em> syntax irrespective of the set of operations you would like to perform to achieve your end goal.</p></li>
<li><p>performing analysis <em>fluidly</em> without the cognitive burden of having to map each operation to a particular function from a set of functions available before to perform the analysis.</p></li>
<li><p><em>automatically</em> optimising operations internally, and very effectively, by knowing precisely the data required for each operation and therefore very fast and memory efficient.</p></li>
</ul>
+
<p>Briefly, if you are interested in reducing <em>programming</em> and <em>compute</em> time tremendously, then this package is for you. The philosophy that <em>data.table</em> adheres to makes this possible. Our goal is to illustrate it through this series of vignettes.</p>
-</div>
-<div id="data" class="section level2">
-<h2>Data</h2>
-<p>In this vignette, we will use <a href="https://github.com/arunsrinivasan/flights/wiki/NYC-Flights-2014-data">NYC-flights14</a> data. It contains On-Time flights data from the <a href="http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236">Bureau of Transporation Statistics</a> for all the flights that departed from New York City airports in 2014 (inspired by <a href="https://github.com/hadley/nycflights13">nycflights13</a>). The data is available only for Jan-Oct’14.</p>
-<p>We can use <em>data.table’s</em> fast file reader <code>fread</code> to load <em>flights</em> directly as follows:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights <-<span class="st"> </span><span class="kw">fread</span>(<span class="st">"flights14.csv"</span>)
+
+<h2>Data {#data}</h2>
+
+<p>In this vignette, we will use <a href="https://github.com/arunsrinivasan/flights/wiki/NYC-Flights-2014-data">NYC-flights14</a> data. It contains On-Time flights data from the <a href="http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236">Bureau of Transporation Statistics</a> for all the flights that departed from New York City airports in 2014 (inspired by <a href="https://github.com/hadley/nycflights13">nycflights13</a>). The data is available only for Jan-Oct'14.</p>
+
+<p>We can use <em>data.table's</em> fast file reader <code>fread</code> to load <em>flights</em> directly as follows:</p>
+
+<pre><code class="r">flights <- fread("flights14.csv")
flights
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14</span>
-<span class="co"># 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8</span>
-<span class="co"># 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11</span>
-<span class="co"># 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11</span>
-<span class="co"># 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8</span>
-<span class="kw">dim</span>(flights)
-<span class="co"># [1] 253316 11</span></code></pre></div>
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
+# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
+# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13
+# ---
+# 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14
+# 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8
+# 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11
+# 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11
+# 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8
+dim(flights)
+# [1] 253316 11
+</code></pre>
+
<p>Aside: <code>fread</code> accepts <code>http</code> and <code>https</code> URLs directly as well as operating system commands such as <code>sed</code> and <code>awk</code> output. See <code>?fread</code> for examples.</p>
-</div>
-<div id="introduction" class="section level2">
+
<h2>Introduction</h2>
+
<p>In this vignette, we will</p>
-<ol style="list-style-type: decimal">
+
+<ol>
<li><p>start with basics - what is a <em>data.table</em>, its general form, how to <em>subset</em> rows, <em>select and compute</em> on columns</p></li>
<li><p>and then we will look at performing data aggregations by group,</p></li>
</ol>
-</div>
-<div id="basics-1" class="section level2">
-<h2>1. Basics</h2>
-<div id="what-is-datatable-1a" class="section level3">
-<h3>a) What is data.table?</h3>
+
+<h2>1. Basics {#basics-1}</h2>
+
+<h3>a) What is data.table? {#what-is-datatable-1a}</h3>
+
<p><em>data.table</em> is an R package that provides <strong>an enhanced version</strong> of <em>data.frames</em>. In the <a href="#data">Data</a> section, we already created a <em>data.table</em> using <code>fread()</code>. We can also create one using the <code>data.table()</code> function. Here is an example:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT =<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">ID =</span> <span class="kw">c</span>(<span class="st">"b"</span>,<span class="st">"b"</span>,<span class="st">"b"</span>,<span class="st">"a"</span>,<span class="st">"a"</span>,<span class="st">"c"</span>), <span class="dt">a =</span> <span class="dv">1</span>:<span class= [...]
+
+<pre><code class="r">DT = data.table(ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18)
DT
-<span class="co"># ID a b c</span>
-<span class="co"># 1: b 1 7 13</span>
-<span class="co"># 2: b 2 8 14</span>
-<span class="co"># 3: b 3 9 15</span>
-<span class="co"># 4: a 4 10 16</span>
-<span class="co"># 5: a 5 11 17</span>
-<span class="co"># 6: c 6 12 18</span>
-<span class="kw">class</span>(DT$ID)
-<span class="co"># [1] "character"</span></code></pre></div>
+# ID a b c
+# 1: b 1 7 13
+# 2: b 2 8 14
+# 3: b 3 9 15
+# 4: a 4 10 16
+# 5: a 5 11 17
+# 6: c 6 12 18
+class(DT$ID)
+# [1] "character"
+</code></pre>
+
<p>You can also convert existing objects to a <em>data.table</em> using <code>as.data.table()</code>.</p>
-<div id="note-that" class="section level4 bs-callout bs-callout-info">
-<h4>Note that:</h4>
+
+<h4>Note that: {.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>Unlike <em>data.frames</em>, columns of <code>character</code> type are <em>never</em> converted to <code>factors</code> by default.</p></li>
<li><p>Row numbers are printed with a <code>:</code> in order to visually separate the row number from the first column.</p></li>
<li><p>When the number of rows to print exceeds the global option <code>datatable.print.nrows</code> (default = 100), it automatically prints only the top 5 and bottom 5 rows (as can be seen in the <a href="#data">Data</a> section).</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">getOption</span>(<span class="st">"datatable.print.nrows"</span>)
-<span class="co"># [1] 100</span></code></pre></div></li>
-<li><p><em>data.table</em> doesn’t set or use <em>row names</em>, ever. We will see as to why in <em>“Keys and fast binary search based subset”</em> vignette.</p></li>
+
+<pre><code class="r">getOption("datatable.print.nrows")
+</code></pre></li>
+<li><p><em>data.table</em> doesn't set or use <em>row names</em>, ever. We will see as to why in <em>“Keys and fast binary search based subset”</em> vignette.</p></li>
</ul>
-</div>
-</div>
-<div id="enhanced-1b" class="section level3">
-<h3>b) General form - in what way is a data.table <em>enhanced</em>?</h3>
+
+<h3>b) General form - in what way is a data.table <em>enhanced</em>? {#enhanced-1b}</h3>
+
<p>In contrast to a <em>data.frame</em>, you can do <em>a lot more</em> than just subsetting rows and selecting columns within the frame of a <em>data.table</em>, i.e., within <code>[ ... ]</code>. To understand it we will have to first look at the <em>general form</em> of <em>data.table</em> syntax, as shown below:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT[i, j, by]
+
+<pre><code class="r">DT[i, j, by]
## R: i j by
-## SQL: where select | update group by</code></pre></div>
+## SQL: where select | update group by
+</code></pre>
+
<p>Users who have a SQL background might perhaps immediately relate to this syntax.</p>
-<div id="the-way-to-read-it-out-loud-is" class="section level4 bs-callout bs-callout-info">
-<h4>The way to read it (out loud) is:</h4>
+
+<h4>The way to read it (out loud) is: {.bs-callout .bs-callout-info}</h4>
+
<p>Take <code>DT</code>, subset rows using <code>i</code>, then calculate <code>j</code>, grouped by <code>by</code>.</p>
-</div>
-</div>
-</div>
-<div id="section" class="section level1">
-<h1></h1>
-<p>Let’s begin by looking at <code>i</code> and <code>j</code> first - subsetting rows and operating on columns.</p>
-<div id="c-subset-rows-in-i" class="section level3">
-<h3>c) Subset rows in <code id="subset-i-1c">i</code></h3>
-<div id="get-all-the-flights-with-jfk-as-the-origin-airport-in-the-month-of-june." class="section level4">
-<h4>– Get all the flights with “JFK” as the origin airport in the month of June.</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[origin ==<span class="st"> "JFK"</span> &<span class="st"> </span>month ==<span class="st"> </span>6L]
-<span class="kw">head</span>(ans)
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 6 1 -9 -5 AA JFK LAX 324 2475 8</span>
-<span class="co"># 2: 2014 6 1 -10 -13 AA JFK LAX 329 2475 12</span>
-<span class="co"># 3: 2014 6 1 18 -1 AA JFK LAX 326 2475 7</span>
-<span class="co"># 4: 2014 6 1 -6 -16 AA JFK LAX 320 2475 10</span>
-<span class="co"># 5: 2014 6 1 -4 -45 AA JFK LAX 326 2475 18</span>
-<span class="co"># 6: 2014 6 1 -6 -23 AA JFK LAX 329 2475 14</span></code></pre></div>
-</div>
-<div id="section-1" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<p>#</p>
+
+<p>Let's begin by looking at <code>i</code> and <code>j</code> first - subsetting rows and operating on columns.</p>
+
+<h3>c) Subset rows in <code>i</code> {#subset-i-1c}</h3>
+
+<h4>– Get all the flights with “JFK” as the origin airport in the month of June.</h4>
+
+<pre><code class="r">ans <- flights[origin == "JFK" & month == 6L]
+head(ans)
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 6 1 -9 -5 AA JFK LAX 324 2475 8
+# 2: 2014 6 1 -10 -13 AA JFK LAX 329 2475 12
+# 3: 2014 6 1 18 -1 AA JFK LAX 326 2475 7
+# 4: 2014 6 1 -6 -16 AA JFK LAX 320 2475 10
+# 5: 2014 6 1 -4 -45 AA JFK LAX 326 2475 18
+# 6: 2014 6 1 -6 -23 AA JFK LAX 329 2475 14
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>Within the frame of a <em>data.table</em>, columns can be referred to <em>as if they are variables</em>. Therefore, we simply refer to <code>dest</code> and <code>month</code> as if they are variables. We do not need to add the prefix <code>flights$</code> each time. However using <code>flights$dest</code> and <code>flights$month</code> would work just fine.</p></li>
<li><p>The <em>row indices</em> that satisfies the condition <code>origin == "JFK" & month == 6L</code> are computed, and since there is nothing else left to do, a <em>data.table</em> all columns from <code>flights</code> corresponding to those <em>row indices</em> are simply returned.</p></li>
<li><p>A comma after the condition is also not required in <code>i</code>. But <code>flights[dest == "JFK" & month == 6L, ]</code> would work just fine. In <em>data.frames</em> however, the comma is necessary.</p></li>
</ul>
-</div>
-<div id="subset-rows-integer" class="section level4">
-<h4>– Get the first two rows from <code>flights</code>.</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[<span class="dv">1</span>:<span class="dv">2</span>]
+
+<h4>– Get the first two rows from <code>flights</code>. {#subset-rows-integer}</h4>
+
+<pre><code class="r">ans <- flights[1:2]
ans
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span></code></pre></div>
-</div>
-<div id="section-2" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li>In this case, there is no condition. The row indices are already provided in <code>i</code>. We therefore return a <em>data.table</em> with all columns from <code>flight</code> for those <em>row indices</em>.</li>
</ul>
-</div>
-<div id="sort-flights-first-by-column-origin-in-ascending-order-and-then-by-dest-in-descending-order" class="section level4">
-<h4>– Sort <code>flights</code> first by column <code>origin</code> in <em>ascending</em> order, and then by <code>dest</code> in <em>descending</em> order:</h4>
+
+<h4>– Sort <code>flights</code> first by column <code>origin</code> in <em>ascending</em> order, and then by <code>dest</code> in <em>descending</em> order:</h4>
+
<p>We can use the base R function <code>order()</code> to accomplish this.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[<span class="kw">order</span>(origin, -dest)]
-<span class="kw">head</span>(ans)
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 5 6 49 EV EWR XNA 195 1131 8</span>
-<span class="co"># 2: 2014 1 6 7 13 EV EWR XNA 190 1131 8</span>
-<span class="co"># 3: 2014 1 7 -6 -13 EV EWR XNA 179 1131 8</span>
-<span class="co"># 4: 2014 1 8 -7 -12 EV EWR XNA 184 1131 8</span>
-<span class="co"># 5: 2014 1 9 16 7 EV EWR XNA 181 1131 8</span>
-<span class="co"># 6: 2014 1 13 66 66 EV EWR XNA 188 1131 9</span></code></pre></div>
-</div>
-<div id="order-is-internally-optimised" class="section level4 bs-callout bs-callout-info">
-<h4><code>order()</code> is internally optimised</h4>
+
+<pre><code class="r">ans <- flights[order(origin, -dest)]
+head(ans)
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 5 6 49 EV EWR XNA 195 1131 8
+# 2: 2014 1 6 7 13 EV EWR XNA 190 1131 8
+# 3: 2014 1 7 -6 -13 EV EWR XNA 179 1131 8
+# 4: 2014 1 8 -7 -12 EV EWR XNA 184 1131 8
+# 5: 2014 1 9 16 7 EV EWR XNA 181 1131 8
+# 6: 2014 1 13 66 66 EV EWR XNA 188 1131 9
+</code></pre>
+
+<h4><code>order()</code> is internally optimised {.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p>We can use “-” on a <em>character</em> columns within the frame of a <em>data.table</em> to sort in decreasing order.</p></li>
-<li><p>In addition, <code>order(...)</code> within the frame of a <em>data.table</em> uses <em>data.table</em>’s internal fast radix order <code>forder()</code>, which is much faster than <code>base::order</code>. Here’s a small example to highlight the difference.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">odt =<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">col =</span> <span class="kw">sample</span>(<span class="fl">1e7</span>))
-(t1 <-<span class="st"> </span><span class="kw">system.time</span>(ans1 <-<span class="st"> </span>odt[base::<span class="kw">order</span>(col)])) ## uses order from base R
-<span class="co"># user system elapsed </span>
-<span class="co"># 0.312 0.008 0.320</span>
-(t2 <-<span class="st"> </span><span class="kw">system.time</span>(ans2 <-<span class="st"> </span>odt[<span class="kw">order</span>(col)])) ## uses data.table's forder
-<span class="co"># user system elapsed </span>
-<span class="co"># 0.336 0.008 0.345</span>
-(<span class="kw">identical</span>(ans1, ans2))
-<span class="co"># [1] TRUE</span></code></pre></div></li>
+<li><p>We can use “-” on a <em>character</em> columns within the frame of a <em>data.table</em> to sort in decreasing order.</p></li>
+<li><p>In addition, <code>order(...)</code> within the frame of a <em>data.table</em> uses <em>data.table</em>'s internal fast radix order <code>forder()</code>, which is much faster than <code>base::order</code>. Here's a small example to highlight the difference.</p>
+
+<pre><code class="r">odt = data.table(col = sample(1e7))
+(t1 <- system.time(ans1 <- odt[base::order(col)])) ## uses order from base R
+# user system elapsed
+# 0.408 0.000 0.408
+(t2 <- system.time(ans2 <- odt[order(col)])) ## uses data.table's forder
+# user system elapsed
+# 0.852 0.000 0.852
+(identical(ans1, ans2))
+# [1] TRUE
+</code></pre></li>
</ul>
-<p>The speedup here is <strong>~1x</strong>. We will discuss <em>data.table</em>’s fast order in more detail in the <em>data.table internals</em> vignette.</p>
+
+<p>The speedup here is <strong>~0x</strong>. We will discuss <em>data.table</em>'s fast order in more detail in the <em>data.table internals</em> vignette.</p>
+
<ul>
<li>This is so that you can improve performance tremendously while using already familiar functions.</li>
</ul>
-</div>
-</div>
-</div>
-<div id="section-3" class="section level1">
-<h1></h1>
-<div id="d-select-columns-in-j" class="section level3">
-<h3>d) Select column(s) in <code id="select-j-1d">j</code></h3>
-<div id="select-arr_delay-column-but-return-it-as-a-vector." class="section level4">
-<h4>– Select <code>arr_delay</code> column, but return it as a <em>vector</em>.</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[, arr_delay]
-<span class="kw">head</span>(ans)
-<span class="co"># [1] 13 13 9 -26 1 0</span></code></pre></div>
-</div>
-<div id="section-4" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<p>#</p>
+
+<h3>d) Select column(s) in <code>j</code> {#select-j-1d}</h3>
+
+<h4>– Select <code>arr_delay</code> column, but return it as a <em>vector</em>.</h4>
+
+<pre><code class="r">ans <- flights[, arr_delay]
+head(ans)
+# [1] 13 13 9 -26 1 0
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>Since columns can be referred to as if they are variables within the frame of data.tables, we directly refer to the <em>variable</em> we want to subset. Since we want <em>all the rows</em>, we simply skip <code>i</code>.</p></li>
<li><p>It returns <em>all</em> the rows for the column <code>arr_delay</code>.</p></li>
</ul>
-</div>
-<div id="select-arr_delay-column-but-return-as-a-data.table-instead." class="section level4">
-<h4>– Select <code>arr_delay</code> column, but return as a <em>data.table</em> instead.</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[, <span class="kw">list</span>(arr_delay)]
-<span class="kw">head</span>(ans)
-<span class="co"># arr_delay</span>
-<span class="co"># 1: 13</span>
-<span class="co"># 2: 13</span>
-<span class="co"># 3: 9</span>
-<span class="co"># 4: -26</span>
-<span class="co"># 5: 1</span>
-<span class="co"># 6: 0</span></code></pre></div>
-</div>
-<div id="section-5" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<h4>– Select <code>arr_delay</code> column, but return as a <em>data.table</em> instead.</h4>
+
+<pre><code class="r">ans <- flights[, list(arr_delay)]
+head(ans)
+# arr_delay
+# 1: 13
+# 2: 13
+# 3: 9
+# 4: -26
+# 5: 1
+# 6: 0
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p>We wrap the <em>variables</em> (column names) within <code>list()</code>, which ensures that a <em>data.table</em> is returned. In case of a single column name, not wrapping with <code>list()</code> returns a vector instead, as seen in the <a href="#select-j-1d">previous example</a>.</p></li>
+<li><p>We wrap the <em>variables</em> (column names) within <code>list()</code>, which ensures that a <em>data.table</em> is returned. In case of a single column name, not wrapping with <code>list()</code> returns a vector instead, as seen in the <a href="#select-j-1d">previous example</a>.</p></li>
<li><p><em>data.table</em> also allows using <code>.()</code> to wrap columns with. It is an <em>alias</em> to <code>list()</code>; they both mean the same. Feel free to use whichever you prefer.</p>
+
<p>We will continue to use <code>.()</code> from here on.</p></li>
</ul>
-</div>
-</div>
-</div>
-<div id="section-6" class="section level1">
-<h1></h1>
-<p><em>data.tables</em> (and <em>data.frames</em>) are internally <em>lists</em> as well, but with all its columns of equal length and with a <em>class</em> attribute. Allowing <code>j</code> to return a <em>list</em> enables converting and returning a <em>data.table</em> very efficiently.</p>
-<div id="tip-1" class="section level4 bs-callout bs-callout-warning">
-<h4>Tip:</h4>
+
+<p>#</p>
+
+<p><em>data.tables</em> (and <em>data.frames</em>) are internally <em>lists</em> as well, but with all its columns of equal length and with a <em>class</em> attribute. Allowing <code>j</code> to return a <em>list</em> enables converting and returning a <em>data.table</em> very efficiently.</p>
+
+<h4>Tip: {.bs-callout .bs-callout-warning #tip-1}</h4>
+
<p>As long as <code>j-expression</code> returns a <em>list</em>, each element of the list will be converted to a column in the resulting <em>data.table</em>. This makes <code>j</code> quite powerful, as we will see shortly.</p>
-</div>
-<div id="select-both-arr_delay-and-dep_delay-columns." class="section level4">
-<h4>– Select both <code>arr_delay</code> and <code>dep_delay</code> columns.</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[, .(arr_delay, dep_delay)]
-<span class="kw">head</span>(ans)
-<span class="co"># arr_delay dep_delay</span>
-<span class="co"># 1: 13 14</span>
-<span class="co"># 2: 13 -3</span>
-<span class="co"># 3: 9 2</span>
-<span class="co"># 4: -26 -8</span>
-<span class="co"># 5: 1 2</span>
-<span class="co"># 6: 0 4</span>
+
+<h4>– Select both <code>arr_delay</code> and <code>dep_delay</code> columns.</h4>
+
+<pre><code class="r">ans <- flights[, .(arr_delay, dep_delay)]
+head(ans)
+# arr_delay dep_delay
+# 1: 13 14
+# 2: 13 -3
+# 3: 9 2
+# 4: -26 -8
+# 5: 1 2
+# 6: 0 4
## alternatively
-<span class="co"># ans <- flights[, list(arr_delay, dep_delay)]</span></code></pre></div>
-</div>
-<div id="section-7" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# ans <- flights[, list(arr_delay, dep_delay)]
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li>Wrap both columns within <code>.()</code>, or <code>list()</code>. That’s it.</li>
+<li>Wrap both columns within <code>.()</code>, or <code>list()</code>. That's it.</li>
</ul>
-</div>
-</div>
-<div id="section-8" class="section level1">
-<h1></h1>
-<div id="select-both-arr_delay-and-dep_delay-columns-and-rename-them-to-delay_arr-and-delay_dep." class="section level4">
-<h4>– Select both <code>arr_delay</code> and <code>dep_delay</code> columns <em>and</em> rename them to <code>delay_arr</code> and <code>delay_dep</code>.</h4>
+
+<p>#</p>
+
+<h4>– Select both <code>arr_delay</code> and <code>dep_delay</code> columns <em>and</em> rename them to <code>delay_arr</code> and <code>delay_dep</code>.</h4>
+
<p>Since <code>.()</code> is just an alias for <code>list()</code>, we can name columns as we would while creating a <em>list</em>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[, .(<span class="dt">delay_arr =</span> arr_delay, <span class="dt">delay_dep =</span> dep_delay)]
-<span class="kw">head</span>(ans)
-<span class="co"># delay_arr delay_dep</span>
-<span class="co"># 1: 13 14</span>
-<span class="co"># 2: 13 -3</span>
-<span class="co"># 3: 9 2</span>
-<span class="co"># 4: -26 -8</span>
-<span class="co"># 5: 1 2</span>
-<span class="co"># 6: 0 4</span></code></pre></div>
-<p>That’s it.</p>
-</div>
-<div id="e-compute-or-do-in-j" class="section level3">
+
+<pre><code class="r">ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]
+head(ans)
+# delay_arr delay_dep
+# 1: 13 14
+# 2: 13 -3
+# 3: 9 2
+# 4: -26 -8
+# 5: 1 2
+# 6: 0 4
+</code></pre>
+
+<p>That's it.</p>
+
<h3>e) Compute or <em>do</em> in <code>j</code></h3>
-<div id="how-many-trips-have-had-total-delay-0" class="section level4">
-<h4>– How many trips have had total delay < 0?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[, <span class="kw">sum</span>((arr_delay +<span class="st"> </span>dep_delay) <<span class="st"> </span><span class="dv">0</span>)]
+
+<h4>– How many trips have had total delay < 0?</h4>
+
+<pre><code class="r">ans <- flights[, sum((arr_delay + dep_delay) < 0)]
ans
-<span class="co"># [1] 141814</span></code></pre></div>
-</div>
-<div id="whats-happening-here" class="section level4 bs-callout bs-callout-info">
-<h4>What’s happening here?</h4>
+# [1] 141814
+</code></pre>
+
+<h4>What's happening here? {.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><em>data.table</em>’s <code>j</code> can handle more than just <em>selecting columns</em> - it can handle <em>expressions</em>, i.e., <em>compute on columns</em>. This shouldn’t be surprising, as columns can be referred to as if they are variables. Then we should be able to <em>compute</em> by calling functions on those variables. And that’s what precisely happens here.</li>
+<li><em>data.table</em>'s <code>j</code> can handle more than just <em>selecting columns</em> - it can handle <em>expressions</em>, i.e., <em>compute on columns</em>. This shouldn't be surprising, as columns can be referred to as if they are variables. Then we should be able to <em>compute</em> by calling functions on those variables. And that's what precisely happens here.</li>
</ul>
-</div>
-</div>
-<div id="f-subset-in-i-and-do-in-j" class="section level3">
+
<h3>f) Subset in <code>i</code> <em>and</em> do in <code>j</code></h3>
-<div id="calculate-the-average-arrival-and-departure-delay-for-all-flights-with-jfk-as-the-origin-airport-in-the-month-of-june." class="section level4">
-<h4>– Calculate the average arrival and departure delay for all flights with “JFK” as the origin airport in the month of June.</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[origin ==<span class="st"> "JFK"</span> &<span class="st"> </span>month ==<span class="st"> </span>6L,
- .(<span class="dt">m_arr =</span> <span class="kw">mean</span>(arr_delay), <span class="dt">m_dep =</span> <span class="kw">mean</span>(dep_delay))]
+
+<h4>– Calculate the average arrival and departure delay for all flights with “JFK” as the origin airport in the month of June.</h4>
+
+<pre><code class="r">ans <- flights[origin == "JFK" & month == 6L,
+ .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
ans
-<span class="co"># m_arr m_dep</span>
-<span class="co"># 1: 5.839349 9.807884</span></code></pre></div>
-</div>
-<div id="section-9" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# m_arr m_dep
+# 1: 5.839349 9.807884
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p>We first subset in <code>i</code> to find matching <em>row indices</em> where <code>origin</code> airport equals <em>“JFK”</em>, and <code>month</code> equals <em>6</em>. At this point, we <em>do not</em> subset the entire <em>data.table</em> corresponding to those rows.</p></li>
+<li><p>We first subset in <code>i</code> to find matching <em>row indices</em> where <code>origin</code> airport equals <em>“JFK”</em>, and <code>month</code> equals <em>6</em>. At this point, we <em>do not</em> subset the entire <em>data.table</em> corresponding to those rows.</p></li>
<li><p>Now, we look at <code>j</code> and find that it uses only <em>two columns</em>. And what we have to do is to compute their <code>mean()</code>. Therefore we subset just those columns corresponding to the matching rows, and compute their <code>mean()</code>.</p></li>
</ul>
+
<p>Because the three main components of the query (<code>i</code>, <code>j</code> and <code>by</code>) are <em>together</em> inside <code>[...]</code>, <em>data.table</em> can see all three and optimise the query altogether <em>before evaluation</em>, not each separately. We are able to therefore avoid the entire subset, for both speed and memory efficiency.</p>
-</div>
-<div id="how-many-trips-have-been-made-in-2014-from-jfk-airport-in-the-month-of-june" class="section level4">
-<h4>– How many trips have been made in 2014 from “JFK” airport in the month of June?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[origin ==<span class="st"> "JFK"</span> &<span class="st"> </span>month ==<span class="st"> </span>6L, <span class="kw">length</span>(dest)]
+
+<h4>– How many trips have been made in 2014 from “JFK” airport in the month of June?</h4>
+
+<pre><code class="r">ans <- flights[origin == "JFK" & month == 6L, length(dest)]
ans
-<span class="co"># [1] 8422</span></code></pre></div>
+# [1] 8422
+</code></pre>
+
<p>The function <code>length()</code> requires an input argument. We just needed to compute the number of rows in the subset. We could have used any other column as input argument to <code>length()</code> really.</p>
+
<p>This type of operation occurs quite frequently, especially while grouping as we will see in the next section, that <em>data.table</em> provides a <em>special symbol</em> <code>.N</code> for it.</p>
-</div>
-<div id="special-N" class="section level4 bs-callout bs-callout-info">
-<h4>Special symbol <code>.N</code>:</h4>
-<p><code>.N</code> is a special in-built variable that holds the number of observations in the current group. It is particularly useful when combined with <code>by</code> as we’ll see in the next section. In the absence of group by operations, it simply returns the number of rows in the subset.</p>
-</div>
-</div>
-</div>
-<div id="section-10" class="section level1">
-<h1></h1>
-<p>So we can now accomplish the same task by using <code>.N</code> as follows:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[origin ==<span class="st"> "JFK"</span> &<span class="st"> </span>month ==<span class="st"> </span>6L, .N]
+
+<h4>Special symbol <code>.N</code>: {.bs-callout .bs-callout-info #special-N}</h4>
+
+<p><code>.N</code> is a special in-built variable that holds the number of observations in the current group. It is particularly useful when combined with <code>by</code> as we'll see in the next section. In the absence of group by operations, it simply returns the number of rows in the subset.</p>
+
+<p>#
+So we can now accomplish the same task by using <code>.N</code> as follows:</p>
+
+<pre><code class="r">ans <- flights[origin == "JFK" & month == 6L, .N]
ans
-<span class="co"># [1] 8422</span></code></pre></div>
-<div id="section-11" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# [1] 8422
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p>Once again, we subset in <code>i</code> to get the <em>row indices</em> where <code>origin</code> airport equals <em>“JFK”</em>, and <code>month</code> equals <em>6</em>.</p></li>
+<li><p>Once again, we subset in <code>i</code> to get the <em>row indices</em> where <code>origin</code> airport equals <em>“JFK”</em>, and <code>month</code> equals <em>6</em>.</p></li>
<li><p>We see that <code>j</code> uses only <code>.N</code> and no other columns. Therefore the entire subset is not materialised. We simply return the number of rows in the subset (which is just the length of row indices).</p></li>
<li><p>Note that we did not wrap <code>.N</code> with <code>list()</code> or <code>.()</code>. Therefore a vector is returned.</p></li>
</ul>
+
<p>We could have accomplished the same operation by doing <code>nrow(flights[origin == "JFK" & month == 6L])</code>. However, it would have to subset the entire <em>data.table</em> first corresponding to the <em>row indices</em> in <code>i</code> <em>and then</em> return the rows using <code>nrow()</code>, which is unnecessary and inefficient. We will cover this and other optimisation aspects in detail under the <em>data.table design</em> vignette.</p>
-</div>
-<div id="g-great-but-how-can-i-refer-to-columns-by-names-in-j-like-in-a-data.frame" class="section level3">
+
<h3>g) Great! But how can I refer to columns by names in <code>j</code> (like in a <em>data.frame</em>)?</h3>
+
<p>You can refer to column names the <em>data.frame</em> way using <code>with = FALSE</code>.</p>
-<div id="select-both-arr_delay-and-dep_delay-columns-the-data.frame-way." class="section level4">
-<h4>– Select both <code>arr_delay</code> and <code>dep_delay</code> columns the <em>data.frame</em> way.</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[, <span class="kw">c</span>(<span class="st">"arr_delay"</span>, <span class="st">"dep_delay"</span>), with =<span class="st"> </span><span class="ot">FALSE</span>]
-<span class="kw">head</span>(ans)
-<span class="co"># arr_delay dep_delay</span>
-<span class="co"># 1: 13 14</span>
-<span class="co"># 2: 13 -3</span>
-<span class="co"># 3: 9 2</span>
-<span class="co"># 4: -26 -8</span>
-<span class="co"># 5: 1 2</span>
-<span class="co"># 6: 0 4</span></code></pre></div>
-<p>The argument is named <code>with</code> after the R function <code>with()</code> because of similar functionality. Suppose you’ve a <em>data.frame</em> <code>DF</code> and you’d like to subset all rows where <code>x > 1</code>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DF =<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">x =</span> <span class="kw">c</span>(<span class="dv">1</span>,<span class="dv">1</span>,<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">2</span>,<span class="dv">3</span>,<span class="dv">3</span>,<span class="dv">3</span>), <span class="dt">y =</span> <span class="dv">1</span>:<span class="dv">8</span>)
+
+<h4>– Select both <code>arr_delay</code> and <code>dep_delay</code> columns the <em>data.frame</em> way.</h4>
+
+<pre><code class="r">ans <- flights[, c("arr_delay", "dep_delay"), with = FALSE]
+head(ans)
+# arr_delay dep_delay
+# 1: 13 14
+# 2: 13 -3
+# 3: 9 2
+# 4: -26 -8
+# 5: 1 2
+# 6: 0 4
+</code></pre>
+
+<p>The argument is named <code>with</code> after the R function <code>with()</code> because of similar functionality. Suppose you've a <em>data.frame</em> <code>DF</code> and you'd like to subset all rows where <code>x > 1</code>.</p>
+
+<pre><code class="r">DF = data.frame(x = c(1,1,1,2,2,3,3,3), y = 1:8)
## (1) normal way
-DF[DF$x ><span class="st"> </span><span class="dv">1</span>, ] <span class="co"># data.frame needs that ',' as well</span>
-<span class="co"># x y</span>
-<span class="co"># 4 2 4</span>
-<span class="co"># 5 2 5</span>
-<span class="co"># 6 3 6</span>
-<span class="co"># 7 3 7</span>
-<span class="co"># 8 3 8</span>
+DF[DF$x > 1, ] # data.frame needs that ',' as well
+# x y
+# 4 2 4
+# 5 2 5
+# 6 3 6
+# 7 3 7
+# 8 3 8
## (2) using with
-DF[<span class="kw">with</span>(DF, x ><span class="st"> </span><span class="dv">1</span>), ]
-<span class="co"># x y</span>
-<span class="co"># 4 2 4</span>
-<span class="co"># 5 2 5</span>
-<span class="co"># 6 3 6</span>
-<span class="co"># 7 3 7</span>
-<span class="co"># 8 3 8</span></code></pre></div>
-</div>
-<div id="with_false" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+DF[with(DF, x > 1), ]
+# x y
+# 4 2 4
+# 5 2 5
+# 6 3 6
+# 7 3 7
+# 8 3 8
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info #with_false}</h4>
+
<ul>
-<li><p>Using <code>with()</code> in (2) allows using <code>DF</code>’s column <code>x</code> as if it were a variable.</p>
-<p>Hence the argument name <code>with</code> in <em>data.table</em>. Setting <code>with = FALSE</code> disables the ability to refer to columns as if they are variables, thereby restoring the “<em>data.frame</em> mode”.</p></li>
+<li><p>Using <code>with()</code> in (2) allows using <code>DF</code>'s column <code>x</code> as if it were a variable.</p>
+
+<p>Hence the argument name <code>with</code> in <em>data.table</em>. Setting <code>with = FALSE</code> disables the ability to refer to columns as if they are variables, thereby restoring the “<em>data.frame</em> mode”.</p></li>
<li><p>We can also <em>deselect</em> columns using <code>-</code> or <code>!</code>. For example:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## not run
-<span class="co"># returns all columns except arr_delay and dep_delay</span>
-ans <-<span class="st"> </span>flights[, !<span class="kw">c</span>(<span class="st">"arr_delay"</span>, <span class="st">"dep_delay"</span>), with =<span class="st"> </span><span class="ot">FALSE</span>]
-<span class="co"># or</span>
-ans <-<span class="st"> </span>flights[, -<span class="kw">c</span>(<span class="st">"arr_delay"</span>, <span class="st">"dep_delay"</span>), with =<span class="st"> </span><span class="ot">FALSE</span>]</code></pre></div></li>
+<pre><code class="r">## not run
+
+# returns all columns except arr_delay and dep_delay
+ans <- flights[, !c("arr_delay", "dep_delay"), with = FALSE]
+# or
+ans <- flights[, -c("arr_delay", "dep_delay"), with = FALSE]
+</code></pre></li>
<li><p>From <code>v1.9.5+</code>, we can also select by specifying start and end column names, for e.g, <code>year:day</code> to select the first three columns.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## not run
-
-<span class="co"># returns year,month and day</span>
-ans <-<span class="st"> </span>flights[, year:day, with =<span class="st"> </span><span class="ot">FALSE</span>]
-<span class="co"># returns day, month and year</span>
-ans <-<span class="st"> </span>flights[, day:year, with =<span class="st"> </span><span class="ot">FALSE</span>]
-<span class="co"># returns all columns except year, month and day</span>
-ans <-<span class="st"> </span>flights[, -(year:day), with =<span class="st"> </span><span class="ot">FALSE</span>]
-ans <-<span class="st"> </span>flights[, !(year:day), with =<span class="st"> </span><span class="ot">FALSE</span>]</code></pre></div>
+
+<pre><code class="r">## not run
+
+# returns year,month and day
+ans <- flights[, year:day, with = FALSE]
+# returns day, month and year
+ans <- flights[, day:year, with = FALSE]
+# returns all columns except year, month and day
+ans <- flights[, -(year:day), with = FALSE]
+ans <- flights[, !(year:day), with = FALSE]
+</code></pre>
+
<p>This is particularly handy while working interactively.</p></li>
</ul>
-</div>
-</div>
-</div>
-<div id="section-12" class="section level1">
-<h1></h1>
-<p><code>with = TRUE</code> is default in <em>data.table</em> because we can do much more by allowing <code>j</code> to handle expressions - especially when combined with <code>by</code> as we’ll see in a moment.</p>
-<div id="aggregations" class="section level2">
+
+<p>#</p>
+
+<p><code>with = TRUE</code> is default in <em>data.table</em> because we can do much more by allowing <code>j</code> to handle expressions - especially when combined with <code>by</code> as we'll see in a moment.</p>
+
<h2>2. Aggregations</h2>
-<p>We’ve already seen <code>i</code> and <code>j</code> from <em>data.table</em>’s general form in the previous section. In this section, we’ll see how they can be combined together with <code>by</code> to perform operations <em>by group</em>. Let’s look at some examples.</p>
-<div id="a-grouping-using-by" class="section level3">
+
+<p>We've already seen <code>i</code> and <code>j</code> from <em>data.table</em>'s general form in the previous section. In this section, we'll see how they can be combined together with <code>by</code> to perform operations <em>by group</em>. Let's look at some examples.</p>
+
<h3>a) Grouping using <code>by</code></h3>
-<div id="how-can-we-get-the-number-of-trips-corresponding-to-each-origin-airport" class="section level4">
-<h4>– How can we get the number of trips corresponding to each origin airport?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[, .(.N), by =<span class="st"> </span>.(origin)]
+
+<h4>– How can we get the number of trips corresponding to each origin airport?</h4>
+
+<pre><code class="r">ans <- flights[, .(.N), by = .(origin)]
ans
-<span class="co"># origin N</span>
-<span class="co"># 1: JFK 81483</span>
-<span class="co"># 2: LGA 84433</span>
-<span class="co"># 3: EWR 87400</span>
-
-## or equivalently using a character vector in 'by'
-<span class="co"># ans <- flights[, .(.N), by = "origin"]</span></code></pre></div>
-</div>
-<div id="section-13" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# origin N
+# 1: JFK 81483
+# 2: LGA 84433
+# 3: EWR 87400
+
+## or equivalently using a character vector in 'by'
+# ans <- flights[, .(.N), by = "origin"]
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>We know <code>.N</code> <a href="#special-N">is a special variable</a> that holds the number of rows in the current group. Grouping by <code>origin</code> obtains the number of rows, <code>.N</code>, for each group.</p></li>
-<li><p>By doing <code>head(flights)</code> you can see that the origin airports occur in the order <em>“JFK”</em>, <em>“LGA”</em> and <em>“EWR”</em>. The original order of grouping variables is preserved in the result.</p></li>
+<li><p>By doing <code>head(flights)</code> you can see that the origin airports occur in the order <em>“JFK”</em>, <em>“LGA”</em> and <em>“EWR”</em>. The original order of grouping variables is preserved in the result.</p></li>
<li><p>Since we did not provide a name for the column returned in <code>j</code>, it was named <code>N</code>automatically by recognising the special symbol <code>.N</code>.</p></li>
<li><p><code>by</code> also accepts character vector of column names. It is particularly useful to program with, for e.g., designing a function with the columns to be group by as a function argument.</p></li>
-<li><p>When there’s only one column or expression to refer to in <code>j</code> and <code>by</code>, we can drop the <code>.()</code> notation. This is purely for convenience. We could instead do:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[, .N, by =<span class="st"> </span>origin]
+<li><p>When there's only one column or expression to refer to in <code>j</code> and <code>by</code>, we can drop the <code>.()</code> notation. This is purely for convenience. We could instead do:</p>
+
+<pre><code class="r">ans <- flights[, .N, by = origin]
ans
-<span class="co"># origin N</span>
-<span class="co"># 1: JFK 81483</span>
-<span class="co"># 2: LGA 84433</span>
-<span class="co"># 3: EWR 87400</span></code></pre></div>
-<p>We’ll use this convenient form wherever applicable hereafter.</p></li>
+# origin N
+# 1: JFK 81483
+# 2: LGA 84433
+# 3: EWR 87400
+</code></pre>
+
+<p>We'll use this convenient form wherever applicable hereafter.</p></li>
</ul>
-</div>
-</div>
-</div>
-</div>
-<div id="section-14" class="section level1">
-<h1></h1>
-<div id="origin-.N" class="section level4">
-<h4>– How can we calculate the number of trips for each origin airport for carrier code <em>“AA”</em>?</h4>
-<p>The unique carrier code <em>“AA”</em> corresponds to <em>American Airlines Inc.</em></p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[carrier ==<span class="st"> "AA"</span>, .N, by =<span class="st"> </span>origin]
+
+<p>#</p>
+
+<h4>– How can we calculate the number of trips for each origin airport for carrier code <em>“AA”</em>? {#origin-.N}</h4>
+
+<p>The unique carrier code <em>“AA”</em> corresponds to <em>American Airlines Inc.</em></p>
+
+<pre><code class="r">ans <- flights[carrier == "AA", .N, by = origin]
ans
-<span class="co"># origin N</span>
-<span class="co"># 1: JFK 11923</span>
-<span class="co"># 2: LGA 11730</span>
-<span class="co"># 3: EWR 2649</span></code></pre></div>
-</div>
-<div id="section-15" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# origin N
+# 1: JFK 11923
+# 2: LGA 11730
+# 3: EWR 2649
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>We first obtain the row indices for the expression <code>carrier == "AA"</code> from <code>i</code>.</p></li>
<li><p>Using those <em>row indices</em>, we obtain the number of rows while grouped by <code>origin</code>. Once again no columns are actually materialised here, because the <code>j-expression</code> does not require any columns to be actually subsetted and is therefore fast and memory efficient.</p></li>
</ul>
-</div>
-<div id="origin-dest-.N" class="section level4">
-<h4>– How can we get the total number of trips for each <code>origin, dest</code> pair for carrier code <em>“AA”</em>?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[carrier ==<span class="st"> "AA"</span>, .N, by =<span class="st"> </span>.(origin,dest)]
-<span class="kw">head</span>(ans)
-<span class="co"># origin dest N</span>
-<span class="co"># 1: JFK LAX 3387</span>
-<span class="co"># 2: LGA PBI 245</span>
-<span class="co"># 3: EWR LAX 62</span>
-<span class="co"># 4: JFK MIA 1876</span>
-<span class="co"># 5: JFK SEA 298</span>
-<span class="co"># 6: EWR MIA 848</span>
-
-## or equivalently using a character vector in 'by'
-<span class="co"># ans <- flights[carrier == "AA", .N, by = c("origin", "dest")]</span></code></pre></div>
-</div>
-<div id="section-16" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<h4>– How can we get the total number of trips for each <code>origin, dest</code> pair for carrier code <em>“AA”</em>? {#origin-dest-.N}</h4>
+
+<pre><code class="r">ans <- flights[carrier == "AA", .N, by = .(origin,dest)]
+head(ans)
+# origin dest N
+# 1: JFK LAX 3387
+# 2: LGA PBI 245
+# 3: EWR LAX 62
+# 4: JFK MIA 1876
+# 5: JFK SEA 298
+# 6: EWR MIA 848
+
+## or equivalently using a character vector in 'by'
+# ans <- flights[carrier == "AA", .N, by = c("origin", "dest")]
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><code>by</code> accepts multiple columns. We just provide all the columns by which to group by.</li>
</ul>
-</div>
-<div id="origin-dest-month" class="section level4">
-<h4>– How can we get the average arrival and departure delay for each <code>orig,dest</code> pair for each month for carrier code <em>“AA”</em>?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[carrier ==<span class="st"> "AA"</span>,
- .(<span class="kw">mean</span>(arr_delay), <span class="kw">mean</span>(dep_delay)),
- by =<span class="st"> </span>.(origin, dest, month)]
+
+<h4>– How can we get the average arrival and departure delay for each <code>orig,dest</code> pair for each month for carrier code <em>“AA”</em>? {#origin-dest-month}</h4>
+
+<pre><code class="r">ans <- flights[carrier == "AA",
+ .(mean(arr_delay), mean(dep_delay)),
+ by = .(origin, dest, month)]
ans
-<span class="co"># origin dest month V1 V2</span>
-<span class="co"># 1: JFK LAX 1 6.590361 14.2289157</span>
-<span class="co"># 2: LGA PBI 1 -7.758621 0.3103448</span>
-<span class="co"># 3: EWR LAX 1 1.366667 7.5000000</span>
-<span class="co"># 4: JFK MIA 1 15.720670 18.7430168</span>
-<span class="co"># 5: JFK SEA 1 14.357143 30.7500000</span>
-<span class="co"># --- </span>
-<span class="co"># 196: LGA MIA 10 -6.251799 -1.4208633</span>
-<span class="co"># 197: JFK MIA 10 -1.880184 6.6774194</span>
-<span class="co"># 198: EWR PHX 10 -3.032258 -4.2903226</span>
-<span class="co"># 199: JFK MCO 10 -10.048387 -1.6129032</span>
-<span class="co"># 200: JFK DCA 10 16.483871 15.5161290</span></code></pre></div>
-</div>
-<div id="section-17" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# origin dest month V1 V2
+# 1: JFK LAX 1 6.590361 14.2289157
+# 2: LGA PBI 1 -7.758621 0.3103448
+# 3: EWR LAX 1 1.366667 7.5000000
+# 4: JFK MIA 1 15.720670 18.7430168
+# 5: JFK SEA 1 14.357143 30.7500000
+# ---
+# 196: LGA MIA 10 -6.251799 -1.4208633
+# 197: JFK MIA 10 -1.880184 6.6774194
+# 198: EWR PHX 10 -3.032258 -4.2903226
+# 199: JFK MCO 10 -10.048387 -1.6129032
+# 200: JFK DCA 10 16.483871 15.5161290
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>We did not provide column names for expressions in <code>j</code>, they were automatically generated (<code>V1</code>, <code>V2</code>).</p></li>
<li><p>Once again, note that the input order of grouping columns is preserved in the result.</p></li>
</ul>
-</div>
-</div>
-<div id="section-18" class="section level1">
-<h1></h1>
+
+<p>#</p>
+
<p>Now what if we would like to order the result by those grouping columns <code>origin</code>, <code>dest</code> and <code>month</code>?</p>
-<div id="b-keyby" class="section level3">
+
<h3>b) keyby</h3>
+
<p><em>data.table</em> retaining the original order of groups is intentional and by design. There are cases when preserving the original order is essential. But at times we would like to automatically sort by the variables we grouped by.</p>
-<div id="so-how-can-we-directly-order-by-all-the-grouping-variables" class="section level4">
-<h4>– So how can we directly order by all the grouping variables?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[carrier ==<span class="st"> "AA"</span>,
- .(<span class="kw">mean</span>(arr_delay), <span class="kw">mean</span>(dep_delay)),
- keyby =<span class="st"> </span>.(origin, dest, month)]
+
+<h4>– So how can we directly order by all the grouping variables?</h4>
+
+<pre><code class="r">ans <- flights[carrier == "AA",
+ .(mean(arr_delay), mean(dep_delay)),
+ keyby = .(origin, dest, month)]
ans
-<span class="co"># origin dest month V1 V2</span>
-<span class="co"># 1: EWR DFW 1 6.427673 10.0125786</span>
-<span class="co"># 2: EWR DFW 2 10.536765 11.3455882</span>
-<span class="co"># 3: EWR DFW 3 12.865031 8.0797546</span>
-<span class="co"># 4: EWR DFW 4 17.792683 12.9207317</span>
-<span class="co"># 5: EWR DFW 5 18.487805 18.6829268</span>
-<span class="co"># --- </span>
-<span class="co"># 196: LGA PBI 1 -7.758621 0.3103448</span>
-<span class="co"># 197: LGA PBI 2 -7.865385 2.4038462</span>
-<span class="co"># 198: LGA PBI 3 -5.754098 3.0327869</span>
-<span class="co"># 199: LGA PBI 4 -13.966667 -4.7333333</span>
-<span class="co"># 200: LGA PBI 5 -10.357143 -6.8571429</span></code></pre></div>
-</div>
-<div id="section-19" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# origin dest month V1 V2
+# 1: EWR DFW 1 6.427673 10.0125786
+# 2: EWR DFW 2 10.536765 11.3455882
+# 3: EWR DFW 3 12.865031 8.0797546
+# 4: EWR DFW 4 17.792683 12.9207317
+# 5: EWR DFW 5 18.487805 18.6829268
+# ---
+# 196: LGA PBI 1 -7.758621 0.3103448
+# 197: LGA PBI 2 -7.865385 2.4038462
+# 198: LGA PBI 3 -5.754098 3.0327869
+# 199: LGA PBI 4 -13.966667 -4.7333333
+# 200: LGA PBI 5 -10.357143 -6.8571429
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li>All we did was to change <code>by</code> to <code>keyby</code>. This automatically orders the result by the grouping variables in increasing order. Note that <code>keyby()</code> is applied after performing the operation, i.e., on the computed result.</li>
+<li>All we did was to change <code>by</code> to <code>keyby</code>. This automatically orders the result by the grouping variables in increasing order. Note that <code>keyby()</code> is applied after performing the operation, i.e., on the computed result.</li>
</ul>
-<p><strong>Keys:</strong> Actually <code>keyby</code> does a little more than <em>just ordering</em>. It also <em>sets a key</em> after ordering by setting an <em>attribute</em> called <code>sorted</code>. But we’ll learn more about <code>keys</code> in the next vignette.</p>
-<p>For now, all you’ve to know is you can use <code>keyby</code> to automatically order by the columns specified in <code>by</code>.</p>
-</div>
-</div>
-<div id="c-chaining" class="section level3">
+
+<p><strong>Keys:</strong> Actually <code>keyby</code> does a little more than <em>just ordering</em>. It also <em>sets a key</em> after ordering by setting an <em>attribute</em> called <code>sorted</code>. But we'll learn more about <code>keys</code> in the next vignette.</p>
+
+<p>For now, all you've to know is you can use <code>keyby</code> to automatically order by the columns specified in <code>by</code>.</p>
+
<h3>c) Chaining</h3>
-<p>Let’s reconsider the task of <a href="#origin-dest-.N">getting the total number of trips for each <code>origin, dest</code> pair for carrier <em>“AA”</em></a>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[carrier ==<span class="st"> "AA"</span>, .N, by =<span class="st"> </span>.(origin, dest)]</code></pre></div>
-<div id="how-can-we-order-ans-using-the-columns-origin-in-ascending-order-and-dest-in-descending-order" class="section level4">
-<h4>– How can we order <code>ans</code> using the columns <code>origin</code> in ascending order, and <code>dest</code> in descending order?</h4>
+
+<p>Let's reconsider the task of <a href="#origin-dest-.N">getting the total number of trips for each <code>origin, dest</code> pair for carrier <em>“AA”</em></a>.</p>
+
+<pre><code class="r">ans <- flights[carrier == "AA", .N, by = .(origin, dest)]
+</code></pre>
+
+<h4>– How can we order <code>ans</code> using the columns <code>origin</code> in ascending order, and <code>dest</code> in descending order?</h4>
+
<p>We can store the intermediate result in a variable, and then use <code>order(origin, -dest)</code> on that variable. It seems fairly straightforward.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>ans[<span class="kw">order</span>(origin, -dest)]
-<span class="kw">head</span>(ans)
-<span class="co"># origin dest N</span>
-<span class="co"># 1: EWR PHX 121</span>
-<span class="co"># 2: EWR MIA 848</span>
-<span class="co"># 3: EWR LAX 62</span>
-<span class="co"># 4: EWR DFW 1618</span>
-<span class="co"># 5: JFK STT 229</span>
-<span class="co"># 6: JFK SJU 690</span></code></pre></div>
-</div>
-<div id="section-20" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<pre><code class="r">ans <- ans[order(origin, -dest)]
+head(ans)
+# origin dest N
+# 1: EWR PHX 121
+# 2: EWR MIA 848
+# 3: EWR LAX 62
+# 4: EWR DFW 1618
+# 5: JFK STT 229
+# 6: JFK SJU 690
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p>Recall that we can use “-” on a <em>character</em> column in <code>order()</code> within the frame of a <em>data.table</em>. This is possible to due <em>data.table</em>’s internal query optimisation.</p></li>
-<li><p>Also recall that <code>order(...)</code> with the frame of a <em>data.table</em> is <em>automatically optimised</em> to use <em>data.table</em>’s internal fast radix order <code>forder()</code> for speed. So you can keep using the already <em>familiar</em> base R functions without compromising in speed or memory efficiency that <em>data.table</em> offers. We will cover this in more detail in the <em>data.table internals</em> vignette.</p></li>
+<li><p>Recall that we can use “-” on a <em>character</em> column in <code>order()</code> within the frame of a <em>data.table</em>. This is possible to due <em>data.table</em>'s internal query optimisation.</p></li>
+<li><p>Also recall that <code>order(...)</code> with the frame of a <em>data.table</em> is <em>automatically optimised</em> to use <em>data.table</em>'s internal fast radix order <code>forder()</code> for speed. So you can keep using the already <em>familiar</em> base R functions without compromising in speed or memory efficiency that <em>data.table</em> offers. We will cover this in more detail in the <em>data.table internals</em> vignette.</p></li>
</ul>
-</div>
-</div>
-</div>
-<div id="section-21" class="section level1">
-<h1></h1>
+
+<p>#</p>
+
<p>But this requires having to assign the intermediate result and then overwriting that result. We can do one better and avoid this intermediate assignment on to a variable altogether by <code>chaining</code> expressions.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[carrier ==<span class="st"> "AA"</span>, .N, by =<span class="st"> </span>.(origin, dest)][<span class="kw">order</span>(origin, -dest)]
-<span class="kw">head</span>(ans, <span class="dv">10</span>)
-<span class="co"># origin dest N</span>
-<span class="co"># 1: EWR PHX 121</span>
-<span class="co"># 2: EWR MIA 848</span>
-<span class="co"># 3: EWR LAX 62</span>
-<span class="co"># 4: EWR DFW 1618</span>
-<span class="co"># 5: JFK STT 229</span>
-<span class="co"># 6: JFK SJU 690</span>
-<span class="co"># 7: JFK SFO 1312</span>
-<span class="co"># 8: JFK SEA 298</span>
-<span class="co"># 9: JFK SAN 299</span>
-<span class="co"># 10: JFK ORD 432</span></code></pre></div>
-<div id="section-22" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<pre><code class="r">ans <- flights[carrier == "AA", .N, by = .(origin, dest)][order(origin, -dest)]
+head(ans, 10)
+# origin dest N
+# 1: EWR PHX 121
+# 2: EWR MIA 848
+# 3: EWR LAX 62
+# 4: EWR DFW 1618
+# 5: JFK STT 229
+# 6: JFK SJU 690
+# 7: JFK SFO 1312
+# 8: JFK SEA 298
+# 9: JFK SAN 299
+# 10: JFK ORD 432
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>We can tack expressions one after another, <em>forming a chain</em> of operations, i.e., <code>DT[ ... ][ ... ][ ... ]</code>.</p></li>
<li><p>Or you can also chain them vertically:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT[ ...
+
+<pre><code class="r">DT[ ...
][ ...
][ ...
- ]</code></pre></div></li>
+ ]
+</code></pre></li>
</ul>
-</div>
-<div id="d-expressions-in-by" class="section level3">
+
<h3>d) Expressions in <code>by</code></h3>
-<div id="can-by-accept-expressions-as-well-or-just-take-columns" class="section level4">
-<h4>– Can <code>by</code> accept <em>expressions</em> as well or just take columns?</h4>
-<p>Yes it does. As an example, if we would like to find out how many flights started late but arrived early (or on time), started and arrived late etc…</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[, .N, .(dep_delay><span class="dv">0</span>, arr_delay><span class="dv">0</span>)]
+
+<h4>– Can <code>by</code> accept <em>expressions</em> as well or just take columns?</h4>
+
+<p>Yes it does. As an example, if we would like to find out how many flights started late but arrived early (or on time), started and arrived late etc…</p>
+
+<pre><code class="r">ans <- flights[, .N, .(dep_delay>0, arr_delay>0)]
ans
-<span class="co"># dep_delay arr_delay N</span>
-<span class="co"># 1: TRUE TRUE 72836</span>
-<span class="co"># 2: FALSE TRUE 34583</span>
-<span class="co"># 3: FALSE FALSE 119304</span>
-<span class="co"># 4: TRUE FALSE 26593</span></code></pre></div>
-</div>
-<div id="section-23" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# dep_delay arr_delay N
+# 1: TRUE TRUE 72836
+# 2: FALSE TRUE 34583
+# 3: FALSE FALSE 119304
+# 4: TRUE FALSE 26593
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>The last row corresponds to <code>dep_delay > 0 = TRUE</code> and <code>arr_delay > 0 = FALSE</code>. We can see that 26593 flights started late but arrived early (or on time).</p></li>
<li><p>Note that we did not provide any names to <code>by-expression</code>. And names have been automatically assigned in the result.</p></li>
<li><p>You can provide other columns along with expressions, for example: <code>DT[, .N, by = .(a, b>0)]</code>.</p></li>
</ul>
-</div>
-</div>
-<div id="e-multiple-columns-in-j---.sd" class="section level3">
+
<h3>e) Multiple columns in <code>j</code> - <code>.SD</code></h3>
-<div id="do-we-have-to-compute-mean-for-each-column-individually" class="section level4">
-<h4>– Do we have to compute <code>mean()</code> for each column individually?</h4>
+
+<h4>– Do we have to compute <code>mean()</code> for each column individually?</h4>
+
<p>It is of course not practical to have to type <code>mean(myCol)</code> for every column one by one. What if you had a 100 columns to compute <code>mean()</code> of?</p>
-<p>How can we do this efficiently? To get there, refresh on <a href="#tip-1">this tip</a> - <em>“As long as j-expression returns a list, each element of the list will be converted to a column in the resulting data.table”</em>. Suppose we can refer to the <em>data subset for each group</em> as a variable <em>while grouping</em>, then we can loop through all the columns of that variable using the already familiar base function <code>lapply()</code>. We don’t have to learn any new function.</p>
-</div>
-<div id="special-SD" class="section level4 bs-callout bs-callout-info">
-<h4>Special symbol <code>.SD</code>:</h4>
+
+<p>How can we do this efficiently? To get there, refresh on <a href="#tip-1">this tip</a> - <em>“As long as j-expression returns a list, each element of the list will be converted to a column in the resulting data.table”</em>. Suppose we can refer to the <em>data subset for each group</em> as a variable <em>while grouping</em>, then we can loop through all the columns of that variable using the already familiar base function <code>lapply()</code>. We don't have to learn a [...]
+
+<h4>Special symbol <code>.SD</code>: {.bs-callout .bs-callout-info #special-SD}</h4>
+
<p><em>data.table</em> provides a <em>special</em> symbol, called <code>.SD</code>. It stands for <strong>S</strong>ubset of <strong>D</strong>ata. It by itself is a <em>data.table</em> that holds the data for <em>the current group</em> defined using <code>by</code>.</p>
+
<p>Recall that a <em>data.table</em> is internally a list as well with all its columns of equal length.</p>
-</div>
-</div>
-</div>
-<div id="section-24" class="section level1">
-<h1></h1>
-<p>Let’s use the <a href="#what-is-datatable-1a"><em>data.table</em> <code>DT</code> from before</a> to get a glimpse of what <code>.SD</code> looks like.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT
-<span class="co"># ID a b c</span>
-<span class="co"># 1: b 1 7 13</span>
-<span class="co"># 2: b 2 8 14</span>
-<span class="co"># 3: b 3 9 15</span>
-<span class="co"># 4: a 4 10 16</span>
-<span class="co"># 5: a 5 11 17</span>
-<span class="co"># 6: c 6 12 18</span>
-
-DT[, <span class="kw">print</span>(.SD), by =<span class="st"> </span>ID]
-<span class="co"># a b c</span>
-<span class="co"># 1: 1 7 13</span>
-<span class="co"># 2: 2 8 14</span>
-<span class="co"># 3: 3 9 15</span>
-<span class="co"># a b c</span>
-<span class="co"># 1: 4 10 16</span>
-<span class="co"># 2: 5 11 17</span>
-<span class="co"># a b c</span>
-<span class="co"># 1: 6 12 18</span>
-<span class="co"># Empty data.table (0 rows) of 1 col: ID</span></code></pre></div>
-<div id="section-25" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<p>#</p>
+
+<p>Let's use the <a href="#what-is-datatable-1a"><em>data.table</em> <code>DT</code> from before</a> to get a glimpse of what <code>.SD</code> looks like.</p>
+
+<pre><code class="r">DT
+# ID a b c
+# 1: b 1 7 13
+# 2: b 2 8 14
+# 3: b 3 9 15
+# 4: a 4 10 16
+# 5: a 5 11 17
+# 6: c 6 12 18
+
+DT[, print(.SD), by = ID]
+# a b c
+# 1: 1 7 13
+# 2: 2 8 14
+# 3: 3 9 15
+# a b c
+# 1: 4 10 16
+# 2: 5 11 17
+# a b c
+# 1: 6 12 18
+# Empty data.table (0 rows) of 1 col: ID
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p><code>.SD</code> contains all the columns <em>except the grouping columns</em> by default.</p></li>
<li><p>It is also generated by preserving the original order - data corresponding to <code>ID = "b"</code>, then <code>ID = "a"</code>, and then <code>ID = "c"</code>.</p></li>
</ul>
-</div>
-</div>
-<div id="section-26" class="section level1">
-<h1></h1>
+
+<p>#</p>
+
<p>To compute on (multiple) columns, we can then simply use the base R function <code>lapply()</code>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT[, <span class="kw">lapply</span>(.SD, mean), by =<span class="st"> </span>ID]
-<span class="co"># ID a b c</span>
-<span class="co"># 1: b 2.0 8.0 14.0</span>
-<span class="co"># 2: a 4.5 10.5 16.5</span>
-<span class="co"># 3: c 6.0 12.0 18.0</span></code></pre></div>
-<div id="section-27" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<pre><code class="r">DT[, lapply(.SD, mean), by = ID]
+# ID a b c
+# 1: b 2.0 8.0 14.0
+# 2: a 4.5 10.5 16.5
+# 3: c 6.0 12.0 18.0
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p><code>.SD</code> holds the rows corresponding to columns <em>a</em>, <em>b</em> and <em>c</em> for that group. We compute the <code>mean()</code> on each of these columns using the already familiar base function <code>lapply()</code>.</p></li>
<li><p>Each group returns a list of three elements containing the mean value which will become the columns of the resulting <code>data.table</code>.</p></li>
<li><p>Since <code>lapply()</code> returns a <em>list</em>, there is no need to wrap it with an additional <code>.()</code> (if necessary, refer to <a href="#tip-1">this tip</a>).</p></li>
</ul>
-</div>
-</div>
-<div id="section-28" class="section level1">
-<h1></h1>
+
+<p>#</p>
+
<p>We are almost there. There is one little thing left to address. In our <code>flights</code> <em>data.table</em>, we only wanted to calculate the <code>mean()</code> of two columns <code>arr_delay</code> and <code>dep_delay</code>. But <code>.SD</code> would contain all the columns other than the grouping variables by default.</p>
-<div id="how-can-we-specify-just-the-columns-we-would-like-to-compute-the-mean-on" class="section level4">
-<h4>– How can we specify just the columns we would like to compute the <code>mean()</code> on?</h4>
-</div>
-<div id="sdcols" class="section level4 bs-callout bs-callout-info">
-<h4>.SDcols</h4>
+
+<h4>– How can we specify just the columns we would like to compute the <code>mean()</code> on?</h4>
+
+<h4>.SDcols {.bs-callout .bs-callout-info}</h4>
+
<p>Using the argument <code>.SDcols</code>. It accepts either column names or column indices. For example, <code>.SDcols = c("arr_delay", "dep_delay")</code> ensures that <code>.SD</code> contains only these two columns for each group.</p>
-<p>Similar to the <a href="#with_false">with = FALSE section</a>, you can also provide the columns to remove instead of columns to keep using <code>-</code> or <code>!</code> sign as well as select consecutive columns as <code>colA:colB</code> and deselect consecutive columns as <code>!(colA:colB) or</code>-(colA:colB)`.</p>
-</div>
-</div>
-<div id="section-29" class="section level1">
-<h1></h1>
-<p>Now let us try to use <code>.SD</code> along with <code>.SDcols</code> to get the <code>mean()</code> of <code>arr_delay</code> and <code>dep_delay</code> columns grouped by <code>origin</code>, <code>dest</code> and <code>month</code>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[carrier ==<span class="st"> "AA"</span>, ## Only on trips with carrier "AA"
- <span class="kw">lapply</span>(.SD, mean), ## compute the mean
- by =<span class="st"> </span>.(origin, dest, month), ## for every 'origin,dest,month'
- .SDcols =<span class="st"> </span><span class="kw">c</span>(<span class="st">"arr_delay"</span>, <span class="st">"dep_delay"</span>)] ## for just those specified in .SDcols
-<span class="co"># origin dest month arr_delay dep_delay</span>
-<span class="co"># 1: JFK LAX 1 6.590361 14.2289157</span>
-<span class="co"># 2: LGA PBI 1 -7.758621 0.3103448</span>
-<span class="co"># 3: EWR LAX 1 1.366667 7.5000000</span>
-<span class="co"># 4: JFK MIA 1 15.720670 18.7430168</span>
-<span class="co"># 5: JFK SEA 1 14.357143 30.7500000</span>
-<span class="co"># --- </span>
-<span class="co"># 196: LGA MIA 10 -6.251799 -1.4208633</span>
-<span class="co"># 197: JFK MIA 10 -1.880184 6.6774194</span>
-<span class="co"># 198: EWR PHX 10 -3.032258 -4.2903226</span>
-<span class="co"># 199: JFK MCO 10 -10.048387 -1.6129032</span>
-<span class="co"># 200: JFK DCA 10 16.483871 15.5161290</span></code></pre></div>
-<div id="f-subset-.sd-for-each-group" class="section level3">
+
+<p>Similar to the <a href="#with_false">with = FALSE section</a>, you can also provide the columns to remove instead of columns to keep using <code>-</code> or <code>!</code> sign as well as select consecutive columns as <code>colA:colB</code> and deselect consecutive columns as <code>!(colA:colB) or</code>-(colA:colB).</p>
+
+<p>#
+Now let us try to use <code>.SD</code> along with <code>.SDcols</code> to get the <code>mean()</code> of <code>arr_delay</code> and <code>dep_delay</code> columns grouped by <code>origin</code>, <code>dest</code> and <code>month</code>.</p>
+
+<pre><code class="r">flights[carrier == "AA", ## Only on trips with carrier "AA"
+ lapply(.SD, mean), ## compute the mean
+ by = .(origin, dest, month), ## for every 'origin,dest,month'
+ .SDcols = c("arr_delay", "dep_delay")] ## for just those specified in .SDcols
+# origin dest month arr_delay dep_delay
+# 1: JFK LAX 1 6.590361 14.2289157
+# 2: LGA PBI 1 -7.758621 0.3103448
+# 3: EWR LAX 1 1.366667 7.5000000
+# 4: JFK MIA 1 15.720670 18.7430168
+# 5: JFK SEA 1 14.357143 30.7500000
+# ---
+# 196: LGA MIA 10 -6.251799 -1.4208633
+# 197: JFK MIA 10 -1.880184 6.6774194
+# 198: EWR PHX 10 -3.032258 -4.2903226
+# 199: JFK MCO 10 -10.048387 -1.6129032
+# 200: JFK DCA 10 16.483871 15.5161290
+</code></pre>
+
<h3>f) Subset <code>.SD</code> for each group:</h3>
-<div id="how-can-we-return-the-first-two-rows-for-each-month" class="section level4">
-<h4>– How can we return the first two rows for each <code>month</code>?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[, <span class="kw">head</span>(.SD, <span class="dv">2</span>), by =<span class="st"> </span>month]
-<span class="kw">head</span>(ans)
-<span class="co"># month year day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 1 2014 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 1 2014 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2 2014 1 -1 1 AA JFK LAX 358 2475 8</span>
-<span class="co"># 4: 2 2014 1 -5 3 AA JFK LAX 358 2475 11</span>
-<span class="co"># 5: 3 2014 1 -11 36 AA JFK LAX 375 2475 8</span>
-<span class="co"># 6: 3 2014 1 -3 14 AA JFK LAX 368 2475 11</span></code></pre></div>
-</div>
-<div id="section-30" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<h4>– How can we return the first two rows for each <code>month</code>?</h4>
+
+<pre><code class="r">ans <- flights[, head(.SD, 2), by = month]
+head(ans)
+# month year day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 1 2014 1 14 13 AA JFK LAX 359 2475 9
+# 2: 1 2014 1 -3 13 AA JFK LAX 363 2475 11
+# 3: 2 2014 1 -1 1 AA JFK LAX 358 2475 8
+# 4: 2 2014 1 -5 3 AA JFK LAX 358 2475 11
+# 5: 3 2014 1 -11 36 AA JFK LAX 375 2475 8
+# 6: 3 2014 1 -3 14 AA JFK LAX 368 2475 11
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p><code>.SD</code> is a <em>data.table</em> that holds all the rows for <em>that group</em>. We simply subset the first two rows as we have seen <a href="#subset-rows-integer">here</a> already.</p></li>
<li><p>For each group, <code>head(.SD, 2)</code> returns the first two rows as a <em>data.table</em> which is also a list. So we do not have to wrap it with <code>.()</code>.</p></li>
</ul>
-</div>
-</div>
-<div id="g-why-keep-j-so-flexible" class="section level3">
+
<h3>g) Why keep <code>j</code> so flexible?</h3>
+
<p>So that we have a consistent syntax and keep using already existing (and familiar) base functions instead of learning new functions. To illustrate, let us use the <em>data.table</em> <code>DT</code> we created at the very beginning under <a href="#what-is-datatable-1a">What is a data.table?</a> section.</p>
-<div id="how-can-we-concatenate-columns-a-and-b-for-each-group-in-id" class="section level4">
-<h4>– How can we concatenate columns <code>a</code> and <code>b</code> for each group in <code>ID</code>?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT[, .(<span class="dt">val =</span> <span class="kw">c</span>(a,b)), by =<span class="st"> </span>ID]
-<span class="co"># ID val</span>
-<span class="co"># 1: b 1</span>
-<span class="co"># 2: b 2</span>
-<span class="co"># 3: b 3</span>
-<span class="co"># 4: b 7</span>
-<span class="co"># 5: b 8</span>
-<span class="co"># 6: b 9</span>
-<span class="co"># 7: a 4</span>
-<span class="co"># 8: a 5</span>
-<span class="co"># 9: a 10</span>
-<span class="co"># 10: a 11</span>
-<span class="co"># 11: c 6</span>
-<span class="co"># 12: c 12</span></code></pre></div>
-</div>
-<div id="section-31" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<h4>– How can we concatenate columns <code>a</code> and <code>b</code> for each group in <code>ID</code>?</h4>
+
+<pre><code class="r">DT[, .(val = c(a,b)), by = ID]
+# ID val
+# 1: b 1
+# 2: b 2
+# 3: b 3
+# 4: b 7
+# 5: b 8
+# 6: b 9
+# 7: a 4
+# 8: a 5
+# 9: a 10
+# 10: a 11
+# 11: c 6
+# 12: c 12
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li>That’s it. There is no special syntax required. All we need to know is the base function <code>c()</code> which concatenates vectors and <a href="#tip-1">the tip from before</a>.</li>
+<li>That's it. There is no special syntax required. All we need to know is the base function <code>c()</code> which concatenates vectors and <a href="#tip-1">the tip from before</a>.</li>
</ul>
-</div>
-<div id="what-if-we-would-like-to-have-all-the-values-of-column-a-and-b-concatenated-but-returned-as-a-list-column" class="section level4">
-<h4>– What if we would like to have all the values of column <code>a</code> and <code>b</code> concatenated, but returned as a list column?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT[, .(<span class="dt">val =</span> <span class="kw">list</span>(<span class="kw">c</span>(a,b))), by =<span class="st"> </span>ID]
-<span class="co"># ID val</span>
-<span class="co"># 1: b 1,2,3,7,8,9</span>
-<span class="co"># 2: a 4, 5,10,11</span>
-<span class="co"># 3: c 6,12</span></code></pre></div>
-</div>
-<div id="section-32" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<h4>– What if we would like to have all the values of column <code>a</code> and <code>b</code> concatenated, but returned as a list column?</h4>
+
+<pre><code class="r">DT[, .(val = list(c(a,b))), by = ID]
+# ID val
+# 1: b 1,2,3,7,8,9
+# 2: a 4, 5,10,11
+# 3: c 6,12
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>Here, we first concatenate the values with <code>c(a,b)</code> for each group, and wrap that with <code>list()</code>. So for each group, we return a list of all concatenated values.</p></li>
<li><p>Note those commas are for display only. A list column can contain any object in each cell, and in this example, each cell is itself a vector and some cells contain longer vectors than others.</p></li>
</ul>
-</div>
-</div>
-</div>
-<div id="section-33" class="section level1">
-<h1></h1>
-<p>Once you start internalising usage in <code>j</code>, you will realise how powerful the syntax can be. A very useful way to understand it is by playing around, with the help of <code>print()</code>.</p>
+
+<p>#
+Once you start internalising usage in <code>j</code>, you will realise how powerful the syntax can be. A very useful way to understand it is by playing around, with the help of <code>print()</code>.</p>
+
<p>For example:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## (1) look at the difference between
-DT[, <span class="kw">print</span>(<span class="kw">c</span>(a,b)), by =<span class="st"> </span>ID]
-<span class="co"># [1] 1 2 3 7 8 9</span>
-<span class="co"># [1] 4 5 10 11</span>
-<span class="co"># [1] 6 12</span>
-<span class="co"># Empty data.table (0 rows) of 1 col: ID</span>
+
+<pre><code class="r">## (1) look at the difference between
+DT[, print(c(a,b)), by = ID]
+# [1] 1 2 3 7 8 9
+# [1] 4 5 10 11
+# [1] 6 12
+# Empty data.table (0 rows) of 1 col: ID
## (2) and
-DT[, <span class="kw">print</span>(<span class="kw">list</span>(<span class="kw">c</span>(a,b))), by =<span class="st"> </span>ID]
-<span class="co"># [[1]]</span>
-<span class="co"># [1] 1 2 3 7 8 9</span>
-<span class="co"># </span>
-<span class="co"># [[1]]</span>
-<span class="co"># [1] 4 5 10 11</span>
-<span class="co"># </span>
-<span class="co"># [[1]]</span>
-<span class="co"># [1] 6 12</span>
-<span class="co"># Empty data.table (0 rows) of 1 col: ID</span></code></pre></div>
+DT[, print(list(c(a,b))), by = ID]
+# [[1]]
+# [1] 1 2 3 7 8 9
+#
+# [[1]]
+# [1] 4 5 10 11
+#
+# [[1]]
+# [1] 6 12
+# Empty data.table (0 rows) of 1 col: ID
+</code></pre>
+
<p>In (1), for each group, a vector is returned, with length = 6,4,2 here. However (2) returns a list of length 1 for each group, with its first element holding vectors of length 6,4,2. Therefore (1) results in a length of <code>6+4+2 = 12</code>, whereas (2) returns <code>1+1+1=3</code>.</p>
-<div id="summary" class="section level2">
+
<h2>Summary</h2>
+
<p>The general form of <em>data.table</em> syntax is:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT[i, j, by]</code></pre></div>
+
+<pre><code class="r">DT[i, j, by]
+</code></pre>
+
<p>We have seen so far that,</p>
-<div id="using-i" class="section level4 bs-callout bs-callout-info">
-<h4>Using <code>i</code>:</h4>
+
+<h4>Using <code>i</code>: {.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p>We can subset rows similar to a <em>data.frame</em> - except you don’t have to use <code>DT$</code> repetitively since columns within the frame of a <em>data.table</em> are seen as if they are <em>variables</em>.</p></li>
-<li><p>We can also sort a <em>data.table</em> using <code>order()</code>, which internally uses <em>data.table</em>’s fast order for performance.</p></li>
+<li><p>We can subset rows similar to a <em>data.frame</em> - except you don't have to use <code>DT$</code> repetitively since columns within the frame of a <em>data.table</em> are seen as if they are <em>variables</em>.</p></li>
+<li><p>We can also sort a <em>data.table</em> using <code>order()</code>, which internally uses <em>data.table</em>'s fast order for performance.</p></li>
</ul>
-<p>We can do much more in <code>i</code> by keying a <em>data.table</em>, which allows blazing fast subsets and joins. We will see this in the <em>“Keys and fast binary search based subsets”</em> and <em>“Joins and rolling joins”</em> vignette.</p>
-</div>
-<div id="using-j" class="section level4 bs-callout bs-callout-info">
-<h4>Using <code>j</code>:</h4>
-<ol style="list-style-type: decimal">
+
+<p>We can do much more in <code>i</code> by keying a <em>data.table</em>, which allows blazing fast subsets and joins. We will see this in the <em>“Keys and fast binary search based subsets”</em> and <em>“Joins and rolling joins”</em> vignette.</p>
+
+<h4>Using <code>j</code>: {.bs-callout .bs-callout-info}</h4>
+
+<ol>
<li><p>Select columns the <em>data.table</em> way: <code>DT[, .(colA, colB)]</code>.</p></li>
<li><p>Select columns the <em>data.frame</em> way: <code>DT[, c("colA", "colB"), with = FALSE]</code>.</p></li>
<li><p>Compute on columns: <code>DT[, .(sum(colA), mean(colB))]</code>.</p></li>
<li><p>Provide names if necessary: <code>DT[, .(sA =sum(colA), mB = mean(colB))]</code>.</p></li>
<li><p>Combine with <code>i</code>: <code>DT[colA > value, sum(colB)]</code>.</p></li>
</ol>
-</div>
-</div>
-</div>
-<div id="section-34" class="section level1">
-<h1></h1>
-<div id="using-by" class="section level4 bs-callout bs-callout-info">
-<h4>Using <code>by</code>:</h4>
+
+<p>#</p>
+
+<h4>Using <code>by</code>: {.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>Using <code>by</code>, we can group by columns by specifying a <em>list of columns</em> or a <em>character vector of column names</em> or even <em>expressions</em>. The flexibility of <code>j</code>, combined with <code>by</code> and <code>i</code> makes for a very powerful syntax.</p></li>
<li><p><code>by</code> can handle multiple columns and also <em>expressions</em>.</p></li>
<li><p>We can <code>keyby</code> grouping columns to automatically sort the grouped result.</p></li>
-<li><p>We can use <code>.SD</code> and <code>.SDcols</code> in <code>j</code> to operate on multiple columns using already familiar base functions. Here are some examples:</p>
-<ol style="list-style-type: decimal">
-<li><p><code>DT[, lapply(.SD, fun), by = ..., .SDcols = ...]</code> - applies <code>fun</code> to all columns specified in <code>.SDcols</code> while grouping by the columns specified in <code>by</code>.</p></li>
-<li><p><code>DT[, head(.SD, 2), by = ...]</code> - return the first two rows for each group.</p></li>
-<li><p><code>DT[col > val, head(.SD, 1), by = ...]</code> - combine <code>i</code> along with <code>j</code> and <code>by</code>.</p></li>
-</ol></li>
+<li><p>We can use <code>.SD</code> and <code>.SDcols</code> in <code>j</code> to operate on multiple columns using already familiar base functions. Here are some examples:</p></li>
</ul>
-</div>
-</div>
-<div id="section-35" class="section level1">
-<h1></h1>
-<div id="and-remember-the-tip" class="section level4 bs-callout bs-callout-warning">
-<h4>And remember the tip:</h4>
+
+<pre><code>1. `DT[, lapply(.SD, fun), by = ..., .SDcols = ...]` - applies `fun` to all columns specified in `.SDcols` while grouping by the columns specified in `by`.
+
+2. `DT[, head(.SD, 2), by = ...]` - return the first two rows for each group.
+
+3. `DT[col > val, head(.SD, 1), by = ...]` - combine `i` along with `j` and `by`.
+</code></pre>
+
+<p>#</p>
+
+<h4>And remember the tip: {.bs-callout .bs-callout-warning}</h4>
+
<p>As long as <code>j</code> returns a <em>list</em>, each element of the list will become a column in the resulting <em>data.table</em>.</p>
-</div>
-</div>
-<div id="section-36" class="section level1">
-<h1></h1>
-<p>We will see how to <em>add/update/delete</em> columns <em>by reference</em> and how to combine them with <code>i</code> and <code>by</code> in the next vignette.</p>
-<hr />
-</div>
+<p>#</p>
+<p>We will see how to <em>add/update/delete</em> columns <em>by reference</em> and how to combine them with <code>i</code> and <code>by</code> in the next vignette.</p>
-<!-- dynamically load mathjax for compatibility with self-contained -->
-<script>
- (function () {
- var script = document.createElement("script");
- script.type = "text/javascript";
- script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
- document.getElementsByTagName("head")[0].appendChild(script);
- })();
-</script>
+<hr/>
</body>
+
</html>
diff --git a/inst/doc/datatable-keys-fast-subset.html b/inst/doc/datatable-keys-fast-subset.html
index 4b90d19..2512e12 100644
--- a/inst/doc/datatable-keys-fast-subset.html
+++ b/inst/doc/datatable-keys-fast-subset.html
@@ -1,679 +1,854 @@
<!DOCTYPE html>
-
-<html xmlns="http://www.w3.org/1999/xhtml">
-
+<html>
<head>
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
+
+<title>Data {#data}</title>
+
+<script type="text/javascript">
+window.onload = function() {
+ var imgs = document.getElementsByTagName('img'), i, img;
+ for (i = 0; i < imgs.length; i++) {
+ img = imgs[i];
+ // center an image if it is the only element of its parent
+ if (img.parentElement.childElementCount === 1)
+ img.parentElement.style.textAlign = 'center';
+ }
+};
+</script>
-<meta charset="utf-8">
-<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
-<meta name="generator" content="pandoc" />
-
-<meta name="viewport" content="width=device-width, initial-scale=1">
-
-
-<meta name="date" content="2017-01-31" />
+<!-- Styles for R syntax highlighter -->
+<style type="text/css">
+ pre .operator,
+ pre .paren {
+ color: rgb(104, 118, 135)
+ }
+
+ pre .literal {
+ color: #990073
+ }
+
+ pre .number {
+ color: #099;
+ }
+
+ pre .comment {
+ color: #998;
+ font-style: italic
+ }
+
+ pre .keyword {
+ color: #900;
+ font-weight: bold
+ }
+
+ pre .identifier {
+ color: rgb(0, 0, 0);
+ }
+
+ pre .string {
+ color: #d14;
+ }
+</style>
-<title>Keys and fast binary search based subset</title>
+<!-- R syntax highlighter -->
+<script type="text/javascript">
+var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/</gm,"<")}function f(r,q,p){return RegExp(q,"m"+(r.cI?"i":"")+(p?"g":""))}function b(r){for(var p=0;p<r.childNodes.length;p++){var q=r.childNodes[p];if(q.nodeName=="CODE"){return q}if(!(q.nodeType==3&&q.nodeValue.match(/\s+/))){break}}}function h(t,s){var p="";for(var r=0;r<t.childNodes.length;r++){if(t.childNodes[r].nodeType==3){var q=t.childNodes[r].nodeValue;if(s){q=q.replace(/\n/g,"")}p+=q}else{if(t.chi [...]
+hljs.initHighlightingOnLoad();
+</script>
-<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
-div.sourceCode { overflow-x: auto; }
-table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
- margin: 0; padding: 0; vertical-align: baseline; border: none; }
-table.sourceCode { width: 100%; line-height: 100%; }
-td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
-td.sourceCode { padding-left: 5px; }
-code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
-code > span.dt { color: #902000; } /* DataType */
-code > span.dv { color: #40a070; } /* DecVal */
-code > span.bn { color: #40a070; } /* BaseN */
-code > span.fl { color: #40a070; } /* Float */
-code > span.ch { color: #4070a0; } /* Char */
-code > span.st { color: #4070a0; } /* String */
-code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
-code > span.ot { color: #007020; } /* Other */
-code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
-code > span.fu { color: #06287e; } /* Function */
-code > span.er { color: #ff0000; font-weight: bold; } /* Error */
-code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
-code > span.cn { color: #880000; } /* Constant */
-code > span.sc { color: #4070a0; } /* SpecialChar */
-code > span.vs { color: #4070a0; } /* VerbatimString */
-code > span.ss { color: #bb6688; } /* SpecialString */
-code > span.im { } /* Import */
-code > span.va { color: #19177c; } /* Variable */
-code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
-code > span.op { color: #666666; } /* Operator */
-code > span.bu { } /* BuiltIn */
-code > span.ex { } /* Extension */
-code > span.pp { color: #bc7a00; } /* Preprocessor */
-code > span.at { color: #7d9029; } /* Attribute */
-code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
-code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
-code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
-code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
+body, td {
+ font-family: sans-serif;
+ background-color: white;
+ font-size: 13px;
+}
+
+body {
+ max-width: 800px;
+ margin: auto;
+ padding: 1em;
+ line-height: 20px;
+}
+
+tt, code, pre {
+ font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
+}
+
+h1 {
+ font-size:2.2em;
+}
+
+h2 {
+ font-size:1.8em;
+}
+
+h3 {
+ font-size:1.4em;
+}
+
+h4 {
+ font-size:1.0em;
+}
+
+h5 {
+ font-size:0.9em;
+}
+
+h6 {
+ font-size:0.8em;
+}
+
+a:visited {
+ color: rgb(50%, 0%, 50%);
+}
+
+pre, img {
+ max-width: 100%;
+}
+pre {
+ overflow-x: auto;
+}
+pre code {
+ display: block; padding: 0.5em;
+}
+
+code {
+ font-size: 92%;
+ border: 1px solid #ccc;
+}
+
+code[class] {
+ background-color: #F8F8F8;
+}
+
+table, td, th {
+ border: none;
+}
+
+blockquote {
+ color:#666666;
+ margin:0;
+ padding-left: 1em;
+ border-left: 0.5em #EEE solid;
+}
+
+hr {
+ height: 0px;
+ border-bottom: none;
+ border-top-width: thin;
+ border-top-style: dotted;
+ border-top-color: #999999;
+}
+
+ at media print {
+ * {
+ background: transparent !important;
+ color: black !important;
+ filter:none !important;
+ -ms-filter: none !important;
+ }
+
+ body {
+ font-size:12pt;
+ max-width:100%;
+ }
+
+ a, a:visited {
+ text-decoration: underline;
+ }
+
+ hr {
+ visibility: hidden;
+ page-break-before: always;
+ }
+
+ pre, blockquote {
+ padding-right: 1em;
+ page-break-inside: avoid;
+ }
+
+ tr, img {
+ page-break-inside: avoid;
+ }
+
+ img {
+ max-width: 100% !important;
+ }
+
+ @page :left {
+ margin: 15mm 20mm 15mm 10mm;
+ }
+
+ @page :right {
+ margin: 15mm 10mm 15mm 20mm;
+ }
+
+ p, h2, h3 {
+ orphans: 3; widows: 3;
+ }
+
+ h2, h3 {
+ page-break-after: avoid;
+ }
+}
</style>
-<link href="data:text/css;charset=utf-8,body%20%7B%0Abackground%2Dcolor%3A%20%23fff%3B%0Amargin%3A%201em%20auto%3B%0Amax%2Dwidth%3A%20700px%3B%0Aoverflow%3A%20visible%3B%0Apadding%2Dleft%3A%202em%3B%0Apadding%2Dright%3A%202em%3B%0Afont%2Dfamily%3A%20%22Open%20Sans%22%2C%20%22Helvetica%20Neue%22%2C%20Helvetica%2C%20Arial%2C%20sans%2Dserif%3B%0Afont%2Dsize%3A%2014px%3B%0Aline%2Dheight%3A%201%2E35%3B%0A%7D%0A%23header%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0A%23TOC%20%7B%0Aclear%3A%20bot [...]
-
</head>
<body>
+<p>This vignette is aimed at those who are already familiar with <em>data.table</em> syntax, its general form, how to subset rows in <code>i</code>, select and compute on columns, add/modify/delete columns <em>by reference</em> in <code>j</code> and group by using <code>by</code>. If you're not familiar with these concepts, please read the <em>“Introduction to data.table”</em> and <em>“Reference semantics”</em> vignettes first.</p>
+<hr/>
+<h2>Data {#data}</h2>
+<p>We will use the same <code>flights</code> data as in the <em>“Introduction to data.table”</em> vignette.</p>
-<h1 class="title toc-ignore">Keys and fast binary search based subset</h1>
-<h4 class="date"><em>2017-01-31</em></h4>
+<pre><code class="r">flights <- fread("flights14.csv")
+head(flights)
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
+# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
+# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13
+# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18
+dim(flights)
+# [1] 253316 11
+</code></pre>
-
-
-<p>This vignette is aimed at those who are already familiar with <em>data.table</em> syntax, its general form, how to subset rows in <code>i</code>, select and compute on columns, add/modify/delete columns <em>by reference</em> in <code>j</code> and group by using <code>by</code>. If you’re not familiar with these concepts, please read the <em>“Introduction to data.table”</em> and <em>“Reference semantics”</em> vignettes first.</p>
-<hr />
-<div id="data" class="section level2">
-<h2>Data</h2>
-<p>We will use the same <code>flights</code> data as in the <em>“Introduction to data.table”</em> vignette.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights <-<span class="st"> </span><span class="kw">fread</span>(<span class="st">"flights14.csv"</span>)
-<span class="kw">head</span>(flights)
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span>
-<span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18</span>
-<span class="kw">dim</span>(flights)
-<span class="co"># [1] 253316 11</span></code></pre></div>
-</div>
-<div id="introduction" class="section level2">
<h2>Introduction</h2>
+
<p>In this vignette, we will</p>
+
<ul>
<li><p>first introduce the concept of <code>key</code> in <em>data.table</em>, and set and use keys to perform <em>fast binary search</em> based subsets in <code>i</code>,</p></li>
<li><p>see that we can combine key based subsets along with <code>j</code> and <code>by</code> in the exact same way as before,</p></li>
<li><p>look at other additional useful arguments - <code>mult</code> and <code>nomatch</code>,</p></li>
<li><p>and finally conclude by looking at the advantage of setting keys - perform <em>fast binary search based subsets</em> and compare with the traditional vector scan approach.</p></li>
</ul>
-</div>
-<div id="keys" class="section level2">
+
<h2>1. Keys</h2>
-<div id="a-what-is-a-key" class="section level3">
+
<h3>a) What is a <em>key</em>?</h3>
-<p>In the <em>“Introduction to data.table”</em> vignette, we saw how to subset rows in <code>i</code> using logical expressions, row numbers and using <code>order()</code>. In this section, we will look at another way of subsetting incredibly fast - using <em>keys</em>.</p>
-<p>But first, let’s start by looking at <em>data.frames</em>. All <em>data.frames</em> have a row names attribute. Consider the <em>data.frame</em> <code>DF</code> below.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">set.seed</span>(1L)
-DF =<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">ID1 =</span> <span class="kw">sample</span>(letters[<span class="dv">1</span>:<span class="dv">2</span>], <span class="dv">10</span>, <span class="ot">TRUE</span>),
- <span class="dt">ID2 =</span> <span class="kw">sample</span>(<span class="dv">1</span>:<span class="dv">3</span>, <span class="dv">10</span>, <span class="ot">TRUE</span>),
- <span class="dt">val =</span> <span class="kw">sample</span>(<span class="dv">10</span>),
- <span class="dt">stringsAsFactors =</span> <span class="ot">FALSE</span>,
- <span class="dt">row.names =</span> <span class="kw">sample</span>(LETTERS[<span class="dv">1</span>:<span class="dv">10</span>]))
+
+<p>In the <em>“Introduction to data.table”</em> vignette, we saw how to subset rows in <code>i</code> using logical expressions, row numbers and using <code>order()</code>. In this section, we will look at another way of subsetting incredibly fast - using <em>keys</em>.</p>
+
+<p>But first, let's start by looking at <em>data.frames</em>. All <em>data.frames</em> have a row names attribute. Consider the <em>data.frame</em> <code>DF</code> below.</p>
+
+<pre><code class="r">set.seed(1L)
+DF = data.frame(ID1 = sample(letters[1:2], 10, TRUE),
+ ID2 = sample(1:3, 10, TRUE),
+ val = sample(10),
+ stringsAsFactors = FALSE,
+ row.names = sample(LETTERS[1:10]))
DF
-<span class="co"># ID1 ID2 val</span>
-<span class="co"># C a 3 5</span>
-<span class="co"># D a 1 6</span>
-<span class="co"># E b 2 4</span>
-<span class="co"># G a 1 2</span>
-<span class="co"># B b 1 10</span>
-<span class="co"># H a 2 8</span>
-<span class="co"># I b 1 9</span>
-<span class="co"># F b 2 1</span>
-<span class="co"># J a 3 7</span>
-<span class="co"># A b 2 3</span>
-
-<span class="kw">rownames</span>(DF)
-<span class="co"># [1] "C" "D" "E" "G" "B" "H" "I" "F" "J" "A"</span></code></pre></div>
+# ID1 ID2 val
+# C a 3 5
+# D a 1 6
+# E b 2 4
+# G a 1 2
+# B b 1 10
+# H a 2 8
+# I b 1 9
+# F b 2 1
+# J a 3 7
+# A b 2 3
+
+rownames(DF)
+# [1] "C" "D" "E" "G" "B" "H" "I" "F" "J" "A"
+</code></pre>
+
<p>We can <em>subset</em> a particular row using its row name as shown below:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DF[<span class="st">"C"</span>, ]
-<span class="co"># ID1 ID2 val</span>
-<span class="co"># C a 3 5</span></code></pre></div>
+
+<pre><code class="r">DF["C", ]
+# ID1 ID2 val
+# C a 3 5
+</code></pre>
+
<p>i.e., row names are more or less <em>an index</em> to rows of a <em>data.frame</em>. However,</p>
-<ol style="list-style-type: decimal">
+
+<ol>
<li><p>Each row is limited to <em>exactly one</em> row name.</p>
+
<p>But, a person (for example) has at least two names - a <em>first</em> and a <em>second</em> name. It is useful to organise a telephone directory by <em>surname</em> then <em>first name</em>.</p></li>
<li><p>And row names should be <em>unique</em>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">rownames</span>(DF) =<span class="st"> </span><span class="kw">sample</span>(LETTERS[<span class="dv">1</span>:<span class="dv">5</span>], <span class="dv">10</span>, <span class="ot">TRUE</span>)
-<span class="co"># Warning: non-unique values when setting 'row.names': 'C', 'D'</span>
-<span class="co"># Error in `row.names<-.data.frame`(`*tmp*`, value = value): duplicate 'row.names' are not allowed</span></code></pre></div></li>
+
+<pre><code class="r">rownames(DF) = sample(LETTERS[1:5], 10, TRUE)
+# Warning: non-unique values when setting 'row.names': 'C', 'D'
+# Error in `row.names<-.data.frame`(`*tmp*`, value = value): duplicate 'row.names' are not allowed
+</code></pre></li>
</ol>
-<p>Now let’s convert it to a <em>data.table</em>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT =<span class="st"> </span><span class="kw">as.data.table</span>(DF)
+
+<p>Now let's convert it to a <em>data.table</em>.</p>
+
+<pre><code class="r">DT = as.data.table(DF)
DT
-<span class="co"># ID1 ID2 val</span>
-<span class="co"># 1: a 3 5</span>
-<span class="co"># 2: a 1 6</span>
-<span class="co"># 3: b 2 4</span>
-<span class="co"># 4: a 1 2</span>
-<span class="co"># 5: b 1 10</span>
-<span class="co"># 6: a 2 8</span>
-<span class="co"># 7: b 1 9</span>
-<span class="co"># 8: b 2 1</span>
-<span class="co"># 9: a 3 7</span>
-<span class="co"># 10: b 2 3</span>
-
-<span class="kw">rownames</span>(DT)
-<span class="co"># [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"</span></code></pre></div>
+# ID1 ID2 val
+# 1: a 3 5
+# 2: a 1 6
+# 3: b 2 4
+# 4: a 1 2
+# 5: b 1 10
+# 6: a 2 8
+# 7: b 1 9
+# 8: b 2 1
+# 9: a 3 7
+# 10: b 2 3
+
+rownames(DT)
+# [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
+</code></pre>
+
<ul>
<li><p>Note that row names have been reset.</p></li>
-<li><p><em>data.tables</em> never uses row names. Since <em>data.tables</em> <strong>inherit</strong> from <em>data.frames</em>, it still has the row names attribute. But it never uses them. We’ll see in a moment as to why.</p>
+<li><p><em>data.tables</em> never uses row names. Since <em>data.tables</em> <strong>inherit</strong> from <em>data.frames</em>, it still has the row names attribute. But it never uses them. We'll see in a moment as to why.</p>
+
<p>If you would like to preserve the row names, use <code>keep.rownames = TRUE</code> in <code>as.data.table()</code> - this will create a new column called <code>rn</code> and assign row names to this column.</p></li>
</ul>
+
<p>Instead, in <em>data.tables</em> we set and use <code>keys</code>. Think of a <code>key</code> as <strong>supercharged rownames</strong>.</p>
-<div id="key-properties" class="section level4 bs-callout bs-callout-info">
-<h4>Keys and their properties</h4>
-<ol style="list-style-type: decimal">
-<li><p>We can set keys on <em>multiple columns</em> and the column can be of <em>different types</em> – <em>integer</em>, <em>numeric</em>, <em>character</em>, <em>factor</em>, <em>integer64</em> etc. <em>list</em> and <em>complex</em> types are not supported yet.</p></li>
+
+<h4>Keys and their properties {.bs-callout .bs-callout-info #key-properties}</h4>
+
+<ol>
+<li><p>We can set keys on <em>multiple columns</em> and the column can be of <em>different types</em> – <em>integer</em>, <em>numeric</em>, <em>character</em>, <em>factor</em>, <em>integer64</em> etc. <em>list</em> and <em>complex</em> types are not supported yet.</p></li>
<li><p>Uniqueness is not enforced, i.e., duplicate key values are allowed. Since rows are sorted by key, any duplicates in the key columns will appear consecutively.</p></li>
<li><p>Setting a <code>key</code> does <em>two</em> things:</p>
-<ol style="list-style-type: lower-alpha">
-<li><p>physically reorders the rows of the <em>data.table</em> by the column(s) provided <em>by reference</em>, always in <em>increasing</em> order.</p></li>
-<li><p>marks those columns as <em>key</em> columns by setting an attribute called <code>sorted</code> to the <em>data.table</em>.</p></li>
-</ol>
+
+<p>a. physically reorders the rows of the <em>data.table</em> by the column(s) provided <em>by reference</em>, always in <em>increasing</em> order.</p>
+
+<p>b. marks those columns as <em>key</em> columns by setting an attribute called <code>sorted</code> to the <em>data.table</em>.</p>
+
<p>Since the rows are reordered, a <em>data.table</em> can have at most one key because it can not be sorted in more than one way.</p></li>
</ol>
-</div>
-</div>
-</div>
-<div id="section" class="section level1">
-<h1></h1>
+
+<p>#</p>
+
<p>For the rest of the vignette, we will work with <code>flights</code> data set.</p>
-<div id="b-set-get-and-use-keys-on-a-data.table" class="section level3">
+
<h3>b) Set, get and use keys on a <em>data.table</em></h3>
-<div id="how-can-we-set-the-column-origin-as-key-in-the-data.table-flights" class="section level4">
-<h4>– How can we set the column <code>origin</code> as key in the <em>data.table</em> <code>flights</code>?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setkey</span>(flights, origin)
-<span class="kw">head</span>(flights)
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 4 0 AA EWR LAX 339 2454 18</span>
-<span class="co"># 2: 2014 1 1 -5 -17 AA EWR MIA 161 1085 16</span>
-<span class="co"># 3: 2014 1 1 191 185 AA EWR DFW 214 1372 16</span>
-<span class="co"># 4: 2014 1 1 -1 -2 AA EWR DFW 214 1372 14</span>
-<span class="co"># 5: 2014 1 1 -3 -10 AA EWR MIA 154 1085 6</span>
-<span class="co"># 6: 2014 1 1 4 -17 AA EWR DFW 215 1372 9</span>
-
-## alternatively we can provide character vectors to the function 'setkeyv()'
-<span class="co"># setkeyv(flights, "origin") # useful to program with</span></code></pre></div>
-</div>
-<div id="section-1" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<h4>– How can we set the column <code>origin</code> as key in the <em>data.table</em> <code>flights</code>?</h4>
+
+<pre><code class="r">setkey(flights, origin)
+head(flights)
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 4 0 AA EWR LAX 339 2454 18
+# 2: 2014 1 1 -5 -17 AA EWR MIA 161 1085 16
+# 3: 2014 1 1 191 185 AA EWR DFW 214 1372 16
+# 4: 2014 1 1 -1 -2 AA EWR DFW 214 1372 14
+# 5: 2014 1 1 -3 -10 AA EWR MIA 154 1085 6
+# 6: 2014 1 1 4 -17 AA EWR DFW 215 1372 9
+
+## alternatively we can provide character vectors to the function 'setkeyv()'
+# setkeyv(flights, "origin") # useful to program with
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>You can use the function <code>setkey()</code> and provide the column names (without quoting them). This is helpful during interactive use.</p></li>
<li><p>Alternatively you can pass a character vector of column names to the function <code>setkeyv()</code>. This is particularly useful while designing functions to pass columns to set key on as function arguments.</p></li>
-<li><p>Note that we did not have to assign the result back to a variable. This is because like the <code>:=</code> function we saw in the <em>“Introduction to data.table”</em> vignette, <code>setkey()</code> and <code>setkeyv()</code> modify the input <em>data.table</em> <em>by reference</em>. They return the result invisibly.</p></li>
+<li><p>Note that we did not have to assign the result back to a variable. This is because like the <code>:=</code> function we saw in the <em>“Introduction to data.table”</em> vignette, <code>setkey()</code> and <code>setkeyv()</code> modify the input <em>data.table</em> <em>by reference</em>. They return the result invisibly.</p></li>
<li><p>The <em>data.table</em> is now reordered (or sorted) by the column we provided - <code>origin</code>. Since we reorder by reference, we only require additional memory of one column of length equal to the number of rows in the <em>data.table</em>, and is therefore very memory efficient.</p></li>
<li><p>You can also set keys directly when creating <em>data.tables</em> using the <code>data.table()</code> function using <code>key</code> argument. It takes a character vector of column names.</p></li>
</ul>
-</div>
-<div id="set-and" class="section level4 bs-callout bs-callout-info">
-<h4>set* and <code>:=</code>:</h4>
+
+<h4>set* and <code>:=</code>: {.bs-callout .bs-callout-info}</h4>
+
<p>In <em>data.table</em>, the <code>:=</code> operator and all the <code>set*</code> (e.g., <code>setkey</code>, <code>setorder</code>, <code>setnames</code> etc..) functions are the only ones which modify the input object <em>by reference</em>.</p>
-</div>
-</div>
-</div>
-<div id="section-2" class="section level1">
-<h1></h1>
+
+<p>#</p>
+
<p>Once you <em>key</em> a <em>data.table</em> by certain columns, you can subset by querying those key columns using the <code>.()</code> notation in <code>i</code>. Recall that <code>.()</code> is an <em>alias to</em> <code>list()</code>.</p>
-<div id="use-the-key-column-origin-to-subset-all-rows-where-the-origin-airport-matches-jfk" class="section level4">
-<h4>– Use the key column <code>origin</code> to subset all rows where the origin airport matches <em>“JFK”</em></h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="st">"JFK"</span>)]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span>
-<span class="co"># 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span>
-<span class="co"># 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21</span>
-<span class="co"># --- </span>
-<span class="co"># 81479: 2014 10 31 -4 -21 UA JFK SFO 337 2586 17</span>
-<span class="co"># 81480: 2014 10 31 -2 -37 UA JFK SFO 344 2586 18</span>
-<span class="co"># 81481: 2014 10 31 0 -33 UA JFK LAX 320 2475 17</span>
-<span class="co"># 81482: 2014 10 31 -6 -38 UA JFK SFO 343 2586 9</span>
-<span class="co"># 81483: 2014 10 31 -6 -38 UA JFK LAX 323 2475 11</span>
+
+<h4>– Use the key column <code>origin</code> to subset all rows where the origin airport matches <em>“JFK”</em></h4>
+
+<pre><code class="r">flights[.("JFK")]
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
+# 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13
+# 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21
+# ---
+# 81479: 2014 10 31 -4 -21 UA JFK SFO 337 2586 17
+# 81480: 2014 10 31 -2 -37 UA JFK SFO 344 2586 18
+# 81481: 2014 10 31 0 -33 UA JFK LAX 320 2475 17
+# 81482: 2014 10 31 -6 -38 UA JFK SFO 343 2586 9
+# 81483: 2014 10 31 -6 -38 UA JFK LAX 323 2475 11
## alternatively
-<span class="co"># flights[J("JFK")] (or) </span>
-<span class="co"># flights[list("JFK")]</span></code></pre></div>
-</div>
-<div id="section-3" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# flights[J("JFK")] (or)
+# flights[list("JFK")]
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p>The <em>key</em> column has already been set to <code>origin</code>. So it is sufficient to provide the value, here <em>“JFK”</em>, directly. The <code>.()</code> syntax helps identify that the task requires looking up the value <em>“JFK”</em> in the key column of <em>data.table</em> (here column <code>origin</code> of <code>flights</code> <em>data.table</em>).</p></li>
-<li><p>The <em>row indices</em> corresponding to the value <em>“JFK”</em> in <code>origin</code> is obtained first. And since there is no expression in <code>j</code>, all columns corresponding to those row indices are returned.</p></li>
+<li><p>The <em>key</em> column has already been set to <code>origin</code>. So it is sufficient to provide the value, here <em>“JFK”</em>, directly. The <code>.()</code> syntax helps identify that the task requires looking up the value <em>“JFK”</em> in the key column of <em>data.table</em> (here column <code>origin</code> of <code>flights</code> <em>data.table</em>).</p></li>
+<li><p>The <em>row indices</em> corresponding to the value <em>“JFK”</em> in <code>origin</code> is obtained first. And since there is no expression in <code>j</code>, all columns corresponding to those row indices are returned.</p></li>
<li><p>On single column key of <em>character</em> type, you can drop the <code>.()</code> notation and use the values directly when subsetting, like subset using row names on <em>data.frames</em>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[<span class="st">"JFK"</span>] ## same as flights[.("JFK")]</code></pre></div></li>
+
+<pre><code class="r">flights["JFK"] ## same as flights[.("JFK")]
+</code></pre></li>
<li><p>We can subset any amount of values as required</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[<span class="kw">c</span>(<span class="st">"JFK"</span>, <span class="st">"LGA"</span>)] ## same as flights[.(c("JFK", "LGA"))]</code></pre></div>
-<p>This returns all columns corresponding to those rows where <code>origin</code> column matches either <em>“JFK”</em> or <em>“LGA”</em>.</p></li>
+
+<pre><code class="r">flights[c("JFK", "LGA")] ## same as flights[.(c("JFK", "LGA"))]
+</code></pre>
+
+<p>This returns all columns corresponding to those rows where <code>origin</code> column matches either <em>“JFK”</em> or <em>“LGA”</em>.</p></li>
</ul>
-</div>
-<div id="how-can-we-get-the-columns-a-data.table-is-keyed-by" class="section level4">
-<h4>– How can we get the column(s) a <em>data.table</em> is keyed by?</h4>
+
+<h4>– How can we get the column(s) a <em>data.table</em> is keyed by?</h4>
+
<p>Using the function <code>key()</code>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">key</span>(flights)
-<span class="co"># [1] "origin"</span></code></pre></div>
-</div>
-<div id="section-4" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<pre><code class="r">key(flights)
+# [1] "origin"
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>It returns a character vector of all the key columns.</p></li>
<li><p>If no key is set, it returns <code>NULL</code>.</p></li>
</ul>
-</div>
-<div id="c-keys-and-multiple-columns" class="section level3">
+
<h3>c) Keys and multiple columns</h3>
+
<p>To refresh, <em>keys</em> are like <em>supercharged</em> row names. We can set key on multiple columns and they can be of multiple types.</p>
-<div id="how-can-i-set-keys-on-both-origin-and-dest-columns" class="section level4">
-<h4>– How can I set keys on both <code>origin</code> <em>and</em> <code>dest</code> columns?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setkey</span>(flights, origin, dest)
-<span class="kw">head</span>(flights)
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 2 -2 -25 EV EWR ALB 30 143 7</span>
-<span class="co"># 2: 2014 1 3 88 79 EV EWR ALB 29 143 23</span>
-<span class="co"># 3: 2014 1 4 220 211 EV EWR ALB 32 143 15</span>
-<span class="co"># 4: 2014 1 4 35 19 EV EWR ALB 32 143 7</span>
-<span class="co"># 5: 2014 1 5 47 42 EV EWR ALB 26 143 8</span>
-<span class="co"># 6: 2014 1 5 66 62 EV EWR ALB 31 143 23</span>
+
+<h4>– How can I set keys on both <code>origin</code> <em>and</em> <code>dest</code> columns?</h4>
+
+<pre><code class="r">setkey(flights, origin, dest)
+head(flights)
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 2 -2 -25 EV EWR ALB 30 143 7
+# 2: 2014 1 3 88 79 EV EWR ALB 29 143 23
+# 3: 2014 1 4 220 211 EV EWR ALB 32 143 15
+# 4: 2014 1 4 35 19 EV EWR ALB 32 143 7
+# 5: 2014 1 5 47 42 EV EWR ALB 26 143 8
+# 6: 2014 1 5 66 62 EV EWR ALB 31 143 23
## or alternatively
-<span class="co"># setkeyv(flights, c("origin", "dest")) # provide a character vector of column names</span>
+# setkeyv(flights, c("origin", "dest")) # provide a character vector of column names
+
+key(flights)
+# [1] "origin" "dest"
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
-<span class="kw">key</span>(flights)
-<span class="co"># [1] "origin" "dest"</span></code></pre></div>
-</div>
-<div id="section-5" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
<ul>
<li>It sorts the <em>data.table</em> first by the column <code>origin</code> and then by <code>dest</code> <em>by reference</em>.</li>
</ul>
-</div>
-<div id="subset-all-rows-using-key-columns-where-first-key-column-origin-matches-jfk-and-second-key-column-dest-matches-mia" class="section level4">
-<h4>– Subset all rows using key columns where first key column <code>origin</code> matches <em>“JFK”</em> and second key column <code>dest</code> matches <em>“MIA”</em></h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="st">"JFK"</span>, <span class="st">"MIA"</span>)]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 -1 -17 AA JFK MIA 161 1089 15</span>
-<span class="co"># 2: 2014 1 1 7 -8 AA JFK MIA 166 1089 9</span>
-<span class="co"># 3: 2014 1 1 2 -1 AA JFK MIA 164 1089 12</span>
-<span class="co"># 4: 2014 1 1 6 3 AA JFK MIA 157 1089 5</span>
-<span class="co"># 5: 2014 1 1 6 -12 AA JFK MIA 154 1089 17</span>
-<span class="co"># --- </span>
-<span class="co"># 2746: 2014 10 31 -1 -22 AA JFK MIA 148 1089 16</span>
-<span class="co"># 2747: 2014 10 31 -3 -20 AA JFK MIA 146 1089 8</span>
-<span class="co"># 2748: 2014 10 31 2 -17 AA JFK MIA 150 1089 6</span>
-<span class="co"># 2749: 2014 10 31 -3 -12 AA JFK MIA 150 1089 5</span>
-<span class="co"># 2750: 2014 10 31 29 4 AA JFK MIA 146 1089 19</span></code></pre></div>
-</div>
-<div id="multiple-key-point" class="section level4 bs-callout bs-callout-info">
-<h4>How does the subset work here?</h4>
+
+<h4>– Subset all rows using key columns where first key column <code>origin</code> matches <em>“JFK”</em> and second key column <code>dest</code> matches <em>“MIA”</em></h4>
+
+<pre><code class="r">flights[.("JFK", "MIA")]
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 -1 -17 AA JFK MIA 161 1089 15
+# 2: 2014 1 1 7 -8 AA JFK MIA 166 1089 9
+# 3: 2014 1 1 2 -1 AA JFK MIA 164 1089 12
+# 4: 2014 1 1 6 3 AA JFK MIA 157 1089 5
+# 5: 2014 1 1 6 -12 AA JFK MIA 154 1089 17
+# ---
+# 2746: 2014 10 31 -1 -22 AA JFK MIA 148 1089 16
+# 2747: 2014 10 31 -3 -20 AA JFK MIA 146 1089 8
+# 2748: 2014 10 31 2 -17 AA JFK MIA 150 1089 6
+# 2749: 2014 10 31 -3 -12 AA JFK MIA 150 1089 5
+# 2750: 2014 10 31 29 4 AA JFK MIA 146 1089 19
+</code></pre>
+
+<h4>How does the subset work here? {.bs-callout .bs-callout-info #multiple-key-point}</h4>
+
<ul>
-<li><p>It is important to undertand how this works internally. <em>“JFK”</em> is first matched against the first key column <code>origin</code>. And <em>within those matching rows</em>, <em>“MIA”</em> is matched against the second key column <code>dest</code> to obtain <em>row indices</em> where both <code>origin</code> and <code>dest</code> match the given values.</p></li>
+<li><p>It is important to undertand how this works internally. <em>“JFK”</em> is first matched against the first key column <code>origin</code>. And <em>within those matching rows</em>, <em>“MIA”</em> is matched against the second key column <code>dest</code> to obtain <em>row indices</em> where both <code>origin</code> and <code>dest</code> match the given values.</p></li>
<li><p>Since no <code>j</code> is provided, we simply return <em>all columns</em> corresponding to those row indices.</p></li>
</ul>
-</div>
-</div>
-</div>
-<div id="section-6" class="section level1">
-<h1></h1>
-<div id="subset-all-rows-where-just-the-first-key-column-origin-matches-jfk" class="section level4">
-<h4>– Subset all rows where just the first key column <code>origin</code> matches <em>“JFK”</em></h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">key</span>(flights)
-<span class="co"># [1] "origin" "dest"</span>
-
-flights[.(<span class="st">"JFK"</span>)] ## or in this case simply flights["JFK"], for convenience
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 10 4 B6 JFK ABQ 280 1826 20</span>
-<span class="co"># 2: 2014 1 2 134 161 B6 JFK ABQ 252 1826 22</span>
-<span class="co"># 3: 2014 1 7 6 6 B6 JFK ABQ 269 1826 20</span>
-<span class="co"># 4: 2014 1 8 15 -15 B6 JFK ABQ 259 1826 20</span>
-<span class="co"># 5: 2014 1 9 45 32 B6 JFK ABQ 267 1826 20</span>
-<span class="co"># --- </span>
-<span class="co"># 81479: 2014 10 31 0 -18 DL JFK TPA 142 1005 8</span>
-<span class="co"># 81480: 2014 10 31 1 -8 B6 JFK TPA 149 1005 19</span>
-<span class="co"># 81481: 2014 10 31 -2 -22 B6 JFK TPA 145 1005 14</span>
-<span class="co"># 81482: 2014 10 31 -8 -5 B6 JFK TPA 149 1005 9</span>
-<span class="co"># 81483: 2014 10 31 -4 -18 B6 JFK TPA 145 1005 8</span></code></pre></div>
-</div>
-<div id="section-7" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<p>#</p>
+
+<h4>– Subset all rows where just the first key column <code>origin</code> matches <em>“JFK”</em></h4>
+
+<pre><code class="r">key(flights)
+# [1] "origin" "dest"
+
+flights[.("JFK")] ## or in this case simply flights["JFK"], for convenience
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 10 4 B6 JFK ABQ 280 1826 20
+# 2: 2014 1 2 134 161 B6 JFK ABQ 252 1826 22
+# 3: 2014 1 7 6 6 B6 JFK ABQ 269 1826 20
+# 4: 2014 1 8 15 -15 B6 JFK ABQ 259 1826 20
+# 5: 2014 1 9 45 32 B6 JFK ABQ 267 1826 20
+# ---
+# 81479: 2014 10 31 0 -18 DL JFK TPA 142 1005 8
+# 81480: 2014 10 31 1 -8 B6 JFK TPA 149 1005 19
+# 81481: 2014 10 31 -2 -22 B6 JFK TPA 145 1005 14
+# 81482: 2014 10 31 -8 -5 B6 JFK TPA 149 1005 9
+# 81483: 2014 10 31 -4 -18 B6 JFK TPA 145 1005 8
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li>Since we did not provide any values for the second key column <code>dest</code>, it just matches <em>“JFK”</em> against the first key column <code>origin</code> and returns all the matched rows.</li>
+<li>Since we did not provide any values for the second key column <code>dest</code>, it just matches <em>“JFK”</em> against the first key column <code>origin</code> and returns all the matched rows.</li>
</ul>
-</div>
-<div id="subset-all-rows-where-just-the-second-key-column-dest-matches-mia" class="section level4">
-<h4>– Subset all rows where just the second key column <code>dest</code> matches <em>“MIA”</em></h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="kw">unique</span>(origin), <span class="st">"MIA"</span>)]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 -5 -17 AA EWR MIA 161 1085 16</span>
-<span class="co"># 2: 2014 1 1 -3 -10 AA EWR MIA 154 1085 6</span>
-<span class="co"># 3: 2014 1 1 -5 -8 AA EWR MIA 157 1085 11</span>
-<span class="co"># 4: 2014 1 1 43 42 UA EWR MIA 155 1085 15</span>
-<span class="co"># 5: 2014 1 1 60 49 UA EWR MIA 162 1085 21</span>
-<span class="co"># --- </span>
-<span class="co"># 9924: 2014 10 31 -11 -8 AA LGA MIA 157 1096 13</span>
-<span class="co"># 9925: 2014 10 31 -5 -11 AA LGA MIA 150 1096 9</span>
-<span class="co"># 9926: 2014 10 31 -2 10 AA LGA MIA 156 1096 6</span>
-<span class="co"># 9927: 2014 10 31 -2 -16 AA LGA MIA 156 1096 19</span>
-<span class="co"># 9928: 2014 10 31 1 -11 US LGA MIA 164 1096 15</span></code></pre></div>
-</div>
-<div id="whats-happening-here" class="section level4 bs-callout bs-callout-info">
-<h4>What’s happening here?</h4>
+
+<h4>– Subset all rows where just the second key column <code>dest</code> matches <em>“MIA”</em></h4>
+
+<pre><code class="r">flights[.(unique(origin), "MIA")]
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 -5 -17 AA EWR MIA 161 1085 16
+# 2: 2014 1 1 -3 -10 AA EWR MIA 154 1085 6
+# 3: 2014 1 1 -5 -8 AA EWR MIA 157 1085 11
+# 4: 2014 1 1 43 42 UA EWR MIA 155 1085 15
+# 5: 2014 1 1 60 49 UA EWR MIA 162 1085 21
+# ---
+# 9924: 2014 10 31 -11 -8 AA LGA MIA 157 1096 13
+# 9925: 2014 10 31 -5 -11 AA LGA MIA 150 1096 9
+# 9926: 2014 10 31 -2 10 AA LGA MIA 156 1096 6
+# 9927: 2014 10 31 -2 -16 AA LGA MIA 156 1096 19
+# 9928: 2014 10 31 1 -11 US LGA MIA 164 1096 15
+</code></pre>
+
+<h4>What's happening here? {.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p>Read <a href="#multiple-key-point">this</a> again. The value provided for the second key column <em>“MIA”</em> has to find the matching values in <code>dest</code> key column <em>on the matching rows provided by the first key column <code>origin</code></em>. We can not skip the values of key columns <em>before</em>. Therefore we provide <em>all</em> unique values from key column <code>origin</code>.</p></li>
-<li><p><em>“MIA”</em> is automatically recycled to fit the length of <code>unique(origin)</code> which is <em>3</em>.</p></li>
+<li><p>Read <a href="#multiple-key-point">this</a> again. The value provided for the second key column <em>“MIA”</em> has to find the matching values in <code>dest</code> key column <em>on the matching rows provided by the first key column <code>origin</code></em>. We can not skip the values of key columns <em>before</em>. Therefore we provide <em>all</em> unique values from key column <code>origin</code>.</p></li>
+<li><p><em>“MIA”</em> is automatically recycled to fit the length of <code>unique(origin)</code> which is <em>3</em>.</p></li>
</ul>
-</div>
-<div id="combining-keys-with-j-and-by" class="section level2">
+
<h2>2) Combining keys with <code>j</code> and <code>by</code></h2>
-<p>All we have seen so far is the same concept – obtaining <em>row indices</em> in <code>i</code>, but just using a different method – using <code>keys</code>. It shouldn’t be surprising that we can do exactly the same things in <code>j</code> and <code>by</code> as seen from the previous vignettes. We will highlight this with a few examples.</p>
-<div id="a-select-in-j" class="section level3">
+
+<p>All we have seen so far is the same concept – obtaining <em>row indices</em> in <code>i</code>, but just using a different method – using <code>keys</code>. It shouldn't be surprising that we can do exactly the same things in <code>j</code> and <code>by</code> as seen from the previous vignettes. We will highlight this with a few examples.</p>
+
<h3>a) Select in <code>j</code></h3>
-<div id="return-arr_delay-column-as-a-data.table-corresponding-to-origin-lga-and-dest-tpa." class="section level4">
-<h4>– Return <code>arr_delay</code> column as a <em>data.table</em> corresponding to <code>origin = "LGA"</code> and <code>dest = "TPA"</code>.</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">key</span>(flights)
-<span class="co"># [1] "origin" "dest"</span>
-flights[.(<span class="st">"LGA"</span>, <span class="st">"TPA"</span>), .(arr_delay)]
-<span class="co"># arr_delay</span>
-<span class="co"># 1: 1</span>
-<span class="co"># 2: 14</span>
-<span class="co"># 3: -17</span>
-<span class="co"># 4: -4</span>
-<span class="co"># 5: -12</span>
-<span class="co"># --- </span>
-<span class="co"># 1848: 39</span>
-<span class="co"># 1849: -24</span>
-<span class="co"># 1850: -12</span>
-<span class="co"># 1851: 21</span>
-<span class="co"># 1852: -11</span></code></pre></div>
-</div>
-<div id="section-8" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<h4>– Return <code>arr_delay</code> column as a <em>data.table</em> corresponding to <code>origin = "LGA"</code> and <code>dest = "TPA"</code>.</h4>
+
+<pre><code class="r">key(flights)
+# [1] "origin" "dest"
+flights[.("LGA", "TPA"), .(arr_delay)]
+# arr_delay
+# 1: 1
+# 2: 14
+# 3: -17
+# 4: -4
+# 5: -12
+# ---
+# 1848: 39
+# 1849: -24
+# 1850: -12
+# 1851: 21
+# 1852: -11
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>The <em>row indices</em> corresponding to <code>origin == "LGA"</code> and <code>dest == "TPA"</code> are obtained using <em>key based subset</em>.</p></li>
<li><p>Once we have the row indices, we look at <code>j</code> which requires only the <code>arr_delay</code> column. So we simply select the column <code>arr_delay</code> for those <em>row indices</em> in the exact same way as we have seen in <em>Introduction to data.table</em> vignette.</p></li>
<li><p>We could have returned the result by using <code>with = FALSE</code> as well.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="st">"LGA"</span>, <span class="st">"TPA"</span>), <span class="st">"arr_delay"</span>, with =<span class="st"> </span><span class="ot">FALSE</span>]</code></pre></div></li>
+
+<pre><code class="r">flights[.("LGA", "TPA"), "arr_delay", with = FALSE]
+</code></pre></li>
</ul>
-</div>
-</div>
-<div id="b-chaining" class="section level3">
+
<h3>b) Chaining</h3>
-<div id="on-the-result-obtained-above-use-chaining-to-order-the-column-in-decreasing-order." class="section level4">
-<h4>– On the result obtained above, use chaining to order the column in decreasing order.</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="st">"LGA"</span>, <span class="st">"TPA"</span>), .(arr_delay)][<span class="kw">order</span>(-arr_delay)]
-<span class="co"># arr_delay</span>
-<span class="co"># 1: 486</span>
-<span class="co"># 2: 380</span>
-<span class="co"># 3: 351</span>
-<span class="co"># 4: 318</span>
-<span class="co"># 5: 300</span>
-<span class="co"># --- </span>
-<span class="co"># 1848: -40</span>
-<span class="co"># 1849: -43</span>
-<span class="co"># 1850: -46</span>
-<span class="co"># 1851: -48</span>
-<span class="co"># 1852: -49</span></code></pre></div>
-</div>
-</div>
-<div id="c-compute-or-do-in-j" class="section level3">
+
+<h4>– On the result obtained above, use chaining to order the column in decreasing order.</h4>
+
+<pre><code class="r">flights[.("LGA", "TPA"), .(arr_delay)][order(-arr_delay)]
+# arr_delay
+# 1: 486
+# 2: 380
+# 3: 351
+# 4: 318
+# 5: 300
+# ---
+# 1848: -40
+# 1849: -43
+# 1850: -46
+# 1851: -48
+# 1852: -49
+</code></pre>
+
<h3>c) Compute or <em>do</em> in <code>j</code></h3>
-<div id="find-the-maximum-arrival-delay-correspondong-to-origin-lga-and-dest-tpa." class="section level4">
-<h4>– Find the maximum arrival delay correspondong to <code>origin = "LGA"</code> and <code>dest = "TPA"</code>.</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="st">"LGA"</span>, <span class="st">"TPA"</span>), <span class="kw">max</span>(arr_delay)]
-<span class="co"># [1] 486</span></code></pre></div>
-</div>
-<div id="section-9" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<h4>– Find the maximum arrival delay correspondong to <code>origin = "LGA"</code> and <code>dest = "TPA"</code>.</h4>
+
+<pre><code class="r">flights[.("LGA", "TPA"), max(arr_delay)]
+# [1] 486
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li>We can verify that the result is identical to first value (486) from the previous example.</li>
</ul>
-</div>
-</div>
-<div id="d-sub-assign-by-reference-using-in-j" class="section level3">
+
<h3>d) <em>sub-assign</em> by reference using <code>:=</code> in <code>j</code></h3>
-<p>We have seen this example already in the <em>Reference semantics</em> vignette. Let’s take a look at all the <code>hours</code> available in the <code>flights</code> <em>data.table</em>:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># get all 'hours' in flights</span>
-flights[, <span class="kw">sort</span>(<span class="kw">unique</span>(hour))]
-<span class="co"># [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24</span></code></pre></div>
-<p>We see that there are totally <code>25</code> unique values in the data. Both <em>0</em> and <em>24</em> hours seem to be present. Let’s go ahead and replace <em>24</em> with <em>0</em>, but this time using <em>key</em>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setkey</span>(flights, hour)
-<span class="kw">key</span>(flights)
-<span class="co"># [1] "hour"</span>
-flights[.(<span class="dv">24</span>), hour :<span class="er">=</span><span class="st"> </span>0L]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 4 15 598 602 DL EWR ATL 104 746 0</span>
-<span class="co"># 2: 2014 5 22 289 267 DL EWR ATL 102 746 0</span>
-<span class="co"># 3: 2014 7 14 277 253 DL EWR ATL 101 746 0</span>
-<span class="co"># 4: 2014 2 14 128 117 EV EWR BDL 27 116 0</span>
-<span class="co"># 5: 2014 6 17 127 119 EV EWR BDL 24 116 0</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: 2014 8 3 1 -13 DL JFK SJU 196 1598 0</span>
-<span class="co"># 253313: 2014 10 8 1 1 B6 JFK SJU 199 1598 0</span>
-<span class="co"># 253314: 2014 7 14 211 219 B6 JFK SLC 282 1990 0</span>
-<span class="co"># 253315: 2014 7 3 440 418 FL LGA ATL 107 762 0</span>
-<span class="co"># 253316: 2014 6 13 300 280 DL LGA PBI 140 1035 0</span>
-<span class="kw">key</span>(flights)
-<span class="co"># NULL</span></code></pre></div>
-<div id="section-10" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<p>We have seen this example already in the <em>Reference semantics</em> vignette. Let's take a look at all the <code>hours</code> available in the <code>flights</code> <em>data.table</em>:</p>
+
+<pre><code class="r"># get all 'hours' in flights
+flights[, sort(unique(hour))]
+# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
+</code></pre>
+
+<p>We see that there are totally <code>25</code> unique values in the data. Both <em>0</em> and <em>24</em> hours seem to be present. Let's go ahead and replace <em>24</em> with <em>0</em>, but this time using <em>key</em>.</p>
+
+<pre><code class="r">setkey(flights, hour)
+key(flights)
+# [1] "hour"
+flights[.(24), hour := 0L]
+key(flights)
+# NULL
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>We first set <code>key</code> to <code>hour</code>. This reorders <code>flights</code> by the column <code>hour</code> and marks that column as the <code>key</code> column.</p></li>
<li><p>Now we can subset on <code>hour</code> by using the <code>.()</code> notation. We subset for the value <em>24</em> and obtain the corresponding <em>row indices</em>.</p></li>
<li><p>And on those row indices, we replace the <code>key</code> column with the value <code>0</code>.</p></li>
-<li><p>Since we have replaced values on the <em>key</em> column, the <em>data.table</em> <code>flights</code> isn’t sorted by <code>hour</code> anymore. Therefore, the key has been automatically removed by setting to NULL.</p></li>
+<li><p>Since we have replaced values on the <em>key</em> column, the <em>data.table</em> <code>flights</code> isn't sorted by <code>hour</code> anymore. Therefore, the key has been automatically removed by setting to NULL.</p></li>
</ul>
-</div>
-</div>
-</div>
-</div>
-<div id="section-11" class="section level1">
-<h1></h1>
-<p>Now, there shouldn’t be any <em>24</em> in the <code>hour</code> column.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[, <span class="kw">sort</span>(<span class="kw">unique</span>(hour))]
-<span class="co"># [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23</span></code></pre></div>
-<div id="e-aggregation-using-by" class="section level3">
+
+<p>#
+Now, there shouldn't be any <em>24</em> in the <code>hour</code> column.</p>
+
+<pre><code class="r">flights[, sort(unique(hour))]
+# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
+</code></pre>
+
<h3>e) Aggregation using <code>by</code></h3>
-<p>Let’s set the key back to <code>origin, dest</code> first.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setkey</span>(flights, origin, dest)
-<span class="kw">key</span>(flights)
-<span class="co"># [1] "origin" "dest"</span></code></pre></div>
-<div id="get-the-maximum-departure-delay-for-each-month-corresponding-to-origin-jfk.-order-the-result-by-month" class="section level4">
-<h4>– Get the maximum departure delay for each <code>month</code> corresponding to <code>origin = "JFK"</code>. Order the result by <code>month</code></h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[<span class="st">"JFK"</span>, <span class="kw">max</span>(dep_delay), keyby =<span class="st"> </span>month]
-<span class="kw">head</span>(ans)
-<span class="co"># month V1</span>
-<span class="co"># 1: 1 881</span>
-<span class="co"># 2: 2 1014</span>
-<span class="co"># 3: 3 920</span>
-<span class="co"># 4: 4 1241</span>
-<span class="co"># 5: 5 853</span>
-<span class="co"># 6: 6 798</span>
-<span class="kw">key</span>(ans)
-<span class="co"># [1] "month"</span></code></pre></div>
-</div>
-<div id="section-12" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<p>Let's set the key back to <code>origin, dest</code> first.</p>
+
+<pre><code class="r">setkey(flights, origin, dest)
+key(flights)
+# [1] "origin" "dest"
+</code></pre>
+
+<h4>– Get the maximum departure delay for each <code>month</code> corresponding to <code>origin = "JFK"</code>. Order the result by <code>month</code></h4>
+
+<pre><code class="r">ans <- flights["JFK", max(dep_delay), keyby = month]
+head(ans)
+# month V1
+# 1: 1 881
+# 2: 2 1014
+# 3: 3 920
+# 4: 4 1241
+# 5: 5 853
+# 6: 6 798
+key(ans)
+# [1] "month"
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p>We subset on the <code>key</code> column <em>origin</em> to obtain the <em>row indices</em> corresponding to <em>“JFK”</em>.</p></li>
-<li><p>Once we obtain the row indices, we only need two columns - <code>month</code> to group by and <code>dep_delay</code> to obtain <code>max()</code> for each group. <em>data.table’s</em> query optimisation therefore subsets just those two columns corresponding to the <em>row indices</em> obtained in <code>i</code>, for speed and memory efficiency.</p></li>
+<li><p>We subset on the <code>key</code> column <em>origin</em> to obtain the <em>row indices</em> corresponding to <em>“JFK”</em>.</p></li>
+<li><p>Once we obtain the row indices, we only need two columns - <code>month</code> to group by and <code>dep_delay</code> to obtain <code>max()</code> for each group. <em>data.table's</em> query optimisation therefore subsets just those two columns corresponding to the <em>row indices</em> obtained in <code>i</code>, for speed and memory efficiency.</p></li>
<li><p>And on that subset, we group by <em>month</em> and compute <code>max(dep_delay)</code>.</p></li>
<li><p>We use <code>keyby</code> to automatically key that result by <em>month</em>. Now we understand what that means. In addition to ordering, it also sets <em>month</em> as the <code>key</code> column.</p></li>
</ul>
-</div>
-</div>
-<div id="additional-arguments---mult-and-nomatch" class="section level2">
+
<h2>3) Additional arguments - <code>mult</code> and <code>nomatch</code></h2>
-<div id="a-the-mult-argument" class="section level3">
+
<h3>a) The <em>mult</em> argument</h3>
-<p>We can choose, for each query, if <em>“all”</em> the matching rows should be returned, or just the <em>“first”</em> or <em>“last”</em> using the <code>mult</code> argument. The default value is <em>“all”</em> - what we’ve seen so far.</p>
-<div id="subset-only-the-first-matching-row-from-all-rows-where-origin-matches-jfk-and-dest-matches-mia" class="section level4">
-<h4>– Subset only the first matching row from all rows where <code>origin</code> matches <em>“JFK”</em> and <code>dest</code> matches <em>“MIA”</em></h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="st">"JFK"</span>, <span class="st">"MIA"</span>), mult =<span class="st"> "first"</span>]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 6 3 AA JFK MIA 157 1089 5</span></code></pre></div>
-</div>
-<div id="subset-only-the-last-matching-row-of-all-the-rows-where-origin-matches-lga-jfk-ewr-and-dest-matches-xna" class="section level4">
-<h4>– Subset only the last matching row of all the rows where <code>origin</code> matches <em>“LGA”, “JFK”, “EWR”</em> and <code>dest</code> matches <em>“XNA”</em></h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="kw">c</span>(<span class="st">"LGA"</span>, <span class="st">"JFK"</span>, <span class="st">"EWR"</span>), <span class="st">"XNA"</span>), mult =<span class="st"> "last"</span>]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 5 23 163 148 MQ LGA XNA 158 1147 18</span>
-<span class="co"># 2: NA NA NA NA NA NA JFK XNA NA NA NA</span>
-<span class="co"># 3: 2014 2 3 231 268 EV EWR XNA 184 1131 12</span></code></pre></div>
-</div>
-<div id="section-13" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<p>We can choose, for each query, if <em>“all”</em> the matching rows should be returned, or just the <em>“first”</em> or <em>“last”</em> using the <code>mult</code> argument. The default value is <em>“all”</em> - what we've seen so far.</p>
+
+<h4>– Subset only the first matching row from all rows where <code>origin</code> matches <em>“JFK”</em> and <code>dest</code> matches <em>“MIA”</em></h4>
+
+<pre><code class="r">flights[.("JFK", "MIA"), mult = "first"]
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 6 3 AA JFK MIA 157 1089 5
+</code></pre>
+
+<h4>– Subset only the last matching row of all the rows where <code>origin</code> matches <em>“LGA”, “JFK”, “EWR”</em> and <code>dest</code> matches <em>“XNA”</em></h4>
+
+<pre><code class="r">flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last"]
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 5 23 163 148 MQ LGA XNA 158 1147 18
+# 2: NA NA NA NA NA NA JFK XNA NA NA NA
+# 3: 2014 2 3 231 268 EV EWR XNA 184 1131 12
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p>The query <em>“JFK”, “XNA”</em> doesn’t match any rows in <code>flights</code> and therefore returns <code>NA</code>.</p></li>
-<li><p>Once again, the query for second key column <code>dest</code>, <em>“XNA”</em>, is recycled to fit the length of the query for first key column <code>origin</code>, which is of length 3.</p></li>
+<li><p>The query <em>“JFK”, “XNA”</em> doesn't match any rows in <code>flights</code> and therefore returns <code>NA</code>.</p></li>
+<li><p>Once again, the query for second key column <code>dest</code>, <em>“XNA”</em>, is recycled to fit the length of the query for first key column <code>origin</code>, which is of length 3.</p></li>
</ul>
-</div>
-</div>
-<div id="b-the-nomatch-argument" class="section level3">
+
<h3>b) The <em>nomatch</em> argument</h3>
+
<p>We can choose if queries that do not match should return <code>NA</code> or be skipped altogether using the <code>nomatch</code> argument.</p>
-<div id="from-the-previous-example-subset-all-rows-only-if-theres-a-match" class="section level4">
-<h4>– From the previous example, Subset all rows only if there’s a match</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="kw">c</span>(<span class="st">"LGA"</span>, <span class="st">"JFK"</span>, <span class="st">"EWR"</span>), <span class="st">"XNA"</span>), mult =<span class="st"> "last"</span>, nomatch =<span class="st"> </span>0L]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 5 23 163 148 MQ LGA XNA 158 1147 18</span>
-<span class="co"># 2: 2014 2 3 231 268 EV EWR XNA 184 1131 12</span></code></pre></div>
-</div>
-<div id="section-14" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<h4>– From the previous example, Subset all rows only if there's a match</h4>
+
+<pre><code class="r">flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last", nomatch = 0L]
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 5 23 163 148 MQ LGA XNA 158 1147 18
+# 2: 2014 2 3 231 268 EV EWR XNA 184 1131 12
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>Default value for <code>nomatch</code> is <code>NA</code>. Setting <code>nomatch = 0L</code> skips queries with no matches.</p></li>
<li><p>The query “JFK”, “XNA” doesn’t match any rows in flights and therefore is skipped.</p></li>
</ul>
-</div>
-</div>
-</div>
-<div id="binary-search-vs-vector-scans" class="section level2">
+
<h2>4) binary search vs vector scans</h2>
-<p>We have seen so far how we can set and use keys to subset. But what’s the advantage? For example, instead of doing:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># key by origin,dest columns</span>
-flights[.(<span class="st">"JFK"</span>, <span class="st">"MIA"</span>)]</code></pre></div>
+
+<p>We have seen so far how we can set and use keys to subset. But what's the advantage? For example, instead of doing:</p>
+
+<pre><code class="r"># key by origin,dest columns
+flights[.("JFK", "MIA")]
+</code></pre>
+
<p>we could have done:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[origin ==<span class="st"> "JFK"</span> &<span class="st"> </span>dest ==<span class="st"> "MIA"</span>]</code></pre></div>
+
+<pre><code class="r">flights[origin == "JFK" & dest == "MIA"]
+</code></pre>
+
<p>One advantage very likely is shorter syntax. But even more than that, <em>binary search based subsets</em> are <strong>incredibly fast</strong>.</p>
-<div id="a-performance-of-binary-search-approach" class="section level3">
+
<h3>a) Performance of binary search approach</h3>
-<p>To illustrate, let’s create a sample <em>data.table</em> with 20 million rows and three columns and key it by columns <code>x</code> and <code>y</code>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">set.seed</span>(2L)
-N =<span class="st"> </span>2e7L
-DT =<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">x =</span> <span class="kw">sample</span>(letters, N, <span class="ot">TRUE</span>),
- <span class="dt">y =</span> <span class="kw">sample</span>(1000L, N, <span class="ot">TRUE</span>),
- <span class="dt">val =</span> <span class="kw">runif</span>(N), <span class="dt">key =</span> <span class="kw">c</span>(<span class="st">"x"</span>, <span class="st">"y"</span>))
-<span class="kw">print</span>(<span class="kw">object.size</span>(DT), <span class="dt">units =</span> <span class="st">"Mb"</span>)
-<span class="co"># 381.5 Mb</span>
-
-<span class="kw">key</span>(DT)
-<span class="co"># [1] "x" "y"</span></code></pre></div>
+
+<p>To illustrate, let's create a sample <em>data.table</em> with 20 million rows and three columns and key it by columns <code>x</code> and <code>y</code>.</p>
+
+<pre><code class="r">set.seed(2L)
+N = 2e7L
+DT = data.table(x = sample(letters, N, TRUE),
+ y = sample(1000L, N, TRUE),
+ val = runif(N), key = c("x", "y"))
+print(object.size(DT), units = "Mb")
+# 381.5 Mb
+
+key(DT)
+# [1] "x" "y"
+</code></pre>
+
<p><code>DT</code> is ~380MB. It is not really huge, but this will do to illustrate the point.</p>
+
<p>From what we have seen in the Introduction to data.table section, we can subset those rows where columns <code>x = "g"</code> and <code>y = 877</code> as follows:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## (1) Usual way of subsetting - vector scan approach
-t1 <-<span class="st"> </span><span class="kw">system.time</span>(ans1 <-<span class="st"> </span>DT[x ==<span class="st"> "g"</span> &<span class="st"> </span>y ==<span class="st"> </span>877L])
+
+<pre><code class="r">## (1) Usual way of subsetting - vector scan approach
+t1 <- system.time(ans1 <- DT[x == "g" & y == 877L])
t1
-<span class="co"># user system elapsed </span>
-<span class="co"># 0.124 0.016 0.140</span>
-<span class="kw">head</span>(ans1)
-<span class="co"># x y val</span>
-<span class="co"># 1: g 877 0.3946652</span>
-<span class="co"># 2: g 877 0.9424275</span>
-<span class="co"># 3: g 877 0.7068512</span>
-<span class="co"># 4: g 877 0.6959935</span>
-<span class="co"># 5: g 877 0.9673482</span>
-<span class="co"># 6: g 877 0.4842585</span>
-<span class="kw">dim</span>(ans1)
-<span class="co"># [1] 761 3</span></code></pre></div>
-<p>Now let’s try to subset by using keys.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## (2) Subsetting using keys
-t2 <-<span class="st"> </span><span class="kw">system.time</span>(ans2 <-<span class="st"> </span>DT[.(<span class="st">"g"</span>, 877L)])
+# user system elapsed
+# 0.132 0.020 0.153
+head(ans1)
+# x y val
+# 1: g 877 0.3946652
+# 2: g 877 0.9424275
+# 3: g 877 0.7068512
+# 4: g 877 0.6959935
+# 5: g 877 0.9673482
+# 6: g 877 0.4842585
+dim(ans1)
+# [1] 761 3
+</code></pre>
+
+<p>Now let's try to subset by using keys.</p>
+
+<pre><code class="r">## (2) Subsetting using keys
+t2 <- system.time(ans2 <- DT[.("g", 877L)])
t2
-<span class="co"># user system elapsed </span>
-<span class="co"># 0.000 0.000 0.001</span>
-<span class="kw">head</span>(ans2)
-<span class="co"># x y val</span>
-<span class="co"># 1: g 877 0.3946652</span>
-<span class="co"># 2: g 877 0.9424275</span>
-<span class="co"># 3: g 877 0.7068512</span>
-<span class="co"># 4: g 877 0.6959935</span>
-<span class="co"># 5: g 877 0.9673482</span>
-<span class="co"># 6: g 877 0.4842585</span>
-<span class="kw">dim</span>(ans2)
-<span class="co"># [1] 761 3</span>
-
-<span class="kw">identical</span>(ans1$val, ans2$val)
-<span class="co"># [1] TRUE</span></code></pre></div>
+# user system elapsed
+# 0.000 0.000 0.001
+head(ans2)
+# x y val
+# 1: g 877 0.3946652
+# 2: g 877 0.9424275
+# 3: g 877 0.7068512
+# 4: g 877 0.6959935
+# 5: g 877 0.9673482
+# 6: g 877 0.4842585
+dim(ans2)
+# [1] 761 3
+
+identical(ans1$val, ans2$val)
+# [1] TRUE
+</code></pre>
+
<ul>
-<li>The speedup is <strong>~140x</strong>!</li>
+<li>The speedup is <strong>~153x</strong>!</li>
</ul>
-</div>
-<div id="b-why-does-keying-a-data.table-result-in-blazing-fast-susbets" class="section level3">
+
<h3>b) Why does keying a <em>data.table</em> result in blazing fast susbets?</h3>
-<p>To understand that, let’s first look at what <em>vector scan approach</em> (method 1) does.</p>
-<div id="vector-scan-approach" class="section level4 bs-callout bs-callout-info">
-<h4>Vector scan approach:</h4>
+
+<p>To understand that, let's first look at what <em>vector scan approach</em> (method 1) does.</p>
+
+<h4>Vector scan approach: {.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p>The column <code>x</code> is searched for the value <em>“g”</em> row by row, on all 20 million of them. This results in a <em>logical vector</em> of size 20 million, with values <code>TRUE, FALSE or NA</code> corresponding to <code>x</code>’s value.</p></li>
+<li><p>The column <code>x</code> is searched for the value <em>“g”</em> row by row, on all 20 million of them. This results in a <em>logical vector</em> of size 20 million, with values <code>TRUE, FALSE or NA</code> corresponding to <code>x</code>'s value.</p></li>
<li><p>Similarly, the column <code>y</code> is searched for <code>877</code> on all 20 million rows one by one, and stored in another logical vector.</p></li>
<li><p>Element wise <code>&</code> operations are performed on the intermediate logical vectors and all the rows where the expression evaluates to <code>TRUE</code> are returned.</p></li>
</ul>
+
<p>This is what we call a <em>vector scan approach</em>. And this is quite inefficient, especially on larger tables and when one needs repeated subsetting, because it has to scan through all the rows each time.</p>
-</div>
-</div>
-</div>
-</div>
-<div id="section-15" class="section level1">
-<h1></h1>
-<p>Now let us look at binary search approach (method 2). Recall from <a href="#key-properties">Properties of key</a> - <em>setting keys reorders the data.table by key columns</em>. Since the data is sorted, we don’t have to <em>scan through the entire length of the column</em>! We can instead use <em>binary search</em> to search a value in <code>O(log n)</code> as opposed to <code>O(n)</code> in case of <em>vector scan approach</em>, where <code>n</code> is the number of rows in the <em> [...]
-<div id="binary-search-approach" class="section level4 bs-callout bs-callout-info">
-<h4>Binary search approach:</h4>
-<p>Here’s a very simple illustration. Let’s consider the (sorted) numbers shown below:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="dv">1</span>, <span class="dv">5</span>, <span class="dv">10</span>, <span class="dv">19</span>, <span class="dv">22</span>, <span class="dv">23</span>, <span class="dv">30</span></code></pre></div>
-<p>Suppose we’d like to find the matching position of the value <em>1</em>, using binary search, this is how we would proceed - because we know that the data is <em>sorted</em>.</p>
+
+<p>#</p>
+
+<p>Now let us look at binary search approach (method 2). Recall from <a href="#key-properties">Properties of key</a> - <em>setting keys reorders the data.table by key columns</em>. Since the data is sorted, we don't have to <em>scan through the entire length of the column</em>! We can instead use <em>binary search</em> to search a value in <code>O(log n)</code> as opposed to <code>O(n)</code> in case of <em>vector scan approach</em>, where <code>n</code> is the number of rows in the [...]
+
+<h4>Binary search approach: {.bs-callout .bs-callout-info}</h4>
+
+<p>Here's a very simple illustration. Let's consider the (sorted) numbers shown below:</p>
+
+<pre><code class="r">1, 5, 10, 19, 22, 23, 30
+</code></pre>
+
+<p>Suppose we'd like to find the matching position of the value <em>1</em>, using binary search, this is how we would proceed - because we know that the data is <em>sorted</em>.</p>
+
<ul>
<li><p>Start with the middle value = 19. Is 1 == 19? No. 1 < 19.</p></li>
-<li><p>Since the value we’re looking for is smaller than 19, it should be somewhere before 19. So we can discard the rest of the half that are >= 19.</p></li>
+<li><p>Since the value we're looking for is smaller than 19, it should be somewhere before 19. So we can discard the rest of the half that are >= 19.</p></li>
<li><p>Our set is now reduced to <em>1, 5, 10</em>. Grab the middle value once again = 5. Is 1 == 5? No. 1 < 5.</p></li>
-<li><p>Our set is reduced to <em>1</em>. Is 1 == 1? Yes. The corresponding index is also 1. And that’s the only match.</p></li>
+<li><p>Our set is reduced to <em>1</em>. Is 1 == 1? Yes. The corresponding index is also 1. And that's the only match.</p></li>
</ul>
+
<p>A vector scan approach on the other hand would have to scan through all the values (here, 7).</p>
-</div>
-</div>
-<div id="section-16" class="section level1">
-<h1></h1>
+
+<p>#</p>
+
<p>It can be seen that with every search we reduce the number of searches by half. This is why <em>binary search</em> based subsets are <strong>incredibly fast</strong>. Since rows of each column of <em>data.tables</em> have contiguous locations in memory, the operations are performed in a very cache efficient manner (also contributes to <em>speed</em>).</p>
+
<p>In addition, since we obtain the matching row indices directly without having to create those huge logical vectors (equal to the number of rows in a <em>data.table</em>), it is quite <strong>memory efficient</strong> as well.</p>
-<div id="summary" class="section level2">
+
<h2>Summary</h2>
+
<p>In this vignette, we have learnt another method to subset rows in <code>i</code> by keying a <em>data.table</em>. Setting keys allows us to perform blazing fast subsets by using <em>binary search</em>. In particular, we have seen how to</p>
-<div id="section-17" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>set key and subset using the key on a <em>data.table</em>.</p></li>
<li><p>subset using keys which fetches <em>row indices</em> in <code>i</code>, but much faster.</p></li>
<li><p>combine key based subsets with <code>j</code> and <code>by</code>. Note that the <code>j</code> and <code>by</code> operations are exactly the same as before.</p></li>
</ul>
-</div>
-</div>
-</div>
-<div id="section-18" class="section level1">
-<h1></h1>
-<p>Key based subsets are <strong>incredibly fast</strong> and are particularly useful when the task involves <em>repeated subsetting</em>. But it may not be always desirable to set key and physically reorder the <em>data.table</em>. In the next vignette, we will address this using a <em>new</em> feature – <em>secondary indexes</em>.</p>
-<hr />
-</div>
-
-
-
-<!-- dynamically load mathjax for compatibility with self-contained -->
-<script>
- (function () {
- var script = document.createElement("script");
- script.type = "text/javascript";
- script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
- document.getElementsByTagName("head")[0].appendChild(script);
- })();
-</script>
+
+<p>#</p>
+
+<p>Key based subsets are <strong>incredibly fast</strong> and are particularly useful when the task involves <em>repeated subsetting</em>. But it may not be always desirable to set key and physically reorder the <em>data.table</em>. In the next vignette, we will address this using a <em>new</em> feature – <em>secondary indexes</em>.</p>
+
+<hr/>
</body>
+
</html>
diff --git a/inst/doc/datatable-reference-semantics.html b/inst/doc/datatable-reference-semantics.html
index 542ed96..2301781 100644
--- a/inst/doc/datatable-reference-semantics.html
+++ b/inst/doc/datatable-reference-semantics.html
@@ -1,640 +1,664 @@
<!DOCTYPE html>
+<html>
+<head>
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
+
+<title>Data {#data}</title>
+
+<script type="text/javascript">
+window.onload = function() {
+ var imgs = document.getElementsByTagName('img'), i, img;
+ for (i = 0; i < imgs.length; i++) {
+ img = imgs[i];
+ // center an image if it is the only element of its parent
+ if (img.parentElement.childElementCount === 1)
+ img.parentElement.style.textAlign = 'center';
+ }
+};
+</script>
-<html xmlns="http://www.w3.org/1999/xhtml">
+<!-- Styles for R syntax highlighter -->
+<style type="text/css">
+ pre .operator,
+ pre .paren {
+ color: rgb(104, 118, 135)
+ }
+
+ pre .literal {
+ color: #990073
+ }
+
+ pre .number {
+ color: #099;
+ }
+
+ pre .comment {
+ color: #998;
+ font-style: italic
+ }
+
+ pre .keyword {
+ color: #900;
+ font-weight: bold
+ }
+
+ pre .identifier {
+ color: rgb(0, 0, 0);
+ }
+
+ pre .string {
+ color: #d14;
+ }
+</style>
+
+<!-- R syntax highlighter -->
+<script type="text/javascript">
+var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/</gm,"<")}function f(r,q,p){return RegExp(q,"m"+(r.cI?"i":"")+(p?"g":""))}function b(r){for(var p=0;p<r.childNodes.length;p++){var q=r.childNodes[p];if(q.nodeName=="CODE"){return q}if(!(q.nodeType==3&&q.nodeValue.match(/\s+/))){break}}}function h(t,s){var p="";for(var r=0;r<t.childNodes.length;r++){if(t.childNodes[r].nodeType==3){var q=t.childNodes[r].nodeValue;if(s){q=q.replace(/\n/g,"")}p+=q}else{if(t.chi [...]
+hljs.initHighlightingOnLoad();
+</script>
-<head>
-<meta charset="utf-8">
-<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
-<meta name="generator" content="pandoc" />
-<meta name="viewport" content="width=device-width, initial-scale=1">
+<style type="text/css">
+body, td {
+ font-family: sans-serif;
+ background-color: white;
+ font-size: 13px;
+}
+
+body {
+ max-width: 800px;
+ margin: auto;
+ padding: 1em;
+ line-height: 20px;
+}
+tt, code, pre {
+ font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
+}
-<meta name="date" content="2017-01-31" />
+h1 {
+ font-size:2.2em;
+}
-<title>Reference semantics</title>
+h2 {
+ font-size:1.8em;
+}
+h3 {
+ font-size:1.4em;
+}
+h4 {
+ font-size:1.0em;
+}
-<style type="text/css">code{white-space: pre;}</style>
-<style type="text/css">
-div.sourceCode { overflow-x: auto; }
-table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
- margin: 0; padding: 0; vertical-align: baseline; border: none; }
-table.sourceCode { width: 100%; line-height: 100%; }
-td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
-td.sourceCode { padding-left: 5px; }
-code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
-code > span.dt { color: #902000; } /* DataType */
-code > span.dv { color: #40a070; } /* DecVal */
-code > span.bn { color: #40a070; } /* BaseN */
-code > span.fl { color: #40a070; } /* Float */
-code > span.ch { color: #4070a0; } /* Char */
-code > span.st { color: #4070a0; } /* String */
-code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
-code > span.ot { color: #007020; } /* Other */
-code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
-code > span.fu { color: #06287e; } /* Function */
-code > span.er { color: #ff0000; font-weight: bold; } /* Error */
-code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
-code > span.cn { color: #880000; } /* Constant */
-code > span.sc { color: #4070a0; } /* SpecialChar */
-code > span.vs { color: #4070a0; } /* VerbatimString */
-code > span.ss { color: #bb6688; } /* SpecialString */
-code > span.im { } /* Import */
-code > span.va { color: #19177c; } /* Variable */
-code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
-code > span.op { color: #666666; } /* Operator */
-code > span.bu { } /* BuiltIn */
-code > span.ex { } /* Extension */
-code > span.pp { color: #bc7a00; } /* Preprocessor */
-code > span.at { color: #7d9029; } /* Attribute */
-code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
-code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
-code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
-code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
-</style>
+h5 {
+ font-size:0.9em;
+}
+h6 {
+ font-size:0.8em;
+}
+a:visited {
+ color: rgb(50%, 0%, 50%);
+}
-<link href="data:text/css;charset=utf-8,body%20%7B%0Abackground%2Dcolor%3A%20%23fff%3B%0Amargin%3A%201em%20auto%3B%0Amax%2Dwidth%3A%20700px%3B%0Aoverflow%3A%20visible%3B%0Apadding%2Dleft%3A%202em%3B%0Apadding%2Dright%3A%202em%3B%0Afont%2Dfamily%3A%20%22Open%20Sans%22%2C%20%22Helvetica%20Neue%22%2C%20Helvetica%2C%20Arial%2C%20sans%2Dserif%3B%0Afont%2Dsize%3A%2014px%3B%0Aline%2Dheight%3A%201%2E35%3B%0A%7D%0A%23header%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0A%23TOC%20%7B%0Aclear%3A%20bot [...]
+pre, img {
+ max-width: 100%;
+}
+pre {
+ overflow-x: auto;
+}
+pre code {
+ display: block; padding: 0.5em;
+}
-</head>
+code {
+ font-size: 92%;
+ border: 1px solid #ccc;
+}
-<body>
+code[class] {
+ background-color: #F8F8F8;
+}
+table, td, th {
+ border: none;
+}
+blockquote {
+ color:#666666;
+ margin:0;
+ padding-left: 1em;
+ border-left: 0.5em #EEE solid;
+}
+hr {
+ height: 0px;
+ border-bottom: none;
+ border-top-width: thin;
+ border-top-style: dotted;
+ border-top-color: #999999;
+}
+
+ at media print {
+ * {
+ background: transparent !important;
+ color: black !important;
+ filter:none !important;
+ -ms-filter: none !important;
+ }
+
+ body {
+ font-size:12pt;
+ max-width:100%;
+ }
+
+ a, a:visited {
+ text-decoration: underline;
+ }
+
+ hr {
+ visibility: hidden;
+ page-break-before: always;
+ }
+
+ pre, blockquote {
+ padding-right: 1em;
+ page-break-inside: avoid;
+ }
+
+ tr, img {
+ page-break-inside: avoid;
+ }
+
+ img {
+ max-width: 100% !important;
+ }
+
+ @page :left {
+ margin: 15mm 20mm 15mm 10mm;
+ }
+
+ @page :right {
+ margin: 15mm 10mm 15mm 20mm;
+ }
+
+ p, h2, h3 {
+ orphans: 3; widows: 3;
+ }
+
+ h2, h3 {
+ page-break-after: avoid;
+ }
+}
+</style>
-<h1 class="title toc-ignore">Reference semantics</h1>
-<h4 class="date"><em>2017-01-31</em></h4>
+</head>
-<p>This vignette discusses <em>data.table</em>’s reference semantics which allows to <em>add/update/delete</em> columns of a <em>data.table by reference</em>, and also combine them with <code>i</code> and <code>by</code>. It is aimed at those who are already familiar with <em>data.table</em> syntax, its general form, how to subset rows in <code>i</code>, select and compute on columns, and perform aggregations by group. If you’re not familiar with these concepts, please read the <em>“Intr [...]
-<hr />
-<div id="data" class="section level2">
-<h2>Data</h2>
-<p>We will use the same <code>flights</code> data as in the <em>“Introduction to data.table”</em> vignette.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights <-<span class="st"> </span><span class="kw">fread</span>(<span class="st">"flights14.csv"</span>)
+<body>
+<p>This vignette discusses <em>data.table</em>'s reference semantics which allows to <em>add/update/delete</em> columns of a <em>data.table by reference</em>, and also combine them with <code>i</code> and <code>by</code>. It is aimed at those who are already familiar with <em>data.table</em> syntax, its general form, how to subset rows in <code>i</code>, select and compute on columns, and perform aggregations by group. If you're not familiar with these concepts, please read the < [...]
+
+<hr/>
+
+<h2>Data {#data}</h2>
+
+<p>We will use the same <code>flights</code> data as in the <em>“Introduction to data.table”</em> vignette.</p>
+
+<pre><code class="r">flights <- fread("flights14.csv")
flights
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14</span>
-<span class="co"># 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8</span>
-<span class="co"># 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11</span>
-<span class="co"># 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11</span>
-<span class="co"># 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8</span>
-<span class="kw">dim</span>(flights)
-<span class="co"># [1] 253316 11</span></code></pre></div>
-</div>
-<div id="introduction" class="section level2">
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
+# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
+# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13
+# ---
+# 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14
+# 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8
+# 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11
+# 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11
+# 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8
+dim(flights)
+# [1] 253316 11
+</code></pre>
+
<h2>Introduction</h2>
+
<p>In this vignette, we will</p>
-<ol style="list-style-type: decimal">
+
+<ol>
<li><p>first discuss reference semantics briefly and look at the two different forms in which the <code>:=</code> operator can be used</p></li>
<li><p>then see how we can <em>add/update/delete</em> columns <em>by reference</em> in <code>j</code> using the <code>:=</code> operator and how to combine with <code>i</code> and <code>by</code>.</p></li>
<li><p>and finally we will look at using <code>:=</code> for its <em>side-effect</em> and how we can avoid the side effects using <code>copy()</code>.</p></li>
</ol>
-</div>
-<div id="reference-semantics" class="section level2">
+
<h2>1. Reference semantics</h2>
+
<p>All the operations we have seen so far in the previous vignette resulted in a new data set. We will see how to <em>add</em> new column(s), <em>update</em> or <em>delete</em> existing column(s) on the original data.</p>
-<div id="a-background" class="section level3">
+
<h3>a) Background</h3>
+
<p>Before we look at <em>reference semantics</em>, consider the <em>data.frame</em> shown below:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DF =<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">ID =</span> <span class="kw">c</span>(<span class="st">"b"</span>,<span class="st">"b"</span>,<span class="st">"b"</span>,<span class="st">"a"</span>,<span class="st">"a"</span>,<span class="st">"c"</span>), <span class="dt">a =</span> <span class="dv">1</span>:<span class= [...]
+
+<pre><code class="r">DF = data.frame(ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18)
DF
-<span class="co"># ID a b c</span>
-<span class="co"># 1 b 1 7 13</span>
-<span class="co"># 2 b 2 8 14</span>
-<span class="co"># 3 b 3 9 15</span>
-<span class="co"># 4 a 4 10 16</span>
-<span class="co"># 5 a 5 11 17</span>
-<span class="co"># 6 c 6 12 18</span></code></pre></div>
+# ID a b c
+# 1 b 1 7 13
+# 2 b 2 8 14
+# 3 b 3 9 15
+# 4 a 4 10 16
+# 5 a 5 11 17
+# 6 c 6 12 18
+</code></pre>
+
<p>When we did:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DF$c <-<span class="st"> </span><span class="dv">18</span>:<span class="dv">13</span> <span class="co"># (1) -- replace entire column</span>
-<span class="co"># or</span>
-DF$c[DF$ID ==<span class="st"> "b"</span>] <-<span class="st"> </span><span class="dv">15</span>:<span class="dv">13</span> <span class="co"># (2) -- subassign in column 'c'</span></code></pre></div>
+
+<pre><code class="r">DF$c <- 18:13 # (1) -- replace entire column
+# or
+DF$c[DF$ID == "b"] <- 15:13 # (2) -- subassign in column 'c'
+</code></pre>
+
<p>both (1) and (2) resulted in <a href="http://r.789695.n4.nabble.com/speeding-up-perception-td3640920.html#a3646694"><em>deep</em> copy of the entire <em>data.frame</em></a> in versions of <code>R</code> versions <code>< 3.1</code>. <a href="http://stackoverflow.com/q/23898969/559784">It copied more than once</a>. To improve performance by avoiding these redundant copies, <em>data.table</em> utilised the <a href="http://stackoverflow.com/q/7033106/559784">available but unused <code> [...]
+
<p>Great performance improvements were made in <code>R v3.1</code> as a result of which only a <em>shallow</em> copy is made for (1) and not <em>deep</em> copy. However, for (2) still, the entire column is <em>deep</em> copied even in <code>R v3.1+</code>. This means the more columns one subassigns to in the <em>same query</em>, the more <em>deep</em> copies R does.</p>
-<div id="shallow-vs-deep-copy" class="section level4 bs-callout bs-callout-info">
-<h4><em>shallow</em> vs <em>deep</em> copy</h4>
+
+<h4><em>shallow</em> vs <em>deep</em> copy {.bs-callout .bs-callout-info}</h4>
+
<p>A <em>shallow</em> copy is just a copy of the vector of column pointers (corresponding to the columns in a <em>data.frame</em> or <em>data.table</em>). The actual data is not physically copied in memory.</p>
+
<p>A <em>deep</em> copy on the other hand copies the entire data to another location in memory.</p>
-</div>
-</div>
-</div>
-<div id="section" class="section level1">
-<h1></h1>
-<p>With <em>data.table’s</em> <code>:=</code> operator, absolutely no copies are made in <em>both</em> (1) and (2), irrespective of R version you are using. This is because <code>:=</code> operator updates <em>data.table</em> columns <em>in-place</em> (by reference).</p>
-<div id="b-the-operator" class="section level3">
+
+<p>#
+With <em>data.table's</em> <code>:=</code> operator, absolutely no copies are made in <em>both</em> (1) and (2), irrespective of R version you are using. This is because <code>:=</code> operator updates <em>data.table</em> columns <em>in-place</em> (by reference).</p>
+
<h3>b) The <code>:=</code> operator</h3>
+
<p>It can be used in <code>j</code> in two ways:</p>
-<ol style="list-style-type: lower-alpha">
-<li><p>The <code>LHS := RHS</code> form</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT[, <span class="kw">c</span>(<span class="st">"colA"</span>, <span class="st">"colB"</span>, ...) :<span class="er">=</span><span class="st"> </span><span class="kw">list</span>(valA, valB, ...)]
-
-<span class="co"># when you have only one column to assign to you</span>
-<span class="co"># can drop the quotes and list(), for convenience</span>
-DT[, colA :<span class="er">=</span><span class="st"> </span>valA]</code></pre></div></li>
-<li><p>The functional form</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT[, <span class="st">`</span><span class="dt">:=</span><span class="st">`</span>(<span class="dt">colA =</span> valA, <span class="co"># valA is assigned to colA</span>
- <span class="dt">colB =</span> valB, <span class="co"># valB is assigned to colB</span>
+
+<p>(a) The <code>LHS := RHS</code> form</p>
+
+<pre><code>```r
+DT[, c("colA", "colB", ...) := list(valA, valB, ...)]
+
+# when you have only one column to assign to you
+# can drop the quotes and list(), for convenience
+DT[, colA := valA]
+```
+</code></pre>
+
+<p>(b) The functional form</p>
+
+<pre><code>```r
+DT[, `:=`(colA = valA, # valA is assigned to colA
+ colB = valB, # valB is assigned to colB
...
-)]</code></pre></div></li>
-</ol>
-<div id="section-1" class="section level4 bs-callout bs-callout-warning">
-<h4></h4>
+)]
+```
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-warning}</h4>
+
<p>Note that the code above explains how <code>:=</code> can be used. They are not working examples. We will start using them on <code>flights</code> <em>data.table</em> from the next section.</p>
-</div>
-</div>
-</div>
-<div id="section-2" class="section level1">
-<h1></h1>
-<div id="section-3" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<p>#</p>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p>In (a), <code>LHS</code> takes a character vector of column names and <code>RHS</code> a <em>list of values</em>. <code>RHS</code> just needs to be a <code>list</code>, irrespective of how its generated (e.g., using <code>lapply()</code>, <code>list()</code>, <code>mget()</code>, <code>mapply()</code> etc.). This form is usually easy to program with and is particularly useful when you don’t know the columns to assign values to in advance.</p></li>
+<li><p>In (a), <code>LHS</code> takes a character vector of column names and <code>RHS</code> a <em>list of values</em>. <code>RHS</code> just needs to be a <code>list</code>, irrespective of how its generated (e.g., using <code>lapply()</code>, <code>list()</code>, <code>mget()</code>, <code>mapply()</code> etc.). This form is usually easy to program with and is particularly useful when you don't know the columns to assign values to in advance.</p></li>
<li><p>On the other hand, (b) is handy if you would like to jot some comments down for later.</p></li>
<li><p>The result is returned <em>invisibly</em>.</p></li>
<li><p>Since <code>:=</code> is available in <code>j</code>, we can combine it with <code>i</code> and <code>by</code> operations just like the aggregation operations we saw in the previous vignette.</p></li>
</ul>
-</div>
-</div>
-<div id="section-4" class="section level1">
-<h1></h1>
-<p>In the two forms of <code>:=</code> shown above, note that we don’t assign the result back to a variable. Because we don’t need to. The input <em>data.table</em> is modified by reference. Let’s go through examples to understand what we mean by this.</p>
+
+<p>#</p>
+
+<p>In the two forms of <code>:=</code> shown above, note that we don't assign the result back to a variable. Because we don't need to. The input <em>data.table</em> is modified by reference. Let's go through examples to understand what we mean by this.</p>
+
<p>For the rest of the vignette, we will work with <code>flights</code> <em>data.table</em>.</p>
-<div id="addupdatedelete-columns-by-reference" class="section level2">
+
<h2>2. Add/update/delete columns <em>by reference</em></h2>
-<div id="ref-j" class="section level3">
-<h3>a) Add columns by reference</h3>
-<div id="how-can-we-add-columns-speed-and-total-delay-of-each-flight-to-flights-data.table" class="section level4">
-<h4>– How can we add columns <em>speed</em> and <em>total delay</em> of each flight to <code>flights</code> <em>data.table</em>?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[, <span class="st">`</span><span class="dt">:=</span><span class="st">`</span>(<span class="dt">speed =</span> distance /<span class="st"> </span>(air_time/<span class="dv">60</span>), <span class="co"># speed in mph (mi/h)</span>
- <span class="dt">delay =</span> arr_delay +<span class="st"> </span>dep_delay)] <span class="co"># delay in minutes</span>
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14 422.6866</span>
-<span class="co"># 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8 444.4444</span>
-<span class="co"># 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11 311.5663</span>
-<span class="co"># 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11 401.6000</span>
-<span class="co"># 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8 359.4545</span>
-<span class="co"># delay</span>
-<span class="co"># 1: 27</span>
-<span class="co"># 2: 10</span>
-<span class="co"># 3: 11</span>
-<span class="co"># 4: -34</span>
-<span class="co"># 5: 3</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: -29</span>
-<span class="co"># 253313: -19</span>
-<span class="co"># 253314: 8</span>
-<span class="co"># 253315: 11</span>
-<span class="co"># 253316: -4</span>
-<span class="kw">head</span>(flights)
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed delay</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490 27</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909 10</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769 11</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414 -34</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857 3</span>
-<span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363 4</span>
-
-## alternatively, using the 'LHS := RHS' form
-<span class="co"># flights[, c("speed", "delay") := list(distance/(air_time/60), arr_delay + dep_delay)]</span></code></pre></div>
-</div>
-<div id="note-that" class="section level4 bs-callout bs-callout-info">
-<h4>Note that</h4>
+
+<h3>a) Add columns by reference {#ref-j}</h3>
+
+<h4>– How can we add columns <em>speed</em> and <em>total delay</em> of each flight to <code>flights</code> <em>data.table</em>?</h4>
+
+<pre><code class="r">flights[, `:=`(speed = distance / (air_time/60), # speed in mph (mi/h)
+ delay = arr_delay + dep_delay)] # delay in minutes
+head(flights)
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour speed delay
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490 27
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909 10
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769 11
+# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414 -34
+# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857 3
+# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363 4
+
+## alternatively, using the 'LHS := RHS' form
+# flights[, c("speed", "delay") := list(distance/(air_time/60), arr_delay + dep_delay)]
+</code></pre>
+
+<h4>Note that {.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>We did not have to assign the result back to <code>flights</code>.</p></li>
<li><p>The <code>flights</code> <em>data.table</em> now contains the two newly added columns. This is what we mean by <em>added by reference</em>.</p></li>
<li><p>We used the functional form so that we could add comments on the side to explain what the computation does. You can also see the <code>LHS := RHS</code> form (commented).</p></li>
</ul>
-</div>
-</div>
-<div id="ref-i-j" class="section level3">
-<h3>b) Update some rows of columns by reference - <em>sub-assign</em> by reference</h3>
-<p>Let’s take a look at all the <code>hours</code> available in the <code>flights</code> <em>data.table</em>:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># get all 'hours' in flights</span>
-flights[, <span class="kw">sort</span>(<span class="kw">unique</span>(hour))]
-<span class="co"># [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24</span></code></pre></div>
-<p>We see that there are totally <code>25</code> unique values in the data. Both <em>0</em> and <em>24</em> hours seem to be present. Let’s go ahead and replace <em>24</em> with <em>0</em>.</p>
-<div id="replace-those-rows-where-hour-24-with-the-value-0" class="section level4">
-<h4>– Replace those rows where <code>hour == 24</code> with the value <code>0</code></h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># subassign by reference</span>
-flights[hour ==<span class="st"> </span>24L, hour :<span class="er">=</span><span class="st"> </span>0L]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14 422.6866</span>
-<span class="co"># 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8 444.4444</span>
-<span class="co"># 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11 311.5663</span>
-<span class="co"># 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11 401.6000</span>
-<span class="co"># 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8 359.4545</span>
-<span class="co"># delay</span>
-<span class="co"># 1: 27</span>
-<span class="co"># 2: 10</span>
-<span class="co"># 3: 11</span>
-<span class="co"># 4: -34</span>
-<span class="co"># 5: 3</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: -29</span>
-<span class="co"># 253313: -19</span>
-<span class="co"># 253314: 8</span>
-<span class="co"># 253315: 11</span>
-<span class="co"># 253316: -4</span></code></pre></div>
-</div>
-<div id="section-5" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<h3>b) Update some rows of columns by reference - <em>sub-assign</em> by reference {#ref-i-j}</h3>
+
+<p>Let's take a look at all the <code>hours</code> available in the <code>flights</code> <em>data.table</em>:</p>
+
+<pre><code class="r"># get all 'hours' in flights
+flights[, sort(unique(hour))]
+# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
+</code></pre>
+
+<p>We see that there are totally <code>25</code> unique values in the data. Both <em>0</em> and <em>24</em> hours seem to be present. Let's go ahead and replace <em>24</em> with <em>0</em>.</p>
+
+<h4>– Replace those rows where <code>hour == 24</code> with the value <code>0</code></h4>
+
+<pre><code class="r"># subassign by reference
+flights[hour == 24L, hour := 0L]
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p>We can use <code>i</code> along with <code>:=</code> in <code>j</code> the very same way as we have already seen in the <em>“Introduction to data.table”</em> vignette.</p></li>
+<li><p>We can use <code>i</code> along with <code>:=</code> in <code>j</code> the very same way as we have already seen in the <em>“Introduction to data.table”</em> vignette.</p></li>
<li><p>Column <code>hour</code> is replaced with <code>0</code> only on those <em>row indices</em> where the condition <code>hour == 24L</code> specified in <code>i</code> evaluates to <code>TRUE</code>.</p></li>
<li><p><code>:=</code> returns the result invisibly. Sometimes it might be necessary to see the result after the assignment. We can accomplish that by adding an empty <code>[]</code> at the end of the query as shown below:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[hour ==<span class="st"> </span>24L, hour :<span class="er">=</span><span class="st"> </span>0L][]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14 422.6866</span>
-<span class="co"># 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8 444.4444</span>
-<span class="co"># 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11 311.5663</span>
-<span class="co"># 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11 401.6000</span>
-<span class="co"># 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8 359.4545</span>
-<span class="co"># delay</span>
-<span class="co"># 1: 27</span>
-<span class="co"># 2: 10</span>
-<span class="co"># 3: 11</span>
-<span class="co"># 4: -34</span>
-<span class="co"># 5: 3</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: -29</span>
-<span class="co"># 253313: -19</span>
-<span class="co"># 253314: 8</span>
-<span class="co"># 253315: 11</span>
-<span class="co"># 253316: -4</span></code></pre></div></li>
+
+<pre><code class="r">flights[hour == 24L, hour := 0L][]
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour speed
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769
+# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414
+# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857
+# ---
+# 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14 422.6866
+# 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8 444.4444
+# 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11 311.5663
+# 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11 401.6000
+# 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8 359.4545
+# delay
+# 1: 27
+# 2: 10
+# 3: 11
+# 4: -34
+# 5: 3
+# ---
+# 253312: -29
+# 253313: -19
+# 253314: 8
+# 253315: 11
+# 253316: -4
+</code></pre></li>
</ul>
-</div>
-</div>
-</div>
-</div>
-<div id="section-6" class="section level1">
-<h1></h1>
-<p>Let’s look at all the <code>hours</code> to verify.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># check again for '24'</span>
-flights[, <span class="kw">sort</span>(<span class="kw">unique</span>(hour))]
-<span class="co"># [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23</span></code></pre></div>
-<div id="update-by-reference-question" class="section level4 bs-callout bs-callout-warning">
-<h4>Exercise:</h4>
+
+<p>#
+Let's look at all the <code>hours</code> to verify.</p>
+
+<pre><code class="r"># check again for '24'
+flights[, sort(unique(hour))]
+# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
+</code></pre>
+
+<h4>Exercise: {.bs-callout .bs-callout-warning #update-by-reference-question}</h4>
+
<p>What is the difference between <code>flights[hour == 24L, hour := 0L]</code> and <code>flights[hour == 24L][, hour := 0L]</code>? Hint: The latter needs an assignment (<code><-</code>) if you would want to use the result later.</p>
-<p>If you can’t figure it out, have a look at the <code>Note</code> section of <code>?":="</code>.</p>
-</div>
-<div id="c-delete-column-by-reference" class="section level3">
+
+<p>If you can't figure it out, have a look at the <code>Note</code> section of <code>?":="</code>.</p>
+
<h3>c) Delete column by reference</h3>
-<div id="remove-delay-column" class="section level4">
-<h4>– Remove <code>delay</code> column</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[, <span class="kw">c</span>(<span class="st">"delay"</span>) :<span class="er">=</span><span class="st"> </span><span class="ot">NULL</span>]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14 422.6866</span>
-<span class="co"># 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8 444.4444</span>
-<span class="co"># 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11 311.5663</span>
-<span class="co"># 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11 401.6000</span>
-<span class="co"># 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8 359.4545</span>
-<span class="kw">head</span>(flights)
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857</span>
-<span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363</span>
+
+<h4>– Remove <code>delay</code> column</h4>
+
+<pre><code class="r">flights[, c("delay") := NULL]
+head(flights)
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour speed
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769
+# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414
+# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857
+# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363
## or using the functional form
-<span class="co"># flights[, `:=`(delay = NULL)]</span></code></pre></div>
-</div>
-<div id="delete-convenience" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# flights[, `:=`(delay = NULL)]
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info #delete-convenience}</h4>
+
<ul>
<li><p>Assigning <code>NULL</code> to a column <em>deletes</em> that column. And it happens <em>instantly</em>.</p></li>
<li><p>We can also pass column numbers instead of names in the <code>LHS</code>, although it is good programming practice to use column names.</p></li>
<li><p>When there is just one column to delete, we can drop the <code>c()</code> and double quotes and just use the column name <em>unquoted</em>, for convenience. That is:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[, delay :<span class="er">=</span><span class="st"> </span><span class="ot">NULL</span>]</code></pre></div>
+
+<pre><code class="r">flights[, delay := NULL]
+</code></pre>
+
<p>is equivalent to the code above.</p></li>
</ul>
-</div>
-</div>
-<div id="d-along-with-grouping-using-by" class="section level3">
-<h3>d) <code>:=</code> along with grouping using <code id="ref-j-by">by</code></h3>
-<p>We have already seen the use of <code>i</code> along with <code>:=</code> in <a href="#ref-i-j">Section 2b</a>. Let’s now see how we can use <code>:=</code> along with <code>by</code>.</p>
-<div id="how-can-we-add-a-new-column-which-contains-for-each-origdest-pair-the-maximum-speed" class="section level4">
-<h4>– How can we add a new column which contains for each <code>orig,dest</code> pair the maximum speed?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[, max_speed :<span class="er">=</span><span class="st"> </span><span class="kw">max</span>(speed), by =<span class="st"> </span>.(origin, dest)]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14 422.6866</span>
-<span class="co"># 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8 444.4444</span>
-<span class="co"># 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11 311.5663</span>
-<span class="co"># 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11 401.6000</span>
-<span class="co"># 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8 359.4545</span>
-<span class="co"># max_speed</span>
-<span class="co"># 1: 526.5957</span>
-<span class="co"># 2: 526.5957</span>
-<span class="co"># 3: 526.5957</span>
-<span class="co"># 4: 517.5000</span>
-<span class="co"># 5: 526.5957</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: 508.7425</span>
-<span class="co"># 253313: 538.4615</span>
-<span class="co"># 253314: 445.8621</span>
-<span class="co"># 253315: 456.3636</span>
-<span class="co"># 253316: 434.5055</span>
-<span class="kw">head</span>(flights)
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed max_speed</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490 526.5957</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909 526.5957</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769 526.5957</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414 517.5000</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857 526.5957</span>
-<span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363 518.4507</span></code></pre></div>
-</div>
-<div id="section-7" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<h3>d) <code>:=</code> along with grouping using <code>by</code> {#ref-j-by}</h3>
+
+<p>We have already seen the use of <code>i</code> along with <code>:=</code> in <a href="#ref-i-j">Section 2b</a>. Let's now see how we can use <code>:=</code> along with <code>by</code>.</p>
+
+<h4>– How can we add a new column which contains for each <code>orig,dest</code> pair the maximum speed?</h4>
+
+<pre><code class="r">flights[, max_speed := max(speed), by = .(origin, dest)]
+head(flights)
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour speed max_speed
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490 526.5957
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909 526.5957
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769 526.5957
+# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414 517.5000
+# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857 526.5957
+# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363 518.4507
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>We add a new column <code>max_speed</code> using the <code>:=</code> operator by reference.</p></li>
<li><p>We provide the columns to group by the same way as shown in the <em>Introduction to data.table</em> vignette. For each group, <code>max(speed)</code> is computed, which returns a single value. That value is recycled to fit the length of the group. Once again, no copies are being made at all. <code>flights</code> <em>data.table</em> is modified <em>in-place</em>.</p></li>
<li><p>We could have also provided <code>by</code> with a <em>character vector</em> as we saw in the <em>Introduction to data.table</em> vignette, e.g., <code>by = c("origin", "dest")</code>.</p></li>
</ul>
-</div>
-</div>
-</div>
-<div id="section-8" class="section level1">
-<h1></h1>
-<div id="e-multiple-columns-and" class="section level3">
+
+<p>#</p>
+
<h3>e) Multiple columns and <code>:=</code></h3>
-<div id="how-can-we-add-two-more-columns-computing-max-of-dep_delay-and-arr_delay-for-each-month-using-.sd" class="section level4">
-<h4>– How can we add two more columns computing <code>max()</code> of <code>dep_delay</code> and <code>arr_delay</code> for each month, using <code>.SD</code>?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">in_cols =<span class="st"> </span><span class="kw">c</span>(<span class="st">"dep_delay"</span>, <span class="st">"arr_delay"</span>)
-out_cols =<span class="st"> </span><span class="kw">c</span>(<span class="st">"max_dep_delay"</span>, <span class="st">"max_arr_delay"</span>)
-flights[, <span class="kw">c</span>(out_cols) :<span class="er">=</span><span class="st"> </span><span class="kw">lapply</span>(.SD, max), by =<span class="st"> </span>month, .SDcols =<span class="st"> </span>in_cols]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14 422.6866</span>
-<span class="co"># 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8 444.4444</span>
-<span class="co"># 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11 311.5663</span>
-<span class="co"># 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11 401.6000</span>
-<span class="co"># 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8 359.4545</span>
-<span class="co"># max_speed max_dep_delay max_arr_delay</span>
-<span class="co"># 1: 526.5957 973 996</span>
-<span class="co"># 2: 526.5957 973 996</span>
-<span class="co"># 3: 526.5957 973 996</span>
-<span class="co"># 4: 517.5000 973 996</span>
-<span class="co"># 5: 526.5957 973 996</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: 508.7425 1498 1494</span>
-<span class="co"># 253313: 538.4615 1498 1494</span>
-<span class="co"># 253314: 445.8621 1498 1494</span>
-<span class="co"># 253315: 456.3636 1498 1494</span>
-<span class="co"># 253316: 434.5055 1498 1494</span>
-<span class="kw">head</span>(flights)
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed max_speed</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490 526.5957</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909 526.5957</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769 526.5957</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414 517.5000</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857 526.5957</span>
-<span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363 518.4507</span>
-<span class="co"># max_dep_delay max_arr_delay</span>
-<span class="co"># 1: 973 996</span>
-<span class="co"># 2: 973 996</span>
-<span class="co"># 3: 973 996</span>
-<span class="co"># 4: 973 996</span>
-<span class="co"># 5: 973 996</span>
-<span class="co"># 6: 973 996</span></code></pre></div>
-</div>
-<div id="section-9" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<h4>– How can we add two more columns computing <code>max()</code> of <code>dep_delay</code> and <code>arr_delay</code> for each month, using <code>.SD</code>?</h4>
+
+<pre><code class="r">in_cols = c("dep_delay", "arr_delay")
+out_cols = c("max_dep_delay", "max_arr_delay")
+flights[, c(out_cols) := lapply(.SD, max), by = month, .SDcols = in_cols]
+head(flights)
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour speed max_speed
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490 526.5957
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909 526.5957
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769 526.5957
+# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414 517.5000
+# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857 526.5957
+# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363 518.4507
+# max_dep_delay max_arr_delay
+# 1: 973 996
+# 2: 973 996
+# 3: 973 996
+# 4: 973 996
+# 5: 973 996
+# 6: 973 996
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>We use the <code>LHS := RHS</code> form. We store the input column names and the new columns to add in separate variables and provide them to <code>.SDcols</code> and for <code>LHS</code> (for better readability).</p></li>
<li><p>Note that since we allow assignment by reference without quoting column names when there is only one column as explained in <a href="#delete-convenience">Section 2c</a>, we can not do <code>out_cols := lapply(.SD, max)</code>. That would result in adding one new column named <code>out_col</code>. Instead we should do either <code>c(out_cols)</code> or simply <code>(out_cols)</code>. Wrapping the variable name with <code>(</code> is enough to differentiate between the two cases.</p></li>
-<li><p>The <code>LHS := RHS</code> form allows us to operate on multiple columns. In the RHS, to compute the <code>max</code> on columns specified in <code>.SDcols</code>, we make use of the base function <code>lapply()</code> along with <code>.SD</code> in the same way as we have seen before in the <em>“Introduction to data.table”</em> vignette. It returns a list of two elements, containing the maximum value corresponding to <code>dep_delay</code> and <code>arr_delay</code> for each gro [...]
+<li><p>The <code>LHS := RHS</code> form allows us to operate on multiple columns. In the RHS, to compute the <code>max</code> on columns specified in <code>.SDcols</code>, we make use of the base function <code>lapply()</code> along with <code>.SD</code> in the same way as we have seen before in the <em>“Introduction to data.table”</em> vignette. It returns a list of two elements, containing the maximum value corresponding to <code>dep_delay</code> and <code>arr_delay</code> [...]
</ul>
-</div>
-</div>
-</div>
-<div id="section-10" class="section level1">
-<h1></h1>
-<p>Before moving on to the next section, let’s clean up the newly created columns <code>speed</code>, <code>max_speed</code>, <code>max_dep_delay</code> and <code>max_arr_delay</code>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># RHS gets automatically recycled to length of LHS</span>
-flights[, <span class="kw">c</span>(<span class="st">"speed"</span>, <span class="st">"max_speed"</span>, <span class="st">"max_dep_delay"</span>, <span class="st">"max_arr_delay"</span>) :<span class="er">=</span><span class="st"> </span><span class="ot">NULL</span>]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14</span>
-<span class="co"># 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8</span>
-<span class="co"># 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11</span>
-<span class="co"># 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11</span>
-<span class="co"># 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8</span>
-<span class="kw">head</span>(flights)
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span>
-<span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18</span></code></pre></div>
-<div id="and-copy" class="section level2">
+
+<p>#
+Before moving on to the next section, let's clean up the newly created columns <code>speed</code>, <code>max_speed</code>, <code>max_dep_delay</code> and <code>max_arr_delay</code>.</p>
+
+<pre><code class="r"># RHS gets automatically recycled to length of LHS
+flights[, c("speed", "max_speed", "max_dep_delay", "max_arr_delay") := NULL]
+head(flights)
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
+# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
+# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13
+# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18
+</code></pre>
+
<h2>3) <code>:=</code> and <code>copy()</code></h2>
+
<p><code>:=</code> modifies the input object by reference. Apart from the features we have discussed already, sometimes we might want to use the update by reference feature for its side effect. And at other times it may not be desirable to modify the original object, in which case we can use <code>copy()</code> function, as we will see in a moment.</p>
-<div id="a-for-its-side-effect" class="section level3">
+
<h3>a) <code>:=</code> for its side effect</h3>
-<p>Let’s say we would like to create a function that would return the <em>maximum speed</em> for each month. But at the same time, we would also like to add the column <code>speed</code> to <em>flights</em>. We could write a simple function as follows:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">foo <-<span class="st"> </span>function(DT) {
- DT[, speed :<span class="er">=</span><span class="st"> </span>distance /<span class="st"> </span>(air_time/<span class="dv">60</span>)]
- DT[, .(<span class="dt">max_speed =</span> <span class="kw">max</span>(speed)), by =<span class="st"> </span>month]
+
+<p>Let's say we would like to create a function that would return the <em>maximum speed</em> for each month. But at the same time, we would also like to add the column <code>speed</code> to <em>flights</em>. We could write a simple function as follows:</p>
+
+<pre><code class="r">foo <- function(DT) {
+ DT[, speed := distance / (air_time/60)]
+ DT[, .(max_speed = max(speed)), by = month]
}
-ans =<span class="st"> </span><span class="kw">foo</span>(flights)
-<span class="kw">head</span>(flights)
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857</span>
-<span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363</span>
-<span class="kw">head</span>(ans)
-<span class="co"># month max_speed</span>
-<span class="co"># 1: 1 535.6425</span>
-<span class="co"># 2: 2 535.6425</span>
-<span class="co"># 3: 3 549.0756</span>
-<span class="co"># 4: 4 585.6000</span>
-<span class="co"># 5: 5 544.2857</span>
-<span class="co"># 6: 6 608.5714</span></code></pre></div>
-<div id="section-11" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+ans = foo(flights)
+head(flights)
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour speed
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769
+# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414
+# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857
+# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363
+head(ans)
+# month max_speed
+# 1: 1 535.6425
+# 2: 2 535.6425
+# 3: 3 549.0756
+# 4: 4 585.6000
+# 5: 5 544.2857
+# 6: 6 608.5714
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>Note that the new column <code>speed</code> has been added to <code>flights</code> <em>data.table</em>. This is because <code>:=</code> performs operations by reference. Since <code>DT</code> (the function argument) and <code>flights</code> refer to the same object in memory, modifying <code>DT</code> also reflects on <code>flights</code>.</p></li>
<li><p>And <code>ans</code> contains the maximum speed for each month.</p></li>
</ul>
-</div>
-</div>
-<div id="b-the-copy-function" class="section level3">
+
<h3>b) The <code>copy()</code> function</h3>
-<p>In the previous section, we used <code>:=</code> for its side effect. But of course this may not be always desirable. Sometimes, we would like to pass a <em>data.table</em> object to a function, and might want to use the <code>:=</code> operator, but <em>wouldn’t</em> want to update the original object. We can accomplish this using the function <code>copy()</code>.</p>
-<div id="section-12" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<p>In the previous section, we used <code>:=</code> for its side effect. But of course this may not be always desirable. Sometimes, we would like to pass a <em>data.table</em> object to a function, and might want to use the <code>:=</code> operator, but <em>wouldn't</em> want to update the original object. We can accomplish this using the function <code>copy()</code>.</p>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<p>The <code>copy()</code> function <em>deep</em> copies the input object and therefore any subsequent update by reference operations performed on the copied object will not affect the original object.</p>
-</div>
-</div>
-</div>
-</div>
-<div id="section-13" class="section level1">
-<h1></h1>
+
+<p>#</p>
+
<p>There are two particular places where <code>copy()</code> function is essential:</p>
-<ol style="list-style-type: decimal">
-<li><p>Contrary to the situation we have seen in the previous point, we may not want the input data.table to a function to be modified <em>by reference</em>. As an example, let’s consider the task in the previous section, except we don’t want to modify <code>flights</code> by reference.</p>
-<p>Let’s first delete the <code>speed</code> column we generated in the previous section.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[, speed :<span class="er">=</span><span class="st"> </span><span class="ot">NULL</span>]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14</span>
-<span class="co"># 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8</span>
-<span class="co"># 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11</span>
-<span class="co"># 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11</span>
-<span class="co"># 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8</span></code></pre></div>
+
+<ol>
+<li><p>Contrary to the situation we have seen in the previous point, we may not want the input data.table to a function to be modified <em>by reference</em>. As an example, let's consider the task in the previous section, except we don't want to modify <code>flights</code> by reference.</p>
+
+<p>Let's first delete the <code>speed</code> column we generated in the previous section.</p>
+
+<pre><code class="r">flights[, speed := NULL]
+</code></pre>
+
<p>Now, we could accomplish the task as follows:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">foo <-<span class="st"> </span>function(DT) {
- DT <-<span class="st"> </span><span class="kw">copy</span>(DT) ## deep copy
- DT[, speed :<span class="er">=</span><span class="st"> </span>distance /<span class="st"> </span>(air_time/<span class="dv">60</span>)] ## doesn't affect 'flights'
- DT[, .(<span class="dt">max_speed =</span> <span class="kw">max</span>(speed)), by =<span class="st"> </span>month]
+
+<pre><code class="r">foo <- function(DT) {
+ DT <- copy(DT) ## deep copy
+ DT[, speed := distance / (air_time/60)] ## doesn't affect 'flights'
+ DT[, .(max_speed = max(speed)), by = month]
}
-ans <-<span class="st"> </span><span class="kw">foo</span>(flights)
-<span class="kw">head</span>(flights)
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span>
-<span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18</span>
-<span class="kw">head</span>(ans)
-<span class="co"># month max_speed</span>
-<span class="co"># 1: 1 535.6425</span>
-<span class="co"># 2: 2 535.6425</span>
-<span class="co"># 3: 3 549.0756</span>
-<span class="co"># 4: 4 585.6000</span>
-<span class="co"># 5: 5 544.2857</span>
-<span class="co"># 6: 6 608.5714</span></code></pre></div></li>
+ans <- foo(flights)
+head(flights)
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
+# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
+# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13
+# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18
+head(ans)
+# month max_speed
+# 1: 1 535.6425
+# 2: 2 535.6425
+# 3: 3 549.0756
+# 4: 4 585.6000
+# 5: 5 544.2857
+# 6: 6 608.5714
+</code></pre></li>
</ol>
-<div id="section-14" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p>Using <code>copy()</code> function did not update <code>flights</code> <em>data.table</em> by reference. It doesn’t contain the column <code>speed</code>.</p></li>
+<li><p>Using <code>copy()</code> function did not update <code>flights</code> <em>data.table</em> by reference. It doesn't contain the column <code>speed</code>.</p></li>
<li><p>And <code>ans</code> contains the maximum speed corresponding to each month.</p></li>
</ul>
+
<p>However we could improve this functionality further by <em>shallow</em> copying instead of <em>deep</em> copying. In fact, we would very much like to <a href="https://github.com/Rdatatable/data.table/issues/617">provide this functionality for <code>v1.9.8</code></a>. We will touch up on this again in the <em>data.table design</em> vignette.</p>
-</div>
-</div>
-<div id="section-15" class="section level1">
-<h1></h1>
-<ol start="2" style="list-style-type: decimal">
+
+<p>#</p>
+
+<ol>
<li><p>When we store the column names on to a variable, e.g., <code>DT_n = names(DT)</code>, and then <em>add/update/delete</em> column(s) <em>by reference</em>. It would also modify <code>DT_n</code>, unless we do <code>copy(names(DT))</code>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT =<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">x =</span> 1L, <span class="dt">y =</span> 2L)
-DT_n =<span class="st"> </span><span class="kw">names</span>(DT)
+
+<pre><code class="r">DT = data.table(x = 1L, y = 2L)
+DT_n = names(DT)
DT_n
-<span class="co"># [1] "x" "y"</span>
+# [1] "x" "y"
## add a new column by reference
-DT[, z :<span class="er">=</span><span class="st"> </span>3L]
-<span class="co"># x y z</span>
-<span class="co"># 1: 1 2 3</span>
+DT[, z := 3L]
## DT_n also gets updated
DT_n
-<span class="co"># [1] "x" "y" "z"</span>
+# [1] "x" "y" "z"
## use `copy()`
-DT_n =<span class="st"> </span><span class="kw">copy</span>(<span class="kw">names</span>(DT))
-DT[, w :<span class="er">=</span><span class="st"> </span>4L]
-<span class="co"># x y z w</span>
-<span class="co"># 1: 1 2 3 4</span>
+DT_n = copy(names(DT))
+DT[, w := 4L]
-## DT_n doesn't get updated
+## DT_n doesn't get updated
DT_n
-<span class="co"># [1] "x" "y" "z"</span></code></pre></div></li>
+# [1] "x" "y" "z"
+</code></pre></li>
</ol>
-<div id="summary" class="section level2">
+
<h2>Summary</h2>
-<div id="the-operator" class="section level4 bs-callout bs-callout-info">
-<h4>The <code>:=</code> operator</h4>
+
+<h4>The <code>:=</code> operator {.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>It is used to <em>add/update/delete</em> columns by reference.</p></li>
<li><p>We have also seen how to use <code>:=</code> along with <code>i</code> and <code>by</code> the same way as we have seen in the <em>Introduction to data.table</em> vignette. We can in the same way use <code>keyby</code>, chain operations together, and pass expressions to <code>by</code> as well all in the same way. The syntax is <em>consistent</em>.</p></li>
<li><p>We can use <code>:=</code> for its side effect or use <code>copy()</code> to not modify the original object while updating by reference.</p></li>
</ul>
-</div>
-</div>
-</div>
-<div id="section-16" class="section level1">
-<h1></h1>
-<p>So far we have seen a whole lot in <code>j</code>, and how to combine it with <code>by</code> and little of <code>i</code>. Let’s turn our attention back to <code>i</code> in the next vignette <em>“Keys and fast binary search based subset”</em> to perform <em>blazing fast subsets</em> by <em>keying data.tables</em>.</p>
-<hr />
-</div>
-
-
-
-<!-- dynamically load mathjax for compatibility with self-contained -->
-<script>
- (function () {
- var script = document.createElement("script");
- script.type = "text/javascript";
- script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
- document.getElementsByTagName("head")[0].appendChild(script);
- })();
-</script>
+
+<p>#</p>
+
+<p>So far we have seen a whole lot in <code>j</code>, and how to combine it with <code>by</code> and little of <code>i</code>. Let's turn our attention back to <code>i</code> in the next vignette <em>“Keys and fast binary search based subset”</em> to perform <em>blazing fast subsets</em> by <em>keying data.tables</em>.</p>
+
+<hr/>
</body>
+
</html>
diff --git a/inst/doc/datatable-reshape.html b/inst/doc/datatable-reshape.html
index e747b5c..f94c458 100644
--- a/inst/doc/datatable-reshape.html
+++ b/inst/doc/datatable-reshape.html
@@ -1,450 +1,560 @@
<!DOCTYPE html>
+<html>
+<head>
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
-<html xmlns="http://www.w3.org/1999/xhtml">
+<title>Data</title>
-<head>
+<script type="text/javascript">
+window.onload = function() {
+ var imgs = document.getElementsByTagName('img'), i, img;
+ for (i = 0; i < imgs.length; i++) {
+ img = imgs[i];
+ // center an image if it is the only element of its parent
+ if (img.parentElement.childElementCount === 1)
+ img.parentElement.style.textAlign = 'center';
+ }
+};
+</script>
+
+<!-- Styles for R syntax highlighter -->
+<style type="text/css">
+ pre .operator,
+ pre .paren {
+ color: rgb(104, 118, 135)
+ }
+
+ pre .literal {
+ color: #990073
+ }
+
+ pre .number {
+ color: #099;
+ }
-<meta charset="utf-8">
-<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
-<meta name="generator" content="pandoc" />
+ pre .comment {
+ color: #998;
+ font-style: italic
+ }
-<meta name="viewport" content="width=device-width, initial-scale=1">
+ pre .keyword {
+ color: #900;
+ font-weight: bold
+ }
+ pre .identifier {
+ color: rgb(0, 0, 0);
+ }
-<meta name="date" content="2017-01-31" />
+ pre .string {
+ color: #d14;
+ }
+</style>
-<title>Efficient reshaping using data.tables</title>
+<!-- R syntax highlighter -->
+<script type="text/javascript">
+var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/</gm,"<")}function f(r,q,p){return RegExp(q,"m"+(r.cI?"i":"")+(p?"g":""))}function b(r){for(var p=0;p<r.childNodes.length;p++){var q=r.childNodes[p];if(q.nodeName=="CODE"){return q}if(!(q.nodeType==3&&q.nodeValue.match(/\s+/))){break}}}function h(t,s){var p="";for(var r=0;r<t.childNodes.length;r++){if(t.childNodes[r].nodeType==3){var q=t.childNodes[r].nodeValue;if(s){q=q.replace(/\n/g,"")}p+=q}else{if(t.chi [...]
+hljs.initHighlightingOnLoad();
+</script>
-<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
-div.sourceCode { overflow-x: auto; }
-table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
- margin: 0; padding: 0; vertical-align: baseline; border: none; }
-table.sourceCode { width: 100%; line-height: 100%; }
-td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
-td.sourceCode { padding-left: 5px; }
-code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
-code > span.dt { color: #902000; } /* DataType */
-code > span.dv { color: #40a070; } /* DecVal */
-code > span.bn { color: #40a070; } /* BaseN */
-code > span.fl { color: #40a070; } /* Float */
-code > span.ch { color: #4070a0; } /* Char */
-code > span.st { color: #4070a0; } /* String */
-code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
-code > span.ot { color: #007020; } /* Other */
-code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
-code > span.fu { color: #06287e; } /* Function */
-code > span.er { color: #ff0000; font-weight: bold; } /* Error */
-code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
-code > span.cn { color: #880000; } /* Constant */
-code > span.sc { color: #4070a0; } /* SpecialChar */
-code > span.vs { color: #4070a0; } /* VerbatimString */
-code > span.ss { color: #bb6688; } /* SpecialString */
-code > span.im { } /* Import */
-code > span.va { color: #19177c; } /* Variable */
-code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
-code > span.op { color: #666666; } /* Operator */
-code > span.bu { } /* BuiltIn */
-code > span.ex { } /* Extension */
-code > span.pp { color: #bc7a00; } /* Preprocessor */
-code > span.at { color: #7d9029; } /* Attribute */
-code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
-code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
-code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
-code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
-</style>
+body, td {
+ font-family: sans-serif;
+ background-color: white;
+ font-size: 13px;
+}
+body {
+ max-width: 800px;
+ margin: auto;
+ padding: 1em;
+ line-height: 20px;
+}
+tt, code, pre {
+ font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
+}
-<link href="data:text/css;charset=utf-8,body%20%7B%0Abackground%2Dcolor%3A%20%23fff%3B%0Amargin%3A%201em%20auto%3B%0Amax%2Dwidth%3A%20700px%3B%0Aoverflow%3A%20visible%3B%0Apadding%2Dleft%3A%202em%3B%0Apadding%2Dright%3A%202em%3B%0Afont%2Dfamily%3A%20%22Open%20Sans%22%2C%20%22Helvetica%20Neue%22%2C%20Helvetica%2C%20Arial%2C%20sans%2Dserif%3B%0Afont%2Dsize%3A%2014px%3B%0Aline%2Dheight%3A%201%2E35%3B%0A%7D%0A%23header%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0A%23TOC%20%7B%0Aclear%3A%20bot [...]
+h1 {
+ font-size:2.2em;
+}
-</head>
+h2 {
+ font-size:1.8em;
+}
-<body>
+h3 {
+ font-size:1.4em;
+}
+
+h4 {
+ font-size:1.0em;
+}
+
+h5 {
+ font-size:0.9em;
+}
+
+h6 {
+ font-size:0.8em;
+}
+
+a:visited {
+ color: rgb(50%, 0%, 50%);
+}
+
+pre, img {
+ max-width: 100%;
+}
+pre {
+ overflow-x: auto;
+}
+pre code {
+ display: block; padding: 0.5em;
+}
+
+code {
+ font-size: 92%;
+ border: 1px solid #ccc;
+}
+
+code[class] {
+ background-color: #F8F8F8;
+}
+
+table, td, th {
+ border: none;
+}
+
+blockquote {
+ color:#666666;
+ margin:0;
+ padding-left: 1em;
+ border-left: 0.5em #EEE solid;
+}
+
+hr {
+ height: 0px;
+ border-bottom: none;
+ border-top-width: thin;
+ border-top-style: dotted;
+ border-top-color: #999999;
+}
+
+ at media print {
+ * {
+ background: transparent !important;
+ color: black !important;
+ filter:none !important;
+ -ms-filter: none !important;
+ }
+
+ body {
+ font-size:12pt;
+ max-width:100%;
+ }
+
+ a, a:visited {
+ text-decoration: underline;
+ }
+
+ hr {
+ visibility: hidden;
+ page-break-before: always;
+ }
+
+ pre, blockquote {
+ padding-right: 1em;
+ page-break-inside: avoid;
+ }
+
+ tr, img {
+ page-break-inside: avoid;
+ }
+ img {
+ max-width: 100% !important;
+ }
+ @page :left {
+ margin: 15mm 20mm 15mm 10mm;
+ }
+ @page :right {
+ margin: 15mm 10mm 15mm 20mm;
+ }
-<h1 class="title toc-ignore">Efficient reshaping using data.tables</h1>
-<h4 class="date"><em>2017-01-31</em></h4>
+ p, h2, h3 {
+ orphans: 3; widows: 3;
+ }
+
+ h2, h3 {
+ page-break-after: avoid;
+ }
+}
+</style>
+</head>
+
+<body>
<p>This vignette discusses the default usage of reshaping functions <code>melt</code> (wide to long) and <code>dcast</code> (long to wide) for <em>data.tables</em> as well as the <strong>new extended functionalities</strong> of melting and casting on <em>multiple columns</em> available from <code>v1.9.6</code>.</p>
-<hr />
-<div id="data" class="section level2">
+
+<hr/>
+
<h2>Data</h2>
+
<p>We will load the data sets directly within sections.</p>
-</div>
-<div id="introduction" class="section level2">
+
<h2>Introduction</h2>
+
<p>The <code>melt</code> and <code>dcast</code> functions for <em>data.tables</em> are extensions of the corresponding functions from the <a href="https://cran.r-project.org/package=reshape2">reshape2</a> package.</p>
+
<p>In this vignette, we will</p>
-<ol style="list-style-type: decimal">
+
+<ol>
<li><p>first briefly look at the default <em>melting</em> and <em>casting</em> of <em>data.tables</em> to convert them from <em>wide</em> to <em>long</em> format and vice versa,</p></li>
<li><p>then look at scenarios where the current functionalities becomes cumbersome and inefficient,</p></li>
<li><p>and finally look at the new improvements to both <code>melt</code> and <code>dcast</code> methods for <em>data.tables</em> to handle multiple columns simultaneously.</p></li>
</ol>
-<p>The extended functionalities are in line with <em>data.table’s</em> philosophy of performing operations efficiently and in a straightforward manner.</p>
-<div id="note" class="section level4 bs-callout bs-callout-info">
-<h4>Note:</h4>
-<p>From <code>v1.9.6</code> on, you don’t have to load <code>reshape2</code> package to use these functions for <em>data.tables</em>. You just need to load <code>data.table</code>. If you’ve to load <code>reshape2</code> for melting or casting matrices and/or data.frames, then make sure to load it <em>before</em> loading <code>data.table</code>.</p>
-</div>
-</div>
-<div id="default-functionality" class="section level2">
+
+<p>The extended functionalities are in line with <em>data.table's</em> philosophy of performing operations efficiently and in a straightforward manner.</p>
+
+<h4>Note: {.bs-callout .bs-callout-info}</h4>
+
+<p>From <code>v1.9.6</code> on, you don't have to load <code>reshape2</code> package to use these functions for <em>data.tables</em>. You just need to load <code>data.table</code>. If you've to load <code>reshape2</code> for melting or casting matrices and/or data.frames, then make sure to load it <em>before</em> loading <code>data.table</code>.</p>
+
<h2>1. Default functionality</h2>
-<div id="a-melting-data.tables-wide-to-long" class="section level3">
+
<h3>a) <code>melt</code>ing <em>data.tables</em> (wide to long)</h3>
+
<p>Suppose we have a <code>data.table</code> (artificial data) as shown below:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT =<span class="st"> </span><span class="kw">fread</span>(<span class="st">"melt_default.csv"</span>)
+
+<pre><code class="r">DT = fread("melt_default.csv")
DT
-<span class="co"># family_id age_mother dob_child1 dob_child2 dob_child3</span>
-<span class="co"># 1: 1 30 1998-11-26 2000-01-29 NA</span>
-<span class="co"># 2: 2 27 1996-06-22 NA NA</span>
-<span class="co"># 3: 3 26 2002-07-11 2004-04-05 2007-09-02</span>
-<span class="co"># 4: 4 32 2004-10-10 2009-08-27 2012-07-21</span>
-<span class="co"># 5: 5 29 2000-12-05 2005-02-28 NA</span>
+# family_id age_mother dob_child1 dob_child2 dob_child3
+# 1: 1 30 1998-11-26 2000-01-29 NA
+# 2: 2 27 1996-06-22 NA NA
+# 3: 3 26 2002-07-11 2004-04-05 2007-09-02
+# 4: 4 32 2004-10-10 2009-08-27 2012-07-21
+# 5: 5 29 2000-12-05 2005-02-28 NA
## dob stands for date of birth.
-<span class="kw">str</span>(DT)
-<span class="co"># Classes 'data.table' and 'data.frame': 5 obs. of 5 variables:</span>
-<span class="co"># $ family_id : int 1 2 3 4 5</span>
-<span class="co"># $ age_mother: int 30 27 26 32 29</span>
-<span class="co"># $ dob_child1: chr "1998-11-26" "1996-06-22" "2002-07-11" "2004-10-10" ...</span>
-<span class="co"># $ dob_child2: chr "2000-01-29" NA "2004-04-05" "2009-08-27" ...</span>
-<span class="co"># $ dob_child3: chr NA NA "2007-09-02" "2012-07-21" ...</span>
-<span class="co"># - attr(*, ".internal.selfref")=<externalptr></span></code></pre></div>
-</div>
-</div>
-<div id="section" class="section level1">
-<h1></h1>
-<div id="convert-dt-to-long-form-where-each-dob-is-a-separate-observation." class="section level4">
+str(DT)
+# Classes 'data.table' and 'data.frame': 5 obs. of 5 variables:
+# $ family_id : int 1 2 3 4 5
+# $ age_mother: int 30 27 26 32 29
+# $ dob_child1: chr "1998-11-26" "1996-06-22" "2002-07-11" "2004-10-10" ...
+# $ dob_child2: chr "2000-01-29" NA "2004-04-05" "2009-08-27" ...
+# $ dob_child3: chr NA NA "2007-09-02" "2012-07-21" ...
+# - attr(*, ".internal.selfref")=<externalptr>
+</code></pre>
+
+<p>#</p>
+
<h4>- Convert <code>DT</code> to <em>long</em> form where each <code>dob</code> is a separate observation.</h4>
+
<p>We could accomplish this using <code>melt()</code> by specifying <code>id.vars</code> and <code>measure.vars</code> arguments as follows:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT.m1 =<span class="st"> </span><span class="kw">melt</span>(DT, <span class="dt">id.vars =</span> <span class="kw">c</span>(<span class="st">"family_id"</span>, <span class="st">"age_mother"</span>),
- <span class="dt">measure.vars =</span> <span class="kw">c</span>(<span class="st">"dob_child1"</span>, <span class="st">"dob_child2"</span>, <span class="st">"dob_child3"</span>))
+
+<pre><code class="r">DT.m1 = melt(DT, id.vars = c("family_id", "age_mother"),
+ measure.vars = c("dob_child1", "dob_child2", "dob_child3"))
DT.m1
-<span class="co"># family_id age_mother variable value</span>
-<span class="co"># 1: 1 30 dob_child1 1998-11-26</span>
-<span class="co"># 2: 2 27 dob_child1 1996-06-22</span>
-<span class="co"># 3: 3 26 dob_child1 2002-07-11</span>
-<span class="co"># 4: 4 32 dob_child1 2004-10-10</span>
-<span class="co"># 5: 5 29 dob_child1 2000-12-05</span>
-<span class="co"># 6: 1 30 dob_child2 2000-01-29</span>
-<span class="co"># 7: 2 27 dob_child2 NA</span>
-<span class="co"># 8: 3 26 dob_child2 2004-04-05</span>
-<span class="co"># 9: 4 32 dob_child2 2009-08-27</span>
-<span class="co"># 10: 5 29 dob_child2 2005-02-28</span>
-<span class="co"># 11: 1 30 dob_child3 NA</span>
-<span class="co"># 12: 2 27 dob_child3 NA</span>
-<span class="co"># 13: 3 26 dob_child3 2007-09-02</span>
-<span class="co"># 14: 4 32 dob_child3 2012-07-21</span>
-<span class="co"># 15: 5 29 dob_child3 NA</span>
-<span class="kw">str</span>(DT.m1)
-<span class="co"># Classes 'data.table' and 'data.frame': 15 obs. of 4 variables:</span>
-<span class="co"># $ family_id : int 1 2 3 4 5 1 2 3 4 5 ...</span>
-<span class="co"># $ age_mother: int 30 27 26 32 29 30 27 26 32 29 ...</span>
-<span class="co"># $ variable : Factor w/ 3 levels "dob_child1","dob_child2",..: 1 1 1 1 1 2 2 2 2 2 ...</span>
-<span class="co"># $ value : chr "1998-11-26" "1996-06-22" "2002-07-11" "2004-10-10" ...</span>
-<span class="co"># - attr(*, ".internal.selfref")=<externalptr></span></code></pre></div>
-</div>
-<div id="section-1" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# family_id age_mother variable value
+# 1: 1 30 dob_child1 1998-11-26
+# 2: 2 27 dob_child1 1996-06-22
+# 3: 3 26 dob_child1 2002-07-11
+# 4: 4 32 dob_child1 2004-10-10
+# 5: 5 29 dob_child1 2000-12-05
+# 6: 1 30 dob_child2 2000-01-29
+# 7: 2 27 dob_child2 NA
+# 8: 3 26 dob_child2 2004-04-05
+# 9: 4 32 dob_child2 2009-08-27
+# 10: 5 29 dob_child2 2005-02-28
+# 11: 1 30 dob_child3 NA
+# 12: 2 27 dob_child3 NA
+# 13: 3 26 dob_child3 2007-09-02
+# 14: 4 32 dob_child3 2012-07-21
+# 15: 5 29 dob_child3 NA
+str(DT.m1)
+# Classes 'data.table' and 'data.frame': 15 obs. of 4 variables:
+# $ family_id : int 1 2 3 4 5 1 2 3 4 5 ...
+# $ age_mother: int 30 27 26 32 29 30 27 26 32 29 ...
+# $ variable : Factor w/ 3 levels "dob_child1","dob_child2",..: 1 1 1 1 1 2 2 2 2 2 ...
+# $ value : chr "1998-11-26" "1996-06-22" "2002-07-11" "2004-10-10" ...
+# - attr(*, ".internal.selfref")=<externalptr>
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p><code>measure.vars</code> specify the set of columns we would like to collapse (or combine) together.</p></li>
<li><p>We can also specify column <em>indices</em> instead of <em>names</em>.</p></li>
-<li><p>By default, <code>variable</code> column is of type <code>factor</code>. Set <code>variable.factor</code> argument to <code>FALSE</code> if you’d like to return a <em>character</em> vector instead. <code>variable.factor</code> argument is only available in <code>melt</code> from <code>data.table</code> and not in the <a href="https://github.com/hadley/reshape"><code>reshape2</code> package</a>.</p></li>
+<li><p>By default, <code>variable</code> column is of type <code>factor</code>. Set <code>variable.factor</code> argument to <code>FALSE</code> if you'd like to return a <em>character</em> vector instead. <code>variable.factor</code> argument is only available in <code>melt</code> from <code>data.table</code> and not in the <a href="https://github.com/hadley/reshape"><code>reshape2</code> package</a>.</p></li>
<li><p>By default, the molten columns are automatically named <code>variable</code> and <code>value</code>.</p></li>
<li><p><code>melt</code> preserves column attributes in result.</p></li>
</ul>
-</div>
-</div>
-<div id="section-2" class="section level1">
-<h1></h1>
-<div id="name-the-variable-and-value-columns-to-child-and-dob-respectively" class="section level4">
+
+<p>#</p>
+
<h4>- Name the <code>variable</code> and <code>value</code> columns to <code>child</code> and <code>dob</code> respectively</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT.m1 =<span class="st"> </span><span class="kw">melt</span>(DT, <span class="dt">measure.vars =</span> <span class="kw">c</span>(<span class="st">"dob_child1"</span>, <span class="st">"dob_child2"</span>, <span class="st">"dob_child3"</span>),
- <span class="dt">variable.name =</span> <span class="st">"child"</span>, <span class="dt">value.name =</span> <span class="st">"dob"</span>)
+
+<pre><code class="r">DT.m1 = melt(DT, measure.vars = c("dob_child1", "dob_child2", "dob_child3"),
+ variable.name = "child", value.name = "dob")
DT.m1
-<span class="co"># family_id age_mother child dob</span>
-<span class="co"># 1: 1 30 dob_child1 1998-11-26</span>
-<span class="co"># 2: 2 27 dob_child1 1996-06-22</span>
-<span class="co"># 3: 3 26 dob_child1 2002-07-11</span>
-<span class="co"># 4: 4 32 dob_child1 2004-10-10</span>
-<span class="co"># 5: 5 29 dob_child1 2000-12-05</span>
-<span class="co"># 6: 1 30 dob_child2 2000-01-29</span>
-<span class="co"># 7: 2 27 dob_child2 NA</span>
-<span class="co"># 8: 3 26 dob_child2 2004-04-05</span>
-<span class="co"># 9: 4 32 dob_child2 2009-08-27</span>
-<span class="co"># 10: 5 29 dob_child2 2005-02-28</span>
-<span class="co"># 11: 1 30 dob_child3 NA</span>
-<span class="co"># 12: 2 27 dob_child3 NA</span>
-<span class="co"># 13: 3 26 dob_child3 2007-09-02</span>
-<span class="co"># 14: 4 32 dob_child3 2012-07-21</span>
-<span class="co"># 15: 5 29 dob_child3 NA</span></code></pre></div>
-</div>
-<div id="section-3" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# family_id age_mother child dob
+# 1: 1 30 dob_child1 1998-11-26
+# 2: 2 27 dob_child1 1996-06-22
+# 3: 3 26 dob_child1 2002-07-11
+# 4: 4 32 dob_child1 2004-10-10
+# 5: 5 29 dob_child1 2000-12-05
+# 6: 1 30 dob_child2 2000-01-29
+# 7: 2 27 dob_child2 NA
+# 8: 3 26 dob_child2 2004-04-05
+# 9: 4 32 dob_child2 2009-08-27
+# 10: 5 29 dob_child2 2005-02-28
+# 11: 1 30 dob_child3 NA
+# 12: 2 27 dob_child3 NA
+# 13: 3 26 dob_child3 2007-09-02
+# 14: 4 32 dob_child3 2012-07-21
+# 15: 5 29 dob_child3 NA
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>By default, when one of <code>id.vars</code> or <code>measure.vars</code> is missing, the rest of the columns are <em>automatically assigned</em> to the missing argument.</p></li>
<li><p>When neither <code>id.vars</code> nor <code>measure.vars</code> are specified, as mentioned under <code>?melt</code>, all <em>non</em>-<code>numeric</code>, <code>integer</code>, <code>logical</code> columns will be assigned to <code>id.vars</code>.</p>
+
<p>In addition, a warning message is issued highlighting the columns that are automatically considered to be <code>id.vars</code>.</p></li>
</ul>
-</div>
-<div id="b-casting-data.tables-long-to-wide" class="section level3">
+
<h3>b) <code>Cast</code>ing <em>data.tables</em> (long to wide)</h3>
-<p>In the previous section, we saw how to get from wide form to long form. Let’s see the reverse operation in this section.</p>
-<div id="how-can-we-get-back-to-the-original-data-table-dt-from-dt.m" class="section level4">
+
+<p>In the previous section, we saw how to get from wide form to long form. Let's see the reverse operation in this section.</p>
+
<h4>- How can we get back to the original data table <code>DT</code> from <code>DT.m</code>?</h4>
-<p>That is, we’d like to collect all <em>child</em> observations corresponding to each <code>family_id, age_mother</code> together under the same row. We can accomplish it using <code>dcast</code> as follows:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dcast</span>(DT.m1, family_id +<span class="st"> </span>age_mother ~<span class="st"> </span>child, <span class="dt">value.var =</span> <span class="st">"dob"</span>)
-<span class="co"># family_id age_mother dob_child1 dob_child2 dob_child3</span>
-<span class="co"># 1: 1 30 1998-11-26 2000-01-29 NA</span>
-<span class="co"># 2: 2 27 1996-06-22 NA NA</span>
-<span class="co"># 3: 3 26 2002-07-11 2004-04-05 2007-09-02</span>
-<span class="co"># 4: 4 32 2004-10-10 2009-08-27 2012-07-21</span>
-<span class="co"># 5: 5 29 2000-12-05 2005-02-28 NA</span></code></pre></div>
-</div>
-<div id="section-4" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+
+<p>That is, we'd like to collect all <em>child</em> observations corresponding to each <code>family_id, age_mother</code> together under the same row. We can accomplish it using <code>dcast</code> as follows:</p>
+
+<pre><code class="r">dcast(DT.m1, family_id + age_mother ~ child, value.var = "dob")
+# family_id age_mother dob_child1 dob_child2 dob_child3
+# 1: 1 30 1998-11-26 2000-01-29 NA
+# 2: 2 27 1996-06-22 NA NA
+# 3: 3 26 2002-07-11 2004-04-05 2007-09-02
+# 4: 4 32 2004-10-10 2009-08-27 2012-07-21
+# 5: 5 29 2000-12-05 2005-02-28 NA
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
-<li><p><code>dcast</code> uses <em>formula</em> interface. The variables on the <em>LHS</em> of formula represents the <em>id</em> vars and <em>RHS</em> the <em>measure</em> vars.</p></li>
+<li><p><code>dcast</code> uses <em>formula</em> interface. The variables on the <em>LHS</em> of formula represents the <em>id</em> vars and <em>RHS</em> the <em>measure</em> vars.</p></li>
<li><p><code>value.var</code> denotes the column to be filled in with while casting to wide format.</p></li>
<li><p><code>dcast</code> also tries to preserve attributes in result wherever possible.</p></li>
</ul>
-</div>
-</div>
-</div>
-<div id="section-5" class="section level1">
-<h1></h1>
-<div id="starting-from-dt.m-how-can-we-get-the-number-of-children-in-each-family" class="section level4">
+
+<p>#</p>
+
<h4>- Starting from <code>DT.m</code>, how can we get the number of children in each family?</h4>
+
<p>You can also pass a function to aggregate by in <code>dcast</code> with the argument <code>fun.aggregate</code>. This is particularly essential when the formula provided does not identify single observation for each cell.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dcast</span>(DT.m1, family_id ~<span class="st"> </span>., <span class="dt">fun.agg =</span> function(x) <span class="kw">sum</span>(!<span class="kw">is.na</span>(x)), <span class="dt">value.var =</span> <span class="st">"dob"</span>)
-<span class="co"># family_id .</span>
-<span class="co"># 1: 1 2</span>
-<span class="co"># 2: 2 1</span>
-<span class="co"># 3: 3 3</span>
-<span class="co"># 4: 4 3</span>
-<span class="co"># 5: 5 2</span></code></pre></div>
+
+<pre><code class="r">dcast(DT.m1, family_id ~ ., fun.agg = function(x) sum(!is.na(x)), value.var = "dob")
+# family_id .
+# 1: 1 2
+# 2: 2 1
+# 3: 3 3
+# 4: 4 3
+# 5: 5 2
+</code></pre>
+
<p>Check <code>?dcast</code> for other useful arguments and additional examples.</p>
-</div>
-<div id="limitations-in-current-meltdcast-approaches" class="section level2">
+
<h2>2. Limitations in current <code>melt/dcast</code> approaches</h2>
-<p>So far we’ve seen features of <code>melt</code> and <code>dcast</code> that are based on <code>reshape2</code> package, but implemented efficiently for <em>data.table</em>s, using internal <code>data.table</code> machinery (<em>fast radix ordering</em>, <em>binary search</em> etc..).</p>
+
+<p>So far we've seen features of <code>melt</code> and <code>dcast</code> that are based on <code>reshape2</code> package, but implemented efficiently for <em>data.table*s, using internal <code>data.table</code> machinery (*fast radix ordering</em>, <em>binary search</em> etc..).</p>
+
<p>However, there are situations we might run into where the desired operation is not expressed in a straightforward manner. For example, consider the <em>data.table</em> shown below:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT =<span class="st"> </span><span class="kw">fread</span>(<span class="st">"melt_enhanced.csv"</span>)
+
+<pre><code class="r">DT = fread("melt_enhanced.csv")
DT
-<span class="co"># family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3</span>
-<span class="co"># 1: 1 30 1998-11-26 2000-01-29 NA 1 2 NA</span>
-<span class="co"># 2: 2 27 1996-06-22 NA NA 2 NA NA</span>
-<span class="co"># 3: 3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1</span>
-<span class="co"># 4: 4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1</span>
-<span class="co"># 5: 5 29 2000-12-05 2005-02-28 NA 2 1 NA</span>
-## 1 = female, 2 = male</code></pre></div>
-<p>And you’d like to combine (melt) all the <code>dob</code> columns together, and <code>gender</code> columns together. Using the current functionality, we can do something like this:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT.m1 =<span class="st"> </span><span class="kw">melt</span>(DT, <span class="dt">id =</span> <span class="kw">c</span>(<span class="st">"family_id"</span>, <span class="st">"age_mother"</span>))
-<span class="co"># Warning in melt.data.table(DT, id = c("family_id", "age_mother")): 'measure.vars' [dob_child1,</span>
-<span class="co"># dob_child2, dob_child3, gender_child1, ...] are not all of the same type. By order of hierarchy, the</span>
-<span class="co"># molten data value column will be of type 'character'. All measure variables not of type 'character'</span>
-<span class="co"># will be coerced to. Check DETAILS in ?melt.data.table for more on coercion.</span>
-DT.m1[, <span class="kw">c</span>(<span class="st">"variable"</span>, <span class="st">"child"</span>) :<span class="er">=</span><span class="st"> </span><span class="kw">tstrsplit</span>(variable, <span class="st">"_"</span>, <span class="dt">fixed =</span> <span class="ot">TRUE</span>)]
-<span class="co"># family_id age_mother variable value child</span>
-<span class="co"># 1: 1 30 dob 1998-11-26 child1</span>
-<span class="co"># 2: 2 27 dob 1996-06-22 child1</span>
-<span class="co"># 3: 3 26 dob 2002-07-11 child1</span>
-<span class="co"># 4: 4 32 dob 2004-10-10 child1</span>
-<span class="co"># 5: 5 29 dob 2000-12-05 child1</span>
-<span class="co"># 6: 1 30 dob 2000-01-29 child2</span>
-<span class="co"># 7: 2 27 dob NA child2</span>
-<span class="co"># 8: 3 26 dob 2004-04-05 child2</span>
-<span class="co"># 9: 4 32 dob 2009-08-27 child2</span>
-<span class="co"># 10: 5 29 dob 2005-02-28 child2</span>
-<span class="co"># 11: 1 30 dob NA child3</span>
-<span class="co"># 12: 2 27 dob NA child3</span>
-<span class="co"># 13: 3 26 dob 2007-09-02 child3</span>
-<span class="co"># 14: 4 32 dob 2012-07-21 child3</span>
-<span class="co"># 15: 5 29 dob NA child3</span>
-<span class="co"># 16: 1 30 gender 1 child1</span>
-<span class="co"># 17: 2 27 gender 2 child1</span>
-<span class="co"># 18: 3 26 gender 2 child1</span>
-<span class="co"># 19: 4 32 gender 1 child1</span>
-<span class="co"># 20: 5 29 gender 2 child1</span>
-<span class="co"># 21: 1 30 gender 2 child2</span>
-<span class="co"># 22: 2 27 gender NA child2</span>
-<span class="co"># 23: 3 26 gender 2 child2</span>
-<span class="co"># 24: 4 32 gender 1 child2</span>
-<span class="co"># 25: 5 29 gender 1 child2</span>
-<span class="co"># 26: 1 30 gender NA child3</span>
-<span class="co"># 27: 2 27 gender NA child3</span>
-<span class="co"># 28: 3 26 gender 1 child3</span>
-<span class="co"># 29: 4 32 gender 1 child3</span>
-<span class="co"># 30: 5 29 gender NA child3</span>
-<span class="co"># family_id age_mother variable value child</span>
-DT.c1 =<span class="st"> </span><span class="kw">dcast</span>(DT.m1, family_id +<span class="st"> </span>age_mother +<span class="st"> </span>child ~<span class="st"> </span>variable, <span class="dt">value.var =</span> <span class="st">"value"</span>)
+# family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
+# 1: 1 30 1998-11-26 2000-01-29 NA 1 2 NA
+# 2: 2 27 1996-06-22 NA NA 2 NA NA
+# 3: 3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
+# 4: 4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
+# 5: 5 29 2000-12-05 2005-02-28 NA 2 1 NA
+## 1 = female, 2 = male
+</code></pre>
+
+<p>And you'd like to combine (melt) all the <code>dob</code> columns together, and <code>gender</code> columns together. Using the current functionality, we can do something like this:</p>
+
+<pre><code class="r">DT.m1 = melt(DT, id = c("family_id", "age_mother"))
+# Warning in melt.data.table(DT, id = c("family_id", "age_mother")): 'measure.vars' [dob_child1,
+# dob_child2, dob_child3, gender_child1, ...] are not all of the same type. By order of hierarchy, the
+# molten data value column will be of type 'character'. All measure variables not of type 'character'
+# will be coerced to. Check DETAILS in ?melt.data.table for more on coercion.
+DT.m1[, c("variable", "child") := tstrsplit(variable, "_", fixed = TRUE)]
+DT.c1 = dcast(DT.m1, family_id + age_mother + child ~ variable, value.var = "value")
DT.c1
-<span class="co"># family_id age_mother child dob gender</span>
-<span class="co"># 1: 1 30 child1 1998-11-26 1</span>
-<span class="co"># 2: 1 30 child2 2000-01-29 2</span>
-<span class="co"># 3: 1 30 child3 NA NA</span>
-<span class="co"># 4: 2 27 child1 1996-06-22 2</span>
-<span class="co"># 5: 2 27 child2 NA NA</span>
-<span class="co"># 6: 2 27 child3 NA NA</span>
-<span class="co"># 7: 3 26 child1 2002-07-11 2</span>
-<span class="co"># 8: 3 26 child2 2004-04-05 2</span>
-<span class="co"># 9: 3 26 child3 2007-09-02 1</span>
-<span class="co"># 10: 4 32 child1 2004-10-10 1</span>
-<span class="co"># 11: 4 32 child2 2009-08-27 1</span>
-<span class="co"># 12: 4 32 child3 2012-07-21 1</span>
-<span class="co"># 13: 5 29 child1 2000-12-05 2</span>
-<span class="co"># 14: 5 29 child2 2005-02-28 1</span>
-<span class="co"># 15: 5 29 child3 NA NA</span>
-
-<span class="kw">str</span>(DT.c1) ## gender column is character type now!
-<span class="co"># Classes 'data.table' and 'data.frame': 15 obs. of 5 variables:</span>
-<span class="co"># $ family_id : int 1 1 1 2 2 2 3 3 3 4 ...</span>
-<span class="co"># $ age_mother: int 30 30 30 27 27 27 26 26 26 32 ...</span>
-<span class="co"># $ child : chr "child1" "child2" "child3" "child1" ...</span>
-<span class="co"># $ dob : chr "1998-11-26" "2000-01-29" NA "1996-06-22" ...</span>
-<span class="co"># $ gender : chr "1" "2" NA "2" ...</span>
-<span class="co"># - attr(*, ".internal.selfref")=<externalptr> </span>
-<span class="co"># - attr(*, "sorted")= chr "family_id" "age_mother" "child"</span></code></pre></div>
-<div id="issues" class="section level4 bs-callout bs-callout-info">
-<h4>Issues</h4>
-<ol style="list-style-type: decimal">
-<li><p>What we wanted to do was to combine all the <code>dob</code> and <code>gender</code> type columns together respectively. Instead we are combining <em>everything</em> together, and then splitting them again. I think it’s easy to see that it’s quite roundabout (and inefficient).</p>
-<p>As an analogy, imagine you’ve a closet with four shelves of clothes and you’d like to put together the clothes from shelves 1 and 2 together (in 1), and 3 and 4 together (in 3). What we are doing is more or less to combine all the clothes together, and then split them back on to shelves 1 and 3!</p></li>
+# family_id age_mother child dob gender
+# 1: 1 30 child1 1998-11-26 1
+# 2: 1 30 child2 2000-01-29 2
+# 3: 1 30 child3 NA NA
+# 4: 2 27 child1 1996-06-22 2
+# 5: 2 27 child2 NA NA
+# 6: 2 27 child3 NA NA
+# 7: 3 26 child1 2002-07-11 2
+# 8: 3 26 child2 2004-04-05 2
+# 9: 3 26 child3 2007-09-02 1
+# 10: 4 32 child1 2004-10-10 1
+# 11: 4 32 child2 2009-08-27 1
+# 12: 4 32 child3 2012-07-21 1
+# 13: 5 29 child1 2000-12-05 2
+# 14: 5 29 child2 2005-02-28 1
+# 15: 5 29 child3 NA NA
+
+str(DT.c1) ## gender column is character type now!
+# Classes 'data.table' and 'data.frame': 15 obs. of 5 variables:
+# $ family_id : int 1 1 1 2 2 2 3 3 3 4 ...
+# $ age_mother: int 30 30 30 27 27 27 26 26 26 32 ...
+# $ child : chr "child1" "child2" "child3" "child1" ...
+# $ dob : chr "1998-11-26" "2000-01-29" NA "1996-06-22" ...
+# $ gender : chr "1" "2" NA "2" ...
+# - attr(*, ".internal.selfref")=<externalptr>
+# - attr(*, "sorted")= chr "family_id" "age_mother" "child"
+</code></pre>
+
+<h4>Issues {.bs-callout .bs-callout-info}</h4>
+
+<ol>
+<li><p>What we wanted to do was to combine all the <code>dob</code> and <code>gender</code> type columns together respectively. Instead we are combining <em>everything</em> together, and then splitting them again. I think it's easy to see that it's quite roundabout (and inefficient).</p>
+
+<p>As an analogy, imagine you've a closet with four shelves of clothes and you'd like to put together the clothes from shelves 1 and 2 together (in 1), and 3 and 4 together (in 3). What we are doing is more or less to combine all the clothes together, and then split them back on to shelves 1 and 3!</p></li>
<li><p>The columns to <em>melt</em> may be of different types, as in this case (character and integer types). By <em>melting</em> them all together, the columns will be coerced in result, as explained by the warning message above and shown from output of <code>str(DT.c1)</code>, where <code>gender</code> has been converted to <em>character</em> type.</p></li>
<li><p>We are generating an additional column by splitting the <code>variable</code> column into two columns, whose purpose is quite cryptic. We do it because we need it for <em>casting</em> in the next step.</p></li>
-<li><p>Finally, we cast the data set. But the issue is it’s a much more computationally involved operation than <em>melt</em>. Specifically, it requires computing the order of the variables in formula, and that’s costly.</p></li>
+<li><p>Finally, we cast the data set. But the issue is it's a much more computationally involved operation than <em>melt</em>. Specifically, it requires computing the order of the variables in formula, and that's costly.</p></li>
</ol>
-</div>
-</div>
-</div>
-<div id="section-6" class="section level1">
-<h1></h1>
+
+<p>#</p>
+
<p>In fact, <code>base::reshape</code> is capable of performing this operation in a very straightforward manner. It is an extremely useful and often underrated function. You should definitely give it a try!</p>
-<div id="enhanced-new-functionality" class="section level2">
+
<h2>3. Enhanced (new) functionality</h2>
-<div id="a-enhanced-melt" class="section level3">
+
<h3>a) Enhanced <code>melt</code></h3>
-<p>Since we’d like for <em>data.tables</em> to perform this operation straightforward and efficient using the same interface, we went ahead and implemented an <em>additional functionality</em>, where we can <code>melt</code> to multiple columns <em>simultaneously</em>.</p>
-<div id="melt-multiple-columns-simultaneously" class="section level4">
+
+<p>Since we'd like for <em>data.tables</em> to perform this operation straightforward and efficient using the same interface, we went ahead and implemented an <em>additional functionality</em>, where we can <code>melt</code> to multiple columns <em>simultaneously</em>.</p>
+
<h4>- <code>melt</code> multiple columns simultaneously</h4>
+
<p>The idea is quite simple. We pass a list of columns to <code>measure.vars</code>, where each element of the list contains the columns that should be combined together.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">colA =<span class="st"> </span><span class="kw">paste</span>(<span class="st">"dob_child"</span>, <span class="dv">1</span>:<span class="dv">3</span>, <span class="dt">sep =</span> <span class="st">""</span>)
-colB =<span class="st"> </span><span class="kw">paste</span>(<span class="st">"gender_child"</span>, <span class="dv">1</span>:<span class="dv">3</span>, <span class="dt">sep =</span> <span class="st">""</span>)
-DT.m2 =<span class="st"> </span><span class="kw">melt</span>(DT, <span class="dt">measure =</span> <span class="kw">list</span>(colA, colB), <span class="dt">value.name =</span> <span class="kw">c</span>(<span class="st">"dob"</span>, <span class="st">"gender"</span>))
+
+<pre><code class="r">colA = paste("dob_child", 1:3, sep = "")
+colB = paste("gender_child", 1:3, sep = "")
+DT.m2 = melt(DT, measure = list(colA, colB), value.name = c("dob", "gender"))
DT.m2
-<span class="co"># family_id age_mother variable dob gender</span>
-<span class="co"># 1: 1 30 1 1998-11-26 1</span>
-<span class="co"># 2: 2 27 1 1996-06-22 2</span>
-<span class="co"># 3: 3 26 1 2002-07-11 2</span>
-<span class="co"># 4: 4 32 1 2004-10-10 1</span>
-<span class="co"># 5: 5 29 1 2000-12-05 2</span>
-<span class="co"># 6: 1 30 2 2000-01-29 2</span>
-<span class="co"># 7: 2 27 2 NA NA</span>
-<span class="co"># 8: 3 26 2 2004-04-05 2</span>
-<span class="co"># 9: 4 32 2 2009-08-27 1</span>
-<span class="co"># 10: 5 29 2 2005-02-28 1</span>
-<span class="co"># 11: 1 30 3 NA NA</span>
-<span class="co"># 12: 2 27 3 NA NA</span>
-<span class="co"># 13: 3 26 3 2007-09-02 1</span>
-<span class="co"># 14: 4 32 3 2012-07-21 1</span>
-<span class="co"># 15: 5 29 3 NA NA</span>
-
-<span class="kw">str</span>(DT.m2) ## col type is preserved
-<span class="co"># Classes 'data.table' and 'data.frame': 15 obs. of 5 variables:</span>
-<span class="co"># $ family_id : int 1 2 3 4 5 1 2 3 4 5 ...</span>
-<span class="co"># $ age_mother: int 30 27 26 32 29 30 27 26 32 29 ...</span>
-<span class="co"># $ variable : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 2 2 2 2 ...</span>
-<span class="co"># $ dob : chr "1998-11-26" "1996-06-22" "2002-07-11" "2004-10-10" ...</span>
-<span class="co"># $ gender : int 1 2 2 1 2 2 NA 2 1 1 ...</span>
-<span class="co"># - attr(*, ".internal.selfref")=<externalptr></span></code></pre></div>
-</div>
-<div id="using-patterns" class="section level4">
+# family_id age_mother variable dob gender
+# 1: 1 30 1 1998-11-26 1
+# 2: 2 27 1 1996-06-22 2
+# 3: 3 26 1 2002-07-11 2
+# 4: 4 32 1 2004-10-10 1
+# 5: 5 29 1 2000-12-05 2
+# 6: 1 30 2 2000-01-29 2
+# 7: 2 27 2 NA NA
+# 8: 3 26 2 2004-04-05 2
+# 9: 4 32 2 2009-08-27 1
+# 10: 5 29 2 2005-02-28 1
+# 11: 1 30 3 NA NA
+# 12: 2 27 3 NA NA
+# 13: 3 26 3 2007-09-02 1
+# 14: 4 32 3 2012-07-21 1
+# 15: 5 29 3 NA NA
+
+str(DT.m2) ## col type is preserved
+# Classes 'data.table' and 'data.frame': 15 obs. of 5 variables:
+# $ family_id : int 1 2 3 4 5 1 2 3 4 5 ...
+# $ age_mother: int 30 27 26 32 29 30 27 26 32 29 ...
+# $ variable : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 2 2 2 2 ...
+# $ dob : chr "1998-11-26" "1996-06-22" "2002-07-11" "2004-10-10" ...
+# $ gender : int 1 2 2 1 2 2 NA 2 1 1 ...
+# - attr(*, ".internal.selfref")=<externalptr>
+</code></pre>
+
<h4>- Using <code>patterns()</code></h4>
-<p>Usually in these problems, the columns we’d like to melt can be distinguished by a common pattern. We can use the function <code>patterns()</code>, implemented for convenience, to provide regular expressions for the columns to be combined together. The above operation can be rewritten as:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">DT.m2 =<span class="st"> </span><span class="kw">melt</span>(DT, <span class="dt">measure =</span> <span class="kw">patterns</span>(<span class="st">"^dob"</span>, <span class="st">"^gender"</span>), <span class="dt">value.name =</span> <span class="kw">c</span>(<span class="st">"dob"</span>, <span class="st">"gender"</span>))
+
+<p>Usually in these problems, the columns we'd like to melt can be distinguished by a common pattern. We can use the function <code>patterns()</code>, implemented for convenience, to provide regular expressions for the columns to be combined together. The above operation can be rewritten as:</p>
+
+<pre><code class="r">DT.m2 = melt(DT, measure = patterns("^dob", "^gender"), value.name = c("dob", "gender"))
DT.m2
-<span class="co"># family_id age_mother variable dob gender</span>
-<span class="co"># 1: 1 30 1 1998-11-26 1</span>
-<span class="co"># 2: 2 27 1 1996-06-22 2</span>
-<span class="co"># 3: 3 26 1 2002-07-11 2</span>
-<span class="co"># 4: 4 32 1 2004-10-10 1</span>
-<span class="co"># 5: 5 29 1 2000-12-05 2</span>
-<span class="co"># 6: 1 30 2 2000-01-29 2</span>
-<span class="co"># 7: 2 27 2 NA NA</span>
-<span class="co"># 8: 3 26 2 2004-04-05 2</span>
-<span class="co"># 9: 4 32 2 2009-08-27 1</span>
-<span class="co"># 10: 5 29 2 2005-02-28 1</span>
-<span class="co"># 11: 1 30 3 NA NA</span>
-<span class="co"># 12: 2 27 3 NA NA</span>
-<span class="co"># 13: 3 26 3 2007-09-02 1</span>
-<span class="co"># 14: 4 32 3 2012-07-21 1</span>
-<span class="co"># 15: 5 29 3 NA NA</span></code></pre></div>
-<p>That’s it!</p>
-</div>
-<div id="section-7" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# family_id age_mother variable dob gender
+# 1: 1 30 1 1998-11-26 1
+# 2: 2 27 1 1996-06-22 2
+# 3: 3 26 1 2002-07-11 2
+# 4: 4 32 1 2004-10-10 1
+# 5: 5 29 1 2000-12-05 2
+# 6: 1 30 2 2000-01-29 2
+# 7: 2 27 2 NA NA
+# 8: 3 26 2 2004-04-05 2
+# 9: 4 32 2 2009-08-27 1
+# 10: 5 29 2 2005-02-28 1
+# 11: 1 30 3 NA NA
+# 12: 2 27 3 NA NA
+# 13: 3 26 3 2007-09-02 1
+# 14: 4 32 3 2012-07-21 1
+# 15: 5 29 3 NA NA
+</code></pre>
+
+<p>That's it!</p>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>We can remove the <code>variable</code> column if necessary.</p></li>
<li><p>The functionality is implemented entirely in C, and is therefore both <em>fast</em> and <em>memory efficient</em> in addition to being <em>straightforward</em>.</p></li>
</ul>
-</div>
-</div>
-<div id="b-enhanced-dcast" class="section level3">
+
<h3>b) Enhanced <code>dcast</code></h3>
+
<p>Okay great! We can now melt into multiple columns simultaneously. Now given the data set <code>DT.m2</code> as shown above, how can we get back to the same format as the original data we started with?</p>
-<p>If we use the current functionality of <code>dcast</code>, then we’d have to cast twice and bind the results together. But that’s once again verbose, not straightforward and is also inefficient.</p>
-<div id="casting-multiple-value.vars-simultaneously" class="section level4">
+
+<p>If we use the current functionality of <code>dcast</code>, then we'd have to cast twice and bind the results together. But that's once again verbose, not straightforward and is also inefficient.</p>
+
<h4>- Casting multiple <code>value.var</code>s simultaneously</h4>
+
<p>We can now provide <strong>multiple <code>value.var</code> columns</strong> to <code>dcast</code> for <em>data.tables</em> directly so that the operations are taken care of internally and efficiently.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## new 'cast' functionality - multiple value.vars
-DT.c2 =<span class="st"> </span><span class="kw">dcast</span>(DT.m2, family_id +<span class="st"> </span>age_mother ~<span class="st"> </span>variable, <span class="dt">value.var =</span> <span class="kw">c</span>(<span class="st">"dob"</span>, <span class="st">"gender"</span>))
+
+<pre><code class="r">## new 'cast' functionality - multiple value.vars
+DT.c2 = dcast(DT.m2, family_id + age_mother ~ variable, value.var = c("dob", "gender"))
DT.c2
-<span class="co"># family_id age_mother dob_1 dob_2 dob_3 gender_1 gender_2 gender_3</span>
-<span class="co"># 1: 1 30 1998-11-26 2000-01-29 NA 1 2 NA</span>
-<span class="co"># 2: 2 27 1996-06-22 NA NA 2 NA NA</span>
-<span class="co"># 3: 3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1</span>
-<span class="co"># 4: 4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1</span>
-<span class="co"># 5: 5 29 2000-12-05 2005-02-28 NA 2 1 NA</span></code></pre></div>
-</div>
-<div id="section-8" class="section level4 bs-callout bs-callout-info">
-<h4></h4>
+# family_id age_mother dob_1 dob_2 dob_3 gender_1 gender_2 gender_3
+# 1: 1 30 1998-11-26 2000-01-29 NA 1 2 NA
+# 2: 2 27 1996-06-22 NA NA 2 NA NA
+# 3: 3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
+# 4: 4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
+# 5: 5 29 2000-12-05 2005-02-28 NA 2 1 NA
+</code></pre>
+
+<h4>{.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>Attributes are preserved in result wherever possible.</p></li>
<li><p>Everything is taken care of internally, and efficiently. In addition to being fast, it is also very memory efficient.</p></li>
</ul>
-</div>
-</div>
-</div>
-</div>
-<div id="section-9" class="section level1">
-<h1></h1>
-<div id="multiple-functions-to-fun.aggregate" class="section level4 bs-callout bs-callout-info">
-<h4>Multiple functions to <code>fun.aggregate</code>:</h4>
+
+<p>#</p>
+
+<h4>Multiple functions to <code>fun.aggregate</code>: {.bs-callout .bs-callout-info}</h4>
+
<p>You can also provide <em>multiple functions</em> to <code>fun.aggregate</code> to <code>dcast</code> for <em>data.tables</em>. Check the examples in <code>?dcast</code> which illustrates this functionality.</p>
-</div>
-</div>
-<div id="section-10" class="section level1">
-<h1></h1>
-<hr />
-</div>
-
-
-
-<!-- dynamically load mathjax for compatibility with self-contained -->
-<script>
- (function () {
- var script = document.createElement("script");
- script.type = "text/javascript";
- script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
- document.getElementsByTagName("head")[0].appendChild(script);
- })();
-</script>
+
+<p>#</p>
+
+<hr/>
</body>
+
</html>
diff --git a/inst/doc/datatable-secondary-indices-and-auto-indexing.html b/inst/doc/datatable-secondary-indices-and-auto-indexing.html
index 2c67a8a..e59255b 100644
--- a/inst/doc/datatable-secondary-indices-and-auto-indexing.html
+++ b/inst/doc/datatable-secondary-indices-and-auto-indexing.html
@@ -1,453 +1,598 @@
<!DOCTYPE html>
-
-<html xmlns="http://www.w3.org/1999/xhtml">
-
+<html>
<head>
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
+
+<title>Data {#data}</title>
+
+<script type="text/javascript">
+window.onload = function() {
+ var imgs = document.getElementsByTagName('img'), i, img;
+ for (i = 0; i < imgs.length; i++) {
+ img = imgs[i];
+ // center an image if it is the only element of its parent
+ if (img.parentElement.childElementCount === 1)
+ img.parentElement.style.textAlign = 'center';
+ }
+};
+</script>
-<meta charset="utf-8">
-<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
-<meta name="generator" content="pandoc" />
-
-<meta name="viewport" content="width=device-width, initial-scale=1">
-
-
-<meta name="date" content="2017-01-31" />
+<!-- Styles for R syntax highlighter -->
+<style type="text/css">
+ pre .operator,
+ pre .paren {
+ color: rgb(104, 118, 135)
+ }
+
+ pre .literal {
+ color: #990073
+ }
+
+ pre .number {
+ color: #099;
+ }
+
+ pre .comment {
+ color: #998;
+ font-style: italic
+ }
+
+ pre .keyword {
+ color: #900;
+ font-weight: bold
+ }
+
+ pre .identifier {
+ color: rgb(0, 0, 0);
+ }
+
+ pre .string {
+ color: #d14;
+ }
+</style>
-<title>Secondary indices and auto indexing</title>
+<!-- R syntax highlighter -->
+<script type="text/javascript">
+var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/</gm,"<")}function f(r,q,p){return RegExp(q,"m"+(r.cI?"i":"")+(p?"g":""))}function b(r){for(var p=0;p<r.childNodes.length;p++){var q=r.childNodes[p];if(q.nodeName=="CODE"){return q}if(!(q.nodeType==3&&q.nodeValue.match(/\s+/))){break}}}function h(t,s){var p="";for(var r=0;r<t.childNodes.length;r++){if(t.childNodes[r].nodeType==3){var q=t.childNodes[r].nodeValue;if(s){q=q.replace(/\n/g,"")}p+=q}else{if(t.chi [...]
+hljs.initHighlightingOnLoad();
+</script>
-<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
-div.sourceCode { overflow-x: auto; }
-table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
- margin: 0; padding: 0; vertical-align: baseline; border: none; }
-table.sourceCode { width: 100%; line-height: 100%; }
-td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
-td.sourceCode { padding-left: 5px; }
-code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
-code > span.dt { color: #902000; } /* DataType */
-code > span.dv { color: #40a070; } /* DecVal */
-code > span.bn { color: #40a070; } /* BaseN */
-code > span.fl { color: #40a070; } /* Float */
-code > span.ch { color: #4070a0; } /* Char */
-code > span.st { color: #4070a0; } /* String */
-code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
-code > span.ot { color: #007020; } /* Other */
-code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
-code > span.fu { color: #06287e; } /* Function */
-code > span.er { color: #ff0000; font-weight: bold; } /* Error */
-code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
-code > span.cn { color: #880000; } /* Constant */
-code > span.sc { color: #4070a0; } /* SpecialChar */
-code > span.vs { color: #4070a0; } /* VerbatimString */
-code > span.ss { color: #bb6688; } /* SpecialString */
-code > span.im { } /* Import */
-code > span.va { color: #19177c; } /* Variable */
-code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
-code > span.op { color: #666666; } /* Operator */
-code > span.bu { } /* BuiltIn */
-code > span.ex { } /* Extension */
-code > span.pp { color: #bc7a00; } /* Preprocessor */
-code > span.at { color: #7d9029; } /* Attribute */
-code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
-code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
-code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
-code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
+body, td {
+ font-family: sans-serif;
+ background-color: white;
+ font-size: 13px;
+}
+
+body {
+ max-width: 800px;
+ margin: auto;
+ padding: 1em;
+ line-height: 20px;
+}
+
+tt, code, pre {
+ font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
+}
+
+h1 {
+ font-size:2.2em;
+}
+
+h2 {
+ font-size:1.8em;
+}
+
+h3 {
+ font-size:1.4em;
+}
+
+h4 {
+ font-size:1.0em;
+}
+
+h5 {
+ font-size:0.9em;
+}
+
+h6 {
+ font-size:0.8em;
+}
+
+a:visited {
+ color: rgb(50%, 0%, 50%);
+}
+
+pre, img {
+ max-width: 100%;
+}
+pre {
+ overflow-x: auto;
+}
+pre code {
+ display: block; padding: 0.5em;
+}
+
+code {
+ font-size: 92%;
+ border: 1px solid #ccc;
+}
+
+code[class] {
+ background-color: #F8F8F8;
+}
+
+table, td, th {
+ border: none;
+}
+
+blockquote {
+ color:#666666;
+ margin:0;
+ padding-left: 1em;
+ border-left: 0.5em #EEE solid;
+}
+
+hr {
+ height: 0px;
+ border-bottom: none;
+ border-top-width: thin;
+ border-top-style: dotted;
+ border-top-color: #999999;
+}
+
+ at media print {
+ * {
+ background: transparent !important;
+ color: black !important;
+ filter:none !important;
+ -ms-filter: none !important;
+ }
+
+ body {
+ font-size:12pt;
+ max-width:100%;
+ }
+
+ a, a:visited {
+ text-decoration: underline;
+ }
+
+ hr {
+ visibility: hidden;
+ page-break-before: always;
+ }
+
+ pre, blockquote {
+ padding-right: 1em;
+ page-break-inside: avoid;
+ }
+
+ tr, img {
+ page-break-inside: avoid;
+ }
+
+ img {
+ max-width: 100% !important;
+ }
+
+ @page :left {
+ margin: 15mm 20mm 15mm 10mm;
+ }
+
+ @page :right {
+ margin: 15mm 10mm 15mm 20mm;
+ }
+
+ p, h2, h3 {
+ orphans: 3; widows: 3;
+ }
+
+ h2, h3 {
+ page-break-after: avoid;
+ }
+}
</style>
-<link href="data:text/css;charset=utf-8,body%20%7B%0Abackground%2Dcolor%3A%20%23fff%3B%0Amargin%3A%201em%20auto%3B%0Amax%2Dwidth%3A%20700px%3B%0Aoverflow%3A%20visible%3B%0Apadding%2Dleft%3A%202em%3B%0Apadding%2Dright%3A%202em%3B%0Afont%2Dfamily%3A%20%22Open%20Sans%22%2C%20%22Helvetica%20Neue%22%2C%20Helvetica%2C%20Arial%2C%20sans%2Dserif%3B%0Afont%2Dsize%3A%2014px%3B%0Aline%2Dheight%3A%201%2E35%3B%0A%7D%0A%23header%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0A%23TOC%20%7B%0Aclear%3A%20bot [...]
-
</head>
<body>
+<p>This vignette assumes that the reader is familiar with data.table's <code>[i, j, by]</code> syntax, and how to perform fast key based subsets. If you're not familar with these concepts, please read the <em>“Introduction to data.table”</em>, <em>“Reference semantics”</em> and <em>“Keys and fast binary search based subset”</em> vignettes first.</p>
+<hr/>
+<h2>Data {#data}</h2>
+<p>We will use the same <code>flights</code> data as in the <em>“Introduction to data.table”</em> vignette.</p>
-<h1 class="title toc-ignore">Secondary indices and auto indexing</h1>
-<h4 class="date"><em>2017-01-31</em></h4>
+<pre><code class="r">flights <- fread("flights14.csv")
+head(flights)
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
+# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
+# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13
+# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18
+dim(flights)
+# [1] 253316 11
+</code></pre>
-
-
-<p>This vignette assumes that the reader is familiar with data.table’s <code>[i, j, by]</code> syntax, and how to perform fast key based subsets. If you’re not familar with these concepts, please read the <em>“Introduction to data.table”</em>, <em>“Reference semantics”</em> and <em>“Keys and fast binary search based subset”</em> vignettes first.</p>
-<hr />
-<div id="data" class="section level2">
-<h2>Data</h2>
-<p>We will use the same <code>flights</code> data as in the <em>“Introduction to data.table”</em> vignette.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights <-<span class="st"> </span><span class="kw">fread</span>(<span class="st">"flights14.csv"</span>)
-<span class="kw">head</span>(flights)
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span>
-<span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18</span>
-<span class="kw">dim</span>(flights)
-<span class="co"># [1] 253316 11</span></code></pre></div>
-</div>
-<div id="introduction" class="section level2">
<h2>Introduction</h2>
+
<p>In this vignette, we will</p>
+
<ul>
<li><p>discuss <em>secondary indices</em> and provide rationale as to why we need them by citing cases where setting keys is not necessarily ideal,</p></li>
<li><p>perform fast subsetting, once again, but using the new <code>on</code> argument, which computes secondary indices internally for the task (temporarily), and reuses if one already exists,</p></li>
<li><p>and finally look at <em>auto indexing</em> which goes a step further and creates secondary indices automatically, but does so on native R syntax for subsetting.</p></li>
</ul>
-</div>
-<div id="secondary-indices" class="section level2">
+
<h2>1. Secondary indices</h2>
-<div id="a-what-are-secondary-indices" class="section level3">
+
<h3>a) What are secondary indices?</h3>
+
<p>Secondary indices are similar to <code>keys</code> in <em>data.table</em>, except for two major differences:</p>
+
<ul>
-<li><p>It <em>doesn’t</em> physically reorder the entire data.table in RAM. Instead, it only computes the order for the set of columns provided and stores that <em>order vector</em> in an additional attribute called <code>index</code>.</p></li>
+<li><p>It <em>doesn't</em> physically reorder the entire data.table in RAM. Instead, it only computes the order for the set of columns provided and stores that <em>order vector</em> in an additional attribute called <code>index</code>.</p></li>
<li><p>There can be more than one secondary index for a data.table (as we will see below).</p></li>
</ul>
-</div>
-<div id="b-set-and-get-secondary-indices" class="section level3">
+
<h3>b) Set and get secondary indices</h3>
-<div id="how-can-we-set-the-column-origin-as-a-secondary-index-in-the-data.table-flights" class="section level4">
-<h4>– How can we set the column <code>origin</code> as a secondary index in the <em>data.table</em> <code>flights</code>?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setindex</span>(flights, origin)
-<span class="kw">head</span>(flights)
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span>
-<span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18</span>
-
-## alternatively we can provide character vectors to the function 'setindexv()'
-<span class="co"># setindexv(flights, "origin") # useful to program with</span>
-
-<span class="co"># 'index' attribute added</span>
-<span class="kw">names</span>(<span class="kw">attributes</span>(flights))
-<span class="co"># [1] "names" "row.names" "class" ".internal.selfref"</span>
-<span class="co"># [5] "index"</span></code></pre></div>
+
+<h4>– How can we set the column <code>origin</code> as a secondary index in the <em>data.table</em> <code>flights</code>?</h4>
+
+<pre><code class="r">setindex(flights, origin)
+head(flights)
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
+# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
+# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13
+# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18
+
+## alternatively we can provide character vectors to the function 'setindexv()'
+# setindexv(flights, "origin") # useful to program with
+
+# 'index' attribute added
+names(attributes(flights))
+# [1] "names" "row.names" "class" ".internal.selfref"
+# [5] "index"
+</code></pre>
+
<ul>
<li><p><code>setindex</code> and <code>setindexv()</code> allows adding a secondary index to the data.table.</p></li>
<li><p>Note that <code>flights</code> is <strong>not</strong> phyiscally reordered in increasing order of <code>origin</code>, as would have been the case with <code>setkey()</code>.</p></li>
-<li><p>Also note that the attribute <code>index</code> has been added to <code>flights</code>.</p></li>
+<li><p>Also note that the attribute <code>index</code> has been added to <code>flights</code>. </p></li>
<li><p><code>setindex(flights, NULL)</code> would remove all secondary indices.</p></li>
</ul>
-</div>
-<div id="how-can-we-get-all-the-secondary-indices-set-so-far-in-flights" class="section level4">
-<h4>– How can we get all the secondary indices set so far in <code>flights</code>?</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">indices</span>(flights)
-<span class="co"># [1] "origin"</span>
-
-<span class="kw">setindex</span>(flights, origin, dest)
-<span class="kw">indices</span>(flights)
-<span class="co"># [1] "origin" "origin__dest"</span></code></pre></div>
+
+<h4>– How can we get all the secondary indices set so far in <code>flights</code>?</h4>
+
+<pre><code class="r">indices(flights)
+# [1] "origin"
+
+setindex(flights, origin, dest)
+indices(flights)
+# [1] "origin" "origin__dest"
+</code></pre>
+
<ul>
<li><p>The function <code>indices()</code> returns all current secondary indices in the data.table. If none exists, <code>NULL</code> is returned.</p></li>
<li><p>Note that by creating another index on the columns <code>origin, dest</code>, we do not lose the first index created on the column <code>origin</code>, i.e., we can have multiple secondary indices.</p></li>
</ul>
-</div>
-</div>
-<div id="c-why-do-we-need-secondary-indices" class="section level3">
+
<h3>c) Why do we need secondary indices?</h3>
-<div id="reordering-a-data.table-can-be-expensive-and-not-always-ideal" class="section level4">
-<h4>– Reordering a data.table can be expensive and not always ideal</h4>
-<p>Consider the case where you would like to perform a fast key based subset on <code>origin</code> column for the value “JFK”. We’d do this as:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## not run
-<span class="kw">setkey</span>(flights, origin)
-flights[<span class="st">"JFK"</span>] <span class="co"># or flights[.("JFK")]</span></code></pre></div>
-</div>
-<div id="setkey-requires" class="section level4 bs-callout bs-callout-info">
-<h4><code>setkey()</code> requires:</h4>
-<ol style="list-style-type: lower-alpha">
-<li><p>computing the order vector for the column(s) provided, here, <code>origin</code>, and</p></li>
-<li><p>reordering the entire data.table, by reference, based on the order vector computed.</p></li>
-</ol>
-</div>
-</div>
-</div>
-<div id="section" class="section level1">
-<h1></h1>
-<p>Computing the order isn’t the time consuming part, since data.table uses true radix sorting on integer, character and numeric vectors. However reordering the data.table could be time consuming (depending on the number of rows and columns).</p>
+
+<h4>– Reordering a data.table can be expensive and not always ideal</h4>
+
+<p>Consider the case where you would like to perform a fast key based subset on <code>origin</code> column for the value “JFK”. We'd do this as:</p>
+
+<pre><code class="r">## not run
+setkey(flights, origin)
+flights["JFK"] # or flights[.("JFK")]
+</code></pre>
+
+<h4><code>setkey()</code> requires: {.bs-callout .bs-callout-info}</h4>
+
+<p>a) computing the order vector for the column(s) provided, here, <code>origin</code>, and</p>
+
+<p>b) reordering the entire data.table, by reference, based on the order vector computed.</p>
+
+<p>Computing the order isn't the time consuming part, since data.table uses true radix sorting on integer, character and numeric vectors. However reordering the data.table could be time consuming (depending on the number of rows and columns). </p>
+
<p>Unless our task involves repeated subsetting on the same column, fast key based subsetting could effectively be nullified by the time to reorder, depending on our data.table dimensions.</p>
-<div id="there-can-be-only-one-key-at-the-most" class="section level4">
-<h4>– There can be only one <code>key</code> at the most</h4>
-<p>Now if we would like to repeat the same operation but on <code>dest</code> column instead, for the value “LAX”, then we have to <code>setkey()</code>, <em>again</em>.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## not run
-<span class="kw">setkey</span>(flights, dest)
-flights[<span class="st">"LAX"</span>]</code></pre></div>
-<p>And this reorders <code>flights</code> by <code>dest</code>, <em>again</em>. What we would really like is to be able to perform the fast subsetting by eliminating the reordering step.</p>
+
+<h4>– There can be only one <code>key</code> at the most</h4>
+
+<p>Now if we would like to repeat the same operation but on <code>dest</code> column instead, for the value “LAX”, then we have to <code>setkey()</code>, <em>again</em>. </p>
+
+<pre><code class="r">## not run
+setkey(flights, dest)
+flights["LAX"]
+</code></pre>
+
+<p>And this reorders <code>flights</code> by <code>dest</code>, <em>again</em>. What we would really like is to be able to perform the fast subsetting by eliminating the reordering step. </p>
+
<p>And this is precisely what <em>secondary indices</em> allow for!</p>
-</div>
-<div id="secondary-indices-can-be-reused" class="section level4">
-<h4>– Secondary indices can be reused</h4>
+
+<h4>– Secondary indices can be reused</h4>
+
<p>Since there can be multiple secondary indices, and creating an index is as simple as storing the order vector as an attribute, this allows us to even eliminate the time to recompute the order vector if an index already exists.</p>
-</div>
-<div id="the-new-on-argument-allows-for-cleaner-syntax-and-automatic-creation-and-reuse-of-secondary-indices" class="section level4">
-<h4>– The new <code>on</code> argument allows for cleaner syntax and automatic creation and reuse of secondary indices</h4>
+
+<h4>– The new <code>on</code> argument allows for cleaner syntax and automatic creation and reuse of secondary indices</h4>
+
<p>As we will see in the next section, the <code>on</code> argument provides several advantages:</p>
-</div>
-<div id="on-argument" class="section level4 bs-callout bs-callout-info">
-<h4><code>on</code> argument</h4>
+
+<h4><code>on</code> argument {.bs-callout .bs-callout-info}</h4>
+
<ul>
<li><p>enables subsetting by computing secondary indices on the fly. This eliminates having to do <code>setindex()</code> every time.</p></li>
<li><p>allows easy reuse of existing indices by just checking the attributes.</p></li>
-<li><p>allows for a cleaner syntax by having the columns on which the subset is performed as part of the syntax. This makes the code easier to follow when looking at it at a later point.</p>
+<li><p>allows for a cleaner syntax by having the columns on which the subset is performed as part of the syntax. This makes the code easier to follow when looking at it at a later point. </p>
+
<p>Note that <code>on</code> argument can also be used on keyed subsets as well. In fact, we encourage to provide the <code>on</code> argument even when subsetting using keys for better readability.</p></li>
</ul>
-</div>
-</div>
-<div id="section-1" class="section level1">
-<h1></h1>
-<div id="fast-subsetting-using-on-argument-and-secondary-indices" class="section level2">
+
<h2>2. Fast subsetting using <code>on</code> argument and secondary indices</h2>
-<div id="a-fast-subsets-in-i" class="section level3">
+
<h3>a) Fast subsets in <code>i</code></h3>
-<div id="subset-all-rows-where-the-origin-airport-matches-jfk-using-on" class="section level4">
-<h4>– Subset all rows where the origin airport matches <em>“JFK”</em> using <code>on</code></h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[<span class="st">"JFK"</span>, on =<span class="st"> "origin"</span>]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span>
-<span class="co"># 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span>
-<span class="co"># 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21</span>
-<span class="co"># --- </span>
-<span class="co"># 81479: 2014 10 31 -4 -21 UA JFK SFO 337 2586 17</span>
-<span class="co"># 81480: 2014 10 31 -2 -37 UA JFK SFO 344 2586 18</span>
-<span class="co"># 81481: 2014 10 31 0 -33 UA JFK LAX 320 2475 17</span>
-<span class="co"># 81482: 2014 10 31 -6 -38 UA JFK SFO 343 2586 9</span>
-<span class="co"># 81483: 2014 10 31 -6 -38 UA JFK LAX 323 2475 11</span>
+
+<h4>– Subset all rows where the origin airport matches <em>“JFK”</em> using <code>on</code></h4>
+
+<pre><code class="r">flights["JFK", on = "origin"]
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
+# 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13
+# 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21
+# ---
+# 81479: 2014 10 31 -4 -21 UA JFK SFO 337 2586 17
+# 81480: 2014 10 31 -2 -37 UA JFK SFO 344 2586 18
+# 81481: 2014 10 31 0 -33 UA JFK LAX 320 2475 17
+# 81482: 2014 10 31 -6 -38 UA JFK SFO 343 2586 9
+# 81483: 2014 10 31 -6 -38 UA JFK LAX 323 2475 11
## alternatively
-<span class="co"># flights[.("JFK"), on = "origin"] (or) </span>
-<span class="co"># flights[list("JFK"), on = "origin"]</span></code></pre></div>
+# flights[.("JFK"), on = "origin"] (or)
+# flights[list("JFK"), on = "origin"]
+</code></pre>
+
<ul>
-<li><p>This statement performs a fast binary search based subset as well, by computing the index on the fly. However, note that it doesn’t save the index as an attribute automatically. This may change in the future.</p></li>
+<li><p>This statement performs a fast binary search based subset as well, by computing the index on the fly. However, note that it doesn't save the index as an attribute automatically. This may change in the future.</p></li>
<li><p>If we had already created a secondary index, using <code>setindex()</code>, then <code>on</code> would reuse it instead of (re)computing it. We can see that by using <code>verbose = TRUE</code>:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setindex</span>(flights, origin)
-flights[<span class="st">"JFK"</span>, on =<span class="st"> "origin"</span>, verbose =<span class="st"> </span><span class="ot">TRUE</span>][<span class="dv">1</span>:<span class="dv">5</span>]
-<span class="co"># on= matches existing index, using index</span>
-<span class="co"># Starting bmerge ...done in 0 secs</span>
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span>
-<span class="co"># 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span>
-<span class="co"># 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21</span></code></pre></div></li>
+
+<pre><code class="r">setindex(flights, origin)
+flights["JFK", on = "origin", verbose = TRUE][1:5]
+# on= matches existing index, using index
+# Starting bmerge ...done in 0 secs
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
+# 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13
+# 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21
+</code></pre></li>
</ul>
-</div>
-<div id="how-can-i-subset-based-on-origin-and-dest-columns" class="section level4">
-<h4>– How can I subset based on <code>origin</code> <em>and</em> <code>dest</code> columns?</h4>
+
+<h4>– How can I subset based on <code>origin</code> <em>and</em> <code>dest</code> columns?</h4>
+
<p>For example, if we want to subset <code>"JFK", "LAX"</code> combination, then:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="st">"JFK"</span>, <span class="st">"LAX"</span>), on =<span class="st"> </span><span class="kw">c</span>(<span class="st">"origin"</span>, <span class="st">"dest"</span>)][<span class="dv">1</span>:<span class="dv">5</span>]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span>
-<span class="co"># 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span>
-<span class="co"># 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21</span></code></pre></div>
+
+<pre><code class="r">flights[.("JFK", "LAX"), on = c("origin", "dest")][1:5]
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
+# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
+# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
+# 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13
+# 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21
+</code></pre>
+
<ul>
<li><p><code>on</code> argument accepts a character vector of column names corresponding to the order provided to <code>i-argument</code>.</p></li>
-<li><p>Since the time to compute the secondary index is quite small, we don’t have to use <code>setindex()</code>, unless, once again, the task involves repeated subsetting on the same column.</p></li>
+<li><p>Since the time to compute the secondary index is quite small, we don't have to use <code>setindex()</code>, unless, once again, the task involves repeated subsetting on the same column.</p></li>
</ul>
-</div>
-</div>
-<div id="b-select-in-j" class="section level3">
+
<h3>b) Select in <code>j</code></h3>
-<p>All the operations we will discuss below are no different to the ones we already saw in the <em>Keys and fast binary search based subset</em> vignette. Except we’ll be using the <code>on</code> argument instead of setting keys.</p>
-<div id="return-arr_delay-column-alone-as-a-data.table-corresponding-to-origin-lga-and-dest-tpa" class="section level4">
-<h4>– Return <code>arr_delay</code> column alone as a data.table corresponding to <code>origin = "LGA"</code> and <code>dest = "TPA"</code></h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="st">"LGA"</span>, <span class="st">"TPA"</span>), .(arr_delay), on =<span class="st"> </span><span class="kw">c</span>(<span class="st">"origin"</span>, <span class="st">"dest"</span>)]
-<span class="co"># arr_delay</span>
-<span class="co"># 1: 1</span>
-<span class="co"># 2: 14</span>
-<span class="co"># 3: -17</span>
-<span class="co"># 4: -4</span>
-<span class="co"># 5: -12</span>
-<span class="co"># --- </span>
-<span class="co"># 1848: 39</span>
-<span class="co"># 1849: -24</span>
-<span class="co"># 1850: -12</span>
-<span class="co"># 1851: 21</span>
-<span class="co"># 1852: -11</span></code></pre></div>
-</div>
-</div>
-<div id="c-chaining" class="section level3">
+
+<p>All the operations we will discuss below are no different to the ones we already saw in the <em>Keys and fast binary search based subset</em> vignette. Except we'll be using the <code>on</code> argument instead of setting keys.</p>
+
+<h4>– Return <code>arr_delay</code> column alone as a data.table corresponding to <code>origin = "LGA"</code> and <code>dest = "TPA"</code></h4>
+
+<pre><code class="r">flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")]
+# arr_delay
+# 1: 1
+# 2: 14
+# 3: -17
+# 4: -4
+# 5: -12
+# ---
+# 1848: 39
+# 1849: -24
+# 1850: -12
+# 1851: 21
+# 1852: -11
+</code></pre>
+
<h3>c) Chaining</h3>
-<div id="on-the-result-obtained-above-use-chaining-to-order-the-column-in-decreasing-order." class="section level4">
-<h4>– On the result obtained above, use chaining to order the column in decreasing order.</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="st">"LGA"</span>, <span class="st">"TPA"</span>), .(arr_delay), on =<span class="st"> </span><span class="kw">c</span>(<span class="st">"origin"</span>, <span class="st">"dest"</span>)][<span class="kw">order</span>(-arr_delay)]
-<span class="co"># arr_delay</span>
-<span class="co"># 1: 486</span>
-<span class="co"># 2: 380</span>
-<span class="co"># 3: 351</span>
-<span class="co"># 4: 318</span>
-<span class="co"># 5: 300</span>
-<span class="co"># --- </span>
-<span class="co"># 1848: -40</span>
-<span class="co"># 1849: -43</span>
-<span class="co"># 1850: -46</span>
-<span class="co"># 1851: -48</span>
-<span class="co"># 1852: -49</span></code></pre></div>
-</div>
-</div>
-<div id="d-compute-or-do-in-j" class="section level3">
+
+<h4>– On the result obtained above, use chaining to order the column in decreasing order.</h4>
+
+<pre><code class="r">flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")][order(-arr_delay)]
+# arr_delay
+# 1: 486
+# 2: 380
+# 3: 351
+# 4: 318
+# 5: 300
+# ---
+# 1848: -40
+# 1849: -43
+# 1850: -46
+# 1851: -48
+# 1852: -49
+</code></pre>
+
<h3>d) Compute or <em>do</em> in <code>j</code></h3>
-<div id="find-the-maximum-arrival-delay-correspondong-to-origin-lga-and-dest-tpa." class="section level4">
-<h4>– Find the maximum arrival delay correspondong to <code>origin = "LGA"</code> and <code>dest = "TPA"</code>.</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="st">"LGA"</span>, <span class="st">"TPA"</span>), <span class="kw">max</span>(arr_delay), on =<span class="st"> </span><span class="kw">c</span>(<span class="st">"origin"</span>, <span class="st">"dest"</span>)]
-<span class="co"># [1] 486</span></code></pre></div>
-</div>
-</div>
-<div id="e-sub-assign-by-reference-using-in-j" class="section level3">
+
+<h4>– Find the maximum arrival delay correspondong to <code>origin = "LGA"</code> and <code>dest = "TPA"</code>.</h4>
+
+<pre><code class="r">flights[.("LGA", "TPA"), max(arr_delay), on = c("origin", "dest")]
+# [1] 486
+</code></pre>
+
<h3>e) <em>sub-assign</em> by reference using <code>:=</code> in <code>j</code></h3>
-<p>We have seen this example already in the <em>Reference semantics</em> and <em>Keys and fast binary search based subset</em> vignette. Let’s take a look at all the <code>hours</code> available in the <code>flights</code> <em>data.table</em>:</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># get all 'hours' in flights</span>
-flights[, <span class="kw">sort</span>(<span class="kw">unique</span>(hour))]
-<span class="co"># [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24</span></code></pre></div>
-<p>We see that there are totally <code>25</code> unique values in the data. Both <em>0</em> and <em>24</em> hours seem to be present. Let’s go ahead and replace <em>24</em> with <em>0</em>, but this time using <code>on</code> instead of setting keys.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(24L), hour :<span class="er">=</span><span class="st"> </span>0L, on =<span class="st"> "hour"</span>]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span>
-<span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span>
-<span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span>
-<span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span>
-<span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span>
-<span class="co"># --- </span>
-<span class="co"># 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14</span>
-<span class="co"># 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8</span>
-<span class="co"># 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11</span>
-<span class="co"># 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11</span>
-<span class="co"># 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8</span></code></pre></div>
-<p>Now, let’s check if <code>24</code> is replaced with <code>0</code> in the <code>hour</code> column.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[, <span class="kw">sort</span>(<span class="kw">unique</span>(hour))]
-<span class="co"># [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23</span></code></pre></div>
+
+<p>We have seen this example already in the <em>Reference semantics</em> and <em>Keys and fast binary search based subset</em> vignette. Let's take a look at all the <code>hours</code> available in the <code>flights</code> <em>data.table</em>:</p>
+
+<pre><code class="r"># get all 'hours' in flights
+flights[, sort(unique(hour))]
+# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
+</code></pre>
+
+<p>We see that there are totally <code>25</code> unique values in the data. Both <em>0</em> and <em>24</em> hours seem to be present. Let's go ahead and replace <em>24</em> with <em>0</em>, but this time using <code>on</code> instead of setting keys.</p>
+
+<pre><code class="r">flights[.(24L), hour := 0L, on = "hour"]
+</code></pre>
+
+<p>Now, let's check if <code>24</code> is replaced with <code>0</code> in the <code>hour</code> column.</p>
+
+<pre><code class="r">flights[, sort(unique(hour))]
+# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
+</code></pre>
+
<ul>
<li>This is particularly a huge advantage of secondary indices. Previously, just to update a few rows of <code>hour</code>, we had to <code>setkey()</code> on it, which inevitablly reorders the entire data.table. With <code>on</code>, the order is preserved, and the operation is much faster! Looking at the code, the task we wanted to perform is also quite clear.</li>
</ul>
-</div>
-<div id="f-aggregation-using-by" class="section level3">
+
<h3>f) Aggregation using <code>by</code></h3>
-<div id="get-the-maximum-departure-delay-for-each-month-corresponding-to-origin-jfk.-order-the-result-by-month" class="section level4">
-<h4>– Get the maximum departure delay for each <code>month</code> corresponding to <code>origin = "JFK"</code>. Order the result by <code>month</code></h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ans <-<span class="st"> </span>flights[<span class="st">"JFK"</span>, <span class="kw">max</span>(dep_delay), keyby =<span class="st"> </span>month, on =<span class="st"> "origin"</span>]
-<span class="kw">head</span>(ans)
-<span class="co"># month V1</span>
-<span class="co"># 1: 1 881</span>
-<span class="co"># 2: 1 1014</span>
-<span class="co"># 3: 1 920</span>
-<span class="co"># 4: 1 1241</span>
-<span class="co"># 5: 1 853</span>
-<span class="co"># 6: 1 798</span></code></pre></div>
+
+<h4>– Get the maximum departure delay for each <code>month</code> corresponding to <code>origin = "JFK"</code>. Order the result by <code>month</code></h4>
+
+<pre><code class="r">ans <- flights["JFK", max(dep_delay), keyby = month, on = "origin"]
+head(ans)
+# month V1
+# 1: 1 881
+# 2: 1 1014
+# 3: 1 920
+# 4: 1 1241
+# 5: 1 853
+# 6: 1 798
+</code></pre>
+
<ul>
<li>We would have had to set the <code>key</code> back to <code>origin, dest</code> again, if we did not use <code>on</code> which internally builds secondary indices on the fly.</li>
</ul>
-</div>
-</div>
-<div id="g-the-mult-argument" class="section level3">
+
<h3>g) The <em>mult</em> argument</h3>
-<p>The other arguments including <code>mult</code> work exactly the same way as we saw in the <em>Keys and fast binary search based subset</em> vignette. The default value for <code>mult</code> is “all”. We can choose, instead only the “first” or “last” matching rows should be returned.</p>
-<div id="subset-only-the-first-matching-row-where-dest-matches-bos-and-day" class="section level4">
-<h4>– Subset only the first matching row where <code>dest</code> matches <em>“BOS”</em> and <em>“DAY”</em></h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[<span class="kw">c</span>(<span class="st">"BOS"</span>, <span class="st">"DAY"</span>), on =<span class="st"> "dest"</span>, mult =<span class="st"> "first"</span>]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 1 1 3 1 AA JFK BOS 39 187 12</span>
-<span class="co"># 2: 2014 1 1 25 35 EV EWR DAY 102 533 17</span></code></pre></div>
-</div>
-<div id="subset-only-the-last-matching-row-where-origin-matches-lga-jfk-ewr-and-dest-matches-xna" class="section level4">
-<h4>– Subset only the last matching row where <code>origin</code> matches <em>“LGA”, “JFK”, “EWR”</em> and <code>dest</code> matches <em>“XNA”</em></h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="kw">c</span>(<span class="st">"LGA"</span>, <span class="st">"JFK"</span>, <span class="st">"EWR"</span>), <span class="st">"XNA"</span>), on =<span class="st"> </span><span class="kw">c</span>(<span class="st">"origin"</span>, <span class="st">"dest"</span>), mult =<span class="st"> "last"</span>]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 10 31 -5 -11 MQ LGA XNA 165 1147 6</span>
-<span class="co"># 2: NA NA NA NA NA NA JFK XNA NA NA NA</span>
-<span class="co"># 3: 2014 10 31 -2 -25 EV EWR XNA 160 1131 6</span></code></pre></div>
-</div>
-</div>
-<div id="h-the-nomatch-argument" class="section level3">
+
+<p>The other arguments including <code>mult</code> work exactly the same way as we saw in the <em>Keys and fast binary search based subset</em> vignette. The default value for <code>mult</code> is “all”. We can choose, instead only the “first” or “last” matching rows should be returned.</p>
+
+<h4>– Subset only the first matching row where <code>dest</code> matches <em>“BOS”</em> and <em>“DAY”</em></h4>
+
+<pre><code class="r">flights[c("BOS", "DAY"), on = "dest", mult = "first"]
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 1 1 3 1 AA JFK BOS 39 187 12
+# 2: 2014 1 1 25 35 EV EWR DAY 102 533 17
+</code></pre>
+
+<h4>– Subset only the last matching row where <code>origin</code> matches <em>“LGA”, “JFK”, “EWR”</em> and <code>dest</code> matches <em>“XNA”</em></h4>
+
+<pre><code class="r">flights[.(c("LGA", "JFK", "EWR"), "XNA"), on = c("origin", "dest"), mult = "last"]
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 10 31 -5 -11 MQ LGA XNA 165 1147 6
+# 2: NA NA NA NA NA NA JFK XNA NA NA NA
+# 3: 2014 10 31 -2 -25 EV EWR XNA 160 1131 6
+</code></pre>
+
<h3>h) The <em>nomatch</em> argument</h3>
+
<p>We can choose if queries that do not match should return <code>NA</code> or be skipped altogether using the <code>nomatch</code> argument.</p>
-<div id="from-the-previous-example-subset-all-rows-only-if-theres-a-match" class="section level4">
-<h4>– From the previous example, subset all rows only if there’s a match</h4>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[.(<span class="kw">c</span>(<span class="st">"LGA"</span>, <span class="st">"JFK"</span>, <span class="st">"EWR"</span>), <span class="st">"XNA"</span>), mult =<span class="st"> "last"</span>, on =<span class="st"> </span><span class="kw">c</span>(<span class="st">"origin"</span>, <span class="st">"dest"</span>), nomatch =<span class=" [...]
-<span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span>
-<span class="co"># 1: 2014 10 31 -5 -11 MQ LGA XNA 165 1147 6</span>
-<span class="co"># 2: 2014 10 31 -2 -25 EV EWR XNA 160 1131 6</span></code></pre></div>
+
+<h4>– From the previous example, subset all rows only if there's a match</h4>
+
+<pre><code class="r">flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last", on = c("origin", "dest"), nomatch = 0L]
+# year month day dep_delay arr_delay carrier origin dest air_time distance hour
+# 1: 2014 10 31 -5 -11 MQ LGA XNA 165 1147 6
+# 2: 2014 10 31 -2 -25 EV EWR XNA 160 1131 6
+</code></pre>
+
<ul>
-<li>There are no flights connecting “JFK” and “XNA”. Therefore, that row is skipped in the result.</li>
+<li>There are no flights connecting “JFK” and “XNA”. Therefore, that row is skipped in the result.</li>
</ul>
-</div>
-</div>
-</div>
-<div id="auto-indexing" class="section level2">
+
<h2>3. Auto indexing</h2>
+
<p>First we looked at how to fast subset using binary search using <em>keys</em>. Then we figured out that we could improve performance even further and have more cleaner syntax by using secondary indices. What could be better than that? The answer is to optimise <em>native R syntax</em> to use secondary indices internally so that we can have the same performance without having to use newer syntax.</p>
-<p>That is what <em>auto indexing</em> does. At the moment, it is only implemented for binary operators <code>==</code> and <code>%in%</code>. And it only works with a single column at the moment as well. An index is automatically created <em>and</em> saved as an attribute. That is, unlike the <code>on</code> argument which computes the index on the fly each time, a secondary index is created here.</p>
-<p>Let’s start by creating a data.table big enough to highlight the advantage.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">set.seed</span>(1L)
-dt =<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">x =</span> <span class="kw">sample</span>(1e5L, 1e7L, <span class="ot">TRUE</span>), <span class="dt">y =</span> <span class="kw">runif</span>(100L))
-<span class="kw">print</span>(<span class="kw">object.size</span>(dt), <span class="dt">units =</span> <span class="st">"Mb"</span>)
-<span class="co"># 114.4 Mb</span></code></pre></div>
+
+<p>That is what <em>auto indexing</em> does. At the moment, it is only implemented for binary operators <code>==</code> and <code>%in%</code>. And it only works with a single column at the moment as well. An index is automatically created <em>and</em> saved as an attribute. That is, unlike the <code>on</code> argument which computes the index on the fly each time, a secondary index is created here. </p>
+
+<p>Let's start by creating a data.table big enough to highlight the advantage.</p>
+
+<pre><code class="r">set.seed(1L)
+dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L))
+print(object.size(dt), units = "Mb")
+# 114.4 Mb
+</code></pre>
+
<p>When we use <code>==</code> or <code>%in%</code> on a single column for the first time, a secondary index is created automtically, and it is used to perform the subset.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## have a look at all the attribute names
-<span class="kw">names</span>(<span class="kw">attributes</span>(dt))
-<span class="co"># [1] "names" "row.names" "class" ".internal.selfref"</span>
+
+<pre><code class="r">## have a look at all the attribute names
+names(attributes(dt))
+# [1] "names" "row.names" "class" ".internal.selfref"
## run thefirst time
-(t1 <-<span class="st"> </span><span class="kw">system.time</span>(ans <-<span class="st"> </span>dt[x ==<span class="st"> </span>989L]))
-<span class="co"># user system elapsed </span>
-<span class="co"># 0.156 0.012 0.170</span>
-<span class="kw">head</span>(ans)
-<span class="co"># x y</span>
-<span class="co"># 1: 989 0.5372007</span>
-<span class="co"># 2: 989 0.5642786</span>
-<span class="co"># 3: 989 0.7151100</span>
-<span class="co"># 4: 989 0.3920405</span>
-<span class="co"># 5: 989 0.9547465</span>
-<span class="co"># 6: 989 0.2914710</span>
+(t1 <- system.time(ans <- dt[x == 989L]))
+# user system elapsed
+# 0.208 0.004 0.212
+head(ans)
+# x y
+# 1: 989 0.5372007
+# 2: 989 0.5642786
+# 3: 989 0.7151100
+# 4: 989 0.3920405
+# 5: 989 0.9547465
+# 6: 989 0.2914710
## secondary index is created
-<span class="kw">names</span>(<span class="kw">attributes</span>(dt))
-<span class="co"># [1] "names" "row.names" "class" ".internal.selfref"</span>
-<span class="co"># [5] "index"</span>
+names(attributes(dt))
+# [1] "names" "row.names" "class" ".internal.selfref"
+# [5] "index"
+
+indices(dt)
+# [1] "x"
+</code></pre>
-<span class="kw">indices</span>(dt)
-<span class="co"># [1] "x"</span></code></pre></div>
<p>The time to subset the first time is the time to create the index + the time to subset. Since creating a secondary index involves only creating the order vector, this combined operation is faster than vector scans in many cases. But the real advantage comes in successive subsets. They are extremely fast.</p>
-<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## successive subsets
-(t2 <-<span class="st"> </span><span class="kw">system.time</span>(dt[x ==<span class="st"> </span>989L]))
-<span class="co"># user system elapsed </span>
-<span class="co"># 0.000 0.000 0.001</span>
-<span class="kw">system.time</span>(dt[x %in%<span class="st"> </span><span class="dv">1989</span>:<span class="dv">2012</span>])
-<span class="co"># user system elapsed </span>
-<span class="co"># 0.000 0.000 0.001</span></code></pre></div>
+
+<pre><code class="r">## successive subsets
+(t2 <- system.time(dt[x == 989L]))
+# user system elapsed
+# 0 0 0
+system.time(dt[x %in% 1989:2012])
+# user system elapsed
+# 0 0 0
+</code></pre>
+
<ul>
-<li><p>Running the first time took 0.170 seconds where as the second time took 0.001 seconds.</p></li>
+<li><p>Running the first time took 0.212 seconds where as the second time took 0.000 seconds. </p></li>
<li><p>Auto indexing can be disabled by setting the global argument <code>options(datatable.auto.index = FALSE)</code>.</p></li>
<li><p>Disabling auto indexing still allows to use indices created explicitly with <code>setindex</code> or <code>setindexv</code>. You can disable indices fully by setting global argument <code>options(datatable.use.index = FALSE)</code>.</p></li>
</ul>
-</div>
-</div>
-<div id="section-2" class="section level1">
-<h1></h1>
-<p>In the future, we plan to extend auto indexing to expressions involving more than one column. Also we are working on extending binary search to work with more binary operators like <code><</code>, <code><=</code>, <code>></code> and <code>>=</code>. Once done, it would be straightforward to extend it to these operators as well.</p>
-<p>We will extend fast <em>subsets</em> using keys and secondary indices to <em>joins</em> in the next vignette, <em>“Joins and rolling joins”</em>.</p>
-<hr />
-</div>
-
-
-
-<!-- dynamically load mathjax for compatibility with self-contained -->
-<script>
- (function () {
- var script = document.createElement("script");
- script.type = "text/javascript";
- script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
- document.getElementsByTagName("head")[0].appendChild(script);
- })();
-</script>
+
+<p>In the future, we plan to extend auto indexing to expressions involving more than one column. Also we are working on extending binary search to work with more binary operators like <code><</code>, <code><=</code>, <code>></code> and <code>>=</code>. Once done, it would be straightforward to extend it to these operators as well. </p>
+
+<p>We will extend fast <em>subsets</em> using keys and secondary indices to <em>joins</em> in the next vignette, <em>“Joins and rolling joins”</em>.</p>
+
+<hr/>
</body>
+
</html>
diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw
index 8d26b54..57e2b5d 100644
--- a/inst/tests/tests.Rraw
+++ b/inst/tests/tests.Rraw
@@ -4197,11 +4197,11 @@ if (base::getRversion() < "3.3.0") {
} else {
base_order <- function(..., na.last=TRUE, method=c("shell","radix")) {
ans1 = base::order(..., na.last=na.last, method="shell")
- if (!is.na(na.last) || base::getRversion()>"3.3.2") {
+ if (!is.na(na.last) || base::getRversion()>"3.3.3") {
ans2 = base::order(..., na.last=na.last, method="radix")
if (!identical(ans1,ans2)) stop("Base R's order(,method='shell') != order(,method='radix')")
} else {
- # Only when na.last=NA in just R 3.3.0-3.3.2 we don't check shell==radix
+ # Only when na.last=NA in just R 3.3.0-3.3.3 we don't check shell==radix
# because there was a problem in base R's port of data.table code then when :
# 1) 2 or more vectors were passed to base::order(,method="radix")
# AND 2) na.last=NA
@@ -8956,10 +8956,11 @@ test(1673.5, as.IDate("2016-04-28") - as.IDate("2016-04-20"), 8L)
test(1674, forderv(c(2147483645L, 2147483646L, 2147483647L, 2147483644L), order=-1L), c(3,2,1,4))
# fix for #1718
+# In R-devel somwhere between 12 June 2017 (r72786) and 27 June 2017 (r72859), the behaviour of factor() changed.
+# Test updated minimally to create the previous representation directly instead of going via factor().
A = data.table(foo = c(1, 2, 3), bar = c(4, 5, 6))
-B = data.table(foo = c(1, 2, 3, 4, 5, 6), bar = c(NA, NA, NA, 4, 5, 6))
A[, bar := factor(bar, levels = c(4, 5), labels = c("Boop", "Beep"), exclude = 6)]
-B[, bar := factor(bar, levels = c(4, 5, NA), labels = c("Boop", "Beep", NA), exclude = NULL)]
+B = data.table(foo = c(1, 2, 3, 4, 5, 6), bar = structure(c(3L, 3L, 3L, 1L, 2L, NA), .Label=c("Boop","Beep",NA), class="factor"))
test(1675.1, as.integer(B[A, bar := i.bar, on="foo"]$bar), c(1:3,1:2,NA))
A = data.table(foo = c(1, 2, 3), bar = c(4, 5, 6))
B = data.table(foo = c(1, 2, 3, 4, 5, 6), bar = c(NA, NA, NA, 4, 5, 6))
diff --git a/src/assign.c b/src/assign.c
index 4be58c1..b9cfa84 100644
--- a/src/assign.c
+++ b/src/assign.c
@@ -1,6 +1,6 @@
#include "data.table.h"
#include <Rdefines.h>
-#include <Rmath.h>
+#include <Rmath.h>
#include <Rversion.h>
static SEXP *saveds=NULL;
@@ -29,7 +29,7 @@ static void finalizer(SEXP p)
return;
}
-void setselfref(SEXP x) {
+void setselfref(SEXP x) {
SEXP p;
// Store pointer to itself so we can detect if the object has been copied. See
// ?copy for why copies are not just inefficient but cause a problem for over-allocated data.tables.
@@ -45,7 +45,7 @@ void setselfref(SEXP x) {
));
R_RegisterCFinalizerEx(p, finalizer, FALSE);
UNPROTECT(1); // The PROTECT above is needed by --enable-strict-barrier (it seems, iiuc)
-
+
/* * base::identical doesn't check prot and tag of EXTPTR, just that the ptr itself is the
same in both objects. R_NilValue is always equal to R_NilValue. R_NilValue is a memory
location constant within an R session, but can vary from session to session. So, it
@@ -88,11 +88,11 @@ closest I got to getting it to pass all tests :
));
R_RegisterCFinalizerEx(p, finalizer, FALSE);
UNPROTECT(2);
-
+
Then in finalizer:
SETLENGTH(names, tl)
SETLENGTH(dt, tl)
-
+
and that finalizer indeed now happens before the GC releases memory (thanks to the env wrapper).
However, we still have problem (ii) above and it didn't pass tests involving base R copies.
@@ -122,7 +122,7 @@ static int _selfrefok(SEXP x, Rboolean checkNames, Rboolean verbose) {
if (!(isNull(tag) || isString(tag))) error("Internal error: .internal.selfref tag isn't NULL or a character vector");
names = getAttrib(x, R_NamesSymbol);
if (names != tag && isString(names))
- SET_TRUELENGTH(names, LENGTH(names));
+ SET_TRUELENGTH(names, LENGTH(names));
// R copied this vector not data.table; it's not actually over-allocated. It looks over-allocated
// because R copies the original vector's tl over despite allocating length.
prot = R_ExternalPtrProtected(v);
@@ -157,7 +157,7 @@ static SEXP shallow(SEXP dt, SEXP cols, R_len_t n)
// so that the next change knows to duplicate.
// Does copyMostAttrib duplicate each attrib or does it point? It seems to point, hence DUPLICATE_ATTRIB
// for now otherwise example(merge.data.table) fails (since attr(d4,"sorted") gets written by setnames).
- names = getAttrib(dt, R_NamesSymbol);
+ names = getAttrib(dt, R_NamesSymbol);
PROTECT(newnames = allocVector(STRSXP, n));
protecti++;
if (isNull(cols)) {
@@ -166,16 +166,16 @@ static SEXP shallow(SEXP dt, SEXP cols, R_len_t n)
if (length(names)) {
if (length(names) < l) error("Internal error: length(names)>0 but <length(dt)");
for (i=0; i<l; i++) SET_STRING_ELT(newnames, i, STRING_ELT(names,i));
- }
+ }
// else an unnamed data.table is valid e.g. unname(DT) done by ggplot2, and .SD may have its names cleared in dogroups, but shallow will always create names for data.table(NULL) which has 100 slots all empty so you can add to an empty data.table by reference ok.
} else {
l = length(cols);
for (i=0; i<l; i++) SET_VECTOR_ELT(newdt, i, VECTOR_ELT(dt,INTEGER(cols)[i]-1));
if (length(names)) {
- // no need to check length(names) < l here. R-level checks if all value
- // in 'cols' are valid - in the range of 1:length(names(x))
+ // no need to check length(names) < l here. R-level checks if all value
+ // in 'cols' are valid - in the range of 1:length(names(x))
for (i=0; i<l; i++) SET_STRING_ELT( newnames, i, STRING_ELT(names,INTEGER(cols)[i]-1) );
- }
+ }
}
setAttrib(newdt, R_NamesSymbol, newnames);
// setAttrib appears to change length and truelength, so need to do that first _then_ SET next,
@@ -185,7 +185,6 @@ static SEXP shallow(SEXP dt, SEXP cols, R_len_t n)
SETLENGTH(newdt,l);
SET_TRUELENGTH(newdt,n);
setselfref(newdt);
- // SET_NAMED(dt,1); // for some reason, R seems to set NAMED=2 via setAttrib? Need NAMED to be 1 for passing to assign via a .C dance before .Call (which sets NAMED to 2), and we can't use .C with DUP=FALSE on lists.
UNPROTECT(protecti);
return(newdt);
}
@@ -201,15 +200,15 @@ SEXP alloccol(SEXP dt, R_len_t n, Rboolean verbose)
l = LENGTH(dt);
names = getAttrib(dt,R_NamesSymbol);
// names may be NULL when null.data.table() passes list() to alloccol for example.
- // So, careful to use length() on names, not LENGTH().
+ // So, careful to use length() on names, not LENGTH().
if (length(names)!=l) error("Internal error: length of names (%d) is not length of dt (%d)",length(names),l);
if (!selfrefok(dt,verbose))
- return shallow(dt,R_NilValue,(n>l) ? n : l); // e.g. test 848 and 851 in R > 3.0.2
+ return shallow(dt,R_NilValue,(n>l) ? n : l); // e.g. test 848 and 851 in R > 3.0.2
// added (n>l) ? ... for #970, see test 1481.
// TO DO: test realloc names if selfrefnamesok (users can setattr(x,"name") themselves for example.
// if (TRUELENGTH(getAttrib(dt,R_NamesSymbol))!=tl)
// error("Internal error: tl of dt passes checks, but tl of names (%d) != tl of dt (%d)", tl, TRUELENGTH(getAttrib(dt,R_NamesSymbol)));
-
+
tl = TRUELENGTH(dt);
if (tl<0) error("Internal error, tl of class is marked but tl<0."); // R <= 2.13.2 and we didn't catch uninitialized tl somehow
if (tl>0 && tl<l) error("Internal error, please report (including result of sessionInfo()) to datatable-help: tl (%d) < l (%d) but tl of class is marked.", tl, l);
@@ -222,10 +221,10 @@ SEXP alloccol(SEXP dt, R_len_t n, Rboolean verbose)
SEXP alloccolwrapper(SEXP dt, SEXP newncol, SEXP verbose) {
if (!isInteger(newncol) || length(newncol)!=1) error("n must be integer length 1. Has getOption('datatable.alloccol') somehow become unset?");
- if (!isLogical(verbose) || length(verbose)!=1) error("verbose must be TRUE or FALSE");
-
+ if (!isLogical(verbose) || length(verbose)!=1) error("verbose must be TRUE or FALSE");
+
SEXP ans = PROTECT(alloccol(dt, INTEGER(newncol)[0], LOGICAL(verbose)[0]));
-
+
for(R_len_t i = 0; i < LENGTH(ans); i++) {
// clear the same excluded by copyMostAttrib(). Primarily for data.table and as.data.table, but added here centrally (see #4890).
@@ -293,7 +292,7 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v
if (TYPEOF(dt) != VECSXP) error("dt passed to assign isn't type VECSXP");
if (length(bindingIsLocked) && LOGICAL(bindingIsLocked)[0])
error(".SD is locked. Updating .SD by reference using := or set are reserved for future use. Use := in j directly. Or use copy(.SD) as a (slow) last resort, until shallow() is exported.");
-
+
class = getAttrib(dt, R_ClassSymbol);
if (isNull(class)) error("Input passed to assign has no class attribute. Must be a data.table or data.frame.");
// Check if there is a class "data.table" somewhere (#5115).
@@ -362,7 +361,7 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v
if (!isDataTable) error("set() on a data.frame is for changing existing columns, not adding new ones. Please use a data.table for that. data.table's are over-allocated and don't shallow copy.");
PROTECT(newcolnames = allocVector(STRSXP, k));
protecti++;
- for (i=0; i<k; i++) {
+ for (i=0; i<k; i++) {
SET_STRING_ELT(newcolnames, i, STRING_ELT(cols, buf[i]));
INTEGER(tmp)[buf[i]] = oldncol+i+1;
}
@@ -389,10 +388,10 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v
if (length(cols)>1) {
if (length(values)==0) error("Supplied %d columns to be assigned an empty list (which may be an empty data.table or data.frame since they are lists too). To delete multiple columns use NULL instead. To add multiple empty list columns, use list(list()).", length(cols));
if (length(values)>length(cols))
- warning("Supplied %d columns to be assigned a list (length %d) of values (%d unused)", length(cols), length(values), length(values)-length(cols));
+ warning("Supplied %d columns to be assigned a list (length %d) of values (%d unused)", length(cols), length(values), length(values)-length(cols));
else if (length(cols)%length(values) != 0)
warning("Supplied %d columns to be assigned a list (length %d) of values (recycled leaving remainder of %d items).",length(cols),length(values),length(cols)%length(values));
- } // else it's a list() column being assigned to one column
+ } // else it's a list() column being assigned to one column
}
}
// Check all inputs :
@@ -436,7 +435,7 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v
error("Can't assign to column '%s' (type 'factor') a value of type '%s' (not character, factor, integer or numeric)", CHAR(STRING_ELT(names,coln)),type2char(TYPEOF(thisvalue)));
if (nrow>0 && targetlen>0) {
if (vlen>targetlen)
- warning("Supplied %d items to be assigned to %d items of column '%s' (%d unused)", vlen, targetlen,CHAR(colnam),vlen-targetlen);
+ warning("Supplied %d items to be assigned to %d items of column '%s' (%d unused)", vlen, targetlen,CHAR(colnam),vlen-targetlen);
else if (vlen>0 && targetlen%vlen != 0)
warning("Supplied %d items to be assigned to %d items of column '%s' (recycled leaving remainder of %d items).",vlen,targetlen,CHAR(colnam),targetlen%vlen);
}
@@ -446,10 +445,10 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v
// out-of-memory. In that case the user will receive hard halt and know to rerun.
if (length(newcolnames)) {
oldtncol = TRUELENGTH(dt); // TO DO: oldtncol can be just called tl now, as we won't realloc here any more.
-
+
if (oldtncol<oldncol) error("Internal error, please report (including result of sessionInfo()) to datatable-help: oldtncol (%d) < oldncol (%d) but tl of class is marked.", oldtncol, oldncol);
- if (oldtncol>oldncol+10000L) warning("truelength (%d) is greater than 10,000 items over-allocated (length = %d). See ?truelength. If you didn't set the datatable.alloccol option very large, please report this to datatable-help including the result of sessionInfo().",oldtncol, oldncol);
-
+ if (oldtncol>oldncol+10000L) warning("truelength (%d) is greater than 10,000 items over-allocated (length = %d). See ?truelength. If you didn't set the datatable.alloccol option very large, please report this to datatable-help including the result of sessionInfo().",oldtncol, oldncol);
+
if (oldtncol < oldncol+LENGTH(newcolnames))
error("Internal logical error. DT passed to assign has not been allocated enough column slots. l=%d, tl=%d, adding %d", oldncol, oldtncol, LENGTH(newcolnames));
if (!selfrefnamesok(dt,verbose))
@@ -477,11 +476,16 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v
}
vlen = length(thisvalue);
if (length(rows)==0 && targetlen==vlen && (vlen>0 || nrow==0)) {
- if ( NAMED(thisvalue)==2 || // set() protects the NAMED of atomic vectors from .Call setting arguments to 2 by wrapping with list
+ if ( MAYBE_SHARED(thisvalue) || // set() protects the NAMED of atomic vectors from .Call setting arguments to 2 by wrapping with list
(TYPEOF(values)==VECSXP && i>LENGTH(values)-1)) { // recycled RHS would have columns pointing to others, #185.
if (verbose) {
- if (NAMED(thisvalue)==2) Rprintf("RHS for item %d has been duplicated because NAMED is %d, but then is being plonked.\n",i+1, NAMED(thisvalue));
- else Rprintf("RHS for item %d has been duplicated because the list of RHS values (length %d) is being recycled, but then is being plonked.\n", i+1, length(values));
+ if (length(values)==length(cols)) {
+ // usual branch
+ Rprintf("RHS for item %d has been duplicated because NAMED is %d, but then is being plonked.\n", i+1, NAMED(thisvalue));
+ } else {
+ // rare branch where the lhs of := is longer than the items on the rhs of :=
+ Rprintf("RHS for item %d has been duplicated because the list of RHS values (length %d) is being recycled, but then is being plonked.\n", i+1, length(values));
+ }
}
thisvalue = duplicate(thisvalue); // PROTECT not needed as assigned as element to protected list below.
} else {
@@ -574,7 +578,7 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v
if (INTEGER(RHS)[0] != NA_INTEGER) warning("Coerced '%s' RHS to 'integer' to match the factor column's underlying type. Character columns are now recommended (can be in keys), or coerce RHS to integer or character first.", type2char(TYPEOF(thisvalue)));
} else RHS = thisvalue;
for (j=0; j<length(RHS); j++) {
- if ( (INTEGER(RHS)[j]<1 || INTEGER(RHS)[j]>LENGTH(targetlevels))
+ if ( (INTEGER(RHS)[j]<1 || INTEGER(RHS)[j]>LENGTH(targetlevels))
&& INTEGER(RHS)[j] != NA_INTEGER) {
warning("RHS contains %d which is outside the levels range ([1,%d]) of column %d, NAs generated", INTEGER(RHS)[j], LENGTH(targetlevels), i+1);
INTEGER(RHS)[j] = NA_INTEGER;
@@ -595,8 +599,8 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v
protecti++;
// FR #2551, added test for equality between RHS and thisvalue to not provide the warning when length(thisvalue) == 1
if ( length(thisvalue) == 1 && TYPEOF(RHS) != VECSXP && TYPEOF(thisvalue) != VECSXP && (
- (isReal(thisvalue) && isInteger(targetcol) && REAL(thisvalue)[0] == INTEGER(RHS)[0]) ||
- (isLogical(thisvalue) && LOGICAL(thisvalue)[0] == NA_LOGICAL) ||
+ (isReal(thisvalue) && isInteger(targetcol) && REAL(thisvalue)[0] == INTEGER(RHS)[0]) ||
+ (isLogical(thisvalue) && LOGICAL(thisvalue)[0] == NA_LOGICAL) ||
(isReal(RHS) && isInteger(thisvalue)) )) {
;
} else {
@@ -677,7 +681,7 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v
protecti++;
// Can't find a visible R entry point to return ordering of cols, above is only way I could find.
// Need ordering (rather than just sorting) because the RHS corresponds in order to the LHS.
-
+
for (r=LENGTH(cols)-1; r>=0; r--) {
i = INTEGER(colorder)[r]-1;
coln = INTEGER(cols)[i]-1;
@@ -689,14 +693,14 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v
// A new column being assigned NULL would have been warned above, added above, and now deleted (just easier
// to code it this way e.g. so that other columns may be added or removed ok by the same query).
size=sizeof(SEXP *);
- memmove((char *)DATAPTR(dt)+coln*size,
+ memmove((char *)DATAPTR(dt)+coln*size,
(char *)DATAPTR(dt)+(coln+1)*size,
(LENGTH(dt)-coln-1)*size);
SET_VECTOR_ELT(dt, LENGTH(dt)-1, R_NilValue);
SETLENGTH(dt, LENGTH(dt)-1);
// adding using := by group relies on NULL here to know column slot is empty.
// good to tidy up the vector anyway.
- memmove((char *)DATAPTR(names)+coln*size,
+ memmove((char *)DATAPTR(names)+coln*size,
(char *)DATAPTR(names)+(coln+1)*size,
(LENGTH(names)-coln-1)*size);
SET_STRING_ELT(names, LENGTH(names)-1, NA_STRING); // no need really, just to be tidy.
@@ -716,7 +720,7 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v
}
static Rboolean anyNamed(SEXP x) {
- if (NAMED(x)) return TRUE;
+ if (MAYBE_REFERENCED(x)) return TRUE;
if (isNewList(x)) for (int i=0; i<LENGTH(x); i++)
if (anyNamed(VECTOR_ELT(x,i))) return TRUE;
return FALSE;
@@ -767,7 +771,7 @@ void memrecycle(SEXP target, SEXP where, int start, int len, SEXP source)
default :
error("Unsupported type '%s'", type2char(TYPEOF(target)));
}
- if (slen == 1) {
+ if (slen == 1) {
if (size==4) for (; r<len; r++)
INTEGER(target)[start+r] = INTEGER(source)[0]; // copies pointer on 32bit, sizes checked in init.c
else for (; r<len; r++)
@@ -805,7 +809,7 @@ void memrecycle(SEXP target, SEXP where, int start, int len, SEXP source)
break;
default :
error("Unsupported type '%s'", type2char(TYPEOF(target)));
- }
+ }
if (slen == 1) {
if (size==4) for (; r<len; r++) {
w = INTEGER(where)[start+r]; if (w<1) continue;
@@ -955,26 +959,3 @@ SEXP pointWrapper(SEXP to, SEXP to_idx, SEXP from, SEXP from_idx) {
return(to);
}
-/*
-SEXP pointer(SEXP x) {
- SEXP ans;
- PROTECT(ans = allocVector(REALSXP, 1));
- REAL(ans)[0] = (double)x;
- UNPROTECT(1);
- return(ans);
-}
-
-SEXP named(SEXP x) {
- SEXP y = (SEXP)(REAL(x)[0]);
- Rprintf("%d length = %d\n",NAMED(y), LENGTH(y));
- return(R_NilValue);
-}
-
-void setnamed(double *x, int *v) { // call by .Call(,DUP=FALSE) only.
- SEXP y = (SEXP)(*x);
- Rprintf("%d length = %d\n",NAMED(y), LENGTH(y));
- SET_NAMED(y,*v);
-}
-*/
-
-
diff --git a/src/data.table.h b/src/data.table.h
index 9bf6092..435edb2 100644
--- a/src/data.table.h
+++ b/src/data.table.h
@@ -10,9 +10,10 @@
#endif
// #include <signal.h> // the debugging machinery + breakpoint aidee
// raise(SIGINT);
+#include <stdint.h> // for uint64_t rather than unsigned long long
// Fixes R-Forge #5150, and #1641
-// a simple check for R version to decide if the type should be R_len_t or
+// a simple check for R version to decide if the type should be R_len_t or
// R_xlen_t long vector support was added in R 3.0.0
#if defined(R_VERSION) && R_VERSION >= R_Version(3, 0, 0)
typedef R_xlen_t RLEN;
@@ -31,6 +32,14 @@
#define MIN(a,b) (((a)<(b))?(a):(b))
#define NAINT64 LLONG_MIN
+// Backport macros added to R in 2017 so we don't need to update dependency from R 3.0.0
+#ifndef MAYBE_SHARED
+# define MAYBE_SHARED(x) (NAMED(x) > 1)
+#endif
+#ifndef MAYBE_REFERENCED
+# define MAYBE_REFERENCED(x) ( NAMED(x) > 0 )
+#endif
+
// init.c
void setSizes();
SEXP char_integer64;
@@ -44,7 +53,7 @@ SEXP sym_BY;
SEXP sym_starts, char_starts;
SEXP sym_maxgrpn;
Rboolean INHERITS(SEXP x, SEXP char_);
-long long I64(double x);
+int64_t I64(double x);
// dogroups.c
SEXP keepattr(SEXP to, SEXP from);
@@ -96,14 +105,14 @@ SEXP alloccol(SEXP dt, R_len_t n, Rboolean verbose);
void memrecycle(SEXP target, SEXP where, int r, int len, SEXP source);
SEXP shallowwrapper(SEXP dt, SEXP cols);
-SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols,
- SEXP xjiscols, SEXP grporder, SEXP order, SEXP starts,
- SEXP lens, SEXP jexp, SEXP env, SEXP lhs, SEXP newnames,
+SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols,
+ SEXP xjiscols, SEXP grporder, SEXP order, SEXP starts,
+ SEXP lens, SEXP jexp, SEXP env, SEXP lhs, SEXP newnames,
SEXP on, SEXP verbose);
// bmerge.c
-SEXP bmerge(SEXP iArg, SEXP xArg, SEXP icolsArg, SEXP xcolsArg, SEXP isorted,
- SEXP xoArg, SEXP rollarg, SEXP rollendsArg, SEXP nomatchArg,
+SEXP bmerge(SEXP iArg, SEXP xArg, SEXP icolsArg, SEXP xcolsArg, SEXP isorted,
+ SEXP xoArg, SEXP rollarg, SEXP rollendsArg, SEXP nomatchArg,
SEXP multArg, SEXP opArg, SEXP nqgrpArg, SEXP nqmaxgrpArg);
SEXP ENC2UTF8(SEXP s);
@@ -118,4 +127,3 @@ double iquickselect(int *x, int n, int k);
int getDTthreads();
void avoid_openmp_hang_within_fork();
-
diff --git a/src/dogroups.c b/src/dogroups.c
index 4098695..421b022 100644
--- a/src/dogroups.c
+++ b/src/dogroups.c
@@ -41,9 +41,9 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
if(!isEnvironment(env)) error("’env’ should be an environment");
ngrp = length(starts); // the number of groups (nrow(groups) will be larger when by)
ngrpcols = length(grpcols);
- // fix for longstanding FR/bug, #495. E.g., DT[, c(sum(v1), lapply(.SD, mean)), by=grp, .SDcols=v2:v3] resulted in error.. the idea is, 1) we create .SDall, which is normally == .SD. But if extra vars are detected in jexp other than .SD, then .SD becomes a shallow copy of .SDall with only .SDcols in .SD. Since internally, we don't make a copy, changing .SDall will reflect in .SD. Hopefully this'll workout :-).
+ // fix for longstanding FR/bug, #495. E.g., DT[, c(sum(v1), lapply(.SD, mean)), by=grp, .SDcols=v2:v3] resulted in error.. the idea is, 1) we create .SDall, which is normally == .SD. But if extra vars are detected in jexp other than .SD, then .SD becomes a shallow copy of .SDall with only .SDcols in .SD. Since internally, we don't make a copy, changing .SDall will reflect in .SD. Hopefully this'll workout :-).
SDall = findVar(install(".SDall"), env);
-
+
defineVar(sym_BY, BY = allocVector(VECSXP, ngrpcols), env);
bynames = PROTECT(allocVector(STRSXP, ngrpcols)); protecti++; // TO DO: do we really need bynames, can we assign names afterwards in one step?
for (i=0; i<ngrpcols; i++) {
@@ -61,13 +61,13 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
R_LockBinding(sym_BY, env);
if (isNull(jiscols) && (length(bynames)!=length(groups) || length(bynames)!=length(grpcols))) error("!length(bynames)[%d]==length(groups)[%d]==length(grpcols)[%d]",length(bynames),length(groups),length(grpcols));
// TO DO: check this check above.
-
+
N = findVar(install(".N"), env);
GRP = findVar(install(".GRP"), env);
iSD = findVar(install(".iSD"), env); // 1-row and possibly no cols (if no i variables are used via JIS)
xSD = findVar(install(".xSD"), env);
I = findVar(install(".I"), env);
-
+
dtnames = getAttrib(dt, R_NamesSymbol); // added here to fix #4990 - `:=` did not issue recycling warning during "by"
// fetch rownames of .SD. rownames[1] is set to -thislen for each group, in case .SD is passed to
// non data.table aware package that uses rownames
@@ -76,7 +76,7 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
if (s==R_NilValue) error("row.names attribute of .SD not found");
rownames = CAR(s);
if (!isInteger(rownames) || LENGTH(rownames)!=2 || INTEGER(rownames)[0]!=NA_INTEGER) error("row.names of .SD isn't integer length 2 with NA as first item; i.e., .set_row_names(). [%s %d %d]",type2char(TYPEOF(rownames)),LENGTH(rownames),INTEGER(rownames)[0]);
-
+
// fetch names of .SD and prepare symbols. In case they are copied-on-write by user assigning to those variables
// using <- in j (which is valid, useful and tested), they are repointed to the .SD cols for each group.
names = getAttrib(SDall, R_NamesSymbol);
@@ -89,7 +89,7 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
// fixes http://stackoverflow.com/questions/14753411/why-does-data-table-lose-class-definition-in-sd-after-group-by
copyMostAttrib(VECTOR_ELT(dt,INTEGER(dtcols)[i]-1), VECTOR_ELT(SDall,i)); // not names, otherwise test 778 would fail
}
-
+
origIlen = length(I); // test 762 has length(I)==1 but nrow(SD)==0
if (length(SDall)) origSDnrow = length(VECTOR_ELT(SDall, 0));
@@ -104,12 +104,12 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
if (length(iSD)!=length(jiscols)) error("length(iSD)[%d] != length(jiscols)[%d]",length(iSD),length(jiscols));
if (length(xSD)!=length(xjiscols)) error("length(xSD)[%d] != length(xjiscols)[%d]",length(xSD),length(xjiscols));
-
-
+
+
PROTECT(listwrap = allocVector(VECSXP, 1));
protecti++;
Rboolean jexpIsSymbolOtherThanSD = (isSymbol(jexp) && strcmp(CHAR(PRINTNAME(jexp)),".SD")!=0); // test 559
-
+
ansloc = 0;
for(i=0; i<ngrp; i++) { // even for an empty i table, ngroup is length 1 (starts is value 0), for consistency of empty cases
@@ -118,7 +118,7 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
// The above is now to fix #1993, see test 1746.
// In cases were no i rows match, '|| estn>-1' ensures that the last empty group creates an empty result.
// TODO: revisit and tidy
-
+
if (!isNull(lhs) &&
(INTEGER(starts)[i] == NA_INTEGER ||
(LENGTH(order) && INTEGER(order)[ INTEGER(starts)[i]-1 ]==NA_INTEGER)))
@@ -127,12 +127,12 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
INTEGER(N)[0] = INTEGER(starts)[i] == NA_INTEGER ? 0 : grpn;
// .N is number of rows matched to ( 0 even when nomatch is NA)
INTEGER(GRP)[0] = i+1; // group counter exposed as .GRP
-
+
for (j=0; j<length(iSD); j++) { // either this or the next for() will run, not both
size = SIZEOF(VECTOR_ELT(iSD,j));
memcpy((char *)DATAPTR(VECTOR_ELT(iSD,j)), // ok use of memcpy. Loop'd through columns not rows
(char *)DATAPTR(VECTOR_ELT(groups,INTEGER(jiscols)[j]-1))+i*size,
- size);
+ size);
}
if (LOGICAL(on)[0])
igrp = (length(grporder) && isNull(jiscols)) ? INTEGER(grporder)[INTEGER(starts)[i]-1]-1 : i;
@@ -187,23 +187,25 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
if (LOGICAL(verbose)[0]) tstart = clock();
if (LENGTH(order)==0) {
rownum = INTEGER(starts)[i]-1;
- for (j=0; j<length(SDall); j++) {
- size = SIZEOF(VECTOR_ELT(SDall,j));
- memcpy((char *)DATAPTR(VECTOR_ELT(SDall,j)), // direct memcpy best here, for usually large size groups. by= each row is slow and not recommended anyway, so we don't mind there's no switch here for grpn==1
- (char *)DATAPTR(VECTOR_ELT(dt,INTEGER(dtcols)[j]-1))+rownum*size,
- grpn*size);
- // SD is our own alloc'd memory, and the source (DT) is protected throughout, so no need for SET_* overhead
- }
for (j=0; j<grpn; j++) INTEGER(I)[j] = rownum+j+1;
- for (j=0; j<length(xSD); j++) {
- size = SIZEOF(VECTOR_ELT(xSD,j));
- memcpy((char *)DATAPTR(VECTOR_ELT(xSD,j)), // ok use of memcpy. Loop'd through columns not rows
- (char *)DATAPTR(VECTOR_ELT(dt,INTEGER(xjiscols)[j]-1))+rownum*size,
- size);
+ if (rownum>=0) {
+ for (j=0; j<length(SDall); j++) {
+ size = SIZEOF(VECTOR_ELT(SDall,j));
+ memcpy((char *)DATAPTR(VECTOR_ELT(SDall,j)), // direct memcpy best here, for usually large size groups. by= each row is slow and not recommended anyway, so we don't mind there's no switch here for grpn==1
+ (char *)DATAPTR(VECTOR_ELT(dt,INTEGER(dtcols)[j]-1))+rownum*size,
+ grpn*size);
+ // SD is our own alloc'd memory, and the source (DT) is protected throughout, so no need for SET_* overhead
+ }
+ for (j=0; j<length(xSD); j++) {
+ size = SIZEOF(VECTOR_ELT(xSD,j));
+ memcpy((char *)DATAPTR(VECTOR_ELT(xSD,j)), // ok use of memcpy. Loop'd through columns not rows
+ (char *)DATAPTR(VECTOR_ELT(dt,INTEGER(xjiscols)[j]-1))+rownum*size,
+ size);
+ }
}
if (LOGICAL(verbose)[0]) { tblock[0] += clock()-tstart; nblock[0]++; }
} else {
- // Fairly happy with this block. No need for SET_* here. See comment above.
+ // Fairly happy with this block. No need for SET_* here. See comment above.
for (k=0; k<grpn; k++) INTEGER(I)[k] = INTEGER(order)[ INTEGER(starts)[i]-1 + k ];
for (j=0; j<length(SDall); j++) {
size = SIZEOF(VECTOR_ELT(SDall,j));
@@ -214,7 +216,7 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
rownum = INTEGER(I)[k]-1;
INTEGER(target)[k] = INTEGER(source)[rownum]; // on 32bit, copies pointers too
}
- } else { // size 8
+ } else { // size 8
for (k=0; k<grpn; k++) {
rownum = INTEGER(I)[k]-1;
REAL(target)[k] = REAL(source)[rownum]; // on 64bit, copies pointers too
@@ -236,12 +238,12 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
defineVar(xknameSyms[j], VECTOR_ELT(xSD, j), env);
}
SETLENGTH(I, grpn);
-
+
if (LOGICAL(verbose)[0]) tstart = clock(); // call to clock() is more expensive than an 'if'
PROTECT(jval = eval(jexp, env));
-
+
if (LOGICAL(verbose)[0]) { tblock[2] += clock()-tstart; nblock[2]++; }
-
+
if (isNull(jval)) {
// j may be a plot or other side-effect only
UNPROTECT(1);
@@ -291,12 +293,12 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
if (vlen==0) continue;
if (grpn>0 && vlen>grpn && j<LENGTH(jval)) warning("RHS %d is length %d (greater than the size (%d) of group %d). The last %d element(s) will be discarded.", j+1, vlen, grpn, i+1, vlen-grpn);
// fix for #4990 - `:=` did not issue recycling warning during "by" operation.
- if (vlen<grpn && vlen>0 && grpn%vlen != 0)
+ if (vlen<grpn && vlen>0 && grpn%vlen != 0)
warning("Supplied %d items to be assigned to group %d of size %d in column '%s' (recycled leaving remainder of %d items).",vlen,i+1,grpn,CHAR(STRING_ELT(dtnames,INTEGER(lhs)[j]-1)),grpn%vlen);
-
+
memrecycle(target, order, INTEGER(starts)[i]-1, grpn, RHS);
-
- copyMostAttrib(RHS, target); // not names, otherwise test 778 would fail.
+
+ copyMostAttrib(RHS, target); // not names, otherwise test 778 would fail.
/* OLD FIX: commented now. The fix below resulted in segfault on factor columns because I dint set the "levels"
Instead of fixing that, I just removed setting class if it's factor. Not appropriate fix.
Correct fix of copying all attributes (except names) added above. Now, everything should be alright.
@@ -336,7 +338,7 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
if (estn<maxn) estn=maxn; // if the result for the first group is larger than the table itself(!) Unusual case where a join is being done in j via .SD and the 1-row table is an edge case of bigger picture.
PROTECT(ans = allocVector(VECSXP, ngrpcols + njval));
protecti++;
- firstalloc=TRUE;
+ firstalloc=TRUE;
for(j=0; j<ngrpcols; j++) {
thiscol = VECTOR_ELT(groups, INTEGER(grpcols)[j]-1);
SET_VECTOR_ELT(ans, j, allocVector(TYPEOF(thiscol), estn));
@@ -358,7 +360,7 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
}
names = getAttrib(jval, R_NamesSymbol);
if (!isNull(names)) {
- if (LOGICAL(verbose)[0]) Rprintf("The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.\n"); // e.g. test 104 has j=transform().
+ if (LOGICAL(verbose)[0]) Rprintf("The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.\n"); // e.g. test 104 has j=transform().
// names of result come from the first group and the names of remaining groups are ignored (all that matters for them is that the number of columns (and their types) match the first group.
names2 = PROTECT(allocVector(STRSXP,ngrpcols+njval));
protecti++;
@@ -425,9 +427,9 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
}
ansloc += maxn;
if (firstalloc) {
- protecti++; // remember the first jval. If we UNPROTECTed now, we'd unprotect
+ protecti++; // remember the first jval. If we UNPROTECTed now, we'd unprotect
firstalloc = FALSE; // ans. The first jval is needed to create the right size and type of ans.
- // TO DO: could avoid this last 'if' by adding a dummy PROTECT after first alloc for this UNPROTECT(1) to do.
+ // TO DO: could avoid this last 'if' by adding a dummy PROTECT after first alloc for this UNPROTECT(1) to do.
}
else UNPROTECT(1); // the jval. Don't want them to build up. The first jval can stay protected till the end ok.
}
@@ -483,10 +485,10 @@ SEXP growVector(SEXP x, R_len_t newlen)
case VECSXP :
for (i=0; i<len; i++)
SET_VECTOR_ELT(newx, i, VECTOR_ELT(x, i));
- // TO DO: Again, is there bulk op to avoid this loop, which still respects older generations
+ // TO DO: Again, is there bulk op to avoid this loop, which still respects older generations
break;
default :
- memcpy((char *)DATAPTR(newx), (char *)DATAPTR(x), len*SIZEOF(x)); // SIZEOF() returns size_t (just as sizeof()) so * shouldn't overflow
+ memcpy((char *)DATAPTR(newx), (char *)DATAPTR(x), len*SIZEOF(x)); // SIZEOF() returns size_t (just as sizeof()) so * shouldn't overflow
}
// if (verbose) Rprintf("Growing vector from %d to %d items of type '%s'\n", len, newlen, type2char(TYPEOF(x)));
// Would print for every column if here. Now just up in dogroups (one msg for each table grow).
diff --git a/src/fread.c b/src/fread.c
index f5085c1..9ed61bd 100644
--- a/src/fread.c
+++ b/src/fread.c
@@ -49,7 +49,7 @@ as.read.table=TRUE/FALSE option. Or fread.table and fread.csv (see http://r.789
*****/
-static const char *ch, *eof;
+static const char *ch, *eof;
static char sep, eol, eol2; // sep2 TO DO
static int eolLen, line, field;
static Rboolean verbose, ERANGEwarning;
@@ -91,7 +91,7 @@ void closeFile() {
}
#else
int fd=-1;
-void closeFile() {
+void closeFile() {
if (fnam!=NULL) {
munmap((char *)mmp, filesize);
close(fd);
@@ -112,7 +112,7 @@ void STOP(const char *format, ...) {
// NA handling.
// algorithm is following
// 1) Strto*() checks whether we can convert substring into given * type
-// 2) If not, we try to iteratively char-by-char starting from begining of the substring
+// 2) If not, we try to iteratively char-by-char starting from begining of the substring
// look forward for maximum max(nchar(nastrings)) symbols:
// ********************************************************************************************
// max_na_nchar = max(nchar(nastrings))
@@ -125,7 +125,7 @@ void STOP(const char *format, ...) {
// }
// return TRUE
// ********************************************************************************************
-// for checking "if" condition we manage mask "na_mask" with length na_len = length(nastrings).
+// for checking "if" condition we manage mask "na_mask" with length na_len = length(nastrings).
// 1 on mask position i means that nastring[i] is still candidate for given substring.
// 0 means this substring can't be casted into nastring[i], so nastring[i] is not candidate.
int *NA_MASK;
@@ -225,7 +225,7 @@ static inline void Field()
}
if (ch+1==eof || *(ch+1)==sep || *(ch+1)==eol) break;
// " followed by sep|eol|eof dominates a field ending with \" (for support of Windows style paths)
-
+
if (*(ch-1)!='\\') {
if (ch+1<eof && *(ch+1)==quote[0]) { ch++; continue; } // skip doubled-quote
// unescaped subregion
@@ -304,18 +304,18 @@ static inline Rboolean Strtoll()
return(TRUE); // caller already set u.l=NA_INTEGR or u.d=NA_REAL as appropriate
}
// moved this if-statement to the top for #1314
- if(can_cast_to_na(lch)) {
+ if(can_cast_to_na(lch)) {
// ch pointer already set to end of nastring by can_cast_to_na() function
return(TRUE);
}
if (*lch=='-') { sign=0; lch++; if (*lch<'0' || *lch>'9') return(FALSE); } // + or - symbols alone should be character
else if (*lch=='+') { sign=1; lch++; if (*lch<'0' || *lch>'9') return(FALSE); }
long long acc = 0;
- while (lch<eof && '0'<=*lch && *lch<='9' && acc<(LLONG_MAX-10)/10) {
+ while (lch<eof && '0'<=*lch && *lch<='9' && acc<(LLONG_MAX-10)/10) {
acc *= 10; // overflow avoided, thanks to BDR and CRAN's new ASAN checks, 25 Feb 2014
acc += *lch-'0'; // have assumed compiler will optimize the constant expression (LLONG_MAX-10)/10
lch++; // TO DO can remove lch<eof when last row is specialized in case of no final eol
-
+
}
// take care of leading spaces
while(lch<eof && *lch!=sep && *lch==' ') lch++;
@@ -503,7 +503,7 @@ static SEXP coerceVectorSoFar(SEXP v, int oldtype, int newtype, R_len_t sofar, R
#ifdef WIN32
snprintf(buffer,128,"%" PRId64, I64(REAL(v)[i]));
#else
- snprintf(buffer,128,"%lld", I64(REAL(v)[i]));
+ snprintf(buffer,128,"%lld", (long long)I64(REAL(v)[i]));
#endif
SET_STRING_ELT(newv, i, mkChar(buffer));
}
@@ -564,22 +564,22 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
if (!isLogical(showProgressArg) || LENGTH(showProgressArg)!=1 || LOGICAL(showProgressArg)[0]==NA_LOGICAL)
error("Internal error: showProgress is not TRUE or FALSE. Please report.");
const Rboolean showProgress = LOGICAL(showProgressArg)[0];
-
+
if (!isString(dec) || LENGTH(dec)!=1 || strlen(CHAR(STRING_ELT(dec,0))) != 1)
error("dec must be a single character");
const char decChar = *CHAR(STRING_ELT(dec,0));
-
+
fnam = NULL; // reset global, so STOP() can call closeFile() which sees fnam
if (NA_INTEGER != INT_MIN) error("Internal error: NA_INTEGER (%d) != INT_MIN (%d).", NA_INTEGER, INT_MIN); // relied on by Stroll
if (sizeof(double) != 8) error("Internal error: sizeof(double) is %d bytes, not 8.", sizeof(double));
if (sizeof(long long) != 8) error("Internal error: sizeof(long long) is %d bytes, not 8.", sizeof(long long));
-
+
// raise(SIGINT);
// ********************************************************************************************
// Check inputs.
// ********************************************************************************************
-
+
if (!isLogical(headerarg) || LENGTH(headerarg)!=1) error("'header' must be 'auto', TRUE or FALSE"); // 'auto' was converted to NA at R level
header = LOGICAL(headerarg)[0];
if (!isNull(nastrings) && !isString(nastrings)) error("'na.strings' is type '%s'. Must be a character vector.", type2char(TYPEOF(nastrings)));
@@ -696,7 +696,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
}
clock_t tMap = clock();
// From now use STOP() wrapper instead of error(), for Windows to close file so as not to lock the file after an error.
-
+
// ********************************************************************************************
// Auto detect eol, first eol where there are two (i.e. CRLF)
// ********************************************************************************************
@@ -720,7 +720,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
} else {
if (ch+1<eof && *(ch+1)=='\r')
STOP("Line ending is \\r\\r\\n. R's download.file() appears to add the extra \\r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.");
- // NB: on Windows, download.file from file: seems to condense \r\r too. So
+ // NB: on Windows, download.file from file: seems to condense \r\r too. So
if (verbose) Rprintf("Detected eol as \\r only (no \\n or \\r afterwards). An old Mac 9 standard, discontinued in 2002 according to Wikipedia.\n");
}
} else if (eol=='\n') {
@@ -757,7 +757,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
}
pos = ch;
if (verbose) Rprintf("Positioned on line %d after skip or autostart\n", line);
-
+
while (ch<eof && isspace(*ch) && *ch!=eol) ch++;
Rboolean thisLineBlank = (ch==eof || *ch==eol);
ch = pos;
@@ -790,7 +790,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
}
}
if (pos>mmp && *(pos-1)!=eol2) STOP("Internal error. No eol2 immediately before line %d, '%.1s' instead", line, pos-1);
-
+
// ********************************************************************************************
// Auto detect separator, number of fields, and location of first row
@@ -857,7 +857,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
if (verbose) {
if (isNull(separg)) { if (sep=='\t') Rprintf("'\\t'\n"); else Rprintf("'%c'\n", sep); }
else Rprintf("found ok\n");
- }
+ }
}
if (verbose) {
if (sep!=eol) {
@@ -876,12 +876,12 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
if (i>0) {
if (line==2 && isInteger(skip) && INTEGER(skip)[0]==0) // warn about line 1, unless skip was provided
warning("Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: %.*s", ch-ch2-eolLen, ch2);
- else if (verbose)
+ else if (verbose)
Rprintf("The line before starting line %d is non-empty and will be ignored (it has too few or too many items to be column names or data): %.*s", line, ch-ch2-eolLen, ch2);
}
}
if (ch!=pos) STOP("Internal error. ch!=pos after sep detection");
-
+
// ********************************************************************************************
// Detect and assign column names (if present)
// ********************************************************************************************
@@ -896,20 +896,20 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
} else { // if field reads as double ok then it's INT/INT64/REAL; i.e., not character (and so not a column name)
if (*ch!=sep && *ch!=eol && Strtod()) // blank column names (,,) considered character and will get default names
allchar=FALSE; // considered testing at least one isalpha, but we want 1E9 to be a value not a column name
- else
+ else
while(ch<eof && *ch!=eol && *ch!=sep) ch++; // skip over unquoted character field
}
if (i<ncol-1) { // not the last column (doesn't have a separator after it)
if (ch<eof && *ch!=sep) {
if (!fill) STOP("Unexpected character ending field %d of line %d: %.*s", i+1, line, ch-pos+5, pos);
} else if (ch<eof) ch++;
- }
+ }
}
// discard any whitespace after last column name on first row before the eol (was a TODO)
skip_spaces();
if (ch<eof && *ch!=eol) STOP("Not positioned correctly after testing format of header row. ch='%c'",*ch);
if (verbose && header!=NA_LOGICAL) Rprintf("'header' changed by user from 'auto' to %s\n", header?"TRUE":"FALSE");
- char buff[10]; // to construct default column names
+ char buff[12]; // to construct default column names
if (header==FALSE || (header==NA_LOGICAL && !allchar)) {
if (verbose && header==NA_LOGICAL) Rprintf("Some fields on line %d are not type character (or are empty). Treating as a data row and using default column names.\n", line);
for (i=0; i<ncol; i++) {
@@ -938,7 +938,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
pos = ch;
}
clock_t tLayout = clock();
-
+
// ********************************************************************************************
// Count number of rows
// ********************************************************************************************
@@ -960,7 +960,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
nsep+=(*ch++==sep);
}
} else {
- // goal: don't count newlines within quotes,
+ // goal: don't count newlines within quotes,
// don't rely on 'sep' because it'll provide an underestimate
if (!skipEmptyLines) {
while (ch<eof) {
@@ -973,7 +973,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
if (*ch!=quote[0]) neol += (*ch==eol && *(ch-1)!=eol);
else while(ch+1<eof && *(++ch)!=quote[0]); // quits at next quote
ch++;
- }
+ }
}
}
if (ch!=eof) STOP("Internal error: ch!=eof after counting sep and eol");
@@ -999,7 +999,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
Rprintf("nrow = MIN( nsep [%lld] / (ncol [%d] -1), neol [%lld] - endblanks [%d] ) = %lld\n", nsep, ncol, neol, endblanks, tmp);
}
} else {
- if (!skipEmptyLines)
+ if (!skipEmptyLines)
Rprintf("nrow = neol [%lld] - endblanks [%d] = %lld\n", neol, endblanks, tmp);
else Rprintf("nrow = neol (after discarding blank lines) = %lld\n", tmp);
}
@@ -1014,7 +1014,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
// if (verbose) Rprintf("Estimated nrows: %d ( 1.05*%d*(%ld-(%ld-%ld))/(%ld-%ld) )\n",estn,10,filesize,pos,mmp,pos2,pos1);
}
clock_t tRowCount = clock();
-
+
// *********************************************************************************************************
// Make best guess at column types using 100 rows at 10 points, including the very first and very last row
// *********************************************************************************************************
@@ -1070,7 +1070,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
case SXP_INT:
ch2=ch;
u.l = NA_INTEGER; // see comments in the main read step lower down. Similar switch as here.
- if (Strtoll() && INT_MIN<=u.l && u.l<=INT_MAX) break;
+ if (Strtoll() && INT_MIN<=u.l && u.l<=INT_MAX) break;
type[field]++; ch=ch2;
case SXP_INT64:
if (Strtoll()) break;
@@ -1094,7 +1094,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
if (verbose) { Rprintf("Type codes (point %2d): ",j); for (i=0; i<ncol; i++) Rprintf("%d",type[i]); Rprintf("\n"); }
}
ch = pos;
-
+
// ********************************************************************************************
// Apply colClasses, select and integer64
// ********************************************************************************************
@@ -1195,7 +1195,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
if (isString(select)) {
// invalid cols check part of #1445 moved here (makes sense before reading the file)
itemsInt = PROTECT(chmatch(select, names, NA_INTEGER, FALSE));
- for (i=0; i<length(select); i++) if (INTEGER(itemsInt)[i]==NA_INTEGER)
+ for (i=0; i<length(select); i++) if (INTEGER(itemsInt)[i]==NA_INTEGER)
warning("Column name '%s' not found in column name header (case sensitive), skipping.", CHAR(STRING_ELT(select, i)));
UNPROTECT(1);
PROTECT_WITH_INDEX(itemsInt, &pi);
@@ -1215,7 +1215,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
}
if (verbose) { Rprintf("Type codes: "); for (i=0; i<ncol; i++) Rprintf("%d",type[i]); Rprintf(" (after applying drop or select (if supplied)\n"); }
clock_t tColType = clock();
-
+
// ********************************************************************************************
// Allocate columns for known nrow
// ********************************************************************************************
@@ -1240,7 +1240,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
SET_TRUELENGTH(thiscol, nrow);
}
clock_t tAlloc = clock();
-
+
// ********************************************************************************************
// Read the data
// ********************************************************************************************
@@ -1340,7 +1340,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
R_FlushConsole();
}
clock_t tRead = clock();
-
+
// Warn about any non-whitespace not read at the end
while (ch<eof && isspace(*ch)) ch++;
if (ch<eof) {
@@ -1361,7 +1361,7 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
if (verbose) Rprintf("Read %d rows. Exactly what was estimated and allocated up front\n", i);
}
for (j=0; j<ncol-numNULL; j++) SETLENGTH(VECTOR_ELT(ans,j), nrow);
-
+
// ********************************************************************************************
// Convert na.strings to NA for character columns
// ********************************************************************************************
diff --git a/src/fwrite.c b/src/fwrite.c
index f8a377c..241d56c 100644
--- a/src/fwrite.c
+++ b/src/fwrite.c
@@ -96,7 +96,7 @@ SEXP genLookups() {
return R_NilValue;
}
/*
- FILE *f = fopen("/tmp/fwriteLookups.h", "w");
+ FILE *f = fopen("/tmp/fwriteLookups.h", "w");
fprintf(f, "//\n\
// Generated by fwrite.c:genLookups()\n\
//\n\
@@ -158,7 +158,7 @@ static void writeNumeric(SEXP column, int i, char **thisCh)
unsigned long long fraction = u.ull & 0xFFFFFFFFFFFFF; // (1ULL<<52)-1;
int exponent = (int)((u.ull>>52) & 0x7FF); // [0,2047]
- // Now sum the appropriate powers 2^-(1:52) of the fraction
+ // Now sum the appropriate powers 2^-(1:52) of the fraction
// Important for accuracy to start with the smallest first; i.e. 2^-52
// Exact powers of 2 (1.0, 2.0, 4.0, etc) are represented precisely with fraction==0
// Skip over tailing zeros for exactly representable numbers such 0.5, 0.75
@@ -167,7 +167,7 @@ static void writeNumeric(SEXP column, int i, char **thisCh)
double acc = 0; // 'long double' not needed
int i = 52;
if (fraction) {
- while ((fraction & 0xFF) == 0) { fraction >>= 8; i-=8; }
+ while ((fraction & 0xFF) == 0) { fraction >>= 8; i-=8; }
while (fraction) {
acc += sigparts[(((fraction&1u)^1u)-1u) & i];
i--;
@@ -185,7 +185,7 @@ static void writeNumeric(SEXP column, int i, char **thisCh)
unsigned long long l = y * SIZE_SF; // low magnitude mult 10^NUM_SF
// l now contains NUM_SF+1 digits as integer where repeated /10 below is accurate
- // if (verbose) Rprintf("\nTRACE: acc=%.20Le ; y=%.20Le ; l=%llu ; e=%d ", acc, y, l, exp);
+ // if (verbose) Rprintf("\nTRACE: acc=%.20Le ; y=%.20Le ; l=%llu ; e=%d ", acc, y, l, exp);
if (l%10 >= 5) l+=10; // use the last digit to round
l /= 10;
@@ -198,7 +198,7 @@ static void writeNumeric(SEXP column, int i, char **thisCh)
while (l%10 == 0) { l /= 10; trailZero++; }
int sf = NUM_SF - trailZero;
if (sf==0) {sf=1; exp++;} // e.g. l was 9999999[5-9] rounded to 10000000 which added 1 digit
-
+
// l is now an unsigned long that doesn't start or end with 0
// sf is the number of digits now in l
// exp is e<exp> were l to be written with the decimal sep after the first digit
@@ -231,7 +231,7 @@ static void writeNumeric(SEXP column, int i, char **thisCh)
// scientific ...
ch += sf; // sf-1 + 1 for dec
for (int i=sf; i>1; i--) {
- *ch-- = '0' + l%10;
+ *ch-- = '0' + l%10;
l /= 10;
}
if (sf == 1) ch--; else *ch-- = dec;
@@ -311,7 +311,7 @@ static inline void write_time(int x, char **thisCh)
{
char *ch = *thisCh;
if (x<0) { // <0 covers NA_INTEGER too (==INT_MIN checked in init.c)
- write_chars(na, &ch);
+ write_chars(na, &ch);
} else {
int hh = x/3600;
int mm = (x - hh*3600) / 60;
@@ -363,7 +363,7 @@ static inline void write_date(int x, char **thisCh)
int z = x - y*365 - y/4 + y/100 - y/400 + 1; // days from March 1st in year y
int md = monthday[z]; // See fwriteLookups.h for how the 366 item lookup 'monthday' is arranged
y += z && (md/100)<3; // The +1 above turned z=-1 to 0 (meaning Feb29 of year y not Jan or Feb of y+1)
-
+
ch += 7 + 2*!squash;
*ch-- = '0'+md%10; md/=10;
*ch-- = '0'+md%10; md/=10;
@@ -396,7 +396,7 @@ static void writePOSIXct(SEXP column, int i, char **thisCh)
// Aside: an often overlooked option for users is to start R in UTC: $ TZ='UTC' R
// All positive integers up to 2^53 (9e15) are exactly representable by double which is relied
// on in the ops here; number of seconds since epoch.
-
+
double x = REAL(column)[i];
char *ch = *thisCh;
if (!R_FINITE(x)) {
@@ -494,8 +494,8 @@ static writer_fun_t whichWriter(SEXP column) {
if (INHERITS(column, char_Date)) return writeDateInt;
return writeInteger;
case REALSXP:
- if (INHERITS(column, char_integer64))
- return (INHERITS(column, char_nanotime) && dateTimeAs!=DATETIMEAS_EPOCH) ? writeNanotime : writeInteger;
+ if (INHERITS(column, char_nanotime) && dateTimeAs!=DATETIMEAS_EPOCH) return writeNanotime;
+ if (INHERITS(column, char_integer64))return writeInteger;
if (dateTimeAs==DATETIMEAS_EPOCH) return writeNumeric;
if (INHERITS(column, char_Date)) return writeDateReal;
if (INHERITS(column, char_POSIXct)) return writePOSIXct;
@@ -588,17 +588,17 @@ SEXP writefile(SEXP DFin, // any list of same length vectors; e.g.
const Rboolean showProgress = LOGICAL(showProgress_Arg)[0];
time_t start_time = time(NULL);
time_t next_time = start_time+2; // start printing progress meter in 2 sec if not completed by then
-
+
verbose = LOGICAL(verbose_Arg)[0];
-
+
sep = *CHAR(STRING_ELT(sep_Arg, 0)); // DO NOT DO: allow multichar separator (bad idea)
sep2start = CHAR(STRING_ELT(sep2_Arg, 0));
sep2 = *CHAR(STRING_ELT(sep2_Arg, 1));
sep2end = CHAR(STRING_ELT(sep2_Arg, 2));
-
+
const char *eol = CHAR(STRING_ELT(eol_Arg, 0));
// someone might want a trailer on every line so allow any length string as eol
-
+
na = CHAR(STRING_ELT(na_Arg, 0));
dec = *CHAR(STRING_ELT(dec_Arg,0));
quote = LOGICAL(quote_Arg)[0];
@@ -635,7 +635,7 @@ SEXP writefile(SEXP DFin, // any list of same length vectors; e.g.
UNPROTECT(1); // s, not DF
}
}
-
+
// Allocate lookup vector to writer function for each column. For simplicity and robustness via many fewer lines
// of code and less logic need. Secondly, for efficiency to save deep switch and branches later.
// Don't use a VLA as ncol could be > 1e6 columns
@@ -656,9 +656,9 @@ SEXP writefile(SEXP DFin, // any list of same length vectors; e.g.
if (verbose) Rprintf("If quote='auto', fields will be quoted if the field contains either sep ('%c') or sep2[2] ('%c') because column %d is a list column.\n", sep, sep2, firstListColumn );
if (dec==sep) error("Internal error: dec != sep was checked at R level");
if (dec==sep2 || sep==sep2)
- error("sep ('%c'), sep2[2L] ('%c') and dec ('%c') must all be different when list columns are present. Column %d is a list column.", sep, sep2, dec, firstListColumn);
+ error("sep ('%c'), sep2[2L] ('%c') and dec ('%c') must all be different when list columns are present. Column %d is a list column.", sep, sep2, dec, firstListColumn);
}
-
+
// user may want row names even when they don't exist (implied row numbers as row names)
Rboolean doRowNames = LOGICAL(row_names)[0];
SEXP rowNames = NULL;
@@ -666,7 +666,7 @@ SEXP writefile(SEXP DFin, // any list of same length vectors; e.g.
rowNames = getAttrib(DFin, R_RowNamesSymbol);
if (!isString(rowNames)) rowNames=NULL;
}
-
+
// Estimate max line length of a 1000 row sample (100 rows in 10 places).
// 'Estimate' even of this sample because quote='auto' may add quotes and escape embedded quotes.
// Buffers will be resized later if there are too many line lengths outside the sample, anyway.
@@ -710,10 +710,10 @@ SEXP writefile(SEXP DFin, // any list of same length vectors; e.g.
if (fun==NULL) error("Column %d is a list column but on row %d is type '%s' - not yet implemented. fwrite() can write list columns containing atomic vectors of type logical, integer, integer64, double, character and factor, currently.", j+1, i+1, type2char(TYPEOF(v)));
for (int k=0; k<LENGTH(v); k++) {
(*fun)(v, k, &ch); thisLineLen+=(int)(ch-tmp); ch=tmp;
- }
+ }
}
thisLineLen += LENGTH(v); // sep2 after each field. LENGTH(v) could be 0 so don't -1 (only estimate anyway)
- thisLineLen += strlen(sep2end);
+ thisLineLen += strlen(sep2end);
} else {
// regular atomic columns like integer, integer64, double, date and time
(*fun[j])(column, i, &ch);
@@ -727,12 +727,12 @@ SEXP writefile(SEXP DFin, // any list of same length vectors; e.g.
}
maxLineLen += strlen(eol);
if (verbose) Rprintf("maxLineLen=%d from sample. Found in %.3fs\n", maxLineLen, 1.0*(clock()-t0)/CLOCKS_PER_SEC);
-
+
int f;
if (*filename=='\0') {
f=-1; // file="" means write to standard output
eol = "\n"; // We'll use Rprintf(); it knows itself about \r\n on Windows
- } else {
+ } else {
#ifdef WIN32
f = _open(filename, _O_WRONLY | _O_BINARY | _O_CREAT | (LOGICAL(append)[0] ? _O_APPEND : _O_TRUNC), _S_IWRITE);
// eol must be passed from R level as '\r\n' on Windows since write() only auto-converts \n to \r\n in
@@ -745,17 +745,17 @@ SEXP writefile(SEXP DFin, // any list of same length vectors; e.g.
if( access( filename, F_OK ) != -1 )
error("%s: '%s'. Failed to open existing file for writing. Do you have write permission to it? Is this Windows and does another process such as Excel have it open?", strerror(erropen), filename);
else
- error("%s: '%s'. Unable to create new file for writing (it does not exist already). Do you have permission to write here, is there space on the disk and does the path exist?", strerror(erropen), filename);
+ error("%s: '%s'. Unable to create new file for writing (it does not exist already). Do you have permission to write here, is there space on the disk and does the path exist?", strerror(erropen), filename);
}
}
t0=clock();
-
+
if (verbose) {
Rprintf("Writing column names ... ");
if (f==-1) Rprintf("\n");
}
if (LOGICAL(col_names)[0]) {
- SEXP names = getAttrib(DFin, R_NamesSymbol);
+ SEXP names = getAttrib(DFin, R_NamesSymbol);
if (names!=R_NilValue) {
if (LENGTH(names) != ncol) error("Internal error: length of column names is not equal to the number of columns. Please report.");
// allow for quoting even when not.
@@ -775,7 +775,7 @@ SEXP writefile(SEXP DFin, // any list of same length vectors; e.g.
*ch++ = sep;
}
ch--; // backup onto the last sep after the last column
- write_chars(eol, &ch); // replace it with the newline
+ write_chars(eol, &ch); // replace it with the newline
if (f==-1) { *ch='\0'; Rprintf(buffer); }
else if (WRITE(f, buffer, (int)(ch-buffer))==-1) {
int errwrite=errno;
@@ -818,43 +818,43 @@ SEXP writefile(SEXP DFin, // any list of same length vectors; e.g.
if (f==-1) Rprintf("\n");
}
t0 = clock();
-
+
failed=0; // static global so checkBuffer can set it. -errno for malloc or realloc fails, +errno for write fail
Rboolean hasPrinted=FALSE;
Rboolean anyBufferGrown=FALSE;
int maxBuffUsedPC=0;
-
+
#pragma omp parallel num_threads(nth)
{
char *ch, *buffer; // local to each thread
ch = buffer = malloc(buffSize); // each thread has its own buffer
// Don't use any R API alloc here (e.g. R_alloc); they are
// not thread-safe as per last sentence of R-exts 6.1.1.
-
+
if (buffer==NULL) {failed=-errno;}
// Do not rely on availability of '#omp cancel' new in OpenMP v4.0 (July 2013).
// OpenMP v4.0 is in gcc 4.9+ (https://gcc.gnu.org/wiki/openmp) but
// not yet in clang as of v3.8 (http://openmp.llvm.org/)
// If not-me failed, I'll see shared 'failed', fall through loop, free my buffer
// and after parallel section, single thread will call R API error() safely.
-
+
size_t myAlloc = buffSize;
size_t myMaxLineLen = maxLineLen;
// so we can realloc(). Should only be needed if there are very long single CHARSXP
- // much longer than occurred in the sample for maxLineLen. Or for list() columns
+ // much longer than occurred in the sample for maxLineLen. Or for list() columns
// contain vectors which are much longer than occurred in the sample.
-
+
#pragma omp single
{
nth = omp_get_num_threads(); // update nth with the actual nth (might be different than requested)
}
int me = omp_get_thread_num();
-
+
#pragma omp for ordered schedule(dynamic)
for(RLEN start=0; start<nrow; start+=rowsPerBatch) {
if (failed) continue; // Not break. See comments above about #omp cancel
int end = ((nrow-start)<rowsPerBatch) ? nrow : start+rowsPerBatch;
-
+
for (RLEN i=start; i<end; i++) {
char *lineStart = ch;
if (doRowNames) {
@@ -873,7 +873,7 @@ SEXP writefile(SEXP DFin, // any list of same length vectors; e.g.
}
ch--; // backup onto the last sep after the last column. ncol>=1 because 0-columns was caught earlier.
write_chars(eol, &ch); // replace it with the newline.
-
+
// Track longest line seen so far. If we start to see longer lines than we saw in the
// sample, we'll realloc the buffer. The rowsPerBatch chosen based on the (very good) sample,
// must fit in the buffer. Can't early write and reset buffer because the
@@ -910,10 +910,10 @@ SEXP writefile(SEXP DFin, // any list of same length vectors; e.g.
// showProgress=FALSE until this can be fixed or removed.
int ETA = (int)((nrow-end)*(((double)(now-start_time))/end));
if (hasPrinted || ETA >= 2) {
- if (verbose && !hasPrinted) Rprintf("\n");
+ if (verbose && !hasPrinted) Rprintf("\n");
Rprintf("\rWritten %.1f%% of %d rows in %d secs using %d thread%s. "
"anyBufferGrown=%s; maxBuffUsed=%d%%. Finished in %d secs. ",
- (100.0*end)/nrow, nrow, (int)(now-start_time), nth, nth==1?"":"s",
+ (100.0*end)/nrow, nrow, (int)(now-start_time), nth, nth==1?"":"s",
anyBufferGrown?"yes":"no", maxBuffUsedPC, ETA);
R_FlushConsole(); // for Windows
next_time = now+1;
@@ -921,7 +921,7 @@ SEXP writefile(SEXP DFin, // any list of same length vectors; e.g.
}
}
// May be possible for master thread (me==0) to call R_CheckUserInterrupt() here.
- // Something like:
+ // Something like:
// if (me==0) {
// failed = TRUE; // inside ordered here; the slaves are before ordered and not looking at 'failed'
// R_CheckUserInterrupt();
@@ -972,4 +972,3 @@ SEXP writefile(SEXP DFin, // any list of same length vectors; e.g.
return(R_NilValue);
}
-
diff --git a/src/init.c b/src/init.c
index 564b3d6..abf5030 100644
--- a/src/init.c
+++ b/src/init.c
@@ -24,7 +24,7 @@ SEXP rbindlist();
SEXP vecseq();
SEXP copyattr();
SEXP setlistelt();
-SEXP setnamed();
+SEXP setmutable();
SEXP address();
SEXP copyNamedInList();
SEXP fmelt();
@@ -103,11 +103,11 @@ R_CallMethodDef callMethods[] = {
{"Cvecseq", (DL_FUNC) &vecseq, -1},
{"Ccopyattr", (DL_FUNC) ©attr, -1},
{"Csetlistelt", (DL_FUNC) &setlistelt, -1},
-{"Csetnamed", (DL_FUNC) &setnamed, -1},
+{"Csetmutable", (DL_FUNC) &setmutable, -1},
{"Caddress", (DL_FUNC) &address, -1},
{"CcopyNamedInList", (DL_FUNC) ©NamedInList, -1},
-{"Cfmelt", (DL_FUNC) &fmelt, -1},
-{"Cfcast", (DL_FUNC) &fcast, -1},
+{"Cfmelt", (DL_FUNC) &fmelt, -1},
+{"Cfcast", (DL_FUNC) &fcast, -1},
{"Cuniqlist", (DL_FUNC) &uniqlist, -1},
{"Cuniqlengths", (DL_FUNC) &uniqlengths, -1},
{"Csetrev", (DL_FUNC) &setrev, -1},
@@ -173,14 +173,19 @@ void attribute_visible R_init_datatable(DllInfo *info)
R_useDynamicSymbols(info, FALSE);
setSizes();
const char *msg = "... failed. Please forward this message to maintainer('data.table').";
- if (NA_INTEGER != INT_MIN) error("Checking NA_INTEGER [%d] == INT_MIN [%d] %s", NA_INTEGER, INT_MIN, msg);
- if (NA_INTEGER != NA_LOGICAL) error("Checking NA_INTEGER [%d] == NA_LOGICAL [%d] %s", NA_INTEGER, NA_LOGICAL, msg);
+ if ((int)NA_INTEGER != (int)INT_MIN) error("Checking NA_INTEGER [%d] == INT_MIN [%d] %s", NA_INTEGER, INT_MIN, msg);
+ if ((int)NA_INTEGER != (int)NA_LOGICAL) error("Checking NA_INTEGER [%d] == NA_LOGICAL [%d] %s", NA_INTEGER, NA_LOGICAL, msg);
if (sizeof(int) != 4) error("Checking sizeof(int) [%d] is 4 %s", sizeof(int), msg);
- if (sizeof(double) != 8) error("Checking sizeof(double) [%d] is 8 %s", sizeof(double), msg); // 8 on both 32bit and 64bit.
+ if (sizeof(double) != 8) error("Checking sizeof(double) [%d] is 8 %s", sizeof(double), msg); // 8 on both 32bit and 64bit
+ // alignof not available in C99: if (alignof(double) != 8) error("Checking alignof(double) [%d] is 8 %s", alignof(double), msg); // 8 on both 32bit and 64bit
if (sizeof(long long) != 8) error("Checking sizeof(long long) [%d] is 8 %s", sizeof(long long), msg);
if (sizeof(char *) != 4 && sizeof(char *) != 8) error("Checking sizeof(pointer) [%d] is 4 or 8 %s", sizeof(char *), msg);
if (sizeof(SEXP) != sizeof(char *)) error("Checking sizeof(SEXP) [%d] == sizeof(pointer) [%d] %s", sizeof(SEXP), sizeof(char *), msg);
-
+ if (sizeof(uint64_t) != 8) error("Checking sizeof(uint64_t) [%d] is 8 %s", sizeof(uint64_t), msg);
+ if (sizeof(int64_t) != 8) error("Checking sizeof(int64_t) [%d] is 8 %s", sizeof(int64_t), msg);
+ if (sizeof(signed char) != 1) error("Checking sizeof(signed char) [%d] is 1 %s", sizeof(signed char), msg);
+ if (sizeof(int8_t) != 1) error("Checking sizeof(int8_t) [%d] is 1 %s", sizeof(int8_t), msg);
+
SEXP tmp = PROTECT(allocVector(INTSXP,2));
if (LENGTH(tmp)!=2) error("Checking LENGTH(allocVector(INTSXP,2)) [%d] is 2 %s", LENGTH(tmp), msg);
if (TRUELENGTH(tmp)!=0) error("Checking TRUELENGTH(allocVector(INTSXP,2)) [%d] is 0 %s", TRUELENGTH(tmp), msg);
@@ -200,10 +205,10 @@ void attribute_visible R_init_datatable(DllInfo *info)
long double ld = 3.14;
memset(&ld, 0, sizeof(long double));
if (ld != 0.0) error("Checking memset(&ld, 0, sizeof(long double)); ld == (long double)0.0 %s", msg);
-
+
setNumericRounding(PROTECT(ScalarInteger(0))); // #1642, #1728, #1463, #485
UNPROTECT(1);
-
+
// create needed strings in advance for speed, same techique as R_*Symbol
// Following R-exts 5.9.4; paragraph and example starting "Using install ..."
// either use PRINTNAME(install()) or R_PreserveObject(mkChar()) here.
@@ -218,18 +223,18 @@ void attribute_visible R_init_datatable(DllInfo *info)
error("PRINTNAME(install(\"integer64\")) has returned %s not %s",
type2char(TYPEOF(char_integer64)), type2char(CHARSXP));
}
-
+
// create commonly used symbols, same as R_*Symbol but internal to DT
// Not really for speed but to avoid leak in situations like setAttrib(DT, install(), allocVector()) where
// the allocVector() can happen first and then the install() could gc and free it before it is protected
// within setAttrib. Thanks to Bill Dunlap finding and reporting. Using these symbols instead of install()
// avoids the gc without needing an extra PROTECT and immediate UNPROTECT after the setAttrib which would
// look odd (and devs in future might be tempted to remove them). Avoiding passing install() to API calls
- // keeps the code neat and readable. Also see grep's added to CRAN_Release.cmd to find such calls.
+ // keeps the code neat and readable. Also see grep's added to CRAN_Release.cmd to find such calls.
sym_sorted = install("sorted");
sym_BY = install(".BY");
sym_maxgrpn = install("maxgrpn");
-
+
avoid_openmp_hang_within_fork();
}
@@ -251,13 +256,13 @@ inline Rboolean INHERITS(SEXP x, SEXP char_) {
return FALSE;
}
-inline long long I64(double x) {
+inline int64_t I64(double x) {
// type punning such as *(long long *)&REAL(column)[i] is undefined and I think was the
// cause of 1.10.2 failing on 31 Jan 2017 under clang 3.9.1 -O3 and solaris-sparc but
- // not solaris-x86 or gcc. There is now a grep in CRAN_Release.cmd; use this union method instead.
- union {double d; long long ll;} u;
+ // not solaris-x86 or gcc. There is now a grep in CRAN_Release.cmd; use this union method instead.
+ union {double d; int64_t i64;} u; // not static, inline instead
u.d = x;
- return u.ll;
+ return u.i64;
}
SEXP hasOpenMP() {
@@ -272,4 +277,3 @@ SEXP hasOpenMP() {
#endif
}
-
diff --git a/src/openmp-utils.c b/src/openmp-utils.c
index ca4871e..d812dca 100644
--- a/src/openmp-utils.c
+++ b/src/openmp-utils.c
@@ -9,7 +9,7 @@
* 3) And not if user doesn't want to:
* i) Respect env variable OMP_NUM_THREADS (which just calls (ii) on startup)
* ii) Respect omp_set_num_threads()
-* iii) Provide way to restrict data.table only independently of base R and
+* iii) Provide way to restrict data.table only independently of base R and
* other packages using openMP
* 4) Avoid user needing to remember to unset this control after their use of data.table
* 5) Automatically drop down to 1 thread when called from parallel package (e.g. mclapply) to
@@ -22,7 +22,7 @@ static int DTthreads = 0;
int getDTthreads() {
#ifdef _OPENMP
- return DTthreads == 0 ? omp_get_max_threads() : DTthreads;
+ return DTthreads == 0 ? omp_get_max_threads() : DTthreads;
#else
return 1;
#endif
@@ -38,7 +38,7 @@ SEXP setDTthreads(SEXP threads) {
error("Argument to setDTthreads must be a single integer >= 0. \
Default 0 is recommended to use all CPU.");
}
- // do not call omp_set_num_threads() here as that affects other openMP
+ // do not call omp_set_num_threads() here as that affects other openMP
// packages and base R as well potentially.
int old = DTthreads;
DTthreads = INTEGER(threads)[0];
@@ -46,19 +46,16 @@ SEXP setDTthreads(SEXP threads) {
}
// auto avoid deadlock when data.table called from parallel::mclapply
-static int preFork_DTthreads = 0;
void when_fork() {
- preFork_DTthreads = DTthreads;
+#ifdef _OPENMP
+ omp_set_num_threads(1);
+#endif
DTthreads = 1;
}
-void when_fork_end() {
- DTthreads = preFork_DTthreads;
-}
void avoid_openmp_hang_within_fork() {
// Called once on loading data.table from init.c
#ifdef _OPENMP
- pthread_atfork(&when_fork, &when_fork_end, NULL);
+ pthread_atfork(&when_fork, NULL, NULL);
#endif
}
-
diff --git a/src/rbindlist.c b/src/rbindlist.c
index cc08300..fcd4692 100644
--- a/src/rbindlist.c
+++ b/src/rbindlist.c
@@ -269,7 +269,7 @@ SEXP combineFactorLevels(SEXP factorLevels, int * factorType, Rboolean * isRowOr
while (h[idx] != NULL) {
pl = h[idx];
if (data.equal(VECTOR_ELT(factorLevels, pl->i), pl->j, elem, j)) {
- // Fixes #899. "rest" can have identical levels in
+ // Fixes #899. "rest" can have identical levels in
// more than 1 data.table.
if (!(pl->i == i && pl->j == j)) break;
record = TRUE;
@@ -352,7 +352,7 @@ struct preprocessData {
};
static SEXP unlist2(SEXP v) {
-
+
RLEN i, j, k=0, ni, n=0;
SEXP ans, vi, lnames, groups, runids;
@@ -379,7 +379,7 @@ static SEXP unlist2(SEXP v) {
}
// Don't use elsewhere. No checks are made on byArg and handleSorted
-// if handleSorted is 0, then it'll return integer(0) as such when
+// if handleSorted is 0, then it'll return integer(0) as such when
// input is already sorted, like forder. if not, seq_len(nrow(dt)).
static SEXP fast_order(SEXP dt, R_len_t byArg, R_len_t handleSorted) {
@@ -401,7 +401,7 @@ static SEXP fast_order(SEXP dt, R_len_t byArg, R_len_t handleSorted) {
} else {
order = PROTECT(allocVector(INTSXP, 1)); INTEGER(order)[0] = 1;
UNPROTECT(4);
- }
+ }
ans = PROTECT(forder(dt, by, retGrp, sortStr, order, na)); protecti++;
if (!length(ans) && handleSorted != 0) {
starts = getAttrib(ans, sym_starts);
@@ -416,7 +416,7 @@ static SEXP fast_order(SEXP dt, R_len_t byArg, R_len_t handleSorted) {
}
static SEXP uniq_lengths(SEXP v, R_len_t n) {
-
+
R_len_t i, nv=length(v);
SEXP ans = PROTECT(allocVector(INTSXP, nv));
for (i=1; i<nv; i++) {
@@ -429,22 +429,22 @@ static SEXP uniq_lengths(SEXP v, R_len_t n) {
}
static SEXP match_names(SEXP v) {
-
+
R_len_t i, j, idx, ncols, protecti=0;
SEXP ans, dt, lnames, ti;
SEXP uorder, starts, ulens, index, firstofeachgroup, origorder;
SEXP fnames, findices, runid, grpid;
-
+
ans = PROTECT(allocVector(VECSXP, 2));
dt = PROTECT(unlist2(v)); protecti++;
lnames = VECTOR_ELT(dt, 0);
grpid = PROTECT(duplicate(VECTOR_ELT(dt, 1))); protecti++; // dt[1] will be reused, so backup
runid = VECTOR_ELT(dt, 2);
-
+
uorder = PROTECT(fast_order(dt, 2, 1)); protecti++; // byArg alone is set, everything else is set inside fast_order
starts = getAttrib(uorder, sym_starts);
ulens = PROTECT(uniq_lengths(starts, length(lnames))); protecti++;
-
+
// seq_len(.N) for each group
index = PROTECT(VECTOR_ELT(dt, 1)); protecti++; // reuse dt[1] (in 0-index coordinate), value already backed up above.
for (i=0; i<length(ulens); i++) {
@@ -454,7 +454,7 @@ static SEXP match_names(SEXP v) {
// order again
uorder = PROTECT(fast_order(dt, 2, 1)); protecti++; // byArg alone is set, everything else is set inside fast_order
starts = getAttrib(uorder, sym_starts);
- ulens = PROTECT(uniq_lengths(starts, length(lnames))); protecti++;
+ ulens = PROTECT(uniq_lengths(starts, length(lnames))); protecti++;
ncols = length(starts);
// check if order has to be changed (bysameorder = FALSE here by default - in `[.data.table` parlance)
firstofeachgroup = PROTECT(allocVector(INTSXP, length(starts)));
@@ -488,18 +488,18 @@ static SEXP match_names(SEXP v) {
}
static void preprocess(SEXP l, Rboolean usenames, Rboolean fill, struct preprocessData *data) {
-
+
R_len_t i, j, idx;
SEXP li, lnames=R_NilValue, fnames, findices=R_NilValue, f_ind=R_NilValue, thiscol, col_name=R_NilValue, thisClass = R_NilValue;
SEXPTYPE type;
-
+
data->first = -1; data->lcount = 0; data->n_rows = 0; data->n_cols = 0; data->protecti = 0;
data->max_type = NULL; data->is_factor = NULL; data->ans_ptr = R_NilValue; data->mincol=0;
data->fn_rows = (int *)R_alloc(LENGTH(l), sizeof(int));
data->colname = R_NilValue;
// get first non null name, 'rbind' was doing a 'match.names' for each item.. which is a bit more time consuming.
- // And warning that it'll be matched by names is not necessary, I think, as that's the default for 'rbind'. We
+ // And warning that it'll be matched by names is not necessary, I think, as that's the default for 'rbind'. We
// should instead document it.
for (i=0; i<LENGTH(l); i++) { // isNull is checked already in rbindlist
data->fn_rows[i] = 0; // careful to initialize before continues as R_alloc above doesn't initialize
@@ -557,7 +557,7 @@ static void preprocess(SEXP l, Rboolean usenames, Rboolean fill, struct preproce
error("Answer requires %d columns whereas one or more item(s) in the input list has only %d columns. This could be because the items in the list may not all have identical column names or some of the items may have duplicate names. In either case, if you're aware of this and would like to fill those missing columns, set the argument 'fill=TRUE'.", length(fnames), data->mincol);
} else data->n_cols = length(fnames);
}
-
+
// decide type of each column
// initialize the max types - will possibly increment later
data->max_type = (SEXPTYPE *)R_alloc(data->n_cols, sizeof(SEXPTYPE));
@@ -581,7 +581,7 @@ static void preprocess(SEXP l, Rboolean usenames, Rboolean fill, struct preproce
data->max_type[i] = STRSXP;
} else {
// Fix for #705, check attributes and error if non-factor class and not identical
- if (!data->is_factor[i] &&
+ if (!data->is_factor[i] &&
!R_compute_identical(thisClass, getAttrib(thiscol, R_ClassSymbol), 0) && !fill) {
error("Class attributes at column %d of input list at position %d does not match with column %d of input list at position %d. Coercion of objects of class 'factor' alone is handled internally by rbind/rbindlist at the moment.", i+1, j+1, i+1, data->first+1);
}
@@ -605,9 +605,9 @@ SEXP add_idcol(SEXP nm, SEXP idcol, int cols) {
}
SEXP rbindlist(SEXP l, SEXP sexp_usenames, SEXP sexp_fill, SEXP idcol) {
-
+
R_len_t jj, ansloc, resi, i,j,r, idx, thislen;
- struct preprocessData data;
+ struct preprocessData data;
Rboolean usenames, fill, to_copy = FALSE, coerced=FALSE, isidcol = !isNull(idcol);
SEXP fnames = R_NilValue, findices = R_NilValue, f_ind = R_NilValue, ans, lf, li, target, thiscol, levels;
SEXP factorLevels = R_NilValue, finalFactorLevels;
@@ -620,10 +620,10 @@ SEXP rbindlist(SEXP l, SEXP sexp_usenames, SEXP sexp_fill, SEXP idcol) {
error("fill should be TRUE or FALSE");
if (!length(l)) return(l);
if (TYPEOF(l) != VECSXP) error("Input to rbindlist must be a list of data.tables");
-
+
usenames = LOGICAL(sexp_usenames)[0];
fill = LOGICAL(sexp_fill)[0];
- if (fill && !usenames) {
+ if (fill && !usenames) {
// override default
warning("Resetting 'use.names' to TRUE. 'use.names' can not be FALSE when 'fill=TRUE'.\n");
usenames=TRUE;
@@ -632,7 +632,7 @@ SEXP rbindlist(SEXP l, SEXP sexp_usenames, SEXP sexp_fill, SEXP idcol) {
// check for factor, get max types, and when usenames=TRUE get the answer 'names' and column indices for proper reordering.
preprocess(l, usenames, fill, &data);
fnames = VECTOR_ELT(data.ans_ptr, 0);
- findices = VECTOR_ELT(data.ans_ptr, 1);
+ if (usenames) findices = VECTOR_ELT(data.ans_ptr, 1);
protecti = data.protecti; // TODO very ugly and doesn't seem right. Assign items to list instead, perhaps.
if (data.n_rows == 0 && data.n_cols == 0) {
UNPROTECT(protecti);
@@ -645,7 +645,7 @@ SEXP rbindlist(SEXP l, SEXP sexp_usenames, SEXP sexp_fill, SEXP idcol) {
factorLevels = PROTECT(allocVector(VECSXP, data.lcount));
Rboolean *isRowOrdered = (Rboolean *)R_alloc(data.lcount, sizeof(Rboolean));
for (int i=0; i<data.lcount; i++) isRowOrdered[i] = FALSE;
-
+
ans = PROTECT(allocVector(VECSXP, data.n_cols+isidcol)); protecti++;
setAttrib(ans, R_NamesSymbol, fnames);
lf = VECTOR_ELT(l, data.first);
@@ -653,7 +653,7 @@ SEXP rbindlist(SEXP l, SEXP sexp_usenames, SEXP sexp_fill, SEXP idcol) {
if (fill) target = allocNAVector(data.max_type[j], data.n_rows);
else target = allocVector(data.max_type[j], data.n_rows);
SET_VECTOR_ELT(ans, j+isidcol, target);
-
+
if (usenames) {
to_copy = TRUE;
f_ind = VECTOR_ELT(findices, j);
@@ -663,7 +663,7 @@ SEXP rbindlist(SEXP l, SEXP sexp_usenames, SEXP sexp_fill, SEXP idcol) {
}
ansloc = 0;
jj = 0; // to increment factorLevels
- resi = -1;
+ resi = -1;
for (i=data.first; i<LENGTH(l); i++) {
li = VECTOR_ELT(l,i);
if (!length(li)) continue; // majority of time though, each item of l is populated
@@ -703,7 +703,7 @@ SEXP rbindlist(SEXP l, SEXP sexp_usenames, SEXP sexp_fill, SEXP idcol) {
SET_STRING_ELT(target, ansloc+r, NA_STRING);
else
SET_STRING_ELT(target, ansloc+r, STRING_ELT(levels,INTEGER(thiscol)[r]-1));
-
+
// add levels to factorLevels
// changed "i" to "jj" and increment 'jj' after so as to fill only non-empty tables with levels
SET_VECTOR_ELT(factorLevels, jj, levels); jj++;
@@ -711,7 +711,7 @@ SEXP rbindlist(SEXP l, SEXP sexp_usenames, SEXP sexp_fill, SEXP idcol) {
} else {
if (TYPEOF(thiscol) != STRSXP) error("Internal logical error in rbindlist.c (not STRSXP), please report to datatable-help.");
for (r=0; r<thislen; r++) SET_STRING_ELT(target, ansloc+r, STRING_ELT(thiscol,r));
-
+
// if this column is going to be a factor, add column to factorLevels
// changed "i" to "jj" and increment 'jj' after so as to fill only non-empty tables with levels
if (data.is_factor[j]) {
@@ -741,7 +741,7 @@ SEXP rbindlist(SEXP l, SEXP sexp_usenames, SEXP sexp_fill, SEXP idcol) {
thislen * SIZEOF(thiscol));
break;
default :
- error("Unsupported column type '%s'", type2char(TYPEOF(target)));
+ error("Unsupported column type '%s'", type2char(TYPEOF(target)));
}
ansloc += thislen;
if (coerced) {
@@ -785,9 +785,9 @@ SEXP rbindlist(SEXP l, SEXP sexp_usenames, SEXP sexp_fill, SEXP idcol) {
return(ans);
}
-/*
+/*
## The section below implements "chmatch2_old" and "chmatch2" (faster version of chmatch2_old).
-## It's basically 'pmatch' but without the partial matching part. These examples should
+## It's basically 'pmatch' but without the partial matching part. These examples should
## make it clearer.
## Examples:
## chmatch2_old(c("a", "a"), c("a", "a")) # 1,2 - the second 'a' in 'x' has a 2nd match in 'table'
@@ -798,13 +798,13 @@ SEXP rbindlist(SEXP l, SEXP sexp_usenames, SEXP sexp_fill, SEXP idcol) {
## dt = data.table(val=c(x,y), grp1 = rep(1:2, c(length(x),length(y))), grp2=c(1:length(x), 1:length(y)))
## dt[, grp1 := 0:(.N-1), by="val,grp1"]
## dt[, grp2[2], by="val,grp1"]
-##
-## NOTE: This is FAST, but not AS FAST AS it could be. See chmatch2 for a faster implementation (and bottom
-## of this file for a benchmark). I've retained here for now. Ultimately, will've to discuss with Matt and
+##
+## NOTE: This is FAST, but not AS FAST AS it could be. See chmatch2 for a faster implementation (and bottom
+## of this file for a benchmark). I've retained here for now. Ultimately, will've to discuss with Matt and
## probably export it??
*/
SEXP chmatch2_old(SEXP x, SEXP table, SEXP nomatch) {
-
+
R_len_t i, j, k, nx, li, si, oi;
SEXP dt, l, ans, order, start, lens, grpid, index;
if (TYPEOF(nomatch) != INTSXP || length(nomatch) != 1) error("'nomatch' must be an integer of length 1");
@@ -822,7 +822,7 @@ SEXP chmatch2_old(SEXP x, SEXP table, SEXP nomatch) {
l = PROTECT(allocVector(VECSXP, 2));
SET_VECTOR_ELT(l, 0, x);
SET_VECTOR_ELT(l, 1, table);
-
+
UNPROTECT(1); // l
dt = PROTECT(unlist2(l));
@@ -832,7 +832,7 @@ SEXP chmatch2_old(SEXP x, SEXP table, SEXP nomatch) {
lens = PROTECT(uniq_lengths(start, length(order))); // length(order) = nrow(dt)
grpid = VECTOR_ELT(dt, 1);
index = VECTOR_ELT(dt, 2);
-
+
// replace dt[1], we don't need it anymore
k=0;
for (i=0; i<length(lens); i++) {
@@ -843,10 +843,10 @@ SEXP chmatch2_old(SEXP x, SEXP table, SEXP nomatch) {
}
// order - again
UNPROTECT(2); // order, lens
- order = PROTECT(fast_order(dt, 2, 1));
+ order = PROTECT(fast_order(dt, 2, 1));
start = getAttrib(order, sym_starts);
lens = PROTECT(uniq_lengths(start, length(order)));
-
+
ans = PROTECT(allocVector(INTSXP, nx));
k = 0;
for (i=0; i<length(lens); i++) {
@@ -862,16 +862,16 @@ SEXP chmatch2_old(SEXP x, SEXP table, SEXP nomatch) {
// utility function used from within chmatch2
static SEXP listlist(SEXP x) {
-
+
R_len_t i,j,k, nl;
SEXP lx, xo, xs, xl, tmp, ans, ans0, ans1;
-
+
lx = PROTECT(allocVector(VECSXP, 1));
SET_VECTOR_ELT(lx, 0, x);
xo = PROTECT(fast_order(lx, 1, 1));
xs = getAttrib(xo, sym_starts);
xl = PROTECT(uniq_lengths(xs, length(x)));
-
+
ans0 = PROTECT(allocVector(STRSXP, length(xs)));
ans1 = PROTECT(allocVector(VECSXP, length(xs)));
k=0;
@@ -893,11 +893,11 @@ static SEXP listlist(SEXP x) {
}
/*
-## While chmatch2_old works great, I find it inefficient in terms of both memory (stores 2 indices over the
-## length of x+y) and speed (2 ordering and looping over unnecesssary amount of times). So, here's
-## another stab at a faster version of 'chmatch2_old', leveraging the power of 'chmatch' and data.table's
+## While chmatch2_old works great, I find it inefficient in terms of both memory (stores 2 indices over the
+## length of x+y) and speed (2 ordering and looping over unnecesssary amount of times). So, here's
+## another stab at a faster version of 'chmatch2_old', leveraging the power of 'chmatch' and data.table's
## DT[ , list(list()), by=.] syntax.
-##
+##
## The algorithm:
## x.agg = data.table(x)[, list(list(rep(x, .N))), by=x]
## y.agg = data.table(y)[, list(list(rep(y, .N))), by=y]
@@ -923,18 +923,18 @@ SEXP chmatch2(SEXP x, SEXP y, SEXP nomatch) {
// Done with special cases. On to the real deal.
xll = PROTECT(listlist(x));
yll = PROTECT(listlist(y));
-
+
xu = VECTOR_ELT(xll, 0);
yu = VECTOR_ELT(yll, 0);
-
+
mx = PROTECT(chmatch(xu, yu, 0, FALSE));
ans = PROTECT(allocVector(INTSXP, nx));
k=0;
for (i=0; i<length(mx); i++) {
xl = VECTOR_ELT(VECTOR_ELT(xll, 1), i);
- ix = length(xl);
+ ix = length(xl);
if (INTEGER(mx)[i] == 0) {
- for (j=0; j<ix; j++)
+ for (j=0; j<ix; j++)
INTEGER(ans)[INTEGER(xl)[j]-1] = INTEGER(nomatch)[0];
} else {
yl = VECTOR_ELT(VECTOR_ELT(yll, 1), INTEGER(mx)[i]-1);
@@ -946,7 +946,7 @@ SEXP chmatch2(SEXP x, SEXP y, SEXP nomatch) {
}
UNPROTECT(4);
return(ans);
-
+
}
/*
diff --git a/src/uniqlist.c b/src/uniqlist.c
index b714ead..7a21425 100644
--- a/src/uniqlist.c
+++ b/src/uniqlist.c
@@ -17,8 +17,6 @@ SEXP uniqlist(SEXP l, SEXP order)
int *iidx = Calloc(isize, int); // for 'idx'
int *n_iidx; // to catch allocation errors using Realloc!
- if (NA_INTEGER != NA_LOGICAL || sizeof(NA_INTEGER)!=sizeof(NA_LOGICAL))
- error("Have assumed NA_INTEGER == NA_LOGICAL (currently R_NaInt). If R changes this in future (seems unlikely), an extra case is required; a simple change.");
ncol = length(l);
nrow = length(VECTOR_ELT(l,0));
len = 1;
diff --git a/src/wrappers.c b/src/wrappers.c
index 9f483ae..6d151c9 100644
--- a/src/wrappers.c
+++ b/src/wrappers.c
@@ -9,24 +9,24 @@
SEXP setattrib(SEXP x, SEXP name, SEXP value)
{
if (TYPEOF(name) != STRSXP) error("Attribute name must be of type character");
- if ( !isNewList(x) &&
- strcmp(CHAR(STRING_ELT(name, 0)), "class") == 0 &&
- isString(value) && (strcmp(CHAR(STRING_ELT(value, 0)), "data.table") == 0 ||
+ if ( !isNewList(x) &&
+ strcmp(CHAR(STRING_ELT(name, 0)), "class") == 0 &&
+ isString(value) && (strcmp(CHAR(STRING_ELT(value, 0)), "data.table") == 0 ||
strcmp(CHAR(STRING_ELT(value, 0)), "data.frame") == 0) )
error("Internal structure doesn't seem to be a list. Can't set class to be 'data.table' or 'data.frame'. Use 'as.data.table()' or 'as.data.frame()' methods instead.");
if (isLogical(x) && x == ScalarLogical(TRUE)) { // ok not to protect this ScalarLogical() as not assigned or passed
x = PROTECT(duplicate(x));
- setAttrib(x, name, NAMED(value) ? duplicate(value) : value);
+ setAttrib(x, name, MAYBE_REFERENCED(value) ? duplicate(value) : value);
UNPROTECT(1);
return(x);
}
setAttrib(x, name,
- NAMED(value) ? duplicate(value) : value);
+ MAYBE_REFERENCED(value) ? duplicate(value) : value);
// duplicate is temp fix to restore R behaviour prior to R-devel change on 10 Jan 2014 (r64724).
// TO DO: revisit. Enough to reproduce is: DT=data.table(a=1:3); DT[2]; DT[,b:=2]
// ... Error: selfrefnames is ok but tl names [1] != tl [100]
return(R_NilValue);
-}
+}
// fix for #1142 - duplicated levels for factors
SEXP setlevels(SEXP x, SEXP levels, SEXP ulevels) {
@@ -67,10 +67,11 @@ SEXP setlistelt(SEXP l, SEXP i, SEXP value)
return(R_NilValue);
}
-SEXP setnamed(SEXP x, SEXP value)
+SEXP setmutable(SEXP x)
{
- if (!isInteger(value) || LENGTH(value)!=1) error("Second argument to setnamed must a length 1 integer vector");
- SET_NAMED(x,INTEGER(value)[0]);
+ // called from one single place at R level. TODO: avoid somehow, but fails tests without
+ // At least the SET_NAMED() direct call is passed 0 and makes no assumptions about >0. Good enough for now as patch for CRAN is needed.
+ SET_NAMED(x,0);
return(x);
}
@@ -87,15 +88,15 @@ SEXP copyNamedInList(SEXP x)
// As from R 3.1.0 list() no longer copies NAMED inputs
// Since data.table allows subassignment by reference, we need a way to copy NAMED inputs, still.
// But for many other applications (such as in j and elsewhere internally) the new non-copying list() in R 3.1.0 is very welcome.
-
- // This is intended to be called just after list(...) in data.table(). It isn't for use on a single data.table, as
+
+ // This is intended to be called just after list(...) in data.table(). It isn't for use on a single data.table, as
// member columns of a list aren't marked as NAMED when the VECSXP is.
-
+
// For now, this makes the old behaviour of list() in R<3.1.0 available for use, where we need it.
-
+
if (TYPEOF(x) != VECSXP) error("x isn't a VECSXP");
for (int i=0; i<LENGTH(x); i++) {
- if (NAMED(VECTOR_ELT(x, i))) {
+ if (MAYBE_REFERENCED(VECTOR_ELT(x, i))) {
SET_VECTOR_ELT(x, i, duplicate(VECTOR_ELT(x,i)));
}
}
@@ -109,10 +110,10 @@ SEXP dim(SEXP x)
// fast implementation of dim.data.table
if (TYPEOF(x) != VECSXP) {
- error("dim.data.table expects a data.table as input (which is a list), but seems to be of type %s",
+ error("dim.data.table expects a data.table as input (which is a list), but seems to be of type %s",
type2char(TYPEOF(x)));
}
-
+
SEXP ans = allocVector(INTSXP, 2);
if(length(x) == 0) {
INTEGER(ans)[0] = 0;
diff --git a/vignettes/datatable-faq.Rmd b/vignettes/datatable-faq.Rmd
index f78ca19..48fdbb9 100644
--- a/vignettes/datatable-faq.Rmd
+++ b/vignettes/datatable-faq.Rmd
@@ -595,7 +595,7 @@ Please file suggestions, bug reports and enhancement requests on our [issues tra
Please do star the package on [GitHub](https://github.com/Rdatatable/data.table/wiki). This helps encourage the developers and helps other R users find the package.
-You can submit pull requests to change the code and/or documentation yourself; see our [Contribution Guidelines](https://github.com/Rdatatable/data.table/blob/master/Contributing.md).
+You can submit pull requests to change the code and/or documentation yourself; see our [Contribution Guidelines](https://github.com/Rdatatable/data.table/blob/master/CONTRIBUTING.md).
## I think it's not great. How do I warn others about my experience?
Please put your vote and comments on [Crantastic](http://crantastic.org/packages/data-table). Please make it constructive so we have a chance to improve.
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/r-cran-data.table.git
More information about the debian-med-commit
mailing list