Data.frame/table brass tacks (the pointy part)

Matt Dowle and Arun Srinivasan’s¹ data.table package is a data.frame twin with amazing performance. Often, it’s the fastest way to operate on in-memory tabular data, period (not just in R!).

I’ve used data.table for performance reasons in projects for many years now. There are a few things to watch out for outlined in these notes (I have stumbled across both in practice). Fortunately they are not that problematic to deal with.

Data.frames

R’s data.frame is simply an R list with extra attributes: class = “data.frame” and row.names = (integer or character vector of row names). Data.frame column names (if any) are taken from the list names.

That is, you can make a data.frame the usual way:

x <- data.frame(v = 1:2, a = c(pi, 5))

or, like this if you prefer²:

y <- structure(
       list(v = 1:2, a = c(pi, 5)),
       class = "data.frame",
       row.names = 1:2)
all.equal(x, y)

## [1] TRUE

R defines the number of rows of a data.frame to be the length of its row.names vector attribute. The number of columns is simply the length of the list.

This simple scheme is very flexible! Each data.frame list entry (column) is effectively a memory pointer to a native R vector. And usually, R tries to defer copying data until it must. Consider the following example run on my laptop with 8GB RAM:

system.time({
  x <- runif(4e8)
  y <- runif(4e8, 0, 10)
})

##    user  system elapsed 
##  15.195   2.282  17.523

system.time({d <- data.frame(y, x)})

##    user  system elapsed 
##   0.000   0.001   0.000

Each vector x and y consume about 3GB of RAM for a total of about 6GB. Creation of the data.frame d above happens instantly without data copy because its columns simply contain pointers to y and x. (Of course, subsequently modifying the values would induce a copy.)

Re-arranging columns in a data.frame simply re-arranges pointers and need not always copy data:

system.time({print(tail(d[, order(names(d))]))})

##                   x        y
## 399999995 0.4109185 7.576306
## 399999996 0.3578239 7.045663
## 399999997 0.3312816 7.116747
## 399999998 0.6940979 3.164837
## 399999999 0.2465026 5.793221
## 400000000 0.8413648 7.473116

##    user  system elapsed 
##   0.003   0.000   0.006

That operation (ordering columns and printing the tail) ran almost instantly on my laptop.

Data.table corner cases

Data.tables are blobs of data behind a reference pointer that (mostly) emulate standard data.frame structure and behavior. They are almost always faster and more efficient than standard R data.frames (and I often encourage their use), but there exist situations where plain old data.frames can be more efficient.

First, let’s run rm and gc to free up the 6GB memory used by the previous example before proceeding:

rm(list = ls())
gc()

##          used (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells 462458 24.7     973298   52.0    665888   35.6
## Vcells 869100  6.7  924660767 7054.7 800959386 6110.9

Now consider this:

library(data.table)
system.time({
  dt <- data.table(x = runif(4e8), y = runif(4e8, 0, 10))
})

##    user  system elapsed 
##  13.979  17.280 129.837

Why did that take so long? Because data.table made copies of the data, which happen to be larger than the RAM in my laptop in this case, and thus operating system swap was involved.

Somewhat more problematically, unlike base-R data.frames, simply re-ordering data.table columns sometimes induces data copies. For example:

system.time({print(tail(dt[, order(names(dt)), with = FALSE]))})

##             x          y
## 1: 0.21629333 9.35813882
## 2: 0.42472695 0.71434654
## 3: 0.28599030 8.08345845
## 4: 0.07070132 5.94907941
## 5: 0.23969760 0.01440646
## 6: 0.28483723 8.35151906

##    user  system elapsed 
##   1.669  13.798  88.547

That’s a bit of a bummer–watch out for it. The work around here is usually to try to use “the data.table way” as much as possible to solve problems.

Data.table function arguments

Another brass tack with data.table that you should be aware of is that data.table function arguments are call-by-reference, unlike the standard R function argument call-by-value convention. (See, for instance, https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Argument-evaluation). I’ve actually personally wrestled with hard-to-find bugs resulting from this, so it is something to watch out for. To be fair, environment function arguments are also call-by-reference in R, so it may help to think of data.tables like that.

Sort order

Be aware that data.table sorts using the ‘C’ locale (ASCII, basically), which is almost certainly not the way your R session defaults to sorting things.

Here is an example:

set.seed(1)
a <- sample(c(letters, LETTERS), 10)

x <- data.table(a = a)
setkeyv(x, "a")  # sort the data.table by the contents of 'a'
a <- sort(a)     # sort 'a' in plain old R and compare:
cbind(x, a)

##     a a
##  1: G a
##  2: H d
##  3: M G
##  4: Q H
##  5: a M
##  6: d n
##  7: n Q
##  8: r r
##  9: u u
## 10: w w

Forcing R to use the C local makes the sort order the same:

Sys.setlocale("LC_COLLATE", "C")

## [1] "C"

a <- sort(a)
cbind(x, a)

##     a a
##  1: G G
##  2: H H
##  3: M M
##  4: Q Q
##  5: a a
##  6: d d
##  7: n n
##  8: r r
##  9: u u
## 10: w w

along with many other contributors↩
Mostly, there are some edge-case exceptions to this however!↩