On magrittr

My opinion of the tidyverse (https://www.tidyverse.org/) is that there is a lot of really great stuff out there. For me anyway, apex tidyverse is the work that David Robinson and more recently Max Kuhn and many others have been doing to promote uniform modeling interfaces for both model specification and output.

Most of the other tidyverse stuff strikes me, for the most part, as simply common sense. I like common sense! Even if I don’t always personally use tidy tools, I very much like that they promote sensible patterns to follow for reasoning about problems.

I worry, but only a little, about the large vocabulary associated with the tidyverse. That’s really just me–I personally just can’t remember that much stuff anymore (my brain is shrinking with age I guess; or maybe it’s from the whiskey. Yeah, probably whiskey). Besides that, occasional R users that primarily think about other topics (science!) and also use other languages (Python! Fortran!) might have a hard time going back and reading R code with so many function names to remember. But, hey, R and S have always been “big” languages with lots of terms. It’s a big tent and I don’t see any problem at all with including tools that promote common sense, even verbosely.

I’m not sure about the magrittr pipe operator syntax though.

This note briefly lays out the main problems I personally have with the pipe operator and why I don’t prefer that syntax. The examples I came up with are contrived to illustrate their points economically; but they are representative. That is to say using magrittr pipe operators, I would have to worry about where and how to best use or not use these things.

I do understand and appreciate the readability argument often made for magrittr pipe operators, but personally I just don’t really agree. Again, personally, I find the “repeated assignment” syntax generally more understandable and illustrate that in an example below. I am aware that this may in fact be a minority opinion!

And that’s all this document is. An opinion about why I rarely use the magrittr syntax.

They’re (mostly) not lazy

The magrittr %>% and related operators evaluate their arguments by default, defeating R’s lazy evaluation mechanism, an innovative feature of the language. For example:

f <- function(x, y) if(y > 0) sum(x) else 0

f(runif(1e10), 0)
## [1] 0
# vs

library(magrittr)
runif(1e10) %>% f(0)
## Error: cannot allocate vector of size 74.5 Gb

%>% does not like closures

This is alluded to in a sketchy way in the documentation for %>% but no examples are provided. Are there other, perhaps unexpected, edge cases? I think magrittr should at least document where it does not play well with R with examples.

f <- function(x)
{
  i <- 3
  function(y) x + i
}

g <- function(x) x()

g(f(1))
## [1] 4
# or
x <- f(1)
g(x)
## [1] 4
# vs

library(magrittr)
1 %>% f %>% g
## Error in x + i: non-numeric argument to binary operator

Syntax!
Or, “the example from the magrittr vignette bothers me”

Copied verbatim, this example is called “horrific” in the magrittr vignette:

car_data <- 
  transform(aggregate(. ~ cyl,
                      data = subset(mtcars, hp > 100),
                      FUN = function(x) round(mean(x, 2))),
            kpl = mpg * 0.4251)

Somehow, I find it pretty readable! Horrific? Really? Personally, however I dislike and avoid using transform and subset and would have probably written the above expression like this:

f <- function(x) round(mean(x, 2))                                        # Define an aggregation function
car_data <- aggregate(. ~ cyl, mtcars[mtcars[["hp"]] > 100, ], FUN = f)   # Aggregate a filtered mtcars
car_data[["kpl"]] <- car_data[["mpg"]] * 0.4251                           # Add a kpl column

The above approach is extremely imperative. I use a named instead of anonymous function really only for readability here. I’ll bet that most programmers familiar with C or Java or nearly any other procedural language can figure out what this code does without knowing anything about R. The most mysterious part is the first argument of the aggregate function, a formula term, but the aggregate help page at least covers this. Like the “horrific” version, this approach also emphasizes–in a fairly obvious way for most programmers not already familiar with R–that functions are first-class objects, a really key feature of R.

Here is the magrittr version, again from its vignette:

library(magrittr)

car_data <- 
  mtcars %>%
  subset(hp > 100) %>%
  aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2)) %>%
  transform(kpl = mpg %>% multiply_by(0.4251))

It’s still very imperative, but a few things stand out at least to me:

A bit more about functions and dots

In that last example we saw this line with judicious use of dots

aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2))

I think most tidy proponents would argue that this is really a problem with the aggregate function (so maybe it’s not the best example for the magrittr vignette!). But it illustrates something deeper that bothers me about the magrittr pipes, it mixes up the concepts of functions and expressions by making every function look like an expression.

Mixing expression and function syntax makes their distinct scoping rules a bit more confusing I think. Consider, for instance:

x <- 3
f <- function(x)  x + 3

I’m used to understanding that the x in the function isn’t the same as the x defined above it in the expression, because the symbol used as the function argument name takes precedence within the function. But in the aggregate expression above, there are three subtly distinct meanings of the dot symbol (a confusing enough symbol to begin with!):

  1. . ~ cyl an R formula, mysterious but documented well. Dots have a different meaning here.
  2. data = . Here ‘.’ means the output from the previous magrittr pipe.
  3. . %>% mean %>% round(2) this dot has a different meaning and even scope than the others, it’s actually shorthand for an argument to an unseen closure. It is not the same as dot #2 above as one might naturally assume because it’s used in an expression.

I prefer the formula syntax; scoping is explicit and easier to reason about.

By the way, the R formula dots really are different than the magrittr dots, shown in the next contrivance:

coef(lm(update(y ~ x, sqrt(.) ~ . ), data = data.frame(x = 1:2, y = 3:4)))
## (Intercept)           x 
##   1.4641016   0.2679492
y ~ x %>% update(sqrt(.) ~ . ) %>% lm(data = data.frame(x = 1:2, y = 3:4)) %>% coef
## y ~ x %>% update(sqrt(.) ~ .) %>% lm(data = data.frame(x = 1:2, 
##     y = 3:4)) %>% coef

?

The magrittr version re-worked

This is just like my shorter preferred example above, but with subset.

f        <- function(x) round(mean(x, 2))
car_data <- subset(mtcars, hp > 100)
car_data <- aggregate(. ~ cyl, data = car_data, FUN = f)
car_data[["kpl"]] <- car_data[["mpg"]] * 0.4251

Functions are used in ways that match documentation, the imperative flow is easy to understand by anyone not already familiar with R but perhaps with another language, and there are a lot less words and special symbols to know about. It’s also just as efficient as the pipe version from an R interpreter evaluation standpoint.

Ce n’est pas une pipe

Yeah, I get the joke: magrittr explicitly advertises itself as not a pipe. But then really, why call it that?

R’s pipe() and fifo() functions let you set up pipes and pipelines. That is to say first in, first out thingys that the computer people usually think of when you say “pipe.” Pipes are a powerful abstraction for parallel data processing. With pipes we set up pipelines to process data incrementally as they move through the pipeline. The analogy often used is to a factory assembly line. A lot of Unix people are very familiar with using pipes in this way at their command lines. R very capably supports this kind of parallel processing, even on Windows!

Magrittr pipes don’t do that.

Anyway that’s just like, my opinion, man.

Look, I’m not trying to start a fight or flame or whatever the modern internet term is. This is just my opinion of magrittr operators. I feel that the operators confuse otherwise well-defined concepts of functions and expressions and obscure or outright inhibit some of the cool, innovative features of R like first class functions, meticulous documentation and lazy evaluation. These are features that many other languages aspire to have.