Advanced dplyr Quiz

Being able to effectively perform meaningful work using R programming involves being able to both know how various packages work and anticipate package method outcomes in basic situations. Any mismatch there (be it a knowledge gap in the programmer, an implementation gap in a package, or a difference between programmer opinion and package doctrine) can lead to confusion, bugs and incorrect results.

Below is our advanced dplyr quiz. Can you anticipate the result of each of the example operations? Can you anticipate which commands are in error and which are valid dplyr?

Or another phrasing: here are our notes on dplyr corner-cases (in my opinion). You may not need to know how any of these work (it is often good to avoid corner-cases), but you should at least be confident you are avoiding the malformed ones.

Start

With the current version of dplyr in mind, please anticipate the result of each example command. Note: we don’t claim all of the examples below are correct dplyr code. However, effective programming requires knowledge of what happens in some incorrect cases (at least knowing which throw usable errors, and which perform quiet mal-calculations).

# Show versions we are using.
# devtools::install_github('tidyverse/dplyr')
# devtools::install_github('tidyverse/dbplyr')
# devtools::install_github('rstats-db/RSQLite')
# devtools::install_github('tidyverse/rlang')
packageVersion("dplyr")
## [1] '0.7.1.9000'
packageVersion("dbplyr")
## [1] '1.1.0.9000'
packageVersion("RSQlite")
## [1] '2.0'
packageVersion("rlang")
## [1] '0.1.1.9000'
packageVersion("magrittr")
## [1] '1.5'
base::date()
## [1] "Thu Jul  6 09:22:41 2017"
suppressPackageStartupMessages(library("dplyr"))

Now for the examples/quiz.

Please take a moment to try the quiz, and write down your answers before moving on to the solutions. This should give you a much more open mind as to what constitutes “surprising behavior.”

You can also run the quiz yourself by downloading and knitting the source document.

Please keep in mind while “you never want errors” you do sometimes want exceptions (which are unfortunately called “Error:” in R). Exceptions are an important way of stopping off-track computation and preventing later incorrect results. Exceptions can often be the desired outcome of a malformed calculation.

Not all of the questions are “trick”, some of them are just convenient ways to remember different dplyr conventions.

Column selection

data.frame(x = 1) %>% 
  select(x)
# Two questions: 
#  1) Should this next one work?
#  2) Does this next one work?
data.frame(x = 1) %>% 
  select('x')
y <- 'x' # value used in later examples

# Hint: this is new for dplyr 0.7.* 
# (dplyr 0.5.0 only did this for integers).
# This is going to be a massive source
# of unexpected results, bugs, and
# incorrect analyses.
data.frame(x = 1) %>% 
  select(y)
y <- 'x' # value used in later examples

data.frame(x = 1, y = 2) %>% 
  select(y)

(From dplyr issue 2904.)

rename and mutate

data.frame(x=1, y=2) %>% rename(x=y , y=x)
data.frame(x=1, y=2) %>% mutate(x=y , y=x)

Column grouping

y <- 'x' # value used in later examples

data.frame(x = 1) %>% 
  group_by(.data[[y]]) %>% 
  summarize(count = n()) %>% 
  colnames()

(From dplyr issue 2916, notation taken from here).

Piping into different targets (functions, blocks expressions)

magrittr pipe details:

data.frame(x = 1)  %>%  { bind_rows(list(., .)) }
data.frame(x = 1)  %>%    bind_rows(list(., .))
data.frame(x = 1)  %>%  ( bind_rows(list(., .)) )

Same with Bizarro Pipe:

data.frame(x = 1)  ->.;  { bind_rows(list(., .)) }
data.frame(x = 1)  ->.;    bind_rows(list(., .))
data.frame(x = 1)  ->.;  ( bind_rows(list(., .)) )

summary

data.frame(x = c(1, 2), y = c(3, 3)) %>% 
  group_by(x) %>% 
  summarize(y)

(From dplyr issue 2915.)

enquo rules

This section is about enquo(), or passing unquoted variable names to functions that use dplyr methods. Please skip it if you don’t write such code, or if you use something like wrapr::let() for such replacements.

(function(z) select(data.frame(x = 1), !!enquo(z)))(x)
(function(z) data.frame(x = 1) %>% select(!!enquo(z)))(x)

(From dplyr issue 2726.)

y <- NULL # value used in later examples

(function(z) mutate(data.frame(x = 1), !!quo_name(enquo(z)) := 2))(y)
y <- NULL # value used in later examples

(function() mutate(data.frame(x = 1), !!quo_name(enquo(y)) := 2))()
y <- NULL # value used in later examples

(function(z) select(data.frame(y = 1), !!enquo(z)))(y)
y <- NULL # value used in later examples

(function() select(data.frame(y = 1), !!enquo(y)))()
y <- NULL # value used in later examples

(function() mutate(data.frame(x = 1), !!ensym(y) := 2))()

(From rlang issue 203.)

Databases

Setup:

# values used in later examples
db <- DBI::dbConnect(RSQLite::SQLite(), 
                     ":memory:")
dL <- data.frame(x = 3.077, 
                 k = 'a', 
                 stringsAsFactors = FALSE)
dR <- dplyr::copy_to(db, dL, 'dR')

nrow()

dL <- data.frame(x = 3.077, 
                 k = 'a', 
                 stringsAsFactors = FALSE)

nrow(dL)
db <- DBI::dbConnect(RSQLite::SQLite(), 
                     ":memory:")
dL <- data.frame(x = 3.077, 
                 k = 'a', 
                 stringsAsFactors = FALSE)
dR <- dplyr::copy_to(db, dL, 'dR')

nrow(dR)

(From dplyr issue 2871.)

union_all()

db <- DBI::dbConnect(RSQLite::SQLite(), 
                     ":memory:")
dL <- data.frame(x = 3.077, 
                 k = 'a', 
                 stringsAsFactors = FALSE)
dR <- dplyr::copy_to(db, dL, 'dR')

union_all(dR, dR)
dL <- data.frame(x = 3.077, 
                 k = 'a', 
                 stringsAsFactors = FALSE)

union_all(dL, head(dL))
db <- DBI::dbConnect(RSQLite::SQLite(), 
                     ":memory:")
dL <- data.frame(x = 3.077, 
                 k = 'a', 
                 stringsAsFactors = FALSE)
dR <- dplyr::copy_to(db, dL, 'dR')

union_all(dR, head(dR))

(From dplyr issue 2858.)

mutate_all funs()

dL <- data.frame(x = 3.077, 
                 k = 'a', 
                 stringsAsFactors = FALSE)

dL %>% select(x) %>% 
  mutate_all(funs(round(., digits = 2)))
db <- DBI::dbConnect(RSQLite::SQLite(), 
                     ":memory:")
dL <- data.frame(x = 3.077, 
                 k = 'a', 
                 stringsAsFactors = FALSE)
dR <- dplyr::copy_to(db, dL, 'dR')

dR %>% 
  mutate_all(funs(round(., 2)))
db <- DBI::dbConnect(RSQLite::SQLite(), 
                     ":memory:")
dL <- data.frame(x = 3.077, 
                 k = 'a', 
                 stringsAsFactors = FALSE)
dR <- dplyr::copy_to(db, dL, 'dR')

dR %>% select(x) %>% 
  mutate_all(funs(round(., digits = 2)))

(From dplyr issue 2890 and dplyr issue 2908.)

rename

db <- DBI::dbConnect(RSQLite::SQLite(), 
                     ":memory:")
dL <- data.frame(x = 3.077, 
                 k = 'a', 
                 stringsAsFactors = FALSE)
dR <- dplyr::copy_to(db, dL, 'dR')

dR %>% 
  rename(x2 = x) %>% rename(k2 = k)
dL <- data.frame(x = 3.077, 
                 k = 'a', 
                 stringsAsFactors = FALSE)

dL %>% 
  rename(x2 = x, k2 = k)
db <- DBI::dbConnect(RSQLite::SQLite(), 
                     ":memory:")
dL <- data.frame(x = 3.077, 
                 k = 'a', 
                 stringsAsFactors = FALSE)
dR <- dplyr::copy_to(db, dL, 'dR')

dR %>% 
  rename(x2 = x, k2 = k)

(From dplyr issue 2860.)

Conclusion

The above quiz is really my working notes on both how things work (so many examples are correct), and corner-cases to avoid. Some of the odd cases are simple bugs (which will likely be fixed), and some are legacy behaviors from earlier versions of dplyr. In many cases you can and should re-arrange your dplyr pipelines to avoid triggering the above issues. But to do that, you have to know what to avoid (hence the notes).

My quiz-grading priniciple comes from Software for Data Analysis: Programming with R by John Chambers (Springer 2008):

… the computations and the software for data analysis should be trustworthy: they should do what the claim, and be seen to do so. Neither those how view the results of data analysis nor, in many cases, the statisticians performing the analysis can directly validate extensive computations on large and complicated data processes. Ironically, the steadily increasing computer power applied to data analysis often distances the result further from direct checking by the recipient. The many computational steps between original data source and displayed results must all be truthful, or the effect of the analysis may be worthless, if not pernicious. This places an obligation on all creators of software to program in such a way that the computations can be understood and trusted.

The point is: to know a long calculation is correct, we must at least know all the small steps did what we (the analyst) intended (and not something else). To go back to one of our examples: the analyst must know the column selected in their analysis was always the one they intended.

Also: please understand, some of these may not represent problems with the above packages. They may instead represent mistakes and misunderstandings on my part. Or opinions of mine that may differ from the considered opinions and experience of the people who have authored and who have to maintain these packages. Some things that might seem “easy to fix” to an outsider may already be set at a “best possible compromise” among many other considerations.

I may or may not keep these up to date depending on the utility of such a list going forward.

[ quiz ] [ solutions ]