Using seplyr to Program Over dplyr

Introduction

seplyr is an R package that makes it easy to program over dplyr 0.7.*.

To illustrate this we will work an example.

Suppose you had worked out a dplyr pipeline that performed an analysis you were interested in. For an example we could take something similar to one of the examples from the dplyr 0.7.0 announcement.

suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")

## [1] '1.0.5'

colnames(starwars)

##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "sex"        "gender"     "homeworld" 
## [11] "species"    "films"      "vehicles"   "starships"

starwars %>%
  group_by(homeworld) %>%
  summarise(mean_height = mean(height, na.rm = TRUE),
            mean_mass = mean(mass, na.rm = TRUE),
            count = n())

## # A tibble: 49 x 4
##    homeworld      mean_height mean_mass count
##    <chr>                <dbl>     <dbl> <int>
##  1 Alderaan              176.      64       3
##  2 Aleen Minor            79       15       1
##  3 Bespin                175       79       1
##  4 Bestine IV            180      110       1
##  5 Cato Neimoidia        191       90       1
##  6 Cerea                 198       82       1
##  7 Champala              196      NaN       1
##  8 Chandrila             150      NaN       1
##  9 Concord Dawn          183       79       1
## 10 Corellia              175       78.5     2
## # … with 39 more rows

The above is colloquially called “an interactive script.” The name comes from the fact that we directly use names of variables (such as “homeworld”, which would only be known from looking at the data directly) in our analysis code. Only somebody interacting with the data could write such a script (hence the name).

It has long been considered a point of discomfort to convert such an interactive dplyr pipeline into a re-usable script or function. That is a script or function that specifies column names in some parametric or re-usable fashion. Roughly it means the names of the data columns are not yet known when we are writing the code (and this is what makes the code re-usable).

This inessential (or conquerable) difficulty is largely a due to the strong preference for non-standard evaluation interfaces (that is interfaces that capture and inspect un-evaluated expressions from their calling interface) in the design dplyr. The common tooling to work around dplyr’s non-standard interfaces is a system called alternately rlang (a name which is perhaps a bit overly ambitious as rlang is not the R language) or tidyeval (a name which is perhaps a bit derogatory, incorrectly labeling R itself as the “non-tidy” dialect relative to tidyeval). We, on the other hand, recommend using either the superior tooling offered by wrapr::let() (which actual pre-dates rlang and can be used with any package, not just those that are designed around it) or to avoid non-standard interface foibles by using a standard evaluation adapter for dplyr called seplyr (which we will discuss here).

Our contention is a dialect (or adaption) of dplyr itself (which we call seplyr for “standard evaluation plyr”) that emphasizes standard (or value carrying) interfaces is easy to use and does not require the complexities of rlang (as seplyr fails to introduce the difficulty rlang is designed to feast on).

`seplyr`

seplyr is a dplyr adapter layer that prefers “slightly clunkier” standard interfaces (or referentially transparent interfaces), which are actually very powerful and can be used to some advantage.

The above description and comparisons can come off as needlessly broad and painfully abstract. Things are much clearer if we move away from theory and return to our example.

Let’s translate the above example into a re-usable function in small (easy) stages. First translate the interactive script from dplyr notation into seplyr notation. This step is a pure re-factoring, we are changing the code without changing its observable external behavior.

The translation is mechanical in that it is mostly using seplyr documentation as a lookup table. What you have to do is:

Change dplyr verbs to their matching seplyr “*_se()” adapters.
Add quote marks around names and expressions.
Convert sequences of expressions (such as in the summarize()) to explicit vectors by adding the “c()” notation.
Replace “=” in expressions with “:=”.

Our converted code looks like the following.

library("seplyr")

## Loading required package: wrapr

## 
## Attaching package: 'wrapr'

## The following object is masked from 'package:dplyr':
## 
##     coalesce

starwars %>%
  group_by_se("homeworld") %>%
  summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)",
                 "mean_mass" := "mean(mass, na.rm = TRUE)",
                 "count" := "n()"))

## # A tibble: 49 x 4
##    homeworld      mean_height mean_mass count
##    <chr>                <dbl>     <dbl> <int>
##  1 Alderaan              176.      64       3
##  2 Aleen Minor            79       15       1
##  3 Bespin                175       79       1
##  4 Bestine IV            180      110       1
##  5 Cato Neimoidia        191       90       1
##  6 Cerea                 198       82       1
##  7 Champala              196      NaN       1
##  8 Chandrila             150      NaN       1
##  9 Concord Dawn          183       79       1
## 10 Corellia              175       78.5     2
## # … with 39 more rows

This code works the same as the original dplyr code. Also the translation could be performed by following the small set of explicit re-coding rules that we gave above.

Obviously at this point all we have done is: worked to make the code a bit less pleasant looking. We have yet to see any benefit from this conversion (though we can turn this on its head and say all the original dplyr notation is saving us is from having to write a few quote marks).

The benefit is: this new code can very easily be parameterized and wrapped in a re-usable function. In fact it is now simpler to do than to describe.

grouped_mean <- function(data, 
                         grouping_variables, 
                         value_variables,
                         count_name = "count") {
  result_names <- paste0("mean_", 
                         value_variables)
  expressions <- paste0("mean(", 
                        value_variables, 
                        ", na.rm = TRUE)")
  calculation <- result_names := expressions
  data %>%
    group_by_se(grouping_variables) %>%
    summarize_se(c(calculation,
                   count_name := "n()")) %>%
    ungroup()
}

starwars %>% 
  grouped_mean(grouping_variables = c("eye_color", "skin_color"),
               value_variables = c("mass", "birth_year"))

## # A tibble: 53 x 5
##    eye_color skin_color       mean_mass mean_birth_year count
##    <chr>     <chr>                <dbl>           <dbl> <int>
##  1 black     green                 80.5            44       2
##  2 black     grey                  78.7           NaN       4
##  3 black     none                 NaN             NaN       1
##  4 black     orange                80              22       1
##  5 black     red, blue, white      57             NaN       1
##  6 black     white, blue          NaN             NaN       1
##  7 blue      blue                 NaN             NaN       1
##  8 blue      brown                136             NaN       1
##  9 blue      dark                  50             NaN       1
## 10 blue      fair                  90              62.6    10
## # … with 43 more rows

We have translated our original interactive or ad-hoc calculation into a parameterized reusable function in two easy steps:

Translate into notation.
Replace values with variables.

To be sure: there are some clunky details of using to build up the expressions, but the conversion process is very regular and easy. In seplyr parametric programming is intentionally easy (just replace values with variables).

`rlang`/tidyeval

The above may all seem academic as “dplyr supplies its own parametric abstraction layer: rlang/tidyeval.” However, in our opinion rlang abstraction is irregular and full of corner cases (even for simple column names). When you are programming of rlang you are programming over a language that is not quite R, and a language that is not in fact superior to R. Take for example the following example from the dplyr 0.7.0 announcement:

The above code appears to work, unless you take the extra steps of actually running it and examining the results.

packageVersion("dplyr")

## [1] '1.0.5'

packageVersion("rlang")

## [1] '0.4.10'

packageVersion("tidyselect")

## [1] '1.1.0'

my_var <- "homeworld"

starwars %>%
  group_by(.data[[my_var]]) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)

## # A tibble: 49 x 3
##    homeworld      height  mass
##    <chr>           <dbl> <dbl>
##  1 Alderaan         176.  64  
##  2 Aleen Minor       79   15  
##  3 Bespin           175   79  
##  4 Bestine IV       180  110  
##  5 Cato Neimoidia   191   90  
##  6 Cerea            198   82  
##  7 Champala         196  NaN  
##  8 Chandrila        150  NaN  
##  9 Concord Dawn     183   79  
## 10 Corellia         175   78.5
## # … with 39 more rows

Notice the created grouping column is called “my_var”, and not “homeworld”. This may seem like a small thing, but it will cause any downstream processing to fail. One can try variations and get many different results, some of which are correct and some of which are not.

# wrong
starwars %>%
  group_by(!!my_var) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)

## # A tibble: 1 x 3
##   `"homeworld"` height  mass
##   <chr>          <dbl> <dbl>
## 1 homeworld       174.  97.3

# correct, or at least appears to work
starwars %>%
  group_by(!!rlang::sym(my_var)) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)

## # A tibble: 49 x 3
##    homeworld      height  mass
##    <chr>           <dbl> <dbl>
##  1 Alderaan         176.  64  
##  2 Aleen Minor       79   15  
##  3 Bespin           175   79  
##  4 Bestine IV       180  110  
##  5 Cato Neimoidia   191   90  
##  6 Cerea            198   82  
##  7 Champala         196  NaN  
##  8 Chandrila        150  NaN  
##  9 Concord Dawn     183   79  
## 10 Corellia         175   78.5
## # … with 39 more rows

# correct, or at least appears to work
starwars %>%
  group_by(.data[[!!my_var]]) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)

## # A tibble: 49 x 3
##    homeworld      height  mass
##    <chr>           <dbl> <dbl>
##  1 Alderaan         176.  64  
##  2 Aleen Minor       79   15  
##  3 Bespin           175   79  
##  4 Bestine IV       180  110  
##  5 Cato Neimoidia   191   90  
##  6 Cerea            198   82  
##  7 Champala         196  NaN  
##  8 Chandrila        150  NaN  
##  9 Concord Dawn     183   79  
## 10 Corellia         175   78.5
## # … with 39 more rows

# errors-out
starwars %>%
  group_by(.data[[!!rlang::sym(my_var)]]) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)

## # A tibble: 49 x 3
##    homeworld      height  mass
##    <chr>           <dbl> <dbl>
##  1 Alderaan         176.  64  
##  2 Aleen Minor       79   15  
##  3 Bespin           175   79  
##  4 Bestine IV       180  110  
##  5 Cato Neimoidia   191   90  
##  6 Cerea            198   82  
##  7 Champala         196  NaN  
##  8 Chandrila        150  NaN  
##  9 Concord Dawn     183   79  
## 10 Corellia         175   78.5
## # … with 39 more rows

Notice above in some places strings are required and in others names or symbols are required. In some contexts one may substitute for the other, and in some contexts they can not. In some situations “!!” must be used, and in others it must not be used. In some situations “rlang::sym()” must be used, and in some situations it must not be used. In some situations errors are signaled, and in others a bad result is quietly returned.

The details above are possibly teachable (in the sense one can rote memorize them, once APIs stabilize and stop changing), but they are not friendly to exploration or discoverable. There are a needlessly large number of moving pieces (strings, names, symbols, quoting, quosures, de-quoting, “data pronouns”, select semantics, non-select semantics, = versus :=, and so on) and one is expected to know which combinations of these items work together in which context.

For a further example consider the following.

# correct, or at least appears to work
starwars %>%
  select(my_var) %>%
  head(n=2)

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(my_var)` instead of `my_var` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

## # A tibble: 2 x 1
##   homeworld
##   <chr>    
## 1 Tatooine 
## 2 Tatooine

# errors-out
starwars %>%
  group_by(my_var) %>%
  head(n=2)

## Error: Must group by variables found in `.data`.
## * Column `my_var` is not found.

The “use the value unless the variable name is a column” semantics of select can make code hard to read and potentially unsafe.
The point is the user does not know what column a select() actually chooses unless that know all of:

The name of the variable being used (in this case “my_var”).
The value in the variable.
If the data.frame does or does not have a column matching the variable name.

Column-name coincidences are actually likely in re-usable code. Often when generalizing a sequence of operations, one often chooses an existing column name as the name of a column control variable.

Let’s look at examples.

my_var <- "homeworld"
# selects height column
starwars %>%
  select(my_var) %>%
  head(n=2)

## # A tibble: 2 x 1
##   homeworld
##   <chr>    
## 1 Tatooine 
## 2 Tatooine

# perhaps different data with an extra column
starwars_plus <- starwars %>%
  mutate(my_var = seq_len(nrow(starwars)))

# same code now selects my_var column
starwars_plus %>%
  select(my_var) %>%
  head(n=2)

## # A tibble: 2 x 1
##   my_var
##    <int>
## 1      1
## 2      2

# original official example code now errors-out
starwars_plus %>%
  group_by(.data[[my_var]]) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)

## # A tibble: 49 x 3
##    homeworld      height  mass
##    <chr>           <dbl> <dbl>
##  1 Alderaan         176.  64  
##  2 Aleen Minor       79   15  
##  3 Bespin           175   79  
##  4 Bestine IV       180  110  
##  5 Cato Neimoidia   191   90  
##  6 Cerea            198   82  
##  7 Champala         196  NaN  
##  8 Chandrila        150  NaN  
##  9 Concord Dawn     183   79  
## 10 Corellia         175   78.5
## # … with 39 more rows

The above might be acceptable in interactive work, if we assume the user knows the names of all of the columns in the data.frame and therefore can be held responsible for avoiding column name to variable coincidences. Or more realistically: the user is working interactively in the sense they are there to alter and re-run the code if they happen to notice the failure. However, this is not a safe state of affairs in re-usable functions or packages: which should work even if their are naming coincidences in thought to be irrelevant extra columns.

rlang does have methods to write unambiguous code such as in:

starwars %>%
  select(!!my_var) %>%
  head(n=2)

## # A tibble: 2 x 1
##   homeworld
##   <chr>    
## 1 Tatooine 
## 2 Tatooine

However, because the dplyr interface does not insist on the “!!” notation, the “!!” notation will not always be remembered. rlang/tidyeval interfaces tend to be complex and irregular.

The seplyr::select_se() interface is intentionally simple and regular: it always looks for columns using the strings (not variable names) given.

print(my_var)

## [1] "homeworld"

# selects homeworld column (the value specified my_var) independent of
# any coincidence between variable name and column names.
starwars_plus %>% 
  select_se(my_var) %>%
  head(n=2)

## # A tibble: 2 x 1
##   homeworld
##   <chr>    
## 1 Tatooine 
## 2 Tatooine

We have found when a programing notation is irregular it discourages exploration and leads to learned helplessness (rote appeals to authority and manuals instead of rapid acquisition of patterns and skills).

Conclusion

The seplyr methodology is simple, easy to teach, and powerful.

The methodologies we discussed differ in philosophy.

rlang/tidyeval is often used to build new non-standard interfaces on top of existing non-standard interfaces. This is needed because non-standard interfaces do not naturally compose. The tidy/non-tidy analogy is: it works around a mess by introducing a new mess.
wrapr::let() converts non-standard interfaces into standard referentially transparent or value-oriented interfaces. It tries to help clean up messes.
seplyr is a demonstration of a possible variation of dplyr where more of the interfaces expect values. It tries to cut down on the initial mess, which in turn cuts down on the need for tools and training in dealing with messes.

The seplyr package contains a number of worked examples both in help() and vignette(package='seplyr') documentation.

John Mount

2021-06-12

Introduction

`seplyr`

`rlang`/tidyeval

Conclusion

Using seplyr to Program Over dplyr

John Mount

2021-06-12

Introduction

seplyr

rlang/tidyeval

Conclusion

`seplyr`

`rlang`/tidyeval