DplyrDependencies

Win-Vector LLC

11/30/2017

In an earlier note we exhibited a non-signalling result corruption in dplyr 0.7.4. In this note we demonstrate the seplyr work-around.

Re-establish up our example:

packageVersion("dplyr")
## [1] '0.7.4'
my_db <- DBI::dbConnect(RSQLite::SQLite(),
                        ":memory:")
d <- dplyr::copy_to(
  my_db, 
  data.frame(
    valuesA = c("A", NA, NA),
    valuesB = c("B", NA, NA),
    canUseFix1 = c(TRUE, TRUE, FALSE),
    fix1 = c('Fix_1_V1', "Fix_1_V2", "Fix_1_V3"),
    canUseFix2 = c(FALSE, FALSE, TRUE),
    fix2 = c('Fix_2_V1', "Fix_2_V2", "Fix_2_V3"),
    stringsAsFactors = FALSE),
  'd', 
  temporary = TRUE, overwrite = TRUE)
knitr::kable(dplyr::collect(d))
valuesA valuesB canUseFix1 fix1 canUseFix2 fix2
A B 1 Fix_1_V1 0 Fix_2_V1
NA NA 1 Fix_1_V2 0 Fix_2_V2
NA NA 0 Fix_1_V3 1 Fix_2_V3

seplyr has a fix/work-around for the earlier issue: automatically break up the steps into safe blocks (announcement; here we are using the development seplyr 0.5.1 version of mutate_se()).

library("seplyr")
## Loading required package: wrapr
packageVersion("seplyr")
## [1] '0.5.1'
d %.>% 
  mutate_se(
    ., 
    qae(valuesA := ifelse(is.na(valuesA) & canUseFix1, 
                          fix1, valuesA),
        valuesA := ifelse(is.na(valuesA) & canUseFix2, 
                          fix2, valuesA),
        valuesB := ifelse(is.na(valuesB) & canUseFix1, 
                          fix1, valuesB),
        valuesB := ifelse(is.na(valuesB) & canUseFix2, 
                          fix2, valuesB)),
    printPlan = TRUE) %.>% 
  select_se(., c("valuesA", "valuesB")) %.>% 
  dplyr::collect(.) %.>% 
  knitr::kable(.)
## $group00001
##                                              valuesA 
## "ifelse(is.na(valuesA) & canUseFix1, fix1, valuesA)" 
##                                              valuesB 
## "ifelse(is.na(valuesB) & canUseFix1, fix1, valuesB)" 
## 
## $group00002
##                                              valuesA 
## "ifelse(is.na(valuesA) & canUseFix2, fix2, valuesA)" 
##                                              valuesB 
## "ifelse(is.na(valuesB) & canUseFix2, fix2, valuesB)"
valuesA valuesB
A B
Fix_1_V2 Fix_1_V2
Fix_2_V3 Fix_2_V3

We now have a correct result (all cells filled).

seplyr used safe statement re-ordering to break the calculation into the minimum number of blocks/groups that have no in-block dependencies between statements (note this is more efficient that merely introducing a new mutate each first time a new value is used).

We can slow that down and see how the underlying planning functions break the assignments down into a small number of safe blocks (here we are using the development wrapr 1.0.2 function qae()).

packageVersion("wrapr")
## [1] '1.0.3'
steps <- qae(
  valuesA := ifelse(is.na(valuesA) & canUseFix1, 
                    fix1, valuesA),
  valuesA := ifelse(is.na(valuesA) & canUseFix2, 
                    fix2, valuesA),
  valuesB := ifelse(is.na(valuesB) & canUseFix1, 
                    fix1, valuesB),
  valuesB := ifelse(is.na(valuesB) & canUseFix2, 
                    fix2, valuesB))
print(steps)
## $valuesA
## [1] "ifelse(is.na(valuesA) & canUseFix1, fix1, valuesA)"
## 
## $valuesA
## [1] "ifelse(is.na(valuesA) & canUseFix2, fix2, valuesA)"
## 
## $valuesB
## [1] "ifelse(is.na(valuesB) & canUseFix1, fix1, valuesB)"
## 
## $valuesB
## [1] "ifelse(is.na(valuesB) & canUseFix2, fix2, valuesB)"
plan <- partition_mutate_se(steps)
print(plan)
## $group00001
##                                              valuesA 
## "ifelse(is.na(valuesA) & canUseFix1, fix1, valuesA)" 
##                                              valuesB 
## "ifelse(is.na(valuesB) & canUseFix1, fix1, valuesB)" 
## 
## $group00002
##                                              valuesA 
## "ifelse(is.na(valuesA) & canUseFix2, fix2, valuesA)" 
##                                              valuesB 
## "ifelse(is.na(valuesB) & canUseFix2, fix2, valuesB)"
d %.>% 
  mutate_seb(., plan) %.>% 
  select_se(., c("valuesA", "valuesB")) %.>% 
  dplyr::collect(.) %.>% 
  knitr::kable(.)
valuesA valuesB
A B
Fix_1_V2 Fix_1_V2
Fix_2_V3 Fix_2_V3

Note that the current CRAN versions of wrapr and seplyr already implement the above work-around. Just some of the conveniences such as printPlan = TRUE and qae() require the development versions of these packages.