seplyr::partition_mutate_qt() is a service supplied by the package seplyr (version 0.5.0 or higher).

seplyr::partition_mutate_qt() can partition a sequence of assignments so that no statement is using any value created in the same partition element or group. The partitions are in a format accepted by seplyr::mutate_se() for execution.

For such a partition the evaluation result does not depend on the order of execution of the statements in each group (as they are all independent of each other’s left-hand-sides). A no-dependency small number of groups partition is very helpful when executing expressions on SQL based data interfaces (such as Apache Spark).

The method used to partition expressions is to scan the remaining expressions in order taking any that: have all their values available from earlier groups, do not use a value formed in the current group, and do not overwrite a value formed in the current group.

This partitioning method can lead to far fewer groups than the straightforward method of breaking up the sequence of expressions at each new-value use.

Here is a non-trivial example (notice we use := for assignment, that is a requirement of seplyr::mutate_se()):

#> Loading required package: wrapr

plan <- partition_mutate_qt(
  rand_a := rand(),
   choice_a := rand_a>=0.5, # first use of a new value 1
    a_1 := ifelse(choice_a, # first use of a new value 2
    a_2 := ifelse(choice_a, 
  rand_b := rand(),
   choice_b := rand_b>=0.5, # first use of a new value 3
    b_1 := ifelse(choice_b, # first use of a new value 4
    b_2 := ifelse(choice_b, 
  rand_c := rand(),
   choice_c := rand_c>=0.5, # first use of a new value 5
    c_1 := ifelse(choice_c, # first use of a new value 6
    c_2 := ifelse(choice_c, 
  rand_d := rand(),
   choice_d := rand_d>=0.5, # first use of a new value 7
    d_1 := ifelse(choice_d, # first use of a new value 8
    d_2 := ifelse(choice_d, 
  rand_e := rand(),
   choice_e := rand_e>=0.5, # first use of a new value 9
    e_1 := ifelse(choice_e, # first use of a new value 10
    e_2 := ifelse(choice_e, 

#> $group00001
#>   rand_a   rand_b   rand_c   rand_d   rand_e 
#> "rand()" "rand()" "rand()" "rand()" "rand()" 
#> $group00002
#>        choice_a        choice_b        choice_c        choice_d 
#> "rand_a >= 0.5" "rand_b >= 0.5" "rand_c >= 0.5" "rand_d >= 0.5" 
#>        choice_e 
#> "rand_e >= 0.5" 
#> $group00003
#>                                            a_1 
#>  "ifelse(choice_a, \"treatment\", \"contol\")" 
#>                                            a_2 
#> "ifelse(choice_a, \"control\", \"treatment\")" 
#>                                            b_1 
#>  "ifelse(choice_b, \"treatment\", \"contol\")" 
#>                                            b_2 
#> "ifelse(choice_b, \"control\", \"treatment\")" 
#>                                            c_1 
#>  "ifelse(choice_c, \"treatment\", \"contol\")" 
#>                                            c_2 
#> "ifelse(choice_c, \"control\", \"treatment\")" 
#>                                            d_1 
#>  "ifelse(choice_d, \"treatment\", \"contol\")" 
#>                                            d_2 
#> "ifelse(choice_d, \"control\", \"treatment\")" 
#>                                            e_1 
#>  "ifelse(choice_e, \"treatment\", \"contol\")" 
#>                                            e_2 
#> "ifelse(choice_e, \"control\", \"treatment\")"

Notice seplyr::partition_mutate_qt() split the work into 3 groups. The straightforward method (with no statement re-ordering) of splitting into non-dependent groups would have to split the mutate at each first use of a new value: yielding 10 splits or 11 mutate stages. For why a low number of execution stages is important please see here.

To execute the statements on a data-item “d” (either an in-memory data.frame or a remote database or Sparklyr handle) we would do something like the following:

res <- mutate_seb(d, plan)

A fully worked version of this example can be found here.