This third article shows our second example being processed by the rquery big data scale relational data operator system (currently in development).
We will repeat the steps from Partitioning Mutate, Example 2, using only the rquery package and DBI/sparklyr (no dplyr).
library("wrapr")
library("rquery")
class(sc)
## [1] "spark_connection" "spark_shell_connection"
## [3] "DBIConnection"
class(d)
## [1] "relop_table_source" "relop" "wrapr_applicable"
d %.>%
to_sql(., sc) %.>%
DBI::dbGetQuery(sc, .) %.>%
knitr::kable(.)
| rowNum | a_1 | a_2 | b_1 | b_2 | c_1 | c_2 | d_1 | d_2 | e_1 | e_2 |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 3 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 5 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
dQ <- d %.>%
extend_se(.,
if_else_block(
testexpr =
"rand()>=0.5",
thenexprs = qae(
a_1 := 'treatment',
a_2 := 'control'),
elseexprs = qae(
a_1 := 'control',
a_2 := 'treatment'))) %.>%
select_columns(., c("rowNum", "a_1", "a_2"))
cat(format(dQ))
table('d') %.>%
extend(.,
ifebtest_1 := rand() >= 0.5) %.>%
extend(.,
a_1 := ifelse(ifebtest_1, "treatment", a_1),
a_2 := ifelse(ifebtest_1, "control", a_2)) %.>%
extend(.,
a_1 := ifelse(!( ifebtest_1 ), "control", a_1),
a_2 := ifelse(!( ifebtest_1 ), "treatment", a_2)) %.>%
select_columns(., rowNum, a_1, a_2)
Notice the rquery extend_se command accepts the if_else_block and partitions it into conflict-free segments. Also the rquery presentation lets the user inspect the operation plan before attempting execution.
sql <- to_sql(dQ, sc)
DBI::dbGetQuery(sc, sql) %.>%
knitr::kable(.)
| rowNum | a_1 | a_2 |
|---|---|---|
| 1 | control | treatment |
| 2 | control | treatment |
| 3 | control | treatment |
| 4 | treatment | control |
| 5 | control | treatment |
The underlying SQL is fairly involved, but can be performant at big-data scale.
cat(sql)
SELECT
`rowNum`,
`a_1`,
`a_2`
FROM (
SELECT
`ifebtest_1`,
`rowNum`,
( CASE WHEN ( ( NOT ( `ifebtest_1` ) ) ) THEN ( "control" ) ELSE ( `a_1` ) END ) AS `a_1`,
( CASE WHEN ( ( NOT ( `ifebtest_1` ) ) ) THEN ( "treatment" ) ELSE ( `a_2` ) END ) AS `a_2`
FROM (
SELECT
`ifebtest_1`,
`rowNum`,
( CASE WHEN ( `ifebtest_1` ) THEN ( "treatment" ) ELSE ( `a_1` ) END ) AS `a_1`,
( CASE WHEN ( `ifebtest_1` ) THEN ( "control" ) ELSE ( `a_2` ) END ) AS `a_2`
FROM (
SELECT
`rowNum`,
`a_1`,
`a_2`,
rand ( ) >= 0.5 AS `ifebtest_1`
FROM (
SELECT
`d`.`rowNum`,
`d`.`a_1`,
`d`.`a_2`
FROM
`d`
) tsql_0000
) tsql_0001
) tsql_0002
) tsql_0003
Win-Vector LLC supplies a number of open-source R packages for working effectively with big data. These include:
dplyr) much easier.dplyr on big data.Partitioning mutate articles:
ifelse example.rquery example.Topics such as the above are often discussed on the Win-Vector blog.