vignettes/rquery_many_columns.Rmd
rquery_many_columns.Rmd
A common data manipulation need is: applying the same operation to a number of columns.
In the rquery
R
package we strongly recommend using value-oriented (or standard evaluation) for this task.
Here is what this looks like.
For our data lets build a simple data set similar to mtcars
:
library(wrapr)
library(rquery)
have_rqdatatable <- requireNamespace('rqdatatable', quietly = TRUE)
example_data <- wrapr::build_frame(
"mpg" , "cyl", "disp", "car" |
21 , 6 , 160 , "Mazda RX4" |
21 , 6 , 160 , "Mazda RX4 Wag" |
22.8 , 4 , 108 , "Datsun 710" |
21.4 , 6 , 258 , "Hornet 4 Drive" |
18.7 , 8 , 360 , "Hornet Sportabout" |
18.1 , 6 , 225 , "Valiant" )
knitr::kable(example_data)
mpg | cyl | disp | car |
---|---|---|---|
21.0 | 6 | 160 | Mazda RX4 |
21.0 | 6 | 160 | Mazda RX4 Wag |
22.8 | 4 | 108 | Datsun 710 |
21.4 | 6 | 258 | Hornet 4 Drive |
18.7 | 8 | 360 | Hornet Sportabout |
18.1 | 6 | 225 | Valiant |
Now suppose for a number of columns we wish to perform a calculation, such centering it with respect to the grand average.
This is easily handled by first specifying the set of variables we wish to work with.
## [1] "mpg" "cyl" "disp"
Now we build up what we want as a name-vector of strings using the :=
named map builder.
## mpg cyl disp
## "mpg - mean(mpg)" "cyl - mean(cyl)" "disp - mean(disp)"
The idea is: the :=
operator fits into R
idiom by looking very much like a vectorized version of “names get assigned expressions”.
These expressions can then be used in an rquery
pipeline using the _se
-variant of extend()
: extend_se()
.
## mk_td("example_data", c(
## "mpg",
## "cyl",
## "disp",
## "car")) %.>%
## extend(.,
## mpg := mpg - mean(mpg),
## cyl := cyl - mean(cyl),
## disp := disp - mean(disp))
And this operator pipeline is ready to use (assuming we have rqdatatable
available):
if(have_rqdatatable) {
example_data %.>%
ops %.>%
knitr::kable(.)
}
mpg | cyl | disp | car |
---|---|---|---|
0.5 | 0 | -51.83333 | Mazda RX4 |
0.5 | 0 | -51.83333 | Mazda RX4 Wag |
2.3 | -2 | -103.83333 | Datsun 710 |
0.9 | 0 | 46.16667 | Hornet 4 Drive |
-1.8 | 2 | 148.16667 | Hornet Sportabout |
-2.4 | 0 | 13.16667 | Valiant |
The expression construction can also be done inside the extend_se()
operator.
ops <- local_td(example_data) %.>%
extend_se(., vars := paste0(vars, ' - mean(', vars, ')'))
cat(format(ops))
## mk_td("example_data", c(
## "mpg",
## "cyl",
## "disp",
## "car")) %.>%
## extend(.,
## mpg := mpg - mean(mpg),
## cyl := cyl - mean(cyl),
## disp := disp - mean(disp))
Note: the above is only a notional example to demonstrate the operations; for supervised machine learning we would probably use base::scale()
, which saves the learned centering for later re-use on new data.
The point is: we use standard R
tools to build up the lists of names and operations. We are not restricted to any single argument manipulation grammar.
For example we could build all interaction terms as follows.
combos <- t(combn(vars, 2))
interactions <-
paste0(combos[, 1], '_', combos[, 2]) :=
paste0(combos[, 1], ' * ', combos[, 2])
print(interactions)
## mpg_cyl mpg_disp cyl_disp
## "mpg * cyl" "mpg * disp" "cyl * disp"
## mk_td("example_data", c(
## "mpg",
## "cyl",
## "disp",
## "car")) %.>%
## extend(.,
## mpg_cyl := mpg * cyl,
## mpg_disp := mpg * disp,
## cyl_disp := cyl * disp)
It is a critical advantage to work with sets of variables as standard values. In this case what we can do is limited only by the power of R
itself.
Note: we also supply an alias for :=
as %:=%
for those that don’t want to confuse this assignment with how the symbol is used in data.table
. Take care that :=
has the precedence-level of an assignment and %:=%
has the precedence-level of a user defined operator.
As, as always, our queries can be used on data.
if(have_rqdatatable) {
example_data %.>%
ops %.>%
knitr::kable(.)
}
mpg | cyl | disp | car | mpg_cyl | mpg_disp | cyl_disp |
---|---|---|---|---|---|---|
21.0 | 6 | 160 | Mazda RX4 | 126.0 | 3360.0 | 960 |
21.0 | 6 | 160 | Mazda RX4 Wag | 126.0 | 3360.0 | 960 |
22.8 | 4 | 108 | Datsun 710 | 91.2 | 2462.4 | 432 |
21.4 | 6 | 258 | Hornet 4 Drive | 128.4 | 5521.2 | 1548 |
18.7 | 8 | 360 | Hornet Sportabout | 149.6 | 6732.0 | 2880 |
18.1 | 6 | 225 | Valiant | 108.6 | 4072.5 | 1350 |
Or even in a database.
have_db <- requireNamespace("DBI", quietly = TRUE) &&
requireNamespace("RSQLite", quietly = TRUE)
if(have_db) {
raw_connection <- DBI::dbConnect(RSQLite::SQLite(),
":memory:")
RSQLite::initExtension(raw_connection)
db <- rquery_db_info(
connection = raw_connection,
is_dbi = TRUE,
connection_options = rq_connection_tests(raw_connection))
rq_copy_to(db, 'example_data',
example_data,
temporary = TRUE,
overwrite = TRUE)
sql <- to_sql(ops, db)
cat(format(sql))
}
## SELECT
## `mpg`,
## `cyl`,
## `disp`,
## `car`,
## `mpg` * `cyl` AS `mpg_cyl`,
## `mpg` * `disp` AS `mpg_disp`,
## `cyl` * `disp` AS `cyl_disp`
## FROM (
## SELECT
## `mpg`,
## `cyl`,
## `disp`,
## `car`
## FROM
## `example_data`
## ) tsql_10091297325049892353_0000000000
if(have_db) {
res_table <- materialize(db, ops)
DBI::dbReadTable(raw_connection, res_table$table_name) %.>%
knitr::kable(.)
}
mpg | cyl | disp | car | mpg_cyl | mpg_disp | cyl_disp |
---|---|---|---|---|---|---|
21.0 | 6 | 160 | Mazda RX4 | 126.0 | 3360.0 | 960 |
21.0 | 6 | 160 | Mazda RX4 Wag | 126.0 | 3360.0 | 960 |
22.8 | 4 | 108 | Datsun 710 | 91.2 | 2462.4 | 432 |
21.4 | 6 | 258 | Hornet 4 Drive | 128.4 | 5521.2 | 1548 |
18.7 | 8 | 360 | Hornet Sportabout | 149.6 | 6732.0 | 2880 |
18.1 | 6 | 225 | Valiant | 108.6 | 4072.5 | 1350 |
if(have_db) {
DBI::dbDisconnect(raw_connection)
}