rqdatatable is an implementation of the rquery piped Codd-style relational algebra hosted on data.table. rquery allow the expression of complex transformations as a series of relational operators and rqdatatable implements the operators using data.table.

For example scoring a logistic regression model (which requires grouping, ordering, and ranking) is organized as follows. For more on this example please see “Let’s Have Some Sympathy For The Part-time R User”.

library("rqdatatable")
## Loading required package: rquery
scale <- 0.237

# example rquery pipeline
rquery_pipeline <- local_td(dL) %.>%
  extend_nse(.,
             probability :=
               exp(assessmentTotal * scale))  %.>% 
  normalize_cols(.,
                 "probability",
                 partitionby = 'subjectID') %.>%
  pick_top_k(.,
             k = 1,
             partitionby = 'subjectID',
             orderby = c('probability', 'surveyCategory'),
             reverse = c('probability', 'surveyCategory')) %.>% 
  rename_columns(., c('diagnosis' = 'surveyCategory')) %.>%
  select_columns(., c('subjectID', 
                      'diagnosis', 
                      'probability')) %.>%
  orderby(., cols = 'subjectID')

We can show the expanded form of query tree.

cat(format(rquery_pipeline))
table(dL; 
  subjectID,
  surveyCategory,
  assessmentTotal) %.>%
 extend(.,
  probability := exp(assessmentTotal * 0.237)) %.>%
 extend(.,
  probability := probability / sum(probability),
  p= subjectID) %.>%
 extend(.,
  row_number := row_number(),
  p= subjectID,
  o= "probability" DESC, "surveyCategory" DESC) %.>%
 select_rows(.,
   row_number <= 1) %.>%
 rename(.,
  c('diagnosis' = 'surveyCategory')) %.>%
 select_columns(.,
   subjectID, diagnosis, probability) %.>%
 orderby(., subjectID)

And execute it using data.table.

ex_data_table(rquery_pipeline)
##    subjectID           diagnosis probability
## 1:         1 withdrawal behavior   0.6706221
## 2:         2 positive re-framing   0.5589742

One can also apply the pipeline to new tables.

##    subjectID           diagnosis probability
## 1:         7 positive re-framing   0.9722128

Initial bench-marking of rqdatatable is very favorable (notes here).

Note rqdatatable has an “immediate mode” which allows direct application of pipelines stages without pre-assembling the pipeline. “Immediate mode” is a convenience for ad-hoc analyses, and has some negative performance impact, so we encourage users to build pipelines for most work. Some notes on the issue can be found here.

rqdatatable is a fairly complete implementation of rquery. The main differences are the rqdatatable implementations of sql_node() and theta_join() are implemented by round-tripping through a database handle specified by the rquery.rquery_db_executor option (so it is not they are not very desirable implementation).

To install rqdatatable please use install.packages("rqdatatable").