cdata is a general data re-shaper that has the great virtue of adhering to the so-called “Rule of Representation”:

Fold knowledge into data, so program logic can be stupid and robust.

The Art of Unix Programming, Erick S. Raymond, Addison-Wesley , 2003

The point being: it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.

Briefly: cdata supplies data transform operators that:

  • Work on local data or with any DBI data source.
  • Are powerful generalizations of the operations commonly called pivot and un-pivot.

A quick example: plot iris petal and sepal dimensions in a faceted graph.

iris <- data.frame(iris)

library("ggplot2")
library("cdata")

#
# build a control table with a "key column" flower_part
# and "value columns" Length and Width
#
controlTable <- wrapr::qchar_frame(
   flower_part, Length      , Width       |
   Petal    , Petal.Length, Petal.Width |
   Sepal    , Sepal.Length, Sepal.Width )

# do the unpivot to convert the row records to block records
iris_aug <- rowrecs_to_blocks(
  iris,
  controlTable,
  columnsToCopy = c("Species"))


ggplot(iris_aug, aes(x=Length, y=Width)) +
  geom_point(aes(color=Species, shape=Species)) + 
  facet_wrap(~flower_part, labeller = label_both, scale = "free") +
  ggtitle("Iris dimensions") +  scale_color_brewer(palette = "Dark2")

More details on the above example can be found here. A tutorial on how to design a controlTable can be found here.
And some discussion of the nature of records in cdata can be found here.


We can also exhibit a larger example of using cdata to create a scatter-plot matrix, or pair plot:

iris <- data.frame(iris)

library("ggplot2")
library("cdata")

# declare our columns of interest
meas_vars <- qc(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
category_variable <- "Species"

# build a control with all pairs of variables as value columns
# and pair_key as the key column
controlTable <- data.frame(expand.grid(meas_vars, meas_vars, 
                                       stringsAsFactors = FALSE))
# name the value columns value1 and value2
colnames(controlTable) <- qc(value1, value2)
# insert first, or key column
controlTable <- cbind(
  data.frame(pair_key = paste(controlTable[[1]], controlTable[[2]]),
             stringsAsFactors = FALSE),
  controlTable)


# do the unpivot to convert the row records to multiple block records
iris_aug <- rowrecs_to_blocks(
  iris,
  controlTable,
  columnsToCopy = category_variable)

# unpack the key column into two variable keys for the facet_grid
splt <- strsplit(iris_aug$pair_key, split = " ", fixed = TRUE)
iris_aug$v1 <- vapply(splt, function(si) si[[1]], character(1))
iris_aug$v2 <- vapply(splt, function(si) si[[2]], character(1))


ggplot(iris_aug, aes(x=value1, y=value2)) +
  geom_point(aes_string(color=category_variable, shape=category_variable)) + 
  facet_grid(v2~v1, labeller = label_both, scale = "free") +
  ggtitle("Iris dimensions") +
  scale_color_brewer(palette = "Dark2") +
  ylab(NULL) + 
  xlab(NULL)

The above is now wrapped into a one-line command in WVPlots.

And a quick database example:

library("cdata")
library("rquery")

use_spark <- TRUE

if(use_spark) {
  my_db <- sparklyr::spark_connect(version='2.2.0', 
                                   master = "local")
} else {
  my_db <- DBI::dbConnect(RSQLite::SQLite(),
                          ":memory:")
}



# pivot example
d <- wrapr::build_frame(
   "meas", "val" |
   "AUC" , 0.6   |
   "R2"  , 0.2   )
DBI::dbWriteTable(my_db,
                  'd',
                  d,
                  temporary = TRUE)
rstr(my_db, 'd')
## table `d` spark_connection spark_shell_connection DBIConnection 
##  nrow: 2 
## 'data.frame':    2 obs. of  2 variables:
##  $ meas: chr  "AUC" "R2"
##  $ val : num  0.6 0.2
td <- db_td(my_db, "d")
td
## [1] "table(`d`; meas, val)"
cT <- td %.>%
  build_pivot_control(.,
                      columnToTakeKeysFrom= 'meas',
                      columnToTakeValuesFrom= 'val') %.>%
  execute(my_db, .)
print(cT)
##   meas val
## 1  AUC AUC
## 2   R2  R2
tab <- td %.>%
  blocks_to_rowrecs(.,
                    keyColumns = NULL,
                    controlTable = cT,
                    temporary = FALSE) %.>%
  materialize(my_db, .)

print(tab)
## [1] "table(`rquery_mat_50172924785930095101_0000000000`; AUC, R2)"
rstr(my_db, tab)
## table `rquery_mat_50172924785930095101_0000000000` spark_connection spark_shell_connection DBIConnection 
##  nrow: 1 
## 'data.frame':    1 obs. of  2 variables:
##  $ AUC: num 0.6
##  $ R2 : num 0.2
if(use_spark) {
  sparklyr::spark_disconnect(my_db)
} else {
  DBI::dbDisconnect(my_db)
}

The cdata package is a demonstration of the “coordinatized data” theory and includes an implementation of the “fluid data” methodology. The recommended tutorial is: Fluid data reshaping with cdata. We also have a short free cdata screencast (and another example can be found here).


Install via CRAN:

install.packages("cdata")

Or from GitHub using devtools:

devtools::install_github("WinVector/cdata")

Note: cdata is targeted at data with “tame column names” (column names that are valid both in databases, and as R unquoted variable names) and basic types (column values that are simple R types such as character, numeric, logical, and so on).