Data Wrangling at Scale

John Mount, Win-Vector LLC

2018-01-25

R is its packages.”

One great example of this is: the sparklyr package, which allows R to work with big data. In this note we we will use the sparklyr package, but concentrate on another important package called cdata.

In the following example the table we are working with is remote. For all we know it could be distributed on a large remote data center.1 Working with remote data tools can be like working with a Waldo system: powerful, but a touch awkward.

library("seplyr")

d <- dplyr::tbl(sc, "starwars") %.>%
  arrange_se(., "name")
print(d)
## # Source: table<starwars> [?? x 3]
## # Database: spark_connection
## # Ordered by: name
##    name                height  mass
##    <chr>                <int> <dbl>
##  1 Ackbar                 180  83.0
##  2 Adi Gallia             184  50.0
##  3 Anakin Skywalker       188  84.0
##  4 Arvel Crynyd            NA NaN  
##  5 Ayla Secura            178  55.0
##  6 BB8                     NA NaN  
##  7 Bail Prestor Organa    191 NaN  
##  8 Barriss Offee          166  50.0
##  9 Ben Quadinaros         163  65.0
## 10 Beru Whitesun lars     165  75.0
## # ... with more rows

Suppose we needed to un-pivot this data into a row oriented representation. Often big data transform steps can achieve a much higher degree of parallelization with “tall data”2 many rows, fewer columns. With the cdata package this transform is easy and performant, as we show below.

library("cdata")

cT <- build_unpivot_control(
  nameForNewKeyColumn = "measurement",
  nameForNewValueColumn = "value",
  columnsToTakeFrom = c('height', 'mass') )
dTName <- rowrecs_to_blocks_q(
  "starwars", 
  controlTable = cT, 
  my_db = sc, 
  columnsToCopy = "name")

str(dTName)
##  chr "mvtrq_rrakp3dqsorimj4ohzgh_0000000001"
qlook(sc, dTName)
## table `mvtrq_rrakp3dqsorimj4ohzgh_0000000001` spark_connection spark_shell_connection DBIConnection 
##  nrow: 174 
##  NOTE: "obs" below is count of sample, not number of rows of data.
## 'data.frame':    10 obs. of  3 variables:
##  $ name       : chr  "Luke Skywalker" "Luke Skywalker" "C-3PO" "C-3PO" ...
##  $ measurement: chr  "height" "mass" "height" "mass" ...
##  $ value      : num  172 77 167 75 96 32 202 136 150 49

cdata can work fluidly with dplyr.3 However, cdata does not require dplyr.

dT <- dplyr::tbl(sc, dTName)

dT %.>%
  arrange_se(., c("name", "measurement")) %.>%
  head(., n = 12) %.>%
  knitr::kable(.)
name measurement value
Ackbar height 180
Ackbar mass 83
Adi Gallia height 184
Adi Gallia mass 50
Anakin Skywalker height 188
Anakin Skywalker mass 84
Arvel Crynyd height NaN
Arvel Crynyd mass NaN
Ayla Secura height 178
Ayla Secura mass 55
BB8 height NaN
BB8 mass NaN

cdata is itself is based on SQL aggregations, using the DBI package.4 This is why “BB8”’s NA (likely represented on the remote system as a NULL) gets changed to a NaN.

The important thing is: the data stays remote5 class(dT): tbl_spark, tbl_sql, tbl_lazy, tbl, all operations are performed on the remote big data system, possibly in parallel.6 Note: the tidyr package does not currently (0.7.2) work with remote data: tidyr::gather(d, key=measurement, value=value, mass, height) Error in gather.default(d, key = measurement, value = value, mass, height): object ‘height’ not found .

The cdata arguments are deliberately verbose. This is because at some point you may have to read your own code, and it is best to be reminded about the role of each function argument at that time.

Notice cdata takes names of tables as arguments and also returns names of tables (not values or dplyr::tbl() handles). This is why we show the explicit use of dplyr::tbl() to produce remote data handles in this example.7 Basing storage on names instead of handles means: for production code you need tools to help manage table lifetime. cdata::makeTempNameGenerator() is supplied for that purpose. As in-memory data doesn’t have representation names in the same sense remote data does, cdata supplies in-memory value to value functions: rowrecs_to_blocks(), buildPivotControlTableD(), and blocks_to_rowrecs().

The reverse operation (often called pivoting) is also easy (and in fact shares the same control structures).

dL <- blocks_to_rowrecs_q(
  dTName,
  keyColumns = "name",
  controlTable = cT,
  my_db = sc ) %.>%
  dplyr::tbl(sc, .)

dL %.>%
  arrange_se(., "name") %.>%
  head(., n = 6) %.>%
  knitr::kable(.)
name height mass
Ackbar 180 83
Adi Gallia 184 50
Anakin Skywalker 188 84
Arvel Crynyd NaN NaN
Ayla Secura 178 55
BB8 NaN NaN

More8 The source code that produced this article can be downloaded from here. on these sort of transforms can be found in the following notes: