“
R
is its packages.”
One great example of this is: the sparklyr
package, which allows R
to work with big data. In this note we we will use the sparklyr
package, but concentrate on another important package called cdata
.
In the following example the table we are working with is remote. For all we know it could be distributed on a large remote data center.1 Working with remote data tools can be like working with a Waldo system: powerful, but a touch awkward.
library("seplyr")
d <- dplyr::tbl(sc, "starwars") %.>%
arrange_se(., "name")
print(d)
## # Source: table<starwars> [?? x 3]
## # Database: spark_connection
## # Ordered by: name
## name height mass
## <chr> <int> <dbl>
## 1 Ackbar 180 83.0
## 2 Adi Gallia 184 50.0
## 3 Anakin Skywalker 188 84.0
## 4 Arvel Crynyd NA NaN
## 5 Ayla Secura 178 55.0
## 6 BB8 NA NaN
## 7 Bail Prestor Organa 191 NaN
## 8 Barriss Offee 166 50.0
## 9 Ben Quadinaros 163 65.0
## 10 Beru Whitesun lars 165 75.0
## # ... with more rows
Suppose we needed to un-pivot this data into a row oriented representation. Often big data transform steps can achieve a much higher degree of parallelization with “tall data”2 many rows, fewer columns. With the cdata
package this transform is easy and performant, as we show below.
library("cdata")
cT <- build_unpivot_control(
nameForNewKeyColumn = "measurement",
nameForNewValueColumn = "value",
columnsToTakeFrom = c('height', 'mass') )
dTName <- rowrecs_to_blocks_q(
"starwars",
controlTable = cT,
my_db = sc,
columnsToCopy = "name")
str(dTName)
## chr "mvtrq_rrakp3dqsorimj4ohzgh_0000000001"
qlook(sc, dTName)
## table `mvtrq_rrakp3dqsorimj4ohzgh_0000000001` spark_connection spark_shell_connection DBIConnection
## nrow: 174
## NOTE: "obs" below is count of sample, not number of rows of data.
## 'data.frame': 10 obs. of 3 variables:
## $ name : chr "Luke Skywalker" "Luke Skywalker" "C-3PO" "C-3PO" ...
## $ measurement: chr "height" "mass" "height" "mass" ...
## $ value : num 172 77 167 75 96 32 202 136 150 49
cdata
can work fluidly with dplyr
.3 However, cdata
does not require dplyr
.
dT <- dplyr::tbl(sc, dTName)
dT %.>%
arrange_se(., c("name", "measurement")) %.>%
head(., n = 12) %.>%
knitr::kable(.)
name | measurement | value |
---|---|---|
Ackbar | height | 180 |
Ackbar | mass | 83 |
Adi Gallia | height | 184 |
Adi Gallia | mass | 50 |
Anakin Skywalker | height | 188 |
Anakin Skywalker | mass | 84 |
Arvel Crynyd | height | NaN |
Arvel Crynyd | mass | NaN |
Ayla Secura | height | 178 |
Ayla Secura | mass | 55 |
BB8 | height | NaN |
BB8 | mass | NaN |
cdata
is itself is based on SQL
aggregations, using the DBI
package.4 This is why “BB8”’s NA
(likely represented on the remote system as a NULL
) gets changed to a NaN
.
The important thing is: the data stays remote5 class(dT)
: tbl_spark, tbl_sql, tbl_lazy, tbl, all operations are performed on the remote big data system, possibly in parallel.6 Note: the tidyr
package does not currently (0.7.2) work with remote data: tidyr::gather(d, key=measurement, value=value, mass, height)
Error in gather.default(d, key = measurement, value = value, mass, height): object ‘height’ not found .
The cdata
arguments are deliberately verbose. This is because at some point you may have to read your own code, and it is best to be reminded about the role of each function argument at that time.
Notice cdata
takes names of tables as arguments and also returns names of tables (not values or dplyr::tbl()
handles). This is why we show the explicit use of dplyr::tbl()
to produce remote data handles in this example.7 Basing storage on names instead of handles means: for production code you need tools to help manage table lifetime. cdata::makeTempNameGenerator()
is supplied for that purpose. As in-memory data doesn’t have representation names in the same sense remote data does, cdata
supplies in-memory value to value functions: rowrecs_to_blocks()
, buildPivotControlTableD()
, and blocks_to_rowrecs()
.
The reverse operation (often called pivoting) is also easy (and in fact shares the same control structures).
dL <- blocks_to_rowrecs_q(
dTName,
keyColumns = "name",
controlTable = cT,
my_db = sc ) %.>%
dplyr::tbl(sc, .)
dL %.>%
arrange_se(., "name") %.>%
head(., n = 6) %.>%
knitr::kable(.)
name | height | mass |
---|---|---|
Ackbar | 180 | 83 |
Adi Gallia | 184 | 50 |
Anakin Skywalker | 188 | 84 |
Arvel Crynyd | NaN | NaN |
Ayla Secura | 178 | 55 |
BB8 | NaN | NaN |
More8 The source code that produced this article can be downloaded from here. on these sort of transforms can be found in the following notes: