vtreat data set splitting

Motivation

vtreat supplies a number of data set splitting or cross-validation planning facilities. Some services are implicit such as the simulated out of sample scoring of high degree of freedom derived variables (such as catB, catN,catD, and catP; see here for a list of variable types). Some services are explicit such as vtreat::mkCrossFrameCExperiment and vtreat::mkCrossFrameNExperiment (please see here). And there is even a user-facing cross-validation planner in vtreat::buildEvalSets (try help(buildEvalSets) for details).

We (Nina Zumel and John Mount) have written a lot on structured cross-validation; the most relevant article being Random Test/Train Split is not Always Enough. The point is that in retrospective studies random test/train split is at best a simulation of how a model will be applied in the future. It is not an actual experimental design as in a randomized control trial. To be an effective simulation you must work to preserve structure that will be true in future application.

The overall idea is: a better splitting plan helps build a model that actually performs better in practice. And porting such a splitting plan back to your evaluation procedures gives you a better estimate of this future model performance.

A random test/train split attempts to preserve the following:

  • Future application data is exchangeable with training data (prior to model construction).
  • Future application data remains exchangeable with test data (even after model construction, as test data is not used in model construction).

Note if there is a concept change (also called issues of non-stationarity) then future data is already not statistically exchangeable with training data (so can’t preserve a property you never had). However even if your future data starts exchangeable with training data there is at least one (often) un-modeled difference between training data and future application data:

  • Future application data tends to be formed after (or in the future of) training data.

This is usually an unstated structure of your problem solving plan: use annotated data from the past to build a supervised model for future un-annotated data.

Examples

With the above discussion under our belt we get back to the problem at hand. When creating an appropriate test/train split, we may have to consider one or more of the following:

  • Stratification: Stratification preserves the distribution or prevalence of the outcome variable (or any other variable, but vtreat only stratifies on y). For example, for a classification problem with a target class prevalence of 15%, stratifying on y insures that both the training and test sets have target class prevalence of precisely 15% (or as close to that as is possible), not just “around” 15%, as would happen with a simple randomized test/train split. This is especially important for modeling rare events.

  • Grouping: By “grouping” we mean not splitting closely related events into test and train: if a set of rows constitutes a “group,” then we want all those rows to go either into test or into train – as a group. Typical examples are multiple events from a single customer (as you really want your model to predict behavior of new customers) or records close together in time (as latter application records will not be close in time to original training records).

  • Structured back testing: Structured back testing preserves the order of time ordered events. In finance it is considered ridiculous to use data from a Monday and a Wednesday to build a model for prices on the intervening Tuesday – but this is the kind of thing that can happen if the training and evaluation data are partitioned using a simple random split.

Our goal is for vtreat to be a domain agnostic, y-aware data conditioner. So vtreat should y-stratify its data splits throughout. Prior to version 0.5.26 vtreat used simple random splits. Now with version 0.5.26 (currently available from Github) vtreat defaults to stratified sampling throughout. Respecting things like locality of record grouping or ordering of time are domain issues and should be handled by the analyst.

Any splitting or stratification plan requires domain knowledge and should represent domain sensitive trade-off between the competing goals of:

  • Having a random split.
  • Stability of distribution of outcome variable across splits.
  • Not cutting into “atomic” groups of records.
  • Not using data from the future to predict the past.
  • Having a lot of data in each split.
  • Having disjoint training and testing data.

As of version 0.5.26 vtreat supports this by allowing a user specified data splitting function where the analyst can encode their desired domain invariants. The user-implemented splitting function should have the signature

function(nRows,nSplits,dframe,y)

where

  • nRows is the number of rows you are trying to split
  • nSplits is the number of split groups you want
  • dframe is the original data frame (which may contain grouping or order columns that you want),
  • y is the outcome variable converted to numeric

The function should return a list of lists. The ith element should have slots train and app, where [[i]]$train designates the training data used to fit the model that evaluates the data designated by [[i]]$app.

This is easiest to show through an example:

## [[1]]
## [[1]]$train
## [1] 1 3
## 
## [[1]]$app
## [1] 2
## 
## 
## [[2]]
## [[2]]$train
## [1] 2
## 
## [[2]]$app
## [1] 1 3
## 
## 
## attr(,"splitmethod")
## [1] "kwaycross"

As we can see vtreat::oneWayHoldout builds three split sets where in each set the “application data rows” is a single row index and the corresponding training rows are the complementary row indexes. This is a leave-one-out cross validation plan.

vtreat supplies a number of cross validation split/plan implementations:

  • kWayStratifiedY: k-way y-stratified cross-validation. This is the vtreat default splitting plan.
  • makekWayCrossValidationGroupedByColumn: k-way y-stratified cross-validation that preserves grouping (for example, all rows corresponding to a single customer or patient, etc). This is a complex splitting plan, and only recommended when absolutely needed.
  • kWayCrossValidation: k-way un-stratified cross-validation
  • oneWayHoldout: jackknife, or leave-one-out cross-validation. Note one way hold out can leak target expectations, so is not preferred for nested model situations.

The function buildEvalSets takes one of the above splitting functions as input and returns a cross-validation plan that instantiates the desired splitting, while also guarding against corner cases. You can also explicitly specify the splitting plan when designing a vtreat variable treatment plan using designTreatments[N\C] or mkCrossFrame[N\C]Experiment.

For issues beyond stratification the user may want to supply their own splitting plan. Such a function can then be passed into any vtreat operation that takes a splitFunction argument (such as mkCrossFrameNExperiment, designTreatmentsN, and many more). For example we can pass a user defined splitFn into vtreat::buildEvalSets as follows:

For example to use a user supplied splitting function we would write the following function definition.

This function can then be passed into any vtreat operation that takes a splitFunction argument (such as mkCrossFrameNExperiment, designTreatmentsN, and many more). For example we can pass the user defined splitFn into vtreat::buildEvalSets as follows:

vtreat::buildEvalSets(nRows=25,nSplits=3,splitFunction=modularSplit)
## [[1]]
## [[1]]$train
##  [1]  2  3  5  6  8  9 11 12 14 15 17 18 20 21 23 24
## 
## [[1]]$app
## [1]  1  4  7 10 13 16 19 22 25
## 
## 
## [[2]]
## [[2]]$train
##  [1]  1  3  4  6  7  9 10 12 13 15 16 18 19 21 22 24 25
## 
## [[2]]$app
## [1]  2  5  8 11 14 17 20 23
## 
## 
## [[3]]
## [[3]]$train
##  [1]  1  2  4  5  7  8 10 11 13 14 16 17 19 20 22 23 25
## 
## [[3]]$app
## [1]  3  6  9 12 15 18 21 24
## 
## 
## attr(,"splitmethod")
## [1] "userfunction"

As stated above, the vtreat library code will try to use the user function for splitting, but will fall back to an appropriate vtreat function in corner cases that the user function may not handle (for example, too few rows, too few groups, and so on). Thus the user code can assume it is in a reasonable situation (and even safely return NULL if it can’t deal with the situation it is given). For example the following bad user split is detected and corrected:

badSplit <- function(nRows,nSplits,dframe,y) {
  list(list(train=seq_len(nRows),app=seq_len(nRows)))
}
vtreat::buildEvalSets(nRows=5,nSplits=3,splitFunction=badSplit)
## Warning in doTryCatch(return(expr), name, parentenv, handler):
## vtreat::buildEvalSets user carve-up rejected: train and application slots
## overlap
## [[1]]
## [[1]]$train
## [1] 1 2 3 5
## 
## [[1]]$app
## [1] 4
## 
## 
## [[2]]
## [[2]]$train
## [1] 2 4 5
## 
## [[2]]$app
## [1] 3 1
## 
## 
## [[3]]
## [[3]]$train
## [1] 1 3 4
## 
## [[3]]$app
## [1] 2 5
## 
## 
## attr(,"splitmethod")
## [1] "kwaycross"

Notice above the returned split does not meet all of the original desiderata, but is guaranteed to be a useful data partition.

Implementations

The file outOfSample.R contains worked examples. In particular we would suggest running the code displayed when you type any of:

For example from help(kWayStratifiedY) we can see that the distribution of y is much more similar in each fold when we stratify than when we don’t:

library('vtreat')
## Loading required package: wrapr
## [1] "Plan is good"
## [1] "Plan is good"
##            1            2            3            4            5 
## -0.059622525  0.068139081 -0.007774052  0.099774019 -0.106875074
# standard error of mean(y)
sd(tapply(d$y,d$simpleGroup,mean))
## [1] 0.08606286
##            1            2            3            4            5 
##  0.008797500 -0.011530915 -0.010448401  0.009648950 -0.002825685
# standard error of mean(y)
sd(tapply(d$y,d$stratGroup,mean))
## [1] 0.01015539

Notice the increased similarity if distributions.

Conclusion

Controlling the way data is split in cross-validation – preserving y-distribution, groups, and even ordering – can improve the real world performance of models trained on such data. Obviously this adds some complexity and “places to go wrong”, but it is a topic worth learning about.