[Reader’s Note. Some of our articles are applied and some of our articles are more theoretical. The following article is more theoretical, and requires fairly formal notation to even work through. However, it should be of interest as it touches on some of the fine points of cross-validation that are quite hard to perceive or discuss without the notational framework. We thought about including some “simplifying explanatory diagrams” but so many entities are being introduced and manipulated by the processes we are describing we found equation notation to be in fact cleaner than the diagrams we attempted and rejected.]

Please consider either of the following common predictive modeling tasks:

In each case you are building a pipeline where “y-aware” (or outcome aware) choices and transformations made at each stage affect later stages. This can introduce undesirable nested model bias and over-fitting.

Our current standard advice to avoid nested model bias is either:

The first practice is simple and computationally efficient, but statistically inefficient. This may not matter if you have a lot of data, as in “big data”. The second procedure is more statistically efficient, but is also more complicated and has some computational cost. For convenience the cross simulation method is supplied as a ready to go procedure in our R data cleaning and preparation package vtreat.

What would it look like if we insisted on using cross simulation or simulated out of sample techniques for all three (or more) stages?

CleanAllTheThings

Hyperbole and a Half copyright Allie Brosh
(use allowed in some situations with attribution)

We mentioned two possible nested tasks to emphasize that nested model bias is not merely an issue of our own invention (despite our writings on variable selection and variable preparation). The issue also occurs with hyper-parameter tuning, model selection, and dimension reduction.

As we have said: if you have enough data you can split your data into a disjoint set for each stage (often called train/test, or calibration/train/test) and wish most of the problem away. However, if you are not feeling “data rich” (and this rapidly happens with complex models, rare features, and un-balanced target classes) or your training data is expensive to produce (very true for many high-value data science projects) then you may want to be more statistically efficient. However, there is a reason to always reserve a portion of data for the last-stage test (as we usually advise): it ensures you are testing properties of the actual model you are proposing to put into production (whereas cross-procedures mostly test the expected quality of your procedures, not the realized quality of your single model).

It is still an interesting problem: how do you properly organize nested cross simulation procedures so no additional data splitting is required? We are going to describe one such method here which is a generalization of k-way cross validation. We call the whole family of methods “nested cross simulation” (to emphasize both the multi-stage nature, and also that scoring or validation is not the only task we are removing bias from).

Notation/Definitions

BridgeTooFar02

“It’s all a question of notation” (“A Bridge too Far”, 1977, Copyright United Artists).

Suppose we are drawing supervised training examples from an arbitrary data schema (a schema that includes possibly numeric, categorical and logical columns or variables) called \(\mathbb{S}\). We have an indexed list of training examples \(T = ((s_i,y_i) \;|\; i = 1, \cdots m)\) where \(i\) is the example name, \(s_i \in \mathbb{S}\) is a vector of input or independent variables, and \(y_i\) is the outcome (numeric, but can be zero-one representing a categorization target).

We define some functions:

The naive method

The naive (not recommended) modeling method uses the same data over and over for data treatment design, model training, and scoring. In our notation this looks like:

  • \(T = ((s_i,y_i) \;|\; i = 1, \cdots m)\) (training data)
  • \(f = buildF(T)\) (design the variable treatments)
  • \(g = buildG((f(s_i),y_i) \;|\; i = 1, \cdots m))\) (build the model on treated data)
  • \(\widehat{y}_j = g(f(s_j)) ,\; j = 1, \cdots m\) (build the model predictions)
  • Compute \(loss(y,\widehat{y})\) (evaluate performance on training data).

In the above method we expect both over-fit and an upwardly biased assessment of quality. This is because the same data rows are used in many stages of the training and scoring, which is something that will not be true for future test or application data. This is a standard criticism, so we won’t go into it here.

To avoid the over-fit and assessment bias we try more sophisticated data handling as we will illustrate below.

The cross-frame or cross simulation method

The vtreat documentation recommends using a separate held-out set for testing. It then recommends either using separate portions of the training set for variable treatment design and model training or using cross-frame methods to try to break some of the data-reuse driven information leakage. This scheme when used in “leave-one-out” mode looks like the following. (For convenience in this writeup we add the additional assumption that the \(m\) items in \(T\) are in random order, so a segment of them is a sample picked uniformly at random.)

The important points are:

This is the system recommended in the vtreat documentation and examples. We are trying to prevent the components of over-fit driven by the data treatment by introducing the \(f_i()\). \(g\) is fit in a standard (not cross-simulated) manner and we depend on the model fitting procedures themselves to avoid over-fit. The usefulness of this procedure depends on the estimation procedures being somewhat stable and easy to stitch together (an issue for procedures with symmetries such as principal components analysis).

We use the held-out test set to try and detect over-fit issues.

I think this formulation is hits a real sweet-spot in sensibleness and data re-use. The fact that the actual model to be used is evaluated is a real plus.

The nested cross simulation method

We can go further. Set up the nested cross simulation method as follows.

Notice we are building a lot of different data treatments \(f_*()\) (one for every distinct \(i,j\) and one for each \(j\)) and models (one for each \(j\)). This can be mitigated by treating each datum as a block of observations (chosen randomly) and noticing this converts “leave-one-out” cross validation to “\(m\)-fold cross validation” (where \(m\) is now how many blocks we have).

The important points are:

There is also the issue of what treatment plan and function to put into production. This is essentially re-running the process without the loss scoring stage (removing one level of nesting), so we can take the results \(f\) and \(g\) from the previous cross simulation method as our data treatment and model to put into production. The extra work in this section is all just trying to get an unbiased estimate of the quality of the modeling process without splitting the data.

Despite the heavy notation the above is all a bit informal (all the “never sees … therefore …” bits; generalizing cross validation is fun- but the notation is laborious). A fine point is exchangeability and information hiding only holds for each row individually: a process that looks at many rows can in fact over-fit. That is: a row that is processed “without seeing such and such” must behave as if it has no information about what it did not see when taken along; however when combined with a seemingly harmless other piece of information (such as a grand mean) it may be the the become informative (the usual issue: u and v each being singly independent of y does not imply y is independent of the joint fact (u, v)). This is why stacking/super-learner style proofs don’t work on independence of information but instead use typical PAC-style arguments.

However, my claim is: the nested cross simulation method is sound in practice and we could expand the above into a formal argument. The method can be generalized to more stages- but obviously we start running over larger and larger sets of points to be simultaneously excluded in the earlier stages (so the method will get expensive). We also must take care to note for all its complexity the nested cross simulation method is measuring the quality of the modeling procedure, not measuring the quality of the actual model returned.

Conclusion

Our plan throughout has been to prevent over-fit with respect to a single data point by hiding that data point from the training and fitting procedures. The hope is such care (realized by the complicated bookkeeping and subscripting above) will preserve, in part, statistical exchangeability between training data and future test or application data (if it was there in the first place). As such we can mitigate over-fit and upward evaluation bias.

Obviously this is all quite a lot of trouble to directly work with, which is why recommend using the simple (not nested) cross simulation procedure and this process is wrapped and automated in vtreat.

I feel it is well worth knowing how to work the theory to collect evidence we are settled on a good recommendation. In this document we are mostly just laying down notation to discuss theory later. Our group, Win-Vector LLC, has found for our consulting clients: certain problems and domains (for example time series) benefit greatly from properly designed domain specific cross simulation procedures (please see arXiv:1611.09477 stat.AP Section 4 for such an example).

Comments/Discussion

The above scheme is essentially a block-oriented re-telling and generalization of Breiman’s “Stacked Regressions”, Machine Learning, 24, pp. 49-64, 1996.

We are going to be writing on a situation of some biases that do leak into the cross-frame “new data simulation.” So think of cross-frames as bias (some small amount is introduced) / variance (reduced be appearing to have a full sized data set at all stages) trade-off.