[Reader’s Note. Some of our articles are applied and some of our articles are more theoretical. The following article is more theoretical, and requires fairly formal notation to even work through. However, it should be of interest as it touches on some of the fine points of cross-validation that are quite hard to perceive or discuss without the notational framework. We thought about including some “simplifying explanatory diagrams” but so many entities are being introduced and manipulated by the processes we are describing we found equation notation to be in fact cleaner than the diagrams we attempted and rejected.]

Please consider either of the following common predictive modeling tasks:

- Picking hyper-parameters, fitting a model, and then evaluating the model.
- Variable preparation/pruning, fitting a model, and then evaluating the model.

In each case you are building a pipeline where “y-aware” (or outcome aware) choices and transformations made at each stage affect later stages. This can introduce undesirable nested model bias and over-fitting.

Our current standard advice to avoid nested model bias is either:

- Split your data into 3 or more disjoint pieces, such as separate variable preparation/pruning, model fitting, and model evaluation.
- Reserve a test-set for evaluation and use “simulated out of sample data” or “cross-frame”/“cross simulation” techniques to simulate dividing data among the first two model construction stages.

The first practice is simple and computationally efficient, but statistically inefficient. This may not matter if you have a lot of data, as in “big data”. The second procedure is more statistically efficient, but is also more complicated and has some computational cost. For convenience the cross simulation method is supplied as a ready to go procedure in our `R`

data cleaning and preparation package `vtreat`

.

What would it look like if we insisted on using cross simulation or simulated out of sample techniques for all three (or more) stages?

We mentioned two possible nested tasks to emphasize that nested model bias is not merely an issue of our own invention (despite our writings on variable selection and variable preparation). The issue also occurs with hyper-parameter tuning, model selection, and dimension reduction.

As we have said: if you have enough data you can split your data into a disjoint set for each stage (often called train/test, or calibration/train/test) and wish most of the problem away. However, if you are not feeling “data rich” (and this rapidly happens with complex models, rare features, and un-balanced target classes) or your training data is expensive to produce (very true for many high-value data science projects) then you may want to be more statistically efficient. However, there is a reason to always reserve a portion of data for the last-stage test (as we usually advise): it ensures you are testing properties of the actual model you are proposing to put into production (whereas cross-procedures mostly test the expected quality of your procedures, not the realized quality of your single model).

It is still an interesting problem: how do you properly organize nested cross simulation procedures so no additional data splitting is required? We are going to describe one such method here which is a generalization of k-way cross validation. We call the whole family of methods “nested cross simulation” (to emphasize both the multi-stage nature, and also that scoring or validation is not the only task we are removing bias from).

Suppose we are drawing supervised training examples from an arbitrary data schema (a schema that includes possibly numeric, categorical and logical columns or variables) called \(\mathbb{S}\). We have an indexed list of training examples \(T = ((s_i,y_i) \;|\; i = 1, \cdots m)\) where \(i\) is the example name, \(s_i \in \mathbb{S}\) is a vector of input or independent variables, and \(y_i\) is the outcome (numeric, but can be zero-one representing a categorization target).

We define some functions:

- \(buildF()\): takes a sequence of training examples \((s_i,y_i)\) (as described above) and returns a map \(f: \mathbb{S} \rightarrow \mathbb{R}^n\). \(f\) is a “data treatment” or preparation map (mapping examples to vectors in \(\mathbb{R}^n\)) and \(buildF\) is the data treatment designer. Think of \(buildF()\) designing things like one-hot encoders, or (more to our point) building various outcome sensitive or \(y\)-aware transforms (such as found in
`vtreat`

). - \(buildG()\): takes a sequence of tuples \((x_i,y_i)\) where \(i\) is the example name, \(x_i \in \mathbb{R}^n\), and \(y_i\) is an outcome. Think of these as encoded or treated training examples. \(buildG()\) returns a map \(g:\mathbb{R}^n \rightarrow \mathbb{R}\) mapping inputs to estimated outputs. \(buildG()\) is a model training procedure (such as linear regression, gradient boosted trees, and so on) and \(g()\) is the returned model.
*loss()*: takes two numeric vectors \(\widehat{y}\) and \(y\) (built up from the \(y_i\)) and returns a loss calculation or model quality score. Examples loss functions include sum of mean square error (\(loss(y,\widehat{y}) = \frac{1}{m} \sum_{i=1}^{m} (\widehat{y}_i - y_i)^2\)) for regression or per-example normalized deviance (\(loss(y,\widehat{y}) = -2 \frac{1}{m} \sum_{i=1}^{m} (y_i \log(\widehat{y}_i) + (1-y_i) \log(1-\widehat{y}_i)\)) for classification (the deviance is proportional to a cross-entropy).

The naive (not recommended) modeling method uses the same data over and over for data treatment design, model training, and scoring. In our notation this looks like:

- \(T = ((s_i,y_i) \;|\; i = 1, \cdots m)\) (training data)
- \(f = buildF(T)\) (design the variable treatments)
- \(g = buildG((f(s_i),y_i) \;|\; i = 1, \cdots m))\) (build the model on treated data)
- \(\widehat{y}_j = g(f(s_j)) ,\; j = 1, \cdots m\) (build the model predictions)
- Compute \(loss(y,\widehat{y})\) (evaluate performance on training data).

In the above method we expect both over-fit and an upwardly biased assessment of quality. This is because the same data rows are used in many stages of the training and scoring, which is something that will not be true for future test or application data. This is a standard criticism, so we won’t go into it here.

To avoid the over-fit and assessment bias we try more sophisticated data handling as we will illustrate below.

The `vtreat`

documentation recommends using a separate held-out set for testing. It then recommends either using separate portions of the training set for variable treatment design and model training or using cross-frame methods to try to break some of the data-reuse driven information leakage. This scheme when used in “leave-one-out” mode looks like the following. (For convenience in this writeup we add the additional assumption that the \(m\) items in \(T\) are in random order, so a segment of them is a sample picked uniformly at random.)

- \(Train = ((s_i,y_i) \;|\; i = 1, \cdots k)\) (training data)
- \(Test = ((s_i,y_i) \;|\; i = k+1, \cdots m)\) (test data)
- \(f_i = buildF( (s_j,y_j) \;|\; j = 1, \cdots k ,\; j \neq i )), \; i = 1, \cdots k\) (build the cross-frame or data that is treated using treatments designed and data disjoint from the data being treated)
- \(g = buildG((f_i(s_i),y_i) \;|\; i = 1, \cdots k))\) (build the model using the cross-frame or cross simulated treated data)
- \(f = buildF( (s_j,y_j) \;|\; j = 1, \cdots k))\) (design a final data treatment using all the data)
- \(\widehat{y}_j = g(f(s_j)) ,\; j = k+1, \cdots m\) (evaluate the model on test data, treated using the final data treatment)
- Compute \(loss(y_{j=k+1, \cdots m},\widehat{y}_{j=k+1, \cdots m})\) (score the model only on test data).

The important points are:

- \(f_i\) the treatment plan applied to point \(s_i\) is built without ever looking at \((i,s_i,y_i)\), so \((f_i(s_i),y_i)\) should be distributed as it would be for data not seen during \(f_i\) construction. Thus \(buildG()\) is built training data that statistically looks disjoint from the data determining the \(f()\). This addresses the issue of nested over-fitting.
- Disjoint, uniformly selected at random, held out test data is used for scoring. this addressed the issue of model evaluation bias.

This is the system recommended in the `vtreat`

documentation and examples. We are trying to prevent the components of over-fit driven by the data treatment by introducing the \(f_i()\). \(g\) is fit in a standard (not cross-simulated) manner and we depend on the model fitting procedures themselves to avoid over-fit. The usefulness of this procedure depends on the estimation procedures being somewhat *stable* and easy to stitch together (an issue for procedures with symmetries such as principal components analysis).

We use the held-out test set to try and detect over-fit issues.

I think this formulation is hits a real sweet-spot in sensibleness and data re-use. The fact that the actual model to be used is evaluated is a real plus.

We can go further. Set up the nested cross simulation method as follows.

- \(T = ((s_i,y_i) \;|\; i = 1, \cdots m)\) (training data)
- \(f_c = buildF((s_j,y_j) \;|\; j = 1, \cdots m ,\; j \not\in c )) ,\; c \subseteq \{1, \cdots m \} \;\wedge\; |c| \in \{1,2\}\) (build many cross-frames or cross simulated data maps for use in different contexts)
- \(g_j = buildG((f_{\{j,i\}}(s_i),y_i) \;|\; i = 1, \cdots m ,\; i \neq j)) ,\; j = 1, \cdots m\) (build many models to cross validate)
- \(\widehat{y}_j = g_j(f_{\{j\}}(s_j)) ,\; j = 1, \cdots m\) (evaluate the models only on points and treatments disjoint form the point we are evaluating)
- Compute \(loss(y,\widehat{y})\) (score the model on all data).

Notice we are building *a lot* of different data treatments \(f_*()\) (one for every distinct \(i,j\) and one for each \(j\)) and models (one for each \(j\)). This can be mitigated by treating each datum as a block of observations (chosen randomly) and noticing this converts “leave-one-out” cross validation to “\(m\)-fold cross validation” (where \(m\) is now how many blocks we have).

The important points are:

- \(g_j\) is built without ever looking at \((j,s_j,y_j)\), and using only \(f_{\{j,i\}}(s_i)\) where \(f_{\{j,i\}}()\) has looked at neither \((j,s_j,y_j)\) nor \((i,s_i,y_i)\). Thus from \(f_{\{j,i\}}()\)’s point of view both of these points are statistically exchangable with future data. So \(f_{\{j,i\}}()\) is not “over fit” for either of these points, as it never saw them. \(f_{\{j,i\}}(s_i)\) behaves as \(f_{\{j,i\}}()\) would on a novel point, meaning \(buildG()\) sees training data that looks disjoint from the data determining the \(f()\) (what \(buildF()\) sees)
*and*the \(f_{c}()\) should behave during training as they will on future test or application data. This addresses the issue of nested over-fitting. - In the assignment “\(\widehat{y}_j = g_j(f_{\{j\}}(s_j))\)” by design both \(g_j()\) and \(f_{\{j\}}()\) are built without having seen the datum \((j,s_j,y_j)\). So this datum remains statistically exchangable with future test or application data (addressing the issue of model evaluation bias).

There is also the issue of what treatment plan and function to put into production. This is essentially re-running the process without the loss scoring stage (removing one level of nesting), so we can take the results \(f\) and \(g\) from the previous cross simulation method as our data treatment and model to put into production. The extra work in this section is all just trying to get an unbiased estimate of the quality of the modeling process without splitting the data.

Despite the heavy notation the above is all a bit informal (all the “never sees … therefore …” bits; generalizing cross validation is fun- but the notation is laborious). A fine point is exchangeability and information hiding only holds for each row individually: a process that looks at many rows can in fact over-fit. That is: a row that is processed “without seeing such and such” must behave as if it has no information about what it did not see when taken along; however when combined with a seemingly harmless other piece of information (such as a grand mean) it may be the the become informative (the usual issue: `u`

and `v`

each being singly independent of `y`

does not imply `y`

is independent of the joint fact `(u, v)`

). This is why stacking/super-learner style proofs don’t work on independence of information but instead use typical PAC-style arguments.

However, my claim is: the nested cross simulation method is sound in practice and we could expand the above into a formal argument. The method can be generalized to more stages- but obviously we start running over larger and larger sets of points to be simultaneously excluded in the earlier stages (so the method will get expensive). We also must take care to note for all its complexity the nested cross simulation method is measuring the quality of the modeling procedure, not measuring the quality of the actual model returned.

Our plan throughout has been to prevent over-fit with respect to a single data point by hiding that data point from the training and fitting procedures. The hope is such care (realized by the complicated bookkeeping and subscripting above) will preserve, in part, statistical exchangeability between training data and future test or application data (if it was there in the first place). As such we can mitigate over-fit and upward evaluation bias.

Obviously this is all quite a lot of trouble to directly work with, which is why recommend using the simple (not nested) cross simulation procedure and this process is wrapped and automated in `vtreat`

.

I feel it is well worth knowing how to work the theory to collect evidence we are settled on a good recommendation. In this document we are mostly just laying down notation to discuss theory later. Our group, Win-Vector LLC, has found for our consulting clients: certain problems and domains (for example time series) benefit *greatly* from properly designed domain specific cross simulation procedures (please see arXiv:1611.09477 stat.AP Section 4 for such an example).

## Comments/Discussion

The above scheme is essentially a block-oriented re-telling and generalization of Breiman’s “Stacked Regressions”, Machine Learning, 24, pp. 49-64, 1996.

We are going to be writing on a situation of some biases that do leak into the cross-frame “new data simulation.” So think of cross-frames as bias (some small amount is introduced) / variance (reduced be appearing to have a full sized data set at all stages) trade-off.