`vignettes/vtreatOverfit.Rmd`

`vtreatOverfit.Rmd`

Example showing safe “best practice” use of the ‘vtreat’ variable preparation library. For more on vtreat see here and here.

Below we generate an example data frame with no relation between x and y. We are using a synthetic data set so we know what the “right answer is” (no signal). False fitting on no-signal variables is bad for several reasons:

- It creates undesirable biases in variable quality estimates and in subsequent models.
- It “hides degrees of freedom” from subsequent models.
- It creates the false impression you have a good result (which you may fail to falsify).
- Complex bad variables can starve out simple weak good variables.

This example shows things we don’t want to happen, and then the additional precautions that help prevent them.

```
set.seed(22626)
d <- data.frame(x=sample(paste('level',1:1000,sep=''),2000,replace=TRUE)) # independent variable.
d$y <- runif(nrow(d))>0.5 # the quantity to be predicted, notice: independent of variables.
d$rgroup <- round(100*runif(nrow(d))) # the random group used for splitting the data set, not a variable.
```

Using the same set of data to prepare the variable encoding and train the model can lead to the false belief (derived from the training set) that the model fit well. This is largely due to the treated variable appearing to consume only one degree of freedom, when it in fact consumes many more. In many cases a reasonable setting of `pruneSig`

(say 0.01) will help against a noise variable being considered desirable, but selected variables may still be mis-used by downstream modeling.

`## Loading required package: wrapr`

```
treatments <- vtreat::designTreatmentsC(dTrain,'x','y',TRUE,
rareCount=0 # Note: usually want rareCount>0, setting to zero to illustrate problem
)
```

```
## [1] "vtreat 1.6.0 inspecting inputs Wed Mar 11 16:35:00 2020"
## [1] "designing treatments Wed Mar 11 16:35:00 2020"
## [1] " have initial level statistics Wed Mar 11 16:35:00 2020"
## [1] " scoring treatments Wed Mar 11 16:35:00 2020"
## [1] "have treatment plan Wed Mar 11 16:35:00 2020"
## [1] "rescoring complex variables Wed Mar 11 16:35:00 2020"
## [1] "done rescoring complex variables Wed Mar 11 16:35:00 2020"
```

```
dTrainTreated <- vtreat::prepare(treatments,dTrain,
pruneSig=c() # Note: usually want pruneSig to be a small fraction, setting to null to illustrate problem
)
```

```
## Warning in prepare.treatmentplan(treatments, dTrain, pruneSig = c()):
## possibly called prepare() on same data frame as designTreatments*()/
## mkCrossFrame*Experiment(), this can lead to over-fit. To avoid this, please use
## mkCrossFrame*Experiment$crossFrame.
```

```
m1 <- glm(y~x_catB,data=dTrainTreated,family=binomial(link='logit'))
print(summary(m1)) # notice low residual deviance
```

```
##
## Call:
## glm(formula = y ~ x_catB, family = binomial(link = "logit"),
## data = dTrainTreated)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.79811 -0.75642 0.00684 0.75617 1.89724
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.03361 0.07170 0.469 0.639
## x_catB 1.00624 0.11442 8.794 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2266.1 on 1634 degrees of freedom
## Residual deviance: 1131.2 on 1633 degrees of freedom
## AIC: 1135.2
##
## Number of Fisher Scoring iterations: 8
```

```
dTrain$predM1 <- predict(m1,newdata=dTrainTreated,type='response')
# devtools::install_github("WinVector/WVPlots")
# library('WVPlots')
plotRes <- function(d,predName,yName,title) {
print(title)
tab <- table(truth=d[[yName]],pred=d[[predName]]>0.5)
print(tab)
diag <- sum(vapply(seq_len(min(dim(tab))),
function(i) tab[i,i],numeric(1)))
acc <- diag/sum(tab)
# if(requireNamespace("WVPlots",quietly=TRUE)) {
# print(WVPlots::ROCPlot(d,predName,yName,title))
# }
print(paste('accuracy',acc))
}
# evaluate model on training
plotRes(dTrain,'predM1','y','model1 on train')
```

```
## [1] "model1 on train"
## pred
## truth FALSE TRUE
## FALSE 556 248
## TRUE 92 739
## [1] "accuracy 0.792048929663609"
```

```
# evaluate model on test
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM1 <- predict(m1,newdata=dTestTreated,type='response')
plotRes(dTest,'predM1','y','model1 on test')
```

```
## [1] "model1 on test"
## pred
## truth FALSE TRUE
## FALSE 57 123
## TRUE 65 120
## [1] "accuracy 0.484931506849315"
```

The above is bad: we saw a “significant” model fit on training data (even though there is no relation to be found). This means the treated training data can be confusing to machine learning techniques and to the analyst. The issue is that the training data is no longer exchangeable with the test data because the training data was used to build the variable encodings. One way to avoid this is to not use the training data for variable encoding construction, but instead use a third set for this task.

Notice that vtreat did not think there was any usable signal, and did not want us to use the variables: the values in `treatments$scoreFrame$sig`

are all much larger than a nominally acceptable significance level like 0.05. The variables stayed in our model because we did not prune them (*ie* we set `pruneSig=c()`

). Also notice we set `rareCount=0`

, which allows the use of very rare levels (which help drive the problem).

`print(treatments$scoreFrame)`

```
## varName varMoves rsq sig needsSplit extraModelDegrees
## 1 x_catP TRUE 0.0001174834 0.60586892 TRUE 803
## 2 x_catB TRUE 0.0020852084 0.02972052 TRUE 803
## origName code default_threshold recommended
## 1 x catP 0.5 FALSE
## 2 x catB 0.5 TRUE
```

Subsequently, the down-stream machine learning (in this case a standard logistic regression) used the variable incorrectly. The modeling algorithm gave the variable a non-negligible coefficient (around 3) that it thought was reliably bounded away from zero; it also believed that the resulting model almost halved deviance (when in fact it explained nothing). So any variables that do get through may have distributional issues (and misleadingly low apparent degrees of freedom).

**Rare levels of a categorical variable**

The biggest contributors to this distributional issue tend to be rare levels of categorical variables. Since the individual levels are rare we have unreliable estimates for their effects, and if there are very many of them we may see quite a large effect in aggregate. To help combat this we have a control called `rareLevels`

. Any level that is observed no more than `rareLevels`

times during training is re-mapped to a new special level called *rare* and not allowed to directly contribute (i.e. can not generate unique indicator columns, and doesn’t have a direct effect on `catB`

or `catN`

encodings). If all the rare levels have a distinct behavior after grouping, the *rare* level can capture that.

**Impact-coding of categorical variables with many levels**

Another undesirable effect is over-estimating significance of derived variable fit for `catB`

and `catN`

impact-coded variables. To fight this vtreat attempts to estimate out of sample or cross-validated effect significances (when it has enough data). With enough data, setting the `pruneSig`

parameter during `prepare()`

will help remove noise variables. One can set `pruneSig`

to something like *1/number-of-columns* to ensure that with high probability only an constant number of truly useless variables make it to later modeling. However, the significance of a given effect size for variables that actually have some signal (i.e. non-noise variables) can still be sensitive to in/out sample scoring and the hiding of degrees of freedom that occurs when a large categorical variable (that represents a large number of degrees of freedom) is re-coded as an impact or effect (which appears to have only a single degree of freedom).

We next show how to avoid these undesirable illusory effects: better practice in partitioning and using training data. We are doing more with our data (essentially chaining models), so we have to take a bit more care with our data.

Below is part of our suggested work pattern: coding/train/test split.

```
dCode <- d[d$rgroup<=20,,drop=FALSE]
dTrain <- d[(d$rgroup>20) & (d$rgroup<=80),,drop=FALSE]
treatments <- vtreat::designTreatmentsC(dCode,'x','y',TRUE,
rareCount=0, # Note set this to something larger, like 5
rareSig=c() # Note set this to something like 0.3
)
```

```
## [1] "vtreat 1.6.0 inspecting inputs Wed Mar 11 16:35:01 2020"
## [1] "designing treatments Wed Mar 11 16:35:01 2020"
## [1] " have initial level statistics Wed Mar 11 16:35:01 2020"
## [1] " scoring treatments Wed Mar 11 16:35:01 2020"
## [1] "have treatment plan Wed Mar 11 16:35:01 2020"
## [1] "rescoring complex variables Wed Mar 11 16:35:01 2020"
## [1] "done rescoring complex variables Wed Mar 11 16:35:01 2020"
```

```
dTrainTreated <- vtreat::prepare(treatments,dTrain,
pruneSig=c() # Note: set this to filter, like 0.05 or 1/nvars
)
m2 <- glm(y~x_catB,data=dTrainTreated,family=binomial(link='logit'))
print(summary(m2)) # notice high residual deviance
```

```
##
## Call:
## glm(formula = y ~ x_catB, family = binomial(link = "logit"),
## data = dTrainTreated)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.254 -1.182 1.103 1.173 1.246
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.01040 0.05746 0.181 0.856
## x_catB 0.01713 0.01054 1.625 0.104
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1687.1 on 1216 degrees of freedom
## Residual deviance: 1684.4 on 1215 degrees of freedom
## AIC: 1688.4
##
## Number of Fisher Scoring iterations: 3
```

```
dTrain$predM2 <- predict(m2,newdata=dTrainTreated,type='response')
plotRes(dTrain,'predM2','y','model2 on train')
```

```
## [1] "model2 on train"
## pred
## truth FALSE TRUE
## FALSE 105 499
## TRUE 92 521
## [1] "accuracy 0.514379622021364"
```

```
# We do not advise creating dCodeTreated for any purpose other than
# diagnostic plotting. You should not use the treated coding data
# for anything (as that would undo the benefit of having a separate
# coding data subset).
dCodeTreated <- vtreat::prepare(treatments,dCode,pruneSig=c())
```

```
## Warning in prepare.treatmentplan(treatments, dCode, pruneSig = c()):
## possibly called prepare() on same data frame as designTreatments*()/
## mkCrossFrame*Experiment(), this can lead to over-fit. To avoid this, please use
## mkCrossFrame*Experiment$crossFrame.
```

```
dCode$predM2 <- predict(m2,newdata=dCodeTreated,type='response')
plotRes(dCode,'predM2','y','model2 on coding set')
```

```
## [1] "model2 on coding set"
## pred
## truth FALSE TRUE
## FALSE 174 26
## TRUE 3 215
## [1] "accuracy 0.930622009569378"
```

```
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM2 <- predict(m2,newdata=dTestTreated,type='response')
plotRes(dTest,'predM2','y','model2 on test set')
```

```
## [1] "model2 on test set"
## pred
## truth FALSE TRUE
## FALSE 26 154
## TRUE 42 143
## [1] "accuracy 0.463013698630137"
```

In the above example we saw training and test performance are similar – and equally poor, as they should be since there is no signal. Though it didn’t happen in this case, note the coding set can (falsely) show high performance. This is the bad behavior we wanted to isolate out of the training set.

Remember, the goal isn’t good performance on training- it is good performance on future data (simulated by test). So doing well on training and bad on test is worse than doing bad on both test and training.

There are, of course, other methods to avoid the bias introduced in using the same data to both treat/encode the variables and to train the model. vtreat incorporates a number of these methods, including smoothing (controlled through `smFactor`

) and pruning of rare levels (controlled through `rareSig`

).

Another effective technique: cross-constructed training frames can also be accessed by using `mkCrossFrameCExperiment`

or `mkCrossFrameNExperiment`

, which we demonstrate here.

```
dTrain <- d[d$rgroup<=80,,drop=FALSE]
xdat <- vtreat::mkCrossFrameCExperiment(dTrain,'x','y',TRUE,
rareCount=0, # Note set this to something larger, like 5
rareSig=c())
```

```
## [1] "vtreat 1.6.0 start initial treatment design Wed Mar 11 16:35:01 2020"
## [1] " start cross frame work Wed Mar 11 16:35:01 2020"
## [1] " vtreat::mkCrossFrameCExperiment done Wed Mar 11 16:35:01 2020"
```

```
treatments <- xdat$treatments
print(treatments$scoreFrame)
```

```
## varName varMoves rsq sig needsSplit extraModelDegrees
## 1 x_catP TRUE 2.772417e-05 0.8020822863 TRUE 803
## 2 x_catB TRUE 5.776072e-03 0.0002969686 TRUE 803
## origName code default_threshold recommended
## 1 x catP 0.5 FALSE
## 2 x catB 0.5 TRUE
```

```
dTrainTreated <- xdat$crossFrame
m3 <- glm(y~x_catB,data=dTrainTreated,family=binomial(link='logit'))
print(summary(m3)) # notice high residual deviance
```

```
##
## Call:
## glm(formula = y ~ x_catB, family = binomial(link = "logit"),
## data = dTrainTreated)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.226 -1.191 1.131 1.164 1.199
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.031450 0.049508 0.635 0.525
## x_catB 0.007853 0.007422 1.058 0.290
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2266.1 on 1634 degrees of freedom
## Residual deviance: 2265.0 on 1633 degrees of freedom
## AIC: 2269
##
## Number of Fisher Scoring iterations: 3
```

```
dTrainTreated$predM3 <- predict(m3,newdata=dTrainTreated,type='response')
plotRes(dTrainTreated,'predM3','y','model3 on train')
```

```
## [1] "model3 on train"
## pred
## truth FALSE TRUE
## FALSE 184 620
## TRUE 205 626
## [1] "accuracy 0.495412844036697"
```

```
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM3 <- predict(m3,newdata=dTestTreated,type='response')
plotRes(dTest,'predM3','y','model3 on test set')
```

```
## [1] "model3 on test set"
## pred
## truth FALSE TRUE
## FALSE 44 136
## TRUE 53 132
## [1] "accuracy 0.482191780821918"
```

Notice the glm significance is off, but the model quality is similar on train and test, and the scoreFrame significance is a correct indication.