Example showing safe “best practice” use of the ‘vtreat’ variable preparation library. For more on vtreat see here and here.

Below we generate an example data frame with no relation between x and y. We are using a synthetic data set so we know what the “right answer is” (no signal). False fitting on no-signal variables is bad for several reasons:

- It creates undesirable biases in variable quality estimates and in subsequent models.
- It “hides degrees of freedom” from subsequent models.
- It creates the false impression you have a good result (which you may fail to falsify).
- Complex bad variables can starve out simple weak good variables.

This example shows things we don’t want to happen, and then the additional precautions that help prevent them.

```
set.seed(22626)
d <- data.frame(x=sample(paste('level',1:1000,sep=''),2000,replace=TRUE)) # independent variable.
d$y <- runif(nrow(d))>0.5 # the quantity to be predicted, notice: independent of variables.
d$rgroup <- round(100*runif(nrow(d))) # the random group used for splitting the data set, not a variable.
```

Using the same set of data to prepare the variable encoding and train the model can lead to the false belief (derived from the training set) that the model fit well. This is largely due to the treated variable appearing to consume only one degree of freedom, when it in fact consumes many more. In many cases a reasonable setting of `pruneSig`

(say 0.01) will help against a noise variable being considered desirable, but selected variables may still be mis-used by downstream modeling.

```
dTrain <- d[d$rgroup<=80,,drop=FALSE]
dTest <- d[d$rgroup>80,,drop=FALSE]
library('vtreat')
treatments <- vtreat::designTreatmentsC(dTrain,'x','y',TRUE,
rareCount=0 # Note: usually want rareCount>0, setting to zero to illustrate problem
)
```

```
## [1] "desigining treatments Thu Apr 13 17:41:08 2017"
## [1] "designing treatments Thu Apr 13 17:41:08 2017"
## [1] " have level statistics Thu Apr 13 17:41:08 2017"
## [1] "design var x Thu Apr 13 17:41:08 2017"
## [1] " scoring treatments Thu Apr 13 17:41:08 2017"
## [1] "have treatment plan Thu Apr 13 17:41:08 2017"
## [1] "rescoring complex variables Thu Apr 13 17:41:08 2017"
## [1] "done rescoring complex variables Thu Apr 13 17:41:08 2017"
```

```
dTrainTreated <- vtreat::prepare(treatments,dTrain,
pruneSig=c() # Note: usually want pruneSig to be a small fraction, setting to null to illustrate problem
)
m1 <- glm(y~x_catB,data=dTrainTreated,family=binomial(link='logit'))
print(summary(m1)) # notice low residual deviance
```

```
##
## Call:
## glm(formula = y ~ x_catB, family = binomial(link = "logit"),
## data = dTrainTreated)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.79793 -0.75624 0.00611 0.75631 1.79803
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.02692 0.07046 0.382 0.702
## x_catB 1.00630 0.11333 8.879 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2270.5 on 1637 degrees of freedom
## Residual deviance: 1146.2 on 1636 degrees of freedom
## AIC: 1150.2
##
## Number of Fisher Scoring iterations: 8
```

```
dTrain$predM1 <- predict(m1,newdata=dTrainTreated,type='response')
# devtools::install_github("WinVector/WVPlots")
# library('WVPlots')
plotRes <- function(d,predName,yName,title) {
print(title)
tab <- table(truth=d[[yName]],pred=d[[predName]]>0.5)
print(tab)
diag <- sum(vapply(seq_len(min(dim(tab))),
function(i) tab[i,i],numeric(1)))
acc <- diag/sum(tab)
# if(requireNamespace("WVPlots",quietly=TRUE)) {
# print(WVPlots::ROCPlot(d,predName,yName,title))
# }
print(paste('accuracy',acc))
}
# evaluate model on training
plotRes(dTrain,'predM1','y','model1 on train')
```

```
## [1] "model1 on train"
## pred
## truth FALSE TRUE
## FALSE 711 97
## TRUE 245 585
## [1] "accuracy 0.791208791208791"
```

```
# evaluate model on test
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM1 <- predict(m1,newdata=dTestTreated,type='response')
plotRes(dTest,'predM1','y','model1 on test')
```

```
## [1] "model1 on test"
## pred
## truth FALSE TRUE
## FALSE 80 95
## TRUE 98 89
## [1] "accuracy 0.466850828729282"
```

The above is bad: we saw a “significant” model fit on training data (even though there is no relation to be found). This means the treated training data can be confusing to machine learning techniques and to the analyst. The issue is that the training data is no longer exchangeable with the test data because the training data was used to build the variable encodings. One way to avoid this is to not use the training data for variable encoding construction, but instead use a third set for this task.

Notice that vtreat did not think there was any usable signal, and did not want us to use the variables: the values in `treatments$scoreFrame$sig`

are all much larger than a nominally acceptable significance level like 0.05. The variables stayed in our model because we did not prune them (*ie* we set `pruneSig=c()`

). Also notice we set `rareCount=0`

, which allows the use of very rare levels (which help drive the problem).

`print(treatments$scoreFrame)`

```
## varName varMoves rsq sig needsSplit extraModelDegrees
## 1 x_catP TRUE 6.344411e-05 0.7042905 TRUE 819
## 2 x_catB TRUE 2.016246e-05 0.8305800 TRUE 819
## origName code
## 1 x catP
## 2 x catB
```

Subsequently, the down-stream machine learning (in this case a standard logistic regression) used the variable incorrectly. The modeling algorithm gave the variable a non-negligible coefficient (around 3) that it thought was reliably bounded away from zero; it also believed that the resulting model almost halved deviance (when in fact it explained nothing). So any variables that do get through may have distributional issues (and misleadingly low apparent degrees of freedom).

**Rare levels of a categorical variable**

The biggest contributors to this distributional issue tend to be rare levels of categorical variables. Since the individual levels are rare we have unreliable estimates for their effects, and if there are very many of them we may see quite a large effect in aggregate. To help combat this we have a control called `rareLevels`

. Any level that is observed no more than `rareLevels`

times during training is re-mapped to a new special level called *rare* and not allowed to directly contribute (i.e. can not generate unique indicator columns, and doesn’t have a direct effect on `catB`

or `catN`

encodings). If all the rare levels have a distinct behavior after grouping, the *rare* level can capture that.

**Impact-coding of categorical variables with many levels**

Another undesirable effect is over-estimating significance of derived variable fit for `catB`

and `catN`

impact-coded variables. To fight this vtreat attempts to estimate out of sample or cross-validated effect significances (when it has enough data). With enough data, setting the `pruneSig`

parameter during `prepare()`

will help remove noise variables. One can set `pruneSig`

to something like *1/number-of-columns* to ensure that with high probability only an constant number of truly useless variables make it to later modeling. However, the significance of a given effect size for variables that actually have some signal (i.e. non-noise variables) can still be sensitive to in/out sample scoring and the hiding of degrees of freedom that occurs when a large categorical variable (that represents a large number of degrees of freedom) is re-coded as an impact or effect (which appears to have only a single degree of freedom).

We next show how to avoid these undesirable illusory effects: better practice in partitioning and using training data. We are doing more with our data (essentially chaining models), so we have to take a bit more care with our data.

Below is part of our suggested work pattern: coding/train/test split.

```
dCode <- d[d$rgroup<=20,,drop=FALSE]
dTrain <- d[(d$rgroup>20) & (d$rgroup<=80),,drop=FALSE]
treatments <- vtreat::designTreatmentsC(dCode,'x','y',TRUE,
rareCount=0, # Note set this to something larger, like 5
rareSig=c() # Note set this to something like 0.3
)
```

```
## [1] "desigining treatments Thu Apr 13 17:41:08 2017"
## [1] "designing treatments Thu Apr 13 17:41:08 2017"
## [1] " have level statistics Thu Apr 13 17:41:08 2017"
## [1] "design var x Thu Apr 13 17:41:08 2017"
## [1] " scoring treatments Thu Apr 13 17:41:08 2017"
## [1] "have treatment plan Thu Apr 13 17:41:08 2017"
## [1] "rescoring complex variables Thu Apr 13 17:41:08 2017"
## [1] "done rescoring complex variables Thu Apr 13 17:41:08 2017"
```

```
dTrainTreated <- vtreat::prepare(treatments,dTrain,
pruneSig=c() # Note: set this to filter, like 0.05 or 1/nvars
)
m2 <- glm(y~x_catB,data=dTrainTreated,family=binomial(link='logit'))
print(summary(m2)) # notice high residual deviance
```

```
##
## Call:
## glm(formula = y ~ x_catB, family = binomial(link = "logit"),
## data = dTrainTreated)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.190 -1.181 1.165 1.174 1.183
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.008294 0.057427 0.144 0.885
## x_catB -0.002186 0.010944 -0.200 0.842
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1681.6 on 1212 degrees of freedom
## Residual deviance: 1681.5 on 1211 degrees of freedom
## AIC: 1685.5
##
## Number of Fisher Scoring iterations: 3
```

```
dTrain$predM2 <- predict(m2,newdata=dTrainTreated,type='response')
plotRes(dTrain,'predM2','y','model2 on train')
```

```
## [1] "model2 on train"
## pred
## truth FALSE TRUE
## FALSE 100 504
## TRUE 97 512
## [1] "accuracy 0.504534212695796"
```

```
# We do not advise creating dCodeTreated for any purpose other than
# diagnostic plotting. You should not use the treated coding data
# for anything (as that would undo the benefit of having a separate
# coding data subset).
dCodeTreated <- vtreat::prepare(treatments,dCode,pruneSig=c())
dCode$predM2 <- predict(m2,newdata=dCodeTreated,type='response')
plotRes(dCode,'predM2','y','model2 on coding set')
```

```
## [1] "model2 on coding set"
## pred
## truth FALSE TRUE
## FALSE 0 204
## TRUE 182 39
## [1] "accuracy 0.0917647058823529"
```

```
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM2 <- predict(m2,newdata=dTestTreated,type='response')
plotRes(dTest,'predM2','y','model2 on test set')
```

```
## [1] "model2 on test set"
## pred
## truth FALSE TRUE
## FALSE 31 144
## TRUE 20 167
## [1] "accuracy 0.546961325966851"
```

In the above example we saw training and test performance are similar – and equally poor, as they should be since there is no signal. Though it didn’t happen in this case, note the coding set can (falsely) show high performance. This is the bad behavior we wanted to isolate out of the training set.

Remember, the goal isn’t good performance on training- it is good performance on future data (simulated by test). So doing well on training and bad on test is worse than doing bad on both test and training.

There are, of course, other methods to avoid the bias introduced in using the same data to both treat/encode the variables and to train the model. vtreat incorporates a number of these methods, including smoothing (controlled through `smFactor`

) and pruning of rare levels (controlled through `rareSig`

).

Another effective technique: cross-constructed training frames can also be accessed by using `mkCrossFrameCExperiment`

or `mkCrossFrameNExperiment`

, which we demonstrate here.

```
dTrain <- d[d$rgroup<=80,,drop=FALSE]
xdat <- vtreat::mkCrossFrameCExperiment(dTrain,'x','y',TRUE,
rareCount=0, # Note set this to something larger, like 5
rareSig=c())
treatments <- xdat$treatments
print(treatments$scoreFrame)
```

```
## varName varMoves rsq sig needsSplit extraModelDegrees
## 1 x_catP TRUE 7.636995e-05 0.6771125 TRUE 819
## 2 x_catB TRUE 2.052927e-05 0.8290694 TRUE 819
## origName code
## 1 x catP
## 2 x catB
```

```
dTrainTreated <- xdat$crossFrame
m3 <- glm(y~x_catB,data=dTrainTreated,family=binomial(link='logit'))
print(summary(m3)) # notice high residual deviance
```

```
##
## Call:
## glm(formula = y ~ x_catB, family = binomial(link = "logit"),
## data = dTrainTreated)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.254 -1.189 1.103 1.166 1.233
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.026160 0.049480 0.529 0.597
## x_catB 0.014767 0.007567 1.952 0.051 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2270.5 on 1637 degrees of freedom
## Residual deviance: 2266.6 on 1636 degrees of freedom
## AIC: 2270.6
##
## Number of Fisher Scoring iterations: 3
```

```
dTrainTreated$predM3 <- predict(m3,newdata=dTrainTreated,type='response')
plotRes(dTrainTreated,'predM3','y','model3 on train')
```

```
## [1] "model3 on train"
## pred
## truth FALSE TRUE
## FALSE 209 599
## TRUE 182 648
## [1] "accuracy 0.523199023199023"
```

```
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM3 <- predict(m3,newdata=dTestTreated,type='response')
plotRes(dTest,'predM3','y','model3 on test set')
```

```
## [1] "model3 on test set"
## pred
## truth FALSE TRUE
## FALSE 48 127
## TRUE 55 132
## [1] "accuracy 0.497237569060773"
```

Notice the glm significance is off, but the model quality is similar on train and test, and the scoreFrame significance is a correct indication.