vtreat overfit

John Mount, Nina Zumel

2017-04-13

Example showing safe “best practice” use of the ‘vtreat’ variable preparation library. For more on vtreat see here and here.

Below we generate an example data frame with no relation between x and y. We are using a synthetic data set so we know what the “right answer is” (no signal). False fitting on no-signal variables is bad for several reasons:

This example shows things we don’t want to happen, and then the additional precautions that help prevent them.

set.seed(22626)
d <- data.frame(x=sample(paste('level',1:1000,sep=''),2000,replace=TRUE)) # independent variable.
d$y <- runif(nrow(d))>0.5  # the quantity to be predicted, notice: independent of variables.
d$rgroup <- round(100*runif(nrow(d)))  # the random group used for splitting the data set, not a variable.

Bad Practice: Using the same data to treat and to train

Using the same set of data to prepare the variable encoding and train the model can lead to the false belief (derived from the training set) that the model fit well. This is largely due to the treated variable appearing to consume only one degree of freedom, when it in fact consumes many more. In many cases a reasonable setting of pruneSig (say 0.01) will help against a noise variable being considered desirable, but selected variables may still be mis-used by downstream modeling.

dTrain <- d[d$rgroup<=80,,drop=FALSE]
dTest <- d[d$rgroup>80,,drop=FALSE]
library('vtreat')
treatments <- vtreat::designTreatmentsC(dTrain,'x','y',TRUE,
  rareCount=0 # Note: usually want rareCount>0, setting to zero to illustrate problem
)
## [1] "desigining treatments Thu Apr 13 17:41:08 2017"
## [1] "designing treatments Thu Apr 13 17:41:08 2017"
## [1] " have level statistics Thu Apr 13 17:41:08 2017"
## [1] "design var x Thu Apr 13 17:41:08 2017"
## [1] " scoring treatments Thu Apr 13 17:41:08 2017"
## [1] "have treatment plan Thu Apr 13 17:41:08 2017"
## [1] "rescoring complex variables Thu Apr 13 17:41:08 2017"
## [1] "done rescoring complex variables Thu Apr 13 17:41:08 2017"
dTrainTreated <- vtreat::prepare(treatments,dTrain,
  pruneSig=c() # Note: usually want pruneSig to be a small fraction, setting to null to illustrate problem
)
m1 <- glm(y~x_catB,data=dTrainTreated,family=binomial(link='logit'))
print(summary(m1))  # notice low residual deviance
## 
## Call:
## glm(formula = y ~ x_catB, family = binomial(link = "logit"), 
##     data = dTrainTreated)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.79793  -0.75624   0.00611   0.75631   1.79803  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.02692    0.07046   0.382    0.702    
## x_catB       1.00630    0.11333   8.879   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2270.5  on 1637  degrees of freedom
## Residual deviance: 1146.2  on 1636  degrees of freedom
## AIC: 1150.2
## 
## Number of Fisher Scoring iterations: 8
dTrain$predM1 <- predict(m1,newdata=dTrainTreated,type='response')

# devtools::install_github("WinVector/WVPlots")
# library('WVPlots')
plotRes <- function(d,predName,yName,title) {
  print(title)
  tab <- table(truth=d[[yName]],pred=d[[predName]]>0.5)
  print(tab)
  diag <- sum(vapply(seq_len(min(dim(tab))),
                     function(i) tab[i,i],numeric(1)))
  acc <- diag/sum(tab)
#  if(requireNamespace("WVPlots",quietly=TRUE)) {
#     print(WVPlots::ROCPlot(d,predName,yName,title))
#  }
  print(paste('accuracy',acc))
}

# evaluate model on training
plotRes(dTrain,'predM1','y','model1 on train')
## [1] "model1 on train"
##        pred
## truth   FALSE TRUE
##   FALSE   711   97
##   TRUE    245  585
## [1] "accuracy 0.791208791208791"
# evaluate model on test
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM1 <- predict(m1,newdata=dTestTreated,type='response')
plotRes(dTest,'predM1','y','model1 on test')
## [1] "model1 on test"
##        pred
## truth   FALSE TRUE
##   FALSE    80   95
##   TRUE     98   89
## [1] "accuracy 0.466850828729282"

The above is bad: we saw a “significant” model fit on training data (even though there is no relation to be found). This means the treated training data can be confusing to machine learning techniques and to the analyst. The issue is that the training data is no longer exchangeable with the test data because the training data was used to build the variable encodings. One way to avoid this is to not use the training data for variable encoding construction, but instead use a third set for this task.

What went wrong?

Notice that vtreat did not think there was any usable signal, and did not want us to use the variables: the values in treatments$scoreFrame$sig are all much larger than a nominally acceptable significance level like 0.05. The variables stayed in our model because we did not prune them (ie we set pruneSig=c()). Also notice we set rareCount=0, which allows the use of very rare levels (which help drive the problem).

print(treatments$scoreFrame)
##   varName varMoves          rsq       sig needsSplit extraModelDegrees
## 1  x_catP     TRUE 6.344411e-05 0.7042905       TRUE               819
## 2  x_catB     TRUE 2.016246e-05 0.8305800       TRUE               819
##   origName code
## 1        x catP
## 2        x catB

Subsequently, the down-stream machine learning (in this case a standard logistic regression) used the variable incorrectly. The modeling algorithm gave the variable a non-negligible coefficient (around 3) that it thought was reliably bounded away from zero; it also believed that the resulting model almost halved deviance (when in fact it explained nothing). So any variables that do get through may have distributional issues (and misleadingly low apparent degrees of freedom).

Rare levels of a categorical variable

The biggest contributors to this distributional issue tend to be rare levels of categorical variables. Since the individual levels are rare we have unreliable estimates for their effects, and if there are very many of them we may see quite a large effect in aggregate. To help combat this we have a control called rareLevels. Any level that is observed no more than rareLevels times during training is re-mapped to a new special level called rare and not allowed to directly contribute (i.e. can not generate unique indicator columns, and doesn’t have a direct effect on catB or catN encodings). If all the rare levels have a distinct behavior after grouping, the rare level can capture that.

Impact-coding of categorical variables with many levels

Another undesirable effect is over-estimating significance of derived variable fit for catB and catN impact-coded variables. To fight this vtreat attempts to estimate out of sample or cross-validated effect significances (when it has enough data). With enough data, setting the pruneSig parameter during prepare() will help remove noise variables. One can set pruneSig to something like 1/number-of-columns to ensure that with high probability only an constant number of truly useless variables make it to later modeling. However, the significance of a given effect size for variables that actually have some signal (i.e. non-noise variables) can still be sensitive to in/out sample scoring and the hiding of degrees of freedom that occurs when a large categorical variable (that represents a large number of degrees of freedom) is re-coded as an impact or effect (which appears to have only a single degree of freedom).

We next show how to avoid these undesirable illusory effects: better practice in partitioning and using training data. We are doing more with our data (essentially chaining models), so we have to take a bit more care with our data.

Correct Practice 1/2: Use different data to treat and train

Below is part of our suggested work pattern: coding/train/test split.

dCode <- d[d$rgroup<=20,,drop=FALSE]
dTrain <- d[(d$rgroup>20) & (d$rgroup<=80),,drop=FALSE]
treatments <- vtreat::designTreatmentsC(dCode,'x','y',TRUE,
                                        rareCount=0,  # Note set this to something larger, like 5
                                        rareSig=c() # Note set this to something like 0.3
)
## [1] "desigining treatments Thu Apr 13 17:41:08 2017"
## [1] "designing treatments Thu Apr 13 17:41:08 2017"
## [1] " have level statistics Thu Apr 13 17:41:08 2017"
## [1] "design var x Thu Apr 13 17:41:08 2017"
## [1] " scoring treatments Thu Apr 13 17:41:08 2017"
## [1] "have treatment plan Thu Apr 13 17:41:08 2017"
## [1] "rescoring complex variables Thu Apr 13 17:41:08 2017"
## [1] "done rescoring complex variables Thu Apr 13 17:41:08 2017"
dTrainTreated <- vtreat::prepare(treatments,dTrain,
                                 pruneSig=c() # Note: set this to filter, like 0.05 or 1/nvars
)
m2 <- glm(y~x_catB,data=dTrainTreated,family=binomial(link='logit'))
print(summary(m2)) # notice high residual deviance
## 
## Call:
## glm(formula = y ~ x_catB, family = binomial(link = "logit"), 
##     data = dTrainTreated)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.190  -1.181   1.165   1.174   1.183  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.008294   0.057427   0.144    0.885
## x_catB      -0.002186   0.010944  -0.200    0.842
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1681.6  on 1212  degrees of freedom
## Residual deviance: 1681.5  on 1211  degrees of freedom
## AIC: 1685.5
## 
## Number of Fisher Scoring iterations: 3
dTrain$predM2 <- predict(m2,newdata=dTrainTreated,type='response')
plotRes(dTrain,'predM2','y','model2 on train')
## [1] "model2 on train"
##        pred
## truth   FALSE TRUE
##   FALSE   100  504
##   TRUE     97  512
## [1] "accuracy 0.504534212695796"
# We do not advise creating dCodeTreated for any purpose other than
# diagnostic plotting.  You should not use the treated coding data
# for anything (as that would undo the benefit of having a separate
# coding data subset).
dCodeTreated <- vtreat::prepare(treatments,dCode,pruneSig=c())
dCode$predM2 <- predict(m2,newdata=dCodeTreated,type='response')
plotRes(dCode,'predM2','y','model2 on coding set')
## [1] "model2 on coding set"
##        pred
## truth   FALSE TRUE
##   FALSE     0  204
##   TRUE    182   39
## [1] "accuracy 0.0917647058823529"
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM2 <- predict(m2,newdata=dTestTreated,type='response')
plotRes(dTest,'predM2','y','model2 on test set')
## [1] "model2 on test set"
##        pred
## truth   FALSE TRUE
##   FALSE    31  144
##   TRUE     20  167
## [1] "accuracy 0.546961325966851"

In the above example we saw training and test performance are similar – and equally poor, as they should be since there is no signal. Though it didn’t happen in this case, note the coding set can (falsely) show high performance. This is the bad behavior we wanted to isolate out of the training set.

Remember, the goal isn’t good performance on training- it is good performance on future data (simulated by test). So doing well on training and bad on test is worse than doing bad on both test and training.

There are, of course, other methods to avoid the bias introduced in using the same data to both treat/encode the variables and to train the model. vtreat incorporates a number of these methods, including smoothing (controlled through smFactor) and pruning of rare levels (controlled through rareSig).

Correct Practice 2/2: Use simulated out of sample methods (cross methods)

Another effective technique: cross-constructed training frames can also be accessed by using mkCrossFrameCExperiment or mkCrossFrameNExperiment, which we demonstrate here.

dTrain <- d[d$rgroup<=80,,drop=FALSE]
xdat <- vtreat::mkCrossFrameCExperiment(dTrain,'x','y',TRUE,
                                  rareCount=0,  # Note set this to something larger, like 5
                                  rareSig=c())
treatments <- xdat$treatments
print(treatments$scoreFrame)
##   varName varMoves          rsq       sig needsSplit extraModelDegrees
## 1  x_catP     TRUE 7.636995e-05 0.6771125       TRUE               819
## 2  x_catB     TRUE 2.052927e-05 0.8290694       TRUE               819
##   origName code
## 1        x catP
## 2        x catB
dTrainTreated <- xdat$crossFrame
m3 <- glm(y~x_catB,data=dTrainTreated,family=binomial(link='logit'))
print(summary(m3)) # notice high residual deviance
## 
## Call:
## glm(formula = y ~ x_catB, family = binomial(link = "logit"), 
##     data = dTrainTreated)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.254  -1.189   1.103   1.166   1.233  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 0.026160   0.049480   0.529    0.597  
## x_catB      0.014767   0.007567   1.952    0.051 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2270.5  on 1637  degrees of freedom
## Residual deviance: 2266.6  on 1636  degrees of freedom
## AIC: 2270.6
## 
## Number of Fisher Scoring iterations: 3
dTrainTreated$predM3 <- predict(m3,newdata=dTrainTreated,type='response')
plotRes(dTrainTreated,'predM3','y','model3 on train')
## [1] "model3 on train"
##        pred
## truth   FALSE TRUE
##   FALSE   209  599
##   TRUE    182  648
## [1] "accuracy 0.523199023199023"
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM3 <- predict(m3,newdata=dTestTreated,type='response')
plotRes(dTest,'predM3','y','model3 on test set')
## [1] "model3 on test set"
##        pred
## truth   FALSE TRUE
##   FALSE    48  127
##   TRUE     55  132
## [1] "accuracy 0.497237569060773"

Notice the glm significance is off, but the model quality is similar on train and test, and the scoreFrame significance is a correct indication.