vtreat package

John Mount, Nina Zumel

2017-04-13

‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. A formal article on the method can be found here: arXiv:1611.09477 stat.AP.

A ‘vtreat’ clean data frame:

To achieve this a number of techniques are used. Principally:

For more details see: the ‘vtreat’ article and update.

The use pattern is:

  1. Use designTreatmentsC() or designTreatmentsN() to design a treatment plan
  2. Use the returned structure with prepare() to apply the plan to data frames.

The main feature of ‘vtreat’ is that all data preparation is “y-aware”: it uses the relations of effective variables to the dependent or outcome variable to encode the effective variables.

The structure returned from designTreatmentsN() or designTreatmentsC() includes a list of “treatments”: objects that encapsulate the transformation process from the original variables to the new “clean” variables.

In addition to the treatment objects designTreatmentsC() and designTreatmentsN() also return a data frame named scoreFrame which contains columns:

In all cases we have two undesirable upward biases on the scores:

‘vtreat’ uses a number of cross-training and jackknife style procedures to try to mitigate these effects. The suggested best practice (if you have enough data) is to split your data randomly into at least the following disjoint data sets:

Taking the extra step to perform the designTreatmentsC() or designTreatmentsN() on data disjoint from training makes the training data more exchangeable with test and avoids the issue that ‘vtreat’ may be hiding a large number of degrees of freedom in variables it derives from large categoricals.

Some trivial execution examples (not demonstrating any cal/train/test split) are given below. Variables that do not move during hold-out testing are considered “not to move.”


A Categorical Outcome Example

library(vtreat)
dTrainC <- data.frame(x=c('a','a','a','b','b',NA),
   z=c(1,2,3,4,NA,6),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE))
head(dTrainC)
##      x  z     y
## 1    a  1 FALSE
## 2    a  2 FALSE
## 3    a  3  TRUE
## 4    b  4 FALSE
## 5    b NA  TRUE
## 6 <NA>  6  TRUE
dTestC <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
head(dTestC)
##      x  z
## 1    a 10
## 2    b 20
## 3    c 30
## 4 <NA> NA
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)
## [1] "desigining treatments Thu Apr 13 17:40:59 2017"
## [1] "designing treatments Thu Apr 13 17:40:59 2017"
## [1] " have level statistics Thu Apr 13 17:40:59 2017"
## [1] "design var x Thu Apr 13 17:40:59 2017"
## [1] "design var z Thu Apr 13 17:40:59 2017"
## [1] " scoring treatments Thu Apr 13 17:40:59 2017"
## [1] "have treatment plan Thu Apr 13 17:40:59 2017"
## [1] "rescoring complex variables Thu Apr 13 17:40:59 2017"
## [1] "done rescoring complex variables Thu Apr 13 17:41:00 2017"
print(treatmentsC)
## $treatments
## $treatments[[1]]
## [1] "vtreat 'Categoric Indicators'('x'(integer,factor)->character->'x_lev_NA','x_lev_x.a','x_lev_x.b')"
## 
## $treatments[[2]]
## [1] "vtreat 'Prevalence Code'('x'(integer,factor)->character->'x_catP')"
## 
## $treatments[[3]]
## [1] "vtreat 'Bayesian Impact Code'('x'(integer,factor)->character->'x_catB')"
## 
## $treatments[[4]]
## [1] "vtreat 'Scalable pass through'('z'(double,numeric)->numeric->'z_clean')"
## 
## $treatments[[5]]
## [1] "vtreat 'is.bad'('z'(double,numeric)->numeric->'z_isBAD')"
## 
## 
## $scoreFrame
##     varName varMoves        rsq        sig needsSplit extraModelDegrees
## 1  x_lev_NA     TRUE 0.19087450 0.20766228      FALSE                 0
## 2 x_lev_x.a     TRUE 0.08170417 0.40972582      FALSE                 0
## 3 x_lev_x.b     TRUE 0.00000000 1.00000000      FALSE                 0
## 4    x_catP     TRUE 0.15582050 0.25493078       TRUE                 2
## 5    x_catB     TRUE 0.49618022 0.04220134       TRUE                 2
## 6   z_clean     TRUE 0.25792985 0.14299775      FALSE                 0
## 7   z_isBAD     TRUE 0.19087450 0.20766228      FALSE                 0
##   origName  code
## 1        x   lev
## 2        x   lev
## 3        x   lev
## 4        x  catP
## 5        x  catB
## 6        z clean
## 7        z isBAD
## 
## $outcomename
## [1] "y"
## 
## $vtreatVersion
## [1] '0.5.31'
## 
## $outcomeType
## [1] "Binary"
## 
## $outcomeTarget
## [1] TRUE
## 
## $meanY
## [1] 0.5
## 
## $splitmethod
## [1] "oneway"
## 
## attr(,"class")
## [1] "treatmentplan"
print(treatmentsC$treatments[[1]])
## [1] "vtreat 'Categoric Indicators'('x'(integer,factor)->character->'x_lev_NA','x_lev_x.a','x_lev_x.b')"

Here we demonstrate the optional scaling feature of prepare(), which scales and centers all significant variables to mean 0, and slope 1 with respect to y: In other words, it rescales the variables to “y-units”. This is useful for downstream principal components analysis. Note: variables perfectly uncorrelated with y necessarily have slope 0 and can’t be “scaled” to slope 1, however for the same reason these variables will be insignificant and can be pruned by pruneSig.

scale=FALSE by default.

dTrainCTreated <- prepare(treatmentsC,dTrainC,pruneSig=c(),scale=TRUE)
head(dTrainCTreated)
##   x_lev_NA  x_lev_x.a x_lev_x.b x_catP      x_catB     z_clean z_isBAD
## 1     -0.1 -0.1666667         0   -0.2 -0.11976374 -0.38648649    -0.1
## 2     -0.1 -0.1666667         0   -0.2 -0.11976374 -0.21081081    -0.1
## 3     -0.1 -0.1666667         0   -0.2 -0.11976374 -0.03513514    -0.1
## 4     -0.1  0.1666667         0    0.1 -0.07564865  0.14054054    -0.1
## 5     -0.1  0.1666667         0    0.1 -0.07564865  0.00000000     0.5
## 6      0.5  0.1666667         0    0.4  0.51058851  0.49189189    -0.1
##       y
## 1 FALSE
## 2 FALSE
## 3  TRUE
## 4 FALSE
## 5  TRUE
## 6  TRUE
varsC <- setdiff(colnames(dTrainCTreated),'y')
# all input variables should be mean 0
sapply(dTrainCTreated[,varsC,drop=FALSE],mean)
##      x_lev_NA     x_lev_x.a     x_lev_x.b        x_catP        x_catB 
## -6.938894e-18  0.000000e+00  0.000000e+00  1.850372e-17  1.387779e-17 
##       z_clean       z_isBAD 
##  9.251859e-18 -6.938894e-18
# all slopes should be 1 for variables with dTrainCTreated$scoreFrame$sig<1
sapply(varsC,function(c) { glm(paste('y',c,sep='~'),family=binomial,
   data=dTrainCTreated)$coefficients[[2]]})
##  x_lev_NA x_lev_x.a x_lev_x.b    x_catP    x_catB   z_clean   z_isBAD 
## 31.619223  4.158883        NA  4.698112 15.815409  5.733441 31.619223
dTestCTreated <- prepare(treatmentsC,dTestC,pruneSig=c(),scale=TRUE)
head(dTestCTreated)
##   x_lev_NA  x_lev_x.a x_lev_x.b x_catP      x_catB  z_clean z_isBAD
## 1     -0.1 -0.1666667         0   -0.2 -0.11976374 1.194595    -0.1
## 2     -0.1  0.1666667         0    0.1 -0.07564865 2.951351    -0.1
## 3     -0.1  0.1666667         0    0.7 -0.07564865 4.708108    -0.1
## 4      0.5  0.1666667         0    0.4  0.51058851 0.000000     0.5

A Numeric Outcome Example

# numeric example
dTrainN <- data.frame(x=c('a','a','a','a','b','b',NA),
   z=c(1,2,3,4,5,NA,7),y=c(0,0,0,1,0,1,1))
head(dTrainN)
##   x  z y
## 1 a  1 0
## 2 a  2 0
## 3 a  3 0
## 4 a  4 1
## 5 b  5 0
## 6 b NA 1
dTestN <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
head(dTestN)
##      x  z
## 1    a 10
## 2    b 20
## 3    c 30
## 4 <NA> NA
treatmentsN = designTreatmentsN(dTrainN,colnames(dTrainN),'y')
## [1] "desigining treatments Thu Apr 13 17:41:00 2017"
## [1] "designing treatments Thu Apr 13 17:41:00 2017"
## [1] " have level statistics Thu Apr 13 17:41:00 2017"
## [1] "design var x Thu Apr 13 17:41:00 2017"
## [1] "design var z Thu Apr 13 17:41:00 2017"
## [1] " scoring treatments Thu Apr 13 17:41:00 2017"
## [1] "have treatment plan Thu Apr 13 17:41:00 2017"
## [1] "rescoring complex variables Thu Apr 13 17:41:00 2017"
## [1] "done rescoring complex variables Thu Apr 13 17:41:00 2017"
print(treatmentsN)
## $treatments
## $treatments[[1]]
## [1] "vtreat 'Categoric Indicators'('x'(integer,factor)->character->'x_lev_NA','x_lev_x.a','x_lev_x.b')"
## 
## $treatments[[2]]
## [1] "vtreat 'Prevalence Code'('x'(integer,factor)->character->'x_catP')"
## 
## $treatments[[3]]
## [1] "vtreat 'Scalable Impact Code'('x'(integer,factor)->character->'x_catN')"
## 
## $treatments[[4]]
## [1] "vtreat 'Deviation Fact'('x'(integer,factor)->character->'x_catD')"
## 
## $treatments[[5]]
## [1] "vtreat 'Scalable pass through'('z'(double,numeric)->numeric->'z_clean')"
## 
## $treatments[[6]]
## [1] "vtreat 'is.bad'('z'(double,numeric)->numeric->'z_isBAD')"
## 
## 
## $scoreFrame
##     varName varMoves         rsq       sig needsSplit extraModelDegrees
## 1  x_lev_NA     TRUE 0.222222222 0.2855909      FALSE                 0
## 2 x_lev_x.a     TRUE 0.173611111 0.3524132      FALSE                 0
## 3 x_lev_x.b     TRUE 0.008333333 0.8456711      FALSE                 0
## 4    x_catP     TRUE 0.233333333 0.2721791       TRUE                 2
## 5    x_catN     TRUE 0.172043011 0.3548219       TRUE                 2
## 6    x_catD     TRUE 0.086668611 0.5215921       TRUE                 2
## 7   z_clean     TRUE 0.336111111 0.1724763      FALSE                 0
## 8   z_isBAD     TRUE 0.222222222 0.2855909      FALSE                 0
##   origName  code
## 1        x   lev
## 2        x   lev
## 3        x   lev
## 4        x  catP
## 5        x  catN
## 6        x  catD
## 7        z clean
## 8        z isBAD
## 
## $outcomename
## [1] "y"
## 
## $vtreatVersion
## [1] '0.5.31'
## 
## $outcomeType
## [1] "Numeric"
## 
## $outcomeTarget
## [1] "y"
## 
## $meanY
## [1] 0.4285714
## 
## $splitmethod
## [1] "oneway"
## 
## attr(,"class")
## [1] "treatmentplan"
dTrainNTreated <- prepare(treatmentsN,dTrainN,
                          pruneSig=c(),scale=TRUE)
head(dTrainNTreated)
##     x_lev_NA  x_lev_x.a   x_lev_x.b x_catP      x_catN     x_catD
## 1 -0.0952381 -0.1785714 -0.02857143   -0.2 -0.17857143 -0.1785714
## 2 -0.0952381 -0.1785714 -0.02857143   -0.2 -0.17857143 -0.1785714
## 3 -0.0952381 -0.1785714 -0.02857143   -0.2 -0.17857143 -0.1785714
## 4 -0.0952381 -0.1785714 -0.02857143   -0.2 -0.17857143 -0.1785714
## 5 -0.0952381  0.2380952  0.07142857    0.2  0.07142857  0.2380952
## 6 -0.0952381  0.2380952  0.07142857    0.2  0.07142857  0.2380952
##       z_clean    z_isBAD y
## 1 -0.41904762 -0.0952381 0
## 2 -0.26190476 -0.0952381 0
## 3 -0.10476190 -0.0952381 0
## 4  0.05238095 -0.0952381 1
## 5  0.20952381 -0.0952381 0
## 6  0.00000000  0.5714286 1
varsN <- setdiff(colnames(dTrainNTreated),'y')
# all input variables should be mean 0
sapply(dTrainNTreated[,varsN,drop=FALSE],mean) 
##      x_lev_NA     x_lev_x.a     x_lev_x.b        x_catP        x_catN 
## -3.965082e-18  0.000000e+00 -2.974054e-18 -5.551115e-17 -3.965082e-18 
##        x_catD       z_clean       z_isBAD 
## -9.515810e-17  4.757324e-17 -3.967986e-18
# all slopes should be 1 for variables with treatmentsN$scoreFrame$sig<1
sapply(varsN,function(c) { lm(paste('y',c,sep='~'),
   data=dTrainNTreated)$coefficients[[2]]}) 
##  x_lev_NA x_lev_x.a x_lev_x.b    x_catP    x_catN    x_catD   z_clean 
##         1         1         1         1         1         1         1 
##   z_isBAD 
##         1
# prepared frame
dTestNTreated <- prepare(treatmentsN,dTestN,
                         pruneSig=c())
head(dTestNTreated)
##   x_lev_NA x_lev_x.a x_lev_x.b    x_catP      x_catN    x_catD   z_clean
## 1        0         1         0 0.5714286 -0.17857143 0.5000000 10.000000
## 2        0         0         1 0.2857143  0.07142857 0.7071068 20.000000
## 3        0         0         0 0.0000000  0.00000000 0.7071068 30.000000
## 4        1         0         0 0.1428571  0.57142857 0.7071068  3.666667
##   z_isBAD
## 1       0
## 2       0
## 3       0
## 4       1
# scaled prepared frame
dTestNTreatedS <- prepare(treatmentsN,dTestN,
                         pruneSig=c(),scale=TRUE)
head(dTestNTreatedS)
##     x_lev_NA  x_lev_x.a   x_lev_x.b x_catP        x_catN     x_catD
## 1 -0.0952381 -0.1785714 -0.02857143   -0.2 -1.785714e-01 -0.1785714
## 2 -0.0952381  0.2380952  0.07142857    0.2  7.142857e-02  0.2380952
## 3 -0.0952381  0.2380952 -0.02857143    0.6 -1.586033e-17  0.2380952
## 4  0.5714286  0.2380952 -0.02857143    0.4  5.714286e-01  0.2380952
##     z_clean    z_isBAD
## 1 0.9952381 -0.0952381
## 2 2.5666667 -0.0952381
## 3 4.1380952 -0.0952381
## 4 0.0000000  0.5714286

Related work: