vtreat scale mode

Win-Vector LLC


vtreat::prepare(scale=TRUE) is a variation of vtreat::prepare() intended to prepare data frames so all the derived input or independent (x) variables are fully in outcome or dependent variable (y) units. This is in the sense of a linear regression for numeric y’s (vtreat::designTreatmentsN and vtreat::mkCrossFrameNExperiment).
For classification problems (or categorical y’s) as of version 0.5.26 and newer (available here) scaling is established through a a logistic regression “in link units” or as 0/1 indicators depending on the setting of the catScaling argument in vtreat::designTreatmentsC or vtreat::mkCrossFrameNExperiment. Prior to this version classification the scaling calculation (and only the scaling calculation) was always handled as a linear regression against a 0/1 y-indicator. catScaling=FALSE can be a bit faster as the underlying regression can be a bit quicker than a logistic regression.

This is the appropriate preparation before a geometry/metric sensitive modeling step such as principal components analysis or clustering (such as k-means clustering).

Normally (with vtreat::prepare(scale=FALSE)) vtreat passes through a number of variables with minimal alteration (cleaned numeric), builds 0/1 indicator variables for various conditions (categorical levels, presence of NAs, and so on), and builds some “in y-units” variables (catN, catB) that are in fact sub-models. With vtreat::prepare(scale=TRUE) all of these numeric variables are then re-processed to have mean zero, and slope 1 (when possible) when appropriately regressed against the y-variable.

This is easiest to illustrate with a concrete example.

dTrainC <- data.frame(x=c('a','a','a','b','b',NA),
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE,
dTrainCTreatedUnscaled <- prepare(treatmentsC,dTrainC,pruneSig=c(),scale=FALSE)
dTrainCTreatedScaled <- prepare(treatmentsC,dTrainC,pruneSig=c(),scale=TRUE)

Note we have set catScaling=FALSE to ask that we treat y as a 0/1 indicator and scale using linear regression. The standard vtreat treated frame converts the original data from this:

##      x     y
## 1    a FALSE
## 2    a FALSE
## 3    a  TRUE
## 4    b FALSE
## 5    b  TRUE
## 6 <NA>  TRUE

into this:

##   x_lev_NA x_lev_x.a x_lev_x.b    x_catP     x_catB     y
## 1        0         1         0 0.5000000 -0.6930972 FALSE
## 2        0         1         0 0.5000000 -0.6930972 FALSE
## 3        0         1         0 0.5000000 -0.6930972  TRUE
## 4        0         0         1 0.3333333  0.0000000 FALSE
## 5        0         0         1 0.3333333  0.0000000  TRUE
## 6        1         0         0 0.1666667  9.2104404  TRUE

This is the “standard way” to run vtreat – with the exception that for this example we set pruneSig to NULL to suppress variable pruning, instead of setting it to a value in the interval (0,1). The principle is: vtreat inflicts the minimal possible alterations on the data, leaving as much as possible to the downstream machine learning code. This does turn out to already be a lot of alteration. Mostly vtreat is taking only steps that are unsafe to leave for later: re-encoding of large categoricals, re-coding of aberrant values, and bulk pruning of variables.

However some procedures, in particular principal components analysis or geometric clustering, assume all of the columns have been fully transformed. The usual assumption (“more honored in the breach than the observance”) is that the columns are centered (mean zero) and scaled. The non y-aware meaning of “scaled” is unit variance. However, vtreat is designed to emphasize y-aware processing and we feel the y-aware sense of scaling should be: unit slope when regressed against y. If you want standard scaling you can use the standard frame produced by vtreat and scale it yourself. If you want vtreat style y-aware scaling you (which we strongly think is the right thing to do) you can use vtreat::prepare(scale=TRUE) which produces a frame that looks like the following:

##   x_lev_NA  x_lev_x.a x_lev_x.b x_catP      x_catB     y
## 1     -0.1 -0.1666667         0   -0.2 -0.11976374 FALSE
## 2     -0.1 -0.1666667         0   -0.2 -0.11976374 FALSE
## 3     -0.1 -0.1666667         0   -0.2 -0.11976374  TRUE
## 4     -0.1  0.1666667         0    0.1 -0.07564865 FALSE
## 5     -0.1  0.1666667         0    0.1 -0.07564865  TRUE
## 6      0.5  0.1666667         0    0.4  0.51058851  TRUE

First we can check the claims. Are the variables mean-zero and slope 1 when regressed against y?

slopeFrame <- data.frame(varName = treatmentsC$scoreFrame$varName,
                         stringsAsFactors = FALSE)
slopeFrame$mean <-
  vapply(dTrainCTreatedScaled[, slopeFrame$varName, drop = FALSE], mean,
slopeFrame$slope <- vapply(slopeFrame$varName,
                           function(c) {
                             lm(paste('y', c, sep = '~'),
                                data = dTrainCTreatedScaled)$coefficients[[2]]
slopeFrame$sig <- vapply(slopeFrame$varName,
                         function(c) {
                           treatmentsC$scoreFrame[treatmentsC$scoreFrame$varName == c, 'sig']
slopeFrame$badSlope <-
  ifelse(is.na(slopeFrame$slope), TRUE, abs(slopeFrame$slope - 1) > 1.e-8)
##     varName          mean slope        sig badSlope
## 1  x_lev_NA -6.938894e-18     1 0.20766228    FALSE
## 2 x_lev_x.a  0.000000e+00     1 0.40972582    FALSE
## 3 x_lev_x.b  0.000000e+00    NA 1.00000000     TRUE
## 4    x_catP  1.850372e-17     1 0.25493078    FALSE
## 5    x_catB  1.387779e-17     1 0.04220134    FALSE

The above claims are true with the exception of the derived variable x_lev_x.b. This is because the outcome variable y has identical distribution when the original variable x==‘b’ and when x!=‘b’ (on half the time in both cases). This means y is perfectly independent of x==‘b’ and the regression slope must be zero (thus, cannot be 1). vtreat now treats this as needing to scale by a multiplicative factor of zero. Note also that the significance level associated with x_lev_x.b is large, making this variable easy to prune. The varMoves and significance facts in treatmentsC$scoreFrame are about the unscaled frame (where x_lev_x.b does in fact move).

For a good discussion of the application of y-aware scaling to Principal Components Analysis please see here.

Previous versions of vtreat (0.5.22 and earlier) would copy variables that could not be sensibly scaled into the treated frame unaltered. This was considered the “most faithful” thing to do. However we now feel that this practice was not safe for many downstream procedures, such as principal components analysis and geometric clustering.

Categorical outcome mode “catScaling=TRUE”

As of version 0.5.26 vtreat also supports a “scaling mode for categorical outcomes.” In this mode scaling is performed using the coefficient of a logistic regression fit on a categorical instead of the coefficient of a linear fit (with the outcome encoded as a zero/one indicator).

The idea is with this mode on we are scaling as a logistic regression would- so we are in logistic regression “link space” (where logistic regression assume effects are additive). The mode may be well suited for principal components analysis or principal components regression where the target variable is a categorical (i.e. classification tasks).

To ensure this effect we set the argument catScaling=TRUE in vtreat::designTreatmentsC or vtreat::mkCrossFrameCExperiment. WE demonstrate this below.

treatmentsC2 <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE,
dTrainCTreatedScaled2 <- prepare(treatmentsC2,dTrainC,pruneSig=c(),scale=TRUE)
##    x_lev_NA  x_lev_x.a x_lev_x.b     x_catP    x_catB     y
## 1 -3.161922 -0.6931472         0 -0.9396225 -1.894112 FALSE
## 2 -3.161922 -0.6931472         0 -0.9396225 -1.894112 FALSE
## 3 -3.161922 -0.6931472         0 -0.9396225 -1.894112  TRUE
## 4 -3.161922  0.6931472         0  0.4698112 -1.196414 FALSE
## 5 -3.161922  0.6931472         0  0.4698112 -1.196414  TRUE
## 6 15.809611  0.6931472         0  1.8792449  8.075166  TRUE

Notice the new scaled frame is in a different scale than the original scaled frame. It likely is a function of the problem domain which scaling is more appropriate or useful.

The new scaled columns are again mean-0 (so they are not exactly the logistic link values, which may not have been so shifted). The new scaled columns do not necessarily have linear model slope 1 as the original scaled columns did as we see below:

##     x_lev_NA    x_lev_x.a    x_lev_x.b       x_catP       x_catB 
## 0.000000e+00 0.000000e+00 0.000000e+00 3.700743e-16 7.401487e-17 
##            y 
## 5.000000e-01
## Call:
## lm(formula = y ~ x_lev_NA, data = dTrainCTreatedScaled)
## Coefficients:
## (Intercept)     x_lev_NA  
##         0.5          1.0
## Call:
## lm(formula = y ~ x_lev_NA, data = dTrainCTreatedScaled2)
## Coefficients:
## (Intercept)     x_lev_NA  
##     0.50000      0.03163

The new scaled columns, however are in good logistic link units.

                           function(c) {
                             glm(paste('y', c, sep = '~'),family=binomial,
                                data = dTrainCTreatedScaled2)$coefficients[[2]]
##  x_lev_NA x_lev_x.a x_lev_x.b    x_catP    x_catB 
##         1         1        NA         1         1


The intended applications of scale mode include preparing data for metric sensitive applications such as KNN classification/regression and Principal Components Analysis/Regression. Please see here for an article series describing such applications.

Overall the advice is to first use the following pattern:

However, practitioners experienced in principal components analysis may uncomfortable with the range of eigenvalues or singular values returned by y-aware analysis. If a more familiar scale is desired we suggest performing the y-aware scaling against an additional scaled and centered y to try to get ranges closer the traditional unit ranges. This can be achieved as shown below.

dTrainN <- data.frame(x1=rnorm(100),
dTrainN$y <- 1000*(dTrainN$x1 + dTrainN$x2)
cEraw <- vtreat::mkCrossFrameNExperiment(dTrainN,
dM1 <- as.matrix(cEraw$crossFrame[,c('x1_clean','x2_clean','x3_clean')])
pCraw <- stats::prcomp(dM1,
## Standard deviations:
## [1] 1157.8816 1044.9080  179.8415
## Rotation:
##                  PC1         PC2         PC3
## x1_clean  0.98564543 -0.16880819 0.002623125
## x2_clean  0.16849926  0.98457413 0.047135824
## x3_clean -0.01053957 -0.04601721 0.998885045
dTrainN$yScaled <- scale(dTrainN$y,center=TRUE,scale=TRUE)
cEscaled <- vtreat::mkCrossFrameNExperiment(dTrainN,
dM2 <- as.matrix(cEscaled$crossFrame[,c('x1_clean','x2_clean','x3_clean')])
pCscaled <- stats::prcomp(dM2,
## Standard deviations:
## [1] 0.7843205 0.7283565 0.1177734
## Rotation:
##                 PC1          PC2         PC3
## x1_clean  0.9810074 -0.191767234 0.029151677
## x2_clean  0.1915603  0.981432326 0.009759229
## x3_clean -0.0304819 -0.003989572 0.999527357

Notice the second application of stats::prcomp has more standard scaling of the reported standard deviations (though we still do not advise choosing latent variables based on mere comparisons to unit magnitude).