vtreat can now effectively prepare data for multi-class classification or multinomial modeling.

The two functions needed (mkCrossFrameMExperiment() and the S3 method prepare.multinomial_plan()) are now part of vtreat.

Let’s work a specific example: trying to model multi-class y as a function of x1 and x2.

## Loading required package: wrapr
# create example data
set.seed(326346)
sym_bonuses <- rnorm(3)
names(sym_bonuses) <- c("a", "b", "c")
sym_bonuses3 <- rnorm(3)
names(sym_bonuses3) <- as.character(seq_len(length(sym_bonuses3)))
n_row <- 1000
d <- data.frame(
  x1 = rnorm(n_row),
  x2 = sample(names(sym_bonuses), n_row, replace = TRUE),
  x3 = sample(names(sym_bonuses3), n_row, replace = TRUE),
  y = "NoInfo",
  stringsAsFactors = FALSE)
d$y[sym_bonuses[d$x2] > 
      pmax(d$x1, sym_bonuses3[d$x3], runif(n_row))] <- "Large1"
d$y[sym_bonuses3[d$x3] > 
      pmax(sym_bonuses[d$x2], d$x1, runif(n_row))] <- "Large2"

knitr::kable(head(d))
x1 x2 x3 y
0.8178292 a 2 NoInfo
0.5867139 c 1 NoInfo
-0.6711920 a 3 Large2
0.1033166 a 2 Large1
-0.3182176 c 3 Large2
-0.5914308 c 2 NoInfo

We define the problem controls and use mkCrossFrameMExperiment() to build both a cross-frame and a treatment plan.

# define problem
vars <- c("x1", "x2", "x3")
y_name <- "y"

# build the multi-class cross frame and treatments
cfe_m <- mkCrossFrameMExperiment(d, vars, y_name)

The cross-frame is the entity safest for training on (unless you have made separate data split for the treatment design step). It uses cross-validation to reduce nested model bias. Some notes on this issue are available here, and here.

# look at the data we would train models on
str(cfe_m$cross_frame)
## 'data.frame':    1000 obs. of  16 variables:
##  $ x1            : num  0.818 0.587 -0.671 0.103 -0.318 ...
##  $ x2_catP       : num  0.333 0.334 0.333 0.333 0.334 0.334 0.333 0.333 0.333 0.333 ...
##  $ x3_catP       : num  0.35 0.321 0.329 0.35 0.329 0.35 0.321 0.321 0.321 0.35 ...
##  $ x2_lev_x_a    : num  1 0 1 1 0 0 0 0 1 1 ...
##  $ x2_lev_x_b    : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ x2_lev_x_c    : num  0 1 0 0 1 1 0 0 0 0 ...
##  $ x3_lev_x_1    : num  0 1 0 0 0 0 1 1 1 0 ...
##  $ x3_lev_x_2    : num  1 0 0 1 0 1 0 0 0 1 ...
##  $ x3_lev_x_3    : num  0 0 1 0 1 0 0 0 0 0 ...
##  $ Large1_x2_catB: num  1.23 -10.72 1.15 1.16 -10.53 ...
##  $ Large1_x3_catB: num  0.7025 0.0903 -10.4833 0.6238 -10.529 ...
##  $ Large2_x2_catB: num  0.17979 0.19661 -0.00379 -0.09818 0.00627 ...
##  $ Large2_x3_catB: num  -13.12 -13.05 4.49 -4.03 4.71 ...
##  $ NoInfo_x2_catB: num  -0.48752 -0.00254 -0.27947 -0.26155 0.15195 ...
##  $ NoInfo_x3_catB: num  2.05 2.43 -4.34 1.79 -4.55 ...
##  $ y             : chr  "NoInfo" "NoInfo" "Large2" "Large1" ...

prepare() can apply the designed treatments to new data. Here we are simulating new data by re-using our design data.

# pretend original data is new data to be treated
# NA out top row to show processing
for(vi in vars) {
  d[[vi]][[1]] <- NA
}
str(prepare(cfe_m$treat_m, d))
## 'data.frame':    1000 obs. of  16 variables:
##  $ x1            : num  0.0205 0.5867 -0.6712 0.1033 -0.3182 ...
##  $ x2_catP       : num  0.0005 0.334 0.333 0.333 0.334 0.334 0.333 0.333 0.333 0.333 ...
##  $ x3_catP       : num  0.0005 0.321 0.329 0.35 0.329 0.35 0.321 0.321 0.321 0.35 ...
##  $ x2_lev_x_a    : num  0 0 1 1 0 0 0 0 1 1 ...
##  $ x2_lev_x_b    : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ x2_lev_x_c    : num  0 1 0 0 1 1 0 0 0 0 ...
##  $ x3_lev_x_1    : num  0 1 0 0 0 0 1 1 1 0 ...
##  $ x3_lev_x_2    : num  0 0 0 1 0 1 0 0 0 1 ...
##  $ x3_lev_x_3    : num  0 0 1 0 1 0 0 0 0 0 ...
##  $ Large1_x2_catB: num  0 -10.58 1.18 1.18 -10.58 ...
##  $ Large1_x3_catB: num  0 0.284 -10.584 0.529 -10.584 ...
##  $ Large2_x2_catB: num  0 0.1 0.0242 0.0242 0.1 ...
##  $ Large2_x3_catB: num  0 -13.08 4.72 -4.43 4.72 ...
##  $ NoInfo_x2_catB: num  0 0.0685 -0.3392 -0.3392 0.0685 ...
##  $ NoInfo_x3_catB: num  0 2.39 -4.55 2.05 -4.55 ...
##  $ y             : chr  "NoInfo" "NoInfo" "Large2" "Large1" ...

Obvious issues include: computing variable importance, and blow up and co-dependency of produced columns. These we leave for the next modeling step to deal with (this is our philosophy with most issues that involve joint distributions of variables).

We also have per-outcome variable importance.

knitr::kable(
  cfe_m$score_frame[, 
                    c("varName", "rsq", "sig", "outcome_level"), 
                    drop = FALSE])
varName rsq sig outcome_level
x1 0.0427675 0.0002015 Large1
x2_catP 0.0979334 0.0000000 Large1
x2_lev_x_a 0.2681130 0.0000000 Large1
x2_lev_x_b 0.0975700 0.0000000 Large1
x2_lev_x_c 0.0979334 0.0000000 Large1
x3_catP 0.0125618 0.0439536 Large1
x3_lev_x_1 0.0053772 0.1874933 Large1
x3_lev_x_2 0.0266092 0.0033678 Large1
x3_lev_x_3 0.0961219 0.0000000 Large1
x1 0.0003984 0.4784542 Large2
x2_catP 0.0008969 0.2875322 Large2
x2_lev_x_a 0.0000512 0.7994128 Large2
x2_lev_x_b 0.0013961 0.1845435 Large2
x2_lev_x_c 0.0008969 0.2875322 Large2
x3_catP 0.0574052 0.0000000 Large2
x3_lev_x_1 0.2546121 0.0000000 Large2
x3_lev_x_2 0.2659830 0.0000000 Large2
x3_lev_x_3 0.9308590 0.0000000 Large2
x1 0.0035420 0.0312177 NoInfo
x2_catP 0.0004091 0.4641054 NoInfo
x2_lev_x_a 0.0108027 0.0001684 NoInfo
x2_lev_x_b 0.0072297 0.0020855 NoInfo
x2_lev_x_c 0.0004091 0.4641054 NoInfo
x3_catP 0.0416046 0.0000000 NoInfo
x3_lev_x_1 0.1848006 0.0000000 NoInfo
x3_lev_x_2 0.1796720 0.0000000 NoInfo
x3_lev_x_3 0.7228777 0.0000000 NoInfo
Large1_x2_catB 0.2679354 0.0000000 Large1
Large1_x3_catB 0.0835409 0.0000002 Large1
Large2_x2_catB 0.0002176 0.6004146 Large2
Large2_x3_catB 0.9064823 0.0000000 Large2
NoInfo_x2_catB 0.0080585 0.0011565 NoInfo
NoInfo_x3_catB 0.7143906 0.0000000 NoInfo

One can relate these per-target and per-treatment performances back to original columns by aggregating.

tapply(cfe_m$score_frame$rsq, 
       cfe_m$score_frame$origName, 
       max)
##         x1         x2         x3 
## 0.04276746 0.26811298 0.93085900
tapply(cfe_m$score_frame$sig, 
       cfe_m$score_frame$origName, 
       min)
##            x1            x2            x3 
##  2.015164e-04  1.315559e-20 2.777723e-257