Use a treatment plan to prepare a data frame for analysis. The
resulting frame will have new effective variables that are numeric
and free of NaN/NA. If the outcome column is present it will be copied over.
The intent is that these frames are compatible with more machine learning
techniques, and avoid a lot of corner cases (NA,NaN, novel levels, too many levels).
Note: each column is processed independently of all others. Also copies over outcome if present.
Note: treatmentplan's are not meant for long-term storage, a warning is issued if the version of
vtreat that produced the plan differs from the version running
# S3 method for treatmentplan prepare( treatmentplan, dframe, ..., pruneSig = NULL, scale = FALSE, doCollar = FALSE, varRestriction = NULL, codeRestriction = NULL, trackedValues = NULL, extracols = NULL, parallelCluster = NULL, use_parallel = TRUE, check_for_duplicate_frames = TRUE )
Plan built by designTreantmentsC() or designTreatmentsN()
Data frame to be treated
no additional arguments, declared to forced named binding of later arguments
suppress variables with significance above this level
optional if TRUE replace numeric variables with single variable model regressions ("move to outcome-scale"). These have mean zero and (for variables with significant less than 1) slope 1 when regressed (lm for regression problems/glm for classification problems) against outcome.
optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design.
optional list of treated variable names to restrict to
optional list of treated variable codes to restrict to
optional named list mapping variables to know values, allows warnings upon novel level appearances (see
extra columns to copy.
(optional) a cluster object created by package parallel or package snow.
logical, if TRUE use parallel methods.
logical, if TRUE check if we called prepare on same data.frame as design step.
treated data frame (all columns numeric- without NA, NaN)
# categorical example set.seed(23525) # we set up our raw training and application data dTrainC <- data.frame( x = c('a', 'a', 'a', 'b', 'b', NA, NA), z = c(1, 2, 3, 4, NA, 6, NA), y = c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)) dTestC <- data.frame( x = c('a', 'b', 'c', NA), z = c(10, 20, 30, NA)) # we perform a vtreat cross frame experiment # and unpack the results into treatmentsC # and dTrainCTreated unpack[ treatmentsC = treatments, dTrainCTreated = crossFrame ] <- mkCrossFrameCExperiment( dframe = dTrainC, varlist = setdiff(colnames(dTrainC), 'y'), outcomename = 'y', outcometarget = TRUE, verbose = FALSE) # the treatments include a score frame relating new # derived variables to original columns treatmentsC$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>% print(.)#> origName varName code rsq sig extraModelDegrees #> 1 x x_catP catP 0.166956795 0.20643885 2 #> 2 x x_catB catB 0.254788311 0.11858143 2 #> 3 z z clean 0.237601767 0.13176020 0 #> 4 z z_isBAD isBAD 0.296065432 0.09248399 0 #> 5 x x_lev_NA lev 0.296065432 0.09248399 0 #> 6 x x_lev_x_a lev 0.130005705 0.26490379 0 #> 7 x x_lev_x_b lev 0.006067337 0.80967242 0# the treated frame is a "cross frame" which # is a transform of the training data built # as if the treatment were learned on a different # disjoint training set to avoid nested model # bias and over-fit. dTrainCTreated %.>% head(.) %.>% print(.)#> x_catP x_catB z z_isBAD x_lev_NA x_lev_x_a x_lev_x_b y #> 1 0.50 0.0000000 1 0 0 1 0 FALSE #> 2 0.40 -0.4054484 2 0 0 1 0 FALSE #> 3 0.40 -10.3089860 3 0 0 1 0 TRUE #> 4 0.20 8.8049919 4 0 0 0 1 FALSE #> 5 0.25 -9.2104404 3 1 0 0 1 TRUE #> 6 0.25 9.2104404 6 0 1 0 0 TRUE# Any future application data is prepared with # the prepare method. dTestCTreated <- prepare(treatmentsC, dTestC, pruneSig=NULL) dTestCTreated %.>% head(.) %.>% print(.)#> x_catP x_catB z z_isBAD x_lev_NA x_lev_x_a x_lev_x_b #> 1 0.42857143 -0.9807709 10.0 0 0 1 0 #> 2 0.28571429 -0.2876737 20.0 0 0 0 1 #> 3 0.07142857 0.0000000 30.0 0 0 0 0 #> 4 0.28571429 9.6158638 3.2 1 1 0 0