Use a treatment plan to prepare a data frame for analysis. The
resulting frame will have new effective variables that are numeric
and free of NaN/NA. If the outcome column is present it will be copied over.
The intent is that these frames are compatible with more machine learning
techniques, and avoid a lot of corner cases (NA,NaN, novel levels, too many levels).
Note: each column is processed independently of all others. Also copies over outcome if present.
Note: treatmentplan's are not meant for long-term storage, a warning is issued if the version of
vtreat that produced the plan differs from the version running prepare()
.
# S3 method for treatmentplan prepare( treatmentplan, dframe, ..., pruneSig = NULL, scale = FALSE, doCollar = FALSE, varRestriction = NULL, codeRestriction = NULL, trackedValues = NULL, extracols = NULL, parallelCluster = NULL, use_parallel = TRUE, check_for_duplicate_frames = TRUE )
treatmentplan | Plan built by designTreantmentsC() or designTreatmentsN() |
---|---|
dframe | Data frame to be treated |
... | no additional arguments, declared to forced named binding of later arguments |
pruneSig | suppress variables with significance above this level |
scale | optional if TRUE replace numeric variables with single variable model regressions ("move to outcome-scale"). These have mean zero and (for variables with significant less than 1) slope 1 when regressed (lm for regression problems/glm for classification problems) against outcome. |
doCollar | optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design. |
varRestriction | optional list of treated variable names to restrict to |
codeRestriction | optional list of treated variable codes to restrict to |
trackedValues | optional named list mapping variables to know values, allows warnings upon novel level appearances (see |
extracols | extra columns to copy. |
parallelCluster | (optional) a cluster object created by package parallel or package snow. |
use_parallel | logical, if TRUE use parallel methods. |
check_for_duplicate_frames | logical, if TRUE check if we called prepare on same data.frame as design step. |
treated data frame (all columns numeric- without NA, NaN)
mkCrossFrameCExperiment
, mkCrossFrameNExperiment
, designTreatmentsC
designTreatmentsN
designTreatmentsZ
, prepare
# categorical example set.seed(23525) # we set up our raw training and application data dTrainC <- data.frame( x = c('a', 'a', 'a', 'b', 'b', NA, NA), z = c(1, 2, 3, 4, NA, 6, NA), y = c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)) dTestC <- data.frame( x = c('a', 'b', 'c', NA), z = c(10, 20, 30, NA)) # we perform a vtreat cross frame experiment # and unpack the results into treatmentsC # and dTrainCTreated unpack[ treatmentsC = treatments, dTrainCTreated = crossFrame ] <- mkCrossFrameCExperiment( dframe = dTrainC, varlist = setdiff(colnames(dTrainC), 'y'), outcomename = 'y', outcometarget = TRUE, verbose = FALSE) # the treatments include a score frame relating new # derived variables to original columns treatmentsC$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>% print(.)#> origName varName code rsq sig extraModelDegrees #> 1 x x_catP catP 0.166956795 0.20643885 2 #> 2 x x_catB catB 0.254788311 0.11858143 2 #> 3 z z clean 0.237601767 0.13176020 0 #> 4 z z_isBAD isBAD 0.296065432 0.09248399 0 #> 5 x x_lev_NA lev 0.296065432 0.09248399 0 #> 6 x x_lev_x_a lev 0.130005705 0.26490379 0 #> 7 x x_lev_x_b lev 0.006067337 0.80967242 0# the treated frame is a "cross frame" which # is a transform of the training data built # as if the treatment were learned on a different # disjoint training set to avoid nested model # bias and over-fit. dTrainCTreated %.>% head(.) %.>% print(.)#> x_catP x_catB z z_isBAD x_lev_NA x_lev_x_a x_lev_x_b y #> 1 0.50 0.0000000 1 0 0 1 0 FALSE #> 2 0.40 -0.4054484 2 0 0 1 0 FALSE #> 3 0.40 -10.3089860 3 0 0 1 0 TRUE #> 4 0.20 8.8049919 4 0 0 0 1 FALSE #> 5 0.25 -9.2104404 3 1 0 0 1 TRUE #> 6 0.25 9.2104404 6 0 1 0 0 TRUE# Any future application data is prepared with # the prepare method. dTestCTreated <- prepare(treatmentsC, dTestC, pruneSig=NULL) dTestCTreated %.>% head(.) %.>% print(.)#> x_catP x_catB z z_isBAD x_lev_NA x_lev_x_a x_lev_x_b #> 1 0.42857143 -0.9807709 10.0 0 0 1 0 #> 2 0.28571429 -0.2876737 20.0 0 0 0 1 #> 3 0.07142857 0.0000000 30.0 0 0 0 0 #> 4 0.28571429 9.6158638 3.2 1 1 0 0