designTreatmentsC treatment plan and a data frame prepared
dframe that is "cross" in the sense each row is treated using a treatment
plan built from a subset of dframe disjoint from the given row.
The goal is to try to and supply a method of breaking nested model bias other than splitting
into calibration, training, test sets.
mkCrossFrameCExperiment( dframe, varlist, outcomename, outcometarget, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = 1, collarProb = 0, codeRestriction = NULL, customCoders = NULL, scale = FALSE, doCollar = FALSE, splitFunction = NULL, ncross = 3, forceSplit = FALSE, catScaling = TRUE, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
Data frame to learn treatments from (training data), must have at least 1 row.
Names of columns to treat (effective variables).
Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values.
Value/level of outcome to be considered "success", and there must be a cut such that dframe[[outcomename]]==outcometarget at least twice and dframe[[outcomename]]!=outcometarget at least twice.
no additional arguments, declared to forced named binding of later arguments
optional training weights for each row
optional minimum frequency a categorical level must have to be converted to an indicator column.
optional smoothing factor for impact coding models.
optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off.
optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off.
what fraction of the data (pseudo-probability) to collar data at if doCollar is set during
what types of variables to produce (character array of level codes, NULL means no restriction).
map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/master/extras/CustomLevelCoders.md).
optional if TRUE replace numeric variables with regression ("move to outcome-scale").
optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design.
(optional) see vtreat::buildEvalSets .
optional scalar>=2 number of cross-validation rounds to design.
logical, if TRUE force cross-validated significance calculations on all variables.
optional, if TRUE use glm() linkspace, if FALSE use lm() for scaling.
if TRUE print progress.
(optional) a cluster object created by package parallel or package snow.
logical, if TRUE use parallel methods.
function of signature f(values: numeric, weights: numeric), simple missing value imputer.
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers.
named list containing: treatments, crossFrame, crossWeights, method, and evalSets
# categorical example set.seed(23525) # we set up our raw training and application data dTrainC <- data.frame( x = c('a', 'a', 'a', 'b', 'b', NA, NA), z = c(1, 2, 3, 4, NA, 6, NA), y = c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)) dTestC <- data.frame( x = c('a', 'b', 'c', NA), z = c(10, 20, 30, NA)) # we perform a vtreat cross frame experiment # and unpack the results into treatmentsC # and dTrainCTreated unpack[ treatmentsC = treatments, dTrainCTreated = crossFrame ] <- mkCrossFrameCExperiment( dframe = dTrainC, varlist = setdiff(colnames(dTrainC), 'y'), outcomename = 'y', outcometarget = TRUE, verbose = FALSE) # the treatments include a score frame relating new # derived variables to original columns treatmentsC$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>% print(.)#> origName varName code rsq sig extraModelDegrees #> 1 x x_catP catP 0.166956795 0.20643885 2 #> 2 x x_catB catB 0.254788311 0.11858143 2 #> 3 z z clean 0.237601767 0.13176020 0 #> 4 z z_isBAD isBAD 0.296065432 0.09248399 0 #> 5 x x_lev_NA lev 0.296065432 0.09248399 0 #> 6 x x_lev_x_a lev 0.130005705 0.26490379 0 #> 7 x x_lev_x_b lev 0.006067337 0.80967242 0# the treated frame is a "cross frame" which # is a transform of the training data built # as if the treatment were learned on a different # disjoint training set to avoid nested model # bias and over-fit. dTrainCTreated %.>% head(.) %.>% print(.)#> x_catP x_catB z z_isBAD x_lev_NA x_lev_x_a x_lev_x_b y #> 1 0.50 0.0000000 1 0 0 1 0 FALSE #> 2 0.40 -0.4054484 2 0 0 1 0 FALSE #> 3 0.40 -10.3089860 3 0 0 1 0 TRUE #> 4 0.20 8.8049919 4 0 0 0 1 FALSE #> 5 0.25 -9.2104404 3 1 0 0 1 TRUE #> 6 0.25 9.2104404 6 0 1 0 0 TRUE# Any future application data is prepared with # the prepare method. dTestCTreated <- prepare(treatmentsC, dTestC, pruneSig=NULL) dTestCTreated %.>% head(.) %.>% print(.)#> x_catP x_catB z z_isBAD x_lev_NA x_lev_x_a x_lev_x_b #> 1 0.42857143 -0.9807709 10.0 0 0 1 0 #> 2 0.28571429 -0.2876737 20.0 0 0 0 1 #> 3 0.07142857 0.0000000 30.0 0 0 0 0 #> 4 0.28571429 9.6158638 3.2 1 1 0 0