Builds a designTreatmentsN
treatment plan and a data frame prepared
from dframe
that is "cross" in the sense each row is treated using a treatment
plan built from a subset of dframe disjoint from the given row.
The goal is to try to and supply a method of breaking nested model bias other than splitting
into calibration, training, test sets.
mkCrossFrameNExperiment( dframe, varlist, outcomename, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = 1, collarProb = 0, codeRestriction = NULL, customCoders = NULL, scale = FALSE, doCollar = FALSE, splitFunction = NULL, ncross = 3, forceSplit = FALSE, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
dframe | Data frame to learn treatments from (training data), must have at least 1 row. |
---|---|
varlist | Names of columns to treat (effective variables). |
outcomename | Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values and there must be a cut such that dframe[[outcomename]] is both above the cut at least twice and below the cut at least twice. |
... | no additional arguments, declared to forced named binding of later arguments |
weights | optional training weights for each row |
minFraction | optional minimum frequency a categorical level must have to be converted to an indicator column. |
smFactor | optional smoothing factor for impact coding models. |
rareCount | optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |
rareSig | optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off. |
collarProb | what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |
codeRestriction | what types of variables to produce (character array of level codes, NULL means no restriction). |
customCoders | map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md). |
scale | optional if TRUE replace numeric variables with regression ("move to outcome-scale"). |
doCollar | optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design. |
splitFunction | (optional) see vtreat::buildEvalSets . |
ncross | optional scalar>=2 number of cross-validation rounds to design. |
forceSplit | logical, if TRUE force cross-validated significance calculations on all variables. |
verbose | if TRUE print progress. |
parallelCluster | (optional) a cluster object created by package parallel or package snow. |
use_parallel | logical, if TRUE use parallel methods. |
missingness_imputation | function of signature f(values: numeric, weights: numeric), simple missing value imputer. |
imputation_map | map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
named list containing: treatments, crossFrame, crossWeights, method, and evalSets
# numeric example set.seed(23525) # we set up our raw training and application data dTrainN <- data.frame( x = c('a', 'a', 'a', 'a', 'b', 'b', NA, NA), z = c(1, 2, 3, 4, 5, NA, 7, NA), y = c(0, 0, 0, 1, 0, 1, 1, 1)) dTestN <- data.frame( x = c('a', 'b', 'c', NA), z = c(10, 20, 30, NA)) # we perform a vtreat cross frame experiment # and unpack the results into treatmentsN # and dTrainNTreated unpack[ treatmentsN = treatments, dTrainNTreated = crossFrame ] <- mkCrossFrameNExperiment( dframe = dTrainN, varlist = setdiff(colnames(dTrainN), 'y'), outcomename = 'y', verbose = FALSE) # the treatments include a score frame relating new # derived variables to original columns treatmentsN$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>% print(.)#> origName varName code rsq sig extraModelDegrees #> 1 x x_catP catP 4.047085e-01 0.08994062 2 #> 2 x x_catN catN 2.822908e-01 0.17539581 2 #> 3 x x_catD catD 2.096931e-02 0.73225708 2 #> 4 z z clean 2.880952e-01 0.17018920 0 #> 5 z z_isBAD isBAD 3.333333e-01 0.13397460 0 #> 6 x x_lev_NA lev 3.333333e-01 0.13397460 0 #> 7 x x_lev_x_a lev 2.500000e-01 0.20703125 0 #> 8 x x_lev_x_b lev 1.110223e-16 0.99999998 0# the treated frame is a "cross frame" which # is a transform of the training data built # as if the treatment were learned on a different # disjoint training set to avoid nested model # bias and over-fit. dTrainNTreated %.>% head(.) %.>% print(.)#> x_catN x_catD z z_isBAD x_lev_NA x_lev_x_a x_lev_x_b x_catP y #> 1 -0.26666667 0.5000000 1 0 0 1 0 0.6 0 #> 2 -0.50000000 0.0000000 2 0 0 1 0 0.5 0 #> 3 -0.06666667 0.5000000 3 0 0 1 0 0.6 0 #> 4 -0.50000000 0.0000000 4 0 0 1 0 0.5 1 #> 5 0.40000000 0.7071068 5 0 0 0 1 0.2 0 #> 6 -0.40000000 0.7071068 3 1 0 0 1 0.2 1# Any future application data is prepared with # the prepare method. dTestNTreated <- prepare(treatmentsN, dTestN, pruneSig=NULL) dTestNTreated %.>% head(.) %.>% print(.)#> x_catP x_catN x_catD z z_isBAD x_lev_NA x_lev_x_a x_lev_x_b #> 1 0.5000 -0.25 0.5000000 10.000000 0 0 1 0 #> 2 0.2500 0.00 0.7071068 20.000000 0 0 0 1 #> 3 0.0625 0.00 0.7071068 30.000000 0 0 0 0 #> 4 0.2500 0.50 0.0000000 3.666667 1 1 0 0