Builds a designTreatmentsC
treatment plan and a data frame prepared
from dframe
that is "cross" in the sense each row is treated using a treatment
plan built from a subset of dframe disjoint from the given row.
The goal is to try to and supply a method of breaking nested model bias other than splitting
into calibration, training, test sets.
mkCrossFrameCExperiment( dframe, varlist, outcomename, outcometarget, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = 1, collarProb = 0, codeRestriction = NULL, customCoders = NULL, scale = FALSE, doCollar = FALSE, splitFunction = NULL, ncross = 3, forceSplit = FALSE, catScaling = TRUE, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
dframe | Data frame to learn treatments from (training data), must have at least 1 row. |
---|---|
varlist | Names of columns to treat (effective variables). |
outcomename | Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values. |
outcometarget | Value/level of outcome to be considered "success", and there must be a cut such that dframe[[outcomename]]==outcometarget at least twice and dframe[[outcomename]]!=outcometarget at least twice. |
... | no additional arguments, declared to forced named binding of later arguments |
weights | optional training weights for each row |
minFraction | optional minimum frequency a categorical level must have to be converted to an indicator column. |
smFactor | optional smoothing factor for impact coding models. |
rareCount | optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |
rareSig | optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off. |
collarProb | what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |
codeRestriction | what types of variables to produce (character array of level codes, NULL means no restriction). |
customCoders | map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md). |
scale | optional if TRUE replace numeric variables with regression ("move to outcome-scale"). |
doCollar | optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design. |
splitFunction | (optional) see vtreat::buildEvalSets . |
ncross | optional scalar>=2 number of cross-validation rounds to design. |
forceSplit | logical, if TRUE force cross-validated significance calculations on all variables. |
catScaling | optional, if TRUE use glm() linkspace, if FALSE use lm() for scaling. |
verbose | if TRUE print progress. |
parallelCluster | (optional) a cluster object created by package parallel or package snow. |
use_parallel | logical, if TRUE use parallel methods. |
missingness_imputation | function of signature f(values: numeric, weights: numeric), simple missing value imputer. |
imputation_map | map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
named list containing: treatments, crossFrame, crossWeights, method, and evalSets
# categorical example set.seed(23525) # we set up our raw training and application data dTrainC <- data.frame( x = c('a', 'a', 'a', 'b', 'b', NA, NA), z = c(1, 2, 3, 4, NA, 6, NA), y = c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)) dTestC <- data.frame( x = c('a', 'b', 'c', NA), z = c(10, 20, 30, NA)) # we perform a vtreat cross frame experiment # and unpack the results into treatmentsC # and dTrainCTreated unpack[ treatmentsC = treatments, dTrainCTreated = crossFrame ] <- mkCrossFrameCExperiment( dframe = dTrainC, varlist = setdiff(colnames(dTrainC), 'y'), outcomename = 'y', outcometarget = TRUE, verbose = FALSE) # the treatments include a score frame relating new # derived variables to original columns treatmentsC$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>% print(.)#> origName varName code rsq sig extraModelDegrees #> 1 x x_catP catP 0.166956795 0.20643885 2 #> 2 x x_catB catB 0.254788311 0.11858143 2 #> 3 z z clean 0.237601767 0.13176020 0 #> 4 z z_isBAD isBAD 0.296065432 0.09248399 0 #> 5 x x_lev_NA lev 0.296065432 0.09248399 0 #> 6 x x_lev_x_a lev 0.130005705 0.26490379 0 #> 7 x x_lev_x_b lev 0.006067337 0.80967242 0# the treated frame is a "cross frame" which # is a transform of the training data built # as if the treatment were learned on a different # disjoint training set to avoid nested model # bias and over-fit. dTrainCTreated %.>% head(.) %.>% print(.)#> x_catP x_catB z z_isBAD x_lev_NA x_lev_x_a x_lev_x_b y #> 1 0.50 0.0000000 1 0 0 1 0 FALSE #> 2 0.40 -0.4054484 2 0 0 1 0 FALSE #> 3 0.40 -10.3089860 3 0 0 1 0 TRUE #> 4 0.20 8.8049919 4 0 0 0 1 FALSE #> 5 0.25 -9.2104404 3 1 0 0 1 TRUE #> 6 0.25 9.2104404 6 0 1 0 0 TRUE# Any future application data is prepared with # the prepare method. dTestCTreated <- prepare(treatmentsC, dTestC, pruneSig=NULL) dTestCTreated %.>% head(.) %.>% print(.)#> x_catP x_catB z z_isBAD x_lev_NA x_lev_x_a x_lev_x_b #> 1 0.42857143 -0.9807709 10.0 0 0 1 0 #> 2 0.28571429 -0.2876737 20.0 0 0 0 1 #> 3 0.07142857 0.0000000 30.0 0 0 0 0 #> 4 0.28571429 9.6158638 3.2 1 1 0 0