Function to design variable treatments for binary prediction of a
categorical outcome. Data frame is assumed to have only atomic columns
except for dates (which are converted to numeric). Note: re-encoding high cardinality
categorical variables can introduce undesirable nested model bias, for such data consider
designTreatmentsC( dframe, varlist, outcomename, outcometarget = TRUE, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = NULL, collarProb = 0, codeRestriction = NULL, customCoders = NULL, splitFunction = NULL, ncross = 3, forceSplit = FALSE, catScaling = TRUE, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
Data frame to learn treatments from (training data), must have at least 1 row.
Names of columns to treat (effective variables).
Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values.
Value/level of outcome to be considered "success", and there must be a cut such that dframe[[outcomename]]==outcometarget at least twice and dframe[[outcomename]]!=outcometarget at least twice.
no additional arguments, declared to forced named binding of later arguments
optional training weights for each row
optional minimum frequency a categorical level must have to be converted to an indicator column.
optional smoothing factor for impact coding models.
optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off.
optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off.
what fraction of the data (pseudo-probability) to collar data at if doCollar is set during
what types of variables to produce (character array of level codes, NULL means no restriction).
map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/master/extras/CustomLevelCoders.md).
(optional) see vtreat::buildEvalSets .
optional scalar >=2 number of cross validation splits use in rescoring complex variables.
logical, if TRUE force cross-validated significance calculations on all variables.
optional, if TRUE use glm() linkspace, if FALSE use lm() for scaling.
if TRUE print progress.
(optional) a cluster object created by package parallel or package snow.
logical, if TRUE use parallel methods (when parallel cluster is set).
function of signature f(values: numeric, weights: numeric), simple missing value imputer.
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers.
treatment plan (for use with prepare)
The main fields are mostly vectors with names (all with the same names in the same order):
- vars : (character array without names) names of variables (in same order as names on the other diagnostic vectors) - varMoves : logical TRUE if the variable varied during hold out scoring, only variables that move will be in the treated frame - #' - sig : an estimate significance of effect
See the vtreat vignette for a bit more detail and a worked example.
Columns that do not vary are not passed through.
Note: re-encoding high cardinality on training data can introduce nested model bias, consider using
dTrainC <- data.frame(x=c('a','a','a','b','b','b'), z=c(1,2,3,4,5,6), y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE)) dTestC <- data.frame(x=c('a','b','c',NA), z=c(10,20,30,NA)) treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)#>  "vtreat 1.6.1 inspecting inputs Wed Aug 12 09:50:20 2020" #>  "designing treatments Wed Aug 12 09:50:20 2020" #>  " have initial level statistics Wed Aug 12 09:50:20 2020" #>  " scoring treatments Wed Aug 12 09:50:20 2020" #>  "have treatment plan Wed Aug 12 09:50:20 2020"dTestCTreated <- prepare(treatmentsC,dTestC,pruneSig=0.99)