Data frame is assumed to have only atomic columns except for dates (which are converted to numeric). Note: each column is processed independently of all others.
designTreatmentsZ( dframe, varlist, ..., minFraction = 0, weights = c(), rareCount = 0, collarProb = 0, codeRestriction = NULL, customCoders = NULL, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
dframe | Data frame to learn treatments from (training data), must have at least 1 row. |
---|---|
varlist | Names of columns to treat (effective variables). |
... | no additional arguments, declared to forced named binding of later arguments |
minFraction | optional minimum frequency a categorical level must have to be converted to an indicator column. |
weights | optional training weights for each row |
rareCount | optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |
collarProb | what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |
codeRestriction | what types of variables to produce (character array of level codes, NULL means no restriction). |
customCoders | map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md). |
verbose | if TRUE print progress. |
parallelCluster | (optional) a cluster object created by package parallel or package snow. |
use_parallel | logical, if TRUE use parallel methods (if parallel cluster is set). |
missingness_imputation | function of signature f(values: numeric, weights: numeric), simple missing value imputer. |
imputation_map | map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
treatment plan (for use with prepare)
The main fields are mostly vectors with names (all with the same names in the same order):
- vars : (character array without names) names of variables (in same order as names on the other diagnostic vectors) - varMoves : logical TRUE if the variable varied during hold out scoring, only variables that move will be in the treated frame
See the vtreat vignette for a bit more detail and a worked example.
Columns that do not vary are not passed through.
dTrainZ <- data.frame(x=c('a','a','a','a','b','b',NA,'e','e'), z=c(1,2,3,4,5,6,7,NA,9)) dTestZ <- data.frame(x=c('a','x','c',NA), z=c(10,20,30,NA)) treatmentsZ = designTreatmentsZ(dTrainZ, colnames(dTrainZ), rareCount=0)#> [1] "vtreat 1.6.3 inspecting inputs Fri Jun 11 07:01:19 2021" #> [1] "designing treatments Fri Jun 11 07:01:19 2021" #> [1] " have initial level statistics Fri Jun 11 07:01:19 2021" #> [1] " scoring treatments Fri Jun 11 07:01:19 2021" #> [1] "have treatment plan Fri Jun 11 07:01:19 2021"