`R/vtreat.R`

`designTreatmentsC.Rd`

Function to design variable treatments for binary prediction of a
categorical outcome. Data frame is assumed to have only atomic columns
except for dates (which are converted to numeric). Note: re-encoding high cardinality
categorical variables can introduce undesirable nested model bias, for such data consider
using `mkCrossFrameCExperiment`

.

designTreatmentsC( dframe, varlist, outcomename, outcometarget = TRUE, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = NULL, collarProb = 0, codeRestriction = NULL, customCoders = NULL, splitFunction = NULL, ncross = 3, forceSplit = FALSE, catScaling = TRUE, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )

dframe | Data frame to learn treatments from (training data), must have at least 1 row. |
---|---|

varlist | Names of columns to treat (effective variables). |

outcomename | Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values. |

outcometarget | Value/level of outcome to be considered "success", and there must be a cut such that dframe[[outcomename]]==outcometarget at least twice and dframe[[outcomename]]!=outcometarget at least twice. |

... | no additional arguments, declared to forced named binding of later arguments |

weights | optional training weights for each row |

minFraction | optional minimum frequency a categorical level must have to be converted to an indicator column. |

smFactor | optional smoothing factor for impact coding models. |

rareCount | optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |

rareSig | optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off. |

collarProb | what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |

codeRestriction | what types of variables to produce (character array of level codes, NULL means no restriction). |

customCoders | map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/master/extras/CustomLevelCoders.md). |

splitFunction | (optional) see vtreat::buildEvalSets . |

ncross | optional scalar >=2 number of cross validation splits use in rescoring complex variables. |

forceSplit | logical, if TRUE force cross-validated significance calculations on all variables. |

catScaling | optional, if TRUE use glm() linkspace, if FALSE use lm() for scaling. |

verbose | if TRUE print progress. |

parallelCluster | (optional) a cluster object created by package parallel or package snow. |

use_parallel | logical, if TRUE use parallel methods (when parallel cluster is set). |

missingness_imputation | function of signature f(values: numeric, weights: numeric), simple missing value imputer. |

imputation_map | map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |

treatment plan (for use with prepare)

The main fields are mostly vectors with names (all with the same names in the same order):

- vars : (character array without names) names of variables (in same order as names on the other diagnostic vectors) - varMoves : logical TRUE if the variable varied during hold out scoring, only variables that move will be in the treated frame - #' - sig : an estimate significance of effect

See the vtreat vignette for a bit more detail and a worked example.

Columns that do not vary are not passed through.

Note: re-encoding high cardinality on training data can introduce nested model bias, consider using `mkCrossFrameCExperiment`

instead.

dTrainC <- data.frame(x=c('a','a','a','b','b','b'), z=c(1,2,3,4,5,6), y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE)) dTestC <- data.frame(x=c('a','b','c',NA), z=c(10,20,30,NA)) treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)#> [1] "vtreat 1.6.1 inspecting inputs Sun Sep 6 13:50:35 2020" #> [1] "designing treatments Sun Sep 6 13:50:35 2020" #> [1] " have initial level statistics Sun Sep 6 13:50:35 2020" #> [1] " scoring treatments Sun Sep 6 13:50:35 2020" #> [1] "have treatment plan Sun Sep 6 13:50:35 2020"