Builds a `designTreatmentsN`

treatment plan and a data frame prepared
from `dframe`

that is "cross" in the sense each row is treated using a treatment
plan built from a subset of dframe disjoint from the given row.
The goal is to try to and supply a method of breaking nested model bias other than splitting
into calibration, training, test sets.

mkCrossFrameNExperiment(dframe, varlist, outcomename, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = 1, collarProb = 0, codeRestriction = NULL, customCoders = NULL, scale = FALSE, doCollar = FALSE, splitFunction = NULL, ncross = 3, forceSplit = FALSE, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE)

dframe | Data frame to learn treatments from (training data), must have at least 1 row. |
---|---|

varlist | Names of columns to treat (effective variables). |

outcomename | Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values and there must be a cut such that dframe[[outcomename]] is both above the cut at least twice and below the cut at least twice. |

... | no additional arguments, declared to forced named binding of later arguments |

weights | optional training weights for each row |

minFraction | optional minimum frequency a categorical level must have to be converted to an indicator column. |

smFactor | optional smoothing factor for impact coding models. |

rareCount | optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |

rareSig | optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off. |

collarProb | what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |

codeRestriction | what types of variables to produce (character array of level codes, NULL means no restriction). |

customCoders | map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/master/extras/CustomLevelCoders.md). |

scale | optional if TRUE replace numeric variables with regression ("move to outcome-scale"). |

doCollar | optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design. |

splitFunction | (optional) see vtreat::buildEvalSets . |

ncross | optional scalar>=2 number of cross-validation rounds to design. |

forceSplit | logical, if TRUE force cross-validated significance calculations on all variables. |

verbose | if TRUE print progress. |

parallelCluster | (optional) a cluster object created by package parallel or package snow. |

use_parallel | logical, if TRUE use parallel methods. |

treatment plan (for use with prepare)

set.seed(23525) zip <- paste('z',1:100) N <- 200 d <- data.frame(zip=sample(zip,N,replace=TRUE), zip2=sample(zip,N,replace=TRUE), y=runif(N)) del <- runif(length(zip)) names(del) <- zip d$y <- d$y + del[d$zip2] d$yc <- d$y>=mean(d$y) cN <- mkCrossFrameNExperiment(d,c('zip','zip2'),'y', rareCount=2,rareSig=0.9)#> [1] "vtreat 1.3.2 start initial treatment design Mon Oct 1 14:32:55 2018" #> [1] " start cross frame work Mon Oct 1 14:32:55 2018" #> [1] " vtreat::mkCrossFrameNExperiment done Mon Oct 1 14:32:56 2018"cor(cN$crossFrame$y,cN$crossFrame$zip_catN) # poor#> [1] 0.02114663cor(cN$crossFrame$y,cN$crossFrame$zip2_catN) # better#> [1] 0.2013498treatments <- cN$treatments dTrainV <- cN$crossFrame