Return a carve-up of seq_len(nRows). Very useful for any sort of nested model situation (such as data prep, stacking, or super-learning).

buildEvalSets(
nRows,
...,
dframe = NULL,
y = NULL,
splitFunction = NULL,
nSplits = 3
)

## Arguments

nRows scalar, >=1 number of rows to sample from. no additional arguments, declared to forced named binding of later arguments. (optional) original data.frame, passed to user splitFunction. (optional) numeric vector, outcome variable (possibly to stratify on), passed to user splitFunction. (optional) function taking arguments nSplits,nRows,dframe, and y; returning a user desired split. integer, target number of splits.

## Value

list of lists where the app portion of the sub-lists is a disjoint carve-up of seq_len(nRows) and each list as a train portion disjoint from app.

## Details

Also sets attribute "splitmethod" on return value that describes how the split was performed. attr(returnValue,'splitmethod') is one of: 'notsplit' (data was not split; corner cases like single row data sets), 'oneway' (leave one out holdout), 'kwaycross' (a simple partition), 'userfunction' (user supplied function was actually used), or a user specified attribute. Any user desired properties (such as stratification on y, or preservation of groups designated by original data row numbers) may not apply unless you see that 'userfunction' has been used.

The intent is the user splitFunction only needs to handle "easy cases" and maintain user invariants. If the user splitFunction returns NULL, throws, or returns an unacceptable carve-up then vtreat::buildEvalSets returns its own eval set plan. The signature of splitFunction should be splitFunction(nRows,nSplits,dframe,y) where nSplits is the number of pieces we want in the carve-up, nRows is the number of rows to split, dframe is the original dataframe (useful for any group control variables), and y is a numeric vector representing outcome (useful for outcome stratification).

Note that buildEvalSets may not always return a partition (such as one row dataframes), or if the user split function chooses to make rows eligible for application a different number of times.

kWayCrossValidation, kWayStratifiedY, and makekWayCrossValidationGroupedByColumn

## Examples


# use
buildEvalSets(200)#> [[1]]
#> [[1]]$train #> [1] 1 2 3 4 6 7 8 9 11 13 14 18 19 21 22 23 24 28 #> [19] 29 30 32 33 35 38 40 41 43 44 47 48 49 50 51 52 53 56 #> [37] 60 62 64 65 66 68 69 70 71 72 73 74 75 76 77 78 79 81 #> [55] 82 84 85 86 87 88 89 91 92 97 98 100 101 102 103 106 107 108 #> [73] 109 110 111 112 113 115 116 117 118 119 121 122 124 125 126 127 129 131 #> [91] 132 133 134 135 136 138 139 142 145 147 148 150 151 152 153 154 155 156 #> [109] 158 159 163 166 169 172 173 174 176 178 179 181 182 183 184 185 186 189 #> [127] 190 191 192 193 195 196 197 199 #> #> [[1]]$app
#>  [1]  26   5  17 161  55  36  34 105  25  45 162  94 128  39  12  42  15  58 200
#> [20]  37  83  31  10 146  63  57  59  54 157 104  16 143 198  90 123 140 187 194
#> [39]  95 164 180 141 170  67  99 144  80  93 165 171 160 175 167  27  20 168 120
#> [58] 188  96  46 114 130  61 177 149 137
#>
#>
#> [[2]]
#> [[2]]$train #> [1] 4 5 6 8 9 10 12 14 15 16 17 20 25 26 27 31 32 33 #> [19] 34 36 37 39 42 45 46 50 51 52 54 55 57 58 59 61 62 63 #> [37] 65 66 67 68 69 70 74 76 79 80 81 82 83 85 86 89 90 91 #> [55] 92 93 94 95 96 97 98 99 100 101 103 104 105 106 109 110 111 113 #> [73] 114 117 119 120 123 124 125 126 128 130 131 132 133 135 136 137 138 139 #> [91] 140 141 143 144 146 148 149 150 151 154 155 157 158 160 161 162 163 164 #> [109] 165 167 168 170 171 175 176 177 178 180 182 183 184 186 187 188 189 190 #> [127] 191 193 194 196 197 198 200 #> #> [[2]]$app
#>  [1] 102  60 115  13  77  11   7  73 166  53 181  41 147 145   3 159  40  44  29
#> [20] 142 108  56  24 179  49  23 116  75 152 172 156  72  71  64   2  43  87  84
#> [39] 174 112 199 153  19 122  18 134 173  47 107 129  78 121 192  21  88 195  48
#> [58] 169  22  28  30 118  35 185  38   1 127
#>
#>
#> [[3]]
#> [[3]]$train #> [1] 1 2 3 5 7 10 11 12 13 15 16 17 18 19 20 21 22 23 #> [19] 24 25 26 27 28 29 30 31 34 35 36 37 38 39 40 41 42 43 #> [37] 44 45 46 47 48 49 53 54 55 56 57 58 59 60 61 63 64 67 #> [55] 71 72 73 75 77 78 80 83 84 87 88 90 93 94 95 96 99 102 #> [73] 104 105 107 108 112 114 115 116 118 120 121 122 123 127 128 129 130 134 #> [91] 137 140 141 142 143 144 145 146 147 149 152 153 156 157 159 160 161 162 #> [109] 164 165 166 167 168 169 170 171 172 173 174 175 177 179 180 181 185 187 #> [127] 188 192 194 195 198 199 200 #> #> [[3]]$app
#>  [1] 103  14 101  97  32 189 151  51 109 163   4  82 178 191 182 139 113  85  74
#> [20]  66 132  52  89 133  79 193 135 111 197  92 158 110 176  33 131 100 150  65
#> [39] 154 126  62  50   6 196  91 155   8 119 183  86 125 138 117  68  76 106 186
#> [58] 184 124  70 190 148  81  98  69 136   9
#>
#>
#> attr(,"splitmethod")
#> [1] "kwaycross"
# longer example
# helper fns
# fit models using experiment plan to estimate out of sample behavior
fitModelAndApply <- function(trainData,applicaitonData) {
model <- lm(y~x,data=trainData)
predict(model,newdata=applicaitonData)
}
simulateOutOfSampleTrainEval <- function(d,fitApplyFn) {
eSets <- buildEvalSets(nrow(d))
evals <- lapply(eSets,
function(ei) { fitApplyFn(d[ei$train,],d[ei$app,]) })
pred <- numeric(nrow(d))
for(eii in seq_len(length(eSets))) {
pred[eSets[[eii]]$app] <- evals[[eii]] } pred } # run the experiment set.seed(2352356) # example data d <- data.frame(x=rnorm(5),y=rnorm(5), outOfSampleEst=NA,inSampleEst=NA) # fit model on all data d$inSampleEst <- fitModelAndApply(d,d)
# compute in-sample R^2 (above zero, falsely shows a
#   relation until we adjust for degrees of freedom)
1-sum((d$y-d$inSampleEst)^2)/sum((d$y-mean(d$y))^2)#> [1] 0.4193942
d$outOfSampleEst <- simulateOutOfSampleTrainEval(d,fitModelAndApply) # compute out-sample R^2 (not positive, # evidence of no relation) 1-sum((d$y-d$outOfSampleEst)^2)/sum((d$y-mean(d\$y))^2)#> [1] -3.873148