Return a carve-up of seq_len(nRows). Very useful for any sort of nested model situation (such as data prep, stacking, or super-learning).

buildEvalSets(
  nRows,
  ...,
  dframe = NULL,
  y = NULL,
  splitFunction = NULL,
  nSplits = 3
)

Arguments

nRows

scalar, >=1 number of rows to sample from.

...

no additional arguments, declared to forced named binding of later arguments.

dframe

(optional) original data.frame, passed to user splitFunction.

y

(optional) numeric vector, outcome variable (possibly to stratify on), passed to user splitFunction.

splitFunction

(optional) function taking arguments nSplits,nRows,dframe, and y; returning a user desired split.

nSplits

integer, target number of splits.

Value

list of lists where the app portion of the sub-lists is a disjoint carve-up of seq_len(nRows) and each list as a train portion disjoint from app.

Details

Also sets attribute "splitmethod" on return value that describes how the split was performed. attr(returnValue,'splitmethod') is one of: 'notsplit' (data was not split; corner cases like single row data sets), 'oneway' (leave one out holdout), 'kwaycross' (a simple partition), 'userfunction' (user supplied function was actually used), or a user specified attribute. Any user desired properties (such as stratification on y, or preservation of groups designated by original data row numbers) may not apply unless you see that 'userfunction' has been used.

The intent is the user splitFunction only needs to handle "easy cases" and maintain user invariants. If the user splitFunction returns NULL, throws, or returns an unacceptable carve-up then vtreat::buildEvalSets returns its own eval set plan. The signature of splitFunction should be splitFunction(nRows,nSplits,dframe,y) where nSplits is the number of pieces we want in the carve-up, nRows is the number of rows to split, dframe is the original dataframe (useful for any group control variables), and y is a numeric vector representing outcome (useful for outcome stratification).

Note that buildEvalSets may not always return a partition (such as one row dataframes), or if the user split function chooses to make rows eligible for application a different number of times.

See also

Examples

# use buildEvalSets(200)
#> [[1]] #> [[1]]$train #> [1] 1 2 3 4 6 7 8 9 11 13 14 18 19 21 22 23 24 28 #> [19] 29 30 32 33 35 38 40 41 43 44 47 48 49 50 51 52 53 56 #> [37] 60 62 64 65 66 68 69 70 71 72 73 74 75 76 77 78 79 81 #> [55] 82 84 85 86 87 88 89 91 92 97 98 100 101 102 103 106 107 108 #> [73] 109 110 111 112 113 115 116 117 118 119 121 122 124 125 126 127 129 131 #> [91] 132 133 134 135 136 138 139 142 145 147 148 150 151 152 153 154 155 156 #> [109] 158 159 163 166 169 172 173 174 176 178 179 181 182 183 184 185 186 189 #> [127] 190 191 192 193 195 196 197 199 #> #> [[1]]$app #> [1] 26 5 17 161 55 36 34 105 25 45 162 94 128 39 12 42 15 58 200 #> [20] 37 83 31 10 146 63 57 59 54 157 104 16 143 198 90 123 140 187 194 #> [39] 95 164 180 141 170 67 99 144 80 93 165 171 160 175 167 27 20 168 120 #> [58] 188 96 46 114 130 61 177 149 137 #> #> #> [[2]] #> [[2]]$train #> [1] 4 5 6 8 9 10 12 14 15 16 17 20 25 26 27 31 32 33 #> [19] 34 36 37 39 42 45 46 50 51 52 54 55 57 58 59 61 62 63 #> [37] 65 66 67 68 69 70 74 76 79 80 81 82 83 85 86 89 90 91 #> [55] 92 93 94 95 96 97 98 99 100 101 103 104 105 106 109 110 111 113 #> [73] 114 117 119 120 123 124 125 126 128 130 131 132 133 135 136 137 138 139 #> [91] 140 141 143 144 146 148 149 150 151 154 155 157 158 160 161 162 163 164 #> [109] 165 167 168 170 171 175 176 177 178 180 182 183 184 186 187 188 189 190 #> [127] 191 193 194 196 197 198 200 #> #> [[2]]$app #> [1] 102 60 115 13 77 11 7 73 166 53 181 41 147 145 3 159 40 44 29 #> [20] 142 108 56 24 179 49 23 116 75 152 172 156 72 71 64 2 43 87 84 #> [39] 174 112 199 153 19 122 18 134 173 47 107 129 78 121 192 21 88 195 48 #> [58] 169 22 28 30 118 35 185 38 1 127 #> #> #> [[3]] #> [[3]]$train #> [1] 1 2 3 5 7 10 11 12 13 15 16 17 18 19 20 21 22 23 #> [19] 24 25 26 27 28 29 30 31 34 35 36 37 38 39 40 41 42 43 #> [37] 44 45 46 47 48 49 53 54 55 56 57 58 59 60 61 63 64 67 #> [55] 71 72 73 75 77 78 80 83 84 87 88 90 93 94 95 96 99 102 #> [73] 104 105 107 108 112 114 115 116 118 120 121 122 123 127 128 129 130 134 #> [91] 137 140 141 142 143 144 145 146 147 149 152 153 156 157 159 160 161 162 #> [109] 164 165 166 167 168 169 170 171 172 173 174 175 177 179 180 181 185 187 #> [127] 188 192 194 195 198 199 200 #> #> [[3]]$app #> [1] 103 14 101 97 32 189 151 51 109 163 4 82 178 191 182 139 113 85 74 #> [20] 66 132 52 89 133 79 193 135 111 197 92 158 110 176 33 131 100 150 65 #> [39] 154 126 62 50 6 196 91 155 8 119 183 86 125 138 117 68 76 106 186 #> [58] 184 124 70 190 148 81 98 69 136 9 #> #> #> attr(,"splitmethod") #> [1] "kwaycross"
# longer example # helper fns # fit models using experiment plan to estimate out of sample behavior fitModelAndApply <- function(trainData,applicaitonData) { model <- lm(y~x,data=trainData) predict(model,newdata=applicaitonData) } simulateOutOfSampleTrainEval <- function(d,fitApplyFn) { eSets <- buildEvalSets(nrow(d)) evals <- lapply(eSets, function(ei) { fitApplyFn(d[ei$train,],d[ei$app,]) }) pred <- numeric(nrow(d)) for(eii in seq_len(length(eSets))) { pred[eSets[[eii]]$app] <- evals[[eii]] } pred } # run the experiment set.seed(2352356) # example data d <- data.frame(x=rnorm(5),y=rnorm(5), outOfSampleEst=NA,inSampleEst=NA) # fit model on all data d$inSampleEst <- fitModelAndApply(d,d) # compute in-sample R^2 (above zero, falsely shows a # relation until we adjust for degrees of freedom) 1-sum((d$y-d$inSampleEst)^2)/sum((d$y-mean(d$y))^2)
#> [1] 0.4193942
d$outOfSampleEst <- simulateOutOfSampleTrainEval(d,fitModelAndApply) # compute out-sample R^2 (not positive, # evidence of no relation) 1-sum((d$y-d$outOfSampleEst)^2)/sum((d$y-mean(d$y))^2)
#> [1] -3.873148