Return a carve-up of seq_len(nRows). Very useful for any sort of nested model situation (such as data prep, stacking, or super-learning).

buildEvalSets(
  nRows,
  ...,
  dframe = NULL,
  y = NULL,
  splitFunction = NULL,
  nSplits = 3
)

Arguments

nRows

scalar, >=1 number of rows to sample from.

...

no additional arguments, declared to forced named binding of later arguments.

dframe

(optional) original data.frame, passed to user splitFunction.

y

(optional) numeric vector, outcome variable (possibly to stratify on), passed to user splitFunction.

splitFunction

(optional) function taking arguments nSplits,nRows,dframe, and y; returning a user desired split.

nSplits

integer, target number of splits.

Value

list of lists where the app portion of the sub-lists is a disjoint carve-up of seq_len(nRows) and each list as a train portion disjoint from app.

Details

Also sets attribute "splitmethod" on return value that describes how the split was performed. attr(returnValue,'splitmethod') is one of: 'notsplit' (data was not split; corner cases like single row data sets), 'oneway' (leave one out holdout), 'kwaycross' (a simple partition), 'userfunction' (user supplied function was actually used), or a user specified attribute. Any user desired properties (such as stratification on y, or preservation of groups designated by original data row numbers) may not apply unless you see that 'userfunction' has been used.

The intent is the user splitFunction only needs to handle "easy cases" and maintain user invariants. If the user splitFunction returns NULL, throws, or returns an unacceptable carve-up then vtreat::buildEvalSets returns its own eval set plan. The signature of splitFunction should be splitFunction(nRows,nSplits,dframe,y) where nSplits is the number of pieces we want in the carve-up, nRows is the number of rows to split, dframe is the original dataframe (useful for any group control variables), and y is a numeric vector representing outcome (useful for outcome stratification).

Note that buildEvalSets may not always return a partition (such as one row dataframes), or if the user split function chooses to make rows eligible for application a different number of times.

See also

Examples

# use buildEvalSets(200)
#> [[1]] #> [[1]]$train #> [1] 1 3 4 5 6 7 8 9 10 11 15 17 18 21 23 24 25 26 #> [19] 27 28 29 30 31 32 33 34 39 41 42 43 45 46 48 49 50 51 #> [37] 52 54 55 56 61 62 64 65 66 67 68 69 70 73 74 75 76 77 #> [55] 80 81 82 84 85 86 87 88 90 91 92 94 95 98 100 101 103 104 #> [73] 108 109 110 111 112 113 114 115 116 117 119 120 122 125 126 129 130 131 #> [91] 132 133 139 140 141 142 143 144 146 147 149 150 152 154 155 158 159 160 #> [109] 162 164 165 166 167 168 170 173 174 175 176 180 181 182 183 184 185 186 #> [127] 187 188 189 190 191 193 198 200 #> #> [[1]]$app #> [1] 40 44 58 102 37 83 145 89 177 79 179 135 12 72 71 124 57 16 107 #> [20] 60 138 118 63 199 19 78 169 197 153 14 196 178 53 127 148 151 97 161 #> [39] 96 105 195 20 22 134 156 192 171 106 93 59 121 47 38 123 128 35 136 #> [58] 157 172 137 2 194 13 36 99 163 #> #> #> [[2]] #> [[2]]$train #> [1] 2 5 9 11 12 13 14 15 16 18 19 20 22 24 27 33 34 35 #> [19] 36 37 38 40 41 42 44 46 47 49 51 52 53 55 57 58 59 60 #> [37] 62 63 64 65 66 71 72 73 74 75 76 77 78 79 83 85 89 90 #> [55] 93 94 95 96 97 99 100 101 102 103 104 105 106 107 109 114 115 116 #> [73] 118 120 121 123 124 125 127 128 129 130 131 132 133 134 135 136 137 138 #> [91] 141 143 144 145 148 150 151 152 153 155 156 157 158 159 160 161 162 163 #> [109] 165 166 169 171 172 174 175 177 178 179 180 181 185 186 187 188 192 193 #> [127] 194 195 196 197 198 199 200 #> #> [[2]]$app #> [1] 139 113 182 29 142 108 56 31 10 191 23 54 92 3 140 26 117 43 87 #> [20] 32 25 149 111 184 50 84 98 6 88 86 67 164 39 112 147 173 146 168 #> [39] 17 48 69 170 4 21 28 7 119 8 122 176 82 61 68 110 183 91 167 #> [58] 154 189 80 30 126 190 70 1 45 81 #> #> #> [[3]] #> [[3]]$train #> [1] 1 2 3 4 6 7 8 10 12 13 14 16 17 19 20 21 22 23 #> [19] 25 26 28 29 30 31 32 35 36 37 38 39 40 43 44 45 47 48 #> [37] 50 53 54 56 57 58 59 60 61 63 67 68 69 70 71 72 78 79 #> [55] 80 81 82 83 84 86 87 88 89 91 92 93 96 97 98 99 102 105 #> [73] 106 107 108 110 111 112 113 117 118 119 121 122 123 124 126 127 128 134 #> [91] 135 136 137 138 139 140 142 145 146 147 148 149 151 153 154 156 157 161 #> [109] 163 164 167 168 169 170 171 172 173 176 177 178 179 182 183 184 189 190 #> [127] 191 192 194 195 196 197 199 #> #> [[3]]$app #> [1] 42 15 85 74 66 132 52 24 73 49 101 198 104 144 187 64 130 33 143 #> [20] 65 125 116 141 34 186 94 152 51 166 159 18 180 90 155 165 158 77 100 #> [39] 131 109 162 181 62 46 114 188 5 11 193 55 120 185 133 95 200 150 76 #> [58] 175 9 160 75 115 27 129 41 174 103 #> #> #> attr(,"splitmethod") #> [1] "kwaycross"
# longer example # helper fns # fit models using experiment plan to estimate out of sample behavior fitModelAndApply <- function(trainData,applicaitonData) { model <- lm(y~x,data=trainData) predict(model,newdata=applicaitonData) } simulateOutOfSampleTrainEval <- function(d,fitApplyFn) { eSets <- buildEvalSets(nrow(d)) evals <- lapply(eSets, function(ei) { fitApplyFn(d[ei$train,],d[ei$app,]) }) pred <- numeric(nrow(d)) for(eii in seq_len(length(eSets))) { pred[eSets[[eii]]$app] <- evals[[eii]] } pred } # run the experiment set.seed(2352356) # example data d <- data.frame(x=rnorm(5),y=rnorm(5), outOfSampleEst=NA,inSampleEst=NA) # fit model on all data d$inSampleEst <- fitModelAndApply(d,d) # compute in-sample R^2 (above zero, falsely shows a # relation until we adjust for degrees of freedom) 1-sum((d$y-d$inSampleEst)^2)/sum((d$y-mean(d$y))^2)
#> [1] 0.4193942
d$outOfSampleEst <- simulateOutOfSampleTrainEval(d,fitModelAndApply) # compute out-sample R^2 (not positive, # evidence of no relation) 1-sum((d$y-d$outOfSampleEst)^2)/sum((d$y-mean(d$y))^2)
#> [1] -3.873148