Build set carve-up for out-of sample evaluation.

Return a carve-up of seq_len(nRows). Very useful for any sort of nested model situation (such as data prep, stacking, or super-learning).

buildEvalSets(
  nRows,
  ...,
  dframe = NULL,
  y = NULL,
  splitFunction = NULL,
  nSplits = 3
)

Arguments

nRows	scalar, >=1 number of rows to sample from.
...	no additional arguments, declared to forced named binding of later arguments.
dframe	(optional) original data.frame, passed to user splitFunction.
y	(optional) numeric vector, outcome variable (possibly to stratify on), passed to user splitFunction.
splitFunction	(optional) function taking arguments nSplits,nRows,dframe, and y; returning a user desired split.
nSplits	integer, target number of splits.

Value

list of lists where the app portion of the sub-lists is a disjoint carve-up of seq_len(nRows) and each list as a train portion disjoint from app.

Details

Also sets attribute "splitmethod" on return value that describes how the split was performed. attr(returnValue,'splitmethod') is one of: 'notsplit' (data was not split; corner cases like single row data sets), 'oneway' (leave one out holdout), 'kwaycross' (a simple partition), 'userfunction' (user supplied function was actually used), or a user specified attribute. Any user desired properties (such as stratification on y, or preservation of groups designated by original data row numbers) may not apply unless you see that 'userfunction' has been used.

The intent is the user splitFunction only needs to handle "easy cases" and maintain user invariants. If the user splitFunction returns NULL, throws, or returns an unacceptable carve-up then vtreat::buildEvalSets returns its own eval set plan. The signature of splitFunction should be splitFunction(nRows,nSplits,dframe,y) where nSplits is the number of pieces we want in the carve-up, nRows is the number of rows to split, dframe is the original dataframe (useful for any group control variables), and y is a numeric vector representing outcome (useful for outcome stratification).

Note that buildEvalSets may not always return a partition (such as one row dataframes), or if the user split function chooses to make rows eligible for application a different number of times.

Examples


# use
buildEvalSets(200)
#> [[1]]
#> [[1]]$train
#>   [1]   1   3   4   5   6   7   8   9  10  11  15  17  18  21  23  24  25  26
#>  [19]  27  28  29  30  31  32  33  34  39  41  42  43  45  46  48  49  50  51
#>  [37]  52  54  55  56  61  62  64  65  66  67  68  69  70  73  74  75  76  77
#>  [55]  80  81  82  84  85  86  87  88  90  91  92  94  95  98 100 101 103 104
#>  [73] 108 109 110 111 112 113 114 115 116 117 119 120 122 125 126 129 130 131
#>  [91] 132 133 139 140 141 142 143 144 146 147 149 150 152 154 155 158 159 160
#> [109] 162 164 165 166 167 168 170 173 174 175 176 180 181 182 183 184 185 186
#> [127] 187 188 189 190 191 193 198 200
#> 
#> [[1]]$app
#>  [1]  40  44  58 102  37  83 145  89 177  79 179 135  12  72  71 124  57  16 107
#> [20]  60 138 118  63 199  19  78 169 197 153  14 196 178  53 127 148 151  97 161
#> [39]  96 105 195  20  22 134 156 192 171 106  93  59 121  47  38 123 128  35 136
#> [58] 157 172 137   2 194  13  36  99 163
#> 
#> 
#> [[2]]
#> [[2]]$train
#>   [1]   2   5   9  11  12  13  14  15  16  18  19  20  22  24  27  33  34  35
#>  [19]  36  37  38  40  41  42  44  46  47  49  51  52  53  55  57  58  59  60
#>  [37]  62  63  64  65  66  71  72  73  74  75  76  77  78  79  83  85  89  90
#>  [55]  93  94  95  96  97  99 100 101 102 103 104 105 106 107 109 114 115 116
#>  [73] 118 120 121 123 124 125 127 128 129 130 131 132 133 134 135 136 137 138
#>  [91] 141 143 144 145 148 150 151 152 153 155 156 157 158 159 160 161 162 163
#> [109] 165 166 169 171 172 174 175 177 178 179 180 181 185 186 187 188 192 193
#> [127] 194 195 196 197 198 199 200
#> 
#> [[2]]$app
#>  [1] 139 113 182  29 142 108  56  31  10 191  23  54  92   3 140  26 117  43  87
#> [20]  32  25 149 111 184  50  84  98   6  88  86  67 164  39 112 147 173 146 168
#> [39]  17  48  69 170   4  21  28   7 119   8 122 176  82  61  68 110 183  91 167
#> [58] 154 189  80  30 126 190  70   1  45  81
#> 
#> 
#> [[3]]
#> [[3]]$train
#>   [1]   1   2   3   4   6   7   8  10  12  13  14  16  17  19  20  21  22  23
#>  [19]  25  26  28  29  30  31  32  35  36  37  38  39  40  43  44  45  47  48
#>  [37]  50  53  54  56  57  58  59  60  61  63  67  68  69  70  71  72  78  79
#>  [55]  80  81  82  83  84  86  87  88  89  91  92  93  96  97  98  99 102 105
#>  [73] 106 107 108 110 111 112 113 117 118 119 121 122 123 124 126 127 128 134
#>  [91] 135 136 137 138 139 140 142 145 146 147 148 149 151 153 154 156 157 161
#> [109] 163 164 167 168 169 170 171 172 173 176 177 178 179 182 183 184 189 190
#> [127] 191 192 194 195 196 197 199
#> 
#> [[3]]$app
#>  [1]  42  15  85  74  66 132  52  24  73  49 101 198 104 144 187  64 130  33 143
#> [20]  65 125 116 141  34 186  94 152  51 166 159  18 180  90 155 165 158  77 100
#> [39] 131 109 162 181  62  46 114 188   5  11 193  55 120 185 133  95 200 150  76
#> [58] 175   9 160  75 115  27 129  41 174 103
#> 
#> 
#> attr(,"splitmethod")
#> [1] "kwaycross"

# longer example
# helper fns
# fit models using experiment plan to estimate out of sample behavior
fitModelAndApply <- function(trainData,applicaitonData) {
   model <- lm(y~x,data=trainData)
   predict(model,newdata=applicaitonData)
}
simulateOutOfSampleTrainEval <- function(d,fitApplyFn) {
   eSets <- buildEvalSets(nrow(d))
   evals <- lapply(eSets, 
      function(ei) { fitApplyFn(d[ei$train,],d[ei$app,]) })
   pred <- numeric(nrow(d))
   for(eii in seq_len(length(eSets))) {
     pred[eSets[[eii]]$app] <- evals[[eii]]
   }
   pred
}

# run the experiment
set.seed(2352356)
# example data
d <- data.frame(x=rnorm(5),y=rnorm(5),
        outOfSampleEst=NA,inSampleEst=NA)
        
# fit model on all data
d$inSampleEst <- fitModelAndApply(d,d)
# compute in-sample R^2 (above zero, falsely shows a 
#   relation until we adjust for degrees of freedom)
1-sum((d$y-d$inSampleEst)^2)/sum((d$y-mean(d$y))^2)
#> [1] 0.4193942

d$outOfSampleEst <- simulateOutOfSampleTrainEval(d,fitModelAndApply)
# compute out-sample R^2 (not positive, 
#  evidence of no relation)
1-sum((d$y-d$outOfSampleEst)^2)/sum((d$y-mean(d$y))^2)
#> [1] -3.873148

Build set carve-up for out-of sample evaluation.

Arguments

Value

Details

See also

Examples