Implementation of section 6.4.2 of Abraham Wald, Sequential Analysis, Dover 2004.

We will phrase this as purchasing web-traffic from two sources in parallel. Each source (or process) either converts (a success) or does not (a failure). Source 1 converts with (unknown) probability p1 and source 2 converts with (unknown) probability p2. We want to buy traffic from both sources for a while and then with high probability pick the source with the higher conversion rate.

In the sequential analysis all traffic is paired. So we observe tuples of the form (c1,c2) where ci=0 if the traffic did not convert and 1 if it did (i being the traffic source). All tuples of the form (0,0) and (1,1) are discarded as being uninformative about the relative values of p1 versus p2. So the experimental summary is how many (0,1) and (1,0) pairs we have observed. Note this is not a sufficient statistic (such as recording the number of items taken from each source and the number of successes from each source), but instead roughly a pivotal quantity that depends only on the relative sizes of p1 and p2 (and not their absolute rates).

The recording procedure is then:

Our decision surface is a pair of parallel lines:

picture of parallel lines

picture of parallel lines

When our tracking point (t,y) crosses out of the region between the lines our experiment is complete. Crossing over the top line means we have evidence the second process likely has a higher conversion rate (and should go with it), crossing under the bottom line means the second process likely has a lower conversion rate (and we should reject it).

To run the process we traditionally need 4 user supplied parameters. We will demonstrate a simplified procedure that uses two parameters.

The inspection procedure is such that if |log( (p1/(1-p1)) / (p2/(1-p2)) )| > log(u) and we exit the inspection region (decide on one of the streams) then with probability at least 1-a we will have chosen the larger rate process.

The idea is: if p1/(1-p1) is near p2/(1-p2) then the difference is both very hard to measure, and not at all important to the business. So we design our inspection plan to not attempt to make such a determination, and instead ask the user for a value u that tells us where we stop caring.

Here is a typical inspection plan:

# User chosen parameters. 
# We are sticking to a special simple symmetric case of Wald's system.
# All we are saying is we value relative error and probability of mistake symmetrically.
relDiff <- 0.1  # measure to relative rate of 10%
errorProb <- 0.05 # accept 5% chance of making wrong decision

# plug into Wald parameters
u1 <- (1+relDiff)
u0 <- 1
a <- errorProb
b <- errorProb

# Equations 6:7 and 6:8 of Section 6.4.2, page 111, Abraham Wald, Sequential Analysis, Dover 2004
# Sequential inspection plan
s <- log((1+u1)/(1+u0))/(log(u1)-log(u0))
h0 <- log(b/(1-a))/(log(u1)-log(u0))
h1 <- log((1-b)/a)/(log(u1)-log(u0))

library('ggplot2')
ggplot(data=data.frame(t=c(0,300),y=c(0,150)),
       mapping=aes(x=t,y=y)) +
  geom_point(size=0) +
  geom_abline(slope=s,intercept=h0,
              color='blue',size=3,alpha=0.5) +
  geom_abline(slope=s,intercept=h1,
              color='green',size=3,alpha=0.5) +
  coord_fixed() +
  ggtitle("inspection plan")

Here is a simulation of executing such a plan:

set.seed(3525)

# Unknown rates
p1 <- 0.05
p2 <- 0.055

simulate <- function(s,h0,h1,p1,p2) {
  # statistics we are tracking
  runLength <- 0
  t <- 0
  y <- 0
  
  d <- data.frame(t=rep(NA,1000),y=NA)
  inBounds <- TRUE
  while(inBounds) {
    c1 <- runif(1)<=p1
    c2 <- runif(1)<=p2
    if(c1!=c2) {
      t <- t+1
      y <- y+c2
      if(t<=nrow(d)) {
        d$t[[t]] <- t
        d$y[[t]] <- y
      }
      inBounds <- (y<=h1+s*t)&&(y>=h0+s*t)
    }
    runLength <- runLength+1
  }
  d <- d[!is.na(d$t),]
  decision <- ifelse(y/t>s,2,1)
  list(d=d,
       summary=data.frame(t=t,
                          y=y,
                          runLength=runLength,
                          decision=decision))
}

sim <- simulate(s,h0,h1,p1,p2)
ggplot(data=sim$d,mapping=aes(x=t,y=y)) +
  geom_point() + geom_line() +
  geom_abline(slope=s,intercept=h0,
              color='blue',size=3,alpha=0.5) +
  geom_abline(slope=s,intercept=h1,
              color='green',size=3,alpha=0.5) +
  coord_fixed() +
  ggtitle(paste("inspection, steps=",sim$summary$runLength,
                ', decision=',sim$summary$decision))

This becomes more legible if we replace y with y-s*t, skewing the frame so our decision boundaries are horizontal lines and we are always stepping up or down as we move to the right (actually it is affine transform as it includes a rescaling).

On the skewed graph we always move one to the right on (0,1) and (1,0) observations (which is now irrelevant as the skewed acceptance lines are horizontal) and either down 1/2 or up 1/2 depending which process was better. So Wald’s acceptance rule becomes: whichever process is first more than 2*h1 ahead is accepted (the 2 is because we are moving up and down in halves). Also notice this rule does not require throwing out the (0,0) and (1,1) observations- as adding them in would not affect vertical position. So this special case of Wald’s rule is in fact a function of the standard sufficient statistics, and doesn’t need the exact details of the pairings.

For our example the acceptance rule (in skewed form) is: “run until one of the options is more than k wins ahead of the other and then quit”. The record keeping doesn’t even require a chart or throwing out pairs- just one running “net ahead” sum.

ggplot(data=sim$d,mapping=aes(x=t,y=y-s*t)) +
  geom_point(data=data.frame(x=c(0,0),y=c(-h1,h1)),
             mapping=aes(x=x,y=y),size=0) +
  geom_point() + geom_line() +
  geom_abline(slope=0,intercept=h0,
              color='blue',size=3,alpha=0.5) +
  geom_abline(slope=0,intercept=h1,
              color='green',size=3,alpha=0.5) +
  ggtitle(paste("(skewed) inspection, steps=",sim$summary$runLength,
                ', decision=',sim$summary$decision))

Here are many repeats of such an experiment:

parallelCluster <- parallel::makeCluster(parallel::detectCores())
mkWorker <- function(s,h0,h1,p1,p2,simulate) {
  force(s)
  force(h0)
  force(h1)
  force(p1)
  force(p2)
  force(simulate)
  function(i) {
    si <- simulate(s,h0,h1,p1,p2)
    si$d$rep <- i
    si$summary$rep <- i
    si
  }
}
worker <- mkWorker(s,h0,h1,p1,p2,simulate)
reps <- parallel::parLapply(parallelCluster,
                            1:100,
                            worker)
parallel::stopCluster(parallelCluster)
parallelCluster <- NULL

d <- do.call(rbind,lapply(reps,function(r) {r$d}))
sum <- do.call(rbind,lapply(reps,function(r) {r$summary}))
print(summary(sum))
##        t               y            runLength         decision   
##  Min.   :  480   Min.   : 277.0   Min.   :  4705   Min.   :1.00  
##  1st Qu.: 1049   1st Qu.: 568.0   1st Qu.: 10549   1st Qu.:2.00  
##  Median : 1702   Median : 902.5   Median : 17032   Median :2.00  
##  Mean   : 2159   Mean   :1134.9   Mean   : 21717   Mean   :1.98  
##  3rd Qu.: 2798   3rd Qu.:1463.8   3rd Qu.: 28102   3rd Qu.:2.00  
##  Max.   :10228   Max.   :5267.0   Max.   :102137   Max.   :2.00  
##       rep        
##  Min.   :  1.00  
##  1st Qu.: 25.75  
##  Median : 50.50  
##  Mean   : 50.50  
##  3rd Qu.: 75.25  
##  Max.   :100.00
ggplot(data=d,mapping=aes(x=t,y=y-s*t)) +
  geom_point(data=data.frame(x=c(0,0),y=c(-h1,h1)),
             mapping=aes(x=x,y=y),size=0) +
  geom_line(alpha=0.5,mapping=aes(group=as.factor(rep))) +
  geom_smooth(color='orange') +
  geom_density2d(color='gold') +
  geom_abline(slope=0,intercept=h0,
              color='blue',size=3,alpha=0.5) +
  geom_abline(slope=0,intercept=h1,
              color='green',size=3,alpha=0.5) +
  ggtitle("(skewed) many repetitions of the experiment")
## `geom_smooth()` using method = 'gam'

ggplot(data=sum,mapping=aes(x=runLength)) + 
  geom_density(adjust=0.2) + ggtitle('experiment run times')

The experiment run time is the subject of random walks. You would expect to take time on the order of (1/(p1+p2-p1 p2))*(h1-h0)^2 to escape if p1=p2, and to escape much quicker if they are in fact different.

(1/(p1+p2-p1*p2))*(h1-h0)^2
## [1] 37335.61
mean(sum$runLength)
## [1] 21717.12
summary(as.factor(sum$decision))
##  1  2 
##  2 98

For more articles on A/B testing (including dynamic programing formulations and Bayesian formulations) please see following: