A major part of working with data is attempting to determine some objective measures of when an observed variation is significant. It's especially difficult in cases like A/B testing, where you can occasionally have so few examples of user behavior that measuring significance is tough. You want your test to yield definitive results as quickly as possibly, so as to reduce time displaying the less performant options. But you also need results to be supportable. After all, you'll be putting this on a powerpoint in an exec level meeting at some point.

Thankfully, we have computers which we've trained to do math for us. And therefore, solutions. In this example of an A/B test I put together in a hackathon environment (I told you, I'm an addict), I'm using the python PYMC3 package to generate extra samples using Markov Chain Monte Carlo (MCMC) and comparing the beta distributions of two different website versions.

As you can see below, 21 results on version A and 17 results on version B is not exactly enough to bet the farm on. However we can still make some mathematically supportable assertions with what we have by using MCMC to trace out more samples.

Here I made 100,000 new samples. Which is undoubtedly overkill, but my CPU needed a workout.

With this increased sample size, we can visualize the beta distribution of probability that version A will result in a click and the probability that version B will.

So this is all well and good, it seems perfectly clear that B has the higher probability of click through. A/B test closed, right?

Well, though we just generated additional samples to create our beta distribution, any good CEO would also point out that we only started with ~20 samples of each to begin with.

How do I know my findings are statistically valid? Furthermore, how can we be good Bayesians and state a credibility interval about our assertions?

By calling the Deterministic method on our pymc3 model and subtracting the probability distribution of B from A, we get the above distribution of difference. Our credibility interval that B's lead is statistically significant is 75.9%.

Hi.

Jan 9 Determining significance with Bayesian analysis.

Hi.

Jan 9 Determining significance with Bayesian analysis.

Jan 8 Working across programs.