Calibration of the Testing Environment for Better Metrics on User Behavior

By Paul Terry

May 13, 2022


A/B testing is all about understanding change in user behavior through sampling and comparing user journeys through metrics. Best practices around A/B test configuration, sampling, measuring, segmenting, and statistical significance are important to understanding how journeys change. But remember, those practices begin with a clean, calibrated A/B testing lab.


Before we begin, let’s define a few terms.

  • A/B Testing: A/B testing is an experiment where two versions of a web page are compared to see which one drives more conversions.
  • A/A Testing: A/A testing is a method of experimentation where two versions of a webpage (or the same version) are shown to a specified set of visitors and compared to test the accuracy of a testing tool.
  • Bot: A ‘bot’ – short for robot – is a software program that performs automated, repetitive, pre-defined tasks. Bots typically imitate or replace human user behavior. Because they are automated, they operate much faster than human users.
  • Delta: In A/B testing we are interested in reporting the differences between treatment and control, either as percent change or as absolute difference.
  • Outlier: An outlier is an observation that appears to deviate markedly from other observations in the sample.

Before conducting an experiment in a test tube, it’s important to first clean that test tube of anything that might bias the results. The same is true when A/B testing your site or application.

Creating a Clean Test Environment

An A/A test is where we randomly assign users to two variations, neither of which makes any changes. All major metrics – significant user behaviors from landing page to thank you page – are measured for each group. As the test runs and the data matures, you’ll get to see how a test that should ultimately show no deltas at significance matures and does indeed at times show significant deltas.

An excellent method of studying the calibration is a Double A/A. This means running two A/A campaigns simultaneously with the same users, with distinct random variation assignments for each. As these two A/A tests mature, you may see some metric deltas becoming minimal in one A/A and not the other. Over time, most deltas will disappear, but others may not – perhaps in one A/A but not the other. Why? Systemic noise. Test tube dirt.

Any metrics showing sustained deltas at significance across either A/A cannot be used in analytics until a clean A/A has been achieved. These deltas likely result from bot activity – perhaps a health or performance check system, or a nasty content scraper – present in a variation causing the sustained delta.

Segmentation of an A/A with a sustained delta may help to isolate the cause. Look for outliers in normal site behavior – one useful metric to help uncover bots is pageviews per second. Bots typically will make more frequent requests than real users – and usually will focus on a single or a few page types. Once identified, bots should be removed from the testing pool.


Regular calibration using A/A is important to a clean test environment. Metrics should also be compared against other systems outside the testing environment. Any differential should be minimal, consistent, and explainable. Segmentation can help when comparing traffic types and sources between different measurement systems. By applying these tips, your metrics will be more accurate and you will be able to make informed decisions to improve the user experience.

Happy Testing! Want to see SiteSpect in action? Schedule a personalized Demo!


Paul Terry

Paul Terry

Paul Terry is a SiteSpect Consultant in Customer Support, guiding SiteSpect users on the road to optimization. He has over 15 years experience in optimization, testing, and personalization. He is based in Duluth, Georgia.

Suggested Posts

Subscribe to our blog: