Kim Ann King

In this post, we’ll look at four key steps in implementing your optimization program:

  1. Clearly Define Success and Failure
  2. Ensure Good Test Design
  3. Clarify Your Testing Timeline
  4. Test Different Audience Segments

Clearly Define Success and Failure

A common disappointment among companies deploying testing and optimization technology stems from tests that fail to produce the type of gains expected. Seemingly without rhyme or reason, even the most dramatic design changes yield “no significant differences” based on simple measures such as click-through and even less for more involved down-stream metrics such as conversion rate. While this is the reality of testing, I believe that much of the disappointment stems from a lack of attention to the definition of “success” and“failure” as the design or changes are implemented.

Success in testing can be measured many different ways:

  • For some, “success” is a dramatic increase in a revenue-based metric, knowing that most senior stakeholders will respond to incremental revenue.
  • For others, “success” is a small increase in key visitor engagement metrics, knowing that a series of small gains eventually adds up.
  • For still others, “success” is a reduction in the number of problems present throughout the site, knowing that reducing barriers improves usability.
  • For some, especially those with an increasingly dated site, “success” is simply being able to deploy a new look without negatively impacting existing key performance indicators.

A lack of “success” in testing is often viewed as a failure on someone’s part, but that is actually rarely the case. In reality, testing powers a continual learning process about your visitors and customers. If a particular image fails to increase conversion rates, you have learned that your audience does not respond to that particular image. If subsequent testing reveals that variations of the same image yield similar results, then you learn something about your audience’s reaction to the image’s content. In this context, there is no such thing as “failure” in testing, only a failure to achieve the specific defined objective.

Keep in mind that not every test can yield incremental millions in revenue for your business. Some tests will fail to produce the change desired; others will yield results but not across the key performance indicators; and still others will simply fail to produce statistically relevant differences. But it is our firm opinion that there are no “failures” in testing other than a failure to carefully design your tests and a failure to carefully consider what you’ve learned.

Ensure Good Test Design

Success with testing depends heavily on the quality of your test design. One of the reasons we recommend requiring a formal test plan is so that the Testing Team has as much information as possible to determine how the test should be run. Especially when you start to aggressively test, good test design helps ensure that any effects from participation in multiple tests can be taken into account, either by identification and isolation or outright removal from the result set.

To this end it is reasonable to consult with someone experienced in experimental design in the online world—either from your vendor or a third-party. There are several elements that constitute a good test design and it is important to pay attention to them. For example, you should:

  • know whether you need an A/B or multivariate test.
  • pick the test array that works best for your needs, either a full or fractional factorial array.
  • make sure you are running the test long enough based on traffic and conversions in order to get a statistically valid sample size.
  • make sure you are properly testing variations of factors. Improper factoring is caused by poor (or no) isolation of individual changes; for example, changing a headline’s text, font, color, and size all at the same time.

Another mistake new testers often make is always running tests against anyone and everyone; a good test design means you are targeting your tests to a relevant audience, and then performing additional segmentation on the results.

Clarify Your Testing Timeline

One of the most unfortunate mistakes that companies make when getting started with testing is to only test for statistical significance. A great deal has been written about test design and full factorial versus fractional factorial versus A/B testing. While these are all important considerations, none are nearly as important as having a test sample that takes day-part and day-of-week variation into account.

Consider that even on the highest volume sites, there are typical peaks and valleys in traffic caused by target audience geography, marketing efforts, and the particular interaction model promoted by the site. Within each of these peaks and valleys, your site is attracting a particular type of visitor—late night visitors, early risers, lunch-timers across different time zones, etc.

Assuming you’re not trying to target a specific audience segment, a truly random sample of visitors will account for this variation and sample across these visitor variants. In order to reduce test bias as much as possible, a general rule-of-thumb for test planning is the “7+1” testing model.

In this model, you will be testing over an entire week (seven days) and building in a little extra time to make sure that you have a clean break in the data for analysis. Thus, “7+1”means running your test for a full week with an extra day on the front end. By giving the test a day before you start actively tracking results, you allow for slippage and the need for last-minute changes, plus it gives the analysis team the ability to gather data starting at midnight at the end of the “+1 ”day.

And by running the test over an entire week, you will account for all of the potential day-part and day-of-week variation, at least as much as is possible. If you have the luxury of time, you may want to consider extending the test to a “14+1”model, doubling the amount of time you run the test. With two weeks, you will be better able to account for additional variation in the data arising from tactical marketing efforts, a sudden increase in referral from social media, holidays, and current events, etc.

One of the advantages of the “7+1”model is that you can adjust your sample size to still only gather as much data as you need; you’ll just gather that data more slowly. Rather than taking a 20% sample over four days to get statistical significance, the “7+1”model may guide you to take a 5% sample over seven days. The smaller sample lessens your risk associated with testing since if the tests fair poorly, fewer visitors will be exposed to them and you’re still able to get to statistical significance in a relatively short period of time. Further, it allows non-test traffic to be eligible for assignment to other tests that you may be running concurrently.

The major complaint about the “7+1”model is that it takes time and if you just open the spigot on the test, you can achieve statistical significance in a matter of hours in some cases. While this sounds good, opening the spigot on testing is exactly how not to achieve success through testing. Unless you have a very sophisticated understanding of your audience and the sampling technique employed, the “fire hose” model will likely leave you with more questions than helpful insights. Anyone who doesn’t like the results can simply argue that your sample does not represent the diversity of user types coming to the site and refuse to accept your analysis. Whatever kind of test you are running (A/B or multivariate), you want to make sure you’ve run your test long enough to obtain a statistically valid sample size—the number of participants assigned to the test.

Your sample size will be determined by a combination of traffic volume, your baseline (control) conversion rate, and the conversion rate observed by test participants. You’ll want to make sure you obtain an appropriate sample without bias in time of day, day or week, holiday/event, etc.

For example, you might run a test with a huge sample size and obtain statistically significant results in one day, but this only reflects how visitors behaved on that particular day. So take care that the test is run across a longer period of time (at least 7+1 or 14+1), and perhaps longer depending on the situation) to insure against bias.

Remember: the best testers work thoughtfully and carefully, and they are willing to spend a little extra time on process or testing to make sure they deliver accurate, reliable, and believable results to socialize through the rest of the organization.

Test Different Audience Segments

Advanced testers are testing against key visitor and customer segments. The logic behind this is clear: why optimize your site for everyone when you can focus your optimization efforts on those visitors who have already demonstrated value to your business?

There are two ways to conduct a segmented test: ad hoc and post hoc. The former method requires that you are able to identify segment members in real time so that the testing engine can assign people appropriately. For example, you may be targeting “first-time visitors” or “visitors referred from Google organic search results, which, depending on the testing platform you use, can be easily done.

The latter method for segmenting is post hoc—after the fact—which is more an analysis technique than a testing strategy. In this case, you will mine test results for segment members and compare these results across control and test groups. This strategy also involves some work between testing and analytics vendors but is often more forgiving, especially if your testing vendor supports full data export and is able to provide the analytics vendor’s ID.

Regardless of how you produce the data, focus on your key segments when communicating your test results. If you have the data and the time, it is definitely better to be able to tell management, “The test produced a 5% lift in click-through rate across all visitors and a 15% increase in click-through rate across our most valuable customer segment.” This message should resonate loud and clear, especially if your measurement team has done a good job at leveraging visitor segmentation.

(For more information, download the whitepaper titled "Successful Web Site Testing Practices: Ten Best Practices for Building a World-Class Testing and Optimization Program.")

Tags: Best Practices