Significance Testing

Risky Business, Statistically Speaking

Statistical significance testing is fraught with danger. “Getting it wrong” can translate into suboptimal business decisions at best and financial loss at worst. Those who dare venture into the morass of Greek letters and complex equations should become familiar with its perils, especially the likelihood of making an incorrect conclusion.

Although there are several potential pitfalls associated with statistical significance testing, these are the two main mistakes:

  • Mistake #1 is a false positive — detecting a true difference between two numbers (usually averages or percentages) when that difference does not exist. This is called a Type 1 error and also called the “significance level.” Reporting that two numbers are statistically different at the 95% confidence level means that there is a 5% chance of a false positive.
  • Mistake #2 is a false negative — failing to detect a true difference between two numbers when that difference really does exist. This is called a Type 2 error (statisticians are not especially creative with names).

Understanding the magnitude of the risk is the first step in managing it. Fortunately, we can estimate the probability of making each mistake. The probability of making a Type 1 error is called alpha (𝜶) and is mostly under the control of the experimenter or analyst. Prior to collecting data, assess your organization’s tolerance for risk. Generally 5% is used (see “Why 95%?” for historical context). Market researchers often will set 𝜶 as high as 10%, which translates to a 90% confidence that two numbers are statistically different.

The probability of making a Type 2 error is called beta (𝜷). The complement of 𝜷 is called power (1 - 𝜷). (Normally I avoid referring to Greek letters in blogs...but trust me, you will amaze your coworkers with a few well-placed references about 𝜶 and 𝜷.) Type 2 error can be controlled via sample size. How large is large enough? The sample size needed is broadly based on the 𝜶 level and the potential difference between the two numbers being tested (called effect size). Simply put, the sample needs to be of sufficient size for the statistical test to detect a difference if one exists.

We learn in life that all mistakes and missteps are not created equal. Apply this heuristic to statistical testing and a question naturally arises, Which is worse — Type 1 error or Type 2 error?

Academic researchers are well-versed in the nuances of these two errors, and polite disagreement exists regarding which error is more problematic. Market researchers, though, are much less familiar with these statistical testing mistakes. Most understand Type 1 error within the context of the confidence level (1 - 𝜶) but are fuzzier on Type 2 error.

I thought it would be interesting to tap into the market researcher’s perspective. So, I posed the following question to my colleagues:

Which of the following mistakes is worse, if either?
  • Declaring that two means or proportions are significantly different when they are not.
  • Not declaring that two means or proportions are significantly different when they are.
  • Both are equally bad.

These are the results of the poll.

  1. 37% — Declaring that two means or proportions are significantly different when they are not. (Type 1 Error)
  2. 14% — Not declaring that two means or proportions are significantly different when they are. (Type 2 Error)
  3. 49% — Both are equally bad.

On a purely rational basis, it makes complete sense that about half of those who answered the poll agreed that both mistakes are equally undesirable. After all, researchers are averse to errors of any kind. It is striking, though, that more than twice as many respondents reported that making a Type 1 error was worse compared to those reporting that making a Type 2 error was worse (37% versus 14%, respectively).

Why these results? Generally, we strive to make decisions correctly and quickly (time is money). Therefore, we evaluate risk in ways that minimize time and effort but yield the best results. The two statistical errors are then likely evaluated based on the perceived differences in their consequences. On the surface, the consequences of a Type 1 error appear more immediate and more painful than those of a Type 2 error. This can be illustrated using a new product usage-test example.

Imagine that the purpose of the statistical test is to determine which of two new product formulations (A and B) to launch. Purchase intent for both formulations is assessed in an in-home-use test. The results indicate that both products score just shy of the manufacturer’s new product-launch action standard. Yet, Product A’s purchase intent is significantly higher than Product B’s purchase intent at the 95% confidence level (𝜶 = 5%). Based on the test results, Product A is launched. Unfortunately, Product A has a difficult time generating trial, even with typical marketing support. After a couple of months, the manufacturer makes the decision to substitute Product B, which performs well. Considering that all other factors were constant (marketing, economic forces, etc.), we are left to assume that a Type 1 error has occurred. Although Product B is performing well, several painful consequences (such as loss of revenue and market share and the potential damage to brand equity) resulted from launching Product A first.

Now imagine a different scenario using these same two products. This time purchase intent for Product A and Product B are not significantly different. The manufacturer decides to hold off on launching either formulation. Yet, Product B is a better performing formulation (based on the in-market outcomes cited previously) than Product A. The research yields no indication, however, that launching Product B would increase market share and revenue. A Type 2 error has been committed. Management congratulates itself on the no-go decision for Product B and continues on to the next new product initiative, never realizing what could have been.

The consequences of a Type 2 error are often hidden, whereas those for a Type 1 error are more obvious. Therefore, market researchers tend to pay more attention to reducing the risk of a Type 1 error (setting 𝜶 as low as possible) than reducing the risk of a Type 2 error (ensuring that sample size is robust enough to allow the statistical test to detect a difference). Moreover, it is more cost-effective in the short term to adjust the 𝜶 level (it’s completely under the control of the researcher) than it is to adjust sample size. Rather than viewing them as a cost to be managed, robust samples should be considered a standard requirement for good decision-making.

Statistical significance testing is perilous, but there are ways to mitigate the risk. First, understand the potential mistakes that can be made (Type 1 and Type 2 errors). Second, decide what level of uncertainty is tolerable for your organization and allocate an appropriate research budget to achieve a robust sample size. Reducing the probability of these two errors translates into increasing the likelihood that your organization will make right decision when faced with the outcome of a statistical significance test.

Author

Elizabeth Horn

Elizabeth Horn

Senior VP, Advanced Analytics

Beth has provided expertise and high-end analytics for Decision Analyst for over 25 years. She is responsible for design, analyses, and insights derived from discrete choice models; MaxDiff analysis; volumetric forecasting; predictive modeling; GIS analysis; and market segmentation. She regularly consults with clients regarding best practices in research methodology. Beth earned a Ph.D. and a Master of Science in Experimental Psychology with emphasis on psychological principles, research methods, and statistics from Texas Christian University in Fort Worth, TX.

Copyright © 2019 by Decision Analyst, Inc.
This posting may not be copied, published, or used in any way without written permission of Decision Analyst.