Way Out In Left Field:
How To Manage Extreme Data Values
You’ve cleaned the data of survey cheaters, bots, straightliners, speedsters, and other ne’er-do-wells.
The data set is ready to go. But hold the analysis! You need to examine the data for outliers. These are extreme data points found in continuous, numerical entries, such as taxable income, age, number of times shopped in the past 6 months, number of children, and so on.
Outliers can play havoc with the data. At best, these pesky data values can make individual metrics look a bit “off.” At worst, outliers can lead to erroneous business decisions. Case in point, if the U.S. Census relied on the simple average to describe a typical household’s annual income, the resulting statistic would be a distorted view on earnings (think about the relatively few individuals who make billions per year and the many households who have zero or near zero income). Instead, the Census uses a median (middle value) to account for the undue influence of outliers. This provides a more accurate picture of U.S. household income.
So, if outliers are potentially a big problem, there should be some formal ways to find them, right? Some typical definitions include (1) data points that are outside 2 to 3 standard deviations from the mean if the data follows the normal distribution (the bell-shaped curve) or (2) for non-normal data, data points that are 1.5 times more than the interquartile range either below the 1st quartile or above the 3rd quartile. But often the definition of an outlier is in the eye of the beholder (or data analyst).
Once outliers have been identified, what do you do with them? The simplest method is to delete the problematic data values. You can keep the remaining information from that same data record (respondent) unless multiple outliers from the same data record are suspected. This method works when there are a large number of respondents and the number of outliers within a data field is relatively low. Other methods of dealing with outliers include transforming the data and/or weighting. There are many ways of transforming outliers, including truncating to the nearest acceptable value (Winsorization) or using the remaining data to impute (via regression or other algorithms) a more probable data value. Down-weighting outliers also reduces their impact on any statistics generated. Finally, if you are reluctant to omit or change outliers prior to analysis, non-linear regression techniques can reduce the impact of extreme values on predicted outcomes.
As a corollary to addressing outliers, it is important to understand why they occurred in the first place. Are these out-of-bounds values telling you something about your data? Recently, my team processed a client’s customer data (sales and products) with the goal of merging it with survey data for analysis. The client was attuned to the outlier issue, having dealt with it previously. During data cleaning, my team developed the hypothesis that many of the customers who had outlier values were actually members of a high-value segment for the client’s business. In omitting these data records, valuable intelligence about a small but mighty group of customers would be lost. The client ultimately agreed, and these former outliers were partitioned into another data set to review separately.
Often though, outliers result from typos or poorly worded survey items. Typos can be prevented by using pre-coded answer lists versus open-ended numerical (user-typed) entry boxes. After all, can survey participants really provide a precise number of times they have used a product in the past 6 months, for example? Allowing respondents to select the appropriate frequency category (e.g., less than once a month, once a month) is easier and reflects actual usage more accurately. If open-ended numerical questions must be used, validate the answers, such that only responses within a realistic range are accepted.
Poorly worded questions can be improved by pre-testing the survey among a small group of potential respondents. This vetting process identifies wording that is confusing or that unintentionally elicits wild responses. Carefully crafted instructions can help respondents be more consistent and accurate. For instance, ask respondents to recall one recent specific usage occasion versus a vague number of occasions that may have occurred in the past 3-12 months. Then ask for the details about that one occasion.
As a final note, always document how you treated outliers so that if the study is repeated (a tracking study) you can more easily replicate the procedure. Changing outlier-handling midstream in a research program might lead to unintended consequences.
Author
Elizabeth Horn
Senior VP, Advanced Analytics
Beth has provided expertise and high-end analytics for Decision Analyst for over 25 years. She is responsible for design, analyses, and insights derived from discrete choice models; MaxDiff analysis; volumetric forecasting; predictive modeling; GIS analysis; and market segmentation. She regularly consults with clients regarding best practices in research methodology. Beth earned a Ph.D. and a Master of Science in Experimental Psychology with emphasis on psychological principles, research methods, and statistics from Texas Christian University in Fort Worth, TX.
Copyright © 2024 by Decision Analyst, Inc.
This posting may not be copied, published, or used in any way without written permission of Decision Analyst.