Marketers often undertake large survey projects with little forethought about their approach to data analysis. Compounding this problem is their general lack of interest in cleaning the data they collect. When raw survey data is left unprocessed, it can lead to misleading conclusions, especially when statistical outliers distort the findings, resulting in poor business decisions.
Data cleaning isn’t really optional. Without proper data cleaning, your quantitative data may become tainted. Actions based on inaccurate information can lead to ineffective strategies and result in lost opportunities. In today’s data-driven world, the quality of the information you use is crucial. It plays a significant role in helping you make informed decisions. These decisions ultimately impact your organization’s success.
Identifying statistical outliers is a key part of data cleaning, and that’s what we’re going to cover here. We will discuss how to identify an outlier in relation to the study’s goals and the type of data collected. Additionally, we will cover what to do with an outlier once identified. This includes deciding whether to omit it or leave it in your results. Understanding outliers can also help you capture valuable insights that might otherwise be overlooked.
What Are Statistical Outliers?
Statistical outliers are data points that lie significantly outside the range of the majority of other values in a data set. They can arise for various reasons, including data entry errors, measurement errors, or genuine variability in the population being studied. Identifying these outliers is critical because they can disproportionately affect statistical analyses, skewing results and leading to incorrect interpretations.
For example, consider a survey measuring customer satisfaction, where most respondents rate their experience between 4 and 5 out of 5. If one respondent rates their experience as 1, this response could be considered an outlier. Understanding why this outlier exists is essential. It may indicate a problem worth investigating, or it could simply be an anomaly in the data.
Why Outliers Matter
Outliers can provide insights that are valuable for decision-making. They may point to exceptional cases or trends that can inform your strategies. However, they can also obscure meaningful patterns if not addressed properly. For instance, if a marketing campaign generates a sudden spike in responses, it’s essential to analyze the situation carefully. You need to determine whether that spike is due to a legitimate increase in interest or if it is an outlier that requires further investigation.
Ignoring outliers can lead to biased estimates of central tendency (like the mean and median), misinformed decisions, and ineffective marketing strategies. Conversely, removing too many outliers might lead to the loss of crucial insights, as they can represent unique customer experiences or emerging trends.
Identifying Statistical Outliers in Your Survey Data
Data points that lie outside of the trend set by the majority of other values are typically easy to distinguish when the data is represented visually in a graph. Visual tools like box plots, scatter plots, or histograms can help highlight outliers, making it easier for marketers to grasp their potential impact on the overall dataset.
For example, the day you get 139 trial signups on your marketing site when the daily median is closer to 60 would be an obvious outlier, right?
Well, maybe.
But it’s tough to say without doing a little simple math first. [Notice that we didn’t use the average of 60 in the example; this is because an average can be manipulated by an outlier, and heavily if the sample is small.] Using the median, instead of the mean, offers a more robust measure of central tendency, especially in skewed datasets.
How to Calculate the Median
Start by taking your sample and ordering each observation from lowest to highest. As an example, we’ll stick with the trial signup hypothetical. In this case, we have a sample of 13 days and the signups from those days. After being re-arranged from smallest to largest, they look like this:
Day 1: 32
Day 2: 45
Day 3: 49
Day 4: 52
Day 5: 59
Day 6: 62
Day 7: 63 <-median
Day 8: 67
Day 9: 68
Day 10: 71
Day 11: 72
Day 12: 74
Day 13: 139
The median in this data set is Day 7 with a value of 63 trial signups. If you happen to have an even number of observations, the median would be the average of the two values closest to the middle. So now that we have the median for this sample, we’ll assign 63 as the variable Q2, which sits between variables Q1 and Q3 that define the upper and lower quartiles.
Q2 = 63
Calculate the Lower Quartile
Similar to the median (Q2) the lower quartile (Q1) is the middle observation of the lower half of the sample. With an even number of days (6) below the median, we’ll have to average days 3 and 4 (49 and 52 respectively). That makes our lower quartile (Q1) 50.5.
Q1 = 50.5
Calculate the Upper Quartile
Following the same steps, days 10 and 11 will have to be averaged (71 and 72 respectively). This gives us 71.5 for the Q3.
Q3 = 71.5
Calculate the Interquartile Range
The idea behind the interquartile is that once you know the distance between Q1 and Q3 (21 in this example), you can quickly identify boundaries known as ‘fences’ to sieve for statistical outliers. Observations that fall outside the inner fence are known as minor statistical outliers, while observations that also fall outside the outer fence are known as major statistical outliers.
Interquartile range: 21
There are two sets of fences – the inner fence and the outer fence. To calculate the inner fence, we multiply the interquartile by 1.5 and add the result to Q3 and subtract from Q1. To calculate the outer fence, we follow the same steps, but multiply by 3.
21 x 1.5 = 31.5
Q1-31.5 = 19, Q3+31.5 = 103
Inner fence = 19 to 103
21 x 3 = 63
Q1-63 = -12.5, Q3+63 = 134.5
Outer fence = -12.5 to 134.5
Now that we have our inner and outer fences, we can clearly see that the lowest of our observations, Day 1 with 32 signups, is well within the inner fence, and not considered an outlier. However, at our high end, Day 13 with 139 signups is well outside the inner fence and also outside the outer fence. This makes Day 13 a major outlier.
You’ve Identified the Statistical Outliers – Now What?
This is where a very objective process begins to take on a more subjective feel. Even though you’ve clearly labeled the observations that are statistical outliers within the data set, it isn’t a black-and-white issue whether you should omit or not omit an observation, especially considering it may be looked at as a form of data tampering.
Here are several key considerations:
- Was the outlier caused by error? Human error, process error, calculation error, etc.: If an inaccuracy is to blame, omission is generally a good idea. For instance, if a data entry mistake inflated the number of trial signups, it would make sense to exclude that entry. If not, then it may provide valuable insight, and including it may prove important.
- Will the outlier’s inclusion skew the average? If so, it should probably be removed. If not, removing the outlier may be less crucial to conceiving an accurate picture. For instance, in a dataset where most values cluster tightly around a median, a significant outlier can distort the average, leading to misleading conclusions.
- Consider the context of the outlier. It’s essential to assess whether the outlier represents a significant shift in your target audience’s behavior or if it’s merely an anomaly. Sometimes, outliers can indicate emerging trends or changes in customer preferences, which can be valuable information for your marketing strategy.
- Explore potential reasons for the outlier. Delve into the data to understand what caused the outlier. Was it a specific event that drove extra signups, like a promotional campaign? Understanding the “why” behind the data outlier can offer actionable insights that inform future decisions.
There are several methods to determining statistical outliers, such as Chauvenet’s criterion and Grubbs’ test. These methods offer more sophisticated approaches to identifying outliers based on statistical properties of the dataset.
This is certainly not the only way to calculate an outlier, but if you need a simple and fast equation to determine an outlier with regard to the median and quartiles, the method outlined here will serve you well. Remember, understanding data outliers is essential for maintaining the integrity of your survey data analysis.
Conclusion
In conclusion, identifying statistical outliers in your survey data is crucial for ensuring the quality and accuracy of your data analysis. By following the steps outlined above, you can effectively pinpoint these outliers, understand their implications, and make informed decisions about whether to include or exclude them from your analysis. In a landscape where data-driven decisions reign supreme, having clean, accurate data is the bedrock of successful marketing strategies.
By embracing a thorough data cleaning process and understanding the nuances of statistical outliers, you can transform your survey data into actionable insights that drive your business forward. median and quartiles, the method outlined here will serve you well.