When working with data in Excel, you’ll often have the issues of handling outliers in your data set.
Having outliers is quite common in all kinds of data, and it’s important to identify and treat these outliers to make sure that your analysis is correct and more meaningful.
In this tutorial, I’ll show you how to find outliers in Excel, and some of the techniques that I have used in my work to handle these outliers.
Working with data in Excel often means dealing with outliers. But what exactly are outliers, and why is it important to find and handle them correctly?
In this article, we’ll cover everything you need to know about calculating outliers in Excel including
- What are outliers and why do they matter?
- 3 methods to identify outliers
- How to handle outliers properly
- Pros and cons of each technique
Plus plenty of easy-to-follow examples and Excel tips along the way
Let’s dive in!
What Are Outliers and Why Do They Matter?
An outlier is a data point that is significantly different from the rest of the data set. They fall outside the overall pattern of a distribution.
Outliers can occur for many reasons. Sometimes they happen naturally (e.g. someone’s height in a group). Other times, they indicate experimental errors or faulty data.
Either way, it’s important to find outliers because they can skew results. Just one or two outliers can drastically impact measures like the mean and standard deviation.
Let me give you an example. Imagine we have data on the income of 30 people traveling on a bus. Most are in a similar income group, making around $70,000 per year.
But then Bill Gates hops on the bus! His income would skyrocket the average income for the group to a few billion dollars. Clearly not an accurate representation if we’re interested in the “typical” income.
- They can distort statistical analysis like means and standard deviations
- Lead to incorrect conclusions from the data
- Uncover potential errors or anomalies for investigation
The key is identifying outliers accurately, then deciding how to handle them appropriately.
Now let’s look at 3 methods for finding outliers in Excel.
Method 1: Sorting to Eyeball Outliers
With small datasets, a quick way to spot outliers is by sorting. Just organize your data from largest to smallest or vice versa. Then scan the top and bottom values to see if any stand out.
Let’s walk through an example. Say we have call durations for 15 customer service calls:
To sort:
- Select the column header
- Go to Home > Sort & Filter > Custom Sort
- Sort by duration, largest to smallest
Scanning the sorted data, we can see the top 2 durations are significantly higher than the rest. And the bottom 2 are way lower. So we’ve quickly identified 4 potential outliers just by sorting and eyeballing.
This works well for smaller datasets, where you can manually inspect for outliers. But for large datasets, we need more powerful techniques.
Method 2: Finding Outliers with Quartiles
A more robust statistical approach is using quartiles. This divides your dataset into 4 equal groups, each representing 25% of the data points.
The 1st quartile (Q1) marks the 25th percentile. The 3rd quartile (Q3) marks the 75th percentile. We can use Q1 and Q3 to calculate an interquartile range (IQR).
Values that fall outside of Q1 – 1.5IQR and Q3 + 1.5IQR are potential outliers. Let’s see how to calculate this in Excel.
Here’s an example dataset:
Follow these steps:
- Calculate Q1 using
=QUARTILE.INC()
:
=QUARTILE.INC($B$2:$B$15,1)
- Calculate Q3:
=QUARTILE.INC($B$2:$B$15,3)
-
Calculate the IQR by subtracting Q1 from Q3
-
Calculate lower limit:
Q1 - 1.5*IQR
-
Calculate upper limit:
Q3 + 1.5*IQR
-
Flag outliers: any values outside the limits
This gives us an objective way to identify potential outliers on both ends of the distribution. The quartile method is more robust compared to sorting. But it still has some limitations, which brings us to our 3rd method.
Method 3: Finding Outliers with Z-Scores
Z-scores provide another standardized, statistical way to calculate outliers.
A z-score measures how many standard deviations a data point is from the mean. The formula is:
Z-score = (x - μ) / σ
Where:
- x is a data point
- μ is the mean
- σ is the standard deviation
The further from 0 a z-score is, the more likely it’s an outlier. Common thresholds are +/- 2 or +/- 3 standard deviations from the mean.
Let’s see an example. Here’s a dataset of exam scores:
To calculate outliers:
-
Calculate the mean using
=AVERAGE()
-
Calculate the standard deviation with
=STDEV.S()
-
Calculate the z-score for each data point:
(value - mean) / standard deviation
-
Flag potential outliers based on z-score thresholds (e.g. +/- 3)
The z-score method works well for normally distributed data. It accounts for variance in the data by using standard deviation. But the limitation is having to define somewhat arbitrary cutoffs for what constitutes an outlier.
Now that we’ve covered 3 ways to calculate outliers, let’s discuss how to handle them properly.
How to Handle Outliers
Simply identifying outliers isn’t enough. We need to determine why they exist and deal with them appropriately. Here are some common approaches:
Delete Outliers
If you confirm an outlier is due to errors in data collection, you can simply remove it from the dataset. Make sure to keep the original data intact!
Only delete if you have a large dataset, where removing a few points won’t impact the analysis.
Keep But Ignore Outliers
You may decide to keep outliers in the dataset, but ignore them in your analysis by:
- Calculating statistics like means/standard deviations with outliers excluded
- Using non-parametric tests that are less sensitive to outliers
- Using median instead of mean
Transform the Outliers
Winsorizing replaces outliers with the closest “non-outlier” values in the dataset. For example, replacing a 200 exam score with the next highest score of 98.
Capping replaces outliers with a specified limit or cap. Like capping outliers to always be 1.5 times the interquartile range from Q1/Q3.
Investigate the Cause
Instead of blindly removing or transforming outliers, it’s worth investigating their cause. They could uncover important findings like flaws in data collection or rare events worth exploring further.
- Deleting outliers removes their influence but risks losing potentially useful data
- Keeping outliers preserves the data but requires adjustments during analysis
- Transforming outliers reduces distortion but still modifies the original data
- Investigating outliers takes more effort but can reveal key insights
The best approach depends on the context and goals of your analysis.
Now let’s recap the key steps for identifying and handling outliers in Excel:
In Summary: How to Calculate and Handle Outliers
Here’s a quick checklist for dealing with outliers in your Excel data analysis:
-
Sort your data and eyeball potential outliers
-
Calculate quartiles and the interquartile range to identify outliers
-
Compute z-scores to reveal outliers based on standard deviations
-
Investigate the source and cause of each outlier
-
Decide whether to delete, keep, transform or further analyze outliers
-
Adjust your analysis methods as needed to handle outliers (e.g. use median instead of mean)
-
Clearly document your outlier calculation, assumptions and handling approach
Handling outliers takes thought and care. But identifying these potentially problematic data points is a key part of robust Excel data analysis.
Finding Outliers Using the Quartile Functions
Now let’s talk about a more scientific solution that can help you identify whether there are any outliers or not.
In statistics, a quartile is one-fourth of the data set. For example, if you have 12 data points, then the first quartile would be the bottom three data points, the second quartile would be the next three data points, and so on.
Below is the data set where I want to find the outliers. To do this, I will have to calculate the 1st and the 3rd quartile, and then using it calculate the upper and the lower limit.
Below is the formula to calculate the first quartile in cell E2:
and here is the one to calculate the third quartile in cell E3:
Now, I can use the above two calculations to get the Interquartile Range (which is 50% of our data within the 1st and the 3rd quartile)
Now, we will use the interquartile range to find the lower and upper limit which would contain most of our data.
Anything which is out of these lower and upper limits would then be considered outliers.
Below is the formula to calculate the lower limit:
which in our example becomes:
And the formula to calculate the upper limit is:
which in our example becomes:
Now that we have the upper and lower limit in our data set, we can go back to the original data and quickly identify those values that do not lie in this range.
A quick way to do this would be to check every value and return a TRUE or FALSE in a new column.
I have used the below OR formula to get TRUE for those values that are outliers.
Now you can filter the Outlier column and only show the records where the value is TRUE.
Alternatively, you can also use conditional formatting to highlight all the cells where the value is TRUE
Find Outliers by Sorting the Data
With small datasets, a quick way to identify outliers is to simply sort the data and manually go through some of the values at the top of this sorted data.
And since there could be outliers in both directions, make sure you first sort the data in ascending order and then in descending order and then go through the top values.
Let me show you an example.
Below I have a dataset where I have call durations (in seconds) for 15 customer service calls.
Below are the steps to sort this data so that we can identify the outliers in the dataset:
- Select the Column Header of the column you want to sort (cell B1 in this example)
- Click the Home tab
- In the Editing group, click on the Sort & Filter icon.
- Click on Custom Sort
- In the Sort dialog box, select ‘Duration’ in the Sort by drop-down and ‘Largest to Smallest’ in the Order drop-down
- Click Ok
The above steps would sort the call duration column with the highest values at the top. Now you can manually scan the data and see if there are any outliers.
In our example, I can see that the first two values are way higher than the rest of the values (and the bottom two are way lower).
Note: This method works with small datasets where you can manually scan the data. It’s not a scientific method but works well
How to Use the Outliers Function in Excel
How do you find outliers in a data set?
To find the outliers in a data set, we use the following steps: Calculate the 1st and 3rd quartiles (we’ll be talking about what those are in just a bit). Evaluate the interquartile range (we’ll also be explaining these a bit further down). Return the upper and lower bounds of our data range. Use these bounds to identify the outlying data points.
How to analyze outliers in Excel?
To begin analyzing outliers in Excel, you first need to identify them within your dataset. One common method is using the Interquartile Range (IQR). By calculating the IQR and defining a threshold, you can flag data points that fall below Q1 – 1.5 IQR or above Q3 + 1.5 IQR as potential outliers.
How to calculate outliers using z-score in Excel?
To calculate outliers using Z-score in Excel you can see the steps described below. Step 1: Firstly, take the desired data set. Step 2: Secondly, in cell H5, type the following formula for calculating the mean for the given data.
How to calculate outliers using the sort and filter function?
If you wish to calculate outliers using the sort and filter function, you can do it by following the steps below. Step 1: Firstly, select the column header in your dataset of excel that you want to sort. For example, in the given data set, in the file column header named Daily Income (Cell C40 is chosen). Step 2: