Data Cleaning - Dealing with Outliers#
An important first step in data science is cleaning up your data. This includes identifying and removing data points that may be due to errors in data collection, or other factors that could lead us to make erroneous conclusions about the data. This is not the same as “cooking” your data — modifying it specifically to get the results you want. Data cleaning involves a set of methods that are specifically designed to identify and remove data points that are objectively anomalous — that is, you have objective reasons for removing them. Data cleaning steps should always be documented when you write up a study, and should be included in the code you share for analyses so that your procedures are transparent and reproducible.
Outlier Identification and Removal#
Outliers are data points that are “significantly different” from the majority of data points in a data set. “Significantly different” is in quotes here, because this doesn’t necessarily mean we use statistical testing to define outliers, and also that there several different definitions of “significant” that can be used to identify outliers.
Tukey’s method#
John Tukey, who was a driving force in the development of exploratory data analysis, defined outliers as points more than \(1.5 \times IQR\) above the third quartile (Q3) or below the first quartile (Q1). Recall that IQR is the inter-quartile range, which captures the middle 50% of the data points in a data set. In a box plot (which Tukey invented), the IQR is the “box”, and 1.5 IQR are the extent of the “whiskers” in each direction from Q1 and Q3. We noted earlier that in box plots, outliers are represented as individual points in the plot, beyond the whiskers. Since Tukey invented box plots and this definition of outliers, it’s not surprising that they are identical. In general, outliers defined in this way will comprise approximately 1% of the data, if the data are approximately normally distributed.
Image from Jhguch and shared under the Creative Commons Attribution-Share Alike 2.5 Generic license, via Wikimedia Commons
Standard scores#
An alternative method is to use standardized scores. This is done by applying the z transform to the data. This transform’s formula is:
Where \(x\) is an individual data point, \(\mu\) is the mean of the data set, and \(\sigma\) is the standard deviation. In other words, the z transform takes a data set and transforms it so that its mean is zero (by subtracting the mean from every value), and its standard deviation is 1.
Put another way, after we apply a z transform to a data set, the numerical values in the data set are not in their original units (be that milliseconds, radians, dollars, or whatever) but in units of standard deviations. So in a z transformed data set, a data point with a value of 1 would have had an original value that was 1 standard deviation from the mean. This is where the “standard” in “standard scores” comes from.
Applying a z transform is also called normalizing the data set, because the z transform assumes that the data come from a normal distribution (also known as a Gaussian distribution, or the bell curve). The nice thing about standardizing a data set is that the values have fairly straightforward interpretations. In a normal distribution:
68.27% of all values fall within ±1 standard deviation of the mean (\(1 \sigma\)),
95.45% of all values fall within ±2 standard deviations of the mean (\(2 \sigma\)), and
99.73% of all values fall within ±3 standard deviations of the mean (\(3 \sigma\))
Image from M. W. Toews, and shared under the Creative Commons Attribution-Share Alike 2.5 Generic license, via Wikimedia Commons
This means that if we define outliers as those values z > 3 or z < -3, then we will tend to reject about 0.25% of a data set. Sometimes, \(z \pm 2.5 \sigma\) is used instead. This will tend to reject more data points (typically around 2%).
Identifying outliers#
First, we’ll load in the same Flanker data set we used in the [Repeated Measures] lesson:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('data/flanker_rt_data.csv')
Visualizing the distribution#
Let’s generate a box plot of the entire data set, organized by condition. This is the same as the first box plot in the Repeated Measures lesson.
sns.catplot(kind='box',
data=df,
x='flankers', y='rt')
plt.show()
Let’s also look at the distribution of values with a kernel density estimate:
sns.displot(kind='kde',
data=df,
x='rt', hue='flankers')
plt.show()
Idenfity outliers in a DataFrame#
To define values based on the IQR, we first need to calculate the IQR. pandas doesn’t have a method for this specifically, but we can use the pandas .quantile()
method with the argument 0.25 to reference the lower end of the IQR (the 0.25 quantile means the point below which 25% of data values lie), and 0.75 for the upper end of the IQR. Since IQR is defined as the range between the 25th and 75th quantiles, we can compute it by subtracting the difference between the two:
q1 = df['rt'].quantile(0.25)
q3 = df['rt'].quantile(0.75)
iqr = q3 - q1
The variable iqr
represents the size of the box in the box plot of the data. Recall that our definition of outliers is \(1.5 \times IQR\) — the length of the whiskers in the box plot.
We can now define the threshold for values we will call “outliers” (i.e., those data that extend beyond the length of the whiskers) based on iqr
:
out_low = q1 - 1.5 * iqr
out_high = q3 + 1.5 * iqr
So out_low
(the end of the bottom whisker) is \(1.5 \times IQR\) below q1 (the lower end of the box), and out_high
is \(1.5 \times IQR\) above q3.
Recall from the [pandas] lesson that we can select values in a DataFrame using a Boolean mask. We can do that here to identify the rows of the DataFrame where the RT values are lower than \(1.5 \times IQR\):
df.loc[:, 'rt'] < out_low
0 False
1 False
2 False
3 False
4 False
...
4284 False
4285 False
4286 False
4287 False
4288 False
Name: rt, Length: 4289, dtype: bool
The result will have values of True
for those data points that are “low” outliers, and False
for all other data points. We can count the number of these by passing the mask to np.sum()
(since True
= 1
and False
= 0
)
np.sum(df.loc[:, 'rt'] < out_low)
10
We can do the same to identify outliers at the high end of the range of values, using the >
operator this time:
np.sum(df.loc[:, 'rt'] > out_high)
110
Consistent with what we saw in the box plot, there are more outliers at the high (slow RT) end of the distribution than at the low (fast RT) end.
Marking data points as outliers#
We’ll worry about what to do with outliers later — after we’ve described standard scores. For now, let’s make a column in df
called outliers_tukey
that is a Boolean mask: values of True
in rows with outlier RT values, and False
for all other rows.
To do this, we want to combine the two masking operations above < out_low
and > out_high
. We can do that using the OR |
operator, as long as we include each of our masking operations in parentheses (this last bit is important, but easy to forget!):
(df.loc[:, 'rt'] < out_low) | (df.loc[:, 'rt'] > out_high)
0 False
1 False
2 False
3 False
4 False
...
4284 False
4285 False
4286 False
4287 False
4288 True
Name: rt, Length: 4289, dtype: bool
It’s a bit hard to confirm that this works from the output above, but if we pass that whole expression to np.sum()
we can see that the number of all True
values is the same as the total of the out_low
and out_high
values reported above:
np.sum((df.loc[:, 'rt'] < out_low) | (df.loc[:, 'rt'] > out_high))
120
To actually use this expression to create a new column, we just have to assign the result to a new column name in df
:
df['outliers_tukey'] = (df.loc[:, 'rt'] < out_low) | (df.loc[:, 'rt'] > out_high)
df.head()
participant | block | trial | flankers | rt | outliers_tukey | |
---|---|---|---|---|---|---|
0 | s25 | 1 | 1 | congruent | 0.652072 | False |
1 | s25 | 1 | 2 | neutral | 0.644872 | False |
2 | s25 | 1 | 3 | congruent | 0.445182 | False |
3 | s25 | 1 | 4 | neutral | 0.525200 | False |
4 | s25 | 1 | 5 | congruent | 0.533328 | False |
And double-check the number, using the pandas .sum()
method:
df['outliers_tukey'].sum()
120
Standard scores (z transform)#
We can convert scores to standard scores very easily, using the zscore()
function in scipy.stats
:
from scipy import stats
df['rt_z'] = stats.zscore(df['rt'])
df.head()
participant | block | trial | flankers | rt | outliers_tukey | rt_z | |
---|---|---|---|---|---|---|---|
0 | s25 | 1 | 1 | congruent | 0.652072 | False | 1.526800 |
1 | s25 | 1 | 2 | neutral | 0.644872 | False | 1.456483 |
2 | s25 | 1 | 3 | congruent | 0.445182 | False | -0.493736 |
3 | s25 | 1 | 4 | neutral | 0.525200 | False | 0.287737 |
4 | s25 | 1 | 5 | congruent | 0.533328 | False | 0.367119 |
Now we can identify outliers as those rows where rt_z > 2.5
or rt_z < -2.5
. We’ll define a variable z_thresh
because that way we can later change it to a different value (like 3) if we desire.
Tip
This technique of assigning a constant value to a variable is one you should adopt. Few things are more annoying (or prone to human error) than having to go into your code and find all the places where you think you used that value and change each and every instance. By assigning the constant to a variable, you only have to change its value once.
z_thresh = 2.5
(df['rt_z'] > z_thresh) | (df['rt_z'] < -z_thresh)
0 False
1 False
2 False
3 False
4 False
...
4284 False
4285 False
4286 False
4287 False
4288 True
Name: rt_z, Length: 4289, dtype: bool
More simply, we can take the absolute value of z (which ignores the -
sign):
np.abs(df['rt_z']) > z_thresh
0 False
1 False
2 False
3 False
4 False
...
4284 False
4285 False
4286 False
4287 False
4288 True
Name: rt_z, Length: 4289, dtype: bool
The number of outliers identified this way is:
np.sum(np.abs(df['rt_z']) > z_thresh)
111
You’ll note that there are actually fewer outliers defined this way as opposed to using the \(1.5 \times IQR\) (Tukey) method. Why is this? As noted earlier, Tukey’s threshold and \(z \ge 2.5\) are only approximately the same, so we expect minor differences. However, we noted that the \(1.5 \times IQR\) range is a bit wider than \(z \gt 2.5\), meaning it should capture more data points and fewer values should be identified as outliers by Tukey’s method. Yet here, more are. The reason for this is that both Tukey’s and the standard score methods assume our data are normally distributed, when in fact they are skewed, with more values at the high (slow RT) end than at the low (fast RT) end. Since the data don’t fit the assumptions of the methods we used, we get unexpected results.
Write column of standard score outliers#
We’ll worry about skewness a little later, but for now we will add a column marking outliers as defined this way, similar to what we did above for the Tukey method:
df['outliers_z'] = np.abs(df['rt_z']) > z_thresh
df.sample(6)
participant | block | trial | flankers | rt | outliers_tukey | rt_z | outliers_z | |
---|---|---|---|---|---|---|---|---|
3013 | s6 | 5 | 29 | incongruent | 0.617003 | False | 1.184308 | False |
140 | s25 | 5 | 13 | incongruent | 0.581365 | False | 0.836260 | False |
598 | s22 | 4 | 23 | incongruent | 0.409900 | False | -0.838308 | False |
2040 | s12 | 5 | 13 | incongruent | 0.385942 | False | -1.072289 | False |
295 | s18 | 5 | 8 | neutral | 0.454454 | False | -0.403185 | False |
1109 | s24 | 5 | 29 | neutral | 0.536790 | False | 0.400928 | False |
The above .sample()
is not likely to happen to sample outlier data points, but we can confirm their presence with:
df[df['outliers_z'] == True]
participant | block | trial | flankers | rt | outliers_tukey | rt_z | outliers_z | |
---|---|---|---|---|---|---|---|---|
26 | s25 | 1 | 27 | incongruent | 0.757437 | True | 2.555820 | True |
37 | s25 | 2 | 6 | neutral | 0.917223 | True | 4.116328 | True |
117 | s25 | 4 | 22 | incongruent | 0.817285 | True | 3.140311 | True |
227 | s18 | 3 | 4 | neutral | 0.806528 | True | 3.035256 | True |
350 | s16 | 1 | 32 | incongruent | 0.762060 | True | 2.600971 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... |
4208 | s15 | 3 | 16 | incongruent | 0.897641 | True | 3.925088 | True |
4231 | s15 | 4 | 7 | congruent | 0.881503 | True | 3.767480 | True |
4262 | s15 | 5 | 6 | incongruent | 0.853477 | True | 3.493770 | True |
4279 | s15 | 5 | 23 | neutral | 0.845636 | True | 3.417194 | True |
4288 | s15 | 5 | 32 | congruent | 0.789597 | True | 2.869903 | True |
111 rows × 8 columns
Why are outliers a problem?#
Outliers are a problem because they have disproportionate leverage on statistical properties of the data, such as the mean and variance. We can see this by creating a small, simulated data set from a normal distribution, then replacing one value with either an extremely different or just slightly different data point. Let’s do this, and compare the means and standard deviations in each case. We use NumPy’s np.random.default_rng().normal()
function to do this:
# Create array with mean=0, standard deviation=1, and 10 data values
data = np.random.default_rng().normal(0, 1, 10)
print("Data:", str(data))
print("Mean of data with no outliers:", str(round(np.mean(data), 5)))
print("Standard deviation of data with no outliers:", str(round(np.std(data), 5)))
Data: [ 1.91100003 0.17137214 -0.39349555 -0.28543587 -0.26501308 1.89003043
-0.9411103 -0.01666448 0.33770206 0.58309185]
Mean of data with no outliers: 0.29915
Standard deviation of data with no outliers: 0.89472
data_not_extreme = np.copy(data)
# replace last value in array with outlier
data_not_extreme[-1] = 3
print("Data:", str(data_not_extreme))
print("Mean of data with minor outlier:", str(round(np.mean(data_not_extreme), 5)))
print("Standard deviation of data with minor outlier:", str(round(np.std(data_not_extreme), 5)))
Data: [ 1.91100003 0.17137214 -0.39349555 -0.28543587 -0.26501308 1.89003043
-0.9411103 -0.01666448 0.33770206 3. ]
Mean of data with minor outlier: 0.54084
Standard deviation of data with minor outlier: 1.20975
data_extreme = np.copy(data)
# replace last value in array with outlier
data_extreme[-1] = 10
print("Data:", str(data_extreme))
print("Mean of data with extreme outlier:", str(round(np.mean(data_extreme), 5)))
print("Standard deviation of data with extreme outlier:", str(round(np.std(data_extreme), 5)))
Data: [ 1.91100003 0.17137214 -0.39349555 -0.28543587 -0.26501308 1.89003043
-0.9411103 -0.01666448 0.33770206 10. ]
Mean of data with extreme outlier: 1.24084
Standard deviation of data with extreme outlier: 3.05227
From the above we can see that a larger outlier has a larger impact on both the mean and variance of the data. Also remember, when looking at the information above, that the true population mean of the data (as specified in the arguments to the random number generator) is 0 and the true standard deviation is 1. Using this normal distribution makes it easy to see that outliers bias our statistics away from the true population mean and variance.
The impact of outliers is often less extreme — in the above examples, our data set comprised only 10 values, and so making one an outlier meant that 10% of data points were outliers — whereas by the definitions above, we would normally expect only 1–2% of data points to be outliers.
If we re-run the simulations with a data set of 100 values and one outlier, you can see that the impact of even an extreme outlier is much reduced:
# Create array with mean=0, standard deviation=1, and 10 data values
data = np.random.default_rng().normal(0, 1, 100)
print("Data:", str(data))
print("Mean of data with no outliers:", str(round(np.mean(data), 5)))
print("Standard deviation of data with no outliers:", str(round(np.std(data), 5)))
Data: [ 1.32837597 0.56328608 -2.339687 -0.60652073 -0.64891056 -0.27277238
-0.95103654 1.03914128 0.83792084 -0.35941137 0.3160935 0.28874106
-1.28107104 0.52819336 0.10162759 1.16806614 -1.02678795 -0.14535503
0.72357183 -0.13326056 0.57595151 -1.49044684 -0.57508832 -0.99281764
-0.09958412 -0.25178987 0.59542842 -1.49041781 -1.68006343 -0.35217149
1.10107656 1.52984378 -2.23385531 -0.79695084 -0.18556523 -0.6294443
-0.36146888 -0.6945822 -1.86737644 0.36249266 0.9957168 -0.16027802
-1.73786059 1.24128357 -0.38647875 0.65344037 1.0470381 0.63601265
-0.70834178 -0.3189896 1.03957853 0.40938784 0.05409531 -0.36373765
-0.93715383 -1.30072443 0.5441396 1.95568419 -0.27352537 -0.11599894
-0.32905507 -1.25068221 -0.88989643 -0.39049096 2.48742749 2.08343882
0.38968338 -0.13534081 0.03774834 -0.11423551 0.44582902 0.39899617
-0.1292473 0.20832833 -1.64807316 -0.02193188 -0.24140068 0.18740622
1.84322568 -1.11545035 -1.15847791 0.49739761 0.68455642 0.39512782
-0.66238918 0.71030956 -0.84076594 -1.87700564 -0.60068309 1.59858186
0.2650798 -0.73271475 -0.15597434 -1.72768534 -0.4747589 0.21323863
0.07468702 -0.15626044 0.36456704 -0.4191594 ]
Mean of data with no outliers: -0.10319
Standard deviation of data with no outliers: 0.95248
data_extreme = np.copy(data)
# replace last value in array with outlier
data_extreme[-1] = 10
print("Data:", str(data_extreme))
print("Mean of data with extreme outlier:", str(round(np.mean(data_extreme), 5)))
print("Standard deviation of data with extreme outlier:", str(round(np.std(data_extreme), 5)))
Data: [ 1.32837597 0.56328608 -2.339687 -0.60652073 -0.64891056 -0.27277238
-0.95103654 1.03914128 0.83792084 -0.35941137 0.3160935 0.28874106
-1.28107104 0.52819336 0.10162759 1.16806614 -1.02678795 -0.14535503
0.72357183 -0.13326056 0.57595151 -1.49044684 -0.57508832 -0.99281764
-0.09958412 -0.25178987 0.59542842 -1.49041781 -1.68006343 -0.35217149
1.10107656 1.52984378 -2.23385531 -0.79695084 -0.18556523 -0.6294443
-0.36146888 -0.6945822 -1.86737644 0.36249266 0.9957168 -0.16027802
-1.73786059 1.24128357 -0.38647875 0.65344037 1.0470381 0.63601265
-0.70834178 -0.3189896 1.03957853 0.40938784 0.05409531 -0.36373765
-0.93715383 -1.30072443 0.5441396 1.95568419 -0.27352537 -0.11599894
-0.32905507 -1.25068221 -0.88989643 -0.39049096 2.48742749 2.08343882
0.38968338 -0.13534081 0.03774834 -0.11423551 0.44582902 0.39899617
-0.1292473 0.20832833 -1.64807316 -0.02193188 -0.24140068 0.18740622
1.84322568 -1.11545035 -1.15847791 0.49739761 0.68455642 0.39512782
-0.66238918 0.71030956 -0.84076594 -1.87700564 -0.60068309 1.59858186
0.2650798 -0.73271475 -0.15597434 -1.72768534 -0.4747589 0.21323863
0.07468702 -0.15626044 0.36456704 10. ]
Mean of data with extreme outlier: 0.001
Standard deviation of data with extreme outlier: 1.38423
The impact of an individual outlier is thus dependent on the size of the data set, and indeed in many data sets comprising a large number of data points, the impact of one outlier (or even 1% of the data values being outliers) can be quite minimal.
However, the impact of an outlier is also proportionate to its size, relative to the other values in the data set. Here we use 100 as the anomalous value, rather than 10, again in a data set with 100 values:
data_extreme = np.copy(data)
# replace last value in array with outlier
data_extreme[-1] = 100
print("Data:", str(data_extreme))
print("Mean of data with extreme outlier:", str(round(np.mean(data_extreme), 5)))
print("Standard deviation of data with extreme outlier:", str(round(np.std(data_extreme), 5)))
Data: [ 1.32837597e+00 5.63286077e-01 -2.33968700e+00 -6.06520730e-01
-6.48910558e-01 -2.72772384e-01 -9.51036543e-01 1.03914128e+00
8.37920839e-01 -3.59411367e-01 3.16093504e-01 2.88741058e-01
-1.28107104e+00 5.28193365e-01 1.01627594e-01 1.16806614e+00
-1.02678795e+00 -1.45355028e-01 7.23571831e-01 -1.33260556e-01
5.75951506e-01 -1.49044684e+00 -5.75088315e-01 -9.92817638e-01
-9.95841161e-02 -2.51789868e-01 5.95428417e-01 -1.49041781e+00
-1.68006343e+00 -3.52171487e-01 1.10107656e+00 1.52984378e+00
-2.23385531e+00 -7.96950838e-01 -1.85565233e-01 -6.29444301e-01
-3.61468878e-01 -6.94582198e-01 -1.86737644e+00 3.62492658e-01
9.95716799e-01 -1.60278022e-01 -1.73786059e+00 1.24128357e+00
-3.86478749e-01 6.53440366e-01 1.04703810e+00 6.36012654e-01
-7.08341784e-01 -3.18989605e-01 1.03957853e+00 4.09387843e-01
5.40953067e-02 -3.63737652e-01 -9.37153827e-01 -1.30072443e+00
5.44139605e-01 1.95568419e+00 -2.73525367e-01 -1.15998936e-01
-3.29055071e-01 -1.25068221e+00 -8.89896431e-01 -3.90490963e-01
2.48742749e+00 2.08343882e+00 3.89683375e-01 -1.35340814e-01
3.77483355e-02 -1.14235514e-01 4.45829021e-01 3.98996170e-01
-1.29247297e-01 2.08328325e-01 -1.64807316e+00 -2.19318789e-02
-2.41400680e-01 1.87406218e-01 1.84322568e+00 -1.11545035e+00
-1.15847791e+00 4.97397607e-01 6.84556424e-01 3.95127816e-01
-6.62389180e-01 7.10309562e-01 -8.40765944e-01 -1.87700564e+00
-6.00683087e-01 1.59858186e+00 2.65079797e-01 -7.32714753e-01
-1.55974336e-01 -1.72768534e+00 -4.74758898e-01 2.13238631e-01
7.46870243e-02 -1.56260441e-01 3.64567045e-01 1.00000000e+02]
Mean of data with extreme outlier: 0.901
Standard deviation of data with extreme outlier: 10.00521
Now the impact of this 1% of the data is much more extreme. In some data sets, it is unlikely or even impossible to have very extreme outliers. For example, in the Flanker study we’re working with, the lower bound of RT is 0, and the upper bound is 1000 ms, because if a participant didn’t make a response iwthin 1 s of the stimulus, the trial automatically ended without recording a response (Note: This illustrates the value of sometimes including hard limits within your experimental paradigm.). However, in some data sets more extreme values are possible. For example, a technical problem with a neurophysiological recording, such as via surface EMG, single-unit recording electrodes, or force transducer, could result in very extreme values relative to normal physiologically-plausible values.
Which method to use?#
If you look at the image describing the Tukey method above, you can see that the range of the distribution covered by \(1.5 \times IQR\) is just slightly larger than 2.5 \(\sigma\). So, while the Tukey method is roughly equivalent to using a z threshold of ±2.5, the \(1.5 \times IQR\) range is slightly wider than z = 2.5 and may on average identify slightly fewer data points as outliers. However, the result is probably negligible in practice. Indeed, many researchers use z > 3 as their threshold for outliers, which will tend to identify fewer outliers than the Tukey method.
Since box plots are a great — and well-known — way of visualizing data distributions, there is a logic to using the \(1.5 \times IQR\) method to identify outliers since the results will be consistent. However, the approximate equivalence between the Tukey method and z = 2.5 is only true if the data are normally distributed. For skewed data, like RT data, this method may exclude too many outliers. In particular, it will tend to exclude more long RT values than short RT values, even though many of the long RT values are likely valid (especially in a case like the present experiment, where trials were cut off at 1 s, preventing really long RTs).
An empirical study addressing outlier removal with RT data#
Berger and Kiefer (2021) performed a study in which they simulated many samples of reaction times, and introduced either more- or less-extreme outliers into the data. Essentially this was a more sophisticated version of the simulations shown above, mimicing the natural properties of RT data and simulating an experiment with different conditions. Berger and Kiefer compared Tukey’s method, the z transform method using thresholds of both 2 and 3, and several other methods, as well as simply leaving outliers in the data. Since this was simulated data, the authors knew the true values of the mean and standard deviation in each data set, which allowed them to compare each method to the “truth” — something we essentially never have the luxury of knowing in the life sciences.
Berger and Kiefer found that the z transform method was the best choice. With RT data, as noted above, Tukey’s method excluded too many data points on the long-RT side of the distribution. This had two negative impacts: firstly, it made the data appear more normally distributed than it really was, and secondly, it biased the statistical analyses towards false positives (Type 1 errors) — finding statistically significant differences between conditions when they were not present in the “true” data. Berger and Kiefer also observed that not rejecting outliers was the worst choice. The outliers led to a high rate of false negative results (Type 2 errors), meaning that true differences between conditions were not found to be statistically significant, due to the presence of outliers.
The z transform was not perfect (and we cannot expect any method to be), but it was more balanced in not rejecting disproportionately more long RTs than short RTs, rejecting fewer data points overall, and biasing the statistical analyses the least. Since in general, with experimental data we want to only exclude truly anomalous data points, a good rule-of-thumb would be to use a z threshold of 3 for outlier removal. Using a threshold of 2 or 2.5 will naturally exclude more data points, increasing the likelihood that valid data are excluded.
It’s also important to note that trying different outlier removal methods on a real data set, or even different z thresholds to define outliers, can introduce a component of researcher bias (or researcher degrees of freedom) into the analysis. Once you start doing this, the temptation may be to choose a method based on what “looks right” — which is, ultimately, what the researcher thinks the data should look like, not the true nature of the data. For this reason, parameters such as outlier definition method and threshold should ideally be chosen prior to performing the analyses. Of course, if you know for a fact that certain trials are invalid and, thus, outliers (e.g., technical issue during specific trials), that supersedes any of this discussion of exact methods to choose between. As is always the case, there is no single right answer, and so transparency regarding your methods is key.
What to do with outliers#
We have three basic options for outliers:
Ignore them — just include them in the data and all analyses
Remove them completely from the DataFrame
Replace them with non-outlier values
1. Ignore them#
Some scientists fundamentally do not believe in rejecting outliers. In working with data from human or animal participants, one can argue that any behavioural response is possibly valid, and outliers may represent interesting data points for any number of reasons (e.g., an exceptionally slow RT may mean the person spent a lot of time thinking about the stimulus and trying to decide on a response).
Another reason to consider not removing outliers (or using some other definition of outliers) is if the data are not actually normally distributed. Indeed, the RT data we have been working with are not normally distributed, they are skewed - an issue which we will address below. Skew is a relatively minor violation of normality. More generally, although the normal distribution is very common in nature, not all data are normally distributed, and so it is worth examining the distribution of your data (one reason we spend so much time looking at box plots, histograms, and other visualizations of distributions).
On the other hand, outliers can arise for reasons that really aren’t of experimental interest (McCormack & Wright, 1964). For instance, in the present data set we have a few impossibly short RTs. In the box plot of the data, we can see 2–3 data points in each flanker condition below 100–200 ms. It is nearly impossible for a typical human to respond to a stimulus in less than this amount of time, particularly when some decision (like which way an arrow is pointing) is required first (Luce, 1991; Whelan, 2008). So rather than being scientifically interesting, these outliers probably reflect trials when the participant accidentally pressed a response button. Likewise, exceptionally long RTs could reflect trials on which the participant thought they pressed the button then realized they didn’t, trials where the participant got distracted, or sneezed and didn’t see the stimulus, etc.. As noted above, Berger and Kiefer (2021) found that not rejecting outliers decreased the likelihood of finding statistically significant differences, when those differences were actually present in the data.
My general advice is to not ignore outliers, unless you have strong, principled reasons as to why outliers might be interesting (or not actually outliers, if the distribution is not normal) and ideas of how you should interpret them.
2. Remove them from the data set#
Removing data points — known as trimming or truncation — is acceptable in some cases, but problematic in others. For repeated measures data, it’s often fine to remove outliers at the individual participant level, since we have many other samples of the measurement (i.e., the repeated measures, such as trials). On the other hand, if we only have one measurement for a particular “unit” of the experimental design (e.g., one measurement from each participant, in each condition), then we might not want to completely lose that data point. This is especially true for data that will be analyzed statistically with a t test or ANOVA, because these methods cannot handle missing values.
3. Replace them with non-outlier values#
Another approach to dealing with outliers is to replace them with a less-outlying value. This could be the mean value, or the most extreme value that is not considered an outlier. Replacing an outlier with the mean value basically “neutralizes” that data point, since it is equivalent to the mean it doesn’t alter the mean of the data points. However, it does artificially reduce the variance in the data a bit, because variance is, by definition, the amount by which any data point differs from the mean. But if you are only replacing 1–2% of your data values, the impact on variance should be fairly small. Also note that, properly, the mean that is used as the replacement value should be computed only from all of the non-outlier data values. This is because a major reason to replace or remove outliers in the first place is that they have disproportionate influence on the mean.
Replacing the outlier with the most extreme non-outlier value will preserve variance contributed by that data point, but reduce the variance relative to retaining the outlier value. Likewise, the impact on the mean will not be neutral, but it will be rather small. Note that an assumption of replacing the outlier value with another, relatively extreme (relative to the mean), value is that the outlier was actually a valid measurement, perhaps with some noise in it. Since we don’t know for sure what the true value of the data was, in principle it seems better to replace outliers with the mean.
Replacing outliers with mean values#
Based on the information above, we will approach outliers in the Flanker data set by applying a z transform, defining outliers as those values with \(\lvert z \lvert \gt 3\), and replace these values with the mean (where the mean is calculated excluding the outlier values).
First, as we did above, we’ll define z_thresh
as 3 and fill the outliers_z
column with True
where \(\lvert z \lvert \gt 3\) and False
for all other rows.
z_thresh = 3
df['outliers_z'] = np.abs(df['rt_z']) > z_thresh
Next we’ll create a column containing the mean RT. To exclude the outlier values in computing this, we select only the rows where outliers_z
is False:
df['rt_mean'] = df.loc[df['outliers_z'] == False, 'rt'].mean()
df.head()
participant | block | trial | flankers | rt | outliers_tukey | rt_z | outliers_z | rt_mean | |
---|---|---|---|---|---|---|---|---|---|
0 | s25 | 1 | 1 | congruent | 0.652072 | False | 1.526800 | False | 0.49191 |
1 | s25 | 1 | 2 | neutral | 0.644872 | False | 1.456483 | False | 0.49191 |
2 | s25 | 1 | 3 | congruent | 0.445182 | False | -0.493736 | False | 0.49191 |
3 | s25 | 1 | 4 | neutral | 0.525200 | False | 0.287737 | False | 0.49191 |
4 | s25 | 1 | 5 | congruent | 0.533328 | False | 0.367119 | False | 0.49191 |
Finally, we fill a new column, rt_clean
, with either the original RT value (for non-outliers), or the mean RT (for outliers). The way that we create a new pandas column, contingent on the values of another column, is using the np.where()
function. This is essentially a way of running an if-else
conditional statement on each row in a pandas column. np.where()
takes three arguments, which are (in order), the conditional statement, the value to fill in the new column if the conditional is true, and finally the value to fill in if the conditional is false.
So the code below tests, for each row, whether outliers_z
is True
. If it is (i.e., an outlier), the mean RT is filled in; otherwise the original RT value for that trial is copied to the clean_rt
column. The three arguments to np.where()
are each shown on their own line, to make it a little more readable:
df['rt_clean'] = np.where(df['outliers_z'] == True,
df['rt_mean'],
df['rt'])
df.head()
participant | block | trial | flankers | rt | outliers_tukey | rt_z | outliers_z | rt_mean | rt_clean | |
---|---|---|---|---|---|---|---|---|---|---|
0 | s25 | 1 | 1 | congruent | 0.652072 | False | 1.526800 | False | 0.49191 | 0.652072 |
1 | s25 | 1 | 2 | neutral | 0.644872 | False | 1.456483 | False | 0.49191 | 0.644872 |
2 | s25 | 1 | 3 | congruent | 0.445182 | False | -0.493736 | False | 0.49191 | 0.445182 |
3 | s25 | 1 | 4 | neutral | 0.525200 | False | 0.287737 | False | 0.49191 | 0.525200 |
4 | s25 | 1 | 5 | congruent | 0.533328 | False | 0.367119 | False | 0.49191 | 0.533328 |
Although the first few rows in this DataFrame were not identified as outliers, we can confirm that this procedure worked by filtering the DataFrame to only show us rows where outliers_z
is True
:
df[df['outliers_z'] == True]
participant | block | trial | flankers | rt | outliers_tukey | rt_z | outliers_z | rt_mean | rt_clean | |
---|---|---|---|---|---|---|---|---|---|---|
37 | s25 | 2 | 6 | neutral | 0.917223 | True | 4.116328 | True | 0.49191 | 0.49191 |
117 | s25 | 4 | 22 | incongruent | 0.817285 | True | 3.140311 | True | 0.49191 | 0.49191 |
227 | s18 | 3 | 4 | neutral | 0.806528 | True | 3.035256 | True | 0.49191 | 0.49191 |
518 | s22 | 2 | 7 | incongruent | 0.817863 | True | 3.145955 | True | 0.49191 | 0.49191 |
841 | s8 | 2 | 11 | congruent | 0.056265 | True | -4.291995 | True | 0.49191 | 0.49191 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4195 | s15 | 3 | 3 | congruent | 0.853705 | True | 3.495996 | True | 0.49191 | 0.49191 |
4208 | s15 | 3 | 16 | incongruent | 0.897641 | True | 3.925088 | True | 0.49191 | 0.49191 |
4231 | s15 | 4 | 7 | congruent | 0.881503 | True | 3.767480 | True | 0.49191 | 0.49191 |
4262 | s15 | 5 | 6 | incongruent | 0.853477 | True | 3.493770 | True | 0.49191 | 0.49191 |
4279 | s15 | 5 | 23 | neutral | 0.845636 | True | 3.417194 | True | 0.49191 | 0.49191 |
66 rows × 10 columns
Reporting outlier rejections#
In any report describing your data, you should state:
the method you used to identify outliers
how you dealt with outliers
what proportion of the data were identified as outliers, and thus treated
To get this last value, we can simply compute the number of True
values in outliers_z
, divided by the total number of trials (rows):
outliers = df['outliers_z'].sum() / df['outliers_z'].count()
# Convert proportion to percent, and round to 2 decimal places
print(str(round(outliers * 100, 2)), "% of the data identified as outliers and replaced with mean value.")
1.54 % of the data identified as outliers and replaced with mean value.
What I’ve done in my own work is to run separate analyses, one on the data with outliers and one on the data without outliers. The inferences you can make based on both of the analyses should be the same. In other words, removing outliers appropriately should not qualitatively change your statistical inferences and how you interpret the data. If that were the case, that would be an issue, as it would suggest your findings are quite fragile.
Working with nested data#
All of the above was correct, for data without nesting or repeated measures. The reason for this caveat is that with nested data, what constitutes an “outlier” for one source of data (e.g., one participant) may be a very reasonable value for another person. This is perhaps most intuitive if we think about an experiment comparing older and younger adults. With age, RTs on average get longer. Therefore, a slow RT that is clearly an outlier for an 18 year old university student may be about the average RT for someone who is 80 years old.
To properly address the nested structure of data in identifying and correcting outliers, we can use the same steps as above, but using the split-apply-combine approach provided by pandas .groupby()
method. In other words, we split the data by participant, apply the outlier detection and correction steps, and then combine the data into a new column in the DataFrame.
The code below is probably the most complex that we’ve worked with so far. Take your time going through this, and try to understand it as fully as you can. It may seem like a convoluted series of steps, but they are a logical sequence and each one is optimized for pandas. This is in fact pandas’ suggested approach for outlier correction.
Since we’ve added lots of columns above, let’s start fresh by re-loading the DataFrame from the file:
df = pd.read_csv('data/flanker_rt_data.csv')
Compute z scores by participant#
Now let’s split by participant and apply the z transform to the split data. Pandas has a useful method, .transform()
, that allows us to run a non-pandas function on a selection from a DataFrame. So we group by participant, select only the rt
column to operate on, and then .transform(stats.zscore)
and assign this to a new column in the DataFrame, rt_z
. Note that when we pass a function to .transform()
, we do not put parentheses after the name of the function we are using. That is, if we were running stats.zscore
on its own (see above), we would use stats.zscore(df['rt'])
. But since the we have already selected the data that we are passing with .transform()
, we don’t use the parentheses with the function call.
df['rt_z'] = df.groupby('participant')['rt'].transform(stats.zscore)
df.sample(12)
participant | block | trial | flankers | rt | rt_z | |
---|---|---|---|---|---|---|
4099 | s5 | 5 | 3 | congruent | 0.498127 | -0.032565 |
829 | s8 | 1 | 31 | neutral | 0.445732 | -0.124476 |
2511 | s14 | 5 | 6 | neutral | 0.396724 | -0.772640 |
470 | s16 | 5 | 23 | neutral | 0.414089 | -0.517560 |
1791 | s7 | 2 | 13 | neutral | 0.613734 | -0.141204 |
2163 | s9 | 4 | 9 | neutral | 0.372220 | -0.875797 |
3108 | s23 | 3 | 28 | incongruent | 0.877870 | 3.010529 |
141 | s25 | 5 | 14 | incongruent | 0.721394 | 2.556630 |
3729 | s17 | 3 | 15 | incongruent | 0.602214 | 1.704254 |
747 | s11 | 4 | 12 | incongruent | 0.517459 | 1.144262 |
4106 | s5 | 5 | 10 | incongruent | 0.430217 | -0.933662 |
2913 | s6 | 2 | 25 | neutral | 0.392572 | -0.623929 |
Compute mean RT from valid trials for each participant#
Likewise, we need to compute the mean RT within participants, so again we could split by participant and then compute the within-participant mean, this time passing np.mean
to .transform()
:
df['rt_mean'] = df.groupby('participant')['rt'].transform(np.mean)
But actually, doing the above computes the mean across all trials for each participant — including the outlier trials. We can compute the mean across only the non-outlier trials by adding a filter on df
prior to the groupby
operation:
df['rt_mean'] = df[np.abs(df['rt_z']) <= z_thresh].groupby('participant')['rt'].transform(np.mean)
df.head()
participant | block | trial | flankers | rt | rt_z | rt_mean | |
---|---|---|---|---|---|---|---|
0 | s25 | 1 | 1 | congruent | 0.652072 | 1.822298 | 0.475143 |
1 | s25 | 1 | 2 | neutral | 0.644872 | 1.746028 | 0.475143 |
2 | s25 | 1 | 3 | congruent | 0.445182 | -0.369295 | 0.475143 |
3 | s25 | 1 | 4 | neutral | 0.525200 | 0.478337 | 0.475143 |
4 | s25 | 1 | 5 | congruent | 0.533328 | 0.564439 | 0.475143 |
Fill in mean values for outlier rows#
There’s a catch with this however — since we grouped filtered np.abs(df['rt_z']) <= z_thresh
data, this only fills in the non-outlier rows with mean RT, because those are the only rows that get passed to the .transform()
method. The mean_RT
rows where z exceeds outlier threshold are filled with NaN
, since pandas doesn’t know what else to do with them. We can see this if we ask for a sample of just the rows that exceed the outlier threshold:
# view sample of rows to confirm RT varies by participant
df[df['rt_z'] > z_thresh].sample(10)
participant | block | trial | flankers | rt | rt_z | rt_mean | |
---|---|---|---|---|---|---|---|
3302 | s27 | 4 | 32 | incongruent | 0.728959 | 3.322114 | NaN |
350 | s16 | 1 | 32 | incongruent | 0.762060 | 3.576242 | NaN |
4231 | s15 | 4 | 7 | congruent | 0.881503 | 3.046180 | NaN |
1296 | s2 | 1 | 25 | incongruent | 0.745997 | 3.284224 | NaN |
37 | s25 | 2 | 6 | neutral | 0.917223 | 4.631054 | NaN |
2454 | s14 | 3 | 13 | incongruent | 0.797524 | 5.013169 | NaN |
3118 | s23 | 4 | 8 | neutral | 0.919488 | 3.395270 | NaN |
1299 | s2 | 1 | 28 | neutral | 0.837920 | 4.414282 | NaN |
4077 | s5 | 4 | 13 | incongruent | 0.742321 | 3.207646 | NaN |
3588 | s20 | 4 | 1 | neutral | 0.796931 | 3.650637 | NaN |
To fix this, we need to group the data by participant again, except this time not excluding outlier rows. Then, we need to replace all NaNs in the rt_mean
column with the mean RT for that participant.
To do this, we run another .groupby()
+ .transform()
operation, but this time applying a more complex transformation using a lambda function. Lambda functions are basically a way of writing a small Python function inside a command. They are used when what you need to do is a bit more complex or specific than can be done with an existing method or function.
In this case, what we want to do is group df
by participant ID, then fill in the NaN values with the mean RT from all non-NaN values for that participant. We do this by applying two operations in sequence: first we use np.nanmean()
to compute the mean over all non-NaN RT values for that participant, and then we pass that value to .fillna()
to replace any NaN values for that participant with the mean RT. This works because the x
in the lambda function is a variable that refers to the entire grouped data set that we pass to it:
df['rt_mean'] = df.groupby('participant')['rt_mean'].transform(lambda x: x.fillna(np.nanmean(x)))
# Check by again printing a sample of outlier rows
df[df['rt_z'] > z_thresh].sample(10)
participant | block | trial | flankers | rt | rt_z | rt_mean | |
---|---|---|---|---|---|---|---|
2016 | s12 | 4 | 20 | neutral | 0.814067 | 3.895140 | 0.490843 |
3302 | s27 | 4 | 32 | incongruent | 0.728959 | 3.322114 | 0.457454 |
4231 | s15 | 4 | 7 | congruent | 0.881503 | 3.046180 | 0.557696 |
4142 | s15 | 1 | 14 | neutral | 0.909568 | 3.315476 | 0.557696 |
2454 | s14 | 3 | 13 | incongruent | 0.797524 | 5.013169 | 0.446461 |
2964 | s6 | 4 | 12 | neutral | 0.917153 | 3.456548 | 0.466651 |
1296 | s2 | 1 | 25 | incongruent | 0.745997 | 3.284224 | 0.474832 |
2262 | s26 | 2 | 12 | incongruent | 0.840871 | 3.309630 | 0.550074 |
3769 | s17 | 4 | 23 | incongruent | 0.778205 | 4.073122 | 0.473696 |
766 | s11 | 4 | 31 | neutral | 0.749664 | 4.908618 | 0.443325 |
If you don’t fully understand the lambda function above, that’s OK. It is fairly advanced relative to most of what we’ve covered so far. However, it’s introduced here because it is the proper and efficient way to replace outlier values. It’s also worth knowing about lambda functions because you will start seeing them more in the future!
Create final output column#
Next, we want to create a column called rt_clean
that contains the original RT for non-outlier trials, and the mean RT for that participant for outlier trials. We can use np.where()
to find outliers and replace them with each individual’s mean RT. Below, we do this in one line of code, which is more compact and efficient than we did above. Earlier, we first created a column, outliers_z
and then used that to determine which rows would be filled with the original RT versus the mean RT. However, creating a column in our DataFrame that we will never use again is not efficient use of resources. Here we simply use the threshold test as our conditional.
df['rt_clean'] = np.where(np.abs(df['rt_z']) > z_thresh,
df['rt_mean'],
df['rt'])
df.sample(10)
participant | block | trial | flankers | rt | rt_z | rt_mean | rt_clean | |
---|---|---|---|---|---|---|---|---|
1700 | s13 | 4 | 16 | neutral | 0.540628 | 1.327034 | 0.453831 | 0.540628 |
907 | s8 | 4 | 13 | congruent | 0.393582 | -0.546616 | 0.477679 | 0.393582 |
2686 | s4 | 5 | 21 | neutral | 0.369700 | -0.914764 | 0.416534 | 0.369700 |
4188 | s15 | 2 | 28 | neutral | 0.517368 | -0.447881 | 0.557696 | 0.517368 |
490 | s22 | 1 | 11 | neutral | 0.361490 | -0.980852 | 0.431980 | 0.361490 |
636 | s22 | 5 | 29 | neutral | 0.449969 | 0.101231 | 0.431980 | 0.449969 |
2620 | s4 | 3 | 19 | neutral | 0.453388 | 0.634132 | 0.416534 | 0.453388 |
1289 | s2 | 1 | 18 | congruent | 0.493976 | 0.185997 | 0.474832 | 0.493976 |
4147 | s15 | 1 | 19 | congruent | 0.461373 | -0.985179 | 0.557696 | 0.461373 |
2019 | s12 | 4 | 23 | incongruent | 0.569706 | 0.931736 | 0.490843 | 0.569706 |
Since a random sample of 10 rows may not show any where z actually exceeded our outlier threshold, we can use filtering to view only those rows where the absolute value of z was greater than our threshold, to confirm that in those rows rt_clean
is the same as rt_mean
:
df[np.abs(df['rt_z']) > z_thresh].sample(10)
participant | block | trial | flankers | rt | rt_z | rt_mean | rt_clean | |
---|---|---|---|---|---|---|---|---|
1151 | s21 | 2 | 7 | incongruent | 0.750666 | 3.288323 | 0.459101 | 0.459101 |
4231 | s15 | 4 | 7 | congruent | 0.881503 | 3.046180 | 0.557696 | 0.557696 |
37 | s25 | 2 | 6 | neutral | 0.917223 | 4.631054 | 0.475143 | 0.475143 |
3017 | s23 | 1 | 1 | neutral | 0.945058 | 3.631653 | 0.542556 | 0.542556 |
2926 | s6 | 3 | 6 | congruent | 0.889037 | 3.237846 | 0.466651 | 0.466651 |
4208 | s15 | 3 | 16 | incongruent | 0.897641 | 3.201033 | 0.557696 | 0.557696 |
2841 | s10 | 5 | 17 | neutral | 0.730521 | 3.250012 | 0.489909 | 0.489909 |
847 | s8 | 2 | 17 | neutral | 0.031539 | -3.477270 | 0.477679 | 0.477679 |
602 | s22 | 4 | 27 | incongruent | 0.701941 | 3.182802 | 0.431980 | 0.431980 |
2695 | s4 | 5 | 30 | incongruent | 0.657840 | 4.418124 | 0.416534 | 0.416534 |
Finally, we’ll print out the proportion of data identified as outliers, by computing the number of z values that exceed our threshold, relative to the total number of z values (i.e., the number of rows in the DataFrame):
outliers = df[np.abs(df['rt_z']) > z_thresh]['rt_z'].count() / df['rt_z'].count()
# Convert proportion to percent, and round to 2 decimal places
print(str(round(outliers * 100, 2)), "% of the data identified as outliers and replaced with mean value.")
1.35 % of the data identified as outliers and replaced with mean value.
Visualizing the cleaned data#
Box plots#
Before outlier correction#
Let’s generate a box plot of the original RT values for reference (this is the same as the one plotted at the top of this lesson).
sns.catplot(kind='box',
data=df,
x='flankers', y='rt')
plt.show()
After outlier correction#
Below, we can see that the number of short RT outliers is reduced, and those that remain are very close to the whiskers (and also greater than 200 ms). At the same time, the number of outliers at the long RT end of the distribution is also reduced, but there are still more of them than at the short end (which is reasonable, for RT data).
sns.catplot(kind='box',
data=df,
x='flankers', y='rt_clean')
plt.show()
Kernel density estimates#
Before outlier correction#
sns.displot(kind='kde',
data=df,
x='rt', hue='flankers')
plt.show()
After outlier correction#
The visually detectable changes below are pretty subtle. That’s to be expected when only a small percentage of the data were considered outliers and none were very extreme. However, we can see that the distributions look less “peaky” in the middle and some of the bumps in the “tails” have been smoothed out. The long tails were caused by the few outlier data points at extreme ends of the 0–1 s range.
sns.displot(kind='kde',
data=df,
x='rt_clean', hue='flankers')
plt.show()
Summary of steps for cleaning nested data#
So, what all did we do to filter the data appropriately? We performed the following steps:
We created a z-scored version of our main dependent variable, RT, for each participant (
rt_z
)We created a column of mean RTs for each participant using only valid trials (
rt_mean
), so that we could later replace outliers with these valuesWe filled in the empty rows of
rt_mean
, which contained NaN values, withrt_mean
values from their respective participant using a lambda functionWe created a new column,
rt_clean
, filling in valid trials (rows) with the actual RT value (rt
) and replacing outliers withrt_mean
We calculated the proportion of data that were outliers
This section was adapted from Aaron J. Newman’s Data Science for Psychology and Neuroscience - in Python and Software Carpentry’s Plotting and Programming in Python workshop.