Basic Statistics in Python: t tests with SciPy

Basic Statistics in Python: t tests with SciPy#

Learning Objectives#

Implement paired, unpaired, and 1-sample t tests using the SciPy package

Introduction#

An introductory course in statistics is a prerequisite for this class, so we assume you remember (some of) the basics including t-tests, ANOVAs, and regression (or at least correlation).

Here we will demonstrate how to perform t tests in Python.

The t test#

A t test is used to compare the means of two sets of data. For example, in the flanker experiment we used in the previous section, we could compare the mean RTs for the congruent and incongruent conditions. t tests consider the size of the difference between the means of the two data sets, relative to the variance in each one. The less the distributions of values in the two data sets overlap, the larger the t value will tend to be. We can then estimate the probability that the observed difference occurred simply by chance, rather than due to a true difference — this is the p value. Typically, researchers use a p < .05 threshold to determine statistical significance.

t tests are implemented in the SciPy library, which “provides many user-friendly and efficient numerical routines, such as routines for numerical integration, interpolation, optimization, linear algebra, and statistics.” Each of those type of routines is in a separate sub-module of SciPy; the one we’ll want is scipy.stats. We can import this specific module with the command:

from scipy import stats

We’ll also import some other packages we’ll need, and the flanker data from the previous lesson to work with:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import glob

# Import the data
df = pd.read_csv('data/flanker_rt_data.csv')

# Aggregate the data across participants (see the Repeated Measures chapter, if you've 
# forgotten the logic behind this next line).
df_avg = pd.DataFrame(df.groupby(['participant', 'flankers'])['rt'].mean()).reset_index()

Paired t-test#

Let’s start by comparing the mean RTs for the congruent and incongruent flanker conditions.

Recall that we are working with repeated-measures data – for each participant, we have 160 trials across 4 conditions. t tests are not meant for within-condition repeated measures data — we need only one measurement per participant in each condition. This is for essentially the same reason discussed at the end of the previous section on repeated measures data: if we treat the within-participant variability the same as the between-participant variability, then we will tend to grossly under-estimate the true (between-participant) variance. When running a t test, this would result in erroneously large t values that could often falsely suggest a statistically significant result. So, we need to use the aggregated data, df_avg.

The other important characteristic of our data are that, even though aggregation has reduced the data to one measurement per participant, we still have repeated measures, across the two conditions. The default assumption of a t test is that each of the two data sets being compared come from different samples of the population (often called a between-subjects design). This means that t tests assume there is no relationship between any particular measurement in each of the two data sets being compared (such as coming from the same participants!). When we do have measurements from the same people in both data sets (a within-subjects design), we need to account for this, or the t test will again suggest an inflated (incorrect) value. We account for this by using what’s referred to as a paired t test. In SciPy, this is the function ttest_rel(). (For a between-subjects — or independent groups design, which we will not cover here, you would use ttest_ind()).

Select the data#

Running ttest_rel() is as simple as giving it the two sets of data you want to compare, as arguments. We can pull these directly from our df_avg pandas DataFrame. We’ll do this in a few lines of code below, first assigning each data set to a new variable, and then running the t test.

Let’s make sure you understand how to appropriately parse the data. We’ve seen these steps before, but maybe not in exactly this form — and it is quite complex, but logical.

Let’s start first by selecting only the rows of the DataFrame associated with congruent trials, which returns a Boolean mask:

# Solution
df_avg['flankers'] == 'congruent'

    True
   False
   False
    True
   False
      ...  
  False
  False
   True
  False
  False
Name: flankers, Length: 81, dtype: bool

We embed this inside another selector, which applies the Boolean mask to the DataFrame, essentially saying, “select from df_avg all the columns associated with congruent trials”.

df_avg[df_avg['flankers'] == 'congruent']

	participant	flankers	rt
0	s1	congruent	0.455259
3	s10	congruent	0.471231
6	s11	congruent	0.417540
9	s12	congruent	0.429758
12	s13	congruent	0.419096
15	s14	congruent	0.437178
18	s15	congruent	0.548638
21	s16	congruent	0.433748
24	s17	congruent	0.437577
27	s18	congruent	0.488892
30	s19	congruent	0.539020
33	s2	congruent	0.438167
36	s20	congruent	0.462935
39	s21	congruent	0.417553
42	s22	congruent	0.410191
45	s23	congruent	0.549622
48	s24	congruent	0.568396
51	s25	congruent	0.450102
54	s26	congruent	0.528508
57	s27	congruent	0.439243
60	s3	congruent	0.570766
63	s4	congruent	0.401993
66	s5	congruent	0.462927
69	s6	congruent	0.446840
72	s7	congruent	0.628185
75	s8	congruent	0.428642
78	s9	congruent	0.431829

Finally, add ['rt'] to the end of your expression to indicate that, having selected the congruent rows, we actually only want the column with the RT values, because those are what we want to perform the t test on. Assign these values to a variable called congr.

congr = df_avg[df_avg['flankers'] == 'congruent']['rt']
congr

   0.455259
   0.471231
   0.417540
   0.429758
  0.419096
  0.437178
  0.548638
  0.433748
  0.437577
  0.488892
  0.539020
  0.438167
  0.462935
  0.417553
  0.410191
  0.549622
  0.568396
  0.450102
  0.528508
  0.439243
  0.570766
  0.401993
  0.462927
  0.446840
  0.628185
  0.428642
  0.431829
Name: rt, dtype: float64

Note, by the way, that congr is a pandas Series, not a DataFrame. Check for yourself in the cell below.

type(congr)

pandas.core.series.Series

Now do follow the same logic to assign RTs from incongruent trials to a variable called incongr. This time we’ll use pandas’ .loc function, to illustrate another way of indexing into the same data.

incongr = df_avg.loc[df_avg['flankers'] == 'incongruent', 'rt']
incongr

   0.471838
   0.499031
   0.473012
  0.506722
  0.478367
  0.453524
  0.591644
  0.492921
  0.504452
  0.527152
  0.591181
  0.518216
  0.507257
  0.507033
  0.474612
  0.554172
  0.595977
  0.513179
  0.565531
  0.501069
  0.591022
  0.428867
  0.530722
  0.490298
  0.650769
  0.494878
  0.437926
Name: rt, dtype: float64

Likewise, incongr is a Series of the same length (the number of participants).

Run the t test#

Now we just pass congr and incongr as the first (and only) two arguments to ttest_rel(), and print the results out with some explanatory text. Note that we have to write stats.ttest_rel(), because we imported the library as stats.

t, p = stats.ttest_rel(congr, incongr)
print('Congruent vs. Incongruent t = ', str(t), ' p = ', str(p))

Congruent vs. Incongruent t =  -10.209634805365013  p =  1.373929657982063e-10

We can make the output nicer by rounding to a reasonable level of precision:

print(f'Congruent vs. Incongruent t = {str(round(t, 2))}, p = {str(round(p, 4))}')

Congruent vs. Incongruent t = -10.21, p = 0.0

Now those are results any researcher would be happy to see! The p value is not actually zero by the way, but note in the original output the p value was reported in scientific notation, ending in e-10. This means that the p value is actually 0.00000000013739. We would typically report this as p < .0001, since we rounded to 4 decimal places (which is fairly typical for reporting p values).

1-tailed vs. 2-tailed p values#

By default, SciPy’s ttest_ functions return 2-tailed p values. This means that the p value considers both possible directions of difference between the two conditions. In the present example, that means either RTs for congruent are faster than incongruent, or they are slower for congruent than incongruent. In contrast, a 1-tailed p value should be used, in theory, if we have a specific prediction of a “direction” of the difference. Using a 1-tailed p value will tend to be less conservative, i.e., more likely to find a significant effect. This is because, for a given p threshold (e.g., \(\alpha = .05\)), a 2-tailed test effectively splits the p in half, and reflects a probability of 2.5% that the result occurred by chance in one direction (e.g., congruent slower) and a 2.5% probability of getting the reverse result (e.g., congruent faster) by chance. In contrast, a 1-tailed test allocates all of the 5% chance probability to the likelihood of a difference in one direction (e.g., congruent faster).

Practically speaking, 2-tailed tests should be used by default, but if you have a specific a priori hypothesis regarding the direction of the difference, you can use a 1-tailed test. For example, for the flanker experiment we’re working with here, previous research would lead us to the congruent-faster hypothesis.

In the present example, it really doesn’t matter since the two-tailed p value is miniscule. However, if you want to convert to a one-tailed p value, you just need to divide p in half:

print('Congruent vs. Incongruent t = ', str(round(t, 2)), ' p (one-tailed) = ', str(p / 2))

Congruent vs. Incongruent t =  -10.21  p (one-tailed) =  6.869648289910314e-11

Be careful about order of data values#

The above paired t test worked properly because in our pandas DataFrame, participants are listed in a consistent order. So when we create separate Series for congruent and incongruent, the same rows of the two Series belong to the same participant. However, this isn’t always guaranteed, and so it’s good practice to do things in a way that ensures proper pairing of participants between data sets.

pandas indexing allows us to do this. Recall that indexes are row labels. By default, when we read a CSV file to a DataFrame, the rows are indexed numerically starting from zero. Indeed, if you look back above at the contents of congr and incongr, you’ll see the indexes in the left column are discontinuous and different between the two, because each data point came from a separate row. To ensure alignment of each participant’s data across the two series, we can first use the participant ID as the index of df_avg, and then create separate Series for each condition:

df_avg = df_avg.set_index('participant')
congr = df_avg.loc[df_avg['flankers'] == 'congruent', 'rt']
incongr = df_avg.loc[df_avg['flankers'] == 'incongruent', 'rt']

Now when we look at the resulting Series, we see that the participant indexes are preserved:

congr

participant
s1     0.455259
s10    0.471231
s11    0.417540
s12    0.429758
s13    0.419096
s14    0.437178
s15    0.548638
s16    0.433748
s17    0.437577
s18    0.488892
s19    0.539020
s2     0.438167
s20    0.462935
s21    0.417553
s22    0.410191
s23    0.549622
s24    0.568396
s25    0.450102
s26    0.528508
s27    0.439243
s3     0.570766
s4     0.401993
s5     0.462927
s6     0.446840
s7     0.628185
s8     0.428642
s9     0.431829
Name: rt, dtype: float64

incongr

participant
s1     0.471838
s10    0.499031
s11    0.473012
s12    0.506722
s13    0.478367
s14    0.453524
s15    0.591644
s16    0.492921
s17    0.504452
s18    0.527152
s19    0.591181
s2     0.518216
s20    0.507257
s21    0.507033
s22    0.474612
s23    0.554172
s24    0.595977
s25    0.513179
s26    0.565531
s27    0.501069
s3     0.591022
s4     0.428867
s5     0.530722
s6     0.490298
s7     0.650769
s8     0.494878
s9     0.437926
Name: rt, dtype: float64

Ensure pandas indexing is used in t tests#

What could go wrong?#

It turns out that SciPy’s ttest functions ignore pandas indexes, so indexing on its own won’t ensure that the t test compares data points from the same individuals. We can see that by randomizing the order of the rows of the incongr2 series, while preserving the relationship between indexes (participant IDs) and RTs (you can compare with above data to confirm that the same RT values are associated with the same IDs as in the original incongr Series):

df_avg = df_avg.reset_index()
inc_arr = np.array(df_avg[df_avg['flankers']=='incongruent'].iloc[:, [0, 2]])
np.random.shuffle(inc_arr)
incongr2 = pd.DataFrame(inc_arr, columns=['participant', 'rt']).set_index('participant')
incongr2 = pd.Series(incongr2['rt']).astype(float)  # convert from object to float dtype
incongr2

participant
s13    0.478367
s21    0.507033
s10    0.499031
s11    0.473012
s14    0.453524
s4     0.428867
s19    0.591181
s20    0.507257
s5     0.530722
s26    0.565531
s8     0.494878
s25    0.513179
s18    0.527152
s6     0.490298
s9     0.437926
s16    0.492921
s24    0.595977
s2     0.518216
s7     0.650769
s12    0.506722
s3     0.591022
s22    0.474612
s23    0.554172
s27    0.501069
s17    0.504452
s1     0.471838
s15    0.591644
Name: rt, dtype: float64

Now when we run the t test, the t value doesn’t match the t value that we got above with the properly-paired data, and in fact if you run the code below multiple times, you will get diferent t and p values each time due to the random shuffling.

t, p = stats.ttest_rel(congr, incongr2)
print('Congruent vs. Incongruent t = ', str(round(t, 2)), ' p = ', str(round(p, 4)))

Congruent vs. Incongruent t =  -4.24  p =  0.0003

Solution 1: Use DataFrame columns rather than extracting Series#

Above, we extracted the two data sets we wanted to compare with a t test from a DataFrame (df) to two pandas Series, congr and incongr. On the one hand, this simplifies the syntax of the t test command, but on the other hand we lose the structure of the pandas DataFrame. That is, in the DataFrame, the values for each condition were grouped by participant, so we don’t have to worry about the order of the data values. We can run the t test based on the flankers value of the pandas DataFrame and be assured that congruent and incongruent values will be matched by participant. The code is just a little more complex to look at:

t, p = stats.ttest_rel(df_avg[df_avg['flankers'] == 'congruent']['rt'],
                       df_avg[df_avg['flankers'] == 'incongruent']['rt'])
print('Congruent vs. Incongruent t = ', str(round(t, 2)), ' p = ', str(round(p, 4)))

Congruent vs. Incongruent t =  -10.21  p =  0.0

This is probably the best approach to use in most cases, because:

It ensures that the repeated-measures structure of the data is preserved
It uses less memory resources, because we aren’t copying columns of our DataFrame to new Series/variables.

It may in fact seem overly convoluted to have first demonstrated the extract-to-Series approach, then explain that it’s not the ideal way to do things! However, for many people, it’s intuitive to extract subsets of data to perform further processing on. One point of this lesson was to illustrate how that can create problems, even though it might seem like a logical approach.

Solution 2: Use `.sort_index()` to ensure paired data are aligned#

If you do choose to work with a pair of Series, the way we can ensure that the indexes of the two data sets align this is by re-ordering the data in both Series that we’re comparing (congr and incongr2 in this case), using pandas .sort_index() method:

t, p = stats.ttest_rel(congr.sort_index(), incongr2.sort_index())
print('Congruent vs. Incongruent t = ', str(round(t, 2)), ' p = ', str(round(p, 4)))

Congruent vs. Incongruent t =  -10.21  p =  0.0

Long story short, it is good practice to index by participant ID, and use the .sort_index() method when applying t tests to pandas Series or DataFrames, to ensure that values are appropriately paired.

Testing differences: one-sample t tests#

An alternative way to compare the congruent and incongruent conditions is to compute the difference in mean RTs between the two conditions for each participant (since it is a paired design), and then run a t test on the differences. In this case, we use a one sample t test, in which we compare the data set to zero. In other words, is the difference between the conditions basically zero, or is it significantly different from zero (i.e., a believable difference)?

We can compute the difference between two pandas Series easily just using the - (minus) operator, so in this case we could use congr - incongr

Note that this subtraction only works if the two Series are indexed by participant ID (or in some way that preserves the alignment of values between the two data sets). However, because we are subtracting two pandas objects, pandas recognizes the indexes in each and aligns them, even if the indexes aren’t in the same order in the two input Series. So we don’t have to worry about using .sort_index() as we did above for paired t tests.

congr_vs_incongr = congr - incongr
t, p = stats.ttest_1samp(congr_vs_incongr, 0)
print('Congruent vs. Incongruent t = ', str(round(t, 2)), ' p = ', str(round(p, 4)))

Congruent vs. Incongruent t =  -10.21  p =  0.0

Note that we get the same result from the 1 sample t test if we perform the subtraction on the two Series that have the same order of indexes, as when we perform the subtraction using incongr2, which has a randomly shuffled order of indexes. We don’t need to explicitly .sort_index() in this case:

congr_vs_incongr = congr - incongr2
congr_vs_incongr
t, p = stats.ttest_1samp(congr_vs_incongr, 0)
print(f'Congruent vs. Incongruent t = {str(round(t, 2))}, p = {str(round(p, 4))}')

Congruent vs. Incongruent t = -10.21, p = 0.0

Paired vs. 1-sample t tests?#

You’ll note the result of the 1-sample t test is the same as the paired t test above. This is expected, because in both cases we ran a t test to compare the difference between the same two sets of data. From a coding perspective, the paired t test is a bit simpler, because you don’t have to perform a subtraction on the data prior to running the t test.

The reasons we might want to run a 1-sample t test include cases where are data are already represented as a subtraction, or in some cases when we’re working with multiple variables, performing subtractions can be a way of simplifying our presentation of the results. As well, since pandas subtraction respects the indexes, computing differences and then 1-sample t tests can be a bit safer in ensuring that the proper within-participants nesting structure of your data is preserved.

Summary#

t tests are used to compare the means of two sets of data to each other, or the mean of one set of data against a particular value (such as zero)
An unpaired t test is used to compare two independent sets of data (e.g., from two different samples of a population, two groups, etc.)
A paired t test must be used when the two sets of data come from the same samples (e.g., the same individual participants)
A 1-sample t test is used to compare the mean of one set of data against a specific value. This is often used to compare a data set to zero
Paired t tests and 1-sample t tests can both be used to determine whether differences between two samples are significantly different from zero (no difference).
- In the 1-sample case, you must first compute the difference between the pairs of data in two conditions.
When working with pandas data objects, it is important to remember that SciPy’s functions (including ttests) do not use pandas indexes. So when doing paired t tests, you must ensure that the data are listed in the same order in the two Series being compared.
- The best way to ensure that the within-participant/repeated measures structure of the data is preserved when doing a t test, is to use two columns from a DataFrame that is indexed by participant ID.
- One alternative is to use the .sort_index() method on two series that are indexed by participant ID
- Another alternative is to use the fact that pandas does respect its indexing when you subtract two Series, so if your data are indexed by participant ID, doing the subtraction followed by a 1-sample t test is a way of ensuring that the within-participants relationships between data sets are preserved.

This section was adapted from Aaron J. Newman’s Data Science for Psychology and Neuroscience - in Python and Software Carpentry’s Plotting and Programming in Python workshop.