Looping Over Data Files#
Questions:#
How can I efficiently read in many data sets from different files?
How can I combine data from different files into one pandas DataFrame?
Learning Objectives:#
Be able to write “globbing” expressions that match sets of files
Use
glob
to create lists of filesWrite
for
loops to perform operations on many filesWrite list comprehensions to perform operations on many files
combine pandas DataFrames
Use a for
loop to process files given a list of their names#
We can use a for
loop to read in a set of data files, and do a thing for each one. In this case, we’ll print the minimum value in each file:
import pandas as pd
data_files = ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']
for filename in data_files:
data = pd.read_csv(filename, index_col='country')
print(filename, data.min())
data/gapminder_gdp_africa.csv gdpPercap_1952 298.846212
gdpPercap_1957 335.997115
gdpPercap_1962 355.203227
gdpPercap_1967 412.977514
gdpPercap_1972 464.099504
gdpPercap_1977 502.319733
gdpPercap_1982 462.211415
gdpPercap_1987 389.876185
gdpPercap_1992 410.896824
gdpPercap_1997 312.188423
gdpPercap_2002 241.165876
gdpPercap_2007 277.551859
dtype: float64
data/gapminder_gdp_asia.csv gdpPercap_1952 331.0
gdpPercap_1957 350.0
gdpPercap_1962 388.0
gdpPercap_1967 349.0
gdpPercap_1972 357.0
gdpPercap_1977 371.0
gdpPercap_1982 424.0
gdpPercap_1987 385.0
gdpPercap_1992 347.0
gdpPercap_1997 415.0
gdpPercap_2002 611.0
gdpPercap_2007 944.0
dtype: float64
Use glob.glob
to find sets of files whose names match a pattern.#
In Unix, the term globbing means matching a set of files with a pattern.
The most common patterns are:
*
meaning match zero or more characters?
meaning match exactly one character
Python’s standard library contains the
glob
module to provide pattern matching functionalityThe
glob
module contains a function also calledglob
to match file patternsE.g.,
glob.glob('*.txt')
matches all files in the current directory whose names end with.txt
.Result is a list of strings.
import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))
all csv files in data directory: ['data/gapminder_gdp_americas.csv', 'data/gapminder_gdp_europe.csv', 'data/gapminder_all.csv', 'data/gapminder_gdp_oceania.csv', 'data/gapminder_gdp_africa.csv', 'data/s2.csv', 'data/s3.csv', 'data/s1.csv', 'data/gapminder_life_expectancy_years.csv', 'data/gapminder_gdp_asia.csv']
Use glob
and for
to process batches of files.#
It’s good practice to name your files systematically. As you’ve learned, Python is very precise about things like capitalization, so if your file names are inconsistent (e.g., Gapminder_Europe.csv
, gapminder_americas.csv
, gapminder_Oceania.csv
), then it is harder to write code with glob
that works correctly.
For the Gapminder data, fortunately the file names are quite systematic and consistent (as are the names of the columns inside each file), so we can use the following to read in each one and print the minimum GDP from 1952:
for filename in glob.glob('data/gapminder_gdp*.csv'):
data = pd.read_csv(filename)
print(filename, data['gdpPercap_1952'].min())
data/gapminder_gdp_americas.csv 1397.717137
data/gapminder_gdp_europe.csv 973.5331948
data/gapminder_gdp_oceania.csv 10039.59564
data/gapminder_gdp_africa.csv 298.8462121
data/gapminder_gdp_asia.csv 331.0
Appending Files to a Single DataFrame#
Often we don’t just want to open a file and extract a small bit of data (such as the minimum value in examples above). Rather, we might want to open a set of related data files and combine them into one big DataFrame. For example, in psychology and neuroscience most experiments involve multiple participants. For each participant, when we run the experiment we get a data file. To analyze the data across participants, we would want to read in all participants’ data files and combine them into one DataFrame.
pandas has a few methods that allow us to combine DataFrames, including:
We will focus here on the first one. concat
stands for “concatenate” which essentially means combine files by “stacking” them. That is, start with one DataFrame, and add a new data frame to the bottom of it, creating additional rows. In what we’ll do here, we assume that all of the data files we’re reading have the same columns. For example, in the Gapminder GDP data sets, each file has a column for country
plus a series of columns for GDP in different years — and the same years are in the columns of all the data sets.
Reading in data from multiple experimental participants#
Let’s say we have data from an experiment in which we ran three human participants (sometimes called “people”) on different days. For each participant, we have a data file. The columns in all the files are the same, because the files were generated by a computer program that ran the experiment.
We give the participants anonymized ID codes to protect their privacy, and allow for a simple, systematic naming convention for the files. The first participant’s data is saved in a file called s1.csv
, the second’s in s2.csv
, etc..
We can glob the data folder in which the files are stored, to find all the CSV files whose names start with an s
followed by a single character, followed by .csv
. We’ll save the result to a list that we can loop through later:
filenames = glob.glob('data/s?.csv')
Next, we create an empty list that we will store the DataFrames from each participant in. It will end up being a list of DataFrames (remember, lists can contain just about any other Python data type), and once we have read in all the data, we will combine them into one DataFrame. This is a trick that’s important to use in pandas. The reason has to do with how pandas combines DataFrames and stores them in memory. In simple terms, each time we concatenate DataFrames, pandas does a lot of internal checking to make sure there are no errors. Doing this checking once, when combining many DataFrames, is far more efficient (and thus faster) than doing it many times. Likewise, when a DataFrame is created, an appropriate amount of memory space is allocated for it on the computer. Each time we append additional data, we have to create a new, bigger block of memory. Allocating new blocks of memory, many times, takes more time than just doing it once.
df_list = []
Finally, use a for
loop to read the files in. This will cycle through the items in the filenames
list; each time through the loop, filename
has the value of the current file name, and we use the list append()
method to add the data from that file to df_list
:
for f in filenames:
df_list.append(pd.read_csv(f))
When we view the contents of the list, we see each data set, with its two columns (with headers saying what they are), and commas separating the list entries, as is typical of a list.
df_list
[ participantID trial RT
0 s2 1 0.433094
1 s2 2 0.392526
2 s2 3 0.396831
3 s2 4 0.417988
4 s2 5 0.371810
5 s2 6 0.659228
6 s2 7 0.411051
7 s2 8 0.409580
8 s2 9 0.486828
9 s2 10 0.468912,
participantID trial RT
0 s3 1 0.322099
1 s3 2 0.396106
2 s3 3 0.384297
3 s3 4 0.364524
4 s3 5 0.454075
5 s3 6 0.494156
6 s3 7 0.492787
7 s3 8 0.506836
8 s3 9 0.340722
9 s3 10 0.704491,
participantID trial RT
0 s1 1 0.508971
1 s1 2 0.389858
2 s1 3 0.404175
3 s1 4 0.269520
4 s1 5 0.437765
5 s1 6 0.368142
6 s1 7 0.400544
7 s1 8 0.335198
8 s1 9 0.341722
9 s1 10 0.439583]
Reading multiple files using list comprehension#
While the for
loop above works fine, there is an alternative way to do this, using list comprehension. Recall that list comprehensions are basically just a compact version of a for
loop, but they have some advantages:
they are more pythonic: they only require one line of code, whereas the
for
loop above required twothey are more efficient: list comprehensions actually run faster. This may not be an issue in the small examples here, but can make a big difference when working with real, large data sets
df_list = [pd.read_csv(f) for f in filenames]
df_list
[ participantID trial RT
0 s2 1 0.433094
1 s2 2 0.392526
2 s2 3 0.396831
3 s2 4 0.417988
4 s2 5 0.371810
5 s2 6 0.659228
6 s2 7 0.411051
7 s2 8 0.409580
8 s2 9 0.486828
9 s2 10 0.468912,
participantID trial RT
0 s3 1 0.322099
1 s3 2 0.396106
2 s3 3 0.384297
3 s3 4 0.364524
4 s3 5 0.454075
5 s3 6 0.494156
6 s3 7 0.492787
7 s3 8 0.506836
8 s3 9 0.340722
9 s3 10 0.704491,
participantID trial RT
0 s1 1 0.508971
1 s1 2 0.389858
2 s1 3 0.404175
3 s1 4 0.269520
4 s1 5 0.437765
5 s1 6 0.368142
6 s1 7 0.400544
7 s1 8 0.335198
8 s1 9 0.341722
9 s1 10 0.439583]
Combining DataFrames#
At this point, we’ve read each input file in and stored it as a DataFrame, but we have a list of three distinct DataFrames. In most cases, we’ll want to combine these in some way. Having built our list of DataFrames through reading a set of files, we can combine them into a single DataFrame using the pandas .concat()
method:
df = pd.concat(df_list)
# Confirm this worked by viewing a random sample of rows
df.sample(8)
participantID | trial | RT | |
---|---|---|---|
5 | s1 | 6 | 0.368142 |
2 | s3 | 3 | 0.384297 |
9 | s1 | 10 | 0.439583 |
8 | s3 | 9 | 0.340722 |
6 | s3 | 7 | 0.492787 |
4 | s3 | 5 | 0.454075 |
0 | s2 | 1 | 0.433094 |
3 | s1 | 4 | 0.269520 |
Setting the index column#
Recall that row labels in pands are called indexes. We can convert any column to an index using the .set_index()
method. For this data, an appropriate index is the participant ID, which is in the column participantID
. Note that we need to assign the result of the .set_index()
operation back to df
for the change to be stored:
df = df.set_index('participantID')
df.sample(8)
trial | RT | |
---|---|---|
participantID | ||
s1 | 1 | 0.508971 |
s3 | 1 | 0.322099 |
s3 | 3 | 0.384297 |
s2 | 2 | 0.392526 |
s3 | 8 | 0.506836 |
s1 | 3 | 0.404175 |
s3 | 5 | 0.454075 |
s1 | 9 | 0.341722 |
Exercises#
Determining Matches#
Which of these files is not matched by the expression glob.glob('data/*as*.csv')
?
data/gapminder_gdp_africa.csv
data/gapminder_gdp_americas.csv
data/gapminder_gdp_asia.csv
Click the button to reveal the answer
1 is not matched. The string as
occurs in both americas and asia
Globbing files#
Fill in the blanks so that the code below does the following:
Find all of the CSV files in the data folder that contain GDP data
Read these files in using a
for
loopConcatenate the data files into a single pandas DataFrame
Print out the first 10 lines of the final combined DataFrame
Note that not all the Gapminder data files contain GDP data, but the file names will indicate which ones do.
import glob
import pandas as pd
data_files = ___(___)
df_list = []
for ____ in ____:
df_list.___(pd.read_csv(f))
df = ___
df.___(10)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[11], line 4
1 import glob
2 import pandas as pd
----> 4 data_files = ___(___)
6 df_list = []
8 for ____ in ____:
TypeError: 'list' object is not callable
Click the button to reveal the solution
import glob
import pandas as pd
data_files = glob.glob('data/*gdp*.csv')
df_list = []
for f in data_files:
df_list.append(pd.read_csv(f))
df = pd.concat(df_list)
df.head(10)
List comprehension#
Now rewrite the code above to use list comprehension rather than a for
loop, and only two lines of code total (excluding the import
commands and viewing the first 10 lines of the result).
Click the button to reveal the solution
Done in two lines of code:
df_list = [pd.read_csv(f) for f in glob.glob('data/*gdp*.csv')]
df = pd.concat(df_list)
df.head(10)
For an even bigger challenge, see if you can reduce the code to a single line!
Click the button to reveal the solution
Done in one line of code:
df = pd.concat([pd.read_csv(f) for f in glob.glob('data/*gdp*.csv')])
df.head(10)
Summary of Key Points:#
Use a
for
loop to process files given a list of their namesUse
glob.glob
to find sets of files whose names match a patternList comprehension can replace a
for
loop, resulting in more compact and efficient codeNaming your files in a consistent manner is just as important in data science, as writing the code to read them
When you want to combine multiple files into one pandas DataFrame, read each one in to a list of DataFrames, then run
pd.concat()
only once
This section was adapted from Aaron J. Newman’s Data Science for Psychology and Neuroscience - in Python and Software Carpentry’s Plotting and Programming in Python workshop.