Looping Over Data Files#


Questions:#

  • How can I efficiently read in many data sets from different files?

  • How can I combine data from different files into one pandas DataFrame?

Learning Objectives:#

  • Be able to write “globbing” expressions that match sets of files

  • Use glob to create lists of files

  • Write for loops to perform operations on many files

  • Write list comprehensions to perform operations on many files

  • combine pandas DataFrames


Use a for loop to process files given a list of their names#

We can use a for loop to read in a set of data files, and do a thing for each one. In this case, we’ll print the minimum value in each file:

import pandas as pd

data_files = ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']

for filename in data_files:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())
data/gapminder_gdp_africa.csv gdpPercap_1952    298.846212
gdpPercap_1957    335.997115
gdpPercap_1962    355.203227
gdpPercap_1967    412.977514
gdpPercap_1972    464.099504
gdpPercap_1977    502.319733
gdpPercap_1982    462.211415
gdpPercap_1987    389.876185
gdpPercap_1992    410.896824
gdpPercap_1997    312.188423
gdpPercap_2002    241.165876
gdpPercap_2007    277.551859
dtype: float64
data/gapminder_gdp_asia.csv gdpPercap_1952    331.0
gdpPercap_1957    350.0
gdpPercap_1962    388.0
gdpPercap_1967    349.0
gdpPercap_1972    357.0
gdpPercap_1977    371.0
gdpPercap_1982    424.0
gdpPercap_1987    385.0
gdpPercap_1992    347.0
gdpPercap_1997    415.0
gdpPercap_2002    611.0
gdpPercap_2007    944.0
dtype: float64

Use glob.glob to find sets of files whose names match a pattern.#

  • In Unix, the term globbing means matching a set of files with a pattern.

  • The most common patterns are:

    • * meaning match zero or more characters

    • ? meaning match exactly one character

  • Python’s standard library contains the glob module to provide pattern matching functionality

  • The glob module contains a function also called glob to match file patterns

  • E.g., glob.glob('*.txt') matches all files in the current directory whose names end with .txt.

  • Result is a list of strings.

import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))
all csv files in data directory: ['data/gapminder_gdp_americas.csv', 'data/gapminder_gdp_europe.csv', 'data/gapminder_all.csv', 'data/gapminder_gdp_oceania.csv', 'data/gapminder_gdp_africa.csv', 'data/s2.csv', 'data/s3.csv', 'data/s1.csv', 'data/gapminder_life_expectancy_years.csv', 'data/gapminder_gdp_asia.csv']

Use glob and for to process batches of files.#

It’s good practice to name your files systematically. As you’ve learned, Python is very precise about things like capitalization, so if your file names are inconsistent (e.g., Gapminder_Europe.csv, gapminder_americas.csv, gapminder_Oceania.csv), then it is harder to write code with glob that works correctly.

For the Gapminder data, fortunately the file names are quite systematic and consistent (as are the names of the columns inside each file), so we can use the following to read in each one and print the minimum GDP from 1952:

for filename in glob.glob('data/gapminder_gdp*.csv'):
    data = pd.read_csv(filename)
    print(filename, data['gdpPercap_1952'].min())
data/gapminder_gdp_americas.csv 1397.717137
data/gapminder_gdp_europe.csv 973.5331948
data/gapminder_gdp_oceania.csv 10039.59564
data/gapminder_gdp_africa.csv 298.8462121
data/gapminder_gdp_asia.csv 331.0

Appending Files to a Single DataFrame#

Often we don’t just want to open a file and extract a small bit of data (such as the minimum value in examples above). Rather, we might want to open a set of related data files and combine them into one big DataFrame. For example, in psychology and neuroscience most experiments involve multiple participants. For each participant, when we run the experiment we get a data file. To analyze the data across participants, we would want to read in all participants’ data files and combine them into one DataFrame.

pandas has a few methods that allow us to combine DataFrames, including:

We will focus here on the first one. concat stands for “concatenate” which essentially means combine files by “stacking” them. That is, start with one DataFrame, and add a new data frame to the bottom of it, creating additional rows. In what we’ll do here, we assume that all of the data files we’re reading have the same columns. For example, in the Gapminder GDP data sets, each file has a column for country plus a series of columns for GDP in different years — and the same years are in the columns of all the data sets.

Reading in data from multiple experimental participants#

Let’s say we have data from an experiment in which we ran three human participants (sometimes called “people”) on different days. For each participant, we have a data file. The columns in all the files are the same, because the files were generated by a computer program that ran the experiment.

We give the participants anonymized ID codes to protect their privacy, and allow for a simple, systematic naming convention for the files. The first participant’s data is saved in a file called s1.csv, the second’s in s2.csv, etc..

We can glob the data folder in which the files are stored, to find all the CSV files whose names start with an s followed by a single character, followed by .csv. We’ll save the result to a list that we can loop through later:

filenames = glob.glob('data/s?.csv')

Next, we create an empty list that we will store the DataFrames from each participant in. It will end up being a list of DataFrames (remember, lists can contain just about any other Python data type), and once we have read in all the data, we will combine them into one DataFrame. This is a trick that’s important to use in pandas. The reason has to do with how pandas combines DataFrames and stores them in memory. In simple terms, each time we concatenate DataFrames, pandas does a lot of internal checking to make sure there are no errors. Doing this checking once, when combining many DataFrames, is far more efficient (and thus faster) than doing it many times. Likewise, when a DataFrame is created, an appropriate amount of memory space is allocated for it on the computer. Each time we append additional data, we have to create a new, bigger block of memory. Allocating new blocks of memory, many times, takes more time than just doing it once.

df_list = []

Finally, use a for loop to read the files in. This will cycle through the items in the filenames list; each time through the loop, filename has the value of the current file name, and we use the list append() method to add the data from that file to df_list:

for f in filenames:
    df_list.append(pd.read_csv(f))

When we view the contents of the list, we see each data set, with its two columns (with headers saying what they are), and commas separating the list entries, as is typical of a list.

df_list
[  participantID  trial        RT
 0            s2      1  0.433094
 1            s2      2  0.392526
 2            s2      3  0.396831
 3            s2      4  0.417988
 4            s2      5  0.371810
 5            s2      6  0.659228
 6            s2      7  0.411051
 7            s2      8  0.409580
 8            s2      9  0.486828
 9            s2     10  0.468912,
   participantID  trial        RT
 0            s3      1  0.322099
 1            s3      2  0.396106
 2            s3      3  0.384297
 3            s3      4  0.364524
 4            s3      5  0.454075
 5            s3      6  0.494156
 6            s3      7  0.492787
 7            s3      8  0.506836
 8            s3      9  0.340722
 9            s3     10  0.704491,
   participantID  trial        RT
 0            s1      1  0.508971
 1            s1      2  0.389858
 2            s1      3  0.404175
 3            s1      4  0.269520
 4            s1      5  0.437765
 5            s1      6  0.368142
 6            s1      7  0.400544
 7            s1      8  0.335198
 8            s1      9  0.341722
 9            s1     10  0.439583]

Reading multiple files using list comprehension#

While the for loop above works fine, there is an alternative way to do this, using list comprehension. Recall that list comprehensions are basically just a compact version of a for loop, but they have some advantages:

  • they are more pythonic: they only require one line of code, whereas the for loop above required two

  • they are more efficient: list comprehensions actually run faster. This may not be an issue in the small examples here, but can make a big difference when working with real, large data sets

df_list = [pd.read_csv(f) for f in filenames]
df_list
[  participantID  trial        RT
 0            s2      1  0.433094
 1            s2      2  0.392526
 2            s2      3  0.396831
 3            s2      4  0.417988
 4            s2      5  0.371810
 5            s2      6  0.659228
 6            s2      7  0.411051
 7            s2      8  0.409580
 8            s2      9  0.486828
 9            s2     10  0.468912,
   participantID  trial        RT
 0            s3      1  0.322099
 1            s3      2  0.396106
 2            s3      3  0.384297
 3            s3      4  0.364524
 4            s3      5  0.454075
 5            s3      6  0.494156
 6            s3      7  0.492787
 7            s3      8  0.506836
 8            s3      9  0.340722
 9            s3     10  0.704491,
   participantID  trial        RT
 0            s1      1  0.508971
 1            s1      2  0.389858
 2            s1      3  0.404175
 3            s1      4  0.269520
 4            s1      5  0.437765
 5            s1      6  0.368142
 6            s1      7  0.400544
 7            s1      8  0.335198
 8            s1      9  0.341722
 9            s1     10  0.439583]

Combining DataFrames#

At this point, we’ve read each input file in and stored it as a DataFrame, but we have a list of three distinct DataFrames. In most cases, we’ll want to combine these in some way. Having built our list of DataFrames through reading a set of files, we can combine them into a single DataFrame using the pandas .concat() method:

df = pd.concat(df_list)

# Confirm this worked by viewing a random sample of rows
df.sample(8)
participantID trial RT
5 s1 6 0.368142
2 s3 3 0.384297
9 s1 10 0.439583
8 s3 9 0.340722
6 s3 7 0.492787
4 s3 5 0.454075
0 s2 1 0.433094
3 s1 4 0.269520

Setting the index column#

Recall that row labels in pands are called indexes. We can convert any column to an index using the .set_index() method. For this data, an appropriate index is the participant ID, which is in the column participantID. Note that we need to assign the result of the .set_index() operation back to df for the change to be stored:

df = df.set_index('participantID')
df.sample(8)
trial RT
participantID
s1 1 0.508971
s3 1 0.322099
s3 3 0.384297
s2 2 0.392526
s3 8 0.506836
s1 3 0.404175
s3 5 0.454075
s1 9 0.341722

Exercises#

Determining Matches#

Which of these files is not matched by the expression glob.glob('data/*as*.csv')?

  1. data/gapminder_gdp_africa.csv

  2. data/gapminder_gdp_americas.csv

  3. data/gapminder_gdp_asia.csv

Globbing files#

Fill in the blanks so that the code below does the following:

  • Find all of the CSV files in the data folder that contain GDP data

  • Read these files in using a for loop

  • Concatenate the data files into a single pandas DataFrame

  • Print out the first 10 lines of the final combined DataFrame

Note that not all the Gapminder data files contain GDP data, but the file names will indicate which ones do.

import glob
import pandas as pd

data_files = ___(___)

df_list = []

for ____ in ____:
    df_list.___(pd.read_csv(f))
    
df = ___

df.___(10)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 4
      1 import glob
      2 import pandas as pd
----> 4 data_files = ___(___)
      6 df_list = []
      8 for ____ in ____:

TypeError: 'list' object is not callable

List comprehension#

Now rewrite the code above to use list comprehension rather than a for loop, and only two lines of code total (excluding the import commands and viewing the first 10 lines of the result).

For an even bigger challenge, see if you can reduce the code to a single line!

Summary of Key Points:#

  • Use a for loop to process files given a list of their names

  • Use glob.glob to find sets of files whose names match a pattern

  • List comprehension can replace a for loop, resulting in more compact and efficient code

  • Naming your files in a consistent manner is just as important in data science, as writing the code to read them

  • When you want to combine multiple files into one pandas DataFrame, read each one in to a list of DataFrames, then run pd.concat() only once


This section was adapted from Aaron J. Newman’s Data Science for Psychology and Neuroscience - in Python and Software Carpentry’s Plotting and Programming in Python workshop.