Introduction to pandas DataFrames#
Questions:#
How do I read in data from a file?
How can I work with data in tabular format (tables)?
How can I do basic descriptive statistics on tabular data?
Learning Objectives:#
Select individual values from a Pandas dataframe
Select entire rows or entire columns from a dataframe
Select a subset of both rows and columns from a dataframe in a single operation
Select a subset of a dataframe by a single Boolean criterion
Obtain descriptive statistics for subsets of data within a table
Use the split-apply-combine paradigm to work with data
What is pandas?#
pandas (whose official name starts with a lower-case “p”) is a Python library for working with data in a tabular format, such as is found in file formats like CSV, Microsoft Excel, and Google Sheets. (Sorry - there are no cute, black-and-white bears here.) Unlike Excel or Sheets, pandas is not a point-and click graphical interface for working with these files — everything is done through Python code. But compared to other formats for working with data that we have seen in this workshop, such as lists and dictionaries, and, most recently, NumPy arrays, pandas may seem more familiar, and it definitely lends itself more naturally to large data sets. Indeed, pandas’ mission statement is, “…to be the fundamental high-level building block for doing practical, real world data analysis in Python”.
The primary units of pandas data storage you will work with are DataFrames (essentially, tables of data organized as rows and columns). DataFrames are actually collections of pandas Series objects, which can be thought of as individual rows or columns (or vectors, or 1D arrays, which you should now feel familiar with).
Among the things that make pandas so attractive are the powerful interface to access individual records of the table, proper handling of missing values, and relational-database operations between DataFrames. As well, pandas functions and methods are written to work intuitively and efficiently with data organized in tables. Most operations are vectorized, which means that they will automatically apply to all values in a DataFrame or Series without the need to write for
loops to execute the same operation on a set of cells.
pandas is built on top of the NumPy library. While NumPy is a powerful and widely used Python library for working with numerical data, pandas provides greater flexibility in working with different types of data and provides an extremely powerful set of tools for data science, in general, and data wrangling (transforming raw data into more readily usable format), in particular. However, it’s worth noting that most of the methods defined for NumPy Arrays also apply to Pandas Series/DataFrames.
About Python Libraries#
pandas is an example of a Python library. A library is a collection of files (called modules) that contains functions for use by other programs. Libraries provide ways of extending Python’s functionality in different ways. They may also contain data values (e.g., numerical constants), entire sample data sets, and other things. A library’s contents are supposed to be related, although there’s no actual way to enforce that.
The Python standard library is an extensive suite of modules that comes with Python itself. Everything we’ve done so far in this workshop has been part of the standard library. Many additional libraries are available; UBC’s Jupyter Open has a large number of extra libraries already installed.
To use a library in a particular Jupyter notebook or other Python program, we must import it using the import
statement, like this:
import pandas
Once a library is imported, we can use functions and methods from it. But, for functions we have to tell Python that the function can be found in a particular library we imported. For example, pandas has a function to import data from CSV (comma-separated value) files, called read_csv
. To run this command, we would need to type:
pandas.read_csv()
Since some package names are long, and adding the name to every function can result in a lot of typing, Python also allows us to assign an alias — a shorter name — to a library when we import it. For example, the convention for pandas is to give it the alias pd
like this:
import pandas as pd
Then to read a CSV file we could use:
pd.read_csv()
In the cell below, import pandas with the alias pd:
import pandas as pd
Importing data with pandas#
As noted, we can read a CSV file and use it to create a pandas DataFrame, with the function pd.read_csv()
. CSV is a text format used for storing tabular data, in which each line of the file corresponds to a row in the table, and columns are separated with commas (“CSV” stands for “comma-separated values”). Often the first row of a CSV file will be the header, containing labels for each column.
The Gapminder data is in CSV format, so let’s load in one of the Gapminder datasets with the command below. Note that when we read in a DataFrame, we need to assign it to a variable name so that we can reference it later. A convention when working with pandas is to call the DataFrame df
. This works fine if you only have one DataFrame to work with, although if you are working with multiple DataFrames it is a good idea to give them more meaningful names.
The Gapminder data are stored in a subfolder called data
, so as the argument to pd.read_csv()
below we give the folder name followed by a slash, then the file name:
df = pd.read_csv('data/gapminder_gdp_europe.csv')
We can view the contents of the DataFrame df
by simply typing its name and running the cell. Note that, unlike most of the examples we’ve used in previous lessons, we don’t use the print()
function. Although it works, the result is not nicely formatted the way the output is if we just use the name of the data frame.
That is, run this command: df
— not print(df)
— in the cell below.
df
country | gdpPercap_1952 | gdpPercap_1957 | gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | gdpPercap_1977 | gdpPercap_1982 | gdpPercap_1987 | gdpPercap_1992 | gdpPercap_1997 | gdpPercap_2002 | gdpPercap_2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Albania | 1601.056136 | 1942.284244 | 2312.888958 | 2760.196931 | 3313.422188 | 3533.003910 | 3630.880722 | 3738.932735 | 2497.437901 | 3193.054604 | 4604.211737 | 5937.029526 |
1 | Austria | 6137.076492 | 8842.598030 | 10750.721110 | 12834.602400 | 16661.625600 | 19749.422300 | 21597.083620 | 23687.826070 | 27042.018680 | 29095.920660 | 32417.607690 | 36126.492700 |
2 | Belgium | 8343.105127 | 9714.960623 | 10991.206760 | 13149.041190 | 16672.143560 | 19117.974480 | 20979.845890 | 22525.563080 | 25575.570690 | 27561.196630 | 30485.883750 | 33692.605080 |
3 | Bosnia and Herzegovina | 973.533195 | 1353.989176 | 1709.683679 | 2172.352423 | 2860.169750 | 3528.481305 | 4126.613157 | 4314.114757 | 2546.781445 | 4766.355904 | 6018.975239 | 7446.298803 |
4 | Bulgaria | 2444.286648 | 3008.670727 | 4254.337839 | 5577.002800 | 6597.494398 | 7612.240438 | 8224.191647 | 8239.854824 | 6302.623438 | 5970.388760 | 7696.777725 | 10680.792820 |
5 | Croatia | 3119.236520 | 4338.231617 | 5477.890018 | 6960.297861 | 9164.090127 | 11305.385170 | 13221.821840 | 13822.583940 | 8447.794873 | 9875.604515 | 11628.388950 | 14619.222720 |
6 | Czech Republic | 6876.140250 | 8256.343918 | 10136.867130 | 11399.444890 | 13108.453600 | 14800.160620 | 15377.228550 | 16310.443400 | 14297.021220 | 16048.514240 | 17596.210220 | 22833.308510 |
7 | Denmark | 9692.385245 | 11099.659350 | 13583.313510 | 15937.211230 | 18866.207210 | 20422.901500 | 21688.040480 | 25116.175810 | 26406.739850 | 29804.345670 | 32166.500060 | 35278.418740 |
8 | Finland | 6424.519071 | 7545.415386 | 9371.842561 | 10921.636260 | 14358.875900 | 15605.422830 | 18533.157610 | 21141.012230 | 20647.164990 | 23723.950200 | 28204.590570 | 33207.084400 |
9 | France | 7029.809327 | 8662.834898 | 10560.485530 | 12999.917660 | 16107.191710 | 18292.635140 | 20293.897460 | 22066.442140 | 24703.796150 | 25889.784870 | 28926.032340 | 30470.016700 |
10 | Germany | 7144.114393 | 10187.826650 | 12902.462910 | 14745.625610 | 18016.180270 | 20512.921230 | 22031.532740 | 24639.185660 | 26505.303170 | 27788.884160 | 30035.801980 | 32170.374420 |
11 | Greece | 3530.690067 | 4916.299889 | 6017.190733 | 8513.097016 | 12724.829570 | 14195.524280 | 15268.420890 | 16120.528390 | 17541.496340 | 18747.698140 | 22514.254800 | 27538.411880 |
12 | Hungary | 5263.673816 | 6040.180011 | 7550.359877 | 9326.644670 | 10168.656110 | 11674.837370 | 12545.990660 | 12986.479980 | 10535.628550 | 11712.776800 | 14843.935560 | 18008.944440 |
13 | Iceland | 7267.688428 | 9244.001412 | 10350.159060 | 13319.895680 | 15798.063620 | 19654.962470 | 23269.607500 | 26923.206280 | 25144.392010 | 28061.099660 | 31163.201960 | 36180.789190 |
14 | Ireland | 5210.280328 | 5599.077872 | 6631.597314 | 7655.568963 | 9530.772896 | 11150.981130 | 12618.321410 | 13872.866520 | 17558.815550 | 24521.947130 | 34077.049390 | 40675.996350 |
15 | Italy | 4931.404155 | 6248.656232 | 8243.582340 | 10022.401310 | 12269.273780 | 14255.984750 | 16537.483500 | 19207.234820 | 22013.644860 | 24675.024460 | 27968.098170 | 28569.719700 |
16 | Montenegro | 2647.585601 | 3682.259903 | 4649.593785 | 5907.850937 | 7778.414017 | 9595.929905 | 11222.587620 | 11732.510170 | 7003.339037 | 6465.613349 | 6557.194282 | 9253.896111 |
17 | Netherlands | 8941.571858 | 11276.193440 | 12790.849560 | 15363.251360 | 18794.745670 | 21209.059200 | 21399.460460 | 23651.323610 | 26790.949610 | 30246.130630 | 33724.757780 | 36797.933320 |
18 | Norway | 10095.421720 | 11653.973040 | 13450.401510 | 16361.876470 | 18965.055510 | 23311.349390 | 26298.635310 | 31540.974800 | 33965.661150 | 41283.164330 | 44683.975250 | 49357.190170 |
19 | Poland | 4029.329699 | 4734.253019 | 5338.752143 | 6557.152776 | 8006.506993 | 9508.141454 | 8451.531004 | 9082.351172 | 7738.881247 | 10159.583680 | 12002.239080 | 15389.924680 |
20 | Portugal | 3068.319867 | 3774.571743 | 4727.954889 | 6361.517993 | 9022.247417 | 10172.485720 | 11753.842910 | 13039.308760 | 16207.266630 | 17641.031560 | 19970.907870 | 20509.647770 |
21 | Romania | 3144.613186 | 3943.370225 | 4734.997586 | 6470.866545 | 8011.414402 | 9356.397240 | 9605.314053 | 9696.273295 | 6598.409903 | 7346.547557 | 7885.360081 | 10808.475610 |
22 | Serbia | 3581.459448 | 4981.090891 | 6289.629157 | 7991.707066 | 10522.067490 | 12980.669560 | 15181.092700 | 15870.878510 | 9325.068238 | 7914.320304 | 7236.075251 | 9786.534714 |
23 | Slovak Republic | 5074.659104 | 6093.262980 | 7481.107598 | 8412.902397 | 9674.167626 | 10922.664040 | 11348.545850 | 12037.267580 | 9498.467723 | 12126.230650 | 13638.778370 | 18678.314350 |
24 | Slovenia | 4215.041741 | 5862.276629 | 7402.303395 | 9405.489397 | 12383.486200 | 15277.030170 | 17866.721750 | 18678.534920 | 14214.716810 | 17161.107350 | 20660.019360 | 25768.257590 |
25 | Spain | 3834.034742 | 4564.802410 | 5693.843879 | 7993.512294 | 10638.751310 | 13236.921170 | 13926.169970 | 15764.983130 | 18603.064520 | 20445.298960 | 24835.471660 | 28821.063700 |
26 | Sweden | 8527.844662 | 9911.878226 | 12329.441920 | 15258.296970 | 17832.024640 | 18855.725210 | 20667.381250 | 23586.929270 | 23880.016830 | 25266.594990 | 29341.630930 | 33859.748350 |
27 | Switzerland | 14734.232750 | 17909.489730 | 20431.092700 | 22966.144320 | 27195.113040 | 26982.290520 | 28397.715120 | 30281.704590 | 31871.530300 | 32135.323010 | 34480.957710 | 37506.419070 |
28 | Turkey | 1969.100980 | 2218.754257 | 2322.869908 | 2826.356387 | 3450.696380 | 4269.122326 | 4241.356344 | 5089.043686 | 5678.348271 | 6601.429915 | 6508.085718 | 8458.276384 |
29 | United Kingdom | 9979.508487 | 11283.177950 | 12477.177070 | 14142.850890 | 15895.116410 | 17428.748460 | 18232.424520 | 21664.787670 | 22705.092540 | 26074.531360 | 29478.999190 | 33203.261280 |
You’ll see that the rows are numbered in boldface, starting with 0 as is the norm in Python. This boldfaced, leftmost column is called the index of the DataFrame, and provides one way of accessing data by rows. Across the top, you’ll see that the column labels are also in boldface. pandas is pretty smart about automatically detecting when the first row of a CSV file contains header information (column names).
Heads or Tails?#
We might want to “peek” at the DataFrame without printing out the entire thing, especially if it’s big. We can see the first 5 rows of a DataFrame with the .head()
method:
df.head()
country | gdpPercap_1952 | gdpPercap_1957 | gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | gdpPercap_1977 | gdpPercap_1982 | gdpPercap_1987 | gdpPercap_1992 | gdpPercap_1997 | gdpPercap_2002 | gdpPercap_2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Albania | 1601.056136 | 1942.284244 | 2312.888958 | 2760.196931 | 3313.422188 | 3533.003910 | 3630.880722 | 3738.932735 | 2497.437901 | 3193.054604 | 4604.211737 | 5937.029526 |
1 | Austria | 6137.076492 | 8842.598030 | 10750.721110 | 12834.602400 | 16661.625600 | 19749.422300 | 21597.083620 | 23687.826070 | 27042.018680 | 29095.920660 | 32417.607690 | 36126.492700 |
2 | Belgium | 8343.105127 | 9714.960623 | 10991.206760 | 13149.041190 | 16672.143560 | 19117.974480 | 20979.845890 | 22525.563080 | 25575.570690 | 27561.196630 | 30485.883750 | 33692.605080 |
3 | Bosnia and Herzegovina | 973.533195 | 1353.989176 | 1709.683679 | 2172.352423 | 2860.169750 | 3528.481305 | 4126.613157 | 4314.114757 | 2546.781445 | 4766.355904 | 6018.975239 | 7446.298803 |
4 | Bulgaria | 2444.286648 | 3008.670727 | 4254.337839 | 5577.002800 | 6597.494398 | 7612.240438 | 8224.191647 | 8239.854824 | 6302.623438 | 5970.388760 | 7696.777725 | 10680.792820 |
…or the last 5 rows with .tail()
:
df.tail()
country | gdpPercap_1952 | gdpPercap_1957 | gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | gdpPercap_1977 | gdpPercap_1982 | gdpPercap_1987 | gdpPercap_1992 | gdpPercap_1997 | gdpPercap_2002 | gdpPercap_2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
25 | Spain | 3834.034742 | 4564.802410 | 5693.843879 | 7993.512294 | 10638.75131 | 13236.921170 | 13926.169970 | 15764.983130 | 18603.064520 | 20445.298960 | 24835.471660 | 28821.063700 |
26 | Sweden | 8527.844662 | 9911.878226 | 12329.441920 | 15258.296970 | 17832.02464 | 18855.725210 | 20667.381250 | 23586.929270 | 23880.016830 | 25266.594990 | 29341.630930 | 33859.748350 |
27 | Switzerland | 14734.232750 | 17909.489730 | 20431.092700 | 22966.144320 | 27195.11304 | 26982.290520 | 28397.715120 | 30281.704590 | 31871.530300 | 32135.323010 | 34480.957710 | 37506.419070 |
28 | Turkey | 1969.100980 | 2218.754257 | 2322.869908 | 2826.356387 | 3450.69638 | 4269.122326 | 4241.356344 | 5089.043686 | 5678.348271 | 6601.429915 | 6508.085718 | 8458.276384 |
29 | United Kingdom | 9979.508487 | 11283.177950 | 12477.177070 | 14142.850890 | 15895.11641 | 17428.748460 | 18232.424520 | 21664.787670 | 22705.092540 | 26074.531360 | 29478.999190 | 33203.261280 |
We can also see a random sample of rows from the DataFrame with .sample()
, giving it a numerical argument to indicate the number of rows we want to see:
df.sample(10)
country | gdpPercap_1952 | gdpPercap_1957 | gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | gdpPercap_1977 | gdpPercap_1982 | gdpPercap_1987 | gdpPercap_1992 | gdpPercap_1997 | gdpPercap_2002 | gdpPercap_2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
26 | Sweden | 8527.844662 | 9911.878226 | 12329.441920 | 15258.296970 | 17832.024640 | 18855.725210 | 20667.381250 | 23586.929270 | 23880.016830 | 25266.594990 | 29341.630930 | 33859.74835 |
29 | United Kingdom | 9979.508487 | 11283.177950 | 12477.177070 | 14142.850890 | 15895.116410 | 17428.748460 | 18232.424520 | 21664.787670 | 22705.092540 | 26074.531360 | 29478.999190 | 33203.26128 |
24 | Slovenia | 4215.041741 | 5862.276629 | 7402.303395 | 9405.489397 | 12383.486200 | 15277.030170 | 17866.721750 | 18678.534920 | 14214.716810 | 17161.107350 | 20660.019360 | 25768.25759 |
23 | Slovak Republic | 5074.659104 | 6093.262980 | 7481.107598 | 8412.902397 | 9674.167626 | 10922.664040 | 11348.545850 | 12037.267580 | 9498.467723 | 12126.230650 | 13638.778370 | 18678.31435 |
19 | Poland | 4029.329699 | 4734.253019 | 5338.752143 | 6557.152776 | 8006.506993 | 9508.141454 | 8451.531004 | 9082.351172 | 7738.881247 | 10159.583680 | 12002.239080 | 15389.92468 |
11 | Greece | 3530.690067 | 4916.299889 | 6017.190733 | 8513.097016 | 12724.829570 | 14195.524280 | 15268.420890 | 16120.528390 | 17541.496340 | 18747.698140 | 22514.254800 | 27538.41188 |
8 | Finland | 6424.519071 | 7545.415386 | 9371.842561 | 10921.636260 | 14358.875900 | 15605.422830 | 18533.157610 | 21141.012230 | 20647.164990 | 23723.950200 | 28204.590570 | 33207.08440 |
21 | Romania | 3144.613186 | 3943.370225 | 4734.997586 | 6470.866545 | 8011.414402 | 9356.397240 | 9605.314053 | 9696.273295 | 6598.409903 | 7346.547557 | 7885.360081 | 10808.47561 |
12 | Hungary | 5263.673816 | 6040.180011 | 7550.359877 | 9326.644670 | 10168.656110 | 11674.837370 | 12545.990660 | 12986.479980 | 10535.628550 | 11712.776800 | 14843.935560 | 18008.94444 |
15 | Italy | 4931.404155 | 6248.656232 | 8243.582340 | 10022.401310 | 12269.273780 | 14255.984750 | 16537.483500 | 19207.234820 | 22013.644860 | 24675.024460 | 27968.098170 | 28569.71970 |
Note that the .head()
and .tail()
methods also optionally take a numerical argument, if you want to view a different number of rows from the default of 5.
Accessing values in a DataFrame#
One thing we often want to do is access a single cell in a DataFrame, or a range of cells. Each cell is uniquely defined by a combination of its row and column locations.
Select a column using []
#
If we want to select an entire column of a pandas DataFrame, we just give the name of the DataFrame followed by the column name in square brackets:
df['gdpPercap_1992']
0 2497.437901
1 27042.018680
2 25575.570690
3 2546.781445
4 6302.623438
5 8447.794873
6 14297.021220
7 26406.739850
8 20647.164990
9 24703.796150
10 26505.303170
11 17541.496340
12 10535.628550
13 25144.392010
14 17558.815550
15 22013.644860
16 7003.339037
17 26790.949610
18 33965.661150
19 7738.881247
20 16207.266630
21 6598.409903
22 9325.068238
23 9498.467723
24 14214.716810
25 18603.064520
26 23880.016830
27 31871.530300
28 5678.348271
29 22705.092540
Name: gdpPercap_1992, dtype: float64
Note that if we ask for a single column the result is a pandas Series, but if we ask for two or more columns, the result is a DataFrame. Pay close attention to the syntax below — if we’re asking for more than one column, we need to provide a list of columns inside the square brackets (so there are two sets of nested square brackets in the code below):
df[['gdpPercap_1982', 'gdpPercap_1992']]
gdpPercap_1982 | gdpPercap_1992 | |
---|---|---|
0 | 3630.880722 | 2497.437901 |
1 | 21597.083620 | 27042.018680 |
2 | 20979.845890 | 25575.570690 |
3 | 4126.613157 | 2546.781445 |
4 | 8224.191647 | 6302.623438 |
5 | 13221.821840 | 8447.794873 |
6 | 15377.228550 | 14297.021220 |
7 | 21688.040480 | 26406.739850 |
8 | 18533.157610 | 20647.164990 |
9 | 20293.897460 | 24703.796150 |
10 | 22031.532740 | 26505.303170 |
11 | 15268.420890 | 17541.496340 |
12 | 12545.990660 | 10535.628550 |
13 | 23269.607500 | 25144.392010 |
14 | 12618.321410 | 17558.815550 |
15 | 16537.483500 | 22013.644860 |
16 | 11222.587620 | 7003.339037 |
17 | 21399.460460 | 26790.949610 |
18 | 26298.635310 | 33965.661150 |
19 | 8451.531004 | 7738.881247 |
20 | 11753.842910 | 16207.266630 |
21 | 9605.314053 | 6598.409903 |
22 | 15181.092700 | 9325.068238 |
23 | 11348.545850 | 9498.467723 |
24 | 17866.721750 | 14214.716810 |
25 | 13926.169970 | 18603.064520 |
26 | 20667.381250 | 23880.016830 |
27 | 28397.715120 | 31871.530300 |
28 | 4241.356344 | 5678.348271 |
29 | 18232.424520 | 22705.092540 |
Numerical indexing using .iloc[]
#
Often we don’t want to access an entire column, however, but just specific rows within a column (or range of columns). pandas provides two ways of accessing cell locations. One is using the numerical positions in the DataFrame, using the convention of [row, column] — with [0, 0] being the top left cell in the DataFrame. So for a pandas DataFrame with 3 rows and 3 columns, the indices of each cell are as shown:
col 0 |
col 1 |
col 2 |
col 3 |
|
---|---|---|---|---|
0 |
[0, 0] |
[0, 1] |
[0, 2] |
[0, 3] |
1 |
[1, 0] |
[1, 1] |
[1, 2] |
[1, 3] |
2 |
[2, 0] |
[2, 1] |
[2, 2] |
[2, 3] |
3 |
[3, 0] |
[3, 1] |
[3, 2] |
[3, 3] |
Numerical indexing of DataFrames is conceptually identical to indexing of NumPy arrays, however, with pandas it is performed with the .iloc[]
method. For example, to access the GDP value for Austria in 1952 — which is located in the second row, third column of our current DataFrame, we would use:
df.iloc[1, 2]
8842.59803
However, trying to access the same data without .iloc
will return a long error message.
Label-based indexing using .loc[]
#
The other way to access a location in a DataFrame is by its index and column labels, using the .loc[]
method. As noted earlier, in the DataFrame we imported, the indexes are currently numbers, which were created automatically when we imported the data. The .loc[]
method doesn’t work with numerical indexes (that’s what iloc
is for — and you can’t mix, say, a numerical row index with a column label), but in the data set we imported, the first column of this CSV file is actually meant to be its index: while all other columns are data values (GDP, in type float), the first column identifies the country with which each row of data is associated.
pandas has a method for setting an index column, .set_index()
, where the argument (in the parentheses) would be the name of the column to use as the index. So here we want to run:
df = df.set_index('country')
Note that we need to assign the result of this operation back to df
(using df =
), otherwise the change will not actually modify df
.
In the cell below, use the .set_index()
method to set the index of df
to country
, and then view the DataFrame again to see how it has changed.
df = df.set_index('country')
Alternatively, if we knew which column we wanted to use as the index before loading in the data file, we could have included the argument index_col=
in the pd.read_csv()
command:
df = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
Now that we have defined the index, we can access the 1952 GDP value for Austria by its index and column names:
df.loc['Austria', 'gdpPercap_1952']
6137.076492
Use :
on its own to mean all columns or all rows.#
Using Python’s familiar slicing notation (which we’ve previously used for strings, lists, and NumPy arrays), we can use :
with .iloc[]
or .loc[]
, to specify a range in a DataFrame.
For example, to see the GDP of Albania for every year (column) in the DataFrame, we would use:
df.loc["Albania", :]
gdpPercap_1952 1601.056136
gdpPercap_1957 1942.284244
gdpPercap_1962 2312.888958
gdpPercap_1967 2760.196931
gdpPercap_1972 3313.422188
gdpPercap_1977 3533.003910
gdpPercap_1982 3630.880722
gdpPercap_1987 3738.932735
gdpPercap_1992 2497.437901
gdpPercap_1997 3193.054604
gdpPercap_2002 4604.211737
gdpPercap_2007 5937.029526
Name: Albania, dtype: float64
Likewise, we could see the GDP for every country (row) in the year 1957 with:
df.loc[:, 'gdpPercap_1957']
country
Albania 1942.284244
Austria 8842.598030
Belgium 9714.960623
Bosnia and Herzegovina 1353.989176
Bulgaria 3008.670727
Croatia 4338.231617
Czech Republic 8256.343918
Denmark 11099.659350
Finland 7545.415386
France 8662.834898
Germany 10187.826650
Greece 4916.299889
Hungary 6040.180011
Iceland 9244.001412
Ireland 5599.077872
Italy 6248.656232
Montenegro 3682.259903
Netherlands 11276.193440
Norway 11653.973040
Poland 4734.253019
Portugal 3774.571743
Romania 3943.370225
Serbia 4981.090891
Slovak Republic 6093.262980
Slovenia 5862.276629
Spain 4564.802410
Sweden 9911.878226
Switzerland 17909.489730
Turkey 2218.754257
United Kingdom 11283.177950
Name: gdpPercap_1957, dtype: float64
You can also just specify the row index; if you don’t specify anything for the columns, pandas assumes you want all columns:
df.loc["Albania"]
gdpPercap_1952 1601.056136
gdpPercap_1957 1942.284244
gdpPercap_1962 2312.888958
gdpPercap_1967 2760.196931
gdpPercap_1972 3313.422188
gdpPercap_1977 3533.003910
gdpPercap_1982 3630.880722
gdpPercap_1987 3738.932735
gdpPercap_1992 2497.437901
gdpPercap_1997 3193.054604
gdpPercap_2002 4604.211737
gdpPercap_2007 5937.029526
Name: Albania, dtype: float64
However, since the syntax for .iloc[]
and .loc[]
is [rows, columns], you cannot omit a row index; you need to use :
if you want all rows.
Slicing works on DataFrames#
Slicing using numerical indices works similarly for DataFrames as we previously saw for strings, lists, and ndarrays. For example, the following code will print the third through fifth rows of the DataFrame, and the fifth through eighth columns (remember, Python indexing starts at 0, and slicing does not include the “end” index):
df.iloc[2:5, 4:8]
gdpPercap_1972 | gdpPercap_1977 | gdpPercap_1982 | gdpPercap_1987 | |
---|---|---|---|---|
country | ||||
Belgium | 16672.143560 | 19117.974480 | 20979.845890 | 22525.563080 |
Bosnia and Herzegovina | 2860.169750 | 3528.481305 | 4126.613157 | 4314.114757 |
Bulgaria | 6597.494398 | 7612.240438 | 8224.191647 | 8239.854824 |
The code below will print from the sixth to second-last row of the DataFrame, and from the ninth to the last column:
df.iloc[5:-1, 8:]
gdpPercap_1992 | gdpPercap_1997 | gdpPercap_2002 | gdpPercap_2007 | |
---|---|---|---|---|
country | ||||
Croatia | 8447.794873 | 9875.604515 | 11628.388950 | 14619.222720 |
Czech Republic | 14297.021220 | 16048.514240 | 17596.210220 | 22833.308510 |
Denmark | 26406.739850 | 29804.345670 | 32166.500060 | 35278.418740 |
Finland | 20647.164990 | 23723.950200 | 28204.590570 | 33207.084400 |
France | 24703.796150 | 25889.784870 | 28926.032340 | 30470.016700 |
Germany | 26505.303170 | 27788.884160 | 30035.801980 | 32170.374420 |
Greece | 17541.496340 | 18747.698140 | 22514.254800 | 27538.411880 |
Hungary | 10535.628550 | 11712.776800 | 14843.935560 | 18008.944440 |
Iceland | 25144.392010 | 28061.099660 | 31163.201960 | 36180.789190 |
Ireland | 17558.815550 | 24521.947130 | 34077.049390 | 40675.996350 |
Italy | 22013.644860 | 24675.024460 | 27968.098170 | 28569.719700 |
Montenegro | 7003.339037 | 6465.613349 | 6557.194282 | 9253.896111 |
Netherlands | 26790.949610 | 30246.130630 | 33724.757780 | 36797.933320 |
Norway | 33965.661150 | 41283.164330 | 44683.975250 | 49357.190170 |
Poland | 7738.881247 | 10159.583680 | 12002.239080 | 15389.924680 |
Portugal | 16207.266630 | 17641.031560 | 19970.907870 | 20509.647770 |
Romania | 6598.409903 | 7346.547557 | 7885.360081 | 10808.475610 |
Serbia | 9325.068238 | 7914.320304 | 7236.075251 | 9786.534714 |
Slovak Republic | 9498.467723 | 12126.230650 | 13638.778370 | 18678.314350 |
Slovenia | 14214.716810 | 17161.107350 | 20660.019360 | 25768.257590 |
Spain | 18603.064520 | 20445.298960 | 24835.471660 | 28821.063700 |
Sweden | 23880.016830 | 25266.594990 | 29341.630930 | 33859.748350 |
Switzerland | 31871.530300 | 32135.323010 | 34480.957710 | 37506.419070 |
Turkey | 5678.348271 | 6601.429915 | 6508.085718 | 8458.276384 |
Note however, that when using label-based indexing with .loc[]
, pandas’ slicing behaviour is a bit different. Specifically, the output includes the last item in the range, whereas numerical indexing with .iloc[]
does not.
So, considering that the first three rows of the DataFrame correspond to the countries Albania, Austria, and Belgium, and that columns 6 and 7 are for the years 1982 and 1987 respectively, compare the output of:
df.iloc[0:2, 6:7]
gdpPercap_1982 | |
---|---|
country | |
Albania | 3630.880722 |
Austria | 21597.083620 |
with:
df.loc['Albania':'Belgium', 'gdpPercap_1982':'gdpPercap_1988']
gdpPercap_1982 | gdpPercap_1987 | |
---|---|---|
country | ||
Albania | 3630.880722 | 3738.932735 |
Austria | 21597.083620 | 23687.826070 |
Belgium | 20979.845890 | 22525.563080 |
The “inclusive” label-based indexing with .loc[]
is fairly intuitive, but it’s important to remember that it works differently from numerical indexing.
Use lists to select non-contiguous sections of a DataFrame#
While slicing can be very useful, sometimes we might want to extract values that aren’t next to each other in a DataFrame. For example, what if we only want values from two specific years (1992 and 2002), for Scandinavian countries (Denmark, Finland, Norway, and Sweden)? Neither the years nor countries are in adjacent columns/rows in the DataFrame. With .loc[]
, we can use lists, rather than ranges separated by :
, as selectors:
df.loc[['Denmark', 'Finland', 'Norway', 'Sweden'], ['gdpPercap_1992', 'gdpPercap_2002']]
gdpPercap_1992 | gdpPercap_2002 | |
---|---|---|
country | ||
Denmark | 26406.73985 | 32166.50006 |
Finland | 20647.16499 | 28204.59057 |
Norway | 33965.66115 | 44683.97525 |
Sweden | 23880.01683 | 29341.63093 |
We can equivalently write the command over several lines to make it a bit easier to read:
df.loc[['Denmark', 'Finland', 'Norway', 'Sweden'],
['gdpPercap_1992', 'gdpPercap_2002']
]
gdpPercap_1992 | gdpPercap_2002 | |
---|---|---|
country | ||
Denmark | 26406.73985 | 32166.50006 |
Finland | 20647.16499 | 28204.59057 |
Norway | 33965.66115 | 44683.97525 |
Sweden | 23880.01683 | 29341.63093 |
We could also define those lists as variables, and pass the variables to .loc[]
. This might be useful if you were going to use the lists more than once, as well as for clarity:
scand_countries = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']
years = ['gdpPercap_1992', 'gdpPercap_2002']
df.loc[scand_countries, years]
gdpPercap_1992 | gdpPercap_2002 | |
---|---|---|
country | ||
Denmark | 26406.73985 | 32166.50006 |
Finland | 20647.16499 | 28204.59057 |
Iceland | 25144.39201 | 31163.20196 |
Norway | 33965.66115 | 44683.97525 |
Sweden | 23880.01683 | 29341.63093 |
We can take this a step further, and assign the output of a .loc[]
selection like this to a new variable name. This makes a copy of the selected data, stored in a new DataFrame (or Series, if we only select one row or column) with its own name. This allows us to later reference and use that selection.
scand_data = df.loc[scand_countries, years]
It’s easy to do simple math and statistics in DataFrames#
We prevoiusly learned about methods to get simple statistical values out of a Python list, like .max()
, and .min()
. We took it a step further when learning about NumPy, as we were able to easily retrieve summary statistics with functions such as np.mean()
and np.std()
. pandas includes these and many more methods as well. For example, we can view the mean GDP for Italy across all years (columns) with:
df.loc['Italy'].mean()
16245.20900641667
Or the largest GDP in 1977 with:
df.loc[:, 'gdpPercap_1977'].max()
26982.29052
Another useful method is .describe()
, which prints out a range of descriptive statistics for the range of data you specify. Without any slicing it provides information for each column:
df.describe()
gdpPercap_1952 | gdpPercap_1957 | gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | gdpPercap_1977 | gdpPercap_1982 | gdpPercap_1987 | gdpPercap_1992 | gdpPercap_1997 | gdpPercap_2002 | gdpPercap_2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 30.000000 | 30.000000 | 30.000000 | 30.000000 | 30.000000 | 30.000000 | 30.000000 | 30.000000 | 30.000000 | 30.000000 | 30.000000 | 30.000000 |
mean | 5661.057435 | 6963.012816 | 8365.486814 | 10143.823757 | 12479.575246 | 14283.979110 | 15617.896551 | 17214.310727 | 17061.568084 | 19076.781802 | 21711.732422 | 25054.481636 |
std | 3114.060493 | 3677.950146 | 4199.193906 | 4724.983889 | 5509.691411 | 5874.464896 | 6453.234827 | 7482.957960 | 9109.804361 | 10065.457716 | 11197.355517 | 11800.339811 |
min | 973.533195 | 1353.989176 | 1709.683679 | 2172.352423 | 2860.169750 | 3528.481305 | 3630.880722 | 3738.932735 | 2497.437901 | 3193.054604 | 4604.211737 | 5937.029526 |
25% | 3241.132406 | 4394.874315 | 5373.536612 | 6657.939047 | 9057.708094 | 10360.030300 | 11449.870115 | 12274.570680 | 8667.113214 | 9946.599306 | 11721.851483 | 14811.898210 |
50% | 5142.469716 | 6066.721495 | 7515.733737 | 9366.067033 | 12326.379990 | 14225.754515 | 15322.824720 | 16215.485895 | 17550.155945 | 19596.498550 | 23674.863230 | 28054.065790 |
75% | 7236.794919 | 9597.220820 | 10931.085348 | 13277.182058 | 16523.017127 | 19052.412163 | 20901.729730 | 23321.587723 | 25034.243045 | 27189.530312 | 30373.363307 | 33817.962533 |
max | 14734.232750 | 17909.489730 | 20431.092700 | 22966.144320 | 27195.113040 | 26982.290520 | 28397.715120 | 31540.974800 | 33965.661150 | 41283.164330 | 44683.975250 | 49357.190170 |
Mini-Exercise#
In the cell below, use the scand_countries
and years
variables to view descriptive statistics for all Scandinavian countries in each year.
Evaluate cells based on conditions#
pandas allows an easy way to identify values in a DataFrame that meet a certain condition, using operators like <
, >
, and ==
. For example, let’s see which countries in a list had a GDP over 10,000 in 1962 and 1992. The result is reported in Booleans (True/False) for each cell.
countries = ['France', 'Germany', 'Italy', 'Spain', 'United Kingdom']
df.loc[countries, ['gdpPercap_1962', 'gdpPercap_1992']] > 10000
gdpPercap_1962 | gdpPercap_1992 | |
---|---|---|
country | ||
France | True | True |
Germany | True | True |
Italy | False | True |
Spain | False | True |
United Kingdom | True | True |
Select values or NaN using a Boolean mask.#
A DataFrame full of Booleans is sometimes called a mask because of how it can be used. A mask can be used to remove values that are not True, and replace them with NaN
— a special Python value representing “not a number”. This can be useful because pandas ignores NaN values when doing computations.
We create a mask by assigning the output of a conditional statement to a variable name:
mask = scand_data > 30000
mask
gdpPercap_1992 | gdpPercap_2002 | |
---|---|---|
country | ||
Denmark | False | True |
Finland | False | False |
Iceland | False | True |
Norway | True | True |
Sweden | False | False |
Then we can apply the mask to the DataFrame to get only the values that meet the criterion:
scand_data[mask]
gdpPercap_1992 | gdpPercap_2002 | |
---|---|---|
country | ||
Denmark | NaN | 32166.50006 |
Finland | NaN | NaN |
Iceland | NaN | 31163.20196 |
Norway | 33965.66115 | 44683.97525 |
Sweden | NaN | NaN |
As an example of how this might be used, the steps above would now allow us to find the lowest GDP value in each year, that was above 30,000:
scand_data[mask].min()
gdpPercap_1992 33965.66115
gdpPercap_2002 31163.20196
dtype: float64
Split-Apply-Combine#
A common task in data science is to split data into meaningful subgroups, apply an operation to each subgroup (e.g., compute the mean), and then combine the results into a single output, such as a table or a new DataFrame. This paradigm was famously described by Hadley Wickham in a 2011 paper.
pandas provides methods and grouping operations that are very efficient (vectorized) for split-apply-combine operations.
As an example, let’s say that we wanted to compare the average GDP for different regions of Europe, divided as northern, southern, eastern, and western. To do this, we first have to create lists defining the countries belonging to each of these regions:
northern = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']
southern = ['Greece', 'Italy', 'Portugal', 'Spain']
eastern = ['Albania', 'Bosnia and Herzegovina', 'Bulgaria', 'Croatia',
'Czech Republic', 'Hungary', 'Montenegro', 'Poland', 'Romania',
'Serbia', 'Slovak Republic', 'Slovenia']
western = ['Austria', 'Belgium', 'France', 'Germany', 'Ireland',
'Netherlands', 'Switzerland', 'United Kingdom']
Next we can make a new column simply by using .loc[]
with the rows specified by one of the lists we just defined, a column name that doesn’t already exist (in this case, we’ll call it “region”), then assigning a region label to that combination of rows and column. We need to do this separately for each region. Note that when we first create the new column (“region”), pandas fills it with NaN values in any rows that were not defined by the assignment. For example, in the code below, the first line will create the column “region”, and fill it with “northern” for any row in the northern
list, and NaN
to every other row.
df.loc[northern, 'region'] = 'northern'
df.loc[southern, 'region'] = 'southern'
df.loc[eastern, 'region'] = 'eastern'
df.loc[western, 'region'] = 'western'
df.loc[northern, 'region'] = 'northern'
df.loc[southern, 'region'] = 'southern'
df.loc[eastern, 'region'] = 'eastern'
df.loc[western, 'region'] = 'western'
Let’s look at the last two columns of df
to see the region
column we created:
df.iloc[:, -2:]
gdpPercap_2007 | region | |
---|---|---|
country | ||
Albania | 5937.029526 | eastern |
Austria | 36126.492700 | western |
Belgium | 33692.605080 | western |
Bosnia and Herzegovina | 7446.298803 | eastern |
Bulgaria | 10680.792820 | eastern |
Croatia | 14619.222720 | eastern |
Czech Republic | 22833.308510 | eastern |
Denmark | 35278.418740 | northern |
Finland | 33207.084400 | northern |
France | 30470.016700 | western |
Germany | 32170.374420 | western |
Greece | 27538.411880 | southern |
Hungary | 18008.944440 | eastern |
Iceland | 36180.789190 | northern |
Ireland | 40675.996350 | western |
Italy | 28569.719700 | southern |
Montenegro | 9253.896111 | eastern |
Netherlands | 36797.933320 | western |
Norway | 49357.190170 | northern |
Poland | 15389.924680 | eastern |
Portugal | 20509.647770 | southern |
Romania | 10808.475610 | eastern |
Serbia | 9786.534714 | eastern |
Slovak Republic | 18678.314350 | eastern |
Slovenia | 25768.257590 | eastern |
Spain | 28821.063700 | southern |
Sweden | 33859.748350 | northern |
Switzerland | 37506.419070 | western |
Turkey | 8458.276384 | NaN |
United Kingdom | 33203.261280 | western |
Split#
Now we can use this “region” column to split the data into groups, using a pandas method called .groupby()
grouped_countries = df.groupby('region')
Note that this step doesn’t create a new DataFrame, it creates a special kind of pandas object that points to a grouping in the original DataFrame:
type(grouped_countries)
pandas.core.groupby.generic.DataFrameGroupBy
Apply#
Now that we have split the data, we can apply a function separately to each group. Here we’ll compute the mean GDP for each region, for each year:
mean_gdp_by_region = grouped_countries.mean()
Combine#
The combine step actually occurred with the apply step above — the result is automatically combined into a table of mean values organized by region. But since our apply step (.mean()
) saved the result to a variable, we can view the resulting table as the output of the combine step:
mean_gdp_by_region
gdpPercap_1952 | gdpPercap_1957 | gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | gdpPercap_1977 | gdpPercap_1982 | gdpPercap_1987 | gdpPercap_1992 | gdpPercap_1997 | gdpPercap_2002 | gdpPercap_2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
region | ||||||||||||
eastern | 3580.884612 | 4519.684445 | 5611.534264 | 6911.825724 | 8465.695242 | 10007.911765 | 10900.209963 | 11375.852107 | 8250.514199 | 9395.008143 | 10864.013821 | 14100.916656 |
northern | 8401.571825 | 9890.985483 | 11817.031712 | 14359.783322 | 17164.045376 | 19570.072280 | 22091.364430 | 25661.659678 | 26008.794966 | 29627.830970 | 33111.979754 | 37576.646170 |
southern | 3841.112208 | 4876.082568 | 6170.642960 | 8222.632153 | 11163.775519 | 12965.228980 | 14371.479317 | 16033.013775 | 18591.368087 | 20377.263280 | 23822.183125 | 26359.710762 |
western | 8439.962345 | 10434.519899 | 12191.949119 | 14232.125299 | 17359.111144 | 19305.504057 | 20693.785153 | 22798.712417 | 25344.134586 | 27914.214806 | 31703.386229 | 35080.387365 |
Chaining#
In Python, chaining refers to combining a number of operations in one command, using a sequence of methods. We can perform the above split-apply-combine procedure in a single step as follows. Note that because we don’t assign the output to a variable name, it is displayed as output but not saved.
df.groupby('region').mean()
gdpPercap_1952 | gdpPercap_1957 | gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | gdpPercap_1977 | gdpPercap_1982 | gdpPercap_1987 | gdpPercap_1992 | gdpPercap_1997 | gdpPercap_2002 | gdpPercap_2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
region | ||||||||||||
eastern | 3580.884612 | 4519.684445 | 5611.534264 | 6911.825724 | 8465.695242 | 10007.911765 | 10900.209963 | 11375.852107 | 8250.514199 | 9395.008143 | 10864.013821 | 14100.916656 |
northern | 8401.571825 | 9890.985483 | 11817.031712 | 14359.783322 | 17164.045376 | 19570.072280 | 22091.364430 | 25661.659678 | 26008.794966 | 29627.830970 | 33111.979754 | 37576.646170 |
southern | 3841.112208 | 4876.082568 | 6170.642960 | 8222.632153 | 11163.775519 | 12965.228980 | 14371.479317 | 16033.013775 | 18591.368087 | 20377.263280 | 23822.183125 | 26359.710762 |
western | 8439.962345 | 10434.519899 | 12191.949119 | 14232.125299 | 17359.111144 | 19305.504057 | 20693.785153 | 22798.712417 | 25344.134586 | 27914.214806 | 31703.386229 | 35080.387365 |
Exercises#
Selecting Individual Values#
Write an expression to find the Per Capita GDP of Serbia in 2007.
Click the button to reveal the solution
print(df.loc[‘Serbia’, ‘gdpPercap_2007’])
Extent of Slicing#
Do the two statements below produce the same output? (Hint: you might want to use the
.head()
method to remind yourself of the structure of the DataFrame)Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?
print(df.iloc[0:2, 0:2])
print(df.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])
Click the button to reveal the solution
Solution
No, they do not produce the same output! The output of the first statement is:
gdpPercap_1952 gdpPercap_1957
country
Albania 1601.056136 1942.284244
Austria 6137.076492 8842.598030
The second statement gives:
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962
country
Albania 1601.056136 1942.284244 2312.888958
Austria 6137.076492 8842.598030 10750.721110
Belgium 8343.105127 9714.960623 10991.206760
Clearly, the second statement produces an additional column and an additional row compared to the first statement. What conclusion can we draw? We see that a numerical slice, 0:2, omits the final index (i.e. index 2) in the range provided, while a named slice, 'gdpPercap_1952':'gdpPercap_1962'
, includes the final element.
Reconstructing Data#
Explain what each line in the following short program does:
what is in df1
, df2
, etc.?
df1 = pd.read_csv('data/gapminder_all.csv', index_col='country')
df2 = df1[df1['continent'] == 'Americas']
df3 = df2.drop('Puerto Rico')
df4 = df3.drop('continent', axis = 1)
df4.to_csv('result.csv')
Click the button to reveal the solution
Let’s go through this piece of code line by line.
df1 = pd.read_csv('data/gapminder_all.csv', index_col='country')
This line loads the dataset containing the GDP data from all countries into a dataframe called
df1
. The index_col='country'
parameter selects which column to use as the
row labels in the dataframe.
df2 = df1[df1['continent'] == 'Americas']
This line makes a selection: only those rows of df1
for which the ‘continent’ column matches ‘Americas’ are extracted. Notice how the Boolean expression inside the brackets, df1['continent'] == 'Americas'
, is used to select only those rows where the expression is true. Try printing this expression! Can you print also its individual True/False elements? (hint: first assign the expression to a variable)
df3 = df2.drop('Puerto Rico')
As the syntax suggests, this line drops the row from df2
where the label is ‘Puerto Rico’. The resulting dataframe df3
has one row less than the original dataframe df2
.
df4 = df3.drop('continent', axis = 1)
Again we apply the drop function, but in this case we are dropping not a row but a whole column. To accomplish this, we need to specify also the axis
parameter (we want to drop the second column which has index 1).
df4.to_csv('result.csv')
The final step is to write the data that we have been working on to a csv file. Pandas makes this easy with the to_csv()
function. The only required argument to the function is the filename. Note that the file will be written in the directory from which you started the Jupyter or Python session.
Selecting Indices#
Explain in simple terms what idxmin
and idxmax
do. When would you use these methods?
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data.idxmin()
data.idxmax()
Click the button to reveal the solution
For each column in data
, idxmin
will return the index value corresponding to each column’s minimum;
idxmax
will do accordingly the same for each column’s maximum value.
You can use these functions whenever you want to get the row index of the minimum/maximum value and not the actual minimum/maximum value.
Practice with Selection#
From the previous exercise, the Gapminder GDP data for Europe should be loaded in as data
. Using this DataFrame, write an expression to select each of the following:
GDP per capita for all countries in 1982
Click the button to reveal the solution
data['gdpPercap_1982']
GDP per capita for Denmark for all years
Click the button to reveal the solution
data.loc['Denmark',:]
or
data.loc['Denmark',:'gdpPercap_2007']
GDP per capita for all countries for years after 1985
Click the button to reveal the solution
data.loc[:,'gdpPercap_1985':]
Note that pandas is smart enough to recognize the number at the end of the column label and does not give you an error, although no column named gdpPercap_1985
actually exists. This is useful if new columns are added to the CSV file later.
GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952
Click the button to reveal the solution
data[‘gdpPercap_2007’]/data[‘gdpPercap_1952’]
Summary of Key Points:#
pandas DataFrames are a powerful way of storing and working with tabular (row/column) data
pandas columns and rows can have names
pandas row names are called indexes which are numeric by default, but can be given other labels
Use the
.iloc[]
method with a DataFrame to select values by integer location, using [row, column] formatUse the
.loc[]
method with a DataFrame to select rows and/or columns, using named slicesUse
:
on its own to mean all columns or all rowsResult of slicing can be used in further operations
Use comparisons to select data based on value
Select values or
NaN
using a Boolean maskuse split-apply-combine to derive analytics from groupings within a DataFrame
This section was adapted from Aaron J. Newman’s Data Science for Psychology and Neuroscience - in Python and Software Carpentry’s Plotting and Programming in Python workshop.