Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How can I do statistical analysis of tabular data?

Objectives
  • Select individual values from a Pandas dataframe.

  • Select entire rows or entire columns from a dataframe.

  • Select a subset of both rows and columns from a dataframe in a single operation.

  • Select a subset of a dataframe by a single Boolean criterion.

Use DataFrame.loc[columns, rows] to select values by location using column and row names.

data = pandas.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.loc["Albania", "gdpPercap_1952"])
1601.056136
print(data.loc["Albania":"Germany", "gdpPercap_1952"])
country
Albania                   1601.056136
Austria                   6137.076492
Belgium                   8343.105127
Bosnia and Herzegovina     973.533195
Bulgaria                  2444.286648
Croatia                   3119.236520
Czech Republic            6876.140250
Denmark                   9692.385245
Finland                   6424.519071
France                    7029.809327
Germany                   7144.114393
Name: gdpPercap_1952, dtype: float64
print(data.loc["Albania":"Belgium", "gdpPercap_1952":"gdpPercap_1967"])
         gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967
country
Albania     1601.056136     1942.284244     2312.888958     2760.196931
Austria     6137.076492     8842.598030    10750.721110    12834.602400
Belgium     8343.105127     9714.960623    10991.206760    13149.041190

Use DataFrame.iloc[columns, rows] to select values by numerical indices.

print(data.iloc[15:20, 2:5])
             gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
country
Italy           8243.582340    10022.401310    12269.273780
Montenegro      4649.593785     5907.850937     7778.414017
Netherlands    12790.849560    15363.251360    18794.745670
Norway         13450.401510    16361.876470    18965.055510
Poland          5338.752143     6557.152776     8006.506993

Note that the second index is NOT inclusive when using .iloc. So [1:3] will only return values at index 1 and 2 (typical python behavior). However, when using .loc, the slicing is inclusive at both ends.

Two ways to access specific columns directly, without .loc

print(data["gdpPercap_1952"])
country
Albania                    1601.056136
Austria                    6137.076492
Belgium                    8343.105127
Bosnia and Herzegovina      973.533195
...
Spain                      3834.034742
Sweden                     8527.844662
Switzerland               14734.232750
Turkey                     1969.100980
United Kingdom             9979.508487
Name: gdpPercap_1952, dtype: float64
print(data[["gdpPercap_1952", "gdpPercap_1972"]])
                        gdpPercap_1952  gdpPercap_1972
country
Albania                    1601.056136     3313.422188
Austria                    6137.076492    16661.625600
Belgium                    8343.105127    16672.143560
Bosnia and Herzegovina      973.533195     2860.169750
...
Spain                      3834.034742    10638.751310
Sweden                     8527.844662    17832.024640
Switzerland               14734.232750    27195.113040
Turkey                     1969.100980     3450.696380
United Kingdom             9979.508487    15895.116410
print(data.gdpPercap_1952)
country
Albania                    1601.056136
Austria                    6137.076492
Belgium                    8343.105127
Bosnia and Herzegovina      973.533195
...
Spain                      3834.034742
Sweden                     8527.844662
Switzerland               14734.232750
Turkey                     1969.100980
United Kingdom             9979.508487
Name: gdpPercap_1952, dtype: float64

Keep in mind that the syntax presented in this section represent shortcuts, which can lead to confusion when first starting out. You should stick to .loc and .iloc until you are comfortable with pandas

Result of slicing can be used in further operations.

print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())
gdpPercap_1962    13450.40151
gdpPercap_1967    16361.87647
gdpPercap_1972    18965.05551
dtype: float64
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())
gdpPercap_1962    4649.593785
gdpPercap_1967    5907.850937
gdpPercap_1972    7778.414017
dtype: float64

Use comparisons to select data based on value.

# Use a subset of data to keep output readable.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)

# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 10000)
Subset of data:
             gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
country
Italy           8243.582340    10022.401310    12269.273780
Montenegro      4649.593785     5907.850937     7778.414017
Netherlands    12790.849560    15363.251360    18794.745670
Norway         13450.401510    16361.876470    18965.055510
Poland          5338.752143     6557.152776     8006.506993

Where are values large?
            gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy                False           True           True
Montenegro           False          False          False
Netherlands           True           True           True
Norway                True           True           True
Poland               False          False          False

Select values or NaN using a Boolean mask.

mask = subset > 10000
print(subset[mask])
             gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
country
Italy                   NaN     10022.40131     12269.27378
Montenegro              NaN             NaN             NaN
Netherlands     12790.84956     15363.25136     18794.74567
Norway          13450.40151     16361.87647     18965.05551
Poland                  NaN             NaN             NaN
print(subset[subset > 10000].describe())
       gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
count        2.000000        3.000000        3.000000
mean     13120.625535    13915.843047    16676.358320
std        466.373656     3408.589070     3817.597015
min      12790.849560    10022.401310    12269.273780
25%      12955.737547    12692.826335    15532.009725
50%      13120.625535    15363.251360    18794.745670
75%      13285.513523    15862.563915    18879.900590
max      13450.401510    16361.876470    18965.055510

Selection of Specific Rows

Load the Gapminder GDP data for Europe into a dataframe:

europe_df = pandas.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

Display only the first and third rows of the data frame.

Hint

You can use a list with ‘iloc’ to select multiple, specific rows. Remember that python uses zero-based indexes, so rows 1 and 3 actually have indexes 0 and 2.

Solution

# You can specify rows by their indices using a list.
print(europe_df.iloc[[0, 2], :])

# Note that the following is equivalent:
print(europe_df.iloc[[0, 2]])

Selection of Individual Values

Write an expression to find the Per Capita GDP of Serbia in 2007.

Hint

Try using the ‘loc’ method

Solution

# You can specify a particular peice of data by referencing its coordinate headings
print(europe_df.loc["Serbia", "gdpPercap_2007"])

Extent of Slicing

  1. Do the two statements below produce the same output?
  2. Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?
print(data.iloc[0:2, 0:2])
print(data.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])

Hint

Try it! Put the statements into a notebook cell.

Solution

         gdpPercap_1952  gdpPercap_1957
country                                
Albania     1601.056136     1942.284244
Austria     6137.076492     8842.598030
         gdpPercap_1952  gdpPercap_1957  gdpPercap_1962
country                                                
Albania     1601.056136     1942.284244     2312.888958
Austria     6137.076492     8842.598030    10750.721110
Belgium     8343.105127     9714.960623    10991.206760

Numbered indexing is not inclusive of the second half of the slice (i.e., index ‘2’, in this case, is not included), while label indexing is inclusive.

Reconstructing Data

Explain what each line in the following short program does: what is in first, second, etc.?

first = pandas.read_csv('gapminder/gapminder_all.csv', index_col='country')
second = first[first['continent'] == 'Americas']
third = second.drop('Puerto Rico')
fourth = third.drop('continent', axis = 1)
fourth.to_csv('result.csv')

Solution

# Reads a CSV file into a new dataframe
first = pandas.read_csv('gapminder/gapminder_all.csv', index_col='country')

# Selects only those rows with the value 'Americas' in the 'continent' column 
second = first[first['continent'] == 'Americas']

# Remove the 'Puerto Rico' row
third = second.drop('Puerto Rico')

# Remove the 'continent' column
fourth = third.drop('continent', axis = 1)

# Write data to a new file
fourth.to_csv('result.csv')

Selecting Indices

Explain in simple terms what idxmin and idxmax do in the short program below. When might you use these methods?

data = pandas.read_csv('gapminder/gapminder_gdp_europe.csv', index_col='country')
print(data.idxmin())
print(data.idxmax())

Hint

Don’t forget about the ‘help’ method

Compare the output of print(data.idxmin()) to the output of print(data)

Solution

idxmax’ returns the label of the row with the highest value in each column, and ‘idxmin’ returns the minimum.

These methods could be useful when looking for outliers in your data.

Practice with Selection

Write an expression to select each of the following from the Europe data:

  1. GDP per capita for all countries in 1982.
  2. GDP per capita for Denmark for all years.
  3. GDP per capita for all countries for years after 1985.
  4. Inflation of GDP per capita for each country between 2007 and 1952 (i.e., 2007 divided by 1952).

Hint

Everything can be done with the ‘loc’ method

Solution

# GDP per capita for all countries in 1982.
europe_df.loc[:, "gdpPercap_1982"]

# GDP per capita for Denmark for all years.
europe_df.loc["Denmark", :]

# GDP per capita for all countries for years *after* 1985.
europe_df.loc[:, "gdpPercap_1985":]

# GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952.
europe_df.loc[:, "gdpPercap_2007"] / europe_df.loc[:, "gdpPercap_1952"]

Changing Values in a Data Frame

You can select values to change using the same indexing strategies as for displaying values. Change the 1982 per-capita GDP of Denmark to a new value.

Hint

Try selecting the value using ‘.loc’ and then setting a new value using ‘=’.

Solution

# Any indexing command can be combined with '=' to change the value(s).
europe_df.loc['Denmark','gdpPercap_1982'] = 450000000

Key Points