Overview
Teaching: 15 min
Exercises: 15 minQuestions
How can I do statistical analysis of tabular data?
Objectives
Select individual values from a Pandas dataframe.
Select entire rows or entire columns from a dataframe.
Select a subset of both rows and columns from a dataframe in a single operation.
Select a subset of a dataframe by a single Boolean criterion.
DataFrame.loc[columns, rows]
to select values by location using column and row names..loc
is followed by square brackets []
, not parentheses ()
.data = pandas.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.loc["Albania", "gdpPercap_1952"])
1601.056136
:
notation to slice columns and/or rows (just like lists)print(data.loc["Albania":"Germany", "gdpPercap_1952"])
country
Albania 1601.056136
Austria 6137.076492
Belgium 8343.105127
Bosnia and Herzegovina 973.533195
Bulgaria 2444.286648
Croatia 3119.236520
Czech Republic 6876.140250
Denmark 9692.385245
Finland 6424.519071
France 7029.809327
Germany 7144.114393
Name: gdpPercap_1952, dtype: float64
print(data.loc["Albania":"Belgium", "gdpPercap_1952":"gdpPercap_1967"])
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967
country
Albania 1601.056136 1942.284244 2312.888958 2760.196931
Austria 6137.076492 8842.598030 10750.721110 12834.602400
Belgium 8343.105127 9714.960623 10991.206760 13149.041190
DataFrame.iloc[columns, rows]
to select values by numerical indices.print(data.iloc[15:20, 2:5])
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy 8243.582340 10022.401310 12269.273780
Montenegro 4649.593785 5907.850937 7778.414017
Netherlands 12790.849560 15363.251360 18794.745670
Norway 13450.401510 16361.876470 18965.055510
Poland 5338.752143 6557.152776 8006.506993
Note that the second index is NOT inclusive when using .iloc
. So
[1:3]
will only return values at index 1 and 2 (typical python
behavior). However, when using .loc
, the slicing is inclusive at
both ends.
.loc
[]
print(data["gdpPercap_1952"])
country
Albania 1601.056136
Austria 6137.076492
Belgium 8343.105127
Bosnia and Herzegovina 973.533195
...
Spain 3834.034742
Sweden 8527.844662
Switzerland 14734.232750
Turkey 1969.100980
United Kingdom 9979.508487
Name: gdpPercap_1952, dtype: float64
print(data[["gdpPercap_1952", "gdpPercap_1972"]])
gdpPercap_1952 gdpPercap_1972
country
Albania 1601.056136 3313.422188
Austria 6137.076492 16661.625600
Belgium 8343.105127 16672.143560
Bosnia and Herzegovina 973.533195 2860.169750
...
Spain 3834.034742 10638.751310
Sweden 8527.844662 17832.024640
Switzerland 14734.232750 27195.113040
Turkey 1969.100980 3450.696380
United Kingdom 9979.508487 15895.116410
.
print(data.gdpPercap_1952)
country
Albania 1601.056136
Austria 6137.076492
Belgium 8343.105127
Bosnia and Herzegovina 973.533195
...
Spain 3834.034742
Sweden 8527.844662
Switzerland 14734.232750
Turkey 1969.100980
United Kingdom 9979.508487
Name: gdpPercap_1952, dtype: float64
Keep in mind that the syntax presented in this section represent
shortcuts, which can lead to confusion when first starting out. You
should stick to .loc
and .iloc
until you are comfortable
with pandas
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())
gdpPercap_1962 13450.40151
gdpPercap_1967 16361.87647
gdpPercap_1972 18965.05551
dtype: float64
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())
gdpPercap_1962 4649.593785
gdpPercap_1967 5907.850937
gdpPercap_1972 7778.414017
dtype: float64
True
and False
.# Use a subset of data to keep output readable.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)
# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 10000)
Subset of data:
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy 8243.582340 10022.401310 12269.273780
Montenegro 4649.593785 5907.850937 7778.414017
Netherlands 12790.849560 15363.251360 18794.745670
Norway 13450.401510 16361.876470 18965.055510
Poland 5338.752143 6557.152776 8006.506993
Where are values large?
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy False True True
Montenegro False False False
Netherlands True True True
Norway True True True
Poland False False False
mask = subset > 10000
print(subset[mask])
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy NaN 10022.40131 12269.27378
Montenegro NaN NaN NaN
Netherlands 12790.84956 15363.25136 18794.74567
Norway 13450.40151 16361.87647 18965.05551
Poland NaN NaN NaN
print(subset[subset > 10000].describe())
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
count 2.000000 3.000000 3.000000
mean 13120.625535 13915.843047 16676.358320
std 466.373656 3408.589070 3817.597015
min 12790.849560 10022.401310 12269.273780
25% 12955.737547 12692.826335 15532.009725
50% 13120.625535 15363.251360 18794.745670
75% 13285.513523 15862.563915 18879.900590
max 13450.401510 16361.876470 18965.055510
Selection of Specific Rows
Load the Gapminder GDP data for Europe into a dataframe:
europe_df = pandas.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
Display only the first and third rows of the data frame.
Hint
You can use a list with ‘
iloc
’ to select multiple, specific rows. Remember that python uses zero-based indexes, so rows 1 and 3 actually have indexes 0 and 2.Solution
# You can specify rows by their indices using a list. print(europe_df.iloc[[0, 2], :]) # Note that the following is equivalent: print(europe_df.iloc[[0, 2]])
Selection of Individual Values
Write an expression to find the Per Capita GDP of Serbia in 2007.
Hint
Try using the ‘
loc
’ methodSolution
# You can specify a particular peice of data by referencing its coordinate headings print(europe_df.loc["Serbia", "gdpPercap_2007"])
Extent of Slicing
- Do the two statements below produce the same output?
- Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?
print(data.iloc[0:2, 0:2]) print(data.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])
Hint
Try it! Put the statements into a notebook cell.
Solution
gdpPercap_1952 gdpPercap_1957 country Albania 1601.056136 1942.284244 Austria 6137.076492 8842.598030 gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 country Albania 1601.056136 1942.284244 2312.888958 Austria 6137.076492 8842.598030 10750.721110 Belgium 8343.105127 9714.960623 10991.206760
Numbered indexing is not inclusive of the second half of the slice (i.e., index ‘2’, in this case, is not included), while label indexing is inclusive.
Reconstructing Data
Explain what each line in the following short program does: what is in
first
,second
, etc.?first = pandas.read_csv('gapminder/gapminder_all.csv', index_col='country') second = first[first['continent'] == 'Americas'] third = second.drop('Puerto Rico') fourth = third.drop('continent', axis = 1) fourth.to_csv('result.csv')
Solution
# Reads a CSV file into a new dataframe first = pandas.read_csv('gapminder/gapminder_all.csv', index_col='country') # Selects only those rows with the value 'Americas' in the 'continent' column second = first[first['continent'] == 'Americas'] # Remove the 'Puerto Rico' row third = second.drop('Puerto Rico') # Remove the 'continent' column fourth = third.drop('continent', axis = 1) # Write data to a new file fourth.to_csv('result.csv')
Selecting Indices
Explain in simple terms what
idxmin
andidxmax
do in the short program below. When might you use these methods?data = pandas.read_csv('gapminder/gapminder_gdp_europe.csv', index_col='country') print(data.idxmin()) print(data.idxmax())
Hint
Don’t forget about the ‘
help
’ methodCompare the output of
print(data.idxmin())
to the output ofprint(data)
Solution
‘
idxmax
’ returns the label of the row with the highest value in each column, and ‘idxmin
’ returns the minimum.These methods could be useful when looking for outliers in your data.
Practice with Selection
Write an expression to select each of the following from the Europe data:
- GDP per capita for all countries in 1982.
- GDP per capita for Denmark for all years.
- GDP per capita for all countries for years after 1985.
- Inflation of GDP per capita for each country between 2007 and 1952 (i.e., 2007 divided by 1952).
Hint
Everything can be done with the ‘
loc
’ methodSolution
# GDP per capita for all countries in 1982. europe_df.loc[:, "gdpPercap_1982"] # GDP per capita for Denmark for all years. europe_df.loc["Denmark", :] # GDP per capita for all countries for years *after* 1985. europe_df.loc[:, "gdpPercap_1985":] # GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952. europe_df.loc[:, "gdpPercap_2007"] / europe_df.loc[:, "gdpPercap_1952"]
Changing Values in a Data Frame
You can select values to change using the same indexing strategies as for displaying values. Change the 1982 per-capita GDP of Denmark to a new value.
Hint
Try selecting the value using ‘
.loc
’ and then setting a new value using ‘=
’.Solution
# Any indexing command can be combined with '=' to change the value(s). europe_df.loc['Denmark','gdpPercap_1982'] = 450000000
Key Points
Use
DataFrame.loc[..., ...]
andDataFrame.iloc[..., ...]
to select values by location (name or index, respectively).Use
:
on its own to mean all columns or all rows.Result of slicing can be used in further operations.
Use comparisons to select data based on value.
Select values or NaN using a Boolean mask.