The core datasets used for this task were taken from Kaggle.
This is a well-analysed dataset with over 100 kernels pinned against it on Kaggle. To enrich my analysis, and make it original, I also scraped some other datasets:
- I used a dataset of capital city Latitude & Longitudes from Techslides.
- I manually created a new dataset of olympic host city Altitudes, Latitudes & Longitudes from mapcoordinates.
I used a Jupyter Notebook (Python 3) to narrate my analysis.
ON FIRST RUN: execute the code below to download the geopy dependency.
import sys
!{sys.executable} -m pip install geopy
The following packages were used for the analysis:
#load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import unittest
#import difflib - not used
from geopy.distance import vincenty, geodesic
%pylab inline
%matplotlib inline
My analysis is laid out as follows:
Following a review of the data available and a brainstorming session, I decided to focus my efforts on the following areas:
Summary statistics to explore how athletes have changed over the years
#Loading athlete data
data = pd.read_csv('athlete_events.csv', index_col = 'ID')
data.head(10)
A couple of early observations, upon loading the data:
- NA in the Medal column have been converted to NaN, needs correcting.
- Team variable combines > 1 Country, which is interesting, but unhelpful, will look to merge with the NOC table to clean.
- It looks like Games is just a concat of Year + Season, should remove, once checked.
- there's some NaN's in the Height & Weight Column , will need to be cleaned if using these columns.
- There's some interesting sports from way-back-when - including Tug-Of-War, maybe these would be interesting to look at!
#Replace NA in Medal column with None
data['Medal'] = data['Medal'].fillna("None")
data['Medal'].head()
I want to use the NOC dataset to get a better representation of where an athlete is from over 'Team' variable.
#Loading in the NOC dataset
NOC = pd.read_csv('noc_regions.csv')
Later on in the analysis, I do fuzzy matching of Country Names to join with the Capital Cities metadata dataset. After doing this, I changed this part of the script to include the fuzzy matches that the algorithm highlighted. i.e. before it is merged in the main data frame.
NOC['region'].replace(['US', 'USA', 'UK', 'Saint Kitts', 'Trinidad', 'Boliva', 'Virgin Islands, US', 'Curacao'],
['United States', 'United States', 'United Kingdom', 'Saint Kitts and Nevis',
'Trinidad and Tobago', 'Bolivia', 'US Virgin Islands', 'Cura̤ao'],
inplace = True)
data = pd.merge(data, NOC, left_on='NOC', right_on='NOC')
#before discarding, check to see if Games is in Year + Season for all values, then discard.
if (data['Games'].equals(data['Year'].map(str) + " " + data['Season'])):
data = data.drop(['notes', 'Games'], axis=1)
else:
data = data.drop(['notes'], axis=1)
data.sample()
We can see above that Games == Year + Season for all values, and so is removed from the dataset.
# assign a list of (ColumnName, NullCount) tuples to null_list using list comprehension
null_list = [(col, data[col].isnull().sum()) for col in data.columns]
print(null_list)
df_nulls = pd.DataFrame(null_list)
df_nulls
We can see that only the Age, Height, Weight & the newly created region columns contain null values. My instinct tells me that this is because they didn't record age/height/weight measurements before a certain date. We'll have to investigate the region nulls a bit further.
We can see below that these are being caused by the Refugee Olympic Athletes Team, Unknowns and Tuvalu. I will remove these rows from the dataset.
#find out which points haven't merged properly
missing_region_datapoints = data[data.region.isnull()]
print(len(missing_region_datapoints))
missing_region_datapoints.sample(5)
# remove rows with NaN in the region column, using prints to check
print(len(data))
data = data[pd.notnull(data['region'])]
print(len(data))
I chose to use external datasets to enrich the analyse and build novel feature sets. This section shows how I went about it. In this section most of the code is wrapped around unit tests to ensure the transformation have been carried out as expected.
Firstly, i created a map of Host Cities to Host Country, which wasn't originally in the dataset. This will be when looking at the 'Home-advantage' hypothesis.
#manually created a dictionary map from city -> country
city_to_country = { 'Barcelona':'Spain', 'London':'UK', 'Antwerpen':'Belgium', 'Paris':'France', 'Calgary':'Canada',
'Albertville':'France', 'Lillehammer':'Norway', 'Los Angeles':'USA', 'Salt Lake City':'USA',
'Helsinki':'Finland', 'Lake Placid':'USA', 'Sydney':'Australia', 'Atlanta':'USA', 'Stockholm':'Sweden',
'Sochi':'Russia', 'Nagano':'Japan', 'Torino':'Italy', 'Beijing':'China', 'Rio de Janeiro':'Brazil', 'Athina':'Greece',
'Squaw Valley':'USA', 'Innsbruck':'Austria', 'Sarajevo': 'Bosnia and Herzegovina', 'Mexico City':'Mexico', 'Munich': 'Germany',
'Seoul': 'South Korea', 'Berlin': 'Germany', 'Oslo': 'Norway', "Cortina d'Ampezzo":'Italy', 'Melbourne': 'Australia', 'Roma': 'Italy',
'Amsterdam': 'Netherlands', 'Montreal': 'Canada', 'Moskva': 'Russia', 'Tokyo':'Japan', 'Vancouver':'Canada', 'Grenoble':'France',
'Sapporo':'Japan', 'Chamonix':'France', 'St. Louis':'USA', 'Sankt Moritz':'Switzerland',
'Garmisch-Partenkirchen':'Germany'}
#simple dictionary map
def map_cities_to_countries(data, dict_city_to_country) :
data['Host_Country'] = data['City'].map(dict_city_to_country)
return data
# unit test the mapping to ensure the merge did not introduce nulls into the dataset.
class TestMapping(unittest.TestCase):
# Create the unit test
def test_mapping(self):
map_cities_to_countries(data, city_to_country)
count_na = data['Medal'].isnull().sum()
# Test that the mapping has not introduces NA's
self.assertEqual(0, count_na)
# Run the test
unittest.main(argv=['ignored', '-v'], exit=False)
#have along at the result - we now have a new variable HostCountry
data.head()
host_cities_metadata = pd.read_csv('hostcities.csv', names=['HostCity', 'HostLatitude','HostLongitude','HostAltitude'])
host_cities_metadata = host_cities_metadata.iloc[1:]
host_cities_metadata.head()
#merge the host cities metadata into the main data frame
data = pd.merge(data, host_cities_metadata, left_on='City', right_on='HostCity')
data.head(5)
missing_HostLongitude_datapoints = data[data.HostLongitude.isnull()]
missing_HostLongitude_datapoints
The empty dataframe above shows the merge was successful.
This dataset contains the CountryName, CapitalName, CapitalLatitude, CapitalLongitude, CountryCode (not used) and ContinentName. My plan is merge this dataset on the Athletes 'region' variable, that we originally got from the NOC dataset. I'm doing this in preparation for answering the 'distance-travelled' research question; my hypothesis is that athletes that travel the furthest for the games are most disadvantaged due to jet-lag/ acclimatisation etc. I appreciate using the capital city lat/long of an athletes country is not ideal (as they may not live in the capital, or might live abroad etc), but serves as an OK proxy for this analysis.
#load in the data
names = ['CountryName', 'CapitalName', 'CapitalLatitude','CapitalLongitude', 'CountryCode', 'ContinentName']
capital_cities_metadata = pd.read_csv('country-capitals.csv', names = names)
#remove the first row as it's a duplicate of the column headers
capital_cities_metadata = capital_cities_metadata.iloc[1:]
# describe method to undertsand the shape of the dataset.
capital_cities_metadata.describe()
Before I merge the capital cities metadata dataset into the main 'data' frame, I'm going to perform an outer join to create a new variable called data_new. In doing so, I can then identify Country names are not joining properly. This then becomes a fuzzy matching problem between two lists of strings from different sources.
#df_pitches = pd.read_csv("df_pitches_2016.csv")
#df_salaries = pd.read_csv("df_salaries_2016.csv")
# Creating my merged data frame
data_new = data.merge(capital_cities_metadata,
left_on='region',
right_on='CountryName',
how='outer',
)
# Only selecing pitchers with a lot of pitches
#data_new = data_new[data_new > 1000][['CapitalLongitude']]
#Selecing people with missing salaries
missing_latitude = data_new[data_new.CapitalLatitude.isnull()]
#Displaying results
missing_latitude.reset_index(inplace=True,drop='index')
missing_latitude.sample(40)
missing_latitude['region'].unique()
To try and facilitate as clean a merge as possible, I used a fuzzy matching algorithm from the fuzzywuzzy package. This highlighted inconsistencies in the string, which I then dealt with earlier on in the analysis (up the page), then re-ran the script. This was because I needed to make the string replacements prior to merge the NOC & main data frames.
from fuzzywuzzy import fuzz
def match_name(name, list_names, min_score=0):
# -1 score incase we don't get any matches
max_score = -1
# Returning empty name for no match as well
max_name = ""
# Iternating over all names in the other
for name2 in list_names:
#Finding fuzzy match score
score = fuzz.ratio(name, name2)
# Checking if we are above our threshold and have a better score
if (score > min_score) & (score > max_score):
max_name = name2
max_score = score
return (max_name, max_score)
# List for dicts for easy dataframe creation
dict_list = []
# iterating over our players without salaries found above
for name in missing_latitude.region.unique():
# Use our method to find best match, we can set a threshold here
match = match_name(name, capital_cities_metadata.CountryName, 50)
# New dict for storing data
dict_ = {}
dict_.update({"country_name" : name})
dict_.update({"match_name" : match[0]})
dict_.update({"score" : match[1]})
dict_list.append(dict_)
merge_table = pd.DataFrame(dict_list)
# Display results
merge_table
The above 'remainder' from the fuzzy matching will be discarded from the dataset (given the time constraints).
#merge the capital cities metadata into the main data frame
data = pd.merge(data, capital_cities_metadata, left_on='region', right_on='CountryName')
data.head()
# remove rows with NaN in the region column, using prints to check
print(len(data))
data = data[pd.notnull(data.region)]
print(len(data))
Now that we have lat/long pairs of 1) where the athlete lives, and 2) where the games is being held, we can calcuate the Vincenty distance (from GeoPy).
The Vincenty distance is the geodesic distance between two points on the earth, taking into account the Earth's elliptic shape; using Vincenty’s method.
def calculate_vincenty_distance(distances) :
'''
This function first creates (Latitude, Longitude) tuples for both the Athletes Home City and the Olympic Host City.
Then, calculates a new column 'vincenty_distance' using np.vectorize (as the vincenty method requires two inputs)
Inputs: distances - dataframe with cols = ['CapitalLatitude', 'CapitalLongitude', 'HostLatitude', 'HostLongitude']
Outputs: same dataframe with new column ['vincenty_distance']
'''
distances['AthleteLatLong'] = list(zip(distances.CapitalLatitude, distances.CapitalLongitude))
distances['HostCityLatLong'] = list(zip(distances.HostLatitude, distances.HostLongitude))
distances['vincenty_distance'] = np.vectorize(geodesic)(distances['AthleteLatLong'], distances['HostCityLatLong'])
return distances
# create a new dataframe away from the original data, and then concat back into the original dataframe
new_data = data
distances = new_data.ix[:, ['CapitalLongitude', 'CapitalLatitude','HostLongitude','HostLatitude']]
#normally I'd one class for all unit tests, but I've created a new one here for code readability in a notebook.
class TestCalculations(unittest.TestCase):
# Create the unit test
def test_vincenty_calculation(self):
#create a test tuple of data from the first row of 'data' dataframe
athleteLatLong0 = (39.91666667, 116.383333) # first row
#create a test tuple of data from the first row of 'data' dataframe
hostLatLong0 = (41.3828939, 2.1774322)
#calculate the vincenty distance on the test data
distanceTest = geodesic(athleteLatLong0, hostLatLong0).km
#calculate vincenty distance using our function calculate_vincenty_distance
calculate_vincenty_distance(distances)
# get the result of the first row
function_output = distances.loc[0,'vincenty_distance']
# Test that the result of the two method asserts equal.
self.assertAlmostEqual(function_output, distanceTest,5)
# Run the test
unittest.main(argv=['ignored', '-v'], exit=False)
#cast the vincenty distance from Distance ojbect to float
distances['vincenty_distance'] = distances['vincenty_distance'].astype(str).str[:-3].astype(float)
data = pd.concat([data, distances['vincenty_distance']], axis=1)
#take a look at the results
data.head()
# see what columns we now have
data.columns
cols_to_keep = ['Name', 'Sex', 'Age', 'Team', 'Height', 'Weight', 'Year', 'Season',
'City', 'Sport', 'Event', 'Medal', 'region', 'Host_Country',
'CapitalName', 'ContinentName', 'vincenty_distance', 'HostAltitude']
clean_data = pd.DataFrame(data, columns=cols_to_keep)
clean_data.head()
clean_data.columns = ['Name', 'Sex', 'Age', 'Team', 'Height', 'Weight', 'Year', 'Season',
'City', 'Sport', 'Event', 'Medal', 'AthleteCountry', 'HostCountry',
'AthletesHomeCapital', 'AthletesHomeContinent', 'DistanceTravelled', 'HostCityAltitude']
clean_data.head()
df_medals = pd.get_dummies(clean_data['Medal'])
# Join the dummy variables to the main dataframe
clean_data = pd.concat([clean_data, df_medals], axis=1)
pd.options.display.max_columns = None
clean_data.head()
clean_data['TotalMedals'] = clean_data['Gold'] + clean_data['Silver'] + clean_data['Bronze']
clean_data['AnyMedals'] = np.where(clean_data.loc[:,'Medal'] == 'None', 0, 1)
clean_data.sample(10)
#Proportion
ProportionMvW = data.groupby(['Year', 'Sex']).agg({'Name': 'count'})
percentages = ProportionMvW.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
percentages = percentages.reset_index()
a4_dims = (11.7, 8.27)
fig, ax = pyplot.subplots(figsize=a4_dims)
ax = sns.barplot(x="Year", y="Name", hue="Sex", data=percentages, palette="Set2")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right")
ax.set_title('Proportion of men and women competing in the olympic games')
ax.set_ylabel('Percentage of athletes for each gender')
ax.legend(loc="upper right")
plt.show()
Insights:
The visualisation below shows how the distribution of men and women's ages have changed since 1896, split by Sex and Summer/Winter Olympics
sns.set(rc={'figure.figsize':(20,20)})
g = sns.FacetGrid(clean_data, col="Season", row="Sex", size=6)
g = g.map(sns.boxplot, "Year", 'Age')
g.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right")
show()
Insights:
#create a table of year,countryname & sex to be merged into a count of each later.
mini_country_table = clean_data[clean_data['Season'] == 'Summer'].loc[:, ['Year','AthleteCountry', 'Name', 'Sex']].drop_duplicates()
mini_country_table.head()
# Create a pivot table to count gender wise representation of each team in each year
CountAthletesByCountry = pd.pivot_table(mini_country_table,
index = ['Year', 'AthleteCountry'],
columns = 'Sex',
aggfunc = 'count').reset_index()
# rename columns as per column names in the 0th level
CountAthletesByCountry.columns = CountAthletesByCountry.columns.get_level_values(0)
# rename the columns appropriately
CountAthletesByCountry.columns = ['Year', 'AthleteCountry', 'Female_Athletes', 'Male_Athletes']
# get total athletes per team-year
CountAthletesByCountry['Total_Athletes'] = CountAthletesByCountry['Female_Athletes'] + \
CountAthletesByCountry['Male_Athletes']
uk_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "United Kingdom"]
uk_athletes.fillna(0, inplace = True)
uk_athletes.set_index('Year', inplace = True)
swedish_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Sweden"]
swedish_athletes.set_index('Year', inplace = True)
japanese_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Japan"]
japanese_athletes.set_index('Year', inplace = True)
germany_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Germany"]
germany_athletes.set_index('Year', inplace = True)
# Plot the values of male, female and total athletes using bar charts and the line charts.
fig, ((ax1, ax2), (ax3, ax4)) = subplots(nrows = 2, ncols = 2, figsize = (20, 20), sharey = True)
fig.subplots_adjust(hspace = 0.3)
# Plot team Australia's contingent size
ax1.bar(uk_athletes.index.values, uk_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax1.bar(uk_athletes.index.values, uk_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax1.plot(uk_athletes.index.values, uk_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
marker = 'o')
ax1.legend(loc="upper right")
ax1.set_title('UK Athletes :\nParticipation since 1896')
ax1.set_ylabel('Count of Athletes')
# Plot German athlete participation
ax2.bar(germany_athletes.index.values, germany_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax2.bar(germany_athletes.index.values, germany_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax2.plot(germany_athletes.index.values, germany_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
marker = 'o')
ax2.set_title('Germany Athletes :\nParticipation since 1896')
ax2.legend(loc="upper left")
ax2.set_ylabel('Count of Athletes')
# Plot Japan's contingent size
ax3.bar(japanese_athletes.index.values, japanese_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax3.bar(japanese_athletes.index.values, japanese_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax3.plot(japanese_athletes.index.values, japanese_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
marker = 'o')
ax3.set_title('Japanese Athletes :\nParticipation since 1896')
ax3.set_ylabel('Count of Athletes')
ax3.legend(loc="upper right")
# Plot team Swedens's contingent size
ax4.bar(swedish_athletes.index.values, swedish_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax4.bar(swedish_athletes.index.values, swedish_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax4.plot(swedish_athletes.index.values, swedish_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
marker = 'o')
ax4.set_title('Swedish Athletes :\nParticipation since 1896')
ax4.set_ylabel('Count of Athletes')
ax4.legend(loc="upper left")
show()
Insights:
This is where we bring in the engineered feature 'DistanceTravelled' - the Vincenty distance between an athletes home capital and where the games took place. My hypothesis is that distance travelled has an impact of number of medals a Country receives (due to things physiological factors like jet-mind, blood flow etc), I also further hypothesise that the effect of this will be reduced in the modern era (I've chosen Pre/Post 1980).
To investigate these hypothesese, I've split the dataset (pre and post 1980) and calculate correlation between this distance and medals won. Obviously, whether a country has a high medal count is a multivariate problem and would require a more in-depth analysis in practice.
# Get just the medalists
Subset_Medalists = clean_data['AnyMedals'] == 1
#Create a new dataset to be merged with the medal table dataset below
subset_distancetravelled = clean_data.loc[:, ['Year', 'AthleteCountry', 'DistanceTravelled']].drop_duplicates()
medal_tally = clean_data.groupby(['Year','AthleteCountry'])['TotalMedals'].agg('sum').reset_index()
medal_tally_byDistance = medal_tally.merge(subset_distancetravelled,
left_on = ['Year', 'AthleteCountry'],
right_on = ['Year', 'AthleteCountry'],
how = 'left')
#create two datasets to compare correlations
Pre_1980_MedalTallyByDistance = medal_tally_byDistance[medal_tally_byDistance['Year']<=1980]
Post_1980_MedalTallyByDistance = medal_tally_byDistance[medal_tally_byDistance['Year']>1980]
medal_tally_byDistance.head()
selected_rows = Pre_1980_MedalTallyByDistance['TotalMedals'] > 0
correlation = Pre_1980_MedalTallyByDistance.loc[selected_rows, ['DistanceTravelled', 'TotalMedals']].corr()['TotalMedals'][0]
plot(Pre_1980_MedalTallyByDistance.loc[selected_rows, 'DistanceTravelled'],
Pre_1980_MedalTallyByDistance.loc[selected_rows, 'TotalMedals'] ,
linestyle = 'none',
marker = 'o',
alpha = 0.4)
xlabel('Distance Travelled')
ylabel('Number of Medals')
title('PRE-1980 - Distance travelled versus medal tally')
text(np.nanpercentile(Pre_1980_MedalTallyByDistance['DistanceTravelled'], 99.6),
max(Pre_1980_MedalTallyByDistance['TotalMedals']) - 50,
"Correlation = " + str(correlation))
selected_rows = Post_1980_MedalTallyByDistance['TotalMedals'] > 0
correlation = Post_1980_MedalTallyByDistance.loc[selected_rows, ['DistanceTravelled', 'TotalMedals']].corr()['TotalMedals'][0]
plot(Post_1980_MedalTallyByDistance.loc[selected_rows, 'DistanceTravelled'],
Post_1980_MedalTallyByDistance.loc[selected_rows, 'TotalMedals'] ,
linestyle = 'none',
marker = 'o',
alpha = 0.4)
xlabel('Distance Travelled')
ylabel('Number of Medals')
title('POST-1960 - Distance travelled versus medal tally')
text(np.nanpercentile(Post_1980_MedalTallyByDistance['DistanceTravelled'], 99.6),
max(Post_1980_MedalTallyByDistance['TotalMedals']) - 50,
"Correlation = " + str(correlation))
Insights:
This was an enjoyable task! I focused my efforts on data engineering and manfuactures some novel features. The following insights were of most interest to me:
Given more time, I would like to:
It wouldn't be a proper Olympic analysis without a look at how the Jamaican bobsleigh team have done over the years
#df[(df['coverage'] > 50) & (df['reports'] < 4)]
jamaicans = clean_data[(clean_data['AthleteCountry'] == 'Jamaica') & (clean_data['Sport'] == 'Bobsleigh') ]
jamaican_grouped = jamaicans.groupby(['Year','City'])['TotalMedals'].agg('sum').reset_index()
jamaican_grouped
new_data = {'Year': [2018, 2022], 'City': ["Pyeongchang", "Beijing"], 'TotalMedals': [0, NaN]}
future_results = pd.DataFrame(new_data)
combined = pd.concat([jamaican_grouped, future_results])
combined
plt.bar(combined['Year'],combined['TotalMedals'] )
fig.suptitle('test title', fontsize=20)
plt.xlabel('Olympic Year', fontsize=18)
plt.ylabel('Number of medals won', fontsize=16)
plt.ylim((0,10))
plt.xticks(combined['Year'])
Hope you enjoyed reading!