Analysis on the Olympics dataset¶

Datasets¶

The core datasets used for this task were taken from Kaggle.

This is a well-analysed dataset with over 100 kernels pinned against it on Kaggle. To enrich my analysis, and make it original, I also scraped some other datasets:

I used a dataset of capital city Latitude & Longitudes from Techslides.

I manually created a new dataset of olympic host city Altitudes, Latitudes & Longitudes from mapcoordinates.

Running environment¶

I used a Jupyter Notebook (Python 3) to narrate my analysis.

Download dependencies¶

ON FIRST RUN: execute the code below to download the geopy dependency.

import sys
!{sys.executable} -m pip install geopy

Requirement already satisfied: geopy in /Users/williamneedham/anaconda/lib/python3.6/site-packages (1.17.0)
Requirement already satisfied: geographiclib<2,>=1.49 in /Users/williamneedham/anaconda/lib/python3.6/site-packages (from geopy) (1.49)

Load packages¶

The following packages were used for the analysis:

#load packages
import pandas as pd
import numpy as np 
import matplotlib.pyplot as  plt
import seaborn as sns
import unittest
#import difflib - not used
from geopy.distance import vincenty, geodesic

%pylab inline
%matplotlib inline

Populating the interactive namespace from numpy and matplotlib

Contents¶

My analysis is laid out as follows:

Stage 1: Initial look at the data¶

Link to initial look at the data

Stage 2: Feature engineering¶

Link to feature engineering section of the report

Stage 3: Research questions¶

Following a review of the data available and a brainstorming session, I decided to focus my efforts on the following areas:

Summary statistics to explore how athletes have changed over the years

Analysis of some interesting countries

Is there a correlation between Distance Travelled to the Games and performance and has that changed since 1980?

Stage 4: Conclusions¶

Conclusions and further research options

...any finally - what about the Jamaican bobsleigh team?

Contact details¶

Contact details

Initial look at the data¶

#Loading athlete data
data = pd.read_csv('athlete_events.csv', index_col = 'ID')
data.head(10)

A couple of early observations, upon loading the data:

NA in the Medal column have been converted to NaN, needs correcting.

Team variable combines > 1 Country, which is interesting, but unhelpful, will look to merge with the NOC table to clean.

It looks like Games is just a concat of Year + Season, should remove, once checked.

there's some NaN's in the Height & Weight Column , will need to be cleaned if using these columns.

There's some interesting sports from way-back-when - including Tug-Of-War, maybe these would be interesting to look at!

Fill NaN values in Medal column with 'None'¶

#Replace NA in Medal column with None
data['Medal'] = data['Medal'].fillna("None")
data['Medal'].head()

ID
1    None
2    None
3    None
4    Gold
5    None
Name: Medal, dtype: object

Loading the NOC dataset¶

I want to use the NOC dataset to get a better representation of where an athlete is from over 'Team' variable.

#Loading in the NOC dataset
NOC = pd.read_csv('noc_regions.csv')

String replacement: replacing incorrect/ shorterned country names in the NOC dataset¶

Later on in the analysis, I do fuzzy matching of Country Names to join with the Capital Cities metadata dataset. After doing this, I changed this part of the script to include the fuzzy matches that the algorithm highlighted. i.e. before it is merged in the main data frame.

NOC['region'].replace(['US', 'USA', 'UK', 'Saint Kitts', 'Trinidad', 'Boliva', 'Virgin Islands, US', 'Curacao'], 
                      ['United States', 'United States', 'United Kingdom', 'Saint Kitts and Nevis', 
                       'Trinidad and Tobago', 'Bolivia', 'US Virgin Islands', 'Cura̤ao'], 
                      inplace = True)

data = pd.merge(data, NOC, left_on='NOC', right_on='NOC')

#before discarding, check to see if Games is in Year + Season for all values, then discard. 
if (data['Games'].equals(data['Year'].map(str) + " " + data['Season'])):
    data = data.drop(['notes', 'Games'], axis=1)
else: 
    data = data.drop(['notes'], axis=1)

data.sample()

We can see above that Games == Year + Season for all values, and so is removed from the dataset.

Review of null values in the dataset¶

# assign a list of (ColumnName, NullCount) tuples to null_list using list comprehension
null_list = [(col, data[col].isnull().sum()) for col in data.columns]
print(null_list)

[('Name', 0), ('Sex', 0), ('Age', 9462), ('Height', 60083), ('Weight', 62785), ('Team', 0), ('NOC', 0), ('Year', 0), ('Season', 0), ('City', 0), ('Sport', 0), ('Event', 0), ('Medal', 0), ('region', 21)]

df_nulls = pd.DataFrame(null_list)
df_nulls

We can see that only the Age, Height, Weight & the newly created region columns contain null values. My instinct tells me that this is because they didn't record age/height/weight measurements before a certain date. We'll have to investigate the region nulls a bit further.

Investigating the region nulls¶

We can see below that these are being caused by the Refugee Olympic Athletes Team, Unknowns and Tuvalu. I will remove these rows from the dataset.

#find out which points haven't merged properly
missing_region_datapoints = data[data.region.isnull()]
print(len(missing_region_datapoints))
missing_region_datapoints.sample(5)

21

# remove rows with NaN in the region column, using prints to check
print(len(data))
data = data[pd.notnull(data['region'])]
print(len(data))

270767
270746

Feature engineering¶

I chose to use external datasets to enrich the analyse and build novel feature sets. This section shows how I went about it. In this section most of the code is wrapped around unit tests to ensure the transformation have been carried out as expected.

Mapping host cities to host countries¶

Firstly, i created a map of Host Cities to Host Country, which wasn't originally in the dataset. This will be when looking at the 'Home-advantage' hypothesis.

#manually created a dictionary map from city -> country 
city_to_country = { 'Barcelona':'Spain', 'London':'UK', 'Antwerpen':'Belgium', 'Paris':'France', 'Calgary':'Canada',
       'Albertville':'France', 'Lillehammer':'Norway', 'Los Angeles':'USA', 'Salt Lake City':'USA',
       'Helsinki':'Finland', 'Lake Placid':'USA', 'Sydney':'Australia', 'Atlanta':'USA', 'Stockholm':'Sweden',
       'Sochi':'Russia', 'Nagano':'Japan', 'Torino':'Italy', 'Beijing':'China', 'Rio de Janeiro':'Brazil', 'Athina':'Greece',
       'Squaw Valley':'USA', 'Innsbruck':'Austria', 'Sarajevo': 'Bosnia and Herzegovina', 'Mexico City':'Mexico', 'Munich': 'Germany',
       'Seoul': 'South Korea', 'Berlin': 'Germany', 'Oslo': 'Norway', "Cortina d'Ampezzo":'Italy', 'Melbourne': 'Australia', 'Roma': 'Italy',
       'Amsterdam': 'Netherlands', 'Montreal': 'Canada', 'Moskva': 'Russia', 'Tokyo':'Japan', 'Vancouver':'Canada', 'Grenoble':'France',
       'Sapporo':'Japan', 'Chamonix':'France', 'St. Louis':'USA', 'Sankt Moritz':'Switzerland',
       'Garmisch-Partenkirchen':'Germany'}

#simple dictionary map
def map_cities_to_countries(data, dict_city_to_country) : 
    data['Host_Country'] = data['City'].map(dict_city_to_country)
    return data

# unit test the mapping to ensure the merge did not introduce nulls into the dataset. 
class TestMapping(unittest.TestCase):
    # Create the unit test
    def test_mapping(self):
        map_cities_to_countries(data, city_to_country)
        count_na = data['Medal'].isnull().sum()
        
        # Test that the mapping has not introduces NA's
        self.assertEqual(0, count_na)
        
 # Run the test 
unittest.main(argv=['ignored', '-v'], exit=False)

test_mapping (__main__.TestMapping) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.071s

OK

<unittest.main.TestProgram at 0x1138f12e8>

#have along at the result - we now have a new variable HostCountry
data.head()

Read in and merge the host city metadata - giving the latitude, longitude, altitude of all the host cities¶

host_cities_metadata = pd.read_csv('hostcities.csv', names=['HostCity', 'HostLatitude','HostLongitude','HostAltitude'])
host_cities_metadata = host_cities_metadata.iloc[1:]
host_cities_metadata.head()

#merge the host cities metadata into the main data frame
data = pd.merge(data, host_cities_metadata, left_on='City', right_on='HostCity')
data.head(5)

Check to see in the merge was performed correctly, without introducing NA's¶

missing_HostLongitude_datapoints = data[data.HostLongitude.isnull()]
missing_HostLongitude_datapoints

The empty dataframe above shows the merge was successful.

Read in and merge the athletes capital city metadata¶

This dataset contains the CountryName, CapitalName, CapitalLatitude, CapitalLongitude, CountryCode (not used) and ContinentName. My plan is merge this dataset on the Athletes 'region' variable, that we originally got from the NOC dataset. I'm doing this in preparation for answering the 'distance-travelled' research question; my hypothesis is that athletes that travel the furthest for the games are most disadvantaged due to jet-lag/ acclimatisation etc. I appreciate using the capital city lat/long of an athletes country is not ideal (as they may not live in the capital, or might live abroad etc), but serves as an OK proxy for this analysis.

#load in the data
names = ['CountryName', 'CapitalName', 'CapitalLatitude','CapitalLongitude', 'CountryCode', 'ContinentName']
capital_cities_metadata = pd.read_csv('country-capitals.csv', names = names)

#remove the first row as it's a duplicate of the column headers
capital_cities_metadata = capital_cities_metadata.iloc[1:]

# describe method to undertsand the shape of the dataset. 
capital_cities_metadata.describe()

Merge on capital_cities_metadata.¶

Before I merge the capital cities metadata dataset into the main 'data' frame, I'm going to perform an outer join to create a new variable called data_new. In doing so, I can then identify Country names are not joining properly. This then becomes a fuzzy matching problem between two lists of strings from different sources.

#df_pitches = pd.read_csv("df_pitches_2016.csv")
#df_salaries = pd.read_csv("df_salaries_2016.csv")
# Creating my merged data frame
data_new = data.merge(capital_cities_metadata, 
                        left_on='region', 
                        right_on='CountryName', 
                        how='outer', 
                       )

# Only selecing pitchers with a lot of pitches
#data_new = data_new[data_new > 1000][['CapitalLongitude']]

#Selecing people with missing salaries
missing_latitude = data_new[data_new.CapitalLatitude.isnull()]

#Displaying results
missing_latitude.reset_index(inplace=True,drop='index')
missing_latitude.sample(40)
missing_latitude['region'].unique()

array(['Ivory Coast', 'Saint Vincent', 'Individual Olympic Athletes',
       'Virgin Islands, British', 'Antigua', 'Gambia', 'Brunei',
       'Micronesia'], dtype=object)

Fuzzy matching of Country Names from the NOC dataset and the country capitals dataset¶

To try and facilitate as clean a merge as possible, I used a fuzzy matching algorithm from the fuzzywuzzy package. This highlighted inconsistencies in the string, which I then dealt with earlier on in the analysis (up the page), then re-ran the script. This was because I needed to make the string replacements prior to merge the NOC & main data frames.

from fuzzywuzzy import fuzz

def match_name(name, list_names, min_score=0):
    # -1 score incase we don't get any matches
    max_score = -1
    # Returning empty name for no match as well
    max_name = ""
    # Iternating over all names in the other
    for name2 in list_names:
        #Finding fuzzy match score
        score = fuzz.ratio(name, name2)
        # Checking if we are above our threshold and have a better score
        if (score > min_score) & (score > max_score):
            max_name = name2
            max_score = score
    return (max_name, max_score)

/Users/williamneedham/anaconda/lib/python3.6/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

# List for dicts for easy dataframe creation
dict_list = []

# iterating over our players without salaries found above
for name in missing_latitude.region.unique():
    # Use our method to find best match, we can set a threshold here
    match = match_name(name, capital_cities_metadata.CountryName, 50)
    
    # New dict for storing data
    dict_ = {}
    dict_.update({"country_name" : name})
    dict_.update({"match_name" : match[0]})
    dict_.update({"score" : match[1]})
    dict_list.append(dict_)
    
merge_table = pd.DataFrame(dict_list)
# Display results
merge_table

The above 'remainder' from the fuzzy matching will be discarded from the dataset (given the time constraints).

#merge the capital cities metadata into the main data frame
data = pd.merge(data, capital_cities_metadata, left_on='region', right_on='CountryName')
data.head()

# remove rows with NaN in the region column, using prints to check
print(len(data))
data = data[pd.notnull(data.region)]
print(len(data))

270147
270147

Calculating how far each athlete has travelled to participate in the games¶

Now that we have lat/long pairs of 1) where the athlete lives, and 2) where the games is being held, we can calcuate the Vincenty distance (from GeoPy).

The Vincenty distance is the geodesic distance between two points on the earth, taking into account the Earth's elliptic shape; using Vincenty’s method.

def calculate_vincenty_distance(distances) : 
    '''
    This function first creates (Latitude, Longitude) tuples for both the Athletes Home City and the Olympic Host City. 
    Then, calculates a new column 'vincenty_distance' using np.vectorize (as the vincenty method requires two inputs)
    
    Inputs: distances - dataframe with cols = ['CapitalLatitude', 'CapitalLongitude', 'HostLatitude', 'HostLongitude']
    Outputs: same dataframe with new column ['vincenty_distance']
    '''
        
    distances['AthleteLatLong'] = list(zip(distances.CapitalLatitude, distances.CapitalLongitude))
    distances['HostCityLatLong'] = list(zip(distances.HostLatitude, distances.HostLongitude))
    distances['vincenty_distance'] = np.vectorize(geodesic)(distances['AthleteLatLong'], distances['HostCityLatLong'])
    
    return distances

# create a new dataframe away from the original data, and then concat back into the original dataframe
new_data = data
distances = new_data.ix[:, ['CapitalLongitude', 'CapitalLatitude','HostLongitude','HostLatitude']]

#normally I'd one class for all unit tests, but I've created a new one here for code readability in a notebook. 
class TestCalculations(unittest.TestCase):
    # Create the unit test
    def test_vincenty_calculation(self):
        #create a test tuple of data from the first row of 'data' dataframe
        athleteLatLong0 = (39.91666667, 116.383333) # first row
        
        #create a test tuple of data from the first row of 'data' dataframe
        hostLatLong0 = (41.3828939, 2.1774322) 
        
        #calculate the vincenty distance on the test data
        distanceTest = geodesic(athleteLatLong0, hostLatLong0).km
        
        #calculate vincenty distance using our function calculate_vincenty_distance
        calculate_vincenty_distance(distances)
        
        # get the result of the first row
        function_output = distances.loc[0,'vincenty_distance']
        
        # Test that the result of the two method asserts equal. 
        self.assertAlmostEqual(function_output, distanceTest,5)
        
 # Run the test 
unittest.main(argv=['ignored', '-v'], exit=False)

test_vincenty_calculation (__main__.TestCalculations) ... ok
test_mapping (__main__.TestMapping) ... ok

----------------------------------------------------------------------
Ran 2 tests in 88.062s

OK

<unittest.main.TestProgram at 0x103e642e8>

Merge the new vincenty_distance back into main dataframe¶

#cast the vincenty distance from Distance ojbect to float 
distances['vincenty_distance'] = distances['vincenty_distance'].astype(str).str[:-3].astype(float)

data = pd.concat([data, distances['vincenty_distance']], axis=1)

#take a look at the results
data.head()

Tidy up the resulting dataframe to remove columns we don't need¶

# see what columns we now have 
data.columns

Index(['Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Year',
       'Season', 'City', 'Sport', 'Event', 'Medal', 'region', 'Host_Country',
       'HostCity', 'HostLatitude', 'HostLongitude', 'HostAltitude',
       'CountryName', 'CapitalName', 'CapitalLatitude', 'CapitalLongitude',
       'CountryCode', 'ContinentName', 'vincenty_distance'],
      dtype='object')

cols_to_keep = ['Name', 'Sex', 'Age', 'Team', 'Height', 'Weight', 'Year', 'Season',
       'City', 'Sport', 'Event', 'Medal', 'region', 'Host_Country', 
                'CapitalName', 'ContinentName', 'vincenty_distance', 'HostAltitude']

clean_data = pd.DataFrame(data, columns=cols_to_keep)
clean_data.head()

Re-name some of the columns to make it obvious what they represent¶

clean_data.columns = ['Name', 'Sex', 'Age', 'Team', 'Height', 'Weight', 'Year', 'Season',
       'City', 'Sport', 'Event', 'Medal', 'AthleteCountry', 'HostCountry', 
                'AthletesHomeCapital', 'AthletesHomeContinent', 'DistanceTravelled', 'HostCityAltitude']

clean_data.head()

One-hot encode the Medal variable to make later analysis easier¶

df_medals = pd.get_dummies(clean_data['Medal'])

# Join the dummy variables to the main dataframe
clean_data = pd.concat([clean_data, df_medals], axis=1)
pd.options.display.max_columns = None
clean_data.head()

Add binary AnyMedals? variable¶

clean_data['TotalMedals'] = clean_data['Gold'] + clean_data['Silver'] + clean_data['Bronze']

clean_data['AnyMedals'] = np.where(clean_data.loc[:,'Medal'] == 'None', 0, 1)
clean_data.sample(10)

======================================================¶

Summary statistics to explore how athletes have changed over the years¶

In this question, i'm looking to explore how the sex, age, and number of competitors has changed in the last 120 years.

The changing profile of the Olympic athlete (part 1)¶

#Proportion 

ProportionMvW = data.groupby(['Year', 'Sex']).agg({'Name': 'count'})

percentages = ProportionMvW.groupby(level=0).apply(lambda x:
                                                 100 * x / float(x.sum()))
percentages = percentages.reset_index()

a4_dims = (11.7, 8.27)
fig, ax = pyplot.subplots(figsize=a4_dims)

ax = sns.barplot(x="Year", y="Name", hue="Sex", data=percentages, palette="Set2") 
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right")
ax.set_title('Proportion of men and women competing in the olympic games')
ax.set_ylabel('Percentage of athletes for each gender')
ax.legend(loc="upper right")
plt.show()

Insights:

Good news! The ratio of male to female competitors is slowly becoming more balanced - could 2020 be the year that we have an equal number of female competitors and male competitors?

The changing profile of the Olympic athlete (part 2)¶

The visualisation below shows how the distribution of men and women's ages have changed since 1896, split by Sex and Summer/Winter Olympics

sns.set(rc={'figure.figsize':(20,20)})

g = sns.FacetGrid(clean_data, col="Season",  row="Sex", size=6)
g = g.map(sns.boxplot, "Year", 'Age')
g.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right")


show()

Insights:

some 'athletes' in the 20th century were > 80 years when they competed! Worth a quick look back at the data to see what they competed in.
As could be expected, variance is greatly reduced in the modern athletes age (with the median Age being in the early 20's. Interestingly the median age in the winter olympics seems to be higher.
Higher variance in the Summer Olympics - I would hypothesise this is down to the wider variety of sports on offer at the Summer Olympics.

Analysis of some interesting countries¶

#create a table of year,countryname & sex to be merged into a count of each later. 
mini_country_table = clean_data[clean_data['Season'] == 'Summer'].loc[:, ['Year','AthleteCountry', 'Name', 'Sex']].drop_duplicates()
mini_country_table.head()

# Create a pivot table to count gender wise representation of each team in each year
CountAthletesByCountry = pd.pivot_table(mini_country_table,
                                        index = ['Year', 'AthleteCountry'],
                                        columns = 'Sex',
                                        aggfunc = 'count').reset_index()

# rename columns as per column names in the 0th level
CountAthletesByCountry.columns = CountAthletesByCountry.columns.get_level_values(0)

# rename the columns appropriately
CountAthletesByCountry.columns = ['Year', 'AthleteCountry', 'Female_Athletes', 'Male_Athletes']

# get total athletes per team-year
CountAthletesByCountry['Total_Athletes'] = CountAthletesByCountry['Female_Athletes'] + \
CountAthletesByCountry['Male_Athletes']

uk_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "United Kingdom"]
uk_athletes.fillna(0, inplace = True)
uk_athletes.set_index('Year', inplace = True)

swedish_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Sweden"]
swedish_athletes.set_index('Year', inplace = True)

japanese_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Japan"]
japanese_athletes.set_index('Year', inplace = True)

germany_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Germany"]
germany_athletes.set_index('Year', inplace = True)

# Plot the values of male, female and total athletes using bar charts and the line charts.
fig, ((ax1, ax2), (ax3, ax4)) = subplots(nrows = 2, ncols = 2, figsize = (20, 20), sharey = True)
fig.subplots_adjust(hspace = 0.3)

# Plot team Australia's contingent size
ax1.bar(uk_athletes.index.values, uk_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax1.bar(uk_athletes.index.values, uk_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax1.plot(uk_athletes.index.values, uk_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
        marker = 'o')
ax1.legend(loc="upper right")
ax1.set_title('UK Athletes :\nParticipation since 1896')
ax1.set_ylabel('Count of Athletes')

# Plot German athlete participation
ax2.bar(germany_athletes.index.values, germany_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax2.bar(germany_athletes.index.values, germany_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax2.plot(germany_athletes.index.values, germany_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
        marker = 'o')
ax2.set_title('Germany Athletes :\nParticipation since 1896')
ax2.legend(loc="upper left")

ax2.set_ylabel('Count of Athletes')

# Plot Japan's contingent size
ax3.bar(japanese_athletes.index.values, japanese_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax3.bar(japanese_athletes.index.values, japanese_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax3.plot(japanese_athletes.index.values, japanese_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes', 
         marker = 'o')
ax3.set_title('Japanese Athletes :\nParticipation since 1896')
ax3.set_ylabel('Count of Athletes')
ax3.legend(loc="upper right")


# Plot team Swedens's contingent size
ax4.bar(swedish_athletes.index.values, swedish_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax4.bar(swedish_athletes.index.values, swedish_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax4.plot(swedish_athletes.index.values, swedish_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
        marker = 'o')
ax4.set_title('Swedish Athletes :\nParticipation since 1896')
ax4.set_ylabel('Count of Athletes')
ax4.legend(loc="upper left")

show()

/Users/williamneedham/anaconda/lib/python3.6/site-packages/pandas/core/frame.py:2842: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)

Insights:

Overall gender balance has been improving slowing since the mid 20th century and it nearly equal for all countries considered.
Swedish female participation has risen above male participation in the last 3 Olympic Games.
UK particpation peaked in 2012 - when it hosted in London! This peak trend was mirrored for German participation in the 1972 Games, held in West Germany.
1916, 1940 and 1944 games cancelled due to World Wars.

Is there a correlation between Distance Travelled to the Games and performance and has that changed since 1980?¶

This is where we bring in the engineered feature 'DistanceTravelled' - the Vincenty distance between an athletes home capital and where the games took place. My hypothesis is that distance travelled has an impact of number of medals a Country receives (due to things physiological factors like jet-mind, blood flow etc), I also further hypothesise that the effect of this will be reduced in the modern era (I've chosen Pre/Post 1980).

To investigate these hypothesese, I've split the dataset (pre and post 1980) and calculate correlation between this distance and medals won. Obviously, whether a country has a high medal count is a multivariate problem and would require a more in-depth analysis in practice.

# Get just the medalists 
Subset_Medalists = clean_data['AnyMedals'] == 1

#Create a new dataset to be merged with the medal table dataset below 
subset_distancetravelled = clean_data.loc[:, ['Year', 'AthleteCountry', 'DistanceTravelled']].drop_duplicates()

medal_tally = clean_data.groupby(['Year','AthleteCountry'])['TotalMedals'].agg('sum').reset_index()
medal_tally_byDistance = medal_tally.merge(subset_distancetravelled,
                                   left_on = ['Year', 'AthleteCountry'],
                                   right_on = ['Year', 'AthleteCountry'],
                                   how = 'left')

#create two datasets to compare correlations
Pre_1980_MedalTallyByDistance = medal_tally_byDistance[medal_tally_byDistance['Year']<=1980]
Post_1980_MedalTallyByDistance = medal_tally_byDistance[medal_tally_byDistance['Year']>1980]

medal_tally_byDistance.head()

Pre-1980: impact of distance travelled on country performance¶

selected_rows = Pre_1980_MedalTallyByDistance['TotalMedals'] > 0

correlation = Pre_1980_MedalTallyByDistance.loc[selected_rows, ['DistanceTravelled', 'TotalMedals']].corr()['TotalMedals'][0]

plot(Pre_1980_MedalTallyByDistance.loc[selected_rows, 'DistanceTravelled'], 
     Pre_1980_MedalTallyByDistance.loc[selected_rows, 'TotalMedals'] , 
     linestyle = 'none', 
     marker = 'o',
     alpha = 0.4)
xlabel('Distance Travelled')

ylabel('Number of Medals')
title('PRE-1980 - Distance travelled versus medal tally')
text(np.nanpercentile(Pre_1980_MedalTallyByDistance['DistanceTravelled'], 99.6), 
     max(Pre_1980_MedalTallyByDistance['TotalMedals']) - 50,
     "Correlation = " + str(correlation))

Text(18557.7,446,'Correlation = -0.146533766948')

Post-1980: impact of distance travelled on country performance¶

selected_rows = Post_1980_MedalTallyByDistance['TotalMedals'] > 0

correlation = Post_1980_MedalTallyByDistance.loc[selected_rows, ['DistanceTravelled', 'TotalMedals']].corr()['TotalMedals'][0]

plot(Post_1980_MedalTallyByDistance.loc[selected_rows, 'DistanceTravelled'], 
     Post_1980_MedalTallyByDistance.loc[selected_rows, 'TotalMedals'] , 
     linestyle = 'none', 
     marker = 'o', 
    alpha = 0.4)
xlabel('Distance Travelled')

ylabel('Number of Medals')
title('POST-1960 - Distance travelled versus medal tally')
text(np.nanpercentile(Post_1980_MedalTallyByDistance['DistanceTravelled'], 99.6), 
     max(Post_1980_MedalTallyByDistance['TotalMedals']) - 50,
     "Correlation = " + str(correlation))

Text(18738.3,316,'Correlation = -0.0614926627557')

Insights:

negative correlations recorded for post the Pre-1980 dataset and Post-1980 dataset, meaning a higher distance travelled does lead to lower medal haul.
The correlation coefficient for pre-1980 was 'Correlation = -0.146533766948' and for Post-1980 was 'Correlation = -0.0614926627557'. This tells us that my hypothesis was correct and that this Distance Travelled effect has less of an effect in the modern era (where flights are cheap/ more comfortable and athletes probably spend time preparing in the country of the games beforehand).
Admittedly though, the coefficients are small and as such the DistanceTravelled-effect is smaller than I thought it would be.

Conclusions and further research options¶

This was an enjoyable task! I focused my efforts on data engineering and manfuactures some novel features. The following insights were of most interest to me:

good to see gender balance finally becoming a reality (after over 120 years of competition!
following high variance in the early years, most athletes are in there early 20's (slightly higher median age for the Winter Olympics).
of all interesting countries selected for this study (UK, Germany, Sweden & Japan), Germany send the most athletes (on average each year),

Given more time, I would like to:

add more summary statistics,
explore covariance,
build predictive models (i.e. predicting the medals table at the next Games),
conduct a clustering analysis to so what separates the best athletes from OK athletes.

...any finally - what about the Jamaican bobsleigh team?¶

It wouldn't be a proper Olympic analysis without a look at how the Jamaican bobsleigh team have done over the years

#df[(df['coverage']  > 50) & (df['reports'] < 4)]
jamaicans = clean_data[(clean_data['AthleteCountry'] == 'Jamaica') & (clean_data['Sport'] == 'Bobsleigh') ]
jamaican_grouped = jamaicans.groupby(['Year','City'])['TotalMedals'].agg('sum').reset_index()
jamaican_grouped

new_data = {'Year': [2018, 2022], 'City': ["Pyeongchang", "Beijing"], 'TotalMedals': [0, NaN]}
future_results = pd.DataFrame(new_data)
combined = pd.concat([jamaican_grouped, future_results])
combined

plt.bar(combined['Year'],combined['TotalMedals'] )
fig.suptitle('test title', fontsize=20)
plt.xlabel('Olympic Year', fontsize=18)
plt.ylabel('Number of medals won', fontsize=16)
plt.ylim((0,10))
plt.xticks(combined['Year'])

([<matplotlib.axis.XTick at 0x111dbfcf8>,
  <matplotlib.axis.XTick at 0x111b19710>,
  <matplotlib.axis.XTick at 0x111b199e8>,
  <matplotlib.axis.XTick at 0x111dfbe48>,
  <matplotlib.axis.XTick at 0x1185083c8>,
  <matplotlib.axis.XTick at 0x118508908>,
  <matplotlib.axis.XTick at 0x118508e48>,
  <matplotlib.axis.XTick at 0x11850e3c8>],
 <a list of 8 Text xticklabel objects>)

Unfortunately, no medals yet for the Jamaican's in the Bobsleigh, maybe 2022 is their year?¶

Hope you enjoyed reading!

------------- > Back to Top < -------------¶

	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event	Medal
ID
1	A Dijiang	M	24.0	180.0	80.0	China	CHN	1992 Summer	1992	Summer	Barcelona	Basketball	Basketball Men's Basketball	NaN
2	A Lamusi	M	23.0	170.0	60.0	China	CHN	2012 Summer	2012	Summer	London	Judo	Judo Men's Extra-Lightweight	NaN
3	Gunnar Nielsen Aaby	M	24.0	NaN	NaN	Denmark	DEN	1920 Summer	1920	Summer	Antwerpen	Football	Football Men's Football	NaN
4	Edgar Lindenau Aabye	M	34.0	NaN	NaN	Denmark/Sweden	DEN	1900 Summer	1900	Summer	Paris	Tug-Of-War	Tug-Of-War Men's Tug-Of-War	Gold
5	Christine Jacoba Aaftink	F	21.0	185.0	82.0	Netherlands	NED	1988 Winter	1988	Winter	Calgary	Speed Skating	Speed Skating Women's 500 metres	NaN
5	Christine Jacoba Aaftink	F	21.0	185.0	82.0	Netherlands	NED	1988 Winter	1988	Winter	Calgary	Speed Skating	Speed Skating Women's 1,000 metres	NaN
5	Christine Jacoba Aaftink	F	25.0	185.0	82.0	Netherlands	NED	1992 Winter	1992	Winter	Albertville	Speed Skating	Speed Skating Women's 500 metres	NaN
5	Christine Jacoba Aaftink	F	25.0	185.0	82.0	Netherlands	NED	1992 Winter	1992	Winter	Albertville	Speed Skating	Speed Skating Women's 1,000 metres	NaN
5	Christine Jacoba Aaftink	F	27.0	185.0	82.0	Netherlands	NED	1994 Winter	1994	Winter	Lillehammer	Speed Skating	Speed Skating Women's 500 metres	NaN
5	Christine Jacoba Aaftink	F	27.0	185.0	82.0	Netherlands	NED	1994 Winter	1994	Winter	Lillehammer	Speed Skating	Speed Skating Women's 1,000 metres	NaN

	CountryName	CapitalName	CapitalLatitude	CapitalLongitude	CountryCode	ContinentName
count	245	241	245	245	242	245
unique	245	238	235	239	242	8
top	Kiribati	Jerusalem	0	0	NG	Europe
freq	1	2	4	4	1	58

	Name	Sex	Age	Team	Height	Weight	Year	Season	City	Sport	Event	Medal	AthleteCountry	HostCountry	AthletesHomeCapital	AthletesHomeContinent	DistanceTravelled	HostCityAltitude	None
268023	Darko Majstorovi	M	30.0	Yugoslavia	189.0	85.0	1976	Summer	Montreal	Rowing	Rowing Men's Double Sculls	None	Serbia	Canada	Belgrade	Europe	6925.483129	56	1
27524	George Charles Calnan	M	24.0	United States	183.0	NaN	1924	Summer	Paris	Fencing	Fencing Men's epee, Individual	None	United States	France	Washington D.C.	North America	6177.680465	43	1
82598	Giuseppe Lupi	M	33.0	Italy	NaN	NaN	1928	Summer	Amsterdam	Gymnastics	Gymnastics Men's Team All-Around	None	Italy	Netherlands	Rome	Europe	1297.207864	11	1
248717	Jin O-Hyeon	M	24.0	South Korea	163.0	67.0	1960	Summer	Roma	Weightlifting	Weightlifting Men's Lightweight	None	South Korea	Italy	Seoul	Asia	8990.813118	21	1
249859	Chainarong Sophonpong	M	20.0	Thailand	166.0	62.0	1964	Summer	Tokyo	Cycling	Cycling Men's 100 kilometres Team Time Trial	None	Thailand	Japan	Bangkok	Asia	4608.433126	10	1
147164	Makrem Ayed	M	22.0	Tunisia	170.0	60.0	1996	Summer	Atlanta	Judo	Judo Men's Extra-Lightweight	None	Tunisia	USA	Tunis	Africa	8219.461425	311	1
248254	Park Dong-Ju	M	25.0	South Korea	173.0	63.0	1988	Summer	Seoul	Equestrianism	Equestrianism Mixed Three-Day Event, Team	None	South Korea	South Korea	Seoul	Asia	1.904024	40	1
60895	Alain Bouffard	M	25.0	France	165.0	52.0	1964	Summer	Tokyo	Rowing	Rowing Men's Coxed Eights	None	France	Japan	Paris	Europe	9738.883220	10	1
197450	Alfred Edward Flaxman	M	28.0	Great Britain	NaN	NaN	1908	Summer	London	Athletics	Athletics Men's Standing High Jump	None	United Kingdom	UK	London	Europe	4.380864	11	1
185309	Thierry Vatrican	M	20.0	Monaco	180.0	80.0	1996	Summer	Atlanta	Judo	Judo Men's Half-Middleweight	None	Monaco	USA	Monaco	Europe	7643.984453	311	1

	Year	AthleteCountry	Name	Sex
0	1992	China	A Dijiang	M
1	1992	China	Bai Chongguang	M
2	1992	China	Bai Mei	F
3	1992	China	Bi Zhong	M
4	1992	China	Cai Yanshu	M

	0	1
0	Name	0
1	Sex	0
2	Age	9462
3	Height	60083
4	Weight	62785
5	Team	0
6	NOC	0
7	Year	0
8	Season	0
9	City	0
10	Sport	0
11	Event	0
12	Medal	0
13	region	21

	Name	Sex	Age	Height	Weight	Team	NOC	Year	Season	City	Sport	Event	Medal	region
270681	A. Laffen	M	NaN	NaN	NaN	Unknown	UNK	1912	Summer	Stockholm	Art Competitions	Art Competitions Mixed Architecture	None	NaN
264747	Yusra Mardini	F	18.0	157.0	53.0	Refugee Olympic Athletes	ROT	2016	Summer	Rio de Janeiro	Swimming	Swimming Women's 100 metres Butterfly	None	NaN
270682	Logona Esau	M	21.0	163.0	69.0	Tuvalu	TUV	2008	Summer	Beijing	Weightlifting	Weightlifting Men's Lightweight	None	NaN
264746	Yusra Mardini	F	18.0	157.0	53.0	Refugee Olympic Athletes	ROT	2016	Summer	Rio de Janeiro	Swimming	Swimming Women's 100 metres Freestyle	None	NaN
264743	Yonas Kinde	M	36.0	172.0	57.0	Refugee Olympic Athletes	ROT	2016	Summer	Rio de Janeiro	Athletics	Athletics Men's Marathon	None	NaN

	HostCity	HostLatitude	HostLongitude	HostAltitude
1	Barcelona	41.3828939	2.1774322	32
2	London	51.4893335	-0.14405509	11
3	Antwerpen	51.27704655	4.54451718	12
4	Paris	48.85881005	2.32003101	43
5	Calgary	51.02532675	-114.0498685	1043

	country_name	match_name	score
0	Ivory Coast		-1
1	Saint Vincent	Saint Martin	64
2	Individual Olympic Athletes		-1
3	Virgin Islands, British	US Virgin Islands	70
4	Antigua	Anguilla	67
5	Gambia	Zambia	83
6	Brunei	Burundi	77
7	Micronesia	Indonesia	63

	Year	AthleteCountry	TotalMedals	DistanceTravelled
0	1896	Australia	3.0	15201.660597
1	1896	Austria	5.0	1282.025503
2	1896	Denmark	6.0	2135.520896
3	1896	France	11.0	2100.783748
4	1896	Germany	32.0	1802.949137

	City	TotalMedals	Year
0	Calgary	0.0	1988
1	Albertville	0.0	1992
2	Lillehammer	0.0	1994
3	Nagano	0.0	1998
4	Salt Lake City	0.0	2002
5	Sochi	0.0	2014
0	Pyeongchang	0.0	2018
1	Beijing	NaN	2022