Analysis on the Olympics dataset

Datasets

The core datasets used for this task were taken from Kaggle.

This is a well-analysed dataset with over 100 kernels pinned against it on Kaggle. To enrich my analysis, and make it original, I also scraped some other datasets:

  • I used a dataset of capital city Latitude & Longitudes from Techslides.
  • I manually created a new dataset of olympic host city Altitudes, Latitudes & Longitudes from mapcoordinates.

Running environment

I used a Jupyter Notebook (Python 3) to narrate my analysis.

Download dependencies

ON FIRST RUN: execute the code below to download the geopy dependency.

In [2]:
import sys
!{sys.executable} -m pip install geopy
Requirement already satisfied: geopy in /Users/williamneedham/anaconda/lib/python3.6/site-packages (1.17.0)
Requirement already satisfied: geographiclib<2,>=1.49 in /Users/williamneedham/anaconda/lib/python3.6/site-packages (from geopy) (1.49)

Load packages

The following packages were used for the analysis:

In [3]:
#load packages
import pandas as pd
import numpy as np 
import matplotlib.pyplot as  plt
import seaborn as sns
import unittest
#import difflib - not used
from geopy.distance import vincenty, geodesic

%pylab inline
%matplotlib inline
Populating the interactive namespace from numpy and matplotlib

Contents

My analysis is laid out as follows:

Stage 1: Initial look at the data

Link to initial look at the data

Stage 2: Feature engineering

Link to feature engineering section of the report

Stage 3: Research questions

Following a review of the data available and a brainstorming session, I decided to focus my efforts on the following areas:

Summary statistics to explore how athletes have changed over the years

Analysis of some interesting countries

Is there a correlation between Distance Travelled to the Games and performance and has that changed since 1980?

Stage 4: Conclusions

Conclusions and further research options

...any finally - what about the Jamaican bobsleigh team?

Contact details

Contact details

Initial look at the data

In [4]:
#Loading athlete data
data = pd.read_csv('athlete_events.csv', index_col = 'ID')
data.head(10)
Out[4]:
Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
ID
1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN
2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN
3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN
4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN
5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 1,000 metres NaN
5 Christine Jacoba Aaftink F 25.0 185.0 82.0 Netherlands NED 1992 Winter 1992 Winter Albertville Speed Skating Speed Skating Women's 500 metres NaN
5 Christine Jacoba Aaftink F 25.0 185.0 82.0 Netherlands NED 1992 Winter 1992 Winter Albertville Speed Skating Speed Skating Women's 1,000 metres NaN
5 Christine Jacoba Aaftink F 27.0 185.0 82.0 Netherlands NED 1994 Winter 1994 Winter Lillehammer Speed Skating Speed Skating Women's 500 metres NaN
5 Christine Jacoba Aaftink F 27.0 185.0 82.0 Netherlands NED 1994 Winter 1994 Winter Lillehammer Speed Skating Speed Skating Women's 1,000 metres NaN

A couple of early observations, upon loading the data:

  • NA in the Medal column have been converted to NaN, needs correcting.
  • Team variable combines > 1 Country, which is interesting, but unhelpful, will look to merge with the NOC table to clean.
  • It looks like Games is just a concat of Year + Season, should remove, once checked.
  • there's some NaN's in the Height & Weight Column , will need to be cleaned if using these columns.
  • There's some interesting sports from way-back-when - including Tug-Of-War, maybe these would be interesting to look at!

Fill NaN values in Medal column with 'None'

In [5]:
#Replace NA in Medal column with None
data['Medal'] = data['Medal'].fillna("None")
data['Medal'].head()
Out[5]:
ID
1    None
2    None
3    None
4    Gold
5    None
Name: Medal, dtype: object

Loading the NOC dataset

I want to use the NOC dataset to get a better representation of where an athlete is from over 'Team' variable.

In [6]:
#Loading in the NOC dataset
NOC = pd.read_csv('noc_regions.csv')

String replacement: replacing incorrect/ shorterned country names in the NOC dataset

Later on in the analysis, I do fuzzy matching of Country Names to join with the Capital Cities metadata dataset. After doing this, I changed this part of the script to include the fuzzy matches that the algorithm highlighted. i.e. before it is merged in the main data frame.

In [7]:
NOC['region'].replace(['US', 'USA', 'UK', 'Saint Kitts', 'Trinidad', 'Boliva', 'Virgin Islands, US', 'Curacao'], 
                      ['United States', 'United States', 'United Kingdom', 'Saint Kitts and Nevis', 
                       'Trinidad and Tobago', 'Bolivia', 'US Virgin Islands', 'Cura̤ao'], 
                      inplace = True)
In [8]:
data = pd.merge(data, NOC, left_on='NOC', right_on='NOC')
In [9]:
#before discarding, check to see if Games is in Year + Season for all values, then discard. 
if (data['Games'].equals(data['Year'].map(str) + " " + data['Season'])):
    data = data.drop(['notes', 'Games'], axis=1)
else: 
    data = data.drop(['notes'], axis=1)
In [10]:
data.sample()
Out[10]:
Name Sex Age Height Weight Team NOC Year Season City Sport Event Medal region
258795 Miroslav Cerar M 20.0 172.0 73.0 Yugoslavia YUG 1960 Summer Roma Gymnastics Gymnastics Men's Parallel Bars None Serbia

We can see above that Games == Year + Season for all values, and so is removed from the dataset.

Review of null values in the dataset

In [11]:
# assign a list of (ColumnName, NullCount) tuples to null_list using list comprehension
null_list = [(col, data[col].isnull().sum()) for col in data.columns]
print(null_list)
[('Name', 0), ('Sex', 0), ('Age', 9462), ('Height', 60083), ('Weight', 62785), ('Team', 0), ('NOC', 0), ('Year', 0), ('Season', 0), ('City', 0), ('Sport', 0), ('Event', 0), ('Medal', 0), ('region', 21)]
In [12]:
df_nulls = pd.DataFrame(null_list)
df_nulls
Out[12]:
0 1
0 Name 0
1 Sex 0
2 Age 9462
3 Height 60083
4 Weight 62785
5 Team 0
6 NOC 0
7 Year 0
8 Season 0
9 City 0
10 Sport 0
11 Event 0
12 Medal 0
13 region 21

We can see that only the Age, Height, Weight & the newly created region columns contain null values. My instinct tells me that this is because they didn't record age/height/weight measurements before a certain date. We'll have to investigate the region nulls a bit further.

Investigating the region nulls

We can see below that these are being caused by the Refugee Olympic Athletes Team, Unknowns and Tuvalu. I will remove these rows from the dataset.

In [13]:
#find out which points haven't merged properly
missing_region_datapoints = data[data.region.isnull()]
print(len(missing_region_datapoints))
missing_region_datapoints.sample(5)
21
Out[13]:
Name Sex Age Height Weight Team NOC Year Season City Sport Event Medal region
270681 A. Laffen M NaN NaN NaN Unknown UNK 1912 Summer Stockholm Art Competitions Art Competitions Mixed Architecture None NaN
264747 Yusra Mardini F 18.0 157.0 53.0 Refugee Olympic Athletes ROT 2016 Summer Rio de Janeiro Swimming Swimming Women's 100 metres Butterfly None NaN
270682 Logona Esau M 21.0 163.0 69.0 Tuvalu TUV 2008 Summer Beijing Weightlifting Weightlifting Men's Lightweight None NaN
264746 Yusra Mardini F 18.0 157.0 53.0 Refugee Olympic Athletes ROT 2016 Summer Rio de Janeiro Swimming Swimming Women's 100 metres Freestyle None NaN
264743 Yonas Kinde M 36.0 172.0 57.0 Refugee Olympic Athletes ROT 2016 Summer Rio de Janeiro Athletics Athletics Men's Marathon None NaN
In [14]:
# remove rows with NaN in the region column, using prints to check
print(len(data))
data = data[pd.notnull(data['region'])]
print(len(data))
270767
270746

Feature engineering

I chose to use external datasets to enrich the analyse and build novel feature sets. This section shows how I went about it. In this section most of the code is wrapped around unit tests to ensure the transformation have been carried out as expected.

Mapping host cities to host countries

Firstly, i created a map of Host Cities to Host Country, which wasn't originally in the dataset. This will be when looking at the 'Home-advantage' hypothesis.

In [15]:
#manually created a dictionary map from city -> country 
city_to_country = { 'Barcelona':'Spain', 'London':'UK', 'Antwerpen':'Belgium', 'Paris':'France', 'Calgary':'Canada',
       'Albertville':'France', 'Lillehammer':'Norway', 'Los Angeles':'USA', 'Salt Lake City':'USA',
       'Helsinki':'Finland', 'Lake Placid':'USA', 'Sydney':'Australia', 'Atlanta':'USA', 'Stockholm':'Sweden',
       'Sochi':'Russia', 'Nagano':'Japan', 'Torino':'Italy', 'Beijing':'China', 'Rio de Janeiro':'Brazil', 'Athina':'Greece',
       'Squaw Valley':'USA', 'Innsbruck':'Austria', 'Sarajevo': 'Bosnia and Herzegovina', 'Mexico City':'Mexico', 'Munich': 'Germany',
       'Seoul': 'South Korea', 'Berlin': 'Germany', 'Oslo': 'Norway', "Cortina d'Ampezzo":'Italy', 'Melbourne': 'Australia', 'Roma': 'Italy',
       'Amsterdam': 'Netherlands', 'Montreal': 'Canada', 'Moskva': 'Russia', 'Tokyo':'Japan', 'Vancouver':'Canada', 'Grenoble':'France',
       'Sapporo':'Japan', 'Chamonix':'France', 'St. Louis':'USA', 'Sankt Moritz':'Switzerland',
       'Garmisch-Partenkirchen':'Germany'}

#simple dictionary map
def map_cities_to_countries(data, dict_city_to_country) : 
    data['Host_Country'] = data['City'].map(dict_city_to_country)
    return data

# unit test the mapping to ensure the merge did not introduce nulls into the dataset. 
class TestMapping(unittest.TestCase):
    # Create the unit test
    def test_mapping(self):
        map_cities_to_countries(data, city_to_country)
        count_na = data['Medal'].isnull().sum()
        
        # Test that the mapping has not introduces NA's
        self.assertEqual(0, count_na)
        
 # Run the test 
unittest.main(argv=['ignored', '-v'], exit=False)        
test_mapping (__main__.TestMapping) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.071s

OK
Out[15]:
<unittest.main.TestProgram at 0x1138f12e8>
In [16]:
#have along at the result - we now have a new variable HostCountry
data.head()
Out[16]:
Name Sex Age Height Weight Team NOC Year Season City Sport Event Medal region Host_Country
0 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer Barcelona Basketball Basketball Men's Basketball None China Spain
1 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer London Judo Judo Men's Extra-Lightweight None China UK
2 Abudoureheman M 22.0 182.0 75.0 China CHN 2000 Summer Sydney Boxing Boxing Men's Middleweight None China Australia
3 Ai Linuer M 25.0 160.0 62.0 China CHN 2004 Summer Athina Wrestling Wrestling Men's Lightweight, Greco-Roman None China Greece
4 Ai Yanhan F 14.0 168.0 54.0 China CHN 2016 Summer Rio de Janeiro Swimming Swimming Women's 200 metres Freestyle None China Brazil

Read in and merge the host city metadata - giving the latitude, longitude, altitude of all the host cities

In [17]:
host_cities_metadata = pd.read_csv('hostcities.csv', names=['HostCity', 'HostLatitude','HostLongitude','HostAltitude'])
host_cities_metadata = host_cities_metadata.iloc[1:]
host_cities_metadata.head()
Out[17]:
HostCity HostLatitude HostLongitude HostAltitude
1 Barcelona 41.3828939 2.1774322 32
2 London 51.4893335 -0.14405509 11
3 Antwerpen 51.27704655 4.54451718 12
4 Paris 48.85881005 2.32003101 43
5 Calgary 51.02532675 -114.0498685 1043
In [18]:
#merge the host cities metadata into the main data frame
data = pd.merge(data, host_cities_metadata, left_on='City', right_on='HostCity')
data.head(5)
Out[18]:
Name Sex Age Height Weight Team NOC Year Season City Sport Event Medal region Host_Country HostCity HostLatitude HostLongitude HostAltitude
0 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer Barcelona Basketball Basketball Men's Basketball None China Spain Barcelona 41.3828939 2.1774322 32
1 Bai Chongguang M 21.0 184.0 83.0 China CHN 1992 Summer Barcelona Boxing Boxing Men's Light-Heavyweight None China Spain Barcelona 41.3828939 2.1774322 32
2 Bai Mei F 17.0 166.0 46.0 China CHN 1992 Summer Barcelona Rhythmic Gymnastics Rhythmic Gymnastics Women's Individual None China Spain Barcelona 41.3828939 2.1774322 32
3 Bi Zhong M 23.0 188.0 110.0 China CHN 1992 Summer Barcelona Athletics Athletics Men's Hammer Throw None China Spain Barcelona 41.3828939 2.1774322 32
4 Cai Yanshu M 28.0 169.0 79.0 China CHN 1992 Summer Barcelona Weightlifting Weightlifting Men's Light-Heavyweight None China Spain Barcelona 41.3828939 2.1774322 32

Check to see in the merge was performed correctly, without introducing NA's

In [19]:
missing_HostLongitude_datapoints = data[data.HostLongitude.isnull()]
missing_HostLongitude_datapoints
Out[19]:
Name Sex Age Height Weight Team NOC Year Season City Sport Event Medal region Host_Country HostCity HostLatitude HostLongitude HostAltitude

The empty dataframe above shows the merge was successful.

Read in and merge the athletes capital city metadata

This dataset contains the CountryName, CapitalName, CapitalLatitude, CapitalLongitude, CountryCode (not used) and ContinentName. My plan is merge this dataset on the Athletes 'region' variable, that we originally got from the NOC dataset. I'm doing this in preparation for answering the 'distance-travelled' research question; my hypothesis is that athletes that travel the furthest for the games are most disadvantaged due to jet-lag/ acclimatisation etc. I appreciate using the capital city lat/long of an athletes country is not ideal (as they may not live in the capital, or might live abroad etc), but serves as an OK proxy for this analysis.

In [20]:
#load in the data
names = ['CountryName', 'CapitalName', 'CapitalLatitude','CapitalLongitude', 'CountryCode', 'ContinentName']
capital_cities_metadata = pd.read_csv('country-capitals.csv', names = names)

#remove the first row as it's a duplicate of the column headers
capital_cities_metadata = capital_cities_metadata.iloc[1:]

# describe method to undertsand the shape of the dataset. 
capital_cities_metadata.describe()
Out[20]:
CountryName CapitalName CapitalLatitude CapitalLongitude CountryCode ContinentName
count 245 241 245 245 242 245
unique 245 238 235 239 242 8
top Kiribati Jerusalem 0 0 NG Europe
freq 1 2 4 4 1 58

Merge on capital_cities_metadata.

Before I merge the capital cities metadata dataset into the main 'data' frame, I'm going to perform an outer join to create a new variable called data_new. In doing so, I can then identify Country names are not joining properly. This then becomes a fuzzy matching problem between two lists of strings from different sources.

In [21]:
#df_pitches = pd.read_csv("df_pitches_2016.csv")
#df_salaries = pd.read_csv("df_salaries_2016.csv")
# Creating my merged data frame
data_new = data.merge(capital_cities_metadata, 
                        left_on='region', 
                        right_on='CountryName', 
                        how='outer', 
                       )
In [22]:
# Only selecing pitchers with a lot of pitches
#data_new = data_new[data_new > 1000][['CapitalLongitude']]

#Selecing people with missing salaries
missing_latitude = data_new[data_new.CapitalLatitude.isnull()]

#Displaying results
missing_latitude.reset_index(inplace=True,drop='index')
missing_latitude.sample(40)
missing_latitude['region'].unique()
Out[22]:
array(['Ivory Coast', 'Saint Vincent', 'Individual Olympic Athletes',
       'Virgin Islands, British', 'Antigua', 'Gambia', 'Brunei',
       'Micronesia'], dtype=object)

Fuzzy matching of Country Names from the NOC dataset and the country capitals dataset

To try and facilitate as clean a merge as possible, I used a fuzzy matching algorithm from the fuzzywuzzy package. This highlighted inconsistencies in the string, which I then dealt with earlier on in the analysis (up the page), then re-ran the script. This was because I needed to make the string replacements prior to merge the NOC & main data frames.

In [23]:
from fuzzywuzzy import fuzz

def match_name(name, list_names, min_score=0):
    # -1 score incase we don't get any matches
    max_score = -1
    # Returning empty name for no match as well
    max_name = ""
    # Iternating over all names in the other
    for name2 in list_names:
        #Finding fuzzy match score
        score = fuzz.ratio(name, name2)
        # Checking if we are above our threshold and have a better score
        if (score > min_score) & (score > max_score):
            max_name = name2
            max_score = score
    return (max_name, max_score)
/Users/williamneedham/anaconda/lib/python3.6/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
In [24]:
# List for dicts for easy dataframe creation
dict_list = []

# iterating over our players without salaries found above
for name in missing_latitude.region.unique():
    # Use our method to find best match, we can set a threshold here
    match = match_name(name, capital_cities_metadata.CountryName, 50)
    
    # New dict for storing data
    dict_ = {}
    dict_.update({"country_name" : name})
    dict_.update({"match_name" : match[0]})
    dict_.update({"score" : match[1]})
    dict_list.append(dict_)
    
merge_table = pd.DataFrame(dict_list)
# Display results
merge_table
Out[24]:
country_name match_name score
0 Ivory Coast -1
1 Saint Vincent Saint Martin 64
2 Individual Olympic Athletes -1
3 Virgin Islands, British US Virgin Islands 70
4 Antigua Anguilla 67
5 Gambia Zambia 83
6 Brunei Burundi 77
7 Micronesia Indonesia 63

The above 'remainder' from the fuzzy matching will be discarded from the dataset (given the time constraints).

In [25]:
#merge the capital cities metadata into the main data frame
data = pd.merge(data, capital_cities_metadata, left_on='region', right_on='CountryName')
data.head()
Out[25]:
Name Sex Age Height Weight Team NOC Year Season City ... HostCity HostLatitude HostLongitude HostAltitude CountryName CapitalName CapitalLatitude CapitalLongitude CountryCode ContinentName
0 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer Barcelona ... Barcelona 41.3828939 2.1774322 32 China Beijing 39.91666667 116.383333 CN Asia
1 Bai Chongguang M 21.0 184.0 83.0 China CHN 1992 Summer Barcelona ... Barcelona 41.3828939 2.1774322 32 China Beijing 39.91666667 116.383333 CN Asia
2 Bai Mei F 17.0 166.0 46.0 China CHN 1992 Summer Barcelona ... Barcelona 41.3828939 2.1774322 32 China Beijing 39.91666667 116.383333 CN Asia
3 Bi Zhong M 23.0 188.0 110.0 China CHN 1992 Summer Barcelona ... Barcelona 41.3828939 2.1774322 32 China Beijing 39.91666667 116.383333 CN Asia
4 Cai Yanshu M 28.0 169.0 79.0 China CHN 1992 Summer Barcelona ... Barcelona 41.3828939 2.1774322 32 China Beijing 39.91666667 116.383333 CN Asia

5 rows × 25 columns

In [26]:
# remove rows with NaN in the region column, using prints to check
print(len(data))
data = data[pd.notnull(data.region)]
print(len(data))
270147
270147

Calculating how far each athlete has travelled to participate in the games

Now that we have lat/long pairs of 1) where the athlete lives, and 2) where the games is being held, we can calcuate the Vincenty distance (from GeoPy).

The Vincenty distance is the geodesic distance between two points on the earth, taking into account the Earth's elliptic shape; using Vincenty’s method.

In [27]:
def calculate_vincenty_distance(distances) : 
    '''
    This function first creates (Latitude, Longitude) tuples for both the Athletes Home City and the Olympic Host City. 
    Then, calculates a new column 'vincenty_distance' using np.vectorize (as the vincenty method requires two inputs)
    
    Inputs: distances - dataframe with cols = ['CapitalLatitude', 'CapitalLongitude', 'HostLatitude', 'HostLongitude']
    Outputs: same dataframe with new column ['vincenty_distance']
    '''
        
    distances['AthleteLatLong'] = list(zip(distances.CapitalLatitude, distances.CapitalLongitude))
    distances['HostCityLatLong'] = list(zip(distances.HostLatitude, distances.HostLongitude))
    distances['vincenty_distance'] = np.vectorize(geodesic)(distances['AthleteLatLong'], distances['HostCityLatLong'])
    
    return distances

# create a new dataframe away from the original data, and then concat back into the original dataframe
new_data = data
distances = new_data.ix[:, ['CapitalLongitude', 'CapitalLatitude','HostLongitude','HostLatitude']]

#normally I'd one class for all unit tests, but I've created a new one here for code readability in a notebook. 
class TestCalculations(unittest.TestCase):
    # Create the unit test
    def test_vincenty_calculation(self):
        #create a test tuple of data from the first row of 'data' dataframe
        athleteLatLong0 = (39.91666667, 116.383333) # first row
        
        #create a test tuple of data from the first row of 'data' dataframe
        hostLatLong0 = (41.3828939, 2.1774322) 
        
        #calculate the vincenty distance on the test data
        distanceTest = geodesic(athleteLatLong0, hostLatLong0).km
        
        #calculate vincenty distance using our function calculate_vincenty_distance
        calculate_vincenty_distance(distances)
        
        # get the result of the first row
        function_output = distances.loc[0,'vincenty_distance']
        
        # Test that the result of the two method asserts equal. 
        self.assertAlmostEqual(function_output, distanceTest,5)
        
 # Run the test 
unittest.main(argv=['ignored', '-v'], exit=False)        
test_vincenty_calculation (__main__.TestCalculations) ... ok
test_mapping (__main__.TestMapping) ... ok

----------------------------------------------------------------------
Ran 2 tests in 88.062s

OK
Out[27]:
<unittest.main.TestProgram at 0x103e642e8>

Merge the new vincenty_distance back into main dataframe

In [28]:
#cast the vincenty distance from Distance ojbect to float 
distances['vincenty_distance'] = distances['vincenty_distance'].astype(str).str[:-3].astype(float)

data = pd.concat([data, distances['vincenty_distance']], axis=1)

#take a look at the results
data.head()
Out[28]:
Name Sex Age Height Weight Team NOC Year Season City ... HostLatitude HostLongitude HostAltitude CountryName CapitalName CapitalLatitude CapitalLongitude CountryCode ContinentName vincenty_distance
0 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer Barcelona ... 41.3828939 2.1774322 32 China Beijing 39.91666667 116.383333 CN Asia 8822.807943
1 Bai Chongguang M 21.0 184.0 83.0 China CHN 1992 Summer Barcelona ... 41.3828939 2.1774322 32 China Beijing 39.91666667 116.383333 CN Asia 8822.807943
2 Bai Mei F 17.0 166.0 46.0 China CHN 1992 Summer Barcelona ... 41.3828939 2.1774322 32 China Beijing 39.91666667 116.383333 CN Asia 8822.807943
3 Bi Zhong M 23.0 188.0 110.0 China CHN 1992 Summer Barcelona ... 41.3828939 2.1774322 32 China Beijing 39.91666667 116.383333 CN Asia 8822.807943
4 Cai Yanshu M 28.0 169.0 79.0 China CHN 1992 Summer Barcelona ... 41.3828939 2.1774322 32 China Beijing 39.91666667 116.383333 CN Asia 8822.807943

5 rows × 26 columns

Tidy up the resulting dataframe to remove columns we don't need

In [29]:
# see what columns we now have 
data.columns
Out[29]:
Index(['Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Year',
       'Season', 'City', 'Sport', 'Event', 'Medal', 'region', 'Host_Country',
       'HostCity', 'HostLatitude', 'HostLongitude', 'HostAltitude',
       'CountryName', 'CapitalName', 'CapitalLatitude', 'CapitalLongitude',
       'CountryCode', 'ContinentName', 'vincenty_distance'],
      dtype='object')
In [30]:
cols_to_keep = ['Name', 'Sex', 'Age', 'Team', 'Height', 'Weight', 'Year', 'Season',
       'City', 'Sport', 'Event', 'Medal', 'region', 'Host_Country', 
                'CapitalName', 'ContinentName', 'vincenty_distance', 'HostAltitude']

clean_data = pd.DataFrame(data, columns=cols_to_keep)
clean_data.head()
Out[30]:
Name Sex Age Team Height Weight Year Season City Sport Event Medal region Host_Country CapitalName ContinentName vincenty_distance HostAltitude
0 A Dijiang M 24.0 China 180.0 80.0 1992 Summer Barcelona Basketball Basketball Men's Basketball None China Spain Beijing Asia 8822.807943 32
1 Bai Chongguang M 21.0 China 184.0 83.0 1992 Summer Barcelona Boxing Boxing Men's Light-Heavyweight None China Spain Beijing Asia 8822.807943 32
2 Bai Mei F 17.0 China 166.0 46.0 1992 Summer Barcelona Rhythmic Gymnastics Rhythmic Gymnastics Women's Individual None China Spain Beijing Asia 8822.807943 32
3 Bi Zhong M 23.0 China 188.0 110.0 1992 Summer Barcelona Athletics Athletics Men's Hammer Throw None China Spain Beijing Asia 8822.807943 32
4 Cai Yanshu M 28.0 China 169.0 79.0 1992 Summer Barcelona Weightlifting Weightlifting Men's Light-Heavyweight None China Spain Beijing Asia 8822.807943 32

Re-name some of the columns to make it obvious what they represent

In [31]:
clean_data.columns = ['Name', 'Sex', 'Age', 'Team', 'Height', 'Weight', 'Year', 'Season',
       'City', 'Sport', 'Event', 'Medal', 'AthleteCountry', 'HostCountry', 
                'AthletesHomeCapital', 'AthletesHomeContinent', 'DistanceTravelled', 'HostCityAltitude']

clean_data.head()
Out[31]:
Name Sex Age Team Height Weight Year Season City Sport Event Medal AthleteCountry HostCountry AthletesHomeCapital AthletesHomeContinent DistanceTravelled HostCityAltitude
0 A Dijiang M 24.0 China 180.0 80.0 1992 Summer Barcelona Basketball Basketball Men's Basketball None China Spain Beijing Asia 8822.807943 32
1 Bai Chongguang M 21.0 China 184.0 83.0 1992 Summer Barcelona Boxing Boxing Men's Light-Heavyweight None China Spain Beijing Asia 8822.807943 32
2 Bai Mei F 17.0 China 166.0 46.0 1992 Summer Barcelona Rhythmic Gymnastics Rhythmic Gymnastics Women's Individual None China Spain Beijing Asia 8822.807943 32
3 Bi Zhong M 23.0 China 188.0 110.0 1992 Summer Barcelona Athletics Athletics Men's Hammer Throw None China Spain Beijing Asia 8822.807943 32
4 Cai Yanshu M 28.0 China 169.0 79.0 1992 Summer Barcelona Weightlifting Weightlifting Men's Light-Heavyweight None China Spain Beijing Asia 8822.807943 32

One-hot encode the Medal variable to make later analysis easier

In [32]:
df_medals = pd.get_dummies(clean_data['Medal'])

# Join the dummy variables to the main dataframe
clean_data = pd.concat([clean_data, df_medals], axis=1)
pd.options.display.max_columns = None
clean_data.head()
Out[32]:
Name Sex Age Team Height Weight Year Season City Sport Event Medal AthleteCountry HostCountry AthletesHomeCapital AthletesHomeContinent DistanceTravelled HostCityAltitude Bronze Gold None Silver
0 A Dijiang M 24.0 China 180.0 80.0 1992 Summer Barcelona Basketball Basketball Men's Basketball None China Spain Beijing Asia 8822.807943 32 0 0 1 0
1 Bai Chongguang M 21.0 China 184.0 83.0 1992 Summer Barcelona Boxing Boxing Men's Light-Heavyweight None China Spain Beijing Asia 8822.807943 32 0 0 1 0
2 Bai Mei F 17.0 China 166.0 46.0 1992 Summer Barcelona Rhythmic Gymnastics Rhythmic Gymnastics Women's Individual None China Spain Beijing Asia 8822.807943 32 0 0 1 0
3 Bi Zhong M 23.0 China 188.0 110.0 1992 Summer Barcelona Athletics Athletics Men's Hammer Throw None China Spain Beijing Asia 8822.807943 32 0 0 1 0
4 Cai Yanshu M 28.0 China 169.0 79.0 1992 Summer Barcelona Weightlifting Weightlifting Men's Light-Heavyweight None China Spain Beijing Asia 8822.807943 32 0 0 1 0

Add binary AnyMedals? variable

In [33]:
clean_data['TotalMedals'] = clean_data['Gold'] + clean_data['Silver'] + clean_data['Bronze']

clean_data['AnyMedals'] = np.where(clean_data.loc[:,'Medal'] == 'None', 0, 1)
clean_data.sample(10)
Out[33]:
Name Sex Age Team Height Weight Year Season City Sport Event Medal AthleteCountry HostCountry AthletesHomeCapital AthletesHomeContinent DistanceTravelled HostCityAltitude Bronze Gold None Silver TotalMedals AnyMedals
268023 Darko Majstorovi M 30.0 Yugoslavia 189.0 85.0 1976 Summer Montreal Rowing Rowing Men's Double Sculls None Serbia Canada Belgrade Europe 6925.483129 56 0 0 1 0 0 0
27524 George Charles Calnan M 24.0 United States 183.0 NaN 1924 Summer Paris Fencing Fencing Men's epee, Individual None United States France Washington D.C. North America 6177.680465 43 0 0 1 0 0 0
82598 Giuseppe Lupi M 33.0 Italy NaN NaN 1928 Summer Amsterdam Gymnastics Gymnastics Men's Team All-Around None Italy Netherlands Rome Europe 1297.207864 11 0 0 1 0 0 0
248717 Jin O-Hyeon M 24.0 South Korea 163.0 67.0 1960 Summer Roma Weightlifting Weightlifting Men's Lightweight None South Korea Italy Seoul Asia 8990.813118 21 0 0 1 0 0 0
249859 Chainarong Sophonpong M 20.0 Thailand 166.0 62.0 1964 Summer Tokyo Cycling Cycling Men's 100 kilometres Team Time Trial None Thailand Japan Bangkok Asia 4608.433126 10 0 0 1 0 0 0
147164 Makrem Ayed M 22.0 Tunisia 170.0 60.0 1996 Summer Atlanta Judo Judo Men's Extra-Lightweight None Tunisia USA Tunis Africa 8219.461425 311 0 0 1 0 0 0
248254 Park Dong-Ju M 25.0 South Korea 173.0 63.0 1988 Summer Seoul Equestrianism Equestrianism Mixed Three-Day Event, Team None South Korea South Korea Seoul Asia 1.904024 40 0 0 1 0 0 0
60895 Alain Bouffard M 25.0 France 165.0 52.0 1964 Summer Tokyo Rowing Rowing Men's Coxed Eights None France Japan Paris Europe 9738.883220 10 0 0 1 0 0 0
197450 Alfred Edward Flaxman M 28.0 Great Britain NaN NaN 1908 Summer London Athletics Athletics Men's Standing High Jump None United Kingdom UK London Europe 4.380864 11 0 0 1 0 0 0
185309 Thierry Vatrican M 20.0 Monaco 180.0 80.0 1996 Summer Atlanta Judo Judo Men's Half-Middleweight None Monaco USA Monaco Europe 7643.984453 311 0 0 1 0 0 0

======================================================

Summary statistics to explore how athletes have changed over the years

In this question, i'm looking to explore how the sex, age, and number of competitors has changed in the last 120 years.

The changing profile of the Olympic athlete (part 1)

In [34]:
#Proportion 

ProportionMvW = data.groupby(['Year', 'Sex']).agg({'Name': 'count'})

percentages = ProportionMvW.groupby(level=0).apply(lambda x:
                                                 100 * x / float(x.sum()))
percentages = percentages.reset_index()
In [35]:
a4_dims = (11.7, 8.27)
fig, ax = pyplot.subplots(figsize=a4_dims)

ax = sns.barplot(x="Year", y="Name", hue="Sex", data=percentages, palette="Set2") 
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right")
ax.set_title('Proportion of men and women competing in the olympic games')
ax.set_ylabel('Percentage of athletes for each gender')
ax.legend(loc="upper right")
plt.show()

Insights:

  • Good news! The ratio of male to female competitors is slowly becoming more balanced - could 2020 be the year that we have an equal number of female competitors and male competitors?

The changing profile of the Olympic athlete (part 2)

The visualisation below shows how the distribution of men and women's ages have changed since 1896, split by Sex and Summer/Winter Olympics

In [36]:
sns.set(rc={'figure.figsize':(20,20)})

g = sns.FacetGrid(clean_data, col="Season",  row="Sex", size=6)
g = g.map(sns.boxplot, "Year", 'Age')
g.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right")


show()

Insights:

  • some 'athletes' in the 20th century were > 80 years when they competed! Worth a quick look back at the data to see what they competed in.
  • As could be expected, variance is greatly reduced in the modern athletes age (with the median Age being in the early 20's. Interestingly the median age in the winter olympics seems to be higher.
  • Higher variance in the Summer Olympics - I would hypothesise this is down to the wider variety of sports on offer at the Summer Olympics.

Analysis of some interesting countries

In [37]:
#create a table of year,countryname & sex to be merged into a count of each later. 
mini_country_table = clean_data[clean_data['Season'] == 'Summer'].loc[:, ['Year','AthleteCountry', 'Name', 'Sex']].drop_duplicates()
mini_country_table.head()
Out[37]:
Year AthleteCountry Name Sex
0 1992 China A Dijiang M
1 1992 China Bai Chongguang M
2 1992 China Bai Mei F
3 1992 China Bi Zhong M
4 1992 China Cai Yanshu M
In [38]:
# Create a pivot table to count gender wise representation of each team in each year
CountAthletesByCountry = pd.pivot_table(mini_country_table,
                                        index = ['Year', 'AthleteCountry'],
                                        columns = 'Sex',
                                        aggfunc = 'count').reset_index()

# rename columns as per column names in the 0th level
CountAthletesByCountry.columns = CountAthletesByCountry.columns.get_level_values(0)

# rename the columns appropriately
CountAthletesByCountry.columns = ['Year', 'AthleteCountry', 'Female_Athletes', 'Male_Athletes']

# get total athletes per team-year
CountAthletesByCountry['Total_Athletes'] = CountAthletesByCountry['Female_Athletes'] + \
CountAthletesByCountry['Male_Athletes']

uk_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "United Kingdom"]
uk_athletes.fillna(0, inplace = True)
uk_athletes.set_index('Year', inplace = True)

swedish_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Sweden"]
swedish_athletes.set_index('Year', inplace = True)

japanese_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Japan"]
japanese_athletes.set_index('Year', inplace = True)

germany_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Germany"]
germany_athletes.set_index('Year', inplace = True)

# Plot the values of male, female and total athletes using bar charts and the line charts.
fig, ((ax1, ax2), (ax3, ax4)) = subplots(nrows = 2, ncols = 2, figsize = (20, 20), sharey = True)
fig.subplots_adjust(hspace = 0.3)

# Plot team Australia's contingent size
ax1.bar(uk_athletes.index.values, uk_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax1.bar(uk_athletes.index.values, uk_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax1.plot(uk_athletes.index.values, uk_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
        marker = 'o')
ax1.legend(loc="upper right")
ax1.set_title('UK Athletes :\nParticipation since 1896')
ax1.set_ylabel('Count of Athletes')

# Plot German athlete participation
ax2.bar(germany_athletes.index.values, germany_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax2.bar(germany_athletes.index.values, germany_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax2.plot(germany_athletes.index.values, germany_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
        marker = 'o')
ax2.set_title('Germany Athletes :\nParticipation since 1896')
ax2.legend(loc="upper left")

ax2.set_ylabel('Count of Athletes')

# Plot Japan's contingent size
ax3.bar(japanese_athletes.index.values, japanese_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax3.bar(japanese_athletes.index.values, japanese_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax3.plot(japanese_athletes.index.values, japanese_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes', 
         marker = 'o')
ax3.set_title('Japanese Athletes :\nParticipation since 1896')
ax3.set_ylabel('Count of Athletes')
ax3.legend(loc="upper right")


# Plot team Swedens's contingent size
ax4.bar(swedish_athletes.index.values, swedish_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax4.bar(swedish_athletes.index.values, swedish_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax4.plot(swedish_athletes.index.values, swedish_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
        marker = 'o')
ax4.set_title('Swedish Athletes :\nParticipation since 1896')
ax4.set_ylabel('Count of Athletes')
ax4.legend(loc="upper left")

show()
/Users/williamneedham/anaconda/lib/python3.6/site-packages/pandas/core/frame.py:2842: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)

Insights:

  • Overall gender balance has been improving slowing since the mid 20th century and it nearly equal for all countries considered.
  • Swedish female participation has risen above male participation in the last 3 Olympic Games.
  • UK particpation peaked in 2012 - when it hosted in London! This peak trend was mirrored for German participation in the 1972 Games, held in West Germany.
  • 1916, 1940 and 1944 games cancelled due to World Wars.

Is there a correlation between Distance Travelled to the Games and performance and has that changed since 1980?

This is where we bring in the engineered feature 'DistanceTravelled' - the Vincenty distance between an athletes home capital and where the games took place. My hypothesis is that distance travelled has an impact of number of medals a Country receives (due to things physiological factors like jet-mind, blood flow etc), I also further hypothesise that the effect of this will be reduced in the modern era (I've chosen Pre/Post 1980).

To investigate these hypothesese, I've split the dataset (pre and post 1980) and calculate correlation between this distance and medals won. Obviously, whether a country has a high medal count is a multivariate problem and would require a more in-depth analysis in practice.

In [39]:
# Get just the medalists 
Subset_Medalists = clean_data['AnyMedals'] == 1

#Create a new dataset to be merged with the medal table dataset below 
subset_distancetravelled = clean_data.loc[:, ['Year', 'AthleteCountry', 'DistanceTravelled']].drop_duplicates()

medal_tally = clean_data.groupby(['Year','AthleteCountry'])['TotalMedals'].agg('sum').reset_index()
medal_tally_byDistance = medal_tally.merge(subset_distancetravelled,
                                   left_on = ['Year', 'AthleteCountry'],
                                   right_on = ['Year', 'AthleteCountry'],
                                   how = 'left')

#create two datasets to compare correlations
Pre_1980_MedalTallyByDistance = medal_tally_byDistance[medal_tally_byDistance['Year']<=1980]
Post_1980_MedalTallyByDistance = medal_tally_byDistance[medal_tally_byDistance['Year']>1980]

medal_tally_byDistance.head()
Out[39]:
Year AthleteCountry TotalMedals DistanceTravelled
0 1896 Australia 3.0 15201.660597
1 1896 Austria 5.0 1282.025503
2 1896 Denmark 6.0 2135.520896
3 1896 France 11.0 2100.783748
4 1896 Germany 32.0 1802.949137

Pre-1980: impact of distance travelled on country performance

In [40]:
selected_rows = Pre_1980_MedalTallyByDistance['TotalMedals'] > 0

correlation = Pre_1980_MedalTallyByDistance.loc[selected_rows, ['DistanceTravelled', 'TotalMedals']].corr()['TotalMedals'][0]

plot(Pre_1980_MedalTallyByDistance.loc[selected_rows, 'DistanceTravelled'], 
     Pre_1980_MedalTallyByDistance.loc[selected_rows, 'TotalMedals'] , 
     linestyle = 'none', 
     marker = 'o',
     alpha = 0.4)
xlabel('Distance Travelled')

ylabel('Number of Medals')
title('PRE-1980 - Distance travelled versus medal tally')
text(np.nanpercentile(Pre_1980_MedalTallyByDistance['DistanceTravelled'], 99.6), 
     max(Pre_1980_MedalTallyByDistance['TotalMedals']) - 50,
     "Correlation = " + str(correlation))
Out[40]:
Text(18557.7,446,'Correlation = -0.146533766948')

Post-1980: impact of distance travelled on country performance

In [41]:
selected_rows = Post_1980_MedalTallyByDistance['TotalMedals'] > 0

correlation = Post_1980_MedalTallyByDistance.loc[selected_rows, ['DistanceTravelled', 'TotalMedals']].corr()['TotalMedals'][0]

plot(Post_1980_MedalTallyByDistance.loc[selected_rows, 'DistanceTravelled'], 
     Post_1980_MedalTallyByDistance.loc[selected_rows, 'TotalMedals'] , 
     linestyle = 'none', 
     marker = 'o', 
    alpha = 0.4)
xlabel('Distance Travelled')

ylabel('Number of Medals')
title('POST-1960 - Distance travelled versus medal tally')
text(np.nanpercentile(Post_1980_MedalTallyByDistance['DistanceTravelled'], 99.6), 
     max(Post_1980_MedalTallyByDistance['TotalMedals']) - 50,
     "Correlation = " + str(correlation))
Out[41]:
Text(18738.3,316,'Correlation = -0.0614926627557')

Insights:

  • negative correlations recorded for post the Pre-1980 dataset and Post-1980 dataset, meaning a higher distance travelled does lead to lower medal haul.
  • The correlation coefficient for pre-1980 was 'Correlation = -0.146533766948' and for Post-1980 was 'Correlation = -0.0614926627557'. This tells us that my hypothesis was correct and that this Distance Travelled effect has less of an effect in the modern era (where flights are cheap/ more comfortable and athletes probably spend time preparing in the country of the games beforehand).
  • Admittedly though, the coefficients are small and as such the DistanceTravelled-effect is smaller than I thought it would be.

Conclusions and further research options

This was an enjoyable task! I focused my efforts on data engineering and manfuactures some novel features. The following insights were of most interest to me:

  • good to see gender balance finally becoming a reality (after over 120 years of competition!
  • following high variance in the early years, most athletes are in there early 20's (slightly higher median age for the Winter Olympics).
  • of all interesting countries selected for this study (UK, Germany, Sweden & Japan), Germany send the most athletes (on average each year),

Given more time, I would like to:

  • add more summary statistics,
  • explore covariance,
  • build predictive models (i.e. predicting the medals table at the next Games),
  • conduct a clustering analysis to so what separates the best athletes from OK athletes.

...any finally - what about the Jamaican bobsleigh team?

It wouldn't be a proper Olympic analysis without a look at how the Jamaican bobsleigh team have done over the years

In [42]:
#df[(df['coverage']  > 50) & (df['reports'] < 4)]
jamaicans = clean_data[(clean_data['AthleteCountry'] == 'Jamaica') & (clean_data['Sport'] == 'Bobsleigh') ]
jamaican_grouped = jamaicans.groupby(['Year','City'])['TotalMedals'].agg('sum').reset_index()
jamaican_grouped
Out[42]:
Year City TotalMedals
0 1988 Calgary 0
1 1992 Albertville 0
2 1994 Lillehammer 0
3 1998 Nagano 0
4 2002 Salt Lake City 0
5 2014 Sochi 0
In [43]:
new_data = {'Year': [2018, 2022], 'City': ["Pyeongchang", "Beijing"], 'TotalMedals': [0, NaN]}
future_results = pd.DataFrame(new_data)
combined = pd.concat([jamaican_grouped, future_results])
combined
Out[43]:
City TotalMedals Year
0 Calgary 0.0 1988
1 Albertville 0.0 1992
2 Lillehammer 0.0 1994
3 Nagano 0.0 1998
4 Salt Lake City 0.0 2002
5 Sochi 0.0 2014
0 Pyeongchang 0.0 2018
1 Beijing NaN 2022
In [44]:
plt.bar(combined['Year'],combined['TotalMedals'] )
fig.suptitle('test title', fontsize=20)
plt.xlabel('Olympic Year', fontsize=18)
plt.ylabel('Number of medals won', fontsize=16)
plt.ylim((0,10))
plt.xticks(combined['Year'])
Out[44]:
([<matplotlib.axis.XTick at 0x111dbfcf8>,
  <matplotlib.axis.XTick at 0x111b19710>,
  <matplotlib.axis.XTick at 0x111b199e8>,
  <matplotlib.axis.XTick at 0x111dfbe48>,
  <matplotlib.axis.XTick at 0x1185083c8>,
  <matplotlib.axis.XTick at 0x118508908>,
  <matplotlib.axis.XTick at 0x118508e48>,
  <matplotlib.axis.XTick at 0x11850e3c8>],
 <a list of 8 Text xticklabel objects>)

Unfortunately, no medals yet for the Jamaican's in the Bobsleigh, maybe 2022 is their year?

Hope you enjoyed reading!

------------- > Back to Top < -------------