Football World Cup Datathon - Part 2: Data Aquisition

I introduced the Betfair World Cup Datathon in Part 1 of this series of posts. In this second post, I’m going to focus on getting data. Given I don’t have a lot of domain knowledge, I can’t go into developing anything too advanced myself. As such, getting as much data as possible is going to be my best bet. This post shows all the data sources I’ve been able to easily access.

# Set some defaults and load libraries
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(pacman)
pacman::p_load(tidyverse, here, ggthemes, rvest, lubridate)

Load Data

Betfair

Betfair has provided all participants with data. To get that data you need to sign up to the competition, which you can do here. I’m also hosting it on Github here.

# Load betfair data
betfair_dat <- read_csv(here::here("projects", "world-cup-2018", "wc_datathon_dataset.csv"))

# View
glimpse(betfair_dat)
## Observations: 14,568
## Variables: 12
## $ date                <date> 2000-06-10, 2000-06-11, 2000-06-11, 2000-...
## $ team_1              <chr> "Belgium", "France", "Netherlands", "Turke...
## $ team_2              <chr> "Sweden", "Denmark", "Czech Republic", "It...
## $ team_1_goals        <int> 2, 3, 1, 1, 1, 3, 3, 0, 0, 3, 1, 0, 0, 1, ...
## $ team_2_goals        <int> 1, 0, 0, 2, 1, 2, 3, 1, 2, 0, 0, 1, 1, 2, ...
## $ tournament          <chr> "Euro 2000", "Euro 2000", "Euro 2000", "Eu...
## $ is_team_1_home      <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FA...
## $ is_team_2_home      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ is_neutral          <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE...
## $ team_1_betfair_odds <dbl> 2.72, 1.68, 1.80, 5.90, 2.22, 3.60, 1.76, ...
## $ draw_betfair_odds   <dbl> 2.90, 3.75, 3.45, 3.25, 3.10, 3.10, 3.65, ...
## $ team_2_betfair_odds <dbl> 3.30, 6.60, 5.90, 1.86, 4.20, 2.40, 5.80, ...
summary(betfair_dat)
##       date               team_1             team_2         
##  Min.   :1998-07-15   Length:14568       Length:14568      
##  1st Qu.:2003-10-11   Class :character   Class :character  
##  Median :2008-06-14   Mode  :character   Mode  :character  
##  Mean   :2008-06-12                                        
##  3rd Qu.:2013-03-22                                        
##  Max.   :2017-11-15                                        
##                                                            
##   team_1_goals     team_2_goals     tournament        is_team_1_home 
##  Min.   : 0.000   Min.   : 0.000   Length:14568       Mode :logical  
##  1st Qu.: 0.000   1st Qu.: 0.000   Class :character   FALSE:2788     
##  Median : 1.000   Median : 1.000   Mode  :character   TRUE :11780    
##  Mean   : 1.608   Mean   : 1.051                                     
##  3rd Qu.: 2.000   3rd Qu.: 2.000                                     
##  Max.   :31.000   Max.   :15.000                                     
##                                                                      
##  is_team_2_home  is_neutral      team_1_betfair_odds draw_betfair_odds
##  Mode :logical   Mode :logical   Min.   :  1.010     Min.   :  1.010  
##  FALSE:14568     FALSE:11800     1st Qu.:  1.460     1st Qu.:  3.350  
##                  TRUE :2768      Median :  2.140     Median :  3.800  
##                                  Mean   :  5.947     Mean   :  6.685  
##                                  3rd Qu.:  3.600     3rd Qu.:  5.300  
##                                  Max.   :980.000     Max.   :980.000  
##                                  NA's   :5764        NA's   :5764     
##  team_2_betfair_odds
##  Min.   :   1.01    
##  1st Qu.:   2.30    
##  Median :   4.10    
##  Mean   :  13.28    
##  3rd Qu.:   9.00    
##  Max.   :1000.00    
##  NA's   :5764

So - it looks like each row is a game that contains the final score, the tournament, who is the home team and then the odds of each outcome. There is quite a lot of games (~14k) which is good! There are some missing values which is a bit annoying but we’ll work out how to deal with that later! I’ll be doing a more detailed data exploration in Part 3.

Kaggle Odds dataset

I found this data set on Kaggle that has the odds for almost 500k matches! Most of those are league games but we should be able to get rid of those when we join it to our other data sets.

# Load odds data
odds_dat <- read_csv(here::here("projects", "world-cup-2018", "closing_odds.csv"))

# View
glimpse(odds_dat)
## Observations: 479,440
## Variables: 19
## $ match_id            <int> 170088, 170089, 170090, 170091, 170092, 17...
## $ league              <chr> "England: Premier League", "England: Premi...
## $ match_date          <date> 2005-01-01, 2005-01-01, 2005-01-01, 2005-...
## $ home_team           <chr> "Liverpool", "Fulham", "Aston Villa", "Bol...
## $ home_score          <int> 0, 3, 1, 1, 1, 2, 2, 1, 5, 0, 0, 0, 0, 0, ...
## $ away_team           <chr> "Chelsea", "Crystal Palace", "Blackburn", ...
## $ away_score          <int> 1, 1, 0, 1, 3, 1, 1, 1, 2, 2, 1, 0, 2, 2, ...
## $ avg_odds_home_win   <dbl> 2.9944, 1.9456, 1.8522, 1.6122, 5.9878, 1....
## $ avg_odds_draw       <dbl> 3.1944, 3.2333, 3.2611, 3.4133, 3.4778, 3....
## $ avg_odds_away_win   <dbl> 2.2256, 3.6722, 4.0144, 5.4722, 1.5567, 4....
## $ max_odds_home_win   <dbl> 3.20, 2.04, 2.00, 1.67, 7.00, 1.73, 1.91, ...
## $ max_odds_draw       <dbl> 3.25, 3.30, 3.40, 3.57, 3.60, 3.50, 3.40, ...
## $ max_odds_away_win   <dbl> 2.29, 4.15, 4.50, 6.27, 1.62, 5.00, 4.33, ...
## $ top_bookie_home_win <chr> "Paddy Power", "Pinnacle Sports", "Pinnacl...
## $ top_bookie_draw     <chr> "Sportingbet", "bet-at-home", "Paddy Power...
## $ top_bookie_away_win <chr> "Expekt", "Expekt", "Sportingbet", "Pinnac...
## $ n_odds_home_win     <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, ...
## $ n_odds_draw         <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, ...
## $ n_odds_away_win     <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, ...
summary(odds_dat)
##     match_id         league            match_date        
##  Min.   :170088   Length:479440      Min.   :2005-01-01  
##  1st Qu.:394532   Class :character   1st Qu.:2009-07-05  
##  Median :566276   Mode  :character   Median :2011-12-17  
##  Mean   :551865                      Mean   :2011-06-19  
##  3rd Qu.:715668                      3rd Qu.:2013-10-08  
##  Max.   :876812                      Max.   :2015-06-30  
##   home_team           home_score       away_team           away_score   
##  Length:479440      Min.   :  0.000   Length:479440      Min.   :  0.0  
##  Class :character   1st Qu.:  1.000   Class :character   1st Qu.:  0.0  
##  Mode  :character   Median :  1.000   Mode  :character   Median :  1.0  
##                     Mean   :  1.539                      Mean   :  1.2  
##                     3rd Qu.:  2.000                      3rd Qu.:  2.0  
##                     Max.   :101.000                      Max.   :102.0  
##  avg_odds_home_win avg_odds_draw    avg_odds_away_win max_odds_home_win
##  Min.   :  0.000   Min.   : 1.010   Min.   :  0.000   Min.   :  0.000  
##  1st Qu.:  1.658   1st Qu.: 3.200   1st Qu.:  2.418   1st Qu.:  1.750  
##  Median :  2.084   Median : 3.361   Median :  3.200   Median :  2.200  
##  Mean   :  2.525   Mean   : 3.670   Mean   :  3.975   Mean   :  2.789  
##  3rd Qu.:  2.643   3rd Qu.: 3.735   3rd Qu.:  4.607   3rd Qu.:  2.850  
##  Max.   :151.000   Max.   :51.000   Max.   :103.024   Max.   :209.040  
##  max_odds_draw     max_odds_away_win top_bookie_home_win
##  Min.   :  1.010   Min.   :  0.000   Length:479440      
##  1st Qu.:  3.300   1st Qu.:  2.580   Class :character   
##  Median :  3.560   Median :  3.500   Mode  :character   
##  Mean   :  3.942   Mean   :  4.573                      
##  3rd Qu.:  4.000   3rd Qu.:  5.200                      
##  Max.   :150.000   Max.   :501.000                      
##  top_bookie_draw    top_bookie_away_win n_odds_home_win  n_odds_draw   
##  Length:479440      Length:479440       Min.   : 0.00   Min.   : 1.00  
##  Class :character   Class :character    1st Qu.: 9.00   1st Qu.: 9.00  
##  Mode  :character   Mode  :character    Median :17.00   Median :17.00  
##                                         Mean   :16.45   Mean   :16.46  
##                                         3rd Qu.:25.00   3rd Qu.:25.00  
##                                         Max.   :29.00   Max.   :29.00  
##  n_odds_away_win
##  Min.   : 0.00  
##  1st Qu.: 9.00  
##  Median :17.00  
##  Mean   :16.45  
##  3rd Qu.:25.00  
##  Max.   :29.00

So this data set has different odds but has it for way more data and nothing missing. One problem I might have is that I’m not sure I can get these specific odds (bookies) easily for the upcoming World Cup games. If I want to include it in a model that will be an issue. I’ll need to see what I can get.

FIFA

Logically, if we wanted to try and assess team ratings we could look towards the official FIFA World rankings. These are used for primarily for seeding in tournaments, such as the World Cup. In researching for this project, I found that these are generally critisied for being too simplistic and most people prefer ELO rating systems. Nonetheless, it is a data set I am able to get access to and, given it is used for seeding in the World Cup, there may be some value for including them in a model.

Scraping data

The code to get this information has been adapted from a Github respository I found from Github user Ross Boberg.

If you would prefer to take a scripted version of my code below, you can find that here. You can also just grab the data directly from here.

Firstly, we have to create a function that builds us a URL. We can then use this to get all the combinations of URLS we need. Based on Ross’ work, we know that each URL contains a query for rank, confediration, gender and page. This function will dynamically generate the URL with the values for those queries.

Now, let’s generate a bunch of URLs. Based on Ross’s code, I know the rough start and end points for rank and I also know the id’s of the confedirations.

# Specify parameters
start_rank = 286 # the last update (as of Jun 5th, 2018)
end_rank = 57 # corresponds to Jan99, when point method was revised
conf_ids = c(23913, 23914, 23915, 23916, 25998, 27275)

rank <- rep(start_rank:end_rank, each = length(conf_ids))
conf_vec <- rep(conf_ids, times = length(start_rank:end_rank))

urls <- rank %>%
  map2_chr(conf_vec, ~ get_ranking_table_URL(rank = .x, confederation = .y))

length(urls)
## [1] 1380

Now we’ve got a list of URLs to search through - I can pass them to read_html and generate local HTML files for me to use. This is the most time consuming part of this process - this function took my laptop ~13 min to run so be mindful if you are following along! I’ve actually just done the first 10 of 1380 for this blog post so comment that out if needed.

# Uncomment 1st line and comment out 2nd line to do all data. I've just done 1st 10 for illustration purposes
# ind <- seq_along(urls)
ind <- 1:10 

htmls <- urls[ind] %>%
  map(xml2::read_html)

Success! Now we’ve got those htmls, we can do some extraction and data cleaning. We basically just need to get the dates and then the table of data on each page. Again, using the handy purrr package to do our heavy lifting makes this relatively simple.

# Read date
dates <- htmls %>% 
  map(~html_nodes(.x, ".lb")) %>%
  map(html_text) %>%
  map_chr(~str_remove(.x, "Last Updated ")) %>%
  map(dmy) 

# Read Tables
tables <- htmls %>% 
  map(~html_nodes(.x, "#tbl_rankingTable")) %>%
  map(html_table) %>%
  map(as.data.frame)

# Add date, conf id and make into one dataframe
fifa_dat <- tables %>%
  map2(dates, ~ .x %>% mutate(Date = .y)) %>%
  map2(conf_vec[ind], ~ .x %>% mutate(Conference_id = .y)) %>%
  reduce(bind_rows) %>%
  select(Date, Team, Pts, Conference_id) %>%
  arrange(Date, desc(Pts))

head(fifa_dat)
##         Date      Team  Pts Conference_id
## 1 2018-04-12    Brazil 1384         23915
## 2 2018-04-12 Argentina 1254         23915
## 3 2018-04-12     Chile 1146         23915
## 4 2018-04-12      Peru 1106         23915
## 5 2018-04-12   Tunisia 1012         23913
## 6 2018-04-12    Mexico 1008         23914

For the purposes of joining my Data sets, I need the full set of data (the above code is just for the last 10 ratings). Here’s some I prepared earlier!.

fifa_dat <- read_csv(here::here("projects", "world-cup-2018", "fifa_rank_history.csv"))
glimpse(fifa_dat)
## Observations: 47,104
## Variables: 4
## $ Date          <date> 1999-01-27, 1999-01-27, 1999-01-27, 1999-01-27,...
## $ Team          <chr> "Brazil", "France", "Croatia", "Italy", "Germany...
## $ Pts           <int> 829, 787, 745, 745, 742, 733, 726, 720, 703, 698...
## $ Conference_id <int> 23915, 27275, 27275, 27275, 27275, 23915, 27275,...
summary(fifa_dat)
##       Date                Team                Pts         Conference_id  
##  Min.   :1999-01-27   Length:47104       Min.   :   0.0   Min.   :23913  
##  1st Qu.:2003-12-15   Class :character   1st Qu.: 162.0   1st Qu.:23913  
##  Median :2008-10-08   Mode  :character   Median : 360.0   Median :23916  
##  Mean   :2008-09-24                      Mean   : 397.3   Mean   :25221  
##  3rd Qu.:2013-08-08                      3rd Qu.: 568.0   3rd Qu.:27275  
##  Max.   :2018-05-17                      Max.   :1920.0   Max.   :27275

Other sources

I had looked at some other sources but found it either too tricky or they didn’t have wide enough covered. These including trying to include ‘team value’ information from Transfermarkt. I think this would be awesome but I could only get the value of each of the current World Cup squads. I’m sure it’s possible to try and find backdate information but I don’t have the time.

Another option that would be cool would be to use the EA Sports FIFA player ratings as a bit of a proxy for the talent in each team. I did find some promising websites but they were ultimately going to take too much time to scrape.

Combining dataset

Now that we have acquired our data, we need to combine it in some way to make one big data file. Again, I want this to be one row per match and it’s going to be fairly wide.

Now, what we want is to add the FIFA data to our betfair odds. This isn’t going to a simple join I don’t think - we want to match the home team name and the date of the match to the closest date and team in our FIFA file and return the rating. I’m pretty sure that the FIFA ratings get updated once per month so we may be able to join on year, month and team

# Add in a date/month to fifa and betfair data
fifa_dat <- fifa_dat %>%
  mutate(year = year(Date),
         month = month(Date)) %>%
  select(year, month, Team, Pts) 

betfair_dat <- betfair_dat %>%
  mutate(year = year(date),
         month = month(date))

odds_dat <- odds_dat %>%
  select(match_id, match_date, home_team, away_team, avg_odds_home_win:avg_odds_away_win)

# Join data together
dat <- betfair_dat %>%
  left_join(fifa_dat, by = c("year", "month", "team_1" = "Team")) %>%
  rename(team_1_fifa = Pts) %>%
  left_join(fifa_dat, by = c("year", "month", "team_2" = "Team")) %>%
  rename(team_2_fifa = Pts) %>%
  left_join(odds_dat, by = c("date" = "match_date", "team_1" = "home_team", "team_2" = "away_team"))

head(dat)
## # A tibble: 6 x 20
##   date       team_1      team_2       team_1_goals team_2_goals tournament
##   <date>     <chr>       <chr>               <int>        <int> <chr>     
## 1 2000-06-10 Belgium     Sweden                  2            1 Euro 2000 
## 2 2000-06-11 France      Denmark                 3            0 Euro 2000 
## 3 2000-06-11 Netherlands Czech Repub…            1            0 Euro 2000 
## 4 2000-06-11 Turkey      Italy                   1            2 Euro 2000 
## 5 2000-06-12 Germany     Romania                 1            1 Euro 2000 
## 6 2000-06-12 Portugal    England                 3            2 Euro 2000 
## # ... with 14 more variables: is_team_1_home <lgl>, is_team_2_home <lgl>,
## #   is_neutral <lgl>, team_1_betfair_odds <dbl>, draw_betfair_odds <dbl>,
## #   team_2_betfair_odds <dbl>, year <dbl>, month <dbl>, team_1_fifa <int>,
## #   team_2_fifa <int>, match_id <int>, avg_odds_home_win <dbl>,
## #   avg_odds_draw <dbl>, avg_odds_away_win <dbl>
glimpse(dat)
## Observations: 14,730
## Variables: 20
## $ date                <date> 2000-06-10, 2000-06-11, 2000-06-11, 2000-...
## $ team_1              <chr> "Belgium", "France", "Netherlands", "Turke...
## $ team_2              <chr> "Sweden", "Denmark", "Czech Republic", "It...
## $ team_1_goals        <int> 2, 3, 1, 1, 1, 3, 3, 0, 0, 3, 1, 0, 0, 1, ...
## $ team_2_goals        <int> 1, 0, 0, 2, 1, 2, 3, 1, 2, 0, 0, 1, 1, 2, ...
## $ tournament          <chr> "Euro 2000", "Euro 2000", "Euro 2000", "Eu...
## $ is_team_1_home      <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FA...
## $ is_team_2_home      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ is_neutral          <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE...
## $ team_1_betfair_odds <dbl> 2.72, 1.68, 1.80, 5.90, 2.22, 3.60, 1.76, ...
## $ draw_betfair_odds   <dbl> 2.90, 3.75, 3.45, 3.25, 3.10, 3.10, 3.65, ...
## $ team_2_betfair_odds <dbl> 3.30, 6.60, 5.90, 1.86, 4.20, 2.40, 5.80, ...
## $ year                <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, ...
## $ month               <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ team_1_fifa         <int> 585, 758, 629, 568, 706, 647, NA, 728, 585...
## $ team_2_fifa         <int> 646, 662, 750, 660, 692, 682, 538, 703, 66...
## $ match_id            <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_home_win   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_draw       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_away_win   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
summary(dat)
##       date               team_1             team_2         
##  Min.   :1998-07-15   Length:14730       Length:14730      
##  1st Qu.:2003-10-12   Class :character   Class :character  
##  Median :2008-06-22   Mode  :character   Mode  :character  
##  Mean   :2008-06-19                                        
##  3rd Qu.:2013-02-06                                        
##  Max.   :2017-11-15                                        
##                                                            
##   team_1_goals     team_2_goals    tournament        is_team_1_home 
##  Min.   : 0.000   Min.   : 0.00   Length:14730       Mode :logical  
##  1st Qu.: 0.000   1st Qu.: 0.00   Class :character   FALSE:2810     
##  Median : 1.000   Median : 1.00   Mode  :character   TRUE :11920    
##  Mean   : 1.603   Mean   : 1.05                                     
##  3rd Qu.: 2.000   3rd Qu.: 2.00                                     
##  Max.   :31.000   Max.   :15.00                                     
##                                                                     
##  is_team_2_home  is_neutral      team_1_betfair_odds draw_betfair_odds
##  Mode :logical   Mode :logical   Min.   :  1.010     Min.   :  1.010  
##  FALSE:14730     FALSE:11940     1st Qu.:  1.460     1st Qu.:  3.350  
##                  TRUE :2790      Median :  2.140     Median :  3.800  
##                                  Mean   :  5.892     Mean   :  6.647  
##                                  3rd Qu.:  3.600     3rd Qu.:  5.300  
##                                  Max.   :980.000     Max.   :980.000  
##                                  NA's   :5791        NA's   :5791     
##  team_2_betfair_odds      year          month         team_1_fifa    
##  Min.   :   1.01     Min.   :1998   Min.   : 1.000   Min.   :   0.0  
##  1st Qu.:   2.32     1st Qu.:2003   1st Qu.: 4.000   1st Qu.: 333.0  
##  Median :   4.10     Median :2008   Median : 7.000   Median : 515.0  
##  Mean   :  13.18     Mean   :2008   Mean   : 6.728   Mean   : 536.1  
##  3rd Qu.:   9.00     3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.: 683.8  
##  Max.   :1000.00     Max.   :2017   Max.   :12.000   Max.   :1887.0  
##  NA's   :5791                                        NA's   :1412    
##   team_2_fifa        match_id      avg_odds_home_win avg_odds_draw   
##  Min.   :   0.0   Min.   :173540   Min.   : 1.010    Min.   : 1.737  
##  1st Qu.: 326.0   1st Qu.:337465   1st Qu.: 1.422    1st Qu.: 3.207  
##  Median : 503.0   Median :464972   Median : 2.035    Median : 3.491  
##  Mean   : 522.9   Mean   :496687   Mean   : 3.418    Mean   : 4.423  
##  3rd Qu.: 659.0   3rd Qu.:653966   3rd Qu.: 3.153    3rd Qu.: 4.527  
##  Max.   :1920.0   Max.   :876298   Max.   :60.687    Max.   :32.230  
##  NA's   :1429     NA's   :9722     NA's   :9722      NA's   :9722    
##  avg_odds_away_win
##  Min.   :  1.010  
##  1st Qu.:  2.215  
##  Median :  3.526  
##  Mean   :  6.195  
##  3rd Qu.:  7.170  
##  Max.   :103.024  
##  NA's   :9722

It seems promising at first glance but there are missing FIFA rankings which shouldn’t happen.

matched <- dat %>%
  group_by(team_1) %>%
  summarise(match = sum(!is.na(team_1_fifa)),
            no_match = sum(is.na(team_1_fifa))) %>%
  mutate(perc = match/(match + no_match)) %>%
  arrange(perc)

matched %>%
  filter(perc == 0) %>%
  arrange(desc(no_match))
## # A tibble: 30 x 4
##    team_1             match no_match  perc
##    <chr>              <int>    <int> <dbl>
##  1 China                  0      140    0.
##  2 Ireland                0      122    0.
##  3 Ivory Coast            0       98    0.
##  4 Macedonia              0       92    0.
##  5 Bosnia-Herzegovina     0       81    0.
##  6 Cape Verde             0       45    0.
##  7 Burma                  0       36    0.
##  8 Curacao                0       22    0.
##  9 Martinique             0       22    0.
## 10 Guadeloupe             0       17    0.
## # ... with 20 more rows

So - there are only 30 countries that don’t match at all. Most of those are really small countries, or countries that have changed names. China is a weird one to be missing. Let’s see what we can find in the FIFA data set.

fifa_dat %>%
  filter(str_detect(Team, "China"))
## # A tibble: 230 x 4
##     year month Team       Pts
##    <dbl> <dbl> <chr>    <int>
##  1 1999.    1. China PR   513
##  2 1999.    2. China PR   511
##  3 1999.    3. China PR   510
##  4 1999.    4. China PR   509
##  5 1999.    5. China PR   507
##  6 1999.    6. China PR   505
##  7 1999.    7. China PR   502
##  8 1999.    8. China PR   499
##  9 1999.    9. China PR   497
## 10 1999.   10. China PR   494
## # ... with 220 more rows
betfair_dat %>%
  filter(str_detect(team_1, "China"))
## # A tibble: 140 x 14
##    date       team_1 team_2    team_1_goals team_2_goals tournament       
##    <date>     <chr>  <chr>            <int>        <int> <chr>            
##  1 2002-06-04 China  Costa Ri…            0            2 World Cup 2002   
##  2 2003-02-12 China  Brazil               0            0 Friendly         
##  3 2003-08-20 China  Chile                0            0 Friendly         
##  4 2004-06-01 China  Hungary              2            1 Friendly         
##  5 2004-06-09 China  Malaysia             4            0 World Cup 2006 Q…
##  6 2004-07-17 China  Bahrain              2            2 Asian Cup 2004   
##  7 2004-07-21 China  Indonesia            5            0 Asian Cup 2004   
##  8 2004-07-25 China  Qatar                1            0 Asian Cup 2004   
##  9 2004-07-30 China  Iraq                 3            0 Asian Cup 2004   
## 10 2004-08-03 China  Iran                 1            1 Asian Cup 2004   
## # ... with 130 more rows, and 8 more variables: is_team_1_home <lgl>,
## #   is_team_2_home <lgl>, is_neutral <lgl>, team_1_betfair_odds <dbl>,
## #   draw_betfair_odds <dbl>, team_2_betfair_odds <dbl>, year <dbl>,
## #   month <dbl>
fifa_dat %>%
  filter(str_detect(Team, "Ireland"))
## # A tibble: 450 x 4
##     year month Team                  Pts
##    <dbl> <dbl> <chr>               <int>
##  1 1999.    1. Republic of Ireland   535
##  2 1999.    1. Northern Ireland      467
##  3 1999.    2. Republic of Ireland   550
##  4 1999.    2. Northern Ireland      465
##  5 1999.    3. Republic of Ireland   547
##  6 1999.    3. Northern Ireland      463
##  7 1999.    4. Republic of Ireland   540
##  8 1999.    4. Northern Ireland      459
##  9 1999.    5. Republic of Ireland   557
## 10 1999.    5. Northern Ireland      455
## # ... with 440 more rows
betfair_dat %>%
  filter(str_detect(team_1, "Ireland"))
## # A tibble: 204 x 14
##    date       team_1  team_2    team_1_goals team_2_goals tournament      
##    <date>     <chr>   <chr>            <int>        <int> <chr>           
##  1 2001-06-02 Ireland Portugal             1            1 World Cup 2002 …
##  2 2001-11-10 Ireland Iran                 2            0 World Cup 2002 …
##  3 2002-04-17 Ireland USA                  2            1 Friendly        
##  4 2002-06-01 Ireland Cameroon             1            1 World Cup 2002  
##  5 2003-06-07 Ireland Albania              2            1 Euro 2004 Quali…
##  6 2003-09-06 Ireland Russia               1            1 Euro 2004 Quali…
##  7 2004-02-18 Ireland Brazil               0            0 Friendly        
##  8 2004-03-31 Ireland Czech Re…            2            1 Friendly        
##  9 2004-05-27 Ireland Romania              1            0 Friendly        
## 10 2004-05-29 Ireland Nigeria              0            3 Friendly        
## # ... with 194 more rows, and 8 more variables: is_team_1_home <lgl>,
## #   is_team_2_home <lgl>, is_neutral <lgl>, team_1_betfair_odds <dbl>,
## #   draw_betfair_odds <dbl>, team_2_betfair_odds <dbl>, year <dbl>,
## #   month <dbl>

Hmm - so it didn’t join because the FIFA data using “China PR”. Also, it didn’t match Northern Ireland or Republic of Ireland. I’m guessing that is the case for the rest of these 172 countries as well. I really should explore doing some fuzzy matching but it’s only a small group and most of those teams aren’t featuring in many matches so I’m just going to do it for the top 5. I manually worked out what these differences were.

fix_teams <- function(team){
    case_when(
      team == "China PR" ~ "China",
      team == "Republic of Ireland" ~ "Ireland",
      team == "Côte d'Ivoire" ~ "Ivory Coast",
      team == "FYR Macedonia" ~ "Macedonia",
      team == "Bosnia and Herzegovina" ~ "Bosnia-Herzegovina",
      TRUE ~ team
    )
}

fifa_dat <- fifa_dat %>%
  mutate(Team = fix_teams(Team))

OK - that should at least pick the top 5. Those smaller teams I’ll just have to not have any FIFA data. I suspect, given those level of teams aren’t appearing in World Cups, that we won’t have a big issue. The other missing data, since we matched the name properly at least once, are likely due to some kind of missing month/year combination, which I can deal with later.

# Join data together
dat <- betfair_dat %>%
  left_join(fifa_dat, by = c("year", "month", "team_1" = "Team")) %>%
  rename(team_1_fifa = Pts) %>%
  left_join(fifa_dat, by = c("year", "month", "team_2" = "Team")) %>%
  rename(team_2_fifa = Pts) %>%
  left_join(odds_dat, by = c("date" = "match_date", "team_1" = "home_team", "team_2" = "away_team"))

head(dat)
## # A tibble: 6 x 20
##   date       team_1      team_2       team_1_goals team_2_goals tournament
##   <date>     <chr>       <chr>               <int>        <int> <chr>     
## 1 2000-06-10 Belgium     Sweden                  2            1 Euro 2000 
## 2 2000-06-11 France      Denmark                 3            0 Euro 2000 
## 3 2000-06-11 Netherlands Czech Repub…            1            0 Euro 2000 
## 4 2000-06-11 Turkey      Italy                   1            2 Euro 2000 
## 5 2000-06-12 Germany     Romania                 1            1 Euro 2000 
## 6 2000-06-12 Portugal    England                 3            2 Euro 2000 
## # ... with 14 more variables: is_team_1_home <lgl>, is_team_2_home <lgl>,
## #   is_neutral <lgl>, team_1_betfair_odds <dbl>, draw_betfair_odds <dbl>,
## #   team_2_betfair_odds <dbl>, year <dbl>, month <dbl>, team_1_fifa <int>,
## #   team_2_fifa <int>, match_id <int>, avg_odds_home_win <dbl>,
## #   avg_odds_draw <dbl>, avg_odds_away_win <dbl>
glimpse(dat)
## Observations: 14,740
## Variables: 20
## $ date                <date> 2000-06-10, 2000-06-11, 2000-06-11, 2000-...
## $ team_1              <chr> "Belgium", "France", "Netherlands", "Turke...
## $ team_2              <chr> "Sweden", "Denmark", "Czech Republic", "It...
## $ team_1_goals        <int> 2, 3, 1, 1, 1, 3, 3, 0, 0, 3, 1, 0, 0, 1, ...
## $ team_2_goals        <int> 1, 0, 0, 2, 1, 2, 3, 1, 2, 0, 0, 1, 1, 2, ...
## $ tournament          <chr> "Euro 2000", "Euro 2000", "Euro 2000", "Eu...
## $ is_team_1_home      <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FA...
## $ is_team_2_home      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ is_neutral          <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE...
## $ team_1_betfair_odds <dbl> 2.72, 1.68, 1.80, 5.90, 2.22, 3.60, 1.76, ...
## $ draw_betfair_odds   <dbl> 2.90, 3.75, 3.45, 3.25, 3.10, 3.10, 3.65, ...
## $ team_2_betfair_odds <dbl> 3.30, 6.60, 5.90, 1.86, 4.20, 2.40, 5.80, ...
## $ year                <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, ...
## $ month               <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ team_1_fifa         <int> 585, 758, 629, 568, 706, 647, NA, 728, 585...
## $ team_2_fifa         <int> 646, 662, 750, 660, 692, 682, 538, 703, 66...
## $ match_id            <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_home_win   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_draw       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_away_win   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
summary(dat)
##       date               team_1             team_2         
##  Min.   :1998-07-15   Length:14740       Length:14740      
##  1st Qu.:2003-10-12   Class :character   Class :character  
##  Median :2008-06-22   Mode  :character   Mode  :character  
##  Mean   :2008-06-20                                        
##  3rd Qu.:2013-02-06                                        
##  Max.   :2017-11-15                                        
##                                                            
##   team_1_goals     team_2_goals    tournament        is_team_1_home 
##  Min.   : 0.000   Min.   : 0.00   Length:14740       Mode :logical  
##  1st Qu.: 0.000   1st Qu.: 0.00   Class :character   FALSE:2814     
##  Median : 1.000   Median : 1.00   Mode  :character   TRUE :11926    
##  Mean   : 1.603   Mean   : 1.05                                     
##  3rd Qu.: 2.000   3rd Qu.: 2.00                                     
##  Max.   :31.000   Max.   :15.00                                     
##                                                                     
##  is_team_2_home  is_neutral      team_1_betfair_odds draw_betfair_odds
##  Mode :logical   Mode :logical   Min.   :  1.010     Min.   :  1.010  
##  FALSE:14740     FALSE:11946     1st Qu.:  1.460     1st Qu.:  3.350  
##                  TRUE :2794      Median :  2.140     Median :  3.800  
##                                  Mean   :  5.887     Mean   :  6.645  
##                                  3rd Qu.:  3.600     3rd Qu.:  5.300  
##                                  Max.   :980.000     Max.   :980.000  
##                                  NA's   :5791        NA's   :5791     
##  team_2_betfair_odds      year          month         team_1_fifa  
##  Min.   :   1.01     Min.   :1998   Min.   : 1.000   Min.   :   0  
##  1st Qu.:   2.32     1st Qu.:2003   1st Qu.: 4.000   1st Qu.: 339  
##  Median :   4.10     Median :2008   Median : 7.000   Median : 517  
##  Mean   :  13.17     Mean   :2008   Mean   : 6.725   Mean   : 538  
##  3rd Qu.:   9.00     3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.: 685  
##  Max.   :1000.00     Max.   :2017   Max.   :12.000   Max.   :1887  
##  NA's   :5791                                        NA's   :901   
##   team_2_fifa        match_id      avg_odds_home_win avg_odds_draw   
##  Min.   :   0.0   Min.   :173540   Min.   : 1.010    Min.   : 1.737  
##  1st Qu.: 331.0   1st Qu.:337483   1st Qu.: 1.421    1st Qu.: 3.207  
##  Median : 506.0   Median :464970   Median : 2.035    Median : 3.492  
##  Mean   : 525.6   Mean   :496636   Mean   : 3.419    Mean   : 4.423  
##  3rd Qu.: 662.0   3rd Qu.:653965   3rd Qu.: 3.154    3rd Qu.: 4.527  
##  Max.   :1920.0   Max.   :876298   Max.   :60.687    Max.   :32.230  
##  NA's   :996      NA's   :9728     NA's   :9728      NA's   :9728    
##  avg_odds_away_win
##  Min.   :  1.010  
##  1st Qu.:  2.215  
##  Median :  3.526  
##  Mean   :  6.196  
##  3rd Qu.:  7.183  
##  Max.   :103.024  
##  NA's   :9728

Submission

One of the important things is getting a good submission file in the right format. There is no point building a big model with heaps of variables and then not having those variables available for our actual submission!

Luckily, Betfair has given us a starting point with the matches we need to predict. Let’s load that file in (I’m hosting it here if you want it.)

world_cup <- read_csv(here::here("projects", "world-cup-2018", "john_smith_numbersman.csv"))
glimpse(world_cup)
## Observations: 48
## Variables: 7
## $ date             <chr> "14/06/2018", "15/06/2018", "15/06/2018", "15...
## $ match_id         <chr> "a_1", "a_2", "b_1", "b_2", "c_1", "d_1", "c_...
## $ team_1           <chr> "russia", "egypt", "morocco", "portugal", "fr...
## $ team_2           <chr> "saudi arabia", "uruguay", "iran", "spain", "...
## $ prob_team_1_win  <dbl> 0.42, 0.20, 0.20, 0.20, 0.80, 0.82, 0.42, 0.6...
## $ prob_team_1_draw <dbl> 0.32, 0.30, 0.48, 0.45, 0.12, 0.10, 0.32, 0.1...
## $ prob_team_1_lose <dbl> 0.26, 0.50, 0.32, 0.35, 0.08, 0.08, 0.26, 0.2...

First thing that jumps out is that the team’s are in lower case. I’ll need to fix that. Also, the date isn’t being read properly. Lastly, those probability columns are just examples from Betfair so I can get rid of those.

world_cup <- world_cup %>%
  mutate(date = dmy(date)) %>%
  mutate_at(vars(team_1, team_2), tools::toTitleCase) %>%
  select(date, match_id, team_1, team_2)

Now the other fields that we don’t have are tournament, is_team_1_home, is_team_2_home and is_neutral. We can also hopefully get the betfair odds and the fifa points.

# Add extra columns
world_cup <- world_cup %>%
  mutate(
    tournament = "World Cup 2018",
    is_team_1_home = ifelse(team_1 == "Russia", TRUE, FALSE),
    is_team_2_home = ifelse(team_2 == "Russia", TRUE, FALSE),
    is_neutral = !is_team_1_home & !is_team_2_home,
    year = year(date),
    month = month(date)
    )

# Join with fifa
fifa_dat_final <- fifa_dat %>% 
  group_by(Team) %>% 
  filter(row_number() == n()) %>%
  select(Team, Pts)

world_cup <- world_cup %>%
  left_join(fifa_dat_final, by = c("team_1" = "Team")) %>%
  rename(team_1_fifa = Pts) %>%
  left_join(fifa_dat_final, by = c("team_2" = "Team")) %>%
  rename(team_2_fifa = Pts)

I’ve also used the betfair API to get the betfair data. I haven’t included the code for that as it uses login details, but you can see my script here or just load the data from here.

Let’s load that in and join it to our existing world cup data frame.

betfair_wc <- read_csv(here::here("projects", "world-cup-2018", "world_cup_2018_betfair.csv"))

world_cup <- world_cup %>%
  left_join(betfair_wc, by = c("date", "team_1", "team_2"))
glimpse(world_cup)
## Observations: 48
## Variables: 15
## $ date                <date> 2018-06-14, 2018-06-15, 2018-06-15, 2018-...
## $ match_id            <chr> "a_1", "a_2", "b_1", "b_2", "c_1", "d_1", ...
## $ team_1              <chr> "Russia", "Egypt", "Morocco", "Portugal", ...
## $ team_2              <chr> "Saudi Arabia", "Uruguay", "Iran", "Spain"...
## $ tournament          <chr> "World Cup 2018", "World Cup 2018", "World...
## $ is_team_1_home      <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ is_team_2_home      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ is_neutral          <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,...
## $ year                <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, ...
## $ month               <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ team_1_fifa         <int> 493, 636, 681, 1306, 1166, 1254, 1106, 975...
## $ team_2_fifa         <int> 462, 976, 798, 1162, 700, 930, 1054, 635, ...
## $ draw_betfair_odds   <dbl> 4.60, 3.80, 3.05, 3.50, 7.00, 5.30, 3.35, ...
## $ team_1_betfair_odds <dbl> 1.47, 7.80, 2.38, 4.80, 1.27, 1.35, 3.30, ...
## $ team_2_betfair_odds <dbl> 10.00, 1.65, 4.00, 1.99, 14.50, 15.00, 2.4...

Finishing up

Anyway, enough for now. Let’s save it and we’ll revisit in Part 3!

write_csv(dat, here::here("projects", "world-cup-2018", "combined-data.csv"))
write_csv(world_cup, here::here("projects", "world-cup-2018", "world-cup.csv"))

I’ve done the models and submitted to Betfair - you can see those here. I’ll post up my actual model write up tomorrow after the first game!

Project Page
Part 1 - Intro
Part 2 - Data Acquisition
Part 3 - Data Exploration and Feature Engineering
Part 4 - Models (coming soon)
Part 5 - Review (coming soon)

comments powered by Disqus