I introduced the Betfair World Cup Datathon in Part 1 of this series of posts. In this second post, I’m going to focus on getting data. Given I don’t have a lot of domain knowledge, I can’t go into developing anything too advanced myself. As such, getting as much data as possible is going to be my best bet. This post shows all the data sources I’ve been able to easily access.
# Set some defaults and load libraries
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(pacman)
pacman::p_load(tidyverse, here, ggthemes, rvest, lubridate)
Load Data
Betfair
Betfair has provided all participants with data. To get that data you need to sign up to the competition, which you can do here. I’m also hosting it on Github here.
# Load betfair data
betfair_dat <- read_csv(here::here("projects", "world-cup-2018", "wc_datathon_dataset.csv"))
# View
glimpse(betfair_dat)
## Observations: 14,568
## Variables: 12
## $ date <date> 2000-06-10, 2000-06-11, 2000-06-11, 2000-...
## $ team_1 <chr> "Belgium", "France", "Netherlands", "Turke...
## $ team_2 <chr> "Sweden", "Denmark", "Czech Republic", "It...
## $ team_1_goals <int> 2, 3, 1, 1, 1, 3, 3, 0, 0, 3, 1, 0, 0, 1, ...
## $ team_2_goals <int> 1, 0, 0, 2, 1, 2, 3, 1, 2, 0, 0, 1, 1, 2, ...
## $ tournament <chr> "Euro 2000", "Euro 2000", "Euro 2000", "Eu...
## $ is_team_1_home <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FA...
## $ is_team_2_home <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ is_neutral <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE...
## $ team_1_betfair_odds <dbl> 2.72, 1.68, 1.80, 5.90, 2.22, 3.60, 1.76, ...
## $ draw_betfair_odds <dbl> 2.90, 3.75, 3.45, 3.25, 3.10, 3.10, 3.65, ...
## $ team_2_betfair_odds <dbl> 3.30, 6.60, 5.90, 1.86, 4.20, 2.40, 5.80, ...
summary(betfair_dat)
## date team_1 team_2
## Min. :1998-07-15 Length:14568 Length:14568
## 1st Qu.:2003-10-11 Class :character Class :character
## Median :2008-06-14 Mode :character Mode :character
## Mean :2008-06-12
## 3rd Qu.:2013-03-22
## Max. :2017-11-15
##
## team_1_goals team_2_goals tournament is_team_1_home
## Min. : 0.000 Min. : 0.000 Length:14568 Mode :logical
## 1st Qu.: 0.000 1st Qu.: 0.000 Class :character FALSE:2788
## Median : 1.000 Median : 1.000 Mode :character TRUE :11780
## Mean : 1.608 Mean : 1.051
## 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :31.000 Max. :15.000
##
## is_team_2_home is_neutral team_1_betfair_odds draw_betfair_odds
## Mode :logical Mode :logical Min. : 1.010 Min. : 1.010
## FALSE:14568 FALSE:11800 1st Qu.: 1.460 1st Qu.: 3.350
## TRUE :2768 Median : 2.140 Median : 3.800
## Mean : 5.947 Mean : 6.685
## 3rd Qu.: 3.600 3rd Qu.: 5.300
## Max. :980.000 Max. :980.000
## NA's :5764 NA's :5764
## team_2_betfair_odds
## Min. : 1.01
## 1st Qu.: 2.30
## Median : 4.10
## Mean : 13.28
## 3rd Qu.: 9.00
## Max. :1000.00
## NA's :5764
So - it looks like each row is a game that contains the final score, the tournament, who is the home team and then the odds of each outcome. There is quite a lot of games (~14k) which is good! There are some missing values which is a bit annoying but we’ll work out how to deal with that later! I’ll be doing a more detailed data exploration in Part 3.
Kaggle Odds dataset
I found this data set on Kaggle that has the odds for almost 500k matches! Most of those are league games but we should be able to get rid of those when we join it to our other data sets.
# Load odds data
odds_dat <- read_csv(here::here("projects", "world-cup-2018", "closing_odds.csv"))
# View
glimpse(odds_dat)
## Observations: 479,440
## Variables: 19
## $ match_id <int> 170088, 170089, 170090, 170091, 170092, 17...
## $ league <chr> "England: Premier League", "England: Premi...
## $ match_date <date> 2005-01-01, 2005-01-01, 2005-01-01, 2005-...
## $ home_team <chr> "Liverpool", "Fulham", "Aston Villa", "Bol...
## $ home_score <int> 0, 3, 1, 1, 1, 2, 2, 1, 5, 0, 0, 0, 0, 0, ...
## $ away_team <chr> "Chelsea", "Crystal Palace", "Blackburn", ...
## $ away_score <int> 1, 1, 0, 1, 3, 1, 1, 1, 2, 2, 1, 0, 2, 2, ...
## $ avg_odds_home_win <dbl> 2.9944, 1.9456, 1.8522, 1.6122, 5.9878, 1....
## $ avg_odds_draw <dbl> 3.1944, 3.2333, 3.2611, 3.4133, 3.4778, 3....
## $ avg_odds_away_win <dbl> 2.2256, 3.6722, 4.0144, 5.4722, 1.5567, 4....
## $ max_odds_home_win <dbl> 3.20, 2.04, 2.00, 1.67, 7.00, 1.73, 1.91, ...
## $ max_odds_draw <dbl> 3.25, 3.30, 3.40, 3.57, 3.60, 3.50, 3.40, ...
## $ max_odds_away_win <dbl> 2.29, 4.15, 4.50, 6.27, 1.62, 5.00, 4.33, ...
## $ top_bookie_home_win <chr> "Paddy Power", "Pinnacle Sports", "Pinnacl...
## $ top_bookie_draw <chr> "Sportingbet", "bet-at-home", "Paddy Power...
## $ top_bookie_away_win <chr> "Expekt", "Expekt", "Sportingbet", "Pinnac...
## $ n_odds_home_win <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, ...
## $ n_odds_draw <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, ...
## $ n_odds_away_win <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, ...
summary(odds_dat)
## match_id league match_date
## Min. :170088 Length:479440 Min. :2005-01-01
## 1st Qu.:394532 Class :character 1st Qu.:2009-07-05
## Median :566276 Mode :character Median :2011-12-17
## Mean :551865 Mean :2011-06-19
## 3rd Qu.:715668 3rd Qu.:2013-10-08
## Max. :876812 Max. :2015-06-30
## home_team home_score away_team away_score
## Length:479440 Min. : 0.000 Length:479440 Min. : 0.0
## Class :character 1st Qu.: 1.000 Class :character 1st Qu.: 0.0
## Mode :character Median : 1.000 Mode :character Median : 1.0
## Mean : 1.539 Mean : 1.2
## 3rd Qu.: 2.000 3rd Qu.: 2.0
## Max. :101.000 Max. :102.0
## avg_odds_home_win avg_odds_draw avg_odds_away_win max_odds_home_win
## Min. : 0.000 Min. : 1.010 Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.658 1st Qu.: 3.200 1st Qu.: 2.418 1st Qu.: 1.750
## Median : 2.084 Median : 3.361 Median : 3.200 Median : 2.200
## Mean : 2.525 Mean : 3.670 Mean : 3.975 Mean : 2.789
## 3rd Qu.: 2.643 3rd Qu.: 3.735 3rd Qu.: 4.607 3rd Qu.: 2.850
## Max. :151.000 Max. :51.000 Max. :103.024 Max. :209.040
## max_odds_draw max_odds_away_win top_bookie_home_win
## Min. : 1.010 Min. : 0.000 Length:479440
## 1st Qu.: 3.300 1st Qu.: 2.580 Class :character
## Median : 3.560 Median : 3.500 Mode :character
## Mean : 3.942 Mean : 4.573
## 3rd Qu.: 4.000 3rd Qu.: 5.200
## Max. :150.000 Max. :501.000
## top_bookie_draw top_bookie_away_win n_odds_home_win n_odds_draw
## Length:479440 Length:479440 Min. : 0.00 Min. : 1.00
## Class :character Class :character 1st Qu.: 9.00 1st Qu.: 9.00
## Mode :character Mode :character Median :17.00 Median :17.00
## Mean :16.45 Mean :16.46
## 3rd Qu.:25.00 3rd Qu.:25.00
## Max. :29.00 Max. :29.00
## n_odds_away_win
## Min. : 0.00
## 1st Qu.: 9.00
## Median :17.00
## Mean :16.45
## 3rd Qu.:25.00
## Max. :29.00
So this data set has different odds but has it for way more data and nothing missing. One problem I might have is that I’m not sure I can get these specific odds (bookies) easily for the upcoming World Cup games. If I want to include it in a model that will be an issue. I’ll need to see what I can get.
FIFA
Logically, if we wanted to try and assess team ratings we could look towards the official FIFA World rankings. These are used for primarily for seeding in tournaments, such as the World Cup. In researching for this project, I found that these are generally critisied for being too simplistic and most people prefer ELO rating systems. Nonetheless, it is a data set I am able to get access to and, given it is used for seeding in the World Cup, there may be some value for including them in a model.
Scraping data
The code to get this information has been adapted from a Github respository I found from Github user Ross Boberg.
If you would prefer to take a scripted version of my code below, you can find that here. You can also just grab the data directly from here.
Firstly, we have to create a function that builds us a URL. We can then use this to get all the combinations of URLS we need. Based on Ross’ work, we know that each URL contains a query for rank
, confediration
, gender
and page
. This function will dynamically generate the URL with the values for those queries.
Now, let’s generate a bunch of URLs. Based on Ross’s code, I know the rough start and end points for rank
and I also know the id’s of the confedirations
.
# Specify parameters
start_rank = 286 # the last update (as of Jun 5th, 2018)
end_rank = 57 # corresponds to Jan99, when point method was revised
conf_ids = c(23913, 23914, 23915, 23916, 25998, 27275)
rank <- rep(start_rank:end_rank, each = length(conf_ids))
conf_vec <- rep(conf_ids, times = length(start_rank:end_rank))
urls <- rank %>%
map2_chr(conf_vec, ~ get_ranking_table_URL(rank = .x, confederation = .y))
length(urls)
## [1] 1380
Now we’ve got a list of URLs to search through - I can pass them to read_html
and generate local HTML files for me to use. This is the most time consuming part of this process - this function took my laptop ~13 min to run so be mindful if you are following along! I’ve actually just done the first 10 of 1380 for this blog post so comment that out if needed.
# Uncomment 1st line and comment out 2nd line to do all data. I've just done 1st 10 for illustration purposes
# ind <- seq_along(urls)
ind <- 1:10
htmls <- urls[ind] %>%
map(xml2::read_html)
Success! Now we’ve got those htmls, we can do some extraction and data cleaning. We basically just need to get the dates and then the table of data on each page. Again, using the handy purrr
package to do our heavy lifting makes this relatively simple.
# Read date
dates <- htmls %>%
map(~html_nodes(.x, ".lb")) %>%
map(html_text) %>%
map_chr(~str_remove(.x, "Last Updated ")) %>%
map(dmy)
# Read Tables
tables <- htmls %>%
map(~html_nodes(.x, "#tbl_rankingTable")) %>%
map(html_table) %>%
map(as.data.frame)
# Add date, conf id and make into one dataframe
fifa_dat <- tables %>%
map2(dates, ~ .x %>% mutate(Date = .y)) %>%
map2(conf_vec[ind], ~ .x %>% mutate(Conference_id = .y)) %>%
reduce(bind_rows) %>%
select(Date, Team, Pts, Conference_id) %>%
arrange(Date, desc(Pts))
head(fifa_dat)
## Date Team Pts Conference_id
## 1 2018-04-12 Brazil 1384 23915
## 2 2018-04-12 Argentina 1254 23915
## 3 2018-04-12 Chile 1146 23915
## 4 2018-04-12 Peru 1106 23915
## 5 2018-04-12 Tunisia 1012 23913
## 6 2018-04-12 Mexico 1008 23914
For the purposes of joining my Data sets, I need the full set of data (the above code is just for the last 10 ratings). Here’s some I prepared earlier!.
fifa_dat <- read_csv(here::here("projects", "world-cup-2018", "fifa_rank_history.csv"))
glimpse(fifa_dat)
## Observations: 47,104
## Variables: 4
## $ Date <date> 1999-01-27, 1999-01-27, 1999-01-27, 1999-01-27,...
## $ Team <chr> "Brazil", "France", "Croatia", "Italy", "Germany...
## $ Pts <int> 829, 787, 745, 745, 742, 733, 726, 720, 703, 698...
## $ Conference_id <int> 23915, 27275, 27275, 27275, 27275, 23915, 27275,...
summary(fifa_dat)
## Date Team Pts Conference_id
## Min. :1999-01-27 Length:47104 Min. : 0.0 Min. :23913
## 1st Qu.:2003-12-15 Class :character 1st Qu.: 162.0 1st Qu.:23913
## Median :2008-10-08 Mode :character Median : 360.0 Median :23916
## Mean :2008-09-24 Mean : 397.3 Mean :25221
## 3rd Qu.:2013-08-08 3rd Qu.: 568.0 3rd Qu.:27275
## Max. :2018-05-17 Max. :1920.0 Max. :27275
Other sources
I had looked at some other sources but found it either too tricky or they didn’t have wide enough covered. These including trying to include ‘team value’ information from Transfermarkt. I think this would be awesome but I could only get the value of each of the current World Cup squads. I’m sure it’s possible to try and find backdate information but I don’t have the time.
Another option that would be cool would be to use the EA Sports FIFA player ratings as a bit of a proxy for the talent in each team. I did find some promising websites but they were ultimately going to take too much time to scrape.
Combining dataset
Now that we have acquired our data, we need to combine it in some way to make one big data file. Again, I want this to be one row per match and it’s going to be fairly wide.
Now, what we want is to add the FIFA data to our betfair odds. This isn’t going to a simple join I don’t think - we want to match the home team name and the date of the match to the closest date and team in our FIFA file and return the rating. I’m pretty sure that the FIFA ratings get updated once per month so we may be able to join on year
, month
and team
# Add in a date/month to fifa and betfair data
fifa_dat <- fifa_dat %>%
mutate(year = year(Date),
month = month(Date)) %>%
select(year, month, Team, Pts)
betfair_dat <- betfair_dat %>%
mutate(year = year(date),
month = month(date))
odds_dat <- odds_dat %>%
select(match_id, match_date, home_team, away_team, avg_odds_home_win:avg_odds_away_win)
# Join data together
dat <- betfair_dat %>%
left_join(fifa_dat, by = c("year", "month", "team_1" = "Team")) %>%
rename(team_1_fifa = Pts) %>%
left_join(fifa_dat, by = c("year", "month", "team_2" = "Team")) %>%
rename(team_2_fifa = Pts) %>%
left_join(odds_dat, by = c("date" = "match_date", "team_1" = "home_team", "team_2" = "away_team"))
head(dat)
## # A tibble: 6 x 20
## date team_1 team_2 team_1_goals team_2_goals tournament
## <date> <chr> <chr> <int> <int> <chr>
## 1 2000-06-10 Belgium Sweden 2 1 Euro 2000
## 2 2000-06-11 France Denmark 3 0 Euro 2000
## 3 2000-06-11 Netherlands Czech Repub… 1 0 Euro 2000
## 4 2000-06-11 Turkey Italy 1 2 Euro 2000
## 5 2000-06-12 Germany Romania 1 1 Euro 2000
## 6 2000-06-12 Portugal England 3 2 Euro 2000
## # ... with 14 more variables: is_team_1_home <lgl>, is_team_2_home <lgl>,
## # is_neutral <lgl>, team_1_betfair_odds <dbl>, draw_betfair_odds <dbl>,
## # team_2_betfair_odds <dbl>, year <dbl>, month <dbl>, team_1_fifa <int>,
## # team_2_fifa <int>, match_id <int>, avg_odds_home_win <dbl>,
## # avg_odds_draw <dbl>, avg_odds_away_win <dbl>
glimpse(dat)
## Observations: 14,730
## Variables: 20
## $ date <date> 2000-06-10, 2000-06-11, 2000-06-11, 2000-...
## $ team_1 <chr> "Belgium", "France", "Netherlands", "Turke...
## $ team_2 <chr> "Sweden", "Denmark", "Czech Republic", "It...
## $ team_1_goals <int> 2, 3, 1, 1, 1, 3, 3, 0, 0, 3, 1, 0, 0, 1, ...
## $ team_2_goals <int> 1, 0, 0, 2, 1, 2, 3, 1, 2, 0, 0, 1, 1, 2, ...
## $ tournament <chr> "Euro 2000", "Euro 2000", "Euro 2000", "Eu...
## $ is_team_1_home <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FA...
## $ is_team_2_home <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ is_neutral <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE...
## $ team_1_betfair_odds <dbl> 2.72, 1.68, 1.80, 5.90, 2.22, 3.60, 1.76, ...
## $ draw_betfair_odds <dbl> 2.90, 3.75, 3.45, 3.25, 3.10, 3.10, 3.65, ...
## $ team_2_betfair_odds <dbl> 3.30, 6.60, 5.90, 1.86, 4.20, 2.40, 5.80, ...
## $ year <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, ...
## $ month <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ team_1_fifa <int> 585, 758, 629, 568, 706, 647, NA, 728, 585...
## $ team_2_fifa <int> 646, 662, 750, 660, 692, 682, 538, 703, 66...
## $ match_id <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_home_win <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_draw <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_away_win <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
summary(dat)
## date team_1 team_2
## Min. :1998-07-15 Length:14730 Length:14730
## 1st Qu.:2003-10-12 Class :character Class :character
## Median :2008-06-22 Mode :character Mode :character
## Mean :2008-06-19
## 3rd Qu.:2013-02-06
## Max. :2017-11-15
##
## team_1_goals team_2_goals tournament is_team_1_home
## Min. : 0.000 Min. : 0.00 Length:14730 Mode :logical
## 1st Qu.: 0.000 1st Qu.: 0.00 Class :character FALSE:2810
## Median : 1.000 Median : 1.00 Mode :character TRUE :11920
## Mean : 1.603 Mean : 1.05
## 3rd Qu.: 2.000 3rd Qu.: 2.00
## Max. :31.000 Max. :15.00
##
## is_team_2_home is_neutral team_1_betfair_odds draw_betfair_odds
## Mode :logical Mode :logical Min. : 1.010 Min. : 1.010
## FALSE:14730 FALSE:11940 1st Qu.: 1.460 1st Qu.: 3.350
## TRUE :2790 Median : 2.140 Median : 3.800
## Mean : 5.892 Mean : 6.647
## 3rd Qu.: 3.600 3rd Qu.: 5.300
## Max. :980.000 Max. :980.000
## NA's :5791 NA's :5791
## team_2_betfair_odds year month team_1_fifa
## Min. : 1.01 Min. :1998 Min. : 1.000 Min. : 0.0
## 1st Qu.: 2.32 1st Qu.:2003 1st Qu.: 4.000 1st Qu.: 333.0
## Median : 4.10 Median :2008 Median : 7.000 Median : 515.0
## Mean : 13.18 Mean :2008 Mean : 6.728 Mean : 536.1
## 3rd Qu.: 9.00 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.: 683.8
## Max. :1000.00 Max. :2017 Max. :12.000 Max. :1887.0
## NA's :5791 NA's :1412
## team_2_fifa match_id avg_odds_home_win avg_odds_draw
## Min. : 0.0 Min. :173540 Min. : 1.010 Min. : 1.737
## 1st Qu.: 326.0 1st Qu.:337465 1st Qu.: 1.422 1st Qu.: 3.207
## Median : 503.0 Median :464972 Median : 2.035 Median : 3.491
## Mean : 522.9 Mean :496687 Mean : 3.418 Mean : 4.423
## 3rd Qu.: 659.0 3rd Qu.:653966 3rd Qu.: 3.153 3rd Qu.: 4.527
## Max. :1920.0 Max. :876298 Max. :60.687 Max. :32.230
## NA's :1429 NA's :9722 NA's :9722 NA's :9722
## avg_odds_away_win
## Min. : 1.010
## 1st Qu.: 2.215
## Median : 3.526
## Mean : 6.195
## 3rd Qu.: 7.170
## Max. :103.024
## NA's :9722
It seems promising at first glance but there are missing FIFA rankings which shouldn’t happen.
matched <- dat %>%
group_by(team_1) %>%
summarise(match = sum(!is.na(team_1_fifa)),
no_match = sum(is.na(team_1_fifa))) %>%
mutate(perc = match/(match + no_match)) %>%
arrange(perc)
matched %>%
filter(perc == 0) %>%
arrange(desc(no_match))
## # A tibble: 30 x 4
## team_1 match no_match perc
## <chr> <int> <int> <dbl>
## 1 China 0 140 0.
## 2 Ireland 0 122 0.
## 3 Ivory Coast 0 98 0.
## 4 Macedonia 0 92 0.
## 5 Bosnia-Herzegovina 0 81 0.
## 6 Cape Verde 0 45 0.
## 7 Burma 0 36 0.
## 8 Curacao 0 22 0.
## 9 Martinique 0 22 0.
## 10 Guadeloupe 0 17 0.
## # ... with 20 more rows
So - there are only 30 countries that don’t match at all. Most of those are really small countries, or countries that have changed names. China is a weird one to be missing. Let’s see what we can find in the FIFA data set.
fifa_dat %>%
filter(str_detect(Team, "China"))
## # A tibble: 230 x 4
## year month Team Pts
## <dbl> <dbl> <chr> <int>
## 1 1999. 1. China PR 513
## 2 1999. 2. China PR 511
## 3 1999. 3. China PR 510
## 4 1999. 4. China PR 509
## 5 1999. 5. China PR 507
## 6 1999. 6. China PR 505
## 7 1999. 7. China PR 502
## 8 1999. 8. China PR 499
## 9 1999. 9. China PR 497
## 10 1999. 10. China PR 494
## # ... with 220 more rows
betfair_dat %>%
filter(str_detect(team_1, "China"))
## # A tibble: 140 x 14
## date team_1 team_2 team_1_goals team_2_goals tournament
## <date> <chr> <chr> <int> <int> <chr>
## 1 2002-06-04 China Costa Ri… 0 2 World Cup 2002
## 2 2003-02-12 China Brazil 0 0 Friendly
## 3 2003-08-20 China Chile 0 0 Friendly
## 4 2004-06-01 China Hungary 2 1 Friendly
## 5 2004-06-09 China Malaysia 4 0 World Cup 2006 Q…
## 6 2004-07-17 China Bahrain 2 2 Asian Cup 2004
## 7 2004-07-21 China Indonesia 5 0 Asian Cup 2004
## 8 2004-07-25 China Qatar 1 0 Asian Cup 2004
## 9 2004-07-30 China Iraq 3 0 Asian Cup 2004
## 10 2004-08-03 China Iran 1 1 Asian Cup 2004
## # ... with 130 more rows, and 8 more variables: is_team_1_home <lgl>,
## # is_team_2_home <lgl>, is_neutral <lgl>, team_1_betfair_odds <dbl>,
## # draw_betfair_odds <dbl>, team_2_betfair_odds <dbl>, year <dbl>,
## # month <dbl>
fifa_dat %>%
filter(str_detect(Team, "Ireland"))
## # A tibble: 450 x 4
## year month Team Pts
## <dbl> <dbl> <chr> <int>
## 1 1999. 1. Republic of Ireland 535
## 2 1999. 1. Northern Ireland 467
## 3 1999. 2. Republic of Ireland 550
## 4 1999. 2. Northern Ireland 465
## 5 1999. 3. Republic of Ireland 547
## 6 1999. 3. Northern Ireland 463
## 7 1999. 4. Republic of Ireland 540
## 8 1999. 4. Northern Ireland 459
## 9 1999. 5. Republic of Ireland 557
## 10 1999. 5. Northern Ireland 455
## # ... with 440 more rows
betfair_dat %>%
filter(str_detect(team_1, "Ireland"))
## # A tibble: 204 x 14
## date team_1 team_2 team_1_goals team_2_goals tournament
## <date> <chr> <chr> <int> <int> <chr>
## 1 2001-06-02 Ireland Portugal 1 1 World Cup 2002 …
## 2 2001-11-10 Ireland Iran 2 0 World Cup 2002 …
## 3 2002-04-17 Ireland USA 2 1 Friendly
## 4 2002-06-01 Ireland Cameroon 1 1 World Cup 2002
## 5 2003-06-07 Ireland Albania 2 1 Euro 2004 Quali…
## 6 2003-09-06 Ireland Russia 1 1 Euro 2004 Quali…
## 7 2004-02-18 Ireland Brazil 0 0 Friendly
## 8 2004-03-31 Ireland Czech Re… 2 1 Friendly
## 9 2004-05-27 Ireland Romania 1 0 Friendly
## 10 2004-05-29 Ireland Nigeria 0 3 Friendly
## # ... with 194 more rows, and 8 more variables: is_team_1_home <lgl>,
## # is_team_2_home <lgl>, is_neutral <lgl>, team_1_betfair_odds <dbl>,
## # draw_betfair_odds <dbl>, team_2_betfair_odds <dbl>, year <dbl>,
## # month <dbl>
Hmm - so it didn’t join because the FIFA data using “China PR”. Also, it didn’t match Northern Ireland or Republic of Ireland. I’m guessing that is the case for the rest of these 172 countries as well. I really should explore doing some fuzzy matching but it’s only a small group and most of those teams aren’t featuring in many matches so I’m just going to do it for the top 5. I manually worked out what these differences were.
fix_teams <- function(team){
case_when(
team == "China PR" ~ "China",
team == "Republic of Ireland" ~ "Ireland",
team == "Côte d'Ivoire" ~ "Ivory Coast",
team == "FYR Macedonia" ~ "Macedonia",
team == "Bosnia and Herzegovina" ~ "Bosnia-Herzegovina",
TRUE ~ team
)
}
fifa_dat <- fifa_dat %>%
mutate(Team = fix_teams(Team))
OK - that should at least pick the top 5. Those smaller teams I’ll just have to not have any FIFA data. I suspect, given those level of teams aren’t appearing in World Cups, that we won’t have a big issue. The other missing data, since we matched the name properly at least once, are likely due to some kind of missing month/year combination, which I can deal with later.
# Join data together
dat <- betfair_dat %>%
left_join(fifa_dat, by = c("year", "month", "team_1" = "Team")) %>%
rename(team_1_fifa = Pts) %>%
left_join(fifa_dat, by = c("year", "month", "team_2" = "Team")) %>%
rename(team_2_fifa = Pts) %>%
left_join(odds_dat, by = c("date" = "match_date", "team_1" = "home_team", "team_2" = "away_team"))
head(dat)
## # A tibble: 6 x 20
## date team_1 team_2 team_1_goals team_2_goals tournament
## <date> <chr> <chr> <int> <int> <chr>
## 1 2000-06-10 Belgium Sweden 2 1 Euro 2000
## 2 2000-06-11 France Denmark 3 0 Euro 2000
## 3 2000-06-11 Netherlands Czech Repub… 1 0 Euro 2000
## 4 2000-06-11 Turkey Italy 1 2 Euro 2000
## 5 2000-06-12 Germany Romania 1 1 Euro 2000
## 6 2000-06-12 Portugal England 3 2 Euro 2000
## # ... with 14 more variables: is_team_1_home <lgl>, is_team_2_home <lgl>,
## # is_neutral <lgl>, team_1_betfair_odds <dbl>, draw_betfair_odds <dbl>,
## # team_2_betfair_odds <dbl>, year <dbl>, month <dbl>, team_1_fifa <int>,
## # team_2_fifa <int>, match_id <int>, avg_odds_home_win <dbl>,
## # avg_odds_draw <dbl>, avg_odds_away_win <dbl>
glimpse(dat)
## Observations: 14,740
## Variables: 20
## $ date <date> 2000-06-10, 2000-06-11, 2000-06-11, 2000-...
## $ team_1 <chr> "Belgium", "France", "Netherlands", "Turke...
## $ team_2 <chr> "Sweden", "Denmark", "Czech Republic", "It...
## $ team_1_goals <int> 2, 3, 1, 1, 1, 3, 3, 0, 0, 3, 1, 0, 0, 1, ...
## $ team_2_goals <int> 1, 0, 0, 2, 1, 2, 3, 1, 2, 0, 0, 1, 1, 2, ...
## $ tournament <chr> "Euro 2000", "Euro 2000", "Euro 2000", "Eu...
## $ is_team_1_home <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FA...
## $ is_team_2_home <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ is_neutral <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE...
## $ team_1_betfair_odds <dbl> 2.72, 1.68, 1.80, 5.90, 2.22, 3.60, 1.76, ...
## $ draw_betfair_odds <dbl> 2.90, 3.75, 3.45, 3.25, 3.10, 3.10, 3.65, ...
## $ team_2_betfair_odds <dbl> 3.30, 6.60, 5.90, 1.86, 4.20, 2.40, 5.80, ...
## $ year <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, ...
## $ month <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ team_1_fifa <int> 585, 758, 629, 568, 706, 647, NA, 728, 585...
## $ team_2_fifa <int> 646, 662, 750, 660, 692, 682, 538, 703, 66...
## $ match_id <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_home_win <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_draw <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_away_win <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
summary(dat)
## date team_1 team_2
## Min. :1998-07-15 Length:14740 Length:14740
## 1st Qu.:2003-10-12 Class :character Class :character
## Median :2008-06-22 Mode :character Mode :character
## Mean :2008-06-20
## 3rd Qu.:2013-02-06
## Max. :2017-11-15
##
## team_1_goals team_2_goals tournament is_team_1_home
## Min. : 0.000 Min. : 0.00 Length:14740 Mode :logical
## 1st Qu.: 0.000 1st Qu.: 0.00 Class :character FALSE:2814
## Median : 1.000 Median : 1.00 Mode :character TRUE :11926
## Mean : 1.603 Mean : 1.05
## 3rd Qu.: 2.000 3rd Qu.: 2.00
## Max. :31.000 Max. :15.00
##
## is_team_2_home is_neutral team_1_betfair_odds draw_betfair_odds
## Mode :logical Mode :logical Min. : 1.010 Min. : 1.010
## FALSE:14740 FALSE:11946 1st Qu.: 1.460 1st Qu.: 3.350
## TRUE :2794 Median : 2.140 Median : 3.800
## Mean : 5.887 Mean : 6.645
## 3rd Qu.: 3.600 3rd Qu.: 5.300
## Max. :980.000 Max. :980.000
## NA's :5791 NA's :5791
## team_2_betfair_odds year month team_1_fifa
## Min. : 1.01 Min. :1998 Min. : 1.000 Min. : 0
## 1st Qu.: 2.32 1st Qu.:2003 1st Qu.: 4.000 1st Qu.: 339
## Median : 4.10 Median :2008 Median : 7.000 Median : 517
## Mean : 13.17 Mean :2008 Mean : 6.725 Mean : 538
## 3rd Qu.: 9.00 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.: 685
## Max. :1000.00 Max. :2017 Max. :12.000 Max. :1887
## NA's :5791 NA's :901
## team_2_fifa match_id avg_odds_home_win avg_odds_draw
## Min. : 0.0 Min. :173540 Min. : 1.010 Min. : 1.737
## 1st Qu.: 331.0 1st Qu.:337483 1st Qu.: 1.421 1st Qu.: 3.207
## Median : 506.0 Median :464970 Median : 2.035 Median : 3.492
## Mean : 525.6 Mean :496636 Mean : 3.419 Mean : 4.423
## 3rd Qu.: 662.0 3rd Qu.:653965 3rd Qu.: 3.154 3rd Qu.: 4.527
## Max. :1920.0 Max. :876298 Max. :60.687 Max. :32.230
## NA's :996 NA's :9728 NA's :9728 NA's :9728
## avg_odds_away_win
## Min. : 1.010
## 1st Qu.: 2.215
## Median : 3.526
## Mean : 6.196
## 3rd Qu.: 7.183
## Max. :103.024
## NA's :9728
Submission
One of the important things is getting a good submission file in the right format. There is no point building a big model with heaps of variables and then not having those variables available for our actual submission!
Luckily, Betfair has given us a starting point with the matches we need to predict. Let’s load that file in (I’m hosting it here if you want it.)
world_cup <- read_csv(here::here("projects", "world-cup-2018", "john_smith_numbersman.csv"))
glimpse(world_cup)
## Observations: 48
## Variables: 7
## $ date <chr> "14/06/2018", "15/06/2018", "15/06/2018", "15...
## $ match_id <chr> "a_1", "a_2", "b_1", "b_2", "c_1", "d_1", "c_...
## $ team_1 <chr> "russia", "egypt", "morocco", "portugal", "fr...
## $ team_2 <chr> "saudi arabia", "uruguay", "iran", "spain", "...
## $ prob_team_1_win <dbl> 0.42, 0.20, 0.20, 0.20, 0.80, 0.82, 0.42, 0.6...
## $ prob_team_1_draw <dbl> 0.32, 0.30, 0.48, 0.45, 0.12, 0.10, 0.32, 0.1...
## $ prob_team_1_lose <dbl> 0.26, 0.50, 0.32, 0.35, 0.08, 0.08, 0.26, 0.2...
First thing that jumps out is that the team’s are in lower case. I’ll need to fix that. Also, the date isn’t being read properly. Lastly, those probability columns are just examples from Betfair so I can get rid of those.
world_cup <- world_cup %>%
mutate(date = dmy(date)) %>%
mutate_at(vars(team_1, team_2), tools::toTitleCase) %>%
select(date, match_id, team_1, team_2)
Now the other fields that we don’t have are tournament
, is_team_1_home
, is_team_2_home
and is_neutral
. We can also hopefully get the betfair
odds and the fifa
points.
# Add extra columns
world_cup <- world_cup %>%
mutate(
tournament = "World Cup 2018",
is_team_1_home = ifelse(team_1 == "Russia", TRUE, FALSE),
is_team_2_home = ifelse(team_2 == "Russia", TRUE, FALSE),
is_neutral = !is_team_1_home & !is_team_2_home,
year = year(date),
month = month(date)
)
# Join with fifa
fifa_dat_final <- fifa_dat %>%
group_by(Team) %>%
filter(row_number() == n()) %>%
select(Team, Pts)
world_cup <- world_cup %>%
left_join(fifa_dat_final, by = c("team_1" = "Team")) %>%
rename(team_1_fifa = Pts) %>%
left_join(fifa_dat_final, by = c("team_2" = "Team")) %>%
rename(team_2_fifa = Pts)
I’ve also used the betfair API to get the betfair data. I haven’t included the code for that as it uses login details, but you can see my script here or just load the data from here.
Let’s load that in and join it to our existing world cup data frame.
betfair_wc <- read_csv(here::here("projects", "world-cup-2018", "world_cup_2018_betfair.csv"))
world_cup <- world_cup %>%
left_join(betfair_wc, by = c("date", "team_1", "team_2"))
glimpse(world_cup)
## Observations: 48
## Variables: 15
## $ date <date> 2018-06-14, 2018-06-15, 2018-06-15, 2018-...
## $ match_id <chr> "a_1", "a_2", "b_1", "b_2", "c_1", "d_1", ...
## $ team_1 <chr> "Russia", "Egypt", "Morocco", "Portugal", ...
## $ team_2 <chr> "Saudi Arabia", "Uruguay", "Iran", "Spain"...
## $ tournament <chr> "World Cup 2018", "World Cup 2018", "World...
## $ is_team_1_home <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ is_team_2_home <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ is_neutral <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,...
## $ year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, ...
## $ month <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ team_1_fifa <int> 493, 636, 681, 1306, 1166, 1254, 1106, 975...
## $ team_2_fifa <int> 462, 976, 798, 1162, 700, 930, 1054, 635, ...
## $ draw_betfair_odds <dbl> 4.60, 3.80, 3.05, 3.50, 7.00, 5.30, 3.35, ...
## $ team_1_betfair_odds <dbl> 1.47, 7.80, 2.38, 4.80, 1.27, 1.35, 3.30, ...
## $ team_2_betfair_odds <dbl> 10.00, 1.65, 4.00, 1.99, 14.50, 15.00, 2.4...
Finishing up
Anyway, enough for now. Let’s save it and we’ll revisit in Part 3!
write_csv(dat, here::here("projects", "world-cup-2018", "combined-data.csv"))
write_csv(world_cup, here::here("projects", "world-cup-2018", "world-cup.csv"))
I’ve done the models and submitted to Betfair - you can see those here. I’ll post up my actual model write up tomorrow after the first game!
Project Page
Part 1 - Intro
Part 2 - Data Acquisition
Part 3 - Data Exploration and Feature Engineering
Part 4 - Models (coming soon)
Part 5 - Review (coming soon)