World Cup Datathon - Part 3: Feature Engineering

In Part 1 and
Part 2 of this series, I introduced the Betfair World Cup datathon and acquired some data. Now, let’s spend a bit of time exploring that data and creating some features!

Feature Engineering is where the art comes into our modelling process. Given I don’t have a lot of domain knowledge here, I’ve done a bit of quick reading. Again - I’m not going to go into some advanced Soccer analytics like Expected Goal (XG) or any kind of network analysis - so I’ll simply use the match results, FIFA rankings and betfair data.

First - let’s load our combined data from Part 2.

# Load libraries
library(pacman)
pacman::p_load(tidyverse, here, elo, ggthemes, zoo, naniar, visdat, lubridate)

dat <- read_csv(here::here("projects", "world-cup-2018", "combined-data.csv"))
world_cup <- read_csv(here::here("projects", "world-cup-2018", "world-cup.csv"))

Data Exploration

Let’s do a quick exploration of our dat data frame. I did a little it of this in Part 2 but we’ll try get a bit more in depth.

Before we start, let’s create our output variable and then look at it’s distribution.

dat <- dat %>%
  mutate(
    result = case_when(
      team_1_goals > team_2_goals ~ "win",
      team_1_goals < team_2_goals ~ "lose",
      TRUE ~ "draw"),
    result = as.factor(result))

dat %>%
  ungroup() %>%
  dplyr::count(result) %>%
  mutate(perc = n/sum(n))
## # A tibble: 3 x 3
##   result     n  perc
##   <fct>  <int> <dbl>
## 1 draw    3565 0.242
## 2 lose    4053 0.275
## 3 win     7122 0.483

So - that is essentially my baseline model. For an game I can naively put in those percentages. If any other model can’t beat that, it’s not adding anything!

What about getting an understanding of the rest of the data.

glimpse(dat)
## Observations: 14,740
## Variables: 21
## $ date                <date> 2000-06-10, 2000-06-11, 2000-06-11, 2000-...
## $ team_1              <chr> "Belgium", "France", "Netherlands", "Turke...
## $ team_2              <chr> "Sweden", "Denmark", "Czech Republic", "It...
## $ team_1_goals        <int> 2, 3, 1, 1, 1, 3, 3, 0, 0, 3, 1, 0, 0, 1, ...
## $ team_2_goals        <int> 1, 0, 0, 2, 1, 2, 3, 1, 2, 0, 0, 1, 1, 2, ...
## $ tournament          <chr> "Euro 2000", "Euro 2000", "Euro 2000", "Eu...
## $ is_team_1_home      <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FA...
## $ is_team_2_home      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ is_neutral          <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE...
## $ team_1_betfair_odds <dbl> 2.72, 1.68, 1.80, 5.90, 2.22, 3.60, 1.76, ...
## $ draw_betfair_odds   <dbl> 2.90, 3.75, 3.45, 3.25, 3.10, 3.10, 3.65, ...
## $ team_2_betfair_odds <dbl> 3.30, 6.60, 5.90, 1.86, 4.20, 2.40, 5.80, ...
## $ year                <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, ...
## $ month               <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ team_1_fifa         <int> 585, 758, 629, 568, 706, 647, NA, 728, 585...
## $ team_2_fifa         <int> 646, 662, 750, 660, 692, 682, 538, 703, 66...
## $ match_id            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_home_win   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_draw       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ avg_odds_away_win   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ result              <fct> win, win, win, lose, draw, win, draw, lose...

The first thing that jumps out at me is the tournament column contains the tournament and the year in it. It would be nice to group say, all “Euro” tournaments together. I also want to get a match_id that is just the game number in our dataset.

dat <- dat %>%
  arrange(date) %>%
  mutate(tournament_cat = as.factor(str_remove(tournament, "[[:digit:]]+") %>% str_squish()),
         match_id = row_number())

dat %>% distinct(tournament_cat)
## # A tibble: 13 x 1
##    tournament_cat                   
##    <fct>                            
##  1 Friendly                         
##  2 African Cup of Nations Qualifiers
##  3 Euro Qualifiers                  
##  4 Oceania Nations Cup              
##  5 Copa America                     
##  6 Confederations Cup               
##  7 Asian Cup Qualifiers             
##  8 African Cup of Nations           
##  9 Gold Cup                         
## 10 World Cup Qualifiers             
## 11 Euro                             
## 12 Asian Cup                        
## 13 World Cup

That’s better. The other thing is that the odds data is being read as characters rather than numeric. That’s pretty easy to fix.

dat <- dat %>%
  mutate_at(vars(contains("odds")), as.numeric)

Now we can use skim to look at how each column is distributed or structured.

skimr::skim(dat)
## Skim summary statistics
##  n obs: 14740 
##  n variables: 22 
## 
## Variable type: character 
##    variable missing complete     n min max empty n_unique
##      team_1       0    14740 14740   3  30     0      228
##      team_2       0    14740 14740   3  30     0      230
##  tournament       0    14740 14740   8  38     0       81
## 
## Variable type: Date 
##  variable missing complete     n        min        max     median n_unique
##      date       0    14740 14740 1998-07-15 2017-11-15 2008-06-22     3220
## 
## Variable type: factor 
##        variable missing complete     n n_unique
##          result       0    14740 14740        3
##  tournament_cat       0    14740 14740       13
##                                 top_counts ordered
##     win: 7122, los: 4053, dra: 3565, NA: 0   FALSE
##  Fri: 6636, Wor: 3987, Eur: 1258, Afr: 895   FALSE
## 
## Variable type: integer 
##      variable missing complete     n    mean      sd p0     p25 median
##      match_id       0    14740 14740 7370.5  4255.22  1 3685.75 7370.5
##         month       0    14740 14740    6.73    3.15  1    4       7  
##   team_1_fifa     901    13839 14740  538.03  288.93  0  339     517  
##  team_1_goals       0    14740 14740    1.6     1.63  0    0       1  
##   team_2_fifa     996    13744 14740  525.65  287.65  0  331     506  
##  team_2_goals       0    14740 14740    1.05    1.21  0    0       1  
##       p75  p100     hist
##  11055.25 14740 ▇▇▇▇▇▇▇▇
##     10       12 ▅▅▃▇▂▇▆▅
##    685     1887 ▃▇▇▃▁▁▁▁
##      2       31 ▇▁▁▁▁▁▁▁
##    662     1920 ▃▇▇▃▁▁▁▁
##      2       15 ▇▂▁▁▁▁▁▁
## 
## Variable type: logical 
##        variable missing complete     n mean                        count
##      is_neutral       0    14740 14740 0.19 FAL: 11946, TRU: 2794, NA: 0
##  is_team_1_home       0    14740 14740 0.81 TRU: 11926, FAL: 2814, NA: 0
##  is_team_2_home       0    14740 14740 0               FAL: 14740, NA: 0
## 
## Variable type: numeric 
##             variable missing complete     n    mean    sd      p0     p25
##    avg_odds_away_win    9728     5012 14740    6.2   7.85    1.01    2.21
##        avg_odds_draw    9728     5012 14740    4.42  2.4     1.74    3.21
##    avg_odds_home_win    9728     5012 14740    3.42  5.22    1.01    1.42
##    draw_betfair_odds    5791     8949 14740    6.64 17.21    1.01    3.35
##  team_1_betfair_odds    5791     8949 14740    5.89 26.6     1.01    1.46
##  team_2_betfair_odds    5791     8949 14740   13.17 44.99    1.01    2.32
##                 year       0    14740 14740 2007.96  5.5  1998    2003   
##   median     p75    p100     hist
##     3.53    7.18  103.02 ▇▁▁▁▁▁▁▁
##     3.49    4.53   32.23 ▇▁▁▁▁▁▁▁
##     2.03    3.15   60.69 ▇▁▁▁▁▁▁▁
##     3.8     5.3   980    ▇▁▁▁▁▁▁▁
##     2.14    3.6   980    ▇▁▁▁▁▁▁▁
##     4.1     9    1000    ▇▁▁▁▁▁▁▁
##  2008    2013    2017    ▅▅▇▅▅▇▅▇

First thing to note is that we have a large chunk of our odds data missing. There is a smaller group of FIFA ratings missing and then after that, all our data is complete. We can also see from that plot that a lot of our numeric variables are skewed in their distribution. In particular, the betfair odds seem to have some really large outliers at the top end.

visdat::vis_dat(dat)

This plot shows us where the missing data occurs. You can see that a lot of the missing betfair data is also missing from the other odds data. One suspects that these are probably either early games in our data set or from pretty small competitions. The FIFA data doesn’t seem to have that same relationship.

Firstly, let’s just remove the odds columns - that’s way too many missing variables unfortunately.

dat <- dat %>%
  select(-avg_odds_draw, -avg_odds_away_win, -avg_odds_home_win)

Let’s see if we can find if the betfair or FIFA data that is missing has something common.

naniar::gg_miss_fct(dat, tournament_cat)

So our fifa data is mostly missing for World Cups. I suspect that this has something to do with when those ratings are done - typically out of season. Maybe the FIFA ratings aren’t updated as often enough.

dat %>%
  filter(tournament_cat == "World Cup") %>%
  select(date, team_1, team_2, team_1_fifa, team_2_fifa)
## # A tibble: 256 x 5
##    date       team_1    team_2       team_1_fifa team_2_fifa
##    <date>     <chr>     <chr>              <int>       <int>
##  1 2002-05-31 France    Senegal              802         599
##  2 2002-06-01 Germany   Saudi Arabia          NA          NA
##  3 2002-06-01 Ireland   Cameroon              NA          NA
##  4 2002-06-01 Uruguay   Denmark               NA          NA
##  5 2002-06-02 Argentina Nigeria               NA          NA
##  6 2002-06-02 England   Sweden                NA          NA
##  7 2002-06-02 Paraguay  South Africa          NA          NA
##  8 2002-06-02 Spain     Slovenia              NA          NA
##  9 2002-06-03 Brazil    Turkey                NA          NA
## 10 2002-06-03 Croatia   Mexico                NA          NA
## # ... with 246 more rows

Let’s pick a date and take a look. I’m going to look through the 2010 World Cup for Australia.

dat %>%
  filter((team_1 == "Australia" | team_2 == "Australia") & year == 2010) %>%
  select(date, team_1, team_2, team_1_fifa, team_2_fifa, tournament)
## # A tibble: 16 x 6
##    date       team_1      team_2      team_1_fifa team_2_fifa tournament  
##    <date>     <chr>       <chr>             <int>       <int> <chr>       
##  1 2010-01-06 Kuwait      Australia            NA          NA Asian Cup 2…
##  2 2010-03-03 Australia   Indonesia           867         155 Asian Cup 2…
##  3 2010-03-03 Australia   Indonesia           867         155 Asian Cup 2…
##  4 2010-03-03 Australia   Indonesia           898         155 Asian Cup 2…
##  5 2010-03-03 Australia   Indonesia           898         155 Asian Cup 2…
##  6 2010-05-24 Australia   New Zealand         886         410 Friendly    
##  7 2010-06-01 Australia   Denmark              NA          NA Friendly    
##  8 2010-06-05 Australia   USA                  NA          NA Friendly    
##  9 2010-06-13 Germany     Australia            NA          NA World Cup 2…
## 10 2010-06-19 Ghana       Australia            NA          NA World Cup 2…
## 11 2010-06-23 Australia   Serbia               NA          NA World Cup 2…
## 12 2010-08-11 Slovenia    Australia           917         911 Friendly    
## 13 2010-09-03 Switzerland Australia           882         874 Friendly    
## 14 2010-09-07 Poland      Australia           504         874 Friendly    
## 15 2010-10-09 Australia   Paraguay            862         840 Friendly    
## 16 2010-11-17 Egypt       Australia          1047         853 Friendly
fifa_dat <- read_csv(here::here("projects", "world-cup-2018", "fifa_rank_history.csv"))
fifa_dat %>%
  filter(Team == "Australia" & year(Date)  == 2010)
## # A tibble: 11 x 4
##    Date       Team        Pts Conference_id
##    <date>     <chr>     <int>         <int>
##  1 2010-02-03 Australia   857         25998
##  2 2010-03-03 Australia   867         25998
##  3 2010-03-31 Australia   898         25998
##  4 2010-04-28 Australia   883         25998
##  5 2010-05-26 Australia   886         25998
##  6 2010-07-14 Australia   911         25998
##  7 2010-08-11 Australia   911         25998
##  8 2010-09-15 Australia   874         25998
##  9 2010-10-20 Australia   862         25998
## 10 2010-11-17 Australia   853         25998
## 11 2010-12-15 Australia   816         25998

So - there was no rating update during the month of the World Cup. I think the best bet here is to simply move the data forward (i.e. take the rating from the month beforehand). That seems like a safe thing to do in general here.

# Make fifa data long again and then fill
fifa <- dat %>%
  gather(team, team_val, team_1:team_2) %>%
  select(date, match_id, team, team_val, team_1_fifa, team_2_fifa) %>%
  mutate(fifa_points = ifelse(team == "team_1", team_1_fifa, team_2_fifa)) %>%
  group_by(team_val) %>%
  arrange(team_val, date) %>%
  fill(fifa_points, .direction = "down") %>%
  fill(fifa_points, .direction = "up") %>%
  select(-team_1_fifa, -team_2_fifa, -team)

# Join back in with dat
dat <- dat %>%
  select(-team_1_fifa, -team_2_fifa) %>%
  left_join(fifa, by = c("date", "match_id", "team_1" = "team_val")) %>%
  rename(team_1_fifa = fifa_points) %>%
  left_join(fifa, by = c("date", "match_id","team_2" = "team_val")) %>%
  rename(team_2_fifa = fifa_points)
  
summary(dat)
##       date               team_1             team_2         
##  Min.   :1998-07-15   Length:14740       Length:14740      
##  1st Qu.:2003-10-12   Class :character   Class :character  
##  Median :2008-06-22   Mode  :character   Mode  :character  
##  Mean   :2008-06-20                                        
##  3rd Qu.:2013-02-06                                        
##  Max.   :2017-11-15                                        
##                                                            
##   team_1_goals     team_2_goals    tournament        is_team_1_home 
##  Min.   : 0.000   Min.   : 0.00   Length:14740       Mode :logical  
##  1st Qu.: 0.000   1st Qu.: 0.00   Class :character   FALSE:2814     
##  Median : 1.000   Median : 1.00   Mode  :character   TRUE :11926    
##  Mean   : 1.603   Mean   : 1.05                                     
##  3rd Qu.: 2.000   3rd Qu.: 2.00                                     
##  Max.   :31.000   Max.   :15.00                                     
##                                                                     
##  is_team_2_home  is_neutral      team_1_betfair_odds draw_betfair_odds
##  Mode :logical   Mode :logical   Min.   :  1.010     Min.   :  1.010  
##  FALSE:14740     FALSE:11946     1st Qu.:  1.460     1st Qu.:  3.350  
##                  TRUE :2794      Median :  2.140     Median :  3.800  
##                                  Mean   :  5.887     Mean   :  6.645  
##                                  3rd Qu.:  3.600     3rd Qu.:  5.300  
##                                  Max.   :980.000     Max.   :980.000  
##                                  NA's   :5791        NA's   :5791     
##  team_2_betfair_odds      year          month           match_id    
##  Min.   :   1.01     Min.   :1998   Min.   : 1.000   Min.   :    1  
##  1st Qu.:   2.32     1st Qu.:2003   1st Qu.: 4.000   1st Qu.: 3686  
##  Median :   4.10     Median :2008   Median : 7.000   Median : 7370  
##  Mean   :  13.17     Mean   :2008   Mean   : 6.725   Mean   : 7370  
##  3rd Qu.:   9.00     3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:11055  
##  Max.   :1000.00     Max.   :2017   Max.   :12.000   Max.   :14740  
##  NA's   :5791                                                       
##   result                               tournament_cat  team_1_fifa  
##  draw:3565   Friendly                         :6636   Min.   :   0  
##  lose:4053   World Cup Qualifiers             :3987   1st Qu.: 341  
##  win :7122   Euro Qualifiers                  :1258   Median : 519  
##              African Cup of Nations Qualifiers: 895   Mean   : 540  
##              Asian Cup Qualifiers             : 458   3rd Qu.: 689  
##              African Cup of Nations           : 317   Max.   :1887  
##              (Other)                          :1189   NA's   :237   
##   team_2_fifa  
##  Min.   :   0  
##  1st Qu.: 333  
##  Median : 510  
##  Mean   : 528  
##  3rd Qu.: 670  
##  Max.   :1920  
##  NA's   :329

So we’ve fixed up most of the FIFA data.

Onto the betfair data. It looked like the most missing data was coming from Oceania Nations Cup, Friendlies and African Cup of Nations.

What about over time? We’d suspect older games to be less likely to have odds data.

dat %>%
  ggplot(aes(x = date, fill = is.na(team_1_betfair_odds))) +
  geom_histogram(binwidth = 30) + 
  theme_minimal()

Yep! So - the Betfair data isn’t just randomly missing. It is mostly likely to be missing for older data and most likely to be missing for certain less popular tournaments and friendlies. I’ll probably remove that data later but I need those games to do some feature engineering, like ELO, so I’ll leave it for now.

Feature Engineering

ELO

For my AFL ELO ratings, I’ve recently started using a fantastic R package called elo. You can find details of that package here and my weekly R code here. Again - not knowing how to best implement an ELO model in Football specifically, I’ve left it pretty simple. I’m following loosely the method employed at eloratings.net.

Firstly, I need to classify the tournaments into groupings so that I can assign different k values for them. Namely,

60 for World Cup finals;
50 for continental championship finals and major intercontinental tournaments;
40 for World Cup and continental qualifiers and major tournaments;
30 for all other tournaments;
20 for friendly matches.

I also need to adjust the k value based on the goal difference and give a 100 point HGA to any team playing at home.

# Function to clasify tournament
find_k <- function(x){
  case_when(
    x == "World Cup" ~ 60,
    x %in% c("Euro", "Copa America", "African Cup of Nations", 
             "Asian Cup", "Gold Cup", "Confederations Cup") ~ 50,
    str_detect(x, "Qualifiers") ~ 40,
    x == "Friendly" ~ 20,
    TRUE ~ 30
  )
}

# Function to adjust based on goal difference
k_adjust <- function(x){
  case_when(
    x == 2 ~ 1.5,
    x == 3 ~ 1.75,
    x > 3 ~ 1.75 + ((x - 3)/8),
    TRUE ~ 1
  )
}

Now, we can go and apply those functions to our data and perform the elo.run function.

dat <- dat %>%
  ungroup() %>%
  mutate(game_id = row_number()) %>%
  group_by(team_1) %>%
  mutate(
    team_1_result = case_when(
      team_1_goals < team_2_goals ~ 0,
      team_1_goals > team_2_goals ~ 1,
      TRUE ~ 0.5),
    team_2_result = 1 - team_1_result,
    team_1_goals_against = team_2_goals,
    team_2_goals_against = team_1_goals)

# Clasify tournemt and calculate k value
dat <- dat %>%
  mutate(
         k_val = find_k(tournament_cat),
         hga = ifelse(is_team_1_home, 100, 0)) %>%
  arrange(date)

# Run ELO
elo.data <- elo.run(
  team_1_result ~
    adjust(team_1, hga) + 
    team_2  +
    k(k_adjust(abs(team_1_goals - team_2_goals)) * k_val),
  data = dat
)

dat <- dat %>%
  bind_cols(as.data.frame(elo.data)) %>%
  mutate(team_1_elo = elo.A - update,
         team_2_elo = elo.B + update, 
         team_1_elo_prob = p.A) %>%
  select(date:k_val, team_1_elo:team_1_elo_prob)

Let’s do a quick exploration of those ELO values.

dat %>%
  ggplot(aes(team_1_elo)) +
  geom_histogram()

final.elos(elo.data) %>%
  data.frame(elo = .) %>%
  rownames_to_column("Team") %>%
  filter(Team %in% world_cup$team_1) %>%
  arrange(desc(elo))
##              Team      elo
## 1          Brazil 2076.715
## 2         Germany 2074.492
## 3           Spain 1989.650
## 4        Portugal 1963.858
## 5       Argentina 1956.581
## 6          France 1942.912
## 7        Colombia 1908.234
## 8            Peru 1893.122
## 9         Belgium 1886.362
## 10        England 1881.468
## 11           Iran 1864.434
## 12         Mexico 1844.136
## 13        Senegal 1831.587
## 14        Uruguay 1827.816
## 15          Japan 1826.275
## 16        Croatia 1821.667
## 17    Switzerland 1814.201
## 18      Australia 1793.659
## 19         Poland 1791.995
## 20     Costa Rica 1783.093
## 21        Iceland 1781.308
## 22        Denmark 1773.105
## 23          Egypt 1769.544
## 24         Sweden 1765.650
## 25 Korea Republic 1758.689
## 26        Nigeria 1756.263
## 27        Morocco 1752.234
## 28         Panama 1725.079
## 29         Serbia 1714.713
## 30        Tunisia 1705.453
## 31   Saudi Arabia 1700.544
## 32         Russia 1684.451

Team history

For these next few features, I actually need to make my data set ‘long’. This allows me to calculate some historical information such as the number of games or wins over a period of team, if they won a cup or other features that might be interesting.

team_dat <- dat %>%
  ungroup() %>%
  select(date, game_id, tournament_cat, contains("team")) %>%
  gather(variable, value, -date, -game_id, -tournament_cat) %>%
  separate(variable, into = c("team", "number", "measure"), extra = "merge") %>%
  mutate(measure = ifelse(is.na(measure), "team", measure)) %>%
  spread(measure, value) %>%
  select(date, game_id, tournament_cat, team, result, goals, goals_against) %>%
  na.omit()

head(team_dat)
## # A tibble: 6 x 7
##   date       game_id tournament_cat team        result goals goals_against
##   <date>       <int> <fct>          <chr>       <chr>  <chr> <chr>        
## 1 1998-07-15       1 Friendly       Ukraine     0      1     2            
## 2 1998-07-18       2 Friendly       Malawi      1      1     0            
## 3 1998-07-18       3 Friendly       Uganda      0.5    3     3            
## 4 1998-07-20       4 Friendly       Malawi      0.5    0     0            
## 5 1998-07-22       5 Friendly       Senegal     1      1     0            
## 6 1998-07-24       6 Friendly       Burkina Fa… 1      1     0
nrow(team_dat)
## [1] 29480
nrow(dat) * 2
## [1] 29480

OK - that was tricky. Oh well - we got it!

Values for Last 10 games

One thing that I’ve seen in a few prediction models is looking at the results and scores for a team over their last n games. In theory, some of this should be reflected in our ELO rating. But nonetheless, our ELO rating isn’t a very sophisticated one (it looks at just results rather than scores), so I can add this in. Since I’m going to put it into some machine learning models, we can trim these out if they aren’t very useful.

I’ve picked the last 10 games based on a couple examples I found here and

First - it makes sense to see how many games a team has won in it’s last 10 games. I’ll do this as their average value of result where a win is assigned 1, a draw is 0.5 and a loss is 0. I’ll also calculate the average goals for and against in that period.

# Specify variables
variables <- vars(goals, result, goals_against)
team_dat <- team_dat %>%
  group_by(team) %>%
  arrange(team, date) %>%
  mutate_at(variables, as.numeric) %>%
  mutate(result_game = result) %>%
  mutate_at(variables,
            .funs = rollmean, k = 10, na.rm = T, fill = NA, partial = T, align = "right") %>%
  rename_at(variables, funs(paste0("last_10_", .)))

World Cup Appearances

Being at a World Cup multiple times generally, you’d think, would be a good predictor of performing well at world cups. In particular, playing more games would seem to indicate that you’ve done well at world cups.

team_dat <- team_dat %>%
  group_by(team) %>%
  mutate(world_cup_wins = cumsum(tournament_cat == "World Cup" & result_game == 1),
         world_cup_games = cumsum(tournament_cat == "World Cup")) %>%
  select(-tournament_cat, -result_game, -date)

dat <- dat %>%
  left_join(team_dat, by = c("game_id", "team_1" = "team")) %>%
  left_join(team_dat, by = c("game_id", "team_2" = "team"), suffix = c("_team_1", "_team_2"))

Test Data Set

For our final data, we need all of these variables to exist.

# Add ELO
final_elo <- final.elos(elo.data) %>%
  data.frame(elo = .) %>%
  rownames_to_column("Team")

world_cup <- world_cup %>%
  left_join(final_elo, by = c("team_1" = "Team")) %>%
  rename(team_1_elo = elo) %>%
  left_join(final_elo, by = c("team_2" = "Team")) %>%
  rename(team_2_elo = elo) %>%
  mutate(hga = ifelse(is_team_1_home, 100, 0),
         team_1_elo_prob = elo.prob(team_1_elo, team_2_elo, adjust.A = hga),
         tournament_cat = "World Cup",
         k_val = find_k(tournament_cat))

  
# Add the team history data - take last value
team_dat_last <- team_dat %>%
  group_by(team) %>%
  filter(row_number() == n())

world_cup <- world_cup %>%
  left_join(team_dat_last, by = c("team_1" = "team")) %>%
  left_join(team_dat_last, by = c("team_2" = "team"), suffix = c("_team_1", "_team_2"))

Save

Let’s save our data.

write_csv(dat, here::here("projects", "world-cup-2018", "combined-data-cleaned.csv"))
write_csv(world_cup, here::here("projects", "world-cup-2018", "world-cup-cleaned.csv"))

This is part of a series of posts on the World Cup Betfair datathon. See the links to others below.

Project Page
Part 1 - Intro
Part 2 - Data Acquisition
Part 3 - Data Exploration and Feature Engineering
Part 4 - Models (coming soon)
Part 5 - Review (coming soon)

comments powered by Disqus