Here's some fake data showing how shot logs might look for a soccer match.
library(tibble)
library(dplyr)
df <- tibble::tibble(
shot_id = seq.int(1, 8),
minute = c(7, 13, 25, 30, 41, 44, 58, 78), ## doesn't matter, just for illustrative purposes
side = c('home', 'home', 'away', 'home', 'away', 'away', 'home', 'home'),
g = as.integer(c(0, 1, 0, 0, 1, 1, 0, 0)),
xg = c(0.2, 0.6, 0.1, 0.3, 0.4, 0.1, 0.5, 0.1)
)
df
#> # A tibble: 8 × 5
#> shot_id minute side g xg
#> <int> <dbl> <chr> <int> <dbl>
#> 1 1 7 home 0 0.2
#> 2 2 13 home 1 0.6
#> 3 3 25 away 0 0.1
#> 4 4 30 home 0 0.3
#> 5 5 41 away 1 0.4
#> 6 6 44 away 1 0.1
#> 7 7 58 home 0 0.5
#> 8 8 78 home 0 0.1
Here's some simple transformations that we do to calculate gamestate.
calc_df <- df |>
dplyr::mutate(
home_g = ifelse(side == 'home', g, 0L),
away_g = ifelse(side == 'away', g, 0L)
) |>
dplyr::mutate(
dplyr::across(
c(home_g, away_g),
list(cumu = \(.x) cumsum(.x)),
.names = 'cumu_{.col}'
)
) |>
dplyr::mutate(
home_gamestate = cumu_home_g - cumu_away_g,
home_pre_gamestate = dplyr::lag(home_gamestate, default = 0L)
)
calc_df |>
dplyr::select(
shot_id,
cumu_home_g,
cumu_away_g,
home_gamestate,
home_pre_gamestate
)
#> # A tibble: 8 × 5
#> shot_id cumu_home_g cumu_away_g home_gamestate home_pre_gamestate
#> <int> <int> <int> <int> <int>
#> 1 1 0 0 0 0
#> 2 2 1 0 1 0
#> 3 3 1 0 1 1
#> 4 4 1 0 1 1
#> 5 5 1 1 0 1
#> 6 6 1 2 -1 0
#> 7 7 1 2 -1 -1
#> 8 8 1 2 -1 -1
Note that we calculate a "pre"-shot gamestate home_pre_gamestate
in the prior section. This is because we calculate fairly different xG difference (home_xgd
) depending on whether a goal's xG is counted towards the post-shot gamestate or the pre-shot gamestate.
Below shows what gamestate xG difference looks like if we count a given shot's xG towards the post-shot gamestate. For example, for the first shot that is scored in a match, the xG from that shot is counted towards the +1
(or -1
) gamestate.
calc_df |>
dplyr::group_by(home_gamestate) |>
dplyr::summarize(
shots = dplyr::n(),
home_xgd = sum(home_xg - away_xg)
)
#> # A tibble: 3 × 3
#> home_gamestate shots home_xgd
#> <int> <int> <dbl>
#> 1 -1 3 0.5
#> 2 0 2 -0.2
#> 3 1 3 0.8
Now here's the same summary, but here we count a given shot's xG towards the pre-shot gamestate. In this case, the first shot of a match is always counted towards the 0
gamestate.
calc_df |>
dplyr::group_by(home_pre_gamestate) |>
dplyr::summarize(
shots = dplyr::n(),
home_xgd = sum(home_xg - away_xg)
)
#> # A tibble: 3 × 3
#> home_pre_gamestate shots home_xgd
#> <int> <int> <dbl>
#> 1 -1 2 0.6
#> 2 0 3 0.7
#> 3 1 3 -0.2
I prefer to use the second approach, using the pre-shot gamestate. Intuitively, I think this prevents a "leakage" of data into gamestate xGD calculations.