The correct way to calculate xG difference by gamestate

Here's some fake data showing how shot logs might look for a soccer match.

library(tibble)
library(dplyr)

df <- tibble::tibble(
  shot_id = seq.int(1, 8),
  minute = c(7, 13, 25, 30, 41, 44, 58, 78), ## doesn't matter, just for illustrative purposes
  side = c('home', 'home', 'away', 'home', 'away', 'away', 'home', 'home'),
  g = as.integer(c(0, 1, 0, 0, 1, 1, 0, 0)),
  xg = c(0.2, 0.6, 0.1, 0.3, 0.4, 0.1, 0.5, 0.1)
)
df
#> # A tibble: 8 × 5
#>   shot_id minute side      g    xg
#>     <int>  <dbl> <chr> <int> <dbl>
#> 1       1      7 home      0   0.2
#> 2       2     13 home      1   0.6
#> 3       3     25 away      0   0.1
#> 4       4     30 home      0   0.3
#> 5       5     41 away      1   0.4
#> 6       6     44 away      1   0.1
#> 7       7     58 home      0   0.5
#> 8       8     78 home      0   0.1

Here's some simple transformations that we do to calculate gamestate.

calc_df <- df |> 
  dplyr::mutate(
    home_g = ifelse(side == 'home', g, 0L),
    away_g = ifelse(side == 'away', g, 0L)
  ) |> 
  dplyr::mutate(
    dplyr::across(
      c(home_g, away_g),
      list(cumu = \(.x) cumsum(.x)),
      .names = 'cumu_{.col}'
    )
  ) |> 
  dplyr::mutate(
    home_gamestate = cumu_home_g - cumu_away_g,
    home_pre_gamestate = dplyr::lag(home_gamestate, default = 0L)
  )
calc_df |> 
  dplyr::select(
    shot_id,
    cumu_home_g,
    cumu_away_g,
    home_gamestate,
    home_pre_gamestate
  )
#> # A tibble: 8 × 5
#>   shot_id cumu_home_g cumu_away_g home_gamestate home_pre_gamestate
#>     <int>       <int>       <int>          <int>              <int>
#> 1       1           0           0              0                  0
#> 2       2           1           0              1                  0
#> 3       3           1           0              1                  1
#> 4       4           1           0              1                  1
#> 5       5           1           1              0                  1
#> 6       6           1           2             -1                  0
#> 7       7           1           2             -1                 -1
#> 8       8           1           2             -1                 -1

Note that we calculate a "pre"-shot gamestate home_pre_gamestate in the prior section. This is because we calculate fairly different xG difference (home_xgd) depending on whether a goal's xG is counted towards the post-shot gamestate or the pre-shot gamestate.

Below shows what gamestate xG difference looks like if we count a given shot's xG towards the post-shot gamestate. For example, for the first shot that is scored in a match, the xG from that shot is counted towards the +1 (or -1) gamestate.

calc_df |> 
  dplyr::group_by(home_gamestate) |> 
  dplyr::summarize(
    shots = dplyr::n(),
    home_xgd = sum(home_xg - away_xg)
  )
#> # A tibble: 3 × 3
#>   home_gamestate shots home_xgd
#>            <int> <int>    <dbl>
#> 1             -1     3      0.5
#> 2              0     2     -0.2
#> 3              1     3      0.8

Now here's the same summary, but here we count a given shot's xG towards the pre-shot gamestate. In this case, the first shot of a match is always counted towards the 0 gamestate.

calc_df |> 
  dplyr::group_by(home_pre_gamestate) |> 
  dplyr::summarize(
    shots = dplyr::n(),
    home_xgd = sum(home_xg - away_xg)
  )
#> # A tibble: 3 × 3
#>   home_pre_gamestate shots home_xgd
#>                <int> <int>    <dbl>
#> 1                 -1     2      0.6
#> 2                  0     3      0.7
#> 3                  1     3     -0.2

I prefer to use the second approach, using the pre-shot gamestate. Intuitively, I think this prevents a "leakage" of data into gamestate xGD calculations.

tonyelhabr/gamestate-xgd.md