Methodology

How we score every 4th-down decision and what the numbers don't capture.

The win-probability model

We trained a gradient-boosted classifier (XGBoost, 126 trees, depth 6) on every play in 2018-2023, holding out 2024 as a clean evaluation set. Features are the standard game-state signals: score differential, time remaining, yards from the endzone, down and distance, timeouts, quarter, home/away, and whether the offense received the second-half kickoff.

Plays trained

~300k

2024 log-loss

0.465

nflfastR log-loss

0.463

AUC (held-out)

0.848

The model lands within 0.3% log-loss of nflfastR's bundled WP on the same held-out 2024 season. That's the sanity check we wanted: we're comparable to the field-standard model, so the decision rankings below sit on a credible base.

Scoring a 4th-down decision

For every 4th down in 2018-2024 with a real play (no penalty aborts, no kneeldowns), we compute three counterfactual expected win probabilities, all from the offense's pre-snap perspective:

E[WP | go] = P(convert) · WP(after gain) + (1 − P(convert)) · [1 − WP(opponent ball at our spot)]

E[WP | punt] = 1 − WP(opponent at expected post-punt yardline)

E[WP | FG] = P(make) · [1 − WP(opponent at kickoff, score +3)] + (1 − P(make)) · [1 − WP(opponent at spot of kick)]

The model's recommendation is whichever option maximizes E[WP]. The cost of the coach's actual decision is the gap between the best option and the chosen option. Costs are non-negative by construction.

P(convert) is a logistic on yardline and yards-to-go fit only on real go-for-it attempts (~53% base rate). P(make FG) is a logistic on kick distance (~85% base rate, capped at 65-yard attempts). Expected net punt yards is an empirical mean by 5-yard field-position bucket (42.4 yds overall).

Confidence intervals

A single coach makes ~100-150 4th-down decisions per season. The variance in cumulative WP loss across that sample is real, so the leaderboard shows 90% bootstrap CIs: we resample each coach's plays with replacement 1,500 times and take the 5th and 95th percentiles of the resulting totals.

Practically, this means the gap between the 5th and 25th ranked coach is often inside the noise band, even though the gap between the 1st and 30th is real. Trust the bars more than the rank.

What the model doesn't know

Opponent strength.A 4th-and-2 against the 2024 Lions defense is a different proposition than against the Patriots' defense. Our P(convert) doesn't see the opponent. That's probably the biggest hole.
Weather.Cold-game punts come up short. Crosswind FGs miss. We don't model wind, temperature, or precipitation. The punt bucketing absorbs some of this on average, but not for a specific game.
Personnel. Down to your backup kicker after a hammy pull? Model says try the 52-yarder anyway because the league-average kicker makes it.
Game script and clock leverage."Up 14 with 4 minutes left, on 4th and 1 from the opponent's 35" is a different conversation than the same situation in Q2. The WP model captures most of this through time and score features, but not all of it.
Decision quality vs outcome quality.We never penalize a coach for the outcome of a play, only for the option they picked. A goal-line stuff on 4th-and-goal counts the same as a 5-yard scamper if the model agreed with the decision. This is intentional: it's a decision audit, not a game review.

Reproducing the pipeline

All code is open source. Run scripts/fetch.py → scripts/train.py → scripts/score_season.py in the backend folder. Data comes from the public nflverse releases.