Model · Cycling BT Rankings

Section 1

Introduction and Motivation

Professional cycling poses a unique ranking challenge. Unlike team sports with balanced schedules, a cyclist in a given season may start anywhere from a handful to over a hundred races, against wildly varying fields, across radically different terrain types. A sprinter competes almost exclusively against other sprinters in flat finishes; a climber is mostly absent from those races. A single global ELO score conflates apples and mountains.

Three further complications make the problem harder:

Non-stationarity. Riders improve, peak, decline, get injured, change team, retire and return. A race five years ago is less informative about today's rider than last month's race.
Sparse, unbalanced comparisons. In a given race, only the riders who start are compared. Fields of 100–200 starters generate \(\binom{n}{2}\) partial orderings, but the comparison graph across the whole season is far from complete.
Heterogeneous importance. A Grand Tour stage counts very differently from a mid-week one-day race, both in prestige and in information content.

This project addresses all three with a time-aware, importance-weighted, multi-track partial-pairing Bradley–Terry model, using the openskill library's BradleyTerryPart backend, and exposes the results through an interactive web dashboard.

Section 2

The Bradley–Terry Model

Core probabilistic model

Given two riders \(i\) and \(j\) with scalar strengths \(\lambda_i, \lambda_j > 0\), the classical Bradley–Terry model assigns:

Classical BT

\[ P(i \text{ beats } j) = \frac{\lambda_i}{\lambda_i + \lambda_j} \]

In the Bayesian (TrueSkill / openskill) formulation each rider carries a Gaussian belief over their true skill \(s_i\): \( s_i \sim \mathcal{N}(\mu_i,\, \sigma_i^2) \). \(\mu_i\) is the current best estimate of skill; \(\sigma_i\) encodes uncertainty. A fresh rider has high \(\sigma\); a rider with thousands of observed head-to-heads has low \(\sigma\).

The Thurstone–Mosteller win-probability

The probability that \(i\) beats \(j\), marginalised over both latent skills and per-performance noise \(\varepsilon \sim \mathcal{N}(0, 2\beta^2)\):

Win probability — Eq. 1

\[ P(i \text{ beats } j) = \Phi\!\left( \frac{\mu_i - \mu_j}{\sqrt{2\beta^2 + \sigma_i^2 + \sigma_j^2}} \right) \]

where \(\Phi\) is the standard normal CDF and \(\beta = \mu_0 / 6 = 25/6 \approx 4.17\) is the per-race performance standard deviation.

Remark — safe mode and win probability

The formula uses raw \(\mu_i\) in the numerator — not a penalty-adjusted version. The uncertainty already enters through \(\sigma_i^2\) in the denominator: a high-\(\sigma\) rider pulls the probability toward \(0.5\) regardless of their \(\mu\). Applying an additional \(\mu - 2\sigma\) discount to both sides would double-count uncertainty. Win probability is therefore independent of the safe/normal toggle.

Section 3

Time-Awareness: Sigma Inflation

The stationarity problem

A standard BT update treats all historical results as equally informative. Two mechanisms discount the past:

Sigma inflation. Between races, each rider's \(\sigma_i\) is widened proportionally to elapsed calendar time:

Brownian drift
\[ \sigma_i(t + \Delta t) \leftarrow \sqrt{\sigma_i(t)^2 + \tau^2 \cdot \Delta t} \]
where \(\tau\) is the drift rate (units: skill per day). This is the standard Brownian motion model of non-stationarity.
Race importance weighting. Each race updates skill with a strength proportional to the race's \(\tau\)-weight, reflecting prestige tier (Grand Tour > Monument > World Cup > one-week race).

After a long period of inactivity, \(\sigma_i\) grows large. When the rider races again, the update is correspondingly large — data dominates a wide prior. This automatically handles riders returning from injury, changing teams, or switching focus.

Section 4

Partial Pairing

A typical road race produces a full finishing order over \(n\) starters, generating \(\binom{n}{2}\) binary comparisons. Naively computing all pairings is \(O(n^2)\) per race. The BradleyTerryPart algorithm computes partial pairings: each rider is compared only to riders immediately above and below them in the finishing order (within a sliding window), yielding \(O(n \cdot w)\) comparisons while still propagating global ordering information.

In a sprint finish, the difference between positions 1 and 50 is often a few centimetres and largely determined by bunch dynamics, not individual skill. Comparing every finisher against every other would overfit to noise. Partial pairing dampens this by concentrating updates on nearby neighbours.

Section 5

Multi-Track Architecture

Eight discipline tracks

Track	Type	Example races
All races (ALL)	aggregate	all WT events
GC	type	Tour de France, Giro, Vuelta
Stage race	category	all multi-day races
Mountain	type	mountain finishes
Cobbles	type	Roubaix, Flanders
Punch	type	Liège, Amstel Gold
Sprint	type	Milan–San Remo finishes
Time trial	type	individual TTs
Classics	category	one-day UCI monuments

Mixing terrain types in a single rating conflates orthogonal skills. Van der Poel and Pogačar are both all-time greats, but near-complementary: Van der Poel dominates cobbled classics; Pogačar excels in mountain stages and GC. A global rating would obscure this.

Track types vs. categories

Types (disjoint): cobbles, punch, mountain, sprint, GC, time trial — a single race maps to at most one.
Categories (overlapping): stage race, classics — these aggregate across types.

Section 6

Display Rating: ELO-Like Rescaling

Display rating — Eq. 2

\[ \text{Rating}(i) = 1500 + 60 \times (\mu_i - \mu_0) \]

A rider at the prior (\(\mu_i = \mu_0 = 25\)) displays as 1500. For a conservative ranking that penalises uncertainty:

Safe mode rating — Eq. 3

\[ \text{Rating}_{\text{safe}}(i) = 1500 + 60 \times (\mu_i - 2\sigma_i - \mu_0) \]

A rider with high \(\mu\) but high \(\sigma\) (few races) drops significantly; a rider with moderate \(\mu\) and very low \(\sigma\) (many consistent races) rises. Most useful early in the season when race counts differ widely.

Section 7

Cross-Discipline Composite Score

The composite score answers "who is the most complete cyclist?" by aggregating only the five disjoint type tracks, avoiding double-counting. For each rider \(i\), snapshot \(t\), and track \(k \in \{\text{cobbles, punch, mountain, sprint, GC}\}\):

Effective skill: \(r_{i,k} = \mu_{i,k} - z\,\sigma_{i,k}\) (\(z=2\) safe, \(z=0\) normal)
Field stats at \(t\): mean \(\bar{r}_k(t)\) and std \(s_k(t)\) over all riders who have raced in track \(k\)
Per-track z-score, inverse-\(\sigma\) weighted: \(\tilde{z}_{i,k} = \frac{r_{i,k} - \bar{r}_k(t)}{s_k(t)} \cdot \frac{1}{\sigma_{i,k}}\)

Composite score — Eq. 4

\[ C_i(t) = \frac{\displaystyle\sum_{k \in K_i} w_k \cdot \tilde{z}_{i,k}}{\displaystyle\sum_{k \in K_i} w_k} \]

Track	Weight \(w_k\)
Cobbles	1.00
Punch	1.00
Mountain	1.00
Sprint	1.00
GC	1.75

GC is up-weighted because three-week stage racing requires a unique combination of endurance, climbing, time-trialling, and tactical acumen. Z-scoring normalises each track to zero mean and unit variance within its field at each snapshot, making the composite dimensionless and field-adjusted.

Section 8

All-Time Records: Reigns and GOAT

Reigns

A reign is a maximal contiguous interval during which rider \(i\) held the #1 position: \(\text{reign}(i) = [t_{\text{start}}, t_{\text{end}}]\). Reigns are computed over the full engine history, independent of any downsampling applied to the scrubber. Duration is visualised as a proportional coloured bar.

GOAT

The GOAT ranking identifies the ten riders whose peak rating was highest at any point in the full history: \(\text{peak}(i) = \max_{t}\, \text{Rating}(i, t)\). Peak date and career race count are stored alongside the value. Top-3 receive gold / silver / bronze highlighting.

Section 9

Online Head-to-Head Win Probability

Precomputing win probabilities for all \(\binom{N}{2}\) pairs at every snapshot is infeasible: with \(N = 1{,}930\) riders, that is nearly 1.9 million pairs per snapshot across 10 tracks. Instead, raw \((\mu_i, \sigma_i)\) tuples are stored in the timeseries JSON, and Eq. 1 is evaluated client-side in the browser for any chosen pair.

The standard normal CDF is approximated using Abramowitz & Stegun formula 7.1.26:

erf approximation (max error < 1.5×10⁻⁷)

\[ \operatorname{erf}(x) \approx 1 - (a_1 t + a_2 t^2 + a_3 t^3 + a_4 t^4 + a_5 t^5)\,e^{-x^2}, \quad t = \frac{1}{1 + 0.3275911\,|x|} \]

Carry-forward is used for the rating lookup: if rider \(i\) has no race in the selected track after date \(d\), their most recent \((\mu, \sigma)\) before \(d\) is used.

Section 10

Static Data Export Architecture

The dashboard is a static GitHub Pages site with no backend. All computation happens at export time (Python) or render time (JavaScript).

File	Contents
`meta.json`	Race counts, date range, model parameters (\(\mu_0\), \(\beta\), display scale), composite config
`rider_timeseries.json`	Sparse \((\mu, \sigma)\) per rider per track per race event, plus composite z-scores
`top_history.json`	Top-10 snapshot at each race event in both normal and safe order, per track
`hall_of_fame.json`	Full reign sequence and GOAT top-10 per track and metric, over complete history

Quantity	Where computed	Depends on
ELO display rating	Browser (JS)	\(\mu\), \(\sigma\), mode
Safe/normal toggle	Browser (JS)	local state
Head-to-head probability	Browser (JS)	\(\mu\), \(\sigma\), \(\beta\)
Snapshot Top 10	Python export	track, metric
Reigns / GOAT	Python export	full history
Composite z-scores	Python export	all 5 type tracks

Section 11

CI/CD Pipeline

A push to data/** or model/** triggers the Build BT exports workflow.
The workflow installs dependencies via uv, runs the exporter, and commits the four JSON files (git add -f to override .gitignore).
That commit triggers GitHub's built-in pages build and deployment, deploying the /docs folder to the live site.

The JSON outputs are git-ignored but force-added by CI. HTML-only changes trigger Pages deployment directly, without re-running the exporter.

Section 12

Limitations and Future Directions

Known limitations

DNF and DNS not modelled. Abandonments carry information but are currently excluded.
Team effects ignored. Domestique roles mean finishing position drastically understates or overstates actual contribution.
Composite requires participation. A rider who has never raced cobbled classics has no cobbles component — their composite averages over fewer tracks.
No cross-track transfer. A strong mountain rider gets no prior advantage in the mountain track without having raced there.

Possible extensions

Hierarchical model. Pool information across tracks via a rider-level hyperprior, allowing cross-discipline transfer.
Ordinal likelihood. Replace binary win/loss with a full Plackett–Luce model over the finishing order.
Covariate adjustment. Include race-level features (elevation gain, expected finishing speed, weather) to separate terrain affinity from raw strength.
Form windows. Add a short-window component capturing hot/cold streaks separately from long-term ability.

Section 13

Notation Summary

Symbol	Meaning
\(\mu_i\)	Mean skill estimate for rider \(i\)
\(\sigma_i\)	Skill uncertainty (std. dev.) for rider \(i\)
\(\beta\)	Per-race performance noise (\(= 25/6\))
\(\mu_0\)	Prior mean skill (\(= 25\))
\(\tau\)	Sigma drift rate (time decay)
\(w_k\)	Composite weight for track \(k\)
\(\Phi\)	Standard normal CDF
\(C_i(t)\)	Composite score for rider \(i\) at time \(t\)
\(z\)	Safe-mode multiplier (\(z=2\) safe, \(z=0\) normal)

Time-Aware Bradley–TerryRider Model