Design, inference, and dashboard architecture for the World Tour 2019–2026 cycling rankings system. Eight discipline tracks, composite scoring, and live head-to-head win probabilities.
Professional cycling poses a unique ranking challenge. Unlike team sports with balanced schedules, a cyclist in a given season may start anywhere from a handful to over a hundred races, against wildly varying fields, across radically different terrain types. A sprinter competes almost exclusively against other sprinters in flat finishes; a climber is mostly absent from those races. A single global ELO score conflates apples and mountains.
Three further complications make the problem harder:
This project addresses all three with a time-aware, importance-weighted, multi-track partial-pairing Bradley–Terry model, using the openskill library's BradleyTerryPart backend, and exposes the results through an interactive web dashboard.
Given two riders \(i\) and \(j\) with scalar strengths \(\lambda_i, \lambda_j > 0\), the classical Bradley–Terry model assigns:
In the Bayesian (TrueSkill / openskill) formulation each rider carries a Gaussian belief over their true skill \(s_i\): \( s_i \sim \mathcal{N}(\mu_i,\, \sigma_i^2) \). \(\mu_i\) is the current best estimate of skill; \(\sigma_i\) encodes uncertainty. A fresh rider has high \(\sigma\); a rider with thousands of observed head-to-heads has low \(\sigma\).
The probability that \(i\) beats \(j\), marginalised over both latent skills and per-performance noise \(\varepsilon \sim \mathcal{N}(0, 2\beta^2)\):
where \(\Phi\) is the standard normal CDF and \(\beta = \mu_0 / 6 = 25/6 \approx 4.17\) is the per-race performance standard deviation.
The formula uses raw \(\mu_i\) in the numerator — not a penalty-adjusted version. The uncertainty already enters through \(\sigma_i^2\) in the denominator: a high-\(\sigma\) rider pulls the probability toward \(0.5\) regardless of their \(\mu\). Applying an additional \(\mu - 2\sigma\) discount to both sides would double-count uncertainty. Win probability is therefore independent of the safe/normal toggle.
A standard BT update treats all historical results as equally informative. Two mechanisms discount the past:
After a long period of inactivity, \(\sigma_i\) grows large. When the rider races again, the update is correspondingly large — data dominates a wide prior. This automatically handles riders returning from injury, changing teams, or switching focus.
A typical road race produces a full finishing order over \(n\) starters, generating \(\binom{n}{2}\) binary comparisons. Naively computing all pairings is \(O(n^2)\) per race. The BradleyTerryPart algorithm computes partial pairings: each rider is compared only to riders immediately above and below them in the finishing order (within a sliding window), yielding \(O(n \cdot w)\) comparisons while still propagating global ordering information.
In a sprint finish, the difference between positions 1 and 50 is often a few centimetres and largely determined by bunch dynamics, not individual skill. Comparing every finisher against every other would overfit to noise. Partial pairing dampens this by concentrating updates on nearby neighbours.
| Track | Type | Example races |
|---|---|---|
| All races (ALL) | aggregate | all WT events |
| GC | type | Tour de France, Giro, Vuelta |
| Stage race | category | all multi-day races |
| Mountain | type | mountain finishes |
| Cobbles | type | Roubaix, Flanders |
| Punch | type | Liège, Amstel Gold |
| Sprint | type | Milan–San Remo finishes |
| Time trial | type | individual TTs |
| Classics | category | one-day UCI monuments |
Mixing terrain types in a single rating conflates orthogonal skills. Van der Poel and Pogačar are both all-time greats, but near-complementary: Van der Poel dominates cobbled classics; Pogačar excels in mountain stages and GC. A global rating would obscure this.
A rider at the prior (\(\mu_i = \mu_0 = 25\)) displays as 1500. For a conservative ranking that penalises uncertainty:
A rider with high \(\mu\) but high \(\sigma\) (few races) drops significantly; a rider with moderate \(\mu\) and very low \(\sigma\) (many consistent races) rises. Most useful early in the season when race counts differ widely.
The composite score answers "who is the most complete cyclist?" by aggregating only the five disjoint type tracks, avoiding double-counting. For each rider \(i\), snapshot \(t\), and track \(k \in \{\text{cobbles, punch, mountain, sprint, GC}\}\):
| Track | Weight \(w_k\) |
|---|---|
| Cobbles | 1.00 |
| Punch | 1.00 |
| Mountain | 1.00 |
| Sprint | 1.00 |
| GC | 1.75 |
GC is up-weighted because three-week stage racing requires a unique combination of endurance, climbing, time-trialling, and tactical acumen. Z-scoring normalises each track to zero mean and unit variance within its field at each snapshot, making the composite dimensionless and field-adjusted.
A reign is a maximal contiguous interval during which rider \(i\) held the #1 position: \(\text{reign}(i) = [t_{\text{start}}, t_{\text{end}}]\). Reigns are computed over the full engine history, independent of any downsampling applied to the scrubber. Duration is visualised as a proportional coloured bar.
The GOAT ranking identifies the ten riders whose peak rating was highest at any point in the full history: \(\text{peak}(i) = \max_{t}\, \text{Rating}(i, t)\). Peak date and career race count are stored alongside the value. Top-3 receive gold / silver / bronze highlighting.
Precomputing win probabilities for all \(\binom{N}{2}\) pairs at every snapshot is infeasible: with \(N = 1{,}930\) riders, that is nearly 1.9 million pairs per snapshot across 10 tracks. Instead, raw \((\mu_i, \sigma_i)\) tuples are stored in the timeseries JSON, and Eq. 1 is evaluated client-side in the browser for any chosen pair.
The standard normal CDF is approximated using Abramowitz & Stegun formula 7.1.26:
Carry-forward is used for the rating lookup: if rider \(i\) has no race in the selected track after date \(d\), their most recent \((\mu, \sigma)\) before \(d\) is used.
The dashboard is a static GitHub Pages site with no backend. All computation happens at export time (Python) or render time (JavaScript).
| File | Contents |
|---|---|
meta.json | Race counts, date range, model parameters (\(\mu_0\), \(\beta\), display scale), composite config |
rider_timeseries.json | Sparse \((\mu, \sigma)\) per rider per track per race event, plus composite z-scores |
top_history.json | Top-10 snapshot at each race event in both normal and safe order, per track |
hall_of_fame.json | Full reign sequence and GOAT top-10 per track and metric, over complete history |
| Quantity | Where computed | Depends on |
|---|---|---|
| ELO display rating | Browser (JS) | \(\mu\), \(\sigma\), mode |
| Safe/normal toggle | Browser (JS) | local state |
| Head-to-head probability | Browser (JS) | \(\mu\), \(\sigma\), \(\beta\) |
| Snapshot Top 10 | Python export | track, metric |
| Reigns / GOAT | Python export | full history |
| Composite z-scores | Python export | all 5 type tracks |
data/** or model/** triggers the Build BT exports workflow.uv, runs the exporter, and commits the four JSON files (git add -f to override .gitignore)./docs folder to the live site.The JSON outputs are git-ignored but force-added by CI. HTML-only changes trigger Pages deployment directly, without re-running the exporter.
| Symbol | Meaning |
|---|---|
| \(\mu_i\) | Mean skill estimate for rider \(i\) |
| \(\sigma_i\) | Skill uncertainty (std. dev.) for rider \(i\) |
| \(\beta\) | Per-race performance noise (\(= 25/6\)) |
| \(\mu_0\) | Prior mean skill (\(= 25\)) |
| \(\tau\) | Sigma drift rate (time decay) |
| \(w_k\) | Composite weight for track \(k\) |
| \(\Phi\) | Standard normal CDF |
| \(C_i(t)\) | Composite score for rider \(i\) at time \(t\) |
| \(z\) | Safe-mode multiplier (\(z=2\) safe, \(z=0\) normal) |