March Madness 2026 — ML Bracket Predictor

2026 PREDICTED BRACKETS

Claude and ChatGPT, shown as equal model paths — click any team for details

TOP CHAMPIONSHIP CONTENDERS

Model vs market — sorted by ML probability

UPSET WATCH

MODEL vs MARKET ODDS

Where our model disagrees with Vegas — sorted by edge

Click column headers to sort

TEAM	SEED	REGION	MODEL CHAMP%	MARKET%	EDGE (Δ)	WIN%	PPG

HOW THIS MODEL WAS BUILT

From raw data to bracket predictions — a walkthrough for a business audience

01

THE PREDICTION ENGINE: THREE SIGNALS

Every matchup probability is built from three independent signals blended into a single number. Each captures a different dimension of tournament reality — statistical dominance, historical seed behavior, and direct head-to-head history.

01 · What the Input Data Looks Like

For every matchup, the model receives current-season stats for both teams. Rather than feeding raw values directly, it computes the difference between each team's stats — this is called a "delta feature" and forces the model to reason about relative strength, not individual team identity.

TEAM	SEED	WIN %	PPG	OPP PPG	PT DIFF	SRS	SOS
Duke	1	.882	84.3	63.1	+21.2	24.1	13.8
TCU	9	.647	73.8	67.4	+6.4	8.9	9.4
DELTA (A − B) → MODEL INPUT	−8	+.235	+10.5	−4.3	+14.8	+15.2	+4.4

↑ The orange DELTA row is the only row the model sees. Green = advantage for Team A, Red = disadvantage. Note: a lower Opp PPG is an advantage (defense), so the −4.3 actually favors Duke — the model learns this from historical outcomes.

02 · The Three Prediction Signals

ML Gradient Boosted Model 70% weight

ΔSeed = −8

ΔWin% = +.235

ΔPPG = +10.5

ΔSRS = +15.2

→

GBT

25 years
of data

→

86%

P(A wins)

Gradient boosted trees trained on 1,600+ tournament games from 2000–2025. The model learns which statistical gaps matter most in March — seed difference alone explains ~37% of variance, but points-per-game differential and strength of schedule also carry significant weight. Gets 70% of the blend because it's trained directly on tournament outcomes, not just regular-season strength.

HX Historic Seed Match Rates 30% weight

1 vs 16

99%

2 vs 15

94%

3 vs 14

85%

5 vs 12

65%

7 vs 10

60%

8 vs 9

51%

← 12-seeds beat 5-seeds 35% of the time historically — a fact the model learns but this signal explicitly reinforces

Historical win rates for each seed matchup since 1985. These rates are remarkably stable across eras. Weighted more heavily in early rounds (R64/R32) where seeding is most predictive — by the Elite Eight, any remaining team has proven themselves regardless of original seed, so the ML model carries more weight in later rounds.

H2H Head-to-Head Record ±5% cap

Team A

8 – 3

all-time
since 2010

Team B

8 ÷ 11 = 72.7% → +3.6% adj (max ±5%)

No history? 0% adj ~40% of R64 matchups have never been played before

When teams have played each other, their head-to-head record nudges the final probability by up to ±5%. Capped intentionally — college rosters turn over completely every 4 years, and old matchups between different player generations are weak predictors. Acts as a small momentum signal, not a dominant one.

03 · How the Three Signals Are Blended into One Number

0.70

×

ML
Model

( e.g. 88% )

+

0.30

×

Seed
History

( e.g. 99% )

+

±5%

max

H2H
Adj

( e.g. 0% )

=

91.3%

Final
P(win)

Real example — Duke (1) vs. Siena (16): ML model outputs 88% based on statistical gap. Historic 1-vs-16 seed rate is 99%. No H2H history between these programs. Blended: (0.70 × 0.88) + (0.30 × 0.99) + 0 = 0.616 + 0.297 = 91.3%. The seed history pulls the prediction higher than the ML model alone, anchoring it to 40 years of tournament data. This is by design — the model occasionally underestimates dominant seeds, and the seed history acts as a calibration floor.

02

WHERE THE DATA COMES FROM

KAGGLE MARCH ML MANIA

Historical Tournament Results

Game-by-game results from every NCAA tournament since 2000 — 25 years of first-round upsets, Final Fours, and championship finishes. This is the ground truth the model learns from: who beat whom, by how much, and at what seed.

~1,600 games 2000–2025 Public dataset

SPORTS REFERENCE CBB

Team Season Statistics

Season-level stats for every tournament team: win percentage, points per game, strength of schedule, and Simple Rating System (SRS). These become the model's input features for each matchup, capturing team quality beyond seeding alone.

Win % PPG / Opp PPG SRS SOS

VEGAS INSIDER / ESPN

Current Market Odds

Pre-tournament betting lines converted to implied probabilities. Used not for training, but for calibration and comparison — letting us surface where the model disagrees with the market and identify potential analytical edges.

Championship odds Implied probability Calibration only

03

AI + HUMAN COLLABORATION

This project was built using Claude (Anthropic) as a coding collaborator. Here's an honest breakdown of what was machine-generated versus where human judgment shaped the final product.

⬡ AI-GENERATED ~70%

Data scraping and parsing scripts for Sports Reference and Kaggle datasets
Feature engineering pipeline — building delta features from raw team statistics
Model training code, hyperparameter grid, and cross-validation scaffolding
Monte Carlo simulation engine and full bracket propagation logic
Front-end visualization: bracket tree, bar charts, and odds comparison table
JSON data build pipeline that exports model outputs to the browser

◈ HUMAN REFINEMENT ~30%

Selected the training window (2000–2025) and excluded the COVID-shortened 2020 season
Curated the final feature set after reviewing importance scores — removed correlated and redundant inputs
Chose ML training method based on need to learn and understand how the predictions work
Validated upset predictions against known historical rates (e.g., 12-over-5 seed frequency)
Defined the "edge" metric — model probability minus market probability — and set the display threshold
Directed visual hierarchy, section ordering, and UX flow for a non-technical business audience

MARCH MADNESSBRACKET PREDICTOR