Case Study 03

2026 FIFA World Cup Match Predictor

Machine learning pipeline + Monte Carlo tournament simulation

Mar 2026 – May 2026
PythonScikit-learnXGBoostPyTorchOptunaSHAPPandasNumPyPlotlyBeautifulSoup
Key impact
Built an end-to-end forecasting pipeline (EECE 5644 — Intro to Machine Learning, Northeastern) over 20+ years of international football — roughly 50,000 matches plus Elo ratings, FIFA rankings, Transfermarkt market values, manager history, and injury records, scraped and merged from six sources.
flowsight.ai/dashboard
2026 FIFA World Cup Match Predictor — app screenshot
Product screenshot

What I did

6
  1. 01

    Built an end-to-end forecasting pipeline (EECE 5644 — Intro to Machine Learning, Northeastern) over 20+ years of international football — roughly 50,000 matches plus Elo ratings, FIFA rankings, Transfermarkt market values, manager history, and injury records, scraped and merged from six sources.

  2. 02

    Engineered 25+ leakage-safe 'delta' features (home-minus-away) capturing team strength, recent form, squad depth, manager experience, and injury burden, with strictly chronological train / validation / test splits to prevent look-ahead leakage.

  3. 03

    Trained and benchmarked seven classifiers — Logistic Regression, KNN, Random Forest, XGBoost (multi:softprob), HistGradientBoosting, SVM, and a Stacking Ensemble — on the three-way outcome (home win / draw / away win), tuned with Optuna and judged by Macro-F1, log loss, ROC-AUC, and SHAP attributions.

  4. 04

    Designed a custom PyTorch network with an Adaptive Focal Loss (per-class γ/α weighting to counter the hard-to-predict draw class) and Gaussian-noise augmentation, putting a neural model head-to-head with the gradient-boosted ensembles.

  5. 05

    Calibrated the match probabilities and ran a 10,000-iteration Monte Carlo of the full 48-team, 104-match tournament — Poisson goal model, simulated group tables, third-place slotting, and a knockout bracket with a penalty-shootout model — to estimate every nation's advancement and title odds.

  6. 06

    Validated on the 2022 World Cup as a held-out test set (training only on pre-tournament data) to measure genuine out-of-sample forecasting skill rather than in-sample fit.

Tech stack

PythonScikit-learnXGBoostPyTorchOptunaSHAPPandasNumPyPlotlyBeautifulSoup