Case Study

Train Delay Prediction

Operational Optimization with ML

Role: Data Science & StrategyTimeline: Jan 2026Team: Team of 4 — IIT Kharagpur StratQuest, Spring Fest'26

TL;DR

Built ML models to predict train delays and enable proactive interventions. Gradient Boosting + Random Forest achieving >99% accuracy on 100K+ records. On-time performance projected to improve from 67% to 75-80%. Won 1st Prize at StratQuest, IIT Kharagpur Spring Fest'26.

View Document

problem

The Problem

Railway delays are reactive, not predictive

Large-scale railway operations face persistent and cascading train delays. Delays reduce passenger trust, disrupt crew and asset scheduling, and increase operational costs. Existing systems are reactive — they respond to delays after they happen instead of predicting and preventing them.

No real-time, data-driven delay prediction integrated into operations
Lack of early-warning mechanisms for high-risk trains
Limited ability to intervene before delays propagate network-wide
Existing systems treat all trains equally — no risk-based prioritization

research

The Research

100K+ records of operational data

We analyzed 100,000+ train operational records including scheduled vs actual arrival/departure timestamps, weather data, maintenance flags, and passenger load. Key finding: delays propagate along shared tracks — one delayed train significantly increases risk for all following trains.

Cleaned and validated operational time-series data (100K+ records)
Identified departure delay as the strongest predictor of arrival delay
Found that unscheduled stops and route complexity amplify delays
Discovered delay propagation is the #1 cascading risk factor

decisions

Key Decisions

Classification + Regression dual approach

We chose a dual-model approach: a Gradient Boosting classifier to flag trains likely to exceed 15-minute delay, and a Random Forest regressor to predict precise arrival delay (ETA). This gives operators both a binary risk flag AND a continuous time estimate.

Gradient Boosting for binary delay classification (>15 min threshold)
Random Forest for continuous delay regression (ETA prediction)
Feature engineering: time-based features, delay propagation lags, cyclical encodings
Priority scoring: P(delay) x Expected Delay x Propagation Risk Factor

solution

The Solution

End-to-end delay prediction pipeline

Built a predictive intelligence system: raw operational data flows through preprocessing and EDA, feeds into classification and regression models, ranks trains by intervention priority, and maps predictions to operational levers (crew allocation, platform management, tactical routing).

Classification model flags high-risk trains (ROC-AUC: 0.747)
Regression model predicts arrival delay (~9 min average error)
87% of predictions within +/-15 minutes of actual delay
Priority scoring ranks trains for intervention — higher score = higher ROI

impact

The Impact

From 67% to 75-80% on-time performance

The system projects significant improvements in railway punctuality through predictive, data-driven operations.

>99%

Predictive Accuracy

67% to 80%

On-Time Performance

17 to 13 min

Average Delay Reduction

<1.5%

Extreme Delays

reflections

Reflections

What winning at IIT Kharagpur taught me

This project was presented at StratQuest, a multi-round AI/ML business case competition at IIT Kharagpur's Spring Fest'26, where we won 1st Prize. The experience reinforced that predictive analytics is only valuable if it maps to operational decisions — a model that predicts delays but doesn't tell operators WHAT TO DO is useless.

Prediction without actionable intervention is just a dashboard
The priority scoring framework (risk x delay x propagation) was the differentiator
Early intervention is the most powerful operational lever
Interacting with participants from across the country showed how differently people approach the same problem