Post-hoc Prediction Correction: Long-Term Bias & Short-Term Drift

A trained regressor that performs well on historical data often drifts in production for two distinct reasons:
- Systematic model error. The model consistently over- or under-shoots on certain sub-populations, even on the training set itself.
- Target non-stationarity. The physical process generating
ychanges after the training cutoff — a sensor ages, a recipe is retuned, a market regime shifts. The model cannot know this.
Retraining is the principled fix for both, but it is expensive and sometimes impossible (no fresh labels, frozen model in a regulated pipeline). A cheap, production-friendly alternative is a two-term post-hoc correction:
y_pred_final = y_pred − residual_bias − recent_drift
└─ long-term ─┘ └ short-term ┘
1. Long-term bias (residual_bias)
Definition, computed once at the end of training on the training set itself:
residual_bias = mean( y_pred − y_true ) # over-prediction
If residual_bias > 0 the model is, on average, over-shooting; subtracting it re-centres future predictions onto the training-era y mean.
Key properties:
- Sub-population aware. Compute one residual_bias per meaningful group (product, channel, segment). A global constant washes out local biases.
- Uses training-set predictions, not a held-out split. It measures the fitted model’s systematic tilt, which is exactly what will be carried
into inference. A validation residual is noisier and estimates a slightly different quantity. - Idempotent with respect to the model. Applied after
model.predict, never during training, so it does not change gradients or feature ranking. - Free variance floor. The residual std reported alongside it tells you how much of the error is not systematic — what correction cannot fix.
# training (per group g)
preds_train = model.predict(X_train)
residual_bias[g] = (preds_train[g_mask] - y_train[g_mask]).mean()
# inference
y_pred = model.predict(X_new)
y_pred -= residual_bias[group_of(X_new)]
2. Short-term drift (recent_drift)
Definition, computed at inference time from recent ground truth labels:
recent_drift = y_train_tail_mean − mean( y_true over last N days )
where y_train_tail_mean is the mean of y over the last window of the training data (an anchor representing the model’s implicit operating
point), and recent_mean is the mean of fresh labels observed just before the current prediction.
The sign is such that, if the process has drifted up since training, recent_drift is negative and y_pred -= recent_drift raises the prediction — the model is effectively pulled toward the new operating point.
Key properties:
- Requires recent labels. This is the cost. For problems where
yis never observed post-hoc, short-term drift cannot be estimated this way
(fall back to covariate-shift methods onX). - Lookback window
Nis a bias–variance knob. SmallN→ reactive but noisy. LargeN→ stable but slow to catch a real shift. Typical starting point: one to two process cycles of the underlying dynamics. - Anchored to training, not to a moving mean. Using
y_train_tail_meanas the anchor ensures the correction is zero by construction at the
training/inference boundary, so deploying a freshly trained model does not immediately apply a spurious drift term. - Can be combined with, or replaced by, a rolling buffer of features.
If the drift is better explained by observableXchanges, updateXfeatures and skip this term.
# inference, with a rolling window of fresh labels
y_tail = y_train_tail_mean[group] # saved at training
recent_mean = recent_labels[-N:].mean() # from the field
recent_drift = y_tail - recent_mean
y_pred -= recent_drift
3. Why the two terms are complementary
They correct different error sources and neither subsumes the other:
| Term | Error source | Estimated from | Changes over time? |
|---|---|---|---|
residual_bias | fitted model tilt | training predictions | no (frozen at training) |
recent_drift | target distribution shift | recent ground truth | yes (updates each inference) |
A model can be perfectly bias-free (residual_bias ≈ 0) and still need recent_drift if the world moved. Conversely, a stationary process still needs residual_bias if the model class systematically mis-fits certain groups.
4. Sanity checks before deployment
- Residual std ≫ |residual_bias|. If the bias is tiny compared to the residual std, the correction is cosmetic — expect no
R²gain. If it is
comparable, expect a visible MAE improvement. - Leave-one-group-out. Compute
residual_biason groups A∪B, test on group C. If correction improves C, it generalises; if it hurts, the bias
is not a stable group property. - Ablate
recent_drifton a stable period. During a window whereyis known to be stationary,recent_driftshould fluctuate around zero.
Persistent non-zero values indicate the anchor (y_train_tail_mean) is wrong or the lookback is too short. - Cap the correction. A good defensive trick is to clip
residual_biasandrecent_driftto a multiple of the training residual std (e.g.±2σ). This prevents a pathological group or a bad label batch from blowing up predictions.
5. When this pattern is the wrong tool
- Drift manifests as a scale change, not a mean shift. Subtracting a constant does nothing; you need a multiplicative correction or
recalibration (e.g. isotonic regression on recent points). - Drift is non-linear in a feature (e.g. accuracy degrades only at high temperatures). A global constant hides the real failure mode and
can make worst-case predictions worse. Investigate segment-level errors first. - Ground truth labels are not available soon enough.
recent_driftneeds labels with latency shorter than the drift timescale; otherwise
the correction chases the past.
6. TL;DR
residual_bias: a per-group scalar fitted at training time that absorbs the model’s systematic over-prediction. Cheap, safe, always worth
computing.recent_drift: a per-group scalar re-estimated at inference time from recent labels that absorbs target non-stationarity. Powerful, but only
when fresh labels exist and the drift is primarily a mean shift.- Final inference rule (Convention B, over-prediction is positive): y_pred_final = y_pred − residual_bias − recent_drift
- Treat the two terms as orthogonal; monitor them separately; clip both.
