Post-hoc Prediction Correction: Long-Term Bias & Short-Term Drift

A trained regressor that performs well on historical data often drifts in production for two distinct reasons:

Systematic model error. The model consistently over- or under-shoots on certain sub-populations, even on the training set itself.
Target non-stationarity. The physical process generating y changes after the training cutoff — a sensor ages, a recipe is retuned, a market regime shifts. The model cannot know this.

Retraining is the principled fix for both, but it is expensive and sometimes impossible (no fresh labels, frozen model in a regulated pipeline). A cheap, production-friendly alternative is a two-term post-hoc correction:

y_pred_final = y_pred − residual_bias − recent_drift
                       └─ long-term ─┘ └ short-term ┘

1. Long-term bias (residual_bias)

Definition, computed once at the end of training on the training set itself:

residual_bias = mean( y_pred − y_true )        # over-prediction

If residual_bias > 0 the model is, on average, over-shooting; subtracting it re-centres future predictions onto the training-era y mean.

Key properties:

Sub-population aware. Compute one residual_bias per meaningful group (product, channel, segment). A global constant washes out local biases.
Uses training-set predictions, not a held-out split. It measures the fitted model’s systematic tilt, which is exactly what will be carried
into inference. A validation residual is noisier and estimates a slightly different quantity.
Idempotent with respect to the model. Applied after model.predict, never during training, so it does not change gradients or feature ranking.
Free variance floor. The residual std reported alongside it tells you how much of the error is not systematic — what correction cannot fix.

# training (per group g)
preds_train  = model.predict(X_train)
residual_bias[g] = (preds_train[g_mask] - y_train[g_mask]).mean()

# inference
y_pred = model.predict(X_new)
y_pred -= residual_bias[group_of(X_new)]

2. Short-term drift (recent_drift)

Definition, computed at inference time from recent ground truth labels:

recent_drift = y_train_tail_mean  −  mean( y_true over last N days )

where y_train_tail_mean is the mean of y over the last window of the training data (an anchor representing the model’s implicit operating
point), and recent_mean is the mean of fresh labels observed just before the current prediction.

The sign is such that, if the process has drifted up since training, recent_drift is negative and y_pred -= recent_drift raises the prediction — the model is effectively pulled toward the new operating point.

Key properties:

Requires recent labels. This is the cost. For problems where y is never observed post-hoc, short-term drift cannot be estimated this way
(fall back to covariate-shift methods on X).
Lookback window N is a bias–variance knob. Small N → reactive but noisy. Large N → stable but slow to catch a real shift. Typical starting point: one to two process cycles of the underlying dynamics.
Anchored to training, not to a moving mean. Using y_train_tail_mean as the anchor ensures the correction is zero by construction at the
training/inference boundary, so deploying a freshly trained model does not immediately apply a spurious drift term.
Can be combined with, or replaced by, a rolling buffer of features.
If the drift is better explained by observable X changes, update X features and skip this term.

# inference, with a rolling window of fresh labels
y_tail      = y_train_tail_mean[group]              # saved at training
recent_mean = recent_labels[-N:].mean()             # from the field
recent_drift = y_tail - recent_mean
y_pred -= recent_drift

3. Why the two terms are complementary

They correct different error sources and neither subsumes the other:

Term	Error source	Estimated from	Changes over time?
`residual_bias`	fitted model tilt	training predictions	no (frozen at training)
`recent_drift`	target distribution shift	recent ground truth	yes (updates each inference)

A model can be perfectly bias-free (residual_bias ≈ 0) and still need recent_drift if the world moved. Conversely, a stationary process still needs residual_bias if the model class systematically mis-fits certain groups.

4. Sanity checks before deployment

Residual std ≫ |residual_bias|. If the bias is tiny compared to the residual std, the correction is cosmetic — expect no R² gain. If it is
comparable, expect a visible MAE improvement.
Leave-one-group-out. Compute residual_bias on groups A∪B, test on group C. If correction improves C, it generalises; if it hurts, the bias
is not a stable group property.
Ablate recent_drift on a stable period. During a window where y is known to be stationary, recent_drift should fluctuate around zero.
Persistent non-zero values indicate the anchor (y_train_tail_mean) is wrong or the lookback is too short.
Cap the correction. A good defensive trick is to clip residual_bias and recent_drift to a multiple of the training residual std (e.g.
±2σ). This prevents a pathological group or a bad label batch from blowing up predictions.

5. When this pattern is the wrong tool

Drift manifests as a scale change, not a mean shift. Subtracting a constant does nothing; you need a multiplicative correction or
recalibration (e.g. isotonic regression on recent points).
Drift is non-linear in a feature (e.g. accuracy degrades only at high temperatures). A global constant hides the real failure mode and
can make worst-case predictions worse. Investigate segment-level errors first.
Ground truth labels are not available soon enough. recent_drift needs labels with latency shorter than the drift timescale; otherwise
the correction chases the past.

6. TL;DR

residual_bias: a per-group scalar fitted at training time that absorbs the model’s systematic over-prediction. Cheap, safe, always worth
computing.
recent_drift: a per-group scalar re-estimated at inference time from recent labels that absorbs target non-stationarity. Powerful, but only
when fresh labels exist and the drift is primarily a mean shift.
Final inference rule (Convention B, over-prediction is positive): y_pred_final = y_pred − residual_bias − recent_drift
Treat the two terms as orthogonal; monitor them separately; clip both.

Our Score

Click to rate this post!

[Total: 0 Average: 0]

Visited 24 times, 1 visit(s) today

Post-hoc Prediction Correction: Long-Term Bias & Short-Term Drift

1. Long-term bias (residual_bias)

2. Short-term drift (recent_drift)

3. Why the two terms are complementary

4. Sanity checks before deployment

5. When this pattern is the wrong tool

6. TL;DR

Leave a Comment Cancel reply

Visitor

Post

About Me

Contact