| |

Post-hoc Prediction Correction: Long-Term Bias & Short-Term Drift

A trained regressor that performs well on historical data often drifts in production for two distinct reasons:

  1. Systematic model error. The model consistently over- or under-shoots on certain sub-populations, even on the training set itself.
  2. Target non-stationarity. The physical process generating y changes after the training cutoff — a sensor ages, a recipe is retuned, a market regime shifts. The model cannot know this.

Retraining is the principled fix for both, but it is expensive and sometimes impossible (no fresh labels, frozen model in a regulated pipeline). A cheap, production-friendly alternative is a two-term post-hoc correction:

y_pred_final = y_pred − residual_bias − recent_drift
                       └─ long-term ─┘ └ short-term ┘

1. Long-term bias (residual_bias)

Definition, computed once at the end of training on the training set itself:

residual_bias = mean( y_pred − y_true )        # over-prediction

If residual_bias > 0 the model is, on average, over-shooting; subtracting it re-centres future predictions onto the training-era y mean.

Key properties:

  • Sub-population aware. Compute one residual_bias per meaningful group (product, channel, segment). A global constant washes out local biases.
  • Uses training-set predictions, not a held-out split. It measures the fitted model’s systematic tilt, which is exactly what will be carried
    into inference. A validation residual is noisier and estimates a slightly different quantity.
  • Idempotent with respect to the model. Applied after model.predict, never during training, so it does not change gradients or feature ranking.
  • Free variance floor. The residual std reported alongside it tells you how much of the error is not systematic — what correction cannot fix.
# training (per group g)
preds_train  = model.predict(X_train)
residual_bias[g] = (preds_train[g_mask] - y_train[g_mask]).mean()

# inference
y_pred = model.predict(X_new)
y_pred -= residual_bias[group_of(X_new)]

2. Short-term drift (recent_drift)

Definition, computed at inference time from recent ground truth labels:

recent_drift = y_train_tail_mean  −  mean( y_true over last N days )

where y_train_tail_mean is the mean of y over the last window of the training data (an anchor representing the model’s implicit operating
point), and recent_mean is the mean of fresh labels observed just before the current prediction.

The sign is such that, if the process has drifted up since training, recent_drift is negative and y_pred -= recent_drift raises the prediction — the model is effectively pulled toward the new operating point.

Key properties:

  • Requires recent labels. This is the cost. For problems where y is never observed post-hoc, short-term drift cannot be estimated this way
    (fall back to covariate-shift methods on X).
  • Lookback window N is a bias–variance knob. Small N → reactive but noisy. Large N → stable but slow to catch a real shift. Typical starting point: one to two process cycles of the underlying dynamics.
  • Anchored to training, not to a moving mean. Using y_train_tail_mean as the anchor ensures the correction is zero by construction at the
    training/inference boundary, so deploying a freshly trained model does not immediately apply a spurious drift term.
  • Can be combined with, or replaced by, a rolling buffer of features.
    If the drift is better explained by observable X changes, update X features and skip this term.
# inference, with a rolling window of fresh labels
y_tail      = y_train_tail_mean[group]              # saved at training
recent_mean = recent_labels[-N:].mean()             # from the field
recent_drift = y_tail - recent_mean
y_pred -= recent_drift

3. Why the two terms are complementary

They correct different error sources and neither subsumes the other:

TermError sourceEstimated fromChanges over time?
residual_biasfitted model tilttraining predictionsno (frozen at training)
recent_drifttarget distribution shiftrecent ground truthyes (updates each inference)

A model can be perfectly bias-free (residual_bias ≈ 0) and still need recent_drift if the world moved. Conversely, a stationary process still needs residual_bias if the model class systematically mis-fits certain groups.

4. Sanity checks before deployment

  • Residual std ≫ |residual_bias|. If the bias is tiny compared to the residual std, the correction is cosmetic — expect no gain. If it is
    comparable, expect a visible MAE improvement.
  • Leave-one-group-out. Compute residual_bias on groups A∪B, test on group C. If correction improves C, it generalises; if it hurts, the bias
    is not a stable group property.
  • Ablate recent_drift on a stable period. During a window where y is known to be stationary, recent_drift should fluctuate around zero.
    Persistent non-zero values indicate the anchor (y_train_tail_mean) is wrong or the lookback is too short.
  • Cap the correction. A good defensive trick is to clip residual_bias and recent_drift to a multiple of the training residual std (e.g.
    ±2σ). This prevents a pathological group or a bad label batch from blowing up predictions.

5. When this pattern is the wrong tool

  • Drift manifests as a scale change, not a mean shift. Subtracting a constant does nothing; you need a multiplicative correction or
    recalibration (e.g. isotonic regression on recent points).
  • Drift is non-linear in a feature (e.g. accuracy degrades only at high temperatures). A global constant hides the real failure mode and
    can make worst-case predictions worse. Investigate segment-level errors first.
  • Ground truth labels are not available soon enough. recent_drift needs labels with latency shorter than the drift timescale; otherwise
    the correction chases the past.

6. TL;DR

  • residual_bias: a per-group scalar fitted at training time that absorbs the model’s systematic over-prediction. Cheap, safe, always worth
    computing.
  • recent_drift: a per-group scalar re-estimated at inference time from recent labels that absorbs target non-stationarity. Powerful, but only
    when fresh labels exist and the drift is primarily a mean shift.
  • Final inference rule (Convention B, over-prediction is positive): y_pred_final = y_pred − residual_bias − recent_drift
  • Treat the two terms as orthogonal; monitor them separately; clip both.
Our Score
Click to rate this post!
[Total: 0 Average: 0]
Visited 24 times, 1 visit(s) today

Leave a Comment

Your email address will not be published. Required fields are marked *