{"id":6322,"date":"2026-04-18T02:35:18","date_gmt":"2026-04-18T07:35:18","guid":{"rendered":"https:\/\/ykim.synology.me\/wordpress\/?p=6322"},"modified":"2026-04-18T03:12:06","modified_gmt":"2026-04-18T08:12:06","slug":"post-hoc-prediction-correction-long-term-bias-short-term-drift","status":"publish","type":"post","link":"https:\/\/ykim.synology.me\/wordpress\/post-hoc-prediction-correction-long-term-bias-short-term-drift-6322\/","title":{"rendered":"Post-hoc Prediction Correction: Long-Term Bias &#038; Short-Term Drift"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/202604-On-the-way-from-Houston-to-Austin-1024x576.jpg\" alt=\"\" class=\"wp-image-6326\" style=\"width:800px\" srcset=\"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/202604-On-the-way-from-Houston-to-Austin-1024x576.jpg 1024w, https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/202604-On-the-way-from-Houston-to-Austin-300x169.jpg 300w, https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/202604-On-the-way-from-Houston-to-Austin-768x432.jpg 768w, https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/202604-On-the-way-from-Houston-to-Austin-1536x864.jpg 1536w, https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/202604-On-the-way-from-Houston-to-Austin-2048x1152.jpg 2048w, https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/202604-On-the-way-from-Houston-to-Austin-128x72.jpg 128w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">A trained regressor that performs well on historical data often drifts in production for two distinct reasons:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Systematic model error.<\/strong> The model consistently over- or under-shoots on certain sub-populations, even on the training set itself.<\/li>\n\n\n\n<li><strong>Target non-stationarity.<\/strong> The physical process generating <code>y<\/code> changes after the training cutoff \u2014 a sensor ages, a recipe is retuned, a market regime shifts. The model cannot know this.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Retraining is the principled fix for both, but it is expensive and sometimes impossible (no fresh labels, frozen model in a regulated pipeline). A cheap, production-friendly alternative is a two-term <strong>post-hoc correction<\/strong>:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\ny_pred_final = y_pred \u2212 residual_bias \u2212 recent_drift\n                       \u2514\u2500 long-term \u2500\u2518 \u2514 short-term \u2518\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">1. Long-term bias (residual_bias)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Definition, computed once at the end of training on the training set itself:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nresidual_bias = mean( y_pred \u2212 y_true )        # over-prediction\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">If residual_bias > 0 the model is, on average, over-shooting; subtracting it re-centres future predictions onto the training-era <code>y<\/code> mean.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Key properties:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Sub-population aware.<\/strong> Compute one residual_bias per meaningful group (product, channel, segment). A global constant washes out local biases.<\/li>\n\n\n\n<li><strong>Uses training-set predictions, not a held-out split.<\/strong> It measures the <em>fitted model&#8217;s<\/em> systematic tilt, which is exactly what will be carried<br>into inference. A validation residual is noisier and estimates a slightly different quantity.<\/li>\n\n\n\n<li><strong>Idempotent with respect to the model.<\/strong> Applied after <code>model.predict<\/code>, never during training, so it does not change gradients or feature ranking.<\/li>\n\n\n\n<li><strong>Free variance floor.<\/strong> The residual std reported alongside it tells you how much of the error is <em>not<\/em> systematic \u2014 what correction cannot fix.<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n# training (per group g)\npreds_train  = model.predict(X_train)\nresidual_bias&#x5B;g] = (preds_train&#x5B;g_mask] - y_train&#x5B;g_mask]).mean()\n\n# inference\ny_pred = model.predict(X_new)\ny_pred -= residual_bias&#x5B;group_of(X_new)]\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">2. Short-term drift (recent_drift)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Definition, computed at inference time from recent ground truth labels:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nrecent_drift = y_train_tail_mean  \u2212  mean( y_true over last N days )\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">where <code>y_train_tail_mean<\/code> is the mean of <code>y<\/code> over the last window of the training data (an anchor representing the model&#8217;s <em>implicit<\/em> operating<br>point), and <code>recent_mean<\/code> is the mean of fresh labels observed just before the current prediction.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The sign is such that, if the process has drifted <strong>up<\/strong> since training, <code>recent_drift<\/code> is <strong>negative<\/strong> and <code>y_pred -= recent_drift<\/code> <em>raises<\/em> the prediction \u2014 the model is effectively pulled toward the new operating point.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Key properties:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Requires recent labels.<\/strong> This is the cost. For problems where <code>y<\/code> is never observed post-hoc, short-term drift cannot be estimated this way<br>(fall back to covariate-shift methods on <code>X<\/code>).<\/li>\n\n\n\n<li><strong>Lookback window <code>N<\/code> is a bias\u2013variance knob.<\/strong> Small <code>N<\/code> \u2192 reactive but noisy. Large <code>N<\/code> \u2192 stable but slow to catch a real shift. Typical starting point: one to two process cycles of the underlying dynamics.<\/li>\n\n\n\n<li><strong>Anchored to training, not to a moving mean.<\/strong> Using <code>y_train_tail_mean<\/code> as the anchor ensures the correction is zero <em>by construction<\/em> at the<br>training\/inference boundary, so deploying a freshly trained model does not immediately apply a spurious drift term.<\/li>\n\n\n\n<li><strong>Can be combined with, or replaced by, a rolling buffer of features.<\/strong><br>If the drift is better explained by observable <code>X<\/code> changes, update <code>X<\/code> features and skip this term.<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n# inference, with a rolling window of fresh labels\ny_tail      = y_train_tail_mean&#x5B;group]              # saved at training\nrecent_mean = recent_labels&#x5B;-N:].mean()             # from the field\nrecent_drift = y_tail - recent_mean\ny_pred -= recent_drift\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">3. Why the two terms are complementary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">They correct different error sources and neither subsumes the other:<\/p>\n\n\n\n<figure style=\"padding-right:var(--wp--preset--spacing--40);padding-left:var(--wp--preset--spacing--40)\" class=\"wp-block-table\"><table><thead><tr><th>Term<\/th><th>Error source<\/th><th>Estimated from<\/th><th>Changes over time?<\/th><\/tr><\/thead><tbody><tr><td><code>residual_bias<\/code><\/td><td>fitted model tilt<\/td><td>training predictions<\/td><td>no (frozen at training)<\/td><\/tr><tr><td><code>recent_drift<\/code><\/td><td>target distribution shift<\/td><td>recent ground truth<\/td><td>yes (updates each inference)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">A model can be perfectly bias-free (<code>residual_bias \u2248 0<\/code>) and still need <code>recent_drift<\/code> if the world moved. Conversely, a stationary process still needs <code>residual_bias<\/code> if the model class systematically mis-fits certain groups.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">4. Sanity checks before deployment<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Residual std \u226b |residual_bias|.<\/strong> If the bias is tiny compared to the residual std, the correction is cosmetic \u2014 expect no <code>R\u00b2<\/code> gain. If it is<br>comparable, expect a visible MAE improvement. <\/li>\n\n\n\n<li><strong>Leave-one-group-out.<\/strong> Compute <code>residual_bias<\/code> on groups A\u222aB, test on group C. If correction improves C, it generalises; if it hurts, the bias<br>is not a stable group property.<\/li>\n\n\n\n<li><strong>Ablate <code>recent_drift<\/code> on a stable period.<\/strong> During a window where <code>y<\/code> is known to be stationary, <code>recent_drift<\/code> should fluctuate around zero.<br>Persistent non-zero values indicate the anchor (<code>y_train_tail_mean<\/code>) is wrong or the lookback is too short.<\/li>\n\n\n\n<li><strong>Cap the correction.<\/strong> A good defensive trick is to clip <code>residual_bias<\/code> and <code>recent_drift<\/code> to a multiple of the training residual std (e.g.<br><code>\u00b12\u03c3<\/code>). This prevents a pathological group or a bad label batch from blowing up predictions.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">5. When this pattern is the wrong tool<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Drift manifests as a <strong>scale<\/strong> change, not a mean shift. Subtracting a constant does nothing; you need a multiplicative correction or<br>recalibration (e.g. isotonic regression on recent points).<\/li>\n\n\n\n<li>Drift is <strong>non-linear in a feature<\/strong> (e.g. accuracy degrades only at high temperatures). A global constant hides the real failure mode and<br>can make worst-case predictions worse. Investigate segment-level errors first.<\/li>\n\n\n\n<li>Ground truth labels are <strong>not available soon enough<\/strong>. <code>recent_drift<\/code> needs labels with latency shorter than the drift timescale; otherwise<br>the correction chases the past.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">6. TL;DR<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>residual_bias<\/code>: a per-group scalar fitted at training time that absorbs the model&#8217;s systematic over-prediction. Cheap, safe, always worth<br>computing.<\/li>\n\n\n\n<li><code>recent_drift<\/code>: a per-group scalar re-estimated at inference time from recent labels that absorbs target non-stationarity. Powerful, but only<br>when fresh labels exist and the drift is primarily a mean shift.<\/li>\n\n\n\n<li>Final inference rule (Convention B, over-prediction is positive): y_pred_final = y_pred \u2212 residual_bias \u2212 recent_drift<\/li>\n\n\n\n<li>Treat the two terms as orthogonal; monitor them separately; clip both.<\/li>\n<\/ul>\n<div style='text-align:center' class='yasr-auto-insert-overall'><\/div><div style='text-align:center' class='yasr-auto-insert-visitor'><\/div>","protected":false},"excerpt":{"rendered":"<p>A trained regressor that performs well on historical data often drifts in production for two distinct reasons: Retraining is the principled fix for both, but it is expensive and sometimes impossible (no fresh labels, frozen model in a regulated pipeline). A cheap, production-friendly alternative is a two-term post-hoc correction: 1. Long-term bias (residual_bias) Definition, computed&#8230;<\/p>\n","protected":false},"author":4,"featured_media":6326,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","yasr_overall_rating":0,"yasr_post_is_review":"","yasr_auto_insert_disabled":"","yasr_review_type":"","fifu_image_url":"","fifu_image_alt":"","iawp_total_views":3,"footnotes":""},"categories":[56,372,371],"tags":[],"class_list":["post-6322","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science-slug","category-test-slug","category-training-slug"],"yasr_visitor_votes":{"stars_attributes":{"read_only":false,"span_bottom":false},"number_of_votes":0,"sum_votes":0},"jetpack_featured_media_url":"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/202604-On-the-way-from-Houston-to-Austin-scaled.jpg","_links":{"self":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6322","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/comments?post=6322"}],"version-history":[{"count":5,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6322\/revisions"}],"predecessor-version":[{"id":6330,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6322\/revisions\/6330"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/media\/6326"}],"wp:attachment":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/media?parent=6322"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/categories?post=6322"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/tags?post=6322"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}