{"id":6548,"date":"2026-05-03T01:36:08","date_gmt":"2026-05-03T06:36:08","guid":{"rendered":"https:\/\/ykim.synology.me\/wordpress\/?p=6548"},"modified":"2026-05-03T11:07:30","modified_gmt":"2026-05-03T16:07:30","slug":"a-taxonomy-of-ml-model-failures-in-the-training-testing-gap","status":"publish","type":"post","link":"https:\/\/ykim.synology.me\/wordpress\/a-taxonomy-of-ml-model-failures-in-the-training-testing-gap-6548\/","title":{"rendered":"A Taxonomy of ML Model Failures in the Training-Testing Gap"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"600\" src=\"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/03\/Irregular-Stone-Wall-Texture-800x600px.jpg\" alt=\"\" class=\"wp-image-6474\" style=\"width:600px\" srcset=\"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/03\/Irregular-Stone-Wall-Texture-800x600px.jpg 800w, https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/03\/Irregular-Stone-Wall-Texture-800x600px-300x225.jpg 300w, https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/03\/Irregular-Stone-Wall-Texture-800x600px-768x576.jpg 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Machine learning (ML) models are designed under the assumption that the training distribution P_train equals the deployment distribution P_test. In reality, this assumption breaks frequently, causing sharp accuracy drops in deployed systems. This post organizes these failures into a clean taxonomy and summarizes practical mitigation strategies.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Hierarchy of ML Failures from train \u2260 test<\/h2>\n\n\n<style>.kadence-column6548_41766d-0a > .kt-inside-inner-col,.kadence-column6548_41766d-0a > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_41766d-0a > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_41766d-0a > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_41766d-0a > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_41766d-0a > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_41766d-0a{position:relative;}.kadence-column6548_41766d-0a, .kt-inside-inner-col > .kadence-column6548_41766d-0a:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_41766d-0a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_41766d-0a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_41766d-0a\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">1.1 Two Orthogonal Axes \u2014 Why Not One Tree<\/h3>\n\n\n<style>.kadence-column6548_86ac0f-f0 > .kt-inside-inner-col,.kadence-column6548_86ac0f-f0 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_86ac0f-f0 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_86ac0f-f0 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_86ac0f-f0 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_86ac0f-f0 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_86ac0f-f0{position:relative;}.kadence-column6548_86ac0f-f0, .kt-inside-inner-col > .kadence-column6548_86ac0f-f0:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_86ac0f-f0 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_86ac0f-f0 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_86ac0f-f0\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">The &#8220;train \u2260 test&#8221; situation cannot be cleanly classified under a single hierarchical tree (Moreno-Torres 2012). Two widely used frameworks answer different questions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The first axis, <strong>Distribution Shift<\/strong> (synonymous with Dataset Shift in the literature), is a <strong>population-level<\/strong> concept (Qui\u00f1onero-Candela 2009). It asks: &#8220;When we compare the entire training set with the entire deployment set, can both be considered samples from the same statistical population?&#8221; Here, &#8220;comparing the entire set&#8221; means examining overall statistical properties \u2014 means, variances, class ratios, feature histograms \u2014 not individual samples. For example, if 10,000 training images and 10,000 deployment images differ in average brightness, class proportions, or histogram shapes of certain features, distribution shift has occurred.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The second axis, <strong>Out-of-Distribution (OOD) Detection<\/strong>, is a <strong>sample-level<\/strong> concept (Yang 2021). It asks: &#8220;Is this single sample plausibly drawn from the training distribution?&#8221; The judgment can be made for even a single point.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By analogy, distribution shift asks &#8220;Has the water quality of this river changed since last year?&#8221;, while OOD asks &#8220;Is the water in this cup actually drawn from this river?&#8221; The two questions are related but fundamentally distinct. Therefore, rather than forcing them into a parent-child tree, treating them as <strong>two complementary axes<\/strong> is more accurate.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">1.2 Hierarchy Diagram<\/h3>\n\n\n<style>.kadence-column6548_aaa22a-82 > .kt-inside-inner-col,.kadence-column6548_aaa22a-82 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_aaa22a-82 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_aaa22a-82 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_aaa22a-82 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_aaa22a-82 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_aaa22a-82{position:relative;}.kadence-column6548_aaa22a-82, .kt-inside-inner-col > .kadence-column6548_aaa22a-82:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_aaa22a-82 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_aaa22a-82 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_aaa22a-82\"><div class=\"kt-inside-inner-col\">\n<pre style=\"font-family: Consolas,monospace; font-size: 1.3rem; white-space: pre; line-height:1.2; background-color: #fff; border: none\">\nML Failures from P_train \u2260 P_test\n\u2502\n\u251c\u2500\u2500 [Axis 1] Distribution Shift (= Dataset Shift)\n\u2502   \u2502   Population-level: overall statistical comparison\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 Covariate Shift  : P(x) changes, P(y|x) stays stable\n\u2502   \u251c\u2500\u2500 Label Shift      : P(y) changes, P(x|y) stays stable\n\u2502   \u2514\u2500\u2500 Concept Drift    : P(y|x) changes (input-output relation shifts)\n\u2502\n\u2514\u2500\u2500 [Axis 2] Out-of-Distribution (OOD) Detection\n    \u2502   Sample-level: is a single sample inside the support of P_train?\n    \u2502\n    \u251c\u2500\u2500 Semantic OOD     : a class unseen during training appears\n    \u251c\u2500\u2500 Non-semantic OOD : same class but different domain or style\n    \u2514\u2500\u2500 Anomaly Detection: no class concept, only \"normal vs abnormal\"\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>What is Support?<\/strong> The support of a Probability Density Function (PDF) or Probability Mass Function (PMF) is the set of input values (x) where the function takes a non-zero value. In simple terms, it is &#8220;the region where data can actually exist&#8221; or &#8220;the range where there is at least some probability of data occurring.&#8221; For example, if the training input range is [0, 1], then the support of P_train(x) is [0, 1]. If 1.2 arrives at inference time, that sample is outside the support \u2014 i.e., OOD.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">1.3 The Two Axes Can Overlap<\/h3>\n\n\n<style>.kadence-column6548_caa406-6c > .kt-inside-inner-col,.kadence-column6548_caa406-6c > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_caa406-6c > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_caa406-6c > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_caa406-6c > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_caa406-6c > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_caa406-6c{position:relative;}.kadence-column6548_caa406-6c, .kt-inside-inner-col > .kadence-column6548_caa406-6c:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_caa406-6c > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_caa406-6c > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_caa406-6c\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">The two axes are not perfectly disjoint. If covariate shift is severe enough that P_test(x)&#8217;s support entirely escapes P_train(x)&#8217;s support, those samples effectively become OOD. Some authors (Yang 2021) keep OOD as a separate axis from distribution shift; others view OOD as an extreme case of covariate shift. Either view is defensible \u2014 choose the framing that matches your monitoring needs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In real deployments, multiple shift types co-occur as <strong>compound shift<\/strong>. The COVID-19 pandemic, for example, simultaneously triggered behavioral changes (covariate shift), disease prevalence changes (label shift), and the breakdown of relationships such as &#8220;frequent business traveler = low credit risk&#8221; (concept drift) (Gama 2014).<\/p>\n<\/div><\/div>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">2. Probability Distributions and the Stable\/Shift Comparison Table<\/h2>\n\n\n<style>.kadence-column6548_6d6da2-03 > .kt-inside-inner-col,.kadence-column6548_6d6da2-03 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_6d6da2-03 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_6d6da2-03 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_6d6da2-03 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_6d6da2-03 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_6d6da2-03{position:relative;}.kadence-column6548_6d6da2-03, .kt-inside-inner-col > .kadence-column6548_6d6da2-03:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_6d6da2-03 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_6d6da2-03 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_6d6da2-03\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">2.1 The Four Probability Distributions<\/h3>\n\n\n<style>.kadence-column6548_ecd583-a0 > .kt-inside-inner-col,.kadence-column6548_ecd583-a0 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_ecd583-a0 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_ecd583-a0 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_ecd583-a0 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_ecd583-a0 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_ecd583-a0{position:relative;}.kadence-column6548_ecd583-a0, .kt-inside-inner-col > .kadence-column6548_ecd583-a0:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_ecd583-a0 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_ecd583-a0 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_ecd583-a0\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">$P(x)$ is the marginal distribution of inputs \u2014 how the input $x$ is distributed regardless of labels. In image classification this is the overall pixel value distribution; in medical data it is the patient feature distribution.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$P(y)$ is the marginal distribution of labels \u2014 the proportion each class occupies in the data. In spam classification, ratios such as $P(\\text{spam})=0.2$, $P(\\text{ham})=0.8$ are $P(y)$.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$P(y|x)$ is the conditional distribution of the label given the input \u2014 &#8220;the probability that this input belongs to class $y$.&#8221; This is typically what classification models learn directly.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$P(x|y)$ is the conditional distribution of the input given the class \u2014 &#8220;the typical appearance of each class.&#8221; For example, $P(\\text{image} \\mid \\text{cat})$ describes the distribution of visual features in cat images. Although it appears reverse to everyday intuition, $P(x|y)$ is critically important; this is discussed further in the appendix.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">2.2 Stable\/Shift Comparison Table<\/h3>\n\n\n<style>.kadence-column6548_37019d-bc > .kt-inside-inner-col,.kadence-column6548_37019d-bc > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_37019d-bc > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_37019d-bc > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_37019d-bc > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_37019d-bc > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_37019d-bc{position:relative;}.kadence-column6548_37019d-bc, .kt-inside-inner-col > .kadence-column6548_37019d-bc:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_37019d-bc > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_37019d-bc > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_37019d-bc\"><div class=\"kt-inside-inner-col\">\n<figure class=\"wp-block-table\"><table><thead><tr><th>Shift Type<\/th><th>P(x)<\/th><th>P(y)<\/th><th>P(x|y)<\/th><th>P(y|x)<\/th><th>Representative Example<\/th><\/tr><\/thead><tbody><tr><td><strong>Covariate shift<\/strong><\/td><td>Shift<\/td><td>\u2014<\/td><td>\u2014<\/td><td>Stable<\/td><td>Daytime-trained autonomous driving meets night images<\/td><\/tr><tr><td><strong>Label shift<\/strong><\/td><td>\u2014<\/td><td>Shift<\/td><td>Stable<\/td><td>\u2014<\/td><td>Pre-pandemic disease classifier used during a pandemic<\/td><\/tr><tr><td><strong>Concept drift<\/strong><\/td><td>Stable<\/td><td>Stable<\/td><td>\u2014<\/td><td>Shift<\/td><td>Spammers change tactics to evade the filter<\/td><\/tr><tr><td><strong>Semantic OOD<\/strong><\/td><td>Shift<\/td><td>Shift<\/td><td>N\/A<\/td><td>N\/A<\/td><td>Car image fed to a dog\/cat classifier<\/td><\/tr><tr><td><strong>Non-semantic OOD<\/strong><\/td><td>Shift<\/td><td>Stable<\/td><td>Shift<\/td><td>Stable<\/td><td>Cartoon cat fed to a photo-trained cat classifier<\/td><\/tr><tr><td><strong>Anomaly<\/strong><\/td><td>Shift<\/td><td>N\/A<\/td><td>N\/A<\/td><td>N\/A<\/td><td>Fraud detection on a model trained on normal transactions<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">&#8220;\u2014&#8221; indicates a distribution that is automatically determined by the others (e.g., in covariate shift, $P(y)$ moves as a side effect of $P(x)$ moving but is not part of the definition). &#8220;N\/A&#8221; indicates a case where the distribution is not even well-defined under the training setup (e.g., $P(y)$ for a brand-new class in semantic OOD, or $P(y)$ when $y$ is absent in anomaly detection). &#8220;Stable\/Shift&#8221; is not a hard standard term but is commonly used in the distribution shift literature.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Because of the Bayes identity $P(x, y) = P(x)P(y|x) = P(y)P(x|y)$, the four distributions cannot be independently fixed. Each shift type is therefore typically defined by <strong>which pair of terms moves together<\/strong>.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">2.3 Detailed Examples<\/h3>\n\n\n<style>.kadence-column6548_7b633e-3b > .kt-inside-inner-col,.kadence-column6548_7b633e-3b > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_7b633e-3b > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_7b633e-3b > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_7b633e-3b > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_7b633e-3b > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_7b633e-3b{position:relative;}.kadence-column6548_7b633e-3b, .kt-inside-inner-col > .kadence-column6548_7b633e-3b:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_7b633e-3b > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_7b633e-3b > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_7b633e-3b\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>Covariate shift \u2014 autonomous driving:<\/strong> A vehicle\/pedestrian recognition model trained on daytime driving images is used at night. The pixel distribution $P(x)$ has clearly changed (darker images, headlight glare, different color channel statistics). However, the relation $P(y|x)$ \u2014 &#8220;this pixel pattern is a pedestrian&#8221; \u2014 remains stable. A pedestrian in dim light is still a pedestrian.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Label shift \u2014 medical diagnosis:<\/strong> A disease classifier trained on pre-pandemic emergency room data is used during the COVID-19 pandemic. Class ratios shifted dramatically (e.g., $P(\\text{flu})=0.05$, $P(\\text{COVID})=0.001$ at training vs. $P(\\text{flu})=0.02$, $P(\\text{COVID})=0.4$ at deployment). Yet $P(\\text{symptoms} \\mid \\text{COVID})$ \u2014 what a COVID patient looks like \u2014 is unchanged. Same virus, same symptoms.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Concept drift \u2014 spam filter:<\/strong> The phrase &#8220;Receive a free gift card&#8221; was clearly spam five years ago, but today it may be a legitimate marketing message. The text distribution $P(x)$ is similar; what changed is $P(y|x)$, the criterion for &#8220;is this spam?&#8221; This is the trickiest shift \u2014 input alone provides no signal, and new labels are required (Gama 2014).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Semantic OOD \u2014 new category:<\/strong> A car image fed into a dog\/cat classifier. Cars were never seen during training, so $P(y=\\text{car})$, $P(x|y=\\text{car})$, and $P(y=\\text{car}|x)$ are essentially undefined under the training distribution. The model is forced to label it &#8220;dog&#8221; or &#8220;cat&#8221; \u2014 close to a random output.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Non-semantic OOD \u2014 domain shift:<\/strong> A cartoon cat fed to a classifier trained on photographic cats. The class (&#8220;cat&#8221;) is in-distribution, but the visual style differs. Humans recognize the cartoon as a cat, but the model \u2014 relying on textures and color tones from training \u2014 may fail.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Anomaly \u2014 fraud detection:<\/strong> A model trained on normal credit card transactions detects fraud as &#8220;points outside the normal distribution.&#8221; There is no explicit label $y$; the system only judges &#8220;does this transaction belong to $P(x)$ of normal traffic?&#8221; Isolation Forest (Liu 2008), One-Class SVM, and autoencoder reconstruction error are representative methods.<\/p>\n<\/div><\/div>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">3. Distribution Shift<\/h2>\n\n\n<style>.kadence-column6548_394cfe-9b > .kt-inside-inner-col,.kadence-column6548_394cfe-9b > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_394cfe-9b > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_394cfe-9b > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_394cfe-9b > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_394cfe-9b > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_394cfe-9b{position:relative;}.kadence-column6548_394cfe-9b, .kt-inside-inner-col > .kadence-column6548_394cfe-9b:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_394cfe-9b > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_394cfe-9b > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_394cfe-9b\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">3.1 Problem Definition<\/h3>\n\n\n<style>.kadence-column6548_800598-f9 > .kt-inside-inner-col,.kadence-column6548_800598-f9 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_800598-f9 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_800598-f9 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_800598-f9 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_800598-f9 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_800598-f9{position:relative;}.kadence-column6548_800598-f9, .kt-inside-inner-col > .kadence-column6548_800598-f9:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_800598-f9 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_800598-f9 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_800598-f9\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>What changes.<\/strong> Distribution shift is defined as $P_{\\text{train}}(x, y) \\neq P_{\\text{test}}(x, y)$. Following the Bayes decomposition, three subtypes emerge depending on which factor moves (Moreno-Torres 2012).<\/p>\n\n\n\n<p style=\"background-color: #fff; border: none\">$$P(x, y) = P(x)\\,P(y|x) = P(y)\\,P(x|y)$$<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Covariate shift:<\/strong> $P(x)$ changes while $P(y|x)$ stays stable. Input distribution alignment alone can recover performance in principle.<\/li>\n\n\n\n<li><strong>Label shift:<\/strong> $P(y)$ changes while $P(x|y)$ stays stable. A relatively light post-hoc re-weighting of output probabilities by the new prior is sufficient (Lipton 2018).<\/li>\n\n\n\n<li><strong>Concept drift:<\/strong> $P(y|x)$ itself changes. Without new labels, this cannot be solved at the root.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>How it changes.<\/strong> Drift is also classified by temporal pattern (Gama 2014):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Sudden drift:<\/strong> A step change \u2014 e.g., a policy update or new system deployment.<\/li>\n\n\n\n<li><strong>Gradual drift:<\/strong> The new distribution slowly replaces the old; both coexist for a while.<\/li>\n\n\n\n<li><strong>Incremental drift:<\/strong> Tiny accumulated changes, such as sensor aging or slow user-preference shifts.<\/li>\n\n\n\n<li><strong>Recurring drift:<\/strong> Periodic changes such as seasonality or weekday patterns.<\/li>\n<\/ul>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">3.2 Mitigation Strategies<\/h3>\n\n\n<style>.kadence-column6548_8920bd-41 > .kt-inside-inner-col,.kadence-column6548_8920bd-41 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_8920bd-41 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_8920bd-41 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_8920bd-41 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_8920bd-41 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_8920bd-41{position:relative;}.kadence-column6548_8920bd-41, .kt-inside-inner-col > .kadence-column6548_8920bd-41:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_8920bd-41 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_8920bd-41 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_8920bd-41\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>a) Feature Engineering.<\/strong> Design inputs robust to distributional change. Reconsider normalization: Min-Max normalization to [0, 1] is sensitive to extreme values, while Z-score standardization or Robust Scaling (median and Inter-Quartile Range, IQR) is far more stable when new extremes arrive at inference. Self-supervised pretraining (contrastive learning, masked modeling) yields domain-invariant features. For tabular data, run per-feature drift monitors such as the Kolmogorov-Smirnov (KS) test or Population Stability Index (PSI).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>b) Label Engineering.<\/strong> Class re-weighting upweights minority classes during training to harden against label shift. Importance weighting reweights training samples by $P_{\\text{test}}(x) \/ P_{\\text{train}}(x)$ to address covariate shift. Active learning selects the most informative new samples for labeling, providing efficient response to concept drift.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>c) Training.<\/strong> Build robustness into the model. Data augmentation \u2014 noise, rotation, color jitter \u2014 is the strongest tool against $P(x)$ shifts. Domain Adversarial Neural Network (DANN, Ganin 2016) and CORrelation ALignment (CORAL, Sun 2016) align feature distributions across source and target domains. Domain Generalization (Wang 2022) trains across multiple domains so that the model generalizes to unseen ones. Adversarial training keeps outputs stable under small input perturbations. Batch \/ Layer Normalization built into the model partially compensates for shifts in input statistics.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>d) Test (Inference).<\/strong> Respond dynamically post-deployment. Drift detectors include Adaptive WINdowing (ADWIN, Bifet 2007), Page-Hinkley test, Cumulative SUM (CUSUM, Page 1954), and KS-test. Test-Time Adaptation (TTA) methods such as Tent (Wang 2021) update the model&#8217;s BatchNorm statistics or a few parameters using only unlabeled test data. Black Box Shift Estimation (BBSE, Lipton 2018) corrects label shift by post-hoc rescaling of output logits using the new $P(y)$.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>e) Continual \/ Online Learning.<\/strong> When shifts persist over time, the model cannot be trained once and forgotten. <em>Online learning<\/em> updates the model on each batch (or single sample) as data streams in, typically with a small learning rate. <em>Incremental learning<\/em> adds new data or new classes to an existing model while preserving past knowledge.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The main risk is <strong>catastrophic forgetting<\/strong> \u2014 losing previously learned knowledge. Mitigations include a <strong>Replay Buffer<\/strong> (mix stored old samples into each new batch), <strong>Elastic Weight Consolidation<\/strong> (EWC, Kirkpatrick 2017) which penalizes changes to parameters important for past tasks via Fisher information, and <strong>Learning without Forgetting<\/strong> (LwF, Li 2017) which uses the previous model&#8217;s outputs as soft targets.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In practice, an MLOps pipeline triggers retraining periodically or upon drift detection, with a regression test on a fixed validation set and automatic rollback if performance degrades. A common pattern is multi-tier scheduling: nightly fine-tuning plus weekly full retraining.<\/p>\n<\/div><\/div>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">4. OOD (Out-of-Distribution)<\/h2>\n\n\n<style>.kadence-column6548_01220d-63 > .kt-inside-inner-col,.kadence-column6548_01220d-63 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_01220d-63 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_01220d-63 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_01220d-63 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_01220d-63 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_01220d-63{position:relative;}.kadence-column6548_01220d-63, .kt-inside-inner-col > .kadence-column6548_01220d-63:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_01220d-63 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_01220d-63 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_01220d-63\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">4.1 Problem Definition<\/h3>\n\n\n<style>.kadence-column6548_7c6bf3-0a > .kt-inside-inner-col,.kadence-column6548_7c6bf3-0a > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_7c6bf3-0a > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_7c6bf3-0a > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_7c6bf3-0a > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_7c6bf3-0a > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_7c6bf3-0a{position:relative;}.kadence-column6548_7c6bf3-0a, .kt-inside-inner-col > .kadence-column6548_7c6bf3-0a:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_7c6bf3-0a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_7c6bf3-0a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_7c6bf3-0a\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>What changes.<\/strong> OOD is a per-sample problem (Yang 2021). A sample $x$ is OOD when the probability of being drawn from the support of $P_{\\text{train}}(x)$ is very low.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Semantic OOD:<\/strong> a class never seen during training appears, requiring expansion of the label space. Novelty detection and open-set recognition fall here.<\/li>\n\n\n\n<li><strong>Non-semantic OOD:<\/strong> the class is in-distribution but the input style\/domain differs (textures, lighting, color tone). Examples include applying a photo-trained model to cartoons, or transferring a model trained on Hospital A&#8217;s CT machine to Hospital B&#8217;s machine.<\/li>\n\n\n\n<li><strong>Anomaly Detection:<\/strong> an unsupervised setting with no label $y$, focused on detecting low-density regions of $P_{\\text{train}}(x)$ (Pang 2021). Common in fraud detection, manufacturing defect detection, and intrusion detection.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">By distance from the training distribution, OOD is further split into <strong>Near-OOD<\/strong> (a slight deviation, e.g., CIFAR-10 \u2192 CIFAR-100) and <strong>Far-OOD<\/strong> (an entirely different distribution, e.g., CIFAR-10 \u2192 SVHN).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>How it changes.<\/strong> OOD samples appear in different patterns: <em>intermittent OOD<\/em> mixed into normal flow (fraud, system faults), <em>sudden OOD spikes<\/em> when service expands to new user groups or regions, <em>gradual OOD increase<\/em> as the deployment distribution slowly drifts from training (essentially the cumulative result of covariate shift), and <em>adversarial OOD<\/em> where attackers craft inputs deliberately outside the training distribution.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">4.2 Mitigation Strategies<\/h3>\n\n\n<style>.kadence-column6548_64676c-11 > .kt-inside-inner-col,.kadence-column6548_64676c-11 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_64676c-11 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_64676c-11 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_64676c-11 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_64676c-11 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_64676c-11{position:relative;}.kadence-column6548_64676c-11, .kt-inside-inner-col > .kadence-column6548_64676c-11:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_64676c-11 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_64676c-11 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_64676c-11\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>a) Feature Engineering.<\/strong> Design representations suited for OOD detection. Distance-based representation learning (metric learning, contrastive learning) places same-class samples close and different-class samples apart, so feature-space distance becomes an OOD signal. Uncertainty-aware architectures \u2014 Bayesian Neural Networks, Deep Ensemble, Monte Carlo (MC) Dropout \u2014 naturally surface OOD via predictive variance. Features from pre-trained foundation models tend to be more OOD-robust thanks to large-scale pretraining.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>b) Label Engineering.<\/strong> Use OOD samples as a learning signal. Outlier Exposure (Hendrycks 2019) shows the model an auxiliary OOD dataset during training and forces uniform output on it. Open-set learning explicitly adds an &#8220;unknown&#8221; class so the model has a reject option. Active learning combined with human-in-the-loop sends suspected OOD samples to human labeling, gradually expanding the class space.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>c) Training.<\/strong> Bake OOD awareness into training. Energy-based training (Liu 2020) assigns low energy to in-distribution samples and high energy to OOD samples. One-class classification (One-Class SVM, Deep Support Vector Data Description, Deep SVDD, Ruff 2018) learns the &#8220;normal region&#8221; using only normal data. Generative models \u2014 Variational AutoEncoder (VAE), Normalizing Flow, Diffusion model \u2014 fitted to the training distribution support OOD detection via likelihood or reconstruction error. Self-supervised auxiliary tasks (e.g., rotation prediction) degrade on OOD samples, exposing them.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>d) Test (Inference).<\/strong> Detect and respond at inference. Score-based methods include Maximum Softmax Probability (MSP, Hendrycks 2017), Out-of-DIstribution detector for Neural networks (ODIN, Liang 2018) which combines temperature scaling with input preprocessing, Energy score (Liu 2020), and Mahalanobis distance in feature space (Lee 2018). Reconstruction-based detection flags inputs an autoencoder cannot reconstruct well. A reject option \/ abstention escalates flagged samples to humans. Calibration (temperature scaling, Platt scaling) makes confidence trustworthy so that low confidence becomes a reliable OOD signal.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>e) Continual \/ Online Learning.<\/strong> Continual learning under OOD differs from the distribution-shift case in one critical way: <strong>OOD samples come without labels<\/strong>. They cannot be used immediately for training, so a staged pipeline is required.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, in <em>OOD detection and isolation<\/em>, score-based detectors flag OOD samples and divert them into a separate buffer. Second, in <em>label acquisition<\/em>, active learning selects the most informative OOD samples for human labeling, or pseudo-labeling is used for high-confidence cases.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Once labels are obtained, samples drive <em>incremental model expansion<\/em>. For semantic OOD (new classes), the label space must grow \u2014 class-incremental learning techniques such as LwF (Li 2017) add classes while preserving old-class accuracy. For non-semantic OOD (same class, new domain), domain adaptation is applied to the new domain.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Catastrophic forgetting is even more dangerous here than in plain distribution shift. EWC (Kirkpatrick 2017) protects parameters important for old classes; the replay buffer must keep balanced samples per class. Furthermore, the <strong>OOD detector itself must be updated<\/strong> alongside the model \u2014 what was OOD yesterday becomes in-distribution once the new class is learned.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Production systems commonly run a <strong>two-loop architecture<\/strong>: a fast loop (hourly\/daily) updates the OOD detector and applies light fine-tuning, while a slow loop (weekly\/monthly) performs full retraining incorporating accumulated new classes\/domains. In adversarial-prone domains (security, finance), OOD samples are <strong>never auto-incorporated<\/strong> into training without human review.<\/p>\n<\/div><\/div>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n<style>.kadence-column6548_be7be3-4c > .kt-inside-inner-col,.kadence-column6548_be7be3-4c > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_be7be3-4c > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_be7be3-4c > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_be7be3-4c > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_be7be3-4c > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_be7be3-4c{position:relative;}.kadence-column6548_be7be3-4c, .kt-inside-inner-col > .kadence-column6548_be7be3-4c:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_be7be3-4c > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_be7be3-4c > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_be7be3-4c\"><div class=\"kt-inside-inner-col\">\n<ol class=\"wp-block-list\">\n<li>Bifet, A., &amp; Gavald\u00e0, R. (2007). Learning from Time-Changing Data with Adaptive Windowing (ADWIN). <em>SDM<\/em>.<\/li>\n\n\n\n<li>Gama, J., \u017dliobait\u0117, I., Bifet, A., Pechenizkiy, M., &amp; Bouchachia, A. (2014). A survey on concept drift adaptation. <em>ACM Computing Surveys<\/em>, 46(4), 1-37.<\/li>\n\n\n\n<li>Ganin, Y., et al. (2016). Domain-Adversarial Training of Neural Networks (DANN). <em>JMLR<\/em>, 17(1).<\/li>\n\n\n\n<li>Hendrycks, D., &amp; Gimpel, K. (2017). A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. <em>ICLR<\/em>.<\/li>\n\n\n\n<li>Hendrycks, D., Mazeika, M., &amp; Dietterich, T. (2019). Deep Anomaly Detection with Outlier Exposure. <em>ICLR<\/em>.<\/li>\n\n\n\n<li>Kirkpatrick, J., et al. (2017). Overcoming catastrophic forgetting in neural networks (EWC). <em>PNAS<\/em>, 114(13).<\/li>\n\n\n\n<li>Lee, K., Lee, K., Lee, H., &amp; Shin, J. (2018). A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks (Mahalanobis). <em>NeurIPS<\/em>.<\/li>\n\n\n\n<li>Li, Z., &amp; Hoiem, D. (2017). Learning without Forgetting (LwF). <em>IEEE TPAMI<\/em>.<\/li>\n\n\n\n<li>Liang, S., Li, Y., &amp; Srikant, R. (2018). Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks (ODIN). <em>ICLR<\/em>.<\/li>\n\n\n\n<li>Lipton, Z. C., Wang, Y. X., &amp; Smola, A. (2018). Detecting and Correcting for Label Shift with Black Box Predictors (BBSE). <em>ICML<\/em>.<\/li>\n\n\n\n<li>Liu, F. T., Ting, K. M., &amp; Zhou, Z. H. (2008). Isolation Forest. <em>ICDM<\/em>.<\/li>\n\n\n\n<li>Liu, W., Wang, X., Owens, J., &amp; Li, Y. (2020). Energy-based Out-of-distribution Detection. <em>NeurIPS<\/em>.<\/li>\n\n\n\n<li>Moreno-Torres, J. G., Raeder, T., Alaiz-Rodr\u00edguez, R., Chawla, N. V., &amp; Herrera, F. (2012). A unifying view on dataset shift in classification. <em>Pattern Recognition<\/em>, 45(1), 521-530.<\/li>\n\n\n\n<li>Page, E. S. (1954). Continuous Inspection Schemes (CUSUM). <em>Biometrika<\/em>, 41(1\/2), 100-115.<\/li>\n\n\n\n<li>Pang, G., Shen, C., Cao, L., &amp; Hengel, A. V. D. (2021). Deep Learning for Anomaly Detection: A Review. <em>ACM Computing Surveys<\/em>, 54(2).<\/li>\n\n\n\n<li>Qui\u00f1onero-Candela, J., Sugiyama, M., Schwaighofer, A., &amp; Lawrence, N. D. (Eds.). (2009). <em>Dataset Shift in Machine Learning<\/em>. MIT Press.<\/li>\n\n\n\n<li>Ruff, L., et al. (2018). Deep One-Class Classification (Deep SVDD). <em>ICML<\/em>.<\/li>\n\n\n\n<li>Sun, B., &amp; Saenko, K. (2016). Deep CORAL: Correlation Alignment for Deep Domain Adaptation. <em>ECCV Workshops<\/em>.<\/li>\n\n\n\n<li>Wang, D., Shelhamer, E., Liu, S., Olshausen, B., &amp; Darrell, T. (2021). Tent: Fully Test-Time Adaptation by Entropy Minimization. <em>ICLR<\/em>.<\/li>\n\n\n\n<li>Wang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., et al. (2022). Generalizing to Unseen Domains: A Survey on Domain Generalization. <em>IEEE TKDE<\/em>.<\/li>\n\n\n\n<li>Yang, J., Zhou, K., Li, Y., &amp; Liu, Z. (2021). Generalized Out-of-Distribution Detection: A Survey. <em>arXiv:2110.11334<\/em>.<\/li>\n<\/ol>\n<\/div><\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" style=\"margin-top:var(--wp--preset--spacing--60);margin-bottom:var(--wp--preset--spacing--60)\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Appendix. The Generative View: Why P(x|y) Is Not Just a Reverse Direction<\/h2>\n\n\n<style>.kadence-column6548_b7d22e-fb > .kt-inside-inner-col,.kadence-column6548_b7d22e-fb > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6548_b7d22e-fb > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6548_b7d22e-fb > .kt-inside-inner-col{flex-direction:column;}.kadence-column6548_b7d22e-fb > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6548_b7d22e-fb > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6548_b7d22e-fb{position:relative;}.kadence-column6548_b7d22e-fb, .kt-inside-inner-col > .kadence-column6548_b7d22e-fb:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6548_b7d22e-fb > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6548_b7d22e-fb > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6548_b7d22e-fb\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">A.1 Meaning<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">$P(x|y)$ describes &#8220;given class $y$, what does the input $x$ typically look like?&#8221; \u2014 the <strong>typical appearance<\/strong> of each class. For example, $P(\\text{image} \\mid \\text{cat})$ encodes the distribution of fur, pointed ears, and large eyes characteristic of cat images. $P(\\text{symptoms} \\mid \\text{flu})$ captures the symptom patterns \u2014 fever, cough, body ache \u2014 exhibited by flu patients.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">A.2 Why This Direction Matters<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The essence of generative models.<\/strong> ML models split into two camps. <em>Discriminative<\/em> models learn $P(y|x)$ directly (&#8220;which class is this input?&#8221;). <em>Generative<\/em> models learn $P(x|y)$ and $P(y)$ (&#8220;what does a typical input from this class look like?&#8221;). Naive Bayes, Gaussian Mixture Model (GMM), Generative Adversarial Network (GAN), and diffusion models all operate on $P(x|y)$. A conditional generative model (&#8220;generate a cat image&#8221;) is exactly sampling from $P(x|y)$.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Core component of Bayes&#8217; theorem.<\/strong><\/p>\n\n\n\n<p style=\"background-color: #fff; border: none\">$$P(y \\mid x) = \\frac{P(x \\mid y)\\,P(y)}{P(x)}$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$P(x|y)$ is the likelihood \u2014 a critical term. We typically want $P(y|x)$, but data is often collected in $P(x|y)$ form. In a medical case-control study where &#8220;100 cancer patients and 100 healthy controls are recruited and their symptoms recorded,&#8221; the data directly estimates $P(\\text{symptoms} \\mid \\text{cancer status})$, not $P(\\text{cancer status} \\mid \\text{symptoms})$. Bayes&#8217; theorem bridges the two directions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Essential for the definition of label shift.<\/strong> Label shift can be defined as &#8220;$P(y)$ shifts, $P(x|y)$ stays stable&#8221; precisely because &#8220;the typical appearance of a class doesn&#8217;t change when the class is the same.&#8221; A COVID patient&#8217;s symptom distribution $P(\\text{symptoms} \\mid \\text{COVID})$ is identical pre- and post-pandemic \u2014 same virus. Because this assumption holds, lightweight corrections such as BBSE (Lipton 2018), which rescale outputs by the new $P(y)$ without retraining, become possible.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Class-conditional data augmentation.<\/strong> Synthetic Minority Over-sampling Technique (SMOTE) and similar minority-class generation methods estimate $P(x|y=\\text{minority})$ and then sample from it. GAN-based medical image synthesis similarly trains $P(\\text{image} \\mid \\text{disease})$ to produce missing rare-disease data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Anomaly detection and OOD.<\/strong> Generative models trained on $P(x|y)$ are natural for anomaly\/OOD detection. The reasoning: &#8220;If a model that fits $P(x \\mid \\text{normal})$ assigns very low likelihood to a new input, that input is abnormal.&#8221; Autoencoder reconstruction error approximates this; Normalizing Flows compute the likelihood directly to flag OOD.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">A.3 An Intuitive Analogy<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">$P(y|x)$ and $P(x|y)$ are two sides of the same coin, but the question direction is reversed. $P(y|x)$ is <strong>the detective&#8217;s question:<\/strong> &#8220;Given this evidence ($x$), who is the culprit ($y$)?&#8221; \u2014 input to conclusion, classification. $P(x|y)$ is <strong>the profiler&#8217;s question:<\/strong> &#8220;Given this type of culprit ($y$), what kind of trace ($x$) do they tend to leave?&#8221; \u2014 conclusion to input pattern, modeling\/generation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The two directions are linked by Bayes&#8217; theorem, and the choice between them defines the discriminative-vs-generative split. $P(x|y)$ feels &#8220;reverse&#8221; only because we are accustomed to the &#8220;input \u2192 prediction&#8221; inference flow \u2014 but for modeling the data-generating process itself, reflecting how data is collected, or surgically decomposing which factor of a shift moved, the $P(x|y)$ perspective is indispensable.<\/p>\n<\/div><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n<div style='text-align:center' class='yasr-auto-insert-overall'><\/div><div style='text-align:center' class='yasr-auto-insert-visitor'><\/div>","protected":false},"excerpt":{"rendered":"<p>Machine learning (ML) models are designed under the assumption that the training distribution P_train equals the deployment distribution P_test. In reality, this assumption breaks frequently, causing sharp accuracy drops in deployed systems. This post organizes these failures into a clean taxonomy and summarizes practical mitigation strategies. 1. Hierarchy of ML Failures from train \u2260 test&#8230;<\/p>\n","protected":false},"author":4,"featured_media":6474,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","yasr_overall_rating":0,"yasr_post_is_review":"","yasr_auto_insert_disabled":"","yasr_review_type":"","fifu_image_url":"","fifu_image_alt":"","iawp_total_views":0,"footnotes":""},"categories":[56],"tags":[],"class_list":["post-6548","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science-slug"],"yasr_visitor_votes":{"stars_attributes":{"read_only":false,"span_bottom":false},"number_of_votes":1,"sum_votes":4},"jetpack_featured_media_url":"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/03\/Irregular-Stone-Wall-Texture-800x600px.jpg","_links":{"self":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6548","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/comments?post=6548"}],"version-history":[{"count":9,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6548\/revisions"}],"predecessor-version":[{"id":6560,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6548\/revisions\/6560"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/media\/6474"}],"wp:attachment":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/media?parent=6548"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/categories?post=6548"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/tags?post=6548"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}