{"id":6483,"date":"2026-04-29T21:39:22","date_gmt":"2026-04-30T02:39:22","guid":{"rendered":"https:\/\/ykim.synology.me\/wordpress\/?p=6483"},"modified":"2026-04-29T22:45:56","modified_gmt":"2026-04-30T03:45:56","slug":"missing-direction-learning-and-unknown-category-inference-in-gradient-boosting-libraries","status":"publish","type":"post","link":"https:\/\/ykim.synology.me\/wordpress\/missing-direction-learning-and-unknown-category-inference-in-gradient-boosting-libraries-6483\/","title":{"rendered":"Missing Values and Unknown Categories in Gradient Boosting Libraries"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"600\" src=\"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/Texas-Flag-Over-Margaret-Hunt-Hill-Bridge-800x600px.png\" alt=\"\" class=\"wp-image-6485\" style=\"width:600px\" srcset=\"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/Texas-Flag-Over-Margaret-Hunt-Hill-Bridge-800x600px.png 800w, https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/Texas-Flag-Over-Margaret-Hunt-Hill-Bridge-800x600px-300x225.png 300w, https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/Texas-Flag-Over-Margaret-Hunt-Hill-Bridge-800x600px-768x576.png 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n<style>.kadence-column6483_693d41-df > .kt-inside-inner-col,.kadence-column6483_693d41-df > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6483_693d41-df > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6483_693d41-df > .kt-inside-inner-col{flex-direction:column;}.kadence-column6483_693d41-df > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6483_693d41-df > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6483_693d41-df{position:relative;}.kadence-column6483_693d41-df, .kt-inside-inner-col > .kadence-column6483_693d41-df:not(.specificity){margin-left:var(--global-kb-spacing-md, 2rem);}@media all and (max-width: 1024px){.kadence-column6483_693d41-df > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6483_693d41-df > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6483_693d41-df\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">This article summarizes how three popular gradient boosting libraries \u2014 LightGBM (Light Gradient Boosting Machine), XGBoost (Extreme Gradient Boosting), and CatBoost (Categorical Boosting) \u2014 handle missing values and unknown categories that appear during inference. We examine how the missing direction is learned during training and how each library responds to category values it has never seen before.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">2. Terminology: Missing Value vs. Unknown Category<\/h2>\n\n\n<style>.kadence-column6483_cf8ead-0f > .kt-inside-inner-col,.kadence-column6483_cf8ead-0f > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6483_cf8ead-0f > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6483_cf8ead-0f > .kt-inside-inner-col{flex-direction:column;}.kadence-column6483_cf8ead-0f > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6483_cf8ead-0f > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6483_cf8ead-0f{position:relative;}.kadence-column6483_cf8ead-0f, .kt-inside-inner-col > .kadence-column6483_cf8ead-0f:not(.specificity){margin-left:var(--global-kb-spacing-md, 2rem);}@media all and (max-width: 1024px){.kadence-column6483_cf8ead-0f > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6483_cf8ead-0f > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6483_cf8ead-0f\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">Before diving in, it helps to distinguish two related but different concepts. The 2&#215;2 matrix below clarifies when each term applies:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th><\/th><th>Numerical Feature<\/th><th>Categorical Feature<\/th><\/tr><\/thead><tbody><tr><td><strong>Value is NaN \/ None<\/strong><\/td><td>Missing value (e.g., age=NaN)<\/td><td>Missing value (e.g., job=NaN)<\/td><\/tr><tr><td><strong>Value exists but not seen during training<\/strong><\/td><td>N\/A*<\/td><td>Unknown category (e.g., job=&#8217;pilot&#8217;)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><em>*For numerical features, an out-of-range value at inference is not treated as &#8220;unknown&#8221; \u2014 it is simply a larger or smaller number. Since tree models only compare against split thresholds, such values are routed normally without any special handling.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In short: <strong>missing values<\/strong> can occur in both numerical and categorical features, while <strong>unknown categories<\/strong> are exclusive to categorical features.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">3. Handling Missing Value and Unknown Category<\/h2>\n\n\n<style>.kadence-column6483_02189b-24 > .kt-inside-inner-col,.kadence-column6483_02189b-24 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6483_02189b-24 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6483_02189b-24 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6483_02189b-24 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6483_02189b-24 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6483_02189b-24{position:relative;}.kadence-column6483_02189b-24, .kt-inside-inner-col > .kadence-column6483_02189b-24:not(.specificity){margin-left:var(--global-kb-spacing-md, 2rem);}@media all and (max-width: 1024px){.kadence-column6483_02189b-24 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6483_02189b-24 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6483_02189b-24\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">This section establishes the conceptual framework for handling missing values and unknown categories during both training and inference. Each library implements these ideas differently, as detailed in the following chapters.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.1 Missing Value: Training-Time Handling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The dominant approach for missing values during training is <strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-theme-palette-12-color\">missing direction learning<\/mark><\/strong>. Because missing samples exist in the training data, the model can learn \u2014 at each split node \u2014 which child branch (left or right) the missing samples should be routed to.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The mechanism, used by LightGBM and XGBoost, works as follows:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Find the best split point using only non-missing samples.<\/li>\n\n\n\n<li>Compute the gain assuming all missing samples go left.<\/li>\n\n\n\n<li>Compute the gain assuming all missing samples go right.<\/li>\n\n\n\n<li>Store the higher-gain direction as the node&#8217;s <code>default_direction<\/code>.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Crucially, the direction is learned <strong>per node, not globally<\/strong>. Within the same tree, one node may route missing samples left while another routes them right, depending on which choice maximizes gain in that specific sub-population. The fact that a value is missing thus carries predictive signal \u2014 for example, in medical data a missing test result may indicate &#8220;the test was not performed because the patient was healthy.&#8221;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">CatBoost takes a different path: it does <strong>not<\/strong> learn missing direction per node. Instead, a global policy (<code>nan_mode<\/code>) determines that NaN values are treated as either smaller than the feature minimum or larger than the maximum, so they always flow in one fixed direction.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.2 Unknown Category: Training-Time Handling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">By definition, unknown categories cannot be present during training \u2014 they are values the model has never seen. As a result, <strong>no library performs direct &#8220;unknown category learning&#8221;<\/strong>. There is no training mechanism analogous to missing direction learning for unknown categories.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Instead, libraries prepare a <strong>fallback policy<\/strong> at training time that will be applied whenever an unknown category appears at inference:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>LightGBM and XGBoost<\/strong>: do not prepare anything specific for unknown categories. They reuse the missing direction learned for each node.<\/li>\n\n\n\n<li><strong>CatBoost<\/strong>: computes a Prior value during training (typically the global target mean) as part of its TS encoding pipeline. This prior is automatically used to encode unknown categories at inference.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The closest practical equivalent to &#8220;unknown category learning&#8221; is a preprocessing strategy: grouping rare categories under an explicit <code>_OTHER_<\/code> label during training. The model then learns an explicit direction for <code>_OTHER_<\/code>, and unknown categories at inference are mapped to it.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.3 Missing Value: Inference-Time Handling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">At inference, NaN values are routed according to the policy established during training:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>LightGBM \/ XGBoost<\/strong>: each split node looks up its learned <code>default_direction<\/code> and routes the missing sample accordingly.<\/li>\n\n\n\n<li><strong>CatBoost<\/strong>: NaN is treated as the extreme value defined by <code>nan_mode<\/code>, and the standard split-threshold comparison routes it to one side.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Note that none of the libraries directly tells the user whether a given inference sample followed the missing path (see <strong>Appendix B<\/strong>).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.4 Unknown Category: Inference-Time Handling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">When an unknown category appears at inference, each library converts it into something the trained model can handle:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>LightGBM \/ XGBoost<\/strong>: the unknown category cannot be mapped to a known category ID, so it is silently converted to NaN. From there, the missing direction routing takes over.<\/li>\n\n\n\n<li><strong>CatBoost<\/strong>: the unknown category has no TS statistics, so the encoding formula reduces to the Prior value. The encoded numerical value is then routed through the tree like any normal value (see <strong>Appendix A<\/strong>).<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The key insight: LightGBM and XGBoost <strong>collapse unknown into missing<\/strong>, while CatBoost <strong>treats them separately<\/strong> \u2014 unknown becomes an &#8220;average category&#8221; rather than a missing value.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">4. LightGBM<\/h2>\n\n\n<style>.kadence-column6483_f3692f-a1 > .kt-inside-inner-col,.kadence-column6483_f3692f-a1 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6483_f3692f-a1 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6483_f3692f-a1 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6483_f3692f-a1 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6483_f3692f-a1 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6483_f3692f-a1{position:relative;}.kadence-column6483_f3692f-a1, .kt-inside-inner-col > .kadence-column6483_f3692f-a1:not(.specificity){margin-left:var(--global-kb-spacing-md, 2rem);}@media all and (max-width: 1024px){.kadence-column6483_f3692f-a1 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6483_f3692f-a1 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6483_f3692f-a1\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">4.1 Missing Value Handling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Method<\/strong>: Per-node missing direction learning. At each categorical or numerical split, LightGBM evaluates the gain of routing missing samples to the left vs. right child and stores the higher-gain choice as the node&#8217;s <code>default_left<\/code> flag.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Principle<\/strong>: Missing samples form a meaningful group whose gradient and hessian statistics indicate which side of the split they &#8220;belong&#8221; to. The direction is learned independently for each node, allowing the same missing pattern to be routed differently in different parts of the tree.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Options<\/strong>:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nparams = {\n    &#039;use_missing&#039;: True,        # enable missing handling (default)\n    &#039;zero_as_missing&#039;: False,   # treat zero as missing or not\n}\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\"><strong>Edge Case<\/strong>: If a node had no missing samples during training, the direction is undefined and LightGBM typically falls back to the left child.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.2 Unknown Category Handling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Method<\/strong>: Unknown category values encountered at inference are automatically converted to NaN, then routed according to the missing direction learned for that node.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Principle<\/strong>: LightGBM has no separate mechanism for unknown categories. Because categorical features must be non-negative integer category IDs, any value that fails to map (an unknown) is treated identically to a missing value.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Options<\/strong>: No dedicated parameter \u2014 the missing-handling options above govern unknown behavior as well.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">5. XGBoost<\/h2>\n\n\n<style>.kadence-column6483_7d1b7f-68 > .kt-inside-inner-col,.kadence-column6483_7d1b7f-68 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6483_7d1b7f-68 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6483_7d1b7f-68 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6483_7d1b7f-68 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6483_7d1b7f-68 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6483_7d1b7f-68{position:relative;}.kadence-column6483_7d1b7f-68, .kt-inside-inner-col > .kadence-column6483_7d1b7f-68:not(.specificity){margin-left:var(--global-kb-spacing-md, 2rem);}@media all and (max-width: 1024px){.kadence-column6483_7d1b7f-68 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6483_7d1b7f-68 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6483_7d1b7f-68\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">5.1 Missing Value Handling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Method<\/strong>: Sparsity-aware Split Finding (Algorithm 3 of Chen and Guestrin, 2016). Per node, XGBoost computes the gain of sending all missing samples left vs. all missing samples right, then stores the better choice as <code>default_direction<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Principle<\/strong>: All missing samples at a given node move together in one direction; XGBoost never partially splits the missing group. This keeps the algorithm efficient on sparse data and preserves the predictive signal carried by missingness itself.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Options<\/strong>:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport xgboost as xgb\nmodel = xgb.XGBClassifier(\n    enable_categorical=True,\n    missing=np.nan,             # value treated as missing\n    max_cat_to_onehot=4,        # one-hot below this; partition split above\n)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\"><strong>Edge Case<\/strong>: When a node has no missing samples during training, XGBoost defaults to the left direction. If train and inference distributions differ, predictions can become skewed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5.2 Unknown Category Handling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Method<\/strong>: From XGBoost 1.5+, native categorical handling is enabled with <code>enable_categorical=True<\/code>. Unknown categories at inference are routed via the default direction \u2014 the same path missing values take.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Principle<\/strong>: XGBoost shares LightGBM&#8217;s design choice: unknown is collapsed into missing. There is no separate handling mechanism, just a unified missing direction.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Options<\/strong>: <code>enable_categorical=True<\/code> activates the native categorical split. The same <code>missing<\/code> parameter governs both NaN and unknown routing.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">6. CatBoost<\/h2>\n\n\n<style>.kadence-column6483_85e78a-f9 > .kt-inside-inner-col,.kadence-column6483_85e78a-f9 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6483_85e78a-f9 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6483_85e78a-f9 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6483_85e78a-f9 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6483_85e78a-f9 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6483_85e78a-f9{position:relative;}.kadence-column6483_85e78a-f9, .kt-inside-inner-col > .kadence-column6483_85e78a-f9:not(.specificity){margin-left:var(--global-kb-spacing-md, 2rem);}@media all and (max-width: 1024px){.kadence-column6483_85e78a-f9 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6483_85e78a-f9 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6483_85e78a-f9\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">6.1 Missing Value Handling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Method<\/strong>: Global policy via the <code>nan_mode<\/code> parameter. NaN is treated as either smaller than the feature minimum (<code>'Min'<\/code>) or larger than the maximum (<code>'Max'<\/code>), so it always flows in a fixed direction.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Principle<\/strong>: Unlike LightGBM and XGBoost, CatBoost does <strong>not<\/strong> learn missing direction per node. Missing values are absorbed into the split threshold comparison and routed consistently in one direction across the entire model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Options<\/strong>:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom catboost import CatBoostClassifier\nmodel = CatBoostClassifier(\n    nan_mode=&#039;Min&#039;,     # NaN smaller than min --&gt; always left\n    # &#039;Max&#039;             # NaN larger than max --&gt; always right\n    # &#039;Forbidden&#039;       # raise error if NaN exists\n)\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">6.2 Unknown Category Handling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Method<\/strong>: TS (Target Statistics) encoding with Prior fallback. Categorical features are converted to numerical values during training; an unknown category at inference is encoded as the Prior \u2014 typically the global target mean.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Principle<\/strong>: Rather than collapsing unknown into missing, CatBoost treats unknown as an &#8220;average category.&#8221; The encoded value sits near the middle of the decision space, avoiding extreme predictions and keeping the prediction distribution stable when many unknowns appear (see <strong>Appendix A<\/strong> for the formula and a worked example).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Options<\/strong>:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nmodel = CatBoostClassifier(\n    cat_features=&#x5B;&#039;job&#039;, &#039;city&#039;],\n\n    # Prior-related settings\n    simple_ctr=&#x5B;&#039;Borders:Prior=0.5:Prior=1.0&#039;, &#039;Counter:Prior=0.5&#039;],\n    # Borders: TS-based encoding\n    # Counter: frequency-based encoding\n    # Prior=X: explicit prior value\n)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">By default, CatBoost estimates the prior automatically from the data (see <strong>Appendix A.2<\/strong> for details on configuring the average category for categorical vs. numerical features).<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">7. Comparison Table<\/h2>\n\n\n<style>.kadence-column6483_6fc114-94 > .kt-inside-inner-col,.kadence-column6483_6fc114-94 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6483_6fc114-94 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6483_6fc114-94 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6483_6fc114-94 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6483_6fc114-94 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6483_6fc114-94{position:relative;}.kadence-column6483_6fc114-94, .kt-inside-inner-col > .kadence-column6483_6fc114-94:not(.specificity){margin-left:var(--global-kb-spacing-md, 2rem);}@media all and (max-width: 1024px){.kadence-column6483_6fc114-94 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6483_6fc114-94 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6483_6fc114-94\"><div class=\"kt-inside-inner-col\">\n<figure class=\"wp-block-table\"><table><thead><tr><th>Aspect<\/th><th>LightGBM<\/th><th>XGBoost<\/th><th>CatBoost<\/th><\/tr><\/thead><tbody><tr><td>Default unknown category handling<\/td><td>Treated as missing<\/td><td>Treated as missing<\/td><td>Replaced by Prior (mean)<\/td><\/tr><tr><td>Missing direction learning<\/td><td>Per-node (via gain comparison)<\/td><td>Per-node (via gain comparison)<\/td><td>Not learned; global policy via <code>nan_mode<\/code><\/td><\/tr><tr><td>If no missing during training?<\/td><td>Falls back to left<\/td><td>Falls back to left<\/td><td><code>nan_mode<\/code> policy applied<\/td><\/tr><tr><td>Explicit option<\/td><td><code>use_missing<\/code><\/td><td><code>missing=np.nan<\/code><\/td><td><code>nan_mode='Min'\/'Max'\/'Forbidden'<\/code><\/td><\/tr><tr><td>Categorical encoding method<\/td><td>Partition split<\/td><td>Partition split \/ one-hot<\/td><td>Target statistics<\/td><\/tr><tr><td><strong>Identify missing samples at inference?<\/strong><\/td><td>Indirect (track NaN externally)<\/td><td>Indirect (same)<\/td><td>Indirect (same)<\/td><\/tr><tr><td><strong>Identify unknown samples at inference?<\/strong><\/td><td>No (treated identically to missing)<\/td><td>No (same)<\/td><td>No (replaced by Prior, indistinguishable)<\/td><\/tr><tr><td><strong>Missing branch traceable inside model?<\/strong><\/td><td>Yes (default direction visible in tree dump)<\/td><td>Yes (<code>default_left<\/code> exposed)<\/td><td>Partial (direction absorbed by split threshold under <code>nan_mode<\/code>)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For practical methods of identifying missing or unknown samples at inference, see <strong>Appendix B<\/strong>.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">8. Intuition Behind Missing Direction Learning<\/h2>\n\n\n<style>.kadence-column6483_a2ca03-75 > .kt-inside-inner-col,.kadence-column6483_a2ca03-75 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6483_a2ca03-75 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6483_a2ca03-75 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6483_a2ca03-75 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6483_a2ca03-75 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6483_a2ca03-75{position:relative;}.kadence-column6483_a2ca03-75, .kt-inside-inner-col > .kadence-column6483_a2ca03-75:not(.specificity){margin-left:var(--global-kb-spacing-md, 2rem);}@media all and (max-width: 1024px){.kadence-column6483_a2ca03-75 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6483_a2ca03-75 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6483_a2ca03-75\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">The sparsity-aware approach in LightGBM and XGBoost rests on a simple idea: at each split, the gradients of the missing samples are used to determine which side, when they are routed there, reduces the loss most. If the gradient profile of the missing group resembles that of the left child&#8217;s non-missing samples, sending them left yields higher gain; otherwise, right wins.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Missing samples therefore form a meaningful group on their own, and the fact that a value is missing carries predictive signal. In medical data, for example, a missing test result may indicate &#8220;the test was not performed because the patient was healthy.&#8221; This pattern is captured automatically.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">9. Practical Implications<\/h2>\n\n\n<style>.kadence-column6483_50df6c-3c > .kt-inside-inner-col,.kadence-column6483_50df6c-3c > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6483_50df6c-3c > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6483_50df6c-3c > .kt-inside-inner-col{flex-direction:column;}.kadence-column6483_50df6c-3c > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6483_50df6c-3c > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6483_50df6c-3c{position:relative;}.kadence-column6483_50df6c-3c, .kt-inside-inner-col > .kadence-column6483_50df6c-3c:not(.specificity){margin-left:var(--global-kb-spacing-md, 2rem);}@media all and (max-width: 1024px){.kadence-column6483_50df6c-3c > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6483_50df6c-3c > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6483_50df6c-3c\"><div class=\"kt-inside-inner-col\">\n<ul class=\"wp-block-list\">\n<li><strong>Distribution shift risk<\/strong>: if there were no missing or unknown values during training, inference-time handling becomes unstable. Inject NaN into a portion of training samples or group rare categories under <code>_OTHER_<\/code> so that the missing direction is learned explicitly.<\/li>\n\n\n\n<li><strong>Missing indicator features<\/strong>: adding a <code>feature_is_missing<\/code> binary feature lets the missingness itself become a split criterion, improving the reliability of missing handling.<\/li>\n\n\n\n<li><strong>CatBoost&#8217;s strength<\/strong>: because unknowns are encoded as the prior, predictions do not swing to extremes. LightGBM and XGBoost, with a one-sided missing direction, can produce skewed prediction distributions when many unknowns appear.<\/li>\n\n\n\n<li><strong>Validation<\/strong>: when train-inference category distributions are likely to differ, evaluate model robustness with a validation set that intentionally contains injected unknowns.<\/li>\n\n\n\n<li><strong>Categorical cardinality<\/strong>: at very high cardinality (tens of thousands of categories), partition splits in LightGBM and XGBoost become inefficient. CatBoost&#8217;s TS approach is advantageous in this regime.<\/li>\n\n\n\n<li><strong>High-missing features<\/strong>: features with more than 50% missing values are best split into a missing indicator plus an alternative encoding for the original signal.<\/li>\n<\/ul>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">10. References<\/h2>\n\n\n<style>.kadence-column6483_e5c6bc-8c > .kt-inside-inner-col,.kadence-column6483_e5c6bc-8c > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6483_e5c6bc-8c > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6483_e5c6bc-8c > .kt-inside-inner-col{flex-direction:column;}.kadence-column6483_e5c6bc-8c > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6483_e5c6bc-8c > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6483_e5c6bc-8c{position:relative;}.kadence-column6483_e5c6bc-8c, .kt-inside-inner-col > .kadence-column6483_e5c6bc-8c:not(.specificity){margin-left:var(--global-kb-spacing-md, 2rem);}@media all and (max-width: 1024px){.kadence-column6483_e5c6bc-8c > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6483_e5c6bc-8c > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6483_e5c6bc-8c\"><div class=\"kt-inside-inner-col\">\n<ul class=\"wp-block-list\">\n<li>Chen and Guestrin (2016), &#8220;XGBoost: A Scalable Tree Boosting System&#8221; \u2014 Algorithm 3 (Sparsity-aware Split Finding)<\/li>\n\n\n\n<li>Ke et al. (2017), &#8220;LightGBM: A Highly Efficient Gradient Boosting Decision Tree&#8221;<\/li>\n\n\n\n<li>Prokhorenkova et al. (2018), &#8220;CatBoost: Unbiased Boosting with Categorical Features&#8221;<\/li>\n\n\n\n<li>LightGBM official documentation: Advanced Topics \u2014 Missing Value Handling<\/li>\n\n\n\n<li>CatBoost official documentation: Parameters \u2014 <code>nan_mode<\/code>, <code>simple_ctr<\/code><\/li>\n<\/ul>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Appendix A. CatBoost: Prior Encoding and Average-Category Settings<\/h2>\n\n\n<style>.kadence-column6483_63023b-6a > .kt-inside-inner-col,.kadence-column6483_63023b-6a > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6483_63023b-6a > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6483_63023b-6a > .kt-inside-inner-col{flex-direction:column;}.kadence-column6483_63023b-6a > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6483_63023b-6a > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6483_63023b-6a{position:relative;}.kadence-column6483_63023b-6a, .kt-inside-inner-col > .kadence-column6483_63023b-6a:not(.specificity){margin-left:var(--global-kb-spacing-md, 2rem);}@media all and (max-width: 1024px){.kadence-column6483_63023b-6a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6483_63023b-6a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6483_63023b-6a\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">A.1 What &#8220;Replaced by Prior&#8221; Means<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">CatBoost replaces each category value with a numerical value using the formula:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\n$$TS(\\text{category}) = \\frac{\\text{TargetSum} + a \\cdot \\text{Prior}}{\\text{Count} + a}$$\n<\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>TargetSum<\/strong>: sum of target values for samples in the category<\/li>\n\n\n\n<li><strong>Count<\/strong>: number of samples in the category<\/li>\n\n\n\n<li><strong>Prior<\/strong>: prior value (typically the global target mean)<\/li>\n\n\n\n<li><strong>$a$<\/strong>: smoothing parameter (default 1)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Definition of Prior<\/strong>. For binary classification:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\n$$\\text{Prior} = \\frac{\\text{number of positive samples in training data}}{\\text{total number of training samples}}$$\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">For regression, the prior equals the mean of the target across the training set.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Behavior on Unknown Categories<\/strong>. When a category unseen during training appears at inference, both TargetSum and Count equal zero, so the formula reduces to:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\n$$TS = \\frac{0 + a \\cdot \\text{Prior}}{0 + a} = \\text{Prior}$$\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">The encoded value of that category becomes the Prior itself.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Intuitive Meaning<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>&#8220;Average category&#8221; assumption<\/strong>: with no information, the unknown is assigned the influence of the global average.<\/li>\n\n\n\n<li><strong>Avoids extreme predictions<\/strong>: rather than being pushed to one tree edge as a missing value, it routes through the middle of the decision space.<\/li>\n\n\n\n<li><strong>Distribution stability<\/strong>: even when many unknowns appear, the prediction distribution does not shift drastically.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Example<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Consider a credit scoring model with a <code>job<\/code> feature:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Training data: <code>doctor<\/code> default rate 0.05, <code>student<\/code> default rate 0.30, <code>engineer<\/code> default rate 0.10<\/li>\n\n\n\n<li>Global mean default rate: 0.15 \u2192 Prior = 0.15<\/li>\n\n\n\n<li>At inference, <code>pilot<\/code> (unseen during training) appears \u2192 encoding value = 0.15<\/li>\n\n\n\n<li>The model treats <code>pilot<\/code> as a job with average risk and routes it accordingly.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">LightGBM or XGBoost would route <code>pilot<\/code> down the missing direction. CatBoost treats it as a &#8220;job with average risk&#8221; instead.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">A.2 Setting the &#8220;Average Category&#8221; for Categorical vs. Numerical Data<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">Categorical Features<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Handled automatically, with optional fine-tuning:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom catboost import CatBoostClassifier\nmodel = CatBoostClassifier(\n    cat_features=&#x5B;&#039;job&#039;, &#039;city&#039;],\n\n    # Prior-related settings\n    simple_ctr=&#x5B;&#039;Borders:Prior=0.5:Prior=1.0&#039;, &#039;Counter:Prior=0.5&#039;],\n    # Borders: TS-based encoding\n    # Counter: frequency-based encoding\n    # Prior=X: explicit prior value\n\n    # Smoothing\n    ctr_leaf_count_limit=None,  # max stored category statistics per leaf\n)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">By default, CatBoost estimates the prior automatically from the data, so no manual setting is required.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Numerical Features<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Numerical features have no built-in &#8220;average value&#8221; concept. The missing-handling policy is set via <code>nan_mode<\/code>:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nmodel = CatBoostClassifier(\n    nan_mode=&#039;Min&#039;,  # &#039;Min&#039; \/ &#039;Max&#039; \/ &#039;Forbidden&#039;\n)\n<\/pre><\/div>\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Option<\/th><th>Meaning<\/th><th>Routing Result<\/th><\/tr><\/thead><tbody><tr><td><code>'Min'<\/code> (default for numerical)<\/td><td>Treats NaN as smaller than the feature minimum<\/td><td>All missing values go left<\/td><\/tr><tr><td><code>'Max'<\/code><\/td><td>Treats NaN as larger than the feature maximum<\/td><td>All missing values go right<\/td><\/tr><tr><td><code>'Forbidden'<\/code><\/td><td>Raises an error if NaN is present<\/td><td>Cannot proceed<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">If you want mean imputation for numerical features, do it manually before training:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# Replace NaN with the column mean before training\nX&#x5B;&#039;feature&#039;].fillna(X&#x5B;&#039;feature&#039;].mean(), inplace=True)\n# CatBoost does not perform this preprocessing automatically.\n<\/pre><\/div>\n\n\n<h4 class=\"wp-block-heading\">Summary: Categorical vs. Numerical &#8220;Average&#8221; Treatment<\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Aspect<\/th><th>Categorical<\/th><th>Numerical<\/th><\/tr><\/thead><tbody><tr><td>Automatic average substitution?<\/td><td>Yes (Prior applied automatically)<\/td><td>No (manual handling required)<\/td><\/tr><tr><td>Unknown \/ missing handling<\/td><td>Encoded as Prior, routed through tree<\/td><td>Routed in one direction by <code>nan_mode<\/code><\/td><\/tr><tr><td>Relevant parameters<\/td><td><code>simple_ctr<\/code>, <code>combinations_ctr<\/code>, <code>ctr_target_border_count<\/code><\/td><td><code>nan_mode<\/code><\/td><\/tr><tr><td>Intuitive meaning<\/td><td>Treated as an &#8220;average category&#8221;<\/td><td>Treated as an extreme value with fixed direction<\/td><\/tr><\/tbody><\/table><\/figure>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Appendix B. Identifying Missing \/ Unknown Samples at Inference<\/h2>\n\n\n<style>.kadence-column6483_2f2a48-70 > .kt-inside-inner-col,.kadence-column6483_2f2a48-70 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6483_2f2a48-70 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6483_2f2a48-70 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6483_2f2a48-70 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6483_2f2a48-70 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6483_2f2a48-70{position:relative;}.kadence-column6483_2f2a48-70, .kt-inside-inner-col > .kadence-column6483_2f2a48-70:not(.specificity){margin-left:var(--global-kb-spacing-md, 2rem);}@media all and (max-width: 1024px){.kadence-column6483_2f2a48-70 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6483_2f2a48-70 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6483_2f2a48-70\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">None of the three libraries directly tells you which samples followed the missing or unknown path during inference. You can confirm this indirectly:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">B.1 LightGBM<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# Inspect default direction in the tree dump\nmodel.booster_.dump_model()  # contains default_left flags per node\n# Track NaN positions externally beforehand\nnan_mask = X_test.isna().any(axis=1)\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">B.2 XGBoost<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# Inspect tree structure\nbooster.get_dump(with_stats=True)\n# Each split node carries yes \/ no \/ missing branch info.\n# pred_leaf=True reveals leaf indices but not whether missing was traversed.\nleaf_indices = model.predict(X_test, pred_leaf=True)\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">B.3 CatBoost<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# With nan_mode=&#039;Min&#039;\/&#039;Max&#039;, missing is folded into the split threshold,\n# so tracing a separate &quot;missing branch&quot; is impractical.\n# In production, build a NaN \/ unknown mask in advance.\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">B.4 Practical Recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Save an <code>X_test.isna()<\/code> mask just before inference for downstream auditing.<\/li>\n\n\n\n<li>To detect unknown categories, store the training category set and compare against the inference data.<\/li>\n\n\n\n<li>For monitoring, track the missing and unknown rates as separate metrics.<\/li>\n<\/ul>\n<\/div><\/div>\n<div style='text-align:center' class='yasr-auto-insert-overall'><\/div><div style='text-align:center' class='yasr-auto-insert-visitor'><\/div>","protected":false},"excerpt":{"rendered":"<p>1. Introduction This article summarizes how three popular gradient boosting libraries \u2014 LightGBM (Light Gradient Boosting Machine), XGBoost (Extreme Gradient Boosting), and CatBoost (Categorical Boosting) \u2014 handle missing values and unknown categories that appear during inference. We examine how the missing direction is learned during training and how each library responds to category values it&#8230;<\/p>\n","protected":false},"author":4,"featured_media":6485,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","yasr_overall_rating":0,"yasr_post_is_review":"","yasr_auto_insert_disabled":"","yasr_review_type":"","fifu_image_url":"","fifu_image_alt":"","iawp_total_views":0,"footnotes":""},"categories":[56],"tags":[],"class_list":["post-6483","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science-slug"],"yasr_visitor_votes":{"stars_attributes":{"read_only":false,"span_bottom":false},"number_of_votes":1,"sum_votes":4},"jetpack_featured_media_url":"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/Texas-Flag-Over-Margaret-Hunt-Hill-Bridge-800x600px.png","_links":{"self":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6483","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/comments?post=6483"}],"version-history":[{"count":6,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6483\/revisions"}],"predecessor-version":[{"id":6493,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6483\/revisions\/6493"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/media\/6485"}],"wp:attachment":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/media?parent=6483"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/categories?post=6483"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/tags?post=6483"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}