Missing Values and Unknown Categories in Gradient Boosting Libraries

1. Introduction
This article summarizes how three popular gradient boosting libraries — LightGBM (Light Gradient Boosting Machine), XGBoost (Extreme Gradient Boosting), and CatBoost (Categorical Boosting) — handle missing values and unknown categories that appear during inference. We examine how the missing direction is learned during training and how each library responds to category values it has never seen before.
2. Terminology: Missing Value vs. Unknown Category
Before diving in, it helps to distinguish two related but different concepts. The 2×2 matrix below clarifies when each term applies:
| Numerical Feature | Categorical Feature | |
|---|---|---|
| Value is NaN / None | Missing value (e.g., age=NaN) | Missing value (e.g., job=NaN) |
| Value exists but not seen during training | N/A* | Unknown category (e.g., job=’pilot’) |
*For numerical features, an out-of-range value at inference is not treated as “unknown” — it is simply a larger or smaller number. Since tree models only compare against split thresholds, such values are routed normally without any special handling.
In short: missing values can occur in both numerical and categorical features, while unknown categories are exclusive to categorical features.
3. Handling Missing Value and Unknown Category
This section establishes the conceptual framework for handling missing values and unknown categories during both training and inference. Each library implements these ideas differently, as detailed in the following chapters.
3.1 Missing Value: Training-Time Handling
The dominant approach for missing values during training is missing direction learning. Because missing samples exist in the training data, the model can learn — at each split node — which child branch (left or right) the missing samples should be routed to.
The mechanism, used by LightGBM and XGBoost, works as follows:
- Find the best split point using only non-missing samples.
- Compute the gain assuming all missing samples go left.
- Compute the gain assuming all missing samples go right.
- Store the higher-gain direction as the node’s
default_direction.
Crucially, the direction is learned per node, not globally. Within the same tree, one node may route missing samples left while another routes them right, depending on which choice maximizes gain in that specific sub-population. The fact that a value is missing thus carries predictive signal — for example, in medical data a missing test result may indicate “the test was not performed because the patient was healthy.”
CatBoost takes a different path: it does not learn missing direction per node. Instead, a global policy (nan_mode) determines that NaN values are treated as either smaller than the feature minimum or larger than the maximum, so they always flow in one fixed direction.
3.2 Unknown Category: Training-Time Handling
By definition, unknown categories cannot be present during training — they are values the model has never seen. As a result, no library performs direct “unknown category learning”. There is no training mechanism analogous to missing direction learning for unknown categories.
Instead, libraries prepare a fallback policy at training time that will be applied whenever an unknown category appears at inference:
- LightGBM and XGBoost: do not prepare anything specific for unknown categories. They reuse the missing direction learned for each node.
- CatBoost: computes a Prior value during training (typically the global target mean) as part of its TS encoding pipeline. This prior is automatically used to encode unknown categories at inference.
The closest practical equivalent to “unknown category learning” is a preprocessing strategy: grouping rare categories under an explicit _OTHER_ label during training. The model then learns an explicit direction for _OTHER_, and unknown categories at inference are mapped to it.
3.3 Missing Value: Inference-Time Handling
At inference, NaN values are routed according to the policy established during training:
- LightGBM / XGBoost: each split node looks up its learned
default_directionand routes the missing sample accordingly. - CatBoost: NaN is treated as the extreme value defined by
nan_mode, and the standard split-threshold comparison routes it to one side.
Note that none of the libraries directly tells the user whether a given inference sample followed the missing path (see Appendix B).
3.4 Unknown Category: Inference-Time Handling
When an unknown category appears at inference, each library converts it into something the trained model can handle:
- LightGBM / XGBoost: the unknown category cannot be mapped to a known category ID, so it is silently converted to NaN. From there, the missing direction routing takes over.
- CatBoost: the unknown category has no TS statistics, so the encoding formula reduces to the Prior value. The encoded numerical value is then routed through the tree like any normal value (see Appendix A).
The key insight: LightGBM and XGBoost collapse unknown into missing, while CatBoost treats them separately — unknown becomes an “average category” rather than a missing value.
4. LightGBM
4.1 Missing Value Handling
Method: Per-node missing direction learning. At each categorical or numerical split, LightGBM evaluates the gain of routing missing samples to the left vs. right child and stores the higher-gain choice as the node’s default_left flag.
Principle: Missing samples form a meaningful group whose gradient and hessian statistics indicate which side of the split they “belong” to. The direction is learned independently for each node, allowing the same missing pattern to be routed differently in different parts of the tree.
Options:
params = {
'use_missing': True, # enable missing handling (default)
'zero_as_missing': False, # treat zero as missing or not
}
Edge Case: If a node had no missing samples during training, the direction is undefined and LightGBM typically falls back to the left child.
4.2 Unknown Category Handling
Method: Unknown category values encountered at inference are automatically converted to NaN, then routed according to the missing direction learned for that node.
Principle: LightGBM has no separate mechanism for unknown categories. Because categorical features must be non-negative integer category IDs, any value that fails to map (an unknown) is treated identically to a missing value.
Options: No dedicated parameter — the missing-handling options above govern unknown behavior as well.
5. XGBoost
5.1 Missing Value Handling
Method: Sparsity-aware Split Finding (Algorithm 3 of Chen and Guestrin, 2016). Per node, XGBoost computes the gain of sending all missing samples left vs. all missing samples right, then stores the better choice as default_direction.
Principle: All missing samples at a given node move together in one direction; XGBoost never partially splits the missing group. This keeps the algorithm efficient on sparse data and preserves the predictive signal carried by missingness itself.
Options:
import xgboost as xgb
model = xgb.XGBClassifier(
enable_categorical=True,
missing=np.nan, # value treated as missing
max_cat_to_onehot=4, # one-hot below this; partition split above
)
Edge Case: When a node has no missing samples during training, XGBoost defaults to the left direction. If train and inference distributions differ, predictions can become skewed.
5.2 Unknown Category Handling
Method: From XGBoost 1.5+, native categorical handling is enabled with enable_categorical=True. Unknown categories at inference are routed via the default direction — the same path missing values take.
Principle: XGBoost shares LightGBM’s design choice: unknown is collapsed into missing. There is no separate handling mechanism, just a unified missing direction.
Options: enable_categorical=True activates the native categorical split. The same missing parameter governs both NaN and unknown routing.
6. CatBoost
6.1 Missing Value Handling
Method: Global policy via the nan_mode parameter. NaN is treated as either smaller than the feature minimum ('Min') or larger than the maximum ('Max'), so it always flows in a fixed direction.
Principle: Unlike LightGBM and XGBoost, CatBoost does not learn missing direction per node. Missing values are absorbed into the split threshold comparison and routed consistently in one direction across the entire model.
Options:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
nan_mode='Min', # NaN smaller than min --> always left
# 'Max' # NaN larger than max --> always right
# 'Forbidden' # raise error if NaN exists
)
6.2 Unknown Category Handling
Method: TS (Target Statistics) encoding with Prior fallback. Categorical features are converted to numerical values during training; an unknown category at inference is encoded as the Prior — typically the global target mean.
Principle: Rather than collapsing unknown into missing, CatBoost treats unknown as an “average category.” The encoded value sits near the middle of the decision space, avoiding extreme predictions and keeping the prediction distribution stable when many unknowns appear (see Appendix A for the formula and a worked example).
Options:
model = CatBoostClassifier(
cat_features=['job', 'city'],
# Prior-related settings
simple_ctr=['Borders:Prior=0.5:Prior=1.0', 'Counter:Prior=0.5'],
# Borders: TS-based encoding
# Counter: frequency-based encoding
# Prior=X: explicit prior value
)
By default, CatBoost estimates the prior automatically from the data (see Appendix A.2 for details on configuring the average category for categorical vs. numerical features).
7. Comparison Table
| Aspect | LightGBM | XGBoost | CatBoost |
|---|---|---|---|
| Default unknown category handling | Treated as missing | Treated as missing | Replaced by Prior (mean) |
| Missing direction learning | Per-node (via gain comparison) | Per-node (via gain comparison) | Not learned; global policy via nan_mode |
| If no missing during training? | Falls back to left | Falls back to left | nan_mode policy applied |
| Explicit option | use_missing | missing=np.nan | nan_mode='Min'/'Max'/'Forbidden' |
| Categorical encoding method | Partition split | Partition split / one-hot | Target statistics |
| Identify missing samples at inference? | Indirect (track NaN externally) | Indirect (same) | Indirect (same) |
| Identify unknown samples at inference? | No (treated identically to missing) | No (same) | No (replaced by Prior, indistinguishable) |
| Missing branch traceable inside model? | Yes (default direction visible in tree dump) | Yes (default_left exposed) | Partial (direction absorbed by split threshold under nan_mode) |
For practical methods of identifying missing or unknown samples at inference, see Appendix B.
8. Intuition Behind Missing Direction Learning
The sparsity-aware approach in LightGBM and XGBoost rests on a simple idea: at each split, the gradients of the missing samples are used to determine which side, when they are routed there, reduces the loss most. If the gradient profile of the missing group resembles that of the left child’s non-missing samples, sending them left yields higher gain; otherwise, right wins.
Missing samples therefore form a meaningful group on their own, and the fact that a value is missing carries predictive signal. In medical data, for example, a missing test result may indicate “the test was not performed because the patient was healthy.” This pattern is captured automatically.
9. Practical Implications
- Distribution shift risk: if there were no missing or unknown values during training, inference-time handling becomes unstable. Inject NaN into a portion of training samples or group rare categories under
_OTHER_so that the missing direction is learned explicitly. - Missing indicator features: adding a
feature_is_missingbinary feature lets the missingness itself become a split criterion, improving the reliability of missing handling. - CatBoost’s strength: because unknowns are encoded as the prior, predictions do not swing to extremes. LightGBM and XGBoost, with a one-sided missing direction, can produce skewed prediction distributions when many unknowns appear.
- Validation: when train-inference category distributions are likely to differ, evaluate model robustness with a validation set that intentionally contains injected unknowns.
- Categorical cardinality: at very high cardinality (tens of thousands of categories), partition splits in LightGBM and XGBoost become inefficient. CatBoost’s TS approach is advantageous in this regime.
- High-missing features: features with more than 50% missing values are best split into a missing indicator plus an alternative encoding for the original signal.
10. References
- Chen and Guestrin (2016), “XGBoost: A Scalable Tree Boosting System” — Algorithm 3 (Sparsity-aware Split Finding)
- Ke et al. (2017), “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”
- Prokhorenkova et al. (2018), “CatBoost: Unbiased Boosting with Categorical Features”
- LightGBM official documentation: Advanced Topics — Missing Value Handling
- CatBoost official documentation: Parameters —
nan_mode,simple_ctr
Appendix A. CatBoost: Prior Encoding and Average-Category Settings
A.1 What “Replaced by Prior” Means
CatBoost replaces each category value with a numerical value using the formula:
- TargetSum: sum of target values for samples in the category
- Count: number of samples in the category
- Prior: prior value (typically the global target mean)
- $a$: smoothing parameter (default 1)
Definition of Prior. For binary classification:
For regression, the prior equals the mean of the target across the training set.
Behavior on Unknown Categories. When a category unseen during training appears at inference, both TargetSum and Count equal zero, so the formula reduces to:
The encoded value of that category becomes the Prior itself.
Intuitive Meaning
- “Average category” assumption: with no information, the unknown is assigned the influence of the global average.
- Avoids extreme predictions: rather than being pushed to one tree edge as a missing value, it routes through the middle of the decision space.
- Distribution stability: even when many unknowns appear, the prediction distribution does not shift drastically.
Example
Consider a credit scoring model with a job feature:
- Training data:
doctordefault rate 0.05,studentdefault rate 0.30,engineerdefault rate 0.10 - Global mean default rate: 0.15 → Prior = 0.15
- At inference,
pilot(unseen during training) appears → encoding value = 0.15 - The model treats
pilotas a job with average risk and routes it accordingly.
LightGBM or XGBoost would route pilot down the missing direction. CatBoost treats it as a “job with average risk” instead.
A.2 Setting the “Average Category” for Categorical vs. Numerical Data
Categorical Features
Handled automatically, with optional fine-tuning:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
cat_features=['job', 'city'],
# Prior-related settings
simple_ctr=['Borders:Prior=0.5:Prior=1.0', 'Counter:Prior=0.5'],
# Borders: TS-based encoding
# Counter: frequency-based encoding
# Prior=X: explicit prior value
# Smoothing
ctr_leaf_count_limit=None, # max stored category statistics per leaf
)
By default, CatBoost estimates the prior automatically from the data, so no manual setting is required.
Numerical Features
Numerical features have no built-in “average value” concept. The missing-handling policy is set via nan_mode:
model = CatBoostClassifier(
nan_mode='Min', # 'Min' / 'Max' / 'Forbidden'
)
| Option | Meaning | Routing Result |
|---|---|---|
'Min' (default for numerical) | Treats NaN as smaller than the feature minimum | All missing values go left |
'Max' | Treats NaN as larger than the feature maximum | All missing values go right |
'Forbidden' | Raises an error if NaN is present | Cannot proceed |
If you want mean imputation for numerical features, do it manually before training:
# Replace NaN with the column mean before training
X['feature'].fillna(X['feature'].mean(), inplace=True)
# CatBoost does not perform this preprocessing automatically.
Summary: Categorical vs. Numerical “Average” Treatment
| Aspect | Categorical | Numerical |
|---|---|---|
| Automatic average substitution? | Yes (Prior applied automatically) | No (manual handling required) |
| Unknown / missing handling | Encoded as Prior, routed through tree | Routed in one direction by nan_mode |
| Relevant parameters | simple_ctr, combinations_ctr, ctr_target_border_count | nan_mode |
| Intuitive meaning | Treated as an “average category” | Treated as an extreme value with fixed direction |
Appendix B. Identifying Missing / Unknown Samples at Inference
None of the three libraries directly tells you which samples followed the missing or unknown path during inference. You can confirm this indirectly:
B.1 LightGBM
# Inspect default direction in the tree dump
model.booster_.dump_model() # contains default_left flags per node
# Track NaN positions externally beforehand
nan_mask = X_test.isna().any(axis=1)
B.2 XGBoost
# Inspect tree structure
booster.get_dump(with_stats=True)
# Each split node carries yes / no / missing branch info.
# pred_leaf=True reveals leaf indices but not whether missing was traversed.
leaf_indices = model.predict(X_test, pred_leaf=True)
B.3 CatBoost
# With nan_mode='Min'/'Max', missing is folded into the split threshold,
# so tracing a separate "missing branch" is impractical.
# In production, build a NaN / unknown mask in advance.
B.4 Practical Recommendations
- Save an
X_test.isna()mask just before inference for downstream auditing. - To detect unknown categories, store the training category set and compare against the inference data.
- For monitoring, track the missing and unknown rates as separate metrics.
