One-Hot Encoding Pitfalls and Countermeasures
9. CatBoost, XGBoost, and LightGBM — Practical Implementation
Both libraries offer native categorical handling that completely bypasses one-hot encoding’s dimensionality problems.
CatBoost — Ordered Target Encoding
CatBoost is uniquely designed for categorical data. Its Ordered Target Encoding uses random permutations and only “previous” samples to compute target statistics, eliminating leakage that plagues vanilla target encoding.
import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split
# Sample data with high-cardinality categoricals
df = pd.DataFrame({
'city': ['Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan'],
'product': ['A', 'B', 'A', 'C', 'B', 'A'],
'user_id': ['u_001', 'u_002', 'u_003', 'u_001', 'u_004', 'u_002'],
'price': [100, 200, 150, 300, 250, 180],
'target': [1, 0, 1, 0, 1, 0],
})
X = df.drop('target', axis=1)
y = df['target']
cat_features = ['city', 'product', 'user_id']
X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.33, random_state=42)
model = CatBoostClassifier(
iterations=500,
learning_rate=0.05,
depth=6,
cat_features=cat_features, # native categorical handling
one_hot_max_size=4, # one-hot only if cardinality <= 4
eval_metric='AUC',
verbose=0,
)
# Pool object lets you bundle data with categorical metadata
train_pool = Pool(X_tr, y_tr, cat_features=cat_features)
valid_pool = Pool(X_va, y_va, cat_features=cat_features)
model.fit(train_pool, eval_set=valid_pool, use_best_model=True)
# Inspect categorical feature importance
print(model.get_feature_importance(prettified=True))Key parameters:
cat_features: list of column names or indices to treat as categoricalone_hot_max_size: low-cardinality cutoff for which one-hot is preferred over target encoding- Ordered Target Encoding is applied automatically for higher-cardinality features
XGBoost — Native Categorical Support (1.5+)
XGBoost added native categorical support in version 1.5, using optimal partitioning rather than one-hot encoding. It requires the category dtype in pandas.
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
df = pd.DataFrame({
'city': ['Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan'],
'product': ['A', 'B', 'A', 'C', 'B', 'A'],
'grade': ['low', 'mid', 'high', 'mid', 'low', 'high'], # ordinal
'price': [100, 200, 150, 300, 250, 180],
'target': [1, 0, 1, 0, 1, 0],
})
X = df.drop('target', axis=1)
y = df['target']
# Convert categorical columns to pandas 'category' dtype
for col in ['city', 'product']:
X[col] = X[col].astype('category')
# Ordinal feature: preserve order explicitly
X['grade'] = pd.Categorical(X['grade'], categories=['low', 'mid', 'high'], ordered=True)
X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.33, random_state=42)
model = xgb.XGBClassifier(
n_estimators=500,
learning_rate=0.05,
max_depth=6,
tree_method='hist', # required for categorical support
enable_categorical=True, # turn on native categorical handling
max_cat_to_onehot=4, # one-hot if cardinality <= 4, else partition
monotone_constraints={'grade': 1}, # enforce monotonic effect for ordinal feature
eval_metric='auc',
)
model.fit(X_tr, y_tr, eval_set=[(X_va, y_va)], verbose=False)
# Predict
preds = model.predict_proba(X_va)[:, 1]
print(model.feature_importances_)Key parameters:
enable_categorical=True: activates native handlingtree_method='hist'or'gpu_hist': required (the older'exact'method does not support categoricals)max_cat_to_onehot: threshold for one-hot vs. partition-based splittingmonotone_constraints: combined with ordinal encoding, enforces monotonic predictions for ordered categories
LightGBM — categorical_feature Native Support
LightGBM was the first major GBDT library to offer built-in categorical handling. It uses an optimized Fisher (1958) algorithm to find the best partitioning of categories in O(k·log k) time per split, based on the mean target gradient within each category. This is far more efficient and accurate than one-hot encoding for high-cardinality features.
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report
df = pd.DataFrame({
'city': ['Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan',
'Seoul', 'Daegu', 'Busan', 'Incheon'],
'product': ['A', 'B', 'A', 'C', 'B', 'A', 'B', 'C', 'A', 'B'],
'user_id': ['u_001', 'u_002', 'u_003', 'u_001', 'u_004', 'u_002',
'u_005', 'u_003', 'u_006', 'u_004'],
'grade': ['low', 'mid', 'high', 'mid', 'low', 'high',
'mid', 'high', 'low', 'mid'],
'price': [100, 200, 150, 300, 250, 180, 220, 170, 310, 240],
'target': [1, 0, 1, 0, 1, 0, 1, 1, 0, 1],
})
# grade: ordinal → numeric (required for monotone constraint)
grade_map = {'low': 0, 'mid': 1, 'high': 2}
df['grade'] = df['grade'].map(grade_map).astype(int)
# nominal features: cast to category dtype so LightGBM encodes them internally
for col in ['city', 'product', 'user_id']:
df[col] = df[col].astype('category')
X = df.drop('target', axis=1)
y = df['target']
X_tr, X_va, y_tr, y_va = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
feature_cols = X.columns.tolist()
mono = [1 if c == 'grade' else 0 for c in feature_cols] # +1 = increasing for grade only
# --- sklearn API ---
print("=" * 50)
print("sklearn API")
print("=" * 50)
model = lgb.LGBMClassifier(
n_estimators=500,
learning_rate=0.05,
max_depth=-1,
num_leaves=31,
min_data_in_leaf=1, # relaxed for small dataset
cat_smooth=10, # smoothing for rare categories
cat_l2=10, # L2 regularization on categorical splits
max_cat_threshold=32, # max categories considered per split
monotone_constraints=mono,
objective='binary',
metric='auc',
verbose=-1,
)
model.fit(
X_tr, y_tr,
eval_set=[(X_va, y_va)],
categorical_feature=['city', 'product', 'user_id'], # grade excluded: numeric with monotone constraint
callbacks=[lgb.early_stopping(stopping_rounds=30)],
)
preds_proba = model.predict_proba(X_va)[:, 1]
preds_label = model.predict(X_va)
print(f"Best iteration : {model.best_iteration_}")
print(f"Validation AUC : {roc_auc_score(y_va, preds_proba):.4f}")
print()
print("Classification report:")
print(classification_report(y_va, preds_label, target_names=['class 0', 'class 1']))
importance = dict(zip(feature_cols, model.feature_importances_))
print("Feature importance (split):")
for feat, score in sorted(importance.items(), key=lambda x: -x[1]):
print(f" {feat:<12}: {score}")
# --- Native (Dataset) API ---
print()
print("=" * 50)
print("Native API")
print("=" * 50)
train_ds = lgb.Dataset(
X_tr, label=y_tr,
categorical_feature=['city', 'product', 'user_id'], # grade excluded
free_raw_data=False,
)
valid_ds = lgb.Dataset(
X_va, label=y_va,
categorical_feature=['city', 'product', 'user_id'], # grade excluded
reference=train_ds, # ensures consistent encoding
)
params = {
'objective': 'binary',
'metric': 'auc',
'learning_rate': 0.05,
'num_leaves': 31,
'cat_smooth': 10,
'cat_l2': 10,
'max_cat_threshold': 32,
'monotone_constraints': mono,
'verbose': -1,
}
booster = lgb.train(
params,
train_ds,
num_boost_round=500,
valid_sets=[valid_ds],
callbacks=[lgb.early_stopping(stopping_rounds=30)],
)
native_proba = booster.predict(X_va)
native_label = (native_proba >= 0.5).astype(int)
print(f"Best iteration : {booster.best_iteration}")
print(f"Validation AUC : {roc_auc_score(y_va, native_proba):.4f}")
print()
print("Classification report:")
print(classification_report(y_va, native_label, target_names=['class 0', 'class 1']))
importance_native = booster.feature_importance(importance_type='split')
print("Feature importance (split):")
for feat, score in sorted(zip(feature_cols, importance_native), key=lambda x: -x[1]):
print(f" {feat:<12}: {score}")
print()
print("=" * 50)
print("Monotone constraint check (grade: low=0, mid=1, high=2)")
print("=" * 50)
check_df = pd.DataFrame({
'city': ['Seoul'] * 3,
'product': ['A'] * 3,
'user_id': ['u_001'] * 3,
'grade': [0, 1, 2], # low → mid → high
'price': [200] * 3,
}).astype({'city': 'category', 'product': 'category', 'user_id': 'category'})
sklearn_proba = model.predict_proba(check_df)[:, 1]
native_proba_check = booster.predict(check_df)
print(f"{'grade':<10} {'sklearn prob':>14} {'native prob':>12}")
print("-" * 38)
for grade, sp, np_ in zip(['low', 'mid', 'high'], sklearn_proba, native_proba_check):
print(f"{grade:<10} {sp:>14.4f} {np_:>12.4f}")Results:
=======================================================
Monotone constraint check (grade: low=0, mid=1, high=2)
=======================================================
grade sklearn prob native prob
--------------------------------------
low 0.5927 0.5714
mid 0.5927 0.5714
high 0.5927 0.5714Key parameters:
categorical_feature: list of column names or integer indices. When using category dtype in pandas, this list can be omitted — LightGBM auto-detects. Explicit specification is still recommended for clarity and reproducibility.cat_smooth(default 10): smoothing term for categorical target statistics; larger values reduce overfitting on rare categories.cat_l2(default 10): L2 regularization applied specifically to categorical splits.max_cat_threshold(default 32): maximum number of category groups considered in a single split — caps computation for very high-cardinality features.min_data_per_group(default 100): minimum observations required per category group; raise this for high-cardinality features with many rare levels.max_cat_to_onehot(default 4): if cardinality ≤ this threshold, LightGBM falls back to one-hot encoding (which is optimal for very low cardinality); otherwise uses partition-based splitting.monotone_constraints: list of -1 / 0 / +1 aligned with feature order; combines with ordinal encoding to enforce monotonic predictions.
Important caveats:
- Categorical values must be non-negative integers internally. When using the Dataset API directly with NumPy arrays, encode your categories to integer codes first (e.g.,
df[col].cat.codes). The pandascategorydtype route handles this automatically. - LightGBM treats categorical features fundamentally differently from numerical ones — it never treats them as ordered (unless
monotone_constraintsis used on an ordinal-encoded integer column, which forces numerical treatment). - For extremely high-cardinality features (> 100,000 unique values), consider combining with target encoding or hashing upstream, as the internal Fisher grouping may still become expensive.
Comparison: Entity Embedding vs. CatBoost vs. XGBoost vs. LightGBM
| Aspect | Entity Embedding (PyTorch) | CatBoost | XGBoost | LightGBM |
|---|---|---|---|---|
| Encoding mechanism | Learnable dense vectors | Ordered Target Encoding | Optimal partition splitting | Fisher-based partition grouping |
| Leakage protection | Inherent | Built-in (ordered boosting) | Manual | Manual (smoothing helps) |
| Speed on high cardinality | Fast (matrix lookup) | Moderate | Fast | Fastest (histogram + Fisher) |
| Memory efficiency | High | Moderate | High | Highest |
| Ordinal handling | Order embeddings | monotone_constraints | monotone_constraints | monotone_constraints |
| Cardinality cap for one-hot fallback | N/A | one_hot_max_size | max_cat_to_onehot | max_cat_to_onehot |
| Best fit | Deep learning pipelines | Category-heavy tabular data | XGBoost-based stacks | Large datasets, speed-critical |
| Setup overhead | Define embedding layers | Pass cat_features | Convert to category dtype | Convert to category dtype |
LightGBM is often the fastest choice for large tabular datasets with many high-cardinality categorical features. Its Fisher-based partition algorithm scales better than XGBoost’s approach when cardinality exceeds a few hundred, making it a strong default for industrial-scale categorical data.
