|

One-Hot Encoding Pitfalls and Countermeasures

9. CatBoost, XGBoost, and LightGBM — Practical Implementation

Both libraries offer native categorical handling that completely bypasses one-hot encoding’s dimensionality problems.

CatBoost — Ordered Target Encoding

CatBoost is uniquely designed for categorical data. Its Ordered Target Encoding uses random permutations and only “previous” samples to compute target statistics, eliminating leakage that plagues vanilla target encoding.

import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split

# Sample data with high-cardinality categoricals
df = pd.DataFrame({
    'city':     ['Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan'],
    'product':  ['A', 'B', 'A', 'C', 'B', 'A'],
    'user_id':  ['u_001', 'u_002', 'u_003', 'u_001', 'u_004', 'u_002'],
    'price':    [100, 200, 150, 300, 250, 180],
    'target':   [1, 0, 1, 0, 1, 0],
})

X = df.drop('target', axis=1)
y = df['target']

cat_features = ['city', 'product', 'user_id']

X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.33, random_state=42)

model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    cat_features=cat_features,        # native categorical handling
    one_hot_max_size=4,               # one-hot only if cardinality <= 4
    eval_metric='AUC',
    verbose=0,
)

# Pool object lets you bundle data with categorical metadata
train_pool = Pool(X_tr, y_tr, cat_features=cat_features)
valid_pool = Pool(X_va, y_va, cat_features=cat_features)

model.fit(train_pool, eval_set=valid_pool, use_best_model=True)

# Inspect categorical feature importance
print(model.get_feature_importance(prettified=True))

Key parameters:

  • cat_features: list of column names or indices to treat as categorical
  • one_hot_max_size: low-cardinality cutoff for which one-hot is preferred over target encoding
  • Ordered Target Encoding is applied automatically for higher-cardinality features

XGBoost — Native Categorical Support (1.5+)

XGBoost added native categorical support in version 1.5, using optimal partitioning rather than one-hot encoding. It requires the category dtype in pandas.

import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    'city':     ['Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan'],
    'product':  ['A', 'B', 'A', 'C', 'B', 'A'],
    'grade':    ['low', 'mid', 'high', 'mid', 'low', 'high'],  # ordinal
    'price':    [100, 200, 150, 300, 250, 180],
    'target':   [1, 0, 1, 0, 1, 0],
})

X = df.drop('target', axis=1)
y = df['target']

# Convert categorical columns to pandas 'category' dtype
for col in ['city', 'product']:
    X[col] = X[col].astype('category')

# Ordinal feature: preserve order explicitly
X['grade'] = pd.Categorical(X['grade'], categories=['low', 'mid', 'high'], ordered=True)

X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.33, random_state=42)

model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    tree_method='hist',                  # required for categorical support
    enable_categorical=True,             # turn on native categorical handling
    max_cat_to_onehot=4,                 # one-hot if cardinality <= 4, else partition
    monotone_constraints={'grade': 1},   # enforce monotonic effect for ordinal feature
    eval_metric='auc',
)

model.fit(X_tr, y_tr, eval_set=[(X_va, y_va)], verbose=False)

# Predict
preds = model.predict_proba(X_va)[:, 1]
print(model.feature_importances_)

Key parameters:

  • enable_categorical=True: activates native handling
  • tree_method='hist' or 'gpu_hist': required (the older 'exact' method does not support categoricals)
  • max_cat_to_onehot: threshold for one-hot vs. partition-based splitting
  • monotone_constraints: combined with ordinal encoding, enforces monotonic predictions for ordered categories

LightGBM — categorical_feature Native Support

LightGBM was the first major GBDT library to offer built-in categorical handling. It uses an optimized Fisher (1958) algorithm to find the best partitioning of categories in O(k·log k) time per split, based on the mean target gradient within each category. This is far more efficient and accurate than one-hot encoding for high-cardinality features.

import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report

df = pd.DataFrame({
    'city': ['Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan',
             'Seoul', 'Daegu', 'Busan', 'Incheon'],
    'product': ['A', 'B', 'A', 'C', 'B', 'A', 'B', 'C', 'A', 'B'],
    'user_id': ['u_001', 'u_002', 'u_003', 'u_001', 'u_004', 'u_002',
                'u_005', 'u_003', 'u_006', 'u_004'],
    'grade': ['low', 'mid', 'high', 'mid', 'low', 'high',
              'mid', 'high', 'low', 'mid'],
    'price': [100, 200, 150, 300, 250, 180, 220, 170, 310, 240],
    'target': [1, 0, 1, 0, 1, 0, 1, 1, 0, 1],
})

# grade: ordinal → numeric (required for monotone constraint)
grade_map = {'low': 0, 'mid': 1, 'high': 2}
df['grade'] = df['grade'].map(grade_map).astype(int)

# nominal features: cast to category dtype so LightGBM encodes them internally
for col in ['city', 'product', 'user_id']:
    df[col] = df[col].astype('category')

X = df.drop('target', axis=1)
y = df['target']

X_tr, X_va, y_tr, y_va = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

feature_cols = X.columns.tolist()
mono = [1 if c == 'grade' else 0 for c in feature_cols]  # +1 = increasing for grade only

# --- sklearn API ---
print("=" * 50)
print("sklearn API")
print("=" * 50)

model = lgb.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=-1,
    num_leaves=31,
    min_data_in_leaf=1,      # relaxed for small dataset
    cat_smooth=10,           # smoothing for rare categories
    cat_l2=10,               # L2 regularization on categorical splits
    max_cat_threshold=32,    # max categories considered per split
    monotone_constraints=mono,
    objective='binary',
    metric='auc',
    verbose=-1,
)
model.fit(
    X_tr, y_tr,
    eval_set=[(X_va, y_va)],
    categorical_feature=['city', 'product', 'user_id'],  # grade excluded: numeric with monotone constraint
    callbacks=[lgb.early_stopping(stopping_rounds=30)],
)

preds_proba = model.predict_proba(X_va)[:, 1]
preds_label = model.predict(X_va)

print(f"Best iteration   : {model.best_iteration_}")
print(f"Validation AUC   : {roc_auc_score(y_va, preds_proba):.4f}")
print()
print("Classification report:")
print(classification_report(y_va, preds_label, target_names=['class 0', 'class 1']))

importance = dict(zip(feature_cols, model.feature_importances_))
print("Feature importance (split):")
for feat, score in sorted(importance.items(), key=lambda x: -x[1]):
    print(f"  {feat:<12}: {score}")

# --- Native (Dataset) API ---
print()
print("=" * 50)
print("Native API")
print("=" * 50)

train_ds = lgb.Dataset(
    X_tr, label=y_tr,
    categorical_feature=['city', 'product', 'user_id'],  # grade excluded
    free_raw_data=False,
)
valid_ds = lgb.Dataset(
    X_va, label=y_va,
    categorical_feature=['city', 'product', 'user_id'],  # grade excluded
    reference=train_ds,                                   # ensures consistent encoding
)

params = {
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 0.05,
    'num_leaves': 31,
    'cat_smooth': 10,
    'cat_l2': 10,
    'max_cat_threshold': 32,
    'monotone_constraints': mono,
    'verbose': -1,
}

booster = lgb.train(
    params,
    train_ds,
    num_boost_round=500,
    valid_sets=[valid_ds],
    callbacks=[lgb.early_stopping(stopping_rounds=30)],
)

native_proba = booster.predict(X_va)
native_label = (native_proba >= 0.5).astype(int)

print(f"Best iteration   : {booster.best_iteration}")
print(f"Validation AUC   : {roc_auc_score(y_va, native_proba):.4f}")
print()
print("Classification report:")
print(classification_report(y_va, native_label, target_names=['class 0', 'class 1']))

importance_native = booster.feature_importance(importance_type='split')
print("Feature importance (split):")
for feat, score in sorted(zip(feature_cols, importance_native), key=lambda x: -x[1]):
    print(f"  {feat:<12}: {score}")

print()
print("=" * 50)
print("Monotone constraint check (grade: low=0, mid=1, high=2)")
print("=" * 50)
check_df = pd.DataFrame({
    'city':    ['Seoul'] * 3,
    'product': ['A'] * 3,
    'user_id': ['u_001'] * 3,
    'grade':   [0, 1, 2],          # low → mid → high
    'price':   [200] * 3,
}).astype({'city': 'category', 'product': 'category', 'user_id': 'category'})

sklearn_proba = model.predict_proba(check_df)[:, 1]
native_proba_check = booster.predict(check_df)

print(f"{'grade':<10} {'sklearn prob':>14} {'native prob':>12}")
print("-" * 38)
for grade, sp, np_ in zip(['low', 'mid', 'high'], sklearn_proba, native_proba_check):
    print(f"{grade:<10} {sp:>14.4f} {np_:>12.4f}")

Results:

=======================================================
Monotone constraint check (grade: low=0, mid=1, high=2)
=======================================================
grade        sklearn prob  native prob
--------------------------------------
low                0.5927       0.5714
mid                0.5927       0.5714
high               0.5927       0.5714

Key parameters:

  • categorical_feature: list of column names or integer indices. When using category dtype in pandas, this list can be omitted — LightGBM auto-detects. Explicit specification is still recommended for clarity and reproducibility.
  • cat_smooth (default 10): smoothing term for categorical target statistics; larger values reduce overfitting on rare categories.
  • cat_l2 (default 10): L2 regularization applied specifically to categorical splits.
  • max_cat_threshold (default 32): maximum number of category groups considered in a single split — caps computation for very high-cardinality features.
  • min_data_per_group (default 100): minimum observations required per category group; raise this for high-cardinality features with many rare levels.
  • max_cat_to_onehot (default 4): if cardinality ≤ this threshold, LightGBM falls back to one-hot encoding (which is optimal for very low cardinality); otherwise uses partition-based splitting.
  • monotone_constraints: list of -1 / 0 / +1 aligned with feature order; combines with ordinal encoding to enforce monotonic predictions.

Important caveats:

  • Categorical values must be non-negative integers internally. When using the Dataset API directly with NumPy arrays, encode your categories to integer codes first (e.g., df[col].cat.codes). The pandas category dtype route handles this automatically.
  • LightGBM treats categorical features fundamentally differently from numerical ones — it never treats them as ordered (unless monotone_constraints is used on an ordinal-encoded integer column, which forces numerical treatment).
  • For extremely high-cardinality features (> 100,000 unique values), consider combining with target encoding or hashing upstream, as the internal Fisher grouping may still become expensive.

Comparison: Entity Embedding vs. CatBoost vs. XGBoost vs. LightGBM

AspectEntity Embedding (PyTorch)CatBoostXGBoostLightGBM
Encoding mechanismLearnable dense vectorsOrdered Target EncodingOptimal partition splittingFisher-based partition grouping
Leakage protectionInherentBuilt-in (ordered boosting)ManualManual (smoothing helps)
Speed on high cardinalityFast (matrix lookup)ModerateFastFastest (histogram + Fisher)
Memory efficiencyHighModerateHighHighest
Ordinal handlingOrder embeddingsmonotone_constraintsmonotone_constraintsmonotone_constraints
Cardinality cap for one-hot fallbackN/Aone_hot_max_sizemax_cat_to_onehotmax_cat_to_onehot
Best fitDeep learning pipelinesCategory-heavy tabular dataXGBoost-based stacksLarge datasets, speed-critical
Setup overheadDefine embedding layersPass cat_featuresConvert to category dtypeConvert to category dtype

LightGBM is often the fastest choice for large tabular datasets with many high-cardinality categorical features. Its Fisher-based partition algorithm scales better than XGBoost’s approach when cardinality exceeds a few hundred, making it a strong default for industrial-scale categorical data.

Our Score
Click to rate this post!
[Total: 0 Average: 0]
Visited 9 times, 1 visit(s) today

Leave a Comment

Your email address will not be published. Required fields are marked *