One-Hot Encoding Pitfalls and Countermeasures

9. CatBoost, XGBoost, and LightGBM — Practical Implementation

Both libraries offer native categorical handling that completely bypasses one-hot encoding’s dimensionality problems.

CatBoost — Ordered Target Encoding

CatBoost is uniquely designed for categorical data. Its Ordered Target Encoding uses random permutations and only “previous” samples to compute target statistics, eliminating leakage that plagues vanilla target encoding.

import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split

# Sample data with high-cardinality categoricals
df = pd.DataFrame({
    'city':     ['Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan'],
    'product':  ['A', 'B', 'A', 'C', 'B', 'A'],
    'user_id':  ['u_001', 'u_002', 'u_003', 'u_001', 'u_004', 'u_002'],
    'price':    [100, 200, 150, 300, 250, 180],
    'target':   [1, 0, 1, 0, 1, 0],
})

X = df.drop('target', axis=1)
y = df['target']

cat_features = ['city', 'product', 'user_id']

X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.33, random_state=42)

model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    cat_features=cat_features,        # native categorical handling
    one_hot_max_size=4,               # one-hot only if cardinality <= 4
    eval_metric='AUC',
    verbose=0,
)

# Pool object lets you bundle data with categorical metadata
train_pool = Pool(X_tr, y_tr, cat_features=cat_features)
valid_pool = Pool(X_va, y_va, cat_features=cat_features)

model.fit(train_pool, eval_set=valid_pool, use_best_model=True)

# Inspect categorical feature importance
print(model.get_feature_importance(prettified=True))

import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split

# Sample data with high-cardinality categoricals
df = pd.DataFrame({
    'city':     ['Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan'],
    'product':  ['A', 'B', 'A', 'C', 'B', 'A'],
    'user_id':  ['u_001', 'u_002', 'u_003', 'u_001', 'u_004', 'u_002'],
    'price':    [100, 200, 150, 300, 250, 180],
    'target':   [1, 0, 1, 0, 1, 0],
})

X = df.drop('target', axis=1)
y = df['target']

cat_features = ['city', 'product', 'user_id']

X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.33, random_state=42)

model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    cat_features=cat_features,        # native categorical handling
    one_hot_max_size=4,               # one-hot only if cardinality <= 4
    eval_metric='AUC',
    verbose=0,
)

# Pool object lets you bundle data with categorical metadata
train_pool = Pool(X_tr, y_tr, cat_features=cat_features)
valid_pool = Pool(X_va, y_va, cat_features=cat_features)

model.fit(train_pool, eval_set=valid_pool, use_best_model=True)

# Inspect categorical feature importance
print(model.get_feature_importance(prettified=True))

Key parameters:

cat_features: list of column names or indices to treat as categorical
one_hot_max_size: low-cardinality cutoff for which one-hot is preferred over target encoding
Ordered Target Encoding is applied automatically for higher-cardinality features

XGBoost — Native Categorical Support (1.5+)

XGBoost added native categorical support in version 1.5, using optimal partitioning rather than one-hot encoding. It requires the category dtype in pandas.

import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    'city':     ['Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan'],
    'product':  ['A', 'B', 'A', 'C', 'B', 'A'],
    'grade':    ['low', 'mid', 'high', 'mid', 'low', 'high'],  # ordinal
    'price':    [100, 200, 150, 300, 250, 180],
    'target':   [1, 0, 1, 0, 1, 0],
})

X = df.drop('target', axis=1)
y = df['target']

# Convert categorical columns to pandas 'category' dtype
for col in ['city', 'product']:
    X[col] = X[col].astype('category')

# Ordinal feature: preserve order explicitly
X['grade'] = pd.Categorical(X['grade'], categories=['low', 'mid', 'high'], ordered=True)

X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.33, random_state=42)

model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    tree_method='hist',                  # required for categorical support
    enable_categorical=True,             # turn on native categorical handling
    max_cat_to_onehot=4,                 # one-hot if cardinality <= 4, else partition
    monotone_constraints={'grade': 1},   # enforce monotonic effect for ordinal feature
    eval_metric='auc',
)

model.fit(X_tr, y_tr, eval_set=[(X_va, y_va)], verbose=False)

# Predict
preds = model.predict_proba(X_va)[:, 1]
print(model.feature_importances_)

import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    'city':     ['Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan'],
    'product':  ['A', 'B', 'A', 'C', 'B', 'A'],
    'grade':    ['low', 'mid', 'high', 'mid', 'low', 'high'],  # ordinal
    'price':    [100, 200, 150, 300, 250, 180],
    'target':   [1, 0, 1, 0, 1, 0],
})

X = df.drop('target', axis=1)
y = df['target']

# Convert categorical columns to pandas 'category' dtype
for col in ['city', 'product']:
    X[col] = X[col].astype('category')

# Ordinal feature: preserve order explicitly
X['grade'] = pd.Categorical(X['grade'], categories=['low', 'mid', 'high'], ordered=True)

X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.33, random_state=42)

model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    tree_method='hist',                  # required for categorical support
    enable_categorical=True,             # turn on native categorical handling
    max_cat_to_onehot=4,                 # one-hot if cardinality <= 4, else partition
    monotone_constraints={'grade': 1},   # enforce monotonic effect for ordinal feature
    eval_metric='auc',
)

model.fit(X_tr, y_tr, eval_set=[(X_va, y_va)], verbose=False)

# Predict
preds = model.predict_proba(X_va)[:, 1]
print(model.feature_importances_)

Key parameters:

enable_categorical=True: activates native handling
tree_method='hist' or 'gpu_hist': required (the older 'exact' method does not support categoricals)
max_cat_to_onehot: threshold for one-hot vs. partition-based splitting
monotone_constraints: combined with ordinal encoding, enforces monotonic predictions for ordered categories

LightGBM — categorical_feature Native Support

LightGBM was the first major GBDT library to offer built-in categorical handling. It uses an optimized Fisher (1958) algorithm to find the best partitioning of categories in O(k·log k) time per split, based on the mean target gradient within each category. This is far more efficient and accurate than one-hot encoding for high-cardinality features.

import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report

df = pd.DataFrame({
    'city': ['Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan',
             'Seoul', 'Daegu', 'Busan', 'Incheon'],
    'product': ['A', 'B', 'A', 'C', 'B', 'A', 'B', 'C', 'A', 'B'],
    'user_id': ['u_001', 'u_002', 'u_003', 'u_001', 'u_004', 'u_002',
                'u_005', 'u_003', 'u_006', 'u_004'],
    'grade': ['low', 'mid', 'high', 'mid', 'low', 'high',
              'mid', 'high', 'low', 'mid'],
    'price': [100, 200, 150, 300, 250, 180, 220, 170, 310, 240],
    'target': [1, 0, 1, 0, 1, 0, 1, 1, 0, 1],
})

# grade: ordinal → numeric (required for monotone constraint)
grade_map = {'low': 0, 'mid': 1, 'high': 2}
df['grade'] = df['grade'].map(grade_map).astype(int)

# nominal features: cast to category dtype so LightGBM encodes them internally
for col in ['city', 'product', 'user_id']:
    df[col] = df[col].astype('category')

X = df.drop('target', axis=1)
y = df['target']

X_tr, X_va, y_tr, y_va = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

feature_cols = X.columns.tolist()
mono = [1 if c == 'grade' else 0 for c in feature_cols]  # +1 = increasing for grade only

# --- sklearn API ---
print("=" * 50)
print("sklearn API")
print("=" * 50)

model = lgb.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=-1,
    num_leaves=31,
    min_data_in_leaf=1,      # relaxed for small dataset
    cat_smooth=10,           # smoothing for rare categories
    cat_l2=10,               # L2 regularization on categorical splits
    max_cat_threshold=32,    # max categories considered per split
    monotone_constraints=mono,
    objective='binary',
    metric='auc',
    verbose=-1,
)
model.fit(
    X_tr, y_tr,
    eval_set=[(X_va, y_va)],
    categorical_feature=['city', 'product', 'user_id'],  # grade excluded: numeric with monotone constraint
    callbacks=[lgb.early_stopping(stopping_rounds=30)],
)

preds_proba = model.predict_proba(X_va)[:, 1]
preds_label = model.predict(X_va)

print(f"Best iteration   : {model.best_iteration_}")
print(f"Validation AUC   : {roc_auc_score(y_va, preds_proba):.4f}")
print()
print("Classification report:")
print(classification_report(y_va, preds_label, target_names=['class 0', 'class 1']))

importance = dict(zip(feature_cols, model.feature_importances_))
print("Feature importance (split):")
for feat, score in sorted(importance.items(), key=lambda x: -x[1]):
    print(f"  {feat:<12}: {score}")

# --- Native (Dataset) API ---
print()
print("=" * 50)
print("Native API")
print("=" * 50)

train_ds = lgb.Dataset(
    X_tr, label=y_tr,
    categorical_feature=['city', 'product', 'user_id'],  # grade excluded
    free_raw_data=False,
)
valid_ds = lgb.Dataset(
    X_va, label=y_va,
    categorical_feature=['city', 'product', 'user_id'],  # grade excluded
    reference=train_ds,                                   # ensures consistent encoding
)

params = {
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 0.05,
    'num_leaves': 31,
    'cat_smooth': 10,
    'cat_l2': 10,
    'max_cat_threshold': 32,
    'monotone_constraints': mono,
    'verbose': -1,
}

booster = lgb.train(
    params,
    train_ds,
    num_boost_round=500,
    valid_sets=[valid_ds],
    callbacks=[lgb.early_stopping(stopping_rounds=30)],
)

native_proba = booster.predict(X_va)
native_label = (native_proba >= 0.5).astype(int)

print(f"Best iteration   : {booster.best_iteration}")
print(f"Validation AUC   : {roc_auc_score(y_va, native_proba):.4f}")
print()
print("Classification report:")
print(classification_report(y_va, native_label, target_names=['class 0', 'class 1']))

importance_native = booster.feature_importance(importance_type='split')
print("Feature importance (split):")
for feat, score in sorted(zip(feature_cols, importance_native), key=lambda x: -x[1]):
    print(f"  {feat:<12}: {score}")

print()
print("=" * 50)
print("Monotone constraint check (grade: low=0, mid=1, high=2)")
print("=" * 50)
check_df = pd.DataFrame({
    'city':    ['Seoul'] * 3,
    'product': ['A'] * 3,
    'user_id': ['u_001'] * 3,
    'grade':   [0, 1, 2],          # low → mid → high
    'price':   [200] * 3,
}).astype({'city': 'category', 'product': 'category', 'user_id': 'category'})

sklearn_proba = model.predict_proba(check_df)[:, 1]
native_proba_check = booster.predict(check_df)

print(f"{'grade':<10} {'sklearn prob':>14} {'native prob':>12}")
print("-" * 38)
for grade, sp, np_ in zip(['low', 'mid', 'high'], sklearn_proba, native_proba_check):
    print(f"{grade:<10} {sp:>14.4f} {np_:>12.4f}")

import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report

df = pd.DataFrame({
    'city': ['Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan',
             'Seoul', 'Daegu', 'Busan', 'Incheon'],
    'product': ['A', 'B', 'A', 'C', 'B', 'A', 'B', 'C', 'A', 'B'],
    'user_id': ['u_001', 'u_002', 'u_003', 'u_001', 'u_004', 'u_002',
                'u_005', 'u_003', 'u_006', 'u_004'],
    'grade': ['low', 'mid', 'high', 'mid', 'low', 'high',
              'mid', 'high', 'low', 'mid'],
    'price': [100, 200, 150, 300, 250, 180, 220, 170, 310, 240],
    'target': [1, 0, 1, 0, 1, 0, 1, 1, 0, 1],
})

# grade: ordinal → numeric (required for monotone constraint)
grade_map = {'low': 0, 'mid': 1, 'high': 2}
df['grade'] = df['grade'].map(grade_map).astype(int)

# nominal features: cast to category dtype so LightGBM encodes them internally
for col in ['city', 'product', 'user_id']:
    df[col] = df[col].astype('category')

X = df.drop('target', axis=1)
y = df['target']

X_tr, X_va, y_tr, y_va = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

feature_cols = X.columns.tolist()
mono = [1 if c == 'grade' else 0 for c in feature_cols]  # +1 = increasing for grade only

# --- sklearn API ---
print("=" * 50)
print("sklearn API")
print("=" * 50)

model = lgb.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=-1,
    num_leaves=31,
    min_data_in_leaf=1,      # relaxed for small dataset
    cat_smooth=10,           # smoothing for rare categories
    cat_l2=10,               # L2 regularization on categorical splits
    max_cat_threshold=32,    # max categories considered per split
    monotone_constraints=mono,
    objective='binary',
    metric='auc',
    verbose=-1,
)
model.fit(
    X_tr, y_tr,
    eval_set=[(X_va, y_va)],
    categorical_feature=['city', 'product', 'user_id'],  # grade excluded: numeric with monotone constraint
    callbacks=[lgb.early_stopping(stopping_rounds=30)],
)

preds_proba = model.predict_proba(X_va)[:, 1]
preds_label = model.predict(X_va)

print(f"Best iteration   : {model.best_iteration_}")
print(f"Validation AUC   : {roc_auc_score(y_va, preds_proba):.4f}")
print()
print("Classification report:")
print(classification_report(y_va, preds_label, target_names=['class 0', 'class 1']))

importance = dict(zip(feature_cols, model.feature_importances_))
print("Feature importance (split):")
for feat, score in sorted(importance.items(), key=lambda x: -x[1]):
    print(f"  {feat:<12}: {score}")

# --- Native (Dataset) API ---
print()
print("=" * 50)
print("Native API")
print("=" * 50)

train_ds = lgb.Dataset(
    X_tr, label=y_tr,
    categorical_feature=['city', 'product', 'user_id'],  # grade excluded
    free_raw_data=False,
)
valid_ds = lgb.Dataset(
    X_va, label=y_va,
    categorical_feature=['city', 'product', 'user_id'],  # grade excluded
    reference=train_ds,                                   # ensures consistent encoding
)

params = {
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 0.05,
    'num_leaves': 31,
    'cat_smooth': 10,
    'cat_l2': 10,
    'max_cat_threshold': 32,
    'monotone_constraints': mono,
    'verbose': -1,
}

booster = lgb.train(
    params,
    train_ds,
    num_boost_round=500,
    valid_sets=[valid_ds],
    callbacks=[lgb.early_stopping(stopping_rounds=30)],
)

native_proba = booster.predict(X_va)
native_label = (native_proba >= 0.5).astype(int)

print(f"Best iteration   : {booster.best_iteration}")
print(f"Validation AUC   : {roc_auc_score(y_va, native_proba):.4f}")
print()
print("Classification report:")
print(classification_report(y_va, native_label, target_names=['class 0', 'class 1']))

importance_native = booster.feature_importance(importance_type='split')
print("Feature importance (split):")
for feat, score in sorted(zip(feature_cols, importance_native), key=lambda x: -x[1]):
    print(f"  {feat:<12}: {score}")

print()
print("=" * 50)
print("Monotone constraint check (grade: low=0, mid=1, high=2)")
print("=" * 50)
check_df = pd.DataFrame({
    'city':    ['Seoul'] * 3,
    'product': ['A'] * 3,
    'user_id': ['u_001'] * 3,
    'grade':   [0, 1, 2],          # low → mid → high
    'price':   [200] * 3,
}).astype({'city': 'category', 'product': 'category', 'user_id': 'category'})

sklearn_proba = model.predict_proba(check_df)[:, 1]
native_proba_check = booster.predict(check_df)

print(f"{'grade':<10} {'sklearn prob':>14} {'native prob':>12}")
print("-" * 38)
for grade, sp, np_ in zip(['low', 'mid', 'high'], sklearn_proba, native_proba_check):
    print(f"{grade:<10} {sp:>14.4f} {np_:>12.4f}")

Results:

=======================================================
Monotone constraint check (grade: low=0, mid=1, high=2)
=======================================================
grade        sklearn prob  native prob
--------------------------------------
low                0.5927       0.5714
mid                0.5927       0.5714
high               0.5927       0.5714

=======================================================
Monotone constraint check (grade: low=0, mid=1, high=2)
=======================================================
grade        sklearn prob  native prob
--------------------------------------
low                0.5927       0.5714
mid                0.5927       0.5714
high               0.5927       0.5714

Key parameters:

categorical_feature: list of column names or integer indices. When using category dtype in pandas, this list can be omitted — LightGBM auto-detects. Explicit specification is still recommended for clarity and reproducibility.
cat_smooth (default 10): smoothing term for categorical target statistics; larger values reduce overfitting on rare categories.
cat_l2 (default 10): L2 regularization applied specifically to categorical splits.
max_cat_threshold (default 32): maximum number of category groups considered in a single split — caps computation for very high-cardinality features.
min_data_per_group (default 100): minimum observations required per category group; raise this for high-cardinality features with many rare levels.
max_cat_to_onehot (default 4): if cardinality ≤ this threshold, LightGBM falls back to one-hot encoding (which is optimal for very low cardinality); otherwise uses partition-based splitting.
monotone_constraints: list of -1 / 0 / +1 aligned with feature order; combines with ordinal encoding to enforce monotonic predictions.

Important caveats:

Categorical values must be non-negative integers internally. When using the Dataset API directly with NumPy arrays, encode your categories to integer codes first (e.g., df[col].cat.codes). The pandas category dtype route handles this automatically.
LightGBM treats categorical features fundamentally differently from numerical ones — it never treats them as ordered (unless monotone_constraints is used on an ordinal-encoded integer column, which forces numerical treatment).
For extremely high-cardinality features (> 100,000 unique values), consider combining with target encoding or hashing upstream, as the internal Fisher grouping may still become expensive.

Comparison: Entity Embedding vs. CatBoost vs. XGBoost vs. LightGBM

Aspect	Entity Embedding (PyTorch)	CatBoost	XGBoost	LightGBM
Encoding mechanism	Learnable dense vectors	Ordered Target Encoding	Optimal partition splitting	Fisher-based partition grouping
Leakage protection	Inherent	Built-in (ordered boosting)	Manual	Manual (smoothing helps)
Speed on high cardinality	Fast (matrix lookup)	Moderate	Fast	Fastest (histogram + Fisher)
Memory efficiency	High	Moderate	High	Highest
Ordinal handling	Order embeddings	`monotone_constraints`	`monotone_constraints`	`monotone_constraints`
Cardinality cap for one-hot fallback	N/A	`one_hot_max_size`	`max_cat_to_onehot`	`max_cat_to_onehot`
Best fit	Deep learning pipelines	Category-heavy tabular data	XGBoost-based stacks	Large datasets, speed-critical
Setup overhead	Define embedding layers	Pass `cat_features`	Convert to `category` dtype	Convert to `category` dtype

LightGBM is often the fastest choice for large tabular datasets with many high-cardinality categorical features. Its Fisher-based partition algorithm scales better than XGBoost’s approach when cardinality exceeds a few hundred, making it a strong default for industrial-scale categorical data.

Our Score

Click to rate this post!

[Total: 0 Average: 0]

Visited 9 times, 1 visit(s) today

Pages: 1 2

One-Hot Encoding Pitfalls and Countermeasures

9. CatBoost, XGBoost, and LightGBM — Practical Implementation

CatBoost — Ordered Target Encoding

XGBoost — Native Categorical Support (1.5+)

LightGBM — categorical_feature Native Support

Comparison: Entity Embedding vs. CatBoost vs. XGBoost vs. LightGBM

Leave a Comment Cancel reply

Visitor

Post

About Me

Contact