|

Addressing Random Seed Sensitivity in Feature Selection: A Survey of Methods and Recent Advances (2025–2026)

Ensemble Importance Averaging with Python Code

Overview

Ensemble Importance Averaging is a simple yet effective technique for stabilizing feature selection. Instead of relying on a single model’s feature importances (which vary with random seed), it trains multiple models with different seeds and averages the importance scores across them.

Core Idea

  1. Train the same model N times, each with a different random seed.
  2. Extract feature importance scores from each model.
  3. Average (or take the median of) the scores across all runs.
  4. Rank features by the averaged score and select the top-K (or apply a threshold).

This reduces the variance in importance estimates caused by stochastic training elements (bootstrap sampling, feature subsampling, weight initialization, etc.) and yields a more reproducible ranking.

Python Example (Random Forest-based)

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

# 1. Generate synthetic data: 500 samples, 30 features, only 8 informative
X, y = make_classification(
    n_samples=500, n_features=30, n_informative=8,
    n_redundant=2, n_repeated=0, random_state=0
)

# 2. Ensemble averaging parameters
n_runs = 50                # number of models with different seeds
top_k = 10                 # number of top features to retain

n_features = X.shape[1]
importance_matrix = np.zeros((n_runs, n_features))

# 3. Train multiple models with different seeds
for i in range(n_runs):
    rf = RandomForestClassifier(
        n_estimators=200,
        random_state=i,       # vary seed across runs
        n_jobs=-1
    ).fit(X, y)
    importance_matrix[i] = rf.feature_importances_

# 4. Aggregate importances
mean_importance = importance_matrix.mean(axis=0)
std_importance = importance_matrix.std(axis=0)

# 5. Rank and select top-K features
ranking = np.argsort(mean_importance)[::-1]
selected_features = ranking[:top_k]

# 6. Report
results = pd.DataFrame({
    "feature": np.arange(n_features),
    "mean_importance": mean_importance.round(4),
    "std_importance": std_importance.round(4),
}).sort_values("mean_importance", ascending=False)

print(f"Selected top-{top_k} features: {sorted(selected_features.tolist())}")
print(results.head(top_k).to_string(index=False))

Key Parameters

ParameterTypical ValueRole
n_runs30 – 100More runs → more stable averaging
random_stateVaried per runEnsures diversity across models
top_k or thresholdDomain-dependentDefines how many features to keep

Aggregation Options

  • Mean: Standard choice; sensitive to outliers.
  • Median: More robust when some runs produce extreme values.
  • Rank averaging: Convert each run’s importances to ranks, then average — useful when importance scales differ across models.

Practical Notes

  • Works with any model exposing feature importances: RandomForest, GradientBoosting, XGBoost, LightGBM, or coefficient-based models (e.g., Lasso).
  • For deep learning or model-agnostic setups, combine with permutation importance or SHAP values across runs.
  • The standard deviation of importance across runs provides a useful uncertainty measure — features with high mean but also high std may be unreliable.
  • Computationally heavier than a single run; consider parallelization with joblib or n_jobs=-1.
Our Score
Click to rate this post!
[Total: 0 Average: 0]
Visited 14 times, 1 visit(s) today

Leave a Comment

Your email address will not be published. Required fields are marked *