|

Addressing Random Seed Sensitivity in Feature Selection: A Survey of Methods and Recent Advances (2025–2026)

Stability Selection with Python Code

Overview

Stability Selection (Meinshausen & Bühlmann, 2010) is a general technique that improves the reliability of feature selection by combining subsampling with any base selection algorithm (e.g., Lasso, Random Forest).

Core Idea

  1. Repeatedly draw random subsamples (typically half of the data) — B times (e.g., B = 100).
  2. Apply a base feature selector on each subsample.
  3. For each feature, compute its selection probability = (# times selected) / B.
  4. Retain only features whose selection probability exceeds a threshold π (e.g., 0.6–0.9).

This filters out features that appear significant only due to random fluctuations, yielding a stable and reproducible feature set with theoretical false discovery control.

Python Example (Lasso-based)

import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler

# 1. Generate synthetic data: 200 samples, 50 features, only 10 informative
X, y, true_coef = make_regression(
    n_samples=200, n_features=50, n_informative=10,
    noise=5.0, coef=True, random_state=0
)
X = StandardScaler().fit_transform(X)

# 2. Stability Selection parameters
n_bootstrap = 100          # number of subsamples (B)
subsample_ratio = 0.5      # fraction of data per subsample
alpha = 0.05               # Lasso regularization strength
threshold = 0.7            # selection probability cutoff (π)

n_samples, n_features = X.shape
selection_counts = np.zeros(n_features)

# 3. Repeated subsampling + Lasso selection
rng = np.random.default_rng(42)
sub_size = int(n_samples * subsample_ratio)

for _ in range(n_bootstrap):
    idx = rng.choice(n_samples, size=sub_size, replace=False)
    lasso = Lasso(alpha=alpha, max_iter=10000).fit(X[idx], y[idx])
    selection_counts += (lasso.coef_ != 0).astype(int)

# 4. Compute selection probabilities and apply threshold
selection_prob = selection_counts / n_bootstrap
stable_features = np.where(selection_prob >= threshold)[0]

# 5. Report
print(f"Stable features (prob ≥ {threshold}): {stable_features}")
print(f"Selection probabilities: {np.round(selection_prob, 2)}")
print(f"True informative features: {np.where(true_coef != 0)[0]}")

Key Parameters

ParameterTypical ValueRole
n_bootstrap (B)100More iterations → smoother probability estimates
subsample_ratio0.5Half-sampling is standard
alphaCross-validatedRegularization strength of base selector
threshold (π)0.6 – 0.9Higher threshold → fewer but more reliable features

Practical Notes

  • Swap Lasso with any selector (Random Forest importance, Boruta, mutual information, etc.).
  • For stricter false discovery control, use the bound from Meinshausen & Bühlmann: π ≥ 0.5 with an appropriate regularization range.
  • The stability-selection package provides a scikit-learn compatible implementation.
Our Score
Click to rate this post!
[Total: 0 Average: 0]
Visited 13 times, 1 visit(s) today

Leave a Comment

Your email address will not be published. Required fields are marked *