Stability Selection with Python Code

Overview

Stability Selection (Meinshausen & Bühlmann, 2010) is a general technique that improves the reliability of feature selection by combining subsampling with any base selection algorithm (e.g., Lasso, Random Forest).

Core Idea

Repeatedly draw random subsamples (typically half of the data) — B times (e.g., B = 100).
Apply a base feature selector on each subsample.
For each feature, compute its selection probability = (# times selected) / B.
Retain only features whose selection probability exceeds a threshold π (e.g., 0.6–0.9).

This filters out features that appear significant only due to random fluctuations, yielding a stable and reproducible feature set with theoretical false discovery control.

Python Example (Lasso-based)

import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler

# 1. Generate synthetic data: 200 samples, 50 features, only 10 informative
X, y, true_coef = make_regression(
    n_samples=200, n_features=50, n_informative=10,
    noise=5.0, coef=True, random_state=0
)
X = StandardScaler().fit_transform(X)

# 2. Stability Selection parameters
n_bootstrap = 100          # number of subsamples (B)
subsample_ratio = 0.5      # fraction of data per subsample
alpha = 0.05               # Lasso regularization strength
threshold = 0.7            # selection probability cutoff (π)

n_samples, n_features = X.shape
selection_counts = np.zeros(n_features)

# 3. Repeated subsampling + Lasso selection
rng = np.random.default_rng(42)
sub_size = int(n_samples * subsample_ratio)

for _ in range(n_bootstrap):
    idx = rng.choice(n_samples, size=sub_size, replace=False)
    lasso = Lasso(alpha=alpha, max_iter=10000).fit(X[idx], y[idx])
    selection_counts += (lasso.coef_ != 0).astype(int)

# 4. Compute selection probabilities and apply threshold
selection_prob = selection_counts / n_bootstrap
stable_features = np.where(selection_prob >= threshold)[0]

# 5. Report
print(f"Stable features (prob ≥ {threshold}): {stable_features}")
print(f"Selection probabilities: {np.round(selection_prob, 2)}")
print(f"True informative features: {np.where(true_coef != 0)[0]}")

import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler

# 1. Generate synthetic data: 200 samples, 50 features, only 10 informative
X, y, true_coef = make_regression(
    n_samples=200, n_features=50, n_informative=10,
    noise=5.0, coef=True, random_state=0
)
X = StandardScaler().fit_transform(X)

# 2. Stability Selection parameters
n_bootstrap = 100          # number of subsamples (B)
subsample_ratio = 0.5      # fraction of data per subsample
alpha = 0.05               # Lasso regularization strength
threshold = 0.7            # selection probability cutoff (π)

n_samples, n_features = X.shape
selection_counts = np.zeros(n_features)

# 3. Repeated subsampling + Lasso selection
rng = np.random.default_rng(42)
sub_size = int(n_samples * subsample_ratio)

for _ in range(n_bootstrap):
    idx = rng.choice(n_samples, size=sub_size, replace=False)
    lasso = Lasso(alpha=alpha, max_iter=10000).fit(X[idx], y[idx])
    selection_counts += (lasso.coef_ != 0).astype(int)

# 4. Compute selection probabilities and apply threshold
selection_prob = selection_counts / n_bootstrap
stable_features = np.where(selection_prob >= threshold)[0]

# 5. Report
print(f"Stable features (prob ≥ {threshold}): {stable_features}")
print(f"Selection probabilities: {np.round(selection_prob, 2)}")
print(f"True informative features: {np.where(true_coef != 0)[0]}")

Key Parameters

Parameter	Typical Value	Role
`n_bootstrap` (B)	100	More iterations → smoother probability estimates
`subsample_ratio`	0.5	Half-sampling is standard
`alpha`	Cross-validated	Regularization strength of base selector
`threshold` (π)	0.6 – 0.9	Higher threshold → fewer but more reliable features

Practical Notes

Swap Lasso with any selector (Random Forest importance, Boruta, mutual information, etc.).
For stricter false discovery control, use the bound from Meinshausen & Bühlmann: π ≥ 0.5 with an appropriate regularization range.
The stability-selection package provides a scikit-learn compatible implementation.

Our Score

Click to rate this post!

[Total: 0 Average: 0]

Visited 13 times, 1 visit(s) today

Pages: 1 2 3

Addressing Random Seed Sensitivity in Feature Selection: A Survey of Methods and Recent Advances (2025–2026)

Stability Selection with Python Code

Overview

Core Idea

Python Example (Lasso-based)

Key Parameters

Practical Notes

Leave a Comment Cancel reply

Visitor

Post

About Me

Contact