Addressing Random Seed Sensitivity in Feature Selection: A Survey of Methods and Recent Advances (2025–2026)

Random Seed Sensitivity in Feature Selection — Page 1 Stability Selection with Python Code — Page 2 Ensemble Importance Averaging with Python Code — Page 3

Delicate dandelion seed heads

Random Seed Sensitivity in Feature Selection

1. Problem Statement

Feature selection results can vary significantly depending on the random seed used during model training, data splitting, or stochastic optimization. This instability undermines reproducibility and trustworthiness, particularly in high-stakes domains such as healthcare and genomics where consistent feature identification is critical for scientific validity.

2. Classical Approaches

Method	Core Idea	Strength
Stability Selection (Meinshausen & Bühlmann, 2010)	Repeated subsampling + selection frequency thresholding	Theoretical FDR control
Ensemble Importance Averaging	Aggregate feature importances across multiple seeds/models	Simple, model-agnostic
Permutation Importance	Measure performance drop when a feature is shuffled	Model-agnostic; stable with repetition
Boruta Algorithm	Compare features against shuffled “shadow” copies via statistical test	Built-in significance testing
SHAP-based Selection	Rank features by Shapley values averaged over runs	Game-theoretic foundation; interpretable
Knockoff Filter (Barber & Candès, 2015)	Generate synthetic knockoff variables independent of target; compare importances	Explicit FDR control

Common principle: Repeat → Aggregate → Threshold. Running feature selection N times (typically 50–100) with different seeds and retaining features selected above a frequency threshold (e.g., ≥80%) is the most practical stabilization strategy regardless of the base method.

3. Recent Advances (2025–2026)

3.1 MVFS-SHAP: Contribution-Driven Voting (2026)

Method: Bootstrap + 5-fold CV → apply base selector on each resample → majority voting → re-rank by SHAP contribution scores (Ridge + Linear SHAP).
Key innovation: Moves beyond pure frequency-based aggregation to contribution-driven aggregation, jointly considering how often and how much a feature contributes.
Results: Stability > 0.90 on two metabolomics datasets; ~80% of experiments above 0.80.
Ref: ScienceDirect

3.2 Loss-Guided Stability Selection (2025)

Problem solved: Original stability selection tends to severely underfit on noisy high-dimensional data.
Method: Select the stability threshold in a data-driven manner by optimizing out-of-sample validation loss, optionally with exhaustive search.
Results: Significant precision improvement over raw Boosting while avoiding underfitting.
Ref: Springer ADAC

3.3 GRIP2: Robust Deep Knockoff (2026)

Problem solved: Existing deep knockoff methods degrade under high feature correlation and low SNR.
Method: Improved deep generative knockoff framework with enhanced robustness guarantees.
Results: Superior FDR-controlled power vs. linear baselines; validated on HIV drug resistance mutations.
Ref: arXiv 2602.00218

3.4 Knockoff-ML: Clinical Risk Feature Selection (2025)

Method: Model-free integration of knockoff framework with ML algorithms + SHAP for interpretability.
Strength: Handles nonlinear feature–outcome relationships with FDR control; applied to EHR data.
Ref: Nature npj Digital Medicine

3.5 OSSFS: Stable Streaming Feature Selection (2025)

Problem solved: Streaming features arrive continuously; existing methods ignore stability.
Method: MeanShift-inspired incremental aggregation into hyperellipsoids; select representative features per cluster.
Results: Optimal stability without sacrificing predictive accuracy on real-world datasets.
Ref: ACM TKDD

3.6 Two-Stage RF + Improved GA (2025)

Method: Stage 1 — RF importance ranking; Stage 2 — improved genetic algorithm for global optimal subset search.
Strength: RF ensemble nature reduces seed dependency; GA explores search space systematically.
Ref: Nature Scientific Reports

4. Summary of 2025–2026 Trends

Two dominant research directions have emerged:

Interpretability-integrated aggregation — Combining repeated sampling with explainability metrics (especially SHAP) during the aggregation step, moving from “how often selected” to “how often selected AND how much it contributes” (e.g., MVFS-SHAP).
Deep knockoff extensions — Scaling the knockoff framework with deep generative models to handle nonlinear relationships, high correlation, and low-SNR regimes while maintaining rigorous FDR control (e.g., GRIP2, Knockoff-ML).

5. Practical Recommendations

Always aggregate: Run feature selection with K different seeds (K ≥ 50) and retain features exceeding a stability threshold.
Use SHAP-weighted voting over simple frequency counting for richer importance signals.
Apply knockoff methods when FDR control is a hard requirement (e.g., biomedical studies).
Increase ensemble size: For tree-based models, larger n_estimators naturally reduces seed sensitivity.
Report stability metrics: Use Jaccard index, Kuncheva index, or selection probability distributions alongside accuracy to quantify reproducibility.

🏄 FDR (False Discovery Rate)

FDR stands for False Discovery Rate, an error rate metric used in multiple hypothesis testing.

Definition

FDR is the expected proportion of false positives among all items declared as “significant” (discoveries).

$$
FDR = E\left[\frac{V}{R}\right]
$$

V: Number of incorrectly selected features (false positives)
R: Total number of selected features (total discoveries)
Defined as 0 when R = 0

Meaning in Feature Selection

For example, controlling the FDR at 0.1 guarantees that, on average, no more than 10% of the finally selected features are irrelevant (false discoveries).

If 100 features are selected:

On average, about 10 features may actually be noise unrelated to the target
The remaining 90 features are statistically guaranteed to be genuinely important

Our Score

Click to rate this post!

[Total: 0 Average: 0]

Visited 13 times, 1 visit(s) today

Pages: 1 2 3