Noise-Induced Instability in Tree-based Feature Selection: Root Causes and Robust Countermeasures

When performing feature selection with tree-based models such as LightGBM (LGBM) or CatBoost, adding noise features to the existing set often causes truly important primary features to drop out of the importance ranking. This is not a data problem but a structural issue rooted in how tree models compete during training and how importance is computed. Relying on default feature importance alone makes selection fragile against noise. This post breaks down the five main causes and explains why each robust countermeasure works.

1. What Is a Noise Feature?

A noise feature is a feature with no (or negligible) statistical relationship to the target. The term is standard in both academia and industry, and is typically split into three categories.

  • Irrelevant feature: independent of the target, with mutual information close to zero.
  • Random noise feature: deliberately generated from a random distribution (uniform, Gaussian, etc.) with no causal or statistical link to the target.
  • Redundant feature: related to the target but provides no additional information beyond what other features already carry. Distinct from noise.

Note: in the literature, terms like irrelevant features, uninformative features, and spurious predictors are used interchangeably. In the Boruta algorithm, artificially shuffled features are called shadow features.

2. Tree-Native Features

Ideal Tree-Native features let the model reach the correct answer through the shortest possible path, without having to grow deep, complex trees. They form the first line of defense against noise vulnerability before any selection technique is applied.

1) High-Fidelity Signal

  • Denoised continuous variables: clean sensor jitter and outliers so values reflect the true physical state.
  • Monotonicity: relationships where the target consistently increases or decreases with the feature (e.g., process temperature vs. yield). Gradient Boosted Decision Trees (GBDT) learn far more robustly under monotonic constraints.

2) Structural Determinants

  • High-gain splitters: variables that drastically reduce impurity in a single split (e.g., equipment ID, process step number).
  • Interaction-rich features: explicitly precomputed combinations of two or more variables, so the tree does not have to discover them on its own.

3) Clean Dimensionality

  • Zero redundancy: remove highly correlated variables (multicollinearity) so the model does not have to “choose” between equivalent options.
  • Optimized cardinality (low cardinality): prefer meaningful grouped categories over high-cardinality ID-like fields. High cardinality is the main channel through which noise features inflate importance (see Appendix B).

3. Why Noise Features Push Out Primary Features

3.1 How Other Model Families Behave

Model FamilyNoise SensitivityNotes
Linear / Logistic RegressionMediumSensitive to multicollinearity but stable under L1/L2 regularization. Noise feature coefficients shrink toward zero.
Tree-based (LGBM, XGBoost, CatBoost, Random Forest)HighStructurally vulnerable due to greedy splitting and cardinality bias. Main focus of this post.
Neural Networks (NN)Low–MediumWeight decay and dropout dilute noise impact, but interpretable importance is hard to obtain.
k-Nearest Neighbors (kNN), Support Vector Machine (SVM, RBF kernel)Very HighDistance-based, so noise features directly distort distance computation. Exposed to the curse of dimensionality.

What makes tree-based models distinct is that features compete at every split, where a “split” means partitioning a node’s data by a single feature threshold. This competitive structure is the root cause of noise vulnerability.

3.2 Five Root Causes (Tree-based Models)

1) Lucky wins in split competition (limits of greedy splitting). At each node, the tree picks the split with the highest immediate gain. A noise feature can occasionally beat a primary feature on a specific subset of samples by sheer chance. Once the primary loses that node, its chance to contribute downstream collapses, and since importance is cumulative gain, losing early splits causes a sharp drop in score.

2) High cardinality bias (selection bias toward high-cardinality features). The most common cause. A continuous or high-cardinality noise feature (e.g., time-series-like values) offers many more split candidates, making it structurally more likely to win. A feature with 1000 unique values has ~999 split candidates; one with 5 unique values has only 4. More candidates means a higher chance of finding a split that fits the training data by accident. This was formally reported by Strobl et al. (2007) for Random Forest and applies equally to Gradient Boosting Machine (GBM) variants (see Appendix B).

3) Masking effect (feature correlation and redundancy). If a primary feature has even weak correlation with a noise feature, their importance gets distributed across both. If the noise accidentally explains some variation of the primary, the primary’s marginal contribution looks smaller. This mirrors multicollinearity in linear models.

4) Interaction with regularization and column subsampling. Parameters like LGBM’s feature_fraction or CatBoost’s rsm sample only a subset of features per tree or node. When a primary feature is not sampled in some iterations, a noise feature gets picked instead and accumulates importance. Adding more noise features dilutes the sampling probability of primaries (k of N becomes k of N+M).

5) Limitations of the importance metric itself. Default importance (split count, gain) measures “how much the model used a feature,” not “how genuinely related it is to the target.” All the biases above feed directly into the score.

4. Countermeasures

4.1 Permutation Importance

It measures how much model performance drops when a feature’s values are randomly shuffled.

What is cardinality bias? The number of split candidates in a tree model is proportional to the number of unique values in a feature. As a result, high-cardinality features can find a split that fits the training data by chance, inflating their importance even when unrelated to the target (see Appendix B for details).

Why permutation importance is immune. Shuffling only changes the order of values; the value set itself (uniques, range, frequency, distribution) is preserved. What gets broken is the row-level pairing between feature and target — the joint distribution.

$$ \text{Before shuffle: } P(\text{feature}, \text{target}) \quad\longrightarrow\quad \text{After shuffle: } P(\text{feature}) \times P(\text{target}) $$

Marginal distributions are preserved; only the joint distribution is destroyed.

Interpreting the Drop

Permutation importance = (performance before) − (performance after).

Feature TypeBeforeAfterDropInterpretation
Truly importantArea Under the Curve (AUC) 0.85AUC 0.700.15 (large)Model breaks without it → genuine signal
High-cardinality noiseAUC 0.85AUC 0.849≈ 0 (tiny)No effect when broken → fake importance

Why cardinality bias cancels out. Since cardinality is preserved after shuffling, the inflated gain from “many split candidates” exists both before and after. Subtracting the two cancels the bias, leaving only the contribution from the true feature-target relationship.

“Isn’t shuffled data garbage?”

This is not validation; it is a controlled experiment, like comparing a drug group with a placebo group. The original data is the drug, the shuffled data is the placebo, and the difference is the feature’s pure “drug effect.” Garbage is the point — it gives us a clean baseline. We are not asking “how good is the shuffled data?” but “how much did the trained model rely on this feature?”

Applicability to Time Series

Standard permutation assumes independent samples, which breaks for time series. Naive row-wise shuffling destroys autocorrelation and temporal order, producing unrealistic sequences. Recommended adaptations:

  • Block permutation: shuffle in fixed-length time blocks to partially preserve autocorrelation.
  • Time-series Cross-Validation (CV)-based permutation: walk-forward splitting, then shuffle only within validation folds.
  • Conditional permutation: swap values only within local time windows.

For lag or rolling features, SHapley Additive exPlanations (SHAP) or Boruta is often more stable. Summary: not directly applicable, but works with adaptations.

Known limits: unrealistic input combinations and distortion from correlated features. Conditional permutation importance and SHAP (see Appendix A) address these.

4.2 SHAP Values

Based on the Shapley value from cooperative game theory, SHAP computes how much each feature contributes on average across all possible feature subsets (see Appendix A for the mathematical background).

  • Consistency axiom: if the model becomes more dependent on a feature, that feature’s SHAP value never decreases. By contrast, default gain importance (the metric exposed as feature_importances_ in LGBM / XGBoost — the cumulative loss reduction across all splits using that feature) does not satisfy consistency, so importance rankings can flip even after small model changes. SHAP guarantees this stability axiomatically.
  • Local accuracy: summing the SHAP values of all features for a sample exactly equals that sample’s actual prediction. This means “feature contribution” is not just an analogy but a mathematically closed, decomposable definition.
  • Less sensitive to cardinality bias: SHAP measures marginal contribution — the average change in prediction when the feature is added to or removed from the model, i.e., the feature’s pure standalone contribution — rather than counting splits. The mechanism that inflates importance with high cardinality simply does not enter the calculation.

4.3 Null Importance / Boruta (Most Direct Solution)

These methods directly attack the problem of “noise features accidentally accumulating importance.”

  • Boruta: duplicate each feature, shuffle the copy to create a “shadow feature,” and train them together. A real feature is accepted only if its importance is statistically significantly higher than the maximum among shadow features.
  • Null importance: shuffle the target many times to estimate the null distribution of importance achievable by noise alone. A real feature counts as signal only if it exceeds a high percentile (e.g., 99th) of that distribution.

Why it works: cardinality bias, greedy randomness, and sampling dilution all act equally on shadow and real features, so the bias cancels. The criterion becomes “how much more important is this feature than noise?”, which is the right question.

4.4 Target Encoding / CatBoost Ordered Boosting

Cardinality bias stems from differences in split candidate counts. Target encoding converts every categorical into a single continuous value, equalizing those counts. CatBoost’s ordered boosting adds target leakage protection on top.

4.5 Multi-Seed Averaging

Greedy splitting is sensitive to data order and subsampling seed. Averaging or taking the median of importance across multiple seeds reduces the variance of “lucky wins,” statistically shrinking the noise variance by a factor of $\sqrt{n}$ where $n$ is the number of seeds.

5. Cause–Countermeasure Mapping

CausePermutationSHAPBoruta/NullTarget EncodingMulti-seed
1) Lucky greedy wins
2) Cardinality bias
3) Masking / correlation□ (Conditional ◉)
4) Subsampling dilution
5) Importance metric limits

◉ Highly effective / ◎ Effective / □ Partially effective / – Not relevant

6. Recommended Practical Pipeline

Do not rely on a single method. Combine them in layers.

[Stage 1: Filter]     Boruta or Null Importance
                        ↓   (keep features significant vs. noise)
[Stage 2: Validate]   SHAP value analysis
                        ↓   (check direction and consistency)
[Stage 3: Stabilize]  Multi-seed repetition (≥ 5 runs)
                        ↓   (check importance variance)
[Final selection]     Features with stable, proven signal

Supporting practices: unify categoricals via target encoding (equalizes cardinality), use conditional permutation importance when correlations are strong, and avoid setting feature_fraction too low in LGBM (mitigates dilution).

7. Key Takeaways

  • Noise pushing out primary features is not a data problem — it is a structural property of tree models, born from greedy splitting, cardinality bias, and sampling dilution.
  • Default feature importance (gain, split count) measures “how much the model used a feature,” not its true relationship with the target.
  • Boruta and null importance are the most direct fix: they use noise itself as the baseline, canceling all biases at once.
  • SHAP and permutation are complementary: permutation handles cardinality bias, SHAP guarantees axiomatic consistency.
  • Robust feature selection requires a multi-layer verification pipeline, not a single metric.

Appendix A. Shapley Value and SHAP

A.1 Origin

The Shapley value was proposed by economist Lloyd Shapley in 1953 in cooperative game theory, answering: “When several players cooperate to earn a total reward, how do we fairly distribute it according to each player’s contribution?”

Example: A, B, and C work together and earn $1M. A alone earns $100K, A+B earn $500K, A+B+C earn $1M, and so on for every coalition. Shapley defined each player’s fair share as the average marginal contribution they bring across all possible joining orders.

A.2 Mathematical Definition

Let $N = \{1, 2, \dots, n\}$ be the feature set, $S \subseteq N \setminus \{i\}$ a subset, and $v(S)$ the model output on $S$. The Shapley value of feature $i$ is:

$$ \phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|! \, (n – |S| – 1)!}{n!} \left[ v(S \cup \{i\}) – v(S) \right] $$
  • $v(S \cup \{i\}) – v(S)$ is the marginal contribution of feature $i$ when added to $S$.
  • The fraction is the weight over all possible orderings.
  • The result is the average marginal contribution of $i$ across all coalitions.

A.3 The Four Axioms

The Shapley value is provably the unique distribution rule satisfying these four axioms.

AxiomMeaning
EfficiencyThe sum of all Shapley values equals the model output. Nothing is left undistributed.
SymmetryTwo features with identical contributions across all subsets receive identical Shapley values.
DummyA feature whose marginal contribution is zero in every coalition gets a Shapley value of zero.
AdditivityShapley values of two combined models equal the sum of their individual Shapley values.

A.4 SHAP (SHapley Additive exPlanations)

Lundberg & Lee (2017) applied the Shapley value to Machine Learning (ML) prediction explanation. The mapping is: players → features, total reward → model prediction, fair share → contribution to that prediction. For a sample $x$:

$$ f(x) = \phi_0 + \sum_{i=1}^{n} \phi_i $$

Here $\phi_0$ is the baseline (mean prediction with no features) and $\phi_i$ is feature $i$’s SHAP value for this sample.

A.5 Why SHAP Resists Noise

  • The Dummy axiom drives true noise features toward a SHAP value of zero — they never change predictions on average.
  • The Symmetry axiom means cardinality does not inflate scores; equal contributions yield equal values regardless of split candidate counts.
  • Consistency (derived from Efficiency and Symmetry) ensures that increasing a feature’s role in the model never reduces its SHAP value, unlike default gain importance.

A.6 Limitations

  • Compute cost: TreeSHAP runs in $O(TLD^2)$ where $T$ is the number of trees, $L$ the number of leaves, and $D$ the depth — heavy on large models.
  • Independence assumption: standard SHAP assumes feature independence; strong correlations create unrealistic coalitions. Use Interventional SHAP or Conditional SHAP as alternatives.
  • Local vs. global: SHAP is fundamentally local; aggregate $|\phi_i|$ for a global view.

Appendix B. Cardinality Bias in Detail

B.1 Definition

Cardinality bias (selection bias toward high-cardinality features): because the number of split candidates in a tree model is proportional to the number of unique values in a feature, high-cardinality features can find a split that fits the training data by chance, inflating their importance even when they are unrelated to the target.

B.2 What Cardinality Means

  • Sex: 2 values → low cardinality.
  • Postal code: thousands of values → high cardinality.
  • Continuous variables: nearly every value unique → very high cardinality.

B.3 The Mechanism

At each node, the tree tries every possible split point and picks the one with the highest gain.

  • 2 unique values: 1 candidate.
  • 1000 unique values: 999 candidates.

More candidates means a higher chance that at least one accidentally fits the training data well — like buying 999 lottery tickets versus one. Even if that split captures pure noise, the gain is still measured as high.

B.4 Symptoms

  • High-cardinality features get inflated importance even when unrelated to the target.
  • Genuinely important low-cardinality features are systematically underrated.
  • This is called cardinality bias or selection bias toward high-cardinality features. Strobl et al. (2007) formally reported it for Random Forest, and the same applies to GBM-family models.

References

  • Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics.
  • Lundberg, S. M., & Lee, S. I. (2017). A Unified Approach to Interpreting Model Predictions (SHAP). NeurIPS.
  • Kursa, M. B., & Rudnicki, W. R. (2010). Feature Selection with the Boruta Package. Journal of Statistical Software.
  • Altmann, A., et al. (2010). Permutation importance: a corrected feature importance measure. Bioinformatics.
Our Score
Click to rate this post!
[Total: 1 Average: 4]
Visited 14 times, 1 visit(s) today

Leave a Comment

Your email address will not be published. Required fields are marked *