Introduction

This document classifies reproducibility problems in Python Machine Learning (ML) pipelines into three chapters, plus a fourth chapter on diagnostic techniques:

Known problems (Observed, Identified issues) — already documented in standard guides and library references.
Potential problems (Anticipated, Emerging issues) — latent risks that surface without process governance.
Unknown problems (Hidden, Latent issues) — subtle traps that standard guides rarely cover.

This classification aligns well with the Six Sigma philosophy and toolset, and is the methodology used in semiconductor process and product development.

1. Known Problems (Observed, Identified Issues)

Determinism traps recognized by standard libraries and the community. Most are fixed with one or two lines of code.

1.1 Direct set → list conversion

Library: builtins

Problem. PYTHONHASHSEED is randomized per process, so string hash() values change every run, making set/dict iteration order non-deterministic. When that order decides the next stage’s column or row order, the non-determinism cascades downstream.

Code Before

selected = list(set(candidate_cols) - set(drop_cols))
result = list(my_set)

Code After

# (a) Alphabetical — generic dedup
selected = sorted(set(candidate_cols) - set(drop_cols))
# (b) Preserve input order — when column meaning matters
selected = [c for c in candidate_cols if c not in drop_set]

Solution. Grep for list(set(...)) and list(<set_var>) patterns; replace with sorted() or an order-preserving list comprehension. Harmless when the order is irrelevant (e.g. drop targets), but mandatory wherever the result feeds the next stage’s input order.

1.2 sorted() tie-break

Library: builtins

Problem. Python sorted() is stable: tied keys keep their input order. If the input is a set or a hash-randomized dict, the relative order of tied items becomes non-deterministic. This often hits in stability selection or multi-trial score averaging where ties are common.

Code Before

ranked = sorted(features_set, key=lambda f: scores[f], reverse=True)

Code After

ranked = sorted(features_set, key=lambda f: (-scores[f], f))

Solution. Add a deterministic secondary key (name, identifier) to the tuple key. Replace reverse=True with sign inversion (-score) so the primary key descends while the secondary ascends.

1.3 LightGBM multi-thread determinism

Library: lightgbm

Problem. With default settings, multi-threaded training drifts slightly across runs even with identical data and seed.

Code Before

import lightgbm as lgb
model = lgb.LGBMRegressor(n_estimators=200, random_state=42)

Code After

model = lgb.LGBMRegressor(
    n_estimators=200,
    random_state=42,
    deterministic=True,
    force_row_wise=True,    # or force_col_wise=True
    verbose=-1,
)

Solution. Both deterministic=True and force_row_wise=True (or force_col_wise) are required. Setting only one has no effect.

1.4 CatBoost Random Subspace Method (RSM) and seed

Library: catboost

Problem. Without random_seed, every run differs. With Random Subspace Method (RSM) enabled (rsm < 1), CatBoost samples column indices, but the index → real-column mapping depends on input column order. If upstream column order changes, RSM trains on a different feature subset (cascade risk — see 3.1).

Code Before

from catboost import CatBoostRegressor
model = CatBoostRegressor(iterations=800, rsm=0.5)

Code After

model = CatBoostRegressor(
    iterations=800, rsm=0.5,
    random_seed=42, verbose=0,
)

Solution. Set random_seed explicitly and preserve column order across the entire pipeline.

1.5 sklearn LassoCV / KFold

Library: sklearn.linear_model, sklearn.model_selection

Problem. With shuffle=True and no random_state, fold splits are non-deterministic.

Code Before

from sklearn.linear_model import LassoCV
lasso = LassoCV(cv=5, n_jobs=-1)

Code After

from sklearn.model_selection import KFold
lasso = LassoCV(
    cv=KFold(5, shuffle=True, random_state=42),
    n_jobs=-1, random_state=42, max_iter=10000,
)

Solution. Always set random_state and pass an explicit splitter object as cv. Apply the same rule to train_test_split, StratifiedKFold, and any other split helper.

1.6 XGBoost

Library: xgboost

Problem. Missing random_state yields non-determinism. Graphics Processing Unit (GPU) modes (gpu_hist, device='cuda') are inherently non-deterministic.

Code Before

import xgboost as xgb
model = xgb.XGBRegressor(n_estimators=500, tree_method='hist')

Code After

model = xgb.XGBRegressor(
    n_estimators=500,
    tree_method='hist',
    random_state=42,
    n_jobs=-1,
)

Solution. Use Central Processing Unit (CPU) tree_method='hist' with random_state. If GPU is required, accept non-determinism or build a separate validation pipeline.

1.7 pandas categorical dtype

Library: pandas

Problem. astype('category') indexes categories by first-appearance in the data. If row order is non-deterministic, category indices shift, and Gradient Boosted Decision Tree (GBDT) models learn different splits.

Code Before

df[col] = df[col].astype('category')

Code After

df[col] = pd.Categorical(
    df[col],
    categories=sorted(df[col].dropna().unique()),
)

Solution. Pass a deterministically-sorted unique list to the categories argument.

1.8 File loading order

Library: pathlib

Problem. Path.glob() order is filesystem-dependent. Usually stable on the same machine with the same files, but it can shift across platforms or when files are added or mounts change.

Code Before

files = list(folder.glob('*.parquet'))

Code After

files = sorted(folder.glob('*.parquet'))

Solution. Wrap every file iteration in sorted().

1.9 NumPy global Random Number Generator (RNG)

Library: numpy, random

Problem. Calls like np.random.rand() or random.choice() hit the global RNG; without a seed they vary every run. Global seeds are also fragile because any dependency can reseed mid-pipeline.

Code Before

import numpy as np
sample_idx = np.random.choice(n, size=k)

Code After

rng = np.random.default_rng(42)
sample_idx = rng.choice(n, size=k)

Solution. Inject an explicit Generator instead of touching the global RNG. If unavoidable, call np.random.seed(N) and random.seed(N) right at process start.

2. Potential Problems (Anticipated, Emerging Issues)

Areas to control by governance and process before failures surface. These cannot be fixed with a one-line edit; they require team rules, infrastructure, and tooling.

2.1 Code-level control (Seed Management)

Problem. Without a unified seed entry point, only some RNGs become deterministic. GPU operations via the CUDA Deep Neural Network library (CuDNN) sometimes use non-deterministic algorithms for performance and must be forced into deterministic mode.

Code Before

torch.manual_seed(42)
# Only the framework seed is set; global RNG and CuDNN are unguarded.

Code After

import os, random, numpy as np, torch

def set_seed(seed: int = 42):
    os.environ['PYTHONHASHSEED'] = str(seed)   # must be set before subprocess start
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.use_deterministic_algorithms(True)   # raise on non-deterministic ops

Solution. Call set_seed at every entry point (training and inference launchers). torch.use_deterministic_algorithms(True) promotes latent issues to runtime errors as soon as a non-deterministic op is invoked.

2.2 Data-level control (Data Lineage)

Problem. Identical code can produce different results when the data drifts. Without snapshots and lineage of datasets and preprocessing artifacts, reproduction is impossible.

Code Before

df = pd.read_csv('data/raw.csv')        # which raw? from when?
X_train, X_test = train_test_split(X, y) # no seed

Code After

# Version the data with Data Version Control (DVC)
dvc add data/raw.csv
git add data/raw.csv.dvc

import joblib
df = pd.read_csv('data/raw.csv')
X_train, X_test = train_test_split(X, y, random_state=42, shuffle=True)
joblib.dump(preprocess_pipeline, 'artifacts/preprocess.pkl')   # serialize preprocessing

Solution. Snapshot data with Data Version Control (DVC) — or LakeFS, Delta Lake — fix random_state on every split, and serialize the preprocessing pipeline (joblib, pickle) alongside model artifacts.

2.3 Environment-level control (Environment Isolation)

Problem. The classic “works on my machine” trap. Operating System (OS), GPU drivers, and library versions can shift results. Even minor NumPy / Basic Linear Algebra Subprograms (BLAS) / Math Kernel Library (MKL) updates can change floating-point outputs.

Code Before

pip install lightgbm catboost   # versions floating; latest each time

Code After

pip freeze > requirements.txt          # pinned versions
# or Poetry / Conda
conda env export --no-builds > environment.yml

# Dockerfile — same training and inference environment FROM python:3.10.14-slim COPY requirements.txt . RUN pip install –no-deps -r requirements.txt

Solution. Pin versions (Poetry or Conda lockfile) and build Docker images. Reusing the exact training image at inference time is the strongest guarantee.

2.4 Experiment management (MLOps and Tracking)

Problem. “It worked this week, but I can’t reproduce it next week” — without explicit tracking of code, hyperparameters, and data versions, there is no audit trail.

Code Before

model.fit(X, y)
joblib.dump(model, f'model_{date.today()}.pkl')   # config detached, no git sha

Code After

import mlflow, subprocess
with mlflow.start_run():
    mlflow.log_params({'lr': 0.05, 'n_estimators': 800, 'seed': 42})
    mlflow.log_param(
        'git_sha',
        subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip(),
    )
    mlflow.log_param('data_version', dvc_hash('data/raw.csv'))
    model.fit(X, y)
    mlflow.sklearn.log_model(model, 'model')
    mlflow.log_metric('cv_r2', cv_r2)

Solution. Use MLflow, Weights & Biases (W&B), or Neptune to log code version (git SHA), hyperparameters, data version (DVC hash), and metrics. Bundle the model artifact and its config together.

3. Unknown Problems (Hidden, Latent Issues)

Hard to debug and easy to miss. Often skipped because of the false belief that “if the determinism flags are set, we are done.”

3.1 Cascading column-order effect

Library: every GBDT — lightgbm, catboost, xgboost

Problem. Even with all determinism flags and seeds set, results still differ. The cause is upstream: the column order entering the model is non-deterministic, so tied information-gain splits are tie-broken differently, producing different trees. CatBoost’s RSM, LightGBM’s feature_fraction, and XGBoost’s colsample_bytree sample column indices, which makes them especially sensitive.

Code Before

# Upstream: stability-selection output
selected = sorted(stable_set, key=lambda f: scores[f], reverse=True)
# Tied scores resolve in set-iteration order — non-deterministic.

# Downstream: all determinism flags set
X_next = X[selected]
model = lgb.LGBMRegressor(deterministic=True, force_row_wise=True, random_state=42)
model.fit(X_next, y)   # different column order ⇒ different model!

Code After

selected = sorted(stable_set, key=lambda f: (-scores[f], f))   # tie-break

Solution. Enforce deterministic column and row order through every stage. Use stage-by-stage pickle diffing (4.1) to find the first stage that diverges.

3.2 Floating-point precision

Library: Python core — Institute of Electrical and Electronics Engineers (IEEE) 754 binary64

Problem. Python float follows IEEE 754 binary64; decimal-to-binary conversion accumulates tiny rounding errors. 0.1 + 0.2 == 0.3 evaluates to False. Threshold comparisons, equality checks, and accumulated sums can flip decisions because of these errors.

Code Before

if score == threshold:    # risky — Unit-in-the-Last-Place (ULP) drift can flip the decision
    keep = False
total = sum(values)       # sum order varies under parallel reduce

Code After

import math
if math.isclose(score, threshold, rel_tol=1e-9):
    keep = False
# When precision matters, use the decimal module
from decimal import Decimal
total = sum(Decimal(str(v)) for v in values)

Solution. Replace exact equality with math.isclose or an explicit tolerance. Use decimal for finance-grade or precision-critical math. Design ML thresholds so feature scores rarely cluster near the boundary.

3.3 BLAS multi-thread reduction order

Library: numpy, scipy, pandas, scikit-learn (BLAS backends: MKL, OpenBLAS, Accelerate)

Problem. Reductions like X.T @ X, df.corr(), np.sum, and np.dot may sum partial results from threads in different orders each run. Floating-point addition is non-associative ($(a+b)+c \neq a+(b+c)$), so the result drifts by a few ULP. Threshold branches such as corr > 0.95 can flip.

Code Before

corr = X.corr().abs()
drop = corr.columns[(corr > self.threshold).any()]

Code After

import os
# Set before any module imports — required for the env vars to take effect
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'
# Or add a margin to the threshold comparison
drop = corr.columns[(corr > self.threshold + 1e-12).any()]

Solution. Force BLAS to a single thread only where determinism is critical (the performance cost is real). Alternatively, design the algorithm so threshold ties are rare (e.g. add an epsilon margin).

4. Reproducibility Diagnostics

Techniques to locate where non-determinism enters the pipeline once a reproducibility failure has occurred.

4.1 Stage-by-stage pickle diff

When to use. Two runs with identical input produce different final outputs, and you don’t know at which stage the non-determinism crept in.

Why byte-level diff (fc /b, cmp, Compare-Object) is not enough. Byte comparison only tells you whether two files are identical or not. It gives no clue about where they diverged. Non-determinism typically starts at one stage and cascades downstream (see 3.1), so the goal is to find that first divergence point. Comparing only the final output forces a binary search across the whole pipeline. Worse, ML pipeline stages are usually trained transformer objects, so meaningful diagnosis must compare them as objects, not as raw bytes.

Why pickle comparison is more powerful. When each pipeline stage (FeatureCleaner, Selector, Encoder, Model, etc.) saves its fit result to a .pkl file, that file contains the trained state — attributes such as selected_features_, kept_cols_, drop_corr_cols_, categories_. Unpickling the same stage from two runs and comparing those attributes tells you whether that stage’s learned output is deterministic. Walking from stage 1 onward, the first stage whose output diverges is the source of the non-determinism.

How to compare. A three-step check (set → list) reveals not only divergence but also its nature:

import pickle
a = pickle.load(open('run_A/stage_5.pkl', 'rb'))
b = pickle.load(open('run_B/stage_5.pkl', 'rb'))
sa, sb = set(a.selected_), set(b.selected_)
print(f'set equal : {sa == sb}')                          # same elements?
print(f'list equal: {a.selected_ == b.selected_}')        # same order too?

set equal	list equal	Diagnosis up to that stage
True	True	Deterministic ✅
True	False	Same elements, different order — `list(set)` or `sorted` tie-break trap (1.1, 1.2)
False	False	Elements themselves differ — cascade started in an earlier stage (3.1)

Caveats:

Not a sufficient condition. “Deterministic up to that stage” only. If a later stage uses non-deterministic logic, the cascade restarts there. Full-pipeline determinism requires every stage to read (True, True).
Prerequisite. The technique only works if the pipeline serializes per-stage artifacts. It is worth designing pipelines to dump each stage as a .pkl — useful for partial-resume and other operational tasks beyond debugging.
Environment alignment. Both runs must use the same library versions; otherwise unpickling itself can fail or yield differently shaped attributes (see 2.3).

Our Score

Click to rate this post!

[Total: 1 Average: 4]

Visited 12 times, 1 visit(s) today

Python ML Pipeline Reproducibility — Field Notes

Introduction

1. Known Problems (Observed, Identified Issues)

1.1 Direct set → list conversion

1.2 sorted() tie-break

1.3 LightGBM multi-thread determinism

1.4 CatBoost Random Subspace Method (RSM) and seed

1.5 sklearn LassoCV / KFold

1.6 XGBoost

1.7 pandas categorical dtype

1.8 File loading order

1.9 NumPy global Random Number Generator (RNG)

2. Potential Problems (Anticipated, Emerging Issues)

2.1 Code-level control (Seed Management)

2.2 Data-level control (Data Lineage)

2.3 Environment-level control (Environment Isolation)

2.4 Experiment management (MLOps and Tracking)

3. Unknown Problems (Hidden, Latent Issues)

3.1 Cascading column-order effect

3.2 Floating-point precision

3.3 BLAS multi-thread reduction order

4. Reproducibility Diagnostics

4.1 Stage-by-stage pickle diff

Leave a Comment Cancel reply

Visitor

Post

About Me

Contact