Python ML Pipeline Reproducibility — Field Notes

Introduction
This document classifies reproducibility problems in Python Machine Learning (ML) pipelines into three chapters, plus a fourth chapter on diagnostic techniques:
- Known problems (Observed, Identified issues) — already documented in standard guides and library references.
- Potential problems (Anticipated, Emerging issues) — latent risks that surface without process governance.
- Unknown problems (Hidden, Latent issues) — subtle traps that standard guides rarely cover.
This classification aligns well with the Six Sigma philosophy and toolset, and is the methodology used in semiconductor process and product development.
1. Known Problems (Observed, Identified Issues)
Determinism traps recognized by standard libraries and the community. Most are fixed with one or two lines of code.
1.1 Direct set → list conversion
Library: builtins
Problem. PYTHONHASHSEED is randomized per process, so string hash() values change every run, making set/dict iteration order non-deterministic. When that order decides the next stage’s column or row order, the non-determinism cascades downstream.
Code Before
selected = list(set(candidate_cols) - set(drop_cols))
result = list(my_set)
Code After
# (a) Alphabetical — generic dedup
selected = sorted(set(candidate_cols) - set(drop_cols))
# (b) Preserve input order — when column meaning matters
selected = [c for c in candidate_cols if c not in drop_set]
Solution. Grep for list(set(...)) and list(<set_var>) patterns; replace with sorted() or an order-preserving list comprehension. Harmless when the order is irrelevant (e.g. drop targets), but mandatory wherever the result feeds the next stage’s input order.
1.2 sorted() tie-break
Library: builtins
Problem. Python sorted() is stable: tied keys keep their input order. If the input is a set or a hash-randomized dict, the relative order of tied items becomes non-deterministic. This often hits in stability selection or multi-trial score averaging where ties are common.
Code Before
ranked = sorted(features_set, key=lambda f: scores[f], reverse=True)
Code After
ranked = sorted(features_set, key=lambda f: (-scores[f], f))
Solution. Add a deterministic secondary key (name, identifier) to the tuple key. Replace reverse=True with sign inversion (-score) so the primary key descends while the secondary ascends.
1.3 LightGBM multi-thread determinism
Library: lightgbm
Problem. With default settings, multi-threaded training drifts slightly across runs even with identical data and seed.
Code Before
import lightgbm as lgb
model = lgb.LGBMRegressor(n_estimators=200, random_state=42)
Code After
model = lgb.LGBMRegressor(
n_estimators=200,
random_state=42,
deterministic=True,
force_row_wise=True, # or force_col_wise=True
verbose=-1,
)
Solution. Both deterministic=True and force_row_wise=True (or force_col_wise) are required. Setting only one has no effect.
1.4 CatBoost Random Subspace Method (RSM) and seed
Library: catboost
Problem. Without random_seed, every run differs. With Random Subspace Method (RSM) enabled (rsm < 1), CatBoost samples column indices, but the index → real-column mapping depends on input column order. If upstream column order changes, RSM trains on a different feature subset (cascade risk — see 3.1).
Code Before
from catboost import CatBoostRegressor
model = CatBoostRegressor(iterations=800, rsm=0.5)
Code After
model = CatBoostRegressor(
iterations=800, rsm=0.5,
random_seed=42, verbose=0,
)
Solution. Set random_seed explicitly and preserve column order across the entire pipeline.
1.5 sklearn LassoCV / KFold
Library: sklearn.linear_model, sklearn.model_selection
Problem. With shuffle=True and no random_state, fold splits are non-deterministic.
Code Before
from sklearn.linear_model import LassoCV
lasso = LassoCV(cv=5, n_jobs=-1)
Code After
from sklearn.model_selection import KFold
lasso = LassoCV(
cv=KFold(5, shuffle=True, random_state=42),
n_jobs=-1, random_state=42, max_iter=10000,
)
Solution. Always set random_state and pass an explicit splitter object as cv. Apply the same rule to train_test_split, StratifiedKFold, and any other split helper.
1.6 XGBoost
Library: xgboost
Problem. Missing random_state yields non-determinism. Graphics Processing Unit (GPU) modes (gpu_hist, device='cuda') are inherently non-deterministic.
Code Before
import xgboost as xgb
model = xgb.XGBRegressor(n_estimators=500, tree_method='hist')
Code After
model = xgb.XGBRegressor(
n_estimators=500,
tree_method='hist',
random_state=42,
n_jobs=-1,
)
Solution. Use Central Processing Unit (CPU) tree_method='hist' with random_state. If GPU is required, accept non-determinism or build a separate validation pipeline.
1.7 pandas categorical dtype
Library: pandas
Problem. astype('category') indexes categories by first-appearance in the data. If row order is non-deterministic, category indices shift, and Gradient Boosted Decision Tree (GBDT) models learn different splits.
Code Before
df[col] = df[col].astype('category')
Code After
df[col] = pd.Categorical(
df[col],
categories=sorted(df[col].dropna().unique()),
)
Solution. Pass a deterministically-sorted unique list to the categories argument.
1.8 File loading order
Library: pathlib
Problem. Path.glob() order is filesystem-dependent. Usually stable on the same machine with the same files, but it can shift across platforms or when files are added or mounts change.
Code Before
files = list(folder.glob('*.parquet'))
Code After
files = sorted(folder.glob('*.parquet'))
Solution. Wrap every file iteration in sorted().
1.9 NumPy global Random Number Generator (RNG)
Library: numpy, random
Problem. Calls like np.random.rand() or random.choice() hit the global RNG; without a seed they vary every run. Global seeds are also fragile because any dependency can reseed mid-pipeline.
Code Before
import numpy as np
sample_idx = np.random.choice(n, size=k)
Code After
rng = np.random.default_rng(42)
sample_idx = rng.choice(n, size=k)
Solution. Inject an explicit Generator instead of touching the global RNG. If unavoidable, call np.random.seed(N) and random.seed(N) right at process start.
2. Potential Problems (Anticipated, Emerging Issues)
Areas to control by governance and process before failures surface. These cannot be fixed with a one-line edit; they require team rules, infrastructure, and tooling.
2.1 Code-level control (Seed Management)
Problem. Without a unified seed entry point, only some RNGs become deterministic. GPU operations via the CUDA Deep Neural Network library (CuDNN) sometimes use non-deterministic algorithms for performance and must be forced into deterministic mode.
Code Before
torch.manual_seed(42)
# Only the framework seed is set; global RNG and CuDNN are unguarded.
Code After
import os, random, numpy as np, torch
def set_seed(seed: int = 42):
os.environ['PYTHONHASHSEED'] = str(seed) # must be set before subprocess start
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True) # raise on non-deterministic ops
Solution. Call set_seed at every entry point (training and inference launchers). torch.use_deterministic_algorithms(True) promotes latent issues to runtime errors as soon as a non-deterministic op is invoked.
2.2 Data-level control (Data Lineage)
Problem. Identical code can produce different results when the data drifts. Without snapshots and lineage of datasets and preprocessing artifacts, reproduction is impossible.
Code Before
df = pd.read_csv('data/raw.csv') # which raw? from when?
X_train, X_test = train_test_split(X, y) # no seed
Code After
# Version the data with Data Version Control (DVC)
dvc add data/raw.csv
git add data/raw.csv.dvc
import joblib
df = pd.read_csv('data/raw.csv')
X_train, X_test = train_test_split(X, y, random_state=42, shuffle=True)
joblib.dump(preprocess_pipeline, 'artifacts/preprocess.pkl') # serialize preprocessing
Solution. Snapshot data with Data Version Control (DVC) — or LakeFS, Delta Lake — fix random_state on every split, and serialize the preprocessing pipeline (joblib, pickle) alongside model artifacts.
2.3 Environment-level control (Environment Isolation)
Problem. The classic “works on my machine” trap. Operating System (OS), GPU drivers, and library versions can shift results. Even minor NumPy / Basic Linear Algebra Subprograms (BLAS) / Math Kernel Library (MKL) updates can change floating-point outputs.
Code Before
pip install lightgbm catboost # versions floating; latest each time
Code After
pip freeze > requirements.txt # pinned versions
# or Poetry / Conda
conda env export --no-builds > environment.yml
Solution. Pin versions (Poetry or Conda lockfile) and build Docker images. Reusing the exact training image at inference time is the strongest guarantee.
2.4 Experiment management (MLOps and Tracking)
Problem. “It worked this week, but I can’t reproduce it next week” — without explicit tracking of code, hyperparameters, and data versions, there is no audit trail.
Code Before
model.fit(X, y)
joblib.dump(model, f'model_{date.today()}.pkl') # config detached, no git sha
Code After
import mlflow, subprocess
with mlflow.start_run():
mlflow.log_params({'lr': 0.05, 'n_estimators': 800, 'seed': 42})
mlflow.log_param(
'git_sha',
subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip(),
)
mlflow.log_param('data_version', dvc_hash('data/raw.csv'))
model.fit(X, y)
mlflow.sklearn.log_model(model, 'model')
mlflow.log_metric('cv_r2', cv_r2)
Solution. Use MLflow, Weights & Biases (W&B), or Neptune to log code version (git SHA), hyperparameters, data version (DVC hash), and metrics. Bundle the model artifact and its config together.
3. Unknown Problems (Hidden, Latent Issues)
Hard to debug and easy to miss. Often skipped because of the false belief that “if the determinism flags are set, we are done.”
3.1 Cascading column-order effect
Library: every GBDT — lightgbm, catboost, xgboost
Problem. Even with all determinism flags and seeds set, results still differ. The cause is upstream: the column order entering the model is non-deterministic, so tied information-gain splits are tie-broken differently, producing different trees. CatBoost’s RSM, LightGBM’s feature_fraction, and XGBoost’s colsample_bytree sample column indices, which makes them especially sensitive.
Code Before
# Upstream: stability-selection output
selected = sorted(stable_set, key=lambda f: scores[f], reverse=True)
# Tied scores resolve in set-iteration order — non-deterministic.
# Downstream: all determinism flags set
X_next = X[selected]
model = lgb.LGBMRegressor(deterministic=True, force_row_wise=True, random_state=42)
model.fit(X_next, y) # different column order ⇒ different model!
Code After
selected = sorted(stable_set, key=lambda f: (-scores[f], f)) # tie-break
Solution. Enforce deterministic column and row order through every stage. Use stage-by-stage pickle diffing (4.1) to find the first stage that diverges.
3.2 Floating-point precision
Library: Python core — Institute of Electrical and Electronics Engineers (IEEE) 754 binary64
Problem. Python float follows IEEE 754 binary64; decimal-to-binary conversion accumulates tiny rounding errors. 0.1 + 0.2 == 0.3 evaluates to False. Threshold comparisons, equality checks, and accumulated sums can flip decisions because of these errors.
Code Before
if score == threshold: # risky — Unit-in-the-Last-Place (ULP) drift can flip the decision
keep = False
total = sum(values) # sum order varies under parallel reduce
Code After
import math
if math.isclose(score, threshold, rel_tol=1e-9):
keep = False
# When precision matters, use the decimal module
from decimal import Decimal
total = sum(Decimal(str(v)) for v in values)
Solution. Replace exact equality with math.isclose or an explicit tolerance. Use decimal for finance-grade or precision-critical math. Design ML thresholds so feature scores rarely cluster near the boundary.
3.3 BLAS multi-thread reduction order
Library: numpy, scipy, pandas, scikit-learn (BLAS backends: MKL, OpenBLAS, Accelerate)
Problem. Reductions like X.T @ X, df.corr(), np.sum, and np.dot may sum partial results from threads in different orders each run. Floating-point addition is non-associative ($(a+b)+c \neq a+(b+c)$), so the result drifts by a few ULP. Threshold branches such as corr > 0.95 can flip.
Code Before
corr = X.corr().abs()
drop = corr.columns[(corr > self.threshold).any()]
Code After
import os
# Set before any module imports — required for the env vars to take effect
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'
# Or add a margin to the threshold comparison
drop = corr.columns[(corr > self.threshold + 1e-12).any()]
Solution. Force BLAS to a single thread only where determinism is critical (the performance cost is real). Alternatively, design the algorithm so threshold ties are rare (e.g. add an epsilon margin).
4. Reproducibility Diagnostics
Techniques to locate where non-determinism enters the pipeline once a reproducibility failure has occurred.
4.1 Stage-by-stage pickle diff
When to use. Two runs with identical input produce different final outputs, and you don’t know at which stage the non-determinism crept in.
Why byte-level diff (fc /b, cmp, Compare-Object) is not enough. Byte comparison only tells you whether two files are identical or not. It gives no clue about where they diverged. Non-determinism typically starts at one stage and cascades downstream (see 3.1), so the goal is to find that first divergence point. Comparing only the final output forces a binary search across the whole pipeline. Worse, ML pipeline stages are usually trained transformer objects, so meaningful diagnosis must compare them as objects, not as raw bytes.
Why pickle comparison is more powerful. When each pipeline stage (FeatureCleaner, Selector, Encoder, Model, etc.) saves its fit result to a .pkl file, that file contains the trained state — attributes such as selected_features_, kept_cols_, drop_corr_cols_, categories_. Unpickling the same stage from two runs and comparing those attributes tells you whether that stage’s learned output is deterministic. Walking from stage 1 onward, the first stage whose output diverges is the source of the non-determinism.
How to compare. A three-step check (set → list) reveals not only divergence but also its nature:
import pickle
a = pickle.load(open('run_A/stage_5.pkl', 'rb'))
b = pickle.load(open('run_B/stage_5.pkl', 'rb'))
sa, sb = set(a.selected_), set(b.selected_)
print(f'set equal : {sa == sb}') # same elements?
print(f'list equal: {a.selected_ == b.selected_}') # same order too?
| set equal | list equal | Diagnosis up to that stage |
|---|---|---|
| True | True | Deterministic ✅ |
| True | False | Same elements, different order — list(set) or sorted tie-break trap (1.1, 1.2) |
| False | False | Elements themselves differ — cascade started in an earlier stage (3.1) |
Caveats:
- Not a sufficient condition. “Deterministic up to that stage” only. If a later stage uses non-deterministic logic, the cascade restarts there. Full-pipeline determinism requires every stage to read (True, True).
- Prerequisite. The technique only works if the pipeline serializes per-stage artifacts. It is worth designing pipelines to dump each stage as a
.pkl— useful for partial-resume and other operational tasks beyond debugging. - Environment alignment. Both runs must use the same library versions; otherwise unpickling itself can fail or yield differently shaped attributes (see 2.3).
