{"id":6637,"date":"2026-05-08T09:32:31","date_gmt":"2026-05-08T14:32:31","guid":{"rendered":"https:\/\/ykim.synology.me\/wordpress\/?p=6637"},"modified":"2026-05-08T15:21:02","modified_gmt":"2026-05-08T20:21:02","slug":"python-ml-pipeline-reproducibility-field-notes","status":"publish","type":"post","link":"https:\/\/ykim.synology.me\/wordpress\/python-ml-pipeline-reproducibility-field-notes-6637\/","title":{"rendered":"Python ML Pipeline Reproducibility \u2014 Field Notes"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"750\" src=\"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/05\/Lowes-DO-IT-RIGHT-1000x750px.jpg\" alt=\"\" class=\"wp-image-6642\" style=\"width:600px\" srcset=\"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/05\/Lowes-DO-IT-RIGHT-1000x750px.jpg 1000w, https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/05\/Lowes-DO-IT-RIGHT-1000x750px-300x225.jpg 300w, https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/05\/Lowes-DO-IT-RIGHT-1000x750px-768x576.jpg 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n<style>.kadence-column6637_3324c7-92 > .kt-inside-inner-col,.kadence-column6637_3324c7-92 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_3324c7-92 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_3324c7-92 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_3324c7-92 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_3324c7-92 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_3324c7-92{position:relative;}.kadence-column6637_3324c7-92, .kt-inside-inner-col > .kadence-column6637_3324c7-92:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_3324c7-92 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_3324c7-92 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_3324c7-92\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003This document classifies reproducibility problems in Python Machine Learning (ML) pipelines into three chapters, plus a fourth chapter on diagnostic techniques:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Known problems (Observed, Identified issues)<\/strong> \u2014 already documented in standard guides and library references.<\/li>\n\n\n\n<li><strong>Potential problems (Anticipated, Emerging issues)<\/strong> \u2014 latent risks that surface without process governance.<\/li>\n\n\n\n<li><strong>Unknown problems (Hidden, Latent issues)<\/strong> \u2014 subtle traps that standard guides rarely cover.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003This classification aligns well with the Six Sigma philosophy and toolset, and is the methodology used in semiconductor process and product development.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">1. Known Problems (Observed, Identified Issues)<\/h2>\n\n\n<style>.kadence-column6637_a231d6-2b > .kt-inside-inner-col,.kadence-column6637_a231d6-2b > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_a231d6-2b > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_a231d6-2b > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_a231d6-2b > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_a231d6-2b > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_a231d6-2b{position:relative;}.kadence-column6637_a231d6-2b, .kt-inside-inner-col > .kadence-column6637_a231d6-2b:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_a231d6-2b > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_a231d6-2b > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_a231d6-2b\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003Determinism traps recognized by standard libraries and the community. Most are fixed with one or two lines of code.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1.1 Direct set \u2192 list conversion<\/h3>\n\n\n<style>.kadence-column6637_b8c037-33 > .kt-inside-inner-col,.kadence-column6637_b8c037-33 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_b8c037-33 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_b8c037-33 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_b8c037-33 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_b8c037-33 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_b8c037-33{position:relative;}.kadence-column6637_b8c037-33, .kt-inside-inner-col > .kadence-column6637_b8c037-33:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_b8c037-33 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_b8c037-33 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_b8c037-33\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Library<\/strong>: <code>builtins<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. PYTHONHASHSEED is randomized per process, so string <code>hash()<\/code> values change every run, making set\/dict iteration order non-deterministic. When that order decides the next stage&#8217;s column or row order, the non-determinism cascades downstream.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nselected = list(set(candidate_cols) - set(drop_cols))\nresult = list(my_set)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# (a) Alphabetical \u2014 generic dedup\nselected = sorted(set(candidate_cols) - set(drop_cols))\n# (b) Preserve input order \u2014 when column meaning matters\nselected = &#x5B;c for c in candidate_cols if c not in drop_set]\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Grep for <code>list(set(...))<\/code> and <code>list(&lt;set_var&gt;)<\/code> patterns; replace with <code>sorted()<\/code> or an order-preserving list comprehension. Harmless when the order is irrelevant (e.g. drop targets), but mandatory wherever the result feeds the next stage&#8217;s input order.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">1.2 sorted() tie-break<\/h3>\n\n\n<style>.kadence-column6637_066823-95 > .kt-inside-inner-col,.kadence-column6637_066823-95 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_066823-95 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_066823-95 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_066823-95 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_066823-95 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_066823-95{position:relative;}.kadence-column6637_066823-95, .kt-inside-inner-col > .kadence-column6637_066823-95:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_066823-95 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_066823-95 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_066823-95\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Library<\/strong>: <code>builtins<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. Python <code>sorted()<\/code> is stable: tied keys keep their input order. If the input is a set or a hash-randomized dict, the relative order of tied items becomes non-deterministic. This often hits in stability selection or multi-trial score averaging where ties are common.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nranked = sorted(features_set, key=lambda f: scores&#x5B;f], reverse=True)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nranked = sorted(features_set, key=lambda f: (-scores&#x5B;f], f))\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Add a deterministic secondary key (name, identifier) to the tuple key. Replace <code>reverse=True<\/code> with sign inversion (<code>-score<\/code>) so the primary key descends while the secondary ascends.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">1.3 LightGBM multi-thread determinism<\/h3>\n\n\n<style>.kadence-column6637_fb1be5-c3 > .kt-inside-inner-col,.kadence-column6637_fb1be5-c3 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_fb1be5-c3 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_fb1be5-c3 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_fb1be5-c3 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_fb1be5-c3 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_fb1be5-c3{position:relative;}.kadence-column6637_fb1be5-c3, .kt-inside-inner-col > .kadence-column6637_fb1be5-c3:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_fb1be5-c3 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_fb1be5-c3 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_fb1be5-c3\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Library<\/strong>: <code>lightgbm<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. With default settings, multi-threaded training drifts slightly across runs even with identical data and seed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport lightgbm as lgb\nmodel = lgb.LGBMRegressor(n_estimators=200, random_state=42)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nmodel = lgb.LGBMRegressor(\n    n_estimators=200,\n    random_state=42,\n    deterministic=True,\n    force_row_wise=True,    # or force_col_wise=True\n    verbose=-1,\n)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Both <code>deterministic=True<\/code> and <code>force_row_wise=True<\/code> (or <code>force_col_wise<\/code>) are required. Setting only one has no effect.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">1.4 CatBoost Random Subspace Method (RSM) and seed<\/h3>\n\n\n<style>.kadence-column6637_69a0fe-68 > .kt-inside-inner-col,.kadence-column6637_69a0fe-68 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_69a0fe-68 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_69a0fe-68 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_69a0fe-68 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_69a0fe-68 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_69a0fe-68{position:relative;}.kadence-column6637_69a0fe-68, .kt-inside-inner-col > .kadence-column6637_69a0fe-68:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_69a0fe-68 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_69a0fe-68 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_69a0fe-68\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Library<\/strong>: <code>catboost<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. Without <code>random_seed<\/code>, every run differs. With Random Subspace Method (RSM) enabled (<code>rsm &lt; 1<\/code>), CatBoost samples column indices, but the index \u2192 real-column mapping depends on input column order. If upstream column order changes, RSM trains on a different feature subset (cascade risk \u2014 see 3.1).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom catboost import CatBoostRegressor\nmodel = CatBoostRegressor(iterations=800, rsm=0.5)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nmodel = CatBoostRegressor(\n    iterations=800, rsm=0.5,\n    random_seed=42, verbose=0,\n)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Set <code>random_seed<\/code> explicitly and preserve column order across the entire pipeline.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">1.5 sklearn LassoCV \/ KFold<\/h3>\n\n\n<style>.kadence-column6637_7a610d-12 > .kt-inside-inner-col,.kadence-column6637_7a610d-12 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_7a610d-12 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_7a610d-12 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_7a610d-12 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_7a610d-12 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_7a610d-12{position:relative;}.kadence-column6637_7a610d-12, .kt-inside-inner-col > .kadence-column6637_7a610d-12:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_7a610d-12 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_7a610d-12 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_7a610d-12\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Library<\/strong>: <code>sklearn.linear_model<\/code>, <code>sklearn.model_selection<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. With <code>shuffle=True<\/code> and no <code>random_state<\/code>, fold splits are non-deterministic.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom sklearn.linear_model import LassoCV\nlasso = LassoCV(cv=5, n_jobs=-1)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom sklearn.model_selection import KFold\nlasso = LassoCV(\n    cv=KFold(5, shuffle=True, random_state=42),\n    n_jobs=-1, random_state=42, max_iter=10000,\n)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Always set <code>random_state<\/code> and pass an explicit splitter object as <code>cv<\/code>. Apply the same rule to <code>train_test_split<\/code>, <code>StratifiedKFold<\/code>, and any other split helper.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">1.6 XGBoost<\/h3>\n\n\n<style>.kadence-column6637_c73d38-2a > .kt-inside-inner-col,.kadence-column6637_c73d38-2a > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_c73d38-2a > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_c73d38-2a > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_c73d38-2a > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_c73d38-2a > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_c73d38-2a{position:relative;}.kadence-column6637_c73d38-2a, .kt-inside-inner-col > .kadence-column6637_c73d38-2a:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_c73d38-2a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_c73d38-2a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_c73d38-2a\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Library<\/strong>: <code>xgboost<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. Missing <code>random_state<\/code> yields non-determinism. Graphics Processing Unit (GPU) modes (<code>gpu_hist<\/code>, <code>device='cuda'<\/code>) are inherently non-deterministic.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport xgboost as xgb\nmodel = xgb.XGBRegressor(n_estimators=500, tree_method=&#039;hist&#039;)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nmodel = xgb.XGBRegressor(\n    n_estimators=500,\n    tree_method=&#039;hist&#039;,\n    random_state=42,\n    n_jobs=-1,\n)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Use Central Processing Unit (CPU) <code>tree_method='hist'<\/code> with <code>random_state<\/code>. If GPU is required, accept non-determinism or build a separate validation pipeline.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">1.7 pandas categorical dtype<\/h3>\n\n\n<style>.kadence-column6637_a7058e-d7 > .kt-inside-inner-col,.kadence-column6637_a7058e-d7 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_a7058e-d7 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_a7058e-d7 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_a7058e-d7 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_a7058e-d7 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_a7058e-d7{position:relative;}.kadence-column6637_a7058e-d7, .kt-inside-inner-col > .kadence-column6637_a7058e-d7:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_a7058e-d7 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_a7058e-d7 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_a7058e-d7\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Library<\/strong>: <code>pandas<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. <code>astype('category')<\/code> indexes categories by first-appearance in the data. If row order is non-deterministic, category indices shift, and Gradient Boosted Decision Tree (GBDT) models learn different splits.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ndf&#x5B;col] = df&#x5B;col].astype(&#039;category&#039;)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ndf&#x5B;col] = pd.Categorical(\n    df&#x5B;col],\n    categories=sorted(df&#x5B;col].dropna().unique()),\n)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Pass a deterministically-sorted unique list to the <code>categories<\/code> argument.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">1.8 File loading order<\/h3>\n\n\n<style>.kadence-column6637_c918e5-16 > .kt-inside-inner-col,.kadence-column6637_c918e5-16 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_c918e5-16 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_c918e5-16 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_c918e5-16 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_c918e5-16 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_c918e5-16{position:relative;}.kadence-column6637_c918e5-16, .kt-inside-inner-col > .kadence-column6637_c918e5-16:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_c918e5-16 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_c918e5-16 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_c918e5-16\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Library<\/strong>: <code>pathlib<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. <code>Path.glob()<\/code> order is filesystem-dependent. Usually stable on the same machine with the same files, but it can shift across platforms or when files are added or mounts change.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfiles = list(folder.glob(&#039;*.parquet&#039;))\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfiles = sorted(folder.glob(&#039;*.parquet&#039;))\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Wrap every file iteration in <code>sorted()<\/code>.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">1.9 NumPy global Random Number Generator (RNG)<\/h3>\n\n\n<style>.kadence-column6637_b19c0a-b0 > .kt-inside-inner-col,.kadence-column6637_b19c0a-b0 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_b19c0a-b0 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_b19c0a-b0 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_b19c0a-b0 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_b19c0a-b0 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_b19c0a-b0{position:relative;}.kadence-column6637_b19c0a-b0, .kt-inside-inner-col > .kadence-column6637_b19c0a-b0:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_b19c0a-b0 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_b19c0a-b0 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_b19c0a-b0\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Library<\/strong>: <code>numpy<\/code>, <code>random<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. Calls like <code>np.random.rand()<\/code> or <code>random.choice()<\/code> hit the global RNG; without a seed they vary every run. Global seeds are also fragile because any dependency can reseed mid-pipeline.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport numpy as np\nsample_idx = np.random.choice(n, size=k)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nrng = np.random.default_rng(42)\nsample_idx = rng.choice(n, size=k)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Inject an explicit <code>Generator<\/code> instead of touching the global RNG. If unavoidable, call <code>np.random.seed(N)<\/code> and <code>random.seed(N)<\/code> right at process start.<\/p>\n<\/div><\/div>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">2. Potential Problems (Anticipated, Emerging Issues)<\/h2>\n\n\n<style>.kadence-column6637_3685af-23 > .kt-inside-inner-col,.kadence-column6637_3685af-23 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_3685af-23 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_3685af-23 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_3685af-23 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_3685af-23 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_3685af-23{position:relative;}.kadence-column6637_3685af-23, .kt-inside-inner-col > .kadence-column6637_3685af-23:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_3685af-23 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_3685af-23 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_3685af-23\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003Areas to control by governance and process before failures surface. These cannot be fixed with a one-line edit; they require team rules, infrastructure, and tooling.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">2.1 Code-level control (Seed Management)<\/h3>\n\n\n<style>.kadence-column6637_75d729-16 > .kt-inside-inner-col,.kadence-column6637_75d729-16 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_75d729-16 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_75d729-16 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_75d729-16 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_75d729-16 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_75d729-16{position:relative;}.kadence-column6637_75d729-16, .kt-inside-inner-col > .kadence-column6637_75d729-16:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_75d729-16 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_75d729-16 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_75d729-16\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. Without a unified seed entry point, only some RNGs become deterministic. GPU operations via the CUDA Deep Neural Network library (CuDNN) sometimes use non-deterministic algorithms for performance and must be forced into deterministic mode.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ntorch.manual_seed(42)\n# Only the framework seed is set; global RNG and CuDNN are unguarded.\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport os, random, numpy as np, torch\n\ndef set_seed(seed: int = 42):\n    os.environ&#x5B;&#039;PYTHONHASHSEED&#039;] = str(seed)   # must be set before subprocess start\n    random.seed(seed)\n    np.random.seed(seed)\n    torch.manual_seed(seed)\n    torch.cuda.manual_seed_all(seed)\n    torch.backends.cudnn.deterministic = True\n    torch.backends.cudnn.benchmark = False\n    torch.use_deterministic_algorithms(True)   # raise on non-deterministic ops\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Call <code>set_seed<\/code> at every entry point (training and inference launchers). <code>torch.use_deterministic_algorithms(True)<\/code> promotes latent issues to runtime errors as soon as a non-deterministic op is invoked.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">2.2 Data-level control (Data Lineage)<\/h3>\n\n\n<style>.kadence-column6637_9b1fb6-68 > .kt-inside-inner-col,.kadence-column6637_9b1fb6-68 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_9b1fb6-68 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_9b1fb6-68 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_9b1fb6-68 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_9b1fb6-68 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_9b1fb6-68{position:relative;}.kadence-column6637_9b1fb6-68, .kt-inside-inner-col > .kadence-column6637_9b1fb6-68:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_9b1fb6-68 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_9b1fb6-68 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_9b1fb6-68\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. Identical code can produce different results when the data drifts. Without snapshots and lineage of datasets and preprocessing artifacts, reproduction is impossible.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ndf = pd.read_csv(&#039;data\/raw.csv&#039;)        # which raw? from when?\nX_train, X_test = train_test_split(X, y) # no seed\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\n# Version the data with Data Version Control (DVC)\ndvc add data\/raw.csv\ngit add data\/raw.csv.dvc\n<\/pre><\/div>\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport joblib\ndf = pd.read_csv(&#039;data\/raw.csv&#039;)\nX_train, X_test = train_test_split(X, y, random_state=42, shuffle=True)\njoblib.dump(preprocess_pipeline, &#039;artifacts\/preprocess.pkl&#039;)   # serialize preprocessing\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Snapshot data with Data Version Control (DVC) \u2014 or LakeFS, Delta Lake \u2014 fix <code>random_state<\/code> on every split, and serialize the preprocessing pipeline (<code>joblib<\/code>, <code>pickle<\/code>) alongside model artifacts.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">2.3 Environment-level control (Environment Isolation)<\/h3>\n\n\n<style>.kadence-column6637_fea1b2-fd > .kt-inside-inner-col,.kadence-column6637_fea1b2-fd > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_fea1b2-fd > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_fea1b2-fd > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_fea1b2-fd > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_fea1b2-fd > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_fea1b2-fd{position:relative;}.kadence-column6637_fea1b2-fd, .kt-inside-inner-col > .kadence-column6637_fea1b2-fd:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_fea1b2-fd > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_fea1b2-fd > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_fea1b2-fd\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. The classic &#8220;works on my machine&#8221; trap. Operating System (OS), GPU drivers, and library versions can shift results. Even minor NumPy \/ Basic Linear Algebra Subprograms (BLAS) \/ Math Kernel Library (MKL) updates can change floating-point outputs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\npip install lightgbm catboost   # versions floating; latest each time\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\npip freeze &gt; requirements.txt          # pinned versions\n# or Poetry \/ Conda\nconda env export --no-builds &gt; environment.yml\n<\/pre><\/div>\n\n<div class=\"wp-block-syntaxhighlighter-code \">\n# Dockerfile \u2014 same training and inference environment\nFROM python:3.10.14-slim\nCOPY requirements.txt .\nRUN pip install &#8211;no-deps -r requirements.txt\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Pin versions (Poetry or Conda lockfile) and build Docker images. Reusing the exact training image at inference time is the strongest guarantee.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">2.4 Experiment management (MLOps and Tracking)<\/h3>\n\n\n<style>.kadence-column6637_38c7ce-5a > .kt-inside-inner-col,.kadence-column6637_38c7ce-5a > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_38c7ce-5a > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_38c7ce-5a > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_38c7ce-5a > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_38c7ce-5a > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_38c7ce-5a{position:relative;}.kadence-column6637_38c7ce-5a, .kt-inside-inner-col > .kadence-column6637_38c7ce-5a:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_38c7ce-5a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_38c7ce-5a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_38c7ce-5a\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. &#8220;It worked this week, but I can&#8217;t reproduce it next week&#8221; \u2014 without explicit tracking of code, hyperparameters, and data versions, there is no audit trail.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nmodel.fit(X, y)\njoblib.dump(model, f&#039;model_{date.today()}.pkl&#039;)   # config detached, no git sha\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport mlflow, subprocess\nwith mlflow.start_run():\n    mlflow.log_params({&#039;lr&#039;: 0.05, &#039;n_estimators&#039;: 800, &#039;seed&#039;: 42})\n    mlflow.log_param(\n        &#039;git_sha&#039;,\n        subprocess.check_output(&#x5B;&#039;git&#039;, &#039;rev-parse&#039;, &#039;HEAD&#039;]).decode().strip(),\n    )\n    mlflow.log_param(&#039;data_version&#039;, dvc_hash(&#039;data\/raw.csv&#039;))\n    model.fit(X, y)\n    mlflow.sklearn.log_model(model, &#039;model&#039;)\n    mlflow.log_metric(&#039;cv_r2&#039;, cv_r2)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Use MLflow, Weights &amp; Biases (W&amp;B), or Neptune to log code version (git SHA), hyperparameters, data version (DVC hash), and metrics. Bundle the model artifact and its config together.<\/p>\n<\/div><\/div>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">3. Unknown Problems (Hidden, Latent Issues)<\/h2>\n\n\n<style>.kadence-column6637_841c98-35 > .kt-inside-inner-col,.kadence-column6637_841c98-35 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_841c98-35 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_841c98-35 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_841c98-35 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_841c98-35 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_841c98-35{position:relative;}.kadence-column6637_841c98-35, .kt-inside-inner-col > .kadence-column6637_841c98-35:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_841c98-35 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_841c98-35 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_841c98-35\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003Hard to debug and easy to miss. Often skipped because of the false belief that &#8220;if the determinism flags are set, we are done.&#8221;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.1 Cascading column-order effect<\/h3>\n\n\n<style>.kadence-column6637_da4993-26 > .kt-inside-inner-col,.kadence-column6637_da4993-26 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_da4993-26 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_da4993-26 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_da4993-26 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_da4993-26 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_da4993-26{position:relative;}.kadence-column6637_da4993-26, .kt-inside-inner-col > .kadence-column6637_da4993-26:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_da4993-26 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_da4993-26 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_da4993-26\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Library<\/strong>: every GBDT \u2014 <code>lightgbm<\/code>, <code>catboost<\/code>, <code>xgboost<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. Even with all determinism flags and seeds set, results still differ. The cause is upstream: the column order entering the model is non-deterministic, so tied information-gain splits are tie-broken differently, producing different trees. CatBoost&#8217;s RSM, LightGBM&#8217;s <code>feature_fraction<\/code>, and XGBoost&#8217;s <code>colsample_bytree<\/code> sample column indices, which makes them especially sensitive.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# Upstream: stability-selection output\nselected = sorted(stable_set, key=lambda f: scores&#x5B;f], reverse=True)\n# Tied scores resolve in set-iteration order \u2014 non-deterministic.\n\n# Downstream: all determinism flags set\nX_next = X&#x5B;selected]\nmodel = lgb.LGBMRegressor(deterministic=True, force_row_wise=True, random_state=42)\nmodel.fit(X_next, y)   # different column order \u21d2 different model!\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nselected = sorted(stable_set, key=lambda f: (-scores&#x5B;f], f))   # tie-break\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Enforce deterministic column and row order through every stage. Use stage-by-stage pickle diffing (4.1) to find the first stage that diverges.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">3.2 Floating-point precision<\/h3>\n\n\n<style>.kadence-column6637_f41c24-2e > .kt-inside-inner-col,.kadence-column6637_f41c24-2e > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_f41c24-2e > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_f41c24-2e > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_f41c24-2e > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_f41c24-2e > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_f41c24-2e{position:relative;}.kadence-column6637_f41c24-2e, .kt-inside-inner-col > .kadence-column6637_f41c24-2e:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_f41c24-2e > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_f41c24-2e > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_f41c24-2e\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Library<\/strong>: Python core \u2014 Institute of Electrical and Electronics Engineers (IEEE) 754 binary64<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. Python <code>float<\/code> follows IEEE 754 binary64; decimal-to-binary conversion accumulates tiny rounding errors. <code>0.1 + 0.2 == 0.3<\/code> evaluates to <code>False<\/code>. Threshold comparisons, equality checks, and accumulated sums can flip decisions because of these errors.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nif score == threshold:    # risky \u2014 Unit-in-the-Last-Place (ULP) drift can flip the decision\n    keep = False\ntotal = sum(values)       # sum order varies under parallel reduce\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport math\nif math.isclose(score, threshold, rel_tol=1e-9):\n    keep = False\n# When precision matters, use the decimal module\nfrom decimal import Decimal\ntotal = sum(Decimal(str(v)) for v in values)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Replace exact equality with <code>math.isclose<\/code> or an explicit tolerance. Use <code>decimal<\/code> for finance-grade or precision-critical math. Design ML thresholds so feature scores rarely cluster near the boundary.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">3.3 BLAS multi-thread reduction order<\/h3>\n\n\n<style>.kadence-column6637_e3d445-bb > .kt-inside-inner-col,.kadence-column6637_e3d445-bb > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_e3d445-bb > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_e3d445-bb > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_e3d445-bb > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_e3d445-bb > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_e3d445-bb{position:relative;}.kadence-column6637_e3d445-bb, .kt-inside-inner-col > .kadence-column6637_e3d445-bb:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_e3d445-bb > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_e3d445-bb > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_e3d445-bb\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>Library<\/strong>: <code>numpy<\/code>, <code>scipy<\/code>, <code>pandas<\/code>, <code>scikit-learn<\/code> (BLAS backends: MKL, OpenBLAS, Accelerate)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Problem<\/strong>. Reductions like <code>X.T @ X<\/code>, <code>df.corr()<\/code>, <code>np.sum<\/code>, and <code>np.dot<\/code> may sum partial results from threads in different orders each run. Floating-point addition is non-associative ($(a+b)+c \\neq a+(b+c)$), so the result drifts by a few ULP. Threshold branches such as <code>corr &gt; 0.95<\/code> can flip.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code Before<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ncorr = X.corr().abs()\ndrop = corr.columns&#x5B;(corr &gt; self.threshold).any()]\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Code After<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport os\n# Set before any module imports \u2014 required for the env vars to take effect\nos.environ&#x5B;&#039;MKL_NUM_THREADS&#039;] = &#039;1&#039;\nos.environ&#x5B;&#039;OPENBLAS_NUM_THREADS&#039;] = &#039;1&#039;\nos.environ&#x5B;&#039;OMP_NUM_THREADS&#039;] = &#039;1&#039;\n# Or add a margin to the threshold comparison\ndrop = corr.columns&#x5B;(corr &gt; self.threshold + 1e-12).any()]\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Solution<\/strong>. Force BLAS to a single thread only where determinism is critical (the performance cost is real). Alternatively, design the algorithm so threshold ties are rare (e.g. add an epsilon margin).<\/p>\n<\/div><\/div>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">4. Reproducibility Diagnostics<\/h2>\n\n\n<style>.kadence-column6637_41f64f-f3 > .kt-inside-inner-col,.kadence-column6637_41f64f-f3 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_41f64f-f3 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_41f64f-f3 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_41f64f-f3 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_41f64f-f3 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_41f64f-f3{position:relative;}.kadence-column6637_41f64f-f3, .kt-inside-inner-col > .kadence-column6637_41f64f-f3:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_41f64f-f3 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_41f64f-f3 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_41f64f-f3\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003Techniques to locate where non-determinism enters the pipeline once a reproducibility failure has occurred.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.1 Stage-by-stage pickle diff<\/h3>\n\n\n<style>.kadence-column6637_28e024-80 > .kt-inside-inner-col,.kadence-column6637_28e024-80 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6637_28e024-80 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6637_28e024-80 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6637_28e024-80 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6637_28e024-80 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6637_28e024-80{position:relative;}.kadence-column6637_28e024-80, .kt-inside-inner-col > .kadence-column6637_28e024-80:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6637_28e024-80 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6637_28e024-80 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6637_28e024-80\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">\u2003<strong>When to use<\/strong>. Two runs with identical input produce different final outputs, and you don&#8217;t know at which stage the non-determinism crept in.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Why byte-level diff (<code>fc \/b<\/code>, <code>cmp<\/code>, <code>Compare-Object<\/code>) is not enough<\/strong>. Byte comparison only tells you whether two files are identical or not. It gives no clue about <em>where<\/em> they diverged. Non-determinism typically starts at one stage and cascades downstream (see 3.1), so the goal is to find that first divergence point. Comparing only the final output forces a binary search across the whole pipeline. Worse, ML pipeline stages are usually trained transformer objects, so meaningful diagnosis must compare them <em>as objects<\/em>, not as raw bytes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Why pickle comparison is more powerful<\/strong>. When each pipeline stage (FeatureCleaner, Selector, Encoder, Model, etc.) saves its <code>fit<\/code> result to a <code>.pkl<\/code> file, that file contains the trained state \u2014 attributes such as <code>selected_features_<\/code>, <code>kept_cols_<\/code>, <code>drop_corr_cols_<\/code>, <code>categories_<\/code>. Unpickling the same stage from two runs and comparing those attributes tells you whether <em>that stage&#8217;s learned output is deterministic<\/em>. Walking from stage 1 onward, the first stage whose output diverges is the source of the non-determinism.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>How to compare<\/strong>. A three-step check (set \u2192 list) reveals not only divergence but also its nature:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pickle\na = pickle.load(open(&#039;run_A\/stage_5.pkl&#039;, &#039;rb&#039;))\nb = pickle.load(open(&#039;run_B\/stage_5.pkl&#039;, &#039;rb&#039;))\nsa, sb = set(a.selected_), set(b.selected_)\nprint(f&#039;set equal : {sa == sb}&#039;)                          # same elements?\nprint(f&#039;list equal: {a.selected_ == b.selected_}&#039;)        # same order too?\n<\/pre><\/div>\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>set equal<\/th><th>list equal<\/th><th>Diagnosis up to that stage<\/th><\/tr><\/thead><tbody><tr><td>True<\/td><td>True<\/td><td>Deterministic \u2705<\/td><\/tr><tr><td>True<\/td><td>False<\/td><td>Same elements, different order \u2014 <code>list(set)<\/code> or <code>sorted<\/code> tie-break trap (1.1, 1.2)<\/td><\/tr><tr><td>False<\/td><td>False<\/td><td>Elements themselves differ \u2014 cascade started in an earlier stage (3.1)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">\u2003<strong>Caveats<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Not a sufficient condition<\/strong>. &#8220;Deterministic up to that stage&#8221; only. If a later stage uses non-deterministic logic, the cascade restarts there. Full-pipeline determinism requires every stage to read (True, True).<\/li>\n\n\n\n<li><strong>Prerequisite<\/strong>. The technique only works if the pipeline serializes per-stage artifacts. It is worth designing pipelines to dump each stage as a <code>.pkl<\/code> \u2014 useful for partial-resume and other operational tasks beyond debugging.<\/li>\n\n\n\n<li><strong>Environment alignment<\/strong>. Both runs must use the same library versions; otherwise unpickling itself can fail or yield differently shaped attributes (see 2.3).<\/li>\n<\/ul>\n<\/div><\/div>\n<\/div><\/div>\n<div style='text-align:center' class='yasr-auto-insert-overall'><\/div><div style='text-align:center' class='yasr-auto-insert-visitor'><\/div>","protected":false},"excerpt":{"rendered":"<p>Introduction \u2003This document classifies reproducibility problems in Python Machine Learning (ML) pipelines into three chapters, plus a fourth chapter on diagnostic techniques: \u2003This classification aligns well with the Six Sigma philosophy and toolset, and is the methodology used in semiconductor process and product development. 1. Known Problems (Observed, Identified Issues) \u2003Determinism traps recognized by standard&#8230;<\/p>\n","protected":false},"author":4,"featured_media":6642,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","yasr_overall_rating":0,"yasr_post_is_review":"","yasr_auto_insert_disabled":"","yasr_review_type":"","fifu_image_url":"","fifu_image_alt":"","iawp_total_views":3,"footnotes":""},"categories":[56,376],"tags":[],"class_list":["post-6637","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science-slug","category-pipeline-slug"],"yasr_visitor_votes":{"stars_attributes":{"read_only":false,"span_bottom":false},"number_of_votes":1,"sum_votes":4},"jetpack_featured_media_url":"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/05\/Lowes-DO-IT-RIGHT-1000x750px.jpg","_links":{"self":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6637","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/comments?post=6637"}],"version-history":[{"count":3,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6637\/revisions"}],"predecessor-version":[{"id":6643,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6637\/revisions\/6643"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/media\/6642"}],"wp:attachment":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/media?parent=6637"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/categories?post=6637"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/tags?post=6637"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}