{"id":6292,"date":"2026-04-16T22:55:55","date_gmt":"2026-04-17T03:55:55","guid":{"rendered":"https:\/\/ykim.synology.me\/wordpress\/?p=6292"},"modified":"2026-04-17T01:15:06","modified_gmt":"2026-04-17T06:15:06","slug":"one-hot-encoding","status":"publish","type":"post","link":"https:\/\/ykim.synology.me\/wordpress\/one-hot-encoding-6292\/","title":{"rendered":"One-Hot Encoding Pitfalls and Countermeasures"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">1. What is One-Hot Encoding?<\/h2>\n\n\n<p>\r\n    <style>\r\n    .k-page-nav { margin-bottom:20px; padding:10px 0; }\r\n    .k-page-nav a, .k-page-nav span {\r\n        display:block; padding:6px 10px; margin-bottom:6px;\r\n        background:#eee; border-radius:4px; text-decoration:none;\r\n        color:#333; font-weight:500;\r\n    }\r\n    .k-page-nav span { background:#333; color:#fff; }\r\n    <\/style>\r\n\r\n    <div class=\"k-page-nav\">\r\n                                    <span>1. What is One-Hot Encoding? \u2014 Page 1<\/span>\r\n                                                <a href=\"https:\/\/ykim.synology.me\/wordpress\/one-hot-encoding-6292\/2\/\" class=\"post-page-numbers\">                    9. CatBoost, XGBoost, and LightGBM \u2014 Practical Implementation \u2014 Page 2                <\/a>\r\n                        <\/div>\r\n\r\n    <\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"400\" height=\"300\" src=\"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/one-hot-encoding-ACDB.png\" alt=\"\" class=\"wp-image-6293\" style=\"width:600px;height:auto\" srcset=\"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/one-hot-encoding-ACDB.png 400w, https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/one-hot-encoding-ACDB-300x225.png 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><\/figure>\n\n\n<style>.kadence-column6292_447142-89 > .kt-inside-inner-col{padding-right:var(--global-kb-spacing-xl, 4rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);}.kadence-column6292_447142-89 > .kt-inside-inner-col,.kadence-column6292_447142-89 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6292_447142-89 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6292_447142-89 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6292_447142-89 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6292_447142-89 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6292_447142-89{position:relative;}@media all and (max-width: 1024px){.kadence-column6292_447142-89 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6292_447142-89 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6292_447142-89\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">One-hot encoding is the most fundamental technique for converting categorical variables into numerical vectors that machine learning models can process. Given N unique categories, each category is represented as an N-dimensional vector with a <code>1<\/code> at the position corresponding to that category and <code>0<\/code> elsewhere.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Example: <code>['Seoul', 'Busan', 'Daegu']<\/code> \u2192 Seoul = [1,0,0], Busan = [0,1,0], Daegu = [0,0,1]<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The advantages are that it imposes no arbitrary ordering between categories and guarantees mutual independence between them. It remains the default starting point for low-cardinality categorical variables.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">2. Limitations of One-Hot Encoding<\/h2>\n\n\n<style>.kadence-column6292_74de80-29 > .kt-inside-inner-col{padding-right:var(--global-kb-spacing-xl, 4rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);}.kadence-column6292_74de80-29 > .kt-inside-inner-col,.kadence-column6292_74de80-29 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6292_74de80-29 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6292_74de80-29 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6292_74de80-29 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6292_74de80-29 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6292_74de80-29{position:relative;}@media all and (max-width: 1024px){.kadence-column6292_74de80-29 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6292_74de80-29 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6292_74de80-29\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">One-hot encoding suffers from two fundamentally different but equally important weaknesses: it explodes in dimensionality, and it destroys any natural ordering that may exist among categories. Both must be understood to choose the right alternative.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">2.1 The Essence of the Curse of Dimensionality<\/h3>\n\n\n<style>.kadence-column6292_53d4b7-da > .kt-inside-inner-col{padding-right:var(--global-kb-spacing-xl, 4rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);}.kadence-column6292_53d4b7-da > .kt-inside-inner-col,.kadence-column6292_53d4b7-da > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6292_53d4b7-da > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6292_53d4b7-da > .kt-inside-inner-col{flex-direction:column;}.kadence-column6292_53d4b7-da > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6292_53d4b7-da > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6292_53d4b7-da{position:relative;}@media all and (max-width: 1024px){.kadence-column6292_53d4b7-da > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6292_53d4b7-da > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6292_53d4b7-da\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">First named by Richard Bellman in 1961, the curse of dimensionality refers to the exponential problems that arise as data dimensionality grows. In the context of one-hot encoding, the following effects become particularly severe:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(1) Explosion of Sparsity<\/strong><br>Encoding a feature with 10,000 unique values (e.g., <code>user_id<\/code>, <code>product_id<\/code>) produces a 10,000-dimensional vector in which only 1 entry is <code>1<\/code> and 9,999 are <code>0<\/code>. Memory usage explodes as O(N\u00d7D), and information density collapses to 1\/D.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(2) Distance Concentration<\/strong><br>In high-dimensional space, all pairwise Euclidean distances become nearly equal. Mathematically, as dimensionality <code>d<\/code> grows, the ratio of maximum to minimum distance converges to 1: <code>lim(d\u2192\u221e) (max_dist - min_dist) \/ min_dist \u2192 0<\/code>. This neutralizes distance-based algorithms like KNN and K-means.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(3) Sample Complexity Explosion<\/strong><br>The number of samples needed to &#8220;fill&#8221; the input space grows exponentially with dimensionality. If 10 samples per dimension suffice, 100 dimensions would require 10^100 samples.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(4) Overfitting Risk<\/strong><br>Rare categories appear as one-hot dimensions that are almost always <code>0<\/code>, making them extremely susceptible to overfitting.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(5) Loss of Semantic Relationships<\/strong><br>&#8220;Seoul&#8221; and &#8220;Busan&#8221; share the semantic concept of &#8220;city,&#8221; but in one-hot encoding they are perfectly orthogonal vectors with cosine similarity of 0. No relational structure can be learned.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(6) Tree-Model Splitting Bias<\/strong><br>In decision-tree algorithms, sparse one-hot features almost always create &#8220;0 vs 1&#8221; splits. Information gain becomes distorted, and these features are systematically disadvantaged compared to continuous features.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">2.2 Loss of Sequence and Ordinal Structure<\/h3>\n\n\n<style>.kadence-column6292_65c199-8a > .kt-inside-inner-col{padding-right:var(--global-kb-spacing-xl, 4rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);}.kadence-column6292_65c199-8a > .kt-inside-inner-col,.kadence-column6292_65c199-8a > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6292_65c199-8a > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6292_65c199-8a > .kt-inside-inner-col{flex-direction:column;}.kadence-column6292_65c199-8a > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6292_65c199-8a > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6292_65c199-8a{position:relative;}@media all and (max-width: 1024px){.kadence-column6292_65c199-8a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6292_65c199-8a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6292_65c199-8a\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">Many real-world categorical variables carry an intrinsic order \u2014 education level (<code>elementary &lt; middle &lt; high &lt; bachelor &lt; master &lt; PhD<\/code>), severity grade (<code>mild &lt; moderate &lt; severe &lt; critical<\/code>), Likert-scale survey responses (<code>strongly disagree \u2192 strongly agree<\/code>), age brackets, customer tiers (<code>bronze &lt; silver &lt; gold &lt; platinum<\/code>), and more. One-hot encoding <strong>erases this ordering completely<\/strong>, treating every level as equidistant and unrelated. This is a structural limitation that is independent of the curse of dimensionality, and in many domains it is the more damaging defect.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(1) Equidistance Assumption<\/strong><br>For one-hot vectors, every pair of categories has the same Euclidean distance of \u221a2 and the same cosine similarity of 0. The model receives no signal that &#8220;bachelor&#8221; is closer to &#8220;master&#8221; than to &#8220;elementary.&#8221; Any ordinal information must be re-discovered by the model from the target signal alone, which requires far more data and is often unreliable for rare levels.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(2) Loss of Monotonicity<\/strong><br>In domains like credit scoring, healthcare risk, and pricing, regulators and stakeholders often require monotonic behavior \u2014 for example, &#8220;higher education should not lower predicted income.&#8221; With one-hot encoding, levels are independent, so monotonicity cannot be expressed or enforced. Tools like <code>monotone_constraints<\/code> in GBDT libraries become inapplicable because the constraint requires a single ordered numeric feature.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(3) Inability to Interpolate or Extrapolate<\/strong><br>If training data lacks the &#8220;gold&#8221; tier but contains &#8220;silver&#8221; and &#8220;platinum,&#8221; a model with one-hot features has no basis for predicting &#8220;gold&#8221; sensibly. An ordinal or embedding-based representation can interpolate between neighboring levels; one-hot cannot.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(4) Loss of Sequential \/ Temporal Pattern<\/strong><br>For categorical features that represent stages in a sequence \u2014 purchase funnel steps, disease progression stages, weekday\/month, or pipeline stages \u2014 one-hot encoding loses adjacency. &#8220;Monday&#8221; and &#8220;Tuesday&#8221; are no closer than &#8220;Monday&#8221; and &#8220;Saturday.&#8221; Neural sequence models built on top of one-hot inputs must spend capacity learning trivial adjacency facts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(5) Increased Sample Requirement for Ordinal Tasks<\/strong><br>Because no order prior is provided, every level effectively becomes its own free parameter. Rare ordinal levels require many examples to be learned correctly, whereas an ordinal encoding would inherit information from neighboring levels for free.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(6) Statistical Interpretability Loss<\/strong><br>In regression analysis, one-hot dummy variables produce one coefficient per level \u2014 the analyst cannot directly read off &#8220;the linear trend of severity on outcome.&#8221; Polynomial contrast or backward-difference coding preserve this trend explicitly.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Why This Matters in Practice<\/strong><br>The two limitations compound. A high-cardinality ordinal feature (e.g., income decile across 100 bands, or 50 age buckets) suffers <em>both<\/em> the curse of dimensionality <em>and<\/em> the loss of ordering. Choosing the right encoding \u2014 ordinal integer + monotonic constraint, thermometer encoding, ordinal entity embedding, CORAL\/CORN, or order embeddings \u2014 directly attacks both problems at once. This is why the practical guide in Section 8 explicitly distinguishes &#8220;order present&#8221; cases from &#8220;no order&#8221; cases.<\/p>\n<\/div><\/div>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">3. Classical Solutions<\/h2>\n\n\n<style>.kadence-column6292_01a3da-df > .kt-inside-inner-col{padding-right:var(--global-kb-spacing-xl, 4rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);}.kadence-column6292_01a3da-df > .kt-inside-inner-col,.kadence-column6292_01a3da-df > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6292_01a3da-df > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6292_01a3da-df > .kt-inside-inner-col{flex-direction:column;}.kadence-column6292_01a3da-df > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6292_01a3da-df > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6292_01a3da-df{position:relative;}@media all and (max-width: 1024px){.kadence-column6292_01a3da-df > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6292_01a3da-df > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6292_01a3da-df\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">These are deterministic or lightly-learned encodings that have served as the backbone of categorical handling for decades.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Label \/ Ordinal Encoding<\/strong> \u2014 Map categories directly to integers. Only valid when a meaningful order exists.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Frequency Encoding<\/strong> \u2014 Replace each category with its occurrence count. Simple, but cannot distinguish categories with equal frequency.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Target (Mean) Encoding<\/strong> \u2014 Replace categories with the mean of the target variable. Carries leakage risk, requiring K-fold splitting, smoothing, or noise injection.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Hash Encoding (Feature Hashing)<\/strong> \u2014 Map categories to a fixed-size bucket via a hash function. Memory-efficient but suffers collisions. Used in Vowpal Wabbit.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Binary \/ BaseN Encoding<\/strong> \u2014 Encode categories as binary (or base-N) numbers, reducing dimensionality to log\u2082(N).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Leave-One-Out Encoding, WOE (Weight of Evidence)<\/strong> \u2014 Common variants in finance and credit scoring.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Ordinal-Aware Classical Encodings<\/strong> \u2014 For variables with intrinsic order (education level, severity, Likert scale):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>Thermometer \/ Unary Encoding<\/em>: Encode level k as k ones followed by zeros, e.g., level 2 of 3 \u2192 [1,1,0]. Naturally encodes partial order; underpins later ordinal deep learning methods (CORAL, CORN).<\/li>\n\n\n\n<li><em>Polynomial Contrast Coding<\/em>: Decomposes ordinal effects into linear, quadratic, and cubic trends.<\/li>\n\n\n\n<li><em>Helmert \/ Reverse-Helmert Coding<\/em>: Compares each level to the mean of subsequent (or preceding) levels.<\/li>\n\n\n\n<li><em>Backward Difference Coding<\/em>: Encodes only differences between adjacent levels.<\/li>\n\n\n\n<li><em>Sum (Deviation) Coding<\/em>: Compares each level to the grand mean.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">These contrast-coding schemes remain standard in medicine, social sciences, and any setting where regression-coefficient interpretability matters.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">4. Deep Learning-based Entity Embedding<\/h2>\n\n\n<style>.kadence-column6292_70c7d9-d1 > .kt-inside-inner-col{padding-right:var(--global-kb-spacing-xl, 4rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);}.kadence-column6292_70c7d9-d1 > .kt-inside-inner-col,.kadence-column6292_70c7d9-d1 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6292_70c7d9-d1 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6292_70c7d9-d1 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6292_70c7d9-d1 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6292_70c7d9-d1 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6292_70c7d9-d1{position:relative;}@media all and (max-width: 1024px){.kadence-column6292_70c7d9-d1 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6292_70c7d9-d1 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6292_70c7d9-d1\"><div class=\"kt-inside-inner-col\">\n<p class=\"has-24292-e-color has-text-color wp-block-paragraph\">Although introduced after the strictly statistical era, <strong>entity embeddings are now considered a classical, standard tool<\/strong> in any deep-learning workflow. The 2016 paper by Guo &amp; Berkhahn \u2014 &#8220;Entity Embeddings of Categorical Variables&#8221; \u2014 proved their value in the Kaggle Rossmann Store Sales competition (3rd place). A decade later, they are the default choice for medium- and high-cardinality features.<\/p>\n\n\n\n<p class=\"has-24292-e-color has-text-color wp-block-paragraph\"><strong>Core Idea<\/strong><br>Each category is mapped to a learnable, low-dimensional dense vector via an embedding layer placed at the front of the network. This is essentially the same mechanism as Word2Vec, applied to tabular data.<\/p>\n\n\n\n<ul class=\"wp-block-list has-24292-e-color has-text-color\">\n<li>Dimensionality: N \u2192 d, typically d \u2248 min(50, (N+1)\/\/2)<\/li>\n\n\n\n<li>Semantically similar categories cluster naturally in embedding space<\/li>\n\n\n\n<li>End-to-end gradient-based training optimizes the embedding for the downstream task<\/li>\n<\/ul>\n\n\n\n<p class=\"has-24292-e-color has-text-color wp-block-paragraph\"><strong>Ordinal-Preserving Embedding Variants<\/strong><br>For categorical variables with intrinsic order, several extensions exist:<\/p>\n\n\n\n<ul class=\"wp-block-list has-24292-e-color has-text-color\">\n<li><em>Order Embeddings<\/em> (Vendrov et al., ICLR 2016): Encode partial order via coordinate-wise inequality in the embedding space.<\/li>\n\n\n\n<li><em>Poincar\u00e9 \/ Hyperbolic Embeddings<\/em> (Nickel &amp; Kiela, 2017): Embed hierarchical and ordered structures in hyperbolic space, where tree depth maps naturally to distance.<\/li>\n\n\n\n<li><em>Monotonic Embedding Networks<\/em>: Use UMNN or Deep Lattice Networks to enforce monotonic structure.<\/li>\n\n\n\n<li><em>Order-Preserving Contrastive Learning<\/em> (2023+): Triplet objectives enforce that <code>x &lt; y &lt; z<\/code> implies <code>d(x,y) &lt; d(x,z)<\/code>.<\/li>\n<\/ul>\n\n\n\n<p class=\"has-24292-e-color has-text-color wp-block-paragraph\">The fastai <code>TabularLearner<\/code> and PyTorch&#8217;s <code>nn.Embedding<\/code> standardized this approach across the industry.<\/p>\n\n\n\n<p class=\"has-24292-e-color has-text-color wp-block-paragraph\"><strong>PyTorch Implementation Example<\/strong><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:1rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#24292e;--cbp-line-number-width:calc(2 * 0.6 * 1rem);line-height:1.625rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#24292e;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>import torch\nimport torch.nn as nn\nimport pandas as pd\n\n# Sample data with mixed categorical and numerical features\ndf = pd.DataFrame({\n    'city':    &#091;'Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan'&#093;,\n    'product': &#091;'A', 'B', 'A', 'C', 'B', 'A'&#093;,\n    'price':   &#091;100, 200, 150, 300, 250, 180&#093;,\n    'target':  &#091;1, 0, 1, 0, 1, 0&#093;,\n})\n\n# Map each category to an integer index\ncity_to_idx    = {c: i for i, c in enumerate(df&#091;'city'&#093;.unique())}\nproduct_to_idx = {p: i for i, p in enumerate(df&#091;'product'&#093;.unique())}\n\ndf&#091;'city_idx'&#093;    = df&#091;'city'&#093;.map(city_to_idx)\ndf&#091;'product_idx'&#093; = df&#091;'product'&#093;.map(product_to_idx)\n\nn_cities, n_products = len(city_to_idx), len(product_to_idx)\n\n# Rule of thumb: embedding dim = min(50, (cardinality + 1) \/\/ 2)\ncity_dim    = min(50, (n_cities + 1) \/\/ 2)\nproduct_dim = min(50, (n_products + 1) \/\/ 2)\n\nclass EntityEmbeddingNet(nn.Module):\n    def __init__(self):\n        super().__init__()\n        # Learnable lookup tables \u2014 replace one-hot entirely\n        self.city_emb    = nn.Embedding(n_cities, city_dim)\n        self.product_emb = nn.Embedding(n_products, product_dim)\n\n        input_dim = city_dim + product_dim + 1  # +1 for price\n        self.mlp = nn.Sequential(\n            nn.Linear(input_dim, 32),\n            nn.ReLU(),\n            nn.Dropout(0.2),\n            nn.Linear(32, 1),\n            nn.Sigmoid(),\n        )\n\n    def forward(self, city_idx, product_idx, price):\n        c = self.city_emb(city_idx)\n        p = self.product_emb(product_idx)\n        x = torch.cat(&#091;c, p, price.unsqueeze(1)&#093;, dim=1)\n        return self.mlp(x)\n\nmodel     = EntityEmbeddingNet()\noptimizer = torch.optim.Adam(model.parameters(), lr=1e-2)\nloss_fn   = nn.BCELoss()\n\ncity_t  = torch.tensor(df&#091;'city_idx'&#093;.values,    dtype=torch.long)\nprod_t  = torch.tensor(df&#091;'product_idx'&#093;.values, dtype=torch.long)\nprice_t = torch.tensor(df&#091;'price'&#093;.values,       dtype=torch.float32)\ny_t     = torch.tensor(df&#091;'target'&#093;.values,      dtype=torch.float32)\n\nfor epoch in range(200):\n    pred = model(city_t, prod_t, price_t).squeeze()\n    loss = loss_fn(pred, y_t)\n    optimizer.zero_grad()\n    loss.backward()\n    optimizer.step()\n\n# The learned embedding matrix is now a dense, semantically meaningful\n# representation that completely sidesteps the curse of dimensionality.\nprint(\"Learned city embeddings:\\n\", model.city_emb.weight.detach())<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki github-light\" style=\"background-color: #fff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D73A49\">import<\/span><span style=\"color: #24292E\"> torch<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">import<\/span><span style=\"color: #24292E\"> torch.nn <\/span><span style=\"color: #D73A49\">as<\/span><span style=\"color: #24292E\"> nn<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">import<\/span><span style=\"color: #24292E\"> pandas <\/span><span style=\"color: #D73A49\">as<\/span><span style=\"color: #24292E\"> pd<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># Sample data with mixed categorical and numerical features<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">df <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> pd.DataFrame({<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;city&#39;<\/span><span style=\"color: #24292E\">:    &#091;<\/span><span style=\"color: #032F62\">&#39;Seoul&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Busan&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Daegu&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Seoul&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Incheon&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Busan&#39;<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;product&#39;<\/span><span style=\"color: #24292E\">: &#091;<\/span><span style=\"color: #032F62\">&#39;A&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;B&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;A&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;C&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;B&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;A&#39;<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;price&#39;<\/span><span style=\"color: #24292E\">:   &#091;<\/span><span style=\"color: #005CC5\">100<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">200<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">150<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">300<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">250<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">180<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;target&#39;<\/span><span style=\"color: #24292E\">:  &#091;<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">})<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># Map each category to an integer index<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">city_to_idx    <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> {c: i <\/span><span style=\"color: #D73A49\">for<\/span><span style=\"color: #24292E\"> i, c <\/span><span style=\"color: #D73A49\">in<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">enumerate<\/span><span style=\"color: #24292E\">(df&#091;<\/span><span style=\"color: #032F62\">&#39;city&#39;<\/span><span style=\"color: #24292E\">&#093;.unique())}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">product_to_idx <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> {p: i <\/span><span style=\"color: #D73A49\">for<\/span><span style=\"color: #24292E\"> i, p <\/span><span style=\"color: #D73A49\">in<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">enumerate<\/span><span style=\"color: #24292E\">(df&#091;<\/span><span style=\"color: #032F62\">&#39;product&#39;<\/span><span style=\"color: #24292E\">&#093;.unique())}<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">df&#091;<\/span><span style=\"color: #032F62\">&#39;city_idx&#39;<\/span><span style=\"color: #24292E\">&#093;    <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> df&#091;<\/span><span style=\"color: #032F62\">&#39;city&#39;<\/span><span style=\"color: #24292E\">&#093;.map(city_to_idx)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">df&#091;<\/span><span style=\"color: #032F62\">&#39;product_idx&#39;<\/span><span style=\"color: #24292E\">&#093; <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> df&#091;<\/span><span style=\"color: #032F62\">&#39;product&#39;<\/span><span style=\"color: #24292E\">&#093;.map(product_to_idx)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">n_cities, n_products <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">len<\/span><span style=\"color: #24292E\">(city_to_idx), <\/span><span style=\"color: #005CC5\">len<\/span><span style=\"color: #24292E\">(product_to_idx)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># Rule of thumb: embedding dim = min(50, (cardinality + 1) \/\/ 2)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">city_dim    <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">min<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #005CC5\">50<\/span><span style=\"color: #24292E\">, (n_cities <\/span><span style=\"color: #D73A49\">+<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">) <\/span><span style=\"color: #D73A49\">\/\/<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">2<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">product_dim <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">min<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #005CC5\">50<\/span><span style=\"color: #24292E\">, (n_products <\/span><span style=\"color: #D73A49\">+<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">) <\/span><span style=\"color: #D73A49\">\/\/<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">2<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">class<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #6F42C1\">EntityEmbeddingNet<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #6F42C1\">nn<\/span><span style=\"color: #24292E\">.<\/span><span style=\"color: #6F42C1\">Module<\/span><span style=\"color: #24292E\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #D73A49\">def<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">__init__<\/span><span style=\"color: #24292E\">(self):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">        <\/span><span style=\"color: #005CC5\">super<\/span><span style=\"color: #24292E\">().<\/span><span style=\"color: #005CC5\">__init__<\/span><span style=\"color: #24292E\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">        <\/span><span style=\"color: #6A737D\"># Learnable lookup tables \u2014 replace one-hot entirely<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">        <\/span><span style=\"color: #005CC5\">self<\/span><span style=\"color: #24292E\">.city_emb    <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> nn.Embedding(n_cities, city_dim)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">        <\/span><span style=\"color: #005CC5\">self<\/span><span style=\"color: #24292E\">.product_emb <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> nn.Embedding(n_products, product_dim)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">        input_dim <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> city_dim <\/span><span style=\"color: #D73A49\">+<\/span><span style=\"color: #24292E\"> product_dim <\/span><span style=\"color: #D73A49\">+<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">  <\/span><span style=\"color: #6A737D\"># +1 for price<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">        <\/span><span style=\"color: #005CC5\">self<\/span><span style=\"color: #24292E\">.mlp <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> nn.Sequential(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">            nn.Linear(input_dim, <\/span><span style=\"color: #005CC5\">32<\/span><span style=\"color: #24292E\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">            nn.ReLU(),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">            nn.Dropout(<\/span><span style=\"color: #005CC5\">0.2<\/span><span style=\"color: #24292E\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">            nn.Linear(<\/span><span style=\"color: #005CC5\">32<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">            nn.Sigmoid(),<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">        )<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #D73A49\">def<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #6F42C1\">forward<\/span><span style=\"color: #24292E\">(self, city_idx, product_idx, price):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">        c <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">self<\/span><span style=\"color: #24292E\">.city_emb(city_idx)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">        p <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">self<\/span><span style=\"color: #24292E\">.product_emb(product_idx)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">        x <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> torch.cat(&#091;c, p, price.unsqueeze(<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">)&#093;, <\/span><span style=\"color: #E36209\">dim<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">        <\/span><span style=\"color: #D73A49\">return<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">self<\/span><span style=\"color: #24292E\">.mlp(x)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">model     <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> EntityEmbeddingNet()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">optimizer <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> torch.optim.Adam(model.parameters(), <\/span><span style=\"color: #E36209\">lr<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">1e-2<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">loss_fn   <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> nn.BCELoss()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">city_t  <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> torch.tensor(df&#091;<\/span><span style=\"color: #032F62\">&#39;city_idx&#39;<\/span><span style=\"color: #24292E\">&#093;.values,    <\/span><span style=\"color: #E36209\">dtype<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">torch.long)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">prod_t  <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> torch.tensor(df&#091;<\/span><span style=\"color: #032F62\">&#39;product_idx&#39;<\/span><span style=\"color: #24292E\">&#093;.values, <\/span><span style=\"color: #E36209\">dtype<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">torch.long)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">price_t <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> torch.tensor(df&#091;<\/span><span style=\"color: #032F62\">&#39;price&#39;<\/span><span style=\"color: #24292E\">&#093;.values,       <\/span><span style=\"color: #E36209\">dtype<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">torch.float32)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">y_t     <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> torch.tensor(df&#091;<\/span><span style=\"color: #032F62\">&#39;target&#39;<\/span><span style=\"color: #24292E\">&#093;.values,      <\/span><span style=\"color: #E36209\">dtype<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">torch.float32)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">for<\/span><span style=\"color: #24292E\"> epoch <\/span><span style=\"color: #D73A49\">in<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">range<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #005CC5\">200<\/span><span style=\"color: #24292E\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    pred <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> model(city_t, prod_t, price_t).squeeze()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    loss <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> loss_fn(pred, y_t)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    optimizer.zero_grad()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    loss.backward()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    optimizer.step()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># The learned embedding matrix is now a dense, semantically meaningful<\/span><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># representation that completely sidesteps the curse of dimensionality.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;Learned city embeddings:<\/span><span style=\"color: #005CC5\">\\n<\/span><span style=\"color: #032F62\">&quot;<\/span><span style=\"color: #24292E\">, model.city_emb.weight.detach())<\/span><\/span><\/code><\/pre><\/div>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">5. GBDT&#8217;s Native Categorical Handling<\/h2>\n\n\n<style>.kadence-column6292_78bcb9-73 > .kt-inside-inner-col{padding-right:var(--global-kb-spacing-xl, 4rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);}.kadence-column6292_78bcb9-73 > .kt-inside-inner-col,.kadence-column6292_78bcb9-73 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6292_78bcb9-73 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6292_78bcb9-73 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6292_78bcb9-73 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6292_78bcb9-73 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6292_78bcb9-73{position:relative;}@media all and (max-width: 1024px){.kadence-column6292_78bcb9-73 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6292_78bcb9-73 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6292_78bcb9-73\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">Gradient-boosted decision trees can sidestep one-hot encoding entirely through native categorical support.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>CatBoost (Yandex, 2017)<\/strong> \u2014 Implements <em>Ordered Target Encoding<\/em>, a permutation-based scheme that uses only &#8220;earlier&#8221; samples for each instance, eliminating target leakage. Completely bypasses one-hot encoding.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>LightGBM<\/strong> \u2014 Direct categorical support via the <code>categorical_feature<\/code> parameter. Uses an improved Fisher (1958) algorithm to find optimal partitions in O(k\u00b7log k).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>XGBoost 1.5+<\/strong> \u2014 Native categorical support via <code>enable_categorical=True<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">These libraries also support <strong>monotonic constraints<\/strong> (<code>monotone_constraints={\"grade\": 1}<\/code>), which combine cleanly with ordinal encoding to enforce, for example, that &#8220;higher education level \u2192 higher prediction.&#8221; This is essential in credit scoring and healthcare.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">6. Latest (2022+) Transformer\/Foundation Model Approaches<\/h2>\n\n\n<style>.kadence-column6292_86b550-d6 > .kt-inside-inner-col{padding-right:var(--global-kb-spacing-xl, 4rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);}.kadence-column6292_86b550-d6 > .kt-inside-inner-col,.kadence-column6292_86b550-d6 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6292_86b550-d6 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6292_86b550-d6 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6292_86b550-d6 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6292_86b550-d6 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6292_86b550-d6{position:relative;}@media all and (max-width: 1024px){.kadence-column6292_86b550-d6 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6292_86b550-d6 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6292_86b550-d6\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>TabTransformer (Amazon, 2020)<\/strong> \u2014 Tokenizes categorical features and learns contextual embeddings via Transformer self-attention, automatically capturing inter-categorical interactions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>FT-Transformer (Yandex, 2021)<\/strong> \u2014 Feature Tokenizer + Transformer. Tokenizes both numerical and categorical features uniformly. A current strong baseline for tabular deep learning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>SAINT (2021)<\/strong> \u2014 Combines row attention and column attention with contrastive self-supervised pre-training.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>TabNet (Google, 2019)<\/strong> \u2014 Uses sequential attention for instance-wise feature selection, providing interpretability and automatic sparsity.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>TabPFN (Hollmann et al., ICLR 2023)<\/strong> \u2014 A &#8220;Tabular Prior-data Fitted Network&#8221; that performs classification on small tabular datasets via a single forward pass without separate training. Internally encodes categorical features, sidestepping the curse of dimensionality. Actively expanding through 2025\u20132026 with TabPFN v2 and regression support.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>TabDDPM, TabSyn<\/strong> \u2014 Diffusion-based generative models that learn the joint distribution of tabular data, enabling rare-category augmentation that mitigates dimensional sparsity.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Ordinal Regression Deep Learning<\/strong> \u2014 CORAL (Cao et al., 2019) and CORN (Shi, Cao &amp; Raschka, 2022) decompose K-class ordinal classification into K-1 binary classifiers with rank-consistency constraints, providing principled order-aware deep learning.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">7. New Paradigms in the LLM Era<\/h2>\n\n\n<style>.kadence-column6292_1fca9f-cc > .kt-inside-inner-col{padding-right:var(--global-kb-spacing-xl, 4rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);}.kadence-column6292_1fca9f-cc > .kt-inside-inner-col,.kadence-column6292_1fca9f-cc > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6292_1fca9f-cc > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6292_1fca9f-cc > .kt-inside-inner-col{flex-direction:column;}.kadence-column6292_1fca9f-cc > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6292_1fca9f-cc > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6292_1fca9f-cc{position:relative;}@media all and (max-width: 1024px){.kadence-column6292_1fca9f-cc > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6292_1fca9f-cc > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6292_1fca9f-cc\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>Text-as-Features<\/strong> \u2014 Treat category names as natural language and extract embeddings from pre-trained LLMs (BERT, Sentence-BERT, OpenAI embeddings). Semantic similarity between categories like &#8220;Seoul Metropolitan City&#8221; and &#8220;Busan Metropolitan City&#8221; is preserved, and zero-shot encoding of new categories becomes possible.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>CARTE (Kim et al., 2024)<\/strong> \u2014 &#8220;Context-Aware Representation of Table Entries.&#8221; Embeds even column names with LLM embeddings, enabling transfer learning across tables with different schemas. Solves cold-start and rare-category problems simultaneously.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>LLM-generated Semantic Encoding<\/strong> \u2014 Ask an LLM (GPT-4, Claude) to describe each category, then embed the descriptions. Injects domain knowledge essentially for free.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Graph-based Encoding (GNN + Embedding)<\/strong> \u2014 Model relationships between categories (e.g., category\u2013subcategory, user\u2013item) as a graph and learn embeddings via GNNs. Includes PinSage, GraphSAGE, and related architectures.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Rank-Consistent Foundation Models<\/strong> \u2014 Extensions like TabPFN-Ord, attaching ordinal regression heads to foundation models, are emerging.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">8. Practical Selection Guide<\/h2>\n\n\n<style>.kadence-column6292_97c407-77 > .kt-inside-inner-col{padding-right:var(--global-kb-spacing-xl, 4rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);}.kadence-column6292_97c407-77 > .kt-inside-inner-col,.kadence-column6292_97c407-77 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6292_97c407-77 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6292_97c407-77 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6292_97c407-77 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6292_97c407-77 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6292_97c407-77{position:relative;}@media all and (max-width: 1024px){.kadence-column6292_97c407-77 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6292_97c407-77 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6292_97c407-77\"><div class=\"kt-inside-inner-col\">\n<figure style=\"padding-right:var(--wp--preset--spacing--40);padding-left:var(--wp--preset--spacing--40)\" class=\"wp-block-table\"><table><thead><tr><th>Cardinality \/ Situation<\/th><th>Recommended Approach<\/th><\/tr><\/thead><tbody><tr><td>Low (&lt; 10), no order<\/td><td>One-hot encoding<\/td><\/tr><tr><td>Medium (10\u20131,000), no order<\/td><td>Target encoding + smoothing, or <strong>entity embedding<\/strong><\/td><\/tr><tr><td>High (&gt; 1,000), no order<\/td><td>CatBoost ordered target encoding, hashing, or <strong>learned entity embeddings<\/strong><\/td><\/tr><tr><td>Order present, simple model<\/td><td>Ordinal + Monotonic Constraint (GBDT)<\/td><\/tr><tr><td>Order present, statistical interpretation<\/td><td>Polynomial \/ Helmert Contrast Coding<\/td><\/tr><tr><td>Order present, neural classifier<\/td><td>Thermometer Encoding + CORAL\/CORN, or <strong>ordinal entity embeddings<\/strong><\/td><\/tr><tr><td>Hierarchical order (tree)<\/td><td>Order \/ Poincar\u00e9 Embeddings<\/td><\/tr><tr><td>Ordered + small dataset<\/td><td>TabPFN with ordinal head<\/td><\/tr><tr><td>Tabular deep learning project<\/td><td><strong>Entity embeddings<\/strong> + FT-Transformer or SAINT<\/td><\/tr><tr><td>Heavy domain knowledge \/ cold-start<\/td><td>LLM-based semantic embedding<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The overarching principle: <strong>avoid one-hot when possible<\/strong>, prefer learnable entity embeddings or GBDT native handling, and turn to LLM- or TabPFN-based approaches when data is small or new categories are frequent.<\/p>\n<\/div><\/div>\n\n\n\n<!--nextpage-->\n\n\n\n<h2 class=\"wp-block-heading\">9. CatBoost, XGBoost, and LightGBM \u2014 Practical Implementation<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Both libraries offer native categorical handling that completely bypasses one-hot encoding&#8217;s dimensionality problems.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">CatBoost \u2014 Ordered Target Encoding<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">CatBoost is uniquely designed for categorical data. Its <em>Ordered Target Encoding<\/em> uses random permutations and only &#8220;previous&#8221; samples to compute target statistics, eliminating leakage that plagues vanilla target encoding.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:1rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#24292e;--cbp-line-number-width:calc(2 * 0.6 * 1rem);line-height:1.625rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#24292e;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>import pandas as pd\nfrom catboost import CatBoostClassifier, Pool\nfrom sklearn.model_selection import train_test_split\n\n# Sample data with high-cardinality categoricals\ndf = pd.DataFrame({\n    'city':     &#091;'Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan'&#093;,\n    'product':  &#091;'A', 'B', 'A', 'C', 'B', 'A'&#093;,\n    'user_id':  &#091;'u_001', 'u_002', 'u_003', 'u_001', 'u_004', 'u_002'&#093;,\n    'price':    &#091;100, 200, 150, 300, 250, 180&#093;,\n    'target':   &#091;1, 0, 1, 0, 1, 0&#093;,\n})\n\nX = df.drop('target', axis=1)\ny = df&#091;'target'&#093;\n\ncat_features = &#091;'city', 'product', 'user_id'&#093;\n\nX_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.33, random_state=42)\n\nmodel = CatBoostClassifier(\n    iterations=500,\n    learning_rate=0.05,\n    depth=6,\n    cat_features=cat_features,        # native categorical handling\n    one_hot_max_size=4,               # one-hot only if cardinality &lt;= 4\n    eval_metric='AUC',\n    verbose=0,\n)\n\n# Pool object lets you bundle data with categorical metadata\ntrain_pool = Pool(X_tr, y_tr, cat_features=cat_features)\nvalid_pool = Pool(X_va, y_va, cat_features=cat_features)\n\nmodel.fit(train_pool, eval_set=valid_pool, use_best_model=True)\n\n# Inspect categorical feature importance\nprint(model.get_feature_importance(prettified=True))<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki github-light\" style=\"background-color: #fff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D73A49\">import<\/span><span style=\"color: #24292E\"> pandas <\/span><span style=\"color: #D73A49\">as<\/span><span style=\"color: #24292E\"> pd<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">from<\/span><span style=\"color: #24292E\"> catboost <\/span><span style=\"color: #D73A49\">import<\/span><span style=\"color: #24292E\"> CatBoostClassifier, Pool<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">from<\/span><span style=\"color: #24292E\"> sklearn.model_selection <\/span><span style=\"color: #D73A49\">import<\/span><span style=\"color: #24292E\"> train_test_split<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># Sample data with high-cardinality categoricals<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">df <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> pd.DataFrame({<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;city&#39;<\/span><span style=\"color: #24292E\">:     &#091;<\/span><span style=\"color: #032F62\">&#39;Seoul&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Busan&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Daegu&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Seoul&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Incheon&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Busan&#39;<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;product&#39;<\/span><span style=\"color: #24292E\">:  &#091;<\/span><span style=\"color: #032F62\">&#39;A&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;B&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;A&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;C&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;B&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;A&#39;<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;user_id&#39;<\/span><span style=\"color: #24292E\">:  &#091;<\/span><span style=\"color: #032F62\">&#39;u_001&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;u_002&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;u_003&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;u_001&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;u_004&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;u_002&#39;<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;price&#39;<\/span><span style=\"color: #24292E\">:    &#091;<\/span><span style=\"color: #005CC5\">100<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">200<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">150<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">300<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">250<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">180<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;target&#39;<\/span><span style=\"color: #24292E\">:   &#091;<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">})<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">X <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> df.drop(<\/span><span style=\"color: #032F62\">&#39;target&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #E36209\">axis<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">y <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> df&#091;<\/span><span style=\"color: #032F62\">&#39;target&#39;<\/span><span style=\"color: #24292E\">&#093;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">cat_features <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> &#091;<\/span><span style=\"color: #032F62\">&#39;city&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;product&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;user_id&#39;<\/span><span style=\"color: #24292E\">&#093;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">X_tr, X_va, y_tr, y_va <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> train_test_split(X, y, <\/span><span style=\"color: #E36209\">test_size<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">0.33<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #E36209\">random_state<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">42<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">model <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> CatBoostClassifier(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">iterations<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">500<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">learning_rate<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">0.05<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">depth<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">6<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">cat_features<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">cat_features,        <\/span><span style=\"color: #6A737D\"># native categorical handling<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">one_hot_max_size<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">4<\/span><span style=\"color: #24292E\">,               <\/span><span style=\"color: #6A737D\"># one-hot only if cardinality &lt;= 4<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">eval_metric<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #032F62\">&#39;AUC&#39;<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">verbose<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># Pool object lets you bundle data with categorical metadata<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">train_pool <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> Pool(X_tr, y_tr, <\/span><span style=\"color: #E36209\">cat_features<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">cat_features)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">valid_pool <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> Pool(X_va, y_va, <\/span><span style=\"color: #E36209\">cat_features<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">cat_features)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">model.fit(train_pool, <\/span><span style=\"color: #E36209\">eval_set<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">valid_pool, <\/span><span style=\"color: #E36209\">use_best_model<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">True<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># Inspect categorical feature importance<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(model.get_feature_importance(<\/span><span style=\"color: #E36209\">prettified<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">True<\/span><span style=\"color: #24292E\">))<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Key parameters:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>cat_features<\/code>: list of column names or indices to treat as categorical<\/li>\n\n\n\n<li><code>one_hot_max_size<\/code>: low-cardinality cutoff for which one-hot is preferred over target encoding<\/li>\n\n\n\n<li>Ordered Target Encoding is applied automatically for higher-cardinality features<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">XGBoost \u2014 Native Categorical Support (1.5+)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">XGBoost added native categorical support in version 1.5, using optimal partitioning rather than one-hot encoding. It requires the <code>category<\/code> dtype in pandas.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:1rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#24292e;--cbp-line-number-width:calc(2 * 0.6 * 1rem);line-height:1.625rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#24292e;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>import pandas as pd\nimport xgboost as xgb\nfrom sklearn.model_selection import train_test_split\n\ndf = pd.DataFrame({\n    'city':     &#091;'Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan'&#093;,\n    'product':  &#091;'A', 'B', 'A', 'C', 'B', 'A'&#093;,\n    'grade':    &#091;'low', 'mid', 'high', 'mid', 'low', 'high'&#093;,  # ordinal\n    'price':    &#091;100, 200, 150, 300, 250, 180&#093;,\n    'target':   &#091;1, 0, 1, 0, 1, 0&#093;,\n})\n\nX = df.drop('target', axis=1)\ny = df&#091;'target'&#093;\n\n# Convert categorical columns to pandas 'category' dtype\nfor col in &#091;'city', 'product'&#093;:\n    X&#091;col&#093; = X&#091;col&#093;.astype('category')\n\n# Ordinal feature: preserve order explicitly\nX&#091;'grade'&#093; = pd.Categorical(X&#091;'grade'&#093;, categories=&#091;'low', 'mid', 'high'&#093;, ordered=True)\n\nX_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.33, random_state=42)\n\nmodel = xgb.XGBClassifier(\n    n_estimators=500,\n    learning_rate=0.05,\n    max_depth=6,\n    tree_method='hist',                  # required for categorical support\n    enable_categorical=True,             # turn on native categorical handling\n    max_cat_to_onehot=4,                 # one-hot if cardinality &lt;= 4, else partition\n    monotone_constraints={'grade': 1},   # enforce monotonic effect for ordinal feature\n    eval_metric='auc',\n)\n\nmodel.fit(X_tr, y_tr, eval_set=&#091;(X_va, y_va)&#093;, verbose=False)\n\n# Predict\npreds = model.predict_proba(X_va)&#091;:, 1&#093;\nprint(model.feature_importances_)<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki github-light\" style=\"background-color: #fff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D73A49\">import<\/span><span style=\"color: #24292E\"> pandas <\/span><span style=\"color: #D73A49\">as<\/span><span style=\"color: #24292E\"> pd<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">import<\/span><span style=\"color: #24292E\"> xgboost <\/span><span style=\"color: #D73A49\">as<\/span><span style=\"color: #24292E\"> xgb<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">from<\/span><span style=\"color: #24292E\"> sklearn.model_selection <\/span><span style=\"color: #D73A49\">import<\/span><span style=\"color: #24292E\"> train_test_split<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">df <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> pd.DataFrame({<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;city&#39;<\/span><span style=\"color: #24292E\">:     &#091;<\/span><span style=\"color: #032F62\">&#39;Seoul&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Busan&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Daegu&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Seoul&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Incheon&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Busan&#39;<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;product&#39;<\/span><span style=\"color: #24292E\">:  &#091;<\/span><span style=\"color: #032F62\">&#39;A&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;B&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;A&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;C&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;B&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;A&#39;<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;grade&#39;<\/span><span style=\"color: #24292E\">:    &#091;<\/span><span style=\"color: #032F62\">&#39;low&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;mid&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;high&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;mid&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;low&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;high&#39;<\/span><span style=\"color: #24292E\">&#093;,  <\/span><span style=\"color: #6A737D\"># ordinal<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;price&#39;<\/span><span style=\"color: #24292E\">:    &#091;<\/span><span style=\"color: #005CC5\">100<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">200<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">150<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">300<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">250<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">180<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;target&#39;<\/span><span style=\"color: #24292E\">:   &#091;<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">})<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">X <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> df.drop(<\/span><span style=\"color: #032F62\">&#39;target&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #E36209\">axis<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">y <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> df&#091;<\/span><span style=\"color: #032F62\">&#39;target&#39;<\/span><span style=\"color: #24292E\">&#093;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># Convert categorical columns to pandas &#39;category&#39; dtype<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">for<\/span><span style=\"color: #24292E\"> col <\/span><span style=\"color: #D73A49\">in<\/span><span style=\"color: #24292E\"> &#091;<\/span><span style=\"color: #032F62\">&#39;city&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;product&#39;<\/span><span style=\"color: #24292E\">&#093;:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    X&#091;col&#093; <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> X&#091;col&#093;.astype(<\/span><span style=\"color: #032F62\">&#39;category&#39;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># Ordinal feature: preserve order explicitly<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">X&#091;<\/span><span style=\"color: #032F62\">&#39;grade&#39;<\/span><span style=\"color: #24292E\">&#093; <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> pd.Categorical(X&#091;<\/span><span style=\"color: #032F62\">&#39;grade&#39;<\/span><span style=\"color: #24292E\">&#093;, <\/span><span style=\"color: #E36209\">categories<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">&#091;<\/span><span style=\"color: #032F62\">&#39;low&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;mid&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;high&#39;<\/span><span style=\"color: #24292E\">&#093;, <\/span><span style=\"color: #E36209\">ordered<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">True<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">X_tr, X_va, y_tr, y_va <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> train_test_split(X, y, <\/span><span style=\"color: #E36209\">test_size<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">0.33<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #E36209\">random_state<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">42<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">model <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> xgb.XGBClassifier(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">n_estimators<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">500<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">learning_rate<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">0.05<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">max_depth<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">6<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">tree_method<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #032F62\">&#39;hist&#39;<\/span><span style=\"color: #24292E\">,                  <\/span><span style=\"color: #6A737D\"># required for categorical support<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">enable_categorical<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">True<\/span><span style=\"color: #24292E\">,             <\/span><span style=\"color: #6A737D\"># turn on native categorical handling<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">max_cat_to_onehot<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">4<\/span><span style=\"color: #24292E\">,                 <\/span><span style=\"color: #6A737D\"># one-hot if cardinality &lt;= 4, else partition<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">monotone_constraints<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">{<\/span><span style=\"color: #032F62\">&#39;grade&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">},   <\/span><span style=\"color: #6A737D\"># enforce monotonic effect for ordinal feature<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">eval_metric<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #032F62\">&#39;auc&#39;<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">model.fit(X_tr, y_tr, <\/span><span style=\"color: #E36209\">eval_set<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">&#091;(X_va, y_va)&#093;, <\/span><span style=\"color: #E36209\">verbose<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">False<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># Predict<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">preds <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> model.predict_proba(X_va)&#091;:, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">&#093;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(model.feature_importances_)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Key parameters:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>enable_categorical=True<\/code>: activates native handling<\/li>\n\n\n\n<li><code>tree_method='hist'<\/code> or <code>'gpu_hist'<\/code>: required (the older <code>'exact'<\/code> method does not support categoricals)<\/li>\n\n\n\n<li><code>max_cat_to_onehot<\/code>: threshold for one-hot vs. partition-based splitting<\/li>\n\n\n\n<li><code>monotone_constraints<\/code>: combined with ordinal encoding, enforces monotonic predictions for ordered categories<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">LightGBM \u2014 categorical_feature Native Support<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">LightGBM was the first major GBDT library to offer built-in categorical handling. It uses an optimized Fisher (1958) algorithm to find the best partitioning of categories in O(k\u00b7log k) time per split, based on the mean target gradient within each category. This is far more efficient and accurate than one-hot encoding for high-cardinality features.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:1rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#24292e;--cbp-line-number-width:calc(3 * 0.6 * 1rem);line-height:1.625rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#24292e;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>import pandas as pd\nimport lightgbm as lgb\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import roc_auc_score, classification_report\n\ndf = pd.DataFrame({\n    'city': &#091;'Seoul', 'Busan', 'Daegu', 'Seoul', 'Incheon', 'Busan',\n             'Seoul', 'Daegu', 'Busan', 'Incheon'&#093;,\n    'product': &#091;'A', 'B', 'A', 'C', 'B', 'A', 'B', 'C', 'A', 'B'&#093;,\n    'user_id': &#091;'u_001', 'u_002', 'u_003', 'u_001', 'u_004', 'u_002',\n                'u_005', 'u_003', 'u_006', 'u_004'&#093;,\n    'grade': &#091;'low', 'mid', 'high', 'mid', 'low', 'high',\n              'mid', 'high', 'low', 'mid'&#093;,\n    'price': &#091;100, 200, 150, 300, 250, 180, 220, 170, 310, 240&#093;,\n    'target': &#091;1, 0, 1, 0, 1, 0, 1, 1, 0, 1&#093;,\n})\n\n# grade: ordinal \u2192 numeric (required for monotone constraint)\ngrade_map = {'low': 0, 'mid': 1, 'high': 2}\ndf&#091;'grade'&#093; = df&#091;'grade'&#093;.map(grade_map).astype(int)\n\n# nominal features: cast to category dtype so LightGBM encodes them internally\nfor col in &#091;'city', 'product', 'user_id'&#093;:\n    df&#091;col&#093; = df&#091;col&#093;.astype('category')\n\nX = df.drop('target', axis=1)\ny = df&#091;'target'&#093;\n\nX_tr, X_va, y_tr, y_va = train_test_split(\n    X, y, test_size=0.3, random_state=42, stratify=y\n)\n\nfeature_cols = X.columns.tolist()\nmono = &#091;1 if c == 'grade' else 0 for c in feature_cols&#093;  # +1 = increasing for grade only\n\n# --- sklearn API ---\nprint(\"=\" * 50)\nprint(\"sklearn API\")\nprint(\"=\" * 50)\n\nmodel = lgb.LGBMClassifier(\n    n_estimators=500,\n    learning_rate=0.05,\n    max_depth=-1,\n    num_leaves=31,\n    min_data_in_leaf=1,      # relaxed for small dataset\n    cat_smooth=10,           # smoothing for rare categories\n    cat_l2=10,               # L2 regularization on categorical splits\n    max_cat_threshold=32,    # max categories considered per split\n    monotone_constraints=mono,\n    objective='binary',\n    metric='auc',\n    verbose=-1,\n)\nmodel.fit(\n    X_tr, y_tr,\n    eval_set=&#091;(X_va, y_va)&#093;,\n    categorical_feature=&#091;'city', 'product', 'user_id'&#093;,  # grade excluded: numeric with monotone constraint\n    callbacks=&#091;lgb.early_stopping(stopping_rounds=30)&#093;,\n)\n\npreds_proba = model.predict_proba(X_va)&#091;:, 1&#093;\npreds_label = model.predict(X_va)\n\nprint(f\"Best iteration   : {model.best_iteration_}\")\nprint(f\"Validation AUC   : {roc_auc_score(y_va, preds_proba):.4f}\")\nprint()\nprint(\"Classification report:\")\nprint(classification_report(y_va, preds_label, target_names=&#091;'class 0', 'class 1'&#093;))\n\nimportance = dict(zip(feature_cols, model.feature_importances_))\nprint(\"Feature importance (split):\")\nfor feat, score in sorted(importance.items(), key=lambda x: -x&#091;1&#093;):\n    print(f\"  {feat:&lt;12}: {score}\")\n\n# --- Native (Dataset) API ---\nprint()\nprint(\"=\" * 50)\nprint(\"Native API\")\nprint(\"=\" * 50)\n\ntrain_ds = lgb.Dataset(\n    X_tr, label=y_tr,\n    categorical_feature=&#091;'city', 'product', 'user_id'&#093;,  # grade excluded\n    free_raw_data=False,\n)\nvalid_ds = lgb.Dataset(\n    X_va, label=y_va,\n    categorical_feature=&#091;'city', 'product', 'user_id'&#093;,  # grade excluded\n    reference=train_ds,                                   # ensures consistent encoding\n)\n\nparams = {\n    'objective': 'binary',\n    'metric': 'auc',\n    'learning_rate': 0.05,\n    'num_leaves': 31,\n    'cat_smooth': 10,\n    'cat_l2': 10,\n    'max_cat_threshold': 32,\n    'monotone_constraints': mono,\n    'verbose': -1,\n}\n\nbooster = lgb.train(\n    params,\n    train_ds,\n    num_boost_round=500,\n    valid_sets=&#091;valid_ds&#093;,\n    callbacks=&#091;lgb.early_stopping(stopping_rounds=30)&#093;,\n)\n\nnative_proba = booster.predict(X_va)\nnative_label = (native_proba >= 0.5).astype(int)\n\nprint(f\"Best iteration   : {booster.best_iteration}\")\nprint(f\"Validation AUC   : {roc_auc_score(y_va, native_proba):.4f}\")\nprint()\nprint(\"Classification report:\")\nprint(classification_report(y_va, native_label, target_names=&#091;'class 0', 'class 1'&#093;))\n\nimportance_native = booster.feature_importance(importance_type='split')\nprint(\"Feature importance (split):\")\nfor feat, score in sorted(zip(feature_cols, importance_native), key=lambda x: -x&#091;1&#093;):\n    print(f\"  {feat:&lt;12}: {score}\")\n\nprint()\nprint(\"=\" * 50)\nprint(\"Monotone constraint check (grade: low=0, mid=1, high=2)\")\nprint(\"=\" * 50)\ncheck_df = pd.DataFrame({\n    'city':    &#091;'Seoul'&#093; * 3,\n    'product': &#091;'A'&#093; * 3,\n    'user_id': &#091;'u_001'&#093; * 3,\n    'grade':   &#091;0, 1, 2&#093;,          # low \u2192 mid \u2192 high\n    'price':   &#091;200&#093; * 3,\n}).astype({'city': 'category', 'product': 'category', 'user_id': 'category'})\n\nsklearn_proba = model.predict_proba(check_df)&#091;:, 1&#093;\nnative_proba_check = booster.predict(check_df)\n\nprint(f\"{'grade':&lt;10} {'sklearn prob':>14} {'native prob':>12}\")\nprint(\"-\" * 38)\nfor grade, sp, np_ in zip(&#091;'low', 'mid', 'high'&#093;, sklearn_proba, native_proba_check):\n    print(f\"{grade:&lt;10} {sp:>14.4f} {np_:>12.4f}\")<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki github-light\" style=\"background-color: #fff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D73A49\">import<\/span><span style=\"color: #24292E\"> pandas <\/span><span style=\"color: #D73A49\">as<\/span><span style=\"color: #24292E\"> pd<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">import<\/span><span style=\"color: #24292E\"> lightgbm <\/span><span style=\"color: #D73A49\">as<\/span><span style=\"color: #24292E\"> lgb<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">from<\/span><span style=\"color: #24292E\"> sklearn.model_selection <\/span><span style=\"color: #D73A49\">import<\/span><span style=\"color: #24292E\"> train_test_split<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">from<\/span><span style=\"color: #24292E\"> sklearn.metrics <\/span><span style=\"color: #D73A49\">import<\/span><span style=\"color: #24292E\"> roc_auc_score, classification_report<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">df <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> pd.DataFrame({<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;city&#39;<\/span><span style=\"color: #24292E\">: &#091;<\/span><span style=\"color: #032F62\">&#39;Seoul&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Busan&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Daegu&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Seoul&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Incheon&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Busan&#39;<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">             <\/span><span style=\"color: #032F62\">&#39;Seoul&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Daegu&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Busan&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;Incheon&#39;<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;product&#39;<\/span><span style=\"color: #24292E\">: &#091;<\/span><span style=\"color: #032F62\">&#39;A&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;B&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;A&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;C&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;B&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;A&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;B&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;C&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;A&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;B&#39;<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;user_id&#39;<\/span><span style=\"color: #24292E\">: &#091;<\/span><span style=\"color: #032F62\">&#39;u_001&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;u_002&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;u_003&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;u_001&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;u_004&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;u_002&#39;<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">                <\/span><span style=\"color: #032F62\">&#39;u_005&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;u_003&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;u_006&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;u_004&#39;<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;grade&#39;<\/span><span style=\"color: #24292E\">: &#091;<\/span><span style=\"color: #032F62\">&#39;low&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;mid&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;high&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;mid&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;low&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;high&#39;<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">              <\/span><span style=\"color: #032F62\">&#39;mid&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;high&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;low&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;mid&#39;<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;price&#39;<\/span><span style=\"color: #24292E\">: &#091;<\/span><span style=\"color: #005CC5\">100<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">200<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">150<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">300<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">250<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">180<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">220<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">170<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">310<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">240<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;target&#39;<\/span><span style=\"color: #24292E\">: &#091;<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">})<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># grade: ordinal \u2192 numeric (required for monotone constraint)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">grade_map <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> {<\/span><span style=\"color: #032F62\">&#39;low&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;mid&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;high&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #005CC5\">2<\/span><span style=\"color: #24292E\">}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">df&#091;<\/span><span style=\"color: #032F62\">&#39;grade&#39;<\/span><span style=\"color: #24292E\">&#093; <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> df&#091;<\/span><span style=\"color: #032F62\">&#39;grade&#39;<\/span><span style=\"color: #24292E\">&#093;.map(grade_map).astype(<\/span><span style=\"color: #005CC5\">int<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># nominal features: cast to category dtype so LightGBM encodes them internally<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">for<\/span><span style=\"color: #24292E\"> col <\/span><span style=\"color: #D73A49\">in<\/span><span style=\"color: #24292E\"> &#091;<\/span><span style=\"color: #032F62\">&#39;city&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;product&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;user_id&#39;<\/span><span style=\"color: #24292E\">&#093;:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    df&#091;col&#093; <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> df&#091;col&#093;.astype(<\/span><span style=\"color: #032F62\">&#39;category&#39;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">X <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> df.drop(<\/span><span style=\"color: #032F62\">&#39;target&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #E36209\">axis<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">y <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> df&#091;<\/span><span style=\"color: #032F62\">&#39;target&#39;<\/span><span style=\"color: #24292E\">&#093;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">X_tr, X_va, y_tr, y_va <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> train_test_split(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    X, y, <\/span><span style=\"color: #E36209\">test_size<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">0.3<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #E36209\">random_state<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">42<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #E36209\">stratify<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">y<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">feature_cols <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> X.columns.tolist()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">mono <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> &#091;<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #D73A49\">if<\/span><span style=\"color: #24292E\"> c <\/span><span style=\"color: #D73A49\">==<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #032F62\">&#39;grade&#39;<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #D73A49\">else<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #D73A49\">for<\/span><span style=\"color: #24292E\"> c <\/span><span style=\"color: #D73A49\">in<\/span><span style=\"color: #24292E\"> feature_cols&#093;  <\/span><span style=\"color: #6A737D\"># +1 = increasing for grade only<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># --- sklearn API ---<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;=&quot;<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #D73A49\">*<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">50<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;sklearn API&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;=&quot;<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #D73A49\">*<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">50<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">model <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> lgb.LGBMClassifier(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">n_estimators<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">500<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">learning_rate<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">0.05<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">max_depth<\/span><span style=\"color: #D73A49\">=-<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">num_leaves<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">31<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">min_data_in_leaf<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">,      <\/span><span style=\"color: #6A737D\"># relaxed for small dataset<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">cat_smooth<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">10<\/span><span style=\"color: #24292E\">,           <\/span><span style=\"color: #6A737D\"># smoothing for rare categories<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">cat_l2<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">10<\/span><span style=\"color: #24292E\">,               <\/span><span style=\"color: #6A737D\"># L2 regularization on categorical splits<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">max_cat_threshold<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">32<\/span><span style=\"color: #24292E\">,    <\/span><span style=\"color: #6A737D\"># max categories considered per split<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">monotone_constraints<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">mono,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">objective<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #032F62\">&#39;binary&#39;<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">metric<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #032F62\">&#39;auc&#39;<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">verbose<\/span><span style=\"color: #D73A49\">=-<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">model.fit(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    X_tr, y_tr,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">eval_set<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">&#091;(X_va, y_va)&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">categorical_feature<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">&#091;<\/span><span style=\"color: #032F62\">&#39;city&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;product&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;user_id&#39;<\/span><span style=\"color: #24292E\">&#093;,  <\/span><span style=\"color: #6A737D\"># grade excluded: numeric with monotone constraint<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">callbacks<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">&#091;lgb.early_stopping(<\/span><span style=\"color: #E36209\">stopping_rounds<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">30<\/span><span style=\"color: #24292E\">)&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">preds_proba <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> model.predict_proba(X_va)&#091;:, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">&#093;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">preds_label <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> model.predict(X_va)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #D73A49\">f<\/span><span style=\"color: #032F62\">&quot;Best iteration   : <\/span><span style=\"color: #005CC5\">{<\/span><span style=\"color: #24292E\">model.best_iteration_<\/span><span style=\"color: #005CC5\">}<\/span><span style=\"color: #032F62\">&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #D73A49\">f<\/span><span style=\"color: #032F62\">&quot;Validation AUC   : <\/span><span style=\"color: #005CC5\">{<\/span><span style=\"color: #24292E\">roc_auc_score(y_va, preds_proba)<\/span><span style=\"color: #D73A49\">:.4f<\/span><span style=\"color: #005CC5\">}<\/span><span style=\"color: #032F62\">&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;Classification report:&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(classification_report(y_va, preds_label, <\/span><span style=\"color: #E36209\">target_names<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">&#091;<\/span><span style=\"color: #032F62\">&#39;class 0&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;class 1&#39;<\/span><span style=\"color: #24292E\">&#093;))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">importance <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">dict<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #005CC5\">zip<\/span><span style=\"color: #24292E\">(feature_cols, model.feature_importances_))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;Feature importance (split):&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">for<\/span><span style=\"color: #24292E\"> feat, score <\/span><span style=\"color: #D73A49\">in<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">sorted<\/span><span style=\"color: #24292E\">(importance.items(), <\/span><span style=\"color: #E36209\">key<\/span><span style=\"color: #D73A49\">=lambda<\/span><span style=\"color: #24292E\"> x: <\/span><span style=\"color: #D73A49\">-<\/span><span style=\"color: #24292E\">x&#091;<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">&#093;):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #D73A49\">f<\/span><span style=\"color: #032F62\">&quot;  <\/span><span style=\"color: #005CC5\">{<\/span><span style=\"color: #24292E\">feat<\/span><span style=\"color: #D73A49\">:&lt;12<\/span><span style=\"color: #005CC5\">}<\/span><span style=\"color: #032F62\">: <\/span><span style=\"color: #005CC5\">{<\/span><span style=\"color: #24292E\">score<\/span><span style=\"color: #005CC5\">}<\/span><span style=\"color: #032F62\">&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #6A737D\"># --- Native (Dataset) API ---<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;=&quot;<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #D73A49\">*<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">50<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;Native API&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;=&quot;<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #D73A49\">*<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">50<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">train_ds <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> lgb.Dataset(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    X_tr, <\/span><span style=\"color: #E36209\">label<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">y_tr,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">categorical_feature<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">&#091;<\/span><span style=\"color: #032F62\">&#39;city&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;product&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;user_id&#39;<\/span><span style=\"color: #24292E\">&#093;,  <\/span><span style=\"color: #6A737D\"># grade excluded<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">free_raw_data<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">False<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">valid_ds <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> lgb.Dataset(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    X_va, <\/span><span style=\"color: #E36209\">label<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">y_va,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">categorical_feature<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">&#091;<\/span><span style=\"color: #032F62\">&#39;city&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;product&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;user_id&#39;<\/span><span style=\"color: #24292E\">&#093;,  <\/span><span style=\"color: #6A737D\"># grade excluded<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">reference<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">train_ds,                                   <\/span><span style=\"color: #6A737D\"># ensures consistent encoding<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">params <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> {<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;objective&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #032F62\">&#39;binary&#39;<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;metric&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #032F62\">&#39;auc&#39;<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;learning_rate&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #005CC5\">0.05<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;num_leaves&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #005CC5\">31<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;cat_smooth&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #005CC5\">10<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;cat_l2&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #005CC5\">10<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;max_cat_threshold&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #005CC5\">32<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;monotone_constraints&#39;<\/span><span style=\"color: #24292E\">: mono,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;verbose&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #D73A49\">-<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">}<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">booster <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> lgb.train(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    params,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    train_ds,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">num_boost_round<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">500<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">valid_sets<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">&#091;valid_ds&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #E36209\">callbacks<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">&#091;lgb.early_stopping(<\/span><span style=\"color: #E36209\">stopping_rounds<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #005CC5\">30<\/span><span style=\"color: #24292E\">)&#093;,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">native_proba <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> booster.predict(X_va)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">native_label <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> (native_proba <\/span><span style=\"color: #D73A49\">&gt;=<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">0.5<\/span><span style=\"color: #24292E\">).astype(<\/span><span style=\"color: #005CC5\">int<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #D73A49\">f<\/span><span style=\"color: #032F62\">&quot;Best iteration   : <\/span><span style=\"color: #005CC5\">{<\/span><span style=\"color: #24292E\">booster.best_iteration<\/span><span style=\"color: #005CC5\">}<\/span><span style=\"color: #032F62\">&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #D73A49\">f<\/span><span style=\"color: #032F62\">&quot;Validation AUC   : <\/span><span style=\"color: #005CC5\">{<\/span><span style=\"color: #24292E\">roc_auc_score(y_va, native_proba)<\/span><span style=\"color: #D73A49\">:.4f<\/span><span style=\"color: #005CC5\">}<\/span><span style=\"color: #032F62\">&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;Classification report:&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(classification_report(y_va, native_label, <\/span><span style=\"color: #E36209\">target_names<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\">&#091;<\/span><span style=\"color: #032F62\">&#39;class 0&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;class 1&#39;<\/span><span style=\"color: #24292E\">&#093;))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">importance_native <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> booster.feature_importance(<\/span><span style=\"color: #E36209\">importance_type<\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #032F62\">&#39;split&#39;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;Feature importance (split):&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">for<\/span><span style=\"color: #24292E\"> feat, score <\/span><span style=\"color: #D73A49\">in<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">sorted<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #005CC5\">zip<\/span><span style=\"color: #24292E\">(feature_cols, importance_native), <\/span><span style=\"color: #E36209\">key<\/span><span style=\"color: #D73A49\">=lambda<\/span><span style=\"color: #24292E\"> x: <\/span><span style=\"color: #D73A49\">-<\/span><span style=\"color: #24292E\">x&#091;<\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">&#093;):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #D73A49\">f<\/span><span style=\"color: #032F62\">&quot;  <\/span><span style=\"color: #005CC5\">{<\/span><span style=\"color: #24292E\">feat<\/span><span style=\"color: #D73A49\">:&lt;12<\/span><span style=\"color: #005CC5\">}<\/span><span style=\"color: #032F62\">: <\/span><span style=\"color: #005CC5\">{<\/span><span style=\"color: #24292E\">score<\/span><span style=\"color: #005CC5\">}<\/span><span style=\"color: #032F62\">&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;=&quot;<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #D73A49\">*<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">50<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;Monotone constraint check (grade: low=0, mid=1, high=2)&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;=&quot;<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #D73A49\">*<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">50<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">check_df <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> pd.DataFrame({<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;city&#39;<\/span><span style=\"color: #24292E\">:    &#091;<\/span><span style=\"color: #032F62\">&#39;Seoul&#39;<\/span><span style=\"color: #24292E\">&#093; <\/span><span style=\"color: #D73A49\">*<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">3<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;product&#39;<\/span><span style=\"color: #24292E\">: &#091;<\/span><span style=\"color: #032F62\">&#39;A&#39;<\/span><span style=\"color: #24292E\">&#093; <\/span><span style=\"color: #D73A49\">*<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">3<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;user_id&#39;<\/span><span style=\"color: #24292E\">: &#091;<\/span><span style=\"color: #032F62\">&#39;u_001&#39;<\/span><span style=\"color: #24292E\">&#093; <\/span><span style=\"color: #D73A49\">*<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">3<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;grade&#39;<\/span><span style=\"color: #24292E\">:   &#091;<\/span><span style=\"color: #005CC5\">0<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #005CC5\">2<\/span><span style=\"color: #24292E\">&#093;,          <\/span><span style=\"color: #6A737D\"># low \u2192 mid \u2192 high<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #032F62\">&#39;price&#39;<\/span><span style=\"color: #24292E\">:   &#091;<\/span><span style=\"color: #005CC5\">200<\/span><span style=\"color: #24292E\">&#093; <\/span><span style=\"color: #D73A49\">*<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">3<\/span><span style=\"color: #24292E\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">}).astype({<\/span><span style=\"color: #032F62\">&#39;city&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #032F62\">&#39;category&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;product&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #032F62\">&#39;category&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;user_id&#39;<\/span><span style=\"color: #24292E\">: <\/span><span style=\"color: #032F62\">&#39;category&#39;<\/span><span style=\"color: #24292E\">})<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">sklearn_proba <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> model.predict_proba(check_df)&#091;:, <\/span><span style=\"color: #005CC5\">1<\/span><span style=\"color: #24292E\">&#093;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">native_proba_check <\/span><span style=\"color: #D73A49\">=<\/span><span style=\"color: #24292E\"> booster.predict(check_df)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #D73A49\">f<\/span><span style=\"color: #032F62\">&quot;<\/span><span style=\"color: #005CC5\">{<\/span><span style=\"color: #032F62\">&#39;grade&#39;<\/span><span style=\"color: #D73A49\">:&lt;10<\/span><span style=\"color: #005CC5\">}<\/span><span style=\"color: #032F62\"> <\/span><span style=\"color: #005CC5\">{<\/span><span style=\"color: #032F62\">&#39;sklearn prob&#39;<\/span><span style=\"color: #D73A49\">:&gt;14<\/span><span style=\"color: #005CC5\">}<\/span><span style=\"color: #032F62\"> <\/span><span style=\"color: #005CC5\">{<\/span><span style=\"color: #032F62\">&#39;native prob&#39;<\/span><span style=\"color: #D73A49\">:&gt;12<\/span><span style=\"color: #005CC5\">}<\/span><span style=\"color: #032F62\">&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #032F62\">&quot;-&quot;<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #D73A49\">*<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">38<\/span><span style=\"color: #24292E\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D73A49\">for<\/span><span style=\"color: #24292E\"> grade, sp, np_ <\/span><span style=\"color: #D73A49\">in<\/span><span style=\"color: #24292E\"> <\/span><span style=\"color: #005CC5\">zip<\/span><span style=\"color: #24292E\">(&#091;<\/span><span style=\"color: #032F62\">&#39;low&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;mid&#39;<\/span><span style=\"color: #24292E\">, <\/span><span style=\"color: #032F62\">&#39;high&#39;<\/span><span style=\"color: #24292E\">&#093;, sklearn_proba, native_proba_check):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">    <\/span><span style=\"color: #005CC5\">print<\/span><span style=\"color: #24292E\">(<\/span><span style=\"color: #D73A49\">f<\/span><span style=\"color: #032F62\">&quot;<\/span><span style=\"color: #005CC5\">{<\/span><span style=\"color: #24292E\">grade<\/span><span style=\"color: #D73A49\">:&lt;10<\/span><span style=\"color: #005CC5\">}<\/span><span style=\"color: #032F62\"> <\/span><span style=\"color: #005CC5\">{<\/span><span style=\"color: #24292E\">sp<\/span><span style=\"color: #D73A49\">:&gt;14.4f<\/span><span style=\"color: #005CC5\">}<\/span><span style=\"color: #032F62\"> <\/span><span style=\"color: #005CC5\">{<\/span><span style=\"color: #24292E\">np_<\/span><span style=\"color: #D73A49\">:&gt;12.4f<\/span><span style=\"color: #005CC5\">}<\/span><span style=\"color: #032F62\">&quot;<\/span><span style=\"color: #24292E\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Results:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro cbp-has-line-numbers\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:1rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;--cbp-line-number-color:#24292e;--cbp-line-number-width:calc(1 * 0.6 * 1rem);line-height:1.625rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span role=\"button\" tabindex=\"0\" style=\"color:#24292e;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>=======================================================\nMonotone constraint check (grade: low=0, mid=1, high=2)\n=======================================================\ngrade        sklearn prob  native prob\n--------------------------------------\nlow                0.5927       0.5714\nmid                0.5927       0.5714\nhigh               0.5927       0.5714<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki github-light\" style=\"background-color: #fff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #005CC5; font-weight: bold\">=======================================================<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">Monotone constraint check (grade: low=0, mid=1, high=2)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5; font-weight: bold\">=======================================================<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">grade        sklearn prob  native prob<\/span><\/span>\n<span class=\"line\"><span style=\"color: #005CC5; font-weight: bold\">--------------------------------------<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">low                0.5927       0.5714<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">mid                0.5927       0.5714<\/span><\/span>\n<span class=\"line\"><span style=\"color: #24292E\">high               0.5927       0.5714<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Key parameters:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>categorical_feature<\/code>: list of column names or integer indices. When using category dtype in pandas, this list can be omitted \u2014 LightGBM auto-detects. Explicit specification is still recommended for clarity and reproducibility.<\/li>\n\n\n\n<li><code>cat_smooth <\/code>(default 10): smoothing term for categorical target statistics; larger values reduce overfitting on rare categories.<\/li>\n\n\n\n<li><code>cat_l2 <\/code>(default 10): L2 regularization applied specifically to categorical splits.<\/li>\n\n\n\n<li><code>max_cat_threshold <\/code>(default 32): maximum number of category groups considered in a single split \u2014 caps computation for very high-cardinality features.<\/li>\n\n\n\n<li><code>min_data_per_group <\/code>(default 100): minimum observations required per category group; raise this for high-cardinality features with many rare levels.<\/li>\n\n\n\n<li><code>max_cat_to_onehot <\/code>(default 4): if cardinality \u2264 this threshold, LightGBM falls back to one-hot encoding (which is optimal for very low cardinality); otherwise uses partition-based splitting.<\/li>\n\n\n\n<li><code>monotone_constraints<\/code>: list of -1 \/ 0 \/ +1 aligned with feature order; combines with ordinal encoding to enforce monotonic predictions.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Important caveats:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Categorical values must be non-negative integers internally. When using the Dataset API directly with NumPy arrays, encode your categories to integer codes first (e.g., <code>df[col].cat.codes<\/code>). The pandas <code>category <\/code>dtype route handles this automatically.<\/li>\n\n\n\n<li>LightGBM treats categorical features fundamentally differently from numerical ones \u2014 it never treats them as ordered (unless <code>monotone_constraints <\/code>is used on an ordinal-encoded integer column, which forces numerical treatment).<\/li>\n\n\n\n<li>For extremely high-cardinality features (> 100,000 unique values), consider combining with target encoding or hashing upstream, as the internal Fisher grouping may still become expensive.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Comparison: Entity Embedding vs. CatBoost vs. XGBoost vs. LightGBM<\/h3>\n\n\n\n<figure style=\"padding-right:var(--wp--preset--spacing--40);padding-left:var(--wp--preset--spacing--40)\" class=\"wp-block-table\"><table><thead><tr><th>Aspect<\/th><th>Entity Embedding (PyTorch)<\/th><th>CatBoost<\/th><th>XGBoost<\/th><th>LightGBM<\/th><\/tr><\/thead><tbody><tr><td>Encoding mechanism<\/td><td>Learnable dense vectors<\/td><td>Ordered Target Encoding<\/td><td>Optimal partition splitting<\/td><td>Fisher-based partition grouping<\/td><\/tr><tr><td>Leakage protection<\/td><td>Inherent<\/td><td>Built-in (ordered boosting)<\/td><td>Manual<\/td><td>Manual (smoothing helps)<\/td><\/tr><tr><td>Speed on high cardinality<\/td><td>Fast (matrix lookup)<\/td><td>Moderate<\/td><td>Fast<\/td><td><strong>Fastest<\/strong> (histogram + Fisher)<\/td><\/tr><tr><td>Memory efficiency<\/td><td>High<\/td><td>Moderate<\/td><td>High<\/td><td><strong>Highest<\/strong><\/td><\/tr><tr><td>Ordinal handling<\/td><td>Order embeddings<\/td><td><code>monotone_constraints<\/code><\/td><td><code>monotone_constraints<\/code><\/td><td><code>monotone_constraints<\/code><\/td><\/tr><tr><td>Cardinality cap for one-hot fallback<\/td><td>N\/A<\/td><td><code>one_hot_max_size<\/code><\/td><td><code>max_cat_to_onehot<\/code><\/td><td><code>max_cat_to_onehot<\/code><\/td><\/tr><tr><td>Best fit<\/td><td>Deep learning pipelines<\/td><td>Category-heavy tabular data<\/td><td>XGBoost-based stacks<\/td><td>Large datasets, speed-critical<\/td><\/tr><tr><td>Setup overhead<\/td><td>Define embedding layers<\/td><td>Pass <code>cat_features<\/code><\/td><td>Convert to <code>category<\/code> dtype<\/td><td>Convert to <code>category<\/code> dtype<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">LightGBM is often the fastest choice for large tabular datasets with many high-cardinality categorical features. Its Fisher-based partition algorithm scales better than XGBoost&#8217;s approach when cardinality exceeds a few hundred, making it a strong default for industrial-scale categorical data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n<div style='text-align:center' class='yasr-auto-insert-overall'><\/div><div style='text-align:center' class='yasr-auto-insert-visitor'><\/div>","protected":false},"excerpt":{"rendered":"<p>1. What is One-Hot Encoding? One-hot encoding is the most fundamental technique for converting categorical variables into numerical vectors that machine learning models can process. Given N unique categories, each category is represented as an N-dimensional vector with a 1 at the position corresponding to that category and 0 elsewhere. Example: [&#8216;Seoul&#8217;, &#8216;Busan&#8217;, &#8216;Daegu&#8217;] \u2192&#8230;<\/p>\n","protected":false},"author":4,"featured_media":6293,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","yasr_overall_rating":0,"yasr_post_is_review":"","yasr_auto_insert_disabled":"","yasr_review_type":"","fifu_image_url":"","fifu_image_alt":"","iawp_total_views":0,"footnotes":""},"categories":[373,56],"tags":[],"class_list":["post-6292","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-feature-engineering-slug","category-data-science-slug"],"yasr_visitor_votes":{"stars_attributes":{"read_only":false,"span_bottom":false},"number_of_votes":0,"sum_votes":0},"jetpack_featured_media_url":"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/04\/one-hot-encoding-ACDB.png","_links":{"self":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6292","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/comments?post=6292"}],"version-history":[{"count":7,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6292\/revisions"}],"predecessor-version":[{"id":6300,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6292\/revisions\/6300"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/media\/6293"}],"wp:attachment":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/media?parent=6292"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/categories?post=6292"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/tags?post=6292"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}