Data Science and AI Integration

Experimental, AI‑assisted, data‑driven methodologies integrated into engineering platforms and supported by semiconductor, statistical, machine‑learning, and deep‑learning technologies to optimize semiconductor manufacturing across process, device, and yield development. The following are the key components of my work on AI‑Driven Engineering Platforms:

  • AI-assisted software: AI-agent
  • AI-assisted data analysis: yield analysis enabling yield-aware design and yieldable process/device
    • Machine Learning: PCA, SVM, Bayesian Optimization
    • Deep Learning: time-series data
  • Statistical data analysis: Gaussian, Poisson, Order statistics, Extreme Value Distribution
  • (Semiconductor) Technology-based analysis: Device physics, Small circuit simulation, Error propagation, Monte Carlo Simulation, DOE/RSM, Split-CV, Dielectric Conduction, Variability, BKM management, Soft/hard yield
  • Full-stack web platform using WordPress, Flask, or Next.js

The motivation for technology convergence that integrates semiconductor technology with data science is that this convergence is essential for technology-aware software that enables:

  1. Advancing semiconductor technology
  2. Improving engineers’ productivity
  3. Creating a more fulfilling work environment.

A technology‑aware software engineer can deliver this integration effectively, since domain‑aware development fits well with Agile and DevOps practices.

A technology‑aware software tool provides several key benefits:

  1. Helps engineers quickly learn the legacy knowledge from previous technologies
  2. Enables engineers to absorb leading‑edge technology more effectively
  3. Speeds up computational workflows
  4. Ensures work is performed in a standardized manner
  5. Standardizes data by serving as a de facto specification
  6. Needs continuous improvement as the technology evolves, with pros and cons.

Applied Statistics

AI-assisted Semiconductor Development ….


Related Posts below (or view All Articles)

Categories = “Data Science, AI-powered, Applied Statistics”

Optuna Metric Projection
Label Engineering | Data Science | Hyperparameter | Tree Based Model

Optuna Metric Projection

By Wolf
Created: 2026.05.27 | Modified: 2026.05.27
 A concise report on projecting Optuna’s best-so-far trajectory with four saturation curves. The method estimates the expected best metric after $K$ additional trials (forward) or the trials needed to reach…
Read More
Wafer Level Zernike Polynomials
Data Science | Feature Engineering | Label Engineering

Wafer Level Zernike Polynomials

By Wolf
Created: 2026.05.12 | Modified: 2026.05.16
wlzpoly is a Python package that decomposes N-point wafer thickness measurements into M Zernike polynomial coefficients using LSQ or Ridge regression with LOOCV-tuned regularization. It ships with a reproducible three-stage…
Read More
A Taxonomy of Manufacturing Big Data: Integrating Machine and Human Data
Data Science | Pipeline

A Taxonomy of Manufacturing Big Data: Integrating Machine and Human Data

By Wolf
Created: 2026.05.09 | Modified: 2026.05.10
1. Introduction: The Missing Link in Smart Manufacturing  Investment in smart manufacturing and big data analytics has expanded rapidly, yet the focus has remained almost exclusively on Machine Data—the data…
Read More
Python ML Pipeline Reproducibility — Field Notes
Data Science | Pipeline

Python ML Pipeline Reproducibility — Field Notes

By Wolf
Created: 2026.05.08 | Modified: 2026.05.08
Introduction  This document classifies reproducibility problems in Python Machine Learning (ML) pipelines into three chapters, plus a fourth chapter on diagnostic techniques:  This classification aligns well with the Six Sigma…
Read More
An Introductory Survey on Polynomial Machine Learning: Taxonomic Axes and Hierarchical Levels
Semiconductor | AI-powered | Data Science | Label Engineering

An Introductory Survey on Polynomial Machine Learning: Taxonomic Axes and Hierarchical Levels

By Wolf
Created: 2026.05.06 | Modified: 2026.05.06
 This report surveys Polynomial Machine Learning (PML) at an introductory level. PML refers to the family of techniques that exploit higher-order and interaction terms of input variables to learn nonlinear…
Read More
Modeling Thickness Variation in Semiconductor Thin-Film Processes — A Spatial Decomposition Approach to Machine Learning (ML)
Data Science | AI-powered | Label Engineering | Semiconductor | Tree Based Model

Modeling Thickness Variation in Semiconductor Thin-Film Processes — A Spatial Decomposition Approach to Machine Learning (ML)

By Wolf
Created: 2026.05.04 | Modified: 2026.05.06
 Thickness uniformity in thin-film deposition determines downstream yield and device performance. Variation arises along two distinct axes — within a single wafer (Within-Wafer, WiW) and across wafers over time (Wafer-to-Wafer,…
Read More
Are Missing-Path Samples in Tree-Based Models OOD?
Data Science

Are Missing-Path Samples in Tree-Based Models OOD?

By Wolf
Created: 2026.05.03 | Modified: 2026.05.04
Bottom Line Strictly speaking, no — but in practice, treat them as Out-of-Distribution (OOD). Missing-path samples in tree-based boosting models such as LightGBM, CatBoost, and XGBoost do not match the…
Read More
A Taxonomy of ML Model Failures in the Training-Testing Gap
Data Science

A Taxonomy of ML Model Failures in the Training-Testing Gap

By Wolf
Created: 2026.05.03 | Modified: 2026.05.03
Machine learning (ML) models are designed under the assumption that the training distribution P_train equals the deployment distribution P_test. In reality, this assumption breaks frequently, causing sharp accuracy drops in…
Read More
Why Raw Vectorization Is the Right Choice for Ultra-Short Time Series (T ≤ 10)
Data Science | Feature Engineering | Time Series

Why Raw Vectorization Is the Right Choice for Ultra-Short Time Series (T ≤ 10)

By Wolf
Created: 2026.05.02 | Modified: 2026.05.02
This report analyzes why standard vectorization methods — statistical summary (mean/var/AUC), automatic feature extraction (tsfresh, catch22), convolutional representations (MiniRocket), and self-supervised embeddings (TS2Vec) — fail when the time series length…
Read More
Missing Values and Unknown Categories in Gradient Boosting Libraries
Data Science

Missing Values and Unknown Categories in Gradient Boosting Libraries

By Wolf
Created: 2026.04.29 | Modified: 2026.04.29
1. Introduction This article summarizes how three popular gradient boosting libraries — LightGBM (Light Gradient Boosting Machine), XGBoost (Extreme Gradient Boosting), and CatBoost (Categorical Boosting) — handle missing values and…
Read More
Noise-Induced Instability in Tree-based Feature Selection: Root Causes and Robust Countermeasures
Data Science

Noise-Induced Instability in Tree-based Feature Selection: Root Causes and Robust Countermeasures

By Wolf
Created: 2026.04.29 | Modified: 2026.04.29
When performing feature selection with tree-based models such as LightGBM (LGBM) or CatBoost, adding noise features to the existing set often causes truly important primary features to drop out of…
Read More
Centered R² vs Uncentered R²
Data Science | Evaluation Metric

Centered R² vs Uncentered R²

By Wolf
Created: 2026.04.25 | Modified: 2026.04.26
Bondi Iceberg pool 1. Introduction: R² and Its Relation to RSQ The coefficient of determination, denoted as R² (R-squared), is one of the most widely used validation metrics in statistics…
Read More