the far side of the Moon

This report presents a structured taxonomy of machine learning methodologies specialized for Within-Wafer (WIW) variation prediction in semiconductor manufacturing. General-purpose ML approaches often fail to exploit the unique spatial structure of wafer data: circular geometry, process-induced radial symmetry, and strong inter-site correlations. The seven categories below organize WIW-specific methods into methodological primitives that can be combined in practical pipelines.

Scope and Motivation

Within-wafer variation refers to systematic and random deviations of a measurement parameter (film thickness, critical dimension, overlay, material removal rate, etc.) across multiple sites on a single wafer. Typical setups measure $N_{sites}$ locations per wafer (commonly 9, 13, 17, 25, or 49 sites), with $N_{wafers}$ ranging from hundreds to tens of thousands depending on the fab context.

The central modeling challenge is:

$$T(r, \theta, \mathbf{s}) = T_{mean}(\mathbf{s}) + T_{WIW}(r, \theta, \mathbf{s}) + \epsilon$$

where $T$ is the measured parameter at polar coordinates $(r, \theta)$, $\mathbf{s}$ denotes sensor/process features, $T_{mean}$ captures wafer-level behavior, $T_{WIW}$ captures spatial variation across the wafer, and $\epsilon$ is irreducible noise. A good WIW-ML framework must model $T_{WIW}$ while respecting wafer geometry and the small-data regime.

A. Spatial Basis Decomposition

Core idea: Decompose the wafer spatial pattern into a linear combination of basis functions and learn the coefficients rather than raw site values. By replacing $N_{sites}$ raw targets with $K$ basis coefficients (typically $K \ll N_{sites}$), target dimensionality is reduced and physical interpretation becomes possible. Zernike polynomials provide an orthonormal basis on the unit disk that perfectly matches wafer geometry, while PCA and DCT offer data-driven bases. A small number of coefficients typically explains over 90% of wafer variation, making this approach highly effective in small-data regimes. Scanner Advanced Process Control (APC) systems already operate on this principle in high-volume manufacturing (Noh 2018).

Mathematical formulation:

$$T(r, \theta) = \sum_{k=1}^{K} c_{k} \cdot \psi_{k}(r, \theta)$$

where $\psi_{k}$ are basis functions (Zernike, polynomial, DCT, PCA) and $c_{k}$ are the coefficients that become the new ML targets:

$$\hat{\mathbf{c}} = f_{ML}(\mathbf{s}), \quad \hat{T}(r, \theta) = \sum_{k=1}^{K} \hat{c}_{k} \cdot \psi_{k}(r, \theta)$$

The Zernike basis satisfies orthogonality on the unit disk:

$$\int_{0}^{1} \int_{0}^{2\pi} Z_{n,m}(r, \theta) \cdot Z_{n’,m’}(r, \theta) \cdot r \, dr \, d\theta = \frac{\pi}{2n+2} \delta_{n,n’} \delta_{m,m’}$$

Conceptual diagram:

   13 sites                         Basis coefficients
   +-----------+                    +----------------+
   | o o o o o |                    | c1 (piston)    |
   |o * * * * o|  --- fitting -->   | c2 (tilt_x)    |
   |o * @ * * o|                    | c3 (tilt_y)    |
   |o * * * * o|                    | c4 (bowl)      |
   | o o o o o |                    | c5 (astigmatism)|
   +-----------+                    +----------------+
   raw targets                      learn THESE
       |                                    |
       |                                    v
       |                        Sensors -> ML -> c_hat
       |                                    |
       +------ reconstruct <----------------+

Key concepts and references:

Zernike polynomial decomposition — Noh 2018 (overlay APC in HVM)
Radial-azimuthal polynomial decomposition — Dwivedi 2023
2D bivariate polynomial (2nd order) — Dwivedi 2023
DCT basis with sparsity prior — Zhang 2011
PCA / Karhunen-Loève on wafer maps — Kazemi 2020

B. Spatial Correlation Modeling (Gaussian Process Family)

Core idea: Encode the natural assumption that nearby wafer locations have correlated measurements into a kernel function. Gaussian Processes (GPs) provide not only point predictions but also predictive uncertainty, which is essential for fallback decisions in Virtual Metrology. Multi-task GPs (MTGPs) extend this by jointly predicting multiple sites while learning inter-site correlations through a coregionalization matrix. Hierarchical GPs further handle discontinuities between site clusters caused by probe-card parasitics or different measurement systems (Shintani 2021). This is the most validated approach in the hundreds-to-thousands wafer regime.

Mathematical formulation:

For a standard spatial GP:

$$f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}),\ k(\mathbf{x}, \mathbf{x}’))$$

with Matern 5/2 kernel commonly used for wafer spatial correlation:

$$k_{5/2}(\mathbf{x}, \mathbf{x}’) = \sigma^{2} \left(1 + \frac{\sqrt{5}d}{\ell} + \frac{5d^{2}}{3\ell^{2}}\right) \exp\left(-\frac{\sqrt{5}d}{\ell}\right)$$

where $d = \|\mathbf{x} – \mathbf{x}’\|$ and $\ell$ is the length scale.

Multi-task GP with Intrinsic Coregionalization Model (ICM):

$$k_{MTGP}((\mathbf{x}, i),\ (\mathbf{x}’, j)) = k_{process}(\mathbf{x}, \mathbf{x}’) \cdot B[i, j]$$

where $B \in \mathbb{R}^{N_{sites} \times N_{sites}}$ is a learned positive semi-definite matrix capturing site-to-site correlations, and $i, j$ index the sites.

The predictive distribution at a new point is:

$$p(f_{*} | \mathbf{X}, \mathbf{y}, \mathbf{x}_{*}) = \mathcal{N}(\mu_{*},\ \sigma_{*}^{2})$$

Conceptual diagram:

         kernel k(x, x') measures similarity
              +---------------------+
              |                     |
         *----+--*                  |   high correlation
         site1  site2               |   (nearby)
              |                     |
              |         *-----------+--*   low correlation
              |         site3       site13   (far apart)
              +---------------------+

   13 x 13 task correlation matrix B (learned):
   +-------------------------------+
   | 1.0  0.9  0.8  ...  0.2       |  <- site correlations
   | 0.9  1.0  0.9  ...  0.3       |     captured automatically
   |  :                      :     |
   +-------------------------------+

   Prediction: mu(x*) +/- sigma(x*)   <- uncertainty included

Key concepts and references:

Kriging for spatial interpolation — Cressie 1993
Multi-task GP with coregionalization — Bonilla 2008
Spatial GP for wafer e-test with variance decomposition — Reda 2010
Multi-task GP for CMP with uncertainty — Cai 2020
Hierarchical GP for multi-site RF testing — Shintani 2021
Multilevel kernel methods for VM — Schirru 2011

C. Hierarchical Variation Decomposition

Core idea: Decompose total variation into physically meaningful components and apply ML only to the predictable ones. A typical decomposition separates layout-dependent effects, Intra-Wafer Systematic (IWS) variation, Wafer-to-Wafer (WTW) mean shifts, and random residuals, each corresponding to distinct physical origins (Dwivedi 2023). This prevents ML models from overfitting random noise and enables per-component model selection. Combined with ANOVA-style variance decomposition, it also provides diagnostic insight into which variation source dominates. This framework is the de facto standard in silicon photonics and mature fab variation analysis.

Mathematical formulation:

$$T_{total}(r, \theta, t) = T_{layout}(r, \theta) + T_{IWS}(r, \theta) + T_{WTW}(t) + \epsilon$$

Variance decomposition under orthogonality assumptions:

$$\sigma_{total}^{2} = \sigma_{layout}^{2} + \sigma_{IWS}^{2} + \sigma_{WTW}^{2} + \sigma_{random}^{2}$$

Each component is modeled separately:

$$\hat{T}_{IWS}(r, \theta) = \sum_{k=1}^{K_{IWS}} c_{k}^{IWS}(\mathbf{s}) \cdot \psi_{k}(r, \theta)$$ $$\hat{T}_{WTW}(t) = f_{WTW}(\mathbf{s}_{tool}(t))$$

Conceptual diagram:

   Total Variation (observed)
            |
            v
   +-------------------------------------+
   |  Layout-dependent (pattern density) |  <- physical model
   +-------------------------------------+
   |  IWS (Intra-Wafer Systematic)       |  <- polynomial/Zernike
   |    radial + bivariate polynomial    |     ML predicts this
   +-------------------------------------+
   |  WTW (Wafer-to-Wafer mean shift)    |  <- tool state features
   |                                     |     ML predicts this
   +-------------------------------------+
   |  Random residual                    |  <- NOT predicted,
   |                                     |     monitored as Cpk
   +-------------------------------------+

   sigma^2_total = sigma^2_layout + sigma^2_IWS
                 + sigma^2_WTW + sigma^2_random

Key concepts and references:

Hierarchical model for photonic variation — Dwivedi 2023
Spatial estimation and variance decomposition — Reda 2010
Mixed-effect profile monitoring — Liu 2022
Hierarchical multi-task learning for wafer quality — He 2018
Multiplicative Product Factor for hierarchy — Rothe 2025

D. Compressed Sensing and Sparse Recovery

Core idea: Exploit the empirical observation that wafer spatial patterns are sparse in a transform domain (DCT, Fourier, wavelet). By $L_{1}$ regularization, the full wafer can be recovered from far fewer samples than the Nyquist rate would require. Unlike the Gaussian Process family, which uses spatial correlation as prior, Compressed Sensing uses frequency-domain sparsity as prior (Zhang 2011). The problem reduces to linear programming and runs efficiently. In industrial data, delay measurements on 50 dies reconstructed the delay of 269 dies on the same wafer, with up to 10x error reduction compared to 2D interpolation and Kriging. This is the primary framework for test cost reduction in production testing.

Mathematical formulation:

Let $\mathbf{g} \in \mathbb{R}^{PQ}$ be the vectorized full wafer map and $\boldsymbol{\eta}$ its DCT coefficients:

$$\mathbf{g} = \mathbf{\Psi} \boldsymbol{\eta}, \quad \mathbf{\Psi} = \text{inverse DCT matrix}$$

Given sparse measurements $\mathbf{y} = \mathbf{A} \mathbf{g}$ where $\mathbf{A}$ is a selection matrix, recover $\boldsymbol{\eta}$ by:

$$\hat{\boldsymbol{\eta}} = \arg\min_{\boldsymbol{\eta}} \|\boldsymbol{\eta}\|_{1} \quad \text{s.t.} \quad \mathbf{y} = \mathbf{A} \mathbf{\Psi} \boldsymbol{\eta}$$

In practice, the constrained form is relaxed to:

$$\hat{\boldsymbol{\eta}} = \arg\min_{\boldsymbol{\eta}} \left\{ \|\mathbf{y} – \mathbf{A} \mathbf{\Psi} \boldsymbol{\eta}\|_{2}^{2} + \lambda \|\boldsymbol{\eta}\|_{1} \right\}$$

and $\lambda$ is selected via cross-validation. The full wafer is reconstructed as $\hat{\mathbf{g}} = \mathbf{\Psi} \hat{\boldsymbol{\eta}}$.

Conceptual diagram:

   Wafer measurement              DCT coefficients
   +-------------+               +----------------+
   | . . * . . * |               | ##..........   |  <- most near zero
   | * . . . * . |   --- DCT --> | #...........   |     (SPARSE!)
   | . * . * . . |               | ............   |
   | . . * . . * |               | ............   |
   +-------------+               +----------------+
   few measured dies              prior: sparsity
         |                               |
         |                               v
         |           solve: min ||eta||_1  s.t.  y = A*Psi*eta
         |                  (linear programming)
         |                               |
         +--- reconstruct full wafer <---+
                   via inverse DCT

Key concepts and references:

Virtual Probe (DCT + L1) — Zhang 2011
Bayesian Virtual Probe — Zhang 2010
Joint Virtual Probe for multi-item correlation — Zhang 2014
Multi-Wafer Virtual Probe with WTW correlation — Zhang 2012
3D compressed sensing with KLT — Ahmadi 2015

E. Physics-Informed and Hybrid Approaches

Core idea: Embed PDEs, conservation laws, or reaction kinetics directly into neural network loss functions or architectures. Pure data-driven models fail when wafer counts are limited, but physics acts as a regularizer, enabling full-field prediction from sparse sensors. Physics-Informed Neural Networks (PINNs) enforce equation residuals, while Fourier Neural Operators (FNOs) learn the solution operator of a PDE family (Go 2025). Training is slow but inference is faster than traditional numerical solvers. This approach is particularly valuable for new processes with limited data, and for processes where physical models are well established (heat transfer, diffusion, plasma kinetics).

Mathematical formulation:

For a PDE of the form $\mathcal{N}[T](\mathbf{x}, t) = 0$, the PINN loss is:

$$\mathcal{L} = \mathcal{L}_{data} + \lambda \cdot \mathcal{L}_{physics}$$ $$\mathcal{L}_{data} = \frac{1}{N_{d}} \sum_{i=1}^{N_{d}} \left( T_{\theta}(\mathbf{x}_{i}, t_{i}) – T_{i}^{obs} \right)^{2}$$ $$\mathcal{L}_{physics} = \frac{1}{N_{c}} \sum_{j=1}^{N_{c}} \left( \mathcal{N}[T_{\theta}](\mathbf{x}_{j}, t_{j}) \right)^{2}$$

For wafer thermal dynamics (e.g., PEB step), the governing equation is:

$$\frac{\partial T}{\partial t} – \alpha \nabla^{2} T = Q(\mathbf{x}, t)$$

Fourier Neural Operator learns the mapping:

$$\mathcal{G}_{\theta}(a)(\mathbf{y}) = \sum_{k=1}^{p} b_{k}(a(\mathbf{x}_{1}),\ \ldots,\ a(\mathbf{x}_{m})) \cdot t_{k}(\mathbf{y})$$

where $a$ is the input function (e.g., temperature field) and $\mathcal{G}_{\theta}$ maps it to the output function (e.g., warpage field).

Conceptual diagram:

   PINN loss combines data and physics:
   +-------------------------------------------------+
   |  L = L_data  +  lambda * L_physics              |
   |      ------      ---------------                |
   |      MSE on      PDE residual                   |
   |      measured    (e.g. dT/dt - alpha*nabla^2 T) |
   |      points      enforced everywhere            |
   +-------------------------------------------------+

   Sparse sensors         Full wafer field
   +-----------+          +--------------+
   | * . . * . |          | ~~~~~~~~~~~~ |
   | . . * . . | --PINN-->| ~~~~~~~~~~~~ |  <- continuous T(x,y)
   | * . . . . |          | ~~~~~~~~~~~~ |     warpage W(x,y)
   | . * . * . |          | ~~~~~~~~~~~~ |
   +-----------+          +--------------+
   6 points                full continuous field
                           (physics-constrained)

Key concepts and references:

PINN for semiconductor deposition review — Han 2025
PBSM + Fourier Neural Operator for PEB process — Go 2025
Neural Master Equation for plasma etch kinetics — Kim 2025
ACO + BPNN for SiC epitaxy uniformity optimization — Zhang 2024
CFD + ANN hybrid for ALD SiO2 — Li 2022

F. Multi-task and Multi-output Learning (Non-GP)

Core idea: Predict multiple sites or multiple measurements jointly with a single model that learns a shared representation across tasks. Unlike GP-based methods, which express correlation through a kernel, neural-network or gradient-boosting multi-output models capture correlation implicitly through shared parameters. The canonical architecture is a shared backbone with task-specific heads; regularization-based alternatives include multi-task Lasso. Multi-task learning acts as regularization and typically improves accuracy over single-task models when per-site wafer counts are low. CatBoost MultiRMSE is the tree-based representative (Dorogush 2018).

Mathematical formulation:

Shared-encoder multi-task architecture:

$$\mathbf{z} = f_{shared}(\mathbf{s}; \boldsymbol{\theta}_{shared})$$ $$\hat{y}_{i} = h_{i}(\mathbf{z}; \boldsymbol{\theta}_{i}), \quad i = 1, \ldots, N_{sites}$$

Joint loss with uncertainty weighting (Kendall 2018):

$$\mathcal{L} = \sum_{i=1}^{N_{sites}} \frac{1}{2\sigma_{i}^{2}} \mathcal{L}_{i}(\boldsymbol{\theta}_{shared}, \boldsymbol{\theta}_{i}) + \log \sigma_{i}$$

where $\sigma_{i}$ is a learned per-task uncertainty that automatically balances the multi-task loss.

Multi-task Lasso formulation:

$$\hat{\mathbf{W}} = \arg\min_{\mathbf{W}} \left\{ \sum_{i=1}^{N_{sites}} \|\mathbf{y}_{i} – \mathbf{X} \mathbf{w}_{i}\|_{2}^{2} + \lambda \sum_{j=1}^{p} \|\mathbf{W}_{j,:}\|_{2} \right\}$$

The group-L2 penalty encourages shared feature selection across all sites.

Conceptual diagram:

   Shared encoder learns common representation

   Sensor features
        |
        v
   +-------------------+
   |  Shared backbone  |  <- learns process physics
   |   (MLP / CNN)     |     common across all sites
   +--------+----------+
            |
     +------+------+------+------+
     v      v      v      v      v
   +---+ +---+  +---+  +---+  +---+
   | H1| | H2|  | H3|  ...   | H13|  <- task-specific heads
   +-+-+ +-+-+  +-+-+         +-+-+     (site-level detail)
     v      v      v             v
    y1     y2     y3     ...    y13
   site1  site2  site3          site13

Key concepts and references:

Multi-task VM across chambers — Park 2018
Multi-level Lasso for VM — Schirru 2011
Hierarchical multi-task for wafer quality — He 2018
Multi-task uncertainty weighting — Kendall 2018
CatBoost MultiRMSE — Dorogush 2018
Shared encoder with site heads for thin film VM — Liu 2025

G. Ensemble and Hybrid (Mix and Match)

Core idea: Combine the preceding primitives to compensate for individual weaknesses. Typical patterns include running direct WIWNU prediction and site-level prediction in parallel for cross-validation of error characteristics, or ensembling GP uncertainty with gradient-boosting accuracy. Another common pattern is applying different models to different layers of a hierarchical decomposition: polynomial fitting for IWS, gradient boosting for WTW, and GP for residual uncertainty. Most production-grade papers fall into this category; pure single-primitive papers are the minority. Mastery of individual primitives is a prerequisite for designing effective hybrid pipelines.

Mathematical formulation:

General ensemble prediction:

$$\hat{T}_{ensemble}(r, \theta) = \sum_{m=1}^{M} w_{m} \cdot \hat{T}_{m}(r, \theta)$$

with weights learned by stacking or fixed by domain knowledge.

Hierarchical hybrid pipeline:

$$\hat{T}(r, \theta) = \underbrace{f_{A}(\mathbf{s})}_{wafer\ mean} + \underbrace{\sum_{k=1}^{K} f_{B,k}(\mathbf{s}) \cdot \psi_{k}(r, \theta)}_{Zernike\ reconstruction} + \underbrace{f_{C}(r, \theta, \mathbf{s})}_{GP\ residual}$$

Cross-validation between direct and derived WIWNU:

$$\widehat{WIWNU}_{direct} = f_{direct}(\mathbf{s})$$ $$\widehat{WIWNU}_{derived} = \frac{\text{std}(\hat{T}_{1}, \ldots, \hat{T}_{N_{sites}})}{\text{mean}(\hat{T}_{1}, \ldots, \hat{T}_{N_{sites}})}$$

Large discrepancy between the two indicates prediction unreliability and triggers a real measurement fallback.

Conceptual diagram:

   Multiple primitives combined in pipeline

   Sensors --+--> [Input FE: tsfresh] --+
             |                          |
             |                          v
             |                 +------------------+
             |                 | Model A: GBM     |--> wafer_mean
             |                 | (scalar target)  |
             |                 +------------------+
             |
             +--> [Output FE: Zernike] --+
                                          v
                                 +------------------+
                                 | Model B: MTGP    |--> c1...c5
                                 | (coefficients)   |    + uncertainty
                                 +------------------+
                                          |
                                          v
                              +---------------------+
                              | Ensemble + physical |--> 13 sites
                              | reconstruction      |    + WIWNU
                              +---------------------+
                                          |
                                          v
                                  Cross-check with
                                  Model C: direct WIWNU

Key concepts and references:

Direct vs derived WIWNU comparison — Rothe 2025
GP + GBM stacking for CMP VM — Cai 2021
Adaptive active learning with MTGP for CVD — Cai 2022
Hierarchical decomposition with per-level models — Dwivedi 2023
Ensemble with Shapley interpretation — Liu 2025

Taxonomy Usage Guideline

Real-world papers combine two or three primitives rather than belonging cleanly to one category:

Zhang 2011 (Virtual Probe) = A (DCT basis) + D (sparse recovery)
Cai 2020 (CMP MTGP) = B (MTGP) + C (reference-based hierarchy)
Go 2025 (PEB PINN) = E (physics) + F (multi-output field)
Rothe 2025 (CMP VM) = C (product factor hierarchy) + F (multi-site) + G (direct-vs-derived ensemble)

When reviewing a paper, the productive question is: what is the primary primitive, and which secondary primitives are combined with it? This framing reveals each paper’s true contribution and makes cross-paper comparison meaningful.

Four-Axis Classification for Paper Review

Each paper can additionally be indexed along four orthogonal axes:

$$\mathbf{x}_{paper} = [\ \text{InputFE},\ \text{OutputFE},\ \text{InductiveBias},\ \text{Model}\ ]$$

Input FE: sensor-side feature engineering (tsfresh, statistical features, FFT, wavelets, PCA on sensors, step-wise segmentation, chamber context)
Output FE: target-side representation (raw sites, wafer-mean-residual, reference-site-residual, Zernike coefficients, PCA scores, zone aggregates, WIWNU scalar)
Inductive Bias: spatial structure encoding (location as categorical, polar coordinates $(r, \theta)$, spatial kernel, graph structure, 2D wafer map, PDE constraint)
Model: predictor algorithm (Ridge, Lasso, PLS, SVR, GP, MTGP, Random Forest, XGBoost, LightGBM, CatBoost, MLP, 1D-CNN, Transformer, GNN, ensembles)

Summary

The seven categories (A–G) form a complete primitive set for WIW-ML research. Categories A–F represent distinct methodological innovations; category G represents the combinatorial space in which most production-grade systems live. The four-axis classification disambiguates where a paper’s novelty actually resides, which is frequently on the Output-FE or Inductive-Bias axes rather than on the Model axis. For researchers entering this field, reading six foundational papers — Zhang 2011, Bonilla 2008, Cai 2020, Dwivedi 2023, Rothe 2025, Shintani 2021 — provides full coverage of the landscape.

References

Ahmadi 2015 — Ahmadi, A., et al., “Joint exploration of multiple test items’ spatial patterns via compressed sensing,” IEEE Transactions on Semiconductor Manufacturing, 2015.
Bonilla 2008 — Bonilla, E. V., Chai, K. M. A., and Williams, C. K. I., “Multi-task Gaussian process prediction,” Advances in Neural Information Processing Systems 20, 2008.
Cai 2020 — Cai, H., Feng, J., Yang, Q., Li, W., Li, X., and Lee, J., “A virtual metrology method with prediction uncertainty based on Gaussian process for chemical mechanical planarization,” Computers in Industry, 2020.
Cai 2021 — Cai, H., et al., “Reference-based virtual metrology method with uncertainty evaluation for material removal rate prediction based on Gaussian process regression,” 2021.
Cai 2022 — Cai, H., et al., “An improved virtual metrology method in chemical vapor deposition systems via multi-task Gaussian processes and adaptive active learning,” International Journal of Advanced Manufacturing Technology, 2022.
Cressie 1993 — Cressie, N. A. C., “Statistics for Spatial Data,” Wiley, 1993.
Dorogush 2018 — Dorogush, A. V., Ershov, V., and Gulin, A., “CatBoost: gradient boosting with categorical features support,” arXiv:1810.11363, 2018.
Dwivedi 2023 — Dwivedi, S., et al., “Capturing the effects of spatial process variations in silicon photonic circuits,” ACS Photonics, 2023.
Go 2025 — Go, J., et al., “Real-time monitoring of thermoelastic deformation of a silicon wafer with sparse measurements in the photolithography process using a physics-informed neural network and Fourier neural operator,” Engineering Applications of Artificial Intelligence, 2025.
Han 2025 — Han, T., et al., “Physics-Informed Neural Networks for Semiconductor Film Deposition: A Review,” arXiv:2507.10983, 2025.
He 2018 — He, J., and Zhu, Y., “Hierarchical multi-task learning with application to wafer quality prediction,” 2018.
Kazemi 2020 — Kazemi, P., et al., “Adaptive neural-based PCA framework for fault detection and diagnosis in time-varying industrial processes,” 2020.
Kendall 2018 — Kendall, A., Gal, Y., and Cipolla, R., “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” CVPR 2018.
Kim 2025 — Kim, S., et al., “A neural master equation framework for multiscale modeling of molecular processes: application to atomic-scale plasma processes,” npj Computational Materials, 2025.
Li 2022 — Li, X., et al., “CFD and ANN hybrid modeling for ALD SiO2 deposition,” 2022.
Liu 2022 — Liu, Y., et al., “Mixed-effect profile monitoring for wafer thickness in industrial wafer slicing,” 2022.
Liu 2025 — Liu, Y.-Y., Wang, Y.-C., Hsu, W.-C., Lin, C.-H., and Chang, K.-H., “An empirical study on enhancing wafer quality: Integrating big data and AI in virtual metrology for thin-film processing,” ScienceDirect, 2025.
Noh 2018 — Noh, H., et al., “Zernike polynomial modeling for wafer-level overlay correction in APC,” 2018.
Park 2018 — Park, C., et al., “Multitask learning for virtual metrology in semiconductor manufacturing systems,” Computers & Industrial Engineering, 2018.
Reda 2010 — Reda, S., and Nassif, S. R., “Accurate spatial estimation and decomposition techniques for variability characterization,” IEEE Transactions on Semiconductor Manufacturing, vol. 23, no. 3, pp. 345–357, 2010.
Rothe 2025 — Rothe, T., Lauff, A., Thieme, P., Langer, J., Günther, M., and Kuhn, H., “Process data-driven machine learning for non-uniformity prediction and virtual metrology in chemical mechanical planarization,” Journal of Intelligent Manufacturing, 2025.
Schirru 2011 — Schirru, A., Pampuri, S., and De Nicolao, G., “Multilevel kernel methods for virtual metrology in semiconductor manufacturing,” IFAC Proceedings, 2011.
Shintani 2021 — Shintani, M., Mian, R.-U.-H., Inoue, M., Nakamura, T., Kajiyama, M., and Eiki, M., “Wafer-level variation modeling for multi-site RF IC testing via hierarchical Gaussian process,” arXiv:2111.01369, 2021.
Zhang 2010 — Zhang, W., Li, X., and Rutenbar, R. A., “Bayesian virtual probe: Minimizing variation characterization cost for nanoscale IC technologies via Bayesian inference,” DAC 2010.
Zhang 2011 — Zhang, W., Li, X., Liu, F., Acar, E., Rutenbar, R. A., and Blanton, R. D., “Virtual probe: A statistical framework for low-cost silicon characterization of nanoscale integrated circuits,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 30, no. 12, pp. 1814–1827, 2011.
Zhang 2012 — Zhang, W., et al., “Multi-Wafer Virtual Probe: Minimum-cost variation characterization by exploring wafer-to-wafer correlation,” 2012.
Zhang 2014 — Zhang, W., et al., “Joint Virtual Probe: Joint exploration of multiple test items’ spatial patterns for efficient silicon characterization,” 2014.
Zhang 2024 — Zhang, Y., et al., “Ant Colony Optimization and Back Propagation Neural Network for 4H-SiC CVD epitaxy uniformity optimization,” 2024.

Our Score

Click to rate this post!

[Total: 1 Average: 4]

Visited 15 times, 1 visit(s) today

ML Methodology: Taxonomy for Within-Wafer Variation Prediction

Scope and Motivation

A. Spatial Basis Decomposition

B. Spatial Correlation Modeling (Gaussian Process Family)

C. Hierarchical Variation Decomposition

D. Compressed Sensing and Sparse Recovery

E. Physics-Informed and Hybrid Approaches

F. Multi-task and Multi-output Learning (Non-GP)

G. Ensemble and Hybrid (Mix and Match)

Taxonomy Usage Guideline

Four-Axis Classification for Paper Review

Summary

References

One Comment

Leave a Comment Cancel reply

Visitor

Post

About Me

Contact