| | | |

Modeling Thickness Variation in Semiconductor Thin-Film Processes — A Spatial Decomposition Approach to Machine Learning (ML)

 Thickness uniformity in thin-film deposition determines downstream yield and device performance. Variation arises along two distinct axes — within a single wafer (Within-Wafer, WiW) and across wafers over time (Wafer-to-Wafer, W2W). These two axes have different physical origins and demand different diagnostic treatments. Mixing them into a single ML target forces the model to learn two unrelated physics simultaneously, hurting both.

 This post presents an ML framework that incorporates Spatial Decomposition based on Zernike Polynomials (Zernike 1934; Noll 1976) into label engineering. It compresses 13-point wafer thickness measurements into 9 physically meaningful coefficients, which then serve as targets for a Two-Head ML architecture.

 The framework is organized along three axes: Domain (facts), Design (technical choices), and Delivery (user value).

1. Domain

Domain defines the set of facts the model operates within: the environment, the data, and the mathematical tools available.

1.1 WiW / W2W Variation — Industrial Background

  • W2W variation: temporal change in mean thickness per wafer. Driven by Run-to-Run drift, source/target depletion, recipe shifts.
  • WiW variation: spatial thickness distribution within a single wafer. Driven by chamber hardware asymmetry, gas flow imbalance, temperature non-uniformity.

The two axes have different physical origins, so they require different diagnostic and control approaches. Treating them as a single ML target conflates two physics regimes and degrades both.

1.2 Measurement Setup — 13-Point Wafer Location Pattern

ItemValue
Wafer size300 mm
Edge Exclusion (EE)5 mm
Number of points13
LayoutCenter 1 + Middle ring (r=75 mm, 4 cardinal) + Edge ring (r=145 mm, 8 directions)
  • Three radial levels (0 / 75 / 145 mm) — enables radial-order decomposition.
  • Cardinal-aligned middle ring (0°/90°/180°/270°) — stabilizes astigmatism extraction.
  • Edge ring at 8 directions — enables higher-order asymmetry (coma, trefoil) representation.

1.3 Dataset under Study

Output side (target source)

DataFormFrequency
13-point thickness measurements13 scalars per wafer (Å or nm)per wafer or lot

Input side (feature source)

DataFormUse
Equipment sensor time-series — Fault Detection and Classification (FDC)RF Power, Pressure, Gas Flow, Temperature, etc.step-wise statistics for feature engineering
Process metadataRecipe ID, Chamber ID, Timestampgrouping variables, context
Preventive Maintenance (PM) / maintenance historyevent logbaseline definition for drift analysis

1.4 Zernike Polynomials — Mathematical Foundation

  • Orthogonal basis functions defined on the unit disk (Zernike 1934).
  • Each term is a function of normalized radius $\rho \in [0,1]$ and angle $\theta \in [0, 2\pi]$.
  • Standard tool in optics and metrology for wavefront decomposition and surface form analysis (Born & Wolf 1999).

Decomposition formula:

$$T(\rho, \theta) = \sum_{k=1}^{N} a_k \cdot Z_k(\rho, \theta) + \varepsilon$$

where $T(\rho, \theta)$ is the thickness distribution, $Z_k$ is the $k$-th Zernike basis function, $a_k$ is the corresponding scalar coefficient, $\varepsilon$ is the residual, and $N$ is the number of terms used.

Physical meaning of low-order terms (Noll 1976 convention):

IndexNamePhysical meaning
$Z_1$PistonMean thickness
$Z_2$Tilt XSlope along X
$Z_3$Tilt YSlope along Y
$Z_4$DefocusBowl / Dome (center–edge contrast)
$Z_5$Astigmatism 45°45°/135° asymmetry
$Z_6$Astigmatism 0°0°/90° asymmetry
$Z_7$Coma YY-direction asymmetric variation
$Z_8$Coma XX-direction asymmetric variation
$Z_9$Trefoil3-fold pattern

2. Design

Design covers the technical choices and optimization strategies built on top of the Domain — the realm of decisions and trade-offs.

2.1 Why Zernike Polynomials

CandidateProsConsFit
Zernike PolynomialsNatural fit to circular domain, orthogonal,
physically interpretable
Order must be capped for 13 points
Polynomial ($x^n, y^n$)Simple to implementNon-orthogonal, unstable near circular edge
Fourier series (polar)Orthogonal in anglePoor radial expressiveness
Spline interpolationPasses through measured points exactlyNo physical meaning, noise-sensitive
Principal Component Analysis
(PCA) / Autoencoder
Data-driven compressionNot interpretable, requires large data

Five reasons Zernike wins for this problem: (1) the wafer is a disk and Zernike is defined on a disk — the coordinate systems align naturally; (2) orthogonality means each coefficient represents an independent pattern; (3) low-order terms map directly to known process drivers (tilt, bowl, astigmatism); (4) Zernike is the de-facto standard in optics and semiconductor metrology (Wang & Silva 1980); (5) compression from 13 points to 9 coefficients improves ML learning efficiency.

2.2 Spatial Decomposition Structure — W2W / WiW Separation

[Measurement space]            [Zernike space]

13 point values         ──►   9 coefficients
(T1, T2, ..., T13)             (a1, a2, ..., a9)

13-D                           9-D
(location-dependent)           (meaning-dependent)

W2W / WiW group definition:

GroupComponentCountZernike termsMeaning
W2WMean component1$Z_1$ (Piston)Wafer-wide mean thickness
WiWShape components8$Z_2 \sim Z_9$Spatial variation (tilt / bowl / astigmatism / coma / trefoil)

Three reasons for separation:

  • Higher ML accuracy — W2W and WiW are different physics; separating them lets each head focus on its own signal, improving accuracy and convergence speed.
  • Equipment fingerprint generation — the 8-element WiW vector forms a unique chamber signature that enables matching and outlier detection.
  • Continuous thickness inference — 13 measurements reconstruct the full wafer thickness as a continuous function, allowing inference at unmeasured locations.

Fitting in matrix form:

$$T = A \cdot a + \varepsilon$$

where $T \in \mathbb{R}^{13 \times 1}$ is the measurement vector, $A \in \mathbb{R}^{13 \times 9}$ is the Zernike basis matrix, $a \in \mathbb{R}^{9 \times 1}$ is the coefficient vector to be estimated, and $\varepsilon \in \mathbb{R}^{13 \times 1}$ is the residual. Standard solution is Least Squares (LSQ); for noise robustness, Ridge Regression (Hoerl & Kennard 1970) is recommended. Detailed derivations appear in Appendices E and F.

With 13 measurements and 9 unknowns, the residual carries 4 degrees of freedom — sufficient for stable fitting and residual diagnostics. Higher-order terms ($Z_{10}$ and above) are under-determined and flow into the residual instead.

2.3 Model Architecture — Two-Head Design

                          ┌─► W2W Head ─► a1   (mean, 1 output)
Sensor data X ─► [Model] ─┤
                          └─► WiW Head ─► a2..a9 (shape, 8 outputs)
                                            │
                                            ▼
                                    Equipment fingerprint
  • Input: feature vector X derived from sensor time-series.
  • Output: 9 coefficients (1 W2W + 8 WiW).
  • Reconstruction: 9 coefficients combined with Zernike basis yield $\hat{T}(\rho, \theta)$ at any location.

Two-Head benefits: separates loss between very different output scales (W2W large, WiW small); allows feature subset specialization per head since the driving factors differ; enables independent monitoring and retraining per head in operation.

2.4 Recommended Algorithms per Head

Head / Group1st choice2nd choiceRationale
W2W Head ($Z_1$)Ridge RegressionXGBoost (shallow)Strongly linear, interpretability priority, stable on small data
WiW low-order ($Z_2 \sim Z_4$)LightGBM / XGBoostRandom ForestMild non-linearity, multivariate interactions
WiW high-order ($Z_5 \sim Z_9$)XGBoost (heavy regularization)1D-CNN, StackingSmall signal, noise robustness needed

With sufficient data (> 5,000 wafers), a Multi-task Learning structure with a shared backbone and group-specific heads is effective.

2.5 Drift Tracking — Spatial × Temporal

Spatial decomposition compresses one wafer’s spatial pattern into 9 coefficients; temporal drift is then tracked on the time-series of those coefficients using Statistical Process Control (SPC) charts, Exponentially Weighted Moving Average (EWMA), or Cumulative Sum (CUSUM) — see Montgomery (2013).

[Spatial: Zernike]                   [Temporal: SPC / time-series]

T(ρ,θ; t)  →  (a1(t), ..., a9(t))  →  EWMA / CUSUM / ARIMA
(one wafer)    (coefficient series)    (drift detection)
  • Spatial drift: captured by Zernike coefficients (e.g., gradual rise in $a_4$ = bowl deepening).
  • Temporal drift: tracked on the 9 coefficient time-series using SPC, EWMA, or CUSUM.
  • Non-Zernike patterns: point defects and local hot-spots are caught by residual monitoring instead.

2.6 Residual Interpretation and Use

The 13-D measurement decomposes into a 9-D Zernike fit plus a residual carrying 4 Degrees of Freedom (DOF):

13-D measurement
   │
   ├── 9-D (Zernike fit)  ──► ML training target
   │
   └── Residual (DOF 4)   ──► diagnostic information
Residual componentOriginUse
High-order spatial patternProcess systematic missed by 9 termsSignals model capacity insufficiency → consider order extension
Measurement system biasSensor calibration issuePer-point reliability check
Local defectParticle, scratch, etc.Anomaly detection
Random noiseMeasurement repeatability limitNoise-floor estimation

Two-layer monitoring strategy: Layer 1 (Zernike coefficients) tracks “expected variation” — drift detection, run-to-run control. Layer 2 (residual statistics) catches “unexpected anomalies” — alarm triggers, defect inspection.

2.7 Dimensionality Reduction Strategy

MethodCharacteristicWhen to apply
Variance filterDrop coefficients with low variabilityAfter initial baseline analysis
Domain knowledgePick dominant terms per process type — Chemical Vapor Deposition (CVD) / Physical Vapor Deposition (PVD) / Atomic Layer Deposition (ALD) / EtchWhen process priors are clear
Target correlationSelect terms most correlated with yield/qualityWhen outcome data is available
PCA on coefficientsAutomatic compression (loses interpretability)For ML input features only
Sparse Regression (Least Absolute Shrinkage and Selection Operator, LASSO)Auto-selection during ML trainingIntegrated learning step

Recommendation: split coefficients into “Active” (used for learning) and “Passive” (monitored only) groups. Don’t discard — keep computing and watching all coefficients.

3. Delivery

Delivery defines what the user gains by adopting this framework — operational and business value, not technical structure.

3.1 Application — Four Outcomes by W2W / WiW Group

#GroupOutcomeHow
1W2WRun-to-Run thickness controlMean trend tracking
2WiWEarly outlier-tool detectionFingerprint deviation
3WiWPreventive Maintenance (PM) timing optimizationFingerprint trend
4WiWHardware root-cause diagnosisPattern-to-factor mapping

3.2 W2W Outcome — Run-to-Run Thickness Control

How: mean trend tracking. The W2W head’s predicted mean thickness ($\hat{a}_1$) drives recipe correction for the next wafer or lot, minimizing per-lot deviation and absorbing source-depletion or recipe-drift effects before they hit spec.

3.3 WiW Outcomes

Early outlier-tool detection. Monitor distance from a normal-baseline fingerprint (e.g., Mahalanobis distance) over the 8-element WiW vector. Detects deviating tools before yield impact, maintains chamber-to-chamber matching, ensures fleet-level consistency.

PM timing optimization. Move from periodic PM to condition-based PM driven by fingerprint drift trends. Improves uptime and reduces maintenance cost simultaneously, avoiding both unnecessary PMs and delayed-PM excursions.

Hardware root-cause diagnosis. Each shape coefficient maps to specific hardware factors:

  • Tilt → chuck levelness, gas inlet position
  • Bowl → center-edge temperature delta, RF coupling
  • Astigmatism → showerhead directionality, magnetic-field asymmetry
  • Coma / Trefoil → pump location, 3-zone heater non-uniformity

Result: faster root-cause identification on excursions, better maintenance efficiency, standardized troubleshooting playbooks.


Appendix A. 13-Point JSON Coordinate Definition

{
  "wafer_size_mm": 300,
  "edge_exclusion_mm": 5,
  "pattern": "13-points",
  "points": [
    {"id": "P1",  "x":    0.0, "y":    0.0, "r":   0, "theta":   0, "zone": "Center"},
    {"id": "P2",  "x":   75.0, "y":    0.0, "r":  75, "theta":   0, "zone": "Mid_E"},
    {"id": "P3",  "x":    0.0, "y":   75.0, "r":  75, "theta":  90, "zone": "Mid_N"},
    {"id": "P4",  "x":  -75.0, "y":    0.0, "r":  75, "theta": 180, "zone": "Mid_W"},
    {"id": "P5",  "x":    0.0, "y":  -75.0, "r":  75, "theta": 270, "zone": "Mid_S"},
    {"id": "P6",  "x":  145.0, "y":    0.0, "r": 145, "theta":   0, "zone": "Edge_E"},
    {"id": "P7",  "x":  102.5, "y":  102.5, "r": 145, "theta":  45, "zone": "Edge_NE"},
    {"id": "P8",  "x":    0.0, "y":  145.0, "r": 145, "theta":  90, "zone": "Edge_N"},
    {"id": "P9",  "x": -102.5, "y":  102.5, "r": 145, "theta": 135, "zone": "Edge_NW"},
    {"id": "P10", "x": -145.0, "y":    0.0, "r": 145, "theta": 180, "zone": "Edge_W"},
    {"id": "P11", "x": -102.5, "y": -102.5, "r": 145, "theta": 225, "zone": "Edge_SW"},
    {"id": "P12", "x":    0.0, "y": -145.0, "r": 145, "theta": 270, "zone": "Edge_S"},
    {"id": "P13", "x":  102.5, "y": -102.5, "r": 145, "theta": 315, "zone": "Edge_SE"}
  ]
}

Appendix B. 13-Point Wafer Location Map

                          N (+Y)
                           │
                    . . . P8 . . .
                .       (0,145)      .
            P9 .                      . P7
        (-102,102).                  .(102,102)
           .          P3 (0,75)         .
          .               │              .
         .                │               .
        .                 │                .
       .                  │                 .
      P10─────P4─────────P1─────────P2─────P6   ── E (+X)
    (-145,0)(-75,0)    (0,0)     (75,0) (145,0)
       .                  │                 .
        .                 │                .
         .                │               .
          .          P5 (0,-75)          .
           .              │             .
       P11 .                            . P13
       (-102,-102) .                . (102,-102)
                .       (0,-145)      .
                    . . . P12 . . .
                           │
                          S (-Y)

Appendix C. Visualization of Low-Order Zernike Terms

Each Zernike term is rendered as a 17×17 ASCII grid normalized to its own peak, so shape patterns are directly comparable across terms.

Shading legend (negative ← zero → positive):

   #   @   *   +   -   ' '   .   :   o   O   0
strong                                       strong
negative              zero                   positive

C.1 $Z_1$ — Piston (mean)

$(n=0, m=0)$ — wafer-wide constant. Captures the mean thickness; the W2W variation lives here.

        0        
     0000000     
   00000000000   
  0000000000000  
  0000000000000  
 000000000000000 
 000000000000000 
 000000000000000 
00000000000000000
 000000000000000 
 000000000000000 
 000000000000000 
  0000000000000  
  0000000000000  
   00000000000   
     0000000     
        0        

C.2 $Z_2$ — Tilt X

$(n=1, m=1)$ — linear slope along X. Diagnoses chuck levelness or asymmetric gas inlet position.

                 
     +-- ..:     
   **+-- ..:oo   
  @**+-- ..:ooO  
  @**+-- ..:ooO  
 @@**+-- ..:ooOO 
 @@**+-- ..:ooOO 
 @@**+-- ..:ooOO 
#@@**+-- ..:ooOO0
 @@**+-- ..:ooOO 
 @@**+-- ..:ooOO 
 @@**+-- ..:ooOO 
  @**+-- ..:ooO  
  @**+-- ..:ooO  
   **+-- ..:oo   
     +-- ..:     
                 

C.3 $Z_3$ — Tilt Y

$(n=1, m=-1)$ — linear slope along Y. Diagnoses front-back chuck levelness or top-bottom flow asymmetry.

        #        
     @@@@@@@     
   @@@@@@@@@@@   
  *************  
  ************+  
 +++++++++++++++ 
 --------------- 
 --------------- 
                 
 ............... 
 ............... 
 ::::::::::::::: 
  ooooooooooooo  
  ooooooooooooo  
   OOOOOOOOOOO   
     OOOOOOO     
        0        

C.4 $Z_4$ — Defocus (Bowl / Dome)

$(n=2, m=0)$ — radially symmetric center-vs-edge contrast. Key diagnostic for center-edge temperature delta, RF coupling, and showerhead-to-wafer gap.

        0        
     OoooooO     
   0o:.....:o0   
  0o.  ---  .o0  
  o. -++*++- .o  
 O: -+**@**+- :O 
 o. +*@@@@@*+ .o 
 o.-+*@###@*+-.o 
0o.-*@@###@@*-.o0
 o.-+*@###@*+-.o 
 o. +*@@@@@*+ .o 
 O: -+**@**+- :O 
  o. -++*++- .o  
  0o.  ---  .o0  
   0o:.....:o0   
     OoooooO     
        0        

C.5 $Z_5$ — Astigmatism 45°

$(n=2, m=-2)$ — 4-fold asymmetry along diagonals. Diagnoses 45°/135°-direction flow asymmetry or magnetic-field bias.

                 
     o:. -+@     
   0Oo:. -+*@#   
  0Ooo:. -+**@#  
  Ooo:.. --+**@  
 ooo:..  ---+**@ 
 :::...   ---+++ 
 .....     ----- 
                 
 -----     ..... 
 +++---   ...::: 
 @**+--- ...:ooo 
  @**+-- ..:ooO  
  #@**+- .:ooO0  
   #@*+- .:oO0   
     *+- .:o     
                 

C.6 $Z_6$ — Astigmatism 0°

$(n=2, m=2)$ — 4-fold asymmetry along horizontal/vertical. Diagnoses showerhead directionality and 0°/90° pump-position effects.

        #        
     *@@@@@*     
   -++*****++-   
  . --+++++-- .  
  :. ------- .:  
 o:..  ---  ..:o 
 Oo:.       .:oO 
 Oo:..     ..:oO 
0Oo:..     ..:oO0
 Oo:..     ..:oO 
 Oo:.       .:oO 
 o:..  ---  ..:o 
  :. ------- .:  
  . --+++++-- .  
   -++*****++-   
     *@@@@@*     
        #        

C.7 $Z_7$ — Coma Y

$(n=3, m=-1)$ — Y-direction asymmetric tilt with stronger curvature on one side. Diagnoses asymmetric Y flow and pump-position bias.

        #        
     *++-++*     
   *+  ...  +*   
  *- .::o::. -*  
  - .:ooooo:. -  
 - .::ooooo::. - 
 - ..:::::::.. - 
    .........    
                 
    ---------    
 . --+++++++-- . 
 . -++*****++- . 
  . -+*****+- .  
  o. -++*++- .o  
   o:  ---  :o   
     o::.::o     
        0        

C.8 $Z_8$ — Coma X

$(n=3, m=1)$ — X-direction asymmetric tilt. Diagnoses asymmetric X gas inlet/outlet or one-sided chamber hardware bias.

                 
     --   ..     
   *-       .o   
  *- ..   -- .o  
  + .:.. --+- :  
 * .:::. -+++- o 
 + :oo:. -+**+ : 
 +.:oo:. -+**+-: 
#-.ooo:. -+***-.0
 +.:oo:. -+**+-: 
 + :oo:. -+**+ : 
 * .:::. -+++- o 
  + .:.. --+- :  
  *- ..   -- .o  
   *-       .o   
     --   ..     
                 

C.9 $Z_9$ — Trefoil (3-fold)

$(n=3, m=-3)$ — 3-fold pattern repeating every 120°. Marks 3-zone heater non-uniformity or 3-fold chamber hardware effects (3-leg lift pins, 3-port gas).

        0        
     :ooooo:     
   +-..:::..-+   
  @+- ..... -+@  
  @+-  ...  -+@  
 @*+--     --+*@ 
 *+--       --+* 
 ---         --- 
                 
 ...         ... 
 o:..       ..:o 
 Oo:..     ..:oO 
  O:.  ---  .:O  
  O:. ----- .:O  
   :.--+++--.:   
     +*****+     
        #        

C.10 Drift Diagnostic Guide

Coefficient that suddenly growsHypothesized cause
$a_1$ (Piston)Source depletion, deposition time/power shift
$a_2, a_3$ (Tilt)Chuck levelness change, gas inlet position
$a_4$ (Defocus)Center-edge temperature change, RF coupling, showerhead distance
$a_5, a_6$ (Astigmatism)Showerhead directionality, magnetic-field asymmetry, pump position
$a_7, a_8$ (Coma)Asymmetric gas flow, one-sided hardware bias
$a_9$ (Trefoil)3-zone heater or 3-fold hardware issues

Appendix D. Why 13 Measurements Map to 9 Coefficients

For a linear measurement model $T = A \cdot a + \varepsilon$ with $m$ measurements and $N$ basis terms, three regimes exist:

ConditionNameResult
$N > m$Under-determinedInfinitely many solutions — no unique answer
$N = m$Exactly-determinedResidual = 0 but noise also fitted (overfitting)
$N < m$Over-determinedResidual-minimizing LSQ solution exists — recommended

For 13 points, the choice of $N$ is constrained:

$N$Patterns expressibleDOFVerdict
4Piston, Tilt X/Y, Defocus9Insufficient (no asymmetry)
6+ Astigmatism 0°/45°7Acceptable
9+ Coma X/Y, Trefoil4Recommended balance
11+ Spherical, Quadrafoil2DOF too low
13+ further terms0No residual monitoring possible

DOF of 4 means the residual $\varepsilon$ moves freely in a 4-dimensional subspace, providing the information channel for residual diagnostics (Section 2.6). The 13 → 9 mapping is therefore not an arbitrary choice but the simultaneous optimum of three constraints: over-determined system, sufficient expressiveness, and residual-monitoring DOF.

Appendix E. Zernike basis matrix A

The model: $T = A \cdot a + e$

 The Zernike decomposition writes a measured wavefront $T$ as a linear combination of Zernike basis functions plus measurement noise:

$$T = A \cdot a + e$$

 This appendix walks through what each piece of the equation means and what makes $A$ — the Zernike basis matrix — useful.

Symbols

 The pieces of the model carry the following meaning:

SymbolMeaning
$Z_j(\rho, \theta)$The $j$-th Zernike basis function on the unit disk
$(\rho_i, \theta_i)$Polar coordinates of the $i$-th measurement point
$a_j$Coefficient (weight) of $Z_j$ — the unknown to fit
$T_i$Measured value at the $i$-th point (e.g. wafer thickness)
$e_i$Measurement noise at the $i$-th point

 $Z_j$ is one element of a sequence of fixed shapes defined on the unit disk (the disk of radius $1$). Just as a Fourier series uses sines and cosines as a basis, a Zernike expansion uses $Z_1, Z_2, Z_3, \ldots$ as its basis. The Noll convention numbers the basis functions starting from $j = 1$.

$j$Shape of $Z_j$Optical name
1constant (flat)Piston
2tilt along $x$Tilt X
3tilt along $y$Tilt Y
4bowl (center vs. edge)Defocus
$\ldots$$\ldots$$\ldots$

 For the $i$-th measurement point, $\rho_i$ is the radius from the disk center (normalized so that the edge sits at $\rho = 1$) and $\theta_i$ is the angle from the positive $x$-axis (in radians).

Per-point equation

 Written out for one point $i$, the model becomes:

$$T_i = a_1 Z_1(\rho_i, \theta_i) + a_2 Z_2(\rho_i, \theta_i) + \ldots + a_N Z_N(\rho_i, \theta_i) + e_i$$

 Each measurement is decomposed into a weighted sum of $N$ predefined shapes; the unknowns are the weights $a_j$. Stacking the $m$ such per-point equations gives the matrix form $T = A \cdot a + e$.

Structure of $A$ ($m \times N$)

 The entries of $A$ are Zernike-basis evaluations at the measurement points:

$$A[i, j] = Z_j(\rho_i, \theta_i)$$

 That is, $A[i, j]$ is “the value of the $j$-th basis at the $i$-th measurement point”. $A$ has $m$ rows (one per measurement point) and $N$ columns (one per Zernike basis function):

            j=1     j=2     j=3    ...   j=N
        ┌─────────────────────────────────────┐
i=1     │  Z_1    Z_2    Z_3    ...    Z_N    │ ← all basis values at point 1
i=2     │  Z_1    Z_2    Z_3    ...    Z_N    │ ← all basis values at point 2
...     │   …      …      …             …     │
i=m     │  Z_1    Z_2    Z_3    ...    Z_N    │ ← all basis values at point m
        └─────────────────────────────────────┘
          ↑
        column 1 = Z_1 evaluated at every measurement point

Row and column meaning

 The rows and columns of $A$ have very different meanings, and both are useful:

  • Column $j$: the values of one fixed basis function $Z_j$ sampled at all $m$ measurement points — $Z_j$’s sampling pattern at the measurement points (the $j$-th column vector of $A$). It is a snapshot of the continuous function $Z_j$ “photographed” at the chosen $m$ points.
  • Row $i$: the values of all $N$ basis functions evaluated at one measurement point — the coefficients on the right-hand side of the expansion equation (writing the wavefront as a sum of basis functions) at that point.

 Here expansion means writing a function as a sum of basis functions. The Fourier series, which expands a function as a sum of sines and cosines, is the textbook example. In Zernike land we expand the wavefront $T(\rho, \theta)$ as a sum of $Z_j$.

Noll normalization

 The Noll convention scales each $Z_j$ so that all coefficients $a_j$ are directly comparable. After normalization, on the unit disk:

$$\iint_{\text{disk}} Z_j(\rho, \theta)^2 \, dA = \pi \quad \text{for every } j \quad\text{(unit norm)}$$

$$\iint_{\text{disk}} Z_j \, Z_k \, dA = 0 \quad \text{for } j \neq k \quad\text{(orthogonality)}$$

 The scaling constants depend on the $(n, m)$ pair (radial order $n$, azimuthal frequency $m$) that each $j$ corresponds to:

  • For $m = 0$ terms (purely radial: Piston, Defocus, …): multiply by $\sqrt{n+1}$.
  • For $m \neq 0$ terms (with $\cos$ or $\sin$: Tilt, Astigmatism, …): multiply by $\sqrt{2(n+1)}$.

 The $j \rightarrow (n, m)$ conversion follows Noll’s 1976 algorithm. The benefit of normalization: $a_j^2$ is exactly “$j$’s contribution to the wavefront RMS”, so absolute coefficient values can be compared at a glance.

$A^{\!\top}\!A \approx m \cdot I$ — sampling density and regression stability

 The least-squares solution is $\hat{a} = (A^{\!\top}\!A)^{-1} A^{\!\top}\, T$, so the structure of $A^{\!\top}\!A$ controls the conditioning of the fit. Its $(j, k)$ entry is the inner product of column $j$ and column $k$ of $A$:

$$(A^{\!\top}\!A)_{jk} = \sum_{i=1}^{m} Z_j(\rho_i, \theta_i)\, Z_k(\rho_i, \theta_i)$$

Dense sampling ($m \rightarrow \infty$, with the points uniformly covering the disk): the discrete sum tends to the integral average:

$$\frac{1}{m} \sum_{i=1}^{m} Z_j(\rho_i, \theta_i)\, Z_k(\rho_i, \theta_i) \;\approx\; \frac{1}{\pi} \iint_{\text{disk}} Z_j Z_k \, dA \;=\; \delta_{jk}$$

hence $A^{\!\top}\!A \approx m \cdot I$ — a diagonal matrix with all diagonal entries equal to $m$. In this regime:

  • $(A^{\!\top}\!A)^{-1} \approx I/m$ — the inverse is essentially “multiply by $1/m$”. Numerically very stable.
  • $\hat{a}_j \approx \frac{1}{m} \sum_i Z_j(\rho_i, \theta_i)\, T_i$ — each coefficient is just a $Z_j$-weighted average of the measurements.
  • Condition number $\approx 1$ — measurement noise propagates to the coefficients with virtually no amplification.

Sparse sampling (e.g. the $m = 13$ measurement points used in this project): the integral approximation breaks. $A^{\!\top}\!A$ is no longer diagonal; it is a general $N \times N$ matrix with non-zero off-diagonal entries, and its condition number grows. With $m = 13 \geq N = 9$ the fit is still solvable, but if the condition number becomes too large, ridge regression (adding $\lambda I$ to $A^{\!\top}\!A$) is a standard way to stabilize the solve.

Appendix F. Least Squares Normal-Equation Derivation

Find the coefficients that minimize squared residual:

$$\hat{a} = \arg\min_{a} \| T – A \cdot a \|^2$$

Expanding the objective $J(a) = \| T – A \cdot a \|^2$:

$$J(a) = (T – A \cdot a)^\top (T – A \cdot a) = T^\top T – 2 \, T^\top A \cdot a + a^\top A^\top A \cdot a$$

Setting $\partial J / \partial a = 0$:

$$-2 \, A^\top T + 2 \, A^\top A \cdot a = 0 \quad \Rightarrow \quad A^\top A \cdot a = A^\top T$$

This is the normal equation. When $A^\top A$ is invertible (i.e., $A$ has full column rank), the solution is:

$$\hat{a} = (A^\top A)^{-1} A^\top T$$

Geometrically, $\hat{a}$ is the orthogonal projection of $T$ onto the column space of $A$, so the residual $\varepsilon = T – A \cdot \hat{a}$ satisfies $A^\top \varepsilon = 0$. The Hessian $2 A^\top A$ is positive semi-definite, so this is a global minimum.

For 13-point fitting, $A$ must have full column rank — the cardinal-aligned 13-point pattern satisfies this for $N=9$. If $A^\top A$ is ill-conditioned, use SVD-based pseudo-inverse or Ridge Regression (Appendix G).

Appendix G. Ridge Regression Derivation (Tikhonov Regularization)

Pure LSQ becomes unstable when $A^\top A$ is ill-conditioned or measurement noise is large. Hoerl & Kennard (1970) addressed this by adding an L2 penalty to the objective:

$$J_{\text{ridge}}(a) = \| T – A \cdot a \|^2 + \lambda \| a \|^2$$

where $\lambda \geq 0$ is the regularization strength. With $\lambda = 0$ this reduces to LSQ; as $\lambda \to \infty$, $a \to 0$. Expanding:

$$J_{\text{ridge}}(a) = T^\top T – 2 \, T^\top A \cdot a + a^\top (A^\top A + \lambda I) \cdot a$$

Setting the gradient to zero:

$$(A^\top A + \lambda I) \cdot a = A^\top T \quad \Rightarrow \quad \hat{a}_{\text{ridge}} = (A^\top A + \lambda I)^{-1} A^\top T$$

Why this is always invertible: if $A^\top A$ is positive semi-definite, then for any $\lambda > 0$, $A^\top A + \lambda I$ is strictly positive-definite — guaranteeing a unique solution even when LSQ fails.

AspectLSQRidge
Singular $A^\top A$No solutionSolution exists
Noise sensitivityHighLow
High-order coefficient stabilityUnstableStable
BiasNoneMild bias introduced
13-point recommendationOnly when noise is very smallRecommended in general

To choose $\lambda$ for 13-point fitting: collect coefficient distributions from many normal wafers via LSQ, sweep $\lambda \in \{0.001, 0.01, 0.1, 1.0, 10.0\}$, run Leave-One-Out Cross-Validation (LOOCV) by holding out one of the 13 measurements, pick the $\lambda$ minimizing average prediction error, and re-evaluate periodically (every few months).

References

  • Born, M., & Wolf, E. (1999). Principles of Optics (7th ed.). Cambridge University Press.
  • Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.
  • Montgomery, D. C. (2013). Introduction to Statistical Quality Control (7th ed.). Wiley.
  • Noll, R. J. (1976). Zernike polynomials and atmospheric turbulence. Journal of the Optical Society of America, 66(3), 207–211.
  • Wang, J. Y., & Silva, D. E. (1980). Wavefront interpretation with Zernike polynomials. Applied Optics, 19(9), 1510–1518.
  • Zernike, F. (1934). Beugungstheorie des Schneidenverfahrens und seiner verbesserten Form, der Phasenkontrastmethode. Physica, 1(7–12), 689–704.
Our Score
Click to rate this post!
[Total: 1 Average: 5]
Visited 33 times, 1 visit(s) today

Leave a Comment

Your email address will not be published. Required fields are marked *