Optuna Metric Projection

A concise report on projecting Optuna’s best-so-far trajectory with four saturation curves. The method estimates the expected best metric after $K$ additional trials (forward) or the trials needed to reach a target $T$ (inverse) without using hyperparameter (HP) coordinates and without the over-optimism that plagues surrogate-based projection.
1. Problem Setting and Motivation
Optuna (Akiba et al., 2019) is a widely used hyperparameter optimization (HPO) framework. After running $N$ trials, two natural questions arise:
- “If we run $K$ more trials, what is the expected best metric?”
- “How many additional trials are needed to reach a target $T$?”
This report fits a saturation curve to the running maximum of the metric and uses it to answer both questions.
1.1 Why Empirical Saturation, Not Surrogate Models
Surrogate-based projection methods — Gaussian Process (GP) or Prior-Fitted Network (PFN) — learn an abstract HP→metric model and then simulate $K$ future trials. Two fundamental problems arise.
Issue 1 — TPE’s active narrowing is not simulated. Optuna’s default sampler, the Tree-structured Parzen Estimator (TPE), concentrates new samples in a narrow good region (about 1–5% of the HP space) once it has converged. GP projection, however, draws $K$ random HP candidates from the entire search space. The two regimes mismatch: surrogate projection includes vast regions TPE will never visit, inflating the predicted max.
Issue 2 — GP epistemic uncertainty inflates max-of-$K$ statistics. GP has small uncertainty where data is dense, but very wide intervals in unexplored regions. The standard Monte Carlo (MC) routine is:
- Draw $K$ HP candidates from the search space.
- Sample each metric from the GP posterior (mean ± std).
- Take the maximum of the $K$ samples.
- Repeat $M$ times; average the maxima.
Candidates that fall in unexplored regions return draws with very large variance. Max-of-$K$ then preferentially selects these high-tail outliers, producing an over-optimistic projection.
Empirical saturation fits the observed trial number → best-so-far series directly. TPE’s active learning is already encoded in the trajectory, and no hypothetical exploration of the HP space is assumed. Both surrogate-side problems are avoided automatically.
1.2 The Best-so-Far Trajectory
Sort completed trials by trial number and track the cumulative maximum of the metric:
trial_num: 1 2 3 4 5 6 ... N
metric: 0.10 0.05 0.18 0.12 0.21 0.15 ... 0.31
|
running max (cumulative)
v
Y_n: 0.10 0.10 0.18 0.18 0.21 0.21 ... 0.31
-------- monotonically non-decreasing ------>
Formally, $Y_n = \max(metric_1, \ldots, metric_n)$. Equivalently, $Y_n = \max(Y_{n-1}, metric_n)$. The sequence of points $(n, Y_n)$ is the best-so-far trajectory.
Universal properties that saturation models fit:
- Monotonically non-decreasing. The running max satisfies $Y_n \le Y_{n+1}$ for every $n$. This is distinct from strictly increasing ($Y_n \lt Y_{n+1}$); the curve stays flat whenever a new trial fails to update the best.
- Diminishing returns. The probability that a new trial improves the running max decreases as $n$ grows.
- Asymptotic ceiling $Y_\infty$. The limit TPE can reach in the given HP space. Estimating $Y_\infty$ is the core target of this report.
2. Four Saturation Models
Saturation denotes a curve that approaches a ceiling and stops growing. Optuna best-so-far curves typically show three phases: rapid early rise, gradual mid-phase slowdown, and asymptotic plateau near $Y_\infty$. The four models below all carry $Y_\infty$ as an explicit parameter but differ in how they approach the ceiling.
The power law is the classical Learning Curve Extrapolation (LCE) form (Mosteller & Tukey, 1977); the remaining three are standard members of the LCE family (Domhan et al., 2015).
Common features: all have 3 parameters (rationale in §3.2), explicit $Y_\infty$, monotone shape, and fit with scipy.optimize.curve_fit.
2.1 Exponential Saturation (exp)
$$Y_n = Y_\infty – a \cdot e^{-n / \tau}$$
Parameters: $Y_\infty$ (ceiling), $a$ (initial gap), $\tau$ (e-folding time). Smooth approach to the ceiling; the gap shrinks by a factor $1/e$ at $n = \tau$. Best fit when TPE converges quickly and only noise remains in the tail.
2.2 Power Law (power)
$$Y_n = Y_\infty – a \cdot n^{-c}$$
Parameters: $Y_\infty$, $a$, $c$ (decay rate, $c \gt 0$). Polynomial decay; larger $c$ means faster convergence. The classical learning curve form used widely in machine learning.
2.3 Hill Function (hill)
$$Y_n = Y_\infty \cdot \frac{n^c}{k^c + n^c}$$
Parameters: $Y_\infty$, $k$ (half-max point, i.e., $Y_n = Y_\infty / 2$ at $n = k$), $c$ (steepness). An S-shape with a slow start, rapid mid-section, and plateau. Fits runs where TPE has a delayed discovery of a productive region.
2.4 Logistic (logistic)
$$Y_n = \frac{Y_\infty}{1 + e^{-a (n – n_0)}}$$
Parameters: $Y_\infty$, $a$ (steepness), $n_0$ (midpoint). A symmetric S-curve with the inflection at $n_0$. Fits trajectories with a clear phase transition.
3. Auto Mode — Model Selection
There is no a-priori best model; the right choice depends on the trajectory shape. The auto mode fits all four candidates and chooses the one with the smallest Akaike Information Criterion (AIC).
3.1 Procedure
For each model M in {exp, power, hill, logistic}:
1. fit via scipy.optimize.curve_fit(M, n_arr, Y_n, p0, bounds)
2. y_pred = M(n_arr, *fitted_params)
3. MSE_M = mean((Y_n - y_pred)^2)
4. AIC_M = N * log(MSE_M) + 2 * k (k = 3 for all)
Selected = argmin AIC
If a model fails to converge it is skipped and the best among successful fits is chosen.
3.2 Selection Criterion — AIC
The AIC (Akaike, 1974) balances fit quality against model complexity. Under a Gaussian residual assumption (derivation in Appendix B):
$$AIC = N \cdot \log(\mathrm{MSE}) + 2 k$$
where $N$ is the number of data points, MSE is the Mean Squared Error, and $k$ is the number of model parameters. The first term rewards low residual error; the second penalizes complexity. A smaller AIC indicates a better model.
Why $k = 3$ for every model. All four saturation models are deliberately constructed with three parameters. This choice (i) gives a fair, non-nested comparison, (ii) avoids overfit (three parameters express the ceiling, curvature, and timescale), and (iii) yields stable nonlinear fits.
Because $2k = 6$ is identical for all four models, the AIC ranking reduces to an MSE ranking (and equivalently to an R² ranking) for this specific case:
$$\arg\min_i AIC_i = \arg\min_i N \log(\mathrm{MSE}_i) = \arg\min_i \mathrm{MSE}_i$$
Even so, AIC is preferred as the standard naming, and the $2k$ penalty becomes active immediately if a 4-parameter model (e.g., Janoschek) is added later. Appendix A shows why R²/MSE is unreliable as soon as $k$ varies between candidates.
3.3 Sample Output
Model selection (AIC = N*log(MSE) + 2k, k=3): exp : MSE=7.74e-06, AIC=-7361.33 <- CHOSEN logistic : MSE=1.22e-05, AIC=-7079.04 hill : MSE=1.99e-05, AIC=-6771.67 power : MSE=2.03e-05, AIC=-6756.43 Chosen model: exp (lowest AIC) Fitted params: Y_inf=0.3218, a=0.0368, tau=338.96
4. Forward and Inverse Projection
Once a saturation curve $Y = f(n; \hat{\theta})$ is fitted, two questions can be answered in opposite directions:
| Direction | Input | Output | Plain-English form |
|---|---|---|---|
| Forward | $K$ (additional trials) | $Y_{N+K}$ | "If we run $K$ more trials, what's the best?" |
| Inverse | $T$ (target metric) | $K_{required}$ | "How many more trials to reach $T$?" |
Forward is a direct evaluation of $f$; inverse is a root-finding problem on the same curve. Three edge cases bracket the inverse:
- $T \le Y_{current}$ — already reached, $K = 0$.
- $Y_{current} \lt T \lt Y_\infty$ — forward and inverse both defined.
- $T \ge Y_\infty$ — unreachable; the inverse diverges.
4.1 Forward
$$Y_{proj} = f(N + K, \hat{\theta})$$
Evaluated for each $K$ in the user-supplied list (commonly 50, 100, 500, 1000, 5000, 10000).
4.2 Inverse
Solve $f(N + K, \hat{\theta}) = T$ for $K$. Each model admits a closed-form inverse:
| Model | Inverse |
|---|---|
| Exponential | $K = -\tau \log\!\left(\frac{Y_\infty - T}{a}\right) - N$ |
| Power law | $K = \left(\frac{a}{Y_\infty - T}\right)^{1/c} - N$ |
| Hill | $K = \left(\frac{k^c \, T}{Y_\infty - T}\right)^{1/c} - N$ |
| Logistic | $K = n_0 - \frac{1}{a} \log\!\left(\frac{Y_\infty - T}{T}\right) - N$ |
In practice, a unified scipy.optimize.brentq call covers all four models with the same dispatcher.
5. Python Code
The full reference implementation. Python 3.8+, depends only on numpy, scipy, and optuna.
5.1 Full Implementation
"""Optuna best-so-far saturation projection.
Importable module + standalone CLI for forward projection
(expected best after K more trials) and inverse projection
(trials needed to reach a target T).
"""
import argparse
import numpy as np
from scipy.optimize import curve_fit, brentq
# --- Four saturation models (all 3-parameter, asymptotic Y_inf) ---
def _model_exp(n, Y_inf, a, tau):
"""Y_n = Y_inf - a * exp(-n / tau)."""
return Y_inf - a * np.exp(-n / tau)
def _model_power(n, Y_inf, a, c):
"""Y_n = Y_inf - a * n^(-c)."""
return Y_inf - a * np.power(n, -c)
def _model_hill(n, Y_inf, k, c):
"""Y_n = Y_inf * n^c / (k^c + n^c)."""
return Y_inf * np.power(n, c) / (np.power(k, c) + np.power(n, c))
def _model_logistic(n, Y_inf, a, n0):
"""Y_n = Y_inf / (1 + exp(-a * (n - n0)))."""
z = np.clip(-a * (n - n0), -50, 50)
return Y_inf / (1 + np.exp(z))
SATURATION_MODELS = {
'exp': {'f': _model_exp, 'pnames': ('Y_inf', 'a', 'tau')},
'power': {'f': _model_power, 'pnames': ('Y_inf', 'a', 'c')},
'hill': {'f': _model_hill, 'pnames': ('Y_inf', 'k', 'c')},
'logistic': {'f': _model_logistic, 'pnames': ('Y_inf', 'a', 'n0')},
}
# --- Fit and auto-selection ---
def fit_saturation_model(name, n_arr, Y_n):
"""Fit one saturation model. Returns a dict with f, popt, mse."""
model = SATURATION_MODELS[name]
f = model['f']
Y_curr = float(Y_n[-1])
gap = max(0.01, Y_curr - float(Y_n[0]))
N = len(n_arr)
if name == 'exp':
p0 = [Y_curr + 0.01, gap, max(50.0, N / 3.0)]
bounds = ([Y_curr, 0.0, 1.0], [np.inf, np.inf, 1e6])
elif name == 'power':
p0 = [Y_curr + 0.01, gap, 0.5]
bounds = ([Y_curr, 0.0, 0.001], [np.inf, np.inf, 5.0])
elif name == 'hill':
p0 = [Y_curr + 0.01, max(1.0, N / 2.0), 1.0]
bounds = ([Y_curr, 1.0, 0.001], [np.inf, 1e6, 10.0])
else: # logistic
p0 = [Y_curr + 0.01, 0.01, max(1.0, N / 2.0)]
bounds = ([Y_curr, 0.0001, -1e3], [np.inf, 10.0, 1e6])
popt, pcov = curve_fit(f, n_arr, Y_n, p0=p0, bounds=bounds,
maxfev=10_000)
mse = float(np.mean((Y_n - f(n_arr, *popt)) ** 2))
return {'f': f, 'popt': popt, 'pcov': pcov, 'mse': mse,
'pnames': model['pnames'], 'name': name}
def select_best_saturation_model(n_arr, Y_n):
"""Fit all four models; return (fits_dict, chosen_name)
with the lowest AIC. Failed fits are skipped."""
fits = {}
for name in SATURATION_MODELS:
try:
fits[name] = fit_saturation_model(name, n_arr, Y_n)
except Exception as e:
print(f' [auto] {name:>9s} fit FAILED: {e}')
if not fits:
raise RuntimeError('All saturation model fits failed')
N, k = len(n_arr), 3
for name in fits:
fits[name]['aic'] = (
N * np.log(fits[name]['mse'] + 1e-12) + 2 * k)
chosen = min(fits, key=lambda nm: fits[nm]['aic'])
return fits, chosen
# --- Forward and inverse projection ---
def project_forward(fit, N, K):
"""Y_(N+K) from the fitted curve."""
return float(fit['f'](N + K, *fit['popt']))
def project_inverse(fit, N, T, K_max=1e8):
"""K such that Y_(N+K) = T. Returns 0 if already reached,
inf if unreachable, otherwise the brentq solution."""
f, popt = fit['f'], fit['popt']
Y_curr = float(f(N, *popt))
Y_inf = float(popt[0])
if T <= Y_curr:
return 0.0
if T >= Y_inf:
return float('inf')
def g(K, _T=T):
return f(N + K, *popt) - _T
if g(K_max) < 0:
return float('inf')
return brentq(g, 0, K_max, xtol=1.0)
# --- End-to-end pipeline ---
def print_extrapolation(study, *,
model='auto',
K_values=None,
target_values=None,
exclude_value_below=None):
"""Fit a saturation curve to study.trials and print forward
and inverse projection tables."""
import optuna
if K_values is None:
K_values = [50, 100, 500, 1000, 5000, 10000]
trials = [t for t in study.trials
if t.state == optuna.trial.TrialState.COMPLETE
and t.value is not None]
if exclude_value_below is not None:
trials = [t for t in trials if t.value > exclude_value_below]
trials.sort(key=lambda t: t.number)
if len(trials) < 10:
raise RuntimeError(
f'need >= 10 COMPLETE trials, got {len(trials)}')
values = np.array([t.value for t in trials], dtype=float)
Y_n = np.maximum.accumulate(values)
n_arr = np.arange(1, len(Y_n) + 1)
N, Y_curr = len(n_arr), float(Y_n[-1])
if model == 'auto':
fits, chosen = select_best_saturation_model(n_arr, Y_n)
for name in sorted(fits, key=lambda nm: fits[nm]['aic']):
tag = ' <- CHOSEN' if name == chosen else ''
print(f' {name:>9s}: MSE={fits[name]["mse"]:.6e}, '
f'AIC={fits[name]["aic"]:+.2f}{tag}')
fit = fits[chosen]
else:
fit = fit_saturation_model(model, n_arr, Y_n)
fit['aic'] = N * np.log(fit['mse'] + 1e-12) + 2 * 3
chosen = model
print(f'\n Forward -- K -> Y_(N+K), model={chosen}')
for K in K_values:
Y = project_forward(fit, N, K)
print(f' K={K:>6d} -> Y={Y:+.4f} (gain {Y - Y_curr:+.4f})')
Y_inf = float(fit['popt'][0])
if target_values is None:
target_values = list(np.linspace(Y_curr + 0.001,
Y_inf - 0.001, 5))
print(f'\n Inverse -- T -> required K, model={chosen}')
for T in target_values:
K = project_inverse(fit, N, T)
K_str = 'inf' if K == float('inf') else f'{K:.0f}'
print(f' T={T:+.4f} -> K={K_str}')
# --- CLI ---
def parse_args():
ap = argparse.ArgumentParser(
description='Optuna best-so-far saturation projection.')
ap.add_argument('--storage', required=True, type=str,
help='Optuna storage URL (e.g., sqlite:///optuna.db)')
ap.add_argument('--study_name', required=True, type=str)
ap.add_argument('--model', default='auto', type=str.lower,
choices=['auto'] + list(SATURATION_MODELS.keys()))
ap.add_argument('--K_values',
default='50,100,500,1000,5000,10000',
type=lambda s: [int(t.strip()) for t in s.split(',')
if t.strip()])
ap.add_argument('--target_values', default=None,
type=lambda s: ([float(t.strip()) for t in s.split(',')
if t.strip()] if s else None))
ap.add_argument('--exclude_value_below', default=None, type=float)
return ap.parse_args()
if __name__ == '__main__':
import optuna
args = parse_args()
study = optuna.load_study(
study_name=args.study_name, storage=args.storage)
print_extrapolation(
study=study,
model=args.model,
K_values=args.K_values,
target_values=args.target_values,
exclude_value_below=args.exclude_value_below)
5.2 Function Reference
| Symbol | Role |
|---|---|
_model_exp / _power / _hill / _logistic | The four saturation forms from §2. |
SATURATION_MODELS | Registry of (function, parameter names). |
fit_saturation_model(name, n_arr, Y_n) | Fit a single model with per-model starting values and bounds; enforces $Y_\infty \gt Y_{current}$. |
select_best_saturation_model(n_arr, Y_n) | Fit all four models, compute AIC, return the dictionary and the chosen name. |
project_forward(fit, N, K) | Direct evaluation $Y_{N+K}$. |
project_inverse(fit, N, T) | Root-finding for required $K$. |
print_extrapolation(study, ...) | End-to-end: load → fit → print both tables. |
parse_args() / __main__ | Command-line entry point. |
5.3 CLI Usage
<pre class="wp-block-syntaxhighlighter-code"># Auto mode (default — fit all four, pick lowest AIC)
python optuna_projection.py \
--storage sqlite:///path/to/optuna.db \
--study_name <study_name> \
--model auto
# Force the exponential model
python optuna_projection.py \
--storage sqlite:///path/to/optuna.db \
--study_name <study_name> \
--model exp \
--target_values 0.32,0.35,0.40
# Cross-check all four models
for m in exp power hill logistic; do
python optuna_projection.py \
--storage sqlite:///path/to/optuna.db \
--study_name <study_name> \
--model $m
done
</pre>
5.4 Example Output
[saturation] N_trials=626, Y_current=+0.3136
Model selection (AIC = N*log(MSE) + 2k, k=3):
exp: MSE=7.74e-06, AIC=-7361.33 <- CHOSEN
logistic: MSE=1.22e-05, AIC=-7079.04
hill: MSE=1.99e-05, AIC=-6771.67
power: MSE=2.03e-05, AIC=-6756.43
Fitted params: Y_inf=0.3218, a=0.0368, tau=338.96
Y_inf (asymptotic ceiling) = +0.3218, gap = +0.0082
Forward -- K -> Y_(N+K) Inverse -- T -> required K
K= 50 -> +0.3168 T=+0.3200 -> K = 400
K= 100 -> +0.3175 T=+0.3500 -> unreachable
K= 1000 -> +0.3215 (T >= Y_inf)
K= 10000 -> +0.3218
6. Limitations and Caveats
Saturation-curve projection is conservative and assumption-light, but it has well-defined failure modes. The tables below separate what the method captures faithfully from what it misses.
6.1 Underlying Assumptions
| Assumption | Meaning | Effect when violated |
|---|---|---|
| Saturation form | The trajectory fits one of the four families. | Out-of-distribution shapes (decreasing, oscillating) yield poor fits across all four. |
| Existence of asymptote | TPE eventually plateaus. | For studies still climbing, $Y_\infty$ estimates are unstable. |
| Fixed HP space | The search space does not change during the run. | Add/remove HPs requires refitting from scratch. |
| Stationary noise | Trial-to-trial variance is homoscedastic. | Heteroscedasticity calls for weighted least squares. |
6.2 Good Projection vs Bad Projection
Good Projection — what is captured well:
| Aspect | Why it works |
|---|---|
| Result of TPE's active narrowing | The best-so-far curve is the product of narrowing; fitting it learns the narrowing directly. |
| Diminishing returns | All four forms are inherently diminishing-returns shapes. |
| Asymptotic ceiling $Y_\infty$ | The fitted asymptote is the natural estimate of TPE's reachable maximum. |
| HP-space dimensionality | HP coordinates are unused, so 5-D and 50-D HP spaces share the same framework. |
Bad Projection — what is missed:
| Limitation | Why it is missed |
|---|---|
| Sudden discovery jumps | A new productive HP region creates a step in best-so-far; smooth saturation curves average it out and underestimate. |
| Effect of HP-space expansion | Fitted curves use historical data only; adding new HPs spawns a new trajectory. |
| Shapes outside the four families | Oscillations, multi-stage saturation, or positive curvature cannot be expressed. |
| Non-Gaussian residual tails | AIC absolute values lose meaning; rankings remain relatively robust. |
How to act on this: trust the projection for "expected $Y_{N+K}$ with $K \le 10000$" but treat the report as silent about sudden jumps or HP-space expansions; those questions require complementary diagnostics.
6.3 Recommended Usage
- Refit every ~100 trials to track drift in $\hat{Y}_\infty$.
- Even in auto mode, sanity-check by running each of the four models manually.
- If $\hat{Y}_\infty - Y_{current}$ falls below a decision threshold (say 0.05), additional trials are unlikely to help; consider structural changes (HP-space redesign, data augmentation, model change).
- Optionally cross-check with a surrogate-based projection (GP) and a PFN-based projection (Adriaensen et al., 2023). The saturation curve is typically the most conservative; surrogate is the most optimistic. Wide disagreement is a sign that an assumption is broken.
Appendix A. Why R² / MSE Fails When k Varies
This appendix explains why R² and MSE alone are unreliable selection criteria as soon as candidate models have different numbers of parameters.
A.1 The Monotone Improvement Property
Suppose $M_3$ has 3 parameters and $M_4$ has 4, with the function space of $M_4$ a superset of $M_3$'s. For least-squares fits on the same data,
$$\mathrm{SSres}(M_4) \le \mathrm{SSres}(M_3) \;\Rightarrow\; \mathrm{MSE}(M_4) \le \mathrm{MSE}(M_3) \;\Rightarrow\; R^2(M_4) \ge R^2(M_3).$$
Adding parameters therefore never worsens R²/MSE; equality holds only when the extra parameter is exactly zero. Picking by R² alone always favors the more complex model.
A.2 Polynomial Regression Illustration
Fit polynomials of degree 1, 3, and 9 to $N = 10$ noisy samples whose true generating function is quadratic. Compare AIC against R²:
| Model | $k$ | MSE | R² | $AIC = N \log(\mathrm{MSE}) + 2k$ | R² rank | AIC rank |
|---|---|---|---|---|---|---|
| Linear | 2 | 0.030 | 0.70 | $10 \ln(0.030)+4 = -31.1$ | worst | worst |
| Cubic | 4 | 0.005 | 0.95 | $10 \ln(0.005)+8 = -45.0$ | middle | best |
| Degree 9 | 10 | 0.003 | 0.98 | $10 \ln(0.003)+20 = -38.1$ | best | middle |
R²/MSE picks the degree-9 polynomial (an overfit; it captures the noise). AIC picks the cubic (correct: the true generator is quadratic). The trade-off is quantitative:
| Comparison | $N \log$ improvement | $2k$ penalty change | $\Delta$AIC |
|---|---|---|---|
| Cubic → Degree 9 | $10 \ln(0.003/0.005) = -5.1$ | $+12$ | $+6.9$ (worse) |
| Linear → Cubic | $10 \ln(0.005/0.030) = -17.9$ | $+4$ | $-13.9$ (better) |
Each extra parameter must improve $\log(\mathrm{MSE})$ by at least $2/N$ for AIC to accept it. R²/MSE has no such rule, which is why it slides toward overfit.
A.3 Same-k Comparisons Are Safe
The monotone improvement property requires nested models with different $k$. When all candidates have the same $k$ and belong to different function families — as in our four saturation models — R², MSE, and AIC produce identical rankings. The penalty matters only when a 4-parameter model (e.g., Janoschek) joins the candidate set.
Appendix B. AIC Under Gaussian Residuals — Derivation
The general AIC (Akaike, 1974) is
$$AIC = -2 \log \hat{L} + 2 k,$$
where $\hat{L}$ is the maximum value of the likelihood. The compact form $AIC = N \log(\mathrm{MSE}) + 2k$ used in §3.2 follows once Gaussian residuals are assumed and the additive constant is dropped.
B.1 What is the Likelihood $L$?
The likelihood $L$ is the probability density of the observed data $\{(x_i, y_i)\}_{i=1}^N$ under given model parameters $(\theta, \sigma^2)$. Informally: "If these parameters are correct, how plausible is the data we actually observed?"
B.2 Setup and Notation
Regression model: $y_i = f(x_i; \theta) + \varepsilon_i$ with $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$ independent and identically distributed (i.i.d.). Let $\varphi(x; \mu, \sigma^2) = (2\pi\sigma^2)^{-1/2} \exp\!\big(-(x-\mu)^2 / (2\sigma^2)\big)$ denote the Gaussian density.
B.3 Step 1 — Density of a Single Observation
Conditional on $x_i, \theta, \sigma^2$, the response $y_i$ is normally distributed with mean $f(x_i; \theta)$ and variance $\sigma^2$. Its density is
$$p(y_i \mid x_i, \theta, \sigma^2) = \varphi\!\big(y_i - f(x_i; \theta);\, 0,\, \sigma^2\big),$$
i.e., the Gaussian density evaluated at the residual $r_i = y_i - f(x_i; \theta)$.
B.4 Step 2 — Joint Likelihood
Independence converts joint probability to a product of single-observation densities:
$$L(\theta, \sigma^2) = \prod_{i=1}^N \varphi\!\big(y_i - f(x_i; \theta);\, 0,\, \sigma^2\big).$$
B.5 Step 3 — Take the Log
A product over $N$ small numbers underflows to zero in floating-point; taking the logarithm converts it to a sum and simplifies differentiation. Because $\log$ is monotonic, $\arg\max L = \arg\max \log L$. Hence
$$\log L(\theta, \sigma^2) = \sum_{i=1}^N \log \varphi\!\big(y_i - f(x_i; \theta);\, 0,\, \sigma^2\big).$$
B.6 Step 4 — Expand the Gaussian
Substituting the Gaussian density and writing $r_i = y_i - f(x_i; \theta)$:
$$\log L(\theta, \sigma^2) = -\frac{N}{2}\log(2\pi) - \frac{N}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^N r_i^2.$$
The three terms are, respectively, a model-independent constant, a precision term, and a goodness-of-fit term proportional to the sum of squared residuals.
B.7 MLE and AIC Formula
Maximum Likelihood Estimation (MLE) maximizes $\log L$. The MLE for $\theta$ coincides with the Least Squares Estimate (LSE), and the MLE for $\sigma^2$ is
$$\hat{\sigma}^2_{MLE} = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i; \hat{\theta}))^2 = \mathrm{MSE}.$$
Plugging the MLEs into $\log L$ and then into $AIC = -2 \log \hat{L} + 2 k$:
$$AIC = N \log(\mathrm{MSE}) + N (\log(2\pi) + 1) + 2k.$$
The term $N(\log(2\pi) + 1)$ is identical across all candidate models and drops out of any ranking. Hence the operational form
$$AIC \approx N \log(\mathrm{MSE}) + 2k$$
recovers the §3.2 formula.
B.8 When the Gaussian Assumption Breaks
- Absolute AIC values lose interpretability for heavy-tailed or skewed residuals; rankings remain reasonably robust.
- The small-sample-corrected variant AICc adds $2k(k+1)/(N - k - 1)$ and is recommended when $N \lt 40k$.
- The Bayesian Information Criterion (BIC) uses penalty $k \log(N)$, which is stricter for large $N$.
- Cross-validation is distribution-free but costlier than AIC.
Our best-so-far series is monotone and therefore its residuals are not strictly Gaussian (one tail is truncated). Empirically, the order-of-magnitude AIC ranking is unaffected.
References
- Adriaensen, S., Rakotoarison, H., Müller, S., & Hutter, F. (2023). Efficient Bayesian Learning Curve Extrapolation using Prior-Data Fitted Networks. NeurIPS 2023.
- Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control, 19(6), 716–723.
- Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of KDD 2019.
- Domhan, T., Springenberg, J. T., & Hutter, F. (2015). Speeding up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves. IJCAI 2015.
- Mosteller, F., & Tukey, J. W. (1977). Data Analysis and Regression. Addison-Wesley.
