A concise report on projecting Optuna’s best-so-far trajectory with four saturation curves. The method estimates the expected best metric after $K$ additional trials (forward) or the trials needed to reach a target $T$ (inverse) without using hyperparameter (HP) coordinates and without the over-optimism that plagues surrogate-based projection.

1. Problem Setting and Motivation

Optuna (Akiba et al., 2019) is a widely used hyperparameter optimization (HPO) framework. After running $N$ trials, two natural questions arise:

“If we run $K$ more trials, what is the expected best metric?”
“How many additional trials are needed to reach a target $T$?”

This report fits a saturation curve to the running maximum of the metric and uses it to answer both questions.

1.1 Why Empirical Saturation, Not Surrogate Models

Surrogate-based projection methods — Gaussian Process (GP) or Prior-Fitted Network (PFN) — learn an abstract HP→metric model and then simulate $K$ future trials. Two fundamental problems arise.

Issue 1 — TPE’s active narrowing is not simulated. Optuna’s default sampler, the Tree-structured Parzen Estimator (TPE), concentrates new samples in a narrow good region (about 1–5% of the HP space) once it has converged. GP projection, however, draws $K$ random HP candidates from the entire search space. The two regimes mismatch: surrogate projection includes vast regions TPE will never visit, inflating the predicted max.

Issue 2 — GP epistemic uncertainty inflates max-of-$K$ statistics. GP has small uncertainty where data is dense, but very wide intervals in unexplored regions. The standard Monte Carlo (MC) routine is:

Draw $K$ HP candidates from the search space.
Sample each metric from the GP posterior (mean ± std).
Take the maximum of the $K$ samples.
Repeat $M$ times; average the maxima.

Candidates that fall in unexplored regions return draws with very large variance. Max-of-$K$ then preferentially selects these high-tail outliers, producing an over-optimistic projection.

Empirical saturation fits the observed trial number → best-so-far series directly. TPE’s active learning is already encoded in the trajectory, and no hypothetical exploration of the HP space is assumed. Both surrogate-side problems are avoided automatically.

1.2 The Best-so-Far Trajectory

Sort completed trials by trial number and track the cumulative maximum of the metric:

trial_num:   1     2     3     4     5     6   ...  N
metric:    0.10  0.05  0.18  0.12  0.21  0.15  ... 0.31
                                              |
                              running max (cumulative)
                                              v
Y_n:       0.10  0.10  0.18  0.18  0.21  0.21  ... 0.31
            -------- monotonically non-decreasing ------>

Formally, $Y_n = \max(metric_1, \ldots, metric_n)$. Equivalently, $Y_n = \max(Y_{n-1}, metric_n)$. The sequence of points $(n, Y_n)$ is the best-so-far trajectory.

Universal properties that saturation models fit:

Monotonically non-decreasing. The running max satisfies $Y_n \le Y_{n+1}$ for every $n$. This is distinct from strictly increasing ($Y_n \lt Y_{n+1}$); the curve stays flat whenever a new trial fails to update the best.
Diminishing returns. The probability that a new trial improves the running max decreases as $n$ grows.
Asymptotic ceiling $Y_\infty$. The limit TPE can reach in the given HP space. Estimating $Y_\infty$ is the core target of this report.

2. Four Saturation Models

Saturation denotes a curve that approaches a ceiling and stops growing. Optuna best-so-far curves typically show three phases: rapid early rise, gradual mid-phase slowdown, and asymptotic plateau near $Y_\infty$. The four models below all carry $Y_\infty$ as an explicit parameter but differ in how they approach the ceiling.

The power law is the classical Learning Curve Extrapolation (LCE) form (Mosteller & Tukey, 1977); the remaining three are standard members of the LCE family (Domhan et al., 2015).

Common features: all have 3 parameters (rationale in §3.2), explicit $Y_\infty$, monotone shape, and fit with scipy.optimize.curve_fit.

2.1 Exponential Saturation (exp)

$$Y_n = Y_\infty – a \cdot e^{-n / \tau}$$

Parameters: $Y_\infty$ (ceiling), $a$ (initial gap), $\tau$ (e-folding time). Smooth approach to the ceiling; the gap shrinks by a factor $1/e$ at $n = \tau$. Best fit when TPE converges quickly and only noise remains in the tail.

2.2 Power Law (power)

$$Y_n = Y_\infty – a \cdot n^{-c}$$

Parameters: $Y_\infty$, $a$, $c$ (decay rate, $c \gt 0$). Polynomial decay; larger $c$ means faster convergence. The classical learning curve form used widely in machine learning.

2.3 Hill Function (hill)

$$Y_n = Y_\infty \cdot \frac{n^c}{k^c + n^c}$$

Parameters: $Y_\infty$, $k$ (half-max point, i.e., $Y_n = Y_\infty / 2$ at $n = k$), $c$ (steepness). An S-shape with a slow start, rapid mid-section, and plateau. Fits runs where TPE has a delayed discovery of a productive region.

2.4 Logistic (logistic)

$$Y_n = \frac{Y_\infty}{1 + e^{-a (n – n_0)}}$$

Parameters: $Y_\infty$, $a$ (steepness), $n_0$ (midpoint). A symmetric S-curve with the inflection at $n_0$. Fits trajectories with a clear phase transition.

3. Auto Mode — Model Selection

There is no a-priori best model; the right choice depends on the trajectory shape. The auto mode fits all four candidates and chooses the one with the smallest Akaike Information Criterion (AIC).

3.1 Procedure

For each model M in {exp, power, hill, logistic}:
    1. fit via scipy.optimize.curve_fit(M, n_arr, Y_n, p0, bounds)
    2. y_pred = M(n_arr, *fitted_params)
    3. MSE_M  = mean((Y_n - y_pred)^2)
    4. AIC_M  = N * log(MSE_M) + 2 * k       (k = 3 for all)

Selected = argmin AIC

If a model fails to converge it is skipped and the best among successful fits is chosen.

3.2 Selection Criterion — AIC

The AIC (Akaike, 1974) balances fit quality against model complexity. Under a Gaussian residual assumption (derivation in Appendix B):

$$AIC = N \cdot \log(\mathrm{MSE}) + 2 k$$

where $N$ is the number of data points, MSE is the Mean Squared Error, and $k$ is the number of model parameters. The first term rewards low residual error; the second penalizes complexity. A smaller AIC indicates a better model.

Why $k = 3$ for every model. All four saturation models are deliberately constructed with three parameters. This choice (i) gives a fair, non-nested comparison, (ii) avoids overfit (three parameters express the ceiling, curvature, and timescale), and (iii) yields stable nonlinear fits.

Because $2k = 6$ is identical for all four models, the AIC ranking reduces to an MSE ranking (and equivalently to an R² ranking) for this specific case:

$$\arg\min_i AIC_i = \arg\min_i N \log(\mathrm{MSE}_i) = \arg\min_i \mathrm{MSE}_i$$

Even so, AIC is preferred as the standard naming, and the $2k$ penalty becomes active immediately if a 4-parameter model (e.g., Janoschek) is added later. Appendix A shows why R²/MSE is unreliable as soon as $k$ varies between candidates.

3.3 Sample Output

Model selection (AIC = N*log(MSE) + 2k, k=3):
  exp       : MSE=7.74e-06, AIC=-7361.33   <- CHOSEN
  logistic  : MSE=1.22e-05, AIC=-7079.04
  hill      : MSE=1.99e-05, AIC=-6771.67
  power     : MSE=2.03e-05, AIC=-6756.43

Chosen model: exp (lowest AIC)
Fitted params: Y_inf=0.3218, a=0.0368, tau=338.96

4. Forward and Inverse Projection

Once a saturation curve $Y = f(n; \hat{\theta})$ is fitted, two questions can be answered in opposite directions:

Direction	Input	Output	Plain-English form
Forward	$K$ (additional trials)	$Y_{N+K}$	"If we run $K$ more trials, what's the best?"
Inverse	$T$ (target metric)	$K_{required}$	"How many more trials to reach $T$?"

Forward is a direct evaluation of $f$; inverse is a root-finding problem on the same curve. Three edge cases bracket the inverse:

$T \le Y_{current}$ — already reached, $K = 0$.
$Y_{current} \lt T \lt Y_\infty$ — forward and inverse both defined.
$T \ge Y_\infty$ — unreachable; the inverse diverges.

4.1 Forward

$$Y_{proj} = f(N + K, \hat{\theta})$$

Evaluated for each $K$ in the user-supplied list (commonly 50, 100, 500, 1000, 5000, 10000).

4.2 Inverse

Solve $f(N + K, \hat{\theta}) = T$ for $K$. Each model admits a closed-form inverse:

Model	Inverse
Exponential	$K = -\tau \log\!\left(\frac{Y_\infty - T}{a}\right) - N$
Power law	$K = \left(\frac{a}{Y_\infty - T}\right)^{1/c} - N$
Hill	$K = \left(\frac{k^c \, T}{Y_\infty - T}\right)^{1/c} - N$
Logistic	$K = n_0 - \frac{1}{a} \log\!\left(\frac{Y_\infty - T}{T}\right) - N$

In practice, a unified scipy.optimize.brentq call covers all four models with the same dispatcher.

5. Python Code

The full reference implementation. Python 3.8+, depends only on numpy, scipy, and optuna.

5.1 Full Implementation

"""Optuna best-so-far saturation projection.

Importable module + standalone CLI for forward projection
(expected best after K more trials) and inverse projection
(trials needed to reach a target T).
"""
import argparse
import numpy as np
from scipy.optimize import curve_fit, brentq


# --- Four saturation models (all 3-parameter, asymptotic Y_inf) ---

def _model_exp(n, Y_inf, a, tau):
    """Y_n = Y_inf - a * exp(-n / tau)."""
    return Y_inf - a * np.exp(-n / tau)


def _model_power(n, Y_inf, a, c):
    """Y_n = Y_inf - a * n^(-c)."""
    return Y_inf - a * np.power(n, -c)


def _model_hill(n, Y_inf, k, c):
    """Y_n = Y_inf * n^c / (k^c + n^c)."""
    return Y_inf * np.power(n, c) / (np.power(k, c) + np.power(n, c))


def _model_logistic(n, Y_inf, a, n0):
    """Y_n = Y_inf / (1 + exp(-a * (n - n0)))."""
    z = np.clip(-a * (n - n0), -50, 50)
    return Y_inf / (1 + np.exp(z))


SATURATION_MODELS = {
    'exp':      {'f': _model_exp,      'pnames': ('Y_inf', 'a', 'tau')},
    'power':    {'f': _model_power,    'pnames': ('Y_inf', 'a', 'c')},
    'hill':     {'f': _model_hill,     'pnames': ('Y_inf', 'k', 'c')},
    'logistic': {'f': _model_logistic, 'pnames': ('Y_inf', 'a', 'n0')},
}


# --- Fit and auto-selection ---

def fit_saturation_model(name, n_arr, Y_n):
    """Fit one saturation model. Returns a dict with f, popt, mse."""
    model = SATURATION_MODELS[name]
    f = model['f']
    Y_curr = float(Y_n[-1])
    gap = max(0.01, Y_curr - float(Y_n[0]))
    N = len(n_arr)
    if name == 'exp':
        p0 = [Y_curr + 0.01, gap, max(50.0, N / 3.0)]
        bounds = ([Y_curr, 0.0, 1.0], [np.inf, np.inf, 1e6])
    elif name == 'power':
        p0 = [Y_curr + 0.01, gap, 0.5]
        bounds = ([Y_curr, 0.0, 0.001], [np.inf, np.inf, 5.0])
    elif name == 'hill':
        p0 = [Y_curr + 0.01, max(1.0, N / 2.0), 1.0]
        bounds = ([Y_curr, 1.0, 0.001], [np.inf, 1e6, 10.0])
    else:  # logistic
        p0 = [Y_curr + 0.01, 0.01, max(1.0, N / 2.0)]
        bounds = ([Y_curr, 0.0001, -1e3], [np.inf, 10.0, 1e6])
    popt, pcov = curve_fit(f, n_arr, Y_n, p0=p0, bounds=bounds,
                           maxfev=10_000)
    mse = float(np.mean((Y_n - f(n_arr, *popt)) ** 2))
    return {'f': f, 'popt': popt, 'pcov': pcov, 'mse': mse,
            'pnames': model['pnames'], 'name': name}


def select_best_saturation_model(n_arr, Y_n):
    """Fit all four models; return (fits_dict, chosen_name)
    with the lowest AIC. Failed fits are skipped."""
    fits = {}
    for name in SATURATION_MODELS:
        try:
            fits[name] = fit_saturation_model(name, n_arr, Y_n)
        except Exception as e:
            print(f'  [auto] {name:>9s} fit FAILED: {e}')
    if not fits:
        raise RuntimeError('All saturation model fits failed')
    N, k = len(n_arr), 3
    for name in fits:
        fits[name]['aic'] = (
            N * np.log(fits[name]['mse'] + 1e-12) + 2 * k)
    chosen = min(fits, key=lambda nm: fits[nm]['aic'])
    return fits, chosen


# --- Forward and inverse projection ---

def project_forward(fit, N, K):
    """Y_(N+K) from the fitted curve."""
    return float(fit['f'](N + K, *fit['popt']))


def project_inverse(fit, N, T, K_max=1e8):
    """K such that Y_(N+K) = T. Returns 0 if already reached,
    inf if unreachable, otherwise the brentq solution."""
    f, popt = fit['f'], fit['popt']
    Y_curr = float(f(N, *popt))
    Y_inf = float(popt[0])
    if T <= Y_curr:
        return 0.0
    if T >= Y_inf:
        return float('inf')
    def g(K, _T=T):
        return f(N + K, *popt) - _T
    if g(K_max) < 0:
        return float('inf')
    return brentq(g, 0, K_max, xtol=1.0)


# --- End-to-end pipeline ---

def print_extrapolation(study, *,
                        model='auto',
                        K_values=None,
                        target_values=None,
                        exclude_value_below=None):
    """Fit a saturation curve to study.trials and print forward
    and inverse projection tables."""
    import optuna
    if K_values is None:
        K_values = [50, 100, 500, 1000, 5000, 10000]
    trials = [t for t in study.trials
              if t.state == optuna.trial.TrialState.COMPLETE
              and t.value is not None]
    if exclude_value_below is not None:
        trials = [t for t in trials if t.value > exclude_value_below]
    trials.sort(key=lambda t: t.number)
    if len(trials) < 10:
        raise RuntimeError(
            f'need >= 10 COMPLETE trials, got {len(trials)}')

    values = np.array([t.value for t in trials], dtype=float)
    Y_n = np.maximum.accumulate(values)
    n_arr = np.arange(1, len(Y_n) + 1)
    N, Y_curr = len(n_arr), float(Y_n[-1])

    if model == 'auto':
        fits, chosen = select_best_saturation_model(n_arr, Y_n)
        for name in sorted(fits, key=lambda nm: fits[nm]['aic']):
            tag = '  <- CHOSEN' if name == chosen else ''
            print(f'    {name:>9s}: MSE={fits[name]["mse"]:.6e}, '
                  f'AIC={fits[name]["aic"]:+.2f}{tag}')
        fit = fits[chosen]
    else:
        fit = fit_saturation_model(model, n_arr, Y_n)
        fit['aic'] = N * np.log(fit['mse'] + 1e-12) + 2 * 3
        chosen = model

    print(f'\n  Forward -- K -> Y_(N+K), model={chosen}')
    for K in K_values:
        Y = project_forward(fit, N, K)
        print(f'    K={K:>6d}  ->  Y={Y:+.4f}  (gain {Y - Y_curr:+.4f})')

    Y_inf = float(fit['popt'][0])
    if target_values is None:
        target_values = list(np.linspace(Y_curr + 0.001,
                                         Y_inf - 0.001, 5))
    print(f'\n  Inverse -- T -> required K, model={chosen}')
    for T in target_values:
        K = project_inverse(fit, N, T)
        K_str = 'inf' if K == float('inf') else f'{K:.0f}'
        print(f'    T={T:+.4f}  ->  K={K_str}')


# --- CLI ---

def parse_args():
    ap = argparse.ArgumentParser(
        description='Optuna best-so-far saturation projection.')
    ap.add_argument('--storage', required=True, type=str,
                    help='Optuna storage URL (e.g., sqlite:///optuna.db)')
    ap.add_argument('--study_name', required=True, type=str)
    ap.add_argument('--model', default='auto', type=str.lower,
                    choices=['auto'] + list(SATURATION_MODELS.keys()))
    ap.add_argument('--K_values',
                    default='50,100,500,1000,5000,10000',
                    type=lambda s: [int(t.strip()) for t in s.split(',')
                                    if t.strip()])
    ap.add_argument('--target_values', default=None,
                    type=lambda s: ([float(t.strip()) for t in s.split(',')
                                     if t.strip()] if s else None))
    ap.add_argument('--exclude_value_below', default=None, type=float)
    return ap.parse_args()


if __name__ == '__main__':
    import optuna
    args = parse_args()
    study = optuna.load_study(
        study_name=args.study_name, storage=args.storage)
    print_extrapolation(
        study=study,
        model=args.model,
        K_values=args.K_values,
        target_values=args.target_values,
        exclude_value_below=args.exclude_value_below)

5.2 Function Reference

Symbol	Role
`_model_exp / _power / _hill / _logistic`	The four saturation forms from §2.
`SATURATION_MODELS`	Registry of (function, parameter names).
`fit_saturation_model(name, n_arr, Y_n)`	Fit a single model with per-model starting values and bounds; enforces $Y_\infty \gt Y_{current}$.
`select_best_saturation_model(n_arr, Y_n)`	Fit all four models, compute AIC, return the dictionary and the chosen name.
`project_forward(fit, N, K)`	Direct evaluation $Y_{N+K}$.
`project_inverse(fit, N, T)`	Root-finding for required $K$.
`print_extrapolation(study, ...)`	End-to-end: load → fit → print both tables.
`parse_args()` / `__main__`	Command-line entry point.

5.3 CLI Usage

<pre class="wp-block-syntaxhighlighter-code"># Auto mode (default — fit all four, pick lowest AIC)
python optuna_projection.py \
    --storage sqlite:///path/to/optuna.db \
    --study_name <study_name> \
    --model auto
# Force the exponential model
python optuna_projection.py \
    --storage sqlite:///path/to/optuna.db \
    --study_name <study_name> \
    --model exp \
    --target_values 0.32,0.35,0.40
# Cross-check all four models
for m in exp power hill logistic; do
    python optuna_projection.py \
        --storage sqlite:///path/to/optuna.db \
        --study_name <study_name> \
        --model $m
done
</pre>

5.4 Example Output

[saturation] N_trials=626, Y_current=+0.3136
  Model selection (AIC = N*log(MSE) + 2k, k=3):
          exp: MSE=7.74e-06, AIC=-7361.33   <- CHOSEN
     logistic: MSE=1.22e-05, AIC=-7079.04
         hill: MSE=1.99e-05, AIC=-6771.67
        power: MSE=2.03e-05, AIC=-6756.43
  Fitted params: Y_inf=0.3218, a=0.0368, tau=338.96
  Y_inf (asymptotic ceiling) = +0.3218, gap = +0.0082

  Forward -- K -> Y_(N+K)        Inverse -- T -> required K
    K=    50  ->  +0.3168          T=+0.3200  ->  K = 400
    K=   100  ->  +0.3175          T=+0.3500  ->  unreachable
    K=  1000  ->  +0.3215          (T >= Y_inf)
    K= 10000  ->  +0.3218

6. Limitations and Caveats

Saturation-curve projection is conservative and assumption-light, but it has well-defined failure modes. The tables below separate what the method captures faithfully from what it misses.

6.1 Underlying Assumptions

Assumption	Meaning	Effect when violated
Saturation form	The trajectory fits one of the four families.	Out-of-distribution shapes (decreasing, oscillating) yield poor fits across all four.
Existence of asymptote	TPE eventually plateaus.	For studies still climbing, $Y_\infty$ estimates are unstable.
Fixed HP space	The search space does not change during the run.	Add/remove HPs requires refitting from scratch.
Stationary noise	Trial-to-trial variance is homoscedastic.	Heteroscedasticity calls for weighted least squares.

6.2 Good Projection vs Bad Projection

Good Projection — what is captured well:

Aspect	Why it works
Result of TPE's active narrowing	The best-so-far curve is the product of narrowing; fitting it learns the narrowing directly.
Diminishing returns	All four forms are inherently diminishing-returns shapes.
Asymptotic ceiling $Y_\infty$	The fitted asymptote is the natural estimate of TPE's reachable maximum.
HP-space dimensionality	HP coordinates are unused, so 5-D and 50-D HP spaces share the same framework.

Bad Projection — what is missed:

Limitation	Why it is missed
Sudden discovery jumps	A new productive HP region creates a step in best-so-far; smooth saturation curves average it out and underestimate.
Effect of HP-space expansion	Fitted curves use historical data only; adding new HPs spawns a new trajectory.
Shapes outside the four families	Oscillations, multi-stage saturation, or positive curvature cannot be expressed.
Non-Gaussian residual tails	AIC absolute values lose meaning; rankings remain relatively robust.

How to act on this: trust the projection for "expected $Y_{N+K}$ with $K \le 10000$" but treat the report as silent about sudden jumps or HP-space expansions; those questions require complementary diagnostics.

6.3 Recommended Usage

Refit every ~100 trials to track drift in $\hat{Y}_\infty$.
Even in auto mode, sanity-check by running each of the four models manually.
If $\hat{Y}_\infty - Y_{current}$ falls below a decision threshold (say 0.05), additional trials are unlikely to help; consider structural changes (HP-space redesign, data augmentation, model change).
Optionally cross-check with a surrogate-based projection (GP) and a PFN-based projection (Adriaensen et al., 2023). The saturation curve is typically the most conservative; surrogate is the most optimistic. Wide disagreement is a sign that an assumption is broken.

Appendix A. Why R² / MSE Fails When k Varies

This appendix explains why R² and MSE alone are unreliable selection criteria as soon as candidate models have different numbers of parameters.

A.1 The Monotone Improvement Property

Suppose $M_3$ has 3 parameters and $M_4$ has 4, with the function space of $M_4$ a superset of $M_3$'s. For least-squares fits on the same data,

$$\mathrm{SSres}(M_4) \le \mathrm{SSres}(M_3) \;\Rightarrow\; \mathrm{MSE}(M_4) \le \mathrm{MSE}(M_3) \;\Rightarrow\; R^2(M_4) \ge R^2(M_3).$$

Adding parameters therefore never worsens R²/MSE; equality holds only when the extra parameter is exactly zero. Picking by R² alone always favors the more complex model.

A.2 Polynomial Regression Illustration

Fit polynomials of degree 1, 3, and 9 to $N = 10$ noisy samples whose true generating function is quadratic. Compare AIC against R²:

Model	$k$	MSE	R²	$AIC = N \log(\mathrm{MSE}) + 2k$	R² rank	AIC rank
Linear	2	0.030	0.70	$10 \ln(0.030)+4 = -31.1$	worst	worst
Cubic	4	0.005	0.95	$10 \ln(0.005)+8 = -45.0$	middle	best
Degree 9	10	0.003	0.98	$10 \ln(0.003)+20 = -38.1$	best	middle

R²/MSE picks the degree-9 polynomial (an overfit; it captures the noise). AIC picks the cubic (correct: the true generator is quadratic). The trade-off is quantitative:

Comparison	$N \log$ improvement	$2k$ penalty change	$\Delta$AIC
Cubic → Degree 9	$10 \ln(0.003/0.005) = -5.1$	$+12$	$+6.9$ (worse)
Linear → Cubic	$10 \ln(0.005/0.030) = -17.9$	$+4$	$-13.9$ (better)

Each extra parameter must improve $\log(\mathrm{MSE})$ by at least $2/N$ for AIC to accept it. R²/MSE has no such rule, which is why it slides toward overfit.

A.3 Same-k Comparisons Are Safe

The monotone improvement property requires nested models with different $k$. When all candidates have the same $k$ and belong to different function families — as in our four saturation models — R², MSE, and AIC produce identical rankings. The penalty matters only when a 4-parameter model (e.g., Janoschek) joins the candidate set.

Appendix B. AIC Under Gaussian Residuals — Derivation

The general AIC (Akaike, 1974) is

$$AIC = -2 \log \hat{L} + 2 k,$$

where $\hat{L}$ is the maximum value of the likelihood. The compact form $AIC = N \log(\mathrm{MSE}) + 2k$ used in §3.2 follows once Gaussian residuals are assumed and the additive constant is dropped.

B.1 What is the Likelihood $L$?

The likelihood $L$ is the probability density of the observed data $\{(x_i, y_i)\}_{i=1}^N$ under given model parameters $(\theta, \sigma^2)$. Informally: "If these parameters are correct, how plausible is the data we actually observed?"

B.2 Setup and Notation

Regression model: $y_i = f(x_i; \theta) + \varepsilon_i$ with $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$ independent and identically distributed (i.i.d.). Let $\varphi(x; \mu, \sigma^2) = (2\pi\sigma^2)^{-1/2} \exp\!\big(-(x-\mu)^2 / (2\sigma^2)\big)$ denote the Gaussian density.

B.3 Step 1 — Density of a Single Observation

Conditional on $x_i, \theta, \sigma^2$, the response $y_i$ is normally distributed with mean $f(x_i; \theta)$ and variance $\sigma^2$. Its density is

$$p(y_i \mid x_i, \theta, \sigma^2) = \varphi\!\big(y_i - f(x_i; \theta);\, 0,\, \sigma^2\big),$$

i.e., the Gaussian density evaluated at the residual $r_i = y_i - f(x_i; \theta)$.

B.4 Step 2 — Joint Likelihood

Independence converts joint probability to a product of single-observation densities:

$$L(\theta, \sigma^2) = \prod_{i=1}^N \varphi\!\big(y_i - f(x_i; \theta);\, 0,\, \sigma^2\big).$$

B.5 Step 3 — Take the Log

A product over $N$ small numbers underflows to zero in floating-point; taking the logarithm converts it to a sum and simplifies differentiation. Because $\log$ is monotonic, $\arg\max L = \arg\max \log L$. Hence

$$\log L(\theta, \sigma^2) = \sum_{i=1}^N \log \varphi\!\big(y_i - f(x_i; \theta);\, 0,\, \sigma^2\big).$$

B.6 Step 4 — Expand the Gaussian

Substituting the Gaussian density and writing $r_i = y_i - f(x_i; \theta)$:

$$\log L(\theta, \sigma^2) = -\frac{N}{2}\log(2\pi) - \frac{N}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^N r_i^2.$$

The three terms are, respectively, a model-independent constant, a precision term, and a goodness-of-fit term proportional to the sum of squared residuals.

B.7 MLE and AIC Formula

Maximum Likelihood Estimation (MLE) maximizes $\log L$. The MLE for $\theta$ coincides with the Least Squares Estimate (LSE), and the MLE for $\sigma^2$ is

$$\hat{\sigma}^2_{MLE} = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i; \hat{\theta}))^2 = \mathrm{MSE}.$$

Plugging the MLEs into $\log L$ and then into $AIC = -2 \log \hat{L} + 2 k$:

$$AIC = N \log(\mathrm{MSE}) + N (\log(2\pi) + 1) + 2k.$$

The term $N(\log(2\pi) + 1)$ is identical across all candidate models and drops out of any ranking. Hence the operational form

$$AIC \approx N \log(\mathrm{MSE}) + 2k$$

recovers the §3.2 formula.

B.8 When the Gaussian Assumption Breaks

Absolute AIC values lose interpretability for heavy-tailed or skewed residuals; rankings remain reasonably robust.
The small-sample-corrected variant AICc adds $2k(k+1)/(N - k - 1)$ and is recommended when $N \lt 40k$.
The Bayesian Information Criterion (BIC) uses penalty $k \log(N)$, which is stricter for large $N$.
Cross-validation is distribution-free but costlier than AIC.

Our best-so-far series is monotone and therefore its residuals are not strictly Gaussian (one tail is truncated). Empirically, the order-of-magnitude AIC ranking is unaffected.

References

Adriaensen, S., Rakotoarison, H., Müller, S., & Hutter, F. (2023). Efficient Bayesian Learning Curve Extrapolation using Prior-Data Fitted Networks. NeurIPS 2023.
Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control, 19(6), 716–723.
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of KDD 2019.
Domhan, T., Springenberg, J. T., & Hutter, F. (2015). Speeding up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves. IJCAI 2015.
Mosteller, F., & Tukey, J. W. (1977). Data Analysis and Regression. Addison-Wesley.

Our Score

Click to rate this post!

[Total: 1 Average: 4]

Visited 13 times, 1 visit(s) today