What is Bayesian Optimization?
Home / Forums / AI & ML: Learn It Yourself / Machine Learning / What is Bayesian Optimization?
- This topic has 7 replies, 1 voice, and was last updated 3 months ago by
Wolf.
-
AuthorPosts
-
Understanding Bayesian Optimization
When training AI models, we often face a “black box” problem: evaluating a specific set of hyperparameters (like learning rate or batch size) is incredibly expensive in terms of time and compute. Bayesian Optimization is a strategy designed to find the global optimum of such functions with as few evaluations as possible.
1. The Core Intuition
Imagine you are looking for the highest point on a foggy mountain range. You can only see the ground exactly where you stand. You could walk randomly (Random Search) or check every meter (Grid Search), but that takes forever.
Instead, you:
1. Build a map of where you’ve already been.
2. Estimate where the peak might be based on that map (and how uncertain you are about unexplored areas).
3. Decide where to step next to gain the most information or find a higher point.
2. The Two Main Components
Bayesian Optimization relies on two mathematical pillars:
A. The Surrogate Model
Since the objective function $f(x)$ is expensive to evaluate, we use a probabilistic model to “stand in” for it. The most common choice is a Gaussian Process (GP). A GP doesn’t just give a single prediction; it gives a probability distribution (mean and variance) for every point in the search space.
We define the prior distribution over functions and update it with observed data $D = { (x_1, y_1), …, (x_t, y_t) }$ to get a posterior distribution:
$$P(f | D) \propto P(D | f) P(f)$$
B. The Acquisition Function
This is the “logic” used to pick the next point $x_{t+1}$. It balances two competing goals:
* Exploitation: Looking in areas where the surrogate model predicts a high value.
* Exploration: Looking in areas where uncertainty is high (we haven’t checked there yet).Common functions include Expected Improvement (EI):
$$EI(x) = E[\max(f(x) – f(x_{\ast}), 0)]$$
where $f(x_{\ast})$ is the current best observed value.
3. The Algorithm Step-by-Step
Step Action Description 1 Initialize Evaluate the objective function $f(x)$ at a few random points. 2 Update Fit the Surrogate Model (e.g., GP) to all current observations. 3 Optimize Find $x_{next}$ by maximizing the Acquisition Function. 4 Evaluate Run the actual expensive function $f(x_{next})$ to get $y_{next}$. 5 Repeat Add the new data to $D$ and loop until the budget is exhausted.
4. Mathematical Representation
In a Gaussian Process, we represent the relationship between points using a kernel function $K$. If we want to predict the value $f_{\ast}$ at a new point $x_{\ast}$, the joint distribution is:
$$\begin{pmatrix} y \cr f_{\ast} \end{pmatrix} \sim N \left( 0, \begin{pmatrix} K(X, X) + \sigma^2 I & K(X, x_{\ast}) \cr K(x_{\ast}, X) & K(x_{\ast}, x_{\ast}) \end{pmatrix} \right)$$
The predictive mean $\bar{f}_{\ast}$ and variance $V_{\ast}$ are then calculated to update our “map” of the function.
Summary for the AI Learner
Bayesian Optimization is essentially informed trial and error. It uses “Bayesian” logic—updating beliefs as new evidence arrives—to navigate high-dimensional spaces efficiently. It is the gold standard for hyperparameter tuning (AutoML) because it treats the training process as a budget that must be spent wisely.
The Python Implementation
Each time the code runs an iteration, it updates the Surrogate Model (the Gaussian Process). You can visualize the model getting smarter with every point sampled.
Python from bayes_opt import BayesianOptimization import numpy as np # 1. Define the "Black Box" function we want to maximize # In reality, this could be a machine learning model training loop def black_box_function(x, y): # This is just a mathematical hill with a peak at (x=2, y=3) return -1 * (x - 2)**2 - (y - 3)**2 + 10 # 2. Define the search space (the range for our variables) pbounds = {'x': (0, 4), 'y': (0, 5)} # 3. Initialize the Optimizer # We use a Gaussian Process as the surrogate model by default optimizer = BayesianOptimization( f=black_box_function, pbounds=pbounds, verbose=2, # 2 prints the steps, 1 only prints the best, 0 is silent random_state=1, ) # 4. Run the Optimization # init_points: How many random steps to take first (Exploration) # n_iter: How many Bayesian steps to take (Exploitation) optimizer.maximize( init_points=2, n_iter=10, ) # 5. Get the best result print("--- Result ---") print(f"Best parameters: {optimizer.max['params']}") print(f"Best value found: {optimizer.max['target']}")Stdout
| iter | target | x | y |
————————————————-
| 1 | 9.528 | 1.668 | 3.602 |
| 2 | 3.787 | 0.0004575 | 1.512 |
| 3 | 6.216 | 0.4351 | 1.844 |
| 4 | 9.613 | 1.415 | 3.213 |
| 5 | 9.288 | 2.649 | 2.461 |
| 6 | 5.842 | 3.724 | 4.09 |
| 7 | 5.908 | 0.459 | 4.31 |
| 8 | 9.307 | 1.753 | 2.205 |
| 9 | 4.449 | 3.095 | 0.9138 |
| 10 | 9.925 | 2.251 | 3.112 |
| 11 | 5.966 | 2.186 | 5.0 |
| 12 | 5.764 | 4.0 | 2.514 |
=================================================
— Result —
Best parameters: {‘x’: 2.250982156847981, ‘y’: 3.111795304557062}
Best value found: 9.92450976682293-
This reply was modified 3 months ago by
Wolf.
-
This reply was modified 3 months ago by
Wolf.
-
This reply was modified 3 months ago by
yRocket.
-
This reply was modified 3 months ago by
yRocket.
-
This reply was modified 3 months ago by
yRocket.
-
This reply was modified 3 months ago by
yRocket.
-
This reply was modified 3 months ago by
yRocket.
-
This reply was modified 3 months ago by
yRocket.
Global Minimum Search: Bayesian Optimization (BO) vs. Gradient Descent (GD)
For an AI learner, understanding the difference between these two is about understanding information. GD uses local “slope” information, while BO uses global “uncertainty” information.
1. Conceptual Framework
Gradient Descent (GD) is a local search algorithm. It calculates the derivative of the loss function $f$ at the current point $x_{n}$ and moves in the direction of the steepest descent.
$$x_{n+1} = x_{n} – \eta \nabla f(x_{n})$$
If the landscape has multiple valleys, GD will simply fall into the closest one.Bayesian Optimization (BO) is a global search strategy. It treats the objective function as a random variable and maintains a posterior distribution over possible functions. It doesn’t just ask “where is the slope pointing?” but “where is the best point likely to be, given everything I’ve seen so far?”
2. Comparison Table
Feature Gradient Descent (GD) Bayesian Optimization (BO) Search Scope Local (Point-to-point) Global (Area-to-area) Information Used Gradient (First-order derivative) Surrogate Model & Acquisition Function Global Minima Capability High risk of trapping in Local Minima High capability via Exploration Function Type Must be differentiable ($f \in C^{1}$) Black-box (No derivative needed) Computational Cost Low per iteration | High total (many steps) High per iteration | Low total (few steps)
3. The Mathematics of Global Search
Gradient Descent and Local Minima
GD relies on the local Taylor expansion. Because it only sees the immediate neighborhood, it converges to a point $x_{\ast}$ where $\nabla f(x_{\ast}) = 0$. In non-convex optimization, there is no guarantee that $f(x_{\ast})$ is the global minimum.
Bayesian Optimization and the Surrogate
BO uses a Gaussian Process (GP) to model the function. For any input $x$, the GP provides a mean $\mu(x)$ and a variance $\sigma^{2}(x)$.
The search for the global minimum is guided by an Acquisition Function, such as Expected Improvement (EI):
$$EI(x) = E[max(f(x_{best}) – f(x), 0)]$$
By evaluating points where the variance $\sigma(x)$ is high, BO explicitly forces the search to leave local valleys and explore unknown territory, effectively “jumping” out of local minima.
4. When to Use Which?
- Use Gradient Descent when you have millions of parameters (like a Neural Network) and the function is “cheap” to evaluate or you have an analytical gradient.
- Use Bayesian Optimization when the function is a “Black Box,” evaluation is extremely “expensive” (e.g., training a model for 10 hours), and you need to find the global best hyperparameters $x_{\ast}$ within a few dozen trials.
The Role of the Kernel in Bayesian Optimization
In Bayesian Optimization, we don’t just treat points as isolated data. We assume the function is “smooth.” The Kernel function (or Covariance function) $k(x, x’)$ is the mathematical engine that defines this smoothness, allowing the model to “share” information from a tested point to its neighbors.
1. The RBF (Radial Basis Function) Kernel
The RBF kernel (also known as the Squared Exponential kernel) is the most popular choice. It assumes that if two points $x$ and $x’$ are close in the input space, their function values $f(x)$ and $f(x’)$ are highly correlated.
The formula is defined as:
$$k(x, x’) = \sigma_{f}^{2} \exp\left( -\frac{|x – x’|^{2}}{2l^{2}} \right)$$Where:
* $\sigma_{f}^{2}$ (Signal Variance): Controls the vertical scale (how much the function fluctuates).
* $l$ (Lengthscale): Controls the horizontal scale (how far the influence of a point reaches).
2. How Information is Shared
When we use a Gaussian Process (GP) as our surrogate model, we define a Covariance Matrix $K$ for a set of points ${x_{1}, …, x_{n}}$:
$$K = \begin{pmatrix}
k(x_{1}, x_{1}) & k(x_{1}, x_{2}) & \cdots & k(x_{1}, x_{n}) \cr
k(x_{2}, x_{1}) & k(x_{2}, x_{2}) & \cdots & k(x_{2}, x_{n}) \cr
\vdots & \vdots & \ddots & \vdots \cr
k(x_{n}, x_{1}) & k(x_{n}, x_{2}) & \cdots & k(x_{n}, x_{n})
\end{pmatrix}$$The Mechanism:
1. If you evaluate the function at $x_{1}$ and find a high value, the Kernel tells the model: “Because $x_{2}$ is near $x_{1}$, its value is likely high too.”
2. The correlation decreases exponentially as the distance $|x – x’|$ increases.
3. This creates the “smooth” hills and valleys in the surrogate model, allowing BO to predict values in unexplored regions.
3. Comparing Kernels
Different kernels allow BO to share information in different ways:
Kernel Name Formula Concept Behavior RBF Exponential of squared distance Very smooth, infinitely differentiable. Matérn Incorporates Bessel functions Less smooth; better for modeling physical processes with “rougher” changes. Periodic $k(x, x’) = \exp(-\frac{2\sin^{2}(\pi|x-x’|/p)}{l^{2}})$ Shares info across repeating patterns (e.g., seasonal sales).
4. Summary for the Learner
The Kernel is the Prior Knowledge you give to the AI.
* A short lengthscale ($l$) means “only trust data points that are very close.”
* A long lengthscale ($l$) means “one data point tells me a lot about a wide area of the map.”By tuning these kernel parameters (often via Maximum Likelihood Estimation), the Bayesian Optimizer learns exactly how much it can generalize from each expensive test run.
Is Bayesian Optimization (BO) Robust to a Lack of Data?
The short answer is yes. In fact, robustness to small datasets is the primary reason researchers choose Bayesian Optimization over other methods. While Deep Learning requires thousands of data points, BO is specifically designed to perform well with as few as 10 to 50 samples.
1. Why BO Excels with Sparse Data
BO handles “data poverty” through three specific mathematical mechanisms:
A. The Power of the Prior
In traditional statistics, if you have no data, you know nothing. In BO, you start with a Prior (usually a Gaussian Process).
* The Prior defines your assumptions about the function’s “smoothness” and “variance” before any testing begins.
* Even with zero data points, the GP provides a baseline expectation across the entire search space.B. Quantifying “What We Don’t Know”
Most models (like Linear Regression or Neural Networks) provide a single point prediction. If data is sparse, these models often “overfit” or give wildly confident but wrong answers.
BO provides a Mean ($\mu$) and Uncertainty ($\sigma$). When data is missing in a specific region, the uncertainty $\sigma$ naturally increases.C. Informed Exploration
Because BO knows where its data is “thin,” the Acquisition Function can purposefully target those empty regions. It doesn’t guess randomly; it mathematically identifies the point that will provide the most information gain.
2. The Limits of Robustness
While BO is robust, “lack of data” can still cause issues if the search space is too large. This is known as the Curse of Dimensionality.
Scenario Robustness Level Why? Low Dim (1-5 variables) | Low Data Very High GP can easily map the correlations and find the peak. High Dim (20+ variables) | Low Data Low The volume of the search space grows exponentially; 10 points in a 20D space is like 10 drops of water in an ocean. Noisy Data | Low Data Medium The GP can filter noise, but with very few points, it may struggle to distinguish noise from the true signal.
3. How to Improve Robustness with Tiny Datasets
If you are forced to work with extremely limited data (e.g., only 5-10 trials), you can “help” the BO algorithm by:
- Choosing a Strong Kernel: Using a Matérn Kernel instead of RBF can be more robust if you expect the function to have sudden changes rather than perfect smoothness.
- Narrowing Bounds: Don’t search from 0 to 1,000 if you know the answer is likely between 10 and 20.
- Hyperparameter Priors: Instead of letting the GP “learn” the lengthscale $l$ from scratch, you can provide a “Prior” for the lengthscale based on domain knowledge.
4. Summary for the Learner
Bayesian Optimization is the gold standard for small-data optimization. It doesn’t just “survive” a lack of data; it uses the lack of data (uncertainty) as a compass to find the global minimum more efficiently than any other method.
Calculating Uncertainty ($\sigma$) in Gaussian Processes
The “Uncertainty” in Bayesian Optimization comes from the Conditional Variance of the Gaussian Process. When we observe data, the GP uses the Kernel’s covariance matrix to “pinch” the uncertainty at those points, while letting it grow in unobserved regions.
1. Defining the Components
Assume we have already tested $n$ points, which we call our training set $X$. We now want to predict the value and uncertainty at a new, untested point $x_{\ast}$.
We use three components derived from our Kernel function $k(x, x’)$:
* $K$: The $n \times n$ covariance matrix of the training points $X$.
* $K_{\ast}$: An $n \times 1$ vector of covariances between the training points $X$ and the new point $x_{\ast}$.
* $K_{\ast\ast}$: The scalar covariance of the new point $x_{\ast}$ with itself (the “prior” variance).
2. The Uncertainty Formula
The uncertainty (variance) at the new point, denoted as $\sigma^{2}(x_{\ast})$, is calculated by subtracting the “information we gained” from the “initial uncertainty.”
$$\sigma^{2}(x_{\ast}) = K_{\ast\ast} – K_{\ast}^{T} K^{-1} K_{\ast}$$
Breaking down the math:
- $K_{\ast\ast}$ (Prior Uncertainty): This is the maximum uncertainty we have about any point before seeing data. For an RBF kernel, this is usually $\sigma_{f}^{2}$.
- $K_{\ast}^{T} K^{-1} K_{\ast}$ (Information Gain): This term represents how much the points we’ve already seen ($X$) tell us about the new point ($x_{\ast}$).
- If $x_{\ast}$ is very close to a training point, $K_{\ast}$ will have high values.
- The subtraction will be large, making the resulting $\sigma^{2}(x_{\ast})$ very small (near zero).
- If $x_{\ast}$ is very far from all training points, $K_{\ast}$ will be near zero, and the uncertainty will remain high (near $K_{\ast\ast}$).
3. Visualizing the “Pinch”
When we calculate this for every possible $x_{\ast}$ across the search space, we get the famous “confidence envelope” seen in GP plots.
- At Training Points: $\sigma(x)$ drops to 0 (or the level of noise $\sigma_{n}^{2}$ if specified).
- Between Points: $\sigma(x)$ arches upward like a bridge, representing the growing uncertainty as we move away from known data.
4. Why this matters for BO
The Acquisition Function uses this specific $\sigma(x_{\ast})$ to decide where to go next. For example, in Upper Confidence Bound (UCB), the score is:
$$UCB(x) = \mu(x) + \kappa \sigma(x)$$- If $\sigma(x)$ is high, the score increases, forcing the “AI” to go and explore that area to reduce its ignorance.
- This calculation is the exact reason why BO is robust to a lack of data; it knows exactly how much it doesn’t know.
Incorporating Noise ($\epsilon$) into the Matrix Calculation
In real-world applications (like lab experiments or noisy sensor data), we rarely get the “perfect” value. Every time we measure $y$, we are actually seeing the true function value $f(x)$ plus some random noise $\epsilon$.
Mathematically, we model this as:
$$y = f(x) + \epsilon, \quad \epsilon \sim N(0, \sigma_{n}^{2})$$To make Bayesian Optimization robust to this messiness, we must adjust our Covariance Matrix calculation.
1. The Regularized Covariance Matrix
When data is noisy, we no longer want the Surrogate Model to pass exactly through every data point (which would be overfitting the noise). Instead, we add a “noise term” to the diagonal of our training covariance matrix $K$.
The noisy covariance matrix $\tilde{K}$ is defined as:
$$\tilde{K} = K + \sigma_{n}^{2} I$$Where:
* $K$: The original kernel covariance matrix.
* $\sigma_{n}^{2}$: The variance of the noise (how “messy” the data is).
* $I$: The Identity Matrix (a matrix with 1s on the diagonal and 0s elsewhere).$$ \tilde{K} = \begin{pmatrix}
k(x_{1}, x_{1}) + \sigma_{n}^{2} & k(x_{1}, x_{2}) & \cdots \cr
k(x_{2}, x_{1}) & k(x_{2}, x_{2}) + \sigma_{n}^{2} & \cdots \cr
\vdots & \vdots & \ddots
\end{pmatrix} $$
2. Updated Uncertainty with Noise
When we predict the uncertainty $\sigma^{2}(x_{\ast})$ at a new point using this noisy matrix, the formula becomes:
$$\sigma^{2}(x_{\ast}) = K_{\ast\ast} – K_{\ast}^{T} (K + \sigma_{n}^{2}I)^{-1} K_{\ast}$$
What changes for the AI learner?
1. Non-Zero Uncertainty at Observed Points: In a noiseless GP, $\sigma$ drops to exactly 0 at a tested point. With noise, the uncertainty at a tested point stays slightly above zero because the model knows the measurement itself might be slightly off.
2. Smoothing Effect: The mean prediction $\mu(x_{\ast})$ no longer has to “touch” every blue dot. It creates a smooth path that averages the noise, leading to more stable optimization.
3. Impact on the Acquisition Function
In noisy environments, the Acquisition Function becomes more “cautious.”
* Without noise, BO might find a huge “spike” in the data and assume it found the global maximum.
* With noise modeling, BO realizes that a single high point might just be a “lucky” noise artifact. It will often require multiple samples in a promising area to “confirm” that the peak is real.
4. Summary of the “Identity Matrix” Trick
By adding $\sigma_{n}^{2}$ to the diagonal (often called Tikhonov regularization or a Nugget term):
* We prevent numerical instability (it makes the matrix easier to invert).
* We tell the AI: “Trust the general trend, not the individual points.”
* We ensure the global minimum search isn’t distracted by outliers.
Optimizing the Model: Maximum Likelihood Estimation (MLE)
In Bayesian Optimization, we don’t just “guess” the noise $\sigma_{n}^{2}$ or the kernel lengthscale $l$. We find the values that make our observed data most probable. This process is called tuning the hyperparameters of the Gaussian Process.
1. The Marginal Log-Likelihood (MLL)
To find the best parameters (collectively called $\theta$), we maximize the Marginal Log-Likelihood. This function tells us how well our GP model explains the data we have already collected.
For a set of observations $y$ at points $X$, the Log-Likelihood is defined as:
$$\log p(y|X, \theta) = \underbrace{-\frac{1}{2} y^{T} \tilde{K}^{-1} y}_{\text{Data Fit}} – \underbrace{\frac{1}{2} \log |\tilde{K}|}_{\text{Complexity Penalty}} – \underbrace{\frac{n}{2} \log 2\pi}_{\text{Constant}}$$
Where $\tilde{K} = K_{\theta} + \sigma_{n}^{2}I$.
2. The Tug-of-War: Fit vs. Complexity
The MLE formula is a beautiful balancing act between two competing forces:
A. The Data Fit Term ($-\frac{1}{2} y^{T} \tilde{K}^{-1} y$)
This term rewards parameters that make the model pass close to the data points.
* If the lengthscale $l$ is very short, the model can “wiggle” to hit every point perfectly, making this term very high.B. The Complexity Penalty ($-\frac{1}{2} \log |\tilde{K}|$)
This term (the determinant of the covariance matrix) penalizes models that are too complex or “wiggly.”
* It prefers a simple, smooth line over a jagged one.
* As $l$ gets smaller (more complex), this penalty grows.
3. How the Optimizer Finds $\sigma_{n}^{2}$ and $l$
The AI learner should visualize the MLE process as a separate “mini-optimization” inside the BO loop:
- Start: Pick initial values for $l$ and $\sigma_{n}^{2}$.
- Calculate: Compute the Log-Likelihood using the formula above.
- Gradient Ascent: Calculate the derivative of the Likelihood with respect to the parameters and “climb” the hill to find the peak.
- Update: Set the GP to use these optimized parameters for the next prediction.
4. Why MLE makes BO Robust
By using MLE, the Bayesian Optimizer self-corrects:
* If the data is noisy: The MLE will naturally increase $\sigma_{n}^{2}$ to avoid the “Complexity Penalty” of trying to fit every noise spike.
* If the function is simple: The MLE will increase the lengthscale $l$, allowing the model to share information across much larger distances.Summary for the Learner
Parameter If too small… If too large… Lengthscale ($l$) Overfits (too wiggly) Underfits (too flat) Noise ($\sigma_{n}^{2}$) Mistrusts the trend (sees noise as signal) Ignores the data (sees signal as noise) -
This reply was modified 3 months ago by
-
AuthorPosts
- You must be logged in to reply to this topic.
