Marginals and Condtionals of MVN Distribution

This topic has 3 replies, 1 voice, and was last updated 3 months ago by Wolf.

Author

Posts

February 16, 2026 at 1:49 pm #5463

Understanding Marginals and Conditionals in Multivariate Normal (MVN) Distributions

When working with Multivariate Normal distributions, we often deal with high-dimensional data. Understanding how to “slice” this data (conditionals) or “collapse” it (marginals) is fundamental for Bayesian inference and Gaussian Processes.

The Setup: Partitioning the Distribution

Consider a random vector $X$ following an MVN distribution, partitioned into two subsets of variables, $X_1$ and $X_2$:

$$X = \begin{pmatrix} X_1 \cr X_2 \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} \mu_1 \cr \mu_2 \end{pmatrix}, \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \cr \Sigma_{21} & \Sigma_{22} \end{pmatrix} \right)$$

$\mu_1, \mu_2$: Mean vectors for the respective partitions.
$\Sigma_{11}, \Sigma_{22}$: Covariance matrices within each partition.
$\Sigma_{12}, \Sigma_{21}$: Cross-covariance matrices (where $\Sigma_{21} = \Sigma_{12}^T$).

1. Marginal Distributions

The marginal distribution tells us how one subset of variables behaves regardless of the others. A beautiful property of the MVN is that the marginal of a Gaussian is simply another Gaussian.

To find the marginal distribution of $X_1$, we effectively “ignore” $X_2$. We simply extract the corresponding mean and covariance blocks:

$$p(X_1) = \int p(X_1, X_2) dX_2 = \mathcal{N}(X_1 | \mu_1, \Sigma_{11})$$

Key Insight: You don’t need to perform any complex integration; you just look at the relevant sub-matrix and sub-vector.

2. Conditional Distributions

The conditional distribution describes the behavior of $X_1$ given that we have observed specific values for $X_2$. This is the “slicing” of the joint density. The result is still a Gaussian:

$$p(X_1 | X_2 = x_2) = \mathcal{N}(X_1 | \bar{\mu}, \bar{\Sigma})$$

The updated mean and covariance are calculated using the following formulas:

Feature	Formula
Conditional Mean	$$\bar{\mu} = \mu_1 + \Sigma_{12} \Sigma_{22}^{-1} (x_2 – \mu_2)$$
Conditional Covariance	$$\bar{\Sigma} = \Sigma_{11} – \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}$$

Intuition for AI Learners:

The Mean: The new mean $\bar{\mu}$ starts at the original $\mu_1$ and is adjusted based on the error $(x_2 – \mu_2)$, scaled by the correlation between the two sets ($\Sigma_{12} \Sigma_{22}^{-1}$).
The Covariance: The new covariance $\bar{\Sigma}$ is the original $\Sigma_{11}$ reduced by a factor. This makes sense: observing $X_2$ provides information about $X_1$, thereby reducing our uncertainty. This reduction term is known as the Schur Complement.

Comparison Summary

Property	Marginal ($X_1$)	Conditional ($X_1 \| X_2$)
Operation	Projection / Collapsing	Slicing / Observing
Complexity	Extremely Simple (extraction)	Requires Matrix Inversion ($\Sigma_{22}^{-1}$)
Uncertainty	Same as original ($\Sigma_{11}$)	Reduced (via Schur Complement)

Wolf

February 16, 2026 at 1:52 pm #5464

Numerical Example: 2D Multivariate Normal

Let’s walk through a concrete example. Imagine we have a 2D Gaussian distribution representing the relationship between a student’s Math Score ($X_1$) and Physics Score ($X_2$).

1. The Joint Distribution

Suppose the scores are distributed as follows:

$$X = \begin{pmatrix} X_1 \cr X_2 \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 70 \cr 75 \end{pmatrix}, \begin{pmatrix} 100 & 80 \cr 80 & 100 \end{pmatrix} \right)$$

Here, the mean math score is 70, the mean physics score is 75, and they have a strong positive covariance of 80 (indicating that students who do well in math usually do well in physics).

2. Finding the Marginal

If we only care about the Math Score ($X_1$) and ignore the physics score entirely, we look at the first row/column of our parameters:

Marginal Mean: $\mu_1 = 70$
Marginal Variance: $\Sigma_{11} = 100$

Thus, $p(X_1) = \mathcal{N}(70, 100)$. The standard deviation is $\sqrt{100} = 10$.

3. Finding the Conditional

Now, suppose we observe that a specific student scored a 90 in Physics ($x_2 = 90$). How does this change our prediction for their Math Score ($X_1$| $X_2 = 90$)?

Step A: Calculate the New Mean ($\bar{\mu}$)

We use the formula: $\bar{\mu} = \mu_1 + \Sigma_{12} \Sigma_{22}^{-1} (x_2 – \mu_2)$

Difference from mean: $(x_2 – \mu_2) = (90 – 75) = 15$.
Regression Coefficient: $\Sigma_{12} \Sigma_{22}^{-1} = 80 / 100 = 0.8$.
Update: $\bar{\mu} = 70 + 0.8(15) = 70 + 12 = 82$.

The expected math score jumped from 70 to 82 because we saw a high physics score.

Step B: Calculate the New Variance ($\bar{\Sigma}$)

We use the formula: $\bar{\Sigma} = \Sigma_{11} – \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}$

Reduction Term: $80 \times (1/100) \times 80 = 64$.
Update: $\bar{\Sigma} = 100 – 64 = 36$.

The uncertainty (variance) dropped from 100 to 36. Our new standard deviation is only 6, meaning we are much more confident in our prediction of 82 than we were before we knew the physics score.

Summary of Results

Distribution	Mean ($\mu$)	Variance ($\Sigma$)	Interpretation
Marginal ($X_1$)	70	100	General population math performance.
Conditional ($X_1 \| 90$)	82	36	Math performance for high-achieving physics students.

Notice how the conditional distribution is “tighter” and shifted toward the observed evidence.

Wolf

February 16, 2026 at 1:53 pm #5465

The Precision Matrix and MVN Distributions

In high-dimensional modeling and Gaussian Graphical Models (GGMs), we often work with the Precision Matrix $\Lambda$ (also denoted as $Q$ or $K$), which is the inverse of the covariance matrix:

$$\Lambda = \Sigma^{-1} = \begin{pmatrix} \Lambda_{11} & \Lambda_{12} \cr \Lambda_{21} & \Lambda_{22} \end{pmatrix}$$

While the covariance matrix $\Sigma$ tells us about marginal relationships, the precision matrix $\Lambda$ tells us about conditional relationships.

1. Conditional Distribution via Precision

One of the primary advantages of the precision matrix is that the conditional distribution formulas become much simpler. If we want the distribution of $X_1$ given $X_2$, the parameters are:

Conditional Covariance: $\bar{\Sigma} = \Lambda_{11}^{-1}$
Conditional Mean: $\bar{\mu} = \mu_1 – \Lambda_{11}^{-1} \Lambda_{12} (x_2 – \mu_2)$

Why this matters: In the covariance form, we had to compute a Schur complement. In the precision form, the conditional covariance is just the inverse of the top-left block.

2. Marginal Distribution via Precision

Conversely, finding the marginal distribution becomes harder with the precision matrix. To find the marginal of $X_1$, we must compute the Schur complement of the precision matrix:

Marginal Covariance: $\Sigma_{11} = (\Lambda_{11} – \Lambda_{12} \Lambda_{22}^{-1} \Lambda_{21})^{-1}$

3. The “Zero” Property (Conditional Independence)

This is the most critical concept for AI learners. There is a beautiful duality between $\Sigma$ and $\Lambda$:

Matrix	Entry Value	Meaning
Covariance ($\Sigma$)	$\Sigma_{ij} = 0$	$X_i$ and $X_j$ are marginally independent.
Precision ($\Lambda$)	$\Lambda_{ij} = 0$	$X_i$ and $X_j$ are conditionally independent given all other variables.

Numerical Example (Revisited)

Recall our Math ($X_1$) and Physics ($X_2$) example where $\Sigma = \begin{pmatrix} 100 & 80 \cr 80 & 100 \end{pmatrix}$. Let’s find $\Lambda$:

$$\det(\Sigma) = (100 \times 100) – (80 \times 80) = 10000 – 6400 = 3600$$

$$\Lambda = \frac{1}{3600} \begin{pmatrix} 100 & -80 \cr -80 & 100 \end{pmatrix} = \begin{pmatrix} 0.0278 & -0.0222 \cr -0.0222 & 0.0278 \end{pmatrix}$$

The off-diagonal elements are non-zero, which confirms that Math and Physics scores are conditionally dependent.
If we added a third variable, Study Hours ($X_3$), and found that $\Lambda_{12}$ (Math-Physics) became zero while $\Lambda_{13}$ and $\Lambda_{23}$ were non-zero, it would mean Math and Physics are only related because they both depend on Study Hours.

Summary Table

Feature	Covariance Matrix $\Sigma$	Precision Matrix $\Lambda$
Marginals	Easy (extract sub-matrix)	Hard (Schur complement)
Conditionals	Hard (Schur complement)	Easy (extract sub-matrix)
Interpretation	Marginal correlation	Partial correlation / Direct links
Graph View	Dense connections	Sparse connections (edges)

Wolf

February 16, 2026 at 1:53 pm #5466

From MVN to Gaussian Processes and Kalman Filters

The marginal and conditional properties of the MVN are the “secret sauce” behind some of the most powerful algorithms in AI. Let’s look at how they power Gaussian Processes (GPs) and Kalman Filters.

1. Gaussian Processes (GPs): Predicting the Unknown

A Gaussian Process is essentially an MVN with infinite dimensions. We treat a function $f(x)$ as a collection of random variables, any finite number of which have a joint Gaussian distribution.

How it uses MVN Properties:

When we “train” a GP, we aren’t actually training weights like a Neural Network. Instead, we use the Conditional Distribution formulas we discussed earlier.

The Setup: We have observed data points $X_{train}$ (with values $y_{train}$) and we want to predict the values $y_{test}$ at new locations $X_{test}$.
The Joint Distribution: We define a joint MVN between the knowns and unknowns:
$$\begin{pmatrix} y_{train} \cr y_{test} \end{pmatrix} \sim \mathcal{N} \left( \mathbf{0}, \begin{pmatrix} K(X_{train}, X_{train}) & K(X_{train}, X_{test}) \cr K(X_{test}, X_{train}) & K(X_{test}, X_{test}) \end{pmatrix} \right)$$
(Where $K$ is the kernel/covariance function.)
The Inference: To get the prediction, we simply calculate the Conditional Distribution $p(y_{test} | y_{train})$.

The “Mean” formula gives us our prediction, and the “Covariance” formula gives us the Uncertainty (the shaded area in GP plots).

2. Kalman Filters: Tracking Over Time

Kalman Filters are used in robotics and navigation (like GPS or self-driving cars) to estimate the state of a system (position, velocity) over time.

How it uses MVN Properties:

A Kalman Filter is essentially a recursive application of MVN properties, alternating between a Predict step and an Update step.

The Predict Step (Marginalization):
We move our estimate forward in time. This is like adding Gaussian noise to our current state. Mathematically, this is related to the Marginal properties—specifically, how the sum of two Gaussians remains Gaussian.
The Update Step (Conditioning):
We receive a new, noisy sensor measurement (e.g., a GPS ping). We “condition” our current estimate on this new evidence.
- The Kalman Gain ($K$) is actually just the term $\Sigma_{12} \Sigma_{22}^{-1}$ from the MVN conditional mean formula!
- It determines how much we should trust the sensor vs. our internal model.

Comparison: GPs vs. Kalman Filters

Concept	Primary MVN Tool	Goal
Gaussian Process	Conditional Distribution	Predict values at unobserved spatial locations.
Kalman Filter	Marginal (Predict) + Conditional (Update)	Estimate hidden states in a temporal sequence.

Summary for AI Learners

The beauty of the MVN is that Inference = Algebra. Because the math stays Gaussian after marginalizing and conditioning, these models can provide exact solutions with closed-form equations, making them incredibly robust for uncertainty quantification.

Author

Posts

You must be logged in to reply to this topic.

Marginals and Condtionals of MVN Distribution

Understanding Marginals and Conditionals in Multivariate Normal (MVN) Distributions

The Setup: Partitioning the Distribution

1. Marginal Distributions

2. Conditional Distributions

Intuition for AI Learners:

Comparison Summary

Numerical Example: 2D Multivariate Normal

1. The Joint Distribution

2. Finding the Marginal

3. Finding the Conditional

Step A: Calculate the New Mean ($\bar{\mu}$)

Step B: Calculate the New Variance ($\bar{\Sigma}$)

Summary of Results

The Precision Matrix and MVN Distributions

1. Conditional Distribution via Precision

2. Marginal Distribution via Precision

3. The “Zero” Property (Conditional Independence)

Numerical Example (Revisited)

Summary Table

From MVN to Gaussian Processes and Kalman Filters

1. Gaussian Processes (GPs): Predicting the Unknown

How it uses MVN Properties:

2. Kalman Filters: Tracking Over Time

How it uses MVN Properties:

Comparison: GPs vs. Kalman Filters

Summary for AI Learners

Visitor

Post

About Me

Contact