Marginals and Condtionals of MVN Distribution

Home / Forums / AI & ML: Learn It Yourself / Linear Algebra / Marginals and Condtionals of MVN Distribution

  • Author
    Posts
    • February 16, 2026 at 1:49 pm #5463

      Understanding Marginals and Conditionals in Multivariate Normal (MVN) Distributions

      When working with Multivariate Normal distributions, we often deal with high-dimensional data. Understanding how to “slice” this data (conditionals) or “collapse” it (marginals) is fundamental for Bayesian inference and Gaussian Processes.

      The Setup: Partitioning the Distribution

      Consider a random vector $X$ following an MVN distribution, partitioned into two subsets of variables, $X_1$ and $X_2$:

      $$X = \begin{pmatrix} X_1 \cr X_2 \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} \mu_1 \cr \mu_2 \end{pmatrix}, \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \cr \Sigma_{21} & \Sigma_{22} \end{pmatrix} \right)$$

      • $\mu_1, \mu_2$: Mean vectors for the respective partitions.
      • $\Sigma_{11}, \Sigma_{22}$: Covariance matrices within each partition.
      • $\Sigma_{12}, \Sigma_{21}$: Cross-covariance matrices (where $\Sigma_{21} = \Sigma_{12}^T$).

      1. Marginal Distributions

      The marginal distribution tells us how one subset of variables behaves regardless of the others. A beautiful property of the MVN is that the marginal of a Gaussian is simply another Gaussian.

      To find the marginal distribution of $X_1$, we effectively “ignore” $X_2$. We simply extract the corresponding mean and covariance blocks:

      $$p(X_1) = \int p(X_1, X_2) dX_2 = \mathcal{N}(X_1 | \mu_1, \Sigma_{11})$$

      Key Insight: You don’t need to perform any complex integration; you just look at the relevant sub-matrix and sub-vector.


      2. Conditional Distributions

      The conditional distribution describes the behavior of $X_1$ given that we have observed specific values for $X_2$. This is the “slicing” of the joint density. The result is still a Gaussian:

      $$p(X_1 | X_2 = x_2) = \mathcal{N}(X_1 | \bar{\mu}, \bar{\Sigma})$$

      The updated mean and covariance are calculated using the following formulas:

      Feature Formula
      Conditional Mean $$\bar{\mu} = \mu_1 + \Sigma_{12} \Sigma_{22}^{-1} (x_2 – \mu_2)$$
      Conditional Covariance $$\bar{\Sigma} = \Sigma_{11} – \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}$$

      Intuition for AI Learners:

      • The Mean: The new mean $\bar{\mu}$ starts at the original $\mu_1$ and is adjusted based on the error $(x_2 – \mu_2)$, scaled by the correlation between the two sets ($\Sigma_{12} \Sigma_{22}^{-1}$).
      • The Covariance: The new covariance $\bar{\Sigma}$ is the original $\Sigma_{11}$ reduced by a factor. This makes sense: observing $X_2$ provides information about $X_1$, thereby reducing our uncertainty. This reduction term is known as the Schur Complement.

      Comparison Summary

      Property Marginal ($X_1$) Conditional ($X_1 | X_2$)
      Operation Projection / Collapsing Slicing / Observing
      Complexity Extremely Simple (extraction) Requires Matrix Inversion ($\Sigma_{22}^{-1}$)
      Uncertainty Same as original ($\Sigma_{11}$) Reduced (via Schur Complement)
        February 16, 2026 at 1:52 pm #5464

        Numerical Example: 2D Multivariate Normal

        Let’s walk through a concrete example. Imagine we have a 2D Gaussian distribution representing the relationship between a student’s Math Score ($X_1$) and Physics Score ($X_2$).

        1. The Joint Distribution

        Suppose the scores are distributed as follows:

        $$X = \begin{pmatrix} X_1 \cr X_2 \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 70 \cr 75 \end{pmatrix}, \begin{pmatrix} 100 & 80 \cr 80 & 100 \end{pmatrix} \right)$$

        Here, the mean math score is 70, the mean physics score is 75, and they have a strong positive covariance of 80 (indicating that students who do well in math usually do well in physics).


        2. Finding the Marginal

        If we only care about the Math Score ($X_1$) and ignore the physics score entirely, we look at the first row/column of our parameters:

        • Marginal Mean: $\mu_1 = 70$
        • Marginal Variance: $\Sigma_{11} = 100$

        Thus, $p(X_1) = \mathcal{N}(70, 100)$. The standard deviation is $\sqrt{100} = 10$.


        3. Finding the Conditional

        Now, suppose we observe that a specific student scored a 90 in Physics ($x_2 = 90$). How does this change our prediction for their Math Score ($X_1$| $X_2 = 90$)?

        Step A: Calculate the New Mean ($\bar{\mu}$)

        We use the formula: $\bar{\mu} = \mu_1 + \Sigma_{12} \Sigma_{22}^{-1} (x_2 – \mu_2)$

        1. Difference from mean: $(x_2 – \mu_2) = (90 – 75) = 15$.
        2. Regression Coefficient: $\Sigma_{12} \Sigma_{22}^{-1} = 80 / 100 = 0.8$.
        3. Update: $\bar{\mu} = 70 + 0.8(15) = 70 + 12 = 82$.

        The expected math score jumped from 70 to 82 because we saw a high physics score.

        Step B: Calculate the New Variance ($\bar{\Sigma}$)

        We use the formula: $\bar{\Sigma} = \Sigma_{11} – \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}$

        1. Reduction Term: $80 \times (1/100) \times 80 = 64$.
        2. Update: $\bar{\Sigma} = 100 – 64 = 36$.

        The uncertainty (variance) dropped from 100 to 36. Our new standard deviation is only 6, meaning we are much more confident in our prediction of 82 than we were before we knew the physics score.


        Summary of Results

        Distribution Mean ($\mu$) Variance ($\Sigma$) Interpretation
        Marginal ($X_1$) 70 100 General population math performance.
        Conditional ($X_1 | 90$) 82 36 Math performance for high-achieving physics students.

        Notice how the conditional distribution is “tighter” and shifted toward the observed evidence.

          February 16, 2026 at 1:53 pm #5465

          The Precision Matrix and MVN Distributions

          In high-dimensional modeling and Gaussian Graphical Models (GGMs), we often work with the Precision Matrix $\Lambda$ (also denoted as $Q$ or $K$), which is the inverse of the covariance matrix:

          $$\Lambda = \Sigma^{-1} = \begin{pmatrix} \Lambda_{11} & \Lambda_{12} \cr \Lambda_{21} & \Lambda_{22} \end{pmatrix}$$

          While the covariance matrix $\Sigma$ tells us about marginal relationships, the precision matrix $\Lambda$ tells us about conditional relationships.


          1. Conditional Distribution via Precision

          One of the primary advantages of the precision matrix is that the conditional distribution formulas become much simpler. If we want the distribution of $X_1$ given $X_2$, the parameters are:

          • Conditional Covariance: $\bar{\Sigma} = \Lambda_{11}^{-1}$
          • Conditional Mean: $\bar{\mu} = \mu_1 – \Lambda_{11}^{-1} \Lambda_{12} (x_2 – \mu_2)$

          Why this matters: In the covariance form, we had to compute a Schur complement. In the precision form, the conditional covariance is just the inverse of the top-left block.


          2. Marginal Distribution via Precision

          Conversely, finding the marginal distribution becomes harder with the precision matrix. To find the marginal of $X_1$, we must compute the Schur complement of the precision matrix:

          • Marginal Covariance: $\Sigma_{11} = (\Lambda_{11} – \Lambda_{12} \Lambda_{22}^{-1} \Lambda_{21})^{-1}$

          3. The “Zero” Property (Conditional Independence)

          This is the most critical concept for AI learners. There is a beautiful duality between $\Sigma$ and $\Lambda$:

          Matrix Entry Value Meaning
          Covariance ($\Sigma$) $\Sigma_{ij} = 0$ $X_i$ and $X_j$ are marginally independent.
          Precision ($\Lambda$) $\Lambda_{ij} = 0$ $X_i$ and $X_j$ are conditionally independent given all other variables.

          Numerical Example (Revisited)

          Recall our Math ($X_1$) and Physics ($X_2$) example where $\Sigma = \begin{pmatrix} 100 & 80 \cr 80 & 100 \end{pmatrix}$. Let’s find $\Lambda$:

          $$\det(\Sigma) = (100 \times 100) – (80 \times 80) = 10000 – 6400 = 3600$$

          $$\Lambda = \frac{1}{3600} \begin{pmatrix} 100 & -80 \cr -80 & 100 \end{pmatrix} = \begin{pmatrix} 0.0278 & -0.0222 \cr -0.0222 & 0.0278 \end{pmatrix}$$

          • The off-diagonal elements are non-zero, which confirms that Math and Physics scores are conditionally dependent.
          • If we added a third variable, Study Hours ($X_3$), and found that $\Lambda_{12}$ (Math-Physics) became zero while $\Lambda_{13}$ and $\Lambda_{23}$ were non-zero, it would mean Math and Physics are only related because they both depend on Study Hours.

          Summary Table

          Feature Covariance Matrix $\Sigma$ Precision Matrix $\Lambda$
          Marginals Easy (extract sub-matrix) Hard (Schur complement)
          Conditionals Hard (Schur complement) Easy (extract sub-matrix)
          Interpretation Marginal correlation Partial correlation / Direct links
          Graph View Dense connections Sparse connections (edges)
            February 16, 2026 at 1:53 pm #5466

            From MVN to Gaussian Processes and Kalman Filters

            The marginal and conditional properties of the MVN are the “secret sauce” behind some of the most powerful algorithms in AI. Let’s look at how they power Gaussian Processes (GPs) and Kalman Filters.


            1. Gaussian Processes (GPs): Predicting the Unknown

            A Gaussian Process is essentially an MVN with infinite dimensions. We treat a function $f(x)$ as a collection of random variables, any finite number of which have a joint Gaussian distribution.

            How it uses MVN Properties:

            When we “train” a GP, we aren’t actually training weights like a Neural Network. Instead, we use the Conditional Distribution formulas we discussed earlier.

            • The Setup: We have observed data points $X_{train}$ (with values $y_{train}$) and we want to predict the values $y_{test}$ at new locations $X_{test}$.
            • The Joint Distribution: We define a joint MVN between the knowns and unknowns:
              $$\begin{pmatrix} y_{train} \cr y_{test} \end{pmatrix} \sim \mathcal{N} \left( \mathbf{0}, \begin{pmatrix} K(X_{train}, X_{train}) & K(X_{train}, X_{test}) \cr K(X_{test}, X_{train}) & K(X_{test}, X_{test}) \end{pmatrix} \right)$$
              (Where $K$ is the kernel/covariance function.)
            • The Inference: To get the prediction, we simply calculate the Conditional Distribution $p(y_{test} | y_{train})$.

            The “Mean” formula gives us our prediction, and the “Covariance” formula gives us the Uncertainty (the shaded area in GP plots).


            2. Kalman Filters: Tracking Over Time

            Kalman Filters are used in robotics and navigation (like GPS or self-driving cars) to estimate the state of a system (position, velocity) over time.

            How it uses MVN Properties:

            A Kalman Filter is essentially a recursive application of MVN properties, alternating between a Predict step and an Update step.

            1. The Predict Step (Marginalization):
              We move our estimate forward in time. This is like adding Gaussian noise to our current state. Mathematically, this is related to the Marginal properties—specifically, how the sum of two Gaussians remains Gaussian.
            2. The Update Step (Conditioning):
              We receive a new, noisy sensor measurement (e.g., a GPS ping). We “condition” our current estimate on this new evidence.

              • The Kalman Gain ($K$) is actually just the term $\Sigma_{12} \Sigma_{22}^{-1}$ from the MVN conditional mean formula!
              • It determines how much we should trust the sensor vs. our internal model.

            Comparison: GPs vs. Kalman Filters

            Concept Primary MVN Tool Goal
            Gaussian Process Conditional Distribution Predict values at unobserved spatial locations.
            Kalman Filter Marginal (Predict) + Conditional (Update) Estimate hidden states in a temporal sequence.

            Summary for AI Learners

            The beauty of the MVN is that Inference = Algebra. Because the math stays Gaussian after marginalizing and conditioning, these models can provide exact solutions with closed-form equations, making them incredibly robust for uncertainty quantification.

          • You must be logged in to reply to this topic.