Midterm Notes & Questions

1. Discriminative vs Generative models: what are the differences? What are the advantages and disadvantages of each?

Discriminative Models ($P(Y|X)$):

Concept: They learn the boundary between classes. They care about "what differentiates a cat from a dog", not "what makes a dog a dog".
Goal: Map input $X$ to label $Y$ directly.
Examples: Logistic Regression, SVM, Neural Nets (standard classifiers).
Advantages:
- Generally higher accuracy for classification tasks because they focus purely on the decision boundary.
- Often computationally cheaper to train and predict.
- Robust to correlated features (doesn't double-count evidence like Naive Bayes).
Disadvantages:
- Cannot generate data (you can't ask it to "draw a cat").
- Requires labeled data (strictly supervised).
- Can be prone to overfitting noise in the boundary.

Generative Models ($P(X, Y)$ or $P(X)$):

Concept: They learn the distribution of the data itself. They learn "what a dog looks like" and "what a cat looks like".
Goal: Model the underlying structure of the data.
Examples: Naive Bayes, GMMs, VAEs, GANs, Diffusion.
Advantages:
- Can generate new samples (hallucinate new data).
- Can handle missing data and effective for semi-supervised learning.
- Models the world, not just a boundary (more robust to outliers/adversarial attacks in some contexts).
Disadvantages:
- Computationally expensive (modelling the whole distribution is hard).
- "Double counting" evidence if features are correlated (e.g., Naive Bayes assumes independence).
- May have lower classification accuracy because they solve a harder problem (modelling density) than necessary.

Bayes Theorem and Generative Models

Bayes Theorem: $$P(\theta|X) = \frac{P(X|\theta)P(\theta)}{P(X)}$$

Where:

$P(\theta|X)$ = Posterior: Probability of parameters given data
$P(X|\theta)$ = Likelihood: Probability of data given parameters
$P(\theta)$ = Prior: Belief about parameters before seeing data
$P(X)$ = Evidence: Marginal probability of data (normalization constant)

Role in Generative Models:

Parameter Estimation: Update beliefs about model parameters based on observed data
- Start with prior $P(\theta)$
- Observe data $X$
- Update to posterior $P(\theta|X)$ using Bayes theorem
Connection to different generative models:
- VAEs: Approximate posterior $q(z|x)$ used because true posterior $p(z|x)$ intractable
- GANs: Discriminator approximates likelihood ratio
- Bayesian Neural Networks: Posterior over network weights
Inference in generative models:
- Given data $x$, infer latent $z$: $P(z|x) = \frac{P(x|z)P(z)}{P(x)}$
- $P(x|z)$ = decoder (how likely is data given latent)
- $P(z)$ = prior (usually $N(0,I)$)
- $P(x)$ = evidence (intractable! This is why VAEs use ELBO)

Exam-style question: "How does Bayes theorem relate to the ELBO in VAEs? Why can't we compute the posterior $p(z|x)$ directly?"

Answer: By Bayes theorem, $p(z|x) = \frac{p(x|z)p(z)}{p(x)}$. The denominator $p(x) = \int p(x|z)p(z)dz$ is intractable (requires integrating over all $z$). VAEs introduce approximate posterior $q(z|x)$ and maximize ELBO, which is equivalent to minimizing $D_{KL}(q(z|x) || p(z|x))$ - making $q$ close to the true (but unknown) posterior.

3. Gaussian distributions: properties, why is it used in generative models?

Properties:

Bell Curve: Symmetric, defined entirely by Mean ($\mu$) and Variance ($\sigma^2$).
Central Limit Theorem (CLT): The sum of many independent random variables tends toward a Gaussian distribution. This makes it a natural choice for modeling noise or aggregate real-world phenomena.
- Key insight: Many real-world phenomena can be modeled as a sum of multiple small contributions → naturally Gaussian
Math Magic: Analytical tractability. Differentiating, integrating, and multiplying Gaussians often results in closed-form Gaussian solutions.
Multivariate Gaussian: For multi-dimensional data, characterized by mean vector $\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{\Sigma}$
- Covariance matrix captures correlations between features

Why used in Generative Models (e.g., VAEs, Diffusion):

Smoothness: Forces the latent space to be continuous and densely packed (no "holes"), allowing for smooth interpolation between samples.
Reparameterization: The "Reparameterization Trick" ($z = \mu + \sigma \odot \epsilon$) is easy with Gaussians, allowing backpropagation through stochastic nodes.
Prior: It's the standard "blank canvas" prior ($N(0, I)$). We assume latent factors are independent and normally distributed, then the network learns to map this simple distribution to complex data.
Natural: CLT justifies using Gaussian as default assumption for many processes.

4. Entropy, cross-entropy

Entropy ($H(P)$):
- A measure of uncertainty or "surprise" in a distribution.
- High entropy = Uniform distribution (maximum unpredictability).
- Low entropy = Deterministic (we know exactly what will happen).
- Formula: $H(P) = -\sum P(x) \log P(x)$.
Cross-Entropy ($H(P, Q)$):
- A measure of the average number of bits needed to encode events from true distribution $P$ using a code optimized for distribution $Q$.
- Basically: "How different is my predicted distribution $Q$ from the true distribution $P$?"
- Formula: $H(P, Q) = -\sum P(x) \log Q(x)$.
- In Deep Learning: Minimizing Cross-Entropy is equivalent to minimizing KL Divergence (since $H(P)$ is constant for training data).

From Midterm Q1: Need to know formulas, when they become undefined/infinite, and how to calculate for both discrete and continuous distributions.

Entropy $H(p)$

Discrete: $$H(p) = -\sum_{x} p(x) \log p(x)$$

Continuous: $$H(p) = -\int p(x) \log p(x) dx$$

Properties:

Always $\geq 0$
Maximum when distribution is uniform
Minimum (0) when distribution is deterministic (probability 1 at single point)
Convention: $0 \log 0 = 0$

Example Calculations:

Deterministic (Dirac delta at $x=1$): $p(1)=1$, $p(x)=0$ elsewhere → $H(p) = -1\cdot\log(1) = 0$
Uniform on ${1,2,3,4}$: $p(x)=0.25$ for all → $H(p) = -4(0.25\log 0.25) = 2$ bits
Continuous Uniform on $[a,b]$: $p(x) = \frac{1}{b-a}$ $$H(p) = -\int_a^b p(x) \log p(x) dx = -\int_a^b \frac{1}{b-a} \log\left(\frac{1}{b-a}\right) dx$$ $$= -\log\left(\frac{1}{b-a}\right) \int_a^b \frac{1}{b-a} dx = -\log\left(\frac{1}{b-a}\right) \cdot 1 = \log(b-a)$$
Gaussian $N(\mu, \sigma^2)$:
- PDF: $p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$
- Entropy: $H(p) = \frac{1}{2}\log(2\pi e \sigma^2)$

Cross-Entropy $H(p,q)$

Discrete: $$H(p,q) = -\sum_{x} p(x) \log q(x)$$

Continuous: $$H(p,q) = -\int p(x) \log q(x) dx$$

When undefined:

If $\exists x$ where $p(x) > 0$ but $q(x) = 0$ → undefined (or $+\infty$)
Midterm mistake: Writing "infinite" or "-infinite" instead of "undefined"

Properties:

Always $H(p,q) \geq H(p)$ (equality when $p=q$)
NOT symmetric: $H(p,q) \neq H(q,p)$
Commonly used loss in classification (true labels = $p$, predictions = $q$)

KL Divergence $D_{KL}(p||q)$

Formula: $$D_{KL}(p||q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)} = H(p,q) - H(p)$$

Properties:

Always $\geq 0$ (equals 0 iff $p=q$)
NOT symmetric: $D_{KL}(p||q) \neq D_{KL}(q||p)$
Undefined when $\exists x: p(x)>0$ but $q(x)=0$
Measures "information loss" when approximating $p$ with $q$

Relationship: $$H(p,q) = H(p) + D_{KL}(p||q)$$

Example scenario (like Midterm Q1):

$p$: Discrete with $p(1)=1$, $p(x)=0$ elsewhere
$q$: Uniform on $[0,2]$ → $q(x) = 0.5$ for $x \in [0,2]$

Calculations:

$H(p) = 0$ (deterministic)
$H(p,q) = -1 \cdot \log(0.5) = \log(2)$ (if we treat $p$ as having mass at $x=1$)
$D_{KL}(p||q) = \log(2) - 0 = \log(2)$
Warning: If $p$ and $q$ are of different types (discrete vs continuous), need to be careful about whether comparison is valid

Information Theory Fundamentals

Self-Information

Definition: Amount of information (surprise) from observing event with probability $p$ $$I(x) = -\log p(x) = \log \frac{1}{p(x)}$$

Properties:

Rare events (low $p$) → High information (surprising)
Common events (high $p$) → Low information (unsurprising)
Event with $p=1$ → Zero information (no surprise)
Additive: For independent events, $I(x,y) = I(x) + I(y)$

Units: Depends on logarithm base

Base 2 → bits
Base $e$ → nats

Mutual Information $I(X;Y)$

Definition: Amount of information shared between two variables $$I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$$

Alternative formulation: $$I(X;Y) = D_{KL}(P(X,Y) || P(X)P(Y))$$

Interpretation:

Measures reduction in uncertainty about $X$ after observing $Y$
If $X$ and $Y$ independent → $I(X;Y) = 0$
Symmetric: $I(X;Y) = I(Y;X)$

In generative models:

InfoGAN: Maximize $I(c; G(z,c))$ to encourage latent code $c$ to be meaningful
Disentanglement: High MI between latent dimensions and semantic factors
Regularization: Encourage mutual information between latent and data

Conditional Entropy $H(Y|X)$

Definition: Remaining uncertainty in $Y$ after observing $X$

Discrete: $$H(Y|X) = \sum_x P(x) H(Y|X=x) = -\sum_{x,y} P(x,y) \log P(y|x)$$

Properties:

$H(Y|X) \leq H(Y)$ (observing $X$ cannot increase uncertainty)
Equality when $X$ and $Y$ are independent
$H(Y|X) = 0$ when $Y$ is deterministic function of $X$

Relationship to mutual information: $$I(X;Y) = H(Y) - H(Y|X)$$

Exam-style question: "What does high mutual information between latent code $c$ and generated image $G(z,c)$ mean? How is this used in InfoGAN?"

Answer: High $I(c; G(z,c))$ means that if we know $c$, we have high certainty about characteristics of $G(z,c)$ - the latent code meaningfully controls generation. InfoGAN maximizes this to force generator to use latent code $c$ in interpretable way, achieving unsupervised disentanglement.

5. Different distribution distances: KL, JS, W1. What are they? How are they calculated? Which is better for what?

Metric	Name	Calculation / Concept	Best For / Properties
KL	Kullback-Leibler Divergence	$E_{x \sim P} [\log \frac{P(x)}{Q(x)}]$ Expected log-ratio.	Asymmetric ($P\|Q \neq Q\|P$). Used in VAEs (Regularization). Measures "information loss". Fails if distributions don't overlap (div by zero).
JS	Jensen-Shannon Divergence	Symmetrized KL: $\frac{1}{2}KL(P\|M) + \frac{1}{2}KL(Q\|M)$ where $M$ is average.	Symmetric & Bounded $[0, 1]$. Used in original GANs. Better stability than KL, but can still suffer from vanishing gradients if supports are disjoint.
W1	Wasserstein-1 (Earth Mover's)	"Minimum cost to move pile $P$ to pile $Q$". 1D: Area between CDFs $\int \|F_P - F_Q\|$.	Geometric / Disjoint Support. Used in WGANs. Works even when distributions don't overlap (gradients don't vanish). Sensitive to magnitude of difference (location shifts), not just probability overlap.

Summary: Use KL for compression/VAEs. Use W1 for GANs/geometric stability (prevents mode collapse, stable gradients). Use JS as a stable baseline comparison.

Additional Distribution Distances

Total Variation (TV) Distance

Definition: Half the L1 norm between probability mass functions $$d_{TV}(P,Q) = \frac{1}{2} \sum_x |P(x) - Q(x)|$$

For continuous distributions (using CDFs): $$d_{TV}(P,Q) = \frac{1}{2} \int |p(x) - q(x)| dx$$

Properties:

Symmetric: $d_{TV}(P,Q) = d_{TV}(Q,P)$
Bounded: $d_{TV} \in [0, 1]$
Equals 0 iff $P = Q$
Equals 1 iff supports are disjoint

Interpretation: Maximum difference between probabilities $P$ and $Q$ assign to same event

Hellinger Distance

Definition: Square root of half the sum of squared differences between square roots of densities $$d_H(P,Q) = \sqrt{\frac{1}{2} \sum_x (\sqrt{P(x)} - \sqrt{Q(x)})^2}$$

For continuous: $$d_H(P,Q) = \sqrt{\frac{1}{2} \int (\sqrt{p(x)} - \sqrt{q(x)})^2 dx}$$

Properties:

Symmetric: $d_H(P,Q) = d_H(Q,P)$
Bounded: $d_H \in [0, 1]$
Less sensitive to outliers than KL divergence
Computationally efficient
Not a true metric (doesn't satisfy triangle inequality)

Comparison with KL:

Hellinger is symmetric, KL is not
Hellinger bounded, KL can be infinite
Hellinger defined even when supports don't overlap, KL can be undefined
KL more sensitive to tail behavior

Exam-style question: "Compare KL divergence, Total Variation distance, and Hellinger distance. When would you prefer each?"

Answer:

KL: Asymmetric, unbounded, undefined for disjoint supports. Good for VAE regularization (emphasizes tail behavior).
TV: Symmetric, bounded [0,1], measures maximum probability difference. Simple interpretation.
Hellinger: Symmetric, bounded, less sensitive to outliers, works with disjoint supports. Good for robust comparison but not a true metric.

6. What is a VAE? How does it work? What is the ELBO?

What is a VAE (Variational Autoencoder)? It's a generative model that learns a continuous, probabilistic latent space. Unlike a standard Autoencoder (which maps input to a fixed vector), a VAE maps input to a distribution (mean $\mu$ and variance $\sigma$).

VAE Architecture

Encoder Network $q_\phi(z|x)$ (Recognition/Inference Network):

Input: Data $x$ (e.g., 28×28 image)
Output: Two vectors (not one!):
- Mean vector $\mu$ (e.g., 20-dim)
- Log-variance vector $\log \sigma^2$ (e.g., 20-dim)
These define a diagonal Gaussian distribution over latent space
Parameterized by neural network weights $\phi$

Decoder Network $p_\theta(x|z)$ (Generative Network):

Input: Latent vector $z$ (e.g., 20-dim)
Output: Reconstructed data $\hat{x}$ (e.g., 28×28 image)
Parameterized by neural network weights $\theta$

Prior Distribution $p(z)$:

Standard normal: $p(z) = N(0, I)$
Why standard normal?
- Simple "blank canvas" that's easy to sample from
- Forces structure: without it, encoder could map each input to arbitrary location
- Enables generation: just sample $z \sim N(0,I)$ and decode

How VAE Works (Forward Pass)

Encoder: Given input $x$, neural network outputs $\mu$ and $\log \sigma^2$
Reparameterization Trick: Sample latent $z$ using: $$z = \mu + \sigma \odot \epsilon, \quad \text{where } \epsilon \sim N(0, I)$$
- Why this trick? Sampling is non-differentiable, but this formulation is!
- Randomness moved to $\epsilon$ (independent of parameters)
- Gradients can flow through $\mu$ and $\sigma$
- Essential for backpropagation through stochastic nodes
Decoder: Reconstruct $\hat{x}$ from sampled $z$
Loss Computation: Calculate ELBO (see below)

What is the ELBO (Evidence Lower Bound)?

The Problem: We want to maximize likelihood $p(x)$, but it's intractable!

Why intractable? $$p(x) = \int p(x|z)p(z)dz$$

Must integrate over all possible latent variables $z$
For high-dimensional $z$ (e.g., 100+ dims) and complex decoder, no closed form
Direct computation impossible

The Solution: Variational Inference

Introduce approximate posterior $q(z|x)$ (encoder)
Instead of computing true posterior $p(z|x) = \frac{p(x|z)p(z)}{p(x)}$ (requires intractable $p(x)$)
Learn $q(z|x)$ to approximate $p(z|x)$

Deriving ELBO:

Starting from log-likelihood: $$\log p(x) = \log \int p(x|z)p(z)dz$$

Introduce $q(z|x)$ and apply Jensen's inequality: $$\log p(x) = \log \mathbb{E}{q(z|x)}\left[\frac{p(x,z)}{q(z|x)}\right] \geq \mathbb{E}{q(z|x)}\left[\log \frac{p(x,z)}{q(z|x)}\right] = \text{ELBO}$$

Expanding ELBO: $$\text{ELBO} = \mathbb{E}{q(z|x)}[\log p(x|z)] - D{KL}(q(z|x) || p(z))$$

Key Relationship: $$\log p(x) = \text{ELBO} + D_{KL}(q(z|x) || p(z|x))$$

Since $D_{KL} \geq 0$, ELBO is indeed a lower bound on $\log p(x)$. Maximizing ELBO:

Tightens the bound (minimizes approximation gap)
Indirectly maximizes $\log p(x)$

ELBO Components (The Loss Function)

$$ \text{ELBO} = \underbrace{\mathbb{E}_{q(z|x)}[\log p(x|z)]}_{\text{Reconstruction Term}} - \underbrace{D_{KL}(q(z|x) || p(z))}_{\text{KL Regularization}} $$

In plain English: $$ \text{Maximize ELBO} = \text{Minimize}[\text{Reconstruction Error} + \text{KL Divergence}] $$

Term 1: Reconstruction Loss (Data Fidelity)

Measures: How well can decoder reconstruct input from latent code?
$\mathbb{E}_{q(z|x)}[\log p(x|z)]$: Expected log-likelihood over sampled $z$
In practice (continuous data): $-||x - \hat{x}||^2$ (MSE)
In practice (binary data): Binary cross-entropy
Goal: Make output look like input

Term 2: KL Regularization (Latent Space Structure)

Measures: How different is $q(z|x)$ from prior $p(z)$?
$D_{KL}(q(z|x) || p(z))$: KL divergence between learned posterior and $N(0,I)$
For Gaussian posterior and prior, closed form: $$D_{KL}(q(z|x) || p(z)) = \frac{1}{2}\sum_{i=1}^{d}\left(\mu_i^2 + \sigma_i^2 - \log(\sigma_i^2) - 1\right)$$
Goal: Keep latent space "organized" and continuous
Prevents cheating: Without this, encoder could map each input to isolated point → no interpolation possible

Training VAEs

Optimization: Maximize ELBO w.r.t. both $\phi$ (encoder) and $\theta$ (decoder) $$\max_{\phi,\theta} \mathbb{E}{p{data}(x)}\left[\mathbb{E}{q\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z))\right]$$

In practice:

Sample minibatch of data ${x_1, ..., x_m}$
For each $x_i$:
- Encode: Compute $\mu_i, \sigma_i = \text{Encoder}(x_i)$
- Sample: $z_i = \mu_i + \sigma_i \odot \epsilon_i$ where $\epsilon_i \sim N(0,I)$
- Decode: $\hat{x}_i = \text{Decoder}(z_i)$
- Compute loss: $L_i = ||x_i - \hat{x}i||^2 + D{KL}(q(z|x_i) || p(z))$
Backpropagate and update $\phi, \theta$

What happens without KL regularization? (Midterm Q2 scenario)

Scenario: Train autoencoder with only reconstruction loss (no KL term)

Consequences:

Latent space becomes arbitrarily shaped (not normally distributed)
Encoder maps each input to isolated points (no continuity)
Cannot generate by sampling $z \sim N(0,I)$ → decoder never saw such $z$ during training
Becomes deterministic autoencoder (no probabilistic interpretation)
Lacks continuity: Gaps between encoded points → undefined behavior when sampling there
Lacks completeness: Most of latent space never visited during training

Alternative Sampling Strategies (without proper latent structure):

Mean of k encoded samples:
- Pick k training samples, encode to get $z_1, ..., z_k$
- Average: $z_{new} = \frac{1}{k}\sum z_i$
- Decode: $x_{new} = \text{Decoder}(z_{new})$
Interpolation between encoded samples:
- Encode two samples: $z_1 = \text{Encoder}(x_1), z_2 = \text{Encoder}(x_2)$
- Interpolate: $z = \alpha z_1 + (1-\alpha)z_2$
- Decode: $x = \text{Decoder}(z)$
- Risk: If latent space has "holes", interpolation path may pass through undefined regions
Cluster-based sampling:
- Encode all training data
- Cluster in latent space (k-means)
- Sample from cluster centroids

Tradeoff:

Without KL: Better reconstruction (encoder free to use latent space optimally) but no generative capability
With KL: Slightly worse reconstruction but true generative model (can sample novel data)

The fundamental insight: Regular autoencoders optimize for reconstruction only. VAEs optimize for reconstruction + generative capability via KL regularization.

Exam-Style Questions

Q1: "Why can't we directly maximize log p(x) in VAEs? What makes it intractable?"

Answer: Computing $p(x) = \int p(x|z)p(z)dz$ requires integrating over all possible latent variables $z$. For high-dimensional latent spaces and complex neural network decoders, this integral has no closed form and is computationally impossible (exponential in dimension). ELBO provides tractable lower bound we can maximize instead.

Q2: "Explain the reparameterization trick. Why is it necessary?"

Answer: Sampling $z \sim N(\mu, \sigma^2)$ is non-differentiable (can't backpropagate through random sampling). Reparameterization trick: $z = \mu + \sigma \odot \epsilon$ where $\epsilon \sim N(0,I)$. Moves randomness to $\epsilon$ (independent of parameters), making gradients flow through $\mu$ and $\sigma$. Essential for training VAEs with backpropagation.

Q3: "What is the relationship between ELBO and log p(x)?"

Answer: $\log p(x) = \text{ELBO} + D_{KL}(q(z|x) || p(z|x))$. Since KL divergence is always non-negative, ELBO is lower bound on log-likelihood. Maximizing ELBO: (1) increases log p(x), (2) minimizes gap between approximate posterior $q$ and true posterior $p(z|x)$.

Q4: "Derive the KL divergence for VAE (Gaussian posterior, Gaussian prior)."

Answer: Given $q(z|x) = N(\mu, \sigma^2 I)$ and $p(z) = N(0, I)$: $$D_{KL}(q||p) = \int q(z|x) \log \frac{q(z|x)}{p(z)} dz = \frac{1}{2}\sum_{i=1}^{d}(\mu_i^2 + \sigma_i^2 - \log \sigma_i^2 - 1)$$

This is the closed-form expression used in practice (no numerical integration needed).

Q5: "What happens if you remove the KL term from VAE loss? Can you still generate new samples?"

Answer: Without KL regularization, model becomes deterministic autoencoder. Latent space not structured (no prior enforcement). Cannot generate by sampling $z \sim N(0,I)$ because decoder never trained on such $z$. Can only reconstruct training data or interpolate between encoded training samples. Loses generative capability.

Q6: "Why do VAEs produce blurrier images than GANs? Explain in terms of the loss function."

Answer: VAE loss uses pixel-wise reconstruction (MSE or BCE), which penalizes any deviation from training examples. To minimize loss, VAE averages over all plausible reconstructions, producing blurry outputs. MSE loss: $||x - \hat{x}||^2$ treats all pixel mismatches equally, doesn't capture perceptual quality. GANs use adversarial loss focusing on realism, not pixel-perfect reconstruction.

Q7: "What is amortized inference and how do VAEs implement it?"

Answer: Amortized inference means learning a single inference network (encoder) that works for all data points, instead of optimizing latent code separately for each sample. VAE encoder $q_\phi(z|x)$ amortizes inference: one forward pass gives approximate posterior for any $x$. Alternative (non-amortized): optimize $z$ separately for each $x$ via gradient descent on ELBO - much slower. Amortization trades off: faster inference but potentially less accurate per-sample.

VAE Latent Space Properties

The lecture emphasizes two key properties VAEs achieve:

1. Continuity: Points close in latent space → similar decoded outputs

Gradual changes when interpolating between latent codes
Example: Interpolating between "1" and "2" shows smooth digit morphing
Achieved by: KL regularization forcing smooth, continuous distributions

2. Completeness: Sampling anywhere in latent space → meaningful output

No "holes" or undefined regions in latent space
Any random sample $z \sim N(0,I)$ decodes to valid output
Achieved by: Prior $p(z) = N(0,I)$ ensures all regions are used during training

Why both matter for generation:

Continuity alone: Could have isolated islands (good locally, but gaps between)
Completeness alone: Could have abrupt transitions (coverage but not smooth)
Together: Enable both diverse sampling AND smooth interpolation

Exam-style question: "A VAE's latent space has continuity but not completeness. What would you observe? How would you fix it?"

Answer: You'd observe: smooth interpolations between training samples work well, but random sampling from $N(0,I)$ produces garbage (many regions never trained). Cause: KL regularization too weak - encoder maps data to isolated clusters, leaving gaps. Fix: Increase KL weight (like β-VAE with β>1) to force encoder to use entire prior space, ensuring completeness.

7.1 Sampling and Inference: The forward and reverse process

Sampling:

Definition: Generating random variables/data points from a given distribution
In generative modeling: Generate new samples from learned distribution
Two approaches:
1. Analytical (GMMs): Know functional form of distribution, can sample directly
2. Implicit (GANs): Don't know analytical form, learn mapping from noise $z$ to samples

Key concept: Random variable $z$ captures randomness in sampling

Often $z \sim N(0,I)$ (simple distribution)
Generator learns complex mapping: $G(z) \to x$
Different $z$ → different samples

Inference:

Definition: The reverse of sampling - given data, estimate what model/parameters generated it
Uses observed data to update beliefs about model parameters (Bayes!)
In discriminative models (AlexNet): Forward pass (input → prediction)
In generative models: Reverse process (data → latent code/parameters)
Integral part of training generative models

Exam-style question: "Explain the difference between sampling and inference in generative models. Why is inference more complex than a simple forward pass?"

Answer: Sampling generates new data from learned distribution (forward: $z \to x$). Inference estimates latent variables or parameters given data (reverse: $x \to z$ or $x \to \theta$). Inference is complex because it often requires computing intractable posteriors or optimizing in latent space, unlike discriminative models where forward pass is straightforward.

7.2. Likelihood and Maximum Likelihood Estimation (MLE)

Likelihood $P(X|\theta)$:

Definition: Probability of observing data $X$ given model parameters $\theta$
Measures "how well does the model explain the data?"
NOT the same as probability: Likelihood is a function of $\theta$ for fixed $X$

Maximum Likelihood Estimation:

Goal: Find parameters $\theta^*$ that maximize $P(X|\theta)$
$$\theta^* = \arg\max_\theta P(X|\theta)$$

Log-Likelihood:

In practice, maximize $\log P(X|\theta)$ instead (easier math, same result)
Converts products to sums: $\log(p_1 \cdot p_2) = \log p_1 + \log p_2$
Used as loss function in deep generative models

Why maximize likelihood?

Want model that assigns high probability to observed data
Equivalent to minimizing KL divergence between data distribution and model

Challenges in deep generative models:

Often intractable to compute directly
VAEs: Optimize lower bound (ELBO) instead
GANs: Use game-theoretic objective (implicit likelihood)
Normalizing Flows: Can compute exact likelihood via change of variables

Exam-style question: "Why do VAEs optimize ELBO instead of likelihood directly? What makes likelihood intractable?"

Answer: Computing $p(x) = \int p(x|z)p(z)dz$ requires integrating over all latent variables, which is intractable for high-dimensional $z$ and complex decoder networks. ELBO provides a tractable lower bound that can be optimized via gradient descent.

7.5. Distributions: Key Concepts

Cumulative Distribution Function (CDF) vs Probability Density Function (PDF)

CDF: $F(x) = P(X \leq x)$

Gives probability that $X$ takes value at most $x$
Always non-decreasing
Ranges from 0 to 1

PDF: $p(x)$ (for continuous distributions)

Describes probability density at point $x$
Can exceed 1 (it's a density, not probability!)
Derivative of CDF: $p(x) = \frac{dF(x)}{dx}$

Key difference:

CDF: "Probability of being ≤ x"
PDF: "Density of probability around x" (needs integration to get probability)

For continuous: $P(a < X < b) = \int_a^b p(x)dx = F(b) - F(a)$

Modality

Definition: Number of peaks (modes) in a distribution

Unimodal: Single peak

Example: Standard Gaussian $N(0,1)$
Most data concentrated around one value

Multimodal: Multiple peaks

Example: Gaussian Mixture Model with 3 components
Data has multiple "clusters" or preferred values

Why it matters for generative models:

Real data often multimodal (e.g., different classes)
Mode collapse in GANs: Generator only learns some modes, ignores others
Good generative model should capture all modes

Exam-style question: "What is the relationship between multimodal distributions and mode collapse in GANs?"

Answer: Real data distributions are often multimodal (e.g., different face types, multiple object classes). Mode collapse occurs when GAN generator learns to produce only a subset of modes (e.g., only certain face types), failing to capture full diversity of data distribution. This is a failure to learn the complete multimodal structure.

8. Normalizing Flows: How they work and the Jacobian determinant

What are Normalizing Flows? A generative model using an invertible transformation $f$ to map:

Simple prior distribution $p_z(z)$ (usually $N(0,I)$) → Complex data distribution $p_x(x)$

Key idea: Unlike VAE with separate encoder/decoder, flows use a single invertible function $f$ where:

Forward: $z = f(x)$ (encoding/likelihood evaluation)
Inverse: $x = f^{-1}(z)$ (generation/sampling)

Why Normalizing Flows were developed:

Exact likelihood: Unlike VAEs (approximate via ELBO) or GANs (no explicit likelihood), flows compute exact $p(x)$
Bidirectional: Same function for encoding and decoding (perfect inverse)
Latent space by design: Guaranteed to match chosen prior (no regularization needed like VAE's KL term)
Tractable training: Direct maximum likelihood optimization

Change of Variables Formula

The Foundation (from Midterm Q3): $$p_x(x) = p_z(f(x)) \left| \det \frac{\partial f(x)}{\partial x} \right| = p_z(z) \left| \det J_f \right|$$

where $J_f$ is the Jacobian matrix of $f$.

Intuitive Explanation:

When you transform variables, probability mass must be conserved
Total probability before transformation = Total probability after transformation = 1
If transformation stretches a region → probability density gets compressed (spread thinner)
If transformation compresses a region → probability density gets concentrated (packed denser)
The Jacobian determinant $|\det J_f|$ measures exactly how much volume changes

Concrete Example:

1D transformation $x = 2z$ (stretches space by factor 2)
Small interval $[z, z+dz]$ becomes $[x, x+2dz]$ (twice as wide)
Probability mass stays same, but spread over 2× wider region
Therefore density must be halved: $p_x(x) = p_z(z) \cdot \frac{1}{2}$
Jacobian determinant = 2, so formula gives: $p_x(x) = p_z(z) / |2|$ ✓

Why determinant preserves probability:

For transformation $x = f(z)$, volume element changes: $dx = |\det J_f| \cdot dz$
Probability in small region: $p_x(x)dx = p_z(z)dz$ (must be equal)
Solving: $p_x(x) = p_z(z) / |\det J_f|$ (or equivalently $p_x(x) = p_z(f(x)) |\det J_{f^{-1}}|$)
This ensures $\int p_x(x)dx = \int p_z(z)dz = 1$ (normalization preserved)

Example Calculation (like Midterm Q3):

Start with $z \sim N(0,1)$ (simple prior)
Apply transformation $x = f(z) = az + b$ (affine transformation)
Jacobian: $\frac{\partial f}{\partial z} = a$
Determinant: $|\det J_f| = |a|$
Final distribution: $p_x(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{(x-b)^2}{2a^2}} \cdot \frac{1}{|a|} = N(b, a^2)$

Requirements for Flow Transformations

Three Essential Properties:

Invertible: Must have unique inverse $f^{-1}$
- Bijection: one-to-one and onto
- Every $x$ maps to exactly one $z$, and vice versa
- Needed for both sampling ($z \to x$) and likelihood ($x \to z$)
Differentiable: Need to compute Jacobian
- Required for change of variables formula
- Enables gradient-based training
Efficient: Computing $\det J_f$ should be fast
- Full matrix determinant: $O(d^3)$ - too slow!
- Smart architectures make this $O(d)$ or $O(d^2)$

Computational Complexity:

Architecture	Inverse	Determinant	Example
Full matrix	$O(d^3)$	$O(d^3)$	General linear layer (impractical)
Diagonal	$O(d)$	$O(d)$	Element-wise scaling
Triangular	$O(d^2)$	$O(d)$	Autoregressive flows
Block diagonal	$O(c^3 d/c)$	$O(cd)$	Multi-scale architectures
Coupling flows	$O(d)$	$O(d)$	RealNVP, Glow

The Challenge: Design expressive flows with tractable determinants!

Coupling Flows (RealNVP Architecture)

Key Innovation: Make Jacobian triangular by design

How it works:

Split input $x$ into two parts: $x = [x_1, x_2]$
Transform only one part, conditioned on the other:
- $y_1 = x_1$ (unchanged - identity transformation)
- $y_2 = x_2 \odot \exp(s(x_1)) + t(x_1)$ (affine coupling)

where $s(\cdot)$ (scale) and $t(\cdot)$ (translation) are arbitrary neural networks (can be complex!)

Why this is brilliant:

1. Triangular Jacobian: $$J_f = \begin{bmatrix} I & 0 \ \frac{\partial y_2}{\partial x_1} & \text{diag}(\exp(s(x_1))) \end{bmatrix}$$

Upper-right block is zero (because $y_1$ doesn't depend on $x_2$)
Determinant of triangular matrix = product of diagonal elements
$\det J_f = \prod_i \exp(s_i(x_1))$ - $O(d)$ computation!

2. Easy Inversion:

Reverse operations: $x_2 = (y_2 - t(x_1)) \odot \exp(-s(x_1))$
$x_1 = y_1$ (unchanged)
Same computational cost as forward pass
No need to invert neural networks $s$ or $t$!

3. Expressive Power:

$s$ and $t$ can be arbitrarily complex neural networks
Not limited to simple functions
But their complexity doesn't affect determinant computation!

Limitations:

Half of dimensions pass through unchanged ($x_1 = y_1$)
Single coupling layer is weak
Solution: Stack multiple layers with alternating partition (swap which half is transformed)

Stacking Flows (Composition)

Building Deep Flows: Multiple transformations compose: $f = f_K \circ f_{K-1} \circ ... \circ f_1$

Change of variables for composition: $$p_x(x) = p_z(z) \prod_{k=1}^K \left| \det \frac{\partial f_k}{\partial h_{k-1}} \right|^{-1}$$

where $h_0 = x$, $h_k = f_k(h_{k-1})$, and $h_K = z$

In log-space (used for training): $$\log p_x(x) = \log p_z(z) - \sum_{k=1}^K \log \left| \det \frac{\partial f_k}{\partial h_{k-1}} \right|$$

Why stack flows?:

Expressiveness: Single coupling layer is limited, composition becomes arbitrarily complex
Permutation: Alternate which dimensions are transformed
- Layer 1: transform $x_2$ conditioned on $x_1$
- Layer 2: transform $x_1$ conditioned on $x_2$ (swap!)
- Ensures all dimensions get transformed
Multi-scale: Can split off dimensions at different depths (like in Glow)

Example RealNVP architecture:

Input x (e.g., 28×28 image = 784 dims)
  ↓
Coupling layer 1: transform dims [392:784] | dims [0:392] unchanged
  ↓
Permutation (or 1×1 conv): shuffle dimensions
  ↓
Coupling layer 2: transform dims [392:784] | dims [0:392] unchanged
  ↓
... (repeat K times)
  ↓
Output z ~ N(0, I)

Training Normalizing Flows

Objective: Maximize log-likelihood $$\max_\theta \sum_{i=1}^N \log p_\theta(x^{(i)})$$

Using change of variables: $$\log p(x) = \log p_z(f(x)) + \log \left| \det \frac{\partial f(x)}{\partial x} \right|$$

In practice:

Forward pass: $z = f_\theta(x)$ (transform data to latent)
Evaluate prior: $\log p_z(z)$ (usually Gaussian, easy!)
Compute log-determinant: $\log |\det J_f|$ (designed to be efficient)
Loss = $-(\log p_z(z) + \log |\det J_f|)$
Backpropagate through entire flow

Sampling (generation):

Sample $z \sim p_z(z)$ (e.g., $N(0,I)$)
Inverse transform: $x = f^{-1}(z)$
Return $x$ (generated sample)

Key advantage: Exact likelihood - no approximation like VAE's ELBO!

Autoregressive Flows

Alternative to coupling flows: Transform dimensions sequentially

Masked Autoregressive Flow (MAF): $$x_i = z_i \cdot \exp(s_i) + t_i$$ where $s_i = s_i(x_1, ..., x_{i-1})$ and $t_i = t_i(x_1, ..., x_{i-1})$

Properties:

Each dimension depends on all previous dimensions
Jacobian is triangular (autoregressive structure)
Determinant: $\det J = \prod_i \exp(s_i)$ - still $O(d)$!

MAF vs Coupling Flows:

	MAF	RealNVP (Coupling)
Forward (sampling)	Sequential $O(d)$ passes	Parallel $O(1)$
Inverse (likelihood)	Parallel $O(1)$	Sequential $O(d)$ passes
Expressiveness	More expressive per layer	Need more layers
Use case	Density estimation	Fast sampling

Inverse Autoregressive Flow (IAF):

Swap role of $x$ and $z$ in MAF
Fast sampling, slow likelihood
Used in VAE decoders

Why NF Compute Exact Likelihood (VAE Cannot)

Normalizing Flow - Exact Likelihood:

We have a bijection (one-to-one invertible mapping): $z = f(x)$ and $x = f^{-1}(z)$
Can directly apply change of variables formula: $$p(x) = p_z(f(x)) \left| \det \frac{\partial f(x)}{\partial x} \right|$$
All components are tractable:
- $p_z(z)$ is known (we chose it, usually $N(0,I)$)
- $f(x)$ is computed by forward pass through the network
- $\det J_f$ is designed to be efficiently computable (coupling flows, triangular Jacobians)
No integration needed! Just plug in values and calculate.

VAE - Intractable Likelihood:

We want $p(x)$ but it requires integrating over all possible latent variables: $$p(x) = \int p(x|z)p(z) dz$$
This integral is intractable because:
- The latent space is high-dimensional (e.g., 100+ dimensions)
- $p(x|z)$ is a complex neural network (decoder)
- No closed-form solution exists
Encoder and decoder are separate networks:
- Encoder: $q(z|x)$ approximates posterior
- Decoder: $p(x|z)$ generates data
- They are not exact inverses of each other
- The approximation gap is captured by $D_{KL}(q(z|x) || p(z|x))$
Solution: Use ELBO as a tractable lower bound instead of computing $p(x)$ directly

Key Insight:

NF trades flexibility for exactness: Must use invertible architectures, but get exact likelihood
VAE trades exactness for flexibility: Can use any encoder/decoder architecture, but only get approximate likelihood

Exam-Style Questions

Q1: "Explain the change of variables formula intuitively. Why is the Jacobian determinant needed?"

Answer: When transforming variables, probability mass must be conserved. If transformation stretches a region by factor $k$, the probability density must be compressed by factor $1/k$ to maintain total probability = 1. The Jacobian determinant measures exactly this volume change. Formula: $p_x(x) = p_z(f(x)) / |\det J_f|$ ensures the transformed distribution still integrates to 1.

Q2: "Why do coupling flows have triangular Jacobians? Why does this matter?"

Answer: Coupling flows split input $[x_1, x_2]$ and transform only $x_2$ conditioned on $x_1$ (leaving $x_1$ unchanged). This makes $y_1$ independent of $x_2$, creating zeros in upper-right block of Jacobian (triangular structure). Determinant of triangular matrix = product of diagonal elements = $\prod \exp(s(x_1))$, which is $O(d)$ instead of $O(d^3)$ for general matrices. This makes training tractable for high dimensions.

Q3: "Why stack multiple coupling layers? What problem does alternating the partition solve?"

Answer: Single coupling layer only transforms half of dimensions ($x_2$), leaving $x_1$ unchanged. This is too weak. Stacking K layers makes transformation arbitrarily complex. Alternating which half is transformed (swap partition) ensures all dimensions eventually get transformed. Without alternation, some dimensions would pass through completely unchanged.

Q4: "Compare MAF and RealNVP. When would you use each?"

Answer:

MAF: Sequential sampling $O(d)$, parallel likelihood $O(1)$ → Good for density estimation tasks
RealNVP: Parallel sampling $O(1)$, sequential likelihood $O(d)$ → Good for fast generation
Both have triangular Jacobians with $O(d)$ determinant computation
MAF more expressive per layer but slower sampling

Q5: "Why can normalizing flows compute exact likelihood while VAEs cannot? Explain in terms of the mathematical operations required."

Answer: NFs use a single invertible function with tractable Jacobian determinant, allowing direct application of change of variables formula without integration. VAEs have separate encoder/decoder networks and require integrating over the intractable posterior $p(z|x) = p(x|z)p(z)/p(x)$ where $p(x) = \int p(x|z)p(z)dz$. This integral is intractable for high-dimensional latent spaces with complex decoders. Hence VAEs use ELBO as an approximation.

Q6: "Given transformation $x = 3z + 5$ where $z \sim N(0,1)$, derive the distribution of $x$ using change of variables."

Answer:

Jacobian: $\frac{\partial x}{\partial z} = 3$
Determinant: $|\det J| = 3$
Change of variables: $p_x(x) = p_z(z) / |3| = \frac{1}{3\sqrt{2\pi}} \exp(-\frac{z^2}{2})$
Substitute $z = (x-5)/3$: $p_x(x) = \frac{1}{3\sqrt{2\pi}} \exp(-\frac{(x-5)^2}{18})$
Therefore: $x \sim N(5, 9)$ (mean = 5, variance = 9)

Q7: "Given a 2D transformation $\mathbf{x} = A\mathbf{z} + \mathbf{b}$ where $\mathbf{z} \sim N(\mathbf{0}, I)$, $A = \begin{bmatrix} 2 & 1 \ 0 & 3 \end{bmatrix}$, and $\mathbf{b} = \begin{bmatrix} 1 \ 2 \end{bmatrix}$, calculate the Jacobian determinant and derive the distribution of $\mathbf{x}$."

Answer:

Transformation: $\begin{bmatrix} x_1 \ x_2 \end{bmatrix} = \begin{bmatrix} 2 & 1 \ 0 & 3 \end{bmatrix} \begin{bmatrix} z_1 \ z_2 \end{bmatrix} + \begin{bmatrix} 1 \ 2 \end{bmatrix}$
This gives: $x_1 = 2z_1 + z_2 + 1$ and $x_2 = 3z_2 + 2$
Jacobian matrix: $J = \frac{\partial \mathbf{x}}{\partial \mathbf{z}} = \begin{bmatrix} \frac{\partial x_1}{\partial z_1} & \frac{\partial x_1}{\partial z_2} \ \frac{\partial x_2}{\partial z_1} & \frac{\partial x_2}{\partial z_2} \end{bmatrix} = \begin{bmatrix} 2 & 1 \ 0 & 3 \end{bmatrix}$
Determinant calculation: $\det J = \det \begin{bmatrix} 2 & 1 \ 0 & 3 \end{bmatrix} = (2)(3) - (1)(0) = 6 - 0 = 6$
Since $J = A$ (linear transformation), we have $|\det J| = 6$
Change of variables: $p_{\mathbf{x}}(\mathbf{x}) = p_{\mathbf{z}}(\mathbf{z}) / |\det J| = \frac{1}{6} \cdot \frac{1}{2\pi} \exp(-\frac{1}{2}(\mathbf{z}^T \mathbf{z}))$
Substitute $\mathbf{z} = A^{-1}(\mathbf{x} - \mathbf{b})$: Since $A^{-1} = \begin{bmatrix} \frac{1}{2} & -\frac{1}{6} \ 0 & \frac{1}{3} \end{bmatrix}$, we get $\mathbf{z} = \begin{bmatrix} \frac{1}{2}(x_1 - 1) - \frac{1}{6}(x_2 - 2) \ \frac{1}{3}(x_2 - 2) \end{bmatrix}$
Therefore: $\mathbf{x} \sim N(\mathbf{b}, AA^T) = N\left(\begin{bmatrix} 1 \ 2 \end{bmatrix}, \begin{bmatrix} 5 & 3 \ 3 & 9 \end{bmatrix}\right)$

Q8: "What is the role of the base measure/prior $p_z(z)$ in normalizing flows?"

Answer: The base measure is the simple distribution we start from (typically $N(0,I)$). Flow learns transformation $f$ to map this simple distribution to complex data distribution $p_x(x)$. Choice affects: (1) Sampling ease - Gaussian easy to sample, (2) Likelihood computation - simple $p_z(z)$ for fast evaluation, (3) Latent space structure - flows guarantee $z$ matches chosen prior exactly (unlike VAE which encourages it via KL).

Q9: "Explain how log-determinants are computed and summed when stacking K flow transformations."

Answer: For composed flow $f = f_K \circ ... \circ f_1$, change of variables gives: $\log p(x) = \log p_z(z) - \sum_{k=1}^K \log|\det J_{f_k}|$. Each layer contributes one log-determinant term. For coupling flows, each term is $\sum_i s_i(x_1)$ (from $\det J = \prod \exp(s_i)$). Sum over layers gives total log-likelihood. Training maximizes this sum via gradient descent.

Flow vs VAE vs GAN:

	VAE	Normalizing Flow	GAN
Architecture	Encoder + Decoder (separate)	Single invertible function	Generator + Discriminator
Likelihood	Approximate (ELBO)	Exact	Implicit (no explicit $p(x)$)
Training	Maximize ELBO	Maximize log-likelihood directly	Min-max game
Latent space	Encouraged to be Gaussian (KL reg)	Exactly Gaussian (by design)	Unstructured
Flexibility	Very flexible architectures	Constrained to invertible architectures	Very flexible
Key limitation	Intractable integral $\int p(x\|z)p(z)dz$	Must design invertible & efficient Jacobian	No likelihood, training instability
Sampling speed	Fast (one decoder pass)	Depends (MAF slow, RealNVP fast)	Very fast
Likelihood evaluation	Approximate	Exact and fast	Not available

9. Diffusion Models: Forward/reverse process and training

What are Diffusion Models? Generative models that:

Forward process: Gradually destroy data by adding Gaussian noise over $T$ steps (fixed, no learning)
Reverse process: Learn to denoise and recover data from noise (learned with neural network)

Core Idea: Similar to thermodynamic diffusion - data starts organized (low entropy) and gradually becomes random noise (high entropy). We learn to reverse this process.

Forward Process (Fixed Markov Chain)

Single step transition: $$q(x_t | x_{t-1}) = N(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$

where $\beta_t \in (0,1)$ is a variance schedule (noise schedule).

Interpretation:

Mean: $\sqrt{1-\beta_t} x_{t-1}$ - slightly shrinks previous state
Variance: $\beta_t I$ - adds isotropic Gaussian noise
If $\beta_t \to 0$: no noise added (just copy $x_{t-1}$)
If $\beta_t \to 1$: complete jump to noise (too aggressive, loses information)

Reparameterization trick for sampling $x_t$ from $x_{t-1}$: $$x_t = \sqrt{1-\beta_t} \cdot x_{t-1} + \sqrt{\beta_t} \cdot \epsilon \quad \text{where } \epsilon \sim N(0,I)$$

Direct sampling at any timestep (Key property from marginalization): $$q(x_t | x_0) = N(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)$$

where:

$\alpha_t = 1 - \beta_t$
$\bar{\alpha}t = \prod{i=1}^t \alpha_i = \prod_{i=1}^t (1-\beta_i)$ (cumulative product)

Reparameterization for direct sampling: $$x_t = \sqrt{\bar{\alpha}_t} \cdot x_0 + \sqrt{1-\bar{\alpha}_t} \cdot \epsilon$$

Why this formula works:

As $t$ increases, $\bar{\alpha}_t$ decreases (more noise accumulates)
At $t=0$: $\bar{\alpha}_0 = 1$ → $x_0 = x_0$ (no noise)
As $t \to T$: $\bar{\alpha}_T \to 0$ → $x_T \approx N(0,I)$ (pure noise)
The $(1-\bar{\alpha}_t)$ term ensures variance balances to 1 as $t \to T$

Why multiple steps instead of one big jump?

Training signal: Each intermediate timestep $t$ contributes to the loss function (like keeping activations in deep networks for backprop)
Smooth path: Gradual noise addition creates a smoother, more learnable trajectory from data to noise
Easier inversion: Reverse process is easier to learn when steps are small (local denoising vs global reconstruction)
Theoretical: Reverse process converges to true posterior as $\beta_t \to 0$ and $T \to \infty$

Reverse Process (Learned Markov Chain)

Goal: Learn to invert the forward process, starting from pure noise $x_T \sim N(0,I)$

Reverse transition: $$p_\theta(x_{t-1} | x_t) = N(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

where $\mu_\theta$ and $\Sigma_\theta$ are parameterized by neural networks.

Training objective: Make reverse process match the actual time-reversal of forward process

Minimize KL divergence between joint distributions: $$\min_\theta D_{KL}(q(x_{0:T}) || p_\theta(x_{0:T}))$$
This decomposes (via ELBO) into sum of KL divergences at each timestep
Key simplification: When both $q$ and $p$ are Gaussian, KL divergence reduces to L2 loss on means
Final loss (after math): Train network to predict the noise $\epsilon$ that was added

Simplified training loss: $$L = \mathbb{E}{t, x_0, \epsilon} \left[ ||\epsilon - \epsilon\theta(x_t, t)||^2 \right]$$

Training algorithm:

Sample $x_0$ from training data
Sample timestep $t \sim \text{Uniform}(1, T)$
Sample noise $\epsilon \sim N(0,I)$
Compute $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$
Predict noise: $\hat{\epsilon} = \epsilon_\theta(x_t, t)$
Compute loss: $L = ||\epsilon - \hat{\epsilon}||^2$
Update $\theta$ via gradient descent

Sampling (generating new samples):

Start with $x_T \sim N(0,I)$ (pure noise)
For $t = T, T-1, ..., 1$:
- Predict noise: $\hat{\epsilon} = \epsilon_\theta(x_t, t)$
- Compute mean: $\mu_\theta(x_t, t)$ using predicted noise
- Sample: $x_{t-1} = \mu_\theta(x_t, t) + \sqrt{\Sigma_\theta(x_t, t)} \cdot z$ where $z \sim N(0,I)$
Return $x_0$ (denoised image)

Architecture: U-Net

Why U-Net?

Takes noisy image $x_t$ and timestep $t$ as input
Outputs same-sized tensor (predicted noise $\epsilon_\theta(x_t, t)$)
Skip connections preserve spatial information across scales
Encoder-decoder structure with bottleneck

Time embedding:

Timestep $t$ embedded via sinusoidal positional encoding (like Transformers)
Injected into U-Net via adaptive normalization layers or concatenation
Allows network to learn different denoising strategies for different noise levels

Conditioning (for conditional generation):

Easy to add conditions (class labels, text embeddings) through decoder
Concatenate or add condition embeddings alongside time embeddings
Enables text-to-image (Stable Diffusion) or class-conditional generation

Exam-Style Questions

Q1: "Explain why we use $\bar{\alpha}t = \prod{i=1}^t (1-\beta_i)$ instead of $\sum_{i=1}^t \beta_i$ for the cumulative noise schedule."

Answer: The noise variance compounds multiplicatively, not additively. Each step multiplies the signal by $\sqrt{\alpha_t} = \sqrt{1-\beta_t}$ and adds independent noise. When combining multiple Gaussian steps with reparameterization, the signal scaling factors multiply: $\sqrt{\alpha_1} \cdot \sqrt{\alpha_2} \cdot ... = \sqrt{\prod \alpha_i} = \sqrt{\bar{\alpha}_t}$. This ensures the final distribution has correct variance (variance of sum of independent Gaussians). A sum would incorrectly model the compounding.

Q2: "If $\beta_t$ is constant across all timesteps, what happens to $\bar{\alpha}_t$ as $t$ increases? Why is this desirable?"

Answer: If $\beta_t = \beta$ (constant), then $\bar{\alpha}_t = (1-\beta)^t$ which decreases exponentially toward 0 as $t$ increases. This is desirable because:

Signal strength $\sqrt{\bar{\alpha}_t} x_0$ decays exponentially
Noise variance $(1-\bar{\alpha}_t)$ grows toward 1
At large $t$, $x_t$ approaches $N(0,I)$ regardless of $x_0$
Ensures forward process actually transforms data to simple prior (pure noise)

Q3: "In the training loss $L = ||\epsilon - \epsilon_\theta(x_t, t)||^2$, why do we predict noise $\epsilon$ instead of directly predicting $x_0$ or $x_{t-1}$?"

Answer: Predicting noise is equivalent but more stable:

Noise $\epsilon \sim N(0,I)$ has constant statistics (zero mean, unit variance) regardless of timestep $t$
Predicting $x_0$ directly requires reconstructing entire image from very noisy $x_t$ at large $t$ (harder)
Predicting $x_{t-1}$ requires modeling small differences (vanishing gradients)
Noise prediction: Network learns "what was added" rather than "what should be" - clearer learning signal
Can recover $x_0$ or $x_{t-1}$ from $\epsilon$ via reparameterization: $x_0 = (x_t - \sqrt{1-\bar{\alpha}_t}\epsilon)/\sqrt{\bar{\alpha}_t}$

Q4: "What would happen if we set $\beta_1 = 0.9$ (very high noise from the start)? Why is a small, gradually increasing schedule preferred?"

Answer: High initial $\beta_1$ would:

Destroy most information in first few steps (large jump toward noise)
Make reverse process much harder (massive denoising required per step)
Lose fine details early (irreversible information loss)
Poor training signal (network can't learn gradual denoising)

Gradual schedule (e.g., $\beta_1 = 0.0001$ increasing to $\beta_T = 0.02$):

Preserves information longer (smooth degradation)
Each reverse step is small, local denoising operation (easier to learn)
Better training signal at all timesteps
Matches theoretical requirement: reverse process converges as $\beta_t \to 0$

Q5: "How does the forward process $q(x_t|x_{t-1})$ differ from the reverse process $p_\theta(x_{t-1}|x_t)$ in terms of conditioning and learnability?"

Answer:

Forward $q(x_t|x_{t-1})$: Conditions on less noisy state $x_{t-1}$, adds noise (easy, deterministic given noise schedule)
Reverse $p_\theta(x_{t-1}|x_t)$: Conditions on more noisy state $x_t$, removes noise (hard, must be learned)
Forward is tractable: Simple Gaussian with fixed parameters $\beta_t$
Reverse is intractable: True posterior $q(x_{t-1}|x_t, x_0)$ depends on unknown $x_0$ and entire data distribution
Solution: Approximate reverse with neural network $p_\theta$ that learns to denoise without knowing $x_0$

10. Score-Based Models: Avoiding normalization constant

The Normalization Problem

Likelihood-based models require computing: $$p_\theta(x) = \frac{f_\theta(x)}{Z_\theta} \quad \text{where } Z_\theta = \int f_\theta(x) dx$$

Problems:

$Z_\theta$ (partition function/normalizing constant) is intractable to compute for complex $f_\theta$ (neural networks)
Requires integrating over entire data space (e.g., all possible images)
For high-dimensional data: billions/trillions of dimensions to integrate over
Forces architectural constraints:
- Autoregressive models: Product of conditionals makes $Z_\theta$ tractable
- Normalizing flows: Invertibility + change of variables makes $Z_\theta$ tractable
- VAEs: Use surrogate objective (ELBO) to approximate maximum likelihood

Example: Energy-based model $p_\theta(x) = \frac{1}{Z_\theta} \exp(-E_\theta(x))$

$E_\theta(x)$ is any neural network (energy function)
$Z_\theta = \int \exp(-E_\theta(x)) dx$ is intractable
Cannot evaluate $p_\theta(x)$ or train via maximum likelihood

Score-Based Solution: Model Gradient Instead

Key Idea: Instead of modeling density $p_\theta(x)$, model the score function (gradient of log-density)

Score Function (w.r.t. data variable $x$): $$s_\theta(x) = \nabla_x \log p_\theta(x)$$

Why exponential form? Assume $p_\theta(x) = \frac{\exp(-f_\theta(x))}{Z_\theta}$ where $f_\theta$ is any neural network

Key insight - Normalization constant disappears: $$\nabla_x \log p_\theta(x) = \nabla_x \log \frac{\exp(-f_\theta(x))}{Z_\theta}$$ $$= \nabla_x [\log \exp(-f_\theta(x)) - \log Z_\theta]$$ $$= \nabla_x [-f_\theta(x)] - \nabla_x \log Z_\theta$$ $$= -\nabla_x f_\theta(x) - 0$$ $$= -\nabla_x f_\theta(x)$$

The gradient of a constant ($Z_\theta$) is zero! No need to compute normalization constant!

Why exponential form specifically?

Logarithm of exponential simplifies nicely: $\log \exp(-f) = -f$
Exponentials can represent diverse shapes (via any neural network $f_\theta$)
Common in physics (Boltzmann distribution, energy-based models)
Mathematically convenient for score matching derivations

Note: Two types of "score":

Fisher score (w.r.t. parameters $\theta$): $\nabla_\theta \log p_\theta(x)$ - used in classical statistics
Data score (w.r.t. data $x$): $\nabla_x \log p_\theta(x)$ - used in score-based models

We use the data score because it describes the geometry of the data distribution (which direction increases density).

Training: Score Matching

Goal: Train $s_\theta(x)$ to match the true data score $\nabla_x \log p_{data}(x)$

Naive objective - Minimize Fisher divergence: $$L = \mathbb{E}{p{data}(x)} \left[ ||s_\theta(x) - \nabla_x \log p_{data}(x)||^2 \right]$$

Problem: We don't know $p_{data}(x)$, so we can't compute $\nabla_x \log p_{data}(x)$!

Solution: Hyvarinen's Score Matching Theorem (2005)

Showed the above loss is equivalent to a tractable objective: $$L = \mathbb{E}{p{data}(x)} \left[ ||\nabla_x s_\theta(x)||^2_F + 2 \cdot \text{tr}(\nabla_x s_\theta(x)) \right]$$
This only requires computing derivatives of $s_\theta$, not $p_{data}$!
$\text{tr}(\nabla_x s_\theta(x))$ = trace of Jacobian (sum of diagonal)
$||\nabla_x s_\theta(x)||^2_F$ = Frobenius norm (sum of squared elements)

In practice: Modern implementations use denoising score matching:

Perturb data: $\tilde{x} = x + \sigma \epsilon$ where $\epsilon \sim N(0,I)$
Train to denoise: $L = \mathbb{E}{x, \epsilon} \left[ ||s\theta(\tilde{x}, \sigma) + \frac{\epsilon}{\sigma}||^2 \right]$
Equivalent to score matching but simpler to implement

Sampling: Langevin Dynamics

Langevin Dynamics - MCMC method to sample from distribution using only its score: $$x_{k+1} = x_k + \epsilon \cdot s_\theta(x_k) + \sqrt{2\epsilon} \cdot z_k \quad \text{where } z_k \sim N(0,I)$$

Intuition:

Gradient term $\epsilon \cdot s_\theta(x_k)$: Follow score uphill toward high-density regions (like gradient ascent)
Noise term $\sqrt{2\epsilon} \cdot z_k$: Add stochasticity to explore and escape local optima
Balance: As $k \to \infty$ and $\epsilon \to 0$, converges to sampling from $p_\theta(x)$

Algorithm:

Initialize: $x_0 \sim N(0,I)$ (random noise)
For $k = 0, 1, ..., K-1$:
- Compute score: $s_k = s_\theta(x_k)$
- Update: $x_{k+1} = x_k + \epsilon s_k + \sqrt{2\epsilon} z_k$ where $z_k \sim N(0,I)$
Return $x_K$ (sample from learned distribution)

Connection to physics: Langevin equation models Brownian motion of particles in a potential field - drift toward low energy + random thermal fluctuations.

The Low-Density Region Problem

Challenge: Score matching minimizes Fisher divergence: $$L = \mathbb{E}{p{data}(x)} \left[ ||s_\theta(x) - \nabla_x \log p_{data}(x)||^2 \right]$$

Problem: Expectation is weighted by $p_{data}(x)$

Errors in high-density regions (where data exists) are heavily penalized
Errors in low-density regions (between data modes, far from data) are largely ignored
But Langevin dynamics starts in low-density regions (random noise initialization)!
Inaccurate scores in low-density regions derail sampling from the start

Visual intuition:

Imagine 2D swiss roll dataset (high-density spiral)
Score matching learns good scores on the spiral (plenty of data)
Score matching learns poor scores between spiral arms (no data, but weighted less)
Sampling starts between arms → follows wrong gradients → fails to reach spiral

Solution: Noise Conditional Score Networks (NCSN)

Idea: Perturb data with multiple noise scales to populate low-density regions

Multiple noise levels:

Choose $L$ noise scales: $\sigma_1 < \sigma_2 < ... < \sigma_L$ (e.g., geometric sequence)
Perturb data: $p_{\sigma_i}(x) = \int p_{data}(y) \cdot N(x; y, \sigma_i^2 I) dy$ (convolution with Gaussian)
High $\sigma_i$: Heavily blurred data (fills low-density regions)
Low $\sigma_i$: Slightly blurred data (preserves structure)

Train noise-conditional score network: $$s_\theta(x, \sigma_i) \approx \nabla_x \log p_{\sigma_i}(x)$$

Training objective (weighted sum over noise levels): $$L = \sum_{i=1}^L \lambda(\sigma_i) \mathbb{E}{p{\sigma_i}(x)} \left[ ||s_\theta(x, \sigma_i) - \nabla_x \log p_{\sigma_i}(x)||^2 \right]$$

where $\lambda(\sigma_i)$ is a weighting function (often $\lambda(\sigma_i) = \sigma_i^2$).

Using denoising score matching: $$L = \sum_{i=1}^L \lambda(\sigma_i) \mathbb{E}{x_0 \sim p{data}, \epsilon \sim N(0,I)} \left[ ||s_\theta(x_0 + \sigma_i \epsilon, \sigma_i) + \frac{\epsilon}{\sigma_i}||^2 \right]$$

Sampling: Annealed Langevin Dynamics

Algorithm:

Initialize: $x_L \sim N(0, \sigma_L^2 I)$ (start at highest noise level)
For $i = L, L-1, ..., 1$:
- Run $K$ steps of Langevin dynamics with score $s_\theta(x, \sigma_i)$:
  - For $k = 1, ..., K$: $$x \leftarrow x + \epsilon_i \cdot s_\theta(x, \sigma_i) + \sqrt{2\epsilon_i} \cdot z \quad \text{where } z \sim N(0,I)$$
Return $x$ (final sample)

Intuition:

High noise ($\sigma_L$): Scores are good (heavily blurred data everywhere), rough global structure
Low noise ($\sigma_1$): Scores are good (close to actual data), fine details
Annealing: Gradually reduce noise → progressively refine sample from coarse to fine
Similar to simulated annealing in optimization

Connection to Diffusion Models (DDPM ↔ Score-Based)

Key insight: Predicting noise $\epsilon$ ↔ Predicting score $\nabla_x \log p(x_t)$

Relationship: From $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}t} \epsilon$, the score is: $$\nabla{x_t} \log p(x_t) = -\frac{\epsilon}{\sqrt{1-\bar{\alpha}_t}}$$

Therefore: $$s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1-\bar{\alpha}_t}}$$

Equivalence:

DDPM: Discrete timesteps $t \in {1, ..., T}$, predicts noise $\epsilon_\theta(x_t, t)$
Score-based: Continuous time $t \in [0,1]$ (or discrete noise levels $\sigma_i$), predicts score $s_\theta(x_t, t)$
Unified view: Both are special cases of Stochastic Differential Equations (SDEs) framework

Benefits of score-based view:

More flexible sampling (can adjust step size, use different samplers)
Connections to continuous-time diffusion processes
Theoretical foundations from statistical physics

Exam-Style Questions

Q1: "Why can't we use standard maximum likelihood to train energy-based models $p_\theta(x) = \frac{1}{Z_\theta} \exp(-E_\theta(x))$? How do score-based models solve this?"

Answer: Maximum likelihood requires computing $\log p_\theta(x) = -E_\theta(x) - \log Z_\theta$, but $Z_\theta = \int \exp(-E_\theta(x)) dx$ is intractable (high-dimensional integral over all possible $x$). Score-based models sidestep this by modeling the gradient $\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)$, where the normalization constant disappears because $\nabla_x \log Z_\theta = 0$ (gradient of constant is zero).

Q2: "Explain the low-density region problem in score matching. Why does it affect Langevin dynamics sampling?"

Answer: Score matching loss $\mathbb{E}{p{data}} [||s_\theta - \nabla_x \log p_{data}||^2]$ is weighted by $p_{data}(x)$. Errors in low-density regions (where $p_{data}(x) \approx 0$) contribute little to the loss, so the model learns inaccurate scores there. Langevin dynamics initializes from random noise (a low-density region). Inaccurate scores at initialization cause the sampler to follow wrong gradients, preventing it from reaching high-density data regions. Solution: Add noise to populate low-density regions during training (NCSN).

Q3: "Why use multiple noise scales in NCSN instead of a single large noise level?"

Answer: Single large noise heavily blurs the data distribution, making all regions high-density but losing fine structure. Multiple noise scales provide a curriculum:

Large $\sigma$: Fills low-density regions, learns global structure
Small $\sigma$: Preserves data details, learns fine structure Annealed Langevin dynamics progressively refines samples: coarse structure from high-noise scores → fine details from low-noise scores. Like coarse-to-fine optimization.

Q4: "How are DDPM and score-based models mathematically equivalent? What does each model learn?"

Answer: DDPM learns to predict added noise $\epsilon_\theta(x_t, t)$. Score-based models learn the score $s_\theta(x_t, t) = \nabla_{x_t} \log p(x_t)$. They're related by: $s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1-\bar{\alpha}_t}}$. This follows from $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}t} \epsilon$ and computing $\nabla{x_t} \log p(x_t)$. Both frameworks learn the same underlying denoising function, just parameterized differently.

Q5: "Explain Langevin dynamics intuitively. Why does adding noise help sampling?"

Answer: Langevin dynamics updates $x_{k+1} = x_k + \epsilon s_\theta(x_k) + \sqrt{2\epsilon} z_k$:

Gradient term $\epsilon s_\theta(x_k)$: Pushes toward high-density regions (deterministic)
Noise term $\sqrt{2\epsilon} z_k$: Adds randomness (stochastic)

Without noise, it's just gradient ascent → gets stuck in local maxima. Noise allows:

Escaping local optima (exploration)
Proportional sampling (visit high-density regions more, but not exclusively)
Detailed balance (necessary for MCMC convergence to target distribution)

Q6: "Compare the normalization challenges in VAE vs Score-Based Models. How does each solve them?"

Answer:

VAE: Intractable $p(x) = \int p(x|z)p(z) dz$ (integral over latent space). Solution: Use ELBO as surrogate objective (approximate inference with $q(z|x)$).
Score-based: Intractable $Z_\theta$ in $p_\theta(x) = f_\theta(x)/Z_\theta$ (integral over data space). Solution: Model gradient $\nabla_x \log p_\theta(x)$ instead, where $Z_\theta$ disappears (gradient of constant is zero).

Both avoid direct likelihood computation but for different intractable integrals using different mathematical tricks.

12. Generative Adversarial Networks (GANs): Adversarial training and architecture

What are GANs? GANs are a unique approach to generative modeling that frames the unsupervised problem as a supervised one using adversarial training between two networks.

Key Insight: Instead of directly modeling $p(x)$, GANs use a game-theoretic approach:

Generator $G$: Creates fake samples from noise $z \sim p_z(z)$ (usually $N(0,I)$)
Discriminator $D$: Classifies samples as real (from data) or fake (from $G$)
They compete in a zero-sum, two-player min-max game until equilibrium
- Zero-sum: One player's gain is the other's loss (D maximizes what G minimizes)
- Alternating optimization: Train D for k steps, then G for 1 step (repeat)

The Analogy:

Generator = Artist trying to create realistic paintings
Discriminator = Art critic trying to spot fakes
Training = Friendly competition that makes both better
Goal = Generator becomes so good that critic can only guess (50% accuracy)

Vanilla GAN Architecture

Generator:

Input: Random noise $z \sim p_z(z)$ (latent vector, e.g., 100-dim)
Output: Generated sample $G(z)$ (e.g., 28×28 image)
Architecture: Fully connected layers → Conv layers (in DCGAN)

Discriminator:

Input: Sample $x$ (either real from data or fake from $G$)
Output: Probability $D(x) \in [0,1]$ (1 = real, 0 = fake)
Architecture: Classifier network (conv layers + sigmoid output)

GAN Loss Function (Min-Max Game)

Original formulation: $$\min_G \max_D V(D,G) = \mathbb{E}{x \sim p{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1-D(G(z)))]$$

Breaking it down:

First term $\mathbb{E}{x \sim p{data}}[\log D(x)]$: Discriminator correctly identifies real data
Second term $\mathbb{E}_{z \sim p_z}[\log(1-D(G(z)))]$: Discriminator correctly identifies fake data
Discriminator goal: Maximize both terms (correct classification)
Generator goal: Minimize second term (fool discriminator)

This is Binary Cross-Entropy (BCE) Loss!

Training Algorithm

For each iteration:

Train Discriminator (k steps, typically k=1):
- Sample minibatch of real data ${x^{(1)}, ..., x^{(m)}}$
- Sample minibatch of noise ${z^{(1)}, ..., z^{(m)}}$
- Generate fake samples: ${\tilde{x}^{(1)} = G(z^{(1)}), ..., \tilde{x}^{(m)} = G(z^{(m)})}$
- Update $D$ to maximize: $\frac{1}{m}\sum_{i=1}^m [\log D(x^{(i)}) + \log(1-D(G(z^{(i)})))]$
Train Generator (1 step):
- Sample minibatch of noise ${z^{(1)}, ..., z^{(m)}}$
- Update $G$ to minimize: $\frac{1}{m}\sum_{i=1}^m \log(1-D(G(z^{(i)})))$

Non-Saturating Generator Loss (used in practice):

Problem: $\log(1-D(G(z)))$ saturates when $D$ is confident (early training)
Solution: Instead minimize $-\log D(G(z))$ (maximize prob of fooling $D$)
Same gradient direction but stronger signal when $D$ confidently rejects

Convergence Criterion:

Optimal discriminator: $D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} = \frac{1}{2}$ everywhere
When $p_g = p_{data}$: Generator has perfectly learned the data distribution

Connection to JS Divergence:

The original GAN loss minimizes the Jensen-Shannon (JS) divergence between $p_{data}$ and $p_g$
At optimal discriminator $D^$, the generator's objective becomes: $$\min_G V(G, D^) = 2 \cdot D_{JS}(p_{data} || p_g) - 2\log 2$$
Problem: JS divergence is constant when supports don't overlap → vanishing gradients

DCGAN (Deep Convolutional GAN)

Key architectural guidelines:

Replace pooling with strided convolutions (discriminator) and fractional-strided convolutions/transposed conv (generator)
Use Batch Normalization in both G and D
Remove fully connected layers for deeper architectures
Generator activations:
- ReLU for all layers except output
- Tanh for output layer
Discriminator activation:
- LeakyReLU for all layers

Significance: Made GANs work reliably for image generation, became foundation for vision-based GANs

Conditional GANs (cGANs)

Idea: Control what the generator produces by conditioning on additional information $y$ (e.g., class labels)

Modified loss: $$\min_G \max_D V(D,G) = \mathbb{E}{x \sim p{data}}[\log D(x|y)] + \mathbb{E}_{z \sim p_z}[\log(1-D(G(z|y)|y))]$$

How conditioning works:

Generator: $G(z, y)$ - concatenate noise $z$ and label $y$ as input
Discriminator: $D(x, y)$ - receives both image and label
- Must learn: Real image + correct label = Real
- Real image + wrong label = Fake
- Fake image + any label = Fake

Benefit: Can generate specific classes on demand (e.g., "generate a digit 7")

Key insight for cGANs:

Training discriminator with mismatched pairs (real image + wrong label) forces it to learn semantic meaning
This makes discriminator a better critic - it doesn't just judge "realistic", but "realistic AND matches condition"
Helps reduce mode collapse: generator can't just produce one realistic output for all conditions

13. GAN Failure Modes and Solutions

Mode Collapse

Problem: Generator learns to produce only a limited subset of the data distribution

Example: Face generator produces only a few face types instead of diverse faces
Root cause: GAN loss has no explicit diversity term
Generator finds a few "safe" outputs that always fool discriminator

Why it happens:

Discriminator only judges "real vs fake", not diversity
Generator exploits weaknesses: if one type of output fools $D$, keep producing it
No mechanism to learn the entire distribution, only to fool $D$

Vanishing Gradients

Problem: When discriminator becomes too good, gradients to generator vanish

$D(G(z)) \approx 0$ → $\log(1-D(G(z))) \approx 0$ → No learning signal
Generator stops improving

Cause: Sigmoid output $D(x) \in [0,1]$ with BCE loss causes saturation

Other Failure Modes

Convergence issues: Hard to achieve simultaneous equilibrium of both networks
Perfect Discriminator: No gradients flow to generator
Poor Discriminator: Generator doesn't learn realistic features

Wasserstein GAN (WGAN) Solution

Key Insight: Use Wasserstein-1 (Earth Mover's) distance instead of JS divergence

Why Wasserstein distance?

JS divergence problem: When $p_{data}$ and $p_g$ have disjoint supports (don't overlap), JS = constant
- No meaningful gradient signal
- Common in high-dimensional spaces (images)
Wasserstein distance: Measures "how much work" to move one distribution to another
- Always provides gradient even with disjoint supports
- More stable, smoother gradients

WGAN Changes:

Remove sigmoid from discriminator → Call it "Critic" instead
- Output: $C(x) \in (-\infty, \infty)$ (unbounded, linear activation)
New loss: $$\min_G \max_{C \in \mathcal{C}} \mathbb{E}{x \sim p{data}}[C(x)] - \mathbb{E}_{z \sim p_z}[C(G(z))]$$
- Critic maximizes: score real data high, fake data low
- Generator minimizes: make fake data score high
Enforce Lipschitz constraint on critic:
- Original WGAN: Weight clipping (clip weights to $[-0.01, 0.01]$ after each update)
- WGAN-GP (improved): Gradient penalty instead of clipping $$L_{GP} = \lambda \mathbb{E}{\hat{x}}[(||\nabla{\hat{x}} C(\hat{x})||_2 - 1)^2]$$ where $\hat{x}$ is a point interpolated between real and fake samples

Benefits:

Addresses mode collapse: Smoother gradients help explore full distribution
No vanishing gradients: Critic provides meaningful gradients even when far from optimal
Training stability: Can train critic to optimality without worrying about vanishing gradients

Implementation details:

Train critic 5 times per generator update (vs 1:1 in vanilla GAN)
Use RMSProp with small learning rate (0.00005)
No momentum-based optimizers

Comparison: BCE Loss vs W-Loss

	BCE Loss (Vanilla GAN)	W-Loss (WGAN)
Discriminator output	$[0, 1]$ (sigmoid)	$(-\infty, \infty)$ (linear)
Loss function	$\log D(x)$ and $\log(1-D(G(z)))$	$C(x) - C(G(z))$
Gradient behavior	Can vanish/saturate	Always provides signal
Mode collapse	Common	Reduced
Constraint	None	Lipschitz (weight clipping or GP)

14. Progressive Growing GANs and StyleGAN

Progressive Growing GAN (ProGAN)

Problem: GANs struggle with high-resolution images (1024×1024)

Higher resolution → easier to tell real from fake
Must learn all scales simultaneously (very hard)

Solution: Incrementally grow both G and D during training

Start with 4×4 resolution
Train until stable
Add layers to increase resolution: 4×4 → 8×8 → 16×16 → ... → 1024×1024
Smoothly fade in new layers to avoid shocking existing layers

Benefits:

Stability: Easier to learn simple (low-res) structure first, then details
Speed: Most iterations at low resolution → 2-6× faster training
Quality: Achieves unprecedented 1024×1024 image quality

Key technique - Fade-in mechanism: When adding new layer, blend between old and new:

$\alpha$ (fade-in factor) goes from 0 → 1 over time
Output = $(1-\alpha) \cdot \text{old_path} + \alpha \cdot \text{new_path}$
Why fade-in? Avoids "shocking" well-trained lower-resolution layers
New layers smoothly introduced while keeping existing layers trainable
Both G and D grow in synchrony (mirror images of each other)

Training details:

Uses WGAN-GP (Wasserstein GAN with Gradient Penalty) loss
WGAN-GP replaces weight clipping with gradient penalty for better stability
Progressive approach allows most iterations at low resolution (faster)

Evaluation Metrics for GANs

Why needed? Hard to objectively evaluate generated images

1. Fréchet Inception Distance (FID) - Most common:

Use pretrained InceptionV3 to extract features from real and fake images
Compute mean $\mu$ and covariance $\Sigma$ of feature distributions
FID = Fréchet distance between Gaussians: $$\text{FID} = ||\mu_r - \mu_g||^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$$
Lower is better (0 = perfect match to real distribution)

2. Inception Score (IS):

Classify generated images with InceptionV3
Good images should:
- Have confident predictions (low entropy per image)
- Cover diverse classes (high entropy overall)
IS = $\exp(\mathbb{E}x[D{KL}(p(y|x) || p(y))])$
Higher is better

3. Structural Similarity (MS-SSIM):

Perceptual metric comparing structure, luminance, contrast
SSIM formula: $$\text{SSIM}(x,y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}$$ where $\mu$ = mean, $\sigma$ = variance, $\sigma_{xy}$ = covariance, $C_1, C_2$ = stability constants
MS-SSIM: SSIM applied at multiple scales (pyramid) for multi-resolution comparison
Higher is better (1 = identical images)

4. Sliced Wasserstein Distance (SWD):

Approximates Wasserstein distance using 1D projections
Projects high-dimensional distributions onto random directions, computes 1D Wasserstein distance
SWD = average over many random projections (directions)
Measures statistical similarity between real and generated distributions
Lower is better

Important: For non-image domains (audio, signals), retrain classifier on your domain!

StyleGAN Architecture

Key Innovation: Separate style from content using intermediate latent space

Architecture:

z (512-dim noise)
  → Mapping Network (8 FC layers)
  → w (512-dim intermediate latent)
  → AdaIN at each conv layer (style injection)
  → Generated image

Three main components:

Mapping Network: $f: \mathcal{Z} \to \mathcal{W}$
- 8 fully-connected layers
- Maps Gaussian noise $z$ to intermediate latent $w$
- Why? $z$ must follow fixed distribution, but $w$ is free to be disentangled
Synthesis Network with AdaIN:
- Starts from learned constant (4×4 tensor), not $z$!
- At each conv layer: Apply Adaptive Instance Normalization (AdaIN)
- AdaIN: $\text{AdaIN}(x, y) = \sigma(y) \frac{x - \mu(x)}{\sigma(x)} + \mu(y)$
  - Normalizes activation $x$, then scales/shifts by style $y$ (derived from $w$)
  - Injects style at multiple scales: coarse (4×4-16×16), middle (32×32-64×64), fine (128×128-1024×1024)
Noise Injection:
- Add Gaussian noise to each feature map
- Controls stochastic variation (hair strands, pores) without affecting global structure (pose, identity)

Disentanglement in StyleGAN

Disentanglement: Each dimension in latent space controls one independent factor of variation

Example: One dimension = age, another = gender, another = hair color
Linear subspaces control factors independently

Why $w$ is more disentangled than $z$:

$z \sim N(0,I)$ must match training data distribution (potentially entangled)
$w = f(z)$ is free from that constraint (learned mapping can untangle)
Hypothesis: Easier to generate from disentangled representation

Path Length Regularization (PLR):

Problem: Interpolating in latent space causes non-linear changes in image
- Features absent in endpoints appear in middle (e.g., glasses appear mid-interpolation)
- Image changes drastically for small latent moves (unpredictable)
Perceptual Path Length: Measures how much image changes during interpolation
- Uses VGG16 embeddings to measure perceptual distance
- "Full" metric: subdivide interpolation path, sum perceptual distances
- "End" metric: only measure endpoints (biased toward input space)
Solution: Penalize large changes in image space during latent interpolation $$L_{PLR} = \mathbb{E}_{w,t}[(||\mathbf{J}^T_w \mathbf{y}||_2 - a)^2]$$ where $\mathbf{J}_w$ is Jacobian of $G$ w.r.t. $w$, $\mathbf{y} \sim N(0,I)$, $a$ is moving average
- Penalizes deviation from expected path length (encourages consistency)
Effect:
- Smoother, more linear interpolations in latent space
- Better disentanglement (W-space more linear)
- Easier inversion (more predictable mapping)

StyleGAN2 Improvements

Problem 1: Droplet artifacts

Blob-like artifacts appear at 64×64+ resolution in all feature maps
Visible in intermediate layers even when not obvious in final image
Cause: AdaIN normalizes mean/variance of each feature map independently
- Destroys information in relative magnitudes between features
- Generator exploits this: creates strong localized spike that dominates statistics
- This allows generator to "sneak" signal strength information past normalization
Solution: Replace AdaIN with weight demodulation
- Removes the normalization step that caused the artifact
- Modulates convolution weights instead of activations
- Retains full style controllability without artifacts

Problem 2: Progressive growing artifacts

Phase artifacts, location preference for details, compromised shift invariance
Solution: Remove progressive growing entirely
- Use direct training at target resolution with improved regularization
- Alternative architectures explored to achieve quality without progressive growing

Other improvements:

Lazy regularization: Apply regularization (R1 gradient penalty) every N minibatches instead of every batch
- Typical: R1 penalty once every 16 minibatches (not every iteration)
- Why it works: Main loss and regularization can be optimized at different frequencies
- Reduces computation by ~15-30% with no quality loss
- Greatly reduces memory usage

15. GAN Exam-Style Questions and Concepts

Conceptual Questions (instructor's style):

"Why do vanilla GANs suffer from mode collapse? How does Wasserstein loss help?"
- Answer: Vanilla GANs only optimize "real vs fake", no diversity term. Generator finds limited outputs that fool D. WGAN provides smooth gradients even with disjoint distributions, encouraging exploration of full distribution.
"What is the equilibrium condition for a GAN? What does $D^*(x) = 0.5$ mean?"
- Answer: At equilibrium, $p_g = p_{data}$. Optimal discriminator $D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} = 0.5$ means it can only guess (generator perfectly learned distribution).
"Why is GAN training called a 'zero-sum game'? Why can't we just train G and D simultaneously with gradient descent?"
- Answer: Zero-sum because D's objective (maximize $V$) is exactly opposite to G's objective (minimize $V$) - one's gain is other's loss. Can't train simultaneously because we need D to be optimal (or near-optimal) to provide meaningful gradients to G. If both updated together, D never reaches optimality, giving poor training signal to G. That's why we use alternating optimization: train D for k steps to near-optimality, then update G once.
"Why does WGAN train the critic 5 times per generator update (5:1 ratio) while vanilla GAN uses 1:1?"
- Answer: WGAN's theoretical guarantee requires critic to be nearly optimal for Wasserstein distance approximation to be accurate. Unlike vanilla GAN where perfect D causes vanishing gradients (bad), WGAN benefits from better critic (provides better distance estimate). The 5:1 ratio ensures critic stays ahead, giving reliable gradients. Vanilla GAN uses 1:1 because training D too well causes gradient vanishing.
"Why does StyleGAN use an intermediate latent space $w$ instead of directly using $z$?"
- Answer: $z$ must follow fixed Gaussian distribution matching training data (may be entangled). $w$ is free from this constraint, allowing learned mapping to disentangle factors of variation.
"Explain the connection between the GAN min-max loss and JS divergence. Why is this problematic?"
- Answer: At optimal discriminator $D^*$, minimizing GAN loss is equivalent to minimizing JS divergence between $p_{data}$ and $p_g$. Problem: When distributions have disjoint supports (common in high dimensions), JS divergence is constant (log 2), providing no gradient. This causes vanishing gradients - generator gets no learning signal about which direction to move.
"What's the difference between BCE loss and Wasserstein loss in GANs?"
- Answer: BCE uses sigmoid discriminator output [0,1], can vanish when D is confident. Wasserstein uses unbounded critic output, provides gradients even with disjoint supports. Key: W-loss is a proper distance metric that measures "how far" distributions are, not just "different or same".
"In conditional GANs (cGANs), why do we feed the discriminator both 'real image + wrong label' pairs during training? Doesn't this confuse it?"
- Answer: No, it improves training! The discriminator must learn three rejection cases: (1) fake images, (2) real images with wrong labels, (3) mismatched pairs. This forces D to understand semantic content, not just image quality. It becomes a better critic that judges "realistic AND semantically correct", which gives better gradient signal to G. Also helps prevent mode collapse - G can't fool D with one realistic output for all conditions.
"Why does Progressive GAN use a fade-in mechanism instead of abruptly adding new layers?"
- Answer: Abruptly adding layers would "shock" the well-trained low-resolution layers with random gradients from untrained high-resolution layers. Fade-in smoothly blends old path (trained) and new path (training) using $\alpha$: output = $(1-\alpha) \cdot \text{old} + \alpha \cdot \text{new}$ where $\alpha$ goes 0→1. This preserves learned knowledge while introducing new capacity. Both G and D grow synchronously.
"Explain why StyleGAN's intermediate latent space W is more disentangled than input space Z. Use the 'warped distribution' argument."

Answer: Z must follow fixed Gaussian $N(0,I)$, but training data has correlations (e.g., beards mostly on males). To match training distribution, Z-space must "warp" - creating curved manifolds where correlated features cluster. This warping = entanglement. Mapping network $f: Z \to W$ can "unwarp" this: W is free from fixed distribution constraint, allowing learned transformation to straighten the manifolds into linear subspaces (disentanglement). Training encourages this because disentangled representations are easier to generate from.

"What problem does Path Length Regularization solve in StyleGAN? Why can features appear/disappear during interpolation?"

Answer: Problem: Non-linear mapping from latent to image means linear interpolation in latent space causes non-linear changes in image space. Example: interpolating between "no glasses" and "no glasses" can produce "glasses" in middle because latent path crosses through "glasses" region (curved manifold). PLR penalizes large image changes for small latent moves using Jacobian norm, encouraging smoother, more linear geometry. Makes interpolation predictable and improves disentanglement.

"Why did StyleGAN2 need to replace AdaIN with weight demodulation? Explain the 'sneaking signal strength' problem."

Answer: AdaIN normalizes each feature map independently (divides by std, centers at mean). This destroys information about relative magnitudes between different features. Generator exploited this flaw: it created strong localized spikes (droplet artifacts at 64×64+) that dominate the mean/std statistics of that feature map. By controlling spike magnitude, generator could "sneak" signal strength information past the normalization. Weight demodulation fixes this by modulating convolution weights instead of activations, avoiding the normalization that enabled the exploit.

"What is 'lazy regularization' in StyleGAN2? Why doesn't it hurt performance to apply regularization less frequently?"

Answer: Lazy regularization applies R1 gradient penalty once every N minibatches (e.g., N=16) instead of every iteration, but with N× weight to compensate. Works because: (1) main loss and regularization have different time scales - regularization prevents long-term drift, doesn't need frequent updates, (2) computing gradients for regularization is expensive, doing it 1/16th as often saves 15-30% computation. No quality loss because regularization's role (smoothing, preventing pathological solutions) doesn't require immediate response.

"Explain how AdaIN in StyleGAN controls style injection. What does the formula $\text{AdaIN}(x, y) = \sigma(y) \frac{x - \mu(x)}{\sigma(x)} + \mu(y)$ mean?"

Answer: AdaIN normalizes content features $x$ to zero mean and unit variance (removes original style), then applies affine transformation using style $y$'s statistics. $\sigma(y)$ controls scale/contrast, $\mu(y)$ controls shift/brightness. Style $y$ comes from learned affine transform of $w$ (intermediate latent). Different layers control different scales: early layers (4×4-16×16) = coarse features (pose, shape), middle (32×64) = facial features, late (128×1024) = fine details (hair strands, skin texture). This hierarchical injection enables style mixing.

Calculation Questions:

"Given discriminator outputs $D(x_{real}) = 0.9$ and $D(G(z)) = 0.3$, calculate the discriminator and generator losses."

D loss: $-[\log(0.9) + \log(1-0.3)] = -[\log(0.9) + \log(0.7)]$
G loss (non-saturating): $-\log(0.3)$

"If the optimal discriminator outputs $D^*(x) = 0.7$ for a particular sample $x$, what can you infer about $p_{data}(x)$ and $p_g(x)$?"

Answer: Using $D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} = 0.7$, we get:
- $p_{data}(x) = 0.7(p_{data}(x) + p_g(x))$
- $0.3 p_{data}(x) = 0.7 p_g(x)$
- Therefore: $\frac{p_{data}(x)}{p_g(x)} = \frac{0.7}{0.3} = \frac{7}{3} \approx 2.33$
- Real data is ~2.3x more likely than generated data at $x$ → Generator underproduces this region

"You're evaluating a GAN trained on face images. FID = 15, IS = 3.2, MS-SSIM = 0.85. Interpret these metrics."

Answer:
- FID = 15: Low is good. 15 indicates decent quality (real distributions not perfectly matched but close). Feature distributions fairly similar to real data.
- IS = 3.2: Relatively low score suggests either: (1) low confidence predictions (blurry/unclear images), or (2) limited diversity (not covering all face types). For faces, IS may be limited by dataset (faces aren't as diverse as ImageNet's 1000 classes).
- MS-SSIM = 0.85: High structural similarity (max = 1.0) suggests good preservation of structure, luminance, contrast at multiple scales.
- Overall: Decent quality but room for improvement in diversity or sharpness.

"Why do we use Inception-based metrics (FID, IS) for GANs? Could we use any classifier?"

Answer: InceptionV3 pretrained on ImageNet provides rich feature representations that correlate well with human perception. We can use any classifier, but must retrain it on the target domain. Example: for audio GANs, train ResNet on audio spectrograms. For earthquake signals (no standard classifier exists), can't directly use FID - must first train domain-specific classifier. The classifier quality determines metric reliability.

Comparison Questions:

"Compare GANs, VAEs, and Normalizing Flows in terms of: (a) latent space structure, (b) training objective, (c) mode coverage."

See table below

GAN vs Other Generative Models

	VAE	Normalizing Flow	GAN
Latent space	Continuous, Gaussian (regularized)	Exactly Gaussian (by design)	Any (no explicit prior)
Training	Maximize ELBO	Maximize log-likelihood	Min-max game
Likelihood	Approximate (lower bound)	Exact	Implicit (no explicit p(x))
Mode coverage	Good (encouraged by KL)	Good	Can suffer mode collapse
Sample quality	Often blurry	Good	Excellent (sharp images)
Training stability	Stable	Stable	Unstable (improved with WGAN)
Inference	Can encode x → z	Can encode x → z	No encoder (need inversion)

16. Disentanglement: Understanding and Controlling Latent Representations

What is Disentanglement?

Disentanglement refers to having a latent representation where each dimension controls one independent factor of variation.

Ideal disentangled representation:

Each latent dimension $z_i$ controls exactly one semantic attribute
Changes in $z_i$ affect only that attribute, nothing else
Example: $z_1$ = age, $z_2$ = gender, $z_3$ = hair color (completely independent)

Entangled representation (reality):

Latent dimensions are correlated/interdependent
Changing one dimension affects multiple attributes
Example: Changing "beard" dimension also affects "gender" (beards correlated with males)

Why it matters:

Controllability: Want to manipulate specific attributes (e.g., add smile without changing identity)
Interpretability: Understand what each dimension represents
Generalization: Linear interpolation in latent space should produce smooth, meaningful changes

Controllability vs Conditioning

Conditioning (Week 9 concept):

Provide explicit labels during training: $p(x|y)$
Example: cGAN with class labels
Requires labeled data
Controls what class to generate (e.g., "generate digit 7")

Controllability (Week 11 concept):

Manipulate features in latent space without labels
Example: Adjusting $z$ to add beard, change age, etc.
No labeled data needed (unsupervised)
Controls how features appear (e.g., "make person smile more")

Key difference:

Conditioning: "Generate a cat" (discrete choice)
Controllability: "Make this cat fluffier" (continuous manipulation)

The Feature Correlation Problem

Desired case: Uncorrelated features

Original → Add beard → Still male, different age possible
Original → Change age → Still clean-shaven, same gender

Reality: Correlated features (entangled)

Original → Add beard → ALSO becomes more masculine, older
Original → Make feminine → ALSO loses beard, different pose

Root cause: Training data has natural correlations

Most beards appear on males → beard entangled with gender
Older people have different features → age entangled with wrinkles, hair color

Why entanglement can be good:

Preserves realism: Bearded females are rare, model reflects this
Learned from data distribution
Contemporary models handle via instruction following (text conditioning)

Finding Directions in Latent Space

Goal: Find direction $\mathbf{d}$ in latent space where moving along it changes specific attribute

Method 1: Gradient-based (supervised)

Use a pretrained classifier or discriminator:

Start with latent code $z_0$
Define target attribute via classifier: $y = C(G(z))$
Compute gradient: $\frac{\partial y}{\partial z}$
Update $z$ in gradient direction (like SGD, but on $z$, not weights): $$z_{t+1} = z_t + \alpha \frac{\partial y}{\partial z}$$
Generate $x = G(z_{t+1})$ with enhanced attribute

Example: To increase "smile" attribute:

Use smile classifier $C_{smile}$
Optimize $z$ to maximize $C_{smile}(G(z))$
Results in $z$ that generates smiling face

Method 2: InterfaceGAN / GANSpace

Find semantic directions in latent space using labeled examples
Linear SVM to find decision boundary between attribute classes
Normal vector to boundary = control direction

17. Quantifying Disentanglement: DCI Metrics

Problem: How to objectively measure disentanglement quality?

DCI Framework requires:

Learned latent representation $\mathbf{c}$ (code) of dimension $D$
Ground truth generative factors $\mathbf{z}$ of dimension $K$
Example: 3D shapes with $z$ = [azimuth, elevation, red, green, blue]

Ideal case: If $D = K$, perfect disentanglement means $\mathbf{c}$ is a monomial matrix (generalized permutation) transformation of $\mathbf{z}$

Each $c_i$ is scaled version of exactly one $z_j$
If $D > K$: Some dimensions are "dead" (don't capture any factor)
DCI metrics quantify deviation from this ideal one-to-one mapping

Process:

Train model on synthetic dataset with known factors $\mathbf{z}$
Extract learned codes $\mathbf{c} = M(x)$ for all samples
Train $K$ regressors to predict $z_j$ from $\mathbf{c}$: $\hat{z}_j = f_j(\mathbf{c})$
Extract importance matrix $R \in \mathbb{R}^{D \times K}$
- $R_{ij}$ = relative importance of $c_i$ in predicting $z_j$
Compute three metrics:

1. Disentanglement Score

Measures: Does each code variable $c_i$ capture at most one generative factor?

Formula: $$D_i = 1 - \frac{H(P_i)}{\log K}$$

where $P_{ij} = \frac{R_{ij}}{\sum_{k} R_{ik}}$ (normalized importance), $H(P_i)$ is entropy

Interpretation:

$D_i = 1$: $c_i$ perfectly captures single factor (fully disentangled)
$D_i = 0$: $c_i$ equally important for all factors (maximally entangled)

Overall disentanglement: Weighted average across all dimensions $$D = \sum_i P_i \cdot D_i$$ where $P_i = \sum_j R_{ij}$ (relative importance of $c_i$)

2. Completeness Score

Measures: Is each generative factor $z_j$ captured by at most one code variable?

Formula: $$C_j = 1 - \frac{H(P_j)}{\log D}$$

where $P_{ij} = \frac{R_{ij}}{\sum_{k} R_{kj}}$ (column-wise normalized)

Interpretation:

$C_j = 1$: Single $c_i$ captures $z_j$ completely (complete)
$C_j = 0$: All code variables equally contribute (overcomplete)

Difference from Disentanglement:

Disentanglement: Row-wise (one code → one factor)
Completeness: Column-wise (one factor → one code)

3. Informativeness Score

Measures: How much information does $\mathbf{c}$ capture about $\mathbf{z}$?

Formula: Prediction error $$I_j = E(z_j, \hat{z}_j)$$

where $E$ is appropriate error function (MSE for continuous, accuracy for discrete)

Key point: Depends on regressor capacity

Linear regressor: Only captures explicitly represented information
Overlap with disentanglement metric (better disentanglement → easier linear prediction)

Visualizing DCI with Hinton Diagrams

Importance matrix $R$ visualized as grid:

Rows = code dimensions $c_i$
Columns = generative factors $z_j$
Square size = $R_{ij}$ (importance)

Ideal (disentangled):

      z1  z2  z3  z4  z5
c1    ■   ·   ·   ·   ·
c2    ·   ■   ·   ·   ·
c3    ·   ·   ■   ·   ·
c4    ·   ·   ·   ■   ·
c5    ·   ·   ·   ·   ■

Diagonal structure: one-to-one mapping

Entangled example:

      z1  z2  z3  z4  z5
c1    ▪   ▫   ▪   ·   ·
c2    ▫   ▪   ▫   ▪   ·
c3    ▪   ▫   ▪   ▫   ▫

Scattered: many-to-many relationships

18. Models Designed for Disentanglement

β-VAE (Beta-VAE)

Key idea: Increase weight on KL term to force disentanglement

Standard VAE loss: $$\mathcal{L} = \mathbb{E}{q(z|x)}[\log p(x|z)] - D{KL}(q(z|x) || p(z))$$

β-VAE loss: $$\mathcal{L}{\beta} = \mathbb{E}{q(z|x)}[\log p(x|z)] - \beta \cdot D_{KL}(q(z|x) || p(z))$$

where $\beta > 1$ (typically 4-10)

Why it works (information bottleneck principle):

Stronger KL penalty forces $q(z|x)$ closer to $N(0,I)$
Creates information bottleneck: Limited capacity to encode information
Encoder faces pressure:
- Reconstruction term wants to encode all information
- KL term (weighted by β) limits how much can be encoded
Result: Encoder forced to be selective, prioritizes most important factors
To minimize loss efficiently, encoder allocates each $z_i$ to single most important factor
Redundant encoding (multiple $z_i$ for same factor) is penalized
Encourages independence between latent dimensions

Tradeoff:

Higher $\beta$ → better disentanglement, worse reconstruction
Lower $\beta$ → better reconstruction, worse disentanglement
Need to tune $\beta$ via hyperparameter search or visual inspection

Additional component:

Linear classifier trained on latent differences to identify target factors
Used for quantitative evaluation of disentanglement quality

InfoGAN

Key idea: Maximize mutual information between subset of latent variables and generated output

Standard GAN objective: $$\min_G \max_D V(D,G) = \mathbb{E}{x \sim p{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1-D(G(z)))]$$

InfoGAN objective: Add mutual information term $$\min_{G,Q} \max_D V(D,G) - \lambda I(c; G(z, c))$$

where:

$z$ = incompressible noise (traditional GAN noise)
$c$ = latent code we want to be interpretable (categorical or continuous)
$I(c; G(z,c))$ = mutual information between $c$ and generated image
$Q$ = auxiliary network approximating $P(c|x)$ (posterior)

Mutual Information Lower Bound (tractable): $$I(c; G(z,c)) \geq \mathbb{E}_{c \sim P(c), x \sim G(z,c)}[\log Q(c|x)] + H(c)$$

This is called the variational lower bound - making mutual information optimization tractable.

Why this works conceptually:

Direct mutual information $I(c; G(z,c))$ is intractable to compute
Instead, maximize lower bound using auxiliary network $Q$
$Q(c|x)$ tries to predict latent code from generated image
If $c$ truly affects generation, $Q$ should be able to recover it
Forces generator to use $c$ meaningfully (information-theoretic constraint)

Q Network Loss Function:

Categorical $c$ (e.g., digit class): Cross-entropy loss comparing $Q(x)$ predictions to true $c$
Continuous $c$ (e.g., rotation): Mean squared error (MSE) or negative log-likelihood (Gaussian)
Training: Feed $c$ into $G$, generate $x = G(z,c)$, then train $Q$ to predict $c$ from $x$
Key insight: If $Q$ can accurately reconstruct $c$ from generated image, mutual information is high
In practice: $Q$ shares convolutional layers with discriminator $D$, adds small FC head for code prediction

Why split latent space into $z$ and $c$?:

$c$ (code): Structured, interpretable factors we want to control (digit class, rotation, style)
- Gets mutual information loss → forced to be meaningful and recoverable
- Must be distinct enough that $Q$ can guess it from image
$z$ (noise): Incompressible randomness for variation (background texture, lighting details)
- No constraints → can be entangled and complex
- Provides diversity: same $c$ can generate many different images via different $z$
Without $z$: Generator would be deterministic (same $c$ → identical image every time)
Without $c$: Generator has no incentive to learn interpretable, controllable factors

In practice:

Split latent input: $[z, c]$ where $c$ is structured (e.g., 10 categorical for digit class, 2 continuous for rotation/width)
Generator: $G(z, c)$
Discriminator: $D(x)$
Auxiliary network: $Q(c|x)$ shares parameters with $D$
Loss encourages: If we know $c$, we should be able to recover it from $G(z,c)$

Results:

Unsupervised discovery of interpretable factors
MNIST: Discovers digit class, rotation, width automatically
3D faces: Discovers pose, lighting, expression
No labels needed!

Comparison β-VAE vs InfoGAN:

	β-VAE	InfoGAN
Framework	VAE-based	GAN-based
Supervision	Fully unsupervised	Fully unsupervised
Method	Stronger KL regularization	Maximize mutual information
Training	Stable	Can be unstable (GAN training)
Control	All latent dims	Specific code $c$
Quality	Lower (higher β hurts reconstruction)	Higher (GAN quality)

19. StyleGAN's Latent Spaces: Z, W, and S

The Hierarchy of Latent Spaces

Z-space (input latent space):

Dimension: 512
Distribution: $z \sim N(0, I)$
Problem: Must follow fixed Gaussian, leads to entanglement
Used as input to mapping network

W-space (intermediate latent space):

Dimension: 512
Obtained: $w = f(z)$ via mapping network (8 FC layers)
Benefit: Free from distribution constraint → more disentangled
Each layer in generator receives same $w$

S-space (StyleSpace):

Dimension: 9088 (for 1024×1024 generator with 18 layers)
Obtained: Affine transformation of $w$ at each layer
Formula: $s = A(w)$ where $A$ is learned layer-specific affine transform
Most disentangled of all three spaces

Why does disentanglement increase down the hierarchy?

Z → W (Mapping Network):
- $z$ must sample from $N(0,I)$ to match training data distribution
- Training data has entangled factors (e.g., beard + male)
- $z$ must be warped to avoid impossible combinations
- Mapping network $f$ untangles this warping
W → S (Layer-specific control):
- Single $w$ → Multiple layer-specific styles $s_i$
- Each layer controls different scale: coarse (4×4-16×16), medium (32×32-64×64), fine (128×128-1024×1024)
- Channel-wise control allows finer-grained manipulation
- Higher dimensionality (512 → 9088) allows more specific factors

Finding Controllable Directions in StyleSpace

Goal: Identify which of the 9088 style channels control specific attributes

Method (Wu et al., 2021):

Generate images from pretrained StyleGAN2
Compute gradient maps via backpropagation: $$\frac{\partial x}{\partial s_i}$$ where $x$ is generated image, $s_i$ is specific style channel
Segment images into semantic regions (hair, face, background, etc.)
Measure overlap between gradient maps and semantic regions
Identify channels consistently active in each region → those control that region

Result: Thousands of localized, disentangled controls

Channel 6364: Amount of hair
Channel 12_113: Hubcap style (for cars)
Channel 8_119: Pillow presence (for bedrooms)

Advantages over W-space:

More localized: Changes affect smaller regions
More disentangled: Attribute Dependency metric shows less interference between attributes
More controls: 9088 dims vs 512 dims
Layer-specific control: Different layers control different scales (coarse/medium/fine details)

Manipulating Real Images

Problem: StyleGAN trained on random $z$, but we want to edit real photos

Solution 1: Latent Optimization (GAN inversion)

Start with real image $x_{real}$
Randomly initialize latent code $w$
Optimize $w$ to minimize: $||G(w) - x_{real}||^2$
Iterate until $G(w) \approx x_{real}$
Manipulate $w$ or $s$ to edit attributes
Generate edited image: $x_{edited} = G(w + \Delta w)$

Solution 2: Encoder-based Inversion

Train encoder $E$ to predict $w$ from real images: $E: x \to w$
For new image: $w = E(x_{real})$
Manipulate and generate: $x_{edited} = G(w + \Delta w)$
Faster than optimization (single forward pass)

Typical pipeline:

Real image → Encoder/Optimization → w or s
                                      ↓
                                 Manipulate specific channels
                                      ↓
                                 Generator → Edited image

20. Disentanglement: Exam-Style Questions

Conceptual Questions:

"What is the difference between conditioning and controllability? Give examples."
- Answer: Conditioning uses explicit labels during training to control class (cGAN: "generate cat"). Controllability manipulates latent space to adjust features without labels (StyleGAN: "make person smile more"). Conditioning = what to generate, Controllability = how features appear.
"Why is the intermediate latent space W in StyleGAN more disentangled than Z?"
- Answer: Z must follow fixed Gaussian distribution matching training data, which has entangled factors (beard + male correlation). W is free from this constraint - mapping network can untangle the warped distribution. Training encourages disentanglement because it's easier to generate from disentangled representation.
"Explain why higher β in β-VAE leads to better disentanglement."
- Answer: Higher β creates stronger information bottleneck via KL penalty. Encoder forced to be selective about what information to encode. To minimize loss, encoder allocates each dimension to most important factor, encouraging independence. Tradeoff: worse reconstruction quality.
"What do the three DCI metrics measure? How are they different?"
- Answer:
  - Disentanglement: Does each code variable control at most one factor? (row-wise in importance matrix)
  - Completeness: Is each factor controlled by at most one code variable? (column-wise)
  - Informativeness: How much information does code capture? (prediction error)
"How does InfoGAN achieve disentanglement without labels?"
- Answer: Maximizes mutual information between latent code c and generated image G(z,c). Forces generator to use c meaningfully - if we know c, should be able to recover it from output. Auxiliary network Q learns inverse mapping. Discovers interpretable factors automatically.

Comparison Questions:

"Compare β-VAE and InfoGAN for learning disentangled representations."
- See table in Section 18

Calculation/Application Questions:

"Given importance matrix R (3×2), calculate disentanglement score for c₁."
```
R = [[0.8, 0.1],
     [0.1, 0.7],
     [0.1, 0.2]]
```
- Normalize row 1: P₁ = [0.8/0.9, 0.1/0.9] = [0.89, 0.11]
- Entropy: H(P₁) = -0.89 log(0.89) - 0.11 log(0.11) ≈ 0.50
- D₁ = 1 - 0.50/log(2) = 1 - 0.50/0.69 ≈ 0.28

Theoretical Understanding Questions:

"Explain why InfoGAN uses a variational lower bound instead of directly maximizing mutual information."
- Answer: Direct mutual information $I(c; G(z,c))$ requires computing $P(c|x)$ which is intractable (requires marginalizing over all possible $c$). Instead, use auxiliary network $Q(c|x)$ to approximate posterior and maximize variational lower bound: $I(c;G(z,c)) \geq E[\log Q(c|x)] + H(c)$. This makes optimization tractable via gradient descent while still encouraging generator to use latent code meaningfully.
"In DCI framework, what does it mean for c to be a 'monomial matrix transformation' of z?"
- Answer: Perfect disentanglement where each learned dimension $c_i$ is a scaled/permuted version of exactly one ground-truth factor $z_j$. One-to-one mapping. Example: If $z = [age, gender]$, ideal $c = [2·gender, 5·age]$ (scaled permutation). DCI metrics measure how close learned representation is to this ideal.

Practical Questions:

"You want to edit a real photograph using StyleGAN. Outline the steps."

Answer:
1. Invert image to latent code (optimization or encoder)
2. Identify style channel controlling desired attribute (gradient-based or pretrained classifier)
3. Manipulate that channel: s' = s + α·direction
4. Generate edited image: x' = G(s')
5. Verify change is localized and disentangled

"A researcher trains β-VAE with β=1, β=4, and β=10. For each model, they compute DCI metrics. Predict the pattern of results and explain."

Answer:
- β=1 (standard VAE): Low D, low C, high I. Entangled but captures lots of info. Good reconstruction.
- β=4: Medium D, medium C, medium I. Balanced tradeoff. Some disentanglement emerging.
- β=10: High D, high C, low I. Best disentanglement but information bottleneck too tight - loses details, poor reconstruction.
- Pattern: As β↑, disentanglement (D,C)↑ but informativeness (I)↓. Stronger KL penalty forces selectivity but sacrifices information capacity.

"How does StyleSpace (S) discovery use gradients to find controllable directions? Why does this work?"

Answer:
- For each style channel $s_i$, compute gradient map $\frac{\partial x}{\partial s_i}$ showing which pixels change when $s_i$ changes
- Segment image into semantic regions (hair, face, etc.)
- Measure overlap between gradient maps and regions
- Channels with high gradient overlap in specific region → control that region
- Why it works: Backpropagation reveals causal relationship between style channel and image regions. High gradient = high sensitivity. Consistent gradients in one region = localized, disentangled control.

Key Takeaways for Exam

Disentanglement is about:

Independence of latent dimensions
One dimension → one semantic factor
Linear interpolation produces meaningful changes

Methods to achieve it:

β-VAE: Stronger KL penalty (information bottleneck)
InfoGAN: Maximize mutual information (enforce interpretability)
StyleGAN: Mapping network + hierarchical spaces (Z→W→S)

Measuring it:

DCI metrics: Quantitative evaluation when ground truth available
Visual inspection: Qualitative check of interpolations
Attribute dependency: How much changing one affects others

Why it matters:

Controllability and interpretability
Better generalization and editing
Foundation for instruction-following models

Appendix A - Common Formulas and Distributions

Gaussian (Normal) Distribution

Univariate Gaussian $N(\mu, \sigma^2)$: $$p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Standard Normal $N(0, 1)$: $$p(x) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{x^2}{2}\right)$$

Multivariate Gaussian $N(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ (d-dimensional): $$p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)$$

Isotropic Gaussian (same variance in all dimensions: $\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}$): $$p(\mathbf{x}) = \frac{1}{(2\pi\sigma^2)^{d/2}} \exp\left(-\frac{||\mathbf{x}-\boldsymbol{\mu}||^2}{2\sigma^2}\right)$$

Common Integrals

Power Functions: $$\int x^n dx = \frac{x^{n+1}}{n+1} + C \quad \text{(for } n \neq -1\text{)}$$

$$\int \frac{1}{x} dx = \ln|x| + C$$

Logarithmic: $$\int \ln(x) dx = x\ln(x) - x + C$$

$$\int \frac{\ln(x)}{x} dx = \frac{[\ln(x)]^2}{2} + C$$

Exponential: $$\int e^x dx = e^x + C$$

$$\int e^{ax} dx = \frac{e^{ax}}{a} + C$$

$$\int x e^{ax} dx = \frac{e^{ax}}{a^2}(ax - 1) + C$$

Definite Integrals (useful for normalization checks): $$\int_0^{\infty} x^n e^{-ax} dx = \frac{n!}{a^{n+1}} \quad \text{(for } n \geq 0, a > 0\text{)}$$

$$\int_0^{\infty} e^{-ax^2} dx = \frac{1}{2}\sqrt{\frac{\pi}{a}} \quad \text{(for } a > 0\text{)}$$

Trigonometric (less common in this course but useful): $$\int \sin(x) dx = -\cos(x) + C$$

$$\int \cos(x) dx = \sin(x) + C$$

Gaussian Integrals

$$\int_{-\infty}^{\infty} e^{-ax^2} dx = \sqrt{\frac{\pi}{a}}$$

$$\int_{-\infty}^{\infty} x e^{-ax^2} dx = 0 \quad \text{(odd function)}$$

$$\int_{-\infty}^{\infty} x^2 e^{-ax^2} dx = \frac{1}{2}\sqrt{\frac{\pi}{a^3}}$$

$$\int_{-\infty}^{\infty} x^n e^{-ax^2} dx = 0 \quad \text{(for odd } n\text{)}$$

Useful identities: $$\int_{-\infty}^{\infty} p(x) dx = 1 \quad \text{(normalization condition for any PDF)}$$

$$\int_a^b \frac{1}{b-a} dx = 1 \quad \text{(uniform distribution normalization)}$$

Derivative Rules (for Jacobian calculations)

Power rule: $$\frac{d}{dx}x^n = nx^{n-1}$$

Exponential: $$\frac{d}{dx}e^{ax} = ae^{ax}$$

Logarithmic: $$\frac{d}{dx}\ln(x) = \frac{1}{x}$$

$$\frac{d}{dx}\ln(f(x)) = \frac{f'(x)}{f(x)}$$

Chain rule: $$\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$$

Product rule: $$\frac{d}{dx}[f(x)g(x)] = f'(x)g(x) + f(x)g'(x)$$

Determinant Properties

$\det(AB) = \det(A)\det(B)$
$\det(A^{-1}) = \frac{1}{\det(A)}$
$\det(A^T) = \det(A)$
For diagonal matrix: $\det(D) = \prod_i d_{ii}$
For triangular matrix: $\det(T) = \prod_i t_{ii}$ (diagonal elements)

Log properties:

$\log(ab) = \log(a) + \log(b)$
$\log(a/b) = \log(a) - \log(b)$
$\log(a^b) = b\log(a)$
$\log|\det J| = \text{tr}(\log J)$ when $J$ is positive definite

Appendix B - Expectation maximization: pseudo-code, what it does

What it does: EM is an iterative algorithm used to find Maximum Likelihood Estimates (MLE) for models with latent (hidden) variables (like GMMs, where we don't know which Gaussian cluster a point belongs to). It alternates between "guessing the missing data" and "updating the model".

Pseudo-code:

# 1. Initialize parameters θ (e.g., means μ, covariances Σ, mixing weights π) randomly
theta = initialize_randomly()

repeat until convergence:
    # --- E-Step (Expectation) ---
    # "Fill in the blanks": Estimate the probability of the latent variables
    # given the current parameters.
    # Example (GMM): Calculate 'responsibility' (prob that point x_i belongs to cluster k)
    responsibilities = calculate_probabilities(data, theta)

    # --- M-Step (Maximization) ---
    # "Update the rules": Re-calculate parameters θ to maximize likelihood,
    # assuming the E-step guesses are correct.
    # Example (GMM): Update means μ based on weighted average of data points
    theta = update_parameters(data, responsibilities)

return theta

Key Application: Training Gaussian Mixture Models (GMMs)

GMMs approximate probability distribution as weighted sum of Gaussians
Each Gaussian represents a cluster in the data
Weights represent relative importance of each cluster
EM estimates cluster means, covariances, and mixing weights

Connection to MLE:

EM maximizes likelihood $P(X|\theta)$ when direct optimization is intractable
E-step: Compute expected log-likelihood w.r.t. latent variables
M-step: Find $\theta$ that maximizes this expected log-likelihood

Exam-style question: "Why can't we directly maximize likelihood in GMMs? How does EM solve this?"

Answer: In GMMs, we don't know which Gaussian generated each data point (latent cluster assignment). Direct maximization would require enumerating all possible assignments (exponential complexity). EM iteratively: (1) guesses cluster assignments given current parameters (E-step), (2) updates parameters assuming those assignments (M-step). This alternating optimization is tractable and converges to local maximum.

bdsaglam/gist:895f9794eb53d145347bd4b748073d2a

Midterm Notes & Questions

1. Discriminative vs Generative models: what are the differences? What are the advantages and disadvantages of each?

Bayes Theorem and Generative Models

3. Gaussian distributions: properties, why is it used in generative models?

4. Entropy, cross-entropy

Entropy $H(p)$

Cross-Entropy $H(p,q)$

KL Divergence $D_{KL}(p||q)$

Information Theory Fundamentals

Self-Information

Mutual Information $I(X;Y)$

Conditional Entropy $H(Y|X)$

5. Different distribution distances: KL, JS, W1. What are they? How are they calculated? Which is better for what?

Additional Distribution Distances

Total Variation (TV) Distance

Hellinger Distance

6. What is a VAE? How does it work? What is the ELBO?

VAE Architecture

How VAE Works (Forward Pass)

What is the ELBO (Evidence Lower Bound)?

ELBO Components (The Loss Function)

Training VAEs

What happens without KL regularization? (Midterm Q2 scenario)

Exam-Style Questions

VAE Latent Space Properties

7.1 Sampling and Inference: The forward and reverse process

7.2. Likelihood and Maximum Likelihood Estimation (MLE)

7.5. Distributions: Key Concepts

Cumulative Distribution Function (CDF) vs Probability Density Function (PDF)

Modality

8. Normalizing Flows: How they work and the Jacobian determinant

Change of Variables Formula

Requirements for Flow Transformations

Coupling Flows (RealNVP Architecture)

Stacking Flows (Composition)

Training Normalizing Flows

Autoregressive Flows

Why NF Compute Exact Likelihood (VAE Cannot)

Exam-Style Questions

9. Diffusion Models: Forward/reverse process and training

Forward Process (Fixed Markov Chain)

Reverse Process (Learned Markov Chain)

Architecture: U-Net

Exam-Style Questions

10. Score-Based Models: Avoiding normalization constant

The Normalization Problem

Score-Based Solution: Model Gradient Instead

Training: Score Matching

Sampling: Langevin Dynamics

The Low-Density Region Problem

Solution: Noise Conditional Score Networks (NCSN)

Sampling: Annealed Langevin Dynamics

Connection to Diffusion Models (DDPM ↔ Score-Based)

Exam-Style Questions

12. Generative Adversarial Networks (GANs): Adversarial training and architecture

Vanilla GAN Architecture

GAN Loss Function (Min-Max Game)

Training Algorithm

DCGAN (Deep Convolutional GAN)

Conditional GANs (cGANs)

13. GAN Failure Modes and Solutions

Mode Collapse

Vanishing Gradients

Other Failure Modes

Wasserstein GAN (WGAN) Solution

Comparison: BCE Loss vs W-Loss

14. Progressive Growing GANs and StyleGAN

Progressive Growing GAN (ProGAN)

Evaluation Metrics for GANs

StyleGAN Architecture

Disentanglement in StyleGAN

StyleGAN2 Improvements

15. GAN Exam-Style Questions and Concepts

GAN vs Other Generative Models

16. Disentanglement: Understanding and Controlling Latent Representations

What is Disentanglement?

Controllability vs Conditioning

The Feature Correlation Problem

Finding Directions in Latent Space