1. Discriminative vs Generative models: what are the differences? What are the advantages and disadvantages of each?
Discriminative Models ($P(Y|X)$):
- Concept: They learn the boundary between classes. They care about "what differentiates a cat from a dog", not "what makes a dog a dog".
-
Goal: Map input
$X$ to label$Y$ directly. - Examples: Logistic Regression, SVM, Neural Nets (standard classifiers).
-
Advantages:
- Generally higher accuracy for classification tasks because they focus purely on the decision boundary.
- Often computationally cheaper to train and predict.
- Robust to correlated features (doesn't double-count evidence like Naive Bayes).
-
Disadvantages:
- Cannot generate data (you can't ask it to "draw a cat").
- Requires labeled data (strictly supervised).
- Can be prone to overfitting noise in the boundary.
Generative Models (
- Concept: They learn the distribution of the data itself. They learn "what a dog looks like" and "what a cat looks like".
- Goal: Model the underlying structure of the data.
- Examples: Naive Bayes, GMMs, VAEs, GANs, Diffusion.
- Advantages:
- Can generate new samples (hallucinate new data).
- Can handle missing data and effective for semi-supervised learning.
- Models the world, not just a boundary (more robust to outliers/adversarial attacks in some contexts).
- Disadvantages:
- Computationally expensive (modelling the whole distribution is hard).
- "Double counting" evidence if features are correlated (e.g., Naive Bayes assumes independence).
- May have lower classification accuracy because they solve a harder problem (modelling density) than necessary.
Bayes Theorem:
Where:
-
$P(\theta|X)$ = Posterior: Probability of parameters given data -
$P(X|\theta)$ = Likelihood: Probability of data given parameters -
$P(\theta)$ = Prior: Belief about parameters before seeing data -
$P(X)$ = Evidence: Marginal probability of data (normalization constant)
Role in Generative Models:
-
Parameter Estimation: Update beliefs about model parameters based on observed data
- Start with prior
$P(\theta)$ - Observe data
$X$ - Update to posterior
$P(\theta|X)$ using Bayes theorem
- Start with prior
-
Connection to different generative models:
-
VAEs: Approximate posterior
$q(z|x)$ used because true posterior$p(z|x)$ intractable - GANs: Discriminator approximates likelihood ratio
- Bayesian Neural Networks: Posterior over network weights
-
VAEs: Approximate posterior
-
Inference in generative models:
- Given data
$x$ , infer latent$z$ :$P(z|x) = \frac{P(x|z)P(z)}{P(x)}$ -
$P(x|z)$ = decoder (how likely is data given latent) -
$P(z)$ = prior (usually $N(0,I)$) -
$P(x)$ = evidence (intractable! This is why VAEs use ELBO)
- Given data
Exam-style question: "How does Bayes theorem relate to the ELBO in VAEs? Why can't we compute the posterior
Answer: By Bayes theorem,
Properties:
-
Bell Curve: Symmetric, defined entirely by Mean (
$\mu$ ) and Variance ($\sigma^2$ ). -
Central Limit Theorem (CLT): The sum of many independent random variables tends toward a Gaussian distribution. This makes it a natural choice for modeling noise or aggregate real-world phenomena.
- Key insight: Many real-world phenomena can be modeled as a sum of multiple small contributions → naturally Gaussian
- Math Magic: Analytical tractability. Differentiating, integrating, and multiplying Gaussians often results in closed-form Gaussian solutions.
-
Multivariate Gaussian: For multi-dimensional data, characterized by mean vector
$\boldsymbol{\mu}$ and covariance matrix$\boldsymbol{\Sigma}$ - Covariance matrix captures correlations between features
Why used in Generative Models (e.g., VAEs, Diffusion):
- Smoothness: Forces the latent space to be continuous and densely packed (no "holes"), allowing for smooth interpolation between samples.
-
Reparameterization: The "Reparameterization Trick" (
$z = \mu + \sigma \odot \epsilon$ ) is easy with Gaussians, allowing backpropagation through stochastic nodes. - Prior: It's the standard "blank canvas" prior ($N(0, I)$). We assume latent factors are independent and normally distributed, then the network learns to map this simple distribution to complex data.
- Natural: CLT justifies using Gaussian as default assumption for many processes.
-
Entropy ($H(P)$):
- A measure of uncertainty or "surprise" in a distribution.
- High entropy = Uniform distribution (maximum unpredictability).
- Low entropy = Deterministic (we know exactly what will happen).
- Formula:
$H(P) = -\sum P(x) \log P(x)$ .
-
Cross-Entropy ($H(P, Q)$):
- A measure of the average number of bits needed to encode events from true distribution
$P$ using a code optimized for distribution$Q$ . - Basically: "How different is my predicted distribution
$Q$ from the true distribution$P$ ?" - Formula:
$H(P, Q) = -\sum P(x) \log Q(x)$ . - In Deep Learning: Minimizing Cross-Entropy is equivalent to minimizing KL Divergence (since
$H(P)$ is constant for training data).
- A measure of the average number of bits needed to encode events from true distribution
From Midterm Q1: Need to know formulas, when they become undefined/infinite, and how to calculate for both discrete and continuous distributions.
Discrete:
Continuous:
Properties:
- Always
$\geq 0$ - Maximum when distribution is uniform
- Minimum (0) when distribution is deterministic (probability 1 at single point)
- Convention:
$0 \log 0 = 0$
Example Calculations:
-
Deterministic (Dirac delta at
$x=1$ ):$p(1)=1$ ,$p(x)=0$ elsewhere →$H(p) = -1\cdot\log(1) = 0$ -
Uniform on
${1,2,3,4}$ :$p(x)=0.25$ for all →$H(p) = -4(0.25\log 0.25) = 2$ bits -
Continuous Uniform on
$[a,b]$ :$p(x) = \frac{1}{b-a}$ $$H(p) = -\int_a^b p(x) \log p(x) dx = -\int_a^b \frac{1}{b-a} \log\left(\frac{1}{b-a}\right) dx$$ $$= -\log\left(\frac{1}{b-a}\right) \int_a^b \frac{1}{b-a} dx = -\log\left(\frac{1}{b-a}\right) \cdot 1 = \log(b-a)$$ -
Gaussian
$N(\mu, \sigma^2)$ :- PDF:
$p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$ - Entropy:
$H(p) = \frac{1}{2}\log(2\pi e \sigma^2)$
- PDF:
Discrete:
Continuous:
When undefined:
- If
$\exists x$ where$p(x) > 0$ but$q(x) = 0$ → undefined (or$+\infty$ ) - Midterm mistake: Writing "infinite" or "-infinite" instead of "undefined"
Properties:
- Always
$H(p,q) \geq H(p)$ (equality when$p=q$ ) - NOT symmetric:
$H(p,q) \neq H(q,p)$ - Commonly used loss in classification (true labels =
$p$ , predictions =$q$ )
Formula:
Properties:
- Always
$\geq 0$ (equals 0 iff$p=q$ ) -
NOT symmetric:
$D_{KL}(p||q) \neq D_{KL}(q||p)$ -
Undefined when
$\exists x: p(x)>0$ but$q(x)=0$ - Measures "information loss" when approximating
$p$ with$q$
Relationship:
Example scenario (like Midterm Q1):
-
$p$ : Discrete with$p(1)=1$ ,$p(x)=0$ elsewhere -
$q$ : Uniform on$[0,2]$ →$q(x) = 0.5$ for$x \in [0,2]$
Calculations:
-
$H(p) = 0$ (deterministic) -
$H(p,q) = -1 \cdot \log(0.5) = \log(2)$ (if we treat$p$ as having mass at$x=1$ ) $D_{KL}(p||q) = \log(2) - 0 = \log(2)$ -
Warning: If
$p$ and$q$ are of different types (discrete vs continuous), need to be careful about whether comparison is valid
Definition: Amount of information (surprise) from observing event with probability
Properties:
-
Rare events (low
$p$ ) → High information (surprising) -
Common events (high
$p$ ) → Low information (unsurprising) - Event with
$p=1$ → Zero information (no surprise) -
Additive: For independent events,
$I(x,y) = I(x) + I(y)$
Units: Depends on logarithm base
- Base 2 → bits
- Base
$e$ → nats
Definition: Amount of information shared between two variables
Alternative formulation:
Interpretation:
- Measures reduction in uncertainty about
$X$ after observing$Y$ - If
$X$ and$Y$ independent →$I(X;Y) = 0$ - Symmetric:
$I(X;Y) = I(Y;X)$
In generative models:
-
InfoGAN: Maximize
$I(c; G(z,c))$ to encourage latent code$c$ to be meaningful - Disentanglement: High MI between latent dimensions and semantic factors
- Regularization: Encourage mutual information between latent and data
Definition: Remaining uncertainty in
Discrete:
Properties:
-
$H(Y|X) \leq H(Y)$ (observing$X$ cannot increase uncertainty) - Equality when
$X$ and$Y$ are independent -
$H(Y|X) = 0$ when$Y$ is deterministic function of$X$
Relationship to mutual information:
Exam-style question: "What does high mutual information between latent code
Answer: High
5. Different distribution distances: KL, JS, W1. What are they? How are they calculated? Which is better for what?
| Metric | Name | Calculation / Concept | Best For / Properties |
|---|---|---|---|
| KL | Kullback-Leibler Divergence |
Expected log-ratio. |
Asymmetric ( Used in VAEs (Regularization). Measures "information loss". Fails if distributions don't overlap (div by zero). |
| JS | Jensen-Shannon Divergence | Symmetrized KL: where |
Symmetric & Bounded Used in original GANs. Better stability than KL, but can still suffer from vanishing gradients if supports are disjoint. |
| W1 | Wasserstein-1 (Earth Mover's) | "Minimum cost to move pile 1D: Area between CDFs |
Geometric / Disjoint Support. Used in WGANs. Works even when distributions don't overlap (gradients don't vanish). Sensitive to magnitude of difference (location shifts), not just probability overlap. |
Summary: Use KL for compression/VAEs. Use W1 for GANs/geometric stability (prevents mode collapse, stable gradients). Use JS as a stable baseline comparison.
Definition: Half the L1 norm between probability mass functions
For continuous distributions (using CDFs):
Properties:
-
Symmetric:
$d_{TV}(P,Q) = d_{TV}(Q,P)$ -
Bounded:
$d_{TV} \in [0, 1]$ - Equals 0 iff
$P = Q$ - Equals 1 iff supports are disjoint
Interpretation: Maximum difference between probabilities
Definition: Square root of half the sum of squared differences between square roots of densities
For continuous:
Properties:
-
Symmetric:
$d_H(P,Q) = d_H(Q,P)$ -
Bounded:
$d_H \in [0, 1]$ - Less sensitive to outliers than KL divergence
- Computationally efficient
- Not a true metric (doesn't satisfy triangle inequality)
Comparison with KL:
- Hellinger is symmetric, KL is not
- Hellinger bounded, KL can be infinite
- Hellinger defined even when supports don't overlap, KL can be undefined
- KL more sensitive to tail behavior
Exam-style question: "Compare KL divergence, Total Variation distance, and Hellinger distance. When would you prefer each?"
Answer:
- KL: Asymmetric, unbounded, undefined for disjoint supports. Good for VAE regularization (emphasizes tail behavior).
- TV: Symmetric, bounded [0,1], measures maximum probability difference. Simple interpretation.
- Hellinger: Symmetric, bounded, less sensitive to outliers, works with disjoint supports. Good for robust comparison but not a true metric.
What is a VAE (Variational Autoencoder)?
It's a generative model that learns a continuous, probabilistic latent space. Unlike a standard Autoencoder (which maps input to a fixed vector), a VAE maps input to a distribution (mean
Encoder Network
- Input: Data
$x$ (e.g., 28×28 image) - Output: Two vectors (not one!):
- Mean vector
$\mu$ (e.g., 20-dim) - Log-variance vector
$\log \sigma^2$ (e.g., 20-dim)
- Mean vector
- These define a diagonal Gaussian distribution over latent space
- Parameterized by neural network weights
$\phi$
Decoder Network
- Input: Latent vector
$z$ (e.g., 20-dim) - Output: Reconstructed data
$\hat{x}$ (e.g., 28×28 image) - Parameterized by neural network weights
$\theta$
Prior Distribution
- Standard normal:
$p(z) = N(0, I)$ -
Why standard normal?
- Simple "blank canvas" that's easy to sample from
- Forces structure: without it, encoder could map each input to arbitrary location
- Enables generation: just sample
$z \sim N(0,I)$ and decode
-
Encoder: Given input
$x$ , neural network outputs$\mu$ and$\log \sigma^2$ -
Reparameterization Trick: Sample latent
$z$ using:$$z = \mu + \sigma \odot \epsilon, \quad \text{where } \epsilon \sim N(0, I)$$ - Why this trick? Sampling is non-differentiable, but this formulation is!
- Randomness moved to
$\epsilon$ (independent of parameters) - Gradients can flow through
$\mu$ and$\sigma$ - Essential for backpropagation through stochastic nodes
-
Decoder: Reconstruct
$\hat{x}$ from sampled$z$ - Loss Computation: Calculate ELBO (see below)
The Problem: We want to maximize likelihood
Why intractable?
- Must integrate over all possible latent variables
$z$ - For high-dimensional
$z$ (e.g., 100+ dims) and complex decoder, no closed form - Direct computation impossible
The Solution: Variational Inference
- Introduce approximate posterior
$q(z|x)$ (encoder) - Instead of computing true posterior
$p(z|x) = \frac{p(x|z)p(z)}{p(x)}$ (requires intractable $p(x)$) - Learn
$q(z|x)$ to approximate$p(z|x)$
Deriving ELBO:
Starting from log-likelihood:
Introduce
Expanding ELBO: $$\text{ELBO} = \mathbb{E}{q(z|x)}[\log p(x|z)] - D{KL}(q(z|x) || p(z))$$
Key Relationship:
Since
- Tightens the bound (minimizes approximation gap)
- Indirectly maximizes
$\log p(x)$
In plain English: $$ \text{Maximize ELBO} = \text{Minimize}[\text{Reconstruction Error} + \text{KL Divergence}] $$
Term 1: Reconstruction Loss (Data Fidelity)
- Measures: How well can decoder reconstruct input from latent code?
-
$\mathbb{E}_{q(z|x)}[\log p(x|z)]$ : Expected log-likelihood over sampled$z$ - In practice (continuous data):
$-||x - \hat{x}||^2$ (MSE) - In practice (binary data): Binary cross-entropy
- Goal: Make output look like input
Term 2: KL Regularization (Latent Space Structure)
- Measures: How different is
$q(z|x)$ from prior$p(z)$ ? -
$D_{KL}(q(z|x) || p(z))$ : KL divergence between learned posterior and$N(0,I)$ -
For Gaussian posterior and prior, closed form:
$$D_{KL}(q(z|x) || p(z)) = \frac{1}{2}\sum_{i=1}^{d}\left(\mu_i^2 + \sigma_i^2 - \log(\sigma_i^2) - 1\right)$$ - Goal: Keep latent space "organized" and continuous
- Prevents cheating: Without this, encoder could map each input to isolated point → no interpolation possible
Optimization: Maximize ELBO w.r.t. both
In practice:
- Sample minibatch of data
${x_1, ..., x_m}$ - For each
$x_i$ :- Encode: Compute
$\mu_i, \sigma_i = \text{Encoder}(x_i)$ - Sample:
$z_i = \mu_i + \sigma_i \odot \epsilon_i$ where$\epsilon_i \sim N(0,I)$ - Decode:
$\hat{x}_i = \text{Decoder}(z_i)$ - Compute loss: $L_i = ||x_i - \hat{x}i||^2 + D{KL}(q(z|x_i) || p(z))$
- Encode: Compute
- Backpropagate and update
$\phi, \theta$
Scenario: Train autoencoder with only reconstruction loss (no KL term)
Consequences:
- Latent space becomes arbitrarily shaped (not normally distributed)
- Encoder maps each input to isolated points (no continuity)
-
Cannot generate by sampling
$z \sim N(0,I)$ → decoder never saw such$z$ during training - Becomes deterministic autoencoder (no probabilistic interpretation)
- Lacks continuity: Gaps between encoded points → undefined behavior when sampling there
- Lacks completeness: Most of latent space never visited during training
Alternative Sampling Strategies (without proper latent structure):
-
Mean of k encoded samples:
- Pick k training samples, encode to get
$z_1, ..., z_k$ - Average:
$z_{new} = \frac{1}{k}\sum z_i$ - Decode:
$x_{new} = \text{Decoder}(z_{new})$
- Pick k training samples, encode to get
-
Interpolation between encoded samples:
- Encode two samples:
$z_1 = \text{Encoder}(x_1), z_2 = \text{Encoder}(x_2)$ - Interpolate:
$z = \alpha z_1 + (1-\alpha)z_2$ - Decode:
$x = \text{Decoder}(z)$ - Risk: If latent space has "holes", interpolation path may pass through undefined regions
- Encode two samples:
-
Cluster-based sampling:
- Encode all training data
- Cluster in latent space (k-means)
- Sample from cluster centroids
Tradeoff:
- Without KL: Better reconstruction (encoder free to use latent space optimally) but no generative capability
- With KL: Slightly worse reconstruction but true generative model (can sample novel data)
The fundamental insight: Regular autoencoders optimize for reconstruction only. VAEs optimize for reconstruction + generative capability via KL regularization.
Q1: "Why can't we directly maximize log p(x) in VAEs? What makes it intractable?"
Answer: Computing
Q2: "Explain the reparameterization trick. Why is it necessary?"
Answer: Sampling
Q3: "What is the relationship between ELBO and log p(x)?"
Answer:
Q4: "Derive the KL divergence for VAE (Gaussian posterior, Gaussian prior)."
Answer: Given
This is the closed-form expression used in practice (no numerical integration needed).
Q5: "What happens if you remove the KL term from VAE loss? Can you still generate new samples?"
Answer: Without KL regularization, model becomes deterministic autoencoder. Latent space not structured (no prior enforcement). Cannot generate by sampling
Q6: "Why do VAEs produce blurrier images than GANs? Explain in terms of the loss function."
Answer: VAE loss uses pixel-wise reconstruction (MSE or BCE), which penalizes any deviation from training examples. To minimize loss, VAE averages over all plausible reconstructions, producing blurry outputs. MSE loss:
Q7: "What is amortized inference and how do VAEs implement it?"
Answer: Amortized inference means learning a single inference network (encoder) that works for all data points, instead of optimizing latent code separately for each sample. VAE encoder
The lecture emphasizes two key properties VAEs achieve:
1. Continuity: Points close in latent space → similar decoded outputs
- Gradual changes when interpolating between latent codes
- Example: Interpolating between "1" and "2" shows smooth digit morphing
- Achieved by: KL regularization forcing smooth, continuous distributions
2. Completeness: Sampling anywhere in latent space → meaningful output
- No "holes" or undefined regions in latent space
- Any random sample
$z \sim N(0,I)$ decodes to valid output - Achieved by: Prior
$p(z) = N(0,I)$ ensures all regions are used during training
Why both matter for generation:
- Continuity alone: Could have isolated islands (good locally, but gaps between)
- Completeness alone: Could have abrupt transitions (coverage but not smooth)
- Together: Enable both diverse sampling AND smooth interpolation
Exam-style question: "A VAE's latent space has continuity but not completeness. What would you observe? How would you fix it?"
Answer: You'd observe: smooth interpolations between training samples work well, but random sampling from
Sampling:
- Definition: Generating random variables/data points from a given distribution
- In generative modeling: Generate new samples from learned distribution
-
Two approaches:
- Analytical (GMMs): Know functional form of distribution, can sample directly
-
Implicit (GANs): Don't know analytical form, learn mapping from noise
$z$ to samples
Key concept: Random variable
- Often
$z \sim N(0,I)$ (simple distribution) - Generator learns complex mapping:
$G(z) \to x$ - Different
$z$ → different samples
Inference:
- Definition: The reverse of sampling - given data, estimate what model/parameters generated it
- Uses observed data to update beliefs about model parameters (Bayes!)
- In discriminative models (AlexNet): Forward pass (input → prediction)
- In generative models: Reverse process (data → latent code/parameters)
- Integral part of training generative models
Exam-style question: "Explain the difference between sampling and inference in generative models. Why is inference more complex than a simple forward pass?"
Answer: Sampling generates new data from learned distribution (forward:
Likelihood
-
Definition: Probability of observing data
$X$ given model parameters$\theta$ - Measures "how well does the model explain the data?"
-
NOT the same as probability: Likelihood is a function of
$\theta$ for fixed$X$
Maximum Likelihood Estimation:
-
Goal: Find parameters
$\theta^*$ that maximize$P(X|\theta)$ $$\theta^* = \arg\max_\theta P(X|\theta)$$
Log-Likelihood:
- In practice, maximize
$\log P(X|\theta)$ instead (easier math, same result) - Converts products to sums:
$\log(p_1 \cdot p_2) = \log p_1 + \log p_2$ - Used as loss function in deep generative models
Why maximize likelihood?
- Want model that assigns high probability to observed data
- Equivalent to minimizing KL divergence between data distribution and model
Challenges in deep generative models:
- Often intractable to compute directly
- VAEs: Optimize lower bound (ELBO) instead
- GANs: Use game-theoretic objective (implicit likelihood)
- Normalizing Flows: Can compute exact likelihood via change of variables
Exam-style question: "Why do VAEs optimize ELBO instead of likelihood directly? What makes likelihood intractable?"
Answer: Computing
CDF:
- Gives probability that
$X$ takes value at most$x$ - Always non-decreasing
- Ranges from 0 to 1
PDF:
- Describes probability density at point
$x$ - Can exceed 1 (it's a density, not probability!)
- Derivative of CDF:
$p(x) = \frac{dF(x)}{dx}$
Key difference:
- CDF: "Probability of being ≤ x"
- PDF: "Density of probability around x" (needs integration to get probability)
For continuous:
Definition: Number of peaks (modes) in a distribution
Unimodal: Single peak
- Example: Standard Gaussian
$N(0,1)$ - Most data concentrated around one value
Multimodal: Multiple peaks
- Example: Gaussian Mixture Model with 3 components
- Data has multiple "clusters" or preferred values
Why it matters for generative models:
- Real data often multimodal (e.g., different classes)
- Mode collapse in GANs: Generator only learns some modes, ignores others
- Good generative model should capture all modes
Exam-style question: "What is the relationship between multimodal distributions and mode collapse in GANs?"
Answer: Real data distributions are often multimodal (e.g., different face types, multiple object classes). Mode collapse occurs when GAN generator learns to produce only a subset of modes (e.g., only certain face types), failing to capture full diversity of data distribution. This is a failure to learn the complete multimodal structure.
What are Normalizing Flows?
A generative model using an invertible transformation
- Simple prior distribution
$p_z(z)$ (usually $N(0,I)$) → Complex data distribution$p_x(x)$
Key idea: Unlike VAE with separate encoder/decoder, flows use a single invertible function
- Forward:
$z = f(x)$ (encoding/likelihood evaluation) - Inverse:
$x = f^{-1}(z)$ (generation/sampling)
Why Normalizing Flows were developed:
-
Exact likelihood: Unlike VAEs (approximate via ELBO) or GANs (no explicit likelihood), flows compute exact
$p(x)$ - Bidirectional: Same function for encoding and decoding (perfect inverse)
- Latent space by design: Guaranteed to match chosen prior (no regularization needed like VAE's KL term)
- Tractable training: Direct maximum likelihood optimization
The Foundation (from Midterm Q3):
where
Intuitive Explanation:
- When you transform variables, probability mass must be conserved
- Total probability before transformation = Total probability after transformation = 1
- If transformation stretches a region → probability density gets compressed (spread thinner)
- If transformation compresses a region → probability density gets concentrated (packed denser)
- The Jacobian determinant
$|\det J_f|$ measures exactly how much volume changes
Concrete Example:
- 1D transformation
$x = 2z$ (stretches space by factor 2) - Small interval
$[z, z+dz]$ becomes$[x, x+2dz]$ (twice as wide) - Probability mass stays same, but spread over 2× wider region
- Therefore density must be halved:
$p_x(x) = p_z(z) \cdot \frac{1}{2}$ - Jacobian determinant = 2, so formula gives:
$p_x(x) = p_z(z) / |2|$ ✓
Why determinant preserves probability:
- For transformation
$x = f(z)$ , volume element changes:$dx = |\det J_f| \cdot dz$ - Probability in small region:
$p_x(x)dx = p_z(z)dz$ (must be equal) - Solving:
$p_x(x) = p_z(z) / |\det J_f|$ (or equivalently$p_x(x) = p_z(f(x)) |\det J_{f^{-1}}|$ ) - This ensures
$\int p_x(x)dx = \int p_z(z)dz = 1$ (normalization preserved)
Example Calculation (like Midterm Q3):
- Start with
$z \sim N(0,1)$ (simple prior) - Apply transformation
$x = f(z) = az + b$ (affine transformation) - Jacobian:
$\frac{\partial f}{\partial z} = a$ - Determinant:
$|\det J_f| = |a|$ - Final distribution:
$p_x(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{(x-b)^2}{2a^2}} \cdot \frac{1}{|a|} = N(b, a^2)$
Three Essential Properties:
-
Invertible: Must have unique inverse
$f^{-1}$ - Bijection: one-to-one and onto
- Every
$x$ maps to exactly one$z$ , and vice versa - Needed for both sampling (
$z \to x$ ) and likelihood ($x \to z$ )
-
Differentiable: Need to compute Jacobian
- Required for change of variables formula
- Enables gradient-based training
-
Efficient: Computing
$\det J_f$ should be fast- Full matrix determinant:
$O(d^3)$ - too slow! - Smart architectures make this
$O(d)$ or$O(d^2)$
- Full matrix determinant:
Computational Complexity:
| Architecture | Inverse | Determinant | Example |
|---|---|---|---|
| Full matrix | General linear layer (impractical) | ||
| Diagonal | Element-wise scaling | ||
| Triangular | Autoregressive flows | ||
| Block diagonal | Multi-scale architectures | ||
| Coupling flows | RealNVP, Glow |
The Challenge: Design expressive flows with tractable determinants!
Key Innovation: Make Jacobian triangular by design
How it works:
-
Split input
$x$ into two parts:$x = [x_1, x_2]$ -
Transform only one part, conditioned on the other:
-
$y_1 = x_1$ (unchanged - identity transformation) -
$y_2 = x_2 \odot \exp(s(x_1)) + t(x_1)$ (affine coupling)
-
where
Why this is brilliant:
1. Triangular Jacobian: $$J_f = \begin{bmatrix} I & 0 \ \frac{\partial y_2}{\partial x_1} & \text{diag}(\exp(s(x_1))) \end{bmatrix}$$
- Upper-right block is zero (because
$y_1$ doesn't depend on$x_2$ ) - Determinant of triangular matrix = product of diagonal elements
-
$\det J_f = \prod_i \exp(s_i(x_1))$ -$O(d)$ computation!
2. Easy Inversion:
- Reverse operations:
$x_2 = (y_2 - t(x_1)) \odot \exp(-s(x_1))$ -
$x_1 = y_1$ (unchanged) - Same computational cost as forward pass
- No need to invert neural networks
$s$ or$t$ !
3. Expressive Power:
-
$s$ and$t$ can be arbitrarily complex neural networks - Not limited to simple functions
- But their complexity doesn't affect determinant computation!
Limitations:
- Half of dimensions pass through unchanged (
$x_1 = y_1$ ) - Single coupling layer is weak
- Solution: Stack multiple layers with alternating partition (swap which half is transformed)
Building Deep Flows:
Multiple transformations compose:
Change of variables for composition:
where
In log-space (used for training):
Why stack flows?:
- Expressiveness: Single coupling layer is limited, composition becomes arbitrarily complex
-
Permutation: Alternate which dimensions are transformed
- Layer 1: transform
$x_2$ conditioned on$x_1$ - Layer 2: transform
$x_1$ conditioned on$x_2$ (swap!) - Ensures all dimensions get transformed
- Layer 1: transform
- Multi-scale: Can split off dimensions at different depths (like in Glow)
Example RealNVP architecture:
Input x (e.g., 28×28 image = 784 dims)
↓
Coupling layer 1: transform dims [392:784] | dims [0:392] unchanged
↓
Permutation (or 1×1 conv): shuffle dimensions
↓
Coupling layer 2: transform dims [392:784] | dims [0:392] unchanged
↓
... (repeat K times)
↓
Output z ~ N(0, I)
Objective: Maximize log-likelihood
Using change of variables:
In practice:
- Forward pass:
$z = f_\theta(x)$ (transform data to latent) - Evaluate prior:
$\log p_z(z)$ (usually Gaussian, easy!) - Compute log-determinant:
$\log |\det J_f|$ (designed to be efficient) - Loss =
$-(\log p_z(z) + \log |\det J_f|)$ - Backpropagate through entire flow
Sampling (generation):
- Sample
$z \sim p_z(z)$ (e.g., $N(0,I)$) - Inverse transform:
$x = f^{-1}(z)$ - Return
$x$ (generated sample)
Key advantage: Exact likelihood - no approximation like VAE's ELBO!
Alternative to coupling flows: Transform dimensions sequentially
Masked Autoregressive Flow (MAF):
Properties:
- Each dimension depends on all previous dimensions
- Jacobian is triangular (autoregressive structure)
- Determinant:
$\det J = \prod_i \exp(s_i)$ - still$O(d)$ !
MAF vs Coupling Flows:
| MAF | RealNVP (Coupling) | |
|---|---|---|
| Forward (sampling) | Sequential |
Parallel |
| Inverse (likelihood) | Parallel |
Sequential |
| Expressiveness | More expressive per layer | Need more layers |
| Use case | Density estimation | Fast sampling |
Inverse Autoregressive Flow (IAF):
- Swap role of
$x$ and$z$ in MAF - Fast sampling, slow likelihood
- Used in VAE decoders
Normalizing Flow - Exact Likelihood:
- We have a bijection (one-to-one invertible mapping):
$z = f(x)$ and$x = f^{-1}(z)$ - Can directly apply change of variables formula:
$$p(x) = p_z(f(x)) \left| \det \frac{\partial f(x)}{\partial x} \right|$$ - All components are tractable:
-
$p_z(z)$ is known (we chose it, usually $N(0,I)$) -
$f(x)$ is computed by forward pass through the network -
$\det J_f$ is designed to be efficiently computable (coupling flows, triangular Jacobians)
-
- No integration needed! Just plug in values and calculate.
VAE - Intractable Likelihood:
- We want
$p(x)$ but it requires integrating over all possible latent variables:$$p(x) = \int p(x|z)p(z) dz$$ - This integral is intractable because:
- The latent space is high-dimensional (e.g., 100+ dimensions)
-
$p(x|z)$ is a complex neural network (decoder) - No closed-form solution exists
-
Encoder and decoder are separate networks:
- Encoder:
$q(z|x)$ approximates posterior - Decoder:
$p(x|z)$ generates data - They are not exact inverses of each other
- The approximation gap is captured by
$D_{KL}(q(z|x) || p(z|x))$
- Encoder:
-
Solution: Use ELBO as a tractable lower bound instead of computing
$p(x)$ directly
Key Insight:
- NF trades flexibility for exactness: Must use invertible architectures, but get exact likelihood
- VAE trades exactness for flexibility: Can use any encoder/decoder architecture, but only get approximate likelihood
Q1: "Explain the change of variables formula intuitively. Why is the Jacobian determinant needed?"
Answer: When transforming variables, probability mass must be conserved. If transformation stretches a region by factor
Q2: "Why do coupling flows have triangular Jacobians? Why does this matter?"
Answer: Coupling flows split input
Q3: "Why stack multiple coupling layers? What problem does alternating the partition solve?"
Answer: Single coupling layer only transforms half of dimensions (
Q4: "Compare MAF and RealNVP. When would you use each?"
Answer:
-
MAF: Sequential sampling
$O(d)$ , parallel likelihood$O(1)$ → Good for density estimation tasks -
RealNVP: Parallel sampling
$O(1)$ , sequential likelihood$O(d)$ → Good for fast generation - Both have triangular Jacobians with
$O(d)$ determinant computation - MAF more expressive per layer but slower sampling
Q5: "Why can normalizing flows compute exact likelihood while VAEs cannot? Explain in terms of the mathematical operations required."
Answer: NFs use a single invertible function with tractable Jacobian determinant, allowing direct application of change of variables formula without integration. VAEs have separate encoder/decoder networks and require integrating over the intractable posterior
Q6: "Given transformation
Answer:
- Jacobian:
$\frac{\partial x}{\partial z} = 3$ - Determinant:
$|\det J| = 3$ - Change of variables:
$p_x(x) = p_z(z) / |3| = \frac{1}{3\sqrt{2\pi}} \exp(-\frac{z^2}{2})$ - Substitute
$z = (x-5)/3$ :$p_x(x) = \frac{1}{3\sqrt{2\pi}} \exp(-\frac{(x-5)^2}{18})$ - Therefore:
$x \sim N(5, 9)$ (mean = 5, variance = 9)
Q7: "Given a 2D transformation
Answer:
- Transformation: $\begin{bmatrix} x_1 \ x_2 \end{bmatrix} = \begin{bmatrix} 2 & 1 \ 0 & 3 \end{bmatrix} \begin{bmatrix} z_1 \ z_2 \end{bmatrix} + \begin{bmatrix} 1 \ 2 \end{bmatrix}$
- This gives:
$x_1 = 2z_1 + z_2 + 1$ and$x_2 = 3z_2 + 2$ - Jacobian matrix: $J = \frac{\partial \mathbf{x}}{\partial \mathbf{z}} = \begin{bmatrix} \frac{\partial x_1}{\partial z_1} & \frac{\partial x_1}{\partial z_2} \ \frac{\partial x_2}{\partial z_1} & \frac{\partial x_2}{\partial z_2} \end{bmatrix} = \begin{bmatrix} 2 & 1 \ 0 & 3 \end{bmatrix}$
- Determinant calculation: $\det J = \det \begin{bmatrix} 2 & 1 \ 0 & 3 \end{bmatrix} = (2)(3) - (1)(0) = 6 - 0 = 6$
- Since
$J = A$ (linear transformation), we have$|\det J| = 6$ - Change of variables:
$p_{\mathbf{x}}(\mathbf{x}) = p_{\mathbf{z}}(\mathbf{z}) / |\det J| = \frac{1}{6} \cdot \frac{1}{2\pi} \exp(-\frac{1}{2}(\mathbf{z}^T \mathbf{z}))$ - Substitute
$\mathbf{z} = A^{-1}(\mathbf{x} - \mathbf{b})$ : Since $A^{-1} = \begin{bmatrix} \frac{1}{2} & -\frac{1}{6} \ 0 & \frac{1}{3} \end{bmatrix}$, we get $\mathbf{z} = \begin{bmatrix} \frac{1}{2}(x_1 - 1) - \frac{1}{6}(x_2 - 2) \ \frac{1}{3}(x_2 - 2) \end{bmatrix}$ - Therefore: $\mathbf{x} \sim N(\mathbf{b}, AA^T) = N\left(\begin{bmatrix} 1 \ 2 \end{bmatrix}, \begin{bmatrix} 5 & 3 \ 3 & 9 \end{bmatrix}\right)$
Q8: "What is the role of the base measure/prior
Answer: The base measure is the simple distribution we start from (typically $N(0,I)$). Flow learns transformation
Q9: "Explain how log-determinants are computed and summed when stacking K flow transformations."
Answer: For composed flow
Flow vs VAE vs GAN:
| VAE | Normalizing Flow | GAN | |
|---|---|---|---|
| Architecture | Encoder + Decoder (separate) | Single invertible function | Generator + Discriminator |
| Likelihood | Approximate (ELBO) | Exact | Implicit (no explicit $p(x)$) |
| Training | Maximize ELBO | Maximize log-likelihood directly | Min-max game |
| Latent space | Encouraged to be Gaussian (KL reg) | Exactly Gaussian (by design) | Unstructured |
| Flexibility | Very flexible architectures | Constrained to invertible architectures | Very flexible |
| Key limitation | Intractable integral |
Must design invertible & efficient Jacobian | No likelihood, training instability |
| Sampling speed | Fast (one decoder pass) | Depends (MAF slow, RealNVP fast) | Very fast |
| Likelihood evaluation | Approximate | Exact and fast | Not available |
What are Diffusion Models? Generative models that:
-
Forward process: Gradually destroy data by adding Gaussian noise over
$T$ steps (fixed, no learning) - Reverse process: Learn to denoise and recover data from noise (learned with neural network)
Core Idea: Similar to thermodynamic diffusion - data starts organized (low entropy) and gradually becomes random noise (high entropy). We learn to reverse this process.
Single step transition:
where
Interpretation:
- Mean:
$\sqrt{1-\beta_t} x_{t-1}$ - slightly shrinks previous state - Variance:
$\beta_t I$ - adds isotropic Gaussian noise - If
$\beta_t \to 0$ : no noise added (just copy$x_{t-1}$ ) - If
$\beta_t \to 1$ : complete jump to noise (too aggressive, loses information)
Reparameterization trick for sampling
Direct sampling at any timestep (Key property from marginalization):
where:
$\alpha_t = 1 - \beta_t$ - $\bar{\alpha}t = \prod{i=1}^t \alpha_i = \prod_{i=1}^t (1-\beta_i)$ (cumulative product)
Reparameterization for direct sampling:
Why this formula works:
- As
$t$ increases,$\bar{\alpha}_t$ decreases (more noise accumulates) - At
$t=0$ :$\bar{\alpha}_0 = 1$ →$x_0 = x_0$ (no noise) - As
$t \to T$ :$\bar{\alpha}_T \to 0$ →$x_T \approx N(0,I)$ (pure noise) - The
$(1-\bar{\alpha}_t)$ term ensures variance balances to 1 as$t \to T$
Why multiple steps instead of one big jump?
-
Training signal: Each intermediate timestep
$t$ contributes to the loss function (like keeping activations in deep networks for backprop) - Smooth path: Gradual noise addition creates a smoother, more learnable trajectory from data to noise
- Easier inversion: Reverse process is easier to learn when steps are small (local denoising vs global reconstruction)
-
Theoretical: Reverse process converges to true posterior as
$\beta_t \to 0$ and$T \to \infty$
Goal: Learn to invert the forward process, starting from pure noise
Reverse transition:
where
Training objective: Make reverse process match the actual time-reversal of forward process
- Minimize KL divergence between joint distributions:
$$\min_\theta D_{KL}(q(x_{0:T}) || p_\theta(x_{0:T}))$$ - This decomposes (via ELBO) into sum of KL divergences at each timestep
-
Key simplification: When both
$q$ and$p$ are Gaussian, KL divergence reduces to L2 loss on means - Final loss (after math): Train network to predict the noise
$\epsilon$ that was added
Simplified training loss: $$L = \mathbb{E}{t, x_0, \epsilon} \left[ ||\epsilon - \epsilon\theta(x_t, t)||^2 \right]$$
Training algorithm:
- Sample
$x_0$ from training data - Sample timestep
$t \sim \text{Uniform}(1, T)$ - Sample noise
$\epsilon \sim N(0,I)$ - Compute
$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ - Predict noise:
$\hat{\epsilon} = \epsilon_\theta(x_t, t)$ - Compute loss:
$L = ||\epsilon - \hat{\epsilon}||^2$ - Update
$\theta$ via gradient descent
Sampling (generating new samples):
- Start with
$x_T \sim N(0,I)$ (pure noise) - For
$t = T, T-1, ..., 1$ :- Predict noise:
$\hat{\epsilon} = \epsilon_\theta(x_t, t)$ - Compute mean:
$\mu_\theta(x_t, t)$ using predicted noise - Sample:
$x_{t-1} = \mu_\theta(x_t, t) + \sqrt{\Sigma_\theta(x_t, t)} \cdot z$ where$z \sim N(0,I)$
- Predict noise:
- Return
$x_0$ (denoised image)
Why U-Net?
- Takes noisy image
$x_t$ and timestep$t$ as input - Outputs same-sized tensor (predicted noise $\epsilon_\theta(x_t, t)$)
- Skip connections preserve spatial information across scales
- Encoder-decoder structure with bottleneck
Time embedding:
- Timestep
$t$ embedded via sinusoidal positional encoding (like Transformers) - Injected into U-Net via adaptive normalization layers or concatenation
- Allows network to learn different denoising strategies for different noise levels
Conditioning (for conditional generation):
- Easy to add conditions (class labels, text embeddings) through decoder
- Concatenate or add condition embeddings alongside time embeddings
- Enables text-to-image (Stable Diffusion) or class-conditional generation
Q1: "Explain why we use $\bar{\alpha}t = \prod{i=1}^t (1-\beta_i)$ instead of
Answer: The noise variance compounds multiplicatively, not additively. Each step multiplies the signal by
Q2: "If
Answer: If
- Signal strength
$\sqrt{\bar{\alpha}_t} x_0$ decays exponentially - Noise variance
$(1-\bar{\alpha}_t)$ grows toward 1 - At large
$t$ ,$x_t$ approaches$N(0,I)$ regardless of$x_0$ - Ensures forward process actually transforms data to simple prior (pure noise)
Q3: "In the training loss
Answer: Predicting noise is equivalent but more stable:
- Noise
$\epsilon \sim N(0,I)$ has constant statistics (zero mean, unit variance) regardless of timestep$t$ - Predicting
$x_0$ directly requires reconstructing entire image from very noisy$x_t$ at large$t$ (harder) - Predicting
$x_{t-1}$ requires modeling small differences (vanishing gradients) - Noise prediction: Network learns "what was added" rather than "what should be" - clearer learning signal
- Can recover
$x_0$ or$x_{t-1}$ from$\epsilon$ via reparameterization:$x_0 = (x_t - \sqrt{1-\bar{\alpha}_t}\epsilon)/\sqrt{\bar{\alpha}_t}$
Q4: "What would happen if we set
Answer: High initial
- Destroy most information in first few steps (large jump toward noise)
- Make reverse process much harder (massive denoising required per step)
- Lose fine details early (irreversible information loss)
- Poor training signal (network can't learn gradual denoising)
Gradual schedule (e.g.,
- Preserves information longer (smooth degradation)
- Each reverse step is small, local denoising operation (easier to learn)
- Better training signal at all timesteps
- Matches theoretical requirement: reverse process converges as
$\beta_t \to 0$
Q5: "How does the forward process
Answer:
-
Forward
$q(x_t|x_{t-1})$ : Conditions on less noisy state$x_{t-1}$ , adds noise (easy, deterministic given noise schedule) -
Reverse
$p_\theta(x_{t-1}|x_t)$ : Conditions on more noisy state$x_t$ , removes noise (hard, must be learned) -
Forward is tractable: Simple Gaussian with fixed parameters
$\beta_t$ -
Reverse is intractable: True posterior
$q(x_{t-1}|x_t, x_0)$ depends on unknown$x_0$ and entire data distribution -
Solution: Approximate reverse with neural network
$p_\theta$ that learns to denoise without knowing$x_0$
Likelihood-based models require computing:
Problems:
-
$Z_\theta$ (partition function/normalizing constant) is intractable to compute for complex$f_\theta$ (neural networks) - Requires integrating over entire data space (e.g., all possible images)
- For high-dimensional data: billions/trillions of dimensions to integrate over
- Forces architectural constraints:
-
Autoregressive models: Product of conditionals makes
$Z_\theta$ tractable -
Normalizing flows: Invertibility + change of variables makes
$Z_\theta$ tractable - VAEs: Use surrogate objective (ELBO) to approximate maximum likelihood
-
Autoregressive models: Product of conditionals makes
Example: Energy-based model
-
$E_\theta(x)$ is any neural network (energy function) -
$Z_\theta = \int \exp(-E_\theta(x)) dx$ is intractable - Cannot evaluate
$p_\theta(x)$ or train via maximum likelihood
Key Idea: Instead of modeling density
Score Function (w.r.t. data variable
Why exponential form? Assume
Key insight - Normalization constant disappears:
The gradient of a constant (
Why exponential form specifically?
- Logarithm of exponential simplifies nicely:
$\log \exp(-f) = -f$ - Exponentials can represent diverse shapes (via any neural network
$f_\theta$ ) - Common in physics (Boltzmann distribution, energy-based models)
- Mathematically convenient for score matching derivations
Note: Two types of "score":
-
Fisher score (w.r.t. parameters
$\theta$ ):$\nabla_\theta \log p_\theta(x)$ - used in classical statistics -
Data score (w.r.t. data
$x$ ):$\nabla_x \log p_\theta(x)$ - used in score-based models
We use the data score because it describes the geometry of the data distribution (which direction increases density).
Goal: Train
Naive objective - Minimize Fisher divergence: $$L = \mathbb{E}{p{data}(x)} \left[ ||s_\theta(x) - \nabla_x \log p_{data}(x)||^2 \right]$$
Problem: We don't know
Solution: Hyvarinen's Score Matching Theorem (2005)
- Showed the above loss is equivalent to a tractable objective: $$L = \mathbb{E}{p{data}(x)} \left[ ||\nabla_x s_\theta(x)||^2_F + 2 \cdot \text{tr}(\nabla_x s_\theta(x)) \right]$$
- This only requires computing derivatives of
$s_\theta$ , not$p_{data}$ ! -
$\text{tr}(\nabla_x s_\theta(x))$ = trace of Jacobian (sum of diagonal) -
$||\nabla_x s_\theta(x)||^2_F$ = Frobenius norm (sum of squared elements)
In practice: Modern implementations use denoising score matching:
- Perturb data:
$\tilde{x} = x + \sigma \epsilon$ where$\epsilon \sim N(0,I)$ - Train to denoise: $L = \mathbb{E}{x, \epsilon} \left[ ||s\theta(\tilde{x}, \sigma) + \frac{\epsilon}{\sigma}||^2 \right]$
- Equivalent to score matching but simpler to implement
Langevin Dynamics - MCMC method to sample from distribution using only its score:
Intuition:
-
Gradient term
$\epsilon \cdot s_\theta(x_k)$ : Follow score uphill toward high-density regions (like gradient ascent) -
Noise term
$\sqrt{2\epsilon} \cdot z_k$ : Add stochasticity to explore and escape local optima -
Balance: As
$k \to \infty$ and$\epsilon \to 0$ , converges to sampling from$p_\theta(x)$
Algorithm:
- Initialize:
$x_0 \sim N(0,I)$ (random noise) - For
$k = 0, 1, ..., K-1$ :- Compute score:
$s_k = s_\theta(x_k)$ - Update:
$x_{k+1} = x_k + \epsilon s_k + \sqrt{2\epsilon} z_k$ where$z_k \sim N(0,I)$
- Compute score:
- Return
$x_K$ (sample from learned distribution)
Connection to physics: Langevin equation models Brownian motion of particles in a potential field - drift toward low energy + random thermal fluctuations.
Challenge: Score matching minimizes Fisher divergence: $$L = \mathbb{E}{p{data}(x)} \left[ ||s_\theta(x) - \nabla_x \log p_{data}(x)||^2 \right]$$
Problem: Expectation is weighted by
- Errors in high-density regions (where data exists) are heavily penalized
- Errors in low-density regions (between data modes, far from data) are largely ignored
- But Langevin dynamics starts in low-density regions (random noise initialization)!
- Inaccurate scores in low-density regions derail sampling from the start
Visual intuition:
- Imagine 2D swiss roll dataset (high-density spiral)
- Score matching learns good scores on the spiral (plenty of data)
- Score matching learns poor scores between spiral arms (no data, but weighted less)
- Sampling starts between arms → follows wrong gradients → fails to reach spiral
Idea: Perturb data with multiple noise scales to populate low-density regions
Multiple noise levels:
- Choose
$L$ noise scales:$\sigma_1 < \sigma_2 < ... < \sigma_L$ (e.g., geometric sequence) - Perturb data:
$p_{\sigma_i}(x) = \int p_{data}(y) \cdot N(x; y, \sigma_i^2 I) dy$ (convolution with Gaussian) - High
$\sigma_i$ : Heavily blurred data (fills low-density regions) - Low
$\sigma_i$ : Slightly blurred data (preserves structure)
Train noise-conditional score network:
Training objective (weighted sum over noise levels): $$L = \sum_{i=1}^L \lambda(\sigma_i) \mathbb{E}{p{\sigma_i}(x)} \left[ ||s_\theta(x, \sigma_i) - \nabla_x \log p_{\sigma_i}(x)||^2 \right]$$
where
Using denoising score matching: $$L = \sum_{i=1}^L \lambda(\sigma_i) \mathbb{E}{x_0 \sim p{data}, \epsilon \sim N(0,I)} \left[ ||s_\theta(x_0 + \sigma_i \epsilon, \sigma_i) + \frac{\epsilon}{\sigma_i}||^2 \right]$$
Algorithm:
- Initialize:
$x_L \sim N(0, \sigma_L^2 I)$ (start at highest noise level) - For
$i = L, L-1, ..., 1$ :- Run
$K$ steps of Langevin dynamics with score$s_\theta(x, \sigma_i)$ :- For
$k = 1, ..., K$ :$$x \leftarrow x + \epsilon_i \cdot s_\theta(x, \sigma_i) + \sqrt{2\epsilon_i} \cdot z \quad \text{where } z \sim N(0,I)$$
- For
- Run
- Return
$x$ (final sample)
Intuition:
-
High noise (
$\sigma_L$ ): Scores are good (heavily blurred data everywhere), rough global structure -
Low noise (
$\sigma_1$ ): Scores are good (close to actual data), fine details - Annealing: Gradually reduce noise → progressively refine sample from coarse to fine
- Similar to simulated annealing in optimization
Key insight: Predicting noise
Relationship: From $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}t} \epsilon$, the score is: $$\nabla{x_t} \log p(x_t) = -\frac{\epsilon}{\sqrt{1-\bar{\alpha}_t}}$$
Therefore:
Equivalence:
-
DDPM: Discrete timesteps
$t \in {1, ..., T}$ , predicts noise$\epsilon_\theta(x_t, t)$ -
Score-based: Continuous time
$t \in [0,1]$ (or discrete noise levels$\sigma_i$ ), predicts score$s_\theta(x_t, t)$ - Unified view: Both are special cases of Stochastic Differential Equations (SDEs) framework
Benefits of score-based view:
- More flexible sampling (can adjust step size, use different samplers)
- Connections to continuous-time diffusion processes
- Theoretical foundations from statistical physics
Q1: "Why can't we use standard maximum likelihood to train energy-based models
Answer: Maximum likelihood requires computing
Q2: "Explain the low-density region problem in score matching. Why does it affect Langevin dynamics sampling?"
Answer: Score matching loss $\mathbb{E}{p{data}} [||s_\theta - \nabla_x \log p_{data}||^2]$ is weighted by
Q3: "Why use multiple noise scales in NCSN instead of a single large noise level?"
Answer: Single large noise heavily blurs the data distribution, making all regions high-density but losing fine structure. Multiple noise scales provide a curriculum:
- Large
$\sigma$ : Fills low-density regions, learns global structure - Small
$\sigma$ : Preserves data details, learns fine structure Annealed Langevin dynamics progressively refines samples: coarse structure from high-noise scores → fine details from low-noise scores. Like coarse-to-fine optimization.
Q4: "How are DDPM and score-based models mathematically equivalent? What does each model learn?"
Answer: DDPM learns to predict added noise
Q5: "Explain Langevin dynamics intuitively. Why does adding noise help sampling?"
Answer: Langevin dynamics updates
-
Gradient term
$\epsilon s_\theta(x_k)$ : Pushes toward high-density regions (deterministic) -
Noise term
$\sqrt{2\epsilon} z_k$ : Adds randomness (stochastic)
Without noise, it's just gradient ascent → gets stuck in local maxima. Noise allows:
- Escaping local optima (exploration)
- Proportional sampling (visit high-density regions more, but not exclusively)
- Detailed balance (necessary for MCMC convergence to target distribution)
Q6: "Compare the normalization challenges in VAE vs Score-Based Models. How does each solve them?"
Answer:
-
VAE: Intractable
$p(x) = \int p(x|z)p(z) dz$ (integral over latent space). Solution: Use ELBO as surrogate objective (approximate inference with $q(z|x)$). -
Score-based: Intractable
$Z_\theta$ in$p_\theta(x) = f_\theta(x)/Z_\theta$ (integral over data space). Solution: Model gradient$\nabla_x \log p_\theta(x)$ instead, where$Z_\theta$ disappears (gradient of constant is zero).
Both avoid direct likelihood computation but for different intractable integrals using different mathematical tricks.
What are GANs? GANs are a unique approach to generative modeling that frames the unsupervised problem as a supervised one using adversarial training between two networks.
Key Insight: Instead of directly modeling
-
Generator
$G$ : Creates fake samples from noise$z \sim p_z(z)$ (usually $N(0,I)$) -
Discriminator
$D$ : Classifies samples as real (from data) or fake (from$G$ ) - They compete in a zero-sum, two-player min-max game until equilibrium
- Zero-sum: One player's gain is the other's loss (D maximizes what G minimizes)
- Alternating optimization: Train D for k steps, then G for 1 step (repeat)
The Analogy:
- Generator = Artist trying to create realistic paintings
- Discriminator = Art critic trying to spot fakes
- Training = Friendly competition that makes both better
- Goal = Generator becomes so good that critic can only guess (50% accuracy)
Generator:
- Input: Random noise
$z \sim p_z(z)$ (latent vector, e.g., 100-dim) - Output: Generated sample
$G(z)$ (e.g., 28×28 image) - Architecture: Fully connected layers → Conv layers (in DCGAN)
Discriminator:
- Input: Sample
$x$ (either real from data or fake from$G$ ) - Output: Probability
$D(x) \in [0,1]$ (1 = real, 0 = fake) - Architecture: Classifier network (conv layers + sigmoid output)
Original formulation: $$\min_G \max_D V(D,G) = \mathbb{E}{x \sim p{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1-D(G(z)))]$$
Breaking it down:
- First term $\mathbb{E}{x \sim p{data}}[\log D(x)]$: Discriminator correctly identifies real data
-
Second term
$\mathbb{E}_{z \sim p_z}[\log(1-D(G(z)))]$ : Discriminator correctly identifies fake data - Discriminator goal: Maximize both terms (correct classification)
- Generator goal: Minimize second term (fool discriminator)
This is Binary Cross-Entropy (BCE) Loss!
For each iteration:
-
Train Discriminator (k steps, typically k=1):
- Sample minibatch of real data
${x^{(1)}, ..., x^{(m)}}$ - Sample minibatch of noise
${z^{(1)}, ..., z^{(m)}}$ - Generate fake samples:
${\tilde{x}^{(1)} = G(z^{(1)}), ..., \tilde{x}^{(m)} = G(z^{(m)})}$ - Update
$D$ to maximize:$\frac{1}{m}\sum_{i=1}^m [\log D(x^{(i)}) + \log(1-D(G(z^{(i)})))]$
- Sample minibatch of real data
-
Train Generator (1 step):
- Sample minibatch of noise
${z^{(1)}, ..., z^{(m)}}$ - Update
$G$ to minimize:$\frac{1}{m}\sum_{i=1}^m \log(1-D(G(z^{(i)})))$
- Sample minibatch of noise
Non-Saturating Generator Loss (used in practice):
- Problem:
$\log(1-D(G(z)))$ saturates when$D$ is confident (early training) - Solution: Instead minimize
$-\log D(G(z))$ (maximize prob of fooling$D$ ) - Same gradient direction but stronger signal when
$D$ confidently rejects
Convergence Criterion:
- Optimal discriminator:
$D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} = \frac{1}{2}$ everywhere - When
$p_g = p_{data}$ : Generator has perfectly learned the data distribution
Connection to JS Divergence:
- The original GAN loss minimizes the Jensen-Shannon (JS) divergence between
$p_{data}$ and$p_g$ - At optimal discriminator $D^$, the generator's objective becomes: $$\min_G V(G, D^) = 2 \cdot D_{JS}(p_{data} || p_g) - 2\log 2$$
- Problem: JS divergence is constant when supports don't overlap → vanishing gradients
Key architectural guidelines:
- Replace pooling with strided convolutions (discriminator) and fractional-strided convolutions/transposed conv (generator)
- Use Batch Normalization in both G and D
- Remove fully connected layers for deeper architectures
- Generator activations:
- ReLU for all layers except output
- Tanh for output layer
- Discriminator activation:
- LeakyReLU for all layers
Significance: Made GANs work reliably for image generation, became foundation for vision-based GANs
Idea: Control what the generator produces by conditioning on additional information
Modified loss: $$\min_G \max_D V(D,G) = \mathbb{E}{x \sim p{data}}[\log D(x|y)] + \mathbb{E}_{z \sim p_z}[\log(1-D(G(z|y)|y))]$$
How conditioning works:
-
Generator:
$G(z, y)$ - concatenate noise$z$ and label$y$ as input -
Discriminator:
$D(x, y)$ - receives both image and label- Must learn: Real image + correct label = Real
- Real image + wrong label = Fake
- Fake image + any label = Fake
Benefit: Can generate specific classes on demand (e.g., "generate a digit 7")
Key insight for cGANs:
- Training discriminator with mismatched pairs (real image + wrong label) forces it to learn semantic meaning
- This makes discriminator a better critic - it doesn't just judge "realistic", but "realistic AND matches condition"
- Helps reduce mode collapse: generator can't just produce one realistic output for all conditions
Problem: Generator learns to produce only a limited subset of the data distribution
- Example: Face generator produces only a few face types instead of diverse faces
- Root cause: GAN loss has no explicit diversity term
- Generator finds a few "safe" outputs that always fool discriminator
Why it happens:
- Discriminator only judges "real vs fake", not diversity
- Generator exploits weaknesses: if one type of output fools
$D$ , keep producing it - No mechanism to learn the entire distribution, only to fool
$D$
Problem: When discriminator becomes too good, gradients to generator vanish
-
$D(G(z)) \approx 0$ →$\log(1-D(G(z))) \approx 0$ → No learning signal - Generator stops improving
Cause: Sigmoid output
- Convergence issues: Hard to achieve simultaneous equilibrium of both networks
- Perfect Discriminator: No gradients flow to generator
- Poor Discriminator: Generator doesn't learn realistic features
Key Insight: Use Wasserstein-1 (Earth Mover's) distance instead of JS divergence
Why Wasserstein distance?
-
JS divergence problem: When
$p_{data}$ and$p_g$ have disjoint supports (don't overlap), JS = constant- No meaningful gradient signal
- Common in high-dimensional spaces (images)
-
Wasserstein distance: Measures "how much work" to move one distribution to another
- Always provides gradient even with disjoint supports
- More stable, smoother gradients
WGAN Changes:
-
Remove sigmoid from discriminator → Call it "Critic" instead
- Output:
$C(x) \in (-\infty, \infty)$ (unbounded, linear activation)
- Output:
-
New loss: $$\min_G \max_{C \in \mathcal{C}} \mathbb{E}{x \sim p{data}}[C(x)] - \mathbb{E}_{z \sim p_z}[C(G(z))]$$
- Critic maximizes: score real data high, fake data low
- Generator minimizes: make fake data score high
-
Enforce Lipschitz constraint on critic:
- Original WGAN: Weight clipping (clip weights to
$[-0.01, 0.01]$ after each update) -
WGAN-GP (improved): Gradient penalty instead of clipping
$$L_{GP} = \lambda \mathbb{E}{\hat{x}}[(||\nabla{\hat{x}} C(\hat{x})||_2 - 1)^2]$$
where
$\hat{x}$ is a point interpolated between real and fake samples
- Original WGAN: Weight clipping (clip weights to
Benefits:
- Addresses mode collapse: Smoother gradients help explore full distribution
- No vanishing gradients: Critic provides meaningful gradients even when far from optimal
- Training stability: Can train critic to optimality without worrying about vanishing gradients
Implementation details:
- Train critic 5 times per generator update (vs 1:1 in vanilla GAN)
- Use RMSProp with small learning rate (0.00005)
- No momentum-based optimizers
| BCE Loss (Vanilla GAN) | W-Loss (WGAN) | |
|---|---|---|
| Discriminator output |
|
|
| Loss function |
|
|
| Gradient behavior | Can vanish/saturate | Always provides signal |
| Mode collapse | Common | Reduced |
| Constraint | None | Lipschitz (weight clipping or GP) |
Problem: GANs struggle with high-resolution images (1024×1024)
- Higher resolution → easier to tell real from fake
- Must learn all scales simultaneously (very hard)
Solution: Incrementally grow both G and D during training
- Start with 4×4 resolution
- Train until stable
- Add layers to increase resolution: 4×4 → 8×8 → 16×16 → ... → 1024×1024
- Smoothly fade in new layers to avoid shocking existing layers
Benefits:
- Stability: Easier to learn simple (low-res) structure first, then details
- Speed: Most iterations at low resolution → 2-6× faster training
- Quality: Achieves unprecedented 1024×1024 image quality
Key technique - Fade-in mechanism: When adding new layer, blend between old and new:
-
$\alpha$ (fade-in factor) goes from 0 → 1 over time - Output =
$(1-\alpha) \cdot \text{old_path} + \alpha \cdot \text{new_path}$ - Why fade-in? Avoids "shocking" well-trained lower-resolution layers
- New layers smoothly introduced while keeping existing layers trainable
- Both G and D grow in synchrony (mirror images of each other)
Training details:
- Uses WGAN-GP (Wasserstein GAN with Gradient Penalty) loss
- WGAN-GP replaces weight clipping with gradient penalty for better stability
- Progressive approach allows most iterations at low resolution (faster)
Why needed? Hard to objectively evaluate generated images
1. Fréchet Inception Distance (FID) - Most common:
- Use pretrained InceptionV3 to extract features from real and fake images
- Compute mean
$\mu$ and covariance$\Sigma$ of feature distributions - FID = Fréchet distance between Gaussians:
$$\text{FID} = ||\mu_r - \mu_g||^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$$ - Lower is better (0 = perfect match to real distribution)
2. Inception Score (IS):
- Classify generated images with InceptionV3
- Good images should:
- Have confident predictions (low entropy per image)
- Cover diverse classes (high entropy overall)
- IS = $\exp(\mathbb{E}x[D{KL}(p(y|x) || p(y))])$
- Higher is better
3. Structural Similarity (MS-SSIM):
- Perceptual metric comparing structure, luminance, contrast
- SSIM formula:
$$\text{SSIM}(x,y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}$$ where$\mu$ = mean,$\sigma$ = variance,$\sigma_{xy}$ = covariance,$C_1, C_2$ = stability constants - MS-SSIM: SSIM applied at multiple scales (pyramid) for multi-resolution comparison
- Higher is better (1 = identical images)
4. Sliced Wasserstein Distance (SWD):
- Approximates Wasserstein distance using 1D projections
- Projects high-dimensional distributions onto random directions, computes 1D Wasserstein distance
- SWD = average over many random projections (directions)
- Measures statistical similarity between real and generated distributions
- Lower is better
Important: For non-image domains (audio, signals), retrain classifier on your domain!
Key Innovation: Separate style from content using intermediate latent space
Architecture:
z (512-dim noise)
→ Mapping Network (8 FC layers)
→ w (512-dim intermediate latent)
→ AdaIN at each conv layer (style injection)
→ Generated image
Three main components:
-
Mapping Network:
$f: \mathcal{Z} \to \mathcal{W}$ - 8 fully-connected layers
- Maps Gaussian noise
$z$ to intermediate latent$w$ -
Why?
$z$ must follow fixed distribution, but$w$ is free to be disentangled
-
Synthesis Network with AdaIN:
- Starts from learned constant (4×4 tensor), not
$z$ ! - At each conv layer: Apply Adaptive Instance Normalization (AdaIN)
- AdaIN:
$\text{AdaIN}(x, y) = \sigma(y) \frac{x - \mu(x)}{\sigma(x)} + \mu(y)$ - Normalizes activation
$x$ , then scales/shifts by style$y$ (derived from$w$ ) - Injects style at multiple scales: coarse (4×4-16×16), middle (32×32-64×64), fine (128×128-1024×1024)
- Normalizes activation
- Starts from learned constant (4×4 tensor), not
-
Noise Injection:
- Add Gaussian noise to each feature map
- Controls stochastic variation (hair strands, pores) without affecting global structure (pose, identity)
Disentanglement: Each dimension in latent space controls one independent factor of variation
- Example: One dimension = age, another = gender, another = hair color
- Linear subspaces control factors independently
Why
-
$z \sim N(0,I)$ must match training data distribution (potentially entangled) -
$w = f(z)$ is free from that constraint (learned mapping can untangle) - Hypothesis: Easier to generate from disentangled representation
Path Length Regularization (PLR):
-
Problem: Interpolating in latent space causes non-linear changes in image
- Features absent in endpoints appear in middle (e.g., glasses appear mid-interpolation)
- Image changes drastically for small latent moves (unpredictable)
-
Perceptual Path Length: Measures how much image changes during interpolation
- Uses VGG16 embeddings to measure perceptual distance
- "Full" metric: subdivide interpolation path, sum perceptual distances
- "End" metric: only measure endpoints (biased toward input space)
-
Solution: Penalize large changes in image space during latent interpolation
$$L_{PLR} = \mathbb{E}_{w,t}[(||\mathbf{J}^T_w \mathbf{y}||_2 - a)^2]$$ where$\mathbf{J}_w$ is Jacobian of$G$ w.r.t.$w$ ,$\mathbf{y} \sim N(0,I)$ ,$a$ is moving average- Penalizes deviation from expected path length (encourages consistency)
-
Effect:
- Smoother, more linear interpolations in latent space
- Better disentanglement (W-space more linear)
- Easier inversion (more predictable mapping)
Problem 1: Droplet artifacts
- Blob-like artifacts appear at 64×64+ resolution in all feature maps
- Visible in intermediate layers even when not obvious in final image
- Cause: AdaIN normalizes mean/variance of each feature map independently
- Destroys information in relative magnitudes between features
- Generator exploits this: creates strong localized spike that dominates statistics
- This allows generator to "sneak" signal strength information past normalization
- Solution: Replace AdaIN with weight demodulation
- Removes the normalization step that caused the artifact
- Modulates convolution weights instead of activations
- Retains full style controllability without artifacts
Problem 2: Progressive growing artifacts
- Phase artifacts, location preference for details, compromised shift invariance
- Solution: Remove progressive growing entirely
- Use direct training at target resolution with improved regularization
- Alternative architectures explored to achieve quality without progressive growing
Other improvements:
- Lazy regularization: Apply regularization (R1 gradient penalty) every N minibatches instead of every batch
- Typical: R1 penalty once every 16 minibatches (not every iteration)
- Why it works: Main loss and regularization can be optimized at different frequencies
- Reduces computation by ~15-30% with no quality loss
- Greatly reduces memory usage
Conceptual Questions (instructor's style):
-
"Why do vanilla GANs suffer from mode collapse? How does Wasserstein loss help?"
- Answer: Vanilla GANs only optimize "real vs fake", no diversity term. Generator finds limited outputs that fool D. WGAN provides smooth gradients even with disjoint distributions, encouraging exploration of full distribution.
-
"What is the equilibrium condition for a GAN? What does
$D^*(x) = 0.5$ mean?"- Answer: At equilibrium,
$p_g = p_{data}$ . Optimal discriminator$D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} = 0.5$ means it can only guess (generator perfectly learned distribution).
- Answer: At equilibrium,
-
"Why is GAN training called a 'zero-sum game'? Why can't we just train G and D simultaneously with gradient descent?"
- Answer: Zero-sum because D's objective (maximize
$V$ ) is exactly opposite to G's objective (minimize$V$ ) - one's gain is other's loss. Can't train simultaneously because we need D to be optimal (or near-optimal) to provide meaningful gradients to G. If both updated together, D never reaches optimality, giving poor training signal to G. That's why we use alternating optimization: train D for k steps to near-optimality, then update G once.
- Answer: Zero-sum because D's objective (maximize
-
"Why does WGAN train the critic 5 times per generator update (5:1 ratio) while vanilla GAN uses 1:1?"
- Answer: WGAN's theoretical guarantee requires critic to be nearly optimal for Wasserstein distance approximation to be accurate. Unlike vanilla GAN where perfect D causes vanishing gradients (bad), WGAN benefits from better critic (provides better distance estimate). The 5:1 ratio ensures critic stays ahead, giving reliable gradients. Vanilla GAN uses 1:1 because training D too well causes gradient vanishing.
-
"Why does StyleGAN use an intermediate latent space
$w$ instead of directly using$z$ ?"- Answer:
$z$ must follow fixed Gaussian distribution matching training data (may be entangled).$w$ is free from this constraint, allowing learned mapping to disentangle factors of variation.
- Answer:
-
"Explain the connection between the GAN min-max loss and JS divergence. Why is this problematic?"
- Answer: At optimal discriminator
$D^*$ , minimizing GAN loss is equivalent to minimizing JS divergence between$p_{data}$ and$p_g$ . Problem: When distributions have disjoint supports (common in high dimensions), JS divergence is constant (log 2), providing no gradient. This causes vanishing gradients - generator gets no learning signal about which direction to move.
- Answer: At optimal discriminator
-
"What's the difference between BCE loss and Wasserstein loss in GANs?"
- Answer: BCE uses sigmoid discriminator output [0,1], can vanish when D is confident. Wasserstein uses unbounded critic output, provides gradients even with disjoint supports. Key: W-loss is a proper distance metric that measures "how far" distributions are, not just "different or same".
-
"In conditional GANs (cGANs), why do we feed the discriminator both 'real image + wrong label' pairs during training? Doesn't this confuse it?"
- Answer: No, it improves training! The discriminator must learn three rejection cases: (1) fake images, (2) real images with wrong labels, (3) mismatched pairs. This forces D to understand semantic content, not just image quality. It becomes a better critic that judges "realistic AND semantically correct", which gives better gradient signal to G. Also helps prevent mode collapse - G can't fool D with one realistic output for all conditions.
-
"Why does Progressive GAN use a fade-in mechanism instead of abruptly adding new layers?"
- Answer: Abruptly adding layers would "shock" the well-trained low-resolution layers with random gradients from untrained high-resolution layers. Fade-in smoothly blends old path (trained) and new path (training) using
$\alpha$ : output =$(1-\alpha) \cdot \text{old} + \alpha \cdot \text{new}$ where$\alpha$ goes 0→1. This preserves learned knowledge while introducing new capacity. Both G and D grow synchronously.
- Answer: Abruptly adding layers would "shock" the well-trained low-resolution layers with random gradients from untrained high-resolution layers. Fade-in smoothly blends old path (trained) and new path (training) using
-
"Explain why StyleGAN's intermediate latent space W is more disentangled than input space Z. Use the 'warped distribution' argument."
- Answer: Z must follow fixed Gaussian
$N(0,I)$ , but training data has correlations (e.g., beards mostly on males). To match training distribution, Z-space must "warp" - creating curved manifolds where correlated features cluster. This warping = entanglement. Mapping network$f: Z \to W$ can "unwarp" this: W is free from fixed distribution constraint, allowing learned transformation to straighten the manifolds into linear subspaces (disentanglement). Training encourages this because disentangled representations are easier to generate from.
- "What problem does Path Length Regularization solve in StyleGAN? Why can features appear/disappear during interpolation?"
- Answer: Problem: Non-linear mapping from latent to image means linear interpolation in latent space causes non-linear changes in image space. Example: interpolating between "no glasses" and "no glasses" can produce "glasses" in middle because latent path crosses through "glasses" region (curved manifold). PLR penalizes large image changes for small latent moves using Jacobian norm, encouraging smoother, more linear geometry. Makes interpolation predictable and improves disentanglement.
- "Why did StyleGAN2 need to replace AdaIN with weight demodulation? Explain the 'sneaking signal strength' problem."
- Answer: AdaIN normalizes each feature map independently (divides by std, centers at mean). This destroys information about relative magnitudes between different features. Generator exploited this flaw: it created strong localized spikes (droplet artifacts at 64×64+) that dominate the mean/std statistics of that feature map. By controlling spike magnitude, generator could "sneak" signal strength information past the normalization. Weight demodulation fixes this by modulating convolution weights instead of activations, avoiding the normalization that enabled the exploit.
- "What is 'lazy regularization' in StyleGAN2? Why doesn't it hurt performance to apply regularization less frequently?"
- Answer: Lazy regularization applies R1 gradient penalty once every N minibatches (e.g., N=16) instead of every iteration, but with N× weight to compensate. Works because: (1) main loss and regularization have different time scales - regularization prevents long-term drift, doesn't need frequent updates, (2) computing gradients for regularization is expensive, doing it 1/16th as often saves 15-30% computation. No quality loss because regularization's role (smoothing, preventing pathological solutions) doesn't require immediate response.
- "Explain how AdaIN in StyleGAN controls style injection. What does the formula
$\text{AdaIN}(x, y) = \sigma(y) \frac{x - \mu(x)}{\sigma(x)} + \mu(y)$ mean?"
- Answer: AdaIN normalizes content features
$x$ to zero mean and unit variance (removes original style), then applies affine transformation using style$y$ 's statistics.$\sigma(y)$ controls scale/contrast,$\mu(y)$ controls shift/brightness. Style$y$ comes from learned affine transform of$w$ (intermediate latent). Different layers control different scales: early layers (4×4-16×16) = coarse features (pose, shape), middle (32×64) = facial features, late (128×1024) = fine details (hair strands, skin texture). This hierarchical injection enables style mixing.
Calculation Questions:
- "Given discriminator outputs
$D(x_{real}) = 0.9$ and$D(G(z)) = 0.3$ , calculate the discriminator and generator losses."
- D loss:
$-[\log(0.9) + \log(1-0.3)] = -[\log(0.9) + \log(0.7)]$ - G loss (non-saturating):
$-\log(0.3)$
- "If the optimal discriminator outputs
$D^*(x) = 0.7$ for a particular sample$x$ , what can you infer about$p_{data}(x)$ and$p_g(x)$ ?"
- Answer: Using
$D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} = 0.7$ , we get:$p_{data}(x) = 0.7(p_{data}(x) + p_g(x))$ $0.3 p_{data}(x) = 0.7 p_g(x)$ - Therefore:
$\frac{p_{data}(x)}{p_g(x)} = \frac{0.7}{0.3} = \frac{7}{3} \approx 2.33$ - Real data is ~2.3x more likely than generated data at
$x$ → Generator underproduces this region
- "You're evaluating a GAN trained on face images. FID = 15, IS = 3.2, MS-SSIM = 0.85. Interpret these metrics."
- Answer:
- FID = 15: Low is good. 15 indicates decent quality (real distributions not perfectly matched but close). Feature distributions fairly similar to real data.
- IS = 3.2: Relatively low score suggests either: (1) low confidence predictions (blurry/unclear images), or (2) limited diversity (not covering all face types). For faces, IS may be limited by dataset (faces aren't as diverse as ImageNet's 1000 classes).
- MS-SSIM = 0.85: High structural similarity (max = 1.0) suggests good preservation of structure, luminance, contrast at multiple scales.
- Overall: Decent quality but room for improvement in diversity or sharpness.
- "Why do we use Inception-based metrics (FID, IS) for GANs? Could we use any classifier?"
- Answer: InceptionV3 pretrained on ImageNet provides rich feature representations that correlate well with human perception. We can use any classifier, but must retrain it on the target domain. Example: for audio GANs, train ResNet on audio spectrograms. For earthquake signals (no standard classifier exists), can't directly use FID - must first train domain-specific classifier. The classifier quality determines metric reliability.
Comparison Questions:
- "Compare GANs, VAEs, and Normalizing Flows in terms of: (a) latent space structure, (b) training objective, (c) mode coverage."
- See table below
| VAE | Normalizing Flow | GAN | |
|---|---|---|---|
| Latent space | Continuous, Gaussian (regularized) | Exactly Gaussian (by design) | Any (no explicit prior) |
| Training | Maximize ELBO | Maximize log-likelihood | Min-max game |
| Likelihood | Approximate (lower bound) | Exact | Implicit (no explicit p(x)) |
| Mode coverage | Good (encouraged by KL) | Good | Can suffer mode collapse |
| Sample quality | Often blurry | Good | Excellent (sharp images) |
| Training stability | Stable | Stable | Unstable (improved with WGAN) |
| Inference | Can encode x → z | Can encode x → z | No encoder (need inversion) |
Disentanglement refers to having a latent representation where each dimension controls one independent factor of variation.
Ideal disentangled representation:
- Each latent dimension
$z_i$ controls exactly one semantic attribute - Changes in
$z_i$ affect only that attribute, nothing else - Example:
$z_1$ = age,$z_2$ = gender,$z_3$ = hair color (completely independent)
Entangled representation (reality):
- Latent dimensions are correlated/interdependent
- Changing one dimension affects multiple attributes
- Example: Changing "beard" dimension also affects "gender" (beards correlated with males)
Why it matters:
- Controllability: Want to manipulate specific attributes (e.g., add smile without changing identity)
- Interpretability: Understand what each dimension represents
- Generalization: Linear interpolation in latent space should produce smooth, meaningful changes
Conditioning (Week 9 concept):
- Provide explicit labels during training:
$p(x|y)$ - Example: cGAN with class labels
- Requires labeled data
- Controls what class to generate (e.g., "generate digit 7")
Controllability (Week 11 concept):
- Manipulate features in latent space without labels
- Example: Adjusting
$z$ to add beard, change age, etc. - No labeled data needed (unsupervised)
- Controls how features appear (e.g., "make person smile more")
Key difference:
- Conditioning: "Generate a cat" (discrete choice)
- Controllability: "Make this cat fluffier" (continuous manipulation)
Desired case: Uncorrelated features
Original → Add beard → Still male, different age possible
Original → Change age → Still clean-shaven, same gender
Reality: Correlated features (entangled)
Original → Add beard → ALSO becomes more masculine, older
Original → Make feminine → ALSO loses beard, different pose
Root cause: Training data has natural correlations
- Most beards appear on males → beard entangled with gender
- Older people have different features → age entangled with wrinkles, hair color
Why entanglement can be good:
- Preserves realism: Bearded females are rare, model reflects this
- Learned from data distribution
- Contemporary models handle via instruction following (text conditioning)
Goal: Find direction
Method 1: Gradient-based (supervised)
Use a pretrained classifier or discriminator:
- Start with latent code
$z_0$ - Define target attribute via classifier:
$y = C(G(z))$ - Compute gradient:
$\frac{\partial y}{\partial z}$ - Update
$z$ in gradient direction (like SGD, but on$z$ , not weights):$$z_{t+1} = z_t + \alpha \frac{\partial y}{\partial z}$$ - Generate
$x = G(z_{t+1})$ with enhanced attribute
Example: To increase "smile" attribute:
- Use smile classifier
$C_{smile}$ - Optimize
$z$ to maximize$C_{smile}(G(z))$ - Results in
$z$ that generates smiling face
Method 2: InterfaceGAN / GANSpace
- Find semantic directions in latent space using labeled examples
- Linear SVM to find decision boundary between attribute classes
- Normal vector to boundary = control direction
Problem: How to objectively measure disentanglement quality?
DCI Framework requires:
- Learned latent representation
$\mathbf{c}$ (code) of dimension$D$ - Ground truth generative factors
$\mathbf{z}$ of dimension$K$ - Example: 3D shapes with
$z$ = [azimuth, elevation, red, green, blue]
Ideal case: If
- Each
$c_i$ is scaled version of exactly one$z_j$ - If
$D > K$ : Some dimensions are "dead" (don't capture any factor) - DCI metrics quantify deviation from this ideal one-to-one mapping
Process:
- Train model on synthetic dataset with known factors
$\mathbf{z}$ - Extract learned codes
$\mathbf{c} = M(x)$ for all samples - Train
$K$ regressors to predict$z_j$ from$\mathbf{c}$ :$\hat{z}_j = f_j(\mathbf{c})$ - Extract importance matrix
$R \in \mathbb{R}^{D \times K}$ -
$R_{ij}$ = relative importance of$c_i$ in predicting$z_j$
-
- Compute three metrics:
Measures: Does each code variable
Formula:
where
Interpretation:
-
$D_i = 1$ :$c_i$ perfectly captures single factor (fully disentangled) -
$D_i = 0$ :$c_i$ equally important for all factors (maximally entangled)
Overall disentanglement: Weighted average across all dimensions
Measures: Is each generative factor
Formula:
where
Interpretation:
-
$C_j = 1$ : Single$c_i$ captures$z_j$ completely (complete) -
$C_j = 0$ : All code variables equally contribute (overcomplete)
Difference from Disentanglement:
- Disentanglement: Row-wise (one code → one factor)
- Completeness: Column-wise (one factor → one code)
Measures: How much information does
Formula: Prediction error
where
Key point: Depends on regressor capacity
- Linear regressor: Only captures explicitly represented information
- Overlap with disentanglement metric (better disentanglement → easier linear prediction)
Importance matrix
- Rows = code dimensions
$c_i$ - Columns = generative factors
$z_j$ - Square size =
$R_{ij}$ (importance)
Ideal (disentangled):
z1 z2 z3 z4 z5
c1 ■ · · · ·
c2 · ■ · · ·
c3 · · ■ · ·
c4 · · · ■ ·
c5 · · · · ■
Diagonal structure: one-to-one mapping
Entangled example:
z1 z2 z3 z4 z5
c1 ▪ ▫ ▪ · ·
c2 ▫ ▪ ▫ ▪ ·
c3 ▪ ▫ ▪ ▫ ▫
Scattered: many-to-many relationships
Key idea: Increase weight on KL term to force disentanglement
Standard VAE loss: $$\mathcal{L} = \mathbb{E}{q(z|x)}[\log p(x|z)] - D{KL}(q(z|x) || p(z))$$
β-VAE loss: $$\mathcal{L}{\beta} = \mathbb{E}{q(z|x)}[\log p(x|z)] - \beta \cdot D_{KL}(q(z|x) || p(z))$$
where
Why it works (information bottleneck principle):
- Stronger KL penalty forces
$q(z|x)$ closer to$N(0,I)$ - Creates information bottleneck: Limited capacity to encode information
- Encoder faces pressure:
- Reconstruction term wants to encode all information
- KL term (weighted by β) limits how much can be encoded
- Result: Encoder forced to be selective, prioritizes most important factors
- To minimize loss efficiently, encoder allocates each
$z_i$ to single most important factor - Redundant encoding (multiple
$z_i$ for same factor) is penalized - Encourages independence between latent dimensions
Tradeoff:
- Higher
$\beta$ → better disentanglement, worse reconstruction - Lower
$\beta$ → better reconstruction, worse disentanglement - Need to tune
$\beta$ via hyperparameter search or visual inspection
Additional component:
- Linear classifier trained on latent differences to identify target factors
- Used for quantitative evaluation of disentanglement quality
Key idea: Maximize mutual information between subset of latent variables and generated output
Standard GAN objective: $$\min_G \max_D V(D,G) = \mathbb{E}{x \sim p{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1-D(G(z)))]$$
InfoGAN objective: Add mutual information term
where:
-
$z$ = incompressible noise (traditional GAN noise) -
$c$ = latent code we want to be interpretable (categorical or continuous) -
$I(c; G(z,c))$ = mutual information between$c$ and generated image -
$Q$ = auxiliary network approximating$P(c|x)$ (posterior)
Mutual Information Lower Bound (tractable):
This is called the variational lower bound - making mutual information optimization tractable.
Why this works conceptually:
- Direct mutual information
$I(c; G(z,c))$ is intractable to compute - Instead, maximize lower bound using auxiliary network
$Q$ -
$Q(c|x)$ tries to predict latent code from generated image - If
$c$ truly affects generation,$Q$ should be able to recover it - Forces generator to use
$c$ meaningfully (information-theoretic constraint)
Q Network Loss Function:
-
Categorical
$c$ (e.g., digit class): Cross-entropy loss comparing$Q(x)$ predictions to true$c$ -
Continuous
$c$ (e.g., rotation): Mean squared error (MSE) or negative log-likelihood (Gaussian) - Training: Feed
$c$ into$G$ , generate$x = G(z,c)$ , then train$Q$ to predict$c$ from$x$ -
Key insight: If
$Q$ can accurately reconstruct$c$ from generated image, mutual information is high - In practice:
$Q$ shares convolutional layers with discriminator$D$ , adds small FC head for code prediction
Why split latent space into
-
$c$ (code): Structured, interpretable factors we want to control (digit class, rotation, style)- Gets mutual information loss → forced to be meaningful and recoverable
- Must be distinct enough that
$Q$ can guess it from image
-
$z$ (noise): Incompressible randomness for variation (background texture, lighting details)- No constraints → can be entangled and complex
- Provides diversity: same
$c$ can generate many different images via different$z$
-
Without
$z$ : Generator would be deterministic (same$c$ → identical image every time) -
Without
$c$ : Generator has no incentive to learn interpretable, controllable factors
In practice:
- Split latent input:
$[z, c]$ where$c$ is structured (e.g., 10 categorical for digit class, 2 continuous for rotation/width) - Generator:
$G(z, c)$ - Discriminator:
$D(x)$ - Auxiliary network:
$Q(c|x)$ shares parameters with$D$ - Loss encourages: If we know
$c$ , we should be able to recover it from$G(z,c)$
Results:
- Unsupervised discovery of interpretable factors
- MNIST: Discovers digit class, rotation, width automatically
- 3D faces: Discovers pose, lighting, expression
- No labels needed!
Comparison β-VAE vs InfoGAN:
| β-VAE | InfoGAN | |
|---|---|---|
| Framework | VAE-based | GAN-based |
| Supervision | Fully unsupervised | Fully unsupervised |
| Method | Stronger KL regularization | Maximize mutual information |
| Training | Stable | Can be unstable (GAN training) |
| Control | All latent dims | Specific code |
| Quality | Lower (higher β hurts reconstruction) | Higher (GAN quality) |
Z-space (input latent space):
- Dimension: 512
- Distribution:
$z \sim N(0, I)$ - Problem: Must follow fixed Gaussian, leads to entanglement
- Used as input to mapping network
W-space (intermediate latent space):
- Dimension: 512
- Obtained:
$w = f(z)$ via mapping network (8 FC layers) - Benefit: Free from distribution constraint → more disentangled
- Each layer in generator receives same
$w$
S-space (StyleSpace):
- Dimension: 9088 (for 1024×1024 generator with 18 layers)
- Obtained: Affine transformation of
$w$ at each layer - Formula:
$s = A(w)$ where$A$ is learned layer-specific affine transform - Most disentangled of all three spaces
Why does disentanglement increase down the hierarchy?
-
Z → W (Mapping Network):
-
$z$ must sample from$N(0,I)$ to match training data distribution - Training data has entangled factors (e.g., beard + male)
-
$z$ must be warped to avoid impossible combinations - Mapping network
$f$ untangles this warping
-
-
W → S (Layer-specific control):
- Single
$w$ → Multiple layer-specific styles$s_i$ - Each layer controls different scale: coarse (4×4-16×16), medium (32×32-64×64), fine (128×128-1024×1024)
- Channel-wise control allows finer-grained manipulation
- Higher dimensionality (512 → 9088) allows more specific factors
- Single
Goal: Identify which of the 9088 style channels control specific attributes
Method (Wu et al., 2021):
- Generate images from pretrained StyleGAN2
-
Compute gradient maps via backpropagation:
$$\frac{\partial x}{\partial s_i}$$ where$x$ is generated image,$s_i$ is specific style channel - Segment images into semantic regions (hair, face, background, etc.)
- Measure overlap between gradient maps and semantic regions
- Identify channels consistently active in each region → those control that region
Result: Thousands of localized, disentangled controls
- Channel 6364: Amount of hair
- Channel 12_113: Hubcap style (for cars)
- Channel 8_119: Pillow presence (for bedrooms)
Advantages over W-space:
- More localized: Changes affect smaller regions
- More disentangled: Attribute Dependency metric shows less interference between attributes
- More controls: 9088 dims vs 512 dims
- Layer-specific control: Different layers control different scales (coarse/medium/fine details)
Problem: StyleGAN trained on random
Solution 1: Latent Optimization (GAN inversion)
- Start with real image
$x_{real}$ - Randomly initialize latent code
$w$ - Optimize
$w$ to minimize:$||G(w) - x_{real}||^2$ - Iterate until
$G(w) \approx x_{real}$ - Manipulate
$w$ or$s$ to edit attributes - Generate edited image:
$x_{edited} = G(w + \Delta w)$
Solution 2: Encoder-based Inversion
- Train encoder
$E$ to predict$w$ from real images:$E: x \to w$ - For new image:
$w = E(x_{real})$ - Manipulate and generate:
$x_{edited} = G(w + \Delta w)$ - Faster than optimization (single forward pass)
Typical pipeline:
Real image → Encoder/Optimization → w or s
↓
Manipulate specific channels
↓
Generator → Edited image
Conceptual Questions:
-
"What is the difference between conditioning and controllability? Give examples."
- Answer: Conditioning uses explicit labels during training to control class (cGAN: "generate cat"). Controllability manipulates latent space to adjust features without labels (StyleGAN: "make person smile more"). Conditioning = what to generate, Controllability = how features appear.
-
"Why is the intermediate latent space W in StyleGAN more disentangled than Z?"
- Answer: Z must follow fixed Gaussian distribution matching training data, which has entangled factors (beard + male correlation). W is free from this constraint - mapping network can untangle the warped distribution. Training encourages disentanglement because it's easier to generate from disentangled representation.
-
"Explain why higher β in β-VAE leads to better disentanglement."
- Answer: Higher β creates stronger information bottleneck via KL penalty. Encoder forced to be selective about what information to encode. To minimize loss, encoder allocates each dimension to most important factor, encouraging independence. Tradeoff: worse reconstruction quality.
-
"What do the three DCI metrics measure? How are they different?"
- Answer:
- Disentanglement: Does each code variable control at most one factor? (row-wise in importance matrix)
- Completeness: Is each factor controlled by at most one code variable? (column-wise)
- Informativeness: How much information does code capture? (prediction error)
- Answer:
-
"How does InfoGAN achieve disentanglement without labels?"
- Answer: Maximizes mutual information between latent code c and generated image G(z,c). Forces generator to use c meaningfully - if we know c, should be able to recover it from output. Auxiliary network Q learns inverse mapping. Discovers interpretable factors automatically.
Comparison Questions:
- "Compare β-VAE and InfoGAN for learning disentangled representations."
- See table in Section 18
Calculation/Application Questions:
- "Given importance matrix R (3×2), calculate disentanglement score for c₁."
R = [[0.8, 0.1], [0.1, 0.7], [0.1, 0.2]]- Normalize row 1: P₁ = [0.8/0.9, 0.1/0.9] = [0.89, 0.11]
- Entropy: H(P₁) = -0.89 log(0.89) - 0.11 log(0.11) ≈ 0.50
- D₁ = 1 - 0.50/log(2) = 1 - 0.50/0.69 ≈ 0.28
Theoretical Understanding Questions:
-
"Explain why InfoGAN uses a variational lower bound instead of directly maximizing mutual information."
- Answer: Direct mutual information
$I(c; G(z,c))$ requires computing$P(c|x)$ which is intractable (requires marginalizing over all possible$c$ ). Instead, use auxiliary network$Q(c|x)$ to approximate posterior and maximize variational lower bound:$I(c;G(z,c)) \geq E[\log Q(c|x)] + H(c)$ . This makes optimization tractable via gradient descent while still encouraging generator to use latent code meaningfully.
- Answer: Direct mutual information
-
"In DCI framework, what does it mean for c to be a 'monomial matrix transformation' of z?"
- Answer: Perfect disentanglement where each learned dimension
$c_i$ is a scaled/permuted version of exactly one ground-truth factor$z_j$ . One-to-one mapping. Example: If$z = [age, gender]$ , ideal$c = [2·gender, 5·age]$ (scaled permutation). DCI metrics measure how close learned representation is to this ideal.
- Answer: Perfect disentanglement where each learned dimension
Practical Questions:
- "You want to edit a real photograph using StyleGAN. Outline the steps."
- Answer:
- Invert image to latent code (optimization or encoder)
- Identify style channel controlling desired attribute (gradient-based or pretrained classifier)
- Manipulate that channel: s' = s + α·direction
- Generate edited image: x' = G(s')
- Verify change is localized and disentangled
- "A researcher trains β-VAE with β=1, β=4, and β=10. For each model, they compute DCI metrics. Predict the pattern of results and explain."
- Answer:
- β=1 (standard VAE): Low D, low C, high I. Entangled but captures lots of info. Good reconstruction.
- β=4: Medium D, medium C, medium I. Balanced tradeoff. Some disentanglement emerging.
- β=10: High D, high C, low I. Best disentanglement but information bottleneck too tight - loses details, poor reconstruction.
- Pattern: As β↑, disentanglement (D,C)↑ but informativeness (I)↓. Stronger KL penalty forces selectivity but sacrifices information capacity.
- "How does StyleSpace (S) discovery use gradients to find controllable directions? Why does this work?"
- Answer:
- For each style channel
$s_i$ , compute gradient map$\frac{\partial x}{\partial s_i}$ showing which pixels change when$s_i$ changes - Segment image into semantic regions (hair, face, etc.)
- Measure overlap between gradient maps and regions
- Channels with high gradient overlap in specific region → control that region
- Why it works: Backpropagation reveals causal relationship between style channel and image regions. High gradient = high sensitivity. Consistent gradients in one region = localized, disentangled control.
- For each style channel
Disentanglement is about:
- Independence of latent dimensions
- One dimension → one semantic factor
- Linear interpolation produces meaningful changes
Methods to achieve it:
- β-VAE: Stronger KL penalty (information bottleneck)
- InfoGAN: Maximize mutual information (enforce interpretability)
- StyleGAN: Mapping network + hierarchical spaces (Z→W→S)
Measuring it:
- DCI metrics: Quantitative evaluation when ground truth available
- Visual inspection: Qualitative check of interpolations
- Attribute dependency: How much changing one affects others
Why it matters:
- Controllability and interpretability
- Better generalization and editing
- Foundation for instruction-following models
Univariate Gaussian
Standard Normal
Multivariate Gaussian
Isotropic Gaussian (same variance in all dimensions:
Power Functions:
Logarithmic:
Exponential:
Definite Integrals (useful for normalization checks):
Trigonometric (less common in this course but useful):
Useful identities:
Power rule:
Exponential:
Logarithmic:
Chain rule:
Product rule:
$\det(AB) = \det(A)\det(B)$ $\det(A^{-1}) = \frac{1}{\det(A)}$ $\det(A^T) = \det(A)$ - For diagonal matrix:
$\det(D) = \prod_i d_{ii}$ - For triangular matrix:
$\det(T) = \prod_i t_{ii}$ (diagonal elements)
Log properties:
$\log(ab) = \log(a) + \log(b)$ $\log(a/b) = \log(a) - \log(b)$ $\log(a^b) = b\log(a)$ -
$\log|\det J| = \text{tr}(\log J)$ when$J$ is positive definite
What it does: EM is an iterative algorithm used to find Maximum Likelihood Estimates (MLE) for models with latent (hidden) variables (like GMMs, where we don't know which Gaussian cluster a point belongs to). It alternates between "guessing the missing data" and "updating the model".
Pseudo-code:
# 1. Initialize parameters θ (e.g., means μ, covariances Σ, mixing weights π) randomly
theta = initialize_randomly()
repeat until convergence:
# --- E-Step (Expectation) ---
# "Fill in the blanks": Estimate the probability of the latent variables
# given the current parameters.
# Example (GMM): Calculate 'responsibility' (prob that point x_i belongs to cluster k)
responsibilities = calculate_probabilities(data, theta)
# --- M-Step (Maximization) ---
# "Update the rules": Re-calculate parameters θ to maximize likelihood,
# assuming the E-step guesses are correct.
# Example (GMM): Update means μ based on weighted average of data points
theta = update_parameters(data, responsibilities)
return thetaKey Application: Training Gaussian Mixture Models (GMMs)
- GMMs approximate probability distribution as weighted sum of Gaussians
- Each Gaussian represents a cluster in the data
- Weights represent relative importance of each cluster
- EM estimates cluster means, covariances, and mixing weights
Connection to MLE:
- EM maximizes likelihood
$P(X|\theta)$ when direct optimization is intractable - E-step: Compute expected log-likelihood w.r.t. latent variables
- M-step: Find
$\theta$ that maximizes this expected log-likelihood
Exam-style question: "Why can't we directly maximize likelihood in GMMs? How does EM solve this?"
Answer: In GMMs, we don't know which Gaussian generated each data point (latent cluster assignment). Direct maximization would require enumerating all possible assignments (exponential complexity). EM iteratively: (1) guesses cluster assignments given current parameters (E-step), (2) updates parameters assuming those assignments (M-step). This alternating optimization is tractable and converges to local maximum.