shinnku's Blog

Diffusion Probabilistic Models

By shinnku on Nov 28, 2023
Generated samples on CelebA-HQ 256 × 256

Diffusion Probabilistic Models

原论文都有详细内容啦,这里就只是随便写写。 ps: 【论文链接】

Prior Methods

VAE:

LVAE(θ,ϕ)=logpθ(x)+DKL(qϕ(zx)pθ(zx))=Ezqϕ(zx)logpθ(xz)+DKL(qϕ(zx)pθ(z))θ,ϕ=argminθ,ϕLVAELVAE=logpθ(x)DKL(qϕ(zx)pθ(zx))logpθ(x)\begin{aligned} L_{\mathrm{VAE}}(\theta, \phi) & =-\log p_\theta(\mathbf{x})+D_{\mathrm{KL}}\left(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{z} \mid \mathbf{x})\right) \\ & =-\mathbb{E}_{\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})} \log p_\theta(\mathbf{x} \mid \mathbf{z})+D_{\mathrm{KL}}\left(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{z})\right) \\ \theta^*, \phi^* & =\arg \min _{\theta, \phi} L_{\mathrm{VAE}} \\ -L_{\mathrm{VAE}} & =\log p_\theta(\mathbf{x})-D_{\mathrm{KL}}\left(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{z} \mid \mathbf{x})\right) \leq \log p_\theta(\mathbf{x}) \end{aligned}

GAN:

minGmaxDL(D,G)=Expr(x)[logD(x)]+Ezpz(z)[log(1D(G(z)))]=Expr(x)[logD(x)]+Expg(x)[log(1D(x)]L(G,D)=2DJS(prpg)2log2\begin{gathered} \min _G \max _D L(D, G)=\mathbb{E}_{x \sim p_r(x)}[\log D(x)]+\mathbb{E}_{z \sim p_z(z)}[\log (1-D(G(z)))] \\ =\mathbb{E}_{x \sim p_r(x)}[\log D(x)]+\mathbb{E}_{x \sim p_g(x)}[\log (1-D(x)] \\ L\left(G, D^*\right)=2 D_{J S}\left(p_r \| p_g\right)-2 \log 2 \end{gathered}

Abstract

这是论文摘要:(复制粘贴,复制粘贴)

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics.

Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.

On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.

Our implementation is available at https://github.com/hojonathanho/diffusion.

实话说,这个摘要几乎看不懂,但是感觉无所谓,当然,全篇文章充满了各种概率计算,开始就是 parameterized Markov chain,概率转移分布,之后变分推断(variational inference),然后极大似然,有点哈人。

Background

Diffusion models are latent variable models of the form pθ(x0):=pθ(x0:T)dx1:Tp_\theta\left(\mathbf{x}_0\right):=\int p_\theta\left(\mathbf{x}_{0: T}\right) d \mathbf{x}_{1: T}, where x1,,xT\mathbf{x}_1, \ldots, \mathbf{x}_T are latents of the same dimensionality as the data x0q(x0)\mathbf{x}_0 \sim q\left(\mathbf{x}_0\right).

初始状态 x0\mathbf{x}_0 是没有噪声的原始数据,而最终状态 xT\mathbf{x}_T 接近于纯噪声。

Reverse Process

pθ(x0:T)p_\theta\left(\mathbf{x}_{0: T}\right) 是初始状态 x0\mathbf{x}_0 到最终状态 xT\mathbf{x}_T 的联合概率分布。参数 θ\theta 是模型通过训练学习到的,用于定义从一个状态到另一个状态的转换概率。被称之为 reverse process, 从 TT 开始,00 结束。

The joint distribution pθ(x0:T)p_\theta\left(\mathbf{x}_{0: T}\right) is called the reverse process, and it is defined as a Markov chain with learned Gaussian transitions starting at p(xT)=N(xT;0,I)p\left(\mathbf{x}_T\right)=\mathcal{N}\left(\mathbf{x}_T ; \mathbf{0}, \mathbf{I}\right) :

pθ(x0:T):=p(xT)t=1Tpθ(xt1xt)pθ(xt1xt):=N(xt1;μθ(xt,t),Σθ(xt,t))\begin{aligned} &p_\theta\left(\mathbf{x}_{0: T}\right):=p\left(\mathbf{x}_T\right) \prod_{t=1}^T p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right) \\ &p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right):=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right), \mathbf{\Sigma}_\theta\left(\mathbf{x}_t, t\right)\right) \end{aligned}

Forward Process

呵呵,一个马尔科夫链:q(x1:Tx0)q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right),这个过程通过一系列的变量 β1,,βT\beta_1, \ldots, \beta_T 来控制噪声的增加。每个 βt\beta_t 定义了在时间步 tt 添加的噪声量。

What distinguishes diffusion models from other types of latent variable models is that the approximate posterior q(x1:Tx0)q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right), called the forward process or diffusion process, is fixed to a Markov chain that gradually adds Gaussian noise to the data according to a variance schedule β1,,βT\beta_1, \ldots, \beta_T :

q(x1:Tx0):=t=1Tq(xtxt1)q(xtxt1):=N(xt;1βtxt1,βtI)\begin{aligned} &q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right):=\prod_{t=1}^T q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right) \\ &q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right):=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}\right) \end{aligned}

Training

Training is performed by optimizing the usual variational bound on negative log likelihood:

E[logpθ(x0)]Eq[logpθ(x0:T)q(x1:Tx0)]=Eq[logp(xT)t1logpθ(xt1xt)q(xtxt1)]=:L\begin{aligned} \mathbb{E}\left[-\log p_\theta\left(\mathbf{x}_0\right)\right] & \leq \mathbb{E}_q\left[-\log \frac{p_\theta\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)}\right] \\ & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t \geq 1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)}\right] \\&=: L \end{aligned}

generated by Chatgpt4 (????), verified by shinnku

  1. 对数似然的分解: 首先,我们考虑数据 x0\mathbf{x}_0 的对数似然 logpθ(x0)\log p_\theta\left(\mathbf{x}_0\right) 。根据概率的链式法则,这可以写作:

    logpθ(x0)=logpθ(x0:T)dx1:T=logpθ(x0:T)q(x1:Tx0)q(x1:Tx0)dx1:T\begin{aligned} \log p_\theta\left(\mathbf{x}_0\right) &=\log \int p_\theta\left(\mathbf{x}_{0: T}\right) \d \mathbf{x}_{1: T} \\ &=\log \int \frac{p_\theta\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)} q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right) \d \mathbf{x}_{1: T} \end{aligned}

    其中, pθ(x0:T)p_\theta\left(\mathbf{x}_{0: T}\right) 是反向过程。

  2. 使用 Jensen 不等式: 接下来,我们应用 Jensen 不等式。Jensen 不等式用于对数函数是凹函数的情况,可以将对数从积分内部移到外部:

    logpθ(x0)q(x1:Tx0)logpθ(x0:T)q(x1:Tx0)dx1:T\log p_\theta\left(\mathbf{x}_0\right) \geq \int q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right) \log \frac{p_\theta\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)} \d \mathbf{x}_{1: T}
  3. 期望形式: 将上式写成期望的形式:

    Eq[logpθ(x0)]Eq[logpθ(x0:T)q(x1:Tx0)]=L\mathbb{E}_q\left[\,\log p_\theta\left(\mathbf{x}_0\right)\right] \geq \mathbb{E}_q\left[\log \frac{p_\theta\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)}\right] = -L

The forward process variances βt\beta_t can be learned by reparameterization or held constant as hyperparameters, and expressiveness of the reverse process is ensured in part by the choice of Gaussian conditionals in pθ(xt1xt)p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right), because both processes have the same functional form when βt\beta_t are small.

  1. 负对数似然(Negative Log Likelihood):
    • 训练扩散模型的目标是最小化负对数似然,通过优化变分下界(Variational Bound)来实现。
  2. 变分下界:
    • 变分下界 LL 是负对数似然的一个上界。
    • 这玩意是两部分组成:一部分是关于 p(xt)p(\mathbf{x}_t) 的期望,另一部分是关于 pθ(xt1xt1)p_{\theta}(\mathbf{x}_{t-1} \mid \mathbf{x}_{t-1})q(xt1xt1)q(\mathbf{x}_{t-1} \mid \mathbf{x}_{t-1}) 比率的对数的和的期望。

A notable property of the forward process is that it admits sampling xt\mathbf{x}_t at an arbitrary timestep tt in closed form: using the notation αt:=1βt\alpha_t:=1-\beta_t and αˉt:=s=1tαs\bar{\alpha}_t:=\prod_{s=1}^t \alpha_s, we have

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{\bar{\alpha}_t} \mathbf{x}_0,\left(1-\bar{\alpha}_t\right) \mathbf{I}\right)

Efficient training is therefore possible by optimizing random terms of LL with stochastic gradient descent. Further improvements come from variance reduction by rewriting L(3)L(3) as:

Eq[DKL(q(xTx0)p(xT))LT+t>1DKL(q(xt1xt,x0)pθ(xt1xt))Lt1logpθ(x0x1)L0]\mathbb{E}_q[\underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_T \mid \mathbf{x}_0\right) \| p\left(\mathbf{x}_T\right)\right)}_{L_T}+\sum_{t>1} \underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)\right)}_{L_{t-1}} \underbrace{-\log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)}_{L_0}]

Below is a derivation of above, the reduced variance variational bound for diffusion models. This material is from Sohl-Dickstein et al;

L=Eq[logpθ(x0:T)q(x1:Tx0)]=Eq[logp(xT)t1logpθ(xt1xt)q(xtxt1)]=Eq[logp(xT)t>1logpθ(xt1xt)q(xtxt1)logpθ(x0x1)q(x1x0)]=Eq[logp(xT)t>1logpθ(xt1xt)q(xt1xt,x0)q(xt1x0)q(xtx0)logpθ(x0x1)q(x1x0)]=Eq[logp(xT)q(xTx0)t>1logpθ(xt1xt)q(xt1xt,x0)logpθ(x0x1)]=Eq[DKL(q(xTx0)p(xT))+t>1DKL(q(xt1xt,x0)pθ(xt1xt))logpθ(x0x1)]\begin{aligned} L & =\mathbb{E}_q\left[-\log \frac{p_\theta\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)}\right] \\ & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t \geq 1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)}\right] \\ & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t>1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)}-\log \frac{p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)}{q\left(\mathbf{x}_1 \mid \mathbf{x}_0\right)}\right] \\ & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t>1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right)} \cdot \frac{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_0\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)}-\log \frac{p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)}{q\left(\mathbf{x}_1 \mid \mathbf{x}_0\right)}\right] \\ & =\mathbb{E}_q\left[-\log \frac{p\left(\mathbf{x}_T\right)}{q\left(\mathbf{x}_T \mid \mathbf{x}_0\right)}-\sum_{t>1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right)}-\log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)\right] \\ & =\mathbb{E}_q\left[D_{\mathrm{KL}}\left(q\left(\mathbf{x}_T \mid \mathbf{x}_0\right) \| p\left(\mathbf{x}_T\right)\right)+\sum_{t>1} D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)\right)-\log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)\right] \end{aligned}

The following is an alternate version of LL. It is not tractable to estimate, but it is useful for our discussion in Section 4.3.

L=Eq[logp(xT)t1logpθ(xt1xt)q(xtxt1)]=Eq[logp(xT)t1logpθ(xt1xt)q(xt1xt)q(xt1)q(xt)]=Eq[logp(xT)q(xT)t1logpθ(xt1xt)q(xt1xt)logq(x0)]=DKL(q(xT)p(xT))+Eq[t1DKL(q(xt1xt)pθ(xt1xt))]+H(x0)\begin{aligned} L & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t \geq 1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)}\right] \\ & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t \geq 1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)} \cdot \frac{q\left(\mathbf{x}_{t-1}\right)}{q\left(\mathbf{x}_t\right)}\right] \\ & =\mathbb{E}_q\left[-\log \frac{p\left(\mathbf{x}_T\right)}{q\left(\mathbf{x}_T\right)}-\sum_{t \geq 1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}-\log q\left(\mathbf{x}_0\right)\right] \\ & =D_{\mathrm{KL}}\left(q\left(\mathbf{x}_T\right) \| p\left(\mathbf{x}_T\right)\right)+\mathbb{E}_q\left[\sum_{t \geq 1} D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right) \| p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)\right)\right]+H\left(\mathbf{x}_0\right) \end{aligned}

Equation (5) uses KL divergence to directly compare pθ(xt1xt)p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right) against forward process posteriors, which are tractable when conditioned on x0\mathbf{x}_0 :

q(xt1xt,x0)=N(xt1;μ~t(xt,x0),β~tI) where μ~t(xt,x0):=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt and β~t:=1αˉt11αˉtβt\begin{aligned} q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) & =\mathcal{N}\left(\mathbf{x}_{t-1} ; \tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right), \tilde{\beta}_t \mathbf{I}\right) \\ \text { where } \quad \tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right) & :=\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1-\bar{\alpha}_t} \mathbf{x}_0+\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t} \mathbf{x}_t \quad \text { and } \quad \tilde{\beta}_t:=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t \end{aligned}

Consequently, all KL divergences in Eq. (5) are comparisons between Gaussians, so they can be calculated in a Rao-Blackwellized fashion with closed form expressions instead of high variance Monte Carlo estimates.

3. Diffusion models and denoising autoencoders

Diffusion models might appear to be a restricted class of latent variable models, but they allow a large number of degrees of freedom in implementation. One must choose the variances βt\beta_t of the forward process and the model architecture and Gaussian distribution parameterization of the reverse process. To guide our choices, we establish a new explicit connection between diffusion models and denoising score matching (Section 3.2) that leads to a simplified, weighted variational bound objective for diffusion models (Section 3.4). Ultimately, our model design is justified by simplicity and empirical results (Section 4). Our discussion is categorized by the terms of Eq. (5).

3.1 Forward process and LTL_T

We ignore the fact that the forward process variances βt\beta_t are learnable by reparameterization and instead fix them to constants (see Section 4 for details). Thus, in our implementation, the approximate posterior qq has no learnable parameters, so LTL_T is a constant during training and can be ignored.

3.2 Reverse process and L1:T1L_{1: T-1}

Now we discuss our choices in pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right), \mathbf{\Sigma}_\theta\left(\mathbf{x}_t, t\right)\right) for 1<tT1<t \leq T. First, we set Σθ(xt,t)=σt2I\boldsymbol{\Sigma}_\theta\left(\mathbf{x}_t, t\right)=\sigma_t^2 \mathbf{I} to untrained time dependent constants. Experimentally, both σt2=βt\sigma_t^2=\beta_t and σt2=β~t=1αˉt11αˉtβt\sigma_t^2=\tilde{\beta}_t=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t had similar results. The first choice is optimal for x0N(0,I)\mathbf{x}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), and the second is optimal for x0\mathrm{x}_0 deterministically set to one point. These are the two extreme choices corresponding to upper and lower bounds on reverse process entropy for data with coordinatewise unit variance.

Second, to represent the mean μθ(xt,t)\boldsymbol{\mu}_\theta\left(\mathrm{x}_t, t\right), we propose a specific parameterization motivated by the following analysis of LtL_t. With pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right), \sigma_t^2 \mathbf{I}\right), we can write:

Lt1=Eq[12σt2μ~t(xt,x0)μθ(xt,t)2]+CL_{t-1}=\mathbb{E}_q\left[\frac{1}{2 \sigma_t^2}\left\|\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right)-\boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right)\right\|^2\right]+C

where CC is a constant that does not depend on θ\theta. So, we see that the most straightforward parameterization of μθ\boldsymbol{\mu}_\theta is a model that predicts μ~t\tilde{\boldsymbol{\mu}}_t, the forward process posterior mean. However, we can expand Eq. (8) further by reparameterizing Eq. (4) as xt(x0,ϵ)=αˉtx0+1αˉtϵ\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right)=\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon} for ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) and applying the forward process posterior formula (7)(7) :

Lt1C=Ex0,ϵ[12σt2μ~t(xt(x0,ϵ),1αˉt(xt(x0,ϵ)1αˉtϵ))μθ(xt(x0,ϵ),t)2]=Ex0,ϵ[12σt21αt(xt(x0,ϵ)βt1αˉtϵ)μθ(xt(x0,ϵ),t)2]\begin{aligned} L_{t-1}-C & =\mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{1}{2 \sigma_t^2}\left\|\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right), \frac{1}{\sqrt{\bar{\alpha}_t}}\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right)-\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}\right)\right)-\boldsymbol{\mu}_\theta\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right), t\right)\right\|^2\right] \\ & =\mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{1}{2 \sigma_t^2}\left\|\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right)-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}\right)-\boldsymbol{\mu}_\theta\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right), t\right)\right\|^2\right] \end{aligned}

Equation (10) reveals that μθ\mu_\theta must predict 1αt(xtβt1αˉtϵ)\frac{1}{\sqrt{\alpha_t}}\left(\mathrm{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon\right) given xt\mathrm{x}_t. Since xt\mathrm{x}_t is available as input to the model, we may choose the parameterization

μθ(xt,t)=μ~t(xt,1αˉt(xt1αˉtϵθ(xt)))=1αt(xtβt1αˉtϵθ(xt,t))\boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right)=\tilde{\boldsymbol{\mu}}_t\left(\mathrm{x}_t, \frac{1}{\sqrt{\bar{\alpha}_t}}\left(\mathrm{x}_t-\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}_\theta\left(\mathrm{x}_t\right)\right)\right)=\frac{1}{\sqrt{\alpha_t}}\left(\mathrm{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathrm{x}_t, t\right)\right)

where ϵθ\epsilon_\theta is a function approximator intended to predict ϵ\boldsymbol{\epsilon} from xt\mathbf{x}_t. To sample xt1pθ(xt1xt)\mathbf{x}_{t-1} \sim p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right) is to compute xt1=1αt(xtβt1αˉtϵθ(xt,t))+σtz\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right)\right)+\sigma_t \mathbf{z}, where zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). The complete sampling procedure, Algorithm 2, resembles Langevin dynamics with ϵθ\epsilon_\theta as a learned gradient of the data density. Furthermore, with the parameterization (11), Eq. (10) simplifies to:

Ex0,ϵ[βt22σt2αt(1αˉt)ϵϵθ(αˉtx0+1αˉtϵ,t)2]\mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{\beta_t^2}{2 \sigma_t^2 \alpha_t\left(1-\bar{\alpha}_t\right)}\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta\left(\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}, t\right)\right\|^2\right]

which resembles denoising score matching over multiple noise scales indexed by tt [55]. As Eq. (12) is equal to (one term of) the variational bound for the Langevin-like reverse process (11)(11), we see that optimizing an objective resembling denoising score matching is equivalent to using variational inference to fit the finite-time marginal of a sampling chain resembling Langevin dynamics.

To summarize, we can train the reverse process mean function approximator μθ\boldsymbol{\mu}_\theta to predict μ~t\tilde{\boldsymbol{\mu}}_t, or by modifying its parameterization, we can train it to predict ϵ\epsilon. (There is also the possibility of predicting x0\mathbf{x}_0, but we found this to lead to worse sample quality early in our experiments.) We have shown that the ϵ\epsilon-prediction parameterization both resembles Langevin dynamics and simplifies the diffusion model’s variational bound to an objective that resembles denoising score matching. Nonetheless, it is just another parameterization of pθ(xt1xt)p_\theta\left(\mathrm{x}_{t-1} \mid \mathrm{x}_t\right), so we verify its effectiveness in Section 4 in an ablation where we compare predicting ϵ\epsilon against predicting μ~t\tilde{\mu}_t.

感谢你能看到这里 by Shinnku
© Copyright 2024 by Shinnku's blog. Built with ♥Astro.