We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics.
Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.
On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.
Diffusion models are latent variable models of the form pθ(x0):=∫pθ(x0:T)dx1:T, where x1,…,xT are latents of the same dimensionality as the data x0∼q(x0).
The joint distribution pθ(x0:T) is called the reverse process, and it is defined as a Markov chain with learned Gaussian transitions starting at p(xT)=N(xT;0,I) :
呵呵,一个马尔科夫链:q(x1:T∣x0),这个过程通过一系列的变量 β1,…,βT 来控制噪声的增加。每个 βt 定义了在时间步 t 添加的噪声量。
What distinguishes diffusion models from other types of latent variable models is that the approximate posterior q(x1:T∣x0), called the forward process or diffusion process, is fixed to a Markov chain that gradually adds Gaussian noise to the data according to a variance schedule β1,…,βT :
The forward process variances βt can be learned by reparameterization or held constant as hyperparameters, and expressiveness of the reverse process is ensured in part by the choice of Gaussian conditionals in pθ(xt−1∣xt), because both processes have the same functional form when βt are small.
A notable property of the forward process is that it admits sampling xt at an arbitrary timestep t in closed form: using the notation αt:=1−βt and αˉt:=∏s=1tαs, we have
q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)
Efficient training is therefore possible by optimizing random terms of L with stochastic gradient descent. Further improvements come from variance reduction by rewriting L(3) as:
Equation (5) uses KL divergence to directly compare pθ(xt−1∣xt) against forward process posteriors, which are tractable when conditioned on x0 :
q(xt−1∣xt,x0) where μ~t(xt,x0)=N(xt−1;μ~t(xt,x0),β~tI):=1−αˉtαˉt−1βtx0+1−αˉtαt(1−αˉt−1)xt and β~t:=1−αˉt1−αˉt−1βt
Consequently, all KL divergences in Eq. (5) are comparisons between Gaussians, so they can be calculated in a Rao-Blackwellized fashion with closed form expressions instead of high variance Monte Carlo estimates.
3. Diffusion models and denoising autoencoders
Diffusion models might appear to be a restricted class of latent variable models, but they allow a large number of degrees of freedom in implementation. One must choose the variances βt of the forward process and the model architecture and Gaussian distribution parameterization of the reverse process. To guide our choices, we establish a new explicit connection between diffusion models and denoising score matching (Section 3.2) that leads to a simplified, weighted variational bound objective for diffusion models (Section 3.4). Ultimately, our model design is justified by simplicity and empirical results (Section 4). Our discussion is categorized by the terms of Eq. (5).
3.1 Forward process and LT
We ignore the fact that the forward process variances βt are learnable by reparameterization and instead fix them to constants (see Section 4 for details). Thus, in our implementation, the approximate posterior q has no learnable parameters, so LT is a constant during training and can be ignored.
3.2 Reverse process and L1:T−1
Now we discuss our choices in pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t)) for 1<t≤T. First, we set Σθ(xt,t)=σt2I to untrained time dependent constants. Experimentally, both σt2=βt and σt2=β~t=1−αˉt1−αˉt−1βt had similar results. The first choice is optimal for x0∼N(0,I), and the second is optimal for x0 deterministically set to one point. These are the two extreme choices corresponding to upper and lower bounds on reverse process entropy for data with coordinatewise unit variance.
Second, to represent the mean μθ(xt,t), we propose a specific parameterization motivated by the following analysis of Lt. With pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),σt2I), we can write:
Lt−1=Eq[2σt21∥μ~t(xt,x0)−μθ(xt,t)∥2]+C
where C is a constant that does not depend on θ. So, we see that the most straightforward parameterization of μθ is a model that predicts μ~t, the forward process posterior mean. However, we can expand Eq. (8) further by reparameterizing Eq. (4) as xt(x0,ϵ)=αˉtx0+1−αˉtϵ for ϵ∼N(0,I) and applying the forward process posterior formula (7) :
Equation (10) reveals that μθ must predict αt1(xt−1−αˉtβtϵ) given xt. Since xt is available as input to the model, we may choose the parameterization
where ϵθ is a function approximator intended to predict ϵ from xt. To sample xt−1∼pθ(xt−1∣xt) is to compute xt−1=αt1(xt−1−αˉtβtϵθ(xt,t))+σtz, where z∼N(0,I). The complete sampling procedure, Algorithm 2, resembles Langevin dynamics with ϵθ as a learned gradient of the data density. Furthermore, with the parameterization (11), Eq. (10) simplifies to:
which resembles denoising score matching over multiple noise scales indexed by t [55]. As Eq. (12) is equal to (one term of) the variational bound for the Langevin-like reverse process (11), we see that optimizing an objective resembling denoising score matching is equivalent to using variational inference to fit the finite-time marginal of a sampling chain resembling Langevin dynamics.
To summarize, we can train the reverse process mean function approximator μθ to predict μ~t, or by modifying its parameterization, we can train it to predict ϵ. (There is also the possibility of predicting x0, but we found this to lead to worse sample quality early in our experiments.) We have shown that the ϵ-prediction parameterization both resembles Langevin dynamics and simplifies the diffusion model’s variational bound to an objective that resembles denoising score matching. Nonetheless, it is just another parameterization of pθ(xt−1∣xt), so we verify its effectiveness in Section 4 in an ablation where we compare predicting ϵ against predicting μ~t.