In this post we’ll have a look at what’s known as variational inference (VI), a family of approximate Bayesian inference methods. In particular, we will focus on one of the more standard VI methods called Automatic Differentiation Variational Inference (ADVI).
Here we’ll have a look at the theory behind VI, but if you’re interested in how to use ADVI in Turing.jl, check out this tutorial.
Motivation
In Bayesian inference one usually specifies a model as follows: given data ,
where denotes that the samples are identically independently distributed. Our goal in Bayesian inference is then to find the posterior
In general one cannot obtain a closed form expression for , but one might still be able to sample from with guarantees of converging to the target posterior as the number of samples go to , e.g. MCMC.
As you are hopefully already aware, Turing.jl provides a lot of different methods with asymptotic exactness guarantees that we can apply to such a problem!
Unfortunately, these unbiased samplers can be prohibitively expensive to run. As the model increases in complexity, the convergence of these unbiased samplers can slow down dramatically. Still, in the infinite limit, these methods should converge to the true posterior! But infinity is fairly large, like, at least more than 12, so this might take a while.
In such a case it might be desirable to sacrifice some of these asymptotic guarantees, and instead approximate the posterior using some other model which we’ll denote .
There are multiple approaches to take in this case, one of which is variational inference (VI).
Variational Inference (VI)
In VI, we’re looking to approximate using some approximate or variational posterior .
To approximate something you need a notion of what "close" means. In the context of probability densities a standard such "measure" of closeness is the Kullback-Leibler (KL) divergence , though this is far from the only one. The KL-divergence is defined between two densities and as
It’s worth noting that unfortunately the KL-divergence is not a metric/distance in the analysis-sense due to its lack of symmetry. On the other hand, it turns out that minimizing the KL-divergence that it’s actually equivalent to maximizing the log-likelihood! Also, under reasonable restrictions on the densities at hand,
Therefore one could (and we will) attempt to approximate using a density by minimizing the KL-divergence between these two!
One can also show that , which we’ll need later. Finally notice that the KL-divergence is only well-defined when in fact is zero everywhere is zero, i.e.
Otherwise, there might be a point such that , resulting in which doesn’t make sense!
One major problem: as we can see in the definition of the KL-divergence, we need for any if we want to compute the KL-divergence between this and . We don’t have that. The entire reason we even do Bayesian inference is that we don’t know the posterior! Cleary this isn’t going to work. Or is it?!
Computing KL-divergence without knowing the posterior
First off, recall that
so we can write
where in the last equality we used the fact that is independent of .
Now you’re probably thinking "Oh great! Now you’ve introduced which we also can’t compute (in general)!". Woah. Calm down human. Let’s do some more algebra. The above expression can be rearranged to
See? The left-hand side is constant and, as we mentioned before, . What happens if we try to maximize the term we just gave the completely arbitrary name ? Well, if goes up while stays constant then has to go down! That is, the which minimizes the KL-divergence is the same which maximizes:
Assuming joint and the entropy are both tractable, we can use a Monte-Carlo for the remaining expectation. This leaves us with the following tractable expression
where
Hence, as long as we can sample from somewhat efficiently, we can indeed minimize the KL-divergence! Neat, eh?
Sidenote: in the case where is tractable but is not , we can use an Monte-Carlo estimate for this term too but this generally results in a higher-variance estimate.
Also, I fooled you real good: the ELBO isn’t an arbitrary name, hah! In fact it’s an abbreviation for the expected lower bound (ELBO) because it, uhmm, well, it’s the expected lower bound (remember ). Yup.
Maximizing the ELBO
Finding the optimal over all possible densities of course isn’t feasible. Instead we consider a family of parameterized densities where denotes the space of possible parameters. Each density in this family is parameterized by a unique . Moreover, we’ll assume
, i.e. evaluating the probability density at any point , is differentiable
, i.e. the process of sampling from , is differentiable
We’re going to make use of a particular such approach which goes under a bunch of different names: reparametrization trick, path derivative, etc. This refers to making the assumption that all elements can be considered as reparameterizations of some base density, say . That is, if then
for some function differentiable wrt. . So all are using the same reparameterization-function but each correspond to different choices of for .
Under this assumption we can differentiate the sampling process by taking the derivative of wrt. , and thus we can differentiate the entire wrt. ! With the gradient available we can either try to solve for optimality either by setting the gradient equal to zero or maximize stepwise by traversing in the direction of steepest ascent. For the sake of generality, we’re going to go with the stepwise approach.
With all this nailed down, we eventually reach the section on Automatic Differentiation Variational Inference (ADVI).
So let’s revisit the assumptions we’ve made at this point:
The variational posterior is in a parameterized family of densities denoted , with .
is a space of reparameterizable densities with as the base-density.
The parameterization function is differentiable wrt. .
Evaluation of the probability density is differentiable wrt. .
is tractable.
Evaluation of the joint density is tractable and differentiable wrt.
The support of is a subspace of the support of : .
All of these are not necessary to do VI, but they are very convenient and results in a fairly flexible approach. One distribution which has a density satisfying all of the above assumptions except (7) (we’ll get back to this in second) for any tractable and differentiable is the good ole' Gaussian/normal distribution:
where with obtained from the Cholesky-decomposition. Abusing notation a bit, we’re going to write
With this assumption we finally have a tractable expression for ! Well, assuming (7) is holds. Since a Gaussian has non-zero probability on the entirety of , we also require to have non-zero probability on all of .
Though not necessary, we’ll often make a mean-field assumption for the variational posterior , i.e. assume independence between the latent variables. In this case, we’ll write
Examples
As a (trivial) example we could apply the approach described above to is the following generative model for :
In this case and we have the posterior defined . Then the variational posterior would be
And since prior of , , has non-zero probability on the entirety of , same as , i.e. assumption (7) above holds, everything is fine and life is good.
But what about this generative model for :
with posterior and the mean-field variational posterior will be
where we’ve denoted the evaluation of the probability density of a Gaussian as .
Observe that has non-zero probability only on which is clearly not all of like has, i.e.
Recall from the definition of the KL-divergence that when this is the case, the KL-divergence isn’t well defined. This gets us to the automatic part of ADVI.
"Automatic"? How?
For a lot of the standard (continuous) densities we can actually construct a probability density with non-zero probability on all of by transforming the "constrained" probability density to . In fact, in these cases this is a one-to-one relationship. As we’ll see, this helps solve the support-issue we’ve been going on and on about.
Transforming densities using change of variables
If we want to compute the probability of taking a value in some set , we have to integrate over , i.e.
This means that if we have a differentiable bijection with differentiable inverse , we can perform a change of variables
where denotes the jacobian of evaluated at . Observe that this defines a probability distribution
since which has probability 1. This probability distribution has density with , defined
or equivalently
due to the fact that
Note: it’s also necessary that the log-abs-det-jacobian term is non-vanishing. This can for example be accomplished by assuming to also be elementwise monotonic.
Back to VI
So why is this is useful? Well, we’re looking to generalize our approach using a normal distribution to cases where the supports don’t match up. How about defining by
where is a differentiable bijection with differentiable inverse. Then as we wanted. The resulting variational density is
Note that the way we’ve constructed here is basically a reverse of the approach we described above. Here we sample from a distribution with support on and transform to.
If we want to write the ELBO explicitly in terms of rather than , the first term in the ELBO becomes
The entropy is invariant under change of variables, thus is simply the entropy of the normal distribution which is known analytically.
Hence, the resulting empirical estimate of the ELBO is
And maximizing this wrt. and is what’s referred to as Automatic Differentiation Variational Inference (ADVI)!