## NIPS 2016: random thoughts

This week I was at NIPS along with a big crowd of machine learning fans. It was my second NIPS, and I’m still very excited to see the advances in the field. Unfortunately, I was too busy (or too lazy?) to read on papers. Probably because I’m slow to catch up new deep net architectures. I will try to do it during the holiday and hopefully then I will put some more blog posts on interesting papers. Our group will have a “your favourite NIPS paper in 5mins” session next term and I’m very looking forward to it.

General impressions.

1. This year the number of attendants was nearly doubled to almost 6,000. 6,000! And there were even two high school students! I asked these two boys about “why you come here”, and they answered “we did some toy deep learning projects and we’re very very excited about the future of AI”. So you get the idea of “the hype of machine learning”. Next year I will definitely register and book the hotel once the registration is open!

2. Again, too many people — so no NIPS breakfast, no NIPS banquet, and in the first day I didn’t manage to talk to any poster presenters — the queues were just too long. But we did have good things — more parties by start-ups and big names, including the new players Apple and Uber. Unfortunately this year I was physically too tired that I only went to some, and it was a shame to have missed the launching party of RocketAI (which turns out to be a big big joke). Maybe it seems a bit strange that I still don’t feel that hurry on finding a job now — maybe I will be desperate of that in next year’s NIPS?

3. Hottest topics: deep RL and GANs. Everyone’s talking about it. Every company claims that they’re working on it. I admit that I’m not that crazy keen on both topics (well this summer I did spend quite a bit of time on understanding GAN objective functions), but maybe in the future I might also touch on them — who knows?

4. Tutorials were very good. I went to deep RL, computational social science, and optimisation. Deep RL was a bit strange that it assumed you know MDPs, but overall it was fairly easy to follow. Computational social science was less “high tech”, but the problem they were solving are very interesting (e.g. what makes a movie quote memorizable). Optimisation started from the very basics and slowly built on it, and I managed to follow most of the convex part. Interestingly, three tutorials (deep RL, variational inference and optimisation) had spent quite a lot of time talking about stochastic gradient and variance reduction. So we’re essentially working on very similar things!

5. Workshops were great. This year it had 50 workshops, so apparently I didn’t manage to attend most of them. But the two main workshops of my field — approximate inference and Bayesian deep learning — were actually amazing. They got 58 and 45 posters, respectively, and many of them already had very promising results. I also went to the panel of adversarial training and they had some very nice comments there.

6. Old and new friends. I’ve met quite a lot of old and new friends in this year’s NIPS. Always good to catch up and talk about random things.

7. As a guest presenter. For those who found strange that I was presenting a poster on Monday as well: my collaborator Qiang Liu was too busy presenting 3 posters on the same day, so I went to help him present the Stein variational gradient descent one. This was actually a fun experience since I don't necessary need to "sell or defend" the thing I'm presenting. Maybe another way to thoroughly understand a paper is to help its presentation at conferences?

The following are more related to my research.

1. Alpha divergences.
I didn’t expect that my proposals have been largely recognised now in the approximate inference community. Many people (more than I expected) visited my Renyi divergence poster on Wed, also quite a lot people stopped by my Saturday's constraint relaxation one. Rich had got invited to the approximate inference workshop panel talking about divergence selection. Miguel and Finale Doshi-Velez also advertised the superior performance of alpha divergence methods in the Bayesian deep learning workshop. More excitingly, I saw with my own eyes a real example of stochastic EP applied to industrial machine learning systems. I feel my hard time in the first two years really pays off. I would like to thank Rich for his excellent mentorship, and Miguel, Daniel, Thang and Mark for their great efforts.

2. Stein’s method and wild variational approximations.
Quite strangely I came to work on things related to Stein’s method and what we called “wild variational approximations”. I learned a bit about Stein’s method before, but at that time I didn't think I would use it for approximate inference. I was reviewing the operator variational inference (OPVI) paper for this year’s NIPS, at that time the paper was even harder to follow compared to its current form, and that paper still failed to convince me that Stein’s method is promising in practice. Then in August the Stein variational gradient descent paper popped out, and I really enjoyed it. I wrote a blog post on it, sent it to Qiang for comments, and suddenly the “wild variational approximation” idea came out from our mind, and we decided to collaborate on that. Apparently the OPVI authors are also very excited about it, and so this NIPS we gathered for a lunch chat and had a “cheers” for Stein’s method.
In summary, the next big thing I’m very excited about is using “truly black-box”, or “wild” (as the black box VI name has already been used by MC methods) q distributions to enhance the accuracy of approximate inference, which allows us to use probabilistic models of their full power. As the “outsider of Bayesian inference community” Ian Goodfellow pointed out in the invited talk in Bayesian deep learning workshop, it would be really nice if we can use some inference engines (e.g. neural networks) to generate (approximate) posterior samples. He talked a little bit about the ideas borrowed from GANs, which is also discussed in my approximate inference workshop submission with Qiang, and we discussed a lot more on potential directions addition to that.
I feel like “wild variational approximation” is a long-term project that is definitely impossible to be done in say several conference paper submissions, or two chapters in my thesis. But I also think I am very lucky to have found this exciting research direction, which I’d love to work on also after– my PhD study. Finally as a side note, I’m still very excited about divergences and information theoretic approaches to Bayesian inference, and I would definitely continue working on better understandings of EP-related methods, I feel like I’m also starting to put in more theoretical analysis right now, which I think is a good thing.

## My paper list on deep generative models

I am quite interested in generative models. So I think it's a good idea to summarize what I know so far about it and which papers I think are "must read" if you want to understand what's going on in the field. I haven't finish all of them though and this will also be a good chance to create a "checklist" -- I'll go back and tick those unfinished papers! Also this list can be very long if I include many recent advances so I'll just keep those with big ideas -- can do another list for them later!

Variational Auto-Encoders (VAE)

I think the VAE idea has revolutionized the approximate inference and graphical models field. Since neural networks are so successful as functional approximators, why not use them to compute joint/conditional distributions/factors/potential functions for a directed/undirected graphical model, as well as for the approximate posterior/marginal distribution/factors? Hundreds of papers have then built on this idea.

The must-read papers start from Kingma & Welling's Auto-Encoding Variational Bayes paper. Roughly the same time Deepmind came up with the same idea. Then the above two papers authors teamed up together and published a paper on semi-supervised learning. In this context a recent attempt which adds auxiliary variables also deserves a detailed read. I'm not going to recommend the variational Gaussian process paper for practitioners but essentially it contains similar ideas to the auxiliary DGM paper.

The original framework uses the variational lower-bound to do approximate MLE. IWAE improved on this by proposing a new objective function involving importance sampling. Has been tested on RNN versions as well. Shameless plug here: I have shown that IWAE does not return the best approximation to the marginal likelihood. Should also read a similar algorithm "reweighed wake-sleep" if you want to know more and probably you might need to check out the original wake-sleep paper.

VAEs are very flexible as you can imagine from the summary. So they can be easily extended to model sequential data. In this regime the variational RNN paper could be interesting and you might want to check out some related papers it cited as well.

RNNs + attention model also work well on images, and the DRAW paper is the one that combined them with the VAE framework. Similar papers on the experimental side include the one for image captioning and the convolutional version of DRAW.

There are papers that combine the VAE idea with invertible transformations as well. But I'll list them in later paragraphs just to make sure this part is not too long.

GANs are even more flexible than VAEs. In short you want to train a powerful generative model by fooling another powerful discriminator that "the image I'm showing you is from the real data". I have a post that discussed the maths in detail and I also recommend reading a series of blog post by Ferenc Huszár.

Of course the must-read is the original paper. But the GAN idea has also been discussed in statistical inference context, which has been largely ignored and I think it shouldn't happen. A closely related idea is Noise Contrastive Estimation (NCE), and Ian Goodfellow also discussed the link and difference between both.

GAN is notoriously hard to train. Two experimental papers from Facebook & NYU (i.e. LAPGAN & DCGAN) are must-read for practitioners. I haven't read the recent proposal from OpenAI in detail though. But you should definitely try to train a GAN yourself to get a feeling about how to tune it -- I did so. I'll say this difficulty is one of the main reason that bias me towards VAEs, even when GANs have superior performance in image domain, but see this interesting paper that tries to combine the advantages from both side. Alternatively, the maximum mean discrepancy (MMD) and kernel two sample test have been considered, e.g. see this and this paper.

Some other interesting papers include adversarial auto-encoders (well this one is actually much easier to train), a DRAW-like version of GAN, and the one that includes some contextual information into the modeling process. I also like this paper that extend the GAN framework to f-divergence via duality. Using GAN principles to learn an inference network (and also the same idea here) seems promising, but my experience of playing them seems to suggest that this approach can be even more unstable than the original framework.

Generative models with invertible transformations

If you don't like approximate inference and want something exact, here's the part for you. I remember seeing this type of methods first from the NICE paper (nice name). The improved version is also on arxiv.

From my perspective the most interesting work in this regime is the combination of VAE and invertible transformations. I know this idea has been roughly around for 2-3 years (e.g. see this paper) but this ICML paper formalized it with lots of discussions. This year's NIPS has two interesting follow-up papers: one uses autoregressive functions and another tries to skip the randomness of transformations. Should consider blog posts for these papers later.

Deep Boltzmann machines & deep belief networks

These models are less considered in recent days, but I still love the beautiful derivations, and I think everyone who want to work on generative models should get to know these type of models. No need to introduce the seminal 2006 paper that took neural networks back to the stage. No need either to mention then people tried different variations of RBMs/deep belief networks for image/speech recognition. But then after the introduction of deep Boltzmann machines, people found that these models are very hard to train with contrastive divergence like algorithms. Pre-training as a DBN then switch to DBM training sort of worked and this paper provided quite a nice interpretation on that, but then in the fine-tuning step we still have this training difficulty problem. There's another pre-training technique that approximately performs variational inference, but you need to be careful about the rescaling part. The doubly intractable problem has scared lots of practitioners away from continue investigating DBMs.

On deep sigmoid belief network side, these two years Lawrence Carin's group has produced quite a few papers related to it. I haven't read them in detail though.

Other attempts

Although not very experimentally appealing, I found this diffusion process paper from Surya Ganguli's group an interesting read.

As Bayesian people have made Gaussian processes deep, there you go an attempt on using deep GP to generate images from Neil Lawrence's group. It uses VAE idea to derive inference steps.

David Blei and his students love conjugate models and SVI, so it's no surprising that they've come up with the deep version of exponential families.

If you like the Gibbs sampling procedure of RBMs for data generation, here's another direction to go deeper other than DBMs.

I found this paper's approach a bit strange but those who interested in wake-sleep should get to know it.

Auto-regressive models are also an important type of generative models. For neural network assisted version, must read NADE (and subsequence papers). This year's ICML best paper (pixel RNN) also belongs to this class.

Finally this note on how to evaluate the performance of generative models deserves a detailed read.

## Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

In the old days we were happy with mean field approximation. Currently we don't. As the model goes more complicated, Bayesian inference needs more accurate yet fast approximations to the exact posterior, and apparently mean-field is not a very good choice. To enhance the power of variational approximations people start to use invertible transformations (e.g. see the normalizing flow paper) that warp simple distributions (e.g. factorised Gaussian) to complicated ones that are still differentiable. However it introduces an extra cost: you need to be able to compute the determinant of the Jacobian of that transform in a fast way. Solutions of this include constructing the transform with a sequence of "simple" functions -- simple in the sense that the Jacobian is low-rank or triangular (e.g. see this paper). Recently I found a NIPS preprint that provides another solution of this, which is absolutely amazing: through their clever design you don't even need to compute the Jacobian at all! I found this paper a very interesting read so I decided to put a blog post here:

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Qiang Liu and Dilin Wang
http://arxiv.org/abs/1608.04471 (to appear at NIPS 2016)

You might also want to check out their previous work at this year's ICML.

Before going to the details, let's recap the idea of invertible transformations of random variables. Assume $z$ is a random variable distributed as $q(z)$, e.g. Gaussian. Now if a mapping $T$ is differentiable and invertible, then the transformation $x = T(z)$ is also a random variable, distributed as (using notations in the paper) $q_T(x) = q(T^{-1}(x)) | det(\nabla_x T^{-1}(x)) |$. If we use this distribution in VI, then the variational free energy becomes $\mathrm{KL}[q_T(x)||p(x|\mathcal{D})] - \log p(\mathcal{D})$, which means we need to evaluate the log determinant of the Jacobian $\log | det(\nabla_x T^{-1}(x)) |$. As said $T$ often contains a sequence of "simple" mappings, in math this is $T(z) = F_K(F_{K-1}(...F_1(z)))$, so the determinant term becomes $\sum_{k=1}^K \log |det(\nabla F_k^{-1}) |$.

Previous approaches parameterised the $F$ functions with carefully designed neural networks, and optimised all these network parameters jointly through the free energy. This paper, although not explicitly mentioned, used sequential optimisation instead: it first finds $F_1$ by minimising the energy, then fixes it and proceeds to $F_2$ and so on. No fine-tuning at the end. Now optimisation becomes easier, but potentially you need more transformations and thus longer training time in total. Also storing these functions can be very challenging for memory.

To solve these problems the authors proposed using functional gradients. They assumed a very simple form for the transform: $F_k(z) = z + f_k(z)$ where $f_k$ belongs to some RKHS defined by a kernel $K(x, \cdot)$, and used functional gradients to find the $f_k$ function. Then instead of finding a local optimum of $f_k$, we can do "early stopping", or even just run one gradient step, then move to the next transform. Furthermore, if you start your search at $f_k(z) = 0$, then there's no need to evaluate the determinant as now $\nabla F_{k}^{-1} = I$! This solves the running time problem even when we need much more of these transforms compared to previous approaches. Another nice explanation is that now $T$ becomes $T(z) = z + f(z)$ with $f(z) = \sum_{k=1}^K f_1(z)$, and if the norm of $f_k$ is small enough, then the above is also equivalent to one gradient step for $f$ at point $z + \sum_{i < k} f_i(z)$. I like both ways of interpreting this procedure and I would say it comes from the clever design of the sequential transform.

The authors provided the analytical form of this functional gradient which links back to the kernel Stein discrepancy discussed in their ICML paper:

$\nabla_{f_k} \mathrm{KL}[q_{F_k} || p] |_{f_k = 0} = - \mathbb{E}_{z \sim q} [ \nabla_z K(z, \cdot) + \nabla_z \log p(z) K(z, \cdot) ]$,

where $p(z)$ short-hands the joint distribution $p(z, \mathcal{D})$. Since now Monte Carlo estimation has become quite a standard approach for modern VI, we can use samples from the $q$ distribution to compute the above gradient. Assume we take $n$ samples $z_1, ..., z_n \sim q$. By taking only one gradient step with learning rate $\epsilon$ the current transform becomes $F_k(z) = z + \frac{\epsilon}{n} \sum_n [ \nabla_{z_n} K(z_n, z) + \nabla_{z_n} \log p(z_n) K(z_n, z) ]$. We can also use mini-batches to approximate $\log p(z, \mathcal{D})$ to make the algorithm scalable on large datasets. In fact they reported even faster speed than PBP on Bayesian neural network with comparable results to the state-of-the-art, which looks very promising.

After the gradient step of $f_k$, the algorithm moves to the next transform $F_{k+1}(z) = z + f_{k+1}(z)$ and again starts at $f_{k+1}(z) = 0$. Notice now the $q$ distribution that we simulate $z$ from in the above gradient equation becomes $q_{F_k}$, which do contain a non-identity Jacobian if we want to evaluate it directly. However recall the core idea of invertible transformation that we can simulate $x \sim q_T$ by first sample $z \sim q$ then apply $x = T(z)$, and we do have samples from $q$ from the last step. This means we can first apply the transform to update the samples (with Monte Carlo estimate) $z_i \leftarrow z_i + \frac{\epsilon}{n} \sum_n [ \nabla_{z_n} K(z_n, z_i) + \nabla_{z_n} \log p(z_n) K(z_n, z_i) ]$, then use them to compute this step's gradient. It makes the algorithm sampling-like and the memory consumption only comes from storing these samples, which can be very cheap compared to storing lots of parametric functionals. If one is still unhappy with storing samples (say due to limited memory), one can fit a parametric density to the samples after the last transform $F_K$. Another strategy is to only use 1 sample, and in this case this algorithm reduces to maximum a posteriori (MAP), which still finds a mode of posterior for you.

The authors used RBF kernel and also pointed out that this algorithm also recovers MAP when the bandwidth tends to zero. This make sense as you can understand it through analogies of kernel density estimation, and in general running VI with $q$ as a delta function is also equivalent to MAP. In other words, the number of samples $n$ and the kernel implicitly define the family of variational distribution $q$: the kernel mainly controls the smoothness, and the number of samples roughly indicates the complexity of your fit. Presumably the main criticism I have is also from the nature that $n$ affects the complexity of $q$, which is not the case for other black-box VI techniques. This means in high dimensions where using large number of samples has prohibitive memory cost, you need to carefully tune the kernel parameters in order to get a good fit, or even you might not be able to achieve significantly better results than MAP. Well using mean-field Gaussian in high-dimensions also seems insufficient, but at least the community has relatively clear understanding in this case. For this method it's unclear to me right now what the distribution looks like and whether it is often better than mean-field. Perhaps I should raise this questions to the authors and I’ll be very excited to hear more about it from their poster at NIPS!

## Graphical Models Meet Deep Learning: Two Recent Attempts

We've seem the power of deep learning. But not everyone is happy with just "throwing the dataset to a big neural network". A good model design can still perform well or even better than naive neural network method, but with much fewer number of model parameters. Probabilistic Graphical models (PGMs) method is this type of thing: it has beautiful theory, lots of applications, and many machine learning researchers have been developing models and algorithms based on it. I remember the first deep learning paper I read was about restricted Boltzmann machines (RBM), and today I'm still fascinated by the elegance of the maths.

However inference of PGMs has become the barrier for large scale applications and complicated graphical structure. Surely people have developed fast approximate inference method (VI/EP to name a few) that works very well in practice. But the main problem comes from the memory cost. To make your model powerful you probably want to add a hierarchy of latent variables to describe both local and global behaviour. However for large datasets the number of approximate distribution you need to maintain increases drastically. This is especially true for intermediate-level latent variables (consider topic mixture for each document in LDA for example) which you can't just throw them away in SVI. Perhaps one of the most well known papers to address this problem is Kingma & Welling's variational auto-encoder (VAE), which amortized the inference by learning a mapping (represented by a neural network) from data to local latent variables. Since then this inference network/recognition model idea has been very popular. In this post I'm gonna briefly discuss two recent papers on using neural networks to assist approximate inference which I think are interesting read.

The first paper comes from Ryan Adams group at Harvard:

Composing graphical models with neural networks for structured representations and fast inference
Johnson et al.
https://arxiv.org/pdf/1603.06277v2.pdf

I would say the general idea is very simple: the authors wanted to extend the original SVI method to non-conjugate models. They basically just approximated the non-exponential family likelihood terms with conjugate exponential family distributions, and to reduce memory cost the natural parameters were parameterised by a recognition model. Then you can reuse the SVI derivations to show the fixed point/natural gradient equations, and I think the algorithm can actually be very fast in practice.

I can sort of see why their approach works and in fact I like their recipe which has analytical forms. But the general framework is not that neat: they optimised different objective functions for different parts of variational distribution. It's true that the proposed loss function has the nice lower-bounding property just as SVI, however the authors proved it by saying "the learned local variational distribution using recognition model approximated likelihood terms is not optimal for the exact SVI". In other words, their argument holds for any recognition model, and it's not obvious to me how tight is their objective to the marginal likelihood. But I haven't read the paper in very detail (they have a long appendix) so maybe I'm missing something here.

UPDATE 19/08/16

I actually spent sometime today to read the paper again. Well I still have the above questions. However I noticed another very interesting point that their new parameterisation can potentially alleviate the co-adaptation problem of global/local variational approximation for neural-network based latent variable models. In the appendix of the VAE paper a full Bayesian treatment (i.e. also approximate the posterior of generative network weights) has been briefly sketched. However I've never seen any published results saying this works well. I thing the main problem comes from the factorised approximation, which uses independent variational parameters to the posterior of weights and the latent variables, and then throws their "interaction" to the optimisation process. This paper, though still assuming factorisations, explicitly introduced the dependence between the variational parameters of latent variables and the variational approximation of weights. My colleagues and I recently started to investigate full Bayesian learning for VAE, and I think this paper's observation can potentially be very helpful.

The second paper comes from Le Song's group at Georgia Tech:

Discriminative Embeddings of Latent Variable Models for Structured Data
Hanjun Dai, Bo Dai, Le Song
http://jmlr.org/proceedings/papers/v48/daib16.pdf

The problem they looked at is structural prediction where PGMs are a major class of models being used there. The original pipeline contains three steps: 1. learn the graphical model parameters with unsupervised learning, 2. extract feature using posterior inference on latent variables, and 3. learn a classifier on those features. But since classification performance is the main target here, the authors suggested chain the three steps together and directly minimise the classification loss. Furthermore since the inference step is the main bottle neck (imagine doing message passing on a big graph for every datapoint and every possible setting of graphical model parameters), they amortized the messages using recognition models, which also implicitly capture the graphical model parameters. To further reduce the dimensionality they used distribution embeddings by computing the messages on a feature map. All these things are learned by supervised learning and the results are pretty impressive.

I have a mixed feeling of this paper. I like the idea of discriminative training and directly working on the messages, which are really the key ideas to make the algorithm scalable. Some might complain that it's unclear what's the PGM distribution it has learned from data, but I think loopy BP messages (at convergence) can provide accurate approximations to the marginals and you can also use a tree-like construction to roughly see what's going on with the joint. The main criticism comes from the convergence part for message passing. First BP has no convergence guarantees on loopy graphs. But more importantly the inference network approach actually puts constraints (i.e. share parameters) on the beliefs across datapoint, which is unlikely to return you the optimal answer for everyone. It's also true for VAEs that you won't get the optimal posterior approximation for every latent variables. But at least VAEs still provide a lower-bound on the marginal likelihood, while the Bethe free energy doesn't have this nice property. I guess it works fine in the models the authors have considered, but I could imagine possible failure for this method when extended to unsupervised learning for densely connected graphs.

## A Variational Analysis of Stochastic Gradient Algorithms

I've been working on approximate inference for Bayesian neural networks for quite a while. But when I talk to deep learning people, most of them say "interesting, but I still prefer back-propagation with stochastic gradient descent". Then people in the approximate inference community start to think about how to link SGD to approximate inference, and in this line a very recent paper catched my eyes:

A Variational Analysis of Stochastic Gradient Algorithms
Stephan Mandt, Matt Hoffman, and David Blei, ICML 2016
http://jmlr.org/proceedings/papers/v48/mandt16.pdf

I'm not a big fan of sampling/stochastic dynamics methods (probably because most of the time I test things on neural networks), but I found this paper very enjoyable, and it makes me think I should learn more about this topic. It differs from SGLD and many other MCMC papers in that the authors didn't attempt to recover the exact posterior, but instead they did approximate SVI to keep the computations fast (although the assumptions they made are quite restrictive). There's no painstaking learning rate tuning: the paper also suggests optimal constant learning rate following the guidance of variational inference.

(Figure from the paper's ICML poster)

How does it work? Well, the authors wrote down the continuous dynamics of SGD, made a few assumptions based on CLT, quadratic approximation to the objective function near local optima, then figured out the stationary distribution $q$ of this process and optimize the learning rate as a hyper-parameter of that $q$. To be precise, let's first write down the objective function of MAP we want to minimize wrt. the model parameters $\theta$: $\mathcal{L}(\theta) = -\sum_{n=1}^N \log p(x_n|\theta) - \log p_0(\theta)$. For convenience the author considered working with $\mathcal{L}(\theta) / N$  and denote its gradient as $g(\theta)$. It's straight-forward to see that now the stochastic gradient $\hat{g}_S(\theta)$ estimated on minibatch $S$ is an unbiased estimate of $g(\theta) = \nabla \mathcal{L}(\theta) / N$, and from CLT we know that $\hat{g}_S$ is asymptotically Gaussian distributed. So here the authors make the first assumption:

(A1) $\hat{g}_S(\theta) \approx g(\theta) + \frac{1}{\sqrt{S}} \Delta g(\theta), \quad \Delta g(\theta) \sim \mathcal{N}(0; C(\theta))$

In the following are two more assumptions that when the algorithm gets close to convergence,

(A2) the noise covariance is a constant and can be decomposed, i.e. $C(\theta) = C = BB^T$; (this sounds severe but let's just assume it for simplicity)

(A3) near the local optimum the loss function $\mathcal{L}(\theta) /N$ can be well approximated with a quadratic term $\frac{1}{2}\theta^T A \theta$. (easy to derive from Taylor expansion and $A = \nabla \nabla \mathcal{L}(\theta) / N$)

Based on these assumptions the authors wrote down the continuous time stochastic differential equation of SGD as an Ornstein-Uhlenbeck process:

$d\theta(t) = -A\theta(t) dt + \sqrt{\frac{\epsilon}{S}} B dW(t)$

where $\epsilon$ is the learning rate we will optimize later. The OU process has a stationary distribution which is a Gaussian: $q(\theta) \propto \exp \left[-\frac{1}{2} \theta^T \Sigma^{-1} \theta \right]$ with the variance $\Sigma$ satisfying $\Sigma A^T + A \Sigma = \frac{\epsilon}{S}BB^T$.

Now comes the main contribution of the paper. What if we use this stationary distribution as an approximation of the true posterior? Note that $\Sigma$ is a function of matrix $A$, the mini-batch size $S$, but more importantly, the learning rate $\epsilon$. So the authors used variational inference (VI) to minimize $\mathrm{KL}[q(\theta)||p(\theta|x_1, ..., x_N)]$ wrt. $\epsilon$, and suggested running constant learning rate SGD with the minimizer $\epsilon^*$ as approximately sampling from the exact posterior. Similar procedure can be done if we use a pre-conditioning matrix $H$ (as a full or diagonal matrix) and they also worked out the optimal solutions for them. I'm not going to run through the maths here but here I just copy the solutions for interests (with $D$ dimensional data):

(constant SGD) $\epsilon^* = \frac{2DS}{N\mathcal{Tr}(BB^T)}$

(full pre-conditioned SGD) $H^* = \frac{2S}{\epsilon N}(BB^T)^{-1}$

(diagonal) $H^*_{kk} = \frac{2S}{\epsilon N (BB^T)_{kk}}$

The authors also discussed connections to Stochastic Gradient Fisher Scoring and RMSprop (I think they incorrectly claimed for Adagrad) which looks fun. But let's stop here and talk about what I think about the main results.

First we notice that $C(\theta)$ can be viewed as the empirical variance of the gradient $\hat{g}_n(\theta) = \nabla \log p(x_n|\theta)$. Now if the model is correct, then $C(\theta)$ approaches to the Fisher information matrix $I(\theta) = \mathbb{E}_x[\nabla \log p(x|\theta)^2]$ as the number of datapoint $N$ goes to infinity. Then a well known result says we can rewrite the Fisher information matrix as $I(\theta) = \mathbb{E}_x[-\nabla \nabla \log p(x|\theta)] = \lim_{N \rightarrow +\infty} \nabla \nabla \mathcal{L}(\theta) / N$. So my first guess is that if we run SGD on large datasets (we usually do) and it converges to a local optimum, then the pre-conditioned SGD with full/diagonal matrix should return very similar approximate posterior distributions to Laplace approximation/mean field VI, respectively. It's pretty easy to show them, just by substituting the optimal $H^*$ back to the constraints of $\Sigma$ and notice $BB^T \approx A$. You can do similar calculations for the constant SGD and get an isotropic Gaussian back with the precision equals to the average value of $A$'s diagonal entries.

However most of the time we know the model is wrong. In this case Laplace approximation can possible return terrible results. I think this paper might be useful for those who don't want to directly use traditional approximate inference schemes (in the cost of storing $q$ and using more parameters) and still want to get a good posterior approximation. But the performance of the SGD proposals really depend on how you actually estimate the matrix $BB^T$. Also note that case it's considered expensive to evaluate the empirical variance on the whole dataset, and instead people often use running average (see RMSprop for an example). Even if we assume the mini-batch variance is a good approximation, when the samples are hovering around, assumption (A2) doesn't hold apparently.  So the next question to ask is: can the popular adaptive learning rate schemes return some kind of approximation to the exact posterior? My experience with RMSprop says that it tends to move around the local optimum, so is it possible to get uncertainty estimates from the trajectory? And how reliable would that be? This sounds very interesting research problems and probably I need to think about it in a bit more detail.

UPDATE 11/08/16

After a happy chat with Stephan (the first author), we agreed that even when the model is wrong, the OU process predicted approximation of full-preconditioning SGD still gives you Laplace approximation (although practical simulation might disagree since the assumptions doesn't hold). My guess for the other two cases depends on that the data comes from the model.