NIPS 2016: random thoughts

This week I was at NIPS along with a big crowd of machine learning fans. It was my second NIPS, and I’m still very excited to see the advances in the field. Unfortunately, I was too busy (or too lazy?) to read on papers. Probably because I’m slow to catch up new deep net architectures. I will try to do it during the holiday and hopefully then I will put some more blog posts on interesting papers. Our group will have a “your favourite NIPS paper in 5mins” session next term and I’m very looking forward to it.

 

General impressions.

1. This year the number of attendants was nearly doubled to almost 6,000. 6,000! And there were even two high school students! I asked these two boys about “why you come here”, and they answered “we did some toy deep learning projects and we’re very very excited about the future of AI”. So you get the idea of “the hype of machine learning”. Next year I will definitely register and book the hotel once the registration is open!

2. Again, too many people — so no NIPS breakfast, no NIPS banquet, and in the first day I didn’t manage to talk to any poster presenters — the queues were just too long. But we did have good things — more parties by start-ups and big names, including the new players Apple and Uber. Unfortunately this year I was physically too tired that I only went to some, and it was a shame to have missed the launching party of RocketAI (which turns out to be a big big joke). Maybe it seems a bit strange that I still don’t feel that hurry on finding a job now — maybe I will be desperate of that in next year’s NIPS?

3. Hottest topics: deep RL and GANs. Everyone’s talking about it. Every company claims that they’re working on it. I admit that I’m not that crazy keen on both topics (well this summer I did spend quite a bit of time on understanding GAN objective functions), but maybe in the future I might also touch on them — who knows?

4. Tutorials were very good. I went to deep RL, computational social science, and optimisation. Deep RL was a bit strange that it assumed you know MDPs, but overall it was fairly easy to follow. Computational social science was less “high tech”, but the problem they were solving are very interesting (e.g. what makes a movie quote memorizable). Optimisation started from the very basics and slowly built on it, and I managed to follow most of the convex part. Interestingly, three tutorials (deep RL, variational inference and optimisation) had spent quite a lot of time talking about stochastic gradient and variance reduction. So we’re essentially working on very similar things!

5. Workshops were great. This year it had 50 workshops, so apparently I didn’t manage to attend most of them. But the two main workshops of my field — approximate inference and Bayesian deep learning — were actually amazing. They got 58 and 45 posters, respectively, and many of them already had very promising results. I also went to the panel of adversarial training and they had some very nice comments there.

6. Old and new friends. I’ve met quite a lot of old and new friends in this year’s NIPS. Always good to catch up and talk about random things.

7. As a guest presenter. For those who found strange that I was presenting a poster on Monday as well: my collaborator Qiang Liu was too busy presenting 3 posters on the same day, so I went to help him present the Stein variational gradient descent one. This was actually a fun experience since I don't necessary need to "sell or defend" the thing I'm presenting. Maybe another way to thoroughly understand a paper is to help its presentation at conferences?

 

The following are more related to my research.

1. Alpha divergences.
I didn’t expect that my proposals have been largely recognised now in the approximate inference community. Many people (more than I expected) visited my Renyi divergence poster on Wed, also quite a lot people stopped by my Saturday's constraint relaxation one. Rich had got invited to the approximate inference workshop panel talking about divergence selection. Miguel and Finale Doshi-Velez also advertised the superior performance of alpha divergence methods in the Bayesian deep learning workshop. More excitingly, I saw with my own eyes a real example of stochastic EP applied to industrial machine learning systems. I feel my hard time in the first two years really pays off. I would like to thank Rich for his excellent mentorship, and Miguel, Daniel, Thang and Mark for their great efforts.

2. Stein’s method and wild variational approximations.
Quite strangely I came to work on things related to Stein’s method and what we called “wild variational approximations”. I learned a bit about Stein’s method before, but at that time I didn't think I would use it for approximate inference. I was reviewing the operator variational inference (OPVI) paper for this year’s NIPS, at that time the paper was even harder to follow compared to its current form, and that paper still failed to convince me that Stein’s method is promising in practice. Then in August the Stein variational gradient descent paper popped out, and I really enjoyed it. I wrote a blog post on it, sent it to Qiang for comments, and suddenly the “wild variational approximation” idea came out from our mind, and we decided to collaborate on that. Apparently the OPVI authors are also very excited about it, and so this NIPS we gathered for a lunch chat and had a “cheers” for Stein’s method.
In summary, the next big thing I’m very excited about is using “truly black-box”, or “wild” (as the black box VI name has already been used by MC methods) q distributions to enhance the accuracy of approximate inference, which allows us to use probabilistic models of their full power. As the “outsider of Bayesian inference community” Ian Goodfellow pointed out in the invited talk in Bayesian deep learning workshop, it would be really nice if we can use some inference engines (e.g. neural networks) to generate (approximate) posterior samples. He talked a little bit about the ideas borrowed from GANs, which is also discussed in my approximate inference workshop submission with Qiang, and we discussed a lot more on potential directions addition to that.
I feel like “wild variational approximation” is a long-term project that is definitely impossible to be done in say several conference paper submissions, or two chapters in my thesis. But I also think I am very lucky to have found this exciting research direction, which I’d love to work on also after– my PhD study. Finally as a side note, I’m still very excited about divergences and information theoretic approaches to Bayesian inference, and I would definitely continue working on better understandings of EP-related methods, I feel like I’m also starting to put in more theoretical analysis right now, which I think is a good thing.

My paper list on deep generative models

I am quite interested in generative models. So I think it's a good idea to summarize what I know so far about it and which papers I think are "must read" if you want to understand what's going on in the field. I haven't finish all of them though and this will also be a good chance to create a "checklist" -- I'll go back and tick those unfinished papers! Also this list can be very long if I include many recent advances so I'll just keep those with big ideas -- can do another list for them later!

 

Variational Auto-Encoders (VAE)

I think the VAE idea has revolutionized the approximate inference and graphical models field. Since neural networks are so successful as functional approximators, why not use them to compute joint/conditional distributions/factors/potential functions for a directed/undirected graphical model, as well as for the approximate posterior/marginal distribution/factors? Hundreds of papers have then built on this idea.

The must-read papers start from Kingma & Welling's Auto-Encoding Variational Bayes paper. Roughly the same time Deepmind came up with the same idea. Then the above two papers authors teamed up together and published a paper on semi-supervised learning. In this context a recent attempt which adds auxiliary variables also deserves a detailed read. I'm not going to recommend the variational Gaussian process paper for practitioners but essentially it contains similar ideas to the auxiliary DGM paper.

The original framework uses the variational lower-bound to do approximate MLE. IWAE improved on this by proposing a new objective function involving importance sampling. Has been tested on RNN versions as well. Shameless plug here: I have shown that IWAE does not return the best approximation to the marginal likelihood. Should also read a similar algorithm "reweighed wake-sleep" if you want to know more and probably you might need to check out the original wake-sleep paper.

VAEs are very flexible as you can imagine from the summary. So they can be easily extended to model sequential data. In this regime the variational RNN paper could be interesting and you might want to check out some related papers it cited as well.

RNNs + attention model also work well on images, and the DRAW paper is the one that combined them with the VAE framework. Similar papers on the experimental side include the one for image captioning and the convolutional version of DRAW.

There are papers that combine the VAE idea with invertible transformations as well. But I'll list them in later paragraphs just to make sure this part is not too long.

 

Generative Adversarial Networks (GAN)

GANs are even more flexible than VAEs. In short you want to train a powerful generative model by fooling another powerful discriminator that "the image I'm showing you is from the real data". I have a post that discussed the maths in detail and I also recommend reading a series of blog post by Ferenc Huszár.

Of course the must-read is the original paper. But the GAN idea has also been discussed in statistical inference context, which has been largely ignored and I think it shouldn't happen. A closely related idea is Noise Contrastive Estimation (NCE), and Ian Goodfellow also discussed the link and difference between both.

GAN is notoriously hard to train. Two experimental papers from Facebook & NYU (i.e. LAPGAN & DCGAN) are must-read for practitioners. I haven't read the recent proposal from OpenAI in detail though. But you should definitely try to train a GAN yourself to get a feeling about how to tune it -- I did so. I'll say this difficulty is one of the main reason that bias me towards VAEs, even when GANs have superior performance in image domain, but see this interesting paper that tries to combine the advantages from both side. Alternatively, the maximum mean discrepancy (MMD) and kernel two sample test have been considered, e.g. see this and this paper.

Some other interesting papers include adversarial auto-encoders (well this one is actually much easier to train), a DRAW-like version of GAN, and the one that includes some contextual information into the modeling process. I also like this paper that extend the GAN framework to f-divergence via duality. Using GAN principles to learn an inference network (and also the same idea here) seems promising, but my experience of playing them seems to suggest that this approach can be even more unstable than the original framework.

 

Generative models with invertible transformations

If you don't like approximate inference and want something exact, here's the part for you. I remember seeing this type of methods first from the NICE paper (nice name). The improved version is also on arxiv.

From my perspective the most interesting work in this regime is the combination of VAE and invertible transformations. I know this idea has been roughly around for 2-3 years (e.g. see this paper) but this ICML paper formalized it with lots of discussions. This year's NIPS has two interesting follow-up papers: one uses autoregressive functions and another tries to skip the randomness of transformations. Should consider blog posts for these papers later.

 

Deep Boltzmann machines & deep belief networks

These models are less considered in recent days, but I still love the beautiful derivations, and I think everyone who want to work on generative models should get to know these type of models. No need to introduce the seminal 2006 paper that took neural networks back to the stage. No need either to mention then people tried different variations of RBMs/deep belief networks for image/speech recognition. But then after the introduction of deep Boltzmann machines, people found that these models are very hard to train with contrastive divergence like algorithms. Pre-training as a DBN then switch to DBM training sort of worked and this paper provided quite a nice interpretation on that, but then in the fine-tuning step we still have this training difficulty problem. There's another pre-training technique that approximately performs variational inference, but you need to be careful about the rescaling part. The doubly intractable problem has scared lots of practitioners away from continue investigating DBMs.

On deep sigmoid belief network side, these two years Lawrence Carin's group has produced quite a few papers related to it. I haven't read them in detail though.

 

Other attempts

Although not very experimentally appealing, I found this diffusion process paper from Surya Ganguli's group an interesting read.

As Bayesian people have made Gaussian processes deep, there you go an attempt on using deep GP to generate images from Neil Lawrence's group. It uses VAE idea to derive inference steps.

David Blei and his students love conjugate models and SVI, so it's no surprising that they've come up with the deep version of exponential families.

If you like the Gibbs sampling procedure of RBMs for data generation, here's another direction to go deeper other than DBMs.

I found this paper's approach a bit strange but those who interested in wake-sleep should get to know it.

Auto-regressive models are also an important type of generative models. For neural network assisted version, must read NADE (and subsequence papers). This year's ICML best paper (pixel RNN) also belongs to this class.

Finally this note on how to evaluate the performance of generative models deserves a detailed read.

GANs, mutual information, and possibly algorithm selection?

If you ask deep learning people "what is the best image generation model now", many of them would probably say "generative adversarial networks" (GAN). The original paper describes an adversarial process that asks the generator and the discriminator to fight against each other, and many people including myself like this intuition. What lessons can we learn from GAN for better approximate inference (which is my thesis topic)? I need to rewrite the framework in the language I'm comfortable with, hence I decided to put it on this blog post as a research note.

 

Suppose there's a machine that can show you some images. This machine flips a coin to determine what it will show to you. For heads, the machine shows you a gorgeous paint from human artists. For tails, it shows you a forged famous paint. Your job is to figure out whether the shown image is "real" or not.

 

Let me be precise about it using math language. Assume s \in \{0, 1\} denotes the outcome of that coin flip, where 0 represents tails and 1 represents heads. The coin could possibly be a bent one so let's say your prior belief of the outcome is \tilde{p}(s = 0) = \pi. Then the above process is summarized as the following:

s \sim \tilde{p}(s), x \sim \tilde{p}(x|s),

\tilde{p}(x|s = 0) = p(x), \tilde{p}(x|s = 1) = p_D(x).

Here I use p_D(x) to describe the unknown data distribution and p(x) as the actual generative model we want to learn. As discussed, the generative model don't want you to know about \pi, in math this is achieved by minimizing the mutual information

 \mathrm{I}[s; x] = \mathrm{KL}[\tilde{p}(s, x) || \tilde{p}(s) \tilde{p}(x)].

This is quite intuitive: \mathrm{I}[s; x] = 0 iff. p_D(x) = p(x), and in this case observing an image x tells you nothing about the outcome of the coin flip s. However computing this mutual information requires evaluating p(x) which is generally intractable. Fortunately we can use variational approximation similar to the variational information maximization algorithm. In detail we construct a lower-bound by subtracting the KL divergence:

\mathcal{L}[p; q] = \mathrm{I}[s; x] - \mathbb{E}_{\tilde{p}(x)}[\mathrm{KL}[\tilde{p}(s|x) ||q(s|x)] \\ = \mathrm{H}[s] + \mathbb{E}_{\tilde{p}(s, x)}[\log q(s|x)].

Let's have a closer look at the equations. From the KL term we can interpret q(s|x) as the approximated posterior of s under the augmented model \tilde{p}. But more interestingly q can also be viewed as the discriminator in the GAN framework. To see this, we expand the second term as

 \tilde{p}(s = 0) \mathbb{E}_{p(x)}[\log (1 - q(s = 1|x))] + \tilde{p}(s = 1) \mathbb{E}_{p_D(x)}[\log q(s = 1|x)].

Notice something familiar with? Indeed when we pick the prior as \pi = 0.5, the lower-bound \mathcal{L}[p; q] is exactly GAN's objective function (up to scaling and adding constants), and the mutual information \mathrm{I}[s; x] becomes the Jensen-Shannon divergence \mathrm{JS}[p(x)||p_D(x)], which GAN is actually minimizing.

To summarize, GAN can be viewed as an augmented generative model which is trained by minimizing mutual information. This augmentation is smart in the sense that it uses label-like information s that can be obtained for free, which introduces supervision signal to help unsupervised learning.

 

How useful is this interpretation? Well, in principle we can carefully design an augmented model and learn it by minimizing L2 loss/KL divergence/mutual information... you name it. But more interestingly, we can then do (automatic) divergence/algorithm selection as model selection in the "augmented" space. In the GAN example setting \pi determines which JS divergence variant we use in training, e.g. see this paper. I'm not sure how useful it would be, but we can also learn \pi by say maximum likelihood, or even treat \pi as a latent variable and put a Beta prior on it.

Recently I started to think about automatic algorithm selection. Probably because I'm tired of my reviewers complaining on my alpha-divergence papers "I don't know how to choose alpha and you should give us a guidance". Tom Minka gives an empirical guidance in his divergence measure tech report, and same for us in recent papers. I know this is an important but very difficult topic for research, but at least not only myself have thought about it, e.g. in this paper the authors connected beta-divergences to tweedie distributions and performed approximate maximum likelihood to select beta. Another interesting paper in this line is the "variational tempering" paper which models the annealing temperature in a probabilistic model as well. I like these papers as the core idea is very simple: we should also use probabilistic modeling for algorithmic parameters. Perhaps this also connects to Bayesian optimization but I'm gonna stop here as the note is already a bit too long.

 

EDIT 06/08/16:

  1. I briefly presented this MI interpretation (also extended to ALI) to the MLG summer reading group and you can see some notes here.
  2. Ferenc also posted a discussion of infoGAN with mutual information interpretation which is very nice. He also has a series of great blog posts on GANs that I wish I could have read them earlier!

A general example of the failure of the 1-step learning algorithm

Here's a general case that the 1-step learning will failed to achieve the optimal parameters, inspired by [1].

Consider the exponential family model

 P(x; \theta) = \frac{1}{Z(\theta)} exp(\theta^Tf(x)),

where f(x) = (f_1(x), f_2(x), ..., f_m(x)) (we can also consider it as a multi-expert system). Assume the variable x takes discrete states and a transition operator T(x) s.t. for any x, there exists an integer k \in [1, N] such that  f_k(x) = f_k(T(x)). Then we can find some data set \{x_i\}_{i=1}^{N} satisfying

\exists k' \in [1, N] s.t. f_{k'}(x) = f_{k'}(T(x)), \forall x \in \{x_i\}.

With these assumptions we construct a markov chain with T as the transition in each step, using the Metropolis method, then in the training process \Delta \theta_{k'} = 0, which means that we cannot achieve the optimum as the maximum likelihood algorithm does.

The statement indicates that the performance of the one-step learning depends on the model distribution and the transition in each step of the Markov chain. However, though the failure above will be addressed if provided enough training data points, in some cases the parameters will never converge or converge to the other points rather than the ML result. See the cases of noisy swirl operator and noisy star trek operator in [1] for details. But fortunately, training energy models with sampling from factored likelihood and posterior in each step yields good results with CD-k, which will be explained below.

 

Let's consider the justification of the CD-k algorithm Rich and I discussed in our meeting, to answer the questions I raised in the last research note. We learn the energy models by the maximum likelihood method, i.e. maximize

 \langle log P(x)\rangle_{P_0}

where P is our constructed (optimal) model and P_0 is the data distribution. It's easy to see

max \langle log P(x)\rangle_{P_0} \Leftrightarrow min -H(P_0) - \langle log P(x)\rangle_{P_0},

and the right hand side term is the KL-divergence P_0||P. For the energy model with coefficients \theta we have

 P(x) = \frac{1}{Z(\theta)} exp(-E(x; \theta))

where E is differentiable w.r.t. \theta and Z(\theta) is some normalizer. Then by doing some math we get the derivative of the expectation of the log-likelihood w.r.t. \theta as follows:

 - \frac{\partial}{\partial \theta} \langle log P(x)\rangle_{P_0} = \langle \frac{\partial}{\partial \theta} E(x; \theta)\rangle_{P_0} - \langle \frac{\partial}{\partial \theta} E(x; \theta)\rangle_P.

In the RBM case, we have the model joint distribution

 P(x, h|\theta) = \frac{1}{Z(\theta)} exp(-E(x, h; \theta)),

and similarly we get the derivative 

 - \frac{\partial}{\partial \theta} \langle log P(x)\rangle_{P_0} = \langle \frac{\partial}{\partial \theta} E(x, h; \theta)\rangle_{P_0(x)P(h|x)} - \langle \frac{\partial}{\partial \theta} E(x; \theta)\rangle_{P(x, h|\theta)}.

However, the expectation is intractable, and in practice we sample the training set \{ x_i\}_{i = 1}^{N} from P_0. In this case we approximate the first term of the derivative by \frac{1}{N} \sum_{i = 1}^{N} \frac{\partial}{\partial \theta} E(x_i, h; \theta), where h equals to, or is sampled from, the conditional distribution P(h|x_i; \theta). For better generalization, we don't want the ML algorithm to minimize the KL-divergence exactly to 0, as P will be the average of Dirac functions

P(x) = \frac{1}{N}\sum_{i=1}^{N}\delta(x - x_i)

which makes the probability be zero at the points other than those in the training set. That's why the approximation of the ML optimum can help to smooth the model distribution.

Also notice that the derivative of the log-likelihood of the RBM contains the expectation of the energy function over both x and h with the joint distribution P(x, h|\theta), which is also intractable. In general we use the samples from some MCMC method, e.g. Gibbs sampling, to approximate that derivative. Without loss of generality we assume \{x, h\} have discrete states and the Markov chain is aperiodic, then consider the sampling conditional probability P(h|x) and P(x|h), the Gibbs chain will converge to the equilibrium distribution P_\infty and P_\infty = P. To specify the statement, consider

P_1(x_2, h_2) = P(h_2|x_2)P(x_2|h_1)P(h_1|x_1)

is the joint distribution we get after 1 full-step sampling from the training data x_1 and P_n(x_{n+1}, h_{n+1}) = P(h_{n+1}| x_{n+1})P(x_{n+1}|h_n), then we have P_t \rightarrow P, t \rightarrow \infty. So that justify the truncating of the Gibbs train after k full-step sampling, and Hinton asserted that experiments showed that k = 1 yielded good training results.

 

References

[1] David J. C. MacKay, Failures of the One-Step Learning Algorithm, 2001.