The Information Sieve

Recently I've read an interesting paper about unsupervised learning:

The Information Sieve
Greg Ver Steeg and Aram Galstyan, ICML 2016,
http://arxiv.org/abs/1507.02284

I really like the paper. In their ICML talk (shame that I didn't manage to attend!) they showed the intuition as drinking soups.

soup_sieve

The idea is very simple. Suppose we want to learn a good representation of some observed data. Then the information sieve procedure contains two ideas. First, we should decompose the learning procedure into several stages, such that in each stage we only extract the information from the remainder of the last stage. In the soup example, the first stage gets the meat and leaves other ingredients, then the second stage gets the mushrooms from the remainders of the first stage. Second, at each stage, we want to extract as much information as possible. This depends on the functional class we use to learn representations, just like in the soup example, if the sieve size is small, then we might be able to also get the mushrooms in the first stage.

 

Let's go back and see the math details. The representation learning problem is formulated as correlation characterization, i.e. for data X = (X_1, ..., X_N) with N components, we want to learn a representation Y such that the components X_i are independent with each other given Y. In practice we won't be able to learn such Y in one stage for many problems, so why not do it recursively? Assume at stage k the input is X^k, where stage 0 simply has X^0 = X, then we repeat the following:

  1. find function f^k such that Y^k = f^k(X^k) and TC(X^k|Y^k) is minimized; TC denotes "total correlation" that is defined as the KL-divergence from the joint distribution to the product of marginal distributions, which is just a multivariate version of mutual information.
  2. construct remainder X^{k+1} such that X_{i}^{k+1} contains no information of Y^k, and X^k can be (perfectly) reconstructed from X^{k+1} and Y^k.

 

The concept of the algorithm is sound and very intriguing. I think in general we can replace step 1 with other criteria such that you can learn some representations from the input. Apparently the construction of the remainder X^{k+1} depends on how we learn Y^k and what we want to model in the next stage. For example we can use reconstruction loss in step 1, and extract the residuals to add in the remainder.

 

I was very happy with the sieve idea until I saw a confusing point in the paper. The authors emphasizes that the remainder vector X^{k+1} should contain Y^{k}, i.e. X^{k+1} = (X_1^{k+1}, ..., X_N^{k+1}, Y^{k}). This sounds like you should put the meat back to the soup and drain it again. Sounds strange -- so I guess the soup example is not that appropriate. I can sort of see why this proposal is sensible, but I think in general X^{k+1} just need to be a (probabilistic) function of X^k and Y^k.

 

The sieve algorithm also reminds me the residual net that won ILSVRC and COCO challenges in 2015. It says we should also add a linear projection of the input to the output of the layer, i.e. \textbf{h} = f(\textbf{x}, \textbf{W}) + \textbf{W}_{s}\textbf{x}. Roughly speaking, the linear part \textbf{W}_s \textbf{x} can be viewed as Y^k, and the non-linear part f can be viewed as (X_1^{k+1}, ..., X_N^{k+1}). Does that mean we can adapt the sieve algorithm to train a residual network for unsupervised learning problems?

Before answering it, notice that the residual net uses the sum (instead of concatenation) of this two parts. So you might worry about "I'm losing something here by replacing [\textbf{W}_s \textbf{x}, f(\textbf{x}, \textbf{W})] with a sum of the two", and want to work out the loss in bits. I spent some time looking at the theorems in the paper and tried to figure out, but it seems these theorems depend on the concatenation assumption and it's not clear to me how the bounds change when using summation.

However there's another way to think about this analogy. Simple calculation says that \textbf{W}(\textbf{a} + \textbf{b}) = [\textbf{W}, \textbf{W}] [\textbf{a}^T, \textbf{b}^T]^T. In other words, residual nets restrict the matrices that pre-multiply Y^k and f to be identical, and the sieve algorithm is more general here. I'm not saying you should always prefer the sieve alternative. In fact for residual networks which typically have tens or even hundreds of layers with hundreds of neurons per layer, you probably don't want to double the number of parameters just to get a bit of improvements. The original residual network paper argued its efficiency as preventing vanishing/exploding gradients when the network is very deep, which is computational. So it would be very interesting to see why residual learning helps statistically, and the sieve algorithm might be able to provide some insight here.

 

 

GANs, mutual information, and possibly algorithm selection?

If you ask deep learning people "what is the best image generation model now", many of them would probably say "generative adversarial networks" (GAN). The original paper describes an adversarial process that asks the generator and the discriminator to fight against each other, and many people including myself like this intuition. What lessons can we learn from GAN for better approximate inference (which is my thesis topic)? I need to rewrite the framework in the language I'm comfortable with, hence I decided to put it on this blog post as a research note.

 

Suppose there's a machine that can show you some images. This machine flips a coin to determine what it will show to you. For heads, the machine shows you a gorgeous paint from human artists. For tails, it shows you a forged famous paint. Your job is to figure out whether the shown image is "real" or not.

 

Let me be precise about it using math language. Assume s \in \{0, 1\} denotes the outcome of that coin flip, where 0 represents tails and 1 represents heads. The coin could possibly be a bent one so let's say your prior belief of the outcome is \tilde{p}(s = 0) = \pi. Then the above process is summarized as the following:

s \sim \tilde{p}(s), x \sim \tilde{p}(x|s),

\tilde{p}(x|s = 0) = p(x), \tilde{p}(x|s = 1) = p_D(x).

Here I use p_D(x) to describe the unknown data distribution and p(x) as the actual generative model we want to learn. As discussed, the generative model don't want you to know about \pi, in math this is achieved by minimizing the mutual information

 \mathrm{I}[s; x] = \mathrm{KL}[\tilde{p}(s, x) || \tilde{p}(s) \tilde{p}(x)].

This is quite intuitive: \mathrm{I}[s; x] = 0 iff. p_D(x) = p(x), and in this case observing an image x tells you nothing about the outcome of the coin flip s. However computing this mutual information requires evaluating p(x) which is generally intractable. Fortunately we can use variational approximation similar to the variational information maximization algorithm. In detail we construct a lower-bound by subtracting the KL divergence:

\mathcal{L}[p; q] = \mathrm{I}[s; x] - \mathbb{E}_{\tilde{p}(x)}[\mathrm{KL}[\tilde{p}(s|x) ||q(s|x)] \\ = \mathrm{H}[s] + \mathbb{E}_{\tilde{p}(s, x)}[\log q(s|x)].

Let's have a closer look at the equations. From the KL term we can interpret q(s|x) as the approximated posterior of s under the augmented model \tilde{p}. But more interestingly q can also be viewed as the discriminator in the GAN framework. To see this, we expand the second term as

 \tilde{p}(s = 0) \mathbb{E}_{p(x)}[\log (1 - q(s = 1|x))] + \tilde{p}(s = 1) \mathbb{E}_{p_D(x)}[\log q(s = 1|x)].

Notice something familiar with? Indeed when we pick the prior as \pi = 0.5, the lower-bound \mathcal{L}[p; q] is exactly GAN's objective function (up to scaling and adding constants), and the mutual information \mathrm{I}[s; x] becomes the Jensen-Shannon divergence \mathrm{JS}[p(x)||p_D(x)], which GAN is actually minimizing.

To summarize, GAN can be viewed as an augmented generative model which is trained by minimizing mutual information. This augmentation is smart in the sense that it uses label-like information s that can be obtained for free, which introduces supervision signal to help unsupervised learning.

 

How useful is this interpretation? Well, in principle we can carefully design an augmented model and learn it by minimizing L2 loss/KL divergence/mutual information... you name it. But more interestingly, we can then do (automatic) divergence/algorithm selection as model selection in the "augmented" space. In the GAN example setting \pi determines which JS divergence variant we use in training, e.g. see this paper. I'm not sure how useful it would be, but we can also learn \pi by say maximum likelihood, or even treat \pi as a latent variable and put a Beta prior on it.

Recently I started to think about automatic algorithm selection. Probably because I'm tired of my reviewers complaining on my alpha-divergence papers "I don't know how to choose alpha and you should give us a guidance". Tom Minka gives an empirical guidance in his divergence measure tech report, and same for us in recent papers. I know this is an important but very difficult topic for research, but at least not only myself have thought about it, e.g. in this paper the authors connected beta-divergences to tweedie distributions and performed approximate maximum likelihood to select beta. Another interesting paper in this line is the "variational tempering" paper which models the annealing temperature in a probabilistic model as well. I like these papers as the core idea is very simple: we should also use probabilistic modeling for algorithmic parameters. Perhaps this also connects to Bayesian optimization but I'm gonna stop here as the note is already a bit too long.

 

EDIT 06/08/16:

  1. I briefly presented this MI interpretation (also extended to ALI) to the MLG summer reading group and you can see some notes here.
  2. Ferenc also posted a discussion of infoGAN with mutual information interpretation which is very nice. He also has a series of great blog posts on GANs that I wish I could have read them earlier!

A general example of the failure of the 1-step learning algorithm

Here's a general case that the 1-step learning will failed to achieve the optimal parameters, inspired by [1].

Consider the exponential family model

 P(x; \theta) = \frac{1}{Z(\theta)} exp(\theta^Tf(x)),

where f(x) = (f_1(x), f_2(x), ..., f_m(x)) (we can also consider it as a multi-expert system). Assume the variable x takes discrete states and a transition operator T(x) s.t. for any x, there exists an integer k \in [1, N] such that  f_k(x) = f_k(T(x)). Then we can find some data set \{x_i\}_{i=1}^{N} satisfying

\exists k' \in [1, N] s.t. f_{k'}(x) = f_{k'}(T(x)), \forall x \in \{x_i\}.

With these assumptions we construct a markov chain with T as the transition in each step, using the Metropolis method, then in the training process \Delta \theta_{k'} = 0, which means that we cannot achieve the optimum as the maximum likelihood algorithm does.

The statement indicates that the performance of the one-step learning depends on the model distribution and the transition in each step of the Markov chain. However, though the failure above will be addressed if provided enough training data points, in some cases the parameters will never converge or converge to the other points rather than the ML result. See the cases of noisy swirl operator and noisy star trek operator in [1] for details. But fortunately, training energy models with sampling from factored likelihood and posterior in each step yields good results with CD-k, which will be explained below.

 

Let's consider the justification of the CD-k algorithm Rich and I discussed in our meeting, to answer the questions I raised in the last research note. We learn the energy models by the maximum likelihood method, i.e. maximize

 \langle log P(x)\rangle_{P_0}

where P is our constructed (optimal) model and P_0 is the data distribution. It's easy to see

max \langle log P(x)\rangle_{P_0} \Leftrightarrow min -H(P_0) - \langle log P(x)\rangle_{P_0},

and the right hand side term is the KL-divergence P_0||P. For the energy model with coefficients \theta we have

 P(x) = \frac{1}{Z(\theta)} exp(-E(x; \theta))

where E is differentiable w.r.t. \theta and Z(\theta) is some normalizer. Then by doing some math we get the derivative of the expectation of the log-likelihood w.r.t. \theta as follows:

 - \frac{\partial}{\partial \theta} \langle log P(x)\rangle_{P_0} = \langle \frac{\partial}{\partial \theta} E(x; \theta)\rangle_{P_0} - \langle \frac{\partial}{\partial \theta} E(x; \theta)\rangle_P.

In the RBM case, we have the model joint distribution

 P(x, h|\theta) = \frac{1}{Z(\theta)} exp(-E(x, h; \theta)),

and similarly we get the derivative 

 - \frac{\partial}{\partial \theta} \langle log P(x)\rangle_{P_0} = \langle \frac{\partial}{\partial \theta} E(x, h; \theta)\rangle_{P_0(x)P(h|x)} - \langle \frac{\partial}{\partial \theta} E(x; \theta)\rangle_{P(x, h|\theta)}.

However, the expectation is intractable, and in practice we sample the training set \{ x_i\}_{i = 1}^{N} from P_0. In this case we approximate the first term of the derivative by \frac{1}{N} \sum_{i = 1}^{N} \frac{\partial}{\partial \theta} E(x_i, h; \theta), where h equals to, or is sampled from, the conditional distribution P(h|x_i; \theta). For better generalization, we don't want the ML algorithm to minimize the KL-divergence exactly to 0, as P will be the average of Dirac functions

P(x) = \frac{1}{N}\sum_{i=1}^{N}\delta(x - x_i)

which makes the probability be zero at the points other than those in the training set. That's why the approximation of the ML optimum can help to smooth the model distribution.

Also notice that the derivative of the log-likelihood of the RBM contains the expectation of the energy function over both x and h with the joint distribution P(x, h|\theta), which is also intractable. In general we use the samples from some MCMC method, e.g. Gibbs sampling, to approximate that derivative. Without loss of generality we assume \{x, h\} have discrete states and the Markov chain is aperiodic, then consider the sampling conditional probability P(h|x) and P(x|h), the Gibbs chain will converge to the equilibrium distribution P_\infty and P_\infty = P. To specify the statement, consider

P_1(x_2, h_2) = P(h_2|x_2)P(x_2|h_1)P(h_1|x_1)

is the joint distribution we get after 1 full-step sampling from the training data x_1 and P_n(x_{n+1}, h_{n+1}) = P(h_{n+1}| x_{n+1})P(x_{n+1}|h_n), then we have P_t \rightarrow P, t \rightarrow \infty. So that justify the truncating of the Gibbs train after k full-step sampling, and Hinton asserted that experiments showed that k = 1 yielded good training results.

 

References

[1] David J. C. MacKay, Failures of the One-Step Learning Algorithm, 2001.