04/Oct/2013 – 2Cents

Here's a general case that the 1-step learning will failed to achieve the optimal parameters, inspired by [1].

Consider the exponential family model

$P(x; \theta) = \frac{1}{Z(\theta)} exp(\theta^Tf(x)),$

where

$f(x) = (f_1(x), f_2(x), ..., f_m(x))$ (we can also consider it as a multi-expert system). Assume the variable

$x$ takes discrete states and a transition operator

$T(x)$ s.t. for any

$x$ , there exists an integer

$k \in [1, N]$ such that

$f_k(x) = f_k(T(x))$ . Then we can find some data set

$\{x_i\}_{i=1}^{N}$ satisfying

$\exists k' \in [1, N] s.t. f_{k'}(x) = f_{k'}(T(x)), \forall x \in \{x_i\}.$

With these assumptions we construct a markov chain with T as the transition in each step, using the Metropolis method, then in the training process

$\Delta \theta_{k'} = 0$ , which means that we cannot achieve the optimum as the maximum likelihood algorithm does.

The statement indicates that the performance of the one-step learning depends on the model distribution and the transition in each step of the Markov chain. However, though the failure above will be addressed if provided enough training data points, in some cases the parameters will never converge or converge to the other points rather than the ML result. See the cases of noisy swirl operator and noisy star trek operator in [1] for details. But fortunately, training energy models with sampling from factored likelihood and posterior in each step yields good results with CD-k, which will be explained below.

Let's consider the justification of the CD-k algorithm Rich and I discussed in our meeting, to answer the questions I raised in the last research note. We learn the energy models by the maximum likelihood method, i.e. maximize

$\langle log P(x)\rangle_{P_0}$

where

$P$ is our constructed (optimal) model and

$P_0$ is the data distribution. It's easy to see

$max \langle log P(x)\rangle_{P_0} \Leftrightarrow min -H(P_0) - \langle log P(x)\rangle_{P_0},$

and the right hand side term is the KL-divergence

$P_0||P$ . For the energy model with coefficients

$\theta$ we have

$P(x) = \frac{1}{Z(\theta)} exp(-E(x; \theta))$

where

$E$ is differentiable w.r.t.

$\theta$ and

$Z(\theta)$ is some normalizer. Then by doing some math we get the derivative of the expectation of the log-likelihood w.r.t.

$\theta$ as follows:

$- \frac{\partial}{\partial \theta} \langle log P(x)\rangle_{P_0} = \langle \frac{\partial}{\partial \theta} E(x; \theta)\rangle_{P_0} - \langle \frac{\partial}{\partial \theta} E(x; \theta)\rangle_P.$

In the RBM case, we have the model joint distribution

$P(x, h|\theta) = \frac{1}{Z(\theta)} exp(-E(x, h; \theta)),$

and similarly we get the derivative

$- \frac{\partial}{\partial \theta} \langle log P(x)\rangle_{P_0} = \langle \frac{\partial}{\partial \theta} E(x, h; \theta)\rangle_{P_0(x)P(h|x)} - \langle \frac{\partial}{\partial \theta} E(x; \theta)\rangle_{P(x, h|\theta)}.$

However, the expectation is intractable, and in practice we sample the training set $\{ x_i\}_{i = 1}^{N}$ from $P_0$ . In this case we approximate the first term of the derivative by $\frac{1}{N} \sum_{i = 1}^{N} \frac{\partial}{\partial \theta} E(x_i, h; \theta)$ , where $h$ equals to, or is sampled from, the conditional distribution $P(h|x_i; \theta)$ . For better generalization, we don't want the ML algorithm to minimize the KL-divergence exactly to 0, as P will be the average of Dirac functions

$P(x) = \frac{1}{N}\sum_{i=1}^{N}\delta(x - x_i)$

which makes the probability be zero at the points other than those in the training set. That's why the approximation of the ML optimum can help to smooth the model distribution.

Also notice that the derivative of the log-likelihood of the RBM contains the expectation of the energy function over both $x$ and $h$ with the joint distribution $P(x, h|\theta)$ , which is also intractable. In general we use the samples from some MCMC method, e.g. Gibbs sampling, to approximate that derivative. Without loss of generality we assume $\{x, h\}$ have discrete states and the Markov chain is aperiodic, then consider the sampling conditional probability $P(h|x)$ and $P(x|h)$ , the Gibbs chain will converge to the equilibrium distribution $P_\infty$ and $P_\infty = P$ . To specify the statement, consider

$P_1(x_2, h_2) = P(h_2|x_2)P(x_2|h_1)P(h_1|x_1)$

is the joint distribution we get after 1 full-step sampling from the training data

$x_1$ and

$P_n(x_{n+1}, h_{n+1}) = P(h_{n+1}| x_{n+1})P(x_{n+1}|h_n)$ , then we have

$P_t \rightarrow P, t \rightarrow \infty.$ So that justify the truncating of the Gibbs train after k full-step sampling, and Hinton asserted that experiments showed that k = 1 yielded good training results.

References

[1] David J. C. MacKay, Failures of the One-Step Learning Algorithm, 2001.