A general example of the failure of the 1-step learning algorithm

Here's a general case that the 1-step learning will failed to achieve the optimal parameters, inspired by [1].

Consider the exponential family model

 P(x; \theta) = \frac{1}{Z(\theta)} exp(\theta^Tf(x)),

where f(x) = (f_1(x), f_2(x), ..., f_m(x)) (we can also consider it as a multi-expert system). Assume the variable x takes discrete states and a transition operator T(x) s.t. for any x, there exists an integer k \in [1, N] such that  f_k(x) = f_k(T(x)). Then we can find some data set \{x_i\}_{i=1}^{N} satisfying

\exists k' \in [1, N] s.t. f_{k'}(x) = f_{k'}(T(x)), \forall x \in \{x_i\}.

With these assumptions we construct a markov chain with T as the transition in each step, using the Metropolis method, then in the training process \Delta \theta_{k'} = 0, which means that we cannot achieve the optimum as the maximum likelihood algorithm does.

The statement indicates that the performance of the one-step learning depends on the model distribution and the transition in each step of the Markov chain. However, though the failure above will be addressed if provided enough training data points, in some cases the parameters will never converge or converge to the other points rather than the ML result. See the cases of noisy swirl operator and noisy star trek operator in [1] for details. But fortunately, training energy models with sampling from factored likelihood and posterior in each step yields good results with CD-k, which will be explained below.

 

Let's consider the justification of the CD-k algorithm Rich and I discussed in our meeting, to answer the questions I raised in the last research note. We learn the energy models by the maximum likelihood method, i.e. maximize

 \langle log P(x)\rangle_{P_0}

where P is our constructed (optimal) model and P_0 is the data distribution. It's easy to see

max \langle log P(x)\rangle_{P_0} \Leftrightarrow min -H(P_0) - \langle log P(x)\rangle_{P_0},

and the right hand side term is the KL-divergence P_0||P. For the energy model with coefficients \theta we have

 P(x) = \frac{1}{Z(\theta)} exp(-E(x; \theta))

where E is differentiable w.r.t. \theta and Z(\theta) is some normalizer. Then by doing some math we get the derivative of the expectation of the log-likelihood w.r.t. \theta as follows:

 - \frac{\partial}{\partial \theta} \langle log P(x)\rangle_{P_0} = \langle \frac{\partial}{\partial \theta} E(x; \theta)\rangle_{P_0} - \langle \frac{\partial}{\partial \theta} E(x; \theta)\rangle_P.

In the RBM case, we have the model joint distribution

 P(x, h|\theta) = \frac{1}{Z(\theta)} exp(-E(x, h; \theta)),

and similarly we get the derivative 

 - \frac{\partial}{\partial \theta} \langle log P(x)\rangle_{P_0} = \langle \frac{\partial}{\partial \theta} E(x, h; \theta)\rangle_{P_0(x)P(h|x)} - \langle \frac{\partial}{\partial \theta} E(x; \theta)\rangle_{P(x, h|\theta)}.

However, the expectation is intractable, and in practice we sample the training set \{ x_i\}_{i = 1}^{N} from P_0. In this case we approximate the first term of the derivative by \frac{1}{N} \sum_{i = 1}^{N} \frac{\partial}{\partial \theta} E(x_i, h; \theta), where h equals to, or is sampled from, the conditional distribution P(h|x_i; \theta). For better generalization, we don't want the ML algorithm to minimize the KL-divergence exactly to 0, as P will be the average of Dirac functions

P(x) = \frac{1}{N}\sum_{i=1}^{N}\delta(x - x_i)

which makes the probability be zero at the points other than those in the training set. That's why the approximation of the ML optimum can help to smooth the model distribution.

Also notice that the derivative of the log-likelihood of the RBM contains the expectation of the energy function over both x and h with the joint distribution P(x, h|\theta), which is also intractable. In general we use the samples from some MCMC method, e.g. Gibbs sampling, to approximate that derivative. Without loss of generality we assume \{x, h\} have discrete states and the Markov chain is aperiodic, then consider the sampling conditional probability P(h|x) and P(x|h), the Gibbs chain will converge to the equilibrium distribution P_\infty and P_\infty = P. To specify the statement, consider

P_1(x_2, h_2) = P(h_2|x_2)P(x_2|h_1)P(h_1|x_1)

is the joint distribution we get after 1 full-step sampling from the training data x_1 and P_n(x_{n+1}, h_{n+1}) = P(h_{n+1}| x_{n+1})P(x_{n+1}|h_n), then we have P_t \rightarrow P, t \rightarrow \infty. So that justify the truncating of the Gibbs train after k full-step sampling, and Hinton asserted that experiments showed that k = 1 yielded good training results.

 

References

[1] David J. C. MacKay, Failures of the One-Step Learning Algorithm, 2001.