## Deterministic Annealing

Deterministic annealing [13] is an optimization algorithm based on the principles of information theory, probability theory, and statistical mechanics. Shannon's information theory tells us that the entropy of a system decreases as the underlying probability distribution concentrates and increases as the distribution becomes more uniform. For a physical system with many degrees of freedom that can result in many possible states. A basic rule in statistical mechanics says that when the system is in thermal equilibrium, possibility of a state i follows Gibbs distribution.

where kB is Boltzmann's constant and Z is a constant independent of all states. Gibbs distribution tells us that states of low energy occur with higher probability than states of high energy, and that as the temperature of the system is lowered, the probability concentrates on a smaller subset of low energy states.

Let E be the average energy of the system, then

is the "free energy" of the system. We can see that as T approaches zero, F approaches the average energy E. From the principle of minimal free energy, Gibbs distribution collapses on the global minima of E when this happens. SA [33] is an optimization algorithm based on Metropolis algorithm [34] that captures this idea. However, SA moves randomly on the energy surface and converges to a configuration of minimal energy very slowly, if the control parameter T is lowered no faster than logarithmically. DA improves the speed of convergence; the effective energy is deterministically optimized at successively reduced T while maintaining the annealing process aiming at global minimum.

In our clustering problem, we would like to minimize the expected distortion of all the samples x's given a set of centroids y's. Let D be the average distortion,

x y x y where p(x, y) is the joint probability distribution of sample x and centroid y, p(y | x) is the association probability that relates x to y, and d(x, y) is the distortion measure. Shannon's entropy of the system is given by

If we take D as the average "energy" of the system, then the Lagrangian

is equivalent to the free energy of the system. The temperature T here is the Lagrange multiplier, or simply the pseudotemperature. Rose [13] described a probabilistic framework for clustering by randomization of the encoding rule, in which each sample is associated with a particular cluster with a certain probability. When F is minimized with respect to the association probability p(y | x), an encoding rule assigning x to y can be obtained, exp (-dp)

Ey exp(

Using the explicit expression for p(y | x) into the Lagrangian F in Eq. (6.9), we have the new Lagrangian

To get centroid values of {y}, we minimize F* with respect to y, yielding

x dy

With p(x, y) = p(x)p(y | x), where p(x) is given by the source and p(y | x) is also known (as given above.) The centroid values of y that minimize F* can be computed by an iteration that starts at a large value of T, tracking the minimum while decreasing T. The centroid rule is given by y Txxp(x) p( y 1 x) (613)

It is obvious that the parameter T controls the entire iterative process of deriving final centroids. As the number of clusters increases, the distortion, or the covariance between samples x and centroids yi will be reduced. Thus, when T is lowered, existing clusters split and the number of clusters will increase while maintaining minimum distortion. When T reaches a value at which the clusters split, it corresponds to a phase transition in the physical system. The exact value of T, at which a splitting will occur, is given by

where Tc is known as the critical temperature and Xmax is the maximum eigenvalue of the covariance matrix of the posterior distribution p(x | y) of the cluster corresponding to centroid y:

In mass-constrained DA, the constraint of i pi = 1 is applied. Here pi's are the centroids that coincide in the same cluster i at position yi. We call this the "repeated" centroids. This is because when the cluster splits, the annealing might result in multiple centroids in each effective cluster depending on the initial pertubation. Below is a simple description of implementation of DA:

1. Initialization: set the maximum number of clusters to be generated Kmax and the minimum temperature Tnin. Set K = 1, compute the first Tc and first centroid. Set py = 1. Select temperature reduction rate a, perturbation 5, and threshold R.

2. For all current clusters, compute centroids yi and p(yO according to Eqs. (6.13)-(6.15) until converge.

3. Check if T > Tnin, if yes, reduce pseudotemperature T = a T, otherwise let T = 0 and stop, output centroids and sample assignments.

4. If K > Kmax, stop and output centroids and sample assignments. Otherwise, for all clusters, check if T > Tc(i), if yes, go to step 3; otherwise, split the cluster.

Figure 6.3 gives the flow chart for implementing mass-constrained DA. As can be seen from the flow chart, there are a couple of parameters that govern the annealing process, each exerts its influence on the outcome, particularly the temperature cooling step parameter a. Theoretically, if a is reduced sufficiently slowly, local minima of the cost function can be skipped and a global minimum can be reached. However, it can be very time consuming if T is reduced too slowly.

The centroids y and the encoding rule p(yi | x) are illustrated in Fig. 6.4.

0 0