## Training Procedure

The memorized "knowledge" of the network is represented by the free parameter P = {(w-, s-, p-)}, i.e., the weights w- and sas well as the range p- of the Gaussians in the hidden layer. It serves as the basis for the approximation of a function

which should be provided by using a training data set T = {(xy,yve {1,...,p}}, where p denotes the number of sample patterns. Here, two aspects should be considered: On one hand, the network should reproduce the given training data as well as possible; on the other hand, it should be able to provide some reasonable generalization with respect to unknown test data.

An obvious training strategy is the minimization of the error

An attempt to simultaneous optimization of all the free parameters by gradient descent on the cost function (7) implies two problems: On one hand, the procedure can be trapped in local minima, thus providing suboptimal solutions. One may cope with this difficulty by randomization employing noise (so-called simulated annealing) that enables escape from local minima. A second problem is involved by the slow convergence of such a global optimization. In this context, the method proposed by Poggio [19] does not provide any advantage compared to the classical optimization strategy of error backpropagation for the multilayer perceptron (see, e.g., [23]). Such a global, i.e., simultaneous optimization of all the parameters also does not correspond with the idea ofbiological information processing by local self-organization. This can serve as a motivation for a proposal by Moody and Darken ([17]) to split up the parameter space P = {(w- , s- , p-)} into three smaller partitions {w- }, {s- }, {p-} that maybe optimized separately. The resulting dimensionality reduction involves an increased convergence rate for the optimization process within the different partitions. On the other hand, this strategy may sometimes lead to suboptimal minimization results for (7), for there is no longer a systematic search on the total parameter space P.

This concept of separate optimization within each of the three partitions of P is the basis of the GRBF architecture applied in this chapter. It works in three steps:

(1) Initially the virtual positions w- of the hidden layer neurons are calculated in a procedure called vector quantization (VQ). This is based on an unsupervised clustering algorithm, i.e., the target information yv of the training data set T ={(xy, yv e {1,..., p}} is not used within the training procedure.

(2) Once the virtual positions w- are known, the width ofthe "receptive fields" p- of the hidden layer neurons is determined by some appropriate heuristic rule or by optimization with respect to the quality of the function approximation result.

(3) Finally, the output weights S- are computed. For this purpose, one can choose between a global gradient descent method and a local training procedure based on Hebbian correlation learning.

These training steps are explained in the following paragraphs.

3.1 Optimization of the w-: Vector Quantization

Let n denote the dimension of the feature space, e.g., the number of gray levels obtained for each voxel. Let K denote the number of feature vectors xv e ]Rn. Clustering by VQ identifies groups j e {1,..., N} of vectors sharing similar features. These groups are represented by prototypical vectors called codebook vectors (CVs) Wj located at the center of the corresponding clusters.

VQ procedures determine these cluster centers by an iterative adaptive update according to

Wj(t + 1) = Wj(t) + s(t) Oj(x(t), W(t), k) (x(t) - Wj(t)), (8)

where 8(t) denotes a learning parameter, Oj a so-called cooperativity function which, in general, depends on the codebook W(t), a cooperativity parameter k, and a presented, in general, randomly chosen feature vector x e K\v e {1,..., K}}.

VQ algorithms can be classified according to the cooperativity function Oj. One can distinguish between so-called hord clustering methods that assign each feature vector x to exactly one cluster represented by the CV Wj and soft clustering methods implying the concept of fuzzy membership of feature vectors in several clusters.

For example, a simple method for hard clustering is the algorithm proposed by Linde, Buzo, and Gray (LBG) [16]. Here, Oj is calculated by the "winner takes all" learning rule

Oj (x(t), W (t)) = $i(x)j, where i(x) is determined according to the minimal distance criterion

Wi min x

If we change Oj in such a way that more than one CV can take part in the update at the same time t, (8) changes into a "winner takes most" learning rule. The resulting soft clustering methods can be classified according to the metric the cooperativity function Oj is based on.

If we do not use the metric of the feature space itself, but define a mapping

or a characteristic function on a k-dimensional hypersphere around r'(x(t)),

Oj(r(j), r'(x(t)), ff(t)) = X\\r-r'(x(t))\\<a(t)

= ( 1 : |\r - r'(x(t))\\< a(t) { 0 : \\r - r'(x(t))\\>a(t). The update rule (8) then becomes

Wj(t + 1) = Wj(t) + e(t) Oj(r(j), r'(x(t)), a(t)) (x(t) - Wj(t)).

a(t) denotes a concrete choice of the cooperativity parameter k in (8). It is a measure for the "range" of the cooperativity function Oj(r(j), r'(x(t)), a(t)) on the cortex model. As well as the learning parameter e(t), it is usually updated according to some heuristic annealing scheme. A common strategy is an exponential decay, i.e.,

ofthe CVs onto a target space, we obtain a scenario that may be interpreted with respect to a neural network model of sensory maps (e.g., [26]). Usually k is chosen k ^ N and k ^ n. The r(j)'s can be thought of as representing the physical location of neurons in a cortex model. A common choice would be the ordering of the r;'s on a regular low-dimensional grid (ke {2, 3}). However, this is not mandatory (see, e.g., [28]). By choosing the metric of the cooperativity function Oj according to the neuron positions rj in the target space, we obtain Kohonen's self-organizing map (SOM) algorithm [14,15]. Oj is then chosen as a monotonically declining function of the distance to the "winner neuron'' r'(x(t)), e.g., a Gaussian

In contrast to algorithms in which the cooperativity function is based on the metric of the feature space, such as k-means type clustering or minimal free energy VQ, the SOM algorithm has the interesting property of topology preservOtion, i.e., neighboring points of the feature space are mapped onto neighboring neurons of the cortex model. In various data analysis problems, this allows for the construction of graphically appealing 2D maps that represent the CVs in some topological order. However, this order may be misleading because of folding of the feature map (see, e.g., [20]) in cases where the data are located on manifolds with a dimension > k.

If we avoid the mapping like (10) of the CVs onto a cortex model but apply a metric of the feature space Oj directly according to a.(x(t), W(t), K = p(t))

we obtain the cooperativity function for the soft clustering scheme proposed by Rose, Gurewitz, and Fox [21].

Here, the "error" Ej(x(t)) = \\x(t) - w.(t)\\2 measures the distance between the codebook vector Wj and the data vector x. S denotes a partition function given by

and p is the cooperativity parameter of the model. This so-called "fuzzy range" p(t) defines a length scale in the feature space, not in a cortex model such as c(t) in (11) or (12). According to the analogy to statistical mechanics (see later discussion), p can be interpreted as the temperature T in a multiparticle system employing T = 2p2. It is annealed to repeatedly smaller values in the VQ procedure, e.g., according to an annealing scheme such as (14).

The learning rule (8) with a;- given by (16) describes a stochastic gradient descent on the error function

which, using the terminology of statistical mechanics, is a free energy in a mean-field approximation [21,6]. Here, f(x) denotes the probability density of feature vectors x. The values of the cooperativity function are normalized

## Post a comment