Maximum Likelihood

The maximum likelihood (ML) classifier is one of the most popular methods of classification [42]. The goal is to assign the most likely class Wj, from a set of N classes wi,..., wN, to each feature vector. The most likely class w j from a given feature vector x is the one with maximum posterior probability of belonging to the class P(w j | x). Using the Bayes' theorem, we have p (w j|x) = PixWPW)

On the left side of the equation, there is the a posteriori probability of a feature vector x to belong to the class wj. On the right side, the a priori probability P(x | w j) that expresses the probability of the feature vector x being generated by the probability density function of wj. P (x) and P (wj) are the a priori probability of appearance of feature vector x and the probability of appearance of each class wj, respectively.

This model relies on the knowledge of the probability density function underlying each of the classes, as well as the probability of occurrence of the data and the classes. In order to reduce the complexity of such estimations, some assumptions are made. The first assumption generally made is the equiprobability of appearance for each of the feature vector as well as for each of the classes. This assumption reduces the Bayes' theorem to estimate the probability density function for each class:

Multiple methods can be used to estimate the a priori probability. Two of the most widespread methods are the assumption of a certain behavior and the mixture models.

A very common hypothesis is to identify the underlying probability density function with a multivariate normal distribution. In that case the likelihood value is

where £ j and ¡¡j are the covariance matrix and the mean value for class j, respectively. In the case where the determinants of the covariance matrix for each of the classes are equal to each other, the likelihood value becomes the

Figure 2.14: (a) Graphic example of the maximum likelihood classification assuming an underlying density model. (b) Unknown probability density function estimation by means of a 2 Gaussian mixture model. (c) Resulting approximation of the unknown density function.

same as the Mahalanobis distances. Figure 2.14(a) shows an example of the effect of this kind of classifier on a sample "X." Although the sample seems to be nearer the left-hand distribution in terms of Euclidean distance, it is assigned to the class on the right hand since the probability of generating the sample is higher than its counterpart.

The other approach is to estimate the model of the probability density function. In the mixture model approach, we assume that the probability density function can be modelled using an ensemble of simple known distributions. If the base distribution is the Gaussian function it is called Gaussian mixture model. The interest in this method consists of the estimation of complex density function using low-level statistics.

The mixture model is composed of a sum of fundamental distributions, following the next expression:

where C is the number of mixture components, Pk is the a priori probability of the component k, and 0k represents the unknown mixture parameters. In our case, we have chosen Gaussian mixture models 0k = {Pk, ¡i.k, ak} for each set of texture data we want to model. Figures 2.14(b) and 2.14(c) show an approximation of a probability density function with a mixture of two Gaussian and the resulting approximation. Figure 2.14(b) shows the function to be estimated as a continuous line and the Gaussian functions used for the approximation as a dotted line. Figure 2.14(c) shows the resulting approximated function as a continuous line and the function to be estimated as a dotted line as a reference. One can observe that with a determined mixture of Gaussian distributions, an unknown probability density function can be well approximated. The main problem of this kind of approaches resides in its computational cost and the unknown number of base functions needed, as well as the value of their governing parameters. In order to estimate the parameters of each base distribution, general maximization methods are used, such as expectation-maximization (EM) algorithm [42].

However, this kind of techniques are not very suitable as the number of dimensions is large and the training data samples size is small. Therefore, a process of dimensionality reduction is needed to achieve a set of meaningful data. Principal component analysis and Fisher linear discriminant analysis are the most popular dimensionality reduction techniques used in the literature.