## Pcm

Jm(U, V; w) over (U, V) in Mpcn x S1? m> 1 wi >0Vi u .1; Dik < D,j, j = i 1 V. k. Ulk ' 0; otherwise '

nDt j=i

An important problem associated with unsupervised image segmentation is cluster validity. To understand this problem, let P = { U : 1 < i < N} denote N different partitions (crisp or soft) of a fixed unlabeled data set X that may arise as a result of:

(i) clustering X with one algorithm C at various values of c ; or

(ii) clustering X over other algorithmic parameters of C ; or

(iii) applying several C's to X, each with various parameters ; or

(iv) all of the above

Cluster validity is the study (selection or rejection) of which

TABLE 3 The HCM/FCM/PCM-AO algorithms Store Unlabeled Object Data X cS'

• maximum number of iterations: T

• weighting exponent: 1 < m<o> (m = 1 for HCM-AO) Pick • inner product norm for Jm : ||x||A = xtAx

• termination measure: Et = || Vt — Vt_ 1| = big value

Guess • initial prototypes: V0 = (v10..., vc 0 ) e Scp (12)

t ^0 REPEAT

Iterate Ut = Fc(Vt—1) where Fc(Vt—1) = (9a, 10a or 11a)

Vt = Gc(Ut) where Gc(Ut) = (9b, 10b or 11b) UNTIL (t = T or Et < e)

U, e P "best represents the data" in some well-defined sense, or for some well-defined purpose. One popular strategy—mathematical validation—uses a validity functional V : Dv^S to rank Ui e P. There are hundreds of cluster validity functionals in the literature [3]. We will meet an application of validity functions later, so to fix this idea, here is one of the simplest: the partition coefficient VPC, which is well-defined for any fuzzy partition U of X [3]:

Let's compute the partition coefficient of crisp U1 and fuzzy U2 in Table 1:

For crisp or fuzzy partitions, Vpc is known to lie between 1/c and 1; it will take the value 1 if and only if U is crisp; and it will take the fuzziest value 1/c if and only if Uhas the entry 1/c in all cn elements. Thus, the smaller Vpc is, the fuzzier U is. Some authors use these mathematical facts to justify defining the best Ui in P as the one that maximizes VPc over the available choices, on the theory that the crispest partition is the best candidate in P. Unfortunately, this use of VPC has some well-known deficiencies [3], but this shows you how validity functions work. We could compute VPC(U3,2) =0.76 for the possibilistic partition U3 in Table 1, but since the column sums are not constant, all we know about this validity function for this case is that it lies between 0 and c, and it takes the value 1 at crisp partitions of X. So VPC does not have a clear interpretation for possibilistic partitions. However, the PCM clustering algorithm is sometimes used to segment images, so there are other validity functions that can be used for this case [3].

What's wrong with using validity functions? Well, the "true" parameters of a model that represents substructure in any data set are unknown. Consequently, validity functionals have little chance of being generally useful for identifying a "best" segmentation of an image obtained by clustering. More typically, V is used instead to eliminate badly misleading solutions. Thus, validity functionals are usually used prior to validation by humans or rule bases.

A classifier is any function D : Wp^Npc. The value y = D(z) is the label vector for z in Wp. D is a crisp classifier if D\Wp] = Nhc; otherwise, the classifier is fuzzy, possibilistic, or probabilistic, which for convenience, we lump together as soft classifiers. When part or all of X is used to "train" D (find its parameters), the training and test data are denoted by Xtr and Xte. Many classifiers also produce point prototypes, and in this instance the prototypes will usually acquire physical class labels during training.

We denote the test (or generalization) error rate of D when trained on Xtr and tested with crisply labeled Xte as ED(Xte\Xtr) = (number of mistakes committed by D/\Xte|). If the same data, X = Xtr = Xte, are used for training and testing, Ed(X\X) is called the resubstitution error rate. A classifier for which ED(X\X) = 0 is said to be consistent. Ed(X\X) is usually optimistically biased and so is generally regarded as a "ballpark" estimation of ED(Xte\Xtr), the number we really want to know. There are lots of papers about the topic of this paragraph—how to set up proper experiments, how to use the samples you have most efficiently, how to estimate parameters, how many times, etc. [1].

Once point prototypes are found, they can be used to define a crisp nearest prototype (1-np) classifier DV E 5. Let V be a set of c crisply labeled point prototypes, one per class, ordered so that ^ is the crisp label for vi, 1 < i < c; let 5 be any distance measure on Wp, and let E ={ei : i = 1,..., c} = Nhc. The crisp nearest prototype (1-np) classifier DVEg is defined, for z e Wp, as

< 5(z, v;-)Vj = i; ties arbitrarily resolved. (15)

For example, suppose the prototypical patient with disease A has heart rate of 90 bps and temperature of 101°, while a patient with the "textbook case'' of disease B has heart rate

75 bps and temperature 95°. We can represent these prototypes with the vectors vA =(90,101)T, and vB = (75, 95)T. OK, your patient z has heart rate 82 bps and temperature 98°. If you use (15) to classify z = (82, 98)T and choose the Euclidean distance 52 for your implementation, you have

52(z, vA) = \J(90 - 82)2 + (101 - 98)2 = V73 = 8.54 as the distance (from your patient) to disease A, and

as the distance to disease B. Taking the smaller value (nearest prototype) yields the diagnosis "z has B.'' Would you do this in your practice? Of course not. But now you know how prototype classifiers work, and moreover, you do make diagnoses this way—it's just that your brain uses "non-Euclidean" calculations about z's relationship to (what you know about) A and B to help you, the trained clinician, render a diagnosis. This is what pattern recognition is all about.

One last note: don't confuse the 1-np rule with the 1-nearest neighbor (1-nn) rule. Prototypes are new points built from the data, while neighbors are points in the labeled data. Neighbors are a special case of prototypes, but this distinction is very important. More generally, we may speak of the k-nn rule. This classifier simply finds the k nn's to any unlabeled point z, counts the number of "votes" for each class in z's neighborhood, and then assigns the majority label to z. The k-nn rule is a very good classifier that has been used quite successfully in many papers on medical image segmentation: see [3] for both the crisp and soft designs.

### 2.3 Feature Extraction

This section considers the "front end'' of medical image processing systems, where features are chosen and extracted from the raw image. We usually discuss images with two spatial dimensions, but most of what we say generalizes to images with N spatial dimensions, and to nonspatial dimensions such as time. Let IJ ={(i, j) : i = 1,..., m; j = 1,..., n}cW2 be a rectangular array or lattice of integers that specify (mn) spatial locations (pixel addresses). In what follows, ij may be used as a short form for (i, j). Next let Q = {q: q = 0,1, ..., G — 1 }cW be the integers from 0 to G — 1. G is the set of quantization (or gray) levels of some picture function pix: W2^WN. N is called the spectral dimension of pix. It is commonly assumed that pix is continuously differentiable at least once. Confinement of pix to the lattice IJ (which is done automatically by the digitizing scanner that realizes samples of pix) creates an mxn digital image denoted by PIj. For example, color photographs and magnetic resonance (MR) images usually have spectral dimension N = 3, whereas unispectral images (such as digitized X-ray mammograms)

have N = 1. An "8-bit" image has 28 = 256 gray levels, whereas a "16-bit" image has 65,536 gray levels.

Image data are usually converted into sets of feature vectors X = {,..., xn} c Rp, where p is the number of nonspatial features associated with each spatial location in the image. Don't confuse p, the dimension of the extracted feature space, with N, the spectral dimension of the image itself. N = p whenever the vectors in X are simply the quantized intensities (the raw digitized image). As soon as we extract new features from the raw intensities, it is possible that N = p. Thus, we may have 7D feature vectors built from three intensities attached to two spatial dimensions (pixels); or 2D feature vectors for images with N = 3 intensities at each of three dimensional spatial locations (voxels), etc. If the 2D spatial location of a feature vector is important, we write x^ for the feature vector associated with pixel address (i, j). Feature vector xij may or may not include the coordinate values (i, j) among elements. Thus, e.g., if location (2, 4) in an image has intensity value 37, we write x24 = 37 and p = 1.

Medical imaging systems use various sensors to collect spatial distributions of measurements which represent the underlying tissue characteristics. Figure 5 depicts several schemes for extracting features from an input image. These raw data support pixel-, edge-, and region-based segmentation (the distinction being what numerical features extracted from the image are used as the basis for processing). This is illustrated at the top of Fig. 5, which shows N = 3 spectral dimensions, and the extraction of a feature vector Xj from just the intensities at the ijth pixel locations on the left-hand side, and from a region (window) surrounding the ijth pixel on the right-hand side. The left side thus illustrates pixel-based features, while the right side illustrates either edge or region-based features.

As an example, MRIs are composed of N = 3 spectral dimensions: two relaxation times as well as proton density at each 2D spatial location. Let T 1ij, T2ij and rij denote the spin lattice relaxation, transverse relaxation, and proton density of pixel (i, j) in an MR slice of overall dimensions (m x n). We can aggregate these three measurements into pixel vector Xj = (T 1ij, T2j, r.) in R3; and the pixel vectors so constructed comprise a data set X that supports pixel-based methods.

FIGURE 5 Extraction of features for image processing.

On the other hand, if we estimated the horizontal and vertical gradients, say gv . and gh . of the intensity function at pixel (i, j) from intensities in some neighborhood of (i, j) in each of the three MR slices, there would be either three sets of features in R2 or one set of features in R6 to support edge-based segmentation of X. (Some writers call gradients texture features.) Finally, we might instead extract and order the nine intensities from a 3x 3 window centered at pixel (i, j) in each of the three slices. This would result in either three sets of features in R9 or one set of features in R27 to support region-based segmentation of X.

It is both possible and sometimes profitable to combine pixel- and window-based features to make up each vector in X. For example, the spatial coordinates of the pixels can be used either as part of the feature vector, or in the extraction of the features selected for use. In the final analysis the choice of features is very important, but the "quality" of features is extremely domain dependent. In this sense, it is difficult to quantitatively describe a "generally useful'' feature space. This is especially true in the medical domain, where clinicians must be relied upon to supply insight and recommendations about the features they use when visually processing a medical image.

### 2.4 2D Image Segmentation

Many studies of nonfuzzy segmentation methods have been published. For example, Morrison and Attikiouzel [17] describe segmentation by statistical and neural network models; other multispectral methods are discussed by Just and Thelen [18], Hyman et al. [19]. Vannier et al. [20], and Raman et al. [21]. Jain and Flynn [22] provide a wonderful survey of image segmentation by crisp cluster analysis. Fuzzy clustering and early work in fuzzy image processing is discussed in many papers reprinted in [4].

The objective of image segmentation is to divide an image into (meaningful) regions. Errors made in this stage will affect all higher level activities. Therefore, methods that incorporate the uncertainty of object and region definitions and the faithfulness of the features to represent various objects (regions) are desirable. Some writers regard edge detection as image segmentation, but we prefer to keep these two schemes for decomposing the elements of an input image separate.

In an ideally segmented image, each region should be homogenous with respect to some predicate such as gray level or texture, and adjacent regions should have significantly different characteristics or features [14,47]. More formally, segmentation is the process of partitioning the entire image in its X representation into c crisp and maximally connected subregions {X{} such that each X{ is homogeneous with respect to some predicate P, i.e.,

FIGURE 5 Extraction of features for image processing.

p(X{ U Xj) = FALSE if i = j and X{ is adjacent to Xj. (16e)

Conditions (16a) and (16b) together are just the set-theoretic representation of a crisp c-partition of X. The crisp membership function mx. : IJ -^{0,1} of region Xi is mXi (i,j) =

and if the values of the c membership functions in (17) are arrayed as a matrix U, it is a crisp c-partition U e Mhc of X. Notice that even though the spatial locations of the pixels in the image are not directly involved in the partition in (16), the image plane is implicitly partitioned by U, because each Xj in X is associated with the ijth column of U.

In many situations it is not easy to determine if a pixel should belong to a region or not. This is because the features used to determine homogeneity may not have sharp transitions at region boundaries. This is especially true when features are computed using, say, a local 3 x 3 or 5 x 5 window. To alleviate this situation, we can insert fuzzy sets concepts into the segmentation process. The first reference to fuzzy segmentation was made by Prewitt [24], who suggested that the results of image segmentation should be fuzzy subsets rather than crisp subsets of the image plane. In a fuzzy segmentation, each pixel is assigned a membership value in each of the c regions. If the memberships are taken into account while computing properties of regions, we often obtain more accurate estimates of region properties.

The result of a fuzzy segmentation of X is a fuzzy partition U of Xinto c fuzzy subsets {mx. : IJ— [0,1] : i = 1,..., c}, which replaces the c membership functions in (17). For (i, j) e Pij, mR, (i, j) represents the degree to which (i, j) belongs to mx.. This is how fuzzy clustering algorithms enter the picture—they produce the matrix U that partitions the image. Unfortunately, this does not preserve the connectivity among regions at (16c) that is presumably enforced by construction of the predicate P in (16d) and (16e). This is one of the most important problems that needs to be solved if segmented medical images are to be clinically useful, because without this property, clustering algorithms (crisp or soft) will "tear up tissue classes" into medically incorrect regions. We will return to this point in the last section.

### 2.5 Segmentation in Medical Applications

There are many ways to classify segmentation methods, none of which leads to a crisp partition of them. For example, Figure 92 in Dellepiane [23] shows a classification based on a tree rooted at image segmentation that subdivides algorithms based on the parameters that guide them to their goal. Dellepiane identifies three main groups based on density, topology, and geometry.

Our interest in the role and importance of human experts in medical image processing leads us to a somewhat different way to classify segmentation algorithms. Our focus is primarily on how and where knowledge beyond that possessed by the sensor data and computationally obtained low-level information is injected into the system. Based on this criterion, traditional pattern recognition methods for segmentation can be (roughly) classified into two groups, supervised (Su) and unsupervised (US) methods, depending on whether the vectors in X are labeled or unlabeled. Figure 6 illustrates the distinction we make here between the two groups of segmentation methods that seem popular in the fuzzy modeling domain, and further subdivides unsupervised methods into two subgroups based on whether a human (USA) or a rule base (USB) is used to assign labels (tissue classes) to the unlabeled clusters (or regions) in the image. In the sequel we refer to the three vertical paths (left to right) in Fig. 6 as tracks USA, USB and Su, respectively.

Note carefully that it is the labels of the data (and their use during training) that divide the methods illustrated in Figure 6 into supervised and unsupervised approaches. The fact that humans or rules are used to assign physical labels to tissue clusters in the two US tracks is of course supervision in some broader (and in medicine, crucially more important) sense, but here we use the term supervised in the usual context of classifier design—that is, when training data Xtr are used prior to segmentation of test images. There are also segmentation algorithms that use human knowledge about the image domain as a basis for nontraditional classifier designs (Herndon et al. [25]; Hata etal. [26,27]). These models, and others that appear in the section on 3D problems, don't fit in the framework of Fig. 6 very well.

FIGURE 6 A classification of segmentation methods based on human intervention.

The bottom portion of Fig. 6 reemphasizes the need for final evaluation of processed images by medically trained experts. Researchers rarely carry a system to this stage, and as part of the algorithm itself (cf. Dellepiane, [23]). Prior to computer aided medical diagnosis, problem domains were focused in areas where system designers were the experts (for example, finding chairs, cigars, fish, or guitars). Medical image analysis, however, presents a vastly more complicated problem when visual evaluation is relied upon to assess the quality and utility of algorithmic outputs. Usually, clinicians are provided with a set of images and asked to evaluate one or more "properties" of the enhancement process, such as judging the faithfulness of the replication of the film or sensor information, the utility of the enhancement (edges, regions, etc.), or the implication of region of interest (ROI) prompts (Hume et al. [11]). We illustrate these points with an example from digital mammography [10].

As shown in Fig. 7, radiologists (more generally, clinicians with expertise in the appropriate medical domain) can be involved in performance evaluation of image segmentation in three ways. For example, radiologists can compare the following:

(C1) Sensor outputs (e.g., films) to unenhanced digitizations (digital images)

(C2) Unenhanced digital images to enhanced digital images

(C3) Original sensor outputs to enhanced digital images

When radiologists examine mammogram films in a light box, the entire mammogram suite can be viewed. However, when a digital version of the film is viewed on a computer monitor, only a fraction of the digitized film can be viewed at one time, often with just 256 shades of gray (assuming an 8 bit display). When printed, these digitized images must be further cropped or compressed to fit the paper constraints of the printer. In this case, an average 600 dpi (dots per inch) laser

FIGURE 7 Clinician involvement in algorithm evaluation.

printer can resolve only 122 gray levels [12]. Research indicates that humans themselves can only resolve 32 gray levels [13], and yet many of these issues are ignored in the current literature on mammogram enhancement systems.

Since radiologists are most comfortable with films, while nonclinicians such as computer scientists and engineers are most comfortable with computer monitors and printouts of digitized images, some compromise is involved when developing and evaluating the performance of these systems. Access to all three data types (original films, digitized images, and processed images) maximizes flexibility when defining performance assessment instruments, while still ensuring the development of a sound and repeatable evaluation methodology.

As an example, the involvement of clinicians who are actively participating in breast cancer screening programs is critical for the successful emergence of mainstream digital mammo-graphy. When such clinicians have contributed to the creation of a database, true, medically registered correlations between radiologist and ground truth can be made. Clinicians can be asked to independently judge the enhancement quality of the breast skin line and breast substructure, the sufficiency of the detail level within and outside these areas, and the level of differentiation between normal and abnormal tissues. Instruments used by Bensaid et al. [15] to elicit feedback from radiologists viewing brain tumor segmentations are a good example of how this can be done.

In addition to clinician involvement in performance evaluation, mammographic image databases often contain ground truth information that may include, e.g., ratings for characterizations of breast tissues, or the size and locations of lesions. Ground truth information is derived by one or more domain experts in one of two ways:

GT1 (primary). Visually guided hand labeling of clinical features on sensor outputs

GT2 (secondary). Algorithmically determined locations of ROIs that are visually assessed and accepted by clinical experts

When ground truth of either type is available, performance analysis can be based on various pattern recognition methods [16]:

PR1. Using labeled test data to estimate error rate

PR2. Correlation between computed and ground truth labels

PR3. Analyzing receiver operating characteristic curves

There are several ways to use these measures. For example, algorithm A can be used with training data Xtr to parametrize a classifier D, and then the labels the classifier assigns to test data Xte can be used to estimate the generalization potential of D. If ground truth information is available, estimated regions of interest can be compared to the ground truth to assess the quality of the algorithm. Standard measures of agreement or goodness of fit such as correlation coefficients can be used to compare the relative quality of several algorithms A, B, ..., Z on common data sets that possess ground truth.

The scoring method used to determine a "hit" vs a "miss" when using ground truth information deserves careful consideration. For example, comparisons can be based on matching centroids of observed vs expected regions, intersections of bounding boxes for detected and benchmark regions, etc. Standardization of the technique employed for a given data set is critical to making unbiased assessments of an algorithm's performance. The general situation for assessment by pattern recognition techniques is summarized in Fig. 8 (for the mammographic image domain: "micro" stands for microcalcifications). As shown in Fig. 8, Xte may contain images (here micros) with abnormalities not represented in the training data Xtr.

Evaluation without clinician involvement such as that illustrated in Fig. 8 can provide insight into the success or utility of a proposed technique. However, clinician involvement is vital to developing a generalizable, non-database-specific, repeatable methodology that will be accepted by health care personnel.

Visual examination (e.g., shown as track USA in Fig. 6) of an algorithmically suggested, artificially colored image is termed human validation. In this case, a clinically knowledgeable operator inspects, say, an unsupervised segmentation of a medical image, and either rejects it or accepts it and assigns a physical label to each region (cluster). This approach can be successful in terms of labeling the segmented image correctly only if the operator can imagine tissue structure as it must be in the data. Since X is not labeled, the "true" substructure in the data is unknown. Thus, human validation is subjective and to some extent nonrepeatable. Nonetheless, this method is extremely important and still much in evidence, as clinicians historically trust human judgment more than computational evidence.

FIGURE 8 Evaluation of supervised approaches.

A third possibility (besides mathematical validation, which we have already discussed) is to instead use rule-based validation (e.g., as shown in track USB in Fig. 6), in which clinical knowledge is encapsulated in the form of a rule-based system. This circumvents the need for a human on-line either before or after the segmentation, and probably represents the best hope at present to realize a truly unsupervised medical assistant [28].

After clustering to find a partition U of unlabeled X in either US track, a crisp color is usually assigned to each tissue class. If U is already crisp, this is straightforward. Otherwise, the simplest way to do this is to harden each column of U with H as in Eq. (7). Another possibility is to assign "fuzzy" colors to each pixel by mixing c basic colors in proportion to their memberships. Lighter shades are usually assigned to the pixels with high membership values and darker shades are used for lower membership values. This has the effect of outlining borders where classes are intermixed and has been preferred by physicians (Bezdek et al. [29]). The choice of which color to use for which region, and how to shade regions in images, is a seemingly trivial part of segmentation. However, visual displays often have preestablished expectations in the medical community. The coloring scheme chosen significantly affects the utility of computed outputs, so this issue deserves careful attention.

Several fuzzy approaches to the tracks USA and Su in Fig. 6 are reviewed in [29]. Several fuzzy models for track USB were surveyed by Clark et al. [30]. For readers interested in image processing based on the fuzzy reasoning paradigm that has emerged from fuzzy control (not specifically for applications in the medical domain), we highly recommend the survey by Keller et al. [31] as a companion to this chapter.

3 Qualitative Discussion of a Few Fuzzy Image Segmentation Methods

This section contains a few examples of segmentation based on fuzzy models in each of the three tracks shown in Fig. 6. Much ofthis material is a subset of Bezdek and Sutton [38], which has more extensive discussions of many other papers and methods that will not fit into the space allotted for this chapter. Another source for fuzzy imaging models and their applications (though not much on medicine) is [3].

3.1 Unsupervised Segmentation: Track USA

Much recent work in track USA has been done at many institutions. We will discuss this area using case studies from research done at the University of South Florida (USF). Bensaid et al. [32] introduced a technique called validity-guided (re)clustering (VGC) and illustrated its use on MR images of patients with and without pathology. The VGC

index is a generalization of the Xie-Beni cluster validity index [33]. The aim of this method is to enhance the quality of unsupervised fuzzy partitions of X by examining each FCM-derived cluster individually. When FCM terminates at U for a fixed c, the general procedure begins by selecting a particular cluster in X (say U(^ = row i of U) for splitting. After hardening this row with H at (7), FCM with c = 2 is applied to the points corresponding to H(Uq). To preserve the chosen value of c, two other clusters in U are merged at the same step.

The overall effect of VGC is to join tightly coupled clusters, split loosely coupled ones, and preserve the chosen value for c. Hence, VGC is not a cluster validation scheme in the same sense as defined previously, since it is not applied to different partitions of X and is not used to choose a best value for c. However, it is driven by a validity functional that assesses the ith cluster, and then sums the VGC measure over i, providing an overall validity function for U. VGC continues reclustering until improvement becomes small.

Bensaid et al. [32] used 30 MR images to illustrate that VGC really improves outputs of FCM segmentation. For each image, an optimized supervised segmentation was constructed by an iterative process whereby the training set was repeatedly reselected in order to optimize the segmentation quality, as determined by two clinically knowledgeable investigators. Segmentation as in track Su of Fig. 6 was done with the crisp 7-nn rule using the Euclidean norm. The 7-nn rule was chosen as a way to construct GT2 type ground truth because this classifier is reportedly superior to various neural-like networks and probabilistic designs (Vaidyanathan et al. [34]).

The optimized 7-nn, FCM and VGC segmentations were subsequently evaluated by three expert radiologists in a blind study discussed in Bezdek et al. [35]. Each radiologist was provided with three (T1, T2, p) views of each unenhanced raw image and the three segmentations, and was asked to fill out a survey form (method C2). Individual panelists were asked to rate the quality of the first four performance indicators shown in Table 4 on a scale from 0 to 10, where 0 = very bad and 10 = excellent. Each radiologist was also asked to rate the last two items (5 and 6 in Table 4) on a percentage basis—that is, to estimate the percentage of true positive tumor (correctly classified tumor pixels) and the percentage of false positive

 Item Description 7-nn rule
0 0