Previous Table of Contents Next


6.8.5 Data Scaling

The final results of clustering depend heavily upon relative values and ranges of different parameters. It is therefore generally recommended that the parameter values be scaled so that their relative values and ranges are approximately equal. The four commonly used scaling techniques are as follows:

1.  Normalize to Zero Mean and Unit Variance: Let {x1k,x2k,...,xmk} be the measured values of the kth parameter. The scaled value xik corresponding to xik is given by


Here and sk are the measured mean and standard deviation of the kth parameter, respectively.
2.  Weights:

x'ik = wkxik

  
The weight wk may be assigned depending upon the relative importance of the parameter or is inversely proportional to the standard deviation of the parameter values.
3.  Range Normalization: The range is changed from [xmin,k, xmax,k] to [0, 1]. The scaling formula is


Here xmin,k and xmax,k are the minimum and maximum values, respectively, of the kth parameter. The main problem with this approach is that a few outliers can drastically impact xmin,k and xmax,k and hence the final outcome of clustering. This disadvantage is overcome by using percentiles rather than strict minimum and maximum, as discussed next.
4.  Percentile Normalization: The data is scaled so that 95% of the values fall between 0 and 1:


Here x2.5,k and x97.5,k are the 2.5- and 97.5-percentiles, respectively, of the kth parameter.

6.8.6 Distance Metric

Clustering analysis basically consists of mapping each component into an n-dimensional space and identifying components that are close to each other. Here n is the number of parameters. The closeness between two components is measured by defining a distance measure. Three methods that have been used are as follows:

1.  Euclidean Distance: The distance d between two components {xi1,xi2,...,xin} and {xj1,xj2,...,xjn} is defined as

2.  Weighted Euclidean Distance:


Here ak, k = 1, 2,..., n, are suitably chosen weights for the n parameters.
3.  Chi-Square Distance:


The Euclidean distance is the most commonly used distance metric. The weighted Euclidean is used if the parameters have not been scaled or if the parameters have significantly different levels of importance.
The chi-square distance is generally used in distribution fitting. In using this measure, it is important that the values x.k’s be close to each other, that is, they have been normalized; otherwise, parameters with low values of x.k get higher weights.

6.8.7 Clustering Techniques

The basic aim of clustering is to partition the components into groups so the members of a group are as similar as possible and different groups are as dissimilar as possible. Statistically, this implies that the intragroup variance should be as small as possible and intergroup variance should be as large as possible. Fortunately, these two goals are redundant in the sense that achieving either one is sufficient. This is because

Total variance = intragroup variance + intergroup variance

Since the total variance is constant, minimizing intragroup variance automatically maximizes intergroup variance.

A number of clustering techniques have been described in the literature. These techniques fall into two classes: hierarchical and nonhierarchical. In nonhierarchical approaches, one starts with an arbitrary set of k clusters, and the members of the clusters are moved until the intragroup variance is minimum. There are two kinds of hierarchical approaches: agglomerative and divisive. In the agglomerative hierarchical approach, given n components, one starts with n clusters (each cluster having one component). Then neighboring clusters are merged successively until the desired number of clusters is obtained. In the divisive hierarchical approach, on the other hand, one starts with one cluster (of n components) and then divides the cluster successively into two, three, and so on, until the desired number of clusters is obtained. A popular technique known as minimum spanning tree method is described next.

6.8.8 Minimum Spanning Tree Method

This is an agglomerative hierarchical clustering technique, which starts with n clusters of one component each and successively joins the nearest clusters:

1.  Start with k = n clusters.
2.  Find the centroid of the ith cluster, i = 1, 2, ..., k. The centroid has parameter values equal to the average of all points in the cluster.
3.  Compute the intercluster distance matrix. Its (i, j)th element is the distance between the centroids of cluster i and j. Any distance measure described in Section 6.8.6 can be used.
4.  Find the smallest nonzero element of the distance matrix. Let dlm, the distance between clusters l and m, be the smallest. Merge clusters l and m. Also merge any other cluster pairs that have the same distance.
5.  Repeat steps 2 to 4 until all components are part of one cluster.

Example 6.3 illustrates these steps.

Example 6.3 Consider a workload with five components and two parameters. The CPU time and the number of disk I/O’s were measured for five programs. Tbe parameter values after scaling are as shown in Table 6.6.

TABLE 6.6 Data for Clustering Example 6.3

Program CPU Time Disk I/O

A 2 4
B 3 5
C 1 6
D 4 3
E 5 2

Step 1: Consider five clusters with ith cluster consisting solely of the ith program.
Step 2: The centroids are {2,4}, {3,5}, {1,6}, {4,3}, and {5,2}. These are shown by the five points in Figure 6.6.
Step 3: Using the Euclidean distance measure, the distance matrix is


FIGURE 6.6  Clustering example.


Previous Table of Contents Next

Copyright © John Wiley & Sons, Inc.