Previous Table of Contents Next


The Box-Cox transformation has the property that w has the same units as the response y for all values of the exponent a. All real values of a, positive or negative, can be tried. The transformation is continuous even at zero, since:

One way to determine the parameter a is to try regressions with several different values of a and to use the one that gives the smallest value of the sum of squared errors (SSE). A plot of the SSE versus a can be used to visually see the sensitivity of the SSE to a. Since the SSE generally varies by several orders of magnitude, a semilog plot with ln(SSE) on the vertical axis and a along the horizontal aids may be used.

In general, it is preferable to use simple values for a. For example, if a = 0.52 is found to give the minimum SSE and the SSE at a = 0.5 is not significantly higher, the latter value may be preferable. The 100(1– α) confidence interval for a includes all values of a for which the SSE is less than the following value:

where SSEmin is the minimum SSE and v is the number of degrees of freedom for the errors. If the confidence interval for a includes the value a = 1, then the hypothesis that the relationship is linear cannot be rejected. In other words, there is no need for the transformation. The following case study illustrates the application of the Box-Cox family of transformation.

Case Study 15.2 The garbage collection time for a particular garbage collection algorithm was measured for various values of heap sizes, as shown in Table 15.10. The analyst hypothesizes that the square root of the time is linearly related to the inverse of the heap size. That is, the model is

TABLE 15.10 Garbage Collection Times for Various Heap Sizes

Heap Size Garbage Collection Time Heap Size Garbage Collection Time

500 594.34 1600 63.64
600 247.42 1800 1.00
800 114.24 2000 1.00
1000 85.64 2200 1.00
1200 49.60 2400 1.00
1400 50.30 2600 1.00


FIGURE 15.3  Scatter plot of the data for garbage collection study.

The transformed data along with the linear regression line is plotted in Figure 15.3. The points do not appear to be close to the straight line. Suppose we want to test the hypothesis that the exponent on time is different than a half. The Box-Cox family of transformation can be used for this purpose. Using several values of a ranging from –0.4 to 0.8, the SSE is computed for each value. Values of the SSE as a function of the exponent a are plotted in Figure 15.4. The minimum SSE of 2049 occurs at a = 0.45. Since 0.95-quantile of a t-variate with 10 degrees of freedom is 1.812, a horizontal line is drawn at


FIGURE 15.4  Plot of SSE versus the exponent a for the garbage collection study.

The line intersects the SSE curve at a = 0.2465 and a = 0.5726. Thus, the 90% confidence interval for a is (0.2465, 05726). Since the interval includes 0.5, we cannot reject the hypothesis that the exponent is 0.5.

The Box-Cox family of transformation, as described here, cannot be used if there are some negative or zero values in the measured responses. The solution is to add a constant amount c to all y’s replacing them with y + c. Thus, with a shift of c, the Box-Cox family of transformation becomes

where

In this case, c becomes another parameter in addition to a, which needs to be estimated.

15.5 OUTLIERS

Any observation that is atypical of the remaining observations may be considered an outlier. Notice the emphasis on the word "may" in the last sentence. Including the outlier in the analysis may change the conclusions significantly. Excluding the outlier from the analysis may lead to a misleading conclusion if the outlier in fact represents a correct observation of the system behavior. A number of statistical tests have been proposed to test if a particular value is an outlier. Most of these tests assume a certain distribution for the observations. If the observations do not satisfy the assumed distribution, the results of the statistical test would be misleading. In practice, the easiest way to identify outliers is to look at the scatter plot of the data. Any value significantly away from the remaining observations should be investigated for possible experimental errors. Other experiments in the neighborhood of the outlying observation may be conducted to verify that the response is typical of the system behavior in that operating region. Once the possibility of errors in the experiment has been eliminated, the analyst may decide to include or exclude the suspected outlier based on intuition. One alternative is to repeat the analysis with and without the outlier and state the results separately. Another alternative is to divide the operating region into two (or more) subregions and obtain a separate model for each subregion.


Previous Table of Contents Next

Copyright © John Wiley & Sons, Inc.