Previous Table of Contents Next


14.6 CONFIDENCE INTERVALS FOR PREDICTIONS

The purpose of developing regression usually is to predict the value of the response variable for those values of predictor variables that have not been measured. Given the regression equation, it is easy to predict the response for any given value of predictor variable xp:

(14.3)

This formula gives only the mean value of the predicted response based upon the sample. Like most of the other computations based on the sample, it is necessary to specify a confidence interval for this predicted mean. The formula for the standard deviation of the mean of a future sample of m observations is

There are two special cases of this formula that are of interest. One case is for m = 1. This gives the standard deviation of a single future observation:


FIGURE 14.5  Confidence intervals for predictions from regression models.

The second case is for m = ∞. This gives the standard deviation of the mean of a large number of future observations at xp:

Notice that the standard deviation for the mean of an infinite future sample is lower than that of finite samples since in the latter case the error associated with the future observations should also be accounted for.

In all cases discussed above, a 100(1 – α)% confidence interval for the mean can be constructed using a t quantile read at n – 2 degrees of freedom.

It is interesting to note from the above expressions that the standard deviation of the prediction is minimal at the center of the measured range (at ) and it increases as we move away from the center. Since the goodness of any statistical prediction is indicated by its standard deviation, the goodness of the prediction decreases as we move away from the center. This is shown schematically in Figure 14.5. In particular, if we try to predict far beyond the measured range, the variance of the prediction will be large, the confidence interval will be wide, and the accuracy of the prediction will be low.

Example 14.5 Using the disk I/O and CPU time data of Example 14.1, let us estimate the CPU time for a program with 100 disk I/O’s.

In this case, we have already seen that the regression equation is

CPU time = –0.0083 + 0.2438(number of disk I/O’s)

Therefore, for a program with 100 disk I/O’s, the mean CPU time is

CPU time = –0.0083 + 0.2438(100) = 24.3674

Standard deviation of effors se = 1.0834

The standard deviation of the predicted mean of a large number of observations is

From Table A.4 in the Appendix, the 0.95-quantile of the t-variate with five degrees of freedom is 2.015.

90% confidence interval for predicted mean =

Thus, we can say with 90% confidence that the mean CPU time for a program making 100 disk I/O’s will be between 21.9 and 26.9 milliseconds. This prediction assumes that we will take a large number observations for such programs and then take a mean.

To set bounds on the CPU time of a single future program with 100 disk I/O’s, the computation is as follows:

90% confidence interval for single prediction =

Notice that the confidence interval for the single prediction is wider than that for the mean of a large number of observations.

14.7 VISUAL TESTS FOR VERIFYING THE REGRESSION ASSUMPTIONS

In deriving the expressions for regression parameters, we made the following assumptions:

1.  The true relationship between the response variable y and the predictor variable x is linear.
2.  The predictor variable x is nonstochastic and it is measured without any error.
3.  The model errors are statistically independent.
4.  The errors are normally distributed with zero mean and a constant standard deviation.

If any of the assumptions are violated, the conclusions based on the regression model would be misleading. In this section, we describe a number of visual techniques to verify that these assumptions hold. Unlike statistical tests, all visual techniques are approximate. However, we have found them useful for two reasons. First, they are easier to explain to decision makers who may not understand statistical tests. Second, they often provide more information than a simple “pass-fail” type answer obtained from a test. Often, using a visual test, one can also find the cause of the problem.

The assumptions, which can be visually tested, and the corresponding tests are as follows:

1.  Linear Relationship: Prepare a scatter plot of y versus x. Any nonlinear relationship can be easily seen from this plot. Figure 14.6 shows a number of hypothetical possibilities. In case (a), the relationship appears to be linear and the linear model can be used. In case (b), there seems to be two different regions of operation, and the relationship is linear in both regions; thus, two separate linear regressions should be used. In case (c), there is one point, which is quite different from the remaining points. This may be due to some measurement error. The values must be rechecked, and if possible, measurements should also be made at other intermediate values. In case (d), the points appear to be related but the relationship is nonlinear; a curvilinear regression (discussed in Section 15.3) should be used in place of a linear regression.
2.  Independent Errors: After the regression, compute errors and prepare a scatter plot of i versus the predicted response . Any visible trends in the scatter plot would indicate a dependence of errors on the predictor variable. Figure 14.7 shows three hypothetical plots of error versus predicted response. In case (a), there is no visible trend or clustering of points, and therefore, the errors appear to be independent. In case (b), we see that the errors increase with increasing response. In case (c), the trend is nonlinear. Any such trend is indicative of an inappropriate model. It is quite possible that a linear model is not appropriate for this case, and either the curvilinear model discussed in Section 15.4 or one of the transformations discussed in Section 15.4 should be tried.


FIGURE 14.6  Possible patterns of scatter diagrams.


FIGURE 14.7  Possible patterns of residual versus predicted response graphs.


You may also want to plot the residuals as a function of the experiment number where the experiments are numbered in the order they are conducted. As shown in Figure 14.8, any trend up or down in such a plot would indicate the presence of other factors, environmental conditions (temperature, humidity, and so on), or side effects (incorrect initializations) that varied from one experiment to the next and affected the response. The cause of such trends should be identified. If additional factors are found to affect the response, they also should be included in the analysis.


FIGURE 14.8  A trend in the residual versus experiment number may indicate side effects or incorrect initializations.


FIGURE 14.9  The normal quantile-quantile plots of the residuals should be a straight line.


We must point out that there is no foolproof test for independence. All tests for independence simply try to find dependence of one kind or the other. Thus, passing one test proves that the test was unable to find any dependence. This does not mean that another test will also not find any dependence. In other words, dependence can be proven in practice but independence cannot.
3.  Normally Distributed Errors: Prepare a normal quantile-quantile plot of errors. If the plot is approximately linear, the assumption is satisfied. Figure 14.9 shows two hypothetical examples of such plots. In case (a), the plot is approximately linear, and so the assumption of normally distributed errors is valid. In case (b), there is no visible linearity and the errors do not seem to be normally distributed.
4.  Constant Standard Deviation of Errors: This is also known as homoscedasticity. To verify it, observe the scatter plot of errors versus predicted response prepared for the independence test. If the spread in one part of the graph seems significantly different than that in other parts, then the assumption of constant variance is not valid.
Figure 14.10 shows two hypothetical examples. In case (a), the spread is homogeneous. In case (b), the spread appears to be increasing as the predicted response increases. This implies that the distribution of the errors still depends on the predictor variables. The regression model does not fully incorporate the effect of predictors. The linear model is not a good model in this case. A curvilinear regression should be tried instead. A transformation of the response, for example, using log(y) instead of y, may also help eliminate the problem. The transformations are discussed later in Section 15.4.


FIGURE 14.10  A trend in the spread of residuals as a function of the predicted response indicates a need for transformation or a nonlinear regression.


FIGURE 14.11  Graph of residual versus predicted response for the disk I/O and CPU time data.


Previous Table of Contents Next

Copyright © John Wiley & Sons, Inc.