Previous Table of Contents Next


The following example illustrates the application of these tests.

Example 14.6 For the disk I/O and CPU time data of Example 14.1, a scatter plot of the data was shown in Figure 14.2. The plot does appear to satisfy a linear relationship. To check independence of errors, a plot of residuals as a function of the predicted CPU time (columns and ei in Table 14.1) is shown in Figure 14.11. There does not seem to be any definite trend in the plot. A plot of errors as a function of observation number is shown in Figure 14.12. This graph also does not show any trends.


FIGURE 14.12  Residuals as a function of observation numbers for the disk I/O and CPU data.


FIGURE 14.13  Normal quantile-quantile plot for the residuals of disk I/O and CPU time data.

To check whether the normality assumption is valid, a normal quantile-quantile plot of the errors is shown in Figure 14.13. The graph is reasonably close to a straight line, leading us to believe that the normality assumption is approximately valid in this case.

To check homoscedasticity, we notice that the errors do seem to have a larger spread toward the lower values of . It is difficult to make any judgments in this case due to a small number of observations. However, since the magnitude of errors is small relative to predictions, this is not a concern in this case.

Example 14.7 For the RPC performance study presented earlier in Case Study 14.1, a residual-versus- plot for the ARGUS data is shown in Figure 14.14. The spread on the right side of the graph (at larger values of seems to be considerably higher than that on the left side. Since the magnitudes of errors are not negligible, this is a cause for concern.


FIGURE 14.14   Graph of residual versus predicted response for the ARGUS data.


FIGURE 14.15   Normal quantile-quantile plot for the residuals of the ARGUS data.

A normal quantile-quantile plot for the same residuals is shown in Figure 14.15. Once again the departure from normality is high, at least in comparison to that in Figure 14.13.

The key results presented so far are summarized in Box 14.1.

Box 14.1 Simple Linear Regression

1.  Model: yi = b0 + b1xi + ei
2.  Parameter estimation: b1 =
3.  Allocation of variation; SSY =
4.  Coefficient of determination R2 =
5.  Standard deviation of errors
6.  Degrees of freedoms: SST = SSY – SS0 = SSR + SSE

n – 1 = n – 1 = 1 + (n – 2)

7.  Standard deviation of parameters:
8.  Prediction: Mean of future m observations:

9.  All confidence intervals are computed using t[1 – α/2;n – 2].
10.  Model assumptions:
(a)  Errors are independent and identically distributed normal variates with zero mean.
(b)  Errors have the same variance for all values of x
(c)  Errors are additive.
(d)  x and y are linearly related.
(e)  x is nonstochastic and is measured without error.
11.  Visual tests:
(a)  Scatter plot of y versus x should be linear.
(b)  Scatter plot of errors versus predicted responses should not have any trends.
(c)  The normal quantile-quantile plot of errors should be linear.

If any text fails or if the ratio ymax/ymin is large, curvilinear regressions and transformations should be investigated.

EXERCISES

14.1  Using the Lagrange multiplier technique, find the values of b0 and b1 that minimize the sum


subject to the constraint that the mean error is zero:


Verify that these values of b0 and b1 are the same as those obtained by the unconstrained minimization of error variance and presented in Equations (14.1) and (14.2).
14.2  For the disk I/O and CPU data in Example 14.1, find a linear formula to predict the number of disk I/O’s given the CPU time. Answer the following questions about this regression:
a.  Which parameters are significant? b. What percentage of variation is explained by the regression?
c.  What is the expected number of disk I/O’s for a program with a CPU time of 40 milliseconds?
d.  What bounds would you put on your answer in c if you wanted to take less than 10% chance of error on a single program to be measured tomorrow?
e.  Repeat d for the case that you want to take a less than 10% chance of error on the mean of a large number of programs to be measured tomorrow?
14.3  The memory size of the seven programs mentioned in the disk I/O and CPU time data were also measured. The memory size (in kilobytes) and CPU time (in milliseconds) pairs observed are {((70, 2), (75, 5), (144, 7), (190, 9), (210, 10), (235, 13), (400, 20)}. Analyze the data using a simple regression model to predict CPU time as a function of the memory size.
14.4  The designers of a database information system that allows its users to search backward for several days wanted to develop a formula to predict the time it would take to search. Actual elapsed time was measured for several different values of days. The measured data is shown in Table 14.4. Prepare a simple regression model for this data to predict elapsed time as a function of the number of days and interpret results.


Previous Table of Contents Next

Copyright © John Wiley & Sons, Inc.