## trazi u pregledu dokumenta

Docsity.com

proposed a heuristic algorithm called *p*2 that allows dynamic calculation of percentiles as the observations are
generated. The observations are not stored; therefore, the algorithm has a very small and fixed storage
requirement regardless of the number of observations. The algorithm has also been extended to allow
histogram plotting.

Formulas for various indices of central tendencies and dispersion are summarized in Box 12.1.

Previous Table of Contents Next

**Products | Contact Us | About Us | Privacy | Ad Info | Home
**

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc. All rights reserved. Reproduction whole or in part in any form or medium without express written permission of

EarthWeb is prohibited. Read EarthWeb's privacy statement.

Docsity.com

Search Tips Advanced Search

**Art of Computer Systems Performance Analysis Techniques For Experimental Design
Measurements Simulation And Modeling
***by Raj Jain
*Wiley Computer Publishing, John Wiley & Sons, Inc.
**ISBN:** 0471503363** Pub Date:** 05/01/91

**Search this book:
**

Previous Table of Contents Next

**12.10 DETERMINING DISTRIBUTION OF DATA
**

In the last two sections we discussed how a measured data set could be summarized by stating its average and variability. The next step in presenting a summary could be to state the type of distribution the data follows. For example, a statement that the number of disk I/O’s are uniformly distributed between 1 and 25 is a more meaningful summary than to specify only that the mean is 13 and the variance is 48. The distribution information is also required if the summary has to be used later in simulation or analytical modeling.

**Box 12.1 Summarizing Observations
**Given: A sample {*x*1, *x*2, . . ., *x*n} of *n* observations.

**1.** Sample arithmetic mean:

**2.** Sample geometric mean:

**3.** Sample harmonic mean:

**4.** Sample median:
Here *x*(i) is the *i*th observation in the sorted set.

**5.** Sample mode = observation with the highest frequency (for categorical data).

**6.** Sample variance:

Docsity.com

**7.** Sample standard deviation:

**8.** Coefficient of variation =

**9.** Coefficient of skewness =
**10.** Range: Specify the minimum and maximum.

**11.** Percentiles: 100*p*-percentile

**12.** Semi-interquartile range

**13.** Mean absolute deviation

The simplest way to determine the distribution is to plot a **histogram** of the observations. This requires
determining the maximum and minimum of the values observed and dividing the range into a number of
subranges called **cells** or **buckets**. The count of observations that fall into each cell is determined. The counts
are normalized to cell frequencies by dividing by the total number of observations. The cell frequencies are
plotted as a column chart.

The key problem in plotting histograms is determining the cell size. Small cells lead to very few observations per cell and a large variation in the number of observations per cell. Large cells result in less variation but the details of the distribution are completely lost. Given a data set, it is possible to reach very different conclusions about the distribution shape depending upon the cell size used. One guideline is that if any cell has less than five observations, the cell size should be increased or a variable cell histogram should be used.

A better technique for small samples is to plot the observed quantiles versus the theoretical quantile in a
quantile-quantile plot. Suppose, *y*(*i*) is the observed *qi*th quantile. Using the theoretical distribution, the *qi*th
quantile *xi* is computed and a point is plotted at (*xi*, *y*(*i*)). If the observations do come from the given theoretical
distribution, the quantile-quantile plot would be linear.

To determine the *qi*th quantile *xi*, we need to invert the cumulative distribution function. For example, if *F*(*x*)
is the CDF for the assumed distribution,

*qi* = *F*(*xi*)

or

*xi* = *F* -1(*qi*)

For those distributions whose CDF can be inverted, determining the *x*-coordinate of points on a
quantile-quantile plot is straightforward. Table 28.1 lists the inverse of CDF for a number of distributions.

For other distributions one can use tables and interpolate the values if necessary. For the unit normal
distribution *N*(0, 1), the following approximation is often used:

For *N*(*µ*, *Ã*), the *xi* values computed by Equation (12.1) are scaled to *µ* + *Ã*x*i* before plotting.

One advantage of a quantile-quantile plot is that often it is sufficient to know the name of the possible
distribution. The parameter values are not required. This happens if the effect of the parameters is simply to
scale the quantile. For example, in a normal quantile-quantile plot, *x*-coordinates can be obtained using the
unit normal *N*(0, 1) distribution. The intercept and the slope of the resulting line give the values of location
and shape parameters *µ* and *_*.

**Example****12.5** The difference between the values measured on a system and those predicted by a
model is called modeling error. The modeling error for eight predictions of a model were found to be
-0.04, -0.19, 0.14, -0.09, -0.14, 0.19, 0.04, and 0.09.

Docsity.com

A normal quantile-quantile plot for the errors is shown in Figure 12.5. Computation of values plotted
along the horizontal axis is shown in Table 12.5. The *xi*’s in the last column of the table are computed
using Equation (12.1). In the figure, a straight line that provides a linear least-squares fit to the points is
also shown. From the figure, the errors do appear to be normally distributed.

**FIGURE 12.5** Normal quantile-quantile plot for the error data.

**TABLE 12.5 Data for Normal Quantile-Quantile Plot Example
**

*i yi xi
*

1 0.0625 -0.19 -1.535 2 0.1875 -0.14 -0.885 3 0.3125 -0.09 -0.487 4 0.4375 -0.04 -0.157 5 0.5625 0.04 0.157 6 0.6875 0.09 0.487 7 0.8125 0.14 0.885 8 0.9375 0.19 1.535

Previous Table of Contents Next

**Products | Contact Us | About Us | Privacy | Ad Info | Home
**

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc. All rights reserved. Reproduction whole or in part in any form or medium without express written permission of

EarthWeb is prohibited. Read EarthWeb's privacy statement.

Docsity.com

Search Tips Advanced Search

**Art of Computer Systems Performance Analysis Techniques For Experimental Design
Measurements Simulation And Modeling
***by Raj Jain
*Wiley Computer Publishing, John Wiley & Sons, Inc.
**ISBN:** 0471503363** Pub Date:** 05/01/91

**Search this book:
**

Previous Table of Contents Next

Often it is found in a normal quantile-quantile plot that the data follows a straight line but departs from it at one or both ends. This is an indication of data having shorter or longer tails than the normal distribution. For example, Figure 12.6b shows data that has longer tails at both ends. An S-shaped normal quantile-quantile plot, such as the one shown in Figure 12.6c, indicates that the observations have a distribution that is more peaky and has shorter tails than a normal distribution. If the distribution is asymmetric so that it has a shorter tail on one end and a longer tail on the other, this will show up on the normal quantile-quantile plot as a combination of the two types of departures from normality just discussed. An example of asymmetric plot is shown in Figure 12.6d.

**FIGURE 12.6** Interpretation of normal quantile-quantile plots.

**EXERCISES
**

**12.1** A distributed system has three file servers, which are chosen independently and with equal
probabilities whenever a new file is created. The servers are named A, B, and C. Determine the
probabilities of the following events:

**a.** Server A is selected
**b.** Server A or B is selected
**c.** Servers A and B are selected
**d.** Server A is not selected
**e.** Server A is selected twice in a row
**f.** Server selection sequence ABCABCABC is observed (in nine successive file creations)

**12.2** The traffic arriving at a network gateway is bursty. The burst size *x* is geometrically distributed
with the following pmf.

Docsity.com

Compute the mean, variance, standard deviation, and coefficient of variation of the burst size. Plot the
pmf and CDF for *P* = 0.2.

**12.3** The number of I/O requests received at a disk during a unit interval follows a Poisson distribution
with the following mass function:

Here, » is a parameter. Determine the mean, variance, and coefficient of variation of the number. Plot the pmf and CDF for » = 8.

**12.4** Two Poisson streams (see Exercise 12.3) merge at a disk. The pmf for the two streams are as
follows:

Determine the following:

**a.** Mean of *x* + *y
***b.** Variance of *x* + *y
***c.** Mean of *x* - *y
***d.** Variance of *x* - *y
***e.** Mean of 3*x* - 4*y
***f.** Coefficent of variation of 3*x* - 4*y
*

**12.5** The response time of a computer system has an Erlang distribution with the following CDF:

Find expressions for the pdf, mean, variance, mode, and coefficient of variation of the response time.

**12.6** The CDF of a Pareto variate is given by

Find its pdf, mean, variance, mode, and coefficient of variation.

**12.7** The execution times of queries on a database is normally distributed with a mean of 5 seconds and
a standard deviation of 1 second. Determine the following:

**a.** What is the probability of the execution time being more than 8 seconds?
**b.** What is the probability of the execution time being less than 6 seconds?
**c.** What percentage of responses will take between 4 and 7 seconds?
**d.** What is the 95-percentile execution time?

**12.8** What index of central tendency should be used to report
**a.** Response time (symmetrical pdf)
**b.** Number of packets per day (symmetrical pdf)
**c.** Number of packets per second (skewed pdf)
**d.** Frequency of keywords in a language

**12.9** How would you summarize an average personal computer configuration:

Docsity.com

**a.** CPU type
**b.** Memory size
**c.** Disk type
**d.** Number of peripherals
**e.** Cost

**12.10** The CPU times in milliseconds for 11 workloads on a processor are 0.74, 0.43, 0.24, 2.24,
262.08, 8960, 4720, 19740, 7360, 22,440, and 28,560. Which index of central tendency would you
choose and why?

**12.11** The number of disk I/O’s performed by a number of programs were measured as follows: {23,
33, 14, 15, 42, 28, 33, 45, 23, 34, 39, 21, 36, 23, 34, 36, 25, 9, 11, 19, 35, 24, 31, 29, 16, 23, 34, 24, 38,
15, 13, 35, 28}. Which index of central tendency would you choose and why?

**12.12** Choose a performance analysis problem. List all system performance metrics and workload
parameters. For each metric/parameter discuss the following:

**a.** Which index of central tendency would you use?
**b.** Which index of dispersion would you use?

Assume appropriate distributions, if necessary, and justify your assumptions. Include at least one example of each of the possible indices: mean, median, mode, geometric mean, range, variance, and percentiles. (This may require you to extend the problem or parameter list.)

**12.13** For the data of Exercise 12.10, which index of dispersion would you choose and why?
**12.14** For the data of Exercise 12.11, compute all possible indices of dispersion. Which index would
you choose and why?

**12.15** Plot a normal quantile-quantile plot for the following sample of errors:

-0.04444 -0.04439 -0.04165 -0.03268 -0.03235 -0.03182 0.02771 0.02650 -0.02569 -0.02358 0.02330 0.02305 0.02213 0.02128 0.01793 0.01668 -0.01565 -0.01509 0.01432 0.00978 0.00889 0.00687 0.00543 0.00084 -0.00083 -0.00048 0.00024 0.00079 0.00082 0.00106 0.00110 0.00132 0.00162 0.00181 0.00280 0.00379 0.00411 0.00424 0.00553 0.00865 0.01026 0.01085 0.01440 0.01562 0.01975 0.01996 0.02016 0.02078 0.02134 0.02252 0.02414 0.02568 0.02682 0.02855 0.02889 0.03072 0.03259 0.03754 0.04263 0.04276

Are the errors normally distributed?

Previous Table of Contents Next

**Products | Contact Us | About Us | Privacy | Ad Info | Home
**

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc. All rights reserved. Reproduction whole or in part in any form or medium without express written permission of

EarthWeb is prohibited. Read EarthWeb's privacy statement.

Docsity.com

Search Tips Advanced Search

**Art of Computer Systems Performance Analysis Techniques For Experimental Design
Measurements Simulation And Modeling
***by Raj Jain
*Wiley Computer Publishing, John Wiley & Sons, Inc.
**ISBN:** 0471503363** Pub Date:** 05/01/91

**Search this book:
**

Previous Table of Contents Next

**CHAPTER 13
COMPARING SYSTEMS USING SAMPLE DATA
**

*Statistics are like alienists —they will testify for either side.
*

—Fiorello La Guardia

The English words *sample* and *example* both originated from an Old French word *essample*. Although the two
words are now distinct, it is important to remember their common root. A sample is only an example. One
example is often not enough to prove a theory. Similarly, one sample is often not enough to make a definite
statement about all systems. Yet this distinction is often forgotten. We measure two systems on just 5 or 10
workloads and then declare one system definitely better than the other. The purpose of this chapter is to
reinforce the distinction and to discuss how to use sample data to compare two or more systems.

The basic idea is that a definite statement cannot be made about the characteristics of all systems, but a probabilistic statement about the range in which the characteristics of most systems would lie can be made. The concept of confidence interval introduced in this chapter is one of the fundamental concepts that every performance analyst needs to understand well. In the remainder of this book, most conclusions drawn from samples are stated in terms of confidence intervals.

**13.1 SAMPLE VERSUS POPULATION
**

Suppose we write a computer program to generate several million random numbers with a given property, for
instance, mean ¼ and standard deviation Ã. We now put these numbers in an urn and draw a sample of *n
*observations.

Suppose the sample {*x*1, *x*2, . . . , *xn*} has a sample mean . The sample mean is likely to be different from

Ã. To distinguish between the two, is called the sample mean and ¼ is called population mean. The word
*population* denotes all the numbers inside the urn.

In most real-world problems, the population characteristics (for example, population mean) are unknown, and the goal of the analyst is to estimate these characteristics. For example, in our experiment of measuring a

Docsity.com

program’s processor time, the sample mean obtained from a single sample of *n* observations is simply an
estimate of the population mean. To determine the population mean exactly, we need to repeat the experiment
infinitely many times, which is clearly impossible.

The population characteristics are called **parameters** while the sample estimates are called **statistics.** For
example, the population mean is a parameter while the sample mean is a statistic. It is necessary to distinguish
between the two because the parameters are fixed while the statistic is a random variable. For instance, if we
draw two samples of size *n* from a normally distributed population with mean ¼ and standard deviation Ã, the
sample means and for the two samples would be different. In fact, we can draw many such samples and
draw a distribution for the sample mean. No such distribution is possible for the population mean. It is fixed
and can be determined only if we consider the entire population. Traditionally, the Greek letters such as ¼ and

Ã are used to denote the parameters, while the English letters such as and *s* are used to denote the statistic.

**13.2 CONFIDENCE INTERVAL FOR THE MEAN
**

Each sample mean is an estimate of the population mean. Given *k* samples, we have *k* estimates —all of them
different. The next problem is to get a single estimate of the population mean from these *k* estimates.

In fact, it is not possible to get a perfect estimate of the population mean from any finite number of finite size samples. The best we can do is to get probabilistic bounds. Thus, we may be able to get two bounds, for instance, c1 and c2, such that there is a high probability, 1 - ±, that the population mean is in the interval (c1, c2):

Probability{*c*1d¼d*c*2} = 1 - ±

The interval (*c*1, *c*2) is called the **confidence interval** for the population mean, ± is called the **significance
level,** 100(1 - ±) is called the **confidence level**, and 1 - ± is called the **confidence coefficient**. Notice that the
confidence level is traditionally expressed as a percentage and is typically near 100%, for instance, 90 or
95%; while the significance level a is expressed as a fraction and is typically near zero, for instance, 0.05 or
0.1.

One way to determine the 90% confidence interval would be to use 5-percentile and 95-percentile of the
sample means as the bounds. For example, we could take *k* samples, find sample means, sort them out in an
increasing order, and take the [1 + 0.05(*k* - 1)]th and [1 + 0.95(*k* - 1)th element of the sorted set.

Fortunately, it is not necessary to gather too many samples. It is possible to determine the confidence interval
from just one sample. This is because of the central limit theorem, which allows us to determine the
distribution of the sample mean. This theorem states that if the observations in a sample {*x*1, *x*2, . . . , *x*n} are
independent and come from the same population that has a mean ¼ and a standard deviation Ã, then the
sample mean for large samples is approximately normally distributed with mean ¼ and standard deviation

:

The standard deviation of the sample mean is called the **standard error**. Again, the standard error is different
from the population standard deviation. If the population standard deviation is Ã, the standard error is only

. From this expression, it is easy to see that as the sample size *n* increases, the standard error
decreases.

Using the central limit theorem, a 100(1 - ±)% confidence interval for the population mean is given by

Here, is the sample mean, *s* is the sample standard deviation, *n* is the sample size, and *z*1-±/2 is the (1 -
±/2)-quantile of a unit normal variate. Since these quantiles are used very frequently, their values are listed in
Table A.2 in the Appendix.

**Example 13.1** For the sample of Example 12.4, the mean = 3.90, the standard deviation *s* = 0.95 and
*n* = 32:

A 90% confidence interval for the mean =

Docsity.com

We can state with 90% confidence that the population mean is between 3.62 and 4.17. The chance of error in this statement is 10%. That is, if we take 100 samples and construct a confidence interval for each sample as shown in Figure 13.1, in 90 cases the interval would include the population mean and in 10 cases the interval would not include the population mean.

Similarly,

**FIGURE 13.1** Meaning of a confidence interval.

The preceding confidence interval applies only for large samples, that is, for samples of size greater than 30. For smaller samples, confidence intervals can be constructed only if the observations come from a normally distributed population. For such samples, the 100(1 - ±)% confidence interval is given by

Here, t[1-±/2;n-1] is the (1 - ±/2)-quantile of a *t*-variate with *n* - 1 degrees of freedom. These quantiles are listed
in Table A.4 in the Appendix. The interval is based on the fact that for samples from a normal population

, has a *N*(0, 1) distribution and (*n*-1)*s*2/ has a chi-square distribution

with *n* - 1 degrees of freedom, and therefore, has a *t* distribution with *n* - 1 degrees of freedom
(see Section 29.16 for a description of the *t* distribution). Figure 13.2 shows a sample *t* density function; the
value *t*[1-±/2;*n*-1] is such that the probability of the random variable being less than -*t*[1-±/2;*n*-1]is ±/2. Similarly, the
probability of the random variable being more than *t*[1-±/2;*n*-1]. The probability that the variable will lie between

is 1-±.

**FIGURE 13.2**

The ratio for samples from normal populations follows a *t*(*n* - 1) distribution.

Previous Table of Contents Next

**Products | Contact Us | About Us | Privacy | Ad Info | Home
**

EarthWeb is prohibited. Read EarthWeb's privacy statement.

Docsity.com

Search Tips Advanced Search

*by Raj Jain
*Wiley Computer Publishing, John Wiley & Sons, Inc.
**ISBN:** 0471503363** Pub Date:** 05/01/91

**Search this book:
**

Previous Table of Contents Next

**Example 13.2** Consider the error data of Example 12.5, which was shown to have a normal
distribution. The eight error values are -0.04, -0.19, 0.14, -0.09, -0.14, 0.19, 0.04, and 0.09.

The mean of these values is zero and their sample standard deviation is 0.138. The *t*[0.95;7] from Table A.4 is
1.895. Thus, the confidence interval for the mean error is

**13.3 TESTING FOR A ZERO MEAN
**

A common use of confidence intervals is to check if a measured value is significantly different from zero. When comparing a random measurement with zero, the statements have to be made probabilistically, that is, at a desired level of confidence. If the measured value passes our test of difference with a probability greater than or equal to the specified level of confidence, 100(1 - ±)%, then the value is significantly different from zero.

The test consists of determining a confidence interval and simply checking if the interval includes zero. The
four possibilities are shown in Figure 13.3, where **CI** is used as an abbreviation for confidence interval. The
CI is shown by a vertical line stretching between the lower and upper confidence limits. The sample mean is
indicated by a small circle. In cases (a) and (b), the confidence interval includes zero, and therefore, the
measured values are not significantly different from zero. In cases (c) and (d), the confidence interval does not
include zero, and therefore, the measured value is significantly different from zero.

**FIGURE 13.3** Testing for a zero mean.

**Example 13.3** The difference in the processor times of two different implementations of the same
algorithm was measured on seven similar workloads. The differences are {1.5, 2.6, -1.8, 1.3, -0.5, 1.7,
2.4}. Can we say with 99% confidence that one implementation is superior to the other?

Sample size = *n* = 7

Docsity.com

Mean = 7.20/7 = 1.03

Sample variance = (22.84 - 7.20*7.20/ 7)/6 = 2.57

Sample standard deviation = = 1.60

100(1-³) = 99, ³ = 0.01, 1-³/2 = 0.995

From Table A.4 in the Appendix, the *t*-value at six degrees of freedom is *t*[0.995;6] = 3.707, and

99% confidence interval = (-1.21, 3.27)

The confidence interval includes zero. Therefore, we cannot say with 99% confidence that the mean difference is significantly different from zero.

The procedure for testing for zero mean applies equally well to any other value as well. For example, to test if
the mean is equal to a given value *a*, a confidence interval is constructed as before, and if the interval includes
*a*, then the hypothesis that the mean is equal to a cannot be rejected at the given level of confidence. The
following example illustrates this extension of the test.

**Example 13.4** Consider again the data of Example 13.3. To test if the difference is equal to 1 at 99%
confidence level, the confidence interval as determined in that example is (-1.21, 3.27). The interval
includes 1. Thus, a difference equal to 1 is also accepted at this confidence level.

**13.4 COMPARING TWO ALTERNATIVES
**

A majority of performance analysis projects require comparing two or more systems and finding the best among them. This is the problem addressed in

this section. The discussion, however, is limited to comparing just two systems on very similar workloads. If there are more than two systems or if the workloads are significantly different, the analysis of experimental design techniques discussed later in Part IV of this book should be used.

The statistical procedures to compare two systems are an extension of the test for a zero mean described earlier in Section 13.3. The procedure for paired and unpaired observations are different. These terms and the corresponding procedures are described next.

**13.4.1 Paired Observations
**

If we conduct *n* experiments on each of the two systems such that there is a one-to-one correspondence
between the *i*th test on system A and the ith test on system B, then the observations are called **paired.** For
example, if *xi* and *yi* represent the performance on the *i*th workload, the observations would be called paired. If
there is no correspondence between the two samples, the observations are called **unpaired.
**

The analysis of paired observation is straightforward. The two samples are treated as one sample of *n* pairs.
For each pair, the difference in performance can be computed. A confidence interval can be constructed for
the difference. If the confidence interval includes zero, the systems are not significantly different.

**Example 13.5** Six similar workloads were used on two systems. The observations are {(5.4, 19.1),
(16.6, 3.5), (0.6, 3.4), (1.4, 2.5), (0.6, 3.6), (7.3, 1.7)}. Is one system better than the other?

The performance differences constitute a sample of six observations, {-13.7, 13.1, -2.8, -1.1, -3.0, 5.6}.

For this sample:

Sample mean = -0.32

Sample variance = 81.62

Sample standard deviation = 9.03

The 0.95-quantile of a *t*-variate with five degrees of freedom is is 2.015:

Docsity.com

The confidence interval includes zero. Therefore, the two systems are not different.

**13.4.2 Unpaired Observations
**

The analysis of unpaired observations is a bit more complicated than that of the paired observations. Suppose
we have two samples of size *n*a and *n*b for alternatives A and B, respectively. The observations are unpaired in
the sense that there is no correspondence between *i*th observations in the two samples. The steps to determine
the confidence interval for the difference in mean performance requires making an estimate of the variance
and an effective number of degrees of freedom. The procedure is as follows:

**1.** Compute the sample means:

**2.** Compute the sample standard deviations:

**3.** Compute the mean difference:
**4.** Compute the standard deviation of the mean difference:

**5.** Compute the effective number of degrees of freedom:

**6.** Compute the confidence interval for the mean difference:

Here, *t*[1-±/2;*v*] is the (1-±/2)-quantile of a *t*-variate with *v* degrees of freedom.

**7.** If the confidence interval includes zero, the difference is not significant at 100(1 - ±)% confidence
level. If the confidence interval does not include zero, then the sign of the mean difference indicates
which system is better.

This procedure is known as a ** t-test.
Example 13.6** The processor time required to execute a task was measured on two systems. The times
on system A were {5.36, 16.57, 0.62, 1.41, 0.64, 7.26}. The times on system B were {19.12, 3.52, 3.38,
2.50, 3.60, 1.74}. Are the two systems significantly different?

For system A:

For System B:

Docsity.com

Then

Standard deviation of mean difference = 3.698

Effective number of degrees of freedom f = 11.921

0.95-quantile of t-variate with 12 degrees of freedom = 1.71

90% confidence interval for difference = (-6.92, 6.26)

The confidence interval includes zero. Therefore, at this confidence level the two systems are not different.

Previous Table of Contents Next

**Products | Contact Us | About Us | Privacy | Ad Info | Home
**

EarthWeb is prohibited. Read EarthWeb's privacy statement.

Docsity.com

Search Tips Advanced Search

*by Raj Jain
*Wiley Computer Publishing, John Wiley & Sons, Inc.
**ISBN:** 0471503363** Pub Date:** 05/01/91

**Search this book:
**

Previous Table of Contents Next

**13.4.3 Approximate Visual Test
**

A simpler visual test to compare two unpaired samples is to simply compute the confidence interval for each alternative separately as follows.

There are three possibilities as shown graphically in Figure 13.4:

**FIGURE 13.4** Comparing two alternatives.

**1.** The confidence intervals do not overlap. The alternative with higher sample mean is significantly
better.

**2.** The confidence intervals overlap considerably such that the mean of one falls in the interval for the
other. The two alternatives are equal with the desired confidence.

**3.** The confidence intervals overlap slightly such that the the mean of either is outside the confidence
interval for the other. In this case, no visual conclusion can be drawn. We need to do the *t*-test as
described previously.

**Example 13.7** For the data of Example 13.6,
*t*-value at five degrees of freedom and 90% confidence = 2.015

90% confidence interval for mean of A = 5.31

=(0.24,10.38)

90% confidence interval for mean of B = 5.64

=(0.18,11.10)

The two confidence intervals overlap and the mean of one falls in the confidence interval for the other. Therefore, the two systems are not different at this level of confidence.

Docsity.com

**13.5 WHAT CONFIDENCE LEVEL TO USE
**

Throughout this book, we use confidence levels of 90 or 95%. This should not lead one to believe that the confidence levels should always be that high. The choice of the confidence level is based on the loss one would sustain if the parameter is outside the range and the gain one would have if the parameter is inside the range. If the loss is high compared to the gain, the confidence levels should be high. If the loss is negligible compared to the gain, a lower confidence level is fine.

Consider, for example, a lottery in which a ticket costs one dollar but pays five million dollars to the winner. Suppose the probability of winning is 10-7 or one in ten million. To win the lottery with 90% confidence would require one to buy nine million tickets. It is clear that no one would be willing to spend that much for winning just five million. For most people, a very low confidence such as 0.01% would be fine in this case.

The system design decisions that computer systems performance analysts and designers face are not very different from those of the lottery. Although the loss (if the decision turns out wrong) and the gain (if the decision turns out correct) are not as widely different as in the case of the lottery, the risk level is decided essentially in the same manner.

The point is that if you come across a parameter that is significant only at 50% confidence level, do not automatically assume that the confidence is low and it is not worth making a decision based on that parameter. Similarly, even if a parameter is significant at 99% confidence level, it is possible for the decision makers to say that the confidence level is not high enough if the loss due to wrong decision is enormous.

**13.6 HYPOTHESIS TESTING VERSUS CONFIDENCE INTERVALS
**

Most books on statistics have a chapter devoted to hypothesis testing. Here, we prefer an alternate method of
doing the same thing. This alternate method makes use of confidence intervals and helps us to easily solve a
few commonly encountered cases of hypothesis testing. Testing for zero mean is the first example in this
book of a problem that can be solved by a hypothesis test as well as by a confidence interval. In practice, a
confidence interval is preferable because it provides more information. A hypothesis test is usually a yes-no
decision. We either reject a hypothesis or accept it. A confidence interval not only gives that answer, it also
provides information about the possible range of values for the parameter. A narrow confidence interval
indicates that the parameter has been estimated with a high degree of precision. A wide confidence interval,
on the other hand, indicates that the precision is not high. Knowing the precision is often more helpful to
decision makers than the simple yes-no answer from the hypothesis tests. For example, if the difference *A - B
*in the mean performance of two systems has a confidence interval for (-100, 100), we can say that there is no
difference between the systems since the interval includes a zero. On the other hand, if the interval was (-1,
1), the conclusion is still “no difference” but we can now say it more loudly. Thus, confidence intervals tell us
not only what to say but also how loudly to say it.

Confidence intervals are in general easier to understand and explain to decision makers than hypothesis tests. This is because the width of the interval is in the same units of measurements as the parameter being estimated. The decision makers find it easier to grasp. For example, it is more useful to know that a parameter is in the range, for instance, 100 to 200, than to know that the probability of the parameter being 110 is less than 3%.

**13.7 ONE-SIDED CONFIDENCE INTERVALS
**

In all the tests so far, two-sided confidence intervals have been used. For such intervals, if the confidence level is 100(1-±)%, there is a 100±/2% chance that the difference will be more than the upper confidence limit. Similarly, there is a 100±/2% chance that the difference will be less than the lower confidence limit. For example, with a 90% two-sided confidence interval, there is a 5% chance that the difference will be less than the lower confidence limit. Similarly, there is a 5% chance that the difference will be more than the upper confidence limit.

Sometimes only one-sided comparison is desired. For example, you may want to test the hypothesis that the mean is greater than a certain value. In this case, a one-sided lower confidence interval for ± is desired and it is given by

Docsity.com

Notice that the *t*-value is read at 1 - ± confidence rather than at 1 - ±/2. Similarly, the one-sided upper
confidence interval for the population mean is given by

For large samples, *z*-values are used instead of *t*-values.

**Example 13.8** Time between crashes was measured for two systems A and B. The mean and standard
deviations of the time are listed in Table 13.1. To check if System A is more susceptible to failures than
System B, we use the procedure of unpaired observations. The mean difference is

The standard deviation of the difference is

The effective number of degrees of freedom is

**TABLE 13.1 Time between System Crashes
**

**System Number Mean Standard deviation
**

A 972 124.10 198.20 B 153 141.47 226.11

Since the degrees of freedom are more than 30, we use the unit normal quantiles instead of *t* quantiles. Also,
since this is a one-sided test, we use *z*0.90 = 1.28 for a 90% confidence interval, which is

(-17.37, -17.37 + 1.28 x 19.35) = (-17.37, 7.402)

Since the confidence interval includes zero, we reject the hypothesis that System A is more susceptible to crashes than System B.

Previous Table of Contents Next

**Products | Contact Us | About Us | Privacy | Ad Info | Home
**

EarthWeb is prohibited. Read EarthWeb's privacy statement.

Docsity.com

Search Tips Advanced Search

*by Raj Jain
*Wiley Computer Publishing, John Wiley & Sons, Inc.
**ISBN:** 0471503363** Pub Date:** 05/01/91

**Search this book:
**

Previous Table of Contents Next

**13.8 CONFIDENCE INTERVALS FOR PROPORTIONS
**

For categorical variables, the statistical data often consists of probabilities associated with various categories.
Such probabilities are called **proportions.** Estimation of proportions is very similar to estimation of means.
Each sample of *n* observations gives a sample proportion. We need to obtain a confidence interval to get a
bound. Given that *n*1 of *n* observations are of type 1, a confidence interval for the proportion is obtained as
follows:

Sample proportion =

Confidence interval for proportion =

Here, *z*1-±/2 is the (1 - ±/2)-quantile of a unit normal variate. Its values are listed in Table A.2 in the Appendix.

The previous formula for proportions is based on approximating binomial distribution (see Section 29.3) by a
normal that is valid only if *np*e10. If this condition is not satisfied, the computations are too complex to
discuss here. They require using binomial tables. In particular, *t*-values cannot be used.

**Example 13.9** If 10 of 1000 pages printed on a laser printer are illegible, then the proportion of
illegible pages is characterized as follows:

Sample proportion = p =

Since the condition *np* e 10 is satisfied, Equation (13.1) can be used:

Docsity.com

Thus, at 90% confidence we can state that 0.5 to 1.5% of the pages from the printer are illegible. The chance of error in this statement is 10%. If we want to minimize the chance of error to 5%, the 95% confidence numbers should be used.

The test for zero mean can be easily extended to test proportions, as shown by the following example.

**Example 13.10** A single experiment was repeated on two systems 40 times. System A was found to be
superior to system B in 26 repetitions. Can we state with 99% confidence that system A is superior?

The confidence interval includes 0.5 (the point of equality). Therefore, we cannot say with 99% confidence that system A is superior.

Let us repeat the computations at 90% confidence.

The confidence interval does not include 0.5. Therefore, we can say with 90% confidence that system A is superior.

**13.9 DETERMINING SAMPLE SIZE
**

The confidence level of conclusions drawn from a set of measured data depends upon the size of the data set. The larger the sample, the higher is the associated confidence. However, larger samples also require more effort and resources. Thus, the analyst’s goal is to find the smallest sample size that will provide the desired confidence. In this section, we present formulas for determining the sample sizes required to achieve a given level of accuracy and confidence. Three different cases: single-system measurement, proportion determination, and two-system comparison are considered. In each case, a small set of preliminary measurements are done to estimate the variance, which is then used to determine the sample size required for the given accuracy.

**13.9.1 Sample Size for Determining Mean
**

Suppose we want to estimate the mean performance of a system with an accuracy of and a confidence
level of 100(1 - ±)%. The number of observations *n* required to achieve this goal can be determined as
follows.

We know that for a sample of size *n*, the 100(1 - ±)% confidence interval of the population mean is

The desired accuracy of *r* percent implies that the confidence interval should be .
Equating the desired interval with that obtained with *n* observations, we can determine *n*:

Docsity.com

Here, *z* is the normal variate of the desired confidence level.

**Example 13.11** Based on a preliminary test, the sample mean of the response time is 20 seconds, and
the sample standard deviation is 5. How many repetitions are needed to get the response time accurate
within 1 second at 95% confidence?

Required accuracy = 1 in 20 = 5%

A total of 97 observations are needed.

**13.9.2 Sample Size for Determining Proportions
**

This technique can be extended to determination of proportions. The confidence interval for a proportion was shown in Section 13.8 to be

To get a half-width (accuracy of) *r*,

**Example 13.12** A preliminary measurement of a laser printer showed an illegible print rate of 1 in
10,000. How many pages must be observed to get an accuracy of 1 per million at 95% confidence?

A total of 384.16 million pages must be observed.

**13.9.3 Sample Size for Comparing Two Alternatives
**

The requirement of nonoverlapping confidence intervals allows us to compute the sample size required to compare two alternatives as shown by the following example.

**Example 13.13** Two packet-forwarding algorithms were measured. Preliminary measurements showed
that algorithm A loses 0.5% of packets and algorithm B loses 0.6%. How many packets do we need to
observe to state with 95% confidence that algorithm A is better than the algorithm B?

For the two confidence intervals to be nonoverlapping, the upper edge of the lower confidence interval should be below the lower edge of the upper confidence interval:

Docsity.com

We need to observe 85,000 packets.

The formulas presented in this chapter are summarized in Box 13.1.

**EXERCISES
**

**13.1** Given two samples {*x1, x2, . . . , xn*} and {*y1, y2, . . . , yn*} from a normal population *N*(±, 1), what is the
distribution of

**a.** Sample means:

**b.** Difference of the means:

**c.** Sum of the means:

**d.** Mean of the means:
**e.** Normalized sample variances: *S*2*x*, *S*2*y
***f.** Sum of sample variances: *S*2*x* + *S*2*y
*

**Box 13.1** Confidence Intervals

**g.** Ratio of sample variances: *S*2*x*/*S*2*y
*

**h.** Ratio

**13.2** Answer the following for the data of Exercise 12.11:
**a.** What is the 10-percentile and 90-percentile from the sample?
**b.** What is the mean number of disk I/O’s per program?
**c.** What is the 90% confidence interval for the mean?
**d.** What fraction of programs make less than or equal to 25 I/O’s and what is the 90% confidence
interval for the fraction?

**e.** What is the one-sided 90% confidence interval for the mean?

**13.3** For the code size data of Table 11.2, find the confidence intervals for the average code sizes on various
processors. Choose any two processors and answer the following:

**a.** At what level of significance can you say that one is better than the other?
**b.** How many workloads would you need to decide the superiority at 90% confidence?

** Note: ** Since the code sizes vary over several orders of magnitude, the arithmetic mean and its confidence
interval are not very useful. Do not make any conclusions from the results of this exercise. This data is
reconsidered in Chapter 21.

Previous Table of Contents Next

Docsity.com

**Products | Contact Us | About Us | Privacy | Ad Info | Home
**

EarthWeb is prohibited. Read EarthWeb's privacy statement.

Docsity.com

Search Tips Advanced Search

*by Raj Jain
*Wiley Computer Publishing, John Wiley & Sons, Inc.
**ISBN:** 0471503363** Pub Date:** 05/01/91

**Search this book:
**

Previous Table of Contents Next

**CHAPTER 14
SIMPLE LINEAR REGRESSION MODELS
**

*Statistics is the art of lying by means of figures.
*

—Dr. Wilhelm Stekhel

Among the statistical models used by analysts, regression models are the most common. A regression model
allows one to estimate or predict a random variable as a function of several other variables. The estimated
variable is called the **response variable**, and the variables used to predict the response are called **predictor
variables**, **predictors**, or **factors**. Regression analysis assumes that all predictor variables are quantitative so
that arithmetic operations such as addition and multiplication are meaningful.

Most people are familiar with least-squares fitting of straight lines to data. Our objective in discussing
regression models is twofold. First, we want to highlight the mistakes that analysts commonly make in using
such models. Second, the concepts used in regression models, such as confidence intervals for the model
parameters, are applicable to other types of models. In particular, a knowledge of these concepts is required to
understand the analysis of experimental designs discussed in Part IV of this book. Although regression
techniques can be used to develop a variety of linear and nonlinear models, their most common use is for
finding the best linear model. Such models are called **linear regression models**. To simplify the problem,
initially we limit our discussion to the case of a single predictor variable. Because of their simplicity, such
models are called **simple linear regression models**.

**14.1 DEFINITION OF A GOOD MODEL
**

The first issue in developing a regression model is to define what is meant by a good model and a bad model. Figure 14.1 shows three examples of measured data and attempted linear models. The measured data is shown by scattered points while the model is shown by a straight line. Most people would agree that the model in the first two cases looks reasonably close to the data while that for the third one does not appear to be a good model. What is good about the first two models? One possible answer is that the model line in the first two cases is close to more observations than in the third case. Thus, it is obvious that the goodness of the model should be measured by the distance between the observed points and the model line. The next issue, then, is

Docsity.com

how to measure the distance.

Regression models attempt to minimize the distance measured vertically between the observation point and
the model line (or curve). The motivation for this is as follows. Given any value of the predictor variable *x*,
we can estimate the corresponding response using the linear model by simply reading the *y*-value on the
model line at the given *x*-value. The line segment joining this “predicted point” and the observed point is
vertical since both points have the same *x*-coordinate. The length of the line segment is the difference between
the observed response and the predicted response. This is called **residual**, **modeling error**, or simply **error**.
The terms *residual* and *error* are used interchangeably.

**FIGURE 14.1** Good and bad regression models.

Some of the errors are positive because the estimated response is less than the observed response while others
are negative. One obvious requirement would be to have zero overall error, that is, the negative and positive
errors cancel out. Unfortunately, there are many lines that will satisfy this criterion. We need additional
criteria. One such criterion could be to choose the line that minimizes the sum of squares of the errors. This
criterion is called the **least-squares** criterion and is the criterion that is used to define the best model.

A mathematical definition of the least-squares criterion is as follows. Suppose the linear model is

where is the predicted response when the predictor variable is *x*. The parameters *b*0 and *b*1 are fixed
**regression parameters** to be determined from the data. Given *n* observation pairs {(*x*1, *y*1),..., (*x*n, *y*n)}, the

estimated response for the *i*th observation is

The error is

The best linear model is given by the regression parameter values, which minimizes the Sum of Squared
Errors (**SSE**):

subject to the constraint that the mean error is zero:

It can be shown that this constrained minimization problem is equivalent to minimizing the variance of errors (see Exercise 14.1).

**14.2 ESTIMATION OF MODEL PARAMETERS
**

As shown later, the regression parameters that give minimum error variance are

(14.1)

Docsity.com

and

(14.2)

where

= mean of the values of predictor variables =

= mean response =

Before deriving these expressions, let us look at an example that illustrates an application of these formulas.

**Example 14.1** The number of disk I/O’s and processor times of seven programs were measured
as {(14, 2), (16, 5), (27, 7 (42, 9), (39, 10), (50, 13), (83, 20)}.

A linear model to predict CPU time as a function of disk I/O’s can be developed as follows.

Given the data *n* = 7, Â*xy* = 3375, Â*x* = 271, Â*x*2 = 13,855, Â*y* = 66, Â*y*2 = 828, = 38.71, and

= 9.43. Therefore,

The desired linear model is

CPU time = –0.0083 + 0.2438(number of disk I/O’s)

A scatter plot of the data is shown in Figure 14.2. A straight line with intercept -0.0083 and slope 0.2438 is also shown in the figure. Notice that the line does give an estimate close to the observed values.

**FIGURE 14.2** Scatter plot of disk I/O and CPU time data.

**TABLE 14.1 Error Computation for Disk I/O’s and CPU Time Data
**

**Disk I/O’s,
xi
**

**CPU Time,
yi
**

**Estimate, Error, Error
Squared,
**

*e*i2

14 2 3.4043 –1.4043 1.9721 16 5 3.8918 1.1082 1.2281 27 7 6.5731 0.4269 0.1822 42 9 10.2295 –1.2295 1.5116 39 10 9.4982 0.5018 0.2518 50 13 12.1795 0.8205 0.6732 83 20 20.2235 –0.2235 0.0500

Â271 66 66.0000 0.0000 5.8690

In Table 14.1, we have listed the CPU time predicted by the model, the measured values, errors,