









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Applied statistics 3rd lecture
Typology: Schemes and Mind Maps
1 / 15
This page cannot be seen from the preview
Don't miss anything!










Characteristics of the Mean: o arithmetic mean is the most widely used measure of location. It requires the interval scale. Its major characteristics are: All values are used. It is unique. The sum of the deviations from the mean is 0. It is calculated by summing the values and dividing them by the number of values.
MD Is the Mean Deviation f Is the frequency M Is the midpoint
o It is calculated by summing the values and dividing them by the number of values. o EX: There are 12 automobile manufacturing companies in the United States. Listed below is the number of patents granted by the United States government to each company in a recent year. Company Number of Patents Granted Company Number of Patents Granted General Motors 511 Mazda 210 Nissan 385 Chrysler 97 DaimlerChrysler 275 Porsche 50 Toyota 257 Mitsubishi 36 Honda 249 Volvo 23 Ford 234 BMW 13 Is this information a sample or a population? What is the arithmetic mean number of the patents granted?
Population Mean
o For ungrouped data, the sample mean is the sum of all the sample values divided by the number of sample values. o EX: SunCom is studying the number of minutes used monthly by the clients in a particular cell phone rate plan. A random sample of 12 clients showed the following number of minutes used last month. 90 77 94 89 119 112 91 110 92 100 113 83 What is the arithmetic mean number of the minutes used?
o Every set of interval-level and ratio-level data has a mean. o All the values are included in computing the mean. o A set of data has a unique mean. o The mean is affected by unusually large or small data values. o The arithmetic mean is the only measure of central tendency where the sum of the deviations of each value from the mean is zero. Sample Mean
Florida 60,000 Massachusetts 40,000 West Virginia 60, Idaho 40,000 New Jersey 65,000 Wyoming 55, Answer: A perusal of the salaries reveals that the annual salary of $60,000 appears more often (six times) than any other salary. The mode is, therefore, $60,000.
A skewed distribution occurs when one tail is longer than the other. Skewness defines the asymmetry of a distribution. Unlike the familiar normal distribution with its bell-shaped curve, these distributions are asymmetric. The two halves of the distribution are not mirror images because the data are not distributed equally on both sides of the distribution’s peak.
o The mean, median, and mode are all equal in the normal distribution and other symmetric distributions. o However, when you have a asymmetric distribution, it affects the relationship between these measures of central tendency. The mean is sensitive to extreme values. Consequently, the longer tail in an asymmetrical distribution pulls the mean away from the most common values.
o The mean is greater than the median. The mean overestimates the most common values in a positively skewed distribution. o These distributions tend to occur when there is a lower limit, and most values are relatively close to the lower bound. Values can’t be less than this bound but can fall far from the peak on the high end, causing them to skew positively. o Example: For example, income and wealth are classic examples of right skewed distributions. Most people earn a modest amount, but some millionaires and billionaires extend the right tail into very high values. Meanwhile, the left tail cannot be less than zero. This situation creates a positive skew. Consequently,
reports frequently refer to median incomes because the mean overestimates the most common values. These data are based on the U.S. household income for 2006. Notice how the mean is greater than the median.
o The mean is less than the median. The mean underestimates the most common values in a negatively skewed distribution. o Left skewed distributions occur less frequently than their right-handed counterparts, but they exist. Frequently, they occur when there is an upper limit that values cannot exceed, and most scores are near that limit. Values can’t exceed the cap, but they can extend relatively far from the peak on the lower side, causing a negative skew. o Example: The human life cycle is a classic example of asymmetrically distributed data. This is because most people tend to die after reaching an average age, while only a few people die too soon or too late. If such data is plotted along a linear line, most of the values would be present on the right side, and only a few values would be present on the left side. Hence, the representation is clearly left or negatively skewed in nature.
o Dispersion (or spread) is a means of describing the extent of distribution of data around a central value or point. It aids in understanding data distribution. Lower dispersion indicates higher precision in the manufacturing process or data measurements, whereas higher dispersion means lower accuracy. o A measure of location, such as the mean or the median, only describes the center of the data. It is valuable from that standpoint, but it does not tell us anything about the spread of the data. o For example, if your nature guide told you that the river ahead averaged 3 feet in depth, would you want to wade across on foot without additional information? Probably not. You would want to know something about the variation in the depth. Is the maximum depth of the river 3.25 feet and the minimum 2.75 feet? If that is the case, you would probably agree to cross. What if you learned the river depth ranged from 0.50 foot to 5.5 feet? Your decision would probably be not to cross. Before making a decision about crossing the river, you want information on both the typical depth and the dispersion in the depth of the river o A second reason for studying the dispersion in a set of data is to compare the spread in two or more distributions.
Tucson plant (because a range of 4 is less than a range of 20). Thus, the mean production in the Baton Rouge plant (50 computer monitors) is a more representative measure of location than the mean of 50 computer monitors for the Tucson plant.
2. Mean Deviation: This can be obtained from any one of the measures of central tendency. Though, mean deviation from mean and median are the most commonly used statistical considerations: EX: The number of cappuccinos sold at the Starbucks location in the Orange County Airport between 4 and 7 p.m. for a sample of 5 days last year were 20, 40, 50, 60, and 80. Determine the mean deviation for the number of cappuccinos sold. Solution: Number of Cappuccinos Sold Daily (X - X̅ ) Absolute Deviation 20 (20 – 50) = –30 30 40 (40 – 50) = –10 10 50 (50 – 50) = 0 0 60 (60 – 50) = 10 10 80 (80 – 50) = 30 30 80
3. Variance and Standard Deviation: A limitation of the range is that it is based on only two values, the maximum and the minimum; it does not take into consideration all of the values. The variance does. It measures the mean amount by which the values in a population, or sample, vary from their mean. In terms of a definition: EX: The number of traffic citations issued during the last five months in Beaufort County, South Carolina, is 38, 26, 13, 41, and 22. What is the population variance? Solution: Number (x)
38 +10 100
13 -15 225 41 +13 169
2
2
140 0 534 Mean Deviation:
Population Variance
2
Population Standard Deviation
4. Sample Variance: The formula for the population mean is μ = Σx/N. We just changed the symbols for the sample mean; that is, x = Σx/n. Unfortunately, the conversion from the population variance to the sample variance is not as direct. It requires a change in the denominator. Instead of substituting n (number in the sample) for N (number in the population), the denominator is n − 1. Thus, the formula for the sample variance is: As the sample increases the effect of (-1) will not effect on the final result compared to the population. EX: The hourly wages for a sample of part-time employees at Home Depot are $12, $20, $16, $18, and $19. What is the sample variance? Solution: The sample variance is computed by using formula
o EMPIRICAL RULE For a symmetrical, bell-shaped frequency distribution, approximately 68% of the observations will lie within plus and minus one standard deviation of the mean; about 95% of the observations will lie within plus and minus two standard deviations of the mean; and practically all (99.7%) will lie within plus and minus three standard deviations of the mean. o EX: These relationships are portrayed graphically in Chart for a bell-shaped distribution with a mean of 100 and a standard deviation of 10. Applying the Empirical Rule, if a distribution is symmetrical and bell-shaped, practically all of the observations lie between the mean plus and minus three standard deviations. Thus, if x = 100 and s = 10, practically all the observations lie between 100 + 3(10) and 100 − 3(10), or 70 and
2
Sample Standard Deviation
2
3, 3,000 up to 3, 4 3,200 12, Total 180 333, o Solving for the arithmetic mean using the formula, we get:
We conclude that the mean profit per vehicle is about $1,851.
o To calculate the standard deviation of data grouped into a frequency distribution, we need to adjust formula (3–10) slightly. We weight each of the squared differences by the number of frequencies in each class. The formula is: o EX: use the same table from the previous example Compute the standard deviation of the vehicle selling prices. o Solution: Following the same practice used earlier for computing the mean of data grouped into a frequency distribution, f is the class frequency, M the class midpoint, and n the number of observations. Profit Frequen cy Midpoint (M)
200 up to 600 8 400 3,200 -1,451 2,105,401 16,843, 600 up to 1, 11 800 8,800 -1,051 1,104,601 12,150, 1,000 up to 1, 23 1,200 27,600 -651 423,801 9,747, 1,400 up to 1, 38 1,600 60,800 -251 63,001 2,394, 1,800 up to 2, 45 2,000 90,000 149 22,201 999, 2,200 up to 2, 32 2,400 76,800 549 301,401 9,644, 2,600 up to 3, 19 2,800 53,200 949 900,601 17,111, 3,000 up to 3, 4 3,200 12,800 1,349 1,819,801 7,279, Total 180 333,200 76,169, o To find the standard deviation: Step 1: Subtract the mean from the class midpoint. That is, find (M − x) = (400 − 1,851 = −1,451) for the first class, for the second class (800 − 1,851 = −1,051), and so on. Step 2: Square the difference between the class midpoint and the mean. For the first class, it would be (400 − 1,851)2 = 2,105,401, for the second class (800 − 1,851)2 = 1,104,601, and so on. Standard Deviation of Grouped Data
2
Step 3: Multiply the squared difference between the class midpoint and the mean by the class frequency. For the first class, the value is 8(400 − 1,851)2 = 16,843,208; for the second, 11(800 − 1,851)2 = 12,150,611, and so on. Step 4: Sum the f(M − x)2. The total is 76,169,920. To find the standard deviation, we insert these values in formula.
2
The mean and the standard deviation calculated from the data grouped into a frequency distribution are usually close to the values calculated from raw data. The grouped data result in some loss of information. For the vehicle profit example, the mean profit reported in the Excel output on page 64 is $1,843.17 and the standard deviation is $643.63. The respective values estimated from data grouped into a frequency distribution are $1,851.11 and $652.33. The difference in the means is $7.94, or about 0.4%. The standard deviations differ by $8.70, or 1.4%. Based on the percentage difference, the estimates are very close to the actual values.
Is a statistical technique to present a set of data. Each numerical value is divided into two parts. The leading digit(s) becomes the stem and the trailing digit the leaf. The stems are located along the vertical axis, and the leaf values are stacked against each other along the horizontal axis. For example, Theater of the Republic in Erie, Pennsylvania, books live theater and musical performances. The theater’s capacity is 160 seats. Last year, among the forty-five performances, there were eight different plays and twelve different bands. The following frequency distribution shows that between eighty up to ninety people attended two of the forty- five performances; there were seven performances where ninety up to one hundred people attended. However, is the attendance within this class clustered about 90, spread evenly throughout the class, or clustered near 99? We cannot tell. As we couldn’t say in the 7 frequencies of the second record how the numbers between 90 to 100 is distributed among the each one of the 7 frequencies. The advantage of the stem-and-leaf display over a frequency distribution is that the identity of each observation is not lost. In the above example, we would not know the identity of the values in the 90 up to 100 class. To illustrate the construction of a stem-and-leaf display using the number people attending each performance, suppose the seven observations in the 90 up to 100 class are 96, 94, 93, 94, 95, 96, and 97. The stem value is the leading digit or digits, in this case 9. The leaves are the trailing digits. The stem is placed to the left of a vertical line and the leaf values to the right. The values in the 90 up to 100 class would appear as follows: It is also customary to sort the values within each stem from smallest to largest. Thus, the second row of the stem-and-leaf display would appear as follows:
o Compute the first and third quartiles. Locate L25 and L75 using:
A box plot is a graphical display, based on quartiles, that helps us picture a set of data. To construct a box plot, we need only five statistics: the minimum value, Q1 (the first quartile), the median, Q3 (the third quartile), and the maximum value. An example will help to explain. EX1: Alexander’s Pizza offers free delivery of its pizza within 15 miles. Alex, the owner, wants some information on the time it takes for delivery. How long does a typical delivery take? Within what range of times will most deliveries be completed? For a sample of 20 deliveries, he determined the following information: Minimum Value = 13 minutes
Median = 18 minutes
Maximum Value = 30 minutes Develop a box plot for the delivery times. What conclusions can you make about the delivery times? Solution: The first step in drawing a box plot is to create an appropriate scale along the horizontal axis. Next, we draw a box that starts at Q1 (15 minutes) and ends at Q3 (22 minutes). Inside the box we place a vertical line to represent the median (18 minutes). Finally, we extend horizontal lines from the box out to the minimum value (13 minutes) and the maximum value (3 minutes). These horizontal lines outside of the box are sometimes called “whiskers” because they look a bit like a cat’s whiskers.
The box plot also shows the interquartile range of delivery times between Q1 and Q3. The interquartile range is 7 minutes and indicates that 50% of the Deliveries are between 15 and 22 minutes. The box plot also reveals that the distribution of delivery times is positively skewed. In Chapter 3, we defined skewness as the lack of symmetry in a set of data. How do we know this distribution is positively skewed? In this case, there are actually two pieces of information that suggest this. First, the dashed line to the right of the box from 22 minutes (Q3) to the maximum time of 30 minutes is longer than the dashed line from the left of 15 minutes (Q1) to the minimum value of 13 minutes. To put it another way, the 25% of the data larger than the third quartile is more spread out than the 25% less than the first quartile. A second indication of positive skewness is that the median is not in the center of the box. The distance from the first quartile to the median is smaller than the distance from the median to the third quartile. We know that the number of delivery times between 15 minutes and 18 minutes is the same as the number of delivery times between 18 minutes and 22 minutes. EX2: Refer to the Applewood Auto Group data. Develop a box plot for the variable age of the buyer. What can we conclude about the distribution of the age of the buyer?
The median age of the purchaser is 46 years, 25% of the purchasers are less than 40 years of age, and 25% are more than 52.75 years of age. Based on the summary information and the box plot, we conclude: