




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Professor: Thistleton; Class: Statistical Methods; Subject: Statistics; University: SUNY Institute of Technology at Utica-Rome; Term: Unknown 1989;
Typology: Study notes
1 / 8
This page cannot be seen from the preview
Don't miss anything!





As usual, the first thing we’ll need is some data to consider. Please navigate to your textbook’s companion website and take a moment to download the data set “Miles per Gallon Gasoline Consumption” from
http://college.hmco.com/mathematics/brase/understandable_statistics/9e/resources/datasets/sv/index.html
You should have this data set open in a spreadsheet to play with as you read this lecture.
Also, for convenience the data are ordered from low to high and presented below.
10 13 13 15 18 18 20 20 21 22 23 24 24 24 24 24 24 24 24 24 25 25 25 25 25 25 26 27 27 27 27 27 28 28 28 28 29 29 29 30 30 30 31 31 32 32 33 33 33 35 35 35 37 49 52
At this point you should feel comfortable presenting the data in a histogram with several categories. I’ve done this for 5 categories here. (width 0 = 8.4, width = 9)
class lower limit upper limit lower boundary upper boundary mark frequency 1 10 18 9.5 18.5 14 6 2 19 27 18.5 27.5 23 26 3 28 36 27.5 36.5 32 20 4 37 45 36.5 45.5 41 1 5 46 54 45.5 54.5 50 2 Sum of frequencies 55
This gives us a picture of the data. Suppose, however, you would like to know the average MPG for cars in this data set. You can look at the table or the graph and get this approximately as somewhere between 23 and 32, but the table and graph don’t allow you to see the original data- we’ve made a trade off and thrown away information for simplicity.
6
26
20
1 2 0
5
10
15
20
25
30
14 23 32 41 50
MPG Histogram
A Stem and Leaf plot overcomes this dilemma (if you don’t have thousands or millions of data points). When we construct a “stem and leaf” we try to show the reader each data point. A natural way to “break the data up” is to use the categories of “tens” as a way to sort the data into groups and then show each data point. If we were to write this out as follows:
10 13 13 15 18 18 20 20 21 22 23 24 24 24 24 24 24 24 24 24 25 25 25 25 25 25 26 27 27 27 27 27 28 28 28 28 29 29 29 30 30 30 31 31 32 32 33 33 33 35 35 35 37 49 52
Then we are carrying around some extra baggage- why keep telling your reader that the data points are in the twenties over and over? Here’s a nice idea: tell them you’re going to give them a stem (the twenties) and then the leaves (the 0’s and the 1’s, etc. ). For example, write
1| 0 3 3 5 8 8
Instead of
10 13 13 15 18 18
Less space- more efficient. This now gives us both a picture of the data and a way to reconstruct each data point:
1 | 0 3 3 5 8 8 2 | 0 0 1 2 3 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 6 7 7 7 7 7 8 8 8 8 9 9 9 3 | 0 0 0 1 1 2 2 3 3 3 5 5 5 7 4 | 9 5 | 2
This is a complete Stem and Leaf plot. You may wish to divide the categories a little more finely, however since there are so many 20’s. To do this, divide each category into the lower part and the upper part, for example 20, 21, 22, 23, and 24 as the lower part and 25, 26, 27, 28 and 29 as the upper part. Do this for each category and obtain:
1 (lower)| 0 3 3 1 (upper)| 5 8 8 2 (lower)| 0 0 1 2 3 4 4 4 4 4 4 4 4 4 2 (upper)| 5 5 5 5 5 5 6 7 7 7 7 7 8 8 8 8 9 9 9 3 (lower)| 0 0 0 1 1 2 2 3 3 3 3 (upper)| 5 5 5 7 4 (lower)| 4 (upper)| 9 5 (lower)| 2 5 (upper)| One last thought- many programs will use simple symbols for lower and upper (or lower, middle, upper, etc), or even just leave it blank and assume the reader will notice the “split stems”.
given text we can write down the frequencies per line in order and find the number such that 6 points fall below this number and 6 fall above. I have the following:
Line of Text Number of Words Alas, poor Yorick! I knew him, Horatio: a fellow 9 of infinite jest, of most excellent fancy: he hath 9 borne me on his back a thousand times; and now, how 11 abhorred in my imagination it is! my gorge rims at 10 it. Here hung those lips that I have kissed I know 11 not how oft. Where be your gibes now? your 9 gambols? your songs? your flashes of merriment, 7 that were wont to set the table on a roar? Not one 12 now, to mock your own grinning? quite chap-fallen? 9 Now get you to my lady's chamber, and tell 9 her, let her paint an inch thick, to this favour she must 12 come; make her laugh at that. Prithee, Horatio, tell 9 me one thing. 3
Writing the frequencies from smallest to largest: Lower half Middle Value Upper half 3 7 9 9 9 9 9 9 10 11 11 12 12
In general, after the data have been sorted from low to high, we denote the median of a set of n data points as
What do you get for the median of the MPG data set? I get the number 27 as the median because we have 55 data points and the number 27 lies at the (55+1)/2=28th^ position. Also, the “Weights of Pro Football Players” data set has 50 data points:
222 222 223 225 225 226 227 229 230 230 230 231 232 233
234 235 235 235 235 236 237 238 238 239 240 240 240 240
240 241 242 242 243 243 244 244 245 245 245 246 247 247
248 250 250 251 252 253 255 255
And so I look in the 25th^ position to find 240 and in the 26th^ position to find 240, which I average to obtain 240 as the median. For good measure I’ll print out the Stem and Leaf I obtained using the program SPSS below:
Frequency Stem & Leaf
3.00 22. 223 5.00 22. 55679 7.00 23. 0001234 9.00 23. 555567889 12.00 24. 000001223344 7.00 24. 5556778 5.00 25. 00123 2.00 25. 55
As noted above, the median tells us where a certain measure of central tendency lies for a data set. Another measure of central tendency which has a nice physical interpretation is the arithmetic mean. This is a useful measure, especially when working with normally distributed data. For the data set above, what is the arithmetic mean?
It is often useful to sketch another type of plot corresponding to your data. We have already seen stem and leaf plots, histograms, etc. One more plot, called a box plot or a box and whiskers plot, is obtained by forming a 5 number summary of the data. We first write down the largest and smallest data points (the max and the min) and also the median. Next we find the median of the bottom half of the data (excluding the median from the whole data set) and of the top half of the data (again excluding the median from the whole data set). These 5 numbers are called quartiles since they divide the data set into 4 subsets. For the “Weights of Pro Football Players” data the quartiles are:
Q 0 = minimum = Q 1 =first quartile = Q 2 = median = Q 3 = third quartile = Q 4 = maximum =
Since our data had an even number of data points we just take Q 1 as the median of the bottom 25 data points:
222 222 223 225 225 226 227 229 230 230 230 231 232 233 234 235 235 235 235 236 237 238 238 239 240
And Q 3 as the median of the top 25 data points:
240 240 240 240 241 242 242 243 243 244 244 245 245 245 246 247 247 248 250 250 251 252 253 255 255
And so our 5-Number summary is Q 0 = minimum = Q 1 =first quartile = 232 Q 2 = median = Q 3 = third quartile = 245 Q 4 = maximum =
The median and IQR are often used together for data sets with “heavy tails”. For example, knowing the arithmetic mean (average) of household incomes in a community usually produces a pretty high number because of a relatively small number of high income individuals. In this case we prefer the quartiles approach.
Please put on your seatbelts for this one. This is a concept which will be with us all semester.
We’ve seen that a data set (really just a bunch of numbers) can be summarized with a picture (like a histogram or box plot) or a number (like an arithmetic mean or a median), etc. When talking about how spear out a data set is it is quite typical to talk about a “variance” or a “standard deviation” especially when a data set is normally distributed. Consider the following toy data set:
If we’d like a measure of “spread” one thing to do it to see “on average” how far away from the mean each data point is. Our mean here is
So, the average distance from the mean is
This isn’t a “fluke” and in fact we can show that the average distance on a data set to the mean is always 0. So, what can we do?
One common solution (there are others!) is to work around the cancelation of negatives and positives by making all of the numbers have the same sign. We can do this if we square each number (multiply each number by itself). We then obtain
This is one way to define a variance. We then just take the square root (since we artificially “squared” each number to get what we wanted we are really just “coming back home”) to get
This is one way to get the standard deviation. Take a moment to calculate the variance and standard deviation
A little notation: when we have a population (recall our discussion of samples and populations) we denote the mean of that population as. This is the greek letter “mu”, pronounced “mew” (like a kitten). If we have several numbers (say N ( upper case ) of them) we can define this in a natural way as
Then the variance, denoted by the greek letter “sigma-squared” or is defined as
This is what we did above, only now we are using letters rather than actual numbers. In order to keep this organized, many people like to put their calculation in a table as follows:
x (deviations) 1 1-4 = -3 9 2 2-4= -2 4 3 3-4= -1 1 4 4-4 = 0 0 10 10-4 = -6 36 sums 20 50
This gives us and.
One last wrinkle: sometimes we want the standard deviation of a sample rather than a population. In our textbook (and other places) a common thing to do is to use the symbol for a sample variance and for a sample standard deviation. If we call the number of data points in a sample ( lower case ) then for a technical reason we define (in a sample) by the number of data points – 1.