Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Stem Leaf Plots - Lecture Notes - Statistical Methods | STA 100, Study notes of Data Analysis & Statistical Methods

State University of New York Polytechnic - Utica-Rome Data Analysis & Statistical Methods

Prof. William Thistleton

Material Type: Notes; Professor: Thistleton; Class: Statistical Methods; Subject: Statistics; University: SUNY Institute of Technology at Utica-Rome; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 08/09/2009

koofers-user-k5v 🇺🇸

10 documents

1 / 8

This page cannot be seen from the preview

Don't miss anything!

Prof. Thistleton STA100 Statistical Methods Lecture 3

Text Sections: Chapter 2 Section3, Chapter 3 Sections 1 and 3

Stem and Leaf Plots

As usual, the first thing we’ll need is some data to consider. Please navigate to your textbook’s companion

website and take a moment to download the data set “Miles per Gallon Gasoline Consumption” from

http://college.hmco.com/mathematics/brase/understandable_statistics/9e/resources/datasets/sv/index.html

You should have this data set open in a spreadsheet to play with as you read this lecture.

Also, for convenience the data are ordered from low to high and presented below.

10 13 13 15 18 18 20 20 21 22 23

24 24 24 24 24 24 24 24 24 25 25

25 25 25 25 26 27 27 27 27 27 28

28 28 28 29 29 29 30 30 30 31 31

32 32 33 33 33 35 35 35 37 49 52

At this point you should feel comfortable presenting the data in a histogram with several categories. I’ve done

this for 5 categories here. (width0 = 8.4, width = 9)

class

lower limit

upper limit

lower boundary

upper boundary

mark

frequency

1

10

18

9.5

18.5

14

6

2

19

27

18.5

27.5

23

26

3

28

36

27.5

36.5

32

20

4

37

45

36.5

45.5

41

1

5

46

54

45.5

54.5

50

2

Sum of

frequencies

55

This gives us a picture of the data. Suppose, however, you would like to know the average MPG for cars in this data

set. You can look at the table or the graph and get this approximately as somewhere between 23 and 32, but the table

and graph don’t allow you to see the original data- we’ve made a trade off and thrown away information for

simplicity.

6

26

20

1

2

0

5

10

15

20

25

30

14

23

32

41

50

MPG Histogram

Discover Study notes of Data Analysis & Statistical Methods State University of New York Polytechnic - Utica-Rome

Partial preview of the text

Download Stem Leaf Plots - Lecture Notes - Statistical Methods | STA 100 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Text Sections: Chapter 2 Section3, Chapter 3 Sections 1 and 3

Stem and Leaf Plots

As usual, the first thing we’ll need is some data to consider. Please navigate to your textbook’s companion website and take a moment to download the data set “Miles per Gallon Gasoline Consumption” from

http://college.hmco.com/mathematics/brase/understandable_statistics/9e/resources/datasets/sv/index.html

You should have this data set open in a spreadsheet to play with as you read this lecture.

Also, for convenience the data are ordered from low to high and presented below.

10 13 13 15 18 18 20 20 21 22 23 24 24 24 24 24 24 24 24 24 25 25 25 25 25 25 26 27 27 27 27 27 28 28 28 28 29 29 29 30 30 30 31 31 32 32 33 33 33 35 35 35 37 49 52

At this point you should feel comfortable presenting the data in a histogram with several categories. I’ve done this for 5 categories here. (width 0 = 8.4, width = 9)

class lower limit upper limit lower boundary upper boundary mark frequency 1 10 18 9.5 18.5 14 6 2 19 27 18.5 27.5 23 26 3 28 36 27.5 36.5 32 20 4 37 45 36.5 45.5 41 1 5 46 54 45.5 54.5 50 2 Sum of frequencies 55

This gives us a picture of the data. Suppose, however, you would like to know the average MPG for cars in this data set. You can look at the table or the graph and get this approximately as somewhere between 23 and 32, but the table and graph don’t allow you to see the original data- we’ve made a trade off and thrown away information for simplicity.

6

26

20

1 2 0

5

10

15

20

25

30

14 23 32 41 50

MPG Histogram

A Stem and Leaf plot overcomes this dilemma (if you don’t have thousands or millions of data points). When we construct a “stem and leaf” we try to show the reader each data point. A natural way to “break the data up” is to use the categories of “tens” as a way to sort the data into groups and then show each data point. If we were to write this out as follows:

10 13 13 15 18 18 20 20 21 22 23 24 24 24 24 24 24 24 24 24 25 25 25 25 25 25 26 27 27 27 27 27 28 28 28 28 29 29 29 30 30 30 31 31 32 32 33 33 33 35 35 35 37 49 52

Then we are carrying around some extra baggage- why keep telling your reader that the data points are in the twenties over and over? Here’s a nice idea: tell them you’re going to give them a stem (the twenties) and then the leaves (the 0’s and the 1’s, etc. ). For example, write

1| 0 3 3 5 8 8

Instead of

10 13 13 15 18 18

Less space- more efficient. This now gives us both a picture of the data and a way to reconstruct each data point:

1 | 0 3 3 5 8 8 2 | 0 0 1 2 3 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 6 7 7 7 7 7 8 8 8 8 9 9 9 3 | 0 0 0 1 1 2 2 3 3 3 5 5 5 7 4 | 9 5 | 2

This is a complete Stem and Leaf plot. You may wish to divide the categories a little more finely, however since there are so many 20’s. To do this, divide each category into the lower part and the upper part, for example 20, 21, 22, 23, and 24 as the lower part and 25, 26, 27, 28 and 29 as the upper part. Do this for each category and obtain:

given text we can write down the frequencies per line in order and find the number such that 6 points fall below this number and 6 fall above. I have the following:

Line of Text Number of Words Alas, poor Yorick! I knew him, Horatio: a fellow 9 of infinite jest, of most excellent fancy: he hath 9 borne me on his back a thousand times; and now, how 11 abhorred in my imagination it is! my gorge rims at 10 it. Here hung those lips that I have kissed I know 11 not how oft. Where be your gibes now? your 9 gambols? your songs? your flashes of merriment, 7 that were wont to set the table on a roar? Not one 12 now, to mock your own grinning? quite chap-fallen? 9 Now get you to my lady's chamber, and tell 9 her, let her paint an inch thick, to this favour she must 12 come; make her laugh at that. Prithee, Horatio, tell 9 me one thing. 3

Writing the frequencies from smallest to largest: Lower half Middle Value Upper half 3 7 9 9 9 9 9 9 10 11 11 12 12

In general, after the data have been sorted from low to high, we denote the median of a set of n data points as

as the data point in the location (for n and odd number) and
when n is an even number, we take the average of the data points occupying the and the positions.

What do you get for the median of the MPG data set? I get the number 27 as the median because we have 55 data points and the number 27 lies at the (55+1)/2=28th^ position. Also, the “Weights of Pro Football Players” data set has 50 data points:

222 222 223 225 225 226 227 229 230 230 230 231 232 233

234 235 235 235 235 236 237 238 238 239 240 240 240 240

240 241 242 242 243 243 244 244 245 245 245 246 247 247

248 250 250 251 252 253 255 255

And so I look in the 25th^ position to find 240 and in the 26th^ position to find 240, which I average to obtain 240 as the median. For good measure I’ll print out the Stem and Leaf I obtained using the program SPSS below:

Frequency Stem & Leaf

3.00 22. 223 5.00 22. 55679 7.00 23. 0001234 9.00 23. 555567889 12.00 24. 000001223344 7.00 24. 5556778 5.00 25. 00123 2.00 25. 55

Measures of Central Tendency and Location: The 5-Number Summary

As noted above, the median tells us where a certain measure of central tendency lies for a data set. Another measure of central tendency which has a nice physical interpretation is the arithmetic mean. This is a useful measure, especially when working with normally distributed data. For the data set above, what is the arithmetic mean?

It is often useful to sketch another type of plot corresponding to your data. We have already seen stem and leaf plots, histograms, etc. One more plot, called a box plot or a box and whiskers plot, is obtained by forming a 5 number summary of the data. We first write down the largest and smallest data points (the max and the min) and also the median. Next we find the median of the bottom half of the data (excluding the median from the whole data set) and of the top half of the data (again excluding the median from the whole data set). These 5 numbers are called quartiles since they divide the data set into 4 subsets. For the “Weights of Pro Football Players” data the quartiles are:

Q 0 = minimum = Q 1 =first quartile = Q 2 = median = Q 3 = third quartile = Q 4 = maximum =

Since our data had an even number of data points we just take Q 1 as the median of the bottom 25 data points:

222 222 223 225 225 226 227 229 230 230 230 231 232 233 234 235 235 235 235 236 237 238 238 239 240

And Q 3 as the median of the top 25 data points:

240 240 240 240 241 242 242 243 243 244 244 245 245 245 246 247 247 248 250 250 251 252 253 255 255

And so our 5-Number summary is Q 0 = minimum = Q 1 =first quartile = 232 Q 2 = median = Q 3 = third quartile = 245 Q 4 = maximum =

The interquartile range. This is the distance occupied by the middle half of your data, or

The median and IQR are often used together for data sets with “heavy tails”. For example, knowing the arithmetic mean (average) of household incomes in a community usually produces a pretty high number because of a relatively small number of high income individuals. In this case we prefer the quartiles approach.

Measures of Dispersion: The Standard Deviation

Please put on your seatbelts for this one. This is a concept which will be with us all semester.

We’ve seen that a data set (really just a bunch of numbers) can be summarized with a picture (like a histogram or box plot) or a number (like an arithmetic mean or a median), etc. When talking about how spear out a data set is it is quite typical to talk about a “variance” or a “standard deviation” especially when a data set is normally distributed. Consider the following toy data set:

If we’d like a measure of “spread” one thing to do it to see “on average” how far away from the mean each data point is. Our mean here is

So, the average distance from the mean is

This isn’t a “fluke” and in fact we can show that the average distance on a data set to the mean is always 0. So, what can we do?

One common solution (there are others!) is to work around the cancelation of negatives and positives by making all of the numbers have the same sign. We can do this if we square each number (multiply each number by itself). We then obtain

This is one way to define a variance. We then just take the square root (since we artificially “squared” each number to get what we wanted we are really just “coming back home”) to get

This is one way to get the standard deviation. Take a moment to calculate the variance and standard deviation

of the numbers 1, 2, 3, 4, 10.

A little notation: when we have a population (recall our discussion of samples and populations) we denote the mean of that population as. This is the greek letter “mu”, pronounced “mew” (like a kitten). If we have several numbers (say N ( upper case ) of them) we can define this in a natural way as

Then the variance, denoted by the greek letter “sigma-squared” or is defined as

This is what we did above, only now we are using letters rather than actual numbers. In order to keep this organized, many people like to put their calculation in a table as follows:

x (deviations) 1 1-4 = -3 9 2 2-4= -2 4 3 3-4= -1 1 4 4-4 = 0 0 10 10-4 = -6 36 sums 20 50

This gives us and.

One last wrinkle: sometimes we want the standard deviation of a sample rather than a population. In our textbook (and other places) a common thing to do is to use the symbol for a sample variance and for a sample standard deviation. If we call the number of data points in a sample ( lower case ) then for a technical reason we define (in a sample) by the number of data points – 1.

Stem Leaf Plots - Lecture Notes - Statistical Methods | STA 100, Study notes of Data Analysis & Statistical Methods

Related documents

Partial preview of the text

Download Stem Leaf Plots - Lecture Notes - Statistical Methods | STA 100 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Text Sections: Chapter 2 Section3, Chapter 3 Sections 1 and 3

Stem and Leaf Plots

Measures of Central Tendency and Location: The 5-Number Summary

Measures of Dispersion: The Standard Deviation

of the numbers 1, 2, 3, 4, 10.