2: Frequency Distributions, Exercises of Statistics

2: Frequency Distributions. Stem-and-Leaf Plots (Stemplots). The stem-and-leaf plot (stemplot) is an excellent way to begin an analysis.

Typology: Exercises

2022/2023

Uploaded on 02/28/2023

markzck
markzck 🇺🇸

4.2

(10)

253 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Page 2.1 (freq.docx; last update 2/7/16)
2: Frequency Distributions
Stem-and-Leaf Plots (Stemplots)
The stem-and-leaf plot (stemplot) is an excellent way to begin an analysis. Consider this small data set:
218 426 53 116 309 504 281 270 246 523
To construct a stemplot, start by drawing the stem. Stem-values represent either the first or first-two
significant digits of each value. As a rule-of-thumb, we want between 3 and 15 stem-values on the stem.
You are forming a number line, so stem-values must be evenly spaced and no stem-value can be skipped.
Values for our data range from 53 to 523, so a reasonable first approximation for the stem is:
0|
1|
2|
3|
4|
5|
×100
A stem-multiplier is included to allow the reader to decipher the magnitude of values. The stem-
multiplier of ×100 on this stem shows that a stem value of 2 represents about two hundred (and not two or
twenty, etc.). Values between 200 and 299 will be stored next to the stem bin” of 2.
We plot the subsequent significant digit of each value, so we plot the “tensplace, truncating remaining
significant digits, if any.* For example, 218 is plotted as:
0|
1|
2|1
3|
4|
5|
×100
The remaining leaves are plotted in rank order:
0|5
1|1
2|1478
3|0
4|2
5|02
×100
Note the value of 53 in the above plot is shown as 0|5 (0 in the hundreds-place and 5 in the tens-place).
* These rules differ from the simplified rules taught in California public school and are instead based on the original intention
and intent of John W. Tukey in the groundbreaking book Exploratory Data Analysis (1977).
pf3
pf4
pf5

Partial preview of the text

Download 2: Frequency Distributions and more Exercises Statistics in PDF only on Docsity!

2: Frequency Distributions

Stem-and-Leaf Plots (Stemplots)

The stem-and-leaf plot (stemplot) is an excellent way to begin an analysis. Consider this small data set:

To construct a stemplot, start by drawing the stem. Stem-values represent either the first or first-two

significant digits of each value. As a rule-of-thumb, we want between 3 and 15 stem-values on the stem.

You are forming a number line, so stem-values must be evenly spaced and no stem-value can be skipped.

Values for our data range from 53 to 523, so a reasonable first approximation for the stem is:

× 100

A stem-multiplier is included to allow the reader to decipher the magnitude of values. The stem-

multiplier of × 100 on this stem shows that a stem value of 2 represents about two hundred (and not two or

twenty, etc.). Values between 200 and 299 will be stored next to the “stem bin” of 2.

We plot the subsequent significant digit of each value, so we plot the “tens” place, truncating remaining

significant digits, if any.* For example, 218 is plotted as:

× 100

The remaining leaves are plotted in rank order:

× 100

Note the value of 53 in the above plot is shown as 0|5 ( 0 in the hundreds-place and 5 in the tens-place).

* These rules differ from the simplified rules taught in California public school and are instead based on the original intention

and intent of John W. Tukey in the groundbreaking book Exploratory Data Analysis (1977).

Shape, location, and spread Shape

I’m going to rotate the plot 90 degrees to display the distribution in a more familiar way.

0 1 2 3 4 5 (x100)

This is now just a histogram. Note that the batch of numbers forms a distribution with a shape, location,

and spread.

The shape of a distribution is seen as a “skyline silhouette”:

X

X

X X

X X X X X X

0 1 2 3 4 5 (x100)

Describe the shape narratively. While it is difficult to make reliable statements about the shape of a

distribution when the data set is small, you can still get a general impression of whether (A) a mound is

present, (B) data are symmetrical, and (C) if there are any data that separate from the rest of the

distribution (i.e., outliers). The current stemplot is mound-shaped and is [almost] symmetrical. There are

no clear outliers.

Note:

(1) When the data set is this small, keep a soft focus when describing “shape.” If a shift in just a few data

points can change your impression of the shape, avoid overstatements.

(2) Identification of outliers depends on context and can be subjective. There is no uniform definition of

an “outlier.” See http://www.tufts.edu/~gdallal/out.htm for further remarks (optional for now).

Location

We will simply summarize the central location of a distribution by its median. The median is the value

that is greater than or equal to half of the values in the data set. It’s easier to see the median if we stretch-

out the data in rank-order:

^

Median

The median will have a depth of ( n + 1) / 2. For the current data, n = 10 and median has a depth of (10 +

1) /2 = 5.5. Count in from either the top or bottom of the ordered array to the depth of the median. When

n is even, the median falls between two values, as it is here. Under these circumstances, interpolate the

median as the average of the two adjacent values. In this case, the median = average(270, 281) = 275.5.

Third stemplot example (split stem-values)

The following pollution levels were found in water samples {1.4, 1.7, 1.8, 1.9, 2.2, 2.2, 2.3, 2.4, 2.6, 2.6,

2.7, 2.8, 2.9, 3.0, 3.0, 3.0, 3.1, 3.2, 3.3, 3.4, 3.4, 3.5, 3.6, 3.7, 3.8}. Here is our first cut at the stemplot:

× 1

The above stemplot is too squashed to display the distribution’s shape. Let’s try “splitting” the stem-

values so that values between 1.0 and 1.4 are listed on the first stem-value of 1 and values between 1.

and 1.9 are listed on the second stem-value of 1:

× 1

Narrative interpretation : This distribution has a tail toward its lower values. This “left tail” is called a

“left” or “negative skew.” The median has a depth of (25+1) / 2 = 13. Count the leaves from either end of

the stemplot and you will see that the median is approximately equal to 2.9. Data spread from 1.4 to 3.8.

Fourth stemplot example (quintuple split)

Sometimes splitting stem-values in two still leaves an unclear picture of shape. Consider these 9 values:

You may be tempted to start the process as follows:

×

This is too spread out to draw out the shape, so you try a stem-multiplier ×100 with split the stem-values.

We’ll put values between 0 and 49 on the first 0 and values between 50 and 99 on the second 0:

×

Now try a quintuple-split. This means you will divide the range 0 to 99 into five class intervals, each 20

units in width. The first interval will contain values between 0 and 19, the second interval will contain

values between 20 and 39, and so on. The stem will store the hundreds place and the leaves represent the

tens place. The ones-place will be truncated. Thus:

× 100

This distribution is relatively symmetrical with no outliers.

Frequency Tables Frequency table example #

A more traditional way to explore a distribution is in tabular form. It’s easier to show you how to

construct a frequency table than to provide formulas. In addition, it is best not to be mechanical in our

approach toward statistics.

If you’ve already created a stemplot, the hard work is behind you. Here’s the stemplot for the first

illustrative data set in this chapter:

× 100

The stemplot has already sorted data into class intervals. The class intervals are 100 units in width. The

first class interval contains values from 0 to 99, the second class interval contains values of 100 to 199,

and so on.

Count the number of observations that fall into each class interval. This is the frequency. Also determine

the relative frequency of each count. The relative frequency is just the proportion. It doesn’t matter if

you report the relative frequency as a proportion or percentage, as long as its labeled clearly.

Class interval Frequency Relative frequency

Total 10 100%

An additional concept worth noting is called cumulative frequency. The cumulative frequency is the

frequency up to and including the current interval. This can be reported as a count or as a proportion of

the total ( cumulative relative frequency ), as shown in the table on the next page.