Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Tools for Exploring Univariate Data - Lecture Slides | STATS 13, Study notes of Statistics

Material Type: Notes; Professor: Dinov; Class: Introduction to Statistical Methods for Life and Health Sciences; Subject: Statistics; University: University of California - Los Angeles; Term: Fall 2004;

Typology: Study notes

Pre 2010

Uploaded on 08/26/2009

koofers-user-z6r
koofers-user-z6r 🇺🇸

10 documents

1 / 8

Toggle sidebar

Related documents


Partial preview of the text

Download Tools for Exploring Univariate Data - Lecture Slides | STATS 13 and more Study notes Statistics in PDF only on Docsity!

STAT 13, UCLA, Ivo Dinov Slide 1

UCLA STAT 13

Introduction to Statistical Methods for

the Life and Health Sciences

z Instructor: Ivo Dinov,

Asst. Prof. of Statistics and Neurology

z Teaching Assistants: Chris Barr & Ming Zheng

University of California, Los Angeles, Fall 2004

http://www.stat.ucla.edu/~dinov/courses_students.html

STAT 13, UCLA, Ivo Dinov Slide 2

Chapter 2: Tools for Exploring

Univariate Data

zTypes of variables

zPresentation of data

zSimple plots

zNumerical summaries

zRepeated and grouped data

zQualitative variables

Slide 3 STAT 13, UCLA, Ivo Dinov

TABLE 2.1.1 Data on Male Heart Attack Patients

A subset of the data collected at a Hospital is summarized in this table. Each patient has measurements recorded for a number of variables – ID, Ejection factor (ventricular output), blood systolic/diastolic pressure, etc.

  • Reading the table
  • Which of the measured variables (age, ejection etc.)

are useful in predicting how long the patient may live.

  • Are there relationships between these predictors?
  • variability & noise in the observations hide the message

of the data.

Slide 4 STAT 13, UCLA, Ivo Dinov

TABLE 2.1.1 Data on Male Heart Attack Patients S YS - DIA- OUT- ID EJEC VOL VOL OCCLU S TEN TIME COME AGE S MOKE BETA CHOLa^ S URG 390 72 36 131 0 0 143 0 49 2 2 59 0 279 52 74 155 37 63 143 0 54 2 2 68 1 391 62 52 137 33 47 16 2 56 2 2 52 0 201 50 165 329 33 30 143 0 42 2 2 39 0 202 50 47 95 0 100 143 0 46 2 2 74 1 69 27 124 170 77 23 143 0 57 2 2 NA 2 310 60 86 215 7 50 40 0 51 2 2 58 0 392 72 37 132 40 10 9 5 56 2 2 75 0 311 60 65 163 0 40 142 0 45 2 2 72 0 393 63 52 140 0 10 142 0 46 2 2 90 0 70 29 117 164 50 0 142 0 48 2 2 72 0 203 48 69 133 0 27 142 0 54 2 2 NA 0 394 59 54 133 30 13 142 0 39 2 1 NA 0 204 50 67 135 37 63 141 0 49 2 2 86 2 280 53 65 138 0 33 140 0 58 2 1 49 0 55 17 184 221 57 13 5 1 50 2 2 70 2 79 37 88 140 37 47 118 5 58 2 2 NA 0 205 45 106 193 33 43 140 0 47 1 1 38 1 206 43 85 150 0 50 23 5 51 2 2 61 0 312 60 59 149 7 37 139 0 43 2 1 56 0 80 38 103 168 47 43 100 1 55 2 2 62 1 281 57 53 124 0 57 140 0 58 2 1 93 0 207 44 68 121 27 60 139 0 55 2 2 63 1 282 51 53 109 0 77 139 0 41 2 2 45 4 396 63 58 157 0 73 139 0 51 2 2 60 0 208 49 81 157 13 13 139 0 49 2 2 60 0 209 48 58 112 0 0 72 1 56 2 2 57 0 283 58 71 167 27 0 138 0 45 2 1 46 0 210 42 92 159 0 0 139 0 57 2 2 58 0 397 68 50 156 0 100 138 0 51 2 1 NA 0 211 43 146 259 47 33 3 1 56 2 2 70 0 398 67 43 130 0 70 138 0 49 2 2 NA 3 284 52 70 146 0 23 137 0 47 1 2 NA 0 399 63 73 195 27 0 136 0 36 1 1 61 0 285 54 62 133 33 23 137 0 38 2 2 NA 0 71 37 93 148 47 0 137 0 59 2 2 NA 0 286 51 65 133 43 7 136 0 54 2 2 NA 0 212 42 95 163 40 10 109 3 57 2 2 NA 4 400 66 49 144 10 50 65 1 52 2 2 55 0 287 54 66 145 7 40 136 0 47 2 2 62 0 81 39 144 237 13 87 136 0 39 2 2 56 3 813 63 52 141 0 47 43 3 48 2 2 NA 0 68 30 219 314 33 45 76 1 53 1 2 NA 0 288 59 39 94 0 0 135 0 47 1 2 63 0 407 67 39 117 0 73 53 1 57 2 2 62 2 a NA = N o t A va ila ble (m is s in g d a ta c o de ).

S YS - DIA-
ID EJEC VOL VOL OCCLU S TEN T

a (^) N A = No t Ava ila ble (m is s ing da ta c o de ).

TABLE 2.1.1 Data on Male Heart Attack Patients

Slide 5 STAT 13, UCLA, Ivo Dinov

z Quantitative variables are measurements and

counts

„Variables with few repeated values are treated as

continuous.

„Variables with many repeated values are treated

as discrete

z Qualitative variables (a.k.a. factors or class-

variables) describe group membership

Types of variable

Slide 6 STAT 13, UCLA, Ivo Dinov

Types of Variables

Qualitative
Continuous Discrete Categorical Ordinal
Quantitative

(few repeated values) (many repeated values) (no idea of order) (fall in natural order)

(measurements and counts) (define groups)

Figure 2.1.1 Tree diagram of types of variable.

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Distinguishing between types of variable

Slide 7 STAT 13, UCLA, Ivo Dinov

Questions …

z What is the difference between quantitative and

qualitative variables?

z What is the difference between a discrete variable

and a continuous variable?

z Name two ways in which observations on qualitative

variables can be stored on a computer. (strings/indexes)

z When would you treat a discrete random variable as

though it were a continuous random variable?

„ Can you give an example? ( $34.45, bill)

Slide 8 STAT 13, UCLA, Ivo Dinov

Storing and Reporting Numbers

z Round numbers for presentation

z Maintain complete accuracy in numbers to be used

in calculations. If you need to round-off, this should

be the very last operation …

Slide 9 STAT 13, UCLA, Ivo Dinov

Table before simplification

Country 1970 1975 1980 1985 1990 Belgium 42.01 42.17 34.18 34.18 30. Canada 22.59 21.95 20.98 20.11 14. France 100.91 100.93 81.85 81.85 81. Italy 82.48 82.48 66.67 66.67 66. Japan 15.22 21.11 24.23 24.33 24. Netherlands 51.06 54.33 43.94 43.94 43. Switzerland 78.03 83.2 83.28 83.28 83. U.K. 38.52 21.03 18.84 19.03 18. U.S.A. 316.34 274.71 264.32 262.65 261. Units: millions of troy ounces. Source: The World Almanac and Book of Facts.

TABLE 2.2.1 Gold Reserves of Gold-Holding IMF Countries

Slide 10 STAT 13, UCLA, Ivo Dinov

TABLE 2.2.2 Simplified Table of Gold Reserves of IMF Countries

Country 1970 1975 1980 1985 1990 Average US 320 270 260 260 260 280 Switzerland 78 83 83 83 83 82 France 100 100 82 82 82 89 Italy 82 82 67 67 67 73 Netherlands 51 54 44 44 44 47 Belgium 42 42 34 34 30 37 Japan 15 21 24 24 24 22 UK 39 21 19 19 19 23 Canada 23 22 21 20 15 20

Average 83 78 71 71 70 Units: millions of troy ounces.

Table after simplification

Slide 11 STAT 13, UCLA, Ivo Dinov

0%
10%
20%
30%
29%
US
14%
11% 11%
8%
6%

21% S. Africa

USSR

Austr.

Can.

Chin.

Rest

(a) Bar graph (b) Pie chart

0%
20%
40%
60%
80%
100%

(c) Segmented bar

S. Africa

U.S.

USSR

Austr.

Can.

China

Rest

14%
U.S.

Figure 2.6.3 Percentages of the world's gold production in 1991.

om Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Different graphs of the same set of numbers

Slide 12 STAT 13, UCLA, Ivo Dinov

Questions …

z For what two purposes are tables of numbers

presented? ( convey information about trends in the data, detailed

analysis)

z When should you round numbers, and when should you

preserve full accuracy?

z How should you arrange the numbers you are most

interested in comparing? ( Arrange numbers you want to compare in

columns, not rows. Provide written/verbal summaries/footnotes. Show

row/column averages.)

z Should a table be left to tell its own story?

Slide 13 STAT 13, UCLA, Ivo Dinov

Figure 2.3.1 Dot plot.

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

cluster gap outlier

Figure 2.3.2 Dot plot showing special features.

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

The dot plot

Atypical obs.

Slide 14 STAT 13, UCLA, Ivo Dinov

B

B A A

+

D C C A

+ +

Figure 2.3.3 Grading of a university course.

Example of exploiting gaps and clusters

F D C- C C+ B- B B+ A- A A+

Slide 15 STAT 13, UCLA, Ivo Dinov

(a) Unbroken scale

scale break

(b) Broken scale

Figure 2.3.4 Dot plot with and without a scale break.

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Scale breaks

Slide 16 STAT 13, UCLA, Ivo Dinov

% Growth in GDP

Figure 2.3.5 Forecast of percent growth in GDP for 1990

for some South-East Asian and Pacific countries.

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

A labeled dot plot

Slide 17 STAT 13, UCLA, Ivo Dinov

Units: 7 | 2 =

1 7

2 7 9

3 0 7 7 8 9

4 2 2 3 3 4 5 8 8 9

5 0 0 0 1 1 2 2 3 4 4 7 8 9 9

6 0 0 0 2 3 3 3 3 6 7 7 8

7 2 2

Example of a stem-and-leaf plot

Stem-plot of the 45 obs’s of the Ejection variable in the

Heart Attack data table.

Values 52, 54 and

their frequencies

Stem Leafs

Slide 18 STAT 13, UCLA, Ivo Dinov

Traffic death-rates data

TABLE 2.3.1 Traffic Death-Rates (per 100,000 Population) for 30 Countries

17.4 Australia 20.1 Austria 19.9 Belgium 12.5 Bulgaria 15.8 Canada 10.1 Czechoslovakia 13.0 Denmark 11.6 Finland 20.0 France 12.0 E. Germany 13.1 W. Germany 21.1 Greece 5.4 Hong Kong 17.1 Hungary 15.3 Ireland 10.3 Israel 10.4 Japan 26.8 Kuwait 11.3 Netherlands 20.1 New Zealand 10.5 Norway 14.6 Poland 25.6 Portugal 12.6 Singapore 9.8 Sweden 15.7 Switzerland 18.6 United States 12.1 N. Ireland 12.0 Scotland 10.1England & Wales Data for 1983, 1984 or 1985 depending on the country (prior to reunification of Germany) Source: Hutchinson [1987, page 3].

Slide 19 STAT 13, UCLA, Ivo Dinov

Units: 17 | 4 = 17.4 deaths per 100, 5 4 6 7 8 9 8 Units: 1 | 7 = 17 deaths per 100, 10 1 1 3 4 5 0 5 11 3 6 0 12 0 0 1 5 6 0 13 0 1 1 0 0 0 0 0 1 1 14 6 1 2 2 2 2 3 3 3 15 3 7 8 1 5 5 16 1 6 6 7 7 17 1 4 1 9 18 6 2 0 0 0 0 1 19 9 2 20 0 1 1 2 21 1 2 6 7 22 23 24 25 6 26 8

FIGURE 2.3.7 Two stem-and-leaf plots for the traffic deaths data

Collapse to

12 stems

(a)

(b)

Round-off

Slide 20 STAT 13, UCLA, Ivo Dinov

TABLE 2.3.2 Coyote Lengths Data (cm) Females 93.0 97.0 92.0 101.6 93.0 84.5 102.5 97.8 91.0 98.0 93.5 91. 90.2 91.5 80.0 86.4 91.4 83.5 88.0 71.0 81.3 88.5 86.5 90. 84.0 89.5 84.0 85.0 87.0 88.0 86.5 96.0 87.0 93.5 93.5 90. 85.0 97.0 86.0 73. Males 97.0 95.0 96.0 91.0 95.0 84.5 88.0 96.0 96.0 87.0 95.0 100. 101.0 96.0 93.0 92.5 95.0 98.5 88.0 81.3 91.4 88.9 86.4 101. 83.8 104.1 88.9 92.0 91.0 90.0 85.0 93.5 78.0 100.5 103.0 91. 105.0 86.0 95.5 86.5 90.5 80.0 80. Coyotes captured in Nova Scotia, Canada. Data courtesy of Dr Vera Eastwood. TABLE 2.3.3 Frequency Table for Female Coyote Lengths

Class Interval Tally Frequency Stem-and-leaf plot 70-75 - || 2 7 1 4 75-

  • 0 7 80-85 - |||| | 6 8 0 1 4 4 4 85-90 -^ |||| |||| || 12 8 5 5 5 6 6 7 7 7 7 8 8 9 90-95 - |||| |||| ||| 13 9 0 0 0 0 1 1 2 2 2 3 3 4 4 4 95-100 - |||| 5 9 6 7 7 8 8 100-105 - || 2 10 2 3 Total 40

Body

length

Slide 21 STAT 13, UCLA, Ivo Dinov

length (cm) (a) Histogram (b) Stem-and-leaf plot rotated

0

Figure 2.3.8 Histogram of the female coyote-lengths data. From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

TABLE 2.3.3 Frequency Table for Female Coyote Lengths

Class Interval Tally Frequency Stem-and-leaf p lot 70-75 -^ || 2 7 1 4 75-80 -^0 80-85 -^ |||| | 6 8 0 1 4 4 4 85-90 -^ |||| |||| || 12 8 5 5 5 6 6 7 7 7 7 8 8 9 90-95 -^ |||| |||| ||| 13 9 0 0 0 0 1 1 2 2 2 3 3 4 4 4 95-100 -^ |||| 5 9 6 7 7 8 8 100-105 -^ || 2 10 2 3 Total 40

compare

Slide 22 STAT 13, UCLA, Ivo Dinov

(a) Original histogram (interval width = 5)

(c) Same widths, different boundaries (interval width = 5)

(b) Change class-interval width (interval width = 3)

(d) Density trace (window width = 5)

70 80 90 100 Length (cm)

70 80 90 100 Length (cm)

70 80 90 100 Length (cm)

110 70 80 90 100 Length (cm)

0

4

8

12

0

4

8

12

0

4

8

12

0

4

8

12

Figure 2.3.9 Histograms and density trace of female coyote-lengths data. From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Histogram bin-size change

Histogram bin-boundary change

Slide 23 STAT 13, UCLA, Ivo Dinov

Questions …

z What advantages does a stem-and-leaf plot have over a histogram? (S&L Plots return info on individual values, quick to

produce by hand, provide data sorting mechanisms. But, Hist’s are more

attractive and more understandable ).

z The shape of a histogram can be quite drastically altered by choosing different class-interval boundaries. What type of plot does not have this problem? (density trace ) What other factor affects the shape of a histogram? ( bin-size)

z What was another reason given for plotting data on a variable, apart from interest in how the data on that variable behaves? ( shows features, cluster/gaps, outliers; as well as trends)

Slide 24 STAT 13, UCLA, Ivo Dinov

(e) Positively skewed

(a) Unimodal (b) Bimodal (c) Trimodal

(d) Symmetric (long upper tail)

(f) Negatively skewed (long lower tail)

(g) Symmetric (h) Bimodal with gap (i) Exponential shape

Interpreting Stem-plots and Histograms

e

−| x |

e

x

( 1 )

1

x^ −

Slide 25 STAT 13, UCLA, Ivo Dinov

(j) Spike in pattern

(k) Outliers (l) Truncation plus outlier

outlier outlier

spike

Figure 2.3.10 Features to look for in histograms and stem-and-leaf plots.

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Interpreting Stem-plots and Histograms

Slide 26 STAT 13, UCLA, Ivo Dinov

Fascinations with histograms –

Histogram of heights using the actual people

Subjects are university genetics students, females in white

and males in dark tops.

?

Slide 27 STAT 13, UCLA, Ivo Dinov

Questions …

z What does it mean for a histogram or stem-and-leaf plot to be bimodal? What do we suspect when we see a bimodal plot?

z What are outliers, and how do they show up in these plots? What should we try to do when we see them?

z What do we mean by symmetry and positive and negative skewness?

z What shape do we call exponential?

z Should we be suspicious of abrupt changes? Why?

Yes! Try to establish the reason, the jump may have to be rectified!

Slide 28 STAT 13, UCLA, Ivo Dinov

Descriptive Statistics

Variable N Mean Median TrMean StDev SE Mean age 45 50.133 51.000 50.366 6.092 0. Variable Minimum Maximum Q1 Q age 36.000 59.000 46.500 56.

Standard deviation

Lower quartile Upper quartile

Minitab output

Descriptive statistics from computer

programs like STATA

STATA Output

.summarize

Slide 29 STAT 13, UCLA, Ivo Dinov

z The sample mean is denoted by (^) x.

Descriptive statistics …

Sum of the observations

Number of observations

The sample mean =

Slide 30 STAT 13, UCLA, Ivo Dinov

Mean
(a) (b) (c)
Figure 2.4.1 Mechanical construction representing a dot plot:
(a) shows a balanced rod while (b) and (c) show unbalanced rods.

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

The sample mean is where the dot plot balances

Slide 31 STAT 13, UCLA, Ivo Dinov

The sample median

If is not a whole number, the median is the

average of the two observations on either side.

n + 1

2

For n observations, {x 1 , x 2 , x 3 , …, xn }. Suppose we order

the observations min-to-max to get

{x(1) , x (^) (2) , x (^) (3) , …, x (^) (n) }.

Then the sample median is the [(n+1)/2]-st largest

Observation

x

n

⎛ +

Slide 32 STAT 13, UCLA, Ivo Dinov

Med = x

P

Med

x

P

(a) Data symmetric about P

(b) Two largest points moved to the right

Figure 2.4.2 The mean and the median.

[Grey disks in (b) are the ``ghosts'' of the points that were moved.] From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Effect of outliers on the mean and median

Slide 33 STAT 13, UCLA, Ivo Dinov

Founded
Area
Altitude
Population
Welcome to
MEANSTOWN
Average 711
Suggested by a 1977 cartoon in The New Yorker magazine by Dana Fradon.

Beware of inappropriate averaging

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 1999. Slide 34 STAT 13, UCLA, Ivo Dinov

Questions …

z How is the sample mean related to the dot plot?

z If the index ( n +1)/2 is not a whole number (e.g.,

23.5), how do we obtain the sample median?

z Why is the sample median usually preferred to the sample mean for skewed data? Why is it preferred for

“dirty” data?

z Under what circumstances may quoting a single center

(be it mean or median) not make sense?( multi-modal )

z What can we say about the sample mean of a

qualitative variable? ( meaningless)

0 100 mean

Slide 35 STAT 13, UCLA, Ivo Dinov

The first quartile ( Q 1) is the median of all the observations

whose position is strictly below the position of the median,

and the third quartile ( Q 3) is the median of those above.

Quartiles

25% 25% 25%

median

Slide 36 STAT 13, UCLA, Ivo Dinov

The five-number summery = (Min, Q 1 , Med, Q 3 , Max)

Five number summary

Slide 37 STAT 13, UCLA, Ivo Dinov

IQR = Q

  • Q

Inter-quartile Range

Slide 38 STAT 13, UCLA, Ivo Dinov

SYSVOL

Q 1 Median Q (^3)

Box plot
Dot plot
Figure 2.4.3 Box plot for SYSVOL.

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Box plot compared to dot plot

Slide 39 STAT 13, UCLA, Ivo Dinov

Data
1.5 IQR
Med
1.5 IQR
Scale
Q 1 Q 3

(pull back until hit observation) (pull back until hit observation)

Figure 2.4.4 Construction of a box plot.

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Construction of a box plot

Slide 40 STAT 13, UCLA, Ivo Dinov

Stem-and-leaf of strength N = 33
Leaf Unit = 10
(8) 21 55668899
strength
strength
Figure 2.4.5 Three graphs of the breaking-strength data for}
gear-teeth in positions 4 & 10 (Minitab output).

Comparing 3 plots of the same data

Slide 41 STAT 13, UCLA, Ivo Dinov

TABLE 2.5.1 Word Lengths for the First 100

Words on a Randomly Chosen Page 3 2 2 4 4 4 3 9 9 3 6 2 3 2 3 4 6 5 3 4 2 3 4 5 2 9 5 8 3 2 4 5 2 4 1 4 2 5 2 5 3 6 9 6 3 2 3 4 4 4 2 2 4 2 3 7 4 2 6 4 2 5 9 2 3 7 11 2 3 6 4 4 7 6 6 10 4 3 5 7 7 7 5 10 3 2 3 9 4 5 5 4 4 3 5 2 5 2 4 2

Value u 1 2 3 4 5 6 7 8 9 10 11

Frequency f 1 22 18 22 13 8 6 1 6 2 1

j j

Frequency Table

Frequency Table

Slide 42 STAT 13, UCLA, Ivo Dinov

(Sumofallobservatio ns )

1

Sumof (value frequency ofoccurrence )

1

n

n

x = × =

Mean from a frequency table

Value Frequency Value x Frequency

2 3 6

4 2 8

5 14

Example: {2, 4, 2, 4, 2 }

Mean = 14/

Slide 43 STAT 13, UCLA, Ivo Dinov

TABLE 2.5.

Frequency Table for the Occurrence of Fish Species in Ocean Strata

No. of strata Frequency Percentage in which species occur (No. of species) of species Cumulative Percentage

1 117 35.5 35.
2 61 18.5 53.
3 37 11.2 65.
4 24 7.3 72.
5 23 7.0 79.
6 12 3.6 83.
7 14 4.2 87.
8 10 3.0 90.
9 9 2.7 93.
10+ 23 7.0 100.

n = 330 100 Source: Haedrich and Merrett [1988]

(

fj n

( u j ) ( fj ) × 100 )

Slide 44 STAT 13, UCLA, Ivo Dinov

30%

10%

20%

0 1 2 3 4 5 6 7 8 9 10 +

Number of strata occupied

0%

Figure 2.5.1 Bar graph for species data. From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Slide 45 STAT 13, UCLA, Ivo Dinov

Labeled bar graphs to convey size

0

Gross Rents ($ per ft ) 2

City

Figure 2.6.2 Cost of commercial rents around the world. From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

z Can use bar graphs to relate labels to relative size.
z Where possible, order items by size.