Tools for Exploring Univariate Data - Lecture Slides | STATS 13, Study notes of Statistics

Material Type: Notes; Professor: Dinov; Class: Introduction to Statistical Methods for Life and Health Sciences; Subject: Statistics; University: University of California - Los Angeles; Term: Fall 2004;

Typology: Study notes

Pre 2010

Uploaded on 08/26/2009

koofers-user-z6r
koofers-user-z6r 🇺🇸

10 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
S
TAT 13
,
UCLA
,
Ivo Dinov Slide 1
UCLA STAT 13
Introduction to Statistical Methods for
the Life and Health Sciences
zInstructor: Ivo Dinov,
Asst. Prof. of Statistics and Neurology
zTeaching Assistants: Chris Barr & Ming Zheng
University of California, Los Angeles, Fall 2004
http://www.stat.ucla.edu/~dinov/courses_students.html
S
TAT 13
,
UCLA
,
Ivo Dinov Slide 2
Chapter 2: Tools for Exploring
Univariate Data
zTypes of variables
zPresentation of data
zSimple plots
zNumerical summaries
zRepeated and grouped data
zQualitative variables
STAT 13, UCLA, Ivo Dinov
Slide 3
TABLE 2.1.1 Data on Male Heart Attack Patients
A subset of the data collected at a Hospita l is
summarized in this table. Each patient has
measurements recorded for a number of variables –
ID, Ejection factor (ventricu lar output), blood
systolic/diastolic pressure, etc.
- Reading the tab le
- Which of the measu red variables (age, ejection etc.)
are useful in predicting how long the patient may live.
- Are there re lationships between these predictors?
- variabili ty & noise in the observations hide the mes sage
of the data.
STAT 13, UCLA, Ivo Dinov
Slide 4
TABLE 2.1.1 Data on Ma le Heart Atta ck Patients
SYS- DIA- OUT-
ID EJEC VOL VOL OCCLU STEN TIME COME AGE SMOKE BETA CHOL
a
SURG
390 72 36 131 0 0 143 0 49 2 2 59 0
279 52 74 155 37 63 143 0 54 2 2 68 1
391 62 52 137 33 47 16 2 56 2 2 52 0
201 50 165 329 33 30 143 0 42 2 2 39 0
202 50 47 95 0 100 143 0 46 2 2 74 1
69 27 124 170 77 23 143 0 57 2 2 NA 2
310 60 86 215 7 50 40 0 51 2 2 58 0
392 72 37 132 40 10 9 5 56 2 2 75 0
311 60 65 163 0 40 142 0 45 2 2 72 0
393 63 52 140 0 10 142 0 46 2 2 90 0
70 29 117 164 50 0 142 0 48 2 2 72 0
203 48 69 133 0 27 142 0 54 2 2 NA 0
394 59 54 133 30 13 142 0 39 2 1 NA 0
204 50 67 135 37 63 141 0 49 2 2 86 2
280 53 65 138 0 33 140 0 58 2 1 49 0
55 17 184 221 57 13 5 1 50 2 2 70 2
79 37 88 140 37 47 118 5 58 2 2 NA 0
205 45 106 193 33 43 140 0 47 1 1 38 1
206 43 85 150 0 50 23 5 51 2 2 61 0
312 60 59 149 7 37 139 0 43 2 1 56 0
80 38 103 168 47 43 100 1 55 2 2 62 1
281 57 53 124 0 57 140 0 58 2 1 93 0
207 44 68 121 27 60 139 0 55 2 2 63 1
282 51 53 109 0 77 139 0 41 2 2 45 4
396 63 58 157 0 73 139 0 51 2 2 60 0
208 49 81 157 13 13 139 0 49 2 2 60 0
209 48 58 112 0 0 72 1 56 2 2 57 0
283 58 71 167 27 0 138 0 45 2 1 46 0
210 42 92 159 0 0 139 0 57 2 2 58 0
397 68 50 156 0 100 138 0 51 2 1 NA 0
211 43 146 259 47 33 3 1 56 2 2 70 0
398 67 43 130 0 70 138 0 49 2 2 NA 3
284 52 70 146 0 23 137 0 47 1 2 NA 0
399 63 73 195 27 0 136 0 36 1 1 61 0
285 54 62 133 33 23 137 0 38 2 2 NA 0
71 37 93 148 47 0 137 0 59 2 2 NA 0
286 51 65 133 43 7 136 0 54 2 2 NA 0
212 42 95 163 40 10 109 3 57 2 2 NA 4
400 66 49 144 10 50 65 1 52 2 2 55 0
287 54 66 145 7 40 136 0 47 2 2 62 0
81 39 144 237 13 87 136 0 39 2 2 56 3
813 63 52 141 0 47 43 3 48 2 2 NA 0
68 30 219 314 33 45 76 1 53 1 2 NA 0
288 59 39 94 0 0 135 0 47 1 2 63 0
407 67 39 117 0 73 53 1 57 2 2 62 2
a
NA = No t A v a ila b le (m is sin g d a t a c o d e ).
SYS- DIA-
ID EJEC VOL VOL OCCLU S TEN
T
390 72 36 131 0 0
279 52 74 155 37 63
391 62 52 137 33 47
201 50 165 329 33 30
202 50 47 95 0 100
69 27 124 170 77 23
310 60 86 215 7 50
392 72 37 132 40 10
311 60 65 163 0 40
288 59 39 94 0 0
407 67 39 117 0 73
a
NA = Not Available(miss ing data code).
TABLE 2.1.1 Data on Male Heart Attack Patients
STAT 13, UCLA, Ivo Dinov
Slide 5
zQuantitative variables are measurements and
counts
Variables with few repeated values are treated as
continuous.
Variables with many repeated values are treated
as discrete
zQualitative variables (a.k.a. factors or class-
variables) describe group membership
Types of variable
STAT 13, UCLA, Ivo Dinov
Slide 6
Types of Variables
Qualitative
Continuous Discrete Categorical Ordinal
Quantitative
(few repeated values) (many repeated values) (no idea of order) (fall in natural order)
(measurements and counts) (define groups)
Figure 2.1.1 Tree diagram of types of variable.
From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wile y & Sons, 2000.
Distinguishing between types of variable
pf3
pf4
pf5
pf8

Partial preview of the text

Download Tools for Exploring Univariate Data - Lecture Slides | STATS 13 and more Study notes Statistics in PDF only on Docsity!

STAT 13, UCLA, Ivo Dinov Slide 1

UCLA STAT 13

Introduction to Statistical Methods for

the Life and Health Sciences

z Instructor: Ivo Dinov,

Asst. Prof. of Statistics and Neurology

z Teaching Assistants: Chris Barr & Ming Zheng

University of California, Los Angeles, Fall 2004

http://www.stat.ucla.edu/~dinov/courses_students.html

STAT 13, UCLA, Ivo Dinov Slide 2

Chapter 2: Tools for Exploring

Univariate Data

zTypes of variables

zPresentation of data

zSimple plots

zNumerical summaries

zRepeated and grouped data

zQualitative variables

Slide 3 STAT 13, UCLA, Ivo Dinov

TABLE 2.1.1 Data on Male Heart Attack Patients

A subset of the data collected at a Hospital is summarized in this table. Each patient has measurements recorded for a number of variables – ID, Ejection factor (ventricular output), blood systolic/diastolic pressure, etc.

  • Reading the table
  • Which of the measured variables (age, ejection etc.)

are useful in predicting how long the patient may live.

  • Are there relationships between these predictors?
  • variability & noise in the observations hide the message

of the data.

Slide 4 STAT 13, UCLA, Ivo Dinov

TABLE 2.1.1 Data on Male Heart Attack Patients S YS - DIA- OUT- ID EJEC VOL VOL OCCLU S TEN TIME COME AGE S MOKE BETA CHOLa^ S URG 390 72 36 131 0 0 143 0 49 2 2 59 0 279 52 74 155 37 63 143 0 54 2 2 68 1 391 62 52 137 33 47 16 2 56 2 2 52 0 201 50 165 329 33 30 143 0 42 2 2 39 0 202 50 47 95 0 100 143 0 46 2 2 74 1 69 27 124 170 77 23 143 0 57 2 2 NA 2 310 60 86 215 7 50 40 0 51 2 2 58 0 392 72 37 132 40 10 9 5 56 2 2 75 0 311 60 65 163 0 40 142 0 45 2 2 72 0 393 63 52 140 0 10 142 0 46 2 2 90 0 70 29 117 164 50 0 142 0 48 2 2 72 0 203 48 69 133 0 27 142 0 54 2 2 NA 0 394 59 54 133 30 13 142 0 39 2 1 NA 0 204 50 67 135 37 63 141 0 49 2 2 86 2 280 53 65 138 0 33 140 0 58 2 1 49 0 55 17 184 221 57 13 5 1 50 2 2 70 2 79 37 88 140 37 47 118 5 58 2 2 NA 0 205 45 106 193 33 43 140 0 47 1 1 38 1 206 43 85 150 0 50 23 5 51 2 2 61 0 312 60 59 149 7 37 139 0 43 2 1 56 0 80 38 103 168 47 43 100 1 55 2 2 62 1 281 57 53 124 0 57 140 0 58 2 1 93 0 207 44 68 121 27 60 139 0 55 2 2 63 1 282 51 53 109 0 77 139 0 41 2 2 45 4 396 63 58 157 0 73 139 0 51 2 2 60 0 208 49 81 157 13 13 139 0 49 2 2 60 0 209 48 58 112 0 0 72 1 56 2 2 57 0 283 58 71 167 27 0 138 0 45 2 1 46 0 210 42 92 159 0 0 139 0 57 2 2 58 0 397 68 50 156 0 100 138 0 51 2 1 NA 0 211 43 146 259 47 33 3 1 56 2 2 70 0 398 67 43 130 0 70 138 0 49 2 2 NA 3 284 52 70 146 0 23 137 0 47 1 2 NA 0 399 63 73 195 27 0 136 0 36 1 1 61 0 285 54 62 133 33 23 137 0 38 2 2 NA 0 71 37 93 148 47 0 137 0 59 2 2 NA 0 286 51 65 133 43 7 136 0 54 2 2 NA 0 212 42 95 163 40 10 109 3 57 2 2 NA 4 400 66 49 144 10 50 65 1 52 2 2 55 0 287 54 66 145 7 40 136 0 47 2 2 62 0 81 39 144 237 13 87 136 0 39 2 2 56 3 813 63 52 141 0 47 43 3 48 2 2 NA 0 68 30 219 314 33 45 76 1 53 1 2 NA 0 288 59 39 94 0 0 135 0 47 1 2 63 0 407 67 39 117 0 73 53 1 57 2 2 62 2 a NA = N o t A va ila ble (m is s in g d a ta c o de ).

S YS - DIA-
ID EJEC VOL VOL OCCLU S TEN T

a (^) N A = No t Ava ila ble (m is s ing da ta c o de ).

TABLE 2.1.1 Data on Male Heart Attack Patients

Slide 5 STAT 13, UCLA, Ivo Dinov

z Quantitative variables are measurements and

counts

„Variables with few repeated values are treated as

continuous.

„Variables with many repeated values are treated

as discrete

z Qualitative variables (a.k.a. factors or class-

variables) describe group membership

Types of variable

Slide 6 STAT 13, UCLA, Ivo Dinov

Types of Variables

Qualitative
Continuous Discrete Categorical Ordinal
Quantitative

(few repeated values) (many repeated values) (no idea of order) (fall in natural order)

(measurements and counts) (define groups)

Figure 2.1.1 Tree diagram of types of variable.

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Distinguishing between types of variable

Slide 7 STAT 13, UCLA, Ivo Dinov

Questions …

z What is the difference between quantitative and

qualitative variables?

z What is the difference between a discrete variable

and a continuous variable?

z Name two ways in which observations on qualitative

variables can be stored on a computer. (strings/indexes)

z When would you treat a discrete random variable as

though it were a continuous random variable?

„ Can you give an example? ( $34.45, bill)

Slide 8 STAT 13, UCLA, Ivo Dinov

Storing and Reporting Numbers

z Round numbers for presentation

z Maintain complete accuracy in numbers to be used

in calculations. If you need to round-off, this should

be the very last operation …

Slide 9 STAT 13, UCLA, Ivo Dinov

Table before simplification

Country 1970 1975 1980 1985 1990 Belgium 42.01 42.17 34.18 34.18 30. Canada 22.59 21.95 20.98 20.11 14. France 100.91 100.93 81.85 81.85 81. Italy 82.48 82.48 66.67 66.67 66. Japan 15.22 21.11 24.23 24.33 24. Netherlands 51.06 54.33 43.94 43.94 43. Switzerland 78.03 83.2 83.28 83.28 83. U.K. 38.52 21.03 18.84 19.03 18. U.S.A. 316.34 274.71 264.32 262.65 261. Units: millions of troy ounces. Source: The World Almanac and Book of Facts.

TABLE 2.2.1 Gold Reserves of Gold-Holding IMF Countries

Slide 10 STAT 13, UCLA, Ivo Dinov

TABLE 2.2.2 Simplified Table of Gold Reserves of IMF Countries

Country 1970 1975 1980 1985 1990 Average US 320 270 260 260 260 280 Switzerland 78 83 83 83 83 82 France 100 100 82 82 82 89 Italy 82 82 67 67 67 73 Netherlands 51 54 44 44 44 47 Belgium 42 42 34 34 30 37 Japan 15 21 24 24 24 22 UK 39 21 19 19 19 23 Canada 23 22 21 20 15 20

Average 83 78 71 71 70 Units: millions of troy ounces.

Table after simplification

Slide 11 STAT 13, UCLA, Ivo Dinov

US

21% S. Africa

USSR

Austr.

Can.

Chin.

Rest

(a) Bar graph (b) Pie chart

(c) Segmented bar

S. Africa

U.S.

USSR

Austr.

Can.

China

Rest

U.S.

Figure 2.6.3 Percentages of the world's gold production in 1991.

om Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Different graphs of the same set of numbers

Slide 12 STAT 13, UCLA, Ivo Dinov

Questions …

z For what two purposes are tables of numbers

presented? ( convey information about trends in the data, detailed

analysis)

z When should you round numbers, and when should you

preserve full accuracy?

z How should you arrange the numbers you are most

interested in comparing? ( Arrange numbers you want to compare in

columns, not rows. Provide written/verbal summaries/footnotes. Show

row/column averages.)

z Should a table be left to tell its own story?

Slide 19 STAT 13, UCLA, Ivo Dinov

Units: 17 | 4 = 17.4 deaths per 100, 5 4 6 7 8 9 8 Units: 1 | 7 = 17 deaths per 100, 10 1 1 3 4 5 0 5 11 3 6 0 12 0 0 1 5 6 0 13 0 1 1 0 0 0 0 0 1 1 14 6 1 2 2 2 2 3 3 3 15 3 7 8 1 5 5 16 1 6 6 7 7 17 1 4 1 9 18 6 2 0 0 0 0 1 19 9 2 20 0 1 1 2 21 1 2 6 7 22 23 24 25 6 26 8

FIGURE 2.3.7 Two stem-and-leaf plots for the traffic deaths data

Collapse to

12 stems

(a)

(b)

Round-off

Slide 20 STAT 13, UCLA, Ivo Dinov

TABLE 2.3.2 Coyote Lengths Data (cm) Females 93.0 97.0 92.0 101.6 93.0 84.5 102.5 97.8 91.0 98.0 93.5 91. 90.2 91.5 80.0 86.4 91.4 83.5 88.0 71.0 81.3 88.5 86.5 90. 84.0 89.5 84.0 85.0 87.0 88.0 86.5 96.0 87.0 93.5 93.5 90. 85.0 97.0 86.0 73. Males 97.0 95.0 96.0 91.0 95.0 84.5 88.0 96.0 96.0 87.0 95.0 100. 101.0 96.0 93.0 92.5 95.0 98.5 88.0 81.3 91.4 88.9 86.4 101. 83.8 104.1 88.9 92.0 91.0 90.0 85.0 93.5 78.0 100.5 103.0 91. 105.0 86.0 95.5 86.5 90.5 80.0 80. Coyotes captured in Nova Scotia, Canada. Data courtesy of Dr Vera Eastwood. TABLE 2.3.3 Frequency Table for Female Coyote Lengths

Class Interval Tally Frequency Stem-and-leaf plot 70-75 - || 2 7 1 4 75-

  • 0 7 80-85 - |||| | 6 8 0 1 4 4 4 85-90 -^ |||| |||| || 12 8 5 5 5 6 6 7 7 7 7 8 8 9 90-95 - |||| |||| ||| 13 9 0 0 0 0 1 1 2 2 2 3 3 4 4 4 95-100 - |||| 5 9 6 7 7 8 8 100-105 - || 2 10 2 3 Total 40

Body

length

Slide 21 STAT 13, UCLA, Ivo Dinov

length (cm) (a) Histogram (b) Stem-and-leaf plot rotated

Figure 2.3.8 Histogram of the female coyote-lengths data. From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

TABLE 2.3.3 Frequency Table for Female Coyote Lengths

Class Interval Tally Frequency Stem-and-leaf p lot 70-75 -^ || 2 7 1 4 75-80 -^0 80-85 -^ |||| | 6 8 0 1 4 4 4 85-90 -^ |||| |||| || 12 8 5 5 5 6 6 7 7 7 7 8 8 9 90-95 -^ |||| |||| ||| 13 9 0 0 0 0 1 1 2 2 2 3 3 4 4 4 95-100 -^ |||| 5 9 6 7 7 8 8 100-105 -^ || 2 10 2 3 Total 40

compare

Slide 22 STAT 13, UCLA, Ivo Dinov

(a) Original histogram (interval width = 5)

(c) Same widths, different boundaries (interval width = 5)

(b) Change class-interval width (interval width = 3)

(d) Density trace (window width = 5)

70 80 90 100 Length (cm)

70 80 90 100 Length (cm)

70 80 90 100 Length (cm)

110 70 80 90 100 Length (cm)

0

4

8

12

0

4

8

12

0

4

8

12

0

4

8

12

Figure 2.3.9 Histograms and density trace of female coyote-lengths data. From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Histogram bin-size change

Histogram bin-boundary change

Slide 23 STAT 13, UCLA, Ivo Dinov

Questions …

z What advantages does a stem-and-leaf plot have over a histogram? (S&L Plots return info on individual values, quick to

produce by hand, provide data sorting mechanisms. But, Hist’s are more

attractive and more understandable ).

z The shape of a histogram can be quite drastically altered by choosing different class-interval boundaries. What type of plot does not have this problem? (density trace ) What other factor affects the shape of a histogram? ( bin-size)

z What was another reason given for plotting data on a variable, apart from interest in how the data on that variable behaves? ( shows features, cluster/gaps, outliers; as well as trends)

Slide 24 STAT 13, UCLA, Ivo Dinov

(e) Positively skewed

(a) Unimodal (b) Bimodal (c) Trimodal

(d) Symmetric (long upper tail)

(f) Negatively skewed (long lower tail)

(g) Symmetric (h) Bimodal with gap (i) Exponential shape

Interpreting Stem-plots and Histograms

e

−| x |

e

x

1

x^ −

Slide 25 STAT 13, UCLA, Ivo Dinov

(j) Spike in pattern

(k) Outliers (l) Truncation plus outlier

outlier outlier

spike

Figure 2.3.10 Features to look for in histograms and stem-and-leaf plots.

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Interpreting Stem-plots and Histograms

Slide 26 STAT 13, UCLA, Ivo Dinov

Fascinations with histograms –

Histogram of heights using the actual people

Subjects are university genetics students, females in white

and males in dark tops.

?

Slide 27 STAT 13, UCLA, Ivo Dinov

Questions …

z What does it mean for a histogram or stem-and-leaf plot to be bimodal? What do we suspect when we see a bimodal plot?

z What are outliers, and how do they show up in these plots? What should we try to do when we see them?

z What do we mean by symmetry and positive and negative skewness?

z What shape do we call exponential?

z Should we be suspicious of abrupt changes? Why?

Yes! Try to establish the reason, the jump may have to be rectified!

Slide 28 STAT 13, UCLA, Ivo Dinov

Descriptive Statistics

Variable N Mean Median TrMean StDev SE Mean age 45 50.133 51.000 50.366 6.092 0. Variable Minimum Maximum Q1 Q age 36.000 59.000 46.500 56.

Standard deviation

Lower quartile Upper quartile

Minitab output

Descriptive statistics from computer

programs like STATA

STATA Output

.summarize

Slide 29 STAT 13, UCLA, Ivo Dinov

z The sample mean is denoted by (^) x.

Descriptive statistics …

Sum of the observations

Number of observations

The sample mean =

Slide 30 STAT 13, UCLA, Ivo Dinov

Mean
(a) (b) (c)
Figure 2.4.1 Mechanical construction representing a dot plot:
(a) shows a balanced rod while (b) and (c) show unbalanced rods.

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

The sample mean is where the dot plot balances

Slide 37 STAT 13, UCLA, Ivo Dinov

IQR = Q

  • Q

Inter-quartile Range

Slide 38 STAT 13, UCLA, Ivo Dinov

SYSVOL

Q 1 Median Q (^3)

Box plot
Dot plot
Figure 2.4.3 Box plot for SYSVOL.

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Box plot compared to dot plot

Slide 39 STAT 13, UCLA, Ivo Dinov

Data
1.5 IQR
Med
1.5 IQR
Scale
Q 1 Q 3

(pull back until hit observation) (pull back until hit observation)

Figure 2.4.4 Construction of a box plot.

From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Construction of a box plot

Slide 40 STAT 13, UCLA, Ivo Dinov

Stem-and-leaf of strength N = 33
Leaf Unit = 10
strength
strength
Figure 2.4.5 Three graphs of the breaking-strength data for}
gear-teeth in positions 4 & 10 (Minitab output).

Comparing 3 plots of the same data

Slide 41 STAT 13, UCLA, Ivo Dinov

TABLE 2.5.1 Word Lengths for the First 100

Words on a Randomly Chosen Page 3 2 2 4 4 4 3 9 9 3 6 2 3 2 3 4 6 5 3 4 2 3 4 5 2 9 5 8 3 2 4 5 2 4 1 4 2 5 2 5 3 6 9 6 3 2 3 4 4 4 2 2 4 2 3 7 4 2 6 4 2 5 9 2 3 7 11 2 3 6 4 4 7 6 6 10 4 3 5 7 7 7 5 10 3 2 3 9 4 5 5 4 4 3 5 2 5 2 4 2

Value u 1 2 3 4 5 6 7 8 9 10 11

Frequency f 1 22 18 22 13 8 6 1 6 2 1

j j

Frequency Table

Frequency Table

Slide 42 STAT 13, UCLA, Ivo Dinov

(Sumofallobservatio ns )

1

Sumof (value frequency ofoccurrence )

1

n

n

x = × =

Mean from a frequency table

Value Frequency Value x Frequency

2 3 6

4 2 8

5 14

Example: {2, 4, 2, 4, 2 }

Mean = 14/

Slide 43 STAT 13, UCLA, Ivo Dinov

TABLE 2.5.

Frequency Table for the Occurrence of Fish Species in Ocean Strata

No. of strata Frequency Percentage in which species occur (No. of species) of species Cumulative Percentage

n = 330 100 Source: Haedrich and Merrett [1988]

fj n

( u j ) ( fj ) × 100 )

Slide 44 STAT 13, UCLA, Ivo Dinov

Number of strata occupied

Figure 2.5.1 Bar graph for species data. From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

Slide 45 STAT 13, UCLA, Ivo Dinov

Labeled bar graphs to convey size

Gross Rents ($ per ft ) 2

City

Figure 2.6.2 Cost of commercial rents around the world. From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.

z Can use bar graphs to relate labels to relative size.
z Where possible, order items by size.