Download Understanding Measures of Central Tendency, Correlation, and Effect Sizes in Statistics - and more Exams Engineering in PDF only on Docsity!
STATISTICAL DATA
ANALYSIS
Dr. Yan Liu
Department of Biomedical, Industrial and Human Factors Engineering
Wright State University
2
Statistics
Types of Statistics
Descriptive statistics
Comprises the statistical methods dealing with the collection, tabulation and
summarization of data, so as to present meaningful information of the data
Inferential statistics
Consists of the methods involved with the analysis and interpretation of data that will
enable the researcher to develop meaningful inferences about the data
These two areas interrelated
While descriptive statistics organizes the collected data in a systematic manner,
inferential statistics analyzes the data and enables one to produce significant
inferences about it
3
Measures of Central Tendency
Indicate the central point or the greatest frequency concerning a set of
data
Mean
The statistical mean of a set of data is its average
Population mean vs. sample mean
Population mean, μ , is the expected value E( x ), such that if an infinite number of
measurements are made, the average of the infinite measurements is the result; this
represents the true value of a measurement
The sample mean, , is the average value of a sample, which is a finite series of
measurements, and is an estimate of the population mean
Median
The median of a set of data, , is the value which, when the data are arranged in
an ascending or descending order, satisfies the following conditions: 1) If the
number of data is odd, the median is the middle value; and 2) If the number of
observations is even, the median is the average of the two middle values
The same as the 50th percentile of a set of data
x
x ~
4
Measures of Central Tendency (Cont’d)
Mode
The mode of a set of data is the specific value that occurs with the greatest
frequency
May be more than one or none
7
Correlation Analysis
Purpose
Measures how strongly two attributes correlate with each other
Correlation Coefficient
Correlation analysis for numerical variables
Indicates the strength and direction of a linear relationship between two numeric
random variables
Pearson’s product moment coefficient
Curvilinear Relationship
If the relationship between two variables is curvilinear, the Pearson product-
moment correlation coefficient will not indicate the existence of a relationship
To check whether the relationship is linear, the easiest way is to construct a
scatterplot which gives a visual indication of the shape of the relationship
A B
i
N
i=
i
AB
i
n
i
i
nσ σ
(ab)-nAB
n
a A bB rAB
∑
∑
= 1 1 σ σ
8
Perfect Positive Linear Relationship Perfect Negative Linear Relationship
Curvilinear Relationship
(quadratic relationship with an
intermediate minimum)
No Relationship
9
Correlation Analysis (Cont’d)
Chi-Square (χ^2 ) Test
Correlation analysis for categorical variables
A
A 1 A 2 … Ap
B
B 1 A 1 B 1 A 2 B 1 … ApB 1 B 2 A 1 B 2 A 2 B 2 … ApB 2 … … … … … Bq A 1 Bq A 2 Bq … ApBq
Cell (Ai, Bj) represents the joint event that A= Ai and B= Bj
(-1)(- 1)
2
= 1 = 1
2 ( - )
2 p q
p
i
q
j
e
o e ij
ij ij : statistic test of the hypothesis
that A and B are independent
n
countA A countB B ij
i j
e
( = )× ( = )
n : number of cases in each variable
Count ( A=Ai ): the observed frequency of the event A=Ai
Count ( B=Bj ): the observed frequency of the event B=Bj
Gender (^) Row male female Margin Preferred_ Reading
fiction 250 200 450 non-fiction 50 1000 1050
Column Margin 300 1200 n=
(male)× (fiction) 300 × 450 11 n
count count
e
(male)× (non-fiction) 300 × 1050 12 n
count count
e
(female)× (fiction) 1200 × 450 21 n
count count
e
(female)× (non-fiction) 1200 × 1050 22 n
count count
e
10
Gender and Preferred Reading Example
13
Standardized Difference Between Means
Cohen’s d
Defined as mean difference divided by the pooled standard deviation
Interpretation of Cohen’s d
Small effect size: ~[0.0, 0.5)
Medium effect size: ~[0.5, 0.8)
Large effect size: ~[0.8, +∞)
p
X X d σ 1 − 2 =
1 2
112 222
σ+ σ σ = n n
n n p
σp is pooled population standard deviation of X 1 and X 2
when n 1 =n 2 , 2
σ 12 +σ^22 σ (^) p =
14
Standardized Difference Between Means (Cont’d)
Hedges’ g
Virtually the same as Cohen’s d in large sample sizes
Some software (e.g. Effect Size Generator) calculates g by adjusting the overall
effect size based on the sample sizes, as follows
122
2 ( 21 ) 2 2 ( 11 ) 1
1 2 1 2
− + −
− − = = n n
p n S n S
X X S
X X g Sp is pooled sample standard deviation of^ X 1 and^ X 2
S 1 = σ 1 √n 1 /(n 1 -1), S 2 = σ 2 √n 2 /(n 2 -1)
( (^1 4) (^3 ) 9 ) ( (^14) (^3 ) 9 ) 1 2 1 22
2 ( 21 ) 2 2 ( 11 ) 1
1 2 1 2
1 2
−
− = − = −
− + − n n
X X S n n
X X
n n
p n S n S
g
2
12 +^22
S S when n 1 =n 2 , Sp
15
Standardized Difference Between Means (Cont’d)
Glass’s Delta ∆
Defined as the mean difference between the experimental and control group
divided by the standard deviation of the control group
S control
X (^) 1 − X 2 ∆ =
16
Cohen’s d
= =- 1. 2
- 1662 + 1. 3232 )
d
Interface A)
A1 6 5 5 7 4 3 5 4
A2 8 6 9 6 6 5 5 7
Visual Interface Example (I)
X 1 = 4. 875 X 2 =^6.^5
σ = = 1. 166
∑ (^) ( - ) 1
= 1
2 1 1 n
X X
n i i σ = = 1. 323
∑ (^) ( - ) 2
= 1
2 2 2 n
X X
n i i
= (^) - 1 = 1. 246
∑ (^) ( - ) 1
= 1
2 1 1 n
X X
n i i S =^ - 1 =^1.^414
∑ ( - ) 2
= 1
2 2 2 n
X X
n i i S
Hedges’ g
= =- 1. 2
- 2462 + 1. 4142 )
g
19
Participant (S) 1 2 3 4 5 6 7 8
Interface A)
A1 6 5 5 7 4 3 5 4
A2 8 6 9 6 6 5 5 7
D 12 -2 -1 -4 1 -2 -2 0 -
Visual Interface Example
D 12 = 1. 625
σ = = 1. 495
∑ (^) (D-D) D
= 1
2
n
n i i
Cohen’s d (^) = (^1). 495 = 1. 087
- 625 d
= (^) - 1 = 1. 598
∑ (^) (D-D) D
= 1
2
n
n i i S
Hedges’ g (^) = (^1). 598 = 1. 017
- 625 g
20
Effect Sizes of Correlation
Pearson Product-Moment Correlation Coefficient
Correlation between numeric variables
Point Biserial Correlation Coefficient (rpb)
Used when one variable, say X 1 , is continuous but the other variable, say X 2 , is
dichotomous
Assuming that X 2 has two values, 0 and 1, the data set can be divided into two
groups, group 1 which receives the value "1" on X 2 and group 2 which receives
the value "0" on X 2. Then rpb is calculated as follows
( 1 0 )( 1 01 )
1 0 10
− ⋅ = (^) n n n n
nn S
M M rpb (^) X
where M 1 is the mean of X 1 for all data points in group 1 of X 2 , M 0 is the
mean of X for all data points in group 2 of X 2 , n 1 is the number of data points
in group 1, n 0 is the number of data points in group 2
21
Effect Sizes for ANOVA
Effect Sizes for ANOVA
Measure the degree of association between an effect (i.e., a main effect, an
interaction) and the dependent variable
Can be thought of as the correlation between an effect and the dependent
variable
If the value of the measure of association is squared, it can be interpreted as the
proportion of variance in the dependent variable that is attributable to each effect
Commonly used measures of effect size in AVOVA
Eta squared, η^2
Partial Eta squared, ηp^2
Omega squared, ω^2
Intraclass correlation, ρI
η^2 and ηp^2 are estimates of degree of association for the sample, while ω^2 and ρI are
estimates of the degree of association in the population
22
Eta Squared
Eta Squared, η^2
The proportion of the total variance that is attributed to an effect
Statistical issue
The effect size of an effect is dependent upon the number and magnitude of other
effects
T
Effect SS
2 SS η =
Effect
Sum of
Squares
η^2
Drive 24
Reward 112 18.36%
Reward * Drive 144 23.61%
Error 330 54.10%
SST 610
25
Within-Subject Design
Effect
Sum of
Squares
ηp^2
Interface 10.56 =10.56/(10.56+8.94) =54.2%
Participant 15.
Participant * Interface
(error term)
Total 35.
26
Omega Squared
Omega Squared, ω^2
An estimate of the dependent variance accounted for by the independent
variable in the population for a fixed effects model
ω^2 for between-subjects, fixed effects is
ω^2 is always smaller than either η^2 or η p^2
(SS MS )
2 (SS ( )(MS ))
T Err
Effect Effect Err
=
df ω
Effect
Sum of
Squares
df
Mean
Squares
ω^2
Drive 24 1 24
Reward 112 2 56 12.0%
Reward *
Drive
Error 330 18 18.
SST 610
27
Within-Subject Design
Effect
Sum of
Squares
df Mean Squares ω^2
Interface 10.56 1 10.
Participant 15.94 7 2.
Participant * Interface
(error term)
Total 35.43 15
28
Intraclass Correlation
Intraclass Correlation, ρI
An estimate of the dependent variance accounted for by the independent
variable in the population for a random effects model
(MS ( )(MS ))
(MS MS )
Effect Effect Err
Effect Err I + df
ρ =
31
σ^2 = 60
n = 48
Critical region
Pr(reject H 0 |H 0 is false)= π = 0.
Compare Two Population Means: Independent Samples (II)
32
Compare Two Population Means with
Independent Samples
For large samples (n>30), the sample size per group n needs to satisfy
2
2 2 1 - α/ 2 π ∆
2 ( + )•σ ≥
z z error n
σ^2 error: the within group variance
∆: the smallest difference between the two groups you wish to detect
z1-α/2 : the percentile of the normal distribution used as the critical value in a two-tailed
test of size (1.96 for α = 0.05)
zπ : the π ×100-th percentile of the normal distribution (0.84 for the 80-th percentile)
For small samples (n<30), the sample size per group n needs to satisfy
2
2 2 1 - α/ 2 ,- 1 π,,- 1 ∆
2 ( + )•σ ≥
t n t n error n
Since the particular t distribution depends on the sample size, the equation must be
solved iteratively (trial-and-error)
The sample size increases with σerror and decreases with ∆
33
Compare Two Population Means with
Independent Samples (Cont’d)
Estimate the Within-Group Standard Deviation
Often comes from previous similar studies
Sometimes it is necessary to a pilot study to get some idea of the inherent
variability
Conservative estimates (estimates that lead to a slightly larger sample size) are
preferable to underestimates
Rules of Thumb
For 80% power, need 393 samples for each group when Cohen’s d = 0.2, 64
samples when d =0.5, and 26 samples when d =0.
Compare Two Population Means with
Paired Samples
The formula for the total number of pairs is the same as for the number of
independent samples except that the factor of 2 is dropped, i.e.
Rules of Thumb
For 80% power, need 196 samples for each group when Cohen’s d = 0.2, 32
samples when d =0.5, and 13 samples when d =0.
34
2
2 2 1 - α/ 2 π ∆
( + )•σ ≥
z z error n (when^ n >30)
2
2 2 ( 1 - / 2 ,- 1 π,,- 1 ) ≥ ∆
t n + t n • error n
α σ
(when n <30)