Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An introduction to univariate statistics and descriptive statistics, with a focus on summarizing data using histograms, skewness, mean, median, and interquartile range. The document also includes examples and exercises using r code.
Typology: Study notes
1 / 54
Professor B. Jones
Professor B. Jones
University of California,
University of California,
Davis
Davis
Fun With Numbers
Fun With Numbers
Some Univariate Statistics
Some Univariate Statistics
Learning to Describe Data
Learning to Describe Data
Research is empirically based…therefore,
Research is empirically based…therefore,
we must work with data.
we must work with data.
You just did with your plots.
You just did with your plots.
No statistics in the plots but it does
No statistics in the plots but it does
summarize information usefully
summarize information usefully
Data
Data
Visualizing Data
Visualizing Data
Often the first place to start is with
Often the first place to start is with
visualization.
visualization.
Works best with continuous data (but we’ll
Works best with continuous data (but we’ll
learn tricks for understanding data
learn tricks for understanding data
measured at other levels-of-measurement.
measured at other levels-of-measurement.
Start with an example.
Start with an example.
Useful to Visualize
Useful to Visualize
Data
Data
0
10
20
30
40
Frequency
0 1000 2000 3000 4000
Variable
Y
Histogram
Main Features
Main Features
Exhibits “Right Skew”
Exhibits “Right Skew”
Contrast this with “Left Skew”
Contrast this with “Left Skew”
Some “Outlying” Data Points?
Some “Outlying” Data Points?
Question: Are the outlying data points
Question: Are the outlying data points also
also
“influential” data points (on measures of
“influential” data points (on measures of
central tendency)?
central tendency)?
Let’s check…
Let’s check…
The Mean
The Mean
Formally, the
Formally, the
mean is given by:
mean is given by:
Or more
Or more
compactly:
compactly:
N
Y Y Y
Y
N
1 2
N
Y
Y
N
i
1
Our Data
Our Data
Mean of Y is 260.
Mean of Y is 260.
Mechanically…
Mechanically…
(263 + 73 + … + 88)/67=260.
(263 + 73 + … + 88)/67=260.
Problems with the mean?
Problems with the mean?
No indication of dispersion or variability.
No indication of dispersion or variability.
That is, it is a single indicator of central
That is, it is a single indicator of central
tendency…but is it a good indicator?
tendency…but is it a good indicator?
What about “variability” around the mean?
What about “variability” around the mean?
The variance is a
The variance is a
statistic that
statistic that
describes (squared)
describes (squared)
deviations around
deviations around
the mean:
the mean:
Why “N-1”?
Why “N-1”?
Interpretation:
Interpretation:
“Average squared
“Average squared
deviations from the
deviations from the
mean.”
mean.”
1
2
_
^
2
N
i
i
Our Data
Our Data
Variance= 202,431.
Variance= 202,431.
Mechanically:
Mechanically:
[(263-260.67)
[(263-260.67)
2
2
(73-260.67)
(73-260.67)
2
2
••• +
••• + (88-260.67)
(88-260.67)
2
2
]/
]/
66
66
“
“ The average squared deviation around
The average squared deviation around Y
Y is
is
202,431.
202,431.
Standard Deviation
Standard Deviation
Take the square root of the variance
Take the square root of the variance
and you get the standard deviation.
and you get the standard deviation.
Why we like this:
Why we like this:
Metric is now in original units of
Metric is now in original units of Y
Interpretation
Interpretation
S.D. gives “average deviation” around
S.D. gives “average deviation” around
the mean.
the mean.
It’s a measure of dispersion that is in a
It’s a measure of dispersion that is in a
metric that makes sense to us.
metric that makes sense to us.
1
( )
1
2
_
^
N
Y Y
N
i
i
Our Data
Our Data
The standard deviation is: 449.92The standard deviation is: 449.
Mechanically:
Mechanically:
{[(263-260.67) {[(263-260.67)
2
2
2
2
++ ••• +••• + (88-260.67)(88-260.67)
2
2
]/66}]/66}
½
½
Interpretation: “The average deviation around the
Interpretation: “The average deviation around the
mean of 260.67 is 449.92.
mean of 260.67 is 449.92.
Now, suppose
Now, suppose Y
Y =Votes…
=Votes…
The average number of votes is “about 261 and the
The average number of votes is “about 261 and the
average deviation around this number is about 450
average deviation around this number is about 450
votes.” votes.”
The dispersion is very large.
The dispersion is very large.
(Imagine the opposite case: mean test score is 85
(Imagine the opposite case: mean test score is 85
percent; average deviation is 5 percent.)
percent; average deviation is 5 percent.)
Revisiting our Data
Revisiting our Data
0
10
20
30
40
Frequency
0 1000 2000 3000 4000
Variable
Y
Histogram
Skewness and The
Skewness and The
Mean
Mean
Data often exhibit skew.
Data often exhibit skew.
This is often true with political variables.
This is often true with political variables.
We have a measure of central tendency and
We have a measure of central tendency and
deviation about this measure (Mean, s.d)
deviation about this measure (Mean, s.d)
However, are there other indicators of
However, are there other indicators of
central tendency?
central tendency?
How about the median?
How about the median?
Median
Median
th
th
” Percentile:
” Percentile: Location
Location at which 50
at which 50
percent of the cases lie above; 50 percent
percent of the cases lie above; 50 percent
lie below.
lie below.
Since it’s a locational measure, you need to
Since it’s a locational measure, you need to
“locate it.”
“locate it.”
Example Data: 32, 5, 23, 99, 54
Example Data: 32, 5, 23, 99, 54
As is, not informative.
As is, not informative.
Median
Median
Rank it: 5, 23, 32, 54, 99
Rank it: 5, 23, 32, 54, 99
MedianMedian LocationLocation =(N+1)/2 (when n is odd)=(N+1)/2 (when n is odd)
=6/2=
=6/2=
Location of the median is data point 3
Location of the median is data point 3
This is 32.
This is 32.
Hence, M=32,
Hence, M=32, not 3!!
not 3!!
Interpretation: “50 percent of the data lie above 32;
Interpretation: “50 percent of the data lie above 32;
50 percent of the data lie below 32.”
50 percent of the data lie below 32.”
What would the mean be?
What would the mean be?
(42.6…data are __________ skewed)
(42.6…data are __________ skewed)
Median
Median
When n is even: -67, 5, 23, 32, 54, 99
When n is even: -67, 5, 23, 32, 54, 99
M is usually taken to be the average of the
M is usually taken to be the average of the
two middle scores:
two middle scores:
(N+1)/2=7/2=3.
(N+1)/2=7/2=3.
The median location is 3.5 which is between
The median location is 3.5 which is between
23 and 32
23 and 32
M=(23+32)/2=27.
M=(23+32)/2=27.
All pretty straightforward stuff.
All pretty straightforward stuff.
Dispersion around the
Dispersion around the
Median
Median
The mean has its standard deviation…
The mean has its standard deviation…
What about the median?
What about the median?
No such thing as “standard deviation” per se, around
No such thing as “standard deviation” per se, around
the median. the median.
But, there is the IQR
But, there is the IQR
Interquartile Range
Interquartile Range
The median is the 50
The median is the 50
th
th
percentile.
percentile.
Suppose we compute the 25
Suppose we compute the 25
th
th
and the 75
and the 75
th
th
percentiles
percentiles
and then take the difference.
and then take the difference.
2525
th
th
Percentile is the “median” of the lower half of thePercentile is the “median” of the lower half of the
data; the 75
data; the 75
thth
Percentile is the “median” of the upper
Percentile is the “median” of the upper
half.
half.
Data: -67, 5, 23, 32, 54, 99
Data: -67, 5, 23, 32, 54, 99
25
25
th
th
Percentile=
Percentile=
75
75
thth
Percentile=
Percentile=
IQR is difference between 75
IQR is difference between 75
th
th
and 25
and 25
th
th
percentiles: 54-5=
percentiles: 54-5=
Hence, M=27.5; IQR=
Hence, M=27.5; IQR=
“
“ Five Number Summary” Max, Min, 25
Five Number Summary” Max, Min, 25
th
th
,
,
50
50
th
th
, 75
, 75
th
th
Percentiles:
Percentiles:
-67, 5, 27.5, 54, 99
-67, 5, 27.5, 54, 99
Finding Percentiles
Finding Percentiles
General Formula
General Formula
pp is desired percentileis desired percentile
n
n is sample size
is sample size
If L is a whole number:
If L is a whole number:
The value of the
The value of the p
p th
th
percentile is between the percentile is between the
L
L th value and the next
th value and the next
value. Find the mean of
value. Find the mean of
those values
those values
If L is not a whole
If L is not a whole
number: number:
Round L up. The value of
Round L up. The value of
the
the p
p th percentile is the
th percentile is the
L
L th value
th value
100
p n
L
Example
Example
th
th
Percentile:
Percentile:
_L=(256)/100_*
Round to 2. The 25
Round to 2. The 25
th
th
Percentile is 5.
Percentile is 5.
th
th
Percentile:
Percentile: _L=(756)/100_*
Round to 5. The 75
Round to 5. The 75
th
th
Percentile is 54.
Percentile is 54.
50
50
thth
Percentile:
Percentile: _L=(506)/100_*
_L=(506)/100_* =3
=3
Take average of locations 3 and 4
Take average of locations 3 and 4
This is (23+32)/2=27.5.
This is (23+32)/2=27.5.
Our Data
Our Data
Median=120 Votes (i.e. [5067]/100)Median=120 Votes (i.e. [5067]/100)
25
25
thth
Percentile=46 Votes
Percentile=46 Votes
75
75
th
th
Percentile=289 Votes
Percentile=289 Votes
IQR: 243 Votes
IQR: 243 Votes
5 number summary:
5 number summary:
Min=9, 25Min=9, 25
th
th
P=46, Median=120, 75P=46, Median=120, 75
th
th
P=289, Max=3407P=289, Max=3407
(massive dispersion!)
(massive dispersion!)
Mean was 260.67. Median=120.
Mean was 260.67. Median=120.
The Mean is much closer to the 75
The Mean is much closer to the 75
thth
percentile.
percentile.
That’s SKEW in action.That’s SKEW in action.
Revisiting our Data: Odd
Revisiting our Data: Odd
Ball Cases
Ball Cases
0
10
20
30
40
Frequency
0 1000 2000 3000 4000
Variable
Y
Histogram
“
“
Influential
Influential
Observations”
Observations”
Two data points:
Two data points:
Y=(1013, 3407)
Y=(1013, 3407)
Suppose we omit them (not recommended
Suppose we omit them (not recommended
in applied research)
in applied research)
Mean plummets to 200.69 (drop of 60
Mean plummets to 200.69 (drop of 60
votes)
votes)
s.d. is cut by more than half: 203.92
s.d. is cut by more than half: 203.92
Med=114 (note, it hardly changed)
Med=114 (note, it hardly changed)
Let’s look at a scatterplot
Let’s look at a scatterplot
Useful to Visualize
Useful to Visualize
Data
Data
0
1000
2000
3000
4000
Y
0 100000 200000 300000
X
Scatterplot