Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding Data Analysis: Univariate Statistics and Descriptive Statistics - Prof. Brad, Study notes of Political Science

An introduction to univariate statistics and descriptive statistics, with a focus on summarizing data using histograms, skewness, mean, median, and interquartile range. The document also includes examples and exercises using r code.

Typology: Study notes

2012/2013

Uploaded on 04/28/2013

pcpballer
pcpballer 🇺🇸

40 documents

1 / 54

Toggle sidebar

Related documents


Partial preview of the text

Download Understanding Data Analysis: Univariate Statistics and Descriptive Statistics - Prof. Brad and more Study notes Political Science in PDF only on Docsity!

The Scientific Study

The Scientific Study

of Politics (POL 51)

of Politics (POL 51)

Professor B. Jones

Professor B. Jones

University of California,

University of California,

Davis

Davis

Fun With Numbers

Fun With Numbers

Some Univariate Statistics

Some Univariate Statistics

Learning to Describe Data

Learning to Describe Data

Research is empirically based…therefore,

Research is empirically based…therefore,

we must work with data.

we must work with data.

You just did with your plots.

You just did with your plots.

No statistics in the plots but it does

No statistics in the plots but it does

summarize information usefully

summarize information usefully

Data

Data

Visualizing Data

Visualizing Data

Often the first place to start is with

Often the first place to start is with

visualization.

visualization.

Works best with continuous data (but we’ll

Works best with continuous data (but we’ll

learn tricks for understanding data

learn tricks for understanding data

measured at other levels-of-measurement.

measured at other levels-of-measurement.

Start with an example.

Start with an example.

Useful to Visualize

Useful to Visualize

Data

Data

0

10

20

30

40

Frequency

0 1000 2000 3000 4000

Variable

Y

Histogram

Main Features

Main Features

Exhibits “Right Skew”

Exhibits “Right Skew”

Contrast this with “Left Skew”

Contrast this with “Left Skew”

Some “Outlying” Data Points?

Some “Outlying” Data Points?

Question: Are the outlying data points

Question: Are the outlying data points also

also

“influential” data points (on measures of

“influential” data points (on measures of

central tendency)?

central tendency)?

Let’s check…

Let’s check…

The Mean

The Mean

Formally, the

Formally, the

mean is given by:

mean is given by:

Or more

Or more

compactly:

compactly:

N

Y Y Y

Y

N

 

1 2

N

Y

Y

N

i

 i

1

Our Data

Our Data

Mean of Y is 260.

Mean of Y is 260.

Mechanically…

Mechanically…

(263 + 73 + … + 88)/67=260.

(263 + 73 + … + 88)/67=260.

Problems with the mean?

Problems with the mean?

No indication of dispersion or variability.

No indication of dispersion or variability.

That is, it is a single indicator of central

That is, it is a single indicator of central

tendency…but is it a good indicator?

tendency…but is it a good indicator?

What about “variability” around the mean?

What about “variability” around the mean?

Variance

Variance

The variance is a

The variance is a

statistic that

statistic that

describes (squared)

describes (squared)

deviations around

deviations around

the mean:

the mean:

Why “N-1”?

Why “N-1”?

Interpretation:

Interpretation:

“Average squared

“Average squared

deviations from the

deviations from the

mean.”

mean.”

( )

1

2

_

^

2

N

Y Y

N

i

i

Our Data

Our Data

Variance= 202,431.

Variance= 202,431.

Mechanically:

Mechanically:

[(263-260.67)

[(263-260.67)

2

2

  • (73-260.67)

  • (73-260.67)

2

2

••• +

••• + (88-260.67)

(88-260.67)

2

2

]/

]/

66

66

Interpretation:

Interpretation:

“ The average squared deviation around

The average squared deviation around Y

Y is

is

202,431.

202,431.

Who thinks in terms of squared deviations?

Who thinks in terms of squared deviations?

Answer: no one.

Answer: no one.

That’s why we have a standard deviation.

That’s why we have a standard deviation.

Standard Deviation

Standard Deviation

Take the square root of the variance

Take the square root of the variance

and you get the standard deviation.

and you get the standard deviation.

Why we like this:

Why we like this:

Metric is now in original units of

Metric is now in original units of Y

Y

.

.

Interpretation

Interpretation

S.D. gives “average deviation” around

S.D. gives “average deviation” around

the mean.

the mean.

It’s a measure of dispersion that is in a

It’s a measure of dispersion that is in a

metric that makes sense to us.

metric that makes sense to us.

1

( )

1

2

_

^

N

Y Y

N

i

i

Our Data

Our Data

The standard deviation is: 449.92The standard deviation is: 449.

Mechanically:

Mechanically:

{[(263-260.67) {[(263-260.67)

2

2

  • (73-260.67)+ (73-260.67)

2

2

++ ••• +••• + (88-260.67)(88-260.67)

2

2

]/66}]/66}

½

½

Interpretation: “The average deviation around the

Interpretation: “The average deviation around the

mean of 260.67 is 449.92.

mean of 260.67 is 449.92.

Now, suppose

Now, suppose Y

Y =Votes…

=Votes…

The average number of votes is “about 261 and the

The average number of votes is “about 261 and the

average deviation around this number is about 450

average deviation around this number is about 450

votes.” votes.”

The dispersion is very large.

The dispersion is very large.

(Imagine the opposite case: mean test score is 85

(Imagine the opposite case: mean test score is 85

percent; average deviation is 5 percent.)

percent; average deviation is 5 percent.)

Revisiting our Data

Revisiting our Data

0

10

20

30

40

Frequency

0 1000 2000 3000 4000

Variable

Y

Histogram

Skewness and The

Skewness and The

Mean

Mean

Data often exhibit skew.

Data often exhibit skew.

This is often true with political variables.

This is often true with political variables.

We have a measure of central tendency and

We have a measure of central tendency and

deviation about this measure (Mean, s.d)

deviation about this measure (Mean, s.d)

However, are there other indicators of

However, are there other indicators of

central tendency?

central tendency?

How about the median?

How about the median?

Median

Median

th

th

” Percentile:

” Percentile: Location

Location at which 50

at which 50

percent of the cases lie above; 50 percent

percent of the cases lie above; 50 percent

lie below.

lie below.

Since it’s a locational measure, you need to

Since it’s a locational measure, you need to

“locate it.”

“locate it.”

Example Data: 32, 5, 23, 99, 54

Example Data: 32, 5, 23, 99, 54

As is, not informative.

As is, not informative.

Median

Median

Rank it: 5, 23, 32, 54, 99

Rank it: 5, 23, 32, 54, 99

MedianMedian LocationLocation =(N+1)/2 (when n is odd)=(N+1)/2 (when n is odd)

=6/2=

=6/2=

Location of the median is data point 3

Location of the median is data point 3

This is 32.

This is 32.

Hence, M=32,

Hence, M=32, not 3!!

not 3!!

Interpretation: “50 percent of the data lie above 32;

Interpretation: “50 percent of the data lie above 32;

50 percent of the data lie below 32.”

50 percent of the data lie below 32.”

What would the mean be?

What would the mean be?

(42.6…data are __________ skewed)

(42.6…data are __________ skewed)

Median

Median

When n is even: -67, 5, 23, 32, 54, 99

When n is even: -67, 5, 23, 32, 54, 99

M is usually taken to be the average of the

M is usually taken to be the average of the

two middle scores:

two middle scores:

(N+1)/2=7/2=3.

(N+1)/2=7/2=3.

The median location is 3.5 which is between

The median location is 3.5 which is between

23 and 32

23 and 32

M=(23+32)/2=27.

M=(23+32)/2=27.

All pretty straightforward stuff.

All pretty straightforward stuff.

Dispersion around the

Dispersion around the

Median

Median

The mean has its standard deviation…

The mean has its standard deviation…

What about the median?

What about the median?

No such thing as “standard deviation” per se, around

No such thing as “standard deviation” per se, around

the median. the median.

But, there is the IQR

But, there is the IQR

Interquartile Range

Interquartile Range

The median is the 50

The median is the 50

th

th

percentile.

percentile.

Suppose we compute the 25

Suppose we compute the 25

th

th

and the 75

and the 75

th

th

percentiles

percentiles

and then take the difference.

and then take the difference.

2525

th

th

Percentile is the “median” of the lower half of thePercentile is the “median” of the lower half of the

data; the 75

data; the 75

thth

Percentile is the “median” of the upper

Percentile is the “median” of the upper

half.

half.

IQR and the 5 Number

IQR and the 5 Number

Summary

Summary

Data: -67, 5, 23, 32, 54, 99

Data: -67, 5, 23, 32, 54, 99

25

25

th

th

Percentile=

Percentile=

75

75

thth

Percentile=

Percentile=

IQR is difference between 75

IQR is difference between 75

th

th

and 25

and 25

th

th

percentiles: 54-5=

percentiles: 54-5=

Hence, M=27.5; IQR=

Hence, M=27.5; IQR=

“ Five Number Summary” Max, Min, 25

Five Number Summary” Max, Min, 25

th

th

,

,

50

50

th

th

, 75

, 75

th

th

Percentiles:

Percentiles:

-67, 5, 27.5, 54, 99

-67, 5, 27.5, 54, 99

Finding Percentiles

Finding Percentiles

General Formula

General Formula

pp is desired percentileis desired percentile

n

n is sample size

is sample size

If L is a whole number:

If L is a whole number:

The value of the

The value of the p

p th

th

percentile is between the percentile is between the

L

L th value and the next

th value and the next

value. Find the mean of

value. Find the mean of

those values

those values

If L is not a whole

If L is not a whole

number: number:

Round L up. The value of

Round L up. The value of

the

the p

p th percentile is the

th percentile is the

L

L th value

th value

100

p n

L

Example

Example

-67, 5, 23, 32, 54, 99

-67, 5, 23, 32, 54, 99

th

th

Percentile:

Percentile:

_L=(256)/100_*

L=(25*6)/100

=1.5

=1.5

Round to 2. The 25

Round to 2. The 25

th

th

Percentile is 5.

Percentile is 5.

th

th

Percentile:

Percentile: _L=(756)/100_*

L=(75*6)/100

=4.5

=4.5

Round to 5. The 75

Round to 5. The 75

th

th

Percentile is 54.

Percentile is 54.

50

50

thth

Percentile:

Percentile: _L=(506)/100_*

_L=(506)/100_* =3

=3

Take average of locations 3 and 4

Take average of locations 3 and 4

This is (23+32)/2=27.5.

This is (23+32)/2=27.5.

Our Data

Our Data

Median=120 Votes (i.e. [5067]/100)Median=120 Votes (i.e. [5067]/100)

25

25

thth

Percentile=46 Votes

Percentile=46 Votes

75

75

th

th

Percentile=289 Votes

Percentile=289 Votes

IQR: 243 Votes

IQR: 243 Votes

5 number summary:

5 number summary:

Min=9, 25Min=9, 25

th

th

P=46, Median=120, 75P=46, Median=120, 75

th

th

P=289, Max=3407P=289, Max=3407

(massive dispersion!)

(massive dispersion!)

Mean was 260.67. Median=120.

Mean was 260.67. Median=120.

The Mean is much closer to the 75

The Mean is much closer to the 75

thth

percentile.

percentile.

That’s SKEW in action.That’s SKEW in action.

Revisiting our Data: Odd

Revisiting our Data: Odd

Ball Cases

Ball Cases

0

10

20

30

40

Frequency

0 1000 2000 3000 4000

Variable

Y

Histogram

Influential

Influential

Observations”

Observations”

Two data points:

Two data points:

Y=(1013, 3407)

Y=(1013, 3407)

Suppose we omit them (not recommended

Suppose we omit them (not recommended

in applied research)

in applied research)

Mean plummets to 200.69 (drop of 60

Mean plummets to 200.69 (drop of 60

votes)

votes)

s.d. is cut by more than half: 203.92

s.d. is cut by more than half: 203.92

Med=114 (note, it hardly changed)

Med=114 (note, it hardly changed)

Let’s look at a scatterplot

Let’s look at a scatterplot

Useful to Visualize

Useful to Visualize

Data

Data

0

1000

2000

3000

4000

Y

0 100000 200000 300000

X

Scatterplot