Data Analysis - Introduction to Statistics - Lecture notes, Study notes of Statistics

Data Analysis, Data Description, Parameter, Statistic, Measures of central tendency, Arithmetic average, Mean for grouped Data, Median, Mode, Modal class, Revisiting distribution shapes are learning points available in this lecture notes.

Typology: Study notes

2011/2012

Uploaded on 11/14/2012

dharm
dharm 🇮🇳

4.3

(24)

59 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Section III
Data Analysis
Data Description
When measuring data it is important to note the difference between studies on samples
and studies on populations.
A parameter is a measure or characteristic obtained by studying all data values from a
population while a statistic is derived from a sample.
For many attributes we will have separate symbols for a statistic and a parameter even
though the method for computing them is the same.
The number of datum in a sample will be n as before, but for a population it will be
denoted N.
For writing equations, generic sample or population will be denoted with X values for
each datum:
Example Data:
Sample: {X1, X2, X3, , Xn}
Population: {X1, X2, X3, , XN}
The ambiguous term “average” is actually a category known in statistics as measures of
central tendency that includes the mean, median, mode and midrange. Another often-
used average is the weighted mean.
How data varies compared to these averages is a very useful characteristic to study.
Measures of Central Tendency
“A person has on average 1460 dreams in 1 year”
The mean:
Using a population or sample (the classical arithmetic average).
Sample of (size n)
The Mean
X
=
X1 +X2 +X3 iiiiii+Xn
n
Population (of size N)
The Mean µ =
X1 +X2 +X3 iiiiii+XN
N
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Data Analysis - Introduction to Statistics - Lecture notes and more Study notes Statistics in PDF only on Docsity!

Section III Data Analysis Data Description When measuring data it is important to note the difference between studies on samples and studies on populations. A parameter is a measure or characteristic obtained by studying all data values from a population while a statistic is derived from a sample. For many attributes we will have separate symbols for a statistic and a parameter even though the method for computing them is the same. The number of datum in a sample will be n as before, but for a population it will be denoted N. For writing equations, generic sample or population will be denoted with X values for each datum: Example Data: Sample: {X 1 , X 2 , X 3 , ⋅ ⋅ ⋅ ⋅ ⋅ , X n } Population: {X 1 , X 2 , X 3 , ⋅ ⋅ ⋅ ⋅ ⋅ , XN} The ambiguous term “average” is actually a category known in statistics as measures of central tendency that includes the mean , median , mode and midrange. Another often- used average is the weighted mean. How data varies compared to these averages is a very useful characteristic to study. Measures of Central Tendency “A person has on average 1460 dreams in 1 year” The mean: Using a population or sample (the classical arithmetic average). Sample of (size n )

The Mean X^ =

X 1 +X 2 +X 3 i i i i i i+Xn

n

Population (of size N) The Mean μ =

X 1 +X 2 +X 3 i i i i i i+XN

N

In shortened form: X^ =

∑^ X n and μ = ∑^ X N Keep track of and memorize symbols like X and μ as other equations will sometimes include them without review. Find the mean for the following population and label appropriately: 22 19 8 2 4 13 16 7 Math tips:

  • If you are doing a calculation with intermediate steps do not round off until the very end. Frequent rounding can affect the computation significantly.
  • The mean should be rounded to one more decimal place than the raw data. The preceding formulas allow us to calculate the mean given a set of data. If we are given data that has already been organized into a frequency distribution we can also find the mean. Finding the Mean for grouped Data: Because data in a class can fall anywhere in the class range this is not the exact mean but a good approximation. We will use the class midpoint value Xm to represent every datum in the class. Looking at our previous example, how would we calculate the mean record high for all 50 states? (Where would your guess be?) Class Tally Frequency Cumulative 100 - 104 // 2 2 105 - 109 //////// 8 10 110 - 114 ////////////////// 18 28 115 - 119 ///////////// 13 41 120 - 124 /////// 7 48 125 - 129 / 1 49 130 - 134 / 1 50

The Median: The Median is the halfway point of the data set. Finding the Median MD :

  1. Arrange the data in order ( data array )
  2. Select the middle point Ex Data: 292, 300, 311, 401, 595, 618, 713
  3. 292, 300, 311, 401 , 595, 618, 71
  4. MD = 401 Ex Data: 1, 7, 4, 2, 3, 4
  5. 1, 2, 3, 4, 4, 7
  6. MD = 3 + 4 = 3. 2 Notes on the median:
  • A marker for which values fall into the upper and lower half of a distribution
  • Not as affected by outliers 2, 3, 4, 5, 7, 6, 5, 4, 36
  • Can be used for open ended distributions
  • The median will either be a specific data value or will fall between two data values.
  • There is no median type measurement for frequency distributions. The Mode: The value that occurs most often in a data set is called the mode. There can be bimodal or multimodal datasets depending on the number of modes. On the other hand, if no datum appears more than once then the data set has no mode. Ex Data: 100, 101, 105 , 110, 100, 105 , 103, 105 Since 105 occurs most often it is the mode

For Frequency distributions we speak of the modal class. The modal class is the class with the highest frequency. Class Tally Frequency 100 - 104 // 2 105 - 109 //////// 8 110 - 114 ////////////////// 18 115 - 119 ///////////// 13 120 - 124 /////// 7 125 - 129 / 1 130 - 134 / 1 Clearly in our record high temperature example, the modal class is 110˚ – 114˚. Notes on the mode:

  • Gives us most typical case.
  • Easy to compute
  • Can be used with nominal data o Most frequent eye color, political party, etc
  • The mode is not unique
  • There is no symbol for the mode Re-visiting distribution shapes

Measures of Variation Averages are useful concepts, but they become even more useful when you combine them with the concept of variance. One type of variance is the distance between highest and lowest value, or the range. Perhaps the most important type has to do with the average distance from the mean for a datum.

  • Since the mean is usually towards the middle of the distribution, however some data are in the negative direction and some are not.
  • As with most distances we only care how far, so to get around this problem we use the concepts of squaring and then applying a square root. This will give us the useful concepts of variance and standard deviation.

Paint Example Test: Brand A VS Brand B Variable: Months before fading Two small populations of 6 cans of each brand are tested with the following results: We can calculate the means: Brand A μ = ∑^ X N = 210/6 = 35 months Brand B μ = ∑^ X N = 210/6 = 35 months Brand A Brand B 10 35 60 45 50 30 30 35 40 40 20 25

Difference: 60 - 35 = 25 50 - 35 = 15 40 - 35 = 5 30 - 35 = - 5 20 - 35 = - 15 10 - 35 = - 25 For variance and standard deviation we only want to know how far off on average, not in which direction. Square them, add them up and divide by N this gives the average of the squares of the distance from the mean called variance. Variance = 625 + 625 + 225 + 25 + 25 + 225 = 291. 6 To get the standard deviation we simply return to our scale by square rooting. Standard deviation = 17. Standard deviation for a population σ Standard deviation for a sample s III. Variance and standard deviation for populations The algorithm we just used gives us the formula for variance when using a population and in turn the standard deviation for populations: σ 2 = ∑ (^ X^ −^ μ) N 2 and^ σ^ =^ σ^ 2 = ( X^ −^ μ) 2 ∑ N

Method: These are the same steps we took from the Brand A population’s raw data set:

  1. Calculate μ = ∑^ X N
  2. Subtract the mean from each value
  3. Square each difference
  • Sometimes it helps to set up the following table: X (^) X - μ (X - μ)^2
  1. Sum them up
  2. Divide by N to get the variance σ^2
  3. Take the square root to get the standard deviation σ Now do the same calculations for brand B σ = σ 2 = ( X^ −^ μ) 2 ∑ N Go For It: Conclusion:

Brand A Brand B

V. Variance and standard deviation - grouped data (frequency distributions): We have only so far computed variance for Samples and populations from raw Data. Finding Sample Variance and Standard Deviation for Grouped Data: We will again use the midpoints of each class as an average value to get an approximate answer. The adjusted formula for variance is: s 2 =

n ∑ f i( Xm )

2

( ) −^ (^ ∑^ f^ i Xm )

2 n ( n − 1 ) Example : Compute s^2 and s for our earlier data for record temperatures.

  1. Make a table: Class f 100 - 104 2 105 - 109 8 110 - 114 18 115 - 119 13 120 - 124 7 125 - 129 1 130 - 134 1
  2. Find ∑ f = nf · Xm ∑ f · Xm^2
  3. Use the above formula for s^2
  4. Find the standard deviation