data science notes of basic statistics, Study notes of Data Analysis & Statistical Methods

data science notes of basic statistics

Typology: Study notes

2019/2020

Available from 09/26/2021

project-contact-ajay00294.at.gmail
project-contact-ajay00294.at.gmail 🇦🇺

4.7

(3)

30 documents

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
04-03-2017
1
Disclaimer: This material is protected under copyright act AnalytixLabs ©, 2011-2016. Unauthorized use and/ or duplication of this material or any part of this material
including data, in any form without explicit and written permission from AnalytixLabsis strictly prohibited. Any violation of this copyright will attract legal actions
Basic Statistics
Business Statistics - Snapshot
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download data science notes of basic statistics and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Disclaimer: This material is protected under copyright act AnalytixLabs ©, 2011-2016. Unauthorized use and/ or duplication of this material or any part of this material including data, in any form without explicit and written permission from AnalytixLabs is strictly prohibited. Any violation of this copyright will attract legal actions

Basic Statistics

Business Statistics - Snapshot

Quantitative data Is numerical data which can be measured Discrete Random variable which takes only isolated values in its range of variation. For example number of heads in 10 tosses of a coin Continuous Random variable which takes any value in its range of variation. For example, height of a person Nominal

▪ Values do not have ordering

▪ Example categorical variables

like color, nationality and so on Ordinal

▪ Values are ordered

▪ Example Satisfaction scores

Categorical data Is non-numeric, can be observed but not measured E.g. Favorite color, Place of Birth Types of Data Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting numerical data to assist in making more effective decisions. Descriptive Statistics describes the data set that’s being analyzed, but doesn’t allow us to draw any conclusions or make any inferences about the data Inferential statistics is a set of methods that is used to draw conclusions or inferences about the characteristics of populations based on data from a sample

  • Measures of Central Tendency
  • Measures of Dispersion
  • Tables and Graphs Types of Statistics
  • Estimation
  • Hypothesis Testing
  • There is the man who drowned crossing a stream with an average depth of six inches.
  • Say you were standing with one foot in the oven and one foot in an ice bucket. According to the

averages, you should be perfectly comfortable.

  • Time taken by different modes of transport Auto^ Office Transport^ Own Car

Mean 9 9 9

Median 9 9 9

Mode 9 9 9

But are these sufficient? Dispersion refers to the spread or variability in the data. It determines how spread out are the scores around the mean. Why is Dispersion important?

  • It gives additional information that enables to judge the reliability of the measure of central tendency
  • If data are widely spread the central location is less representative of data as a whole than it would be for data more closely centered around Mean
  • Since problems are peculiar to widely dispersed data, dispersion enables to identify and tackle problems accordingly
  • This enables to compare dispersions of various samples
  • For eg. If a wide spread of values are away from center, this may be
  • undesirable or presents a risk, one may avoid choosing that distribution Distributions with different dispersions Measures of Variation/Dispersion

 Range

 Inter-Quartile Range

 Mean Deviation

 Standard Deviation

 Variance

 Percentiles/Quartiles

outlier

 Box-plot

  • Reveals the spread of the data
  • Outliers defined using the

Q1 - 1.5(Q3-Q1) and Q3 + 1.5(Q3-Q1)

Common Measures of Variation/Dispersion

x x

N  (X - )^2 (^ )^ (^ )  N ^2 = Where X is the value of an observation in the population μ is the arithmetic mean of the population N is the number of observations in the population Common measures of dispersion

Standard Deviation is a measure of how spread out numbers are

Variance is defined as the average of the squared differences from the Mean

Example: You have just measured the heights of your dogs (in millimeters). The heights are: 600mm, 470mm, 170mm, 430mm and 300mm. Mean = 394mm Using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small.

Descriptive Statistics

Central Tendency: is the middle point of distribution. Measures of Central Tendency include Mean, Median

and Mode

Dispersion: is the spread of the data in a distribution, or the extent to which the observations are scattered.

Skewness: When the data is asymmetrical ie the values are not distributed equally on both sides. In this case,

values are either concentrated on low end or on high end of scale on horizontal axis.

If the trail is to the right or positive

end of the scale, the distribution is

said to be “positively skewed”.

If the distribution trails off to the left

or negative side of the scale, it is said

to be “negatively skewed”.

Mean With Gates: $50,040,500 Mean Without Gates: $45,

Outliers

An outlier is an observation that is numerically distant from the rest of the data.

An outlying observation, or outlier, is one that appears to deviate markedly from other members of the

sample in which it occurs. Outliers can occur by chance in any distribution, but they are often indicative

either of measurement error or that the population has a heavy-tailed distribution.

Example: Bill Gates makes $500 million a year. He’s in a room with 9 teachers, 4 of whom make $40k, 3 make

$45k, and 2 make $55k a year. What is the mean salary of everyone in the room? What would be the mean

salary if Gates wasn’t included?

0 5 10 15 20 25 30 35 40 45 0 2 4 6 8 10

A Scatterplot is useful

for "eyeballing" the

presence of outliers. We can also use

Standard Deviation to

identify Outliers!

Normal Distribution Normal distribution is a pattern for the distribution of a set of data which follows a bell shaped curve. This also called the Gaussian distribution. The normal distribution is a theoretical ideal distribution. Real-life empirical distributions never match this model perfectly. However, many things in life do approximate the normal distribution, and are said to be “normally distributed.”

  • Normal Distribution has the mean, the median, and the mode all coinciding at its peak
  • The curve is concentrated in the center and decreases on either side ie most observations are close to the mean
  • The bell shaped curve is symmetric and Unimodal
  • It can be determined entirely by the values of mean and std dev
  • Area under the curve = 1
  • The empirical 68-95-99.7 rule states that for a normal distribution:
    • 68.3% of the data will fall within 1 SD of mean
    • 95.4% of the data will fall within 2 SD’s of the mean
    • Almost all (99.7%) of the data will fall within 3 SD’s of the mean Standard Normal Distribution Standard Normal distribution is a special case of the Normal distribution which has a mean of 0 and a standard deviation of 1 Any normal distribution can be converted to a Standard normal distribution through: Example: If X is a continuous random variable with a mean of 40 and a standard deviation of 10, what proportion of observations are a) Less than 50 b) Less than 20 c) Between 20 and 50 a) P(x<50)? 50- 10 P(x<50) = P(z<1) =.

Z= =1 Z= = -

b) P(x<20)?

P(x<20) = P(x<-2) =.

< Z <

c) P(20<x<50)?

P(-2 < Z < 1) =.

Answer

a. What is the chance of obtaining a birth

weight of 141 oz or heavier when sampling

birth records at random?

Z 

From the chart or SAS  Z of 2.46 corresponds

to a right tail (greater than) area of:

P(Z≥2.46) = 1-(.9931)= .0069 or .69 %

b. What is the chance of obtaining a birth weight

of 120 or lighter?

From the chart  Z of .85 corresponds to a left

tail area of:

P(Z≤.85) = .8023= 80.23%

. 85 13 120 109   Z  So far we have determined the results associated with individual observations or sample means when the true population parameters are known. In reality, the true population parameters are seldom known. We now learn how to infer levels of confidence, or a measure of accuracy on parameters, estimated using samples POINT ESTIMATOR

  • If we take a sample from a population, we can estimate parameters from the population, using sample statistics
  • Example: Sample mean (x) is our best estimate of the population mean (μ)
  • Whereas, we really don’t know how close the estimate is to the true parameter
  • The mean annual rainfall of Melbourne is 620mm per year INTERVAL ESTIMATOR
  • If we estimate a range or interval within which the true population parameter lies, then we are using an interval estimation method
  • This is the most common method of estimation. We can also apply a level of how confident we are in the estimate
  • In 80% of all years Melbourne receives between 440 and 800 mm rain Populations and Samples

Sampling methodologies Simple Random Sample is one in which every member of the population is equally likely to be measured Eg: Allocate a number to each member of the population and use a random number generator to determine which individuals will be measured Stratified Sampling separates the population into mutually exclusive groups and randomly samples within the groups Eg: Randomly select a number of people within each demographic cell, while maintaining overall proportions like gender ratio, income ratio, etc Other methodology Cluster sampling: is used when there is a considerable variation within each group but the groups are essentially similar to each other. Here we divide the population into groups, or clusters, and then select a random sample of these clusters. Sampling is required because it is seldom possible to measure all the individuals in a population. Researchers hence, use samples and infer their results to the population of interest Eg: Election polls, market research surveys, etc For a sample to be a “good sample”, it is imperative that there is a good sample size and there is no biasness in the sample. Central Limit Theorem It is always not possible to get the true information about the population. In this case we have to live with samples. For eg: we don’t know the actual average income for India, but can estimate it based on a random sample picked from the Indian population

In this case, the average we have is not the population average μ but an estimate X

If we take a similar second sample, it is extremely unlikely that the average calculated for the second sample will be the same as the average calculated for the first sample. In fact, statisticians know that repeated samples from the same population give different sample means. They have also proven that the distribution of these sample means will always be normally distributed, regardless of the shape of the parent population. This is known as the Central Limit Theorem. A distribution with a mean μ and variance σ², the sampling distribution of the mean approaches a normal distribution with a mean (μ) and a variance σ²/N as N, the sample size increases. The amazing and counter-intuitive thing about the central limit theorem is that The distribution of an average tends to be Normal, even when the distribution from which the average is computed is decidedly non-Normal distribution from which the average is computed is decidedly non-Normal. As the sample size n increases, the variance of the sampling distribution decreases. This is logical, because the larger the sample size, the closer we are to measuring the true population sample size, the closer we are to measuring the true population parameters.

Confidence Intervals We can extend this principle further:

  • We can be 90% confident that the true population mean lies within x ± 1.645(SE)
  • We can be 95% confident that the true population mean lies within x ± 1.960(SE)
  • We can be 99% confident that the true population mean lies within x ± 2.576(SE) Student’s t - distribution
  • While z-distribution is for the population, t-distribution is for the sample distribution.
  • Hence, the shape of ‘t’ sampling distribution is similar to that of the ‘z’ sampling distribution in that it is a) Symmetrical b) Centered over a mean of zero c) Variance depends on the sample size, more specifically on the degrees of freedom (abbreviated as df)
  • As the number of degrees of freedom increases , variance of the t distribution approaches more closely to that of z
  • For n ≥ 30, shapes are almost similar
  • For n of 30 taken as dividing point between small & large samples
  • t-test for population mean is: X - μ 0 s/ n When n < 30

z-test:

  • σ is known and the population is normal
  • σ is known and the sample size is at least 30. (The population need not be normal) t-test:
  • Whenever σ is not known
  • The population is assumed to be normal
  • And n<
  • The correct distribution to use is the ‘t’ distribution with n-1 df When to use
  • In a Test Procedure, to start with, a hypothesis is made.
  • The validity of the hypothesis is tested.
  • If the hypothesis is found to be true, it is accepted.
  • If it is found to be untrue, it is rejected.
  • The hypothesis which is being tested for possible rejection is called null hypothesis
  • Null hypothesis is denoted by H 0
  • The hypothesis which is accepted when null hypothesis is rejected is called Alternate Hypothesis Ha
  • Ex. Ho : The drug works –it has a real effect.

Ha : The drug doesn’t work - Any effect you saw was due to chance.

Hypothesis testing

Type I and Type II Error

Process of testing a hypothesis indicates that there is a possibility of making an error. There

are two types of errors:

Type I error: The error of rejecting the null hypothesis H 0 even though H 0 was true.

Type II error: The error of accepting the null hypothesis H 0 even though H 0 was false.

P - value

  • Furthermore, the area outside the confidence interval is cumulatively known as α (alpha)
  • Confidence Interval = 1 - α
  • Example: for 95% confidence interval, ∝=0.
  • α is also known as p-value.
  • Hence, p-value is the probability that a randomly picked sample will have the mean lying outside the confidence interval.

Example 1

Suppose that we have been told that the price of petrol in Melbourne is normally distributed with a mean of 92

cents per litre, and a standard deviation of 3.1 cents/litre. To test whether this price is in fact true, we sample 50

service stations and obtain a mean of 93.6 cents/litre

Solution:

  • Step 1: State the null and alternative hypotheses
    • Ho:  = 92
    • Ha:   92
  • Step2: Determine the appropriate test statistic and it’s distribution

Because we know the population standard deviation, we can use the z distribution

  • Step3: Specify the significance level, Say  = 0. Example 1 (contd.)
  • Step 4: Define the decision rule.

Using a z distribution (from tables), if  = 0.05,

the rejection region is > +1.96 and < -1.

0

Rejection Region Rejection Region

i.e., if the test statistic is

> 1.96 or < -1.96,

we will reject Ho and

accept Ha

Example 2

A company pays production workers $630 per week. The union claims that these workers are paid below the

industry average for their work. A sample of 15 workers from other sites gives a mean wage of $670/week with a

standard deviation of $58/week. Is the unions claim justified?

Solution:

Step 1: Ho:  =< $630 (industry weekly average is not significantly different to $630)

Ha:  > $630 (The industry weekly average is greater than $630)

Step2: Test Statistic - As we don’t know the population variance, and the sample size is < 30, we shall use the t test.

Step3: Significance level - We will use  = 0.10 (as we want to be liberal rather than conservative)

Step 4 : Decision rule - From ‘t’ table, t (0.1, 14df) = 1.

Non-Rejection Region (1- = 0.90)

Rejection Region ( = 0.10)

Example 2 (contd.)

Step 5: Calculate test statistic;

Step 6: Make a decision - As 2.67 is > 1.345, we will reject the Ho

Step 7: Conclusion - “Production workers at the company earn an

average of $40 per week less than the industry

standard (t = 2.67, df = 14, p < 0.1)”

n

SE^ s

x x

x  x 

t ^2.^67

t 

Non-Rejection

Region (1- = 0.90) Rejection

Region

Comparison of two populations

Hypothesis testing for two samples:

  • Difference between independent samples & dependent samples
  • Two sample z test for means using independent samples
  • Two sample t test for means using independent Samples
  • Two sample t tests for means using dependent Samples Chi-square test

Two properties are associated if the probability of having one property affects the probability of having

another. Sometimes it is not known whether two properties are associated or not. What is required is a

test of association, or, what is equivalent, a test of independence.

The Chi-square (χ 2) distribution can be used as a test of independence.

Example:

A psychologist conducted a survey into the relationship between the way in which a calculator was held

and the speed with which 10 arithmetical operations were performed. The calculator could be either

placed on a table or held in the hand; the sums could be performed in either less than 2 minutes,

between 2 and 3 minutes or more than 3 minutes.

The following results were obtained for a sample of 150 children between 12 and 13 years old.

Mode of Computation On Table Hand Held Speed Of Computation