Chapter 1-3 review

Data is the information we gather with experiments and with surveys.

Ex: Say we want to know how well a group of students did in a statistics course. The data could

be what every student made on their final exam grades.

3 Statistical Methods

Design: planning how to obtain data. Ex: conduct an experiment/survey

Description: summarizing the raw data and presenting it in a useful format. Ex: charts,

graphs, average (mean), median, etc.

Inference: making decisions or predictions based on data

Subjects are the entities that we measure in a study.

Ex: people, schools, rats, etc.

Population vs. Sample

Population: all subjects of interest

Sample: subset of the population for whom we have data

Ex: Let’s say we want to know how many Texas Tech students like coffee. To figure this out we

surveyed 50 random students in the S.U.B.

Population: every Texas Tech Student

Sample: the 50 random students we surveyed

Parameter vs. Statistic

Parameter: numerical summary of the population, usually unknown

Statistic: numerical summary of a sample taken from the population

Ex: The average number of cigarettes smoked by all teenagers last year -------- parameter

The average number of cigarettes smoked by a proportion of teenagers last year ------ statistic

The proportion of all teenagers who smoked in the last month ------ parameter

The proportion of teenagers who smoked last month out of 50 teenagers ------ statistic

After looking at the cars in the North Commuter parking lot we conclude that 67% of the

people who park in North Commuter drive trucks -------- 67% is a parameter

A survey of 50 car lots in America found that 35% of the cars in the car lots in America are

BMWs --------- 35% is a statistic

Randomness: each subject in the population has the same chance of being included in the

sample (Random sampling enables the sample to be a good reflection of the population)

Ex: Let’s say I want to know if I want to know if everyone in the class understands a question

The top 10 scorers on the first exam ------ not random

Everyone sitting in the last row ------- not random

Picking 10 names off the attendance ------ random

Variability

Note that measurements may vary from subject to subject and from sample to sample.

Ex: If I want to know how the class did on an exam. If I took the average of everyone whose

name starts with an ‘s’, the average will be different than if I take the average of everyone whose

name starts with a ‘m’.

In saying this, we can get a more accurate idea of the population if we take larger samples.

Computer and Statistics

Data file: large sets of data are typically organized in a spreadsheet format

Database: an existing archive collection of data files

Applet: short application program for performing a specific task. Ex: random number generator

Partial preview of the text

Download Statistical Methods - Study Material | MATH 2300 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Chapter 1-3 review

 Data is the information we gather with experiments and with surveys. Ex: Say we want to know how well a group of students did in a statistics course. The data could be what every student made on their final exam grades.  3 Statistical Methods  Design : planning how to obtain data. Ex: conduct an experiment/survey  Description : summarizing the raw data and presenting it in a useful format. Ex: charts, graphs, average (mean), median, etc.  Inference : making decisions or predictions based on data  Subjects are the entities that we measure in a study. Ex: people, schools, rats, etc.  Population vs. Sample  Population : all subjects of interest  Sample : subset of the population for whom we have data Ex: Let’s say we want to know how many Texas Tech students like coffee. To figure this out we surveyed 50 random students in the S.U.B. Population: every Texas Tech Student Sample: the 50 random students we surveyed  Parameter vs. Statistic  Parameter : numerical summary of the population , usually unknown  Statistic : numerical summary of a sample taken from the population Ex: The average number of cigarettes smoked by all teenagers last year -------- parameter The average number of cigarettes smoked by a proportion of teenagers last year ------ statistic The proportion of all teenagers who smoked in the last month ------ parameter The proportion of teenagers who smoked last month out of 50 teenagers ------ statistic After looking at the cars in the North Commuter parking lot we conclude that 67% of the people who park in North Commuter drive trucks -------- 67% is a parameter A survey of 50 car lots in America found that 35% of the cars in the car lots in America are BMWs --------- 35% is a statistic  Randomness : each subject in the population has the same chance of being included in the sample (Random sampling enables the sample to be a good reflection of the population) Ex: Let’s say I want to know if I want to know if everyone in the class understands a question The top 10 scorers on the first exam ------ not random Everyone sitting in the last row ------- not random Picking 10 names off the attendance ------ random  Variability Note that measurements may vary from subject to subject and from sample to sample. Ex: If I want to know how the class did on an exam. If I took the average of everyone whose name starts with an ‘s’, the average will be different than if I take the average of everyone whose name starts with a ‘m’. In saying this, we can get a more accurate idea of the population if we take larger samples.  Computer and Statistics Data file: large sets of data are typically organized in a spreadsheet format Database: an existing archive collection of data files Applet: short application program for performing a specific task. Ex: random number generator

 Variable : any characteristic that is recorded for the subjects in a study  Categorical Variable: described by words. Ex: gender, marital status  Quantitative Variable: described by numbers. Ex: number of pets in a household, height o Discrete Variable: there is a finite number of possible values. Ex: number of pets in a household o Continuous Variable: the values are represented in an interval. Ex: height (All forms of measurements are continuous variables. Ex: time, height, volume)  Proportion Frequency : number of times an observation has occurred. Proportion/Relative frequency : (The proportion will always be between 0 and 1 .) Percentage : proportion multiplied by 100 Ex: 4 students received an A out of 40 students The frequency of getting an A is 4. The proportion of students who got an A is 4/40=0.1. The percentage of students who got an A is 0.1x100=10%.  Frequency Table Possible values of variable Frequency/relative frequency(proportion) Ex: The president of student council wanted to know how many hours Tech students party. Here were his results Number of Party Hours 0-1 2-3 3-4 4 or more Count 4 10 22 44 Variable of interest: Number of hours Tech students party Type of variable: Quantitative Discrete or Continuous: Continuous Add proportions to the frequency table: Number of Party Hours 0-1 2-3 3-4 4 or more Relative Frequency 0.05 0.125 0.275 0.  Distribution o A distribution tells us the possible values a variable takes as well as the occurrence of those values (frequency or relative frequency). o A graph or frequency table describes a distribution.  Graphs for Categorical Variables  Pie Charts  Bar Graph s  Graphs for Quantitative Data  Dot Plot (small data set, discrete variable)  Stem-and-leaf plots (small data set, discrete variable)

 Interquartile Range : IQR = Q 3 – Q 1  Z-Score :

x x

z

s

 Outliers : o An outlier falls far from the rest of the data. o Outliers are represented in the tails of a distribution.  Detecting Potential Outliers :

1.5 x IQR rule: x < Q 1 – 1.5 x IQR or x > Q 3 + 1.5 x IQR
z-score rule: z < -3 or z > 3  Five-number summary : min ( not including potential outliers), Q 1 , Q 2 , Q 3 , max ( not including potential outliers))  Box Plot : using five-number summary, horizontal line <-----> tail.  Resistant Measures : A numerical summary measure is resistant if extreme observations (outliers) have little, if any, influence on its value. Ex: resistant to outliers: median not resistant to outliers: mean, range, standard deviation, linear correlation  Two variables:  Response variable  Explanatory variable  Association An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable.  Association between Two Categorical Variables Contingency Table Conditional proportion :  Association between Two Quantitative Variables Scatter Plot : Horizontal axis: explanatory variable, x Vertical axis: response variable, y Trend : linear, curved, clusters, no pattern Direction: Positively associated : y increases as x increases Negatively associated : y decreases as x increases Strength : linear correlation

1 x y

x x y y

r

n s s

r only measures strength of linear relationship. r is always between -1 and 1. r > 0 => positive association, r < 0 => negative association. r is close to -1 or 1 => strong relationship, r is close to 0 => weak relationship.

r is unitless (does not depend on the variables’ units). Two variables have the same correlation no matter which is treated as the response variable. Squared correlation r^2 : r^2 x 100% of the variation in y can be explained by x.

r  r^2

positive association => +, negative association => -

Regression Line : ˆ y^  a^^  bx

y^ ˆ

is the predicted value of y when x is given.

Residual : y^ ^ y ˆ(measures the size of the prediction errors, the vertical

distance between the point and the regression line.)

y x

s

b r

s

(slope, b > 0 => positive association, b < 0 => negative association)

a  y  b x ( )(y-intercept, predicted value for y when x=0)

Regression Outlier : an outlier that lies far away from the trend that the rest of the data follows An observation is influential if: o Its x value is relatively low or high compared to the remainder of the data o The observation is a regression outlier Lurking Variable : usually unobserved, influences the association between the variables of primary interest Simpson’s Paradox : When the direction of an association between two variables changes after we include a third variable and analyze the data at separate levels of that variable.

Statistical Methods - Study Material | MATH 2300, Study notes of Data Analysis & Statistical Methods

Related documents

Partial preview of the text

Download Statistical Methods - Study Material | MATH 2300 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Chapter 1-3 review

x x

z

s

1 x y

x x y y

r

n s s

r  r^2

Regression Line : ˆ y^  a^^  bx

y^ ˆ

Residual : y^ ^ y ˆ(measures the size of the prediction errors, the vertical

s

b r

s

a  y  b x ( )(y-intercept, predicted value for y when x=0)