Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An introduction to statistical inference, focusing on confidence intervals and hypothesis testing. It covers the concepts of sample statistics, random variables, probability distributions, standardized values, and the normal distribution. The document also explains how to find cumulative probabilities for normal random variables using excel and construct confidence intervals for population parameters. The central limit theorem is discussed, and examples are provided to illustrate the concepts.
Typology: Summaries
1 / 8
Review of Basic Statistical Concepts The purpose of this review is to summarize the basic statistical concepts. Introductory statistics dealt with three main areas: descriptive statistics, probability, and inference. Descriptive Statistics Sample data may be summarized graphically or with summary statistics. Sample statistics include the mean , variance, standard deviation , and median. For the following definitions let x 1 , x 2 , … , x n represent the values obtaining from a random sample of size n drawn from a population of interest. Sample Mean The mean is just the average of the n values observed. Sample Variance The sample variance equals the mean squared deviation from. A small means that the observed values cluster around the average, while a large variance means that they are more spread out. Thus, the variance is a measure of the “spread” in the sampled values. Sample Standard Deviation The sample standard deviation, s , is often a more useful measure of spread than the sample variance, s^2 , because s has the same units (inches, pounds, etc.) as the sampled values and. StatGraphics Common descriptive statistics can be obtained by following: Describe > Numeric Data > One- Variable Analysis > Tabular Options > Summary Statistics Example The file LMF contains the three-year return for a random sample of 26 mutual funds. All of these funds involve a load (a type of sales charge). StatGraphics output is to the right. Random Variables and their Probability Distributions Random Variable A variable whose numerical value is determined by chance. The key elements here are that the variable assumes a number (sales volume, rate of return, test score, etc.) and that the sample selection process generates the numbers randomly, i.e., by a “ random ” selection. (In these notes, a random variable will be designated by a capital letter, such as X , to differentiate it from observed values x. For instance, X might represent the height of a man to be selected randomly. Once the man has been selected, his height is given by the value x , say x = 68 inches.) Probability Distribution Although the values of a random variable are subject to chance, some values are more likely to occur than others. For instance, the height of a randomly selected man is more likely to measure 6’ than 7’. It is the random variable’s probability distribution that determines the relative likelihood of possible values.
Standardized Values For the value x drawn from a population with mean and standard deviation , the standardized value. For example, if incomes have a mean and standard deviation of $48,000 and $16,000, respectively, then someone making $56,000 has a standardized income of because their income is one-half standard deviation above the mean income. The advantage of standardizing is that it facilitates the comparison of values drawn from different populations. Standardized Random Variables For the random variable X with mean and standard deviation , is the Standardized random variable. (Note: The Standardized Variable always has mean 0 and standard deviation 1.) The Normal Distribution In this course we will make use of (at least) four distributions designed to model continuous data: the Normal, t, F, and Chi-Square. Of these, the normal distribution is by far the most important because of its role in statistical inference. Much of the logic behind what we do and why we do it is based upon an understanding of the properties of the normal distribution, and of the theorems involving it, particularly the Central Limit Theorem. Properties 1. Normal distributions are bell-shaped. (In fact, it is sometimes called the “Bell Curve”.)
probabilities in Excel Excel. Follow: fx > Statistical > NORMDIST and enter TRUE in the Cumulative field. Probabilities of the form P( X > x ) or P(a < X < b) can be obtained by subtraction. Example To find P(-1.2 < Z < 2), note that P(-1.2 < Z < 2) = P( Z < 2) – P( Z -1.2) and use the Excel output to the right. Answer = 0.9772 – 0. = 0. Critical Values z (^) is defined by P( Z > z ) = . Critical values are used in the construction of confidence intervals and (optionally) in hypotheses testing. To find the critical value associated with the significance level , follow: fx > Statistical > NORMINV and enter 1 - in the Probability field. Example From the Excel output to the right we see that z 0.05 = 1.
The Distribution of the Sample Mean Because, when we take a random sample, the values of a random variable are determined by chance, statistics such as the sample mean that are calculated from the values are themselves random variables. Thus the random variable has a probability distribution of its own. If we intend to use the sample mean to estimate the mean of the population from which the sample was drawn, then we need to know what values the random variable can assume and with what probability, i.e., we need to know the probability distribution of. It can be shown (using advanced calculus) that has the following properties: The mean of equals the mean of X , i.e.,. This just says that the sample mean is an unbiased estimator of the population mean . The variance of is less than that of X. In fact. This states that there is less variability in averaged values (and the variability decreases as the size of the sample increases ) than there is in individual values. Hence, you might not be surprised if a randomly selected man measured 7’, but you would be suspicious if someone claimed that 100 randomly chosen men averaged 7’! If the variable X is normally distributed, then will also be normal. The properties above, however, don’t describe the shape of the distribution of (needed for making inferences about ) except in the special case where X is normal! They only contribute information about the mean and spread of the distribution. In general, the shape of the distribution of may be difficult to determine for non-normal populations and small samples. However: For large samples the Central Limit Theorem states that will be at least approximately normal. (Most introductory statistics texts consider a sample large whenever n > 30.) Example The dean of a business school claims that the average weekly income of graduates of his school 1 year after graduation is $600, with a standard deviation of $100. Find the probability that a random sample of 36 graduates averages less than $570. Solution: Let X = weekly income of a sampled graduate 1 year after graduation. We are asked to find for 36 graduates. Note: Without the Central Limit Theorem we could not have approximated the probability that a sample of graduates average less than $570 because the distribution of incomes is not usually normal. Statistical Inference: Estimation Point Estimate A single number used to estimate a parameter. For example, the sample mean is typically used to estimate the population mean . Interval Estimate A range of values used as an estimate of a population parameter. The width of the interval provides a sense of the accuracy of the point estimate. Confidence Interval Estimates for
Confidence intervals for have a characteristic format: standard error, where CV stands for Critical Value and the standard error is the (usually estimated) standard deviation of. Case I: X normal or n >30, and is known A (1 - )100% confidence interval estimate for is given by Case II: n 30 and is unknown A (1 - )100% confidence interval estimate for is given by , with n-1 degrees of freedom Case III: X is normal and is unknown A (1 - )100% confidence interval estimate for is given by , with n-1 degrees of freedom Case III requires some explanation. When X is normal, and we must use the sample standard deviation s to estimate the unknown population standard deviation , the studentized statistic has a t distribution with n-1 degrees of freedom. Hence, we must use the critical value t /2 from the^ t^ distribution with n-1 degrees of freedom. The properties of the^ t^ distributions are similar to those for the standard normal distribution Z , except that the t has a larger spread to reflect the added uncertainty involved in estimating by s. Note: For large samples, where n 30, there is very little difference between the t distribution with n-1 degrees of freedom and the standard normal distribution Z. Therefore, for large samples ( Case II in the table above) some texts replace t (^) /2 with z (^) /2 even when X is normal and is unknown!
Example A manufacturer wants to estimate the average life of an expensive component. Because the components are destroyed in the process, only 5 components are tested. The lifetimes (in hours) of the 5 randomly selected components are 92, 110, 115, 103, and 98. Assuming that component lifetimes are normal, construct a 95% confidence interval estimate of the component’s life expectancy. Solution: Using Excel, hours, and hours. From the discussion above, the critical value is t (^) 0.025 = 2.776. (Note: In Excel, shown below, to find the critical value associated with the t distribution and significance level , follow: _fx > Statistical
TINV_ and enter in the Probability field.) Thus a 95% CIE for the mean lifetime of the components is given by or (92.2, 115.0) hours Statistical Inference: Decision Making In hypothesis testing we are asked to evaluate a claim about something, such as a claim about a population mean. For instance, in a previous example a Business dean claimed that the average weekly income of graduates of his school one year after graduation is $600. Suppose that you suspect the dean’s claim may be exaggerated. Hypothesis testing provides a systematic framework, grounded in probability, for evaluating the dean’s claim against your suspicions. Although hypothesis testing uses probability distributions to arrive at a reasonable (and defensible) decision either to reject or "fail to reject" the claim associated with the null hypothesis of the test, H 0 , it does not guarantee that the decision is correct! The table below outlines the possible outcomes of a hypothesis test. ( Note: We avoid "accepting" the null hypothesis for the same reason juries return verdicts of "not guilty" rather than of "innocent") Decision: TRUTH Accept H 0 Reject H 0 H 0 True correct decision Type I error H 0 False Type II error correct decision Type I error The error of incorrectly rejecting H 0 when, in fact, it's true. In a hypothesis test conducted at the significance level , the probability of making a type I error, if H 0 is true, is at most . Type II error The error of incorrectly failing to reject H 0 when, in fact, it's false. For a fixed sample size n , you cannot simultaneously reduce the probability of making a Type I error and the probability of making a Type II error. (This is the statistician’s version of “there is no such thing as a free lunch.”) However, if you can afford to take a larger sample, it is possible to reduce both probabilities.
Decision Making: Hypothesis Testing Example Suppose that a sample of 36 graduates of the business school averaged $570 per week one year after graduation. Test the dean’s claim, against your suspicion, at the 5% level of significance. Solution:
routinely computed by StatGraphics and Excel, we will usually use p-values to conduct significance tests. Note: Many of the (hypothesis) tests conducted in this course are two-sided, and assume that we are sampling from a normal population with unknown variance. When this is the case, Statgraphics will automatically return the correct p-value for the two-sided t test.