




























































































Prepara tus exámenes y mejora tus resultados gracias a la gran cantidad de recursos disponibles en Docsity
Gana puntos ayudando a otros estudiantes o consíguelos activando un Plan Premium
Prepara tus exámenes
Prepara tus exámenes y mejora tus resultados gracias a la gran cantidad de recursos disponibles en Docsity
Prepara tus exámenes con los documentos que comparten otros estudiantes como tú en Docsity
Encuentra los documentos específicos para los exámenes de tu universidad
Estudia con lecciones y exámenes resueltos basados en los programas académicos de las mejores universidades
Responde a preguntas de exámenes reales y pon a prueba tu preparación
Consigue puntos base para descargar
Gana puntos ayudando a otros estudiantes o consíguelos activando un Plan Premium
Comunidad
Pide ayuda a la comunidad y resuelve tus dudas de estudio
Ebooks gratuitos
Descarga nuestras guías gratuitas sobre técnicas de estudio, métodos para controlar la ansiedad y consejos para la tesis preparadas por los tutores de Docsity
Asignatura: Statistics II, Profesor: , Carrera: Administració i Direcció d'Empreses - Anglès, Universidad: UAB
Tipo: Apuntes
1 / 140
Esta página no es visible en la vista previa
¡No te pierdas las partes importantes!





























































































2 Notes on Statistics II
You are free:
to Share to copy, distribute, display, and perform the work to Remix to make derivative works
Under the following conditions:
Attribution. You must attribute the work in the manner spec- ied by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). Noncommercial. You may not use this work for commercial purposes. Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.
For any reuse or distribution, you must make clear to others the license terms of this work. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights. Copyright © 1998-2012 Xavier Vilà. This is a human-readable summary of the Legal Code (the full li- cense) available in http://creativecommons.org
Think of a researcher who seeks to explain some fact in the real world. For instance, imagine Newton trying to explain why apples fall. As a more familiar example, imagine an economist trying to explain why unemployment does exist. Usually, the task of a researcher consists of three parts:
Statistics becomes extremely important for the rst of these three items.^1
It is clear that, in order to study a "real problem", the researcher must observe the "real" world. Nevertheless, it is also clear that no researcher can observe the whole reality. Newton can not observe all the falling apples, neither can an economist interview the whole population of a country to determine the unemployment rate. It is thus necessary to somehow "summarize" the reality, but this task has to be done so that such "summary" closely ts the reality being studied. Then, and only then, the conclusions drawn from the "summary" can be applied reliably to the whole population.
Statistics (more precisely, statistical inference) is a collection of techniques by means of which we can draw conclusions about a reality from the study of a summary of that reality (^1) Very often the researcher does not start up by gathering information using statistical techniques. On the contrary, in many cases his initial activity consists of detecting general patterns of behavior for a given fact. From here, researchers are able to build up an abstract theory in order to explain the phenomenon at study. This is, for example, Newton's way, and also the way Economic Theory works. Once this "abstract theory" is logically constructed, statistical techniques are often used to check whether such theory ts the reality, as we will see in Chapter 5.
Hereafter we will study in detail how this is done.
Chapter one explains how the reality is rigorously summarized and what are the main features of the results obtained in this process.
Chapter two deals with the rst approach on how to draw conclusions about some real issues based on what we observe in the summary.
Chapters three and four introduce more sophisticated techniques to make infer- ences about the reality using some of the more elemental results seen in Chapter two.
Finally, Chapter ve introduces the linear regression analysis, a technique widely used in the economic analysis (and other sciences) to study the relationship between variables.
It is worth saying that a clear understanding of the topics in Chapter one is important in order to easily understand what other chapters deal with, and also to get an global idea of the whole process of statistical inference.
It is important to understand that statistics is based on probabilistic techniques. Hence, any statistical conclusion drawn from this kind of summary will not be true for sure when applied to the whole reality, but only with a certain probability. For instance, when an electoral survey is conducted it is clear that its results will not exactly coincide with the results in the nal election. Nevertheless, if the survey is "well done", that is, if the summary of the reality (which in this case is the set of people interviewed) closely represents the whole reality (which in this case is the whole population that has the right to vote), then the survey result will be close to the nal results with a high probability
In the sections below we will see which are the basic ingredients of any statistical analysis and its probabilistic features
1.1 Inferential Statistics: Denition and Infer-
ence Methods
Statistical inference is mainly built upon four main concepts, which will be dened and described below. These concepts are closely related to each other and it is very important to clearly understand each of them and not to mistake one by the other.
Population Is the set of elements that are the object of study. The goal is to draw some conclusion regarding some specic feature of this population.
Example 1.1.1 All the apples in the world. The feature at study is whether apples fall down or not.
Example 1.1.2 Labor force in the European Union. The feature at study is whether workers are unemployed or not.
Example 1.1.3 Production of Intel chips in a given day. The fea- ture at study is whether chips are faulty or not.
This process can be represented as in Figure 1.
Population
Sample
Parameter (unkonwn)
Statistic (known)
Statistical Sampling Inference
Figure 1.1: The process of Statistical Inference
We can now provide a better denition for Statistics (or Statistical Inference).
Denition 1.1.13 Statistical Inference is a subject whose main objective is to draw conclusions regarding a population through the study of one sample by means of probabilistic techniques.
1.2 Denition and properties of Simple Random
Sampling
We will see next what a sample is, that is, how a sample can be selected out of a population. Since we want to study this sample to produce conclusions about the population, it can not be selected arbitrarily. In this sense, there exist rigorous techniques specially tailored for this purpose. In what follows, the more basic techniques will be introduced, while more sophisticated analysis are out of the scope of these notes. The following denition introduces the idea of sampling
Denition 1.2.1 Sampling is a systematic technique to select a sample out of a population in such a way that it is representative of that population
Here, the keyword is representative. Indeed, if we want our sample to be used in order to produce "reliable" conclusions regarding the original population, we would better have a sample that closely resembles (in its structure) the original population. For instance, if we want to conduct an electoral survey and we only interview people living in a "rich" neighborhood, then it is clear that their answers will not be representative of the whole population.
There are dierent types of sampling techniques, depending on the specic fea- tures of the study at hand. The more important are:
This is the "most random" of all the sampling methods, and throughout this notes we will normally assume that samples are obtained using this technique. Its main feature is that all elements in the population have the same proba- bility of being selected to be incorporated to the sample. In other words, the sample is constructed completely at random. If we think for a moment of all the possible dierent samples of a given size that can be selected from a given population, simple random sampling means that each of these samples has the same probability of been selected as "the sample", i.e., they are equally likely
Example 1.2.2 Consider a population consisting of only 4 elements
Population = {A, B, C, D}
If, for instance, we want to draw a sample of size 2, there are 6 possible samples
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 {A, B} {A, C} {A, D} {B, C} {B, D} {C, D}
Table 1.1: Possibles Samples
Hence, in a Simple Random Sampling, each of this samples has the same proba- bility of being selected, 16 in this case. Analogously, we may also say that each of the four elements in the Population has the same probability of being drawn to enter the selected sample. Indeed, since each of the elements belongs to exactly 3 of the possible sample and each possible sample has probability 16 of being the selected sample, then the probability for any of the elements in the Population of entering the selected sample is 16 + 16 + 16 = 12.
This probability ( 12 ) can also be understood as each element in the Population having probability 14 of being the rst element to enter the sample and probability 1 3
3 4 of being selected as the second element in the sample given that it has not been selected in the rst place, which yields a total of 12 probability of being one of the two elements in the sample.
The Systematic Sampling consists of a variant of a SRS. It is useful when the population to sample is not "static", but changes often. The following example shows how this method works.
Example 1.2.3 Consider a factory that manufactures Intel "chips". The man- agers want to study how many of these chips turn out to be faulty every day. The factory has a "chain" process so that once the "chip" has been assembled, it automatically enters in the packaging process and then moves into warehouse. Let us suppose that the factory produces 100 "chips" a day, and that a sample of size 5 is going to be selected. It is clear that the managers can not wait until the end of the day, then stop all processes, randomly select 5 chips, and start
Example 1.2.5 We want to conduct a survey to know the situation of the public schools in Catalonia. Since this is a very delicate topic, we must travel to each of the schools that have been picked to belong to the sample and interview the Director. In this context, a SRS might very well select a sample composed of schools disseminated all over the territory, which would imply a high level of travel expenditure. To avoid this, we can do the following:
In this way, we have selected 200 public schools to visit in Catalonia with travel costs lower than using a SRS. The problem, though, is that the sample obtained will be less representative.
In some cases the sample is obtained without any randomness at all. For in- stance, if we want to test a new drug against malaria, we can not just randomly select "subjects" and force them to take the drug. In cases like this, a call for volunteers is made. This techniques are usually much less representatives that a random technique. Furthermore, since there are no random components in the sample, we can not use probabilistic tools to study the sample and, therefore, statistical inference techniques can not be properly applied.
In what follows, we will always assume (implicitly) that the sample at hand has been obtained by means of a SRS.
1.3 Distribution of the main sample statistics:
mean, variance and proportion
Once the sample is obtained (we will always assume that using a SRS), the process of working with it and drawing conclusions begins.
In this sense, the main task is now to obtain a statistic, one of the main concepts in statistical inference. We will use it to obtain conclusions regarding the unknown population parameter that is of interest to us.
The denition that follows will remind us what a statistic is (as introduced in the previous section). Then, the concept of estimate is dened. Although these two concepts are very similar and closely related it is very important to notice that they are not the same thing.
Denition 1.3.1 A statistic (or estimator)^4 is a formula that uses the values in the sample at hand (observations) in order to produce an approximation to the true value of an unknown population parameter.
Denition 1.3.2 An estimate (or estimation) is the particular value of an estimator that is obtained from a particular sample.
Hence, a statistic is not a number but a formula while an estimate is the number that is obtained when the formula (the estimator) is applied to the observations of the specic sample that we have at hand.
At this point, it becomes crucial to understand that, given that the sample is obtained by means of a random technique, the statistic will produce dierent estimates with dierent probabilities (depending on the specic sample that is nally "selected" at random). To put it more formally, a statistic is a Random Variable, that is, a variable that takes dierent values with dierent probabilities In this sense, an estimate is a specic realization of this random variable. The following example aims to clarify this idea.
Example 1.3.3 We want to know the average number of cars per family in a given population. To keep the example simple, we will assume that the population is very small, only 4 families,
Population = {A, B, C, D}
Let us now assume that family A owns one car, families B and C have 2 cars each, and family D has 4. 5
For our study, we want to obtain a random sample of size 2. We can then compute the average number of cars in the sample and use it to infer some conclusion regarding the true average in the population. Hence, the sample mean (or just mean, for short) will play the role of statistic in this example, and we can use it to draw conclusions on the true population parameter that is of interest to us: the average number of cars per family in the whole population, that is, the population mean.
Table 1.3 summarizes:
we have seen in the previous example. Indeed, in that example we have seen that the population is distributed so that there is 1 element with 1 car, 2 elements with 2 cars, and 1 element with 4 cars. Therefore, if we pick the sample element xi at random from this population, we will have that:
p(xi = a) =
1 41 if^ a^ = 1 21 if^ a^ = 2 4 if^ a^ = 4 0 otherwise
This is, in this case, the distribution of the population. Figure 1.2 shows it.
0 1 2 3 4 x
p
0.
0.
Figure 1.2: Population distribution in example 1.3.
In generala, we will assume that the Sample has been obtained by means of a SRS from a population distributed according to a Normal Distribution with some Population Mean μ and some Popula- tion Variance σ^2. aThere are special cases that we will discuss in due time
What does it mean? Easy, it means that for any two numbers a and b, we have that for any element in our sample xi,
p(a ≤ xi ≤ b) = p(a − μ ≤ xi − μ ≤ b − μ) =
= p(
a − μ σ
xi − μ σ
b − μ σ ) = p(
a − μ σ
b − μ σ
where Z represents the Standard Normal distribution, usually denoted by N (0, 1), whose associated probabilities are found in tables. Graphically, Figure 1.3 shows it
a b μ
p(a<x<b)
Figure 1.3: Normal Distribution
We turn next to the study of the distributions of the three main statistics. These, as we have discussed above, will depend on the distribution of the popu- lation from which we obtain the sample. For each case, we will also be interested in knowing what is the expectation and the variance of these statistics.
Sample mean, denoted by X¯, is the statistic that is obtained from the sample using the formula:
X^ ¯ =^1 n
∑^ n
i=
xi
It is normally used to infer conclusions regarding the true value of the Population mean μ. Its distribution depends on the characteristics of both the population and the sample
X^ ¯ ∼ N (μ, σ
2 n
because of the sample mean being a linear combination of Normal random variables
X^ ¯ − μ √ σ^2 n
∼ N (0, 1) (approx.)
because of the Central Limit Theorem that will be introduced later on.
Since we only know the distribution of the sample variance when the population is Normal, we will use the fact that in that case its distribution is χ^2 n− 1 to nd the expectation and variance easily. In this sense, we know that for any χ^2 variable we have that E(χ^2 n− 1 ) = n − 1 and V (χ^2 n− 1 ) = 2(n − 1). Hence, we will assume the the sample has been obtained from a Normal population with sample mean μ and sample variance σ^2. That is, xi ∼ N (μ σ^2 ) for any element xi in the sample. Hence: (n − 1)S^2 σ^2
∼ χ^2 n− 1
and therefore
(n − 1)S^2 σ^2
) = n − 1 ⇒
(n − 1) σ^2
E(S^2 ) = n − 1 ⇒ E(S^2 ) = σ^2
(n − 1)S^2 σ^2
) = 2(n − 1) ⇒
(n − 1)^2 (σ^2 )^2
V (S^2 ) = 2(n − 1) ⇒ V (S^2 ) =
2 σ^4 n − 1
Sample proportion is a special case. It is used when we are interested in know- ing which is the true proportion of elements in a population that have a given characteristic. For instance, it might be of interest to know what is the propor- tion of smokers among the second year students in this school (in this case, the characteristic that is of interest is "whether a student smokes or not"), or what is the proportion of faulty Intel chips in a day (in this case, the characteristic of interest is "whether a chip is faulty or not")
Sample proportion, denoted by ˆπ, is the statistic that is obtained from the sample using the formula:
π ˆ =
∑^ n
i=
xi n
where xi = 1 if the i-th element in the sample has the characteristic that we are studying and xi = 0 if it does not.
Sample proportion is normally used to infer conclusions regarding the true pop- ulation sample π. In this case, the population is never Normal since each obser- vation xi comes from a Bernoulli random variable. Indeed, let us assume that we are looking at a population of 100 individuals out of which 45 are smokers. That is, the true population proportion is 45% or π = 0. 45. Imagine that from this population we want to obtain a sample of size 10. It is clear that for any element xi of the sample we will have that:
p(xi = 1) =
p(xi = 0) =
Hence, we see that each element xi in the sample follows a Bernoulli distribution with parameter π (where π is the true and unknown population proportion
It can be shown then that ˆπ =
∑n i=1 xi/n^ is a^ Binomial^ random variable. Also, given that when samples are large a Binomial distribution can be approximated by a Normal distribution, we can conclude that, in general:
π ˆ ∼ N (π,
π(1 − π) n
This approximation is better the closer to 0 , 5 is π and the larger is the sample
With regards to the expectation and variance of the sample proportion, we have:
E(ˆπ) = π
V (ˆπ) =
π(1 − π) n
1.4 Central Limit Theorem
This theorem presents a mathematical fact that is of the highest importance in statistical inference. Basically, the theorem states that the sum of identical random variables, whatever their distribution is, approximates a Normal random variable.
From a practical point of view this result allows us to work with the sample mean as if it were a Normal random variable even when the population from which we obtain the sample does not follow a Normal distribution. Moreover, the larger the sample the better the approximation.
Formally,
Theorem 1.4.1 Let X 1 , X 2 ,... , Xn be a series of independent random vari- ables with identical distribution, expectation μ and variance σ^2. Then, when n is large enough, the random variable
n
∑^ n
i=
Xi
follows, approximately, a Normal distribution with μ (^) X¯ = μ and σ^2 X¯ = σ
2 n