










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Professor: Abayomi; Class: Basic Statistical Meth; Subject: Industrial & Systems Engr; University: Georgia Institute of Technology-Main Campus; Term: Spring 2009;
Typology: Study notes
1 / 18
This page cannot be seen from the preview
Don't miss anything!











In statistics we worry about what there is to observe (what we expect to see) and what we have actually observed (what we do in fact see).
Data are the quantitative characterizations of what we see, often based upon what we expect (or often even desire to see). Data are our observations: observed and quantified. We call the population the set of all possible observations. We call a sample the observations we see at a glance, inspection or study. A statistic is any quantity we derive from the observed data – i.e. any quantity we can generate from the sample. A parameter is any quantity we cannot observe about the data; one that is specific to the population. We often seek to estimate parameters using samples from the population. 1
We often call the individual objects described by a set of data call units of observation or cases. A variable is an object that holds information about the same characteristic for many cases. A data table is an arrangement of data – the convention is to let rows represent cases and columns represent variables.
Here is an example of a possible data table.
ID Name ShoeSize TestScore Classlevel FashionLevel 1 Kobi 11 95 Graduate Low 2 Djleroy 11 100 PhD. High 3 Ronald 22 50.15 Pre-K Very Low ... ... ... ... ... (^1) These are my heuristic definitions...
We classify variables as either quantitative - where the numbers act as numerical values or categorical
Quantitative data can be either discrete – in that we can list all the possible values – or continuous in that we cannot. A more particular distinction is to say a variable is discrete if it can take countably many values and that a variable is continuous otherwise. The distinction is often apparent in use.
Quantitative variables in which the order (in the greater than, less than sense) and distance be- tween data can be determined are called interval variables. Percent scores are examples of interval variables.
Quantitative variables in which the order of data points can be determined, but not the distance are called ordinal. Examples are letter grades..
Categorical variables which are determined by categories that cannot be ordered, such as gender and color, are called nominal.
In math notation a data table is a multivariate vector – i.e. matrix – x with dimension n x k or (n, k) or n rows and k columns. The observations are the rows n; the number of measurement types is k, the number of columns. We select the ith observation of the jth variable with element xi,j from x.
In R data are held in arrays. Think of an array as the broadest class of matrices - having at least two dimensions. A data table, then, is a two dimensional array: a matrix.
In R matrices are restricted to be only numeric. Generally, we work in R with data table objects called a data frames. The data frame is the most general object for holding data; in R the convention remains rows as observations and columns as variables. For example:
In R
worms<-read.csv(file="worms.csv")
#reads in the data frame as a comma delimited file -- i like this #format
#replace file="" with your file name, notice the "/"
is.data.frame(worms)
#checks if the object is a data frame
names(worms);dim(worms)
#returns the column names i.e. the variable names; the number of #rows and columns
worms[7,5]
R has a built in shorter way...
sum(stuff)
How useful.
You should familiarize yourself with this notation now. Prove to yourself that I’m not telling you any lies below.
∑n i=1 c^ ∗^ i^ =^ c^ ∗^
∑n i=1 i
∑n i=1 c^ +^ i^ =^
∑n i=1 c^ +^
∑n i=1 i^ =^ c^ ∗^ n^ +^
∑n i=1 i.
4 Descriptive Techniques - Data Tables
The first thing to do with data, where you can, is to make a picture. Part of being a statistician is using graphical methods to display and interpret data. Rudimentary graphical methods are data tables which we use to illustrate data.
A frequency table is a list of the categories in a categorical variable and the counts or percentage of observations in each category.
Example:
Classlevel Count Percent Graduate 2 4 PhD. 1 2 Pre-K 30 60 ... ... ... ... ...
What should the counts column in a frequency table always add up to? What should the percent column add up to, always?
A frequency table is one way of illustrating the distribution of a variable. The distribution the complete information about a variable: its possible values and the relative frequency of each value.
R can give you a frequency table real, real easy.
table(stuff)
for quantitative as well as categorical data.
hamm<-c("toe","foot","eye","tail","tail","foot","snout") table(hamm)
R can also give you the percents. Again, real, real easy.
table(stuff)/length(stuff)
table(hamm)/length(hamm)
We usually rotate these tables when we include them in reports and such.
A contingency table shows how cases are distributed along each variable, contingent on all other variables.
Let’s look at our example...:
Fashion Level Classlevel Low Middle High Graduate 6 4 1 PhD. 5 1 2 Pre-K 30 25 75
How many graduate students have a low fashion level? How many Pre-K’s are well dressed? The contingency table can be used to reveal patterns in variables that may be contingent on the category of others.
In a contingency table the marginal distribution of a variable is the distribution of that variable alone. The marginal distributions are displayed in the margins of the table.
Fashion Level Classlevel Low Middle High Total Graduate 6 4 1 11 PhD. 5 1 2 8 Pre-K 30 25 75 130 Total 41 30 78 149
The conditional distribution, in a contingency table, is viewed by looking at one column, or one row of the table. The remaining distribution of a variable, then is conditional on the value of the other variable(s) in that restricted view. For illustration...
Fashion Level Classlevel Low Middle High Total Graduate 4 PhD. 1 Pre-K 25 Total 30
is the distribution of class level, after conditioning on the middle fashion level.
We say that two variables are independent if the conditional distributions of one are the same no matter what we condition on in the second, and vice versa.
For instance let’s look at the marginal distribution of fashion level by class level..
Do the Pre-K students seem to be the best dressed? Why? Does fashion level appear to be independent of class level?
5 Continuing with more descriptive techniques
Above we began to look at ways of describing data, principally via tabular illustration. We intro- duced the data table - with cases on rows and variables on columns - and the contingency table - with two or more variables at a time.
Now we’ll continue to look at statistical methods: methods that depend on functions of the data as a prelude to when we will begin to draw pictures, or plots of our observed data.
6 Statistical Methods, Functions of Data
Recall from our definition of a statistic: any quantity that we derive from our observed data.
It is useful to think of statistics in the way they are used: to summarize and condense information.
In passing we introduced this function:
∑^ n
i=
xi n
n
∑^ n
i=
xi =
∑ (^) x 1 + x 2 + · · · + xn n
= x (4)
Explicitly, this x^3 is called the sample mean. It is the arithmetic average of the observed data.
Remember that we can call x = x 1 , x 2 , ..., xn our observations - the results of some experiment. n is the number of cases or observations. And x 1 can be the shoe size of case 1, for instance.
It is worth highlighting here the distinction between population and sample. The sample mean is an estimate of the population mean. Remember that we don’t usually see the population entirely - we only see a sample. We take the arithmetic average of the sample, and we use it as an estimate for the unknown, true mean of the population.
By example, let’s refer to the ”data” from lecture 1:
(^3) ”x - bar” in words
ID Name ShoeSize TestScore Classlevel FashionLevel 1 Kobi 11 95 Graduate Low 2 Djleroy 11 100 PhD. High 3 Ronald 22 50.15 Pre-K Very Low ... ... ... ... ...
In this example we can think of the population as all statistics students with shoes and fashion – we take a sample, that is, we only record data on the class here in Math 417 during fashion week.
Let’s say our class size n = 10 and we get these observations for shoe sizes: x = { 11 , 11 , 22 , 6 , 8 , 10 , 10 , 9 , 7 , 9 }. The sample mean, our estimate for the population mean, is 10.3. How did I get that?
In R:
x<-c(11,11,22,6,8,10,10,9,7,9)
mean(x)
The mean is known as a measure of central tendency or a location measure....the mean tells us where the arithmetic center of the data is.
Another measure of central tendency is the sample median. While the sample mean is the arithmetic center of the distribution of the data, the sample median is the ”physical” center of the data, i.e. the point in the very middle of the distribution. We find the median by:
Allow some more notation: Call x(1), x(2), ..., x(n) the ordered sample values (from least to greatest) or the order statistics for the sample. Then I can restate how to generate the sample median in a formulaic way.
x( n 2 )+x( n 2 +1)
In R
x<-c(11,11,22,6,8,10,10,9,7,9)
median(x)
What could be easier?
In R
x<-c(11,11,22,6,8,10,10,9,7,9)
range(x)
range(x)[2]-range(x)[1]
order(x)
orderedx<-x[order(x)]
orderedx
fractionalpart<-function(x){ceiling(x)-x}
fractionalpart(1.5)
orderedx[1.6]
orderedx[2.5]
n<-length(x)
upperquartile<- fractionalpart(.75n) (orderedx[ceiling(.75n)]-orderedx[floor(.75n)])+orderedx[floor(.75*n)]
upperquartile
lowerquartile<- fractionalpart(.25n) (orderedx[ceiling(.25n)]-orderedx[floor(.25n)])+orderedx[floor(.25*n)]
iqr<-upperquartile-lowerquartile
The summation notation is good to interpret for programming, or for more theoretical work – also perhaps for quick understanding. Though you will rarely be asked to calculate statistics by hand, it’s good to have a quick way to do so.
A simple way is to arrange the data in a columnar fashion with the observations being the first
column, and the following columns being the statistics you will be calculating. For example, let’s take our data, x, of shoe sizes.
i x(i) deviations xi − x squared deviations (xi − x)^2 1 6 -4.3 18. 2 7 -3.3 10. 3 8 -2.3 5. 4 9 -1.3 1. 5 9 -1.3 1. 6 10 -0.3 0. 7 10 -0.3 0. 8 11 .7 0. 9 11 .7 0. 10 22 11.7 136. x =sum / n s^2 =sum / n − 1, s =
s^2
7 Graphical Techniques, Plotting the Data
Remember our definition for a statistic, a function on observed data. That is: we observe something (take observations) and we provide summaries of it (report statistics). Remember that we call our observations samples - representatives from a population of all possible outcomes. Remember statistics (like the sample mean or sample variance) are estimates of population parameters which, usually, we do not know but seek to gain inference about.
Most graphical techniques, or plots of the data, can be viewed as extensions of the introductory statistics covered in chapter 1.
8 Univariate Graphs: The Histogram
Let’s refer again to our example from above, The data table:
ID Name ShoeSize TestScore Classlevel FashionLevel 1 Kobi 11 95 Graduate Low 2 Djleroy 11 100 PhD. High 3 Ronald 22 50.15 Pre-K Very Low ... ... ... ... ...
Let’s take our observations from lecture 2 for shoe size. Remember x = { 11 , 11 , 22 , 6 , 8 , 10 , 10 , 9 , 7 , 9 }, are our shoe sizes.
A frequency table for just the shoe sizes could be:
ShoeSizes
Figure 1: :A histogram of the shoe size data. I used this code hist(x,density=10,labels=T,breaks=c(0,10,20,30),col="red",main="ShoeSizes"). I would call this graph unimodal, asymmetric, and right skewed.
ShoeSizes
Figure 2: Here, I used this code hist(x,density=10,labels=T,freq=FALSE,breaks=c(0,10,20,30), col="red",main="ShoeSizes"). Notice that the labels above the have changed.
hist(testscores,breaks=3) #R will also bin the data with a #suggested number of bins. hist(testscores,breaks=10)
hist(testscores,breaks=30)
Notice how the histogram differs with the number and size of bins. Notice how you can obscure or highlight features of the observed distribution with binning.
Notice also, for the histograms with a frequency axis, that the sum of all the heights is the number of cases. Notice that for the histograms with a density axis, the sum of the widths of the bins times the heights of each bin sum to 1.
We call the most frequent bin or class the modal class. A unimodal histogram has a single mode or peak. A bimodal histogram has two modes, not necessarily equal in height. We call a histogram symmetric if, when we draw a vertical line down the center of the histogram, the two sides are identical in shape and size. A histogram that is flat^5 is called uniform.
Here are some examples in R
unimodaldata<-c(1,2,3,3,3,3,4,5) hist(unimodaldata)
bimodaldata<-c(1,1,1,2,3,4,4,4,4,5) hist(bimodaldata)
uniformdata<-c(1,1,2,2,3,3,4,4,5,5) hist(uniformdata,,breaks=c(0.5,1.5,2.5,3.5,4.5,5.5))
bellshaped<-rnorm(100) hist(bellshaped)
skewedleft<-c(bellshaped,-15,-16) hist(skewedleft)
skewedright<-c(bellshaped,15,16) hist(skewedright)
(^5) which illustrates data that are equally frequent
Remember that the sample median is at the physical center of a distribution of observed values and the sample mean is at the arithmetic center. A symmetric distribution will have the sample median and sample mean relatively close. A distribution that is negatively skewed (or skewed left) will have the sample median greater than the sample mean. A distribution that is positively skewed (or skewed right) will have the sample median less than the sample mean.
In R
mediansl<-median(skewedleft)
meansl<-mean(skewedleft)
hist(skewedleft)
abline(v=mediansl,col="red")
abline(v=meansl,col="blue")
mediansr<-median(skewedright)
meansr<-mean(skewedright)
hist(skewedright)
abline(v=mediansr,col="red")
abline(v=meansr,col="blue")
We call outliers values that differ greatly from the distribution of the sample data.
9 Other Univariate graphs
A stem and leaf plot is, basically, a text histogram. A dotplot is a histogram with dots instead of bars.^6. A boxplot is an illustration of the interquartile range, median, and range of the data.
In R
stem(testscores} (^6) I’ve never used a dotplot