Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

What is Data - Lecture Notes | Basic Statistical Methods | ISYE 2028, Study notes of Data Analysis & Statistical Methods

Georgia Institute of Technology - Main Campus Data Analysis & Statistical Methods

Prof. Kobi Abayomi

Material Type: Notes; Professor: Abayomi; Class: Basic Statistical Meth; Subject: Industrial & Systems Engr; University: Georgia Institute of Technology-Main Campus; Term: Spring 2009;

Typology: Study notes

Pre 2010

Uploaded on 08/04/2009

koofers-user-pn4 🇺🇸

10 documents

1 / 18

This page cannot be seen from the preview

Don't miss anything!

ISYE 2028 A and B

Spring 2009

Lecture 1

Dr. Kobi Abayomi

January 8, 2009

1 Introduction: What is Data

In statistics we worry about what there is to observe (what we expect to see) and what we have

actually observed (what we do in fact see).

Data are the quantitative characterizations of what we see, often based upon what we expect (or

often even desire to see). Data are our observations: observed and quantified. We call the population

the set of all possible observations. We call a sample the observations we see at a glance, inspection

or study. A statistic is any quantity we derive from the observed data – i.e. any quantity we can

generate from the sample. A parameter is any quantity we cannot observe about the data; one

that is specific to the population. We often seek to estimate parameters using samples from the

population. 1

2 Classifying and Observing Data

We often call the individual objects described by a set of data call units of observation or cases.

Avariable is an object that holds information about the same characteristic for many cases. A

data table is an arrangement of data – the convention is to let rows represent cases and columns

represent variables.

Here is an example of a possible data table.

ID Name ShoeSize TestScore Classlevel FashionLevel

1 Kobi 11 95 Graduate Low

2 Djleroy 11 100 PhD. High

3 Ronald 22 50.15 Pre-K Very Low

... ... ... ... ...

1These are my heuristic definitions...

1

Discover Study notes of Data Analysis & Statistical Methods Georgia Institute of Technology - Main Campus

Partial preview of the text

Download What is Data - Lecture Notes | Basic Statistical Methods | ISYE 2028 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

ISYE 2028 A and B

Spring 2009

Lecture 1

Dr. Kobi Abayomi

January 8, 2009

1 Introduction: What is Data

In statistics we worry about what there is to observe (what we expect to see) and what we have actually observed (what we do in fact see).

Data are the quantitative characterizations of what we see, often based upon what we expect (or often even desire to see). Data are our observations: observed and quantified. We call the population the set of all possible observations. We call a sample the observations we see at a glance, inspection or study. A statistic is any quantity we derive from the observed data – i.e. any quantity we can generate from the sample. A parameter is any quantity we cannot observe about the data; one that is specific to the population. We often seek to estimate parameters using samples from the population. 1

2 Classifying and Observing Data

We often call the individual objects described by a set of data call units of observation or cases. A variable is an object that holds information about the same characteristic for many cases. A data table is an arrangement of data – the convention is to let rows represent cases and columns represent variables.

Here is an example of a possible data table.

ID Name ShoeSize TestScore Classlevel FashionLevel 1 Kobi 11 95 Graduate Low 2 Djleroy 11 100 PhD. High 3 Ronald 22 50.15 Pre-K Very Low ... ... ... ... ... (^1) These are my heuristic definitions...

We classify variables as either quantitative - where the numbers act as numerical values or categorical

which are either word or numerals that are treated as non-numeric. For the above example, which variables are which.

Quantitative data can be either discrete – in that we can list all the possible values – or continuous in that we cannot. A more particular distinction is to say a variable is discrete if it can take countably many values and that a variable is continuous otherwise. The distinction is often apparent in use.

Quantitative variables in which the order (in the greater than, less than sense) and distance be- tween data can be determined are called interval variables. Percent scores are examples of interval variables.

Quantitative variables in which the order of data points can be determined, but not the distance are called ordinal. Examples are letter grades..

Categorical variables which are determined by categories that cannot be ordered, such as gender and color, are called nominal.

In math notation a data table is a multivariate vector – i.e. matrix – x with dimension n x k or (n, k) or n rows and k columns. The observations are the rows n; the number of measurement types is k, the number of columns. We select the ith observation of the jth variable with element xi,j from x.

In R data are held in arrays. Think of an array as the broadest class of matrices - having at least two dimensions. A data table, then, is a two dimensional array: a matrix.

In R matrices are restricted to be only numeric. Generally, we work in R with data table objects called a data frames. The data frame is the most general object for holding data; in R the convention remains rows as observations and columns as variables. For example:

In R

worms<-read.csv(file="worms.csv")

#reads in the data frame as a comma delimited file -- i like this #format

#replace file="" with your file name, notice the "/"

is.data.frame(worms)

#checks if the object is a data frame

names(worms);dim(worms)

#returns the column names i.e. the variable names; the number of #rows and columns

worms[7,5]

R has a built in shorter way...

sum(stuff)

How useful.

You should familiarize yourself with this notation now. Prove to yourself that I’m not telling you any lies below.

∑n i=1 c^ ∗^ i^ =^ c^ ∗^

∑n i=1 i

∑n i=1 c^ +^ i^ =^

∑n i=1 c^ +^

∑n i=1 i^ =^ c^ ∗^ n^ +^

∑n i=1 i.

4 Descriptive Techniques - Data Tables

The first thing to do with data, where you can, is to make a picture. Part of being a statistician is using graphical methods to display and interpret data. Rudimentary graphical methods are data tables which we use to illustrate data.

A frequency table is a list of the categories in a categorical variable and the counts or percentage of observations in each category.

Example:

Classlevel Count Percent Graduate 2 4 PhD. 1 2 Pre-K 30 60 ... ... ... ... ...

What should the counts column in a frequency table always add up to? What should the percent column add up to, always?

A frequency table is one way of illustrating the distribution of a variable. The distribution the complete information about a variable: its possible values and the relative frequency of each value.

R can give you a frequency table real, real easy.

table(stuff)

for quantitative as well as categorical data.

hamm<-c("toe","foot","eye","tail","tail","foot","snout") table(hamm)

R can also give you the percents. Again, real, real easy.

table(stuff)/length(stuff)

table(hamm)/length(hamm)

We usually rotate these tables when we include them in reports and such.

A contingency table shows how cases are distributed along each variable, contingent on all other variables.

Let’s look at our example...:

Fashion Level Classlevel Low Middle High Graduate 6 4 1 PhD. 5 1 2 Pre-K 30 25 75

How many graduate students have a low fashion level? How many Pre-K’s are well dressed? The contingency table can be used to reveal patterns in variables that may be contingent on the category of others.

In a contingency table the marginal distribution of a variable is the distribution of that variable alone. The marginal distributions are displayed in the margins of the table.

Fashion Level Classlevel Low Middle High Total Graduate 6 4 1 11 PhD. 5 1 2 8 Pre-K 30 25 75 130 Total 41 30 78 149

The conditional distribution, in a contingency table, is viewed by looking at one column, or one row of the table. The remaining distribution of a variable, then is conditional on the value of the other variable(s) in that restricted view. For illustration...

Fashion Level Classlevel Low Middle High Total Graduate 4 PhD. 1 Pre-K 25 Total 30

is the distribution of class level, after conditioning on the middle fashion level.

We say that two variables are independent if the conditional distributions of one are the same no matter what we condition on in the second, and vice versa.

For instance let’s look at the marginal distribution of fashion level by class level..

Do the Pre-K students seem to be the best dressed? Why? Does fashion level appear to be independent of class level?

5 Continuing with more descriptive techniques

Above we began to look at ways of describing data, principally via tabular illustration. We intro- duced the data table - with cases on rows and variables on columns - and the contingency table - with two or more variables at a time.

Now we’ll continue to look at statistical methods: methods that depend on functions of the data as a prelude to when we will begin to draw pictures, or plots of our observed data.

6 Statistical Methods, Functions of Data

Recall from our definition of a statistic: any quantity that we derive from our observed data.

It is useful to think of statistics in the way they are used: to summarize and condense information.

6.1 Measures of Central Tendency: Mean, Median

In passing we introduced this function:

∑^ n

i=

xi n

n

∑^ n

i=

xi =

∑ (^) x 1 + x 2 + · · · + xn n

= x (4)

Explicitly, this x^3 is called the sample mean. It is the arithmetic average of the observed data.

Remember that we can call x = x 1 , x 2 , ..., xn our observations - the results of some experiment. n is the number of cases or observations. And x 1 can be the shoe size of case 1, for instance.

It is worth highlighting here the distinction between population and sample. The sample mean is an estimate of the population mean. Remember that we don’t usually see the population entirely - we only see a sample. We take the arithmetic average of the sample, and we use it as an estimate for the unknown, true mean of the population.

By example, let’s refer to the ”data” from lecture 1:

(^3) ”x - bar” in words

ID Name ShoeSize TestScore Classlevel FashionLevel 1 Kobi 11 95 Graduate Low 2 Djleroy 11 100 PhD. High 3 Ronald 22 50.15 Pre-K Very Low ... ... ... ... ...

In this example we can think of the population as all statistics students with shoes and fashion – we take a sample, that is, we only record data on the class here in Math 417 during fashion week.

Let’s say our class size n = 10 and we get these observations for shoe sizes: x = { 11 , 11 , 22 , 6 , 8 , 10 , 10 , 9 , 7 , 9 }. The sample mean, our estimate for the population mean, is 10.3. How did I get that?

In R:

x<-c(11,11,22,6,8,10,10,9,7,9)

mean(x)

The mean is known as a measure of central tendency or a location measure....the mean tells us where the arithmetic center of the data is.

Another measure of central tendency is the sample median. While the sample mean is the arithmetic center of the distribution of the data, the sample median is the ”physical” center of the data, i.e. the point in the very middle of the distribution. We find the median by:

Ordering the data from least to greatest.
When n is odd: the median is the point at the n+1 2 position.
When n is even: we take the n 2 and the n 2 + 1 points and average them.

Allow some more notation: Call x(1), x(2), ..., x(n) the ordered sample values (from least to greatest) or the order statistics for the sample. Then I can restate how to generate the sample median in a formulaic way.

Generate x(1), x(2), ..., x(n)
When n is odd: the median is the x( n+1 2 ) observation.
When n is even: the median is

x( n 2 )+x( n 2 +1)

In R

x<-c(11,11,22,6,8,10,10,9,7,9)

median(x)

What could be easier?

In R

x<-c(11,11,22,6,8,10,10,9,7,9)

range(x)

range(x)[2]-range(x)[1]

order(x)

orderedx<-x[order(x)]

orderedx

fractionalpart<-function(x){ceiling(x)-x}

fractionalpart(1.5)

orderedx[1.6]

orderedx[2.5]

n<-length(x)

upperquartile<- fractionalpart(.75n) (orderedx[ceiling(.75n)]-orderedx[floor(.75n)])+orderedx[floor(.75*n)]

upperquartile

lowerquartile<- fractionalpart(.25n) (orderedx[ceiling(.25n)]-orderedx[floor(.25n)])+orderedx[floor(.25*n)]

iqr<-upperquartile-lowerquartile

6.3 Calculations by hand

The summation notation is good to interpret for programming, or for more theoretical work – also perhaps for quick understanding. Though you will rarely be asked to calculate statistics by hand, it’s good to have a quick way to do so.

A simple way is to arrange the data in a columnar fashion with the observations being the first

column, and the following columns being the statistics you will be calculating. For example, let’s take our data, x, of shoe sizes.

i x(i) deviations xi − x squared deviations (xi − x)^2 1 6 -4.3 18. 2 7 -3.3 10. 3 8 -2.3 5. 4 9 -1.3 1. 5 9 -1.3 1. 6 10 -0.3 0. 7 10 -0.3 0. 8 11 .7 0. 9 11 .7 0. 10 22 11.7 136. x =sum / n s^2 =sum / n − 1, s =

s^2

7 Graphical Techniques, Plotting the Data

Remember our definition for a statistic, a function on observed data. That is: we observe something (take observations) and we provide summaries of it (report statistics). Remember that we call our observations samples - representatives from a population of all possible outcomes. Remember statistics (like the sample mean or sample variance) are estimates of population parameters which, usually, we do not know but seek to gain inference about.

Most graphical techniques, or plots of the data, can be viewed as extensions of the introductory statistics covered in chapter 1.

8 Univariate Graphs: The Histogram

Let’s refer again to our example from above, The data table:

ID Name ShoeSize TestScore Classlevel FashionLevel 1 Kobi 11 95 Graduate Low 2 Djleroy 11 100 PhD. High 3 Ronald 22 50.15 Pre-K Very Low ... ... ... ... ...

Let’s take our observations from lecture 2 for shoe size. Remember x = { 11 , 11 , 22 , 6 , 8 , 10 , 10 , 9 , 7 , 9 }, are our shoe sizes.

A frequency table for just the shoe sizes could be:

ShoeSizes

x

Frequency

Figure 1: :A histogram of the shoe size data. I used this code hist(x,density=10,labels=T,breaks=c(0,10,20,30),col="red",main="ShoeSizes"). I would call this graph unimodal, asymmetric, and right skewed.

ShoeSizes

x

Density

Figure 2: Here, I used this code hist(x,density=10,labels=T,freq=FALSE,breaks=c(0,10,20,30), col="red",main="ShoeSizes"). Notice that the labels above the have changed.

hist(testscores,breaks=3) #R will also bin the data with a #suggested number of bins. hist(testscores,breaks=10)

hist(testscores,breaks=30)

Notice how the histogram differs with the number and size of bins. Notice how you can obscure or highlight features of the observed distribution with binning.

Notice also, for the histograms with a frequency axis, that the sum of all the heights is the number of cases. Notice that for the histograms with a density axis, the sum of the widths of the bins times the heights of each bin sum to 1.

8.1 Descriptions of Histograms

We call the most frequent bin or class the modal class. A unimodal histogram has a single mode or peak. A bimodal histogram has two modes, not necessarily equal in height. We call a histogram symmetric if, when we draw a vertical line down the center of the histogram, the two sides are identical in shape and size. A histogram that is flat^5 is called uniform.

Here are some examples in R

unimodaldata<-c(1,2,3,3,3,3,4,5) hist(unimodaldata)

bimodaldata<-c(1,1,1,2,3,4,4,4,4,5) hist(bimodaldata)

uniformdata<-c(1,1,2,2,3,3,4,4,5,5) hist(uniformdata,,breaks=c(0.5,1.5,2.5,3.5,4.5,5.5))

bellshaped<-rnorm(100) hist(bellshaped)

skewedleft<-c(bellshaped,-15,-16) hist(skewedleft)

skewedright<-c(bellshaped,15,16) hist(skewedright)

(^5) which illustrates data that are equally frequent

8.2 Some things to note...

Remember that the sample median is at the physical center of a distribution of observed values and the sample mean is at the arithmetic center. A symmetric distribution will have the sample median and sample mean relatively close. A distribution that is negatively skewed (or skewed left) will have the sample median greater than the sample mean. A distribution that is positively skewed (or skewed right) will have the sample median less than the sample mean.

In R

mediansl<-median(skewedleft)

meansl<-mean(skewedleft)

hist(skewedleft)

abline(v=mediansl,col="red")

abline(v=meansl,col="blue")

mediansr<-median(skewedright)

meansr<-mean(skewedright)

hist(skewedright)

abline(v=mediansr,col="red")

abline(v=meansr,col="blue")

We call outliers values that differ greatly from the distribution of the sample data.

9 Other Univariate graphs

A stem and leaf plot is, basically, a text histogram. A dotplot is a histogram with dots instead of bars.^6. A boxplot is an illustration of the interquartile range, median, and range of the data.

In R

stem(testscores} (^6) I’ve never used a dotplot

What is Data - Lecture Notes | Basic Statistical Methods | ISYE 2028, Study notes of Data Analysis & Statistical Methods

Related documents

Partial preview of the text

Download What is Data - Lecture Notes | Basic Statistical Methods | ISYE 2028 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

ISYE 2028 A and B

Spring 2009

Lecture 1

Dr. Kobi Abayomi

January 8, 2009

1 Introduction: What is Data

2 Classifying and Observing Data

6.1 Measures of Central Tendency: Mean, Median

6.3 Calculations by hand

x

Frequency

x

Density

8.1 Descriptions of Histograms

8.2 Some things to note...