Download Data Analysis I: Statistical Concepts and Descriptive Statistics and more Study notes Statistics in PDF only on Docsity! Stat 528 (Autumn 2007) – Data Analysis I 1 Overview Reading: Section 1.1 • Introduction • Looking at data – Individuals and variables – Categorical and quantitative variables – Variation • Graphical summaries for data (displaying distributions) – Bar and pie charts – Stemplots – Histograms – Time series plots 2 The language of statistics • Statistics is a lot about learning the language and defini- tions. – By learning the “lingo” we can communicate and present our ideas to others. • We will introduce many definitions: – It will help you to learn these terms as we go through this term. 5 Data • Data is everywhere. – collect it for yourself, newspapers, television, the internet, etc., etc. • Studying data gives insight or can confuse (“lies, damn lies and statistics”). • The way data is presented is very important. • An individual is something that has a characteristic of in- terest that can be measured or observed. e.g., a person, animal, chemical process, stock index. • The actual characteristic of interest that is recorded or ob- tained for each individual is called the variable. Variables can be classified as categorical or quantitative. 6 Categorical and quantitative variables • Categorical variables record an individual’s group or char- acteristic. – We cannot perform arithmetic on these variables. – Instead often summarize these variables for a number of individuals at once. • Quantitative variables describe a numeric characteristic. – We can perform arithmetic on these variables. 1. Discrete variables take on a countable number of values. 2. Continuous variables take on a continuum of val- ues. 7 Thinking about data • In order to analyze and present data ask the following ques- tions: 1. Where does the data come from? 2. Why was the data collected? 3. Who collected the data and why? (most good data sources have references!). 10 Thinking about the structure of the data • How much data was collected? – How many individuals are there? (normally the more the better!) – How many variables? • Are there variables missing that you think should have been collected? • How do you define the variables? – How and under what conditions were the variables mea- sured/collected? – What are units of measurement? – Is there missing data? i.e., are there individuals for which the variable was not recorded? Why? – Missing data can be big problem in statistics. 11 A sheepish example • A vet is observing some vital statistics of a flock of 74 sheep at a agricultural research station. The vet records the following variables: – weight; – body temperature; – pulse rate; – estrus (“heat”) cycle. 12 Reasons for variation • Reasons for variation include: – natural variation e.g., one sheep ate more than another sheep, male sheep are heavier in general than females. – measurement error (depends on the measuring in- strument). – rounding/numerical error. • Key idea: We should account and adjust for variation in the data. e.g., the average weight of a number of sheep varies less than the individual measurements. 15 Contrasting descriptive with inferential statistics • Descriptive statistics – we summarize only the data we have collected. e.g., For sheep example: • Inferential statistics – based on the data we have collected (and some assumptions) we try to try to infer something about a more general, larger group of individuals. e.g., For sheep example: 16 Distributions • For a quantitative variable – the distribution tells us the values that variables can take, and how often these values occur. • Examining the distribution allows us to pick up patterns of variation in the data. • We can examine the distribution of a variable either graph- ically or numerically. 17 More plots for categorical data • Now present the data as a pie chart: • Which display is more useful? Why? 20 Stemplots (stem and leaf displays) • Stemplots show the shape of data (the distribution). • Good for small datasets. • We illustrate with the sheep weight data. Stem-and-leaf of sheep N = 23 Leaf Unit = 1.0 15 5779 16 05568899 17 03579 18 012569 • Using another scale: Stem-and-leaf of sheep N = 23 Leaf Unit = 1.0 15 5779 16 0 16 5568899 17 03 17 579 18 012 18 569 21 Stemplots – changing the scale Stem-and-leaf of sheep N = 23 Leaf Unit = 1.0 15 5 15 77 15 9 16 0 16 16 55 16 6 16 8899 17 0 17 3 17 5 17 7 17 9 18 01 18 2 18 5 18 6 18 9 • The last two stemplots here have split stems. • Different scales can be useful for picking up different features of the data. 22 Histograms: a discrete example Temperature transducers of a certain type are shipped in batches of 50. A sample of 60 batches are selected and the number not conforming are determined: 2 1 2 4 0 1 3 2 0 5 3 3 1 3 2 4 7 0 2 3 0 4 2 1 3 1 1 3 4 1 2 3 2 2 8 4 5 1 3 1 5 0 2 3 2 1 0 6 4 2 1 6 0 3 3 3 6 1 2 3 We have class frequency relative frequency 0 7 1 12 2 13 3 14 4 6 5 3 6 3 7 1 8 1 25 Drawing the histogram 876543210 15 10 5 0 transducers F re q u e n c y 876543210 0.2 0.1 0.0 transducers D e n s it y 26 Histograms: a continuous example The following are losses for n = 38 hurricanes occurring over a period of 38 years. The losses are in units of millions of dollars, and are adjusted for inflation. They appear here sorted. 2.93 3.08 4.47 6.77 7.12 10.56 14.47 15.35 16.98 18.38 19.03 25.30 29.11 30.15 33.73 40.60 41.41 47.91 49.40 52.60 59.92 63.12 77.81 102.94 103.22 123.68 140.14 192.01 198.45 227.34 329.51 361.20 421.68 513.59 545.78 750.39 863.88 1638.00 class frequency relative frequency 0-199 29 29/38 = 0.763 200-399 3 1/38 = 0.079 400-599 3 3/38 = 0.079 600-799 1 1/38 = 0.026 800-999 1 1/38 = 0.026 1000-1199 0 0 1200-1399 0 0 1400-1599 0 0 1600-1799 1 1/38=0.026 27 Shapes of distributions • symmetric, left skewed and right skewed F re q u e n c y F re q u e n c y F re q u e n c y 30 Shapes of distributions (cont.) • unimodal, and bimodal F re q u e n c y F re q u e n c y 31 Time series plots • A time series is a set of observations made sequentially in time. • R. A. Fisher: “One damned thing after another”. • Time series analysis is the area of statistics which deals with the analysis of dependency between different obser- vations in time series data. • Example: Annual measurements of the level of Lake Huron in feet. Time La ke Hu ro n 1880 1900 1920 1940 1960 57 6 57 7 57 8 57 9 58 0 58 1 58 2 32