






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An introduction to graphical and numerical summaries for displaying and analyzing data. It covers methods for displaying categorical and numerical data, including bar charts, pie charts, stem-and-leaf plots, and histograms. The document also discusses how to interpret these displays and identify key features such as center, spread, shape, and outliers.
Typology: Slides
1 / 10
This page cannot be seen from the preview
Don't miss anything!







Read sections 1.6 - 1.
How to display data:
Displaying Categorical Data
To display a categorical variable measured from a sample:
White Asian Hispanic Black Other
Bar Chart
Ethnic Group
Percent
0
10
20
30
40
50
60
70 White
Asian
Hispanic
Black
Other
Pie Chart
> freq = c(679,190,77,51,3) # the frequencies on each category > rel.freq=prop.table(freq) # relative frequencies > rel.freq [1] 0.679 0.190 0.077 0.051 0. > percent=rel.freq* > percent [1] 67.9 19.0 7.7 5.1 0. > Race = c("White","Asian","Hispanic","Black","Other") > rbind(Race,percent) [,1] [,2] [,3] [,4] [,5] Race "White" "Asian" "Hispanic" "Black" "Other" percent "67.9" "19" "7.7" "5.1" "0.3"
> barplot(percent,main="Bar Chart",names=Race,xlab="Ethnic Group",ylab="Percent",ylim=c(0,70)) > pie(percent,main="Pie Chart",labels=Race)
A B C D Treatment
Survival Status (%)
0
20
40
60
80
100 Alive Dead
> freq = matrix(c(58,43,56,45,57,77,42,75),ncol=4,byrow=TRUE) > freq [,1] [,2] [,3] [,4] [1,] 58 43 56 45 [2,] 57 77 42 75
> rownames(freq) = c("Dead","Alive") > colnames(freq) = c("A ","B ","C ","D ") > freq A B C D Dead 58 43 56 45 Alive 57 77 42 75
> prop.table(freq,2) # Takes proportions along each column A B C D Dead 0.5043478 0.3583333 0.5714286 0. Alive 0.4956522 0.6416667 0.4285714 0.
> percent = prop.table(freq,2)* > ang=c(60,120) > index=c(2,1) > barplot(percent,beside=FALSE,angle=ang,density=20,col="black", ylab="Survival Status (%)",xlab="Treatment") > legend(1.85,90,fill=TRUE,legend=rownames(freq)[index],angle=ang[index], density=20,merge=TRUE,bg="white")
> hist(x=BirthYear,xlab="Year of Birth",main=" ")
Bins Frequency [20, 25) 2 [25, 30) 5 [30, 35) 8 [35, 40) 12 [40, 45) 7 [45, 50) 2 [50, 55) 2 [55, 60) 7 [60, 65) 2 [65, 70) 3
Year of Birth
Frequency
20 30 40 50 60 70
0
2
4
6
8
10
12
What to Look For in Displays of Numerical Data:
Modes: Shapes:
Unimodal - one major peak Symmetric
Bimodal - two majors peaks Right-skewed (Positively-skewed)
Multimodal - more than two major peaks Left-skewed (Negatively-skewed)
Do not expect perfection in the histogram of sample data! Due to sampling variability, there will be small peaks, valleys, and gaps. Do not focus on slight irregularities! Do not put too much weight on features caused by one or a few data values.
Graphs For Paired Numerical Data: Paired or bivariate means that there are two variables to be studied. One variable, called the explanatory variable X, is used to describe the other variable, the response Y. So a sample of paired data consists of ordered pairs (x 1 , y 1 ), (x 2 , y 2 ), ..., (xn, yn).
40 45 50 55 60 65 70 75
40
50
60
70
80
femur
humerus
> femur=c(38,50,59,64,74) > humerus=c(41,63,70,72,84) > plot(femur,humerus)
Different colors or symbols can be used to distinguish between groups.
50 60 70 80 90
30
40
50
60
70
Year of Hire
Year of Birth
Kept > levels(Status) Laid Off [1] "Kept" "LaidOff" > n=length(Status) > Status.num = rep(1,n) > Status.num[Status=="Kept"]= > plot(x=HireYear,y=BirthYear, pch=Status.num,xlab="Year of Hire", ylab="Year of Birth") > Status.leg=levels(Status) > Status.leg[2] = "Laid Off" > Status.leg [1] "Kept" "Laid Off" > legend(x=43,y=69,legend=Status.leg,pch=c(16,1))
How to Describe the Relationship between two variables:
(a) Form - linear, non-linear (curved), clustered, etc. (b) Association - positive or negative. A positive association indicates that increasing values of one variable are associated with increasing values of the other variable. A negative association indicates that increasing values of one variable are associated with decreasing values of the other variable. (c) Strength - strong, moderate, or weak
QUESTION: House prices in $1000’s: 143.5 132.0 154.5 169.3 134.7 2500 (a) Find the sample mean.
(b) Find the sample median.
(c) Why are the mean and median so different?
IMPORTANT! The mean is strongly affected by (not resistant to) outliers and skewness, whereas the median is not affected by (resistant to) outliers and skewness.
Outliers - the mean is pulled toward the outlier(s) Skewness - the mean is pulled toward the longer tail
NOTE: The mean is sensitive to outliers because it uses all the data values. The median is insensitive to outliers because it uses only 1 or 2 of the middle values in the ordered list.
Measures Of Variability (Spread):
∑(x i−¯x)^2 n− 1 =
∑ (^) x 2 i −^ (∑^ xi)^2 n n− 1
s^2
(xi − x¯) = 0.
where Q 1 is the first quartile (25% below, 75% above) and
Q 3 is the third quartile (75% below, 25% above).
Note, Q 1 is the median of the lower half of the ordered list and Q 3 is the median of the upper half of the ordered list.
Data: 1 1 2 4 5 7 7 7 8 9 10 Find IQR.
IMPORTANT! Standard deviation and variance are both strongly affected by (not resistant to) outliers and skewness, whereas IQR is not affected by (resistant to) outliers and skewness.
Five-number Summary: Minimum, Q 1 , Median, Q 3 , Maximum
Boxplot: plot of the five-number summary
30 40 50 60 70 Year of Birth
Outlier Guidelines for a Boxplot:
Comparative (Side-by-Side) Boxplots:
Kept LaidOff
30
40
50
60
70
Year of Birth
Graphs for paired data, p. 65:1.39 and 1. Numerical summaries, p. 66: 1.43 and 1. Graphs for numerical variables, p. 66: 1.47 (instead of a dot plot, consider a stem and leaf plot), 1.49 - 1.61 odd Graphs for categorical variables, p. 71: 1.65, 1.67, 1.69abc