Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Data Visualization: Graphical and Numerical Summaries, Slides of Statistics

University of Melbourne (UM)Statistics

An introduction to graphical and numerical summaries for displaying and analyzing data. It covers methods for displaying categorical and numerical data, including bar charts, pie charts, stem-and-leaf plots, and histograms. The document also discusses how to interpret these displays and identify key features such as center, spread, shape, and outliers.

Typology: Slides

2021/2022

Uploaded on 07/05/2022

barbara_gr 🇦🇺

4.6

(73)

1K documents

1 / 10

This page cannot be seen from the preview

Don't miss anything!

Chapter 1 - Graphical and Numerical Summaries

Read sections 1.6 - 1.7

Graphical Methods

How to display data:

1. Give the observed values of a categorical or numerical variable taken from a sample, denoted as x1,

x2,x3, ..., xn. The sample size is n

2. Indicate how often the variable takes on these values.

Displaying Categorical Data

To display a categorical variable measured from a sample:

•Encode the values of the variable with respect to each category into binary numbers: a 0 or a 1.

•Frequency is the count of 1’s there are in each category, or the number of times a category appears

•Relative Frequency is the proportion Frequency

Total Number of Observations

•Percentage = (Relative Frequency) ×100

1. Bar Chart: A bar chart is a graph of frequencies, relative frequencies, or percentages on the

vertical axis versus the categories of a categorical variable on the horizontal axis.

White Asian Hispanic Black Other

Bar Chart

Ethnic Group

Percent

0 10 20 30 40 50 60 70

White

Asian

Hispanic

Black

Other

Pie Chart

> freq = c(679,190,77,51,3) # the frequencies on each category

> rel.freq=prop.table(freq) # relative frequencies

> rel.freq

[1] 0.679 0.190 0.077 0.051 0.003

> percent=rel.freq*100

> percent

[1] 67.9 19.0 7.7 5.1 0.3

> Race = c("White","Asian","Hispanic","Black","Other")

> rbind(Race,percent)

[,1] [,2] [,3] [,4] [,5]

Race "White" "Asian" "Hispanic" "Black" "Other"

percent "67.9" "19" "7.7" "5.1" "0.3"

> barplot(percent,main="Bar Chart",names=Race,xlab="Ethnic Group",ylab="Percent",ylim=c(0,70))

> pie(percent,main="Pie Chart",labels=Race)

1

Discover Slides of Statistics University of Melbourne (UM)

Partial preview of the text

Download Data Visualization: Graphical and Numerical Summaries and more Slides Statistics in PDF only on Docsity!

Chapter 1 - Graphical and Numerical Summaries

Read sections 1.6 - 1.

Graphical Methods

How to display data:

Give the observed values of a categorical or numerical variable taken from a sample, denoted as x 1 , x 2 , x 3 , ..., xn. The sample size is n
Indicate how often the variable takes on these values.

Displaying Categorical Data

To display a categorical variable measured from a sample:

Encode the values of the variable with respect to each category into binary numbers: a 0 or a 1.
Frequency is the count of 1’s there are in each category, or the number of times a category appears
Relative Frequency is the proportion (^) Total Number of ObservationsFrequency
Percentage = (Relative Frequency) × 100

Bar Chart: A bar chart is a graph of frequencies, relative frequencies, or percentages on the vertical axis versus the categories of a categorical variable on the horizontal axis.

White Asian Hispanic Black Other

Bar Chart

Ethnic Group

Percent

0

10

20

30

40

50

60

70 White

Asian

Hispanic

Black

Other

Pie Chart

> freq = c(679,190,77,51,3) # the frequencies on each category > rel.freq=prop.table(freq) # relative frequencies > rel.freq [1] 0.679 0.190 0.077 0.051 0. > percent=rel.freq* > percent [1] 67.9 19.0 7.7 5.1 0. > Race = c("White","Asian","Hispanic","Black","Other") > rbind(Race,percent) [,1] [,2] [,3] [,4] [,5] Race "White" "Asian" "Hispanic" "Black" "Other" percent "67.9" "19" "7.7" "5.1" "0.3"

> barplot(percent,main="Bar Chart",names=Race,xlab="Ethnic Group",ylab="Percent",ylim=c(0,70)) > pie(percent,main="Pie Chart",labels=Race)

Pie Chart: A graph of the categories of a categorical variable as pieces of a pie, where the size of each piece is proportional to the frequency, relative frequency, or percentage of the category. Pie charts are inferior to bar charts because humans have a more difficult time judging the difference between angles than the difference between heights or lengths of bars.
Segmented Bar Chart: A segmented bar chart compares two categorical variables. The categories of one of the categorical variables are on the horizontal axis, and percentage is on the vertical axis. Each bar is partitioned into pieces, where each piece represents the categories of the second categorical variable. The textbook describes a comparative or side-by-side bar chart which serves the same purpose as a segmented bar chart.

A B C D Treatment

Survival Status (%)

0

20

40

60

80

100 Alive Dead

> freq = matrix(c(58,43,56,45,57,77,42,75),ncol=4,byrow=TRUE) > freq [,1] [,2] [,3] [,4] [1,] 58 43 56 45 [2,] 57 77 42 75

> rownames(freq) = c("Dead","Alive") > colnames(freq) = c("A ","B ","C ","D ") > freq A B C D Dead 58 43 56 45 Alive 57 77 42 75

> prop.table(freq,2) # Takes proportions along each column A B C D Dead 0.5043478 0.3583333 0.5714286 0. Alive 0.4956522 0.6416667 0.4285714 0.

> percent = prop.table(freq,2)* > ang=c(60,120) > index=c(2,1) > barplot(percent,beside=FALSE,angle=ang,density=20,col="black", ylab="Survival Status (%)",xlab="Treatment") > legend(1.85,90,fill=TRUE,legend=rownames(freq)[index],angle=ang[index], density=20,merge=TRUE,bg="white")

Histogram: A graph of frequencies, relative frequencies or percentages on the vertical axis versus the values of the numerical variable on the horizontal axis.

> hist(x=BirthYear,xlab="Year of Birth",main=" ")

Bins Frequency [20, 25) 2 [25, 30) 5 [30, 35) 8 [35, 40) 12 [40, 45) 7 [45, 50) 2 [50, 55) 2 [55, 60) 7 [60, 65) 2 [65, 70) 3

Year of Birth

Frequency

20 30 40 50 60 70

0

2

4

6

8

10

12

What to Look For in Displays of Numerical Data:

Center
Spread (narrow or wide?)
Shape (modes and symmetry)
Are there outliers (data values that do not follow the overall pattern)?

Modes: Shapes:

Unimodal - one major peak Symmetric

Bimodal - two majors peaks Right-skewed (Positively-skewed)

Multimodal - more than two major peaks Left-skewed (Negatively-skewed)

Do not expect perfection in the histogram of sample data! Due to sampling variability, there will be small peaks, valleys, and gaps. Do not focus on slight irregularities! Do not put too much weight on features caused by one or a few data values.

Graphs For Paired Numerical Data: Paired or bivariate means that there are two variables to be studied. One variable, called the explanatory variable X, is used to describe the other variable, the response Y. So a sample of paired data consists of ordered pairs (x 1 , y 1 ), (x 2 , y 2 ), ..., (xn, yn).

Scatterplot A graphical display of the relationship between two numerical variables. The explanatory variable is along the horizontal axis. The response variable is along the vertical axis.

40 45 50 55 60 65 70 75

40

50

60

70

80

femur

humerus

Bone lengths (in) from n=5 dinosaurs

> femur=c(38,50,59,64,74) > humerus=c(41,63,70,72,84) > plot(femur,humerus)

Different colors or symbols can be used to distinguish between groups.

50 60 70 80 90

30

40

50

60

70

Year of Hire

Year of Birth

Kept > levels(Status) Laid Off [1] "Kept" "LaidOff" > n=length(Status) > Status.num = rep(1,n) > Status.num[Status=="Kept"]= > plot(x=HireYear,y=BirthYear, pch=Status.num,xlab="Year of Hire", ylab="Year of Birth") > Status.leg=levels(Status) > Status.leg[2] = "Laid Off" > Status.leg [1] "Kept" "Laid Off" > legend(x=43,y=69,legend=Status.leg,pch=c(16,1))

How to Describe the Relationship between two variables:

(a) Form - linear, non-linear (curved), clustered, etc. (b) Association - positive or negative. A positive association indicates that increasing values of one variable are associated with increasing values of the other variable. A negative association indicates that increasing values of one variable are associated with decreasing values of the other variable. (c) Strength - strong, moderate, or weak

QUESTION: House prices in $1000’s: 143.5 132.0 154.5 169.3 134.7 2500 (a) Find the sample mean.

(b) Find the sample median.

(c) Why are the mean and median so different?

IMPORTANT! The mean is strongly affected by (not resistant to) outliers and skewness, whereas the median is not affected by (resistant to) outliers and skewness.

Outliers - the mean is pulled toward the outlier(s) Skewness - the mean is pulled toward the longer tail

Symmetric: Mean = Median
Left-skewed (Negatively-skewed): Mean < Median
Right-skewed (Positively-skewed): Mean > Median

NOTE: The mean is sensitive to outliers because it uses all the data values. The median is insensitive to outliers because it uses only 1 or 2 of the middle values in the ordered list.

Measures Of Variability (Spread):

Sample Variance: s^2 =

∑(x i−¯x)^2 n− 1 =

∑ (^) x 2 i −^ (∑^ xi)^2 n n− 1

Sample Standard Deviation: s = +

s^2

A deviation is the distance from a data value to the sample mean (¯x).
Standard deviation should be thought of as the “average (or typical) deviation”.
Deviations sum to zero,

(xi − x¯) = 0.

The mean and standard deviation have the same units as the data values (e.g. inches, pounds). The variance has units^2 (e.g. inches^2 , pounds^2 ).

Interquartile Range (IQR): IQR = Q 3 - Q 1

where Q 1 is the first quartile (25% below, 75% above) and

Q 3 is the third quartile (75% below, 25% above).

Note, Q 1 is the median of the lower half of the ordered list and Q 3 is the median of the upper half of the ordered list.

QUESTION:

Data: 1 1 2 4 5 7 7 7 8 9 10 Find IQR.

IMPORTANT! Standard deviation and variance are both strongly affected by (not resistant to) outliers and skewness, whereas IQR is not affected by (resistant to) outliers and skewness.

Use the mean and standard deviation (or variance) as the measures of center and spread (respectively) when neither outliers nor skewness are present.
Use the median and IQR as the measures of center and spread (respectively) when either outliers or skewness are present.

Five-number Summary: Minimum, Q 1 , Median, Q 3 , Maximum

The five-number summary provides measures of center (median) and spread (IQR and range).

Boxplot: plot of the five-number summary

30 40 50 60 70 Year of Birth

Outlier Guidelines for a Boxplot:

A “mild” outlier falls between 1.5(IQR) and 3(IQR) away from the nearest quartile. Use a solid circle to denote a mild outlier.
An “extreme” outlier falls more than 3(IQR) away from the nearest quartile. Use an open circle to denote an extreme outlier.
Plot the whiskers to the most extreme non-outlier data value.

Comparative (Side-by-Side) Boxplots:

Great for comparing two or more distributions.
Compare centers and spread.

Kept LaidOff

30

40

50

60

70

Year of Birth

Exercises

Graphs for paired data, p. 65:1.39 and 1. Numerical summaries, p. 66: 1.43 and 1. Graphs for numerical variables, p. 66: 1.47 (instead of a dot plot, consider a stem and leaf plot), 1.49 - 1.61 odd Graphs for categorical variables, p. 71: 1.65, 1.67, 1.69abc

Data Visualization: Graphical and Numerical Summaries, Slides of Statistics

Related documents

Partial preview of the text

Download Data Visualization: Graphical and Numerical Summaries and more Slides Statistics in PDF only on Docsity!

Chapter 1 - Graphical and Numerical Summaries

Graphical Methods

Bone lengths (in) from n=5 dinosaurs

QUESTION:

Exercises