STAT5002 - Introduction to Statistics: Tutorial 1 - Exploring Data and Simpson's Paradox | Exercises Statistics

Tutorial 1

STAT5002 - Introduction to statistics

Feb 27, 2019

Summary

In the first lecture, the most major objectives I want to achieve was to get you all familiar are: 1. Creating

RProject and place all of your data in the same folder. 2. Being able to read in data into R via functions like

read.csv. 3. Creating your own RMarkdown file and calculate simple graphical and numerical summaries.

There were some R functions in this tutorial which we didn’t cover in the lecture. We will cover them in

Week 02. But before that, you should try Google these functions and try them out before approaching your

tutors about it.

This week we also discussed the ways that data can be collected and potential sources of bias. We also

examined various approaches for exploring and summarising data numerically and graphically and their

limitations. Remember, when choosing a graphical or numerical summary think.. .

•Why am I using this summary?

•What properties of the data will this summary highlight?

•Is this summary appropriate for communicating what I want to communicate?

When using a summary think. . .

•What is this summary showing me?

•What is this summary not showing me?

And remember,

•

Always think critically about how your data was collected, when exploring raw or summarised data,

and, when someone offers an interpretation of data.

Question 1: Simpson’s Paradox

Three groups of students were asked to sit two tests. The (standardised) test scores were recorded as

and

in the data file

Simpsons.csv

. The group information was recorded as

group

, which is an indication of the

high school years completed by the students.

(a). Read the data

Simpsons.csv

into

. What does

header = TRUE

do? (Hint: change it to

FALSE

to see

what happens.) Also run the summary() function on the data.

x y group

Min. :-2.4137 Min. :-1.7786 Min. :1

1st Qu.: 0.3374 1st Qu.: 0.3746 1st Qu.:1

Median : 1.4073 Median : 1.4144 Median :2

Mean : 1.5110 Mean : 1.4912 Mean :2

3rd Qu.: 2.6773 3rd Qu.: 2.5812 3rd Qu.:3

Max. : 6.2826 Max. : 5.1187 Max. :3

(b). Extract

x = dat$x

and

y = dat$y

, and then create a scatter plot of these two variables. What can you

conclude from this data? Is the trend between

and

positive or negative? In other words, if a student

scored well in the first test (

), is the student more likely to achieve a good score or a poor score in the

second test (y)? (If you know about correlation, then you can also try to run cor(x,y)).

STAT5002 - Introduction to Statistics: Tutorial 1 - Exploring Data and Simpson's Paradox, Exercises of Statistics

Related documents

Partial preview of the text

Download STAT5002 - Introduction to Statistics: Tutorial 1 - Exploring Data and Simpson's Paradox and more Exercises Statistics in PDF only on Docsity!

Tutorial 1

STAT5002 - Introduction to statistics

Feb 27, 2019

x

y

[1] 1.

[1] 1.

[1] 1.

[1] 1.

[1] 1.

[1] 1.

[1] 2.

y

Frequency

50%

1.

50%

1.

1 2 3

1.046477 1.257155 1.

x$Glasses

Frequency

Let's get rid of the extreme value

x$Glasses[use]

Frequency

L M U

## [1] 4

## $U

## [1] 3

Warning in mean.default(x$glasses): argument is not numeric or logical:

returning NA

[1] NA

Warning in is.na(x): is.na() applied to non-(list or vector) of type 'NULL'

NULL

[1] 31.

[1] 4

[1] 4.

[1] 8.

[1] 4

[1] 5.

[1] 13.

Q