









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
[Week 1] Mean, mode, median, range, IQR
Typology: Exercises
1 / 15
This page cannot be seen from the preview
Don't miss anything!










Summary
In the first lecture, the most major objectives I want to achieve was to get you all familiar are: 1. Creating RProject and place all of your data in the same folder. 2. Being able to read in data into R via functions like read.csv. 3. Creating your own RMarkdown file and calculate simple graphical and numerical summaries. There were some R functions in this tutorial which we didn’t cover in the lecture. We will cover them in Week 02. But before that, you should try Google these functions and try them out before approaching your tutors about it. This week we also discussed the ways that data can be collected and potential sources of bias. We also examined various approaches for exploring and summarising data numerically and graphically and their limitations. Remember, when choosing a graphical or numerical summary think...
When using a summary think...
Question 1: Simpson’s Paradox
Three groups of students were asked to sit two tests. The (standardised) test scores were recorded as x and y in the data file Simpsons.csv. The group information was recorded as group, which is an indication of the high school years completed by the students.
(a). Read the data Simpsons.csv into R. What does header = TRUE do? (Hint: change it to FALSE to see what happens.) Also run the summary() function on the data. x y group Min. :-2.4137 Min. :-1.7786 Min. : 1st Qu.: 0.3374 1st Qu.: 0.3746 1st Qu.: Median : 1.4073 Median : 1.4144 Median : Mean : 1.5110 Mean : 1.4912 Mean : 3rd Qu.: 2.6773 3rd Qu.: 2.5812 3rd Qu.: Max. : 6.2826 Max. : 5.1187 Max. :
(b). Extract x = dat$x and y = dat$y, and then create a scatter plot of these two variables. What can you conclude from this data? Is the trend between x and y positive or negative? In other words, if a student scored well in the first test (x), is the student more likely to achieve a good score or a poor score in the second test (y)? (If you know about correlation, then you can also try to run cor(x,y)).
x = dat$x y = dat$y plot (x, y)
cor (x, y)
[1] 0.
(c). Calculate the mean, median, standard deviation, and IQR of x and y. (Hint: use in-built functions for these.) mean (x)
mean (y)
median (x)
median (y)
sd (x)
sd (y)
IQR (x)
Histogram of y
boxplot (x)
boxplot (y)
(e). Check the documentations on the quantile function by typing ?quantile into R. Then, calculate the median of x and y using type = 1 and type = 7. Which one is “more” valid? quantile (x, 0.5, type = 7)
quantile (x, 0.5, type = 1)
These different defintions equally valid.
(f). Create group = dat$group and then make another scatter plot for x and y again. But this time, use plot(x, y, col=group). What is the trend between x and y for each group now? Should your conclusion in (b) be altered? group = dat$group plot (x,y,col = group)
tapply (y, group, IQR)
It should be clear that as the group number increase, the mean/median also increased. Other statistics more or less had the same value across the groups.
(h). Create a boxplot for x and y, but now, try to split the boxplot by the group variable.
boxplot (x ~ group)
boxplot (y ~ group)
(i). What did you learn from this exercise? Why was the conclusion reversed when the extra group variable was taken into account? Is pooling data from different sources always a good idea? How should we analyse data when a confounding variable is present?
Some possible ideas:
(j). Reading exercise: read the Wikipedia article on “Simpson’s paradox”.
Question 2
A government agency wanted to know on average, how many glasses of water does a person drink a day. To do this they randomly called the home phones of 20 people. As part of their survey they asked if people fell into one of 3 age groups that represent the lower (L), middle (M) and upper (U) third of the population.
Glasses 8 7 3 4 3 3 2 4 5 8 9 2 4 5 6 3 2 4 9 100 Age L M U U U M U M M L L M M U U U U U M L a. Characterise and explore the data using numerical and graphical summaries. What do you notice? b. Calculate the mean and median number of glasses of water that were drunk by the survey participants. c. Calculate the mean and median number of glasses of water that were drunk by survey participants in each age group.
a)
From these summaries we might see...
#Numerical summary summary (x)
hist (x$Glasses,breaks = 20)
Histogram of x$Glasses
boxplot (x$Glasses[x$Age=='L'],x$Glasses[x$Age=='M'],x$Glasses[x$Age=='U'])
#boxplot(x$Glasses~x$Age)
iqr = quantile (x$Glasses,.75) - quantile (x$Glasses,.25)
#Use an arbitrary rule to filter out extreme values (outliers) use = x$Glasses > median (x$Glasses) - 1.5iqr & x$Glasses < median (x$Glasses) + 1.5iqr
hist (x$Glasses[use],breaks = 20)
Histogram of x$Glasses[use]
boxplot (x$Glasses[use]~x$Age[use])
#boxplot(x$Glasses[x$Age=='L'&use],x$Glasses[x$Age=='M'&use],x$Glasses[x$Age=='U'&use])
d)
To get an estimate of the population’s center (mode) we have many options, none ar “right”. One might to simply take the mean of all samples (or the median to remove the influence of the outlier).
By definition, all the age groups should be seen equally in our survey, however we do not observe this. This may be due to response bias. We might be able to work around this bias by weighting the estimates of center from each of the groups equally. I like taking the mean of the medians... Taking the medians within each age group reduces the influence of the outlier. I am more comfortable taking the mean of the three age groups than the median as I “feel”" this is better (I “feel” that a mean of 3 numbers is probably better than the median of 3 numbers).
#Naive mean (x$glasses)
median (x$glasses)
#Weighting all age groups equally. mean ( mean (x$Glasses[x$Age=='L']), mean (x$Glasses[x$Age=='M']), mean (x$Glasses[x$Age=='U']))
median ( c ( median (x$Glasses[x$Age=='L']), median (x$Glasses[x$Age=='M']), median (x$Glasses[x$Age=='U'])))
median ( c ( mean (x$Glasses[x$Age=='L']), mean (x$Glasses[x$Age=='M']), mean (x$Glasses[x$Age=='U'])))
mean ( median (x$Glasses[x$Age=='L']), median (x$Glasses[x$Age=='M']), median (x$Glasses[x$Age=='U']))
median ( unlist ( lapply ( split (x$Glasses,x$Age),median)))
mean ( unlist ( lapply ( split (x$Glasses,x$Age),median)))
mean ( unlist ( lapply ( split (x$Glasses,x$Age),mean)))
Question 3
Download one or more datasets from http://www.maths.usyd.edu.au/u/UG/JM/StatsData.html or elsewhere.
a. Use numerical and graphical summaries to explore the datasets.
# If needed explore the documentation for each function i.e. ?stripchart # or help ('stripchart')
install.packages ('vioplot') # Install the vioplot library library ('vioplot') # Load the vioplot library
plot ( table (x)) barplot ( table (x)) hist (x) plot ( density (x)) boxplot (x) stripchart (x) vioplot (x)
c. Many of the plotting functions allow you to put two plots overlayed in the one window. Can you combine these with any of the plots above? Do these change any of your previous interpretations or your assessment of advantages and disadvantages?
points ( density (x)) stripchart (x,add = TRUE, vertical = TRUE)
d. Many functions have parameters that alter their behaviour. Do these alter your interpretations?
stripchart ( list (x,x), vertical = TRUE, method = 'jitter',jitter = 1) points ( density (x),type = 'line') hist (x, breaks = 100)
e. Can you use any of these functions to compare two variables? Does this alter your findings?
The answers to these questions will depend greatly on the datasets you look at. A general, but non-exhaustive list of observations may include.