Excel: Summarizing Continuous Data with Titanic Dataset: Location, Spread, Scatterplot, Schemes and Mind Maps of Statistics

Instructions on how to summarize continuous data using Excel, specifically using the Titanic dataset. It covers measures of location (mean, median), measures of spread (standard deviation, quartiles), minimum and maximum values, and scatterplots to assess relationships between variables. The document also includes steps to create histograms and adjust chart settings.

Typology: Schemes and Mind Maps

2021/2022

Uploaded on 07/05/2022

lee_95
lee_95 🇦🇺

4.6

(59)

999 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Summarising data in Excel
Maths and Statistics Help Centre
1
Data
After collection of data, most people will enter their data into a spreadsheet using the following layout:
Each row represents an individual and each measurement (variable) for that individual is in a separate
column. This example shows data available about the passengers aboard the ship Titanic when it sank in
1912. This dataset will be used to demonstrate summary statistics and charts and is available with this
handout so that everything can be reproduced by the reader.
Summarising continuous data
Continuous variables are those measured on a numerical scale such as height, blood pressure and urine
flow. The continuous variables for the Titanic data are age and fare.
Summary Statistics for continuous variables
Charts
Measures of location: Mean and median
Histogram
Measures of spread: Standard deviation and quartiles
Scatterplot
Minimum and maximum values
Boxplot
Line plot
The data analysis toolpak has an option for Decriptive Statistics. If the toolpak is loaded up, the data
analysis icon will appear on the data sheet.
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Excel: Summarizing Continuous Data with Titanic Dataset: Location, Spread, Scatterplot and more Schemes and Mind Maps Statistics in PDF only on Docsity!

Maths and Statistics Help Centre

Data

After collection of data, most people will enter their data into a spreadsheet using the following layout:

Each row represents an individual and each measurement (variable) for that individual is in a separate column. This example shows data available about the passengers aboard the ship Titanic when it sank in

  1. This dataset will be used to demonstrate summary statistics and charts and is available with this handout so that everything can be reproduced by the reader.

Summarising continuous data

Continuous variables are those measured on a numerical scale such as height, blood pressure and urine flow. The continuous variables for the Titanic data are age and fare.

Summary Statistics for continuous variables Charts Measures of location: Mean and median Histogram Measures of spread: Standard deviation and quartiles Scatterplot Minimum and maximum values Boxplot Line plot

The data analysis toolpak has an option for Decriptive Statistics. If the toolpak is loaded up, the data analysis icon will appear on the data sheet.

Maths and Statistics Help Centre

If it is not there, go to file options  add ins, select the Analysis toolpak option and then ‘Go’

To produce summary statistics for age, select the ‘Descriptive Statistics’ option from the toolpak.

Select the column of data to be analysed.

Tick so the first row used as labels.

Select summary statistics

Maths and Statistics Help Centre

This is the output but the chart is not actually a histogram, it is more like a bar chart and therefore some adjustments are needed. There should be no gap between the bars and the x-axis labels should be right aligned to be closer to the tick marks. The tick marks represent the boundaries where two classes meet.

Upper class boundary (bin) Frequency 10 86 20 162 30 361 40 210 50 132 60 62 70 27 80 6 More 0

If you just need a frequency table do not select the chart output

0

50

100

150

200

250

300

350

400

Frequency

Upper class boundary (bin)

Histogram

Frequency

Maths and Statistics Help Centre

Highlight the bars, right click, select format data series and reduce the gap width to ‘No gap’.

Tidy up the graph so that it looks like this:

Scatterplots

A scatter plot is a useful tool when assessing the strength of a relationship between two continuous variables and correlation helps determine whether relationships between these variables exist. Pearson’s correlation co-efficient (r) measures the strength of a relationship between two continuous variables. It is a number between - 1 and 1 where - 1 is perfect negative correlation and +1 is perfect positive correlation. The command for calculating the correlation coefficient is CORREL(array 1, array 2) where array 1 is the 1st variable and array 2 the 2nd.

0

100

200

300

400

10 20 30 40 50 60 70 80 More

Frequency

Age (years)

Histogram showing age range of

passengers aboard the Titanic

Maths and Statistics Help Centre

The scatterplot shows no real relationship between age and cost of ticket and the very small correlation co-efficient of 0.18 confirms this. It’s clear from the chart that there are some outliers with very expensive tickets.

Using bar charts to display means and standard deviations

Sometimes, bar charts are used to compare means and spread for different groups. If this is required, the means and standard deviations for each group must be contained within a summary table. If the data for all groups is contained within one column, the first step is to re-organise the data into separate columns one for each group. Age within each class will be used to demonstrate here. Sort the class and age variables by class, then move 2nd^ and 3rd^ class into separate columns. Use the descriptive option in the data analysis toolpak to summarise by class. Highlight the class names and the row containing the means before selecting column chart.

It is clear from the resulting chart that the mean age decreases with class i.e. the average age in first class was higher than the average age in 2nd^ and 3rd^ class. The addition of confidence interval bars can be useful especially if you are looking for evidence of a difference between groups.

Maths and Statistics Help Centre

As every sample of people results in a different mean, confidence intervals are often quoted in studies alongside the mean. The 95% confidence interval states the two values between which we would expect to find 95% of sample means if the experiment was repeated numerous times. The wider the interval, the less reliable the sample mean. The confidence level given in the descriptive statistics output is NOT the confidence interval. Confidence interval = mean confidence level. For example, for 1st^ class the confidence interval is 39.

39. 16  1. 7  37. 46 , 40. 86 . So if there were 100 other ships and samples of 284 1st^ class passengers were

taken from each, 95 of those samples would be expected to have a mean age between 37.46 and 40.86.

To add this information to the bar chart go to layout  error bars  more error bar options  custom and select the row with the confidence level values for both the positive and negative error values.

The confidence intervals are very narrow suggesting that the mean age by class is a good estimate of the general population. The sample sizes are quite large which leads to narrower confidence intervals. Confidence intervals are also a quick way of looking for significant differences between groups. If the confidence intervals for two groups do not overlap, there is evidence of a significant difference between the means of the groups. If you just want to display the confidence intervals, choose ‘no fill’ and ‘no line’ for the bars so that they disappear.

Maths and Statistics Help Centre

Contingency tables

Stacked or multiple bar charts require count data for two categorical variables displayed in a contingency table. A contingency table or cross-tabulation table displays the frequencies for each combination of the two variables. For example, the following table shows the numbers of passengers who died within each class.

1st class

2nd class

3rd class Total Died 123 158 528 809 Survived 200 119 181 500 Total 323 277 709 1309

To create the table above, put the variable ‘Survived’ into three columns, (one for each class) and then use the countif(data, criteria) command. The commands for calculating row and column percentages are also included below.

Maths and Statistics Help Centre

Exercise 3: Which percentages are better for answering the question ‘Was chance of survival equal in every class?’.

Column %’s 1st class

2nd class

3rd class Died 38% 57% 74% Survived 62% 43% 26% Total 100% 100% 100%

Row %’s 1st class

2nd class

3rd class Total Died 15% 20% 65% 100% Survived 40% 24% 36% 100%