Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

DATA ANALYSIS REPORT (TEAM 5), Slides of Machine Learning

Australian Lutheran College (ALC)Machine Learning

DATA ANALYSIS REPORT. (TEAM 5) ... The features identified for the analysis are total confirmed cases, ... the one obtained from pandas-profiling reports.

Typology: Slides

2021/2022

Uploaded on 07/05/2022

barbara_gr 🇦🇺

4.6

(73)

1K documents

1 / 15

This page cannot be seen from the preview

Don't miss anything!

DATA ANALYSIS REPORT

(TEAM 5)

Technical Team Challenge

Mohamed Mohamed, Joel Aderemi, Taiye Lawal,

Nonyelum Anigbo, Anndior Boateng and Muhsen

Hussein

27.07.2021

HDR UK Black Internship

Discover Slides of Machine Learning Australian Lutheran College (ALC)

Partial preview of the text

Download DATA ANALYSIS REPORT (TEAM 5) and more Slides Machine Learning in PDF only on Docsity!

DATA ANALYSIS REPORT

(TEAM 5)

Technical Team Challenge

Mohamed Mohamed, Joel Aderemi, Taiye Lawal, Nonyelum Anigbo, Anndior Boateng and Muhsen Hussein 27.07. HDR UK Black Internship

INTRODUCTION
DATA EXPLORATION
METHODS
MODELLING AND RESULTS
CONCLUSION
REFERENCES

DATA EXPLORATION How The Data Was Collected The dataset was collected from Kaggle Datasets. This dataset contains summary COVI- related cases for each of the 220 countries, as of 30th of June 2021. It was downloaded into a desktop folder which was later loaded into pandas dataframe for analysis. Features Identified for Analysis The features identified for the analysis are total confirmed cases, total recovered cases, total deaths, continents, and population. The reason for choosing these features is simply because we want to know how COVID-19 has affected various parts of the world by comparing confirmed cases with death rate and how people are recovering using their populations. Screenshots of Pandas-Profiling Reports Fig 1: Loading the Dataset in Pandas Dataframe

Fig 2: Line Plot Showing Population Distributions Fig 3: Bar Plot Showing Total Confirmed Cases by Continents

Fig 6: Histogram Showing Mean Total Deaths Distributions

Visualisations Generated Using Power BI We also generated visualisations with the help of Power BI to compare the analysis with the one obtained from pandas-profiling reports. Fig 7: Visualisation Generated Using Power BI

MODELLING AND RESULTS Using the coronavirus data, we are going to predict features in the model using Machine Learning methods in Python. The feature we chose to predict was the Total number of Deaths in each country, as we felt it was the most relevant and important feature for a country to try to predict. After understanding and cleaning the data, we decided to also normalise the values and remove features that were showing a low correlation or could contribute to over-fitting. This took us from our original list of headers to a restricted list of headers missing the following columns: Total Deaths Per 1m Population, Total Cases Per 1m Population, Total Tests Per 1m Population. Similarly, we didn't lose any information about coronavirus by omitting these columns as they contain information already reflected in a combination of other parameters. We split the normalised data into training sets and validation sets and applied 7 linear Machine Learning Regression models to the training data. Then evaluated its success against the test data by calculating the normalised mean squared error (MSE), mean absolute error(MAE) and the R-squared score.

Findings

Results using all parameters

We first ran our models using all parameters except, omitting only the following columns: Total Deaths Per 1m Population, Total Cases Per 1m Population, Total Tests Per 1m Population. This gave rise to the following performance results.

As you can see the Theil–Sen estimator performed the best by far, with very low errors and a very high R-squared score. This tells us it is a very good model for this data set and can accurately predict the number of deaths. Figure1: A Graph showing the parameter weightings from the 7 Models If we zoom into the parameter weightings given by this Theil-Sen model, (shown in brown), we can see that the magnitude of the weight of the confirmed cases and recovered cases are large compared to the other parameters and models. This is expected as the number of deaths should be the difference between the number of cases and the number of people who have recovered. However the Theil-Sen estimator appears to be

where there is a significant decline in the performance of the Regressor. The Model with the best predicting performance is now the SGDR Regressor, which outperforms the Theil-Sen model in all categories except the mean absolute error. The performance of the Lasso Lars Regressor remains poor, in both sets of results. However the default $\alpha$ is 1, optimising the penalty would produce a better Lasso estimate. for example by changing $\alpha$ = 10$^{-6}$, the Lasso results improved to MSE = 0.202308 MAE = 0.171545 and R$^2$ = 0. Figure2: A Graph showing the parameter weightings from the 7 Models We also plotted and compared the predicted parameter weightings of each of the models on Figure 2. From this we can see that most models weighted parameters similarly. The highest weightings were the Total Tests, Active Cases and Total Recovered, this suggests that the most important features to predicting the total coronavirus deaths using a Linear Model are these factors. We can also conclude the Population and Continent location had minimal impact on the Total Death number, compared to other parameters.

CONCLUSION According to the box plot, the highest number of confirmed cases was in Asia. However, the number of total deaths has reduced in Asia compared to the other continents because their recovery rates are high. The challenges faced in this work is the limited coding experience in the team and the complicated data sets. The initial data was too complex for the team to use, so had to utilise another data set. A success of this work was that the team was able to use Python to apply data science techniques and visualisations

DATA ANALYSIS REPORT (TEAM 5), Slides of Machine Learning

Related documents

Partial preview of the text

Download DATA ANALYSIS REPORT (TEAM 5) and more Slides Machine Learning in PDF only on Docsity!

DATA ANALYSIS REPORT

(TEAM 5)

Technical Team Challenge

Findings

Results using all parameters