









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
DATA ANALYSIS REPORT. (TEAM 5) ... The features identified for the analysis are total confirmed cases, ... the one obtained from pandas-profiling reports.
Typology: Slides
1 / 15
This page cannot be seen from the preview
Don't miss anything!










Mohamed Mohamed, Joel Aderemi, Taiye Lawal, Nonyelum Anigbo, Anndior Boateng and Muhsen Hussein 27.07. HDR UK Black Internship
DATA EXPLORATION How The Data Was Collected The dataset was collected from Kaggle Datasets. This dataset contains summary COVI- related cases for each of the 220 countries, as of 30th of June 2021. It was downloaded into a desktop folder which was later loaded into pandas dataframe for analysis. Features Identified for Analysis The features identified for the analysis are total confirmed cases, total recovered cases, total deaths, continents, and population. The reason for choosing these features is simply because we want to know how COVID-19 has affected various parts of the world by comparing confirmed cases with death rate and how people are recovering using their populations. Screenshots of Pandas-Profiling Reports Fig 1: Loading the Dataset in Pandas Dataframe
Fig 2: Line Plot Showing Population Distributions Fig 3: Bar Plot Showing Total Confirmed Cases by Continents
Fig 6: Histogram Showing Mean Total Deaths Distributions
Visualisations Generated Using Power BI We also generated visualisations with the help of Power BI to compare the analysis with the one obtained from pandas-profiling reports. Fig 7: Visualisation Generated Using Power BI
MODELLING AND RESULTS Using the coronavirus data, we are going to predict features in the model using Machine Learning methods in Python. The feature we chose to predict was the Total number of Deaths in each country, as we felt it was the most relevant and important feature for a country to try to predict. After understanding and cleaning the data, we decided to also normalise the values and remove features that were showing a low correlation or could contribute to over-fitting. This took us from our original list of headers to a restricted list of headers missing the following columns: Total Deaths Per 1m Population, Total Cases Per 1m Population, Total Tests Per 1m Population. Similarly, we didn't lose any information about coronavirus by omitting these columns as they contain information already reflected in a combination of other parameters. We split the normalised data into training sets and validation sets and applied 7 linear Machine Learning Regression models to the training data. Then evaluated its success against the test data by calculating the normalised mean squared error (MSE), mean absolute error(MAE) and the R-squared score.
We first ran our models using all parameters except, omitting only the following columns: Total Deaths Per 1m Population, Total Cases Per 1m Population, Total Tests Per 1m Population. This gave rise to the following performance results.
As you can see the Theil–Sen estimator performed the best by far, with very low errors and a very high R-squared score. This tells us it is a very good model for this data set and can accurately predict the number of deaths. Figure1: A Graph showing the parameter weightings from the 7 Models If we zoom into the parameter weightings given by this Theil-Sen model, (shown in brown), we can see that the magnitude of the weight of the confirmed cases and recovered cases are large compared to the other parameters and models. This is expected as the number of deaths should be the difference between the number of cases and the number of people who have recovered. However the Theil-Sen estimator appears to be
where there is a significant decline in the performance of the Regressor. The Model with the best predicting performance is now the SGDR Regressor, which outperforms the Theil-Sen model in all categories except the mean absolute error. The performance of the Lasso Lars Regressor remains poor, in both sets of results. However the default $\alpha$ is 1, optimising the penalty would produce a better Lasso estimate. for example by changing $\alpha$ = 10$^{-6}$, the Lasso results improved to MSE = 0.202308 MAE = 0.171545 and R$^2$ = 0. Figure2: A Graph showing the parameter weightings from the 7 Models We also plotted and compared the predicted parameter weightings of each of the models on Figure 2. From this we can see that most models weighted parameters similarly. The highest weightings were the Total Tests, Active Cases and Total Recovered, this suggests that the most important features to predicting the total coronavirus deaths using a Linear Model are these factors. We can also conclude the Population and Continent location had minimal impact on the Total Death number, compared to other parameters.
CONCLUSION According to the box plot, the highest number of confirmed cases was in Asia. However, the number of total deaths has reduced in Asia compared to the other continents because their recovery rates are high. The challenges faced in this work is the limited coding experience in the team and the complicated data sets. The initial data was too complex for the team to use, so had to utilise another data set. A success of this work was that the team was able to use Python to apply data science techniques and visualisations