

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
machine learning in R knn Neural network
Typology: Summaries
1 / 3
This page cannot be seen from the preview
Don't miss anything!


The objective is to predict whether a patient has diabetes,based on diagnostic measurements. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. It includes diabetes outcome and measurements of Glucose, Insulin, BMI et al. I will apply kNN machine learning on presented measurements to predict the diabetes outcome. The data download link https://www.kaggle.com/mathchi/diabetes-data-set KNN algorithm is a supervised machine learning algorithm that deals with similarity. KNN stands for K-Nearest Neighbors. The number K is typically chosen as the square root of N, the total number of points in the training data set. KNN has the advantage of being nonparametric. It’s basically a classification algorithm that will make a prediction of a class of a target variable based on a defined number of nearest neighbors.
STEP 1: Load the Libraries
STEP 2: Load the data set
STEP 3: Exploring and preparing the data
covert the diabetes Outcome into factor
Visualization of the association of diabetes outcome between all 8 variables
From the figures, we noticed the more important predictors of diabetes are Glucose, Age, Insulin, Pregnancies, Pedigree. The least associated are SkinThick and BP.
Here are the correlation Matrix. First there is no abs(r) >0.75. Secondly it illustrated the distribution difference of diabetes outcome by histogram. Lastly, the significant of correlation among the predicators
STEP 4: Data Splicing
data is splitted based on Outcome with 80% to train data and 20% to test data
STEP 5: Model Training and Tuning
The function trainControl can be used to specify the type of resampling
We used 10-fold cross-validation, and resampling of 3 repetition with “repeatedcv” method