machine learning in R, Summaries of Machine Learning

machine learning in R knn Neural network

Typology: Summaries

2021/2022

Uploaded on 05/27/2025

qizhi-fang
qizhi-fang 🇺🇸

6 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
KNN Implementation for predicting Diabetes Outcome
Classification using KNN
Objective
The objective is to predict whether a patient has diabetes,based on diagnostic measurements.
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney
Diseases. It includes diabetes outcome and measurements of Glucose, Insulin, BMI et al."I will
apply kNN machine learning on presented measurements to predict the diabetes outcome. The
data download link https://www.kaggle.com/mathchi/diabetes-data-set
KNN algorithmis a supervised machine learning algorithm that deals with similarity. KNN
stands for K-Nearest Neighbors. The number K is typically chosen as the square root of N, the
total number of points in the training data set.
KNN has the advantage of being nonparametric. It’s basically a classification algorithm that
will make a prediction of a class of a target variable based on a defined number of nearest
neighbors.
Data Descriptions
Using k-nearest neighbors to predict the Diabete Outcome
STEP 1: Load the Libraries
STEP 2: Load the data set
STEP 3: Exploring and preparing the data
Examine the structure of the Dmdata frame
from the results we notice the unusual scenario of zero in Glucose, BP,
SkinTHickness, Insulin and BMI; presumably those are NA
Replace thoses zero into NA
Examine the missing value (the percentage of missing)
From the results, we noticed more than 40% Insulin missing, those missing will
be imputed with the value of the nearested neighbors via knnImpute
the structure of data
pf3

Partial preview of the text

Download machine learning in R and more Summaries Machine Learning in PDF only on Docsity!

KNN Implementation for predicting Diabetes Outcome

Classification using KNN

Objective

The objective is to predict whether a patient has diabetes,based on diagnostic measurements. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. It includes diabetes outcome and measurements of Glucose, Insulin, BMI et al. I will apply kNN machine learning on presented measurements to predict the diabetes outcome. The data download link https://www.kaggle.com/mathchi/diabetes-data-set KNN algorithm is a supervised machine learning algorithm that deals with similarity. KNN stands for K-Nearest Neighbors. The number K is typically chosen as the square root of N, the total number of points in the training data set. KNN has the advantage of being nonparametric. It’s basically a classification algorithm that will make a prediction of a class of a target variable based on a defined number of nearest neighbors.

Data Descriptions

Using k-nearest neighbors to predict the Diabete Outcome

  • STEP 1: Load the Libraries

  • STEP 2: Load the data set

  • STEP 3: Exploring and preparing the data

    • Examine the structure of the Dmdata frame
    • from the results we notice the unusual scenario of zero in Glucose, BP, SkinTHickness, Insulin and BMI; presumably those are NA
    • Replace thoses zero into NA
    • Examine the missing value (the percentage of missing)
    • From the results, we noticed more than 40% Insulin missing, those missing will be imputed with the value of the nearested neighbors via knnImpute
    • the structure of data
  • covert the diabetes Outcome into factor

  • Visualization of the association of diabetes outcome between all 8 variables

  • From the figures, we noticed the more important predictors of diabetes are Glucose, Age, Insulin, Pregnancies, Pedigree. The least associated are SkinThick and BP.

  • Here are the correlation Matrix. First there is no abs(r) >0.75. Secondly it illustrated the distribution difference of diabetes outcome by histogram. Lastly, the significant of correlation among the predicators

  • STEP 4: Data Splicing

  • data is splitted based on Outcome with 80% to train data and 20% to test data

  • STEP 5: Model Training and Tuning

  • The function trainControl can be used to specify the type of resampling

  • We used 10-fold cross-validation, and resampling of 3 repetition with “repeatedcv” method

Performance improvement techniques and improved accuracy

achieved

  • Step 6:Customizing the Tuning Process
    • Pre-Processing Options: we will use imputation method of knn. “center” and “scale” to have the impact of variables in the same scale.
    • Head of transformed data showed variables were centered, scaled and no missing.
  • ‘k’ in KNN is a parameter that refers to the number of nearest neighbors to include in the majority of the voting process. It is very important tuning parameters - In the Caret train function, you can specify tuneLength of k: 1. by default, it starts with k = 5 and it continues in increments of 2^3: k = 5, 7, 9 2. another simple approach is k=sqrt(n), n is the number of data points 3. When the cross validation is performed, caret displays the best option for all the parameter values tested. The optimal k in this model is 21 from the output
  • Step 7: Making predictions
    • using the optimal k from knnfit, apply the prediction on the testing data

Evaluate the model performance

  • The accuracy of our model on the testing set is 73%
  • confusion Matrix to illustrate Accuracy, Sensitivity and Specificity