Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

machine learning in R, Summaries of Machine Learning

University of Southern California (USC)Machine Learning

machine learning in R knn Neural network

Typology: Summaries

2021/2022

Uploaded on 05/27/2025

qizhi-fang 🇺🇸

6 documents

1 / 3

This page cannot be seen from the preview

Don't miss anything!

KNN Implementation for predicting Diabetes Outcome

Classification using KNN

Objective

The objective is to predict whether a patient has diabetes,based on diagnostic measurements.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney

Diseases. It includes diabetes outcome and measurements of Glucose, Insulin, BMI et al."I will

apply kNN machine learning on presented measurements to predict the diabetes outcome. The

data download link https://www.kaggle.com/mathchi/diabetes-data-set

KNN algorithmis a supervised machine learning algorithm that deals with similarity. KNN

stands for K-Nearest Neighbors. The number K is typically chosen as the square root of N, the

total number of points in the training data set.

KNN has the advantage of being nonparametric. It’s basically a classification algorithm that

will make a prediction of a class of a target variable based on a defined number of nearest

neighbors.

Data Descriptions

Using k-nearest neighbors to predict the Diabete Outcome

•STEP 1: Load the Libraries

•STEP 2: Load the data set

•STEP 3: Exploring and preparing the data

– Examine the structure of the Dmdata frame

– from the results we notice the unusual scenario of zero in Glucose, BP,

SkinTHickness, Insulin and BMI; presumably those are NA

– Replace thoses zero into NA

– Examine the missing value (the percentage of missing)

– From the results, we noticed more than 40% Insulin missing, those missing will

be imputed with the value of the nearested neighbors via knnImpute

– the structure of data

•

Discover Summaries of Machine Learning University of Southern California (USC)

Partial preview of the text

Download machine learning in R and more Summaries Machine Learning in PDF only on Docsity!

KNN Implementation for predicting Diabetes Outcome

Classification using KNN

Objective

The objective is to predict whether a patient has diabetes,based on diagnostic measurements. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. It includes diabetes outcome and measurements of Glucose, Insulin, BMI et al. I will apply kNN machine learning on presented measurements to predict the diabetes outcome. The data download link https://www.kaggle.com/mathchi/diabetes-data-set KNN algorithm is a supervised machine learning algorithm that deals with similarity. KNN stands for K-Nearest Neighbors. The number K is typically chosen as the square root of N, the total number of points in the training data set. KNN has the advantage of being nonparametric. It’s basically a classification algorithm that will make a prediction of a class of a target variable based on a defined number of nearest neighbors.

Data Descriptions

Using k-nearest neighbors to predict the Diabete Outcome

STEP 1: Load the Libraries
STEP 2: Load the data set
STEP 3: Exploring and preparing the data
- Examine the structure of the Dmdata frame
- from the results we notice the unusual scenario of zero in Glucose, BP, SkinTHickness, Insulin and BMI; presumably those are NA
- Replace thoses zero into NA
- Examine the missing value (the percentage of missing)
- From the results, we noticed more than 40% Insulin missing, those missing will be imputed with the value of the nearested neighbors via knnImpute
- the structure of data
covert the diabetes Outcome into factor
Visualization of the association of diabetes outcome between all 8 variables
From the figures, we noticed the more important predictors of diabetes are Glucose, Age, Insulin, Pregnancies, Pedigree. The least associated are SkinThick and BP.
Here are the correlation Matrix. First there is no abs(r) >0.75. Secondly it illustrated the distribution difference of diabetes outcome by histogram. Lastly, the significant of correlation among the predicators
STEP 4: Data Splicing
data is splitted based on Outcome with 80% to train data and 20% to test data
STEP 5: Model Training and Tuning
The function trainControl can be used to specify the type of resampling
We used 10-fold cross-validation, and resampling of 3 repetition with “repeatedcv” method

Performance improvement techniques and improved accuracy

achieved

Step 6:Customizing the Tuning Process
- Pre-Processing Options: we will use imputation method of knn. “center” and “scale” to have the impact of variables in the same scale.
- Head of transformed data showed variables were centered, scaled and no missing.
‘k’ in KNN is a parameter that refers to the number of nearest neighbors to include in the majority of the voting process. It is very important tuning parameters - In the Caret train function, you can specify tuneLength of k: 1. by default, it starts with k = 5 and it continues in increments of 2^3: k = 5, 7, 9 2. another simple approach is k=sqrt(n), n is the number of data points 3. When the cross validation is performed, caret displays the best option for all the parameter values tested. The optimal k in this model is 21 from the output
Step 7: Making predictions
- using the optimal k from knnfit, apply the prediction on the testing data

Evaluate the model performance

The accuracy of our model on the testing set is 73%
confusion Matrix to illustrate Accuracy, Sensitivity and Specificity

machine learning in R, Summaries of Machine Learning

Related documents

Partial preview of the text

Download machine learning in R and more Summaries Machine Learning in PDF only on Docsity!

KNN Implementation for predicting Diabetes Outcome

Classification using KNN

Objective

Data Descriptions

Using k-nearest neighbors to predict the Diabete Outcome

Performance improvement techniques and improved accuracy

achieved

Evaluate the model performance