Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

PCA machine learning, Summaries of Machine Learning

University of California - Berkeley Machine Learning

PCA machine learning in R berkeley

Typology: Summaries

2021/2022

Uploaded on 05/29/2025

qizhi-fang 🇺🇸

6 documents

1 / 3

This page cannot be seen from the preview

Don't miss anything!

PCA and Clustering

Objective

PCA is a dimension reduction technique. The data is represented in the form of components

where each component is a linear combination of original variables.It can be used for

correlated data.

Clustering or grouping is the detection of similarities.The goal of clustering is to identify

pattern or groups of similar objects within a data set of interest. It is a method of unsupervised

learning method.

PCA in conjunction with k-means is a powerful method for visualizing high dimensional data.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney

Diseases. It includes diabetes outcome and measurements of Glucose, Insulin, BMI et al.)I will

apply PCA and Clustering machine learning on presented measurements to detect the similarity

of data and group then into optimal clusters. The data download link

https://www.kaggle.com/mathchi/diabetes-data-set

Data Descriptions

• Pregnancies: Number of times pregnant

• Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

• BP: Diastolic blood pressure (mm Hg)

• SkinThickness: Triceps skin fold thickness (mm)

• Insulin: 2-Hour serum insulin (mu U/ml)

• BMI: Body mass index (weight in kg/(height in m)^2)

• Pedigree : Diabetes pedigree function

• Age: Age (years)

• Outcome: Class variable (0 = No ; 1 = Yes)

Exploring the data uing PCA

•STEP 1: Load the Libraries

•STEP 2: Load the data set

• Examine the structure of the Dmdata frame

• from the results we notice the unusual scenario of zero in Glucose, BP, SkinTHickness,

Insulin and BMI; presumably those are NA

•

Discover Summaries of Machine Learning University of California - Berkeley

Partial preview of the text

Download PCA machine learning and more Summaries Machine Learning in PDF only on Docsity!

PCA and Clustering

Objective

PCA is a dimension reduction technique. The data is represented in the form of components where each component is a linear combination of original variables.It can be used for correlated data. Clustering or grouping is the detection of similarities.The goal of clustering is to identify pattern or groups of similar objects within a data set of interest. It is a method of unsupervised learning method. PCA in conjunction with k-means is a powerful method for visualizing high dimensional data. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. It includes diabetes outcome and measurements of Glucose, Insulin, BMI et al. I will apply PCA and Clustering machine learning on presented measurements to detect the similarity of data and group then into optimal clusters. The data download link https://www.kaggle.com/mathchi/diabetes-data-set

Data Descriptions

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BP: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
Pedigree : Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 = No ; 1 = Yes)

Exploring the data uing PCA

STEP 1: Load the Libraries
STEP 2: Load the data set
Examine the structure of the Dmdata frame
from the results we notice the unusual scenario of zero in Glucose, BP, SkinTHickness, Insulin and BMI; presumably those are NA
the data needs to be scaled followed by the creation of covariance matrix to identify correlations. The eigenvalues and eigenvectors of the covariance matrix are calculated to identify principal components. The main components were finally visually represented.

Applying PCA with Linear Model

k-means Visualization With Evaluation Statistics

Step 1. Compute PCA via prcomp
Step2. Visualize eigenvalues (scree plot). Show the percentage of variances explained by each principal component.
Step 3. Graph of individuals. Individuals with a similar profile are grouped together.habillage: an optional factor variable for coloring the observations by groups
Step 4. To obtain optimal clusters numbers, Statistical methods like Elbow, Silhouhette and Gap are used.
Step 5 Evaluation the clustering Tendency by Hopkins stat Hopkin Statistics evaluates the Clustering Tendency of the data. A value close to 1 indicates that the data is highly clustered, .5 indicates random data , 0 indicates uniformly distributed. Silhouette score of 1 indicates the clusters is well separated and -1 indicates clusters are not well separated.

Implementing Hierarchical Clustering Algorithms-a tree-based

representation of the objects

which is also known as dendrogram.Hierarchical clustering can be subdivided into two types: Agglomerative clustering in which, each observation is initially considered as a cluster of its own (leaf). Then, the most similar clusters are successively merged until there is just one single big cluster (root). Divise clustering(also known as DIANA analysis), an inverse of agglomerative clustering, begins with the root, in witch all objects are included in one cluster. Then the most heterogeneous clusters are successively divided until all observation are in their own cluster.

- visualization of indivisual plots in the two group cluster by Hierarchical Clustering Algorithms

PCA machine learning, Summaries of Machine Learning

Related documents

Partial preview of the text

Download PCA machine learning and more Summaries Machine Learning in PDF only on Docsity!

PCA and Clustering

Objective

Data Descriptions

Exploring the data uing PCA

Applying PCA with Linear Model

k-means Visualization With Evaluation Statistics

Implementing Hierarchical Clustering Algorithms-a tree-based

representation of the objects