

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
PCA machine learning in R berkeley
Typology: Summaries
1 / 3
This page cannot be seen from the preview
Don't miss anything!


PCA is a dimension reduction technique. The data is represented in the form of components where each component is a linear combination of original variables.It can be used for correlated data. Clustering or grouping is the detection of similarities.The goal of clustering is to identify pattern or groups of similar objects within a data set of interest. It is a method of unsupervised learning method. PCA in conjunction with k-means is a powerful method for visualizing high dimensional data. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. It includes diabetes outcome and measurements of Glucose, Insulin, BMI et al. I will apply PCA and Clustering machine learning on presented measurements to detect the similarity of data and group then into optimal clusters. The data download link https://www.kaggle.com/mathchi/diabetes-data-set
STEP 1: Load the Libraries
STEP 2: Load the data set
Examine the structure of the Dmdata frame
from the results we notice the unusual scenario of zero in Glucose, BP, SkinTHickness, Insulin and BMI; presumably those are NA
the data needs to be scaled followed by the creation of covariance matrix to identify correlations. The eigenvalues and eigenvectors of the covariance matrix are calculated to identify principal components. The main components were finally visually represented.
which is also known as dendrogram.Hierarchical clustering can be subdivided into two types: Agglomerative clustering in which, each observation is initially considered as a cluster of its own (leaf). Then, the most similar clusters are successively merged until there is just one single big cluster (root). Divise clustering(also known as DIANA analysis), an inverse of agglomerative clustering, begins with the root, in witch all objects are included in one cluster. Then the most heterogeneous clusters are successively divided until all observation are in their own cluster.