PCA machine learning, Summaries of Machine Learning

PCA machine learning in R berkeley

Typology: Summaries

2021/2022

Uploaded on 05/29/2025

qizhi-fang
qizhi-fang 🇺🇸

6 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
PCA and Clustering
Objective
PCA is a dimension reduction technique. The data is represented in the form of components
where each component is a linear combination of original variables.It can be used for
correlated data.
Clustering or grouping is the detection of similarities.The goal of clustering is to identify
pattern or groups of similar objects within a data set of interest. It is a method of unsupervised
learning method.
PCA in conjunction with k-means is a powerful method for visualizing high dimensional data.
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney
Diseases. It includes diabetes outcome and measurements of Glucose, Insulin, BMI et al.)I will
apply PCA and Clustering machine learning on presented measurements to detect the similarity
of data and group then into optimal clusters. The data download link
https://www.kaggle.com/mathchi/diabetes-data-set
Data Descriptions
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BP: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
Pedigree : Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 = No ; 1 = Yes)
Exploring the data uing PCA
STEP 1: Load the Libraries
STEP 2: Load the data set
Examine the structure of the Dmdata frame
from the results we notice the unusual scenario of zero in Glucose, BP, SkinTHickness,
Insulin and BMI; presumably those are NA
pf3

Partial preview of the text

Download PCA machine learning and more Summaries Machine Learning in PDF only on Docsity!

PCA and Clustering

Objective

PCA is a dimension reduction technique. The data is represented in the form of components where each component is a linear combination of original variables.It can be used for correlated data. Clustering or grouping is the detection of similarities.The goal of clustering is to identify pattern or groups of similar objects within a data set of interest. It is a method of unsupervised learning method. PCA in conjunction with k-means is a powerful method for visualizing high dimensional data. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. It includes diabetes outcome and measurements of Glucose, Insulin, BMI et al. I will apply PCA and Clustering machine learning on presented measurements to detect the similarity of data and group then into optimal clusters. The data download link https://www.kaggle.com/mathchi/diabetes-data-set

Data Descriptions

  • Pregnancies: Number of times pregnant
  • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  • BP: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • Pedigree : Diabetes pedigree function
  • Age: Age (years)
  • Outcome: Class variable (0 = No ; 1 = Yes)

Exploring the data uing PCA

  • STEP 1: Load the Libraries

  • STEP 2: Load the data set

  • Examine the structure of the Dmdata frame

  • from the results we notice the unusual scenario of zero in Glucose, BP, SkinTHickness, Insulin and BMI; presumably those are NA

  • the data needs to be scaled followed by the creation of covariance matrix to identify correlations. The eigenvalues and eigenvectors of the covariance matrix are calculated to identify principal components. The main components were finally visually represented.

Applying PCA with Linear Model

k-means Visualization With Evaluation Statistics

  • Step 1. Compute PCA via prcomp
  • Step2. Visualize eigenvalues (scree plot). Show the percentage of variances explained by each principal component.
  • Step 3. Graph of individuals. Individuals with a similar profile are grouped together.habillage: an optional factor variable for coloring the observations by groups
  • Step 4. To obtain optimal clusters numbers, Statistical methods like Elbow, Silhouhette and Gap are used.
  • Step 5 Evaluation the clustering Tendency by Hopkins stat Hopkin Statistics evaluates the Clustering Tendency of the data. A value close to 1 indicates that the data is highly clustered, .5 indicates random data , 0 indicates uniformly distributed. Silhouette score of 1 indicates the clusters is well separated and -1 indicates clusters are not well separated.

Implementing Hierarchical Clustering Algorithms-a tree-based

representation of the objects

which is also known as dendrogram.Hierarchical clustering can be subdivided into two types: Agglomerative clustering in which, each observation is initially considered as a cluster of its own (leaf). Then, the most similar clusters are successively merged until there is just one single big cluster (root). Divise clustering(also known as DIANA analysis), an inverse of agglomerative clustering, begins with the root, in witch all objects are included in one cluster. Then the most heterogeneous clusters are successively divided until all observation are in their own cluster.

    • visualization of indivisual plots in the two group cluster by Hierarchical Clustering Algorithms