Principal Component Analysis: Understanding Data Variation and Dimensionality Reduction, Study notes of Mathematical Statistics

Principal component analysis (pca) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The ideas, steps, and applications of pca, including gene expression analysis and data visualization.

Typology: Study notes

2023/2024

Uploaded on 12/26/2023

abhinandan-uk
abhinandan-uk 🇮🇳

1 document

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Principal Component
Analyasis
by- ABHINANDAN
usn-22BTRCB002
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Principal Component Analysis: Understanding Data Variation and Dimensionality Reduction and more Study notes Mathematical Statistics in PDF only on Docsity!

Principal Component

Analyasis

by- ABHINANDAN

usn-22BTRCB

Principal Components Analysis Ideas ( PCA)

  • (^) Does the data set ‘span’ the whole of d dimensional space?
  • (^) For a matrix of m samples x n genes, create a new covariance matrix of size n x n.
  • (^) Transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs).
  • (^) developed to capture as much of the variation in data as possible

X X Principal Component Analysis Note: Y1 is the first eigen vector, Y2 is the second. Y ignorable. Y Y x x x x x x x x x x x x x x x x x x x x x x x x x Key observation: variance = largest!

Principal Component Analysis: one attribute first

  • (^) Question: how much spread is in the data along the axis? (distance to the mean)
  • (^) Variance=Standard deviation^ 30 40 30 35 30 15 30 15 18 15 30 24 40 42 Temperat ure

More than two attributes: covariance matrix

  • (^) Contains covariance values between all possible dimensions (=attributes):
  • (^) Example for three attributes (x,y,z):

Eigenvalues & eigenvectors

  • (^) Vectors x having same direction as A x are called eigenvectors of A ( A is an n by n matrix).
  • (^) In the equation A xx , λ is called an eigenvalue of A.

Principal components

  • (^) 1. principal component (PC1)
    • (^) The eigenvalue with the largest absolute value will indicate that the data have the largest variance along its eigenvector, the direction along which there is greatest variation
  • (^) 2. principal component (PC2)
    • (^) the direction with maximum variation left in data, orthogonal to the 1. PC
  • (^) In general, only few directions manage to capture most of the variability in the data.

Steps of PCA

  • (^) Let be the mean vector (taking the mean of all rows)
  • (^) Adjust the original data by the mean X’ = X –
  • (^) Compute the covariance matrix C of adjusted X
  • (^) Find the eigenvectors and eigenvalues of C. - (^) For matrix C, v ectors e (=column vector) having same direction as C e : - (^) eigenvectors of C is e such that C ee , - (^) λ is called an eigenvalue of C. - (^) C ee ⇔ ( C -λI) e = - (^) Most data mining packages do this for you.

Principal components - Variance

Transformed Data

  • Eigenvalues λ j corresponds to variance on each component j
  • Thus, sort by λ j
  • Take the first p eigenvectors e i; where p is the number of top eigenvalues
  • (^) These are the directions with the largest variances

Covariance Matrix

  • (^) C=
  • (^) Using MATLAB, we find out:
    • (^) Eigenvectors:
    • (^) e1=(-0.98,-0.21), λ1=51.
    • (^) e2=(0.21,-0.98), λ2=560.
    • (^) Thus the second eigenvector is more important! 106 482 75 106

Principal components

  • (^) General about principal components
    • (^) summary variables
    • (^) linear combinations of the original variables
    • (^) uncorrelated with each other
    • (^) capture as much of the original variance as possible

Two Way (Angle) Data Analysis Genes 10 3

10 4 Samples 10 1

2 Gene expression matrix Sample space analysis Gene space analysis Conditions 10 1

2 Genes 10 3

10 4 Gene expression matrix

PCA - example