Download Principal Component Analysis: Understanding Data Variation and Dimensionality Reduction and more Study notes Mathematical Statistics in PDF only on Docsity!
Principal Component
Analyasis
by- ABHINANDAN
usn-22BTRCB
Principal Components Analysis Ideas ( PCA)
- (^) Does the data set ‘span’ the whole of d dimensional space?
- (^) For a matrix of m samples x n genes, create a new covariance matrix of size n x n.
- (^) Transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs).
- (^) developed to capture as much of the variation in data as possible
X X Principal Component Analysis Note: Y1 is the first eigen vector, Y2 is the second. Y ignorable. Y Y x x x x x x x x x x x x x x x x x x x x x x x x x Key observation: variance = largest!
Principal Component Analysis: one attribute first
- (^) Question: how much spread is in the data along the axis? (distance to the mean)
- (^) Variance=Standard deviation^ 30 40 30 35 30 15 30 15 18 15 30 24 40 42 Temperat ure
More than two attributes: covariance matrix
- (^) Contains covariance values between all possible dimensions (=attributes):
- (^) Example for three attributes (x,y,z):
Eigenvalues & eigenvectors
- (^) Vectors x having same direction as A x are called eigenvectors of A ( A is an n by n matrix).
- (^) In the equation A x =λ x , λ is called an eigenvalue of A.
Principal components
- (^) 1. principal component (PC1)
- (^) The eigenvalue with the largest absolute value will indicate that the data have the largest variance along its eigenvector, the direction along which there is greatest variation
- (^) 2. principal component (PC2)
- (^) the direction with maximum variation left in data, orthogonal to the 1. PC
- (^) In general, only few directions manage to capture most of the variability in the data.
Steps of PCA
- (^) Let be the mean vector (taking the mean of all rows)
- (^) Adjust the original data by the mean X’ = X –
- (^) Compute the covariance matrix C of adjusted X
- (^) Find the eigenvectors and eigenvalues of C. - (^) For matrix C, v ectors e (=column vector) having same direction as C e : - (^) eigenvectors of C is e such that C e =λ e , - (^) λ is called an eigenvalue of C. - (^) C e =λ e ⇔ ( C -λI) e = - (^) Most data mining packages do this for you.
Principal components - Variance
Transformed Data
- Eigenvalues λ j corresponds to variance on each component j
- Thus, sort by λ j
- Take the first p eigenvectors e i; where p is the number of top eigenvalues
- (^) These are the directions with the largest variances
Covariance Matrix
- (^) C=
- (^) Using MATLAB, we find out:
- (^) Eigenvectors:
- (^) e1=(-0.98,-0.21), λ1=51.
- (^) e2=(0.21,-0.98), λ2=560.
- (^) Thus the second eigenvector is more important! 106 482 75 106
Principal components
- (^) General about principal components
- (^) summary variables
- (^) linear combinations of the original variables
- (^) uncorrelated with each other
- (^) capture as much of the original variance as possible
Two Way (Angle) Data Analysis Genes 10 3
10 4 Samples 10 1
2 Gene expression matrix Sample space analysis Gene space analysis Conditions 10 1
2 Genes 10 3
10 4 Gene expression matrix
PCA - example