




















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Various methods for creating sparse statistical models, including sparse linear models with Lasso and Ridge regression, Sparse PCA and SVD, Sparse CCA, and Sparse clustering. how adding a penalization term to the objective function makes the solution more sparse and provides examples of how these methods can be applied to real data. Tom Bartlett is the author.
Typology: Lecture notes
1 / 28
This page cannot be seen from the preview
Don't miss anything!





















Tom Bartlett
‘A sparse statistical model is one having only a small number of nonzero parameters or weights.’[1]
The number of features or variables measured on a person or object can be very large (e.g., expression levels of ∼ 30000 genes)
These measurements are often highly correlated, i.e., contain much redundant information
This scenario is particularly relevant in the age of ‘big-data’
(^1) Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015
A linear model can be written as
yi =α +
∑^ p
j=
xij βj + i , i = 1, ..., n
=α + x> i β + i
Hence, the model can be fit by minimising the objective function
minimise a,β
i=
(yi − α − x> i β)^2
Adding a penalisation term to the objective function makes the solution more sparse:
minimise a,β
i=
(yi − α − x> i β)^2 + λ‖β‖qq
, where q = 1 or 2
The penalty term λ‖β‖qq means that only the bare minimum is used of all the information available in the p predictor variables xij , j = 1, ...p.
minimise a,β
i=
(yi − α − x> i β)^2 + λ‖β‖qq
q is typically chosen as q = 1 or q = 2, because these produce convex solutions and hence are computationally much nicer! q = 1 is called the ‘lasso’; it tends to set as many elements of β as possible to zero q = 2 is called ‘ridge regression’, and it tends to minimise the size of all the elements of β Penalisation is equally applicable to other types of linear models: logistic regression, generalised linear models etc
Gene expression data, for p = 17280 genes, for nc = 530 cancer samples + nh = 61 healthy tissue samples Fit logistic (i.e., 2 class, cancer/healthy) lasso model using the R package glmnet, selecting λ by cross-validation Out of 17280 possible genes for prediction, lasso chooses just these 25 (shown with their fitted model coefficients) ADAMTS5 -0.0666 HPD -0.00679 NUP210 0. ADH4 -0.165 HS3ST4 -0.0863 PAFAH1B3 0. CA4 -0.151 IGSF10 -0.356 TACC3 0. CCDC36 -0.335 LRRTM2 -0.0711 TESC -0. CDH12 -0.253 LRRC3B -0.211 TRPM3 -1. CES1 -0.302 MEG3 -0.022 TSLP -0. COL10A1 0.747 MMP11 0.22 WDR51A 0. DPP6 -0.107 NUAK2 0.0354 WISP1 0. HHATL -0. Caveat: these are not necessarily the only ‘predictive’ genes. If we removed these genes from the data-set and fitted the model again, lasso would choose an entirely new set of genes which might be almost as good at predicting!
Ordinary PCA finds v by carrying out the optimisation:
maximise ‖v‖ 2 =
v>^
n
v
with X ∈ Rn×p^ (i.e., n samples and p variables). With p >> n, the eigenvectors of the sample covariance matrix X>X/n are not necessarily close to those of the population covariance matrix [2]. Hence ordinary PCA can fail in this context. This motivates sparse PCA, in which many entries of v are encouraged to be zero, by finding v by carrying out the optimisation:
maximise ‖v‖ 2 =
v>X>Xv
, subject to: ‖v‖ 1 ≤ t.
In effect this discards some variables such that p is closer to n. (^2) Iain M Johnstone. “On the distribution of the largest eigenvalue in principal components analysis”. In: Annals of statistics (2001), pp. 295–
SVD is a generalisation of PCA. Hence, algorithms to solve the SVD problem can be applied to the PCA problem
The sparse PCA can thus be re-formulated as:
maximise ‖u‖ 2 =‖v‖ 2 =
u>Xv
, subject to: ‖v‖ 1 ≤ t,
which is biconvex in u and v and can be solved by alternating between the updates:
u ←
Xv ‖Xv‖ 2
, and v ←
Sλ
X>u
‖Sλ (X>u) ‖ 2
where Sλ is the soft-thresholding operator Sλ = sign(x) (|x| − λ)+.
Define Σ as a p × p block-diagonal matrix, with p = 200 and 10 blocks of 1s of size 20 × 20.
Hence, we would expect there to be 10 independent components of variation in the corresponding distribution.
Generate n samples x ∼ Normal( 0 , Σ)
Estimate Σ̂ =
(x − ¯x)(x − ¯x)>/n
Correlate eigenvectors of Σ with eigenvectors of Σ̂
Repeat 100 times for each different value of n
0.0 0.2 0.4 0.6 0.8 1.
1.0 Top 10 PCs
n/p
Eigenvector correlation
The plot shows the means of these correlations over the 100 repetitions for different values of n.
0.0 0.2 0.4 0.6 0.8 1.
Top 10 PCs
n/p
Eigenvector correlation
The plot shows the result with ‖u‖ 1 =
p/2.
0.0 0.2 0.4 0.6 0.8 1.
1.0 Top 10 PCs
n/p
Eigenvector correlation
The plot shows the result with ‖u‖ 1 =
p/3.
I carried out PCA on expression levels of 10138 genes in individual cells from developing brains
There are many different cell types in the data - some mature, some immature, and some in between
Different cell-types are characterised by different gene expression profiles
We would therefore expect to be able visualise some separation of the cell-types by dimensionality reduction to three dimensions
The plot shows the cells plotted in terms of the top three (standard) PCA components.
In CCA, the aim is to find coefficient vectors u ∈ Rp^ and v ∈ Rq which project the data-matrices X ∈ Rn×p^ and Y ∈ Rn×q^ so as to maximise the correlations between these projections. Whereas PCA aims to find the ‘direction’ of maximum variance in a single data-matrix, CCA aims to find the ‘directions’ in the two data-matrices in which the variances best explain each other. The CCA problem can be solved by carrying out the optimisation:
maximise u∈Rp^ ,v∈Rq^
Cor(Xu, Yv)
This problem is not well posed for n < max(p, q), in which case u and v can be found which trivially give Cor(Xu, Yv) = 1. Sparse CCA solves this problem by carrying out the optimisation:
maximise u∈Rp^ ,v∈Rq^
Cor(Xu, Yv), subject to ‖u‖ 1 < t 1 and ‖v‖ 1 < t 2.
‘Cell cycle’ is a biological process involved in the replication of cells
Cell-cycle can be thought of as a latent process which is not directly observable in genomics data
It is driven by a small set of genes (particularly cyclins and cyclin- dependent kinases) from which it may be inferred
It has an effect on the expression of very many genes: hence it can also tend to act as a confounding factor when modelling many other biological processes
Used CCA here as an exploratory tool, with Y the data for the cell cycle genes, and X the data for all the other genes.
The decision boundary
log
P(G = k|xi ) P(G = l|xi )
= log
πk πl
(μk + μl )>Σ−^1 (μk − μl )
then naturally leads to the decision rule:
G (xi ) = argmax k
log πk + x> i Σ−^1 μk − μ> k Σ−^1 μk
By assuming Σ is diagonal, i.e., there is no covariance between the p dimensions, this decision rule can be reduced to the nearest centroids classifier:
G (xi ) = argmin k
∑^ p
j=
(xj − μjk )^2 σ^2 j
− log πk
Typically, Σ (or σ) are estimated from the data as Σ̂ (or ̂σ), and the μk are estimated as μ̂ k whilst training the classifier.
The nearest centroids classifier
Ĝ (xi ) = argmin k
∑^ p
j=
(xj − μ̂ jk )^2 ̂ σ j^2
− log πk
will typically use all p variables. This is often unnecessary and can lead to overfitting in high-dimensional contexts. The nearest shrunken centroids classifier deals with this issue. Define μ̂ = ¯x + αk , where ¯x is the data-mean across all classes, and αk is the class-specific deviation of the mean from ¯x. Then, the nearest shrunken centroids classifier proceeds with the optimisation:
minimise αk ∈Rp^ ,k∈{1,...,K }
2 n
k=
i∈Ck
∑^ p
j=
(xij − x¯j − αjk )^2 ˆσ^2
+λ
k=
∑^ p
j=
nk σ ˆ^2
|αjk |
where Ck and nk are the set and number of samples in group k.