Sparse Statistical Modelling: Regularizing Linear Models and Clustering, Lecture notes of Genomics

Various methods for creating sparse statistical models, including sparse linear models with Lasso and Ridge regression, Sparse PCA and SVD, Sparse CCA, and Sparse clustering. how adding a penalization term to the objective function makes the solution more sparse and provides examples of how these methods can be applied to real data. Tom Bartlett is the author.

Typology: Lecture notes

2021/2022

Uploaded on 09/27/2022

ekadant
ekadant 🇺🇸

4.3

(32)

267 documents

1 / 28

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Sparse statistical modelling
Tom Bartlett
Sparse statistical modelling Tom Bartlett 1 / 28
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c

Partial preview of the text

Download Sparse Statistical Modelling: Regularizing Linear Models and Clustering and more Lecture notes Genomics in PDF only on Docsity!

Sparse statistical modelling

Tom Bartlett

Introduction

‘A sparse statistical model is one having only a small number of nonzero parameters or weights.’[1]

The number of features or variables measured on a person or object can be very large (e.g., expression levels of ∼ 30000 genes)

These measurements are often highly correlated, i.e., contain much redundant information

This scenario is particularly relevant in the age of ‘big-data’

(^1) Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015

Sparse linear models

A linear model can be written as

yi =α +

∑^ p

j=

xij βj + i , i = 1, ..., n

=α + x> i β + i

Hence, the model can be fit by minimising the objective function

minimise a,β

{ N

i=

(yi − α − x> i β)^2

Adding a penalisation term to the objective function makes the solution more sparse:

minimise a,β

2 N

∑^ N

i=

(yi − α − x> i β)^2 + λ‖β‖qq

, where q = 1 or 2

Sparse linear models

The penalty term λ‖β‖qq means that only the bare minimum is used of all the information available in the p predictor variables xij , j = 1, ...p.

minimise a,β

2 N

∑^ N

i=

(yi − α − x> i β)^2 + λ‖β‖qq

q is typically chosen as q = 1 or q = 2, because these produce convex solutions and hence are computationally much nicer! q = 1 is called the ‘lasso’; it tends to set as many elements of β as possible to zero q = 2 is called ‘ridge regression’, and it tends to minimise the size of all the elements of β Penalisation is equally applicable to other types of linear models: logistic regression, generalised linear models etc

Sparse linear models - genomics example

Gene expression data, for p = 17280 genes, for nc = 530 cancer samples + nh = 61 healthy tissue samples Fit logistic (i.e., 2 class, cancer/healthy) lasso model using the R package glmnet, selecting λ by cross-validation Out of 17280 possible genes for prediction, lasso chooses just these 25 (shown with their fitted model coefficients) ADAMTS5 -0.0666 HPD -0.00679 NUP210 0. ADH4 -0.165 HS3ST4 -0.0863 PAFAH1B3 0. CA4 -0.151 IGSF10 -0.356 TACC3 0. CCDC36 -0.335 LRRTM2 -0.0711 TESC -0. CDH12 -0.253 LRRC3B -0.211 TRPM3 -1. CES1 -0.302 MEG3 -0.022 TSLP -0. COL10A1 0.747 MMP11 0.22 WDR51A 0. DPP6 -0.107 NUAK2 0.0354 WISP1 0. HHATL -0. Caveat: these are not necessarily the only ‘predictive’ genes. If we removed these genes from the data-set and fitted the model again, lasso would choose an entirely new set of genes which might be almost as good at predicting!

Sparse PCA

Ordinary PCA finds v by carrying out the optimisation:

maximise ‖v‖ 2 =

v>^

X>X

n

v

with X ∈ Rn×p^ (i.e., n samples and p variables). With p >> n, the eigenvectors of the sample covariance matrix X>X/n are not necessarily close to those of the population covariance matrix [2]. Hence ordinary PCA can fail in this context. This motivates sparse PCA, in which many entries of v are encouraged to be zero, by finding v by carrying out the optimisation:

maximise ‖v‖ 2 =

v>X>Xv

, subject to: ‖v‖ 1 ≤ t.

In effect this discards some variables such that p is closer to n. (^2) Iain M Johnstone. “On the distribution of the largest eigenvalue in principal components analysis”. In: Annals of statistics (2001), pp. 295–

Sparse PCA and SVD - an algorithm

SVD is a generalisation of PCA. Hence, algorithms to solve the SVD problem can be applied to the PCA problem

The sparse PCA can thus be re-formulated as:

maximise ‖u‖ 2 =‖v‖ 2 =

u>Xv

, subject to: ‖v‖ 1 ≤ t,

which is biconvex in u and v and can be solved by alternating between the updates:

u ←

Xv ‖Xv‖ 2

, and v ←

X>u

‖Sλ (X>u) ‖ 2

where Sλ is the soft-thresholding operator Sλ = sign(x) (|x| − λ)+.

Sparse PCA - simulation study

Define Σ as a p × p block-diagonal matrix, with p = 200 and 10 blocks of 1s of size 20 × 20.

Hence, we would expect there to be 10 independent components of variation in the corresponding distribution.

Generate n samples x ∼ Normal( 0 , Σ)

Estimate Σ̂ =

(x − ¯x)(x − ¯x)>/n

Correlate eigenvectors of Σ with eigenvectors of Σ̂

Repeat 100 times for each different value of n

0.0 0.2 0.4 0.6 0.8 1.

1.0 Top 10 PCs

n/p

Eigenvector correlation

The plot shows the means of these correlations over the 100 repetitions for different values of n.

Sparse PCA - simulation study

0.0 0.2 0.4 0.6 0.8 1.

Top 10 PCs

n/p

Eigenvector correlation

The plot shows the result with ‖u‖ 1 =

p/2.

0.0 0.2 0.4 0.6 0.8 1.

1.0 Top 10 PCs

n/p

Eigenvector correlation

The plot shows the result with ‖u‖ 1 =

p/3.

Sparse PCA - real data example

I carried out PCA on expression levels of 10138 genes in individual cells from developing brains

There are many different cell types in the data - some mature, some immature, and some in between

Different cell-types are characterised by different gene expression profiles

We would therefore expect to be able visualise some separation of the cell-types by dimensionality reduction to three dimensions

The plot shows the cells plotted in terms of the top three (standard) PCA components.

Sparse CCA

In CCA, the aim is to find coefficient vectors u ∈ Rp^ and v ∈ Rq which project the data-matrices X ∈ Rn×p^ and Y ∈ Rn×q^ so as to maximise the correlations between these projections. Whereas PCA aims to find the ‘direction’ of maximum variance in a single data-matrix, CCA aims to find the ‘directions’ in the two data-matrices in which the variances best explain each other. The CCA problem can be solved by carrying out the optimisation:

maximise u∈Rp^ ,v∈Rq^

Cor(Xu, Yv)

This problem is not well posed for n < max(p, q), in which case u and v can be found which trivially give Cor(Xu, Yv) = 1. Sparse CCA solves this problem by carrying out the optimisation:

maximise u∈Rp^ ,v∈Rq^

Cor(Xu, Yv), subject to ‖u‖ 1 < t 1 and ‖v‖ 1 < t 2.

Sparse CCA - real data example

‘Cell cycle’ is a biological process involved in the replication of cells

Cell-cycle can be thought of as a latent process which is not directly observable in genomics data

It is driven by a small set of genes (particularly cyclins and cyclin- dependent kinases) from which it may be inferred

It has an effect on the expression of very many genes: hence it can also tend to act as a confounding factor when modelling many other biological processes

Used CCA here as an exploratory tool, with Y the data for the cell cycle genes, and X the data for all the other genes.

Sparse LDA

The decision boundary

log

P(G = k|xi ) P(G = l|xi )

= log

πk πl

  • x> i Σ−^1 (μk − μl )

(μk + μl )>Σ−^1 (μk − μl )

then naturally leads to the decision rule:

G (xi ) = argmax k

log πk + x> i Σ−^1 μk − μ> k Σ−^1 μk

By assuming Σ is diagonal, i.e., there is no covariance between the p dimensions, this decision rule can be reduced to the nearest centroids classifier:

G (xi ) = argmin k

∑^ p

j=

(xj − μjk )^2 σ^2 j

− log πk

Typically, Σ (or σ) are estimated from the data as Σ̂ (or ̂σ), and the μk are estimated as μ̂ k whilst training the classifier.

Sparse LDA

The nearest centroids classifier

Ĝ (xi ) = argmin k

∑^ p

j=

(xj − μ̂ jk )^2 ̂ σ j^2

− log πk

will typically use all p variables. This is often unnecessary and can lead to overfitting in high-dimensional contexts. The nearest shrunken centroids classifier deals with this issue. Define μ̂ = ¯x + αk , where ¯x is the data-mean across all classes, and αk is the class-specific deviation of the mean from ¯x. Then, the nearest shrunken centroids classifier proceeds with the optimisation:

minimise αk ∈Rp^ ,k∈{1,...,K }

2 n

∑^ K

k=

i∈Ck

∑^ p

j=

(xij − x¯j − αjk )^2 ˆσ^2

∑^ K

k=

∑^ p

j=

nk σ ˆ^2

|αjk |

where Ck and nk are the set and number of samples in group k.