Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Sparse Statistical Modelling: Regularizing Linear Models and Clustering, Lecture notes of Genomics

University of Idaho (U of I)Genomics

Various methods for creating sparse statistical models, including sparse linear models with Lasso and Ridge regression, Sparse PCA and SVD, Sparse CCA, and Sparse clustering. how adding a penalization term to the objective function makes the solution more sparse and provides examples of how these methods can be applied to real data. Tom Bartlett is the author.

Typology: Lecture notes

2021/2022

Uploaded on 09/27/2022

ekadant 🇺🇸

4.3

(32)

267 documents

1 / 28

This page cannot be seen from the preview

Don't miss anything!

Sparse statistical modelling

Tom Bartlett

Sparse statistical modelling Tom Bartlett 1 / 28

Discover Lecture notes of Genomics University of Idaho (U of I)

Partial preview of the text

Download Sparse Statistical Modelling: Regularizing Linear Models and Clustering and more Lecture notes Genomics in PDF only on Docsity!

Sparse statistical modelling

Tom Bartlett

Introduction

‘A sparse statistical model is one having only a small number of nonzero parameters or weights.’[1]

The number of features or variables measured on a person or object can be very large (e.g., expression levels of ∼ 30000 genes)

These measurements are often highly correlated, i.e., contain much redundant information

This scenario is particularly relevant in the age of ‘big-data’

(^1) Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015

Sparse linear models

A linear model can be written as

yi =α +

∑^ p

xij βj + i , i = 1, ..., n

=α + x> i β + i

Hence, the model can be fit by minimising the objective function

minimise a,β

{ N

(yi − α − x> i β)^2

Adding a penalisation term to the objective function makes the solution more sparse:

minimise a,β

2 N

∑^ N

(yi − α − x> i β)^2 + λ‖β‖qq

, where q = 1 or 2

Sparse linear models

The penalty term λ‖β‖qq means that only the bare minimum is used of all the information available in the p predictor variables xij , j = 1, ...p.

minimise a,β

2 N

∑^ N

(yi − α − x> i β)^2 + λ‖β‖qq

q is typically chosen as q = 1 or q = 2, because these produce convex solutions and hence are computationally much nicer! q = 1 is called the ‘lasso’; it tends to set as many elements of β as possible to zero q = 2 is called ‘ridge regression’, and it tends to minimise the size of all the elements of β Penalisation is equally applicable to other types of linear models: logistic regression, generalised linear models etc

Sparse linear models - genomics example

Gene expression data, for p = 17280 genes, for nc = 530 cancer samples + nh = 61 healthy tissue samples Fit logistic (i.e., 2 class, cancer/healthy) lasso model using the R package glmnet, selecting λ by cross-validation Out of 17280 possible genes for prediction, lasso chooses just these 25 (shown with their fitted model coefficients) ADAMTS5 -0.0666 HPD -0.00679 NUP210 0. ADH4 -0.165 HS3ST4 -0.0863 PAFAH1B3 0. CA4 -0.151 IGSF10 -0.356 TACC3 0. CCDC36 -0.335 LRRTM2 -0.0711 TESC -0. CDH12 -0.253 LRRC3B -0.211 TRPM3 -1. CES1 -0.302 MEG3 -0.022 TSLP -0. COL10A1 0.747 MMP11 0.22 WDR51A 0. DPP6 -0.107 NUAK2 0.0354 WISP1 0. HHATL -0. Caveat: these are not necessarily the only ‘predictive’ genes. If we removed these genes from the data-set and fitted the model again, lasso would choose an entirely new set of genes which might be almost as good at predicting!

Sparse PCA

Ordinary PCA finds v by carrying out the optimisation:

maximise ‖v‖ 2 =

v>^

X>X

with X ∈ Rn×p^ (i.e., n samples and p variables). With p >> n, the eigenvectors of the sample covariance matrix X>X/n are not necessarily close to those of the population covariance matrix [2]. Hence ordinary PCA can fail in this context. This motivates sparse PCA, in which many entries of v are encouraged to be zero, by finding v by carrying out the optimisation:

maximise ‖v‖ 2 =

v>X>Xv

, subject to: ‖v‖ 1 ≤ t.

In effect this discards some variables such that p is closer to n. (^2) Iain M Johnstone. “On the distribution of the largest eigenvalue in principal components analysis”. In: Annals of statistics (2001), pp. 295–

Sparse PCA and SVD - an algorithm

SVD is a generalisation of PCA. Hence, algorithms to solve the SVD problem can be applied to the PCA problem

The sparse PCA can thus be re-formulated as:

maximise ‖u‖ 2 =‖v‖ 2 =

u>Xv

, subject to: ‖v‖ 1 ≤ t,

which is biconvex in u and v and can be solved by alternating between the updates:

u ←

Xv ‖Xv‖ 2

, and v ←

Sλ

X>u

‖Sλ (X>u) ‖ 2

where Sλ is the soft-thresholding operator Sλ = sign(x) (|x| − λ)+.

Sparse PCA - simulation study

Define Σ as a p × p block-diagonal matrix, with p = 200 and 10 blocks of 1s of size 20 × 20.

Hence, we would expect there to be 10 independent components of variation in the corresponding distribution.

Generate n samples x ∼ Normal( 0 , Σ)

Estimate Σ̂ =

(x − ¯x)(x − ¯x)>/n

Correlate eigenvectors of Σ with eigenvectors of Σ̂

Repeat 100 times for each different value of n

0.0 0.2 0.4 0.6 0.8 1.

1.0 Top 10 PCs

n/p

Eigenvector correlation

The plot shows the means of these correlations over the 100 repetitions for different values of n.

Sparse PCA - simulation study

0.0 0.2 0.4 0.6 0.8 1.

Top 10 PCs

n/p

Eigenvector correlation

The plot shows the result with ‖u‖ 1 =

p/2.

0.0 0.2 0.4 0.6 0.8 1.

1.0 Top 10 PCs

n/p

Eigenvector correlation

The plot shows the result with ‖u‖ 1 =

p/3.

Sparse PCA - real data example

I carried out PCA on expression levels of 10138 genes in individual cells from developing brains

There are many different cell types in the data - some mature, some immature, and some in between

Different cell-types are characterised by different gene expression profiles

We would therefore expect to be able visualise some separation of the cell-types by dimensionality reduction to three dimensions

The plot shows the cells plotted in terms of the top three (standard) PCA components.

Sparse CCA

In CCA, the aim is to find coefficient vectors u ∈ Rp^ and v ∈ Rq which project the data-matrices X ∈ Rn×p^ and Y ∈ Rn×q^ so as to maximise the correlations between these projections. Whereas PCA aims to find the ‘direction’ of maximum variance in a single data-matrix, CCA aims to find the ‘directions’ in the two data-matrices in which the variances best explain each other. The CCA problem can be solved by carrying out the optimisation:

maximise u∈Rp^ ,v∈Rq^

Cor(Xu, Yv)

This problem is not well posed for n < max(p, q), in which case u and v can be found which trivially give Cor(Xu, Yv) = 1. Sparse CCA solves this problem by carrying out the optimisation:

maximise u∈Rp^ ,v∈Rq^

Cor(Xu, Yv), subject to ‖u‖ 1 < t 1 and ‖v‖ 1 < t 2.

Sparse CCA - real data example

‘Cell cycle’ is a biological process involved in the replication of cells

Cell-cycle can be thought of as a latent process which is not directly observable in genomics data

It is driven by a small set of genes (particularly cyclins and cyclin- dependent kinases) from which it may be inferred

It has an effect on the expression of very many genes: hence it can also tend to act as a confounding factor when modelling many other biological processes

Used CCA here as an exploratory tool, with Y the data for the cell cycle genes, and X the data for all the other genes.

Sparse LDA

The decision boundary

log

P(G = k|xi ) P(G = l|xi )

= log

πk πl

x> i Σ−^1 (μk − μl )

(μk + μl )>Σ−^1 (μk − μl )

then naturally leads to the decision rule:

G (xi ) = argmax k

log πk + x> i Σ−^1 μk − μ> k Σ−^1 μk

By assuming Σ is diagonal, i.e., there is no covariance between the p dimensions, this decision rule can be reduced to the nearest centroids classifier:

G (xi ) = argmin k

∑^ p

(xj − μjk )^2 σ^2 j

− log πk

Typically, Σ (or σ) are estimated from the data as Σ̂ (or ̂σ), and the μk are estimated as μ̂ k whilst training the classifier.

Sparse LDA

The nearest centroids classifier

Ĝ (xi ) = argmin k

∑^ p

(xj − μ̂ jk )^2 ̂ σ j^2

− log πk

will typically use all p variables. This is often unnecessary and can lead to overfitting in high-dimensional contexts. The nearest shrunken centroids classifier deals with this issue. Define μ̂ = ¯x + αk , where ¯x is the data-mean across all classes, and αk is the class-specific deviation of the mean from ¯x. Then, the nearest shrunken centroids classifier proceeds with the optimisation:

minimise αk ∈Rp^ ,k∈{1,...,K }

2 n

∑^ K

i∈Ck

∑^ p

(xij − x¯j − αjk )^2 ˆσ^2

+λ

∑^ K

∑^ p

nk σ ˆ^2

|αjk |

where Ck and nk are the set and number of samples in group k.

Sparse Statistical Modelling: Regularizing Linear Models and Clustering, Lecture notes of Genomics

Related documents

Partial preview of the text

Download Sparse Statistical Modelling: Regularizing Linear Models and Clustering and more Lecture notes Genomics in PDF only on Docsity!

Sparse statistical modelling

Introduction

Sparse linear models

{ N

2 N

∑^ N

Sparse linear models

2 N

∑^ N

Sparse linear models - genomics example

Sparse PCA

X>X

Sparse PCA and SVD - an algorithm

Sparse PCA - simulation study

Sparse PCA - simulation study

Sparse PCA - real data example

Sparse CCA

Sparse CCA - real data example

Sparse LDA

Sparse LDA

∑^ K

∑^ K