Sparse Coding - Matrix Computation - Lecture Slides, Slides of Advanced Computer Architecture

These lecture slides are very easy to understand and very helpful to built a concept about the Matrix computation.The key points discuss in these slides are:Sparse Coding, Overcomplete Dictionary, Matching Pursuit, Basis Pursuit, Sparse Representation of Signals, Compressive Sensing, Orthogonal Matching Pursuit, Design of Dictionaries, Maximum Likelihood Methods

Typology: Slides

2012/2013

Uploaded on 04/27/2013

ashalata
ashalata 🇮🇳

3.8

(18)

106 documents

1 / 20

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture 25
1 / 20
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14

Partial preview of the text

Download Sparse Coding - Matrix Computation - Lecture Slides and more Slides Advanced Computer Architecture in PDF only on Docsity!

  • Lecture

Overview

Sparse coding Overcomplete dictionary Matching pursuit Basis pursuit K-SVD Applications

Sparse representation of signals

Using an overcomplete dictionary matrix D ∈ IRn×K^ that contains K prototype signal-atoms for columns {dj }Kj=1, a signal y ∈ IRn^ can be represented as a sparse linear combination of these atoms

y = Dx, or y ≈ Dx subject to ‖y − Dx‖p ≤ ε

where the vector x ∈ IRK^ contains the representation coefficients of the signal y,and `p -norm for p = 1, 2 , and ∞ are often used If n < K and D is a full-rank matrix, an infinite number of solutions are available for the representation problems, hence constraints on the solution must be set The sparsest representation is the solution of either

(P 0 ) min x

‖x‖ 0 subject to y = Dx (1)

(P 0 , ε) min x ‖x‖ 0 subject to ‖y − Dx‖ 2 ≤ ε (2)

where ‖ · ‖ 0 is the ` 0 -norm, counting the nonzero entries of a vector

The choice of the dictionary

Can either be chosen as a prespecified set of function (i.e., non-adaptive) or designed by adapting its content to fit a given set of signal examples Prespecified transform matrix: wavelets, curvelets, contourlets, steerable wavelet filters, short-time Fourier transforms, random matrices, and more K-SVD: learn a dictionary D from training examples Compressive sensing: use random matrices

Matching pursuit

Greedy algorithm that finds best matching projection of multidimensional data onto an overcomplete dictionary D Each such dictionary D is a collection of waveforms (φγ )γ∈Γ with γ a parameter

y =

γ∈Γ

αγ φγ , or y =

∑^ m

i=

αγi φγi + R(m)

as an approximate decomposition with residual R(m) Start with an initial approximation y(0)^ = 0 and residual R(0)^ = y, build up a sequence of sparse approximations stepwise At step k, identify the atom that best correlates with the residual (by sweeping all samples), and then add to the current approximation a scalar multiple of that atom, so that y(k)^ = y(k−1)^ + αk φγk where αk = 〈R(k−1), φγk 〉 and R(k)^ = y − y(k) After m steps, obtain the representation in (7) with residual R = R(m)

Orthogonal matching pursuit

When the dictionary is orthogonal (e.g., orthogonal wavelet), MP recovers the underlying sparse structure well Computational complexity of MP for encoder is high Improvements include the use of approximate dictionary representations and suboptimal ways of choosing the best match at each iteration (atom extraction) Orthogonal matching pursuit (OMP): an extra step of orthogonalization in MP Take all m terms that have entered at step m and solve the least squares problem min (αi )

‖y −

∑^ m

i=

αi φγi ‖ 2

for coefficients (α( i m))

Then forms the residual R

[m] = y −

∑m i=1 α

(m) i φγi which will be orthogonal to all terms currently in the model

Why ` 1 -norm?

Consider a two-dimensional case

Design of dictionaries

There is an intriguing relation between sparse representation and clustering (i.e., vector quantization) In clustering, a set of descriptive vectors {dk }Kk=1 is learned, and each sample is represented by one of these vectors (based on distance metric e.g., ` 2 -norm) Can think of this as an extreme sparse representation, where only one atom is allowed in the signal decomposition K -means algorithm, also known as the generalized Lloyd (GLA) algorithm, is the most commonly used procedure for clustering Dictionary learning can be considered as generalization of K -means algorithm: I (^) given {dk }Kk=1, assign the training examples to their nearest neighbor I (^) given that assignment, update {dk }Kk=1 to better fit the examples

Maximum likelihood methods (cont’d)

Assuming the prior is with Laplace distribution p(yi |D) =

p(yi |x, D)p(x)dx = C

exp( (^2) σ^12 ‖Dx − yi ‖^2 ) exp(λ‖x‖ 1 )dx Difficult to evaluate but can be simplified with D = argmax D

∑N

i=1 maxxi p(yi^ ,^ xi^ |D) = argmin D

∑N

i=1 minxi^ ‖Dxi^ −^ yi^ ‖

(^2) + λ‖xi ‖ 1 (3)

This problem does not penalize the entries of D as it does for of xi , thereby the solution tends to increase the dictionary entries An iterative method was suggested: first calculate the coefficients xi using a simple gradient descent procedure and then update the dictionary using

D(n+1)^ = D(n)^ − η

∑^ N

i=

(D(n)xi − yi )x> i

Related to independent component analysis (ICA) which maximizes the mutual information between inputs (samples) and outputs (coefficients)

Method of optimal directions (MOD)

Follow closely the K -means outline with a sparse coding stage that uses either OMP or FOCUSS followed by an update of the dictionary Assume that the sparse coding for each example is known, we define the errors ei = yi − Dxi , the overall representation error is

‖E ‖^2 F = ‖[e 1 , e 2 ,... , eN ]‖^2 F = ‖Y − DX ‖^2 F

Assume X is fixed, we can seek an update to D such that the above error is minimized by taking derivative of the above equation w.r.t. D, (Y − DX )X >^ = 0, and have

D(n+1)^ = Y X (n)

> (X (n)X (n)

> )−^1

Related to the maximum likelihood methods

K-SVD: Generalizing the K -means

The sparse representation problem can be viewed as a generalization of the VQ problem (4) in which we allow each input signal to be represented by a linear combination

min D,X

‖Y − DX ‖^2 F subject to ∀i ‖xi ‖ 0 ≤ T 0 (5)

, or min D,X

‖Y − DX ‖^2 F subject to ‖Y − DX ‖^2 F ≤ ε (6)

Minimize (5) iteratively by first fix D and find the coefficient matrix X using any pursuit method, and then search for a better dictionary It update one column at a time, fixing all the other columns, and find a new column dk and new values for its coefficients that best reduce the MSE The process of updating only one column of D at a time is a problem having a straightforward solution based on SVD

Updating dictionary

Assume that both X and D are fixed, and want to add on column in the dictionary dk and the coefficients of k-th row of X is xkT (different from the vector xk which is the k-th column in X ) The objective function can be rewritten as

‖Y − DX ‖^2 F =

Y − DjK=1dj xjT

F

(Y −

j 6 =k dj^ x

j T )^ −^ dk^ x

k T

F

Ek − dk xkT

F Decompose DX to the sum of K rank-1 matrices where K − 1 terms are fixed and the k-th term remains in question It would be tempting to suggest the use of SVD to find alternative dk and xkT The SVD finds the closest rank-1 matrix that approximate Ek However, this minimization does not take sparsity into consideration

Updating dictionary (cont’d)

Taking the restricted matrix E (^) kR , SVD decomposes it to E (^) kR = UΣV > Define the solution for ˜dk as the first column of U, and the coefficient vector xkR as the fist column of V multiplied by σ 1 In the K-SVD algorithm, one needs to sweep through the columns and use always the most updated coefficients as they emerge from the SVD steps

The K-SVD algorithm

Initialize: Normalize columns of the dictionary matrix D(0)^ ∈ IRn×K for J = 1, 2 ,... do Sparse coding: Use any pursuit algorithm to compute the representation vector xi for each example yi , by approximating the solution of i = 1,... , N, min xi

‖yi − Dx‖^22 subject to ‖xi ‖ 0 ≤ T 0

Codebook update: For each column k = 1,... , K in D(J−1) Define the group of examples that use this atom, ωk = {i| 1 ≤ i ≤ N, xkT (i) 6 = 0} Compute the overall representation error Ek = Y −

j 6 =k dj^ x

j T Restrict Ek by choosing only the columns corresponding to ωk and obtain E (^) kR Apply SVD decomposition E (^) kR = UΣV >. Choose the updated dictionary column ˜dk to be the first column of U. Update the coefficient vector xkR to be the first column of V multiplied by σ 1 end for