Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Independent Components Analysis - Lectures Notes - 12, Study notes of Machine Learning

Stanford University Machine Learning

Artificial Intelligence. Lectures Notes of Machine Learning. Prof. Andrew Ng - Stanford University - Contents: Independent Components Analysis

Typology: Study notes

2010/2011

Uploaded on 10/30/2011

ilyastrab 🇺🇸

4.4

(52)

379 documents

1 / 6

This page cannot be seen from the preview

Don't miss anything!

CS229 Lecture notes

Andrew Ng

Part XII

Independent Components

Analysis

Our next topic is Independent Components Analysis (ICA). Similar to PCA,

this will find a new basis in which to represent our data. However, the goal

is very different.

As a motivating example, consider the “cocktail party problem.” Here, n

speakers are speaking simultaneously at a party, and any microphone placed

in the room records only an overlapping combination of the nspeakers’ voices.

But lets say we have ndifferent microphones placed in the room, and because

each microphone is a different distance from each of the speakers, it records a

different combination of the speakers’ voices. Using these microphone record-

ings, can we separate out the original nspeakers’ speech signals?

To formalize this problem, we imagine that there is some data s∈Rn

that is generated via nindependent sources. What we observe is

x=As,

where Ais an unknown square matrix called the mixing matrix. Repeated

observations gives us a dataset {x(i);i= 1,...,m}, and our goal is to recover

the sources s(i)that had generated our data (x(i)=As(i)).

In our cocktail party problem, s(i)is an n-dimensional vector, and s(i)

jis

the sound that speaker jwas uttering at time i. Also, x(i)in an n-dimensional

vector, and x(i)

jis the acoustic reading recorded by microphone jat time i.

Let W=A−1be the unmixing matrix. Our goal is to find W, so

that given our microphone recordings x(i), we can recover the sources by

computing s(i)=W x(i). For notational convenience, we also let wT

idenote

1

Discover Study notes of Machine Learning Stanford University

Partial preview of the text

Download Independent Components Analysis - Lectures Notes - 12 and more Study notes Machine Learning in PDF only on Docsity!

CS229 Lecture notes

Andrew Ng

Part XII

Independent Components

Analysis

Our next topic is Independent Components Analysis (ICA). Similar to PCA, this will find a new basis in which to represent our data. However, the goal is very different. As a motivating example, consider the “cocktail party problem.” Here, n speakers are speaking simultaneously at a party, and any microphone placed in the room records only an overlapping combination of the n speakers’ voices. But lets say we have n different microphones placed in the room, and because each microphone is a different distance from each of the speakers, it records a different combination of the speakers’ voices. Using these microphone record- ings, can we separate out the original n speakers’ speech signals? To formalize this problem, we imagine that there is some data s ∈ Rn that is generated via n independent sources. What we observe is

x = As,

where A is an unknown square matrix called the mixing matrix. Repeated observations gives us a dataset {x(i); i = 1,... , m}, and our goal is to recover the sources s(i)^ that had generated our data (x(i)^ = As(i)).

In our cocktail party problem, s(i)^ is an n-dimensional vector, and s( ji )is

the sound that speaker j was uttering at time i. Also, x(i)^ in an n-dimensional vector, and x( ji )is the acoustic reading recorded by microphone j at time i. Let W = A−^1 be the unmixing matrix. Our goal is to find W , so that given our microphone recordings x(i), we can recover the sources by computing s(i)^ = W x(i). For notational convenience, we also let wiT denote

the i-th row of W , so that

W =

— w 1 T — .. . — wnT —

Thus, wi ∈ Rn, and the j-th source can be recovered by computing s( ji )=

wjT x(i).

1 ICA ambiguities

To what degree can W = A−^1 be recovered? If we have no prior knowledge about the sources and the mixing matrix, it is not hard to see that there are some inherent ambiguities in A that are impossible to recover, given only the x(i)’s. Specifically, let P be any n-by-n permutation matrix. This means that each row and each column of P has exactly one “1.” Here’re some examples of permutation matrices:

P =

 ; P =

[

]

; P =

[

]

If z is a vector, then P z is another vector that’s contains a permuted version of z’s coordinates. Given only the x(i)’s, there will be no way to distinguish between W and P W. Specifically, the permutation of the original sources is ambiguous, which should be no surprise. Fortunately, this does not matter for most applications. Further, there is no way to recover the correct scaling of the wi’s. For in- stance, if A were replaced with 2A, and every s(i)^ were replaced with (0.5)s(i), then our observed x(i)^ = 2A · (0.5)s(i)^ would still be the same. More broadly, if a single column of A were scaled by a factor of α, and the corresponding source were scaled by a factor of 1/α, then there is again no way, given only the x(i)’s to determine that this had happened. Thus, we cannot recover the “correct” scaling of the sources. However, for the applications that we are concerned with—including the cocktail party problem—this ambiguity also does not matter. Specifically, scaling a speaker’s speech signal s( ji )by some positive factor α affects only the volume of that speaker’s speech. Also, sign changes do not matter, and s( ji )and −s( ji )sound identical when played on a speaker. Thus, if the wi found by an algorithm is scaled by any non-zero real

A = 2, so that x = 2s. Clearly, x is distributed uniformly in the interval [0, 2]. Thus, its density is given by px(x) = (0.5)1{ 0 ≤ x ≤ 2 }. This does not equal ps(W x), where W = 0.5 = A−^1. Instead, the correct formula is px(x) = ps(W x)|W |. More generally, if s is a vector-valued distribution with density ps, and x = As for a square, invertible matrix A, then the density of x is given by

px(x) = ps(W x) · |W |,

where W = A−^1.

Remark. If you’re seen the result that A maps [0, 1]n^ to a set of volume |A|, then here’s another way to remember the formula for px given above, that also generalizes our previous 1-dimensional example. Specifically, let A ∈ Rn×n^ be given, and let W = A−^1 as usual. Also let C 1 = [0, 1]n^ be the n-dimensional hypercube, and define C 2 = {As : s ∈ C 1 } ⊆ Rn^ to be the image of C 1 under the mapping given by A. Then it is a standard result in linear algebra (and, indeed, one of the ways of defining determinants) that the volume of C 2 is given by |A|. Now, suppose s is uniformly distributed in [0, 1]n, so its density is ps(s) = 1{s ∈ C 1 }. Then clearly x will be uniformly distributed in C 2. Its density is therefore found to be px(x) = 1{x ∈ C 2 }/vol(C 2 ) (since it must integrate over C 2 to 1). But using the fact that the determinant of the inverse of a matrix is just the inverse of the determinant, we have 1 /vol(C 2 ) = 1/|A| = |A−^1 | = |W |. Thus, px(x) = 1{x ∈ C 2 }|W | = 1{W x ∈ C 1 }|W | = ps(W x)|W |.

3 ICA algorithm

We are now ready to derive an ICA algorithm. The algorithm we describe is due to Bell and Sejnowski, and the interpretation we give will be of their algorithm as a method for maximum likelihood estimation. (This is different from their original interpretation, which involved a complicated idea called the infomax principal, that is no longer necessary in the derivation given the modern understanding of ICA.) We suppose that the distribution of each source si is given by a density ps, and that the joint distribution of the sources s is given by

p(s) =

∏^ n

i=

ps(si).

Note that by modeling the joint distribution as a product of the marginal, we capture the assumption that the sources are independent. Using our

formulas from the previous section, this implies the following density on x = As = W −^1 s:

p(x) =

∏^ n

i=

ps(wiT x) · |W |.

All that remains is to specify a density for the individual sources ps. Recall that, given a real-valued random variable z, its cumulative distri- bution function (cdf) F is defined by F (z 0 ) = P (z ≤ z 0 ) =

∫ (^) z 0 −∞ pz^ (z)dz. Also, the density of z can be found from the cdf by taking its derivative: pz (z) = F ′(z). Thus, to specify a density for the si’s, all we need to do is to specify some cdf for it. A cdf has to be a monotonic function that increases from zero to one. Following our previous discussion, we cannot choose the cdf to be the cdf of the Gaussian, as ICA doesn’t work on Gaussian data. What we’ll choose instead for the cdf, as a reasonable “default” function that slowly increases from 0 to 1, is the sigmoid function g(s) = 1/(1 + e−s). Hence, ps(s) = g′(s).^1 The square matrix W is the parameter in our model. Given a training set {x(i); i = 1,... , m}, the log likelihood is given by

`(W ) =

∑^ m

i=

( (^) n ∑

j=

log g′(wTj x(i)) + log |W |

We would like to maximize this in terms W. By taking derivatives and using the fact (from the first set of notes) that ∇W |W | = |W |(W −^1 )T^ , we easily derive a stochastic gradient ascent learning rule. For a training example x(i), the update rule is:

W := W + α

1 − 2 g(wT 1 x(i)) 1 − 2 g(wT 2 x(i)) .. . 1 − 2 g(wTn x(i))

x(i)

T

(W T^ )−^1

(^1) If you have prior knowledge that the sources’ densities take a certain form, then it is a good idea to substitute that in here. But in the absence of such knowledge, the sigmoid function can be thought of as a reasonable default that seems to work well for many problems. Also, the presentation here assumes that either the data x(i)^ has been preprocessed to have zero mean, or that it can naturally be expected to have zero mean (such as acoustic signals). This is necessary because our assumption that ps(s) = g′(s) implies E[s] = 0 (the derivative of the logistic function is a symmetric function, and hence gives a density corresponding to a random variable with zero mean), which implies E[x] = E[As] = 0.

Independent Components Analysis - Lectures Notes - 12, Study notes of Machine Learning

Related documents

Partial preview of the text

Download Independent Components Analysis - Lectures Notes - 12 and more Study notes Machine Learning in PDF only on Docsity!

CS229 Lecture notes

Andrew Ng

Part XII

Independent Components

Analysis

W =

P =

 ; P =

[

]

; P =

[

]

`(W ) =