Deep Belief Networks - Machine Learning and Pattern Recognition - Lecture Notes, Study notes of Machine Learning

Main points of this lecture are: Deep Belief Networks, Existing Methods, Algorithms, Draw Backs, Gaussian Noise, Conditional Distributions, Learning Stacks, Topic Space, Semantic Hashing

Typology: Study notes

2012/2013

Uploaded on 04/30/2013

bassu
bassu 🇮🇳

4.5

(42)

141 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
DEEP BELIEF NETWORKS WITH APPLICATIONS
TO FAST DOCUMENT RETRIEVAL
1
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Deep Belief Networks - Machine Learning and Pattern Recognition - Lecture Notes and more Study notes Machine Learning in PDF only on Docsity!

DEEP^ BELIEF^ N

ETWORKS WITH APPLICATIONSTO FAST DOCUMENT^ R

ETRIEVAL 1

Existing Methods

-^ One of the most popular and widely used in practicealgorithms for document retrieval tasks is TF-IDF. •^ TF-IDF weights each word by:^ ^ its frequency in the query document (Term Frequency)^ ^ the logarithm of the reciprocal of its frequency in the whole set ofdocuments (Inverse Document Frequency).^ However, TF-IDF has several limitations:^ ^ It computes document similarity directly in the word-count space,which may be slow for large vocabularies.^ ^ It assumes that the counts of different words provide independentevidence of similarity.^ ^ It makes no use of semantic similarities between words.

2

Drawbacks of Existing Methods • LSA is a linear method so it can only capture pairwisecorrelations between words. • Numerous methods, in particular probabilistic versions ofLSA were introduced in the machine learning community. • These models can be viewed as graphical models in which asingle layer of hidden topic variables have directedconnections to variables that represent word-counts. • There are limitations on the types of structure that can berepresented efficiently by a single layer of hidden variables. • We will build a network with multiple hidden layers and withmillions of parameters and show that it can discover latentrepresentations that work much better.

4

RBM’s revisited

-^ A joint configuration (

v,^ h) has an energy:∑∑ E(v, h) =^ −^ bv−^ bhii^ j^ i^ j

∑− vhW.j ijij^ i,j

-^ The probability that the model assigns to

v^ is: ∑ p(v) = p(v,^ h) =^ h ∑ 1 exp(−E(v,^ h)) Z^ h

5

RBM’s for count data • Hidden units remain binary and the visible word counts aremodeled by the Poisson model. • The energy is defined as:∑∑∑ E(v, h) = − bv−^ bh−^ ii^ jj^ i^ j^

∑vhW+ log^ vijij (^) i,j i^ !.i

-^ Conditional distributions over hidden and visible units are:^ p(h= 1|vj^

1 ∑) = 1 + exp(−b−^ j , Wv)iji i p(v=^ n|h) =^ Poissoni^

(∑^ n,^ exp (b+^ hWi^ j^ j

) )^ ,ij wherePoisson

n() λ−λn, λ= e. n!^7

Learning Stacks of RBM’s

30 W^4500 RBM 500 W^31000 RBM 1000 W^22000 RBM 2000 W^1 RBM

-^ Perform greedy, layer-by-layer learning:^ ^ Learn and Freeze

Wusing Poisson^1 Model. Treat the existing feature detectorsas if they were data. Learn and Freeze W.^2 Greedily learn many layers.

-^ Each layer of features captures stronghigh-order correlations between the activitiesof units in the layer below.

8

20 newsgroup: 2-D topic space^ Autoencoder 2−D Topic Space^ talk.religion.miscsci.cryptographycomp.graphics

rec.sport.hockey misc.forsale talk.politics.mideast

LSA 2−D Topic Space

-^ The 20 newsgroup corpus contains 18,845 postings (11,314training and 7,531 test) taken from the Usenet newsgroups. •^ We use a 2000-500-250-125-10 autoencoder to convert adocument into a low-dimensional code. •^ We used a simple “bag-of-words” representation.

10

Reuters Corpus: 2-D topic space Autoencoder 2−D Topic SpaceEuropean CommunityMonetary/EconomicInterbank MarketsEnergy MarketsDisasters andAccidents Leading Ecnomic^ Legal/JudicialIndicatorsGovernmentBorrowingsAccounts/Earnings

LSA 2−D Topic Space

-^ We use a 2000-500-250-125-2 autoencoder to convert testdocuments into a two-dimensional code. •^ The Reuters Corpus Volume II contains 804,414 newswirestories (randomly split into

402,207^ training and

402,207^ test).

-^ We used a simple “bag-of-words” representation.

11

Semantic Hashing

W^ +ε W^ +ε^ W^ +ε W W

WW W W W W W

WWW

2000 500 500 500 20002000 (^2000500500) Gaussian^3 3 Noise^5002 25001 (^500 )

11 22 33^ Code Layer 20 3 2 5001 Fine−tuning

6 5 4 Code Layer 20 Unrolling RBM 20 3 RBM 500 500 RBM (^500) Bag of WordsRecursive Pretraining TT TT T^ T

-^ Learn to map documents into

semantic^ 20-D binary code and use these codes as memory addresses. • We have the ultimate retrieval tool: Given a query document,compute its 20-bit address and retrieve all of the documentsstored at the similar addresses

with no search at all

. 13

The Main Idea of Semantic Hashing

SemanticallySimilarDocuments

Memory f Document^14

Semantic Hashing Reuters 2−D Embedding of 20−bit codesEuropean CommunityMonetary/EconomicDisasters andAccidents GovernmentBorrowingEnergy Markets Accounts/Earnings^ 0.1^ 0.2^ 0.

TF−IDF 50 TF−IDF using 20 bits Locality Sensitive Hashing (^4030) Precision (%) 20 10 0 0.8 1.6 3.2 6.4^ 12.8^ 25.6^ 51.2^100 Recall (%)

-^ We used a simple C implementation on Reuters dataset(402,212 training and 402,212 test documents). •^ For a given query, it takes about 0.5 milliseconds to create ashort-list of about 3,000 semantically similar documents. •^ It then takes 10 milliseconds to retrieve the top few matchesfrom that short-list using TF-IDF, and it is more accurate thanfull TF-IDF.

16

Learning nonlinear embedding • Learning a similarity measure over the input space

X.

-^ Given a distance metric

D^ (e.g. Euclidean) we can measure similarity between two input vectors

nk^ x,^ x∈^ X^ by ncomputing D[f (x| kW ), f (x|W^ )].

-^ “Push-Pull” Idea: Pull points belonging to the same classtogether. Push points belonging to the different classes apart.

d[f(x1),f(x2)] 1010 WW (^33500500) WW (^22500500) WW (^1120002000) x1 x2^17

Learning Nonlinear NCA • Probability that point^ n^ belongs to class

a^ is: ∑n (^) p(c= a) = k^ k:c=a p.nk

-^ Maximize the expected number of correctly classified pointson the training data:^ O

∑^ ∑N (^1) = (^) NCA n=1^ k:c N p.nk nk=c

-^ By considering a linear perceptron we arrive at linear NCA.

19

2-D codes 0

1 2

3 7 9 8 4 6 5

atheismreligion.christiansci.cryptographysci.spacerec.hokeyrec.autoscomp.windowscomp.hardware 20