















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Main points of this lecture are: Deep Belief Networks, Existing Methods, Algorithms, Draw Backs, Gaussian Noise, Conditional Distributions, Learning Stacks, Topic Space, Semantic Hashing
Typology: Study notes
1 / 23
This page cannot be seen from the preview
Don't miss anything!
















-^ One of the most popular and widely used in practicealgorithms for document retrieval tasks is TF-IDF. •^ TF-IDF weights each word by:^ –^ its frequency in the query document (Term Frequency)^ –^ the logarithm of the reciprocal of its frequency in the whole set ofdocuments (Inverse Document Frequency).^ However, TF-IDF has several limitations:^ –^ It computes document similarity directly in the word-count space,which may be slow for large vocabularies.^ –^ It assumes that the counts of different words provide independentevidence of similarity.^ –^ It makes no use of semantic similarities between words.
2
4
-^ A joint configuration (
v,^ h) has an energy:∑∑ E(v, h) =^ −^ bv−^ bhii^ j^ i^ j
∑− vhW.j ijij^ i,j
-^ The probability that the model assigns to
v^ is: ∑ p(v) = p(v,^ h) =^ h ∑ 1 exp(−E(v,^ h)) Z^ h
5
∑vhW+ log^ vijij (^) i,j i^ !.i
-^ Conditional distributions over hidden and visible units are:^ p(h= 1|vj^
1 ∑) = 1 + exp(−b−^ j , Wv)iji i p(v=^ n|h) =^ Poissoni^
(∑^ n,^ exp (b+^ hWi^ j^ j
) )^ ,ij wherePoisson
n() λ−λn, λ= e. n!^7
30 W^4500 RBM 500 W^31000 RBM 1000 W^22000 RBM 2000 W^1 RBM
-^ Perform greedy, layer-by-layer learning:^ –^ Learn and Freeze
Wusing Poisson^1 Model. – Treat the existing feature detectorsas if they were data. – Learn and Freeze W.^2 – Greedily learn many layers.
-^ Each layer of features captures stronghigh-order correlations between the activitiesof units in the layer below.
8
rec.sport.hockey misc.forsale talk.politics.mideast
LSA 2−D Topic Space
-^ The 20 newsgroup corpus contains 18,845 postings (11,314training and 7,531 test) taken from the Usenet newsgroups. •^ We use a 2000-500-250-125-10 autoencoder to convert adocument into a low-dimensional code. •^ We used a simple “bag-of-words” representation.
10
LSA 2−D Topic Space
-^ We use a 2000-500-250-125-2 autoencoder to convert testdocuments into a two-dimensional code. •^ The Reuters Corpus Volume II contains 804,414 newswirestories (randomly split into
402,207^ training and
402,207^ test).
-^ We used a simple “bag-of-words” representation.
11
W^ +ε W^ +ε^ W^ +ε W W
WW W W W W W
+ε W +ε W +ε W
2000 500 500 500 20002000 (^2000500500) Gaussian^3 3 Noise^5002 25001 (^500 )
11 22 33^ Code Layer 20 3 2 5001 Fine−tuning
6 5 4 Code Layer 20 Unrolling RBM 20 3 RBM 500 500 RBM (^500) Bag of WordsRecursive Pretraining TT TT T^ T
-^ Learn to map documents into
semantic^ 20-D binary code and use these codes as memory addresses. • We have the ultimate retrieval tool: Given a query document,compute its 20-bit address and retrieve all of the documentsstored at the similar addresses
with no search at all
. 13
SemanticallySimilarDocuments
TF−IDF 50 TF−IDF using 20 bits Locality Sensitive Hashing (^4030) Precision (%) 20 10 0 0.8 1.6 3.2 6.4^ 12.8^ 25.6^ 51.2^100 Recall (%)
-^ We used a simple C implementation on Reuters dataset(402,212 training and 402,212 test documents). •^ For a given query, it takes about 0.5 milliseconds to create ashort-list of about 3,000 semantically similar documents. •^ It then takes 10 milliseconds to retrieve the top few matchesfrom that short-list using TF-IDF, and it is more accurate thanfull TF-IDF.
16
-^ Given a distance metric
D^ (e.g. Euclidean) we can measure similarity between two input vectors
nk^ x,^ x∈^ X^ by ncomputing D[f (x| kW ), f (x|W^ )].
-^ “Push-Pull” Idea: Pull points belonging to the same classtogether. Push points belonging to the different classes apart.
d[f(x1),f(x2)] 1010 WW (^33500500) WW (^22500500) WW (^1120002000) x1 x2^17
a^ is: ∑n (^) p(c= a) = k^ k:c=a p.nk
-^ Maximize the expected number of correctly classified pointson the training data:^ O
∑^ ∑N (^1) = (^) NCA n=1^ k:c N p.nk nk=c
-^ By considering a linear perceptron we arrive at linear NCA.
19
1 2
3 7 9 8 4 6 5
atheismreligion.christiansci.cryptographysci.spacerec.hokeyrec.autoscomp.windowscomp.hardware 20