

















































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of the pagerank and hits algorithms used for information retrieval in the context of web search. The pagerank algorithm calculates the importance of web pages based on the link structure of the web, while hits identifies good hubs and authorities. Both algorithms have been modified over the years to address various issues. This document also includes references to relevant research papers.
Typology: Exams
1 / 57
This page cannot be seen from the preview
Don't miss anything!


















































1
T^
i Topics^ ^ What is PageRank^ What is PageRank^ ^ PageRank algorithm [Brin and Page 2002]^ ^ Topic-sensitive PageRank [Richardson et al. 2002, Haveliwala2002]^ ^ HITS [Kleinberg 2002]^ ^ Other variations^ ^ Applications^ Applications
P^
R^
k i t iti
PageRank intuition^ ^ How can we tell a page’s importance?^ How can we tell a page s importance?^ ^ Relies on the link structure of the web to indicate anpage’s value.^ ^ A hyperlink from page A to page B as a vote, for pageB by page A.A page is “important” if:^ A^ page is “important” if:^ ^ There are many pages pointing to it.^ ^ These pages are considered valuable.
E^
l
Example
Taken from http://en.wikipedia.org/wiki/PageRank
Li k
t^
(1)
Links as votes (1)^ ^ A link from page B to page A denotes endorsement:^ A^ link from page B to page A denotes endorsement:^ ^ Page B considers page A an authority on a subject^ ^ An authority value or PageRank(PR) value is assigned toevery page (document)every page (document)
Li k
t^
(2)
Links as votes (2) ^ PR conferred by an outgoing link is the document
’s own PR
^ PR conferred by an outgoing link is the document s own PRdivided by the number of outgoing links:
PR(A) = PR(t1)/O(t1)+
+PR(tn)/O(tn); t
D^ = page linking to A
^ PR(A)
= PR(t1)/O(t1)+…+PR(tn)/O(tn); t
= page linking to A,i^
O(t^ ): number of links from ti
i
M t i
ti
Matrix notion Hyperlink matrix
H =[H ] in ⎧
Hyperlink matrix
H =[H
] inij
which the entry in the i
th^ row
thand j column is: H^ = ij^
1 if Oi
( i ,^ j )^
∈^ E ⎧⎪ ⎨ ⎪^0 otherwise ⎩
⎩
H^ =
10 2 1 2 ⎡ ⎢ 0 0 1⎢
⎤ ⎥ ⎥
Let^ PR
= (PR(1), …, PR(n))
T
PR^ =^
T^ H PR
H^
0 0 1⎢ ⎢ (^) 1 0 0⎣
⎥ ⎥⎦ B C Stochastic matrix: all entries >=0;
We can write equations:
PR^
H^ PR
each row sums up to 1
S l
f^
P^
R^
k
Solve for PageRank
A well known mathematical
k^ +^1 PR H^
T^ kPR^
B^ H
=
10 2 1 2 ⎡ ⎢ 0 0 1⎢ ⎢
⎤ ⎥ ⎥ ⎥
A well known mathematicaltechnique called poweriteration can be used to find PRi e
PR
k +1^ =^
T^ H PR k
⎢^ 1 0 0⎣
⎥⎦
(^0) PR
(^1) PR
(^2) PR
(^3) PR
kPR
i.e. The solution to PR is aneigenvector with the
PR^
PR^
PR^
PR^
PR
0.^ 0.^ 0.^ 0.^
0.^ 0.^ 0.^ 0.^
eigenvector
with the corresponding eigenvalue 1. If some conditions are satisfied,
0.^ 0.^ 0.^ 0.^
,
1 is the largest eigenvalue andthe PageRank vector PR is theprincipal eigenvector. Problem: the Web graph doesnot meet the conditions (later).
CConvergence^ ^ Markov Chain Theorem
:
^ A finite Markov chain defined by the stochastic matrix H has a uniquestationary probability distribution if H is irreducible and aperiodic. After a number of transitions
k^ PR will converge to a steady-state
probability vector
^ When the steady-state is reached, then
k^ PR =^ PR
k+1^ =
PR
PR^
T^ H PR k^ →∞ PR^ =H
T^ PR
^ PR^
is the principal eigenvector of H
T^ with eigenvalue 1.
W b
h
Web graph Questions:^ ^ Is^ H
is a stochastic matrix? ^ Is^ H
irreducible? ^ Is^ H
irreducible? ^ Is^ H
and aperiodic? ^ If the above conditions are not satisfied, how to extend thesimplified equation to the actual PageRank formula?
Irreducible:
Aperiodic:
^ States vi and vj can communicate inthe directed graph, i.e. a path existsbetween them.
Aperiodic: State vi is periodic. Aperiodic Markov chain: none
^ Irreducible Markov chain: all statescommunicate
Aperiodic Markov chain: noneof the states is periodic v^2 v^1
v^2 v^1
v^0
v^3
v^4
v^0
v^3
v^4
Fi^
l P^
R^
k^
ti^
(1)
Final PageRank equation (1)S t^
th^
d^
f^ h^
t^
ti
So at a page, the random surfer has two options^ With probability d, randomly chooses an out-link to follow.^ With probability 1-d, jumps to a random page
((^ d
E^ )
T^ d )
Equation below gives the improved model:^ PR
=^ ((^1
E − d )^ n +^ d H
T^ ) PR
Where
E^ is^
T^ ee (e is a column vector of all 1
’s)^ E
is a
Where
E^ is^
ee^ (e is a column vector of all 1 s).
E^ is a
nxn square matrix of all 1’s.