Understanding PageRank and HITS Algorithms for Information Retrieval, Exams of Computer Science

An overview of the pagerank and hits algorithms used for information retrieval in the context of web search. The pagerank algorithm calculates the importance of web pages based on the link structure of the web, while hits identifies good hubs and authorities. Both algorithms have been modified over the years to address various issues. This document also includes references to relevant research papers.

Typology: Exams

Pre 2010

Uploaded on 07/30/2009

koofers-user-xcl-1
koofers-user-xcl-1 🇺🇸

7 documents

1 / 57

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Node Ranking
Lidan Wang and Brad Skaggs
CMSC 828G
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39

Partial preview of the text

Download Understanding PageRank and HITS Algorithms for Information Retrieval and more Exams Computer Science in PDF only on Docsity!

Node Ranking

Lidan Wang and Brad Skaggs

CMSC 828G

1

T^

i Topics^ „^ What is PageRank„^ What is PageRank^ „^ PageRank algorithm [Brin and Page 2002]^ „^ Topic-sensitive PageRank [Richardson et al. 2002, Haveliwala2002]^ „^ HITS [Kleinberg 2002]^ „^ Other variations^ „^ Applications„^ Applications

P^

R^

k i t iti

PageRank intuition^ „^ How can we tell a page’s importance?„^ How can we tell a page s importance?^ „^ Relies on the link structure of the web to indicate anpage’s value.^ „^ A hyperlink from page A to page B as a vote, for pageB by page A.A page is “important” if:„^ A^ page is “important” if:^ „^ There are many pages pointing to it.^ „^ These pages are considered valuable.

E^

l

Example

Taken from http://en.wikipedia.org/wiki/PageRank

Li k

t^

(1)

Links as votes (1)^ „^ A link from page B to page A denotes endorsement:„^ A^ link from page B to page A denotes endorsement:^ „^ Page B considers page A an authority on a subject^ „^ An authority value or PageRank(PR) value is assigned toevery page (document)every page (document)

PR(A) = 0.75A

B

D

0.25 C

Li k

t^

(2)

Links as votes (2) „^ PR conferred by an outgoing link is the document

’s own PR

„^ PR conferred by an outgoing link is the document s own PRdivided by the number of outgoing links:

PR(A) = PR(B)/2 + PR(C) + PR(D)/

A^

B

PR(B)/2PR(D)/

D

PR(C) C

PR(D)/

PR(A) = PR(t1)/O(t1)+

+PR(tn)/O(tn); t

D^ = page linking to A

C

„^ PR(A)

= PR(t1)/O(t1)+…+PR(tn)/O(tn); t

= page linking to A,i^

O(t^ ): number of links from ti

i

M t i

ti

Matrix notion „Hyperlink matrix

H =[H ] in ⎧

A^

B

„Hyperlink matrix

H =[H

] inij

which the entry in the i

th^ row

thand j column is: H^ = ij^

1 if Oi

( i ,^ j )^

∈^ E ⎧⎪ ⎨ ⎪^0 otherwise

C

H^ =

10 2 1 2 ⎡ ⎢ 0 0 1⎢

⎤ ⎥ ⎥

A^ B^

C

A B

„Let^ PR

= (PR(1), …, PR(n))

T

PR^ =^

T^ H PR

H^

0 0 1⎢ ⎢ (^) 1 0 0⎣

⎥ ⎥⎦ B C Stochastic matrix: all entries >=0;

„We can write equations:

PR^

H^ PR

each row sums up to 1

S l

f^

P^

R^

k

Solve for PageRank

A^

B^

A^ B^

C

„A well known mathematical

k^ +^1 PR H^

T^ kPR^

A^

B^ H

=

10 2 1 2 ⎡ ⎢ 0 0 1⎢ ⎢

⎤ ⎥ ⎥ ⎥

A^ B^

C

A B

„A well known mathematicaltechnique called poweriteration can be used to find PRi e

PR

k +1^ =^

T^ H PR k

C^

⎢^ 1 0 0⎣

⎥⎦

C

(^0) PR

(^1) PR

(^2) PR

(^3) PR

kPR

„i.e. „The solution to PR is aneigenvector with the

PR^

PR^

PR^

PR^

PR

0.^ 0.^ 0.^ 0.^

0.^ 0.^ 0.^ 0.^

eigenvector

with the corresponding eigenvalue 1. „If some conditions are satisfied,

0.^ 0.^ 0.^ 0.^

,

1 is the largest eigenvalue andthe PageRank vector PR is theprincipal eigenvector. „Problem: the Web graph doesnot meet the conditions (later).

CConvergence^ „^ Markov Chain Theorem

:

„^ A finite Markov chain defined by the stochastic matrix H has a uniquestationary probability distribution if H is irreducible and aperiodic. „ After a number of transitions

k^ PR will converge to a steady-state

probability vector

lim^ PR k^ →∞

k^ =^ PR

„^ When the steady-state is reached, then

k^ PR =^ PR

k+1^ =

PR

PR^

T^ H PR k^ →∞ PR^ =H

T^ PR

„^ PR^

is the principal eigenvector of H

T^ with eigenvalue 1.

W b

h

Web graph „ Questions:^ „^ Is^ H

is a stochastic matrix? „^ Is^ H

irreducible? „^ Is^ H

irreducible? „^ Is^ H

and aperiodic? „^ If the above conditions are not satisfied, how to extend thesimplified equation to the actual PageRank formula?

I^ H

t^

h^

ti? (2)

Is^ H

stochastic? (2)

„^ Is the sum of entries in each row equal to 1?„^ Is the sum of entries in each row equal to 1? „^ If web pages have no out-links, i.e. some rows intransition matrix

H^ have all 0’s. They are called the

dangling pages:

⎛^

v

0 H =
0 1/^
1 2^0

v

v^

v

1/5^ 1/
1/5^ 1/
⎜ ⎜ ⎜^ ⎝

v

v

Fixed

I^ H

i^

d^

bl^

d^

i di?

Is^ H

irreducable and aperiodic?

Irreducible:

Aperiodic:

„^ States vi and vj can communicate inthe directed graph, i.e. a path existsbetween them.

Aperiodic: „State vi is periodic. „Aperiodic Markov chain: none

„^ Irreducible Markov chain: all statescommunicate

„Aperiodic Markov chain: noneof the states is periodic v^2 v^1

v^2 v^1

v^0

v^3

v^4

v^0

v^3

v^4

Fi^

l P^

R^

k^

ti^

(1)

Final PageRank equation (1)S t^

th^

d^

f^ h^

t^

ti

„So at a page, the random surfer has two options^ „With probability d, randomly chooses an out-link to follow.^ „With probability 1-d, jumps to a random page

((^ d

E^ )

T^ d )

„Equation below gives the improved model:^ PR

=^ ((^1

Ed )^ n +^ d H

T^ ) PR

„Where

E^ is^

T^ ee (e is a column vector of all 1

’s)^ E

is a

„Where

E^ is^

ee^ (e is a column vector of all 1 s).

E^ is a

nxn square matrix of all 1’s.

Fi^

l P^

R^

k^

ti^

Final PageRank equation (2)The new matrix is stochastic

irreducible and aperiodic

„The

new matrix is stochastic, irreducible and aperiodic

„Scale the equation so that

T ePR

=n:

PR^ =

(1−^

d ) e^ +

T d H

PR

PR ( i )

(^1

d )^ d

H

PR (

j )

n ∑

„PageRank for each page i is:

PR ( i )

=^ (^1 −

d )^ +^

d^ H

PR ( ji^

j )

∑^ j =^1

(1^ d

)^ d^

PR (^ j

=^ (^1 −

d )^ +^

d^

(^ j ) O^ j

∑( j , i )∈ E