



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Google Page Rank Linear Algebra project
Typology: Exercises
1 / 7
This page cannot be seen from the preview
Don't miss anything!




JONATHAN MACHADO
Abstract. Google’s PageRank algorithm is what makes Google such a strong search en- gine. The pioneering PageRank algorithm redefined how a search engine operates and executes. In this paper, the underlying mathematical basics for understanding how the al- gorithm functions are provided. A basic analysis of hyperlinks with its association to the algorithm and the PageRank algorithm is studied. Ultimately, this paper shines light on a neat application of linear algebra coupled with graph theory.
2 J. MACHADO
the web and truly displayed these pages in order of significance. In essence, the algorithm proposes that the relevance or importance of a web page is dictated by the number of quality hyperlinks linking to it. It is useful to represent these networks of hyperlinks linking web pages to each other as directed graphs. It turns out that linear algebra coupled with graph theory are the tools needed to calculate web page rankings by notion of the PageRank algo- rithm. The focus of this paper is to explain the underlying mathematics behind the Google’s PageRank algorithm. We dive into fundamentals of the Google’s PageRank algorithm, pro- viding an overview of important linear algebra and graph theory concepts that apply to this process. In the end, the reader should have a basic understanding of the how Google’s PageRank algorithm computes the ranks of web pages and how to interpret the results.
2.1. Markov Chains. We begin by introducing Markov chains. We define a Markov chain as a mathematical model that describes an experiment or measurement that is performed many times in the same way, where the outcome of a given experiment can affect the outcome of the next experiment. The process starts at an initial state, namely x 0 , and transitions successively from one state to another, say x 1 , x 2 ,...,xk. The outcome of a given state depends only on the immediately preceding state.
Definition 2.1. A probability vector is a vector with nonnegative entries that add up to
We note probability vectors are the states in a Markov chain, hence these vectors are often referred to as state vector.
Definition 2.2. A column-stochastic matrix is a square matrix in which all entries are greater than or equal to zero (nonnegative) and whose columns are probability vectors.
Definition 2.3. A matrix is positive if all its entries are positive (greater than zero) real numbers.
Ultimately, we are interested in analyze the chain’s long-term behavior after starting at some initial state. Thus, a Markov Chain can be expressed as the first-order difference equation or also referred to as a dynamical system:
(1) xk+1 = Axk for k = 0, 1 , 2 ,...
where A is a column-stochastic matrix. Note to compute xk in general, we can use
(2) xk = Akx 0 for k = 0, 1 , 2 ,...
So, we ask ourselves this question: what is the outcome at state xk as time goes on? When studying these Markov Chains, usually as the system passes through time, the state vectors seems to approach an equilibrium. This special long-term outcome leads to the concepts of eigenvalues and eigenvectors.
Definition 2.4. A eigenvector of a square matrix A is a nonzero vector ~x such that A~x = λ~x for some scalar λ, where λ is an eigenvalue.
Such an ~x is an eigenvector corresponding to λ. Additionally, in dynamical systems, if A is a column-stochastic matrix, there exists an eigenvalue λ = 1.
4 J. MACHADO
Figure 2. A strongly connected graph.
3.1. Hyperlink Analysis. Important properties and interesting outcomes of networks or graphs can be drawn out through matrix representation. Matrix representation of graphs successfully captures the characteristics of a given network and allows for the opportunity to deeply analyze its behavior, thus enabling many applications to arise. The entire web can be viewed as a network of graphs with nodes representing webpages and edges representing the hyperlinks connecting them.
Definition 3.1. An adjacency matrix is an n × n matrix containing 1’s in its entries on row i, column j of the matrix if there is an edge from node i to node j and 0’s otherwise.
It follows that the web or a portion of the web in which one is interested in can be illustrated by an adjacency matrix. Any network has n finite nodes or webpages. Each webpage is indexed by an distinct integer p for 1 ≤ p ≤ n. Now consider the web graph as shown in Figure 3. This network can represented as the adjacency matrix A:
Since we are ultimately interested in how the webpages are connected throughout networks to hopefully reach a conclusion of its long term behavior, lets take matrix A and multiply it by itself:
As it turns out, the resulting matrix from A^2 reveals the number of different paths having a distance of 2 units from webpage i to j. For instance, there are 2 paths from webpage 3 to 1 with a distance of 2: page 3 to page 2 to page 1 and page 3 to page 4 to page 1. On the other hand, there is no path of distance 2 between page 4 to 1. Additionally, A^3 will inform the number of different paths having a distance of 3 units from webpage i to j and so on.
LINEAR ALGEBRA APPLICATION: GOOGLE PAGERANK ALGORITHM. 5
Figure 3. A strongly connected web graph representing hyperlinks linking four different websites. Regarding pages 1 and 2, they both have a backlink to each other.
Theorem 3.1. Consider a directed graph and a positive integer k. Then the number of directed walks from node i to node j of length k is the entry on row i and column j of the matrix Ak, where A is the adjacency matrix.
This neat result for adjacency matrices leads to insight on how a user starting on a particular webpage can transition to other pages. Consequently, in time, as the user surfs the web in relation to his/her query, he/she will eventually visit the webpages with the most hyperlinks since many other pages lead to it. Google’s PageRank algorithm ultimately utilizes this information of hyperlink connections to conjure up the ranks of the pages.
3.2. PageRank Algorithm Analysis. Google’s PageRank algorithm takes the hyperlink analysis slightly further. In addition to the number of hyperlinks a particular webpage has, the PageRank algorithm pays close attention to how reputable and authoritative those pages from the incoming hyperlinks are. To incorporate this factor into a web graph, weights are given to each hyperlink.
Definition 3.2. The indegree of a node is the number of edges pointing to it.
Definition 3.3. The outdegree of a node is the number of edges pointing away from it.
Weights are computed as follows: If there is an edge from i to j and the outdegree of node i is di, then the weight for that edge is (^) d^1 i. The application of weights brings forth fairness in this ranking system. Think of it this way: Weights are motions. A page that links to another is a vote that the other page is important and therefore makes a motion to raise the page’s rank. The incorporation of weights attempts to not allow pages that link to many others, commonly referred to as hubs, to unreasonably effect the ranks, essentially treating each link with equal value. Additionally, a page that has many links pointing to it from these hubs will not receive an overwhelming influence that results in an unfair rankings.
Definition 3.4. A transition matrix, corresponding to an adjacency graph, incorporates weights to better model the behavior the network.
We note that the weight corresponding from edge i to j is placed on column i and row j. Returning to our web graph shown in Figure 3, we now consider constructing a transition
LINEAR ALGEBRA APPLICATION: GOOGLE PAGERANK ALGORITHM. 7
Thus, the ranks of the four webpages are given above. Page 1 is ranked highest with .387; page 4 is ranked second highest with .290; followed by page 2 with .194; and lastly page 3 with .129.
References
[1] Amy N Langville and Carl D Meyer. Google’s PageRank and beyond: The science of search engine rankings. Princeton University Press, 2011. [2] Raluca Tanase, Remus Radu. The Mathematics of Web Search. http://www.math.cornell.edu/~mec/ Winter2009/RalucaRemus/index.html, 2017. [Accessed 10 September, 2017]. [3] Danny Sullivan. How search engines work. SEARCH ENGINE WATCH, at http://www. searchenginewatch. com/webmasters/work. html (last updated June 26, 2001)(on file with the New York University Journal of Legislation and Public Policy), 2002.
Department of Mathematics and Statistics, University of North Carolina at Greensboro, Greensboro, NC 27402, USA E-mail address: [email protected]