Simple Data Mining: Simple PageRank Algorithm Description

This blog post will give a simple explanation of the original PageRank algorithm This is the algorithm that differentiated Google from the web search incumbents in the late 1990's (e.g. Yahoo!, Altavista, etc.) and helped to make it what it is today. I will be using material I learned in ChengXiang Zhai's "Text Retrieval and Search Engines" Coursera course, as well as the original technical paper on PageRank by Sergei Brin and Larry Page entitled "The PageRank Citation Ranking: Bringing Order to the Web", 1998.

For this blog post I'm going to use a VERY simplified example of the web so that we can work through it and understand what is happening. The example below has only 6 web pages, but has the complexity we need for our discussion.

We'll consider each of the items in the network (e.g. A, B, C) to be webpages. PageRank can actually be used for other things like social networks, etc., but we'll stick with the web page version for this post. Each black arrow represents a link from where the arrow starts to where the arrow ends. For example, webpage B has a link to webpage A, and that's it. Webpage A has links to webpage B, and D. These links, and the somewhat hidden information that they contain, are the most important part of the PageRank algorithm.

Before PageRank, there were actually lots of people that tried to use the information from these links to help with search. These mostly focused on the count of how many links a webpage had pointing to it. If we use this type of thinking page A seems to be the most important with 4 links to it, C and D come in 2nd place with 2, B comes in 3rd with 1 and E and F come in dead last with 0. This may seem to make sense, but there is something interesting that this method misses. Since page A has so many links, it's pretty obvious it is important, but since it's important, the links that page A has to other pages are more important that other links in the network. PageRank takes this into account.

Imagine you are web-surfing in the diagram above. Say you pick a web page at random and then read it, find a link on that webpage and click it, read that webpage, find a link on that webpage and click it...lather, rinse, repeat...until you get bored, or stuck. At this point you decide to not follow a link from the page you're on, but to just pick a webpage at random from the web. Once you're on this new random webpage you start reading and clicking on links all over again. If you did all of this, this is essentially what the PageRank algorithm does. It follows this type of random web-surfing and then calculates the probability of arriving at all of the pages in the web.

Now that I've explained that process conceptually, lets get into the math that matches this type of random web-surfer behaviour. If we start by creating a grid/table where we put a 1 for links from one webpage to another, we'd get something like this

This is helpful because now we can do some analysis on this table, but it's not really what we need. If you're at a webpage, we want to be able to determine how likely you are to pick one of the links on the webpage. If there is only one link on the webpage, this is easy; 100%. If there are 2 or more, the PageRank algorithm just divides this probability up evenly. If there are 2 links, each link has a 1 out of 2 chance of getting clicked (as a side note, in my opinion it would be more interesting if these probabilities were altered based on where the links were on the screen, or some other characteristic like font size/color). If we slightly alter our table above to give these probabilities, instead of just the links, it will look like this

This matrix is called the transition matrix (later we'll refer to this matrix as M). As mentioned above when describing the random web-surfer, there are 2 parts to this algorithm: clicking links and jumping to pages without links. The transition matrix is useful for the link clicking part of the calculation. Let's say that we start at time t=0 by randomly selecting one of the webpages to start from (each page has a 1 out of 6 chance of being the page we start on) and we want to figure out the probability that our next clicked link will take us to webpage A. To do this we would calculate this

For our example this equals 1/2, 50%, or 0.5 (however you want to write it). The method can be generalized to all of the webpages after we pick a starting webpage. The generalize version of this equation looks like this

Don't freak out, I'm going to explain what all the symbols mean. The part of the left of the equal sign is just saying that you're going to be calculating the probability that the next click (t+1) is going to land on page dj. 'j' is a subscript that helps you keep track of what page you're calculating the probability for. In our transition matrix above, if j=3, then dj = C. The right side is a summation starting at 1 and going to N, where N is the number of pages in your web (in our case N=6). Mij is the element from the transition matrix in row i and column j. pt(di) is the probability at the current time that you have arrived at page di. Since we're still on our first step (t=0), we set pt(di) equal to 1/6 in our example. Now we can calculate the probability of being on all of the pages in our web (not just A) at t=1. The result looks like this

You can see that because there are no webpages that point to pages E and F, we have 0 probability of being on those pages after we've clicked just once. The PageRank algorithm gets to rankings of all of the pages after many repetitions of what we just did. I'll walk you through another step (t=1 to t=2) so you get the feel of it. Let's calculate the probability that we're on page A at t=2.

This is very similar to the step from t=0 to t=1. The first value in each multiple is from the transition matrix (M) and the 2nd value comes from the probability of being at each page at time t=1. Once you get the hang of this, it can be repeated quickly and converges to a final answer very quickly. Here's how our web diagram converges.

When we get to about 21 cycles, we see that nothing is really changing and now we have values that can tell us how important each of the webpages are. Based on this example, page A is most important, then pages B and D tie for 2nd. This is not what we got when we just counted how many links each website had. Page B only has 1 link, but it is linked by page A which is the most important, so that link counts for more. Remember though, that this isn't the whole page rank algorithm. We have to add in the random jumping to a different page now. To do this we have to set a probability that, at any given time, we'll "get bored" and jump to another page. Let's say that we have a 15% chance of getting bored on the page we happen to be on. We'll call this value alpha (α=15%)

Here you can see that the first term in the probability is basically the same, but that it get's multiplied by (1-α). This is because we need to discount this probability by the amount that we get get bored and randomly pick a page. The 2nd term is adding in the probability of randomly jumping to another page. The 2nd term could be simpler; it could just be α*(1/N). It is written the way it is because this equation can be turned into a matrix math problem, which I'll show you later. For not just go with it.

If we start over at t=0, using this new form of the equation, we can calculate a couple examples and show you how this equation converges as well. If we want to calculate the probability of being at webpage A at t=1 we would do the following:

You can see that in the first term in the first line, I just used the value we calculated in our prior example without jumping. I also expanded out the second summation so that it is easier to see. The next 2 lines substitute the actual values into the equation and simplify to an answer. If you follow this calculation for the other webpages you get this.

I'll walk you through one more step and then show you the whole table and how it converges to a final answer. If we want to calculate the probability of being at webpage A at time t=2 we get this

The first thing I want you to notice is that I couldn't just use the value from the old method for the first summation. This is because the probability of being at webpage A at t=1 has changed. So, we get to do all of the math this time. Remember that the M terms here come from the transition matrix. Because we're calculating the probability of being at webpage A, we take the values from column A of the transition matrix. The rest is basically the same as the last example. We can do this for all of the webpages (I did it in Excel for this post) and get the table below

You can see that this table converges where values really aren't changing any more by the time it gets to t=17. The decision of when to stop iterating is based on a user defined variable. In the original PageRank white paper, the authors suggest that you loop through 't' until a value, delta (δ) gets smaller than this user defined value epsilon (Ɛ). To calculate delta you need to know this.

The first line shows how to calculate delta, but if you're anything like me, you may have never been exposed to an "L1 Norm" before. Each of the R terms inside the double bars is a vector containing the probabilities we've calculated for all of the webpages at the given time. So R1 = {0.45, 0.095833, 0.2375, 0.1666, 0.025, 0.025} using the clicking and jumping values from our most recent example. R2 would be {0.389792, 0.21625, 0.117083, 0.226875, 0.025, 0.025}. To calculate our delta value we first have to subtract each term in R1 from R2, then we will take the L1 norm of the result. R2 - R1 = {-0.06021, 0.120417, -0.12042, 0.060208, 0, 0}. Now we need to calculate the L1. This is shown in the last equation in the list above. Basically, you just take the absolute value of every term in the vector (R2-R1) and add them all up. For our example from t=1 to t=2, we get delta equal to 0.36125. If I add this delta value to the last column of the convergence table you can see how this value gets smaller as we iterate.

I promised earlier I'd show you how the equations to calculate page rank can be turned into a matrix equation. If you're familiar with linear algebra, this will be interesting to you. The other MAJOR advantage of having equations in matrix form is that the processing time in a computer is MUCH faster for matrix math; this is because ridiculously smart people have optimized matrix math routines in most math libraries so that they get the right answers while minimizing processing time. We want to be able to stand on the shoulders of these giants. So, here's how the equations can get morphed into matrix equations

If you know how to use matrices in Matlab, a programming language, Excel, etc. This last equation can be used to efficiently calculate each iteration.

Before I end this post, I need to explain one more thing. If our web diagram happened to look like the diagram below we would have a problem

The only thing that I changed is that the arrow that was going from F to C, is now going from C to F. The reason why this is a problem is that F isn't pointing to anything. If we randomly select F as our starting point, how can we click on a link to move forward??? We can't. A simple way to solve this problem is for there to be an exception in the PageRank algorithm. If the page has no links to other pages, we don't allow the option to click on a link, we force the algorithm to randomly jump to another page. In essence, in this special case, we set alpha equal to 1. This keeps the algorithm from getting hung up in a dead end.

Well, I think that wraps it up. Let me know if you have any additional questions about PageRank that I can answer for you. As always, I hope this helps somebody out there!

Simple Data Mining

Pages

Saturday, April 18, 2015

Simple PageRank Algorithm Description

1 comment: