PageRank Algorithm in the Cloud using the Google App Engine
by Kristian Kraljic, January 14, 2012
Two articles a day, this is madness! Madness? This is /Sparta+/. But enough with the old age jokes, pimped up with a little pseudo nerd regex skill to make it even unfunnier!… Sorry for that, let’s get serious!
Have you ever read the famous paper The Anatomy of a Search Engine of the two godfathers Sergey Brin and Lawrence (Larry) Page themselves? Not? Than you may should do so. Having the chance to study at the Stanford University must be heaven on earth. Anyways, today I’d like to give you a quick introduction to the PageRank algorithm. In my opinion another great example of a beautiful computer algorithm.
The PageRank algorithm, named after Larry Page is a link analysis algorithm that is used by Google. It assigns a numerical weighting to each element of a hyperlinked set of documents, such as a bunch of web pages. The PageRank is intended to measure the relative importance of each page.
To implement the PageRank algorithm a good apporach is to use MapReduce. MapReduce is both, a framework on the one hand side and a method of computing highly distributable problems across huge datasets using a large number of nodes.
For my implementation of the PageRank algorithm I used Phython. I did not use any MapReduce framework, to make the code as easy as possible to understand. I deployed the results to the Google App Engine, threfore it should be very easy to access:
It crawls a certain resource, calculates the PageRank based on the crawled data using MapReduce and prints the result. The source is freely available, so feel free to check it out. Hopefully it will help you to understand this great piece of computer science.