CS61C Fall 2015 Project 5: PageRank With Spark


TAs: Xinghua Dou, Chris Hsu
Due 11/24 @ 23:59:59

Updates and clarifications

Goals

In this project, you will use the MapReduce programming paradigm to implement two different variations of the classic PageRank algorithm in the Spark framework.

Getting Started

Intialize your repository and get the skeleton code files by entering the following commands:

git clone https://mybitbucketusername@bitbucket.org/mybitbucketusername/proj5-xxx-yyy.git
cd proj5-xxx-yyy
git remote add proj5-starter https://github.com/cs61c-fall2015/proj5-starter.git
git fetch proj5-starter
git merge proj5-starter/master -m "merge proj5 skeleton code"

If you are not familiar with Spark, read this programming guide, especially the section on Resilient Distributed Datasets (RDDs) and on RDD Operations. For this part of the project, you will be transforming an input RDD of a list of edges into an output RDD of a list of nodes and their PageRank scores.

Background

PageRank is an algorithm developed by Sergey Brin and Larry Page that built the early foundation of the Google search engine. The original paper is available here, and we highly recommend you read it, but it is not necessary for the project. We will be implementing two different versions of a simplified PageRank algorithm.

PageRank

The intuition behind PageRank is that the web can be represented as a graph, with links between pages as directed edges. The importance or "popularity" of a page on the web can be determined by how many other pages link to it. But should every link be treated equally? A link from a popular site like Google's homepage should be more important than a link from an unpopular blog. We want to weigh links from more important pages more than edges links from less important edges when computing the PageRank score of a page. We need PageRank values to determine other PageRank values, and since all nodes depend on each other and the graph isn't necessarily acyclic, this becomes a chicken and egg problem!

Discrete Time Markov Chains

We can model PageRank scores as representing the probability that at a given point in time, a user is at some page. More popular pages will have higher PageRank scores, and be more important, and less popular pages will have lower scores, and be less important. Then, we can model users browsing the web as a discete time Markov chain. This may sound complicated, but it's just a series of states along with transitions probabilities between each state. For example, in the 2-state Markov chain below, at every timestep, if we are at state "A", we will remain in "A" with a probability of 0.6, and move to "E" with a probability of 0.4. If we are at "E", we will remain in "E" with a probability of 0.3 and move to "A" with a probability of 0.7.

The Random Surfer Model

We can then represent users browsing the web under a random surfer model. Sometimes the user will stay on a page, sometimes the user will click on a link and go to another page, or sometimes a user might enter a url in address bar and randomly "teleport" to any page on the web. We can then use this model to iterate through timesteps, transfering PageRank scores between nodes, until we reach convergence and the scores do not change between iterations anymore.

Part 1: SimplePageRank

Algorithm

In this part of the project, we will consider a random surfer model where a user has 3 options at every timestep: stay on the page, randomly follow a link on the page, or randomly go to any page in the graph. The probabilities are as follows:

If a page does not have any outlinks, than the 0.85 chance of following a link is changed to randomly going to another page NOT including the current page.

As an example, we will show how to perform the update for one timestep. In the example graph, assume that all weights are initialized to 1.0. (The probability is just the weight / total weight, which would be 0.25 for each node)

Each timestep, every node will distribute its weight to the other nodes based on the update model.

Let's calculate the contribution that node 0 gives to every other node. First, we'll take the first option. 5% of the time, we will stay on the page. So we give 5% of our weight (which is 5% of 1, or 0.05) back to the node itself. 85% of the time, we will randomly follow a link on the page. There are two links, one to 1 and one to 2, and we will follow each with a probability of 42.5%. So we give 42.5% of our weight (which is 42.5% of 1, or 0.425) to node 1 and 42.5% of our weight to node 2. Let's skip the third option for now.

Now let's do the same calculation for node 1. 5% of the time, we will stay on the page. So we give 5% of our weight (which is 5% of 1, or 0.05) to ourself. For the second option, there are no outgoing links on the page. So therefore, we will replace that option with randomly going to any node in the graph, not including itself. There are 3 other nodes in the graph. So we give 85% / 3 of our weight (which is 28.3% of 1, or .283) to nodes 0, 2 and 3.

The update for 2 and 3 can be similarly done.

Now let's think about option 3. It would be really inefficient to just add this random chance of going to any node directly, because that would make this algorithm O(V^2) in terms of the number of nodes! However, we can notice that this option is the same for every single node in the graph, it does not depend on its edges at all. For every node, we know that 10% of its weight will be distributed randomly to every node in the graph, so 10% of the sum of all the weights will be distributed randomly to every node in the graph. The sum of the weights has to be 4.0 for every iteration, so 0.4 is distributed randomly to every node, or 0.1 to each node. This is called the dampening factor.

Here is a final table of all the values we calculated. The rows represent the weights to a node, and the columns represent the weights from a contribution source.

To/From 0 1 2 3 Random Factor New Weight
0 0.05 0.283 0.0 0.283 0.10 0.716
1 0.425 0.05 0.0 0.283 0.10 0.858
2 0.425 0.283 0.05 0.283 0.10 1.141
3 0.00 0.283 0.85 0.05 0.10 1.283

Your Job

Your job is to implement the distribute_weights and collect_weights functions in pagerank/simple_page_rank.py. distribute_weights is the mapper and collect_weights is the reducer. The input for distribute_weights are tuples of the form (node, (weight, targets)) and the output of the reducer should be tuples of the same form, with updated weights from the update process. The intermediate output for the mapper should be something you decide on yourself, but make sure that the output contains ALL the information the reducer needs.

Alternatively, you can define your own structure but then you will need to change the other methods in the class. Do NOT change the method signature of either __init__ or compute_pagerank.

See the diagram for more details on how the data pipeline works.

Part 2: BackedgesPageRank

Algorithm

In this part of the project we want to model a random surfer like in part 1. However, we will be replacing the 10% chance of going to a random page with a 10% chance of pressing the "back" button, going back to the previous page. So for example, if a user went to node 1 from node 0 in the previous step, then for the current step the backwards behaviour gives the user a 10% step to go to node 0 in the next iteration. If node 1 also has an link to node 0, then this chance is added in addition to the chance of following the link, it does not replace it. Assume that pressing the "back" button during the first iteration will take a user to the starting page.

To reiterate, the proabilities are as follows:

Let's step through the same example as with SimplePageRank. Here is the graph from last time. We will initialize all weights to 1.0, and set the "back" pointer for all nodes to themselves.

The contributions of the first and the second options are the same as in the SimplePageRankCase. For the contribution of the third option, all the backpoints for every node are initially set back to themselves. That means at node 0, we have an added 10% chance of going back to it, giving a contribution of 0.10 to the next weight. Note that this is concidentally the same as the 0.10 from the third option of SimplePageRank. This will not be the case for future iterations.

To/From 0 1 2 3 New Weight
0 0.05 + 0.10 0.283 0.0 0.283 0.716
1 0.425 0.05 + 0.10 0.0 0.283 0.858
2 0.425 0.283 0.05 + 0.10 0.283 1.141
3 0.00 0.283 0.85 0.05 + 0.10 1.283

Now let's do an update for the another iteration on node 0. The total weight is 0.716. The probability that we came from node 0 is (0.05 + 0.10) / 0.716 = 0.21. So we will give 0 a weight of 0.21 (the chance that we came from node 0) * 0.1 (the chance of taking the third option) * 0.716 (our weight) back to node 0 as a contribution of the third option. Note how the total weight ends up cancelling out, and it just becomes (0.05 + 0.10) * (0.1). Similarily, the contribution to nodes 1 and 3 of the third option is (0.283) * (0.1).

Make sure you understand how this works before beginning on the implementation.

Your Job

This time, you will have to also define your own format for storing the data of the nodes between iterations. Therefore, you will also have to implement initialize_nodes and format_output as well as update_weights in pagerank/backedges_page_rank.py, since they depend on the data format. Feel free to copy snippets from the provided code in SimplePageRank, since they mostly the same and only the data format changes.

Think about what additional information you need to store compared to SimplePageRank. One hint: consider how the weight of a node from 1 step ago can be used to determine the incoming weight from the third option for the next step. Try doing a few examples by hand and see if you can find a pattern.

Running Your Program, Debugging, and Testing

To run your MapReduce program through Spark, type:

spark-submit run_pagerank.py [method] [input] [iterations]

from the root directory of the project.

This will print the output to stdout. The method is either "s" for SimplePageRank or "b" for BackedgesPageRank, and the input is location of the input file you want to process.

For example, to run 2 iterations of SimplePageRank on the data/simple1 input and direct the output to a file called out, type:

spark-submit run_pagerank.py s data/simple1 2 > out

Spark can be slow to start up and cause issues with debugging, so we provided a FakeRDD class that implements some of the RDD interface of Spark through standard Python data structures. If an RDD transformation available through Spark is not implemented in the FakeRDD class, feel free to add it yourself. We also provided a run_mock_pagerank.py script that will run your implementation using FakeRDDs. This will help for local testing as well if you don't want to install Spark on your own computer. However, grading will be done EXCLUSIVELY using Spark and not the FakeRDD, so make sure your code works out of the box on the Hive machines.

To run this version, type:

python run_mock_pagerank.py [method] [input] [iterations]

We have also provided a simple script that performs a sanity check using the example inputs and outputs. To run this, type:

python check.py

This is a simple sanity check and provides no guarantees towards your actual grade. check.py runs your implementations against FakeRDD and not a real Spark RDD.

Again, to reiterate: grading will be done using a real Spark RDD, not a FakeRDD.

Your code will also need to work, not only with small examples, but with medium sized ones instead. One great dataset to use for testing is links from Wikipedia. Download the tar.gz file from here, and extract links.tsv to the data directory. The links in this dataset uses names instead of numbers, so we have provided scripts to convert them to numbers to use for your MapReduce program. To follow the process, do:

python utils/generate_mapping.py data/links.tsv > data/wiki_mapping.json
python utils/map_to_numbers.py data/links.tsv data/wiki_mapping.json > data/wiki
spark-submit run_pagerank.py b data/wiki 20 > temp
python utils/map_to_names.py temp data/wiki_mapping.json > wiki_out
rm temp 

Your code should be linear in the number of edges, and definitely NOT quadratic in the number of nodes. The run on the wikipedia dataset for 20 iterations should not take more than 5 minutes.

Which pages have the highest weights? Does that surprise you considering what you know about the PageRank algorithm?

Submission and Grading

To submit, type:

submit proj5-1

on a hive machine.

You should only submit pagerank/simple_page_rank.py and pagerank/backedges_page_rank.py. Anyting else will be overwritten.

In addition, you should submit to your bitbucket repository as well.

cd proj5-XXX-YYY
git add pagerank/simple_page_rank.py
git add pagerank/backedges_page_rank.py
git commit -m "proj5 submission"
git tag -f "proj5-sub"
git push origin proj5 --tags

Project 5-1 will be worth 2/3 of the entire grade for Project 5. The grading for this part will be 40% on Part 1 and 60% on Part 2.