CS 162 Project 4 – Cloud Computing and MapReduce

In this assignment, you will be using MapReduce, a parallel programming model for large computer clusters, to perform some calculations on Wikipedia. The end goal is to implement PageRank, an algorithm used by search engines such as Google to find the most “important” pages on the Internet, and run it on a 39 GB Wikipedia dataset. You will be able to test your code locally on smaller datasets, but to run it at scale, you will have access to the Amazon Elastic Compute Cloud (EC2). Your code will be using Hadoop, a popular open-source implementation of MapReduce.

You can revisit the lecture slides for more information on the MapReduce programming model and Hadoop. In addition, you can find helpful documentation at http://hadoop.apache.org/common/docs/r0.18.3.

Overview

Start by grabbing http://inst.eecs.berkeley.edu/~cs162/sp10/Nachos/proj4.tar.gz. Included in this package is some basic code for both parts of the project, as well as a basic Makefile. Each part of the project includes a component where you will run your code “in the cloud” using Amazon Elastic MapReduce. However, you should plan on performing all of your implementation and testing on either the instructional machines or on your own personal computers. To assist in development on the instructional machines we have included a number of helpful datasets and tools in the /home/ff/cs162/proj4 directory. These tools (discussed below) will be critical part of running your code in the cloud. In addition, your group will be given a private key for accessing Amazon Elastic MapReduce. You may use up to 20 virtual machines in the cloud; our class account has a limit, so you should not be using more than 20 at a time.

For this project we will be using Hadoop 0.18.3 (downloadable from http://apache.mirrors.tds.net/hadoop/core/hadoop-0.18.3/hadoop-0.18.3.tar.gz). Please use the documentation and API associated with the 0.18.3 release, as this is the release available on Amazon Elastic MapReduce.

The Dataset

The Wikipedia dataset we are using consists of text files with one line for each Wikipedia article. Each line contains five fields, separated by tabs:

You can view a sample article by looking at the first line in the “mini” dataset provided to you (/home/ff/cs162/proj4/mini.tsv), using:

head -1 /home/ff/cs162/proj4 mini.tsv.

In addition to mini.tsv, we have also included a slightly larger (“tiny”) dataset split into two files at /home/ff/cs162/proj4/wikipedia-tiny. You can use both of these for doing local testing and debugging. When running in the cloud we have included larger datasets as well (discussed below).

You need to be aware of a few aspects of the data format for this assignment:

1) Within the XML and plain text fields, newlines have been escaped as \n. You probably want to convert these to real newlines before processing the text.

2) Articles representing “special” pages, such as file uploads or “category” pages, do not have the fifth field (plain text version).

3) Some articles have \N for one of the fields, meaning “null” or unknown. Your code should not crash when it encounters such an article.

The Tools

We have installed a number of tools that you will use throughout the assignment on the instructional machines:

- Hadoop 0.18.3 is installed in /home/ff/cs162/proj4/hadoop.

- Elastic-mapreduce, a tool for submitting jobs to Amazon Elastic MapReduce in order to run them in the cloud, is available as /home/ff/cs162/proj4/emr/elastic-mapreduce.

- S3cmd, a tool for reading and writing data to Amazon Simple Storage Service (S3), is available as /home/ff/cs162/proj4/s3cmd/s3cmd.

You may wish to create aliases to these tools, or add the necessary directories onto your PATH to be able to run these tools without typing the full paths to them.

The Project Directory

Included within proj4.tar.gz is the proj4 directory, which includes a Makefile and two subdirectories, textsearch and pagerank.

Compiling a MapReduce Job

To compile your job against Hadoop, you will need to add /home/ff/cs162/proj4/hadoop/hadoop-0.18.3-core.jar to your CLASSPATH. You will also need to package your files into a JAR. The provided makefiles already do both of these things for you. Feel free to investigate the included makefiles to learn more, and make any necessary modifications.

Running a Job Locally

To run your job on top of Hadoop, execute /home/ff/cs162/proj4/hadoop/bin/hadoop jar <your-jar-file> <main-class> <args>. The main class that you provide is responsible for running one or more jobs on top of Hadoop using the JobClient API.

Using S3

When working in the cloud you will use Amazon Simple Storage Service (S3) for storing your code (JAR files), and reading and writing input and output data. We have already set up directories (buckets) for each group on S3, as well as a directory including three differently sized Wikipedia datasets. Note that everyone is sharing these directories! Be careful to work in your own group’s directory!

You can list the directories in S3 using the following command:

s3cmd ls

To put your JAR file in S3, use the following command (replacing XX with your group number):

s3cmd put <local-jar-file> s3://cs162-group-XX/<filename>

To get the output of your job, or its logs, use the following command:

s3cmd get s3://cs162-group-XX/<path> <local-path>

You can find more documenation for the s3cmd at http://s3tools.org/s3cmd and using --help.

Running a job on Amazon Elastic MapReduce

You can submit a job using the following command (again, replacing XX with your group number):

elastic-mapreduce --key-pair group-XX --key-pair-file /path/to/group-XX.pem --create --name "Job name; include group #" --instance-type <type> --num-instances <num> --jar s3n://cs162-group-XX/<jar-file> --arg <main-class> --arg <other-arg> …

You can include as many arguments with --arg as you wish.

Note that if you are passing your input and output directories to Hadoop, they must be URIs starting with s3n://. For example, the full Wikipedia dataset is in s3n://cs162-data/wikipedia-full. Note that s3cmd uses s3:// instead of s3n://.

You may also ask Elastic MapReduce to save logs from your job in order to debug your code. You can do this by adding the flag --log-uri s3n://cs162-group-XX/<log-dir> to the command above. You will then be able to fetch the logs using the instructions below.

You can find additional documentation for the elastic-mapreduce tool at http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264&categoryID=273. And of course, don’t forget --help.

Part I: Text Search

To get familiar with Hadoop, you will start by running and then modifying a simple MapReduce job that searches for Wikipedia articles containing a given string. The job counts the number of occurrences of the word in each article in the map phase, and sorts the articles that contained the word by number of occurrences in the reduce phase. You can find the source code for this job in the textsearch/TextSearch.java file included with the project code.

1) Compile the TextSearch job by running gmake.
Note: If you are doing this on your own machine, you will need to modify the makefiles to set HADOOP_HOME to a location where you have unzipped Hadoop 0.18.3.

2) Run the job on mini.tsv, searching for the string “system”. Which article has the most occurances? How many occurances does it have?

3) Run the job on Amazon Elastic MapReduce, again searching for the string “system”, but using the s3n://cs162-data/wikipedia-full dataset. Which article has the most occurrences now?

4) You will notice that the provided job is performing a case-sensitive search, so it cannot find sentences such as “she stood up against the System”. In addition, the job will happily match occurrences of “system” inside another word, such as “systematically”. Modify the Java code in TextSearch.java to fix these two problems, that is, make the search case-insensitive and only match exact word occurrences of “system”.

5) Run your modified job again on both mini.tsv and the full Wikipedia corpus (on the cloud of course). How different are the results from those of the original version?

For this part of the project, your initial design document only needs to talk about question 4. Your final design document should include short answers to questions 3 and 5 (no more than one paragraph each).

Part II: PageRank

The PageRank algorithm is used to identify the most “important” pages in a set of interlinked documents. The algorithm outputs a numerical score for each page, which can then be used to sort search results and otherwise identify trustworthy pages.

The key notion behind PageRank is that a page is more important if many pages link to it, or if the pages linking to it are important (i.e. have high PageRank). For example, CNN.com is an important site because many web pages link to it, but any page linked from CNN.com is likely important too because CNN links to it.

The PageRank of a document roughly corresponds to the probability that a web surfer starting at a random web page and clicking random links will get to that document. The algorithm for computing PageRank operates in steps that simulate many random surfers clicking links, and compute what fraction of them will get to a particular page after each iteration.

You can read more about PageRank at http://en.wikipedia.org/wiki/PageRank, amongst other places – for this assignment, the goal is to implement it.

To implement PageRank, you will need to do three steps:

1) Write a MapReduce job that extracts a link graph out of Wikipedia. This will be a dataset that contains, for each article, a list of links going out of that article, as well as the article’s title and PageRank (initialize page ranks to 1 for each article). This is the data structure that all of the PageRank steps will operate on. The input for this job should be the Wikipedia dataset, and the job should look at the XML text for each article, extract links, and output a link graph dataset.

In the Wikipedia data set the XML text will contain <link> tags that themselves contain <target> tags. The value within the target tag is the name of the article that is being linked to, for example <target>Berkeley</target> would be a link to the Berkeley article (or in this case disambiguation page).

Unlike in the TextSearch portion of this project, you should create your own subclass of Writable for emitting keys and values. You can store serialized versions of these objects using SequenceFileOutputFormat, rather than TextFileOutputFormat.

2) Write a MapReduce job that performs one iteration of PageRank. This job should take as input a link graph, update the PageRank of each page using the propagation formula shown below, and output a new link graph.

Your map function for this job should output two types of intermediate record from each link graph entry. First, given a link graph entry of the form (title, pageRank, outlinks), you should output a record with key title and value (ORIGINAL, outlinks) to remember the links. Second, for each outlink, you should output a record with key destination and value (CONTRIBUTION, pageRank, numOutlinks).

Your reduce function will receive one ORIGINAL record for each article, as well as one CONTRIBUTION record from each page linking to the document. It should output a link graph record of the form (title, newPageRank, outlinks). The outlinks can be copied from the ORIGINAL record. The newPageRank is computed from the contributions, using the following formula:

In this formula, the sum is over all of the CONTRIBUTION records for the page, and the constant d should be 0.85. (This is the value recommended by Google, but you can try other values to see how they change the results.)

3) Write a MapReduce job that takes your final link graph and prints out a human readable list of article names and PageRank scores, sorted in order of PageRank. Hint: You will need to use a single reduce task, so that all of the data is sorted by the same process. What should your key be and what should your value be if you are to sort the output records by PageRank?

You should invoke all of these jobs from inside a driver program in PageRank.java that takes command line arguments for the input directory, the output directory, the number of iterations, and the number of reduce tasks for each iteration. We have provided a skeleton for a driver program for you, which parses the command-line arguments.

After you have implemented PageRank and tested it locally, you should try to run it on the full Wikipedia dataset on Amazon Elastic MapReduce. Run at least 10 iterations to make sure that the algorithm converges, and no more than 30. Output the top 100 ranked pages and their PageRanks.

NOTE: The output of the third job is quite large – about 900 MB. You can make Hadoop compress it by calling jobConf.setBoolean("mapred.output.compress", true) on your JobConf object. However, even the compressed file is ~200 MB. To further save space, make your mapper output only pages with PageRank greater than 0.5. There should still be roughly 100 MB of these, and the top 100 pages will have PageRank much higher than this value.

Sample Outputs

To help you test your project, we've created the following test cases that you can run:

For part 1: We have a sample input file and sample output. The query used was "system".
For part 2: We have the top 20 page-ranked pages in the full Wikipedia dataset. These are in order of decreasing pagerank, after running ten PageRank iterations, but it's OK if your job sorts pages in increasing order instead of decreasing order (in that case, we'll look at the end of your file). The top 20 pages should stay fixed even if you run more than 10 iterations. (Note: There are a few very weird pages in this top 20 list! Try to figure out why they're there, if you're curious.)