CS61C Lab 13: MapReduce

CS61C Summer 2011 Lab 13: MapReduce

Setup

You may work on this lab with a partner or individually. If you aren’t confident in your Java 1337 h4x0r skills, we strongly recommend that you find a partner with good Java skills.

MapReduce is primarily designed to be used on large distributed clusters, but that makes jobs much harder to debug. So instead, for this lab, we’ll be using Hadoop in “local mode”, where your Map and Reduce code is run entirely within one process.

You should complete this lab on the machines in 200 Dai. If you are not sitting physically in front of a lab machine, you can access one (list of machine names) remotely by following these instructions. In either case, you should use your course account to complete this lab. (If you did not get a course account yet, talk to your TA.)

After you have finished all the exercises (or when the lab time expires, if you don't finish), show your TA your work. If you don't get checked off with your TA, you will not get credit for this lab.

Exercise 0: Running Word Count

Copy the template files for this lab from ~cs61c/lab/13 using

cp -R ~cs61c/labs/13 lab13

The resulting directory will contain two files:

WordCount.java — source code for Hadoop WordCount
Makefile — a build file for the example

Run 'make' to compile and package 'wc.jar'. Then, run the word count example:

hadoop jar wc.jar WordCount ~cs61c/data/billOfRights.txt.seq wc-out

This will run word count over a sample input file (the US Bill of Rights.) Your output should be visible in wc-out/part-r-00000. If you had multiple reduces, then the output would be split across part-r-xyz, where Reducer xyz outputs to the corresponding file. The plain-text for the test code is in “~cs61c/data/billOfRights.txt”. For the input to your MapReduce job, the map()’s key is a document identifier and the actual text is in the value.

Once you have things working on the small test case, try your code on the larger input file ~cs61c/data/sample.seq (approx. 34 MiB). This file contains the text of one week's worth of newsgroup posts (extracted from this corpus). Since Hadoop requires the output directory not to exist when a MapReduce job is executed, you'll need to delete the wc-out directory (using the command rm -rf wc-out) first, or choose a different output directory.

You may notice that the Reduce percentage-complete percentage moves in strange ways. There’s a reason for it. Your code is only the last third of the progress counter. Hadoop treats the distributed shuffle as the first third of the Reduce. The sort is the second third. The actual Reduce code is the last third. Locally, the sort is quick and the shuffle doesn’t happen at all. So don’t be surprised if progress jumps to 66% and then slows.

Exercise 1: Documents-using-a-Word Count

Copy WordCount.java to DocWordCount.java. Rename the class (in the file) from 'WordCount' to 'DocWordCount'. Modify it to count the number of documents containing each word rather than the number of times each word occurs in the input. Run make to compile your modified version, and then run it on the same inputs as before.

You should only need to modify the code inside the map() function for this part. Each call to map() gets a single document, and each document is passed to exactly one map().

Exercise 2: Document which uses a word most

Create a new file called 'MostUsage.java' based on WordCount.java. Modify it to output, for each word, the ID of the document in which that word occurs the most times (compared to all other documents). The ID for each document is provided as the key to the mapper. Your output should have lines that look like:

word document-id

For this exercise, you will need to modify map(), reduce() and the type signature for Reduce. You will also need to make a minor edit to main() to tell the framework about the new type signature for Reduce.

Compile and run this MapReduce program on the same inputs as for Exercise 1.

Check off:

Before you leave:

Show your TA your DocWordCount.java and MostUsage.java
Show your TA the first page of your MostUsage.java's output file when run on sample.seq

We recommend deleting output directories when you have completed the lab, so you don't run out of your 500MB of disk quota.

Further resources;

The Java API documentation is on the web: http://download.oracle.com/javase/6/docs/api/

The classes java.util.HashMap, java.util.HashSet and java.util.ArrayList are particularly likely to be useful to you.

The Hadoop Javadoc is also available: http://hadoop.apache.org/common/docs/r0.20.2/api/index.html

You mostly shouldn’t need this, but it may be handy for org.apache.hadoop.io.Text.