- Get hands-on experience running MapReduce on a local installation of Hadoop.
- Start learning how to fit certain problems/algorithms into the MapReduce paradigm.
Copy the contents of ~cs61c/labs/su14/12 to a suitable location in your home directory.
It will be especially helpful if inexperienced Java programmers partner with experienced Java programmers for this lab.
In lecture we've talked about how the MapReduce system is set up and executed, but now it's time to get some hands-on experience running programs on it!
The MapReduce programming framework is primarily designed to be used on large distributed clusters. However, large, distributed jobs are harder to debug. For this lab, we'll be using Hadoop -- an open source platform which implements MapReduce programming -- in "local mode", where your Map and Reduce routines are run entirely within one process. Because Hadoop is open source, you can download and install it (see the Hadoop webpage) on your home machine! This is rather difficult to setup on Windows, but manageable on Mac or Linux.
Avoid Global Variables
One of the core tenets of MapReduce is that we want to avoid multiple machines working on a single, unpartitioned data set because of the associated overhead. As a result, your algorithms will very rarely need to use global variables. In the worst case, you may need to share one or two variables for configuration across machines. If this is necessary, we will indicate it to you specifically in the spec.
Don't Hash Mutable Objects
Hashing mutable objects is dangerous and can be a key source of bugs that are very difficult to find and resolve. For more background, see this post on StackOverflow.
Useful classes, Hadoop Documentation, and Additional Resources
- The Java API documentation is on the web: http://download.oracle.com/javase/6/docs/api/. The classes java.util.HashMap, java.util.HashSet and java.util.ArrayList are particularly likely to be useful to you.
- The Hadoop Javadoc is also available: http://archive.cloudera.com/cdh/3/hadoop/api/index.html. You mostly shouldn't need this, but it may be handy for org.apache.hadoop.io.Text.
The following exercises use three different sample input files, two of which are provided by the staff and can be found in ~cs61c/data:
- billOfRights.txt.seq -- the 10 Amendments split into separate documents (a very small input)
- complete-works-mark-twain.txt.seq -- The Complete Works of Mark Twain (a medium-sized input)
Notice the .seq extension, which signifies a Hadoop sequence file. These are NOT human-readable. To get a sense of the texts you'll be using, simply drop the .seq portion to view the text file (i.e. ~cs61c/data/billOfRights.txt).
Although an exercise may not explicitly ask you to use it, we recommend testing your code on the billOfRights data set first in order to verify correct behavior and help you debug.
We recommend deleting output directories when you have completed the lab, so you don't run out of your 500MB of disk quota. You can do this by running:
$ make destroy-all
Please be careful with this command as it will delete all MapReduce outputs generated in this lab.
Exercise 0: Generating an Input File for MapReduce
For this exercise you will need the Makefile and Importer.java. In this lab, we'll be working heavily with textual data. We have some pre-generated datasets as indicated above, but it's always more fun to use a dataset that you find interesting. This section of the lab will walk you through generating your own dataset using works from Project Gutenberg (a database of public-domain literary works).
Step 1: Head over to Project Gutenberg, pick a work of your choosing, and download the "Plain Text UTF-8" version into your lab directory.
Step 2: Open up the file you downloaded in your favorite text editor and insert "---END.OF.DOCUMENT---" (without the quotes) by itself on a new line wherever you want MapReduce to split the input file into separate (key, value) pairs. The importer we're using will assign an arbitrary key (like "doc_xyz") and the value will be the contents of our input file between two "---END.OF.DOCUMENT---" markers. You'll want to break the work into reasonably-sized chunks, but don't spend too much time on this part (chapters/sections within a single work or individual works in a body of works are good splitting points).
Step 3: Now, we're going to run our Importer to generate a .seq file that we can pass into the MapReduce programs we'll write. The importer is actually itself a MapReduce program! You can take a look at Importer.java if you want, but the implementation details aren't important for this part of the lab. You can generate your input file like so:
$ make generate-input myinput=YOUR_FILE_FROM_STEP_2.txt
Your generated .seq file can now be found in the convertedOut directory in your lab12 directory. Throughout the rest of this lab, you'll be able to run the mapreduce programs we write using make commands. The make commands will be of the form make PROGRAMNAME-INPUTSIZE. If you wish to try out the input file you generated here, you can instead run:
$ make PROGRAMNAME myinput=YOUR_SEQ_FILE_FROM_STEP_3.txt.seq # Output in wc-out-PROGRAMNAME/ directory
Exercise 1: Running Word Count
For this exercise you will need the Makefile and already-completed WordCount.java. You must compile and package the .java source file into a .jar and then run it on our desired input. Luckily, this is available as a convenient make command:
$ make wordcount-small
This will run WordCount over billOfRights.txt.seq. Your output should be visible in wc-out-wordcount-small/part-r-00000. If we had used multiple reduces, the output would be split across part-r-[id.num], where Reducer "id.num" outputs to the corresponding file. The key-value pair for your Map tasks is a document identifier and the actual document text.
Next, try your code on the larger input file complete-works-mark-twain.txt.seq. In general, Hadoop requires that the output directory not exist when a MapReduce job is executed, however our Makefile takes care of this by removing our old output directory. Remember that we DON'T need to rebuild wc.jar, separately; the Makefile takes care of all the details.
$ make wordcount-medium
Your output for this command will be located in the wc-out-wordcount-medium directory. The first few lines will be confusing since the words you see there are actually numbers (for example, chapter numbers). Search through the file for a word like "the" to get a better understanding of the output. You may also notice that the Reduce "percentage-complete" moves in strange ways. There's a reason for it -- your code is only the last third of the progress counter. Hadoop treats the distributed shuffle as the first third of the Reduce. The sort is the second third. The actual Reduce code is the last third. Locally, the sort is quick and the shuffle doesn't happen at all. So don't be surprised if progress jumps to 66% and then slows.
Exercise 2: Document Word Count
Open DocWordCount.java. Notice that it currently contains the same code as WordCount.java (but with modified class names), which you just compiled and tried for yourself. Modify it to count the number of documents containing each word rather than the number of times each word occurs in the input.
You should only need to modify the code inside the map() function for this part. Each call to map() gets a single document, and each document is passed to exactly one map().
You can test DocWordCount using either of the following (for our two data sets):
$ make docwordcount-small # Output in wc-out-docwordcount-small/
$ make docwordcount-medium # Output in wc-out-wordcount-medium/
- Explain your modifications to DocWordCount.java to your TA.
- Show your output for billOfRights (aka the output for running make docwordcount-small). In particular, what values did you get for "Amendment", "the", and "arms"? Do these values make sense?
Exercise 3: Full Text Index Creation
Open Index.java. Notice again that it contains the same code as WordCount.java (With class names modified). Modify it to output every word and a list of locations (document identifier followed by the word index of EACH time that word appears in that document). Make sure your word indices start at zero. Your output should have lines that look like (minor line formatting details don't matter):
word1 document1-id: word#, word#, ... word1 document2-id: word#, word#, ... . . . word2 document1-id: word#, word#, ... word2 document3-id: word#, word#, ... . . .
Notice that there will be a line of output for EACH document in which that word appears and EACH word and document pair should only have ONE list of indices. Don't forget that the key-value pair for your Map tasks is a document identifier and the actual document text. You can assume that there's just one reducer and hence just one output file.
Note: Remember that the Reduce task will execute on key-value pairs in whatever order they are passed to it. You should NOT expect your final output to be ordered nicely.
For this exercise, you may need to modify map(), reduce(), and the type signature for reduce(). You will also need to make a minor edit to main() to tell the framework about the new type signature for reduce().
You can test Index using either of the following (for our two data sets):
$ make index-small # Output in wc-out-index-small/
$ make index-medium # Output in wc-out-index-medium/
Compile and run Index.java on both data sets. The output from running make index-medium will be a large file. In order to more easily look at its contents, you can use the commands cat, head, more, and grep:
$ head -25 OUTPUTFILE # view the first 25 lines of output $ cat OUTPUTFILE | more # scroll through output one screen at a time (use Space) $ cat OUTPUTFILE | grep the # output only lines containing 'the' (case-sensitive)
Make sure to verify your output. Open complete-works-mark-twain.txt and pick a few words. Manually count a few of their word indices and make sure they all appear in your output file.
- Explain your code in Index.java to your TA.
- Show your TA the first page of your output for the word "Mark" in complete-works-mark-twain.txt.seq to verify correct output. You can do this by running: cat wc-out-index-medium/part-r-00000 | grep Mark | less