CS61C Summer 2013 Lab 13: MapReduce II

Goals

Setup

Copy the contents of ~cs61c/labs/13 to a suitable location in your home directory.

It will be especially helpful if inexperienced Java programmers partner with experienced Java programmers for this lab.

MapReduce is primarily designed to be used on large distributed clusters. And now, we're going to have you run on mid-size clusters on EC2. EC2, as was mentioned in lecture, is an Amazon service that lets you rent machines by the hour.

You should complete this lab on the machines in 330 Soda, which have the relevant tools and scripts for starting up a Hadoop cluster on EC2. If you are not sitting physically in front of a lab machine, you can access one (list of machine names) remotely by following these instructions. These directions rely on commands on the instructional machines. You won't be able to do this from your desktop. As a result, you'll use your course account to complete this lab.

Background Information

In our last lab, we tried out some intro-level MapReduce problems. In this lab, we'll tackle a real-world problem and then run it on a real-world-sized dataset on EC2. The background section below is mostly the same as MapReduce Lab I, with some more detailed documentation references to help you along.

The MapReduce programming framework is primarily designed to be used on large distributed clusters. However, large, distributed jobs are harder to debug. For this lab, we'll be using Hadoop -- an open source platform which implements MapReduce programming -- in "local mode" initially, where your Map and Reduce routines are run entirely within one process. Once your program works, we'll try it out on a larger dataset on Amazon EC2!

Avoid Global Variables

One of the core tenets of MapReduce is that we want to avoid multiple machines working on a single, unpartitioned data set because of the associated overhead. As a result, your algorithms will very rarely need to use global variables. In the worst case, you may need to share one or two variables for configuration across machines. If this is necessary, we will indicate it to you specifically in the spec.

Don't Hash Mutable Objects

Hashing mutable objects is dangerous and can be a key source of bugs that are very difficult to find and resolve. For more background, see this post on StackOverflow.

Useful classes/java tips (specifically for implementing MutualFriends.java)

More Useful classes, Hadoop Documentation, and Additional Resources

Exercise Background

The following exercises use two different sample input files that can be found in ~cs61c/data/mrlab2 or on Amazon S3 (the filestore for EC2):

  1. small.seq -- a very small social network, for debugging
  2. large.seq -- a large social network, for testing out EC2 (only available on S3)

Notice the .seq extension, which signifies a Hadoop sequence file. These are NOT human-readable. For debugging purposes, ~cs61c/data/mrlab2/small.txt is provided (and is also shown as an example below).

We recommend deleting output directories when you have completed the lab, so you don't run out of your 500MB of disk quota. You can do this by running:

$ make clean

Please be careful with this command as it will delete all MapReduce outputs generated in this lab.

Exercise 1: Generating a Mutual Friends List on Social Network Data

For this exercise you will need the Makefile and MutualFriends.java. We will be working locally (NOT on EC2) for this section, since debugging on a smaller dataset is much easier.

Suppose as an intern at a new social networking company, you're given the task of implementing the daily Mutual Friends List computation using MapReduce (a mutual friend of a person A and a person B is a person C that both A and B consider a friend). Unfortunately, the data has not been given to you in a very clean format. You're given key-value pairs of the form (person_1, person_2), where such a pair indicates a friendship between person_1 and person_2. Unfortunately, the data is messy, so our input may or may not have the corresponding key (person_2, person_1) (as you may have noticed, social networks using the concept of "friendship" are undirected graphs).

It is your job as the intern to process this data and output a list of mutual friendships for every pair of friends in the dataset. For the input:

1 2
2 3
1 4
1 3
2 1

Your output should look like:

(1, 2)  Friends List: 3
(1, 3)  Friends List: 2
(1, 4)  Friends List: 
(2, 3)  Friends List: 1

In order to maintain user privacy, all usernames have been encoded as Longs, so in the above input, 1 is friends with 2, 2 is friends with 3, 1 is friends with 4, and 3 is friends with 1 (notice how order does not matter and we potentially have duplicates, like the first and last input pairs in this example).

Here is our general plan of attack:

  1. Use one set of Map/Reduce tasks to generate a friends list for each person. We must handle the following cases:
    1. We must generate the complete friends list for each person. For example, if (1, 3) was the only input pair representing the friendship between person 1 and person 3, both friend lists should reflect the friendship (3's friend list should have 1 as a friend and vice versa).
    2. At the same time, if we have been given both input pairs (1, 3) and (3, 1), we must ensure that the individual friends lists for 1 and 3 do not have duplicates. Data structures like HashSets will be helpful here.
  2. We use another set of map/reduces to actually generate our mutual friends lists (the output shown above). As input, the mapper of this stage will receive output of our previous reduce stage.

Fill in the blank areas in MutualFriends.java. The skeleton is heavily commented and a thorough read through it will be essential to completing the lab. We have provided various classes in MutualFriends.java that will be helpful for your implementation, in addition to various hints. Once you've completed your implementation, test it out on the small input set and confirm that we are in fact getting the correct output (the output shown above).

To run your code on the small dataset, do the following:

$ make clean
$ make
$ make runh

The output file part-r-00000 will be located in the output-out directory.

Exercise 2 Setup

First, you'll need to run this command to configure your account to work with ec2.

[in your lab13 directory]
$ bash ec2-init.sh
$ source ~/ec2-environment.sh

You should only need to run these commands once. If you have Hadoop or EC2 problems later on this lab, see the "When Things Go Wrong" section below.

Next, we must alter our code so that we have more than 1 reducer in use. Find the line in the main method in MutualFriends.java that looks like:

job.setNumReduceTasks(1);

And change it to:

job.setNumReduceTasks(40);

Once you've made these changes, be sure to run make clean and make to recompile your code.

Make sure that you remember to make the above change! If you do not, your job will be very very slow and you won't be able to finish the lab during the lab period.

Our Test Corpus and Accessing it With Hadoop

We have a larger dataset like the "small.seq" file you used to debug Exercise 1. We have stored this on Amazon's Simple Storage Service (S3), which provides storage that is located in the same datacenter as the EC2 virtual machines. You can access it from Hadoop jobs by specifying the "filename" s3n://cs61cMR2 ("s3n" stands for S3 Native). The data is stored compressed, so you should expect output from a job run against it to be substantially bigger.

EC2 Billing

Amazon EC2 rents virtual machines by the hour, rounded up to the next hour. Amazon provides several price points of virtual machines. In this lab, we will be using "High-CPU Extra-Large " ("c1.xlarge") virtual machines, which cost around 0.58 dollars an hour (when rented on an on-demand basis) and provide the equivalent of about 8 2.5GHz commodity processor cores and 7GB of RAM. Note that we are billed for all time when the machines are on, regardless of whether the machines are active, and we pay for at least one hour every time new machines are started. (So starting a machine for 5 minutes, terminating it, and starting an identical one for another 5 minutes causes us to be billed for 2 hours of machine time.)

In addition to billing for virtual machine usage by the hour, Amazon also charges by usage for out-of-datacenter network bandwidth and long-term storage. Usually these costs are negligible compared to the virtual machine costs.

How we're able to use EC2 "For Free"

Amazon generously gives CS61C a grant every semester for a limited number of EC2 credits for our activities. Thus, you get to try out EC2 at no cost to you!

Starting a Cluster

Go ahead and start a Hadoop cluster of 4 c1.xlarge worker nodes and 1 c1.xlarge master and worker node using:

$ hadoop-ec2 launch-cluster --auto-shutdown=170 large 5

This command may take a couple minutes to complete. When it is done, it should give you 2 URLs, one for the "namenode" and one for the "jobtracker". Open the two URLs in a web browser. If you lose track of these URLs, you can list them again with

$ hadoop-ec2 list large

(The --auto-shutdown option specifies to terminate the cluster after 170 minutes. This is intended to reduce the likelihood of expensive accidents, but please do not rely on this option to terminate your machines.  Amazon rounds up to the nearest hour, so 170 is intended to give you nearly 3 hours -- which should be plenty for a two-hour lab!)

Hadoop includes both a distributed filesystem implementation (called the Hadoop Distributed File System (HDFS)), which is similar to the Google File System, and MapReduce implementation. Both of these use a single master program and several worker programs. The master program HDFS is called the "namenode" since it stores all the filesystem metadata (file and directory names, locations, permissions, etc.), and the worker programs for HDFS are called "datanodes" since they handle unnamed blocks of data. Similarly, the master program for the MapReduce implementation is called the "jobtracker" (which keeps track of the state of MapReduce jobs) and the workers are called "tasktrackers" (which keep track of tasks within a job). Storage on HDFS tends to be substantially faster but more expensive than S3, so we're going to use it for outputs. This storage is automatically cleared when you shut down your instances.

Exercise 2: Run MutualFriends on EC2 with a Large Dataset

Run MutualFriends against the large dataset on S3 using:

$ hadoop-ec2 proxy large
$ hc large jar mutualfriends.jar MutualFriends s3n://cs61cMR2/large.seq hdfs:///mutfr-out

The first command sets up an encrypted connection from your lab machine to the master machine on cluster. The hc command runs the "hadoop" command you ran earlier in lab against the remote cluster, using the connection created by the first command to submit jobs, etc. (The "large" here refers to the sizes of the individual machines, not to the size of your cluster.)

(If you get errors like "11/01/24 14:47:41 INFO ipc.Client: Retrying connect to server: .... Already tried 0 time(s)." from hc, that indicates that the connection probably went down, and you should run the proxy command again to reconnect.)

The filename "hdfs:///mutfr-out" specifies a location on the distributed filesystem of the cluster you have started.

Monitoring the Job

You should watch the job's progress through the web interface for the jobtracker. After the job starts, the main webpage for the job tracker will have an entry for the running job. While the job is running, use the web interface to look at a map task's logs. (Click on the number under "running map tasks".)

Find out where the following are displayed on the web interface and record (to a file) the final values when the job completes: (You'll need to show this file for checkoff)

  1. The total runtime
  2. The total size of the input in bytes and records. (labeled as S3N_BYTES_READ)
  3. The total size of the map output in bytes and records.
  4. The total size of the "shuffled" (sent from map tasks to reduce tasks) bytes
  5. The number of killed map and reduce task attempts.
  6. The number of output records.

Retrieving the Output

After the job is complete you could in principle retrieve your output with the command:

$ hc large dfs -cp hdfs:///FILE_NAME DESTINATION

The output file for these jobs is prohibitively large, however, so we won't require you to actually retrieve the output.

Exercise 3: Run MutualFriends on a larger cluster

Now, run the same job on a larger cluster. Terminate your existing cluster with:

$ hadoop-ec2 terminate-cluster large

and start another with 10 worker nodes as with:

$ hadoop-ec2 launch-cluster --auto-shutdown=170 large 10

From the web interface, record the following:

  1. The total time the job took
  2. The total number of killed map and reduce task attempts.

Answer the following and record your answers in a file:

  1. How many times faster was the larger cluster than the smaller one?
  2. What kind of scaling does this problem exhibit (strong or weak)?

Check off:

First, terminate any clusters using:

$ hadoop-ec2 terminate-cluster large

Then, answer the following:

  1. How much money did you spend on EC2? (See the "EC2 Billing" section for rates). Sanity check your estimate with the output with of the 'ec2-usage' command. (Note: sometimes, the number supplied by ec2-usage is very very wrong. Do not be alarmed if this is the case.)

Before you leave:

Checkoff:

When Things Go Wrong

Deleting Old Output Directories

Since the local disks on these instances are very large, you can feel free to invent a new name for outputs on HDFS to avoid the error of output directories already existing. If, however, you want to delete an old output from HDFS, you can do it with

$ hc large dfs -rmr hdfs:///NAME

Stopping running Hadoop jobs

Often it is useful to kill a job. Stopping the Java program that launches the job is not enough; Hadoop on the cluster will continue running the job without it. The command to tell Hadoop to kill a job is:

$ hc large job -kill JOBID

where JOBID is a string like "job_201101121010_0002" that you can get from the output of your console log or the web interface.

"Retrying connect to server ec2-....."

If you get this error from 'hc' try running

$ eval `hadoop-ec2 proxy large`

again. If you continue getting this error from hc after doing that, check that your cluster is still running using

$ hadoop-ec2 list large

and by making sure the web interface is accessible.

Last resort

It's okay to stop and restart a cluster if things are broken. But it wastes time and money.

Deleting configuration files

If you've accidentally deleted one of the configuration files created by bash ec2-init.sh, you can recreate it by rerunning bash ec2-init.sh.