CS 61C Fall 2012 Lab 3:
Hadoop Performance on EC2


Setup

You may work on this lab with a partner or individually. Partners are often more educational and more fun, and we recommend having one. Additionally, the surge in enrollment means that we have fewer ec2 credits available per student this semester than in previous semesters. We don't expect this to be a problem, but conserving credits, e.g. by doing this lab with a partner, will help to make sure that it doesn't become a problem.

MapReduce is primarily designed to be used on large distributed clusters. And now, we're going to have you run on mid-size clusters on EC2. EC2, as was mentioned in lecture, is an Amazon service that lets you rent machines by the hour.

You should complete this lab on the machines in 330 Soda, which have the relevant tools and scripts for starting up a Hadoop cluster on EC2. If you are not sitting physically in front of a lab machine, you can access one (list of machine names) remotely by following these instructions. These directions rely on commands on the instructional machines. You won't be able to do this from your desktop. As a result, you'll use your course account to complete this lab. (If you did not get a course account yet, talk to your TA.)

Lab Setup Commands

First, pull the files for this lab into your working directory.

    git pull ~cs61c/labs/fa12/03 master

Now run this command to configure your account to work with ec2.

    cd <working_repository>/lab03
    bash ec2-init.sh
    source ~/ec2-environment.sh

You should only need to run these commands once.

If you have Hadoop or EC2 problems later on this lab, see the "When Things Go Wrong" section below.

There are some slight changes to "WordCount.java" from last week's Lab 2, so do be sure to use this week's copy. As with last week's lab, you can build "wc.jar" by typing "make". Do so.

Our Test Corpus and Accessing it With Hadoop

We have a larger subset of newsgroup posts like the "sample.seq" file you used in lab last week. We have stored this on Amazon's Simple Storage Service (S3), which provides storage that is located in the same datacenter as the EC2 virtual machines. You can access it from Hadoop jobs by specifying the "filename" s3n://cs61cUsenet ("s3n" stands for S3 Native). The data is stored compressed, so you should expect output from a job run against it to be substantially bigger.

EC2 Billing

Amazon EC2 rents virtual machines by the hour, rounded up to the next hour. Amazon provides several price points of virtual machines. In this lab, we will be using "High-CPU Extra-Large " ("c1.xlarge") virtual machines, which cost around 0.68 dollars an hour (when rented on an on-demand basis) and provide the equivalent of about 8 2.5GHz commodity processor cores and 7GB of RAM. Note that we are billed for all time when the machines are on, regardless of whether the machines are active, and we pay for at least one hour every time new machines are started. (So starting a machine for 5 minutes, terminating it, and starting an identical one for another 5 minutes causes us to be billed for 2 hours of machine time.)

In addition to billing for virtual machine usage by the hour, Amazon also charges by usage for out-of-datacenter network bandwidth and long-term storage. Usually these costs are negligible compared to the virtual machine costs.

Starting a Cluster


Start a Hadoop cluster of 4 c1.xlarge worker nodes and 1 c1.xlarge master and worker node using:

    hadoop-ec2 launch-cluster --auto-shutdown=170 large 5 

This command may take a couple minutes to complete. When it is done, it should give you 2 URLs, one for the "namenode" and one for the "jobtracker". Open the two URLs in a web browser. If you lose track of these URLs, you can list them again with

    hadoop-ec2 list large

(The --auto-shutdown option specifies to terminate the cluster after 170 minutes. This is intended to reduce the likelihood of expensive accidents, but please do not rely on this option to terminate your machines.  Amazon rounds up to the nearest hour, so 170 is intended to give you nearly 3 hours -- which should be plenty for a two-hour lab!)

Hadoop includes both a distributed filesystem implementation (called the Hadoop Distributed File System (HDFS)), which is similar to the Google File System, and MapReduce implementation. Both of these use a single master program and several worker programs. The master program HDFS is called the "namenode" since it stores all the filesystem metadata (file and directory names, locations, permissions, etc.), and the worker programs for HDFS are called "datanodes" since they handle unnamed blocks of data. Similarly, the master program for the MapReduce implementation is called the "jobtracker" (which keeps track of the state of MapReduce jobs) and the workers are called "tasktrackers" (which keep track of tasks within a job). Storage on HDFS tends to be substantially faster but more expensive than S3, so we're going to use it for outputs. This storage is automatically cleared when you shut down your instances.

Exercise 0: running WordCount, small cluster

Run WordCount against the corpus using

    hadoop-ec2 proxy large
    hc large jar wc.jar WordCount s3n://cs61cUsenet/s2006 hdfs:///wordcount-out


The first command sets up an encrypted connection from your lab machine to the master machine on cluster. The hc command runs the "hadoop" command you ran in last week's lab against the remote cluster, using the connection created by the first command to submit jobs, etc. (The "large" here refers to the sizes of the individual machines, not to the size of your cluster.)

(If you get errors like "11/01/24 14:47:41 INFO ipc.Client: Retrying connect to server: .... Already tried 0 time(s)." from hc, that indicates that the connection probably went down, and you should run the proxy command again to reconnect.)

The filename "hdfs:///wordcount-out" specifies a location on the distributed filesystem of the cluster you have started.

Monitoring the Job

You should watch the job's progress through the web interface for the jobtracker. After the job starts, the main webpage for the job tracker will have an entry for the running job. While the job is running, use the web interface to look at a map task's logs. (Click on the number under "running map tasks".)
 
Find out where the following are displayed on the web interface and record (to a file) the final values when the job completes:
  1. The total runtime
  2. The total size of the input in bytes and records. (labeled as S3N_BYTES_READ)
  3. The total size of the map output in bytes and records. 
  4. The total size of the "shuffled" (sent from map tasks to reduce tasks) bytes
  5. The number of records (i.e. words) in the output
  6. The number of killed map and reduce task attempts.
  7. The number of combine input and output records. (This job uses a combiner, which is a function that is run between the mapper and the reducer before all the map output is available. Some of the work of the reducer is moved into this combiner.)

Retrieving the Output

After the job is complete you could in principle retrieve your output with the command

    hc large dfs -cp hdfs:///FILE_NAME <dst>

The output file for these jobs is prohibitively large, however, so we won't require you to actually retrieve the output.

Exercise 1: On a larger cluster

Now, run the same job on a larger cluster. Terminate your existing cluster with

    hadoop-ec2 terminate-cluster large

and start another with 10 worker nodes as with

    hadoop-ec2 launch-cluster --auto-shutdown=170 large 10

From the web interface, record the following:
  1. The total time the job took
  2. The total number of killed map and reduce task attempts.

Answer the following and record your answers in a file:
  1. How many times faster was the larger cluster than the smaller one?
  2. What was the throughput of the large and small cluster (in MB/sec)? 
  3. What was the total size of the "shuffled" bytes on the larger cluster?

Exercise 2: big and little data

And now for one last experiment. This is designed to give you some insight about the difference between "strong scaling" and "weak scaling". Strong scaling is when adding more machines makes your algorithm faster on a given data. Weak scaling is when adding more machines lets you process more data in the same amount of time. As we alluded to in lecture, these don't always go together.

You should have the bigger cluster running at this point. Try running the job with a smaller input, 's3n://cs61cUsenet/s2005'. (This is a substantially smaller amount of data than 's3n://cs61cUsenet/s2006' contained.) You should use a different output directory than you did for exercise 1. There is plenty of space, but if you want to delete it, see "Deleting Old Output Directories" below for instructions.

Record or compute the following:
  1. How much data did your job input?
  2. How long did it take?
  3. How does this compare in MB/sec to the larger data set?

Check off:

First, terminate any clusters using

    hadoop-ec2 terminate-cluster large

Then, answer the following:
  1. How much money did you spent on EC2? Sanity check your estimate with the output with of the 'ec2-usage' command. (Note: this number, too, is an estimate.)

Before you leave:

Checkoff:

When things go wrong

Deleting Old Output Directories

Since the local disks on these instances are very large, you can feel free to invent a new name for outputs on HDFS to avoid the error of output directories already existing. If, however, you want to delete an old output from HDFS, you can do it with

    hc large dfs -rmr hdfs:///NAME

Stopping running Hadoop jobs

Often it is useful to kill a job. Stopping the Java program that launches the job is not enough; Hadoop on the cluster will continue running the job without it. The command to tell Hadoop to kill a job is:
    
    hc large job -kill JOBID

where JOBID is a string like "job_201101121010_0002" that you can get from the output of your console log or the web interface.

"Retrying connect to server ec2-....."

If you get this error from 'hc' try runnig

    eval `hadoop-ec2 proxy large`

again. If you continue getting this error from hc after doing that, check that your cluster is still running using
    
    hadoop-ec2 list large

and by making sure the web interface is accessible.

Last resort

It's okay to stop and restart a cluster if things are broken. But it wastes time and money.

Deleting configuration files

If you've accidentally deleted one of the configuration files created by bash ec2-init.sh, you can recreate it by rerunning bash ec2-init.sh.