CS61C Fall 2010 — Lab 3b: EC2
Introduction: EC2 and Hadoop
Amazon’s Elastic Compute Cloud (EC2) is one of the most popular “cloud computing” services. EC2 allows users to rent virtual machines on an hourly basis. These virtual machines are hosted on physical machines in Amazon’s datacenters. Because of the economies of scale involved in datacenters and the higher utilization achieved through more sharing of machines, Amazon can rent the virtual machines at a fairly low cost and still make a profit.
EC2 offers several types of virtual machines, called “instances,” two of which you will use in this lab. A small instance (“m1.small”) has the equivalent of roughly a 1 GHz CPU core, 2GB of RAM, a single hard drive and 400Megabit network. A high-CPU extra large instance (“c1.xlarge”) has the equivalent of 8 3GHz cores, 8 GB of RAM, four hard drives and a Gigabit network. Amazon sells these instances by the hour, rounded up to the next hour; for a small instance, Amazon bills $0.085 per hour, for a high-CPU extra large $0.68 per hour. The fee per virtual machine hour is the same no matter what the activity on the virtual machine is; however, there are some additional charges for data transfer outside the datacenter. Amazon charges $0.15 per gigabyte transferred out of EC2. (Currently, Amazon does not charge for data transfer into EC2.)
Along with EC2, Amazon offers related services. In this lab, we will also be using Amazon’s Simple Storage Service (S3). S3 offers long-term storage at a rate of about 15 cents per gigabyte-month, billed by the hour. S3 charges data transfer rates like EC2 does; however, data transfer between S3 and EC2 (in either direction) is free.
Most new users of EC2 and S3 would use them through Amazon’s web interface. For this class, we will be sharing one account and so cannot usefully give you access to Amazon’s web interface. So, instead, you will be accessing EC2 and S3 through a variety of command-line tools. Most of these tools were not created for this class; they were developed to be used by “real” users of EC2 and S3. We will give you credentials that will allow you to use these and other third-party EC2 and S3 tools with our account. For some functionality, for example, starting virtual machines, we have developed custom wrappers so we can track students’ EC2 usage accurately.
The Apache foundation’s Hadoop project is the most popular open source implementation of Map Reduce. Google has never released the original implementation of Map Reduce or much of the infrastructure it depends on (for example, the Google File System), so Hadoop provides an implementation of both. In this lab, we will be deploying Hadoop’s Distributed File System (HDFS) and Hadoop’s Map Reduce implementation across a cluster of virtual machines from EC2.
HDFS divides files up into fixed-sized chunks and distributes multiple copies of those chunks across all the machines in the cluster. This provides opportunities for Hadoop’s MapReduce implementation, when using one of those files as input, to avoid reading the file over the network and, when outputting files, to spread the load out over the cluster evenly. Hadoop’s Map Reduce implementation can also take input from other sources or send output to other destinations, for example, S3. (Of course, accessing files not stored on the cluster will be slower than accessing files stored in HDFS.)
Setup
Copy the lab materials to your home directory:
cp -R ~cs61c/labs/03b ./lab3b
Further commands will assume you are working from the “lab3b” directory you copied.
We are assuming that you are using the lab machines (possibly remotely). This lab is unlikely to work on other machines.
Exercise 0. Setup AWS Access
From your class account, run the commands
ec2-util --init
new-ec2-certificate
. ~/ec2-environment.sh
These will allocate you an AWS access key and secret access key (two short random-looking strings) and an SSH private key (a file called username-default.pem in your home directory) and set up several configuration files for command-line utilities you will use in this lab to access EC2 and S3. See Note #1 for more information regarding your AWS keys.
If you find that you are having trouble with your AWS secret key, see Note #2 below!
For more interesting information about command-line utilities, see Note #3.
Accounting for your EC2 Usage
We have provided a command ec2-usage which will estimate the cost of your EC2 usage.
Note that this is an estimate and does not account for networking or storage charges. The estimate of instance hours may be off because it does not observe the exact finishing times of instances, so it may not realize when the hour-mark is not crossed before termination.
Exercise 1. Testing Locally
The lab directory contains a Map Reduce implementation of word count. Type make to build them.
The program files in the directory:
(bar, 1)
(bar, 4)
(foo, 5)
(foo, 7)
(foo, 3)
(quux, 1)
and outputs a list like
(bar, 5)
(foo, 15)
(quux, 1)
Now run
./wordcount-mapper <~cs61c/data/shaks12.txt | sort | ./wordcount-reducer flip >out.txt
This runs wordcount-mapper with ~cs61c/data/shaks12.txt as input, sorts the output (using the Unix sort utility), and then runs wordcount-reducer on the sorted output, saving the result in out.txt. You can use the Unix sort utility on out.txt to see the most common word in our original corpus, which is Project Gutenberg’s version of the complete works of Shakespeare.
To parallelize this task across multiple machines, we will be running this in Hadoop Streaming, an interface to the Hadoop MapReduce implementation which allows the mapper, reducer, and combiner programs to be ordinary executables. Before using EC2 credits, you should test the program on Hadoop in local mode.
Test the same sequence with hadoop using
hadoop jar ~cs61c/hadoop/streaming.jar \
-mapper ./wordcount-mapper \
-reducer './wordcount-reducer flip' \
-combiner './wordcount-reducer' \
-input ~cs61c/data/shaks12.txt \
-output hadoop-local-out
(The "combiner" program takes reducer input and produces new, ideally smaller reducer input. Hadoop runs combiners to reduce the volume of data saved to disks or sent over the network.)
This will create a directory called 'hadoop-local-out' containing a file named 'part-00000'. Hadoop MapReduce creates output directories like this with one file for each reducer. (In local mode, Hadoop only supports one reducer.) The output collected in part-00000 should be the same as that collected in 'out.txt'.
Question 1, Part 1:
What were the most common words in our corpus?
Question 2, Part 2:
The input file is around 5.3 MiB. Use the shell's "time" command ("time ./wordcount-mapper ...") to estimate the throughput of the manual and Hadoop local mode. Estimate the startup overhead of Hadoop Streaming by using the empty input file "~cs61c/data/empty.txt" instead of "~cs61c/data/shaks12.txt". What have you found out about the overhead associated with Hadoop?
Exercise 2a. Testing Remotely
Running make should have produced Linux versions of the above programs, called wordcount-mapper-linux-x86 and wordcount-reducer-linux-x86.
First, let's upload these files to S3. You can use the command-line program 's3cmd' to access S3. First create a new bucket on S3:
s3cmd mb s3://YOUR-USERNAME
Then upload the binaries into that bucket:
s3cmd put wordcount-mapper-linux-x86 wordcount-reducer-linux-x86 s3://YOUR-USERNAME/
Now edit wordcount.sh, and replace each instance of “BUCKET” with the name of the S3 bucket you created during setup. As you might guess, a URL like “s3n://BUCKET/name” specifies the file “/name” stored in “BUCKET” on S3 (Hadoop expects 's3n://' and not 's3://' unlike s3cmd). The “-files" option instructs Hadoop to make a file available to every map and reduce task which is run. “-D mapred.reduce.tasks=2” specifies that 2 reducers should be run (the default is 1).
After editing wordcount.sh, start an MapReduce cluster on EC2 using:
hadoop-ec2 launch-cluster --auto-shutdown=115 YOUR-USERNAME-testing 1 nn,snn,dn,jt,tt
For details regarding this command and the associated options, see Note #2.
It will take some time (possibly minutes) for the virtual machine to boot. Note the URL that is output.
Now, copy wordcount.sh to your instance with
hadoop-ec2 push YOUR-USERNAME-testing wordcount.sh
And run it with
hadoop-ec2 exec YOUR-USERNAME-testing bash wordcount.sh
When complete, this job will create a directory in your S3 bucket called “wordcount-out” containing two files (“part-00000” and “part-00001”).
List these files with
s3cmd ls s3://YOUR-USERNAME/wordcount-out/
and download them with
s3cmd get s3://YOUR-USERNAME/wordcount-out/part-00000
These files are the output of each reducer. Though each reducer received its keys in sorted order, the keys (the words) were split randomly (by a hash function) across the reducers, so each file will contain the complete word counts for approximately half the words.
We’d probably be more interested in a sorted output to get the most frequently occurring words. Hadoop provides a preconstructed sorting MapReduce job that partitions the keys between reducers evenly. To divide the keys approximately evenly across the reducers at the same time, the first step of this sorting job is to take a sample of the input files and choose places to divide the keyspace. With two reducers on your wordcount output with number of occurances as keys, 0 through 2 will be assigned to the first reducer and the remaining keys to the second reducer.
sort.sh contains a template to run this sort job. Edit it to use your S3 bucket and run it as before. Collect its output as before. (Note that our script first copies the output from S3 to a local filesystem (HDFS) because the sort job is not written to handle input on S3.)
You can monitor your Hadoop cluster through the URL that was output by your launch-cluster command. For security, access to this URL is restricted to the machine you ran hadoop-ec2 on. If you have misplaced this URL, look at the name "ec2-..." output in 'ec2din | grep YOUR-USERNAME' and goto "http://ec2-...:50030/".
If you are doing the lab remotely, see Note #6 below!!
Question 2:
From the web interface, how long did each of the MapReduce map and reduce tasks take? What was the size of their inputs and outputs?
After you have gathered your answer to the previous question, shutdown this 1-node cluster by running:
hadoop-ec2 terminate-cluster YOUR-USERNAME-testing
Exercise 3. Scaling Up
Now, we’d like run this program over a much larger dataset. We will use wordcount-big.sh. Modify this file to use your S3 bucket (to get the wordcount-mapper and wordcount-reducer programs) as before. This file specifies a sequence of two jobs; the first is the wordcount job, set to output to the Hadoop Distributed File System (HDFS). HDFS storage is local to the virtual machines of the cluster you will start, and is thus faster (and cheaper) than S3. The second job reads that output and sorts it, and stores the result back in HDFS. (Since the output is quite large and we are only interested in a small part of it, we don't bother writing it all to S3.)
The input to this set of jobs is in s3n://cs61c-data/enwiki-pages/*, a copy of the article text from the English language Wikipedia as of January 2010. This corpus is about 20G and contains about 3 billion words of text.
We will use 5 high-cpu extra large instancesfor our cluster:
hadoop-ec2 launch-cluster --auto-shutdown=60 YOUR-USERNAME-large 1 nn,snn,dn,jt,tt 4 dn,tt
Copy wordcount-big.sh to that cluster and run it as before. With this much compute power, the job should complete in about 20 minutes. Observe the jobs through the web interface as before.
Question 3, Part 1:
While your code is running on the cluster, consider this problem:
“Determine which Wikipedia articles can't be reached by clicking links in articles up to 10 times starting with the article "2009".”
Outline way to solve this problem using Map Reduce over the Wikipedia text. (Hint: You may use multiple Map Reduce jobs) Discuss your approach with the TA.
To download files from the Hadoop cluster securely, we need to setup an encrypted network tunnel from your local machine. Run that with
eval `hadoop-ec2 proxy YOUR-USERNAME-large`
Now, look at the hdfs:///user/root/wikipedia-wordcount-sorted-out using a command 'hc' we have provided,:
hc YOUR-USERNAME-large dfs -ls hdfs:///user/root/wikipedia-wordcount-sorted-out
And download the last reducer's output (because this output file is large, download it to /tmp or similar, not your home directory) using:
hc YOUR-USERNAME-large dfs -copyToLocal \
hdfs:///user/root/wikipedia-wordcount-sorted-out/part-00031 /tmp/YOUR-NAME-out
The hc command runs the local 'hadoop' command we used to test locally configured to use your Hadoop cluster through the proxy we ran.
Now stop the tunnel you started earlier with:
kill $HADOOP_CLOUD_PROXY_PID
Question 3, Part 2:
What was the throughput of this large word count + sort? What's the ratio of the number of times "a" appears in the English Wikipedia to the number of times "an" appears?
After you're done, shutdown your cluster with:
hadoop-ec2 terminate-cluster YOUR-USERNAME-large
And remove the output file you downloaded.
Question 3, Part 3:
Estimate the total cost of your Amazon Web Services (EC2 + S3) usage today.
Before You Go (even if you don’t finish)
Cleanup (save money!):
Notes
We have also provided a copy of the EC2 command-line API tools, which includes commands like:
ec2-authorize YOUR-USERNAME-testing -p 50000-60000 -s YOUR-IP/32
ec2-authorize YOUR-USERNAME-testing -p 80 -s YOUR-IP/32
Alternately, you can proxy access over an SSH tunnel: