CS 61C Project 1

Small World Experiment with MapReduce

Part 1 due Sunday, February 5th, 2012 @ 23:59:59

Part 2 due Sunday, February 12th, 2012 @ 23:59:59

TA: Scott Beamer

Updates

2012-02-11 — Part 2 #3 - Changed median to mean (will still accept median answers)
2012-02-06 — Posted path to reference solution (Part 2)
2012-02-06 — Added grading section
2012-02-03 — Revised Part 2 problems
2012-02-03 — Part 2 graphs now accessible
2012-02-03 — New tips added

In this project you will use MapReduce to analyze data from real social networks. In Part 1, you will develop and test your code locally and for Part 2, you will run it on EC2. You will submit your code and answers via git to your class github repo. You may work with a partner.

Introduction

Have you ever heard of the infamous six degrees of separation or played the game Six Degrees of Kevin Bacon? They are both outcomes of the small-world property of social networks, which states that the average distance between any two vertices grows very slowly (logarithmically) relative to the size of the network. In practice, even for very large social networks, the median distance is small -- Twitter user follows (4), MSN Message Conversations (6), Facebook friendships (5), and Hollywood co-actors (4).

We will use the large processing capabilities of a Hadoop cluster to analyze data from real social networks to approximate the distance distribution to get a feel for how far apart most things are. To do this, we will randomly pick a subset of the vertices in the graph, and compute their distances to all other vertices. Your final program will take in a social graph and a constant for sampling and it will return a histogram of the approximated distance distribution.

Algorithm

Formally, we can think of each input social network as defining a graph. Each person is a vertex and each relationship is an edge. This graph is unweighted since all relationships (friendships, follows, collaborations) are equivalent. The distance between any two vertices is the number of edges in the shortest path between them. To find the all of the distances from one starting point, one would normally use a Single-source Shortest Path algorithm such as Dijkstra's Algorithm, but since the graph is unweighted, we can get away with a simple Breadth-First Search (BFS). Better yet, the BFS will be much easier to parallelize. To get a feel for how this works, examine the pseudocode for a serial BFS implementation below:

def bfs(vertices, s)
  for v in vertices
    dist[v] = -1
    dist[s] = 0
  queue = [s]
  for v in queue
    for n in neighbors(v)
      if dist[n] == -1
        dist[n] = dist[v] + 1
        queue += [n]
  return dist

This variant of breadth-first search will return the distance each vertex is from the starting point s. Notice that each vertex is visited only once, so its distance is updated if it is ever reached and previously unvisited. Your parallel version for MapReduce will be different, but above example was given to show BFS can compute distances on an unweighted graph.

We will only perform breadth-first searches from a subset of the nodes to save on time since not much more fidelity is gained by examining all of them. The program will: load in the social graph, perform multiple breadth-first searches, and compute the histogram. We recommend performing the searches simultaneously since it will result in less times the graph is copied and it will require less mapreduces, both of which will accelerate processing. To pick the starting points, we search from each vertex with the probability of 1/denom, so in expectation, the program will perform num_vertices/denom searches. We leave the actual number of searches to chance since this method is better for a distributed setting like MapReduce. For each vertex, the choice of whether to search from it is completely parallel, and the only global information it needs is with what probability to select it. It does not require calculating (and distributing) how many vertices are in the graph or making sure each starting vertex exists.

The histogram is simply the totals of how many shortest paths are of each distance.

Problem Details

The input graphs are encoded in Hadoop's SequenceFileType and each element is (key,value) = (LongWritable source, LongWritable destination). The output should be in Hadoop's TextFormat, with (key,value) = (LongWritable distance, LongWritable total), where total is the number of shortest paths with that distance. The breadth-first searches should stop early if there are no more distances to be computed, but it also should not run more than the provided MAX_ITERATIONS = 20 to limit runtime in the presence of a few outliers.

The vertices in each graph are given by long identifiers. The address range is not necessarily contiguous (e.g. could have vertices {0,1,5,9}). Each input relation is intended to be treated as a directed edge. If the original relation is undirected, the other direction for that relation will be somewhere in the input. There can be repeat relations, but there will not be loops (self-edges).

The denom constant will be fed in on the command line. To propagate it to all Hadoop instances in the cluster, the skeleton code will actually write it to a file and use the HDFS caching feature to distribute it. The skeleton code takes care of this for you.

If a vertex is unreachable (or has a distance greater than MAX_ITERATIONS), it should not contribute to the histogram.

To help understand the intended operation, we provide an example.

Provided Resources

We provide you with SmallWorld.java and a Makefile in ~cs61c/proj1. You can copy them to your home directory by:

$ cp -r ~cs61c/proj1 ~

There should not be need for you to create any additional files, but if you do, be sure that make compiles them all and that main is still present in the SmallWorld class in sw.jar. Feel free to modify SmallWorld.java, however you want while still completing the program requirements. The code in there is intended to take care of tedious coding details or provide examples of useful things you may want to modify or mimic in your solution.

The skeleton code assumes your code will use three types of mapreduces (graph loading, breadth-first search, and histogram making), and the code in main supports that. The skeleton code is intended to demonstrate how to chain and even iterate multiple mapreduce jobs. Currently all of the maps and reduces are identity, except LoaderMap which filters out some of the edges. It will need to be changed, but it is there to demonstrate accessing and using denom.

The EValue class is provided to give an example implementation of a class that implements Hadoop's Writable Interface. This allows it to be the value of a map or reduce phase. For it to be used as a key type for map or reduce, it would need to implement the WritableComparable interface. EValue currently packages an enumerated type and a long together, but feel free to modify it to suit your implementation.

Counters are a useful feature of Hadoop that lets you count properties of your program as the data flows by. There are default counters that already count things such as how many map and reduce tasks are done each phase. The skeleton code provides an example of making and reading a new counter on lines 122 and 180. It reads in how many edges the graph input has and prints them out.

You should complete this project on the machines in 330 Soda. If you are not sitting physically in front of one of these lab machines, you can access one of them remotely by following these instructions. The code should run both locally and remotely the same as in lab2 and lab3. We recommend spending the majority of your development time working locally and with a small dataset to speed up and simplify debugging. The syntax for using the completed program is:

$ hadoop jar sw.jar SmallWorld input_graph output_dir denom

Local Graphs (`~cs61c/p1data/`)

ring4.seq - 4 vertices in a ring (0→1, 1→2, 2→3, 3→0)
cit-HepPh.sequ - 35K vertices from High Energy Physics collaborations

EC2 Graphs (`s3n://cs61cSp12/`)

hollywood.sequ - Hollywood co-star database circa 2009 (1M actors)
wikipedia.seq - Wikipedia links circa 2011 (5.7M articles) - For fun: not part of assignment
twitter - Twitter user follow graph circa 2009 (60M profiles, 1.2B follows) - Very large: if course has budget, will make available

Tips

We strongly encourage you to use git to manage your code for this project. You will need to submit your code using git, so you might as well take advantage of it while you are developing.
Both Java and Hadoop have many popular versions with different APIs. We are using: Java 6 and Hadoop 0.20.2
You may want to change the output (and input) formats to be more readable during development, to something like TextOutput. The SequenceFileOutputFormat is the fastest because it is a compressed binary encoding.
The number of distance 0 shortest paths is the number of searches done
Try to keep the number of searches for the huge graphs on the order of a few dozen or less
A denom value of 1 will search from every vertex
The correct output for ring4.seq with denom=1 is {(0,4),(1,4),(2,4),(3,4)}
New: For running cit-HepPh.sequ on hive, you should be able to complete within your disk quota with denom=10000.
New: ring4.seq and cit-HepPh.sequ are also available in s3n://cs61cSp12/ to test.
New: To delete all of the intermediate outputs and final output (assuming they all end in -out) for your cluster:
```
hc large dfs -rmr hdfs:///*-out hdfs:///user/cs61c-XX/*-out
```
New: Set the number of reducers to take full advantage of the cluster you are using (see how many on the job tracker page). To set the value for a particular job, you can set it in your source file with:
```
job.setNumReduceTasks(1);
```
To set the value globally for all reducers from the command line, use:
```
hc large jar sw.jar SmallWorld -Dmapred.reduce.tasks=20 ...
```
If you use the command line option, override it to be 1 for your histogram reduce (with the source snippet above) because we want only 1 output file.

Assignment

Part 1 (due 2/5/12 @ 23:59:59)

Complete the problem locally and submit SmallWorld.java. Submit the Makefile (if modified) or any additional source files if needed.

Part 2 (due 2/12/12 @ 23:59:59)

Before running your code on EC2, run it locally to be sure it is correct and decently efficient. We are providing a solution to use for Part 2 for those whose Part 1 code isn't fast enough at ~cs61c/proj1/sw_ref.jar. If you code is efficient, please use it. Submit SmallWorld.java, Makefile (if modified), and any additional source files it needs. Additionally, complete the following questions and submit them in proj1.txt.

Run your code on hollywood.sequ with denom=100000 on clusters of size 6, 9, and 12. How long does each take? How many searches did each perform? How many reducers did you use for each? (Read the rest of the questions to see what other data you will need)
For the Hollywood dataset, at what distance are the 50th, 90th, and 95 percentiles?
What was the mean processing rate (MB/s) for 6, 9, and 12 instances? You can approximate the data size to be (input size) * (# of searches).
What was the speedup for 9 and 12 instances relative to 6 instances? What do you conclude about how well Hadoop parallelizes your work? Is this a case of strong scaling or weak scaling? Why or why not?
What is the purpose of including the combiner in the histogram skeleton code? Does its inclusion affect performance much? Why or why not? Could you have used a combiner in your other mapreduces? Why or why not?
What was the price per GB processed for each cluster size? (Recall that an extra-large instance costs $0.68 per hour, rounded up to the nearest hour.)
How many dollars in EC2 credits did you use to complete this project? If ec2-usage returns bogus values, please try to approximate (and indicate this).
Extra Credit: Compare the performance of your code to the reference. Compute the same results as problems 1, 3, and 4 above with the reference code. How much faster or slower is your implementation relative to the reference?
Optional: What did you think of this project? This is the first time it has been offered, but hopefully not the last. What did you like, not like, or think we should change?

Submission

You will submit this project via git. You should have set up access to your own CS 61C repo in lab1. Place your code in a subdirectory called proj1 inside of your class repo. Tag the commit for your Part 1 submission proj1-1 and tag the commit for your Part 2 submission proj1-2. To be fully submitted, those commits need to be pushed to github by their respective deadlines. The code below assumes the most recent commit is the one you want to submit for part 1.

$ cd (your local repo)
$ git tag proj1-1
$ git push origin (branch of your commit)
$ git push --tags

Be sure to put your name and login (and your partner's name and login if you have one) at the top of your submitted source files, especially SmallWorld.java

Grading

Part 1 (20 points)

Part 2 (10 points)

Extra Credit for using your code (bonuses are not cumulative)

+3 points extra if your code is within 2x performance of reference
+5 points extra if your code is within 20% performance of reference
+6 points extra if your code has greater than 20% better performance of reference

Additional Links on Social Network Analysis

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twitter, a Social Network or a News Media?. WWW 2010. (analysis of Twitter and source of our Twitter data)
Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature:393, 1998. (mathematics behind small-world networks)
Stanford Network Analysis Project (a great source of social network data and the source of cit-HepPh)
Laboratory for Web Algorithmics (source of the Hollywood data and has current metrics on largest social networks)
Wikipedia Crawl