University of California, Berkeley
College of Engineering
Department of Electrical Engineering and Computer Science
CS 61C, Spring 2012 HW 1
TA: Alan Christopher
Due Sunday, January 22, 2011 @ 11:59pm
Goals
This assignment is designed to get you back into the swing of things, thinking about Warehouse Scale Computing, and to help you make sure your account works so that you can submit assignments.
Submitting Your Solution
Submit your solution by creating a directory named "hw1". In this directory, create a file named "hw1.txt" and a file named cs61c-XX.jpg (see problem 0) and put your answers inside. Then run submit hw1
.
If you have an account already, or you have not yet had your first lab (accounts will be distributed then), do not email me with your submission. Seriously. Use the standard inst submission mechanism.
In order to avoid blowing through our disk quota we are requiring that the cs61c-XX.jpg's submitted for problem 0 be of size less than approximately 250 kB.
Problem 0: Photograph
Acquire a jpeg of yourself in which your face is clearly visible. Submit this file as cs61c-XX.jpg, where cs61c-XX is your account login (these will be distributed in lab). Additionally, we request that the image you submit not be much larger than 250 kB in size. Note also that your submission can go through without this file, so make sure that you include it in your submission.
Problem 1: Basics Potpourri
- Match these words to their definitions (each is used exactly once): virtual worlds, desktop computers, servers, low-end servers, supercomputers, terabyte, petabyte, datacenters, embedded computers, multicore processors, VHDL, RAM, CPU, operating system, compiler, bit, instruction, assembly language, machine language, C, assembler, high-level language, system software, application software, cobol, fortran.
- Computer used to run large problems and usually accessed via a network
- 2^50 bytes
- Computers composed of hundreds to thousands of processors and terabytes of memory
- Today's science fiction application that will be available in the near future
- Part of a computer called central processor unit
- A kind of memory called random access memory
- Desktop computer without screen or keyboard usually accessed via a network
- A microprocessor containing several processors in the same chip
- Thousands of processors forming a large cluster
- Currently the largest class of computer that runs one application or one set of related applications
- Program that translates statements in high-level language to assembly language
- Special language used to describe hardware components
- Personal computer delivering good performance to single users at low cost
- Program that translates symbolic instructions to binary instructions
- High level language for scientific computation
- Commands that the processors understand
- Binary language that the processor can understand
- High level language for business data processing
- Symbolic representation of machine instructions
- Interface between user's program and hardware providing a variety of services and supervision functions
- Software/programs developed by the users
- Software layer between the application software and the hardware that includes the operating system and the compilers
- High-level language used to write application and system software
- Portable language composed of words and algebraic expressions that must be translated into assembly language before run in a computer
- 2^40 bytes
- Binary digit (0 or 1)
- If a computer connected to a 1 gigabit Ethernet network needs to send a 256 Kbytes files, how long would it take.
- Assuming that a cache memory is ten times faster than a DRAM memory, that DRAM is 100,000 times faster than magnetic disk, and that flash memory is 1000 times faster than disk, find how long it takes to read a file from a DRAM, a disk, and a flash memory if it takes 2 microseconds from cache memory.
- Lots of service providers boast about offering "five nines" of uptime in their service level agreements.
- What does this mean?
- How much downtime can they tolerate per year without breaking their SLA?
-
Why do the designs of warehouse scale computers need to be inherently fault-tolerant? Give a quantitative example that shows why (don't just cite a failure rate!).
-
Why does Google use desktop-grade hardware, instead of server-grade hardware? What challenges does this decision create?
Problem 2: MapReduce
For the parts that ask you to design an algorithm, feel free to use pseudocode, explaining at a high level what the algorithm is.
-
Give an example of a problem that, while in some sense parallellizable, is not a good fit for MapReduce. Justify your answer.
-
Assume you have a list of key-value pairs of the form (document, text), associating a body of text with the name of a document. Give the format of intermediate key-value pairs, and the pseudocode for the mapper and reducer, that you could use to calculate for each word, how many times it was used across all documents.
-
A fun game to play with your friends is "degrees of Wikipedia separation":
given article A and article B, who can get from A to B with the fewest number of clicks?
The number of pages you can get to in just a few clicks from a popular article is quite
large, larger than you could wrap your head around, so you decide to use a MapReduce
framework to search for the optimal path through Wikipedia.
Assume that you have access to a list of 2-tuples (article-name, outgoing-link) that
describe Wikipedia's site graph. Design an algorithm that uses MapReduce to find the
pages reachable in n
clicks or less from some start page s
.
You should describe any mappers, reducers, and key-value formats you use in your
solution.
Hint: you may find it useful to think of there being two or more different "classes" of
input key-value pairs, one of which will be the (article, outgoing-link) tuples
described above. You could tell these classes of key-value pairs apart by having the
first part of every key be a type tag, for instance.