CS61C Lab 15

Multicore Programming and POSIX Threads

Survey

Before getting checked off for this lab, please complete the following two surveys. We take your comments very seriously, so please answer thoughtfully. You will recieve 1 EPA point (yes, they are real) for completing both surveys.
The cs61c survey is specific to our course and can be checked for completion using this utility.
The department survey goes to the CS department as well as the course staff.

Background

Goals

This is a brand new and experimental CS 61C lab! Please be patient with any bumps in the road...

You will run this lab on the new instructional cluster, which consists of 26 compute nodes each with:

  • 2 Quad-Core Intel Xeon CPUs @ 2.33GHz (8 cores total)
  • 64+64 KiB of L1 cache
  • 4 MiB of L2 cache
  • 8 GiB of RAM
  • 10K RPM disks.

Sweet! Thanks, Google and Intel!

The purpose of this lab is to acquaint you with basic POSIX Thread (PThread) programming as well as to introduce some of the common design issues in multicore programming. The first part of the lab is a short PThread tutorial, while the second part investigates benchmarking and speedup issues

For an excellent PThread tutorial to supplement this lab, please refer to POSIX Thread Programming.

Setting Up the Files

First, log into the icluster front-end node:

$ ssh -X icluster.eecs.berkeley.edu					
				

Once logged into the icluster, create a subdirectory in your home folder called "lab15" (you MUST do this), and copy the lab files into the directory:

$ mkdir ~/lab15
$ cd ~/lab15
$ cp -R ~cs61c/labs/15/* .
				

When compiling any code with PThreads, be sure that you call gcc with the -lpthread switch:

$ gcc -o HelloWorld HelloWorld.c -lpthread					
				

Several warnings will print when you compile. Don't worry about them.

PThread Programming Exercises

Exercise 1: Hello World!

In this exercise you will write a simple Hello World program in which multiple threads print messages.

First, you will need to be able to create new threads, and you can do just that with the following PThread API routine:

int pthread_create(pthread_t *thread, pthread_attr_t attr, void* (*start_routine)(void*), void *arg)
				

The first argument, pthread_t *thread, allows the calling thread to keep a structure that contains data relevant to the thread it creates; you can think of it as keeping a reference. We will not use the pthread_attr_t attr argument. The start_routine argument is a function pointer to the function that the thread will start in. In C you will always write a function name for this argument. Finally, the arg argument is a value of ambiguous type (thus the void * typing) that will be passed to the start_routine that is called when the thread begins.

A typical call to create a thread looks like:

pthread_t the_other_thread;
pthread_create(&the_other_thread, NULL, startingFunction, NULL);					
				

The call above creates and runs a new independent thread of execution which starts at the top of startingFunction.

Read the file HelloWorld.c and add a call to pthread_create that will create threads that call the sayHello function. You will need to compile your code with the -lpthread flag. Disregard any compiler warnings.

$ gcc -o HelloWorld HelloWorld.c -lpthread					
				

You will notice that not all the printing messages were able to print. When the original thread (the one that started with main) of a program exits, the entire process quits and thus all threads within that process are killed. In this case, the printer threads that we created with pthread_create were killed by the original thread, since the original thread created them and immediately exited without waiting for them to complete

To make the original thread wait for its child threads to complete, we must add another loop to HelloWorld.c with a call to pthread_join, which has the following prototype:

int pthread_join(pthread_t thread, void **value_ptr);
				

When thread A joints thread B, thread A will not continue out of the call to pthread_join until thread B has completed and exited. The first argument is the pthread that the caller thread will join. We will not use the second argument.

Add a loop to HelloWorld.c so that the main thread joins each child thread before main finishes and returns. Your new program should allow all the child threads to print.

Checkoff

Show your TA the output of your HelloWorld program. 

Exercise 2: Hello World with a Shared Counter

In this exercise, you will introduce some data that is shared between the threads. Namely, you will have each thread print a number that corresponds to the order in which they print

First, add a global integer variable to the file (add it outside any functions). We want each thread to increment the counter when it prints. However, since the counter variable is shared across multiple threads, we will need to introduce a mutex so as to ensure that multiple threads to not access the variable at the same time and introduce race conditions

You will need to use the pthread_mutex_t type to create a global mutex. One can lock and unlock a mutex with the functions prototyped below:

int pthread_mutex_lock(pthread_mutex_t *mutex);
int pthread_mutex_unlock(pthread_mutex_t *mutex);
				

Some typical lock code would be:

// as a global variable
pthread_mutex_t lock1 = PTHREAD_MUTEX_INITIALIZER;

// before a critical section corresponding to lock1
pthread_mutex_lock(&lock1);
// critical section
pthread_mutex_unlock(&lock1);
// after critical section
				

The PTHREAD_MUTEX_INITIALIZER is a macro that initializes the mutex variable to contain the necessary lock information. Mutexes must be initialized before use.

Edit the sayHello function so that the threads which call it will first try to acquire a lock before incrementing the global counter variable. You should also change the print line to output the current counter value:

printf("Hello! I am thread id %d and I was #%d to print!\n",id,global_counter);
				

Checkoff

Show your TA the output of your new HelloWorld program. 

Exercise 3: Hello World with a Controlled Order

Now we want to have our threads say hello in a specific order: we want those with an even id to print first, and those with an odd id to print after all the evens have printed. Since we are not guaranteed that the threads will start in any given order, we must have the odd threads wait until all the even threads have printed.

Enter the dreaded condition variable (or CV for short)! A condition variable is a construct that allows threads to sleep until a certain condition is met, where the condition is determined by some shared information (such as a counter variable). CVs are always associated with locks because the shared information that they depend on must be synchronized across threads.

You can think of a CV as a "big pillow" in the sense that threads can fall asleep on a condition variable and be woken from that condition variable. One story to keep in your head is: Tommy the Thread wanted to access shared information. He acquired the appropriate lock but then was disappointed to see that the shared information wasn't ready yet. Without anything else to do, he decided to sleep (or wait) on a nearby condition variable until another thread updated the shared information and woke him up. Timmy the Thread finally came along and updated the shared information while Tommy was asleep on the nearby CV. Timmy noticed Tommy asleep, so Timmy signaled him to wake up off the condition variable. Timmy then went on his way and left Tommy to play with his new, excited shared information.

The above story sounds like a pretty useful thing for threads to be able to do, but there's one thing we glazed over: we know that to access the shared information a thread needs to hold the lock. Since Timmy needs to update the shared information while Tommy is asleep and waiting, we must have any thread that sleeps on a condition variable release the lock while it is asleep. But since Tommy had the lock when he fell asleep, he expects to still have the lock when he wakes up, and so the condition variable semantics guarantee that a thread sleeping on a condition variable will not fully wake up until (1) it receives a wake-up signal and (2) it can re-acquire the lock that it had when it fell asleep.

For more of a description (with no silly story), you can check out POSIX Thread Programming

To achieve the evens-print-first functionality, you will need to use the condition variable routines in the PThread library:

pthread_cond_wait(pthread_cond_t *condition, pthread_mutex_t *lock);
pthread_cond_signal(pthread_cond_t *condition);
pthread_cond_broadcast(pthread_cond_t *condition);
				

The pthread_cond_wait routine will put the caller thread to sleep on the condition variable condition and release the mutex lock, guaranteeing that when the subsequent line is executed after the caller has woken up, the caller will hold lock. The pthread_cond_signal routine wakes up one thread off of the condition variable condition, and pthread_cond_broadcast wakes up all threads that are sleeping on condition variable condition. IMPORTANT: to call any of the above routines, the caller must first hold the lock that is associated with that condition variable.

You'll need to have these global variables in your HelloWorld.c file:

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t CV = PTHREAD_COND_INITIALIZER;
int number_evens_finished = 0;
				

Add the logic necessary to have all the evens print before the odds. You will need to edit the sayHello function to contain calls to pthread_mutex_lock, pthread_mutex_unlock, pthread_cond_wait, and pthread_cond_broadcast. Do not use pthread_cond_signal.

Checkoff

Show your TA the output of your new HelloWorld program in which the threads with even-numbered IDs print before any threads with odd-numbered IDs. 

Exercise 4: (Extra for Experts) Implementing a Barrier

Read about barriers on Wikipedia

Barriers are not standard in the PThread library (though some implementations contain them). Implement a my_barrier_t type and associated functions.

Benchmarking Exercises

For these exercises you will benchmark programs on an oct-core machine with no other student processes running. Instead of executing the code directly, you will interact with the Torque batch scheduler, which will ensure that your program runs on a dedicated machine. Torque will also load-balance across many of the machines in the cluster. The instructions for using Torque are given in the exercises below.

Exercise 5: Benchmarking the Game of Life

In this exercise you will benchmark two implementations of a multithreaded Game of Life. It is an example of a domain decomposition problem. You can read details about the game of life on Wikipedia and on MathWorld. A quick-and-dirty explanation is below.

The game of life involves a 2D array of cells that are either "alive" or "dead." The game describes rules that determine how one board will evolve into another after a time step: each cell in the subsequent board will be alive when certain conditions are met in its local area on the previous board. To implement the game of life with a program, there is a main loop that creates a new game board that has evolved one step from an old game board. The main loop iterates over the entire 2D board, and can itself be run several times so that the board is evolved through multiple time steps. However, before the threads can begin another (e.g. a second) update loop, they must wait until all other threads have finished generating the current update (e.g. the first update).

We have two game of life implementations in which the main loop is parallelized, and we want you to measure and explain the performance gains from threading. The first implementation decomposes the board in a row-based way: each thread is given several rows that it is responsible for updating. The second implementation chooses a column-based decomposition: each thread is given several columns that it is responsible for updating.

Move to the GoL directory:

$ cd ~/lab15/GoL
				

and change some file permissions:

$ chmod a+rx *.sh					
				

In that directory you can find three implementations of the game of life: serial_life.c (a serial implementation for your reference), life_rows.c (which uses the row-based decomposition), and life_cols.c (which uses the column-based decomposition). There is also a bitboard.c file that you can use to generate game of life boards of any size and geometry.

Briefly skim over the code in the three life implementations to get some idea of the computation process. Next, compile all the code files:

$ gcc -o life_rows life_rows.c -lpthread
$ gcc -o life_cols life_cols.c -lpthread
$ gcc -o bitboard bitboard.c
				

You MUST use the above executable names for your files

Use bitboard to generate a board that is has 1 Kibi rows and 512 columns:

$ ./bitboard 1024 512 1024x512.board
				

Then, you must edit the last line of the Torque submission script so that your program will use the board you generated. Open the launch_life.sh file in any text editor and make the last two lines read:

### YOUR SHELL COMMAND HERE:
./run_life.sh 1024x512.board
				

Finally, submit the script to Torque by running the following command:

$ qsub launch_life.sh
				

You can view the status of your program by running qstat or qstat | grep $USER. When your job completes it will disappear from the qstat list. It should take less than five minutes to complete.

To view the benchmark results after the job completes, open gnuplot and load "gol.gnuplot":
$ gnuplot
gnuplot> load "gol.gnuplot"
				

Examine the results. Then, repeat the process with a board of very different dimensions:

$ ./bitboard 4096 128 4096x128.board
                

Look at the plot with the 4096x128 board. Be sure to change the line at the end of launch_life.sh so that the new board is used.

You should be able to answer the following questions:

  1. The machine you are benchmarking on has 8 cores, so why would the performance taper off before there are even 8 threads? Hint: the threads must synchronize after every generation.
  2. Why might the row- and column-decompositions perform differently? Can you explain why one might or might not perform better than the other?

Checkoff

Answer the above questions for your TA. 

Exercise 6: (Extra for Experts) Benchmarking an Image Blur

In this exercise you will benchmark a multithreaded image blur. It is important to note that, unlike in the game of life above, the threads working on a blur do not need to synchronize, since the blurring process happens in one step (one pass over the entire image).

Move to the blur directory:

$ cd ~/lab15/blur/
				

and change some file permissions:

$ chmod a+rx *.sh					
				

Read over the code to get an idea of the computation. Then, compile it with the following commands:

$ gcc -o blur_row blur_row.c -lpthread
$ gcc -o blur_col blur_col.c -lpthread
				

Submit your job to Torque with:

$ qsub launch_blur.sh					
				

and examine the results by loading the "blur.gnuplot" file in gnuplot. You can also view any .ppm image by using the command display image.ppm.

You might want to benchmark the program for different image sizes (larger images would be interesting, but might fill up your disk quota). You can download an image and convert it to the appropriate format with mogrify:

$ mogrify -format ppm YOUR_IMAGE_FILENAME
				

Then edit the last line of the launch_blur.sh file so that dan.ppm is replaced with your image filename.



Lab created for CS 61C in Spring 2008 by Matt Johnson and Ramesh Sridharan.