CS61C Fall 2016 Project 5: Image Compression with Spark

TAs: Stephan Liu, Derek Ahmed
Due 12/04 @ 23:59:59


In this project, you will use the MapReduce programming paradigm to parallelize a common image compression algorithm in Spark to process multiple images at once.

Getting Started

Intialize your repository and get the skeleton code files by entering the following commands:

git clone https://mybitbucketusername@bitbucket.org/mybitbucketusername/proj5-xxx-yyy.git
cd proj5-xxx-yyy
git remote add proj5-starter https://github.com/61c-teach/fa16-proj5-starter.git
git fetch proj5-starter
git merge proj5-starter/master -m "merge proj5 skeleton code"

If you are not familiar with Spark, read this programming guide, especially the section on Resilient Distributed Datasets (RDDs) and on RDD Operations.


There have always been different methods of video compression in order to adapt videos to be rendered more easily on devices with less memory or computing capacity. We will be implementing a scheme using the Discrete Cosine Transform (DCT) to perform video compression. This form of lossy compression encoding is typicaly used alongside lossless compression to compress video files. For the purposes of this project, we will only focus on the lossy aspect.

Video Compression using DCT

DCT (Discrete Cosine Transformation) is a transformation used in JPEG compression that is commonly used to differentiate data points in terms of different frequencies, making it easy to discard higher-frequency components and lower the amount of space used to store an image. The idea behind this is that very high frequency components in images aren't actually as recognizable by the human eye due to a diminishing sensitivity in perception of such frequencies (e.g. you can easily identify a shade of red as red, but probably less so easily identify American Rose as anything other than red). Thus, these high frequency components can be, and quite often in post-editing are, discarded when an image is stored with very little loss in quality.

In video compression, videos are typically broken down into frames, at which point lossy and lossless methods of compression are applied, the lossy practices sometimes including DCT.

Quantization Matrix

The quantization matrix is the actual portion of lorssy compression used to "shave off" less noticeable parts of an image. As mentioned above, because human perception can only detect lower frequencies really well, higher frequencies can be disregarded.

Lossy Video Compression and Decompression Encoder


On top of using Spark MapReduce. We will also allow students to (and highly recommend students to) use openCV and numpy. These should be preinstalled on the hive machines. Keep in mind, that image processing is not the main focus of the project, which is why we are allowing use of these tools to simplify that aspect of the project. The project should be set up in such a way that familiarity with these tools is helpful but not really that necessary (e.g. helper functions located in helper_functions.py should already take care of opencv and numpy library functions that you might rely on).


All the steps of this project and jpeg image compression are quite clearly specified here. For this project however, you will only need to do the lossy parts. The main steps are as follows:

Images will be loaded in for you using cv2.imread(). They will be a list of (image_id, image) key value pairs. You will not have to worry about the details of how this is done. Inside the function run(images), we have initialized a Spark RDD for you. The first thing you want to do now is to convert all your images to the YCbCr colorspace. We have provided a helper function to do this in helper_functions.py. This will give you an array of 3 matrices, the first of which is an inputted image's Y channel, the second a chroma subsampled version of the image's Cb channel, and the third a chroma subsampled version of the image's Cr channel. At this point, your RDD should have 3 times as many entries.

Next, you will want to create 8 x 8 "sub-blocks" (also called macroblocks) for each of these matrices, as that is the block size we will use for our transformations. Think about what you need to put in your key and value pairs so that you can reassemble the blocks back together later. Hint: Remember, you can have multiple parts to a key by using tuples!

Once you have an RDD of all your sub-blocks, apply the following transformations in this order: DCT (making sure the pixel ranges are [-128, 127]), Quantization, De-Quantization, Inverse DCT. Most the code for these transformations are already in helper_functions.py. Focus on thinking how you can apply these to all the sub-blocks most efficiently. Each of the functions take a block as input and output a transformed block. As stated earlier, when you apply DCT, make sure you subtract 128 from each element of the block beforehand to ensure pixel ranges are within [-128, 127] (shifting from an initial range of [0,255]). Thus, you should also add 128 to each block after you apply inverse DCT.

Finally, we want to put the image back together. First, make sure all elements are between 0 and 255 (the valid pixel range for an image to be saved). Then, think about how you can combine blocks together using reduce and then write a mapper function that populates a matrix with the correct sub-block. You may find the function np.zeros((num_rows, num_cols), np.uint8) helpful as a starting point for the matrix you wish to return. Lastly, we want to put together a 3D matrix to represent an image in the YCrCb colorspace. This matrix should be the same size as np.zeros((height, width, 3), np.uint8), in which height and width are obtained from the original image. Now, you can convert the image back to BGR colorspace! Try using cv2.imwrite(filename, img_matrix) if you would like to see the processed image at any time.

How Images are Stored

Images are stored as 3D numpy arrays once they are loaded by openCV. They have the shape of (height, width, depth), where height is how many rows the image has, width is the number of columns, and depth is going to be 3 as there are 3 different parts of the colorspace. If you have an image matrix img in the YCrCb colorspace, indexing is as follows:

    image = cv2.imread(filename, cv2.IMREAD_UNCHANGED) # load the image (loads in BGR color space)
    image = cv2.cvtcolor(image, cv2.COLOR_BGR2YCrCb) # convert to YCrCb, there is a helper function for this
    Y = image[:,:,0]  # all rows and all columns of depth 1
    Cr = image[:,:,1] # all rows and all columns of depth 2
    Cb = image[:,:,2] # all rows and all columns of depth 3
    Y.shape                   # will return (height, width) because Y is now a 2D matrix
    subblock = Y[16:24,32:40] # An 8 by 8 sublock starting at row 16 and column 32 of Y
    # image.shape returns a tuple in the form of (height, width, depth) for 3D matrices

    # constructing a copy of the original image and its three channels (YCbCr) 
    image_copy = np.zeros((image.shape[0], image.shape[1], image.shape[2]))
    image_copy[:,:,0] = Y
    image_copy[:,:,1] = Cr
    image_copy[:,:,2] = Cb


To test, you can go on hive and run

spark-submit run_image_processor.py -i INPUT_FILE -o OUTPUT_FILE

Then you can run

image_diff.py -f1 OUTPUT_FILE -f2 REF_OUTPUT_FILE

to see if your solutions matches the staff solution when you run it with any of the given reference images (test*.jpeg). Also, you can run pyspark to run spark interactively in python. Try running your code several times, varying the QF variable in preconstants.py and observe any changes in image quality.

Finally, running

spark-submit run_image_processor.py -t

will run a test with 100 random images and write the output to file. It should match the given reference text file. Despite the images being "random," we are able to provide the output using a seed.
We will ignore the above.

Important Notes and Reminders:

Submission and Grading

To submit, type:

submit proj5

on a hive machine.

You should only submit spark_image_compressor.py. Anyting else will be overwritten.

In addition, you should submit to your bitbucket repository as well.

cd proj5-XXX-YYY
git add spark_image_compress.py
git commit -m "proj5 submission"
git tag -f "proj5-sub"
git push origin proj5-sub --tags