Project 6: Classification

Which Digit?
Which are Faces?

Due 12/06 at 11:59pm

[12/3/07] Small bug fix in readCommand method in



In this checkpoint, you will design two classifiers: a naive bayes classifier and a perceptron classifier. You will test your classifiers on two image datasets: a set of scanned handwritten digit images and a set of face images in which edges have been detected. Even your simple classifiers will be able to do quite well on these tasks with enough training data.

Optical character recognition (OCR) is the task of extracting text from sources in image formats. The first set of data you will run your classifiers on is a set of handwritten numerical digits (0-9). This is a very commercially useful technology similar to a technique used by the US post office to route mail by zip codes. There are systems that can perform with over 99% classification accuracy (see LeNet-5 for an example system in action).

Face detection is the task of localizing faces within video or still images where the faces can be at any location and vary in size. There are many applications for face detection including human computer interaction and surveillance applications. You will attempt a reduced face detection task in which you are presented with an image one which an edge detection algorithm has been computed. Your task will be to determine whether the edge image you have been presented is a face or not. There are several systems in use that perform quite well at the face detection task. One good system is the Face Detector by Schneiderman and Kanade. You can even try it out on your own photos in this demo.

The code for this project contains the following files and data files, available as a zip file

Classification Data file including digit and face data. Abstract superclass for the classifiers you will write. You do not need to modify this file. Code to read in the classification data. You do not need to modify this file. Code defining some useful tools. You may be familiar with some of these by now, and they will save you a lot of time. A simple example classifier that labels every instance as the most frequent class. You do not need to modify this file The main code where you will write your naive bayes classifier. The main code where you will write your perceptron classifier. The wrapper code that will call your classifiers. You will also write your enhanced feature extractor here. You can also use this code to analyze the results of your classifier.

What to submit: You will fill in portions of, and (only) during the assignment, and submit them.

Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder.

Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. Instead, contact the course staff if you are having trouble.

Getting Started

To try out the classification pipeline, run from the command line. This will classify the digit data using the default classifier (mostfrequent) which classifies every example as the most frequent class. Which digit is it picking?

> python

Doing classification -------------------- data: digits classifier: mostfrequent using enhanced features?: False training set size: 100 Extracting features... Training... Validating... 14 correct out of 100 (14.0%). Testing... 14 correct out of 100 (14.0%). <Press enter/return to continue>

As usual, you can learn more about the possible command line options by running:
> python -h

Simple Features

We have defined some simple features for you to use for your naive Bayes and perceptron classifiers. Later you will implement more intelligent features. Our simple features have one feature for each pixel location that can take values 0 or 1. The features are encoded as a dictionary of feature location, value pairs (where location is represented as (column,row) and value is 0 or 1), but using the util.counter wrapper for more convenience.

The data for the digit instances are encoded as 28x28 pixel images giving a vector of 784 features for each data item. Each feature is set to 0 or 1 depending on whether the pixel is on or not.

A canny edge detector has been run on some face and non-face images of size 60x70 pixels, giving a vector of 4200 features for each item. Like the digits, these features can take values 0 or 1 depending on whether there was an edge detected at each pixel.

Naive Bayes

A skeleton implementation of a naive Bayes classifier is provided for you in You will fill in the trainAndTune function, the calculateLogPosteriorProbabilities function and the findHighOddsFeatures function.


A naive Bayes classifier models a joint distribution over a label $Y$ and a set of observed random variables, or features, $\{F_1, F_2, \ldots F_n\}$, using the assumption that the full joint distribution can be factored as follows:

P(F_1 \ldots F_n, Y) = P(Y) \prod_i P(F_i \vert Y)

To classify a datum, we can find the most probable class given the feature values for each pixel:

P(y \vert f_1, \ldots, f_m) &=& \frac{P(f_1, \ldots, f_m \...
&=& \textmd{arg max}_{y} P(y) \prod_{i = 1}^m P(f_i \vert y)

Because multiplying many probabilities together often results in underflow, we will instead compute log probability which will have the same argmax:

\textmd{arg max}_{y} log(P(y \vert f_1, \ldots, f_m) &=& \te...
...{arg max}_{y} (log(P(y)) + \sum_{i = 1}^m log(P(f_i \vert y)))

Parameter Estimation

Our naive Bayes model has several parameters to estimate. One parameter is the prior distribution over labels (digits, or face/not-face), $P(Y)$.

We can estimate $P(Y)$ directly from the training data:

\hat{P}(y) = \frac{c(y)}{n}

where $c(y)$ is the number of training instances with label y and n is the total number of training instances.

The other parameters to estimate are the conditional probabilities of our features given each label y: $P(F_i \vert Y = y)$. We do this for each possible feature value ($f_i \in {0,1}$).

\hat{P}(F_i=f_i\vert Y=y) &=& \frac{c(f_i,y)}{\sum_{f_i}{c(f_i,y)}} \\

where $c(f_i,y)$ is the number of times pixel $F_i$ took value $f_i$ in the training examples of class y.


Your current parameter estimates are unsmoothed, that is, you are using the empirical estimates for the parameters $P(f_i\vert y)$. These estimates are rarely adequate in real systems. Minimally, we need to make sure that no parameter ever receives an estimate of zero, but good smoothing can boost accuracy quite a bit by reducing overfitting.

The basic smoothing method we'll use here is Laplace Smoothing which essentially adds k counts to every possible observation value:

$P(F_i=f_i\vert Y=y) = \frac{c(F_i=f_i,Y=y)+k}{\sum_{f_i}{(c(F_i=f_i,Y=y)+k)}}$

If k=0 the probabilities are unsmoothed, as k grows larger the probabilities are smoothed more and more. You can use your validation set to determine a good value for k (note: you don't have to smooth P(C)).

Question 1 (8 points)

We will test your code with the following commands (on a new test set though), so make sure that they work:
> python -a -d digits -c naivebayes -t 1000
> python -a -d faces -c naivebayes -t 100

Odds Ratios

One important skill in using classifiers in real domains is being able to inspect what they have learned. One way to inspect a naive Bayes model is to look at the most likely features for a given class.

Another tool for understanding the parameters is to look at odds ratios. For each pixel feature $F_i$ and classes $y_1, y_2$, consider the odds ratio:

\mbox{odds}(F_i=on, y_1, y_2) = \frac{P(F_i=on\vert y_1)}{P(F_i=on\vert y_2)}

This ratio will be greater than one for features which cause belief in $y_1$ to increase relative to $y_2$.

The features that will have the greatest impact at classification time are those with both a high probability (because they appear often in the data) and a high odds ratio (because they strongly bias one label versus another).

Question 2 (2 points)


A skeleton implementation of a perceptron classifier is provided for you in You will fill in the train function, and the findHighOddsFeatures function.

Unlike the naive Bayes classifier, the perceptron does not use probabilities to make its decisions. Instead, it keeps a prototype $w^y$ of each class $y$. Given a feature list $f$, the perceptron predicts the class $y$ whose prototype is most similar to the input vector $f$. Formally, given a feature vector $f$ (a map from properties to counts, pixels to intensities), we score each class with:

\mbox{score}(f,y) = \sum_i f_i w^y_i

Then we pick the class with highest score as the label for that datum. In the code, we will represent $w^y$ as a Counter, which maps features (pixels) to their count in digit $y$'s prototype.

Learning weights

What we need is a method of learning the prototype weights. In the basic multi-class perceptron, we scan over the data, one instance at a time. When we come to an instance $(f, y)$, we calculate the model prediction:

y' = \textmd{arg max}_{y''} score(f,y'')

We compare $y'$ to the true label $y$. If $y' = y$, we've gotten the instance correct, and we do nothing. Otherwise, we guessed $y'$ but we should have guessed $y$. That means that the prototype $w^y$ needs to be more like $f$ and the prototype $w^{y'}$ needs to be less like $f$ to help prevent this error in the future. We update these two prototypes:

w^y += f

w^{y'} -= f

Using the adding, subtracting, and multiplying functionality of the Counter class in, the perceptron updates should be relatively easy to code. Certain implementation issues have been taken care of for you in, such as handling iterations over the training data and ordering the update trials. Furthermore, the code sets up the weights data structure for you. Each legal label needs its own protoype Counter full of weights.

Question 3 (6 points)

Visualizing weights

Perceptron classifiers and other discriminative methods are often criticized because the parameters they learn are hard to interpret. To see a demonstration of this issue, we can repeat the visualization exercise from the naive Bayes classifier.

Question 4 (2 points)

Feature Design

Building classifiers is only a small part of getting a good system working for a task. Indeed, the main difference between a good system and a bad one is usually not the classifier itself (e.g. perceptron vs. naive Bayes), but rather rests on the quality of the features used. So far, we have used the simplest possible features: the identity of each pixel.

To increase your classifier's accuracy further, you will need to extract more useful features from the data. The EnhancedFeatureExtractorDigit in is your new playground. Look at some of your errors. You should look for characteristics of the input that would give the classifier useful information about the label. For instance in the digit data, consider the number of separate, connected regions of white pixels, which varies by digit type. 1, 2, 3, 5, 7 tend to have one contiguous region of white space while the loops in 6, 8, 9 create more. The number of white regions in a 4 depends on the writer. This is an example of a feature that is not directly available to the classifier from the per-pixel information. If your feature extractor adds new features that encode these properties, the classifier will be able exploit them.

Question 5 (6 points)