Which Digit? 

[12/3/07] Small bug fix in readCommand
method in dataClassifier.py
.
[12/2/07]
dataClassifier.py
in the
runClassifier
method.
data.zip
file or
the whole project 6 zip file
contains the updated dataset. If you still want to play with the original
60k examples, you may download them here
(you may use either in your experiments, we are only using the first
1000 in our evaluation anyway).
In this checkpoint, you will design two classifiers: a naive bayes classifier and a perceptron classifier. You will test your classifiers on two image datasets: a set of scanned handwritten digit images and a set of face images in which edges have been detected. Even your simple classifiers will be able to do quite well on these tasks with enough training data.
Optical character recognition (OCR) is the task of extracting text from sources in image formats. The first set of data you will run your classifiers on is a set of handwritten numerical digits (09). This is a very commercially useful technology similar to a technique used by the US post office to route mail by zip codes. There are systems that can perform with over 99% classification accuracy (see LeNet5 for an example system in action).
Face detection is the task of localizing faces within video or still images where the faces can be at any location and vary in size. There are many applications for face detection including human computer interaction and surveillance applications. You will attempt a reduced face detection task in which you are presented with an image one which an edge detection algorithm has been computed. Your task will be to determine whether the edge image you have been presented is a face or not. There are several systems in use that perform quite well at the face detection task. One good system is the Face Detector by Schneiderman and Kanade. You can even try it out on your own photos in this demo.
The code for this project contains the following files and data files, available as a zip file.
Classification  
data.zip 
Data file including digit and face data. 
classificationMethod.py 
Abstract superclass for the classifiers you will write. You do not need to modify this file. 
samples.py 
Code to read in the classification data. You do not need to modify this file. 
util.py 
Code defining some useful tools. You may be familiar with some of these by now, and they will save you a lot of time. 
mostFrequent.py 
A simple example classifier that labels every instance as the most frequent class. You do not need to modify this file 
naiveBayes.py 
The main code where you will write your naive bayes classifier. 
perceptron.py 
The main code where you will write your perceptron classifier. 
dataClassifier.py 
The wrapper code that will call your classifiers. You will also write your enhanced feature extractor here. You can also use this code to analyze the results of your classifier. 
What to submit: You will fill in portions of naiveBayes.py
,
perceptron.py
and dataClassifier.py
(only) during the assignment, and submit them.
Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder.
Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. Instead, contact the course staff if you are having trouble.
To try out the classification pipeline, run dataClassifier.py
from the command line. This
will classify the digit data using the default classifier (mostfrequent) which classifies every example
as the most frequent class. Which digit is it picking?
> python dataClassifier.py
As usual, you can learn more about the possible command line options by running:
> python dataClassifier.py h
util.counter
wrapper for more convenience.
The data for the digit instances are encoded as 28x28 pixel images giving a vector of 784 features for each data item. Each feature is set to 0 or 1 depending on whether the pixel is on or not.
A canny edge detector has been run on some face and nonface images of size 60x70 pixels, giving a vector of 4200 features for each item. Like the digits, these features can take values 0 or 1 depending on whether there was an edge detected at each pixel.
naiveBayes.py
.
You will fill in the
trainAndTune
function,
the calculateLogPosteriorProbabilities
function and the
findHighOddsFeatures
function.
A naive Bayes classifier
models a joint distribution over a label and a set of observed
random variables, or features,
,
using the assumption that the full joint distribution can be factored
as follows:
To classify a datum, we can find the most probable class given the feature values for each pixel:
Because multiplying many probabilities together often results in underflow, we will instead compute log probability which will have the same argmax:
We can estimate directly from the training data:
The other parameters to estimate are the conditional probabilities of our features given each label y: . We do this for each possible feature value ().
The basic smoothing method we'll use here is Laplace Smoothing which essentially adds k counts to every possible observation value:
If k=0 the probabilities are unsmoothed, as k grows larger the probabilities are smoothed more and more. You can use your validation set to determine a good value for k (note: you don't have to smooth P(C)).
Question 1 (8 points)
trainAndTune
method in naiveBayes.py
.
Your code should estimate
the conditional probabilities from the training data using different values
of the smoothing parameters (given in the list kgrid
) and
evaluate the performance (accuracy) on the validation set to choose
the parameter with the highest accuracy on it (in case of ties,
prefer the lowest value of the smoothing parameter).
Also fill in the calculateLogPosteriorProbabilities
code
which will use the conditional probability tables constructed by the
trainAndTune
method and compute the log posterior probability
(as described in the theory question) for each class y for a given
passed feature vector. Read the comments to see what should be the
returned datastructure.
As a start, you can test your implementation of naive Bayes
with a specific value of the smoothing parameter with the command: > python dataClassifier.py c naivebayes k 2.0
analysis
method in dataClassifier.py
to explore the mistakes that your classifier is doing.
d digits
or d faces
).
What are your classification accuracies? Explore the effect of varying the smoothing
parameter k on the performance of your classifer. Now compare the performance
of your classifiers by using 100 and 1000 training examples (use the t 1000
option e.g.). Finally, make sure your implementation works when using the a
flag which activates the automatic tuning of the smoothing parameters.
We highly suggest that your code prints out the validation set accuracy for
each value of k tried (though this is not required).
As a sanity check, if your implementation is correct, the following command: > python dataClassifier.py a d digits c naivebayes t 100
util.Counter.argMax
function.
Can you explain why the optimal value of k varies as it does when going from 100 to 1000
training examples? Look at the validation set accuracy for digits varying the size of
the training set up to 2500 training examples.
What can you observe about the performance? Does it look like it's leveled off?
> python dataClassifier.py a d digits c naivebayes t 1000
> python dataClassifier.py a d faces c naivebayes t 100
Another tool for understanding the parameters is to look at odds ratios. For each pixel
feature and classes , consider the odds ratio:
The features that will have the greatest impact at classification time are those with both a high probability (because they appear often in the data) and a high odds ratio (because they strongly bias one label versus another).
Question 2 (2 points)
findHighOddsFeatures(self, class1, class2)
.
It should return 3 lists: featuresClass1
which are the 100 features with largest
, featuresClass2 which are the 100 features
with largest
, and featuresOdds the 100 features with highest odds ratios for class1
over class2.
o
to activate the odds ratio analysis;
and the options 1 class1 2 class2
to specify which class1 and
class2 to use in your odds ratio analysis. Look at the 100 most likely pixels
for the numbers and as well as the pixels with the highest
odds ratios. Plot the most likely pixels for the face and nonface classes
as well as the pixels with the highest odds ratios.
Why do these plots look like they do?
perceptron.py
. You will fill in the
train
function, and the findHighOddsFeatures
function.
Unlike the naive Bayes classifier, the perceptron does not use
probabilities to make its decisions. Instead, it keeps a
prototype of each class . Given a feature list ,
the perceptron predicts the class whose prototype is most similar
to the input vector . Formally, given a feature vector (a map
from properties to counts, pixels to intensities), we score each class with:
Using the adding, subtracting, and multiplying functionality of the Counter class in util.py, the perceptron updates should be relatively easy to code. Certain implementation issues have been taken care of for you in perceptron.py, such as handling iterations over the training data and ordering the update trials. Furthermore, the code sets up the weights data structure for you. Each legal label needs its own protoype Counter full of weights.
Question 3 (6 points)
train
method for the perceptron algorithm and test it
using the basic pixel features on the face and digit data
(use the c perceptron
option).
What classification performance do you get for each?
As a sanity check, the command: > python dataClassifier.py d digits t 100 c perceptron
i iterations
option. Try different numbers of iterations and see how it influences the performance.
In practice, you could use the performance on the validation set to figure out
how many iterations to use, though you don't need to implement this for
this assignment.
Question 4 (2 points)
findHighOddsFeatures(self, class1, class2)
.
It should return 3 lists: featuresClass1 which are the 100 features with largest
weights for class1, featuresClass2 which are the 100 features with largest weights
for class2, and featuresOdds, the 100 features with highest difference in feature
weights.
Building classifiers is only a small part of getting a good system working for a task. Indeed, the main difference between a good system and a bad one is usually not the classifier itself (e.g. perceptron vs. naive Bayes), but rather rests on the quality of the features used. So far, we have used the simplest possible features: the identity of each pixel.
To increase your classifier's accuracy further, you will need to extract
more useful features from the data. The EnhancedFeatureExtractorDigit
in dataClassifier.py is your new playground. Look at some of your errors.
You should look for characteristics of the input that would
give the classifier useful information about the label.
For instance in the digit data, consider the number of
separate, connected regions of white pixels, which varies by digit type.
1, 2, 3, 5, 7 tend to have one
contiguous region of white space while the loops in 6, 8, 9 create more.
The number of white regions in a
4 depends on the writer. This is an example of a feature that is not directly
available to the classifier from the perpixel information. If your feature
extractor adds new features that encode these properties,
the classifier will be able exploit them.
Question 5 (6 points)
EnhancedFeatureExtractorDigit
) in such a way that it works
with your implementation of the naive Bayes classifier: this means that
for this part,
you are restricted to features which can take a finite number of discrete
values (and if you have used the simpler implementation where you assumed
that the features were binary valued, then you are restricted to binary features).
Note that you can encode a feature which takes 3 values [1,2,3] by using 3
binary features, of which only one is on at the time, to indicate which
of the three possibilities you have. This doesn't work well with the conditional
independence assumption of Naive Bayes, but it can still work in practice.
We will test your classifier with the following command:
> python dataClassifier.py d digits c naivebayes f a t 1000
With the basic features (without the f
option), your optimal
choice of smoothing parameter should yield 82% on the validation set with a
test performance of 78%. You will get 4 points to implement a new feature
which yields an improvement; and 2 additional points if your new feature gives you a
test performance greater or equal to 82% with the above command
(note the automatic tuning of the smoothing parameter).