Written Assignment 3: Probability and Classification
Due 3/2 at 11:59pm
Question 1 (2 points). Consider the following
3sided dice with the given side values. Assume the dice are all fair
and all rolls are independent.
A: 2, 2, 5
B: 1, 4, 4
C: 3, 3, 3
 What is the expected value of each die?
 Consider the indicator function better(X,Y) which has value 1 if X>Y and
value 1 if X<Y. What are the expected values of better(A, B), better(B, C), better(C, A)? Why
are these sometimes called nontransitive dice?
Question 2 (2 points). On a day when an assignment
is due (A=a), the newsgroup tends to be busy (B=b), and the computer lab tends
to be full (C=c). Consider the following
conditional probability tables for the domain.
P(A) 
P(BA) 
P(CA) 

B 
A 
P 
b 
a 
0.80 
¬b 
a 
0.20 
b 
¬a 
0.40 
¬b 
¬a 
0.60 

C 
A 
P 
c 
a 
0.70 
¬c 
a 
0.30 
c 
¬a 
0.50 
¬c 
¬a 
0.50 

 Construct the joint distribution out of these conditional probabilities
tables assuming B and C are independent given A.
 What is the marginal distribution P(B,C)? Are these two variables
absolutely independent in this model? Justify your answer using the
actual probabilities, not your intuitions.
 What is the posterior distribution over A given that B=b, P(A  B=b)? What is
the posterior distribution over A given that C=c, P(A  C=c)? What about P(A  B=b, C=c)? Explain the pattern among these posteriors and why it holds.
Question 3 (2 points). Sometimes, there is traffic
on the freeway (C=c). This could either be because of a ball game (B=b) or because of an accident (A=a). Consider the following
joint probability table for the domain A = {a, ¬a}, B = {b, ¬b}, C = {c, ¬c}.
P(A, B, C) 
A 
B 
C 
P 
a 
b 
c 
0.018 
a 
b 
¬c 
0.002 
a 
¬b 
c 
0.144 
a 
¬b 
¬c 
0.036 
¬a 
b 
c 
0.056 
¬a 
b 
¬c 
0.024 
¬a 
¬b 
c 
0.144 
¬a 
¬b 
¬c 
0.576 

 What is the distribution P(A,B)? Are A and B
independent in this model given no evidence? Justify your answer using
the actual probabilities, not your intuitions.
 What is the marginal distribution over A given no evidence?
 How does this change if we observe that C=c; what is the posterior distribution P(A  C=c)? Does this change intuitively make sense? Why or why not?
 What is the conditional distribution over A if we then learn there is a ball
game, P(A  B=b, C=c)?
Does it make sense that observing B should cause this update to A (called
explainingaway)? Why
or why not?
Question 4 (2 points). Often we need to carry out
reasoning over some pair of variables X, Y conditioned on the value of
other variable E=e.
 Using the definitions of conditional probabilities, prove the
conditionalized version of the product rule: P(X, Y  e)
= P(X  Y, e) P(Y  e)
 Prove the conditionalized version of Bayes' rule: P(Y  X,
e) = P(X  Y, e) P(Y  e) / P(X  e)
Question 5 (2 points). Suppose we wish to calculate
P(C  A, B).
 If we have no conditional independence information, which of the
following tables are sufficient to calculate P(C  A, B)?
 P(A, B), P(C), P(A  C), P(B  C)
 P(A, B), P(C), P(A, B  C)
 P(C), P(A C), P(B  C)
 What if we know that A and B are conditionally independent given C?
Question 6 (3 points). A common, simple way of
improving classification accuracy is to train an ensemble of multiple
classifiers for a single task, then to classify new instances by the majority
output of the individual classifiers. Suppose we have M binary
classifiers, C_1 ... C_M, each of which has an error rate ε. Assume M is odd.
 Assume that the classifiers make independent errors, that is, for any
input x, the chance of classifier C_{i} making a mistake is independent of
whether the other classifiers were correct. In this case, give a
formula for the error rate of the ensemble in terms of ε and M.
If ε = 0.1 and M = 3, what is the error rate of the ensemble?
 For a fixed component error rate ε < 0.5, what happens to the ensemble error
rate as M gets very large and why? Why does this not solve the world's
classification problems?
 If we do not know that the errors are independent, is it possible that
the ensemble will have a higher error rate than its members? Again,
assume all classifiers share the same error rate ε < 0.5.
Question 7 (4 points). Consider building a naive Bayes and a perceptron
classifer on the following example, where X and Y are evidence variables and Z is the target class variable.
Z=0: X=0,Y=1
Z=0: X=1,Y=0
Z=1: X=1,Y=1
Z=0: X=1,Y=0
Z=0: X=0,Y=1
Z=0: X=1,Y=0
Z=0: X=0,Y=1
 For each data point, construct a feature vector that consists of [X, Y, 1] where the final element is the bias feature. What parameters will a twoclass perceptron learn on this training set if trained
until convergence? Assume that ties result in updates. What will
its accuracy be when tested on the same set?
 What parameters will a naive Bayes classifier learn on this training
set, assuming there is no smoothing. What will its accuracy be if the
test set is identical to the training set?
 Instead of a naive Bayes model, consider predicting P(ZX,Y) directly
from the full empirical joint distribution over X, Y, and Z. What will the accuracy of this model be
if the test set is identical to the training set?
 Explain the difference between the accuracies of the two Bayesian
classifiers in terms of conditional independence assumptions.
 Give a general disadvantage of each of these three classifiers.
Question 8 (3 points). In this exercise, you will
show that the relative frequency estimates discussed in class are the maximum
likelihood estimates for a single random variable's distribution. Consider
a random variable X with values x_1, x_2, ... x_n. Imagine that we have a
data set in which we observe k samples from X, e.g. [x_7, x_3, x_6, .... ], in
which each x_i occurred c_i times. Let p_i be the probability of x_i.
 To begin with, let X = {h,t} be a coin flip. Assume the data is
[h, t, h, h, t]. Write the likelihood of this exact sample in terms of
p_h and p_t (you may wish to use 1  p_h instead of p_t). Find the
maximizing parameters by differentiating the likelihood function and setting
the derivative to zero.
 Extend this argument to arbitrary data sets over twovalued random
variables. Does the order of the data matter?
 Show that the relative frequency estimate holds in the general case
where n > 2. You may assume that the likelihood has a unique maximum and no local extrema.