## Written Assignment 3: Probability and Classification

Due 3/2 at 11:59pm

Question 1 (2 points). Consider the following 3-sided dice with the given side values.  Assume the dice are all fair and all rolls are independent.

A: 2, 2, 5
B: 1, 4, 4
C: 3, 3, 3

1. What is the expected value of each die?
2. Consider the indicator function better(X,Y) which has value 1 if X>Y and value -1 if X<Y.  What are the expected values of better(A, B), better(B, C), better(C, A)?  Why are these sometimes called non-transitive dice?

Question 2 (2 points). On a day when an assignment is due (A=a), the newsgroup tends to be busy (B=b), and the computer lab tends to be full (C=c).  Consider the following conditional probability tables for the domain.

P(A) P(B|A) P(C|A)
 A P a 0.10 ¬a 0.90
 B A P b a 0.80 ¬b a 0.20 b ¬a 0.40 ¬b ¬a 0.60
 C A P c a 0.70 ¬c a 0.30 c ¬a 0.50 ¬c ¬a 0.50
1. Construct the joint distribution out of these conditional probabilities tables assuming B and C are independent given A.
2. What is the marginal distribution P(B,C)?  Are these two variables absolutely independent in this model?  Justify your answer using the actual probabilities, not your intuitions.
3. What is the posterior distribution over A given that B=b, P(A | B=b)?  What is the posterior distribution over A given that C=c, P(A | C=c)?  What about P(A | B=b, C=c)?  Explain the pattern among these posteriors and why it holds.

Question 3 (2 points). Sometimes, there is traffic on the freeway (C=c).  This could either be because of a ball game (B=b) or because of an accident (A=a).  Consider the following joint probability table for the domain A = {a, ¬a}, B = {b, ¬b}, C = {c, ¬c}.

P(A, B, C)
 A B C P a b c 0.018 a b ¬c 0.002 a ¬b c 0.144 a ¬b ¬c 0.036 ¬a b c 0.056 ¬a b ¬c 0.024 ¬a ¬b c 0.144 ¬a ¬b ¬c 0.576

1. What is the distribution P(A,B)?  Are A and B independent in this model given no evidence?  Justify your answer using the actual probabilities, not your intuitions.
2. What is the marginal distribution over A given no evidence?
3. How does this change if we observe that C=c; what is the posterior distribution P(A | C=c)?  Does this change intuitively make sense?  Why or why not?
4. What is the conditional distribution over A if we then learn there is a ball game, P(A | B=b, C=c)?  Does it make sense that observing B should cause this update to A (called explaining-away)?  Why or why not?

Question 4 (2 points). Often we need to carry out reasoning over some pair of variables X, Y conditioned on the value of other variable E=e.

1. Using the definitions of conditional probabilities, prove the conditionalized version of the product rule: P(X, Y | e) = P(X | Y, e) P(Y | e)
2. Prove the conditionalized version of Bayes' rule: P(Y | X, e) = P(X | Y, e) P(Y | e) / P(X | e)

Question 5 (2 points). Suppose we wish to calculate P(C | A, B).

1. If we have no conditional independence information, which of the following tables are sufficient to calculate P(C | A, B)?
1. P(A, B), P(C), P(A | C), P(B | C)
2. P(A, B), P(C), P(A, B | C)
3. P(C), P(A| C), P(B | C)
2. What if we know that A and B are conditionally independent given C?

Question 6 (3 points). A common, simple way of improving classification accuracy is to train an ensemble of multiple classifiers for a single task, then to classify new instances by the majority output of the individual classifiers.  Suppose we have M binary classifiers, C_1 ... C_M, each of which has an error rate ε. Assume M is odd.

1. Assume that the classifiers make independent errors, that is, for any input x, the chance of classifier Ci making a mistake is independent of whether the other classifiers were correct.  In this case, give a formula for the error rate of the ensemble in terms of ε and M.  If ε = 0.1 and M = 3, what is the error rate of the ensemble?
2. For a fixed component error rate ε < 0.5, what happens to the ensemble error rate as M gets very large and why?  Why does this not solve the world's classification problems?
3. If we do not know that the errors are independent, is it possible that the ensemble will have a higher error rate than its members?  Again, assume all classifiers share the same error rate ε < 0.5.

Question 7 (4 points). Consider building a naive Bayes and a perceptron classifer on the following example, where X and Y are evidence variables and Z is the target class variable.

Z=0: X=0,Y=1
Z=0: X=1,Y=0
Z=1: X=1,Y=1
Z=0: X=1,Y=0
Z=0: X=0,Y=1
Z=0: X=1,Y=0
Z=0: X=0,Y=1

1. For each data point, construct a feature vector that consists of [X, Y, 1] where the final element is the bias feature. What parameters will a two-class perceptron learn on this training set if trained until convergence?  Assume that ties result in updates.  What will its accuracy be when tested on the same set?
2. What parameters will a naive Bayes classifier learn on this training set, assuming there is no smoothing.  What will its accuracy be if the test set is identical to the training set?
3. Instead of a naive Bayes model, consider predicting P(Z|X,Y) directly from the full empirical joint distribution over X, Y, and Z.  What will the accuracy of this model be if the test set is identical to the training set?
4. Explain the difference between the accuracies of the two Bayesian classifiers in terms of conditional independence assumptions.
5. Give a general disadvantage of each of these three classifiers.

Question 8 (3 points). In this exercise, you will show that the relative frequency estimates discussed in class are the maximum likelihood estimates for a single random variable's distribution.  Consider a random variable X with values x_1, x_2, ... x_n.  Imagine that we have a data set in which we observe k samples from X, e.g. [x_7, x_3, x_6, .... ], in which each x_i occurred c_i times.  Let p_i be the probability of x_i.

1. To begin with, let X = {h,t} be a coin flip.  Assume the data is [h, t, h, h, t].  Write the likelihood of this exact sample in terms of p_h and p_t (you may wish to use 1 - p_h instead of p_t).  Find the maximizing parameters by differentiating the likelihood function and setting the derivative to zero.
2. Extend this argument to arbitrary data sets over two-valued random variables.  Does the order of the data matter?
3. Show that the relative frequency estimate holds in the general case where n > 2. You may assume that the likelihood has a unique maximum and no local extrema.