Reinforcement Learning


Due Friday March 21 at 11:59pm

You may work on this homework either individually or in pairs.

What to submit: Submit 3 files online under the project name p5:

blackjack.py - Your code for solving the problems. We should be able to run this code and reproduce your output.txt file.
output.txt - The output of your blackjack.py program, containing the answers to the questions below.
partners.txt - The usual partners.txt file as described in the submission instructions.

The goal of this programming assignment is to discover the optimum policy for blackjack. The rules are as stated in Example 5.1 in Sutton and Barto.

You are asked to implement two different algorithms:

(a) Monte Carlo ES as in Example 5.3 in Sutton and Barto. The result of your algorithm should be something like that displayed in Fig. 5.5. You should print out a table stating the V* function with numerical values for the different states.

(b) The Q-learning algorithm as in Figure 6.12 from Sutton and Barto. Initialize using the same policy as in part (a) above, namely stick only on 20 or 21. Print out a table for the Q* function.

In addition, answer the following questions:

(c) How many episodes does the algorithm take to converge to the optimal policy using these two different learning algorithms?

(d) Suppose that the card deck has become rich in tens and aces, so that the probability of getting one of these is 10% greater than in the case for an unbiased deck. Of course this implies that the probability of getting a card in the 2-9 range is diminished appropriately. Calculate the new value function V*.

(e) Overall, what is your expected gain or loss in the two cases?