Project 2.1: Gridworld MDPs

Due 10/16 at 11:59pm
Update: 10/7: Minor corrections to the text of 1(a) and some typo fixes.
Update: 10/14: Non-critical change to which removed duplicate entries from getTransitionStatesAndProbs().
Update: 10/15: Clarification to the text of question 1(d) (no code change).

MDP solvers
and reinforcement learning.
Make a robot crawl.


In this checkpoint, you will experiment with both value iteration for known MDPs and Q-learning for reinforcement learning. You will test your systems on a simple Gridworld domain, but also apply them to the task of teaching a simple simulated robot to crawl.

The code for this checkpoint contains the following files, available as a zip archive.

Checkpoint files: The file in which you will write your agents. Abstract class for general MDPs. Abstract class for general reinforcement learning environments (compare to The Gridworld code and test harness. Plug-in for the Gridworld graphical display. You can ignore this file entirely. GUI code. You can ignore this file entirely. Plug-in for the Gridworld text interface. You can ignore this file entirely. The crawler code and test harness. You will run this but not edit it, and so can ignore the contents. GUI for the crawler robot. You can ignore this file entirely.


What to submit: You will fill in portions of (only) during the assignment. You should submit this file along with a text or pdf file containing answers to the discussion questions. If you change other files, submit those as well, but you should not need to.

Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. Your answers to discussion questions will also be graded.

Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. Instead, contact the course staff if you are having trouble.


MDPs and Reinforcement Learning

To get started, run the Gridworld harness in interactive mode:

    python -m

You will see the two-exit Gridworld from class.  Your agent's position is given by the blue dot, and you can move with the arrow keys.  Notice that the agent's value estimates are shown, and are all zero.  Manual control may be a little frustrating if the noise level is not turned down (-n), since you will sometimes move in an unexpected direction.  Such is the life of a Gridworld agent!  You can control many aspects of the simulation.  A full list is available by running:

    python -h

You check out the other grids, change the noise or discount, change the number of episodes to run and so on.  If you drop the manual flag (-m) you will get the RandomAgent by default.  Try:

    python -g MazeGrid

You should see the random agent bounce around the grid until it happens upon the exit.  Not the finest hour for an AI agent; we will build better ones soon.

Next, either use the text interface (-t) or look at the console output that accompanies the graphical output.  Do a manual run through any grid you like, and notice that unless you specify quiet (-q) output, you will be told about each transition the agent experiences.  As in previous code, coordinates are in (row, col) format and any arrays are indexed by [row][col], with 'north' being the direction of decreasing row, etc.  By default, most transitions will receive a reward of zero, though you can change this with the living reward option (-r).  Note particularly, that the MDP is such that you first must enter a pre-terminal state and then take the special 'exit' action before the episode actually ends (in the true terminal state (-1, -1)).  Your total return may have been less than you expected, due to the discount rate (-d).

You should definitely look at,, and closely, and investigate parts of as needed.  The support code should be ignored.

Hint: The util.Counter class in will make your life much easier in this assignment.  It acts like a dictionary, but has a getCount() method which returns zero for items not in the Counter (rather than raising an exception like a dictionary).


Question 1 (10 points)  Write a value iteration agent in ValueIterationAgent, which has been partial specified for you in  You can select this agent with '-a value'.  Your value iteration agent is an offline planner, not a reinforcement agent, and so the relevant training option is the number of iterations of value iteration it should run (-i).  It should take an MDP on construction and run value iteration for the specified iterations on that MDP before the constructor returns.  Recall that value iteration computes estimates of the optimal values.  From these value estimates, you should synthesize responses to getPolicy(state) and getQValue(state, action).  (If your implementation is such that the policy or q-values are precomputed, you may simply return them.)  You may assume that 100 iterations is enough for convergence in the questions below. (Note: to actually run your agent with the learned policy, use the -k option; press enter to cycle through viewing the learned values, q-values and policy execution.)

Written short questions (two sentence answers, max!):

  1. How many rounds of value iteration are needed before the start state of MazeGrid becomes non-zero?  Why?
  2. Consider the policy learned on BridgeGrid with the default discount of 0.9 and the default noise of 0.2.  Which one of these two parameters must we change before the agent dares to cross the bridge, and to what value?
  3. On the DiscountGrid, give parameter values which produce the following optimal policy types or state that they are impossible:
    1. Prefer the close exit (+1), risking the cliff (-10)
    2. Prefer the close exit (+1), but avoiding the cliff (-10)
    3. Prefer the distant exit (+10), risking the cliff (-10)
    4. Prefer the distant exit (+10), avoiding the cliff (-10)
    5. Avoid both exits (also avoiding the cliff)
  4. On the MazeGrid, with the default parameters, compare the value estimate of the start state after 100 iterations to the empirical returns from running 10 episodes (-k 10 -q).  How close is the value estimate to the empirical average discounted rewards?   How about if we run 10000 episodes (-k 10000 -q)?   Does the value estimate predict the long-term average discounted rewards correctly?

Note that your value iteration agent does not actually learn from experience.  Rather, it ponders its knowledge of the MDP to arrive at an optimal policy before ever interacting with a real environment.  This distinction may be subtle in a simulated environment like a Gridword, but it's very important in the real world, where the real MDP is not available. 

Question 2 (9 points) You will now write a Q-learning agent, which does very little on construction, but then learns by trial and error interactions with the environment through its update(state, action, nextState, reward) method.  A stub of a Q-learner is specified in QLearningAgent in, and you can select it with the option '-a q'.  You should first run your Q-learner through several episodes under manual control (e.g. '-k 5 -m'), for example on the MazeGrid.  Watch how the agent learns about the state it was just in and 'leaves learning in its wake.'  Your agent should be an epsilon-greedy learner, meaning it chooses random actions epsilon of the time, and follows its current best q-values otherwise.

Written short questions:

  1. Train your Q-learner on the MazeGrid for 100 episodes.  How are the learned values different from those learned by value iteration, and why?  How can you make them closer to the optimal values?
  2. Train your Q-learner on the BridgeGrid with no noise (-n 0.0) for 100 episodes.  How do the learned q-values compare to those of the value iteration agent?  Why will your agent usually not learn the optimal policy?
  3. Train your Q-learner on the CliffGrid for 100 episodes.  Compare the value it learns for the start state with the average returns from the training episodes (printed out automatically).  Why are they so different?

Question 3 (1 point) Try the following command:


This will invoke the crawling robot from class using your Q-learner.  Play around with the various learning parameters to see how they affect the agent's policies and actions.   Your formal requirement for this question is simply to ensure that your Q-learner works in this non-episodic environment.