CS152 Computer Architecture and Engineering

Lab #4: The Final Project

John Lazzaro

Fall 2005


Lab 4 is your final project. You will design a memory system for your 5-stage pipelined processor: an instruction cache, a data cache, and an interface to DRAM. The caches will have 128-bit block sizes, which you will be expected to fill from DRAM in a burst. We will provide you with a sample DRAM controller to get you started -- however, you will need to make extensive modifications to it to use it in your project.

Lab 4 has several "checkoffs" and a final deadline:

  1. By Thursday 10/20 by 11:59 PM, your group will submit a preliminary design document to the TAs via email. We describe this document in Problem 0c below. Team Evaluations (described below in Problem 0b) are also due by 9 PM on Thursday 10/20. Each team member emails a separate team evaluation report.
  2. On Friday 10/21 in lab section, your TA will review your design document with you, and suggest changes. A final version of the document, incorporating these changes, is due via email by Monday 10/24 by 11:59pm
  3. On Friday 10/28 in lab section, you will show your DRAM interface working on the Calinx board, running your DRAM interface test vector suite.
  4. On Friday 11/4 in lab section, you will show your memory system (interconnected I-cache, D-cache, and DRAM interface) working on the Calinx board, running your memory system test vector suite.
  5. On Friday 11/18 in lab section, you will show your memory system integrated into your Lab 3 pipelined processor and running instructions on the Calinx board. During this demo, the TA will provide you with secret test code. If you are able to pass these tests on your first try, you will receive bonus points. You can also receive bonus points if you fix your processor to pass the tests within your section time. If your processor is not fixed by the end of section, your TA will provide source for the secret test code, for use in your weekend debugging sessions.
  6. On Monday 11/22 at 11:59 PM the final project (including the lab report) is due.
  7. On Tuesday 11/29 at 9 PM Team Evaluations for the final project are due. Each team member emails a separate team evaluation report. The report format follows the format for the Lab 3 Team Evaluation described in Problem 0b.
  8. On Thursday 12/8 in class, your group will give a final project presentation. It will be 20 minutes in length maximum, and all four group members will take turns talking during the presentation. Focus on practical issues that will be helpful for next year's students, as we may be showing your presentation to them next time we teach 152! Focus on testing techniques, CAD, maintaining good group dynamics, design methodology, etc. Include a section describing the problems that came up during the project, and how you solved them.
  9. On Thursday 12/8 at 11:59 PM a version of the slides for your final project presentation is due by email to instructor (Powerpoint, PDF, or Keynote files are fine).

Lab Report Submission Policies: To submit your lab report, run m:\bin\submit-fall2005.exe, or at command prompt type "submit-fall2005.exe" then follow the instructions. The required format for lab reports is shown on the resources page, as is the required format for your design notebook.

Lab 4 Document History:

  1. 8/30 Initial document released.

  



Problem 0: Pre-Flight

Several final project important tasks must be done before actual hardware design begins. We list them below.

Problem 0a: Design Notebook

As part of this lab, your group will keep an on-line notebook. So, the directory files to hold notebook entries need to be initialized. See the Lab 2 writeup for detailed information about the notebook.

Problem 0b: Team Evaluation for Lab 3 and Final Project

To help us understand how your team is functioning we require you to evaluate yourself and each of your team members individually. At the start of the final project, you will submit the team evaluations for Lab 3. After the conclusion of the final project, you will submit team evaluations for the final project.

See the Lab 2 handout for how you should determine the evaluation score for each team member. See the schedule at the front of the lab for the due date for the evaluations. Note that each team member emails a separate evaluation report.

Problem 0c: Preliminary Design Document

Below we show the block diagram of the final project.

Proc Interface
Figure 0: Project Block Diagram

The preliminary design document will contain the following information:

  1. Timing diagrams and signal names for the IC, IM, DC, and DM buses. Defining these will require studying the section below describing the DRAM controller example IP, and also studying the section below describing the cache requirements. Note that the I-cache in the sample IP doesn't meet the cache requirements, and the memory bus in the sample IP doesn't meet the performance requirements of the caches. So, thinking will be required.
  2. A preliminary block diagram of the datapaths of the instruction cache, data cache, and DRAM controller. By preliminary, we mean drawing boxes showing the major parts of the data path (write buffer, cache tags, etc), and the general connections between them.
  3. During the course of the project, you will create and demonstrate three test suites: (1) a DRAM controller test bench that interfaces to IM and DM buses (2) a memory system test bench that interfaces to the IC and DC buses (3) MIPS assembly language running on a pipelined processor connected to the memory system. For the preliminary design document, make a comprehensive list of the types of bugs you will target with each test suite. Hints: (1) arbitration between IM and DM will be a key source of bugs for the DRAM controller, and so your list should describe exactly how you will test for arbitration bugs (2) bugs in the cache controller state machines will be a key source of bugs for the caches, and so your list should describe all of the classes of state machine bugs you intend to test (3) handling stalls due to variable latency in the IC and DC buses will be a key source of processor stalls, and so your list should describe the classes of assembly language tests you will write to test this.
  4. The identity of the spokesperson for the lab, and a roster of group members. The responsibility of the spokesperson is to communicate questions to the TA and reply to questions from the TA. Choose a different spokesperson than the one you had for Labs 2 and 3.
  5. A tentative division of labor, showing the tasks each group member intends to do.

See the start of this document for the deadlines associated with the preliminary design document (preliminary submission and TA review).

  



Problem 1: The Memory System IP

In this section, we describe the example memory system IP (IP stands for Intellectual Property -- industry jargon for large pre-made blocks) we have created for you to examine. The IP is written in Verilog, and contains a DRAM controller and an instruction cache.

Note that this IP does not have the full functionality needed for your final project. For example, it doesn't support burst reading of 128-bit cache lines, and the I-cache only has 32-bit blocks. The IP may be deficient in other ways too -- for example, it may have bugs that were not uncovered last semester, where the class used it for their Lab 4 as is. To learn how to add burst capability to your memory controller, look at the Resources pages for the SDRAM data sheet. The Calinx documentation may also be useful.

So, your job is to examine how it works, and then modify it (or if you prefer, start from scratch and redesign it -- your choice) to meet the needs of your Final Project.

The example memory system, and other Verilog code and documentation files for this lab are located in m:\lab4. The directory contains a new TopLevel.v file. The directory also contains a DRAM simulator (mt48lc8m16a2.v) so that you can test your memory system without going to the board.

The IP does not include DRAM. Instead, you connect the IP DRAM port to the Xilinx pins that connect to the Calinx DRAM (when going to the board) or to the DRAM simulator (to use ModelSim).

In addition to the DRAM port, the IP contains two independent processor-facing ports: the I port and the M port. The I port provides access to the instruction cache (the IP fills this cache from the DRAM on a miss). The M port provides access to DRAM memory.

Note that these buses may not a good match for your IC, DC, IM, and DM buses. You need to define the right bus structure for the project requirements and for your CPU design.

Note that the IP time-shares access to the DRAM across several sources: M port accesses, instruction cache misses, and DRAM refresh. We call this time-sharing arbitration.

On the M port, a processor sees the impact of arbitration via the data_ready signal. If the processor requests data on the M port, but the IP is busy doing a DRAM refresh, the IP will signal this busy condition by keeping the data_ready line low until refresh is complete.

The M port interface is distributed across the arbiter.v and sdramControl.v files (see the "Note" in Figures 1 and 2 for details). Below, we show the bus signals for the M port:


input data_read;        // processor sets it to 1 to make a read request
input data_write;       // processor sets it to 1 to make a write request

input DATA_ADDR[31:0];  // processor puts read address on these lines
input DATA_WRITE_ADDR[31:0];  // processor puts write address on these lines

input DIN[31:0];        // input data
output DOUT[31:0];      // output data

output data_ready;      // signals data has been read
output data_wr_ready;   // singals data has been written

The port also defines a clock input, to be driven by the processor clock, and a reset signal.

We now describe M port transaction semantics. We assume the data_ready signal is low. To start a read, the processor asserts the "data_read" line before a rising clock edge. On this edge, the processor also ensures the "DATA_ADDR" lines are stable with the read address. On subsequent rising clock edges, the processor samples the data_ready line. If data_ready is high, the data is present on the DOUT line. The data_ready signal is guaranteed to return to zero for the next rising clock edge, on which the processor may initiate a new transaction. The figure below shows a read transaction.

Proc Interface
Figure 1: Read Timing

A write transaction is similar. To start a write, the processor asserts the "data_write" line on a rising clock edge. On this edge, the processor also ensures the "DATA_WRITE_ADDR" and "DIN lines are stable with the read address. The processor must keep the data_write line high on subsequent rising clock edges, until the data_wr_ready signal is high on a rising clock edge. The data_wr_ready signal will return to zero for the next clock edge, on which the processor may initiate a new transaction. The figure below shows a write transaction.

Proc Interface
Figure 2: Write Timing

Like the M port, the instruction cache I port has variable latency -- if a cache miss occurs, the port may take many cycles to return the next instruction. See the documentation in M:\lab4 for a description of the I port, including timing diagrams.

  



Problem 2: Memory System Requirements

In this section, we describe the requirements for your memory system. Your memory system MUST implement what we request. You are free to choose HOW you implement it, and HOW you choose to test it.

Both caches must have the following properties

Your data cache must implement a write-through policy, with "no allocate on write" semantics, and implement a 4-entry write buffer.

The implementation of all memory elements for your cache up to you ( data bits, valid bits, tag bits, etc). If you wish, you can use CoreGen to generate your own custom RAM components. Feel free to use any feature of CoreGen you wish (dual-porting, FIFOs, etc). You may also generate asynchronous RAM blocks. However, you should keep in mind that the asynchronous RAMs are built of LUTs, and that if you attempt to build your entire cache system out of asynchronous RAM you are likely to exhaust all the LUTs on the board.

Remove the data memory and instruction memory from your Lab 3 processor, and replace it with interfaces to the IC and DC buses. Keep the same number of pipeline stages in your processor (5). As both the IC and DC buses are variable latency, your processor modifications will have to cope with new stall conditions caused by cache misses.

Except for the interface change noted above, and the memory-mapped I/O changes described later in this assignment, the processor should have the same features as in Lab 3. Also, note that your processor does not have to handle self modifying code (except for the level-0 boot). If the processor receives self modifying code its behavior is undefined.


Problem 2a: Tag File

Build the tag file for the cache. Make sure that you can reset all the valid bits to zero after reset. You are allowed to use smaller SRAM components for tags, etc, if you wish (make sure that they compile properly to SRAMs that work with the board).


Problem 2b: Write Buffer

Processor writes to the data cache will go into the cache (if a particular cache line is cached) and will also go directly to DRAM (via the DM port). This can be a serious bottleneck, since it means that every processor write takes a complete DRAM write cycle.

To ameliorate this problem, design a 4-entry write buffer for your system, as we showed in class. This buffer should take writes from the processor and hold them until the DM port is ready for a new transaction. Whenever the buffer is full, you must stall the processor from writing.

Of course, reads and writes may happen to an address that is currently in the write buffer. The TAs will be testing for correct memory semantics in these cases, and so your design should be pre-tested for this issue. We discuss aspects of this issue below.

Each write buffer entry will have a 32-bit address, a 32-bit data word, and a valid bit. Make sure to make this 4-entry write buffer fully-associative so that values sitting in the buffer will be returned properly from load instructions.

Processor load instructions should be handled as follows. First, the memory controller should check in the write buffer.  If there is an entry there with the right address, the controller should return the value directly from the write buffer.  Otherwise, the controller should looks in the cache.  If there is a value available, then the controller returns the value immediately.  Otherwise, the controller stalls the load and requests a cache fill from the DM port.

Processor store instructions should be handled as follows. First, the controller checks the write buffer.  If the write buffer contains an entry with the same address as the store instruction, the controller overwrites the entry. Otherwise, if there is a free write buffer entry, the controller uses it for the store.  Otherwise, the controller stalls the store until an entry is free. As we are implementing no-allocate-on-write, the store should update an existing data cache entry for the address with the new value (we discuss the timing for cache update below), but should not allocate a cache line if the data cache does not hold the address already.

The DM port will see two types of transactions from the data cache. Either (1) it will see reads during a cache miss, or (2) it will see a single word write. Note that when we decide to empty a single-word write from the write buffer, we will write it to the DM port, and also write it to the cache if the word is properly cached. The data cache does not need to be written earlier, because loads always check the write buffer.

The write buffer should be emptied in FIFO order (oldest write first).


Problem 2c: Pipeline Stalling Mechanism

Carefully think through how data cache misses will stall the memory stage of your pipeline. Also carefully think through how an instruction cache miss will stall the fetch stage of your pipeline.

For pipeline stalls, be sure to "freeze up and bubble down". Thus, for instance, if you get a cache miss on a load in the memory stage, you should stall this load by (1) freezing the Fetch, Decode, Execute, and Memory stages (i.e. these stages don't move forward and keep repeating the same operations over and over), and (2) send a bubble on to the write stage. This will keep the load instruction in the memory stage. You should then keep stalling until the cache is filled and then release the pipeline to move forward; in the cycle that you release things, the load will be retried and continue as if nothing happened. Stores will stall if the write buffer is full; same thing -- the store keeps retrying until it is able to write into the write buffer.

The important idea here is that requests are submitted to the cache at the beginning of the cycle. Optionally (by the end of the cycle), the cache can reject a request and continue rejecting this request on subsequent cycles until it is ready to satisfy the request. If you do this correctly, you can allow the cache to stall for an arbitrary time.

Come up with a complete mechanism for handling stalls generated by the IC and DM buses. Explain your mechanism, and provide timing diagrams.


Problem 2d: Testing

Testing the final project on checkoff days will require several test benches. These tests should have ModelSim and Xilinx versions.

For the DRAM checkoff, you will write a DRAM test bench that drives the IM and DM buses with a test vector suite that comprehensively checks for the types of DRAM controller bugs you described in your preliminary design document. Be sure to test multiple concurrent transactions on both buses, that overlap in time in different ways.

For the memory system checkoff, you will write test bench to drive IC and DC buses with test vectors that comprehensively checks for the types of cache controller bugs you described in your preliminary design document. This test bench should catch bugs in operations such as writing an address already in the write buffer, writing an address that is (or is not) already in the cache, reading an address that is in the write buffer and is (or is not) in the cache, etc.

For the final checkoff, you should prepare a machine language code suite for full processor testing. The suite should include your old test codes from Lab 3, and also new tests that catch bugs that may be introduced by the new stalling behaviors the CPU needs to handle the variable latency IC and DC buses. After seeing this code work, the TAs will give you their secret MIPS machine language code test suite to verify your processor.

Both your Verilog test vector suites and your MIPS code suite should be documented to indicate which classes of bugs each set of vectors catch -- this documentation will play a key part in the TAs assessment of the work. In your final report, please keep the testing readable and concise for grading. Remember, the TAs don't like looking at waveforms in the lab report.

  



Problem 3: Enhancing the I/O Module

We have several enhancements that you must make to the I/O module.  We will be adding to the address space we used in Lab 3.  There are 4 distinct address ranges to handle.


Problem 3a: Miscellaneous I/O

In the previous lab, we set I/O space in the top-4 words. Now, we are using some of the areas that were "Reserved for future use".
 

Address
Reads
Writes
0x80000000-0x80002000
See 3c
See 3c
0x80002004-0xFFFFFEDC
Reserved for future use
Reserved for future use
0xFFFFFEE0-0xFFFFFEE8
See 3d
See 3d
0xFFFFFEEC-0xFFFFFEFC Reserved for future use
Reserved for  future use
0xFFFFFF00-0xFFFFFFEC
See 3b
See 3b
0xFFFFFFF0 DP0 DPO
0xFFFFFFF4 DP1 DP1
0xFFFFFFF8 Input switches Nothing
0xFFFFFFFC Cycle Counter Nothing

As in Lab 3, DP0 and DP1 are the registers whose outputs appear on the HEX LEDs. The new entity is the cycle counter. This is a 32-bit counter that counts once per cycle. It will be used to measure statistics. Notice that it should be reset to zero on processor RESET and just count from that point on.


Problem 3b: Level 0 Boot

The I port interface includes a 28-word ROM that appears from Address 0xFFFFFF00 - 0xFFFFFF6C. Note that this work is done for you -- you do not have to create the ROM yourself.

Instruction reads from this address range will return the corresponding instruction. All data memory accesses to this range will have undefined results. To use this ROM to boot, arrange your RESET sequence so that the first PC is always 0xFFFFFF00.

Address
Instruction
0xFFFFFF00     lui   $8, 0x4849      #Initial Display
0xFFFFFF04     ori   $8, $8, 0x2045
0xFFFFFF08     sw    $8, -288($0)
0xFFFFFF0C     lui   $8, 0x4152
0xFFFFFF10     ori   $8, $8, 0x5448
0xFFFFFF14     sw    $8, -284($0)
0xFFFFFF18     sw    $8, -16($0)    #Put "DEADBEEF" on display
0xFFFFFF1C     lui   $1, 0x8000      #Instruction I/O Space
0xFFFFFF20     ori   $7, $1, 0x2000 #Limit of 8K
0xFFFFFF24     j     L3               #Go copy first block
0xFFFFFF28     lw    $8, 4($1)      #Save instruction address
0xFFFFFF2C L1: addiu $1, $1, 8      #Skip block header
0xFFFFFF30 L2: lw    $4, 0($1)      #Next word
0xFFFFFF34     sw    $4, 0($2)      #Copy to memory
0xFFFFFF38     addiu $3, $3, -1     #Decrement count
0xFFFFFF3C     addiu $1, $1, 4      #Increment source
0xFFFFFF40     bne   $3, $0, L2     #Not done
0xFFFFFF44     addiu $2, $2, 4      #Increment destination
0xFFFFFF48     sltu   $5, $1, $7    #Run over limit?
0xFFFFFF4C     beq   $5, $0, BADEND #Yes.  Format problem?
0xFFFFFF50 L3: lw    $3, 0($1)      #Get next length
0xFFFFFF54     bne   $3, $0, L1     #Non-zero? Yes, copy
0xFFFFFF58     lw    $2, 4($1)      #Get next address
0xFFFFFF5C END:sw    $8, -16($0)    #Put execution addr in DP0
0xFFFFFF60     jr    $8              #Start Executing
0xFFFFFF64     break 0xAA             #Pause with 10101010
0xFFFFFF68 BADEND: j BADEND           #Loop forever
0xFFFFFF6C     break 0x7F             #Indicate problem!

You can modify the level 0 boot ROM any way you like, but it must fit within the following address range 0xFFFFFF00-0xFFFFFFEC and it must incorporate the basic functionality laid out above. Notice that what happens here is that the code looks for a compact description of instructions starting at address 0x80000000.  The format of the block of memory at that address is:

Block 0:  Length0
          Address0
          Block0[0]
          Block0[1]
            ...
          Block0[Length0-1]
Block 1:  Length1
          Address1
          Block1[0]
          Block1[1]
            ...
          Block1[Length1-1]
Block 2:  Length2

            ...

This sequence is terminated with a zero Length field.   It is assumed that Block 0 is a block of instructions and that the system should start executing at Address0 after it is finished copying data.  The idea here is that you can have a sequence of instructions that is copied one place in memory and a sequence of data that is copied elsewhere.


Problem 3c: Data Source

You should use either the TFTP Blackbox (highly recommended) or one of the synchronous RAM blocks (only if the black box doesn't work) from Lab 3 for your data source. Assume that you produce data in the above format. Compile it into a 2Kx32 block, then download it to your board. Reads from addresses 0x80000000 - 0x80001FFC should read from this block. To reproduce what we had in Lab 3, you will simply add a length header and an address of 0x00000000 to the front of your instructions output from MIPSASM and a word of 0x00000000 to the end of the output. You may find some of MIPSASM's more advanced features useful in specifying higher address ranges and automatically calculating the length of the code.

For example here is a code sample that will generate the proper header and footer for a simple code block:

.count words begin end # Count the number of words between the begin and end labels
.word 0x00000000   # Place the starting address
.address 0x00000000 # Direct MIPSASM to use 0x00000000 as the address when performing jumps
begin:
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
end:
.word 0x00000000 # A trailing 0x00000000 terminates the instruction stream

Something more fancy might be:

.count words begin1 end1 # Count the number of words in the instruction segment
.word 0x00000000   # Instruction segment starting address
.address 0x00000000 # Direct MIPSASM to use 0x00000000 as the address when performing jumps
begin1:
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
end1:
.count words begin2 end2 # Count the number of words in the data segment
.word 0x00100000 # Data segment starting address
.address 0x001
00000 # Direct MIPSASM to use 0x00100000 as the address when performing jumps
begin2:
.word 0x00000001
.word 0x00000002
.word 0x00000003

...

.word 0x00000500
end2:
.word 0x00000000 # A trailing 0x00000000 terminates the instruction stream

Note that writes to this address range are undefined.


Problem 3d: ASCII Text Conversion

The TAs have provided an ASCII conversion tool that takes an ASCII code in 7 bits and outputs 7 bits of coding information for the hexadecimal LEDs.  Note that not all characters can be displayed on the LEDs, so some characters will be converted to a black space.

ASCII_REG1 is a 28 bit register that contains the ASCII-converted display information for hexadecimal LEDs 1-4.
ASCII_REG2 is a 28
bit register that contains the ASCII-converted display information for hexadecimal LEDs 5-8.
POINT_REG is an 8 bit register that contains the display information for the LED decimal point segments.  The high bit corresponds to LED point 1.  A high value indicates that the LED point is turned on.

Address
Reads
Writes
0xFFFFFEE0 Nothing
Convert the stored value into ASCII and store it in ASCII_REG1.
0xFFFFFEE4 Nothing Convert the stored value into ASCII and store it in ASCII_REG2.
0xFFFFFEE8 Nothing
Store the low 8 bits of the word into POINT_REG. Bit 7 corresponds to the high bit of POINT_REG and bit 0 corresponds to the low bit of point reg. Ignore the high 24 bits.

Switch 7 should be still be used to toggle between DP0 and DP1.
Switch 1 should be used to toggle between displaying DP0 or DP1 and the ASCII registers. The LED decimal points should always be visible.

  



Problem 4: Building the Processor

You will build the processor in three stages: first complete the DRAM controller, then complete the memory system, then complete the entire processor. Each milestone has an associated checkoff, where you will demonstrate the system working on the Calinx board, running your test suite. We describe this test suire in Problem 2d above.

Note that for the first two checkoffs, you will need to create synthesizable test logic to run the test vectors, and to display success or failure to the TAs. You should also have a version that runs under ModelSim. Note that the TAs may choose to add some test vectors themselves, so be prepared to add a few to the list at the last minute!

For the final processor checkoff, you should have all of the I/O from your previous lab.  For ModelSim testing, you should make a new file called "boardlevel.v" that includes (1) the new TopLevel.v from the high-level directory and (2) a simulation of the DRAM banks as they appear on the board.

Your final processor testing should use the MIPS test suite described in Problem 2d. This should include Lab 3 test programs as well as the new Lab 4 programs.

After verifying your complete design works in simulation, push the design to the board, and run it on real hardware. After seeing your machine code programs run, the TAs will give you the secret test code.

Your lab report should contain a description of (a) how the four buses operate (DC, IC, DM, IM) (b) how your caches operate, complete with state machines (c) how you handle I-cache and D-cache pipeline stalls. Also, tell us the size of your processor: how many slices did you use/what fraction of the Xilinx chip did you use?

  



Problem 5: Performance Measurement

Determine the critical path of your processor. Document (using data from the Xilinx timing tools) why you believe the critical path is what you claim. How did you determine your component delay values?

Next, measure the running speed of your processor. What is the fastest clock that you think you can run with? Can your memory subsystem run at the same speed as your processor clock? How does this measurement square with your critical path determination?

  



Final Step: Lab Report

Turn in a copy of your Verilog code (including test benches), schematics, diagnostic program(s) and your on-line logs. Also turn in simulation logs that show correct operation of the processor + memory system. These logs should show the operations that were performed, and then the contents of memory with the correct values in it. Also turn in logs from your test benches.

As part of your writeup, do a port-mortem for your test plan. How did the three-stage testing help your verification and debugging? Show bug curves, and give examples of the type of bugs you found early on because of the test plan (as well as "escaped" bugs you found later than you would have hoped).

How much time did your team spend on this assignment?