CS152 Computer Architecture and Engineering

Lab #5: Processor Racing!

Professor David Patterson

John Lazzaro

Fall 2004

In Lab 5, you will increase the pipeline depth of your processor to 6 or more stages, and add support for integer multiplication to the instruction set. In addition, you will add a performance-enhancing feature to your processor, chosen from a list of options. Your processor will correctly execute the implemented instruction set -- you will create a test plan for your processor, and use the test plan to confirm that your processor correctly executes instructions. At the end of the semester, we will race the processors, using a test suite consisting of programs contributed by the project groups.

The schedule for Lab 5 follows:

By Thursday 11/18 by 9 PM, your group will submit a preliminary design document to your TAs via email. We describe the design document (which includes a test plan) in Problem 0a. Team Evaluations (described below in Problem 0b) are also due by 9 PM on Thursday 11/18. Each team member emails a separate team evaluation report.
On Friday 11/19 in lab section, your TA will review your design document with you, and suggest changes. A final version of the document, incorporating these changes, is due via email by Sunday 11/21 by 11:59 PM.
On Friday 12/3 in lab section, you will demonstrate that your deeply pipelined processor correctly runs the test programs the TA provides. These programs will test the integer multiply instructions. Note that your chosen option (A, B, or D) does not need to be integrated into the processor for your demo, but may be if you wish. If your processor runs the test programs on your first try, you will receive bonus points. You can also receive bonus points if you fix your processor to pass the tests within your section time. If your processor is not fixed by the end of section, your TA will provide source for the secret test code, for use in your weekend debugging sessions.
On Friday 12/3 at 11:59 PM, your benchmark program (Problem 3) is due.
On Tuesday 12/7, groups will give 30-minute presentations of their project. Time and location to be announced (time will probably overlap the 11-12:30 class time slot).
On Wednesday 12/8 at 7:30 PM, we will meet in 119 Cory for the processor racing contest. Pizza at LaVals after the race!
Thurday 12/9 will be the final class lecture, usual time and place (11AM-12:30PM, 320 Soda).
On Friday 12/10 at 11:59 PM, the final project report is due.

Lab Report Submission Policies: To submit your lab report, run m:\bin\submit-fall2004.exe, or at command prompt type "submit-fall2004.exe" then follow the instructions. The required format for lab reports is shown on the resources page, as is the required format for your design notebook.

Lab 5 Document History:

11/8 Initial release.

Problem 0: Pre-Flight

Before your group begins the design, you will perform several preparatory tasks.

Problem 0a: Design Document

Your group will prepare a design document. The design document will be 2-4 pages in length, and will contain:

The identity of the spokesperson for the lab, and a roster of group members. The responsibility of the spokesperson is to communicate questions to the TA and reply to questions from the TA. If your group has more than 3 members, choose a different spokesperson than the one you had for Labs 2-4.
A short description of the structure of the design. The description will be accompanied by preliminary high-level schematics for the deep pipeline, the multiplier, and the chosen design option.
A test plan, using the epoch charting method shown in the 9/7 lecture. Label events in the test plan with the checkoff dates shown at the beginning of this lab.
A tentative division of labor, showing the tasks each group member intends to do.
The "paranoia" section: discuss potential areas of difficulty in the lab. An early guess of critical timing paths for the design should be a part of this section.

See the start of this document for the deadlines associated with the design document (preliminary submission, TA review, and final submission). You are encouraged to submit the preliminary document early, to speed the TA review process.

Problem 0b: Team Evaluation for Lab 4

To help us understand how your team is functioning we require you to evaluate yourself and each of your team members individually.

To evaluate yourself, give us a list of the portions of Lab 4 that you were originally assigned to work on, as well as the set of things that you eventually ended up doing (these are not necessarily the same, and we realize this).

Next, based on your own observations, evaluate the performance of the other members of your team during the last lab assignment. Do not evaluate yourself. Assume an average team member would receive a score of 20 points. Top performers would receive more points, poor performers would receive fewer points.

The maximum score for a person is 40 points. Each evaluation should have a one or two sentence justification for the evaluation. See the Lab 2 writeup for an example evaluation.

You should reevaluate your team members after every lab assignment and base your evaluation only on their performance during that lab assignment. These scores will be used for grading. Be honest and fair as you would hope others will be.

See the schedule at the front of the lab for the due date for the evaluations. Note that each team member emails a separate evaluation report.

Problem 0c: Design Notebook

As part of this lab, your group will keep an on-line notebook. See the Lab 2 writeup for detailed information about the notebook.

Problem 1: Enhancing Your Processor

For the final project, you will increase the number of pipeline stages in your processor, from 5 stages to 6 or more stages. We call this enhancement deep pipelining.

By breaking the pipeline up into more than five stages, deep pipelining decreases the cycle time of your processor. Lengthening the pipeline, however, will increase the difficulties that hazards cause in the system. Some components, such as the ALU, may have to be broken up into two stages.

You will also enhance your processor by adding four new MIPS instructions: mult, multu, mfhi, mflo.

In addition to deep pipelining and multiplication support, you must do one of options A, B, or D described below (but not C). Note that if you choose Option A (high-speed multiplier) it satisfies the multiplication support requirement.

AFTER successful completion of deep pipelining, multiplication support, and option A, B, or D, you may choose to add a second option for extra credit (A, B, C, or D, although C may only be chosen if B was your first option). You MUST save a complete copy of the single-option design, so that you can revert to it for processor racing if your extra-credit option breaks your processor.

Update the instruction tracing support in your monitor to support deep pipelining, the new multiplication instructions, and your chosen options. The tracer should only show committed instructions (i.e. not the squashed ones). A useful debugging option would be to make another tracer which showed instructions as they were dispatched -- this would show more instructions than your primary instruction tracer.

Make sure to include the current cycle count in your trace. Since it is part of the debugging code, all of the logic for the cycle counter can be included in your monitor code. It should be reset to zero on reset, and increment with each clock cycle (since CPI is not necessarily 1, you will actually see the delay in your instruction traces).

The option descriptions below are brief -- for more information, please consult with your TA.

Option A: High-Speed Multiplier

This option adds a performance requirement to your mult and multu instructions: the instructions must execute in 10 or fewer processor clock cycles. In addition, while the multiply is executing, your CPU should be processing subsequent instructions in the stream in parallel. Only attempts to execute mflo, mfhi instructions should block if the operation is not yet ready. Feel free to use CoreGen to implement this option (both parallel and sequential features of the CoreGen Multiplier Generator may be used).

Option B: Data Cache Enhancements

This option adds a requirement to your data cache: the number of words in a cache block must be greater than 1 (recall this was an option in Lab 4).

In addition, this option requires that you to fill this cache block from main memory in a single M bus transaction. Thus, completing this option requires the modification of the Lab 4 IP, so that it is able to deliver multiple words in a single transaction.

The new M bus may keep the same handshaking as the original M bus, but with a wider data bus (128 bits instead of 32 bits). Alternatively, the new M bus may keep the 32-bit width of the original M bus, but define "burst" semantics, so that a single transaction delivers a burst of words over the 32-bit data bus over several clock cycles.

In either case, your IP modifications should use the underlying nature of the Calinx SDRAMs (the fact that the Calinx DRAM port has a word width larger than 32-bits, and the fact that the SDRAMs are capable of burst mode), rather than emulate the new M bus interface using the original M bus (as emulation would not improve performance).

Note that the total amount of data memory in your data cache must remain constant, at 8K bytes.

Option C: Set-Associative Cache, Write-Back Cache

This option requires the completion of Option B. Thus, it can not serve as the initial option you choose to implement, but may be chosen as the "extra-credit" option after you have successfully completed Option B.

In Option C, we further enhance the data cache, by adding one or both of the features described below.

Enhance your cache to be 2-way set associative. A key issue here is the state machine changes to support the replacement strategy (random or LRU).
Convert your cache from write-through to write-back. Key issues here is interaction with the write buffer, and handling writes that miss the cache.

Note that the total amount of data memory in the cache must not be greater than 8K bytes.

Adding associativity is considerably simpler than adding write-back functionality -- we recommend associativity over write-back for all but the most ambitious groups. The amount of extra credit for this option will depend on the correct operation of the enhancement -- a failed attempt at write-back will receive significantly less credit than a successful attempt at associativity.

Option D: Branch Prediction

In this option, we address the performance of your deeply-pipelined processor during a branch. With more than 5 pipeline stages, the branch delay slot is no longer sufficient to keep your pipeline from stalling after a taken branch.

Branch prediction can help. In this option, you are required to implement a branch predictor that depends on the history of the instruction stream (i.e. predicting "always taken" or "always not taken" will not suffice, nor will random prediction).

At a minimum, your branch predictor should implement the 2-bit prediction scheme described in the 10/7 Pipelined CPU III lecture. More ambitious groups may choose to read one of the branch prediction papers under "Final Project" on the Resources page. A good choice would be the GShare algorithm.

Note that you need to think carefully about what it means to have a predicted instruction stream and how you will recover from a mis-prediction. When a branch is mis-predicted, instructions in the pipeline will have to be squashed and correct execution resumed.

As part of this option, you must update the instruction tracing support in your monitor to support a prediction instruction stream. The tracer should only show committed instructions (i.e. not the squashed ones). Make sure to include the current cycle count in your trace. A useful debugging option would be to make another tracer which showed instructions as they were dispatched -- this would show more instructions than your primary instruction tracer.

Problem 2: Memory-mapped I/O

Here is the specification for the memory-mapped I/O space for your processor. Note that there may be some simple additions to this specification for specialized output for the processor race.

`Address`	`Description`
0x80000000-0x80001FFC	TFTP Memory Source (See 2b)
0x80002000-0xFFFFFE6C	Reserved for future use
0xFFFFFE70-0xFFFFFE90	TFTP Memory Source (See 2b)
0xFFFFFE94	Reserved For Future use
0xFFFFFE98-0xFFFFFEDC	Reserved for future use
0xFFFFFEE0-0xFFFFFEE8	ASCII Display (Unchanged from Lab 4)
0xFFFFFEEC	Reserved for future use
0xFFFFFEF0-0xFFFFFEFC	Gianormous Cycle Counter (See 2a)
0xFFFFFF00-0xFFFFFFEC	Boot ROM (unchanged from Lab 4)
0xFFFFFFF0-0xFFFFFFFC	Basic I/O (Unchanged from Lab 4)

Problem 2a: 64-bit Cycle Counter

For the final project we will need a 64-bit cycle counter in order to test some of the longer-running programs. Therefore you must implement a new 64-bit cycle counter in order to achieve this task.

`Address`	`Reads`	`Writes`
`0xFFFFFEF0`	`Lower 32 Bits of Big Cycle Counter`	`Nothing`
`0xFFFFFEF4`	`Upper 32 Bits of Big Cycle Counter`	`Nothing`
`0xFFFFFEF8`	`Timer State`	`Timer Command`
`0xFFFFFEFC`	`Reserved for Future Use`	`Reserved for Future Use`

Your 64-bit cycle counter will have two states, running and stopped. A load of the Timer State should indicate 1 if it is running and 0 if it is stopped. You may issue 3 different commands to your cycle counter via store words to the Timer Command location.

`Value`	`Action`
`0x00000001`	`Reset the Counter`
`0x00000002`	`Start the Counter`
`0x00000004`	`Stop the Counter`

Resetting the counter should NOT start or stop the counter.

Note that this counter should be independent of your 32-bit cycle counter and that both should be functional.

Problem 2b: TFTP Memory Source

Up to now, you have been ignoring the write port to the TFTP. In this lab, you will be using this additional feature so that you will be able to fill the TFTP Black Box with data and then download it off the board to the computer. This means that stores in the range 0x80000000-0x80001FFC should now write back to the corresponding address of the TFTP black box. Bits [12:2] of the memory address should be used as the input to the ExternalAddress_ port of the black box. M:\lab3\FPGA_TOP.v includes a test circuit that will test the writability of your black box. After file has been loaded on to the board, select a length using the dipswitches and push button 2, and the data in the black box will be replaced with words of the format DEADxxxx where xxxx is the memory location of the word. You can now download the file off of the board and see that it has changed, both in length and in content.

`Address`	`Reads`	`Writes`
`0x80000000-0x80001FFC`	`Read from corresponding address of black box`	`Write to corresponding address of black box`
`0xFFFFFE70`	`The first four characters of the filename in the black box in ASCII. The high bits of the word correspond to the earlier characters.`	`Nothing`
`0xFFFFFE74`	`The fifth and sixth characters of the filename in the black box in ASCII. The high bits of the word correspond to the earlier characters. The lower sixteen bits of the word should be 0.`	`Nothing`
`0xFFFFFE78-0xFFFFFE8C`	`Reserved For Future Use`	`Reserved For Future Use`
`0xFFFFFE90`	`Nothing`	`Writes the length of the file in bytes. The maximum legal value is 0x00002000. Values higher than that have undefined results.`

Problem 3: Benchmark Program

As part of your final project, each group will write a benchmark program. We will use these programs as the test suite for the processor race. See the top of the document for the due date for the program.

Your benchmark must conform to the following guidelines. It must be at most 384 words of instructions and preallocated static storage all of which must be below address 0x00800000. It must not be self-modifying. The stack pointer should be initialized to 0x03FFFFFC and the heap should be initialized to 0x00800000. If you use any static storage data then it should be clearly labeled and we should be able to easily modify your code to remap the addresses for your static storage. You must store any $s registers before using them, and you must leave the high 32-bits of the cycle count in $v1 and the low 32-bits of the cycle count in $v0. Finally you must conform to the instruction set specified by Labs 3-5. Failure to follow these conventions will definitely mean that your program will not be used in the final race, and it will also mean a loss of credit for the problem.

Problem 4: Performance Measurement

Determine the critical path of your processor. Document (using data from the Xilinx timing tools) why you believe the critical path is what you claim.

Next, measure the running speed of your processor. What is the fastest clock that you think you can run with? Can your memory subsystem run at the same speed as your processor clock? How does this measurement square with your critical path determination?

Processor Racing!

During the last week of class, we will meet in the lab to race processors (see the schedule at the top of this lab for the date and time). Extra-credit will be assigned for this lab based on the results of this friendly competition. Your processor does not need to be `the overall best' to receive extra credit. We will have competitions for: best CPI, highest Clock Rate, best Performance / Hardware Resource usage (LUT count), and best overall performance, based on total execution time on the benchmarks.

The benchmark suite will consist of the test programs submitted by all groups for Problem 3. The TAs may also add test programs to the suite.

Increasing the number of words of data cache in your system is not allowed -- as in Lab 4, the total amount of data cache storage must be 2048 words (extra memory for tags, etc, does not count against this limit).

Project Presentations

Your group must do a final presentation to the class (see the schedule at the top of this lab for the date and time). Your presentation should be 20 minutes long, with an additional 10 minutes for questions from the floor.

Everybody in your group must present; your individual grade will include your presentation. Good presentations (and write-ups, for that matter) will cover the specific sub-projects you chose to implement, and how they affected your processors performance. What design decisions were made and why? How successful were you? How did you measure your performance? Detailed descriptions of your project datapath are not appropriate for a 20-minute presentation. However, high-level data paths and explanations of specific implementations might be appropriate.

Final Step: Lab Report

Turn in a copy of your Verilog code (including test benches), schematics, diagnostic program(s) and your on-line logs. Also turn in simulation logs that show correct operation of the processor + memory system. These logs should show the operations that were performed, and then the contents of memory with the correct values in it. Also turn in logs from your test benches.

Please make sure that your write-up is clear, concise, and complete. To receive full credit, you must include a discussion on how deep pipelining and your chosen option improved the performance of your project (or as the case may be, "not improved" -- experience is what you get when you don't get what you want).

You will not get a chance to discuss your final project with your TA, so make sure that you effectively communicate the key points of your project. One good technique is to analyze the performance of select benchmarks with and without your new features.

As part of your writeup, do a port-mortem for your test plan. How did unit testing, multi-unit testing, and complete processor testing help your verification and debugging? Show bug curves, and give examples of the type of bugs you found early on because of your test plan (as well as "escaped" bugs you found later than you would have hoped).

How much time did your team spend on this assignment?